qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
* [PATCH nbd 0/4] Enable multi-conn NBD [for discussion only]
@ 2023-03-09 11:39 Richard W.M. Jones
  2023-03-09 11:39 ` [PATCH nbd 1/4] nbd: Add multi-conn option Richard W.M. Jones
                   ` (5 more replies)
  0 siblings, 6 replies; 11+ messages in thread
From: Richard W.M. Jones @ 2023-03-09 11:39 UTC (permalink / raw)
  To: qemu-devel; +Cc: qemu-block, vsementsov, eblake, hreitz, kwolf

[ Patch series also available here, along with this cover letter and the
  script used to generate test results:
  https://gitlab.com/rwmjones/qemu/-/commits/2023-nbd-multi-conn-v1 ]

This patch series adds multi-conn support to the NBD block driver in
qemu.  It is only meant for discussion and testing because it has a
number of obvious shortcomings (see "XXX" in commit messages and
code).  If we decided this was a good idea, we can work on a better
patch.

The Network Block Driver (NBD) protocol allows servers to advertise
that they are capable of multi-conn.  This means they obey certain
requirements around how data is cached, allowing multiple connections
to be safely opened to the NBD server.  For example, a flush or FUA
operation on one connection is guaranteed to flush the cache on all
connections.

Clients that use multi-conn can achieve better performance.  This
seems to be down to at least two factors:

 - Avoids "head of line blocking" of large requests.

 - With NBD over Unix domain sockets, more cores can be used.

qemu-nbd, nbdkit and libnbd have all supported multi-conn for ages,
but qemu's built in NBD client does not, which is what this patch
fixes.

Below I've produced a few benchmarks.

Note these are mostly concoted to try to test NBD performance and may
not make sense in their own terms (eg. qemu's disk image layer has a
curl client so you wouldn't need to run one separately).  In the real
world we use long pipelines of NBD operations where different tools
are mixed together to achieve efficient downloads, conversions, disk
modifications and sparsification, and they would exercise different
aspects of this.

I've also included nbdcopy as a client for comparison in some tests.

All tests were run 4 times, the first result discarded, and the last 3
averaged.  If any of the last 3 were > 10% away from the average then
the test was stopped.

My summary:

 - It works effectively for qemu client & nbdkit server, especially in
   cases where the server does large, heavyweight requests.  This is
   important for us because virt-v2v uses an nbdkit Python plugin and
   various other heavyweight plugins (eg. plugins that access remote
   servers for each request).

 - It seems to make little or no difference with qemu + qemu-nbd
   server.  I speculate that's because qemu-nbd doesn't support system
   threads, so networking is bottlenecked through a single core.  Even
   though there are coroutines handling different sockets, they must
   all wait in turn to issue send(3) or recv(3) calls on the same
   core.

 - qemu-img unfortunately uses a single thread for all coroutines so
   it suffers from a similar problem to qemu-nbd.  This change would
   be much more effective if we could distribute coroutines across
   threads.

 - For tests which are highly bottlenecked on disk I/O (eg. the large
   local file test and null test) multi-conn doesn't make much
   difference.

 - Multi-conn even with only 2 connections can make up for the
   overhead of range requests, exceeding the performance of wget.

 - In the curlremote test, qemu-nbd is especially slow, for unknown
   reasons.


Integrity test (./multi-conn.pl integrity)
==========================================

nbdkit-sparse-random-plugin
  |                 ^
  | nbd+unix        | nbd+unix
  v                 |
   qemu-img convert

Reading from and writing the same data back to nbdkit sparse-random
plugin checks that the data written is the same as the data read.
This uses two Unix domain sockets, with or without multi-conn.  This
test is mainly here to check we don't crash or corrupt data with this
patch.

  server          client        multi-conn
  ---------------------------------------------------------------
    nbdkit	  qemu-img	[u/s]	9.07s	
    nbdkit	  qemu-img	1	9.05s	
    nbdkit	  qemu-img	2	9.02s	
    nbdkit	  qemu-img	4	8.98s	

[u/s] = upstream qemu 7.2.0


Curl local server test (./multi-conn.pl curlhttp)
=================================================

Localhost Apache serving a file over http
                  |
                  | http
                  v
nbdkit-curl-plugin   or   qemu-nbd
                  |
                  | nbd+unix
                  v
qemu-img convert   or   nbdcopy

We download an image from a local web server through
nbdkit-curl-plugin or qemu-nbd using the curl block driver, over NBD.
The image is copied to /dev/null.

  server          client        multi-conn
  ---------------------------------------------------------------
  qemu-nbd	   nbdcopy	1	8.88s	
  qemu-nbd	   nbdcopy	2	8.64s	
  qemu-nbd	   nbdcopy	4	8.37s	
  qemu-nbd	  qemu-img	[u/s]	6.47s	
  qemu-nbd	  qemu-img	1	6.56s	
  qemu-nbd	  qemu-img	2	6.63s	
  qemu-nbd	  qemu-img	4	6.50s	
    nbdkit	   nbdcopy	1	12.15s	
    nbdkit	   nbdcopy	2	7.05s	(72.36% better)
    nbdkit	   nbdcopy	4	3.54s	(242.90% better)
    nbdkit	  qemu-img	[u/s]	6.90s	
    nbdkit	  qemu-img	1	7.00s	
    nbdkit	  qemu-img	2	3.85s	(79.15% better)
    nbdkit	  qemu-img	4	3.85s	(79.15% better)


Curl local file test (./multi-conn.pl curlfile)
===============================================

nbdkit-curl-plugin   using file:/// URI
                  |
                  | nbd+unix
                  v
qemu-img convert   or   nbdcopy

We download from a file:/// URI.  This test is designed to exercise
NBD and some curl internal paths without the overhead from an external
server.  qemu-nbd doesn't support file:/// URIs so we cannot duplicate
the test for qemu as server.

  server          client        multi-conn
  ---------------------------------------------------------------
    nbdkit	   nbdcopy	1	31.32s	
    nbdkit	   nbdcopy	2	20.29s	(54.38% better)
    nbdkit	   nbdcopy	4	13.22s	(136.91% better)
    nbdkit	  qemu-img	[u/s]	31.55s	
    nbdkit	  qemu-img	1	31.70s	
    nbdkit	  qemu-img	2	21.60s	(46.07% better)
    nbdkit	  qemu-img	4	13.88s	(127.25% better)


Curl remote server test (./multi-conn.pl curlremote)
====================================================

nbdkit-curl-plugin   using http://remote/*.qcow2 URI
         |
         | nbd+unix
         v
qemu-img convert

We download from a remote qcow2 file to a local raw file, converting
between formats during copying.

qemu-nbd   using http://remote/*.qcow2 URI
    |
    | nbd+unix
    v
qemu-img convert

Similarly, replacing nbdkit with qemu-nbd (treating the remote file as
if it is raw, so the conversion is still done by qemu-img).

Additionally we compare downloading the file with wget (note this
doesn't include the time for conversion, but that should only be a few
seconds).

  server          client        multi-conn
  ---------------------------------------------------------------
         -	      wget	1	58.19s	
    nbdkit	  qemu-img	[u/s]	68.29s	(17.36% worse)
    nbdkit	  qemu-img	1	67.85s	(16.60% worse)
    nbdkit	  qemu-img	2	58.17s	
    nbdkit	  qemu-img	4	59.80s	
    nbdkit	  qemu-img	6	59.15s	
    nbdkit	  qemu-img	8	59.52s	

  qemu-nbd	  qemu-img	[u/s]	202.55s
  qemu-nbd	  qemu-img	1	204.61s	
  qemu-nbd	  qemu-img	2	196.73s	
  qemu-nbd	  qemu-img	4	179.53s	(12.83% better)
  qemu-nbd	  qemu-img	6	181.70s	(11.48% better)
  qemu-nbd	  qemu-img	8	181.05s	(11.88% better)


Local file test (./multi-conn.pl file)
======================================

qemu-nbd or nbdkit serving a large local file
                  |
                  | nbd+unix
                  v
qemu-img convert   or   nbdcopy

We download a local file over NBD.  The image is copied to /dev/null.

  server          client        multi-conn
  ---------------------------------------------------------------
  qemu-nbd	   nbdcopy	1	15.50s	
  qemu-nbd	   nbdcopy	2	14.36s	
  qemu-nbd	   nbdcopy	4	14.32s	
  qemu-nbd	  qemu-img	[u/s]	10.16s	
  qemu-nbd	  qemu-img	1	11.17s	(10.01% worse)
  qemu-nbd	  qemu-img	2	10.35s	
  qemu-nbd	  qemu-img	4	10.39s	
    nbdkit	   nbdcopy	1	9.10s	
    nbdkit	   nbdcopy	2	8.25s	
    nbdkit	   nbdcopy	4	8.60s	
    nbdkit	  qemu-img	[u/s]	8.64s	
    nbdkit	  qemu-img	1	9.38s	
    nbdkit	  qemu-img	2	8.69s	
    nbdkit	  qemu-img	4	8.87s	


Null test (./multi-conn.pl null)
================================

qemu-nbd with null-co driver  or  nbdkit-null-plugin + noextents filter
                  |
                  | nbd+unix
                  v
qemu-img convert   or   nbdcopy

This is like the local file test above, but without needing a file.
Instead all zeroes (fully allocated) are downloaded over NBD.

  server          client        multi-conn
  ---------------------------------------------------------------
  qemu-nbd	   nbdcopy	1	14.86s	
  qemu-nbd	   nbdcopy	2	17.08s	(14.90% worse)
  qemu-nbd	   nbdcopy	4	17.89s	(20.37% worse)
  qemu-nbd	  qemu-img	[u/s]	13.29s	
  qemu-nbd	  qemu-img	1	13.31s	
  qemu-nbd	  qemu-img	2	13.00s	
  qemu-nbd	  qemu-img	4	12.62s	
    nbdkit	   nbdcopy	1	15.06s	
    nbdkit	   nbdcopy	2	12.21s	(23.32% better)
    nbdkit	   nbdcopy	4	11.67s	(29.10% better)
    nbdkit	  qemu-img	[u/s]	17.13s	
    nbdkit	  qemu-img	1	17.11s	
    nbdkit	  qemu-img	2	16.82s	
    nbdkit	  qemu-img	4	18.81s	





^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH nbd 1/4] nbd: Add multi-conn option
  2023-03-09 11:39 [PATCH nbd 0/4] Enable multi-conn NBD [for discussion only] Richard W.M. Jones
@ 2023-03-09 11:39 ` Richard W.M. Jones
  2023-03-10 22:17   ` Eric Blake
  2023-03-09 11:39 ` [PATCH nbd 2/4] nbd: Split out block device state from underlying NBD connections Richard W.M. Jones
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 11+ messages in thread
From: Richard W.M. Jones @ 2023-03-09 11:39 UTC (permalink / raw)
  To: qemu-devel; +Cc: qemu-block, vsementsov, eblake, hreitz, kwolf

Add multi-conn option to the NBD client.  This commit just adds the
option, it is not functional.

Setting this to a value > 1 permits multiple connections to the NBD
server; a typical value might be 4.  The default is 1, meaning only a
single connection is made.  If the NBD server does not advertise that
it is safe for multi-conn then this setting is forced to 1.

Signed-off-by: Richard W.M. Jones <rjones@redhat.com>
---
 block/nbd.c | 24 ++++++++++++++++++++++++
 1 file changed, 24 insertions(+)

diff --git a/block/nbd.c b/block/nbd.c
index bf2894ad5c..5ffae0b798 100644
--- a/block/nbd.c
+++ b/block/nbd.c
@@ -49,6 +49,7 @@
 
 #define EN_OPTSTR ":exportname="
 #define MAX_NBD_REQUESTS    16
+#define MAX_MULTI_CONN      16
 
 #define HANDLE_TO_INDEX(bs, handle) ((handle) ^ (uint64_t)(intptr_t)(bs))
 #define INDEX_TO_HANDLE(bs, index)  ((index)  ^ (uint64_t)(intptr_t)(bs))
@@ -98,6 +99,7 @@ typedef struct BDRVNBDState {
     /* Connection parameters */
     uint32_t reconnect_delay;
     uint32_t open_timeout;
+    uint32_t multi_conn;
     SocketAddress *saddr;
     char *export;
     char *tlscredsid;
@@ -1803,6 +1805,15 @@ static QemuOptsList nbd_runtime_opts = {
                     "attempts until successful or until @open-timeout seconds "
                     "have elapsed. Default 0",
         },
+        {
+            .name = "multi-conn",
+            .type = QEMU_OPT_NUMBER,
+            .help = "If > 1 permit up to this number of connections to the "
+                    "server. The server must also advertise multi-conn "
+                    "support.  If <= 1, only a single connection is made "
+                    "to the server even if the server advertises multi-conn. "
+                    "Default 1",
+        },
         { /* end of list */ }
     },
 };
@@ -1858,6 +1869,10 @@ static int nbd_process_options(BlockDriverState *bs, QDict *options,
 
     s->reconnect_delay = qemu_opt_get_number(opts, "reconnect-delay", 0);
     s->open_timeout = qemu_opt_get_number(opts, "open-timeout", 0);
+    s->multi_conn = qemu_opt_get_number(opts, "multi-conn", 1);
+    if (s->multi_conn > MAX_MULTI_CONN) {
+        s->multi_conn = MAX_MULTI_CONN;
+    }
 
     ret = 0;
 
@@ -1912,6 +1927,15 @@ static int nbd_open(BlockDriverState *bs, QDict *options, int flags,
 
     nbd_client_connection_enable_retry(s->conn);
 
+    /*
+     * We set s->multi_conn in nbd_process_options above, but now that
+     * we have connected if the server doesn't advertise that it is
+     * safe for multi-conn, force it to 1.
+     */
+    if (!(s->info.flags & NBD_FLAG_CAN_MULTI_CONN)) {
+        s->multi_conn = 1;
+    }
+
     return 0;
 
 fail:
-- 
2.39.2



^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH nbd 2/4] nbd: Split out block device state from underlying NBD connections
  2023-03-09 11:39 [PATCH nbd 0/4] Enable multi-conn NBD [for discussion only] Richard W.M. Jones
  2023-03-09 11:39 ` [PATCH nbd 1/4] nbd: Add multi-conn option Richard W.M. Jones
@ 2023-03-09 11:39 ` Richard W.M. Jones
  2023-03-14 12:13   ` Eric Blake
  2023-03-09 11:39 ` [PATCH nbd 3/4] nbd: Open multiple NBD connections if multi-conn is set Richard W.M. Jones
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 11+ messages in thread
From: Richard W.M. Jones @ 2023-03-09 11:39 UTC (permalink / raw)
  To: qemu-devel; +Cc: qemu-block, vsementsov, eblake, hreitz, kwolf

To implement multi-conn, we will put multiple underlying NBD
connections (ie. NBDClientConnection) inside the NBD block device
handle (BDRVNBDState).  This requires first breaking the one-to-one
relationship between NBDClientConnection and BDRVNBDState.

To do this a new structure (NBDConnState) is implemented.
NBDConnState takes all the per-connection state out of the block
driver struct.  BDRVNBDState now contains a conns[] array of pointers
to NBDConnState, one for each underlying multi-conn connection.

After this change there is still a one-to-one relationship because we
only ever use the zeroth slot in the conns[] array.  Thus this does
not implement multi-conn yet.

Signed-off-by: Richard W.M. Jones <rjones@redhat.com>
---
 block/coroutines.h |   5 +-
 block/nbd.c        | 674 ++++++++++++++++++++++++---------------------
 2 files changed, 358 insertions(+), 321 deletions(-)

diff --git a/block/coroutines.h b/block/coroutines.h
index dd9f3d449b..14b01d8591 100644
--- a/block/coroutines.h
+++ b/block/coroutines.h
@@ -62,7 +62,7 @@ int coroutine_fn GRAPH_RDLOCK
 bdrv_co_writev_vmstate(BlockDriverState *bs, QEMUIOVector *qiov, int64_t pos);
 
 int coroutine_fn
-nbd_co_do_establish_connection(BlockDriverState *bs, bool blocking,
+nbd_co_do_establish_connection(BlockDriverState *bs, void *cs, bool blocking,
                                Error **errp);
 
 
@@ -86,6 +86,7 @@ bdrv_common_block_status_above(BlockDriverState *bs,
                                BlockDriverState **file,
                                int *depth);
 int co_wrapper_mixed
-nbd_do_establish_connection(BlockDriverState *bs, bool blocking, Error **errp);
+nbd_do_establish_connection(BlockDriverState *bs, void *cs, bool blocking,
+                            Error **errp);
 
 #endif /* BLOCK_COROUTINES_H */
diff --git a/block/nbd.c b/block/nbd.c
index 5ffae0b798..84e8a1add0 100644
--- a/block/nbd.c
+++ b/block/nbd.c
@@ -51,8 +51,8 @@
 #define MAX_NBD_REQUESTS    16
 #define MAX_MULTI_CONN      16
 
-#define HANDLE_TO_INDEX(bs, handle) ((handle) ^ (uint64_t)(intptr_t)(bs))
-#define INDEX_TO_HANDLE(bs, index)  ((index)  ^ (uint64_t)(intptr_t)(bs))
+#define HANDLE_TO_INDEX(cs, handle) ((handle) ^ (uint64_t)(intptr_t)(cs))
+#define INDEX_TO_HANDLE(cs, index)  ((index)  ^ (uint64_t)(intptr_t)(cs))
 
 typedef struct {
     Coroutine *coroutine;
@@ -67,7 +67,10 @@ typedef enum NBDClientState {
     NBD_CLIENT_QUIT
 } NBDClientState;
 
-typedef struct BDRVNBDState {
+/* A single client connection. */
+typedef struct NBDConnState {
+    struct BDRVNBDState *s; /* Points to enclosing BDRVNBDState */
+
     QIOChannel *ioc; /* The current I/O channel */
     NBDExportInfo info;
 
@@ -94,7 +97,12 @@ typedef struct BDRVNBDState {
 
     QEMUTimer *open_timer;
 
-    BlockDriverState *bs;
+    NBDClientConnection *conn;
+} NBDConnState;
+
+typedef struct BDRVNBDState {
+    /* The underlying NBD connections */
+    NBDConnState *conns[MAX_MULTI_CONN];
 
     /* Connection parameters */
     uint32_t reconnect_delay;
@@ -108,7 +116,7 @@ typedef struct BDRVNBDState {
     char *x_dirty_bitmap;
     bool alloc_depth;
 
-    NBDClientConnection *conn;
+    BlockDriverState *bs;
 } BDRVNBDState;
 
 static void nbd_yank(void *opaque);
@@ -117,14 +125,16 @@ static void nbd_clear_bdrvstate(BlockDriverState *bs)
 {
     BDRVNBDState *s = (BDRVNBDState *)bs->opaque;
 
-    nbd_client_connection_release(s->conn);
-    s->conn = NULL;
+    nbd_client_connection_release(s->conns[0]->conn);
+    s->conns[0]->conn = NULL;
 
     yank_unregister_instance(BLOCKDEV_YANK_INSTANCE(bs->node_name));
 
     /* Must not leave timers behind that would access freed data */
-    assert(!s->reconnect_delay_timer);
-    assert(!s->open_timer);
+    assert(!s->conns[0]->reconnect_delay_timer);
+    assert(!s->conns[0]->open_timer);
+
+    g_free(s->conns[0]);
 
     object_unref(OBJECT(s->tlscreds));
     qapi_free_SocketAddress(s->saddr);
@@ -151,139 +161,143 @@ static bool coroutine_fn nbd_recv_coroutine_wake_one(NBDClientRequest *req)
     return false;
 }
 
-static void coroutine_fn nbd_recv_coroutines_wake(BDRVNBDState *s)
+static void coroutine_fn nbd_recv_coroutines_wake(NBDConnState *cs)
 {
     int i;
 
-    QEMU_LOCK_GUARD(&s->receive_mutex);
+    QEMU_LOCK_GUARD(&cs->receive_mutex);
     for (i = 0; i < MAX_NBD_REQUESTS; i++) {
-        if (nbd_recv_coroutine_wake_one(&s->requests[i])) {
+        if (nbd_recv_coroutine_wake_one(&cs->requests[i])) {
             return;
         }
     }
 }
 
 /* Called with s->requests_lock held.  */
-static void coroutine_fn nbd_channel_error_locked(BDRVNBDState *s, int ret)
+static void coroutine_fn nbd_channel_error_locked(NBDConnState *cs, int ret)
 {
-    if (s->state == NBD_CLIENT_CONNECTED) {
-        qio_channel_shutdown(s->ioc, QIO_CHANNEL_SHUTDOWN_BOTH, NULL);
+    if (cs->state == NBD_CLIENT_CONNECTED) {
+        qio_channel_shutdown(cs->ioc, QIO_CHANNEL_SHUTDOWN_BOTH, NULL);
     }
 
     if (ret == -EIO) {
-        if (s->state == NBD_CLIENT_CONNECTED) {
-            s->state = s->reconnect_delay ? NBD_CLIENT_CONNECTING_WAIT :
-                                            NBD_CLIENT_CONNECTING_NOWAIT;
+        if (cs->state == NBD_CLIENT_CONNECTED) {
+            cs->state = cs->s->reconnect_delay ? NBD_CLIENT_CONNECTING_WAIT :
+                                                 NBD_CLIENT_CONNECTING_NOWAIT;
         }
     } else {
-        s->state = NBD_CLIENT_QUIT;
+        cs->state = NBD_CLIENT_QUIT;
     }
 }
 
-static void coroutine_fn nbd_channel_error(BDRVNBDState *s, int ret)
+static void coroutine_fn nbd_channel_error(NBDConnState *cs, int ret)
 {
-    QEMU_LOCK_GUARD(&s->requests_lock);
-    nbd_channel_error_locked(s, ret);
+    QEMU_LOCK_GUARD(&cs->requests_lock);
+    nbd_channel_error_locked(cs, ret);
 }
 
-static void reconnect_delay_timer_del(BDRVNBDState *s)
+static void reconnect_delay_timer_del(NBDConnState *cs)
 {
-    if (s->reconnect_delay_timer) {
-        timer_free(s->reconnect_delay_timer);
-        s->reconnect_delay_timer = NULL;
+    if (cs->reconnect_delay_timer) {
+        timer_free(cs->reconnect_delay_timer);
+        cs->reconnect_delay_timer = NULL;
     }
 }
 
 static void reconnect_delay_timer_cb(void *opaque)
 {
-    BDRVNBDState *s = opaque;
+    NBDConnState *cs = opaque;
 
-    reconnect_delay_timer_del(s);
-    WITH_QEMU_LOCK_GUARD(&s->requests_lock) {
-        if (s->state != NBD_CLIENT_CONNECTING_WAIT) {
+    reconnect_delay_timer_del(cs);
+    WITH_QEMU_LOCK_GUARD(&cs->requests_lock) {
+        if (cs->state != NBD_CLIENT_CONNECTING_WAIT) {
             return;
         }
-        s->state = NBD_CLIENT_CONNECTING_NOWAIT;
+        cs->state = NBD_CLIENT_CONNECTING_NOWAIT;
     }
-    nbd_co_establish_connection_cancel(s->conn);
+    nbd_co_establish_connection_cancel(cs->conn);
 }
 
-static void reconnect_delay_timer_init(BDRVNBDState *s, uint64_t expire_time_ns)
+static void reconnect_delay_timer_init(NBDConnState *cs,
+                                       uint64_t expire_time_ns)
 {
-    assert(!s->reconnect_delay_timer);
-    s->reconnect_delay_timer = aio_timer_new(bdrv_get_aio_context(s->bs),
-                                             QEMU_CLOCK_REALTIME,
-                                             SCALE_NS,
-                                             reconnect_delay_timer_cb, s);
-    timer_mod(s->reconnect_delay_timer, expire_time_ns);
+    assert(!cs->reconnect_delay_timer);
+    cs->reconnect_delay_timer = aio_timer_new(bdrv_get_aio_context(cs->s->bs),
+                                              QEMU_CLOCK_REALTIME,
+                                              SCALE_NS,
+                                              reconnect_delay_timer_cb, cs);
+    timer_mod(cs->reconnect_delay_timer, expire_time_ns);
 }
 
-static void nbd_teardown_connection(BlockDriverState *bs)
+static void nbd_teardown_connection(NBDConnState *cs)
 {
-    BDRVNBDState *s = (BDRVNBDState *)bs->opaque;
+    assert(!cs->in_flight);
 
-    assert(!s->in_flight);
-
-    if (s->ioc) {
-        qio_channel_shutdown(s->ioc, QIO_CHANNEL_SHUTDOWN_BOTH, NULL);
-        yank_unregister_function(BLOCKDEV_YANK_INSTANCE(s->bs->node_name),
-                                 nbd_yank, s->bs);
-        object_unref(OBJECT(s->ioc));
-        s->ioc = NULL;
+    if (cs->ioc) {
+        qio_channel_shutdown(cs->ioc, QIO_CHANNEL_SHUTDOWN_BOTH, NULL);
+        yank_unregister_function(BLOCKDEV_YANK_INSTANCE(cs->s->bs->node_name),
+                                 nbd_yank, cs);
+        object_unref(OBJECT(cs->ioc));
+        cs->ioc = NULL;
     }
 
-    WITH_QEMU_LOCK_GUARD(&s->requests_lock) {
-        s->state = NBD_CLIENT_QUIT;
+    WITH_QEMU_LOCK_GUARD(&cs->requests_lock) {
+        cs->state = NBD_CLIENT_QUIT;
     }
 }
 
-static void open_timer_del(BDRVNBDState *s)
+static void open_timer_del(NBDConnState *cs)
 {
-    if (s->open_timer) {
-        timer_free(s->open_timer);
-        s->open_timer = NULL;
+    if (cs->open_timer) {
+        timer_free(cs->open_timer);
+        cs->open_timer = NULL;
     }
 }
 
 static void open_timer_cb(void *opaque)
 {
-    BDRVNBDState *s = opaque;
+    NBDConnState *cs = opaque;
 
-    nbd_co_establish_connection_cancel(s->conn);
-    open_timer_del(s);
+    nbd_co_establish_connection_cancel(cs->conn);
+    open_timer_del(cs);
 }
 
-static void open_timer_init(BDRVNBDState *s, uint64_t expire_time_ns)
+static void open_timer_init(NBDConnState *cs, uint64_t expire_time_ns)
 {
-    assert(!s->open_timer);
-    s->open_timer = aio_timer_new(bdrv_get_aio_context(s->bs),
-                                  QEMU_CLOCK_REALTIME,
-                                  SCALE_NS,
-                                  open_timer_cb, s);
-    timer_mod(s->open_timer, expire_time_ns);
+    assert(!cs->open_timer);
+    cs->open_timer = aio_timer_new(bdrv_get_aio_context(cs->s->bs),
+                                   QEMU_CLOCK_REALTIME,
+                                   SCALE_NS,
+                                   open_timer_cb, cs);
+    timer_mod(cs->open_timer, expire_time_ns);
 }
 
-static bool nbd_client_will_reconnect(BDRVNBDState *s)
+static bool nbd_client_will_reconnect(NBDConnState *cs)
 {
     /*
      * Called only after a socket error, so this is not performance sensitive.
      */
-    QEMU_LOCK_GUARD(&s->requests_lock);
-    return s->state == NBD_CLIENT_CONNECTING_WAIT;
+    QEMU_LOCK_GUARD(&cs->requests_lock);
+    return cs->state == NBD_CLIENT_CONNECTING_WAIT;
 }
 
 /*
  * Update @bs with information learned during a completed negotiation process.
  * Return failure if the server's advertised options are incompatible with the
  * client's needs.
+ *
+ * Note that we are only called for the first connection (s->conns[0])
+ * because multi-conn assumes that all other connections are alike.
+ * We don't check that assumption but probably should (XXX).
  */
-static int nbd_handle_updated_info(BlockDriverState *bs, Error **errp)
+static int nbd_handle_updated_info(BlockDriverState *bs,
+                                   NBDConnState *cs, Error **errp)
 {
     BDRVNBDState *s = (BDRVNBDState *)bs->opaque;
     int ret;
 
     if (s->x_dirty_bitmap) {
-        if (!s->info.base_allocation) {
+        if (!cs->info.base_allocation) {
             error_setg(errp, "requested x-dirty-bitmap %s not found",
                        s->x_dirty_bitmap);
             return -EINVAL;
@@ -293,21 +307,21 @@ static int nbd_handle_updated_info(BlockDriverState *bs, Error **errp)
         }
     }
 
-    if (s->info.flags & NBD_FLAG_READ_ONLY) {
+    if (cs->info.flags & NBD_FLAG_READ_ONLY) {
         ret = bdrv_apply_auto_read_only(bs, "NBD export is read-only", errp);
         if (ret < 0) {
             return ret;
         }
     }
 
-    if (s->info.flags & NBD_FLAG_SEND_FUA) {
+    if (cs->info.flags & NBD_FLAG_SEND_FUA) {
         bs->supported_write_flags = BDRV_REQ_FUA;
         bs->supported_zero_flags |= BDRV_REQ_FUA;
     }
 
-    if (s->info.flags & NBD_FLAG_SEND_WRITE_ZEROES) {
+    if (cs->info.flags & NBD_FLAG_SEND_WRITE_ZEROES) {
         bs->supported_zero_flags |= BDRV_REQ_MAY_UNMAP;
-        if (s->info.flags & NBD_FLAG_SEND_FAST_ZERO) {
+        if (cs->info.flags & NBD_FLAG_SEND_FAST_ZERO) {
             bs->supported_zero_flags |= BDRV_REQ_NO_FALLBACK;
         }
     }
@@ -318,129 +332,132 @@ static int nbd_handle_updated_info(BlockDriverState *bs, Error **errp)
 }
 
 int coroutine_fn nbd_co_do_establish_connection(BlockDriverState *bs,
+                                                void *csvp,
                                                 bool blocking, Error **errp)
 {
     BDRVNBDState *s = (BDRVNBDState *)bs->opaque;
+    NBDConnState *cs = csvp;
     int ret;
     IO_CODE();
 
-    assert(!s->ioc);
+    assert(!cs->ioc);
 
-    s->ioc = nbd_co_establish_connection(s->conn, &s->info, blocking, errp);
-    if (!s->ioc) {
+    cs->ioc = nbd_co_establish_connection(cs->conn, &cs->info, blocking, errp);
+    if (!cs->ioc) {
         return -ECONNREFUSED;
     }
 
-    yank_register_function(BLOCKDEV_YANK_INSTANCE(s->bs->node_name), nbd_yank,
-                           bs);
+    yank_register_function(BLOCKDEV_YANK_INSTANCE(bs->node_name), nbd_yank, cs);
 
-    ret = nbd_handle_updated_info(s->bs, NULL);
-    if (ret < 0) {
-        /*
-         * We have connected, but must fail for other reasons.
-         * Send NBD_CMD_DISC as a courtesy to the server.
-         */
-        NBDRequest request = { .type = NBD_CMD_DISC };
+    if (cs == s->conns[0]) {
+        ret = nbd_handle_updated_info(s->bs, cs, NULL);
+        if (ret < 0) {
+            /*
+             * We have connected, but must fail for other reasons.
+             * Send NBD_CMD_DISC as a courtesy to the server.
+             */
+            NBDRequest request = { .type = NBD_CMD_DISC };
 
-        nbd_send_request(s->ioc, &request);
+            nbd_send_request(cs->ioc, &request);
 
-        yank_unregister_function(BLOCKDEV_YANK_INSTANCE(s->bs->node_name),
-                                 nbd_yank, bs);
-        object_unref(OBJECT(s->ioc));
-        s->ioc = NULL;
+            yank_unregister_function(BLOCKDEV_YANK_INSTANCE(s->bs->node_name),
+                                     nbd_yank, cs);
+            object_unref(OBJECT(cs->ioc));
+            cs->ioc = NULL;
 
-        return ret;
+            return ret;
+        }
     }
 
-    qio_channel_set_blocking(s->ioc, false, NULL);
-    qio_channel_attach_aio_context(s->ioc, bdrv_get_aio_context(bs));
+    qio_channel_set_blocking(cs->ioc, false, NULL);
+    qio_channel_attach_aio_context(cs->ioc, bdrv_get_aio_context(bs));
 
     /* successfully connected */
-    WITH_QEMU_LOCK_GUARD(&s->requests_lock) {
-        s->state = NBD_CLIENT_CONNECTED;
+    WITH_QEMU_LOCK_GUARD(&cs->requests_lock) {
+        cs->state = NBD_CLIENT_CONNECTED;
     }
 
     return 0;
 }
 
 /* Called with s->requests_lock held.  */
-static bool nbd_client_connecting(BDRVNBDState *s)
+static bool nbd_client_connecting(NBDConnState *cs)
 {
-    return s->state == NBD_CLIENT_CONNECTING_WAIT ||
-        s->state == NBD_CLIENT_CONNECTING_NOWAIT;
+    return cs->state == NBD_CLIENT_CONNECTING_WAIT ||
+        cs->state == NBD_CLIENT_CONNECTING_NOWAIT;
 }
 
 /* Called with s->requests_lock taken.  */
-static coroutine_fn void nbd_reconnect_attempt(BDRVNBDState *s)
+static coroutine_fn void nbd_reconnect_attempt(NBDConnState *cs)
 {
     int ret;
-    bool blocking = s->state == NBD_CLIENT_CONNECTING_WAIT;
+    bool blocking = cs->state == NBD_CLIENT_CONNECTING_WAIT;
 
     /*
      * Now we are sure that nobody is accessing the channel, and no one will
      * try until we set the state to CONNECTED.
      */
-    assert(nbd_client_connecting(s));
-    assert(s->in_flight == 1);
+    assert(nbd_client_connecting(cs));
+    assert(cs->in_flight == 1);
 
-    trace_nbd_reconnect_attempt(s->bs->in_flight);
+    trace_nbd_reconnect_attempt(cs->s->bs->in_flight);
 
-    if (blocking && !s->reconnect_delay_timer) {
+    if (blocking && !cs->reconnect_delay_timer) {
         /*
          * It's the first reconnect attempt after switching to
          * NBD_CLIENT_CONNECTING_WAIT
          */
-        g_assert(s->reconnect_delay);
-        reconnect_delay_timer_init(s,
+        g_assert(cs->s->reconnect_delay);
+        reconnect_delay_timer_init(cs,
             qemu_clock_get_ns(QEMU_CLOCK_REALTIME) +
-            s->reconnect_delay * NANOSECONDS_PER_SECOND);
+            cs->s->reconnect_delay * NANOSECONDS_PER_SECOND);
     }
 
     /* Finalize previous connection if any */
-    if (s->ioc) {
-        qio_channel_detach_aio_context(QIO_CHANNEL(s->ioc));
-        yank_unregister_function(BLOCKDEV_YANK_INSTANCE(s->bs->node_name),
-                                 nbd_yank, s->bs);
-        object_unref(OBJECT(s->ioc));
-        s->ioc = NULL;
+    if (cs->ioc) {
+        qio_channel_detach_aio_context(QIO_CHANNEL(cs->ioc));
+        yank_unregister_function(BLOCKDEV_YANK_INSTANCE(cs->s->bs->node_name),
+                                 nbd_yank, cs);
+        object_unref(OBJECT(cs->ioc));
+        cs->ioc = NULL;
     }
 
-    qemu_mutex_unlock(&s->requests_lock);
-    ret = nbd_co_do_establish_connection(s->bs, blocking, NULL);
-    trace_nbd_reconnect_attempt_result(ret, s->bs->in_flight);
-    qemu_mutex_lock(&s->requests_lock);
+    qemu_mutex_unlock(&cs->requests_lock);
+    ret = nbd_co_do_establish_connection(cs->s->bs, cs, blocking, NULL);
+    trace_nbd_reconnect_attempt_result(ret, cs->s->bs->in_flight);
+    qemu_mutex_lock(&cs->requests_lock);
 
     /*
      * The reconnect attempt is done (maybe successfully, maybe not), so
      * we no longer need this timer.  Delete it so it will not outlive
      * this I/O request (so draining removes all timers).
      */
-    reconnect_delay_timer_del(s);
+    reconnect_delay_timer_del(cs);
 }
 
-static coroutine_fn int nbd_receive_replies(BDRVNBDState *s, uint64_t handle)
+static coroutine_fn int nbd_receive_replies(NBDConnState *cs, uint64_t handle)
 {
     int ret;
-    uint64_t ind = HANDLE_TO_INDEX(s, handle), ind2;
-    QEMU_LOCK_GUARD(&s->receive_mutex);
+    uint64_t ind = HANDLE_TO_INDEX(cs, handle), ind2;
+    QEMU_LOCK_GUARD(&cs->receive_mutex);
 
     while (true) {
-        if (s->reply.handle == handle) {
+        if (cs->reply.handle == handle) {
             /* We are done */
             return 0;
         }
 
-        if (s->reply.handle != 0) {
+        if (cs->reply.handle != 0) {
             /*
              * Some other request is being handled now. It should already be
-             * woken by whoever set s->reply.handle (or never wait in this
+             * woken by whoever set cs->reply.handle (or never wait in this
              * yield). So, we should not wake it here.
              */
-            ind2 = HANDLE_TO_INDEX(s, s->reply.handle);
-            assert(!s->requests[ind2].receiving);
+            ind2 = HANDLE_TO_INDEX(cs, cs->reply.handle);
+            assert(!cs->requests[ind2].receiving);
 
-            s->requests[ind].receiving = true;
-            qemu_co_mutex_unlock(&s->receive_mutex);
+            cs->requests[ind].receiving = true;
+            qemu_co_mutex_unlock(&cs->receive_mutex);
 
             qemu_coroutine_yield();
             /*
@@ -448,105 +465,105 @@ static coroutine_fn int nbd_receive_replies(BDRVNBDState *s, uint64_t handle)
              * 1. From this function, executing in parallel coroutine, when our
              *    handle is received.
              * 2. From nbd_co_receive_one_chunk(), when previous request is
-             *    finished and s->reply.handle set to 0.
+             *    finished and cs->reply.handle set to 0.
              * Anyway, it's OK to lock the mutex and go to the next iteration.
              */
 
-            qemu_co_mutex_lock(&s->receive_mutex);
-            assert(!s->requests[ind].receiving);
+            qemu_co_mutex_lock(&cs->receive_mutex);
+            assert(!cs->requests[ind].receiving);
             continue;
         }
 
         /* We are under mutex and handle is 0. We have to do the dirty work. */
-        assert(s->reply.handle == 0);
-        ret = nbd_receive_reply(s->bs, s->ioc, &s->reply, NULL);
+        assert(cs->reply.handle == 0);
+        ret = nbd_receive_reply(cs->s->bs, cs->ioc, &cs->reply, NULL);
         if (ret <= 0) {
             ret = ret ? ret : -EIO;
-            nbd_channel_error(s, ret);
+            nbd_channel_error(cs, ret);
             return ret;
         }
-        if (nbd_reply_is_structured(&s->reply) && !s->info.structured_reply) {
-            nbd_channel_error(s, -EINVAL);
+        if (nbd_reply_is_structured(&cs->reply) &&
+            !cs->info.structured_reply) {
+            nbd_channel_error(cs, -EINVAL);
             return -EINVAL;
         }
-        ind2 = HANDLE_TO_INDEX(s, s->reply.handle);
-        if (ind2 >= MAX_NBD_REQUESTS || !s->requests[ind2].coroutine) {
-            nbd_channel_error(s, -EINVAL);
+        ind2 = HANDLE_TO_INDEX(cs, cs->reply.handle);
+        if (ind2 >= MAX_NBD_REQUESTS || !cs->requests[ind2].coroutine) {
+            nbd_channel_error(cs, -EINVAL);
             return -EINVAL;
         }
-        if (s->reply.handle == handle) {
+        if (cs->reply.handle == handle) {
             /* We are done */
             return 0;
         }
-        nbd_recv_coroutine_wake_one(&s->requests[ind2]);
+        nbd_recv_coroutine_wake_one(&cs->requests[ind2]);
     }
 }
 
-static int coroutine_fn nbd_co_send_request(BlockDriverState *bs,
+static int coroutine_fn nbd_co_send_request(NBDConnState *cs,
                                             NBDRequest *request,
                                             QEMUIOVector *qiov)
 {
-    BDRVNBDState *s = (BDRVNBDState *)bs->opaque;
     int rc, i = -1;
 
-    qemu_mutex_lock(&s->requests_lock);
-    while (s->in_flight == MAX_NBD_REQUESTS ||
-           (s->state != NBD_CLIENT_CONNECTED && s->in_flight > 0)) {
-        qemu_co_queue_wait(&s->free_sema, &s->requests_lock);
+    qemu_mutex_lock(&cs->requests_lock);
+    while (cs->in_flight == MAX_NBD_REQUESTS ||
+           (cs->state != NBD_CLIENT_CONNECTED && cs->in_flight > 0)) {
+        qemu_co_queue_wait(&cs->free_sema, &cs->requests_lock);
     }
 
-    s->in_flight++;
-    if (s->state != NBD_CLIENT_CONNECTED) {
-        if (nbd_client_connecting(s)) {
-            nbd_reconnect_attempt(s);
-            qemu_co_queue_restart_all(&s->free_sema);
+    cs->in_flight++;
+    if (cs->state != NBD_CLIENT_CONNECTED) {
+        if (nbd_client_connecting(cs)) {
+            nbd_reconnect_attempt(cs);
+            qemu_co_queue_restart_all(&cs->free_sema);
         }
-        if (s->state != NBD_CLIENT_CONNECTED) {
+        if (cs->state != NBD_CLIENT_CONNECTED) {
             rc = -EIO;
             goto err;
         }
     }
 
     for (i = 0; i < MAX_NBD_REQUESTS; i++) {
-        if (s->requests[i].coroutine == NULL) {
+        if (cs->requests[i].coroutine == NULL) {
             break;
         }
     }
 
     assert(i < MAX_NBD_REQUESTS);
-    s->requests[i].coroutine = qemu_coroutine_self();
-    s->requests[i].offset = request->from;
-    s->requests[i].receiving = false;
-    qemu_mutex_unlock(&s->requests_lock);
+    cs->requests[i].coroutine = qemu_coroutine_self();
+    cs->requests[i].offset = request->from;
+    cs->requests[i].receiving = false;
+    qemu_mutex_unlock(&cs->requests_lock);
 
-    qemu_co_mutex_lock(&s->send_mutex);
-    request->handle = INDEX_TO_HANDLE(s, i);
+    qemu_co_mutex_lock(&cs->send_mutex);
+    request->handle = INDEX_TO_HANDLE(cs, i);
 
-    assert(s->ioc);
+    assert(cs->ioc);
 
     if (qiov) {
-        qio_channel_set_cork(s->ioc, true);
-        rc = nbd_send_request(s->ioc, request);
-        if (rc >= 0 && qio_channel_writev_all(s->ioc, qiov->iov, qiov->niov,
+        qio_channel_set_cork(cs->ioc, true);
+        rc = nbd_send_request(cs->ioc, request);
+        if (rc >= 0 && qio_channel_writev_all(cs->ioc, qiov->iov, qiov->niov,
                                               NULL) < 0) {
             rc = -EIO;
         }
-        qio_channel_set_cork(s->ioc, false);
+        qio_channel_set_cork(cs->ioc, false);
     } else {
-        rc = nbd_send_request(s->ioc, request);
+        rc = nbd_send_request(cs->ioc, request);
     }
-    qemu_co_mutex_unlock(&s->send_mutex);
+    qemu_co_mutex_unlock(&cs->send_mutex);
 
     if (rc < 0) {
-        qemu_mutex_lock(&s->requests_lock);
+        qemu_mutex_lock(&cs->requests_lock);
 err:
-        nbd_channel_error_locked(s, rc);
+        nbd_channel_error_locked(cs, rc);
         if (i != -1) {
-            s->requests[i].coroutine = NULL;
+            cs->requests[i].coroutine = NULL;
         }
-        s->in_flight--;
-        qemu_co_queue_next(&s->free_sema);
-        qemu_mutex_unlock(&s->requests_lock);
+        cs->in_flight--;
+        qemu_co_queue_next(&cs->free_sema);
+        qemu_mutex_unlock(&cs->requests_lock);
     }
     return rc;
 }
@@ -569,7 +586,7 @@ static inline uint64_t payload_advance64(uint8_t **payload)
     return ldq_be_p(*payload - 8);
 }
 
-static int nbd_parse_offset_hole_payload(BDRVNBDState *s,
+static int nbd_parse_offset_hole_payload(NBDConnState *cs,
                                          NBDStructuredReplyChunk *chunk,
                                          uint8_t *payload, uint64_t orig_offset,
                                          QEMUIOVector *qiov, Error **errp)
@@ -592,8 +609,8 @@ static int nbd_parse_offset_hole_payload(BDRVNBDState *s,
                          " region");
         return -EINVAL;
     }
-    if (s->info.min_block &&
-        !QEMU_IS_ALIGNED(hole_size, s->info.min_block)) {
+    if (cs->info.min_block &&
+        !QEMU_IS_ALIGNED(hole_size, cs->info.min_block)) {
         trace_nbd_structured_read_compliance("hole");
     }
 
@@ -607,7 +624,7 @@ static int nbd_parse_offset_hole_payload(BDRVNBDState *s,
  * Based on our request, we expect only one extent in reply, for the
  * base:allocation context.
  */
-static int nbd_parse_blockstatus_payload(BDRVNBDState *s,
+static int nbd_parse_blockstatus_payload(NBDConnState *cs,
                                          NBDStructuredReplyChunk *chunk,
                                          uint8_t *payload, uint64_t orig_length,
                                          NBDExtent *extent, Error **errp)
@@ -622,11 +639,11 @@ static int nbd_parse_blockstatus_payload(BDRVNBDState *s,
     }
 
     context_id = payload_advance32(&payload);
-    if (s->info.context_id != context_id) {
+    if (cs->info.context_id != context_id) {
         error_setg(errp, "Protocol error: unexpected context id %d for "
                          "NBD_REPLY_TYPE_BLOCK_STATUS, when negotiated context "
                          "id is %d", context_id,
-                         s->info.context_id);
+                         cs->info.context_id);
         return -EINVAL;
     }
 
@@ -651,14 +668,14 @@ static int nbd_parse_blockstatus_payload(BDRVNBDState *s,
      * up to the full block and change the status to fully-allocated
      * (always a safe status, even if it loses information).
      */
-    if (s->info.min_block && !QEMU_IS_ALIGNED(extent->length,
-                                                   s->info.min_block)) {
+    if (cs->info.min_block &&
+        !QEMU_IS_ALIGNED(extent->length, cs->info.min_block)) {
         trace_nbd_parse_blockstatus_compliance("extent length is unaligned");
-        if (extent->length > s->info.min_block) {
+        if (extent->length > cs->info.min_block) {
             extent->length = QEMU_ALIGN_DOWN(extent->length,
-                                             s->info.min_block);
+                                             cs->info.min_block);
         } else {
-            extent->length = s->info.min_block;
+            extent->length = cs->info.min_block;
             extent->flags = 0;
         }
     }
@@ -685,7 +702,7 @@ static int nbd_parse_blockstatus_payload(BDRVNBDState *s,
      * since nbd_client_co_block_status is only expecting the low two
      * bits to be set.
      */
-    if (s->alloc_depth && extent->flags > 2) {
+    if (cs->s->alloc_depth && extent->flags > 2) {
         extent->flags = 2;
     }
 
@@ -735,16 +752,16 @@ static int nbd_parse_error_payload(NBDStructuredReplyChunk *chunk,
 }
 
 static int coroutine_fn
-nbd_co_receive_offset_data_payload(BDRVNBDState *s, uint64_t orig_offset,
+nbd_co_receive_offset_data_payload(NBDConnState *cs, uint64_t orig_offset,
                                    QEMUIOVector *qiov, Error **errp)
 {
     QEMUIOVector sub_qiov;
     uint64_t offset;
     size_t data_size;
     int ret;
-    NBDStructuredReplyChunk *chunk = &s->reply.structured;
+    NBDStructuredReplyChunk *chunk = &cs->reply.structured;
 
-    assert(nbd_reply_is_structured(&s->reply));
+    assert(nbd_reply_is_structured(&cs->reply));
 
     /* The NBD spec requires at least one byte of payload */
     if (chunk->length <= sizeof(offset)) {
@@ -753,7 +770,7 @@ nbd_co_receive_offset_data_payload(BDRVNBDState *s, uint64_t orig_offset,
         return -EINVAL;
     }
 
-    if (nbd_read64(s->ioc, &offset, "OFFSET_DATA offset", errp) < 0) {
+    if (nbd_read64(cs->ioc, &offset, "OFFSET_DATA offset", errp) < 0) {
         return -EIO;
     }
 
@@ -765,13 +782,14 @@ nbd_co_receive_offset_data_payload(BDRVNBDState *s, uint64_t orig_offset,
                          " region");
         return -EINVAL;
     }
-    if (s->info.min_block && !QEMU_IS_ALIGNED(data_size, s->info.min_block)) {
+    if (cs->info.min_block &&
+        !QEMU_IS_ALIGNED(data_size, cs->info.min_block)) {
         trace_nbd_structured_read_compliance("data");
     }
 
     qemu_iovec_init(&sub_qiov, qiov->niov);
     qemu_iovec_concat(&sub_qiov, qiov, offset - orig_offset, data_size);
-    ret = qio_channel_readv_all(s->ioc, sub_qiov.iov, sub_qiov.niov, errp);
+    ret = qio_channel_readv_all(cs->ioc, sub_qiov.iov, sub_qiov.niov, errp);
     qemu_iovec_destroy(&sub_qiov);
 
     return ret < 0 ? -EIO : 0;
@@ -779,14 +797,14 @@ nbd_co_receive_offset_data_payload(BDRVNBDState *s, uint64_t orig_offset,
 
 #define NBD_MAX_MALLOC_PAYLOAD 1000
 static coroutine_fn int nbd_co_receive_structured_payload(
-        BDRVNBDState *s, void **payload, Error **errp)
+        NBDConnState *cs, void **payload, Error **errp)
 {
     int ret;
     uint32_t len;
 
-    assert(nbd_reply_is_structured(&s->reply));
+    assert(nbd_reply_is_structured(&cs->reply));
 
-    len = s->reply.structured.length;
+    len = cs->reply.structured.length;
 
     if (len == 0) {
         return 0;
@@ -803,7 +821,7 @@ static coroutine_fn int nbd_co_receive_structured_payload(
     }
 
     *payload = g_new(char, len);
-    ret = nbd_read(s->ioc, *payload, len, "structured payload", errp);
+    ret = nbd_read(cs->ioc, *payload, len, "structured payload", errp);
     if (ret < 0) {
         g_free(*payload);
         *payload = NULL;
@@ -829,11 +847,11 @@ static coroutine_fn int nbd_co_receive_structured_payload(
  * corresponding to the server's error reply), and errp is unchanged.
  */
 static coroutine_fn int nbd_co_do_receive_one_chunk(
-        BDRVNBDState *s, uint64_t handle, bool only_structured,
+        NBDConnState *cs, uint64_t handle, bool only_structured,
         int *request_ret, QEMUIOVector *qiov, void **payload, Error **errp)
 {
     int ret;
-    int i = HANDLE_TO_INDEX(s, handle);
+    int i = HANDLE_TO_INDEX(cs, handle);
     void *local_payload = NULL;
     NBDStructuredReplyChunk *chunk;
 
@@ -842,34 +860,34 @@ static coroutine_fn int nbd_co_do_receive_one_chunk(
     }
     *request_ret = 0;
 
-    ret = nbd_receive_replies(s, handle);
+    ret = nbd_receive_replies(cs, handle);
     if (ret < 0) {
         error_setg(errp, "Connection closed");
         return -EIO;
     }
-    assert(s->ioc);
+    assert(cs->ioc);
 
-    assert(s->reply.handle == handle);
+    assert(cs->reply.handle == handle);
 
-    if (nbd_reply_is_simple(&s->reply)) {
+    if (nbd_reply_is_simple(&cs->reply)) {
         if (only_structured) {
             error_setg(errp, "Protocol error: simple reply when structured "
                              "reply chunk was expected");
             return -EINVAL;
         }
 
-        *request_ret = -nbd_errno_to_system_errno(s->reply.simple.error);
+        *request_ret = -nbd_errno_to_system_errno(cs->reply.simple.error);
         if (*request_ret < 0 || !qiov) {
             return 0;
         }
 
-        return qio_channel_readv_all(s->ioc, qiov->iov, qiov->niov,
+        return qio_channel_readv_all(cs->ioc, qiov->iov, qiov->niov,
                                      errp) < 0 ? -EIO : 0;
     }
 
     /* handle structured reply chunk */
-    assert(s->info.structured_reply);
-    chunk = &s->reply.structured;
+    assert(cs->info.structured_reply);
+    chunk = &cs->reply.structured;
 
     if (chunk->type == NBD_REPLY_TYPE_NONE) {
         if (!(chunk->flags & NBD_REPLY_FLAG_DONE)) {
@@ -891,7 +909,7 @@ static coroutine_fn int nbd_co_do_receive_one_chunk(
             return -EINVAL;
         }
 
-        return nbd_co_receive_offset_data_payload(s, s->requests[i].offset,
+        return nbd_co_receive_offset_data_payload(cs, cs->requests[i].offset,
                                                   qiov, errp);
     }
 
@@ -899,7 +917,7 @@ static coroutine_fn int nbd_co_do_receive_one_chunk(
         payload = &local_payload;
     }
 
-    ret = nbd_co_receive_structured_payload(s, payload, errp);
+    ret = nbd_co_receive_structured_payload(cs, payload, errp);
     if (ret < 0) {
         return ret;
     }
@@ -919,23 +937,23 @@ static coroutine_fn int nbd_co_do_receive_one_chunk(
  * Return value is a fatal error code or normal nbd reply error code
  */
 static coroutine_fn int nbd_co_receive_one_chunk(
-        BDRVNBDState *s, uint64_t handle, bool only_structured,
+        NBDConnState *cs, uint64_t handle, bool only_structured,
         int *request_ret, QEMUIOVector *qiov, NBDReply *reply, void **payload,
         Error **errp)
 {
-    int ret = nbd_co_do_receive_one_chunk(s, handle, only_structured,
+    int ret = nbd_co_do_receive_one_chunk(cs, handle, only_structured,
                                           request_ret, qiov, payload, errp);
 
     if (ret < 0) {
         memset(reply, 0, sizeof(*reply));
-        nbd_channel_error(s, ret);
+        nbd_channel_error(cs, ret);
     } else {
         /* For assert at loop start in nbd_connection_entry */
-        *reply = s->reply;
+        *reply = cs->reply;
     }
-    s->reply.handle = 0;
+    cs->reply.handle = 0;
 
-    nbd_recv_coroutines_wake(s);
+    nbd_recv_coroutines_wake(cs);
 
     return ret;
 }
@@ -976,16 +994,16 @@ static void nbd_iter_request_error(NBDReplyChunkIter *iter, int ret)
  * NBD_FOREACH_REPLY_CHUNK
  * The pointer stored in @payload requires g_free() to free it.
  */
-#define NBD_FOREACH_REPLY_CHUNK(s, iter, handle, structured, \
+#define NBD_FOREACH_REPLY_CHUNK(cs, iter, handle, structured, \
                                 qiov, reply, payload) \
     for (iter = (NBDReplyChunkIter) { .only_structured = structured }; \
-         nbd_reply_chunk_iter_receive(s, &iter, handle, qiov, reply, payload);)
+         nbd_reply_chunk_iter_receive(cs, &iter, handle, qiov, reply, payload);)
 
 /*
  * nbd_reply_chunk_iter_receive
  * The pointer stored in @payload requires g_free() to free it.
  */
-static bool coroutine_fn nbd_reply_chunk_iter_receive(BDRVNBDState *s,
+static bool coroutine_fn nbd_reply_chunk_iter_receive(NBDConnState *cs,
                                                       NBDReplyChunkIter *iter,
                                                       uint64_t handle,
                                                       QEMUIOVector *qiov,
@@ -1006,7 +1024,7 @@ static bool coroutine_fn nbd_reply_chunk_iter_receive(BDRVNBDState *s,
         reply = &local_reply;
     }
 
-    ret = nbd_co_receive_one_chunk(s, handle, iter->only_structured,
+    ret = nbd_co_receive_one_chunk(cs, handle, iter->only_structured,
                                    &request_ret, qiov, reply, payload,
                                    &local_err);
     if (ret < 0) {
@@ -1038,21 +1056,21 @@ static bool coroutine_fn nbd_reply_chunk_iter_receive(BDRVNBDState *s,
     return true;
 
 break_loop:
-    qemu_mutex_lock(&s->requests_lock);
-    s->requests[HANDLE_TO_INDEX(s, handle)].coroutine = NULL;
-    s->in_flight--;
-    qemu_co_queue_next(&s->free_sema);
-    qemu_mutex_unlock(&s->requests_lock);
+    qemu_mutex_lock(&cs->requests_lock);
+    cs->requests[HANDLE_TO_INDEX(cs, handle)].coroutine = NULL;
+    cs->in_flight--;
+    qemu_co_queue_next(&cs->free_sema);
+    qemu_mutex_unlock(&cs->requests_lock);
 
     return false;
 }
 
-static int coroutine_fn nbd_co_receive_return_code(BDRVNBDState *s, uint64_t handle,
+static int coroutine_fn nbd_co_receive_return_code(NBDConnState *cs, uint64_t handle,
                                                    int *request_ret, Error **errp)
 {
     NBDReplyChunkIter iter;
 
-    NBD_FOREACH_REPLY_CHUNK(s, iter, handle, false, NULL, NULL, NULL) {
+    NBD_FOREACH_REPLY_CHUNK(cs, iter, handle, false, NULL, NULL, NULL) {
         /* nbd_reply_chunk_iter_receive does all the work */
     }
 
@@ -1061,7 +1079,7 @@ static int coroutine_fn nbd_co_receive_return_code(BDRVNBDState *s, uint64_t han
     return iter.ret;
 }
 
-static int coroutine_fn nbd_co_receive_cmdread_reply(BDRVNBDState *s, uint64_t handle,
+static int coroutine_fn nbd_co_receive_cmdread_reply(NBDConnState *cs, uint64_t handle,
                                                      uint64_t offset, QEMUIOVector *qiov,
                                                      int *request_ret, Error **errp)
 {
@@ -1070,7 +1088,7 @@ static int coroutine_fn nbd_co_receive_cmdread_reply(BDRVNBDState *s, uint64_t h
     void *payload = NULL;
     Error *local_err = NULL;
 
-    NBD_FOREACH_REPLY_CHUNK(s, iter, handle, s->info.structured_reply,
+    NBD_FOREACH_REPLY_CHUNK(cs, iter, handle, cs->info.structured_reply,
                             qiov, &reply, &payload)
     {
         int ret;
@@ -1086,17 +1104,17 @@ static int coroutine_fn nbd_co_receive_cmdread_reply(BDRVNBDState *s, uint64_t h
              */
             break;
         case NBD_REPLY_TYPE_OFFSET_HOLE:
-            ret = nbd_parse_offset_hole_payload(s, &reply.structured, payload,
+            ret = nbd_parse_offset_hole_payload(cs, &reply.structured, payload,
                                                 offset, qiov, &local_err);
             if (ret < 0) {
-                nbd_channel_error(s, ret);
+                nbd_channel_error(cs, ret);
                 nbd_iter_channel_error(&iter, ret, &local_err);
             }
             break;
         default:
             if (!nbd_reply_type_is_error(chunk->type)) {
                 /* not allowed reply type */
-                nbd_channel_error(s, -EINVAL);
+                nbd_channel_error(cs, -EINVAL);
                 error_setg(&local_err,
                            "Unexpected reply type: %d (%s) for CMD_READ",
                            chunk->type, nbd_reply_type_lookup(chunk->type));
@@ -1113,7 +1131,7 @@ static int coroutine_fn nbd_co_receive_cmdread_reply(BDRVNBDState *s, uint64_t h
     return iter.ret;
 }
 
-static int coroutine_fn nbd_co_receive_blockstatus_reply(BDRVNBDState *s,
+static int coroutine_fn nbd_co_receive_blockstatus_reply(NBDConnState *cs,
                                                          uint64_t handle, uint64_t length,
                                                          NBDExtent *extent,
                                                          int *request_ret, Error **errp)
@@ -1125,7 +1143,7 @@ static int coroutine_fn nbd_co_receive_blockstatus_reply(BDRVNBDState *s,
     bool received = false;
 
     assert(!extent->length);
-    NBD_FOREACH_REPLY_CHUNK(s, iter, handle, false, NULL, &reply, &payload) {
+    NBD_FOREACH_REPLY_CHUNK(cs, iter, handle, false, NULL, &reply, &payload) {
         int ret;
         NBDStructuredReplyChunk *chunk = &reply.structured;
 
@@ -1134,23 +1152,23 @@ static int coroutine_fn nbd_co_receive_blockstatus_reply(BDRVNBDState *s,
         switch (chunk->type) {
         case NBD_REPLY_TYPE_BLOCK_STATUS:
             if (received) {
-                nbd_channel_error(s, -EINVAL);
+                nbd_channel_error(cs, -EINVAL);
                 error_setg(&local_err, "Several BLOCK_STATUS chunks in reply");
                 nbd_iter_channel_error(&iter, -EINVAL, &local_err);
             }
             received = true;
 
-            ret = nbd_parse_blockstatus_payload(s, &reply.structured,
+            ret = nbd_parse_blockstatus_payload(cs, &reply.structured,
                                                 payload, length, extent,
                                                 &local_err);
             if (ret < 0) {
-                nbd_channel_error(s, ret);
+                nbd_channel_error(cs, ret);
                 nbd_iter_channel_error(&iter, ret, &local_err);
             }
             break;
         default:
             if (!nbd_reply_type_is_error(chunk->type)) {
-                nbd_channel_error(s, -EINVAL);
+                nbd_channel_error(cs, -EINVAL);
                 error_setg(&local_err,
                            "Unexpected reply type: %d (%s) "
                            "for CMD_BLOCK_STATUS",
@@ -1173,12 +1191,11 @@ static int coroutine_fn nbd_co_receive_blockstatus_reply(BDRVNBDState *s,
     return iter.ret;
 }
 
-static int coroutine_fn nbd_co_request(BlockDriverState *bs, NBDRequest *request,
+static int coroutine_fn nbd_co_request(NBDConnState *cs, NBDRequest *request,
                                        QEMUIOVector *write_qiov)
 {
     int ret, request_ret;
     Error *local_err = NULL;
-    BDRVNBDState *s = (BDRVNBDState *)bs->opaque;
 
     assert(request->type != NBD_CMD_READ);
     if (write_qiov) {
@@ -1189,12 +1206,12 @@ static int coroutine_fn nbd_co_request(BlockDriverState *bs, NBDRequest *request
     }
 
     do {
-        ret = nbd_co_send_request(bs, request, write_qiov);
+        ret = nbd_co_send_request(cs, request, write_qiov);
         if (ret < 0) {
             continue;
         }
 
-        ret = nbd_co_receive_return_code(s, request->handle,
+        ret = nbd_co_receive_return_code(cs, request->handle,
                                          &request_ret, &local_err);
         if (local_err) {
             trace_nbd_co_request_fail(request->from, request->len,
@@ -1205,7 +1222,7 @@ static int coroutine_fn nbd_co_request(BlockDriverState *bs, NBDRequest *request
             error_free(local_err);
             local_err = NULL;
         }
-    } while (ret < 0 && nbd_client_will_reconnect(s));
+    } while (ret < 0 && nbd_client_will_reconnect(cs));
 
     return ret ? ret : request_ret;
 }
@@ -1222,6 +1239,7 @@ static int coroutine_fn nbd_client_co_preadv(BlockDriverState *bs, int64_t offse
         .from = offset,
         .len = bytes,
     };
+    NBDConnState * const cs = s->conns[0];
 
     assert(bytes <= NBD_MAX_BUFFER_SIZE);
 
@@ -1234,13 +1252,13 @@ static int coroutine_fn nbd_client_co_preadv(BlockDriverState *bs, int64_t offse
      * advertised size because the block layer rounded size up, then
      * truncate the request to the server and tail-pad with zero.
      */
-    if (offset >= s->info.size) {
+    if (offset >= cs->info.size) {
         assert(bytes < BDRV_SECTOR_SIZE);
         qemu_iovec_memset(qiov, 0, 0, bytes);
         return 0;
     }
-    if (offset + bytes > s->info.size) {
-        uint64_t slop = offset + bytes - s->info.size;
+    if (offset + bytes > cs->info.size) {
+        uint64_t slop = offset + bytes - cs->info.size;
 
         assert(slop < BDRV_SECTOR_SIZE);
         qemu_iovec_memset(qiov, bytes - slop, 0, slop);
@@ -1248,12 +1266,12 @@ static int coroutine_fn nbd_client_co_preadv(BlockDriverState *bs, int64_t offse
     }
 
     do {
-        ret = nbd_co_send_request(bs, &request, NULL);
+        ret = nbd_co_send_request(cs, &request, NULL);
         if (ret < 0) {
             continue;
         }
 
-        ret = nbd_co_receive_cmdread_reply(s, request.handle, offset, qiov,
+        ret = nbd_co_receive_cmdread_reply(cs, request.handle, offset, qiov,
                                            &request_ret, &local_err);
         if (local_err) {
             trace_nbd_co_request_fail(request.from, request.len, request.handle,
@@ -1263,7 +1281,7 @@ static int coroutine_fn nbd_client_co_preadv(BlockDriverState *bs, int64_t offse
             error_free(local_err);
             local_err = NULL;
         }
-    } while (ret < 0 && nbd_client_will_reconnect(s));
+    } while (ret < 0 && nbd_client_will_reconnect(cs));
 
     return ret ? ret : request_ret;
 }
@@ -1278,10 +1296,11 @@ static int coroutine_fn nbd_client_co_pwritev(BlockDriverState *bs, int64_t offs
         .from = offset,
         .len = bytes,
     };
+    NBDConnState * const cs = s->conns[0];
 
-    assert(!(s->info.flags & NBD_FLAG_READ_ONLY));
+    assert(!(cs->info.flags & NBD_FLAG_READ_ONLY));
     if (flags & BDRV_REQ_FUA) {
-        assert(s->info.flags & NBD_FLAG_SEND_FUA);
+        assert(cs->info.flags & NBD_FLAG_SEND_FUA);
         request.flags |= NBD_CMD_FLAG_FUA;
     }
 
@@ -1290,7 +1309,7 @@ static int coroutine_fn nbd_client_co_pwritev(BlockDriverState *bs, int64_t offs
     if (!bytes) {
         return 0;
     }
-    return nbd_co_request(bs, &request, qiov);
+    return nbd_co_request(cs, &request, qiov);
 }
 
 static int coroutine_fn nbd_client_co_pwrite_zeroes(BlockDriverState *bs, int64_t offset,
@@ -1302,45 +1321,47 @@ static int coroutine_fn nbd_client_co_pwrite_zeroes(BlockDriverState *bs, int64_
         .from = offset,
         .len = bytes,  /* .len is uint32_t actually */
     };
+    NBDConnState * const cs = s->conns[0];
 
     assert(bytes <= UINT32_MAX); /* rely on max_pwrite_zeroes */
 
-    assert(!(s->info.flags & NBD_FLAG_READ_ONLY));
-    if (!(s->info.flags & NBD_FLAG_SEND_WRITE_ZEROES)) {
+    assert(!(cs->info.flags & NBD_FLAG_READ_ONLY));
+    if (!(cs->info.flags & NBD_FLAG_SEND_WRITE_ZEROES)) {
         return -ENOTSUP;
     }
 
     if (flags & BDRV_REQ_FUA) {
-        assert(s->info.flags & NBD_FLAG_SEND_FUA);
+        assert(cs->info.flags & NBD_FLAG_SEND_FUA);
         request.flags |= NBD_CMD_FLAG_FUA;
     }
     if (!(flags & BDRV_REQ_MAY_UNMAP)) {
         request.flags |= NBD_CMD_FLAG_NO_HOLE;
     }
     if (flags & BDRV_REQ_NO_FALLBACK) {
-        assert(s->info.flags & NBD_FLAG_SEND_FAST_ZERO);
+        assert(cs->info.flags & NBD_FLAG_SEND_FAST_ZERO);
         request.flags |= NBD_CMD_FLAG_FAST_ZERO;
     }
 
     if (!bytes) {
         return 0;
     }
-    return nbd_co_request(bs, &request, NULL);
+    return nbd_co_request(cs, &request, NULL);
 }
 
 static int coroutine_fn nbd_client_co_flush(BlockDriverState *bs)
 {
     BDRVNBDState *s = (BDRVNBDState *)bs->opaque;
     NBDRequest request = { .type = NBD_CMD_FLUSH };
+    NBDConnState * const cs = s->conns[0];
 
-    if (!(s->info.flags & NBD_FLAG_SEND_FLUSH)) {
+    if (!(cs->info.flags & NBD_FLAG_SEND_FLUSH)) {
         return 0;
     }
 
     request.from = 0;
     request.len = 0;
 
-    return nbd_co_request(bs, &request, NULL);
+    return nbd_co_request(cs, &request, NULL);
 }
 
 static int coroutine_fn nbd_client_co_pdiscard(BlockDriverState *bs, int64_t offset,
@@ -1352,15 +1373,16 @@ static int coroutine_fn nbd_client_co_pdiscard(BlockDriverState *bs, int64_t off
         .from = offset,
         .len = bytes, /* len is uint32_t */
     };
+    NBDConnState * const cs = s->conns[0];
 
     assert(bytes <= UINT32_MAX); /* rely on max_pdiscard */
 
-    assert(!(s->info.flags & NBD_FLAG_READ_ONLY));
-    if (!(s->info.flags & NBD_FLAG_SEND_TRIM) || !bytes) {
+    assert(!(cs->info.flags & NBD_FLAG_READ_ONLY));
+    if (!(cs->info.flags & NBD_FLAG_SEND_TRIM) || !bytes) {
         return 0;
     }
 
-    return nbd_co_request(bs, &request, NULL);
+    return nbd_co_request(cs, &request, NULL);
 }
 
 static int coroutine_fn nbd_client_co_block_status(
@@ -1371,16 +1393,17 @@ static int coroutine_fn nbd_client_co_block_status(
     NBDExtent extent = { 0 };
     BDRVNBDState *s = (BDRVNBDState *)bs->opaque;
     Error *local_err = NULL;
+    NBDConnState * const cs = s->conns[0];
 
     NBDRequest request = {
         .type = NBD_CMD_BLOCK_STATUS,
         .from = offset,
         .len = MIN(QEMU_ALIGN_DOWN(INT_MAX, bs->bl.request_alignment),
-                   MIN(bytes, s->info.size - offset)),
+                   MIN(bytes, cs->info.size - offset)),
         .flags = NBD_CMD_FLAG_REQ_ONE,
     };
 
-    if (!s->info.base_allocation) {
+    if (!cs->info.base_allocation) {
         *pnum = bytes;
         *map = offset;
         *file = bs;
@@ -1394,23 +1417,23 @@ static int coroutine_fn nbd_client_co_block_status(
      * up, we truncated the request to the server (above), or are
      * called on just the hole.
      */
-    if (offset >= s->info.size) {
+    if (offset >= cs->info.size) {
         *pnum = bytes;
         assert(bytes < BDRV_SECTOR_SIZE);
         /* Intentionally don't report offset_valid for the hole */
         return BDRV_BLOCK_ZERO;
     }
 
-    if (s->info.min_block) {
-        assert(QEMU_IS_ALIGNED(request.len, s->info.min_block));
+    if (cs->info.min_block) {
+        assert(QEMU_IS_ALIGNED(request.len, cs->info.min_block));
     }
     do {
-        ret = nbd_co_send_request(bs, &request, NULL);
+        ret = nbd_co_send_request(cs, &request, NULL);
         if (ret < 0) {
             continue;
         }
 
-        ret = nbd_co_receive_blockstatus_reply(s, request.handle, bytes,
+        ret = nbd_co_receive_blockstatus_reply(cs, request.handle, bytes,
                                                &extent, &request_ret,
                                                &local_err);
         if (local_err) {
@@ -1421,7 +1444,7 @@ static int coroutine_fn nbd_client_co_block_status(
             error_free(local_err);
             local_err = NULL;
         }
-    } while (ret < 0 && nbd_client_will_reconnect(s));
+    } while (ret < 0 && nbd_client_will_reconnect(cs));
 
     if (ret < 0 || request_ret < 0) {
         return ret ? ret : request_ret;
@@ -1440,8 +1463,10 @@ static int nbd_client_reopen_prepare(BDRVReopenState *state,
                                      BlockReopenQueue *queue, Error **errp)
 {
     BDRVNBDState *s = (BDRVNBDState *)state->bs->opaque;
+    NBDConnState * const cs = s->conns[0];
 
-    if ((state->flags & BDRV_O_RDWR) && (s->info.flags & NBD_FLAG_READ_ONLY)) {
+    if ((state->flags & BDRV_O_RDWR) &&
+        (cs->info.flags & NBD_FLAG_READ_ONLY)) {
         error_setg(errp, "Can't reopen read-only NBD mount as read/write");
         return -EACCES;
     }
@@ -1450,24 +1475,23 @@ static int nbd_client_reopen_prepare(BDRVReopenState *state,
 
 static void nbd_yank(void *opaque)
 {
-    BlockDriverState *bs = opaque;
-    BDRVNBDState *s = (BDRVNBDState *)bs->opaque;
+    NBDConnState *cs = opaque;
 
-    QEMU_LOCK_GUARD(&s->requests_lock);
-    qio_channel_shutdown(QIO_CHANNEL(s->ioc), QIO_CHANNEL_SHUTDOWN_BOTH, NULL);
-    s->state = NBD_CLIENT_QUIT;
+    QEMU_LOCK_GUARD(&cs->requests_lock);
+    qio_channel_shutdown(QIO_CHANNEL(cs->ioc),
+                         QIO_CHANNEL_SHUTDOWN_BOTH, NULL);
+    cs->state = NBD_CLIENT_QUIT;
 }
 
-static void nbd_client_close(BlockDriverState *bs)
+static void nbd_client_close(NBDConnState *cs)
 {
-    BDRVNBDState *s = (BDRVNBDState *)bs->opaque;
     NBDRequest request = { .type = NBD_CMD_DISC };
 
-    if (s->ioc) {
-        nbd_send_request(s->ioc, &request);
+    if (cs->ioc) {
+        nbd_send_request(cs->ioc, &request);
     }
 
-    nbd_teardown_connection(bs);
+    nbd_teardown_connection(cs);
 }
 
 
@@ -1888,10 +1912,13 @@ static int nbd_open(BlockDriverState *bs, QDict *options, int flags,
     BDRVNBDState *s = (BDRVNBDState *)bs->opaque;
 
     s->bs = bs;
-    qemu_mutex_init(&s->requests_lock);
-    qemu_co_queue_init(&s->free_sema);
-    qemu_co_mutex_init(&s->send_mutex);
-    qemu_co_mutex_init(&s->receive_mutex);
+
+    s->conns[0] = g_new0(NBDConnState, 1);
+    s->conns[0]->s = s;
+    qemu_mutex_init(&s->conns[0]->requests_lock);
+    qemu_co_queue_init(&s->conns[0]->free_sema);
+    qemu_co_mutex_init(&s->conns[0]->send_mutex);
+    qemu_co_mutex_init(&s->conns[0]->receive_mutex);
 
     if (!yank_register_instance(BLOCKDEV_YANK_INSTANCE(bs->node_name), errp)) {
         return -EEXIST;
@@ -1902,18 +1929,19 @@ static int nbd_open(BlockDriverState *bs, QDict *options, int flags,
         goto fail;
     }
 
-    s->conn = nbd_client_connection_new(s->saddr, true, s->export,
-                                        s->x_dirty_bitmap, s->tlscreds,
-                                        s->tlshostname);
+    s->conns[0]->conn =
+        nbd_client_connection_new(s->saddr, true, s->export,
+                                  s->x_dirty_bitmap, s->tlscreds,
+                                  s->tlshostname);
 
     if (s->open_timeout) {
-        nbd_client_connection_enable_retry(s->conn);
-        open_timer_init(s, qemu_clock_get_ns(QEMU_CLOCK_REALTIME) +
+        nbd_client_connection_enable_retry(s->conns[0]->conn);
+        open_timer_init(s->conns[0], qemu_clock_get_ns(QEMU_CLOCK_REALTIME) +
                         s->open_timeout * NANOSECONDS_PER_SECOND);
     }
 
-    s->state = NBD_CLIENT_CONNECTING_WAIT;
-    ret = nbd_do_establish_connection(bs, true, errp);
+    s->conns[0]->state = NBD_CLIENT_CONNECTING_WAIT;
+    ret = nbd_do_establish_connection(bs, s->conns[0], true, errp);
     if (ret < 0) {
         goto fail;
     }
@@ -1923,23 +1951,23 @@ static int nbd_open(BlockDriverState *bs, QDict *options, int flags,
      * Delete it, because we do not want it to be around when this node
      * is drained or closed.
      */
-    open_timer_del(s);
+    open_timer_del(s->conns[0]);
 
-    nbd_client_connection_enable_retry(s->conn);
+    nbd_client_connection_enable_retry(s->conns[0]->conn);
 
     /*
      * We set s->multi_conn in nbd_process_options above, but now that
      * we have connected if the server doesn't advertise that it is
      * safe for multi-conn, force it to 1.
      */
-    if (!(s->info.flags & NBD_FLAG_CAN_MULTI_CONN)) {
+    if (!(s->conns[0]->info.flags & NBD_FLAG_CAN_MULTI_CONN)) {
         s->multi_conn = 1;
     }
 
     return 0;
 
 fail:
-    open_timer_del(s);
+    open_timer_del(s->conns[0]);
     nbd_clear_bdrvstate(bs);
     return ret;
 }
@@ -1952,8 +1980,9 @@ static int coroutine_fn nbd_co_flush(BlockDriverState *bs)
 static void nbd_refresh_limits(BlockDriverState *bs, Error **errp)
 {
     BDRVNBDState *s = (BDRVNBDState *)bs->opaque;
-    uint32_t min = s->info.min_block;
-    uint32_t max = MIN_NON_ZERO(NBD_MAX_BUFFER_SIZE, s->info.max_block);
+    NBDConnState * const cs = s->conns[0];
+    uint32_t min = cs->info.min_block;
+    uint32_t max = MIN_NON_ZERO(NBD_MAX_BUFFER_SIZE, cs->info.max_block);
 
     /*
      * If the server did not advertise an alignment:
@@ -1968,8 +1997,8 @@ static void nbd_refresh_limits(BlockDriverState *bs, Error **errp)
      *   sub-sector requests
      */
     if (!min) {
-        min = (!QEMU_IS_ALIGNED(s->info.size, BDRV_SECTOR_SIZE) ||
-               s->info.base_allocation) ? 1 : BDRV_SECTOR_SIZE;
+        min = (!QEMU_IS_ALIGNED(cs->info.size, BDRV_SECTOR_SIZE) ||
+               cs->info.base_allocation) ? 1 : BDRV_SECTOR_SIZE;
     }
 
     bs->bl.request_alignment = min;
@@ -1977,15 +2006,17 @@ static void nbd_refresh_limits(BlockDriverState *bs, Error **errp)
     bs->bl.max_pwrite_zeroes = max;
     bs->bl.max_transfer = max;
 
-    if (s->info.opt_block &&
-        s->info.opt_block > bs->bl.opt_transfer) {
-        bs->bl.opt_transfer = s->info.opt_block;
+    if (cs->info.opt_block &&
+        cs->info.opt_block > bs->bl.opt_transfer) {
+        bs->bl.opt_transfer = cs->info.opt_block;
     }
 }
 
 static void nbd_close(BlockDriverState *bs)
 {
-    nbd_client_close(bs);
+    BDRVNBDState *s = (BDRVNBDState *)bs->opaque;
+
+    nbd_client_close(s->conns[0]);
     nbd_clear_bdrvstate(bs);
 }
 
@@ -2002,13 +2033,14 @@ static int coroutine_fn nbd_co_truncate(BlockDriverState *bs, int64_t offset,
                                         BdrvRequestFlags flags, Error **errp)
 {
     BDRVNBDState *s = bs->opaque;
+    NBDConnState * const cs = s->conns[0];
 
-    if (offset != s->info.size && exact) {
+    if (offset != cs->info.size && exact) {
         error_setg(errp, "Cannot resize NBD nodes");
         return -ENOTSUP;
     }
 
-    if (offset > s->info.size) {
+    if (offset > cs->info.size) {
         error_setg(errp, "Cannot grow NBD nodes");
         return -EINVAL;
     }
@@ -2019,8 +2051,9 @@ static int coroutine_fn nbd_co_truncate(BlockDriverState *bs, int64_t offset,
 static int64_t coroutine_fn nbd_co_getlength(BlockDriverState *bs)
 {
     BDRVNBDState *s = bs->opaque;
+    NBDConnState * const cs = s->conns[0];
 
-    return s->info.size;
+    return cs->info.size;
 }
 
 static void nbd_refresh_filename(BlockDriverState *bs)
@@ -2083,25 +2116,27 @@ static const char *const nbd_strong_runtime_opts[] = {
 static void nbd_cancel_in_flight(BlockDriverState *bs)
 {
     BDRVNBDState *s = (BDRVNBDState *)bs->opaque;
+    NBDConnState * const cs = s->conns[0];
 
-    reconnect_delay_timer_del(s);
+    reconnect_delay_timer_del(cs);
 
-    qemu_mutex_lock(&s->requests_lock);
-    if (s->state == NBD_CLIENT_CONNECTING_WAIT) {
-        s->state = NBD_CLIENT_CONNECTING_NOWAIT;
+    qemu_mutex_lock(&cs->requests_lock);
+    if (cs->state == NBD_CLIENT_CONNECTING_WAIT) {
+        cs->state = NBD_CLIENT_CONNECTING_NOWAIT;
     }
-    qemu_mutex_unlock(&s->requests_lock);
+    qemu_mutex_unlock(&cs->requests_lock);
 
-    nbd_co_establish_connection_cancel(s->conn);
+    nbd_co_establish_connection_cancel(cs->conn);
 }
 
 static void nbd_attach_aio_context(BlockDriverState *bs,
                                    AioContext *new_context)
 {
     BDRVNBDState *s = bs->opaque;
+    NBDConnState * const cs = s->conns[0];
 
     /* The open_timer is used only during nbd_open() */
-    assert(!s->open_timer);
+    assert(!cs->open_timer);
 
     /*
      * The reconnect_delay_timer is scheduled in I/O paths when the
@@ -2112,22 +2147,23 @@ static void nbd_attach_aio_context(BlockDriverState *bs,
      * Since the AioContext can only be changed when a node is drained,
      * the reconnect_delay_timer cannot be active here.
      */
-    assert(!s->reconnect_delay_timer);
+    assert(!cs->reconnect_delay_timer);
 
-    if (s->ioc) {
-        qio_channel_attach_aio_context(s->ioc, new_context);
+    if (cs->ioc) {
+        qio_channel_attach_aio_context(cs->ioc, new_context);
     }
 }
 
 static void nbd_detach_aio_context(BlockDriverState *bs)
 {
     BDRVNBDState *s = bs->opaque;
+    NBDConnState * const cs = s->conns[0];
 
-    assert(!s->open_timer);
-    assert(!s->reconnect_delay_timer);
+    assert(!cs->open_timer);
+    assert(!cs->reconnect_delay_timer);
 
-    if (s->ioc) {
-        qio_channel_detach_aio_context(s->ioc);
+    if (cs->ioc) {
+        qio_channel_detach_aio_context(cs->ioc);
     }
 }
 
-- 
2.39.2



^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH nbd 3/4] nbd: Open multiple NBD connections if multi-conn is set
  2023-03-09 11:39 [PATCH nbd 0/4] Enable multi-conn NBD [for discussion only] Richard W.M. Jones
  2023-03-09 11:39 ` [PATCH nbd 1/4] nbd: Add multi-conn option Richard W.M. Jones
  2023-03-09 11:39 ` [PATCH nbd 2/4] nbd: Split out block device state from underlying NBD connections Richard W.M. Jones
@ 2023-03-09 11:39 ` Richard W.M. Jones
  2023-03-09 11:39 ` [PATCH nbd 4/4] nbd: Enable multi-conn using round-robin Richard W.M. Jones
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 11+ messages in thread
From: Richard W.M. Jones @ 2023-03-09 11:39 UTC (permalink / raw)
  To: qemu-devel; +Cc: qemu-block, vsementsov, eblake, hreitz, kwolf

If the user opts in to multi-conn and the server advertises it, open
multiple NBD connections.  They are opened with the same parameters as
the initial connection.  Similarly on the close path the connections
are closed.

However only the zeroth connection is used, so this commit does not
actually enable multi-conn capabilities.

(XXX) The strategy here is very naive.  Firstly if you were going to
open them, you'd probably want to open them in parallel.  Secondly it
would make sense to delay opening until multiple parallel requests are
seen (perhaps above some threshold), so that simple or shortlived NBD
operations do not require multiple connections to be made.

Signed-off-by: Richard W.M. Jones <rjones@redhat.com>
---
 block/nbd.c | 128 ++++++++++++++++++++++++++++++++++++----------------
 1 file changed, 90 insertions(+), 38 deletions(-)

diff --git a/block/nbd.c b/block/nbd.c
index 84e8a1add0..4c99c3f865 100644
--- a/block/nbd.c
+++ b/block/nbd.c
@@ -124,18 +124,23 @@ static void nbd_yank(void *opaque);
 static void nbd_clear_bdrvstate(BlockDriverState *bs)
 {
     BDRVNBDState *s = (BDRVNBDState *)bs->opaque;
+    size_t i;
 
-    nbd_client_connection_release(s->conns[0]->conn);
-    s->conns[0]->conn = NULL;
+    for (i = 0; i < MAX_MULTI_CONN; ++i) {
+        if (s->conns[i]) {
+            nbd_client_connection_release(s->conns[i]->conn);
+            s->conns[i]->conn = NULL;
+
+            /* Must not leave timers behind that would access freed data */
+            assert(!s->conns[i]->reconnect_delay_timer);
+            assert(!s->conns[i]->open_timer);
+
+            g_free(s->conns[i]);
+        }
+    }
 
     yank_unregister_instance(BLOCKDEV_YANK_INSTANCE(bs->node_name));
 
-    /* Must not leave timers behind that would access freed data */
-    assert(!s->conns[0]->reconnect_delay_timer);
-    assert(!s->conns[0]->open_timer);
-
-    g_free(s->conns[0]);
-
     object_unref(OBJECT(s->tlscreds));
     qapi_free_SocketAddress(s->saddr);
     s->saddr = NULL;
@@ -1905,43 +1910,39 @@ static int nbd_process_options(BlockDriverState *bs, QDict *options,
     return ret;
 }
 
-static int nbd_open(BlockDriverState *bs, QDict *options, int flags,
-                    Error **errp)
+static NBDConnState *init_conn_state(BDRVNBDState *s)
 {
+    NBDConnState *cs;
+
+    cs = g_new0(NBDConnState, 1);
+    cs->s = s;
+    qemu_mutex_init(&cs->requests_lock);
+    qemu_co_queue_init(&cs->free_sema);
+    qemu_co_mutex_init(&cs->send_mutex);
+    qemu_co_mutex_init(&cs->receive_mutex);
+
+    return cs;
+}
+
+static int conn_state_connect(BlockDriverState *bs, NBDConnState *cs,
+                              Error **errp)
+{
+    BDRVNBDState *s = cs->s;
     int ret;
-    BDRVNBDState *s = (BDRVNBDState *)bs->opaque;
 
-    s->bs = bs;
-
-    s->conns[0] = g_new0(NBDConnState, 1);
-    s->conns[0]->s = s;
-    qemu_mutex_init(&s->conns[0]->requests_lock);
-    qemu_co_queue_init(&s->conns[0]->free_sema);
-    qemu_co_mutex_init(&s->conns[0]->send_mutex);
-    qemu_co_mutex_init(&s->conns[0]->receive_mutex);
-
-    if (!yank_register_instance(BLOCKDEV_YANK_INSTANCE(bs->node_name), errp)) {
-        return -EEXIST;
-    }
-
-    ret = nbd_process_options(bs, options, errp);
-    if (ret < 0) {
-        goto fail;
-    }
-
-    s->conns[0]->conn =
+    cs->conn =
         nbd_client_connection_new(s->saddr, true, s->export,
                                   s->x_dirty_bitmap, s->tlscreds,
                                   s->tlshostname);
 
     if (s->open_timeout) {
-        nbd_client_connection_enable_retry(s->conns[0]->conn);
-        open_timer_init(s->conns[0], qemu_clock_get_ns(QEMU_CLOCK_REALTIME) +
+        nbd_client_connection_enable_retry(cs->conn);
+        open_timer_init(cs, qemu_clock_get_ns(QEMU_CLOCK_REALTIME) +
                         s->open_timeout * NANOSECONDS_PER_SECOND);
     }
 
-    s->conns[0]->state = NBD_CLIENT_CONNECTING_WAIT;
-    ret = nbd_do_establish_connection(bs, s->conns[0], true, errp);
+    cs->state = NBD_CLIENT_CONNECTING_WAIT;
+    ret = nbd_do_establish_connection(bs, cs, true, errp);
     if (ret < 0) {
         goto fail;
     }
@@ -1951,9 +1952,44 @@ static int nbd_open(BlockDriverState *bs, QDict *options, int flags,
      * Delete it, because we do not want it to be around when this node
      * is drained or closed.
      */
-    open_timer_del(s->conns[0]);
+    open_timer_del(cs);
 
-    nbd_client_connection_enable_retry(s->conns[0]->conn);
+    nbd_client_connection_enable_retry(cs->conn);
+
+    return 0;
+
+fail:
+    open_timer_del(cs);
+    return ret;
+}
+
+static int nbd_open(BlockDriverState *bs, QDict *options, int flags,
+                    Error **errp)
+{
+    int ret;
+    BDRVNBDState *s = (BDRVNBDState *)bs->opaque;
+    size_t i;
+
+    s->bs = bs;
+
+    if (!yank_register_instance(BLOCKDEV_YANK_INSTANCE(bs->node_name), errp)) {
+        return -EEXIST;
+    }
+
+    ret = nbd_process_options(bs, options, errp);
+    if (ret < 0) {
+        goto fail;
+    }
+
+    /*
+     * Open the first NBD connection.
+     */
+    s->conns[0] = init_conn_state(s);
+
+    ret = conn_state_connect(bs, s->conns[0], errp);
+    if (ret < 0) {
+        goto fail;
+    }
 
     /*
      * We set s->multi_conn in nbd_process_options above, but now that
@@ -1964,10 +2000,21 @@ static int nbd_open(BlockDriverState *bs, QDict *options, int flags,
         s->multi_conn = 1;
     }
 
+    /*
+     * Open remaining multi-conn NBD connections (if any).
+     */
+    for (i = 1; i < s->multi_conn; ++i) {
+        s->conns[i] = init_conn_state(s);
+
+        ret = conn_state_connect(bs, s->conns[i], errp);
+        if (ret < 0) {
+            goto fail;
+        }
+    }
+
     return 0;
 
 fail:
-    open_timer_del(s->conns[0]);
     nbd_clear_bdrvstate(bs);
     return ret;
 }
@@ -2015,8 +2062,13 @@ static void nbd_refresh_limits(BlockDriverState *bs, Error **errp)
 static void nbd_close(BlockDriverState *bs)
 {
     BDRVNBDState *s = (BDRVNBDState *)bs->opaque;
+    size_t i;
 
-    nbd_client_close(s->conns[0]);
+    for (i = 0; i < MAX_MULTI_CONN; ++i) {
+        if (s->conns[i]) {
+            nbd_client_close(s->conns[i]);
+        }
+    }
     nbd_clear_bdrvstate(bs);
 }
 
-- 
2.39.2



^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH nbd 4/4] nbd: Enable multi-conn using round-robin
  2023-03-09 11:39 [PATCH nbd 0/4] Enable multi-conn NBD [for discussion only] Richard W.M. Jones
                   ` (2 preceding siblings ...)
  2023-03-09 11:39 ` [PATCH nbd 3/4] nbd: Open multiple NBD connections if multi-conn is set Richard W.M. Jones
@ 2023-03-09 11:39 ` Richard W.M. Jones
  2023-03-10 17:43 ` [PATCH nbd 0/4] Enable multi-conn NBD [for discussion only] Vladimir Sementsov-Ogievskiy
  2023-03-10 19:04 ` Eric Blake
  5 siblings, 0 replies; 11+ messages in thread
From: Richard W.M. Jones @ 2023-03-09 11:39 UTC (permalink / raw)
  To: qemu-devel; +Cc: qemu-block, vsementsov, eblake, hreitz, kwolf

Enable NBD multi-conn by spreading operations across multiple
connections.

(XXX) This uses a naive round-robin approach which could be improved.
For example we could look at how many requests are in flight and
assign operations to the connections with fewest.  Or we could try to
estimate (based on size of requests outstanding) the load on each
connection.  But this implementation doesn't do any of that.

Signed-off-by: Richard W.M. Jones <rjones@redhat.com>
---
 block/nbd.c | 67 +++++++++++++++++++++++++++++++++++++++--------------
 1 file changed, 49 insertions(+), 18 deletions(-)

diff --git a/block/nbd.c b/block/nbd.c
index 4c99c3f865..df32ba67ed 100644
--- a/block/nbd.c
+++ b/block/nbd.c
@@ -1232,6 +1232,26 @@ static int coroutine_fn nbd_co_request(NBDConnState *cs, NBDRequest *request,
     return ret ? ret : request_ret;
 }
 
+/*
+ * If multi-conn, choose a connection for this operation.
+ */
+static NBDConnState *choose_connection(BDRVNBDState *s)
+{
+    static size_t next;
+    size_t i;
+
+    if (s->multi_conn <= 1) {
+        return s->conns[0];
+    }
+
+    /* XXX Stupid simple round robin. */
+    i = qatomic_fetch_inc(&next);
+    i %= s->multi_conn;
+
+    assert(s->conns[i] != NULL);
+    return s->conns[i];
+}
+
 static int coroutine_fn nbd_client_co_preadv(BlockDriverState *bs, int64_t offset,
                                              int64_t bytes, QEMUIOVector *qiov,
                                              BdrvRequestFlags flags)
@@ -1244,7 +1264,7 @@ static int coroutine_fn nbd_client_co_preadv(BlockDriverState *bs, int64_t offse
         .from = offset,
         .len = bytes,
     };
-    NBDConnState * const cs = s->conns[0];
+    NBDConnState * const cs = choose_connection(s);
 
     assert(bytes <= NBD_MAX_BUFFER_SIZE);
 
@@ -1301,7 +1321,7 @@ static int coroutine_fn nbd_client_co_pwritev(BlockDriverState *bs, int64_t offs
         .from = offset,
         .len = bytes,
     };
-    NBDConnState * const cs = s->conns[0];
+    NBDConnState * const cs = choose_connection(s);
 
     assert(!(cs->info.flags & NBD_FLAG_READ_ONLY));
     if (flags & BDRV_REQ_FUA) {
@@ -1326,7 +1346,7 @@ static int coroutine_fn nbd_client_co_pwrite_zeroes(BlockDriverState *bs, int64_
         .from = offset,
         .len = bytes,  /* .len is uint32_t actually */
     };
-    NBDConnState * const cs = s->conns[0];
+    NBDConnState * const cs = choose_connection(s);
 
     assert(bytes <= UINT32_MAX); /* rely on max_pwrite_zeroes */
 
@@ -1357,7 +1377,13 @@ static int coroutine_fn nbd_client_co_flush(BlockDriverState *bs)
 {
     BDRVNBDState *s = (BDRVNBDState *)bs->opaque;
     NBDRequest request = { .type = NBD_CMD_FLUSH };
-    NBDConnState * const cs = s->conns[0];
+
+    /*
+     * Multi-conn (if used) guarantees that flushing on any connection
+     * flushes caches on all connections, so we can perform this
+     * operation on any.
+     */
+    NBDConnState * const cs = choose_connection(s);
 
     if (!(cs->info.flags & NBD_FLAG_SEND_FLUSH)) {
         return 0;
@@ -1378,7 +1404,7 @@ static int coroutine_fn nbd_client_co_pdiscard(BlockDriverState *bs, int64_t off
         .from = offset,
         .len = bytes, /* len is uint32_t */
     };
-    NBDConnState * const cs = s->conns[0];
+    NBDConnState * const cs = choose_connection(s);
 
     assert(bytes <= UINT32_MAX); /* rely on max_pdiscard */
 
@@ -1398,7 +1424,7 @@ static int coroutine_fn nbd_client_co_block_status(
     NBDExtent extent = { 0 };
     BDRVNBDState *s = (BDRVNBDState *)bs->opaque;
     Error *local_err = NULL;
-    NBDConnState * const cs = s->conns[0];
+    NBDConnState * const cs = choose_connection(s);
 
     NBDRequest request = {
         .type = NBD_CMD_BLOCK_STATUS,
@@ -2027,7 +2053,7 @@ static int coroutine_fn nbd_co_flush(BlockDriverState *bs)
 static void nbd_refresh_limits(BlockDriverState *bs, Error **errp)
 {
     BDRVNBDState *s = (BDRVNBDState *)bs->opaque;
-    NBDConnState * const cs = s->conns[0];
+    NBDConnState * const cs = choose_connection(s);
     uint32_t min = cs->info.min_block;
     uint32_t max = MIN_NON_ZERO(NBD_MAX_BUFFER_SIZE, cs->info.max_block);
 
@@ -2085,7 +2111,7 @@ static int coroutine_fn nbd_co_truncate(BlockDriverState *bs, int64_t offset,
                                         BdrvRequestFlags flags, Error **errp)
 {
     BDRVNBDState *s = bs->opaque;
-    NBDConnState * const cs = s->conns[0];
+    NBDConnState * const cs = choose_connection(s);
 
     if (offset != cs->info.size && exact) {
         error_setg(errp, "Cannot resize NBD nodes");
@@ -2168,24 +2194,29 @@ static const char *const nbd_strong_runtime_opts[] = {
 static void nbd_cancel_in_flight(BlockDriverState *bs)
 {
     BDRVNBDState *s = (BDRVNBDState *)bs->opaque;
-    NBDConnState * const cs = s->conns[0];
+    size_t i;
+    NBDConnState *cs;
 
-    reconnect_delay_timer_del(cs);
+    for (i = 0; i < MAX_MULTI_CONN; ++i) {
+        cs = s->conns[i];
 
-    qemu_mutex_lock(&cs->requests_lock);
-    if (cs->state == NBD_CLIENT_CONNECTING_WAIT) {
-        cs->state = NBD_CLIENT_CONNECTING_NOWAIT;
+        reconnect_delay_timer_del(cs);
+
+        qemu_mutex_lock(&cs->requests_lock);
+        if (cs->state == NBD_CLIENT_CONNECTING_WAIT) {
+            cs->state = NBD_CLIENT_CONNECTING_NOWAIT;
+        }
+        qemu_mutex_unlock(&cs->requests_lock);
+
+        nbd_co_establish_connection_cancel(cs->conn);
     }
-    qemu_mutex_unlock(&cs->requests_lock);
-
-    nbd_co_establish_connection_cancel(cs->conn);
 }
 
 static void nbd_attach_aio_context(BlockDriverState *bs,
                                    AioContext *new_context)
 {
     BDRVNBDState *s = bs->opaque;
-    NBDConnState * const cs = s->conns[0];
+    NBDConnState * const cs = choose_connection(s);
 
     /* The open_timer is used only during nbd_open() */
     assert(!cs->open_timer);
@@ -2209,7 +2240,7 @@ static void nbd_attach_aio_context(BlockDriverState *bs,
 static void nbd_detach_aio_context(BlockDriverState *bs)
 {
     BDRVNBDState *s = bs->opaque;
-    NBDConnState * const cs = s->conns[0];
+    NBDConnState * const cs = choose_connection(s);
 
     assert(!cs->open_timer);
     assert(!cs->reconnect_delay_timer);
-- 
2.39.2



^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH nbd 0/4] Enable multi-conn NBD [for discussion only]
  2023-03-09 11:39 [PATCH nbd 0/4] Enable multi-conn NBD [for discussion only] Richard W.M. Jones
                   ` (3 preceding siblings ...)
  2023-03-09 11:39 ` [PATCH nbd 4/4] nbd: Enable multi-conn using round-robin Richard W.M. Jones
@ 2023-03-10 17:43 ` Vladimir Sementsov-Ogievskiy
  2023-03-10 19:04 ` Eric Blake
  5 siblings, 0 replies; 11+ messages in thread
From: Vladimir Sementsov-Ogievskiy @ 2023-03-10 17:43 UTC (permalink / raw)
  To: Richard W.M. Jones, qemu-devel; +Cc: qemu-block, eblake, hreitz, kwolf

On 09.03.23 14:39, Richard W.M. Jones wrote:
> [ Patch series also available here, along with this cover letter and the
>    script used to generate test results:
>    https://gitlab.com/rwmjones/qemu/-/commits/2023-nbd-multi-conn-v1  ]
> 
> This patch series adds multi-conn support to the NBD block driver in
> qemu.  It is only meant for discussion and testing because it has a
> number of obvious shortcomings (see "XXX" in commit messages and
> code).  If we decided this was a good idea, we can work on a better
> patch.

I looked through the results and the code, and I think that's of course a good idea!

We still need smarter integration with reconnect logic.

At least, we shouldn't make several open_timer instances..


Currently, on open() we have open-timeout. That's just a limit for the whole nbd_open() - we can do several connection attempts during this time.

Seems we should proceed with success, if we succeeded with at least one connection. Postponing additional connections to be established after open() seems good too[*].


Next, we have reconnect-delay. When connection is lost nbd-client tries to reconnect with no limit in attempts, but after reconnect-delay seconds of reconnection all in-flight requests that are waiting for connection are just failed.

When we have several connections, and one is broken, I think we shouldn't wait, but instead retry the requests on other working connections. This way we don't need several reconnect_delay_timer objects: we need only one, when all connections are lost.


Reestablishing additional connections better to do in background, not blocking in-flight requests. And that's the same as postponing additional connections after open() should work ([*]).

-- 
Best regards,
Vladimir



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH nbd 0/4] Enable multi-conn NBD [for discussion only]
  2023-03-09 11:39 [PATCH nbd 0/4] Enable multi-conn NBD [for discussion only] Richard W.M. Jones
                   ` (4 preceding siblings ...)
  2023-03-10 17:43 ` [PATCH nbd 0/4] Enable multi-conn NBD [for discussion only] Vladimir Sementsov-Ogievskiy
@ 2023-03-10 19:04 ` Eric Blake
  2023-03-10 19:19   ` Richard W.M. Jones
  5 siblings, 1 reply; 11+ messages in thread
From: Eric Blake @ 2023-03-10 19:04 UTC (permalink / raw)
  To: Richard W.M. Jones; +Cc: qemu-devel, qemu-block, vsementsov, hreitz, kwolf

On Thu, Mar 09, 2023 at 11:39:42AM +0000, Richard W.M. Jones wrote:
> [ Patch series also available here, along with this cover letter and the
>   script used to generate test results:
>   https://gitlab.com/rwmjones/qemu/-/commits/2023-nbd-multi-conn-v1 ]
> 
> This patch series adds multi-conn support to the NBD block driver in
> qemu.  It is only meant for discussion and testing because it has a
> number of obvious shortcomings (see "XXX" in commit messages and
> code).  If we decided this was a good idea, we can work on a better
> patch.

Overall, I'm in favor of this.  A longer term project might be to have
qemu's NBD client code call into libnbd instead of reimplementing
things itself, at which point having libnbd manage multi-conn under
the hood would be awesome, but as that's a much bigger effort, a
shorter-term task of having qemu itself handle parallel sockets seems
worthwhile.

> 
>  - It works effectively for qemu client & nbdkit server, especially in
>    cases where the server does large, heavyweight requests.  This is
>    important for us because virt-v2v uses an nbdkit Python plugin and
>    various other heavyweight plugins (eg. plugins that access remote
>    servers for each request).
> 
>  - It seems to make little or no difference with qemu + qemu-nbd
>    server.  I speculate that's because qemu-nbd doesn't support system
>    threads, so networking is bottlenecked through a single core.  Even
>    though there are coroutines handling different sockets, they must
>    all wait in turn to issue send(3) or recv(3) calls on the same
>    core.

Is the current work to teach qemu to do multi-queue (that is, spread
the I/O load for a single block device across multiple cores) going to
help here?  I haven't been following the multi-queue efforts closely
enough to know if the approach used in this series will play nicely,
or need even further overhaul.

> 
>  - qemu-img unfortunately uses a single thread for all coroutines so
>    it suffers from a similar problem to qemu-nbd.  This change would
>    be much more effective if we could distribute coroutines across
>    threads.

qemu-img uses the same client code as qemu-nbd; any multi-queue
improvements that can spread the send()/recv() load of multiple
sockets across multiple cores will benefit both programs
simultaneously.

> 
>  - For tests which are highly bottlenecked on disk I/O (eg. the large
>    local file test and null test) multi-conn doesn't make much
>    difference.

As long as it isn't adding to much penalty, that's okay.  If the
saturation is truly at the point of how fast disk requests can be
served, it doesn't matter if we can queue up more of those requests in
parallel across multiple NBD sockets.

> 
>  - Multi-conn even with only 2 connections can make up for the
>    overhead of range requests, exceeding the performance of wget.

That alone is a rather cool result, and an argument in favor of
further developing this.

> 
>  - In the curlremote test, qemu-nbd is especially slow, for unknown
>    reasons.
> 
> 
> Integrity test (./multi-conn.pl integrity)
> ==========================================
> 
> nbdkit-sparse-random-plugin
>   |                 ^
>   | nbd+unix        | nbd+unix
>   v                 |
>    qemu-img convert
> 
> Reading from and writing the same data back to nbdkit sparse-random
> plugin checks that the data written is the same as the data read.
> This uses two Unix domain sockets, with or without multi-conn.  This
> test is mainly here to check we don't crash or corrupt data with this
> patch.
> 
>   server          client        multi-conn
>   ---------------------------------------------------------------
>     nbdkit	  qemu-img	[u/s]	9.07s	
>     nbdkit	  qemu-img	1	9.05s	
>     nbdkit	  qemu-img	2	9.02s	
>     nbdkit	  qemu-img	4	8.98s	
> 
> [u/s] = upstream qemu 7.2.0

How many of these timing numbers can be repeated with TLS in the mix?

> 
> 
> Curl local server test (./multi-conn.pl curlhttp)
> =================================================
> 
> Localhost Apache serving a file over http
>                   |
>                   | http
>                   v
> nbdkit-curl-plugin   or   qemu-nbd
>                   |
>                   | nbd+unix
>                   v
> qemu-img convert   or   nbdcopy
> 
> We download an image from a local web server through
> nbdkit-curl-plugin or qemu-nbd using the curl block driver, over NBD.
> The image is copied to /dev/null.
> 
>   server          client        multi-conn
>   ---------------------------------------------------------------
>   qemu-nbd	   nbdcopy	1	8.88s	
>   qemu-nbd	   nbdcopy	2	8.64s	
>   qemu-nbd	   nbdcopy	4	8.37s	
>   qemu-nbd	  qemu-img	[u/s]	6.47s

Do we have any good feel for why qemu-img is faster than nbdcopy in
the baseline?  But improving that is orthogonal to this series.

>   qemu-nbd	  qemu-img	1	6.56s	
>   qemu-nbd	  qemu-img	2	6.63s	
>   qemu-nbd	  qemu-img	4	6.50s	
>     nbdkit	   nbdcopy	1	12.15s  

I'm assuming this is nbdkit with your recent in-progress patches to
have the curl plugin serve parallel requests.  But another place where
we can investigate why nbdkit is not as performant as qemu-nbd at
utilizing curl.

>     nbdkit	   nbdcopy	2	7.05s	(72.36% better)
>     nbdkit	   nbdcopy	4	3.54s	(242.90% better)

That one is impressive!

>     nbdkit	  qemu-img	[u/s]	6.90s	
>     nbdkit	  qemu-img	1	7.00s   

Minimal penalty for adding the code but not utilizing it...

>     nbdkit	  qemu-img	2	3.85s	(79.15% better)
>     nbdkit	  qemu-img	4	3.85s	(79.15% better)

...and definitely shows its worth.

> 
> 
> Curl local file test (./multi-conn.pl curlfile)
> ===============================================
> 
> nbdkit-curl-plugin   using file:/// URI
>                   |
>                   | nbd+unix
>                   v
> qemu-img convert   or   nbdcopy
> 
> We download from a file:/// URI.  This test is designed to exercise
> NBD and some curl internal paths without the overhead from an external
> server.  qemu-nbd doesn't support file:/// URIs so we cannot duplicate
> the test for qemu as server.
> 
>   server          client        multi-conn
>   ---------------------------------------------------------------
>     nbdkit	   nbdcopy	1	31.32s	
>     nbdkit	   nbdcopy	2	20.29s	(54.38% better)
>     nbdkit	   nbdcopy	4	13.22s	(136.91% better)
>     nbdkit	  qemu-img	[u/s]	31.55s	

Here, the baseline is already comparable; both nbdcopy and qemu-img
are parsing the image off nbdkit in about the same amount of time.

>     nbdkit	  qemu-img	1	31.70s	

And again, minimal penalty for having the new code in place but not
exploiting it.

>     nbdkit	  qemu-img	2	21.60s	(46.07% better)
>     nbdkit	  qemu-img	4	13.88s	(127.25% better)

Plus an obvious benefit when the parallel sockets matter.

> 
> 
> Curl remote server test (./multi-conn.pl curlremote)
> ====================================================
> 
> nbdkit-curl-plugin   using http://remote/*.qcow2 URI
>          |
>          | nbd+unix
>          v
> qemu-img convert
> 
> We download from a remote qcow2 file to a local raw file, converting
> between formats during copying.
> 
> qemu-nbd   using http://remote/*.qcow2 URI
>     |
>     | nbd+unix
>     v
> qemu-img convert
> 
> Similarly, replacing nbdkit with qemu-nbd (treating the remote file as
> if it is raw, so the conversion is still done by qemu-img).
> 
> Additionally we compare downloading the file with wget (note this
> doesn't include the time for conversion, but that should only be a few
> seconds).
> 
>   server          client        multi-conn
>   ---------------------------------------------------------------
>          -	      wget	1	58.19s	
>     nbdkit	  qemu-img	[u/s]	68.29s	(17.36% worse)
>     nbdkit	  qemu-img	1	67.85s	(16.60% worse)
>     nbdkit	  qemu-img	2	58.17s	

Comparable to wget on paper, but a win in practice (since the wget
step also has to add a post-download qemu-img local conversion step).

>     nbdkit	  qemu-img	4	59.80s	
>     nbdkit	  qemu-img	6	59.15s	
>     nbdkit	  qemu-img	8	59.52s	
> 
>   qemu-nbd	  qemu-img	[u/s]	202.55s
>   qemu-nbd	  qemu-img	1	204.61s	
>   qemu-nbd	  qemu-img	2	196.73s	
>   qemu-nbd	  qemu-img	4	179.53s	(12.83% better)
>   qemu-nbd	  qemu-img	6	181.70s	(11.48% better)
>   qemu-nbd	  qemu-img	8	181.05s	(11.88% better)
>

Less dramatic results here, but still nothing horrible.

> 
> Local file test (./multi-conn.pl file)
> ======================================
> 
> qemu-nbd or nbdkit serving a large local file
>                   |
>                   | nbd+unix
>                   v
> qemu-img convert   or   nbdcopy
> 
> We download a local file over NBD.  The image is copied to /dev/null.
> 
>   server          client        multi-conn
>   ---------------------------------------------------------------
>   qemu-nbd	   nbdcopy	1	15.50s	
>   qemu-nbd	   nbdcopy	2	14.36s	
>   qemu-nbd	   nbdcopy	4	14.32s	
>   qemu-nbd	  qemu-img	[u/s]	10.16s	

Once again, we're seeing qemu-img baseline faster than nbdcopy as
client.  But throwing more sockets at either client does improve
performance, except for...

>   qemu-nbd	  qemu-img	1	11.17s	(10.01% worse)

...this one looks bad.  Is it a case of this series adding more mutex
work (qemu-img is making parallel requests; each request then contends
for the mutex only to learn that it will be using the same NBD
connection)?  And your comments about smarter round-robin schemes mean
there may still be room to avoid this much of a penalty.

>   qemu-nbd	  qemu-img	2	10.35s	
>   qemu-nbd	  qemu-img	4	10.39s	
>     nbdkit	   nbdcopy	1	9.10s	

This one in interesting: nbdkit as server performs better than
qemu-nbd.

>     nbdkit	   nbdcopy	2	8.25s	
>     nbdkit	   nbdcopy	4	8.60s	
>     nbdkit	  qemu-img	[u/s]	8.64s	
>     nbdkit	  qemu-img	1	9.38s	
>     nbdkit	  qemu-img	2	8.69s	
>     nbdkit	  qemu-img	4	8.87s	
> 
> 
> Null test (./multi-conn.pl null)
> ================================
> 
> qemu-nbd with null-co driver  or  nbdkit-null-plugin + noextents filter
>                   |
>                   | nbd+unix
>                   v
> qemu-img convert   or   nbdcopy
> 
> This is like the local file test above, but without needing a file.
> Instead all zeroes (fully allocated) are downloaded over NBD.

And I'm sure that if you allowed block status to show the holes, the
performance would be a lot faster, but that would be testing something
completely differently ;)

> 
>   server          client        multi-conn
>   ---------------------------------------------------------------
>   qemu-nbd	   nbdcopy	1	14.86s	
>   qemu-nbd	   nbdcopy	2	17.08s	(14.90% worse)
>   qemu-nbd	   nbdcopy	4	17.89s	(20.37% worse)

Oh, that's weird.  I wonder if qemu's null-co driver has some poor
mutex behavior when being hit by parallel I/O.  Seems like
investigating that can be separate from this series, though.

>   qemu-nbd	  qemu-img	[u/s]	13.29s	

And another point where qemu-img is faster than nbdcopy as a
single-client baseline.

>   qemu-nbd	  qemu-img	1	13.31s	
>   qemu-nbd	  qemu-img	2	13.00s	
>   qemu-nbd	  qemu-img	4	12.62s	
>     nbdkit	   nbdcopy	1	15.06s	
>     nbdkit	   nbdcopy	2	12.21s	(23.32% better)
>     nbdkit	   nbdcopy	4	11.67s	(29.10% better)
>     nbdkit	  qemu-img	[u/s]	17.13s	
>     nbdkit	  qemu-img	1	17.11s	
>     nbdkit	  qemu-img	2	16.82s	
>     nbdkit	  qemu-img	4	18.81s	

Overall, I'm looking forward to seeing this go in (8.1 material; we're
too close to 8.0)

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH nbd 0/4] Enable multi-conn NBD [for discussion only]
  2023-03-10 19:04 ` Eric Blake
@ 2023-03-10 19:19   ` Richard W.M. Jones
  0 siblings, 0 replies; 11+ messages in thread
From: Richard W.M. Jones @ 2023-03-10 19:19 UTC (permalink / raw)
  To: Eric Blake; +Cc: qemu-devel, qemu-block, vsementsov, hreitz, kwolf

On Fri, Mar 10, 2023 at 01:04:12PM -0600, Eric Blake wrote:
> How many of these timing numbers can be repeated with TLS in the mix?

While I have been playing with TLS and kTLS recently, it's not
something that is especially important to v2v since all NBD traffic
goes over Unix domain sockets only (ie. it's used as kind of
interprocess communication).

I could certainly provide benchmarks, although as I'm going on holiday
shortly it may be a little while.

> > Curl local server test (./multi-conn.pl curlhttp)
> > =================================================
> > 
> > Localhost Apache serving a file over http
> >                   |
> >                   | http
> >                   v
> > nbdkit-curl-plugin   or   qemu-nbd
> >                   |
> >                   | nbd+unix
> >                   v
> > qemu-img convert   or   nbdcopy
> > 
> > We download an image from a local web server through
> > nbdkit-curl-plugin or qemu-nbd using the curl block driver, over NBD.
> > The image is copied to /dev/null.
> > 
> >   server          client        multi-conn
> >   ---------------------------------------------------------------
> >   qemu-nbd	   nbdcopy	1	8.88s	
> >   qemu-nbd	   nbdcopy	2	8.64s	
> >   qemu-nbd	   nbdcopy	4	8.37s	
> >   qemu-nbd	  qemu-img	[u/s]	6.47s
> 
> Do we have any good feel for why qemu-img is faster than nbdcopy in
> the baseline?  But improving that is orthogonal to this series.

I do not, but we have in the past found that results can be very
sensitive to request size.  By default (and also in all of these
tests) nbdcopy is using a request size of 256K, and qemu-img is using
a request size of 2M.

> >   qemu-nbd	  qemu-img	1	6.56s	
> >   qemu-nbd	  qemu-img	2	6.63s	
> >   qemu-nbd	  qemu-img	4	6.50s	
> >     nbdkit	   nbdcopy	1	12.15s  
> 
> I'm assuming this is nbdkit with your recent in-progress patches to
> have the curl plugin serve parallel requests.  But another place where
> we can investigate why nbdkit is not as performant as qemu-nbd at
> utilizing curl.
> 
> >     nbdkit	   nbdcopy	2	7.05s	(72.36% better)
> >     nbdkit	   nbdcopy	4	3.54s	(242.90% better)
> 
> That one is impressive!
> 
> >     nbdkit	  qemu-img	[u/s]	6.90s	
> >     nbdkit	  qemu-img	1	7.00s   
> 
> Minimal penalty for adding the code but not utilizing it...

[u/s] and qemu-img with multi-conn:1 ought to be identical actually.
After all, the only difference should be the restructuring of the code
to add the intermediate NBDConnState struct In this case it's probably
just measurement error.

> >     nbdkit	  qemu-img	2	3.85s	(79.15% better)
> >     nbdkit	  qemu-img	4	3.85s	(79.15% better)
> 
> ...and definitely shows its worth.
> 
> > 
> > 
> > Curl local file test (./multi-conn.pl curlfile)
> > ===============================================
> > 
> > nbdkit-curl-plugin   using file:/// URI
> >                   |
> >                   | nbd+unix
> >                   v
> > qemu-img convert   or   nbdcopy
> > 
> > We download from a file:/// URI.  This test is designed to exercise
> > NBD and some curl internal paths without the overhead from an external
> > server.  qemu-nbd doesn't support file:/// URIs so we cannot duplicate
> > the test for qemu as server.
> > 
> >   server          client        multi-conn
> >   ---------------------------------------------------------------
> >     nbdkit	   nbdcopy	1	31.32s	
> >     nbdkit	   nbdcopy	2	20.29s	(54.38% better)
> >     nbdkit	   nbdcopy	4	13.22s	(136.91% better)
> >     nbdkit	  qemu-img	[u/s]	31.55s	
> 
> Here, the baseline is already comparable; both nbdcopy and qemu-img
> are parsing the image off nbdkit in about the same amount of time.
> 
> >     nbdkit	  qemu-img	1	31.70s	
> 
> And again, minimal penalty for having the new code in place but not
> exploiting it.
> 
> >     nbdkit	  qemu-img	2	21.60s	(46.07% better)
> >     nbdkit	  qemu-img	4	13.88s	(127.25% better)
> 
> Plus an obvious benefit when the parallel sockets matter.
> 
> > 
> > 
> > Curl remote server test (./multi-conn.pl curlremote)
> > ====================================================
> > 
> > nbdkit-curl-plugin   using http://remote/*.qcow2 URI
> >          |
> >          | nbd+unix
> >          v
> > qemu-img convert
> > 
> > We download from a remote qcow2 file to a local raw file, converting
> > between formats during copying.
> > 
> > qemu-nbd   using http://remote/*.qcow2 URI
> >     |
> >     | nbd+unix
> >     v
> > qemu-img convert
> > 
> > Similarly, replacing nbdkit with qemu-nbd (treating the remote file as
> > if it is raw, so the conversion is still done by qemu-img).
> > 
> > Additionally we compare downloading the file with wget (note this
> > doesn't include the time for conversion, but that should only be a few
> > seconds).
> > 
> >   server          client        multi-conn
> >   ---------------------------------------------------------------
> >          -	      wget	1	58.19s	
> >     nbdkit	  qemu-img	[u/s]	68.29s	(17.36% worse)
> >     nbdkit	  qemu-img	1	67.85s	(16.60% worse)
> >     nbdkit	  qemu-img	2	58.17s	
> 
> Comparable to wget on paper, but a win in practice (since the wget
> step also has to add a post-download qemu-img local conversion step).

Yes, correct.  Best case that would be another ~ 2-3 seconds on this
machine.

> >     nbdkit	  qemu-img	4	59.80s	
> >     nbdkit	  qemu-img	6	59.15s	
> >     nbdkit	  qemu-img	8	59.52s	
> > 
> >   qemu-nbd	  qemu-img	[u/s]	202.55s
> >   qemu-nbd	  qemu-img	1	204.61s	
> >   qemu-nbd	  qemu-img	2	196.73s	
> >   qemu-nbd	  qemu-img	4	179.53s	(12.83% better)
> >   qemu-nbd	  qemu-img	6	181.70s	(11.48% better)
> >   qemu-nbd	  qemu-img	8	181.05s	(11.88% better)
> >
> 
> Less dramatic results here, but still nothing horrible.
> 
> > 
> > Local file test (./multi-conn.pl file)
> > ======================================
> > 
> > qemu-nbd or nbdkit serving a large local file
> >                   |
> >                   | nbd+unix
> >                   v
> > qemu-img convert   or   nbdcopy
> > 
> > We download a local file over NBD.  The image is copied to /dev/null.
> > 
> >   server          client        multi-conn
> >   ---------------------------------------------------------------
> >   qemu-nbd	   nbdcopy	1	15.50s	
> >   qemu-nbd	   nbdcopy	2	14.36s	
> >   qemu-nbd	   nbdcopy	4	14.32s	
> >   qemu-nbd	  qemu-img	[u/s]	10.16s	
> 
> Once again, we're seeing qemu-img baseline faster than nbdcopy as
> client.  But throwing more sockets at either client does improve
> performance, except for...
> 
> >   qemu-nbd	  qemu-img	1	11.17s	(10.01% worse)
> 
> ...this one looks bad.  Is it a case of this series adding more mutex
> work (qemu-img is making parallel requests; each request then contends
> for the mutex only to learn that it will be using the same NBD
> connection)?  And your comments about smarter round-robin schemes mean
> there may still be room to avoid this much of a penalty.

This was reproducible and I don't have a good explanation for it.  As
far as I know just adding the NBDConnState struct should not add any
overhead.  The only locking is the call to choose_connection, and
that's just the access to an atomic variable which I can't imagine
could cause such a difference.

> >   qemu-nbd	  qemu-img	2	10.35s	
> >   qemu-nbd	  qemu-img	4	10.39s	
> >     nbdkit	   nbdcopy	1	9.10s	
> 
> This one in interesting: nbdkit as server performs better than
> qemu-nbd.
> 
> >     nbdkit	   nbdcopy	2	8.25s	
> >     nbdkit	   nbdcopy	4	8.60s	
> >     nbdkit	  qemu-img	[u/s]	8.64s	
> >     nbdkit	  qemu-img	1	9.38s	
> >     nbdkit	  qemu-img	2	8.69s	
> >     nbdkit	  qemu-img	4	8.87s	
> > 
> > 
> > Null test (./multi-conn.pl null)
> > ================================
> > 
> > qemu-nbd with null-co driver  or  nbdkit-null-plugin + noextents filter
> >                   |
> >                   | nbd+unix
> >                   v
> > qemu-img convert   or   nbdcopy
> > 
> > This is like the local file test above, but without needing a file.
> > Instead all zeroes (fully allocated) are downloaded over NBD.
> 
> And I'm sure that if you allowed block status to show the holes, the
> performance would be a lot faster, but that would be testing something
> completely differently ;)
> 
> > 
> >   server          client        multi-conn
> >   ---------------------------------------------------------------
> >   qemu-nbd	   nbdcopy	1	14.86s	
> >   qemu-nbd	   nbdcopy	2	17.08s	(14.90% worse)
> >   qemu-nbd	   nbdcopy	4	17.89s	(20.37% worse)
> 
> Oh, that's weird.  I wonder if qemu's null-co driver has some poor
> mutex behavior when being hit by parallel I/O.  Seems like
> investigating that can be separate from this series, though.

Yes, I noticed in other tests that null-co has some odd behaviour, but
I couldn't understand it from looking at the code which seems very
simple.  It does a memset, maybe that is expensive because it uses
newly allocated buffers every time or something like that?

> >   qemu-nbd	  qemu-img	[u/s]	13.29s	
> 
> And another point where qemu-img is faster than nbdcopy as a
> single-client baseline.
> 
> >   qemu-nbd	  qemu-img	1	13.31s	
> >   qemu-nbd	  qemu-img	2	13.00s	
> >   qemu-nbd	  qemu-img	4	12.62s	
> >     nbdkit	   nbdcopy	1	15.06s	
> >     nbdkit	   nbdcopy	2	12.21s	(23.32% better)
> >     nbdkit	   nbdcopy	4	11.67s	(29.10% better)
> >     nbdkit	  qemu-img	[u/s]	17.13s	
> >     nbdkit	  qemu-img	1	17.11s	
> >     nbdkit	  qemu-img	2	16.82s	
> >     nbdkit	  qemu-img	4	18.81s	
> 
> Overall, I'm looking forward to seeing this go in (8.1 material; we're
> too close to 8.0)

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
libguestfs lets you edit virtual machines.  Supports shell scripting,
bindings from many languages.  http://libguestfs.org



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH nbd 1/4] nbd: Add multi-conn option
  2023-03-09 11:39 ` [PATCH nbd 1/4] nbd: Add multi-conn option Richard W.M. Jones
@ 2023-03-10 22:17   ` Eric Blake
  2023-03-10 22:29     ` Richard W.M. Jones
  0 siblings, 1 reply; 11+ messages in thread
From: Eric Blake @ 2023-03-10 22:17 UTC (permalink / raw)
  To: Richard W.M. Jones; +Cc: qemu-devel, qemu-block, vsementsov, hreitz, kwolf

On Thu, Mar 09, 2023 at 11:39:43AM +0000, Richard W.M. Jones wrote:
> Add multi-conn option to the NBD client.  This commit just adds the
> option, it is not functional.

Maybe add the phrase "until later in this patch series" ?

> 
> Setting this to a value > 1 permits multiple connections to the NBD
> server; a typical value might be 4.  The default is 1, meaning only a
> single connection is made.  If the NBD server does not advertise that
> it is safe for multi-conn then this setting is forced to 1.
> 
> Signed-off-by: Richard W.M. Jones <rjones@redhat.com>
> ---
>  block/nbd.c | 24 ++++++++++++++++++++++++
>  1 file changed, 24 insertions(+)
> 
> diff --git a/block/nbd.c b/block/nbd.c
> index bf2894ad5c..5ffae0b798 100644
> --- a/block/nbd.c
> +++ b/block/nbd.c
> @@ -49,6 +49,7 @@
>  
>  #define EN_OPTSTR ":exportname="
>  #define MAX_NBD_REQUESTS    16
> +#define MAX_MULTI_CONN      16
>  
>  #define HANDLE_TO_INDEX(bs, handle) ((handle) ^ (uint64_t)(intptr_t)(bs))
>  #define INDEX_TO_HANDLE(bs, index)  ((index)  ^ (uint64_t)(intptr_t)(bs))
> @@ -98,6 +99,7 @@ typedef struct BDRVNBDState {
>      /* Connection parameters */
>      uint32_t reconnect_delay;
>      uint32_t open_timeout;
> +    uint32_t multi_conn;
>      SocketAddress *saddr;
>      char *export;
>      char *tlscredsid;
> @@ -1803,6 +1805,15 @@ static QemuOptsList nbd_runtime_opts = {
>                      "attempts until successful or until @open-timeout seconds "
>                      "have elapsed. Default 0",
>          },
> +        {
> +            .name = "multi-conn",
> +            .type = QEMU_OPT_NUMBER,
> +            .help = "If > 1 permit up to this number of connections to the "
> +                    "server. The server must also advertise multi-conn "
> +                    "support.  If <= 1, only a single connection is made "
> +                    "to the server even if the server advertises multi-conn. "
> +                    "Default 1",
> +        },
>          { /* end of list */ }
>      },
>  };
> @@ -1858,6 +1869,10 @@ static int nbd_process_options(BlockDriverState *bs, QDict *options,
>  
>      s->reconnect_delay = qemu_opt_get_number(opts, "reconnect-delay", 0);
>      s->open_timeout = qemu_opt_get_number(opts, "open-timeout", 0);
> +    s->multi_conn = qemu_opt_get_number(opts, "multi-conn", 1);
> +    if (s->multi_conn > MAX_MULTI_CONN) {
> +        s->multi_conn = MAX_MULTI_CONN;
> +    }

This silently ignores out-of-range values (negative, greater than 16)
and treats 0 as a synonym for 1.  The latter I'm okay with, the former
I wonder if we should instead raise an error that the user is
requesting something we can't honor, instead of silently bounding it.

>  
>      ret = 0;
>  
> @@ -1912,6 +1927,15 @@ static int nbd_open(BlockDriverState *bs, QDict *options, int flags,
>  
>      nbd_client_connection_enable_retry(s->conn);
>  
> +    /*
> +     * We set s->multi_conn in nbd_process_options above, but now that
> +     * we have connected if the server doesn't advertise that it is

s/connected/connected,/

> +     * safe for multi-conn, force it to 1.
> +     */
> +    if (!(s->info.flags & NBD_FLAG_CAN_MULTI_CONN)) {
> +        s->multi_conn = 1;
> +    }
> +
>      return 0;

Is there an intended QAPI counterpart for this command?  We'll need
that if it is to be set during the command line of
qemu-storage-daemon.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH nbd 1/4] nbd: Add multi-conn option
  2023-03-10 22:17   ` Eric Blake
@ 2023-03-10 22:29     ` Richard W.M. Jones
  0 siblings, 0 replies; 11+ messages in thread
From: Richard W.M. Jones @ 2023-03-10 22:29 UTC (permalink / raw)
  To: Eric Blake; +Cc: qemu-devel, qemu-block, vsementsov, hreitz, kwolf

On Fri, Mar 10, 2023 at 04:17:17PM -0600, Eric Blake wrote:
> On Thu, Mar 09, 2023 at 11:39:43AM +0000, Richard W.M. Jones wrote:
> > +     * safe for multi-conn, force it to 1.
> > +     */
> > +    if (!(s->info.flags & NBD_FLAG_CAN_MULTI_CONN)) {
> > +        s->multi_conn = 1;
> > +    }
> > +
> >      return 0;
> 
> Is there an intended QAPI counterpart for this command?  We'll need
> that if it is to be set during the command line of
> qemu-storage-daemon.

Does it just need to be added to qapi/block-core.json?

It's a shame we can't add the API in one place and have everything
generated from there.  Like some kind of 'generator' ...

Rich.

-- 
Richard Jones, Virtualization Group, Red Hat http://people.redhat.com/~rjones
Read my programming and virtualization blog: http://rwmj.wordpress.com
virt-builder quickly builds VMs from scratch
http://libguestfs.org/virt-builder.1.html



^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH nbd 2/4] nbd: Split out block device state from underlying NBD connections
  2023-03-09 11:39 ` [PATCH nbd 2/4] nbd: Split out block device state from underlying NBD connections Richard W.M. Jones
@ 2023-03-14 12:13   ` Eric Blake
  0 siblings, 0 replies; 11+ messages in thread
From: Eric Blake @ 2023-03-14 12:13 UTC (permalink / raw)
  To: Richard W.M. Jones; +Cc: qemu-devel, qemu-block, vsementsov, hreitz, kwolf

On Thu, Mar 09, 2023 at 11:39:44AM +0000, Richard W.M. Jones wrote:
> To implement multi-conn, we will put multiple underlying NBD
> connections (ie. NBDClientConnection) inside the NBD block device
> handle (BDRVNBDState).  This requires first breaking the one-to-one
> relationship between NBDClientConnection and BDRVNBDState.
> 
> To do this a new structure (NBDConnState) is implemented.
> NBDConnState takes all the per-connection state out of the block
> driver struct.  BDRVNBDState now contains a conns[] array of pointers
> to NBDConnState, one for each underlying multi-conn connection.
> 
> After this change there is still a one-to-one relationship because we
> only ever use the zeroth slot in the conns[] array.  Thus this does
> not implement multi-conn yet.
> 
> Signed-off-by: Richard W.M. Jones <rjones@redhat.com>
> ---
>  block/coroutines.h |   5 +-
>  block/nbd.c        | 674 ++++++++++++++++++++++++---------------------
>  2 files changed, 358 insertions(+), 321 deletions(-)
> 
> diff --git a/block/coroutines.h b/block/coroutines.h
> index dd9f3d449b..14b01d8591 100644
> --- a/block/coroutines.h
> +++ b/block/coroutines.h
> @@ -62,7 +62,7 @@ int coroutine_fn GRAPH_RDLOCK
>  bdrv_co_writev_vmstate(BlockDriverState *bs, QEMUIOVector *qiov, int64_t pos);
>  
>  int coroutine_fn
> -nbd_co_do_establish_connection(BlockDriverState *bs, bool blocking,
> +nbd_co_do_establish_connection(BlockDriverState *bs, void *cs, bool blocking,
>                                 Error **errp);

I guess the void* here is because you couldn't find a way to expose
the new type through the coroutine wrapper generator?

>  
>  
> @@ -86,6 +86,7 @@ bdrv_common_block_status_above(BlockDriverState *bs,
>                                 BlockDriverState **file,
>                                 int *depth);
>  int co_wrapper_mixed
> -nbd_do_establish_connection(BlockDriverState *bs, bool blocking, Error **errp);
> +nbd_do_establish_connection(BlockDriverState *bs, void *cs, bool blocking,
> +                            Error **errp);
>

same here

>  #endif /* BLOCK_COROUTINES_H */
> diff --git a/block/nbd.c b/block/nbd.c
> index 5ffae0b798..84e8a1add0 100644
> --- a/block/nbd.c
> +++ b/block/nbd.c
> @@ -51,8 +51,8 @@
>  #define MAX_NBD_REQUESTS    16
>  #define MAX_MULTI_CONN      16
>  
> -#define HANDLE_TO_INDEX(bs, handle) ((handle) ^ (uint64_t)(intptr_t)(bs))
> -#define INDEX_TO_HANDLE(bs, index)  ((index)  ^ (uint64_t)(intptr_t)(bs))
> +#define HANDLE_TO_INDEX(cs, handle) ((handle) ^ (uint64_t)(intptr_t)(cs))
> +#define INDEX_TO_HANDLE(cs, index)  ((index)  ^ (uint64_t)(intptr_t)(cs))

Independently of your patches, these macros are odd.  I think we could
just as easily define

#define HANDLE_TO_INDEX(XX, handle) ((handle) - 1)
#define INDEX_TO_HANDLE(XX, index) ((index) + 1)

and then refactor to drop the unused parameter, since we never really
depend on it (I audited the code when you first asked me about it on
IRC, and we do a bounds check that whatever the server returns lies
within our expected index of 0-15 before dereferencing any memory with
it, so we are NOT relying on the server passing us a pointer that we
depend on).

Overall, your patch is mostly mechanical and looks nice to me.

>  
>  /*
>   * Update @bs with information learned during a completed negotiation process.
>   * Return failure if the server's advertised options are incompatible with the
>   * client's needs.
> + *
> + * Note that we are only called for the first connection (s->conns[0])
> + * because multi-conn assumes that all other connections are alike.
> + * We don't check that assumption but probably should (XXX).
>   */
> -static int nbd_handle_updated_info(BlockDriverState *bs, Error **errp)
> +static int nbd_handle_updated_info(BlockDriverState *bs,
> +                                   NBDConnState *cs, Error **errp)

But as you pointed out in your cover letter, we'll probably want to
address the XXX comments like this one prior to actually committing
the series.  We really should be making sure that the secondary
clients see the same server parameters as the first one, regardless of
whether they are open in parallel (once multi-conn is enabled) or in
series after a reopen (which is what we currently try to support).

> @@ -318,129 +332,132 @@ static int nbd_handle_updated_info(BlockDriverState *bs, Error **errp)
>  }
>  
>  int coroutine_fn nbd_co_do_establish_connection(BlockDriverState *bs,
> +                                                void *csvp,
>                                                  bool blocking, Error **errp)
>  {
>      BDRVNBDState *s = (BDRVNBDState *)bs->opaque;
> +    NBDConnState *cs = csvp;

Is there a way to get the new type passed through the coroutine
generator without the use of void*?

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2023-03-14 12:15 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-03-09 11:39 [PATCH nbd 0/4] Enable multi-conn NBD [for discussion only] Richard W.M. Jones
2023-03-09 11:39 ` [PATCH nbd 1/4] nbd: Add multi-conn option Richard W.M. Jones
2023-03-10 22:17   ` Eric Blake
2023-03-10 22:29     ` Richard W.M. Jones
2023-03-09 11:39 ` [PATCH nbd 2/4] nbd: Split out block device state from underlying NBD connections Richard W.M. Jones
2023-03-14 12:13   ` Eric Blake
2023-03-09 11:39 ` [PATCH nbd 3/4] nbd: Open multiple NBD connections if multi-conn is set Richard W.M. Jones
2023-03-09 11:39 ` [PATCH nbd 4/4] nbd: Enable multi-conn using round-robin Richard W.M. Jones
2023-03-10 17:43 ` [PATCH nbd 0/4] Enable multi-conn NBD [for discussion only] Vladimir Sementsov-Ogievskiy
2023-03-10 19:04 ` Eric Blake
2023-03-10 19:19   ` Richard W.M. Jones

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).