[Qemu-devel] [PATCH 0/3] NBD reconnection behaviour

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [Qemu-devel] [PATCH 0/3] NBD reconnection behaviour
@ 2012-10-22 11:09 nick
  2012-10-22 11:09 ` [Qemu-devel] [PATCH 1/3] nbd: Only try to send flush/discard commands if connected to the NBD server nick
                   ` (2 more replies)
  0 siblings, 3 replies; 20+ messages in thread
From: nick @ 2012-10-22 11:09 UTC (permalink / raw)
  To: qemu-devel; +Cc: pbonzini, Nick Thomas

[-- Attachment #1: Type: text/plain, Size: 44 bytes --]

This is a multi-part message in MIME format.

[-- Attachment #2: Type: text/plain, Size: 1767 bytes --]

Hi all,

This patchset is about making the NBD backend more useful. Currently,
when the NBD server disconnects, the block device in the guest becomes
unusable with no option to recover except to restart the QEMU process.
These patches introduce a reconnect timer that fires every five
seconds until we successfully reconnect.

I/O requests that are inflight when the disconnection occurs, or
requested while disconnected, are failed with an EIO - so the usual
werror/rerror settings apply in those circumstances.

All this means that, assuming you can get the NBD server up again,
only some I/O requests are failed, rather than all of them.

I've got a few more changes to make - specifically:

  * Allowing the reconnect timer delay to be configurable
  * Queuing and retrying I/O requests instead of EIO
  * Proactively killing the TCP connection if the server doesn't
    respond after a timeout.

The rationale for the second is that some guests remount discs r/o
if I/O requests fail (rather than apparently hang), which is a pain.
The third allows us to quickly detect if a TCP connection disappears
without a trace. 

However, I think these patches stand as an improvement on the curent
situation, and I'd rather like some feedback on the best way to do
the futher bits - assuming these patches get eventually accepted!

Nick Thomas (3):
  nbd: Only try to send flush/discard commands if connected to the NBD
    server
  nbd: Explicitly disconnect and fail inflight I/O requests on error,
    then reconnect next I/O request.
  nbd: Move reconnection attempts from each new I/O request to a
    5-second timer

 block/nbd.c |  117 +++++++++++++++++++++++++++++++++++++++++++++++------------
 1 files changed, 93 insertions(+), 24 deletions(-)

-- 
1.7.2.5

^ permalink raw reply	[flat|nested] 20+ messages in thread

* [Qemu-devel] [PATCH 1/3] nbd: Only try to send flush/discard commands if connected to the NBD server
  2012-10-22 11:09 [Qemu-devel] [PATCH 0/3] NBD reconnection behaviour nick
@ 2012-10-22 11:09 ` nick
  2012-10-23 10:33   ` Kevin Wolf
  2012-10-22 11:09 ` [Qemu-devel] [PATCH 2/3] nbd: Explicitly disconnect and fail inflight I/O requests on error, then reconnect next I/O request nick
  2012-10-22 11:09 ` [Qemu-devel] [PATCH 3/3] nbd: Move reconnection attempts from each new I/O request to a 5-second timer nick
  2 siblings, 1 reply; 20+ messages in thread
From: nick @ 2012-10-22 11:09 UTC (permalink / raw)
  To: qemu-devel; +Cc: pbonzini, Nick Thomas

[-- Attachment #1: Type: text/plain, Size: 44 bytes --]

This is a multi-part message in MIME format.

[-- Attachment #2: Type: text/plain, Size: 234 bytes --]


This is unlikely to come up now, but is a necessary prerequisite for reconnection
behaviour.

Signed-off-by: Nick Thomas <nick@bytemark.co.uk>
---
 block/nbd.c |   13 +++++++++++--
 1 files changed, 11 insertions(+), 2 deletions(-)


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #3: 0001-nbd-Only-try-to-send-flush-discard-commands-if-conne.patch --]
[-- Type: text/x-patch; name="0001-nbd-Only-try-to-send-flush-discard-commands-if-conne.patch", Size: 1561 bytes --]

diff --git a/block/nbd.c b/block/nbd.c
index 2bce47b..9536408 100644
--- a/block/nbd.c
+++ b/block/nbd.c
@@ -71,6 +71,11 @@ typedef struct BDRVNBDState {
     char *host_spec;
 } BDRVNBDState;
 
+static bool nbd_is_connected(BDRVNBDState *s)
+{
+    return s->sock >= 0;
+}
+
 static int nbd_config(BDRVNBDState *s, const char *filename, int flags)
 {
     char *file;
@@ -309,6 +314,7 @@ static void nbd_teardown_connection(BlockDriverState *bs)
 
     qemu_aio_set_fd_handler(s->sock, NULL, NULL, NULL, NULL);
     closesocket(s->sock);
+    s->sock = -1;
 }
 
 static int nbd_open(BlockDriverState *bs, const char* filename, int flags)
@@ -316,6 +322,8 @@ static int nbd_open(BlockDriverState *bs, const char* filename, int flags)
     BDRVNBDState *s = bs->opaque;
     int result;
 
+    s->sock = -1;
+
     qemu_co_mutex_init(&s->send_mutex);
     qemu_co_mutex_init(&s->free_sema);
 
@@ -431,7 +439,7 @@ static int nbd_co_flush(BlockDriverState *bs)
     struct nbd_reply reply;
     ssize_t ret;
 
-    if (!(s->nbdflags & NBD_FLAG_SEND_FLUSH)) {
+    if (!nbd_is_connected(s) || !(s->nbdflags & NBD_FLAG_SEND_FLUSH)) {
         return 0;
     }
 
@@ -462,7 +470,7 @@ static int nbd_co_discard(BlockDriverState *bs, int64_t sector_num,
     struct nbd_reply reply;
     ssize_t ret;
 
-    if (!(s->nbdflags & NBD_FLAG_SEND_TRIM)) {
+    if (!nbd_is_connected(s) || !(s->nbdflags & NBD_FLAG_SEND_TRIM)) {
         return 0;
     }
     request.type = NBD_CMD_TRIM;
@@ -515,3 +523,4 @@ static void bdrv_nbd_init(void)
 }
 
 block_init(bdrv_nbd_init);
+

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [Qemu-devel] [PATCH 2/3] nbd: Explicitly disconnect and fail inflight I/O requests on error, then reconnect next I/O request.
  2012-10-22 11:09 [Qemu-devel] [PATCH 0/3] NBD reconnection behaviour nick
  2012-10-22 11:09 ` [Qemu-devel] [PATCH 1/3] nbd: Only try to send flush/discard commands if connected to the NBD server nick
@ 2012-10-22 11:09 ` nick
  2012-10-23 10:40   ` Kevin Wolf
  2012-10-24 14:31   ` Paolo Bonzini
  2012-10-22 11:09 ` [Qemu-devel] [PATCH 3/3] nbd: Move reconnection attempts from each new I/O request to a 5-second timer nick
  2 siblings, 2 replies; 20+ messages in thread
From: nick @ 2012-10-22 11:09 UTC (permalink / raw)
  To: qemu-devel; +Cc: pbonzini, Nick Thomas

[-- Attachment #1: Type: text/plain, Size: 44 bytes --]

This is a multi-part message in MIME format.

[-- Attachment #2: Type: text/plain, Size: 903 bytes --]


The previous behaviour when an NBD connection broke was to fail just the broken I/O request
and (sometimes) never unlock send_mutex. Now we explicitly call nbd_teardown_connection and
fail all NBD requests in the "inflight" state - this allows werror/rerror settings to be
applied to those requests all at once.

When a new request (or a request that was pending, but not yet inflight) is issued against
the NBD driver, if we're not connected to the NBD server, we attempt to connect (and fail
the new request immediately if that doesn't work). This isn't ideal, as it can lead to many
failed attempts to connect to the NBD server in short order, but at least allows us to get
over temporary failures in this state.

Signed-off-by: Nick Thomas <nick@bytemark.co.uk>
---
 block/nbd.c |   72 ++++++++++++++++++++++++++++++++++++++++------------------
 1 files changed, 49 insertions(+), 23 deletions(-)


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #3: 0002-nbd-Explicitly-disconnect-and-fail-inflight-I-O-requ.patch --]
[-- Type: text/x-patch; name="0002-nbd-Explicitly-disconnect-and-fail-inflight-I-O-requ.patch", Size: 4748 bytes --]

diff --git a/block/nbd.c b/block/nbd.c
index 9536408..1caae89 100644
--- a/block/nbd.c
+++ b/block/nbd.c
@@ -71,6 +71,9 @@ typedef struct BDRVNBDState {
     char *host_spec;
 } BDRVNBDState;
 
+static int nbd_establish_connection(BDRVNBDState *s);
+static void nbd_teardown_connection(BDRVNBDState *s, bool send_disconnect);
+
 static bool nbd_is_connected(BDRVNBDState *s)
 {
     return s->sock >= 0;
@@ -152,6 +155,20 @@ static int nbd_have_request(void *opaque)
     return s->in_flight > 0;
 }
 
+static void nbd_abort_inflight_requests(BDRVNBDState *s)
+{
+    int i;
+    Coroutine *co;
+
+    s->reply.handle = 0;
+    for (i = 0; i < MAX_NBD_REQUESTS; i++) {
+        co = s->recv_coroutine[i];
+        if (co && co != qemu_coroutine_self()) {
+            qemu_coroutine_enter(co, NULL);
+        }
+    }
+}
+
 static void nbd_reply_ready(void *opaque)
 {
     BDRVNBDState *s = opaque;
@@ -168,8 +185,9 @@ static void nbd_reply_ready(void *opaque)
             return;
         }
         if (ret < 0) {
-            s->reply.handle = 0;
-            goto fail;
+            nbd_teardown_connection(s, false);
+            nbd_abort_inflight_requests(s);
+            return;
         }
     }
 
@@ -178,20 +196,15 @@ static void nbd_reply_ready(void *opaque)
      * one coroutine is called until the reply finishes.  */
     i = HANDLE_TO_INDEX(s, s->reply.handle);
     if (i >= MAX_NBD_REQUESTS) {
-        goto fail;
+        nbd_teardown_connection(s, false);
+        nbd_abort_inflight_requests(s);
+        return;
     }
 
     if (s->recv_coroutine[i]) {
         qemu_coroutine_enter(s->recv_coroutine[i], NULL);
         return;
     }
-
-fail:
-    for (i = 0; i < MAX_NBD_REQUESTS; i++) {
-        if (s->recv_coroutine[i]) {
-            qemu_coroutine_enter(s->recv_coroutine[i], NULL);
-        }
-    }
 }
 
 static void nbd_restart_write(void *opaque)
@@ -206,6 +219,13 @@ static int nbd_co_send_request(BDRVNBDState *s, struct nbd_request *request,
     int rc, ret;
 
     qemu_co_mutex_lock(&s->send_mutex);
+
+    if (!nbd_is_connected(s) && nbd_establish_connection(s) != 0) {
+        nbd_abort_inflight_requests(s);
+        qemu_co_mutex_unlock(&s->send_mutex);
+        return -EIO;
+    }
+
     s->send_coroutine = qemu_coroutine_self();
     qemu_aio_set_fd_handler(s->sock, nbd_reply_ready, nbd_restart_write,
                             nbd_have_request, s);
@@ -214,6 +234,9 @@ static int nbd_co_send_request(BDRVNBDState *s, struct nbd_request *request,
         ret = qemu_co_sendv(s->sock, qiov->iov, qiov->niov,
                             offset, request->len);
         if (ret != request->len) {
+            s->send_coroutine = NULL;
+            nbd_teardown_connection(s, false);
+            qemu_co_mutex_unlock(&s->send_mutex);
             return -EIO;
         }
     }
@@ -259,9 +282,8 @@ static void nbd_coroutine_end(BDRVNBDState *s, struct nbd_request *request)
     }
 }
 
-static int nbd_establish_connection(BlockDriverState *bs)
+static int nbd_establish_connection(BDRVNBDState *s)
 {
-    BDRVNBDState *s = bs->opaque;
     int sock;
     int ret;
     off_t size;
@@ -302,19 +324,23 @@ static int nbd_establish_connection(BlockDriverState *bs)
     return 0;
 }
 
-static void nbd_teardown_connection(BlockDriverState *bs)
+static void nbd_teardown_connection(BDRVNBDState *s, bool send_disconnect)
 {
-    BDRVNBDState *s = bs->opaque;
+
     struct nbd_request request;
 
-    request.type = NBD_CMD_DISC;
-    request.from = 0;
-    request.len = 0;
-    nbd_send_request(s->sock, &request);
+    if (nbd_is_connected(s)) {
+        if (send_disconnect) {
+            request.type = NBD_CMD_DISC;
+            request.from = 0;
+            request.len = 0;
+            nbd_send_request(s->sock, &request);
+        }
 
-    qemu_aio_set_fd_handler(s->sock, NULL, NULL, NULL, NULL);
-    closesocket(s->sock);
-    s->sock = -1;
+        qemu_aio_set_fd_handler(s->sock, NULL, NULL, NULL, NULL);
+        closesocket(s->sock);
+        s->sock = -1; /* Makes nbd_is_connected() return true */
+    }
 }
 
 static int nbd_open(BlockDriverState *bs, const char* filename, int flags)
@@ -336,7 +362,7 @@ static int nbd_open(BlockDriverState *bs, const char* filename, int flags)
     /* establish TCP connection, return error if it fails
      * TODO: Configurable retry-until-timeout behaviour.
      */
-    result = nbd_establish_connection(bs);
+    result = nbd_establish_connection(s);
 
     return result;
 }
@@ -494,7 +520,7 @@ static void nbd_close(BlockDriverState *bs)
     g_free(s->export_name);
     g_free(s->host_spec);
 
-    nbd_teardown_connection(bs);
+    nbd_teardown_connection(s, true);
 }
 
 static int64_t nbd_getlength(BlockDriverState *bs)

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* [Qemu-devel] [PATCH 3/3] nbd: Move reconnection attempts from each new I/O request to a 5-second timer
  2012-10-22 11:09 [Qemu-devel] [PATCH 0/3] NBD reconnection behaviour nick
  2012-10-22 11:09 ` [Qemu-devel] [PATCH 1/3] nbd: Only try to send flush/discard commands if connected to the NBD server nick
  2012-10-22 11:09 ` [Qemu-devel] [PATCH 2/3] nbd: Explicitly disconnect and fail inflight I/O requests on error, then reconnect next I/O request nick
@ 2012-10-22 11:09 ` nick
  2 siblings, 0 replies; 20+ messages in thread
From: nick @ 2012-10-22 11:09 UTC (permalink / raw)
  To: qemu-devel; +Cc: pbonzini, Nick Thomas

[-- Attachment #1: Type: text/plain, Size: 44 bytes --]

This is a multi-part message in MIME format.

[-- Attachment #2: Type: text/plain, Size: 522 bytes --]


We don't want the guest to be able to control the reconnection rate (by sending
lots of I/O requests to the block device), so we try once every 5 seconds and
fail I/O requests that happen while we're disconnected.

TODO: Allow the reconnect delay to be specified as an option.
TODO: Queue, rather than fail, I/O requests when disconnected

Signed-off-by: Nick Thomas <nick@bytemark.co.uk>
---
 block/nbd.c |   54 ++++++++++++++++++++++++++++++++++++++++++++----------
 1 files changed, 44 insertions(+), 10 deletions(-)


[-- Warning: decoded text below may be mangled, UTF-8 assumed --]
[-- Attachment #3: 0003-nbd-Move-reconnection-attempts-from-each-new-I-O-req.patch --]
[-- Type: text/x-patch; name="0003-nbd-Move-reconnection-attempts-from-each-new-I-O-req.patch", Size: 4071 bytes --]

diff --git a/block/nbd.c b/block/nbd.c
index 1caae89..e55a9a7 100644
--- a/block/nbd.c
+++ b/block/nbd.c
@@ -50,6 +50,8 @@
 #define HANDLE_TO_INDEX(bs, handle) ((handle) ^ ((uint64_t)(intptr_t)bs))
 #define INDEX_TO_HANDLE(bs, index)  ((index)  ^ ((uint64_t)(intptr_t)bs))
 
+#define RECONNECT_DELAY 5000
+
 typedef struct BDRVNBDState {
     int sock;
     uint32_t nbdflags;
@@ -62,6 +64,8 @@ typedef struct BDRVNBDState {
     Coroutine *send_coroutine;
     int in_flight;
 
+    QEMUTimer *recon_timer;
+
     Coroutine *recv_coroutine[MAX_NBD_REQUESTS];
     struct nbd_reply reply;
 
@@ -125,6 +129,27 @@ out:
     return err;
 }
 
+static void nbd_reconnect(void *opaque)
+{
+    BDRVNBDState *s = opaque;
+
+    if (nbd_is_connected(s)) {
+        logout("Already connected to NBD server, disarming timer\n");
+        qemu_del_timer(s->recon_timer);
+        return;
+    }
+
+    if (nbd_establish_connection(s) == 0) {
+        logout("Reconnected to server, disabling reconnect timer\n");
+        qemu_del_timer(s->recon_timer);
+    } else {
+        logout("Failed to reconnect, trying again in %ims\n", RECONNECT_DELAY);
+        int64_t delay = qemu_get_clock_ms(rt_clock) + RECONNECT_DELAY;
+        qemu_mod_timer(s->recon_timer, delay);
+    }
+
+}
+
 static void nbd_coroutine_start(BDRVNBDState *s, struct nbd_request *request)
 {
     int i;
@@ -155,11 +180,15 @@ static int nbd_have_request(void *opaque)
     return s->in_flight > 0;
 }
 
-static void nbd_abort_inflight_requests(BDRVNBDState *s)
+static void nbd_disconnect_for_reconnect(BDRVNBDState *s)
 {
     int i;
     Coroutine *co;
+    int64_t recon_time;
 
+    nbd_teardown_connection(s, false);
+
+    /* Abort all in-flight requests. TODO: these should be retried */
     s->reply.handle = 0;
     for (i = 0; i < MAX_NBD_REQUESTS; i++) {
         co = s->recv_coroutine[i];
@@ -167,6 +196,9 @@ static void nbd_abort_inflight_requests(BDRVNBDState *s)
             qemu_coroutine_enter(co, NULL);
         }
     }
+
+    recon_time = qemu_get_clock_ms(rt_clock) + RECONNECT_DELAY;
+    qemu_mod_timer(s->recon_timer, recon_time);
 }
 
 static void nbd_reply_ready(void *opaque)
@@ -185,8 +217,7 @@ static void nbd_reply_ready(void *opaque)
             return;
         }
         if (ret < 0) {
-            nbd_teardown_connection(s, false);
-            nbd_abort_inflight_requests(s);
+            nbd_disconnect_for_reconnect(s);
             return;
         }
     }
@@ -196,8 +227,7 @@ static void nbd_reply_ready(void *opaque)
      * one coroutine is called until the reply finishes.  */
     i = HANDLE_TO_INDEX(s, s->reply.handle);
     if (i >= MAX_NBD_REQUESTS) {
-        nbd_teardown_connection(s, false);
-        nbd_abort_inflight_requests(s);
+        nbd_disconnect_for_reconnect(s);
         return;
     }
 
@@ -218,14 +248,14 @@ static int nbd_co_send_request(BDRVNBDState *s, struct nbd_request *request,
 {
     int rc, ret;
 
-    qemu_co_mutex_lock(&s->send_mutex);
-
-    if (!nbd_is_connected(s) && nbd_establish_connection(s) != 0) {
-        nbd_abort_inflight_requests(s);
-        qemu_co_mutex_unlock(&s->send_mutex);
+    /* TODO: We should be able to add this to the queue regardless, once
+     * coroutines retry if not connected */
+    if (!nbd_is_connected(s)) {
         return -EIO;
     }
 
+    qemu_co_mutex_lock(&s->send_mutex);
+
     s->send_coroutine = qemu_coroutine_self();
     qemu_aio_set_fd_handler(s->sock, nbd_reply_ready, nbd_restart_write,
                             nbd_have_request, s);
@@ -349,6 +379,7 @@ static int nbd_open(BlockDriverState *bs, const char* filename, int flags)
     int result;
 
     s->sock = -1;
+    s->recon_timer = qemu_new_timer_ms(rt_clock, nbd_reconnect, s);
 
     qemu_co_mutex_init(&s->send_mutex);
     qemu_co_mutex_init(&s->free_sema);
@@ -520,6 +551,9 @@ static void nbd_close(BlockDriverState *bs)
     g_free(s->export_name);
     g_free(s->host_spec);
 
+    qemu_del_timer(s->recon_timer);
+    qemu_free_timer(s->recon_timer);
+
     nbd_teardown_connection(s, true);
 }
 

^ permalink raw reply related	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [PATCH 1/3] nbd: Only try to send flush/discard commands if connected to the NBD server
  2012-10-22 11:09 ` [Qemu-devel] [PATCH 1/3] nbd: Only try to send flush/discard commands if connected to the NBD server nick
@ 2012-10-23 10:33   ` Kevin Wolf
  2012-10-23 11:08     ` Nicholas Thomas
  0 siblings, 1 reply; 20+ messages in thread
From: Kevin Wolf @ 2012-10-23 10:33 UTC (permalink / raw)
  To: nick; +Cc: pbonzini, qemu-devel

Am 22.10.2012 13:09, schrieb nick@bytemark.co.uk:
> 
> This is unlikely to come up now, but is a necessary prerequisite for reconnection
> behaviour.
> 
> Signed-off-by: Nick Thomas <nick@bytemark.co.uk>
> ---
>  block/nbd.c |   13 +++++++++++--
>  1 files changed, 11 insertions(+), 2 deletions(-)

What's the real requirement here? Silently ignoring a flush and
returning success for it feels wrong. Why is it correct?

Kevin

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [PATCH 2/3] nbd: Explicitly disconnect and fail inflight I/O requests on error, then reconnect next I/O request.
  2012-10-22 11:09 ` [Qemu-devel] [PATCH 2/3] nbd: Explicitly disconnect and fail inflight I/O requests on error, then reconnect next I/O request nick
@ 2012-10-23 10:40   ` Kevin Wolf
  2012-10-24 14:31   ` Paolo Bonzini
  1 sibling, 0 replies; 20+ messages in thread
From: Kevin Wolf @ 2012-10-23 10:40 UTC (permalink / raw)
  To: nick; +Cc: pbonzini, qemu-devel

Am 22.10.2012 13:09, schrieb nick@bytemark.co.uk:
> 
> The previous behaviour when an NBD connection broke was to fail just the broken I/O request
> and (sometimes) never unlock send_mutex. Now we explicitly call nbd_teardown_connection and
> fail all NBD requests in the "inflight" state - this allows werror/rerror settings to be
> applied to those requests all at once.
> 
> When a new request (or a request that was pending, but not yet inflight) is issued against
> the NBD driver, if we're not connected to the NBD server, we attempt to connect (and fail
> the new request immediately if that doesn't work). 

Doesn't this block the vcpu while qemu is trying to establish a new
connection? When the network is down, I think this can take quite a
while. I would actually expect that this is one of the cases where qemu
as a whole would seem to hang.

Kevin

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [PATCH 1/3] nbd: Only try to send flush/discard commands if connected to the NBD server
  2012-10-23 10:33   ` Kevin Wolf
@ 2012-10-23 11:08     ` Nicholas Thomas
  2012-10-23 11:26       ` Kevin Wolf
  2012-10-23 15:02       ` Jamie Lokier
  0 siblings, 2 replies; 20+ messages in thread
From: Nicholas Thomas @ 2012-10-23 11:08 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: pbonzini, qemu-devel

On Tue, 2012-10-23 at 12:33 +0200, Kevin Wolf wrote:
> Am 22.10.2012 13:09, schrieb nick@bytemark.co.uk:
> > 
> > This is unlikely to come up now, but is a necessary prerequisite for reconnection
> > behaviour.
> > 
> > Signed-off-by: Nick Thomas <nick@bytemark.co.uk>
> > ---
> >  block/nbd.c |   13 +++++++++++--
> >  1 files changed, 11 insertions(+), 2 deletions(-)
> 
> What's the real requirement here? Silently ignoring a flush and
> returning success for it feels wrong. Why is it correct?
> 
> Kevin

I just needed to avoid socket operations while s->sock == -1, and
extending the existing case of "can't do the command, so pretend I did
it" to "can't do the command right now, so pretend..." seemed like an
easy way out. 

In the Bytemark case, the NBD server always opens the file O_SYNC, so
nbd_co_flush could check in_flight == 0 and return 0/1 based on that;
but I'd be surprised if that's true for all NBD servers. Should we be
returning 1 here for both "not supported" and "can't do it right now",
instead? 

/Nick

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [PATCH 1/3] nbd: Only try to send flush/discard commands if connected to the NBD server
  2012-10-23 11:08     ` Nicholas Thomas
@ 2012-10-23 11:26       ` Kevin Wolf
  2012-10-23 15:02       ` Jamie Lokier
  1 sibling, 0 replies; 20+ messages in thread
From: Kevin Wolf @ 2012-10-23 11:26 UTC (permalink / raw)
  To: Nicholas Thomas; +Cc: pbonzini, qemu-devel

Am 23.10.2012 13:08, schrieb Nicholas Thomas:
> On Tue, 2012-10-23 at 12:33 +0200, Kevin Wolf wrote:
>> Am 22.10.2012 13:09, schrieb nick@bytemark.co.uk:
>>>
>>> This is unlikely to come up now, but is a necessary prerequisite for reconnection
>>> behaviour.
>>>
>>> Signed-off-by: Nick Thomas <nick@bytemark.co.uk>
>>> ---
>>>  block/nbd.c |   13 +++++++++++--
>>>  1 files changed, 11 insertions(+), 2 deletions(-)
>>
>> What's the real requirement here? Silently ignoring a flush and
>> returning success for it feels wrong. Why is it correct?
>>
>> Kevin
> 
> I just needed to avoid socket operations while s->sock == -1, and
> extending the existing case of "can't do the command, so pretend I did
> it" to "can't do the command right now, so pretend..." seemed like an
> easy way out.

The difference is that NBD_FLAG_SEND_FLUSH is only clear if the server
is configured like that, i.e. it's completely in the control of the
user, and it ignores flushes consistently (for example because it
already uses a writethrough mode so that flushes aren't required).

Network outage usually isn't in the control of the user and ignoring a
flush request when the server actually asked for flushes is completely
surprising and dangerous.

> In the Bytemark case, the NBD server always opens the file O_SYNC, so
> nbd_co_flush could check in_flight == 0 and return 0/1 based on that;
> but I'd be surprised if that's true for all NBD servers. Should we be
> returning 1 here for both "not supported" and "can't do it right now",
> instead? 

The best return value is probably -ENOTCONN or -EIO if you must return
an error. But maybe using the same logic as for writes would be
appropriate, i.e. try to reestablish the connection before you return an
error.

Kevin

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [PATCH 1/3] nbd: Only try to send flush/discard commands if connected to the NBD server
  2012-10-23 11:08     ` Nicholas Thomas
  2012-10-23 11:26       ` Kevin Wolf
@ 2012-10-23 15:02       ` Jamie Lokier
  2012-10-24 12:16         ` Nicholas Thomas
  1 sibling, 1 reply; 20+ messages in thread
From: Jamie Lokier @ 2012-10-23 15:02 UTC (permalink / raw)
  To: Nicholas Thomas; +Cc: Kevin Wolf, pbonzini, qemu-devel

Nicholas Thomas wrote:
> On Tue, 2012-10-23 at 12:33 +0200, Kevin Wolf wrote:
> > Am 22.10.2012 13:09, schrieb nick@bytemark.co.uk:
> > > 
> > > This is unlikely to come up now, but is a necessary prerequisite for reconnection
> > > behaviour.
> > > 
> > > Signed-off-by: Nick Thomas <nick@bytemark.co.uk>
> > > ---
> > >  block/nbd.c |   13 +++++++++++--
> > >  1 files changed, 11 insertions(+), 2 deletions(-)
> > 
> > What's the real requirement here? Silently ignoring a flush and
> > returning success for it feels wrong. Why is it correct?
> > 
> > Kevin
> 
> I just needed to avoid socket operations while s->sock == -1, and
> extending the existing case of "can't do the command, so pretend I did
> it" to "can't do the command right now, so pretend..." seemed like an
> easy way out. 

Hi Nicholas,

Ignoring a flush is another way of saying "corrupt my data" in some
circumstances.  We have options in QEMU already to say whether flushes
are ignored on normal discs, but if someone's chosen the "I really
care about my database/filesystem" option, and verified that their NBD
setup really performs them (in normal circumstances), silently
dropping flushes from time to time isn't nice.

I would much rather the guest is forced to wait until reconnection and
then get a successful flush, if the problem is just that the server
was done briefly.  Or, if that is too hard, that the flush is 

Since the I/O _order_ before, and sometimes after, flush, is important
for data integrity, this needs to be maintained when I/Os are queued in
the disconnected state -- including those which were inflight at the
time disconnect was detected and then retried on reconnect.

Ignoring a discard is not too bad.  However, if discard is retried,
then I/O order is important in relation to those as well.

> In the Bytemark case, the NBD server always opens the file O_SYNC, so
> nbd_co_flush could check in_flight == 0 and return 0/1 based on that;
> but I'd be surprised if that's true for all NBD servers. Should we be
> returning 1 here for both "not supported" and "can't do it right now",
> instead?

When the server is opening the file O_SYNC, wouldn't it make sense to
tell QEMU -- and the guest -- that there's no need to send flushes at
all, as it's equivalent to a disk with no write-cache (or disabled)?

Best,
-- Jamie

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [PATCH 1/3] nbd: Only try to send flush/discard commands if connected to the NBD server
  2012-10-23 15:02       ` Jamie Lokier
@ 2012-10-24 12:16         ` Nicholas Thomas
  2012-10-24 12:57           ` Kevin Wolf
  2012-10-24 14:03           ` Paolo Bonzini
  0 siblings, 2 replies; 20+ messages in thread
From: Nicholas Thomas @ 2012-10-24 12:16 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: Kevin Wolf, pbonzini, qemu-devel

On Tue, 2012-10-23 at 16:02 +0100, Jamie Lokier wrote:
> Nicholas Thomas wrote:
> > On Tue, 2012-10-23 at 12:33 +0200, Kevin Wolf wrote:
> > > Am 22.10.2012 13:09, schrieb nick@bytemark.co.uk:
> > > > 
> > > > This is unlikely to come up now, but is a necessary prerequisite for reconnection
> > > > behaviour.
> > > > 
> > > > Signed-off-by: Nick Thomas <nick@bytemark.co.uk>
> > > > ---
> > > >  block/nbd.c |   13 +++++++++++--
> > > >  1 files changed, 11 insertions(+), 2 deletions(-)
> > > 
> > > What's the real requirement here? Silently ignoring a flush and
> > > returning success for it feels wrong. Why is it correct?
> > > 
> > > Kevin
> > 
> > I just needed to avoid socket operations while s->sock == -1, and
> > extending the existing case of "can't do the command, so pretend I did
> > it" to "can't do the command right now, so pretend..." seemed like an
> > easy way out. 
> 
> Hi Nicholas,
> 
> Ignoring a flush is another way of saying "corrupt my data" in some
> circumstances.  We have options in QEMU already to say whether flushes
> are ignored on normal discs, but if someone's chosen the "I really
> care about my database/filesystem" option, and verified that their NBD
> setup really performs them (in normal circumstances), silently
> dropping flushes from time to time isn't nice.
> 
> I would much rather the guest is forced to wait until reconnection and
> then get a successful flush, if the problem is just that the server
> was done briefly.  Or, if that is too hard, that the flush is 

OK, that's certainly doable. Revised patches hopefully showing up this
week or (at a push) next. 

> Since the I/O _order_ before, and sometimes after, flush, is important
> for data integrity, this needs to be maintained when I/Os are queued in
> the disconnected state -- including those which were inflight at the
> time disconnect was detected and then retried on reconnect.

Hmm, discussing this on IRC I was told that it wasn't necessary to
preserve order - although I forget the fine detail. Depending on the
implementation of qemu's coroutine mutexes, operations may not actually
be performed in order right now - it's not too easy to work out what's
happening. 

I've also just noticed that flush & discard don't take the send_mutex
before writing to the socket. That can't be intentional, surely? Paolo?

I've got a proof-of-concept that makes flush, discard, readv_1 and
writev_1 call a common nbd_co_handle_request function. this retries I/Os
when a connection goes away and comes back, and lets me make this patch
unnecessary. It doesn't preserve I/O ordering, but it won't be too hard
to add that if it is actually necessary.

> Ignoring a discard is not too bad.  However, if discard is retried,
> then I/O order is important in relation to those as well.
> 
> > In the Bytemark case, the NBD server always opens the file O_SYNC, so
> > nbd_co_flush could check in_flight == 0 and return 0/1 based on that;
> > but I'd be surprised if that's true for all NBD servers. Should we be
> > returning 1 here for both "not supported" and "can't do it right now",
> > instead?
> 
> When the server is opening the file O_SYNC, wouldn't it make sense to
> tell QEMU -- and the guest -- that there's no need to send flushes at
> all, as it's equivalent to a disk with no write-cache (or disabled)?

Probably; our NBD server still implements the old-style handshakes, as
far as I recall, and doesn't actually implement the flush or discard
commands as a result. 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [PATCH 1/3] nbd: Only try to send flush/discard commands if connected to the NBD server
  2012-10-24 12:16         ` Nicholas Thomas
@ 2012-10-24 12:57           ` Kevin Wolf
  2012-10-24 14:32             ` Jamie Lokier
  2012-10-24 14:03           ` Paolo Bonzini
  1 sibling, 1 reply; 20+ messages in thread
From: Kevin Wolf @ 2012-10-24 12:57 UTC (permalink / raw)
  To: Nicholas Thomas; +Cc: pbonzini, qemu-devel

Am 24.10.2012 14:16, schrieb Nicholas Thomas:
> On Tue, 2012-10-23 at 16:02 +0100, Jamie Lokier wrote:
>> Since the I/O _order_ before, and sometimes after, flush, is important
>> for data integrity, this needs to be maintained when I/Os are queued in
>> the disconnected state -- including those which were inflight at the
>> time disconnect was detected and then retried on reconnect.
> 
> Hmm, discussing this on IRC I was told that it wasn't necessary to
> preserve order - although I forget the fine detail. Depending on the
> implementation of qemu's coroutine mutexes, operations may not actually
> be performed in order right now - it's not too easy to work out what's
> happening.

It's possible to reorder, but it must be consistent with the order in
which completion is signalled to the guest. The semantics of flush is
that at the point that the flush completes, all writes to the disk that
already have completed successfully are stable. It doesn't say anything
about writes that are still in flight, they may or may not be flushed to
disk.

Kevin

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [PATCH 1/3] nbd: Only try to send flush/discard commands if connected to the NBD server
  2012-10-24 12:16         ` Nicholas Thomas
  2012-10-24 12:57           ` Kevin Wolf
@ 2012-10-24 14:03           ` Paolo Bonzini
  2012-10-24 14:10             ` Paolo Bonzini
  1 sibling, 1 reply; 20+ messages in thread
From: Paolo Bonzini @ 2012-10-24 14:03 UTC (permalink / raw)
  To: Nicholas Thomas; +Cc: Kevin Wolf, qemu-devel

Il 24/10/2012 14:16, Nicholas Thomas ha scritto:
> 
> I've also just noticed that flush & discard don't take the send_mutex
> before writing to the socket. That can't be intentional, surely? Paolo?

No, it's a bug.

Paolo

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [PATCH 1/3] nbd: Only try to send flush/discard commands if connected to the NBD server
  2012-10-24 14:03           ` Paolo Bonzini
@ 2012-10-24 14:10             ` Paolo Bonzini
  2012-10-24 14:12               ` Nicholas Thomas
  0 siblings, 1 reply; 20+ messages in thread
From: Paolo Bonzini @ 2012-10-24 14:10 UTC (permalink / raw)
  Cc: Kevin Wolf, qemu-devel, Nicholas Thomas

Il 24/10/2012 16:03, Paolo Bonzini ha scritto:
> Il 24/10/2012 14:16, Nicholas Thomas ha scritto:
>>
>> I've also just noticed that flush & discard don't take the send_mutex
>> before writing to the socket. That can't be intentional, surely? Paolo?
> 
> No, it's a bug.

Hmm, why do you say it doesn't take the mutex?  nbd_co_flush and
nbd_co_discard all call nbd_co_send_request, which takes the mutex.

Paolo

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [PATCH 1/3] nbd: Only try to send flush/discard commands if connected to the NBD server
  2012-10-24 14:10             ` Paolo Bonzini
@ 2012-10-24 14:12               ` Nicholas Thomas
  0 siblings, 0 replies; 20+ messages in thread
From: Nicholas Thomas @ 2012-10-24 14:12 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: Kevin Wolf, qemu-devel

On Wed, 2012-10-24 at 16:10 +0200, Paolo Bonzini wrote:
> Il 24/10/2012 16:03, Paolo Bonzini ha scritto:
> > Il 24/10/2012 14:16, Nicholas Thomas ha scritto:
> >>
> >> I've also just noticed that flush & discard don't take the send_mutex
> >> before writing to the socket. That can't be intentional, surely? Paolo?
> > 
> > No, it's a bug.
> 
> Hmm, why do you say it doesn't take the mutex?  nbd_co_flush and
> nbd_co_discard all call nbd_co_send_request, which takes the mutex.
> 
> Paolo

Sorry, brain fail. You're right: no bug :)

/Nick

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [PATCH 2/3] nbd: Explicitly disconnect and fail inflight I/O requests on error, then reconnect next I/O request.
  2012-10-22 11:09 ` [Qemu-devel] [PATCH 2/3] nbd: Explicitly disconnect and fail inflight I/O requests on error, then reconnect next I/O request nick
  2012-10-23 10:40   ` Kevin Wolf
@ 2012-10-24 14:31   ` Paolo Bonzini
  1 sibling, 0 replies; 20+ messages in thread
From: Paolo Bonzini @ 2012-10-24 14:31 UTC (permalink / raw)
  To: nick; +Cc: qemu-devel

Il 22/10/2012 13:09, nick@bytemark.co.uk ha scritto:
> diff --git a/block/nbd.c b/block/nbd.c
> index 9536408..1caae89 100644
> --- a/block/nbd.c
> +++ b/block/nbd.c
> @@ -71,6 +71,9 @@ typedef struct BDRVNBDState {
>      char *host_spec;
>  } BDRVNBDState;
>  
> +static int nbd_establish_connection(BDRVNBDState *s);
> +static void nbd_teardown_connection(BDRVNBDState *s, bool send_disconnect);
> +
>  static bool nbd_is_connected(BDRVNBDState *s)
>  {
>      return s->sock >= 0;
> @@ -152,6 +155,20 @@ static int nbd_have_request(void *opaque)
>      return s->in_flight > 0;
>  }
>  
> +static void nbd_abort_inflight_requests(BDRVNBDState *s)
> +{
> +    int i;
> +    Coroutine *co;
> +
> +    s->reply.handle = 0;
> +    for (i = 0; i < MAX_NBD_REQUESTS; i++) {
> +        co = s->recv_coroutine[i];
> +        if (co && co != qemu_coroutine_self()) {
> +            qemu_coroutine_enter(co, NULL);
> +        }
> +    }
> +}

I think this is quite risky.  Does it work if you shutdown(s, 2) the
socket, then wait for the number of pending requests (not just those
in_flight---also those that are waiting) to become 0, and only then
close it?

(For the wait you can add another Coroutine * field to BDRVNBDState, and
reenter it in nbd_coroutine_end if the number of pending requests
becomes zero).

>  static void nbd_reply_ready(void *opaque)
>  {
>      BDRVNBDState *s = opaque;
> @@ -168,8 +185,9 @@ static void nbd_reply_ready(void *opaque)
>              return;
>          }
>          if (ret < 0) {
> -            s->reply.handle = 0;
> -            goto fail;
> +            nbd_teardown_connection(s, false);
> +            nbd_abort_inflight_requests(s);

This is also problematic because you first close the socket, which means
that something else can grab the file descriptor.  But the original file
descriptor is in use (in qemu_co_recvv or qemu_co_sendv) until after
nbd_abort_inflight_requests returns.

So the correct order would be something like this:

           assert(nbd_is_connected(s));
           shutdown(s->sock, 2);
           nbd_abort_inflight_requests(s);
           nbd_teardown_connection(s, false);

where (if my theory is right) the shutdown should immediately cause the
socket to become readable and writable.

> +            return;
>          }
>      }
>  
> @@ -178,20 +196,15 @@ static void nbd_reply_ready(void *opaque)
>       * one coroutine is called until the reply finishes.  */
>      i = HANDLE_TO_INDEX(s, s->reply.handle);
>      if (i >= MAX_NBD_REQUESTS) {
> -        goto fail;
> +        nbd_teardown_connection(s, false);
> +        nbd_abort_inflight_requests(s);
> +        return;
>      }
>  
>      if (s->recv_coroutine[i]) {
>          qemu_coroutine_enter(s->recv_coroutine[i], NULL);
>          return;
>      }
> -
> -fail:
> -    for (i = 0; i < MAX_NBD_REQUESTS; i++) {
> -        if (s->recv_coroutine[i]) {
> -            qemu_coroutine_enter(s->recv_coroutine[i], NULL);
> -        }
> -    }
>  }
>  
>  static void nbd_restart_write(void *opaque)
> @@ -206,6 +219,13 @@ static int nbd_co_send_request(BDRVNBDState *s, struct nbd_request *request,
>      int rc, ret;
>  
>      qemu_co_mutex_lock(&s->send_mutex);
> +
> +    if (!nbd_is_connected(s) && nbd_establish_connection(s) != 0) {
> +        nbd_abort_inflight_requests(s);
> +        qemu_co_mutex_unlock(&s->send_mutex);
> +        return -EIO;

Do you really need to abort all the requests, or only this one?

Paolo

> +    }
> +
>      s->send_coroutine = qemu_coroutine_self();
>      qemu_aio_set_fd_handler(s->sock, nbd_reply_ready, nbd_restart_write,
>                              nbd_have_request, s);
> @@ -214,6 +234,9 @@ static int nbd_co_send_request(BDRVNBDState *s, struct nbd_request *request,
>          ret = qemu_co_sendv(s->sock, qiov->iov, qiov->niov,
>                              offset, request->len);
>          if (ret != request->len) {
> +            s->send_coroutine = NULL;
> +            nbd_teardown_connection(s, false);
> +            qemu_co_mutex_unlock(&s->send_mutex);
>              return -EIO;
>          }
>      }
> @@ -259,9 +282,8 @@ static void nbd_coroutine_end(BDRVNBDState *s, struct nbd_request *request)
>      }
>  }
>  
> -static int nbd_establish_connection(BlockDriverState *bs)
> +static int nbd_establish_connection(BDRVNBDState *s)
>  {
> -    BDRVNBDState *s = bs->opaque;
>      int sock;
>      int ret;
>      off_t size;
> @@ -302,19 +324,23 @@ static int nbd_establish_connection(BlockDriverState *bs)
>      return 0;
>  }
>  
> -static void nbd_teardown_connection(BlockDriverState *bs)
> +static void nbd_teardown_connection(BDRVNBDState *s, bool send_disconnect)
>  {
> -    BDRVNBDState *s = bs->opaque;
> +
>      struct nbd_request request;
>  
> -    request.type = NBD_CMD_DISC;
> -    request.from = 0;
> -    request.len = 0;
> -    nbd_send_request(s->sock, &request);
> +    if (nbd_is_connected(s)) {
> +        if (send_disconnect) {
> +            request.type = NBD_CMD_DISC;
> +            request.from = 0;
> +            request.len = 0;
> +            nbd_send_request(s->sock, &request);
> +        }
>  
> -    qemu_aio_set_fd_handler(s->sock, NULL, NULL, NULL, NULL);
> -    closesocket(s->sock);
> -    s->sock = -1;
> +        qemu_aio_set_fd_handler(s->sock, NULL, NULL, NULL, NULL);
> +        closesocket(s->sock);
> +        s->sock = -1; /* Makes nbd_is_connected() return true */
> +    }
>  }
>  
>  static int nbd_open(BlockDriverState *bs, const char* filename, int flags)
> @@ -336,7 +362,7 @@ static int nbd_open(BlockDriverState *bs, const char* filename, int flags)
>      /* establish TCP connection, return error if it fails
>       * TODO: Configurable retry-until-timeout behaviour.
>       */
> -    result = nbd_establish_connection(bs);
> +    result = nbd_establish_connection(s);
>  
>      return result;
>  }
> @@ -494,7 +520,7 @@ static void nbd_close(BlockDriverState *bs)
>      g_free(s->export_name);
>      g_free(s->host_spec);
>  
> -    nbd_teardown_connection(bs);
> +    nbd_teardown_connection(s, true);
>  }
>  
>  static int64_t nbd_getlength(BlockDriverState *bs)
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [PATCH 1/3] nbd: Only try to send flush/discard commands if connected to the NBD server
  2012-10-24 12:57           ` Kevin Wolf
@ 2012-10-24 14:32             ` Jamie Lokier
  2012-10-24 15:16               ` Paolo Bonzini
  2012-10-25  6:36               ` Kevin Wolf
  0 siblings, 2 replies; 20+ messages in thread
From: Jamie Lokier @ 2012-10-24 14:32 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: pbonzini, qemu-devel, Nicholas Thomas

Kevin Wolf wrote:
> Am 24.10.2012 14:16, schrieb Nicholas Thomas:
> > On Tue, 2012-10-23 at 16:02 +0100, Jamie Lokier wrote:
> >> Since the I/O _order_ before, and sometimes after, flush, is important
> >> for data integrity, this needs to be maintained when I/Os are queued in
> >> the disconnected state -- including those which were inflight at the
> >> time disconnect was detected and then retried on reconnect.
> > 
> > Hmm, discussing this on IRC I was told that it wasn't necessary to
> > preserve order - although I forget the fine detail. Depending on the
> > implementation of qemu's coroutine mutexes, operations may not actually
> > be performed in order right now - it's not too easy to work out what's
> > happening.
> 
> It's possible to reorder, but it must be consistent with the order in
> which completion is signalled to the guest. The semantics of flush is
> that at the point that the flush completes, all writes to the disk that
> already have completed successfully are stable. It doesn't say anything
> about writes that are still in flight, they may or may not be flushed to
> disk.

I admit I wasn't thinking clearly how much ordering NBD actually
guarantees (or if there's ordering the guest depends on implicitly
even if it's not guaranteed in specification), and how that is related
within QEMU to virtio/FUA/NCQ/TCQ/SCSI-ORDERED ordering guarantees
that the guest expects for various emulated devices and their settings.

The ordering (if any) needed from the NBD driver (or any backend) is
going to depend on the assumptions baked into the interface between
QEMU device emulation <-> backend.

E.g. if every device emulation waited for all outstanding writes to
complete before sending a flush, then it wouldn't matter how the
backend reordered its requests, even getting the completions out of
order.

Is that relationship documented (and conformed to)?

-- Jamie

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [PATCH 1/3] nbd: Only try to send flush/discard commands if connected to the NBD server
  2012-10-24 14:32             ` Jamie Lokier
@ 2012-10-24 15:16               ` Paolo Bonzini
  2012-10-25  6:36               ` Kevin Wolf
  1 sibling, 0 replies; 20+ messages in thread
From: Paolo Bonzini @ 2012-10-24 15:16 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: Kevin Wolf, qemu-devel, Nicholas Thomas

Il 24/10/2012 16:32, Jamie Lokier ha scritto:
> Kevin Wolf wrote:
>> Am 24.10.2012 14:16, schrieb Nicholas Thomas:
>>> On Tue, 2012-10-23 at 16:02 +0100, Jamie Lokier wrote:
>>>> Since the I/O _order_ before, and sometimes after, flush, is important
>>>> for data integrity, this needs to be maintained when I/Os are queued in
>>>> the disconnected state -- including those which were inflight at the
>>>> time disconnect was detected and then retried on reconnect.
>>>
>>> Hmm, discussing this on IRC I was told that it wasn't necessary to
>>> preserve order - although I forget the fine detail. Depending on the
>>> implementation of qemu's coroutine mutexes, operations may not actually
>>> be performed in order right now - it's not too easy to work out what's
>>> happening.
>>
>> It's possible to reorder, but it must be consistent with the order in
>> which completion is signalled to the guest. The semantics of flush is
>> that at the point that the flush completes, all writes to the disk that
>> already have completed successfully are stable. It doesn't say anything
>> about writes that are still in flight, they may or may not be flushed to
>> disk.
> 
> I admit I wasn't thinking clearly how much ordering NBD actually
> guarantees (or if there's ordering the guest depends on implicitly
> even if it's not guaranteed in specification),

Here NBD is used just as a transport on the host and totally invisible
to the guest.  So NBD pretty much has to implement whatever ordering
guarantees the guest needs.

> E.g. if every device emulation waited for all outstanding writes to
> complete before sending a flush, then it wouldn't matter how the
> backend reordered its requests, even getting the completions out of
> order.

The NBD implementation in QEMU (which Nick is using) is completely
asynchronous.

Paolo

> Is that relationship documented (and conformed to)?
> 
> -- Jamie
> 

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [PATCH 1/3] nbd: Only try to send flush/discard commands if connected to the NBD server
  2012-10-24 14:32             ` Jamie Lokier
  2012-10-24 15:16               ` Paolo Bonzini
@ 2012-10-25  6:36               ` Kevin Wolf
  2012-10-25 17:09                 ` Jamie Lokier
  1 sibling, 1 reply; 20+ messages in thread
From: Kevin Wolf @ 2012-10-25  6:36 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: pbonzini, qemu-devel, Nicholas Thomas

Am 24.10.2012 16:32, schrieb Jamie Lokier:
> Kevin Wolf wrote:
>> Am 24.10.2012 14:16, schrieb Nicholas Thomas:
>>> On Tue, 2012-10-23 at 16:02 +0100, Jamie Lokier wrote:
>>>> Since the I/O _order_ before, and sometimes after, flush, is important
>>>> for data integrity, this needs to be maintained when I/Os are queued in
>>>> the disconnected state -- including those which were inflight at the
>>>> time disconnect was detected and then retried on reconnect.
>>>
>>> Hmm, discussing this on IRC I was told that it wasn't necessary to
>>> preserve order - although I forget the fine detail. Depending on the
>>> implementation of qemu's coroutine mutexes, operations may not actually
>>> be performed in order right now - it's not too easy to work out what's
>>> happening.
>>
>> It's possible to reorder, but it must be consistent with the order in
>> which completion is signalled to the guest. The semantics of flush is
>> that at the point that the flush completes, all writes to the disk that
>> already have completed successfully are stable. It doesn't say anything
>> about writes that are still in flight, they may or may not be flushed to
>> disk.
> 
> I admit I wasn't thinking clearly how much ordering NBD actually
> guarantees (or if there's ordering the guest depends on implicitly
> even if it's not guaranteed in specification), and how that is related
> within QEMU to virtio/FUA/NCQ/TCQ/SCSI-ORDERED ordering guarantees
> that the guest expects for various emulated devices and their settings.
> 
> The ordering (if any) needed from the NBD driver (or any backend) is
> going to depend on the assumptions baked into the interface between
> QEMU device emulation <-> backend.
> 
> E.g. if every device emulation waited for all outstanding writes to
> complete before sending a flush, then it wouldn't matter how the
> backend reordered its requests, even getting the completions out of
> order.
> 
> Is that relationship documented (and conformed to)?

No, like so many other things in qemu it's not spelt out explicitly.
However, as I understand it it's the same behaviour as real hardware
has, so device emulation at least for the common devices doesn't have to
implement anything special for it. If the hardware even supports
parallel requests, otherwise it would automatically only have a single
request in flight (like IDE).

Kevin

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [PATCH 1/3] nbd: Only try to send flush/discard commands if connected to the NBD server
  2012-10-25  6:36               ` Kevin Wolf
@ 2012-10-25 17:09                 ` Jamie Lokier
  2012-10-26  7:59                   ` Kevin Wolf
  0 siblings, 1 reply; 20+ messages in thread
From: Jamie Lokier @ 2012-10-25 17:09 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: pbonzini, qemu-devel, Nicholas Thomas

Kevin Wolf wrote:
> Am 24.10.2012 16:32, schrieb Jamie Lokier:
> > Kevin Wolf wrote:
> >> Am 24.10.2012 14:16, schrieb Nicholas Thomas:
> >>> On Tue, 2012-10-23 at 16:02 +0100, Jamie Lokier wrote:
> >>>> Since the I/O _order_ before, and sometimes after, flush, is important
> >>>> for data integrity, this needs to be maintained when I/Os are queued in
> >>>> the disconnected state -- including those which were inflight at the
> >>>> time disconnect was detected and then retried on reconnect.
> >>>
> >>> Hmm, discussing this on IRC I was told that it wasn't necessary to
> >>> preserve order - although I forget the fine detail. Depending on the
> >>> implementation of qemu's coroutine mutexes, operations may not actually
> >>> be performed in order right now - it's not too easy to work out what's
> >>> happening.
> >>
> >> It's possible to reorder, but it must be consistent with the order in
> >> which completion is signalled to the guest. The semantics of flush is
> >> that at the point that the flush completes, all writes to the disk that
> >> already have completed successfully are stable. It doesn't say anything
> >> about writes that are still in flight, they may or may not be flushed to
> >> disk.
> > 
> > I admit I wasn't thinking clearly how much ordering NBD actually
> > guarantees (or if there's ordering the guest depends on implicitly
> > even if it's not guaranteed in specification), and how that is related
> > within QEMU to virtio/FUA/NCQ/TCQ/SCSI-ORDERED ordering guarantees
> > that the guest expects for various emulated devices and their settings.
> > 
> > The ordering (if any) needed from the NBD driver (or any backend) is
> > going to depend on the assumptions baked into the interface between
> > QEMU device emulation <-> backend.
> > 
> > E.g. if every device emulation waited for all outstanding writes to
> > complete before sending a flush, then it wouldn't matter how the
> > backend reordered its requests, even getting the completions out of
> > order.
> > 
> > Is that relationship documented (and conformed to)?
> 
> No, like so many other things in qemu it's not spelt out explicitly.
> However, as I understand it it's the same behaviour as real hardware
> has, so device emulation at least for the common devices doesn't have to
> implement anything special for it. If the hardware even supports
> parallel requests, otherwise it would automatically only have a single
> request in flight (like IDE).

That's why I mention virtio/FUA/NCQ/TCQ/SCSI-ORDERED, which are quite
common.

They are features of devices which support multiple parallel requests,
but with certain ordering constraints conveyed by or expected by the
guest, which has to be ensured when it's mapped onto a QEMU fully
asynchronous backend.

That means they are features of the hardware which device emulations
_do_ have to implement.  If they don't, the storage is unreliable on
things like host power removal and virtual power removal.

If the backends are allowed to explicitly have no coupling between
different request types (even flush/discard and write), and ordering
constraints are being enforced by the order in which device emulations
submit and wait, that's fine.

I mention this, because POSIX aio_fsync() is _not_ fully decoupled
according to it's specification.

So it might be that some device emulations are depending on the
semantics of aio_fsync() or the QEMU equivalent by now; and randomly
reordering in the NBD driver in unusual circumstances (or any other
backend), would break those semantics.

-- Jamie

^ permalink raw reply	[flat|nested] 20+ messages in thread

* Re: [Qemu-devel] [PATCH 1/3] nbd: Only try to send flush/discard commands if connected to the NBD server
  2012-10-25 17:09                 ` Jamie Lokier
@ 2012-10-26  7:59                   ` Kevin Wolf
  0 siblings, 0 replies; 20+ messages in thread
From: Kevin Wolf @ 2012-10-26  7:59 UTC (permalink / raw)
  To: Jamie Lokier; +Cc: pbonzini, qemu-devel, Nicholas Thomas

Am 25.10.2012 19:09, schrieb Jamie Lokier:
> Kevin Wolf wrote:
>> Am 24.10.2012 16:32, schrieb Jamie Lokier:
>>> Kevin Wolf wrote:
>>>> Am 24.10.2012 14:16, schrieb Nicholas Thomas:
>>>>> On Tue, 2012-10-23 at 16:02 +0100, Jamie Lokier wrote:
>>>>>> Since the I/O _order_ before, and sometimes after, flush, is important
>>>>>> for data integrity, this needs to be maintained when I/Os are queued in
>>>>>> the disconnected state -- including those which were inflight at the
>>>>>> time disconnect was detected and then retried on reconnect.
>>>>>
>>>>> Hmm, discussing this on IRC I was told that it wasn't necessary to
>>>>> preserve order - although I forget the fine detail. Depending on the
>>>>> implementation of qemu's coroutine mutexes, operations may not actually
>>>>> be performed in order right now - it's not too easy to work out what's
>>>>> happening.
>>>>
>>>> It's possible to reorder, but it must be consistent with the order in
>>>> which completion is signalled to the guest. The semantics of flush is
>>>> that at the point that the flush completes, all writes to the disk that
>>>> already have completed successfully are stable. It doesn't say anything
>>>> about writes that are still in flight, they may or may not be flushed to
>>>> disk.
>>>
>>> I admit I wasn't thinking clearly how much ordering NBD actually
>>> guarantees (or if there's ordering the guest depends on implicitly
>>> even if it's not guaranteed in specification), and how that is related
>>> within QEMU to virtio/FUA/NCQ/TCQ/SCSI-ORDERED ordering guarantees
>>> that the guest expects for various emulated devices and their settings.
>>>
>>> The ordering (if any) needed from the NBD driver (or any backend) is
>>> going to depend on the assumptions baked into the interface between
>>> QEMU device emulation <-> backend.
>>>
>>> E.g. if every device emulation waited for all outstanding writes to
>>> complete before sending a flush, then it wouldn't matter how the
>>> backend reordered its requests, even getting the completions out of
>>> order.
>>>
>>> Is that relationship documented (and conformed to)?
>>
>> No, like so many other things in qemu it's not spelt out explicitly.
>> However, as I understand it it's the same behaviour as real hardware
>> has, so device emulation at least for the common devices doesn't have to
>> implement anything special for it. If the hardware even supports
>> parallel requests, otherwise it would automatically only have a single
>> request in flight (like IDE).
> 
> That's why I mention virtio/FUA/NCQ/TCQ/SCSI-ORDERED, which are quite
> common.
> 
> They are features of devices which support multiple parallel requests,
> but with certain ordering constraints conveyed by or expected by the
> guest, which has to be ensured when it's mapped onto a QEMU fully
> asynchronous backend.
> 
> That means they are features of the hardware which device emulations
> _do_ have to implement.  If they don't, the storage is unreliable on
> things like host power removal and virtual power removal.

Yes, device emulations that need to maintain a given order must pay
attention to wait for completion of the previous requests.

> If the backends are allowed to explicitly have no coupling between
> different request types (even flush/discard and write), and ordering
> constraints are being enforced by the order in which device emulations
> submit and wait, that's fine.
> 
> I mention this, because POSIX aio_fsync() is _not_ fully decoupled
> according to it's specification.
> 
> So it might be that some device emulations are depending on the
> semantics of aio_fsync() or the QEMU equivalent by now; and randomly
> reordering in the NBD driver in unusual circumstances (or any other
> backend), would break those semantics.

qemu AIO has always had this semantics since bdrv_aio_flush() was
introduced. It behaves the same way for image files. So I don't see any
problem with NBD making use of the same.

Kevin

^ permalink raw reply	[flat|nested] 20+ messages in thread

end of thread, other threads:[~2012-10-26  8:00 UTC | newest]

Thread overview: 20+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-10-22 11:09 [Qemu-devel] [PATCH 0/3] NBD reconnection behaviour nick
2012-10-22 11:09 ` [Qemu-devel] [PATCH 1/3] nbd: Only try to send flush/discard commands if connected to the NBD server nick
2012-10-23 10:33   ` Kevin Wolf
2012-10-23 11:08     ` Nicholas Thomas
2012-10-23 11:26       ` Kevin Wolf
2012-10-23 15:02       ` Jamie Lokier
2012-10-24 12:16         ` Nicholas Thomas
2012-10-24 12:57           ` Kevin Wolf
2012-10-24 14:32             ` Jamie Lokier
2012-10-24 15:16               ` Paolo Bonzini
2012-10-25  6:36               ` Kevin Wolf
2012-10-25 17:09                 ` Jamie Lokier
2012-10-26  7:59                   ` Kevin Wolf
2012-10-24 14:03           ` Paolo Bonzini
2012-10-24 14:10             ` Paolo Bonzini
2012-10-24 14:12               ` Nicholas Thomas
2012-10-22 11:09 ` [Qemu-devel] [PATCH 2/3] nbd: Explicitly disconnect and fail inflight I/O requests on error, then reconnect next I/O request nick
2012-10-23 10:40   ` Kevin Wolf
2012-10-24 14:31   ` Paolo Bonzini
2012-10-22 11:09 ` [Qemu-devel] [PATCH 3/3] nbd: Move reconnection attempts from each new I/O request to a 5-second timer nick

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).