[Qemu-devel] [PATCH RFC 0/5] disk deadlines

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [Qemu-devel] [PATCH RFC 0/5] disk deadlines
@ 2015-09-08  8:00 Denis V. Lunev
  2015-09-08  8:00 ` [Qemu-devel] [PATCH 1/5] add QEMU style defines for __sync_add_and_fetch Denis V. Lunev
                   ` (9 more replies)
  0 siblings, 10 replies; 48+ messages in thread
From: Denis V. Lunev @ 2015-09-08  8:00 UTC (permalink / raw)
  Cc: Kevin Wolf, Denis V. Lunev, Stefan Hajnoczi, qemu-devel,
	Raushaniya Maksudova

Description of the problem:
Client and server interacts via Network File System (NFS) or using other
network storage like CEPH. The server contains an image of the Virtual
Machine (VM) with Linux inside. The disk is exposed as SATA or IDE
to VM. VM is started on the client as usual. In the case of network shortage
requests from the virtial disk can not be completed in predictable time.
If this request is f.e. ext3/4 journal write then the guest will reset
the controller and restart the request for the first time. On next such
event the guest will remount victim filesystem read-only. From the
end-user point of view this will look like a fatal crash with a manual
reboot required.

To avoid such situation this patchset introduces patch per-drive option
"disk-deadlines=on|off" which is unset by default. All disk requests
will become tracked if the option is enabled. If requests are not completed
in time some countermeasures applied (see below). The timeout could be
configured, default one is chosen by observations.

Test description that let reproduce the problem:
1) configure and start NFS server:
$sudo /etc/init.d/nfs-kernel-server restart
2) put Virtial Machine image with preinstalled Operating System on the server
3) on the client mount server folder that contains Virtial Machine image:
$sudo mount -t nfs -O uid=1000,iocharset=utf-8 server_ip:/path/to/folder/on/
server /path/to/folder/on/client
4) start Virtual Machine with QEMU on the client (for example):
$qemu-system-x86_64 -enable-kvm -vga std -balloon virtio -monitor stdio
 -drive file=/path/to/folder/on/client/vdisk.img,media=disk,if=ide,disk-deadlines=on
 -boot d -m 12288
5) inside of VM rum the following command:
$dd if=/dev/urandom of=testfile bs=10M count=300
AND stop the server (or disconnect network) by running:
$sudo /etc/init.d/nfs-kernel-server stop
6) inside of VM periodically run:
$dmesg
and check error messages.

One can get one of the error messages (just the main lines):
1) After server restarting Guest OS continues run as usual with
the following messages in dmesg:
  a) [ 1108.131474] nfs: server 10.30.23.163 not responding, still trying
     [ 1203.164903] INFO: task qemu-system-x86:3256 blocked for more
     than 120 seconds

  b) [ 581.184311] ata1.00: qc timeout (cmd 0xe7)
     [ 581.184321] ata1.00: FLUSH failed Emask 0x4
     [ 581.744271] ata1: soft resetting link
     [ 581.900346] ata1.01: NODEV after polling detection
     [ 581.900877] ata1.00: configured for MWDMA2
     [ 581.900879] ata1.00: retrying FLUSH 0xe7 Emask 0x4
     [ 581.901203] ata1.00: device reported invalid CHS sector 0
     [ 581.901213] ata1: EH complete
2) Guest OS remounts its Filesystem as read-only:
"remounting filesystem read-only"
3) Guest OS does not respond at all even after server restart

Tested on:
Virtual Machine - Linux 3.11.0 SMP x86_64 Ubuntu 13.10 saucy;
client -  Linux 3.11.10 SMP x86_64, Ubuntu 13.10 saucy;
server - Linux 3.13.0 SMP x86_64, Ubuntu 14.04.1 LTS.

How the given solution works?

If disk-deadlines option is enabled for a drive, one controls time completion
of this drive's requests. The method is as follows (further assume that this
option is enabled).

Every drive has its own red-black tree for keeping its requests.
Expiration time of the request is a key, cookie (as id of request) is an
appropriate node. Assume that every requests has 8 seconds to be completed.
If request was not accomplished in time for some reasons (server crash or smth
else), timer of this drive is fired and an appropriate callback requests to
stop Virtial Machine (VM).

VM remains stopped until all requests from the disk which caused VM's stopping
are completed. Furthermore, if there is another disks with 'disk-deadlines=on'
whose requests are waiting to be completed, do not start VM : wait completion
of all "late" requests from all disks.

Furthermore, all requests which caused VM stopping (or those that just were not
completed in time) could be printed using "info disk-deadlines" qemu monitor
option as follows:
$(qemu) info disk-deadlines

   disk_id  type       size total_time        start_time
.--------------------------------------------------------
  ide0-hd1 FLUSH         0b 46.403s     22232930059574ns
  ide0-hd1 FLUSH         0b 57.591s     22451499241285ns
  ide0-hd1 FLUSH         0b 103.482s    22574100547397ns

This set is sent in the hope that it might be useful.

Signed-off-by: Raushaniya Maksudova <rmaksudova@virtuozzo.com>
Signed-off-by: Denis V. Lunev <den@openvz.org>
CC: Stefan Hajnoczi <stefanha@redhat.com>
CC: Kevin Wolf <kwolf@redhat.com>

Raushaniya Maksudova (5):
  add QEMU style defines for __sync_add_and_fetch
  disk_deadlines: add request to resume Virtual Machine
  disk_deadlines: add disk-deadlines option per drive
  disk_deadlines: add control of requests time expiration
  disk_deadlines: add info disk-deadlines option

 block/Makefile.objs            |   1 +
 block/accounting.c             |   8 ++
 block/disk-deadlines.c         | 280 +++++++++++++++++++++++++++++++++++++++++
 blockdev.c                     |  20 +++
 hmp.c                          |  37 ++++++
 hmp.h                          |   1 +
 include/block/accounting.h     |   2 +
 include/block/disk-deadlines.h |  48 +++++++
 include/qemu/atomic.h          |   3 +
 include/sysemu/sysemu.h        |   1 +
 monitor.c                      |   7 ++
 qapi-schema.json               |  33 +++++
 stubs/vm-stop.c                |   5 +
 vl.c                           |  18 +++
 14 files changed, 464 insertions(+)
 create mode 100644 block/disk-deadlines.c
 create mode 100644 include/block/disk-deadlines.h

-- 
2.1.4

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [Qemu-devel] [PATCH 1/5] add QEMU style defines for __sync_add_and_fetch
  2015-09-08  8:00 [Qemu-devel] [PATCH RFC 0/5] disk deadlines Denis V. Lunev
@ 2015-09-08  8:00 ` Denis V. Lunev
  2015-09-10  8:19   ` Stefan Hajnoczi
  2015-09-08  8:00 ` [Qemu-devel] [PATCH 2/5] disk_deadlines: add request to resume Virtual Machine Denis V. Lunev
                   ` (8 subsequent siblings)
  9 siblings, 1 reply; 48+ messages in thread
From: Denis V. Lunev @ 2015-09-08  8:00 UTC (permalink / raw)
  Cc: Kevin Wolf, qemu-devel, Raushaniya Maksudova, Stefan Hajnoczi,
	Paolo Bonzini, Denis V. Lunev

From: Raushaniya Maksudova <rmaksudova@virtuozzo.com>

Signed-off-by: Raushaniya Maksudova <rmaksudova@virtuozzo.com>
Signed-off-by: Denis V. Lunev <den@openvz.org>
CC: Stefan Hajnoczi <stefanha@redhat.com>
CC: Kevin Wolf <kwolf@redhat.com>
CC: Paolo Bonzini <pbonzini@redhat.com>
---
 include/qemu/atomic.h | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/include/qemu/atomic.h b/include/qemu/atomic.h
index bd2c075..d26729a 100644
--- a/include/qemu/atomic.h
+++ b/include/qemu/atomic.h
@@ -249,6 +249,9 @@
 #endif
 #endif
 
+#define atomic_inc_fetch(ptr)  __sync_add_and_fetch(ptr, 1)
+#define atomic_dec_fetch(ptr)  __sync_add_and_fetch(ptr, -1)
+
 /* Provide shorter names for GCC atomic builtins.  */
 #define atomic_fetch_inc(ptr)  __sync_fetch_and_add(ptr, 1)
 #define atomic_fetch_dec(ptr)  __sync_fetch_and_add(ptr, -1)
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [Qemu-devel] [PATCH 2/5] disk_deadlines: add request to resume Virtual Machine
  2015-09-08  8:00 [Qemu-devel] [PATCH RFC 0/5] disk deadlines Denis V. Lunev
  2015-09-08  8:00 ` [Qemu-devel] [PATCH 1/5] add QEMU style defines for __sync_add_and_fetch Denis V. Lunev
@ 2015-09-08  8:00 ` Denis V. Lunev
  2015-09-10  8:51   ` Stefan Hajnoczi
  2015-09-08  8:00 ` [Qemu-devel] [PATCH 3/5] disk_deadlines: add disk-deadlines option per drive Denis V. Lunev
                   ` (7 subsequent siblings)
  9 siblings, 1 reply; 48+ messages in thread
From: Denis V. Lunev @ 2015-09-08  8:00 UTC (permalink / raw)
  Cc: Kevin Wolf, qemu-devel, Raushaniya Maksudova, Stefan Hajnoczi,
	Paolo Bonzini, Denis V. Lunev

From: Raushaniya Maksudova <rmaksudova@virtuozzo.com>

In some cases one needs to pause and resume a Virtual Machine from inside
of Qemu. Currently there are request functions to pause VM (vmstop), but
there are no respective ones to resume VM.

Signed-off-by: Raushaniya Maksudova <rmaksudova@virtuozzo.com>
Signed-off-by: Denis V. Lunev <den@openvz.org>
CC: Stefan Hajnoczi <stefanha@redhat.com>
CC: Kevin Wolf <kwolf@redhat.com>
CC: Paolo Bonzini <pbonzini@redhat.com>
---
 include/sysemu/sysemu.h |  1 +
 stubs/vm-stop.c         |  5 +++++
 vl.c                    | 18 ++++++++++++++++++
 3 files changed, 24 insertions(+)

diff --git a/include/sysemu/sysemu.h b/include/sysemu/sysemu.h
index 44570d1..a382ae1 100644
--- a/include/sysemu/sysemu.h
+++ b/include/sysemu/sysemu.h
@@ -62,6 +62,7 @@ void qemu_system_shutdown_request(void);
 void qemu_system_powerdown_request(void);
 void qemu_register_powerdown_notifier(Notifier *notifier);
 void qemu_system_debug_request(void);
+void qemu_system_vmstart_request(void);
 void qemu_system_vmstop_request(RunState reason);
 void qemu_system_vmstop_request_prepare(void);
 int qemu_shutdown_requested_get(void);
diff --git a/stubs/vm-stop.c b/stubs/vm-stop.c
index 69fd86b..c8e2cdd 100644
--- a/stubs/vm-stop.c
+++ b/stubs/vm-stop.c
@@ -10,3 +10,8 @@ void qemu_system_vmstop_request(RunState state)
 {
     abort();
 }
+
+void qemu_system_vmstart_request(void)
+{
+    abort();
+}
diff --git a/vl.c b/vl.c
index 584ca88..63f10d3 100644
--- a/vl.c
+++ b/vl.c
@@ -563,6 +563,7 @@ static RunState current_run_state = RUN_STATE_PRELAUNCH;
 /* We use RUN_STATE_MAX but any invalid value will do */
 static RunState vmstop_requested = RUN_STATE_MAX;
 static QemuMutex vmstop_lock;
+static bool vmstart_requested;
 
 typedef struct {
     RunState from;
@@ -723,6 +724,19 @@ void qemu_system_vmstop_request(RunState state)
     qemu_notify_event();
 }
 
+static bool qemu_vmstart_requested(void)
+{
+    bool r = vmstart_requested;
+    vmstart_requested = false;
+    return r;
+}
+
+void qemu_system_vmstart_request(void)
+{
+    vmstart_requested = true;
+    qemu_notify_event();
+}
+
 void vm_start(void)
 {
     RunState requested;
@@ -1884,6 +1898,10 @@ static bool main_loop_should_exit(void)
     if (qemu_vmstop_requested(&r)) {
         vm_stop(r);
     }
+    if (qemu_vmstart_requested()) {
+        vm_start();
+    }
+
     return false;
 }
 
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [Qemu-devel] [PATCH 3/5] disk_deadlines: add disk-deadlines option per drive
  2015-09-08  8:00 [Qemu-devel] [PATCH RFC 0/5] disk deadlines Denis V. Lunev
  2015-09-08  8:00 ` [Qemu-devel] [PATCH 1/5] add QEMU style defines for __sync_add_and_fetch Denis V. Lunev
  2015-09-08  8:00 ` [Qemu-devel] [PATCH 2/5] disk_deadlines: add request to resume Virtual Machine Denis V. Lunev
@ 2015-09-08  8:00 ` Denis V. Lunev
  2015-09-10  9:05   ` Stefan Hajnoczi
  2015-09-08  8:00 ` [Qemu-devel] [PATCH 4/5] disk_deadlines: add control of requests time expiration Denis V. Lunev
                   ` (6 subsequent siblings)
  9 siblings, 1 reply; 48+ messages in thread
From: Denis V. Lunev @ 2015-09-08  8:00 UTC (permalink / raw)
  Cc: Kevin Wolf, qemu-devel, Raushaniya Maksudova, Markus Armbruster,
	Stefan Hajnoczi, Denis V. Lunev

From: Raushaniya Maksudova <rmaksudova@virtuozzo.com>

This patch adds per-drive option disk-deadlines.
If it is enabled, one tracks which disk'requests were not completed in time.
By default it is unset.

Signed-off-by: Raushaniya Maksudova <rmaksudova@virtuozzo.com>
Signed-off-by: Denis V. Lunev <den@openvz.org>
CC: Stefan Hajnoczi <stefanha@redhat.com>
CC: Kevin Wolf <kwolf@redhat.com>
CC: Markus Armbruster <armbru@redhat.com>
---
 block/Makefile.objs            |  1 +
 block/disk-deadlines.c         | 30 ++++++++++++++++++++++++++++++
 blockdev.c                     | 19 +++++++++++++++++++
 include/block/accounting.h     |  2 ++
 include/block/disk-deadlines.h | 35 +++++++++++++++++++++++++++++++++++
 5 files changed, 87 insertions(+)
 create mode 100644 block/disk-deadlines.c
 create mode 100644 include/block/disk-deadlines.h

diff --git a/block/Makefile.objs b/block/Makefile.objs
index 58ef2ef..cf30ce5 100644
--- a/block/Makefile.objs
+++ b/block/Makefile.objs
@@ -20,6 +20,7 @@ block-obj-$(CONFIG_RBD) += rbd.o
 block-obj-$(CONFIG_GLUSTERFS) += gluster.o
 block-obj-$(CONFIG_ARCHIPELAGO) += archipelago.o
 block-obj-$(CONFIG_LIBSSH2) += ssh.o
+block-obj-y += disk-deadlines.o
 block-obj-y += accounting.o
 block-obj-y += write-threshold.o
 
diff --git a/block/disk-deadlines.c b/block/disk-deadlines.c
new file mode 100644
index 0000000..39dec53
--- /dev/null
+++ b/block/disk-deadlines.c
@@ -0,0 +1,30 @@
+/*
+ * QEMU System Emulator disk deadlines control
+ *
+ * Copyright (c) 2015 Raushaniya Maksudova <rmaksudova@virtuozzo.com>
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to deal
+ * in the Software without restriction, including without limitation the rights
+ * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ * copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+ * THE SOFTWARE.
+ */
+
+#include "block/disk-deadlines.h"
+
+void disk_deadlines_init(DiskDeadlines *disk_deadlines, bool enabled)
+{
+    disk_deadlines->enabled = enabled;
+}
diff --git a/blockdev.c b/blockdev.c
index 6b48be6..6cd9c6e 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -361,6 +361,7 @@ static BlockBackend *blockdev_init(const char *file, QDict *bs_opts,
     ThrottleConfig cfg;
     int snapshot = 0;
     bool copy_on_read;
+    bool disk_deadlines;
     Error *error = NULL;
     QemuOpts *opts;
     const char *id;
@@ -394,6 +395,11 @@ static BlockBackend *blockdev_init(const char *file, QDict *bs_opts,
     ro = qemu_opt_get_bool(opts, "read-only", 0);
     copy_on_read = qemu_opt_get_bool(opts, "copy-on-read", false);
 
+    disk_deadlines = qdict_get_try_bool(bs_opts, "disk-deadlines", false);
+    if (disk_deadlines) {
+        qdict_del(bs_opts, "disk-deadlines");
+    }
+
     if ((buf = qemu_opt_get(opts, "discard")) != NULL) {
         if (bdrv_parse_discard_flags(buf, &bdrv_flags) != 0) {
             error_setg(errp, "invalid discard option");
@@ -555,6 +561,8 @@ static BlockBackend *blockdev_init(const char *file, QDict *bs_opts,
 
     bs->detect_zeroes = detect_zeroes;
 
+    disk_deadlines_init(&bs->stats.disk_deadlines, disk_deadlines);
+
     bdrv_set_on_error(bs, on_read_error, on_write_error);
 
     /* disk I/O throttling */
@@ -658,6 +666,10 @@ QemuOptsList qemu_legacy_drive_opts = {
             .name = "file",
             .type = QEMU_OPT_STRING,
             .help = "file name",
+        },{
+            .name = "disk-deadlines",
+            .type = QEMU_OPT_BOOL,
+            .help = "control of disk requests' time execution",
         },
 
         /* Options that are passed on, but have special semantics with -drive */
@@ -698,6 +710,7 @@ DriveInfo *drive_new(QemuOpts *all_opts, BlockInterfaceType block_default_type)
     const char *werror, *rerror;
     bool read_only = false;
     bool copy_on_read;
+    bool disk_deadlines;
     const char *serial;
     const char *filename;
     Error *local_err = NULL;
@@ -812,6 +825,12 @@ DriveInfo *drive_new(QemuOpts *all_opts, BlockInterfaceType block_default_type)
     qdict_put(bs_opts, "copy-on-read",
               qstring_from_str(copy_on_read ? "on" :"off"));
 
+    /* Enable control of disk requests' time execution */
+    disk_deadlines = qemu_opt_get_bool(legacy_opts, "disk-deadlines", false);
+    if (disk_deadlines) {
+        qdict_put(bs_opts, "disk-deadlines", qbool_from_bool(disk_deadlines));
+    }
+
     /* Controller type */
     value = qemu_opt_get(legacy_opts, "if");
     if (value) {
diff --git a/include/block/accounting.h b/include/block/accounting.h
index 4c406cf..4e2b345 100644
--- a/include/block/accounting.h
+++ b/include/block/accounting.h
@@ -27,6 +27,7 @@
 #include <stdint.h>
 
 #include "qemu/typedefs.h"
+#include "block/disk-deadlines.h"
 
 enum BlockAcctType {
     BLOCK_ACCT_READ,
@@ -41,6 +42,7 @@ typedef struct BlockAcctStats {
     uint64_t total_time_ns[BLOCK_MAX_IOTYPE];
     uint64_t merged[BLOCK_MAX_IOTYPE];
     uint64_t wr_highest_sector;
+    DiskDeadlines disk_deadlines;
 } BlockAcctStats;
 
 typedef struct BlockAcctCookie {
diff --git a/include/block/disk-deadlines.h b/include/block/disk-deadlines.h
new file mode 100644
index 0000000..2ea193b
--- /dev/null
+++ b/include/block/disk-deadlines.h
@@ -0,0 +1,35 @@
+/*
+ * QEMU System Emulator disk deadlines control
+ *
+ * Copyright (c) 2015 Raushaniya Maksudova <rmaksudova@virtuozzo.com>
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this software and associated documentation files (the "Software"), to deal
+ * in the Software without restriction, including without limitation the rights
+ * to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+ * copies of the Software, and to permit persons to whom the Software is
+ * furnished to do so, subject to the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL
+ * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+ * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN
+ * THE SOFTWARE.
+ */
+#ifndef DISK_DEADLINES_H
+#define DISK_DEADLINES_H
+
+#include <stdbool.h>
+
+typedef struct DiskDeadlines {
+    bool enabled;
+} DiskDeadlines;
+
+void disk_deadlines_init(DiskDeadlines *disk_deadlines, bool enabled);
+
+#endif
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [Qemu-devel] [PATCH 4/5] disk_deadlines: add control of requests time expiration
  2015-09-08  8:00 [Qemu-devel] [PATCH RFC 0/5] disk deadlines Denis V. Lunev
                   ` (2 preceding siblings ...)
  2015-09-08  8:00 ` [Qemu-devel] [PATCH 3/5] disk_deadlines: add disk-deadlines option per drive Denis V. Lunev
@ 2015-09-08  8:00 ` Denis V. Lunev
  2015-09-08  9:35   ` Fam Zheng
  2015-09-08 11:06   ` Kevin Wolf
  2015-09-08  8:00 ` [Qemu-devel] [PATCH 5/5] disk_deadlines: add info disk-deadlines option Denis V. Lunev
                   ` (5 subsequent siblings)
  9 siblings, 2 replies; 48+ messages in thread
From: Denis V. Lunev @ 2015-09-08  8:00 UTC (permalink / raw)
  Cc: Kevin Wolf, Denis V. Lunev, Stefan Hajnoczi, qemu-devel,
	Raushaniya Maksudova

From: Raushaniya Maksudova <rmaksudova@virtuozzo.com>

If disk-deadlines option is enabled for a drive, one controls time
completion of this drive's requests. The method is as follows (further
assume that this option is enabled).

Every drive has its own red-black tree for keeping its requests.
Expiration time of the request is a key, cookie (as id of request) is an
appropriate node. Assume that every requests has 8 seconds to be completed.
If request was not accomplished in time for some reasons (server crash or
smth else), timer of this drive is fired and an appropriate callback
requests to stop Virtial Machine (VM).

VM remains stopped until all requests from the disk which caused VM's
stopping are completed. Furthermore, if there is another disks whose
requests are waiting to be completed, do not start VM : wait completion
of all "late" requests from all disks.

Signed-off-by: Raushaniya Maksudova <rmaksudova@virtuozzo.com>
Signed-off-by: Denis V. Lunev <den@openvz.org>
CC: Stefan Hajnoczi <stefanha@redhat.com>
CC: Kevin Wolf <kwolf@redhat.com>
---
 block/accounting.c             |   8 ++
 block/disk-deadlines.c         | 167 +++++++++++++++++++++++++++++++++++++++++
 include/block/disk-deadlines.h |  11 +++
 3 files changed, 186 insertions(+)

diff --git a/block/accounting.c b/block/accounting.c
index 01d594f..7b913fd 100644
--- a/block/accounting.c
+++ b/block/accounting.c
@@ -34,6 +34,10 @@ void block_acct_start(BlockAcctStats *stats, BlockAcctCookie *cookie,
     cookie->bytes = bytes;
     cookie->start_time_ns = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
     cookie->type = type;
+
+    if (stats->disk_deadlines.enabled) {
+        insert_request(&stats->disk_deadlines, cookie);
+    }
 }
 
 void block_acct_done(BlockAcctStats *stats, BlockAcctCookie *cookie)
@@ -44,6 +48,10 @@ void block_acct_done(BlockAcctStats *stats, BlockAcctCookie *cookie)
     stats->nr_ops[cookie->type]++;
     stats->total_time_ns[cookie->type] +=
         qemu_clock_get_ns(QEMU_CLOCK_REALTIME) - cookie->start_time_ns;
+
+    if (stats->disk_deadlines.enabled) {
+        remove_request(&stats->disk_deadlines, cookie);
+    }
 }
 
 
diff --git a/block/disk-deadlines.c b/block/disk-deadlines.c
index 39dec53..acb44bc 100644
--- a/block/disk-deadlines.c
+++ b/block/disk-deadlines.c
@@ -23,8 +23,175 @@
  */
 
 #include "block/disk-deadlines.h"
+#include "block/accounting.h"
+#include "sysemu/sysemu.h"
+#include "qemu/atomic.h"
+
+/*
+ * Number of late requests which were not completed in time
+ * (its timer has expired) and as a result it caused VM's stopping
+ */
+uint64_t num_requests_vmstopped;
+
+/* Give 8 seconds for request to complete by default */
+const uint64_t EXPIRE_DEFAULT_NS = 8000000000;
+
+typedef struct RequestInfo {
+    BlockAcctCookie *cookie;
+    int64_t expire_time;
+} RequestInfo;
+
+static gint compare(gconstpointer a, gconstpointer b)
+{
+    return (int64_t)a - (int64_t)b;
+}
+
+static gboolean find_request(gpointer key, gpointer value, gpointer data)
+{
+    BlockAcctCookie *cookie = value;
+    RequestInfo *request = data;
+    if (cookie == request->cookie) {
+        request->expire_time = (int64_t)key;
+        return true;
+    }
+    return false;
+}
+
+static gint search_min_key(gpointer key, gpointer data)
+{
+    int64_t tree_key = (int64_t)key;
+    int64_t *ptr_curr_min_key = data;
+
+    if ((tree_key <= *ptr_curr_min_key) || (*ptr_curr_min_key == 0)) {
+        *ptr_curr_min_key = tree_key;
+    }
+    /*
+     * We always want to proceed searching among key/value pairs
+     * with smaller key => return -1
+     */
+    return -1;
+}
+
+static int64_t soonest_expire_time(GTree *requests_tree)
+{
+    int64_t min_timestamp = 0;
+    /*
+     * g_tree_search() will always return NULL, because there is no
+     * key = 0 in the tree, we simply search for node the with the minimal key
+     */
+    g_tree_search(requests_tree, (GCompareFunc)search_min_key, &min_timestamp);
+    return min_timestamp;
+}
+
+static void disk_deadlines_callback(void *opaque)
+{
+    bool need_vmstop = false;
+    int64_t current_time, expire_time;
+    DiskDeadlines *disk_deadlines = opaque;
+
+    /*
+     * Check whether the request that triggered callback invocation
+     * is still in the tree of requests.
+     */
+    current_time = qemu_clock_get_ns(QEMU_CLOCK_REALTIME);
+    pthread_mutex_lock(&disk_deadlines->mtx_tree);
+    if (g_tree_nnodes(disk_deadlines->requests_tree) == 0) {
+        /* There are no requests in the tree, do nothing */
+        pthread_mutex_unlock(&disk_deadlines->mtx_tree);
+        return;
+    }
+    expire_time = soonest_expire_time(disk_deadlines->requests_tree);
+
+    /*
+     * If the request was not found, then there is no disk deadline detected,
+     * just update the timer with new value
+     */
+    if (expire_time > current_time) {
+        timer_mod_ns(disk_deadlines->request_timer, expire_time);
+        pthread_mutex_unlock(&disk_deadlines->mtx_tree);
+        return;
+    }
+
+    disk_deadlines->expired_tree = true;
+    need_vmstop = !atomic_fetch_inc(&num_requests_vmstopped);
+    pthread_mutex_unlock(&disk_deadlines->mtx_tree);
+
+    if (need_vmstop) {
+        qemu_system_vmstop_request_prepare();
+        qemu_system_vmstop_request(RUN_STATE_PAUSED);
+    }
+}
 
 void disk_deadlines_init(DiskDeadlines *disk_deadlines, bool enabled)
 {
     disk_deadlines->enabled = enabled;
+    if (!disk_deadlines->enabled) {
+        return;
+    }
+
+    disk_deadlines->requests_tree = g_tree_new(compare);
+    if (disk_deadlines->requests_tree == NULL) {
+        disk_deadlines->enabled = false;
+        fprintf(stderr,
+                "disk_deadlines_init: failed to allocate requests_tree\n");
+        return;
+    }
+
+    pthread_mutex_init(&disk_deadlines->mtx_tree, NULL);
+    disk_deadlines->expired_tree = false;
+    disk_deadlines->request_timer = timer_new_ns(QEMU_CLOCK_REALTIME,
+                                                 disk_deadlines_callback,
+                                                 (void *)disk_deadlines);
+}
+
+void insert_request(DiskDeadlines *disk_deadlines, void *request)
+{
+    BlockAcctCookie *cookie = request;
+
+    int64_t expire_time = cookie->start_time_ns + EXPIRE_DEFAULT_NS;
+
+    pthread_mutex_lock(&disk_deadlines->mtx_tree);
+    /* Set up expire time for the current disk if it is not set yet */
+    if (timer_expired(disk_deadlines->request_timer,
+        qemu_clock_get_ns(QEMU_CLOCK_REALTIME))) {
+        timer_mod_ns(disk_deadlines->request_timer, expire_time);
+    }
+
+    g_tree_insert(disk_deadlines->requests_tree, (int64_t *)expire_time,
+                  cookie);
+    pthread_mutex_unlock(&disk_deadlines->mtx_tree);
+}
+
+void remove_request(DiskDeadlines *disk_deadlines, void *request)
+{
+    bool need_vmstart = false;
+    RequestInfo request_info = {
+        .cookie = request,
+        .expire_time = 0,
+    };
+
+    /* Find the request to remove */
+    pthread_mutex_lock(&disk_deadlines->mtx_tree);
+    g_tree_foreach(disk_deadlines->requests_tree, find_request, &request_info);
+    g_tree_remove(disk_deadlines->requests_tree,
+                  (int64_t *)request_info.expire_time);
+
+    /*
+     * If tree is empty, but marked as expired, then one needs to
+     * unset "expired_tree" flag and check whether VM can be resumed
+     */
+    if (!g_tree_nnodes(disk_deadlines->requests_tree) &&
+        disk_deadlines->expired_tree) {
+        disk_deadlines->expired_tree = false;
+        /*
+         * If all requests (from all disks with enabled
+         * "disk-deadlines" feature) are completed, resume VM
+         */
+        need_vmstart = !atomic_dec_fetch(&num_requests_vmstopped);
+    }
+    pthread_mutex_unlock(&disk_deadlines->mtx_tree);
+
+    if (need_vmstart) {
+        qemu_system_vmstart_request();
+    }
 }
diff --git a/include/block/disk-deadlines.h b/include/block/disk-deadlines.h
index 2ea193b..9672aff 100644
--- a/include/block/disk-deadlines.h
+++ b/include/block/disk-deadlines.h
@@ -25,11 +25,22 @@
 #define DISK_DEADLINES_H
 
 #include <stdbool.h>
+#include <stdint.h>
+#include <glib.h>
+
+#include "qemu/typedefs.h"
+#include "qemu/timer.h"
 
 typedef struct DiskDeadlines {
     bool enabled;
+    bool expired_tree;
+    pthread_mutex_t mtx_tree;
+    GTree *requests_tree;
+    QEMUTimer *request_timer;
 } DiskDeadlines;
 
 void disk_deadlines_init(DiskDeadlines *disk_deadlines, bool enabled);
+void insert_request(DiskDeadlines *disk_deadlines, void *request);
+void remove_request(DiskDeadlines *disk_deadlines, void *request);
 
 #endif
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* [Qemu-devel] [PATCH 5/5] disk_deadlines: add info disk-deadlines option
  2015-09-08  8:00 [Qemu-devel] [PATCH RFC 0/5] disk deadlines Denis V. Lunev
                   ` (3 preceding siblings ...)
  2015-09-08  8:00 ` [Qemu-devel] [PATCH 4/5] disk_deadlines: add control of requests time expiration Denis V. Lunev
@ 2015-09-08  8:00 ` Denis V. Lunev
  2015-09-08 16:20   ` Eric Blake
  2015-09-08  8:58 ` [Qemu-devel] [PATCH RFC 0/5] disk deadlines Vasiliy Tolstov
                   ` (4 subsequent siblings)
  9 siblings, 1 reply; 48+ messages in thread
From: Denis V. Lunev @ 2015-09-08  8:00 UTC (permalink / raw)
  Cc: Kevin Wolf, qemu-devel, Raushaniya Maksudova, Markus Armbruster,
	Stefan Hajnoczi, Denis V. Lunev, Luiz Capitulino

From: Raushaniya Maksudova <rmaksudova@virtuozzo.com>

This patch adds "info disk-deadlines" qemu-monitor option that prints
dump of all disk requests which caused a disk deadline in Guest OS
from the very start of Virtual Machine:

   disk_id  type       size total_time        start_time
.--------------------------------------------------------
  ide0-hd1 FLUSH         0b 46.403s     22232930059574ns
  ide0-hd1 FLUSH         0b 57.591s     22451499241285ns
  ide0-hd1 FLUSH         0b 103.482s    22574100547397ns

Signed-off-by: Raushaniya Maksudova <rmaksudova@virtuozzo.com>
Signed-off-by: Denis V. Lunev <den@openvz.org>
CC: Stefan Hajnoczi <stefanha@redhat.com>
CC: Kevin Wolf <kwolf@redhat.com>
CC: Markus Armbruster <armbru@redhat.com>
CC: Luiz Capitulino <lcapitulino@redhat.com>
---
 block/disk-deadlines.c         | 85 +++++++++++++++++++++++++++++++++++++++++-
 blockdev.c                     |  3 +-
 hmp.c                          | 37 ++++++++++++++++++
 hmp.h                          |  1 +
 include/block/disk-deadlines.h |  4 +-
 monitor.c                      |  7 ++++
 qapi-schema.json               | 33 ++++++++++++++++
 7 files changed, 167 insertions(+), 3 deletions(-)

diff --git a/block/disk-deadlines.c b/block/disk-deadlines.c
index acb44bc..6f76e4f 100644
--- a/block/disk-deadlines.c
+++ b/block/disk-deadlines.c
@@ -26,6 +26,7 @@
 #include "block/accounting.h"
 #include "sysemu/sysemu.h"
 #include "qemu/atomic.h"
+#include "qmp-commands.h"
 
 /*
  * Number of late requests which were not completed in time
@@ -41,6 +42,80 @@ typedef struct RequestInfo {
     int64_t expire_time;
 } RequestInfo;
 
+const char *types[] = {
+    "READ",
+    "WRITE",
+    "FLUSH",
+    "IOTYPE",
+};
+
+typedef struct Log {
+    GSList *head;
+    pthread_mutex_t mtx;
+} Log;
+
+Log ddinfo_list = {
+    NULL,
+    PTHREAD_MUTEX_INITIALIZER,
+};
+
+static void copy_disk_deadlines_info(DiskDeadlinesInfo *ddinfo_new,
+                                      DiskDeadlinesInfo *ddinfo_old)
+{
+    ddinfo_new->total_time_ns = ddinfo_old->total_time_ns;
+    ddinfo_new->start_time_ns = ddinfo_old->start_time_ns;
+    ddinfo_new->size = ddinfo_old->size;
+    ddinfo_new->type = g_strdup(ddinfo_old->type);
+    ddinfo_new->has_type = !!ddinfo_new->type;
+    ddinfo_new->disk_id = g_strdup(ddinfo_old->disk_id);
+    ddinfo_new->has_disk_id = !!ddinfo_new->disk_id;
+}
+
+static void fill_disk_deadlines_info(DiskDeadlinesInfo *ddinfo,
+                                     BlockAcctCookie *cookie,
+                                     DiskDeadlines *disk_deadlines)
+{
+    ddinfo->total_time_ns = qemu_clock_get_ns(QEMU_CLOCK_REALTIME) -
+                            cookie->start_time_ns;
+                            ddinfo->start_time_ns = cookie->start_time_ns;
+    ddinfo->size = cookie->bytes;
+    ddinfo->type = g_strdup(types[cookie->type]);
+    ddinfo->has_type = !!ddinfo->type;
+    ddinfo->disk_id = g_strdup(disk_deadlines->disk_id);
+    ddinfo->has_disk_id = !!ddinfo->disk_id;
+}
+
+DiskDeadlinesInfoList *qmp_query_disk_deadlines(Error **errp)
+{
+    DiskDeadlinesInfoList *list = NULL, *entry;
+    DiskDeadlinesInfo *ddinfo;
+    GSList *curr = ddinfo_list.head;
+
+    pthread_mutex_lock(&ddinfo_list.mtx);
+    for (curr = ddinfo_list.head; curr; curr = g_slist_next(curr)) {
+        ddinfo = g_new(DiskDeadlinesInfo, 1);
+        copy_disk_deadlines_info(ddinfo, curr->data);
+
+        entry = g_new(DiskDeadlinesInfoList, 1);
+        entry->value = ddinfo;
+        entry->next = list;
+        list = entry;
+    }
+    pthread_mutex_unlock(&ddinfo_list.mtx);
+    return list;
+}
+
+static void log_disk_deadlines_info(BlockAcctCookie *cookie,
+                                    DiskDeadlines *disk_deadlines)
+{
+    DiskDeadlinesInfo *data = g_new(DiskDeadlinesInfo, 1);
+    fill_disk_deadlines_info(data, cookie, disk_deadlines);
+
+    pthread_mutex_lock(&ddinfo_list.mtx);
+    ddinfo_list.head = g_slist_prepend(ddinfo_list.head, data);
+    pthread_mutex_unlock(&ddinfo_list.mtx);
+}
+
 static gint compare(gconstpointer a, gconstpointer b)
 {
     return (int64_t)a - (int64_t)b;
@@ -122,7 +197,8 @@ static void disk_deadlines_callback(void *opaque)
     }
 }
 
-void disk_deadlines_init(DiskDeadlines *disk_deadlines, bool enabled)
+void disk_deadlines_init(DiskDeadlines *disk_deadlines, bool enabled,
+                         const char *disk_id)
 {
     disk_deadlines->enabled = enabled;
     if (!disk_deadlines->enabled) {
@@ -139,6 +215,7 @@ void disk_deadlines_init(DiskDeadlines *disk_deadlines, bool enabled)
 
     pthread_mutex_init(&disk_deadlines->mtx_tree, NULL);
     disk_deadlines->expired_tree = false;
+    disk_deadlines->disk_id = g_strdup(disk_id);
     disk_deadlines->request_timer = timer_new_ns(QEMU_CLOCK_REALTIME,
                                                  disk_deadlines_callback,
                                                  (void *)disk_deadlines);
@@ -165,6 +242,7 @@ void insert_request(DiskDeadlines *disk_deadlines, void *request)
 void remove_request(DiskDeadlines *disk_deadlines, void *request)
 {
     bool need_vmstart = false;
+    bool need_log_disk_deadline = false;
     RequestInfo request_info = {
         .cookie = request,
         .expire_time = 0,
@@ -176,6 +254,7 @@ void remove_request(DiskDeadlines *disk_deadlines, void *request)
     g_tree_remove(disk_deadlines->requests_tree,
                   (int64_t *)request_info.expire_time);
 
+    need_log_disk_deadline = disk_deadlines->expired_tree;
     /*
      * If tree is empty, but marked as expired, then one needs to
      * unset "expired_tree" flag and check whether VM can be resumed
@@ -194,4 +273,8 @@ void remove_request(DiskDeadlines *disk_deadlines, void *request)
     if (need_vmstart) {
         qemu_system_vmstart_request();
     }
+
+    if (need_log_disk_deadline) {
+        log_disk_deadlines_info(request, disk_deadlines);
+    }
 }
diff --git a/blockdev.c b/blockdev.c
index 6cd9c6e..9a38c43 100644
--- a/blockdev.c
+++ b/blockdev.c
@@ -561,7 +561,8 @@ static BlockBackend *blockdev_init(const char *file, QDict *bs_opts,
 
     bs->detect_zeroes = detect_zeroes;
 
-    disk_deadlines_init(&bs->stats.disk_deadlines, disk_deadlines);
+    disk_deadlines_init(&bs->stats.disk_deadlines, disk_deadlines,
+                        qemu_opts_id(opts));
 
     bdrv_set_on_error(bs, on_read_error, on_write_error);
 
diff --git a/hmp.c b/hmp.c
index 3f807b7..2c3660a 100644
--- a/hmp.c
+++ b/hmp.c
@@ -850,6 +850,43 @@ void hmp_info_tpm(Monitor *mon, const QDict *qdict)
     qapi_free_TPMInfoList(info_list);
 }
 
+static double nano_to_seconds(int64_t value)
+{
+    return ((double)value)/1000000000.0;
+}
+
+void hmp_info_disk_deadlines(Monitor *mon, const QDict *qdict)
+{
+    int i;
+    DiskDeadlinesInfoList *ddinfo_list, *curr;
+
+    ddinfo_list = qmp_query_disk_deadlines(NULL);
+    if (!ddinfo_list) {
+        monitor_printf(mon, "No disk deadlines occured\n");
+        return;
+    }
+
+    monitor_printf(mon, "\n%10s %5s %10s %-10s %17s\n",
+                        "disk_id", "type", "size",
+                        "total_time", "start_time");
+
+    /* Print line-delimiter */
+    for (i = 0; i < 3; i++) {
+        monitor_printf(mon, "-------------------");
+    }
+
+    for (curr = ddinfo_list; curr != NULL; curr = curr->next) {
+        monitor_printf(mon, "\n%10s %5s %9"PRIu64"b %-6.3lfs %18"PRIu64"ns",
+                       curr->value->has_disk_id ? curr->value->disk_id : "",
+                       curr->value->has_type ? curr->value->type : "",
+                       curr->value->size,
+                       nano_to_seconds(curr->value->total_time_ns),
+                       curr->value->start_time_ns);
+    }
+    monitor_printf(mon, "\n");
+    qapi_free_DiskDeadlinesInfoList(ddinfo_list);
+}
+
 void hmp_quit(Monitor *mon, const QDict *qdict)
 {
     monitor_suspend(mon);
diff --git a/hmp.h b/hmp.h
index 81656c3..8fe0150 100644
--- a/hmp.h
+++ b/hmp.h
@@ -38,6 +38,7 @@ void hmp_info_spice(Monitor *mon, const QDict *qdict);
 void hmp_info_balloon(Monitor *mon, const QDict *qdict);
 void hmp_info_pci(Monitor *mon, const QDict *qdict);
 void hmp_info_block_jobs(Monitor *mon, const QDict *qdict);
+void hmp_info_disk_deadlines(Monitor *mon, const QDict *qdict);
 void hmp_info_tpm(Monitor *mon, const QDict *qdict);
 void hmp_info_iothreads(Monitor *mon, const QDict *qdict);
 void hmp_quit(Monitor *mon, const QDict *qdict);
diff --git a/include/block/disk-deadlines.h b/include/block/disk-deadlines.h
index 9672aff..d9b4143 100644
--- a/include/block/disk-deadlines.h
+++ b/include/block/disk-deadlines.h
@@ -34,12 +34,14 @@
 typedef struct DiskDeadlines {
     bool enabled;
     bool expired_tree;
+    char *disk_id;
     pthread_mutex_t mtx_tree;
     GTree *requests_tree;
     QEMUTimer *request_timer;
 } DiskDeadlines;
 
-void disk_deadlines_init(DiskDeadlines *disk_deadlines, bool enabled);
+void disk_deadlines_init(DiskDeadlines *disk_deadlines, bool enabled,
+                         const char *disk_id);
 void insert_request(DiskDeadlines *disk_deadlines, void *request);
 void remove_request(DiskDeadlines *disk_deadlines, void *request);
 
diff --git a/monitor.c b/monitor.c
index 5455ab9..065effa 100644
--- a/monitor.c
+++ b/monitor.c
@@ -2898,6 +2898,13 @@ static mon_cmd_t info_cmds[] = {
     },
 #endif
     {
+        .name       = "disk-deadlines",
+        .args_type  = "",
+        .params     = "",
+        .help       = "show dump of late disk requests",
+        .mhandler.cmd = hmp_info_disk_deadlines,
+    },
+    {
         .name       = NULL,
     },
 };
diff --git a/qapi-schema.json b/qapi-schema.json
index 67fef37..ffc1445 100644
--- a/qapi-schema.json
+++ b/qapi-schema.json
@@ -3808,3 +3808,36 @@
 
 # Rocker ethernet network switch
 { 'include': 'qapi/rocker.json' }
+
+## @DiskDeadlinesInfo
+#
+# Contains info about late requests which caused VM stopping
+#
+# @disk-id: name of disk (unique for each disk)
+#
+# @type: type of request could be READ, WRITE or FLUSH
+#
+# @size: size in bytes
+#
+# @total-time-ns: total time of request execution
+#
+# @start-time-ns: indicates the start of request execution
+#
+# Since: 2.5
+##
+{ 'struct': 'DiskDeadlinesInfo',
+  'data'  : { '*disk-id': 'str',
+              '*type': 'str',
+              'size': 'uint64',
+              'total-time-ns': 'uint64',
+              'start-time-ns': 'uint64' } }
+##
+# @query-disk-deadlines:
+#
+# Returns information about last late disk requests.
+#
+# Returns: a list of @DiskDeadlinesInfo
+#
+# Since: 2.5
+##
+{ 'command': 'query-disk-deadlines', 'returns': ['DiskDeadlinesInfo'] }
-- 
2.1.4

^ permalink raw reply related	[flat|nested] 48+ messages in thread

* Re: [Qemu-devel] [PATCH RFC 0/5] disk deadlines
  2015-09-08  8:00 [Qemu-devel] [PATCH RFC 0/5] disk deadlines Denis V. Lunev
                   ` (4 preceding siblings ...)
  2015-09-08  8:00 ` [Qemu-devel] [PATCH 5/5] disk_deadlines: add info disk-deadlines option Denis V. Lunev
@ 2015-09-08  8:58 ` Vasiliy Tolstov
  2015-09-08  9:20 ` Fam Zheng
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 48+ messages in thread
From: Vasiliy Tolstov @ 2015-09-08  8:58 UTC (permalink / raw)
  To: Denis V. Lunev
  Cc: Kevin Wolf, qemu-devel, Stefan Hajnoczi, Raushaniya Maksudova

2015-09-08 11:00 GMT+03:00 Denis V. Lunev <den@openvz.org>:
>
> VM remains stopped until all requests from the disk which caused VM's stopping
> are completed. Furthermore, if there is another disks with 'disk-deadlines=on'
> whose requests are waiting to be completed, do not start VM : wait completion
> of all "late" requests from all disks.
>
> Furthermore, all requests which caused VM stopping (or those that just were not
> completed in time) could be printed using "info disk-deadlines" qemu monitor
> option as follows:


Nice feature for networked filesystems and block storages. Thanks for
this nice patch!

-- 
Vasiliy Tolstov,
e-mail: v.tolstov@selfip.ru

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [Qemu-devel] [PATCH RFC 0/5] disk deadlines
  2015-09-08  8:00 [Qemu-devel] [PATCH RFC 0/5] disk deadlines Denis V. Lunev
                   ` (5 preceding siblings ...)
  2015-09-08  8:58 ` [Qemu-devel] [PATCH RFC 0/5] disk deadlines Vasiliy Tolstov
@ 2015-09-08  9:20 ` Fam Zheng
  2015-09-08 10:11   ` Kevin Wolf
  2015-09-08  9:33 ` Paolo Bonzini
                   ` (2 subsequent siblings)
  9 siblings, 1 reply; 48+ messages in thread
From: Fam Zheng @ 2015-09-08  9:20 UTC (permalink / raw)
  To: Denis V. Lunev
  Cc: Kevin Wolf, qemu-devel, Stefan Hajnoczi,
	qemu-block@nongnu.org Raushaniya Maksudova

[Cc'ing qemu-block@nongnu.org]

On Tue, 09/08 11:00, Denis V. Lunev wrote:
> To avoid such situation this patchset introduces patch per-drive option
> "disk-deadlines=on|off" which is unset by default.

The general idea sounds very nice. Thanks!

Should we allow user configuration on the timeout?  If so, the option should be
something like "timeout-seconds=0,1,2...".  Also I think we could use werror
and rerror to control the handling policy (whether to ignore/report/stop on
timeout).

Fam

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [Qemu-devel] [PATCH RFC 0/5] disk deadlines
  2015-09-08  8:00 [Qemu-devel] [PATCH RFC 0/5] disk deadlines Denis V. Lunev
                   ` (6 preceding siblings ...)
  2015-09-08  9:20 ` Fam Zheng
@ 2015-09-08  9:33 ` Paolo Bonzini
  2015-09-08  9:41   ` Denis V. Lunev
                     ` (2 more replies)
  2015-09-08 19:11 ` John Snow
  2015-09-10 19:29 ` [Qemu-devel] Summary: " Denis V. Lunev
  9 siblings, 3 replies; 48+ messages in thread
From: Paolo Bonzini @ 2015-09-08  9:33 UTC (permalink / raw)
  To: Denis V. Lunev
  Cc: Kevin Wolf, qemu-devel, Stefan Hajnoczi, Raushaniya Maksudova



On 08/09/2015 10:00, Denis V. Lunev wrote:
> How the given solution works?
> 
> If disk-deadlines option is enabled for a drive, one controls time completion
> of this drive's requests. The method is as follows (further assume that this
> option is enabled).
> 
> Every drive has its own red-black tree for keeping its requests.
> Expiration time of the request is a key, cookie (as id of request) is an
> appropriate node. Assume that every requests has 8 seconds to be completed.
> If request was not accomplished in time for some reasons (server crash or smth
> else), timer of this drive is fired and an appropriate callback requests to
> stop Virtial Machine (VM).
> 
> VM remains stopped until all requests from the disk which caused VM's stopping
> are completed. Furthermore, if there is another disks with 'disk-deadlines=on'
> whose requests are waiting to be completed, do not start VM : wait completion
> of all "late" requests from all disks.
> 
> Furthermore, all requests which caused VM stopping (or those that just were not
> completed in time) could be printed using "info disk-deadlines" qemu monitor
> option as follows:

This topic has come up several times in the past.

I agree that the current behavior is not great, but I am not sure that
timeouts are safe.  For example, how is disk-deadlines=on different from
NFS soft mounts?  The NFS man page says

     NB: A so-called "soft" timeout can cause silent data corruption in
     certain cases.  As such, use the soft option only when client
     responsiveness is more important than data integrity.  Using NFS
     over TCP or increasing the value of the retrans option may
     mitigate some of the risks of using the soft option.

Note how it only says "mitigate", not solve.

Paolo

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [Qemu-devel] [PATCH 4/5] disk_deadlines: add control of requests time expiration
  2015-09-08  8:00 ` [Qemu-devel] [PATCH 4/5] disk_deadlines: add control of requests time expiration Denis V. Lunev
@ 2015-09-08  9:35   ` Fam Zheng
  2015-09-08  9:42     ` Denis V. Lunev
  2015-09-08 11:06   ` Kevin Wolf
  1 sibling, 1 reply; 48+ messages in thread
From: Fam Zheng @ 2015-09-08  9:35 UTC (permalink / raw)
  To: Denis V. Lunev
  Cc: Kevin Wolf, qemu-block, qemu-devel, Stefan Hajnoczi,
	Raushaniya Maksudova

On Tue, 09/08 11:00, Denis V. Lunev wrote:
>  typedef struct DiskDeadlines {
>      bool enabled;
> +    bool expired_tree;
> +    pthread_mutex_t mtx_tree;

This won't compile on win32, probably use QemuMutex instead?

In file included from /tmp/qemu-build/include/block/accounting.h:30:0,
                 from /tmp/qemu-build/include/block/block.h:8,
                 from /tmp/qemu-build/include/monitor/monitor.h:6,
                 from /tmp/qemu-build/util/osdep.c:51:
/tmp/qemu-build/include/block/disk-deadlines.h:38:5: error: unknown type name 'pthread_mutex_t'
     pthread_mutex_t mtx_tree;
     ^
/tmp/qemu-build/rules.mak:57: recipe for target 'util/osdep.o' failed

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [Qemu-devel] [PATCH RFC 0/5] disk deadlines
  2015-09-08  9:33 ` Paolo Bonzini
@ 2015-09-08  9:41   ` Denis V. Lunev
  2015-09-08  9:43     ` Paolo Bonzini
  2015-09-08 10:37     ` Andrey Korolyov
  2015-09-08 10:07   ` Kevin Wolf
  2015-09-08 10:22   ` Stefan Hajnoczi
  2 siblings, 2 replies; 48+ messages in thread
From: Denis V. Lunev @ 2015-09-08  9:41 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Kevin Wolf, qemu-devel, Stefan Hajnoczi, Raushaniya Maksudova

On 09/08/2015 12:33 PM, Paolo Bonzini wrote:
>
> On 08/09/2015 10:00, Denis V. Lunev wrote:
>> How the given solution works?
>>
>> If disk-deadlines option is enabled for a drive, one controls time completion
>> of this drive's requests. The method is as follows (further assume that this
>> option is enabled).
>>
>> Every drive has its own red-black tree for keeping its requests.
>> Expiration time of the request is a key, cookie (as id of request) is an
>> appropriate node. Assume that every requests has 8 seconds to be completed.
>> If request was not accomplished in time for some reasons (server crash or smth
>> else), timer of this drive is fired and an appropriate callback requests to
>> stop Virtial Machine (VM).
>>
>> VM remains stopped until all requests from the disk which caused VM's stopping
>> are completed. Furthermore, if there is another disks with 'disk-deadlines=on'
>> whose requests are waiting to be completed, do not start VM : wait completion
>> of all "late" requests from all disks.
>>
>> Furthermore, all requests which caused VM stopping (or those that just were not
>> completed in time) could be printed using "info disk-deadlines" qemu monitor
>> option as follows:
> This topic has come up several times in the past.
>
> I agree that the current behavior is not great, but I am not sure that
> timeouts are safe.  For example, how is disk-deadlines=on different from
> NFS soft mounts?  The NFS man page says
>
>       NB: A so-called "soft" timeout can cause silent data corruption in
>       certain cases.  As such, use the soft option only when client
>       responsiveness is more important than data integrity.  Using NFS
>       over TCP or increasing the value of the retrans option may
>       mitigate some of the risks of using the soft option.
>
> Note how it only says "mitigate", not solve.
>
> Paolo
This solution is far not perfect as there is a race window for
request complete anyway. Though the amount of failures is
reduced by 2-3 orders of magnitude.

The behavior is similar not for soft mounts, which could
corrupt the data but to hard mounts which are default AFAIR.
It will not corrupt the data and should patiently wait
request complete.

Without the disk the guest is not able to serve any request and
thus keeping it running does not make serious sense.

This approach is used by Odin in production for years and
we were able to seriously reduce the amount of end-user
reclamations. We were unable to invent any reasonable
solution without guest modification/timeouts tuning.

Anyway, this code is off by default, storage agnostic, separated.
Yes, we will be able to maintain it for us out-of-tree, but...
Den

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [Qemu-devel] [PATCH 4/5] disk_deadlines: add control of requests time expiration
  2015-09-08  9:35   ` Fam Zheng
@ 2015-09-08  9:42     ` Denis V. Lunev
  0 siblings, 0 replies; 48+ messages in thread
From: Denis V. Lunev @ 2015-09-08  9:42 UTC (permalink / raw)
  To: Fam Zheng
  Cc: Kevin Wolf, qemu-block, qemu-devel, Stefan Hajnoczi,
	Raushaniya Maksudova

On 09/08/2015 12:35 PM, Fam Zheng wrote:
> On Tue, 09/08 11:00, Denis V. Lunev wrote:
>>   typedef struct DiskDeadlines {
>>       bool enabled;
>> +    bool expired_tree;
>> +    pthread_mutex_t mtx_tree;
> This won't compile on win32, probably use QemuMutex instead?
>
> In file included from /tmp/qemu-build/include/block/accounting.h:30:0,
>                   from /tmp/qemu-build/include/block/block.h:8,
>                   from /tmp/qemu-build/include/monitor/monitor.h:6,
>                   from /tmp/qemu-build/util/osdep.c:51:
> /tmp/qemu-build/include/block/disk-deadlines.h:38:5: error: unknown type name 'pthread_mutex_t'
>       pthread_mutex_t mtx_tree;
>       ^
> /tmp/qemu-build/rules.mak:57: recipe for target 'util/osdep.o' failed
>
got this. Thank you

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [Qemu-devel] [PATCH RFC 0/5] disk deadlines
  2015-09-08  9:41   ` Denis V. Lunev
@ 2015-09-08  9:43     ` Paolo Bonzini
  2015-09-08 10:37     ` Andrey Korolyov
  1 sibling, 0 replies; 48+ messages in thread
From: Paolo Bonzini @ 2015-09-08  9:43 UTC (permalink / raw)
  To: Denis V. Lunev
  Cc: Kevin Wolf, qemu-devel, Stefan Hajnoczi, Raushaniya Maksudova



On 08/09/2015 11:41, Denis V. Lunev wrote:
> This solution is far not perfect as there is a race window for
> request complete anyway. Though the amount of failures is
> reduced by 2-3 orders of magnitude.
> 
> The behavior is similar not for soft mounts, which could
> corrupt the data but to hard mounts which are default AFAIR.
> It will not corrupt the data and should patiently wait
> request complete.
> 
> Without the disk the guest is not able to serve any request and
> thus keeping it running does not make serious sense.
> 
> This approach is used by Odin in production for years and
> we were able to seriously reduce the amount of end-user
> reclamations. We were unable to invent any reasonable
> solution without guest modification/timeouts tuning.
> 
> Anyway, this code is off by default, storage agnostic, separated.
> Yes, we will be able to maintain it for us out-of-tree, but...

I'm not saying the patches are unacceptable, not at all.  It just needs
a bit of documentation to understand the tradeoffs.  I admit I have not
even started reading the code.

Your experience is very valuable, and it's great that you are
contributing it to QEMU and KVM!

Paolo

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [Qemu-devel] [PATCH RFC 0/5] disk deadlines
  2015-09-08  9:33 ` Paolo Bonzini
  2015-09-08  9:41   ` Denis V. Lunev
@ 2015-09-08 10:07   ` Kevin Wolf
  2015-09-08 10:08     ` Denis V. Lunev
  2015-09-08 10:22   ` Stefan Hajnoczi
  2 siblings, 1 reply; 48+ messages in thread
From: Kevin Wolf @ 2015-09-08 10:07 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Denis V. Lunev, qemu-devel, Stefan Hajnoczi, Raushaniya Maksudova

Am 08.09.2015 um 11:33 hat Paolo Bonzini geschrieben:
> 
> 
> On 08/09/2015 10:00, Denis V. Lunev wrote:
> > How the given solution works?
> > 
> > If disk-deadlines option is enabled for a drive, one controls time completion
> > of this drive's requests. The method is as follows (further assume that this
> > option is enabled).
> > 
> > Every drive has its own red-black tree for keeping its requests.
> > Expiration time of the request is a key, cookie (as id of request) is an
> > appropriate node. Assume that every requests has 8 seconds to be completed.
> > If request was not accomplished in time for some reasons (server crash or smth
> > else), timer of this drive is fired and an appropriate callback requests to
> > stop Virtial Machine (VM).
> > 
> > VM remains stopped until all requests from the disk which caused VM's stopping
> > are completed. Furthermore, if there is another disks with 'disk-deadlines=on'
> > whose requests are waiting to be completed, do not start VM : wait completion
> > of all "late" requests from all disks.
> > 
> > Furthermore, all requests which caused VM stopping (or those that just were not
> > completed in time) could be printed using "info disk-deadlines" qemu monitor
> > option as follows:
> 
> This topic has come up several times in the past.
> 
> I agree that the current behavior is not great, but I am not sure that
> timeouts are safe.  For example, how is disk-deadlines=on different from
> NFS soft mounts?

I think the main difference is that it stops the VM and only allows to
continue once the request has completed, either successfully or with a
final failure (if I understand the cover letter correctly, I haven't
looked at the patches yet).

Kevin

> The NFS man page says
> 
>      NB: A so-called "soft" timeout can cause silent data corruption in
>      certain cases.  As such, use the soft option only when client
>      responsiveness is more important than data integrity.  Using NFS
>      over TCP or increasing the value of the retrans option may
>      mitigate some of the risks of using the soft option.
> 
> Note how it only says "mitigate", not solve.
> 
> Paolo

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [Qemu-devel] [PATCH RFC 0/5] disk deadlines
  2015-09-08 10:07   ` Kevin Wolf
@ 2015-09-08 10:08     ` Denis V. Lunev
  0 siblings, 0 replies; 48+ messages in thread
From: Denis V. Lunev @ 2015-09-08 10:08 UTC (permalink / raw)
  To: Kevin Wolf, Paolo Bonzini
  Cc: qemu-devel, Stefan Hajnoczi, Raushaniya Maksudova

On 09/08/2015 01:07 PM, Kevin Wolf wrote:
> Am 08.09.2015 um 11:33 hat Paolo Bonzini geschrieben:
>>
>> On 08/09/2015 10:00, Denis V. Lunev wrote:
>>> How the given solution works?
>>>
>>> If disk-deadlines option is enabled for a drive, one controls time completion
>>> of this drive's requests. The method is as follows (further assume that this
>>> option is enabled).
>>>
>>> Every drive has its own red-black tree for keeping its requests.
>>> Expiration time of the request is a key, cookie (as id of request) is an
>>> appropriate node. Assume that every requests has 8 seconds to be completed.
>>> If request was not accomplished in time for some reasons (server crash or smth
>>> else), timer of this drive is fired and an appropriate callback requests to
>>> stop Virtial Machine (VM).
>>>
>>> VM remains stopped until all requests from the disk which caused VM's stopping
>>> are completed. Furthermore, if there is another disks with 'disk-deadlines=on'
>>> whose requests are waiting to be completed, do not start VM : wait completion
>>> of all "late" requests from all disks.
>>>
>>> Furthermore, all requests which caused VM stopping (or those that just were not
>>> completed in time) could be printed using "info disk-deadlines" qemu monitor
>>> option as follows:
>> This topic has come up several times in the past.
>>
>> I agree that the current behavior is not great, but I am not sure that
>> timeouts are safe.  For example, how is disk-deadlines=on different from
>> NFS soft mounts?
> I think the main difference is that it stops the VM and only allows to
> continue once the request has completed, either successfully or with a
> final failure (if I understand the cover letter correctly, I haven't
> looked at the patches yet).
exactly. VM is paused until IO will be finally completed.


> Kevin
>
>> The NFS man page says
>>
>>       NB: A so-called "soft" timeout can cause silent data corruption in
>>       certain cases.  As such, use the soft option only when client
>>       responsiveness is more important than data integrity.  Using NFS
>>       over TCP or increasing the value of the retrans option may
>>       mitigate some of the risks of using the soft option.
>>
>> Note how it only says "mitigate", not solve.
>>
>> Paolo

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [Qemu-devel] [PATCH RFC 0/5] disk deadlines
  2015-09-08  9:20 ` Fam Zheng
@ 2015-09-08 10:11   ` Kevin Wolf
  2015-09-08 10:13     ` Denis V. Lunev
  2015-09-08 10:20     ` Fam Zheng
  0 siblings, 2 replies; 48+ messages in thread
From: Kevin Wolf @ 2015-09-08 10:11 UTC (permalink / raw)
  To: Fam Zheng
  Cc: Denis V. Lunev, qemu-devel, Stefan Hajnoczi,
	qemu-block@nongnu.org Raushaniya Maksudova

Am 08.09.2015 um 11:20 hat Fam Zheng geschrieben:
> [Cc'ing qemu-block@nongnu.org]
> 
> On Tue, 09/08 11:00, Denis V. Lunev wrote:
> > To avoid such situation this patchset introduces patch per-drive option
> > "disk-deadlines=on|off" which is unset by default.
> 
> The general idea sounds very nice. Thanks!
> 
> Should we allow user configuration on the timeout?  If so, the option should be
> something like "timeout-seconds=0,1,2...".  Also I think we could use werror
> and rerror to control the handling policy (whether to ignore/report/stop on
> timeout).

Yes, I think the timeout needs to be configurable. However, the only
action that makes sense is stop. Everything else would be unsafe because
the running request could still complete at a later point.

Another question I have related to safety is whether (and how) you can
migrate away from a host that has been stopped because of a timeout. I
guess migration needs to be blocked then?

Kevin

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [Qemu-devel] [PATCH RFC 0/5] disk deadlines
  2015-09-08 10:11   ` Kevin Wolf
@ 2015-09-08 10:13     ` Denis V. Lunev
  2015-09-08 10:20     ` Fam Zheng
  1 sibling, 0 replies; 48+ messages in thread
From: Denis V. Lunev @ 2015-09-08 10:13 UTC (permalink / raw)
  To: Kevin Wolf, Fam Zheng
  Cc: qemu-devel, Stefan Hajnoczi,
	qemu-block@nongnu.org Raushaniya Maksudova

On 09/08/2015 01:11 PM, Kevin Wolf wrote:
> Am 08.09.2015 um 11:20 hat Fam Zheng geschrieben:
>> [Cc'ing qemu-block@nongnu.org]
>>
>> On Tue, 09/08 11:00, Denis V. Lunev wrote:
>>> To avoid such situation this patchset introduces patch per-drive option
>>> "disk-deadlines=on|off" which is unset by default.
>> The general idea sounds very nice. Thanks!
>>
>> Should we allow user configuration on the timeout?  If so, the option should be
>> something like "timeout-seconds=0,1,2...".  Also I think we could use werror
>> and rerror to control the handling policy (whether to ignore/report/stop on
>> timeout).
> Yes, I think the timeout needs to be configurable. However, the only
> action that makes sense is stop. Everything else would be unsafe because
> the running request could still complete at a later point.
>
> Another question I have related to safety is whether (and how) you can
> migrate away from a host that has been stopped because of a timeout. I
> guess migration needs to be blocked then?
>
> Kevin
this sounds reasonable. Noted for the next iteration.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [Qemu-devel] [PATCH RFC 0/5] disk deadlines
  2015-09-08 10:11   ` Kevin Wolf
  2015-09-08 10:13     ` Denis V. Lunev
@ 2015-09-08 10:20     ` Fam Zheng
  2015-09-08 10:46       ` Denis V. Lunev
  2015-09-08 10:49       ` Kevin Wolf
  1 sibling, 2 replies; 48+ messages in thread
From: Fam Zheng @ 2015-09-08 10:20 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: Denis V. Lunev, qemu-devel, Stefan Hajnoczi,
	qemu-block@nongnu.org Raushaniya Maksudova

On Tue, 09/08 12:11, Kevin Wolf wrote:
> Am 08.09.2015 um 11:20 hat Fam Zheng geschrieben:
> > [Cc'ing qemu-block@nongnu.org]
> > 
> > On Tue, 09/08 11:00, Denis V. Lunev wrote:
> > > To avoid such situation this patchset introduces patch per-drive option
> > > "disk-deadlines=on|off" which is unset by default.
> > 
> > The general idea sounds very nice. Thanks!
> > 
> > Should we allow user configuration on the timeout?  If so, the option should be
> > something like "timeout-seconds=0,1,2...".  Also I think we could use werror
> > and rerror to control the handling policy (whether to ignore/report/stop on
> > timeout).
> 
> Yes, I think the timeout needs to be configurable. However, the only
> action that makes sense is stop. Everything else would be unsafe because
> the running request could still complete at a later point.

What if the timeout happens on a quorum child?  The management can replace it
transparently without stopping the VM.

Fam

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [Qemu-devel] [PATCH RFC 0/5] disk deadlines
  2015-09-08  9:33 ` Paolo Bonzini
  2015-09-08  9:41   ` Denis V. Lunev
  2015-09-08 10:07   ` Kevin Wolf
@ 2015-09-08 10:22   ` Stefan Hajnoczi
  2015-09-08 10:26     ` Paolo Bonzini
  2015-09-08 10:36     ` Denis V. Lunev
  2 siblings, 2 replies; 48+ messages in thread
From: Stefan Hajnoczi @ 2015-09-08 10:22 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: Kevin Wolf, Denis V. Lunev, qemu-devel, Raushaniya Maksudova

On Tue, Sep 08, 2015 at 11:33:09AM +0200, Paolo Bonzini wrote:
> 
> 
> On 08/09/2015 10:00, Denis V. Lunev wrote:
> > How the given solution works?
> > 
> > If disk-deadlines option is enabled for a drive, one controls time completion
> > of this drive's requests. The method is as follows (further assume that this
> > option is enabled).
> > 
> > Every drive has its own red-black tree for keeping its requests.
> > Expiration time of the request is a key, cookie (as id of request) is an
> > appropriate node. Assume that every requests has 8 seconds to be completed.
> > If request was not accomplished in time for some reasons (server crash or smth
> > else), timer of this drive is fired and an appropriate callback requests to
> > stop Virtial Machine (VM).
> > 
> > VM remains stopped until all requests from the disk which caused VM's stopping
> > are completed. Furthermore, if there is another disks with 'disk-deadlines=on'
> > whose requests are waiting to be completed, do not start VM : wait completion
> > of all "late" requests from all disks.
> > 
> > Furthermore, all requests which caused VM stopping (or those that just were not
> > completed in time) could be printed using "info disk-deadlines" qemu monitor
> > option as follows:
> 
> This topic has come up several times in the past.
> 
> I agree that the current behavior is not great, but I am not sure that
> timeouts are safe.  For example, how is disk-deadlines=on different from
> NFS soft mounts?  The NFS man page says
> 
>      NB: A so-called "soft" timeout can cause silent data corruption in
>      certain cases.  As such, use the soft option only when client
>      responsiveness is more important than data integrity.  Using NFS
>      over TCP or increasing the value of the retrans option may
>      mitigate some of the risks of using the soft option.
> 
> Note how it only says "mitigate", not solve.

The risky part of "soft" mounts is probably that the client doesn't know
whether or not the request completed.  So it doesn't know the state of
the data on the server after a write request.  This is the classic
Byzantine fault tolerance problem in distributed systems.

This patch series pauses the guest like rerror=stop.  Therefore it's
different from NFS "soft" mounts, which are like rerror=report.

Guests running without this patch series may suffer from the NFS "soft"
mounts problem when they time out and give up on the I/O request just as
it actually completes on the server, leaving the data in a different
state than expected.

This patch series solves that problem by pausing the guest.  Action can
be taken on the host to bring storage back and resume (similar to
ENOSPC).

In order for this to work well, QEMU's timeout value must be shorter
than the guest's own timeout value.

Stefan

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [Qemu-devel] [PATCH RFC 0/5] disk deadlines
  2015-09-08 10:22   ` Stefan Hajnoczi
@ 2015-09-08 10:26     ` Paolo Bonzini
  2015-09-08 10:36     ` Denis V. Lunev
  1 sibling, 0 replies; 48+ messages in thread
From: Paolo Bonzini @ 2015-09-08 10:26 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Kevin Wolf, Denis V. Lunev, qemu-devel, Raushaniya Maksudova



On 08/09/2015 12:22, Stefan Hajnoczi wrote:
> Guests running without this patch series may suffer from the NFS "soft"
> mounts problem when they time out and give up on the I/O request just as
> it actually completes on the server, leaving the data in a different
> state than expected.
> 
> This patch series solves that problem by pausing the guest.  Action can
> be taken on the host to bring storage back and resume (similar to
> ENOSPC).

Serves me right for going too fast through the cover letter.  This
sounds like a great solution to the problem.

Paolo

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [Qemu-devel] [PATCH RFC 0/5] disk deadlines
  2015-09-08 10:22   ` Stefan Hajnoczi
  2015-09-08 10:26     ` Paolo Bonzini
@ 2015-09-08 10:36     ` Denis V. Lunev
  1 sibling, 0 replies; 48+ messages in thread
From: Denis V. Lunev @ 2015-09-08 10:36 UTC (permalink / raw)
  To: Stefan Hajnoczi, Paolo Bonzini
  Cc: Kevin Wolf, qemu-devel, Raushaniya Maksudova

On 09/08/2015 01:22 PM, Stefan Hajnoczi wrote:
> On Tue, Sep 08, 2015 at 11:33:09AM +0200, Paolo Bonzini wrote:
>>
>> On 08/09/2015 10:00, Denis V. Lunev wrote:
>>> How the given solution works?
>>>
>>> If disk-deadlines option is enabled for a drive, one controls time completion
>>> of this drive's requests. The method is as follows (further assume that this
>>> option is enabled).
>>>
>>> Every drive has its own red-black tree for keeping its requests.
>>> Expiration time of the request is a key, cookie (as id of request) is an
>>> appropriate node. Assume that every requests has 8 seconds to be completed.
>>> If request was not accomplished in time for some reasons (server crash or smth
>>> else), timer of this drive is fired and an appropriate callback requests to
>>> stop Virtial Machine (VM).
>>>
>>> VM remains stopped until all requests from the disk which caused VM's stopping
>>> are completed. Furthermore, if there is another disks with 'disk-deadlines=on'
>>> whose requests are waiting to be completed, do not start VM : wait completion
>>> of all "late" requests from all disks.
>>>
>>> Furthermore, all requests which caused VM stopping (or those that just were not
>>> completed in time) could be printed using "info disk-deadlines" qemu monitor
>>> option as follows:
>> This topic has come up several times in the past.
>>
>> I agree that the current behavior is not great, but I am not sure that
>> timeouts are safe.  For example, how is disk-deadlines=on different from
>> NFS soft mounts?  The NFS man page says
>>
>>       NB: A so-called "soft" timeout can cause silent data corruption in
>>       certain cases.  As such, use the soft option only when client
>>       responsiveness is more important than data integrity.  Using NFS
>>       over TCP or increasing the value of the retrans option may
>>       mitigate some of the risks of using the soft option.
>>
>> Note how it only says "mitigate", not solve.
> The risky part of "soft" mounts is probably that the client doesn't know
> whether or not the request completed.  So it doesn't know the state of
> the data on the server after a write request.  This is the classic
> Byzantine fault tolerance problem in distributed systems.
>
> This patch series pauses the guest like rerror=stop.  Therefore it's
> different from NFS "soft" mounts, which are like rerror=report.
>
> Guests running without this patch series may suffer from the NFS "soft"
> mounts problem when they time out and give up on the I/O request just as
> it actually completes on the server, leaving the data in a different
> state than expected.
>
> This patch series solves that problem by pausing the guest.  Action can
> be taken on the host to bring storage back and resume (similar to
> ENOSPC).
>
> In order for this to work well, QEMU's timeout value must be shorter
> than the guest's own timeout value.
>
> Stefan
nice summary, thank you :)

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [Qemu-devel] [PATCH RFC 0/5] disk deadlines
  2015-09-08  9:41   ` Denis V. Lunev
  2015-09-08  9:43     ` Paolo Bonzini
@ 2015-09-08 10:37     ` Andrey Korolyov
  2015-09-08 10:50       ` Denis V. Lunev
  1 sibling, 1 reply; 48+ messages in thread
From: Andrey Korolyov @ 2015-09-08 10:37 UTC (permalink / raw)
  To: Denis V. Lunev
  Cc: Kevin Wolf, Paolo Bonzini, qemu-devel@nongnu.org, Stefan Hajnoczi,
	Raushaniya Maksudova

On Tue, Sep 8, 2015 at 12:41 PM, Denis V. Lunev <den@openvz.org> wrote:
> On 09/08/2015 12:33 PM, Paolo Bonzini wrote:
>>
>>
>> On 08/09/2015 10:00, Denis V. Lunev wrote:
>>>
>>> How the given solution works?
>>>
>>> If disk-deadlines option is enabled for a drive, one controls time
>>> completion
>>> of this drive's requests. The method is as follows (further assume that
>>> this
>>> option is enabled).
>>>
>>> Every drive has its own red-black tree for keeping its requests.
>>> Expiration time of the request is a key, cookie (as id of request) is an
>>> appropriate node. Assume that every requests has 8 seconds to be
>>> completed.
>>> If request was not accomplished in time for some reasons (server crash or
>>> smth
>>> else), timer of this drive is fired and an appropriate callback requests
>>> to
>>> stop Virtial Machine (VM).
>>>
>>> VM remains stopped until all requests from the disk which caused VM's
>>> stopping
>>> are completed. Furthermore, if there is another disks with
>>> 'disk-deadlines=on'
>>> whose requests are waiting to be completed, do not start VM : wait
>>> completion
>>> of all "late" requests from all disks.
>>>
>>> Furthermore, all requests which caused VM stopping (or those that just
>>> were not
>>> completed in time) could be printed using "info disk-deadlines" qemu
>>> monitor
>>> option as follows:
>>
>> This topic has come up several times in the past.
>>
>> I agree that the current behavior is not great, but I am not sure that
>> timeouts are safe.  For example, how is disk-deadlines=on different from
>> NFS soft mounts?  The NFS man page says
>>
>>       NB: A so-called "soft" timeout can cause silent data corruption in
>>       certain cases.  As such, use the soft option only when client
>>       responsiveness is more important than data integrity.  Using NFS
>>       over TCP or increasing the value of the retrans option may
>>       mitigate some of the risks of using the soft option.
>>
>> Note how it only says "mitigate", not solve.
>>
>> Paolo
>
> This solution is far not perfect as there is a race window for
> request complete anyway. Though the amount of failures is
> reduced by 2-3 orders of magnitude.
>
> The behavior is similar not for soft mounts, which could
> corrupt the data but to hard mounts which are default AFAIR.
> It will not corrupt the data and should patiently wait
> request complete.
>
> Without the disk the guest is not able to serve any request and
> thus keeping it running does not make serious sense.
>
> This approach is used by Odin in production for years and
> we were able to seriously reduce the amount of end-user
> reclamations. We were unable to invent any reasonable
> solution without guest modification/timeouts tuning.
>
> Anyway, this code is off by default, storage agnostic, separated.
> Yes, we will be able to maintain it for us out-of-tree, but...
> Den
>

Thanks, the series looks very promising. I have a rather side question
- assuming that we have a guest for which scsi/ide usage is only an
option, wouldn`t the timekeeping issues from the pause/resume action
be a corner problem there? The assumption based on a fact that the
guests with appropriate kvmclock settings can rather softly handle a
resulting timer jump and at the same moment they are not bounded at
most to the 'legacy' storage interfaces, but those guests with
interfaces which are not prone to 'time-outing' can commonly misbehave
as well from a large timer jump. For an IDE, the approach proposed by
a patch is an only option, and for SCSI it is better to tune guest
driver timeout instead, if guest OS allows that. So yes, description
for possible drawbacks would be very useful there.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [Qemu-devel] [PATCH RFC 0/5] disk deadlines
  2015-09-08 10:20     ` Fam Zheng
@ 2015-09-08 10:46       ` Denis V. Lunev
  2015-09-08 10:49       ` Kevin Wolf
  1 sibling, 0 replies; 48+ messages in thread
From: Denis V. Lunev @ 2015-09-08 10:46 UTC (permalink / raw)
  To: Fam Zheng, Kevin Wolf
  Cc: qemu-devel, Stefan Hajnoczi,
	qemu-block@nongnu.org Raushaniya Maksudova

On 09/08/2015 01:20 PM, Fam Zheng wrote:
> On Tue, 09/08 12:11, Kevin Wolf wrote:
>> Am 08.09.2015 um 11:20 hat Fam Zheng geschrieben:
>>> [Cc'ing qemu-block@nongnu.org]
>>>
>>> On Tue, 09/08 11:00, Denis V. Lunev wrote:
>>>> To avoid such situation this patchset introduces patch per-drive option
>>>> "disk-deadlines=on|off" which is unset by default.
>>> The general idea sounds very nice. Thanks!
>>>
>>> Should we allow user configuration on the timeout?  If so, the option should be
>>> something like "timeout-seconds=0,1,2...".  Also I think we could use werror
>>> and rerror to control the handling policy (whether to ignore/report/stop on
>>> timeout).
>> Yes, I think the timeout needs to be configurable. However, the only
>> action that makes sense is stop. Everything else would be unsafe because
>> the running request could still complete at a later point.
> What if the timeout happens on a quorum child?  The management can replace it
> transparently without stopping the VM.
>
> Fam
I have not though about this at all as we do not use
quorum in our setups. But some sort of management
stuff can be added on top of this for sure.

Den

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [Qemu-devel] [PATCH RFC 0/5] disk deadlines
  2015-09-08 10:20     ` Fam Zheng
  2015-09-08 10:46       ` Denis V. Lunev
@ 2015-09-08 10:49       ` Kevin Wolf
  2015-09-08 13:20         ` Fam Zheng
  1 sibling, 1 reply; 48+ messages in thread
From: Kevin Wolf @ 2015-09-08 10:49 UTC (permalink / raw)
  To: Fam Zheng
  Cc: Denis V. Lunev, qemu-devel, Stefan Hajnoczi,
	qemu-block@nongnu.org Raushaniya Maksudova

Am 08.09.2015 um 12:20 hat Fam Zheng geschrieben:
> On Tue, 09/08 12:11, Kevin Wolf wrote:
> > Am 08.09.2015 um 11:20 hat Fam Zheng geschrieben:
> > > [Cc'ing qemu-block@nongnu.org]
> > > 
> > > On Tue, 09/08 11:00, Denis V. Lunev wrote:
> > > > To avoid such situation this patchset introduces patch per-drive option
> > > > "disk-deadlines=on|off" which is unset by default.
> > > 
> > > The general idea sounds very nice. Thanks!
> > > 
> > > Should we allow user configuration on the timeout?  If so, the option should be
> > > something like "timeout-seconds=0,1,2...".  Also I think we could use werror
> > > and rerror to control the handling policy (whether to ignore/report/stop on
> > > timeout).
> > 
> > Yes, I think the timeout needs to be configurable. However, the only
> > action that makes sense is stop. Everything else would be unsafe because
> > the running request could still complete at a later point.
> 
> What if the timeout happens on a quorum child?  The management can replace it
> transparently without stopping the VM.

This is getting tricky...

I'll try this: We need to attribute timed out requests to a specific BDS.
A user of a BlockBackend can run if all of its (recursive) children
don't have timed out requests. So if the only thing that is blocked is a
BDS used for an NBD server, but it isn't used by the guest, the guest
can keep running. The same way, after removing a bad quorum child, the
guest can be continued again.

Somehow we must make sure that timeouts are propagated through the BDS
tree (do we need parent notifiers?), and that at the same time the
quorum BDS's timeout status is updated when the bad child is removed.

The trickier part might actually be to remove a BDS from quorum while a
request is still in flight. The traditional approach is bdrv_drain(),
but that won't work here. We want to remove the child while quorum has
still a request pending on it.

I don't think this will result automatically from doing the timeout
work. It will instead need some serious design work.

Kevin

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [Qemu-devel] [PATCH RFC 0/5] disk deadlines
  2015-09-08 10:37     ` Andrey Korolyov
@ 2015-09-08 10:50       ` Denis V. Lunev
  0 siblings, 0 replies; 48+ messages in thread
From: Denis V. Lunev @ 2015-09-08 10:50 UTC (permalink / raw)
  To: Andrey Korolyov
  Cc: Kevin Wolf, Paolo Bonzini, qemu-devel@nongnu.org, Stefan Hajnoczi,
	Raushaniya Maksudova

On 09/08/2015 01:37 PM, Andrey Korolyov wrote:
> On Tue, Sep 8, 2015 at 12:41 PM, Denis V. Lunev <den@openvz.org> wrote:
>> On 09/08/2015 12:33 PM, Paolo Bonzini wrote:
>>>
>>> On 08/09/2015 10:00, Denis V. Lunev wrote:
>>>> How the given solution works?
>>>>
>>>> If disk-deadlines option is enabled for a drive, one controls time
>>>> completion
>>>> of this drive's requests. The method is as follows (further assume that
>>>> this
>>>> option is enabled).
>>>>
>>>> Every drive has its own red-black tree for keeping its requests.
>>>> Expiration time of the request is a key, cookie (as id of request) is an
>>>> appropriate node. Assume that every requests has 8 seconds to be
>>>> completed.
>>>> If request was not accomplished in time for some reasons (server crash or
>>>> smth
>>>> else), timer of this drive is fired and an appropriate callback requests
>>>> to
>>>> stop Virtial Machine (VM).
>>>>
>>>> VM remains stopped until all requests from the disk which caused VM's
>>>> stopping
>>>> are completed. Furthermore, if there is another disks with
>>>> 'disk-deadlines=on'
>>>> whose requests are waiting to be completed, do not start VM : wait
>>>> completion
>>>> of all "late" requests from all disks.
>>>>
>>>> Furthermore, all requests which caused VM stopping (or those that just
>>>> were not
>>>> completed in time) could be printed using "info disk-deadlines" qemu
>>>> monitor
>>>> option as follows:
>>> This topic has come up several times in the past.
>>>
>>> I agree that the current behavior is not great, but I am not sure that
>>> timeouts are safe.  For example, how is disk-deadlines=on different from
>>> NFS soft mounts?  The NFS man page says
>>>
>>>        NB: A so-called "soft" timeout can cause silent data corruption in
>>>        certain cases.  As such, use the soft option only when client
>>>        responsiveness is more important than data integrity.  Using NFS
>>>        over TCP or increasing the value of the retrans option may
>>>        mitigate some of the risks of using the soft option.
>>>
>>> Note how it only says "mitigate", not solve.
>>>
>>> Paolo
>> This solution is far not perfect as there is a race window for
>> request complete anyway. Though the amount of failures is
>> reduced by 2-3 orders of magnitude.
>>
>> The behavior is similar not for soft mounts, which could
>> corrupt the data but to hard mounts which are default AFAIR.
>> It will not corrupt the data and should patiently wait
>> request complete.
>>
>> Without the disk the guest is not able to serve any request and
>> thus keeping it running does not make serious sense.
>>
>> This approach is used by Odin in production for years and
>> we were able to seriously reduce the amount of end-user
>> reclamations. We were unable to invent any reasonable
>> solution without guest modification/timeouts tuning.
>>
>> Anyway, this code is off by default, storage agnostic, separated.
>> Yes, we will be able to maintain it for us out-of-tree, but...
>> Den
>>
> Thanks, the series looks very promising. I have a rather side question
> - assuming that we have a guest for which scsi/ide usage is only an
> option, wouldn`t the timekeeping issues from the pause/resume action
> be a corner problem there?
I do not think so. The guest can be paused/suspended by the
management and resumes. Normally it takes some time
for guest to start see the time difference and speedup is
limited.

>   The assumption based on a fact that the
> guests with appropriate kvmclock settings can rather softly handle a
> resulting timer jump and at the same moment they are not bounded at
> most to the 'legacy' storage interfaces, but those guests with
> interfaces which are not prone to 'time-outing' can commonly misbehave
> as well from a large timer jump. For an IDE, the approach proposed by
> a patch is an only option, and for SCSI it is better to tune guest
> driver timeout instead, if guest OS allows that. So yes, description
> for possible drawbacks would be very useful there.
OK. I will add the note for this.

Though there are cases when this timeout could not be
tuned at all even for a SCSI case, f.e. Windows will BSOD
with 7b early on boot without the solution applied
and I do not know good ways to tweak this timeout
in guest. It is far too specific.

Den

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [Qemu-devel] [PATCH 4/5] disk_deadlines: add control of requests time expiration
  2015-09-08  8:00 ` [Qemu-devel] [PATCH 4/5] disk_deadlines: add control of requests time expiration Denis V. Lunev
  2015-09-08  9:35   ` Fam Zheng
@ 2015-09-08 11:06   ` Kevin Wolf
  2015-09-08 11:27     ` Denis V. Lunev
  1 sibling, 1 reply; 48+ messages in thread
From: Kevin Wolf @ 2015-09-08 11:06 UTC (permalink / raw)
  To: Denis V. Lunev; +Cc: Stefan Hajnoczi, qemu-devel, Raushaniya Maksudova

Am 08.09.2015 um 10:00 hat Denis V. Lunev geschrieben:
> From: Raushaniya Maksudova <rmaksudova@virtuozzo.com>
> 
> If disk-deadlines option is enabled for a drive, one controls time
> completion of this drive's requests. The method is as follows (further
> assume that this option is enabled).
> 
> Every drive has its own red-black tree for keeping its requests.
> Expiration time of the request is a key, cookie (as id of request) is an
> appropriate node. Assume that every requests has 8 seconds to be completed.
> If request was not accomplished in time for some reasons (server crash or
> smth else), timer of this drive is fired and an appropriate callback
> requests to stop Virtial Machine (VM).
> 
> VM remains stopped until all requests from the disk which caused VM's
> stopping are completed. Furthermore, if there is another disks whose
> requests are waiting to be completed, do not start VM : wait completion
> of all "late" requests from all disks.
> 
> Signed-off-by: Raushaniya Maksudova <rmaksudova@virtuozzo.com>
> Signed-off-by: Denis V. Lunev <den@openvz.org>
> CC: Stefan Hajnoczi <stefanha@redhat.com>
> CC: Kevin Wolf <kwolf@redhat.com>

> +    disk_deadlines->expired_tree = true;
> +    need_vmstop = !atomic_fetch_inc(&num_requests_vmstopped);
> +    pthread_mutex_unlock(&disk_deadlines->mtx_tree);
> +
> +    if (need_vmstop) {
> +        qemu_system_vmstop_request_prepare();
> +        qemu_system_vmstop_request(RUN_STATE_PAUSED);
> +    }
> +}

What behaviour does this result in? If I understand correctly, this is
an indirect call of do_vm_stop(), which involves a bdrv_drain_all(). In
this case, qemu would completely block (including unresponsive monitor)
until the request can complete.

Is this what you are seeing with this patch, or why doesn't the
bdrv_drain_all() call cause such effects?

Kevin

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [Qemu-devel] [PATCH 4/5] disk_deadlines: add control of requests time expiration
  2015-09-08 11:06   ` Kevin Wolf
@ 2015-09-08 11:27     ` Denis V. Lunev
  2015-09-08 13:05       ` Kevin Wolf
  0 siblings, 1 reply; 48+ messages in thread
From: Denis V. Lunev @ 2015-09-08 11:27 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: Stefan Hajnoczi, qemu-devel, Raushaniya Maksudova

On 09/08/2015 02:06 PM, Kevin Wolf wrote:
> Am 08.09.2015 um 10:00 hat Denis V. Lunev geschrieben:
>> From: Raushaniya Maksudova <rmaksudova@virtuozzo.com>
>>
>> If disk-deadlines option is enabled for a drive, one controls time
>> completion of this drive's requests. The method is as follows (further
>> assume that this option is enabled).
>>
>> Every drive has its own red-black tree for keeping its requests.
>> Expiration time of the request is a key, cookie (as id of request) is an
>> appropriate node. Assume that every requests has 8 seconds to be completed.
>> If request was not accomplished in time for some reasons (server crash or
>> smth else), timer of this drive is fired and an appropriate callback
>> requests to stop Virtial Machine (VM).
>>
>> VM remains stopped until all requests from the disk which caused VM's
>> stopping are completed. Furthermore, if there is another disks whose
>> requests are waiting to be completed, do not start VM : wait completion
>> of all "late" requests from all disks.
>>
>> Signed-off-by: Raushaniya Maksudova <rmaksudova@virtuozzo.com>
>> Signed-off-by: Denis V. Lunev <den@openvz.org>
>> CC: Stefan Hajnoczi <stefanha@redhat.com>
>> CC: Kevin Wolf <kwolf@redhat.com>
>> +    disk_deadlines->expired_tree = true;
>> +    need_vmstop = !atomic_fetch_inc(&num_requests_vmstopped);
>> +    pthread_mutex_unlock(&disk_deadlines->mtx_tree);
>> +
>> +    if (need_vmstop) {
>> +        qemu_system_vmstop_request_prepare();
>> +        qemu_system_vmstop_request(RUN_STATE_PAUSED);
>> +    }
>> +}
> What behaviour does this result in? If I understand correctly, this is
> an indirect call of do_vm_stop(), which involves a bdrv_drain_all(). In
> this case, qemu would completely block (including unresponsive monitor)
> until the request can complete.
>
> Is this what you are seeing with this patch, or why doesn't the
> bdrv_drain_all() call cause such effects?
>
> Kevin
interesting point. Yes, it flushes all requests and most likely
hangs inside waiting requests to complete. But fortunately
this happens after the switch to paused state thus
the guest becomes paused. That's why I have missed this
fact.

This (could) be considered as a problem but I have no (good)
solution at the moment. Should think a bit on.

Nice catch, though!

Den

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [Qemu-devel] [PATCH 4/5] disk_deadlines: add control of requests time expiration
  2015-09-08 11:27     ` Denis V. Lunev
@ 2015-09-08 13:05       ` Kevin Wolf
  2015-09-08 14:23         ` Denis V. Lunev
  0 siblings, 1 reply; 48+ messages in thread
From: Kevin Wolf @ 2015-09-08 13:05 UTC (permalink / raw)
  To: Denis V. Lunev; +Cc: Stefan Hajnoczi, qemu-devel, Raushaniya Maksudova

Am 08.09.2015 um 13:27 hat Denis V. Lunev geschrieben:
> interesting point. Yes, it flushes all requests and most likely
> hangs inside waiting requests to complete. But fortunately
> this happens after the switch to paused state thus
> the guest becomes paused. That's why I have missed this
> fact.
> 
> This (could) be considered as a problem but I have no (good)
> solution at the moment. Should think a bit on.

Let me suggest a radically different design. Note that I don't say this
is necessarily how things should be done, I'm just trying to introduce
some new ideas and broaden the discussion, so that we have a larger set
of ideas from which we can pick the right solution(s).

The core of my idea would be a new filter block driver 'timeout' that
can be added on top of each BDS that could potentially fail, like a
raw-posix BDS pointing to a file on NFS. This way most pieces of the
solution are nicely modularised and don't touch the block layer core.

During normal operation the driver would just be passing through
requests to the lower layer. When it detects a timeout, however, it
completes the request it received with -ETIMEDOUT. It also completes any
new request it receives with -ETIMEDOUT without passing the request on
until the request that originally timed out returns. This is our safety
measure against anyone seeing whether or how the timed out request
modified data.

We need to make sure that bdrv_drain() doesn't wait for this request.
Possibly we need to introduce a .bdrv_drain callback that replaces the
default handling, because bdrv_requests_pending() in the default
handling considers bs->file, which would still have the timed out
request. We don't want to see this; bdrv_drain_all() should complete
even though that request is still pending internally (externally, we
returned -ETIMEDOUT, so we can consider it completed). This way the
monitor stays responsive and background jobs can go on if they don't use
the failing block device.

And then we essentially reuse the rerror/werror mechanism that we
already have to stop the VM. The device models would be extended to
always stop the VM on -ETIMEDOUT, regardless of the error policy. In
this state, the VM would even be migratable if you make sure that the
pending request can't modify the image on the destination host any more.

Do you think this could work, or did I miss something important?

Kevin

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [Qemu-devel] [PATCH RFC 0/5] disk deadlines
  2015-09-08 10:49       ` Kevin Wolf
@ 2015-09-08 13:20         ` Fam Zheng
  0 siblings, 0 replies; 48+ messages in thread
From: Fam Zheng @ 2015-09-08 13:20 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: Denis V. Lunev, qemu-devel, Stefan Hajnoczi,
	qemu-block@nongnu.org Raushaniya Maksudova

On Tue, 09/08 12:49, Kevin Wolf wrote:
> Am 08.09.2015 um 12:20 hat Fam Zheng geschrieben:
> > On Tue, 09/08 12:11, Kevin Wolf wrote:
> > > Am 08.09.2015 um 11:20 hat Fam Zheng geschrieben:
> > > > [Cc'ing qemu-block@nongnu.org]
> > > > 
> > > > On Tue, 09/08 11:00, Denis V. Lunev wrote:
> > > > > To avoid such situation this patchset introduces patch per-drive option
> > > > > "disk-deadlines=on|off" which is unset by default.
> > > > 
> > > > The general idea sounds very nice. Thanks!
> > > > 
> > > > Should we allow user configuration on the timeout?  If so, the option should be
> > > > something like "timeout-seconds=0,1,2...".  Also I think we could use werror
> > > > and rerror to control the handling policy (whether to ignore/report/stop on
> > > > timeout).
> > > 
> > > Yes, I think the timeout needs to be configurable. However, the only
> > > action that makes sense is stop. Everything else would be unsafe because
> > > the running request could still complete at a later point.
> > 
> > What if the timeout happens on a quorum child?  The management can replace it
> > transparently without stopping the VM.
> 
> This is getting tricky...
> 
> I'll try this: We need to attribute timed out requests to a specific BDS.
> A user of a BlockBackend can run if all of its (recursive) children
> don't have timed out requests. So if the only thing that is blocked is a
> BDS used for an NBD server, but it isn't used by the guest, the guest
> can keep running. The same way, after removing a bad quorum child, the
> guest can be continued again.
> 
> Somehow we must make sure that timeouts are propagated through the BDS
> tree (do we need parent notifiers?), and that at the same time the
> quorum BDS's timeout status is updated when the bad child is removed.

IIUC the implementation in this series already handles this cleanly with an RBT
data structure, without messing with BDS tree hierarchy.

> 
> The trickier part might actually be to remove a BDS from quorum while a
> request is still in flight. The traditional approach is bdrv_drain(),
> but that won't work here. We want to remove the child while quorum has
> still a request pending on it.

I think the point here is avoiding accessing dangling pointer, which shouldn't
be too hard with BDS's reference counting.

Fam

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [Qemu-devel] [PATCH 4/5] disk_deadlines: add control of requests time expiration
  2015-09-08 13:05       ` Kevin Wolf
@ 2015-09-08 14:23         ` Denis V. Lunev
  2015-09-08 14:48           ` Kevin Wolf
  0 siblings, 1 reply; 48+ messages in thread
From: Denis V. Lunev @ 2015-09-08 14:23 UTC (permalink / raw)
  To: Kevin Wolf; +Cc: Stefan Hajnoczi, qemu-devel, Raushaniya Maksudova

On 09/08/2015 04:05 PM, Kevin Wolf wrote:
> Am 08.09.2015 um 13:27 hat Denis V. Lunev geschrieben:
>> interesting point. Yes, it flushes all requests and most likely
>> hangs inside waiting requests to complete. But fortunately
>> this happens after the switch to paused state thus
>> the guest becomes paused. That's why I have missed this
>> fact.
>>
>> This (could) be considered as a problem but I have no (good)
>> solution at the moment. Should think a bit on.
> Let me suggest a radically different design. Note that I don't say this
> is necessarily how things should be done, I'm just trying to introduce
> some new ideas and broaden the discussion, so that we have a larger set
> of ideas from which we can pick the right solution(s).
>
> The core of my idea would be a new filter block driver 'timeout' that
> can be added on top of each BDS that could potentially fail, like a
> raw-posix BDS pointing to a file on NFS. This way most pieces of the
> solution are nicely modularised and don't touch the block layer core.
>
> During normal operation the driver would just be passing through
> requests to the lower layer. When it detects a timeout, however, it
> completes the request it received with -ETIMEDOUT. It also completes any
> new request it receives with -ETIMEDOUT without passing the request on
> until the request that originally timed out returns. This is our safety
> measure against anyone seeing whether or how the timed out request
> modified data.
>
> We need to make sure that bdrv_drain() doesn't wait for this request.
> Possibly we need to introduce a .bdrv_drain callback that replaces the
> default handling, because bdrv_requests_pending() in the default
> handling considers bs->file, which would still have the timed out
> request. We don't want to see this; bdrv_drain_all() should complete
> even though that request is still pending internally (externally, we
> returned -ETIMEDOUT, so we can consider it completed). This way the
> monitor stays responsive and background jobs can go on if they don't use
> the failing block device.
>
> And then we essentially reuse the rerror/werror mechanism that we
> already have to stop the VM. The device models would be extended to
> always stop the VM on -ETIMEDOUT, regardless of the error policy. In
> this state, the VM would even be migratable if you make sure that the
> pending request can't modify the image on the destination host any more.
>
> Do you think this could work, or did I miss something important?
>
> Kevin
could I propose even more radical solution then?

My original approach was based on the fact that
this could should be maintainable out-of-stream.
If the patch will be merged - this boundary condition
could be dropped.

Why not to invent 'terror' field on BdrvOptions
and process things in core block layer without
a filter? RB Tree entry will just not created if
the policy will be set to 'ignore'.

Den

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [Qemu-devel] [PATCH 4/5] disk_deadlines: add control of requests time expiration
  2015-09-08 14:23         ` Denis V. Lunev
@ 2015-09-08 14:48           ` Kevin Wolf
  2015-09-10 10:27             ` Stefan Hajnoczi
  0 siblings, 1 reply; 48+ messages in thread
From: Kevin Wolf @ 2015-09-08 14:48 UTC (permalink / raw)
  To: Denis V. Lunev; +Cc: Stefan Hajnoczi, qemu-devel, Raushaniya Maksudova

Am 08.09.2015 um 16:23 hat Denis V. Lunev geschrieben:
> On 09/08/2015 04:05 PM, Kevin Wolf wrote:
> >Am 08.09.2015 um 13:27 hat Denis V. Lunev geschrieben:
> >>interesting point. Yes, it flushes all requests and most likely
> >>hangs inside waiting requests to complete. But fortunately
> >>this happens after the switch to paused state thus
> >>the guest becomes paused. That's why I have missed this
> >>fact.
> >>
> >>This (could) be considered as a problem but I have no (good)
> >>solution at the moment. Should think a bit on.
> >Let me suggest a radically different design. Note that I don't say this
> >is necessarily how things should be done, I'm just trying to introduce
> >some new ideas and broaden the discussion, so that we have a larger set
> >of ideas from which we can pick the right solution(s).
> >
> >The core of my idea would be a new filter block driver 'timeout' that
> >can be added on top of each BDS that could potentially fail, like a
> >raw-posix BDS pointing to a file on NFS. This way most pieces of the
> >solution are nicely modularised and don't touch the block layer core.
> >
> >During normal operation the driver would just be passing through
> >requests to the lower layer. When it detects a timeout, however, it
> >completes the request it received with -ETIMEDOUT. It also completes any
> >new request it receives with -ETIMEDOUT without passing the request on
> >until the request that originally timed out returns. This is our safety
> >measure against anyone seeing whether or how the timed out request
> >modified data.
> >
> >We need to make sure that bdrv_drain() doesn't wait for this request.
> >Possibly we need to introduce a .bdrv_drain callback that replaces the
> >default handling, because bdrv_requests_pending() in the default
> >handling considers bs->file, which would still have the timed out
> >request. We don't want to see this; bdrv_drain_all() should complete
> >even though that request is still pending internally (externally, we
> >returned -ETIMEDOUT, so we can consider it completed). This way the
> >monitor stays responsive and background jobs can go on if they don't use
> >the failing block device.
> >
> >And then we essentially reuse the rerror/werror mechanism that we
> >already have to stop the VM. The device models would be extended to
> >always stop the VM on -ETIMEDOUT, regardless of the error policy. In
> >this state, the VM would even be migratable if you make sure that the
> >pending request can't modify the image on the destination host any more.
> >
> >Do you think this could work, or did I miss something important?
> >
> >Kevin
> could I propose even more radical solution then?
> 
> My original approach was based on the fact that
> this could should be maintainable out-of-stream.
> If the patch will be merged - this boundary condition
> could be dropped.
> 
> Why not to invent 'terror' field on BdrvOptions
> and process things in core block layer without
> a filter? RB Tree entry will just not created if
> the policy will be set to 'ignore'.

'terror' might not be the most fortunate name... ;-)

The reason why I would prefer a filter driver is so the code and the
associated data structures are cleanly modularised and we can keep the
actual block layer core small and clean. The same is true for some other
functions that I would rather move out of the core into filter drivers
than add new cases (e.g. I/O throttling, backup notifiers, etc.), but
which are a bit harder to actually move because we already have old
interfaces that we can't break (we'll probably do it anyway eventually,
even if it needs a bit more compatibility code).

However, it seems that you are mostly touching code that is maintained
by Stefan, and Stefan used to be a bit more open to adding functionality
to the core, so my opinion might not be the last word.

Kevin

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [Qemu-devel] [PATCH 5/5] disk_deadlines: add info disk-deadlines option
  2015-09-08  8:00 ` [Qemu-devel] [PATCH 5/5] disk_deadlines: add info disk-deadlines option Denis V. Lunev
@ 2015-09-08 16:20   ` Eric Blake
  2015-09-08 16:26     ` Eric Blake
  2015-09-10 19:13     ` Denis V. Lunev
  0 siblings, 2 replies; 48+ messages in thread
From: Eric Blake @ 2015-09-08 16:20 UTC (permalink / raw)
  To: Denis V. Lunev
  Cc: Kevin Wolf, Markus Armbruster, qemu-devel, Raushaniya Maksudova,
	Luiz Capitulino, Stefan Hajnoczi

[-- Attachment #1: Type: text/plain, Size: 2831 bytes --]

On 09/08/2015 02:00 AM, Denis V. Lunev wrote:
> From: Raushaniya Maksudova <rmaksudova@virtuozzo.com>
> 
> This patch adds "info disk-deadlines" qemu-monitor option that prints
> dump of all disk requests which caused a disk deadline in Guest OS
> from the very start of Virtual Machine:
> 
>    disk_id  type       size total_time        start_time
> .--------------------------------------------------------
>   ide0-hd1 FLUSH         0b 46.403s     22232930059574ns
>   ide0-hd1 FLUSH         0b 57.591s     22451499241285ns
>   ide0-hd1 FLUSH         0b 103.482s    22574100547397ns
> 
> Signed-off-by: Raushaniya Maksudova <rmaksudova@virtuozzo.com>
> Signed-off-by: Denis V. Lunev <den@openvz.org>
> CC: Stefan Hajnoczi <stefanha@redhat.com>
> CC: Kevin Wolf <kwolf@redhat.com>
> CC: Markus Armbruster <armbru@redhat.com>
> CC: Luiz Capitulino <lcapitulino@redhat.com>
> ---

qapi interface review only:


> +++ b/qapi-schema.json
> @@ -3808,3 +3808,36 @@
>  
>  # Rocker ethernet network switch
>  { 'include': 'qapi/rocker.json' }
> +
> +## @DiskDeadlinesInfo
> +#
> +# Contains info about late requests which caused VM stopping
> +#
> +# @disk-id: name of disk (unique for each disk)

Mark this with '#optional', and maybe describe why it would be missing.
 Does this correspond to the BDS node name where the deadline expired,
in which case 'node' might be a nicer name than 'disk-id'?

> +#
> +# @type: type of request could be READ, WRITE or FLUSH

Likewise for using #optional. Please make this an enum type, not an
open-coded string.

> +#
> +# @size: size in bytes

of the failed request? Should you also mention which offset the failed
request started at?

> +#
> +# @total-time-ns: total time of request execution
> +#
> +# @start-time-ns: indicates the start of request execution
> +#
> +# Since: 2.5
> +##
> +{ 'struct': 'DiskDeadlinesInfo',
> +  'data'  : { '*disk-id': 'str',
> +              '*type': 'str',
> +              'size': 'uint64',
> +              'total-time-ns': 'uint64',
> +              'start-time-ns': 'uint64' } }
> +##
> +# @query-disk-deadlines:
> +#
> +# Returns information about last late disk requests.
> +#
> +# Returns: a list of @DiskDeadlinesInfo
> +#
> +# Since: 2.5
> +##
> +{ 'command': 'query-disk-deadlines', 'returns': ['DiskDeadlinesInfo'] }

Should it be possible to filter to deadlines missed for a specific node,
by having an arguments with an optional node name?

Should any of the existing query-block or similar commands be modified
to make it obvious that there are missed deadline stats, and that it
would be useful to call query-disk-deadlines to learn more about them?

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [Qemu-devel] [PATCH 5/5] disk_deadlines: add info disk-deadlines option
  2015-09-08 16:20   ` Eric Blake
@ 2015-09-08 16:26     ` Eric Blake
  2015-09-10 18:53       ` Denis V. Lunev
  2015-09-10 19:13     ` Denis V. Lunev
  1 sibling, 1 reply; 48+ messages in thread
From: Eric Blake @ 2015-09-08 16:26 UTC (permalink / raw)
  To: Denis V. Lunev
  Cc: Kevin Wolf, qemu-devel, Markus Armbruster, Raushaniya Maksudova,
	Luiz Capitulino, Stefan Hajnoczi

[-- Attachment #1: Type: text/plain, Size: 970 bytes --]

On 09/08/2015 10:20 AM, Eric Blake wrote:
> On 09/08/2015 02:00 AM, Denis V. Lunev wrote:
>> From: Raushaniya Maksudova <rmaksudova@virtuozzo.com>
>>
>> This patch adds "info disk-deadlines" qemu-monitor option that prints
>> dump of all disk requests which caused a disk deadline in Guest OS
>> from the very start of Virtual Machine:
>>

> 
> qapi interface review only:
> 

> Should it be possible to filter to deadlines missed for a specific node,
> by having an arguments with an optional node name?
> 
> Should any of the existing query-block or similar commands be modified
> to make it obvious that there are missed deadline stats, and that it
> would be useful to call query-disk-deadlines to learn more about them?

Also, should there be an event raised when a timeout occurs, so that
management doesn't have to poll this API?

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 604 bytes --]

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [Qemu-devel] [PATCH RFC 0/5] disk deadlines
  2015-09-08  8:00 [Qemu-devel] [PATCH RFC 0/5] disk deadlines Denis V. Lunev
                   ` (7 preceding siblings ...)
  2015-09-08  9:33 ` Paolo Bonzini
@ 2015-09-08 19:11 ` John Snow
  2015-09-10 19:29 ` [Qemu-devel] Summary: " Denis V. Lunev
  9 siblings, 0 replies; 48+ messages in thread
From: John Snow @ 2015-09-08 19:11 UTC (permalink / raw)
  To: Denis V. Lunev
  Cc: Kevin Wolf, qemu-devel, Stefan Hajnoczi, Raushaniya Maksudova



On 09/08/2015 04:00 AM, Denis V. Lunev wrote:
> Description of the problem:
> Client and server interacts via Network File System (NFS) or using other
> network storage like CEPH. The server contains an image of the Virtual
> Machine (VM) with Linux inside. The disk is exposed as SATA or IDE
> to VM. VM is started on the client as usual. In the case of network shortage
> requests from the virtial disk can not be completed in predictable time.
> If this request is f.e. ext3/4 journal write then the guest will reset
> the controller and restart the request for the first time. On next such
> event the guest will remount victim filesystem read-only. From the
> end-user point of view this will look like a fatal crash with a manual
> reboot required.
> 
> To avoid such situation this patchset introduces patch per-drive option
> "disk-deadlines=on|off" which is unset by default. All disk requests
> will become tracked if the option is enabled. If requests are not completed
> in time some countermeasures applied (see below). The timeout could be
> configured, default one is chosen by observations.
> 
> Test description that let reproduce the problem:
> 1) configure and start NFS server:
> $sudo /etc/init.d/nfs-kernel-server restart
> 2) put Virtial Machine image with preinstalled Operating System on the server
> 3) on the client mount server folder that contains Virtial Machine image:
> $sudo mount -t nfs -O uid=1000,iocharset=utf-8 server_ip:/path/to/folder/on/
> server /path/to/folder/on/client
> 4) start Virtual Machine with QEMU on the client (for example):
> $qemu-system-x86_64 -enable-kvm -vga std -balloon virtio -monitor stdio
>  -drive file=/path/to/folder/on/client/vdisk.img,media=disk,if=ide,disk-deadlines=on
>  -boot d -m 12288
> 5) inside of VM rum the following command:
> $dd if=/dev/urandom of=testfile bs=10M count=300
> AND stop the server (or disconnect network) by running:
> $sudo /etc/init.d/nfs-kernel-server stop
> 6) inside of VM periodically run:
> $dmesg
> and check error messages.
> 
> One can get one of the error messages (just the main lines):
> 1) After server restarting Guest OS continues run as usual with
> the following messages in dmesg:
>   a) [ 1108.131474] nfs: server 10.30.23.163 not responding, still trying
>      [ 1203.164903] INFO: task qemu-system-x86:3256 blocked for more
>      than 120 seconds
> 
>   b) [ 581.184311] ata1.00: qc timeout (cmd 0xe7)
>      [ 581.184321] ata1.00: FLUSH failed Emask 0x4
>      [ 581.744271] ata1: soft resetting link
>      [ 581.900346] ata1.01: NODEV after polling detection
>      [ 581.900877] ata1.00: configured for MWDMA2
>      [ 581.900879] ata1.00: retrying FLUSH 0xe7 Emask 0x4
>      [ 581.901203] ata1.00: device reported invalid CHS sector 0
>      [ 581.901213] ata1: EH complete
> 2) Guest OS remounts its Filesystem as read-only:
> "remounting filesystem read-only"
> 3) Guest OS does not respond at all even after server restart
> 
> Tested on:
> Virtual Machine - Linux 3.11.0 SMP x86_64 Ubuntu 13.10 saucy;
> client -  Linux 3.11.10 SMP x86_64, Ubuntu 13.10 saucy;
> server - Linux 3.13.0 SMP x86_64, Ubuntu 14.04.1 LTS.
> 
> How the given solution works?
> 
> If disk-deadlines option is enabled for a drive, one controls time completion
> of this drive's requests. The method is as follows (further assume that this
> option is enabled).
> 
> Every drive has its own red-black tree for keeping its requests.
> Expiration time of the request is a key, cookie (as id of request) is an
> appropriate node. Assume that every requests has 8 seconds to be completed.
> If request was not accomplished in time for some reasons (server crash or smth
> else), timer of this drive is fired and an appropriate callback requests to
> stop Virtial Machine (VM).
> 

This sounds like an appropriate tool to have in the QEMU toolbox! We
certainly want to be able to control the guest seeing "hardware failure"
much in the same way we can already prevent it from seeing disk full/etc
errors with rerror/werror.

The timeout idea seems quite welcome.

Thanks,
--js

> VM remains stopped until all requests from the disk which caused VM's stopping
> are completed. Furthermore, if there is another disks with 'disk-deadlines=on'
> whose requests are waiting to be completed, do not start VM : wait completion
> of all "late" requests from all disks.
> 
> Furthermore, all requests which caused VM stopping (or those that just were not
> completed in time) could be printed using "info disk-deadlines" qemu monitor
> option as follows:
> $(qemu) info disk-deadlines
> 
>    disk_id  type       size total_time        start_time
> .--------------------------------------------------------
>   ide0-hd1 FLUSH         0b 46.403s     22232930059574ns
>   ide0-hd1 FLUSH         0b 57.591s     22451499241285ns
>   ide0-hd1 FLUSH         0b 103.482s    22574100547397ns
> 
> This set is sent in the hope that it might be useful.
> 
> Signed-off-by: Raushaniya Maksudova <rmaksudova@virtuozzo.com>
> Signed-off-by: Denis V. Lunev <den@openvz.org>
> CC: Stefan Hajnoczi <stefanha@redhat.com>
> CC: Kevin Wolf <kwolf@redhat.com>
> 
> Raushaniya Maksudova (5):
>   add QEMU style defines for __sync_add_and_fetch
>   disk_deadlines: add request to resume Virtual Machine
>   disk_deadlines: add disk-deadlines option per drive
>   disk_deadlines: add control of requests time expiration
>   disk_deadlines: add info disk-deadlines option
> 
>  block/Makefile.objs            |   1 +
>  block/accounting.c             |   8 ++
>  block/disk-deadlines.c         | 280 +++++++++++++++++++++++++++++++++++++++++
>  blockdev.c                     |  20 +++
>  hmp.c                          |  37 ++++++
>  hmp.h                          |   1 +
>  include/block/accounting.h     |   2 +
>  include/block/disk-deadlines.h |  48 +++++++
>  include/qemu/atomic.h          |   3 +
>  include/sysemu/sysemu.h        |   1 +
>  monitor.c                      |   7 ++
>  qapi-schema.json               |  33 +++++
>  stubs/vm-stop.c                |   5 +
>  vl.c                           |  18 +++
>  14 files changed, 464 insertions(+)
>  create mode 100644 block/disk-deadlines.c
>  create mode 100644 include/block/disk-deadlines.h
> 

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [Qemu-devel] [PATCH 1/5] add QEMU style defines for __sync_add_and_fetch
  2015-09-08  8:00 ` [Qemu-devel] [PATCH 1/5] add QEMU style defines for __sync_add_and_fetch Denis V. Lunev
@ 2015-09-10  8:19   ` Stefan Hajnoczi
  0 siblings, 0 replies; 48+ messages in thread
From: Stefan Hajnoczi @ 2015-09-10  8:19 UTC (permalink / raw)
  To: Denis V. Lunev
  Cc: Kevin Wolf, Paolo Bonzini, Stefan Hajnoczi, qemu-devel,
	Raushaniya Maksudova

On Tue, Sep 08, 2015 at 11:00:24AM +0300, Denis V. Lunev wrote:
> From: Raushaniya Maksudova <rmaksudova@virtuozzo.com>
> 
> Signed-off-by: Raushaniya Maksudova <rmaksudova@virtuozzo.com>
> Signed-off-by: Denis V. Lunev <den@openvz.org>
> CC: Stefan Hajnoczi <stefanha@redhat.com>
> CC: Kevin Wolf <kwolf@redhat.com>
> CC: Paolo Bonzini <pbonzini@redhat.com>
> ---
>  include/qemu/atomic.h | 3 +++
>  1 file changed, 3 insertions(+)

Reviewed-by: Stefan Hajnoczi <stefanha@redhat.com>

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [Qemu-devel] [PATCH 2/5] disk_deadlines: add request to resume Virtual Machine
  2015-09-08  8:00 ` [Qemu-devel] [PATCH 2/5] disk_deadlines: add request to resume Virtual Machine Denis V. Lunev
@ 2015-09-10  8:51   ` Stefan Hajnoczi
  2015-09-10 19:18     ` Denis V. Lunev
  0 siblings, 1 reply; 48+ messages in thread
From: Stefan Hajnoczi @ 2015-09-10  8:51 UTC (permalink / raw)
  To: Denis V. Lunev
  Cc: Kevin Wolf, Paolo Bonzini, Stefan Hajnoczi, qemu-devel,
	Raushaniya Maksudova

On Tue, Sep 08, 2015 at 11:00:25AM +0300, Denis V. Lunev wrote:
> From: Raushaniya Maksudova <rmaksudova@virtuozzo.com>
> 
> In some cases one needs to pause and resume a Virtual Machine from inside
> of Qemu. Currently there are request functions to pause VM (vmstop), but
> there are no respective ones to resume VM.
> 
> Signed-off-by: Raushaniya Maksudova <rmaksudova@virtuozzo.com>
> Signed-off-by: Denis V. Lunev <den@openvz.org>
> CC: Stefan Hajnoczi <stefanha@redhat.com>
> CC: Kevin Wolf <kwolf@redhat.com>
> CC: Paolo Bonzini <pbonzini@redhat.com>
> ---
>  include/sysemu/sysemu.h |  1 +
>  stubs/vm-stop.c         |  5 +++++
>  vl.c                    | 18 ++++++++++++++++++
>  3 files changed, 24 insertions(+)

Why can't vm_start() be used?

> diff --git a/include/sysemu/sysemu.h b/include/sysemu/sysemu.h
> index 44570d1..a382ae1 100644
> --- a/include/sysemu/sysemu.h
> +++ b/include/sysemu/sysemu.h
> @@ -62,6 +62,7 @@ void qemu_system_shutdown_request(void);
>  void qemu_system_powerdown_request(void);
>  void qemu_register_powerdown_notifier(Notifier *notifier);
>  void qemu_system_debug_request(void);
> +void qemu_system_vmstart_request(void);
>  void qemu_system_vmstop_request(RunState reason);
>  void qemu_system_vmstop_request_prepare(void);
>  int qemu_shutdown_requested_get(void);
> diff --git a/stubs/vm-stop.c b/stubs/vm-stop.c
> index 69fd86b..c8e2cdd 100644
> --- a/stubs/vm-stop.c
> +++ b/stubs/vm-stop.c
> @@ -10,3 +10,8 @@ void qemu_system_vmstop_request(RunState state)
>  {
>      abort();
>  }
> +
> +void qemu_system_vmstart_request(void)
> +{
> +    abort();
> +}
> diff --git a/vl.c b/vl.c
> index 584ca88..63f10d3 100644
> --- a/vl.c
> +++ b/vl.c
> @@ -563,6 +563,7 @@ static RunState current_run_state = RUN_STATE_PRELAUNCH;
>  /* We use RUN_STATE_MAX but any invalid value will do */
>  static RunState vmstop_requested = RUN_STATE_MAX;
>  static QemuMutex vmstop_lock;
> +static bool vmstart_requested;
>  
>  typedef struct {
>      RunState from;
> @@ -723,6 +724,19 @@ void qemu_system_vmstop_request(RunState state)
>      qemu_notify_event();
>  }
>  
> +static bool qemu_vmstart_requested(void)
> +{
> +    bool r = vmstart_requested;
> +    vmstart_requested = false;
> +    return r;
> +}
> +
> +void qemu_system_vmstart_request(void)
> +{
> +    vmstart_requested = true;
> +    qemu_notify_event();
> +}
> +
>  void vm_start(void)
>  {
>      RunState requested;
> @@ -1884,6 +1898,10 @@ static bool main_loop_should_exit(void)
>      if (qemu_vmstop_requested(&r)) {
>          vm_stop(r);
>      }
> +    if (qemu_vmstart_requested()) {
> +        vm_start();
> +    }
> +
>      return false;
>  }
>  
> -- 
> 2.1.4
> 
> 

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [Qemu-devel] [PATCH 3/5] disk_deadlines: add disk-deadlines option per drive
  2015-09-08  8:00 ` [Qemu-devel] [PATCH 3/5] disk_deadlines: add disk-deadlines option per drive Denis V. Lunev
@ 2015-09-10  9:05   ` Stefan Hajnoczi
  0 siblings, 0 replies; 48+ messages in thread
From: Stefan Hajnoczi @ 2015-09-10  9:05 UTC (permalink / raw)
  To: Denis V. Lunev
  Cc: Kevin Wolf, Stefan Hajnoczi, qemu-devel, Raushaniya Maksudova,
	Markus Armbruster

On Tue, Sep 08, 2015 at 11:00:26AM +0300, Denis V. Lunev wrote:
> diff --git a/blockdev.c b/blockdev.c
> index 6b48be6..6cd9c6e 100644
> --- a/blockdev.c
> +++ b/blockdev.c
> @@ -361,6 +361,7 @@ static BlockBackend *blockdev_init(const char *file, QDict *bs_opts,
>      ThrottleConfig cfg;
>      int snapshot = 0;
>      bool copy_on_read;
> +    bool disk_deadlines;
>      Error *error = NULL;
>      QemuOpts *opts;
>      const char *id;
> @@ -394,6 +395,11 @@ static BlockBackend *blockdev_init(const char *file, QDict *bs_opts,
>      ro = qemu_opt_get_bool(opts, "read-only", 0);
>      copy_on_read = qemu_opt_get_bool(opts, "copy-on-read", false);
>  
> +    disk_deadlines = qdict_get_try_bool(bs_opts, "disk-deadlines", false);
> +    if (disk_deadlines) {
> +        qdict_del(bs_opts, "disk-deadlines");

qdict_del() should be unconditional so that -drive disk-deadlines=off
works.  qdict_del() is a nop if the key cannot be found in the dict, so
it is always safe to call it.

> +    }
> +
>      if ((buf = qemu_opt_get(opts, "discard")) != NULL) {
>          if (bdrv_parse_discard_flags(buf, &bdrv_flags) != 0) {
>              error_setg(errp, "invalid discard option");
> @@ -555,6 +561,8 @@ static BlockBackend *blockdev_init(const char *file, QDict *bs_opts,
>  
>      bs->detect_zeroes = detect_zeroes;
>  
> +    disk_deadlines_init(&bs->stats.disk_deadlines, disk_deadlines);
> +
>      bdrv_set_on_error(bs, on_read_error, on_write_error);
>  
>      /* disk I/O throttling */
> @@ -658,6 +666,10 @@ QemuOptsList qemu_legacy_drive_opts = {
>              .name = "file",
>              .type = QEMU_OPT_STRING,
>              .help = "file name",
> +        },{
> +            .name = "disk-deadlines",
> +            .type = QEMU_OPT_BOOL,
> +            .help = "control of disk requests' time execution",

It would be nice to mention that the guest will be paused:

"pause guest if disk request timeout expires"

> diff --git a/include/block/accounting.h b/include/block/accounting.h
> index 4c406cf..4e2b345 100644
> --- a/include/block/accounting.h
> +++ b/include/block/accounting.h
> @@ -27,6 +27,7 @@
>  #include <stdint.h>
>  
>  #include "qemu/typedefs.h"
> +#include "block/disk-deadlines.h"
>  
>  enum BlockAcctType {
>      BLOCK_ACCT_READ,
> @@ -41,6 +42,7 @@ typedef struct BlockAcctStats {
>      uint64_t total_time_ns[BLOCK_MAX_IOTYPE];
>      uint64_t merged[BLOCK_MAX_IOTYPE];
>      uint64_t wr_highest_sector;
> +    DiskDeadlines disk_deadlines;

I'm not sure that BlockAcctStats is the most appropriate place for
DiskDeadlines.  BlockAcctStats holds accounting information which can be
queried on the QEMU monitor.  It is for reporting disk statistics.

Please add the DiskDeadlines field to BlockDriverState instead.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [Qemu-devel] [PATCH 4/5] disk_deadlines: add control of requests time expiration
  2015-09-08 14:48           ` Kevin Wolf
@ 2015-09-10 10:27             ` Stefan Hajnoczi
  2015-09-10 11:39               ` Kevin Wolf
  2015-09-25 12:34               ` Dr. David Alan Gilbert
  0 siblings, 2 replies; 48+ messages in thread
From: Stefan Hajnoczi @ 2015-09-10 10:27 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: Denis V. Lunev, qemu-devel, Stefan Hajnoczi, Raushaniya Maksudova

On Tue, Sep 08, 2015 at 04:48:24PM +0200, Kevin Wolf wrote:
> Am 08.09.2015 um 16:23 hat Denis V. Lunev geschrieben:
> > On 09/08/2015 04:05 PM, Kevin Wolf wrote:
> > >Am 08.09.2015 um 13:27 hat Denis V. Lunev geschrieben:
> > >>interesting point. Yes, it flushes all requests and most likely
> > >>hangs inside waiting requests to complete. But fortunately
> > >>this happens after the switch to paused state thus
> > >>the guest becomes paused. That's why I have missed this
> > >>fact.
> > >>
> > >>This (could) be considered as a problem but I have no (good)
> > >>solution at the moment. Should think a bit on.
> > >Let me suggest a radically different design. Note that I don't say this
> > >is necessarily how things should be done, I'm just trying to introduce
> > >some new ideas and broaden the discussion, so that we have a larger set
> > >of ideas from which we can pick the right solution(s).
> > >
> > >The core of my idea would be a new filter block driver 'timeout' that
> > >can be added on top of each BDS that could potentially fail, like a
> > >raw-posix BDS pointing to a file on NFS. This way most pieces of the
> > >solution are nicely modularised and don't touch the block layer core.
> > >
> > >During normal operation the driver would just be passing through
> > >requests to the lower layer. When it detects a timeout, however, it
> > >completes the request it received with -ETIMEDOUT. It also completes any
> > >new request it receives with -ETIMEDOUT without passing the request on
> > >until the request that originally timed out returns. This is our safety
> > >measure against anyone seeing whether or how the timed out request
> > >modified data.
> > >
> > >We need to make sure that bdrv_drain() doesn't wait for this request.
> > >Possibly we need to introduce a .bdrv_drain callback that replaces the
> > >default handling, because bdrv_requests_pending() in the default
> > >handling considers bs->file, which would still have the timed out
> > >request. We don't want to see this; bdrv_drain_all() should complete
> > >even though that request is still pending internally (externally, we
> > >returned -ETIMEDOUT, so we can consider it completed). This way the
> > >monitor stays responsive and background jobs can go on if they don't use
> > >the failing block device.
> > >
> > >And then we essentially reuse the rerror/werror mechanism that we
> > >already have to stop the VM. The device models would be extended to
> > >always stop the VM on -ETIMEDOUT, regardless of the error policy. In
> > >this state, the VM would even be migratable if you make sure that the
> > >pending request can't modify the image on the destination host any more.
> > >
> > >Do you think this could work, or did I miss something important?
> > >
> > >Kevin
> > could I propose even more radical solution then?
> > 
> > My original approach was based on the fact that
> > this could should be maintainable out-of-stream.
> > If the patch will be merged - this boundary condition
> > could be dropped.
> > 
> > Why not to invent 'terror' field on BdrvOptions
> > and process things in core block layer without
> > a filter? RB Tree entry will just not created if
> > the policy will be set to 'ignore'.
> 
> 'terror' might not be the most fortunate name... ;-)
> 
> The reason why I would prefer a filter driver is so the code and the
> associated data structures are cleanly modularised and we can keep the
> actual block layer core small and clean. The same is true for some other
> functions that I would rather move out of the core into filter drivers
> than add new cases (e.g. I/O throttling, backup notifiers, etc.), but
> which are a bit harder to actually move because we already have old
> interfaces that we can't break (we'll probably do it anyway eventually,
> even if it needs a bit more compatibility code).
> 
> However, it seems that you are mostly touching code that is maintained
> by Stefan, and Stefan used to be a bit more open to adding functionality
> to the core, so my opinion might not be the last word.

I've been thinking more about the correctness of this feature:

QEMU cannot cancel I/O because there is no Linux userspace API for doing
so.  Linux AIO's io_cancel(2) syscall is a nop since file systems don't
implement a kiocb_cancel_fn.  Sending a signal to a task blocked in
O_DIRECT preadv(2)/pwritev(2) doesn't work either because the task is in
uninterruptible sleep.

The only way to make sure a request has finished is to wait for
completion.  If we treat a request as failed/cancelled but it's actually
still pending at a layer of the storage stack:
1. Read requests may modify guest memory.
2. Write requests may modify disk sectors.

Today the guest times out and tries to do IDE/ATA recovery, for example.
This causes QEMU to eventually call the synchronous bdrv_drain_all()
function and the guest hangs.  Also, if the guest mounts the file system
read-only in response to the timeout, then game over.

The disk-deadlines feature lets QEMU detect timeouts before the guest so
we can pause the guest.  The part I have been thinking about is that the
only option is to wait until the request completes.

We cannot abandon the timed out request because we'll face #1 or #2
above.  This means it doesn't make sense to retry the request like
rerror=/werror=.  rerror=/werror= can retry safely because the original
request has failed but that is not the case for timed out requests.

This also means that live migration isn't safe, at least if a write
request is pending.  If the guest migrates, the pending write request on
the source host could still complete after live migration handover,
corrupting the disk.

Getting back to these patches: I think the implementation is correct in
that the only policy is to wait for timed out requests to complete and
then resume the guest.

However, these patches need to violate the constraint that guest memory
isn't dirtied when the guest is paused.  This is an important constraint
for the correctness of live migration, since we need to be able to track
all changes to guest memory.

Just wanted to post this in case anyone disagrees.

Stefan

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [Qemu-devel] [PATCH 4/5] disk_deadlines: add control of requests time expiration
  2015-09-10 10:27             ` Stefan Hajnoczi
@ 2015-09-10 11:39               ` Kevin Wolf
  2015-09-14 16:53                 ` Stefan Hajnoczi
  2015-09-25 12:34               ` Dr. David Alan Gilbert
  1 sibling, 1 reply; 48+ messages in thread
From: Kevin Wolf @ 2015-09-10 11:39 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Denis V. Lunev, qemu-devel, Stefan Hajnoczi, Raushaniya Maksudova

Am 10.09.2015 um 12:27 hat Stefan Hajnoczi geschrieben:
> On Tue, Sep 08, 2015 at 04:48:24PM +0200, Kevin Wolf wrote:
> > Am 08.09.2015 um 16:23 hat Denis V. Lunev geschrieben:
> > > On 09/08/2015 04:05 PM, Kevin Wolf wrote:
> > > >Am 08.09.2015 um 13:27 hat Denis V. Lunev geschrieben:
> > > >>interesting point. Yes, it flushes all requests and most likely
> > > >>hangs inside waiting requests to complete. But fortunately
> > > >>this happens after the switch to paused state thus
> > > >>the guest becomes paused. That's why I have missed this
> > > >>fact.
> > > >>
> > > >>This (could) be considered as a problem but I have no (good)
> > > >>solution at the moment. Should think a bit on.
> > > >Let me suggest a radically different design. Note that I don't say this
> > > >is necessarily how things should be done, I'm just trying to introduce
> > > >some new ideas and broaden the discussion, so that we have a larger set
> > > >of ideas from which we can pick the right solution(s).
> > > >
> > > >The core of my idea would be a new filter block driver 'timeout' that
> > > >can be added on top of each BDS that could potentially fail, like a
> > > >raw-posix BDS pointing to a file on NFS. This way most pieces of the
> > > >solution are nicely modularised and don't touch the block layer core.
> > > >
> > > >During normal operation the driver would just be passing through
> > > >requests to the lower layer. When it detects a timeout, however, it
> > > >completes the request it received with -ETIMEDOUT. It also completes any
> > > >new request it receives with -ETIMEDOUT without passing the request on
> > > >until the request that originally timed out returns. This is our safety
> > > >measure against anyone seeing whether or how the timed out request
> > > >modified data.
> > > >
> > > >We need to make sure that bdrv_drain() doesn't wait for this request.
> > > >Possibly we need to introduce a .bdrv_drain callback that replaces the
> > > >default handling, because bdrv_requests_pending() in the default
> > > >handling considers bs->file, which would still have the timed out
> > > >request. We don't want to see this; bdrv_drain_all() should complete
> > > >even though that request is still pending internally (externally, we
> > > >returned -ETIMEDOUT, so we can consider it completed). This way the
> > > >monitor stays responsive and background jobs can go on if they don't use
> > > >the failing block device.
> > > >
> > > >And then we essentially reuse the rerror/werror mechanism that we
> > > >already have to stop the VM. The device models would be extended to
> > > >always stop the VM on -ETIMEDOUT, regardless of the error policy. In
> > > >this state, the VM would even be migratable if you make sure that the
> > > >pending request can't modify the image on the destination host any more.
> > > >
> > > >Do you think this could work, or did I miss something important?
> > > >
> > > >Kevin
> > > could I propose even more radical solution then?
> > > 
> > > My original approach was based on the fact that
> > > this could should be maintainable out-of-stream.
> > > If the patch will be merged - this boundary condition
> > > could be dropped.
> > > 
> > > Why not to invent 'terror' field on BdrvOptions
> > > and process things in core block layer without
> > > a filter? RB Tree entry will just not created if
> > > the policy will be set to 'ignore'.
> > 
> > 'terror' might not be the most fortunate name... ;-)
> > 
> > The reason why I would prefer a filter driver is so the code and the
> > associated data structures are cleanly modularised and we can keep the
> > actual block layer core small and clean. The same is true for some other
> > functions that I would rather move out of the core into filter drivers
> > than add new cases (e.g. I/O throttling, backup notifiers, etc.), but
> > which are a bit harder to actually move because we already have old
> > interfaces that we can't break (we'll probably do it anyway eventually,
> > even if it needs a bit more compatibility code).
> > 
> > However, it seems that you are mostly touching code that is maintained
> > by Stefan, and Stefan used to be a bit more open to adding functionality
> > to the core, so my opinion might not be the last word.
> 
> I've been thinking more about the correctness of this feature:
> 
> QEMU cannot cancel I/O because there is no Linux userspace API for doing
> so.  Linux AIO's io_cancel(2) syscall is a nop since file systems don't
> implement a kiocb_cancel_fn.  Sending a signal to a task blocked in
> O_DIRECT preadv(2)/pwritev(2) doesn't work either because the task is in
> uninterruptible sleep.
> 
> The only way to make sure a request has finished is to wait for
> completion.  If we treat a request as failed/cancelled but it's actually
> still pending at a layer of the storage stack:
> 1. Read requests may modify guest memory.
> 2. Write requests may modify disk sectors.
> 
> Today the guest times out and tries to do IDE/ATA recovery, for example.
> This causes QEMU to eventually call the synchronous bdrv_drain_all()
> function and the guest hangs.  Also, if the guest mounts the file system
> read-only in response to the timeout, then game over.
> 
> The disk-deadlines feature lets QEMU detect timeouts before the guest so
> we can pause the guest.  The part I have been thinking about is that the
> only option is to wait until the request completes.
> 
> We cannot abandon the timed out request because we'll face #1 or #2
> above.  This means it doesn't make sense to retry the request like
> rerror=/werror=.  rerror=/werror= can retry safely because the original
> request has failed but that is not the case for timed out requests.
> 
> This also means that live migration isn't safe, at least if a write
> request is pending.  If the guest migrates, the pending write request on
> the source host could still complete after live migration handover,
> corrupting the disk.
> 
> Getting back to these patches: I think the implementation is correct in
> that the only policy is to wait for timed out requests to complete and
> then resume the guest.
> 
> However, these patches need to violate the constraint that guest memory
> isn't dirtied when the guest is paused.  This is an important constraint
> for the correctness of live migration, since we need to be able to track
> all changes to guest memory.
> 
> Just wanted to post this in case anyone disagrees.

You're making a few good points here.

I thought that migration with a pending write request could be safe with
some additional knowledge because if you know that the write is hanging
because the connection to the NFS server is down and you make sure that
it remains disconnected, that would work. However, the hanging request
is already in the kernel, so you could never bring the connection up
again without rebooting the host, which is clearly not a realistic
assumption.

Never thought of the constraints of live migration either, so it seems
reads requests are equally problematic.

So it appears that the filter driver would have to add a migration
blocker whenever it sees any request time out, and only clear it again
when all pending requests have completed.

Kevin

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [Qemu-devel] [PATCH 5/5] disk_deadlines: add info disk-deadlines option
  2015-09-08 16:26     ` Eric Blake
@ 2015-09-10 18:53       ` Denis V. Lunev
  0 siblings, 0 replies; 48+ messages in thread
From: Denis V. Lunev @ 2015-09-10 18:53 UTC (permalink / raw)
  To: Eric Blake
  Cc: Kevin Wolf, qemu-devel, Markus Armbruster, Raushaniya Maksudova,
	Luiz Capitulino, Stefan Hajnoczi

On 09/08/2015 07:26 PM, Eric Blake wrote:
> On 09/08/2015 10:20 AM, Eric Blake wrote:
>> On 09/08/2015 02:00 AM, Denis V. Lunev wrote:
>>> From: Raushaniya Maksudova <rmaksudova@virtuozzo.com>
>>>
>>> This patch adds "info disk-deadlines" qemu-monitor option that prints
>>> dump of all disk requests which caused a disk deadline in Guest OS
>>> from the very start of Virtual Machine:
>>>
>> qapi interface review only:
>>
>> Should it be possible to filter to deadlines missed for a specific node,
>> by having an arguments with an optional node name?
>>
>> Should any of the existing query-block or similar commands be modified
>> to make it obvious that there are missed deadline stats, and that it
>> would be useful to call query-disk-deadlines to learn more about them?
> Also, should there be an event raised when a timeout occurs, so that
> management doesn't have to poll this API?
>
ok, this seems a nice addition

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [Qemu-devel] [PATCH 5/5] disk_deadlines: add info disk-deadlines option
  2015-09-08 16:20   ` Eric Blake
  2015-09-08 16:26     ` Eric Blake
@ 2015-09-10 19:13     ` Denis V. Lunev
  1 sibling, 0 replies; 48+ messages in thread
From: Denis V. Lunev @ 2015-09-10 19:13 UTC (permalink / raw)
  To: Eric Blake
  Cc: Kevin Wolf, Markus Armbruster, qemu-devel, Raushaniya Maksudova,
	Luiz Capitulino, Stefan Hajnoczi

On 09/08/2015 07:20 PM, Eric Blake wrote:
> On 09/08/2015 02:00 AM, Denis V. Lunev wrote:
>> From: Raushaniya Maksudova <rmaksudova@virtuozzo.com>
>>
>> This patch adds "info disk-deadlines" qemu-monitor option that prints
>> dump of all disk requests which caused a disk deadline in Guest OS
>> from the very start of Virtual Machine:
>>
>>     disk_id  type       size total_time        start_time
>> .--------------------------------------------------------
>>    ide0-hd1 FLUSH         0b 46.403s     22232930059574ns
>>    ide0-hd1 FLUSH         0b 57.591s     22451499241285ns
>>    ide0-hd1 FLUSH         0b 103.482s    22574100547397ns
>>
>> Signed-off-by: Raushaniya Maksudova <rmaksudova@virtuozzo.com>
>> Signed-off-by: Denis V. Lunev <den@openvz.org>
>> CC: Stefan Hajnoczi <stefanha@redhat.com>
>> CC: Kevin Wolf <kwolf@redhat.com>
>> CC: Markus Armbruster <armbru@redhat.com>
>> CC: Luiz Capitulino <lcapitulino@redhat.com>
>> ---
> qapi interface review only:
>
>
>> +++ b/qapi-schema.json
>> @@ -3808,3 +3808,36 @@
>>   
>>   # Rocker ethernet network switch
>>   { 'include': 'qapi/rocker.json' }
>> +
>> +## @DiskDeadlinesInfo
>> +#
>> +# Contains info about late requests which caused VM stopping
>> +#
>> +# @disk-id: name of disk (unique for each disk)
> Mark this with '#optional', and maybe describe why it would be missing.
>   Does this correspond to the BDS node name where the deadline expired,
> in which case 'node' might be a nicer name than 'disk-id'?
as far as I could understand the code this is not BDS node name. This
name is bound to a name of a hardware device under which we could
have several block drivers at the moment. There is no query by
this name at the moment and the identifier collected is good enough
for us for a while to understand and debug the code.

Exact name is to be defined though. Originally I would like to bind
it to device name deadlines are attached to and as it would not be
good to calculate them in each BDS.

Anyway, exact meaning of this 'id' will be defined when we will decide
on proper attach point, whether it will be generic block code or filter
driver or whatever also.

Does it sound good for you? All your suggestions are welcome.

>> +#
>> +# @type: type of request could be READ, WRITE or FLUSH
> Likewise for using #optional. Please make this an enum type, not an
> open-coded string.
ok

>> +#
>> +# @size: size in bytes
> of the failed request? Should you also mention which offset the failed
> request started at?
I'll add this. This info is not accessible in stats and that's why it
was not added. If the code will be a part of the block layer
or will be in the driver filter that would be not a problem.

>> +#
>> +# @total-time-ns: total time of request execution
>> +#
>> +# @start-time-ns: indicates the start of request execution
>> +#
>> +# Since: 2.5
>> +##
>> +{ 'struct': 'DiskDeadlinesInfo',
>> +  'data'  : { '*disk-id': 'str',
>> +              '*type': 'str',
>> +              'size': 'uint64',
>> +              'total-time-ns': 'uint64',
>> +              'start-time-ns': 'uint64' } }
>> +##
>> +# @query-disk-deadlines:
>> +#
>> +# Returns information about last late disk requests.
>> +#
>> +# Returns: a list of @DiskDeadlinesInfo
>> +#
>> +# Since: 2.5
>> +##
>> +{ 'command': 'query-disk-deadlines', 'returns': ['DiskDeadlinesInfo'] }
> Should it be possible to filter to deadlines missed for a specific node,
> by having an arguments with an optional node name?
ok, this seems quite reasonable.

> Should any of the existing query-block or similar commands be modified
> to make it obvious that there are missed deadline stats, and that it
> would be useful to call query-disk-deadlines to learn more about them?
>
What do you think if we will also provide the list of pending requests using
similar API? Would it be useful for others?

Den

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [Qemu-devel] [PATCH 2/5] disk_deadlines: add request to resume Virtual Machine
  2015-09-10  8:51   ` Stefan Hajnoczi
@ 2015-09-10 19:18     ` Denis V. Lunev
  2015-09-14 16:46       ` Stefan Hajnoczi
  0 siblings, 1 reply; 48+ messages in thread
From: Denis V. Lunev @ 2015-09-10 19:18 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Kevin Wolf, Paolo Bonzini, Stefan Hajnoczi, qemu-devel,
	Raushaniya Maksudova

On 09/10/2015 11:51 AM, Stefan Hajnoczi wrote:
> On Tue, Sep 08, 2015 at 11:00:25AM +0300, Denis V. Lunev wrote:
>> From: Raushaniya Maksudova <rmaksudova@virtuozzo.com>
>>
>> In some cases one needs to pause and resume a Virtual Machine from inside
>> of Qemu. Currently there are request functions to pause VM (vmstop), but
>> there are no respective ones to resume VM.
>>
>> Signed-off-by: Raushaniya Maksudova <rmaksudova@virtuozzo.com>
>> Signed-off-by: Denis V. Lunev <den@openvz.org>
>> CC: Stefan Hajnoczi <stefanha@redhat.com>
>> CC: Kevin Wolf <kwolf@redhat.com>
>> CC: Paolo Bonzini <pbonzini@redhat.com>
>> ---
>>   include/sysemu/sysemu.h |  1 +
>>   stubs/vm-stop.c         |  5 +++++
>>   vl.c                    | 18 ++++++++++++++++++
>>   3 files changed, 24 insertions(+)
> Why can't vm_start() be used?
>

we do fear about correct thread to perform this operation.
this code eventually redirect state changing code into
main event loop.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* [Qemu-devel] Summary: [PATCH RFC 0/5] disk deadlines
  2015-09-08  8:00 [Qemu-devel] [PATCH RFC 0/5] disk deadlines Denis V. Lunev
                   ` (8 preceding siblings ...)
  2015-09-08 19:11 ` John Snow
@ 2015-09-10 19:29 ` Denis V. Lunev
  9 siblings, 0 replies; 48+ messages in thread
From: Denis V. Lunev @ 2015-09-10 19:29 UTC (permalink / raw)
  To: Stefan Hajnoczi; +Cc: Kevin Wolf, qemu-devel, Raushaniya Maksudova

On 09/08/2015 11:00 AM, Denis V. Lunev wrote:
> Description of the problem:
> Client and server interacts via Network File System (NFS) or using other
> network storage like CEPH. The server contains an image of the Virtual
> Machine (VM) with Linux inside. The disk is exposed as SATA or IDE
> to VM. VM is started on the client as usual. In the case of network shortage
> requests from the virtial disk can not be completed in predictable time.
> If this request is f.e. ext3/4 journal write then the guest will reset
> the controller and restart the request for the first time. On next such
> event the guest will remount victim filesystem read-only. From the
> end-user point of view this will look like a fatal crash with a manual
> reboot required.
>
> To avoid such situation this patchset introduces patch per-drive option
> "disk-deadlines=on|off" which is unset by default. All disk requests
> will become tracked if the option is enabled. If requests are not completed
> in time some countermeasures applied (see below). The timeout could be
> configured, default one is chosen by observations.
>
> Test description that let reproduce the problem:
> 1) configure and start NFS server:
> $sudo /etc/init.d/nfs-kernel-server restart
> 2) put Virtial Machine image with preinstalled Operating System on the server
> 3) on the client mount server folder that contains Virtial Machine image:
> $sudo mount -t nfs -O uid=1000,iocharset=utf-8 server_ip:/path/to/folder/on/
> server /path/to/folder/on/client
> 4) start Virtual Machine with QEMU on the client (for example):
> $qemu-system-x86_64 -enable-kvm -vga std -balloon virtio -monitor stdio
>   -drive file=/path/to/folder/on/client/vdisk.img,media=disk,if=ide,disk-deadlines=on
>   -boot d -m 12288
> 5) inside of VM rum the following command:
> $dd if=/dev/urandom of=testfile bs=10M count=300
> AND stop the server (or disconnect network) by running:
> $sudo /etc/init.d/nfs-kernel-server stop
> 6) inside of VM periodically run:
> $dmesg
> and check error messages.
>
> One can get one of the error messages (just the main lines):
> 1) After server restarting Guest OS continues run as usual with
> the following messages in dmesg:
>    a) [ 1108.131474] nfs: server 10.30.23.163 not responding, still trying
>       [ 1203.164903] INFO: task qemu-system-x86:3256 blocked for more
>       than 120 seconds
>
>    b) [ 581.184311] ata1.00: qc timeout (cmd 0xe7)
>       [ 581.184321] ata1.00: FLUSH failed Emask 0x4
>       [ 581.744271] ata1: soft resetting link
>       [ 581.900346] ata1.01: NODEV after polling detection
>       [ 581.900877] ata1.00: configured for MWDMA2
>       [ 581.900879] ata1.00: retrying FLUSH 0xe7 Emask 0x4
>       [ 581.901203] ata1.00: device reported invalid CHS sector 0
>       [ 581.901213] ata1: EH complete
> 2) Guest OS remounts its Filesystem as read-only:
> "remounting filesystem read-only"
> 3) Guest OS does not respond at all even after server restart
>
> Tested on:
> Virtual Machine - Linux 3.11.0 SMP x86_64 Ubuntu 13.10 saucy;
> client -  Linux 3.11.10 SMP x86_64, Ubuntu 13.10 saucy;
> server - Linux 3.13.0 SMP x86_64, Ubuntu 14.04.1 LTS.
>
> How the given solution works?
>
> If disk-deadlines option is enabled for a drive, one controls time completion
> of this drive's requests. The method is as follows (further assume that this
> option is enabled).
>
> Every drive has its own red-black tree for keeping its requests.
> Expiration time of the request is a key, cookie (as id of request) is an
> appropriate node. Assume that every requests has 8 seconds to be completed.
> If request was not accomplished in time for some reasons (server crash or smth
> else), timer of this drive is fired and an appropriate callback requests to
> stop Virtial Machine (VM).
>
> VM remains stopped until all requests from the disk which caused VM's stopping
> are completed. Furthermore, if there is another disks with 'disk-deadlines=on'
> whose requests are waiting to be completed, do not start VM : wait completion
> of all "late" requests from all disks.
>
> Furthermore, all requests which caused VM stopping (or those that just were not
> completed in time) could be printed using "info disk-deadlines" qemu monitor
> option as follows:
> $(qemu) info disk-deadlines
>
>     disk_id  type       size total_time        start_time
> .--------------------------------------------------------
>    ide0-hd1 FLUSH         0b 46.403s     22232930059574ns
>    ide0-hd1 FLUSH         0b 57.591s     22451499241285ns
>    ide0-hd1 FLUSH         0b 103.482s    22574100547397ns
>
> This set is sent in the hope that it might be useful.
>
> Signed-off-by: Raushaniya Maksudova <rmaksudova@virtuozzo.com>
> Signed-off-by: Denis V. Lunev <den@openvz.org>
> CC: Stefan Hajnoczi <stefanha@redhat.com>
> CC: Kevin Wolf <kwolf@redhat.com>
>
> Raushaniya Maksudova (5):
>    add QEMU style defines for __sync_add_and_fetch
>    disk_deadlines: add request to resume Virtual Machine
>    disk_deadlines: add disk-deadlines option per drive
>    disk_deadlines: add control of requests time expiration
>    disk_deadlines: add info disk-deadlines option
>
>   block/Makefile.objs            |   1 +
>   block/accounting.c             |   8 ++
>   block/disk-deadlines.c         | 280 +++++++++++++++++++++++++++++++++++++++++
>   blockdev.c                     |  20 +++
>   hmp.c                          |  37 ++++++
>   hmp.h                          |   1 +
>   include/block/accounting.h     |   2 +
>   include/block/disk-deadlines.h |  48 +++++++
>   include/qemu/atomic.h          |   3 +
>   include/sysemu/sysemu.h        |   1 +
>   monitor.c                      |   7 ++
>   qapi-schema.json               |  33 +++++
>   stubs/vm-stop.c                |   5 +
>   vl.c                           |  18 +++
>   14 files changed, 464 insertions(+)
>   create mode 100644 block/disk-deadlines.c
>   create mode 100644 include/block/disk-deadlines.h
>

Discussion summary:
- the idea itself is OK
- there are some technical faults like using Linux specific API for 
synchronization
- libvirt should be notified when deadline happens
- deadline timeout should be configurable
- deadlines should not be added to guest statistics. New architectural 
and configuration approach is necessary.

There are 2 main options:
- filter driver
- code could be embedded into current main block code

- another question is how to configure deadlines. With block layer the 
approach is clear, this would be an option of the driver. On the other 
hand we could add 'io-timeout' option to generic block driver code and 
avoid any further options (-1 will mean default timeout, 0 will mean no 
timeout). Something like this.

I would spend a couple next days to analyse better architectural 
approach, but any suggestion would be welcome.
At the moment I tend to think about integration into generic code. On 
the other hand we could implement it as filter but add it using simple 
option in generic code to avoid unnecessary complexity to end user. 
Though may be I am a bit confused and puzzled here.

Den

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [Qemu-devel] [PATCH 2/5] disk_deadlines: add request to resume Virtual Machine
  2015-09-10 19:18     ` Denis V. Lunev
@ 2015-09-14 16:46       ` Stefan Hajnoczi
  0 siblings, 0 replies; 48+ messages in thread
From: Stefan Hajnoczi @ 2015-09-14 16:46 UTC (permalink / raw)
  To: Denis V. Lunev
  Cc: Kevin Wolf, Stefan Hajnoczi, qemu-devel, Raushaniya Maksudova,
	Paolo Bonzini

On Thu, Sep 10, 2015 at 10:18:32PM +0300, Denis V. Lunev wrote:
> On 09/10/2015 11:51 AM, Stefan Hajnoczi wrote:
> >On Tue, Sep 08, 2015 at 11:00:25AM +0300, Denis V. Lunev wrote:
> >>From: Raushaniya Maksudova <rmaksudova@virtuozzo.com>
> >>
> >>In some cases one needs to pause and resume a Virtual Machine from inside
> >>of Qemu. Currently there are request functions to pause VM (vmstop), but
> >>there are no respective ones to resume VM.
> >>
> >>Signed-off-by: Raushaniya Maksudova <rmaksudova@virtuozzo.com>
> >>Signed-off-by: Denis V. Lunev <den@openvz.org>
> >>CC: Stefan Hajnoczi <stefanha@redhat.com>
> >>CC: Kevin Wolf <kwolf@redhat.com>
> >>CC: Paolo Bonzini <pbonzini@redhat.com>
> >>---
> >>  include/sysemu/sysemu.h |  1 +
> >>  stubs/vm-stop.c         |  5 +++++
> >>  vl.c                    | 18 ++++++++++++++++++
> >>  3 files changed, 24 insertions(+)
> >Why can't vm_start() be used?
> >
> 
> we do fear about correct thread to perform this operation.
> this code eventually redirect state changing code into
> main event loop.

The code isn't thread-safe though:

There is a race condition if qemu_vmstart_requested() is called while
qemu_vmstart_request() is called.  qemu_vmstart_requested() might return
false and the qemu_vmstart_request() is missed.

Please add doc comments to these functions explaining assumptions about
thread-safety and environment.

If the guest is resumed in an I/O completion handler function it
probably needs to be truly thread-safe.

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [Qemu-devel] [PATCH 4/5] disk_deadlines: add control of requests time expiration
  2015-09-10 11:39               ` Kevin Wolf
@ 2015-09-14 16:53                 ` Stefan Hajnoczi
  0 siblings, 0 replies; 48+ messages in thread
From: Stefan Hajnoczi @ 2015-09-14 16:53 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: Stefan Hajnoczi, qemu-devel, Raushaniya Maksudova, Denis V. Lunev

On Thu, Sep 10, 2015 at 01:39:20PM +0200, Kevin Wolf wrote:
> Am 10.09.2015 um 12:27 hat Stefan Hajnoczi geschrieben:
> > On Tue, Sep 08, 2015 at 04:48:24PM +0200, Kevin Wolf wrote:
> > > Am 08.09.2015 um 16:23 hat Denis V. Lunev geschrieben:
> > > > On 09/08/2015 04:05 PM, Kevin Wolf wrote:
> > > > >Am 08.09.2015 um 13:27 hat Denis V. Lunev geschrieben:
> > > > >>interesting point. Yes, it flushes all requests and most likely
> > > > >>hangs inside waiting requests to complete. But fortunately
> > > > >>this happens after the switch to paused state thus
> > > > >>the guest becomes paused. That's why I have missed this
> > > > >>fact.
> > > > >>
> > > > >>This (could) be considered as a problem but I have no (good)
> > > > >>solution at the moment. Should think a bit on.
> > > > >Let me suggest a radically different design. Note that I don't say this
> > > > >is necessarily how things should be done, I'm just trying to introduce
> > > > >some new ideas and broaden the discussion, so that we have a larger set
> > > > >of ideas from which we can pick the right solution(s).
> > > > >
> > > > >The core of my idea would be a new filter block driver 'timeout' that
> > > > >can be added on top of each BDS that could potentially fail, like a
> > > > >raw-posix BDS pointing to a file on NFS. This way most pieces of the
> > > > >solution are nicely modularised and don't touch the block layer core.
> > > > >
> > > > >During normal operation the driver would just be passing through
> > > > >requests to the lower layer. When it detects a timeout, however, it
> > > > >completes the request it received with -ETIMEDOUT. It also completes any
> > > > >new request it receives with -ETIMEDOUT without passing the request on
> > > > >until the request that originally timed out returns. This is our safety
> > > > >measure against anyone seeing whether or how the timed out request
> > > > >modified data.
> > > > >
> > > > >We need to make sure that bdrv_drain() doesn't wait for this request.
> > > > >Possibly we need to introduce a .bdrv_drain callback that replaces the
> > > > >default handling, because bdrv_requests_pending() in the default
> > > > >handling considers bs->file, which would still have the timed out
> > > > >request. We don't want to see this; bdrv_drain_all() should complete
> > > > >even though that request is still pending internally (externally, we
> > > > >returned -ETIMEDOUT, so we can consider it completed). This way the
> > > > >monitor stays responsive and background jobs can go on if they don't use
> > > > >the failing block device.
> > > > >
> > > > >And then we essentially reuse the rerror/werror mechanism that we
> > > > >already have to stop the VM. The device models would be extended to
> > > > >always stop the VM on -ETIMEDOUT, regardless of the error policy. In
> > > > >this state, the VM would even be migratable if you make sure that the
> > > > >pending request can't modify the image on the destination host any more.
> > > > >
> > > > >Do you think this could work, or did I miss something important?
> > > > >
> > > > >Kevin
> > > > could I propose even more radical solution then?
> > > > 
> > > > My original approach was based on the fact that
> > > > this could should be maintainable out-of-stream.
> > > > If the patch will be merged - this boundary condition
> > > > could be dropped.
> > > > 
> > > > Why not to invent 'terror' field on BdrvOptions
> > > > and process things in core block layer without
> > > > a filter? RB Tree entry will just not created if
> > > > the policy will be set to 'ignore'.
> > > 
> > > 'terror' might not be the most fortunate name... ;-)
> > > 
> > > The reason why I would prefer a filter driver is so the code and the
> > > associated data structures are cleanly modularised and we can keep the
> > > actual block layer core small and clean. The same is true for some other
> > > functions that I would rather move out of the core into filter drivers
> > > than add new cases (e.g. I/O throttling, backup notifiers, etc.), but
> > > which are a bit harder to actually move because we already have old
> > > interfaces that we can't break (we'll probably do it anyway eventually,
> > > even if it needs a bit more compatibility code).
> > > 
> > > However, it seems that you are mostly touching code that is maintained
> > > by Stefan, and Stefan used to be a bit more open to adding functionality
> > > to the core, so my opinion might not be the last word.
> > 
> > I've been thinking more about the correctness of this feature:
> > 
> > QEMU cannot cancel I/O because there is no Linux userspace API for doing
> > so.  Linux AIO's io_cancel(2) syscall is a nop since file systems don't
> > implement a kiocb_cancel_fn.  Sending a signal to a task blocked in
> > O_DIRECT preadv(2)/pwritev(2) doesn't work either because the task is in
> > uninterruptible sleep.
> > 
> > The only way to make sure a request has finished is to wait for
> > completion.  If we treat a request as failed/cancelled but it's actually
> > still pending at a layer of the storage stack:
> > 1. Read requests may modify guest memory.
> > 2. Write requests may modify disk sectors.
> > 
> > Today the guest times out and tries to do IDE/ATA recovery, for example.
> > This causes QEMU to eventually call the synchronous bdrv_drain_all()
> > function and the guest hangs.  Also, if the guest mounts the file system
> > read-only in response to the timeout, then game over.
> > 
> > The disk-deadlines feature lets QEMU detect timeouts before the guest so
> > we can pause the guest.  The part I have been thinking about is that the
> > only option is to wait until the request completes.
> > 
> > We cannot abandon the timed out request because we'll face #1 or #2
> > above.  This means it doesn't make sense to retry the request like
> > rerror=/werror=.  rerror=/werror= can retry safely because the original
> > request has failed but that is not the case for timed out requests.
> > 
> > This also means that live migration isn't safe, at least if a write
> > request is pending.  If the guest migrates, the pending write request on
> > the source host could still complete after live migration handover,
> > corrupting the disk.
> > 
> > Getting back to these patches: I think the implementation is correct in
> > that the only policy is to wait for timed out requests to complete and
> > then resume the guest.
> > 
> > However, these patches need to violate the constraint that guest memory
> > isn't dirtied when the guest is paused.  This is an important constraint
> > for the correctness of live migration, since we need to be able to track
> > all changes to guest memory.
> > 
> > Just wanted to post this in case anyone disagrees.
> 
> You're making a few good points here.
> 
> I thought that migration with a pending write request could be safe with
> some additional knowledge because if you know that the write is hanging
> because the connection to the NFS server is down and you make sure that
> it remains disconnected, that would work. However, the hanging request
> is already in the kernel, so you could never bring the connection up
> again without rebooting the host, which is clearly not a realistic
> assumption.
> 
> Never thought of the constraints of live migration either, so it seems
> reads requests are equally problematic.
> 
> So it appears that the filter driver would have to add a migration
> blocker whenever it sees any request time out, and only clear it again
> when all pending requests have completed.

Adding new features as filters (like quorum) instead adding them to the
core block layer is a good thing.

Kevin: Can you post an example of the syntax so it's clear what you
mean?

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [Qemu-devel] [PATCH 4/5] disk_deadlines: add control of requests time expiration
  2015-09-10 10:27             ` Stefan Hajnoczi
  2015-09-10 11:39               ` Kevin Wolf
@ 2015-09-25 12:34               ` Dr. David Alan Gilbert
  2015-09-28 12:42                 ` Stefan Hajnoczi
  1 sibling, 1 reply; 48+ messages in thread
From: Dr. David Alan Gilbert @ 2015-09-25 12:34 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Kevin Wolf, Denis V. Lunev, qemu-devel, Stefan Hajnoczi,
	Raushaniya Maksudova

* Stefan Hajnoczi (stefanha@gmail.com) wrote:
> On Tue, Sep 08, 2015 at 04:48:24PM +0200, Kevin Wolf wrote:
> > Am 08.09.2015 um 16:23 hat Denis V. Lunev geschrieben:
> > > On 09/08/2015 04:05 PM, Kevin Wolf wrote:
> > > >Am 08.09.2015 um 13:27 hat Denis V. Lunev geschrieben:
> > > >>interesting point. Yes, it flushes all requests and most likely
> > > >>hangs inside waiting requests to complete. But fortunately
> > > >>this happens after the switch to paused state thus
> > > >>the guest becomes paused. That's why I have missed this
> > > >>fact.
> > > >>
> > > >>This (could) be considered as a problem but I have no (good)
> > > >>solution at the moment. Should think a bit on.
> > > >Let me suggest a radically different design. Note that I don't say this
> > > >is necessarily how things should be done, I'm just trying to introduce
> > > >some new ideas and broaden the discussion, so that we have a larger set
> > > >of ideas from which we can pick the right solution(s).
> > > >
> > > >The core of my idea would be a new filter block driver 'timeout' that
> > > >can be added on top of each BDS that could potentially fail, like a
> > > >raw-posix BDS pointing to a file on NFS. This way most pieces of the
> > > >solution are nicely modularised and don't touch the block layer core.
> > > >
> > > >During normal operation the driver would just be passing through
> > > >requests to the lower layer. When it detects a timeout, however, it
> > > >completes the request it received with -ETIMEDOUT. It also completes any
> > > >new request it receives with -ETIMEDOUT without passing the request on
> > > >until the request that originally timed out returns. This is our safety
> > > >measure against anyone seeing whether or how the timed out request
> > > >modified data.
> > > >
> > > >We need to make sure that bdrv_drain() doesn't wait for this request.
> > > >Possibly we need to introduce a .bdrv_drain callback that replaces the
> > > >default handling, because bdrv_requests_pending() in the default
> > > >handling considers bs->file, which would still have the timed out
> > > >request. We don't want to see this; bdrv_drain_all() should complete
> > > >even though that request is still pending internally (externally, we
> > > >returned -ETIMEDOUT, so we can consider it completed). This way the
> > > >monitor stays responsive and background jobs can go on if they don't use
> > > >the failing block device.
> > > >
> > > >And then we essentially reuse the rerror/werror mechanism that we
> > > >already have to stop the VM. The device models would be extended to
> > > >always stop the VM on -ETIMEDOUT, regardless of the error policy. In
> > > >this state, the VM would even be migratable if you make sure that the
> > > >pending request can't modify the image on the destination host any more.
> > > >
> > > >Do you think this could work, or did I miss something important?
> > > >
> > > >Kevin
> > > could I propose even more radical solution then?
> > > 
> > > My original approach was based on the fact that
> > > this could should be maintainable out-of-stream.
> > > If the patch will be merged - this boundary condition
> > > could be dropped.
> > > 
> > > Why not to invent 'terror' field on BdrvOptions
> > > and process things in core block layer without
> > > a filter? RB Tree entry will just not created if
> > > the policy will be set to 'ignore'.
> > 
> > 'terror' might not be the most fortunate name... ;-)
> > 
> > The reason why I would prefer a filter driver is so the code and the
> > associated data structures are cleanly modularised and we can keep the
> > actual block layer core small and clean. The same is true for some other
> > functions that I would rather move out of the core into filter drivers
> > than add new cases (e.g. I/O throttling, backup notifiers, etc.), but
> > which are a bit harder to actually move because we already have old
> > interfaces that we can't break (we'll probably do it anyway eventually,
> > even if it needs a bit more compatibility code).
> > 
> > However, it seems that you are mostly touching code that is maintained
> > by Stefan, and Stefan used to be a bit more open to adding functionality
> > to the core, so my opinion might not be the last word.
> 
> I've been thinking more about the correctness of this feature:
> 
> QEMU cannot cancel I/O because there is no Linux userspace API for doing
> so.  Linux AIO's io_cancel(2) syscall is a nop since file systems don't
> implement a kiocb_cancel_fn.  Sending a signal to a task blocked in
> O_DIRECT preadv(2)/pwritev(2) doesn't work either because the task is in
> uninterruptible sleep.

There are things that work on some devices, but nothing generic.
For NBD/iSCSI/(ceph?) you should be able to issue a shutdown(2) on the socket
that connects to the server and that should call all existing IO to fail
quickly.  Then you could do a drain and be done.    This would
be very useful for the fault-tolerant uses (e.g. Wen Congyang's block replication).

There are even ways of killing hard NFS mounts; for example adding
a unreachable route to the NFS server (ip route add unreachable hostname),
and then umount -f  seems to cause I/O errors to tasks.   (I can't find
a way to do a remount to change the hard flag).  This isn't pretty but
it's a reasonable way of getting your host back to useable if one NFS
server has died.

Dave

> 
> The only way to make sure a request has finished is to wait for
> completion.  If we treat a request as failed/cancelled but it's actually
> still pending at a layer of the storage stack:
> 1. Read requests may modify guest memory.
> 2. Write requests may modify disk sectors.
> 
> Today the guest times out and tries to do IDE/ATA recovery, for example.
> This causes QEMU to eventually call the synchronous bdrv_drain_all()
> function and the guest hangs.  Also, if the guest mounts the file system
> read-only in response to the timeout, then game over.
> 
> The disk-deadlines feature lets QEMU detect timeouts before the guest so
> we can pause the guest.  The part I have been thinking about is that the
> only option is to wait until the request completes.
> 
> We cannot abandon the timed out request because we'll face #1 or #2
> above.  This means it doesn't make sense to retry the request like
> rerror=/werror=.  rerror=/werror= can retry safely because the original
> request has failed but that is not the case for timed out requests.
> 
> This also means that live migration isn't safe, at least if a write
> request is pending.  If the guest migrates, the pending write request on
> the source host could still complete after live migration handover,
> corrupting the disk.
> 
> Getting back to these patches: I think the implementation is correct in
> that the only policy is to wait for timed out requests to complete and
> then resume the guest.
> 
> However, these patches need to violate the constraint that guest memory
> isn't dirtied when the guest is paused.  This is an important constraint
> for the correctness of live migration, since we need to be able to track
> all changes to guest memory.
> 
> Just wanted to post this in case anyone disagrees.
> 
> Stefan
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [Qemu-devel] [PATCH 4/5] disk_deadlines: add control of requests time expiration
  2015-09-25 12:34               ` Dr. David Alan Gilbert
@ 2015-09-28 12:42                 ` Stefan Hajnoczi
  2015-09-28 13:55                   ` Dr. David Alan Gilbert
  0 siblings, 1 reply; 48+ messages in thread
From: Stefan Hajnoczi @ 2015-09-28 12:42 UTC (permalink / raw)
  To: Dr. David Alan Gilbert
  Cc: Kevin Wolf, Stefan Hajnoczi, qemu-devel, Raushaniya Maksudova,
	Denis V. Lunev

On Fri, Sep 25, 2015 at 01:34:22PM +0100, Dr. David Alan Gilbert wrote:
> * Stefan Hajnoczi (stefanha@gmail.com) wrote:
> > On Tue, Sep 08, 2015 at 04:48:24PM +0200, Kevin Wolf wrote:
> > > Am 08.09.2015 um 16:23 hat Denis V. Lunev geschrieben:
> > > > On 09/08/2015 04:05 PM, Kevin Wolf wrote:
> > > > >Am 08.09.2015 um 13:27 hat Denis V. Lunev geschrieben:
> > > > >>interesting point. Yes, it flushes all requests and most likely
> > > > >>hangs inside waiting requests to complete. But fortunately
> > > > >>this happens after the switch to paused state thus
> > > > >>the guest becomes paused. That's why I have missed this
> > > > >>fact.
> > > > >>
> > > > >>This (could) be considered as a problem but I have no (good)
> > > > >>solution at the moment. Should think a bit on.
> > > > >Let me suggest a radically different design. Note that I don't say this
> > > > >is necessarily how things should be done, I'm just trying to introduce
> > > > >some new ideas and broaden the discussion, so that we have a larger set
> > > > >of ideas from which we can pick the right solution(s).
> > > > >
> > > > >The core of my idea would be a new filter block driver 'timeout' that
> > > > >can be added on top of each BDS that could potentially fail, like a
> > > > >raw-posix BDS pointing to a file on NFS. This way most pieces of the
> > > > >solution are nicely modularised and don't touch the block layer core.
> > > > >
> > > > >During normal operation the driver would just be passing through
> > > > >requests to the lower layer. When it detects a timeout, however, it
> > > > >completes the request it received with -ETIMEDOUT. It also completes any
> > > > >new request it receives with -ETIMEDOUT without passing the request on
> > > > >until the request that originally timed out returns. This is our safety
> > > > >measure against anyone seeing whether or how the timed out request
> > > > >modified data.
> > > > >
> > > > >We need to make sure that bdrv_drain() doesn't wait for this request.
> > > > >Possibly we need to introduce a .bdrv_drain callback that replaces the
> > > > >default handling, because bdrv_requests_pending() in the default
> > > > >handling considers bs->file, which would still have the timed out
> > > > >request. We don't want to see this; bdrv_drain_all() should complete
> > > > >even though that request is still pending internally (externally, we
> > > > >returned -ETIMEDOUT, so we can consider it completed). This way the
> > > > >monitor stays responsive and background jobs can go on if they don't use
> > > > >the failing block device.
> > > > >
> > > > >And then we essentially reuse the rerror/werror mechanism that we
> > > > >already have to stop the VM. The device models would be extended to
> > > > >always stop the VM on -ETIMEDOUT, regardless of the error policy. In
> > > > >this state, the VM would even be migratable if you make sure that the
> > > > >pending request can't modify the image on the destination host any more.
> > > > >
> > > > >Do you think this could work, or did I miss something important?
> > > > >
> > > > >Kevin
> > > > could I propose even more radical solution then?
> > > > 
> > > > My original approach was based on the fact that
> > > > this could should be maintainable out-of-stream.
> > > > If the patch will be merged - this boundary condition
> > > > could be dropped.
> > > > 
> > > > Why not to invent 'terror' field on BdrvOptions
> > > > and process things in core block layer without
> > > > a filter? RB Tree entry will just not created if
> > > > the policy will be set to 'ignore'.
> > > 
> > > 'terror' might not be the most fortunate name... ;-)
> > > 
> > > The reason why I would prefer a filter driver is so the code and the
> > > associated data structures are cleanly modularised and we can keep the
> > > actual block layer core small and clean. The same is true for some other
> > > functions that I would rather move out of the core into filter drivers
> > > than add new cases (e.g. I/O throttling, backup notifiers, etc.), but
> > > which are a bit harder to actually move because we already have old
> > > interfaces that we can't break (we'll probably do it anyway eventually,
> > > even if it needs a bit more compatibility code).
> > > 
> > > However, it seems that you are mostly touching code that is maintained
> > > by Stefan, and Stefan used to be a bit more open to adding functionality
> > > to the core, so my opinion might not be the last word.
> > 
> > I've been thinking more about the correctness of this feature:
> > 
> > QEMU cannot cancel I/O because there is no Linux userspace API for doing
> > so.  Linux AIO's io_cancel(2) syscall is a nop since file systems don't
> > implement a kiocb_cancel_fn.  Sending a signal to a task blocked in
> > O_DIRECT preadv(2)/pwritev(2) doesn't work either because the task is in
> > uninterruptible sleep.
> 
> There are things that work on some devices, but nothing generic.
> For NBD/iSCSI/(ceph?) you should be able to issue a shutdown(2) on the socket
> that connects to the server and that should call all existing IO to fail
> quickly.  Then you could do a drain and be done.    This would
> be very useful for the fault-tolerant uses (e.g. Wen Congyang's block replication).
> 
> There are even ways of killing hard NFS mounts; for example adding
> a unreachable route to the NFS server (ip route add unreachable hostname),
> and then umount -f  seems to cause I/O errors to tasks.   (I can't find
> a way to do a remount to change the hard flag).  This isn't pretty but
> it's a reasonable way of getting your host back to useable if one NFS
> server has died.

If you just throw away a socket, you don't know the state of the disk
since some requests may have been handled by the server and others were
not handled.

So I doubt these approaches work because cleanly closing a connection
requires communication between the client and server to determine that
the connection was closed and which pending requests were completed.

The trade-off is that the client no longer has DMA buffers that might
get written to, but now you no longer know the state of the disk!

Stefan

^ permalink raw reply	[flat|nested] 48+ messages in thread

* Re: [Qemu-devel] [PATCH 4/5] disk_deadlines: add control of requests time expiration
  2015-09-28 12:42                 ` Stefan Hajnoczi
@ 2015-09-28 13:55                   ` Dr. David Alan Gilbert
  0 siblings, 0 replies; 48+ messages in thread
From: Dr. David Alan Gilbert @ 2015-09-28 13:55 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: Kevin Wolf, Denis V. Lunev, qemu-devel, Raushaniya Maksudova

* Stefan Hajnoczi (stefanha@redhat.com) wrote:
> On Fri, Sep 25, 2015 at 01:34:22PM +0100, Dr. David Alan Gilbert wrote:
> > * Stefan Hajnoczi (stefanha@gmail.com) wrote:
> > > On Tue, Sep 08, 2015 at 04:48:24PM +0200, Kevin Wolf wrote:
> > > > Am 08.09.2015 um 16:23 hat Denis V. Lunev geschrieben:
> > > > > On 09/08/2015 04:05 PM, Kevin Wolf wrote:
> > > > > >Am 08.09.2015 um 13:27 hat Denis V. Lunev geschrieben:
> > > > > >>interesting point. Yes, it flushes all requests and most likely
> > > > > >>hangs inside waiting requests to complete. But fortunately
> > > > > >>this happens after the switch to paused state thus
> > > > > >>the guest becomes paused. That's why I have missed this
> > > > > >>fact.
> > > > > >>
> > > > > >>This (could) be considered as a problem but I have no (good)
> > > > > >>solution at the moment. Should think a bit on.
> > > > > >Let me suggest a radically different design. Note that I don't say this
> > > > > >is necessarily how things should be done, I'm just trying to introduce
> > > > > >some new ideas and broaden the discussion, so that we have a larger set
> > > > > >of ideas from which we can pick the right solution(s).
> > > > > >
> > > > > >The core of my idea would be a new filter block driver 'timeout' that
> > > > > >can be added on top of each BDS that could potentially fail, like a
> > > > > >raw-posix BDS pointing to a file on NFS. This way most pieces of the
> > > > > >solution are nicely modularised and don't touch the block layer core.
> > > > > >
> > > > > >During normal operation the driver would just be passing through
> > > > > >requests to the lower layer. When it detects a timeout, however, it
> > > > > >completes the request it received with -ETIMEDOUT. It also completes any
> > > > > >new request it receives with -ETIMEDOUT without passing the request on
> > > > > >until the request that originally timed out returns. This is our safety
> > > > > >measure against anyone seeing whether or how the timed out request
> > > > > >modified data.
> > > > > >
> > > > > >We need to make sure that bdrv_drain() doesn't wait for this request.
> > > > > >Possibly we need to introduce a .bdrv_drain callback that replaces the
> > > > > >default handling, because bdrv_requests_pending() in the default
> > > > > >handling considers bs->file, which would still have the timed out
> > > > > >request. We don't want to see this; bdrv_drain_all() should complete
> > > > > >even though that request is still pending internally (externally, we
> > > > > >returned -ETIMEDOUT, so we can consider it completed). This way the
> > > > > >monitor stays responsive and background jobs can go on if they don't use
> > > > > >the failing block device.
> > > > > >
> > > > > >And then we essentially reuse the rerror/werror mechanism that we
> > > > > >already have to stop the VM. The device models would be extended to
> > > > > >always stop the VM on -ETIMEDOUT, regardless of the error policy. In
> > > > > >this state, the VM would even be migratable if you make sure that the
> > > > > >pending request can't modify the image on the destination host any more.
> > > > > >
> > > > > >Do you think this could work, or did I miss something important?
> > > > > >
> > > > > >Kevin
> > > > > could I propose even more radical solution then?
> > > > > 
> > > > > My original approach was based on the fact that
> > > > > this could should be maintainable out-of-stream.
> > > > > If the patch will be merged - this boundary condition
> > > > > could be dropped.
> > > > > 
> > > > > Why not to invent 'terror' field on BdrvOptions
> > > > > and process things in core block layer without
> > > > > a filter? RB Tree entry will just not created if
> > > > > the policy will be set to 'ignore'.
> > > > 
> > > > 'terror' might not be the most fortunate name... ;-)
> > > > 
> > > > The reason why I would prefer a filter driver is so the code and the
> > > > associated data structures are cleanly modularised and we can keep the
> > > > actual block layer core small and clean. The same is true for some other
> > > > functions that I would rather move out of the core into filter drivers
> > > > than add new cases (e.g. I/O throttling, backup notifiers, etc.), but
> > > > which are a bit harder to actually move because we already have old
> > > > interfaces that we can't break (we'll probably do it anyway eventually,
> > > > even if it needs a bit more compatibility code).
> > > > 
> > > > However, it seems that you are mostly touching code that is maintained
> > > > by Stefan, and Stefan used to be a bit more open to adding functionality
> > > > to the core, so my opinion might not be the last word.
> > > 
> > > I've been thinking more about the correctness of this feature:
> > > 
> > > QEMU cannot cancel I/O because there is no Linux userspace API for doing
> > > so.  Linux AIO's io_cancel(2) syscall is a nop since file systems don't
> > > implement a kiocb_cancel_fn.  Sending a signal to a task blocked in
> > > O_DIRECT preadv(2)/pwritev(2) doesn't work either because the task is in
> > > uninterruptible sleep.
> > 
> > There are things that work on some devices, but nothing generic.
> > For NBD/iSCSI/(ceph?) you should be able to issue a shutdown(2) on the socket
> > that connects to the server and that should call all existing IO to fail
> > quickly.  Then you could do a drain and be done.    This would
> > be very useful for the fault-tolerant uses (e.g. Wen Congyang's block replication).
> > 
> > There are even ways of killing hard NFS mounts; for example adding
> > a unreachable route to the NFS server (ip route add unreachable hostname),
> > and then umount -f  seems to cause I/O errors to tasks.   (I can't find
> > a way to do a remount to change the hard flag).  This isn't pretty but
> > it's a reasonable way of getting your host back to useable if one NFS
> > server has died.
> 
> If you just throw away a socket, you don't know the state of the disk
> since some requests may have been handled by the server and others were
> not handled.
> 
> So I doubt these approaches work because cleanly closing a connection
> requires communication between the client and server to determine that
> the connection was closed and which pending requests were completed.
> 
> The trade-off is that the client no longer has DMA buffers that might
> get written to, but now you no longer know the state of the disk!

Right, you dont know what the last successfull IOs really were, but if
you know that the NBD/iSCSI/NFS server is dead and is going to need to
get rebooted/replaced anyway then your current state is that you have
some QEMUs that are running fine except for one disk, but are now very
delicate because anything that tries to a drain will hang.  There's no
way that you can recover that knowledge about which IOs completed, but
you can recover all your guests that aren't critical on that device.

Dave
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK

^ permalink raw reply	[flat|nested] 48+ messages in thread

end of thread, other threads:[~2015-09-28 13:55 UTC | newest]

Thread overview: 48+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-09-08  8:00 [Qemu-devel] [PATCH RFC 0/5] disk deadlines Denis V. Lunev
2015-09-08  8:00 ` [Qemu-devel] [PATCH 1/5] add QEMU style defines for __sync_add_and_fetch Denis V. Lunev
2015-09-10  8:19   ` Stefan Hajnoczi
2015-09-08  8:00 ` [Qemu-devel] [PATCH 2/5] disk_deadlines: add request to resume Virtual Machine Denis V. Lunev
2015-09-10  8:51   ` Stefan Hajnoczi
2015-09-10 19:18     ` Denis V. Lunev
2015-09-14 16:46       ` Stefan Hajnoczi
2015-09-08  8:00 ` [Qemu-devel] [PATCH 3/5] disk_deadlines: add disk-deadlines option per drive Denis V. Lunev
2015-09-10  9:05   ` Stefan Hajnoczi
2015-09-08  8:00 ` [Qemu-devel] [PATCH 4/5] disk_deadlines: add control of requests time expiration Denis V. Lunev
2015-09-08  9:35   ` Fam Zheng
2015-09-08  9:42     ` Denis V. Lunev
2015-09-08 11:06   ` Kevin Wolf
2015-09-08 11:27     ` Denis V. Lunev
2015-09-08 13:05       ` Kevin Wolf
2015-09-08 14:23         ` Denis V. Lunev
2015-09-08 14:48           ` Kevin Wolf
2015-09-10 10:27             ` Stefan Hajnoczi
2015-09-10 11:39               ` Kevin Wolf
2015-09-14 16:53                 ` Stefan Hajnoczi
2015-09-25 12:34               ` Dr. David Alan Gilbert
2015-09-28 12:42                 ` Stefan Hajnoczi
2015-09-28 13:55                   ` Dr. David Alan Gilbert
2015-09-08  8:00 ` [Qemu-devel] [PATCH 5/5] disk_deadlines: add info disk-deadlines option Denis V. Lunev
2015-09-08 16:20   ` Eric Blake
2015-09-08 16:26     ` Eric Blake
2015-09-10 18:53       ` Denis V. Lunev
2015-09-10 19:13     ` Denis V. Lunev
2015-09-08  8:58 ` [Qemu-devel] [PATCH RFC 0/5] disk deadlines Vasiliy Tolstov
2015-09-08  9:20 ` Fam Zheng
2015-09-08 10:11   ` Kevin Wolf
2015-09-08 10:13     ` Denis V. Lunev
2015-09-08 10:20     ` Fam Zheng
2015-09-08 10:46       ` Denis V. Lunev
2015-09-08 10:49       ` Kevin Wolf
2015-09-08 13:20         ` Fam Zheng
2015-09-08  9:33 ` Paolo Bonzini
2015-09-08  9:41   ` Denis V. Lunev
2015-09-08  9:43     ` Paolo Bonzini
2015-09-08 10:37     ` Andrey Korolyov
2015-09-08 10:50       ` Denis V. Lunev
2015-09-08 10:07   ` Kevin Wolf
2015-09-08 10:08     ` Denis V. Lunev
2015-09-08 10:22   ` Stefan Hajnoczi
2015-09-08 10:26     ` Paolo Bonzini
2015-09-08 10:36     ` Denis V. Lunev
2015-09-08 19:11 ` John Snow
2015-09-10 19:29 ` [Qemu-devel] Summary: " Denis V. Lunev

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).