[Qemu-devel] [PATCH 00/21] Kemari for KVM 0.2

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [Qemu-devel] [PATCH 00/21] Kemari for KVM 0.2
@ 2010-11-25  6:06 Yoshiaki Tamura
  2010-11-25  6:06 ` [Qemu-devel] [PATCH 01/21] Make QEMUFile buf expandable, and introduce qemu_realloc_buffer() and qemu_clear_buffer() Yoshiaki Tamura
                   ` (21 more replies)
  0 siblings, 22 replies; 112+ messages in thread
From: Yoshiaki Tamura @ 2010-11-25  6:06 UTC (permalink / raw)
  To: kvm, qemu-devel
  Cc: aliguori, mtosatti, ananth, ohmura.kei, dlaor, vatsa,
	Yoshiaki Tamura, avi, psuriset, stefanha

Hi,

This patch series is a revised version of Kemari for KVM, which
applied comments for the previous post and KVM Forum 2010.  The
current code is based on qemu.git
f711df67d611e4762966a249742a5f7499e19f99.

For general information about Kemari, I've made a wiki page at
qemu.org.

http://wiki.qemu.org/Features/FaultTolerance

The changes from v0.1.1 -> v0.2 are:

- Introduce a queue in event-tap to make VM sync live.
- Change transaction receiver to a state machine for async receiving.
- Replace net/block layer functions with event-tap proxy functions.
- Remove dirty bitmap optimization for now.
- convert DPRINTF() in ft_trans_file to trace functions.
- convert fprintf() in ft_trans_file to error_report().
- improved error handling in ft_trans_file.
- add a tmp pointer to qemu_del_vm_change_state_handler.

The changes from v0.1 -> v0.1.1 are:

- events are tapped in net/block layer instead of device emulation layer.
- Introduce a new option for -incoming to accept FT transaction.
- Removed writev() support to QEMUFile and FdMigrationState for now.  I would
 post this work in a different series.
- Modified virtio-blk save/load handler to send inuse variable to
 correctly replay.
- Removed configure --enable-ft-mode.
- Removed unnecessary check for qemu_realloc().

The first 6 patches modify several functions of qemu to prepare
introducing Kemari specific components.

The next 6 patches are the components of Kemari.  They introduce
event-tap and the FT transaction protocol file based on buffered file.
The design document of FT transaction protocol can be found at,
http://wiki.qemu.org/images/b/b1/Kemari_sender_receiver_0.5a.pdf

Then the following 4 patches modifies dma-helpers, virtio-blk
virtio-net and e1000 to replace net/block layer functions with
event-tap proxy functions.  Please note that if Kemari is off,
event-tap will just passthrough, and there is most no intrusion to
exisiting functions including normal live migration.

Finally, the migration layer are modified to support Kemari in the
last 5 patches.  Again, there shouldn't be any affection if a user
doesn't specify Kemari specific options.  The transaction is now async
on both sender and receiver side.  The sender side respects the
max_downtime to decide when to switch from async to sync mode.

The following is a demo video of the latest version.  The left windows
is the primary and the right window is the secondary.  As you can see,
the secondary window gets updated because the transaction receiver is
now asynchronous.

http://www.osrg.net/kemari/download/kemari-v0.2-fedora11.mov

The repository contains all patches I'm sending with this message.
For those who want to try, please pull the following repository.  It
also includes dirty bitmap optimization which aren't ready for posting
yet.  To remove the dirty bitmap optimization, please look at HEAD~4
of the tree.  Also, please note that it's based on a bit older version
of qemu.git because of testing.  There aren't major conflicts with the
patch series posted.

git://kemari.git.sourceforge.net/gitroot/kemari/kemari

As always, I'm looking forward to suggestions/comments.

Thanks,

Yoshi

Yoshiaki Tamura (21):
  Make QEMUFile buf expandable, and introduce qemu_realloc_buffer() and
    qemu_clear_buffer().
  Introduce read() to FdMigrationState.
  Introduce skip_header parameter to qemu_loadvm_state().
  qemu-char: export socket_set_nodelay().
  virtio: modify save/load handler to handle inuse varialble.
  vl: add a tmp pointer so that a handler can delete the entry to which
    it belongs.
  Introduce fault tolerant VM transaction QEMUFile and ft_mode.
  savevm: introduce util functions to control ft_trans_file from savevm
    layer.
  Introduce event-tap.
  Call init handler of event-tap at main() in vl.c.
  ioport: insert event_tap_ioport() to ioport_write().
  Insert event_tap_mmio() to cpu_physical_memory_rw() in exec.c.
  dma-helpers: replace bdrv_aio_writev() with bdrv_aio_writev_proxy().
  virtio-blk: replace bdrv_aio_multiwrite() with
    bdrv_aio_multiwrite_proxy().
  virtio-net: replace qemu_sendv_packet_async() with
    qemu_sendv_packet_async_proxy().
  e1000: replace qemu_send_packet() with qemu_send_packet_proxy().
  savevm: introduce qemu_savevm_trans_{begin,commit}.
  migration: introduce migrate_ft_trans_{put,get}_ready(), and modify
    migrate_fd_put_ready() when ft_mode is on.
  migration-tcp: modify tcp_accept_incoming_migration() to handle
    ft_mode, and add a hack not to close fd when ft_mode is enabled.
  Introduce -k option to enable FT migration mode (Kemari).
  migration: add a parser to accept FT migration incoming mode.

 Makefile.objs   |    1 +
 Makefile.target |    1 +
 block.h         |    9 +
 dma-helpers.c   |    4 +-
 event-tap.c     |  794 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 event-tap.h     |   34 +++
 exec.c          |    4 +
 ft_trans_file.c |  603 +++++++++++++++++++++++++++++++++++++++++
 ft_trans_file.h |   72 +++++
 hmp-commands.hx |    7 +-
 hw/e1000.c      |    4 +-
 hw/hw.h         |    7 +
 hw/virtio-blk.c |    2 +-
 hw/virtio-net.c |    4 +-
 hw/virtio.c     |    8 +-
 ioport.c        |    2 +
 migration-tcp.c |   58 ++++-
 migration.c     |  277 +++++++++++++++++++-
 migration.h     |    3 +
 net.h           |    4 +
 net/queue.c     |    1 +
 qemu-char.c     |    2 +-
 qemu_socket.h   |    1 +
 qmp-commands.hx |    7 +-
 savevm.c        |  298 ++++++++++++++++++++-
 sysemu.h        |    4 +-
 trace-events    |   15 +
 vl.c            |    8 +-
 28 files changed, 2197 insertions(+), 37 deletions(-)
 create mode 100644 event-tap.c
 create mode 100644 event-tap.h
 create mode 100644 ft_trans_file.c
 create mode 100644 ft_trans_file.h

^ permalink raw reply	[flat|nested] 112+ messages in thread

* [Qemu-devel] [PATCH 01/21] Make QEMUFile buf expandable, and introduce qemu_realloc_buffer() and qemu_clear_buffer().
  2010-11-25  6:06 [Qemu-devel] [PATCH 00/21] Kemari for KVM 0.2 Yoshiaki Tamura
@ 2010-11-25  6:06 ` Yoshiaki Tamura
  2010-11-25  6:06 ` [Qemu-devel] [PATCH 02/21] Introduce read() to FdMigrationState Yoshiaki Tamura
                   ` (20 subsequent siblings)
  21 siblings, 0 replies; 112+ messages in thread
From: Yoshiaki Tamura @ 2010-11-25  6:06 UTC (permalink / raw)
  To: kvm, qemu-devel
  Cc: aliguori, mtosatti, ananth, ohmura.kei, dlaor, vatsa,
	Yoshiaki Tamura, avi, psuriset, stefanha

Currently buf size is fixed at 32KB.  It would be useful if it could
be flexible.

Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
---
 hw/hw.h  |    2 ++
 savevm.c |   21 ++++++++++++++++++++-
 2 files changed, 22 insertions(+), 1 deletions(-)

diff --git a/hw/hw.h b/hw/hw.h
index 9d2cfc2..b67f504 100644
--- a/hw/hw.h
+++ b/hw/hw.h
@@ -58,6 +58,8 @@ void qemu_fflush(QEMUFile *f);
 int qemu_fclose(QEMUFile *f);
 void qemu_put_buffer(QEMUFile *f, const uint8_t *buf, int size);
 void qemu_put_byte(QEMUFile *f, int v);
+void *qemu_realloc_buffer(QEMUFile *f, int size);
+void qemu_clear_buffer(QEMUFile *f);
 
 static inline void qemu_put_ubyte(QEMUFile *f, unsigned int v)
 {
diff --git a/savevm.c b/savevm.c
index 4e49765..6f651b3 100644
--- a/savevm.c
+++ b/savevm.c
@@ -172,7 +172,8 @@ struct QEMUFile {
                            when reading */
     int buf_index;
     int buf_size; /* 0 when writing */
-    uint8_t buf[IO_BUF_SIZE];
+    int buf_max_size;
+    uint8_t *buf;
 
     int has_error;
 };
@@ -423,6 +424,9 @@ QEMUFile *qemu_fopen_ops(void *opaque, QEMUFilePutBufferFunc *put_buffer,
     f->get_rate_limit = get_rate_limit;
     f->is_write = 0;
 
+    f->buf_max_size = IO_BUF_SIZE;
+    f->buf = qemu_malloc(sizeof(uint8_t) * f->buf_max_size);
+
     return f;
 }
 
@@ -453,6 +457,20 @@ void qemu_fflush(QEMUFile *f)
     }
 }
 
+void *qemu_realloc_buffer(QEMUFile *f, int size)
+{
+    f->buf_max_size = size;
+    f->buf = qemu_realloc(f->buf, f->buf_max_size);
+
+    return f->buf;
+}
+
+void qemu_clear_buffer(QEMUFile *f)
+{
+    f->buf_size = f->buf_index = f->buf_offset = 0;
+    memset(f->buf, 0, f->buf_max_size);
+}
+
 static void qemu_fill_buffer(QEMUFile *f)
 {
     int len;
@@ -478,6 +496,7 @@ int qemu_fclose(QEMUFile *f)
     qemu_fflush(f);
     if (f->close)
         ret = f->close(f->opaque);
+    qemu_free(f->buf);
     qemu_free(f);
     return ret;
 }
-- 
1.7.1.2

^ permalink raw reply related	[flat|nested] 112+ messages in thread

* [Qemu-devel] [PATCH 02/21] Introduce read() to FdMigrationState.
  2010-11-25  6:06 [Qemu-devel] [PATCH 00/21] Kemari for KVM 0.2 Yoshiaki Tamura
  2010-11-25  6:06 ` [Qemu-devel] [PATCH 01/21] Make QEMUFile buf expandable, and introduce qemu_realloc_buffer() and qemu_clear_buffer() Yoshiaki Tamura
@ 2010-11-25  6:06 ` Yoshiaki Tamura
  2010-11-25  6:06 ` [Qemu-devel] [PATCH 03/21] Introduce skip_header parameter to qemu_loadvm_state() Yoshiaki Tamura
                   ` (19 subsequent siblings)
  21 siblings, 0 replies; 112+ messages in thread
From: Yoshiaki Tamura @ 2010-11-25  6:06 UTC (permalink / raw)
  To: kvm, qemu-devel
  Cc: aliguori, mtosatti, ananth, ohmura.kei, dlaor, vatsa,
	Yoshiaki Tamura, avi, psuriset, stefanha

Currently FdMigrationState doesn't support read(), and this patch
introduces it to get response from the other side.

Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
---
 migration-tcp.c |   15 +++++++++++++++
 migration.c     |   12 ++++++++++++
 migration.h     |    3 +++
 3 files changed, 30 insertions(+), 0 deletions(-)

diff --git a/migration-tcp.c b/migration-tcp.c
index b55f419..96e2411 100644
--- a/migration-tcp.c
+++ b/migration-tcp.c
@@ -39,6 +39,20 @@ static int socket_write(FdMigrationState *s, const void * buf, size_t size)
     return send(s->fd, buf, size, 0);
 }
 
+static int socket_read(FdMigrationState *s, const void * buf, size_t size)
+{
+    ssize_t len;
+
+    do { 
+        len = recv(s->fd, (void *)buf, size, 0);
+    } while (len == -1 && socket_error() == EINTR);
+    if (len == -1) {
+        len = -socket_error();
+    }
+
+    return len;
+}
+
 static int tcp_close(FdMigrationState *s)
 {
     DPRINTF("tcp_close\n");
@@ -94,6 +108,7 @@ MigrationState *tcp_start_outgoing_migration(Monitor *mon,
 
     s->get_error = socket_errno;
     s->write = socket_write;
+    s->read = socket_read;
     s->close = tcp_close;
     s->mig_state.cancel = migrate_fd_cancel;
     s->mig_state.get_status = migrate_fd_get_status;
diff --git a/migration.c b/migration.c
index 9ee8b17..6500714 100644
--- a/migration.c
+++ b/migration.c
@@ -328,6 +328,18 @@ ssize_t migrate_fd_put_buffer(void *opaque, const void *data, size_t size)
     return ret;
 }
 
+int migrate_fd_get_buffer(void *opaque, uint8_t *data, int64_t pos, int size)
+{
+    FdMigrationState *s = opaque;
+    ssize_t ret;
+    ret = s->read(s, data, size);
+    
+    if (ret == -1)
+        ret = -(s->get_error(s));
+    
+    return ret;
+}
+
 void migrate_fd_connect(FdMigrationState *s)
 {
     int ret;
diff --git a/migration.h b/migration.h
index d13ed4f..f033262 100644
--- a/migration.h
+++ b/migration.h
@@ -47,6 +47,7 @@ struct FdMigrationState
     int (*get_error)(struct FdMigrationState*);
     int (*close)(struct FdMigrationState*);
     int (*write)(struct FdMigrationState*, const void *, size_t);
+    int (*read)(struct FdMigrationState *, const void *, size_t);
     void *opaque;
 };
 
@@ -115,6 +116,8 @@ void migrate_fd_put_notify(void *opaque);
 
 ssize_t migrate_fd_put_buffer(void *opaque, const void *data, size_t size);
 
+int migrate_fd_get_buffer(void *opaque, uint8_t *data, int64_t pos, int size);
+
 void migrate_fd_connect(FdMigrationState *s);
 
 void migrate_fd_put_ready(void *opaque);
-- 
1.7.1.2

^ permalink raw reply related	[flat|nested] 112+ messages in thread

* [Qemu-devel] [PATCH 03/21] Introduce skip_header parameter to qemu_loadvm_state().
  2010-11-25  6:06 [Qemu-devel] [PATCH 00/21] Kemari for KVM 0.2 Yoshiaki Tamura
  2010-11-25  6:06 ` [Qemu-devel] [PATCH 01/21] Make QEMUFile buf expandable, and introduce qemu_realloc_buffer() and qemu_clear_buffer() Yoshiaki Tamura
  2010-11-25  6:06 ` [Qemu-devel] [PATCH 02/21] Introduce read() to FdMigrationState Yoshiaki Tamura
@ 2010-11-25  6:06 ` Yoshiaki Tamura
  2010-11-25  6:06 ` [Qemu-devel] [PATCH 04/21] qemu-char: export socket_set_nodelay() Yoshiaki Tamura
                   ` (18 subsequent siblings)
  21 siblings, 0 replies; 112+ messages in thread
From: Yoshiaki Tamura @ 2010-11-25  6:06 UTC (permalink / raw)
  To: kvm, qemu-devel
  Cc: aliguori, mtosatti, ananth, ohmura.kei, dlaor, vatsa,
	Yoshiaki Tamura, avi, psuriset, stefanha

Introduce skip_header parameter to qemu_loadvm_state() so that it can
be called iteratively without reading the header.

Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
---
 migration.c |    2 +-
 savevm.c    |   24 +++++++++++++-----------
 sysemu.h    |    2 +-
 3 files changed, 15 insertions(+), 13 deletions(-)

diff --git a/migration.c b/migration.c
index 6500714..8f4ffcb 100644
--- a/migration.c
+++ b/migration.c
@@ -60,7 +60,7 @@ int qemu_start_incoming_migration(const char *uri)
 
 void process_incoming_migration(QEMUFile *f)
 {
-    if (qemu_loadvm_state(f) < 0) {
+    if (qemu_loadvm_state(f, 0) < 0) {
         fprintf(stderr, "load of migration failed\n");
         exit(0);
     }
diff --git a/savevm.c b/savevm.c
index 6f651b3..8917416 100644
--- a/savevm.c
+++ b/savevm.c
@@ -1704,7 +1704,7 @@ typedef struct LoadStateEntry {
     int version_id;
 } LoadStateEntry;
 
-int qemu_loadvm_state(QEMUFile *f)
+int qemu_loadvm_state(QEMUFile *f, int skip_header)
 {
     QLIST_HEAD(, LoadStateEntry) loadvm_handlers =
         QLIST_HEAD_INITIALIZER(loadvm_handlers);
@@ -1713,17 +1713,19 @@ int qemu_loadvm_state(QEMUFile *f)
     unsigned int v;
     int ret;
 
-    v = qemu_get_be32(f);
-    if (v != QEMU_VM_FILE_MAGIC)
-        return -EINVAL;
+    if (!skip_header) {
+        v = qemu_get_be32(f);
+        if (v != QEMU_VM_FILE_MAGIC)
+            return -EINVAL;
 
-    v = qemu_get_be32(f);
-    if (v == QEMU_VM_FILE_VERSION_COMPAT) {
-        fprintf(stderr, "SaveVM v2 format is obsolete and don't work anymore\n");
-        return -ENOTSUP;
+        v = qemu_get_be32(f);
+        if (v == QEMU_VM_FILE_VERSION_COMPAT) {
+            fprintf(stderr, "SaveVM v2 format is obsolete and don't work anymore\n");
+            return -ENOTSUP;
+        }
+        if (v != QEMU_VM_FILE_VERSION)
+            return -ENOTSUP;
     }
-    if (v != QEMU_VM_FILE_VERSION)
-        return -ENOTSUP;
 
     while ((section_type = qemu_get_byte(f)) != QEMU_VM_EOF) {
         uint32_t instance_id, version_id, section_id;
@@ -2048,7 +2050,7 @@ int load_vmstate(const char *name)
         return -EINVAL;
     }
 
-    ret = qemu_loadvm_state(f);
+    ret = qemu_loadvm_state(f, 0);
 
     qemu_fclose(f);
     if (ret < 0) {
diff --git a/sysemu.h b/sysemu.h
index b81a70e..588548a 100644
--- a/sysemu.h
+++ b/sysemu.h
@@ -78,7 +78,7 @@ int qemu_savevm_state_begin(Monitor *mon, QEMUFile *f, int blk_enable,
 int qemu_savevm_state_iterate(Monitor *mon, QEMUFile *f);
 int qemu_savevm_state_complete(Monitor *mon, QEMUFile *f);
 void qemu_savevm_state_cancel(Monitor *mon, QEMUFile *f);
-int qemu_loadvm_state(QEMUFile *f);
+int qemu_loadvm_state(QEMUFile *f, int skip_header);
 
 /* SLIRP */
 void do_info_slirp(Monitor *mon);
-- 
1.7.1.2

^ permalink raw reply related	[flat|nested] 112+ messages in thread

* [Qemu-devel] [PATCH 04/21] qemu-char: export socket_set_nodelay().
  2010-11-25  6:06 [Qemu-devel] [PATCH 00/21] Kemari for KVM 0.2 Yoshiaki Tamura
                   ` (2 preceding siblings ...)
  2010-11-25  6:06 ` [Qemu-devel] [PATCH 03/21] Introduce skip_header parameter to qemu_loadvm_state() Yoshiaki Tamura
@ 2010-11-25  6:06 ` Yoshiaki Tamura
  2010-11-25  6:06 ` [Qemu-devel] [PATCH 05/21] virtio: modify save/load handler to handle inuse varialble Yoshiaki Tamura
                   ` (17 subsequent siblings)
  21 siblings, 0 replies; 112+ messages in thread
From: Yoshiaki Tamura @ 2010-11-25  6:06 UTC (permalink / raw)
  To: kvm, qemu-devel
  Cc: aliguori, mtosatti, ananth, ohmura.kei, dlaor, vatsa,
	Yoshiaki Tamura, avi, psuriset, stefanha

Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
---
 qemu-char.c   |    2 +-
 qemu_socket.h |    1 +
 2 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/qemu-char.c b/qemu-char.c
index 88997f9..8ef4760 100644
--- a/qemu-char.c
+++ b/qemu-char.c
@@ -2116,7 +2116,7 @@ static void tcp_chr_telnet_init(int fd)
     send(fd, (char *)buf, 3, 0);
 }
 
-static void socket_set_nodelay(int fd)
+void socket_set_nodelay(int fd)
 {
     int val = 1;
     setsockopt(fd, IPPROTO_TCP, TCP_NODELAY, (char *)&val, sizeof(val));
diff --git a/qemu_socket.h b/qemu_socket.h
index 897a8ae..b7f8465 100644
--- a/qemu_socket.h
+++ b/qemu_socket.h
@@ -36,6 +36,7 @@ int inet_aton(const char *cp, struct in_addr *ia);
 int qemu_socket(int domain, int type, int protocol);
 int qemu_accept(int s, struct sockaddr *addr, socklen_t *addrlen);
 void socket_set_nonblock(int fd);
+void socket_set_nodelay(int fd);
 int send_all(int fd, const void *buf, int len1);
 
 /* New, ipv6-ready socket helper functions, see qemu-sockets.c */
-- 
1.7.1.2

^ permalink raw reply related	[flat|nested] 112+ messages in thread

* [Qemu-devel] [PATCH 05/21] virtio: modify save/load handler to handle inuse varialble.
  2010-11-25  6:06 [Qemu-devel] [PATCH 00/21] Kemari for KVM 0.2 Yoshiaki Tamura
                   ` (3 preceding siblings ...)
  2010-11-25  6:06 ` [Qemu-devel] [PATCH 04/21] qemu-char: export socket_set_nodelay() Yoshiaki Tamura
@ 2010-11-25  6:06 ` Yoshiaki Tamura
  2010-11-28  9:28   ` [Qemu-devel] " Michael S. Tsirkin
  2010-11-25  6:06 ` [Qemu-devel] [PATCH 06/21] vl: add a tmp pointer so that a handler can delete the entry to which it belongs Yoshiaki Tamura
                   ` (16 subsequent siblings)
  21 siblings, 1 reply; 112+ messages in thread
From: Yoshiaki Tamura @ 2010-11-25  6:06 UTC (permalink / raw)
  To: kvm, qemu-devel
  Cc: aliguori, mtosatti, ananth, ohmura.kei, dlaor, vatsa,
	Yoshiaki Tamura, avi, psuriset, stefanha

Modify inuse type to uint16_t, let save/load to handle, and revert
last_avail_idx with inuse if there are outstanding emulation.

Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
---
 hw/virtio.c |    8 +++++++-
 1 files changed, 7 insertions(+), 1 deletions(-)

diff --git a/hw/virtio.c b/hw/virtio.c
index 849a60f..5509644 100644
--- a/hw/virtio.c
+++ b/hw/virtio.c
@@ -72,7 +72,7 @@ struct VirtQueue
     VRing vring;
     target_phys_addr_t pa;
     uint16_t last_avail_idx;
-    int inuse;
+    uint16_t inuse;
     uint16_t vector;
     void (*handle_output)(VirtIODevice *vdev, VirtQueue *vq);
     VirtIODevice *vdev;
@@ -671,6 +671,7 @@ void virtio_save(VirtIODevice *vdev, QEMUFile *f)
         qemu_put_be32(f, vdev->vq[i].vring.num);
         qemu_put_be64(f, vdev->vq[i].pa);
         qemu_put_be16s(f, &vdev->vq[i].last_avail_idx);
+        qemu_put_be16s(f, &vdev->vq[i].inuse);
         if (vdev->binding->save_queue)
             vdev->binding->save_queue(vdev->binding_opaque, i, f);
     }
@@ -711,6 +712,11 @@ int virtio_load(VirtIODevice *vdev, QEMUFile *f)
         vdev->vq[i].vring.num = qemu_get_be32(f);
         vdev->vq[i].pa = qemu_get_be64(f);
         qemu_get_be16s(f, &vdev->vq[i].last_avail_idx);
+        qemu_get_be16s(f, &vdev->vq[i].inuse);
+
+        /* revert last_avail_idx if there are outstanding emulation. */
+        vdev->vq[i].last_avail_idx -= vdev->vq[i].inuse;
+        vdev->vq[i].inuse = 0;
 
         if (vdev->vq[i].pa) {
             virtqueue_init(&vdev->vq[i]);
-- 
1.7.1.2

^ permalink raw reply related	[flat|nested] 112+ messages in thread

* [Qemu-devel] [PATCH 06/21] vl: add a tmp pointer so that a handler can delete the entry to which it belongs.
  2010-11-25  6:06 [Qemu-devel] [PATCH 00/21] Kemari for KVM 0.2 Yoshiaki Tamura
                   ` (4 preceding siblings ...)
  2010-11-25  6:06 ` [Qemu-devel] [PATCH 05/21] virtio: modify save/load handler to handle inuse varialble Yoshiaki Tamura
@ 2010-11-25  6:06 ` Yoshiaki Tamura
  2010-12-08  7:03   ` Isaku Yamahata
  2010-11-25  6:06 ` [Qemu-devel] [PATCH 07/21] Introduce fault tolerant VM transaction QEMUFile and ft_mode Yoshiaki Tamura
                   ` (15 subsequent siblings)
  21 siblings, 1 reply; 112+ messages in thread
From: Yoshiaki Tamura @ 2010-11-25  6:06 UTC (permalink / raw)
  To: kvm, qemu-devel
  Cc: aliguori, mtosatti, ananth, ohmura.kei, dlaor, vatsa,
	Yoshiaki Tamura, avi, psuriset, stefanha

By copying the next entry to a tmp pointer,
qemu_del_vm_change_state_handler() can be called in the handler.

Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
---
 vl.c |    5 +++--
 1 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/vl.c b/vl.c
index 805e11f..6b6aec0 100644
--- a/vl.c
+++ b/vl.c
@@ -1073,11 +1073,12 @@ void qemu_del_vm_change_state_handler(VMChangeStateEntry *e)
 
 void vm_state_notify(int running, int reason)
 {
-    VMChangeStateEntry *e;
+    VMChangeStateEntry *e, *ne;
 
     trace_vm_state_notify(running, reason);
 
-    for (e = vm_change_state_head.lh_first; e; e = e->entries.le_next) {
+    for (e = vm_change_state_head.lh_first; e; e = ne) {
+        ne = e->entries.le_next;
         e->cb(e->opaque, running, reason);
     }
 }
-- 
1.7.1.2

^ permalink raw reply related	[flat|nested] 112+ messages in thread

* [Qemu-devel] [PATCH 07/21] Introduce fault tolerant VM transaction QEMUFile and ft_mode.
  2010-11-25  6:06 [Qemu-devel] [PATCH 00/21] Kemari for KVM 0.2 Yoshiaki Tamura
                   ` (5 preceding siblings ...)
  2010-11-25  6:06 ` [Qemu-devel] [PATCH 06/21] vl: add a tmp pointer so that a handler can delete the entry to which it belongs Yoshiaki Tamura
@ 2010-11-25  6:06 ` Yoshiaki Tamura
  2010-11-25  6:06 ` [Qemu-devel] [PATCH 08/21] savevm: introduce util functions to control ft_trans_file from savevm layer Yoshiaki Tamura
                   ` (14 subsequent siblings)
  21 siblings, 0 replies; 112+ messages in thread
From: Yoshiaki Tamura @ 2010-11-25  6:06 UTC (permalink / raw)
  To: kvm, qemu-devel
  Cc: aliguori, mtosatti, ananth, ohmura.kei, dlaor, vatsa,
	Yoshiaki Tamura, avi, psuriset, stefanha

This code implements VM transaction protocol.  Like buffered_file, it
sits between savevm and migration layer.  With this architecture, VM
transaction protocol is implemented mostly independent from other
existing code.

Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
Signed-off-by: OHMURA Kei <ohmura.kei@lab.ntt.co.jp>
---
 Makefile.objs   |    1 +
 ft_trans_file.c |  603 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 ft_trans_file.h |   72 +++++++
 migration.c     |    3 +
 trace-events    |   15 ++
 5 files changed, 694 insertions(+), 0 deletions(-)
 create mode 100644 ft_trans_file.c
 create mode 100644 ft_trans_file.h

diff --git a/Makefile.objs b/Makefile.objs
index 23b17ce..d42f2d1 100644
--- a/Makefile.objs
+++ b/Makefile.objs
@@ -92,6 +92,7 @@ common-obj-y += msmouse.o ps2.o
 common-obj-y += qdev.o qdev-properties.o
 common-obj-y += block-migration.o
 common-obj-y += pflib.o
+common-obj-y += ft_trans_file.o
 
 common-obj-$(CONFIG_BRLAPI) += baum.o
 common-obj-$(CONFIG_POSIX) += migration-exec.o migration-unix.o migration-fd.o
diff --git a/ft_trans_file.c b/ft_trans_file.c
new file mode 100644
index 0000000..4e33034
--- /dev/null
+++ b/ft_trans_file.c
@@ -0,0 +1,603 @@
+/*
+ * Fault tolerant VM transaction QEMUFile
+ *
+ * Copyright (c) 2010 Nippon Telegraph and Telephone Corporation. 
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.  See
+ * the COPYING file in the top-level directory.
+ *
+ * This source code is based on buffered_file.c.
+ * Copyright IBM, Corp. 2008
+ * Authors:
+ *  Anthony Liguori        <aliguori@us.ibm.com>
+ */
+
+#include "qemu-common.h"
+#include "qemu-error.h"
+#include "hw/hw.h"
+#include "qemu-timer.h"
+#include "sysemu.h"
+#include "qemu-char.h"
+#include "trace.h"
+#include "ft_trans_file.h"
+
+// #define DEBUG_FT_TRANSACTION
+
+typedef struct FtTransHdr
+{
+    uint16_t cmd;
+    uint16_t id;
+    uint32_t seq;
+    uint32_t payload_len;
+} FtTransHdr;
+
+typedef struct QEMUFileFtTrans
+{
+    FtTransPutBufferFunc *put_buffer;
+    FtTransGetBufferFunc *get_buffer;
+    FtTransPutReadyFunc *put_ready;
+    FtTransGetReadyFunc *get_ready;
+    FtTransWaitForUnfreezeFunc *wait_for_unfreeze;
+    FtTransCloseFunc *close;
+    void *opaque;
+    QEMUFile *file;
+
+    enum QEMU_VM_TRANSACTION_STATE state;
+    uint32_t seq;
+    uint16_t id;
+
+    int has_error;
+
+    bool freeze_output;
+    bool freeze_input;
+    bool is_sender;
+    bool is_payload;
+
+    uint8_t *buf;
+    size_t buf_max_size;
+    size_t put_offset;
+    size_t get_offset;
+
+    FtTransHdr header;
+    size_t header_offset;
+} QEMUFileFtTrans;
+
+#define IO_BUF_SIZE 32768
+
+static void ft_trans_append(QEMUFileFtTrans *s,
+                            const uint8_t *buf, size_t size)
+{
+    if (size > (s->buf_max_size - s->put_offset)) {
+        trace_ft_trans_realloc(s->buf_max_size, size + 1024);
+        s->buf_max_size += size + 1024;
+        s->buf = qemu_realloc(s->buf, s->buf_max_size);
+    }
+
+    trace_ft_trans_append(size);
+    memcpy(s->buf + s->put_offset, buf, size);
+    s->put_offset += size;
+}
+
+static void ft_trans_flush(QEMUFileFtTrans *s)
+{
+    size_t offset = 0;
+
+    if (s->has_error) {
+        error_report("flush when error %d, bailing\n", s->has_error);
+        return;
+    }
+
+    while (offset < s->put_offset) {
+        ssize_t ret;
+
+        ret = s->put_buffer(s->opaque, s->buf + offset, s->put_offset - offset);
+        if (ret == -EAGAIN) {
+            break;
+        }
+
+        if (ret <= 0) {
+            error_report("error flushing data, %s\n", strerror(errno));
+            s->has_error = FT_TRANS_ERR_FLUSH;
+            break;
+        } else {
+            offset += ret;
+        }
+    }
+
+    trace_ft_trans_flush(offset, s->put_offset);
+    memmove(s->buf, s->buf + offset, s->put_offset - offset);
+    s->put_offset -= offset;
+    s->freeze_output = !!s->put_offset;
+}
+
+static ssize_t ft_trans_put(void *opaque, void *buf, int size)
+{
+    QEMUFileFtTrans *s = opaque;
+    size_t offset = 0;
+    ssize_t len;
+
+    /* flush buffered data before putting next */
+    if (!s->freeze_output && s->put_offset) {
+        ft_trans_flush(s);
+    }
+
+    while (!s->freeze_output && offset < size) {
+        len = s->put_buffer(s->opaque, (uint8_t *)buf + offset, size - offset);
+
+        if (len == -EAGAIN) {
+            trace_ft_trans_freeze_output();
+            s->freeze_output = 1;
+            break;
+        }
+
+        if (len <= 0) {
+            error_report("putting data failed, %s\n", strerror(errno));
+            s->has_error = 1;
+            offset = -EINVAL;
+            break;
+        }
+
+        offset += len;
+    }
+
+    if (s->freeze_output) {
+        ft_trans_append(s, buf + offset, size - offset);
+        offset = size;
+    }
+
+    return offset;
+}
+
+static int ft_trans_send_header(QEMUFileFtTrans *s,
+                                enum QEMU_VM_TRANSACTION_STATE state,
+                                uint32_t payload_len)
+{
+    int ret;
+    FtTransHdr *hdr = &s->header;
+
+    trace_ft_trans_send_header(state);
+
+    hdr->cmd = s->state = state;
+    hdr->id = s->id;
+    hdr->seq = s->seq;
+    hdr->payload_len = payload_len;
+
+    ret = ft_trans_put(s, hdr, sizeof(*hdr));
+    if (ret < 0) {
+        error_report("send header failed\n");
+        s->has_error = FT_TRANS_ERR_SEND_HDR;
+    }
+
+    return ret;
+}
+
+static int ft_trans_put_buffer(void *opaque, const uint8_t *buf, int64_t pos, int size)
+{
+    QEMUFileFtTrans *s = opaque;
+    ssize_t ret;
+
+    trace_ft_trans_put_buffer(size, pos);
+
+    if (s->has_error) {
+        error_report("put_buffer when error %d, bailing\n", s->has_error);
+        return -EINVAL;
+    }
+
+    /* assuming qemu_file_put_notify() is calling */
+    if (pos == 0 && size == 0) {
+        trace_ft_trans_put_ready();
+        ft_trans_flush(s);
+
+        if (!s->freeze_output) {
+            trace_ft_trans_cb(s->put_ready);
+            ret = s->put_ready();
+        }
+
+        goto out;
+    }
+
+    ret = ft_trans_send_header(s, QEMU_VM_TRANSACTION_CONTINUE, size);
+    if (ret < 0) {
+        goto out;
+    }
+
+    ret = ft_trans_put(s, (uint8_t *)buf, size);
+    if (ret < 0) {
+        error_report("send palyload failed\n");
+        s->has_error = FT_TRANS_ERR_SEND_PAYLOAD;
+        goto out;
+    }
+
+    s->seq++;
+
+out:
+    return ret;
+}
+
+static int ft_trans_fill_buffer(void *opaque, void *buf, int size)
+{
+    QEMUFileFtTrans *s = opaque;
+    size_t offset = 0;
+    ssize_t len;
+
+    while (!s->freeze_input && offset < size) {
+        len = s->get_buffer(s->opaque, (uint8_t *)buf + offset,
+                            0, size - offset);
+        if (len == -EAGAIN) {
+            trace_ft_trans_freeze_input();
+            s->freeze_input = 1;
+            break;
+        }
+
+        if (len <= 0) {
+            error_report("fill buffer failed, %s\n", strerror(errno));
+            s->has_error = 1;
+            return -EINVAL;
+        }
+
+        offset += len;
+    }
+
+    return offset;
+}
+
+static int ft_trans_recv_header(QEMUFileFtTrans *s)
+{
+    int ret;
+    char *buf = (char *)&s->header + s->header_offset;
+
+    ret = ft_trans_fill_buffer(s, buf, sizeof(FtTransHdr) - s->header_offset);
+    if (ret < 0) {
+        error_report("recv header failed\n");
+        s->has_error = FT_TRANS_ERR_RECV_HDR;
+        goto out;
+    }
+
+    s->header_offset += ret;
+    if (s->header_offset == sizeof(FtTransHdr)) {
+        trace_ft_trans_recv_header(s->header.cmd);
+        s->state = s->header.cmd;
+        s->header_offset = 0;
+
+        if (!s->is_sender) {
+            s->id = s->header.id;
+            s->seq = s->header.seq;
+        }
+    }
+
+out:
+    return ret;
+}
+
+static int ft_trans_recv_payload(QEMUFileFtTrans *s)
+{
+    QEMUFile *f = s->file;    
+    int ret = -1;
+
+    /* extend QEMUFile buf if there weren't enough space */
+    if (s->header.payload_len > (s->buf_max_size - s->get_offset)) {
+        s->buf_max_size += (s->header.payload_len -
+                            (s->buf_max_size - s->get_offset));
+        s->buf = qemu_realloc_buffer(f, s->buf_max_size);
+    }
+
+    ret = ft_trans_fill_buffer(s, s->buf + s->get_offset,
+                               s->header.payload_len);
+    if (ret < 0) {
+        error_report("recv payload failed\n");
+        s->has_error = FT_TRANS_ERR_RECV_PAYLOAD;
+        goto out;
+    }
+
+    trace_ft_trans_recv_payload(ret, s->header.payload_len, s->get_offset);
+
+    s->header.payload_len -= ret;
+    s->get_offset += ret;
+    s->is_payload = !!s->header.payload_len;
+
+out:
+    return ret;
+}
+
+static int ft_trans_recv(QEMUFileFtTrans *s)
+{
+    int ret;
+
+    /* get payload and return */
+    if (s->is_payload) {
+        ret = ft_trans_recv_payload(s);
+        goto out;
+    }
+
+    ret = ft_trans_recv_header(s);
+    if (ret < 0 || s->freeze_input) {
+        goto out;
+    }
+
+    switch (s->state) {
+    case QEMU_VM_TRANSACTION_BEGIN:
+        /* CONTINUE or COMMIT should come shortly */
+        s->is_payload = 0;
+        break;
+
+    case QEMU_VM_TRANSACTION_CONTINUE:
+        /* get payload */
+        s->is_payload = 1;
+        break;
+
+    case QEMU_VM_TRANSACTION_COMMIT:
+        ret = ft_trans_send_header(s, QEMU_VM_TRANSACTION_ACK, 0);
+        if (ret < 0) {
+            goto out;
+        }
+
+        trace_ft_trans_cb(s->get_ready);
+        if ((ret = s->get_ready(s->opaque)) < 0) {
+            goto out;
+        }
+
+        s->get_offset = 0;
+        s->is_payload = 0;
+
+        break;
+
+    case QEMU_VM_TRANSACTION_ATOMIC:
+        /* not implemented yet */
+        error_report("QEMU_VM_TRANSACTION_ATOMIC not implemented. %d\n",
+                ret);
+        break;
+
+    case QEMU_VM_TRANSACTION_CANCEL:
+        /* return -EINVAL until migrate cancel on recevier side is supported */
+        ret = -EINVAL;
+        break;
+
+    default:
+        error_report("unknown QEMU_VM_TRANSACTION_STATE %d\n", ret);
+        s->has_error = FT_TRANS_ERR_STATE_INVALID;
+        ret = -EINVAL;
+    }
+
+out:
+    return ret;
+}
+
+static int ft_trans_get_buffer(void *opaque, uint8_t *buf,
+                               int64_t pos, int size)
+{
+    QEMUFileFtTrans *s = opaque;
+    int ret;
+
+    if (s->has_error) {
+        error_report("get_buffer when error %d, bailing\n", s->has_error);
+        return -EINVAL;
+    }
+
+    /* assuming qemu_file_get_notify() is calling */
+    if (pos == 0 && size == 0) {
+        trace_ft_trans_get_ready();
+        s->freeze_input = 0;
+
+        /* sender should be waiting for ACK */
+        if (s->is_sender) {
+            ret = ft_trans_recv_header(s);
+            if (s->freeze_input) {
+                ret = 0;
+                goto out;
+            }
+            if (ret < 0) {
+                error_report("recv ack failed\n");
+                goto out;
+            }
+
+            if (s->state != QEMU_VM_TRANSACTION_ACK) {
+                error_report("recv invalid state %d\n", s->state);
+                s->has_error = FT_TRANS_ERR_STATE_INVALID;
+                ret = -EINVAL;
+                goto out;
+            }
+
+            trace_ft_trans_cb(s->get_ready);
+            ret = s->get_ready(s->opaque);
+            if (ret < 0) {
+                goto out;
+            }
+
+            /* proceed trans id */
+            s->id++;
+
+            return 0;
+        }
+
+        /* set QEMUFile buf at beginning */
+        if (!s->buf) {
+            s->buf = buf;
+        }
+
+        ret = ft_trans_recv(s);
+        goto out;
+    }
+
+    ret = s->get_offset;
+
+out:
+    return ret;
+}
+
+static int ft_trans_close(void *opaque)
+{
+    QEMUFileFtTrans *s = opaque;
+    int ret;
+
+    trace_ft_trans_close();
+    ret = s->close(s->opaque);
+    if (s->is_sender) {
+        qemu_free(s->buf);
+    }
+    qemu_free(s);
+
+    return ret;
+}
+
+static int ft_trans_rate_limit(void *opaque)
+{
+    QEMUFileFtTrans *s = opaque;
+
+    if (s->has_error) {
+        return 0;
+    }
+
+    if (s->freeze_output) {
+        return 1;
+    }
+
+    return 0;
+}
+
+int ft_trans_begin(void *opaque)
+{
+    QEMUFileFtTrans *s = opaque;
+    int ret;
+    s->seq = 0;
+
+    /* receiver sends QEMU_VM_TRANSACTION_ACK to start transaction */
+    if (!s->is_sender) {
+        if (s->state != QEMU_VM_TRANSACTION_INIT) {
+            error_report("invalid state %d\n", s->state);
+            s->has_error = FT_TRANS_ERR_STATE_INVALID;
+            ret = -EINVAL;
+        }
+
+        ret = ft_trans_send_header(s, QEMU_VM_TRANSACTION_ACK, 0);
+        goto out;
+    } 
+
+    /* sender waits for QEMU_VM_TRANSACTION_ACK to start transaction */
+    if (s->state == QEMU_VM_TRANSACTION_INIT) {
+    retry:
+        ret = ft_trans_recv_header(s);
+        if (ret < 0) {
+            if (!s->freeze_input) {
+                error_report("recv ack failed\n");
+                goto out;
+            }
+            goto retry;
+        }
+
+        if (s->state != QEMU_VM_TRANSACTION_ACK) {
+            error_report("recv invalid state %d\n", s->state);
+            s->has_error = FT_TRANS_ERR_STATE_INVALID;
+            ret = -EINVAL;
+            goto out;
+        }
+    }
+
+    ret = ft_trans_send_header(s, QEMU_VM_TRANSACTION_BEGIN, 0);
+    if (ret < 0) {
+        goto out;
+    }
+
+    s->state = QEMU_VM_TRANSACTION_CONTINUE;
+
+out:
+    return ret;
+}
+
+int ft_trans_commit(void *opaque)
+{
+    QEMUFileFtTrans *s = opaque;
+    int ret;
+
+    if (!s->is_sender) {
+        ret = ft_trans_send_header(s, QEMU_VM_TRANSACTION_ACK, 0);
+        goto out;
+    }
+
+    /* sender should flush buf before sending COMMIT */
+    qemu_fflush(s->file);
+
+    ret = ft_trans_send_header(s, QEMU_VM_TRANSACTION_COMMIT, 0);
+    if (ret < 0) {
+        goto out;
+    }
+
+    while (!s->has_error && s->put_offset) {
+        ft_trans_flush(s);
+        if (s->freeze_output) {
+            s->wait_for_unfreeze(s);
+        }
+    }
+
+    if (s->has_error) {
+        ret = -EINVAL;
+        goto out;
+    }
+
+    ret = ft_trans_recv_header(s);
+    if (s->freeze_input) {
+        ret = -EAGAIN;
+        goto out;
+    }
+    if (ret < 0) {
+        error_report("recv ack failed\n");
+        goto out;
+    }
+
+    if (s->state != QEMU_VM_TRANSACTION_ACK) {
+        error_report("recv invalid state %d\n", s->state);
+        s->has_error = FT_TRANS_ERR_STATE_INVALID;
+        ret = -EINVAL;
+        goto out;
+    }
+    
+    s->id++;
+
+out:
+    return ret;
+}
+
+int ft_trans_cancel(void *opaque)
+{
+    QEMUFileFtTrans *s = opaque;
+
+    /* invalid until migrate cancel on recevier side is supported */
+    if (!s->is_sender) {
+        return -EINVAL;
+    } 
+    
+    return ft_trans_send_header(s, QEMU_VM_TRANSACTION_CANCEL, 0);
+}
+
+QEMUFile *qemu_fopen_ops_ft_trans(void *opaque,
+                                  FtTransPutBufferFunc *put_buffer,
+                                  FtTransGetBufferFunc *get_buffer,
+                                  FtTransPutReadyFunc *put_ready,
+                                  FtTransGetReadyFunc *get_ready,
+                                  FtTransWaitForUnfreezeFunc *wait_for_unfreeze,
+                                  FtTransCloseFunc *close,
+                                  bool is_sender)
+{
+    QEMUFileFtTrans *s;
+
+    s = qemu_mallocz(sizeof(*s));
+
+    s->opaque = opaque;
+    s->put_buffer = put_buffer;
+    s->get_buffer = get_buffer;
+    s->put_ready = put_ready;
+    s->get_ready = get_ready;
+    s->wait_for_unfreeze = wait_for_unfreeze;
+    s->close = close;
+    s->is_sender = is_sender;
+    s->id = 0;
+    s->seq = 0;
+
+    if (!s->is_sender) {
+        s->buf_max_size = IO_BUF_SIZE;
+    }
+
+    s->file = qemu_fopen_ops(s, ft_trans_put_buffer, ft_trans_get_buffer,
+                             ft_trans_close, ft_trans_rate_limit, NULL, NULL);
+
+    return s->file;
+}
diff --git a/ft_trans_file.h b/ft_trans_file.h
new file mode 100644
index 0000000..d7e221b
--- /dev/null
+++ b/ft_trans_file.h
@@ -0,0 +1,72 @@
+/*
+ * Fault tolerant VM transaction QEMUFile
+ *
+ * Copyright (c) 2010 Nippon Telegraph and Telephone Corporation. 
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.  See
+ * the COPYING file in the top-level directory.
+ *
+ * This source code is based on buffered_file.h.
+ * Copyright IBM, Corp. 2008
+ * Authors:
+ *  Anthony Liguori        <aliguori@us.ibm.com>
+ */
+
+#ifndef QEMU_FT_TRANSACTION_FILE_H
+#define QEMU_FT_TRANSACTION_FILE_H
+
+#include "hw/hw.h"
+
+enum QEMU_VM_TRANSACTION_STATE {
+    QEMU_VM_TRANSACTION_NACK = -1,
+    QEMU_VM_TRANSACTION_INIT,
+    QEMU_VM_TRANSACTION_BEGIN,
+    QEMU_VM_TRANSACTION_CONTINUE,
+    QEMU_VM_TRANSACTION_COMMIT,
+    QEMU_VM_TRANSACTION_CANCEL,
+    QEMU_VM_TRANSACTION_ATOMIC,
+    QEMU_VM_TRANSACTION_ACK,
+};
+
+enum FT_MODE {
+    FT_ERROR = -1,
+    FT_OFF,
+    FT_INIT,
+    FT_TRANSACTION_BEGIN,
+    FT_TRANSACTION_ITER,
+    FT_TRANSACTION_COMMIT,
+    FT_TRANSACTION_ATOMIC,
+    FT_TRANSACTION_RECV,
+};
+extern enum FT_MODE ft_mode;
+
+#define FT_TRANS_ERR_UNKNOWN       0x01 /* Unknown error */
+#define FT_TRANS_ERR_SEND_HDR      0x02 /* Send header failed */
+#define FT_TRANS_ERR_RECV_HDR      0x03 /* Recv header failed */
+#define FT_TRANS_ERR_SEND_PAYLOAD  0x04 /* Send payload failed */
+#define FT_TRANS_ERR_RECV_PAYLOAD  0x05 /* Recv payload failed */
+#define FT_TRANS_ERR_FLUSH         0x06 /* Flush buffered data failed */
+#define FT_TRANS_ERR_STATE_INVALID 0x07 /* Invalid state */
+
+typedef ssize_t (FtTransPutBufferFunc)(void *opaque, const void *data, size_t size);
+typedef int (FtTransGetBufferFunc)(void *opaque, uint8_t *buf, int64_t pos, size_t size);
+typedef ssize_t (FtTransPutVectorFunc)(void *opaque, const struct iovec *iov, int iovcnt);
+typedef int (FtTransPutReadyFunc)(void);
+typedef int (FtTransGetReadyFunc)(void *opaque);
+typedef void (FtTransWaitForUnfreezeFunc)(void *opaque);
+typedef int (FtTransCloseFunc)(void *opaque);
+
+int ft_trans_begin(void *opaque);
+int ft_trans_commit(void *opaque);
+int ft_trans_cancel(void *opaque);
+
+QEMUFile *qemu_fopen_ops_ft_trans(void *opaque, 
+                                  FtTransPutBufferFunc *put_buffer,
+                                  FtTransGetBufferFunc *get_buffer,
+                                  FtTransPutReadyFunc *put_ready,
+                                  FtTransGetReadyFunc *get_ready,
+                                  FtTransWaitForUnfreezeFunc *wait_for_unfreeze,
+                                  FtTransCloseFunc *close,
+                                  bool is_sender);
+
+#endif
diff --git a/migration.c b/migration.c
index 8f4ffcb..40e4945 100644
--- a/migration.c
+++ b/migration.c
@@ -15,6 +15,7 @@
 #include "migration.h"
 #include "monitor.h"
 #include "buffered_file.h"
+#include "ft_trans_file.h"
 #include "sysemu.h"
 #include "block.h"
 #include "qemu_socket.h"
@@ -31,6 +32,8 @@
     do { } while (0)
 #endif
 
+enum FT_MODE ft_mode = FT_OFF;
+
 /* Migration speed throttling */
 static uint32_t max_throttle = (32 << 20);
 
diff --git a/trace-events b/trace-events
index da03d4b..a294bad 100644
--- a/trace-events
+++ b/trace-events
@@ -192,3 +192,18 @@ disable sun4m_iommu_bad_addr(uint64_t addr) "bad addr %"PRIx64""
 
 # vl.c
 disable vm_state_notify(int running, int reason) "running %d reason %d"
+
+# ft_trans_file.c
+disable ft_trans_realloc(size_t old_size, size_t new_size) "increasing buffer from %zu by %zu"
+disable ft_trans_append(size_t size) "buffering %zu bytes"
+disable ft_trans_flush(size_t size, size_t req) "flushed %zu of %zu bytes"
+disable ft_trans_send_header(uint16_t cmd) "send header %d"
+disable ft_trans_recv_header(uint16_t cmd) "recv header %d"
+disable ft_trans_put_buffer(size_t size, int64_t pos) "putting %d bytes at %"PRId64""
+disable ft_trans_recv_payload(size_t len, uint32_t hdr, size_t total) "recv %d of %d total %d"
+disable ft_trans_close(void) "closing"
+disable ft_trans_freeze_output(void) "backend not ready, freezing output"
+disable ft_trans_freeze_input(void) "backend not ready, freezing input"
+disable ft_trans_put_ready(void) "file is ready to put"
+disable ft_trans_get_ready(void) "file is ready to get"
+disable ft_trans_cb(void *cb) "callback %p"
-- 
1.7.1.2

^ permalink raw reply related	[flat|nested] 112+ messages in thread

* [Qemu-devel] [PATCH 08/21] savevm: introduce util functions to control ft_trans_file from savevm layer.
  2010-11-25  6:06 [Qemu-devel] [PATCH 00/21] Kemari for KVM 0.2 Yoshiaki Tamura
                   ` (6 preceding siblings ...)
  2010-11-25  6:06 ` [Qemu-devel] [PATCH 07/21] Introduce fault tolerant VM transaction QEMUFile and ft_mode Yoshiaki Tamura
@ 2010-11-25  6:06 ` Yoshiaki Tamura
  2010-11-25  6:06 ` [Qemu-devel] [PATCH 09/21] Introduce event-tap Yoshiaki Tamura
                   ` (13 subsequent siblings)
  21 siblings, 0 replies; 112+ messages in thread
From: Yoshiaki Tamura @ 2010-11-25  6:06 UTC (permalink / raw)
  To: kvm, qemu-devel
  Cc: aliguori, mtosatti, ananth, ohmura.kei, dlaor, vatsa,
	Yoshiaki Tamura, avi, psuriset, stefanha

To utilize ft_trans_file function, savevm needs interfaces to be
exported.

Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
---
 hw/hw.h  |    5 ++
 savevm.c |  165 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 170 insertions(+), 0 deletions(-)

diff --git a/hw/hw.h b/hw/hw.h
index b67f504..e9f71bc 100644
--- a/hw/hw.h
+++ b/hw/hw.h
@@ -51,6 +51,7 @@ QEMUFile *qemu_fopen_ops(void *opaque, QEMUFilePutBufferFunc *put_buffer,
 QEMUFile *qemu_fopen(const char *filename, const char *mode);
 QEMUFile *qemu_fdopen(int fd, const char *mode);
 QEMUFile *qemu_fopen_socket(int fd);
+QEMUFile *qemu_fopen_ft_trans(int s_fd, int c_fd);
 QEMUFile *qemu_popen(FILE *popen_file, const char *mode);
 QEMUFile *qemu_popen_cmd(const char *command, const char *mode);
 int qemu_stdio_fd(QEMUFile *f);
@@ -60,6 +61,9 @@ void qemu_put_buffer(QEMUFile *f, const uint8_t *buf, int size);
 void qemu_put_byte(QEMUFile *f, int v);
 void *qemu_realloc_buffer(QEMUFile *f, int size);
 void qemu_clear_buffer(QEMUFile *f);
+int qemu_ft_trans_begin(QEMUFile *f);
+int qemu_ft_trans_commit(QEMUFile *f);
+int qemu_ft_trans_cancel(QEMUFile *f);
 
 static inline void qemu_put_ubyte(QEMUFile *f, unsigned int v)
 {
@@ -94,6 +98,7 @@ void qemu_file_set_error(QEMUFile *f);
  * halted due to rate limiting or EAGAIN errors occur as it can be used to
  * resume output. */
 void qemu_file_put_notify(QEMUFile *f);
+void qemu_file_get_notify(void *opaque);
 
 static inline void qemu_put_be64s(QEMUFile *f, const uint64_t *pv)
 {
diff --git a/savevm.c b/savevm.c
index 8917416..afd4046 100644
--- a/savevm.c
+++ b/savevm.c
@@ -83,6 +83,7 @@
 #include "migration.h"
 #include "qemu_socket.h"
 #include "qemu-queue.h"
+#include "ft_trans_file.h"
 
 #define SELF_ANNOUNCE_ROUNDS 5
 
@@ -190,6 +191,13 @@ typedef struct QEMUFileSocket
     QEMUFile *file;
 } QEMUFileSocket;
 
+typedef struct QEMUFileSocketTrans
+{
+    int fd;
+    QEMUFileSocket *s;
+    VMChangeStateEntry *e;
+} QEMUFileSocketTrans;
+
 static int socket_get_buffer(void *opaque, uint8_t *buf, int64_t pos, int size)
 {
     QEMUFileSocket *s = opaque;
@@ -205,6 +213,21 @@ static int socket_get_buffer(void *opaque, uint8_t *buf, int64_t pos, int size)
     return len;
 }
 
+static ssize_t socket_put_buffer(void *opaque, const void *buf, size_t size)
+{
+    QEMUFileSocket *s = opaque;
+    ssize_t len;
+
+    do {
+        len = send(s->fd, (void *)buf, size, 0);
+    } while (len == -1 && socket_error() == EINTR);
+
+    if (len == -1)
+        len = -socket_error();
+
+    return len;
+}
+
 static int socket_close(void *opaque)
 {
     QEMUFileSocket *s = opaque;
@@ -212,6 +235,87 @@ static int socket_close(void *opaque)
     return 0;
 }
 
+static int socket_trans_get_buffer(void *opaque, uint8_t *buf, int64_t pos, size_t size)
+{
+    QEMUFileSocketTrans *t = opaque;
+    QEMUFileSocket *s = t->s;
+    ssize_t len;
+
+    len = socket_get_buffer(s, buf, pos, size);
+
+    return len;
+}
+
+static ssize_t socket_trans_put_buffer(void *opaque, const void *buf, size_t size)
+{
+    QEMUFileSocketTrans *t = opaque;
+
+    return socket_put_buffer(t->s, buf, size);
+}
+
+
+static int socket_trans_get_ready(void *opaque)
+{
+    QEMUFileSocketTrans *t = opaque;
+    QEMUFileSocket *s = t->s;
+    QEMUFile *f = s->file;
+    int ret = 0;
+
+    ret = qemu_loadvm_state(f, 1);
+    if (ret < 0) {
+        fprintf(stderr,
+                "socket_trans_get_ready: error while loading vmstate\n");
+        ft_mode = FT_ERROR;
+        goto out;
+    }
+
+    if (ft_mode == FT_OFF) {
+        /* if migrate_cancel was called at the sender  */
+        goto out;
+    }
+
+    if (ft_mode == FT_ERROR) {
+        qemu_announce_self();
+        goto out;
+    }
+
+    return 0;
+
+out:
+    qemu_set_fd_handler2(s->fd, NULL, NULL, NULL, NULL);
+    qemu_fclose(f);
+    return -1;
+}
+
+static int socket_trans_close(void *opaque)
+{
+    QEMUFileSocketTrans *t = opaque;
+    QEMUFileSocket *s = t->s;
+
+    qemu_set_fd_handler2(s->fd, NULL, NULL, NULL, NULL);
+    qemu_set_fd_handler2(t->fd, NULL, NULL, NULL, NULL);
+    qemu_del_vm_change_state_handler(t->e);
+    close(s->fd);
+    close(t->fd);
+    qemu_free(s);
+    qemu_free(t);
+
+    return 0;
+}
+
+static void socket_trans_resume(void *opaque, int running, int reason)
+{
+    QEMUFileSocketTrans *t = opaque;
+    QEMUFileSocket *s = t->s;    
+
+    if (!running) {
+        return;
+    }
+
+    qemu_announce_self();
+    qemu_fclose(s->file);
+}
+
 static int stdio_put_buffer(void *opaque, const uint8_t *buf, int64_t pos, int size)
 {
     QEMUFileStdio *s = opaque;
@@ -334,6 +438,26 @@ QEMUFile *qemu_fopen_socket(int fd)
     return s->file;
 }
 
+QEMUFile *qemu_fopen_ft_trans(int s_fd, int c_fd)
+{
+    QEMUFileSocketTrans *t = qemu_mallocz(sizeof(QEMUFileSocketTrans));
+    QEMUFileSocket *s = qemu_mallocz(sizeof(QEMUFileSocket));
+
+    t->s = s;
+    t->fd = s_fd;
+    t->e = qemu_add_vm_change_state_handler(socket_trans_resume, t);
+
+    s->fd = c_fd;
+    s->file = qemu_fopen_ops_ft_trans(t, socket_trans_put_buffer,
+                                      socket_trans_get_buffer, NULL,
+                                      socket_trans_get_ready,
+                                      migrate_fd_wait_for_unfreeze,
+                                      socket_trans_close, 0);
+    socket_set_nonblock(s->fd);
+
+    return s->file;
+}
+
 static int file_put_buffer(void *opaque, const uint8_t *buf,
                             int64_t pos, int size)
 {
@@ -471,6 +595,39 @@ void qemu_clear_buffer(QEMUFile *f)
     memset(f->buf, 0, f->buf_max_size);
 }
 
+int qemu_ft_trans_begin(QEMUFile *f)
+{
+    int ret;
+    ret= ft_trans_begin(f->opaque);
+    if (ret < 0) {
+        f->has_error = 1;
+    }
+    return ret;
+}
+
+int qemu_ft_trans_commit(QEMUFile *f)
+{
+    int ret;
+    ret = ft_trans_commit(f->opaque);
+    if (ret == -EAGAIN) {
+        return 1;
+    }
+    if (ret < 0) {
+        f->has_error = 1;
+    }
+    return ret;
+}
+
+int qemu_ft_trans_cancel(QEMUFile *f)
+{
+    int ret;
+    ret = ft_trans_cancel(f->opaque);
+    if (ret < 0) {
+        f->has_error = 1;
+    }
+    return ret;
+}
+
 static void qemu_fill_buffer(QEMUFile *f)
 {
     int len;
@@ -506,6 +663,14 @@ void qemu_file_put_notify(QEMUFile *f)
     f->put_buffer(f->opaque, NULL, 0, 0);
 }
 
+void qemu_file_get_notify(void *opaque)
+{
+    QEMUFile *f = opaque;
+    if (f->get_buffer(f->opaque, f->buf, 0, 0) < 0) {
+        f->has_error = 1;
+    }
+}
+
 void qemu_put_buffer(QEMUFile *f, const uint8_t *buf, int size)
 {
     int l;
-- 
1.7.1.2

^ permalink raw reply related	[flat|nested] 112+ messages in thread

* [Qemu-devel] [PATCH 09/21] Introduce event-tap.
  2010-11-25  6:06 [Qemu-devel] [PATCH 00/21] Kemari for KVM 0.2 Yoshiaki Tamura
                   ` (7 preceding siblings ...)
  2010-11-25  6:06 ` [Qemu-devel] [PATCH 08/21] savevm: introduce util functions to control ft_trans_file from savevm layer Yoshiaki Tamura
@ 2010-11-25  6:06 ` Yoshiaki Tamura
  2010-11-29 11:00   ` [Qemu-devel] " Stefan Hajnoczi
       [not found]   ` <20101130011914.GA9015@amt.cnet>
  2010-11-25  6:06 ` [Qemu-devel] [PATCH 10/21] Call init handler of event-tap at main() in vl.c Yoshiaki Tamura
                   ` (12 subsequent siblings)
  21 siblings, 2 replies; 112+ messages in thread
From: Yoshiaki Tamura @ 2010-11-25  6:06 UTC (permalink / raw)
  To: kvm, qemu-devel
  Cc: aliguori, mtosatti, ananth, ohmura.kei, dlaor, vatsa,
	Yoshiaki Tamura, avi, psuriset, stefanha

event-tap controls when to start FT transaction, and provides proxy
functions to called from net/block devices.  While FT transaction, it
queues up net/block requests, and flush them when the transaction gets
completed.

Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
Signed-off-by: OHMURA Kei <ohmura.kei@lab.ntt.co.jp>
---
 Makefile.target |    1 +
 block.h         |    9 +
 event-tap.c     |  794 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 event-tap.h     |   34 +++
 net.h           |    4 +
 net/queue.c     |    1 +
 6 files changed, 843 insertions(+), 0 deletions(-)
 create mode 100644 event-tap.c
 create mode 100644 event-tap.h

diff --git a/Makefile.target b/Makefile.target
index 2800f47..3922d79 100644
--- a/Makefile.target
+++ b/Makefile.target
@@ -197,6 +197,7 @@ obj-y += rwhandler.o
 obj-$(CONFIG_KVM) += kvm.o kvm-all.o
 obj-$(CONFIG_NO_KVM) += kvm-stub.o
 LIBS+=-lz
+obj-y += event-tap.o
 
 QEMU_CFLAGS += $(VNC_TLS_CFLAGS)
 QEMU_CFLAGS += $(VNC_SASL_CFLAGS)
diff --git a/block.h b/block.h
index 78ecfac..0f07617 100644
--- a/block.h
+++ b/block.h
@@ -116,6 +116,12 @@ BlockDriverAIOCB *bdrv_aio_readv(BlockDriverState *bs, int64_t sector_num,
 BlockDriverAIOCB *bdrv_aio_writev(BlockDriverState *bs, int64_t sector_num,
                                   QEMUIOVector *iov, int nb_sectors,
                                   BlockDriverCompletionFunc *cb, void *opaque);
+
+BlockDriverAIOCB *bdrv_aio_writev_proxy(BlockDriverState *bs,
+                                        int64_t sector_num, QEMUIOVector *iov,
+                                        int nb_sectors,
+                                        BlockDriverCompletionFunc *cb,
+                                        void *opaque);
 BlockDriverAIOCB *bdrv_aio_flush(BlockDriverState *bs,
 				 BlockDriverCompletionFunc *cb, void *opaque);
 void bdrv_aio_cancel(BlockDriverAIOCB *acb);
@@ -134,6 +140,9 @@ typedef struct BlockRequest {
 
 int bdrv_aio_multiwrite(BlockDriverState *bs, BlockRequest *reqs,
     int num_reqs);
+int bdrv_aio_multiwrite_proxy(BlockDriverState *bs, BlockRequest *reqs,
+                              int num_reqs);
+
 
 /* sg packet commands */
 int bdrv_ioctl(BlockDriverState *bs, unsigned long int req, void *buf);
diff --git a/event-tap.c b/event-tap.c
new file mode 100644
index 0000000..cf7a38a
--- /dev/null
+++ b/event-tap.c
@@ -0,0 +1,794 @@
+/*
+ * Event Tap functions for QEMU
+ *
+ * Copyright (c) 2010 Nippon Telegraph and Telephone Corporation. 
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.  See
+ * the COPYING file in the top-level directory.
+ */
+
+#include "qemu-common.h"
+#include "block.h"
+#include "block_int.h"
+#include "ioport.h"
+#include "osdep.h"
+#include "sysemu.h"
+#include "hw/hw.h"
+#include "net.h"
+#include "event-tap.h"
+
+// #define DEBUG_EVENT_TAP
+
+#ifdef DEBUG_EVENT_TAP
+#define DPRINTF(fmt, ...) \
+    do { printf("event-tap: " fmt, ## __VA_ARGS__); } while (0)
+#else
+#define DPRINTF(fmt, ...) \
+    do { } while (0)
+#endif
+
+static enum EVENT_TAP_STATE event_tap_state = EVENT_TAP_OFF;
+static BlockDriverAIOCB dummy_acb; /* we may need a pool for dummies */
+
+typedef struct EventTapIOport {
+    uint32_t address;
+    uint32_t data;    
+    int      index;
+} EventTapIOport;
+
+#define MMIO_BUF_SIZE 8
+
+typedef struct EventTapMMIO {
+    uint64_t address;
+    uint8_t  buf[MMIO_BUF_SIZE];
+    int      len;
+} EventTapMMIO;
+
+typedef struct EventTapNetReq {
+    char *device_name;
+    int iovcnt;
+    struct iovec *iov;
+    int vlan_id;
+    bool vlan_needed;
+    bool async;
+} EventTapNetReq;
+
+#define MAX_BLOCK_REQUEST 32
+
+typedef struct EventTapBlkReq {
+    char *device_name;
+    int num_reqs;
+    int num_cbs;
+    bool is_multiwrite;
+    BlockRequest reqs[MAX_BLOCK_REQUEST];
+    BlockDriverCompletionFunc *cb[MAX_BLOCK_REQUEST];
+    void *opaque[MAX_BLOCK_REQUEST];
+} EventTapBlkReq;
+
+#define EVENT_TAP_IOPORT (1 << 0)
+#define EVENT_TAP_MMIO   (1 << 1)
+#define EVENT_TAP_NET    (1 << 2)
+#define EVENT_TAP_BLK    (1 << 3)
+
+#define EVENT_TAP_TYPE_MASK (EVENT_TAP_NET - 1)
+
+typedef struct EventTapLog {
+    int mode;
+    union {
+        EventTapIOport ioport ;    
+        EventTapMMIO mmio;
+    };
+    union {
+        EventTapNetReq net_req;
+        EventTapBlkReq blk_req;
+    };
+    QTAILQ_ENTRY(EventTapLog) node;
+} EventTapLog;
+
+static EventTapLog *last_event_tap;
+
+static QTAILQ_HEAD(, EventTapLog) event_list;
+static QTAILQ_HEAD(, EventTapLog) event_pool;
+
+static int (*event_tap_cb)(void);
+static QEMUBH *event_tap_bh;
+static VMChangeStateEntry *vmstate;
+
+static void event_tap_bh_cb(void *p)
+{
+    event_tap_cb();
+    qemu_bh_delete(event_tap_bh);
+    event_tap_bh = NULL;
+}
+
+static int event_tap_schedule_bh(void)
+{
+    /* if bh is already set, we ignore it for now */
+    if (event_tap_bh) {
+        DPRINTF("event_tap_bh is already scheduled\n");
+        return 0;
+    }
+
+    event_tap_bh = qemu_bh_new(event_tap_bh_cb, NULL);
+    qemu_bh_schedule(event_tap_bh);
+
+    return 0;
+}
+
+static int event_tap_alloc_net_req(EventTapNetReq *net_req, 
+                                   VLANClientState *vc,
+                                   const struct iovec *iov, int iovcnt,
+                                   NetPacketSent *sent_cb, bool async)
+{
+    int i, ret = 0;
+
+    net_req->iovcnt = iovcnt;
+    net_req->async = async;
+    net_req->device_name = qemu_strdup(vc->name);
+
+    if (vc->vlan) {
+        net_req->vlan_needed = 1;
+        net_req->vlan_id = vc->vlan->id;
+    } else {
+        net_req->vlan_needed = 0;
+    }
+
+    net_req->iov = qemu_malloc(sizeof(struct iovec) * iovcnt);
+
+    for (i = 0; i < iovcnt; i++) {
+        net_req->iov[i].iov_base = qemu_malloc(iov[i].iov_len);
+        memcpy(net_req->iov[i].iov_base, iov[i].iov_base, iov[i].iov_len);
+        net_req->iov[i].iov_len = iov[i].iov_len;
+        ret += iov[i].iov_len;
+    }
+
+    return ret;
+}
+
+static void event_tap_alloc_blk_req(EventTapBlkReq *blk_req,
+                                    BlockDriverState *bs, BlockRequest *reqs,
+                                    int num_reqs, BlockDriverCompletionFunc *cb,
+                                    void *opaque, bool is_multiwrite)
+{
+    int i;
+
+    blk_req->num_reqs = num_reqs;
+    blk_req->num_cbs = num_reqs;
+    blk_req->device_name = qemu_strdup(bs->device_name);
+    blk_req->is_multiwrite = is_multiwrite;
+
+    for (i = 0; i < num_reqs; i++) {
+        blk_req->reqs[i].sector = reqs[i].sector;
+        blk_req->reqs[i].nb_sectors = reqs[i].nb_sectors;
+        blk_req->reqs[i].qiov = reqs[i].qiov;
+        blk_req->reqs[i].cb = cb;
+        blk_req->reqs[i].opaque = opaque;
+        blk_req->cb[i] = reqs[i].cb;
+        blk_req->opaque[i] = reqs[i].opaque;
+    }    
+}                                   
+
+static void *event_tap_alloc_log(void)
+{
+    EventTapLog *log;
+
+    if (QTAILQ_EMPTY(&event_pool)) {
+        log = qemu_mallocz(sizeof(EventTapLog));
+    } else {
+        log = QTAILQ_FIRST(&event_pool);
+        QTAILQ_REMOVE(&event_pool, log, node);
+    }
+
+    return log;
+}
+
+static void event_tap_free_log(EventTapLog *log)
+{
+    int i, mode = log->mode & ~EVENT_TAP_TYPE_MASK;
+
+    if (mode == EVENT_TAP_NET) {
+        EventTapNetReq *net_req = &log->net_req;
+        for (i = 0; i < net_req->iovcnt; i++) {
+            qemu_free(net_req->iov[i].iov_base);
+        }
+        qemu_free(net_req->iov);
+        qemu_free(net_req->device_name);
+    } else if (mode == EVENT_TAP_BLK) {
+        EventTapBlkReq *blk_req = &log->blk_req;
+
+        if (event_tap_state >= EVENT_TAP_LOAD) {
+            for (i = 0; i < blk_req->num_reqs; i++) {
+                qemu_free(blk_req->reqs[i].qiov->iov);
+                qemu_free(blk_req->reqs[i].qiov);
+            }
+        }
+        qemu_free(blk_req->device_name);
+    }
+
+    log->mode = 0;
+
+    /* return the log to event_pool */
+    QTAILQ_INSERT_HEAD(&event_pool, log, node);
+}
+
+static void event_tap_free_pool(void)
+{
+    EventTapLog *log, *next;
+
+    QTAILQ_FOREACH_SAFE(log, &event_pool, node, next) {
+        QTAILQ_REMOVE(&event_pool, log, node);
+        qemu_free(log);
+    }
+}
+
+/* This func is called by qemu_net_queue_flush() when a packet is appended */
+static void event_tap_net_cb(VLANClientState *vc, ssize_t len)
+{
+    DPRINTF("%s: %zd bytes packet was sended\n", vc->name, len);
+}
+
+static void event_tap_blk_cb(void *opaque, int ret)
+{
+    EventTapLog *log = container_of(opaque, EventTapLog, blk_req);
+    EventTapBlkReq *blk_req = opaque;
+    int i;
+
+    blk_req->num_cbs--;
+    if (blk_req->num_cbs == 0) {
+        /* all outstanding requests are flushed */
+        for (i = 0; i < blk_req->num_reqs; i++) {
+            blk_req->cb[i](blk_req->opaque[i], ret);
+        }
+        event_tap_free_log(log);
+    }
+}
+
+static int net_event_tap(VLANClientState *vc, const struct iovec *iov,
+                         int iovcnt, NetPacketSent *sent_cb, bool async)
+{
+    int ret = 0, empty;
+    EventTapLog *log = last_event_tap;
+
+    if (!log) {
+        DPRINTF("no last_event_tap\n");
+        log = event_tap_alloc_log();
+    }
+
+    if (log->mode & ~EVENT_TAP_TYPE_MASK) {
+        DPRINTF("last_event_tap already used %d\n",
+                log->mode & ~EVENT_TAP_TYPE_MASK);
+        return ret;
+    }
+
+    log->mode |= EVENT_TAP_NET;
+    ret = event_tap_alloc_net_req(&log->net_req, vc, iov, iovcnt, sent_cb,
+                                  async);
+
+    empty = QTAILQ_EMPTY(&event_list); 
+    QTAILQ_INSERT_TAIL(&event_list, log, node);
+    last_event_tap = NULL;
+
+    if (empty) {
+        event_tap_schedule_bh();
+    }
+
+    return ret;
+}
+
+static void bdrv_event_tap(BlockDriverState *bs, BlockRequest *reqs,
+                           int num_reqs, bool is_multiwrite)
+{
+    EventTapLog *log = last_event_tap;
+    int empty;
+
+    if (!log) {
+        DPRINTF("no last_event_tap\n");
+        log = event_tap_alloc_log();
+    }
+    if (log->mode & ~EVENT_TAP_TYPE_MASK) {
+        DPRINTF("last_event_tap already used\n");
+        return;
+    }
+
+    log->mode |= EVENT_TAP_BLK;
+    event_tap_alloc_blk_req(&log->blk_req, bs, reqs, num_reqs, event_tap_blk_cb,
+                            &log->blk_req, is_multiwrite);
+
+    empty = QTAILQ_EMPTY(&event_list); 
+    QTAILQ_INSERT_TAIL(&event_list, log, node);
+    last_event_tap = NULL;
+
+    if (empty) {
+        event_tap_schedule_bh();
+    }
+}
+
+BlockDriverAIOCB *bdrv_aio_writev_proxy(BlockDriverState *bs,
+                                        int64_t sector_num,
+                                        QEMUIOVector *iov,
+                                        int nb_sectors,
+                                        BlockDriverCompletionFunc *cb,
+                                        void *opaque)
+{
+    if (event_tap_state == EVENT_TAP_ON) {
+        BlockRequest req;
+
+        req.sector = sector_num;
+        req.nb_sectors = nb_sectors;
+        req.qiov = iov;
+        req.cb = cb;
+        req.opaque = opaque;
+        bdrv_event_tap(bs, &req, 1, 0);
+
+        /* return a dummy_acb pointer to prevent from failing */
+        return &dummy_acb;
+    }
+
+    return bdrv_aio_writev(bs, sector_num, iov, nb_sectors, cb, opaque);
+}
+
+int bdrv_aio_multiwrite_proxy(BlockDriverState *bs, BlockRequest *reqs,
+                              int num_reqs)
+{
+    if (event_tap_state == EVENT_TAP_ON) {
+        bdrv_event_tap(bs, reqs, num_reqs, 1);
+        return 0;
+    }
+
+    return bdrv_aio_multiwrite(bs, reqs, num_reqs);
+}
+
+void qemu_send_packet_proxy(VLANClientState *vc, const uint8_t *buf, int size)
+{
+    if (event_tap_state == EVENT_TAP_ON) {
+        struct iovec iov;
+        iov.iov_base = (uint8_t*)buf;
+        iov.iov_len = size;
+
+        net_event_tap(vc, &iov, 1, NULL, 0);
+        return;
+    }
+
+    return qemu_send_packet(vc, buf, size);
+}
+ssize_t qemu_sendv_packet_async_proxy(VLANClientState *vc,
+                                      const struct iovec *iov,
+                                      int iovcnt, NetPacketSent *sent_cb)
+{
+    if (event_tap_state == EVENT_TAP_ON) {
+        return net_event_tap(vc, iov, iovcnt, sent_cb, 1);
+    }
+
+    return qemu_sendv_packet_async(vc, iov, iovcnt, sent_cb);
+}
+
+int event_tap_register(int (*cb)(void))
+{
+    if (cb == NULL || event_tap_state != EVENT_TAP_OFF)
+        return -1;
+    if (event_tap_cb == NULL)
+        event_tap_cb = cb;
+
+    event_tap_state = EVENT_TAP_ON;
+
+    return 0;
+}
+
+int event_tap_unregister(void)
+{
+    if (event_tap_state == EVENT_TAP_OFF)
+        return -1;
+
+    event_tap_state = EVENT_TAP_OFF;
+    event_tap_cb = NULL;
+
+    event_tap_flush();
+    event_tap_free_pool();
+
+    return 0;
+}
+
+void event_tap_suspend(void)
+{
+    if (event_tap_state == EVENT_TAP_ON) {
+        event_tap_state = EVENT_TAP_SUSPEND;
+    }
+}
+
+void event_tap_resume(void)
+{
+    if (event_tap_state == EVENT_TAP_SUSPEND) {
+        event_tap_state = EVENT_TAP_ON;
+    }
+}
+
+int event_tap_get_state(void)
+{
+    return event_tap_state;
+}
+
+void event_tap_ioport(int index, uint32_t address, uint32_t data)
+{
+    if (event_tap_state != EVENT_TAP_ON) {
+        return;
+    }
+
+    if (!last_event_tap) {
+        last_event_tap = event_tap_alloc_log();
+    }
+
+    last_event_tap->mode = EVENT_TAP_IOPORT;
+    last_event_tap->ioport.index = index;
+    last_event_tap->ioport.address = address;
+    last_event_tap->ioport.data = data;
+}
+
+void event_tap_mmio(uint64_t address, uint8_t *buf, int len)
+{
+    if (event_tap_state != EVENT_TAP_ON || len > MMIO_BUF_SIZE) {
+        return;
+    }
+
+    if (!last_event_tap) {
+        last_event_tap = event_tap_alloc_log();
+    }
+
+    last_event_tap->mode = EVENT_TAP_MMIO;
+    last_event_tap->mmio.address = address;
+    last_event_tap->mmio.len = len;
+    memcpy(last_event_tap->mmio.buf, buf, len);
+}
+
+static void event_tap_net_flush(EventTapNetReq *net_req)
+{
+    VLANClientState *vc;
+    ssize_t len;
+
+    if (net_req->vlan_needed) {
+        vc = qemu_find_vlan_client_by_name(NULL, net_req->vlan_id,
+                                           net_req->device_name);
+    } else {
+        vc = qemu_find_netdev(net_req->device_name);
+    }
+
+    if (net_req->async) {
+        len = qemu_sendv_packet_async(vc, net_req->iov, net_req->iovcnt,
+                                      event_tap_net_cb);
+        if (len == 0) {
+            DPRINTF("This packet is appended\n");
+        }
+    } else {
+        qemu_send_packet(vc, net_req->iov[0].iov_base,
+                         net_req->iov[0].iov_len);
+    }
+}
+
+static void event_tap_blk_flush(EventTapBlkReq *blk_req)
+{
+    BlockDriverState *bs;
+
+    bs = bdrv_find(blk_req->device_name);
+
+    if (blk_req->is_multiwrite) {
+        bdrv_aio_multiwrite(bs, blk_req->reqs, blk_req->num_reqs);
+    } else {
+        bdrv_aio_writev(bs, blk_req->reqs[0].sector, blk_req->reqs[0].qiov,
+                        blk_req->reqs[0].nb_sectors, blk_req->cb[0],
+                        blk_req->opaque[0]);
+    }
+}
+
+/* returns 1 if the queue gets emtpy */
+int event_tap_flush_one(void)
+{
+    EventTapLog *log;
+
+    if (QTAILQ_EMPTY(&event_list)) {
+        return 1;
+    }
+
+    log = QTAILQ_FIRST(&event_list);
+    switch (log->mode & ~EVENT_TAP_TYPE_MASK) {
+    case EVENT_TAP_NET:
+        event_tap_net_flush(&log->net_req);
+        QTAILQ_REMOVE(&event_list, log, node);
+        event_tap_free_log(log);
+        break;
+    case EVENT_TAP_BLK:
+        event_tap_blk_flush(&log->blk_req);
+        QTAILQ_REMOVE(&event_list, log, node);
+        break;
+    default:
+        fprintf(stderr, "Unknown state %d\n", log->mode);
+        return -1;
+    }
+
+    return QTAILQ_EMPTY(&event_list);
+}
+
+void event_tap_flush(void)
+{
+    int ret;
+    do {
+        ret = event_tap_flush_one();
+    } while (ret == 0);
+}
+
+static void event_tap_replay(void *opaque, int running, int reason)
+{
+    EventTapLog *log, *next;
+
+    if (!running) {
+        return;
+    }
+
+    if (event_tap_state != EVENT_TAP_LOAD) {
+        return;
+    }
+
+    event_tap_state = EVENT_TAP_REPLAY;
+
+    QTAILQ_FOREACH(log, &event_list, node) {
+        EventTapBlkReq *blk_req;
+
+        /* event resume */
+        switch (log->mode & ~EVENT_TAP_TYPE_MASK) {
+        case EVENT_TAP_NET:
+            event_tap_net_flush(&log->net_req);
+            break;
+        case EVENT_TAP_BLK:
+            blk_req = &log->blk_req;
+            if ((log->mode & EVENT_TAP_TYPE_MASK) == EVENT_TAP_IOPORT) {
+                switch (log->ioport.index) {
+                case 0:
+                    cpu_outb(log->ioport.address, log->ioport.data);
+                    break;
+                case 1:
+                    cpu_outw(log->ioport.address, log->ioport.data);
+                    break;
+                case 2:
+                    cpu_outl(log->ioport.address, log->ioport.data);
+                    break;
+                }
+            } else {
+                /* EVENT_TAP_MMIO */
+                cpu_physical_memory_rw(log->mmio.address,
+                                       log->mmio.buf,
+                                       log->mmio.len, 1);
+            }
+            break;
+        case 0:
+            DPRINTF("No event\n");
+            break;
+        default:
+            fprintf(stderr, "Unknown state %d\n", log->mode);
+            return;
+        }
+    }
+
+    /* remove event logs from queue */
+    QTAILQ_FOREACH_SAFE(log, &event_list, node, next) {
+        QTAILQ_REMOVE(&event_list, log, node);
+        event_tap_free_log(log);
+    }
+
+    event_tap_state = EVENT_TAP_OFF;
+    qemu_del_vm_change_state_handler(vmstate);
+}
+
+static inline void event_tap_ioport_save(QEMUFile *f, EventTapIOport *ioport)
+{
+    qemu_put_be32(f, ioport->index);
+    qemu_put_be32(f, ioport->address);
+    qemu_put_byte(f, ioport->data);
+}
+
+static inline void event_tap_ioport_load(QEMUFile *f,
+                                         EventTapIOport *ioport)
+{
+    ioport->index = qemu_get_be32(f);
+    ioport->address = qemu_get_be32(f);
+    ioport->data = qemu_get_byte(f);
+}
+
+static inline void event_tap_mmio_save(QEMUFile *f, EventTapMMIO *mmio)
+{
+    qemu_put_be64(f, mmio->address);
+    qemu_put_byte(f, mmio->len);
+    qemu_put_buffer(f, mmio->buf, mmio->len);
+}
+
+static inline void event_tap_mmio_load(QEMUFile *f, EventTapMMIO *mmio)
+{
+    mmio->address = qemu_get_be64(f);
+    mmio->len = qemu_get_byte(f);
+    qemu_get_buffer(f, mmio->buf, mmio->len);
+}
+
+static void event_tap_net_save(QEMUFile *f, EventTapNetReq *net_req)
+{
+    int i, len;
+
+    len = strlen(net_req->device_name);
+    qemu_put_byte(f, len);
+    qemu_put_buffer(f, (uint8_t *)net_req->device_name, len);
+    qemu_put_byte(f, net_req->vlan_id);
+    qemu_put_byte(f, net_req->vlan_needed);
+    qemu_put_byte(f, net_req->iovcnt);
+
+    for (i = 0; i < net_req->iovcnt; i++) {
+        qemu_put_be64(f, net_req->iov[i].iov_len);
+        qemu_put_buffer(f, (uint8_t *)net_req->iov[i].iov_base,
+                        net_req->iov[i].iov_len);
+    }
+}
+
+static void event_tap_net_load(QEMUFile *f, EventTapNetReq *net_req)
+{
+    int i, len;
+
+    len = qemu_get_byte(f);
+    net_req->device_name = qemu_malloc(len + 1);
+    qemu_get_buffer(f, (uint8_t *)net_req->device_name, len);
+    net_req->device_name[len] = '\0';
+    net_req->vlan_id = qemu_get_byte(f);
+    net_req->vlan_needed = qemu_get_byte(f);
+    net_req->iovcnt = qemu_get_byte(f);
+    net_req->iov = qemu_malloc(sizeof(struct iovec) * net_req->iovcnt);
+
+    for (i = 0; i < net_req->iovcnt; i++) {
+        net_req->iov[i].iov_len = qemu_get_be64(f);
+        net_req->iov[i].iov_base = qemu_malloc(net_req->iov[i].iov_len);
+        qemu_get_buffer(f, (uint8_t *)net_req->iov[i].iov_base,
+                        net_req->iov[i].iov_len);
+    }
+}
+
+static void event_tap_blk_save(QEMUFile *f, EventTapBlkReq *blk_req)
+{
+    BlockRequest *req;
+    ram_addr_t page_addr;
+    int i, j, len;
+
+    len = strlen(blk_req->device_name);
+    qemu_put_byte(f, len);
+    qemu_put_buffer(f, (uint8_t *)blk_req->device_name, len);
+    qemu_put_byte(f, blk_req->num_reqs);
+
+    for (i = 0; i < blk_req->num_reqs; i++) {
+        req = &blk_req->reqs[i];
+        qemu_put_be64(f, req->sector);
+        qemu_put_be32(f, req->nb_sectors);
+        qemu_put_byte(f, req->qiov->niov);
+        for (j = 0; j < req->qiov->niov; j++) {
+            page_addr =
+                qemu_ram_addr_from_host_nofail(req->qiov->iov[j].iov_base);
+            qemu_put_be64(f, page_addr);
+            qemu_put_be64(f, req->qiov->iov[j].iov_len);
+        }
+    }
+}
+
+static void event_tap_blk_load(QEMUFile *f, EventTapBlkReq *blk_req)
+{
+    BlockRequest *req;
+    ram_addr_t page_addr;
+    int i, j, len;
+
+    len = qemu_get_byte(f);
+    blk_req->device_name = qemu_malloc(len + 1);
+    qemu_get_buffer(f, (uint8_t *)blk_req->device_name, len);
+    blk_req->device_name[len] = '\0';
+    blk_req->num_reqs = qemu_get_byte(f);
+
+    for (i = 0; i < blk_req->num_reqs; i++) {
+        req = &blk_req->reqs[i];
+        req->sector = qemu_get_be64(f);
+        req->nb_sectors = qemu_get_be32(f);
+        req->qiov = qemu_malloc(sizeof(QEMUIOVector));
+        req->qiov->niov = qemu_get_byte(f);
+        req->qiov->iov = qemu_malloc(sizeof(struct iovec) * req->qiov->niov);
+        for (j = 0; j < req->qiov->niov; j++) {
+            page_addr = qemu_get_be64(f);
+            req->qiov->iov[j].iov_base = qemu_get_ram_ptr(page_addr);
+            req->qiov->iov[j].iov_len = qemu_get_be64(f);
+        }
+    }
+}
+
+static void event_tap_save(QEMUFile *f, void *opaque)
+{
+    EventTapLog *log;
+
+    QTAILQ_FOREACH(log, &event_list, node) {
+        qemu_put_byte(f, log->mode);
+        DPRINTF("log->mode=%d\n", log->mode);
+        switch (log->mode & EVENT_TAP_TYPE_MASK) {
+        case EVENT_TAP_IOPORT:
+            event_tap_ioport_save(f, &log->ioport);
+            break;
+        case EVENT_TAP_MMIO:
+            event_tap_mmio_save(f, &log->mmio);
+            break;
+        case 0:
+            DPRINTF("No event\n");
+            break;
+        default:
+            fprintf(stderr, "Unknown state %d\n", log->mode);
+            return;
+        }
+
+        switch (log->mode & ~EVENT_TAP_TYPE_MASK) {
+        case EVENT_TAP_NET:
+            event_tap_net_save(f, &log->net_req);
+            break;
+        case EVENT_TAP_BLK:
+            event_tap_blk_save(f, &log->blk_req);
+            break;
+        default:
+            fprintf(stderr, "Unknown state %d\n", log->mode);
+            return;
+        }
+    }
+
+    qemu_put_byte(f, 0); /* EOF */
+}
+
+static int event_tap_load(QEMUFile *f, void *opaque, int version_id)
+{
+    EventTapLog *log, *next;
+    int mode;
+
+    event_tap_state = EVENT_TAP_LOAD;
+
+    QTAILQ_FOREACH_SAFE(log, &event_list, node, next) {
+        QTAILQ_REMOVE(&event_list, log, node);
+        event_tap_free_log(log);
+    }
+
+    /* loop until EOF */
+    while ((mode = qemu_get_byte(f)) != 0) {
+        EventTapLog *log = event_tap_alloc_log();
+
+        log->mode = mode;
+        switch (log->mode & EVENT_TAP_TYPE_MASK) {
+        case EVENT_TAP_IOPORT:
+            event_tap_ioport_load(f, &log->ioport);
+            break;
+        case EVENT_TAP_MMIO:
+            event_tap_mmio_load(f, &log->mmio);
+            break;
+        case 0:
+            DPRINTF("No event\n");
+            break;
+        default:
+            fprintf(stderr, "Unknown state %d\n", log->mode);
+            return -1;
+        }
+
+        switch (log->mode & ~EVENT_TAP_TYPE_MASK) {
+        case EVENT_TAP_NET:
+            event_tap_net_load(f, &log->net_req);
+            break;
+        case EVENT_TAP_BLK:
+            event_tap_blk_load(f, &log->blk_req);
+            break;
+        default:
+            fprintf(stderr, "Unknown state %d\n", log->mode);
+            return -1;
+        }
+
+        QTAILQ_INSERT_TAIL(&event_list, log, node);
+    }
+
+    return 0;
+}
+
+void event_tap_init(void)
+{
+    QTAILQ_INIT(&event_list);
+    QTAILQ_INIT(&event_pool);    
+    register_savevm(NULL, "event-tap", 0, 1,
+                    event_tap_save, event_tap_load, &last_event_tap);
+    vmstate = qemu_add_vm_change_state_handler(event_tap_replay, NULL);
+}
diff --git a/event-tap.h b/event-tap.h
new file mode 100644
index 0000000..61b9bbc
--- /dev/null
+++ b/event-tap.h
@@ -0,0 +1,34 @@
+/*
+ * Event Tap functions for QEMU
+ *
+ * Copyright (c) 2010 Nippon Telegraph and Telephone Corporation. 
+ *
+ * This work is licensed under the terms of the GNU GPL, version 2.  See
+ * the COPYING file in the top-level directory.
+ */
+
+#ifndef EVENT_TAP_H
+#define EVENT_TAP_H
+
+#include "qemu-common.h"
+
+enum EVENT_TAP_STATE {
+    EVENT_TAP_OFF,
+    EVENT_TAP_ON,
+    EVENT_TAP_SUSPEND,
+    EVENT_TAP_LOAD,
+    EVENT_TAP_REPLAY,
+};
+
+int event_tap_register(int (*cb)(void));
+int event_tap_unregister(void);
+void event_tap_suspend(void);
+void event_tap_resume(void);
+int event_tap_get_state(void);
+void event_tap_ioport(int index, uint32_t address, uint32_t data);
+void event_tap_mmio(uint64_t address, uint8_t *buf, int len);
+void event_tap_init(void);
+void event_tap_flush(void);
+int event_tap_flush_one(void);
+
+#endif
diff --git a/net.h b/net.h
index 44c31a9..93fd403 100644
--- a/net.h
+++ b/net.h
@@ -105,6 +105,10 @@ ssize_t qemu_sendv_packet(VLANClientState *vc, const struct iovec *iov,
 ssize_t qemu_sendv_packet_async(VLANClientState *vc, const struct iovec *iov,
                                 int iovcnt, NetPacketSent *sent_cb);
 void qemu_send_packet(VLANClientState *vc, const uint8_t *buf, int size);
+void qemu_send_packet_proxy(VLANClientState *vc, const uint8_t *buf, int size);
+ssize_t qemu_sendv_packet_async_proxy(VLANClientState *vc,
+                                      const struct iovec *iov,
+                                      int iovcnt, NetPacketSent *sent_cb);
 ssize_t qemu_send_packet_raw(VLANClientState *vc, const uint8_t *buf, int size);
 ssize_t qemu_send_packet_async(VLANClientState *vc, const uint8_t *buf,
                                int size, NetPacketSent *sent_cb);
diff --git a/net/queue.c b/net/queue.c
index 2ea6cd0..e7a35b0 100644
--- a/net/queue.c
+++ b/net/queue.c
@@ -258,3 +258,4 @@ void qemu_net_queue_flush(NetQueue *queue)
         qemu_free(packet);
     }
 }
+
-- 
1.7.1.2

^ permalink raw reply related	[flat|nested] 112+ messages in thread

* [Qemu-devel] [PATCH 10/21] Call init handler of event-tap at main() in vl.c.
  2010-11-25  6:06 [Qemu-devel] [PATCH 00/21] Kemari for KVM 0.2 Yoshiaki Tamura
                   ` (8 preceding siblings ...)
  2010-11-25  6:06 ` [Qemu-devel] [PATCH 09/21] Introduce event-tap Yoshiaki Tamura
@ 2010-11-25  6:06 ` Yoshiaki Tamura
  2010-11-25  6:06 ` [Qemu-devel] [PATCH 11/21] ioport: insert event_tap_ioport() to ioport_write() Yoshiaki Tamura
                   ` (11 subsequent siblings)
  21 siblings, 0 replies; 112+ messages in thread
From: Yoshiaki Tamura @ 2010-11-25  6:06 UTC (permalink / raw)
  To: kvm, qemu-devel
  Cc: aliguori, mtosatti, ananth, ohmura.kei, dlaor, vatsa,
	Yoshiaki Tamura, avi, psuriset, stefanha

Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
---
 vl.c |    3 +++
 1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/vl.c b/vl.c
index 6b6aec0..ea6fe71 100644
--- a/vl.c
+++ b/vl.c
@@ -162,6 +162,7 @@ int main(int argc, char **argv)
 #include "qemu-queue.h"
 #include "cpus.h"
 #include "arch_init.h"
+#include "event-tap.h"
 
 #include "ui/qemu-spice.h"
 
@@ -2779,6 +2780,8 @@ int main(int argc, char **argv, char **envp)
 
     blk_mig_init();
 
+    event_tap_init();
+
     if (default_cdrom) {
         /* we always create the cdrom drive, even if no disk is there */
         drive_add(NULL, CDROM_ALIAS);
-- 
1.7.1.2

^ permalink raw reply related	[flat|nested] 112+ messages in thread

* [Qemu-devel] [PATCH 11/21] ioport: insert event_tap_ioport() to ioport_write().
  2010-11-25  6:06 [Qemu-devel] [PATCH 00/21] Kemari for KVM 0.2 Yoshiaki Tamura
                   ` (9 preceding siblings ...)
  2010-11-25  6:06 ` [Qemu-devel] [PATCH 10/21] Call init handler of event-tap at main() in vl.c Yoshiaki Tamura
@ 2010-11-25  6:06 ` Yoshiaki Tamura
  2010-11-28  9:40   ` [Qemu-devel] " Michael S. Tsirkin
  2010-11-25  6:06 ` [Qemu-devel] [PATCH 12/21] Insert event_tap_mmio() to cpu_physical_memory_rw() in exec.c Yoshiaki Tamura
                   ` (10 subsequent siblings)
  21 siblings, 1 reply; 112+ messages in thread
From: Yoshiaki Tamura @ 2010-11-25  6:06 UTC (permalink / raw)
  To: kvm, qemu-devel
  Cc: aliguori, mtosatti, ananth, ohmura.kei, dlaor, vatsa,
	Yoshiaki Tamura, avi, psuriset, stefanha

Record ioport event to replay it upon failover.

Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
---
 ioport.c |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)

diff --git a/ioport.c b/ioport.c
index aa4188a..74aebf5 100644
--- a/ioport.c
+++ b/ioport.c
@@ -27,6 +27,7 @@
 
 #include "ioport.h"
 #include "trace.h"
+#include "event-tap.h"
 
 /***********************************************************/
 /* IO Port */
@@ -76,6 +77,7 @@ static void ioport_write(int index, uint32_t address, uint32_t data)
         default_ioport_writel
     };
     IOPortWriteFunc *func = ioport_write_table[index][address];
+    event_tap_ioport(index, address, data);
     if (!func)
         func = default_func[index];
     func(ioport_opaque[address], address, data);
-- 
1.7.1.2

^ permalink raw reply related	[flat|nested] 112+ messages in thread

* [Qemu-devel] [PATCH 12/21] Insert event_tap_mmio() to cpu_physical_memory_rw() in exec.c.
  2010-11-25  6:06 [Qemu-devel] [PATCH 00/21] Kemari for KVM 0.2 Yoshiaki Tamura
                   ` (10 preceding siblings ...)
  2010-11-25  6:06 ` [Qemu-devel] [PATCH 11/21] ioport: insert event_tap_ioport() to ioport_write() Yoshiaki Tamura
@ 2010-11-25  6:06 ` Yoshiaki Tamura
  2010-11-25  6:06 ` [Qemu-devel] [PATCH 13/21] dma-helpers: replace bdrv_aio_writev() with bdrv_aio_writev_proxy() Yoshiaki Tamura
                   ` (9 subsequent siblings)
  21 siblings, 0 replies; 112+ messages in thread
From: Yoshiaki Tamura @ 2010-11-25  6:06 UTC (permalink / raw)
  To: kvm, qemu-devel
  Cc: aliguori, mtosatti, ananth, ohmura.kei, dlaor, vatsa,
	Yoshiaki Tamura, avi, psuriset, stefanha

Record mmio write event to replay it upon failover.

Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
---
 exec.c |    4 ++++
 1 files changed, 4 insertions(+), 0 deletions(-)

diff --git a/exec.c b/exec.c
index db9ff55..fd8823e 100644
--- a/exec.c
+++ b/exec.c
@@ -33,6 +33,7 @@
 #include "osdep.h"
 #include "kvm.h"
 #include "qemu-timer.h"
+#include "event-tap.h"
 #if defined(CONFIG_USER_ONLY)
 #include <qemu.h>
 #include <signal.h>
@@ -3479,6 +3480,9 @@ void cpu_physical_memory_rw(target_phys_addr_t addr, uint8_t *buf,
                 io_index = (pd >> IO_MEM_SHIFT) & (IO_MEM_NB_ENTRIES - 1);
                 if (p)
                     addr1 = (addr & ~TARGET_PAGE_MASK) + p->region_offset;
+
+                event_tap_mmio(addr, buf, len);
+
                 /* XXX: could force cpu_single_env to NULL to avoid
                    potential bugs */
                 if (l >= 4 && ((addr1 & 3) == 0)) {
-- 
1.7.1.2

^ permalink raw reply related	[flat|nested] 112+ messages in thread

* [Qemu-devel] [PATCH 13/21] dma-helpers: replace bdrv_aio_writev() with bdrv_aio_writev_proxy().
  2010-11-25  6:06 [Qemu-devel] [PATCH 00/21] Kemari for KVM 0.2 Yoshiaki Tamura
                   ` (11 preceding siblings ...)
  2010-11-25  6:06 ` [Qemu-devel] [PATCH 12/21] Insert event_tap_mmio() to cpu_physical_memory_rw() in exec.c Yoshiaki Tamura
@ 2010-11-25  6:06 ` Yoshiaki Tamura
  2010-11-28  9:33   ` [Qemu-devel] " Michael S. Tsirkin
  2010-11-25  6:06 ` [Qemu-devel] [PATCH 14/21] virtio-blk: replace bdrv_aio_multiwrite() with bdrv_aio_multiwrite_proxy() Yoshiaki Tamura
                   ` (8 subsequent siblings)
  21 siblings, 1 reply; 112+ messages in thread
From: Yoshiaki Tamura @ 2010-11-25  6:06 UTC (permalink / raw)
  To: kvm, qemu-devel
  Cc: aliguori, mtosatti, ananth, ohmura.kei, dlaor, vatsa,
	Yoshiaki Tamura, avi, psuriset, stefanha

Replace bdrv_aio_writev() with bdrv_aio_writev_proxy() to let
event-tap capture events from dma-helpers.

Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
---
 dma-helpers.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/dma-helpers.c b/dma-helpers.c
index 712ed89..8ab2c26 100644
--- a/dma-helpers.c
+++ b/dma-helpers.c
@@ -117,8 +117,8 @@ static void dma_bdrv_cb(void *opaque, int ret)
     }
 
     if (dbs->is_write) {
-        dbs->acb = bdrv_aio_writev(dbs->bs, dbs->sector_num, &dbs->iov,
-                                   dbs->iov.size / 512, dma_bdrv_cb, dbs);
+        dbs->acb = bdrv_aio_writev_proxy(dbs->bs, dbs->sector_num, &dbs->iov,
+                                         dbs->iov.size / 512, dma_bdrv_cb, dbs);
     } else {
         dbs->acb = bdrv_aio_readv(dbs->bs, dbs->sector_num, &dbs->iov,
                                   dbs->iov.size / 512, dma_bdrv_cb, dbs);
-- 
1.7.1.2

^ permalink raw reply related	[flat|nested] 112+ messages in thread

* [Qemu-devel] [PATCH 14/21] virtio-blk: replace bdrv_aio_multiwrite() with bdrv_aio_multiwrite_proxy().
  2010-11-25  6:06 [Qemu-devel] [PATCH 00/21] Kemari for KVM 0.2 Yoshiaki Tamura
                   ` (12 preceding siblings ...)
  2010-11-25  6:06 ` [Qemu-devel] [PATCH 13/21] dma-helpers: replace bdrv_aio_writev() with bdrv_aio_writev_proxy() Yoshiaki Tamura
@ 2010-11-25  6:06 ` Yoshiaki Tamura
  2010-11-25  6:06 ` [Qemu-devel] [PATCH 15/21] virtio-net: replace qemu_sendv_packet_async() with qemu_sendv_packet_async_proxy() Yoshiaki Tamura
                   ` (7 subsequent siblings)
  21 siblings, 0 replies; 112+ messages in thread
From: Yoshiaki Tamura @ 2010-11-25  6:06 UTC (permalink / raw)
  To: kvm, qemu-devel
  Cc: aliguori, mtosatti, ananth, ohmura.kei, dlaor, vatsa,
	Yoshiaki Tamura, avi, psuriset, stefanha

Replace replace bdrv_aio_multiwrite() with bdrv_aio_multiwrite_proxy()
to let event-tap capture events from virtio-blk.

Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
---
 hw/virtio-blk.c |    2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/hw/virtio-blk.c b/hw/virtio-blk.c
index e5f9b27..aa0c866 100644
--- a/hw/virtio-blk.c
+++ b/hw/virtio-blk.c
@@ -250,7 +250,7 @@ static void virtio_submit_multiwrite(BlockDriverState *bs, MultiReqBuffer *mrb)
         return;
     }
 
-    ret = bdrv_aio_multiwrite(bs, mrb->blkreq, mrb->num_writes);
+    ret = bdrv_aio_multiwrite_proxy(bs, mrb->blkreq, mrb->num_writes);
     if (ret != 0) {
         for (i = 0; i < mrb->num_writes; i++) {
             if (mrb->blkreq[i].error) {
-- 
1.7.1.2

^ permalink raw reply related	[flat|nested] 112+ messages in thread

* [Qemu-devel] [PATCH 15/21] virtio-net: replace qemu_sendv_packet_async() with qemu_sendv_packet_async_proxy().
  2010-11-25  6:06 [Qemu-devel] [PATCH 00/21] Kemari for KVM 0.2 Yoshiaki Tamura
                   ` (13 preceding siblings ...)
  2010-11-25  6:06 ` [Qemu-devel] [PATCH 14/21] virtio-blk: replace bdrv_aio_multiwrite() with bdrv_aio_multiwrite_proxy() Yoshiaki Tamura
@ 2010-11-25  6:06 ` Yoshiaki Tamura
  2010-11-28  9:31   ` [Qemu-devel] " Michael S. Tsirkin
  2010-11-25  6:06 ` [Qemu-devel] [PATCH 16/21] e1000: replace qemu_send_packet() with qemu_send_packet_proxy() Yoshiaki Tamura
                   ` (6 subsequent siblings)
  21 siblings, 1 reply; 112+ messages in thread
From: Yoshiaki Tamura @ 2010-11-25  6:06 UTC (permalink / raw)
  To: kvm, qemu-devel
  Cc: aliguori, mtosatti, ananth, ohmura.kei, dlaor, vatsa,
	Yoshiaki Tamura, avi, psuriset, stefanha

Replace replace qemu_sendv_packet_async() with
qemu_sendv_packet_async_proxy() to let event-tap capture events from
virtio-net.

Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
---
 hw/virtio-net.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/hw/virtio-net.c b/hw/virtio-net.c
index 1d61f19..8c76346 100644
--- a/hw/virtio-net.c
+++ b/hw/virtio-net.c
@@ -710,8 +710,8 @@ static int32_t virtio_net_flush_tx(VirtIONet *n, VirtQueue *vq)
             len += hdr_len;
         }
 
-        ret = qemu_sendv_packet_async(&n->nic->nc, out_sg, out_num,
-                                      virtio_net_tx_complete);
+        ret = qemu_sendv_packet_async_proxy(&n->nic->nc, out_sg, out_num,
+                                            virtio_net_tx_complete);
         if (ret == 0) {
             virtio_queue_set_notification(n->tx_vq, 0);
             n->async_tx.elem = elem;
-- 
1.7.1.2

^ permalink raw reply related	[flat|nested] 112+ messages in thread

* [Qemu-devel] [PATCH 16/21] e1000: replace qemu_send_packet() with qemu_send_packet_proxy().
  2010-11-25  6:06 [Qemu-devel] [PATCH 00/21] Kemari for KVM 0.2 Yoshiaki Tamura
                   ` (14 preceding siblings ...)
  2010-11-25  6:06 ` [Qemu-devel] [PATCH 15/21] virtio-net: replace qemu_sendv_packet_async() with qemu_sendv_packet_async_proxy() Yoshiaki Tamura
@ 2010-11-25  6:06 ` Yoshiaki Tamura
  2010-11-25  6:06 ` [Qemu-devel] [PATCH 17/21] savevm: introduce qemu_savevm_trans_{begin, commit} Yoshiaki Tamura
                   ` (5 subsequent siblings)
  21 siblings, 0 replies; 112+ messages in thread
From: Yoshiaki Tamura @ 2010-11-25  6:06 UTC (permalink / raw)
  To: kvm, qemu-devel
  Cc: aliguori, mtosatti, ananth, ohmura.kei, dlaor, vatsa,
	Yoshiaki Tamura, avi, psuriset, stefanha

Replace replace qemu_send_packet() with qemu_send_packet_proxy() to
let event-tap capture events from e1000.

Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
---
 hw/e1000.c |    4 ++--
 1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/hw/e1000.c b/hw/e1000.c
index 7811699..51a35ec 100644
--- a/hw/e1000.c
+++ b/hw/e1000.c
@@ -402,9 +402,9 @@ xmit_seg(E1000State *s)
         memmove(tp->vlan, tp->data, 4);
         memmove(tp->data, tp->data + 4, 8);
         memcpy(tp->data + 8, tp->vlan_header, 4);
-        qemu_send_packet(&s->nic->nc, tp->vlan, tp->size + 4);
+        qemu_send_packet_proxy(&s->nic->nc, tp->vlan, tp->size + 4);
     } else
-        qemu_send_packet(&s->nic->nc, tp->data, tp->size);
+        qemu_send_packet_proxy(&s->nic->nc, tp->data, tp->size);
     s->mac_reg[TPT]++;
     s->mac_reg[GPTC]++;
     n = s->mac_reg[TOTL];
-- 
1.7.1.2

^ permalink raw reply related	[flat|nested] 112+ messages in thread

* [Qemu-devel] [PATCH 17/21] savevm: introduce qemu_savevm_trans_{begin, commit}.
  2010-11-25  6:06 [Qemu-devel] [PATCH 00/21] Kemari for KVM 0.2 Yoshiaki Tamura
                   ` (15 preceding siblings ...)
  2010-11-25  6:06 ` [Qemu-devel] [PATCH 16/21] e1000: replace qemu_send_packet() with qemu_send_packet_proxy() Yoshiaki Tamura
@ 2010-11-25  6:06 ` Yoshiaki Tamura
  2010-11-25  6:06 ` [Qemu-devel] [PATCH 18/21] migration: introduce migrate_ft_trans_{put, get}_ready(), and modify migrate_fd_put_ready() when ft_mode is on Yoshiaki Tamura
                   ` (4 subsequent siblings)
  21 siblings, 0 replies; 112+ messages in thread
From: Yoshiaki Tamura @ 2010-11-25  6:06 UTC (permalink / raw)
  To: kvm, qemu-devel
  Cc: aliguori, mtosatti, ananth, ohmura.kei, dlaor, vatsa,
	Yoshiaki Tamura, avi, psuriset, stefanha

Introduce qemu_savevm_state_{begin,commit} to send the memory and
device info together, while avoiding cancelling memory state tracking.

Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
---
 savevm.c |   88 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 sysemu.h |    2 +
 2 files changed, 90 insertions(+), 0 deletions(-)

diff --git a/savevm.c b/savevm.c
index afd4046..4e4be7c 100644
--- a/savevm.c
+++ b/savevm.c
@@ -1740,6 +1740,94 @@ int qemu_savevm_state_complete(Monitor *mon, QEMUFile *f)
     return 0;
 }
 
+int qemu_savevm_trans_begin(Monitor *mon, QEMUFile *f, int init)
+{
+    SaveStateEntry *se;
+    int skipped = 0;
+    
+    QTAILQ_FOREACH(se, &savevm_handlers, entry) {
+        int len, stage, ret;
+
+        if (se->save_live_state == NULL)
+            continue;
+
+        /* Section type */
+        qemu_put_byte(f, QEMU_VM_SECTION_START);
+        qemu_put_be32(f, se->section_id);
+
+        /* ID string */
+        len = strlen(se->idstr);
+        qemu_put_byte(f, len);
+        qemu_put_buffer(f, (uint8_t *)se->idstr, len);
+
+        qemu_put_be32(f, se->instance_id);
+        qemu_put_be32(f, se->version_id);
+    
+        stage = init ? QEMU_VM_SECTION_START : QEMU_VM_SECTION_PART;
+        ret = se->save_live_state(mon, f, stage, se->opaque);
+        if (!ret) {
+            skipped++;
+        }
+    }
+
+    if (qemu_file_has_error(f))
+        return -EIO;
+
+    return skipped;
+}
+
+int qemu_savevm_trans_complete(Monitor *mon, QEMUFile *f)
+{
+    SaveStateEntry *se;
+    
+    cpu_synchronize_all_states();
+
+    QTAILQ_FOREACH(se, &savevm_handlers, entry) {
+        int ret;
+
+        if (se->save_live_state == NULL)
+            continue;
+
+        /* Section type */
+        qemu_put_byte(f, QEMU_VM_SECTION_PART);
+        qemu_put_be32(f, se->section_id);
+    
+        ret = se->save_live_state(mon, f, QEMU_VM_SECTION_PART, se->opaque);
+        if (!ret) {
+            /* do not proceed to the next vmstate. */
+            return 1;
+        }
+    }
+
+    QTAILQ_FOREACH(se, &savevm_handlers, entry) {
+        int len;
+
+        if (se->save_state == NULL && se->vmsd == NULL)
+            continue;
+
+        /* Section type */
+        qemu_put_byte(f, QEMU_VM_SECTION_FULL);
+        qemu_put_be32(f, se->section_id);
+
+        /* ID string */
+        len = strlen(se->idstr);
+        qemu_put_byte(f, len);
+        qemu_put_buffer(f, (uint8_t *)se->idstr, len);
+
+        qemu_put_be32(f, se->instance_id);
+        qemu_put_be32(f, se->version_id);
+
+        vmstate_save(f, se);
+    }
+
+    qemu_put_byte(f, QEMU_VM_EOF);
+
+    if (qemu_file_has_error(f))
+        return -EIO;
+
+    return 0;
+}
+
 void qemu_savevm_state_cancel(Monitor *mon, QEMUFile *f)
 {
     SaveStateEntry *se;
diff --git a/sysemu.h b/sysemu.h
index 588548a..5516493 100644
--- a/sysemu.h
+++ b/sysemu.h
@@ -78,6 +78,8 @@ int qemu_savevm_state_begin(Monitor *mon, QEMUFile *f, int blk_enable,
 int qemu_savevm_state_iterate(Monitor *mon, QEMUFile *f);
 int qemu_savevm_state_complete(Monitor *mon, QEMUFile *f);
 void qemu_savevm_state_cancel(Monitor *mon, QEMUFile *f);
+int qemu_savevm_trans_begin(Monitor *mon, QEMUFile *f, int init);
+int qemu_savevm_trans_complete(Monitor *mon, QEMUFile *f);
 int qemu_loadvm_state(QEMUFile *f, int skip_header);
 
 /* SLIRP */
-- 
1.7.1.2

^ permalink raw reply related	[flat|nested] 112+ messages in thread

* [Qemu-devel] [PATCH 18/21] migration: introduce migrate_ft_trans_{put, get}_ready(), and modify migrate_fd_put_ready() when ft_mode is on.
  2010-11-25  6:06 [Qemu-devel] [PATCH 00/21] Kemari for KVM 0.2 Yoshiaki Tamura
                   ` (16 preceding siblings ...)
  2010-11-25  6:06 ` [Qemu-devel] [PATCH 17/21] savevm: introduce qemu_savevm_trans_{begin, commit} Yoshiaki Tamura
@ 2010-11-25  6:06 ` Yoshiaki Tamura
  2010-11-25  6:06 ` [Qemu-devel] [PATCH 19/21] migration-tcp: modify tcp_accept_incoming_migration() to handle ft_mode, and add a hack not to close fd when ft_mode is enabled Yoshiaki Tamura
                   ` (3 subsequent siblings)
  21 siblings, 0 replies; 112+ messages in thread
From: Yoshiaki Tamura @ 2010-11-25  6:06 UTC (permalink / raw)
  To: kvm, qemu-devel
  Cc: aliguori, mtosatti, ananth, ohmura.kei, dlaor, vatsa,
	Yoshiaki Tamura, avi, psuriset, stefanha

Introduce migrate_ft_trans_put_ready() which kicks the FT transaction
cycle.  When ft_mode is on, migrate_fd_put_ready() would open
ft_trans_file and turn on event_tap.  To end or cancel FT transaction,
ft_mode and event_tap is turned off.  migrate_ft_trans_get_ready() is
called to receive ack from the receiver.

Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
---
 migration.c |  256 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++--
 migration.h |    2 +-
 2 files changed, 249 insertions(+), 9 deletions(-)

diff --git a/migration.c b/migration.c
index 40e4945..c03d660 100644
--- a/migration.c
+++ b/migration.c
@@ -21,6 +21,7 @@
 #include "qemu_socket.h"
 #include "block-migration.h"
 #include "qemu-objects.h"
+#include "event-tap.h"
 
 //#define DEBUG_MIGRATION
 
@@ -307,6 +308,20 @@ void migrate_fd_put_notify(void *opaque)
     qemu_file_put_notify(s->file);
 }
 
+static void migrate_fd_get_notify(void *opaque)
+{
+    FdMigrationState *s = opaque;
+
+    qemu_set_fd_handler2(s->fd, NULL, NULL, NULL, NULL);
+    qemu_file_get_notify(s->file);
+    if (qemu_file_has_error(s->file)) {
+        ft_mode = FT_ERROR;
+        qemu_savevm_state_cancel(s->mon, s->file);
+        migrate_fd_error(s);
+        event_tap_unregister();
+    }
+}
+
 ssize_t migrate_fd_put_buffer(void *opaque, const void *data, size_t size)
 {
     FdMigrationState *s = opaque;
@@ -331,15 +346,20 @@ ssize_t migrate_fd_put_buffer(void *opaque, const void *data, size_t size)
     return ret;
 }
 
-int migrate_fd_get_buffer(void *opaque, uint8_t *data, int64_t pos, int size)
+int migrate_fd_get_buffer(void *opaque, uint8_t *data, int64_t pos, size_t size)
 {
     FdMigrationState *s = opaque;
-    ssize_t ret;
+    int ret;
     ret = s->read(s, data, size);
     
-    if (ret == -1)
+    if (ret == -1) {
         ret = -(s->get_error(s));
-    
+    }
+
+    if (ret == -EAGAIN) {
+        qemu_set_fd_handler2(s->fd, NULL, migrate_fd_get_notify, NULL, s);
+    }
+
     return ret;
 }
 
@@ -366,6 +386,195 @@ void migrate_fd_connect(FdMigrationState *s)
     migrate_fd_put_ready(s);
 }
 
+static int migrate_ft_trans_commit(void *opaque)
+{
+    FdMigrationState *s = opaque;
+    int ret = -1;
+
+    if (ft_mode != FT_TRANSACTION_COMMIT && ft_mode != FT_TRANSACTION_ATOMIC) {
+        fprintf(stderr,
+                "migrate_ft_trans_commit: invalid ft_mode %d\n", ft_mode);
+        goto out;
+    }
+
+    do {
+        if (ft_mode == FT_TRANSACTION_ATOMIC) {
+            if (qemu_ft_trans_begin(s->file) < 0) {
+                fprintf(stderr, "qemu_ft_trans_begin failed\n");
+                goto out;
+            }
+
+            if ((ret = qemu_savevm_trans_begin(s->mon, s->file, 0)) < 0) {
+                fprintf(stderr, "qemu_savevm_trans_begin failed\n");
+                goto out;
+            }
+
+            ft_mode = FT_TRANSACTION_COMMIT;
+            if (ret) {
+                /* don't proceed until if fd isn't ready */
+                goto out;
+            }
+        }
+
+        /* make the VM state consistent by flushing outstanding events */
+        vm_stop(0);
+        qemu_aio_flush();
+        bdrv_flush_all();
+
+        if ((ret = qemu_savevm_trans_complete(s->mon, s->file)) < 0) {
+            fprintf(stderr, "qemu_savevm_trans_complete failed\n");
+            goto out;
+        }
+
+        if (ret) {
+            /* don't proceed until if fd isn't ready */
+            ret = 1;
+            goto out;
+        }
+
+        if ((ret = qemu_ft_trans_commit(s->file)) < 0) {
+            fprintf(stderr, "qemu_ft_trans_commit failed\n");
+            goto out;
+        }
+
+        if (ret) {
+            ft_mode = FT_TRANSACTION_RECV;
+            ret = 1;
+            goto out;
+        }
+
+        /* flush and check if events are remaining */
+        if ((ret = event_tap_flush_one()) < 0) {
+            fprintf(stderr, "event_tap_flush_one failed\n");
+            goto out;
+        }
+
+        ft_mode =  ret ? FT_TRANSACTION_BEGIN : FT_TRANSACTION_ATOMIC;
+    } while (ft_mode != FT_TRANSACTION_BEGIN);
+
+    vm_start();
+    ret = 0;
+
+out:
+    return ret;
+}
+
+static int migrate_ft_trans_get_ready(void *opaque)
+{
+    FdMigrationState *s = opaque;
+    int ret = -1;
+
+    if (ft_mode != FT_TRANSACTION_RECV) {
+        fprintf(stderr,
+                "migrate_ft_trans_get_ready: invalid ft_mode %d\n", ft_mode);
+        goto error_out;
+    }
+
+    /* flush and check if events are remaining */
+    if ((ret = event_tap_flush_one()) < 0) {
+        fprintf(stderr, "event_tap_flush_one failed\n");
+        goto error_out;
+    }
+
+    if (ret) {
+        ft_mode = FT_TRANSACTION_BEGIN;
+    } else {
+        ft_mode = FT_TRANSACTION_ATOMIC;
+        if ((ret = migrate_ft_trans_commit(s)) < 0) {
+            goto error_out;
+        }
+        if (ret) {
+            goto out;
+        }
+    }
+
+    vm_start();
+    ret = 0;
+    goto out;
+
+error_out:
+    ft_mode = FT_ERROR;
+    qemu_savevm_state_cancel(s->mon, s->file);
+    migrate_fd_error(s);
+    event_tap_unregister();
+
+out:
+    return ret;
+}
+
+static int migrate_ft_trans_put_ready(void)
+{
+    FdMigrationState *s = migrate_to_fms(current_migration);
+    int ret = -1, init = 0, timeout;
+    static int64_t start, now;
+    
+    switch (ft_mode) {
+    case FT_INIT:
+        init = 1;
+    case FT_TRANSACTION_BEGIN:
+        now = start = qemu_get_clock(vm_clock);
+        if (qemu_ft_trans_begin(s->file) < 0) {
+            fprintf(stderr, "qemu_transaction_begin failed\n");
+            goto error_out;
+        }
+
+        if ((ret = qemu_savevm_trans_begin(s->mon, s->file, init)) < 0) {
+            fprintf(stderr, "qemu_savevm_trans_begin\n");
+            goto error_out;
+        }
+
+        if (ret) {
+            ft_mode = FT_TRANSACTION_ITER;
+        } else {
+            ft_mode = FT_TRANSACTION_COMMIT;
+            if (migrate_ft_trans_commit(s) < 0) {
+                goto error_out;
+            }
+        }
+        break;
+
+    case FT_TRANSACTION_ITER:
+        now = qemu_get_clock(vm_clock);
+        timeout = ((now - start) >= max_downtime);
+        if (timeout || qemu_savevm_state_iterate(s->mon, s->file) == 1) {
+            DPRINTF("ft trans iter timeout %d\n", timeout);
+
+            ft_mode = FT_TRANSACTION_COMMIT;
+            if (migrate_ft_trans_commit(s) < 0) {
+                goto error_out;
+            }
+            return 1;
+        }
+
+        ft_mode = FT_TRANSACTION_ITER;
+        break;
+
+    case FT_TRANSACTION_ATOMIC:
+    case FT_TRANSACTION_COMMIT:
+        if (migrate_ft_trans_commit(s) < 0) {
+            goto error_out;
+        }
+        break;
+
+    default:
+        fprintf(stderr,
+                "migrate_ft_trans_put_ready: invalid ft_mode %d", ft_mode);
+        goto error_out;
+    }
+
+    ret = 0;
+    goto out;
+
+error_out:
+    ft_mode = FT_ERROR;
+    qemu_savevm_state_cancel(s->mon, s->file);
+    migrate_fd_error(s);
+    event_tap_unregister();
+
+out:
+    return ret;
+}
+
 void migrate_fd_put_ready(void *opaque)
 {
     FdMigrationState *s = opaque;
@@ -393,13 +602,38 @@ void migrate_fd_put_ready(void *opaque)
         } else {
             state = MIG_STATE_COMPLETED;
         }
-        if (migrate_fd_cleanup(s) < 0) {
+
+        if (ft_mode && state == MIG_STATE_COMPLETED) {
+            /* close buffered_file and open ft_trans_file
+             * NB: fd won't get closed, and reused by ft_trans_file
+             */
+            qemu_fclose(s->file);
+
+            s->file = qemu_fopen_ops_ft_trans(s,
+                                              migrate_fd_put_buffer,
+                                              migrate_fd_get_buffer,
+                                              migrate_ft_trans_put_ready,
+                                              migrate_ft_trans_get_ready,
+                                              migrate_fd_wait_for_unfreeze,
+                                              migrate_fd_close,
+                                              1);
+            socket_set_nodelay(s->fd);
+
+            /* events are tapped from now */
+            event_tap_register(migrate_ft_trans_put_ready);
+
             if (old_vm_running) {
                 vm_start();
             }
-            state = MIG_STATE_ERROR;
+        } else {
+            if (migrate_fd_cleanup(s) < 0) {
+                if (old_vm_running) {
+                    vm_start();
+                }
+                state = MIG_STATE_ERROR;
+            }
+            s->state = state;
         }
-        s->state = state;
     }
 }
 
@@ -419,8 +653,14 @@ void migrate_fd_cancel(MigrationState *mig_state)
     DPRINTF("cancelling migration\n");
 
     s->state = MIG_STATE_CANCELLED;
-    qemu_savevm_state_cancel(s->mon, s->file);
 
+    if (ft_mode) {
+        qemu_ft_trans_cancel(s->file);
+        ft_mode = FT_OFF;
+        event_tap_unregister();
+    }
+
+    qemu_savevm_state_cancel(s->mon, s->file);
     migrate_fd_cleanup(s);
 }
 
diff --git a/migration.h b/migration.h
index f033262..7bf6747 100644
--- a/migration.h
+++ b/migration.h
@@ -116,7 +116,7 @@ void migrate_fd_put_notify(void *opaque);
 
 ssize_t migrate_fd_put_buffer(void *opaque, const void *data, size_t size);
 
-int migrate_fd_get_buffer(void *opaque, uint8_t *data, int64_t pos, int size);
+int migrate_fd_get_buffer(void *opaque, uint8_t *data, int64_t pos, size_t size);
 
 void migrate_fd_connect(FdMigrationState *s);
 
-- 
1.7.1.2

^ permalink raw reply related	[flat|nested] 112+ messages in thread

* [Qemu-devel] [PATCH 19/21] migration-tcp: modify tcp_accept_incoming_migration() to handle ft_mode, and add a hack not to close fd when ft_mode is enabled.
  2010-11-25  6:06 [Qemu-devel] [PATCH 00/21] Kemari for KVM 0.2 Yoshiaki Tamura
                   ` (17 preceding siblings ...)
  2010-11-25  6:06 ` [Qemu-devel] [PATCH 18/21] migration: introduce migrate_ft_trans_{put, get}_ready(), and modify migrate_fd_put_ready() when ft_mode is on Yoshiaki Tamura
@ 2010-11-25  6:06 ` Yoshiaki Tamura
  2010-11-25  6:06 ` [Qemu-devel] [PATCH 20/21] Introduce -k option to enable FT migration mode (Kemari) Yoshiaki Tamura
                   ` (2 subsequent siblings)
  21 siblings, 0 replies; 112+ messages in thread
From: Yoshiaki Tamura @ 2010-11-25  6:06 UTC (permalink / raw)
  To: kvm, qemu-devel
  Cc: aliguori, mtosatti, ananth, ohmura.kei, dlaor, vatsa,
	Yoshiaki Tamura, avi, psuriset, stefanha

When ft_mode is set in the header, tcp_accept_incoming_migration()
sets ft_trans_incoming() as a callback, and call
qemu_file_get_notify() to receive FT transaction iteratively.  We also
need a hack no to close fd before moving to ft_transaction mode, so
that we can reuse the fd for it.

Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
---
 migration-tcp.c |   43 ++++++++++++++++++++++++++++++++++++++++++-
 1 files changed, 42 insertions(+), 1 deletions(-)

diff --git a/migration-tcp.c b/migration-tcp.c
index 96e2411..669f9f8 100644
--- a/migration-tcp.c
+++ b/migration-tcp.c
@@ -18,6 +18,7 @@
 #include "sysemu.h"
 #include "buffered_file.h"
 #include "block.h"
+#include "ft_trans_file.h"
 
 //#define DEBUG_MIGRATION_TCP
 
@@ -56,7 +57,8 @@ static int socket_read(FdMigrationState *s, const void * buf, size_t size)
 static int tcp_close(FdMigrationState *s)
 {
     DPRINTF("tcp_close\n");
-    if (s->fd != -1) {
+    /* FIX ME: accessing ft_mode here isn't clean */
+    if (s->fd != -1 && ft_mode != FT_INIT) {
         close(s->fd);
         s->fd = -1;
     }
@@ -150,6 +152,16 @@ MigrationState *tcp_start_outgoing_migration(Monitor *mon,
     return &s->mig_state;
 }
 
+static void ft_trans_incoming(void *opaque) {
+    QEMUFile *f = opaque;
+
+    qemu_file_get_notify(f);    
+    if (qemu_file_has_error(f)) {
+        ft_mode = FT_ERROR;
+        qemu_fclose(f);
+    }
+}
+
 static void tcp_accept_incoming_migration(void *opaque)
 {
     struct sockaddr_in addr;
@@ -175,8 +187,37 @@ static void tcp_accept_incoming_migration(void *opaque)
         goto out;
     }
 
+    if (ft_mode == FT_INIT) {
+        autostart = 0;
+    }
+
     process_incoming_migration(f);
+
+    if (ft_mode == FT_INIT) {
+        int ret;
+
+        socket_set_nodelay(c);
+
+        f = qemu_fopen_ft_trans(s, c);
+        if (f == NULL) {
+            fprintf(stderr, "could not qemu_fopen_ft_trans\n");
+            goto out;
+        }
+
+        /* need to wait sender to setup */
+        ret = qemu_ft_trans_begin(f);
+        if (ret < 0) {
+            goto out;
+        }
+
+        qemu_set_fd_handler2(c, NULL, ft_trans_incoming, NULL, f);
+        ft_mode = FT_TRANSACTION_RECV;
+
+        return;
+    }
+
     qemu_fclose(f);
+
 out:
     close(c);
 out2:
-- 
1.7.1.2

^ permalink raw reply related	[flat|nested] 112+ messages in thread

* [Qemu-devel] [PATCH 20/21] Introduce -k option to enable FT migration mode (Kemari).
  2010-11-25  6:06 [Qemu-devel] [PATCH 00/21] Kemari for KVM 0.2 Yoshiaki Tamura
                   ` (18 preceding siblings ...)
  2010-11-25  6:06 ` [Qemu-devel] [PATCH 19/21] migration-tcp: modify tcp_accept_incoming_migration() to handle ft_mode, and add a hack not to close fd when ft_mode is enabled Yoshiaki Tamura
@ 2010-11-25  6:06 ` Yoshiaki Tamura
  2010-11-25  6:07 ` [Qemu-devel] [PATCH 21/21] migration: add a parser to accept FT migration incoming mode Yoshiaki Tamura
  2010-11-26 18:39 ` [Qemu-devel] [PATCH 00/21] Kemari for KVM 0.2 Blue Swirl
  21 siblings, 0 replies; 112+ messages in thread
From: Yoshiaki Tamura @ 2010-11-25  6:06 UTC (permalink / raw)
  To: kvm, qemu-devel
  Cc: aliguori, mtosatti, ananth, ohmura.kei, dlaor, vatsa,
	Yoshiaki Tamura, avi, psuriset, stefanha

When -k option is set to migrate command, it will turn on ft_mode to
start FT migration mode (Kemari).

Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
---
 hmp-commands.hx |    7 ++++---
 migration.c     |    3 +++
 qmp-commands.hx |    7 ++++---
 3 files changed, 11 insertions(+), 6 deletions(-)

diff --git a/hmp-commands.hx b/hmp-commands.hx
index e5585ba..75b11ca 100644
--- a/hmp-commands.hx
+++ b/hmp-commands.hx
@@ -717,13 +717,14 @@ ETEXI
 
     {
         .name       = "migrate",
-        .args_type  = "detach:-d,blk:-b,inc:-i,uri:s",
-        .params     = "[-d] [-b] [-i] uri",
+        .args_type  = "detach:-d,blk:-b,inc:-i,ft:-k,uri:s",
+        .params     = "[-d] [-b] [-i] [-k] uri",
         .help       = "migrate to URI (using -d to not wait for completion)"
 		      "\n\t\t\t -b for migration without shared storage with"
 		      " full copy of disk\n\t\t\t -i for migration without "
 		      "shared storage with incremental copy of disk "
-		      "(base image shared between src and destination)",
+		      "(base image shared between src and destination)"
+		      "\n\t\t\t -k for FT migration mode (Kemari)",
         .user_print = monitor_user_noop,	
 	.mhandler.cmd_new = do_migrate,
     },
diff --git a/migration.c b/migration.c
index c03d660..f0cfa37 100644
--- a/migration.c
+++ b/migration.c
@@ -92,6 +92,9 @@ int do_migrate(Monitor *mon, const QDict *qdict, QObject **ret_data)
         return -1;
     }
 
+    if (qdict_get_try_bool(qdict, "ft", 0))
+        ft_mode = FT_INIT;
+        
     if (strstart(uri, "tcp:", &p)) {
         s = tcp_start_outgoing_migration(mon, p, max_throttle, detach,
                                          blk, inc);
diff --git a/qmp-commands.hx b/qmp-commands.hx
index 793cf1c..0698e4f 100644
--- a/qmp-commands.hx
+++ b/qmp-commands.hx
@@ -431,13 +431,14 @@ EQMP
 
     {
         .name       = "migrate",
-        .args_type  = "detach:-d,blk:-b,inc:-i,uri:s",
-        .params     = "[-d] [-b] [-i] uri",
+        .args_type  = "detach:-d,blk:-b,inc:-i,ft:-k,uri:s",
+        .params     = "[-d] [-b] [-i] [-k] uri",
         .help       = "migrate to URI (using -d to not wait for completion)"
 		      "\n\t\t\t -b for migration without shared storage with"
 		      " full copy of disk\n\t\t\t -i for migration without "
 		      "shared storage with incremental copy of disk "
-		      "(base image shared between src and destination)",
+		      "(base image shared between src and destination)"
+		      "\n\t\t\t -k for FT migration mode (Kemari)",
         .user_print = monitor_user_noop,	
 	.mhandler.cmd_new = do_migrate,
     },
-- 
1.7.1.2

^ permalink raw reply related	[flat|nested] 112+ messages in thread

* [Qemu-devel] [PATCH 21/21] migration: add a parser to accept FT migration incoming mode.
  2010-11-25  6:06 [Qemu-devel] [PATCH 00/21] Kemari for KVM 0.2 Yoshiaki Tamura
                   ` (19 preceding siblings ...)
  2010-11-25  6:06 ` [Qemu-devel] [PATCH 20/21] Introduce -k option to enable FT migration mode (Kemari) Yoshiaki Tamura
@ 2010-11-25  6:07 ` Yoshiaki Tamura
  2010-11-26 18:39 ` [Qemu-devel] [PATCH 00/21] Kemari for KVM 0.2 Blue Swirl
  21 siblings, 0 replies; 112+ messages in thread
From: Yoshiaki Tamura @ 2010-11-25  6:07 UTC (permalink / raw)
  To: kvm, qemu-devel
  Cc: aliguori, mtosatti, ananth, ohmura.kei, dlaor, vatsa,
	Yoshiaki Tamura, avi, psuriset, stefanha

The option looks like, -incoming <protocol>:<address>:<port>,ft_mode

Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
---
 migration.c |    9 ++++++++-
 1 files changed, 8 insertions(+), 1 deletions(-)

diff --git a/migration.c b/migration.c
index f0cfa37..58f1158 100644
--- a/migration.c
+++ b/migration.c
@@ -42,9 +42,16 @@ static MigrationState *current_migration;
 
 int qemu_start_incoming_migration(const char *uri)
 {
-    const char *p;
+    const char *p = uri;
     int ret;
 
+    /* check ft_mode option  */
+    if ((p = strstr(p, "ft_mode"))) {
+        if (!strcmp(p, "ft_mode")) {
+            ft_mode = FT_INIT;
+        }
+    }
+
     if (strstart(uri, "tcp:", &p))
         ret = tcp_start_incoming_migration(p);
 #if !defined(WIN32)
-- 
1.7.1.2

^ permalink raw reply related	[flat|nested] 112+ messages in thread

* Re: [Qemu-devel] [PATCH 00/21] Kemari for KVM 0.2
  2010-11-25  6:06 [Qemu-devel] [PATCH 00/21] Kemari for KVM 0.2 Yoshiaki Tamura
                   ` (20 preceding siblings ...)
  2010-11-25  6:07 ` [Qemu-devel] [PATCH 21/21] migration: add a parser to accept FT migration incoming mode Yoshiaki Tamura
@ 2010-11-26 18:39 ` Blue Swirl
  2010-11-27  4:29   ` Yoshiaki Tamura
  21 siblings, 1 reply; 112+ messages in thread
From: Blue Swirl @ 2010-11-26 18:39 UTC (permalink / raw)
  To: Yoshiaki Tamura
  Cc: ohmura.kei, dlaor, ananth, kvm, mtosatti, aliguori, qemu-devel,
	avi, vatsa, psuriset, stefanha

On Thu, Nov 25, 2010 at 6:06 AM, Yoshiaki Tamura
<tamura.yoshiaki@lab.ntt.co.jp> wrote:
> Hi,
>
> This patch series is a revised version of Kemari for KVM, which
> applied comments for the previous post and KVM Forum 2010.  The
> current code is based on qemu.git
> f711df67d611e4762966a249742a5f7499e19f99.
>
> For general information about Kemari, I've made a wiki page at
> qemu.org.
>
> http://wiki.qemu.org/Features/FaultTolerance
>
> The changes from v0.1.1 -> v0.2 are:
>
> - Introduce a queue in event-tap to make VM sync live.
> - Change transaction receiver to a state machine for async receiving.
> - Replace net/block layer functions with event-tap proxy functions.
> - Remove dirty bitmap optimization for now.
> - convert DPRINTF() in ft_trans_file to trace functions.
> - convert fprintf() in ft_trans_file to error_report().
> - improved error handling in ft_trans_file.
> - add a tmp pointer to qemu_del_vm_change_state_handler.
>
> The changes from v0.1 -> v0.1.1 are:
>
> - events are tapped in net/block layer instead of device emulation layer.
> - Introduce a new option for -incoming to accept FT transaction.
> - Removed writev() support to QEMUFile and FdMigrationState for now.  I would
>  post this work in a different series.
> - Modified virtio-blk save/load handler to send inuse variable to
>  correctly replay.
> - Removed configure --enable-ft-mode.
> - Removed unnecessary check for qemu_realloc().
>
> The first 6 patches modify several functions of qemu to prepare
> introducing Kemari specific components.
>
> The next 6 patches are the components of Kemari.  They introduce
> event-tap and the FT transaction protocol file based on buffered file.
> The design document of FT transaction protocol can be found at,
> http://wiki.qemu.org/images/b/b1/Kemari_sender_receiver_0.5a.pdf
>
> Then the following 4 patches modifies dma-helpers, virtio-blk
> virtio-net and e1000 to replace net/block layer functions with
> event-tap proxy functions.  Please note that if Kemari is off,
> event-tap will just passthrough, and there is most no intrusion to
> exisiting functions including normal live migration.

Would it be possible to make the changes only in the block/net layer,
so that the devices are not modified at all? That is, the proxy
function would always replaces the unproxied version.

Somehow I find some similarities to instrumentation patches. Perhaps
the instrumentation framework could be used (maybe with some changes)
for Kemari as well? That could be beneficial to both.

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [Qemu-devel] [PATCH 00/21] Kemari for KVM 0.2
  2010-11-26 18:39 ` [Qemu-devel] [PATCH 00/21] Kemari for KVM 0.2 Blue Swirl
@ 2010-11-27  4:29   ` Yoshiaki Tamura
  2010-11-27  7:23     ` Stefan Hajnoczi
  0 siblings, 1 reply; 112+ messages in thread
From: Yoshiaki Tamura @ 2010-11-27  4:29 UTC (permalink / raw)
  To: Blue Swirl, stefanha
  Cc: ohmura.kei, dlaor, ananth, kvm, mtosatti, aliguori, qemu-devel,
	avi, vatsa, psuriset

2010/11/27 Blue Swirl <blauwirbel@gmail.com>:
> On Thu, Nov 25, 2010 at 6:06 AM, Yoshiaki Tamura
> <tamura.yoshiaki@lab.ntt.co.jp> wrote:
>> Hi,
>>
>> This patch series is a revised version of Kemari for KVM, which
>> applied comments for the previous post and KVM Forum 2010.  The
>> current code is based on qemu.git
>> f711df67d611e4762966a249742a5f7499e19f99.
>>
>> For general information about Kemari, I've made a wiki page at
>> qemu.org.
>>
>> http://wiki.qemu.org/Features/FaultTolerance
>>
>> The changes from v0.1.1 -> v0.2 are:
>>
>> - Introduce a queue in event-tap to make VM sync live.
>> - Change transaction receiver to a state machine for async receiving.
>> - Replace net/block layer functions with event-tap proxy functions.
>> - Remove dirty bitmap optimization for now.
>> - convert DPRINTF() in ft_trans_file to trace functions.
>> - convert fprintf() in ft_trans_file to error_report().
>> - improved error handling in ft_trans_file.
>> - add a tmp pointer to qemu_del_vm_change_state_handler.
>>
>> The changes from v0.1 -> v0.1.1 are:
>>
>> - events are tapped in net/block layer instead of device emulation layer.
>> - Introduce a new option for -incoming to accept FT transaction.
>> - Removed writev() support to QEMUFile and FdMigrationState for now.  I would
>>  post this work in a different series.
>> - Modified virtio-blk save/load handler to send inuse variable to
>>  correctly replay.
>> - Removed configure --enable-ft-mode.
>> - Removed unnecessary check for qemu_realloc().
>>
>> The first 6 patches modify several functions of qemu to prepare
>> introducing Kemari specific components.
>>
>> The next 6 patches are the components of Kemari.  They introduce
>> event-tap and the FT transaction protocol file based on buffered file.
>> The design document of FT transaction protocol can be found at,
>> http://wiki.qemu.org/images/b/b1/Kemari_sender_receiver_0.5a.pdf
>>
>> Then the following 4 patches modifies dma-helpers, virtio-blk
>> virtio-net and e1000 to replace net/block layer functions with
>> event-tap proxy functions.  Please note that if Kemari is off,
>> event-tap will just passthrough, and there is most no intrusion to
>> exisiting functions including normal live migration.
>
> Would it be possible to make the changes only in the block/net layer,
> so that the devices are not modified at all? That is, the proxy
> function would always replaces the unproxied version.

I understand the benefit of your suggestion.  However it seems a bit
tricky.  It's because event-tap uses functions of emulators and net,
but block.c is also linked for utilities like qemu-img that doesn't
need emulators or net.  In the previous version, I added function
pointers to get around.

http://lists.nongnu.org/archive/html/qemu-devel/2010-05/msg02378.html

I wasn't confident of this approach and discussed it at KVM Forum, and
decided to give a try to replace emulator functions with proxies.
Suggestions are welcomed of course.

> Somehow I find some similarities to instrumentation patches. Perhaps
> the instrumentation framework could be used (maybe with some changes)
> for Kemari as well? That could be beneficial to both.

Yes.  I had the same idea but I'm not sure how tracing works.  I think
Stefan Hajnoczi knows it better.

Stefan, is it possible to call arbitrary functions from the trace
points?

> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [Qemu-devel] [PATCH 00/21] Kemari for KVM 0.2
  2010-11-27  4:29   ` Yoshiaki Tamura
@ 2010-11-27  7:23     ` Stefan Hajnoczi
  2010-11-27  8:53       ` Yoshiaki Tamura
  2010-11-27 11:20       ` Paul Brook
  0 siblings, 2 replies; 112+ messages in thread
From: Stefan Hajnoczi @ 2010-11-27  7:23 UTC (permalink / raw)
  To: Yoshiaki Tamura
  Cc: ohmura.kei, mtosatti, stefanha, kvm, dlaor, aliguori, qemu-devel,
	Blue Swirl, avi, vatsa, psuriset, ananth

On Sat, Nov 27, 2010 at 4:29 AM, Yoshiaki Tamura
<tamura.yoshiaki@lab.ntt.co.jp> wrote:
> 2010/11/27 Blue Swirl <blauwirbel@gmail.com>:
>> On Thu, Nov 25, 2010 at 6:06 AM, Yoshiaki Tamura
>> <tamura.yoshiaki@lab.ntt.co.jp> wrote:
>>> Hi,
>>>
>>> This patch series is a revised version of Kemari for KVM, which
>>> applied comments for the previous post and KVM Forum 2010.  The
>>> current code is based on qemu.git
>>> f711df67d611e4762966a249742a5f7499e19f99.
>>>
>>> For general information about Kemari, I've made a wiki page at
>>> qemu.org.
>>>
>>> http://wiki.qemu.org/Features/FaultTolerance
>>>
>>> The changes from v0.1.1 -> v0.2 are:
>>>
>>> - Introduce a queue in event-tap to make VM sync live.
>>> - Change transaction receiver to a state machine for async receiving.
>>> - Replace net/block layer functions with event-tap proxy functions.
>>> - Remove dirty bitmap optimization for now.
>>> - convert DPRINTF() in ft_trans_file to trace functions.
>>> - convert fprintf() in ft_trans_file to error_report().
>>> - improved error handling in ft_trans_file.
>>> - add a tmp pointer to qemu_del_vm_change_state_handler.
>>>
>>> The changes from v0.1 -> v0.1.1 are:
>>>
>>> - events are tapped in net/block layer instead of device emulation layer.
>>> - Introduce a new option for -incoming to accept FT transaction.
>>> - Removed writev() support to QEMUFile and FdMigrationState for now.  I would
>>>  post this work in a different series.
>>> - Modified virtio-blk save/load handler to send inuse variable to
>>>  correctly replay.
>>> - Removed configure --enable-ft-mode.
>>> - Removed unnecessary check for qemu_realloc().
>>>
>>> The first 6 patches modify several functions of qemu to prepare
>>> introducing Kemari specific components.
>>>
>>> The next 6 patches are the components of Kemari.  They introduce
>>> event-tap and the FT transaction protocol file based on buffered file.
>>> The design document of FT transaction protocol can be found at,
>>> http://wiki.qemu.org/images/b/b1/Kemari_sender_receiver_0.5a.pdf
>>>
>>> Then the following 4 patches modifies dma-helpers, virtio-blk
>>> virtio-net and e1000 to replace net/block layer functions with
>>> event-tap proxy functions.  Please note that if Kemari is off,
>>> event-tap will just passthrough, and there is most no intrusion to
>>> exisiting functions including normal live migration.
>>
>> Would it be possible to make the changes only in the block/net layer,
>> so that the devices are not modified at all? That is, the proxy
>> function would always replaces the unproxied version.
>
> I understand the benefit of your suggestion.  However it seems a bit
> tricky.  It's because event-tap uses functions of emulators and net,
> but block.c is also linked for utilities like qemu-img that doesn't
> need emulators or net.  In the previous version, I added function
> pointers to get around.
>
> http://lists.nongnu.org/archive/html/qemu-devel/2010-05/msg02378.html
>
> I wasn't confident of this approach and discussed it at KVM Forum, and
> decided to give a try to replace emulator functions with proxies.
> Suggestions are welcomed of course.
>
>> Somehow I find some similarities to instrumentation patches. Perhaps
>> the instrumentation framework could be used (maybe with some changes)
>> for Kemari as well? That could be beneficial to both.
>
> Yes.  I had the same idea but I'm not sure how tracing works.  I think
> Stefan Hajnoczi knows it better.
>
> Stefan, is it possible to call arbitrary functions from the trace
> points?

Yes, if you add code to ./tracetool.  I'm not sure I see the
connection between Kemari and tracing though.

One question I have about Kemari is whether it adds new constraints to
the QEMU codebase?  Fault tolerance seems like a cross-cutting concern
- everyone writing device emulation or core QEMU code may need to be
aware of new constraints.  For example, "you are not allowed to
release I/O operations to the outside world directly, instead you need
to go through Kemari code which makes I/O transactional and
communicates with the passive host".  You have converted e1000,
virtio-net, and virtio-blk.  How do we make sure new devices that are
merged into qemu.git don't break Kemari?  How do we go about
supporting the existing hw/* devices?

Stefan

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [Qemu-devel] [PATCH 00/21] Kemari for KVM 0.2
  2010-11-27  7:23     ` Stefan Hajnoczi
@ 2010-11-27  8:53       ` Yoshiaki Tamura
  2010-11-27 11:03         ` Blue Swirl
  2010-11-27 11:54         ` Stefan Hajnoczi
  2010-11-27 11:20       ` Paul Brook
  1 sibling, 2 replies; 112+ messages in thread
From: Yoshiaki Tamura @ 2010-11-27  8:53 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: ohmura.kei, mtosatti, stefanha, kvm, dlaor, aliguori, qemu-devel,
	Blue Swirl, avi, vatsa, psuriset, ananth

2010/11/27 Stefan Hajnoczi <stefanha@gmail.com>:
> On Sat, Nov 27, 2010 at 4:29 AM, Yoshiaki Tamura
> <tamura.yoshiaki@lab.ntt.co.jp> wrote:
>> 2010/11/27 Blue Swirl <blauwirbel@gmail.com>:
>>> On Thu, Nov 25, 2010 at 6:06 AM, Yoshiaki Tamura
>>> <tamura.yoshiaki@lab.ntt.co.jp> wrote:
>>>> Hi,
>>>>
>>>> This patch series is a revised version of Kemari for KVM, which
>>>> applied comments for the previous post and KVM Forum 2010.  The
>>>> current code is based on qemu.git
>>>> f711df67d611e4762966a249742a5f7499e19f99.
>>>>
>>>> For general information about Kemari, I've made a wiki page at
>>>> qemu.org.
>>>>
>>>> http://wiki.qemu.org/Features/FaultTolerance
>>>>
>>>> The changes from v0.1.1 -> v0.2 are:
>>>>
>>>> - Introduce a queue in event-tap to make VM sync live.
>>>> - Change transaction receiver to a state machine for async receiving.
>>>> - Replace net/block layer functions with event-tap proxy functions.
>>>> - Remove dirty bitmap optimization for now.
>>>> - convert DPRINTF() in ft_trans_file to trace functions.
>>>> - convert fprintf() in ft_trans_file to error_report().
>>>> - improved error handling in ft_trans_file.
>>>> - add a tmp pointer to qemu_del_vm_change_state_handler.
>>>>
>>>> The changes from v0.1 -> v0.1.1 are:
>>>>
>>>> - events are tapped in net/block layer instead of device emulation layer.
>>>> - Introduce a new option for -incoming to accept FT transaction.
>>>> - Removed writev() support to QEMUFile and FdMigrationState for now.  I would
>>>>  post this work in a different series.
>>>> - Modified virtio-blk save/load handler to send inuse variable to
>>>>  correctly replay.
>>>> - Removed configure --enable-ft-mode.
>>>> - Removed unnecessary check for qemu_realloc().
>>>>
>>>> The first 6 patches modify several functions of qemu to prepare
>>>> introducing Kemari specific components.
>>>>
>>>> The next 6 patches are the components of Kemari.  They introduce
>>>> event-tap and the FT transaction protocol file based on buffered file.
>>>> The design document of FT transaction protocol can be found at,
>>>> http://wiki.qemu.org/images/b/b1/Kemari_sender_receiver_0.5a.pdf
>>>>
>>>> Then the following 4 patches modifies dma-helpers, virtio-blk
>>>> virtio-net and e1000 to replace net/block layer functions with
>>>> event-tap proxy functions.  Please note that if Kemari is off,
>>>> event-tap will just passthrough, and there is most no intrusion to
>>>> exisiting functions including normal live migration.
>>>
>>> Would it be possible to make the changes only in the block/net layer,
>>> so that the devices are not modified at all? That is, the proxy
>>> function would always replaces the unproxied version.
>>
>> I understand the benefit of your suggestion.  However it seems a bit
>> tricky.  It's because event-tap uses functions of emulators and net,
>> but block.c is also linked for utilities like qemu-img that doesn't
>> need emulators or net.  In the previous version, I added function
>> pointers to get around.
>>
>> http://lists.nongnu.org/archive/html/qemu-devel/2010-05/msg02378.html
>>
>> I wasn't confident of this approach and discussed it at KVM Forum, and
>> decided to give a try to replace emulator functions with proxies.
>> Suggestions are welcomed of course.
>>
>>> Somehow I find some similarities to instrumentation patches. Perhaps
>>> the instrumentation framework could be used (maybe with some changes)
>>> for Kemari as well? That could be beneficial to both.
>>
>> Yes.  I had the same idea but I'm not sure how tracing works.  I think
>> Stefan Hajnoczi knows it better.
>>
>> Stefan, is it possible to call arbitrary functions from the trace
>> points?
>
> Yes, if you add code to ./tracetool.  I'm not sure I see the
> connection between Kemari and tracing though.

The connection is that it may be possible to remove Kemari
specific hook point like in ioport.c and exec.c, and let tracing
notify Kemari instead.

> One question I have about Kemari is whether it adds new constraints to
> the QEMU codebase?  Fault tolerance seems like a cross-cutting concern
> - everyone writing device emulation or core QEMU code may need to be
> aware of new constraints.  For example, "you are not allowed to
> release I/O operations to the outside world directly, instead you need
> to go through Kemari code which makes I/O transactional and
> communicates with the passive host".  You have converted e1000,
> virtio-net, and virtio-blk.  How do we make sure new devices that are
> merged into qemu.git don't break Kemari?  How do we go about
> supporting the existing hw/* devices?

Whether Kemari adds constraints such as you mentioned, yes.  If
the devices (including existing ones) don't call Kemari code,
they would certainly break Kemari.  Altough using proxies looks
explicit, to make it unaware from people writing device
emulation, it's possible to remove proxies and put changes only
into the block/net layer as Blue suggested.

Yoshi

> Stefan
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [Qemu-devel] [PATCH 00/21] Kemari for KVM 0.2
  2010-11-27  8:53       ` Yoshiaki Tamura
@ 2010-11-27 11:03         ` Blue Swirl
  2010-11-27 12:21           ` Yoshiaki Tamura
  2010-11-27 11:54         ` Stefan Hajnoczi
  1 sibling, 1 reply; 112+ messages in thread
From: Blue Swirl @ 2010-11-27 11:03 UTC (permalink / raw)
  To: Yoshiaki Tamura
  Cc: ohmura.kei, mtosatti, stefanha, kvm, Stefan Hajnoczi, dlaor,
	aliguori, qemu-devel, avi, vatsa, psuriset, ananth

On Sat, Nov 27, 2010 at 8:53 AM, Yoshiaki Tamura
<tamura.yoshiaki@lab.ntt.co.jp> wrote:
> 2010/11/27 Stefan Hajnoczi <stefanha@gmail.com>:
>> On Sat, Nov 27, 2010 at 4:29 AM, Yoshiaki Tamura
>> <tamura.yoshiaki@lab.ntt.co.jp> wrote:
>>> 2010/11/27 Blue Swirl <blauwirbel@gmail.com>:
>>>> On Thu, Nov 25, 2010 at 6:06 AM, Yoshiaki Tamura
>>>> <tamura.yoshiaki@lab.ntt.co.jp> wrote:
>>>>> Hi,
>>>>>
>>>>> This patch series is a revised version of Kemari for KVM, which
>>>>> applied comments for the previous post and KVM Forum 2010.  The
>>>>> current code is based on qemu.git
>>>>> f711df67d611e4762966a249742a5f7499e19f99.
>>>>>
>>>>> For general information about Kemari, I've made a wiki page at
>>>>> qemu.org.
>>>>>
>>>>> http://wiki.qemu.org/Features/FaultTolerance
>>>>>
>>>>> The changes from v0.1.1 -> v0.2 are:
>>>>>
>>>>> - Introduce a queue in event-tap to make VM sync live.
>>>>> - Change transaction receiver to a state machine for async receiving.
>>>>> - Replace net/block layer functions with event-tap proxy functions.
>>>>> - Remove dirty bitmap optimization for now.
>>>>> - convert DPRINTF() in ft_trans_file to trace functions.
>>>>> - convert fprintf() in ft_trans_file to error_report().
>>>>> - improved error handling in ft_trans_file.
>>>>> - add a tmp pointer to qemu_del_vm_change_state_handler.
>>>>>
>>>>> The changes from v0.1 -> v0.1.1 are:
>>>>>
>>>>> - events are tapped in net/block layer instead of device emulation layer.
>>>>> - Introduce a new option for -incoming to accept FT transaction.
>>>>> - Removed writev() support to QEMUFile and FdMigrationState for now.  I would
>>>>>  post this work in a different series.
>>>>> - Modified virtio-blk save/load handler to send inuse variable to
>>>>>  correctly replay.
>>>>> - Removed configure --enable-ft-mode.
>>>>> - Removed unnecessary check for qemu_realloc().
>>>>>
>>>>> The first 6 patches modify several functions of qemu to prepare
>>>>> introducing Kemari specific components.
>>>>>
>>>>> The next 6 patches are the components of Kemari.  They introduce
>>>>> event-tap and the FT transaction protocol file based on buffered file.
>>>>> The design document of FT transaction protocol can be found at,
>>>>> http://wiki.qemu.org/images/b/b1/Kemari_sender_receiver_0.5a.pdf
>>>>>
>>>>> Then the following 4 patches modifies dma-helpers, virtio-blk
>>>>> virtio-net and e1000 to replace net/block layer functions with
>>>>> event-tap proxy functions.  Please note that if Kemari is off,
>>>>> event-tap will just passthrough, and there is most no intrusion to
>>>>> exisiting functions including normal live migration.
>>>>
>>>> Would it be possible to make the changes only in the block/net layer,
>>>> so that the devices are not modified at all? That is, the proxy
>>>> function would always replaces the unproxied version.
>>>
>>> I understand the benefit of your suggestion.  However it seems a bit
>>> tricky.  It's because event-tap uses functions of emulators and net,
>>> but block.c is also linked for utilities like qemu-img that doesn't
>>> need emulators or net.  In the previous version, I added function
>>> pointers to get around.
>>>
>>> http://lists.nongnu.org/archive/html/qemu-devel/2010-05/msg02378.html
>>>
>>> I wasn't confident of this approach and discussed it at KVM Forum, and
>>> decided to give a try to replace emulator functions with proxies.
>>> Suggestions are welcomed of course.
>>>
>>>> Somehow I find some similarities to instrumentation patches. Perhaps
>>>> the instrumentation framework could be used (maybe with some changes)
>>>> for Kemari as well? That could be beneficial to both.
>>>
>>> Yes.  I had the same idea but I'm not sure how tracing works.  I think
>>> Stefan Hajnoczi knows it better.
>>>
>>> Stefan, is it possible to call arbitrary functions from the trace
>>> points?
>>
>> Yes, if you add code to ./tracetool.  I'm not sure I see the
>> connection between Kemari and tracing though.
>
> The connection is that it may be possible to remove Kemari
> specific hook point like in ioport.c and exec.c, and let tracing
> notify Kemari instead.

This all depends on how generic we want the trace points become.

One possible extension to the event injection or instrumentation could
be fault injection: based on some rule, make the instrumented function
return error. That would be interesting for testing how guest handles
failure cases.

Maybe it should be also possible to handle event injection in a
generic way. Split the instrumented function to two, before and after
the tracepoint. The tracepoint registers the tail function in addition
to the parameters. This may require a lot of refactoring though.

>> One question I have about Kemari is whether it adds new constraints to
>> the QEMU codebase?  Fault tolerance seems like a cross-cutting concern
>> - everyone writing device emulation or core QEMU code may need to be
>> aware of new constraints.  For example, "you are not allowed to
>> release I/O operations to the outside world directly, instead you need
>> to go through Kemari code which makes I/O transactional and
>> communicates with the passive host".  You have converted e1000,
>> virtio-net, and virtio-blk.  How do we make sure new devices that are
>> merged into qemu.git don't break Kemari?  How do we go about
>> supporting the existing hw/* devices?
>
> Whether Kemari adds constraints such as you mentioned, yes.  If
> the devices (including existing ones) don't call Kemari code,
> they would certainly break Kemari.  Altough using proxies looks
> explicit, to make it unaware from people writing device
> emulation, it's possible to remove proxies and put changes only
> into the block/net layer as Blue suggested.

I'd prefer that approach if possible.

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [Qemu-devel] [PATCH 00/21] Kemari for KVM 0.2
  2010-11-27  7:23     ` Stefan Hajnoczi
  2010-11-27  8:53       ` Yoshiaki Tamura
@ 2010-11-27 11:20       ` Paul Brook
  2010-11-27 12:35         ` Yoshiaki Tamura
  1 sibling, 1 reply; 112+ messages in thread
From: Paul Brook @ 2010-11-27 11:20 UTC (permalink / raw)
  To: qemu-devel
  Cc: ohmura.kei, dlaor, stefanha, kvm, Stefan Hajnoczi, mtosatti,
	Yoshiaki Tamura, vatsa, Blue Swirl, aliguori, ananth, psuriset,
	avi

> One question I have about Kemari is whether it adds new constraints to
> the QEMU codebase?  Fault tolerance seems like a cross-cutting concern
> - everyone writing device emulation or core QEMU code may need to be
> aware of new constraints.  For example, "you are not allowed to
> release I/O operations to the outside world directly, instead you need
> to go through Kemari code which makes I/O transactional and
> communicates with the passive host".  You have converted e1000,
> virtio-net, and virtio-blk.  How do we make sure new devices that are
> merged into qemu.git don't break Kemari?  How do we go about
> supporting the existing hw/* devices?

IMO anything that requires devices to act differently is wrong.  All external 
IO already goes though a common API (e.g. qemu_send_packet). You should be 
putting your transaction code there, not hacking individual devices.

Paul

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [Qemu-devel] [PATCH 00/21] Kemari for KVM 0.2
  2010-11-27  8:53       ` Yoshiaki Tamura
  2010-11-27 11:03         ` Blue Swirl
@ 2010-11-27 11:54         ` Stefan Hajnoczi
  2010-11-27 13:11           ` Yoshiaki Tamura
  1 sibling, 1 reply; 112+ messages in thread
From: Stefan Hajnoczi @ 2010-11-27 11:54 UTC (permalink / raw)
  To: Yoshiaki Tamura
  Cc: ohmura.kei, mtosatti, stefanha, kvm, dlaor, aliguori, qemu-devel,
	Blue Swirl, avi, vatsa, psuriset, ananth

On Sat, Nov 27, 2010 at 8:53 AM, Yoshiaki Tamura
<tamura.yoshiaki@lab.ntt.co.jp> wrote:
> 2010/11/27 Stefan Hajnoczi <stefanha@gmail.com>:
>> On Sat, Nov 27, 2010 at 4:29 AM, Yoshiaki Tamura
>> <tamura.yoshiaki@lab.ntt.co.jp> wrote:
>>> 2010/11/27 Blue Swirl <blauwirbel@gmail.com>:
>>>> On Thu, Nov 25, 2010 at 6:06 AM, Yoshiaki Tamura
>>>> <tamura.yoshiaki@lab.ntt.co.jp> wrote:
>>>> Somehow I find some similarities to instrumentation patches. Perhaps
>>>> the instrumentation framework could be used (maybe with some changes)
>>>> for Kemari as well? That could be beneficial to both.
>>>
>>> Yes.  I had the same idea but I'm not sure how tracing works.  I think
>>> Stefan Hajnoczi knows it better.
>>>
>>> Stefan, is it possible to call arbitrary functions from the trace
>>> points?
>>
>> Yes, if you add code to ./tracetool.  I'm not sure I see the
>> connection between Kemari and tracing though.
>
> The connection is that it may be possible to remove Kemari
> specific hook point like in ioport.c and exec.c, and let tracing
> notify Kemari instead.

I actually think the other way.  Tracing just instruments and stashes
away values.  It does not change inputs or outputs, it does not change
control flow, it does not affect state.

Going down the route of side-effects mixes two different things:
hooking into a subsystem and instrumentation.  For hooking into a
subsystem we should define proper interfaces.  That interface can
explicitly support modifying inputs/outputs or changing control flow.

Tracing is much more ad-hoc and not a clean interface.  It's also
based on a layer of indirection via the tracetool code generator.
That's okay because it doesn't affect the code it is called from and
you don't need to debug trace events (they are simple and have almost
no behavior).

Hooking via tracing is just taking advantage of the cheap layer of
indirection in order to get at interesting events in a subsystem.
It's easy to hook up and quick to develop, but it's not a proper
interface and will be hard to understand for other developers.

>> One question I have about Kemari is whether it adds new constraints to
>> the QEMU codebase?  Fault tolerance seems like a cross-cutting concern
>> - everyone writing device emulation or core QEMU code may need to be
>> aware of new constraints.  For example, "you are not allowed to
>> release I/O operations to the outside world directly, instead you need
>> to go through Kemari code which makes I/O transactional and
>> communicates with the passive host".  You have converted e1000,
>> virtio-net, and virtio-blk.  How do we make sure new devices that are
>> merged into qemu.git don't break Kemari?  How do we go about
>> supporting the existing hw/* devices?
>
> Whether Kemari adds constraints such as you mentioned, yes.  If
> the devices (including existing ones) don't call Kemari code,
> they would certainly break Kemari.  Altough using proxies looks
> explicit, to make it unaware from people writing device
> emulation, it's possible to remove proxies and put changes only
> into the block/net layer as Blue suggested.

Anything that makes it hard to violate the constraints is good.
Otherwise Kemari might get broken in the future and no one will know
until a failover behaves incorrectly.

Could you formulate the constraints so developers are aware of them in
the future and can protect the codebase.  How about expanding the
Kemari wiki pages?

Stefan

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [Qemu-devel] [PATCH 00/21] Kemari for KVM 0.2
  2010-11-27 11:03         ` Blue Swirl
@ 2010-11-27 12:21           ` Yoshiaki Tamura
  0 siblings, 0 replies; 112+ messages in thread
From: Yoshiaki Tamura @ 2010-11-27 12:21 UTC (permalink / raw)
  To: Blue Swirl
  Cc: ohmura.kei, mtosatti, stefanha, kvm, Stefan Hajnoczi, dlaor,
	aliguori, qemu-devel, avi, vatsa, psuriset, ananth

2010/11/27 Blue Swirl <blauwirbel@gmail.com>:
> On Sat, Nov 27, 2010 at 8:53 AM, Yoshiaki Tamura
> <tamura.yoshiaki@lab.ntt.co.jp> wrote:
>> 2010/11/27 Stefan Hajnoczi <stefanha@gmail.com>:
>>> On Sat, Nov 27, 2010 at 4:29 AM, Yoshiaki Tamura
>>> <tamura.yoshiaki@lab.ntt.co.jp> wrote:
>>>> 2010/11/27 Blue Swirl <blauwirbel@gmail.com>:
>>>>> On Thu, Nov 25, 2010 at 6:06 AM, Yoshiaki Tamura
>>>>> <tamura.yoshiaki@lab.ntt.co.jp> wrote:
>>>>>> Hi,
>>>>>>
>>>>>> This patch series is a revised version of Kemari for KVM, which
>>>>>> applied comments for the previous post and KVM Forum 2010.  The
>>>>>> current code is based on qemu.git
>>>>>> f711df67d611e4762966a249742a5f7499e19f99.
>>>>>>
>>>>>> For general information about Kemari, I've made a wiki page at
>>>>>> qemu.org.
>>>>>>
>>>>>> http://wiki.qemu.org/Features/FaultTolerance
>>>>>>
>>>>>> The changes from v0.1.1 -> v0.2 are:
>>>>>>
>>>>>> - Introduce a queue in event-tap to make VM sync live.
>>>>>> - Change transaction receiver to a state machine for async receiving.
>>>>>> - Replace net/block layer functions with event-tap proxy functions.
>>>>>> - Remove dirty bitmap optimization for now.
>>>>>> - convert DPRINTF() in ft_trans_file to trace functions.
>>>>>> - convert fprintf() in ft_trans_file to error_report().
>>>>>> - improved error handling in ft_trans_file.
>>>>>> - add a tmp pointer to qemu_del_vm_change_state_handler.
>>>>>>
>>>>>> The changes from v0.1 -> v0.1.1 are:
>>>>>>
>>>>>> - events are tapped in net/block layer instead of device emulation layer.
>>>>>> - Introduce a new option for -incoming to accept FT transaction.
>>>>>> - Removed writev() support to QEMUFile and FdMigrationState for now.  I would
>>>>>>  post this work in a different series.
>>>>>> - Modified virtio-blk save/load handler to send inuse variable to
>>>>>>  correctly replay.
>>>>>> - Removed configure --enable-ft-mode.
>>>>>> - Removed unnecessary check for qemu_realloc().
>>>>>>
>>>>>> The first 6 patches modify several functions of qemu to prepare
>>>>>> introducing Kemari specific components.
>>>>>>
>>>>>> The next 6 patches are the components of Kemari.  They introduce
>>>>>> event-tap and the FT transaction protocol file based on buffered file.
>>>>>> The design document of FT transaction protocol can be found at,
>>>>>> http://wiki.qemu.org/images/b/b1/Kemari_sender_receiver_0.5a.pdf
>>>>>>
>>>>>> Then the following 4 patches modifies dma-helpers, virtio-blk
>>>>>> virtio-net and e1000 to replace net/block layer functions with
>>>>>> event-tap proxy functions.  Please note that if Kemari is off,
>>>>>> event-tap will just passthrough, and there is most no intrusion to
>>>>>> exisiting functions including normal live migration.
>>>>>
>>>>> Would it be possible to make the changes only in the block/net layer,
>>>>> so that the devices are not modified at all? That is, the proxy
>>>>> function would always replaces the unproxied version.
>>>>
>>>> I understand the benefit of your suggestion.  However it seems a bit
>>>> tricky.  It's because event-tap uses functions of emulators and net,
>>>> but block.c is also linked for utilities like qemu-img that doesn't
>>>> need emulators or net.  In the previous version, I added function
>>>> pointers to get around.
>>>>
>>>> http://lists.nongnu.org/archive/html/qemu-devel/2010-05/msg02378.html
>>>>
>>>> I wasn't confident of this approach and discussed it at KVM Forum, and
>>>> decided to give a try to replace emulator functions with proxies.
>>>> Suggestions are welcomed of course.
>>>>
>>>>> Somehow I find some similarities to instrumentation patches. Perhaps
>>>>> the instrumentation framework could be used (maybe with some changes)
>>>>> for Kemari as well? That could be beneficial to both.
>>>>
>>>> Yes.  I had the same idea but I'm not sure how tracing works.  I think
>>>> Stefan Hajnoczi knows it better.
>>>>
>>>> Stefan, is it possible to call arbitrary functions from the trace
>>>> points?
>>>
>>> Yes, if you add code to ./tracetool.  I'm not sure I see the
>>> connection between Kemari and tracing though.
>>
>> The connection is that it may be possible to remove Kemari
>> specific hook point like in ioport.c and exec.c, and let tracing
>> notify Kemari instead.
>
> This all depends on how generic we want the trace points become.
>
> One possible extension to the event injection or instrumentation could
> be fault injection: based on some rule, make the instrumented function
> return error. That would be interesting for testing how guest handles
> failure cases.
>
> Maybe it should be also possible to handle event injection in a
> generic way. Split the instrumented function to two, before and after
> the tracepoint. The tracepoint registers the tail function in addition
> to the parameters. This may require a lot of refactoring though.

The idea looks cool but it's a bit out of the range I can handle
now:-)  Let's keep the idea of binding with trace points for now,
and focus on how to insert net/block tap points.

>>> One question I have about Kemari is whether it adds new constraints to
>>> the QEMU codebase?  Fault tolerance seems like a cross-cutting concern
>>> - everyone writing device emulation or core QEMU code may need to be
>>> aware of new constraints.  For example, "you are not allowed to
>>> release I/O operations to the outside world directly, instead you need
>>> to go through Kemari code which makes I/O transactional and
>>> communicates with the passive host".  You have converted e1000,
>>> virtio-net, and virtio-blk.  How do we make sure new devices that are
>>> merged into qemu.git don't break Kemari?  How do we go about
>>> supporting the existing hw/* devices?
>>
>> Whether Kemari adds constraints such as you mentioned, yes.  If
>> the devices (including existing ones) don't call Kemari code,
>> they would certainly break Kemari.  Altough using proxies looks
>> explicit, to make it unaware from people writing device
>> emulation, it's possible to remove proxies and put changes only
>> into the block/net layer as Blue suggested.
>
> I'd prefer that approach if possible.

Thanks.  Let me see how others think too.

Yoshi

> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [Qemu-devel] [PATCH 00/21] Kemari for KVM 0.2
  2010-11-27 11:20       ` Paul Brook
@ 2010-11-27 12:35         ` Yoshiaki Tamura
  0 siblings, 0 replies; 112+ messages in thread
From: Yoshiaki Tamura @ 2010-11-27 12:35 UTC (permalink / raw)
  To: Paul Brook
  Cc: ohmura.kei, dlaor, stefanha, kvm, Stefan Hajnoczi, mtosatti,
	qemu-devel, vatsa, Blue Swirl, aliguori, ananth, psuriset, avi

2010/11/27 Paul Brook <paul@codesourcery.com>:
>> One question I have about Kemari is whether it adds new constraints to
>> the QEMU codebase?  Fault tolerance seems like a cross-cutting concern
>> - everyone writing device emulation or core QEMU code may need to be
>> aware of new constraints.  For example, "you are not allowed to
>> release I/O operations to the outside world directly, instead you need
>> to go through Kemari code which makes I/O transactional and
>> communicates with the passive host".  You have converted e1000,
>> virtio-net, and virtio-blk.  How do we make sure new devices that are
>> merged into qemu.git don't break Kemari?  How do we go about
>> supporting the existing hw/* devices?
>
> IMO anything that requires devices to act differently is wrong.  All external
> IO already goes though a common API (e.g. qemu_send_packet). You should be
> putting your transaction code there, not hacking individual devices.

So you're with Blue's idea to put them in block/net layer.

Yoshi

>
> Paul
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [Qemu-devel] [PATCH 00/21] Kemari for KVM 0.2
  2010-11-27 11:54         ` Stefan Hajnoczi
@ 2010-11-27 13:11           ` Yoshiaki Tamura
  2010-11-29 10:17             ` Stefan Hajnoczi
  0 siblings, 1 reply; 112+ messages in thread
From: Yoshiaki Tamura @ 2010-11-27 13:11 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: ohmura.kei, mtosatti, stefanha, kvm, dlaor, aliguori, qemu-devel,
	Blue Swirl, avi, vatsa, psuriset, ananth

2010/11/27 Stefan Hajnoczi <stefanha@gmail.com>:
> On Sat, Nov 27, 2010 at 8:53 AM, Yoshiaki Tamura
> <tamura.yoshiaki@lab.ntt.co.jp> wrote:
>> 2010/11/27 Stefan Hajnoczi <stefanha@gmail.com>:
>>> On Sat, Nov 27, 2010 at 4:29 AM, Yoshiaki Tamura
>>> <tamura.yoshiaki@lab.ntt.co.jp> wrote:
>>>> 2010/11/27 Blue Swirl <blauwirbel@gmail.com>:
>>>>> On Thu, Nov 25, 2010 at 6:06 AM, Yoshiaki Tamura
>>>>> <tamura.yoshiaki@lab.ntt.co.jp> wrote:
>>>>> Somehow I find some similarities to instrumentation patches. Perhaps
>>>>> the instrumentation framework could be used (maybe with some changes)
>>>>> for Kemari as well? That could be beneficial to both.
>>>>
>>>> Yes.  I had the same idea but I'm not sure how tracing works.  I think
>>>> Stefan Hajnoczi knows it better.
>>>>
>>>> Stefan, is it possible to call arbitrary functions from the trace
>>>> points?
>>>
>>> Yes, if you add code to ./tracetool.  I'm not sure I see the
>>> connection between Kemari and tracing though.
>>
>> The connection is that it may be possible to remove Kemari
>> specific hook point like in ioport.c and exec.c, and let tracing
>> notify Kemari instead.
>
> I actually think the other way.  Tracing just instruments and stashes
> away values.  It does not change inputs or outputs, it does not change
> control flow, it does not affect state.
>
> Going down the route of side-effects mixes two different things:
> hooking into a subsystem and instrumentation.  For hooking into a
> subsystem we should define proper interfaces.  That interface can
> explicitly support modifying inputs/outputs or changing control flow.
>
> Tracing is much more ad-hoc and not a clean interface.  It's also
> based on a layer of indirection via the tracetool code generator.
> That's okay because it doesn't affect the code it is called from and
> you don't need to debug trace events (they are simple and have almost
> no behavior).
>
> Hooking via tracing is just taking advantage of the cheap layer of
> indirection in order to get at interesting events in a subsystem.
> It's easy to hook up and quick to develop, but it's not a proper
> interface and will be hard to understand for other developers.
>
>>> One question I have about Kemari is whether it adds new constraints to
>>> the QEMU codebase?  Fault tolerance seems like a cross-cutting concern
>>> - everyone writing device emulation or core QEMU code may need to be
>>> aware of new constraints.  For example, "you are not allowed to
>>> release I/O operations to the outside world directly, instead you need
>>> to go through Kemari code which makes I/O transactional and
>>> communicates with the passive host".  You have converted e1000,
>>> virtio-net, and virtio-blk.  How do we make sure new devices that are
>>> merged into qemu.git don't break Kemari?  How do we go about
>>> supporting the existing hw/* devices?
>>
>> Whether Kemari adds constraints such as you mentioned, yes.  If
>> the devices (including existing ones) don't call Kemari code,
>> they would certainly break Kemari.  Altough using proxies looks
>> explicit, to make it unaware from people writing device
>> emulation, it's possible to remove proxies and put changes only
>> into the block/net layer as Blue suggested.
>
> Anything that makes it hard to violate the constraints is good.
> Otherwise Kemari might get broken in the future and no one will know
> until a failover behaves incorrectly.

Blue and Paul prefer to put it into block/net layer, and you
think it's better to provide API.

I have an idea which may fit into both, which is to put the look
into block/net layer, and make a list of devices that Kemari
supports.  Before turning on, we can check whether the devices
tapped are those on the list.  It's Kemari's responsibility to
keep checking which devices can be supported.

At this point, devices with proxies are on the list.

> Could you formulate the constraints so developers are aware of them in
> the future and can protect the codebase.  How about expanding the
> Kemari wiki pages?

If you like the idea above, I'm happy to make the list also on
the wiki page.

Yoshi

>
> Stefan
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 112+ messages in thread

* [Qemu-devel] Re: [PATCH 05/21] virtio: modify save/load handler to handle inuse varialble.
  2010-11-25  6:06 ` [Qemu-devel] [PATCH 05/21] virtio: modify save/load handler to handle inuse varialble Yoshiaki Tamura
@ 2010-11-28  9:28   ` Michael S. Tsirkin
  2010-11-28 11:27     ` Yoshiaki Tamura
  0 siblings, 1 reply; 112+ messages in thread
From: Michael S. Tsirkin @ 2010-11-28  9:28 UTC (permalink / raw)
  To: Yoshiaki Tamura
  Cc: aliguori, dlaor, ananth, kvm, ohmura.kei, mtosatti, qemu-devel,
	vatsa, avi, psuriset, stefanha

On Thu, Nov 25, 2010 at 03:06:44PM +0900, Yoshiaki Tamura wrote:
> Modify inuse type to uint16_t, let save/load to handle, and revert
> last_avail_idx with inuse if there are outstanding emulation.
> 
> Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>

This changes migration format, so it will break compatibility with
existing drivers. More generally, I think migrating internal
state that is not guest visible is always a mistake
as it ties migration format to an internal implementation
(yes, I know we do this sometimes, but we should at least
try not to add such cases).  I think the right thing to do in this case
is to flush outstanding
work when vm is stopped.  Then, we are guaranteed that inuse is 0.
I sent patches that do this for virtio net and block.

> ---
>  hw/virtio.c |    8 +++++++-
>  1 files changed, 7 insertions(+), 1 deletions(-)
> 
> diff --git a/hw/virtio.c b/hw/virtio.c
> index 849a60f..5509644 100644
> --- a/hw/virtio.c
> +++ b/hw/virtio.c
> @@ -72,7 +72,7 @@ struct VirtQueue
>      VRing vring;
>      target_phys_addr_t pa;
>      uint16_t last_avail_idx;
> -    int inuse;
> +    uint16_t inuse;
>      uint16_t vector;
>      void (*handle_output)(VirtIODevice *vdev, VirtQueue *vq);
>      VirtIODevice *vdev;
> @@ -671,6 +671,7 @@ void virtio_save(VirtIODevice *vdev, QEMUFile *f)
>          qemu_put_be32(f, vdev->vq[i].vring.num);
>          qemu_put_be64(f, vdev->vq[i].pa);
>          qemu_put_be16s(f, &vdev->vq[i].last_avail_idx);
> +        qemu_put_be16s(f, &vdev->vq[i].inuse);
>          if (vdev->binding->save_queue)
>              vdev->binding->save_queue(vdev->binding_opaque, i, f);
>      }
> @@ -711,6 +712,11 @@ int virtio_load(VirtIODevice *vdev, QEMUFile *f)
>          vdev->vq[i].vring.num = qemu_get_be32(f);
>          vdev->vq[i].pa = qemu_get_be64(f);
>          qemu_get_be16s(f, &vdev->vq[i].last_avail_idx);
> +        qemu_get_be16s(f, &vdev->vq[i].inuse);
> +
> +        /* revert last_avail_idx if there are outstanding emulation. */
> +        vdev->vq[i].last_avail_idx -= vdev->vq[i].inuse;
> +        vdev->vq[i].inuse = 0;
>  
>          if (vdev->vq[i].pa) {
>              virtqueue_init(&vdev->vq[i]);
> -- 
> 1.7.1.2
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 112+ messages in thread

* [Qemu-devel] Re: [PATCH 15/21] virtio-net: replace qemu_sendv_packet_async() with qemu_sendv_packet_async_proxy().
  2010-11-25  6:06 ` [Qemu-devel] [PATCH 15/21] virtio-net: replace qemu_sendv_packet_async() with qemu_sendv_packet_async_proxy() Yoshiaki Tamura
@ 2010-11-28  9:31   ` Michael S. Tsirkin
  2010-11-28 11:43     ` Yoshiaki Tamura
  0 siblings, 1 reply; 112+ messages in thread
From: Michael S. Tsirkin @ 2010-11-28  9:31 UTC (permalink / raw)
  To: Yoshiaki Tamura
  Cc: aliguori, dlaor, ananth, kvm, ohmura.kei, mtosatti, qemu-devel,
	vatsa, avi, psuriset, stefanha

On Thu, Nov 25, 2010 at 03:06:54PM +0900, Yoshiaki Tamura wrote:
> Replace replace qemu_sendv_packet_async() with
> qemu_sendv_packet_async_proxy() to let event-tap capture events from
> virtio-net.
> 
> Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>

Why does every device need to know about eent tap?
Can qemu_sendv_packet_async just do the right thing instead?

> ---
>  hw/virtio-net.c |    4 ++--
>  1 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/hw/virtio-net.c b/hw/virtio-net.c
> index 1d61f19..8c76346 100644
> --- a/hw/virtio-net.c
> +++ b/hw/virtio-net.c
> @@ -710,8 +710,8 @@ static int32_t virtio_net_flush_tx(VirtIONet *n, VirtQueue *vq)
>              len += hdr_len;
>          }
>  
> -        ret = qemu_sendv_packet_async(&n->nic->nc, out_sg, out_num,
> -                                      virtio_net_tx_complete);
> +        ret = qemu_sendv_packet_async_proxy(&n->nic->nc, out_sg, out_num,
> +                                            virtio_net_tx_complete);
>          if (ret == 0) {
>              virtio_queue_set_notification(n->tx_vq, 0);
>              n->async_tx.elem = elem;
> -- 
> 1.7.1.2
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 112+ messages in thread

* [Qemu-devel] Re: [PATCH 13/21] dma-helpers: replace bdrv_aio_writev() with bdrv_aio_writev_proxy().
  2010-11-25  6:06 ` [Qemu-devel] [PATCH 13/21] dma-helpers: replace bdrv_aio_writev() with bdrv_aio_writev_proxy() Yoshiaki Tamura
@ 2010-11-28  9:33   ` Michael S. Tsirkin
  2010-11-28 11:55     ` Yoshiaki Tamura
  0 siblings, 1 reply; 112+ messages in thread
From: Michael S. Tsirkin @ 2010-11-28  9:33 UTC (permalink / raw)
  To: Yoshiaki Tamura
  Cc: aliguori, dlaor, ananth, kvm, ohmura.kei, mtosatti, qemu-devel,
	vatsa, avi, psuriset, stefanha

On Thu, Nov 25, 2010 at 03:06:52PM +0900, Yoshiaki Tamura wrote:
> Replace bdrv_aio_writev() with bdrv_aio_writev_proxy() to let
> event-tap capture events from dma-helpers.
> 
> Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>

Same comment as -net here: it's not clear when should
a device use bdrv_aio_writev_proxy and when bdrv_aio_writev.
If all devices should just use _proxy, let's
just make bdrv_aio_writev DTRT instead.

> ---
>  dma-helpers.c |    4 ++--
>  1 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/dma-helpers.c b/dma-helpers.c
> index 712ed89..8ab2c26 100644
> --- a/dma-helpers.c
> +++ b/dma-helpers.c
> @@ -117,8 +117,8 @@ static void dma_bdrv_cb(void *opaque, int ret)
>      }
>  
>      if (dbs->is_write) {
> -        dbs->acb = bdrv_aio_writev(dbs->bs, dbs->sector_num, &dbs->iov,
> -                                   dbs->iov.size / 512, dma_bdrv_cb, dbs);
> +        dbs->acb = bdrv_aio_writev_proxy(dbs->bs, dbs->sector_num, &dbs->iov,
> +                                         dbs->iov.size / 512, dma_bdrv_cb, dbs);
>      } else {
>          dbs->acb = bdrv_aio_readv(dbs->bs, dbs->sector_num, &dbs->iov,
>                                    dbs->iov.size / 512, dma_bdrv_cb, dbs);
> -- 
> 1.7.1.2
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 112+ messages in thread

* [Qemu-devel] Re: [PATCH 11/21] ioport: insert event_tap_ioport() to ioport_write().
  2010-11-25  6:06 ` [Qemu-devel] [PATCH 11/21] ioport: insert event_tap_ioport() to ioport_write() Yoshiaki Tamura
@ 2010-11-28  9:40   ` Michael S. Tsirkin
  2010-11-28 12:00     ` Yoshiaki Tamura
  0 siblings, 1 reply; 112+ messages in thread
From: Michael S. Tsirkin @ 2010-11-28  9:40 UTC (permalink / raw)
  To: Yoshiaki Tamura
  Cc: aliguori, dlaor, ananth, kvm, ohmura.kei, mtosatti, qemu-devel,
	vatsa, avi, psuriset, stefanha

On Thu, Nov 25, 2010 at 03:06:50PM +0900, Yoshiaki Tamura wrote:
> Record ioport event to replay it upon failover.
> 
> Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>

Interesting. This will have to be extended to support ioeventfd.
Since each eventfd is really just a binary trigger
it should be enough to read out the fd state.

> ---
>  ioport.c |    2 ++
>  1 files changed, 2 insertions(+), 0 deletions(-)
> 
> diff --git a/ioport.c b/ioport.c
> index aa4188a..74aebf5 100644
> --- a/ioport.c
> +++ b/ioport.c
> @@ -27,6 +27,7 @@
>  
>  #include "ioport.h"
>  #include "trace.h"
> +#include "event-tap.h"
>  
>  /***********************************************************/
>  /* IO Port */
> @@ -76,6 +77,7 @@ static void ioport_write(int index, uint32_t address, uint32_t data)
>          default_ioport_writel
>      };
>      IOPortWriteFunc *func = ioport_write_table[index][address];
> +    event_tap_ioport(index, address, data);
>      if (!func)
>          func = default_func[index];
>      func(ioport_opaque[address], address, data);
> -- 
> 1.7.1.2
> 
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 112+ messages in thread

* [Qemu-devel] Re: [PATCH 05/21] virtio: modify save/load handler to handle inuse varialble.
  2010-11-28  9:28   ` [Qemu-devel] " Michael S. Tsirkin
@ 2010-11-28 11:27     ` Yoshiaki Tamura
  2010-11-28 11:46       ` Michael S. Tsirkin
  0 siblings, 1 reply; 112+ messages in thread
From: Yoshiaki Tamura @ 2010-11-28 11:27 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: aliguori, dlaor, ananth, kvm, ohmura.kei, mtosatti, qemu-devel,
	vatsa, avi, psuriset, stefanha

2010/11/28 Michael S. Tsirkin <mst@redhat.com>:
> On Thu, Nov 25, 2010 at 03:06:44PM +0900, Yoshiaki Tamura wrote:
>> Modify inuse type to uint16_t, let save/load to handle, and revert
>> last_avail_idx with inuse if there are outstanding emulation.
>>
>> Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
>
> This changes migration format, so it will break compatibility with
> existing drivers. More generally, I think migrating internal
> state that is not guest visible is always a mistake
> as it ties migration format to an internal implementation
> (yes, I know we do this sometimes, but we should at least
> try not to add such cases).  I think the right thing to do in this case
> is to flush outstanding
> work when vm is stopped.  Then, we are guaranteed that inuse is 0.
> I sent patches that do this for virtio net and block.

Could you give me the link of your patches?  I'd like to test
whether they work with Kemari upon failover.  If they do, I'm
happy to drop this patch.

Yoshi

>
>> ---
>>  hw/virtio.c |    8 +++++++-
>>  1 files changed, 7 insertions(+), 1 deletions(-)
>>
>> diff --git a/hw/virtio.c b/hw/virtio.c
>> index 849a60f..5509644 100644
>> --- a/hw/virtio.c
>> +++ b/hw/virtio.c
>> @@ -72,7 +72,7 @@ struct VirtQueue
>>      VRing vring;
>>      target_phys_addr_t pa;
>>      uint16_t last_avail_idx;
>> -    int inuse;
>> +    uint16_t inuse;
>>      uint16_t vector;
>>      void (*handle_output)(VirtIODevice *vdev, VirtQueue *vq);
>>      VirtIODevice *vdev;
>> @@ -671,6 +671,7 @@ void virtio_save(VirtIODevice *vdev, QEMUFile *f)
>>          qemu_put_be32(f, vdev->vq[i].vring.num);
>>          qemu_put_be64(f, vdev->vq[i].pa);
>>          qemu_put_be16s(f, &vdev->vq[i].last_avail_idx);
>> +        qemu_put_be16s(f, &vdev->vq[i].inuse);
>>          if (vdev->binding->save_queue)
>>              vdev->binding->save_queue(vdev->binding_opaque, i, f);
>>      }
>> @@ -711,6 +712,11 @@ int virtio_load(VirtIODevice *vdev, QEMUFile *f)
>>          vdev->vq[i].vring.num = qemu_get_be32(f);
>>          vdev->vq[i].pa = qemu_get_be64(f);
>>          qemu_get_be16s(f, &vdev->vq[i].last_avail_idx);
>> +        qemu_get_be16s(f, &vdev->vq[i].inuse);
>> +
>> +        /* revert last_avail_idx if there are outstanding emulation. */
>> +        vdev->vq[i].last_avail_idx -= vdev->vq[i].inuse;
>> +        vdev->vq[i].inuse = 0;
>>
>>          if (vdev->vq[i].pa) {
>>              virtqueue_init(&vdev->vq[i]);
>> --
>> 1.7.1.2
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe kvm" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 112+ messages in thread

* [Qemu-devel] Re: [PATCH 15/21] virtio-net: replace qemu_sendv_packet_async() with qemu_sendv_packet_async_proxy().
  2010-11-28  9:31   ` [Qemu-devel] " Michael S. Tsirkin
@ 2010-11-28 11:43     ` Yoshiaki Tamura
  0 siblings, 0 replies; 112+ messages in thread
From: Yoshiaki Tamura @ 2010-11-28 11:43 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: aliguori, dlaor, ananth, kvm, ohmura.kei, mtosatti, qemu-devel,
	vatsa, avi, psuriset, stefanha

2010/11/28 Michael S. Tsirkin <mst@redhat.com>:
> On Thu, Nov 25, 2010 at 03:06:54PM +0900, Yoshiaki Tamura wrote:
>> Replace replace qemu_sendv_packet_async() with
>> qemu_sendv_packet_async_proxy() to let event-tap capture events from
>> virtio-net.
>>
>> Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
>
> Why does every device need to know about eent tap?
> Can qemu_sendv_packet_async just do the right thing instead?

If we let net layer notify event-tap, devices don't have to know
about event-tap.  Most people want to go to that direction, and I
would follow it in the next spin.

Yoshi

>
>> ---
>>  hw/virtio-net.c |    4 ++--
>>  1 files changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/hw/virtio-net.c b/hw/virtio-net.c
>> index 1d61f19..8c76346 100644
>> --- a/hw/virtio-net.c
>> +++ b/hw/virtio-net.c
>> @@ -710,8 +710,8 @@ static int32_t virtio_net_flush_tx(VirtIONet *n, VirtQueue *vq)
>>              len += hdr_len;
>>          }
>>
>> -        ret = qemu_sendv_packet_async(&n->nic->nc, out_sg, out_num,
>> -                                      virtio_net_tx_complete);
>> +        ret = qemu_sendv_packet_async_proxy(&n->nic->nc, out_sg, out_num,
>> +                                            virtio_net_tx_complete);
>>          if (ret == 0) {
>>              virtio_queue_set_notification(n->tx_vq, 0);
>>              n->async_tx.elem = elem;
>> --
>> 1.7.1.2
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe kvm" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 112+ messages in thread

* [Qemu-devel] Re: [PATCH 05/21] virtio: modify save/load handler to handle inuse varialble.
  2010-11-28 11:27     ` Yoshiaki Tamura
@ 2010-11-28 11:46       ` Michael S. Tsirkin
  2010-12-01  8:03         ` Yoshiaki Tamura
  0 siblings, 1 reply; 112+ messages in thread
From: Michael S. Tsirkin @ 2010-11-28 11:46 UTC (permalink / raw)
  To: Yoshiaki Tamura
  Cc: aliguori, dlaor, ananth, kvm, ohmura.kei, mtosatti, qemu-devel,
	vatsa, avi, psuriset, stefanha

On Sun, Nov 28, 2010 at 08:27:58PM +0900, Yoshiaki Tamura wrote:
> 2010/11/28 Michael S. Tsirkin <mst@redhat.com>:
> > On Thu, Nov 25, 2010 at 03:06:44PM +0900, Yoshiaki Tamura wrote:
> >> Modify inuse type to uint16_t, let save/load to handle, and revert
> >> last_avail_idx with inuse if there are outstanding emulation.
> >>
> >> Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
> >
> > This changes migration format, so it will break compatibility with
> > existing drivers. More generally, I think migrating internal
> > state that is not guest visible is always a mistake
> > as it ties migration format to an internal implementation
> > (yes, I know we do this sometimes, but we should at least
> > try not to add such cases).  I think the right thing to do in this case
> > is to flush outstanding
> > work when vm is stopped.  Then, we are guaranteed that inuse is 0.
> > I sent patches that do this for virtio net and block.
> 
> Could you give me the link of your patches?  I'd like to test
> whether they work with Kemari upon failover.  If they do, I'm
> happy to drop this patch.
> 
> Yoshi

Look for this:
stable migration image on a stopped vm
sent on:
Wed, 24 Nov 2010 17:52:49 +0200

> >
> >> ---
> >>  hw/virtio.c |    8 +++++++-
> >>  1 files changed, 7 insertions(+), 1 deletions(-)
> >>
> >> diff --git a/hw/virtio.c b/hw/virtio.c
> >> index 849a60f..5509644 100644
> >> --- a/hw/virtio.c
> >> +++ b/hw/virtio.c
> >> @@ -72,7 +72,7 @@ struct VirtQueue
> >>      VRing vring;
> >>      target_phys_addr_t pa;
> >>      uint16_t last_avail_idx;
> >> -    int inuse;
> >> +    uint16_t inuse;
> >>      uint16_t vector;
> >>      void (*handle_output)(VirtIODevice *vdev, VirtQueue *vq);
> >>      VirtIODevice *vdev;
> >> @@ -671,6 +671,7 @@ void virtio_save(VirtIODevice *vdev, QEMUFile *f)
> >>          qemu_put_be32(f, vdev->vq[i].vring.num);
> >>          qemu_put_be64(f, vdev->vq[i].pa);
> >>          qemu_put_be16s(f, &vdev->vq[i].last_avail_idx);
> >> +        qemu_put_be16s(f, &vdev->vq[i].inuse);
> >>          if (vdev->binding->save_queue)
> >>              vdev->binding->save_queue(vdev->binding_opaque, i, f);
> >>      }
> >> @@ -711,6 +712,11 @@ int virtio_load(VirtIODevice *vdev, QEMUFile *f)
> >>          vdev->vq[i].vring.num = qemu_get_be32(f);
> >>          vdev->vq[i].pa = qemu_get_be64(f);
> >>          qemu_get_be16s(f, &vdev->vq[i].last_avail_idx);
> >> +        qemu_get_be16s(f, &vdev->vq[i].inuse);
> >> +
> >> +        /* revert last_avail_idx if there are outstanding emulation. */
> >> +        vdev->vq[i].last_avail_idx -= vdev->vq[i].inuse;
> >> +        vdev->vq[i].inuse = 0;
> >>
> >>          if (vdev->vq[i].pa) {
> >>              virtqueue_init(&vdev->vq[i]);
> >> --
> >> 1.7.1.2
> >>
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe kvm" in
> >> the body of a message to majordomo@vger.kernel.org
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > --
> > To unsubscribe from this list: send the line "unsubscribe kvm" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >

^ permalink raw reply	[flat|nested] 112+ messages in thread

* [Qemu-devel] Re: [PATCH 13/21] dma-helpers: replace bdrv_aio_writev() with bdrv_aio_writev_proxy().
  2010-11-28  9:33   ` [Qemu-devel] " Michael S. Tsirkin
@ 2010-11-28 11:55     ` Yoshiaki Tamura
  2010-11-28 12:28       ` Michael S. Tsirkin
  2010-11-29  9:52       ` Kevin Wolf
  0 siblings, 2 replies; 112+ messages in thread
From: Yoshiaki Tamura @ 2010-11-28 11:55 UTC (permalink / raw)
  To: Michael S. Tsirkin, Kevin Wolf
  Cc: aliguori, dlaor, ananth, kvm, ohmura.kei, mtosatti, qemu-devel,
	vatsa, avi, psuriset, stefanha

2010/11/28 Michael S. Tsirkin <mst@redhat.com>:
> On Thu, Nov 25, 2010 at 03:06:52PM +0900, Yoshiaki Tamura wrote:
>> Replace bdrv_aio_writev() with bdrv_aio_writev_proxy() to let
>> event-tap capture events from dma-helpers.
>>
>> Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
>
> Same comment as -net here: it's not clear when should
> a device use bdrv_aio_writev_proxy and when bdrv_aio_writev.
> If all devices should just use _proxy, let's
> just make bdrv_aio_writev DTRT instead.

Same as I replied to the net layer question.  However, I had
troubles with inserting event-tap functions into block.c before.
block.c gets linked with utils like qemu-img, but they don't get
linked with emulators code which event-tap uses in it.  So I want
to avoid linking block and event-tap for utils, but I guess we
don't want to use ifdefs for this.  I'm wondering how I can solve
this problem cleanly.

Kevin, do you have suggestions here?

Yoshi

>
>> ---
>>  dma-helpers.c |    4 ++--
>>  1 files changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/dma-helpers.c b/dma-helpers.c
>> index 712ed89..8ab2c26 100644
>> --- a/dma-helpers.c
>> +++ b/dma-helpers.c
>> @@ -117,8 +117,8 @@ static void dma_bdrv_cb(void *opaque, int ret)
>>      }
>>
>>      if (dbs->is_write) {
>> -        dbs->acb = bdrv_aio_writev(dbs->bs, dbs->sector_num, &dbs->iov,
>> -                                   dbs->iov.size / 512, dma_bdrv_cb, dbs);
>> +        dbs->acb = bdrv_aio_writev_proxy(dbs->bs, dbs->sector_num, &dbs->iov,
>> +                                         dbs->iov.size / 512, dma_bdrv_cb, dbs);
>>      } else {
>>          dbs->acb = bdrv_aio_readv(dbs->bs, dbs->sector_num, &dbs->iov,
>>                                    dbs->iov.size / 512, dma_bdrv_cb, dbs);
>> --
>> 1.7.1.2
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe kvm" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 112+ messages in thread

* [Qemu-devel] Re: [PATCH 11/21] ioport: insert event_tap_ioport() to ioport_write().
  2010-11-28  9:40   ` [Qemu-devel] " Michael S. Tsirkin
@ 2010-11-28 12:00     ` Yoshiaki Tamura
  2010-12-16  7:37       ` Yoshiaki Tamura
  0 siblings, 1 reply; 112+ messages in thread
From: Yoshiaki Tamura @ 2010-11-28 12:00 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: aliguori, dlaor, ananth, kvm, ohmura.kei, mtosatti, qemu-devel,
	vatsa, avi, psuriset, stefanha

2010/11/28 Michael S. Tsirkin <mst@redhat.com>:
> On Thu, Nov 25, 2010 at 03:06:50PM +0900, Yoshiaki Tamura wrote:
>> Record ioport event to replay it upon failover.
>>
>> Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
>
> Interesting. This will have to be extended to support ioeventfd.
> Since each eventfd is really just a binary trigger
> it should be enough to read out the fd state.

Haven't thought about eventfd yet.  Will try doing it in the next
spin.

Yoshi

>
>> ---
>>  ioport.c |    2 ++
>>  1 files changed, 2 insertions(+), 0 deletions(-)
>>
>> diff --git a/ioport.c b/ioport.c
>> index aa4188a..74aebf5 100644
>> --- a/ioport.c
>> +++ b/ioport.c
>> @@ -27,6 +27,7 @@
>>
>>  #include "ioport.h"
>>  #include "trace.h"
>> +#include "event-tap.h"
>>
>>  /***********************************************************/
>>  /* IO Port */
>> @@ -76,6 +77,7 @@ static void ioport_write(int index, uint32_t address, uint32_t data)
>>          default_ioport_writel
>>      };
>>      IOPortWriteFunc *func = ioport_write_table[index][address];
>> +    event_tap_ioport(index, address, data);
>>      if (!func)
>>          func = default_func[index];
>>      func(ioport_opaque[address], address, data);
>> --
>> 1.7.1.2
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe kvm" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 112+ messages in thread

* [Qemu-devel] Re: [PATCH 13/21] dma-helpers: replace bdrv_aio_writev() with bdrv_aio_writev_proxy().
  2010-11-28 11:55     ` Yoshiaki Tamura
@ 2010-11-28 12:28       ` Michael S. Tsirkin
  2010-11-29  9:52       ` Kevin Wolf
  1 sibling, 0 replies; 112+ messages in thread
From: Michael S. Tsirkin @ 2010-11-28 12:28 UTC (permalink / raw)
  To: Yoshiaki Tamura
  Cc: Kevin Wolf, aliguori, dlaor, ananth, kvm, ohmura.kei, mtosatti,
	qemu-devel, vatsa, avi, psuriset, stefanha

On Sun, Nov 28, 2010 at 08:55:28PM +0900, Yoshiaki Tamura wrote:
> 2010/11/28 Michael S. Tsirkin <mst@redhat.com>:
> > On Thu, Nov 25, 2010 at 03:06:52PM +0900, Yoshiaki Tamura wrote:
> >> Replace bdrv_aio_writev() with bdrv_aio_writev_proxy() to let
> >> event-tap capture events from dma-helpers.
> >>
> >> Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
> >
> > Same comment as -net here: it's not clear when should
> > a device use bdrv_aio_writev_proxy and when bdrv_aio_writev.
> > If all devices should just use _proxy, let's
> > just make bdrv_aio_writev DTRT instead.
> 
> Same as I replied to the net layer question.  However, I had
> troubles with inserting event-tap functions into block.c before.
> block.c gets linked with utils like qemu-img, but they don't get
> linked with emulators code which event-tap uses in it.  So I want
> to avoid linking block and event-tap for utils, but I guess we
> don't want to use ifdefs for this.  I'm wondering how I can solve
> this problem cleanly.

Add stubs same as we have for other functions.

> Kevin, do you have suggestions here?
> 
> Yoshi
> 
> >
> >> ---
> >>  dma-helpers.c |    4 ++--
> >>  1 files changed, 2 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/dma-helpers.c b/dma-helpers.c
> >> index 712ed89..8ab2c26 100644
> >> --- a/dma-helpers.c
> >> +++ b/dma-helpers.c
> >> @@ -117,8 +117,8 @@ static void dma_bdrv_cb(void *opaque, int ret)
> >>      }
> >>
> >>      if (dbs->is_write) {
> >> -        dbs->acb = bdrv_aio_writev(dbs->bs, dbs->sector_num, &dbs->iov,
> >> -                                   dbs->iov.size / 512, dma_bdrv_cb, dbs);
> >> +        dbs->acb = bdrv_aio_writev_proxy(dbs->bs, dbs->sector_num, &dbs->iov,
> >> +                                         dbs->iov.size / 512, dma_bdrv_cb, dbs);
> >>      } else {
> >>          dbs->acb = bdrv_aio_readv(dbs->bs, dbs->sector_num, &dbs->iov,
> >>                                    dbs->iov.size / 512, dma_bdrv_cb, dbs);
> >> --
> >> 1.7.1.2
> >>
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe kvm" in
> >> the body of a message to majordomo@vger.kernel.org
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > --
> > To unsubscribe from this list: send the line "unsubscribe kvm" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [Qemu-devel] Re: [PATCH 13/21] dma-helpers: replace bdrv_aio_writev() with bdrv_aio_writev_proxy().
  2010-11-28 11:55     ` Yoshiaki Tamura
  2010-11-28 12:28       ` Michael S. Tsirkin
@ 2010-11-29  9:52       ` Kevin Wolf
  2010-11-29 12:56         ` Yoshiaki Tamura
  1 sibling, 1 reply; 112+ messages in thread
From: Kevin Wolf @ 2010-11-29  9:52 UTC (permalink / raw)
  To: Yoshiaki Tamura
  Cc: ohmura.kei, mtosatti, ananth, kvm, Michael S. Tsirkin, dlaor,
	aliguori, qemu-devel, avi, vatsa, psuriset, stefanha

Am 28.11.2010 12:55, schrieb Yoshiaki Tamura:
> 2010/11/28 Michael S. Tsirkin <mst@redhat.com>:
>> On Thu, Nov 25, 2010 at 03:06:52PM +0900, Yoshiaki Tamura wrote:
>>> Replace bdrv_aio_writev() with bdrv_aio_writev_proxy() to let
>>> event-tap capture events from dma-helpers.
>>>
>>> Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
>>
>> Same comment as -net here: it's not clear when should
>> a device use bdrv_aio_writev_proxy and when bdrv_aio_writev.
>> If all devices should just use _proxy, let's
>> just make bdrv_aio_writev DTRT instead.
> 
> Same as I replied to the net layer question.  However, I had
> troubles with inserting event-tap functions into block.c before.
> block.c gets linked with utils like qemu-img, but they don't get
> linked with emulators code which event-tap uses in it.  So I want
> to avoid linking block and event-tap for utils, but I guess we
> don't want to use ifdefs for this.  I'm wondering how I can solve
> this problem cleanly.
> 
> Kevin, do you have suggestions here?

Michael's stubs (probably in qemu-tool.c) seem to be the right solution.

Which requests do you actually want to intercept? I assume you're aware
that for example qcow2 internally calls another bdrv_aio_readv/writev
that accesses the image file.

So if you only want to have the requests that come directly from
devices, maybe you'll have to restrict it to BlockDriverStates that
belongs to a drive. I think this is the case if it has a non-empty
device name.

Kevin

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [Qemu-devel] [PATCH 00/21] Kemari for KVM 0.2
  2010-11-27 13:11           ` Yoshiaki Tamura
@ 2010-11-29 10:17             ` Stefan Hajnoczi
  2010-11-29 13:00               ` Paul Brook
  0 siblings, 1 reply; 112+ messages in thread
From: Stefan Hajnoczi @ 2010-11-29 10:17 UTC (permalink / raw)
  To: Yoshiaki Tamura
  Cc: ohmura.kei, mtosatti, stefanha, kvm, dlaor, aliguori, qemu-devel,
	Blue Swirl, avi, vatsa, psuriset, ananth

On Sat, Nov 27, 2010 at 1:11 PM, Yoshiaki Tamura
<tamura.yoshiaki@lab.ntt.co.jp> wrote:
> 2010/11/27 Stefan Hajnoczi <stefanha@gmail.com>:
>> On Sat, Nov 27, 2010 at 8:53 AM, Yoshiaki Tamura
>> <tamura.yoshiaki@lab.ntt.co.jp> wrote:
>>> 2010/11/27 Stefan Hajnoczi <stefanha@gmail.com>:
>>>> On Sat, Nov 27, 2010 at 4:29 AM, Yoshiaki Tamura
>>>> <tamura.yoshiaki@lab.ntt.co.jp> wrote:
>>>>> 2010/11/27 Blue Swirl <blauwirbel@gmail.com>:
>>>>>> On Thu, Nov 25, 2010 at 6:06 AM, Yoshiaki Tamura
>>>>>> <tamura.yoshiaki@lab.ntt.co.jp> wrote:
>>>>>> Somehow I find some similarities to instrumentation patches. Perhaps
>>>>>> the instrumentation framework could be used (maybe with some changes)
>>>>>> for Kemari as well? That could be beneficial to both.
>>>>>
>>>>> Yes.  I had the same idea but I'm not sure how tracing works.  I think
>>>>> Stefan Hajnoczi knows it better.
>>>>>
>>>>> Stefan, is it possible to call arbitrary functions from the trace
>>>>> points?
>>>>
>>>> Yes, if you add code to ./tracetool.  I'm not sure I see the
>>>> connection between Kemari and tracing though.
>>>
>>> The connection is that it may be possible to remove Kemari
>>> specific hook point like in ioport.c and exec.c, and let tracing
>>> notify Kemari instead.
>>
>> I actually think the other way.  Tracing just instruments and stashes
>> away values.  It does not change inputs or outputs, it does not change
>> control flow, it does not affect state.
>>
>> Going down the route of side-effects mixes two different things:
>> hooking into a subsystem and instrumentation.  For hooking into a
>> subsystem we should define proper interfaces.  That interface can
>> explicitly support modifying inputs/outputs or changing control flow.
>>
>> Tracing is much more ad-hoc and not a clean interface.  It's also
>> based on a layer of indirection via the tracetool code generator.
>> That's okay because it doesn't affect the code it is called from and
>> you don't need to debug trace events (they are simple and have almost
>> no behavior).
>>
>> Hooking via tracing is just taking advantage of the cheap layer of
>> indirection in order to get at interesting events in a subsystem.
>> It's easy to hook up and quick to develop, but it's not a proper
>> interface and will be hard to understand for other developers.
>>
>>>> One question I have about Kemari is whether it adds new constraints to
>>>> the QEMU codebase?  Fault tolerance seems like a cross-cutting concern
>>>> - everyone writing device emulation or core QEMU code may need to be
>>>> aware of new constraints.  For example, "you are not allowed to
>>>> release I/O operations to the outside world directly, instead you need
>>>> to go through Kemari code which makes I/O transactional and
>>>> communicates with the passive host".  You have converted e1000,
>>>> virtio-net, and virtio-blk.  How do we make sure new devices that are
>>>> merged into qemu.git don't break Kemari?  How do we go about
>>>> supporting the existing hw/* devices?
>>>
>>> Whether Kemari adds constraints such as you mentioned, yes.  If
>>> the devices (including existing ones) don't call Kemari code,
>>> they would certainly break Kemari.  Altough using proxies looks
>>> explicit, to make it unaware from people writing device
>>> emulation, it's possible to remove proxies and put changes only
>>> into the block/net layer as Blue suggested.
>>
>> Anything that makes it hard to violate the constraints is good.
>> Otherwise Kemari might get broken in the future and no one will know
>> until a failover behaves incorrectly.
>
> Blue and Paul prefer to put it into block/net layer, and you
> think it's better to provide API.

Sorry, I wasn't clear.  I agree that event tap behavior should be in
generic block and net layer code.  That way we're guaranteeing that
all net and block I/O goes through event tap.

>> Could you formulate the constraints so developers are aware of them in
>> the future and can protect the codebase.  How about expanding the
>> Kemari wiki pages?
>
> If you like the idea above, I'm happy to make the list also on
> the wiki page.

Here's a different question: what requirements must an emulated device
meet in order to be added to the Kemari supported whitelist?  That's
what I want to know so that I don't break existing devices and can add
new devices that work with Kemari :).

Stefan

^ permalink raw reply	[flat|nested] 112+ messages in thread

* [Qemu-devel] Re: [PATCH 09/21] Introduce event-tap.
  2010-11-25  6:06 ` [Qemu-devel] [PATCH 09/21] Introduce event-tap Yoshiaki Tamura
@ 2010-11-29 11:00   ` Stefan Hajnoczi
  2010-11-30  9:50     ` Yoshiaki Tamura
  2011-01-04 11:02     ` Yoshiaki Tamura
       [not found]   ` <20101130011914.GA9015@amt.cnet>
  1 sibling, 2 replies; 112+ messages in thread
From: Stefan Hajnoczi @ 2010-11-29 11:00 UTC (permalink / raw)
  To: Yoshiaki Tamura
  Cc: aliguori, dlaor, ananth, kvm, ohmura.kei, mtosatti, qemu-devel,
	vatsa, avi, psuriset, stefanha

On Thu, Nov 25, 2010 at 6:06 AM, Yoshiaki Tamura
<tamura.yoshiaki@lab.ntt.co.jp> wrote:
> event-tap controls when to start FT transaction, and provides proxy
> functions to called from net/block devices.  While FT transaction, it
> queues up net/block requests, and flush them when the transaction gets
> completed.
>
> Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
> Signed-off-by: OHMURA Kei <ohmura.kei@lab.ntt.co.jp>
> ---
>  Makefile.target |    1 +
>  block.h         |    9 +
>  event-tap.c     |  794 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  event-tap.h     |   34 +++
>  net.h           |    4 +
>  net/queue.c     |    1 +
>  6 files changed, 843 insertions(+), 0 deletions(-)
>  create mode 100644 event-tap.c
>  create mode 100644 event-tap.h

event_tap_state is checked at the beginning of several functions.  If
there is an unexpected state the function silently returns.  Should
these checks really be assert() so there is an abort and backtrace if
the program ever reaches this state?

> +typedef struct EventTapBlkReq {
> +    char *device_name;
> +    int num_reqs;
> +    int num_cbs;
> +    bool is_multiwrite;

Is multiwrite logging necessary?  If event tap is called from within
the block layer then multiwrite is turned into one or more
bdrv_aio_writev() calls.

> +static void event_tap_replay(void *opaque, int running, int reason)
> +{
> +    EventTapLog *log, *next;
> +
> +    if (!running) {
> +        return;
> +    }
> +
> +    if (event_tap_state != EVENT_TAP_LOAD) {
> +        return;
> +    }
> +
> +    event_tap_state = EVENT_TAP_REPLAY;
> +
> +    QTAILQ_FOREACH(log, &event_list, node) {
> +        EventTapBlkReq *blk_req;
> +
> +        /* event resume */
> +        switch (log->mode & ~EVENT_TAP_TYPE_MASK) {
> +        case EVENT_TAP_NET:
> +            event_tap_net_flush(&log->net_req);
> +            break;
> +        case EVENT_TAP_BLK:
> +            blk_req = &log->blk_req;
> +            if ((log->mode & EVENT_TAP_TYPE_MASK) == EVENT_TAP_IOPORT) {
> +                switch (log->ioport.index) {
> +                case 0:
> +                    cpu_outb(log->ioport.address, log->ioport.data);
> +                    break;
> +                case 1:
> +                    cpu_outw(log->ioport.address, log->ioport.data);
> +                    break;
> +                case 2:
> +                    cpu_outl(log->ioport.address, log->ioport.data);
> +                    break;
> +                }
> +            } else {
> +                /* EVENT_TAP_MMIO */
> +                cpu_physical_memory_rw(log->mmio.address,
> +                                       log->mmio.buf,
> +                                       log->mmio.len, 1);
> +            }
> +            break;

Why are net tx packets replayed at the net level but blk requests are
replayed at the pio/mmio level?

I expected everything to replay either as pio/mmio or as net/block.

> +static void event_tap_blk_load(QEMUFile *f, EventTapBlkReq *blk_req)
> +{
> +    BlockRequest *req;
> +    ram_addr_t page_addr;
> +    int i, j, len;
> +
> +    len = qemu_get_byte(f);
> +    blk_req->device_name = qemu_malloc(len + 1);
> +    qemu_get_buffer(f, (uint8_t *)blk_req->device_name, len);
> +    blk_req->device_name[len] = '\0';
> +    blk_req->num_reqs = qemu_get_byte(f);
> +
> +    for (i = 0; i < blk_req->num_reqs; i++) {
> +        req = &blk_req->reqs[i];
> +        req->sector = qemu_get_be64(f);
> +        req->nb_sectors = qemu_get_be32(f);
> +        req->qiov = qemu_malloc(sizeof(QEMUIOVector));

It would make sense to have common QEMUIOVector load/save functions
instead of inlining this code here.

> +static int event_tap_load(QEMUFile *f, void *opaque, int version_id)
> +{
> +    EventTapLog *log, *next;
> +    int mode;
> +
> +    event_tap_state = EVENT_TAP_LOAD;
> +
> +    QTAILQ_FOREACH_SAFE(log, &event_list, node, next) {
> +        QTAILQ_REMOVE(&event_list, log, node);
> +        event_tap_free_log(log);
> +    }
> +
> +    /* loop until EOF */
> +    while ((mode = qemu_get_byte(f)) != 0) {
> +        EventTapLog *log = event_tap_alloc_log();
> +
> +        log->mode = mode;
> +        switch (log->mode & EVENT_TAP_TYPE_MASK) {
> +        case EVENT_TAP_IOPORT:
> +            event_tap_ioport_load(f, &log->ioport);
> +            break;
> +        case EVENT_TAP_MMIO:
> +            event_tap_mmio_load(f, &log->mmio);
> +            break;
> +        case 0:
> +            DPRINTF("No event\n");
> +            break;
> +        default:
> +            fprintf(stderr, "Unknown state %d\n", log->mode);
> +            return -1;

log is leaked here...

> +        }
> +
> +        switch (log->mode & ~EVENT_TAP_TYPE_MASK) {
> +        case EVENT_TAP_NET:
> +            event_tap_net_load(f, &log->net_req);
> +            break;
> +        case EVENT_TAP_BLK:
> +            event_tap_blk_load(f, &log->blk_req);
> +            break;
> +        default:
> +            fprintf(stderr, "Unknown state %d\n", log->mode);
> +            return -1;

...and here.

Stefan

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [Qemu-devel] Re: [PATCH 13/21] dma-helpers: replace bdrv_aio_writev() with bdrv_aio_writev_proxy().
  2010-11-29  9:52       ` Kevin Wolf
@ 2010-11-29 12:56         ` Yoshiaki Tamura
  0 siblings, 0 replies; 112+ messages in thread
From: Yoshiaki Tamura @ 2010-11-29 12:56 UTC (permalink / raw)
  To: Kevin Wolf
  Cc: ohmura.kei, mtosatti, ananth, kvm, Michael S. Tsirkin, dlaor,
	aliguori, qemu-devel, avi, vatsa, psuriset, stefanha

2010/11/29 Kevin Wolf <kwolf@redhat.com>:
> Am 28.11.2010 12:55, schrieb Yoshiaki Tamura:
>> 2010/11/28 Michael S. Tsirkin <mst@redhat.com>:
>>> On Thu, Nov 25, 2010 at 03:06:52PM +0900, Yoshiaki Tamura wrote:
>>>> Replace bdrv_aio_writev() with bdrv_aio_writev_proxy() to let
>>>> event-tap capture events from dma-helpers.
>>>>
>>>> Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
>>>
>>> Same comment as -net here: it's not clear when should
>>> a device use bdrv_aio_writev_proxy and when bdrv_aio_writev.
>>> If all devices should just use _proxy, let's
>>> just make bdrv_aio_writev DTRT instead.
>>
>> Same as I replied to the net layer question.  However, I had
>> troubles with inserting event-tap functions into block.c before.
>> block.c gets linked with utils like qemu-img, but they don't get
>> linked with emulators code which event-tap uses in it.  So I want
>> to avoid linking block and event-tap for utils, but I guess we
>> don't want to use ifdefs for this.  I'm wondering how I can solve
>> this problem cleanly.
>>
>> Kevin, do you have suggestions here?
>
> Michael's stubs (probably in qemu-tool.c) seem to be the right solution.

Same here.  I noticed kvm-stub to be a good example.

> Which requests do you actually want to intercept? I assume you're aware
> that for example qcow2 internally calls another bdrv_aio_readv/writev
> that accesses the image file.
>
> So if you only want to have the requests that come directly from
> devices, maybe you'll have to restrict it to BlockDriverStates that
> belongs to a drive. I think this is the case if it has a non-empty
> device name.

Yes, exactly.  I noticed that a little while ago.  Thanks for
making it clear.

>
> Kevin
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [Qemu-devel] [PATCH 00/21] Kemari for KVM 0.2
  2010-11-29 10:17             ` Stefan Hajnoczi
@ 2010-11-29 13:00               ` Paul Brook
  2010-11-29 13:13                 ` Yoshiaki Tamura
  0 siblings, 1 reply; 112+ messages in thread
From: Paul Brook @ 2010-11-29 13:00 UTC (permalink / raw)
  To: qemu-devel
  Cc: ohmura.kei, dlaor, stefanha, kvm, Stefan Hajnoczi, mtosatti,
	Yoshiaki Tamura, vatsa, Blue Swirl, aliguori, ananth, psuriset,
	avi

> >> Could you formulate the constraints so developers are aware of them in
> >> the future and can protect the codebase.  How about expanding the
> >> Kemari wiki pages?
> > 
> > If you like the idea above, I'm happy to make the list also on
> > the wiki page.
> 
> Here's a different question: what requirements must an emulated device
> meet in order to be added to the Kemari supported whitelist?  That's
> what I want to know so that I don't break existing devices and can add
> new devices that work with Kemari :).

Why isn't it completely device agnostic? i.e. if a device has to care about 
Kemari at all (of vice-versa) then IMO you're doing it wrong. The whole point 
of the internal block/net APIs is that they isolate the host implementation 
details from the device emulation.

Paul

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [Qemu-devel] [PATCH 00/21] Kemari for KVM 0.2
  2010-11-29 13:00               ` Paul Brook
@ 2010-11-29 13:13                 ` Yoshiaki Tamura
  2010-11-29 13:19                   ` Paul Brook
  0 siblings, 1 reply; 112+ messages in thread
From: Yoshiaki Tamura @ 2010-11-29 13:13 UTC (permalink / raw)
  To: Paul Brook
  Cc: ohmura.kei, dlaor, stefanha, kvm, Stefan Hajnoczi, mtosatti,
	qemu-devel, vatsa, Blue Swirl, aliguori, ananth, psuriset, avi

2010/11/29 Paul Brook <paul@codesourcery.com>:
>> >> Could you formulate the constraints so developers are aware of them in
>> >> the future and can protect the codebase.  How about expanding the
>> >> Kemari wiki pages?
>> >
>> > If you like the idea above, I'm happy to make the list also on
>> > the wiki page.
>>
>> Here's a different question: what requirements must an emulated device
>> meet in order to be added to the Kemari supported whitelist?  That's
>> what I want to know so that I don't break existing devices and can add
>> new devices that work with Kemari :).
>
> Why isn't it completely device agnostic? i.e. if a device has to care about
> Kemari at all (of vice-versa) then IMO you're doing it wrong. The whole point
> of the internal block/net APIs is that they isolate the host implementation
> details from the device emulation.

You're right "theoretically".  But what I've learned so far,
there are cases like virtio-net and e1000 woks but virtio-blk
doesn't.  "Theoretically", any emulated device should be able to
get into the whitelist if the event-tap is properly implemented
but sometimes it doesn't seem to be that simple.

To answer Stefan's question, there shouldn't be any requirement
for a device, but must be tested with Kemari.  If it doesn't work
correctly, the problems must be fixed before adding to the list.

Yoshi

>
> Paul
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [Qemu-devel] [PATCH 00/21] Kemari for KVM 0.2
  2010-11-29 13:13                 ` Yoshiaki Tamura
@ 2010-11-29 13:19                   ` Paul Brook
  2010-11-29 13:41                     ` Yoshiaki Tamura
  0 siblings, 1 reply; 112+ messages in thread
From: Paul Brook @ 2010-11-29 13:19 UTC (permalink / raw)
  To: qemu-devel
  Cc: ohmura.kei, mtosatti, stefanha, kvm, Stefan Hajnoczi, dlaor,
	Yoshiaki Tamura, vatsa, Blue Swirl, aliguori, avi, psuriset,
	ananth

> 2010/11/29 Paul Brook <paul@codesourcery.com>:
> >> >> Could you formulate the constraints so developers are aware of them
> >> >> in the future and can protect the codebase.  How about expanding the
> >> >> Kemari wiki pages?
> >> > 
> >> > If you like the idea above, I'm happy to make the list also on
> >> > the wiki page.
> >> 
> >> Here's a different question: what requirements must an emulated device
> >> meet in order to be added to the Kemari supported whitelist?  That's
> >> what I want to know so that I don't break existing devices and can add
> >> new devices that work with Kemari :).
> > 
> > Why isn't it completely device agnostic? i.e. if a device has to care
> > about Kemari at all (of vice-versa) then IMO you're doing it wrong. The
> > whole point of the internal block/net APIs is that they isolate the host
> > implementation details from the device emulation.
> 
> You're right "theoretically".  But what I've learned so far,
> there are cases like virtio-net and e1000 woks but virtio-blk
> doesn't.  "Theoretically", any emulated device should be able to
> get into the whitelist if the event-tap is properly implemented
> but sometimes it doesn't seem to be that simple.
> 
> To answer Stefan's question, there shouldn't be any requirement
> for a device, but must be tested with Kemari.  If it doesn't work
> correctly, the problems must be fixed before adding to the list.

What exactly are the problems? Is this a device bus of a Kemari bug?
If it's the former then that implies you're imposing additional requirements 
that weren't previously part of the API.  If the latter, then it's a bug like 
any other.

Paul

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [Qemu-devel] [PATCH 00/21] Kemari for KVM 0.2
  2010-11-29 13:19                   ` Paul Brook
@ 2010-11-29 13:41                     ` Yoshiaki Tamura
  2010-11-29 14:12                       ` Paul Brook
  0 siblings, 1 reply; 112+ messages in thread
From: Yoshiaki Tamura @ 2010-11-29 13:41 UTC (permalink / raw)
  To: Paul Brook
  Cc: ohmura.kei, mtosatti, stefanha, kvm, Stefan Hajnoczi, dlaor,
	qemu-devel, vatsa, Blue Swirl, aliguori, avi, psuriset, ananth

2010/11/29 Paul Brook <paul@codesourcery.com>:
>> 2010/11/29 Paul Brook <paul@codesourcery.com>:
>> >> >> Could you formulate the constraints so developers are aware of them
>> >> >> in the future and can protect the codebase.  How about expanding the
>> >> >> Kemari wiki pages?
>> >> >
>> >> > If you like the idea above, I'm happy to make the list also on
>> >> > the wiki page.
>> >>
>> >> Here's a different question: what requirements must an emulated device
>> >> meet in order to be added to the Kemari supported whitelist?  That's
>> >> what I want to know so that I don't break existing devices and can add
>> >> new devices that work with Kemari :).
>> >
>> > Why isn't it completely device agnostic? i.e. if a device has to care
>> > about Kemari at all (of vice-versa) then IMO you're doing it wrong. The
>> > whole point of the internal block/net APIs is that they isolate the host
>> > implementation details from the device emulation.
>>
>> You're right "theoretically".  But what I've learned so far,
>> there are cases like virtio-net and e1000 woks but virtio-blk
>> doesn't.  "Theoretically", any emulated device should be able to
>> get into the whitelist if the event-tap is properly implemented
>> but sometimes it doesn't seem to be that simple.
>>
>> To answer Stefan's question, there shouldn't be any requirement
>> for a device, but must be tested with Kemari.  If it doesn't work
>> correctly, the problems must be fixed before adding to the list.
>
> What exactly are the problems? Is this a device bus of a Kemari bug?
> If it's the former then that implies you're imposing additional requirements
> that weren't previously part of the API.  If the latter, then it's a bug like
> any other.

It's a problem if devices don't continue correctly upon failover.
I would say it's a bug of live migration (not all of course)
because Kemari is just live migrating at specific points.

Yoshi

>
> Paul
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [Qemu-devel] [PATCH 00/21] Kemari for KVM 0.2
  2010-11-29 13:41                     ` Yoshiaki Tamura
@ 2010-11-29 14:12                       ` Paul Brook
  2010-11-29 14:37                         ` Yoshiaki Tamura
  0 siblings, 1 reply; 112+ messages in thread
From: Paul Brook @ 2010-11-29 14:12 UTC (permalink / raw)
  To: Yoshiaki Tamura
  Cc: ohmura.kei, mtosatti, stefanha, kvm, Stefan Hajnoczi, dlaor,
	qemu-devel, vatsa, Blue Swirl, aliguori, avi, psuriset, ananth

> >> To answer Stefan's question, there shouldn't be any requirement
> >> for a device, but must be tested with Kemari.  If it doesn't work
> >> correctly, the problems must be fixed before adding to the list.
> > 
> > What exactly are the problems? Is this a device bus of a Kemari bug?
> > If it's the former then that implies you're imposing additional
> > requirements that weren't previously part of the API.  If the latter,
> > then it's a bug like any other.
> 
> It's a problem if devices don't continue correctly upon failover.
> I would say it's a bug of live migration (not all of course)
> because Kemari is just live migrating at specific points.

Ah, now we're getting somewhere.  So you're saying that these devices are 
broken anyway, and Kemari happens to trigger that brokenness more frequently?

If the requirement is that a device must support live migration, then that 
should be the criteria for enabling Kemari, not some arbitrary whitelist.
If devices incorrectly claim support for live migration, then that should also 
be fixed, either by removing the broken code or by making it work.

AFAICT your current proposal is just feeding back the results of some fairly 
specific QA testing.  I'd rather not get into that game.  The correct response 
in the context of upstream development is to file a bug and/or fix the code.
We already have config files that allow third party packagers to remove 
devices they don't want to support.

Paul

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [Qemu-devel] [PATCH 00/21] Kemari for KVM 0.2
  2010-11-29 14:12                       ` Paul Brook
@ 2010-11-29 14:37                         ` Yoshiaki Tamura
  2010-11-29 14:56                           ` Paul Brook
  0 siblings, 1 reply; 112+ messages in thread
From: Yoshiaki Tamura @ 2010-11-29 14:37 UTC (permalink / raw)
  To: Paul Brook
  Cc: ohmura.kei, dlaor, stefanha, kvm, Stefan Hajnoczi, mtosatti,
	qemu-devel, vatsa, Blue Swirl, aliguori, ananth, psuriset, avi

2010/11/29 Paul Brook <paul@codesourcery.com>:
>> >> To answer Stefan's question, there shouldn't be any requirement
>> >> for a device, but must be tested with Kemari.  If it doesn't work
>> >> correctly, the problems must be fixed before adding to the list.
>> >
>> > What exactly are the problems? Is this a device bus of a Kemari bug?
>> > If it's the former then that implies you're imposing additional
>> > requirements that weren't previously part of the API.  If the latter,
>> > then it's a bug like any other.
>>
>> It's a problem if devices don't continue correctly upon failover.
>> I would say it's a bug of live migration (not all of course)
>> because Kemari is just live migrating at specific points.
>
> Ah, now we're getting somewhere.  So you're saying that these devices are
> broken anyway, and Kemari happens to trigger that brokenness more frequently?
>
> If the requirement is that a device must support live migration, then that
> should be the criteria for enabling Kemari, not some arbitrary whitelist.

Sorry, I though that criteria to be obvious one and didn't think
to clarify.  The whitelist is a guard not to let users get into
trouble with arbitrary devices.

> If devices incorrectly claim support for live migration, then that should also
> be fixed, either by removing the broken code or by making it work.

I totally agree with you.

> AFAICT your current proposal is just feeding back the results of some fairly
> specific QA testing.  I'd rather not get into that game.  The correct response
> in the context of upstream development is to file a bug and/or fix the code.
> We already have config files that allow third party packagers to remove
> devices they don't want to support.

Sorry, I didn't get what you're trying to tell me.  My plan would
be to initially start from a subset of devices, and gradually
grow the number of devices that Kemari works with.  While this
process, it'll include what you said above, file a but and/or fix
the code.  Am I missing what you're saying?

Yoshi

>
> Paul
>
>

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [Qemu-devel] [PATCH 00/21] Kemari for KVM 0.2
  2010-11-29 14:37                         ` Yoshiaki Tamura
@ 2010-11-29 14:56                           ` Paul Brook
  2010-11-29 15:00                             ` Yoshiaki Tamura
  0 siblings, 1 reply; 112+ messages in thread
From: Paul Brook @ 2010-11-29 14:56 UTC (permalink / raw)
  To: Yoshiaki Tamura
  Cc: ohmura.kei, dlaor, stefanha, kvm, Stefan Hajnoczi, mtosatti,
	qemu-devel, vatsa, Blue Swirl, aliguori, ananth, psuriset, avi

> > If devices incorrectly claim support for live migration, then that should
> > also be fixed, either by removing the broken code or by making it work.
> 
> I totally agree with you.
> 
> > AFAICT your current proposal is just feeding back the results of some
> > fairly specific QA testing.  I'd rather not get into that game.  The
> > correct response in the context of upstream development is to file a bug
> > and/or fix the code. We already have config files that allow third party
> > packagers to remove devices they don't want to support.
> 
> Sorry, I didn't get what you're trying to tell me.  My plan would
> be to initially start from a subset of devices, and gradually
> grow the number of devices that Kemari works with.  While this
> process, it'll include what you said above, file a but and/or fix
> the code.  Am I missing what you're saying?

My point is that the whitelist shouldn't exist at all.  Devices either support 
migration or they don't.  Having some sort of separate whitelist is the wrong 
way to determine which devices support migration.

Paul

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [Qemu-devel] [PATCH 00/21] Kemari for KVM 0.2
  2010-11-29 14:56                           ` Paul Brook
@ 2010-11-29 15:00                             ` Yoshiaki Tamura
  2010-11-29 15:56                               ` Paul Brook
  2010-11-29 16:23                               ` Stefan Hajnoczi
  0 siblings, 2 replies; 112+ messages in thread
From: Yoshiaki Tamura @ 2010-11-29 15:00 UTC (permalink / raw)
  To: Paul Brook
  Cc: ohmura.kei, dlaor, stefanha, kvm, Stefan Hajnoczi, mtosatti,
	qemu-devel, vatsa, Blue Swirl, aliguori, ananth, psuriset, avi

2010/11/29 Paul Brook <paul@codesourcery.com>:
>> > If devices incorrectly claim support for live migration, then that should
>> > also be fixed, either by removing the broken code or by making it work.
>>
>> I totally agree with you.
>>
>> > AFAICT your current proposal is just feeding back the results of some
>> > fairly specific QA testing.  I'd rather not get into that game.  The
>> > correct response in the context of upstream development is to file a bug
>> > and/or fix the code. We already have config files that allow third party
>> > packagers to remove devices they don't want to support.
>>
>> Sorry, I didn't get what you're trying to tell me.  My plan would
>> be to initially start from a subset of devices, and gradually
>> grow the number of devices that Kemari works with.  While this
>> process, it'll include what you said above, file a but and/or fix
>> the code.  Am I missing what you're saying?
>
> My point is that the whitelist shouldn't exist at all.  Devices either support
> migration or they don't.  Having some sort of separate whitelist is the wrong
> way to determine which devices support migration.

Alright!

Then if a user encounters a problem with Kemari, we'll fix Kemari
or the devices or both. Correct?

Yoshi

>
> Paul
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [Qemu-devel] [PATCH 00/21] Kemari for KVM 0.2
  2010-11-29 15:00                             ` Yoshiaki Tamura
@ 2010-11-29 15:56                               ` Paul Brook
  2010-11-29 16:23                               ` Stefan Hajnoczi
  1 sibling, 0 replies; 112+ messages in thread
From: Paul Brook @ 2010-11-29 15:56 UTC (permalink / raw)
  To: Yoshiaki Tamura
  Cc: ohmura.kei, dlaor, stefanha, kvm, Stefan Hajnoczi, mtosatti,
	qemu-devel, vatsa, Blue Swirl, aliguori, ananth, psuriset, avi

> >> Sorry, I didn't get what you're trying to tell me.  My plan would
> >> be to initially start from a subset of devices, and gradually
> >> grow the number of devices that Kemari works with.  While this
> >> process, it'll include what you said above, file a but and/or fix
> >> the code.  Am I missing what you're saying?
> > 
> > My point is that the whitelist shouldn't exist at all.  Devices either
> > support migration or they don't.  Having some sort of separate whitelist
> > is the wrong way to determine which devices support migration.
> 
> Alright!
> 
> Then if a user encounters a problem with Kemari, we'll fix Kemari
> or the devices or both. Correct?

Correct.

Paul

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [Qemu-devel] [PATCH 00/21] Kemari for KVM 0.2
  2010-11-29 15:00                             ` Yoshiaki Tamura
  2010-11-29 15:56                               ` Paul Brook
@ 2010-11-29 16:23                               ` Stefan Hajnoczi
  2010-11-29 16:41                                 ` Dor Laor
  1 sibling, 1 reply; 112+ messages in thread
From: Stefan Hajnoczi @ 2010-11-29 16:23 UTC (permalink / raw)
  To: Yoshiaki Tamura
  Cc: ohmura.kei, mtosatti, stefanha, kvm, aliguori, dlaor, qemu-devel,
	vatsa, Blue Swirl, Paul Brook, ananth, psuriset, avi

On Mon, Nov 29, 2010 at 3:00 PM, Yoshiaki Tamura
<tamura.yoshiaki@lab.ntt.co.jp> wrote:
> 2010/11/29 Paul Brook <paul@codesourcery.com>:
>>> > If devices incorrectly claim support for live migration, then that should
>>> > also be fixed, either by removing the broken code or by making it work.
>>>
>>> I totally agree with you.
>>>
>>> > AFAICT your current proposal is just feeding back the results of some
>>> > fairly specific QA testing.  I'd rather not get into that game.  The
>>> > correct response in the context of upstream development is to file a bug
>>> > and/or fix the code. We already have config files that allow third party
>>> > packagers to remove devices they don't want to support.
>>>
>>> Sorry, I didn't get what you're trying to tell me.  My plan would
>>> be to initially start from a subset of devices, and gradually
>>> grow the number of devices that Kemari works with.  While this
>>> process, it'll include what you said above, file a but and/or fix
>>> the code.  Am I missing what you're saying?
>>
>> My point is that the whitelist shouldn't exist at all.  Devices either support
>> migration or they don't.  Having some sort of separate whitelist is the wrong
>> way to determine which devices support migration.
>
> Alright!
>
> Then if a user encounters a problem with Kemari, we'll fix Kemari
> or the devices or both. Correct?

Is this a fair summary: any device that supports live migration workw
under Kemari?

(If such a device does not work under Kemari then this is a bug that
needs to be fixed in live migration, Kemari, or the device.)

Stefan

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [Qemu-devel] [PATCH 00/21] Kemari for KVM 0.2
  2010-11-29 16:23                               ` Stefan Hajnoczi
@ 2010-11-29 16:41                                 ` Dor Laor
  2010-11-29 16:53                                   ` Paul Brook
                                                     ` (2 more replies)
  0 siblings, 3 replies; 112+ messages in thread
From: Dor Laor @ 2010-11-29 16:41 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: ohmura.kei, aliguori, stefanha, kvm, mtosatti, Yoshiaki Tamura,
	ananth, qemu-devel, Blue Swirl, Paul Brook, vatsa, psuriset, avi

On 11/29/2010 06:23 PM, Stefan Hajnoczi wrote:
> On Mon, Nov 29, 2010 at 3:00 PM, Yoshiaki Tamura
> <tamura.yoshiaki@lab.ntt.co.jp>  wrote:
>> 2010/11/29 Paul Brook<paul@codesourcery.com>:
>>>>> If devices incorrectly claim support for live migration, then that should
>>>>> also be fixed, either by removing the broken code or by making it work.
>>>>
>>>> I totally agree with you.
>>>>
>>>>> AFAICT your current proposal is just feeding back the results of some
>>>>> fairly specific QA testing.  I'd rather not get into that game.  The
>>>>> correct response in the context of upstream development is to file a bug
>>>>> and/or fix the code. We already have config files that allow third party
>>>>> packagers to remove devices they don't want to support.
>>>>
>>>> Sorry, I didn't get what you're trying to tell me.  My plan would
>>>> be to initially start from a subset of devices, and gradually
>>>> grow the number of devices that Kemari works with.  While this
>>>> process, it'll include what you said above, file a but and/or fix
>>>> the code.  Am I missing what you're saying?
>>>
>>> My point is that the whitelist shouldn't exist at all.  Devices either support
>>> migration or they don't.  Having some sort of separate whitelist is the wrong
>>> way to determine which devices support migration.
>>
>> Alright!
>>
>> Then if a user encounters a problem with Kemari, we'll fix Kemari
>> or the devices or both. Correct?
>
> Is this a fair summary: any device that supports live migration workw
> under Kemari?

It might be fair summary but practically we barely have live migration 
working w/o Kemari. In addition, last I checked Kemari needs additional 
hooks and it will be too hard to keep that out of tree until all devices 
get it.

>
> (If such a device does not work under Kemari then this is a bug that
> needs to be fixed in live migration, Kemari, or the device.)
>
> Stefan

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [Qemu-devel] [PATCH 00/21] Kemari for KVM 0.2
  2010-11-29 16:41                                 ` Dor Laor
@ 2010-11-29 16:53                                   ` Paul Brook
  2010-11-29 17:05                                     ` Anthony Liguori
  2010-11-30  6:43                                   ` Yoshiaki Tamura
  2010-11-30  9:13                                   ` Takuya Yoshikawa
  2 siblings, 1 reply; 112+ messages in thread
From: Paul Brook @ 2010-11-29 16:53 UTC (permalink / raw)
  To: dlaor
  Cc: ohmura.kei, stefanha, kvm, Stefan Hajnoczi, mtosatti,
	Yoshiaki Tamura, ananth, qemu-devel, Blue Swirl, aliguori, vatsa,
	psuriset, avi

> > Is this a fair summary: any device that supports live migration workw
> > under Kemari?
> 
> It might be fair summary but practically we barely have live migration
> working w/o Kemari. In addition, last I checked Kemari needs additional
> hooks and it will be too hard to keep that out of tree until all devices
> get it.

That's not what I've been hearing earlier in this thread.
The responses from Yoshi indicate that Stefan's summary is correct.  i.e. the 
current Kemari implementation may require per-device hooks, but that's a bug 
and should be fixed before merging.

Paul

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [Qemu-devel] [PATCH 00/21] Kemari for KVM 0.2
  2010-11-29 16:53                                   ` Paul Brook
@ 2010-11-29 17:05                                     ` Anthony Liguori
  2010-11-29 17:18                                       ` Paul Brook
  2010-11-30  7:13                                       ` Yoshiaki Tamura
  0 siblings, 2 replies; 112+ messages in thread
From: Anthony Liguori @ 2010-11-29 17:05 UTC (permalink / raw)
  To: Paul Brook
  Cc: ohmura.kei, mtosatti, stefanha, kvm, Stefan Hajnoczi, dlaor,
	Yoshiaki Tamura, ananth, qemu-devel, Blue Swirl, aliguori, vatsa,
	psuriset, avi

On 11/29/2010 10:53 AM, Paul Brook wrote:
>>> Is this a fair summary: any device that supports live migration workw
>>> under Kemari?
>>>        
>> It might be fair summary but practically we barely have live migration
>> working w/o Kemari. In addition, last I checked Kemari needs additional
>> hooks and it will be too hard to keep that out of tree until all devices
>> get it.
>>      
> That's not what I've been hearing earlier in this thread.
> The responses from Yoshi indicate that Stefan's summary is correct.  i.e. the
> current Kemari implementation may require per-device hooks, but that's a bug
> and should be fixed before merging.
>    

It's actually really important that Kemari make use of an intermediate 
layer such that the hooks can distinguish between a device access and a 
recursive access.

You could s/bdrv_aio_multiwrite/bdrv_aio_multiwrite_internal/g and then 
within kemari, s/bdrv_aio_multiwrite_proxy/bdrv_aio_multiwrite/ but I 
don't think that results in a cleaner interface.

I don't like the _proxy naming and I think it has led to some 
confusion.  I think having a dev_aio_multiwrite interface is a better 
naming scheme and ultimately provides a clearer idea of why a separate 
interface is needed--to distinguish between device accesses and internal 
accesses.

BTW, dev_aio_multiwrite should take a DeviceState * and a BlockDriverState.

Regards,

Anthony Liguori

> Paul
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>    

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [Qemu-devel] [PATCH 00/21] Kemari for KVM 0.2
  2010-11-29 17:05                                     ` Anthony Liguori
@ 2010-11-29 17:18                                       ` Paul Brook
  2010-11-29 17:33                                         ` Anthony Liguori
  2010-11-30  7:13                                       ` Yoshiaki Tamura
  1 sibling, 1 reply; 112+ messages in thread
From: Paul Brook @ 2010-11-29 17:18 UTC (permalink / raw)
  To: qemu-devel
  Cc: ohmura.kei, dlaor, stefanha, kvm, Stefan Hajnoczi, mtosatti,
	Yoshiaki Tamura, vatsa, Blue Swirl, aliguori, avi, psuriset,
	ananth

> On 11/29/2010 10:53 AM, Paul Brook wrote:
> >>> Is this a fair summary: any device that supports live migration workw
> >>> under Kemari?
> >> 
> >> It might be fair summary but practically we barely have live migration
> >> working w/o Kemari. In addition, last I checked Kemari needs additional
> >> hooks and it will be too hard to keep that out of tree until all devices
> >> get it.
> > 
> > That's not what I've been hearing earlier in this thread.
> > The responses from Yoshi indicate that Stefan's summary is correct.  i.e.
> > the current Kemari implementation may require per-device hooks, but
> > that's a bug and should be fixed before merging.
> 
> It's actually really important that Kemari make use of an intermediate
> layer such that the hooks can distinguish between a device access and a
> recursive access.

I'm failing to understand how this is anything other than running sed over 
block/*.c (or hw/*.c, depending whether you choose to rename the internal or 
external API).

Paul

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [Qemu-devel] [PATCH 00/21] Kemari for KVM 0.2
  2010-11-29 17:18                                       ` Paul Brook
@ 2010-11-29 17:33                                         ` Anthony Liguori
  0 siblings, 0 replies; 112+ messages in thread
From: Anthony Liguori @ 2010-11-29 17:33 UTC (permalink / raw)
  To: Paul Brook
  Cc: ohmura.kei, dlaor, stefanha, kvm, Stefan Hajnoczi, mtosatti,
	qemu-devel, Yoshiaki Tamura, Blue Swirl, aliguori, vatsa, avi,
	psuriset, ananth

On 11/29/2010 11:18 AM, Paul Brook wrote:
>> On 11/29/2010 10:53 AM, Paul Brook wrote:
>>      
>>>>> Is this a fair summary: any device that supports live migration workw
>>>>> under Kemari?
>>>>>            
>>>> It might be fair summary but practically we barely have live migration
>>>> working w/o Kemari. In addition, last I checked Kemari needs additional
>>>> hooks and it will be too hard to keep that out of tree until all devices
>>>> get it.
>>>>          
>>> That's not what I've been hearing earlier in this thread.
>>> The responses from Yoshi indicate that Stefan's summary is correct.  i.e.
>>> the current Kemari implementation may require per-device hooks, but
>>> that's a bug and should be fixed before merging.
>>>        
>> It's actually really important that Kemari make use of an intermediate
>> layer such that the hooks can distinguish between a device access and a
>> recursive access.
>>      
> I'm failing to understand how this is anything other than running sed over
> block/*.c (or hw/*.c, depending whether you choose to rename the internal or
> external API).
>    

You're right, it's not a big deal, and requiring everything in hw use 
the new interface is not a bad idea.

If a device doesn't work with Kemari, that's okay as long as the 
non-Kemari case is essentially a nop.

Regards,

Anthony Liguori

> Paul
>    

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [Qemu-devel] [PATCH 00/21] Kemari for KVM 0.2
  2010-11-29 16:41                                 ` Dor Laor
  2010-11-29 16:53                                   ` Paul Brook
@ 2010-11-30  6:43                                   ` Yoshiaki Tamura
  2010-11-30  9:13                                   ` Takuya Yoshikawa
  2 siblings, 0 replies; 112+ messages in thread
From: Yoshiaki Tamura @ 2010-11-30  6:43 UTC (permalink / raw)
  To: dlaor
  Cc: ohmura.kei, stefanha, kvm, Stefan Hajnoczi, mtosatti, aliguori,
	qemu-devel, Blue Swirl, Paul Brook, vatsa, avi, psuriset, ananth

2010/11/30 Dor Laor <dlaor@redhat.com>:
> On 11/29/2010 06:23 PM, Stefan Hajnoczi wrote:
>>
>> On Mon, Nov 29, 2010 at 3:00 PM, Yoshiaki Tamura
>> <tamura.yoshiaki@lab.ntt.co.jp>  wrote:
>>>
>>> 2010/11/29 Paul Brook<paul@codesourcery.com>:
>>>>>>
>>>>>> If devices incorrectly claim support for live migration, then that
>>>>>> should
>>>>>> also be fixed, either by removing the broken code or by making it
>>>>>> work.
>>>>>
>>>>> I totally agree with you.
>>>>>
>>>>>> AFAICT your current proposal is just feeding back the results of some
>>>>>> fairly specific QA testing.  I'd rather not get into that game.  The
>>>>>> correct response in the context of upstream development is to file a
>>>>>> bug
>>>>>> and/or fix the code. We already have config files that allow third
>>>>>> party
>>>>>> packagers to remove devices they don't want to support.
>>>>>
>>>>> Sorry, I didn't get what you're trying to tell me.  My plan would
>>>>> be to initially start from a subset of devices, and gradually
>>>>> grow the number of devices that Kemari works with.  While this
>>>>> process, it'll include what you said above, file a but and/or fix
>>>>> the code.  Am I missing what you're saying?
>>>>
>>>> My point is that the whitelist shouldn't exist at all.  Devices either
>>>> support
>>>> migration or they don't.  Having some sort of separate whitelist is the
>>>> wrong
>>>> way to determine which devices support migration.
>>>
>>> Alright!
>>>
>>> Then if a user encounters a problem with Kemari, we'll fix Kemari
>>> or the devices or both. Correct?
>>
>> Is this a fair summary: any device that supports live migration workw
>> under Kemari?
>
> It might be fair summary but practically we barely have live migration
> working w/o Kemari. In addition, last I checked Kemari needs additional
> hooks and it will be too hard to keep that out of tree until all devices get
> it.

IIUC, the additional hook you're mentioning is the hack for
virtio.  Michael has commented on it, I hope his patch make the
hack unnecessary.

Yoshi

>
>>
>> (If such a device does not work under Kemari then this is a bug that
>> needs to be fixed in live migration, Kemari, or the device.)
>>
>> Stefan
>
>
>

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [Qemu-devel] [PATCH 00/21] Kemari for KVM 0.2
  2010-11-29 17:05                                     ` Anthony Liguori
  2010-11-29 17:18                                       ` Paul Brook
@ 2010-11-30  7:13                                       ` Yoshiaki Tamura
  1 sibling, 0 replies; 112+ messages in thread
From: Yoshiaki Tamura @ 2010-11-30  7:13 UTC (permalink / raw)
  To: Anthony Liguori
  Cc: ohmura.kei, mtosatti, stefanha, kvm, aliguori, Stefan Hajnoczi,
	dlaor, qemu-devel, vatsa, Blue Swirl, Paul Brook, ananth,
	psuriset, avi

2010/11/30 Anthony Liguori <anthony@codemonkey.ws>:
> On 11/29/2010 10:53 AM, Paul Brook wrote:
>>>>
>>>> Is this a fair summary: any device that supports live migration workw
>>>> under Kemari?
>>>>
>>>
>>> It might be fair summary but practically we barely have live migration
>>> working w/o Kemari. In addition, last I checked Kemari needs additional
>>> hooks and it will be too hard to keep that out of tree until all devices
>>> get it.
>>>
>>
>> That's not what I've been hearing earlier in this thread.
>> The responses from Yoshi indicate that Stefan's summary is correct.  i.e.
>> the
>> current Kemari implementation may require per-device hooks, but that's a
>> bug
>> and should be fixed before merging.
>>
>
> It's actually really important that Kemari make use of an intermediate layer
> such that the hooks can distinguish between a device access and a recursive
> access.
>
> You could s/bdrv_aio_multiwrite/bdrv_aio_multiwrite_internal/g and then
> within kemari, s/bdrv_aio_multiwrite_proxy/bdrv_aio_multiwrite/ but I don't
> think that results in a cleaner interface.
>
> I don't like the _proxy naming and I think it has led to some confusion.  I
> think having a dev_aio_multiwrite interface is a better naming scheme and
> ultimately provides a clearer idea of why a separate interface is needed--to
> distinguish between device accesses and internal accesses.

Sorry about the naming.  But from the discussion so far, adding
an intermediate layer and exporting it to some/all approach needs
a strong reason.  Kemari itself can be implemented w/ or w/o the
intermediate layer, and this makes the discussion toward folding
the layer into block/net to be appropriate.  I think there are
two perspectives to decide which way to go:

- What is clean interfaces for upper/lower layer?
- If we introduce the intermediate layer, is there anyone who may
  use now or in the future?  If not, it may not be worth to add.

Yoshi

>
> BTW, dev_aio_multiwrite should take a DeviceState * and a BlockDriverState.
>
> Regards,
>
> Anthony Liguori
>
>> Paul
>> --
>> To unsubscribe from this list: send the line "unsubscribe kvm" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [Qemu-devel] [PATCH 00/21] Kemari for KVM 0.2
  2010-11-29 16:41                                 ` Dor Laor
  2010-11-29 16:53                                   ` Paul Brook
  2010-11-30  6:43                                   ` Yoshiaki Tamura
@ 2010-11-30  9:13                                   ` Takuya Yoshikawa
  2 siblings, 0 replies; 112+ messages in thread
From: Takuya Yoshikawa @ 2010-11-30  9:13 UTC (permalink / raw)
  To: dlaor
  Cc: ohmura.kei, aliguori, stefanha, kvm, Stefan Hajnoczi, mtosatti,
	Yoshiaki Tamura, ananth, qemu-devel, Blue Swirl, Paul Brook,
	vatsa, psuriset, avi

(2010/11/30 1:41), Dor Laor wrote:

>> Is this a fair summary: any device that supports live migration workw
>> under Kemari?
>
> It might be fair summary but practically we barely have live migration working w/o Kemari. In addition, last I checked Kemari needs additional hooks and it will be too hard to keep that out of tree until all devices get it.

You mean there are few, potential, users of live migration?

We are planning to user live migration. So if there are known issues about
live migration, apart from Kemari, it will be really helpful to have the lists.


Thanks,
   Takuya

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [Qemu-devel] Re: [PATCH 09/21] Introduce event-tap.
       [not found]   ` <20101130011914.GA9015@amt.cnet>
@ 2010-11-30  9:28     ` Yoshiaki Tamura
  2010-11-30 10:25       ` Marcelo Tosatti
  0 siblings, 1 reply; 112+ messages in thread
From: Yoshiaki Tamura @ 2010-11-30  9:28 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: aliguori, ananth, kvm, ohmura.kei, dlaor, qemu-devel, vatsa, avi,
	psuriset, stefanha

2010/11/30 Marcelo Tosatti <mtosatti@redhat.com>:
> On Thu, Nov 25, 2010 at 03:06:48PM +0900, Yoshiaki Tamura wrote:
>> event-tap controls when to start FT transaction, and provides proxy
>> functions to called from net/block devices.  While FT transaction, it
>> queues up net/block requests, and flush them when the transaction gets
>> completed.
>>
>> Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
>> Signed-off-by: OHMURA Kei <ohmura.kei@lab.ntt.co.jp>
>
>> +static void event_tap_alloc_blk_req(EventTapBlkReq *blk_req,
>> +                                    BlockDriverState *bs, BlockRequest *reqs,
>> +                                    int num_reqs, BlockDriverCompletionFunc *cb,
>> +                                    void *opaque, bool is_multiwrite)
>> +{
>> +    int i;
>> +
>> +    blk_req->num_reqs = num_reqs;
>> +    blk_req->num_cbs = num_reqs;
>> +    blk_req->device_name = qemu_strdup(bs->device_name);
>> +    blk_req->is_multiwrite = is_multiwrite;
>> +
>> +    for (i = 0; i < num_reqs; i++) {
>> +        blk_req->reqs[i].sector = reqs[i].sector;
>> +        blk_req->reqs[i].nb_sectors = reqs[i].nb_sectors;
>> +        blk_req->reqs[i].qiov = reqs[i].qiov;
>> +        blk_req->reqs[i].cb = cb;
>> +        blk_req->reqs[i].opaque = opaque;
>> +        blk_req->cb[i] = reqs[i].cb;
>> +        blk_req->opaque[i] = reqs[i].opaque;
>> +    }
>> +}
>
> bdrv_aio_flush should also be logged, so that guest initiated flush is
> respected on replay.

In the current implementation w/o flush logging, there might be
order inversion after replay?

Yoshi

^ permalink raw reply	[flat|nested] 112+ messages in thread

* [Qemu-devel] Re: [PATCH 09/21] Introduce event-tap.
  2010-11-29 11:00   ` [Qemu-devel] " Stefan Hajnoczi
@ 2010-11-30  9:50     ` Yoshiaki Tamura
  2010-11-30 10:04       ` Stefan Hajnoczi
  2011-01-04 11:02     ` Yoshiaki Tamura
  1 sibling, 1 reply; 112+ messages in thread
From: Yoshiaki Tamura @ 2010-11-30  9:50 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: aliguori, dlaor, ananth, kvm, ohmura.kei, mtosatti, qemu-devel,
	vatsa, avi, psuriset, stefanha

2010/11/29 Stefan Hajnoczi <stefanha@gmail.com>:
> On Thu, Nov 25, 2010 at 6:06 AM, Yoshiaki Tamura
> <tamura.yoshiaki@lab.ntt.co.jp> wrote:
>> event-tap controls when to start FT transaction, and provides proxy
>> functions to called from net/block devices.  While FT transaction, it
>> queues up net/block requests, and flush them when the transaction gets
>> completed.
>>
>> Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
>> Signed-off-by: OHMURA Kei <ohmura.kei@lab.ntt.co.jp>
>> ---
>>  Makefile.target |    1 +
>>  block.h         |    9 +
>>  event-tap.c     |  794 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>  event-tap.h     |   34 +++
>>  net.h           |    4 +
>>  net/queue.c     |    1 +
>>  6 files changed, 843 insertions(+), 0 deletions(-)
>>  create mode 100644 event-tap.c
>>  create mode 100644 event-tap.h
>
> event_tap_state is checked at the beginning of several functions.  If
> there is an unexpected state the function silently returns.  Should
> these checks really be assert() so there is an abort and backtrace if
> the program ever reaches this state?

I'm wondering whether abort is too strong, but I think you're
right because stopping Kemari may not be enough as an error
handling.

>
>> +typedef struct EventTapBlkReq {
>> +    char *device_name;
>> +    int num_reqs;
>> +    int num_cbs;
>> +    bool is_multiwrite;
>
> Is multiwrite logging necessary?  If event tap is called from within
> the block layer then multiwrite is turned into one or more
> bdrv_aio_writev() calls.

If we move event-tap into block layer, I guess it won't be
necessary.

>> +static void event_tap_replay(void *opaque, int running, int reason)
>> +{
>> +    EventTapLog *log, *next;
>> +
>> +    if (!running) {
>> +        return;
>> +    }
>> +
>> +    if (event_tap_state != EVENT_TAP_LOAD) {
>> +        return;
>> +    }
>> +
>> +    event_tap_state = EVENT_TAP_REPLAY;
>> +
>> +    QTAILQ_FOREACH(log, &event_list, node) {
>> +        EventTapBlkReq *blk_req;
>> +
>> +        /* event resume */
>> +        switch (log->mode & ~EVENT_TAP_TYPE_MASK) {
>> +        case EVENT_TAP_NET:
>> +            event_tap_net_flush(&log->net_req);
>> +            break;
>> +        case EVENT_TAP_BLK:
>> +            blk_req = &log->blk_req;
>> +            if ((log->mode & EVENT_TAP_TYPE_MASK) == EVENT_TAP_IOPORT) {
>> +                switch (log->ioport.index) {
>> +                case 0:
>> +                    cpu_outb(log->ioport.address, log->ioport.data);
>> +                    break;
>> +                case 1:
>> +                    cpu_outw(log->ioport.address, log->ioport.data);
>> +                    break;
>> +                case 2:
>> +                    cpu_outl(log->ioport.address, log->ioport.data);
>> +                    break;
>> +                }
>> +            } else {
>> +                /* EVENT_TAP_MMIO */
>> +                cpu_physical_memory_rw(log->mmio.address,
>> +                                       log->mmio.buf,
>> +                                       log->mmio.len, 1);
>> +            }
>> +            break;
>
> Why are net tx packets replayed at the net level but blk requests are
> replayed at the pio/mmio level?
>
> I expected everything to replay either as pio/mmio or as net/block.

It's my mistake, sorry about that.  We're just in way of moving
replay from pio/mmio to net/block, and I mixed up.  I'll revert
it to pio/mmio replay in the next spin.

BTW, I would like to ask a question regarding this.  There is a
callback which net/block calls after processing the requests, and
is there a clean way to set this callback on the failovered
host upon replay?

>> +static void event_tap_blk_load(QEMUFile *f, EventTapBlkReq *blk_req)
>> +{
>> +    BlockRequest *req;
>> +    ram_addr_t page_addr;
>> +    int i, j, len;
>> +
>> +    len = qemu_get_byte(f);
>> +    blk_req->device_name = qemu_malloc(len + 1);
>> +    qemu_get_buffer(f, (uint8_t *)blk_req->device_name, len);
>> +    blk_req->device_name[len] = '\0';
>> +    blk_req->num_reqs = qemu_get_byte(f);
>> +
>> +    for (i = 0; i < blk_req->num_reqs; i++) {
>> +        req = &blk_req->reqs[i];
>> +        req->sector = qemu_get_be64(f);
>> +        req->nb_sectors = qemu_get_be32(f);
>> +        req->qiov = qemu_malloc(sizeof(QEMUIOVector));
>
> It would make sense to have common QEMUIOVector load/save functions
> instead of inlining this code here.

OK.

>> +static int event_tap_load(QEMUFile *f, void *opaque, int version_id)
>> +{
>> +    EventTapLog *log, *next;
>> +    int mode;
>> +
>> +    event_tap_state = EVENT_TAP_LOAD;
>> +
>> +    QTAILQ_FOREACH_SAFE(log, &event_list, node, next) {
>> +        QTAILQ_REMOVE(&event_list, log, node);
>> +        event_tap_free_log(log);
>> +    }
>> +
>> +    /* loop until EOF */
>> +    while ((mode = qemu_get_byte(f)) != 0) {
>> +        EventTapLog *log = event_tap_alloc_log();
>> +
>> +        log->mode = mode;
>> +        switch (log->mode & EVENT_TAP_TYPE_MASK) {
>> +        case EVENT_TAP_IOPORT:
>> +            event_tap_ioport_load(f, &log->ioport);
>> +            break;
>> +        case EVENT_TAP_MMIO:
>> +            event_tap_mmio_load(f, &log->mmio);
>> +            break;
>> +        case 0:
>> +            DPRINTF("No event\n");
>> +            break;
>> +        default:
>> +            fprintf(stderr, "Unknown state %d\n", log->mode);
>> +            return -1;
>
> log is leaked here...

Oops:(

>
>> +        }
>> +
>> +        switch (log->mode & ~EVENT_TAP_TYPE_MASK) {
>> +        case EVENT_TAP_NET:
>> +            event_tap_net_load(f, &log->net_req);
>> +            break;
>> +        case EVENT_TAP_BLK:
>> +            event_tap_blk_load(f, &log->blk_req);
>> +            break;
>> +        default:
>> +            fprintf(stderr, "Unknown state %d\n", log->mode);
>> +            return -1;
>
> ...and here.

Oops again:(
Will fix them.

Yoshi

>
> Stefan
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 112+ messages in thread

* [Qemu-devel] Re: [PATCH 09/21] Introduce event-tap.
  2010-11-30  9:50     ` Yoshiaki Tamura
@ 2010-11-30 10:04       ` Stefan Hajnoczi
  2010-11-30 10:20         ` Yoshiaki Tamura
  0 siblings, 1 reply; 112+ messages in thread
From: Stefan Hajnoczi @ 2010-11-30 10:04 UTC (permalink / raw)
  To: Yoshiaki Tamura
  Cc: aliguori, dlaor, ananth, kvm, ohmura.kei, mtosatti, qemu-devel,
	vatsa, avi, psuriset, stefanha

On Tue, Nov 30, 2010 at 9:50 AM, Yoshiaki Tamura
<tamura.yoshiaki@lab.ntt.co.jp> wrote:
> 2010/11/29 Stefan Hajnoczi <stefanha@gmail.com>:
>> On Thu, Nov 25, 2010 at 6:06 AM, Yoshiaki Tamura
>> <tamura.yoshiaki@lab.ntt.co.jp> wrote:
>>> event-tap controls when to start FT transaction, and provides proxy
>>> functions to called from net/block devices.  While FT transaction, it
>>> queues up net/block requests, and flush them when the transaction gets
>>> completed.
>>>
>>> Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
>>> Signed-off-by: OHMURA Kei <ohmura.kei@lab.ntt.co.jp>
>>> ---
>>>  Makefile.target |    1 +
>>>  block.h         |    9 +
>>>  event-tap.c     |  794 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>  event-tap.h     |   34 +++
>>>  net.h           |    4 +
>>>  net/queue.c     |    1 +
>>>  6 files changed, 843 insertions(+), 0 deletions(-)
>>>  create mode 100644 event-tap.c
>>>  create mode 100644 event-tap.h
>>
>> event_tap_state is checked at the beginning of several functions.  If
>> there is an unexpected state the function silently returns.  Should
>> these checks really be assert() so there is an abort and backtrace if
>> the program ever reaches this state?

Fancier error handling would work too.  For example cleaning up,
turning off Kemari, and producing an error message with
error_report().  In that case we need to think through the state of
the environment carefully and make sure we don't cause secondary
failures (like memory leaks).

> BTW, I would like to ask a question regarding this.  There is a
> callback which net/block calls after processing the requests, and
> is there a clean way to set this callback on the failovered
> host upon replay?

I think this is a limitation in the current design.  If requests are
re-issued by Kemari at the net/block level, how will the higher layers
know about these requests?  How will they be prepared to accept
callbacks?

Stefan

^ permalink raw reply	[flat|nested] 112+ messages in thread

* [Qemu-devel] Re: [PATCH 09/21] Introduce event-tap.
  2010-11-30 10:04       ` Stefan Hajnoczi
@ 2010-11-30 10:20         ` Yoshiaki Tamura
  0 siblings, 0 replies; 112+ messages in thread
From: Yoshiaki Tamura @ 2010-11-30 10:20 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: aliguori, dlaor, ananth, kvm, ohmura.kei, mtosatti, qemu-devel,
	vatsa, avi, psuriset, stefanha

2010/11/30 Stefan Hajnoczi <stefanha@gmail.com>:
> On Tue, Nov 30, 2010 at 9:50 AM, Yoshiaki Tamura
> <tamura.yoshiaki@lab.ntt.co.jp> wrote:
>> 2010/11/29 Stefan Hajnoczi <stefanha@gmail.com>:
>>> On Thu, Nov 25, 2010 at 6:06 AM, Yoshiaki Tamura
>>> <tamura.yoshiaki@lab.ntt.co.jp> wrote:
>>>> event-tap controls when to start FT transaction, and provides proxy
>>>> functions to called from net/block devices.  While FT transaction, it
>>>> queues up net/block requests, and flush them when the transaction gets
>>>> completed.
>>>>
>>>> Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
>>>> Signed-off-by: OHMURA Kei <ohmura.kei@lab.ntt.co.jp>
>>>> ---
>>>>  Makefile.target |    1 +
>>>>  block.h         |    9 +
>>>>  event-tap.c     |  794 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>>>  event-tap.h     |   34 +++
>>>>  net.h           |    4 +
>>>>  net/queue.c     |    1 +
>>>>  6 files changed, 843 insertions(+), 0 deletions(-)
>>>>  create mode 100644 event-tap.c
>>>>  create mode 100644 event-tap.h
>>>
>>> event_tap_state is checked at the beginning of several functions.  If
>>> there is an unexpected state the function silently returns.  Should
>>> these checks really be assert() so there is an abort and backtrace if
>>> the program ever reaches this state?
>
> Fancier error handling would work too.  For example cleaning up,
> turning off Kemari, and producing an error message with
> error_report().  In that case we need to think through the state of
> the environment carefully and make sure we don't cause secondary
> failures (like memory leaks).

Turning off Kemari should include canceling the transaction which
notifies the secondary.  So same as you commented for
ft_trans_file error handling, I would implement better handling
for event-tap again.

>> BTW, I would like to ask a question regarding this.  There is a
>> callback which net/block calls after processing the requests, and
>> is there a clean way to set this callback on the failovered
>> host upon replay?
>
> I think this is a limitation in the current design.  If requests are
> re-issued by Kemari at the net/block level, how will the higher layers
> know about these requests?  How will they be prepared to accept
> callbacks?

That's why we're using pio/mmio replay at this moment.  With a
dirty hack in device emulators setting callbacks before replay,
block/net replay seems to work, but I don't think that to be a
correct solution.

Yoshi

>
> Stefan
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [Qemu-devel] Re: [PATCH 09/21] Introduce event-tap.
  2010-11-30  9:28     ` Yoshiaki Tamura
@ 2010-11-30 10:25       ` Marcelo Tosatti
  2010-11-30 10:35         ` Yoshiaki Tamura
  0 siblings, 1 reply; 112+ messages in thread
From: Marcelo Tosatti @ 2010-11-30 10:25 UTC (permalink / raw)
  To: Yoshiaki Tamura
  Cc: aliguori, ananth, kvm, ohmura.kei, dlaor, qemu-devel, vatsa, avi,
	psuriset, stefanha

On Tue, Nov 30, 2010 at 06:28:55PM +0900, Yoshiaki Tamura wrote:
> 2010/11/30 Marcelo Tosatti <mtosatti@redhat.com>:
> > On Thu, Nov 25, 2010 at 03:06:48PM +0900, Yoshiaki Tamura wrote:
> >> event-tap controls when to start FT transaction, and provides proxy
> >> functions to called from net/block devices.  While FT transaction, it
> >> queues up net/block requests, and flush them when the transaction gets
> >> completed.
> >>
> >> Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
> >> Signed-off-by: OHMURA Kei <ohmura.kei@lab.ntt.co.jp>
> >
> >> +static void event_tap_alloc_blk_req(EventTapBlkReq *blk_req,
> >> +                                    BlockDriverState *bs, BlockRequest *reqs,
> >> +                                    int num_reqs, BlockDriverCompletionFunc *cb,
> >> +                                    void *opaque, bool is_multiwrite)
> >> +{
> >> +    int i;
> >> +
> >> +    blk_req->num_reqs = num_reqs;
> >> +    blk_req->num_cbs = num_reqs;
> >> +    blk_req->device_name = qemu_strdup(bs->device_name);
> >> +    blk_req->is_multiwrite = is_multiwrite;
> >> +
> >> +    for (i = 0; i < num_reqs; i++) {
> >> +        blk_req->reqs[i].sector = reqs[i].sector;
> >> +        blk_req->reqs[i].nb_sectors = reqs[i].nb_sectors;
> >> +        blk_req->reqs[i].qiov = reqs[i].qiov;
> >> +        blk_req->reqs[i].cb = cb;
> >> +        blk_req->reqs[i].opaque = opaque;
> >> +        blk_req->cb[i] = reqs[i].cb;
> >> +        blk_req->opaque[i] = reqs[i].opaque;
> >> +    }
> >> +}
> >
> > bdrv_aio_flush should also be logged, so that guest initiated flush is
> > respected on replay.
> 
> In the current implementation w/o flush logging, there might be
> order inversion after replay?
> 
> Yoshi

Yes, since a vcpu is allowed to continue after synchronization is
scheduled via a bh. For virtio-blk, for example:

1) bdrv_aio_write, event queued.
2) bdrv_aio_flush
3) bdrv_aio_write, event queued.

On replay, there is no flush between the two writes.

Why can't synchronization be done from event-tap itself, synchronously,
to avoid this kind of problem?

The way you hook synchronization into savevm seems unclean. Perhaps
better separation between standard savevm path and FT savevm would make
it cleaner.

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [Qemu-devel] Re: [PATCH 09/21] Introduce event-tap.
  2010-11-30 10:25       ` Marcelo Tosatti
@ 2010-11-30 10:35         ` Yoshiaki Tamura
  2010-11-30 13:11           ` Marcelo Tosatti
  0 siblings, 1 reply; 112+ messages in thread
From: Yoshiaki Tamura @ 2010-11-30 10:35 UTC (permalink / raw)
  To: Marcelo Tosatti
  Cc: aliguori, ananth, kvm, ohmura.kei, dlaor, qemu-devel, vatsa, avi,
	psuriset, stefanha

Marcelo Tosatti wrote:
> On Tue, Nov 30, 2010 at 06:28:55PM +0900, Yoshiaki Tamura wrote:
>> 2010/11/30 Marcelo Tosatti<mtosatti@redhat.com>:
>>> On Thu, Nov 25, 2010 at 03:06:48PM +0900, Yoshiaki Tamura wrote:
>>>> event-tap controls when to start FT transaction, and provides proxy
>>>> functions to called from net/block devices.  While FT transaction, it
>>>> queues up net/block requests, and flush them when the transaction gets
>>>> completed.
>>>>
>>>> Signed-off-by: Yoshiaki Tamura<tamura.yoshiaki@lab.ntt.co.jp>
>>>> Signed-off-by: OHMURA Kei<ohmura.kei@lab.ntt.co.jp>
>>>
>>>> +static void event_tap_alloc_blk_req(EventTapBlkReq *blk_req,
>>>> +                                    BlockDriverState *bs, BlockRequest *reqs,
>>>> +                                    int num_reqs, BlockDriverCompletionFunc *cb,
>>>> +                                    void *opaque, bool is_multiwrite)
>>>> +{
>>>> +    int i;
>>>> +
>>>> +    blk_req->num_reqs = num_reqs;
>>>> +    blk_req->num_cbs = num_reqs;
>>>> +    blk_req->device_name = qemu_strdup(bs->device_name);
>>>> +    blk_req->is_multiwrite = is_multiwrite;
>>>> +
>>>> +    for (i = 0; i<  num_reqs; i++) {
>>>> +        blk_req->reqs[i].sector = reqs[i].sector;
>>>> +        blk_req->reqs[i].nb_sectors = reqs[i].nb_sectors;
>>>> +        blk_req->reqs[i].qiov = reqs[i].qiov;
>>>> +        blk_req->reqs[i].cb = cb;
>>>> +        blk_req->reqs[i].opaque = opaque;
>>>> +        blk_req->cb[i] = reqs[i].cb;
>>>> +        blk_req->opaque[i] = reqs[i].opaque;
>>>> +    }
>>>> +}
>>>
>>> bdrv_aio_flush should also be logged, so that guest initiated flush is
>>> respected on replay.
>>
>> In the current implementation w/o flush logging, there might be
>> order inversion after replay?
>>
>> Yoshi
>
> Yes, since a vcpu is allowed to continue after synchronization is
> scheduled via a bh. For virtio-blk, for example:
>
> 1) bdrv_aio_write, event queued.
> 2) bdrv_aio_flush
> 3) bdrv_aio_write, event queued.
>
> On replay, there is no flush between the two writes.
>
> Why can't synchronization be done from event-tap itself, synchronously,
> to avoid this kind of problem?

Thanks.  I would fix it.

> The way you hook synchronization into savevm seems unclean. Perhaps
> better separation between standard savevm path and FT savevm would make
> it cleaner.

I think you're mentioning about the changes in migration.c?

Yoshi

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [Qemu-devel] Re: [PATCH 09/21] Introduce event-tap.
  2010-11-30 10:35         ` Yoshiaki Tamura
@ 2010-11-30 13:11           ` Marcelo Tosatti
  0 siblings, 0 replies; 112+ messages in thread
From: Marcelo Tosatti @ 2010-11-30 13:11 UTC (permalink / raw)
  To: Yoshiaki Tamura
  Cc: aliguori, ananth, kvm, ohmura.kei, dlaor, qemu-devel, vatsa,
	Michael S. Tsirkin, avi, psuriset, stefanha

On Tue, Nov 30, 2010 at 07:35:54PM +0900, Yoshiaki Tamura wrote:
> Marcelo Tosatti wrote:
> >On Tue, Nov 30, 2010 at 06:28:55PM +0900, Yoshiaki Tamura wrote:
> >>2010/11/30 Marcelo Tosatti<mtosatti@redhat.com>:
> >>>On Thu, Nov 25, 2010 at 03:06:48PM +0900, Yoshiaki Tamura wrote:
> >>>>event-tap controls when to start FT transaction, and provides proxy
> >>>>functions to called from net/block devices.  While FT transaction, it
> >>>>queues up net/block requests, and flush them when the transaction gets
> >>>>completed.
> >>>>
> >>>>Signed-off-by: Yoshiaki Tamura<tamura.yoshiaki@lab.ntt.co.jp>
> >>>>Signed-off-by: OHMURA Kei<ohmura.kei@lab.ntt.co.jp>
> >>>
> >>>>+static void event_tap_alloc_blk_req(EventTapBlkReq *blk_req,
> >>>>+                                    BlockDriverState *bs, BlockRequest *reqs,
> >>>>+                                    int num_reqs, BlockDriverCompletionFunc *cb,
> >>>>+                                    void *opaque, bool is_multiwrite)
> >>>>+{
> >>>>+    int i;
> >>>>+
> >>>>+    blk_req->num_reqs = num_reqs;
> >>>>+    blk_req->num_cbs = num_reqs;
> >>>>+    blk_req->device_name = qemu_strdup(bs->device_name);
> >>>>+    blk_req->is_multiwrite = is_multiwrite;
> >>>>+
> >>>>+    for (i = 0; i<  num_reqs; i++) {
> >>>>+        blk_req->reqs[i].sector = reqs[i].sector;
> >>>>+        blk_req->reqs[i].nb_sectors = reqs[i].nb_sectors;
> >>>>+        blk_req->reqs[i].qiov = reqs[i].qiov;
> >>>>+        blk_req->reqs[i].cb = cb;
> >>>>+        blk_req->reqs[i].opaque = opaque;
> >>>>+        blk_req->cb[i] = reqs[i].cb;
> >>>>+        blk_req->opaque[i] = reqs[i].opaque;
> >>>>+    }
> >>>>+}
> >>>
> >>>bdrv_aio_flush should also be logged, so that guest initiated flush is
> >>>respected on replay.
> >>
> >>In the current implementation w/o flush logging, there might be
> >>order inversion after replay?
> >>
> >>Yoshi
> >
> >Yes, since a vcpu is allowed to continue after synchronization is
> >scheduled via a bh. For virtio-blk, for example:
> >
> >1) bdrv_aio_write, event queued.
> >2) bdrv_aio_flush
> >3) bdrv_aio_write, event queued.
> >
> >On replay, there is no flush between the two writes.
> >
> >Why can't synchronization be done from event-tap itself, synchronously,
> >to avoid this kind of problem?
> 
> Thanks.  I would fix it.
> 
> >The way you hook synchronization into savevm seems unclean. Perhaps
> >better separation between standard savevm path and FT savevm would make
> >it cleaner.
> 
> I think you're mentioning about the changes in migration.c?
> 
> Yoshi

The important point is to stop vcpu activity after the event is queued,
and resume once synchronization is performed. Stopping the vm after
Kemari event queueing should do it, once Michael's "stable migration"
patchset is in (and net/block layers fixed).

^ permalink raw reply	[flat|nested] 112+ messages in thread

* [Qemu-devel] Re: [PATCH 05/21] virtio: modify save/load handler to handle inuse varialble.
  2010-11-28 11:46       ` Michael S. Tsirkin
@ 2010-12-01  8:03         ` Yoshiaki Tamura
  2010-12-02 12:02           ` Michael S. Tsirkin
  0 siblings, 1 reply; 112+ messages in thread
From: Yoshiaki Tamura @ 2010-12-01  8:03 UTC (permalink / raw)
  To: Michael S. Tsirkin, Marcelo Tosatti
  Cc: aliguori, ananth, kvm, ohmura.kei, dlaor, qemu-devel, vatsa, avi,
	psuriset, stefanha

2010/11/28 Michael S. Tsirkin <mst@redhat.com>:
> On Sun, Nov 28, 2010 at 08:27:58PM +0900, Yoshiaki Tamura wrote:
>> 2010/11/28 Michael S. Tsirkin <mst@redhat.com>:
>> > On Thu, Nov 25, 2010 at 03:06:44PM +0900, Yoshiaki Tamura wrote:
>> >> Modify inuse type to uint16_t, let save/load to handle, and revert
>> >> last_avail_idx with inuse if there are outstanding emulation.
>> >>
>> >> Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
>> >
>> > This changes migration format, so it will break compatibility with
>> > existing drivers. More generally, I think migrating internal
>> > state that is not guest visible is always a mistake
>> > as it ties migration format to an internal implementation
>> > (yes, I know we do this sometimes, but we should at least
>> > try not to add such cases).  I think the right thing to do in this case
>> > is to flush outstanding
>> > work when vm is stopped.  Then, we are guaranteed that inuse is 0.
>> > I sent patches that do this for virtio net and block.
>>
>> Could you give me the link of your patches?  I'd like to test
>> whether they work with Kemari upon failover.  If they do, I'm
>> happy to drop this patch.
>>
>> Yoshi
>
> Look for this:
> stable migration image on a stopped vm
> sent on:
> Wed, 24 Nov 2010 17:52:49 +0200

Thanks for the info.

However, The patch series above didn't solve the issue.  In
case of Kemari, inuse is mostly > 0 because it queues the
output, and while last_avail_idx gets incremented
immediately, not sending inuse makes the state inconsistent
between Primary and Secondary.  I'm wondering why
last_avail_idx is OK to send but not inuse.

The following patch does the same thing as original, yet
keeps the format of the virtio.  It shouldn't break live
migration either because inuse should be 0.

Yoshi

diff --git a/hw/virtio.c b/hw/virtio.c
index c8a0fc6..875c7ca 100644
--- a/hw/virtio.c
+++ b/hw/virtio.c
@@ -664,12 +664,16 @@ void virtio_save(VirtIODevice *vdev, QEMUFile *f)
     qemu_put_be32(f, i);

     for (i = 0; i < VIRTIO_PCI_QUEUE_MAX; i++) {
+        uint16_t last_avail_idx;
+
         if (vdev->vq[i].vring.num == 0)
             break;

+        last_avail_idx = vdev->vq[i].last_avail_idx - vdev->vq[i].inuse;
+
         qemu_put_be32(f, vdev->vq[i].vring.num);
         qemu_put_be64(f, vdev->vq[i].pa);
-        qemu_put_be16s(f, &vdev->vq[i].last_avail_idx);
+        qemu_put_be16s(f, &last_avail_idx);
         if (vdev->binding->save_queue)
             vdev->binding->save_queue(vdev->binding_opaque, i, f);
     }



>
>> >
>> >> ---
>> >>  hw/virtio.c |    8 +++++++-
>> >>  1 files changed, 7 insertions(+), 1 deletions(-)
>> >>
>> >> diff --git a/hw/virtio.c b/hw/virtio.c
>> >> index 849a60f..5509644 100644
>> >> --- a/hw/virtio.c
>> >> +++ b/hw/virtio.c
>> >> @@ -72,7 +72,7 @@ struct VirtQueue
>> >>      VRing vring;
>> >>      target_phys_addr_t pa;
>> >>      uint16_t last_avail_idx;
>> >> -    int inuse;
>> >> +    uint16_t inuse;
>> >>      uint16_t vector;
>> >>      void (*handle_output)(VirtIODevice *vdev, VirtQueue *vq);
>> >>      VirtIODevice *vdev;
>> >> @@ -671,6 +671,7 @@ void virtio_save(VirtIODevice *vdev, QEMUFile *f)
>> >>          qemu_put_be32(f, vdev->vq[i].vring.num);
>> >>          qemu_put_be64(f, vdev->vq[i].pa);
>> >>          qemu_put_be16s(f, &vdev->vq[i].last_avail_idx);
>> >> +        qemu_put_be16s(f, &vdev->vq[i].inuse);
>> >>          if (vdev->binding->save_queue)
>> >>              vdev->binding->save_queue(vdev->binding_opaque, i, f);
>> >>      }
>> >> @@ -711,6 +712,11 @@ int virtio_load(VirtIODevice *vdev, QEMUFile *f)
>> >>          vdev->vq[i].vring.num = qemu_get_be32(f);
>> >>          vdev->vq[i].pa = qemu_get_be64(f);
>> >>          qemu_get_be16s(f, &vdev->vq[i].last_avail_idx);
>> >> +        qemu_get_be16s(f, &vdev->vq[i].inuse);
>> >> +
>> >> +        /* revert last_avail_idx if there are outstanding emulation. */
>> >> +        vdev->vq[i].last_avail_idx -= vdev->vq[i].inuse;
>> >> +        vdev->vq[i].inuse = 0;
>> >>
>> >>          if (vdev->vq[i].pa) {
>> >>              virtqueue_init(&vdev->vq[i]);
>> >> --
>> >> 1.7.1.2
>> >>
>> >> --
>> >> To unsubscribe from this list: send the line "unsubscribe kvm" in
>> >> the body of a message to majordomo@vger.kernel.org
>> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> > --
>> > To unsubscribe from this list: send the line "unsubscribe kvm" in
>> > the body of a message to majordomo@vger.kernel.org
>> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply related	[flat|nested] 112+ messages in thread

* [Qemu-devel] Re: [PATCH 05/21] virtio: modify save/load handler to handle inuse varialble.
  2010-12-01  8:03         ` Yoshiaki Tamura
@ 2010-12-02 12:02           ` Michael S. Tsirkin
  2010-12-03  6:28             ` Yoshiaki Tamura
  0 siblings, 1 reply; 112+ messages in thread
From: Michael S. Tsirkin @ 2010-12-02 12:02 UTC (permalink / raw)
  To: Yoshiaki Tamura
  Cc: aliguori, dlaor, ananth, kvm, ohmura.kei, Marcelo Tosatti,
	qemu-devel, vatsa, avi, psuriset, stefanha

On Wed, Dec 01, 2010 at 05:03:43PM +0900, Yoshiaki Tamura wrote:
> 2010/11/28 Michael S. Tsirkin <mst@redhat.com>:
> > On Sun, Nov 28, 2010 at 08:27:58PM +0900, Yoshiaki Tamura wrote:
> >> 2010/11/28 Michael S. Tsirkin <mst@redhat.com>:
> >> > On Thu, Nov 25, 2010 at 03:06:44PM +0900, Yoshiaki Tamura wrote:
> >> >> Modify inuse type to uint16_t, let save/load to handle, and revert
> >> >> last_avail_idx with inuse if there are outstanding emulation.
> >> >>
> >> >> Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
> >> >
> >> > This changes migration format, so it will break compatibility with
> >> > existing drivers. More generally, I think migrating internal
> >> > state that is not guest visible is always a mistake
> >> > as it ties migration format to an internal implementation
> >> > (yes, I know we do this sometimes, but we should at least
> >> > try not to add such cases).  I think the right thing to do in this case
> >> > is to flush outstanding
> >> > work when vm is stopped.  Then, we are guaranteed that inuse is 0.
> >> > I sent patches that do this for virtio net and block.
> >>
> >> Could you give me the link of your patches?  I'd like to test
> >> whether they work with Kemari upon failover.  If they do, I'm
> >> happy to drop this patch.
> >>
> >> Yoshi
> >
> > Look for this:
> > stable migration image on a stopped vm
> > sent on:
> > Wed, 24 Nov 2010 17:52:49 +0200
> 
> Thanks for the info.
> 
> However, The patch series above didn't solve the issue.  In
> case of Kemari, inuse is mostly > 0 because it queues the
> output, and while last_avail_idx gets incremented
> immediately, not sending inuse makes the state inconsistent
> between Primary and Secondary.

Hmm. Can we simply avoid incrementing last_avail_idx?

>  I'm wondering why
> last_avail_idx is OK to send but not inuse.

last_avail_idx is at some level a mistake, it exposes part of
our internal implementation, but it does *also* express
a guest observable state.

Here's the problem that it solves: just looking at the rings in virtio
there is no way to detect that a specific request has already been
completed. And the protocol forbids completing the same request twice.

Our implementation always starts processing the requests
in order, and since we flush outstanding requests
before save, it works to just tell the remote 'process only requests
after this place'.

But there's no such requirement in the virtio protocol,
so to be really generic we could add a bitmask of valid avail
ring entries that did not complete yet. This would be
the exact representation of the guest observable state.
In practice we have rings of up to 512 entries.
That's 64 byte per ring, not a lot at all.

However, if we ever do change the protocol to send the bitmask,
we would need some code to resubmit requests
out of order, so it's not trivial.

Another minor mistake with last_avail_idx is that it has
some redundancy: the high bits in the index
(> vq size) are not necessary as they can be
got from avail idx.  There's a consistency check
in load but we really should try to use formats
that are always consistent.

> The following patch does the same thing as original, yet
> keeps the format of the virtio.  It shouldn't break live
> migration either because inuse should be 0.
> 
> Yoshi

Question is, can you flush to make inuse 0 in kemari too?
And if not, how do you handle the fact that some requests
are in flight on the primary?

> diff --git a/hw/virtio.c b/hw/virtio.c
> index c8a0fc6..875c7ca 100644
> --- a/hw/virtio.c
> +++ b/hw/virtio.c
> @@ -664,12 +664,16 @@ void virtio_save(VirtIODevice *vdev, QEMUFile *f)
>      qemu_put_be32(f, i);
> 
>      for (i = 0; i < VIRTIO_PCI_QUEUE_MAX; i++) {
> +        uint16_t last_avail_idx;
> +
>          if (vdev->vq[i].vring.num == 0)
>              break;
> 
> +        last_avail_idx = vdev->vq[i].last_avail_idx - vdev->vq[i].inuse;
> +
>          qemu_put_be32(f, vdev->vq[i].vring.num);
>          qemu_put_be64(f, vdev->vq[i].pa);
> -        qemu_put_be16s(f, &vdev->vq[i].last_avail_idx);
> +        qemu_put_be16s(f, &last_avail_idx);
>          if (vdev->binding->save_queue)
>              vdev->binding->save_queue(vdev->binding_opaque, i, f);
>      }
> 
> 

This looks wrong to me.  Requests can complete in any order, can they
not?  So if request 0 did not complete and request 1 did not,
you send avail - inuse and on the secondary you will process and
complete request 1 the second time, crashing the guest.

> 
> >
> >> >
> >> >> ---
> >> >>  hw/virtio.c |    8 +++++++-
> >> >>  1 files changed, 7 insertions(+), 1 deletions(-)
> >> >>
> >> >> diff --git a/hw/virtio.c b/hw/virtio.c
> >> >> index 849a60f..5509644 100644
> >> >> --- a/hw/virtio.c
> >> >> +++ b/hw/virtio.c
> >> >> @@ -72,7 +72,7 @@ struct VirtQueue
> >> >>      VRing vring;
> >> >>      target_phys_addr_t pa;
> >> >>      uint16_t last_avail_idx;
> >> >> -    int inuse;
> >> >> +    uint16_t inuse;
> >> >>      uint16_t vector;
> >> >>      void (*handle_output)(VirtIODevice *vdev, VirtQueue *vq);
> >> >>      VirtIODevice *vdev;
> >> >> @@ -671,6 +671,7 @@ void virtio_save(VirtIODevice *vdev, QEMUFile *f)
> >> >>          qemu_put_be32(f, vdev->vq[i].vring.num);
> >> >>          qemu_put_be64(f, vdev->vq[i].pa);
> >> >>          qemu_put_be16s(f, &vdev->vq[i].last_avail_idx);
> >> >> +        qemu_put_be16s(f, &vdev->vq[i].inuse);
> >> >>          if (vdev->binding->save_queue)
> >> >>              vdev->binding->save_queue(vdev->binding_opaque, i, f);
> >> >>      }
> >> >> @@ -711,6 +712,11 @@ int virtio_load(VirtIODevice *vdev, QEMUFile *f)
> >> >>          vdev->vq[i].vring.num = qemu_get_be32(f);
> >> >>          vdev->vq[i].pa = qemu_get_be64(f);
> >> >>          qemu_get_be16s(f, &vdev->vq[i].last_avail_idx);
> >> >> +        qemu_get_be16s(f, &vdev->vq[i].inuse);
> >> >> +
> >> >> +        /* revert last_avail_idx if there are outstanding emulation. */
> >> >> +        vdev->vq[i].last_avail_idx -= vdev->vq[i].inuse;
> >> >> +        vdev->vq[i].inuse = 0;
> >> >>
> >> >>          if (vdev->vq[i].pa) {
> >> >>              virtqueue_init(&vdev->vq[i]);
> >> >> --
> >> >> 1.7.1.2
> >> >>
> >> >> --
> >> >> To unsubscribe from this list: send the line "unsubscribe kvm" in
> >> >> the body of a message to majordomo@vger.kernel.org
> >> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >> > --
> >> > To unsubscribe from this list: send the line "unsubscribe kvm" in
> >> > the body of a message to majordomo@vger.kernel.org
> >> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe kvm" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >

^ permalink raw reply	[flat|nested] 112+ messages in thread

* [Qemu-devel] Re: [PATCH 05/21] virtio: modify save/load handler to handle inuse varialble.
  2010-12-02 12:02           ` Michael S. Tsirkin
@ 2010-12-03  6:28             ` Yoshiaki Tamura
  2010-12-16  7:36               ` Yoshiaki Tamura
  0 siblings, 1 reply; 112+ messages in thread
From: Yoshiaki Tamura @ 2010-12-03  6:28 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: aliguori, dlaor, ananth, kvm, ohmura.kei, Marcelo Tosatti,
	qemu-devel, vatsa, avi, psuriset, stefanha

2010/12/2 Michael S. Tsirkin <mst@redhat.com>:
> On Wed, Dec 01, 2010 at 05:03:43PM +0900, Yoshiaki Tamura wrote:
>> 2010/11/28 Michael S. Tsirkin <mst@redhat.com>:
>> > On Sun, Nov 28, 2010 at 08:27:58PM +0900, Yoshiaki Tamura wrote:
>> >> 2010/11/28 Michael S. Tsirkin <mst@redhat.com>:
>> >> > On Thu, Nov 25, 2010 at 03:06:44PM +0900, Yoshiaki Tamura wrote:
>> >> >> Modify inuse type to uint16_t, let save/load to handle, and revert
>> >> >> last_avail_idx with inuse if there are outstanding emulation.
>> >> >>
>> >> >> Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
>> >> >
>> >> > This changes migration format, so it will break compatibility with
>> >> > existing drivers. More generally, I think migrating internal
>> >> > state that is not guest visible is always a mistake
>> >> > as it ties migration format to an internal implementation
>> >> > (yes, I know we do this sometimes, but we should at least
>> >> > try not to add such cases).  I think the right thing to do in this case
>> >> > is to flush outstanding
>> >> > work when vm is stopped.  Then, we are guaranteed that inuse is 0.
>> >> > I sent patches that do this for virtio net and block.
>> >>
>> >> Could you give me the link of your patches?  I'd like to test
>> >> whether they work with Kemari upon failover.  If they do, I'm
>> >> happy to drop this patch.
>> >>
>> >> Yoshi
>> >
>> > Look for this:
>> > stable migration image on a stopped vm
>> > sent on:
>> > Wed, 24 Nov 2010 17:52:49 +0200
>>
>> Thanks for the info.
>>
>> However, The patch series above didn't solve the issue.  In
>> case of Kemari, inuse is mostly > 0 because it queues the
>> output, and while last_avail_idx gets incremented
>> immediately, not sending inuse makes the state inconsistent
>> between Primary and Secondary.
>
> Hmm. Can we simply avoid incrementing last_avail_idx?

I think we can calculate or prepare an internal last_avail_idx,
and update the external when inuse is decremented.  I'll try
whether it work w/ w/o Kemari.

>
>>  I'm wondering why
>> last_avail_idx is OK to send but not inuse.
>
> last_avail_idx is at some level a mistake, it exposes part of
> our internal implementation, but it does *also* express
> a guest observable state.
>
> Here's the problem that it solves: just looking at the rings in virtio
> there is no way to detect that a specific request has already been
> completed. And the protocol forbids completing the same request twice.
>
> Our implementation always starts processing the requests
> in order, and since we flush outstanding requests
> before save, it works to just tell the remote 'process only requests
> after this place'.
>
> But there's no such requirement in the virtio protocol,
> so to be really generic we could add a bitmask of valid avail
> ring entries that did not complete yet. This would be
> the exact representation of the guest observable state.
> In practice we have rings of up to 512 entries.
> That's 64 byte per ring, not a lot at all.
>
> However, if we ever do change the protocol to send the bitmask,
> we would need some code to resubmit requests
> out of order, so it's not trivial.
>
> Another minor mistake with last_avail_idx is that it has
> some redundancy: the high bits in the index
> (> vq size) are not necessary as they can be
> got from avail idx.  There's a consistency check
> in load but we really should try to use formats
> that are always consistent.
>
>> The following patch does the same thing as original, yet
>> keeps the format of the virtio.  It shouldn't break live
>> migration either because inuse should be 0.
>>
>> Yoshi
>
> Question is, can you flush to make inuse 0 in kemari too?
> And if not, how do you handle the fact that some requests
> are in flight on the primary?

Although we try flushing requests one by one making inuse 0,
there are cases when it failovers to the secondary when inuse
isn't 0.  We handle these in flight request on the primary by
replaying on the secondary.

>
>> diff --git a/hw/virtio.c b/hw/virtio.c
>> index c8a0fc6..875c7ca 100644
>> --- a/hw/virtio.c
>> +++ b/hw/virtio.c
>> @@ -664,12 +664,16 @@ void virtio_save(VirtIODevice *vdev, QEMUFile *f)
>>      qemu_put_be32(f, i);
>>
>>      for (i = 0; i < VIRTIO_PCI_QUEUE_MAX; i++) {
>> +        uint16_t last_avail_idx;
>> +
>>          if (vdev->vq[i].vring.num == 0)
>>              break;
>>
>> +        last_avail_idx = vdev->vq[i].last_avail_idx - vdev->vq[i].inuse;
>> +
>>          qemu_put_be32(f, vdev->vq[i].vring.num);
>>          qemu_put_be64(f, vdev->vq[i].pa);
>> -        qemu_put_be16s(f, &vdev->vq[i].last_avail_idx);
>> +        qemu_put_be16s(f, &last_avail_idx);
>>          if (vdev->binding->save_queue)
>>              vdev->binding->save_queue(vdev->binding_opaque, i, f);
>>      }
>>
>>
>
> This looks wrong to me.  Requests can complete in any order, can they
> not?  So if request 0 did not complete and request 1 did not,
> you send avail - inuse and on the secondary you will process and
> complete request 1 the second time, crashing the guest.

In case of Kemari, no.  We sit between devices and net/block, and
queue the requests.  After completing each transaction, we flush
the requests one by one.  So there won't be completion inversion,
and therefore won't be visible to the guest.

Yoshi

>
>>
>> >
>> >> >
>> >> >> ---
>> >> >>  hw/virtio.c |    8 +++++++-
>> >> >>  1 files changed, 7 insertions(+), 1 deletions(-)
>> >> >>
>> >> >> diff --git a/hw/virtio.c b/hw/virtio.c
>> >> >> index 849a60f..5509644 100644
>> >> >> --- a/hw/virtio.c
>> >> >> +++ b/hw/virtio.c
>> >> >> @@ -72,7 +72,7 @@ struct VirtQueue
>> >> >>      VRing vring;
>> >> >>      target_phys_addr_t pa;
>> >> >>      uint16_t last_avail_idx;
>> >> >> -    int inuse;
>> >> >> +    uint16_t inuse;
>> >> >>      uint16_t vector;
>> >> >>      void (*handle_output)(VirtIODevice *vdev, VirtQueue *vq);
>> >> >>      VirtIODevice *vdev;
>> >> >> @@ -671,6 +671,7 @@ void virtio_save(VirtIODevice *vdev, QEMUFile *f)
>> >> >>          qemu_put_be32(f, vdev->vq[i].vring.num);
>> >> >>          qemu_put_be64(f, vdev->vq[i].pa);
>> >> >>          qemu_put_be16s(f, &vdev->vq[i].last_avail_idx);
>> >> >> +        qemu_put_be16s(f, &vdev->vq[i].inuse);
>> >> >>          if (vdev->binding->save_queue)
>> >> >>              vdev->binding->save_queue(vdev->binding_opaque, i, f);
>> >> >>      }
>> >> >> @@ -711,6 +712,11 @@ int virtio_load(VirtIODevice *vdev, QEMUFile *f)
>> >> >>          vdev->vq[i].vring.num = qemu_get_be32(f);
>> >> >>          vdev->vq[i].pa = qemu_get_be64(f);
>> >> >>          qemu_get_be16s(f, &vdev->vq[i].last_avail_idx);
>> >> >> +        qemu_get_be16s(f, &vdev->vq[i].inuse);
>> >> >> +
>> >> >> +        /* revert last_avail_idx if there are outstanding emulation. */
>> >> >> +        vdev->vq[i].last_avail_idx -= vdev->vq[i].inuse;
>> >> >> +        vdev->vq[i].inuse = 0;
>> >> >>
>> >> >>          if (vdev->vq[i].pa) {
>> >> >>              virtqueue_init(&vdev->vq[i]);
>> >> >> --
>> >> >> 1.7.1.2
>> >> >>
>> >> >> --
>> >> >> To unsubscribe from this list: send the line "unsubscribe kvm" in
>> >> >> the body of a message to majordomo@vger.kernel.org
>> >> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >> > --
>> >> > To unsubscribe from this list: send the line "unsubscribe kvm" in
>> >> > the body of a message to majordomo@vger.kernel.org
>> >> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >> >
>> > --
>> > To unsubscribe from this list: send the line "unsubscribe kvm" in
>> > the body of a message to majordomo@vger.kernel.org
>> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [Qemu-devel] [PATCH 06/21] vl: add a tmp pointer so that a handler can delete the entry to which it belongs.
  2010-11-25  6:06 ` [Qemu-devel] [PATCH 06/21] vl: add a tmp pointer so that a handler can delete the entry to which it belongs Yoshiaki Tamura
@ 2010-12-08  7:03   ` Isaku Yamahata
  2010-12-08  8:11     ` Yoshiaki Tamura
  0 siblings, 1 reply; 112+ messages in thread
From: Isaku Yamahata @ 2010-12-08  7:03 UTC (permalink / raw)
  To: Yoshiaki Tamura
  Cc: ohmura.kei, dlaor, ananth, kvm, mtosatti, aliguori, qemu-devel,
	avi, vatsa, psuriset, stefanha

QLIST_FOREACH_SAFE?

On Thu, Nov 25, 2010 at 03:06:45PM +0900, Yoshiaki Tamura wrote:
> By copying the next entry to a tmp pointer,
> qemu_del_vm_change_state_handler() can be called in the handler.
> 
> Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
> ---
>  vl.c |    5 +++--
>  1 files changed, 3 insertions(+), 2 deletions(-)
> 
> diff --git a/vl.c b/vl.c
> index 805e11f..6b6aec0 100644
> --- a/vl.c
> +++ b/vl.c
> @@ -1073,11 +1073,12 @@ void qemu_del_vm_change_state_handler(VMChangeStateEntry *e)
>  
>  void vm_state_notify(int running, int reason)
>  {
> -    VMChangeStateEntry *e;
> +    VMChangeStateEntry *e, *ne;
>  
>      trace_vm_state_notify(running, reason);
>  
> -    for (e = vm_change_state_head.lh_first; e; e = e->entries.le_next) {
> +    for (e = vm_change_state_head.lh_first; e; e = ne) {
> +        ne = e->entries.le_next;
>          e->cb(e->opaque, running, reason);
>      }
>  }
> -- 
> 1.7.1.2
> 
> 

-- 
yamahata

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [Qemu-devel] [PATCH 06/21] vl: add a tmp pointer so that a handler can delete the entry to which it belongs.
  2010-12-08  7:03   ` Isaku Yamahata
@ 2010-12-08  8:11     ` Yoshiaki Tamura
  2010-12-08 14:22       ` Anthony Liguori
  0 siblings, 1 reply; 112+ messages in thread
From: Yoshiaki Tamura @ 2010-12-08  8:11 UTC (permalink / raw)
  To: Isaku Yamahata
  Cc: ohmura.kei, dlaor, ananth, kvm, mtosatti, aliguori, qemu-devel,
	avi, vatsa, psuriset, stefanha

2010/12/8 Isaku Yamahata <yamahata@valinux.co.jp>:
> QLIST_FOREACH_SAFE?

Thanks! So, it should be,

QLIST_FOREACH_SAFE(e, &vm_change_state_head, entries, ne) {
    e->cb(e->opaque, running, reason);
}

I'll put it in the next spin.

Yoshi

>
> On Thu, Nov 25, 2010 at 03:06:45PM +0900, Yoshiaki Tamura wrote:
>> By copying the next entry to a tmp pointer,
>> qemu_del_vm_change_state_handler() can be called in the handler.
>>
>> Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
>> ---
>>  vl.c |    5 +++--
>>  1 files changed, 3 insertions(+), 2 deletions(-)
>>
>> diff --git a/vl.c b/vl.c
>> index 805e11f..6b6aec0 100644
>> --- a/vl.c
>> +++ b/vl.c
>> @@ -1073,11 +1073,12 @@ void qemu_del_vm_change_state_handler(VMChangeStateEntry *e)
>>
>>  void vm_state_notify(int running, int reason)
>>  {
>> -    VMChangeStateEntry *e;
>> +    VMChangeStateEntry *e, *ne;
>>
>>      trace_vm_state_notify(running, reason);
>>
>> -    for (e = vm_change_state_head.lh_first; e; e = e->entries.le_next) {
>> +    for (e = vm_change_state_head.lh_first; e; e = ne) {
>> +        ne = e->entries.le_next;
>>          e->cb(e->opaque, running, reason);
>>      }
>>  }
>> --
>> 1.7.1.2
>>
>>
>
> --
> yamahata
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [Qemu-devel] [PATCH 06/21] vl: add a tmp pointer so that a handler can delete the entry to which it belongs.
  2010-12-08  8:11     ` Yoshiaki Tamura
@ 2010-12-08 14:22       ` Anthony Liguori
  0 siblings, 0 replies; 112+ messages in thread
From: Anthony Liguori @ 2010-12-08 14:22 UTC (permalink / raw)
  To: Yoshiaki Tamura
  Cc: Anthony Liguori, dlaor, ananth, kvm, ohmura.kei, mtosatti,
	qemu-devel, vatsa, Isaku Yamahata, avi, psuriset, stefanha

On 12/08/2010 02:11 AM, Yoshiaki Tamura wrote:
> 2010/12/8 Isaku Yamahata<yamahata@valinux.co.jp>:
>    
>> QLIST_FOREACH_SAFE?
>>      
> Thanks! So, it should be,
>
> QLIST_FOREACH_SAFE(e,&vm_change_state_head, entries, ne) {
>      e->cb(e->opaque, running, reason);
> }
>
> I'll put it in the next spin.
>    

This is still brittle though because it only allows the current handler 
to delete itself.  A better approach is to borrow the technique we use 
with file descriptors (using a deleted flag) as that is robust against 
deletion of any elements in a handler.

Regards,

Anthony Liguori

> Yoshi
>
>    
>> On Thu, Nov 25, 2010 at 03:06:45PM +0900, Yoshiaki Tamura wrote:
>>      
>>> By copying the next entry to a tmp pointer,
>>> qemu_del_vm_change_state_handler() can be called in the handler.
>>>
>>> Signed-off-by: Yoshiaki Tamura<tamura.yoshiaki@lab.ntt.co.jp>
>>> ---
>>>   vl.c |    5 +++--
>>>   1 files changed, 3 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/vl.c b/vl.c
>>> index 805e11f..6b6aec0 100644
>>> --- a/vl.c
>>> +++ b/vl.c
>>> @@ -1073,11 +1073,12 @@ void qemu_del_vm_change_state_handler(VMChangeStateEntry *e)
>>>
>>>   void vm_state_notify(int running, int reason)
>>>   {
>>> -    VMChangeStateEntry *e;
>>> +    VMChangeStateEntry *e, *ne;
>>>
>>>       trace_vm_state_notify(running, reason);
>>>
>>> -    for (e = vm_change_state_head.lh_first; e; e = e->entries.le_next) {
>>> +    for (e = vm_change_state_head.lh_first; e; e = ne) {
>>> +        ne = e->entries.le_next;
>>>           e->cb(e->opaque, running, reason);
>>>       }
>>>   }
>>> --
>>> 1.7.1.2
>>>
>>>
>>>        
>> --
>> yamahata
>> --
>> To unsubscribe from this list: send the line "unsubscribe kvm" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>      

^ permalink raw reply	[flat|nested] 112+ messages in thread

* [Qemu-devel] Re: [PATCH 05/21] virtio: modify save/load handler to handle inuse varialble.
  2010-12-03  6:28             ` Yoshiaki Tamura
@ 2010-12-16  7:36               ` Yoshiaki Tamura
  2010-12-16  9:51                 ` Michael S. Tsirkin
  0 siblings, 1 reply; 112+ messages in thread
From: Yoshiaki Tamura @ 2010-12-16  7:36 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: aliguori, dlaor, ananth, kvm, ohmura.kei, Marcelo Tosatti,
	qemu-devel, vatsa, avi, psuriset, stefanha

2010/12/3 Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>:
> 2010/12/2 Michael S. Tsirkin <mst@redhat.com>:
>> On Wed, Dec 01, 2010 at 05:03:43PM +0900, Yoshiaki Tamura wrote:
>>> 2010/11/28 Michael S. Tsirkin <mst@redhat.com>:
>>> > On Sun, Nov 28, 2010 at 08:27:58PM +0900, Yoshiaki Tamura wrote:
>>> >> 2010/11/28 Michael S. Tsirkin <mst@redhat.com>:
>>> >> > On Thu, Nov 25, 2010 at 03:06:44PM +0900, Yoshiaki Tamura wrote:
>>> >> >> Modify inuse type to uint16_t, let save/load to handle, and revert
>>> >> >> last_avail_idx with inuse if there are outstanding emulation.
>>> >> >>
>>> >> >> Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
>>> >> >
>>> >> > This changes migration format, so it will break compatibility with
>>> >> > existing drivers. More generally, I think migrating internal
>>> >> > state that is not guest visible is always a mistake
>>> >> > as it ties migration format to an internal implementation
>>> >> > (yes, I know we do this sometimes, but we should at least
>>> >> > try not to add such cases).  I think the right thing to do in this case
>>> >> > is to flush outstanding
>>> >> > work when vm is stopped.  Then, we are guaranteed that inuse is 0.
>>> >> > I sent patches that do this for virtio net and block.
>>> >>
>>> >> Could you give me the link of your patches?  I'd like to test
>>> >> whether they work with Kemari upon failover.  If they do, I'm
>>> >> happy to drop this patch.
>>> >>
>>> >> Yoshi
>>> >
>>> > Look for this:
>>> > stable migration image on a stopped vm
>>> > sent on:
>>> > Wed, 24 Nov 2010 17:52:49 +0200
>>>
>>> Thanks for the info.
>>>
>>> However, The patch series above didn't solve the issue.  In
>>> case of Kemari, inuse is mostly > 0 because it queues the
>>> output, and while last_avail_idx gets incremented
>>> immediately, not sending inuse makes the state inconsistent
>>> between Primary and Secondary.
>>
>> Hmm. Can we simply avoid incrementing last_avail_idx?
>
> I think we can calculate or prepare an internal last_avail_idx,
> and update the external when inuse is decremented.  I'll try
> whether it work w/ w/o Kemari.

Hi Michael,

Could you please take a look at the following patch?

commit 36ee7910059e6b236fe9467a609f5b4aed866912
Author: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
Date:   Thu Dec 16 14:50:54 2010 +0900

    virtio: update last_avail_idx when inuse is decreased.

    Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>

diff --git a/hw/virtio.c b/hw/virtio.c
index c8a0fc6..6688c02 100644
--- a/hw/virtio.c
+++ b/hw/virtio.c
@@ -237,6 +237,7 @@ void virtqueue_flush(VirtQueue *vq, unsigned int count)
     wmb();
     trace_virtqueue_flush(vq, count);
     vring_used_idx_increment(vq, count);
+    vq->last_avail_idx += count;
     vq->inuse -= count;
 }

@@ -385,7 +386,7 @@ int virtqueue_pop(VirtQueue *vq, VirtQueueElement *elem)
     unsigned int i, head, max;
     target_phys_addr_t desc_pa = vq->vring.desc;

-    if (!virtqueue_num_heads(vq, vq->last_avail_idx))
+    if (!virtqueue_num_heads(vq, vq->last_avail_idx + vq->inuse))
         return 0;

     /* When we start there are none of either input nor output. */
@@ -393,7 +394,7 @@ int virtqueue_pop(VirtQueue *vq, VirtQueueElement *elem)

     max = vq->vring.num;

-    i = head = virtqueue_get_head(vq, vq->last_avail_idx++);
+    i = head = virtqueue_get_head(vq, vq->last_avail_idx + vq->inuse);

     if (vring_desc_flags(desc_pa, i) & VRING_DESC_F_INDIRECT) {
         if (vring_desc_len(desc_pa, i) % sizeof(VRingDesc)) {



>
>>
>>>  I'm wondering why
>>> last_avail_idx is OK to send but not inuse.
>>
>> last_avail_idx is at some level a mistake, it exposes part of
>> our internal implementation, but it does *also* express
>> a guest observable state.
>>
>> Here's the problem that it solves: just looking at the rings in virtio
>> there is no way to detect that a specific request has already been
>> completed. And the protocol forbids completing the same request twice.
>>
>> Our implementation always starts processing the requests
>> in order, and since we flush outstanding requests
>> before save, it works to just tell the remote 'process only requests
>> after this place'.
>>
>> But there's no such requirement in the virtio protocol,
>> so to be really generic we could add a bitmask of valid avail
>> ring entries that did not complete yet. This would be
>> the exact representation of the guest observable state.
>> In practice we have rings of up to 512 entries.
>> That's 64 byte per ring, not a lot at all.
>>
>> However, if we ever do change the protocol to send the bitmask,
>> we would need some code to resubmit requests
>> out of order, so it's not trivial.
>>
>> Another minor mistake with last_avail_idx is that it has
>> some redundancy: the high bits in the index
>> (> vq size) are not necessary as they can be
>> got from avail idx.  There's a consistency check
>> in load but we really should try to use formats
>> that are always consistent.
>>
>>> The following patch does the same thing as original, yet
>>> keeps the format of the virtio.  It shouldn't break live
>>> migration either because inuse should be 0.
>>>
>>> Yoshi
>>
>> Question is, can you flush to make inuse 0 in kemari too?
>> And if not, how do you handle the fact that some requests
>> are in flight on the primary?
>
> Although we try flushing requests one by one making inuse 0,
> there are cases when it failovers to the secondary when inuse
> isn't 0.  We handle these in flight request on the primary by
> replaying on the secondary.
>
>>
>>> diff --git a/hw/virtio.c b/hw/virtio.c
>>> index c8a0fc6..875c7ca 100644
>>> --- a/hw/virtio.c
>>> +++ b/hw/virtio.c
>>> @@ -664,12 +664,16 @@ void virtio_save(VirtIODevice *vdev, QEMUFile *f)
>>>      qemu_put_be32(f, i);
>>>
>>>      for (i = 0; i < VIRTIO_PCI_QUEUE_MAX; i++) {
>>> +        uint16_t last_avail_idx;
>>> +
>>>          if (vdev->vq[i].vring.num == 0)
>>>              break;
>>>
>>> +        last_avail_idx = vdev->vq[i].last_avail_idx - vdev->vq[i].inuse;
>>> +
>>>          qemu_put_be32(f, vdev->vq[i].vring.num);
>>>          qemu_put_be64(f, vdev->vq[i].pa);
>>> -        qemu_put_be16s(f, &vdev->vq[i].last_avail_idx);
>>> +        qemu_put_be16s(f, &last_avail_idx);
>>>          if (vdev->binding->save_queue)
>>>              vdev->binding->save_queue(vdev->binding_opaque, i, f);
>>>      }
>>>
>>>
>>
>> This looks wrong to me.  Requests can complete in any order, can they
>> not?  So if request 0 did not complete and request 1 did not,
>> you send avail - inuse and on the secondary you will process and
>> complete request 1 the second time, crashing the guest.
>
> In case of Kemari, no.  We sit between devices and net/block, and
> queue the requests.  After completing each transaction, we flush
> the requests one by one.  So there won't be completion inversion,
> and therefore won't be visible to the guest.
>
> Yoshi
>
>>
>>>
>>> >
>>> >> >
>>> >> >> ---
>>> >> >>  hw/virtio.c |    8 +++++++-
>>> >> >>  1 files changed, 7 insertions(+), 1 deletions(-)
>>> >> >>
>>> >> >> diff --git a/hw/virtio.c b/hw/virtio.c
>>> >> >> index 849a60f..5509644 100644
>>> >> >> --- a/hw/virtio.c
>>> >> >> +++ b/hw/virtio.c
>>> >> >> @@ -72,7 +72,7 @@ struct VirtQueue
>>> >> >>      VRing vring;
>>> >> >>      target_phys_addr_t pa;
>>> >> >>      uint16_t last_avail_idx;
>>> >> >> -    int inuse;
>>> >> >> +    uint16_t inuse;
>>> >> >>      uint16_t vector;
>>> >> >>      void (*handle_output)(VirtIODevice *vdev, VirtQueue *vq);
>>> >> >>      VirtIODevice *vdev;
>>> >> >> @@ -671,6 +671,7 @@ void virtio_save(VirtIODevice *vdev, QEMUFile *f)
>>> >> >>          qemu_put_be32(f, vdev->vq[i].vring.num);
>>> >> >>          qemu_put_be64(f, vdev->vq[i].pa);
>>> >> >>          qemu_put_be16s(f, &vdev->vq[i].last_avail_idx);
>>> >> >> +        qemu_put_be16s(f, &vdev->vq[i].inuse);
>>> >> >>          if (vdev->binding->save_queue)
>>> >> >>              vdev->binding->save_queue(vdev->binding_opaque, i, f);
>>> >> >>      }
>>> >> >> @@ -711,6 +712,11 @@ int virtio_load(VirtIODevice *vdev, QEMUFile *f)
>>> >> >>          vdev->vq[i].vring.num = qemu_get_be32(f);
>>> >> >>          vdev->vq[i].pa = qemu_get_be64(f);
>>> >> >>          qemu_get_be16s(f, &vdev->vq[i].last_avail_idx);
>>> >> >> +        qemu_get_be16s(f, &vdev->vq[i].inuse);
>>> >> >> +
>>> >> >> +        /* revert last_avail_idx if there are outstanding emulation. */
>>> >> >> +        vdev->vq[i].last_avail_idx -= vdev->vq[i].inuse;
>>> >> >> +        vdev->vq[i].inuse = 0;
>>> >> >>
>>> >> >>          if (vdev->vq[i].pa) {
>>> >> >>              virtqueue_init(&vdev->vq[i]);
>>> >> >> --
>>> >> >> 1.7.1.2
>>> >> >>
>>> >> >> --
>>> >> >> To unsubscribe from this list: send the line "unsubscribe kvm" in
>>> >> >> the body of a message to majordomo@vger.kernel.org
>>> >> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>> >> > --
>>> >> > To unsubscribe from this list: send the line "unsubscribe kvm" in
>>> >> > the body of a message to majordomo@vger.kernel.org
>>> >> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>> >> >
>>> > --
>>> > To unsubscribe from this list: send the line "unsubscribe kvm" in
>>> > the body of a message to majordomo@vger.kernel.org
>>> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>> >
>> --
>> To unsubscribe from this list: send the line "unsubscribe kvm" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>

^ permalink raw reply related	[flat|nested] 112+ messages in thread

* [Qemu-devel] Re: [PATCH 11/21] ioport: insert event_tap_ioport() to ioport_write().
  2010-11-28 12:00     ` Yoshiaki Tamura
@ 2010-12-16  7:37       ` Yoshiaki Tamura
  2010-12-16  9:22         ` Michael S. Tsirkin
  0 siblings, 1 reply; 112+ messages in thread
From: Yoshiaki Tamura @ 2010-12-16  7:37 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: aliguori, dlaor, ananth, kvm, ohmura.kei, mtosatti, qemu-devel,
	vatsa, avi, psuriset, stefanha

2010/11/28 Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>:
> 2010/11/28 Michael S. Tsirkin <mst@redhat.com>:
>> On Thu, Nov 25, 2010 at 03:06:50PM +0900, Yoshiaki Tamura wrote:
>>> Record ioport event to replay it upon failover.
>>>
>>> Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
>>
>> Interesting. This will have to be extended to support ioeventfd.
>> Since each eventfd is really just a binary trigger
>> it should be enough to read out the fd state.
>
> Haven't thought about eventfd yet.  Will try doing it in the next
> spin.

Hi Michael,

I looked into eventfd and realized it's only used with vhost now.  However, I
believe vhost bypass the net layer in qemu, and there is no way for Kemari to
detect the outputs.  To me, it doesn't make sense to extend this patch to
support eventfd...

Thanks,

Yoshi

>
> Yoshi
>
>>
>>> ---
>>>  ioport.c |    2 ++
>>>  1 files changed, 2 insertions(+), 0 deletions(-)
>>>
>>> diff --git a/ioport.c b/ioport.c
>>> index aa4188a..74aebf5 100644
>>> --- a/ioport.c
>>> +++ b/ioport.c
>>> @@ -27,6 +27,7 @@
>>>
>>>  #include "ioport.h"
>>>  #include "trace.h"
>>> +#include "event-tap.h"
>>>
>>>  /***********************************************************/
>>>  /* IO Port */
>>> @@ -76,6 +77,7 @@ static void ioport_write(int index, uint32_t address, uint32_t data)
>>>          default_ioport_writel
>>>      };
>>>      IOPortWriteFunc *func = ioport_write_table[index][address];
>>> +    event_tap_ioport(index, address, data);
>>>      if (!func)
>>>          func = default_func[index];
>>>      func(ioport_opaque[address], address, data);
>>> --
>>> 1.7.1.2
>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe kvm" in
>>> the body of a message to majordomo@vger.kernel.org
>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> --
>> To unsubscribe from this list: send the line "unsubscribe kvm" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>

^ permalink raw reply	[flat|nested] 112+ messages in thread

* [Qemu-devel] Re: [PATCH 11/21] ioport: insert event_tap_ioport() to ioport_write().
  2010-12-16  7:37       ` Yoshiaki Tamura
@ 2010-12-16  9:22         ` Michael S. Tsirkin
  2010-12-16  9:50           ` Yoshiaki Tamura
  0 siblings, 1 reply; 112+ messages in thread
From: Michael S. Tsirkin @ 2010-12-16  9:22 UTC (permalink / raw)
  To: Yoshiaki Tamura
  Cc: aliguori, dlaor, ananth, kvm, ohmura.kei, mtosatti, qemu-devel,
	vatsa, avi, psuriset, stefanha

On Thu, Dec 16, 2010 at 04:37:41PM +0900, Yoshiaki Tamura wrote:
> 2010/11/28 Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>:
> > 2010/11/28 Michael S. Tsirkin <mst@redhat.com>:
> >> On Thu, Nov 25, 2010 at 03:06:50PM +0900, Yoshiaki Tamura wrote:
> >>> Record ioport event to replay it upon failover.
> >>>
> >>> Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
> >>
> >> Interesting. This will have to be extended to support ioeventfd.
> >> Since each eventfd is really just a binary trigger
> >> it should be enough to read out the fd state.
> >
> > Haven't thought about eventfd yet.  Will try doing it in the next
> > spin.
> 
> Hi Michael,
> 
> I looked into eventfd and realized it's only used with vhost now.

There are patches on list to use it for block/userspace net.

>  However, I
> believe vhost bypass the net layer in qemu, and there is no way for Kemari to
> detect the outputs.  To me, it doesn't make sense to extend this patch to
> support eventfd...
> 
> Thanks,
> 
> Yoshi
> 
> >
> > Yoshi
> >
> >>
> >>> ---
> >>>  ioport.c |    2 ++
> >>>  1 files changed, 2 insertions(+), 0 deletions(-)
> >>>
> >>> diff --git a/ioport.c b/ioport.c
> >>> index aa4188a..74aebf5 100644
> >>> --- a/ioport.c
> >>> +++ b/ioport.c
> >>> @@ -27,6 +27,7 @@
> >>>
> >>>  #include "ioport.h"
> >>>  #include "trace.h"
> >>> +#include "event-tap.h"
> >>>
> >>>  /***********************************************************/
> >>>  /* IO Port */
> >>> @@ -76,6 +77,7 @@ static void ioport_write(int index, uint32_t address, uint32_t data)
> >>>          default_ioport_writel
> >>>      };
> >>>      IOPortWriteFunc *func = ioport_write_table[index][address];
> >>> +    event_tap_ioport(index, address, data);
> >>>      if (!func)
> >>>          func = default_func[index];
> >>>      func(ioport_opaque[address], address, data);
> >>> --
> >>> 1.7.1.2
> >>>
> >>> --
> >>> To unsubscribe from this list: send the line "unsubscribe kvm" in
> >>> the body of a message to majordomo@vger.kernel.org
> >>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe kvm" in
> >> the body of a message to majordomo@vger.kernel.org
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>
> >

^ permalink raw reply	[flat|nested] 112+ messages in thread

* [Qemu-devel] Re: [PATCH 11/21] ioport: insert event_tap_ioport() to ioport_write().
  2010-12-16  9:22         ` Michael S. Tsirkin
@ 2010-12-16  9:50           ` Yoshiaki Tamura
  2010-12-16  9:54             ` Michael S. Tsirkin
  2010-12-16 16:27             ` Stefan Hajnoczi
  0 siblings, 2 replies; 112+ messages in thread
From: Yoshiaki Tamura @ 2010-12-16  9:50 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: aliguori, dlaor, ananth, kvm, ohmura.kei, mtosatti, qemu-devel,
	vatsa, avi, psuriset, stefanha

2010/12/16 Michael S. Tsirkin <mst@redhat.com>:
> On Thu, Dec 16, 2010 at 04:37:41PM +0900, Yoshiaki Tamura wrote:
>> 2010/11/28 Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>:
>> > 2010/11/28 Michael S. Tsirkin <mst@redhat.com>:
>> >> On Thu, Nov 25, 2010 at 03:06:50PM +0900, Yoshiaki Tamura wrote:
>> >>> Record ioport event to replay it upon failover.
>> >>>
>> >>> Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
>> >>
>> >> Interesting. This will have to be extended to support ioeventfd.
>> >> Since each eventfd is really just a binary trigger
>> >> it should be enough to read out the fd state.
>> >
>> > Haven't thought about eventfd yet.  Will try doing it in the next
>> > spin.
>>
>> Hi Michael,
>>
>> I looked into eventfd and realized it's only used with vhost now.
>
> There are patches on list to use it for block/userspace net.

Thanks.  Now I understand.
In that case, inserting an even-tap function to the following code
should be appropriate?

int event_notifier_test_and_clear(EventNotifier *e)
{
    uint64_t value;
    int r = read(e->fd, &value, sizeof(value));
    return r == sizeof(value);
}

>
>>  However, I
>> believe vhost bypass the net layer in qemu, and there is no way for Kemari to
>> detect the outputs.  To me, it doesn't make sense to extend this patch to
>> support eventfd...
>>
>> Thanks,
>>
>> Yoshi
>>
>> >
>> > Yoshi
>> >
>> >>
>> >>> ---
>> >>>  ioport.c |    2 ++
>> >>>  1 files changed, 2 insertions(+), 0 deletions(-)
>> >>>
>> >>> diff --git a/ioport.c b/ioport.c
>> >>> index aa4188a..74aebf5 100644
>> >>> --- a/ioport.c
>> >>> +++ b/ioport.c
>> >>> @@ -27,6 +27,7 @@
>> >>>
>> >>>  #include "ioport.h"
>> >>>  #include "trace.h"
>> >>> +#include "event-tap.h"
>> >>>
>> >>>  /***********************************************************/
>> >>>  /* IO Port */
>> >>> @@ -76,6 +77,7 @@ static void ioport_write(int index, uint32_t address, uint32_t data)
>> >>>          default_ioport_writel
>> >>>      };
>> >>>      IOPortWriteFunc *func = ioport_write_table[index][address];
>> >>> +    event_tap_ioport(index, address, data);
>> >>>      if (!func)
>> >>>          func = default_func[index];
>> >>>      func(ioport_opaque[address], address, data);
>> >>> --
>> >>> 1.7.1.2
>> >>>
>> >>> --
>> >>> To unsubscribe from this list: send the line "unsubscribe kvm" in
>> >>> the body of a message to majordomo@vger.kernel.org
>> >>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >> --
>> >> To unsubscribe from this list: send the line "unsubscribe kvm" in
>> >> the body of a message to majordomo@vger.kernel.org
>> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >>
>> >
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 112+ messages in thread

* [Qemu-devel] Re: [PATCH 05/21] virtio: modify save/load handler to handle inuse varialble.
  2010-12-16  7:36               ` Yoshiaki Tamura
@ 2010-12-16  9:51                 ` Michael S. Tsirkin
  2010-12-16 14:28                   ` Yoshiaki Tamura
  0 siblings, 1 reply; 112+ messages in thread
From: Michael S. Tsirkin @ 2010-12-16  9:51 UTC (permalink / raw)
  To: Yoshiaki Tamura
  Cc: aliguori, dlaor, ananth, kvm, ohmura.kei, Marcelo Tosatti,
	qemu-devel, vatsa, avi, psuriset, stefanha

On Thu, Dec 16, 2010 at 04:36:16PM +0900, Yoshiaki Tamura wrote:
> 2010/12/3 Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>:
> > 2010/12/2 Michael S. Tsirkin <mst@redhat.com>:
> >> On Wed, Dec 01, 2010 at 05:03:43PM +0900, Yoshiaki Tamura wrote:
> >>> 2010/11/28 Michael S. Tsirkin <mst@redhat.com>:
> >>> > On Sun, Nov 28, 2010 at 08:27:58PM +0900, Yoshiaki Tamura wrote:
> >>> >> 2010/11/28 Michael S. Tsirkin <mst@redhat.com>:
> >>> >> > On Thu, Nov 25, 2010 at 03:06:44PM +0900, Yoshiaki Tamura wrote:
> >>> >> >> Modify inuse type to uint16_t, let save/load to handle, and revert
> >>> >> >> last_avail_idx with inuse if there are outstanding emulation.
> >>> >> >>
> >>> >> >> Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
> >>> >> >
> >>> >> > This changes migration format, so it will break compatibility with
> >>> >> > existing drivers. More generally, I think migrating internal
> >>> >> > state that is not guest visible is always a mistake
> >>> >> > as it ties migration format to an internal implementation
> >>> >> > (yes, I know we do this sometimes, but we should at least
> >>> >> > try not to add such cases).  I think the right thing to do in this case
> >>> >> > is to flush outstanding
> >>> >> > work when vm is stopped.  Then, we are guaranteed that inuse is 0.
> >>> >> > I sent patches that do this for virtio net and block.
> >>> >>
> >>> >> Could you give me the link of your patches?  I'd like to test
> >>> >> whether they work with Kemari upon failover.  If they do, I'm
> >>> >> happy to drop this patch.
> >>> >>
> >>> >> Yoshi
> >>> >
> >>> > Look for this:
> >>> > stable migration image on a stopped vm
> >>> > sent on:
> >>> > Wed, 24 Nov 2010 17:52:49 +0200
> >>>
> >>> Thanks for the info.
> >>>
> >>> However, The patch series above didn't solve the issue.  In
> >>> case of Kemari, inuse is mostly > 0 because it queues the
> >>> output, and while last_avail_idx gets incremented
> >>> immediately, not sending inuse makes the state inconsistent
> >>> between Primary and Secondary.
> >>
> >> Hmm. Can we simply avoid incrementing last_avail_idx?
> >
> > I think we can calculate or prepare an internal last_avail_idx,
> > and update the external when inuse is decremented.  I'll try
> > whether it work w/ w/o Kemari.
> 
> Hi Michael,
> 
> Could you please take a look at the following patch?

Which version is this against?

> commit 36ee7910059e6b236fe9467a609f5b4aed866912
> Author: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
> Date:   Thu Dec 16 14:50:54 2010 +0900
> 
>     virtio: update last_avail_idx when inuse is decreased.
> 
>     Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>

It would be better to have a commit description explaining why a change
is made, and why it is correct, not just repeating what can be seen from
the diff anyway.

> diff --git a/hw/virtio.c b/hw/virtio.c
> index c8a0fc6..6688c02 100644
> --- a/hw/virtio.c
> +++ b/hw/virtio.c
> @@ -237,6 +237,7 @@ void virtqueue_flush(VirtQueue *vq, unsigned int count)
>      wmb();
>      trace_virtqueue_flush(vq, count);
>      vring_used_idx_increment(vq, count);
> +    vq->last_avail_idx += count;
>      vq->inuse -= count;
>  }
> 
> @@ -385,7 +386,7 @@ int virtqueue_pop(VirtQueue *vq, VirtQueueElement *elem)
>      unsigned int i, head, max;
>      target_phys_addr_t desc_pa = vq->vring.desc;
> 
> -    if (!virtqueue_num_heads(vq, vq->last_avail_idx))
> +    if (!virtqueue_num_heads(vq, vq->last_avail_idx + vq->inuse))
>          return 0;
> 
>      /* When we start there are none of either input nor output. */
> @@ -393,7 +394,7 @@ int virtqueue_pop(VirtQueue *vq, VirtQueueElement *elem)
> 
>      max = vq->vring.num;
> 
> -    i = head = virtqueue_get_head(vq, vq->last_avail_idx++);
> +    i = head = virtqueue_get_head(vq, vq->last_avail_idx + vq->inuse);
> 
>      if (vring_desc_flags(desc_pa, i) & VRING_DESC_F_INDIRECT) {
>          if (vring_desc_len(desc_pa, i) % sizeof(VRingDesc)) {
> 

Hmm, will virtio_queue_empty be wrong now? What about virtqueue_avail_bytes?
Previous patch version sure looked simpler, and this seems functionally
equivalent, so my question still stands: here it is rephrased in a
different way:

	assume that we have in avail ring 2 requests at start of ring: A and B in this order

	host pops A, then B, then completes B and flushes

	now with this patch last_avail_idx will be 1, and then
	remote will get it, it will execute B again. As a result
	B will complete twice, and apparently A will never complete.


This is what I was saying below: assuming that there are
outstanding requests when we migrate, there is no way
a single index can be enough to figure out which requests
need to be handled and which are in flight already.

We must add some kind of bitmask to tell us which is which.

> >
> >>
> >>>  I'm wondering why
> >>> last_avail_idx is OK to send but not inuse.
> >>
> >> last_avail_idx is at some level a mistake, it exposes part of
> >> our internal implementation, but it does *also* express
> >> a guest observable state.
> >>
> >> Here's the problem that it solves: just looking at the rings in virtio
> >> there is no way to detect that a specific request has already been
> >> completed. And the protocol forbids completing the same request twice.
> >>
> >> Our implementation always starts processing the requests
> >> in order, and since we flush outstanding requests
> >> before save, it works to just tell the remote 'process only requests
> >> after this place'.
> >>
> >> But there's no such requirement in the virtio protocol,
> >> so to be really generic we could add a bitmask of valid avail
> >> ring entries that did not complete yet. This would be
> >> the exact representation of the guest observable state.
> >> In practice we have rings of up to 512 entries.
> >> That's 64 byte per ring, not a lot at all.
> >>
> >> However, if we ever do change the protocol to send the bitmask,
> >> we would need some code to resubmit requests
> >> out of order, so it's not trivial.
> >>
> >> Another minor mistake with last_avail_idx is that it has
> >> some redundancy: the high bits in the index
> >> (> vq size) are not necessary as they can be
> >> got from avail idx.  There's a consistency check
> >> in load but we really should try to use formats
> >> that are always consistent.
> >>
> >>> The following patch does the same thing as original, yet
> >>> keeps the format of the virtio.  It shouldn't break live
> >>> migration either because inuse should be 0.
> >>>
> >>> Yoshi
> >>
> >> Question is, can you flush to make inuse 0 in kemari too?
> >> And if not, how do you handle the fact that some requests
> >> are in flight on the primary?
> >
> > Although we try flushing requests one by one making inuse 0,
> > there are cases when it failovers to the secondary when inuse
> > isn't 0.  We handle these in flight request on the primary by
> > replaying on the secondary.
> >
> >>
> >>> diff --git a/hw/virtio.c b/hw/virtio.c
> >>> index c8a0fc6..875c7ca 100644
> >>> --- a/hw/virtio.c
> >>> +++ b/hw/virtio.c
> >>> @@ -664,12 +664,16 @@ void virtio_save(VirtIODevice *vdev, QEMUFile *f)
> >>>      qemu_put_be32(f, i);
> >>>
> >>>      for (i = 0; i < VIRTIO_PCI_QUEUE_MAX; i++) {
> >>> +        uint16_t last_avail_idx;
> >>> +
> >>>          if (vdev->vq[i].vring.num == 0)
> >>>              break;
> >>>
> >>> +        last_avail_idx = vdev->vq[i].last_avail_idx - vdev->vq[i].inuse;
> >>> +
> >>>          qemu_put_be32(f, vdev->vq[i].vring.num);
> >>>          qemu_put_be64(f, vdev->vq[i].pa);
> >>> -        qemu_put_be16s(f, &vdev->vq[i].last_avail_idx);
> >>> +        qemu_put_be16s(f, &last_avail_idx);
> >>>          if (vdev->binding->save_queue)
> >>>              vdev->binding->save_queue(vdev->binding_opaque, i, f);
> >>>      }
> >>>
> >>>
> >>
> >> This looks wrong to me.  Requests can complete in any order, can they
> >> not?  So if request 0 did not complete and request 1 did not,
> >> you send avail - inuse and on the secondary you will process and
> >> complete request 1 the second time, crashing the guest.
> >
> > In case of Kemari, no.  We sit between devices and net/block, and
> > queue the requests.  After completing each transaction, we flush
> > the requests one by one.  So there won't be completion inversion,
> > and therefore won't be visible to the guest.
> >
> > Yoshi
> >
> >>
> >>>
> >>> >
> >>> >> >
> >>> >> >> ---
> >>> >> >>  hw/virtio.c |    8 +++++++-
> >>> >> >>  1 files changed, 7 insertions(+), 1 deletions(-)
> >>> >> >>
> >>> >> >> diff --git a/hw/virtio.c b/hw/virtio.c
> >>> >> >> index 849a60f..5509644 100644
> >>> >> >> --- a/hw/virtio.c
> >>> >> >> +++ b/hw/virtio.c
> >>> >> >> @@ -72,7 +72,7 @@ struct VirtQueue
> >>> >> >>      VRing vring;
> >>> >> >>      target_phys_addr_t pa;
> >>> >> >>      uint16_t last_avail_idx;
> >>> >> >> -    int inuse;
> >>> >> >> +    uint16_t inuse;
> >>> >> >>      uint16_t vector;
> >>> >> >>      void (*handle_output)(VirtIODevice *vdev, VirtQueue *vq);
> >>> >> >>      VirtIODevice *vdev;
> >>> >> >> @@ -671,6 +671,7 @@ void virtio_save(VirtIODevice *vdev, QEMUFile *f)
> >>> >> >>          qemu_put_be32(f, vdev->vq[i].vring.num);
> >>> >> >>          qemu_put_be64(f, vdev->vq[i].pa);
> >>> >> >>          qemu_put_be16s(f, &vdev->vq[i].last_avail_idx);
> >>> >> >> +        qemu_put_be16s(f, &vdev->vq[i].inuse);
> >>> >> >>          if (vdev->binding->save_queue)
> >>> >> >>              vdev->binding->save_queue(vdev->binding_opaque, i, f);
> >>> >> >>      }
> >>> >> >> @@ -711,6 +712,11 @@ int virtio_load(VirtIODevice *vdev, QEMUFile *f)
> >>> >> >>          vdev->vq[i].vring.num = qemu_get_be32(f);
> >>> >> >>          vdev->vq[i].pa = qemu_get_be64(f);
> >>> >> >>          qemu_get_be16s(f, &vdev->vq[i].last_avail_idx);
> >>> >> >> +        qemu_get_be16s(f, &vdev->vq[i].inuse);
> >>> >> >> +
> >>> >> >> +        /* revert last_avail_idx if there are outstanding emulation. */
> >>> >> >> +        vdev->vq[i].last_avail_idx -= vdev->vq[i].inuse;
> >>> >> >> +        vdev->vq[i].inuse = 0;
> >>> >> >>
> >>> >> >>          if (vdev->vq[i].pa) {
> >>> >> >>              virtqueue_init(&vdev->vq[i]);
> >>> >> >> --
> >>> >> >> 1.7.1.2
> >>> >> >>
> >>> >> >> --
> >>> >> >> To unsubscribe from this list: send the line "unsubscribe kvm" in
> >>> >> >> the body of a message to majordomo@vger.kernel.org
> >>> >> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>> >> > --
> >>> >> > To unsubscribe from this list: send the line "unsubscribe kvm" in
> >>> >> > the body of a message to majordomo@vger.kernel.org
> >>> >> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>> >> >
> >>> > --
> >>> > To unsubscribe from this list: send the line "unsubscribe kvm" in
> >>> > the body of a message to majordomo@vger.kernel.org
> >>> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>> >
> >> --
> >> To unsubscribe from this list: send the line "unsubscribe kvm" in
> >> the body of a message to majordomo@vger.kernel.org
> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >>
> >

^ permalink raw reply	[flat|nested] 112+ messages in thread

* [Qemu-devel] Re: [PATCH 11/21] ioport: insert event_tap_ioport() to ioport_write().
  2010-12-16  9:50           ` Yoshiaki Tamura
@ 2010-12-16  9:54             ` Michael S. Tsirkin
  2010-12-16 16:27             ` Stefan Hajnoczi
  1 sibling, 0 replies; 112+ messages in thread
From: Michael S. Tsirkin @ 2010-12-16  9:54 UTC (permalink / raw)
  To: Yoshiaki Tamura
  Cc: aliguori, dlaor, ananth, kvm, ohmura.kei, mtosatti, qemu-devel,
	vatsa, avi, psuriset, stefanha

On Thu, Dec 16, 2010 at 06:50:04PM +0900, Yoshiaki Tamura wrote:
> 2010/12/16 Michael S. Tsirkin <mst@redhat.com>:
> > On Thu, Dec 16, 2010 at 04:37:41PM +0900, Yoshiaki Tamura wrote:
> >> 2010/11/28 Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>:
> >> > 2010/11/28 Michael S. Tsirkin <mst@redhat.com>:
> >> >> On Thu, Nov 25, 2010 at 03:06:50PM +0900, Yoshiaki Tamura wrote:
> >> >>> Record ioport event to replay it upon failover.
> >> >>>
> >> >>> Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
> >> >>
> >> >> Interesting. This will have to be extended to support ioeventfd.
> >> >> Since each eventfd is really just a binary trigger
> >> >> it should be enough to read out the fd state.
> >> >
> >> > Haven't thought about eventfd yet.  Will try doing it in the next
> >> > spin.
> >>
> >> Hi Michael,
> >>
> >> I looked into eventfd and realized it's only used with vhost now.
> >
> > There are patches on list to use it for block/userspace net.
> 
> Thanks.  Now I understand.
> In that case, inserting an even-tap function to the following code
> should be appropriate?
> 
> int event_notifier_test_and_clear(EventNotifier *e)
> {
>     uint64_t value;
>     int r = read(e->fd, &value, sizeof(value));
>     return r == sizeof(value);
> }

Possibly.

> >
> >>  However, I
> >> believe vhost bypass the net layer in qemu, and there is no way for Kemari to
> >> detect the outputs.


Then maybe you should check for this combination and either disable
vhost-net on the backend when kemari is active or fail.

> >>  To me, it doesn't make sense to extend this patch to
> >> support eventfd...
> >> Thanks,
> >>
> >> Yoshi
> >>
> >> >
> >> > Yoshi
> >> >
> >> >>
> >> >>> ---
> >> >>>  ioport.c |    2 ++
> >> >>>  1 files changed, 2 insertions(+), 0 deletions(-)
> >> >>>
> >> >>> diff --git a/ioport.c b/ioport.c
> >> >>> index aa4188a..74aebf5 100644
> >> >>> --- a/ioport.c
> >> >>> +++ b/ioport.c
> >> >>> @@ -27,6 +27,7 @@
> >> >>>
> >> >>>  #include "ioport.h"
> >> >>>  #include "trace.h"
> >> >>> +#include "event-tap.h"
> >> >>>
> >> >>>  /***********************************************************/
> >> >>>  /* IO Port */
> >> >>> @@ -76,6 +77,7 @@ static void ioport_write(int index, uint32_t address, uint32_t data)
> >> >>>          default_ioport_writel
> >> >>>      };
> >> >>>      IOPortWriteFunc *func = ioport_write_table[index][address];
> >> >>> +    event_tap_ioport(index, address, data);
> >> >>>      if (!func)
> >> >>>          func = default_func[index];
> >> >>>      func(ioport_opaque[address], address, data);
> >> >>> --
> >> >>> 1.7.1.2
> >> >>>
> >> >>> --
> >> >>> To unsubscribe from this list: send the line "unsubscribe kvm" in
> >> >>> the body of a message to majordomo@vger.kernel.org
> >> >>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >> >> --
> >> >> To unsubscribe from this list: send the line "unsubscribe kvm" in
> >> >> the body of a message to majordomo@vger.kernel.org
> >> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >> >>
> >> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe kvm" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >

^ permalink raw reply	[flat|nested] 112+ messages in thread

* [Qemu-devel] Re: [PATCH 05/21] virtio: modify save/load handler to handle inuse varialble.
  2010-12-16  9:51                 ` Michael S. Tsirkin
@ 2010-12-16 14:28                   ` Yoshiaki Tamura
  2010-12-16 14:40                     ` Michael S. Tsirkin
  0 siblings, 1 reply; 112+ messages in thread
From: Yoshiaki Tamura @ 2010-12-16 14:28 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: aliguori, dlaor, ananth, kvm, ohmura.kei, Marcelo Tosatti,
	qemu-devel, vatsa, avi, psuriset, stefanha

2010/12/16 Michael S. Tsirkin <mst@redhat.com>:
> On Thu, Dec 16, 2010 at 04:36:16PM +0900, Yoshiaki Tamura wrote:
>> 2010/12/3 Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>:
>> > 2010/12/2 Michael S. Tsirkin <mst@redhat.com>:
>> >> On Wed, Dec 01, 2010 at 05:03:43PM +0900, Yoshiaki Tamura wrote:
>> >>> 2010/11/28 Michael S. Tsirkin <mst@redhat.com>:
>> >>> > On Sun, Nov 28, 2010 at 08:27:58PM +0900, Yoshiaki Tamura wrote:
>> >>> >> 2010/11/28 Michael S. Tsirkin <mst@redhat.com>:
>> >>> >> > On Thu, Nov 25, 2010 at 03:06:44PM +0900, Yoshiaki Tamura wrote:
>> >>> >> >> Modify inuse type to uint16_t, let save/load to handle, and revert
>> >>> >> >> last_avail_idx with inuse if there are outstanding emulation.
>> >>> >> >>
>> >>> >> >> Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
>> >>> >> >
>> >>> >> > This changes migration format, so it will break compatibility with
>> >>> >> > existing drivers. More generally, I think migrating internal
>> >>> >> > state that is not guest visible is always a mistake
>> >>> >> > as it ties migration format to an internal implementation
>> >>> >> > (yes, I know we do this sometimes, but we should at least
>> >>> >> > try not to add such cases).  I think the right thing to do in this case
>> >>> >> > is to flush outstanding
>> >>> >> > work when vm is stopped.  Then, we are guaranteed that inuse is 0.
>> >>> >> > I sent patches that do this for virtio net and block.
>> >>> >>
>> >>> >> Could you give me the link of your patches?  I'd like to test
>> >>> >> whether they work with Kemari upon failover.  If they do, I'm
>> >>> >> happy to drop this patch.
>> >>> >>
>> >>> >> Yoshi
>> >>> >
>> >>> > Look for this:
>> >>> > stable migration image on a stopped vm
>> >>> > sent on:
>> >>> > Wed, 24 Nov 2010 17:52:49 +0200
>> >>>
>> >>> Thanks for the info.
>> >>>
>> >>> However, The patch series above didn't solve the issue.  In
>> >>> case of Kemari, inuse is mostly > 0 because it queues the
>> >>> output, and while last_avail_idx gets incremented
>> >>> immediately, not sending inuse makes the state inconsistent
>> >>> between Primary and Secondary.
>> >>
>> >> Hmm. Can we simply avoid incrementing last_avail_idx?
>> >
>> > I think we can calculate or prepare an internal last_avail_idx,
>> > and update the external when inuse is decremented.  I'll try
>> > whether it work w/ w/o Kemari.
>>
>> Hi Michael,
>>
>> Could you please take a look at the following patch?
>
> Which version is this against?

Oops.  It should be very old.
67f895bfe69f323b427b284430b6219c8a62e8d4

>> commit 36ee7910059e6b236fe9467a609f5b4aed866912
>> Author: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
>> Date:   Thu Dec 16 14:50:54 2010 +0900
>>
>>     virtio: update last_avail_idx when inuse is decreased.
>>
>>     Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
>
> It would be better to have a commit description explaining why a change
> is made, and why it is correct, not just repeating what can be seen from
> the diff anyway.

Sorry for being lazy here.

>> diff --git a/hw/virtio.c b/hw/virtio.c
>> index c8a0fc6..6688c02 100644
>> --- a/hw/virtio.c
>> +++ b/hw/virtio.c
>> @@ -237,6 +237,7 @@ void virtqueue_flush(VirtQueue *vq, unsigned int count)
>>      wmb();
>>      trace_virtqueue_flush(vq, count);
>>      vring_used_idx_increment(vq, count);
>> +    vq->last_avail_idx += count;
>>      vq->inuse -= count;
>>  }
>>
>> @@ -385,7 +386,7 @@ int virtqueue_pop(VirtQueue *vq, VirtQueueElement *elem)
>>      unsigned int i, head, max;
>>      target_phys_addr_t desc_pa = vq->vring.desc;
>>
>> -    if (!virtqueue_num_heads(vq, vq->last_avail_idx))
>> +    if (!virtqueue_num_heads(vq, vq->last_avail_idx + vq->inuse))
>>          return 0;
>>
>>      /* When we start there are none of either input nor output. */
>> @@ -393,7 +394,7 @@ int virtqueue_pop(VirtQueue *vq, VirtQueueElement *elem)
>>
>>      max = vq->vring.num;
>>
>> -    i = head = virtqueue_get_head(vq, vq->last_avail_idx++);
>> +    i = head = virtqueue_get_head(vq, vq->last_avail_idx + vq->inuse);
>>
>>      if (vring_desc_flags(desc_pa, i) & VRING_DESC_F_INDIRECT) {
>>          if (vring_desc_len(desc_pa, i) % sizeof(VRingDesc)) {
>>
>
> Hmm, will virtio_queue_empty be wrong now? What about virtqueue_avail_bytes?

I think there are two problems.

1. When to update last_avail_idx.
2. The ordering issue you're mentioning below.

The patch above is only trying to address 1 because last time you
mentioned that modifying last_avail_idx upon save may break the
guest, which I agree.  If virtio_queue_empty and
virtqueue_avail_bytes are only used internally, meaning invisible
to the guest, I guess the approach above can be applied too.

> Previous patch version sure looked simpler, and this seems functionally
> equivalent, so my question still stands: here it is rephrased in a
> different way:
>
>        assume that we have in avail ring 2 requests at start of ring: A and B in this order
>
>        host pops A, then B, then completes B and flushes
>
>        now with this patch last_avail_idx will be 1, and then
>        remote will get it, it will execute B again. As a result
>        B will complete twice, and apparently A will never complete.
>
>
> This is what I was saying below: assuming that there are
> outstanding requests when we migrate, there is no way
> a single index can be enough to figure out which requests
> need to be handled and which are in flight already.
>
> We must add some kind of bitmask to tell us which is which.

I should understand why this inversion can happen before solving
the issue.  Currently, how are you making virio-net to flush
every requests for live migration?  Is it qemu_aio_flush()?

Yoshi

>
>> >
>> >>
>> >>>  I'm wondering why
>> >>> last_avail_idx is OK to send but not inuse.
>> >>
>> >> last_avail_idx is at some level a mistake, it exposes part of
>> >> our internal implementation, but it does *also* express
>> >> a guest observable state.
>> >>
>> >> Here's the problem that it solves: just looking at the rings in virtio
>> >> there is no way to detect that a specific request has already been
>> >> completed. And the protocol forbids completing the same request twice.
>> >>
>> >> Our implementation always starts processing the requests
>> >> in order, and since we flush outstanding requests
>> >> before save, it works to just tell the remote 'process only requests
>> >> after this place'.
>> >>
>> >> But there's no such requirement in the virtio protocol,
>> >> so to be really generic we could add a bitmask of valid avail
>> >> ring entries that did not complete yet. This would be
>> >> the exact representation of the guest observable state.
>> >> In practice we have rings of up to 512 entries.
>> >> That's 64 byte per ring, not a lot at all.
>> >>
>> >> However, if we ever do change the protocol to send the bitmask,
>> >> we would need some code to resubmit requests
>> >> out of order, so it's not trivial.
>> >>
>> >> Another minor mistake with last_avail_idx is that it has
>> >> some redundancy: the high bits in the index
>> >> (> vq size) are not necessary as they can be
>> >> got from avail idx.  There's a consistency check
>> >> in load but we really should try to use formats
>> >> that are always consistent.
>> >>
>> >>> The following patch does the same thing as original, yet
>> >>> keeps the format of the virtio.  It shouldn't break live
>> >>> migration either because inuse should be 0.
>> >>>
>> >>> Yoshi
>> >>
>> >> Question is, can you flush to make inuse 0 in kemari too?
>> >> And if not, how do you handle the fact that some requests
>> >> are in flight on the primary?
>> >
>> > Although we try flushing requests one by one making inuse 0,
>> > there are cases when it failovers to the secondary when inuse
>> > isn't 0.  We handle these in flight request on the primary by
>> > replaying on the secondary.
>> >
>> >>
>> >>> diff --git a/hw/virtio.c b/hw/virtio.c
>> >>> index c8a0fc6..875c7ca 100644
>> >>> --- a/hw/virtio.c
>> >>> +++ b/hw/virtio.c
>> >>> @@ -664,12 +664,16 @@ void virtio_save(VirtIODevice *vdev, QEMUFile *f)
>> >>>      qemu_put_be32(f, i);
>> >>>
>> >>>      for (i = 0; i < VIRTIO_PCI_QUEUE_MAX; i++) {
>> >>> +        uint16_t last_avail_idx;
>> >>> +
>> >>>          if (vdev->vq[i].vring.num == 0)
>> >>>              break;
>> >>>
>> >>> +        last_avail_idx = vdev->vq[i].last_avail_idx - vdev->vq[i].inuse;
>> >>> +
>> >>>          qemu_put_be32(f, vdev->vq[i].vring.num);
>> >>>          qemu_put_be64(f, vdev->vq[i].pa);
>> >>> -        qemu_put_be16s(f, &vdev->vq[i].last_avail_idx);
>> >>> +        qemu_put_be16s(f, &last_avail_idx);
>> >>>          if (vdev->binding->save_queue)
>> >>>              vdev->binding->save_queue(vdev->binding_opaque, i, f);
>> >>>      }
>> >>>
>> >>>
>> >>
>> >> This looks wrong to me.  Requests can complete in any order, can they
>> >> not?  So if request 0 did not complete and request 1 did not,
>> >> you send avail - inuse and on the secondary you will process and
>> >> complete request 1 the second time, crashing the guest.
>> >
>> > In case of Kemari, no.  We sit between devices and net/block, and
>> > queue the requests.  After completing each transaction, we flush
>> > the requests one by one.  So there won't be completion inversion,
>> > and therefore won't be visible to the guest.
>> >
>> > Yoshi
>> >
>> >>
>> >>>
>> >>> >
>> >>> >> >
>> >>> >> >> ---
>> >>> >> >>  hw/virtio.c |    8 +++++++-
>> >>> >> >>  1 files changed, 7 insertions(+), 1 deletions(-)
>> >>> >> >>
>> >>> >> >> diff --git a/hw/virtio.c b/hw/virtio.c
>> >>> >> >> index 849a60f..5509644 100644
>> >>> >> >> --- a/hw/virtio.c
>> >>> >> >> +++ b/hw/virtio.c
>> >>> >> >> @@ -72,7 +72,7 @@ struct VirtQueue
>> >>> >> >>      VRing vring;
>> >>> >> >>      target_phys_addr_t pa;
>> >>> >> >>      uint16_t last_avail_idx;
>> >>> >> >> -    int inuse;
>> >>> >> >> +    uint16_t inuse;
>> >>> >> >>      uint16_t vector;
>> >>> >> >>      void (*handle_output)(VirtIODevice *vdev, VirtQueue *vq);
>> >>> >> >>      VirtIODevice *vdev;
>> >>> >> >> @@ -671,6 +671,7 @@ void virtio_save(VirtIODevice *vdev, QEMUFile *f)
>> >>> >> >>          qemu_put_be32(f, vdev->vq[i].vring.num);
>> >>> >> >>          qemu_put_be64(f, vdev->vq[i].pa);
>> >>> >> >>          qemu_put_be16s(f, &vdev->vq[i].last_avail_idx);
>> >>> >> >> +        qemu_put_be16s(f, &vdev->vq[i].inuse);
>> >>> >> >>          if (vdev->binding->save_queue)
>> >>> >> >>              vdev->binding->save_queue(vdev->binding_opaque, i, f);
>> >>> >> >>      }
>> >>> >> >> @@ -711,6 +712,11 @@ int virtio_load(VirtIODevice *vdev, QEMUFile *f)
>> >>> >> >>          vdev->vq[i].vring.num = qemu_get_be32(f);
>> >>> >> >>          vdev->vq[i].pa = qemu_get_be64(f);
>> >>> >> >>          qemu_get_be16s(f, &vdev->vq[i].last_avail_idx);
>> >>> >> >> +        qemu_get_be16s(f, &vdev->vq[i].inuse);
>> >>> >> >> +
>> >>> >> >> +        /* revert last_avail_idx if there are outstanding emulation. */
>> >>> >> >> +        vdev->vq[i].last_avail_idx -= vdev->vq[i].inuse;
>> >>> >> >> +        vdev->vq[i].inuse = 0;
>> >>> >> >>
>> >>> >> >>          if (vdev->vq[i].pa) {
>> >>> >> >>              virtqueue_init(&vdev->vq[i]);
>> >>> >> >> --
>> >>> >> >> 1.7.1.2
>> >>> >> >>
>> >>> >> >> --
>> >>> >> >> To unsubscribe from this list: send the line "unsubscribe kvm" in
>> >>> >> >> the body of a message to majordomo@vger.kernel.org
>> >>> >> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >>> >> > --
>> >>> >> > To unsubscribe from this list: send the line "unsubscribe kvm" in
>> >>> >> > the body of a message to majordomo@vger.kernel.org
>> >>> >> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >>> >> >
>> >>> > --
>> >>> > To unsubscribe from this list: send the line "unsubscribe kvm" in
>> >>> > the body of a message to majordomo@vger.kernel.org
>> >>> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >>> >
>> >> --
>> >> To unsubscribe from this list: send the line "unsubscribe kvm" in
>> >> the body of a message to majordomo@vger.kernel.org
>> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >>
>> >
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 112+ messages in thread

* [Qemu-devel] Re: [PATCH 05/21] virtio: modify save/load handler to handle inuse varialble.
  2010-12-16 14:28                   ` Yoshiaki Tamura
@ 2010-12-16 14:40                     ` Michael S. Tsirkin
  2010-12-16 15:59                       ` Yoshiaki Tamura
  0 siblings, 1 reply; 112+ messages in thread
From: Michael S. Tsirkin @ 2010-12-16 14:40 UTC (permalink / raw)
  To: Yoshiaki Tamura
  Cc: aliguori, dlaor, ananth, kvm, ohmura.kei, Marcelo Tosatti,
	qemu-devel, vatsa, avi, psuriset, stefanha

On Thu, Dec 16, 2010 at 11:28:46PM +0900, Yoshiaki Tamura wrote:
> 2010/12/16 Michael S. Tsirkin <mst@redhat.com>:
> > On Thu, Dec 16, 2010 at 04:36:16PM +0900, Yoshiaki Tamura wrote:
> >> 2010/12/3 Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>:
> >> > 2010/12/2 Michael S. Tsirkin <mst@redhat.com>:
> >> >> On Wed, Dec 01, 2010 at 05:03:43PM +0900, Yoshiaki Tamura wrote:
> >> >>> 2010/11/28 Michael S. Tsirkin <mst@redhat.com>:
> >> >>> > On Sun, Nov 28, 2010 at 08:27:58PM +0900, Yoshiaki Tamura wrote:
> >> >>> >> 2010/11/28 Michael S. Tsirkin <mst@redhat.com>:
> >> >>> >> > On Thu, Nov 25, 2010 at 03:06:44PM +0900, Yoshiaki Tamura wrote:
> >> >>> >> >> Modify inuse type to uint16_t, let save/load to handle, and revert
> >> >>> >> >> last_avail_idx with inuse if there are outstanding emulation.
> >> >>> >> >>
> >> >>> >> >> Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
> >> >>> >> >
> >> >>> >> > This changes migration format, so it will break compatibility with
> >> >>> >> > existing drivers. More generally, I think migrating internal
> >> >>> >> > state that is not guest visible is always a mistake
> >> >>> >> > as it ties migration format to an internal implementation
> >> >>> >> > (yes, I know we do this sometimes, but we should at least
> >> >>> >> > try not to add such cases).  I think the right thing to do in this case
> >> >>> >> > is to flush outstanding
> >> >>> >> > work when vm is stopped.  Then, we are guaranteed that inuse is 0.
> >> >>> >> > I sent patches that do this for virtio net and block.
> >> >>> >>
> >> >>> >> Could you give me the link of your patches?  I'd like to test
> >> >>> >> whether they work with Kemari upon failover.  If they do, I'm
> >> >>> >> happy to drop this patch.
> >> >>> >>
> >> >>> >> Yoshi
> >> >>> >
> >> >>> > Look for this:
> >> >>> > stable migration image on a stopped vm
> >> >>> > sent on:
> >> >>> > Wed, 24 Nov 2010 17:52:49 +0200
> >> >>>
> >> >>> Thanks for the info.
> >> >>>
> >> >>> However, The patch series above didn't solve the issue.  In
> >> >>> case of Kemari, inuse is mostly > 0 because it queues the
> >> >>> output, and while last_avail_idx gets incremented
> >> >>> immediately, not sending inuse makes the state inconsistent
> >> >>> between Primary and Secondary.
> >> >>
> >> >> Hmm. Can we simply avoid incrementing last_avail_idx?
> >> >
> >> > I think we can calculate or prepare an internal last_avail_idx,
> >> > and update the external when inuse is decremented.  I'll try
> >> > whether it work w/ w/o Kemari.
> >>
> >> Hi Michael,
> >>
> >> Could you please take a look at the following patch?
> >
> > Which version is this against?
> 
> Oops.  It should be very old.
> 67f895bfe69f323b427b284430b6219c8a62e8d4
> 
> >> commit 36ee7910059e6b236fe9467a609f5b4aed866912
> >> Author: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
> >> Date:   Thu Dec 16 14:50:54 2010 +0900
> >>
> >>     virtio: update last_avail_idx when inuse is decreased.
> >>
> >>     Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
> >
> > It would be better to have a commit description explaining why a change
> > is made, and why it is correct, not just repeating what can be seen from
> > the diff anyway.
> 
> Sorry for being lazy here.
> 
> >> diff --git a/hw/virtio.c b/hw/virtio.c
> >> index c8a0fc6..6688c02 100644
> >> --- a/hw/virtio.c
> >> +++ b/hw/virtio.c
> >> @@ -237,6 +237,7 @@ void virtqueue_flush(VirtQueue *vq, unsigned int count)
> >>      wmb();
> >>      trace_virtqueue_flush(vq, count);
> >>      vring_used_idx_increment(vq, count);
> >> +    vq->last_avail_idx += count;
> >>      vq->inuse -= count;
> >>  }
> >>
> >> @@ -385,7 +386,7 @@ int virtqueue_pop(VirtQueue *vq, VirtQueueElement *elem)
> >>      unsigned int i, head, max;
> >>      target_phys_addr_t desc_pa = vq->vring.desc;
> >>
> >> -    if (!virtqueue_num_heads(vq, vq->last_avail_idx))
> >> +    if (!virtqueue_num_heads(vq, vq->last_avail_idx + vq->inuse))
> >>          return 0;
> >>
> >>      /* When we start there are none of either input nor output. */
> >> @@ -393,7 +394,7 @@ int virtqueue_pop(VirtQueue *vq, VirtQueueElement *elem)
> >>
> >>      max = vq->vring.num;
> >>
> >> -    i = head = virtqueue_get_head(vq, vq->last_avail_idx++);
> >> +    i = head = virtqueue_get_head(vq, vq->last_avail_idx + vq->inuse);
> >>
> >>      if (vring_desc_flags(desc_pa, i) & VRING_DESC_F_INDIRECT) {
> >>          if (vring_desc_len(desc_pa, i) % sizeof(VRingDesc)) {
> >>
> >
> > Hmm, will virtio_queue_empty be wrong now? What about virtqueue_avail_bytes?
> 
> I think there are two problems.
> 
> 1. When to update last_avail_idx.
> 2. The ordering issue you're mentioning below.
> 
> The patch above is only trying to address 1 because last time you
> mentioned that modifying last_avail_idx upon save may break the
> guest, which I agree.  If virtio_queue_empty and
> virtqueue_avail_bytes are only used internally, meaning invisible
> to the guest, I guess the approach above can be applied too.

So IMHO 2 is the real issue. This is what was problematic
with the save patch, otherwise of course changes in save
are better than changes all over the codebase.

> > Previous patch version sure looked simpler, and this seems functionally
> > equivalent, so my question still stands: here it is rephrased in a
> > different way:
> >
> >        assume that we have in avail ring 2 requests at start of ring: A and B in this order
> >
> >        host pops A, then B, then completes B and flushes
> >
> >        now with this patch last_avail_idx will be 1, and then
> >        remote will get it, it will execute B again. As a result
> >        B will complete twice, and apparently A will never complete.
> >
> >
> > This is what I was saying below: assuming that there are
> > outstanding requests when we migrate, there is no way
> > a single index can be enough to figure out which requests
> > need to be handled and which are in flight already.
> >
> > We must add some kind of bitmask to tell us which is which.
> 
> I should understand why this inversion can happen before solving
> the issue.

It's a fundamental thing in virtio.
I think it is currently only likely to happen with block, I think tap
currently completes things in order.  In any case relying on this in the
frontend is a mistake.

>  Currently, how are you making virio-net to flush
> every requests for live migration?  Is it qemu_aio_flush()?
> 
> Yoshi

Think so.


> >
> >> >
> >> >>
> >> >>>  I'm wondering why
> >> >>> last_avail_idx is OK to send but not inuse.
> >> >>
> >> >> last_avail_idx is at some level a mistake, it exposes part of
> >> >> our internal implementation, but it does *also* express
> >> >> a guest observable state.
> >> >>
> >> >> Here's the problem that it solves: just looking at the rings in virtio
> >> >> there is no way to detect that a specific request has already been
> >> >> completed. And the protocol forbids completing the same request twice.
> >> >>
> >> >> Our implementation always starts processing the requests
> >> >> in order, and since we flush outstanding requests
> >> >> before save, it works to just tell the remote 'process only requests
> >> >> after this place'.
> >> >>
> >> >> But there's no such requirement in the virtio protocol,
> >> >> so to be really generic we could add a bitmask of valid avail
> >> >> ring entries that did not complete yet. This would be
> >> >> the exact representation of the guest observable state.
> >> >> In practice we have rings of up to 512 entries.
> >> >> That's 64 byte per ring, not a lot at all.
> >> >>
> >> >> However, if we ever do change the protocol to send the bitmask,
> >> >> we would need some code to resubmit requests
> >> >> out of order, so it's not trivial.
> >> >>
> >> >> Another minor mistake with last_avail_idx is that it has
> >> >> some redundancy: the high bits in the index
> >> >> (> vq size) are not necessary as they can be
> >> >> got from avail idx.  There's a consistency check
> >> >> in load but we really should try to use formats
> >> >> that are always consistent.
> >> >>
> >> >>> The following patch does the same thing as original, yet
> >> >>> keeps the format of the virtio.  It shouldn't break live
> >> >>> migration either because inuse should be 0.
> >> >>>
> >> >>> Yoshi
> >> >>
> >> >> Question is, can you flush to make inuse 0 in kemari too?
> >> >> And if not, how do you handle the fact that some requests
> >> >> are in flight on the primary?
> >> >
> >> > Although we try flushing requests one by one making inuse 0,
> >> > there are cases when it failovers to the secondary when inuse
> >> > isn't 0.  We handle these in flight request on the primary by
> >> > replaying on the secondary.
> >> >
> >> >>
> >> >>> diff --git a/hw/virtio.c b/hw/virtio.c
> >> >>> index c8a0fc6..875c7ca 100644
> >> >>> --- a/hw/virtio.c
> >> >>> +++ b/hw/virtio.c
> >> >>> @@ -664,12 +664,16 @@ void virtio_save(VirtIODevice *vdev, QEMUFile *f)
> >> >>>      qemu_put_be32(f, i);
> >> >>>
> >> >>>      for (i = 0; i < VIRTIO_PCI_QUEUE_MAX; i++) {
> >> >>> +        uint16_t last_avail_idx;
> >> >>> +
> >> >>>          if (vdev->vq[i].vring.num == 0)
> >> >>>              break;
> >> >>>
> >> >>> +        last_avail_idx = vdev->vq[i].last_avail_idx - vdev->vq[i].inuse;
> >> >>> +
> >> >>>          qemu_put_be32(f, vdev->vq[i].vring.num);
> >> >>>          qemu_put_be64(f, vdev->vq[i].pa);
> >> >>> -        qemu_put_be16s(f, &vdev->vq[i].last_avail_idx);
> >> >>> +        qemu_put_be16s(f, &last_avail_idx);
> >> >>>          if (vdev->binding->save_queue)
> >> >>>              vdev->binding->save_queue(vdev->binding_opaque, i, f);
> >> >>>      }
> >> >>>
> >> >>>
> >> >>
> >> >> This looks wrong to me.  Requests can complete in any order, can they
> >> >> not?  So if request 0 did not complete and request 1 did not,
> >> >> you send avail - inuse and on the secondary you will process and
> >> >> complete request 1 the second time, crashing the guest.
> >> >
> >> > In case of Kemari, no.  We sit between devices and net/block, and
> >> > queue the requests.  After completing each transaction, we flush
> >> > the requests one by one.  So there won't be completion inversion,
> >> > and therefore won't be visible to the guest.
> >> >
> >> > Yoshi
> >> >
> >> >>
> >> >>>
> >> >>> >
> >> >>> >> >
> >> >>> >> >> ---
> >> >>> >> >>  hw/virtio.c |    8 +++++++-
> >> >>> >> >>  1 files changed, 7 insertions(+), 1 deletions(-)
> >> >>> >> >>
> >> >>> >> >> diff --git a/hw/virtio.c b/hw/virtio.c
> >> >>> >> >> index 849a60f..5509644 100644
> >> >>> >> >> --- a/hw/virtio.c
> >> >>> >> >> +++ b/hw/virtio.c
> >> >>> >> >> @@ -72,7 +72,7 @@ struct VirtQueue
> >> >>> >> >>      VRing vring;
> >> >>> >> >>      target_phys_addr_t pa;
> >> >>> >> >>      uint16_t last_avail_idx;
> >> >>> >> >> -    int inuse;
> >> >>> >> >> +    uint16_t inuse;
> >> >>> >> >>      uint16_t vector;
> >> >>> >> >>      void (*handle_output)(VirtIODevice *vdev, VirtQueue *vq);
> >> >>> >> >>      VirtIODevice *vdev;
> >> >>> >> >> @@ -671,6 +671,7 @@ void virtio_save(VirtIODevice *vdev, QEMUFile *f)
> >> >>> >> >>          qemu_put_be32(f, vdev->vq[i].vring.num);
> >> >>> >> >>          qemu_put_be64(f, vdev->vq[i].pa);
> >> >>> >> >>          qemu_put_be16s(f, &vdev->vq[i].last_avail_idx);
> >> >>> >> >> +        qemu_put_be16s(f, &vdev->vq[i].inuse);
> >> >>> >> >>          if (vdev->binding->save_queue)
> >> >>> >> >>              vdev->binding->save_queue(vdev->binding_opaque, i, f);
> >> >>> >> >>      }
> >> >>> >> >> @@ -711,6 +712,11 @@ int virtio_load(VirtIODevice *vdev, QEMUFile *f)
> >> >>> >> >>          vdev->vq[i].vring.num = qemu_get_be32(f);
> >> >>> >> >>          vdev->vq[i].pa = qemu_get_be64(f);
> >> >>> >> >>          qemu_get_be16s(f, &vdev->vq[i].last_avail_idx);
> >> >>> >> >> +        qemu_get_be16s(f, &vdev->vq[i].inuse);
> >> >>> >> >> +
> >> >>> >> >> +        /* revert last_avail_idx if there are outstanding emulation. */
> >> >>> >> >> +        vdev->vq[i].last_avail_idx -= vdev->vq[i].inuse;
> >> >>> >> >> +        vdev->vq[i].inuse = 0;
> >> >>> >> >>
> >> >>> >> >>          if (vdev->vq[i].pa) {
> >> >>> >> >>              virtqueue_init(&vdev->vq[i]);
> >> >>> >> >> --
> >> >>> >> >> 1.7.1.2
> >> >>> >> >>
> >> >>> >> >> --
> >> >>> >> >> To unsubscribe from this list: send the line "unsubscribe kvm" in
> >> >>> >> >> the body of a message to majordomo@vger.kernel.org
> >> >>> >> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >> >>> >> > --
> >> >>> >> > To unsubscribe from this list: send the line "unsubscribe kvm" in
> >> >>> >> > the body of a message to majordomo@vger.kernel.org
> >> >>> >> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >> >>> >> >
> >> >>> > --
> >> >>> > To unsubscribe from this list: send the line "unsubscribe kvm" in
> >> >>> > the body of a message to majordomo@vger.kernel.org
> >> >>> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >> >>> >
> >> >> --
> >> >> To unsubscribe from this list: send the line "unsubscribe kvm" in
> >> >> the body of a message to majordomo@vger.kernel.org
> >> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >> >>
> >> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe kvm" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >

^ permalink raw reply	[flat|nested] 112+ messages in thread

* [Qemu-devel] Re: [PATCH 05/21] virtio: modify save/load handler to handle inuse varialble.
  2010-12-16 14:40                     ` Michael S. Tsirkin
@ 2010-12-16 15:59                       ` Yoshiaki Tamura
  2010-12-17 16:22                         ` Yoshiaki Tamura
  2010-12-24  9:27                         ` Michael S. Tsirkin
  0 siblings, 2 replies; 112+ messages in thread
From: Yoshiaki Tamura @ 2010-12-16 15:59 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: aliguori, dlaor, ananth, kvm, ohmura.kei, Marcelo Tosatti,
	qemu-devel, vatsa, avi, psuriset, stefanha

2010/12/16 Michael S. Tsirkin <mst@redhat.com>:
> On Thu, Dec 16, 2010 at 11:28:46PM +0900, Yoshiaki Tamura wrote:
>> 2010/12/16 Michael S. Tsirkin <mst@redhat.com>:
>> > On Thu, Dec 16, 2010 at 04:36:16PM +0900, Yoshiaki Tamura wrote:
>> >> 2010/12/3 Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>:
>> >> > 2010/12/2 Michael S. Tsirkin <mst@redhat.com>:
>> >> >> On Wed, Dec 01, 2010 at 05:03:43PM +0900, Yoshiaki Tamura wrote:
>> >> >>> 2010/11/28 Michael S. Tsirkin <mst@redhat.com>:
>> >> >>> > On Sun, Nov 28, 2010 at 08:27:58PM +0900, Yoshiaki Tamura wrote:
>> >> >>> >> 2010/11/28 Michael S. Tsirkin <mst@redhat.com>:
>> >> >>> >> > On Thu, Nov 25, 2010 at 03:06:44PM +0900, Yoshiaki Tamura wrote:
>> >> >>> >> >> Modify inuse type to uint16_t, let save/load to handle, and revert
>> >> >>> >> >> last_avail_idx with inuse if there are outstanding emulation.
>> >> >>> >> >>
>> >> >>> >> >> Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
>> >> >>> >> >
>> >> >>> >> > This changes migration format, so it will break compatibility with
>> >> >>> >> > existing drivers. More generally, I think migrating internal
>> >> >>> >> > state that is not guest visible is always a mistake
>> >> >>> >> > as it ties migration format to an internal implementation
>> >> >>> >> > (yes, I know we do this sometimes, but we should at least
>> >> >>> >> > try not to add such cases).  I think the right thing to do in this case
>> >> >>> >> > is to flush outstanding
>> >> >>> >> > work when vm is stopped.  Then, we are guaranteed that inuse is 0.
>> >> >>> >> > I sent patches that do this for virtio net and block.
>> >> >>> >>
>> >> >>> >> Could you give me the link of your patches?  I'd like to test
>> >> >>> >> whether they work with Kemari upon failover.  If they do, I'm
>> >> >>> >> happy to drop this patch.
>> >> >>> >>
>> >> >>> >> Yoshi
>> >> >>> >
>> >> >>> > Look for this:
>> >> >>> > stable migration image on a stopped vm
>> >> >>> > sent on:
>> >> >>> > Wed, 24 Nov 2010 17:52:49 +0200
>> >> >>>
>> >> >>> Thanks for the info.
>> >> >>>
>> >> >>> However, The patch series above didn't solve the issue.  In
>> >> >>> case of Kemari, inuse is mostly > 0 because it queues the
>> >> >>> output, and while last_avail_idx gets incremented
>> >> >>> immediately, not sending inuse makes the state inconsistent
>> >> >>> between Primary and Secondary.
>> >> >>
>> >> >> Hmm. Can we simply avoid incrementing last_avail_idx?
>> >> >
>> >> > I think we can calculate or prepare an internal last_avail_idx,
>> >> > and update the external when inuse is decremented.  I'll try
>> >> > whether it work w/ w/o Kemari.
>> >>
>> >> Hi Michael,
>> >>
>> >> Could you please take a look at the following patch?
>> >
>> > Which version is this against?
>>
>> Oops.  It should be very old.
>> 67f895bfe69f323b427b284430b6219c8a62e8d4
>>
>> >> commit 36ee7910059e6b236fe9467a609f5b4aed866912
>> >> Author: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
>> >> Date:   Thu Dec 16 14:50:54 2010 +0900
>> >>
>> >>     virtio: update last_avail_idx when inuse is decreased.
>> >>
>> >>     Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
>> >
>> > It would be better to have a commit description explaining why a change
>> > is made, and why it is correct, not just repeating what can be seen from
>> > the diff anyway.
>>
>> Sorry for being lazy here.
>>
>> >> diff --git a/hw/virtio.c b/hw/virtio.c
>> >> index c8a0fc6..6688c02 100644
>> >> --- a/hw/virtio.c
>> >> +++ b/hw/virtio.c
>> >> @@ -237,6 +237,7 @@ void virtqueue_flush(VirtQueue *vq, unsigned int count)
>> >>      wmb();
>> >>      trace_virtqueue_flush(vq, count);
>> >>      vring_used_idx_increment(vq, count);
>> >> +    vq->last_avail_idx += count;
>> >>      vq->inuse -= count;
>> >>  }
>> >>
>> >> @@ -385,7 +386,7 @@ int virtqueue_pop(VirtQueue *vq, VirtQueueElement *elem)
>> >>      unsigned int i, head, max;
>> >>      target_phys_addr_t desc_pa = vq->vring.desc;
>> >>
>> >> -    if (!virtqueue_num_heads(vq, vq->last_avail_idx))
>> >> +    if (!virtqueue_num_heads(vq, vq->last_avail_idx + vq->inuse))
>> >>          return 0;
>> >>
>> >>      /* When we start there are none of either input nor output. */
>> >> @@ -393,7 +394,7 @@ int virtqueue_pop(VirtQueue *vq, VirtQueueElement *elem)
>> >>
>> >>      max = vq->vring.num;
>> >>
>> >> -    i = head = virtqueue_get_head(vq, vq->last_avail_idx++);
>> >> +    i = head = virtqueue_get_head(vq, vq->last_avail_idx + vq->inuse);
>> >>
>> >>      if (vring_desc_flags(desc_pa, i) & VRING_DESC_F_INDIRECT) {
>> >>          if (vring_desc_len(desc_pa, i) % sizeof(VRingDesc)) {
>> >>
>> >
>> > Hmm, will virtio_queue_empty be wrong now? What about virtqueue_avail_bytes?
>>
>> I think there are two problems.
>>
>> 1. When to update last_avail_idx.
>> 2. The ordering issue you're mentioning below.
>>
>> The patch above is only trying to address 1 because last time you
>> mentioned that modifying last_avail_idx upon save may break the
>> guest, which I agree.  If virtio_queue_empty and
>> virtqueue_avail_bytes are only used internally, meaning invisible
>> to the guest, I guess the approach above can be applied too.
>
> So IMHO 2 is the real issue. This is what was problematic
> with the save patch, otherwise of course changes in save
> are better than changes all over the codebase.

All right.  Then let's focus on 2 first.

>> > Previous patch version sure looked simpler, and this seems functionally
>> > equivalent, so my question still stands: here it is rephrased in a
>> > different way:
>> >
>> >        assume that we have in avail ring 2 requests at start of ring: A and B in this order
>> >
>> >        host pops A, then B, then completes B and flushes
>> >
>> >        now with this patch last_avail_idx will be 1, and then
>> >        remote will get it, it will execute B again. As a result
>> >        B will complete twice, and apparently A will never complete.
>> >
>> >
>> > This is what I was saying below: assuming that there are
>> > outstanding requests when we migrate, there is no way
>> > a single index can be enough to figure out which requests
>> > need to be handled and which are in flight already.
>> >
>> > We must add some kind of bitmask to tell us which is which.
>>
>> I should understand why this inversion can happen before solving
>> the issue.
>
> It's a fundamental thing in virtio.
> I think it is currently only likely to happen with block, I think tap
> currently completes things in order.  In any case relying on this in the
> frontend is a mistake.
>
>>  Currently, how are you making virio-net to flush
>> every requests for live migration?  Is it qemu_aio_flush()?
>
> Think so.

If qemu_aio_flush() is responsible for flushing the outstanding
virtio-net requests, I'm wondering why it's a problem for Kemari.
As I described in the previous message, Kemari queues the
requests first.  So in you example above, it should start with

virtio-net: last_avai_idx 0 inuse 2
event-tap: {A,B}

As you know, the requests are still in order still because net
layer initiates in order.  Not about completing.

In the first synchronization, the status above is transferred.  In
the next synchronization, the status will be as following.

virtio-net: last_avai_idx 1 inuse 1
event-tap: {B}

Why? Because Kemari flushes the first virtio-net request using
qemu_aio_flush() before each synchronization.  If
qemu_aio_flush() doesn't guarantee the order, what you pointed
should be problematic.  So in the final synchronization, the
state should be,

virtio-net: last_avai_idx 2 inuse 0
event-tap: {}

where A,B were completed in order.

Yoshi


>
>> >
>> >> >
>> >> >>
>> >> >>>  I'm wondering why
>> >> >>> last_avail_idx is OK to send but not inuse.
>> >> >>
>> >> >> last_avail_idx is at some level a mistake, it exposes part of
>> >> >> our internal implementation, but it does *also* express
>> >> >> a guest observable state.
>> >> >>
>> >> >> Here's the problem that it solves: just looking at the rings in virtio
>> >> >> there is no way to detect that a specific request has already been
>> >> >> completed. And the protocol forbids completing the same request twice.
>> >> >>
>> >> >> Our implementation always starts processing the requests
>> >> >> in order, and since we flush outstanding requests
>> >> >> before save, it works to just tell the remote 'process only requests
>> >> >> after this place'.
>> >> >>
>> >> >> But there's no such requirement in the virtio protocol,
>> >> >> so to be really generic we could add a bitmask of valid avail
>> >> >> ring entries that did not complete yet. This would be
>> >> >> the exact representation of the guest observable state.
>> >> >> In practice we have rings of up to 512 entries.
>> >> >> That's 64 byte per ring, not a lot at all.
>> >> >>
>> >> >> However, if we ever do change the protocol to send the bitmask,
>> >> >> we would need some code to resubmit requests
>> >> >> out of order, so it's not trivial.
>> >> >>
>> >> >> Another minor mistake with last_avail_idx is that it has
>> >> >> some redundancy: the high bits in the index
>> >> >> (> vq size) are not necessary as they can be
>> >> >> got from avail idx.  There's a consistency check
>> >> >> in load but we really should try to use formats
>> >> >> that are always consistent.
>> >> >>
>> >> >>> The following patch does the same thing as original, yet
>> >> >>> keeps the format of the virtio.  It shouldn't break live
>> >> >>> migration either because inuse should be 0.
>> >> >>>
>> >> >>> Yoshi
>> >> >>
>> >> >> Question is, can you flush to make inuse 0 in kemari too?
>> >> >> And if not, how do you handle the fact that some requests
>> >> >> are in flight on the primary?
>> >> >
>> >> > Although we try flushing requests one by one making inuse 0,
>> >> > there are cases when it failovers to the secondary when inuse
>> >> > isn't 0.  We handle these in flight request on the primary by
>> >> > replaying on the secondary.
>> >> >
>> >> >>
>> >> >>> diff --git a/hw/virtio.c b/hw/virtio.c
>> >> >>> index c8a0fc6..875c7ca 100644
>> >> >>> --- a/hw/virtio.c
>> >> >>> +++ b/hw/virtio.c
>> >> >>> @@ -664,12 +664,16 @@ void virtio_save(VirtIODevice *vdev, QEMUFile *f)
>> >> >>>      qemu_put_be32(f, i);
>> >> >>>
>> >> >>>      for (i = 0; i < VIRTIO_PCI_QUEUE_MAX; i++) {
>> >> >>> +        uint16_t last_avail_idx;
>> >> >>> +
>> >> >>>          if (vdev->vq[i].vring.num == 0)
>> >> >>>              break;
>> >> >>>
>> >> >>> +        last_avail_idx = vdev->vq[i].last_avail_idx - vdev->vq[i].inuse;
>> >> >>> +
>> >> >>>          qemu_put_be32(f, vdev->vq[i].vring.num);
>> >> >>>          qemu_put_be64(f, vdev->vq[i].pa);
>> >> >>> -        qemu_put_be16s(f, &vdev->vq[i].last_avail_idx);
>> >> >>> +        qemu_put_be16s(f, &last_avail_idx);
>> >> >>>          if (vdev->binding->save_queue)
>> >> >>>              vdev->binding->save_queue(vdev->binding_opaque, i, f);
>> >> >>>      }
>> >> >>>
>> >> >>>
>> >> >>
>> >> >> This looks wrong to me.  Requests can complete in any order, can they
>> >> >> not?  So if request 0 did not complete and request 1 did not,
>> >> >> you send avail - inuse and on the secondary you will process and
>> >> >> complete request 1 the second time, crashing the guest.
>> >> >
>> >> > In case of Kemari, no.  We sit between devices and net/block, and
>> >> > queue the requests.  After completing each transaction, we flush
>> >> > the requests one by one.  So there won't be completion inversion,
>> >> > and therefore won't be visible to the guest.
>> >> >
>> >> > Yoshi
>> >> >
>> >> >>
>> >> >>>
>> >> >>> >
>> >> >>> >> >
>> >> >>> >> >> ---
>> >> >>> >> >>  hw/virtio.c |    8 +++++++-
>> >> >>> >> >>  1 files changed, 7 insertions(+), 1 deletions(-)
>> >> >>> >> >>
>> >> >>> >> >> diff --git a/hw/virtio.c b/hw/virtio.c
>> >> >>> >> >> index 849a60f..5509644 100644
>> >> >>> >> >> --- a/hw/virtio.c
>> >> >>> >> >> +++ b/hw/virtio.c
>> >> >>> >> >> @@ -72,7 +72,7 @@ struct VirtQueue
>> >> >>> >> >>      VRing vring;
>> >> >>> >> >>      target_phys_addr_t pa;
>> >> >>> >> >>      uint16_t last_avail_idx;
>> >> >>> >> >> -    int inuse;
>> >> >>> >> >> +    uint16_t inuse;
>> >> >>> >> >>      uint16_t vector;
>> >> >>> >> >>      void (*handle_output)(VirtIODevice *vdev, VirtQueue *vq);
>> >> >>> >> >>      VirtIODevice *vdev;
>> >> >>> >> >> @@ -671,6 +671,7 @@ void virtio_save(VirtIODevice *vdev, QEMUFile *f)
>> >> >>> >> >>          qemu_put_be32(f, vdev->vq[i].vring.num);
>> >> >>> >> >>          qemu_put_be64(f, vdev->vq[i].pa);
>> >> >>> >> >>          qemu_put_be16s(f, &vdev->vq[i].last_avail_idx);
>> >> >>> >> >> +        qemu_put_be16s(f, &vdev->vq[i].inuse);
>> >> >>> >> >>          if (vdev->binding->save_queue)
>> >> >>> >> >>              vdev->binding->save_queue(vdev->binding_opaque, i, f);
>> >> >>> >> >>      }
>> >> >>> >> >> @@ -711,6 +712,11 @@ int virtio_load(VirtIODevice *vdev, QEMUFile *f)
>> >> >>> >> >>          vdev->vq[i].vring.num = qemu_get_be32(f);
>> >> >>> >> >>          vdev->vq[i].pa = qemu_get_be64(f);
>> >> >>> >> >>          qemu_get_be16s(f, &vdev->vq[i].last_avail_idx);
>> >> >>> >> >> +        qemu_get_be16s(f, &vdev->vq[i].inuse);
>> >> >>> >> >> +
>> >> >>> >> >> +        /* revert last_avail_idx if there are outstanding emulation. */
>> >> >>> >> >> +        vdev->vq[i].last_avail_idx -= vdev->vq[i].inuse;
>> >> >>> >> >> +        vdev->vq[i].inuse = 0;
>> >> >>> >> >>
>> >> >>> >> >>          if (vdev->vq[i].pa) {
>> >> >>> >> >>              virtqueue_init(&vdev->vq[i]);
>> >> >>> >> >> --
>> >> >>> >> >> 1.7.1.2
>> >> >>> >> >>
>> >> >>> >> >> --
>> >> >>> >> >> To unsubscribe from this list: send the line "unsubscribe kvm" in
>> >> >>> >> >> the body of a message to majordomo@vger.kernel.org
>> >> >>> >> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >> >>> >> > --
>> >> >>> >> > To unsubscribe from this list: send the line "unsubscribe kvm" in
>> >> >>> >> > the body of a message to majordomo@vger.kernel.org
>> >> >>> >> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >> >>> >> >
>> >> >>> > --
>> >> >>> > To unsubscribe from this list: send the line "unsubscribe kvm" in
>> >> >>> > the body of a message to majordomo@vger.kernel.org
>> >> >>> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >> >>> >
>> >> >> --
>> >> >> To unsubscribe from this list: send the line "unsubscribe kvm" in
>> >> >> the body of a message to majordomo@vger.kernel.org
>> >> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >> >>
>> >> >
>> > --
>> > To unsubscribe from this list: send the line "unsubscribe kvm" in
>> > the body of a message to majordomo@vger.kernel.org
>> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 112+ messages in thread

* [Qemu-devel] Re: [PATCH 11/21] ioport: insert event_tap_ioport() to ioport_write().
  2010-12-16  9:50           ` Yoshiaki Tamura
  2010-12-16  9:54             ` Michael S. Tsirkin
@ 2010-12-16 16:27             ` Stefan Hajnoczi
  2010-12-17 16:19               ` Yoshiaki Tamura
  1 sibling, 1 reply; 112+ messages in thread
From: Stefan Hajnoczi @ 2010-12-16 16:27 UTC (permalink / raw)
  To: Yoshiaki Tamura
  Cc: aliguori, dlaor, ananth, kvm, Michael S. Tsirkin, mtosatti,
	qemu-devel, vatsa, ohmura.kei, avi, psuriset, stefanha

On Thu, Dec 16, 2010 at 9:50 AM, Yoshiaki Tamura
<tamura.yoshiaki@lab.ntt.co.jp> wrote:
> 2010/12/16 Michael S. Tsirkin <mst@redhat.com>:
>> On Thu, Dec 16, 2010 at 04:37:41PM +0900, Yoshiaki Tamura wrote:
>>> 2010/11/28 Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>:
>>> > 2010/11/28 Michael S. Tsirkin <mst@redhat.com>:
>>> >> On Thu, Nov 25, 2010 at 03:06:50PM +0900, Yoshiaki Tamura wrote:
>>> >>> Record ioport event to replay it upon failover.
>>> >>>
>>> >>> Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
>>> >>
>>> >> Interesting. This will have to be extended to support ioeventfd.
>>> >> Since each eventfd is really just a binary trigger
>>> >> it should be enough to read out the fd state.
>>> >
>>> > Haven't thought about eventfd yet.  Will try doing it in the next
>>> > spin.
>>>
>>> Hi Michael,
>>>
>>> I looked into eventfd and realized it's only used with vhost now.
>>
>> There are patches on list to use it for block/userspace net.
>
> Thanks.  Now I understand.
> In that case, inserting an even-tap function to the following code
> should be appropriate?
>
> int event_notifier_test_and_clear(EventNotifier *e)
> {
>    uint64_t value;
>    int r = read(e->fd, &value, sizeof(value));
>    return r == sizeof(value);
> }
>
>>
>>>  However, I
>>> believe vhost bypass the net layer in qemu, and there is no way for Kemari to
>>> detect the outputs.  To me, it doesn't make sense to extend this patch to
>>> support eventfd...

Here is the userspace ioeventfd patch series:
http://www.mail-archive.com/qemu-devel@nongnu.org/msg49208.html

Instead of switching to QEMU userspace to handle the virtqueue kick
pio write, we signal the eventfd inside the kernel and resume guest
code execution.  The I/O thread can then process the virtqueue kick in
parallel to guest code execution.

I think this can still be tied into Kemari.  If you are switching to a
pure net/block-layer event tap instead of pio/mmio, then I think it
should just work.

For vhost it would be more difficult to integrate with Kemari.

Stefan

^ permalink raw reply	[flat|nested] 112+ messages in thread

* [Qemu-devel] Re: [PATCH 11/21] ioport: insert event_tap_ioport() to ioport_write().
  2010-12-16 16:27             ` Stefan Hajnoczi
@ 2010-12-17 16:19               ` Yoshiaki Tamura
  2010-12-18  8:36                 ` Stefan Hajnoczi
  0 siblings, 1 reply; 112+ messages in thread
From: Yoshiaki Tamura @ 2010-12-17 16:19 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: aliguori, dlaor, ananth, kvm, Michael S. Tsirkin, mtosatti,
	qemu-devel, vatsa, ohmura.kei, avi, psuriset, stefanha

2010/12/17 Stefan Hajnoczi <stefanha@gmail.com>:
> On Thu, Dec 16, 2010 at 9:50 AM, Yoshiaki Tamura
> <tamura.yoshiaki@lab.ntt.co.jp> wrote:
>> 2010/12/16 Michael S. Tsirkin <mst@redhat.com>:
>>> On Thu, Dec 16, 2010 at 04:37:41PM +0900, Yoshiaki Tamura wrote:
>>>> 2010/11/28 Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>:
>>>> > 2010/11/28 Michael S. Tsirkin <mst@redhat.com>:
>>>> >> On Thu, Nov 25, 2010 at 03:06:50PM +0900, Yoshiaki Tamura wrote:
>>>> >>> Record ioport event to replay it upon failover.
>>>> >>>
>>>> >>> Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
>>>> >>
>>>> >> Interesting. This will have to be extended to support ioeventfd.
>>>> >> Since each eventfd is really just a binary trigger
>>>> >> it should be enough to read out the fd state.
>>>> >
>>>> > Haven't thought about eventfd yet.  Will try doing it in the next
>>>> > spin.
>>>>
>>>> Hi Michael,
>>>>
>>>> I looked into eventfd and realized it's only used with vhost now.
>>>
>>> There are patches on list to use it for block/userspace net.
>>
>> Thanks.  Now I understand.
>> In that case, inserting an even-tap function to the following code
>> should be appropriate?
>>
>> int event_notifier_test_and_clear(EventNotifier *e)
>> {
>>    uint64_t value;
>>    int r = read(e->fd, &value, sizeof(value));
>>    return r == sizeof(value);
>> }
>>
>>>
>>>>  However, I
>>>> believe vhost bypass the net layer in qemu, and there is no way for Kemari to
>>>> detect the outputs.  To me, it doesn't make sense to extend this patch to
>>>> support eventfd...
>
> Here is the userspace ioeventfd patch series:
> http://www.mail-archive.com/qemu-devel@nongnu.org/msg49208.html
>
> Instead of switching to QEMU userspace to handle the virtqueue kick
> pio write, we signal the eventfd inside the kernel and resume guest
> code execution.  The I/O thread can then process the virtqueue kick in
> parallel to guest code execution.
>
> I think this can still be tied into Kemari.  If you are switching to a
> pure net/block-layer event tap instead of pio/mmio, then I think it
> should just work.

That should take a while until we solve how to set correct
callbacks to the secondary upon failover.  BTW, do you have a
plan to move the eventfd framework to the upper layer as
pio/mmio.  Not only Kemari works for free, other emulators should
be able to benefit from it.

> For vhost it would be more difficult to integrate with Kemari.

At this point, it's impossible.  As Michael said, I should
prevent starting Kemari when vhost=on.

Yoshi

>
> Stefan
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 112+ messages in thread

* [Qemu-devel] Re: [PATCH 05/21] virtio: modify save/load handler to handle inuse varialble.
  2010-12-16 15:59                       ` Yoshiaki Tamura
@ 2010-12-17 16:22                         ` Yoshiaki Tamura
  2010-12-24  9:27                         ` Michael S. Tsirkin
  1 sibling, 0 replies; 112+ messages in thread
From: Yoshiaki Tamura @ 2010-12-17 16:22 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: aliguori, dlaor, ananth, kvm, ohmura.kei, Marcelo Tosatti,
	qemu-devel, vatsa, avi, psuriset, stefanha

2010/12/17 Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>:
> 2010/12/16 Michael S. Tsirkin <mst@redhat.com>:
>> On Thu, Dec 16, 2010 at 11:28:46PM +0900, Yoshiaki Tamura wrote:
>>> 2010/12/16 Michael S. Tsirkin <mst@redhat.com>:
>>> > On Thu, Dec 16, 2010 at 04:36:16PM +0900, Yoshiaki Tamura wrote:
>>> >> 2010/12/3 Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>:
>>> >> > 2010/12/2 Michael S. Tsirkin <mst@redhat.com>:
>>> >> >> On Wed, Dec 01, 2010 at 05:03:43PM +0900, Yoshiaki Tamura wrote:
>>> >> >>> 2010/11/28 Michael S. Tsirkin <mst@redhat.com>:
>>> >> >>> > On Sun, Nov 28, 2010 at 08:27:58PM +0900, Yoshiaki Tamura wrote:
>>> >> >>> >> 2010/11/28 Michael S. Tsirkin <mst@redhat.com>:
>>> >> >>> >> > On Thu, Nov 25, 2010 at 03:06:44PM +0900, Yoshiaki Tamura wrote:
>>> >> >>> >> >> Modify inuse type to uint16_t, let save/load to handle, and revert
>>> >> >>> >> >> last_avail_idx with inuse if there are outstanding emulation.
>>> >> >>> >> >>
>>> >> >>> >> >> Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
>>> >> >>> >> >
>>> >> >>> >> > This changes migration format, so it will break compatibility with
>>> >> >>> >> > existing drivers. More generally, I think migrating internal
>>> >> >>> >> > state that is not guest visible is always a mistake
>>> >> >>> >> > as it ties migration format to an internal implementation
>>> >> >>> >> > (yes, I know we do this sometimes, but we should at least
>>> >> >>> >> > try not to add such cases).  I think the right thing to do in this case
>>> >> >>> >> > is to flush outstanding
>>> >> >>> >> > work when vm is stopped.  Then, we are guaranteed that inuse is 0.
>>> >> >>> >> > I sent patches that do this for virtio net and block.
>>> >> >>> >>
>>> >> >>> >> Could you give me the link of your patches?  I'd like to test
>>> >> >>> >> whether they work with Kemari upon failover.  If they do, I'm
>>> >> >>> >> happy to drop this patch.
>>> >> >>> >>
>>> >> >>> >> Yoshi
>>> >> >>> >
>>> >> >>> > Look for this:
>>> >> >>> > stable migration image on a stopped vm
>>> >> >>> > sent on:
>>> >> >>> > Wed, 24 Nov 2010 17:52:49 +0200
>>> >> >>>
>>> >> >>> Thanks for the info.
>>> >> >>>
>>> >> >>> However, The patch series above didn't solve the issue.  In
>>> >> >>> case of Kemari, inuse is mostly > 0 because it queues the
>>> >> >>> output, and while last_avail_idx gets incremented
>>> >> >>> immediately, not sending inuse makes the state inconsistent
>>> >> >>> between Primary and Secondary.
>>> >> >>
>>> >> >> Hmm. Can we simply avoid incrementing last_avail_idx?
>>> >> >
>>> >> > I think we can calculate or prepare an internal last_avail_idx,
>>> >> > and update the external when inuse is decremented.  I'll try
>>> >> > whether it work w/ w/o Kemari.
>>> >>
>>> >> Hi Michael,
>>> >>
>>> >> Could you please take a look at the following patch?
>>> >
>>> > Which version is this against?
>>>
>>> Oops.  It should be very old.
>>> 67f895bfe69f323b427b284430b6219c8a62e8d4
>>>
>>> >> commit 36ee7910059e6b236fe9467a609f5b4aed866912
>>> >> Author: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
>>> >> Date:   Thu Dec 16 14:50:54 2010 +0900
>>> >>
>>> >>     virtio: update last_avail_idx when inuse is decreased.
>>> >>
>>> >>     Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
>>> >
>>> > It would be better to have a commit description explaining why a change
>>> > is made, and why it is correct, not just repeating what can be seen from
>>> > the diff anyway.
>>>
>>> Sorry for being lazy here.
>>>
>>> >> diff --git a/hw/virtio.c b/hw/virtio.c
>>> >> index c8a0fc6..6688c02 100644
>>> >> --- a/hw/virtio.c
>>> >> +++ b/hw/virtio.c
>>> >> @@ -237,6 +237,7 @@ void virtqueue_flush(VirtQueue *vq, unsigned int count)
>>> >>      wmb();
>>> >>      trace_virtqueue_flush(vq, count);
>>> >>      vring_used_idx_increment(vq, count);
>>> >> +    vq->last_avail_idx += count;
>>> >>      vq->inuse -= count;
>>> >>  }
>>> >>
>>> >> @@ -385,7 +386,7 @@ int virtqueue_pop(VirtQueue *vq, VirtQueueElement *elem)
>>> >>      unsigned int i, head, max;
>>> >>      target_phys_addr_t desc_pa = vq->vring.desc;
>>> >>
>>> >> -    if (!virtqueue_num_heads(vq, vq->last_avail_idx))
>>> >> +    if (!virtqueue_num_heads(vq, vq->last_avail_idx + vq->inuse))
>>> >>          return 0;
>>> >>
>>> >>      /* When we start there are none of either input nor output. */
>>> >> @@ -393,7 +394,7 @@ int virtqueue_pop(VirtQueue *vq, VirtQueueElement *elem)
>>> >>
>>> >>      max = vq->vring.num;
>>> >>
>>> >> -    i = head = virtqueue_get_head(vq, vq->last_avail_idx++);
>>> >> +    i = head = virtqueue_get_head(vq, vq->last_avail_idx + vq->inuse);
>>> >>
>>> >>      if (vring_desc_flags(desc_pa, i) & VRING_DESC_F_INDIRECT) {
>>> >>          if (vring_desc_len(desc_pa, i) % sizeof(VRingDesc)) {
>>> >>
>>> >
>>> > Hmm, will virtio_queue_empty be wrong now? What about virtqueue_avail_bytes?
>>>
>>> I think there are two problems.
>>>
>>> 1. When to update last_avail_idx.
>>> 2. The ordering issue you're mentioning below.
>>>
>>> The patch above is only trying to address 1 because last time you
>>> mentioned that modifying last_avail_idx upon save may break the
>>> guest, which I agree.  If virtio_queue_empty and
>>> virtqueue_avail_bytes are only used internally, meaning invisible
>>> to the guest, I guess the approach above can be applied too.
>>
>> So IMHO 2 is the real issue. This is what was problematic
>> with the save patch, otherwise of course changes in save
>> are better than changes all over the codebase.
>
> All right.  Then let's focus on 2 first.
>
>>> > Previous patch version sure looked simpler, and this seems functionally
>>> > equivalent, so my question still stands: here it is rephrased in a
>>> > different way:
>>> >
>>> >        assume that we have in avail ring 2 requests at start of ring: A and B in this order
>>> >
>>> >        host pops A, then B, then completes B and flushes
>>> >
>>> >        now with this patch last_avail_idx will be 1, and then
>>> >        remote will get it, it will execute B again. As a result
>>> >        B will complete twice, and apparently A will never complete.
>>> >
>>> >
>>> > This is what I was saying below: assuming that there are
>>> > outstanding requests when we migrate, there is no way
>>> > a single index can be enough to figure out which requests
>>> > need to be handled and which are in flight already.
>>> >
>>> > We must add some kind of bitmask to tell us which is which.
>>>
>>> I should understand why this inversion can happen before solving
>>> the issue.
>>
>> It's a fundamental thing in virtio.
>> I think it is currently only likely to happen with block, I think tap
>> currently completes things in order.  In any case relying on this in the
>> frontend is a mistake.
>>
>>>  Currently, how are you making virio-net to flush
>>> every requests for live migration?  Is it qemu_aio_flush()?
>>
>> Think so.
>
> If qemu_aio_flush() is responsible for flushing the outstanding
> virtio-net requests, I'm wondering why it's a problem for Kemari.
> As I described in the previous message, Kemari queues the
> requests first.  So in you example above, it should start with
>
> virtio-net: last_avai_idx 0 inuse 2
> event-tap: {A,B}
>
> As you know, the requests are still in order still because net
> layer initiates in order.  Not about completing.
>
> In the first synchronization, the status above is transferred.  In
> the next synchronization, the status will be as following.
>
> virtio-net: last_avai_idx 1 inuse 1
> event-tap: {B}
>
> Why? Because Kemari flushes the first virtio-net request using
> qemu_aio_flush() before each synchronization.  If
> qemu_aio_flush() doesn't guarantee the order, what you pointed
> should be problematic.  So in the final synchronization, the
> state should be,
>
> virtio-net: last_avai_idx 2 inuse 0
> event-tap: {}
>
> where A,B were completed in order.
>
> Yoshi

Hi Michael,

Please let me know if the discussion has gone wrong or my
explanation was incorrect.  I believe live migration shouldn't be
a problem either because it would flush until inuse gets 0.

Yoshi

>
>
>>
>>> >
>>> >> >
>>> >> >>
>>> >> >>>  I'm wondering why
>>> >> >>> last_avail_idx is OK to send but not inuse.
>>> >> >>
>>> >> >> last_avail_idx is at some level a mistake, it exposes part of
>>> >> >> our internal implementation, but it does *also* express
>>> >> >> a guest observable state.
>>> >> >>
>>> >> >> Here's the problem that it solves: just looking at the rings in virtio
>>> >> >> there is no way to detect that a specific request has already been
>>> >> >> completed. And the protocol forbids completing the same request twice.
>>> >> >>
>>> >> >> Our implementation always starts processing the requests
>>> >> >> in order, and since we flush outstanding requests
>>> >> >> before save, it works to just tell the remote 'process only requests
>>> >> >> after this place'.
>>> >> >>
>>> >> >> But there's no such requirement in the virtio protocol,
>>> >> >> so to be really generic we could add a bitmask of valid avail
>>> >> >> ring entries that did not complete yet. This would be
>>> >> >> the exact representation of the guest observable state.
>>> >> >> In practice we have rings of up to 512 entries.
>>> >> >> That's 64 byte per ring, not a lot at all.
>>> >> >>
>>> >> >> However, if we ever do change the protocol to send the bitmask,
>>> >> >> we would need some code to resubmit requests
>>> >> >> out of order, so it's not trivial.
>>> >> >>
>>> >> >> Another minor mistake with last_avail_idx is that it has
>>> >> >> some redundancy: the high bits in the index
>>> >> >> (> vq size) are not necessary as they can be
>>> >> >> got from avail idx.  There's a consistency check
>>> >> >> in load but we really should try to use formats
>>> >> >> that are always consistent.
>>> >> >>
>>> >> >>> The following patch does the same thing as original, yet
>>> >> >>> keeps the format of the virtio.  It shouldn't break live
>>> >> >>> migration either because inuse should be 0.
>>> >> >>>
>>> >> >>> Yoshi
>>> >> >>
>>> >> >> Question is, can you flush to make inuse 0 in kemari too?
>>> >> >> And if not, how do you handle the fact that some requests
>>> >> >> are in flight on the primary?
>>> >> >
>>> >> > Although we try flushing requests one by one making inuse 0,
>>> >> > there are cases when it failovers to the secondary when inuse
>>> >> > isn't 0.  We handle these in flight request on the primary by
>>> >> > replaying on the secondary.
>>> >> >
>>> >> >>
>>> >> >>> diff --git a/hw/virtio.c b/hw/virtio.c
>>> >> >>> index c8a0fc6..875c7ca 100644
>>> >> >>> --- a/hw/virtio.c
>>> >> >>> +++ b/hw/virtio.c
>>> >> >>> @@ -664,12 +664,16 @@ void virtio_save(VirtIODevice *vdev, QEMUFile *f)
>>> >> >>>      qemu_put_be32(f, i);
>>> >> >>>
>>> >> >>>      for (i = 0; i < VIRTIO_PCI_QUEUE_MAX; i++) {
>>> >> >>> +        uint16_t last_avail_idx;
>>> >> >>> +
>>> >> >>>          if (vdev->vq[i].vring.num == 0)
>>> >> >>>              break;
>>> >> >>>
>>> >> >>> +        last_avail_idx = vdev->vq[i].last_avail_idx - vdev->vq[i].inuse;
>>> >> >>> +
>>> >> >>>          qemu_put_be32(f, vdev->vq[i].vring.num);
>>> >> >>>          qemu_put_be64(f, vdev->vq[i].pa);
>>> >> >>> -        qemu_put_be16s(f, &vdev->vq[i].last_avail_idx);
>>> >> >>> +        qemu_put_be16s(f, &last_avail_idx);
>>> >> >>>          if (vdev->binding->save_queue)
>>> >> >>>              vdev->binding->save_queue(vdev->binding_opaque, i, f);
>>> >> >>>      }
>>> >> >>>
>>> >> >>>
>>> >> >>
>>> >> >> This looks wrong to me.  Requests can complete in any order, can they
>>> >> >> not?  So if request 0 did not complete and request 1 did not,
>>> >> >> you send avail - inuse and on the secondary you will process and
>>> >> >> complete request 1 the second time, crashing the guest.
>>> >> >
>>> >> > In case of Kemari, no.  We sit between devices and net/block, and
>>> >> > queue the requests.  After completing each transaction, we flush
>>> >> > the requests one by one.  So there won't be completion inversion,
>>> >> > and therefore won't be visible to the guest.
>>> >> >
>>> >> > Yoshi
>>> >> >
>>> >> >>
>>> >> >>>
>>> >> >>> >
>>> >> >>> >> >
>>> >> >>> >> >> ---
>>> >> >>> >> >>  hw/virtio.c |    8 +++++++-
>>> >> >>> >> >>  1 files changed, 7 insertions(+), 1 deletions(-)
>>> >> >>> >> >>
>>> >> >>> >> >> diff --git a/hw/virtio.c b/hw/virtio.c
>>> >> >>> >> >> index 849a60f..5509644 100644
>>> >> >>> >> >> --- a/hw/virtio.c
>>> >> >>> >> >> +++ b/hw/virtio.c
>>> >> >>> >> >> @@ -72,7 +72,7 @@ struct VirtQueue
>>> >> >>> >> >>      VRing vring;
>>> >> >>> >> >>      target_phys_addr_t pa;
>>> >> >>> >> >>      uint16_t last_avail_idx;
>>> >> >>> >> >> -    int inuse;
>>> >> >>> >> >> +    uint16_t inuse;
>>> >> >>> >> >>      uint16_t vector;
>>> >> >>> >> >>      void (*handle_output)(VirtIODevice *vdev, VirtQueue *vq);
>>> >> >>> >> >>      VirtIODevice *vdev;
>>> >> >>> >> >> @@ -671,6 +671,7 @@ void virtio_save(VirtIODevice *vdev, QEMUFile *f)
>>> >> >>> >> >>          qemu_put_be32(f, vdev->vq[i].vring.num);
>>> >> >>> >> >>          qemu_put_be64(f, vdev->vq[i].pa);
>>> >> >>> >> >>          qemu_put_be16s(f, &vdev->vq[i].last_avail_idx);
>>> >> >>> >> >> +        qemu_put_be16s(f, &vdev->vq[i].inuse);
>>> >> >>> >> >>          if (vdev->binding->save_queue)
>>> >> >>> >> >>              vdev->binding->save_queue(vdev->binding_opaque, i, f);
>>> >> >>> >> >>      }
>>> >> >>> >> >> @@ -711,6 +712,11 @@ int virtio_load(VirtIODevice *vdev, QEMUFile *f)
>>> >> >>> >> >>          vdev->vq[i].vring.num = qemu_get_be32(f);
>>> >> >>> >> >>          vdev->vq[i].pa = qemu_get_be64(f);
>>> >> >>> >> >>          qemu_get_be16s(f, &vdev->vq[i].last_avail_idx);
>>> >> >>> >> >> +        qemu_get_be16s(f, &vdev->vq[i].inuse);
>>> >> >>> >> >> +
>>> >> >>> >> >> +        /* revert last_avail_idx if there are outstanding emulation. */
>>> >> >>> >> >> +        vdev->vq[i].last_avail_idx -= vdev->vq[i].inuse;
>>> >> >>> >> >> +        vdev->vq[i].inuse = 0;
>>> >> >>> >> >>
>>> >> >>> >> >>          if (vdev->vq[i].pa) {
>>> >> >>> >> >>              virtqueue_init(&vdev->vq[i]);
>>> >> >>> >> >> --
>>> >> >>> >> >> 1.7.1.2
>>> >> >>> >> >>
>>> >> >>> >> >> --
>>> >> >>> >> >> To unsubscribe from this list: send the line "unsubscribe kvm" in
>>> >> >>> >> >> the body of a message to majordomo@vger.kernel.org
>>> >> >>> >> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>> >> >>> >> > --
>>> >> >>> >> > To unsubscribe from this list: send the line "unsubscribe kvm" in
>>> >> >>> >> > the body of a message to majordomo@vger.kernel.org
>>> >> >>> >> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>> >> >>> >> >
>>> >> >>> > --
>>> >> >>> > To unsubscribe from this list: send the line "unsubscribe kvm" in
>>> >> >>> > the body of a message to majordomo@vger.kernel.org
>>> >> >>> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>> >> >>> >
>>> >> >> --
>>> >> >> To unsubscribe from this list: send the line "unsubscribe kvm" in
>>> >> >> the body of a message to majordomo@vger.kernel.org
>>> >> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>> >> >>
>>> >> >
>>> > --
>>> > To unsubscribe from this list: send the line "unsubscribe kvm" in
>>> > the body of a message to majordomo@vger.kernel.org
>>> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>> >
>> --
>> To unsubscribe from this list: send the line "unsubscribe kvm" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>

^ permalink raw reply	[flat|nested] 112+ messages in thread

* [Qemu-devel] Re: [PATCH 11/21] ioport: insert event_tap_ioport() to ioport_write().
  2010-12-17 16:19               ` Yoshiaki Tamura
@ 2010-12-18  8:36                 ` Stefan Hajnoczi
  0 siblings, 0 replies; 112+ messages in thread
From: Stefan Hajnoczi @ 2010-12-18  8:36 UTC (permalink / raw)
  To: Yoshiaki Tamura
  Cc: aliguori, dlaor, ananth, kvm, Michael S. Tsirkin, mtosatti,
	qemu-devel, vatsa, ohmura.kei, avi, psuriset, stefanha

On Fri, Dec 17, 2010 at 4:19 PM, Yoshiaki Tamura
<tamura.yoshiaki@lab.ntt.co.jp> wrote:
> 2010/12/17 Stefan Hajnoczi <stefanha@gmail.com>:
>> On Thu, Dec 16, 2010 at 9:50 AM, Yoshiaki Tamura
>> <tamura.yoshiaki@lab.ntt.co.jp> wrote:
>>> 2010/12/16 Michael S. Tsirkin <mst@redhat.com>:
>>>> On Thu, Dec 16, 2010 at 04:37:41PM +0900, Yoshiaki Tamura wrote:
>>>>> 2010/11/28 Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>:
>>>>> > 2010/11/28 Michael S. Tsirkin <mst@redhat.com>:
>>>>> >> On Thu, Nov 25, 2010 at 03:06:50PM +0900, Yoshiaki Tamura wrote:
>>>>> >>> Record ioport event to replay it upon failover.
>>>>> >>>
>>>>> >>> Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
>>>>> >>
>>>>> >> Interesting. This will have to be extended to support ioeventfd.
>>>>> >> Since each eventfd is really just a binary trigger
>>>>> >> it should be enough to read out the fd state.
>>>>> >
>>>>> > Haven't thought about eventfd yet.  Will try doing it in the next
>>>>> > spin.
>>>>>
>>>>> Hi Michael,
>>>>>
>>>>> I looked into eventfd and realized it's only used with vhost now.
>>>>
>>>> There are patches on list to use it for block/userspace net.
>>>
>>> Thanks.  Now I understand.
>>> In that case, inserting an even-tap function to the following code
>>> should be appropriate?
>>>
>>> int event_notifier_test_and_clear(EventNotifier *e)
>>> {
>>>    uint64_t value;
>>>    int r = read(e->fd, &value, sizeof(value));
>>>    return r == sizeof(value);
>>> }
>>>
>>>>
>>>>>  However, I
>>>>> believe vhost bypass the net layer in qemu, and there is no way for Kemari to
>>>>> detect the outputs.  To me, it doesn't make sense to extend this patch to
>>>>> support eventfd...
>>
>> Here is the userspace ioeventfd patch series:
>> http://www.mail-archive.com/qemu-devel@nongnu.org/msg49208.html
>>
>> Instead of switching to QEMU userspace to handle the virtqueue kick
>> pio write, we signal the eventfd inside the kernel and resume guest
>> code execution.  The I/O thread can then process the virtqueue kick in
>> parallel to guest code execution.
>>
>> I think this can still be tied into Kemari.  If you are switching to a
>> pure net/block-layer event tap instead of pio/mmio, then I think it
>> should just work.
>
> That should take a while until we solve how to set correct
> callbacks to the secondary upon failover.  BTW, do you have a
> plan to move the eventfd framework to the upper layer as
> pio/mmio.  Not only Kemari works for free, other emulators should
> be able to benefit from it.

I'm not sure I understand the question but I have considered making
ioeventfd a first-class interface like register_ioport_write().  In
some ways that would be cleaner than the way we use ioeventfd in vhost
and virtio-pci today.

>> For vhost it would be more difficult to integrate with Kemari.
>
> At this point, it's impossible.  As Michael said, I should
> prevent starting Kemari when vhost=on.

If you add some functionality to vhost it might be possible, although
that would slow it down.  So perhaps for the near future using vhost
with Kemari is pointless anyway since you won't be able to reach the
performance that vhost-net can achieve.

Stefan

^ permalink raw reply	[flat|nested] 112+ messages in thread

* [Qemu-devel] Re: [PATCH 05/21] virtio: modify save/load handler to handle inuse varialble.
  2010-12-16 15:59                       ` Yoshiaki Tamura
  2010-12-17 16:22                         ` Yoshiaki Tamura
@ 2010-12-24  9:27                         ` Michael S. Tsirkin
  2010-12-24 11:42                           ` Yoshiaki Tamura
  1 sibling, 1 reply; 112+ messages in thread
From: Michael S. Tsirkin @ 2010-12-24  9:27 UTC (permalink / raw)
  To: Yoshiaki Tamura
  Cc: aliguori, dlaor, ananth, kvm, ohmura.kei, Marcelo Tosatti,
	qemu-devel, vatsa, avi, psuriset, stefanha

On Fri, Dec 17, 2010 at 12:59:58AM +0900, Yoshiaki Tamura wrote:
> 2010/12/16 Michael S. Tsirkin <mst@redhat.com>:
> > On Thu, Dec 16, 2010 at 11:28:46PM +0900, Yoshiaki Tamura wrote:
> >> 2010/12/16 Michael S. Tsirkin <mst@redhat.com>:
> >> > On Thu, Dec 16, 2010 at 04:36:16PM +0900, Yoshiaki Tamura wrote:
> >> >> 2010/12/3 Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>:
> >> >> > 2010/12/2 Michael S. Tsirkin <mst@redhat.com>:
> >> >> >> On Wed, Dec 01, 2010 at 05:03:43PM +0900, Yoshiaki Tamura wrote:
> >> >> >>> 2010/11/28 Michael S. Tsirkin <mst@redhat.com>:
> >> >> >>> > On Sun, Nov 28, 2010 at 08:27:58PM +0900, Yoshiaki Tamura wrote:
> >> >> >>> >> 2010/11/28 Michael S. Tsirkin <mst@redhat.com>:
> >> >> >>> >> > On Thu, Nov 25, 2010 at 03:06:44PM +0900, Yoshiaki Tamura wrote:
> >> >> >>> >> >> Modify inuse type to uint16_t, let save/load to handle, and revert
> >> >> >>> >> >> last_avail_idx with inuse if there are outstanding emulation.
> >> >> >>> >> >>
> >> >> >>> >> >> Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
> >> >> >>> >> >
> >> >> >>> >> > This changes migration format, so it will break compatibility with
> >> >> >>> >> > existing drivers. More generally, I think migrating internal
> >> >> >>> >> > state that is not guest visible is always a mistake
> >> >> >>> >> > as it ties migration format to an internal implementation
> >> >> >>> >> > (yes, I know we do this sometimes, but we should at least
> >> >> >>> >> > try not to add such cases).  I think the right thing to do in this case
> >> >> >>> >> > is to flush outstanding
> >> >> >>> >> > work when vm is stopped.  Then, we are guaranteed that inuse is 0.
> >> >> >>> >> > I sent patches that do this for virtio net and block.
> >> >> >>> >>
> >> >> >>> >> Could you give me the link of your patches?  I'd like to test
> >> >> >>> >> whether they work with Kemari upon failover.  If they do, I'm
> >> >> >>> >> happy to drop this patch.
> >> >> >>> >>
> >> >> >>> >> Yoshi
> >> >> >>> >
> >> >> >>> > Look for this:
> >> >> >>> > stable migration image on a stopped vm
> >> >> >>> > sent on:
> >> >> >>> > Wed, 24 Nov 2010 17:52:49 +0200
> >> >> >>>
> >> >> >>> Thanks for the info.
> >> >> >>>
> >> >> >>> However, The patch series above didn't solve the issue.  In
> >> >> >>> case of Kemari, inuse is mostly > 0 because it queues the
> >> >> >>> output, and while last_avail_idx gets incremented
> >> >> >>> immediately, not sending inuse makes the state inconsistent
> >> >> >>> between Primary and Secondary.
> >> >> >>
> >> >> >> Hmm. Can we simply avoid incrementing last_avail_idx?
> >> >> >
> >> >> > I think we can calculate or prepare an internal last_avail_idx,
> >> >> > and update the external when inuse is decremented.  I'll try
> >> >> > whether it work w/ w/o Kemari.
> >> >>
> >> >> Hi Michael,
> >> >>
> >> >> Could you please take a look at the following patch?
> >> >
> >> > Which version is this against?
> >>
> >> Oops.  It should be very old.
> >> 67f895bfe69f323b427b284430b6219c8a62e8d4
> >>
> >> >> commit 36ee7910059e6b236fe9467a609f5b4aed866912
> >> >> Author: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
> >> >> Date:   Thu Dec 16 14:50:54 2010 +0900
> >> >>
> >> >>     virtio: update last_avail_idx when inuse is decreased.
> >> >>
> >> >>     Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
> >> >
> >> > It would be better to have a commit description explaining why a change
> >> > is made, and why it is correct, not just repeating what can be seen from
> >> > the diff anyway.
> >>
> >> Sorry for being lazy here.
> >>
> >> >> diff --git a/hw/virtio.c b/hw/virtio.c
> >> >> index c8a0fc6..6688c02 100644
> >> >> --- a/hw/virtio.c
> >> >> +++ b/hw/virtio.c
> >> >> @@ -237,6 +237,7 @@ void virtqueue_flush(VirtQueue *vq, unsigned int count)
> >> >>      wmb();
> >> >>      trace_virtqueue_flush(vq, count);
> >> >>      vring_used_idx_increment(vq, count);
> >> >> +    vq->last_avail_idx += count;
> >> >>      vq->inuse -= count;
> >> >>  }
> >> >>
> >> >> @@ -385,7 +386,7 @@ int virtqueue_pop(VirtQueue *vq, VirtQueueElement *elem)
> >> >>      unsigned int i, head, max;
> >> >>      target_phys_addr_t desc_pa = vq->vring.desc;
> >> >>
> >> >> -    if (!virtqueue_num_heads(vq, vq->last_avail_idx))
> >> >> +    if (!virtqueue_num_heads(vq, vq->last_avail_idx + vq->inuse))
> >> >>          return 0;
> >> >>
> >> >>      /* When we start there are none of either input nor output. */
> >> >> @@ -393,7 +394,7 @@ int virtqueue_pop(VirtQueue *vq, VirtQueueElement *elem)
> >> >>
> >> >>      max = vq->vring.num;
> >> >>
> >> >> -    i = head = virtqueue_get_head(vq, vq->last_avail_idx++);
> >> >> +    i = head = virtqueue_get_head(vq, vq->last_avail_idx + vq->inuse);
> >> >>
> >> >>      if (vring_desc_flags(desc_pa, i) & VRING_DESC_F_INDIRECT) {
> >> >>          if (vring_desc_len(desc_pa, i) % sizeof(VRingDesc)) {
> >> >>
> >> >
> >> > Hmm, will virtio_queue_empty be wrong now? What about virtqueue_avail_bytes?
> >>
> >> I think there are two problems.
> >>
> >> 1. When to update last_avail_idx.
> >> 2. The ordering issue you're mentioning below.
> >>
> >> The patch above is only trying to address 1 because last time you
> >> mentioned that modifying last_avail_idx upon save may break the
> >> guest, which I agree.  If virtio_queue_empty and
> >> virtqueue_avail_bytes are only used internally, meaning invisible
> >> to the guest, I guess the approach above can be applied too.
> >
> > So IMHO 2 is the real issue. This is what was problematic
> > with the save patch, otherwise of course changes in save
> > are better than changes all over the codebase.
> 
> All right.  Then let's focus on 2 first.
> 
> >> > Previous patch version sure looked simpler, and this seems functionally
> >> > equivalent, so my question still stands: here it is rephrased in a
> >> > different way:
> >> >
> >> >        assume that we have in avail ring 2 requests at start of ring: A and B in this order
> >> >
> >> >        host pops A, then B, then completes B and flushes
> >> >
> >> >        now with this patch last_avail_idx will be 1, and then
> >> >        remote will get it, it will execute B again. As a result
> >> >        B will complete twice, and apparently A will never complete.
> >> >
> >> >
> >> > This is what I was saying below: assuming that there are
> >> > outstanding requests when we migrate, there is no way
> >> > a single index can be enough to figure out which requests
> >> > need to be handled and which are in flight already.
> >> >
> >> > We must add some kind of bitmask to tell us which is which.
> >>
> >> I should understand why this inversion can happen before solving
> >> the issue.
> >
> > It's a fundamental thing in virtio.
> > I think it is currently only likely to happen with block, I think tap
> > currently completes things in order.  In any case relying on this in the
> > frontend is a mistake.
> >
> >>  Currently, how are you making virio-net to flush
> >> every requests for live migration?  Is it qemu_aio_flush()?
> >
> > Think so.
> 
> If qemu_aio_flush() is responsible for flushing the outstanding
> virtio-net requests, I'm wondering why it's a problem for Kemari.
> As I described in the previous message, Kemari queues the
> requests first.  So in you example above, it should start with
> 
> virtio-net: last_avai_idx 0 inuse 2
> event-tap: {A,B}
> 
> As you know, the requests are still in order still because net
> layer initiates in order.  Not about completing.
> 
> In the first synchronization, the status above is transferred.  In
> the next synchronization, the status will be as following.
> 
> virtio-net: last_avai_idx 1 inuse 1
> event-tap: {B}

OK, this answers the ordering question.

Another question: at this point we transfer this status: both
event-tap and virtio ring have the command B,
so the remote will have:

virtio-net: inuse 0
event-tap: {B}

Is this right? This already seems to be a problem as when B completes
inuse will go negative?

Next it seems that the remote virtio will resubmit B to event-tap. The
remote will then have:

virtio-net: inuse 1
event-tap: {B, B}

This looks kind of wrong ... will two packets go out?


> Why? Because Kemari flushes the first virtio-net request using
> qemu_aio_flush() before each synchronization.  If
> qemu_aio_flush() doesn't guarantee the order, what you pointed
> should be problematic.  So in the final synchronization, the
> state should be,
> 
> virtio-net: last_avai_idx 2 inuse 0
> event-tap: {}
> 
> where A,B were completed in order.
> 
> Yoshi


It might be better to discuss block because that's where
requests can complete out of order.

So let me see if I understand:
- each command passed to event tap is queued by it,
  it is not passed directly to the backend
- later requests are passed to the backend,
  always in the same order that they were submitted
- each synchronization point flushes all requests
  passed to the backend so far
- each synchronization transfers all requests not passed to the backend,
  to the remote, and they are replayed there

Now to analyse this for correctness I am looking at the original patch
because it is smaller so easier to analyse and I think it is
functionally equivalent, correct me if I am wrong in this.

So the reason there's no out of order issue is this
(and might be a good thing to put in commit log
or a comment somewhere):


At point of save callback event tap has flushed commands
passed to the backend already. Thus at the point of
the save callback if a command has completed
all previous commands have been flushed and completed.


Therefore inuse is
in fact the # of requests passed to event tap but not yet
passed to the backend (for non-event tap case all commands are
passed to the backend immediately and because of this
inuse is 0) and these are the last inuse commands submitted.


Right?

Now a question:

When we pass last_used_index - inuse to the remote,
the remote virtio will resubmit the request.
Since request is also passed by event tap, we get
the request twice, why is this not a problem?


> >
> >> >
> >> >> >
> >> >> >>
> >> >> >>>  I'm wondering why
> >> >> >>> last_avail_idx is OK to send but not inuse.
> >> >> >>
> >> >> >> last_avail_idx is at some level a mistake, it exposes part of
> >> >> >> our internal implementation, but it does *also* express
> >> >> >> a guest observable state.
> >> >> >>
> >> >> >> Here's the problem that it solves: just looking at the rings in virtio
> >> >> >> there is no way to detect that a specific request has already been
> >> >> >> completed. And the protocol forbids completing the same request twice.
> >> >> >>
> >> >> >> Our implementation always starts processing the requests
> >> >> >> in order, and since we flush outstanding requests
> >> >> >> before save, it works to just tell the remote 'process only requests
> >> >> >> after this place'.
> >> >> >>
> >> >> >> But there's no such requirement in the virtio protocol,
> >> >> >> so to be really generic we could add a bitmask of valid avail
> >> >> >> ring entries that did not complete yet. This would be
> >> >> >> the exact representation of the guest observable state.
> >> >> >> In practice we have rings of up to 512 entries.
> >> >> >> That's 64 byte per ring, not a lot at all.
> >> >> >>
> >> >> >> However, if we ever do change the protocol to send the bitmask,
> >> >> >> we would need some code to resubmit requests
> >> >> >> out of order, so it's not trivial.
> >> >> >>
> >> >> >> Another minor mistake with last_avail_idx is that it has
> >> >> >> some redundancy: the high bits in the index
> >> >> >> (> vq size) are not necessary as they can be
> >> >> >> got from avail idx.  There's a consistency check
> >> >> >> in load but we really should try to use formats
> >> >> >> that are always consistent.
> >> >> >>
> >> >> >>> The following patch does the same thing as original, yet
> >> >> >>> keeps the format of the virtio.  It shouldn't break live
> >> >> >>> migration either because inuse should be 0.
> >> >> >>>
> >> >> >>> Yoshi
> >> >> >>
> >> >> >> Question is, can you flush to make inuse 0 in kemari too?
> >> >> >> And if not, how do you handle the fact that some requests
> >> >> >> are in flight on the primary?
> >> >> >
> >> >> > Although we try flushing requests one by one making inuse 0,
> >> >> > there are cases when it failovers to the secondary when inuse
> >> >> > isn't 0.  We handle these in flight request on the primary by
> >> >> > replaying on the secondary.
> >> >> >
> >> >> >>
> >> >> >>> diff --git a/hw/virtio.c b/hw/virtio.c
> >> >> >>> index c8a0fc6..875c7ca 100644
> >> >> >>> --- a/hw/virtio.c
> >> >> >>> +++ b/hw/virtio.c
> >> >> >>> @@ -664,12 +664,16 @@ void virtio_save(VirtIODevice *vdev, QEMUFile *f)
> >> >> >>>      qemu_put_be32(f, i);
> >> >> >>>
> >> >> >>>      for (i = 0; i < VIRTIO_PCI_QUEUE_MAX; i++) {
> >> >> >>> +        uint16_t last_avail_idx;
> >> >> >>> +
> >> >> >>>          if (vdev->vq[i].vring.num == 0)
> >> >> >>>              break;
> >> >> >>>
> >> >> >>> +        last_avail_idx = vdev->vq[i].last_avail_idx - vdev->vq[i].inuse;
> >> >> >>> +
> >> >> >>>          qemu_put_be32(f, vdev->vq[i].vring.num);
> >> >> >>>          qemu_put_be64(f, vdev->vq[i].pa);
> >> >> >>> -        qemu_put_be16s(f, &vdev->vq[i].last_avail_idx);
> >> >> >>> +        qemu_put_be16s(f, &last_avail_idx);
> >> >> >>>          if (vdev->binding->save_queue)
> >> >> >>>              vdev->binding->save_queue(vdev->binding_opaque, i, f);
> >> >> >>>      }
> >> >> >>>
> >> >> >>>
> >> >> >>
> >> >> >> This looks wrong to me.  Requests can complete in any order, can they
> >> >> >> not?  So if request 0 did not complete and request 1 did not,
> >> >> >> you send avail - inuse and on the secondary you will process and
> >> >> >> complete request 1 the second time, crashing the guest.
> >> >> >
> >> >> > In case of Kemari, no.  We sit between devices and net/block, and
> >> >> > queue the requests.  After completing each transaction, we flush
> >> >> > the requests one by one.  So there won't be completion inversion,
> >> >> > and therefore won't be visible to the guest.
> >> >> >
> >> >> > Yoshi
> >> >> >
> >> >> >>
> >> >> >>>
> >> >> >>> >
> >> >> >>> >> >
> >> >> >>> >> >> ---
> >> >> >>> >> >>  hw/virtio.c |    8 +++++++-
> >> >> >>> >> >>  1 files changed, 7 insertions(+), 1 deletions(-)
> >> >> >>> >> >>
> >> >> >>> >> >> diff --git a/hw/virtio.c b/hw/virtio.c
> >> >> >>> >> >> index 849a60f..5509644 100644
> >> >> >>> >> >> --- a/hw/virtio.c
> >> >> >>> >> >> +++ b/hw/virtio.c
> >> >> >>> >> >> @@ -72,7 +72,7 @@ struct VirtQueue
> >> >> >>> >> >>      VRing vring;
> >> >> >>> >> >>      target_phys_addr_t pa;
> >> >> >>> >> >>      uint16_t last_avail_idx;
> >> >> >>> >> >> -    int inuse;
> >> >> >>> >> >> +    uint16_t inuse;
> >> >> >>> >> >>      uint16_t vector;
> >> >> >>> >> >>      void (*handle_output)(VirtIODevice *vdev, VirtQueue *vq);
> >> >> >>> >> >>      VirtIODevice *vdev;
> >> >> >>> >> >> @@ -671,6 +671,7 @@ void virtio_save(VirtIODevice *vdev, QEMUFile *f)
> >> >> >>> >> >>          qemu_put_be32(f, vdev->vq[i].vring.num);
> >> >> >>> >> >>          qemu_put_be64(f, vdev->vq[i].pa);
> >> >> >>> >> >>          qemu_put_be16s(f, &vdev->vq[i].last_avail_idx);
> >> >> >>> >> >> +        qemu_put_be16s(f, &vdev->vq[i].inuse);
> >> >> >>> >> >>          if (vdev->binding->save_queue)
> >> >> >>> >> >>              vdev->binding->save_queue(vdev->binding_opaque, i, f);
> >> >> >>> >> >>      }
> >> >> >>> >> >> @@ -711,6 +712,11 @@ int virtio_load(VirtIODevice *vdev, QEMUFile *f)
> >> >> >>> >> >>          vdev->vq[i].vring.num = qemu_get_be32(f);
> >> >> >>> >> >>          vdev->vq[i].pa = qemu_get_be64(f);
> >> >> >>> >> >>          qemu_get_be16s(f, &vdev->vq[i].last_avail_idx);
> >> >> >>> >> >> +        qemu_get_be16s(f, &vdev->vq[i].inuse);
> >> >> >>> >> >> +
> >> >> >>> >> >> +        /* revert last_avail_idx if there are outstanding emulation. */
> >> >> >>> >> >> +        vdev->vq[i].last_avail_idx -= vdev->vq[i].inuse;
> >> >> >>> >> >> +        vdev->vq[i].inuse = 0;
> >> >> >>> >> >>
> >> >> >>> >> >>          if (vdev->vq[i].pa) {
> >> >> >>> >> >>              virtqueue_init(&vdev->vq[i]);
> >> >> >>> >> >> --
> >> >> >>> >> >> 1.7.1.2
> >> >> >>> >> >>
> >> >> >>> >> >> --
> >> >> >>> >> >> To unsubscribe from this list: send the line "unsubscribe kvm" in
> >> >> >>> >> >> the body of a message to majordomo@vger.kernel.org
> >> >> >>> >> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >> >> >>> >> > --
> >> >> >>> >> > To unsubscribe from this list: send the line "unsubscribe kvm" in
> >> >> >>> >> > the body of a message to majordomo@vger.kernel.org
> >> >> >>> >> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >> >> >>> >> >
> >> >> >>> > --
> >> >> >>> > To unsubscribe from this list: send the line "unsubscribe kvm" in
> >> >> >>> > the body of a message to majordomo@vger.kernel.org
> >> >> >>> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >> >> >>> >
> >> >> >> --
> >> >> >> To unsubscribe from this list: send the line "unsubscribe kvm" in
> >> >> >> the body of a message to majordomo@vger.kernel.org
> >> >> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >> >> >>
> >> >> >
> >> > --
> >> > To unsubscribe from this list: send the line "unsubscribe kvm" in
> >> > the body of a message to majordomo@vger.kernel.org
> >> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe kvm" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >

^ permalink raw reply	[flat|nested] 112+ messages in thread

* [Qemu-devel] Re: [PATCH 05/21] virtio: modify save/load handler to handle inuse varialble.
  2010-12-24  9:27                         ` Michael S. Tsirkin
@ 2010-12-24 11:42                           ` Yoshiaki Tamura
  2010-12-24 13:21                             ` Michael S. Tsirkin
                                               ` (2 more replies)
  0 siblings, 3 replies; 112+ messages in thread
From: Yoshiaki Tamura @ 2010-12-24 11:42 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: aliguori, dlaor, ananth, kvm, ohmura.kei, Marcelo Tosatti,
	qemu-devel, vatsa, avi, psuriset, stefanha

2010/12/24 Michael S. Tsirkin <mst@redhat.com>:
> On Fri, Dec 17, 2010 at 12:59:58AM +0900, Yoshiaki Tamura wrote:
>> 2010/12/16 Michael S. Tsirkin <mst@redhat.com>:
>> > On Thu, Dec 16, 2010 at 11:28:46PM +0900, Yoshiaki Tamura wrote:
>> >> 2010/12/16 Michael S. Tsirkin <mst@redhat.com>:
>> >> > On Thu, Dec 16, 2010 at 04:36:16PM +0900, Yoshiaki Tamura wrote:
>> >> >> 2010/12/3 Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>:
>> >> >> > 2010/12/2 Michael S. Tsirkin <mst@redhat.com>:
>> >> >> >> On Wed, Dec 01, 2010 at 05:03:43PM +0900, Yoshiaki Tamura wrote:
>> >> >> >>> 2010/11/28 Michael S. Tsirkin <mst@redhat.com>:
>> >> >> >>> > On Sun, Nov 28, 2010 at 08:27:58PM +0900, Yoshiaki Tamura wrote:
>> >> >> >>> >> 2010/11/28 Michael S. Tsirkin <mst@redhat.com>:
>> >> >> >>> >> > On Thu, Nov 25, 2010 at 03:06:44PM +0900, Yoshiaki Tamura wrote:
>> >> >> >>> >> >> Modify inuse type to uint16_t, let save/load to handle, and revert
>> >> >> >>> >> >> last_avail_idx with inuse if there are outstanding emulation.
>> >> >> >>> >> >>
>> >> >> >>> >> >> Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
>> >> >> >>> >> >
>> >> >> >>> >> > This changes migration format, so it will break compatibility with
>> >> >> >>> >> > existing drivers. More generally, I think migrating internal
>> >> >> >>> >> > state that is not guest visible is always a mistake
>> >> >> >>> >> > as it ties migration format to an internal implementation
>> >> >> >>> >> > (yes, I know we do this sometimes, but we should at least
>> >> >> >>> >> > try not to add such cases).  I think the right thing to do in this case
>> >> >> >>> >> > is to flush outstanding
>> >> >> >>> >> > work when vm is stopped.  Then, we are guaranteed that inuse is 0.
>> >> >> >>> >> > I sent patches that do this for virtio net and block.
>> >> >> >>> >>
>> >> >> >>> >> Could you give me the link of your patches?  I'd like to test
>> >> >> >>> >> whether they work with Kemari upon failover.  If they do, I'm
>> >> >> >>> >> happy to drop this patch.
>> >> >> >>> >>
>> >> >> >>> >> Yoshi
>> >> >> >>> >
>> >> >> >>> > Look for this:
>> >> >> >>> > stable migration image on a stopped vm
>> >> >> >>> > sent on:
>> >> >> >>> > Wed, 24 Nov 2010 17:52:49 +0200
>> >> >> >>>
>> >> >> >>> Thanks for the info.
>> >> >> >>>
>> >> >> >>> However, The patch series above didn't solve the issue.  In
>> >> >> >>> case of Kemari, inuse is mostly > 0 because it queues the
>> >> >> >>> output, and while last_avail_idx gets incremented
>> >> >> >>> immediately, not sending inuse makes the state inconsistent
>> >> >> >>> between Primary and Secondary.
>> >> >> >>
>> >> >> >> Hmm. Can we simply avoid incrementing last_avail_idx?
>> >> >> >
>> >> >> > I think we can calculate or prepare an internal last_avail_idx,
>> >> >> > and update the external when inuse is decremented.  I'll try
>> >> >> > whether it work w/ w/o Kemari.
>> >> >>
>> >> >> Hi Michael,
>> >> >>
>> >> >> Could you please take a look at the following patch?
>> >> >
>> >> > Which version is this against?
>> >>
>> >> Oops.  It should be very old.
>> >> 67f895bfe69f323b427b284430b6219c8a62e8d4
>> >>
>> >> >> commit 36ee7910059e6b236fe9467a609f5b4aed866912
>> >> >> Author: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
>> >> >> Date:   Thu Dec 16 14:50:54 2010 +0900
>> >> >>
>> >> >>     virtio: update last_avail_idx when inuse is decreased.
>> >> >>
>> >> >>     Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
>> >> >
>> >> > It would be better to have a commit description explaining why a change
>> >> > is made, and why it is correct, not just repeating what can be seen from
>> >> > the diff anyway.
>> >>
>> >> Sorry for being lazy here.
>> >>
>> >> >> diff --git a/hw/virtio.c b/hw/virtio.c
>> >> >> index c8a0fc6..6688c02 100644
>> >> >> --- a/hw/virtio.c
>> >> >> +++ b/hw/virtio.c
>> >> >> @@ -237,6 +237,7 @@ void virtqueue_flush(VirtQueue *vq, unsigned int count)
>> >> >>      wmb();
>> >> >>      trace_virtqueue_flush(vq, count);
>> >> >>      vring_used_idx_increment(vq, count);
>> >> >> +    vq->last_avail_idx += count;
>> >> >>      vq->inuse -= count;
>> >> >>  }
>> >> >>
>> >> >> @@ -385,7 +386,7 @@ int virtqueue_pop(VirtQueue *vq, VirtQueueElement *elem)
>> >> >>      unsigned int i, head, max;
>> >> >>      target_phys_addr_t desc_pa = vq->vring.desc;
>> >> >>
>> >> >> -    if (!virtqueue_num_heads(vq, vq->last_avail_idx))
>> >> >> +    if (!virtqueue_num_heads(vq, vq->last_avail_idx + vq->inuse))
>> >> >>          return 0;
>> >> >>
>> >> >>      /* When we start there are none of either input nor output. */
>> >> >> @@ -393,7 +394,7 @@ int virtqueue_pop(VirtQueue *vq, VirtQueueElement *elem)
>> >> >>
>> >> >>      max = vq->vring.num;
>> >> >>
>> >> >> -    i = head = virtqueue_get_head(vq, vq->last_avail_idx++);
>> >> >> +    i = head = virtqueue_get_head(vq, vq->last_avail_idx + vq->inuse);
>> >> >>
>> >> >>      if (vring_desc_flags(desc_pa, i) & VRING_DESC_F_INDIRECT) {
>> >> >>          if (vring_desc_len(desc_pa, i) % sizeof(VRingDesc)) {
>> >> >>
>> >> >
>> >> > Hmm, will virtio_queue_empty be wrong now? What about virtqueue_avail_bytes?
>> >>
>> >> I think there are two problems.
>> >>
>> >> 1. When to update last_avail_idx.
>> >> 2. The ordering issue you're mentioning below.
>> >>
>> >> The patch above is only trying to address 1 because last time you
>> >> mentioned that modifying last_avail_idx upon save may break the
>> >> guest, which I agree.  If virtio_queue_empty and
>> >> virtqueue_avail_bytes are only used internally, meaning invisible
>> >> to the guest, I guess the approach above can be applied too.
>> >
>> > So IMHO 2 is the real issue. This is what was problematic
>> > with the save patch, otherwise of course changes in save
>> > are better than changes all over the codebase.
>>
>> All right.  Then let's focus on 2 first.
>>
>> >> > Previous patch version sure looked simpler, and this seems functionally
>> >> > equivalent, so my question still stands: here it is rephrased in a
>> >> > different way:
>> >> >
>> >> >        assume that we have in avail ring 2 requests at start of ring: A and B in this order
>> >> >
>> >> >        host pops A, then B, then completes B and flushes
>> >> >
>> >> >        now with this patch last_avail_idx will be 1, and then
>> >> >        remote will get it, it will execute B again. As a result
>> >> >        B will complete twice, and apparently A will never complete.
>> >> >
>> >> >
>> >> > This is what I was saying below: assuming that there are
>> >> > outstanding requests when we migrate, there is no way
>> >> > a single index can be enough to figure out which requests
>> >> > need to be handled and which are in flight already.
>> >> >
>> >> > We must add some kind of bitmask to tell us which is which.
>> >>
>> >> I should understand why this inversion can happen before solving
>> >> the issue.
>> >
>> > It's a fundamental thing in virtio.
>> > I think it is currently only likely to happen with block, I think tap
>> > currently completes things in order.  In any case relying on this in the
>> > frontend is a mistake.
>> >
>> >>  Currently, how are you making virio-net to flush
>> >> every requests for live migration?  Is it qemu_aio_flush()?
>> >
>> > Think so.
>>
>> If qemu_aio_flush() is responsible for flushing the outstanding
>> virtio-net requests, I'm wondering why it's a problem for Kemari.
>> As I described in the previous message, Kemari queues the
>> requests first.  So in you example above, it should start with
>>
>> virtio-net: last_avai_idx 0 inuse 2
>> event-tap: {A,B}
>>
>> As you know, the requests are still in order still because net
>> layer initiates in order.  Not about completing.
>>
>> In the first synchronization, the status above is transferred.  In
>> the next synchronization, the status will be as following.
>>
>> virtio-net: last_avai_idx 1 inuse 1
>> event-tap: {B}
>
> OK, this answers the ordering question.

Glad to hear that!

> Another question: at this point we transfer this status: both
> event-tap and virtio ring have the command B,
> so the remote will have:
>
> virtio-net: inuse 0
> event-tap: {B}
>
> Is this right? This already seems to be a problem as when B completes
> inuse will go negative?

I think state above is wrong.  inuse 0 means there shouldn't be
any requests in event-tap.  Note that the callback is called only
when event-tap flushes the requests.

> Next it seems that the remote virtio will resubmit B to event-tap. The
> remote will then have:
>
> virtio-net: inuse 1
> event-tap: {B, B}
>
> This looks kind of wrong ... will two packets go out?

No.  Currently, we're just replaying the requests with pio/mmio.
In the situation above, it should be,

virtio-net: inuse 1
event-tap: {B}

>> Why? Because Kemari flushes the first virtio-net request using
>> qemu_aio_flush() before each synchronization.  If
>> qemu_aio_flush() doesn't guarantee the order, what you pointed
>> should be problematic.  So in the final synchronization, the
>> state should be,
>>
>> virtio-net: last_avai_idx 2 inuse 0
>> event-tap: {}
>>
>> where A,B were completed in order.
>>
>> Yoshi
>
>
> It might be better to discuss block because that's where
> requests can complete out of order.

It's same as net.  We queue requests and call bdrv_flush per
sending requests to the block.  So there shouldn't be any
inversion.

> So let me see if I understand:
> - each command passed to event tap is queued by it,
>  it is not passed directly to the backend
> - later requests are passed to the backend,
>  always in the same order that they were submitted
> - each synchronization point flushes all requests
>  passed to the backend so far
> - each synchronization transfers all requests not passed to the backend,
>  to the remote, and they are replayed there

Correct.

> Now to analyse this for correctness I am looking at the original patch
> because it is smaller so easier to analyse and I think it is
> functionally equivalent, correct me if I am wrong in this.

So you think decreasing last_avail_idx upon save is better than
updating it in the callback?

> So the reason there's no out of order issue is this
> (and might be a good thing to put in commit log
> or a comment somewhere):

I've done some in the latest patch.  Please point it out if it
wasn't enough.

> At point of save callback event tap has flushed commands
> passed to the backend already. Thus at the point of
> the save callback if a command has completed
> all previous commands have been flushed and completed.
>
>
> Therefore inuse is
> in fact the # of requests passed to event tap but not yet
> passed to the backend (for non-event tap case all commands are
> passed to the backend immediately and because of this
> inuse is 0) and these are the last inuse commands submitted.
>
>
> Right?

Yep.

> Now a question:
>
> When we pass last_used_index - inuse to the remote,
> the remote virtio will resubmit the request.
> Since request is also passed by event tap, we get
> the request twice, why is this not a problem?

It's not a problem because event-tap currently replays with
pio/mmio only, as I mentioned above.  Although event-tap receives
information about the queued requests, it won't pass it to the
backend.  The reason is the problem in setting the callbacks
which are specific to devices on the secondary.  These are
pointers, and even worse, are usually static functions, which
event-tap has no way to restore it upon failover.  I do want to
change event-tap replay to be this way in the future, pio/mmio
replay is implemented for now.

Thanks,

Yoshi

>
>
>> >
>> >> >
>> >> >> >
>> >> >> >>
>> >> >> >>>  I'm wondering why
>> >> >> >>> last_avail_idx is OK to send but not inuse.
>> >> >> >>
>> >> >> >> last_avail_idx is at some level a mistake, it exposes part of
>> >> >> >> our internal implementation, but it does *also* express
>> >> >> >> a guest observable state.
>> >> >> >>
>> >> >> >> Here's the problem that it solves: just looking at the rings in virtio
>> >> >> >> there is no way to detect that a specific request has already been
>> >> >> >> completed. And the protocol forbids completing the same request twice.
>> >> >> >>
>> >> >> >> Our implementation always starts processing the requests
>> >> >> >> in order, and since we flush outstanding requests
>> >> >> >> before save, it works to just tell the remote 'process only requests
>> >> >> >> after this place'.
>> >> >> >>
>> >> >> >> But there's no such requirement in the virtio protocol,
>> >> >> >> so to be really generic we could add a bitmask of valid avail
>> >> >> >> ring entries that did not complete yet. This would be
>> >> >> >> the exact representation of the guest observable state.
>> >> >> >> In practice we have rings of up to 512 entries.
>> >> >> >> That's 64 byte per ring, not a lot at all.
>> >> >> >>
>> >> >> >> However, if we ever do change the protocol to send the bitmask,
>> >> >> >> we would need some code to resubmit requests
>> >> >> >> out of order, so it's not trivial.
>> >> >> >>
>> >> >> >> Another minor mistake with last_avail_idx is that it has
>> >> >> >> some redundancy: the high bits in the index
>> >> >> >> (> vq size) are not necessary as they can be
>> >> >> >> got from avail idx.  There's a consistency check
>> >> >> >> in load but we really should try to use formats
>> >> >> >> that are always consistent.
>> >> >> >>
>> >> >> >>> The following patch does the same thing as original, yet
>> >> >> >>> keeps the format of the virtio.  It shouldn't break live
>> >> >> >>> migration either because inuse should be 0.
>> >> >> >>>
>> >> >> >>> Yoshi
>> >> >> >>
>> >> >> >> Question is, can you flush to make inuse 0 in kemari too?
>> >> >> >> And if not, how do you handle the fact that some requests
>> >> >> >> are in flight on the primary?
>> >> >> >
>> >> >> > Although we try flushing requests one by one making inuse 0,
>> >> >> > there are cases when it failovers to the secondary when inuse
>> >> >> > isn't 0.  We handle these in flight request on the primary by
>> >> >> > replaying on the secondary.
>> >> >> >
>> >> >> >>
>> >> >> >>> diff --git a/hw/virtio.c b/hw/virtio.c
>> >> >> >>> index c8a0fc6..875c7ca 100644
>> >> >> >>> --- a/hw/virtio.c
>> >> >> >>> +++ b/hw/virtio.c
>> >> >> >>> @@ -664,12 +664,16 @@ void virtio_save(VirtIODevice *vdev, QEMUFile *f)
>> >> >> >>>      qemu_put_be32(f, i);
>> >> >> >>>
>> >> >> >>>      for (i = 0; i < VIRTIO_PCI_QUEUE_MAX; i++) {
>> >> >> >>> +        uint16_t last_avail_idx;
>> >> >> >>> +
>> >> >> >>>          if (vdev->vq[i].vring.num == 0)
>> >> >> >>>              break;
>> >> >> >>>
>> >> >> >>> +        last_avail_idx = vdev->vq[i].last_avail_idx - vdev->vq[i].inuse;
>> >> >> >>> +
>> >> >> >>>          qemu_put_be32(f, vdev->vq[i].vring.num);
>> >> >> >>>          qemu_put_be64(f, vdev->vq[i].pa);
>> >> >> >>> -        qemu_put_be16s(f, &vdev->vq[i].last_avail_idx);
>> >> >> >>> +        qemu_put_be16s(f, &last_avail_idx);
>> >> >> >>>          if (vdev->binding->save_queue)
>> >> >> >>>              vdev->binding->save_queue(vdev->binding_opaque, i, f);
>> >> >> >>>      }
>> >> >> >>>
>> >> >> >>>
>> >> >> >>
>> >> >> >> This looks wrong to me.  Requests can complete in any order, can they
>> >> >> >> not?  So if request 0 did not complete and request 1 did not,
>> >> >> >> you send avail - inuse and on the secondary you will process and
>> >> >> >> complete request 1 the second time, crashing the guest.
>> >> >> >
>> >> >> > In case of Kemari, no.  We sit between devices and net/block, and
>> >> >> > queue the requests.  After completing each transaction, we flush
>> >> >> > the requests one by one.  So there won't be completion inversion,
>> >> >> > and therefore won't be visible to the guest.
>> >> >> >
>> >> >> > Yoshi
>> >> >> >
>> >> >> >>
>> >> >> >>>
>> >> >> >>> >
>> >> >> >>> >> >
>> >> >> >>> >> >> ---
>> >> >> >>> >> >>  hw/virtio.c |    8 +++++++-
>> >> >> >>> >> >>  1 files changed, 7 insertions(+), 1 deletions(-)
>> >> >> >>> >> >>
>> >> >> >>> >> >> diff --git a/hw/virtio.c b/hw/virtio.c
>> >> >> >>> >> >> index 849a60f..5509644 100644
>> >> >> >>> >> >> --- a/hw/virtio.c
>> >> >> >>> >> >> +++ b/hw/virtio.c
>> >> >> >>> >> >> @@ -72,7 +72,7 @@ struct VirtQueue
>> >> >> >>> >> >>      VRing vring;
>> >> >> >>> >> >>      target_phys_addr_t pa;
>> >> >> >>> >> >>      uint16_t last_avail_idx;
>> >> >> >>> >> >> -    int inuse;
>> >> >> >>> >> >> +    uint16_t inuse;
>> >> >> >>> >> >>      uint16_t vector;
>> >> >> >>> >> >>      void (*handle_output)(VirtIODevice *vdev, VirtQueue *vq);
>> >> >> >>> >> >>      VirtIODevice *vdev;
>> >> >> >>> >> >> @@ -671,6 +671,7 @@ void virtio_save(VirtIODevice *vdev, QEMUFile *f)
>> >> >> >>> >> >>          qemu_put_be32(f, vdev->vq[i].vring.num);
>> >> >> >>> >> >>          qemu_put_be64(f, vdev->vq[i].pa);
>> >> >> >>> >> >>          qemu_put_be16s(f, &vdev->vq[i].last_avail_idx);
>> >> >> >>> >> >> +        qemu_put_be16s(f, &vdev->vq[i].inuse);
>> >> >> >>> >> >>          if (vdev->binding->save_queue)
>> >> >> >>> >> >>              vdev->binding->save_queue(vdev->binding_opaque, i, f);
>> >> >> >>> >> >>      }
>> >> >> >>> >> >> @@ -711,6 +712,11 @@ int virtio_load(VirtIODevice *vdev, QEMUFile *f)
>> >> >> >>> >> >>          vdev->vq[i].vring.num = qemu_get_be32(f);
>> >> >> >>> >> >>          vdev->vq[i].pa = qemu_get_be64(f);
>> >> >> >>> >> >>          qemu_get_be16s(f, &vdev->vq[i].last_avail_idx);
>> >> >> >>> >> >> +        qemu_get_be16s(f, &vdev->vq[i].inuse);
>> >> >> >>> >> >> +
>> >> >> >>> >> >> +        /* revert last_avail_idx if there are outstanding emulation. */
>> >> >> >>> >> >> +        vdev->vq[i].last_avail_idx -= vdev->vq[i].inuse;
>> >> >> >>> >> >> +        vdev->vq[i].inuse = 0;
>> >> >> >>> >> >>
>> >> >> >>> >> >>          if (vdev->vq[i].pa) {
>> >> >> >>> >> >>              virtqueue_init(&vdev->vq[i]);
>> >> >> >>> >> >> --
>> >> >> >>> >> >> 1.7.1.2
>> >> >> >>> >> >>
>> >> >> >>> >> >> --
>> >> >> >>> >> >> To unsubscribe from this list: send the line "unsubscribe kvm" in
>> >> >> >>> >> >> the body of a message to majordomo@vger.kernel.org
>> >> >> >>> >> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >> >> >>> >> > --
>> >> >> >>> >> > To unsubscribe from this list: send the line "unsubscribe kvm" in
>> >> >> >>> >> > the body of a message to majordomo@vger.kernel.org
>> >> >> >>> >> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >> >> >>> >> >
>> >> >> >>> > --
>> >> >> >>> > To unsubscribe from this list: send the line "unsubscribe kvm" in
>> >> >> >>> > the body of a message to majordomo@vger.kernel.org
>> >> >> >>> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >> >> >>> >
>> >> >> >> --
>> >> >> >> To unsubscribe from this list: send the line "unsubscribe kvm" in
>> >> >> >> the body of a message to majordomo@vger.kernel.org
>> >> >> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >> >> >>
>> >> >> >
>> >> > --
>> >> > To unsubscribe from this list: send the line "unsubscribe kvm" in
>> >> > the body of a message to majordomo@vger.kernel.org
>> >> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >> >
>> > --
>> > To unsubscribe from this list: send the line "unsubscribe kvm" in
>> > the body of a message to majordomo@vger.kernel.org
>> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 112+ messages in thread

* [Qemu-devel] Re: [PATCH 05/21] virtio: modify save/load handler to handle inuse varialble.
  2010-12-24 11:42                           ` Yoshiaki Tamura
@ 2010-12-24 13:21                             ` Michael S. Tsirkin
  2010-12-26  9:05                             ` Michael S. Tsirkin
  2010-12-26 10:49                             ` Michael S. Tsirkin
  2 siblings, 0 replies; 112+ messages in thread
From: Michael S. Tsirkin @ 2010-12-24 13:21 UTC (permalink / raw)
  To: Yoshiaki Tamura
  Cc: aliguori, dlaor, ananth, kvm, ohmura.kei, Marcelo Tosatti,
	qemu-devel, vatsa, avi, psuriset, stefanha

On Fri, Dec 24, 2010 at 08:42:19PM +0900, Yoshiaki Tamura wrote:
> 2010/12/24 Michael S. Tsirkin <mst@redhat.com>:
> > On Fri, Dec 17, 2010 at 12:59:58AM +0900, Yoshiaki Tamura wrote:
> >> 2010/12/16 Michael S. Tsirkin <mst@redhat.com>:
> >> > On Thu, Dec 16, 2010 at 11:28:46PM +0900, Yoshiaki Tamura wrote:
> >> >> 2010/12/16 Michael S. Tsirkin <mst@redhat.com>:
> >> >> > On Thu, Dec 16, 2010 at 04:36:16PM +0900, Yoshiaki Tamura wrote:
> >> >> >> 2010/12/3 Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>:
> >> >> >> > 2010/12/2 Michael S. Tsirkin <mst@redhat.com>:
> >> >> >> >> On Wed, Dec 01, 2010 at 05:03:43PM +0900, Yoshiaki Tamura wrote:
> >> >> >> >>> 2010/11/28 Michael S. Tsirkin <mst@redhat.com>:
> >> >> >> >>> > On Sun, Nov 28, 2010 at 08:27:58PM +0900, Yoshiaki Tamura wrote:
> >> >> >> >>> >> 2010/11/28 Michael S. Tsirkin <mst@redhat.com>:
> >> >> >> >>> >> > On Thu, Nov 25, 2010 at 03:06:44PM +0900, Yoshiaki Tamura wrote:
> >> >> >> >>> >> >> Modify inuse type to uint16_t, let save/load to handle, and revert
> >> >> >> >>> >> >> last_avail_idx with inuse if there are outstanding emulation.
> >> >> >> >>> >> >>
> >> >> >> >>> >> >> Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
> >> >> >> >>> >> >
> >> >> >> >>> >> > This changes migration format, so it will break compatibility with
> >> >> >> >>> >> > existing drivers. More generally, I think migrating internal
> >> >> >> >>> >> > state that is not guest visible is always a mistake
> >> >> >> >>> >> > as it ties migration format to an internal implementation
> >> >> >> >>> >> > (yes, I know we do this sometimes, but we should at least
> >> >> >> >>> >> > try not to add such cases).  I think the right thing to do in this case
> >> >> >> >>> >> > is to flush outstanding
> >> >> >> >>> >> > work when vm is stopped.  Then, we are guaranteed that inuse is 0.
> >> >> >> >>> >> > I sent patches that do this for virtio net and block.
> >> >> >> >>> >>
> >> >> >> >>> >> Could you give me the link of your patches?  I'd like to test
> >> >> >> >>> >> whether they work with Kemari upon failover.  If they do, I'm
> >> >> >> >>> >> happy to drop this patch.
> >> >> >> >>> >>
> >> >> >> >>> >> Yoshi
> >> >> >> >>> >
> >> >> >> >>> > Look for this:
> >> >> >> >>> > stable migration image on a stopped vm
> >> >> >> >>> > sent on:
> >> >> >> >>> > Wed, 24 Nov 2010 17:52:49 +0200
> >> >> >> >>>
> >> >> >> >>> Thanks for the info.
> >> >> >> >>>
> >> >> >> >>> However, The patch series above didn't solve the issue.  In
> >> >> >> >>> case of Kemari, inuse is mostly > 0 because it queues the
> >> >> >> >>> output, and while last_avail_idx gets incremented
> >> >> >> >>> immediately, not sending inuse makes the state inconsistent
> >> >> >> >>> between Primary and Secondary.
> >> >> >> >>
> >> >> >> >> Hmm. Can we simply avoid incrementing last_avail_idx?
> >> >> >> >
> >> >> >> > I think we can calculate or prepare an internal last_avail_idx,
> >> >> >> > and update the external when inuse is decremented.  I'll try
> >> >> >> > whether it work w/ w/o Kemari.
> >> >> >>
> >> >> >> Hi Michael,
> >> >> >>
> >> >> >> Could you please take a look at the following patch?
> >> >> >
> >> >> > Which version is this against?
> >> >>
> >> >> Oops.  It should be very old.
> >> >> 67f895bfe69f323b427b284430b6219c8a62e8d4
> >> >>
> >> >> >> commit 36ee7910059e6b236fe9467a609f5b4aed866912
> >> >> >> Author: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
> >> >> >> Date:   Thu Dec 16 14:50:54 2010 +0900
> >> >> >>
> >> >> >>     virtio: update last_avail_idx when inuse is decreased.
> >> >> >>
> >> >> >>     Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
> >> >> >
> >> >> > It would be better to have a commit description explaining why a change
> >> >> > is made, and why it is correct, not just repeating what can be seen from
> >> >> > the diff anyway.
> >> >>
> >> >> Sorry for being lazy here.
> >> >>
> >> >> >> diff --git a/hw/virtio.c b/hw/virtio.c
> >> >> >> index c8a0fc6..6688c02 100644
> >> >> >> --- a/hw/virtio.c
> >> >> >> +++ b/hw/virtio.c
> >> >> >> @@ -237,6 +237,7 @@ void virtqueue_flush(VirtQueue *vq, unsigned int count)
> >> >> >>      wmb();
> >> >> >>      trace_virtqueue_flush(vq, count);
> >> >> >>      vring_used_idx_increment(vq, count);
> >> >> >> +    vq->last_avail_idx += count;
> >> >> >>      vq->inuse -= count;
> >> >> >>  }
> >> >> >>
> >> >> >> @@ -385,7 +386,7 @@ int virtqueue_pop(VirtQueue *vq, VirtQueueElement *elem)
> >> >> >>      unsigned int i, head, max;
> >> >> >>      target_phys_addr_t desc_pa = vq->vring.desc;
> >> >> >>
> >> >> >> -    if (!virtqueue_num_heads(vq, vq->last_avail_idx))
> >> >> >> +    if (!virtqueue_num_heads(vq, vq->last_avail_idx + vq->inuse))
> >> >> >>          return 0;
> >> >> >>
> >> >> >>      /* When we start there are none of either input nor output. */
> >> >> >> @@ -393,7 +394,7 @@ int virtqueue_pop(VirtQueue *vq, VirtQueueElement *elem)
> >> >> >>
> >> >> >>      max = vq->vring.num;
> >> >> >>
> >> >> >> -    i = head = virtqueue_get_head(vq, vq->last_avail_idx++);
> >> >> >> +    i = head = virtqueue_get_head(vq, vq->last_avail_idx + vq->inuse);
> >> >> >>
> >> >> >>      if (vring_desc_flags(desc_pa, i) & VRING_DESC_F_INDIRECT) {
> >> >> >>          if (vring_desc_len(desc_pa, i) % sizeof(VRingDesc)) {
> >> >> >>
> >> >> >
> >> >> > Hmm, will virtio_queue_empty be wrong now? What about virtqueue_avail_bytes?
> >> >>
> >> >> I think there are two problems.
> >> >>
> >> >> 1. When to update last_avail_idx.
> >> >> 2. The ordering issue you're mentioning below.
> >> >>
> >> >> The patch above is only trying to address 1 because last time you
> >> >> mentioned that modifying last_avail_idx upon save may break the
> >> >> guest, which I agree.  If virtio_queue_empty and
> >> >> virtqueue_avail_bytes are only used internally, meaning invisible
> >> >> to the guest, I guess the approach above can be applied too.
> >> >
> >> > So IMHO 2 is the real issue. This is what was problematic
> >> > with the save patch, otherwise of course changes in save
> >> > are better than changes all over the codebase.
> >>
> >> All right.  Then let's focus on 2 first.
> >>
> >> >> > Previous patch version sure looked simpler, and this seems functionally
> >> >> > equivalent, so my question still stands: here it is rephrased in a
> >> >> > different way:
> >> >> >
> >> >> >        assume that we have in avail ring 2 requests at start of ring: A and B in this order
> >> >> >
> >> >> >        host pops A, then B, then completes B and flushes
> >> >> >
> >> >> >        now with this patch last_avail_idx will be 1, and then
> >> >> >        remote will get it, it will execute B again. As a result
> >> >> >        B will complete twice, and apparently A will never complete.
> >> >> >
> >> >> >
> >> >> > This is what I was saying below: assuming that there are
> >> >> > outstanding requests when we migrate, there is no way
> >> >> > a single index can be enough to figure out which requests
> >> >> > need to be handled and which are in flight already.
> >> >> >
> >> >> > We must add some kind of bitmask to tell us which is which.
> >> >>
> >> >> I should understand why this inversion can happen before solving
> >> >> the issue.
> >> >
> >> > It's a fundamental thing in virtio.
> >> > I think it is currently only likely to happen with block, I think tap
> >> > currently completes things in order.  In any case relying on this in the
> >> > frontend is a mistake.
> >> >
> >> >>  Currently, how are you making virio-net to flush
> >> >> every requests for live migration?  Is it qemu_aio_flush()?
> >> >
> >> > Think so.
> >>
> >> If qemu_aio_flush() is responsible for flushing the outstanding
> >> virtio-net requests, I'm wondering why it's a problem for Kemari.
> >> As I described in the previous message, Kemari queues the
> >> requests first.  So in you example above, it should start with
> >>
> >> virtio-net: last_avai_idx 0 inuse 2
> >> event-tap: {A,B}
> >>
> >> As you know, the requests are still in order still because net
> >> layer initiates in order.  Not about completing.
> >>
> >> In the first synchronization, the status above is transferred.  In
> >> the next synchronization, the status will be as following.
> >>
> >> virtio-net: last_avai_idx 1 inuse 1
> >> event-tap: {B}
> >
> > OK, this answers the ordering question.
> 
> Glad to hear that!
> 
> > Another question: at this point we transfer this status: both
> > event-tap and virtio ring have the command B,
> > so the remote will have:
> >
> > virtio-net: inuse 0
> > event-tap: {B}
> >
> > Is this right? This already seems to be a problem as when B completes
> > inuse will go negative?
> 
> I think state above is wrong.  inuse 0 means there shouldn't be
> any requests in event-tap.  Note that the callback is called only
> when event-tap flushes the requests.
> 
> > Next it seems that the remote virtio will resubmit B to event-tap. The
> > remote will then have:
> >
> > virtio-net: inuse 1
> > event-tap: {B, B}
> >
> > This looks kind of wrong ... will two packets go out?
> 
> No.  Currently, we're just replaying the requests with pio/mmio.
> In the situation above, it should be,
> 
> virtio-net: inuse 1
> event-tap: {B}
> 
> >> Why? Because Kemari flushes the first virtio-net request using
> >> qemu_aio_flush() before each synchronization.  If
> >> qemu_aio_flush() doesn't guarantee the order, what you pointed
> >> should be problematic.  So in the final synchronization, the
> >> state should be,
> >>
> >> virtio-net: last_avai_idx 2 inuse 0
> >> event-tap: {}
> >>
> >> where A,B were completed in order.
> >>
> >> Yoshi
> >
> >
> > It might be better to discuss block because that's where
> > requests can complete out of order.
> 
> It's same as net.  We queue requests and call bdrv_flush per
> sending requests to the block.  So there shouldn't be any
> inversion.
> 
> > So let me see if I understand:
> > - each command passed to event tap is queued by it,
> >  it is not passed directly to the backend
> > - later requests are passed to the backend,
> >  always in the same order that they were submitted
> > - each synchronization point flushes all requests
> >  passed to the backend so far
> > - each synchronization transfers all requests not passed to the backend,
> >  to the remote, and they are replayed there
> 
> Correct.
> 
> > Now to analyse this for correctness I am looking at the original patch
> > because it is smaller so easier to analyse and I think it is
> > functionally equivalent, correct me if I am wrong in this.
> 
> So you think decreasing last_avail_idx upon save is better than
> updating it in the callback?

If this is correct, of the two equivalent approaches the one
that only touches save/load seems superiour.

> > So the reason there's no out of order issue is this
> > (and might be a good thing to put in commit log
> > or a comment somewhere):
> 
> I've done some in the latest patch.  Please point it out if it
> wasn't enough.
> 
> > At point of save callback event tap has flushed commands
> > passed to the backend already. Thus at the point of
> > the save callback if a command has completed
> > all previous commands have been flushed and completed.
> >
> >
> > Therefore inuse is
> > in fact the # of requests passed to event tap but not yet
> > passed to the backend (for non-event tap case all commands are
> > passed to the backend immediately and because of this
> > inuse is 0) and these are the last inuse commands submitted.
> >
> >
> > Right?
> 
> Yep.
> 
> > Now a question:
> >
> > When we pass last_used_index - inuse to the remote,
> > the remote virtio will resubmit the request.
> > Since request is also passed by event tap, we get
> > the request twice, why is this not a problem?
> 
> It's not a problem because event-tap currently replays with
> pio/mmio only, as I mentioned above.  Although event-tap receives
> information about the queued requests, it won't pass it to the
> backend.  The reason is the problem in setting the callbacks
> which are specific to devices on the secondary.  These are
> pointers, and even worse, are usually static functions, which
> event-tap has no way to restore it upon failover.  I do want to
> change event-tap replay to be this way in the future, pio/mmio
> replay is implemented for now.
> 
> Thanks,
> 
> Yoshi
> 
> >
> >
> >> >
> >> >> >
> >> >> >> >
> >> >> >> >>
> >> >> >> >>>  I'm wondering why
> >> >> >> >>> last_avail_idx is OK to send but not inuse.
> >> >> >> >>
> >> >> >> >> last_avail_idx is at some level a mistake, it exposes part of
> >> >> >> >> our internal implementation, but it does *also* express
> >> >> >> >> a guest observable state.
> >> >> >> >>
> >> >> >> >> Here's the problem that it solves: just looking at the rings in virtio
> >> >> >> >> there is no way to detect that a specific request has already been
> >> >> >> >> completed. And the protocol forbids completing the same request twice.
> >> >> >> >>
> >> >> >> >> Our implementation always starts processing the requests
> >> >> >> >> in order, and since we flush outstanding requests
> >> >> >> >> before save, it works to just tell the remote 'process only requests
> >> >> >> >> after this place'.
> >> >> >> >>
> >> >> >> >> But there's no such requirement in the virtio protocol,
> >> >> >> >> so to be really generic we could add a bitmask of valid avail
> >> >> >> >> ring entries that did not complete yet. This would be
> >> >> >> >> the exact representation of the guest observable state.
> >> >> >> >> In practice we have rings of up to 512 entries.
> >> >> >> >> That's 64 byte per ring, not a lot at all.
> >> >> >> >>
> >> >> >> >> However, if we ever do change the protocol to send the bitmask,
> >> >> >> >> we would need some code to resubmit requests
> >> >> >> >> out of order, so it's not trivial.
> >> >> >> >>
> >> >> >> >> Another minor mistake with last_avail_idx is that it has
> >> >> >> >> some redundancy: the high bits in the index
> >> >> >> >> (> vq size) are not necessary as they can be
> >> >> >> >> got from avail idx.  There's a consistency check
> >> >> >> >> in load but we really should try to use formats
> >> >> >> >> that are always consistent.
> >> >> >> >>
> >> >> >> >>> The following patch does the same thing as original, yet
> >> >> >> >>> keeps the format of the virtio.  It shouldn't break live
> >> >> >> >>> migration either because inuse should be 0.
> >> >> >> >>>
> >> >> >> >>> Yoshi
> >> >> >> >>
> >> >> >> >> Question is, can you flush to make inuse 0 in kemari too?
> >> >> >> >> And if not, how do you handle the fact that some requests
> >> >> >> >> are in flight on the primary?
> >> >> >> >
> >> >> >> > Although we try flushing requests one by one making inuse 0,
> >> >> >> > there are cases when it failovers to the secondary when inuse
> >> >> >> > isn't 0.  We handle these in flight request on the primary by
> >> >> >> > replaying on the secondary.
> >> >> >> >
> >> >> >> >>
> >> >> >> >>> diff --git a/hw/virtio.c b/hw/virtio.c
> >> >> >> >>> index c8a0fc6..875c7ca 100644
> >> >> >> >>> --- a/hw/virtio.c
> >> >> >> >>> +++ b/hw/virtio.c
> >> >> >> >>> @@ -664,12 +664,16 @@ void virtio_save(VirtIODevice *vdev, QEMUFile *f)
> >> >> >> >>>      qemu_put_be32(f, i);
> >> >> >> >>>
> >> >> >> >>>      for (i = 0; i < VIRTIO_PCI_QUEUE_MAX; i++) {
> >> >> >> >>> +        uint16_t last_avail_idx;
> >> >> >> >>> +
> >> >> >> >>>          if (vdev->vq[i].vring.num == 0)
> >> >> >> >>>              break;
> >> >> >> >>>
> >> >> >> >>> +        last_avail_idx = vdev->vq[i].last_avail_idx - vdev->vq[i].inuse;
> >> >> >> >>> +
> >> >> >> >>>          qemu_put_be32(f, vdev->vq[i].vring.num);
> >> >> >> >>>          qemu_put_be64(f, vdev->vq[i].pa);
> >> >> >> >>> -        qemu_put_be16s(f, &vdev->vq[i].last_avail_idx);
> >> >> >> >>> +        qemu_put_be16s(f, &last_avail_idx);
> >> >> >> >>>          if (vdev->binding->save_queue)
> >> >> >> >>>              vdev->binding->save_queue(vdev->binding_opaque, i, f);
> >> >> >> >>>      }
> >> >> >> >>>
> >> >> >> >>>
> >> >> >> >>
> >> >> >> >> This looks wrong to me.  Requests can complete in any order, can they
> >> >> >> >> not?  So if request 0 did not complete and request 1 did not,
> >> >> >> >> you send avail - inuse and on the secondary you will process and
> >> >> >> >> complete request 1 the second time, crashing the guest.
> >> >> >> >
> >> >> >> > In case of Kemari, no.  We sit between devices and net/block, and
> >> >> >> > queue the requests.  After completing each transaction, we flush
> >> >> >> > the requests one by one.  So there won't be completion inversion,
> >> >> >> > and therefore won't be visible to the guest.
> >> >> >> >
> >> >> >> > Yoshi
> >> >> >> >
> >> >> >> >>
> >> >> >> >>>
> >> >> >> >>> >
> >> >> >> >>> >> >
> >> >> >> >>> >> >> ---
> >> >> >> >>> >> >>  hw/virtio.c |    8 +++++++-
> >> >> >> >>> >> >>  1 files changed, 7 insertions(+), 1 deletions(-)
> >> >> >> >>> >> >>
> >> >> >> >>> >> >> diff --git a/hw/virtio.c b/hw/virtio.c
> >> >> >> >>> >> >> index 849a60f..5509644 100644
> >> >> >> >>> >> >> --- a/hw/virtio.c
> >> >> >> >>> >> >> +++ b/hw/virtio.c
> >> >> >> >>> >> >> @@ -72,7 +72,7 @@ struct VirtQueue
> >> >> >> >>> >> >>      VRing vring;
> >> >> >> >>> >> >>      target_phys_addr_t pa;
> >> >> >> >>> >> >>      uint16_t last_avail_idx;
> >> >> >> >>> >> >> -    int inuse;
> >> >> >> >>> >> >> +    uint16_t inuse;
> >> >> >> >>> >> >>      uint16_t vector;
> >> >> >> >>> >> >>      void (*handle_output)(VirtIODevice *vdev, VirtQueue *vq);
> >> >> >> >>> >> >>      VirtIODevice *vdev;
> >> >> >> >>> >> >> @@ -671,6 +671,7 @@ void virtio_save(VirtIODevice *vdev, QEMUFile *f)
> >> >> >> >>> >> >>          qemu_put_be32(f, vdev->vq[i].vring.num);
> >> >> >> >>> >> >>          qemu_put_be64(f, vdev->vq[i].pa);
> >> >> >> >>> >> >>          qemu_put_be16s(f, &vdev->vq[i].last_avail_idx);
> >> >> >> >>> >> >> +        qemu_put_be16s(f, &vdev->vq[i].inuse);
> >> >> >> >>> >> >>          if (vdev->binding->save_queue)
> >> >> >> >>> >> >>              vdev->binding->save_queue(vdev->binding_opaque, i, f);
> >> >> >> >>> >> >>      }
> >> >> >> >>> >> >> @@ -711,6 +712,11 @@ int virtio_load(VirtIODevice *vdev, QEMUFile *f)
> >> >> >> >>> >> >>          vdev->vq[i].vring.num = qemu_get_be32(f);
> >> >> >> >>> >> >>          vdev->vq[i].pa = qemu_get_be64(f);
> >> >> >> >>> >> >>          qemu_get_be16s(f, &vdev->vq[i].last_avail_idx);
> >> >> >> >>> >> >> +        qemu_get_be16s(f, &vdev->vq[i].inuse);
> >> >> >> >>> >> >> +
> >> >> >> >>> >> >> +        /* revert last_avail_idx if there are outstanding emulation. */
> >> >> >> >>> >> >> +        vdev->vq[i].last_avail_idx -= vdev->vq[i].inuse;
> >> >> >> >>> >> >> +        vdev->vq[i].inuse = 0;
> >> >> >> >>> >> >>
> >> >> >> >>> >> >>          if (vdev->vq[i].pa) {
> >> >> >> >>> >> >>              virtqueue_init(&vdev->vq[i]);
> >> >> >> >>> >> >> --
> >> >> >> >>> >> >> 1.7.1.2
> >> >> >> >>> >> >>
> >> >> >> >>> >> >> --
> >> >> >> >>> >> >> To unsubscribe from this list: send the line "unsubscribe kvm" in
> >> >> >> >>> >> >> the body of a message to majordomo@vger.kernel.org
> >> >> >> >>> >> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >> >> >> >>> >> > --
> >> >> >> >>> >> > To unsubscribe from this list: send the line "unsubscribe kvm" in
> >> >> >> >>> >> > the body of a message to majordomo@vger.kernel.org
> >> >> >> >>> >> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >> >> >> >>> >> >
> >> >> >> >>> > --
> >> >> >> >>> > To unsubscribe from this list: send the line "unsubscribe kvm" in
> >> >> >> >>> > the body of a message to majordomo@vger.kernel.org
> >> >> >> >>> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >> >> >> >>> >
> >> >> >> >> --
> >> >> >> >> To unsubscribe from this list: send the line "unsubscribe kvm" in
> >> >> >> >> the body of a message to majordomo@vger.kernel.org
> >> >> >> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >> >> >> >>
> >> >> >> >
> >> >> > --
> >> >> > To unsubscribe from this list: send the line "unsubscribe kvm" in
> >> >> > the body of a message to majordomo@vger.kernel.org
> >> >> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >> >> >
> >> > --
> >> > To unsubscribe from this list: send the line "unsubscribe kvm" in
> >> > the body of a message to majordomo@vger.kernel.org
> >> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe kvm" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >

^ permalink raw reply	[flat|nested] 112+ messages in thread

* [Qemu-devel] Re: [PATCH 05/21] virtio: modify save/load handler to handle inuse varialble.
  2010-12-24 11:42                           ` Yoshiaki Tamura
  2010-12-24 13:21                             ` Michael S. Tsirkin
@ 2010-12-26  9:05                             ` Michael S. Tsirkin
  2010-12-26 10:14                               ` Yoshiaki Tamura
  2010-12-26 10:49                             ` Michael S. Tsirkin
  2 siblings, 1 reply; 112+ messages in thread
From: Michael S. Tsirkin @ 2010-12-26  9:05 UTC (permalink / raw)
  To: Yoshiaki Tamura
  Cc: aliguori, dlaor, ananth, kvm, ohmura.kei, Marcelo Tosatti,
	qemu-devel, vatsa, avi, psuriset, stefanha

On Fri, Dec 24, 2010 at 08:42:19PM +0900, Yoshiaki Tamura wrote:
> >> If qemu_aio_flush() is responsible for flushing the outstanding
> >> virtio-net requests, I'm wondering why it's a problem for Kemari.
> >> As I described in the previous message, Kemari queues the
> >> requests first.  So in you example above, it should start with
> >>
> >> virtio-net: last_avai_idx 0 inuse 2
> >> event-tap: {A,B}
> >>
> >> As you know, the requests are still in order still because net
> >> layer initiates in order.  Not about completing.
> >>
> >> In the first synchronization, the status above is transferred.  In
> >> the next synchronization, the status will be as following.
> >>
> >> virtio-net: last_avai_idx 1 inuse 1
> >> event-tap: {B}
> >
> > OK, this answers the ordering question.
> 
> Glad to hear that!
> 
> > Another question: at this point we transfer this status: both
> > event-tap and virtio ring have the command B,
> > so the remote will have:
> >
> > virtio-net: inuse 0
> > event-tap: {B}
> >
> > Is this right? This already seems to be a problem as when B completes
> > inuse will go negative?
> 
> I think state above is wrong.  inuse 0 means there shouldn't be
> any requests in event-tap.  Note that the callback is called only
> when event-tap flushes the requests.
> 
> > Next it seems that the remote virtio will resubmit B to event-tap. The
> > remote will then have:
> >
> > virtio-net: inuse 1
> > event-tap: {B, B}
> >
> > This looks kind of wrong ... will two packets go out?
> 
> No.  Currently, we're just replaying the requests with pio/mmio.
> In the situation above, it should be,
> 
> virtio-net: inuse 1
> event-tap: {B}
> >> Why? Because Kemari flushes the first virtio-net request using
> >> qemu_aio_flush() before each synchronization.  If
> >> qemu_aio_flush() doesn't guarantee the order, what you pointed
> >> should be problematic.  So in the final synchronization, the
> >> state should be,
> >>
> >> virtio-net: last_avai_idx 2 inuse 0
> >> event-tap: {}
> >>
> >> where A,B were completed in order.
> >>
> >> Yoshi
> >
> >
> > It might be better to discuss block because that's where
> > requests can complete out of order.
> 
> It's same as net.  We queue requests and call bdrv_flush per
> sending requests to the block.  So there shouldn't be any
> inversion.
> 
> > So let me see if I understand:
> > - each command passed to event tap is queued by it,
> >  it is not passed directly to the backend
> > - later requests are passed to the backend,
> >  always in the same order that they were submitted
> > - each synchronization point flushes all requests
> >  passed to the backend so far
> > - each synchronization transfers all requests not passed to the backend,
> >  to the remote, and they are replayed there
> 
> Correct.
> 
> > Now to analyse this for correctness I am looking at the original patch
> > because it is smaller so easier to analyse and I think it is
> > functionally equivalent, correct me if I am wrong in this.
> 
> So you think decreasing last_avail_idx upon save is better than
> updating it in the callback?
> 
> > So the reason there's no out of order issue is this
> > (and might be a good thing to put in commit log
> > or a comment somewhere):
> 
> I've done some in the latest patch.  Please point it out if it
> wasn't enough.
> 
> > At point of save callback event tap has flushed commands
> > passed to the backend already. Thus at the point of
> > the save callback if a command has completed
> > all previous commands have been flushed and completed.
> >
> >
> > Therefore inuse is
> > in fact the # of requests passed to event tap but not yet
> > passed to the backend (for non-event tap case all commands are
> > passed to the backend immediately and because of this
> > inuse is 0) and these are the last inuse commands submitted.
> >
> >
> > Right?
> 
> Yep.
> 
> > Now a question:
> >
> > When we pass last_used_index - inuse to the remote,
> > the remote virtio will resubmit the request.
> > Since request is also passed by event tap, we get
> > the request twice, why is this not a problem?
> 
> It's not a problem because event-tap currently replays with
> pio/mmio only, as I mentioned above.  Although event-tap receives
> information about the queued requests, it won't pass it to the
> backend.  The reason is the problem in setting the callbacks
> which are specific to devices on the secondary.  These are
> pointers, and even worse, are usually static functions, which
> event-tap has no way to restore it upon failover.  I do want to
> change event-tap replay to be this way in the future, pio/mmio
> replay is implemented for now.
> 
> Thanks,
> 
> Yoshi
> 

Then I am still confused, sorry.  inuse != 0 means that some requests
were passed to the backend but did not complete.  I think that if you do
a flush, this waits until all requests passed to the backend will
complete.  Why does not this guarantee inuse = 0 on the origin at the
synchronization point?

-- 
MST

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [Qemu-devel] Re: [PATCH 05/21] virtio: modify save/load handler to handle inuse varialble.
  2010-12-26  9:05                             ` Michael S. Tsirkin
@ 2010-12-26 10:14                               ` Yoshiaki Tamura
  2010-12-26 10:46                                 ` Michael S. Tsirkin
  0 siblings, 1 reply; 112+ messages in thread
From: Yoshiaki Tamura @ 2010-12-26 10:14 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: aliguori, dlaor, ananth, kvm, Marcelo Tosatti, ohmura.kei,
	qemu-devel, avi, vatsa, psuriset, stefanha

2010/12/26 Michael S. Tsirkin <mst@redhat.com>:
> On Fri, Dec 24, 2010 at 08:42:19PM +0900, Yoshiaki Tamura wrote:
>> >> If qemu_aio_flush() is responsible for flushing the outstanding
>> >> virtio-net requests, I'm wondering why it's a problem for Kemari.
>> >> As I described in the previous message, Kemari queues the
>> >> requests first.  So in you example above, it should start with
>> >>
>> >> virtio-net: last_avai_idx 0 inuse 2
>> >> event-tap: {A,B}
>> >>
>> >> As you know, the requests are still in order still because net
>> >> layer initiates in order.  Not about completing.
>> >>
>> >> In the first synchronization, the status above is transferred.  In
>> >> the next synchronization, the status will be as following.
>> >>
>> >> virtio-net: last_avai_idx 1 inuse 1
>> >> event-tap: {B}
>> >
>> > OK, this answers the ordering question.
>>
>> Glad to hear that!
>>
>> > Another question: at this point we transfer this status: both
>> > event-tap and virtio ring have the command B,
>> > so the remote will have:
>> >
>> > virtio-net: inuse 0
>> > event-tap: {B}
>> >
>> > Is this right? This already seems to be a problem as when B completes
>> > inuse will go negative?
>>
>> I think state above is wrong.  inuse 0 means there shouldn't be
>> any requests in event-tap.  Note that the callback is called only
>> when event-tap flushes the requests.
>>
>> > Next it seems that the remote virtio will resubmit B to event-tap. The
>> > remote will then have:
>> >
>> > virtio-net: inuse 1
>> > event-tap: {B, B}
>> >
>> > This looks kind of wrong ... will two packets go out?
>>
>> No.  Currently, we're just replaying the requests with pio/mmio.
>> In the situation above, it should be,
>>
>> virtio-net: inuse 1
>> event-tap: {B}
>> >> Why? Because Kemari flushes the first virtio-net request using
>> >> qemu_aio_flush() before each synchronization.  If
>> >> qemu_aio_flush() doesn't guarantee the order, what you pointed
>> >> should be problematic.  So in the final synchronization, the
>> >> state should be,
>> >>
>> >> virtio-net: last_avai_idx 2 inuse 0
>> >> event-tap: {}
>> >>
>> >> where A,B were completed in order.
>> >>
>> >> Yoshi
>> >
>> >
>> > It might be better to discuss block because that's where
>> > requests can complete out of order.
>>
>> It's same as net.  We queue requests and call bdrv_flush per
>> sending requests to the block.  So there shouldn't be any
>> inversion.
>>
>> > So let me see if I understand:
>> > - each command passed to event tap is queued by it,
>> >  it is not passed directly to the backend
>> > - later requests are passed to the backend,
>> >  always in the same order that they were submitted
>> > - each synchronization point flushes all requests
>> >  passed to the backend so far
>> > - each synchronization transfers all requests not passed to the backend,
>> >  to the remote, and they are replayed there
>>
>> Correct.
>>
>> > Now to analyse this for correctness I am looking at the original patch
>> > because it is smaller so easier to analyse and I think it is
>> > functionally equivalent, correct me if I am wrong in this.
>>
>> So you think decreasing last_avail_idx upon save is better than
>> updating it in the callback?
>>
>> > So the reason there's no out of order issue is this
>> > (and might be a good thing to put in commit log
>> > or a comment somewhere):
>>
>> I've done some in the latest patch.  Please point it out if it
>> wasn't enough.
>>
>> > At point of save callback event tap has flushed commands
>> > passed to the backend already. Thus at the point of
>> > the save callback if a command has completed
>> > all previous commands have been flushed and completed.
>> >
>> >
>> > Therefore inuse is
>> > in fact the # of requests passed to event tap but not yet
>> > passed to the backend (for non-event tap case all commands are
>> > passed to the backend immediately and because of this
>> > inuse is 0) and these are the last inuse commands submitted.
>> >
>> >
>> > Right?
>>
>> Yep.
>>
>> > Now a question:
>> >
>> > When we pass last_used_index - inuse to the remote,
>> > the remote virtio will resubmit the request.
>> > Since request is also passed by event tap, we get
>> > the request twice, why is this not a problem?
>>
>> It's not a problem because event-tap currently replays with
>> pio/mmio only, as I mentioned above.  Although event-tap receives
>> information about the queued requests, it won't pass it to the
>> backend.  The reason is the problem in setting the callbacks
>> which are specific to devices on the secondary.  These are
>> pointers, and even worse, are usually static functions, which
>> event-tap has no way to restore it upon failover.  I do want to
>> change event-tap replay to be this way in the future, pio/mmio
>> replay is implemented for now.
>>
>> Thanks,
>>
>> Yoshi
>>
>
> Then I am still confused, sorry.  inuse != 0 means that some requests
> were passed to the backend but did not complete.  I think that if you do
> a flush, this waits until all requests passed to the backend will
> complete.  Why does not this guarantee inuse = 0 on the origin at the
> synchronization point?

The synchronization is done before event-tap releases requests to
the backend, so there are two types of flush: event-tap and
backend block/net.  I assume you're confused with the fact that
flushing backend with qemu_aio_flush/bdrv_flush doesn't necessary
decrease inuse if event-tap has queued requests because there are
no requests passed to the backend.  Let me do a case study again.

virtio: inuse 4
event-tap: {A,B,C}
backend: {D}

synchronization starts.  backend gets flushed.

virtio: inuse 3
event-tap: {A,B,C}
backend: {}

synchronization gets done.
# secondary is virtio: inuse 3

event-tap flushes one request.

virtio: inuse 2
event-tap: {B,C}
backend: {}

repeats above and finally it should be,

virtio: inuse 0
event-tap: {}

Hope this helps.

Yoshi

>
> --
> MST
>
>

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [Qemu-devel] Re: [PATCH 05/21] virtio: modify save/load handler to handle inuse varialble.
  2010-12-26 10:14                               ` Yoshiaki Tamura
@ 2010-12-26 10:46                                 ` Michael S. Tsirkin
  2010-12-26 10:50                                   ` Yoshiaki Tamura
  0 siblings, 1 reply; 112+ messages in thread
From: Michael S. Tsirkin @ 2010-12-26 10:46 UTC (permalink / raw)
  To: Yoshiaki Tamura
  Cc: aliguori, dlaor, ananth, kvm, Marcelo Tosatti, ohmura.kei,
	qemu-devel, avi, vatsa, psuriset, stefanha

On Sun, Dec 26, 2010 at 07:14:44PM +0900, Yoshiaki Tamura wrote:
> 2010/12/26 Michael S. Tsirkin <mst@redhat.com>:
> > On Fri, Dec 24, 2010 at 08:42:19PM +0900, Yoshiaki Tamura wrote:
> >> >> If qemu_aio_flush() is responsible for flushing the outstanding
> >> >> virtio-net requests, I'm wondering why it's a problem for Kemari.
> >> >> As I described in the previous message, Kemari queues the
> >> >> requests first.  So in you example above, it should start with
> >> >>
> >> >> virtio-net: last_avai_idx 0 inuse 2
> >> >> event-tap: {A,B}
> >> >>
> >> >> As you know, the requests are still in order still because net
> >> >> layer initiates in order.  Not about completing.
> >> >>
> >> >> In the first synchronization, the status above is transferred.  In
> >> >> the next synchronization, the status will be as following.
> >> >>
> >> >> virtio-net: last_avai_idx 1 inuse 1
> >> >> event-tap: {B}
> >> >
> >> > OK, this answers the ordering question.
> >>
> >> Glad to hear that!
> >>
> >> > Another question: at this point we transfer this status: both
> >> > event-tap and virtio ring have the command B,
> >> > so the remote will have:
> >> >
> >> > virtio-net: inuse 0
> >> > event-tap: {B}
> >> >
> >> > Is this right? This already seems to be a problem as when B completes
> >> > inuse will go negative?
> >>
> >> I think state above is wrong.  inuse 0 means there shouldn't be
> >> any requests in event-tap.  Note that the callback is called only
> >> when event-tap flushes the requests.
> >>
> >> > Next it seems that the remote virtio will resubmit B to event-tap. The
> >> > remote will then have:
> >> >
> >> > virtio-net: inuse 1
> >> > event-tap: {B, B}
> >> >
> >> > This looks kind of wrong ... will two packets go out?
> >>
> >> No.  Currently, we're just replaying the requests with pio/mmio.
> >> In the situation above, it should be,
> >>
> >> virtio-net: inuse 1
> >> event-tap: {B}
> >> >> Why? Because Kemari flushes the first virtio-net request using
> >> >> qemu_aio_flush() before each synchronization.  If
> >> >> qemu_aio_flush() doesn't guarantee the order, what you pointed
> >> >> should be problematic.  So in the final synchronization, the
> >> >> state should be,
> >> >>
> >> >> virtio-net: last_avai_idx 2 inuse 0
> >> >> event-tap: {}
> >> >>
> >> >> where A,B were completed in order.
> >> >>
> >> >> Yoshi
> >> >
> >> >
> >> > It might be better to discuss block because that's where
> >> > requests can complete out of order.
> >>
> >> It's same as net.  We queue requests and call bdrv_flush per
> >> sending requests to the block.  So there shouldn't be any
> >> inversion.
> >>
> >> > So let me see if I understand:
> >> > - each command passed to event tap is queued by it,
> >> >  it is not passed directly to the backend
> >> > - later requests are passed to the backend,
> >> >  always in the same order that they were submitted
> >> > - each synchronization point flushes all requests
> >> >  passed to the backend so far
> >> > - each synchronization transfers all requests not passed to the backend,
> >> >  to the remote, and they are replayed there
> >>
> >> Correct.
> >>
> >> > Now to analyse this for correctness I am looking at the original patch
> >> > because it is smaller so easier to analyse and I think it is
> >> > functionally equivalent, correct me if I am wrong in this.
> >>
> >> So you think decreasing last_avail_idx upon save is better than
> >> updating it in the callback?
> >>
> >> > So the reason there's no out of order issue is this
> >> > (and might be a good thing to put in commit log
> >> > or a comment somewhere):
> >>
> >> I've done some in the latest patch.  Please point it out if it
> >> wasn't enough.
> >>
> >> > At point of save callback event tap has flushed commands
> >> > passed to the backend already. Thus at the point of
> >> > the save callback if a command has completed
> >> > all previous commands have been flushed and completed.
> >> >
> >> >
> >> > Therefore inuse is
> >> > in fact the # of requests passed to event tap but not yet
> >> > passed to the backend (for non-event tap case all commands are
> >> > passed to the backend immediately and because of this
> >> > inuse is 0) and these are the last inuse commands submitted.
> >> >
> >> >
> >> > Right?
> >>
> >> Yep.
> >>
> >> > Now a question:
> >> >
> >> > When we pass last_used_index - inuse to the remote,
> >> > the remote virtio will resubmit the request.
> >> > Since request is also passed by event tap, we get
> >> > the request twice, why is this not a problem?
> >>
> >> It's not a problem because event-tap currently replays with
> >> pio/mmio only, as I mentioned above.  Although event-tap receives
> >> information about the queued requests, it won't pass it to the
> >> backend.  The reason is the problem in setting the callbacks
> >> which are specific to devices on the secondary.  These are
> >> pointers, and even worse, are usually static functions, which
> >> event-tap has no way to restore it upon failover.  I do want to
> >> change event-tap replay to be this way in the future, pio/mmio
> >> replay is implemented for now.
> >>
> >> Thanks,
> >>
> >> Yoshi
> >>
> >
> > Then I am still confused, sorry.  inuse != 0 means that some requests
> > were passed to the backend but did not complete.  I think that if you do
> > a flush, this waits until all requests passed to the backend will
> > complete.  Why does not this guarantee inuse = 0 on the origin at the
> > synchronization point?
> 
> The synchronization is done before event-tap releases requests to
> the backend, so there are two types of flush: event-tap and
> backend block/net.  I assume you're confused with the fact that
> flushing backend with qemu_aio_flush/bdrv_flush doesn't necessary
> decrease inuse if event-tap has queued requests because there are
> no requests passed to the backend.  Let me do a case study again.
> 
> virtio: inuse 4
> event-tap: {A,B,C}
> backend: {D}
> 


There are two event-tap devices, right?
PIO one is above virtio, AIO one is between virtio and backend
(e.g. bdrv)? Which one is meant here?


> synchronization starts.  backend gets flushed.
> 
> virtio: inuse 3
> event-tap: {A,B,C}
> backend: {}
> synchronization gets done.
> # secondary is virtio: inuse 3
> 
> event-tap flushes one request.
> 
> virtio: inuse 2
> event-tap: {B,C}
> backend: {}
> repeats above and finally it should be,
> 
> virtio: inuse 0
> event-tap: {}
> 
> Hope this helps.
> 
> Yoshi
> 
> >
> > --
> > MST
> >
> >

^ permalink raw reply	[flat|nested] 112+ messages in thread

* [Qemu-devel] Re: [PATCH 05/21] virtio: modify save/load handler to handle inuse varialble.
  2010-12-24 11:42                           ` Yoshiaki Tamura
  2010-12-24 13:21                             ` Michael S. Tsirkin
  2010-12-26  9:05                             ` Michael S. Tsirkin
@ 2010-12-26 10:49                             ` Michael S. Tsirkin
  2010-12-26 10:57                               ` Yoshiaki Tamura
  2 siblings, 1 reply; 112+ messages in thread
From: Michael S. Tsirkin @ 2010-12-26 10:49 UTC (permalink / raw)
  To: Yoshiaki Tamura
  Cc: aliguori, dlaor, ananth, kvm, ohmura.kei, Marcelo Tosatti,
	qemu-devel, vatsa, avi, psuriset, stefanha

On Fri, Dec 24, 2010 at 08:42:19PM +0900, Yoshiaki Tamura wrote:
> 2010/12/24 Michael S. Tsirkin <mst@redhat.com>:
> > On Fri, Dec 17, 2010 at 12:59:58AM +0900, Yoshiaki Tamura wrote:
> >> 2010/12/16 Michael S. Tsirkin <mst@redhat.com>:
> >> > On Thu, Dec 16, 2010 at 11:28:46PM +0900, Yoshiaki Tamura wrote:
> >> >> 2010/12/16 Michael S. Tsirkin <mst@redhat.com>:
> >> >> > On Thu, Dec 16, 2010 at 04:36:16PM +0900, Yoshiaki Tamura wrote:
> >> >> >> 2010/12/3 Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>:
> >> >> >> > 2010/12/2 Michael S. Tsirkin <mst@redhat.com>:
> >> >> >> >> On Wed, Dec 01, 2010 at 05:03:43PM +0900, Yoshiaki Tamura wrote:
> >> >> >> >>> 2010/11/28 Michael S. Tsirkin <mst@redhat.com>:
> >> >> >> >>> > On Sun, Nov 28, 2010 at 08:27:58PM +0900, Yoshiaki Tamura wrote:
> >> >> >> >>> >> 2010/11/28 Michael S. Tsirkin <mst@redhat.com>:
> >> >> >> >>> >> > On Thu, Nov 25, 2010 at 03:06:44PM +0900, Yoshiaki Tamura wrote:
> >> >> >> >>> >> >> Modify inuse type to uint16_t, let save/load to handle, and revert
> >> >> >> >>> >> >> last_avail_idx with inuse if there are outstanding emulation.
> >> >> >> >>> >> >>
> >> >> >> >>> >> >> Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
> >> >> >> >>> >> >
> >> >> >> >>> >> > This changes migration format, so it will break compatibility with
> >> >> >> >>> >> > existing drivers. More generally, I think migrating internal
> >> >> >> >>> >> > state that is not guest visible is always a mistake
> >> >> >> >>> >> > as it ties migration format to an internal implementation
> >> >> >> >>> >> > (yes, I know we do this sometimes, but we should at least
> >> >> >> >>> >> > try not to add such cases).  I think the right thing to do in this case
> >> >> >> >>> >> > is to flush outstanding
> >> >> >> >>> >> > work when vm is stopped.  Then, we are guaranteed that inuse is 0.
> >> >> >> >>> >> > I sent patches that do this for virtio net and block.
> >> >> >> >>> >>
> >> >> >> >>> >> Could you give me the link of your patches?  I'd like to test
> >> >> >> >>> >> whether they work with Kemari upon failover.  If they do, I'm
> >> >> >> >>> >> happy to drop this patch.
> >> >> >> >>> >>
> >> >> >> >>> >> Yoshi
> >> >> >> >>> >
> >> >> >> >>> > Look for this:
> >> >> >> >>> > stable migration image on a stopped vm
> >> >> >> >>> > sent on:
> >> >> >> >>> > Wed, 24 Nov 2010 17:52:49 +0200
> >> >> >> >>>
> >> >> >> >>> Thanks for the info.
> >> >> >> >>>
> >> >> >> >>> However, The patch series above didn't solve the issue.  In
> >> >> >> >>> case of Kemari, inuse is mostly > 0 because it queues the
> >> >> >> >>> output, and while last_avail_idx gets incremented
> >> >> >> >>> immediately, not sending inuse makes the state inconsistent
> >> >> >> >>> between Primary and Secondary.
> >> >> >> >>
> >> >> >> >> Hmm. Can we simply avoid incrementing last_avail_idx?
> >> >> >> >
> >> >> >> > I think we can calculate or prepare an internal last_avail_idx,
> >> >> >> > and update the external when inuse is decremented.  I'll try
> >> >> >> > whether it work w/ w/o Kemari.
> >> >> >>
> >> >> >> Hi Michael,
> >> >> >>
> >> >> >> Could you please take a look at the following patch?
> >> >> >
> >> >> > Which version is this against?
> >> >>
> >> >> Oops.  It should be very old.
> >> >> 67f895bfe69f323b427b284430b6219c8a62e8d4
> >> >>
> >> >> >> commit 36ee7910059e6b236fe9467a609f5b4aed866912
> >> >> >> Author: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
> >> >> >> Date:   Thu Dec 16 14:50:54 2010 +0900
> >> >> >>
> >> >> >>     virtio: update last_avail_idx when inuse is decreased.
> >> >> >>
> >> >> >>     Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
> >> >> >
> >> >> > It would be better to have a commit description explaining why a change
> >> >> > is made, and why it is correct, not just repeating what can be seen from
> >> >> > the diff anyway.
> >> >>
> >> >> Sorry for being lazy here.
> >> >>
> >> >> >> diff --git a/hw/virtio.c b/hw/virtio.c
> >> >> >> index c8a0fc6..6688c02 100644
> >> >> >> --- a/hw/virtio.c
> >> >> >> +++ b/hw/virtio.c
> >> >> >> @@ -237,6 +237,7 @@ void virtqueue_flush(VirtQueue *vq, unsigned int count)
> >> >> >>      wmb();
> >> >> >>      trace_virtqueue_flush(vq, count);
> >> >> >>      vring_used_idx_increment(vq, count);
> >> >> >> +    vq->last_avail_idx += count;
> >> >> >>      vq->inuse -= count;
> >> >> >>  }
> >> >> >>
> >> >> >> @@ -385,7 +386,7 @@ int virtqueue_pop(VirtQueue *vq, VirtQueueElement *elem)
> >> >> >>      unsigned int i, head, max;
> >> >> >>      target_phys_addr_t desc_pa = vq->vring.desc;
> >> >> >>
> >> >> >> -    if (!virtqueue_num_heads(vq, vq->last_avail_idx))
> >> >> >> +    if (!virtqueue_num_heads(vq, vq->last_avail_idx + vq->inuse))
> >> >> >>          return 0;
> >> >> >>
> >> >> >>      /* When we start there are none of either input nor output. */
> >> >> >> @@ -393,7 +394,7 @@ int virtqueue_pop(VirtQueue *vq, VirtQueueElement *elem)
> >> >> >>
> >> >> >>      max = vq->vring.num;
> >> >> >>
> >> >> >> -    i = head = virtqueue_get_head(vq, vq->last_avail_idx++);
> >> >> >> +    i = head = virtqueue_get_head(vq, vq->last_avail_idx + vq->inuse);
> >> >> >>
> >> >> >>      if (vring_desc_flags(desc_pa, i) & VRING_DESC_F_INDIRECT) {
> >> >> >>          if (vring_desc_len(desc_pa, i) % sizeof(VRingDesc)) {
> >> >> >>
> >> >> >
> >> >> > Hmm, will virtio_queue_empty be wrong now? What about virtqueue_avail_bytes?
> >> >>
> >> >> I think there are two problems.
> >> >>
> >> >> 1. When to update last_avail_idx.
> >> >> 2. The ordering issue you're mentioning below.
> >> >>
> >> >> The patch above is only trying to address 1 because last time you
> >> >> mentioned that modifying last_avail_idx upon save may break the
> >> >> guest, which I agree.  If virtio_queue_empty and
> >> >> virtqueue_avail_bytes are only used internally, meaning invisible
> >> >> to the guest, I guess the approach above can be applied too.
> >> >
> >> > So IMHO 2 is the real issue. This is what was problematic
> >> > with the save patch, otherwise of course changes in save
> >> > are better than changes all over the codebase.
> >>
> >> All right.  Then let's focus on 2 first.
> >>
> >> >> > Previous patch version sure looked simpler, and this seems functionally
> >> >> > equivalent, so my question still stands: here it is rephrased in a
> >> >> > different way:
> >> >> >
> >> >> >        assume that we have in avail ring 2 requests at start of ring: A and B in this order
> >> >> >
> >> >> >        host pops A, then B, then completes B and flushes
> >> >> >
> >> >> >        now with this patch last_avail_idx will be 1, and then
> >> >> >        remote will get it, it will execute B again. As a result
> >> >> >        B will complete twice, and apparently A will never complete.
> >> >> >
> >> >> >
> >> >> > This is what I was saying below: assuming that there are
> >> >> > outstanding requests when we migrate, there is no way
> >> >> > a single index can be enough to figure out which requests
> >> >> > need to be handled and which are in flight already.
> >> >> >
> >> >> > We must add some kind of bitmask to tell us which is which.
> >> >>
> >> >> I should understand why this inversion can happen before solving
> >> >> the issue.
> >> >
> >> > It's a fundamental thing in virtio.
> >> > I think it is currently only likely to happen with block, I think tap
> >> > currently completes things in order.  In any case relying on this in the
> >> > frontend is a mistake.
> >> >
> >> >>  Currently, how are you making virio-net to flush
> >> >> every requests for live migration?  Is it qemu_aio_flush()?
> >> >
> >> > Think so.
> >>
> >> If qemu_aio_flush() is responsible for flushing the outstanding
> >> virtio-net requests, I'm wondering why it's a problem for Kemari.
> >> As I described in the previous message, Kemari queues the
> >> requests first.  So in you example above, it should start with
> >>
> >> virtio-net: last_avai_idx 0 inuse 2
> >> event-tap: {A,B}
> >>
> >> As you know, the requests are still in order still because net
> >> layer initiates in order.  Not about completing.
> >>
> >> In the first synchronization, the status above is transferred.  In
> >> the next synchronization, the status will be as following.
> >>
> >> virtio-net: last_avai_idx 1 inuse 1
> >> event-tap: {B}
> >
> > OK, this answers the ordering question.
> 
> Glad to hear that!
> 
> > Another question: at this point we transfer this status: both
> > event-tap and virtio ring have the command B,
> > so the remote will have:
> >
> > virtio-net: inuse 0
> > event-tap: {B}
> >
> > Is this right? This already seems to be a problem as when B completes
> > inuse will go negative?
> 
> I think state above is wrong.  inuse 0 means there shouldn't be
> any requests in event-tap.  Note that the callback is called only
> when event-tap flushes the requests.
> 
> > Next it seems that the remote virtio will resubmit B to event-tap. The
> > remote will then have:
> >
> > virtio-net: inuse 1
> > event-tap: {B, B}
> >
> > This looks kind of wrong ... will two packets go out?
> 
> No.  Currently, we're just replaying the requests with pio/mmio.

You do?  What purpose do the hooks in bdrv/net serve then?
A placeholder for the future?

> In the situation above, it should be,
> 
> virtio-net: inuse 1
> event-tap: {B}
> 
> >> Why? Because Kemari flushes the first virtio-net request using
> >> qemu_aio_flush() before each synchronization.  If
> >> qemu_aio_flush() doesn't guarantee the order, what you pointed
> >> should be problematic.  So in the final synchronization, the
> >> state should be,
> >>
> >> virtio-net: last_avai_idx 2 inuse 0
> >> event-tap: {}
> >>
> >> where A,B were completed in order.
> >>
> >> Yoshi
> >
> >
> > It might be better to discuss block because that's where
> > requests can complete out of order.
> 
> It's same as net.  We queue requests and call bdrv_flush per
> sending requests to the block.  So there shouldn't be any
> inversion.
> 
> > So let me see if I understand:
> > - each command passed to event tap is queued by it,
> >  it is not passed directly to the backend
> > - later requests are passed to the backend,
> >  always in the same order that they were submitted
> > - each synchronization point flushes all requests
> >  passed to the backend so far
> > - each synchronization transfers all requests not passed to the backend,
> >  to the remote, and they are replayed there
> 
> Correct.
> 
> > Now to analyse this for correctness I am looking at the original patch
> > because it is smaller so easier to analyse and I think it is
> > functionally equivalent, correct me if I am wrong in this.
> 
> So you think decreasing last_avail_idx upon save is better than
> updating it in the callback?
> 
> > So the reason there's no out of order issue is this
> > (and might be a good thing to put in commit log
> > or a comment somewhere):
> 
> I've done some in the latest patch.  Please point it out if it
> wasn't enough.
> 
> > At point of save callback event tap has flushed commands
> > passed to the backend already. Thus at the point of
> > the save callback if a command has completed
> > all previous commands have been flushed and completed.
> >
> >
> > Therefore inuse is
> > in fact the # of requests passed to event tap but not yet
> > passed to the backend (for non-event tap case all commands are
> > passed to the backend immediately and because of this
> > inuse is 0) and these are the last inuse commands submitted.
> >
> >
> > Right?
> 
> Yep.
> 
> > Now a question:
> >
> > When we pass last_used_index - inuse to the remote,
> > the remote virtio will resubmit the request.
> > Since request is also passed by event tap, we get
> > the request twice, why is this not a problem?
> 
> It's not a problem because event-tap currently replays with
> pio/mmio only, as I mentioned above.  Although event-tap receives
> information about the queued requests, it won't pass it to the
> backend.  The reason is the problem in setting the callbacks
> which are specific to devices on the secondary.  These are
> pointers, and even worse, are usually static functions, which
> event-tap has no way to restore it upon failover.  I do want to
> change event-tap replay to be this way in the future, pio/mmio
> replay is implemented for now.
> 
> Thanks,
> 
> Yoshi
> 
> >
> >
> >> >
> >> >> >
> >> >> >> >
> >> >> >> >>
> >> >> >> >>>  I'm wondering why
> >> >> >> >>> last_avail_idx is OK to send but not inuse.
> >> >> >> >>
> >> >> >> >> last_avail_idx is at some level a mistake, it exposes part of
> >> >> >> >> our internal implementation, but it does *also* express
> >> >> >> >> a guest observable state.
> >> >> >> >>
> >> >> >> >> Here's the problem that it solves: just looking at the rings in virtio
> >> >> >> >> there is no way to detect that a specific request has already been
> >> >> >> >> completed. And the protocol forbids completing the same request twice.
> >> >> >> >>
> >> >> >> >> Our implementation always starts processing the requests
> >> >> >> >> in order, and since we flush outstanding requests
> >> >> >> >> before save, it works to just tell the remote 'process only requests
> >> >> >> >> after this place'.
> >> >> >> >>
> >> >> >> >> But there's no such requirement in the virtio protocol,
> >> >> >> >> so to be really generic we could add a bitmask of valid avail
> >> >> >> >> ring entries that did not complete yet. This would be
> >> >> >> >> the exact representation of the guest observable state.
> >> >> >> >> In practice we have rings of up to 512 entries.
> >> >> >> >> That's 64 byte per ring, not a lot at all.
> >> >> >> >>
> >> >> >> >> However, if we ever do change the protocol to send the bitmask,
> >> >> >> >> we would need some code to resubmit requests
> >> >> >> >> out of order, so it's not trivial.
> >> >> >> >>
> >> >> >> >> Another minor mistake with last_avail_idx is that it has
> >> >> >> >> some redundancy: the high bits in the index
> >> >> >> >> (> vq size) are not necessary as they can be
> >> >> >> >> got from avail idx.  There's a consistency check
> >> >> >> >> in load but we really should try to use formats
> >> >> >> >> that are always consistent.
> >> >> >> >>
> >> >> >> >>> The following patch does the same thing as original, yet
> >> >> >> >>> keeps the format of the virtio.  It shouldn't break live
> >> >> >> >>> migration either because inuse should be 0.
> >> >> >> >>>
> >> >> >> >>> Yoshi
> >> >> >> >>
> >> >> >> >> Question is, can you flush to make inuse 0 in kemari too?
> >> >> >> >> And if not, how do you handle the fact that some requests
> >> >> >> >> are in flight on the primary?
> >> >> >> >
> >> >> >> > Although we try flushing requests one by one making inuse 0,
> >> >> >> > there are cases when it failovers to the secondary when inuse
> >> >> >> > isn't 0.  We handle these in flight request on the primary by
> >> >> >> > replaying on the secondary.
> >> >> >> >
> >> >> >> >>
> >> >> >> >>> diff --git a/hw/virtio.c b/hw/virtio.c
> >> >> >> >>> index c8a0fc6..875c7ca 100644
> >> >> >> >>> --- a/hw/virtio.c
> >> >> >> >>> +++ b/hw/virtio.c
> >> >> >> >>> @@ -664,12 +664,16 @@ void virtio_save(VirtIODevice *vdev, QEMUFile *f)
> >> >> >> >>>      qemu_put_be32(f, i);
> >> >> >> >>>
> >> >> >> >>>      for (i = 0; i < VIRTIO_PCI_QUEUE_MAX; i++) {
> >> >> >> >>> +        uint16_t last_avail_idx;
> >> >> >> >>> +
> >> >> >> >>>          if (vdev->vq[i].vring.num == 0)
> >> >> >> >>>              break;
> >> >> >> >>>
> >> >> >> >>> +        last_avail_idx = vdev->vq[i].last_avail_idx - vdev->vq[i].inuse;
> >> >> >> >>> +
> >> >> >> >>>          qemu_put_be32(f, vdev->vq[i].vring.num);
> >> >> >> >>>          qemu_put_be64(f, vdev->vq[i].pa);
> >> >> >> >>> -        qemu_put_be16s(f, &vdev->vq[i].last_avail_idx);
> >> >> >> >>> +        qemu_put_be16s(f, &last_avail_idx);
> >> >> >> >>>          if (vdev->binding->save_queue)
> >> >> >> >>>              vdev->binding->save_queue(vdev->binding_opaque, i, f);
> >> >> >> >>>      }
> >> >> >> >>>
> >> >> >> >>>
> >> >> >> >>
> >> >> >> >> This looks wrong to me.  Requests can complete in any order, can they
> >> >> >> >> not?  So if request 0 did not complete and request 1 did not,
> >> >> >> >> you send avail - inuse and on the secondary you will process and
> >> >> >> >> complete request 1 the second time, crashing the guest.
> >> >> >> >
> >> >> >> > In case of Kemari, no.  We sit between devices and net/block, and
> >> >> >> > queue the requests.  After completing each transaction, we flush
> >> >> >> > the requests one by one.  So there won't be completion inversion,
> >> >> >> > and therefore won't be visible to the guest.
> >> >> >> >
> >> >> >> > Yoshi
> >> >> >> >
> >> >> >> >>
> >> >> >> >>>
> >> >> >> >>> >
> >> >> >> >>> >> >
> >> >> >> >>> >> >> ---
> >> >> >> >>> >> >>  hw/virtio.c |    8 +++++++-
> >> >> >> >>> >> >>  1 files changed, 7 insertions(+), 1 deletions(-)
> >> >> >> >>> >> >>
> >> >> >> >>> >> >> diff --git a/hw/virtio.c b/hw/virtio.c
> >> >> >> >>> >> >> index 849a60f..5509644 100644
> >> >> >> >>> >> >> --- a/hw/virtio.c
> >> >> >> >>> >> >> +++ b/hw/virtio.c
> >> >> >> >>> >> >> @@ -72,7 +72,7 @@ struct VirtQueue
> >> >> >> >>> >> >>      VRing vring;
> >> >> >> >>> >> >>      target_phys_addr_t pa;
> >> >> >> >>> >> >>      uint16_t last_avail_idx;
> >> >> >> >>> >> >> -    int inuse;
> >> >> >> >>> >> >> +    uint16_t inuse;
> >> >> >> >>> >> >>      uint16_t vector;
> >> >> >> >>> >> >>      void (*handle_output)(VirtIODevice *vdev, VirtQueue *vq);
> >> >> >> >>> >> >>      VirtIODevice *vdev;
> >> >> >> >>> >> >> @@ -671,6 +671,7 @@ void virtio_save(VirtIODevice *vdev, QEMUFile *f)
> >> >> >> >>> >> >>          qemu_put_be32(f, vdev->vq[i].vring.num);
> >> >> >> >>> >> >>          qemu_put_be64(f, vdev->vq[i].pa);
> >> >> >> >>> >> >>          qemu_put_be16s(f, &vdev->vq[i].last_avail_idx);
> >> >> >> >>> >> >> +        qemu_put_be16s(f, &vdev->vq[i].inuse);
> >> >> >> >>> >> >>          if (vdev->binding->save_queue)
> >> >> >> >>> >> >>              vdev->binding->save_queue(vdev->binding_opaque, i, f);
> >> >> >> >>> >> >>      }
> >> >> >> >>> >> >> @@ -711,6 +712,11 @@ int virtio_load(VirtIODevice *vdev, QEMUFile *f)
> >> >> >> >>> >> >>          vdev->vq[i].vring.num = qemu_get_be32(f);
> >> >> >> >>> >> >>          vdev->vq[i].pa = qemu_get_be64(f);
> >> >> >> >>> >> >>          qemu_get_be16s(f, &vdev->vq[i].last_avail_idx);
> >> >> >> >>> >> >> +        qemu_get_be16s(f, &vdev->vq[i].inuse);
> >> >> >> >>> >> >> +
> >> >> >> >>> >> >> +        /* revert last_avail_idx if there are outstanding emulation. */
> >> >> >> >>> >> >> +        vdev->vq[i].last_avail_idx -= vdev->vq[i].inuse;
> >> >> >> >>> >> >> +        vdev->vq[i].inuse = 0;
> >> >> >> >>> >> >>
> >> >> >> >>> >> >>          if (vdev->vq[i].pa) {
> >> >> >> >>> >> >>              virtqueue_init(&vdev->vq[i]);
> >> >> >> >>> >> >> --
> >> >> >> >>> >> >> 1.7.1.2
> >> >> >> >>> >> >>
> >> >> >> >>> >> >> --
> >> >> >> >>> >> >> To unsubscribe from this list: send the line "unsubscribe kvm" in
> >> >> >> >>> >> >> the body of a message to majordomo@vger.kernel.org
> >> >> >> >>> >> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >> >> >> >>> >> > --
> >> >> >> >>> >> > To unsubscribe from this list: send the line "unsubscribe kvm" in
> >> >> >> >>> >> > the body of a message to majordomo@vger.kernel.org
> >> >> >> >>> >> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >> >> >> >>> >> >
> >> >> >> >>> > --
> >> >> >> >>> > To unsubscribe from this list: send the line "unsubscribe kvm" in
> >> >> >> >>> > the body of a message to majordomo@vger.kernel.org
> >> >> >> >>> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >> >> >> >>> >
> >> >> >> >> --
> >> >> >> >> To unsubscribe from this list: send the line "unsubscribe kvm" in
> >> >> >> >> the body of a message to majordomo@vger.kernel.org
> >> >> >> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >> >> >> >>
> >> >> >> >
> >> >> > --
> >> >> > To unsubscribe from this list: send the line "unsubscribe kvm" in
> >> >> > the body of a message to majordomo@vger.kernel.org
> >> >> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >> >> >
> >> > --
> >> > To unsubscribe from this list: send the line "unsubscribe kvm" in
> >> > the body of a message to majordomo@vger.kernel.org
> >> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe kvm" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [Qemu-devel] Re: [PATCH 05/21] virtio: modify save/load handler to handle inuse varialble.
  2010-12-26 10:46                                 ` Michael S. Tsirkin
@ 2010-12-26 10:50                                   ` Yoshiaki Tamura
  0 siblings, 0 replies; 112+ messages in thread
From: Yoshiaki Tamura @ 2010-12-26 10:50 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: aliguori, dlaor, ananth, kvm, Marcelo Tosatti, ohmura.kei,
	qemu-devel, avi, vatsa, psuriset, stefanha

2010/12/26 Michael S. Tsirkin <mst@redhat.com>:
> On Sun, Dec 26, 2010 at 07:14:44PM +0900, Yoshiaki Tamura wrote:
>> 2010/12/26 Michael S. Tsirkin <mst@redhat.com>:
>> > On Fri, Dec 24, 2010 at 08:42:19PM +0900, Yoshiaki Tamura wrote:
>> >> >> If qemu_aio_flush() is responsible for flushing the outstanding
>> >> >> virtio-net requests, I'm wondering why it's a problem for Kemari.
>> >> >> As I described in the previous message, Kemari queues the
>> >> >> requests first.  So in you example above, it should start with
>> >> >>
>> >> >> virtio-net: last_avai_idx 0 inuse 2
>> >> >> event-tap: {A,B}
>> >> >>
>> >> >> As you know, the requests are still in order still because net
>> >> >> layer initiates in order.  Not about completing.
>> >> >>
>> >> >> In the first synchronization, the status above is transferred.  In
>> >> >> the next synchronization, the status will be as following.
>> >> >>
>> >> >> virtio-net: last_avai_idx 1 inuse 1
>> >> >> event-tap: {B}
>> >> >
>> >> > OK, this answers the ordering question.
>> >>
>> >> Glad to hear that!
>> >>
>> >> > Another question: at this point we transfer this status: both
>> >> > event-tap and virtio ring have the command B,
>> >> > so the remote will have:
>> >> >
>> >> > virtio-net: inuse 0
>> >> > event-tap: {B}
>> >> >
>> >> > Is this right? This already seems to be a problem as when B completes
>> >> > inuse will go negative?
>> >>
>> >> I think state above is wrong.  inuse 0 means there shouldn't be
>> >> any requests in event-tap.  Note that the callback is called only
>> >> when event-tap flushes the requests.
>> >>
>> >> > Next it seems that the remote virtio will resubmit B to event-tap. The
>> >> > remote will then have:
>> >> >
>> >> > virtio-net: inuse 1
>> >> > event-tap: {B, B}
>> >> >
>> >> > This looks kind of wrong ... will two packets go out?
>> >>
>> >> No.  Currently, we're just replaying the requests with pio/mmio.
>> >> In the situation above, it should be,
>> >>
>> >> virtio-net: inuse 1
>> >> event-tap: {B}
>> >> >> Why? Because Kemari flushes the first virtio-net request using
>> >> >> qemu_aio_flush() before each synchronization.  If
>> >> >> qemu_aio_flush() doesn't guarantee the order, what you pointed
>> >> >> should be problematic.  So in the final synchronization, the
>> >> >> state should be,
>> >> >>
>> >> >> virtio-net: last_avai_idx 2 inuse 0
>> >> >> event-tap: {}
>> >> >>
>> >> >> where A,B were completed in order.
>> >> >>
>> >> >> Yoshi
>> >> >
>> >> >
>> >> > It might be better to discuss block because that's where
>> >> > requests can complete out of order.
>> >>
>> >> It's same as net.  We queue requests and call bdrv_flush per
>> >> sending requests to the block.  So there shouldn't be any
>> >> inversion.
>> >>
>> >> > So let me see if I understand:
>> >> > - each command passed to event tap is queued by it,
>> >> >  it is not passed directly to the backend
>> >> > - later requests are passed to the backend,
>> >> >  always in the same order that they were submitted
>> >> > - each synchronization point flushes all requests
>> >> >  passed to the backend so far
>> >> > - each synchronization transfers all requests not passed to the backend,
>> >> >  to the remote, and they are replayed there
>> >>
>> >> Correct.
>> >>
>> >> > Now to analyse this for correctness I am looking at the original patch
>> >> > because it is smaller so easier to analyse and I think it is
>> >> > functionally equivalent, correct me if I am wrong in this.
>> >>
>> >> So you think decreasing last_avail_idx upon save is better than
>> >> updating it in the callback?
>> >>
>> >> > So the reason there's no out of order issue is this
>> >> > (and might be a good thing to put in commit log
>> >> > or a comment somewhere):
>> >>
>> >> I've done some in the latest patch.  Please point it out if it
>> >> wasn't enough.
>> >>
>> >> > At point of save callback event tap has flushed commands
>> >> > passed to the backend already. Thus at the point of
>> >> > the save callback if a command has completed
>> >> > all previous commands have been flushed and completed.
>> >> >
>> >> >
>> >> > Therefore inuse is
>> >> > in fact the # of requests passed to event tap but not yet
>> >> > passed to the backend (for non-event tap case all commands are
>> >> > passed to the backend immediately and because of this
>> >> > inuse is 0) and these are the last inuse commands submitted.
>> >> >
>> >> >
>> >> > Right?
>> >>
>> >> Yep.
>> >>
>> >> > Now a question:
>> >> >
>> >> > When we pass last_used_index - inuse to the remote,
>> >> > the remote virtio will resubmit the request.
>> >> > Since request is also passed by event tap, we get
>> >> > the request twice, why is this not a problem?
>> >>
>> >> It's not a problem because event-tap currently replays with
>> >> pio/mmio only, as I mentioned above.  Although event-tap receives
>> >> information about the queued requests, it won't pass it to the
>> >> backend.  The reason is the problem in setting the callbacks
>> >> which are specific to devices on the secondary.  These are
>> >> pointers, and even worse, are usually static functions, which
>> >> event-tap has no way to restore it upon failover.  I do want to
>> >> change event-tap replay to be this way in the future, pio/mmio
>> >> replay is implemented for now.
>> >>
>> >> Thanks,
>> >>
>> >> Yoshi
>> >>
>> >
>> > Then I am still confused, sorry.  inuse != 0 means that some requests
>> > were passed to the backend but did not complete.  I think that if you do
>> > a flush, this waits until all requests passed to the backend will
>> > complete.  Why does not this guarantee inuse = 0 on the origin at the
>> > synchronization point?
>>
>> The synchronization is done before event-tap releases requests to
>> the backend, so there are two types of flush: event-tap and
>> backend block/net.  I assume you're confused with the fact that
>> flushing backend with qemu_aio_flush/bdrv_flush doesn't necessary
>> decrease inuse if event-tap has queued requests because there are
>> no requests passed to the backend.  Let me do a case study again.
>>
>> virtio: inuse 4
>> event-tap: {A,B,C}
>> backend: {D}
>>
>
>
> There are two event-tap devices, right?
> PIO one is above virtio, AIO one is between virtio and backend
> (e.g. bdrv)? Which one is meant here?

Right.  I'm mentioning about the latter, between virtio and
backend.  Note that event-tap function in pio/mmio doesn't queue
but just records what initiated the requests.

Yoshi

>
>
>> synchronization starts.  backend gets flushed.
>>
>> virtio: inuse 3
>> event-tap: {A,B,C}
>> backend: {}
>> synchronization gets done.
>> # secondary is virtio: inuse 3
>>
>> event-tap flushes one request.
>>
>> virtio: inuse 2
>> event-tap: {B,C}
>> backend: {}
>> repeats above and finally it should be,
>>
>> virtio: inuse 0
>> event-tap: {}
>>
>> Hope this helps.
>>
>> Yoshi
>>
>> >
>> > --
>> > MST
>> >
>> >
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 112+ messages in thread

* [Qemu-devel] Re: [PATCH 05/21] virtio: modify save/load handler to handle inuse varialble.
  2010-12-26 10:49                             ` Michael S. Tsirkin
@ 2010-12-26 10:57                               ` Yoshiaki Tamura
  2010-12-26 12:01                                 ` Michael S. Tsirkin
  0 siblings, 1 reply; 112+ messages in thread
From: Yoshiaki Tamura @ 2010-12-26 10:57 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: aliguori, dlaor, ananth, kvm, ohmura.kei, Marcelo Tosatti,
	qemu-devel, vatsa, avi, psuriset, stefanha

2010/12/26 Michael S. Tsirkin <mst@redhat.com>:
> On Fri, Dec 24, 2010 at 08:42:19PM +0900, Yoshiaki Tamura wrote:
>> 2010/12/24 Michael S. Tsirkin <mst@redhat.com>:
>> > On Fri, Dec 17, 2010 at 12:59:58AM +0900, Yoshiaki Tamura wrote:
>> >> 2010/12/16 Michael S. Tsirkin <mst@redhat.com>:
>> >> > On Thu, Dec 16, 2010 at 11:28:46PM +0900, Yoshiaki Tamura wrote:
>> >> >> 2010/12/16 Michael S. Tsirkin <mst@redhat.com>:
>> >> >> > On Thu, Dec 16, 2010 at 04:36:16PM +0900, Yoshiaki Tamura wrote:
>> >> >> >> 2010/12/3 Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>:
>> >> >> >> > 2010/12/2 Michael S. Tsirkin <mst@redhat.com>:
>> >> >> >> >> On Wed, Dec 01, 2010 at 05:03:43PM +0900, Yoshiaki Tamura wrote:
>> >> >> >> >>> 2010/11/28 Michael S. Tsirkin <mst@redhat.com>:
>> >> >> >> >>> > On Sun, Nov 28, 2010 at 08:27:58PM +0900, Yoshiaki Tamura wrote:
>> >> >> >> >>> >> 2010/11/28 Michael S. Tsirkin <mst@redhat.com>:
>> >> >> >> >>> >> > On Thu, Nov 25, 2010 at 03:06:44PM +0900, Yoshiaki Tamura wrote:
>> >> >> >> >>> >> >> Modify inuse type to uint16_t, let save/load to handle, and revert
>> >> >> >> >>> >> >> last_avail_idx with inuse if there are outstanding emulation.
>> >> >> >> >>> >> >>
>> >> >> >> >>> >> >> Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
>> >> >> >> >>> >> >
>> >> >> >> >>> >> > This changes migration format, so it will break compatibility with
>> >> >> >> >>> >> > existing drivers. More generally, I think migrating internal
>> >> >> >> >>> >> > state that is not guest visible is always a mistake
>> >> >> >> >>> >> > as it ties migration format to an internal implementation
>> >> >> >> >>> >> > (yes, I know we do this sometimes, but we should at least
>> >> >> >> >>> >> > try not to add such cases).  I think the right thing to do in this case
>> >> >> >> >>> >> > is to flush outstanding
>> >> >> >> >>> >> > work when vm is stopped.  Then, we are guaranteed that inuse is 0.
>> >> >> >> >>> >> > I sent patches that do this for virtio net and block.
>> >> >> >> >>> >>
>> >> >> >> >>> >> Could you give me the link of your patches?  I'd like to test
>> >> >> >> >>> >> whether they work with Kemari upon failover.  If they do, I'm
>> >> >> >> >>> >> happy to drop this patch.
>> >> >> >> >>> >>
>> >> >> >> >>> >> Yoshi
>> >> >> >> >>> >
>> >> >> >> >>> > Look for this:
>> >> >> >> >>> > stable migration image on a stopped vm
>> >> >> >> >>> > sent on:
>> >> >> >> >>> > Wed, 24 Nov 2010 17:52:49 +0200
>> >> >> >> >>>
>> >> >> >> >>> Thanks for the info.
>> >> >> >> >>>
>> >> >> >> >>> However, The patch series above didn't solve the issue.  In
>> >> >> >> >>> case of Kemari, inuse is mostly > 0 because it queues the
>> >> >> >> >>> output, and while last_avail_idx gets incremented
>> >> >> >> >>> immediately, not sending inuse makes the state inconsistent
>> >> >> >> >>> between Primary and Secondary.
>> >> >> >> >>
>> >> >> >> >> Hmm. Can we simply avoid incrementing last_avail_idx?
>> >> >> >> >
>> >> >> >> > I think we can calculate or prepare an internal last_avail_idx,
>> >> >> >> > and update the external when inuse is decremented.  I'll try
>> >> >> >> > whether it work w/ w/o Kemari.
>> >> >> >>
>> >> >> >> Hi Michael,
>> >> >> >>
>> >> >> >> Could you please take a look at the following patch?
>> >> >> >
>> >> >> > Which version is this against?
>> >> >>
>> >> >> Oops.  It should be very old.
>> >> >> 67f895bfe69f323b427b284430b6219c8a62e8d4
>> >> >>
>> >> >> >> commit 36ee7910059e6b236fe9467a609f5b4aed866912
>> >> >> >> Author: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
>> >> >> >> Date:   Thu Dec 16 14:50:54 2010 +0900
>> >> >> >>
>> >> >> >>     virtio: update last_avail_idx when inuse is decreased.
>> >> >> >>
>> >> >> >>     Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
>> >> >> >
>> >> >> > It would be better to have a commit description explaining why a change
>> >> >> > is made, and why it is correct, not just repeating what can be seen from
>> >> >> > the diff anyway.
>> >> >>
>> >> >> Sorry for being lazy here.
>> >> >>
>> >> >> >> diff --git a/hw/virtio.c b/hw/virtio.c
>> >> >> >> index c8a0fc6..6688c02 100644
>> >> >> >> --- a/hw/virtio.c
>> >> >> >> +++ b/hw/virtio.c
>> >> >> >> @@ -237,6 +237,7 @@ void virtqueue_flush(VirtQueue *vq, unsigned int count)
>> >> >> >>      wmb();
>> >> >> >>      trace_virtqueue_flush(vq, count);
>> >> >> >>      vring_used_idx_increment(vq, count);
>> >> >> >> +    vq->last_avail_idx += count;
>> >> >> >>      vq->inuse -= count;
>> >> >> >>  }
>> >> >> >>
>> >> >> >> @@ -385,7 +386,7 @@ int virtqueue_pop(VirtQueue *vq, VirtQueueElement *elem)
>> >> >> >>      unsigned int i, head, max;
>> >> >> >>      target_phys_addr_t desc_pa = vq->vring.desc;
>> >> >> >>
>> >> >> >> -    if (!virtqueue_num_heads(vq, vq->last_avail_idx))
>> >> >> >> +    if (!virtqueue_num_heads(vq, vq->last_avail_idx + vq->inuse))
>> >> >> >>          return 0;
>> >> >> >>
>> >> >> >>      /* When we start there are none of either input nor output. */
>> >> >> >> @@ -393,7 +394,7 @@ int virtqueue_pop(VirtQueue *vq, VirtQueueElement *elem)
>> >> >> >>
>> >> >> >>      max = vq->vring.num;
>> >> >> >>
>> >> >> >> -    i = head = virtqueue_get_head(vq, vq->last_avail_idx++);
>> >> >> >> +    i = head = virtqueue_get_head(vq, vq->last_avail_idx + vq->inuse);
>> >> >> >>
>> >> >> >>      if (vring_desc_flags(desc_pa, i) & VRING_DESC_F_INDIRECT) {
>> >> >> >>          if (vring_desc_len(desc_pa, i) % sizeof(VRingDesc)) {
>> >> >> >>
>> >> >> >
>> >> >> > Hmm, will virtio_queue_empty be wrong now? What about virtqueue_avail_bytes?
>> >> >>
>> >> >> I think there are two problems.
>> >> >>
>> >> >> 1. When to update last_avail_idx.
>> >> >> 2. The ordering issue you're mentioning below.
>> >> >>
>> >> >> The patch above is only trying to address 1 because last time you
>> >> >> mentioned that modifying last_avail_idx upon save may break the
>> >> >> guest, which I agree.  If virtio_queue_empty and
>> >> >> virtqueue_avail_bytes are only used internally, meaning invisible
>> >> >> to the guest, I guess the approach above can be applied too.
>> >> >
>> >> > So IMHO 2 is the real issue. This is what was problematic
>> >> > with the save patch, otherwise of course changes in save
>> >> > are better than changes all over the codebase.
>> >>
>> >> All right.  Then let's focus on 2 first.
>> >>
>> >> >> > Previous patch version sure looked simpler, and this seems functionally
>> >> >> > equivalent, so my question still stands: here it is rephrased in a
>> >> >> > different way:
>> >> >> >
>> >> >> >        assume that we have in avail ring 2 requests at start of ring: A and B in this order
>> >> >> >
>> >> >> >        host pops A, then B, then completes B and flushes
>> >> >> >
>> >> >> >        now with this patch last_avail_idx will be 1, and then
>> >> >> >        remote will get it, it will execute B again. As a result
>> >> >> >        B will complete twice, and apparently A will never complete.
>> >> >> >
>> >> >> >
>> >> >> > This is what I was saying below: assuming that there are
>> >> >> > outstanding requests when we migrate, there is no way
>> >> >> > a single index can be enough to figure out which requests
>> >> >> > need to be handled and which are in flight already.
>> >> >> >
>> >> >> > We must add some kind of bitmask to tell us which is which.
>> >> >>
>> >> >> I should understand why this inversion can happen before solving
>> >> >> the issue.
>> >> >
>> >> > It's a fundamental thing in virtio.
>> >> > I think it is currently only likely to happen with block, I think tap
>> >> > currently completes things in order.  In any case relying on this in the
>> >> > frontend is a mistake.
>> >> >
>> >> >>  Currently, how are you making virio-net to flush
>> >> >> every requests for live migration?  Is it qemu_aio_flush()?
>> >> >
>> >> > Think so.
>> >>
>> >> If qemu_aio_flush() is responsible for flushing the outstanding
>> >> virtio-net requests, I'm wondering why it's a problem for Kemari.
>> >> As I described in the previous message, Kemari queues the
>> >> requests first.  So in you example above, it should start with
>> >>
>> >> virtio-net: last_avai_idx 0 inuse 2
>> >> event-tap: {A,B}
>> >>
>> >> As you know, the requests are still in order still because net
>> >> layer initiates in order.  Not about completing.
>> >>
>> >> In the first synchronization, the status above is transferred.  In
>> >> the next synchronization, the status will be as following.
>> >>
>> >> virtio-net: last_avai_idx 1 inuse 1
>> >> event-tap: {B}
>> >
>> > OK, this answers the ordering question.
>>
>> Glad to hear that!
>>
>> > Another question: at this point we transfer this status: both
>> > event-tap and virtio ring have the command B,
>> > so the remote will have:
>> >
>> > virtio-net: inuse 0
>> > event-tap: {B}
>> >
>> > Is this right? This already seems to be a problem as when B completes
>> > inuse will go negative?
>>
>> I think state above is wrong.  inuse 0 means there shouldn't be
>> any requests in event-tap.  Note that the callback is called only
>> when event-tap flushes the requests.
>>
>> > Next it seems that the remote virtio will resubmit B to event-tap. The
>> > remote will then have:
>> >
>> > virtio-net: inuse 1
>> > event-tap: {B, B}
>> >
>> > This looks kind of wrong ... will two packets go out?
>>
>> No.  Currently, we're just replaying the requests with pio/mmio.
>
> You do?  What purpose do the hooks in bdrv/net serve then?
> A placeholder for the future?

Not only for that reason.  The hooks in bdrv/net is the main
function that queues requests and starts synchronization.
pio/mmio hooks are there for recording what initiated the
requests monitored in bdrv/net layer.  I would like to remove
pio/mmio part if we could make bdrv/net level replay is possible.

Yoshi

>
>> In the situation above, it should be,
>>
>> virtio-net: inuse 1
>> event-tap: {B}
>>
>> >> Why? Because Kemari flushes the first virtio-net request using
>> >> qemu_aio_flush() before each synchronization.  If
>> >> qemu_aio_flush() doesn't guarantee the order, what you pointed
>> >> should be problematic.  So in the final synchronization, the
>> >> state should be,
>> >>
>> >> virtio-net: last_avai_idx 2 inuse 0
>> >> event-tap: {}
>> >>
>> >> where A,B were completed in order.
>> >>
>> >> Yoshi
>> >
>> >
>> > It might be better to discuss block because that's where
>> > requests can complete out of order.
>>
>> It's same as net.  We queue requests and call bdrv_flush per
>> sending requests to the block.  So there shouldn't be any
>> inversion.
>>
>> > So let me see if I understand:
>> > - each command passed to event tap is queued by it,
>> >  it is not passed directly to the backend
>> > - later requests are passed to the backend,
>> >  always in the same order that they were submitted
>> > - each synchronization point flushes all requests
>> >  passed to the backend so far
>> > - each synchronization transfers all requests not passed to the backend,
>> >  to the remote, and they are replayed there
>>
>> Correct.
>>
>> > Now to analyse this for correctness I am looking at the original patch
>> > because it is smaller so easier to analyse and I think it is
>> > functionally equivalent, correct me if I am wrong in this.
>>
>> So you think decreasing last_avail_idx upon save is better than
>> updating it in the callback?
>>
>> > So the reason there's no out of order issue is this
>> > (and might be a good thing to put in commit log
>> > or a comment somewhere):
>>
>> I've done some in the latest patch.  Please point it out if it
>> wasn't enough.
>>
>> > At point of save callback event tap has flushed commands
>> > passed to the backend already. Thus at the point of
>> > the save callback if a command has completed
>> > all previous commands have been flushed and completed.
>> >
>> >
>> > Therefore inuse is
>> > in fact the # of requests passed to event tap but not yet
>> > passed to the backend (for non-event tap case all commands are
>> > passed to the backend immediately and because of this
>> > inuse is 0) and these are the last inuse commands submitted.
>> >
>> >
>> > Right?
>>
>> Yep.
>>
>> > Now a question:
>> >
>> > When we pass last_used_index - inuse to the remote,
>> > the remote virtio will resubmit the request.
>> > Since request is also passed by event tap, we get
>> > the request twice, why is this not a problem?
>>
>> It's not a problem because event-tap currently replays with
>> pio/mmio only, as I mentioned above.  Although event-tap receives
>> information about the queued requests, it won't pass it to the
>> backend.  The reason is the problem in setting the callbacks
>> which are specific to devices on the secondary.  These are
>> pointers, and even worse, are usually static functions, which
>> event-tap has no way to restore it upon failover.  I do want to
>> change event-tap replay to be this way in the future, pio/mmio
>> replay is implemented for now.
>>
>> Thanks,
>>
>> Yoshi
>>
>> >
>> >
>> >> >
>> >> >> >
>> >> >> >> >
>> >> >> >> >>
>> >> >> >> >>>  I'm wondering why
>> >> >> >> >>> last_avail_idx is OK to send but not inuse.
>> >> >> >> >>
>> >> >> >> >> last_avail_idx is at some level a mistake, it exposes part of
>> >> >> >> >> our internal implementation, but it does *also* express
>> >> >> >> >> a guest observable state.
>> >> >> >> >>
>> >> >> >> >> Here's the problem that it solves: just looking at the rings in virtio
>> >> >> >> >> there is no way to detect that a specific request has already been
>> >> >> >> >> completed. And the protocol forbids completing the same request twice.
>> >> >> >> >>
>> >> >> >> >> Our implementation always starts processing the requests
>> >> >> >> >> in order, and since we flush outstanding requests
>> >> >> >> >> before save, it works to just tell the remote 'process only requests
>> >> >> >> >> after this place'.
>> >> >> >> >>
>> >> >> >> >> But there's no such requirement in the virtio protocol,
>> >> >> >> >> so to be really generic we could add a bitmask of valid avail
>> >> >> >> >> ring entries that did not complete yet. This would be
>> >> >> >> >> the exact representation of the guest observable state.
>> >> >> >> >> In practice we have rings of up to 512 entries.
>> >> >> >> >> That's 64 byte per ring, not a lot at all.
>> >> >> >> >>
>> >> >> >> >> However, if we ever do change the protocol to send the bitmask,
>> >> >> >> >> we would need some code to resubmit requests
>> >> >> >> >> out of order, so it's not trivial.
>> >> >> >> >>
>> >> >> >> >> Another minor mistake with last_avail_idx is that it has
>> >> >> >> >> some redundancy: the high bits in the index
>> >> >> >> >> (> vq size) are not necessary as they can be
>> >> >> >> >> got from avail idx.  There's a consistency check
>> >> >> >> >> in load but we really should try to use formats
>> >> >> >> >> that are always consistent.
>> >> >> >> >>
>> >> >> >> >>> The following patch does the same thing as original, yet
>> >> >> >> >>> keeps the format of the virtio.  It shouldn't break live
>> >> >> >> >>> migration either because inuse should be 0.
>> >> >> >> >>>
>> >> >> >> >>> Yoshi
>> >> >> >> >>
>> >> >> >> >> Question is, can you flush to make inuse 0 in kemari too?
>> >> >> >> >> And if not, how do you handle the fact that some requests
>> >> >> >> >> are in flight on the primary?
>> >> >> >> >
>> >> >> >> > Although we try flushing requests one by one making inuse 0,
>> >> >> >> > there are cases when it failovers to the secondary when inuse
>> >> >> >> > isn't 0.  We handle these in flight request on the primary by
>> >> >> >> > replaying on the secondary.
>> >> >> >> >
>> >> >> >> >>
>> >> >> >> >>> diff --git a/hw/virtio.c b/hw/virtio.c
>> >> >> >> >>> index c8a0fc6..875c7ca 100644
>> >> >> >> >>> --- a/hw/virtio.c
>> >> >> >> >>> +++ b/hw/virtio.c
>> >> >> >> >>> @@ -664,12 +664,16 @@ void virtio_save(VirtIODevice *vdev, QEMUFile *f)
>> >> >> >> >>>      qemu_put_be32(f, i);
>> >> >> >> >>>
>> >> >> >> >>>      for (i = 0; i < VIRTIO_PCI_QUEUE_MAX; i++) {
>> >> >> >> >>> +        uint16_t last_avail_idx;
>> >> >> >> >>> +
>> >> >> >> >>>          if (vdev->vq[i].vring.num == 0)
>> >> >> >> >>>              break;
>> >> >> >> >>>
>> >> >> >> >>> +        last_avail_idx = vdev->vq[i].last_avail_idx - vdev->vq[i].inuse;
>> >> >> >> >>> +
>> >> >> >> >>>          qemu_put_be32(f, vdev->vq[i].vring.num);
>> >> >> >> >>>          qemu_put_be64(f, vdev->vq[i].pa);
>> >> >> >> >>> -        qemu_put_be16s(f, &vdev->vq[i].last_avail_idx);
>> >> >> >> >>> +        qemu_put_be16s(f, &last_avail_idx);
>> >> >> >> >>>          if (vdev->binding->save_queue)
>> >> >> >> >>>              vdev->binding->save_queue(vdev->binding_opaque, i, f);
>> >> >> >> >>>      }
>> >> >> >> >>>
>> >> >> >> >>>
>> >> >> >> >>
>> >> >> >> >> This looks wrong to me.  Requests can complete in any order, can they
>> >> >> >> >> not?  So if request 0 did not complete and request 1 did not,
>> >> >> >> >> you send avail - inuse and on the secondary you will process and
>> >> >> >> >> complete request 1 the second time, crashing the guest.
>> >> >> >> >
>> >> >> >> > In case of Kemari, no.  We sit between devices and net/block, and
>> >> >> >> > queue the requests.  After completing each transaction, we flush
>> >> >> >> > the requests one by one.  So there won't be completion inversion,
>> >> >> >> > and therefore won't be visible to the guest.
>> >> >> >> >
>> >> >> >> > Yoshi
>> >> >> >> >
>> >> >> >> >>
>> >> >> >> >>>
>> >> >> >> >>> >
>> >> >> >> >>> >> >
>> >> >> >> >>> >> >> ---
>> >> >> >> >>> >> >>  hw/virtio.c |    8 +++++++-
>> >> >> >> >>> >> >>  1 files changed, 7 insertions(+), 1 deletions(-)
>> >> >> >> >>> >> >>
>> >> >> >> >>> >> >> diff --git a/hw/virtio.c b/hw/virtio.c
>> >> >> >> >>> >> >> index 849a60f..5509644 100644
>> >> >> >> >>> >> >> --- a/hw/virtio.c
>> >> >> >> >>> >> >> +++ b/hw/virtio.c
>> >> >> >> >>> >> >> @@ -72,7 +72,7 @@ struct VirtQueue
>> >> >> >> >>> >> >>      VRing vring;
>> >> >> >> >>> >> >>      target_phys_addr_t pa;
>> >> >> >> >>> >> >>      uint16_t last_avail_idx;
>> >> >> >> >>> >> >> -    int inuse;
>> >> >> >> >>> >> >> +    uint16_t inuse;
>> >> >> >> >>> >> >>      uint16_t vector;
>> >> >> >> >>> >> >>      void (*handle_output)(VirtIODevice *vdev, VirtQueue *vq);
>> >> >> >> >>> >> >>      VirtIODevice *vdev;
>> >> >> >> >>> >> >> @@ -671,6 +671,7 @@ void virtio_save(VirtIODevice *vdev, QEMUFile *f)
>> >> >> >> >>> >> >>          qemu_put_be32(f, vdev->vq[i].vring.num);
>> >> >> >> >>> >> >>          qemu_put_be64(f, vdev->vq[i].pa);
>> >> >> >> >>> >> >>          qemu_put_be16s(f, &vdev->vq[i].last_avail_idx);
>> >> >> >> >>> >> >> +        qemu_put_be16s(f, &vdev->vq[i].inuse);
>> >> >> >> >>> >> >>          if (vdev->binding->save_queue)
>> >> >> >> >>> >> >>              vdev->binding->save_queue(vdev->binding_opaque, i, f);
>> >> >> >> >>> >> >>      }
>> >> >> >> >>> >> >> @@ -711,6 +712,11 @@ int virtio_load(VirtIODevice *vdev, QEMUFile *f)
>> >> >> >> >>> >> >>          vdev->vq[i].vring.num = qemu_get_be32(f);
>> >> >> >> >>> >> >>          vdev->vq[i].pa = qemu_get_be64(f);
>> >> >> >> >>> >> >>          qemu_get_be16s(f, &vdev->vq[i].last_avail_idx);
>> >> >> >> >>> >> >> +        qemu_get_be16s(f, &vdev->vq[i].inuse);
>> >> >> >> >>> >> >> +
>> >> >> >> >>> >> >> +        /* revert last_avail_idx if there are outstanding emulation. */
>> >> >> >> >>> >> >> +        vdev->vq[i].last_avail_idx -= vdev->vq[i].inuse;
>> >> >> >> >>> >> >> +        vdev->vq[i].inuse = 0;
>> >> >> >> >>> >> >>
>> >> >> >> >>> >> >>          if (vdev->vq[i].pa) {
>> >> >> >> >>> >> >>              virtqueue_init(&vdev->vq[i]);
>> >> >> >> >>> >> >> --
>> >> >> >> >>> >> >> 1.7.1.2
>> >> >> >> >>> >> >>
>> >> >> >> >>> >> >> --
>> >> >> >> >>> >> >> To unsubscribe from this list: send the line "unsubscribe kvm" in
>> >> >> >> >>> >> >> the body of a message to majordomo@vger.kernel.org
>> >> >> >> >>> >> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >> >> >> >>> >> > --
>> >> >> >> >>> >> > To unsubscribe from this list: send the line "unsubscribe kvm" in
>> >> >> >> >>> >> > the body of a message to majordomo@vger.kernel.org
>> >> >> >> >>> >> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >> >> >> >>> >> >
>> >> >> >> >>> > --
>> >> >> >> >>> > To unsubscribe from this list: send the line "unsubscribe kvm" in
>> >> >> >> >>> > the body of a message to majordomo@vger.kernel.org
>> >> >> >> >>> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >> >> >> >>> >
>> >> >> >> >> --
>> >> >> >> >> To unsubscribe from this list: send the line "unsubscribe kvm" in
>> >> >> >> >> the body of a message to majordomo@vger.kernel.org
>> >> >> >> >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >> >> >> >>
>> >> >> >> >
>> >> >> > --
>> >> >> > To unsubscribe from this list: send the line "unsubscribe kvm" in
>> >> >> > the body of a message to majordomo@vger.kernel.org
>> >> >> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >> >> >
>> >> > --
>> >> > To unsubscribe from this list: send the line "unsubscribe kvm" in
>> >> > the body of a message to majordomo@vger.kernel.org
>> >> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >> >
>> > --
>> > To unsubscribe from this list: send the line "unsubscribe kvm" in
>> > the body of a message to majordomo@vger.kernel.org
>> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> >
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 112+ messages in thread

* [Qemu-devel] Re: [PATCH 05/21] virtio: modify save/load handler to handle inuse varialble.
  2010-12-26 10:57                               ` Yoshiaki Tamura
@ 2010-12-26 12:01                                 ` Michael S. Tsirkin
  2010-12-26 12:16                                   ` Yoshiaki Tamura
  0 siblings, 1 reply; 112+ messages in thread
From: Michael S. Tsirkin @ 2010-12-26 12:01 UTC (permalink / raw)
  To: Yoshiaki Tamura
  Cc: aliguori, dlaor, ananth, kvm, ohmura.kei, Marcelo Tosatti,
	qemu-devel, vatsa, avi, psuriset, stefanha

On Sun, Dec 26, 2010 at 07:57:52PM +0900, Yoshiaki Tamura wrote:
> 2010/12/26 Michael S. Tsirkin <mst@redhat.com>:
> > On Fri, Dec 24, 2010 at 08:42:19PM +0900, Yoshiaki Tamura wrote:
> >> 2010/12/24 Michael S. Tsirkin <mst@redhat.com>:
> >> > On Fri, Dec 17, 2010 at 12:59:58AM +0900, Yoshiaki Tamura wrote:
> >> >> 2010/12/16 Michael S. Tsirkin <mst@redhat.com>:
> >> >> > On Thu, Dec 16, 2010 at 11:28:46PM +0900, Yoshiaki Tamura wrote:
> >> >> >> 2010/12/16 Michael S. Tsirkin <mst@redhat.com>:
> >> >> >> > On Thu, Dec 16, 2010 at 04:36:16PM +0900, Yoshiaki Tamura wrote:
> >> >> >> >> 2010/12/3 Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>:
> >> >> >> >> > 2010/12/2 Michael S. Tsirkin <mst@redhat.com>:
> >> >> >> >> >> On Wed, Dec 01, 2010 at 05:03:43PM +0900, Yoshiaki Tamura wrote:
> >> >> >> >> >>> 2010/11/28 Michael S. Tsirkin <mst@redhat.com>:
> >> >> >> >> >>> > On Sun, Nov 28, 2010 at 08:27:58PM +0900, Yoshiaki Tamura wrote:
> >> >> >> >> >>> >> 2010/11/28 Michael S. Tsirkin <mst@redhat.com>:
> >> >> >> >> >>> >> > On Thu, Nov 25, 2010 at 03:06:44PM +0900, Yoshiaki Tamura wrote:
> >> >> >> >> >>> >> >> Modify inuse type to uint16_t, let save/load to handle, and revert
> >> >> >> >> >>> >> >> last_avail_idx with inuse if there are outstanding emulation.
> >> >> >> >> >>> >> >>
> >> >> >> >> >>> >> >> Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
> >> >> >> >> >>> >> >
> >> >> >> >> >>> >> > This changes migration format, so it will break compatibility with
> >> >> >> >> >>> >> > existing drivers. More generally, I think migrating internal
> >> >> >> >> >>> >> > state that is not guest visible is always a mistake
> >> >> >> >> >>> >> > as it ties migration format to an internal implementation
> >> >> >> >> >>> >> > (yes, I know we do this sometimes, but we should at least
> >> >> >> >> >>> >> > try not to add such cases).  I think the right thing to do in this case
> >> >> >> >> >>> >> > is to flush outstanding
> >> >> >> >> >>> >> > work when vm is stopped.  Then, we are guaranteed that inuse is 0.
> >> >> >> >> >>> >> > I sent patches that do this for virtio net and block.
> >> >> >> >> >>> >>
> >> >> >> >> >>> >> Could you give me the link of your patches?  I'd like to test
> >> >> >> >> >>> >> whether they work with Kemari upon failover.  If they do, I'm
> >> >> >> >> >>> >> happy to drop this patch.
> >> >> >> >> >>> >>
> >> >> >> >> >>> >> Yoshi
> >> >> >> >> >>> >
> >> >> >> >> >>> > Look for this:
> >> >> >> >> >>> > stable migration image on a stopped vm
> >> >> >> >> >>> > sent on:
> >> >> >> >> >>> > Wed, 24 Nov 2010 17:52:49 +0200
> >> >> >> >> >>>
> >> >> >> >> >>> Thanks for the info.
> >> >> >> >> >>>
> >> >> >> >> >>> However, The patch series above didn't solve the issue.  In
> >> >> >> >> >>> case of Kemari, inuse is mostly > 0 because it queues the
> >> >> >> >> >>> output, and while last_avail_idx gets incremented
> >> >> >> >> >>> immediately, not sending inuse makes the state inconsistent
> >> >> >> >> >>> between Primary and Secondary.
> >> >> >> >> >>
> >> >> >> >> >> Hmm. Can we simply avoid incrementing last_avail_idx?
> >> >> >> >> >
> >> >> >> >> > I think we can calculate or prepare an internal last_avail_idx,
> >> >> >> >> > and update the external when inuse is decremented.  I'll try
> >> >> >> >> > whether it work w/ w/o Kemari.
> >> >> >> >>
> >> >> >> >> Hi Michael,
> >> >> >> >>
> >> >> >> >> Could you please take a look at the following patch?
> >> >> >> >
> >> >> >> > Which version is this against?
> >> >> >>
> >> >> >> Oops.  It should be very old.
> >> >> >> 67f895bfe69f323b427b284430b6219c8a62e8d4
> >> >> >>
> >> >> >> >> commit 36ee7910059e6b236fe9467a609f5b4aed866912
> >> >> >> >> Author: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
> >> >> >> >> Date:   Thu Dec 16 14:50:54 2010 +0900
> >> >> >> >>
> >> >> >> >>     virtio: update last_avail_idx when inuse is decreased.
> >> >> >> >>
> >> >> >> >>     Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
> >> >> >> >
> >> >> >> > It would be better to have a commit description explaining why a change
> >> >> >> > is made, and why it is correct, not just repeating what can be seen from
> >> >> >> > the diff anyway.
> >> >> >>
> >> >> >> Sorry for being lazy here.
> >> >> >>
> >> >> >> >> diff --git a/hw/virtio.c b/hw/virtio.c
> >> >> >> >> index c8a0fc6..6688c02 100644
> >> >> >> >> --- a/hw/virtio.c
> >> >> >> >> +++ b/hw/virtio.c
> >> >> >> >> @@ -237,6 +237,7 @@ void virtqueue_flush(VirtQueue *vq, unsigned int count)
> >> >> >> >>      wmb();
> >> >> >> >>      trace_virtqueue_flush(vq, count);
> >> >> >> >>      vring_used_idx_increment(vq, count);
> >> >> >> >> +    vq->last_avail_idx += count;
> >> >> >> >>      vq->inuse -= count;
> >> >> >> >>  }
> >> >> >> >>
> >> >> >> >> @@ -385,7 +386,7 @@ int virtqueue_pop(VirtQueue *vq, VirtQueueElement *elem)
> >> >> >> >>      unsigned int i, head, max;
> >> >> >> >>      target_phys_addr_t desc_pa = vq->vring.desc;
> >> >> >> >>
> >> >> >> >> -    if (!virtqueue_num_heads(vq, vq->last_avail_idx))
> >> >> >> >> +    if (!virtqueue_num_heads(vq, vq->last_avail_idx + vq->inuse))
> >> >> >> >>          return 0;
> >> >> >> >>
> >> >> >> >>      /* When we start there are none of either input nor output. */
> >> >> >> >> @@ -393,7 +394,7 @@ int virtqueue_pop(VirtQueue *vq, VirtQueueElement *elem)
> >> >> >> >>
> >> >> >> >>      max = vq->vring.num;
> >> >> >> >>
> >> >> >> >> -    i = head = virtqueue_get_head(vq, vq->last_avail_idx++);
> >> >> >> >> +    i = head = virtqueue_get_head(vq, vq->last_avail_idx + vq->inuse);
> >> >> >> >>
> >> >> >> >>      if (vring_desc_flags(desc_pa, i) & VRING_DESC_F_INDIRECT) {
> >> >> >> >>          if (vring_desc_len(desc_pa, i) % sizeof(VRingDesc)) {
> >> >> >> >>
> >> >> >> >
> >> >> >> > Hmm, will virtio_queue_empty be wrong now? What about virtqueue_avail_bytes?
> >> >> >>
> >> >> >> I think there are two problems.
> >> >> >>
> >> >> >> 1. When to update last_avail_idx.
> >> >> >> 2. The ordering issue you're mentioning below.
> >> >> >>
> >> >> >> The patch above is only trying to address 1 because last time you
> >> >> >> mentioned that modifying last_avail_idx upon save may break the
> >> >> >> guest, which I agree.  If virtio_queue_empty and
> >> >> >> virtqueue_avail_bytes are only used internally, meaning invisible
> >> >> >> to the guest, I guess the approach above can be applied too.
> >> >> >
> >> >> > So IMHO 2 is the real issue. This is what was problematic
> >> >> > with the save patch, otherwise of course changes in save
> >> >> > are better than changes all over the codebase.
> >> >>
> >> >> All right.  Then let's focus on 2 first.
> >> >>
> >> >> >> > Previous patch version sure looked simpler, and this seems functionally
> >> >> >> > equivalent, so my question still stands: here it is rephrased in a
> >> >> >> > different way:
> >> >> >> >
> >> >> >> >        assume that we have in avail ring 2 requests at start of ring: A and B in this order
> >> >> >> >
> >> >> >> >        host pops A, then B, then completes B and flushes
> >> >> >> >
> >> >> >> >        now with this patch last_avail_idx will be 1, and then
> >> >> >> >        remote will get it, it will execute B again. As a result
> >> >> >> >        B will complete twice, and apparently A will never complete.
> >> >> >> >
> >> >> >> >
> >> >> >> > This is what I was saying below: assuming that there are
> >> >> >> > outstanding requests when we migrate, there is no way
> >> >> >> > a single index can be enough to figure out which requests
> >> >> >> > need to be handled and which are in flight already.
> >> >> >> >
> >> >> >> > We must add some kind of bitmask to tell us which is which.
> >> >> >>
> >> >> >> I should understand why this inversion can happen before solving
> >> >> >> the issue.
> >> >> >
> >> >> > It's a fundamental thing in virtio.
> >> >> > I think it is currently only likely to happen with block, I think tap
> >> >> > currently completes things in order.  In any case relying on this in the
> >> >> > frontend is a mistake.
> >> >> >
> >> >> >>  Currently, how are you making virio-net to flush
> >> >> >> every requests for live migration?  Is it qemu_aio_flush()?
> >> >> >
> >> >> > Think so.
> >> >>
> >> >> If qemu_aio_flush() is responsible for flushing the outstanding
> >> >> virtio-net requests, I'm wondering why it's a problem for Kemari.
> >> >> As I described in the previous message, Kemari queues the
> >> >> requests first.  So in you example above, it should start with
> >> >>
> >> >> virtio-net: last_avai_idx 0 inuse 2
> >> >> event-tap: {A,B}
> >> >>
> >> >> As you know, the requests are still in order still because net
> >> >> layer initiates in order.  Not about completing.
> >> >>
> >> >> In the first synchronization, the status above is transferred.  In
> >> >> the next synchronization, the status will be as following.
> >> >>
> >> >> virtio-net: last_avai_idx 1 inuse 1
> >> >> event-tap: {B}
> >> >
> >> > OK, this answers the ordering question.
> >>
> >> Glad to hear that!
> >>
> >> > Another question: at this point we transfer this status: both
> >> > event-tap and virtio ring have the command B,
> >> > so the remote will have:
> >> >
> >> > virtio-net: inuse 0
> >> > event-tap: {B}
> >> >
> >> > Is this right? This already seems to be a problem as when B completes
> >> > inuse will go negative?
> >>
> >> I think state above is wrong.  inuse 0 means there shouldn't be
> >> any requests in event-tap.  Note that the callback is called only
> >> when event-tap flushes the requests.
> >>
> >> > Next it seems that the remote virtio will resubmit B to event-tap. The
> >> > remote will then have:
> >> >
> >> > virtio-net: inuse 1
> >> > event-tap: {B, B}
> >> >
> >> > This looks kind of wrong ... will two packets go out?
> >>
> >> No.  Currently, we're just replaying the requests with pio/mmio.
> >
> > You do?  What purpose do the hooks in bdrv/net serve then?
> > A placeholder for the future?
> 
> Not only for that reason.  The hooks in bdrv/net is the main
> function that queues requests and starts synchronization.
> pio/mmio hooks are there for recording what initiated the
> requests monitored in bdrv/net layer.  I would like to remove
> pio/mmio part if we could make bdrv/net level replay is possible.
> 
> Yoshi

I think I begin see. So when event-tap does a replay,
we will probably need to pass the inuse value.
But since we generally don't try to support new->old
cross-version migrations in qemu, my guess is that
it is better not to change the format in anticipation
right now.

So basically for now we just need to add a comment explaining
the reason for moving last_avail_idx back.
Does something like the below (completely untested) make sense?

Signed-off-by: Michael S. Tsirkin <mst@redhat.com>

diff --git a/hw/virtio.c b/hw/virtio.c
index 07dbf86..d1509f28 100644
--- a/hw/virtio.c
+++ b/hw/virtio.c
@@ -665,12 +665,20 @@ void virtio_save(VirtIODevice *vdev, QEMUFile *f)
     qemu_put_be32(f, i);
 
     for (i = 0; i < VIRTIO_PCI_QUEUE_MAX; i++) {
+        /* For regular migration inuse == 0 always as
+         * requests are flushed before save. However, 
+         * event-tap log when enabled introduces an extra
+         * queue for requests which is not being flushed,
+         * thus the last inuse requests are left in the event-tap queue.
+         * Move the last_avail_idx value sent to the remote back
+         * to make it repeat the last inuse requests. */
+        uint16_t last_avail = vdev->vq[i].last_avail_idx - vdev->vq[i].inuse;
         if (vdev->vq[i].vring.num == 0)
             break;
 
         qemu_put_be32(f, vdev->vq[i].vring.num);
         qemu_put_be64(f, vdev->vq[i].pa);
-        qemu_put_be16s(f, &vdev->vq[i].last_avail_idx);
+        qemu_put_be16s(f, &last_avail);
         if (vdev->binding->save_queue)
             vdev->binding->save_queue(vdev->binding_opaque, i, f);
     }

^ permalink raw reply related	[flat|nested] 112+ messages in thread

* Re: [Qemu-devel] Re: [PATCH 05/21] virtio: modify save/load handler to handle inuse varialble.
  2010-12-26 12:01                                 ` Michael S. Tsirkin
@ 2010-12-26 12:16                                   ` Yoshiaki Tamura
  2010-12-26 12:17                                     ` Michael S. Tsirkin
  0 siblings, 1 reply; 112+ messages in thread
From: Yoshiaki Tamura @ 2010-12-26 12:16 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: aliguori, dlaor, ananth, kvm, Marcelo Tosatti, ohmura.kei,
	qemu-devel, avi, vatsa, psuriset, stefanha

2010/12/26 Michael S. Tsirkin <mst@redhat.com>:
> On Sun, Dec 26, 2010 at 07:57:52PM +0900, Yoshiaki Tamura wrote:
>> 2010/12/26 Michael S. Tsirkin <mst@redhat.com>:
>> > On Fri, Dec 24, 2010 at 08:42:19PM +0900, Yoshiaki Tamura wrote:
>> >> 2010/12/24 Michael S. Tsirkin <mst@redhat.com>:
>> >> > On Fri, Dec 17, 2010 at 12:59:58AM +0900, Yoshiaki Tamura wrote:
>> >> >> 2010/12/16 Michael S. Tsirkin <mst@redhat.com>:
>> >> >> > On Thu, Dec 16, 2010 at 11:28:46PM +0900, Yoshiaki Tamura wrote:
>> >> >> >> 2010/12/16 Michael S. Tsirkin <mst@redhat.com>:
>> >> >> >> > On Thu, Dec 16, 2010 at 04:36:16PM +0900, Yoshiaki Tamura wrote:
>> >> >> >> >> 2010/12/3 Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>:
>> >> >> >> >> > 2010/12/2 Michael S. Tsirkin <mst@redhat.com>:
>> >> >> >> >> >> On Wed, Dec 01, 2010 at 05:03:43PM +0900, Yoshiaki Tamura wrote:
>> >> >> >> >> >>> 2010/11/28 Michael S. Tsirkin <mst@redhat.com>:
>> >> >> >> >> >>> > On Sun, Nov 28, 2010 at 08:27:58PM +0900, Yoshiaki Tamura wrote:
>> >> >> >> >> >>> >> 2010/11/28 Michael S. Tsirkin <mst@redhat.com>:
>> >> >> >> >> >>> >> > On Thu, Nov 25, 2010 at 03:06:44PM +0900, Yoshiaki Tamura wrote:
>> >> >> >> >> >>> >> >> Modify inuse type to uint16_t, let save/load to handle, and revert
>> >> >> >> >> >>> >> >> last_avail_idx with inuse if there are outstanding emulation.
>> >> >> >> >> >>> >> >>
>> >> >> >> >> >>> >> >> Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
>> >> >> >> >> >>> >> >
>> >> >> >> >> >>> >> > This changes migration format, so it will break compatibility with
>> >> >> >> >> >>> >> > existing drivers. More generally, I think migrating internal
>> >> >> >> >> >>> >> > state that is not guest visible is always a mistake
>> >> >> >> >> >>> >> > as it ties migration format to an internal implementation
>> >> >> >> >> >>> >> > (yes, I know we do this sometimes, but we should at least
>> >> >> >> >> >>> >> > try not to add such cases).  I think the right thing to do in this case
>> >> >> >> >> >>> >> > is to flush outstanding
>> >> >> >> >> >>> >> > work when vm is stopped.  Then, we are guaranteed that inuse is 0.
>> >> >> >> >> >>> >> > I sent patches that do this for virtio net and block.
>> >> >> >> >> >>> >>
>> >> >> >> >> >>> >> Could you give me the link of your patches?  I'd like to test
>> >> >> >> >> >>> >> whether they work with Kemari upon failover.  If they do, I'm
>> >> >> >> >> >>> >> happy to drop this patch.
>> >> >> >> >> >>> >>
>> >> >> >> >> >>> >> Yoshi
>> >> >> >> >> >>> >
>> >> >> >> >> >>> > Look for this:
>> >> >> >> >> >>> > stable migration image on a stopped vm
>> >> >> >> >> >>> > sent on:
>> >> >> >> >> >>> > Wed, 24 Nov 2010 17:52:49 +0200
>> >> >> >> >> >>>
>> >> >> >> >> >>> Thanks for the info.
>> >> >> >> >> >>>
>> >> >> >> >> >>> However, The patch series above didn't solve the issue.  In
>> >> >> >> >> >>> case of Kemari, inuse is mostly > 0 because it queues the
>> >> >> >> >> >>> output, and while last_avail_idx gets incremented
>> >> >> >> >> >>> immediately, not sending inuse makes the state inconsistent
>> >> >> >> >> >>> between Primary and Secondary.
>> >> >> >> >> >>
>> >> >> >> >> >> Hmm. Can we simply avoid incrementing last_avail_idx?
>> >> >> >> >> >
>> >> >> >> >> > I think we can calculate or prepare an internal last_avail_idx,
>> >> >> >> >> > and update the external when inuse is decremented.  I'll try
>> >> >> >> >> > whether it work w/ w/o Kemari.
>> >> >> >> >>
>> >> >> >> >> Hi Michael,
>> >> >> >> >>
>> >> >> >> >> Could you please take a look at the following patch?
>> >> >> >> >
>> >> >> >> > Which version is this against?
>> >> >> >>
>> >> >> >> Oops.  It should be very old.
>> >> >> >> 67f895bfe69f323b427b284430b6219c8a62e8d4
>> >> >> >>
>> >> >> >> >> commit 36ee7910059e6b236fe9467a609f5b4aed866912
>> >> >> >> >> Author: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
>> >> >> >> >> Date:   Thu Dec 16 14:50:54 2010 +0900
>> >> >> >> >>
>> >> >> >> >>     virtio: update last_avail_idx when inuse is decreased.
>> >> >> >> >>
>> >> >> >> >>     Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
>> >> >> >> >
>> >> >> >> > It would be better to have a commit description explaining why a change
>> >> >> >> > is made, and why it is correct, not just repeating what can be seen from
>> >> >> >> > the diff anyway.
>> >> >> >>
>> >> >> >> Sorry for being lazy here.
>> >> >> >>
>> >> >> >> >> diff --git a/hw/virtio.c b/hw/virtio.c
>> >> >> >> >> index c8a0fc6..6688c02 100644
>> >> >> >> >> --- a/hw/virtio.c
>> >> >> >> >> +++ b/hw/virtio.c
>> >> >> >> >> @@ -237,6 +237,7 @@ void virtqueue_flush(VirtQueue *vq, unsigned int count)
>> >> >> >> >>      wmb();
>> >> >> >> >>      trace_virtqueue_flush(vq, count);
>> >> >> >> >>      vring_used_idx_increment(vq, count);
>> >> >> >> >> +    vq->last_avail_idx += count;
>> >> >> >> >>      vq->inuse -= count;
>> >> >> >> >>  }
>> >> >> >> >>
>> >> >> >> >> @@ -385,7 +386,7 @@ int virtqueue_pop(VirtQueue *vq, VirtQueueElement *elem)
>> >> >> >> >>      unsigned int i, head, max;
>> >> >> >> >>      target_phys_addr_t desc_pa = vq->vring.desc;
>> >> >> >> >>
>> >> >> >> >> -    if (!virtqueue_num_heads(vq, vq->last_avail_idx))
>> >> >> >> >> +    if (!virtqueue_num_heads(vq, vq->last_avail_idx + vq->inuse))
>> >> >> >> >>          return 0;
>> >> >> >> >>
>> >> >> >> >>      /* When we start there are none of either input nor output. */
>> >> >> >> >> @@ -393,7 +394,7 @@ int virtqueue_pop(VirtQueue *vq, VirtQueueElement *elem)
>> >> >> >> >>
>> >> >> >> >>      max = vq->vring.num;
>> >> >> >> >>
>> >> >> >> >> -    i = head = virtqueue_get_head(vq, vq->last_avail_idx++);
>> >> >> >> >> +    i = head = virtqueue_get_head(vq, vq->last_avail_idx + vq->inuse);
>> >> >> >> >>
>> >> >> >> >>      if (vring_desc_flags(desc_pa, i) & VRING_DESC_F_INDIRECT) {
>> >> >> >> >>          if (vring_desc_len(desc_pa, i) % sizeof(VRingDesc)) {
>> >> >> >> >>
>> >> >> >> >
>> >> >> >> > Hmm, will virtio_queue_empty be wrong now? What about virtqueue_avail_bytes?
>> >> >> >>
>> >> >> >> I think there are two problems.
>> >> >> >>
>> >> >> >> 1. When to update last_avail_idx.
>> >> >> >> 2. The ordering issue you're mentioning below.
>> >> >> >>
>> >> >> >> The patch above is only trying to address 1 because last time you
>> >> >> >> mentioned that modifying last_avail_idx upon save may break the
>> >> >> >> guest, which I agree.  If virtio_queue_empty and
>> >> >> >> virtqueue_avail_bytes are only used internally, meaning invisible
>> >> >> >> to the guest, I guess the approach above can be applied too.
>> >> >> >
>> >> >> > So IMHO 2 is the real issue. This is what was problematic
>> >> >> > with the save patch, otherwise of course changes in save
>> >> >> > are better than changes all over the codebase.
>> >> >>
>> >> >> All right.  Then let's focus on 2 first.
>> >> >>
>> >> >> >> > Previous patch version sure looked simpler, and this seems functionally
>> >> >> >> > equivalent, so my question still stands: here it is rephrased in a
>> >> >> >> > different way:
>> >> >> >> >
>> >> >> >> >        assume that we have in avail ring 2 requests at start of ring: A and B in this order
>> >> >> >> >
>> >> >> >> >        host pops A, then B, then completes B and flushes
>> >> >> >> >
>> >> >> >> >        now with this patch last_avail_idx will be 1, and then
>> >> >> >> >        remote will get it, it will execute B again. As a result
>> >> >> >> >        B will complete twice, and apparently A will never complete.
>> >> >> >> >
>> >> >> >> >
>> >> >> >> > This is what I was saying below: assuming that there are
>> >> >> >> > outstanding requests when we migrate, there is no way
>> >> >> >> > a single index can be enough to figure out which requests
>> >> >> >> > need to be handled and which are in flight already.
>> >> >> >> >
>> >> >> >> > We must add some kind of bitmask to tell us which is which.
>> >> >> >>
>> >> >> >> I should understand why this inversion can happen before solving
>> >> >> >> the issue.
>> >> >> >
>> >> >> > It's a fundamental thing in virtio.
>> >> >> > I think it is currently only likely to happen with block, I think tap
>> >> >> > currently completes things in order.  In any case relying on this in the
>> >> >> > frontend is a mistake.
>> >> >> >
>> >> >> >>  Currently, how are you making virio-net to flush
>> >> >> >> every requests for live migration?  Is it qemu_aio_flush()?
>> >> >> >
>> >> >> > Think so.
>> >> >>
>> >> >> If qemu_aio_flush() is responsible for flushing the outstanding
>> >> >> virtio-net requests, I'm wondering why it's a problem for Kemari.
>> >> >> As I described in the previous message, Kemari queues the
>> >> >> requests first.  So in you example above, it should start with
>> >> >>
>> >> >> virtio-net: last_avai_idx 0 inuse 2
>> >> >> event-tap: {A,B}
>> >> >>
>> >> >> As you know, the requests are still in order still because net
>> >> >> layer initiates in order.  Not about completing.
>> >> >>
>> >> >> In the first synchronization, the status above is transferred.  In
>> >> >> the next synchronization, the status will be as following.
>> >> >>
>> >> >> virtio-net: last_avai_idx 1 inuse 1
>> >> >> event-tap: {B}
>> >> >
>> >> > OK, this answers the ordering question.
>> >>
>> >> Glad to hear that!
>> >>
>> >> > Another question: at this point we transfer this status: both
>> >> > event-tap and virtio ring have the command B,
>> >> > so the remote will have:
>> >> >
>> >> > virtio-net: inuse 0
>> >> > event-tap: {B}
>> >> >
>> >> > Is this right? This already seems to be a problem as when B completes
>> >> > inuse will go negative?
>> >>
>> >> I think state above is wrong.  inuse 0 means there shouldn't be
>> >> any requests in event-tap.  Note that the callback is called only
>> >> when event-tap flushes the requests.
>> >>
>> >> > Next it seems that the remote virtio will resubmit B to event-tap. The
>> >> > remote will then have:
>> >> >
>> >> > virtio-net: inuse 1
>> >> > event-tap: {B, B}
>> >> >
>> >> > This looks kind of wrong ... will two packets go out?
>> >>
>> >> No.  Currently, we're just replaying the requests with pio/mmio.
>> >
>> > You do?  What purpose do the hooks in bdrv/net serve then?
>> > A placeholder for the future?
>>
>> Not only for that reason.  The hooks in bdrv/net is the main
>> function that queues requests and starts synchronization.
>> pio/mmio hooks are there for recording what initiated the
>> requests monitored in bdrv/net layer.  I would like to remove
>> pio/mmio part if we could make bdrv/net level replay is possible.
>>
>> Yoshi
>
> I think I begin see. So when event-tap does a replay,
> we will probably need to pass the inuse value.

Completely correct.

> But since we generally don't try to support new->old
> cross-version migrations in qemu, my guess is that
> it is better not to change the format in anticipation
> right now.

I agree.

> So basically for now we just need to add a comment explaining
> the reason for moving last_avail_idx back.
> Does something like the below (completely untested) make sense?

Yes, it does.  Thank you for putting a decent comment.  Can I put
the patch into my series as is?

Yoshi

>
> Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
>
> diff --git a/hw/virtio.c b/hw/virtio.c
> index 07dbf86..d1509f28 100644
> --- a/hw/virtio.c
> +++ b/hw/virtio.c
> @@ -665,12 +665,20 @@ void virtio_save(VirtIODevice *vdev, QEMUFile *f)
>     qemu_put_be32(f, i);
>
>     for (i = 0; i < VIRTIO_PCI_QUEUE_MAX; i++) {
> +        /* For regular migration inuse == 0 always as
> +         * requests are flushed before save. However,
> +         * event-tap log when enabled introduces an extra
> +         * queue for requests which is not being flushed,
> +         * thus the last inuse requests are left in the event-tap queue.
> +         * Move the last_avail_idx value sent to the remote back
> +         * to make it repeat the last inuse requests. */
> +        uint16_t last_avail = vdev->vq[i].last_avail_idx - vdev->vq[i].inuse;
>         if (vdev->vq[i].vring.num == 0)
>             break;
>
>         qemu_put_be32(f, vdev->vq[i].vring.num);
>         qemu_put_be64(f, vdev->vq[i].pa);
> -        qemu_put_be16s(f, &vdev->vq[i].last_avail_idx);
> +        qemu_put_be16s(f, &last_avail);
>         if (vdev->binding->save_queue)
>             vdev->binding->save_queue(vdev->binding_opaque, i, f);
>     }
>
>

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [Qemu-devel] Re: [PATCH 05/21] virtio: modify save/load handler to handle inuse varialble.
  2010-12-26 12:16                                   ` Yoshiaki Tamura
@ 2010-12-26 12:17                                     ` Michael S. Tsirkin
  0 siblings, 0 replies; 112+ messages in thread
From: Michael S. Tsirkin @ 2010-12-26 12:17 UTC (permalink / raw)
  To: Yoshiaki Tamura
  Cc: aliguori, dlaor, ananth, kvm, Marcelo Tosatti, ohmura.kei,
	qemu-devel, avi, vatsa, psuriset, stefanha

On Sun, Dec 26, 2010 at 09:16:28PM +0900, Yoshiaki Tamura wrote:
> 2010/12/26 Michael S. Tsirkin <mst@redhat.com>:
> > On Sun, Dec 26, 2010 at 07:57:52PM +0900, Yoshiaki Tamura wrote:
> >> 2010/12/26 Michael S. Tsirkin <mst@redhat.com>:
> >> > On Fri, Dec 24, 2010 at 08:42:19PM +0900, Yoshiaki Tamura wrote:
> >> >> 2010/12/24 Michael S. Tsirkin <mst@redhat.com>:
> >> >> > On Fri, Dec 17, 2010 at 12:59:58AM +0900, Yoshiaki Tamura wrote:
> >> >> >> 2010/12/16 Michael S. Tsirkin <mst@redhat.com>:
> >> >> >> > On Thu, Dec 16, 2010 at 11:28:46PM +0900, Yoshiaki Tamura wrote:
> >> >> >> >> 2010/12/16 Michael S. Tsirkin <mst@redhat.com>:
> >> >> >> >> > On Thu, Dec 16, 2010 at 04:36:16PM +0900, Yoshiaki Tamura wrote:
> >> >> >> >> >> 2010/12/3 Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>:
> >> >> >> >> >> > 2010/12/2 Michael S. Tsirkin <mst@redhat.com>:
> >> >> >> >> >> >> On Wed, Dec 01, 2010 at 05:03:43PM +0900, Yoshiaki Tamura wrote:
> >> >> >> >> >> >>> 2010/11/28 Michael S. Tsirkin <mst@redhat.com>:
> >> >> >> >> >> >>> > On Sun, Nov 28, 2010 at 08:27:58PM +0900, Yoshiaki Tamura wrote:
> >> >> >> >> >> >>> >> 2010/11/28 Michael S. Tsirkin <mst@redhat.com>:
> >> >> >> >> >> >>> >> > On Thu, Nov 25, 2010 at 03:06:44PM +0900, Yoshiaki Tamura wrote:
> >> >> >> >> >> >>> >> >> Modify inuse type to uint16_t, let save/load to handle, and revert
> >> >> >> >> >> >>> >> >> last_avail_idx with inuse if there are outstanding emulation.
> >> >> >> >> >> >>> >> >>
> >> >> >> >> >> >>> >> >> Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
> >> >> >> >> >> >>> >> >
> >> >> >> >> >> >>> >> > This changes migration format, so it will break compatibility with
> >> >> >> >> >> >>> >> > existing drivers. More generally, I think migrating internal
> >> >> >> >> >> >>> >> > state that is not guest visible is always a mistake
> >> >> >> >> >> >>> >> > as it ties migration format to an internal implementation
> >> >> >> >> >> >>> >> > (yes, I know we do this sometimes, but we should at least
> >> >> >> >> >> >>> >> > try not to add such cases).  I think the right thing to do in this case
> >> >> >> >> >> >>> >> > is to flush outstanding
> >> >> >> >> >> >>> >> > work when vm is stopped.  Then, we are guaranteed that inuse is 0.
> >> >> >> >> >> >>> >> > I sent patches that do this for virtio net and block.
> >> >> >> >> >> >>> >>
> >> >> >> >> >> >>> >> Could you give me the link of your patches?  I'd like to test
> >> >> >> >> >> >>> >> whether they work with Kemari upon failover.  If they do, I'm
> >> >> >> >> >> >>> >> happy to drop this patch.
> >> >> >> >> >> >>> >>
> >> >> >> >> >> >>> >> Yoshi
> >> >> >> >> >> >>> >
> >> >> >> >> >> >>> > Look for this:
> >> >> >> >> >> >>> > stable migration image on a stopped vm
> >> >> >> >> >> >>> > sent on:
> >> >> >> >> >> >>> > Wed, 24 Nov 2010 17:52:49 +0200
> >> >> >> >> >> >>>
> >> >> >> >> >> >>> Thanks for the info.
> >> >> >> >> >> >>>
> >> >> >> >> >> >>> However, The patch series above didn't solve the issue.  In
> >> >> >> >> >> >>> case of Kemari, inuse is mostly > 0 because it queues the
> >> >> >> >> >> >>> output, and while last_avail_idx gets incremented
> >> >> >> >> >> >>> immediately, not sending inuse makes the state inconsistent
> >> >> >> >> >> >>> between Primary and Secondary.
> >> >> >> >> >> >>
> >> >> >> >> >> >> Hmm. Can we simply avoid incrementing last_avail_idx?
> >> >> >> >> >> >
> >> >> >> >> >> > I think we can calculate or prepare an internal last_avail_idx,
> >> >> >> >> >> > and update the external when inuse is decremented.  I'll try
> >> >> >> >> >> > whether it work w/ w/o Kemari.
> >> >> >> >> >>
> >> >> >> >> >> Hi Michael,
> >> >> >> >> >>
> >> >> >> >> >> Could you please take a look at the following patch?
> >> >> >> >> >
> >> >> >> >> > Which version is this against?
> >> >> >> >>
> >> >> >> >> Oops.  It should be very old.
> >> >> >> >> 67f895bfe69f323b427b284430b6219c8a62e8d4
> >> >> >> >>
> >> >> >> >> >> commit 36ee7910059e6b236fe9467a609f5b4aed866912
> >> >> >> >> >> Author: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
> >> >> >> >> >> Date:   Thu Dec 16 14:50:54 2010 +0900
> >> >> >> >> >>
> >> >> >> >> >>     virtio: update last_avail_idx when inuse is decreased.
> >> >> >> >> >>
> >> >> >> >> >>     Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
> >> >> >> >> >
> >> >> >> >> > It would be better to have a commit description explaining why a change
> >> >> >> >> > is made, and why it is correct, not just repeating what can be seen from
> >> >> >> >> > the diff anyway.
> >> >> >> >>
> >> >> >> >> Sorry for being lazy here.
> >> >> >> >>
> >> >> >> >> >> diff --git a/hw/virtio.c b/hw/virtio.c
> >> >> >> >> >> index c8a0fc6..6688c02 100644
> >> >> >> >> >> --- a/hw/virtio.c
> >> >> >> >> >> +++ b/hw/virtio.c
> >> >> >> >> >> @@ -237,6 +237,7 @@ void virtqueue_flush(VirtQueue *vq, unsigned int count)
> >> >> >> >> >>      wmb();
> >> >> >> >> >>      trace_virtqueue_flush(vq, count);
> >> >> >> >> >>      vring_used_idx_increment(vq, count);
> >> >> >> >> >> +    vq->last_avail_idx += count;
> >> >> >> >> >>      vq->inuse -= count;
> >> >> >> >> >>  }
> >> >> >> >> >>
> >> >> >> >> >> @@ -385,7 +386,7 @@ int virtqueue_pop(VirtQueue *vq, VirtQueueElement *elem)
> >> >> >> >> >>      unsigned int i, head, max;
> >> >> >> >> >>      target_phys_addr_t desc_pa = vq->vring.desc;
> >> >> >> >> >>
> >> >> >> >> >> -    if (!virtqueue_num_heads(vq, vq->last_avail_idx))
> >> >> >> >> >> +    if (!virtqueue_num_heads(vq, vq->last_avail_idx + vq->inuse))
> >> >> >> >> >>          return 0;
> >> >> >> >> >>
> >> >> >> >> >>      /* When we start there are none of either input nor output. */
> >> >> >> >> >> @@ -393,7 +394,7 @@ int virtqueue_pop(VirtQueue *vq, VirtQueueElement *elem)
> >> >> >> >> >>
> >> >> >> >> >>      max = vq->vring.num;
> >> >> >> >> >>
> >> >> >> >> >> -    i = head = virtqueue_get_head(vq, vq->last_avail_idx++);
> >> >> >> >> >> +    i = head = virtqueue_get_head(vq, vq->last_avail_idx + vq->inuse);
> >> >> >> >> >>
> >> >> >> >> >>      if (vring_desc_flags(desc_pa, i) & VRING_DESC_F_INDIRECT) {
> >> >> >> >> >>          if (vring_desc_len(desc_pa, i) % sizeof(VRingDesc)) {
> >> >> >> >> >>
> >> >> >> >> >
> >> >> >> >> > Hmm, will virtio_queue_empty be wrong now? What about virtqueue_avail_bytes?
> >> >> >> >>
> >> >> >> >> I think there are two problems.
> >> >> >> >>
> >> >> >> >> 1. When to update last_avail_idx.
> >> >> >> >> 2. The ordering issue you're mentioning below.
> >> >> >> >>
> >> >> >> >> The patch above is only trying to address 1 because last time you
> >> >> >> >> mentioned that modifying last_avail_idx upon save may break the
> >> >> >> >> guest, which I agree.  If virtio_queue_empty and
> >> >> >> >> virtqueue_avail_bytes are only used internally, meaning invisible
> >> >> >> >> to the guest, I guess the approach above can be applied too.
> >> >> >> >
> >> >> >> > So IMHO 2 is the real issue. This is what was problematic
> >> >> >> > with the save patch, otherwise of course changes in save
> >> >> >> > are better than changes all over the codebase.
> >> >> >>
> >> >> >> All right.  Then let's focus on 2 first.
> >> >> >>
> >> >> >> >> > Previous patch version sure looked simpler, and this seems functionally
> >> >> >> >> > equivalent, so my question still stands: here it is rephrased in a
> >> >> >> >> > different way:
> >> >> >> >> >
> >> >> >> >> >        assume that we have in avail ring 2 requests at start of ring: A and B in this order
> >> >> >> >> >
> >> >> >> >> >        host pops A, then B, then completes B and flushes
> >> >> >> >> >
> >> >> >> >> >        now with this patch last_avail_idx will be 1, and then
> >> >> >> >> >        remote will get it, it will execute B again. As a result
> >> >> >> >> >        B will complete twice, and apparently A will never complete.
> >> >> >> >> >
> >> >> >> >> >
> >> >> >> >> > This is what I was saying below: assuming that there are
> >> >> >> >> > outstanding requests when we migrate, there is no way
> >> >> >> >> > a single index can be enough to figure out which requests
> >> >> >> >> > need to be handled and which are in flight already.
> >> >> >> >> >
> >> >> >> >> > We must add some kind of bitmask to tell us which is which.
> >> >> >> >>
> >> >> >> >> I should understand why this inversion can happen before solving
> >> >> >> >> the issue.
> >> >> >> >
> >> >> >> > It's a fundamental thing in virtio.
> >> >> >> > I think it is currently only likely to happen with block, I think tap
> >> >> >> > currently completes things in order.  In any case relying on this in the
> >> >> >> > frontend is a mistake.
> >> >> >> >
> >> >> >> >>  Currently, how are you making virio-net to flush
> >> >> >> >> every requests for live migration?  Is it qemu_aio_flush()?
> >> >> >> >
> >> >> >> > Think so.
> >> >> >>
> >> >> >> If qemu_aio_flush() is responsible for flushing the outstanding
> >> >> >> virtio-net requests, I'm wondering why it's a problem for Kemari.
> >> >> >> As I described in the previous message, Kemari queues the
> >> >> >> requests first.  So in you example above, it should start with
> >> >> >>
> >> >> >> virtio-net: last_avai_idx 0 inuse 2
> >> >> >> event-tap: {A,B}
> >> >> >>
> >> >> >> As you know, the requests are still in order still because net
> >> >> >> layer initiates in order.  Not about completing.
> >> >> >>
> >> >> >> In the first synchronization, the status above is transferred.  In
> >> >> >> the next synchronization, the status will be as following.
> >> >> >>
> >> >> >> virtio-net: last_avai_idx 1 inuse 1
> >> >> >> event-tap: {B}
> >> >> >
> >> >> > OK, this answers the ordering question.
> >> >>
> >> >> Glad to hear that!
> >> >>
> >> >> > Another question: at this point we transfer this status: both
> >> >> > event-tap and virtio ring have the command B,
> >> >> > so the remote will have:
> >> >> >
> >> >> > virtio-net: inuse 0
> >> >> > event-tap: {B}
> >> >> >
> >> >> > Is this right? This already seems to be a problem as when B completes
> >> >> > inuse will go negative?
> >> >>
> >> >> I think state above is wrong.  inuse 0 means there shouldn't be
> >> >> any requests in event-tap.  Note that the callback is called only
> >> >> when event-tap flushes the requests.
> >> >>
> >> >> > Next it seems that the remote virtio will resubmit B to event-tap. The
> >> >> > remote will then have:
> >> >> >
> >> >> > virtio-net: inuse 1
> >> >> > event-tap: {B, B}
> >> >> >
> >> >> > This looks kind of wrong ... will two packets go out?
> >> >>
> >> >> No.  Currently, we're just replaying the requests with pio/mmio.
> >> >
> >> > You do?  What purpose do the hooks in bdrv/net serve then?
> >> > A placeholder for the future?
> >>
> >> Not only for that reason.  The hooks in bdrv/net is the main
> >> function that queues requests and starts synchronization.
> >> pio/mmio hooks are there for recording what initiated the
> >> requests monitored in bdrv/net layer.  I would like to remove
> >> pio/mmio part if we could make bdrv/net level replay is possible.
> >>
> >> Yoshi
> >
> > I think I begin see. So when event-tap does a replay,
> > we will probably need to pass the inuse value.
> 
> Completely correct.
> 
> > But since we generally don't try to support new->old
> > cross-version migrations in qemu, my guess is that
> > it is better not to change the format in anticipation
> > right now.
> 
> I agree.
> 
> > So basically for now we just need to add a comment explaining
> > the reason for moving last_avail_idx back.
> > Does something like the below (completely untested) make sense?
> 
> Yes, it does.  Thank you for putting a decent comment.  Can I put
> the patch into my series as is?
> 
> Yoshi

Sure.

> >
> > Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
> >
> > diff --git a/hw/virtio.c b/hw/virtio.c
> > index 07dbf86..d1509f28 100644
> > --- a/hw/virtio.c
> > +++ b/hw/virtio.c
> > @@ -665,12 +665,20 @@ void virtio_save(VirtIODevice *vdev, QEMUFile *f)
> >     qemu_put_be32(f, i);
> >
> >     for (i = 0; i < VIRTIO_PCI_QUEUE_MAX; i++) {
> > +        /* For regular migration inuse == 0 always as
> > +         * requests are flushed before save. However,
> > +         * event-tap log when enabled introduces an extra
> > +         * queue for requests which is not being flushed,
> > +         * thus the last inuse requests are left in the event-tap queue.
> > +         * Move the last_avail_idx value sent to the remote back
> > +         * to make it repeat the last inuse requests. */
> > +        uint16_t last_avail = vdev->vq[i].last_avail_idx - vdev->vq[i].inuse;
> >         if (vdev->vq[i].vring.num == 0)
> >             break;
> >
> >         qemu_put_be32(f, vdev->vq[i].vring.num);
> >         qemu_put_be64(f, vdev->vq[i].pa);
> > -        qemu_put_be16s(f, &vdev->vq[i].last_avail_idx);
> > +        qemu_put_be16s(f, &last_avail);
> >         if (vdev->binding->save_queue)
> >             vdev->binding->save_queue(vdev->binding_opaque, i, f);
> >     }
> >
> >

^ permalink raw reply	[flat|nested] 112+ messages in thread

* [Qemu-devel] Re: [PATCH 09/21] Introduce event-tap.
  2010-11-29 11:00   ` [Qemu-devel] " Stefan Hajnoczi
  2010-11-30  9:50     ` Yoshiaki Tamura
@ 2011-01-04 11:02     ` Yoshiaki Tamura
  2011-01-04 11:14       ` Stefan Hajnoczi
  2011-01-04 11:19       ` Michael S. Tsirkin
  1 sibling, 2 replies; 112+ messages in thread
From: Yoshiaki Tamura @ 2011-01-04 11:02 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: aliguori, dlaor, ananth, kvm, ohmura.kei, mtosatti, qemu-devel,
	vatsa, Michael S. Tsirkin, avi, psuriset, stefanha

2010/11/29 Stefan Hajnoczi <stefanha@gmail.com>:
> On Thu, Nov 25, 2010 at 6:06 AM, Yoshiaki Tamura
> <tamura.yoshiaki@lab.ntt.co.jp> wrote:
>> event-tap controls when to start FT transaction, and provides proxy
>> functions to called from net/block devices.  While FT transaction, it
>> queues up net/block requests, and flush them when the transaction gets
>> completed.
>>
>> Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
>> Signed-off-by: OHMURA Kei <ohmura.kei@lab.ntt.co.jp>
>> ---
>>  Makefile.target |    1 +
>>  block.h         |    9 +
>>  event-tap.c     |  794 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>  event-tap.h     |   34 +++
>>  net.h           |    4 +
>>  net/queue.c     |    1 +
>>  6 files changed, 843 insertions(+), 0 deletions(-)
>>  create mode 100644 event-tap.c
>>  create mode 100644 event-tap.h
>
> event_tap_state is checked at the beginning of several functions.  If
> there is an unexpected state the function silently returns.  Should
> these checks really be assert() so there is an abort and backtrace if
> the program ever reaches this state?
>
>> +typedef struct EventTapBlkReq {
>> +    char *device_name;
>> +    int num_reqs;
>> +    int num_cbs;
>> +    bool is_multiwrite;
>
> Is multiwrite logging necessary?  If event tap is called from within
> the block layer then multiwrite is turned into one or more
> bdrv_aio_writev() calls.
>
>> +static void event_tap_replay(void *opaque, int running, int reason)
>> +{
>> +    EventTapLog *log, *next;
>> +
>> +    if (!running) {
>> +        return;
>> +    }
>> +
>> +    if (event_tap_state != EVENT_TAP_LOAD) {
>> +        return;
>> +    }
>> +
>> +    event_tap_state = EVENT_TAP_REPLAY;
>> +
>> +    QTAILQ_FOREACH(log, &event_list, node) {
>> +        EventTapBlkReq *blk_req;
>> +
>> +        /* event resume */
>> +        switch (log->mode & ~EVENT_TAP_TYPE_MASK) {
>> +        case EVENT_TAP_NET:
>> +            event_tap_net_flush(&log->net_req);
>> +            break;
>> +        case EVENT_TAP_BLK:
>> +            blk_req = &log->blk_req;
>> +            if ((log->mode & EVENT_TAP_TYPE_MASK) == EVENT_TAP_IOPORT) {
>> +                switch (log->ioport.index) {
>> +                case 0:
>> +                    cpu_outb(log->ioport.address, log->ioport.data);
>> +                    break;
>> +                case 1:
>> +                    cpu_outw(log->ioport.address, log->ioport.data);
>> +                    break;
>> +                case 2:
>> +                    cpu_outl(log->ioport.address, log->ioport.data);
>> +                    break;
>> +                }
>> +            } else {
>> +                /* EVENT_TAP_MMIO */
>> +                cpu_physical_memory_rw(log->mmio.address,
>> +                                       log->mmio.buf,
>> +                                       log->mmio.len, 1);
>> +            }
>> +            break;
>
> Why are net tx packets replayed at the net level but blk requests are
> replayed at the pio/mmio level?
>
> I expected everything to replay either as pio/mmio or as net/block.

Stefan,

After doing some heavy load tests, I realized that we have to
take a hybrid approach to replay for now.  This is because when a
device moves to the next state (e.g. virtio decreases inuse) is
different between net and block.  For example, virtio-net
decreases inuse upon returning from the net layer, but virtio-blk
does that inside of the callback.  If we only use pio/mmio
replay, even though event-tap tries to replay net requests, some
get lost because the state has proceeded already.  This doesn't
happen with block, because the state is still old enough to
replay.  Note that using hybrid approach won't cause duplicated
requests on the secondary.

Thanks,

Yoshi

^ permalink raw reply	[flat|nested] 112+ messages in thread

* [Qemu-devel] Re: [PATCH 09/21] Introduce event-tap.
  2011-01-04 11:02     ` Yoshiaki Tamura
@ 2011-01-04 11:14       ` Stefan Hajnoczi
  2011-01-04 11:19       ` Michael S. Tsirkin
  1 sibling, 0 replies; 112+ messages in thread
From: Stefan Hajnoczi @ 2011-01-04 11:14 UTC (permalink / raw)
  To: Yoshiaki Tamura
  Cc: aliguori, dlaor, ananth, kvm, ohmura.kei, mtosatti, qemu-devel,
	vatsa, Michael S. Tsirkin, avi, psuriset, stefanha

On Tue, Jan 4, 2011 at 11:02 AM, Yoshiaki Tamura
<tamura.yoshiaki@lab.ntt.co.jp> wrote:
> After doing some heavy load tests, I realized that we have to
> take a hybrid approach to replay for now.  This is because when a
> device moves to the next state (e.g. virtio decreases inuse) is
> different between net and block.  For example, virtio-net
> decreases inuse upon returning from the net layer, but virtio-blk
> does that inside of the callback.  If we only use pio/mmio
> replay, even though event-tap tries to replay net requests, some
> get lost because the state has proceeded already.  This doesn't
> happen with block, because the state is still old enough to
> replay.  Note that using hybrid approach won't cause duplicated
> requests on the secondary.

Thanks Yoshi.  I think I understand what you're saying.

Stefan

^ permalink raw reply	[flat|nested] 112+ messages in thread

* [Qemu-devel] Re: [PATCH 09/21] Introduce event-tap.
  2011-01-04 11:02     ` Yoshiaki Tamura
  2011-01-04 11:14       ` Stefan Hajnoczi
@ 2011-01-04 11:19       ` Michael S. Tsirkin
  2011-01-04 12:20         ` Yoshiaki Tamura
  1 sibling, 1 reply; 112+ messages in thread
From: Michael S. Tsirkin @ 2011-01-04 11:19 UTC (permalink / raw)
  To: Yoshiaki Tamura
  Cc: aliguori, dlaor, ananth, kvm, ohmura.kei, Stefan Hajnoczi,
	mtosatti, qemu-devel, vatsa, avi, psuriset, stefanha

On Tue, Jan 04, 2011 at 08:02:54PM +0900, Yoshiaki Tamura wrote:
> 2010/11/29 Stefan Hajnoczi <stefanha@gmail.com>:
> > On Thu, Nov 25, 2010 at 6:06 AM, Yoshiaki Tamura
> > <tamura.yoshiaki@lab.ntt.co.jp> wrote:
> >> event-tap controls when to start FT transaction, and provides proxy
> >> functions to called from net/block devices.  While FT transaction, it
> >> queues up net/block requests, and flush them when the transaction gets
> >> completed.
> >>
> >> Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
> >> Signed-off-by: OHMURA Kei <ohmura.kei@lab.ntt.co.jp>
> >> ---
> >>  Makefile.target |    1 +
> >>  block.h         |    9 +
> >>  event-tap.c     |  794 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >>  event-tap.h     |   34 +++
> >>  net.h           |    4 +
> >>  net/queue.c     |    1 +
> >>  6 files changed, 843 insertions(+), 0 deletions(-)
> >>  create mode 100644 event-tap.c
> >>  create mode 100644 event-tap.h
> >
> > event_tap_state is checked at the beginning of several functions.  If
> > there is an unexpected state the function silently returns.  Should
> > these checks really be assert() so there is an abort and backtrace if
> > the program ever reaches this state?
> >
> >> +typedef struct EventTapBlkReq {
> >> +    char *device_name;
> >> +    int num_reqs;
> >> +    int num_cbs;
> >> +    bool is_multiwrite;
> >
> > Is multiwrite logging necessary?  If event tap is called from within
> > the block layer then multiwrite is turned into one or more
> > bdrv_aio_writev() calls.
> >
> >> +static void event_tap_replay(void *opaque, int running, int reason)
> >> +{
> >> +    EventTapLog *log, *next;
> >> +
> >> +    if (!running) {
> >> +        return;
> >> +    }
> >> +
> >> +    if (event_tap_state != EVENT_TAP_LOAD) {
> >> +        return;
> >> +    }
> >> +
> >> +    event_tap_state = EVENT_TAP_REPLAY;
> >> +
> >> +    QTAILQ_FOREACH(log, &event_list, node) {
> >> +        EventTapBlkReq *blk_req;
> >> +
> >> +        /* event resume */
> >> +        switch (log->mode & ~EVENT_TAP_TYPE_MASK) {
> >> +        case EVENT_TAP_NET:
> >> +            event_tap_net_flush(&log->net_req);
> >> +            break;
> >> +        case EVENT_TAP_BLK:
> >> +            blk_req = &log->blk_req;
> >> +            if ((log->mode & EVENT_TAP_TYPE_MASK) == EVENT_TAP_IOPORT) {
> >> +                switch (log->ioport.index) {
> >> +                case 0:
> >> +                    cpu_outb(log->ioport.address, log->ioport.data);
> >> +                    break;
> >> +                case 1:
> >> +                    cpu_outw(log->ioport.address, log->ioport.data);
> >> +                    break;
> >> +                case 2:
> >> +                    cpu_outl(log->ioport.address, log->ioport.data);
> >> +                    break;
> >> +                }
> >> +            } else {
> >> +                /* EVENT_TAP_MMIO */
> >> +                cpu_physical_memory_rw(log->mmio.address,
> >> +                                       log->mmio.buf,
> >> +                                       log->mmio.len, 1);
> >> +            }
> >> +            break;
> >
> > Why are net tx packets replayed at the net level but blk requests are
> > replayed at the pio/mmio level?
> >
> > I expected everything to replay either as pio/mmio or as net/block.
> 
> Stefan,
> 
> After doing some heavy load tests, I realized that we have to
> take a hybrid approach to replay for now.  This is because when a
> device moves to the next state (e.g. virtio decreases inuse) is
> different between net and block.  For example, virtio-net
> decreases inuse upon returning from the net layer,
> but virtio-blk
> does that inside of the callback.

For TX, virtio-net calls virtqueue_push from virtio_net_tx_complete.
For RX, virtio-net calls virtqueue_flush from virtio_net_receive.
Both are invoked from a callback.

> If we only use pio/mmio
> replay, even though event-tap tries to replay net requests, some
> get lost because the state has proceeded already.

It seems that all you need to do to avoid this is to
delay the callback?

> This doesn't
> happen with block, because the state is still old enough to
> replay.  Note that using hybrid approach won't cause duplicated
> requests on the secondary.

An assumption devices make is that a buffer is unused once
completion callback was invoked. Does this violate that assumption?

-- 
MST

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [Qemu-devel] Re: [PATCH 09/21] Introduce event-tap.
  2011-01-04 11:19       ` Michael S. Tsirkin
@ 2011-01-04 12:20         ` Yoshiaki Tamura
  2011-01-04 13:10           ` Michael S. Tsirkin
  0 siblings, 1 reply; 112+ messages in thread
From: Yoshiaki Tamura @ 2011-01-04 12:20 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: aliguori, mtosatti, ananth, kvm, Stefan Hajnoczi, dlaor,
	ohmura.kei, qemu-devel, avi, vatsa, psuriset, stefanha

2011/1/4 Michael S. Tsirkin <mst@redhat.com>:
> On Tue, Jan 04, 2011 at 08:02:54PM +0900, Yoshiaki Tamura wrote:
>> 2010/11/29 Stefan Hajnoczi <stefanha@gmail.com>:
>> > On Thu, Nov 25, 2010 at 6:06 AM, Yoshiaki Tamura
>> > <tamura.yoshiaki@lab.ntt.co.jp> wrote:
>> >> event-tap controls when to start FT transaction, and provides proxy
>> >> functions to called from net/block devices.  While FT transaction, it
>> >> queues up net/block requests, and flush them when the transaction gets
>> >> completed.
>> >>
>> >> Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
>> >> Signed-off-by: OHMURA Kei <ohmura.kei@lab.ntt.co.jp>
>> >> ---
>> >>  Makefile.target |    1 +
>> >>  block.h         |    9 +
>> >>  event-tap.c     |  794 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> >>  event-tap.h     |   34 +++
>> >>  net.h           |    4 +
>> >>  net/queue.c     |    1 +
>> >>  6 files changed, 843 insertions(+), 0 deletions(-)
>> >>  create mode 100644 event-tap.c
>> >>  create mode 100644 event-tap.h
>> >
>> > event_tap_state is checked at the beginning of several functions.  If
>> > there is an unexpected state the function silently returns.  Should
>> > these checks really be assert() so there is an abort and backtrace if
>> > the program ever reaches this state?
>> >
>> >> +typedef struct EventTapBlkReq {
>> >> +    char *device_name;
>> >> +    int num_reqs;
>> >> +    int num_cbs;
>> >> +    bool is_multiwrite;
>> >
>> > Is multiwrite logging necessary?  If event tap is called from within
>> > the block layer then multiwrite is turned into one or more
>> > bdrv_aio_writev() calls.
>> >
>> >> +static void event_tap_replay(void *opaque, int running, int reason)
>> >> +{
>> >> +    EventTapLog *log, *next;
>> >> +
>> >> +    if (!running) {
>> >> +        return;
>> >> +    }
>> >> +
>> >> +    if (event_tap_state != EVENT_TAP_LOAD) {
>> >> +        return;
>> >> +    }
>> >> +
>> >> +    event_tap_state = EVENT_TAP_REPLAY;
>> >> +
>> >> +    QTAILQ_FOREACH(log, &event_list, node) {
>> >> +        EventTapBlkReq *blk_req;
>> >> +
>> >> +        /* event resume */
>> >> +        switch (log->mode & ~EVENT_TAP_TYPE_MASK) {
>> >> +        case EVENT_TAP_NET:
>> >> +            event_tap_net_flush(&log->net_req);
>> >> +            break;
>> >> +        case EVENT_TAP_BLK:
>> >> +            blk_req = &log->blk_req;
>> >> +            if ((log->mode & EVENT_TAP_TYPE_MASK) == EVENT_TAP_IOPORT) {
>> >> +                switch (log->ioport.index) {
>> >> +                case 0:
>> >> +                    cpu_outb(log->ioport.address, log->ioport.data);
>> >> +                    break;
>> >> +                case 1:
>> >> +                    cpu_outw(log->ioport.address, log->ioport.data);
>> >> +                    break;
>> >> +                case 2:
>> >> +                    cpu_outl(log->ioport.address, log->ioport.data);
>> >> +                    break;
>> >> +                }
>> >> +            } else {
>> >> +                /* EVENT_TAP_MMIO */
>> >> +                cpu_physical_memory_rw(log->mmio.address,
>> >> +                                       log->mmio.buf,
>> >> +                                       log->mmio.len, 1);
>> >> +            }
>> >> +            break;
>> >
>> > Why are net tx packets replayed at the net level but blk requests are
>> > replayed at the pio/mmio level?
>> >
>> > I expected everything to replay either as pio/mmio or as net/block.
>>
>> Stefan,
>>
>> After doing some heavy load tests, I realized that we have to
>> take a hybrid approach to replay for now.  This is because when a
>> device moves to the next state (e.g. virtio decreases inuse) is
>> different between net and block.  For example, virtio-net
>> decreases inuse upon returning from the net layer,
>> but virtio-blk
>> does that inside of the callback.
>
> For TX, virtio-net calls virtqueue_push from virtio_net_tx_complete.
> For RX, virtio-net calls virtqueue_flush from virtio_net_receive.
> Both are invoked from a callback.
>
>> If we only use pio/mmio
>> replay, even though event-tap tries to replay net requests, some
>> get lost because the state has proceeded already.
>
> It seems that all you need to do to avoid this is to
> delay the callback?

Yeah, if it's possible.  But if you take a look at virtio-net,
you'll see that virtio_push is called immediately after calling
qemu_sendv_packet while virtio-blk does that in the callback.

>
>> This doesn't
>> happen with block, because the state is still old enough to
>> replay.  Note that using hybrid approach won't cause duplicated
>> requests on the secondary.
>
> An assumption devices make is that a buffer is unused once
> completion callback was invoked. Does this violate that assumption?

No, it shouldn't.  In case of net with net layer replay, we copy
the content of the requests, and in case of block, because we
haven't called the callback yet, the requests remains fresh.

Yoshi

>
> --
> MST
>
>

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [Qemu-devel] Re: [PATCH 09/21] Introduce event-tap.
  2011-01-04 12:20         ` Yoshiaki Tamura
@ 2011-01-04 13:10           ` Michael S. Tsirkin
  2011-01-04 13:45             ` Yoshiaki Tamura
  0 siblings, 1 reply; 112+ messages in thread
From: Michael S. Tsirkin @ 2011-01-04 13:10 UTC (permalink / raw)
  To: Yoshiaki Tamura
  Cc: aliguori, mtosatti, ananth, kvm, Stefan Hajnoczi, dlaor,
	ohmura.kei, qemu-devel, avi, vatsa, psuriset, stefanha

On Tue, Jan 04, 2011 at 09:20:53PM +0900, Yoshiaki Tamura wrote:
> 2011/1/4 Michael S. Tsirkin <mst@redhat.com>:
> > On Tue, Jan 04, 2011 at 08:02:54PM +0900, Yoshiaki Tamura wrote:
> >> 2010/11/29 Stefan Hajnoczi <stefanha@gmail.com>:
> >> > On Thu, Nov 25, 2010 at 6:06 AM, Yoshiaki Tamura
> >> > <tamura.yoshiaki@lab.ntt.co.jp> wrote:
> >> >> event-tap controls when to start FT transaction, and provides proxy
> >> >> functions to called from net/block devices.  While FT transaction, it
> >> >> queues up net/block requests, and flush them when the transaction gets
> >> >> completed.
> >> >>
> >> >> Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
> >> >> Signed-off-by: OHMURA Kei <ohmura.kei@lab.ntt.co.jp>
> >> >> ---
> >> >>  Makefile.target |    1 +
> >> >>  block.h         |    9 +
> >> >>  event-tap.c     |  794 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >> >>  event-tap.h     |   34 +++
> >> >>  net.h           |    4 +
> >> >>  net/queue.c     |    1 +
> >> >>  6 files changed, 843 insertions(+), 0 deletions(-)
> >> >>  create mode 100644 event-tap.c
> >> >>  create mode 100644 event-tap.h
> >> >
> >> > event_tap_state is checked at the beginning of several functions.  If
> >> > there is an unexpected state the function silently returns.  Should
> >> > these checks really be assert() so there is an abort and backtrace if
> >> > the program ever reaches this state?
> >> >
> >> >> +typedef struct EventTapBlkReq {
> >> >> +    char *device_name;
> >> >> +    int num_reqs;
> >> >> +    int num_cbs;
> >> >> +    bool is_multiwrite;
> >> >
> >> > Is multiwrite logging necessary?  If event tap is called from within
> >> > the block layer then multiwrite is turned into one or more
> >> > bdrv_aio_writev() calls.
> >> >
> >> >> +static void event_tap_replay(void *opaque, int running, int reason)
> >> >> +{
> >> >> +    EventTapLog *log, *next;
> >> >> +
> >> >> +    if (!running) {
> >> >> +        return;
> >> >> +    }
> >> >> +
> >> >> +    if (event_tap_state != EVENT_TAP_LOAD) {
> >> >> +        return;
> >> >> +    }
> >> >> +
> >> >> +    event_tap_state = EVENT_TAP_REPLAY;
> >> >> +
> >> >> +    QTAILQ_FOREACH(log, &event_list, node) {
> >> >> +        EventTapBlkReq *blk_req;
> >> >> +
> >> >> +        /* event resume */
> >> >> +        switch (log->mode & ~EVENT_TAP_TYPE_MASK) {
> >> >> +        case EVENT_TAP_NET:
> >> >> +            event_tap_net_flush(&log->net_req);
> >> >> +            break;
> >> >> +        case EVENT_TAP_BLK:
> >> >> +            blk_req = &log->blk_req;
> >> >> +            if ((log->mode & EVENT_TAP_TYPE_MASK) == EVENT_TAP_IOPORT) {
> >> >> +                switch (log->ioport.index) {
> >> >> +                case 0:
> >> >> +                    cpu_outb(log->ioport.address, log->ioport.data);
> >> >> +                    break;
> >> >> +                case 1:
> >> >> +                    cpu_outw(log->ioport.address, log->ioport.data);
> >> >> +                    break;
> >> >> +                case 2:
> >> >> +                    cpu_outl(log->ioport.address, log->ioport.data);
> >> >> +                    break;
> >> >> +                }
> >> >> +            } else {
> >> >> +                /* EVENT_TAP_MMIO */
> >> >> +                cpu_physical_memory_rw(log->mmio.address,
> >> >> +                                       log->mmio.buf,
> >> >> +                                       log->mmio.len, 1);
> >> >> +            }
> >> >> +            break;
> >> >
> >> > Why are net tx packets replayed at the net level but blk requests are
> >> > replayed at the pio/mmio level?
> >> >
> >> > I expected everything to replay either as pio/mmio or as net/block.
> >>
> >> Stefan,
> >>
> >> After doing some heavy load tests, I realized that we have to
> >> take a hybrid approach to replay for now.  This is because when a
> >> device moves to the next state (e.g. virtio decreases inuse) is
> >> different between net and block.  For example, virtio-net
> >> decreases inuse upon returning from the net layer,
> >> but virtio-blk
> >> does that inside of the callback.
> >
> > For TX, virtio-net calls virtqueue_push from virtio_net_tx_complete.
> > For RX, virtio-net calls virtqueue_flush from virtio_net_receive.
> > Both are invoked from a callback.
> >
> >> If we only use pio/mmio
> >> replay, even though event-tap tries to replay net requests, some
> >> get lost because the state has proceeded already.
> >
> > It seems that all you need to do to avoid this is to
> > delay the callback?
> 
> Yeah, if it's possible.  But if you take a look at virtio-net,
> you'll see that virtio_push is called immediately after calling
> qemu_sendv_packet
> while virtio-blk does that in the callback.

This is only if the packet was sent immediately.
I was referring to the case where the packet is queued.

> >
> >> This doesn't
> >> happen with block, because the state is still old enough to
> >> replay.  Note that using hybrid approach won't cause duplicated
> >> requests on the secondary.
> >
> > An assumption devices make is that a buffer is unused once
> > completion callback was invoked. Does this violate that assumption?
> 
> No, it shouldn't.  In case of net with net layer replay, we copy
> the content of the requests, and in case of block, because we
> haven't called the callback yet, the requests remains fresh.
> 
> Yoshi
> 

Yes, as long as you copy it should be fine.  Maybe it's a good idea for
event-tap to queue all packets to avoid the copy and avoid the need to
replay at the net level.

> >
> > --
> > MST
> >
> >

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [Qemu-devel] Re: [PATCH 09/21] Introduce event-tap.
  2011-01-04 13:10           ` Michael S. Tsirkin
@ 2011-01-04 13:45             ` Yoshiaki Tamura
  2011-01-04 14:42               ` Michael S. Tsirkin
  0 siblings, 1 reply; 112+ messages in thread
From: Yoshiaki Tamura @ 2011-01-04 13:45 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: aliguori, dlaor, ananth, kvm, Stefan Hajnoczi, mtosatti,
	ohmura.kei, qemu-devel, avi, vatsa, psuriset, stefanha

2011/1/4 Michael S. Tsirkin <mst@redhat.com>:
> On Tue, Jan 04, 2011 at 09:20:53PM +0900, Yoshiaki Tamura wrote:
>> 2011/1/4 Michael S. Tsirkin <mst@redhat.com>:
>> > On Tue, Jan 04, 2011 at 08:02:54PM +0900, Yoshiaki Tamura wrote:
>> >> 2010/11/29 Stefan Hajnoczi <stefanha@gmail.com>:
>> >> > On Thu, Nov 25, 2010 at 6:06 AM, Yoshiaki Tamura
>> >> > <tamura.yoshiaki@lab.ntt.co.jp> wrote:
>> >> >> event-tap controls when to start FT transaction, and provides proxy
>> >> >> functions to called from net/block devices.  While FT transaction, it
>> >> >> queues up net/block requests, and flush them when the transaction gets
>> >> >> completed.
>> >> >>
>> >> >> Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
>> >> >> Signed-off-by: OHMURA Kei <ohmura.kei@lab.ntt.co.jp>
>> >> >> ---
>> >> >>  Makefile.target |    1 +
>> >> >>  block.h         |    9 +
>> >> >>  event-tap.c     |  794 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> >> >>  event-tap.h     |   34 +++
>> >> >>  net.h           |    4 +
>> >> >>  net/queue.c     |    1 +
>> >> >>  6 files changed, 843 insertions(+), 0 deletions(-)
>> >> >>  create mode 100644 event-tap.c
>> >> >>  create mode 100644 event-tap.h
>> >> >
>> >> > event_tap_state is checked at the beginning of several functions.  If
>> >> > there is an unexpected state the function silently returns.  Should
>> >> > these checks really be assert() so there is an abort and backtrace if
>> >> > the program ever reaches this state?
>> >> >
>> >> >> +typedef struct EventTapBlkReq {
>> >> >> +    char *device_name;
>> >> >> +    int num_reqs;
>> >> >> +    int num_cbs;
>> >> >> +    bool is_multiwrite;
>> >> >
>> >> > Is multiwrite logging necessary?  If event tap is called from within
>> >> > the block layer then multiwrite is turned into one or more
>> >> > bdrv_aio_writev() calls.
>> >> >
>> >> >> +static void event_tap_replay(void *opaque, int running, int reason)
>> >> >> +{
>> >> >> +    EventTapLog *log, *next;
>> >> >> +
>> >> >> +    if (!running) {
>> >> >> +        return;
>> >> >> +    }
>> >> >> +
>> >> >> +    if (event_tap_state != EVENT_TAP_LOAD) {
>> >> >> +        return;
>> >> >> +    }
>> >> >> +
>> >> >> +    event_tap_state = EVENT_TAP_REPLAY;
>> >> >> +
>> >> >> +    QTAILQ_FOREACH(log, &event_list, node) {
>> >> >> +        EventTapBlkReq *blk_req;
>> >> >> +
>> >> >> +        /* event resume */
>> >> >> +        switch (log->mode & ~EVENT_TAP_TYPE_MASK) {
>> >> >> +        case EVENT_TAP_NET:
>> >> >> +            event_tap_net_flush(&log->net_req);
>> >> >> +            break;
>> >> >> +        case EVENT_TAP_BLK:
>> >> >> +            blk_req = &log->blk_req;
>> >> >> +            if ((log->mode & EVENT_TAP_TYPE_MASK) == EVENT_TAP_IOPORT) {
>> >> >> +                switch (log->ioport.index) {
>> >> >> +                case 0:
>> >> >> +                    cpu_outb(log->ioport.address, log->ioport.data);
>> >> >> +                    break;
>> >> >> +                case 1:
>> >> >> +                    cpu_outw(log->ioport.address, log->ioport.data);
>> >> >> +                    break;
>> >> >> +                case 2:
>> >> >> +                    cpu_outl(log->ioport.address, log->ioport.data);
>> >> >> +                    break;
>> >> >> +                }
>> >> >> +            } else {
>> >> >> +                /* EVENT_TAP_MMIO */
>> >> >> +                cpu_physical_memory_rw(log->mmio.address,
>> >> >> +                                       log->mmio.buf,
>> >> >> +                                       log->mmio.len, 1);
>> >> >> +            }
>> >> >> +            break;
>> >> >
>> >> > Why are net tx packets replayed at the net level but blk requests are
>> >> > replayed at the pio/mmio level?
>> >> >
>> >> > I expected everything to replay either as pio/mmio or as net/block.
>> >>
>> >> Stefan,
>> >>
>> >> After doing some heavy load tests, I realized that we have to
>> >> take a hybrid approach to replay for now.  This is because when a
>> >> device moves to the next state (e.g. virtio decreases inuse) is
>> >> different between net and block.  For example, virtio-net
>> >> decreases inuse upon returning from the net layer,
>> >> but virtio-blk
>> >> does that inside of the callback.
>> >
>> > For TX, virtio-net calls virtqueue_push from virtio_net_tx_complete.
>> > For RX, virtio-net calls virtqueue_flush from virtio_net_receive.
>> > Both are invoked from a callback.
>> >
>> >> If we only use pio/mmio
>> >> replay, even though event-tap tries to replay net requests, some
>> >> get lost because the state has proceeded already.
>> >
>> > It seems that all you need to do to avoid this is to
>> > delay the callback?
>>
>> Yeah, if it's possible.  But if you take a look at virtio-net,
>> you'll see that virtio_push is called immediately after calling
>> qemu_sendv_packet
>> while virtio-blk does that in the callback.
>
> This is only if the packet was sent immediately.
> I was referring to the case where the packet is queued.

I see.  I usually don't see packets get queued in the net layer.
What would be the effect to devices?  Restraint sending packets?

>
>> >
>> >> This doesn't
>> >> happen with block, because the state is still old enough to
>> >> replay.  Note that using hybrid approach won't cause duplicated
>> >> requests on the secondary.
>> >
>> > An assumption devices make is that a buffer is unused once
>> > completion callback was invoked. Does this violate that assumption?
>>
>> No, it shouldn't.  In case of net with net layer replay, we copy
>> the content of the requests, and in case of block, because we
>> haven't called the callback yet, the requests remains fresh.
>>
>> Yoshi
>>
>
> Yes, as long as you copy it should be fine.  Maybe it's a good idea for
> event-tap to queue all packets to avoid the copy and avoid the need to
> replay at the net level.

If queuing works fine for the devices, it seems to be a good
idea.  I think the ordering issue doesn't happen still.

Yoshi

>
>> >
>> > --
>> > MST
>> >
>> >
>
>

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [Qemu-devel] Re: [PATCH 09/21] Introduce event-tap.
  2011-01-04 13:45             ` Yoshiaki Tamura
@ 2011-01-04 14:42               ` Michael S. Tsirkin
  2011-01-06  8:47                 ` Yoshiaki Tamura
  0 siblings, 1 reply; 112+ messages in thread
From: Michael S. Tsirkin @ 2011-01-04 14:42 UTC (permalink / raw)
  To: Yoshiaki Tamura
  Cc: aliguori, dlaor, ananth, kvm, Stefan Hajnoczi, mtosatti,
	ohmura.kei, qemu-devel, avi, vatsa, psuriset, stefanha

On Tue, Jan 04, 2011 at 10:45:13PM +0900, Yoshiaki Tamura wrote:
> 2011/1/4 Michael S. Tsirkin <mst@redhat.com>:
> > On Tue, Jan 04, 2011 at 09:20:53PM +0900, Yoshiaki Tamura wrote:
> >> 2011/1/4 Michael S. Tsirkin <mst@redhat.com>:
> >> > On Tue, Jan 04, 2011 at 08:02:54PM +0900, Yoshiaki Tamura wrote:
> >> >> 2010/11/29 Stefan Hajnoczi <stefanha@gmail.com>:
> >> >> > On Thu, Nov 25, 2010 at 6:06 AM, Yoshiaki Tamura
> >> >> > <tamura.yoshiaki@lab.ntt.co.jp> wrote:
> >> >> >> event-tap controls when to start FT transaction, and provides proxy
> >> >> >> functions to called from net/block devices.  While FT transaction, it
> >> >> >> queues up net/block requests, and flush them when the transaction gets
> >> >> >> completed.
> >> >> >>
> >> >> >> Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
> >> >> >> Signed-off-by: OHMURA Kei <ohmura.kei@lab.ntt.co.jp>
> >> >> >> ---
> >> >> >>  Makefile.target |    1 +
> >> >> >>  block.h         |    9 +
> >> >> >>  event-tap.c     |  794 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >> >> >>  event-tap.h     |   34 +++
> >> >> >>  net.h           |    4 +
> >> >> >>  net/queue.c     |    1 +
> >> >> >>  6 files changed, 843 insertions(+), 0 deletions(-)
> >> >> >>  create mode 100644 event-tap.c
> >> >> >>  create mode 100644 event-tap.h
> >> >> >
> >> >> > event_tap_state is checked at the beginning of several functions.  If
> >> >> > there is an unexpected state the function silently returns.  Should
> >> >> > these checks really be assert() so there is an abort and backtrace if
> >> >> > the program ever reaches this state?
> >> >> >
> >> >> >> +typedef struct EventTapBlkReq {
> >> >> >> +    char *device_name;
> >> >> >> +    int num_reqs;
> >> >> >> +    int num_cbs;
> >> >> >> +    bool is_multiwrite;
> >> >> >
> >> >> > Is multiwrite logging necessary?  If event tap is called from within
> >> >> > the block layer then multiwrite is turned into one or more
> >> >> > bdrv_aio_writev() calls.
> >> >> >
> >> >> >> +static void event_tap_replay(void *opaque, int running, int reason)
> >> >> >> +{
> >> >> >> +    EventTapLog *log, *next;
> >> >> >> +
> >> >> >> +    if (!running) {
> >> >> >> +        return;
> >> >> >> +    }
> >> >> >> +
> >> >> >> +    if (event_tap_state != EVENT_TAP_LOAD) {
> >> >> >> +        return;
> >> >> >> +    }
> >> >> >> +
> >> >> >> +    event_tap_state = EVENT_TAP_REPLAY;
> >> >> >> +
> >> >> >> +    QTAILQ_FOREACH(log, &event_list, node) {
> >> >> >> +        EventTapBlkReq *blk_req;
> >> >> >> +
> >> >> >> +        /* event resume */
> >> >> >> +        switch (log->mode & ~EVENT_TAP_TYPE_MASK) {
> >> >> >> +        case EVENT_TAP_NET:
> >> >> >> +            event_tap_net_flush(&log->net_req);
> >> >> >> +            break;
> >> >> >> +        case EVENT_TAP_BLK:
> >> >> >> +            blk_req = &log->blk_req;
> >> >> >> +            if ((log->mode & EVENT_TAP_TYPE_MASK) == EVENT_TAP_IOPORT) {
> >> >> >> +                switch (log->ioport.index) {
> >> >> >> +                case 0:
> >> >> >> +                    cpu_outb(log->ioport.address, log->ioport.data);
> >> >> >> +                    break;
> >> >> >> +                case 1:
> >> >> >> +                    cpu_outw(log->ioport.address, log->ioport.data);
> >> >> >> +                    break;
> >> >> >> +                case 2:
> >> >> >> +                    cpu_outl(log->ioport.address, log->ioport.data);
> >> >> >> +                    break;
> >> >> >> +                }
> >> >> >> +            } else {
> >> >> >> +                /* EVENT_TAP_MMIO */
> >> >> >> +                cpu_physical_memory_rw(log->mmio.address,
> >> >> >> +                                       log->mmio.buf,
> >> >> >> +                                       log->mmio.len, 1);
> >> >> >> +            }
> >> >> >> +            break;
> >> >> >
> >> >> > Why are net tx packets replayed at the net level but blk requests are
> >> >> > replayed at the pio/mmio level?
> >> >> >
> >> >> > I expected everything to replay either as pio/mmio or as net/block.
> >> >>
> >> >> Stefan,
> >> >>
> >> >> After doing some heavy load tests, I realized that we have to
> >> >> take a hybrid approach to replay for now.  This is because when a
> >> >> device moves to the next state (e.g. virtio decreases inuse) is
> >> >> different between net and block.  For example, virtio-net
> >> >> decreases inuse upon returning from the net layer,
> >> >> but virtio-blk
> >> >> does that inside of the callback.
> >> >
> >> > For TX, virtio-net calls virtqueue_push from virtio_net_tx_complete.
> >> > For RX, virtio-net calls virtqueue_flush from virtio_net_receive.
> >> > Both are invoked from a callback.
> >> >
> >> >> If we only use pio/mmio
> >> >> replay, even though event-tap tries to replay net requests, some
> >> >> get lost because the state has proceeded already.
> >> >
> >> > It seems that all you need to do to avoid this is to
> >> > delay the callback?
> >>
> >> Yeah, if it's possible.  But if you take a look at virtio-net,
> >> you'll see that virtio_push is called immediately after calling
> >> qemu_sendv_packet
> >> while virtio-blk does that in the callback.
> >
> > This is only if the packet was sent immediately.
> > I was referring to the case where the packet is queued.
> 
> I see.  I usually don't see packets get queued in the net layer.
> What would be the effect to devices?  Restraint sending packets?

Yes.

> >
> >> >
> >> >> This doesn't
> >> >> happen with block, because the state is still old enough to
> >> >> replay.  Note that using hybrid approach won't cause duplicated
> >> >> requests on the secondary.
> >> >
> >> > An assumption devices make is that a buffer is unused once
> >> > completion callback was invoked. Does this violate that assumption?
> >>
> >> No, it shouldn't.  In case of net with net layer replay, we copy
> >> the content of the requests, and in case of block, because we
> >> haven't called the callback yet, the requests remains fresh.
> >>
> >> Yoshi
> >>
> >
> > Yes, as long as you copy it should be fine.  Maybe it's a good idea for
> > event-tap to queue all packets to avoid the copy and avoid the need to
> > replay at the net level.
> 
> If queuing works fine for the devices, it seems to be a good
> idea.  I think the ordering issue doesn't happen still.
> 
> Yoshi

If you replay and both net and pio level, it becomes complex.
Maybe it's ok, but certainly harder to reason about.

> >
> >> >
> >> > --
> >> > MST
> >> >
> >> >
> >
> >

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [Qemu-devel] Re: [PATCH 09/21] Introduce event-tap.
  2011-01-04 14:42               ` Michael S. Tsirkin
@ 2011-01-06  8:47                 ` Yoshiaki Tamura
  2011-01-06  9:36                   ` Michael S. Tsirkin
  0 siblings, 1 reply; 112+ messages in thread
From: Yoshiaki Tamura @ 2011-01-06  8:47 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: aliguori, mtosatti, ananth, kvm, Stefan Hajnoczi, dlaor,
	ohmura.kei, qemu-devel, avi, vatsa, psuriset, stefanha

2011/1/4 Michael S. Tsirkin <mst@redhat.com>:
> On Tue, Jan 04, 2011 at 10:45:13PM +0900, Yoshiaki Tamura wrote:
>> 2011/1/4 Michael S. Tsirkin <mst@redhat.com>:
>> > On Tue, Jan 04, 2011 at 09:20:53PM +0900, Yoshiaki Tamura wrote:
>> >> 2011/1/4 Michael S. Tsirkin <mst@redhat.com>:
>> >> > On Tue, Jan 04, 2011 at 08:02:54PM +0900, Yoshiaki Tamura wrote:
>> >> >> 2010/11/29 Stefan Hajnoczi <stefanha@gmail.com>:
>> >> >> > On Thu, Nov 25, 2010 at 6:06 AM, Yoshiaki Tamura
>> >> >> > <tamura.yoshiaki@lab.ntt.co.jp> wrote:
>> >> >> >> event-tap controls when to start FT transaction, and provides proxy
>> >> >> >> functions to called from net/block devices.  While FT transaction, it
>> >> >> >> queues up net/block requests, and flush them when the transaction gets
>> >> >> >> completed.
>> >> >> >>
>> >> >> >> Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
>> >> >> >> Signed-off-by: OHMURA Kei <ohmura.kei@lab.ntt.co.jp>
>> >> >> >> ---
>> >> >> >>  Makefile.target |    1 +
>> >> >> >>  block.h         |    9 +
>> >> >> >>  event-tap.c     |  794 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> >> >> >>  event-tap.h     |   34 +++
>> >> >> >>  net.h           |    4 +
>> >> >> >>  net/queue.c     |    1 +
>> >> >> >>  6 files changed, 843 insertions(+), 0 deletions(-)
>> >> >> >>  create mode 100644 event-tap.c
>> >> >> >>  create mode 100644 event-tap.h
>> >> >> >
>> >> >> > event_tap_state is checked at the beginning of several functions.  If
>> >> >> > there is an unexpected state the function silently returns.  Should
>> >> >> > these checks really be assert() so there is an abort and backtrace if
>> >> >> > the program ever reaches this state?
>> >> >> >
>> >> >> >> +typedef struct EventTapBlkReq {
>> >> >> >> +    char *device_name;
>> >> >> >> +    int num_reqs;
>> >> >> >> +    int num_cbs;
>> >> >> >> +    bool is_multiwrite;
>> >> >> >
>> >> >> > Is multiwrite logging necessary?  If event tap is called from within
>> >> >> > the block layer then multiwrite is turned into one or more
>> >> >> > bdrv_aio_writev() calls.
>> >> >> >
>> >> >> >> +static void event_tap_replay(void *opaque, int running, int reason)
>> >> >> >> +{
>> >> >> >> +    EventTapLog *log, *next;
>> >> >> >> +
>> >> >> >> +    if (!running) {
>> >> >> >> +        return;
>> >> >> >> +    }
>> >> >> >> +
>> >> >> >> +    if (event_tap_state != EVENT_TAP_LOAD) {
>> >> >> >> +        return;
>> >> >> >> +    }
>> >> >> >> +
>> >> >> >> +    event_tap_state = EVENT_TAP_REPLAY;
>> >> >> >> +
>> >> >> >> +    QTAILQ_FOREACH(log, &event_list, node) {
>> >> >> >> +        EventTapBlkReq *blk_req;
>> >> >> >> +
>> >> >> >> +        /* event resume */
>> >> >> >> +        switch (log->mode & ~EVENT_TAP_TYPE_MASK) {
>> >> >> >> +        case EVENT_TAP_NET:
>> >> >> >> +            event_tap_net_flush(&log->net_req);
>> >> >> >> +            break;
>> >> >> >> +        case EVENT_TAP_BLK:
>> >> >> >> +            blk_req = &log->blk_req;
>> >> >> >> +            if ((log->mode & EVENT_TAP_TYPE_MASK) == EVENT_TAP_IOPORT) {
>> >> >> >> +                switch (log->ioport.index) {
>> >> >> >> +                case 0:
>> >> >> >> +                    cpu_outb(log->ioport.address, log->ioport.data);
>> >> >> >> +                    break;
>> >> >> >> +                case 1:
>> >> >> >> +                    cpu_outw(log->ioport.address, log->ioport.data);
>> >> >> >> +                    break;
>> >> >> >> +                case 2:
>> >> >> >> +                    cpu_outl(log->ioport.address, log->ioport.data);
>> >> >> >> +                    break;
>> >> >> >> +                }
>> >> >> >> +            } else {
>> >> >> >> +                /* EVENT_TAP_MMIO */
>> >> >> >> +                cpu_physical_memory_rw(log->mmio.address,
>> >> >> >> +                                       log->mmio.buf,
>> >> >> >> +                                       log->mmio.len, 1);
>> >> >> >> +            }
>> >> >> >> +            break;
>> >> >> >
>> >> >> > Why are net tx packets replayed at the net level but blk requests are
>> >> >> > replayed at the pio/mmio level?
>> >> >> >
>> >> >> > I expected everything to replay either as pio/mmio or as net/block.
>> >> >>
>> >> >> Stefan,
>> >> >>
>> >> >> After doing some heavy load tests, I realized that we have to
>> >> >> take a hybrid approach to replay for now.  This is because when a
>> >> >> device moves to the next state (e.g. virtio decreases inuse) is
>> >> >> different between net and block.  For example, virtio-net
>> >> >> decreases inuse upon returning from the net layer,
>> >> >> but virtio-blk
>> >> >> does that inside of the callback.
>> >> >
>> >> > For TX, virtio-net calls virtqueue_push from virtio_net_tx_complete.
>> >> > For RX, virtio-net calls virtqueue_flush from virtio_net_receive.
>> >> > Both are invoked from a callback.
>> >> >
>> >> >> If we only use pio/mmio
>> >> >> replay, even though event-tap tries to replay net requests, some
>> >> >> get lost because the state has proceeded already.
>> >> >
>> >> > It seems that all you need to do to avoid this is to
>> >> > delay the callback?
>> >>
>> >> Yeah, if it's possible.  But if you take a look at virtio-net,
>> >> you'll see that virtio_push is called immediately after calling
>> >> qemu_sendv_packet
>> >> while virtio-blk does that in the callback.
>> >
>> > This is only if the packet was sent immediately.
>> > I was referring to the case where the packet is queued.
>>
>> I see.  I usually don't see packets get queued in the net layer.
>> What would be the effect to devices?  Restraint sending packets?
>
> Yes.
>
>> >
>> >> >
>> >> >> This doesn't
>> >> >> happen with block, because the state is still old enough to
>> >> >> replay.  Note that using hybrid approach won't cause duplicated
>> >> >> requests on the secondary.
>> >> >
>> >> > An assumption devices make is that a buffer is unused once
>> >> > completion callback was invoked. Does this violate that assumption?
>> >>
>> >> No, it shouldn't.  In case of net with net layer replay, we copy
>> >> the content of the requests, and in case of block, because we
>> >> haven't called the callback yet, the requests remains fresh.
>> >>
>> >> Yoshi
>> >>
>> >
>> > Yes, as long as you copy it should be fine.  Maybe it's a good idea for
>> > event-tap to queue all packets to avoid the copy and avoid the need to
>> > replay at the net level.
>>
>> If queuing works fine for the devices, it seems to be a good
>> idea.  I think the ordering issue doesn't happen still.
>>
>> Yoshi
>
> If you replay and both net and pio level, it becomes complex.
> Maybe it's ok, but certainly harder to reason about.

Michael,

It seems queuing at event-tap like in net layer works for devices
that use qemu_send_packet_async as you suggested.  But for those
that use qemu_send_packet, we still need to copy the contents
just like net layer queuing does, and net level replay should be
kept to handle it.
Thanks,

Yoshi

>
>> >
>> >> >
>> >> > --
>> >> > MST
>> >> >
>> >> >
>> >
>> >
>
>

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [Qemu-devel] Re: [PATCH 09/21] Introduce event-tap.
  2011-01-06  8:47                 ` Yoshiaki Tamura
@ 2011-01-06  9:36                   ` Michael S. Tsirkin
  2011-01-06  9:41                     ` Yoshiaki Tamura
  0 siblings, 1 reply; 112+ messages in thread
From: Michael S. Tsirkin @ 2011-01-06  9:36 UTC (permalink / raw)
  To: Yoshiaki Tamura
  Cc: aliguori, mtosatti, ananth, kvm, Stefan Hajnoczi, dlaor,
	ohmura.kei, qemu-devel, avi, vatsa, psuriset, stefanha

On Thu, Jan 06, 2011 at 05:47:27PM +0900, Yoshiaki Tamura wrote:
> 2011/1/4 Michael S. Tsirkin <mst@redhat.com>:
> > On Tue, Jan 04, 2011 at 10:45:13PM +0900, Yoshiaki Tamura wrote:
> >> 2011/1/4 Michael S. Tsirkin <mst@redhat.com>:
> >> > On Tue, Jan 04, 2011 at 09:20:53PM +0900, Yoshiaki Tamura wrote:
> >> >> 2011/1/4 Michael S. Tsirkin <mst@redhat.com>:
> >> >> > On Tue, Jan 04, 2011 at 08:02:54PM +0900, Yoshiaki Tamura wrote:
> >> >> >> 2010/11/29 Stefan Hajnoczi <stefanha@gmail.com>:
> >> >> >> > On Thu, Nov 25, 2010 at 6:06 AM, Yoshiaki Tamura
> >> >> >> > <tamura.yoshiaki@lab.ntt.co.jp> wrote:
> >> >> >> >> event-tap controls when to start FT transaction, and provides proxy
> >> >> >> >> functions to called from net/block devices.  While FT transaction, it
> >> >> >> >> queues up net/block requests, and flush them when the transaction gets
> >> >> >> >> completed.
> >> >> >> >>
> >> >> >> >> Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
> >> >> >> >> Signed-off-by: OHMURA Kei <ohmura.kei@lab.ntt.co.jp>
> >> >> >> >> ---
> >> >> >> >>  Makefile.target |    1 +
> >> >> >> >>  block.h         |    9 +
> >> >> >> >>  event-tap.c     |  794 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
> >> >> >> >>  event-tap.h     |   34 +++
> >> >> >> >>  net.h           |    4 +
> >> >> >> >>  net/queue.c     |    1 +
> >> >> >> >>  6 files changed, 843 insertions(+), 0 deletions(-)
> >> >> >> >>  create mode 100644 event-tap.c
> >> >> >> >>  create mode 100644 event-tap.h
> >> >> >> >
> >> >> >> > event_tap_state is checked at the beginning of several functions.  If
> >> >> >> > there is an unexpected state the function silently returns.  Should
> >> >> >> > these checks really be assert() so there is an abort and backtrace if
> >> >> >> > the program ever reaches this state?
> >> >> >> >
> >> >> >> >> +typedef struct EventTapBlkReq {
> >> >> >> >> +    char *device_name;
> >> >> >> >> +    int num_reqs;
> >> >> >> >> +    int num_cbs;
> >> >> >> >> +    bool is_multiwrite;
> >> >> >> >
> >> >> >> > Is multiwrite logging necessary?  If event tap is called from within
> >> >> >> > the block layer then multiwrite is turned into one or more
> >> >> >> > bdrv_aio_writev() calls.
> >> >> >> >
> >> >> >> >> +static void event_tap_replay(void *opaque, int running, int reason)
> >> >> >> >> +{
> >> >> >> >> +    EventTapLog *log, *next;
> >> >> >> >> +
> >> >> >> >> +    if (!running) {
> >> >> >> >> +        return;
> >> >> >> >> +    }
> >> >> >> >> +
> >> >> >> >> +    if (event_tap_state != EVENT_TAP_LOAD) {
> >> >> >> >> +        return;
> >> >> >> >> +    }
> >> >> >> >> +
> >> >> >> >> +    event_tap_state = EVENT_TAP_REPLAY;
> >> >> >> >> +
> >> >> >> >> +    QTAILQ_FOREACH(log, &event_list, node) {
> >> >> >> >> +        EventTapBlkReq *blk_req;
> >> >> >> >> +
> >> >> >> >> +        /* event resume */
> >> >> >> >> +        switch (log->mode & ~EVENT_TAP_TYPE_MASK) {
> >> >> >> >> +        case EVENT_TAP_NET:
> >> >> >> >> +            event_tap_net_flush(&log->net_req);
> >> >> >> >> +            break;
> >> >> >> >> +        case EVENT_TAP_BLK:
> >> >> >> >> +            blk_req = &log->blk_req;
> >> >> >> >> +            if ((log->mode & EVENT_TAP_TYPE_MASK) == EVENT_TAP_IOPORT) {
> >> >> >> >> +                switch (log->ioport.index) {
> >> >> >> >> +                case 0:
> >> >> >> >> +                    cpu_outb(log->ioport.address, log->ioport.data);
> >> >> >> >> +                    break;
> >> >> >> >> +                case 1:
> >> >> >> >> +                    cpu_outw(log->ioport.address, log->ioport.data);
> >> >> >> >> +                    break;
> >> >> >> >> +                case 2:
> >> >> >> >> +                    cpu_outl(log->ioport.address, log->ioport.data);
> >> >> >> >> +                    break;
> >> >> >> >> +                }
> >> >> >> >> +            } else {
> >> >> >> >> +                /* EVENT_TAP_MMIO */
> >> >> >> >> +                cpu_physical_memory_rw(log->mmio.address,
> >> >> >> >> +                                       log->mmio.buf,
> >> >> >> >> +                                       log->mmio.len, 1);
> >> >> >> >> +            }
> >> >> >> >> +            break;
> >> >> >> >
> >> >> >> > Why are net tx packets replayed at the net level but blk requests are
> >> >> >> > replayed at the pio/mmio level?
> >> >> >> >
> >> >> >> > I expected everything to replay either as pio/mmio or as net/block.
> >> >> >>
> >> >> >> Stefan,
> >> >> >>
> >> >> >> After doing some heavy load tests, I realized that we have to
> >> >> >> take a hybrid approach to replay for now.  This is because when a
> >> >> >> device moves to the next state (e.g. virtio decreases inuse) is
> >> >> >> different between net and block.  For example, virtio-net
> >> >> >> decreases inuse upon returning from the net layer,
> >> >> >> but virtio-blk
> >> >> >> does that inside of the callback.
> >> >> >
> >> >> > For TX, virtio-net calls virtqueue_push from virtio_net_tx_complete.
> >> >> > For RX, virtio-net calls virtqueue_flush from virtio_net_receive.
> >> >> > Both are invoked from a callback.
> >> >> >
> >> >> >> If we only use pio/mmio
> >> >> >> replay, even though event-tap tries to replay net requests, some
> >> >> >> get lost because the state has proceeded already.
> >> >> >
> >> >> > It seems that all you need to do to avoid this is to
> >> >> > delay the callback?
> >> >>
> >> >> Yeah, if it's possible.  But if you take a look at virtio-net,
> >> >> you'll see that virtio_push is called immediately after calling
> >> >> qemu_sendv_packet
> >> >> while virtio-blk does that in the callback.
> >> >
> >> > This is only if the packet was sent immediately.
> >> > I was referring to the case where the packet is queued.
> >>
> >> I see.  I usually don't see packets get queued in the net layer.
> >> What would be the effect to devices?  Restraint sending packets?
> >
> > Yes.
> >
> >> >
> >> >> >
> >> >> >> This doesn't
> >> >> >> happen with block, because the state is still old enough to
> >> >> >> replay.  Note that using hybrid approach won't cause duplicated
> >> >> >> requests on the secondary.
> >> >> >
> >> >> > An assumption devices make is that a buffer is unused once
> >> >> > completion callback was invoked. Does this violate that assumption?
> >> >>
> >> >> No, it shouldn't.  In case of net with net layer replay, we copy
> >> >> the content of the requests, and in case of block, because we
> >> >> haven't called the callback yet, the requests remains fresh.
> >> >>
> >> >> Yoshi
> >> >>
> >> >
> >> > Yes, as long as you copy it should be fine.  Maybe it's a good idea for
> >> > event-tap to queue all packets to avoid the copy and avoid the need to
> >> > replay at the net level.
> >>
> >> If queuing works fine for the devices, it seems to be a good
> >> idea.  I think the ordering issue doesn't happen still.
> >>
> >> Yoshi
> >
> > If you replay and both net and pio level, it becomes complex.
> > Maybe it's ok, but certainly harder to reason about.
> 
> Michael,
> 
> It seems queuing at event-tap like in net layer works for devices
> that use qemu_send_packet_async as you suggested.  But for those
> that use qemu_send_packet, we still need to copy the contents
> just like net layer queuing does, and net level replay should be
> kept to handle it.
> Thanks,
> 
> Yoshi

Right. And I think it's fine. What I found confusing was
where both virtio (because avail idx is moved back) and
the net layer replay the packet.


> >
> >> >
> >> >> >
> >> >> > --
> >> >> > MST
> >> >> >
> >> >> >
> >> >
> >> >
> >
> >

^ permalink raw reply	[flat|nested] 112+ messages in thread

* Re: [Qemu-devel] Re: [PATCH 09/21] Introduce event-tap.
  2011-01-06  9:36                   ` Michael S. Tsirkin
@ 2011-01-06  9:41                     ` Yoshiaki Tamura
  0 siblings, 0 replies; 112+ messages in thread
From: Yoshiaki Tamura @ 2011-01-06  9:41 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: aliguori, mtosatti, ananth, kvm, Stefan Hajnoczi, dlaor,
	ohmura.kei, qemu-devel, avi, vatsa, psuriset, stefanha

2011/1/6 Michael S. Tsirkin <mst@redhat.com>:
> On Thu, Jan 06, 2011 at 05:47:27PM +0900, Yoshiaki Tamura wrote:
>> 2011/1/4 Michael S. Tsirkin <mst@redhat.com>:
>> > On Tue, Jan 04, 2011 at 10:45:13PM +0900, Yoshiaki Tamura wrote:
>> >> 2011/1/4 Michael S. Tsirkin <mst@redhat.com>:
>> >> > On Tue, Jan 04, 2011 at 09:20:53PM +0900, Yoshiaki Tamura wrote:
>> >> >> 2011/1/4 Michael S. Tsirkin <mst@redhat.com>:
>> >> >> > On Tue, Jan 04, 2011 at 08:02:54PM +0900, Yoshiaki Tamura wrote:
>> >> >> >> 2010/11/29 Stefan Hajnoczi <stefanha@gmail.com>:
>> >> >> >> > On Thu, Nov 25, 2010 at 6:06 AM, Yoshiaki Tamura
>> >> >> >> > <tamura.yoshiaki@lab.ntt.co.jp> wrote:
>> >> >> >> >> event-tap controls when to start FT transaction, and provides proxy
>> >> >> >> >> functions to called from net/block devices.  While FT transaction, it
>> >> >> >> >> queues up net/block requests, and flush them when the transaction gets
>> >> >> >> >> completed.
>> >> >> >> >>
>> >> >> >> >> Signed-off-by: Yoshiaki Tamura <tamura.yoshiaki@lab.ntt.co.jp>
>> >> >> >> >> Signed-off-by: OHMURA Kei <ohmura.kei@lab.ntt.co.jp>
>> >> >> >> >> ---
>> >> >> >> >>  Makefile.target |    1 +
>> >> >> >> >>  block.h         |    9 +
>> >> >> >> >>  event-tap.c     |  794 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> >> >> >> >>  event-tap.h     |   34 +++
>> >> >> >> >>  net.h           |    4 +
>> >> >> >> >>  net/queue.c     |    1 +
>> >> >> >> >>  6 files changed, 843 insertions(+), 0 deletions(-)
>> >> >> >> >>  create mode 100644 event-tap.c
>> >> >> >> >>  create mode 100644 event-tap.h
>> >> >> >> >
>> >> >> >> > event_tap_state is checked at the beginning of several functions.  If
>> >> >> >> > there is an unexpected state the function silently returns.  Should
>> >> >> >> > these checks really be assert() so there is an abort and backtrace if
>> >> >> >> > the program ever reaches this state?
>> >> >> >> >
>> >> >> >> >> +typedef struct EventTapBlkReq {
>> >> >> >> >> +    char *device_name;
>> >> >> >> >> +    int num_reqs;
>> >> >> >> >> +    int num_cbs;
>> >> >> >> >> +    bool is_multiwrite;
>> >> >> >> >
>> >> >> >> > Is multiwrite logging necessary?  If event tap is called from within
>> >> >> >> > the block layer then multiwrite is turned into one or more
>> >> >> >> > bdrv_aio_writev() calls.
>> >> >> >> >
>> >> >> >> >> +static void event_tap_replay(void *opaque, int running, int reason)
>> >> >> >> >> +{
>> >> >> >> >> +    EventTapLog *log, *next;
>> >> >> >> >> +
>> >> >> >> >> +    if (!running) {
>> >> >> >> >> +        return;
>> >> >> >> >> +    }
>> >> >> >> >> +
>> >> >> >> >> +    if (event_tap_state != EVENT_TAP_LOAD) {
>> >> >> >> >> +        return;
>> >> >> >> >> +    }
>> >> >> >> >> +
>> >> >> >> >> +    event_tap_state = EVENT_TAP_REPLAY;
>> >> >> >> >> +
>> >> >> >> >> +    QTAILQ_FOREACH(log, &event_list, node) {
>> >> >> >> >> +        EventTapBlkReq *blk_req;
>> >> >> >> >> +
>> >> >> >> >> +        /* event resume */
>> >> >> >> >> +        switch (log->mode & ~EVENT_TAP_TYPE_MASK) {
>> >> >> >> >> +        case EVENT_TAP_NET:
>> >> >> >> >> +            event_tap_net_flush(&log->net_req);
>> >> >> >> >> +            break;
>> >> >> >> >> +        case EVENT_TAP_BLK:
>> >> >> >> >> +            blk_req = &log->blk_req;
>> >> >> >> >> +            if ((log->mode & EVENT_TAP_TYPE_MASK) == EVENT_TAP_IOPORT) {
>> >> >> >> >> +                switch (log->ioport.index) {
>> >> >> >> >> +                case 0:
>> >> >> >> >> +                    cpu_outb(log->ioport.address, log->ioport.data);
>> >> >> >> >> +                    break;
>> >> >> >> >> +                case 1:
>> >> >> >> >> +                    cpu_outw(log->ioport.address, log->ioport.data);
>> >> >> >> >> +                    break;
>> >> >> >> >> +                case 2:
>> >> >> >> >> +                    cpu_outl(log->ioport.address, log->ioport.data);
>> >> >> >> >> +                    break;
>> >> >> >> >> +                }
>> >> >> >> >> +            } else {
>> >> >> >> >> +                /* EVENT_TAP_MMIO */
>> >> >> >> >> +                cpu_physical_memory_rw(log->mmio.address,
>> >> >> >> >> +                                       log->mmio.buf,
>> >> >> >> >> +                                       log->mmio.len, 1);
>> >> >> >> >> +            }
>> >> >> >> >> +            break;
>> >> >> >> >
>> >> >> >> > Why are net tx packets replayed at the net level but blk requests are
>> >> >> >> > replayed at the pio/mmio level?
>> >> >> >> >
>> >> >> >> > I expected everything to replay either as pio/mmio or as net/block.
>> >> >> >>
>> >> >> >> Stefan,
>> >> >> >>
>> >> >> >> After doing some heavy load tests, I realized that we have to
>> >> >> >> take a hybrid approach to replay for now.  This is because when a
>> >> >> >> device moves to the next state (e.g. virtio decreases inuse) is
>> >> >> >> different between net and block.  For example, virtio-net
>> >> >> >> decreases inuse upon returning from the net layer,
>> >> >> >> but virtio-blk
>> >> >> >> does that inside of the callback.
>> >> >> >
>> >> >> > For TX, virtio-net calls virtqueue_push from virtio_net_tx_complete.
>> >> >> > For RX, virtio-net calls virtqueue_flush from virtio_net_receive.
>> >> >> > Both are invoked from a callback.
>> >> >> >
>> >> >> >> If we only use pio/mmio
>> >> >> >> replay, even though event-tap tries to replay net requests, some
>> >> >> >> get lost because the state has proceeded already.
>> >> >> >
>> >> >> > It seems that all you need to do to avoid this is to
>> >> >> > delay the callback?
>> >> >>
>> >> >> Yeah, if it's possible.  But if you take a look at virtio-net,
>> >> >> you'll see that virtio_push is called immediately after calling
>> >> >> qemu_sendv_packet
>> >> >> while virtio-blk does that in the callback.
>> >> >
>> >> > This is only if the packet was sent immediately.
>> >> > I was referring to the case where the packet is queued.
>> >>
>> >> I see.  I usually don't see packets get queued in the net layer.
>> >> What would be the effect to devices?  Restraint sending packets?
>> >
>> > Yes.
>> >
>> >> >
>> >> >> >
>> >> >> >> This doesn't
>> >> >> >> happen with block, because the state is still old enough to
>> >> >> >> replay.  Note that using hybrid approach won't cause duplicated
>> >> >> >> requests on the secondary.
>> >> >> >
>> >> >> > An assumption devices make is that a buffer is unused once
>> >> >> > completion callback was invoked. Does this violate that assumption?
>> >> >>
>> >> >> No, it shouldn't.  In case of net with net layer replay, we copy
>> >> >> the content of the requests, and in case of block, because we
>> >> >> haven't called the callback yet, the requests remains fresh.
>> >> >>
>> >> >> Yoshi
>> >> >>
>> >> >
>> >> > Yes, as long as you copy it should be fine.  Maybe it's a good idea for
>> >> > event-tap to queue all packets to avoid the copy and avoid the need to
>> >> > replay at the net level.
>> >>
>> >> If queuing works fine for the devices, it seems to be a good
>> >> idea.  I think the ordering issue doesn't happen still.
>> >>
>> >> Yoshi
>> >
>> > If you replay and both net and pio level, it becomes complex.
>> > Maybe it's ok, but certainly harder to reason about.
>>
>> Michael,
>>
>> It seems queuing at event-tap like in net layer works for devices
>> that use qemu_send_packet_async as you suggested.  But for those
>> that use qemu_send_packet, we still need to copy the contents
>> just like net layer queuing does, and net level replay should be
>> kept to handle it.
>> Thanks,
>>
>> Yoshi
>
> Right. And I think it's fine. What I found confusing was
> where both virtio (because avail idx is moved back) and
> the net layer replay the packet.

I agree, and that part is fixed.  There won't be double layer
replay for the same device.

Yoshi

>
>
>> >
>> >> >
>> >> >> >
>> >> >> > --
>> >> >> > MST
>> >> >> >
>> >> >> >
>> >> >
>> >> >
>> >
>> >
> --
> To unsubscribe from this list: send the line "unsubscribe kvm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>

^ permalink raw reply	[flat|nested] 112+ messages in thread

end of thread, other threads:[~2011-01-06  9:41 UTC | newest]

Thread overview: 112+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2010-11-25  6:06 [Qemu-devel] [PATCH 00/21] Kemari for KVM 0.2 Yoshiaki Tamura
2010-11-25  6:06 ` [Qemu-devel] [PATCH 01/21] Make QEMUFile buf expandable, and introduce qemu_realloc_buffer() and qemu_clear_buffer() Yoshiaki Tamura
2010-11-25  6:06 ` [Qemu-devel] [PATCH 02/21] Introduce read() to FdMigrationState Yoshiaki Tamura
2010-11-25  6:06 ` [Qemu-devel] [PATCH 03/21] Introduce skip_header parameter to qemu_loadvm_state() Yoshiaki Tamura
2010-11-25  6:06 ` [Qemu-devel] [PATCH 04/21] qemu-char: export socket_set_nodelay() Yoshiaki Tamura
2010-11-25  6:06 ` [Qemu-devel] [PATCH 05/21] virtio: modify save/load handler to handle inuse varialble Yoshiaki Tamura
2010-11-28  9:28   ` [Qemu-devel] " Michael S. Tsirkin
2010-11-28 11:27     ` Yoshiaki Tamura
2010-11-28 11:46       ` Michael S. Tsirkin
2010-12-01  8:03         ` Yoshiaki Tamura
2010-12-02 12:02           ` Michael S. Tsirkin
2010-12-03  6:28             ` Yoshiaki Tamura
2010-12-16  7:36               ` Yoshiaki Tamura
2010-12-16  9:51                 ` Michael S. Tsirkin
2010-12-16 14:28                   ` Yoshiaki Tamura
2010-12-16 14:40                     ` Michael S. Tsirkin
2010-12-16 15:59                       ` Yoshiaki Tamura
2010-12-17 16:22                         ` Yoshiaki Tamura
2010-12-24  9:27                         ` Michael S. Tsirkin
2010-12-24 11:42                           ` Yoshiaki Tamura
2010-12-24 13:21                             ` Michael S. Tsirkin
2010-12-26  9:05                             ` Michael S. Tsirkin
2010-12-26 10:14                               ` Yoshiaki Tamura
2010-12-26 10:46                                 ` Michael S. Tsirkin
2010-12-26 10:50                                   ` Yoshiaki Tamura
2010-12-26 10:49                             ` Michael S. Tsirkin
2010-12-26 10:57                               ` Yoshiaki Tamura
2010-12-26 12:01                                 ` Michael S. Tsirkin
2010-12-26 12:16                                   ` Yoshiaki Tamura
2010-12-26 12:17                                     ` Michael S. Tsirkin
2010-11-25  6:06 ` [Qemu-devel] [PATCH 06/21] vl: add a tmp pointer so that a handler can delete the entry to which it belongs Yoshiaki Tamura
2010-12-08  7:03   ` Isaku Yamahata
2010-12-08  8:11     ` Yoshiaki Tamura
2010-12-08 14:22       ` Anthony Liguori
2010-11-25  6:06 ` [Qemu-devel] [PATCH 07/21] Introduce fault tolerant VM transaction QEMUFile and ft_mode Yoshiaki Tamura
2010-11-25  6:06 ` [Qemu-devel] [PATCH 08/21] savevm: introduce util functions to control ft_trans_file from savevm layer Yoshiaki Tamura
2010-11-25  6:06 ` [Qemu-devel] [PATCH 09/21] Introduce event-tap Yoshiaki Tamura
2010-11-29 11:00   ` [Qemu-devel] " Stefan Hajnoczi
2010-11-30  9:50     ` Yoshiaki Tamura
2010-11-30 10:04       ` Stefan Hajnoczi
2010-11-30 10:20         ` Yoshiaki Tamura
2011-01-04 11:02     ` Yoshiaki Tamura
2011-01-04 11:14       ` Stefan Hajnoczi
2011-01-04 11:19       ` Michael S. Tsirkin
2011-01-04 12:20         ` Yoshiaki Tamura
2011-01-04 13:10           ` Michael S. Tsirkin
2011-01-04 13:45             ` Yoshiaki Tamura
2011-01-04 14:42               ` Michael S. Tsirkin
2011-01-06  8:47                 ` Yoshiaki Tamura
2011-01-06  9:36                   ` Michael S. Tsirkin
2011-01-06  9:41                     ` Yoshiaki Tamura
     [not found]   ` <20101130011914.GA9015@amt.cnet>
2010-11-30  9:28     ` Yoshiaki Tamura
2010-11-30 10:25       ` Marcelo Tosatti
2010-11-30 10:35         ` Yoshiaki Tamura
2010-11-30 13:11           ` Marcelo Tosatti
2010-11-25  6:06 ` [Qemu-devel] [PATCH 10/21] Call init handler of event-tap at main() in vl.c Yoshiaki Tamura
2010-11-25  6:06 ` [Qemu-devel] [PATCH 11/21] ioport: insert event_tap_ioport() to ioport_write() Yoshiaki Tamura
2010-11-28  9:40   ` [Qemu-devel] " Michael S. Tsirkin
2010-11-28 12:00     ` Yoshiaki Tamura
2010-12-16  7:37       ` Yoshiaki Tamura
2010-12-16  9:22         ` Michael S. Tsirkin
2010-12-16  9:50           ` Yoshiaki Tamura
2010-12-16  9:54             ` Michael S. Tsirkin
2010-12-16 16:27             ` Stefan Hajnoczi
2010-12-17 16:19               ` Yoshiaki Tamura
2010-12-18  8:36                 ` Stefan Hajnoczi
2010-11-25  6:06 ` [Qemu-devel] [PATCH 12/21] Insert event_tap_mmio() to cpu_physical_memory_rw() in exec.c Yoshiaki Tamura
2010-11-25  6:06 ` [Qemu-devel] [PATCH 13/21] dma-helpers: replace bdrv_aio_writev() with bdrv_aio_writev_proxy() Yoshiaki Tamura
2010-11-28  9:33   ` [Qemu-devel] " Michael S. Tsirkin
2010-11-28 11:55     ` Yoshiaki Tamura
2010-11-28 12:28       ` Michael S. Tsirkin
2010-11-29  9:52       ` Kevin Wolf
2010-11-29 12:56         ` Yoshiaki Tamura
2010-11-25  6:06 ` [Qemu-devel] [PATCH 14/21] virtio-blk: replace bdrv_aio_multiwrite() with bdrv_aio_multiwrite_proxy() Yoshiaki Tamura
2010-11-25  6:06 ` [Qemu-devel] [PATCH 15/21] virtio-net: replace qemu_sendv_packet_async() with qemu_sendv_packet_async_proxy() Yoshiaki Tamura
2010-11-28  9:31   ` [Qemu-devel] " Michael S. Tsirkin
2010-11-28 11:43     ` Yoshiaki Tamura
2010-11-25  6:06 ` [Qemu-devel] [PATCH 16/21] e1000: replace qemu_send_packet() with qemu_send_packet_proxy() Yoshiaki Tamura
2010-11-25  6:06 ` [Qemu-devel] [PATCH 17/21] savevm: introduce qemu_savevm_trans_{begin, commit} Yoshiaki Tamura
2010-11-25  6:06 ` [Qemu-devel] [PATCH 18/21] migration: introduce migrate_ft_trans_{put, get}_ready(), and modify migrate_fd_put_ready() when ft_mode is on Yoshiaki Tamura
2010-11-25  6:06 ` [Qemu-devel] [PATCH 19/21] migration-tcp: modify tcp_accept_incoming_migration() to handle ft_mode, and add a hack not to close fd when ft_mode is enabled Yoshiaki Tamura
2010-11-25  6:06 ` [Qemu-devel] [PATCH 20/21] Introduce -k option to enable FT migration mode (Kemari) Yoshiaki Tamura
2010-11-25  6:07 ` [Qemu-devel] [PATCH 21/21] migration: add a parser to accept FT migration incoming mode Yoshiaki Tamura
2010-11-26 18:39 ` [Qemu-devel] [PATCH 00/21] Kemari for KVM 0.2 Blue Swirl
2010-11-27  4:29   ` Yoshiaki Tamura
2010-11-27  7:23     ` Stefan Hajnoczi
2010-11-27  8:53       ` Yoshiaki Tamura
2010-11-27 11:03         ` Blue Swirl
2010-11-27 12:21           ` Yoshiaki Tamura
2010-11-27 11:54         ` Stefan Hajnoczi
2010-11-27 13:11           ` Yoshiaki Tamura
2010-11-29 10:17             ` Stefan Hajnoczi
2010-11-29 13:00               ` Paul Brook
2010-11-29 13:13                 ` Yoshiaki Tamura
2010-11-29 13:19                   ` Paul Brook
2010-11-29 13:41                     ` Yoshiaki Tamura
2010-11-29 14:12                       ` Paul Brook
2010-11-29 14:37                         ` Yoshiaki Tamura
2010-11-29 14:56                           ` Paul Brook
2010-11-29 15:00                             ` Yoshiaki Tamura
2010-11-29 15:56                               ` Paul Brook
2010-11-29 16:23                               ` Stefan Hajnoczi
2010-11-29 16:41                                 ` Dor Laor
2010-11-29 16:53                                   ` Paul Brook
2010-11-29 17:05                                     ` Anthony Liguori
2010-11-29 17:18                                       ` Paul Brook
2010-11-29 17:33                                         ` Anthony Liguori
2010-11-30  7:13                                       ` Yoshiaki Tamura
2010-11-30  6:43                                   ` Yoshiaki Tamura
2010-11-30  9:13                                   ` Takuya Yoshikawa
2010-11-27 11:20       ` Paul Brook
2010-11-27 12:35         ` Yoshiaki Tamura

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).