[Qemu-devel] [RFC PATCH RDMA support v7: 00/13] rdma cleanup and reordering

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [Qemu-devel] [RFC PATCH RDMA support v7: 00/13] rdma cleanup and reordering
@ 2013-04-10 22:28 mrhines
  2013-04-10 22:28 ` [Qemu-devel] [RFC PATCH RDMA support v1: 01/13] introduce qemu_ram_foreach_block() mrhines
                   ` (13 more replies)
  0 siblings, 14 replies; 52+ messages in thread
From: mrhines @ 2013-04-10 22:28 UTC (permalink / raw)
  To: qemu-devel; +Cc: aliguori, mst, owasserm, abali, mrhines, gokul, pbonzini

From: "Michael R. Hines" <mrhines@us.ibm.com>

Changes since v6:

(Thanks, Paolo - things look much cleaner now.)

- Try to get patch-ordering correct =)
- Much cleaner use of QEMUFileOps
- Much fewer header files changes
- Convert zero check capability to QMP command instead
- Updated documentation

Wiki: http://wiki.qemu.org/Features/RDMALiveMigration
Github: git@github.com:hinesmr/qemu.git

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [Qemu-devel] [RFC PATCH RDMA support v1: 01/13] introduce qemu_ram_foreach_block()
  2013-04-10 22:28 [Qemu-devel] [RFC PATCH RDMA support v7: 00/13] rdma cleanup and reordering mrhines
@ 2013-04-10 22:28 ` mrhines
  2013-04-10 22:28 ` [Qemu-devel] [RFC PATCH RDMA support v1: 02/13] Core RMDA logic mrhines
                   ` (12 subsequent siblings)
  13 siblings, 0 replies; 52+ messages in thread
From: mrhines @ 2013-04-10 22:28 UTC (permalink / raw)
  To: qemu-devel; +Cc: aliguori, mst, owasserm, abali, mrhines, gokul, pbonzini

From: "Michael R. Hines" <mrhines@us.ibm.com>

This is used during RDMA initialization in order to transmit
a description of all the RAM blocks to the peer for later
dynamic chunk registration purposes.

Signed-off-by: Michael R. Hines <mrhines@us.ibm.com>
---
 exec.c                    |    9 +++++++++
 include/exec/cpu-common.h |    5 +++++
 2 files changed, 14 insertions(+)

diff --git a/exec.c b/exec.c
index fa1e0c3..0e5a2c3 100644
--- a/exec.c
+++ b/exec.c
@@ -2631,3 +2631,12 @@ bool cpu_physical_memory_is_io(hwaddr phys_addr)
              memory_region_is_romd(section->mr));
 }
 #endif
+
+void qemu_ram_foreach_block(RAMBlockIterFunc func, void *opaque)
+{
+    RAMBlock *block;
+
+    QTAILQ_FOREACH(block, &ram_list.blocks, next) {
+        func(block->host, block->offset, block->length, opaque);
+    }
+}
diff --git a/include/exec/cpu-common.h b/include/exec/cpu-common.h
index 2e5f11f..88cb741 100644
--- a/include/exec/cpu-common.h
+++ b/include/exec/cpu-common.h
@@ -119,6 +119,11 @@ extern struct MemoryRegion io_mem_rom;
 extern struct MemoryRegion io_mem_unassigned;
 extern struct MemoryRegion io_mem_notdirty;
 
+typedef void  (RAMBlockIterFunc)(void *host_addr, 
+    ram_addr_t offset, ram_addr_t length, void *opaque); 
+
+void qemu_ram_foreach_block(RAMBlockIterFunc func, void *opaque);
+
 #endif
 
 #endif /* !CPU_COMMON_H */
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [Qemu-devel] [RFC PATCH RDMA support v1: 02/13] Core RMDA logic
  2013-04-10 22:28 [Qemu-devel] [RFC PATCH RDMA support v7: 00/13] rdma cleanup and reordering mrhines
  2013-04-10 22:28 ` [Qemu-devel] [RFC PATCH RDMA support v1: 01/13] introduce qemu_ram_foreach_block() mrhines
@ 2013-04-10 22:28 ` mrhines
  2013-04-10 22:28 ` [Qemu-devel] [RFC PATCH RDMA support v1: 03/13] RDMA is enabled by default per the usual ./configure testing mrhines
                   ` (11 subsequent siblings)
  13 siblings, 0 replies; 52+ messages in thread
From: mrhines @ 2013-04-10 22:28 UTC (permalink / raw)
  To: qemu-devel; +Cc: aliguori, mst, owasserm, abali, mrhines, gokul, pbonzini

From: "Michael R. Hines" <mrhines@us.ibm.com>

As requested, code that does need to be visible is kept
well contained inside this file and this is the only new
additional file to the entire patch - good progress.

This file includes the entire protocol and interfaces
required to perform RDMA migration.

Full documentation is in docs/rdma.txt

Signed-off-by: Michael R. Hines <mrhines@us.ibm.com>
---
 migration-rdma.c | 2546 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 2546 insertions(+)
 create mode 100644 migration-rdma.c

diff --git a/migration-rdma.c b/migration-rdma.c
new file mode 100644
index 0000000..365d5f3
--- /dev/null
+++ b/migration-rdma.c
@@ -0,0 +1,2546 @@
+/*
+ *  Copyright (C) 2013 Michael R. Hines <mrhines@us.ibm.com>
+ *  Copyright (C) 2010 Jiuxing Liu <jl@us.ibm.com>
+ *
+ *  RDMA protocol and interfaces
+ *
+ *  This program is free software; you can redistribute it and/or modify
+ *  it under the terms of the GNU General Public License as published by
+ *  the Free Software Foundation; under version 2 of the License.
+ *
+ *  This program is distributed in the hope that it will be useful,
+ *  but WITHOUT ANY WARRANTY; without even the implied warranty of
+ *  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ *  GNU General Public License for more details.
+ *
+ *  You should have received a copy of the GNU General Public License
+ *  along with this program; if not, see <http://www.gnu.org/licenses/>.
+ */
+#include "qemu-common.h"
+#include "migration/migration.h"
+#include "migration/qemu-file.h"
+#include "exec/cpu-common.h"
+#include "qemu/main-loop.h"
+#include "qemu/sockets.h"
+#include <stdio.h>
+#include <sys/types.h>
+#include <sys/socket.h>
+#include <netdb.h>
+#include <arpa/inet.h>
+#include <string.h>
+#include <rdma/rdma_cma.h>
+
+//#define DEBUG_RDMA
+//#define DEBUG_RDMA_VERBOSE
+
+#ifdef DEBUG_RDMA
+#define DPRINTF(fmt, ...) \
+    do { printf("rdma: " fmt, ## __VA_ARGS__); } while (0)
+#else
+#define DPRINTF(fmt, ...) \
+    do { } while (0)
+#endif
+
+#ifdef DEBUG_RDMA_VERBOSE
+#define DDPRINTF(fmt, ...) \
+    do { printf("rdma: " fmt, ## __VA_ARGS__); } while (0)
+#else
+#define DDPRINTF(fmt, ...) \
+    do { } while (0)
+#endif
+
+#define RDMA_RESOLVE_TIMEOUT_MS 10000
+
+#define RDMA_CHUNK_REGISTRATION
+
+#define RDMA_LAZY_CLIENT_REGISTRATION
+
+/* Do not merge data if larger than this. */
+#define RDMA_MERGE_MAX (4 * 1024 * 1024)
+#define RDMA_UNSIGNALED_SEND_MAX 64
+
+#define RDMA_REG_CHUNK_SHIFT 20
+#define RDMA_REG_CHUNK_SIZE (1UL << (RDMA_REG_CHUNK_SHIFT))
+#define RDMA_REG_CHUNK_INDEX(start_addr, host_addr) \
+            (((unsigned long)(host_addr) >> RDMA_REG_CHUNK_SHIFT) - \
+            ((unsigned long)(start_addr) >> RDMA_REG_CHUNK_SHIFT))
+#define RDMA_REG_NUM_CHUNKS(rdma_ram_block) \
+            (RDMA_REG_CHUNK_INDEX((rdma_ram_block)->local_host_addr,\
+                (rdma_ram_block)->local_host_addr +\
+                (rdma_ram_block)->length) + 1)
+#define RDMA_REG_CHUNK_START(rdma_ram_block, i) ((uint8_t *)\
+            ((((unsigned long)((rdma_ram_block)->local_host_addr) >> \
+                RDMA_REG_CHUNK_SHIFT) + (i)) << \
+                RDMA_REG_CHUNK_SHIFT))
+#define RDMA_REG_CHUNK_END(rdma_ram_block, i) \
+            (RDMA_REG_CHUNK_START(rdma_ram_block, i) + \
+             RDMA_REG_CHUNK_SIZE)
+
+/*
+ * This is only for non-live state being migrated.
+ * Instead of RDMA_WRITE messages, we use RDMA_SEND
+ * messages for that state, which requires a different
+ * delivery design than main memory.
+ */
+#define RDMA_SEND_INCREMENT 32768
+
+#define RDMA_BLOCKING
+/*
+ * Completion queue can be filled by both read and write work requests, 
+ * so must reflect the sum of both possible queue sizes.
+ */
+#define RDMA_QP_SIZE 1000
+#define RDMA_CQ_SIZE (RDMA_QP_SIZE * 3)
+
+/*
+ * Maximum size infiniband SEND message
+ */
+#define RDMA_CONTROL_MAX_BUFFER (512 * 1024)
+#define RDMA_CONTROL_MAX_WR 2
+
+/*
+ * Capabilities for negotiation.
+ */
+#define RDMA_CAPABILITY_CHUNK_REGISTER 0x01
+#define RDMA_CAPABILITY_NEXT_FEATURE   0x02
+
+/*
+ * RDMA migration protocol:
+ * 1. RDMA Writes (data messages, i.e. RAM)
+ * 2. IB Send/Recv (control channel messages)
+ */
+enum {
+    RDMA_WRID_NONE = 0,
+    RDMA_WRID_RDMA_WRITE,
+    RDMA_WRID_SEND_CONTROL = 1000,
+    RDMA_WRID_RECV_CONTROL = 2000,
+};
+
+const char * wrid_desc[] = {
+        [RDMA_WRID_NONE] = "NONE",
+        [RDMA_WRID_RDMA_WRITE] = "WRITE RDMA",
+        [RDMA_WRID_SEND_CONTROL] = "CONTROL SEND",
+        [RDMA_WRID_RECV_CONTROL] = "CONTROL RECV",
+};
+
+/*
+ * SEND/RECV IB Control Messages.
+ */ 
+enum {
+    RDMA_CONTROL_NONE = 0,
+    RDMA_CONTROL_READY,             /* ready to receive */
+    RDMA_CONTROL_QEMU_FILE,         /* QEMUFile-transmitted bytes */
+    RDMA_CONTROL_RAM_BLOCKS,       /* RAMBlock synchronization */
+    RDMA_CONTROL_REGISTER_REQUEST,  /* dynamic page registration */
+    RDMA_CONTROL_REGISTER_RESULT,   /* key to use after registration */
+    RDMA_CONTROL_REGISTER_FINISHED, /* current iteration finished */
+};
+
+const char * control_desc[] = {
+        [RDMA_CONTROL_NONE] = "NONE",
+        [RDMA_CONTROL_READY] = "READY",
+        [RDMA_CONTROL_QEMU_FILE] = "QEMU FILE",
+        [RDMA_CONTROL_RAM_BLOCKS] = "REMOTE INFO",
+        [RDMA_CONTROL_REGISTER_REQUEST] = "REGISTER REQUEST",
+        [RDMA_CONTROL_REGISTER_RESULT] = "REGISTER RESULT",
+        [RDMA_CONTROL_REGISTER_FINISHED] = "REGISTER FINISHED",
+};
+
+/*
+ * Memory and MR structures used to represen an IB Send/Recv work request.
+ * This is *not* used for RDMA, only IB Send/Recv.
+ */
+typedef struct {
+    uint8_t  control[RDMA_CONTROL_MAX_BUFFER]; /* actual buffer to register */
+    struct   ibv_mr *control_mr;               /* registration metadata */
+    size_t   control_len;                      /* length of the message */
+    uint8_t *control_curr;                     /* start of unconsumed bytes */
+} RDMAWorkRequestData;
+
+/*
+ * Negotiate RDMA capabilities during connection-setup time.
+ */
+typedef struct {
+    uint32_t version;
+    uint32_t flags;
+} RDMACapabilities;
+
+/*
+ * Main data structure for RDMA state.
+ * While there is only one copy of this structure being allocated right now,
+ * this is the place where one would start if you wanted to consider 
+ * having more than one RDMA connection open at the same time.
+ */
+typedef struct RDMAContext {
+    char *host;
+    int port;
+
+    /* This is used by the migration protocol to transmit
+     * control messages (such as device state and registration commands)
+     * 
+     * WR #0 is for control channel ready messages from the server.
+     * WR #1 is for control channel data messages from the server. 
+     * WR #2 is for control channel send messages.
+     *
+     * We could use more WRs, but we have enough for now.
+     */
+    RDMAWorkRequestData wr_data[RDMA_CONTROL_MAX_WR + 1];
+
+    /* 
+     * This is used by *_exchange_send() to figure out whether or not
+     * the initial "READY" message has already been received or not.
+     * This is because other functions may potentially poll() and detect
+     * the READY message before send() does, in which case we need to
+     * know if it completed.
+     */
+    int control_ready_expected;
+
+    /* The rest is only for the initiator of the migration. */
+    int client_init_done;
+
+    /* number of outstanding unsignaled send */
+    int num_unsignaled_send;
+
+    /* number of outstanding signaled send */
+    int num_signaled_send;
+
+    /* store info about current buffer so that we can
+       merge it with future sends */
+    uint64_t current_offset;
+    uint64_t current_length;
+    /* index of ram block the current buffer belongs to */
+    int current_index;
+    /* index of the chunk in the current ram block */
+    int current_chunk;
+
+    bool chunk_register_destination;
+
+    /* 
+     * infiniband-specific variables for opening the device
+     * and maintaining connection state and so forth.
+     * 
+     * cm_id also has ibv_context, rdma_event_channel, and ibv_qp in
+     * cm_id->verbs, cm_id->channel, and cm_id->qp.
+     */
+    struct rdma_cm_id *cm_id;               /* connection manager ID */
+    struct rdma_cm_id *listen_id; 
+
+    struct ibv_context *verbs;
+    struct rdma_event_channel *channel;
+    struct ibv_qp *qp;                      /* queue pair */
+    struct ibv_comp_channel *comp_channel;  /* completion channel */
+    struct ibv_pd *pd;                      /* protection domain */
+    struct ibv_cq *cq;                      /* completion queue */
+} RDMAContext;
+
+/*
+ * Interface to the rest of the migration call stack. 
+ */
+typedef struct QEMUFileRDMA
+{
+    RDMAContext *rdma;
+    size_t len;
+    void *file;
+} QEMUFileRDMA;
+
+/*
+ * Representation of a RAMBlock from an RDMA perspective.
+ * This an subsequent structures cannot be linked lists
+ * because we're using a single IB message to transmit
+ * the information. It's small anyway, so a list is overkill.
+ */
+typedef struct RDMALocalBlock {
+    uint8_t  *local_host_addr; /* local virtual address */
+    uint64_t remote_host_addr; /* remote virtual address */
+    uint64_t offset;
+    uint64_t length;
+    struct   ibv_mr **pmr;     /* MRs for chunk-level registration */
+    struct   ibv_mr *mr;       /* MR for non-chunk-level registration */
+    uint32_t *remote_keys;     /* rkeys for chunk-level registration */ 
+    uint32_t remote_rkey;      /* rkeys for non-chunk-level registration */
+} RDMALocalBlock;
+
+/*
+ * Also represents a RAMblock, but only on the server.
+ * This gets transmitted by the server during connection-time 
+ * to the client / primary VM and then is used to populate the 
+ * corresponding RDMALocalBlock with
+ * the information needed to perform the actual RDMA.
+ *
+ */
+typedef struct RDMARemoteBlock {
+    uint64_t remote_host_addr;
+    uint64_t offset;
+    uint64_t length;
+    uint32_t remote_rkey;
+} RDMARemoteBlock;
+
+/*
+ * Virtual address of the above structures used for transmitting
+ * the RAMBlock descriptions at connection-time.
+ */
+typedef struct RDMALocalBlocks {
+    int num_blocks;
+    RDMALocalBlock *block;
+} RDMALocalBlocks;
+
+/*
+ * Same as above
+ */
+typedef struct RDMARemoteBlocks {
+    int * num_blocks;
+    RDMARemoteBlock *block;
+    void * remote_area;
+    int remote_size;
+} RDMARemoteBlocks;
+
+#define RDMA_CONTROL_VERSION_1      1
+//#define RDMA_CONTROL_VERSION_2      2  /* next version */
+#define RDMA_CONTROL_VERSION_MAX    1
+#define RDMA_CONTROL_VERSION_MIN    1    /* change on next version */
+
+#define RDMA_CONTROL_CURRENT_VERSION RDMA_CONTROL_VERSION_1
+
+/*
+ * Main structure for IB Send/Recv control messages.
+ * This gets prepended at the beginning of every Send/Recv.
+ */
+typedef struct {
+    uint32_t    len;
+    uint32_t    type;
+    uint32_t    version;
+} RDMAControlHeader;
+
+/*
+ * Register a single Chunk.
+ * Information sent by the primary VM to inform the server
+ * to register an single chunk of memory before we can perform
+ * the actual RDMA operation.
+ */
+typedef struct {
+    uint32_t   len;              /* length of the chunk to be registered */
+    uint32_t   current_index;    /* which ramblock the chunk belongs to */
+    uint64_t   offset;           /* offset into the ramblock of the chunk */
+} RDMARegister;
+
+/*
+ * The result of the server's memory registration produces an "rkey"
+ * which the primary VM must reference in order to perform
+ * the RDMA operation.
+ */
+typedef struct {
+    uint32_t rkey;
+} RDMARegisterResult;
+
+#define RDMAControlHeaderSize sizeof(RDMAControlHeader)
+
+RDMALocalBlocks local_ram_blocks;
+RDMARemoteBlocks remote_ram_blocks;
+
+/*
+ * Memory regions need to be registered with the device and queue pairs setup
+ * in advanced before the migration starts. This tells us where the RAM blocks
+ * are so that we can register them individually.
+ */
+static void qemu_rdma_init_one_block(void *host_addr, 
+    ram_addr_t offset, ram_addr_t length, void *opaque)
+{
+    RDMALocalBlocks *rdma_local_ram_blocks = opaque;
+    int num_blocks = rdma_local_ram_blocks->num_blocks;
+
+    rdma_local_ram_blocks->block[num_blocks].local_host_addr = host_addr;
+    rdma_local_ram_blocks->block[num_blocks].offset = (uint64_t)offset;
+    rdma_local_ram_blocks->block[num_blocks].length = (uint64_t)length;
+    rdma_local_ram_blocks->num_blocks++;
+}
+
+static void qemu_rdma_ram_block_counter(void *host_addr, 
+    ram_addr_t offset, ram_addr_t length, void *opaque)
+{
+    int * num_blocks = opaque;
+    *num_blocks = *num_blocks + 1;
+}
+
+/*
+ * Identify the RAMBlocks and their quantity. They will be references to
+ * identify chunk boundaries inside each RAMBlock and also be referenced
+ * during dynamic page registration.
+ */
+static int qemu_rdma_init_ram_blocks(RDMALocalBlocks *rdma_local_ram_blocks)
+{
+    int num_blocks = 0;
+    
+    qemu_ram_foreach_block(qemu_rdma_ram_block_counter, &num_blocks);  
+
+    memset(rdma_local_ram_blocks, 0, sizeof *rdma_local_ram_blocks);
+    rdma_local_ram_blocks->block = g_malloc0(sizeof(RDMALocalBlock) *
+                                    num_blocks);
+
+    rdma_local_ram_blocks->num_blocks = 0;
+    qemu_ram_foreach_block(qemu_rdma_init_one_block, rdma_local_ram_blocks);
+
+    DPRINTF("Allocated %d local ram block structures\n", 
+                    rdma_local_ram_blocks->num_blocks);
+    return 0;
+}
+
+/*
+ * Put in the log file which RDMA device was opened and the details
+ * associated with that device.
+ */
+static void qemu_rdma_dump_id(const char * who, struct ibv_context * verbs)
+{
+    printf("%s RDMA Device opened: kernel name %s "
+           "uverbs device name %s, "
+           "infiniband_verbs class device path %s,"
+           " infiniband class device path %s\n", 
+                who, 
+                verbs->device->name, 
+                verbs->device->dev_name, 
+                verbs->device->dev_path, 
+                verbs->device->ibdev_path);
+}
+
+/*
+ * Put in the log file the RDMA gid addressing information,
+ * useful for folks who have trouble understanding the
+ * RDMA device hierarchy in the kernel. 
+ */
+static void qemu_rdma_dump_gid(const char * who, struct rdma_cm_id * id)
+{
+    char sgid[33];
+    char dgid[33];
+    inet_ntop(AF_INET6, &id->route.addr.addr.ibaddr.sgid, sgid, sizeof sgid);
+    inet_ntop(AF_INET6, &id->route.addr.addr.ibaddr.dgid, dgid, sizeof dgid);
+    DPRINTF("%s Source GID: %s, Dest GID: %s\n", who, sgid, dgid);
+}
+
+/*
+ * Figure out which RDMA device corresponds to the requested IP hostname
+ * Also create the initial connection manager identifiers for opening
+ * the connection.
+ */
+static int qemu_rdma_resolve_host(RDMAContext *rdma)
+{
+    int ret;
+    struct addrinfo *res;
+    char port_str[16];
+    struct rdma_cm_event *cm_event;
+    char ip[40] = "unknown";
+
+    if (rdma->host == NULL || !strcmp(rdma->host, "")) {
+        fprintf(stderr, "RDMA hostname has not been set\n");
+        return -1;
+    }
+
+    /* create CM channel */
+    rdma->channel = rdma_create_event_channel();
+    if (!rdma->channel) {
+        fprintf(stderr, "could not create CM channel\n");
+        return -1;
+    }
+
+    /* create CM id */
+    ret = rdma_create_id(rdma->channel, &rdma->cm_id, NULL, RDMA_PS_TCP);
+    if (ret) {
+        fprintf(stderr, "could not create channel id\n");
+        goto err_resolve_create_id;
+    }
+
+    snprintf(port_str, 16, "%d", rdma->port);
+    port_str[15] = '\0';
+
+    ret = getaddrinfo(rdma->host, port_str, NULL, &res);
+    if (ret < 0) {
+        fprintf(stderr, "could not getaddrinfo destination address %s\n", rdma->host);
+        goto err_resolve_get_addr;
+    }
+
+    inet_ntop(AF_INET, &((struct sockaddr_in *) res->ai_addr)->sin_addr, 
+                                ip, sizeof ip);
+    printf("%s => %s\n", rdma->host, ip);
+
+    /* resolve the first address */
+    ret = rdma_resolve_addr(rdma->cm_id, NULL, res->ai_addr,
+            RDMA_RESOLVE_TIMEOUT_MS);
+    if (ret) {
+        fprintf(stderr, "could not resolve address %s\n", rdma->host);
+        goto err_resolve_get_addr;
+    }
+
+    qemu_rdma_dump_gid("client_resolve_addr", rdma->cm_id);
+
+    ret = rdma_get_cm_event(rdma->channel, &cm_event);
+    if (ret) {
+        fprintf(stderr, "could not perform event_addr_resolved\n");
+        goto err_resolve_get_addr;
+    }
+
+    if (cm_event->event != RDMA_CM_EVENT_ADDR_RESOLVED) {
+        fprintf(stderr, "result not equal to event_addr_resolved %s\n", 
+                rdma_event_str(cm_event->event));
+        perror("rdma_resolve_addr");
+        rdma_ack_cm_event(cm_event);
+        goto err_resolve_get_addr;
+    }
+    rdma_ack_cm_event(cm_event);
+
+    /* resolve route */
+    ret = rdma_resolve_route(rdma->cm_id, RDMA_RESOLVE_TIMEOUT_MS);
+    if (ret) {
+        fprintf(stderr, "could not resolve rdma route\n");
+        goto err_resolve_get_addr;
+    }
+
+    ret = rdma_get_cm_event(rdma->channel, &cm_event);
+    if (ret) {
+        fprintf(stderr, "could not perform event_route_resolved\n");
+        goto err_resolve_get_addr;
+    }
+    if (cm_event->event != RDMA_CM_EVENT_ROUTE_RESOLVED) {
+        fprintf(stderr, "result not equal to event_route_resolved: %s\n", rdma_event_str(cm_event->event));
+        rdma_ack_cm_event(cm_event);
+        goto err_resolve_get_addr;
+    }
+    rdma_ack_cm_event(cm_event);
+    rdma->verbs = rdma->cm_id->verbs;
+    qemu_rdma_dump_id("client_resolve_host", rdma->cm_id->verbs);
+    qemu_rdma_dump_gid("client_resolve_host", rdma->cm_id);
+    return 0;
+
+err_resolve_get_addr:
+    rdma_destroy_id(rdma->cm_id);
+err_resolve_create_id:
+    rdma_destroy_event_channel(rdma->channel);
+    rdma->channel = NULL;
+
+    return -1;
+}
+
+/*
+ * Create protection domain and completion queues
+ */
+static int qemu_rdma_alloc_pd_cq(RDMAContext *rdma)
+{
+    /* allocate pd */
+    rdma->pd = ibv_alloc_pd(rdma->verbs);
+    if (!rdma->pd) {
+        return -1;
+    }
+
+#ifdef RDMA_BLOCKING
+    /* create completion channel */
+    rdma->comp_channel = ibv_create_comp_channel(rdma->verbs);
+    if (!rdma->comp_channel) {
+        goto err_alloc_pd_cq;
+    }
+#endif
+
+    /* create cq */
+    rdma->cq = ibv_create_cq(rdma->verbs, RDMA_CQ_SIZE,
+            NULL, rdma->comp_channel, 0);
+    if (!rdma->cq) {
+        goto err_alloc_pd_cq;
+    }
+
+    return 0;
+
+err_alloc_pd_cq:
+    if (rdma->pd) {
+        ibv_dealloc_pd(rdma->pd);
+    }
+    if (rdma->comp_channel) {
+        ibv_destroy_comp_channel(rdma->comp_channel);
+    }
+    rdma->pd = NULL;
+    rdma->comp_channel = NULL;
+    return -1;
+
+}
+
+/*
+ * Create queue pairs.
+ */
+static int qemu_rdma_alloc_qp(RDMAContext *rdma)
+{
+    struct ibv_qp_init_attr attr = { 0 };
+    int ret;
+
+    attr.cap.max_send_wr = RDMA_QP_SIZE;
+    attr.cap.max_recv_wr = 3;
+    attr.cap.max_send_sge = 1;
+    attr.cap.max_recv_sge = 1;
+    attr.send_cq = rdma->cq;
+    attr.recv_cq = rdma->cq;
+    attr.qp_type = IBV_QPT_RC;
+
+    ret = rdma_create_qp(rdma->cm_id, rdma->pd, &attr);
+    if (ret) {
+        return -1;
+    }
+
+    rdma->qp = rdma->cm_id->qp;
+    return 0;
+}
+
+static int qemu_rdma_get_fd(void *opaque)
+{
+    return -2;
+}
+
+/*
+ * This is probably dead code, but its here anyway for testing.
+ * Sometimes nice to know the performance tradeoffs of pinning.
+ */
+#if !defined(RDMA_LAZY_CLIENT_REGISTRATION)
+static int qemu_rdma_reg_chunk_ram_blocks(RDMAContext *rdma,
+        RDMALocalBlocks *rdma_local_ram_blocks)
+{
+    int i, j;
+    for (i = 0; i < rdma_local_ram_blocks->num_blocks; i++) {
+        RDMALocalBlock *block = &(rdma_local_ram_blocks->block[i]);
+        int num_chunks = RDMA_REG_NUM_CHUNKS(block);
+        /* allocate memory to store chunk MRs */
+        rdma_local_ram_blocks->block[i].pmr = g_malloc0(
+                                num_chunks * sizeof(struct ibv_mr *));
+
+        if (!block->pmr) {
+            goto err_reg_chunk_ram_blocks;
+        }
+
+        for (j = 0; j < num_chunks; j++) {
+            uint8_t *start_addr = RDMA_REG_CHUNK_START(block, j);
+            uint8_t *end_addr = RDMA_REG_CHUNK_END(block, j);
+            if (start_addr < block->local_host_addr) {
+                start_addr = block->local_host_addr;
+            }
+            if (end_addr > block->local_host_addr + block->length) {
+                end_addr = block->local_host_addr + block->length;
+            }
+            block->pmr[j] = ibv_reg_mr(rdma->pd,
+                                start_addr,
+                                end_addr - start_addr,
+                                //IBV_ACCESS_LOCAL_WRITE |
+                                //IBV_ACCESS_REMOTE_WRITE |
+                                //IBV_ACCESS_GIFT |
+                                IBV_ACCESS_REMOTE_READ
+                                );
+            if (!block->pmr[j]) {
+                break;
+            }
+        }
+        if (j < num_chunks) {
+            for (j--; j >= 0; j--) {
+                ibv_dereg_mr(block->pmr[j]);
+            }
+            block->pmr[i] = NULL;
+            goto err_reg_chunk_ram_blocks;
+        }
+    }
+
+    return 0;
+
+err_reg_chunk_ram_blocks:
+    for (i--; i >= 0; i--) {
+        int num_chunks =
+            RDMA_REG_NUM_CHUNKS(&(rdma_local_ram_blocks->block[i]));
+        for (j = 0; j < num_chunks; j++) {
+            ibv_dereg_mr(rdma_local_ram_blocks->block[i].pmr[j]);
+        }
+        free(rdma_local_ram_blocks->block[i].pmr);
+        rdma_local_ram_blocks->block[i].pmr = NULL;
+    }
+
+    return -1;
+
+}
+#endif
+
+/*
+ * Also probably dead code, but for the same reason, its nice
+ * to know the performance tradeoffs of dynamic registration
+ * on both sides of the connection.
+ */
+static int qemu_rdma_reg_whole_ram_blocks(RDMAContext *rdma, 
+                                RDMALocalBlocks *rdma_local_ram_blocks)
+{
+    int i;
+    for (i = 0; i < rdma_local_ram_blocks->num_blocks; i++) {
+        rdma_local_ram_blocks->block[i].mr =
+            ibv_reg_mr(rdma->pd,
+                    rdma_local_ram_blocks->block[i].local_host_addr,
+                    rdma_local_ram_blocks->block[i].length,
+                    IBV_ACCESS_LOCAL_WRITE |
+                    IBV_ACCESS_REMOTE_WRITE
+                    );
+        if (!rdma_local_ram_blocks->block[i].mr) {
+            fprintf(stderr, "Failed to register local server ram block!\n");
+            break;
+        }
+    }
+
+    if (i >= rdma_local_ram_blocks->num_blocks) {
+        return 0;
+    }
+
+    for (i--; i >= 0; i--) {
+        ibv_dereg_mr(rdma_local_ram_blocks->block[i].mr);
+    }
+
+    return -1;
+
+}
+
+static int qemu_rdma_client_reg_ram_blocks(RDMAContext *rdma,
+                                RDMALocalBlocks *rdma_local_ram_blocks)
+{
+#ifdef RDMA_CHUNK_REGISTRATION
+#ifdef RDMA_LAZY_CLIENT_REGISTRATION
+    return 0;
+#else
+    return qemu_rdma_reg_chunk_ram_blocks(rdma, rdma_local_ram_blocks);
+#endif
+#else
+    return qemu_rdma_reg_whole_ram_blocks(rdma, rdma_local_ram_blocks);
+#endif
+}
+
+static int qemu_rdma_server_reg_ram_blocks(RDMAContext *rdma,
+                                RDMALocalBlocks *rdma_local_ram_blocks)
+{
+    return qemu_rdma_reg_whole_ram_blocks(rdma, rdma_local_ram_blocks);
+}
+
+/*
+ * Shutdown and clean things up.
+ */
+static void qemu_rdma_dereg_ram_blocks(RDMALocalBlocks *rdma_local_ram_blocks)
+{
+    int i, j;
+    for (i = 0; i < rdma_local_ram_blocks->num_blocks; i++) {
+        int num_chunks;
+        if (!rdma_local_ram_blocks->block[i].pmr) {
+            continue;
+        }
+        num_chunks = RDMA_REG_NUM_CHUNKS(&(rdma_local_ram_blocks->block[i]));
+        for (j = 0; j < num_chunks; j++) {
+            if (!rdma_local_ram_blocks->block[i].pmr[j]) {
+                continue;
+            }
+            ibv_dereg_mr(rdma_local_ram_blocks->block[i].pmr[j]);
+        }
+        free(rdma_local_ram_blocks->block[i].pmr);
+        rdma_local_ram_blocks->block[i].pmr = NULL;
+    }
+    for (i = 0; i < rdma_local_ram_blocks->num_blocks; i++) {
+        if (!rdma_local_ram_blocks->block[i].mr) {
+            continue;
+        }
+        ibv_dereg_mr(rdma_local_ram_blocks->block[i].mr);
+        rdma_local_ram_blocks->block[i].mr = NULL;
+    }
+}
+
+/*
+ * Server uses this to prepare to transmit the RAMBlock descriptions
+ * to the primary VM after connection setup.
+ * Both sides use the "remote" structure to communicate and update
+ * their "local" descriptions with what was sent.
+ */
+static void qemu_rdma_copy_to_remote_ram_blocks(RDMAContext *rdma,
+                                                RDMALocalBlocks *local,
+                                                RDMARemoteBlocks *remote)
+{
+    int i;
+    DPRINTF("Allocating %d remote ram block structures\n", local->num_blocks);
+    *remote->num_blocks = local->num_blocks;
+
+    for (i = 0; i < local->num_blocks; i++) {
+            remote->block[i].remote_host_addr =
+                (uint64_t)(local->block[i].local_host_addr);
+
+            if(rdma->chunk_register_destination == false)
+                remote->block[i].remote_rkey = local->block[i].mr->rkey;
+
+            remote->block[i].offset = local->block[i].offset;
+            remote->block[i].length = local->block[i].length;
+    }
+}
+
+/*
+ * Client then propogates the remote ram block descriptions to his local copy.
+ * Really, only the virtual addresses are useful, but we propogate everything
+ * anyway.
+ *
+ * If we're using dynamic registration on the server side (the default), then
+ * the 'rkeys' are not useful because we will re-ask for them later during
+ * runtime.
+ */
+static int qemu_rdma_process_remote_ram_blocks(RDMALocalBlocks *local, RDMARemoteBlocks *remote)
+{
+    int i, j;
+
+    if (local->num_blocks != *remote->num_blocks) {
+        fprintf(stderr, "local %d != remote %d\n", 
+            local->num_blocks, *remote->num_blocks);
+        return -1;
+    }
+
+    for (i = 0; i < *remote->num_blocks; i++) {
+        /* search local ram blocks */
+        for (j = 0; j < local->num_blocks; j++) {
+            if (remote->block[i].offset != local->block[j].offset) {
+                continue;
+            }
+            if (remote->block[i].length != local->block[j].length) {
+                return -1;
+            }
+            local->block[j].remote_host_addr =
+                remote->block[i].remote_host_addr;
+            local->block[j].remote_rkey = remote->block[i].remote_rkey;
+            break;
+        }
+        if (j >= local->num_blocks) {
+            return -1;
+        }
+    }
+
+    return 0;
+}
+
+/*
+ * Find the ram block that corresponds to the page requested to be
+ * transmitted by QEMU.
+ *
+ * Once the block is found, also identify which 'chunk' within that
+ * block that the page belongs to.
+ *
+ * This search cannot fail or the migration will fail. 
+ */
+static int qemu_rdma_search_ram_block(uint64_t offset, uint64_t length,
+        RDMALocalBlocks *blocks, int *block_index, int *chunk_index)
+{
+    int i;
+    for (i = 0; i < blocks->num_blocks; i++) {
+        if (offset < blocks->block[i].offset) {
+            continue;
+        }
+        if (offset + length >
+                blocks->block[i].offset + blocks->block[i].length) {
+            continue;
+        }
+        *block_index = i;
+        if (chunk_index) {
+            uint8_t *host_addr = blocks->block[i].local_host_addr +
+                (offset - blocks->block[i].offset);
+            *chunk_index = RDMA_REG_CHUNK_INDEX(
+                    blocks->block[i].local_host_addr, host_addr);
+        }
+        return 0;
+    }
+    return -1;
+}
+
+/*
+ * Register a chunk with IB. If the chunk was already registered
+ * previously, then skip.
+ *
+ * Also return the keys associated with the registration needed
+ * to perform the actual RDMA operation.
+ */ 
+static int qemu_rdma_register_and_get_keys(RDMAContext *rdma,
+        RDMALocalBlock *block, uint64_t host_addr,
+        uint32_t *lkey, uint32_t *rkey)
+{
+    int chunk;
+    if (block->mr) {
+        if(lkey)
+            *lkey = block->mr->lkey;
+        if(rkey)
+            *rkey = block->mr->rkey;
+        return 0;
+    }
+
+    /* allocate memory to store chunk MRs */
+    if (!block->pmr) {
+        int num_chunks = RDMA_REG_NUM_CHUNKS(block);
+        block->pmr = g_malloc0(num_chunks *
+                sizeof(struct ibv_mr *));
+        if (!block->pmr) {
+            return -1;
+        }
+    }
+
+    /*
+     * If 'rkey', then we're the server performing a dynamic
+     * registration, so grant access to the client.
+     *
+     * If 'lkey', then we're the primary VM performing a dynamic
+     * registration, so grant access only to ourselves.
+     */
+    chunk = RDMA_REG_CHUNK_INDEX(block->local_host_addr, host_addr);
+    if (!block->pmr[chunk]) {
+        uint8_t *start_addr = RDMA_REG_CHUNK_START(block, chunk);
+        uint8_t *end_addr = RDMA_REG_CHUNK_END(block, chunk);
+        if (start_addr < block->local_host_addr) {
+            start_addr = block->local_host_addr;
+        }
+        if (end_addr > block->local_host_addr + block->length) {
+            end_addr = block->local_host_addr + block->length;
+        }
+        block->pmr[chunk] = ibv_reg_mr(rdma->pd,
+                start_addr,
+                end_addr - start_addr,
+                //(lkey ? IBV_ACCESS_GIFT : 0) |
+                (rkey ? (IBV_ACCESS_LOCAL_WRITE | IBV_ACCESS_REMOTE_WRITE) : 0)
+                | IBV_ACCESS_REMOTE_READ);
+        if (!block->pmr[chunk]) {
+            fprintf(stderr, "Failed to register chunk!\n");
+            return -1;
+        }
+    }
+    if(lkey)
+        *lkey = block->pmr[chunk]->lkey;
+    if(rkey)
+        *rkey = block->pmr[chunk]->rkey;
+    return 0;
+}
+
+/*
+ * Register (at connection time) the memory used for control
+ * channel messages.
+ */
+static int qemu_rdma_reg_control(RDMAContext *rdma, int idx)
+{
+    rdma->wr_data[idx].control_mr = ibv_reg_mr(rdma->pd,
+            rdma->wr_data[idx].control, RDMA_CONTROL_MAX_BUFFER,
+            IBV_ACCESS_LOCAL_WRITE |
+            IBV_ACCESS_REMOTE_WRITE |
+            IBV_ACCESS_REMOTE_READ);
+    if (rdma->wr_data[idx].control_mr) {
+        return 0;
+    }
+    return -1;
+}
+
+static int qemu_rdma_dereg_control(RDMAContext *rdma, int idx)
+{
+    return ibv_dereg_mr(rdma->wr_data[idx].control_mr);
+}
+
+#if defined(DEBUG_RDMA) || defined(DEBUG_RDMA_VERBOSE)
+static const char * print_wrid(int wrid) {
+    if(wrid >= RDMA_WRID_RECV_CONTROL)
+        return wrid_desc[RDMA_WRID_RECV_CONTROL];
+    return wrid_desc[wrid];
+}
+#endif
+
+/*
+ * Consult the connection manager to see a work request
+ * (of any kind) has completed.
+ * Return the work request ID that completed.
+ */
+static int qemu_rdma_poll(RDMAContext *rdma)
+{
+    int ret;
+    struct ibv_wc wc;
+
+    ret = ibv_poll_cq(rdma->cq, 1, &wc);
+    if (!ret) {
+        return RDMA_WRID_NONE;
+    }
+    if (ret < 0) {
+        fprintf(stderr, "ibv_poll_cq return %d!\n", ret);
+        return ret;
+    }
+    if (wc.status != IBV_WC_SUCCESS) {
+        fprintf(stderr, "ibv_poll_cq wc.status=%d %s!\n",
+                        wc.status, ibv_wc_status_str(wc.status));
+        fprintf(stderr, "ibv_poll_cq wrid=%s!\n", wrid_desc[wc.wr_id]);
+
+        return -1;
+    }
+
+    if(rdma->control_ready_expected &&
+        (wc.wr_id >= RDMA_WRID_RECV_CONTROL)) {
+        DPRINTF("completion %s #%" PRId64 " received (%" PRId64 ")\n", 
+            wrid_desc[RDMA_WRID_RECV_CONTROL], wc.wr_id -
+            RDMA_WRID_RECV_CONTROL, wc.wr_id);
+        rdma->control_ready_expected = 0;
+    }
+
+    if(wc.wr_id == RDMA_WRID_RDMA_WRITE) {
+        rdma->num_signaled_send--;
+        DPRINTF("completions %s (%" PRId64 ") left %d\n", 
+            print_wrid(wc.wr_id), wc.wr_id, rdma->num_signaled_send);
+    } else {
+        DPRINTF("other completion %s (%" PRId64 ") received left %d\n", 
+            print_wrid(wc.wr_id), wc.wr_id, rdma->num_signaled_send);
+    }
+   
+    return  (int)wc.wr_id;
+}
+
+/*
+ * Block until the next work request has completed.
+ * 
+ * First poll to see if a work request has already completed,
+ * otherwise block.
+ *
+ * If we encounter completed work requests for IDs other than
+ * the one we're interested in, then that's generally an error.
+ *
+ * The only exception is actual RDMA Write completions. These
+ * completions only need to be recorded, but do not actually
+ * need further processing.
+ */
+#ifdef RDMA_BLOCKING
+static int qemu_rdma_block_for_wrid(RDMAContext *rdma, int wrid)
+{
+    int num_cq_events = 0;
+    int r = RDMA_WRID_NONE;
+    struct ibv_cq *cq;
+    void *cq_ctx;
+
+    if (ibv_req_notify_cq(rdma->cq, 0)) {
+        return -1;
+    }
+    /* poll cq first */
+    while (r != wrid) {
+        r = qemu_rdma_poll(rdma);
+        if (r < 0) {
+            return r;
+        }
+        if (r == RDMA_WRID_NONE) {
+            break;
+        }
+        if(r != wrid) {
+            DPRINTF("A Wanted wrid %s (%d) but got %s (%d)\n", 
+                print_wrid(wrid), wrid, print_wrid(r), r);
+        }
+    }
+    if (r == wrid) {
+        return 0;
+    }
+
+    while (1) {
+        if (ibv_get_cq_event(rdma->comp_channel, &cq, &cq_ctx)) {
+            goto err_block_for_wrid;
+        }
+        num_cq_events++;
+        if (ibv_req_notify_cq(cq, 0)) {
+            goto err_block_for_wrid;
+        }
+        /* poll cq */
+        while (r != wrid) {
+            r = qemu_rdma_poll(rdma);
+            if (r < 0) {
+                goto err_block_for_wrid;
+            }
+            if (r == RDMA_WRID_NONE) {
+                break;
+            }
+            if(r != wrid) {
+                DPRINTF("B Wanted wrid %s (%d) but got %s (%d)\n", 
+                    print_wrid(wrid), wrid, print_wrid(r), r);
+            }
+        }
+        if (r == wrid) {
+            goto success_block_for_wrid;
+        }
+    }
+
+success_block_for_wrid:
+    if (num_cq_events) {
+        ibv_ack_cq_events(cq, num_cq_events);
+    }
+    return 0;
+
+err_block_for_wrid:
+    if (num_cq_events) {
+        ibv_ack_cq_events(cq, num_cq_events);
+    }
+    return -1;
+}
+#else
+static int qemu_rdma_poll_for_wrid(RDMAContext *rdma, int wrid)
+{
+    int r = RDMA_WRID_NONE;
+    while (r != wrid) {
+        r = qemu_rdma_poll(rdma);
+        if (r < 0) {
+            return r;
+        }
+    }
+    return 0;
+}
+#endif
+
+
+static int wait_for_wrid(RDMAContext *rdma, int wrid)
+{
+#ifdef RDMA_BLOCKING
+    return qemu_rdma_block_for_wrid(rdma, wrid);
+#else
+    return qemu_rdma_poll_for_wrid(rdma, wrid);
+#endif
+}
+
+static void control_to_network(RDMAControlHeader *control)
+{
+    control->version = htonl(control->version);
+    control->type = htonl(control->type);
+    control->len = htonl(control->len);
+}
+
+static void network_to_control(RDMAControlHeader *control)
+{
+    control->version = ntohl(control->version);
+    control->type = ntohl(control->type);
+    control->len = ntohl(control->len);
+}
+
+/*
+ * Post a SEND message work request for the control channel
+ * containing some data and block until the post completes.
+ */
+static int qemu_rdma_post_send_control(RDMAContext *rdma, uint8_t * buf, RDMAControlHeader * head)
+{
+    int ret = 0;
+    RDMAWorkRequestData * wr = &rdma->wr_data[RDMA_CONTROL_MAX_WR];
+    struct ibv_send_wr *bad_wr;
+    struct ibv_sge sge = 
+                    {
+                       .addr = (uint64_t)(wr->control),
+                       .length = head->len + RDMAControlHeaderSize,
+                       .lkey = wr->control_mr->lkey,
+                    };
+    struct ibv_send_wr send_wr = 
+                   {
+                       .wr_id = RDMA_WRID_SEND_CONTROL,
+                       .opcode = IBV_WR_SEND,
+                       .send_flags = IBV_SEND_SIGNALED,
+                       .sg_list = &sge,
+                       .num_sge = 1,
+                   };
+
+    if (head->version < RDMA_CONTROL_VERSION_MIN || 
+            head->version > RDMA_CONTROL_VERSION_MAX) {
+        fprintf(stderr, "SEND: Invalid control message version: %d,"
+                        " min: %d, max: %d\n", 
+                        head->version, RDMA_CONTROL_VERSION_MIN,
+                        RDMA_CONTROL_VERSION_MAX);
+        return -1;
+    }
+
+    DPRINTF("CONTROL: sending %s..\n", control_desc[head->type]);
+
+    /*
+     * We don't actually need to do a memcpy() in here if we used
+     * the "sge" properly, but since we're only sending control messages 
+     * (not RAM in a performance-critical path), then its OK for now.
+     *
+     * The copy makes the RDMAControlHeader simpler to manipulate
+     * for the time being.
+     */
+    memcpy(wr->control, head, RDMAControlHeaderSize);
+    control_to_network((void *) wr->control);
+
+    if(buf)
+        memcpy(wr->control + RDMAControlHeaderSize, buf, head->len);
+
+
+    if (ibv_post_send(rdma->qp, &send_wr, &bad_wr)) {
+        return -1;
+    }
+
+    if (ret < 0) {
+        fprintf(stderr, "Failed to use post IB SEND for control!\n");
+        return ret;
+    }
+
+    ret = wait_for_wrid(rdma, RDMA_WRID_SEND_CONTROL);
+    if (ret < 0) {
+        fprintf(stderr, "rdma migration: polling control error!");
+    }
+
+    return ret;
+}
+
+/*
+ * Post a RECV work request in anticipation of some future receipt
+ * of data on the control channel.
+ */
+static int qemu_rdma_post_recv_control(RDMAContext *rdma, int idx)
+{
+    struct ibv_recv_wr *bad_wr;
+    struct ibv_sge sge = {
+                            .addr = (uint64_t)(rdma->wr_data[idx].control),
+                            .length = RDMA_CONTROL_MAX_BUFFER,
+                            .lkey = rdma->wr_data[idx].control_mr->lkey,
+                         };
+
+    struct ibv_recv_wr recv_wr = 
+                         { 
+                            .wr_id = RDMA_WRID_RECV_CONTROL + idx,
+                            .sg_list = &sge,
+                            .num_sge = 1,
+                         };
+
+    if (ibv_post_recv(rdma->qp, &recv_wr, &bad_wr)) {
+        return -1;
+    }
+
+    return 0;
+}
+
+/*
+ * Block and wait for a RECV control channel message to arrive. 
+ */
+static int qemu_rdma_exchange_get_response(RDMAContext *rdma,
+                RDMAControlHeader *head, int expecting, int idx)
+{
+    int ret = wait_for_wrid(rdma, RDMA_WRID_RECV_CONTROL + idx);
+
+    if (ret < 0) {
+        fprintf(stderr, "rdma migration: polling control error!\n");
+        return ret;
+    }
+
+    network_to_control((void *) rdma->wr_data[idx].control);
+    memcpy(head, rdma->wr_data[idx].control, RDMAControlHeaderSize);
+
+    if (head->version < RDMA_CONTROL_VERSION_MIN || 
+            head->version > RDMA_CONTROL_VERSION_MAX) {
+        fprintf(stderr, "RECV: Invalid control message version: %d,"
+                        " min: %d, max: %d\n", 
+                        head->version, RDMA_CONTROL_VERSION_MIN,
+                        RDMA_CONTROL_VERSION_MAX);
+        return -1;
+    }
+
+    DPRINTF("CONTROL: %s received\n", control_desc[expecting]);
+
+    if (expecting != RDMA_CONTROL_NONE && head->type != expecting) {
+        fprintf(stderr, "Was expecting a %s control message"
+                ", but got: %s, length: %d\n", 
+                control_desc[expecting], 
+                control_desc[head->type], head->len);
+        return -EIO;
+    }
+
+    return 0;
+}
+
+/*
+ * When a RECV work request has completed, the work request's
+ * buffer is pointed at the header. 
+ * 
+ * This will advance the pointer to the data portion 
+ * of the control message of the work request's buffer that
+ * was populated after the work request finished.
+ */ 
+static void qemu_rdma_move_header(RDMAContext *rdma, int idx, 
+                                  RDMAControlHeader *head)
+{
+    rdma->wr_data[idx].control_len = head->len;
+    rdma->wr_data[idx].control_curr = rdma->wr_data[idx].control + RDMAControlHeaderSize;
+}
+
+/*
+ * This is an 'atomic' high-level operation to deliver a single, unified
+ * control-channel message.
+ * 
+ * Additionally, if the user is expecting some kind of reply to this message,
+ * they can request a 'resp' response message be filled in by posting an
+ * additional work request on behalf of the user and waiting for an additional
+ * completion. 
+ * 
+ * The extra (optional) response is used during registration to us from having
+ * to perform an *additional* exchange of message just to provide a response by
+ * instead piggy-backing on the acknowledgement.
+ */
+static int qemu_rdma_exchange_send(RDMAContext *rdma, RDMAControlHeader *head, 
+                                   uint8_t * data, RDMAControlHeader *resp, 
+                                   int * resp_idx)
+{
+    int ret = 0;
+    int idx = 0;
+
+    /*
+     * Wait until the server is ready before attempting to deliver the message
+     * by waiting for a READY message.
+     */
+    if(rdma->control_ready_expected) {
+        RDMAControlHeader resp;
+        ret = qemu_rdma_exchange_get_response(rdma, 
+                                    &resp, RDMA_CONTROL_READY, idx);
+        if(ret < 0)
+            return ret;
+    }
+
+    /*
+     * If the user is expecting a response, post a WR in anticipation of it.
+     */
+    if(resp) {
+        ret = qemu_rdma_post_recv_control(rdma, idx + 1);
+        if (ret) {
+            fprintf(stderr, "rdma migration: error posting"
+                    " extra control recv for anticipated result!");
+            return ret;
+        }
+    }
+
+    /*
+     * Post a WR to replace the one we just consumed for the READY message.
+     */
+    ret = qemu_rdma_post_recv_control(rdma, idx);
+    if (ret) {
+        fprintf(stderr, "rdma migration: error posting first control recv!");
+        return ret;
+    }
+
+    /*
+     * Deliver the control message that was requested.
+     */
+    ret = qemu_rdma_post_send_control(rdma, data, head);
+
+    if(ret < 0) {
+        fprintf(stderr, "Failed to send control buffer!\n");
+        return ret;
+    }
+
+    /*
+     * If we're expecting a response, block and wait for it.
+     */
+    if(resp) {
+        DPRINTF("Waiting for response %s\n", control_desc[resp->type]);
+        ret = qemu_rdma_exchange_get_response(rdma, resp, resp->type, idx + 1);
+
+        if (ret < 0)
+            return ret;
+
+        qemu_rdma_move_header(rdma, idx + 1, resp);
+        *resp_idx = idx + 1;
+        DPRINTF("Response %s received.\n", control_desc[resp->type]);
+    }
+
+    rdma->control_ready_expected = 1;
+
+    return 0;
+}
+
+/*
+ * This is an 'atomic' high-level operation to receive a single, unified
+ * control-channel message.
+ */
+static int qemu_rdma_exchange_recv(RDMAContext *rdma, RDMAControlHeader *head, 
+                                int expecting)
+{
+    RDMAControlHeader ready = { 
+                                .len = 0, 
+                                .type = RDMA_CONTROL_READY,
+                                .version = RDMA_CONTROL_CURRENT_VERSION, 
+                              };
+    int ret;
+    int idx = 0;
+
+    /*
+     * Inform the client that we're ready to receive a message.
+     */
+    ret = qemu_rdma_post_send_control(rdma, NULL, &ready);
+
+    if (ret < 0) {
+        fprintf(stderr, "Failed to send control buffer!\n");
+        return ret;
+    }
+
+    /*
+     * Block and wait for the message.
+     */
+    ret = qemu_rdma_exchange_get_response(rdma, head, expecting, idx);
+
+    if (ret < 0)
+        return ret;
+
+    qemu_rdma_move_header(rdma, idx, head);
+
+    /*
+     * Post a new RECV work request to replace the one we just consumed.
+     */
+    ret = qemu_rdma_post_recv_control(rdma, idx);
+    if (ret) {
+        fprintf(stderr, "rdma migration: error posting second control recv!");
+        return ret;
+    }
+
+    return 0;
+}
+
+/*
+ * Write an actual chunk of memory using RDMA.
+ *
+ * If we're using dynamic registration on the server-side, we have to
+ * send a registration command first.
+ */
+static int __qemu_rdma_write(QEMUFile *f, RDMAContext *rdma,
+        int current_index,
+        uint64_t offset, uint64_t length,
+        uint64_t wr_id, enum ibv_send_flags flag)
+{
+    struct ibv_sge sge;
+    struct ibv_send_wr send_wr = { 0 };
+    struct ibv_send_wr *bad_wr;
+    RDMALocalBlock *block = &(local_ram_blocks.block[current_index]);
+    int chunk;
+    RDMARegister reg;
+    RDMARegisterResult *reg_result;
+    int reg_result_idx;
+    RDMAControlHeader resp = { .len = sizeof(RDMARegisterResult),
+                               .type = RDMA_CONTROL_REGISTER_RESULT,
+                               .version = RDMA_CONTROL_CURRENT_VERSION, 
+                              };
+    RDMAControlHeader head = { .len = sizeof(RDMARegister), 
+                               .type = RDMA_CONTROL_REGISTER_REQUEST,
+                               .version = RDMA_CONTROL_CURRENT_VERSION, 
+                             };
+    int ret;
+
+    sge.addr = (uint64_t)(block->local_host_addr + (offset - block->offset));
+    sge.length = length;
+    if (qemu_rdma_register_and_get_keys(rdma, block, sge.addr, &sge.lkey, NULL)) {
+        fprintf(stderr, "cannot get lkey!\n");
+        return -EINVAL;
+    }
+
+    send_wr.wr_id = wr_id;
+    send_wr.opcode = IBV_WR_RDMA_WRITE;
+    send_wr.send_flags = flag;
+    send_wr.sg_list = &sge;
+    send_wr.num_sge = 1;
+    send_wr.wr.rdma.remote_addr = block->remote_host_addr +
+        (offset - block->offset);
+
+    if(rdma->chunk_register_destination) {
+        chunk = RDMA_REG_CHUNK_INDEX(block->local_host_addr, sge.addr);
+        if (!block->remote_keys[chunk]) {
+            /*
+             * Tell other side to register.
+             */
+            reg.len = sge.length;
+            reg.current_index = current_index;
+            reg.offset = offset;
+
+            DPRINTF("Sending registration request chunk %d for %d bytes...\n", chunk, sge.length);
+            ret = qemu_rdma_exchange_send(rdma, &head, (uint8_t *) &reg, &resp, &reg_result_idx);
+            if(ret < 0)
+                return ret;
+
+            reg_result = (RDMARegisterResult *) rdma->wr_data[reg_result_idx].control_curr;
+            DPRINTF("Received registration result:"
+                    " my key: %x their key %x, chunk %d\n",
+                    block->remote_keys[chunk], reg_result->rkey, chunk);
+
+            block->remote_keys[chunk] = reg_result->rkey;
+        }
+
+        send_wr.wr.rdma.rkey = block->remote_keys[chunk];
+    } else {
+        send_wr.wr.rdma.rkey = block->remote_rkey;
+    }
+
+    return ibv_post_send(rdma->qp, &send_wr, &bad_wr);
+}
+
+/*
+ * Push out any unwritten RDMA operations.
+ *
+ * We support sending out multiple chunks at the same time.
+ * Not all of them need to get signaled in the completion queue.
+ */
+static int qemu_rdma_write_flush(QEMUFile *f, RDMAContext *rdma)
+{
+    int ret;
+    enum ibv_send_flags flags = 0;
+
+    if (!rdma->current_length) {
+        return 0;
+    }
+    if (rdma->num_unsignaled_send >=
+            RDMA_UNSIGNALED_SEND_MAX) {
+        flags = IBV_SEND_SIGNALED;
+    }
+
+    while(1) {
+        ret = __qemu_rdma_write(f, rdma,
+                rdma->current_index,
+                rdma->current_offset,
+                rdma->current_length,
+                RDMA_WRID_RDMA_WRITE, flags);
+        if(ret) {
+            if(ret == ENOMEM) {
+                DPRINTF("send queue is full. wait a little....\n");
+                ret = wait_for_wrid(rdma, RDMA_WRID_RDMA_WRITE);
+                if(ret < 0) {
+                    fprintf(stderr, "rdma migration: failed to make room in full send queue! %d\n", ret);
+                    return -EIO;
+                }
+            } else {
+                 fprintf(stderr, "rdma migration: write flush error! %d\n", ret);
+                 perror("write flush error");
+                 return -EIO;
+            }
+        } else {
+                break;
+        }
+    }
+
+    if (rdma->num_unsignaled_send >=
+            RDMA_UNSIGNALED_SEND_MAX) {
+        rdma->num_unsignaled_send = 0;
+        rdma->num_signaled_send++;
+        DPRINTF("signaled total: %d\n", rdma->num_signaled_send);
+    } else {
+        rdma->num_unsignaled_send++;
+    }
+
+    rdma->current_length = 0;
+    rdma->current_offset = 0;
+
+    return 0;
+}
+
+static inline int qemu_rdma_in_current_block(RDMAContext *rdma,
+                uint64_t offset, uint64_t len)
+{
+    RDMALocalBlock *block =
+        &(local_ram_blocks.block[rdma->current_index]);
+    if (rdma->current_index < 0) {
+        return 0;
+    }
+    if (offset < block->offset) {
+        return 0;
+    }
+    if (offset + len > block->offset + block->length) {
+        return 0;
+    }
+    return 1;
+}
+
+static inline int qemu_rdma_in_current_chunk(RDMAContext *rdma,
+                uint64_t offset, uint64_t len)
+{
+    RDMALocalBlock *block = &(local_ram_blocks.block[rdma->current_index]);
+    uint8_t *chunk_start, *chunk_end, *host_addr;
+    if (rdma->current_chunk < 0) {
+        return 0;
+    }
+    host_addr = block->local_host_addr + (offset - block->offset);
+    chunk_start = RDMA_REG_CHUNK_START(block, rdma->current_chunk);
+    if (chunk_start < block->local_host_addr) {
+        chunk_start = block->local_host_addr;
+    }
+    if (host_addr < chunk_start) {
+        return 0;
+    }
+    chunk_end = RDMA_REG_CHUNK_END(block, rdma->current_chunk);
+    if (chunk_end > chunk_start + block->length) {
+        chunk_end = chunk_start + block->length;
+    }
+    if (host_addr + len > chunk_end) {
+        return 0;
+    }
+    return 1;
+}
+
+static inline int qemu_rdma_buffer_mergable(RDMAContext *rdma,
+                    uint64_t offset, uint64_t len)
+{
+    if (rdma->current_length == 0) {
+        return 0;
+    }
+    if (offset != rdma->current_offset + rdma->current_length) {
+        return 0;
+    }
+    if (!qemu_rdma_in_current_block(rdma, offset, len)) {
+        return 0;
+    }
+#ifdef RDMA_CHUNK_REGISTRATION
+    if (!qemu_rdma_in_current_chunk(rdma, offset, len)) {
+        return 0;
+    }
+#endif
+    return 1;
+}
+
+/*
+ * We're not actually writing here, but doing three things:
+ *
+ * 1. Identify the chunk the buffer belongs to.
+ * 2. If the chunk is full or the buffer doesn't belong to the current
+      chunk, then start a new chunk and flush() the old chunk.
+ * 3. To keep the hardware busy, we also group chunks into batches
+      and only require that a batch gets acknowledged in the completion
+      qeueue instead of each individual chunk. 
+ */
+static int qemu_rdma_write(QEMUFile *f, RDMAContext *rdma, uint64_t offset, uint64_t len)
+{
+    int index = rdma->current_index;
+    int chunk_index = rdma->current_chunk;
+    int ret;
+
+    /* If we cannot merge it, we flush the current buffer first. */
+    if (!qemu_rdma_buffer_mergable(rdma, offset, len)) {
+        ret = qemu_rdma_write_flush(f, rdma);
+        if (ret) {
+            return ret;
+        }
+        rdma->current_length = 0;
+        rdma->current_offset = offset;
+
+        if ((ret = qemu_rdma_search_ram_block(offset, len,
+                    &local_ram_blocks, &index, &chunk_index))) {
+            fprintf(stderr, "ram block search failed\n");
+            return ret;
+        }
+        rdma->current_index = index;
+        rdma->current_chunk = chunk_index;
+    }
+
+    /* merge it */
+    rdma->current_length += len;
+
+    /* flush it if buffer is too large */
+    if (rdma->current_length >= RDMA_MERGE_MAX) {
+        return qemu_rdma_write_flush(f, rdma);
+    }
+
+    return 0;
+}
+
+static void qemu_rdma_cleanup(void * opaque)
+{
+    RDMAContext *rdma = opaque;
+    struct rdma_cm_event *cm_event;
+    int ret, idx;
+
+    if(rdma->cm_id) {
+        DPRINTF("Disconnecting...\n");
+        ret = rdma_disconnect(rdma->cm_id);
+        if (!ret) {
+            ret = rdma_get_cm_event(rdma->channel, &cm_event);
+            if (!ret) {
+                rdma_ack_cm_event(cm_event);
+            }
+        }
+        DPRINTF("Disconnected.\n");
+    }
+
+    if (remote_ram_blocks.remote_area) {
+        g_free(remote_ram_blocks.remote_area);
+    }
+
+    for(idx = 0; idx < (RDMA_CONTROL_MAX_WR + 1); idx++) {
+        if (rdma->wr_data[idx].control_mr) {
+            qemu_rdma_dereg_control(rdma, idx);
+        }
+        rdma->wr_data[idx].control_mr = NULL;
+    }
+
+    qemu_rdma_dereg_ram_blocks(&local_ram_blocks);
+
+    if(local_ram_blocks.block) {
+        if(rdma->chunk_register_destination) {
+            for (idx = 0; idx < local_ram_blocks.num_blocks; idx++) {
+                RDMALocalBlock *block = &(local_ram_blocks.block[idx]);
+                if(block->remote_keys)
+                    g_free(block->remote_keys);
+            }
+        }
+        g_free(local_ram_blocks.block);
+    }
+
+    if (rdma->qp) {
+        ibv_destroy_qp(rdma->qp);
+    }
+    if (rdma->cq) {
+        ibv_destroy_cq(rdma->cq);
+    }
+    if (rdma->comp_channel) {
+        ibv_destroy_comp_channel(rdma->comp_channel);
+    }
+    if (rdma->pd) {
+        ibv_dealloc_pd(rdma->pd);
+    }
+    if (rdma->listen_id) {
+        rdma_destroy_id(rdma->listen_id);
+    }
+    if (rdma->cm_id) {
+        rdma_destroy_id(rdma->cm_id);
+        rdma->cm_id = 0;
+    }
+    if (rdma->channel) {
+        rdma_destroy_event_channel(rdma->channel);
+    }
+}
+
+static void qemu_rdma_remote_ram_blocks_init(void)
+{
+    int remote_size = (sizeof(RDMARemoteBlock) * 
+                        local_ram_blocks.num_blocks)
+                        +   sizeof(*remote_ram_blocks.num_blocks);
+
+    DPRINTF("Preparing %d bytes for remote info\n", remote_size);
+
+    remote_ram_blocks.remote_area = g_malloc0(remote_size);
+    remote_ram_blocks.remote_size = remote_size;
+    remote_ram_blocks.num_blocks = remote_ram_blocks.remote_area;
+    remote_ram_blocks.block = (void *) (remote_ram_blocks.num_blocks + 1);
+}
+
+static int qemu_rdma_client_init(void * opaque, Error **errp,
+                          bool chunk_register_destination)
+{
+    RDMAContext *rdma = opaque;
+    int ret, idx;
+
+    if (rdma->client_init_done) {
+        return 0;
+    }
+
+    rdma->chunk_register_destination = chunk_register_destination;
+
+    ret = qemu_rdma_resolve_host(rdma);
+    if (ret) {
+        fprintf(stderr, "rdma migration: error resolving host!");
+        goto err_rdma_client_init;
+    }
+
+    ret = qemu_rdma_alloc_pd_cq(rdma);
+    if (ret) {
+        fprintf(stderr, "rdma migration: error allocating pd and cq!");
+        goto err_rdma_client_init;
+    }
+
+    ret = qemu_rdma_alloc_qp(rdma);
+    if (ret) {
+        fprintf(stderr, "rdma migration: error allocating qp!");
+        goto err_rdma_client_init;
+    }
+
+    ret = qemu_rdma_init_ram_blocks(&local_ram_blocks);
+    if (ret) {
+        fprintf(stderr, "rdma migration: error initializing ram blocks!");
+        goto err_rdma_client_init;
+    }
+
+    ret = qemu_rdma_client_reg_ram_blocks(rdma, &local_ram_blocks);
+    if (ret) {
+        fprintf(stderr, "rdma migration: error client registering ram blocks!");
+        goto err_rdma_client_init;
+    }
+
+    for(idx = 0; idx < (RDMA_CONTROL_MAX_WR + 1); idx++) {
+        ret = qemu_rdma_reg_control(rdma, idx);
+        if (ret) {
+            fprintf(stderr, "rdma migration: error registering %d control!", idx);
+            goto err_rdma_client_init;
+        }
+    }
+
+    qemu_rdma_remote_ram_blocks_init();
+
+    rdma->client_init_done = 1;
+    return 0;
+
+err_rdma_client_init:
+    qemu_rdma_cleanup(rdma);
+    return -1;
+}
+
+static void caps_to_network(RDMACapabilities *cap)
+{
+    cap->version = htonl(cap->version);
+    cap->flags = htonl(cap->flags);
+}
+
+static void network_to_caps(RDMACapabilities *cap)
+{
+    cap->version = ntohl(cap->version);
+    cap->flags = ntohl(cap->flags);
+}
+
+static int qemu_rdma_connect(void * opaque, Error **errp)
+{
+    RDMAControlHeader head;
+    RDMAContext *rdma = opaque;
+    struct rdma_cm_event *cm_event;
+    RDMACapabilities cap = 
+                {
+                    .version = RDMA_CONTROL_CURRENT_VERSION,
+                    .flags = 0,
+                };
+    struct rdma_conn_param conn_param = { .initiator_depth = 2,
+                                          .retry_count = 5,
+                                          .private_data = &cap,
+                                          .private_data_len = sizeof(cap), 
+                                        };
+    int ret;
+    int idx = 0;
+    int x;
+
+    if(rdma->chunk_register_destination)
+        cap.flags |= RDMA_CAPABILITY_CHUNK_REGISTER;
+
+    caps_to_network(&cap);
+
+    ret = rdma_connect(rdma->cm_id, &conn_param);
+    if (ret) {
+        perror("rdma_connect");
+        fprintf(stderr, "rdma migration: error connecting!");
+        goto err_rdma_client_connect;
+    }
+
+    ret = rdma_get_cm_event(rdma->channel, &cm_event);
+    if (ret) {
+        perror("rdma_get_cm_event after rdma_connect");
+        fprintf(stderr, "rdma migration: error connecting!");
+        goto err_rdma_client_connect;
+    }
+
+    if (cm_event->event != RDMA_CM_EVENT_ESTABLISHED) {
+        perror("rdma_get_cm_event != EVENT_ESTABLISHED after rdma_connect");
+        fprintf(stderr, "rdma migration: error connecting!");
+        goto err_rdma_client_connect;
+    }
+
+    rdma_ack_cm_event(cm_event);
+
+    ret = qemu_rdma_post_recv_control(rdma, idx + 1);
+    if (ret) {
+        fprintf(stderr, "rdma migration: error posting second control recv!");
+        goto err_rdma_client_connect;
+    }
+
+    ret = qemu_rdma_post_recv_control(rdma, idx);
+    if (ret) {
+        fprintf(stderr, "rdma migration: error posting second control recv!");
+        goto err_rdma_client_connect;
+    }
+
+
+    ret = qemu_rdma_exchange_get_response(rdma, 
+                                &head, RDMA_CONTROL_RAM_BLOCKS, idx + 1);
+
+    if(ret < 0) {
+        fprintf(stderr, "rdma migration: error sending remote info!");
+        goto err_rdma_client_connect;
+    }
+
+    qemu_rdma_move_header(rdma, idx + 1, &head);
+    memcpy(remote_ram_blocks.remote_area, rdma->wr_data[idx + 1].control_curr, 
+                    remote_ram_blocks.remote_size);
+
+    ret = qemu_rdma_process_remote_ram_blocks(
+                            &local_ram_blocks, &remote_ram_blocks);
+    if (ret) {
+        fprintf(stderr, "rdma migration: error processing remote ram blocks!\n");
+        goto err_rdma_client_connect;
+    }
+
+    if(rdma->chunk_register_destination) {
+        for (x = 0; x < local_ram_blocks.num_blocks; x++) {
+            RDMALocalBlock *block = &(local_ram_blocks.block[x]);
+            int num_chunks = RDMA_REG_NUM_CHUNKS(block);
+            /* allocate memory to store remote rkeys */
+            block->remote_keys = g_malloc0(num_chunks * sizeof(uint32_t));
+        }
+    }
+    rdma->control_ready_expected = 1;
+    rdma->num_signaled_send = 0;
+    return 0;
+
+err_rdma_client_connect:
+    qemu_rdma_cleanup(rdma);
+    return -1;
+}
+
+static int qemu_rdma_server_init(void * opaque, Error **errp)
+{
+    RDMAContext *rdma = opaque;
+    int ret, idx;
+    struct sockaddr_in sin;
+    struct rdma_cm_id *listen_id;
+    char ip[40] = "unknown";
+
+    for(idx = 0; idx < RDMA_CONTROL_MAX_WR; idx++) {
+        rdma->wr_data[idx].control_len = 0;
+        rdma->wr_data[idx].control_curr = NULL;  
+    }
+
+    if(rdma->host == NULL) {
+        fprintf(stderr, "Error: RDMA host is not set!");
+        return -1;
+    }
+    /* create CM channel */
+    rdma->channel = rdma_create_event_channel();
+    if (!rdma->channel) {
+        fprintf(stderr, "Error: could not create rdma event channel");
+        return -1;
+    }
+
+    /* create CM id */
+    ret = rdma_create_id(rdma->channel, &listen_id, NULL, RDMA_PS_TCP);
+    if (ret) {
+        fprintf(stderr, "Error: could not create cm_id!");
+        goto err_server_init_create_listen_id;
+    }
+
+    memset(&sin, 0, sizeof(sin));
+    sin.sin_family = AF_INET;
+    sin.sin_port = htons(rdma->port);
+
+    if (rdma->host && strcmp("", rdma->host)) {
+        struct hostent *server_addr;
+        server_addr = gethostbyname(rdma->host);
+        if (!server_addr) {
+            fprintf(stderr, "Error: migration could not gethostbyname!");
+            goto err_server_init_bind_addr;
+        }
+        memcpy(&sin.sin_addr.s_addr, server_addr->h_addr,
+                server_addr->h_length);
+        inet_ntop(AF_INET, server_addr->h_addr, ip, sizeof ip);
+    } else {
+        sin.sin_addr.s_addr = INADDR_ANY;
+    }
+
+    DPRINTF("%s => %s\n", rdma->host, ip);
+
+    ret = rdma_bind_addr(listen_id, (struct sockaddr *)&sin);
+    if (ret) {
+        fprintf(stderr, "Error: could not rdma_bind_addr!");
+        goto err_server_init_bind_addr;
+    }
+
+    rdma->listen_id = listen_id;
+    if (listen_id->verbs) {
+        rdma->verbs = listen_id->verbs;
+    }
+    qemu_rdma_dump_id("server_init", rdma->verbs);
+    qemu_rdma_dump_gid("server_init", listen_id);
+    return 0;
+
+err_server_init_bind_addr:
+    rdma_destroy_id(listen_id);
+err_server_init_create_listen_id:
+    rdma_destroy_event_channel(rdma->channel);
+    rdma->channel = NULL;
+    return -1;
+
+}
+
+static int qemu_rdma_server_prepare(void * opaque, Error **errp)
+{
+    RDMAContext *rdma = opaque;
+    int ret;
+    int idx;
+
+    if (!rdma->verbs) {
+        fprintf(stderr, "rdma migration: no verbs context!");
+        return 0;
+    }
+
+    ret = qemu_rdma_alloc_pd_cq(rdma);
+    if (ret) {
+        fprintf(stderr, "rdma migration: error allocating pd and cq!");
+        goto err_rdma_server_prepare;
+    }
+
+    ret = qemu_rdma_init_ram_blocks(&local_ram_blocks);
+    if (ret) {
+        fprintf(stderr, "rdma migration: error initializing ram blocks!");
+        goto err_rdma_server_prepare;
+    }
+
+    qemu_rdma_remote_ram_blocks_init();
+
+    /* Extra one for the send buffer */
+    for(idx = 0; idx < (RDMA_CONTROL_MAX_WR + 1); idx++) {
+        ret = qemu_rdma_reg_control(rdma, idx);
+        if (ret) {
+            fprintf(stderr, "rdma migration: error registering %d control!", idx);
+            goto err_rdma_server_prepare;
+        }
+    }
+
+    ret = rdma_listen(rdma->listen_id, 5);
+    if (ret) {
+        fprintf(stderr, "rdma migration: error listening on socket!");
+        goto err_rdma_server_prepare;
+    }
+
+    return 0;
+
+err_rdma_server_prepare:
+    qemu_rdma_cleanup(rdma);
+    return -1;
+}
+
+static void *qemu_rdma_data_init(const char *host_port, Error **errp)
+{
+    RDMAContext *rdma = NULL;
+    InetSocketAddress *addr;
+
+    if(host_port) {
+        rdma = g_malloc0(sizeof(RDMAContext));
+        memset(rdma, 0, sizeof(RDMAContext));
+        rdma->current_index = -1;
+        rdma->current_chunk = -1;
+
+        addr = inet_parse(host_port, errp);
+        if (addr != NULL) {
+            rdma->port = atoi(addr->port);
+            rdma->host = g_strdup(addr->host);
+            printf("rdma host: %s\n", rdma->host);
+            printf("rdma port: %d\n", rdma->port);
+        } else {
+            error_setg(errp, "bad RDMA migration address '%s'", host_port);
+            g_free(rdma);
+            return NULL;
+        }
+    }
+
+    return rdma;
+}
+
+/*
+ * QEMUFile interface to the control channel.
+ * SEND messages for control only.
+ * pc.ram is handled with regular RDMA messages.
+ */
+static int qemu_rdma_put_buffer(void *opaque, const uint8_t *buf, int64_t pos, int size)
+{
+    QEMUFileRDMA *r = opaque;
+    QEMUFile *f = r->file;
+    RDMAContext *rdma = r->rdma;
+    size_t remaining = size;
+    uint8_t * data = (void *) buf;
+    int ret;
+
+    /*
+     * Push out any writes that
+     * we're queued up for pc.ram.
+     */
+    if (qemu_rdma_write_flush(f, rdma) < 0)
+        return -EIO;
+
+    while(remaining) {
+        RDMAControlHeader head;
+
+        r->len = MIN(remaining, RDMA_SEND_INCREMENT);
+        remaining -= r->len;
+
+        head.len = r->len;
+        head.type = RDMA_CONTROL_QEMU_FILE;
+        head.version = RDMA_CONTROL_CURRENT_VERSION;
+
+        ret = qemu_rdma_exchange_send(rdma, &head, data, NULL, NULL);
+
+        if(ret < 0)
+            return ret;
+
+        data += r->len;
+    }
+
+    return size;
+} 
+
+static size_t qemu_rdma_fill(RDMAContext * rdma, uint8_t *buf, int size, int idx)
+{
+    size_t len = 0;
+
+    if(rdma->wr_data[idx].control_len) {
+        DPRINTF("RDMA %" PRId64 " of %d bytes already in buffer\n",
+	    rdma->wr_data[idx].control_len, size);
+
+        len = MIN(size, rdma->wr_data[idx].control_len);
+        memcpy(buf, rdma->wr_data[idx].control_curr, len);
+        rdma->wr_data[idx].control_curr += len;
+        rdma->wr_data[idx].control_len -= len;
+    }
+
+    return len;
+}
+
+/*
+ * QEMUFile interface to the control channel.
+ * RDMA links don't use bytestreams, so we have to
+ * return bytes to QEMUFile opportunistically.
+ */
+static int qemu_rdma_get_buffer(void *opaque, uint8_t *buf, int64_t pos, int size)
+{
+    QEMUFileRDMA *r = opaque;
+    RDMAContext *rdma = r->rdma;
+    RDMAControlHeader head;
+    int ret = 0;
+
+    /*
+     * First, we hold on to the last SEND message we 
+     * were given and dish out the bytes until we run 
+     * out of bytes.
+     */
+    if((r->len = qemu_rdma_fill(r->rdma, buf, size, 0)))
+        return r->len; 
+
+     /*
+      * Once we run out, we block and wait for another
+      * SEND message to arrive.
+      */
+    ret = qemu_rdma_exchange_recv(rdma, &head, RDMA_CONTROL_QEMU_FILE);
+
+    if(ret < 0)
+        return ret;
+
+    /*
+     * SEND was received with new bytes, now try again.
+     */
+    return qemu_rdma_fill(r->rdma, buf, size, 0);
+} 
+
+/*
+ * Block until all the outstanding chunks have been delivered by the hardware.
+ */
+static int qemu_rdma_drain_cq(QEMUFile *f, RDMAContext *rdma)
+{
+    int ret;
+
+    if (qemu_rdma_write_flush(f, rdma) < 0) {
+        return -EIO;
+    }
+
+    while (rdma->num_signaled_send) {
+        ret = wait_for_wrid(rdma, RDMA_WRID_RDMA_WRITE);
+        if (ret < 0) {
+            fprintf(stderr, "rdma migration: complete polling error!\n");
+            return -EIO;
+        }
+    }
+
+    return 0;
+}
+
+static int qemu_rdma_close(void *opaque)
+{
+    QEMUFileRDMA *r = opaque;
+    if(r->rdma) {
+        qemu_rdma_cleanup(r->rdma);
+        g_free(r->rdma);
+    }
+    g_free(r);
+    return 0;
+}
+
+static size_t qemu_rdma_save_page(QEMUFile *f, void *opaque,
+                           ram_addr_t block_offset, 
+                           ram_addr_t offset,
+                           int cont, size_t size, 
+                           bool zero)
+{
+    ram_addr_t current_addr = block_offset + offset;
+    QEMUFileRDMA * rfile = opaque;
+    RDMAContext * rdma;
+    int ret;
+
+    if(rfile) {
+        rdma = rfile->rdma;
+    } else
+        return -ENOTSUP;
+
+    qemu_ftell(f);
+
+    if(zero)
+        return 0;
+
+    /*
+     * Add this page to the current 'chunk'. If the chunk
+     * is full, or the page doen't belong to the current chunk,
+     * an actual RDMA write will occur and a new chunk will be formed.
+     */
+    if ((ret = qemu_rdma_write(f, rdma, current_addr, size)) < 0) {
+        fprintf(stderr, "rdma migration: write error! %d\n", ret);
+        return ret;
+    }
+
+    /*
+     * Drain the Completion Queue if possible.
+     * If not, the end of the iteration will do this
+     * again to make sure we don't overflow the
+     * request queue. 
+     */
+    while (1) {
+        int ret = qemu_rdma_poll(rdma);
+        if (ret == RDMA_WRID_NONE) {
+            break;
+        }
+        if (ret < 0) {
+            fprintf(stderr, "rdma migration: polling error! %d\n", ret);
+            return ret;
+        }
+    }
+
+    return size;
+}
+
+static int qemu_rdma_accept(void * opaque)
+{
+    RDMAContext *rdma = opaque;
+    RDMAControlHeader head = { .len = remote_ram_blocks.remote_size, 
+                               .type = RDMA_CONTROL_RAM_BLOCKS,
+                               .version = RDMA_CONTROL_CURRENT_VERSION, 
+                             };
+    RDMACapabilities cap; 
+    struct rdma_conn_param conn_param = { 
+                                            .responder_resources = 2,
+                                            .private_data = NULL,
+                                            .private_data_len = 0, 
+                                         };
+    struct rdma_cm_event *cm_event;
+    struct ibv_context *verbs;
+    int ret;
+
+    ret = rdma_get_cm_event(rdma->channel, &cm_event);
+    if (ret) {
+        goto err_rdma_server_wait;
+    }
+
+    if (cm_event->event != RDMA_CM_EVENT_CONNECT_REQUEST) {
+        rdma_ack_cm_event(cm_event);
+        goto err_rdma_server_wait;
+    }
+
+    memcpy(&cap, cm_event->param.conn.private_data, sizeof(cap));
+
+    network_to_caps(&cap);
+
+    if(cap.version < RDMA_CONTROL_VERSION_MIN || 
+            cap.version > RDMA_CONTROL_VERSION_MAX) {
+            fprintf(stderr, "Unknown client RDMA version: %d, bailing...\n",
+                            cap.version);
+            goto err_rdma_server_wait;
+    }
+
+    if(cap.version == RDMA_CONTROL_VERSION_1) {
+        if(cap.flags & RDMA_CAPABILITY_CHUNK_REGISTER) {
+            printf("Enabling chunk registration\n");
+            rdma->chunk_register_destination = true;
+        } else if(cap.flags & RDMA_CAPABILITY_NEXT_FEATURE) {
+            // handle new capability
+        }
+    } else {
+        fprintf(stderr, "Unknown client RDMA version: %d, bailing...\n",
+                        cap.version);
+        goto err_rdma_server_wait;
+    }
+
+    rdma->cm_id = cm_event->id;
+    verbs = cm_event->id->verbs;
+
+    rdma_ack_cm_event(cm_event);
+
+    DPRINTF("verbs context after listen: %p\n", verbs);
+
+    if (!rdma->verbs) {
+        rdma->verbs = verbs;
+        ret = qemu_rdma_server_prepare(rdma, NULL);
+        if (ret) {
+            fprintf(stderr, "rdma migration: error preparing server!\n");
+            goto err_rdma_server_wait;
+        }
+    } else if (rdma->verbs != verbs) {
+            fprintf(stderr, "ibv context not matching %p, %p!\n",
+                    rdma->verbs, verbs);
+            goto err_rdma_server_wait;
+    }
+
+    /* xxx destroy listen_id ??? */
+
+    qemu_set_fd_handler2(rdma->channel->fd, NULL, NULL, NULL, NULL);
+
+    ret = qemu_rdma_alloc_qp(rdma);
+    if (ret) {
+        fprintf(stderr, "rdma migration: error allocating qp!");
+        goto err_rdma_server_wait;
+    }
+
+    ret = rdma_accept(rdma->cm_id, &conn_param);
+    if (ret) {
+        fprintf(stderr, "rdma_accept returns %d!\n", ret);
+        goto err_rdma_server_wait;
+    }
+
+    ret = rdma_get_cm_event(rdma->channel, &cm_event);
+    if (ret) {
+        fprintf(stderr, "rdma_accept get_cm_event failed %d!\n", ret);
+        goto err_rdma_server_wait;
+    }
+
+    if (cm_event->event != RDMA_CM_EVENT_ESTABLISHED) {
+        fprintf(stderr, "rdma_accept not event established!\n");
+        rdma_ack_cm_event(cm_event);
+        goto err_rdma_server_wait;
+    }
+
+    rdma_ack_cm_event(cm_event);
+
+    ret = qemu_rdma_post_recv_control(rdma, 0);
+    if (ret) {
+        fprintf(stderr, "rdma migration: error posting second control recv!");
+        goto err_rdma_server_wait;
+    }
+
+    if(rdma->chunk_register_destination == false) {
+        ret = qemu_rdma_server_reg_ram_blocks(rdma, &local_ram_blocks);
+        if (ret) {
+            fprintf(stderr, "rdma migration: error server registering ram blocks!");
+            goto err_rdma_server_wait;
+        }
+    }
+
+    qemu_rdma_copy_to_remote_ram_blocks(rdma, &local_ram_blocks, &remote_ram_blocks);
+
+    ret = qemu_rdma_post_send_control(rdma, (uint8_t *) remote_ram_blocks.remote_area, &head);
+
+    if(ret < 0) {
+        fprintf(stderr, "rdma migration: error sending remote info!");
+        goto err_rdma_server_wait;
+    }
+
+    qemu_rdma_dump_gid("server_connect", rdma->cm_id);
+
+    return 0;
+
+err_rdma_server_wait:
+    qemu_rdma_cleanup(rdma);
+    return ret;
+}
+
+/*
+ * During each iteration of the migration, we listen for instructions
+ * by the primary VM to perform dynamic page registrations before they
+ * can perform RDMA operations.
+ *
+ * We respond with the 'rkey'.
+ *
+ * Keep doing this until the primary tells us to stop.
+ */
+static int qemu_rdma_registration_handle(QEMUFile *f, void *opaque, uint32_t
+flags)
+{
+    RDMAControlHeader resp = { .len = sizeof(RDMARegisterResult),
+                               .type = RDMA_CONTROL_REGISTER_RESULT,
+                               .version = RDMA_CONTROL_CURRENT_VERSION, 
+                             };
+    QEMUFileRDMA * rfile = opaque;
+    RDMAContext * rdma = rfile->rdma;
+    RDMAControlHeader head;
+    RDMARegister * reg;
+    RDMARegisterResult reg_result;
+    RDMALocalBlock *block;
+    uint64_t host_addr;
+    int ret = 0;
+    int idx = 0;
+
+    DPRINTF("Waiting for next registration %d...\n", flags);
+
+    do {
+        ret = qemu_rdma_exchange_recv(rdma, &head, RDMA_CONTROL_NONE);
+
+        if(ret < 0)
+            break;
+
+        switch(head.type) {
+            case RDMA_CONTROL_REGISTER_FINISHED:
+                DPRINTF("Current registrations complete.\n");
+                goto out;
+            case RDMA_CONTROL_REGISTER_REQUEST:
+                reg = (RDMARegister *) rdma->wr_data[idx].control_curr;
+
+                DPRINTF("Registration request: %" PRId64 
+                    " bytes, index %d, offset %" PRId64 "\n", 
+                    reg->len, reg->current_index, reg->offset);
+
+                block = &(local_ram_blocks.block[reg->current_index]);
+                host_addr = (uint64_t)(block->local_host_addr + (reg->offset - block->offset));
+                if (qemu_rdma_register_and_get_keys(rdma, block, host_addr, NULL, &reg_result.rkey)) {
+                    fprintf(stderr, "cannot get rkey!\n");
+                    ret = -EINVAL;
+                    goto out;
+                }
+
+                DPRINTF("Registerd rkey for this request: %x\n", reg_result.rkey);
+                ret = qemu_rdma_post_send_control(rdma, (uint8_t *) &reg_result, &resp);
+
+                if(ret < 0) {
+                    fprintf(stderr, "Failed to send control buffer!\n");
+                    goto out;
+                }
+                break;
+            case RDMA_CONTROL_REGISTER_RESULT:
+                fprintf(stderr, "Invalid RESULT message at server.\n");
+                ret = -EIO;
+                goto out;
+            default:
+                fprintf(stderr, "Unknown control message %s\n", control_desc[head.type]);
+                ret = -EIO;
+                goto out;
+        }
+    } while(1);
+        
+out:
+    return ret;
+}
+
+/*
+ * Inform server that dynamic registrations are done for now.
+ * First, flush writes, if any.
+ */
+static int qemu_rdma_registration_stop(QEMUFile *f, void *opaque, uint32_t flags)
+{
+    QEMUFileRDMA * rfile = opaque;
+    RDMAContext * rdma = rfile->rdma;
+    RDMAControlHeader head = { .len = 0,
+                               .type = RDMA_CONTROL_REGISTER_FINISHED,
+                               .version = RDMA_CONTROL_CURRENT_VERSION, 
+                             };
+    int ret = qemu_rdma_drain_cq(f, rdma);
+
+    if(ret >= 0) {
+        DPRINTF("Sending registration finish %d...\n", flags);
+
+        ret = qemu_rdma_exchange_send(rdma, &head, NULL, NULL, NULL);
+    }
+
+    return ret;
+}
+
+const QEMUFileOps rdma_read_ops = {
+    .get_buffer    = qemu_rdma_get_buffer,
+    .close         = qemu_rdma_close,
+    .get_fd        = qemu_rdma_get_fd,
+    .hook_ram_load = qemu_rdma_registration_handle,
+};
+
+const QEMUFileOps rdma_write_ops = {
+    .put_buffer           = qemu_rdma_put_buffer,
+    .close                = qemu_rdma_close,
+    .get_fd               = qemu_rdma_get_fd,
+    .before_ram_iterate   = qemu_rdma_registration_start,
+    .after_ram_iterate    = qemu_rdma_registration_stop,
+    .save_page            = qemu_rdma_save_page, 
+};
+
+static void *qemu_fopen_rdma(void * opaque, const char * mode)
+{
+    RDMAContext *rdma = opaque;
+    QEMUFileRDMA *r = g_malloc0(sizeof(QEMUFileRDMA));
+
+    if(qemu_file_mode_is_not_valid(mode))
+        return NULL;
+
+    r->rdma = rdma;
+
+    if (mode[0] == 'w') {
+        r->file = qemu_fopen_ops(r, &rdma_write_ops);
+    } else {
+        r->file = qemu_fopen_ops(r, &rdma_read_ops);
+    }
+
+    return r->file;
+}
+
+static void rdma_accept_incoming_migration(void *opaque)
+{
+    int ret;
+    QEMUFile *f;
+
+    DPRINTF("Accepting rdma connection...\n");
+
+    if ((ret = qemu_rdma_accept(opaque))) {
+        fprintf(stderr, "RDMA Migration initialization failed!\n");
+        goto err;
+    }
+
+    DPRINTF("Accepted migration\n");
+
+    f = qemu_fopen_rdma(opaque, "rb");
+    if (f == NULL) {
+        fprintf(stderr, "could not qemu_fopen_rdma!\n");
+        goto err;
+    }
+
+    process_incoming_migration(f);
+    return;
+
+err:
+    qemu_rdma_cleanup(opaque);
+}
+
+void rdma_start_incoming_migration(const char * host_port, Error **errp)
+{
+    int ret;
+    RDMAContext *rdma;
+
+    DPRINTF("Starting RDMA-based incoming migration\n");
+
+    if ((rdma = qemu_rdma_data_init(host_port, errp)) == NULL) {
+        return;
+    }
+
+    ret = qemu_rdma_server_init(rdma, NULL);
+
+    if (!ret) {
+        DPRINTF("qemu_rdma_server_init success\n");
+        ret = qemu_rdma_server_prepare(rdma, NULL);
+
+        if (!ret) {
+            DPRINTF("qemu_rdma_server_prepare success\n");
+
+            qemu_set_fd_handler2(rdma->channel->fd, NULL, 
+                                 rdma_accept_incoming_migration, NULL,
+                                    (void *)(intptr_t) rdma);
+            return;
+        }
+    }
+
+    g_free(rdma);
+}
+
+void rdma_start_outgoing_migration(void *opaque, const char *host_port, Error **errp)
+{
+    MigrationState *s = opaque;
+    RDMAContext *rdma = NULL;
+    int ret;
+
+    if ((rdma = qemu_rdma_data_init(host_port, errp)) == NULL)
+        return; 
+
+    ret = qemu_rdma_client_init(rdma, NULL,
+        s->enabled_capabilities[MIGRATION_CAPABILITY_CHUNK_REGISTER_DESTINATION]);
+
+    if(!ret) {
+        DPRINTF("qemu_rdma_client_init success\n");
+        ret = qemu_rdma_connect(rdma, NULL);
+
+        if(!ret) {
+            s->file = qemu_fopen_rdma(rdma, "wb");
+            DPRINTF("qemu_rdma_client_connect success\n");
+            migrate_fd_connect(s);
+            return;
+        }
+    }
+
+    g_free(rdma);
+    migrate_fd_error(s);
+}
+
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [Qemu-devel] [RFC PATCH RDMA support v1: 03/13] RDMA is enabled by default per the usual ./configure testing.
  2013-04-10 22:28 [Qemu-devel] [RFC PATCH RDMA support v7: 00/13] rdma cleanup and reordering mrhines
  2013-04-10 22:28 ` [Qemu-devel] [RFC PATCH RDMA support v1: 01/13] introduce qemu_ram_foreach_block() mrhines
  2013-04-10 22:28 ` [Qemu-devel] [RFC PATCH RDMA support v1: 02/13] Core RMDA logic mrhines
@ 2013-04-10 22:28 ` mrhines
  2013-04-10 22:28 ` [Qemu-devel] [RFC PATCH RDMA support v1: 04/13] update QEMUFileOps with new hooks mrhines
                   ` (10 subsequent siblings)
  13 siblings, 0 replies; 52+ messages in thread
From: mrhines @ 2013-04-10 22:28 UTC (permalink / raw)
  To: qemu-devel; +Cc: aliguori, mst, owasserm, abali, mrhines, gokul, pbonzini

From: "Michael R. Hines" <mrhines@us.ibm.com>

Only one new file is added in the patch now (migration-rdma.c),
which is conditionalized by CONFIG_RDMA.

Signed-off-by: Michael R. Hines <mrhines@us.ibm.com>
---
 Makefile.objs |    1 +
 configure     |   29 +++++++++++++++++++++++++++++
 2 files changed, 30 insertions(+)

diff --git a/Makefile.objs b/Makefile.objs
index e568c01..10431bd 100644
--- a/Makefile.objs
+++ b/Makefile.objs
@@ -49,6 +49,7 @@ common-obj-$(CONFIG_POSIX) += os-posix.o
 common-obj-$(CONFIG_LINUX) += fsdev/
 
 common-obj-y += migration.o migration-tcp.o
+common-obj-$(CONFIG_RDMA) += migration-rdma.o
 common-obj-y += qemu-char.o #aio.o
 common-obj-y += block-migration.o
 common-obj-y += page_cache.o xbzrle.o
diff --git a/configure b/configure
index 1ed939a..8ade7ce 100755
--- a/configure
+++ b/configure
@@ -180,6 +180,7 @@ xfs=""
 
 vhost_net="no"
 kvm="no"
+rdma="yes"
 gprof="no"
 debug_tcg="no"
 debug="no"
@@ -918,6 +919,10 @@ for opt do
   ;;
   --enable-gtk) gtk="yes"
   ;;
+  --enable-rdma) rdma="yes"
+  ;;
+  --disable-rdma) rdma="no"
+  ;;
   --with-gtkabi=*) gtkabi="$optarg"
   ;;
   --enable-tpm) tpm="yes"
@@ -1122,6 +1127,8 @@ echo "  --enable-bluez           enable bluez stack connectivity"
 echo "  --disable-slirp          disable SLIRP userspace network connectivity"
 echo "  --disable-kvm            disable KVM acceleration support"
 echo "  --enable-kvm             enable KVM acceleration support"
+echo "  --disable-rdma           disable RDMA-based migration support"
+echo "  --enable-rdma            enable RDMA-based migration support"
 echo "  --enable-tcg-interpreter enable TCG with bytecode interpreter (TCI)"
 echo "  --disable-nptl           disable usermode NPTL support"
 echo "  --enable-nptl            enable usermode NPTL support"
@@ -1767,6 +1774,23 @@ EOF
   libs_softmmu="$sdl_libs $libs_softmmu"
 fi
 
+if test "$rdma" != "no" ; then
+  cat > $TMPC <<EOF
+#include <rdma/rdma_cma.h>
+int main(void) { return 0; }
+EOF
+  rdma_libs="-lrdmacm -libverbs"
+  if compile_prog "-Werror" "$rdma_libs" ; then
+    rdma="yes"
+    libs_softmmu="$libs_softmmu $rdma_libs"
+  else
+    if test "$rdma" = "yes" ; then
+      feature_not_found "rdma"
+    fi
+    rdma="no"
+  fi
+fi
+
 ##########################################
 # VNC TLS/WS detection
 if test "$vnc" = "yes" -a \( "$vnc_tls" != "no" -o "$vnc_ws" != "no" \) ; then
@@ -3408,6 +3432,7 @@ echo "Linux AIO support $linux_aio"
 echo "ATTR/XATTR support $attr"
 echo "Install blobs     $blobs"
 echo "KVM support       $kvm"
+echo "RDMA support      $rdma"
 echo "TCG interpreter   $tcg_interpreter"
 echo "fdt support       $fdt"
 echo "preadv support    $preadv"
@@ -4377,6 +4402,10 @@ if [ "$pixman" = "internal" ]; then
   echo "config-host.h: subdir-pixman" >> $config_host_mak
 fi
 
+if test "$rdma" = "yes" ; then
+echo "CONFIG_RDMA=y" >> $config_host_mak
+fi
+
 # build tree in object directory in case the source is not in the current directory
 DIRS="tests tests/tcg tests/tcg/cris tests/tcg/lm32"
 DIRS="$DIRS pc-bios/optionrom pc-bios/spapr-rtas"
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [Qemu-devel] [RFC PATCH RDMA support v1: 04/13] update QEMUFileOps with new hooks
  2013-04-10 22:28 [Qemu-devel] [RFC PATCH RDMA support v7: 00/13] rdma cleanup and reordering mrhines
                   ` (2 preceding siblings ...)
  2013-04-10 22:28 ` [Qemu-devel] [RFC PATCH RDMA support v1: 03/13] RDMA is enabled by default per the usual ./configure testing mrhines
@ 2013-04-10 22:28 ` mrhines
  2013-04-10 22:28 ` [Qemu-devel] [RFC PATCH RDMA support v1: 05/13] accessor function prototypes for new QEMUFileOps hooks mrhines
                   ` (9 subsequent siblings)
  13 siblings, 0 replies; 52+ messages in thread
From: mrhines @ 2013-04-10 22:28 UTC (permalink / raw)
  To: qemu-devel; +Cc: aliguori, mst, owasserm, abali, mrhines, gokul, pbonzini

From: "Michael R. Hines" <mrhines@us.ibm.com>

These are just the prototypes for optional new hooks that RDMA
takes advantage of to perform dynamic page registration.

An optional hook is also introduce for a custom function
to be able to override the default save_page function.

Signed-off-by: Michael R. Hines <mrhines@us.ibm.com>
---
 include/migration/qemu-file.h |   30 ++++++++++++++++++++++++++++++
 1 file changed, 30 insertions(+)

diff --git a/include/migration/qemu-file.h b/include/migration/qemu-file.h
index 623c434..e2eca28 100644
--- a/include/migration/qemu-file.h
+++ b/include/migration/qemu-file.h
@@ -23,6 +23,7 @@
  */
 #ifndef QEMU_FILE_H
 #define QEMU_FILE_H 1
+#include "exec/cpu-common.h"
 
 /* This function writes a chunk of data to a file at the given position.
  * The pos argument can be ignored if the file is only being used for
@@ -57,12 +58,39 @@ typedef int (QEMUFileGetFD)(void *opaque);
 typedef ssize_t (QEMUFileWritevBufferFunc)(void *opaque, struct iovec *iov,
                                            int iovcnt);
 
+/*
+ * This function provides hooks around different
+ * stages of RAM migration.
+ */
+typedef int (QEMURamHookFunc)(QEMUFile *f, void *opaque, uint32_t flags);
+
+/*
+ * Constants used by QEMURamFunc.
+ */
+#define RAM_CONTROL_SETUP    0
+#define RAM_CONTROL_ROUND    1
+#define RAM_CONTROL_REGISTER 2
+#define RAM_CONTROL_FINISH   3
+
+/*
+ * This function allows override of where the RAM page
+ * is saved (such as RDMA, for example.)
+ */
+typedef size_t (QEMURamSaveFunc)(QEMUFile *f, void *opaque,
+                               ram_addr_t block_offset, 
+                               ram_addr_t offset,
+                               int cont, size_t size, bool zero);
+
 typedef struct QEMUFileOps {
     QEMUFilePutBufferFunc *put_buffer;
     QEMUFileGetBufferFunc *get_buffer;
     QEMUFileCloseFunc *close;
     QEMUFileGetFD *get_fd;
     QEMUFileWritevBufferFunc *writev_buffer;
+    QEMURamHookFunc *before_ram_iterate;
+    QEMURamHookFunc *after_ram_iterate;
+    QEMURamHookFunc *hook_ram_load;
+    QEMURamSaveFunc *save_page;
 } QEMUFileOps;
 
 QEMUFile *qemu_fopen_ops(void *opaque, const QEMUFileOps *ops);
@@ -81,6 +109,8 @@ void qemu_put_byte(QEMUFile *f, int v);
  */
 void qemu_put_buffer_async(QEMUFile *f, const uint8_t *buf, int size);
 
+bool qemu_file_mode_is_not_valid(const char * mode);
+
 static inline void qemu_put_ubyte(QEMUFile *f, unsigned int v)
 {
     qemu_put_byte(f, (int)v);
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [Qemu-devel] [RFC PATCH RDMA support v1: 05/13] accessor function prototypes for new QEMUFileOps hooks
  2013-04-10 22:28 [Qemu-devel] [RFC PATCH RDMA support v7: 00/13] rdma cleanup and reordering mrhines
                   ` (3 preceding siblings ...)
  2013-04-10 22:28 ` [Qemu-devel] [RFC PATCH RDMA support v1: 04/13] update QEMUFileOps with new hooks mrhines
@ 2013-04-10 22:28 ` mrhines
  2013-04-10 22:28 ` [Qemu-devel] [RFC PATCH RDMA support v1: 06/13] implementation of " mrhines
                   ` (8 subsequent siblings)
  13 siblings, 0 replies; 52+ messages in thread
From: mrhines @ 2013-04-10 22:28 UTC (permalink / raw)
  To: qemu-devel; +Cc: aliguori, mst, owasserm, abali, mrhines, gokul, pbonzini

From: "Michael R. Hines" <mrhines@us.ibm.com>

These are just the protytpes of the accessor methods used by
arch_init.c which invoke functions inside savevm.c to call
out to the hooks that may (or may not) have been overridden
inside of QEMUFileOps.

The actual definitions come later in the patch series.

Signed-off-by: Michael R. Hines <mrhines@us.ibm.com>
---
 include/migration/migration.h |   23 +++++++++++++++++++++++
 1 file changed, 23 insertions(+)

diff --git a/include/migration/migration.h b/include/migration/migration.h
index e2acec6..a5222f5 100644
--- a/include/migration/migration.h
+++ b/include/migration/migration.h
@@ -21,6 +21,7 @@
 #include "qapi/error.h"
 #include "migration/vmstate.h"
 #include "qapi-types.h"
+#include "exec/cpu-common.h"
 
 struct MigrationParams {
     bool blk;
@@ -75,6 +76,10 @@ void fd_start_incoming_migration(const char *path, Error **errp);
 
 void fd_start_outgoing_migration(MigrationState *s, const char *fdname, Error **errp);
 
+void rdma_start_outgoing_migration(void *opaque, const char *host_port, Error **errp);
+
+void rdma_start_incoming_migration(const char * host_port, Error **errp);
+
 void migrate_fd_error(MigrationState *s);
 
 void migrate_fd_connect(MigrationState *s);
@@ -127,4 +132,22 @@ int migrate_use_xbzrle(void);
 int64_t migrate_xbzrle_cache_size(void);
 
 int64_t xbzrle_cache_resize(int64_t new_size);
+
+bool migrate_check_for_zero(void);
+bool migrate_chunk_register_destination(void);
+
+void ram_control_before_iterate(QEMUFile *f, uint32_t flags);
+void ram_control_after_iterate(QEMUFile *f, uint32_t flags);
+void ram_control_load_hook(QEMUFile *f, uint32_t flags);
+size_t ram_control_save_page(QEMUFile *f,
+                             ram_addr_t block_offset, 
+                             ram_addr_t offset, int cont, 
+                             size_t size, bool zero);
+
+/*
+ * Prototype used by both arch_init.c and migration_rdma.c
+ * because of RAM_SAVE_FLAG_HOOK
+ */
+int qemu_rdma_registration_start(QEMUFile *f, void *opaque, uint32_t flags);
+
 #endif
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [Qemu-devel] [RFC PATCH RDMA support v1: 06/13] implementation of new QEMUFileOps hooks
  2013-04-10 22:28 [Qemu-devel] [RFC PATCH RDMA support v7: 00/13] rdma cleanup and reordering mrhines
                   ` (4 preceding siblings ...)
  2013-04-10 22:28 ` [Qemu-devel] [RFC PATCH RDMA support v1: 05/13] accessor function prototypes for new QEMUFileOps hooks mrhines
@ 2013-04-10 22:28 ` mrhines
  2013-04-10 22:28 ` [Qemu-devel] [RFC PATCH RDMA support v1: 07/13] introduce capability for dynamic chunk registration mrhines
                   ` (7 subsequent siblings)
  13 siblings, 0 replies; 52+ messages in thread
From: mrhines @ 2013-04-10 22:28 UTC (permalink / raw)
  To: qemu-devel; +Cc: aliguori, mst, owasserm, abali, mrhines, gokul, pbonzini

From: "Michael R. Hines" <mrhines@us.ibm.com>

These are the actual definitions of the accessor methods
which call out to QEMUFileOps hooks during the RAM iteration
faces. These hooks are accessed by arch_init.c,
which comes later in the patch series.

Signed-off-by: Michael R. Hines <mrhines@us.ibm.com>
---
 savevm.c |   78 ++++++++++++++++++++++++++++++++++++++++++++++++++++++--------
 1 file changed, 69 insertions(+), 9 deletions(-)

diff --git a/savevm.c b/savevm.c
index b1d8988..0a20e65 100644
--- a/savevm.c
+++ b/savevm.c
@@ -409,16 +409,24 @@ static const QEMUFileOps socket_write_ops = {
     .close =      socket_close
 };
 
-QEMUFile *qemu_fopen_socket(int fd, const char *mode)
+bool qemu_file_mode_is_not_valid(const char * mode)
 {
-    QEMUFileSocket *s = g_malloc0(sizeof(QEMUFileSocket));
-
     if (mode == NULL ||
         (mode[0] != 'r' && mode[0] != 'w') ||
         mode[1] != 'b' || mode[2] != 0) {
         fprintf(stderr, "qemu_fopen: Argument validity check failed\n");
-        return NULL;
+        return true;
     }
+    
+    return false;
+}
+
+QEMUFile *qemu_fopen_socket(int fd, const char *mode)
+{
+    QEMUFileSocket *s = g_malloc0(sizeof(QEMUFileSocket));
+
+    if(qemu_file_mode_is_not_valid(mode))
+	return NULL;
 
     s->fd = fd;
     if (mode[0] == 'w') {
@@ -434,12 +442,8 @@ QEMUFile *qemu_fopen(const char *filename, const char *mode)
 {
     QEMUFileStdio *s;
 
-    if (mode == NULL ||
-	(mode[0] != 'r' && mode[0] != 'w') ||
-	mode[1] != 'b' || mode[2] != 0) {
-        fprintf(stderr, "qemu_fopen: Argument validity check failed\n");
+    if(qemu_file_mode_is_not_valid(mode))
         return NULL;
-    }
 
     s = g_malloc0(sizeof(QEMUFileStdio));
 
@@ -554,6 +558,62 @@ static void qemu_fflush(QEMUFile *f)
     }
 }
 
+void ram_control_before_iterate(QEMUFile *f, uint32_t flags)
+{
+    int ret = 0;
+
+    if (f->ops->before_ram_iterate) {
+        qemu_fflush(f);
+        ret = f->ops->before_ram_iterate(f, f->opaque, flags);
+        if (ret < 0)
+            qemu_file_set_error(f, ret);
+    }
+}
+
+void ram_control_after_iterate(QEMUFile *f, uint32_t flags)
+{
+    int ret = 0;
+
+    if (f->ops->after_ram_iterate) {
+        qemu_fflush(f);
+        ret = f->ops->after_ram_iterate(f, f->opaque, flags);
+        if (ret < 0)
+            qemu_file_set_error(f, ret);
+    }
+}
+
+void ram_control_load_hook(QEMUFile *f, uint32_t flags)
+{
+    int ret = 0;
+
+    if (f->ops->hook_ram_load) {
+        qemu_fflush(f);
+        ret = f->ops->hook_ram_load(f, f->opaque, flags);
+        if (ret < 0)
+            qemu_file_set_error(f, ret);
+    }
+}
+
+size_t ram_control_save_page(QEMUFile *f, ram_addr_t block_offset, 
+                                    ram_addr_t offset, int cont, 
+                                    size_t size, bool zero)
+{
+    if (f->ops->save_page) {
+        size_t bytes;
+
+        qemu_fflush(f);
+
+        bytes = f->ops->save_page(f, f->opaque, block_offset, offset, cont, size, zero);
+
+        if (bytes > 0)
+            f->pos += bytes;
+
+        return bytes;
+    }
+
+    return -ENOTSUP;
+}
+
 static void qemu_fill_buffer(QEMUFile *f)
 {
     int len;
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [Qemu-devel] [RFC PATCH RDMA support v1: 07/13] introduce capability for dynamic chunk registration
  2013-04-10 22:28 [Qemu-devel] [RFC PATCH RDMA support v7: 00/13] rdma cleanup and reordering mrhines
                   ` (5 preceding siblings ...)
  2013-04-10 22:28 ` [Qemu-devel] [RFC PATCH RDMA support v1: 06/13] implementation of " mrhines
@ 2013-04-10 22:28 ` mrhines
  2013-04-11  2:24   ` Eric Blake
  2013-04-10 22:28 ` [Qemu-devel] [RFC PATCH RDMA support v1: 08/13] default chunk registration to true mrhines
                   ` (6 subsequent siblings)
  13 siblings, 1 reply; 52+ messages in thread
From: mrhines @ 2013-04-10 22:28 UTC (permalink / raw)
  To: qemu-devel; +Cc: aliguori, mst, owasserm, abali, mrhines, gokul, pbonzini

From: "Michael R. Hines" <mrhines@us.ibm.com>

This capability allows you to disable dynamic chunk registration
for better throughput on high-performance links.

It is enabled by default.

Signed-off-by: Michael R. Hines <mrhines@us.ibm.com>
---
 migration.c      |    9 +++++++++
 qapi-schema.json |    2 +-
 2 files changed, 10 insertions(+), 1 deletion(-)

diff --git a/migration.c b/migration.c
index 3439629..404c19a 100644
--- a/migration.c
+++ b/migration.c
@@ -477,6 +477,15 @@ void qmp_migrate_set_downtime(double value, Error **errp)
     max_downtime = (uint64_t)value;
 }
 
+bool migrate_chunk_register_destination(void)
+{
+    MigrationState *s;
+
+    s = migrate_get_current();
+
+    return s->enabled_capabilities[MIGRATION_CAPABILITY_CHUNK_REGISTER_DESTINATION];
+}
+
 int migrate_use_xbzrle(void)
 {
     MigrationState *s;
diff --git a/qapi-schema.json b/qapi-schema.json
index db542f6..7fe7e5c 100644
--- a/qapi-schema.json
+++ b/qapi-schema.json
@@ -602,7 +602,7 @@
 # Since: 1.2
 ##
 { 'enum': 'MigrationCapability',
-  'data': ['xbzrle'] }
+  'data': ['xbzrle', 'chunk_register_destination'] }
 
 ##
 # @MigrationCapabilityStatus
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v1: 07/13] introduce capability for dynamic chunk registration
  2013-04-10 22:28 ` [Qemu-devel] [RFC PATCH RDMA support v1: 07/13] introduce capability for dynamic chunk registration mrhines
@ 2013-04-11  2:24   ` Eric Blake
  2013-04-11  2:39     ` Michael R. Hines
  0 siblings, 1 reply; 52+ messages in thread
From: Eric Blake @ 2013-04-11  2:24 UTC (permalink / raw)
  To: mrhines
  Cc: aliguori, mst, qemu-devel, owasserm, abali, mrhines, gokul,
	pbonzini

[-- Attachment #1: Type: text/plain, Size: 918 bytes --]

On 04/10/2013 04:28 PM, mrhines@linux.vnet.ibm.com wrote:
> From: "Michael R. Hines" <mrhines@us.ibm.com>
> 
> This capability allows you to disable dynamic chunk registration
> for better throughput on high-performance links.
> 
> It is enabled by default.

Actually, it isn't enabled until 8/13 - I'd squash 7 and 8 together, to
make this statement true.

> +++ b/qapi-schema.json
> @@ -602,7 +602,7 @@
>  # Since: 1.2

Missing documentation of the new capability.  Should look something like:

# @chunk-register-destination: Migration does XYZ on the destination.
#                              Enabled by default (since 1.5)

>  ##
>  { 'enum': 'MigrationCapability',
> -  'data': ['xbzrle'] }
> +  'data': ['xbzrle', 'chunk_register_destination'] }

QMP prefers '-' over '_'/

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 621 bytes --]

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v1: 07/13] introduce capability for dynamic chunk registration
  2013-04-11  2:24   ` Eric Blake
@ 2013-04-11  2:39     ` Michael R. Hines
  0 siblings, 0 replies; 52+ messages in thread
From: Michael R. Hines @ 2013-04-11  2:39 UTC (permalink / raw)
  To: Eric Blake
  Cc: aliguori, mst, qemu-devel, owasserm, abali, mrhines, gokul,
	pbonzini

Acknowledged.

On 04/10/2013 10:24 PM, Eric Blake wrote:
> On 04/10/2013 04:28 PM, mrhines@linux.vnet.ibm.com wrote:
>> From: "Michael R. Hines" <mrhines@us.ibm.com>
>>
>> This capability allows you to disable dynamic chunk registration
>> for better throughput on high-performance links.
>>
>> It is enabled by default.
> Actually, it isn't enabled until 8/13 - I'd squash 7 and 8 together, to
> make this statement true.
>
>> +++ b/qapi-schema.json
>> @@ -602,7 +602,7 @@
>>   # Since: 1.2
> Missing documentation of the new capability.  Should look something like:
>
> # @chunk-register-destination: Migration does XYZ on the destination.
> #                              Enabled by default (since 1.5)
>
>>   ##
>>   { 'enum': 'MigrationCapability',
>> -  'data': ['xbzrle'] }
>> +  'data': ['xbzrle', 'chunk_register_destination'] }
> QMP prefers '-' over '_'/
>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [Qemu-devel] [RFC PATCH RDMA support v1: 08/13] default chunk registration to true
  2013-04-10 22:28 [Qemu-devel] [RFC PATCH RDMA support v7: 00/13] rdma cleanup and reordering mrhines
                   ` (6 preceding siblings ...)
  2013-04-10 22:28 ` [Qemu-devel] [RFC PATCH RDMA support v1: 07/13] introduce capability for dynamic chunk registration mrhines
@ 2013-04-10 22:28 ` mrhines
  2013-04-10 22:28 ` [Qemu-devel] [RFC PATCH RDMA support v1: 09/13] parse QMP string for new 'rdma' protocol mrhines
                   ` (5 subsequent siblings)
  13 siblings, 0 replies; 52+ messages in thread
From: mrhines @ 2013-04-10 22:28 UTC (permalink / raw)
  To: qemu-devel; +Cc: aliguori, mst, owasserm, abali, mrhines, gokul, pbonzini

From: "Michael R. Hines" <mrhines@us.ibm.com>

Just enable it by default. User can now disable if they want to.

Signed-off-by: Michael R. Hines <mrhines@us.ibm.com>
---
 migration.c |    1 +
 1 file changed, 1 insertion(+)

diff --git a/migration.c b/migration.c
index 404c19a..41cf5ba 100644
--- a/migration.c
+++ b/migration.c
@@ -69,6 +69,7 @@ MigrationState *migrate_get_current(void)
         .state = MIG_STATE_SETUP,
         .bandwidth_limit = MAX_THROTTLE,
         .xbzrle_cache_size = DEFAULT_MIGRATE_CACHE_SIZE,
+        .enabled_capabilities[MIGRATION_CAPABILITY_CHUNK_REGISTER_DESTINATION] = true,
     };
 
     return &current_migration;
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [Qemu-devel] [RFC PATCH RDMA support v1: 09/13] parse QMP string for new 'rdma' protocol
  2013-04-10 22:28 [Qemu-devel] [RFC PATCH RDMA support v7: 00/13] rdma cleanup and reordering mrhines
                   ` (7 preceding siblings ...)
  2013-04-10 22:28 ` [Qemu-devel] [RFC PATCH RDMA support v1: 08/13] default chunk registration to true mrhines
@ 2013-04-10 22:28 ` mrhines
  2013-04-10 22:28 ` [Qemu-devel] [RFC PATCH RDMA support v1: 10/13] introduce new command migrate_check_for_zero mrhines
                   ` (4 subsequent siblings)
  13 siblings, 0 replies; 52+ messages in thread
From: mrhines @ 2013-04-10 22:28 UTC (permalink / raw)
  To: qemu-devel; +Cc: aliguori, mst, owasserm, abali, mrhines, gokul, pbonzini

From: "Michael R. Hines" <mrhines@us.ibm.com>

This parse the QMP string for the new 'rdma' protocol
and calls out to the appropriate funtions to initiate
the connection before the migration starts.

Signed-off-by: Michael R. Hines <mrhines@us.ibm.com>
---
 migration.c |    9 ++++++++-
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/migration.c b/migration.c
index 41cf5ba..a2fcacf 100644
--- a/migration.c
+++ b/migration.c
@@ -81,6 +81,10 @@ void qemu_start_incoming_migration(const char *uri, Error **errp)
 
     if (strstart(uri, "tcp:", &p))
         tcp_start_incoming_migration(p, errp);
+#ifdef CONFIG_RDMA
+    else if (strstart(uri, "rdma:", &p))
+        rdma_start_incoming_migration(p, errp);
+#endif
 #if !defined(WIN32)
     else if (strstart(uri, "exec:", &p))
         exec_start_incoming_migration(p, errp);
@@ -124,7 +128,6 @@ void process_incoming_migration(QEMUFile *f)
     Coroutine *co = qemu_coroutine_create(process_incoming_migration_co);
     int fd = qemu_get_fd(f);
 
-    assert(fd != -1);
     qemu_set_nonblock(fd);
     qemu_coroutine_enter(co, f);
 }
@@ -409,6 +412,10 @@ void qmp_migrate(const char *uri, bool has_blk, bool blk,
 
     if (strstart(uri, "tcp:", &p)) {
         tcp_start_outgoing_migration(s, p, &local_err);
+#ifdef CONFIG_RDMA
+    } else if (strstart(uri, "rdma:", &p)) {
+        rdma_start_outgoing_migration(s, p, &local_err);
+#endif
 #if !defined(WIN32)
     } else if (strstart(uri, "exec:", &p)) {
         exec_start_outgoing_migration(s, p, &local_err);
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* [Qemu-devel] [RFC PATCH RDMA support v1: 10/13] introduce new command migrate_check_for_zero
  2013-04-10 22:28 [Qemu-devel] [RFC PATCH RDMA support v7: 00/13] rdma cleanup and reordering mrhines
                   ` (8 preceding siblings ...)
  2013-04-10 22:28 ` [Qemu-devel] [RFC PATCH RDMA support v1: 09/13] parse QMP string for new 'rdma' protocol mrhines
@ 2013-04-10 22:28 ` mrhines
  2013-04-11  2:26   ` Eric Blake
  2013-04-11  7:38   ` Michael S. Tsirkin
  2013-04-10 22:28 ` [Qemu-devel] [RFC PATCH RDMA support v1: 11/13] send pc.ram over RDMA mrhines
                   ` (3 subsequent siblings)
  13 siblings, 2 replies; 52+ messages in thread
From: mrhines @ 2013-04-10 22:28 UTC (permalink / raw)
  To: qemu-devel; +Cc: aliguori, mst, owasserm, abali, mrhines, gokul, pbonzini

From: "Michael R. Hines" <mrhines@us.ibm.com>

This allows the user to disable zero page checking during migration

Signed-off-by: Michael R. Hines <mrhines@us.ibm.com>
---
 hmp-commands.hx  |   14 ++++++++++++++
 hmp.c            |    6 ++++++
 hmp.h            |    1 +
 migration.c      |   12 ++++++++++++
 qapi-schema.json |   13 +++++++++++++
 qmp-commands.hx  |   23 +++++++++++++++++++++++
 6 files changed, 69 insertions(+)

diff --git a/hmp-commands.hx b/hmp-commands.hx
index 3d98604..b593095 100644
--- a/hmp-commands.hx
+++ b/hmp-commands.hx
@@ -962,6 +962,20 @@ Set maximum tolerated downtime (in seconds) for migration.
 ETEXI
 
     {
+        .name       = "migrate_check_for_zero",
+        .args_type  = "value:b",
+        .params     = "value",
+        .help       = "Control whether or not to check for zero pages",
+        .mhandler.cmd = hmp_migrate_check_for_zero,
+    },
+
+STEXI
+@item migrate_check_for_zero @var{value}
+@findex migrate_check_for_zero
+Control whether or not to check for zero pages.
+ETEXI
+
+    {
         .name       = "migrate_set_capability",
         .args_type  = "capability:s,state:b",
         .params     = "capability state",
diff --git a/hmp.c b/hmp.c
index dbe9b90..68ba93a 100644
--- a/hmp.c
+++ b/hmp.c
@@ -909,6 +909,12 @@ void hmp_migrate_set_downtime(Monitor *mon, const QDict *qdict)
     qmp_migrate_set_downtime(value, NULL);
 }
 
+void hmp_migrate_check_for_zero(Monitor *mon, const QDict *qdict)
+{
+    bool value = qdict_get_bool(qdict, "value");
+    qmp_migrate_check_for_zero(value, NULL);
+}
+
 void hmp_migrate_set_cache_size(Monitor *mon, const QDict *qdict)
 {
     int64_t value = qdict_get_int(qdict, "value");
diff --git a/hmp.h b/hmp.h
index 80e8b41..a6595da 100644
--- a/hmp.h
+++ b/hmp.h
@@ -58,6 +58,7 @@ void hmp_snapshot_blkdev(Monitor *mon, const QDict *qdict);
 void hmp_drive_mirror(Monitor *mon, const QDict *qdict);
 void hmp_migrate_cancel(Monitor *mon, const QDict *qdict);
 void hmp_migrate_set_downtime(Monitor *mon, const QDict *qdict);
+void hmp_migrate_check_for_zero(Monitor *mon, const QDict *qdict);
 void hmp_migrate_set_speed(Monitor *mon, const QDict *qdict);
 void hmp_migrate_set_capability(Monitor *mon, const QDict *qdict);
 void hmp_migrate_set_cache_size(Monitor *mon, const QDict *qdict);
diff --git a/migration.c b/migration.c
index a2fcacf..9072479 100644
--- a/migration.c
+++ b/migration.c
@@ -485,6 +485,18 @@ void qmp_migrate_set_downtime(double value, Error **errp)
     max_downtime = (uint64_t)value;
 }
 
+static bool check_for_zero = true;
+
+void qmp_migrate_check_for_zero(bool value, Error **errp)
+{
+    check_for_zero = value;
+}
+
+bool migrate_check_for_zero(void)
+{
+    return check_for_zero;
+}
+
 bool migrate_chunk_register_destination(void)
 {
     MigrationState *s;
diff --git a/qapi-schema.json b/qapi-schema.json
index 7fe7e5c..1ca939f 100644
--- a/qapi-schema.json
+++ b/qapi-schema.json
@@ -1792,6 +1792,19 @@
 { 'command': 'migrate_set_downtime', 'data': {'value': 'number'} }
 
 ##
+# @migrate_check_for_zero
+#
+# Control whether or not to check for zero pages during migration.
+#
+# @value: on|off 
+#
+# Returns: nothing on success
+#
+# Since: 1.5.0
+##
+{ 'command': 'migrate_check_for_zero', 'data': {'value': 'bool'} }
+
+##
 # @migrate_set_speed
 #
 # Set maximum speed for migration.
diff --git a/qmp-commands.hx b/qmp-commands.hx
index 1e0e11e..78cda67 100644
--- a/qmp-commands.hx
+++ b/qmp-commands.hx
@@ -750,6 +750,29 @@ Example:
 EQMP
 
     {
+        .name       = "migrate_check_for_zero",
+        .args_type  = "value:b",
+        .mhandler.cmd_new = qmp_marshal_input_migrate_check_for_zero,
+    },
+
+SQMP
+migrate_check_for_zero
+----------------------
+
+Control whether or not to check for zero pages.
+
+Arguments:
+
+- "value": true or false (json-bool) 
+
+Example:
+
+-> { "execute": "migrate_check_for_zero", "arguments": { "value": true } }
+<- { "return": {} }
+
+EQMP
+
+    {
         .name       = "client_migrate_info",
         .args_type  = "protocol:s,hostname:s,port:i?,tls-port:i?,cert-subject:s?",
         .params     = "protocol hostname port tls-port cert-subject",
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v1: 10/13] introduce new command migrate_check_for_zero
  2013-04-10 22:28 ` [Qemu-devel] [RFC PATCH RDMA support v1: 10/13] introduce new command migrate_check_for_zero mrhines
@ 2013-04-11  2:26   ` Eric Blake
  2013-04-11  2:39     ` Michael R. Hines
  2013-04-11  3:11     ` Michael R. Hines
  2013-04-11  7:38   ` Michael S. Tsirkin
  1 sibling, 2 replies; 52+ messages in thread
From: Eric Blake @ 2013-04-11  2:26 UTC (permalink / raw)
  To: mrhines
  Cc: aliguori, mst, qemu-devel, owasserm, abali, mrhines, gokul,
	pbonzini

[-- Attachment #1: Type: text/plain, Size: 1094 bytes --]

On 04/10/2013 04:28 PM, mrhines@linux.vnet.ibm.com wrote:
> From: "Michael R. Hines" <mrhines@us.ibm.com>
> 
> This allows the user to disable zero page checking during migration
> 
> Signed-off-by: Michael R. Hines <mrhines@us.ibm.com>
> ---

> +++ b/qapi-schema.json
> @@ -1792,6 +1792,19 @@
>  { 'command': 'migrate_set_downtime', 'data': {'value': 'number'} }
>  
>  ##
> +# @migrate_check_for_zero
> +#
> +# Control whether or not to check for zero pages during migration.

New QMP commands should be named with '-' rather than '_', as in
'migrate-check-for-zero'.

Why do we need a new command, instead of adding a new capability to the
already-existing capability command?

> +#
> +# @value: on|off 
> +#
> +# Returns: nothing on success
> +#
> +# Since: 1.5.0
> +##
> +{ 'command': 'migrate_check_for_zero', 'data': {'value': 'bool'} }

You can set the capability, but how do you query its current setting?  I
dislike write-only interfaces.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 621 bytes --]

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v1: 10/13] introduce new command migrate_check_for_zero
  2013-04-11  2:26   ` Eric Blake
@ 2013-04-11  2:39     ` Michael R. Hines
  2013-04-11  7:52       ` Orit Wasserman
  2013-04-11  3:11     ` Michael R. Hines
  1 sibling, 1 reply; 52+ messages in thread
From: Michael R. Hines @ 2013-04-11  2:39 UTC (permalink / raw)
  To: Eric Blake
  Cc: aliguori, mst, qemu-devel, owasserm, abali, mrhines, gokul,
	pbonzini

On 04/10/2013 10:26 PM, Eric Blake wrote:
>
> New QMP commands should be named with '-' rather than '_', as in
> 'migrate-check-for-zero'.
>
> Why do we need a new command, instead of adding a new capability to the
> already-existing capability command?
>

Orit told me to convert the capability to a command =)
(It was originally a capability)

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v1: 10/13] introduce new command migrate_check_for_zero
  2013-04-11  2:39     ` Michael R. Hines
@ 2013-04-11  7:52       ` Orit Wasserman
  2013-04-11 12:30         ` Eric Blake
  0 siblings, 1 reply; 52+ messages in thread
From: Orit Wasserman @ 2013-04-11  7:52 UTC (permalink / raw)
  To: Michael R. Hines
  Cc: aliguori, mst, qemu-devel, abali, mrhines, gokul, pbonzini

On 04/11/2013 05:39 AM, Michael R. Hines wrote:
> On 04/10/2013 10:26 PM, Eric Blake wrote:
>>
>> New QMP commands should be named with '-' rather than '_', as in
>> 'migrate-check-for-zero'.
>>
>> Why do we need a new command, instead of adding a new capability to the
>> already-existing capability command?
>>
> 
> Orit told me to convert the capability to a command =)
> (It was originally a capability)
> 
> 
I prefer it a command because it is not related directly to RDMA I can
see it used in regular live migration too.

Orit

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v1: 10/13] introduce new command migrate_check_for_zero
  2013-04-11  7:52       ` Orit Wasserman
@ 2013-04-11 12:30         ` Eric Blake
  2013-04-11 12:36           ` Orit Wasserman
  0 siblings, 1 reply; 52+ messages in thread
From: Eric Blake @ 2013-04-11 12:30 UTC (permalink / raw)
  To: Orit Wasserman
  Cc: aliguori, mst, qemu-devel, Michael R. Hines, abali, mrhines,
	gokul, pbonzini

[-- Attachment #1: Type: text/plain, Size: 892 bytes --]

On 04/11/2013 01:52 AM, Orit Wasserman wrote:
> On 04/11/2013 05:39 AM, Michael R. Hines wrote:
>> On 04/10/2013 10:26 PM, Eric Blake wrote:
>>>
>>> New QMP commands should be named with '-' rather than '_', as in
>>> 'migrate-check-for-zero'.
>>>
>>> Why do we need a new command, instead of adding a new capability to the
>>> already-existing capability command?
>>>
>>
>> Orit told me to convert the capability to a command =)
>> (It was originally a capability)
>>
>>
> I prefer it a command because it is not related directly to RDMA I can
> see it used in regular live migration too.

But how is a new command any different than a new capability?  Both can
be used in regular live migration, and for all intents and purposes, it
feels like a capability.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 621 bytes --]

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v1: 10/13] introduce new command migrate_check_for_zero
  2013-04-11 12:30         ` Eric Blake
@ 2013-04-11 12:36           ` Orit Wasserman
  2013-04-11 17:53             ` Michael R. Hines
  0 siblings, 1 reply; 52+ messages in thread
From: Orit Wasserman @ 2013-04-11 12:36 UTC (permalink / raw)
  To: Eric Blake
  Cc: aliguori, mst, qemu-devel, Michael R. Hines, abali, mrhines,
	gokul, pbonzini

On 04/11/2013 03:30 PM, Eric Blake wrote:
> On 04/11/2013 01:52 AM, Orit Wasserman wrote:
>> On 04/11/2013 05:39 AM, Michael R. Hines wrote:
>>> On 04/10/2013 10:26 PM, Eric Blake wrote:
>>>>
>>>> New QMP commands should be named with '-' rather than '_', as in
>>>> 'migrate-check-for-zero'.
>>>>
>>>> Why do we need a new command, instead of adding a new capability to the
>>>> already-existing capability command?
>>>>
>>>
>>> Orit told me to convert the capability to a command =)
>>> (It was originally a capability)
>>>
>>>
>> I prefer it a command because it is not related directly to RDMA I can
>> see it used in regular live migration too.
> 
> But how is a new command any different than a new capability?  Both can
> be used in regular live migration, and for all intents and purposes, it
> feels like a capability.
> 
It has no meaning for incoming migration only for outgoing.
Anyway Paolo think it should not be needed so this patch will be removed.

Orit

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v1: 10/13] introduce new command migrate_check_for_zero
  2013-04-11 12:36           ` Orit Wasserman
@ 2013-04-11 17:53             ` Michael R. Hines
  0 siblings, 0 replies; 52+ messages in thread
From: Michael R. Hines @ 2013-04-11 17:53 UTC (permalink / raw)
  To: Orit Wasserman; +Cc: aliguori, mst, qemu-devel, abali, mrhines, gokul, pbonzini

On 04/11/2013 08:36 AM, Orit Wasserman wrote:
> On 04/11/2013 03:30 PM, Eric Blake wrote:
>> On 04/11/2013 01:52 AM, Orit Wasserman wrote:
>>> On 04/11/2013 05:39 AM, Michael R. Hines wrote:
>>>> On 04/10/2013 10:26 PM, Eric Blake wrote:
>>>>> New QMP commands should be named with '-' rather than '_', as in
>>>>> 'migrate-check-for-zero'.
>>>>>
>>>>> Why do we need a new command, instead of adding a new capability to the
>>>>> already-existing capability command?
>>>>>
>>>> Orit told me to convert the capability to a command =)
>>>> (It was originally a capability)
>>>>
>>>>
>>> I prefer it a command because it is not related directly to RDMA I can
>>> see it used in regular live migration too.
>> But how is a new command any different than a new capability?  Both can
>> be used in regular live migration, and for all intents and purposes, it
>> feels like a capability.
>>
> It has no meaning for incoming migration only for outgoing.
> Anyway Paolo think it should not be needed so this patch will be removed.
>
> Orit
>

Yes, I will delete the command altogether.

- Michael

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v1: 10/13] introduce new command migrate_check_for_zero
  2013-04-11  2:26   ` Eric Blake
  2013-04-11  2:39     ` Michael R. Hines
@ 2013-04-11  3:11     ` Michael R. Hines
  1 sibling, 0 replies; 52+ messages in thread
From: Michael R. Hines @ 2013-04-11  3:11 UTC (permalink / raw)
  To: Eric Blake
  Cc: aliguori, mst, qemu-devel, owasserm, abali, mrhines, gokul,
	pbonzini

On 04/10/2013 10:26 PM, Eric Blake wrote:
>
>> +#
>> +# @value: on|off
>> +#
>> +# Returns: nothing on success
>> +#
>> +# Since: 1.5.0
>> +##
>> +{ 'command': 'migrate_check_for_zero', 'data': {'value': 'bool'} }
> You can set the capability, but how do you query its current setting?  I
> dislike write-only interfaces.
>

I will add a patch to update the "query-migrate" QMP command to reflect 
the current state of the option.

- Michael

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v1: 10/13] introduce new command migrate_check_for_zero
  2013-04-10 22:28 ` [Qemu-devel] [RFC PATCH RDMA support v1: 10/13] introduce new command migrate_check_for_zero mrhines
  2013-04-11  2:26   ` Eric Blake
@ 2013-04-11  7:38   ` Michael S. Tsirkin
  2013-04-11  9:18     ` Paolo Bonzini
  1 sibling, 1 reply; 52+ messages in thread
From: Michael S. Tsirkin @ 2013-04-11  7:38 UTC (permalink / raw)
  To: mrhines; +Cc: aliguori, qemu-devel, owasserm, abali, mrhines, gokul, pbonzini

On Wed, Apr 10, 2013 at 06:28:18PM -0400, mrhines@linux.vnet.ibm.com wrote:
> From: "Michael R. Hines" <mrhines@us.ibm.com>
> 
> This allows the user to disable zero page checking during migration
> 
> Signed-off-by: Michael R. Hines <mrhines@us.ibm.com>

IMO this knob is too low level to expose to management.
Why not disable this automatically when migrating with rdma?

> ---
>  hmp-commands.hx  |   14 ++++++++++++++
>  hmp.c            |    6 ++++++
>  hmp.h            |    1 +
>  migration.c      |   12 ++++++++++++
>  qapi-schema.json |   13 +++++++++++++
>  qmp-commands.hx  |   23 +++++++++++++++++++++++
>  6 files changed, 69 insertions(+)
> 
> diff --git a/hmp-commands.hx b/hmp-commands.hx
> index 3d98604..b593095 100644
> --- a/hmp-commands.hx
> +++ b/hmp-commands.hx
> @@ -962,6 +962,20 @@ Set maximum tolerated downtime (in seconds) for migration.
>  ETEXI
>  
>      {
> +        .name       = "migrate_check_for_zero",
> +        .args_type  = "value:b",
> +        .params     = "value",
> +        .help       = "Control whether or not to check for zero pages",
> +        .mhandler.cmd = hmp_migrate_check_for_zero,
> +    },
> +
> +STEXI
> +@item migrate_check_for_zero @var{value}
> +@findex migrate_check_for_zero
> +Control whether or not to check for zero pages.
> +ETEXI
> +
> +    {
>          .name       = "migrate_set_capability",
>          .args_type  = "capability:s,state:b",
>          .params     = "capability state",
> diff --git a/hmp.c b/hmp.c
> index dbe9b90..68ba93a 100644
> --- a/hmp.c
> +++ b/hmp.c
> @@ -909,6 +909,12 @@ void hmp_migrate_set_downtime(Monitor *mon, const QDict *qdict)
>      qmp_migrate_set_downtime(value, NULL);
>  }
>  
> +void hmp_migrate_check_for_zero(Monitor *mon, const QDict *qdict)
> +{
> +    bool value = qdict_get_bool(qdict, "value");
> +    qmp_migrate_check_for_zero(value, NULL);
> +}
> +
>  void hmp_migrate_set_cache_size(Monitor *mon, const QDict *qdict)
>  {
>      int64_t value = qdict_get_int(qdict, "value");
> diff --git a/hmp.h b/hmp.h
> index 80e8b41..a6595da 100644
> --- a/hmp.h
> +++ b/hmp.h
> @@ -58,6 +58,7 @@ void hmp_snapshot_blkdev(Monitor *mon, const QDict *qdict);
>  void hmp_drive_mirror(Monitor *mon, const QDict *qdict);
>  void hmp_migrate_cancel(Monitor *mon, const QDict *qdict);
>  void hmp_migrate_set_downtime(Monitor *mon, const QDict *qdict);
> +void hmp_migrate_check_for_zero(Monitor *mon, const QDict *qdict);
>  void hmp_migrate_set_speed(Monitor *mon, const QDict *qdict);
>  void hmp_migrate_set_capability(Monitor *mon, const QDict *qdict);
>  void hmp_migrate_set_cache_size(Monitor *mon, const QDict *qdict);
> diff --git a/migration.c b/migration.c
> index a2fcacf..9072479 100644
> --- a/migration.c
> +++ b/migration.c
> @@ -485,6 +485,18 @@ void qmp_migrate_set_downtime(double value, Error **errp)
>      max_downtime = (uint64_t)value;
>  }
>  
> +static bool check_for_zero = true;
> +
> +void qmp_migrate_check_for_zero(bool value, Error **errp)
> +{
> +    check_for_zero = value;
> +}
> +
> +bool migrate_check_for_zero(void)
> +{
> +    return check_for_zero;
> +}
> +
>  bool migrate_chunk_register_destination(void)
>  {
>      MigrationState *s;
> diff --git a/qapi-schema.json b/qapi-schema.json
> index 7fe7e5c..1ca939f 100644
> --- a/qapi-schema.json
> +++ b/qapi-schema.json
> @@ -1792,6 +1792,19 @@
>  { 'command': 'migrate_set_downtime', 'data': {'value': 'number'} }
>  
>  ##
> +# @migrate_check_for_zero
> +#
> +# Control whether or not to check for zero pages during migration.
> +#
> +# @value: on|off 
> +#
> +# Returns: nothing on success
> +#
> +# Since: 1.5.0
> +##
> +{ 'command': 'migrate_check_for_zero', 'data': {'value': 'bool'} }
> +
> +##
>  # @migrate_set_speed
>  #
>  # Set maximum speed for migration.
> diff --git a/qmp-commands.hx b/qmp-commands.hx
> index 1e0e11e..78cda67 100644
> --- a/qmp-commands.hx
> +++ b/qmp-commands.hx
> @@ -750,6 +750,29 @@ Example:
>  EQMP
>  
>      {
> +        .name       = "migrate_check_for_zero",
> +        .args_type  = "value:b",
> +        .mhandler.cmd_new = qmp_marshal_input_migrate_check_for_zero,
> +    },
> +
> +SQMP
> +migrate_check_for_zero
> +----------------------
> +
> +Control whether or not to check for zero pages.
> +
> +Arguments:
> +
> +- "value": true or false (json-bool) 
> +
> +Example:
> +
> +-> { "execute": "migrate_check_for_zero", "arguments": { "value": true } }
> +<- { "return": {} }
> +
> +EQMP
> +
> +    {
>          .name       = "client_migrate_info",
>          .args_type  = "protocol:s,hostname:s,port:i?,tls-port:i?,cert-subject:s?",
>          .params     = "protocol hostname port tls-port cert-subject",
> -- 
> 1.7.10.4

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v1: 10/13] introduce new command migrate_check_for_zero
  2013-04-11  7:38   ` Michael S. Tsirkin
@ 2013-04-11  9:18     ` Paolo Bonzini
  2013-04-11 11:13       ` Michael S. Tsirkin
  2013-04-11 13:24       ` Michael R. Hines
  0 siblings, 2 replies; 52+ messages in thread
From: Paolo Bonzini @ 2013-04-11  9:18 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: aliguori, qemu-devel, mrhines, owasserm, abali, mrhines, gokul

Il 11/04/2013 09:38, Michael S. Tsirkin ha scritto:
> On Wed, Apr 10, 2013 at 06:28:18PM -0400, mrhines@linux.vnet.ibm.com wrote:
>> From: "Michael R. Hines" <mrhines@us.ibm.com>
>>
>> This allows the user to disable zero page checking during migration
>>
>> Signed-off-by: Michael R. Hines <mrhines@us.ibm.com>
> 
> IMO this knob is too low level to expose to management.
> Why not disable this automatically when migrating with rdma?

Thinking more about it, I'm not sure why it is important to disable it.

As observed earlier:

1) non-zero pages typically have a non-zero word in the first 32 bytes,
as measured by Peter Lieven, so the cost of is_dup_page can be ignored
for non-zero pages.

2) all-zero pages typically change little, so they are rare after the
bulk phase where all memory is sent once to the destination.

Hence, the cost of is_dup_page can be ignored after the bulk phase.  In
the bulk phase, checking for zero pages it may be expensive and lower
throughput, sure, but what matters for convergence is throughput and
latency _after_ the bulk phase.

At least this is the theory.  mrhines, what testcase were you using?  If
it is an idle guest, it is not a realistic one and the decreased
latency/throughput does not really matter.

Paolo

> 
>> ---
>>  hmp-commands.hx  |   14 ++++++++++++++
>>  hmp.c            |    6 ++++++
>>  hmp.h            |    1 +
>>  migration.c      |   12 ++++++++++++
>>  qapi-schema.json |   13 +++++++++++++
>>  qmp-commands.hx  |   23 +++++++++++++++++++++++
>>  6 files changed, 69 insertions(+)
>>
>> diff --git a/hmp-commands.hx b/hmp-commands.hx
>> index 3d98604..b593095 100644
>> --- a/hmp-commands.hx
>> +++ b/hmp-commands.hx
>> @@ -962,6 +962,20 @@ Set maximum tolerated downtime (in seconds) for migration.
>>  ETEXI
>>  
>>      {
>> +        .name       = "migrate_check_for_zero",
>> +        .args_type  = "value:b",
>> +        .params     = "value",
>> +        .help       = "Control whether or not to check for zero pages",
>> +        .mhandler.cmd = hmp_migrate_check_for_zero,
>> +    },
>> +
>> +STEXI
>> +@item migrate_check_for_zero @var{value}
>> +@findex migrate_check_for_zero
>> +Control whether or not to check for zero pages.
>> +ETEXI
>> +
>> +    {
>>          .name       = "migrate_set_capability",
>>          .args_type  = "capability:s,state:b",
>>          .params     = "capability state",
>> diff --git a/hmp.c b/hmp.c
>> index dbe9b90..68ba93a 100644
>> --- a/hmp.c
>> +++ b/hmp.c
>> @@ -909,6 +909,12 @@ void hmp_migrate_set_downtime(Monitor *mon, const QDict *qdict)
>>      qmp_migrate_set_downtime(value, NULL);
>>  }
>>  
>> +void hmp_migrate_check_for_zero(Monitor *mon, const QDict *qdict)
>> +{
>> +    bool value = qdict_get_bool(qdict, "value");
>> +    qmp_migrate_check_for_zero(value, NULL);
>> +}
>> +
>>  void hmp_migrate_set_cache_size(Monitor *mon, const QDict *qdict)
>>  {
>>      int64_t value = qdict_get_int(qdict, "value");
>> diff --git a/hmp.h b/hmp.h
>> index 80e8b41..a6595da 100644
>> --- a/hmp.h
>> +++ b/hmp.h
>> @@ -58,6 +58,7 @@ void hmp_snapshot_blkdev(Monitor *mon, const QDict *qdict);
>>  void hmp_drive_mirror(Monitor *mon, const QDict *qdict);
>>  void hmp_migrate_cancel(Monitor *mon, const QDict *qdict);
>>  void hmp_migrate_set_downtime(Monitor *mon, const QDict *qdict);
>> +void hmp_migrate_check_for_zero(Monitor *mon, const QDict *qdict);
>>  void hmp_migrate_set_speed(Monitor *mon, const QDict *qdict);
>>  void hmp_migrate_set_capability(Monitor *mon, const QDict *qdict);
>>  void hmp_migrate_set_cache_size(Monitor *mon, const QDict *qdict);
>> diff --git a/migration.c b/migration.c
>> index a2fcacf..9072479 100644
>> --- a/migration.c
>> +++ b/migration.c
>> @@ -485,6 +485,18 @@ void qmp_migrate_set_downtime(double value, Error **errp)
>>      max_downtime = (uint64_t)value;
>>  }
>>  
>> +static bool check_for_zero = true;
>> +
>> +void qmp_migrate_check_for_zero(bool value, Error **errp)
>> +{
>> +    check_for_zero = value;
>> +}
>> +
>> +bool migrate_check_for_zero(void)
>> +{
>> +    return check_for_zero;
>> +}
>> +
>>  bool migrate_chunk_register_destination(void)
>>  {
>>      MigrationState *s;
>> diff --git a/qapi-schema.json b/qapi-schema.json
>> index 7fe7e5c..1ca939f 100644
>> --- a/qapi-schema.json
>> +++ b/qapi-schema.json
>> @@ -1792,6 +1792,19 @@
>>  { 'command': 'migrate_set_downtime', 'data': {'value': 'number'} }
>>  
>>  ##
>> +# @migrate_check_for_zero
>> +#
>> +# Control whether or not to check for zero pages during migration.
>> +#
>> +# @value: on|off 
>> +#
>> +# Returns: nothing on success
>> +#
>> +# Since: 1.5.0
>> +##
>> +{ 'command': 'migrate_check_for_zero', 'data': {'value': 'bool'} }
>> +
>> +##
>>  # @migrate_set_speed
>>  #
>>  # Set maximum speed for migration.
>> diff --git a/qmp-commands.hx b/qmp-commands.hx
>> index 1e0e11e..78cda67 100644
>> --- a/qmp-commands.hx
>> +++ b/qmp-commands.hx
>> @@ -750,6 +750,29 @@ Example:
>>  EQMP
>>  
>>      {
>> +        .name       = "migrate_check_for_zero",
>> +        .args_type  = "value:b",
>> +        .mhandler.cmd_new = qmp_marshal_input_migrate_check_for_zero,
>> +    },
>> +
>> +SQMP
>> +migrate_check_for_zero
>> +----------------------
>> +
>> +Control whether or not to check for zero pages.
>> +
>> +Arguments:
>> +
>> +- "value": true or false (json-bool) 
>> +
>> +Example:
>> +
>> +-> { "execute": "migrate_check_for_zero", "arguments": { "value": true } }
>> +<- { "return": {} }
>> +
>> +EQMP
>> +
>> +    {
>>          .name       = "client_migrate_info",
>>          .args_type  = "protocol:s,hostname:s,port:i?,tls-port:i?,cert-subject:s?",
>>          .params     = "protocol hostname port tls-port cert-subject",
>> -- 
>> 1.7.10.4

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v1: 10/13] introduce new command migrate_check_for_zero
  2013-04-11  9:18     ` Paolo Bonzini
@ 2013-04-11 11:13       ` Michael S. Tsirkin
  2013-04-11 13:19         ` Michael R. Hines
  2013-04-11 13:24       ` Michael R. Hines
  1 sibling, 1 reply; 52+ messages in thread
From: Michael S. Tsirkin @ 2013-04-11 11:13 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: aliguori, qemu-devel, mrhines, owasserm, abali, mrhines, gokul

On Thu, Apr 11, 2013 at 11:18:38AM +0200, Paolo Bonzini wrote:
> Il 11/04/2013 09:38, Michael S. Tsirkin ha scritto:
> > On Wed, Apr 10, 2013 at 06:28:18PM -0400, mrhines@linux.vnet.ibm.com wrote:
> >> From: "Michael R. Hines" <mrhines@us.ibm.com>
> >>
> >> This allows the user to disable zero page checking during migration
> >>
> >> Signed-off-by: Michael R. Hines <mrhines@us.ibm.com>
> > 
> > IMO this knob is too low level to expose to management.
> > Why not disable this automatically when migrating with rdma?
> 
> Thinking more about it, I'm not sure why it is important to disable it.

This just illustrates the point. There's no place for such low level
knobs in the management interface.



-- 
MST

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v1: 10/13] introduce new command migrate_check_for_zero
  2013-04-11 11:13       ` Michael S. Tsirkin
@ 2013-04-11 13:19         ` Michael R. Hines
  2013-04-11 13:51           ` Michael S. Tsirkin
  0 siblings, 1 reply; 52+ messages in thread
From: Michael R. Hines @ 2013-04-11 13:19 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: aliguori, qemu-devel, owasserm, abali, mrhines, gokul,
	Paolo Bonzini

On 04/11/2013 07:13 AM, Michael S. Tsirkin wrote:
> On Thu, Apr 11, 2013 at 11:18:38AM +0200, Paolo Bonzini wrote:
>> Il 11/04/2013 09:38, Michael S. Tsirkin ha scritto:
>>> On Wed, Apr 10, 2013 at 06:28:18PM -0400, mrhines@linux.vnet.ibm.com wrote:
>>>> From: "Michael R. Hines" <mrhines@us.ibm.com>
>>>>
>>>> This allows the user to disable zero page checking during migration
>>>>
>>>> Signed-off-by: Michael R. Hines <mrhines@us.ibm.com>
>>> IMO this knob is too low level to expose to management.
>>> Why not disable this automatically when migrating with rdma?
>> Thinking more about it, I'm not sure why it is important to disable it.
> This just illustrates the point. There's no place for such low level
> knobs in the management interface.
>

I disagree with that: We already have precedent for this in the
XBZRLE capability. Zero page checking is no "more" low-level than this
capability already is and the community has already agreed to expose
this capability to management.

Since zero page scanning does in fact affect performance, we not give
the user the option?

Why would the community agree to expose one low-level feature and not 
expose another?

- Michael

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v1: 10/13] introduce new command migrate_check_for_zero
  2013-04-11 13:19         ` Michael R. Hines
@ 2013-04-11 13:51           ` Michael S. Tsirkin
  2013-04-11 14:06             ` Michael R. Hines
  0 siblings, 1 reply; 52+ messages in thread
From: Michael S. Tsirkin @ 2013-04-11 13:51 UTC (permalink / raw)
  To: Michael R. Hines
  Cc: aliguori, qemu-devel, owasserm, abali, mrhines, gokul,
	Paolo Bonzini

On Thu, Apr 11, 2013 at 09:19:43AM -0400, Michael R. Hines wrote:
> On 04/11/2013 07:13 AM, Michael S. Tsirkin wrote:
> >On Thu, Apr 11, 2013 at 11:18:38AM +0200, Paolo Bonzini wrote:
> >>Il 11/04/2013 09:38, Michael S. Tsirkin ha scritto:
> >>>On Wed, Apr 10, 2013 at 06:28:18PM -0400, mrhines@linux.vnet.ibm.com wrote:
> >>>>From: "Michael R. Hines" <mrhines@us.ibm.com>
> >>>>
> >>>>This allows the user to disable zero page checking during migration
> >>>>
> >>>>Signed-off-by: Michael R. Hines <mrhines@us.ibm.com>
> >>>IMO this knob is too low level to expose to management.
> >>>Why not disable this automatically when migrating with rdma?
> >>Thinking more about it, I'm not sure why it is important to disable it.
> >This just illustrates the point. There's no place for such low level
> >knobs in the management interface.
> >
> 
> I disagree with that: We already have precedent for this in the
> XBZRLE capability.

My understanding is the issue is protocol compatibility,
not optimization. E.g. you can migrate to file, for each
new feature you need a way to disable it to stay compatible.

> Zero page checking is no "more" low-level than this
> capability already is and the community has already agreed to expose
> this capability to management.
> 
> Since zero page scanning does in fact affect performance, we not give
> the user the option?
> 
> Why would the community agree to expose one low-level feature and
> not expose another?
> 
> - Michael

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v1: 10/13] introduce new command migrate_check_for_zero
  2013-04-11 13:51           ` Michael S. Tsirkin
@ 2013-04-11 14:06             ` Michael R. Hines
  2013-04-11 14:17               ` Paolo Bonzini
  0 siblings, 1 reply; 52+ messages in thread
From: Michael R. Hines @ 2013-04-11 14:06 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: aliguori, qemu-devel, owasserm, abali, mrhines, gokul,
	Paolo Bonzini

On 04/11/2013 09:51 AM, Michael S. Tsirkin wrote:
> On Thu, Apr 11, 2013 at 09:19:43AM -0400, Michael R. Hines wrote:
>> On 04/11/2013 07:13 AM, Michael S. Tsirkin wrote:
>>> On Thu, Apr 11, 2013 at 11:18:38AM +0200, Paolo Bonzini wrote:
>>>> Il 11/04/2013 09:38, Michael S. Tsirkin ha scritto:
>>>>> On Wed, Apr 10, 2013 at 06:28:18PM -0400, mrhines@linux.vnet.ibm.com wrote:
>>>>>> From: "Michael R. Hines" <mrhines@us.ibm.com>
>>>>>>
>>>>>> This allows the user to disable zero page checking during migration
>>>>>>
>>>>>> Signed-off-by: Michael R. Hines <mrhines@us.ibm.com>
>>>>> IMO this knob is too low level to expose to management.
>>>>> Why not disable this automatically when migrating with rdma?
>>>> Thinking more about it, I'm not sure why it is important to disable it.
>>> This just illustrates the point. There's no place for such low level
>>> knobs in the management interface.
>>>
>> I disagree with that: We already have precedent for this in the
>> XBZRLE capability.
> My understanding is the issue is protocol compatibility,
> not optimization. E.g. you can migrate to file, for each
> new feature you need a way to disable it to stay compatible.

Ok, understood.

I would be happy to add a check for the other migration URI
protocols (like 'unix', 'tcp', etc) which says rejects disabling
the zero page checking only if the URI is for rdma.

Would that be OK?

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v1: 10/13] introduce new command migrate_check_for_zero
  2013-04-11 14:06             ` Michael R. Hines
@ 2013-04-11 14:17               ` Paolo Bonzini
  2013-04-11 14:35                 ` Michael R. Hines
  0 siblings, 1 reply; 52+ messages in thread
From: Paolo Bonzini @ 2013-04-11 14:17 UTC (permalink / raw)
  To: Michael R. Hines
  Cc: aliguori, Michael S. Tsirkin, qemu-devel, owasserm, abali,
	mrhines, gokul

Il 11/04/2013 16:06, Michael R. Hines ha scritto:
>>>
>> My understanding is the issue is protocol compatibility,
>> not optimization. E.g. you can migrate to file, for each
>> new feature you need a way to disable it to stay compatible.
> 
> Ok, understood.
> 
> I would be happy to add a check for the other migration URI
> protocols (like 'unix', 'tcp', etc) which says rejects disabling
> the zero page checking only if the URI is for rdma.
> 
> Would that be OK?

I would like to see is_dup_page() on top of a "perf" profile for a
real-world scenario, and throughput numbers for the same real-world
scenario with/without is_dup_page().  Once you show that, yes.

Paolo

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v1: 10/13] introduce new command migrate_check_for_zero
  2013-04-11 14:17               ` Paolo Bonzini
@ 2013-04-11 14:35                 ` Michael R. Hines
  2013-04-11 14:45                   ` Paolo Bonzini
  0 siblings, 1 reply; 52+ messages in thread
From: Michael R. Hines @ 2013-04-11 14:35 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: aliguori, Michael S. Tsirkin, qemu-devel, owasserm, abali,
	mrhines, gokul

Can I at least get a firm yes or no whether the maintainer will
accept this capability or not?

What you ask would require defining what a "real world scenario" is, and
I don't think that's a good discussion to have right now. Even if we did 
know the
definition, I do not have the infrastructure in place to do an exhaustive
search of such a workload.

My personal view is: new software should define APIs, not hide APIs.

The capability already has a default 'true' value, which is the same 
behavior
that the value has always been and nobody's threatening to get rid of that.

- Michael

On 04/11/2013 10:17 AM, Paolo Bonzini wrote:
> Ok, understood.
>
> I would be happy to add a check for the other migration URI
> protocols (like 'unix', 'tcp', etc) which says rejects disabling
> the zero page checking only if the URI is for rdma.
>
> Would that be OK?
> I would like to see is_dup_page() on top of a "perf" profile for a
> real-world scenario, and throughput numbers for the same real-world
> scenario with/without is_dup_page().  Once you show that, yes.
>
> Paolo
>

If th

- Michael

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v1: 10/13] introduce new command migrate_check_for_zero
  2013-04-11 14:35                 ` Michael R. Hines
@ 2013-04-11 14:45                   ` Paolo Bonzini
  2013-04-11 15:37                     ` Michael R. Hines
  0 siblings, 1 reply; 52+ messages in thread
From: Paolo Bonzini @ 2013-04-11 14:45 UTC (permalink / raw)
  To: Michael R. Hines
  Cc: aliguori, Michael S. Tsirkin, qemu-devel, owasserm, abali,
	mrhines, gokul

Il 11/04/2013 16:35, Michael R. Hines ha scritto:
> Can I at least get a firm yes or no whether the maintainer will
> accept this capability or not?
> 
> What you ask would require defining what a "real world scenario" is,

A TPC benchmark would be a real world scenario.

> and I don't think that's a good discussion to have right now. Even if we did
> know the definition, I do not have the infrastructure in place to do an exhaustive
> search of such a workload.
> 
> My personal view is: new software should define APIs, not hide APIs.

Right, but introducing new APIs is not free.

Let's leave is_dup_page unconditionally in now.  We can always remove it
later if it turns out to be useful.

The important thing is to have the code in early to give it wider
exposure.  Once it is in, people can test it more, benchmark
with/without is_dup_page, etc.  We can declare it experimental, and
break the protocol later if it turns out to be bad.

I think all that's needed is:

1) benchmark the various chunk sizes (with is_dup_page disabled and your
current stress test -- better than nothing).  Please confirm that the
source can modify the chunk size and the destination will just pick it up.

2) remove the patch to disable is_dup_page

3) rename the transport to "x-rdma" (just in migration.c).

And that's it.  The patches should be ready.

We have converged on a good interface between RDMA and the generic
migration code, and that's the important thing because later
implementations will not throw away that work.

Paolo

> The capability already has a default 'true' value, which is the same behavior
> that the value has always been and nobody's threatening to get rid of that.
> 
> - Michael

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v1: 10/13] introduce new command migrate_check_for_zero
  2013-04-11 14:45                   ` Paolo Bonzini
@ 2013-04-11 15:37                     ` Michael R. Hines
  0 siblings, 0 replies; 52+ messages in thread
From: Michael R. Hines @ 2013-04-11 15:37 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: aliguori, Michael S. Tsirkin, qemu-devel, owasserm, abali,
	mrhines, gokul

On 04/11/2013 10:45 AM, Paolo Bonzini wrote:
>
> Right, but introducing new APIs is not free.
>
> Let's leave is_dup_page unconditionally in now.  We can always remove it
> later if it turns out to be useful.
>
> The important thing is to have the code in early to give it wider
> exposure.  Once it is in, people can test it more, benchmark
> with/without is_dup_page, etc.  We can declare it experimental, and
> break the protocol later if it turns out to be bad.
>
> I think all that's needed is:
>
> 1) benchmark the various chunk sizes (with is_dup_page disabled and your
> current stress test -- better than nothing).  Please confirm that the
> source can modify the chunk size and the destination will just pick it up.
>
> 2) remove the patch to disable is_dup_page
>
> 3) rename the transport to "x-rdma" (just in migration.c).
>
> And that's it.  The patches should be ready.
>
> We have converged on a good interface between RDMA and the generic
> migration code, and that's the important thing because later
> implementations will not throw away that work.
>
> Paolo

Ok, acknowledged =)

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v1: 10/13] introduce new command migrate_check_for_zero
  2013-04-11  9:18     ` Paolo Bonzini
  2013-04-11 11:13       ` Michael S. Tsirkin
@ 2013-04-11 13:24       ` Michael R. Hines
  2013-04-11 14:15         ` Paolo Bonzini
  1 sibling, 1 reply; 52+ messages in thread
From: Michael R. Hines @ 2013-04-11 13:24 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: aliguori, Michael S. Tsirkin, qemu-devel, owasserm, abali,
	mrhines, gokul

That's very accurate. Zero page scanning *after* the bulk phase
is not very helpful in general.

Are we proposing to skip is_dup_page() after the bulk phase
has finished?

The testcase I'm using is a "worst-case" stress memory hog command
(apt-get install stress) - but against this does not affect anything until
we assume the bulk phase has already completed.


On 04/11/2013 05:18 AM, Paolo Bonzini wrote:
> Il 11/04/2013 09:38, Michael S. Tsirkin ha scritto:
>> On Wed, Apr 10, 2013 at 06:28:18PM -0400, mrhines@linux.vnet.ibm.com wrote:
>>> From: "Michael R. Hines" <mrhines@us.ibm.com>
>>>
>>> This allows the user to disable zero page checking during migration
>>>
>>> Signed-off-by: Michael R. Hines <mrhines@us.ibm.com>
>> IMO this knob is too low level to expose to management.
>> Why not disable this automatically when migrating with rdma?
> Thinking more about it, I'm not sure why it is important to disable it.
>
> As observed earlier:
>
> 1) non-zero pages typically have a non-zero word in the first 32 bytes,
> as measured by Peter Lieven, so the cost of is_dup_page can be ignored
> for non-zero pages.
>
> 2) all-zero pages typically change little, so they are rare after the
> bulk phase where all memory is sent once to the destination.
>
> Hence, the cost of is_dup_page can be ignored after the bulk phase.  In
> the bulk phase, checking for zero pages it may be expensive and lower
> throughput, sure, but what matters for convergence is throughput and
> latency _after_ the bulk phase.
>
> At least this is the theory.  mrhines, what testcase were you using?  If
> it is an idle guest, it is not a realistic one and the decreased
> latency/throughput does not really matter.
>
> Paolo

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v1: 10/13] introduce new command migrate_check_for_zero
  2013-04-11 13:24       ` Michael R. Hines
@ 2013-04-11 14:15         ` Paolo Bonzini
  2013-04-11 14:45           ` Michael S. Tsirkin
  2013-04-11 14:57           ` Michael R. Hines
  0 siblings, 2 replies; 52+ messages in thread
From: Paolo Bonzini @ 2013-04-11 14:15 UTC (permalink / raw)
  To: Michael R. Hines
  Cc: aliguori, Michael S. Tsirkin, qemu-devel, owasserm, abali,
	mrhines, gokul

Il 11/04/2013 15:24, Michael R. Hines ha scritto:
> That's very accurate. Zero page scanning *after* the bulk phase
> is not very helpful in general.
> 
> Are we proposing to skip is_dup_page() after the bulk phase
> has finished?

No, I'm saying that is_dup_page() should not be a problem.  I'm saying
it should only loop a lot during the bulk phase.  The only effect I can
imagine after the bulk phase is one cache miss.

Perhaps the stress-test you're using does not reproduce realistic
conditions with respect to zero pages.  Peter Lieven benchmarked real
guests, both Linux and Windows, and confirmed the theory that I
mentioned upthread.  Almost all non-zero pages are detected within the
first few words, and almost all zero pages come from the bulk phase.

Considering that one cache miss, RDMA is indeed different here.  TCP
would have this cache miss later anyway, RDMA does not.  Let's say 300
cycles/miss; at 2.5 GHz that is 300/2500 microseconds, i.e 0.12
microseconds per page.  This would say that we can run is_dup_page on 30
GB worth of nonzero pages every second or more.  Ok, the estimate is
quite generous in many ways, but is_dup_page() is only a bottleneck if
it can do less than 5 GB/s.

Paolo

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v1: 10/13] introduce new command migrate_check_for_zero
  2013-04-11 14:15         ` Paolo Bonzini
@ 2013-04-11 14:45           ` Michael S. Tsirkin
  2013-04-11 14:57           ` Michael R. Hines
  1 sibling, 0 replies; 52+ messages in thread
From: Michael S. Tsirkin @ 2013-04-11 14:45 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: aliguori, qemu-devel, Michael R. Hines, owasserm, abali, mrhines,
	gokul

On Thu, Apr 11, 2013 at 04:15:54PM +0200, Paolo Bonzini wrote:
> Il 11/04/2013 15:24, Michael R. Hines ha scritto:
> > That's very accurate. Zero page scanning *after* the bulk phase
> > is not very helpful in general.
> > 
> > Are we proposing to skip is_dup_page() after the bulk phase
> > has finished?
> 
> No, I'm saying that is_dup_page() should not be a problem.  I'm saying
> it should only loop a lot during the bulk phase.  The only effect I can
> imagine after the bulk phase is one cache miss.
> 
> Perhaps the stress-test you're using does not reproduce realistic
> conditions with respect to zero pages.  Peter Lieven benchmarked real
> guests, both Linux and Windows, and confirmed the theory that I
> mentioned upthread.  Almost all non-zero pages are detected within the
> first few words, and almost all zero pages come from the bulk phase.
> 
> Considering that one cache miss, RDMA is indeed different here.  TCP
> would have this cache miss later anyway, RDMA does not.  Let's say 300
> cycles/miss; at 2.5 GHz that is 300/2500 microseconds, i.e 0.12
> microseconds per page.  This would say that we can run is_dup_page on 30
> GB worth of nonzero pages every second or more.  Ok, the estimate is
> quite generous in many ways, but is_dup_page() is only a bottleneck if
> it can do less than 5 GB/s.
> 
> Paolo

Further, if we read the pagemap to detect duplicates,
we won't need to read the page for RDMA either.
This might or might not prove to be a win, but
one thing for sure, management will not be able
to know if it's a win.

-- 
MST

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v1: 10/13] introduce new command migrate_check_for_zero
  2013-04-11 14:15         ` Paolo Bonzini
  2013-04-11 14:45           ` Michael S. Tsirkin
@ 2013-04-11 14:57           ` Michael R. Hines
  2013-04-11 15:01             ` Michael S. Tsirkin
  2013-04-11 15:08             ` Paolo Bonzini
  1 sibling, 2 replies; 52+ messages in thread
From: Michael R. Hines @ 2013-04-11 14:57 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: aliguori, Michael S. Tsirkin, qemu-devel, owasserm, abali,
	mrhines, gokul

We have hardware already with front side bus speeds of 13 GB/s.

We also already have 5 GB/s RDMA hardware, and we will likely
have even faster RDMA hardware in the future.

This analysis is not factoring into account the cycles it takes to
map the pages before they are checked for duplicate bytes,
regardless whether or not very little of the page is actually
cached on the processor.

This analysis is also not taking into account the possibility that the
VM may be CPU-bound at the same time that QEMU is competing
to execute is_dup_page().

Thus, as you mentioned, a worst-case 5 GB/s memory bandwidth
for is_dup_page() could be very easily reached given the right
conditions - and we do have many workloads both HPC and Multi-tier
which can easily cause QEMU's zero scanning performance to suffer.

- Michael

On 04/11/2013 10:15 AM, Paolo Bonzini wrote:
> No, I'm saying that is_dup_page() should not be a problem.  I'm saying
> it should only loop a lot during the bulk phase.  The only effect I can
> imagine after the bulk phase is one cache miss.
>
> Perhaps the stress-test you're using does not reproduce realistic
> conditions with respect to zero pages.  Peter Lieven benchmarked real
> guests, both Linux and Windows, and confirmed the theory that I
> mentioned upthread.  Almost all non-zero pages are detected within the
> first few words, and almost all zero pages come from the bulk phase.
>
> Considering that one cache miss, RDMA is indeed different here.  TCP
> would have this cache miss later anyway, RDMA does not.  Let's say 300
> cycles/miss; at 2.5 GHz that is 300/2500 microseconds, i.e 0.12
> microseconds per page.  This would say that we can run is_dup_page on 30
> GB worth of nonzero pages every second or more.  Ok, the estimate is
> quite generous in many ways, but is_dup_page() is only a bottleneck if
> it can do less than 5 GB/s.
>
> Paolo
>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v1: 10/13] introduce new command migrate_check_for_zero
  2013-04-11 14:57           ` Michael R. Hines
@ 2013-04-11 15:01             ` Michael S. Tsirkin
  2013-04-11 15:08             ` Paolo Bonzini
  1 sibling, 0 replies; 52+ messages in thread
From: Michael S. Tsirkin @ 2013-04-11 15:01 UTC (permalink / raw)
  To: Michael R. Hines
  Cc: aliguori, qemu-devel, owasserm, abali, mrhines, gokul,
	Paolo Bonzini

On Thu, Apr 11, 2013 at 10:57:26AM -0400, Michael R. Hines wrote:
> We have hardware already with front side bus speeds of 13 GB/s.
> 
> We also already have 5 GB/s RDMA hardware, and we will likely
> have even faster RDMA hardware in the future.
> 
> This analysis is not factoring into account the cycles it takes to
> map the pages before they are checked for duplicate bytes,
> regardless whether or not very little of the page is actually
> cached on the processor.
> 
> This analysis is also not taking into account the possibility that the
> VM may be CPU-bound at the same time that QEMU is competing
> to execute is_dup_page().
> 
> Thus, as you mentioned, a worst-case 5 GB/s memory bandwidth
> for is_dup_page() could be very easily reached given the right
> conditions - and we do have many workloads both HPC and Multi-tier
> which can easily cause QEMU's zero scanning performance to suffer.
> 
> - Michael

Well, then you can make is_dup_page faster e.g. using the
pagemap trick as we discussed earlier.
Why does management need a "go fast" option? Just make it go fast...

-- 
MST

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v1: 10/13] introduce new command migrate_check_for_zero
  2013-04-11 14:57           ` Michael R. Hines
  2013-04-11 15:01             ` Michael S. Tsirkin
@ 2013-04-11 15:08             ` Paolo Bonzini
  2013-04-11 15:35               ` Michael R. Hines
  1 sibling, 1 reply; 52+ messages in thread
From: Paolo Bonzini @ 2013-04-11 15:08 UTC (permalink / raw)
  To: Michael R. Hines
  Cc: aliguori, Michael S. Tsirkin, qemu-devel, owasserm, abali,
	mrhines, gokul

Il 11/04/2013 16:57, Michael R. Hines ha scritto:
> We have hardware already with front side bus speeds of 13 GB/s.
> 
> We also already have 5 GB/s RDMA hardware, and we will likely
> have even faster RDMA hardware in the future.
> 
> This analysis is not factoring into account the cycles it takes to
> map the pages before they are checked for duplicate bytes,

Do you mean the TLB misses?

> regardless whether or not very little of the page is actually
> cached on the processor.
> 
> This analysis is also not taking into account the possibility that the
> VM may be CPU-bound at the same time that QEMU is competing
> to execute is_dup_page().

is_dup_page() is memory-bound, not CPU-bound.  Note that is_dup_page
only needs 1% of the bandwidth it scans (32 bytes for a cache line out
of 4096 bytes/page).  Scanning 30 GB/s only requires reading 250 MB/s
from memory to the FSB.

> Thus, as you mentioned, a worst-case 5 GB/s memory bandwidth
> for is_dup_page() could be very easily reached given the right
> conditions - and we do have many workloads both HPC and Multi-tier
> which can easily cause QEMU's zero scanning performance to suffer.

These are the real world scenarios that I was talking about.  Do you
have profiles of these, with the latest QEMU code, that show
is_dup_page() to be expensive?

We could try prefetching the first cache line *of the next page* before
running is_dup_page.  There's a lot of things to test before giving up
and inventing a new API.

Paolo

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v1: 10/13] introduce new command migrate_check_for_zero
  2013-04-11 15:08             ` Paolo Bonzini
@ 2013-04-11 15:35               ` Michael R. Hines
  2013-04-11 15:45                 ` Paolo Bonzini
  0 siblings, 1 reply; 52+ messages in thread
From: Michael R. Hines @ 2013-04-11 15:35 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: aliguori, Michael S. Tsirkin, qemu-devel, owasserm, abali,
	mrhines, gokul

On 04/11/2013 11:08 AM, Paolo Bonzini wrote:
> Il 11/04/2013 16:57, Michael R. Hines ha scritto:
>> We have hardware already with front side bus speeds of 13 GB/s.
>>
>> We also already have 5 GB/s RDMA hardware, and we will likely
>> have even faster RDMA hardware in the future.
>>
>> This analysis is not factoring into account the cycles it takes to
>> map the pages before they are checked for duplicate bytes,
> Do you mean the TLB misses?

Keeping in mind that this primarily happens during the bulk-phase round,
then yes, both TLB missing + the time it takes to trap into the
kernel, map the page, and let the TLB re-walk the page table.

But, as you pointed out, I do conceded that since most of the pages
will already have been mapped after the bulk phase round, this
should not be a problem anymore *after* that round has finished.

Using the /proc/<pid>/pagemap will probably go much further towards
solving the problem than disabling zero page scanning.
If its already possible to know if a page is not mapped, then there
won't be any need to scan it in the first place.

Once the page is mapped already, yes, I do see clearly that is_dup_page()
performance would probably be minimal.

Nevertheless, the initial "burst" of the bulk phase round is still important
to optimize, and I would like to know if the maintainer would accept this
API for disabling the scan or not. We think it's important because the total
migration time can be much smaller with high-throughput RDMA links
by optimizing the bulk-phase round and that lower total migration time 
is very
valuable to many of our workloads, in addition to the low-downtime 
benefits you get with RDMA.

> These are the real world scenarios that I was talking about.  Do you
> have profiles of these, with the latest QEMU code, that show
> is_dup_page() to be expensive?

I have expensive numbers only for the bulk phase round. Other than that,
I would be breaking confidentiality outside of the paper we have already 
published.

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v1: 10/13] introduce new command migrate_check_for_zero
  2013-04-11 15:35               ` Michael R. Hines
@ 2013-04-11 15:45                 ` Paolo Bonzini
  2013-04-11 16:02                   ` Michael R. Hines
  2013-04-11 16:07                   ` Eric Blake
  0 siblings, 2 replies; 52+ messages in thread
From: Paolo Bonzini @ 2013-04-11 15:45 UTC (permalink / raw)
  To: Michael R. Hines
  Cc: aliguori, Michael S. Tsirkin, qemu-devel, owasserm, abali,
	mrhines, gokul

Il 11/04/2013 17:35, Michael R. Hines ha scritto:
> Nevertheless, the initial "burst" of the bulk phase round is still 
> important to optimize, and I would like to know if the maintainer
> would accept this API for disabling the scan or not

I'm not a maintainer, but every opinion counts... and my opinion is "not
yet".  Maybe for 1.6, and only after someone else tried it out.  That's
why it's important to merge the code early.

> We think it's important because the total migration time can be much
> smaller with high-throughput RDMA links by optimizing the bulk-phase
> round and that lower total migration time is very valuable to many of
> our workloads, in addition to the low-downtime benefits you get with 
> RDMA.
> 
> I have expensive numbers only for the bulk phase round. Other than that,
> I would be breaking confidentiality outside of the paper we have already
> published.

Fair enough.

Paolo

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v1: 10/13] introduce new command migrate_check_for_zero
  2013-04-11 15:45                 ` Paolo Bonzini
@ 2013-04-11 16:02                   ` Michael R. Hines
  2013-04-11 16:12                     ` Paolo Bonzini
  2013-04-11 16:07                   ` Eric Blake
  1 sibling, 1 reply; 52+ messages in thread
From: Michael R. Hines @ 2013-04-11 16:02 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: aliguori, Michael S. Tsirkin, qemu-devel, owasserm, abali,
	mrhines, gokul

Alright, so here's a slightly different management decision
which tries to accomplish all the requests,
tell me if you like it:

1. QEMU starts up
2. *if and only if* chunk registration is disabled
     and *if and only* RDMA is enabled
             then, is_dup_page() is skipped
     Otherwise,
             everything is same as before, no change in code path
             and no zero page capability needs to be exposed to management

In this case there would be *no* capability for zero pages,
but we would still be able to detect the motivation of the
user indirectly through the chunk registration capability
by implying that since the capability was disabled then the
user is trying to optimize metrics for total migration time.

On the other hand, if the chunk registration capability is
enabled, then there is no change in the code path we because
zero page checking is mandatory to take of chunk registration
in the first place.

How does that sound? No zero page capability, but allow for
disabling only if chunk registration is disabled?

- Michael



On 04/11/2013 11:45 AM, Paolo Bonzini wrote:
> Il 11/04/2013 17:35, Michael R. Hines ha scritto:
>> Nevertheless, the initial "burst" of the bulk phase round is still
>> important to optimize, and I would like to know if the maintainer
>> would accept this API for disabling the scan or not
> I'm not a maintainer, but every opinion counts... and my opinion is "not
> yet".  Maybe for 1.6, and only after someone else tried it out.  That's
> why it's important to merge the code early.
>
>> We think it's important because the total migration time can be much
>> smaller with high-throughput RDMA links by optimizing the bulk-phase
>> round and that lower total migration time is very valuable to many of
>> our workloads, in addition to the low-downtime benefits you get with
>> RDMA.
>>
>> I have expensive numbers only for the bulk phase round. Other than that,
>> I would be breaking confidentiality outside of the paper we have already
>> published.
> Fair enough.
>
> Paolo
>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v1: 10/13] introduce new command migrate_check_for_zero
  2013-04-11 16:02                   ` Michael R. Hines
@ 2013-04-11 16:12                     ` Paolo Bonzini
  0 siblings, 0 replies; 52+ messages in thread
From: Paolo Bonzini @ 2013-04-11 16:12 UTC (permalink / raw)
  To: Michael R. Hines
  Cc: aliguori, Michael S. Tsirkin, qemu-devel, owasserm, abali,
	mrhines, gokul

Il 11/04/2013 18:02, Michael R. Hines ha scritto:
> Alright, so here's a slightly different management decision
> which tries to accomplish all the requests,
> tell me if you like it:
> 
> 1. QEMU starts up
> 2. *if and only if* chunk registration is disabled
>     and *if and only* RDMA is enabled
>             then, is_dup_page() is skipped
>     Otherwise,
>             everything is same as before, no change in code path
>             and no zero page capability needs to be exposed to management
> 
> In this case there would be *no* capability for zero pages,
> but we would still be able to detect the motivation of the
> user indirectly through the chunk registration capability
> by implying that since the capability was disabled then the
> user is trying to optimize metrics for total migration time.
> 
> On the other hand, if the chunk registration capability is
> enabled, then there is no change in the code path we because
> zero page checking is mandatory to take of chunk registration
> in the first place.
> 
> How does that sound? No zero page capability, but allow for
> disabling only if chunk registration is disabled?

It makes sense, but I prefer to keep the code simple for this first
iteration.  Let's move zero page detection off the table for now.

Paolo

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v1: 10/13] introduce new command migrate_check_for_zero
  2013-04-11 15:45                 ` Paolo Bonzini
  2013-04-11 16:02                   ` Michael R. Hines
@ 2013-04-11 16:07                   ` Eric Blake
  2013-04-11 16:29                     ` Michael R. Hines
  1 sibling, 1 reply; 52+ messages in thread
From: Eric Blake @ 2013-04-11 16:07 UTC (permalink / raw)
  To: Paolo Bonzini
  Cc: aliguori, Michael S. Tsirkin, qemu-devel, Michael R. Hines,
	owasserm, abali, mrhines, gokul

[-- Attachment #1: Type: text/plain, Size: 1164 bytes --]

On 04/11/2013 09:45 AM, Paolo Bonzini wrote:
> Il 11/04/2013 17:35, Michael R. Hines ha scritto:
>> Nevertheless, the initial "burst" of the bulk phase round is still 
>> important to optimize, and I would like to know if the maintainer
>> would accept this API for disabling the scan or not
> 
> I'm not a maintainer, but every opinion counts... and my opinion is "not
> yet".  Maybe for 1.6, and only after someone else tried it out.  That's
> why it's important to merge the code early.

Agreed on that point - it's always easier to add an interface later,
when we have field usage suggesting that it would be useful, than it is
to remove an interface once provided, but where field usage says it is
never tweaked from the default.

Having a knob for disabling zero detection might make sense in the
future, but no need to rush it into 1.5 and regret the design, and no
need to hold up getting RDMA into 1.5 just because of a debate about a
knob that can be deferred to a later release when we've had more time to
play with RDMA.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 621 bytes --]

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v1: 10/13] introduce new command migrate_check_for_zero
  2013-04-11 16:07                   ` Eric Blake
@ 2013-04-11 16:29                     ` Michael R. Hines
  2013-04-11 16:36                       ` Eric Blake
  0 siblings, 1 reply; 52+ messages in thread
From: Michael R. Hines @ 2013-04-11 16:29 UTC (permalink / raw)
  To: Eric Blake
  Cc: aliguori, Michael S. Tsirkin, qemu-devel, owasserm, abali,
	mrhines, gokul, Paolo Bonzini

On 04/11/2013 12:07 PM, Eric Blake wrote:
> On 04/11/2013 09:45 AM, Paolo Bonzini wrote:
>> Il 11/04/2013 17:35, Michael R. Hines ha scritto:
>>> Nevertheless, the initial "burst" of the bulk phase round is still
>>> important to optimize, and I would like to know if the maintainer
>>> would accept this API for disabling the scan or not
>> I'm not a maintainer, but every opinion counts... and my opinion is "not
>> yet".  Maybe for 1.6, and only after someone else tried it out.  That's
>> why it's important to merge the code early.
> Agreed on that point - it's always easier to add an interface later,
> when we have field usage suggesting that it would be useful, than it is
> to remove an interface once provided, but where field usage says it is
> never tweaked from the default.
>
> Having a knob for disabling zero detection might make sense in the
> future, but no need to rush it into 1.5 and regret the design, and no
> need to hold up getting RDMA into 1.5 just because of a debate about a
> knob that can be deferred to a later release when we've had more time to
> play with RDMA.
>

Agreed, so what about my second proposal?

Disabling zero detection "on demand" if and only if RDMA is enabled
and if and only if chunk registration is disabled?

- Michael

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v1: 10/13] introduce new command migrate_check_for_zero
  2013-04-11 16:29                     ` Michael R. Hines
@ 2013-04-11 16:36                       ` Eric Blake
  0 siblings, 0 replies; 52+ messages in thread
From: Eric Blake @ 2013-04-11 16:36 UTC (permalink / raw)
  To: Michael R. Hines
  Cc: aliguori, Michael S. Tsirkin, qemu-devel, owasserm, abali,
	mrhines, gokul, Paolo Bonzini

[-- Attachment #1: Type: text/plain, Size: 518 bytes --]

On 04/11/2013 10:29 AM, Michael R. Hines wrote:
> Agreed, so what about my second proposal?
> 
> Disabling zero detection "on demand" if and only if RDMA is enabled
> and if and only if chunk registration is disabled?

I haven't been following the discussion closely, but that sounds like it
is making the internal state use a sane default based on existing
external interface, and should be fine.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 621 bytes --]

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [Qemu-devel] [RFC PATCH RDMA support v1: 11/13] send pc.ram over RDMA
  2013-04-10 22:28 [Qemu-devel] [RFC PATCH RDMA support v7: 00/13] rdma cleanup and reordering mrhines
                   ` (9 preceding siblings ...)
  2013-04-10 22:28 ` [Qemu-devel] [RFC PATCH RDMA support v1: 10/13] introduce new command migrate_check_for_zero mrhines
@ 2013-04-10 22:28 ` mrhines
  2013-04-11  6:26   ` Paolo Bonzini
  2013-04-10 22:28 ` [Qemu-devel] [RFC PATCH RDMA support v1: 12/13] updated protocol documentation mrhines
                   ` (2 subsequent siblings)
  13 siblings, 1 reply; 52+ messages in thread
From: mrhines @ 2013-04-10 22:28 UTC (permalink / raw)
  To: qemu-devel; +Cc: aliguori, mst, owasserm, abali, mrhines, gokul, pbonzini

From: "Michael R. Hines" <mrhines@us.ibm.com>

This takes advantages of the previous patches:
1. use the new QEMUFileOps hook 'save_page' and return
   ENOTSUP if not supported.
2. call out to the right accessor methods to invoke
   the iteration hooks defined in QEMUFileOps

Signed-off-by: Michael R. Hines <mrhines@us.ibm.com>
---
 arch_init.c |   46 ++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 44 insertions(+), 2 deletions(-)

diff --git a/arch_init.c b/arch_init.c
index 769ce77..eea3091 100644
--- a/arch_init.c
+++ b/arch_init.c
@@ -115,6 +115,7 @@ const uint32_t arch_type = QEMU_ARCH;
 #define RAM_SAVE_FLAG_EOS      0x10
 #define RAM_SAVE_FLAG_CONTINUE 0x20
 #define RAM_SAVE_FLAG_XBZRLE   0x40
+#define RAM_SAVE_FLAG_HOOK     0x80 /* perform hook during iteration */
 
 
 static struct defconfig_file {
@@ -170,6 +171,14 @@ static struct {
     .cache = NULL,
 };
 
+#ifdef CONFIG_RDMA
+int qemu_rdma_registration_start(QEMUFile *f, void *opaque, uint32_t flags)
+{
+    DPRINTF("start section: %d\n", flags);
+    qemu_put_be64(f, RAM_SAVE_FLAG_HOOK);
+    return 0;
+}
+#endif
 
 int64_t xbzrle_cache_resize(int64_t new_size)
 {
@@ -447,15 +456,22 @@ static int ram_save_block(QEMUFile *f, bool last_stage)
                 ram_bulk_stage = false;
             }
         } else {
+            bool zero;
             uint8_t *p;
             int cont = (block == last_sent_block) ?
                 RAM_SAVE_FLAG_CONTINUE : 0;
 
             p = memory_region_get_ram_ptr(mr) + offset;
 
+            /* use capability now, defaults to true */
+            zero = migrate_check_for_zero() ? is_zero_page(p) : false;
+
             /* In doubt sent page as normal */
             bytes_sent = -1;
-            if (is_zero_page(p)) {
+            if ((bytes_sent = ram_control_save_page(f, block->offset, 
+                            offset, cont, TARGET_PAGE_SIZE, zero)) >= 0) {
+                acct_info.norm_pages++;
+            } else if (zero) {
                 acct_info.dup_pages++;
                 if (!ram_bulk_stage) {
                     bytes_sent = save_block_hdr(f, block, offset, cont,
@@ -476,7 +492,7 @@ static int ram_save_block(QEMUFile *f, bool last_stage)
             }
 
             /* XBZRLE overflow or normal page */
-            if (bytes_sent == -1) {
+            if (bytes_sent == -1 || bytes_sent == -ENOTSUP) {
                 bytes_sent = save_block_hdr(f, block, offset, cont, RAM_SAVE_FLAG_PAGE);
                 qemu_put_buffer_async(f, p, TARGET_PAGE_SIZE);
                 bytes_sent += TARGET_PAGE_SIZE;
@@ -598,6 +614,18 @@ static int ram_save_setup(QEMUFile *f, void *opaque)
     }
 
     qemu_mutex_unlock_ramlist();
+
+    /*
+     * These following calls generate reserved messages for future expansion of the RDMA
+     * protocol. If the ops are not defined, nothing will happen.
+     *
+     * Please leave in place. They are intended to be used to pre-register
+     * memory in the future to mitigate the extremely high cost of dynamic page
+     * registration.
+     */
+    ram_control_before_iterate(f, RAM_CONTROL_SETUP);
+    ram_control_after_iterate(f, RAM_CONTROL_SETUP);
+
     qemu_put_be64(f, RAM_SAVE_FLAG_EOS);
 
     return 0;
@@ -616,6 +644,8 @@ static int ram_save_iterate(QEMUFile *f, void *opaque)
         reset_ram_globals();
     }
 
+    ram_control_before_iterate(f, RAM_CONTROL_ROUND);
+
     t0 = qemu_get_clock_ns(rt_clock);
     i = 0;
     while ((ret = qemu_file_rate_limit(f)) == 0) {
@@ -646,6 +676,12 @@ static int ram_save_iterate(QEMUFile *f, void *opaque)
 
     qemu_mutex_unlock_ramlist();
 
+    /* 
+     * must occur before EOS (or any QEMUFile operation) 
+     * because of RDMA protocol 
+     */
+    ram_control_after_iterate(f, RAM_CONTROL_ROUND);
+
     if (ret < 0) {
         bytes_transferred += total_sent;
         return ret;
@@ -663,6 +699,8 @@ static int ram_save_complete(QEMUFile *f, void *opaque)
     qemu_mutex_lock_ramlist();
     migration_bitmap_sync();
 
+    ram_control_before_iterate(f, RAM_CONTROL_FINISH);
+
     /* try transferring iterative blocks of memory */
 
     /* flush all remaining blocks regardless of rate limiting */
@@ -676,6 +714,8 @@ static int ram_save_complete(QEMUFile *f, void *opaque)
         }
         bytes_transferred += bytes_sent;
     }
+
+    ram_control_after_iterate(f, RAM_CONTROL_FINISH);
     migration_end();
 
     qemu_mutex_unlock_ramlist();
@@ -864,6 +904,8 @@ static int ram_load(QEMUFile *f, void *opaque, int version_id)
                 ret = -EINVAL;
                 goto done;
             }
+        } else if (flags & RAM_SAVE_FLAG_HOOK) {
+            ram_control_load_hook(f, RAM_CONTROL_REGISTER); 
         }
         error = qemu_file_get_error(f);
         if (error) {
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v1: 11/13] send pc.ram over RDMA
  2013-04-10 22:28 ` [Qemu-devel] [RFC PATCH RDMA support v1: 11/13] send pc.ram over RDMA mrhines
@ 2013-04-11  6:26   ` Paolo Bonzini
  2013-04-11 12:41     ` Michael R. Hines
  0 siblings, 1 reply; 52+ messages in thread
From: Paolo Bonzini @ 2013-04-11  6:26 UTC (permalink / raw)
  To: mrhines; +Cc: aliguori, mst, qemu-devel, owasserm, abali, mrhines, gokul

Il 11/04/2013 00:28, mrhines@linux.vnet.ibm.com ha scritto:
>  
> +#ifdef CONFIG_RDMA
> +int qemu_rdma_registration_start(QEMUFile *f, void *opaque, uint32_t flags)
> +{
> +    DPRINTF("start section: %d\n", flags);
> +    qemu_put_be64(f, RAM_SAVE_FLAG_HOOK);
> +    return 0;
> +}
> +#endif
>  

This must be in migration-rdma.c.  Move RAM_SAVE_FLAG_HOOK to
migration/migration.h with a comment like this:

/* Whenever this is found in the data stream, the flags
 * will be passed to ram_load_hook in the incoming-migration
 * side.  This lets before_ram_iterate/after_ram_iterate add
 * transport-specific sections to the RAM migration data.
 */

Paolo

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v1: 11/13] send pc.ram over RDMA
  2013-04-11  6:26   ` Paolo Bonzini
@ 2013-04-11 12:41     ` Michael R. Hines
  0 siblings, 0 replies; 52+ messages in thread
From: Michael R. Hines @ 2013-04-11 12:41 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: aliguori, mst, qemu-devel, owasserm, abali, mrhines, gokul

Acknowledged.

On 04/11/2013 02:26 AM, Paolo Bonzini wrote:
> Il 11/04/2013 00:28, mrhines@linux.vnet.ibm.com ha scritto:
>>   
>> +#ifdef CONFIG_RDMA
>> +int qemu_rdma_registration_start(QEMUFile *f, void *opaque, uint32_t flags)
>> +{
>> +    DPRINTF("start section: %d\n", flags);
>> +    qemu_put_be64(f, RAM_SAVE_FLAG_HOOK);
>> +    return 0;
>> +}
>> +#endif
>>   
> This must be in migration-rdma.c.  Move RAM_SAVE_FLAG_HOOK to
> migration/migration.h with a comment like this:
>
> /* Whenever this is found in the data stream, the flags
>   * will be passed to ram_load_hook in the incoming-migration
>   * side.  This lets before_ram_iterate/after_ram_iterate add
>   * transport-specific sections to the RAM migration data.
>   */
>
> Paolo
>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [Qemu-devel] [RFC PATCH RDMA support v1: 12/13] updated protocol documentation
  2013-04-10 22:28 [Qemu-devel] [RFC PATCH RDMA support v7: 00/13] rdma cleanup and reordering mrhines
                   ` (10 preceding siblings ...)
  2013-04-10 22:28 ` [Qemu-devel] [RFC PATCH RDMA support v1: 11/13] send pc.ram over RDMA mrhines
@ 2013-04-10 22:28 ` mrhines
  2013-04-11  2:43   ` Eric Blake
  2013-04-11  6:29   ` Paolo Bonzini
  2013-04-10 22:28 ` [Qemu-devel] [RFC PATCH RDMA support v1: 13/13] print out migration throughput while debugging mrhines
  2013-04-10 22:32 ` [Qemu-devel] [RFC PATCH RDMA support v7: 00/13] rdma cleanup and reordering Michael R. Hines
  13 siblings, 2 replies; 52+ messages in thread
From: mrhines @ 2013-04-10 22:28 UTC (permalink / raw)
  To: qemu-devel; +Cc: aliguori, mst, owasserm, abali, mrhines, gokul, pbonzini

From: "Michael R. Hines" <mrhines@us.ibm.com>

Full documentation on the rdma protocol: docs/rdma.txt

Signed-off-by: Michael R. Hines <mrhines@us.ibm.com>
---
 docs/rdma.txt |  331 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 331 insertions(+)
 create mode 100644 docs/rdma.txt

diff --git a/docs/rdma.txt b/docs/rdma.txt
new file mode 100644
index 0000000..ae68d2f
--- /dev/null
+++ b/docs/rdma.txt
@@ -0,0 +1,331 @@
+Changes since v6:
+
+(Thanks, Paolo - things look much cleaner now.)
+
+- Try to get patch-ordering correct =)
+- Much cleaner use of QEMUFileOps
+- Much fewer header files changes
+- Convert zero check capability to QMP command instead
+- Updated documentation
+
+Wiki: http://wiki.qemu.org/Features/RDMALiveMigration
+Github: git@github.com:hinesmr/qemu.git
+Contact: Michael R. Hines, mrhines@us.ibm.com
+
+RDMA Live Migration Specification, Version # 1
+
+Contents:
+=================================
+* Running
+* RDMA Protocol Description
+* Versioning and Capabilities
+* QEMUFileRDMA Interface
+* Migration of pc.ram
+* Error handling
+* TODO
+* Performance
+
+RUNNING:
+===============================
+
+First, decide if you want dynamic page registration on the server-side.
+This always happens on the primary-VM side, but is optional on the server.
+Doing this allows you to support overcommit (such as cgroups or ballooning)
+with a smaller footprint on the server-side without having to register the
+entire VM memory footprint. 
+NOTE: This significantly slows down RDMA throughput (about 30% slower).
+
+$ virsh qemu-monitor-command --hmp \
+    --cmd "migrate_set_capability chunk_register_destination off" # enabled by default
+
+Next, if you decided *not* to use chunked registration on the server,
+it is recommended to also disable zero page detection. While this is not
+strictly necessary, zero page detection also significantly slows down
+throughput on higher-performance links (by about 50%), like 40 gbps infiniband cards:
+
+$ virsh qemu-monitor-command --hmp \
+    --cmd "migrate_check_for_zero off" # enabled by default
+
+Finally, set the migration speed to match your hardware's capabilities:
+
+$ virsh qemu-monitor-command --hmp \
+    --cmd "migrate_set_speed 40g" # or whatever is the MAX of your RDMA device
+
+Finally, perform the actual migration:
+
+$ virsh migrate domain rdma:xx.xx.xx.xx:port
+
+RDMA Protocol Description:
+=================================
+
+Migration with RDMA is separated into two parts:
+
+1. The transmission of the pages using RDMA
+2. Everything else (a control channel is introduced)
+
+"Everything else" is transmitted using a formal 
+protocol now, consisting of infiniband SEND / RECV messages.
+
+An infiniband SEND message is the standard ibverbs
+message used by applications of infiniband hardware.
+The only difference between a SEND message and an RDMA
+message is that SEND message cause completion notifications
+to be posted to the completion queue (CQ) on the 
+infiniband receiver side, whereas RDMA messages (used
+for pc.ram) do not (to behave like an actual DMA).
+    
+Messages in infiniband require two things:
+
+1. registration of the memory that will be transmitted
+2. (SEND/RECV only) work requests to be posted on both
+   sides of the network before the actual transmission
+   can occur.
+
+RDMA messages much easier to deal with. Once the memory
+on the receiver side is registered and pinned, we're
+basically done. All that is required is for the sender
+side to start dumping bytes onto the link.
+
+(Memory is not released from pinning until the migration
+completes, given that RDMA migrations are very fast.)
+
+SEND messages require more coordination because the
+receiver must have reserved space (using a receive
+work request) on the receive queue (RQ) before QEMUFileRDMA
+can start using them to carry all the bytes as
+a transport for migration of device state.
+
+To begin the migration, the initial connection setup is
+as follows (migration-rdma.c):
+
+1. Receiver and Sender are started (command line or libvirt):
+2. Both sides post two RQ work requests
+3. Receiver does listen()
+4. Sender does connect()
+5. Receiver accept()
+6. Check versioning and capabilities (described later)
+
+At this point, we define a control channel on top of SEND messages
+which is described by a formal protocol. Each SEND message has a 
+header portion and a data portion (but together are transmitted 
+as a single SEND message).
+
+Header:
+    * Length  (of the data portion, uint32, network byte order)
+    * Type    (what command to perform, uint32, network byte order)
+    * Version (protocol version validated before send/recv occurs, uint32, network byte order
+
+The 'type' field has 7 different command values:
+    1. None
+    2. Ready             (control-channel is available) 
+    3. QEMU File         (for sending non-live device state) 
+    4. RAM Blocks        (used right after connection setup)
+    5. Register request  (dynamic chunk registration) 
+    6. Register result   ('rkey' to be used by sender)
+    7. Register finished (registration for current iteration finished)
+
+After connection setup is completed, we have two protocol-level
+functions, responsible for communicating control-channel commands
+using the above list of values: 
+
+Logically:
+
+qemu_rdma_exchange_recv(header, expected command type)
+
+1. We transmit a READY command to let the sender know that 
+   we are *ready* to receive some data bytes on the control channel.
+2. Before attempting to receive the expected command, we post another
+   RQ work request to replace the one we just used up.
+3. Block on a CQ event channel and wait for the SEND to arrive.
+4. When the send arrives, librdmacm will unblock us.
+5. Verify that the command-type and version received matches the one we expected.
+
+qemu_rdma_exchange_send(header, data, optional response header & data): 
+
+1. Block on the CQ event channel waiting for a READY command
+   from the receiver to tell us that the receiver
+   is *ready* for us to transmit some new bytes.
+2. Optionally: if we are expecting a response from the command
+   (that we have no yet transmitted), let's post an RQ
+   work request to receive that data a few moments later. 
+3. When the READY arrives, librdmacm will 
+   unblock us and we immediately post a RQ work request
+   to replace the one we just used up.
+4. Now, we can actually post the work request to SEND
+   the requested command type of the header we were asked for.
+5. Optionally, if we are expecting a response (as before),
+   we block again and wait for that response using the additional
+   work request we previously posted. (This is used to carry
+   'Register result' commands #6 back to the sender which
+   hold the rkey need to perform RDMA.
+
+All of the remaining command types (not including 'ready')
+described above all use the aformentioned two functions to do the hard work:
+
+1. After connection setup, RAMBlock information is exchanged using
+   this protocol before the actual migration begins. This information includes
+   a description of each RAMBlock on the server side as well as the virtual addresses
+   and lengths of each RAMBlock. This is used by the client to determine the
+   start and stop locations of chunks and how to register them dynamically
+   before performing the RDMA operations.
+2. During runtime, once a 'chunk' becomes full of pages ready to
+   be sent with RDMA, the registration commands are used to ask the
+   other side to register the memory for this chunk and respond
+   with the result (rkey) of the registration.
+3. Also, the QEMUFile interfaces also call these functions (described below)
+   when transmitting non-live state, such as devices or to send
+   its own protocol information during the migration process.
+
+Versioning and Capabilities
+==================================
+Current version of the protocol is version #1, both for protocol
+traffic and capabilities negotiation. (i.e. There is only one version
+number that is referred to by all communication).
+
+librdmacm provides the user with a 'private data' area to be exchanged
+at connection-setup time before any infiniband traffic is generated.
+
+Header:
+    * Version (protocol version validated before send/recv occurs), uint32, network byte order
+    * Flags   (bitwise OR of each capability), uint32, network byte order
+
+There is no data portion of this header right now, so there is
+no length field. The maximum size of the 'private data' section
+is only 192 bytes per the Infiniband specification, so it's not
+very useful for data anyway. This structure needs to remain small.
+
+This private data area is a convenient place to check for protocol 
+versioning because the user does not need to register memory to 
+transmit a few bytes of version information.
+
+This is also a convenient place to negotiate capabilities
+(like dynamic page registration).
+
+If the version is invalid, we throw an error.
+
+If the version is new, we only negotiate the capabilities that the
+requested version is able to perform and ignore the rest.
+
+Currently there is only *one* capability in Version #1: dynamic page registration
+
+QEMUFileRDMA Interface:
+==================================
+
+QEMUFileRDMA introduces a couple of new functions:
+
+1. qemu_rdma_get_buffer()  (QEMUFileOps rdma_read_ops)
+2. qemu_rdma_put_buffer()  (QEMUFileOps rdma_write_ops)
+
+These two functions are very short and simply used the protocol
+describe above to deliver bytes without changing the upper-level
+users of QEMUFile that depend on a bytstream abstraction.
+
+Finally, how do we handoff the actual bytes to get_buffer()?
+
+Again, because we're trying to "fake" a bytestream abstraction
+using an analogy not unlike individual UDP frames, we have
+to hold on to the bytes received from control-channel's SEND 
+messages in memory.
+
+Each time we receive a complete "QEMU File" control-channel 
+message, the bytes from SEND are copied into a small local holding area.
+
+Then, we return the number of bytes requested by get_buffer()
+and leave the remaining bytes in the holding area until get_buffer()
+comes around for another pass.
+
+If the buffer is empty, then we follow the same steps
+listed above and issue another "QEMU File" protocol command,
+asking for a new SEND message to re-fill the buffer.
+
+Migration of pc.ram:
+===============================
+
+At the beginning of the migration, (migration-rdma.c),
+the sender and the receiver populate the list of RAMBlocks
+to be registered with each other into a structure.
+Then, using the aforementioned protocol, they exchange a
+description of these blocks with each other, to be used later 
+during the iteration of main memory. This description includes
+a list of all the RAMBlocks, their offsets and lengths and
+possibly includes pre-registered RDMA keys in case dynamic
+page registration was disabled on the server-side, otherwise not.
+
+Main memory is not migrated with the aforementioned protocol, 
+but is instead migrated with normal RDMA Write operations.
+
+Pages are migrated in "chunks" (hard-coded to 1 Megabyte right now).
+Chunk size is not dynamic, but it could be in a future implementation.
+There's nothing to indicate that this is useful right now.
+
+When a chunk is full (or a flush() occurs), the memory backed by 
+the chunk is registered with librdmacm and pinned in memory on 
+both sides using the aforementioned protocol.
+
+After pinning, an RDMA Write is generated and tramsmitted
+for the entire chunk.
+
+Chunks are also transmitted in batches: This means that we
+do not request that the hardware signal the completion queue
+for the completion of *every* chunk. The current batch size
+is about 64 chunks (corresponding to 64 MB of memory).
+Only the last chunk in a batch must be signaled.
+This helps keep everything as asynchronous as possible
+and helps keep the hardware busy performing RDMA operations.
+
+Error-handling:
+===============================
+
+Infiniband has what is called a "Reliable, Connected"
+link (one of 4 choices). This is the mode in which
+we use for RDMA migration.
+
+If a *single* message fails,
+the decision is to abort the migration entirely and
+cleanup all the RDMA descriptors and unregister all
+the memory.
+
+After cleanup, the Virtual Machine is returned to normal
+operation the same way that would happen if the TCP
+socket is broken during a non-RDMA based migration.
+
+TODO:
+=================================
+1. Chunk server registration could be improved:
+   This can be done by holding chunks for a certain amount
+   of time and then register all of the chunks at the same
+   time using a fewer number of control messages. The
+   performance of this approach is unclear.
+2. Currently, cgroups swap limits for *both* TCP and RDMA
+   on the sender-side is broken. This is more poignant for
+   RDMA because RDMA requires memory registration.
+   Fixing this requires infiniband page registrations to be
+   zero-page aware, and this does not yet work properly.
+3. Currently overcommit for the the *receiver* side of
+   TCP works, but not for RDMA. While dynamic page registration
+   *does* work, it is only useful if the is_zero_page() capability
+   is remained enabled (which it is by default).
+   However, leaving this capability turned on *significantly* slows
+   down the RDMA throughput, particularly on hardware capable
+   of transmitting faster than 10 gbps (such as 40gbps links).
+4. Use of the recent /dev/<pid>/pagemap would likely solve some
+   of these problems.
+5. Also, some form of balloon-device usage tracking would also
+   help aleviate some of these issues.
+
+PERFORMANCE
+===================
+
+Using a 40gbps infinband link performing a worst-case stress test:
+
+RDMA Throughput With $ stress --vm-bytes 1024M --vm 1 --vm-keep
+Approximately 30 gpbs (little better than the paper)
+1. Average worst-case throughput 
+TCP Throughput With $ stress --vm-bytes 1024M --vm 1 --vm-keep
+2. Approximately 8 gpbs (using IPOIB IP over Infiniband)
+3. Using chunked registration: approximately 6 gbps.
+
+Average downtime (stop time) ranges between 15 and 33 milliseconds.
+
+An *exhaustive* paper (2010) shows additional performance details
+linked on the QEMU wiki:
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v1: 12/13] updated protocol documentation
  2013-04-10 22:28 ` [Qemu-devel] [RFC PATCH RDMA support v1: 12/13] updated protocol documentation mrhines
@ 2013-04-11  2:43   ` Eric Blake
  2013-04-11  2:47     ` Michael R. Hines
  2013-04-11  6:29   ` Paolo Bonzini
  1 sibling, 1 reply; 52+ messages in thread
From: Eric Blake @ 2013-04-11  2:43 UTC (permalink / raw)
  To: mrhines
  Cc: aliguori, mst, qemu-devel, owasserm, abali, mrhines, gokul,
	pbonzini

[-- Attachment #1: Type: text/plain, Size: 5143 bytes --]

On 04/10/2013 04:28 PM, mrhines@linux.vnet.ibm.com wrote:
> From: "Michael R. Hines" <mrhines@us.ibm.com>
> 
> Full documentation on the rdma protocol: docs/rdma.txt
> 
> Signed-off-by: Michael R. Hines <mrhines@us.ibm.com>
> ---
>  docs/rdma.txt |  331 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 331 insertions(+)
>  create mode 100644 docs/rdma.txt
> 
> diff --git a/docs/rdma.txt b/docs/rdma.txt
> new file mode 100644
> index 0000000..ae68d2f
> --- /dev/null
> +++ b/docs/rdma.txt
> @@ -0,0 +1,331 @@
> +Changes since v6:
> +
> +(Thanks, Paolo - things look much cleaner now.)
> +
> +- Try to get patch-ordering correct =)
> +- Much cleaner use of QEMUFileOps
> +- Much fewer header files changes
> +- Convert zero check capability to QMP command instead
> +- Updated documentation

The above text probably shouldn't be in the file.

> +
> +Wiki: http://wiki.qemu.org/Features/RDMALiveMigration
> +Github: git@github.com:hinesmr/qemu.git
> +Contact: Michael R. Hines, mrhines@us.ibm.com

Missing a copyright statement, but that's just following the example of
other docs, so I guess it's okay?

> +
> +RDMA Live Migration Specification, Version # 1
> +
> +Contents:
> +=================================
> +* Running
> +* RDMA Protocol Description
> +* Versioning and Capabilities
> +* QEMUFileRDMA Interface
> +* Migration of pc.ram
> +* Error handling
> +* TODO
> +* Performance
> +

No high-level overview of what the acronym RDMA even stands for?

> +RUNNING:
> +===============================
> +
> +First, decide if you want dynamic page registration on the server-side.
> +This always happens on the primary-VM side, but is optional on the server.
> +Doing this allows you to support overcommit (such as cgroups or ballooning)
> +with a smaller footprint on the server-side without having to register the
> +entire VM memory footprint. 
> +NOTE: This significantly slows down RDMA throughput (about 30% slower).
> +
> +$ virsh qemu-monitor-command --hmp \
> +    --cmd "migrate_set_capability chunk_register_destination off" # enabled by default

'virsh qemu-monitor-command' is documented as unsupported by libvirt
(it's intended solely as a development/debugging aid); but I guess until
libvirt learns to expose RDMA support by default, this is okay for a
first cut of documentation.  Furthermore, you are missing a domain argument.

Do you really want to be requiring the user to do everything through
libvirt?  This is qemu documentation, so you should document how things
work without needing libvirt in the picture.

> +
> +Next, if you decided *not* to use chunked registration on the server,
> +it is recommended to also disable zero page detection. While this is not
> +strictly necessary, zero page detection also significantly slows down
> +throughput on higher-performance links (by about 50%), like 40 gbps infiniband cards:
> +
> +$ virsh qemu-monitor-command --hmp \
> +    --cmd "migrate_check_for_zero off" # enabled by default

Missing a domain argument.

> +
> +Finally, set the migration speed to match your hardware's capabilities:
> +
> +$ virsh qemu-monitor-command --hmp \
> +    --cmd "migrate_set_speed 40g" # or whatever is the MAX of your RDMA device

This modifies qemu state behind libvirt's back, and won't necessarily do
what you want if libvirt tries to change things back to the speed it
thought it was managing.  Instead, use 'virsh migrate-setspeed $dom 40'.

> +
> +Finally, perform the actual migration:
> +
> +$ virsh migrate domain rdma:xx.xx.xx.xx:port

That's not quite valid syntax for 'virsh migrate'.  Again, do you really
want to be documenting libvirt's interface, or qemu's interface?

> +
> +RDMA Protocol Description:
> +=================================

Aesthetics: match the length of === to the line above it.

<snip> I'm not reviewing technical content, just face value...

> +
> +These two functions are very short and simply used the protocol
> +describe above to deliver bytes without changing the upper-level
> +users of QEMUFile that depend on a bytstream abstraction.

s/bytstream/bytestream/

...
> +
> +After pinning, an RDMA Write is generated and tramsmitted
> +for the entire chunk.

s/tramsmitted/transmitted/

> +5. Also, some form of balloon-device usage tracking would also
> +   help aleviate some of these issues.

s/aleviate/alleviate/

> +
> +PERFORMANCE
> +===================
> +
> +Using a 40gbps infinband link performing a worst-case stress test:

s/infinband/infiniband/

> +
> +RDMA Throughput With $ stress --vm-bytes 1024M --vm 1 --vm-keep
> +Approximately 30 gpbs (little better than the paper)

which paper? Call that out in your high-level summary

...
> +
> +An *exhaustive* paper (2010) shows additional performance details
> +linked on the QEMU wiki:

Missing the actual reference?  And it would help to mention it at the
beginning of the file.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org


[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 621 bytes --]

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v1: 12/13] updated protocol documentation
  2013-04-11  2:43   ` Eric Blake
@ 2013-04-11  2:47     ` Michael R. Hines
  0 siblings, 0 replies; 52+ messages in thread
From: Michael R. Hines @ 2013-04-11  2:47 UTC (permalink / raw)
  To: Eric Blake
  Cc: aliguori, mst, qemu-devel, owasserm, abali, mrhines, gokul,
	pbonzini

Great comments, thanks.

On 04/10/2013 10:43 PM, Eric Blake wrote:
> On 04/10/2013 04:28 PM, mrhines@linux.vnet.ibm.com wrote:
>> From: "Michael R. Hines" <mrhines@us.ibm.com>
>>
>> Full documentation on the rdma protocol: docs/rdma.txt
>>
>> Signed-off-by: Michael R. Hines <mrhines@us.ibm.com>
>> ---
>>   docs/rdma.txt |  331 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
>>   1 file changed, 331 insertions(+)
>>   create mode 100644 docs/rdma.txt
>>
>> diff --git a/docs/rdma.txt b/docs/rdma.txt
>> new file mode 100644
>> index 0000000..ae68d2f
>> --- /dev/null
>> +++ b/docs/rdma.txt
>> @@ -0,0 +1,331 @@
>> +Changes since v6:
>> +
>> +(Thanks, Paolo - things look much cleaner now.)
>> +
>> +- Try to get patch-ordering correct =)
>> +- Much cleaner use of QEMUFileOps
>> +- Much fewer header files changes
>> +- Convert zero check capability to QMP command instead
>> +- Updated documentation
> The above text probably shouldn't be in the file.
>
>> +
>> +Wiki: http://wiki.qemu.org/Features/RDMALiveMigration
>> +Github: git@github.com:hinesmr/qemu.git
>> +Contact: Michael R. Hines, mrhines@us.ibm.com
> Missing a copyright statement, but that's just following the example of
> other docs, so I guess it's okay?
>
>> +
>> +RDMA Live Migration Specification, Version # 1
>> +
>> +Contents:
>> +=================================
>> +* Running
>> +* RDMA Protocol Description
>> +* Versioning and Capabilities
>> +* QEMUFileRDMA Interface
>> +* Migration of pc.ram
>> +* Error handling
>> +* TODO
>> +* Performance
>> +
> No high-level overview of what the acronym RDMA even stands for?
>
>> +RUNNING:
>> +===============================
>> +
>> +First, decide if you want dynamic page registration on the server-side.
>> +This always happens on the primary-VM side, but is optional on the server.
>> +Doing this allows you to support overcommit (such as cgroups or ballooning)
>> +with a smaller footprint on the server-side without having to register the
>> +entire VM memory footprint.
>> +NOTE: This significantly slows down RDMA throughput (about 30% slower).
>> +
>> +$ virsh qemu-monitor-command --hmp \
>> +    --cmd "migrate_set_capability chunk_register_destination off" # enabled by default
> 'virsh qemu-monitor-command' is documented as unsupported by libvirt
> (it's intended solely as a development/debugging aid); but I guess until
> libvirt learns to expose RDMA support by default, this is okay for a
> first cut of documentation.  Furthermore, you are missing a domain argument.
>
> Do you really want to be requiring the user to do everything through
> libvirt?  This is qemu documentation, so you should document how things
> work without needing libvirt in the picture.
>
>> +
>> +Next, if you decided *not* to use chunked registration on the server,
>> +it is recommended to also disable zero page detection. While this is not
>> +strictly necessary, zero page detection also significantly slows down
>> +throughput on higher-performance links (by about 50%), like 40 gbps infiniband cards:
>> +
>> +$ virsh qemu-monitor-command --hmp \
>> +    --cmd "migrate_check_for_zero off" # enabled by default
> Missing a domain argument.
>
>> +
>> +Finally, set the migration speed to match your hardware's capabilities:
>> +
>> +$ virsh qemu-monitor-command --hmp \
>> +    --cmd "migrate_set_speed 40g" # or whatever is the MAX of your RDMA device
> This modifies qemu state behind libvirt's back, and won't necessarily do
> what you want if libvirt tries to change things back to the speed it
> thought it was managing.  Instead, use 'virsh migrate-setspeed $dom 40'.
>
>> +
>> +Finally, perform the actual migration:
>> +
>> +$ virsh migrate domain rdma:xx.xx.xx.xx:port
> That's not quite valid syntax for 'virsh migrate'.  Again, do you really
> want to be documenting libvirt's interface, or qemu's interface?
>
>> +
>> +RDMA Protocol Description:
>> +=================================
> Aesthetics: match the length of === to the line above it.
>
> <snip> I'm not reviewing technical content, just face value...
>
>> +
>> +These two functions are very short and simply used the protocol
>> +describe above to deliver bytes without changing the upper-level
>> +users of QEMUFile that depend on a bytstream abstraction.
> s/bytstream/bytestream/
>
> ...
>> +
>> +After pinning, an RDMA Write is generated and tramsmitted
>> +for the entire chunk.
> s/tramsmitted/transmitted/
>
>> +5. Also, some form of balloon-device usage tracking would also
>> +   help aleviate some of these issues.
> s/aleviate/alleviate/
>
>> +
>> +PERFORMANCE
>> +===================
>> +
>> +Using a 40gbps infinband link performing a worst-case stress test:
> s/infinband/infiniband/
>
>> +
>> +RDMA Throughput With $ stress --vm-bytes 1024M --vm 1 --vm-keep
>> +Approximately 30 gpbs (little better than the paper)
> which paper? Call that out in your high-level summary
>
> ...
>> +
>> +An *exhaustive* paper (2010) shows additional performance details
>> +linked on the QEMU wiki:
> Missing the actual reference?  And it would help to mention it at the
> beginning of the file.
>

^ permalink raw reply	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v1: 12/13] updated protocol documentation
  2013-04-10 22:28 ` [Qemu-devel] [RFC PATCH RDMA support v1: 12/13] updated protocol documentation mrhines
  2013-04-11  2:43   ` Eric Blake
@ 2013-04-11  6:29   ` Paolo Bonzini
  1 sibling, 0 replies; 52+ messages in thread
From: Paolo Bonzini @ 2013-04-11  6:29 UTC (permalink / raw)
  To: mrhines; +Cc: aliguori, mst, qemu-devel, owasserm, abali, mrhines, gokul

Il 11/04/2013 00:28, mrhines@linux.vnet.ibm.com ha scritto:
> +
> +(Thanks, Paolo - things look much cleaner now.)

I agree!  No need to immortalize me in the docs, though. :)

Paolo

^ permalink raw reply	[flat|nested] 52+ messages in thread

* [Qemu-devel] [RFC PATCH RDMA support v1: 13/13] print out migration throughput while debugging
  2013-04-10 22:28 [Qemu-devel] [RFC PATCH RDMA support v7: 00/13] rdma cleanup and reordering mrhines
                   ` (11 preceding siblings ...)
  2013-04-10 22:28 ` [Qemu-devel] [RFC PATCH RDMA support v1: 12/13] updated protocol documentation mrhines
@ 2013-04-10 22:28 ` mrhines
  2013-04-10 22:32 ` [Qemu-devel] [RFC PATCH RDMA support v7: 00/13] rdma cleanup and reordering Michael R. Hines
  13 siblings, 0 replies; 52+ messages in thread
From: mrhines @ 2013-04-10 22:28 UTC (permalink / raw)
  To: qemu-devel; +Cc: aliguori, mst, owasserm, abali, mrhines, gokul, pbonzini

From: "Michael R. Hines" <mrhines@us.ibm.com>

It's very helpful when debugging to print out migration throughput
after each iteration round to compare the different migration
technologies.

Signed-off-by: Michael R. Hines <mrhines@us.ibm.com>
---
 migration.c |    8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/migration.c b/migration.c
index 3b4b467..3439629 100644
--- a/migration.c
+++ b/migration.c
@@ -35,6 +35,9 @@
     do { } while (0)
 #endif
 
+#define MBPS(bytes, time) time ? ((((double) bytes * 8)         \
+        / ((double) time / 1000.0)) / 1000.0 / 1000.0) : -1.0
+
 enum {
     MIG_STATE_ERROR,
     MIG_STATE_SETUP,
@@ -546,8 +549,9 @@ static void *migration_thread(void *opaque)
             max_size = bandwidth * migrate_max_downtime() / 1000000;
 
             DPRINTF("transferred %" PRIu64 " time_spent %" PRIu64
-                    " bandwidth %g max_size %" PRId64 "\n",
-                    transferred_bytes, time_spent, bandwidth, max_size);
+                    " bandwidth %g throughput %f max_size %" PRId64 "\n",
+                    transferred_bytes, time_spent, bandwidth, 
+                    MBPS(transferred_bytes, time_spent), max_size);
             /* if we haven't sent anything, we don't want to recalculate
                10000 is a small enough number for our purposes */
             if (s->dirty_bytes_rate && transferred_bytes > 10000) {
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 52+ messages in thread

* Re: [Qemu-devel] [RFC PATCH RDMA support v7: 00/13] rdma cleanup and reordering
  2013-04-10 22:28 [Qemu-devel] [RFC PATCH RDMA support v7: 00/13] rdma cleanup and reordering mrhines
                   ` (12 preceding siblings ...)
  2013-04-10 22:28 ` [Qemu-devel] [RFC PATCH RDMA support v1: 13/13] print out migration throughput while debugging mrhines
@ 2013-04-10 22:32 ` Michael R. Hines
  13 siblings, 0 replies; 52+ messages in thread
From: Michael R. Hines @ 2013-04-10 22:32 UTC (permalink / raw)
  To: qemu-devel; +Cc: aliguori, mst, owasserm, abali, mrhines, gokul, pbonzini

Don't know why it says "v1" in the other subject lines - please ignore.

On 04/10/2013 06:28 PM, mrhines@linux.vnet.ibm.com wrote:
> From: "Michael R. Hines" <mrhines@us.ibm.com>
>
> Changes since v6:
>
> (Thanks, Paolo - things look much cleaner now.)
>
> - Try to get patch-ordering correct =)
> - Much cleaner use of QEMUFileOps
> - Much fewer header files changes
> - Convert zero check capability to QMP command instead
> - Updated documentation
>
> Wiki: http://wiki.qemu.org/Features/RDMALiveMigration
> Github: git@github.com:hinesmr/qemu.git
>
>

^ permalink raw reply	[flat|nested] 52+ messages in thread

end of thread, other threads:[~2013-04-11 17:54 UTC | newest]

Thread overview: 52+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-04-10 22:28 [Qemu-devel] [RFC PATCH RDMA support v7: 00/13] rdma cleanup and reordering mrhines
2013-04-10 22:28 ` [Qemu-devel] [RFC PATCH RDMA support v1: 01/13] introduce qemu_ram_foreach_block() mrhines
2013-04-10 22:28 ` [Qemu-devel] [RFC PATCH RDMA support v1: 02/13] Core RMDA logic mrhines
2013-04-10 22:28 ` [Qemu-devel] [RFC PATCH RDMA support v1: 03/13] RDMA is enabled by default per the usual ./configure testing mrhines
2013-04-10 22:28 ` [Qemu-devel] [RFC PATCH RDMA support v1: 04/13] update QEMUFileOps with new hooks mrhines
2013-04-10 22:28 ` [Qemu-devel] [RFC PATCH RDMA support v1: 05/13] accessor function prototypes for new QEMUFileOps hooks mrhines
2013-04-10 22:28 ` [Qemu-devel] [RFC PATCH RDMA support v1: 06/13] implementation of " mrhines
2013-04-10 22:28 ` [Qemu-devel] [RFC PATCH RDMA support v1: 07/13] introduce capability for dynamic chunk registration mrhines
2013-04-11  2:24   ` Eric Blake
2013-04-11  2:39     ` Michael R. Hines
2013-04-10 22:28 ` [Qemu-devel] [RFC PATCH RDMA support v1: 08/13] default chunk registration to true mrhines
2013-04-10 22:28 ` [Qemu-devel] [RFC PATCH RDMA support v1: 09/13] parse QMP string for new 'rdma' protocol mrhines
2013-04-10 22:28 ` [Qemu-devel] [RFC PATCH RDMA support v1: 10/13] introduce new command migrate_check_for_zero mrhines
2013-04-11  2:26   ` Eric Blake
2013-04-11  2:39     ` Michael R. Hines
2013-04-11  7:52       ` Orit Wasserman
2013-04-11 12:30         ` Eric Blake
2013-04-11 12:36           ` Orit Wasserman
2013-04-11 17:53             ` Michael R. Hines
2013-04-11  3:11     ` Michael R. Hines
2013-04-11  7:38   ` Michael S. Tsirkin
2013-04-11  9:18     ` Paolo Bonzini
2013-04-11 11:13       ` Michael S. Tsirkin
2013-04-11 13:19         ` Michael R. Hines
2013-04-11 13:51           ` Michael S. Tsirkin
2013-04-11 14:06             ` Michael R. Hines
2013-04-11 14:17               ` Paolo Bonzini
2013-04-11 14:35                 ` Michael R. Hines
2013-04-11 14:45                   ` Paolo Bonzini
2013-04-11 15:37                     ` Michael R. Hines
2013-04-11 13:24       ` Michael R. Hines
2013-04-11 14:15         ` Paolo Bonzini
2013-04-11 14:45           ` Michael S. Tsirkin
2013-04-11 14:57           ` Michael R. Hines
2013-04-11 15:01             ` Michael S. Tsirkin
2013-04-11 15:08             ` Paolo Bonzini
2013-04-11 15:35               ` Michael R. Hines
2013-04-11 15:45                 ` Paolo Bonzini
2013-04-11 16:02                   ` Michael R. Hines
2013-04-11 16:12                     ` Paolo Bonzini
2013-04-11 16:07                   ` Eric Blake
2013-04-11 16:29                     ` Michael R. Hines
2013-04-11 16:36                       ` Eric Blake
2013-04-10 22:28 ` [Qemu-devel] [RFC PATCH RDMA support v1: 11/13] send pc.ram over RDMA mrhines
2013-04-11  6:26   ` Paolo Bonzini
2013-04-11 12:41     ` Michael R. Hines
2013-04-10 22:28 ` [Qemu-devel] [RFC PATCH RDMA support v1: 12/13] updated protocol documentation mrhines
2013-04-11  2:43   ` Eric Blake
2013-04-11  2:47     ` Michael R. Hines
2013-04-11  6:29   ` Paolo Bonzini
2013-04-10 22:28 ` [Qemu-devel] [RFC PATCH RDMA support v1: 13/13] print out migration throughput while debugging mrhines
2013-04-10 22:32 ` [Qemu-devel] [RFC PATCH RDMA support v7: 00/13] rdma cleanup and reordering Michael R. Hines

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).