* Re: [Qemu-devel] [RFC PATCH RDMA support v3: 03/10] documentation of RDMA protocol in docs/rdma.txt
[not found] ` <1362976414-21396-4-git-send-email-mrhines@us.ibm.com>
@ 2013-03-11 11:51 ` Michael S. Tsirkin
2013-03-11 16:24 ` Michael R. Hines
0 siblings, 1 reply; 22+ messages in thread
From: Michael S. Tsirkin @ 2013-03-11 11:51 UTC (permalink / raw)
To: Michael.R.Hines.mrhines
Cc: aliguori, qemu-devel, owasserm, abali, mrhines, gokul, pbonzini
On Mon, Mar 11, 2013 at 12:33:27AM -0400, Michael.R.Hines.mrhines@linux.vnet.ibm.com wrote:
> From: "Michael R. Hines" <mrhines@us.ibm.com>
>
>
> Signed-off-by: Michael R. Hines <mrhines@us.ibm.com>
> ---
> docs/rdma.txt | 93 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 93 insertions(+)
> create mode 100644 docs/rdma.txt
>
> diff --git a/docs/rdma.txt b/docs/rdma.txt
> new file mode 100644
> index 0000000..a38ce1c
> --- /dev/null
> +++ b/docs/rdma.txt
> @@ -0,0 +1,93 @@
> +Changes since v2:
> +
> +- TCP channel has been eliminated. All device state uses SEND messages
> +- type 'migrate rdma:host:port' to start the migration
> +- QEMUFileRDMA is introduced
> +- librdmacm calls distinquished from qemu RDMA calls
> +- lots of code cleanup
> +
> +RDMA-based live migration protocol
> +==================================
> +
> +We use two kinds of RDMA messages:
> +
> +1. RDMA WRITES (to the receiver)
> +2. RDMA SEND (for non-live state, like devices and CPU)
Something's missing here.
Don't you need to know remote addresses before doing RDMA writes?
> +
> +First, migration-rdma.c does the initial connection establishment
> +using the URI 'rdma:host:port' on the QMP command line.
> +
> +Second, the normal live migration process kicks in for 'pc.ram'.
> +
> +During iterative phase of the migration, only RDMA WRITE messages
> +are used. Messages are grouped into "chunks" which get pinned by
> +the hardware in 64-page increments. Each chunk is acknowledged in
> +the Queue Pairs completion queue (not the individual pages).
> +
> +During iteration of RAM, there are no messages sent, just RDMA writes.
> +During the last iteration, once the devices and CPU is ready to be
> +sent, we begin to use the RDMA SEND messages.
It's unclear whether you are switching modes here, if yes
assuming CPU/device state is only sent during
the last iteration would break post-migration so
is probably not a good choice for a protocol.
> +Due to the asynchronous nature of RDMA, the receiver of the migration
> +must post Receive work requests in the queue *before* a SEND work request
> +can be posted.
> +
> +To achieve this, both sides perform an initial 'barrier' synchronization.
> +Before the barrier, we already know that both sides have a receive work
> +request posted,
How?
> and then both sides exchange and block on the completion
> +queue waiting for each other to know the other peer is alive and ready
> +to send the rest of the live migration state (qemu_send/recv_barrier()).
How much?
> +At this point, the use of QEMUFile between both sides for communication
> +proceeds as normal.
> +The difference between TCP and SEND comes in migration-rdma.c: Since
> +we cannot simply dump the bytes into a socket, instead a SEND message
> +must be preceeded by one side instructing the other side *exactly* how
> +many bytes the SEND message will contain.
instructing how? Presumably you use some protocol for this?
> +Each time a SEND is received, the receiver buffers the message and
> +divies out the bytes from the SEND to the qemu_loadvm_state() function
> +until all the bytes from the buffered SEND message have been exhausted.
> +
> +Before the SEND is exhausted, the receiver sends an 'ack' SEND back
> +to the sender to let the savevm_state_* functions know that they
> +can resume and start generating more SEND messages.
The above two paragraphs seem very opaque to me.
what's an 'ack' SEND, how do you know whether SEND
is exhausted?
> +This ping-pong of SEND messages
BTW, if by ping-pong you mean something like this:
source "I have X bytes"
destination "ok send me X bytes"
source sends X bytes
then you could put the address in the destination response and
use RDMA for sending X bytes.
It's up to you but it might simplify the protocol as
the only thing you send would be buffer management messages.
> happens until the live migration completes.
Any way to tear down the connection in case of errors?
> +
> +USAGE
> +===============================
> +
> +Compiling:
> +
> +$ ./configure --enable-rdma --target-list=x86_64-softmmu
> +
> +$ make
> +
> +Command-line on the Source machine AND Destination:
> +
> +$ virsh qemu-monitor-command --hmp --cmd "migrate_set_speed 40g" # or whatever is the MAX of your RDMA device
> +
> +Finally, perform the actual migration:
> +
> +$ virsh migrate domain rdma:xx.xx.xx.xx:port
> +
> +PERFORMANCE
> +===================
> +
> +Using a 40gbps infinband link performing a worst-case stress test:
> +
> +RDMA Throughput With $ stress --vm-bytes 1024M --vm 1 --vm-keep
> +Approximately 30 gpbs (little better than the paper)
> +1. Average worst-case throughput
> +TCP Throughput With $ stress --vm-bytes 1024M --vm 1 --vm-keep
> +2. Approximately 8 gpbs (using IPOIB IP over Infiniband)
> +
> +Average downtime (stop time) ranges between 28 and 33 milliseconds.
> +
> +An *exhaustive* paper (2010) shows additional performance details
> +linked on the QEMU wiki:
> +
> +http://wiki.qemu.org/Features/RDMALiveMigration
> --
> 1.7.10.4
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Qemu-devel] [RFC PATCH RDMA support v3: 02/10] Link in new migration-rdma.c and rmda.c files
[not found] ` <1362976414-21396-3-git-send-email-mrhines@us.ibm.com>
@ 2013-03-11 13:35 ` Paolo Bonzini
2013-03-11 16:25 ` Michael R. Hines
0 siblings, 1 reply; 22+ messages in thread
From: Paolo Bonzini @ 2013-03-11 13:35 UTC (permalink / raw)
To: Michael.R.Hines.mrhines
Cc: aliguori, mst, qemu-devel, owasserm, abali, mrhines, gokul
Il 11/03/2013 05:33, Michael.R.Hines.mrhines@linux.vnet.ibm.com ha scritto:
> common-obj-y += migration.o migration-tcp.o
> +common-obj-$(CONFIG_RDMA) += migration-rdma.o
> common-obj-y += qemu-char.o #aio.o
> common-obj-y += block-migration.o
> -common-obj-y += page_cache.o xbzrle.o
> +common-obj-y += page_cache.o xbzrle.o rdma.o
Why is rdma.o not conditionalized by $(CONFIG_RDMA)?
Paolo
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Qemu-devel] [RFC PATCH RDMA support v3: 08/10] Introduce QEMUFileRDMA
[not found] ` <1362976414-21396-9-git-send-email-mrhines@us.ibm.com>
@ 2013-03-11 13:40 ` Paolo Bonzini
2013-03-11 16:26 ` Michael R. Hines
2013-03-11 16:26 ` Michael R. Hines
0 siblings, 2 replies; 22+ messages in thread
From: Paolo Bonzini @ 2013-03-11 13:40 UTC (permalink / raw)
To: Michael.R.Hines.mrhines
Cc: aliguori, mst, qemu-devel, owasserm, abali, mrhines, gokul
Il 11/03/2013 05:33, Michael.R.Hines.mrhines@linux.vnet.ibm.com ha scritto:
> From: "Michael R. Hines" <mrhines@us.ibm.com>
>
> This is the loadvm() side of the connection which is RDMA-aware,
> so that transfer of device state can use the same abstractions
> as all of the other migration protocols.
>
> Full documentation of the protocol is in docs/rdma.txt
>
> Signed-off-by: Michael R. Hines <mrhines@us.ibm.com>
> ---
> include/migration/qemu-file.h | 17 +++++
> savevm.c | 165 ++++++++++++++++++++++++++++++++++++++++-
> 2 files changed, 180 insertions(+), 2 deletions(-)
Please move this to rdma.c, otherwise it won't compile without librdmacm.
Paolo
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Qemu-devel] [RFC PATCH RDMA support v3: 05/10] RDMA connection establishment (migration-rdma.c).
[not found] ` <1362976414-21396-6-git-send-email-mrhines@us.ibm.com>
@ 2013-03-11 13:41 ` Paolo Bonzini
2013-03-11 16:28 ` Michael R. Hines
2013-03-11 20:20 ` Michael R. Hines
0 siblings, 2 replies; 22+ messages in thread
From: Paolo Bonzini @ 2013-03-11 13:41 UTC (permalink / raw)
To: Michael.R.Hines.mrhines
Cc: aliguori, mst, qemu-devel, owasserm, abali, mrhines, gokul
Il 11/03/2013 05:33, Michael.R.Hines.mrhines@linux.vnet.ibm.com ha scritto:
> From: "Michael R. Hines" <mrhines@us.ibm.com>
>
> Use 'migrate rdma:host:port' to begin the migration.
>
> The TCP control channel has finally been eliminated
> when RDMA is used. Documentation of the use of SEND message
> for transferring device state is covered in docs/rdma.txt
>
> Signed-off-by: Michael R. Hines <mrhines@us.ibm.com>
> ---
> migration-rdma.c | 198 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
> 1 file changed, 198 insertions(+)
> create mode 100644 migration-rdma.c
>
> diff --git a/migration-rdma.c b/migration-rdma.c
> new file mode 100644
> index 0000000..822b17a
> --- /dev/null
> +++ b/migration-rdma.c
> @@ -0,0 +1,198 @@
> +/*
> + * Copyright (C) 2013 Michael R. Hines <mrhines@us.ibm.com>
> + * Copyright (C) 2013 Jiuxing Liu <jl@us.ibm.com>
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; under version 2 of the License.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program; if not, see <http://www.gnu.org/licenses/>.
> + */
> +#include "migration/rdma.h"
> +#include "qemu-common.h"
> +#include "migration/migration.h"
> +#include "migration/qemu-file.h"
> +#include <stdio.h>
> +#include <sys/types.h>
> +#include <sys/socket.h>
> +#include <netdb.h>
> +#include <arpa/inet.h>
> +#include <string.h>
> +
> +//#define DEBUG_MIGRATION_RDMA
> +
> +#ifdef DEBUG_MIGRATION_RDMA
> +#define DPRINTF(fmt, ...) \
> + do { printf("migration-rdma: " fmt, ## __VA_ARGS__); } while (0)
> +#else
> +#define DPRINTF(fmt, ...) \
> + do { } while (0)
> +#endif
> +
> +static int rdma_accept_incoming_migration(RDMAData *rdma, Error **errp)
> +{
> + int ret;
> +
> + ret = qemu_rdma_migrate_listen(rdma, rdma->host, rdma->port);
> + if (ret) {
> + qemu_rdma_print("rdma migration: error listening!");
> + goto err_rdma_server_wait;
> + }
> +
> + ret = qemu_rdma_alloc_qp(&rdma->rdma_ctx);
> + if (ret) {
> + qemu_rdma_print("rdma migration: error allocating qp!");
> + goto err_rdma_server_wait;
> + }
> +
> + ret = qemu_rdma_migrate_accept(&rdma->rdma_ctx, NULL, NULL, NULL, 0);
> + if (ret) {
> + qemu_rdma_print("rdma migration: error accepting connection!");
> + goto err_rdma_server_wait;
> + }
> +
> + ret = qemu_rdma_post_recv_qemu_file(rdma);
> + if (ret) {
> + qemu_rdma_print("rdma migration: error posting second qemu file recv!");
> + goto err_rdma_server_wait;
> + }
> +
> + ret = qemu_rdma_post_send_remote_info(rdma);
> + if (ret) {
> + qemu_rdma_print("rdma migration: error sending remote info!");
> + goto err_rdma_server_wait;
> + }
> +
> + ret = qemu_rdma_wait_for_wrid(rdma, RDMA_WRID_SEND_REMOTE_INFO);
> + if (ret < 0) {
> + qemu_rdma_print("rdma migration: polling remote info error!");
> + goto err_rdma_server_wait;
> + }
> +
> + rdma->total_bytes = 0;
> + rdma->enabled = 1;
> + qemu_rdma_dump_gid("server_connect", rdma->rdma_ctx.cm_id);
> + return 0;
> +
> +err_rdma_server_wait:
> + qemu_rdma_cleanup(rdma);
> + return -1;
> +
> +}
> +
> +int rdma_start_incoming_migration(const char * host_port, Error **errp)
> +{
> + QEMUFile *f;
> + int ret;
> + RDMAData rdma;
> +
> + if ((ret = qemu_rdma_data_init(&rdma, host_port, errp)) < 0)
> + return ret;
> +
> + ret = qemu_rdma_server_init(&rdma, NULL);
> +
> + DPRINTF("Starting RDMA-based incoming migration\n");
> +
> + if (!ret) {
> + DPRINTF("qemu_rdma_server_init success\n");
> + ret = qemu_rdma_server_prepare(&rdma, NULL);
> +
> + if (!ret) {
> + DPRINTF("qemu_rdma_server_prepare success\n");
> +
> + ret = rdma_accept_incoming_migration(&rdma, NULL);
> + if(!ret)
> + DPRINTF("qemu_rdma_accept_incoming_migration success\n");
> + f = qemu_fopen_rdma(&rdma);
> + if (f == NULL) {
> + fprintf(stderr, "could not qemu_fopen RDMA\n");
> + ret = -EIO;
> + }
> +
> + process_incoming_migration(f);
> + }
> + }
> +
> + return ret;
> +}
> +
> +/*
> + * Not sure what this is for yet...
> + */
> +static int qemu_rdma_errno(MigrationState *s)
> +{
> + return 0;
> +}
> +
> +/*
> + * SEND messages for device state only.
> + * pc.ram is handled elsewhere...
> + */
> +static int qemu_rdma_send(MigrationState *s, const void * buf, size_t size)
> +{
> + size_t len, remaining = size;
> + uint8_t * data = (void *) buf;
> +
> + if (qemu_rdma_write_flush(&s->rdma) < 0) {
> + qemu_file_set_error(s->file, -EIO);
> + return -EIO;
> + }
> +
> + while(remaining) {
> + len = MIN(remaining, RDMA_SEND_INCREMENT);
> + remaining -= len;
> + DPRINTF("iter Sending %" PRId64 "-byte SEND\n", len);
> +
> + if(qemu_rdma_exchange_send(&s->rdma, data, len) < 0)
> + return -EINVAL;
> +
> + data += len;
> + }
> +
> + return size;
> +}
> +
> +static int qemu_rdma_close(MigrationState *s)
> +{
> + DPRINTF("qemu_rdma_close\n");
> + qemu_rdma_cleanup(&s->rdma);
> + return 0;
> +}
> +
> +void rdma_start_outgoing_migration(MigrationState *s, const char *host_port, Error **errp)
> +{
> + int ret;
> +
> +#ifndef CONFIG_RDMA
> + error_set(errp, QERR_FEATURE_DISABLED, "rdma migration");
> + return;
> +#endif
> +
> + if (qemu_rdma_data_init(&s->rdma, host_port, errp) < 0)
> + return;
> +
> + ret = qemu_rdma_client_init(&s->rdma, NULL);
> + if(!ret) {
> + DPRINTF("qemu_rdma_client_init success\n");
> + ret = qemu_rdma_client_connect(&s->rdma, NULL);
> +
> + if(!ret) {
> + s->get_error = qemu_rdma_errno;
> + s->write = qemu_rdma_send;
> + s->close = qemu_rdma_close;
> + s->fd = -2;
> + DPRINTF("qemu_rdma_client_connect success\n");
> + migrate_fd_connect(s);
> + return;
> + }
> + }
> +
> + s->fd = -1;
> + migrate_fd_error(s);
> +}
>
Note that as soon as the pending migration patches hit upstream, almost
all of this code will move to QEMUFileRDMA (in particular qemu_rdma_send
and qemu_rdma_close). It should become simpler.
Paolo
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Qemu-devel] [RFC PATCH RDMA support v3: 06/10] Introduce 'max_iterations' and Call out to migration-rdma.c when requested
[not found] ` <1362976414-21396-7-git-send-email-mrhines@us.ibm.com>
@ 2013-03-11 13:49 ` Paolo Bonzini
2013-03-11 16:30 ` Michael R. Hines
0 siblings, 1 reply; 22+ messages in thread
From: Paolo Bonzini @ 2013-03-11 13:49 UTC (permalink / raw)
To: Michael.R.Hines.mrhines
Cc: aliguori, mst, qemu-devel, owasserm, abali, mrhines, gokul
Il 11/03/2013 05:33, Michael.R.Hines.mrhines@linux.vnet.ibm.com ha scritto:
> From: "Michael R. Hines" <mrhines@us.ibm.com>
>
> Very little changes here except for halting the migration after a maximum
> number of iterations is reached.
>
> When comparing against TCP, the migration never ends if we don't cap
> the migrations somehow..... just an idea for now.
This makes sense, but please: a) make it a separate patch; b) add QMP
commands for it; c) make it disabled by default.
There are two uses of migrate_use_rdma(). One is to disable the search
for zero pages. Perhaps we can do that automatically based on the
current computed bandwidth? At some point, it costs less to just send
the data down the wire.
The other is to use the RDMA-specific primitive to send pages. I hope
that Orit's work will make that unnecessary; in the meanwhile, however,
the latter is okay.
The "verbose logging" should be yet another patch. Many of the messages
you touched are gone in the most recent version of the code. I suspect
that, for the others, it's better to use tracepoints (see the
trace-events file) instead.
Paolo
> Signed-off-by: Michael R. Hines <mrhines@us.ibm.com>
> ---
> include/migration/migration.h | 10 ++++++++
> migration.c | 56 +++++++++++++++++++++++++++++++++++------
> 2 files changed, 58 insertions(+), 8 deletions(-)
>
> diff --git a/include/migration/migration.h b/include/migration/migration.h
> index d121409..796cf3d 100644
> --- a/include/migration/migration.h
> +++ b/include/migration/migration.h
> @@ -20,6 +20,7 @@
> #include "qemu/notify.h"
> #include "qapi/error.h"
> #include "migration/vmstate.h"
> +#include "migration/rdma.h"
> #include "qapi-types.h"
>
> struct MigrationParams {
> @@ -55,6 +56,9 @@ struct MigrationState
> bool enabled_capabilities[MIGRATION_CAPABILITY_MAX];
> int64_t xbzrle_cache_size;
> bool complete;
> +
> + RDMAData rdma;
> + double mbps;
> };
>
> void process_incoming_migration(QEMUFile *f);
> @@ -75,6 +79,10 @@ void tcp_start_incoming_migration(const char *host_port, Error **errp);
>
> void tcp_start_outgoing_migration(MigrationState *s, const char *host_port, Error **errp);
>
> +void rdma_start_outgoing_migration(MigrationState *s, const char *host_port, Error **errp);
> +
> +int rdma_start_incoming_migration(const char * host_port, Error **errp);
> +
> void unix_start_incoming_migration(const char *path, Error **errp);
>
> void unix_start_outgoing_migration(MigrationState *s, const char *path, Error **errp);
> @@ -106,6 +114,7 @@ uint64_t dup_mig_bytes_transferred(void);
> uint64_t dup_mig_pages_transferred(void);
> uint64_t norm_mig_bytes_transferred(void);
> uint64_t norm_mig_pages_transferred(void);
> +uint64_t delta_norm_mig_bytes_transferred(void);
> uint64_t xbzrle_mig_bytes_transferred(void);
> uint64_t xbzrle_mig_pages_transferred(void);
> uint64_t xbzrle_mig_pages_overflow(void);
> @@ -130,6 +139,7 @@ int xbzrle_encode_buffer(uint8_t *old_buf, uint8_t *new_buf, int slen,
> int xbzrle_decode_buffer(uint8_t *src, int slen, uint8_t *dst, int dlen);
>
> int migrate_use_xbzrle(void);
> +int migrate_use_rdma(void);
> int64_t migrate_xbzrle_cache_size(void);
>
> int64_t xbzrle_cache_resize(int64_t new_size);
> diff --git a/migration.c b/migration.c
> index 11725ae..aae2f66 100644
> --- a/migration.c
> +++ b/migration.c
> @@ -25,6 +25,7 @@
> #include "qmp-commands.h"
>
> //#define DEBUG_MIGRATION
> +//#define DEBUG_MIGRATION_VERBOSE
>
> #ifdef DEBUG_MIGRATION
> #define DPRINTF(fmt, ...) \
> @@ -34,6 +35,14 @@
> do { } while (0)
> #endif
>
> +#ifdef DEBUG_MIGRATION_VERBOSE
> +#define DDPRINTF(fmt, ...) \
> + do { printf("migration: " fmt, ## __VA_ARGS__); } while (0)
> +#else
> +#define DDPRINTF(fmt, ...) \
> + do { } while (0)
> +#endif
> +
> enum {
> MIG_STATE_ERROR,
> MIG_STATE_SETUP,
> @@ -76,6 +85,8 @@ void qemu_start_incoming_migration(const char *uri, Error **errp)
>
> if (strstart(uri, "tcp:", &p))
> tcp_start_incoming_migration(p, errp);
> + else if (strstart(uri, "rdma:", &p))
> + rdma_start_incoming_migration(p, errp);
> #if !defined(WIN32)
> else if (strstart(uri, "exec:", &p))
> exec_start_incoming_migration(p, errp);
> @@ -130,6 +141,14 @@ void process_incoming_migration(QEMUFile *f)
> * units must be in seconds */
> static uint64_t max_downtime = 30000000;
>
> +/*
> + * RFC: We probably need a QMP setting for this, but the point
> + * of it is that it's hard to compare RDMA workloads
> + * vs. TCP workloads because the TCP migrations never
> + * complete without some kind of iteration cap.
> + */
> +static int max_iterations = 30;
> +
> uint64_t migrate_max_downtime(void)
> {
> return max_downtime;
> @@ -429,6 +448,8 @@ void qmp_migrate(const char *uri, bool has_blk, bool blk,
>
> if (strstart(uri, "tcp:", &p)) {
> tcp_start_outgoing_migration(s, p, &local_err);
> + } else if (strstart(uri, "rdma:", &p)) {
> + rdma_start_outgoing_migration(s, p, &local_err);
> #if !defined(WIN32)
> } else if (strstart(uri, "exec:", &p)) {
> exec_start_outgoing_migration(s, p, &local_err);
> @@ -502,6 +523,16 @@ int migrate_use_xbzrle(void)
> return s->enabled_capabilities[MIGRATION_CAPABILITY_XBZRLE];
> }
>
> +/*
> + * Don't think we need a 'capability' here
> + * because 'rdma:host:port' must be specified
> + * on the QMP command line...
> + */
> +int migrate_use_rdma(void)
> +{
> + return migrate_get_current()->rdma.enabled;
> +}
> +
> int64_t migrate_xbzrle_cache_size(void)
> {
> MigrationState *s;
> @@ -550,7 +581,7 @@ static int buffered_put_buffer(void *opaque, const uint8_t *buf,
> MigrationState *s = opaque;
> ssize_t error;
>
> - DPRINTF("putting %d bytes at %" PRId64 "\n", size, pos);
> + DDPRINTF("putting %d bytes at %" PRId64 "\n", size, pos);
>
> error = qemu_file_get_error(s->file);
> if (error) {
> @@ -563,7 +594,7 @@ static int buffered_put_buffer(void *opaque, const uint8_t *buf,
> }
>
> if (size > (s->buffer_capacity - s->buffer_size)) {
> - DPRINTF("increasing buffer capacity from %zu by %zu\n",
> + DDPRINTF("increasing buffer capacity from %zu by %d\n",
> s->buffer_capacity, size + 1024);
>
> s->buffer_capacity += size + 1024;
> @@ -661,7 +692,7 @@ static void *buffered_file_thread(void *opaque)
> int64_t sleep_time = 0;
> int64_t max_size = 0;
> bool last_round = false;
> - int ret;
> + int ret, iterations = 0;
>
> qemu_mutex_lock_iothread();
> DPRINTF("beginning savevm\n");
> @@ -687,11 +718,15 @@ static void *buffered_file_thread(void *opaque)
> qemu_mutex_unlock_iothread();
> break;
> }
> +
> + iterations++;
> +
> if (s->bytes_xfer < s->xfer_limit) {
> - DPRINTF("iterate\n");
> + DPRINTF("iterate %d max %d\n", iterations, max_iterations);
> pending_size = qemu_savevm_state_pending(s->file, max_size);
> DPRINTF("pending size %lu max %lu\n", pending_size, max_size);
> - if (pending_size && pending_size >= max_size) {
> + if (pending_size && pending_size >= max_size &&
> + (max_iterations == -1 || iterations < max_iterations)) {
> ret = qemu_savevm_state_iterate(s->file);
> if (ret < 0) {
> qemu_mutex_unlock_iothread();
> @@ -730,14 +765,18 @@ static void *buffered_file_thread(void *opaque)
> qemu_mutex_unlock_iothread();
> current_time = qemu_get_clock_ms(rt_clock);
> if (current_time >= initial_time + BUFFER_DELAY) {
> - uint64_t transferred_bytes = s->bytes_xfer;
> + uint64_t transferred_bytes = migrate_use_rdma() ?
> + delta_norm_mig_bytes_transferred() : s->bytes_xfer;
> uint64_t time_spent = current_time - initial_time - sleep_time;
> double bandwidth = transferred_bytes / time_spent;
> max_size = bandwidth * migrate_max_downtime() / 1000000;
> + s->mbps = ((double) transferred_bytes * 8.0 /
> + ((double) time_spent / 1000.0)) / 1000.0 / 1000.0;
>
> DPRINTF("transferred %" PRIu64 " time_spent %" PRIu64
> - " bandwidth %g max_size %" PRId64 "\n",
> - transferred_bytes, time_spent, bandwidth, max_size);
> + " bandwidth %g (%0.2f mbps) max_size %" PRId64 "\n",
> + transferred_bytes, time_spent,
> + bandwidth, s->mbps, max_size);
> /* if we haven't sent anything, we don't want to recalculate
> 10000 is a small enough number for our purposes */
> if (s->dirty_bytes_rate && transferred_bytes > 10000) {
> @@ -774,6 +813,7 @@ static const QEMUFileOps buffered_file_ops = {
> .rate_limit = buffered_rate_limit,
> .get_rate_limit = buffered_get_rate_limit,
> .set_rate_limit = buffered_set_rate_limit,
> + .send_barrier = qemu_rdma_send_barrier,
> };
>
> void migrate_fd_connect(MigrationState *s)
>
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Qemu-devel] [RFC PATCH RDMA support v3: 07/10] Send the actual pages over RDMA.
[not found] ` <1362976414-21396-8-git-send-email-mrhines@us.ibm.com>
@ 2013-03-11 13:59 ` Paolo Bonzini
2013-03-11 16:31 ` Michael R. Hines
0 siblings, 1 reply; 22+ messages in thread
From: Paolo Bonzini @ 2013-03-11 13:59 UTC (permalink / raw)
To: Michael.R.Hines.mrhines
Cc: aliguori, mst, qemu-devel, owasserm, abali, mrhines, gokul
Il 11/03/2013 05:33, Michael.R.Hines.mrhines@linux.vnet.ibm.com ha scritto:
> From: "Michael R. Hines" <mrhines@us.ibm.com>
>
> For performance reasons, dup_page() and xbzrle() is skipped because
> they are too expensive for zero-copy RDMA.
>
> Signed-off-by: Michael R. Hines <mrhines@us.ibm.com>
> ---
> arch_init.c | 57 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
> 1 file changed, 56 insertions(+), 1 deletion(-)
>
> diff --git a/arch_init.c b/arch_init.c
> index 8daeafa..437cb47 100644
> --- a/arch_init.c
> +++ b/arch_init.c
> @@ -45,6 +45,7 @@
> #include "exec/address-spaces.h"
> #include "hw/pcspk.h"
> #include "migration/page_cache.h"
> +#include "migration/rdma.h"
> #include "qemu/config-file.h"
> #include "qmp-commands.h"
> #include "trace.h"
> @@ -245,6 +246,18 @@ uint64_t norm_mig_pages_transferred(void)
> return acct_info.norm_pages;
> }
>
> +/*
> + * RDMA does not use the buffered_file,
> + * but we still need a way to do accounting...
> + */
> +uint64_t delta_norm_mig_bytes_transferred(void)
> +{
> + static uint64_t last_norm_pages = 0;
> + uint64_t delta_bytes = (acct_info.norm_pages - last_norm_pages) * TARGET_PAGE_SIZE;
> + last_norm_pages = acct_info.norm_pages;
> + return delta_bytes;
> +}
> +
> uint64_t xbzrle_mig_bytes_transferred(void)
> {
> return acct_info.xbzrle_bytes;
> @@ -282,6 +295,45 @@ static size_t save_block_hdr(QEMUFile *f, RAMBlock *block, ram_addr_t offset,
> return size;
> }
>
> +static size_t save_rdma_page(QEMUFile *f, RAMBlock *block, ram_addr_t offset,
> + int cont)
> +{
> + int ret;
> + size_t bytes_sent = 0;
> + ram_addr_t current_addr;
> + RDMAData * rdma = &migrate_get_current()->rdma;
> +
> + acct_info.norm_pages++;
> +
> + /*
> + * use RDMA to send page
> + */
Not quite true, the page is added to the current chunk. Please make the
comments a quick-and-dirty reference of the protocol, or leave them out
altogether.
> + current_addr = block->offset + offset;
> + if ((ret = qemu_rdma_write(rdma, current_addr, TARGET_PAGE_SIZE)) < 0) {
> + fprintf(stderr, "rdma migration: write error! %d\n", ret);
> + qemu_file_set_error(f, ret);
> + return ret;
> + }
> +
> + /*
> + * do some polling
> + */
Again, that's quite self-evident. Poll for what though? :)
> + while (1) {
> + int ret = qemu_rdma_poll(rdma);
> + if (ret == RDMA_WRID_NONE) {
> + break;
> + }
> + if (ret < 0) {
> + fprintf(stderr, "rdma migration: polling error! %d\n", ret);
> + qemu_file_set_error(f, ret);
> + return ret;
> + }
> + }
> +
> + bytes_sent += TARGET_PAGE_SIZE;
> + return bytes_sent;
> +}
As written in the other message, I think this should be an additional
QEMUFile operation, hopefully the same that Orit is introducing in her
patches.
> #define ENCODING_FLAG_XBZRLE 0x1
>
> static int save_xbzrle_page(QEMUFile *f, uint8_t *current_data,
> @@ -462,7 +514,10 @@ static int ram_save_block(QEMUFile *f, bool last_stage)
>
> /* In doubt sent page as normal */
> bytes_sent = -1;
> - if (is_dup_page(p)) {
> + if (migrate_use_rdma()) {
> + /* searching for zeros is still too expensive for RDMA */
> + bytes_sent = save_rdma_page(f, block, offset, cont);
Again as written in the other message, this is not really an RDMA thing,
it's mostly the effect of a fast link. Of course to some extent it
depends on the CPU and RAM speed, but we can fake that it isn't.
> + } else if (is_dup_page(p)) {
> acct_info.dup_pages++;
> bytes_sent = save_block_hdr(f, block, offset, cont,
> RAM_SAVE_FLAG_COMPRESS);
>
Thanks,
Paolo
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Qemu-devel] [RFC PATCH RDMA support v3: 10/10] Parse RDMA host/port out of the QMP string.
[not found] ` <1362976414-21396-11-git-send-email-mrhines@us.ibm.com>
@ 2013-03-11 14:00 ` Paolo Bonzini
2013-03-11 16:32 ` Michael R. Hines
0 siblings, 1 reply; 22+ messages in thread
From: Paolo Bonzini @ 2013-03-11 14:00 UTC (permalink / raw)
To: Michael.R.Hines.mrhines
Cc: aliguori, mst, qemu-devel, owasserm, abali, mrhines, gokul
Il 11/03/2013 05:33, Michael.R.Hines.mrhines@linux.vnet.ibm.com ha scritto:
> From: "Michael R. Hines" <mrhines@us.ibm.com>
>
> ... want to use existing functions to do this.
Remember that each of the 10 steps should compile and link, both with
and without librdmacm installed. So this should be quite earlier.
> Signed-off-by: Michael R. Hines <mrhines@us.ibm.com>
> ---
> include/qemu/sockets.h | 1 +
> util/oslib-posix.c | 4 ++++
> util/qemu-sockets.c | 2 +-
> 3 files changed, 6 insertions(+), 1 deletion(-)
>
> diff --git a/include/qemu/sockets.h b/include/qemu/sockets.h
> index 6125bf7..86fe4da 100644
> --- a/include/qemu/sockets.h
> +++ b/include/qemu/sockets.h
> @@ -47,6 +47,7 @@ typedef void NonBlockingConnectHandler(int fd, void *opaque);
> int inet_listen_opts(QemuOpts *opts, int port_offset, Error **errp);
> int inet_listen(const char *str, char *ostr, int olen,
> int socktype, int port_offset, Error **errp);
> +InetSocketAddress *inet_parse(const char *str, Error **errp);
> int inet_connect_opts(QemuOpts *opts, Error **errp,
> NonBlockingConnectHandler *callback, void *opaque);
> int inet_connect(const char *str, Error **errp);
> diff --git a/util/oslib-posix.c b/util/oslib-posix.c
> index 433dd68..dc369ae 100644
> --- a/util/oslib-posix.c
> +++ b/util/oslib-posix.c
> @@ -137,6 +137,8 @@ void qemu_vfree(void *ptr)
> void socket_set_block(int fd)
> {
> int f;
> + if(fd < 0)
> + return;
What is this needed for? Make it a separate patch and describe the
rationale in the commit message.
> f = fcntl(fd, F_GETFL);
> fcntl(fd, F_SETFL, f & ~O_NONBLOCK);
> }
> @@ -144,6 +146,8 @@ void socket_set_block(int fd)
> void socket_set_nonblock(int fd)
> {
> int f;
> + if(fd < 0)
> + return;
> f = fcntl(fd, F_GETFL);
> fcntl(fd, F_SETFL, f | O_NONBLOCK);
> }
> diff --git a/util/qemu-sockets.c b/util/qemu-sockets.c
> index 3f12296..58e4bcd 100644
> --- a/util/qemu-sockets.c
> +++ b/util/qemu-sockets.c
> @@ -485,7 +485,7 @@ err:
> }
>
> /* compatibility wrapper */
> -static InetSocketAddress *inet_parse(const char *str, Error **errp)
> +InetSocketAddress *inet_parse(const char *str, Error **errp)
> {
> InetSocketAddress *addr;
> const char *optstr, *h;
>
Paolo
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Qemu-devel] [RFC PATCH RDMA support v3: 09/10] Move RAMBlock to cpu-common.h
[not found] ` <1362976414-21396-10-git-send-email-mrhines@us.ibm.com>
@ 2013-03-11 14:07 ` Paolo Bonzini
2013-03-11 16:34 ` Michael R. Hines
0 siblings, 1 reply; 22+ messages in thread
From: Paolo Bonzini @ 2013-03-11 14:07 UTC (permalink / raw)
To: Michael.R.Hines.mrhines
Cc: aliguori, mst, qemu-devel, owasserm, abali, mrhines, gokul
Il 11/03/2013 05:33, Michael.R.Hines.mrhines@linux.vnet.ibm.com ha scritto:
> From: "Michael R. Hines" <mrhines@us.ibm.com>
>
> RDMA needs access to this structure by including cpu-common.h
> (Including cpu-all.h causes things to throw up on me).
>
> Signed-off-by: Michael R. Hines <mrhines@us.ibm.com>
> ---
> include/exec/cpu-all.h | 27 ---------------------------
> include/exec/cpu-common.h | 29 +++++++++++++++++++++++++++++
> 2 files changed, 29 insertions(+), 27 deletions(-)
>
> diff --git a/include/exec/cpu-all.h b/include/exec/cpu-all.h
> index 249e046..02a2808 100644
> --- a/include/exec/cpu-all.h
> +++ b/include/exec/cpu-all.h
> @@ -480,33 +480,6 @@ extern ram_addr_t ram_size;
> /* RAM is pre-allocated and passed into qemu_ram_alloc_from_ptr */
> #define RAM_PREALLOC_MASK (1 << 0)
>
> -typedef struct RAMBlock {
> - struct MemoryRegion *mr;
> - uint8_t *host;
> - ram_addr_t offset;
> - ram_addr_t length;
> - uint32_t flags;
> - char idstr[256];
> - /* Reads can take either the iothread or the ramlist lock.
> - * Writes must take both locks.
> - */
> - QTAILQ_ENTRY(RAMBlock) next;
> -#if defined(__linux__) && !defined(TARGET_S390X)
> - int fd;
> -#endif
> -} RAMBlock;
> -
> -typedef struct RAMList {
> - QemuMutex mutex;
> - /* Protected by the iothread lock. */
> - uint8_t *phys_dirty;
> - RAMBlock *mru_block;
> - /* Protected by the ramlist lock. */
> - QTAILQ_HEAD(, RAMBlock) blocks;
> - uint32_t version;
> -} RAMList;
> -extern RAMList ram_list;
> -
> extern const char *mem_path;
> extern int mem_prealloc;
>
> diff --git a/include/exec/cpu-common.h b/include/exec/cpu-common.h
> index 2e5f11f..763cef3 100644
> --- a/include/exec/cpu-common.h
> +++ b/include/exec/cpu-common.h
> @@ -11,6 +11,7 @@
>
> #include "qemu/bswap.h"
> #include "qemu/queue.h"
> +#include "qemu/thread.h"
>
> /**
> * CPUListState:
> @@ -121,4 +122,32 @@ extern struct MemoryRegion io_mem_notdirty;
>
> #endif
>
> +typedef struct RAMBlock {
> + struct MemoryRegion *mr;
> + uint8_t *host;
> + ram_addr_t offset;
> + ram_addr_t length;
> + uint32_t flags;
> + char idstr[256];
> + /* Reads can take either the iothread or the ramlist lock.
> + * Writes must take both locks.
> + */
> + QTAILQ_ENTRY(RAMBlock) next;
> +#if defined(__linux__) && !defined(TARGET_S390X)
> + int fd;
> +#endif
> +} RAMBlock;
> +
> +typedef struct RAMList {
> + QemuMutex mutex;
> + /* Protected by the iothread lock. */
> + uint8_t *phys_dirty;
> + RAMBlock *mru_block;
> + /* Protected by the ramlist lock. */
> + QTAILQ_HEAD(, RAMBlock) blocks;
> + uint32_t version;
> +} RAMList;
> +
> +extern RAMList ram_list;
> +
> #endif /* !CPU_COMMON_H */
>
Only used in qemu_rdma_init_ram_blocks. Can you add instead an API like
qemu_ram_foreach_block(
void (*fn)(void *host_addr, ram_addr_t offset, ram_addr_t length,
void *opaque),
void *opaque)
?
BTW, please avoid arbitrary limits like RDMA_MAX_RAM_BLOCKS. Can you
use a list (see qemu-queue.h; it is the same as the BSD queue.h) instead
of current_index?
Paolo
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Qemu-devel] [RFC PATCH RDMA support v3: 03/10] documentation of RDMA protocol in docs/rdma.txt
2013-03-11 11:51 ` [Qemu-devel] [RFC PATCH RDMA support v3: 03/10] documentation of RDMA protocol in docs/rdma.txt Michael S. Tsirkin
@ 2013-03-11 16:24 ` Michael R. Hines
2013-03-11 17:05 ` Michael S. Tsirkin
0 siblings, 1 reply; 22+ messages in thread
From: Michael R. Hines @ 2013-03-11 16:24 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: aliguori, michael.r.hines.mrhines, qemu-devel, owasserm, abali,
mrhines, gokul, pbonzini
Excellent questions: answers inline.........
On 03/11/2013 07:51 AM, Michael S. Tsirkin wrote:
> +RDMA-based live migration protocol
> +==================================
> +
> +We use two kinds of RDMA messages:
> +
> +1. RDMA WRITES (to the receiver)
> +2. RDMA SEND (for non-live state, like devices and CPU)
> Something's missing here.
> Don't you need to know remote addresses before doing RDMA writes?
Yes, It looks like I need to do some more "teaching" about infiniband / RDMA
inside the documentation.
I was trying not to make it too long, but it seems I over-estimated
the ubiquity of RDMA and I'll have to include some background information
about the programming model and memory model used by RDMA.
>> +
>> +First, migration-rdma.c does the initial connection establishment
>> +using the URI 'rdma:host:port' on the QMP command line.
>> +
>> +Second, the normal live migration process kicks in for 'pc.ram'.
>> +
>> +During iterative phase of the migration, only RDMA WRITE messages
>> +are used. Messages are grouped into "chunks" which get pinned by
>> +the hardware in 64-page increments. Each chunk is acknowledged in
>> +the Queue Pairs completion queue (not the individual pages).
>> +
>> +During iteration of RAM, there are no messages sent, just RDMA writes.
>> +During the last iteration, once the devices and CPU is ready to be
>> +sent, we begin to use the RDMA SEND messages.
> It's unclear whether you are switching modes here, if yes
> assuming CPU/device state is only sent during
> the last iteration would break post-migration so
> is probably not a good choice for a protocol.
I made a bad choice of words ...... I'll correct the documentation.
>
>> +Due to the asynchronous nature of RDMA, the receiver of the migration
>> +must post Receive work requests in the queue *before* a SEND work request
>> +can be posted.
>> +
>> +To achieve this, both sides perform an initial 'barrier' synchronization.
>> +Before the barrier, we already know that both sides have a receive work
>> +request posted,
> How?
While I was coding last night, I was able to eliminate this barrier.
>> and then both sides exchange and block on the completion
>> +queue waiting for each other to know the other peer is alive and ready
>> +to send the rest of the live migration state (qemu_send/recv_barrier()).
> How much?
The remaining migration state is typically < 100K (usually more like 17-32K)
Most of this gets sent during qemu_savevm_state_complete() during the
last iteration.
>> +At this point, the use of QEMUFile between both sides for communication
>> +proceeds as normal.
>> +The difference between TCP and SEND comes in migration-rdma.c: Since
>> +we cannot simply dump the bytes into a socket, instead a SEND message
>> +must be preceeded by one side instructing the other side *exactly* how
>> +many bytes the SEND message will contain.
> instructing how? Presumably you use some protocol for this?
Yes, I'll be more verbose. Sorry about that =)
(Basically, the length of the SEND is stored inside the SEND message itself.
>> +Each time a SEND is received, the receiver buffers the message and
>> +divies out the bytes from the SEND to the qemu_loadvm_state() function
>> +until all the bytes from the buffered SEND message have been exhausted.
>> +
>> +Before the SEND is exhausted, the receiver sends an 'ack' SEND back
>> +to the sender to let the savevm_state_* functions know that they
>> +can resume and start generating more SEND messages.
> The above two paragraphs seem very opaque to me.
> what's an 'ack' SEND, how do you know whether SEND
> is exhausted?
More verbosity needed here too =). Exhaustion is detected because
the SEND bytes are copied into a buffer and then whenever
QEMUFile functions request more bytes from the buffer, we check
how many bytes are available from the last SEND message (which
was copied locally) to be handed back to QEMUFile functions.
If there are no bytes left in the buffer, we block and wait for another
SEND message.
>> +This ping-pong of SEND messages
> BTW, if by ping-pong you mean something like this:
> source "I have X bytes"
> destination "ok send me X bytes"
> source sends X bytes
> then you could put the address in the destination response and
> use RDMA for sending X bytes.
> It's up to you but it might simplify the protocol as
> the only thing you send would be buffer management messages.
No, you can't do that because RDMA writes do not produce
completion queue (CQ) notifications on the receiver side.
Thus, there's no way for the receiver to know data was received.
You still need regular SEND message to handle it.
>> happens until the live migration completes.
> Any way to tear down the connection in case of errors?
Yes, I'll add all these questions to the update documentation ASAP.
>> +
>> +USAGE
>> +===============================
>> +
>> +Compiling:
>> +
>> +$ ./configure --enable-rdma --target-list=x86_64-softmmu
>> +
>> +$ make
>> +
>> +Command-line on the Source machine AND Destination:
>> +
>> +$ virsh qemu-monitor-command --hmp --cmd "migrate_set_speed 40g" # or whatever is the MAX of your RDMA device
>> +
>> +Finally, perform the actual migration:
>> +
>> +$ virsh migrate domain rdma:xx.xx.xx.xx:port
>> +
>> +PERFORMANCE
>> +===================
>> +
>> +Using a 40gbps infinband link performing a worst-case stress test:
>> +
>> +RDMA Throughput With $ stress --vm-bytes 1024M --vm 1 --vm-keep
>> +Approximately 30 gpbs (little better than the paper)
>> +1. Average worst-case throughput
>> +TCP Throughput With $ stress --vm-bytes 1024M --vm 1 --vm-keep
>> +2. Approximately 8 gpbs (using IPOIB IP over Infiniband)
>> +
>> +Average downtime (stop time) ranges between 28 and 33 milliseconds.
>> +
>> +An *exhaustive* paper (2010) shows additional performance details
>> +linked on the QEMU wiki:
>> +
>> +http://wiki.qemu.org/Features/RDMALiveMigration
>> --
>> 1.7.10.4
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Qemu-devel] [RFC PATCH RDMA support v3: 02/10] Link in new migration-rdma.c and rmda.c files
2013-03-11 13:35 ` [Qemu-devel] [RFC PATCH RDMA support v3: 02/10] Link in new migration-rdma.c and rmda.c files Paolo Bonzini
@ 2013-03-11 16:25 ` Michael R. Hines
0 siblings, 0 replies; 22+ messages in thread
From: Michael R. Hines @ 2013-03-11 16:25 UTC (permalink / raw)
To: qemu-devel
will fix....acknowledged
On 03/11/2013 09:35 AM, Paolo Bonzini wrote:
> Il 11/03/2013 05:33, Michael.R.Hines.mrhines@linux.vnet.ibm.com ha scritto:
>> common-obj-y += migration.o migration-tcp.o
>> +common-obj-$(CONFIG_RDMA) += migration-rdma.o
>> common-obj-y += qemu-char.o #aio.o
>> common-obj-y += block-migration.o
>> -common-obj-y += page_cache.o xbzrle.o
>> +common-obj-y += page_cache.o xbzrle.o rdma.o
> Why is rdma.o not conditionalized by $(CONFIG_RDMA)?
>
> Paolo
>
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Qemu-devel] [RFC PATCH RDMA support v3: 08/10] Introduce QEMUFileRDMA
2013-03-11 13:40 ` [Qemu-devel] [RFC PATCH RDMA support v3: 08/10] Introduce QEMUFileRDMA Paolo Bonzini
@ 2013-03-11 16:26 ` Michael R. Hines
2013-03-11 16:26 ` Michael R. Hines
1 sibling, 0 replies; 22+ messages in thread
From: Michael R. Hines @ 2013-03-11 16:26 UTC (permalink / raw)
To: qemu-devel
Acknowledged.
On 03/11/2013 09:40 AM, Paolo Bonzini wrote:
> Il 11/03/2013 05:33, Michael.R.Hines.mrhines@linux.vnet.ibm.com ha scritto:
>> From: "Michael R. Hines" <mrhines@us.ibm.com>
>>
>> This is the loadvm() side of the connection which is RDMA-aware,
>> so that transfer of device state can use the same abstractions
>> as all of the other migration protocols.
>>
>> Full documentation of the protocol is in docs/rdma.txt
>>
>> Signed-off-by: Michael R. Hines <mrhines@us.ibm.com>
>> ---
>> include/migration/qemu-file.h | 17 +++++
>> savevm.c | 165 ++++++++++++++++++++++++++++++++++++++++-
>> 2 files changed, 180 insertions(+), 2 deletions(-)
> Please move this to rdma.c, otherwise it won't compile without librdmacm.
>
> Paolo
>
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Qemu-devel] [RFC PATCH RDMA support v3: 08/10] Introduce QEMUFileRDMA
2013-03-11 13:40 ` [Qemu-devel] [RFC PATCH RDMA support v3: 08/10] Introduce QEMUFileRDMA Paolo Bonzini
2013-03-11 16:26 ` Michael R. Hines
@ 2013-03-11 16:26 ` Michael R. Hines
1 sibling, 0 replies; 22+ messages in thread
From: Michael R. Hines @ 2013-03-11 16:26 UTC (permalink / raw)
To: Paolo Bonzini
Cc: aliguori, mst, michael.r.hines.mrhines, qemu-devel, owasserm,
abali, mrhines, gokul
Acknowledged.
On 03/11/2013 09:40 AM, Paolo Bonzini wrote:
> Il 11/03/2013 05:33, Michael.R.Hines.mrhines@linux.vnet.ibm.com ha scritto:
>> From: "Michael R. Hines" <mrhines@us.ibm.com>
>>
>> This is the loadvm() side of the connection which is RDMA-aware,
>> so that transfer of device state can use the same abstractions
>> as all of the other migration protocols.
>>
>> Full documentation of the protocol is in docs/rdma.txt
>>
>> Signed-off-by: Michael R. Hines <mrhines@us.ibm.com>
>> ---
>> include/migration/qemu-file.h | 17 +++++
>> savevm.c | 165 ++++++++++++++++++++++++++++++++++++++++-
>> 2 files changed, 180 insertions(+), 2 deletions(-)
> Please move this to rdma.c, otherwise it won't compile without librdmacm.
>
> Paolo
>
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Qemu-devel] [RFC PATCH RDMA support v3: 05/10] RDMA connection establishment (migration-rdma.c).
2013-03-11 13:41 ` [Qemu-devel] [RFC PATCH RDMA support v3: 05/10] RDMA connection establishment (migration-rdma.c) Paolo Bonzini
@ 2013-03-11 16:28 ` Michael R. Hines
2013-03-11 20:20 ` Michael R. Hines
1 sibling, 0 replies; 22+ messages in thread
From: Michael R. Hines @ 2013-03-11 16:28 UTC (permalink / raw)
To: Paolo Bonzini
Cc: aliguori, mst, michael.r.hines.mrhines, qemu-devel, owasserm,
abali, mrhines, gokul
Yes, I was hoping to see your patch queue pulled already, but I guess
everyone's busy =)
On 03/11/2013 09:41 AM, Paolo Bonzini wrote:
> Note that as soon as the pending migration patches hit upstream, almost
> all of this code will move to QEMUFileRDMA (in particular qemu_rdma_send
> and qemu_rdma_close). It should become simpler.
> Paolo
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Qemu-devel] [RFC PATCH RDMA support v3: 06/10] Introduce 'max_iterations' and Call out to migration-rdma.c when requested
2013-03-11 13:49 ` [Qemu-devel] [RFC PATCH RDMA support v3: 06/10] Introduce 'max_iterations' and Call out to migration-rdma.c when requested Paolo Bonzini
@ 2013-03-11 16:30 ` Michael R. Hines
0 siblings, 0 replies; 22+ messages in thread
From: Michael R. Hines @ 2013-03-11 16:30 UTC (permalink / raw)
To: Paolo Bonzini
Cc: aliguori, mst, michael.r.hines.mrhines, qemu-devel, owasserm,
abali, mrhines, gokul
Will do - I'll make another patch for these.
I don't have a good answer for the "computed bandwidth" idea,
but at least some initial max_iterations (even if disabled by default)
would go a long way to helping the problem.....
On 03/11/2013 09:49 AM, Paolo Bonzini wrote:
> Il 11/03/2013 05:33, Michael.R.Hines.mrhines@linux.vnet.ibm.com ha scritto:
>> From: "Michael R. Hines" <mrhines@us.ibm.com>
>>
>> Very little changes here except for halting the migration after a maximum
>> number of iterations is reached.
>>
>> When comparing against TCP, the migration never ends if we don't cap
>> the migrations somehow..... just an idea for now.
> This makes sense, but please: a) make it a separate patch; b) add QMP
> commands for it; c) make it disabled by default.
>
> There are two uses of migrate_use_rdma(). One is to disable the search
> for zero pages. Perhaps we can do that automatically based on the
> current computed bandwidth? At some point, it costs less to just send
> the data down the wire.
>
> The other is to use the RDMA-specific primitive to send pages. I hope
> that Orit's work will make that unnecessary; in the meanwhile, however,
> the latter is okay.
>
> The "verbose logging" should be yet another patch. Many of the messages
> you touched are gone in the most recent version of the code. I suspect
> that, for the others, it's better to use tracepoints (see the
> trace-events file) instead.
>
> Paolo
>
>> Signed-off-by: Michael R. Hines <mrhines@us.ibm.com>
>> ---
>> include/migration/migration.h | 10 ++++++++
>> migration.c | 56 +++++++++++++++++++++++++++++++++++------
>> 2 files changed, 58 insertions(+), 8 deletions(-)
>>
>> diff --git a/include/migration/migration.h b/include/migration/migration.h
>> index d121409..796cf3d 100644
>> --- a/include/migration/migration.h
>> +++ b/include/migration/migration.h
>> @@ -20,6 +20,7 @@
>> #include "qemu/notify.h"
>> #include "qapi/error.h"
>> #include "migration/vmstate.h"
>> +#include "migration/rdma.h"
>> #include "qapi-types.h"
>>
>> struct MigrationParams {
>> @@ -55,6 +56,9 @@ struct MigrationState
>> bool enabled_capabilities[MIGRATION_CAPABILITY_MAX];
>> int64_t xbzrle_cache_size;
>> bool complete;
>> +
>> + RDMAData rdma;
>> + double mbps;
>> };
>>
>> void process_incoming_migration(QEMUFile *f);
>> @@ -75,6 +79,10 @@ void tcp_start_incoming_migration(const char *host_port, Error **errp);
>>
>> void tcp_start_outgoing_migration(MigrationState *s, const char *host_port, Error **errp);
>>
>> +void rdma_start_outgoing_migration(MigrationState *s, const char *host_port, Error **errp);
>> +
>> +int rdma_start_incoming_migration(const char * host_port, Error **errp);
>> +
>> void unix_start_incoming_migration(const char *path, Error **errp);
>>
>> void unix_start_outgoing_migration(MigrationState *s, const char *path, Error **errp);
>> @@ -106,6 +114,7 @@ uint64_t dup_mig_bytes_transferred(void);
>> uint64_t dup_mig_pages_transferred(void);
>> uint64_t norm_mig_bytes_transferred(void);
>> uint64_t norm_mig_pages_transferred(void);
>> +uint64_t delta_norm_mig_bytes_transferred(void);
>> uint64_t xbzrle_mig_bytes_transferred(void);
>> uint64_t xbzrle_mig_pages_transferred(void);
>> uint64_t xbzrle_mig_pages_overflow(void);
>> @@ -130,6 +139,7 @@ int xbzrle_encode_buffer(uint8_t *old_buf, uint8_t *new_buf, int slen,
>> int xbzrle_decode_buffer(uint8_t *src, int slen, uint8_t *dst, int dlen);
>>
>> int migrate_use_xbzrle(void);
>> +int migrate_use_rdma(void);
>> int64_t migrate_xbzrle_cache_size(void);
>>
>> int64_t xbzrle_cache_resize(int64_t new_size);
>> diff --git a/migration.c b/migration.c
>> index 11725ae..aae2f66 100644
>> --- a/migration.c
>> +++ b/migration.c
>> @@ -25,6 +25,7 @@
>> #include "qmp-commands.h"
>>
>> //#define DEBUG_MIGRATION
>> +//#define DEBUG_MIGRATION_VERBOSE
>>
>> #ifdef DEBUG_MIGRATION
>> #define DPRINTF(fmt, ...) \
>> @@ -34,6 +35,14 @@
>> do { } while (0)
>> #endif
>>
>> +#ifdef DEBUG_MIGRATION_VERBOSE
>> +#define DDPRINTF(fmt, ...) \
>> + do { printf("migration: " fmt, ## __VA_ARGS__); } while (0)
>> +#else
>> +#define DDPRINTF(fmt, ...) \
>> + do { } while (0)
>> +#endif
>> +
>> enum {
>> MIG_STATE_ERROR,
>> MIG_STATE_SETUP,
>> @@ -76,6 +85,8 @@ void qemu_start_incoming_migration(const char *uri, Error **errp)
>>
>> if (strstart(uri, "tcp:", &p))
>> tcp_start_incoming_migration(p, errp);
>> + else if (strstart(uri, "rdma:", &p))
>> + rdma_start_incoming_migration(p, errp);
>> #if !defined(WIN32)
>> else if (strstart(uri, "exec:", &p))
>> exec_start_incoming_migration(p, errp);
>> @@ -130,6 +141,14 @@ void process_incoming_migration(QEMUFile *f)
>> * units must be in seconds */
>> static uint64_t max_downtime = 30000000;
>>
>> +/*
>> + * RFC: We probably need a QMP setting for this, but the point
>> + * of it is that it's hard to compare RDMA workloads
>> + * vs. TCP workloads because the TCP migrations never
>> + * complete without some kind of iteration cap.
>> + */
>> +static int max_iterations = 30;
>> +
>> uint64_t migrate_max_downtime(void)
>> {
>> return max_downtime;
>> @@ -429,6 +448,8 @@ void qmp_migrate(const char *uri, bool has_blk, bool blk,
>>
>> if (strstart(uri, "tcp:", &p)) {
>> tcp_start_outgoing_migration(s, p, &local_err);
>> + } else if (strstart(uri, "rdma:", &p)) {
>> + rdma_start_outgoing_migration(s, p, &local_err);
>> #if !defined(WIN32)
>> } else if (strstart(uri, "exec:", &p)) {
>> exec_start_outgoing_migration(s, p, &local_err);
>> @@ -502,6 +523,16 @@ int migrate_use_xbzrle(void)
>> return s->enabled_capabilities[MIGRATION_CAPABILITY_XBZRLE];
>> }
>>
>> +/*
>> + * Don't think we need a 'capability' here
>> + * because 'rdma:host:port' must be specified
>> + * on the QMP command line...
>> + */
>> +int migrate_use_rdma(void)
>> +{
>> + return migrate_get_current()->rdma.enabled;
>> +}
>> +
>> int64_t migrate_xbzrle_cache_size(void)
>> {
>> MigrationState *s;
>> @@ -550,7 +581,7 @@ static int buffered_put_buffer(void *opaque, const uint8_t *buf,
>> MigrationState *s = opaque;
>> ssize_t error;
>>
>> - DPRINTF("putting %d bytes at %" PRId64 "\n", size, pos);
>> + DDPRINTF("putting %d bytes at %" PRId64 "\n", size, pos);
>>
>> error = qemu_file_get_error(s->file);
>> if (error) {
>> @@ -563,7 +594,7 @@ static int buffered_put_buffer(void *opaque, const uint8_t *buf,
>> }
>>
>> if (size > (s->buffer_capacity - s->buffer_size)) {
>> - DPRINTF("increasing buffer capacity from %zu by %zu\n",
>> + DDPRINTF("increasing buffer capacity from %zu by %d\n",
>> s->buffer_capacity, size + 1024);
>>
>> s->buffer_capacity += size + 1024;
>> @@ -661,7 +692,7 @@ static void *buffered_file_thread(void *opaque)
>> int64_t sleep_time = 0;
>> int64_t max_size = 0;
>> bool last_round = false;
>> - int ret;
>> + int ret, iterations = 0;
>>
>> qemu_mutex_lock_iothread();
>> DPRINTF("beginning savevm\n");
>> @@ -687,11 +718,15 @@ static void *buffered_file_thread(void *opaque)
>> qemu_mutex_unlock_iothread();
>> break;
>> }
>> +
>> + iterations++;
>> +
>> if (s->bytes_xfer < s->xfer_limit) {
>> - DPRINTF("iterate\n");
>> + DPRINTF("iterate %d max %d\n", iterations, max_iterations);
>> pending_size = qemu_savevm_state_pending(s->file, max_size);
>> DPRINTF("pending size %lu max %lu\n", pending_size, max_size);
>> - if (pending_size && pending_size >= max_size) {
>> + if (pending_size && pending_size >= max_size &&
>> + (max_iterations == -1 || iterations < max_iterations)) {
>> ret = qemu_savevm_state_iterate(s->file);
>> if (ret < 0) {
>> qemu_mutex_unlock_iothread();
>> @@ -730,14 +765,18 @@ static void *buffered_file_thread(void *opaque)
>> qemu_mutex_unlock_iothread();
>> current_time = qemu_get_clock_ms(rt_clock);
>> if (current_time >= initial_time + BUFFER_DELAY) {
>> - uint64_t transferred_bytes = s->bytes_xfer;
>> + uint64_t transferred_bytes = migrate_use_rdma() ?
>> + delta_norm_mig_bytes_transferred() : s->bytes_xfer;
>> uint64_t time_spent = current_time - initial_time - sleep_time;
>> double bandwidth = transferred_bytes / time_spent;
>> max_size = bandwidth * migrate_max_downtime() / 1000000;
>> + s->mbps = ((double) transferred_bytes * 8.0 /
>> + ((double) time_spent / 1000.0)) / 1000.0 / 1000.0;
>>
>> DPRINTF("transferred %" PRIu64 " time_spent %" PRIu64
>> - " bandwidth %g max_size %" PRId64 "\n",
>> - transferred_bytes, time_spent, bandwidth, max_size);
>> + " bandwidth %g (%0.2f mbps) max_size %" PRId64 "\n",
>> + transferred_bytes, time_spent,
>> + bandwidth, s->mbps, max_size);
>> /* if we haven't sent anything, we don't want to recalculate
>> 10000 is a small enough number for our purposes */
>> if (s->dirty_bytes_rate && transferred_bytes > 10000) {
>> @@ -774,6 +813,7 @@ static const QEMUFileOps buffered_file_ops = {
>> .rate_limit = buffered_rate_limit,
>> .get_rate_limit = buffered_get_rate_limit,
>> .set_rate_limit = buffered_set_rate_limit,
>> + .send_barrier = qemu_rdma_send_barrier,
>> };
>>
>> void migrate_fd_connect(MigrationState *s)
>>
>
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Qemu-devel] [RFC PATCH RDMA support v3: 07/10] Send the actual pages over RDMA.
2013-03-11 13:59 ` [Qemu-devel] [RFC PATCH RDMA support v3: 07/10] Send the actual pages over RDMA Paolo Bonzini
@ 2013-03-11 16:31 ` Michael R. Hines
0 siblings, 0 replies; 22+ messages in thread
From: Michael R. Hines @ 2013-03-11 16:31 UTC (permalink / raw)
To: Paolo Bonzini
Cc: aliguori, mst, michael.r.hines.mrhines, qemu-devel, owasserm,
abali, mrhines, gokul
Acknowledged all...
On 03/11/2013 09:59 AM, Paolo Bonzini wrote:
> Il 11/03/2013 05:33, Michael.R.Hines.mrhines@linux.vnet.ibm.com ha scritto:
>> From: "Michael R. Hines" <mrhines@us.ibm.com>
>>
>> For performance reasons, dup_page() and xbzrle() is skipped because
>> they are too expensive for zero-copy RDMA.
>>
>> Signed-off-by: Michael R. Hines <mrhines@us.ibm.com>
>> ---
>> arch_init.c | 57 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++-
>> 1 file changed, 56 insertions(+), 1 deletion(-)
>>
>> diff --git a/arch_init.c b/arch_init.c
>> index 8daeafa..437cb47 100644
>> --- a/arch_init.c
>> +++ b/arch_init.c
>> @@ -45,6 +45,7 @@
>> #include "exec/address-spaces.h"
>> #include "hw/pcspk.h"
>> #include "migration/page_cache.h"
>> +#include "migration/rdma.h"
>> #include "qemu/config-file.h"
>> #include "qmp-commands.h"
>> #include "trace.h"
>> @@ -245,6 +246,18 @@ uint64_t norm_mig_pages_transferred(void)
>> return acct_info.norm_pages;
>> }
>>
>> +/*
>> + * RDMA does not use the buffered_file,
>> + * but we still need a way to do accounting...
>> + */
>> +uint64_t delta_norm_mig_bytes_transferred(void)
>> +{
>> + static uint64_t last_norm_pages = 0;
>> + uint64_t delta_bytes = (acct_info.norm_pages - last_norm_pages) * TARGET_PAGE_SIZE;
>> + last_norm_pages = acct_info.norm_pages;
>> + return delta_bytes;
>> +}
>> +
>> uint64_t xbzrle_mig_bytes_transferred(void)
>> {
>> return acct_info.xbzrle_bytes;
>> @@ -282,6 +295,45 @@ static size_t save_block_hdr(QEMUFile *f, RAMBlock *block, ram_addr_t offset,
>> return size;
>> }
>>
>> +static size_t save_rdma_page(QEMUFile *f, RAMBlock *block, ram_addr_t offset,
>> + int cont)
>> +{
>> + int ret;
>> + size_t bytes_sent = 0;
>> + ram_addr_t current_addr;
>> + RDMAData * rdma = &migrate_get_current()->rdma;
>> +
>> + acct_info.norm_pages++;
>> +
>> + /*
>> + * use RDMA to send page
>> + */
> Not quite true, the page is added to the current chunk. Please make the
> comments a quick-and-dirty reference of the protocol, or leave them out
> altogether.
>
>> + current_addr = block->offset + offset;
>> + if ((ret = qemu_rdma_write(rdma, current_addr, TARGET_PAGE_SIZE)) < 0) {
>> + fprintf(stderr, "rdma migration: write error! %d\n", ret);
>> + qemu_file_set_error(f, ret);
>> + return ret;
>> + }
>> +
>> + /*
>> + * do some polling
>> + */
> Again, that's quite self-evident. Poll for what though? :)
>
>> + while (1) {
>> + int ret = qemu_rdma_poll(rdma);
>> + if (ret == RDMA_WRID_NONE) {
>> + break;
>> + }
>> + if (ret < 0) {
>> + fprintf(stderr, "rdma migration: polling error! %d\n", ret);
>> + qemu_file_set_error(f, ret);
>> + return ret;
>> + }
>> + }
>> +
>> + bytes_sent += TARGET_PAGE_SIZE;
>> + return bytes_sent;
>> +}
> As written in the other message, I think this should be an additional
> QEMUFile operation, hopefully the same that Orit is introducing in her
> patches.
>
>> #define ENCODING_FLAG_XBZRLE 0x1
>>
>> static int save_xbzrle_page(QEMUFile *f, uint8_t *current_data,
>> @@ -462,7 +514,10 @@ static int ram_save_block(QEMUFile *f, bool last_stage)
>>
>> /* In doubt sent page as normal */
>> bytes_sent = -1;
>> - if (is_dup_page(p)) {
>> + if (migrate_use_rdma()) {
>> + /* searching for zeros is still too expensive for RDMA */
>> + bytes_sent = save_rdma_page(f, block, offset, cont);
> Again as written in the other message, this is not really an RDMA thing,
> it's mostly the effect of a fast link. Of course to some extent it
> depends on the CPU and RAM speed, but we can fake that it isn't.
>
>> + } else if (is_dup_page(p)) {
>> acct_info.dup_pages++;
>> bytes_sent = save_block_hdr(f, block, offset, cont,
>> RAM_SAVE_FLAG_COMPRESS);
>>
> Thanks,
>
> Paolo
>
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Qemu-devel] [RFC PATCH RDMA support v3: 10/10] Parse RDMA host/port out of the QMP string.
2013-03-11 14:00 ` [Qemu-devel] [RFC PATCH RDMA support v3: 10/10] Parse RDMA host/port out of the QMP string Paolo Bonzini
@ 2013-03-11 16:32 ` Michael R. Hines
0 siblings, 0 replies; 22+ messages in thread
From: Michael R. Hines @ 2013-03-11 16:32 UTC (permalink / raw)
To: Paolo Bonzini
Cc: aliguori, mst, michael.r.hines.mrhines, qemu-devel, owasserm,
abali, mrhines, gokul
acknowledged....
On 03/11/2013 10:00 AM, Paolo Bonzini wrote:
> Il 11/03/2013 05:33, Michael.R.Hines.mrhines@linux.vnet.ibm.com ha scritto:
>> From: "Michael R. Hines" <mrhines@us.ibm.com>
>>
>> ... want to use existing functions to do this.
> Remember that each of the 10 steps should compile and link, both with
> and without librdmacm installed. So this should be quite earlier.
>
>> Signed-off-by: Michael R. Hines <mrhines@us.ibm.com>
>> ---
>> include/qemu/sockets.h | 1 +
>> util/oslib-posix.c | 4 ++++
>> util/qemu-sockets.c | 2 +-
>> 3 files changed, 6 insertions(+), 1 deletion(-)
>>
>> diff --git a/include/qemu/sockets.h b/include/qemu/sockets.h
>> index 6125bf7..86fe4da 100644
>> --- a/include/qemu/sockets.h
>> +++ b/include/qemu/sockets.h
>> @@ -47,6 +47,7 @@ typedef void NonBlockingConnectHandler(int fd, void *opaque);
>> int inet_listen_opts(QemuOpts *opts, int port_offset, Error **errp);
>> int inet_listen(const char *str, char *ostr, int olen,
>> int socktype, int port_offset, Error **errp);
>> +InetSocketAddress *inet_parse(const char *str, Error **errp);
>> int inet_connect_opts(QemuOpts *opts, Error **errp,
>> NonBlockingConnectHandler *callback, void *opaque);
>> int inet_connect(const char *str, Error **errp);
>> diff --git a/util/oslib-posix.c b/util/oslib-posix.c
>> index 433dd68..dc369ae 100644
>> --- a/util/oslib-posix.c
>> +++ b/util/oslib-posix.c
>> @@ -137,6 +137,8 @@ void qemu_vfree(void *ptr)
>> void socket_set_block(int fd)
>> {
>> int f;
>> + if(fd < 0)
>> + return;
> What is this needed for? Make it a separate patch and describe the
> rationale in the commit message.
>
>> f = fcntl(fd, F_GETFL);
>> fcntl(fd, F_SETFL, f & ~O_NONBLOCK);
>> }
>> @@ -144,6 +146,8 @@ void socket_set_block(int fd)
>> void socket_set_nonblock(int fd)
>> {
>> int f;
>> + if(fd < 0)
>> + return;
>> f = fcntl(fd, F_GETFL);
>> fcntl(fd, F_SETFL, f | O_NONBLOCK);
>> }
>> diff --git a/util/qemu-sockets.c b/util/qemu-sockets.c
>> index 3f12296..58e4bcd 100644
>> --- a/util/qemu-sockets.c
>> +++ b/util/qemu-sockets.c
>> @@ -485,7 +485,7 @@ err:
>> }
>>
>> /* compatibility wrapper */
>> -static InetSocketAddress *inet_parse(const char *str, Error **errp)
>> +InetSocketAddress *inet_parse(const char *str, Error **errp)
>> {
>> InetSocketAddress *addr;
>> const char *optstr, *h;
>>
> Paolo
>
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Qemu-devel] [RFC PATCH RDMA support v3: 09/10] Move RAMBlock to cpu-common.h
2013-03-11 14:07 ` [Qemu-devel] [RFC PATCH RDMA support v3: 09/10] Move RAMBlock to cpu-common.h Paolo Bonzini
@ 2013-03-11 16:34 ` Michael R. Hines
0 siblings, 0 replies; 22+ messages in thread
From: Michael R. Hines @ 2013-03-11 16:34 UTC (permalink / raw)
To: Paolo Bonzini
Cc: aliguori, mst, michael.r.hines.mrhines, qemu-devel, owasserm,
abali, mrhines, gokul
Understood......will take care of both of these.
On 03/11/2013 10:07 AM, Paolo Bonzini wrote:
> Only used in qemu_rdma_init_ram_blocks. Can you add instead an API
> like qemu_ram_foreach_block( void (*fn)(void *host_addr, ram_addr_t
> offset, ram_addr_t length, void *opaque), void *opaque) ? BTW, please
> avoid arbitrary limits like RDMA_MAX_RAM_BLOCKS. Can you use a list
> (see qemu-queue.h; it is the same as the BSD queue.h) instead of
> current_index? Paolo
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Qemu-devel] [RFC PATCH RDMA support v3: 03/10] documentation of RDMA protocol in docs/rdma.txt
2013-03-11 16:24 ` Michael R. Hines
@ 2013-03-11 17:05 ` Michael S. Tsirkin
2013-03-11 17:17 ` Michael R. Hines
0 siblings, 1 reply; 22+ messages in thread
From: Michael S. Tsirkin @ 2013-03-11 17:05 UTC (permalink / raw)
To: Michael R. Hines
Cc: aliguori, michael.r.hines.mrhines, qemu-devel, owasserm, abali,
mrhines, gokul, pbonzini
On Mon, Mar 11, 2013 at 12:24:53PM -0400, Michael R. Hines wrote:
> Excellent questions: answers inline.........
>
> On 03/11/2013 07:51 AM, Michael S. Tsirkin wrote:
> >+RDMA-based live migration protocol
> >+==================================
> >+
> >+We use two kinds of RDMA messages:
> >+
> >+1. RDMA WRITES (to the receiver)
> >+2. RDMA SEND (for non-live state, like devices and CPU)
> >Something's missing here.
> >Don't you need to know remote addresses before doing RDMA writes?
>
> Yes, It looks like I need to do some more "teaching" about infiniband / RDMA
> inside the documentation.
>
> I was trying not to make it too long, but it seems I over-estimated
> the ubiquity of RDMA and I'll have to include some background information
> about the programming model and memory model used by RDMA.
Well that's exactly the question. As far as I remember the
RDMA memory model, you need to know a key and address to
execute RDMA writes. Remote memory also needs to be locked,
so you need some mechanism to lock chunks of memory,
do RDMA write and unlock when done.
> >>+
> >>+First, migration-rdma.c does the initial connection establishment
> >>+using the URI 'rdma:host:port' on the QMP command line.
> >>+
> >>+Second, the normal live migration process kicks in for 'pc.ram'.
> >>+
> >>+During iterative phase of the migration, only RDMA WRITE messages
> >>+are used. Messages are grouped into "chunks" which get pinned by
> >>+the hardware in 64-page increments. Each chunk is acknowledged in
> >>+the Queue Pairs completion queue (not the individual pages).
> >>+
> >>+During iteration of RAM, there are no messages sent, just RDMA writes.
> >>+During the last iteration, once the devices and CPU is ready to be
> >>+sent, we begin to use the RDMA SEND messages.
> >It's unclear whether you are switching modes here, if yes
> >assuming CPU/device state is only sent during
> >the last iteration would break post-migration so
> >is probably not a good choice for a protocol.
>
> I made a bad choice of words ...... I'll correct the documentation.
>
> >
> >>+Due to the asynchronous nature of RDMA, the receiver of the migration
> >>+must post Receive work requests in the queue *before* a SEND work request
> >>+can be posted.
> >>+
> >>+To achieve this, both sides perform an initial 'barrier' synchronization.
> >>+Before the barrier, we already know that both sides have a receive work
> >>+request posted,
> >How?
>
> While I was coding last night, I was able to eliminate this barrier.
>
> >>and then both sides exchange and block on the completion
> >>+queue waiting for each other to know the other peer is alive and ready
> >>+to send the rest of the live migration state (qemu_send/recv_barrier()).
> >How much?
>
> The remaining migration state is typically < 100K (usually more like 17-32K)
>
> Most of this gets sent during qemu_savevm_state_complete() during
> the last iteration.
>
> >>+At this point, the use of QEMUFile between both sides for communication
> >>+proceeds as normal.
> >>+The difference between TCP and SEND comes in migration-rdma.c: Since
> >>+we cannot simply dump the bytes into a socket, instead a SEND message
> >>+must be preceeded by one side instructing the other side *exactly* how
> >>+many bytes the SEND message will contain.
> >instructing how? Presumably you use some protocol for this?
>
> Yes, I'll be more verbose. Sorry about that =)
>
> (Basically, the length of the SEND is stored inside the SEND message itself.
>
> >>+Each time a SEND is received, the receiver buffers the message and
> >>+divies out the bytes from the SEND to the qemu_loadvm_state() function
> >>+until all the bytes from the buffered SEND message have been exhausted.
> >>+
> >>+Before the SEND is exhausted, the receiver sends an 'ack' SEND back
> >>+to the sender to let the savevm_state_* functions know that they
> >>+can resume and start generating more SEND messages.
> >The above two paragraphs seem very opaque to me.
> >what's an 'ack' SEND, how do you know whether SEND
> >is exhausted?
>
> More verbosity needed here too =). Exhaustion is detected because
> the SEND bytes are copied into a buffer and then whenever
> QEMUFile functions request more bytes from the buffer, we check
> how many bytes are available from the last SEND message (which
> was copied locally) to be handed back to QEMUFile functions.
>
> If there are no bytes left in the buffer, we block and wait for
> another SEND message.
You need some way to make sure there's a buffer available
for that SEND message though.
> >>+This ping-pong of SEND messages
> >BTW, if by ping-pong you mean something like this:
> > source "I have X bytes"
> > destination "ok send me X bytes"
> > source sends X bytes
> >then you could put the address in the destination response and
> >use RDMA for sending X bytes.
> >It's up to you but it might simplify the protocol as
> >the only thing you send would be buffer management messages.
> No, you can't do that because RDMA writes do not produce
> completion queue (CQ) notifications on the receiver side.
>
> Thus, there's no way for the receiver to know data was received.
>
> You still need regular SEND message to handle it.
>
> >>happens until the live migration completes.
> >Any way to tear down the connection in case of errors?
>
> Yes, I'll add all these questions to the update documentation ASAP.
>
>
> >>+
> >>+USAGE
> >>+===============================
> >>+
> >>+Compiling:
> >>+
> >>+$ ./configure --enable-rdma --target-list=x86_64-softmmu
> >>+
> >>+$ make
> >>+
> >>+Command-line on the Source machine AND Destination:
> >>+
> >>+$ virsh qemu-monitor-command --hmp --cmd "migrate_set_speed 40g" # or whatever is the MAX of your RDMA device
> >>+
> >>+Finally, perform the actual migration:
> >>+
> >>+$ virsh migrate domain rdma:xx.xx.xx.xx:port
> >>+
> >>+PERFORMANCE
> >>+===================
> >>+
> >>+Using a 40gbps infinband link performing a worst-case stress test:
> >>+
> >>+RDMA Throughput With $ stress --vm-bytes 1024M --vm 1 --vm-keep
> >>+Approximately 30 gpbs (little better than the paper)
> >>+1. Average worst-case throughput
> >>+TCP Throughput With $ stress --vm-bytes 1024M --vm 1 --vm-keep
> >>+2. Approximately 8 gpbs (using IPOIB IP over Infiniband)
> >>+
> >>+Average downtime (stop time) ranges between 28 and 33 milliseconds.
> >>+
> >>+An *exhaustive* paper (2010) shows additional performance details
> >>+linked on the QEMU wiki:
> >>+
> >>+http://wiki.qemu.org/Features/RDMALiveMigration
> >>--
> >>1.7.10.4
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Qemu-devel] [RFC PATCH RDMA support v3: 03/10] documentation of RDMA protocol in docs/rdma.txt
2013-03-11 17:05 ` Michael S. Tsirkin
@ 2013-03-11 17:17 ` Michael R. Hines
2013-03-11 17:19 ` Michael S. Tsirkin
0 siblings, 1 reply; 22+ messages in thread
From: Michael R. Hines @ 2013-03-11 17:17 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: aliguori, michael.r.hines.mrhines, qemu-devel, owasserm, abali,
mrhines, gokul, pbonzini
On 03/11/2013 01:05 PM, Michael S. Tsirkin wrote:
> Well that's exactly the question. As far as I remember the RDMA memory
> model, you need to know a key and address to execute RDMA writes.
> Remote memory also needs to be locked, so you need some mechanism to
> lock chunks of memory, do RDMA write and unlock when done.
Yes, memory is registered before the write occurs - that's is all taken
care of in rdma.c (patch #04/10)
Same answer for the SEND messages: memory for each send must be
registered before you send.
The pinning (mlock()) is already handled by libibverbs (by calling the
function ibv_reg_mr()) - standard infiniband protocol initialization.
- Michael
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Qemu-devel] [RFC PATCH RDMA support v3: 03/10] documentation of RDMA protocol in docs/rdma.txt
2013-03-11 17:17 ` Michael R. Hines
@ 2013-03-11 17:19 ` Michael S. Tsirkin
2013-03-11 17:35 ` Michael R. Hines
0 siblings, 1 reply; 22+ messages in thread
From: Michael S. Tsirkin @ 2013-03-11 17:19 UTC (permalink / raw)
To: Michael R. Hines
Cc: aliguori, michael.r.hines.mrhines, qemu-devel, owasserm, abali,
mrhines, gokul, pbonzini
On Mon, Mar 11, 2013 at 01:17:19PM -0400, Michael R. Hines wrote:
> On 03/11/2013 01:05 PM, Michael S. Tsirkin wrote:
> >Well that's exactly the question. As far as I remember the RDMA
> >memory model, you need to know a key and address to execute RDMA
> >writes. Remote memory also needs to be locked, so you need some
> >mechanism to lock chunks of memory, do RDMA write and unlock when
> >done.
>
> Yes, memory is registered before the write occurs - that's is all
> taken care of in rdma.c (patch #04/10)
>
> Same answer for the SEND messages: memory for each send must be
> registered before you send.
>
> The pinning (mlock()) is already handled by libibverbs (by calling
> the function ibv_reg_mr()) - standard infiniband protocol
> initialization.
>
> - Michael
Yes but the document should describe when is the destination
memory registered and how are keys/addresses passed back to
source.
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Qemu-devel] [RFC PATCH RDMA support v3: 03/10] documentation of RDMA protocol in docs/rdma.txt
2013-03-11 17:19 ` Michael S. Tsirkin
@ 2013-03-11 17:35 ` Michael R. Hines
0 siblings, 0 replies; 22+ messages in thread
From: Michael R. Hines @ 2013-03-11 17:35 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: aliguori, michael.r.hines.mrhines, qemu-devel, owasserm, abali,
mrhines, gokul, pbonzini
Acknowledged.
On 03/11/2013 01:19 PM, Michael S. Tsirkin wrote:
> Yes but the document should describe when is the destination memory
> registered and how are keys/addresses passed back to source.
^ permalink raw reply [flat|nested] 22+ messages in thread
* Re: [Qemu-devel] [RFC PATCH RDMA support v3: 05/10] RDMA connection establishment (migration-rdma.c).
2013-03-11 13:41 ` [Qemu-devel] [RFC PATCH RDMA support v3: 05/10] RDMA connection establishment (migration-rdma.c) Paolo Bonzini
2013-03-11 16:28 ` Michael R. Hines
@ 2013-03-11 20:20 ` Michael R. Hines
1 sibling, 0 replies; 22+ messages in thread
From: Michael R. Hines @ 2013-03-11 20:20 UTC (permalink / raw)
To: Paolo Bonzini
Cc: aliguori, mst, michael.r.hines.mrhines, qemu-devel, owasserm,
abali, mrhines, gokul
Paolo, I just pulled your changes from master.
Will get started on merging with all the new comments....
- Michael
On 03/11/2013 09:41 AM, Paolo Bonzini wrote:
> Il 11/03/2013 05:33, Michael.R.Hines.mrhines@linux.vnet.ibm.com ha scritto:
>> From: "Michael R. Hines" <mrhines@us.ibm.com>
>>
>> Use 'migrate rdma:host:port' to begin the migration.
>>
>> The TCP control channel has finally been eliminated
>> when RDMA is used. Documentation of the use of SEND message
>> for transferring device state is covered in docs/rdma.txt
>>
>> Signed-off-by: Michael R. Hines <mrhines@us.ibm.com>
>> ---
>> migration-rdma.c | 198 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
>> 1 file changed, 198 insertions(+)
>> create mode 100644 migration-rdma.c
>>
>> diff --git a/migration-rdma.c b/migration-rdma.c
>> new file mode 100644
>> index 0000000..822b17a
>> --- /dev/null
>> +++ b/migration-rdma.c
>> @@ -0,0 +1,198 @@
>> +/*
>> + * Copyright (C) 2013 Michael R. Hines <mrhines@us.ibm.com>
>> + * Copyright (C) 2013 Jiuxing Liu <jl@us.ibm.com>
>> + *
>> + * This program is free software; you can redistribute it and/or modify
>> + * it under the terms of the GNU General Public License as published by
>> + * the Free Software Foundation; under version 2 of the License.
>> + *
>> + * This program is distributed in the hope that it will be useful,
>> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
>> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
>> + * GNU General Public License for more details.
>> + *
>> + * You should have received a copy of the GNU General Public License
>> + * along with this program; if not, see <http://www.gnu.org/licenses/>.
>> + */
>> +#include "migration/rdma.h"
>> +#include "qemu-common.h"
>> +#include "migration/migration.h"
>> +#include "migration/qemu-file.h"
>> +#include <stdio.h>
>> +#include <sys/types.h>
>> +#include <sys/socket.h>
>> +#include <netdb.h>
>> +#include <arpa/inet.h>
>> +#include <string.h>
>> +
>> +//#define DEBUG_MIGRATION_RDMA
>> +
>> +#ifdef DEBUG_MIGRATION_RDMA
>> +#define DPRINTF(fmt, ...) \
>> + do { printf("migration-rdma: " fmt, ## __VA_ARGS__); } while (0)
>> +#else
>> +#define DPRINTF(fmt, ...) \
>> + do { } while (0)
>> +#endif
>> +
>> +static int rdma_accept_incoming_migration(RDMAData *rdma, Error **errp)
>> +{
>> + int ret;
>> +
>> + ret = qemu_rdma_migrate_listen(rdma, rdma->host, rdma->port);
>> + if (ret) {
>> + qemu_rdma_print("rdma migration: error listening!");
>> + goto err_rdma_server_wait;
>> + }
>> +
>> + ret = qemu_rdma_alloc_qp(&rdma->rdma_ctx);
>> + if (ret) {
>> + qemu_rdma_print("rdma migration: error allocating qp!");
>> + goto err_rdma_server_wait;
>> + }
>> +
>> + ret = qemu_rdma_migrate_accept(&rdma->rdma_ctx, NULL, NULL, NULL, 0);
>> + if (ret) {
>> + qemu_rdma_print("rdma migration: error accepting connection!");
>> + goto err_rdma_server_wait;
>> + }
>> +
>> + ret = qemu_rdma_post_recv_qemu_file(rdma);
>> + if (ret) {
>> + qemu_rdma_print("rdma migration: error posting second qemu file recv!");
>> + goto err_rdma_server_wait;
>> + }
>> +
>> + ret = qemu_rdma_post_send_remote_info(rdma);
>> + if (ret) {
>> + qemu_rdma_print("rdma migration: error sending remote info!");
>> + goto err_rdma_server_wait;
>> + }
>> +
>> + ret = qemu_rdma_wait_for_wrid(rdma, RDMA_WRID_SEND_REMOTE_INFO);
>> + if (ret < 0) {
>> + qemu_rdma_print("rdma migration: polling remote info error!");
>> + goto err_rdma_server_wait;
>> + }
>> +
>> + rdma->total_bytes = 0;
>> + rdma->enabled = 1;
>> + qemu_rdma_dump_gid("server_connect", rdma->rdma_ctx.cm_id);
>> + return 0;
>> +
>> +err_rdma_server_wait:
>> + qemu_rdma_cleanup(rdma);
>> + return -1;
>> +
>> +}
>> +
>> +int rdma_start_incoming_migration(const char * host_port, Error **errp)
>> +{
>> + QEMUFile *f;
>> + int ret;
>> + RDMAData rdma;
>> +
>> + if ((ret = qemu_rdma_data_init(&rdma, host_port, errp)) < 0)
>> + return ret;
>> +
>> + ret = qemu_rdma_server_init(&rdma, NULL);
>> +
>> + DPRINTF("Starting RDMA-based incoming migration\n");
>> +
>> + if (!ret) {
>> + DPRINTF("qemu_rdma_server_init success\n");
>> + ret = qemu_rdma_server_prepare(&rdma, NULL);
>> +
>> + if (!ret) {
>> + DPRINTF("qemu_rdma_server_prepare success\n");
>> +
>> + ret = rdma_accept_incoming_migration(&rdma, NULL);
>> + if(!ret)
>> + DPRINTF("qemu_rdma_accept_incoming_migration success\n");
>> + f = qemu_fopen_rdma(&rdma);
>> + if (f == NULL) {
>> + fprintf(stderr, "could not qemu_fopen RDMA\n");
>> + ret = -EIO;
>> + }
>> +
>> + process_incoming_migration(f);
>> + }
>> + }
>> +
>> + return ret;
>> +}
>> +
>> +/*
>> + * Not sure what this is for yet...
>> + */
>> +static int qemu_rdma_errno(MigrationState *s)
>> +{
>> + return 0;
>> +}
>> +
>> +/*
>> + * SEND messages for device state only.
>> + * pc.ram is handled elsewhere...
>> + */
>> +static int qemu_rdma_send(MigrationState *s, const void * buf, size_t size)
>> +{
>> + size_t len, remaining = size;
>> + uint8_t * data = (void *) buf;
>> +
>> + if (qemu_rdma_write_flush(&s->rdma) < 0) {
>> + qemu_file_set_error(s->file, -EIO);
>> + return -EIO;
>> + }
>> +
>> + while(remaining) {
>> + len = MIN(remaining, RDMA_SEND_INCREMENT);
>> + remaining -= len;
>> + DPRINTF("iter Sending %" PRId64 "-byte SEND\n", len);
>> +
>> + if(qemu_rdma_exchange_send(&s->rdma, data, len) < 0)
>> + return -EINVAL;
>> +
>> + data += len;
>> + }
>> +
>> + return size;
>> +}
>> +
>> +static int qemu_rdma_close(MigrationState *s)
>> +{
>> + DPRINTF("qemu_rdma_close\n");
>> + qemu_rdma_cleanup(&s->rdma);
>> + return 0;
>> +}
>> +
>> +void rdma_start_outgoing_migration(MigrationState *s, const char *host_port, Error **errp)
>> +{
>> + int ret;
>> +
>> +#ifndef CONFIG_RDMA
>> + error_set(errp, QERR_FEATURE_DISABLED, "rdma migration");
>> + return;
>> +#endif
>> +
>> + if (qemu_rdma_data_init(&s->rdma, host_port, errp) < 0)
>> + return;
>> +
>> + ret = qemu_rdma_client_init(&s->rdma, NULL);
>> + if(!ret) {
>> + DPRINTF("qemu_rdma_client_init success\n");
>> + ret = qemu_rdma_client_connect(&s->rdma, NULL);
>> +
>> + if(!ret) {
>> + s->get_error = qemu_rdma_errno;
>> + s->write = qemu_rdma_send;
>> + s->close = qemu_rdma_close;
>> + s->fd = -2;
>> + DPRINTF("qemu_rdma_client_connect success\n");
>> + migrate_fd_connect(s);
>> + return;
>> + }
>> + }
>> +
>> + s->fd = -1;
>> + migrate_fd_error(s);
>> +}
>>
> Note that as soon as the pending migration patches hit upstream, almost
> all of this code will move to QEMUFileRDMA (in particular qemu_rdma_send
> and qemu_rdma_close). It should become simpler.
>
> Paolo
>
^ permalink raw reply [flat|nested] 22+ messages in thread
end of thread, other threads:[~2013-03-11 20:20 UTC | newest]
Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <1362976414-21396-1-git-send-email-mrhines@us.ibm.com>
[not found] ` <1362976414-21396-4-git-send-email-mrhines@us.ibm.com>
2013-03-11 11:51 ` [Qemu-devel] [RFC PATCH RDMA support v3: 03/10] documentation of RDMA protocol in docs/rdma.txt Michael S. Tsirkin
2013-03-11 16:24 ` Michael R. Hines
2013-03-11 17:05 ` Michael S. Tsirkin
2013-03-11 17:17 ` Michael R. Hines
2013-03-11 17:19 ` Michael S. Tsirkin
2013-03-11 17:35 ` Michael R. Hines
[not found] ` <1362976414-21396-3-git-send-email-mrhines@us.ibm.com>
2013-03-11 13:35 ` [Qemu-devel] [RFC PATCH RDMA support v3: 02/10] Link in new migration-rdma.c and rmda.c files Paolo Bonzini
2013-03-11 16:25 ` Michael R. Hines
[not found] ` <1362976414-21396-9-git-send-email-mrhines@us.ibm.com>
2013-03-11 13:40 ` [Qemu-devel] [RFC PATCH RDMA support v3: 08/10] Introduce QEMUFileRDMA Paolo Bonzini
2013-03-11 16:26 ` Michael R. Hines
2013-03-11 16:26 ` Michael R. Hines
[not found] ` <1362976414-21396-6-git-send-email-mrhines@us.ibm.com>
2013-03-11 13:41 ` [Qemu-devel] [RFC PATCH RDMA support v3: 05/10] RDMA connection establishment (migration-rdma.c) Paolo Bonzini
2013-03-11 16:28 ` Michael R. Hines
2013-03-11 20:20 ` Michael R. Hines
[not found] ` <1362976414-21396-7-git-send-email-mrhines@us.ibm.com>
2013-03-11 13:49 ` [Qemu-devel] [RFC PATCH RDMA support v3: 06/10] Introduce 'max_iterations' and Call out to migration-rdma.c when requested Paolo Bonzini
2013-03-11 16:30 ` Michael R. Hines
[not found] ` <1362976414-21396-8-git-send-email-mrhines@us.ibm.com>
2013-03-11 13:59 ` [Qemu-devel] [RFC PATCH RDMA support v3: 07/10] Send the actual pages over RDMA Paolo Bonzini
2013-03-11 16:31 ` Michael R. Hines
[not found] ` <1362976414-21396-11-git-send-email-mrhines@us.ibm.com>
2013-03-11 14:00 ` [Qemu-devel] [RFC PATCH RDMA support v3: 10/10] Parse RDMA host/port out of the QMP string Paolo Bonzini
2013-03-11 16:32 ` Michael R. Hines
[not found] ` <1362976414-21396-10-git-send-email-mrhines@us.ibm.com>
2013-03-11 14:07 ` [Qemu-devel] [RFC PATCH RDMA support v3: 09/10] Move RAMBlock to cpu-common.h Paolo Bonzini
2013-03-11 16:34 ` Michael R. Hines
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).