From: "Michael S. Tsirkin" <mst@redhat.com>
To: "Michael R. Hines" <mrhines@linux.vnet.ibm.com>
Cc: aliguori@us.ibm.com, michael.r.hines.mrhines@linux.vnet.ibm.com,
qemu-devel@nongnu.org, owasserm@redhat.com, abali@us.ibm.com,
mrhines@us.ibm.com, gokul@us.ibm.com, pbonzini@redhat.com
Subject: Re: [Qemu-devel] [RFC PATCH RDMA support v3: 03/10] documentation of RDMA protocol in docs/rdma.txt
Date: Mon, 11 Mar 2013 19:05:15 +0200 [thread overview]
Message-ID: <20130311170515.GB28930@redhat.com> (raw)
In-Reply-To: <513E0555.1060205@linux.vnet.ibm.com>
On Mon, Mar 11, 2013 at 12:24:53PM -0400, Michael R. Hines wrote:
> Excellent questions: answers inline.........
>
> On 03/11/2013 07:51 AM, Michael S. Tsirkin wrote:
> >+RDMA-based live migration protocol
> >+==================================
> >+
> >+We use two kinds of RDMA messages:
> >+
> >+1. RDMA WRITES (to the receiver)
> >+2. RDMA SEND (for non-live state, like devices and CPU)
> >Something's missing here.
> >Don't you need to know remote addresses before doing RDMA writes?
>
> Yes, It looks like I need to do some more "teaching" about infiniband / RDMA
> inside the documentation.
>
> I was trying not to make it too long, but it seems I over-estimated
> the ubiquity of RDMA and I'll have to include some background information
> about the programming model and memory model used by RDMA.
Well that's exactly the question. As far as I remember the
RDMA memory model, you need to know a key and address to
execute RDMA writes. Remote memory also needs to be locked,
so you need some mechanism to lock chunks of memory,
do RDMA write and unlock when done.
> >>+
> >>+First, migration-rdma.c does the initial connection establishment
> >>+using the URI 'rdma:host:port' on the QMP command line.
> >>+
> >>+Second, the normal live migration process kicks in for 'pc.ram'.
> >>+
> >>+During iterative phase of the migration, only RDMA WRITE messages
> >>+are used. Messages are grouped into "chunks" which get pinned by
> >>+the hardware in 64-page increments. Each chunk is acknowledged in
> >>+the Queue Pairs completion queue (not the individual pages).
> >>+
> >>+During iteration of RAM, there are no messages sent, just RDMA writes.
> >>+During the last iteration, once the devices and CPU is ready to be
> >>+sent, we begin to use the RDMA SEND messages.
> >It's unclear whether you are switching modes here, if yes
> >assuming CPU/device state is only sent during
> >the last iteration would break post-migration so
> >is probably not a good choice for a protocol.
>
> I made a bad choice of words ...... I'll correct the documentation.
>
> >
> >>+Due to the asynchronous nature of RDMA, the receiver of the migration
> >>+must post Receive work requests in the queue *before* a SEND work request
> >>+can be posted.
> >>+
> >>+To achieve this, both sides perform an initial 'barrier' synchronization.
> >>+Before the barrier, we already know that both sides have a receive work
> >>+request posted,
> >How?
>
> While I was coding last night, I was able to eliminate this barrier.
>
> >>and then both sides exchange and block on the completion
> >>+queue waiting for each other to know the other peer is alive and ready
> >>+to send the rest of the live migration state (qemu_send/recv_barrier()).
> >How much?
>
> The remaining migration state is typically < 100K (usually more like 17-32K)
>
> Most of this gets sent during qemu_savevm_state_complete() during
> the last iteration.
>
> >>+At this point, the use of QEMUFile between both sides for communication
> >>+proceeds as normal.
> >>+The difference between TCP and SEND comes in migration-rdma.c: Since
> >>+we cannot simply dump the bytes into a socket, instead a SEND message
> >>+must be preceeded by one side instructing the other side *exactly* how
> >>+many bytes the SEND message will contain.
> >instructing how? Presumably you use some protocol for this?
>
> Yes, I'll be more verbose. Sorry about that =)
>
> (Basically, the length of the SEND is stored inside the SEND message itself.
>
> >>+Each time a SEND is received, the receiver buffers the message and
> >>+divies out the bytes from the SEND to the qemu_loadvm_state() function
> >>+until all the bytes from the buffered SEND message have been exhausted.
> >>+
> >>+Before the SEND is exhausted, the receiver sends an 'ack' SEND back
> >>+to the sender to let the savevm_state_* functions know that they
> >>+can resume and start generating more SEND messages.
> >The above two paragraphs seem very opaque to me.
> >what's an 'ack' SEND, how do you know whether SEND
> >is exhausted?
>
> More verbosity needed here too =). Exhaustion is detected because
> the SEND bytes are copied into a buffer and then whenever
> QEMUFile functions request more bytes from the buffer, we check
> how many bytes are available from the last SEND message (which
> was copied locally) to be handed back to QEMUFile functions.
>
> If there are no bytes left in the buffer, we block and wait for
> another SEND message.
You need some way to make sure there's a buffer available
for that SEND message though.
> >>+This ping-pong of SEND messages
> >BTW, if by ping-pong you mean something like this:
> > source "I have X bytes"
> > destination "ok send me X bytes"
> > source sends X bytes
> >then you could put the address in the destination response and
> >use RDMA for sending X bytes.
> >It's up to you but it might simplify the protocol as
> >the only thing you send would be buffer management messages.
> No, you can't do that because RDMA writes do not produce
> completion queue (CQ) notifications on the receiver side.
>
> Thus, there's no way for the receiver to know data was received.
>
> You still need regular SEND message to handle it.
>
> >>happens until the live migration completes.
> >Any way to tear down the connection in case of errors?
>
> Yes, I'll add all these questions to the update documentation ASAP.
>
>
> >>+
> >>+USAGE
> >>+===============================
> >>+
> >>+Compiling:
> >>+
> >>+$ ./configure --enable-rdma --target-list=x86_64-softmmu
> >>+
> >>+$ make
> >>+
> >>+Command-line on the Source machine AND Destination:
> >>+
> >>+$ virsh qemu-monitor-command --hmp --cmd "migrate_set_speed 40g" # or whatever is the MAX of your RDMA device
> >>+
> >>+Finally, perform the actual migration:
> >>+
> >>+$ virsh migrate domain rdma:xx.xx.xx.xx:port
> >>+
> >>+PERFORMANCE
> >>+===================
> >>+
> >>+Using a 40gbps infinband link performing a worst-case stress test:
> >>+
> >>+RDMA Throughput With $ stress --vm-bytes 1024M --vm 1 --vm-keep
> >>+Approximately 30 gpbs (little better than the paper)
> >>+1. Average worst-case throughput
> >>+TCP Throughput With $ stress --vm-bytes 1024M --vm 1 --vm-keep
> >>+2. Approximately 8 gpbs (using IPOIB IP over Infiniband)
> >>+
> >>+Average downtime (stop time) ranges between 28 and 33 milliseconds.
> >>+
> >>+An *exhaustive* paper (2010) shows additional performance details
> >>+linked on the QEMU wiki:
> >>+
> >>+http://wiki.qemu.org/Features/RDMALiveMigration
> >>--
> >>1.7.10.4
next prev parent reply other threads:[~2013-03-11 17:21 UTC|newest]
Thread overview: 22+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <1362976414-21396-1-git-send-email-mrhines@us.ibm.com>
[not found] ` <1362976414-21396-4-git-send-email-mrhines@us.ibm.com>
2013-03-11 11:51 ` [Qemu-devel] [RFC PATCH RDMA support v3: 03/10] documentation of RDMA protocol in docs/rdma.txt Michael S. Tsirkin
2013-03-11 16:24 ` Michael R. Hines
2013-03-11 17:05 ` Michael S. Tsirkin [this message]
2013-03-11 17:17 ` Michael R. Hines
2013-03-11 17:19 ` Michael S. Tsirkin
2013-03-11 17:35 ` Michael R. Hines
[not found] ` <1362976414-21396-3-git-send-email-mrhines@us.ibm.com>
2013-03-11 13:35 ` [Qemu-devel] [RFC PATCH RDMA support v3: 02/10] Link in new migration-rdma.c and rmda.c files Paolo Bonzini
2013-03-11 16:25 ` Michael R. Hines
[not found] ` <1362976414-21396-9-git-send-email-mrhines@us.ibm.com>
2013-03-11 13:40 ` [Qemu-devel] [RFC PATCH RDMA support v3: 08/10] Introduce QEMUFileRDMA Paolo Bonzini
2013-03-11 16:26 ` Michael R. Hines
2013-03-11 16:26 ` Michael R. Hines
[not found] ` <1362976414-21396-6-git-send-email-mrhines@us.ibm.com>
2013-03-11 13:41 ` [Qemu-devel] [RFC PATCH RDMA support v3: 05/10] RDMA connection establishment (migration-rdma.c) Paolo Bonzini
2013-03-11 16:28 ` Michael R. Hines
2013-03-11 20:20 ` Michael R. Hines
[not found] ` <1362976414-21396-7-git-send-email-mrhines@us.ibm.com>
2013-03-11 13:49 ` [Qemu-devel] [RFC PATCH RDMA support v3: 06/10] Introduce 'max_iterations' and Call out to migration-rdma.c when requested Paolo Bonzini
2013-03-11 16:30 ` Michael R. Hines
[not found] ` <1362976414-21396-8-git-send-email-mrhines@us.ibm.com>
2013-03-11 13:59 ` [Qemu-devel] [RFC PATCH RDMA support v3: 07/10] Send the actual pages over RDMA Paolo Bonzini
2013-03-11 16:31 ` Michael R. Hines
[not found] ` <1362976414-21396-11-git-send-email-mrhines@us.ibm.com>
2013-03-11 14:00 ` [Qemu-devel] [RFC PATCH RDMA support v3: 10/10] Parse RDMA host/port out of the QMP string Paolo Bonzini
2013-03-11 16:32 ` Michael R. Hines
[not found] ` <1362976414-21396-10-git-send-email-mrhines@us.ibm.com>
2013-03-11 14:07 ` [Qemu-devel] [RFC PATCH RDMA support v3: 09/10] Move RAMBlock to cpu-common.h Paolo Bonzini
2013-03-11 16:34 ` Michael R. Hines
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20130311170515.GB28930@redhat.com \
--to=mst@redhat.com \
--cc=abali@us.ibm.com \
--cc=aliguori@us.ibm.com \
--cc=gokul@us.ibm.com \
--cc=michael.r.hines.mrhines@linux.vnet.ibm.com \
--cc=mrhines@linux.vnet.ibm.com \
--cc=mrhines@us.ibm.com \
--cc=owasserm@redhat.com \
--cc=pbonzini@redhat.com \
--cc=qemu-devel@nongnu.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).