From: mrhines@linux.vnet.ibm.com
To: qemu-devel@nongnu.org
Cc: aliguori@us.ibm.com, mst@redhat.com, owasserm@redhat.com,
abali@us.ibm.com, mrhines@us.ibm.com, gokul@us.ibm.com,
pbonzini@redhat.com
Subject: [Qemu-devel] [RFC PATCH RDMA support v4: 03/10] more verbose documentation of the RDMA transport
Date: Sun, 17 Mar 2013 23:18:56 -0400 [thread overview]
Message-ID: <1363576743-6146-4-git-send-email-mrhines@linux.vnet.ibm.com> (raw)
In-Reply-To: <1363576743-6146-1-git-send-email-mrhines@linux.vnet.ibm.com>
From: "Michael R. Hines" <mrhines@us.ibm.com>
This tries to cover all the questions I got the last time.
Please do tell me what is not clear, and I'll revise again.
Signed-off-by: Michael R. Hines <mrhines@us.ibm.com>
---
docs/rdma.txt | 208 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 208 insertions(+)
create mode 100644 docs/rdma.txt
diff --git a/docs/rdma.txt b/docs/rdma.txt
new file mode 100644
index 0000000..2a48ab0
--- /dev/null
+++ b/docs/rdma.txt
@@ -0,0 +1,208 @@
+Changes since v3:
+
+- Compile-tested with and without --enable-rdma is working.
+- Updated docs/rdma.txt (included below)
+- Merged with latest pull queue from Paolo
+- Implemented qemu_ram_foreach_block()
+
+mrhines@mrhinesdev:~/qemu$ git diff --stat master
+Makefile.objs | 1 +
+arch_init.c | 28 +-
+configure | 25 ++
+docs/rdma.txt | 190 +++++++++++
+exec.c | 21 ++
+include/exec/cpu-common.h | 6 +
+include/migration/migration.h | 3 +
+include/migration/qemu-file.h | 10 +
+include/migration/rdma.h | 269 ++++++++++++++++
+include/qemu/sockets.h | 1 +
+migration-rdma.c | 205 ++++++++++++
+migration.c | 19 +-
+rdma.c | 1511 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+savevm.c | 172 +++++++++-
+util/qemu-sockets.c | 2 +-
+15 files changed, 2445 insertions(+), 18 deletions(-)
+
+QEMUFileRDMA:
+==================================
+
+QEMUFileRDMA introduces a couple of new functions:
+
+1. qemu_rdma_get_buffer() (QEMUFileOps rdma_read_ops)
+2. qemu_rdma_put_buffer() (QEMUFileOps rdma_write_ops)
+
+These two functions provide an RDMA transport
+(not a protocol) without changing the upper-level
+users of QEMUFile that depend on a bytstream abstraction.
+
+In order to provide the same bytestream interface
+for RDMA, we use SEND messages instead of sockets.
+The operations themselves and the protocol built on
+top of QEMUFile used throughout the migration
+process do not change whatsoever.
+
+An infiniband SEND message is the standard ibverbs
+message used by applications of infiniband hardware.
+The only difference between a SEND message and an RDMA
+message is that SEND message cause completion notifications
+to be posted to the completion queue (CQ) on the
+infiniband receiver side, whereas RDMA messages (used
+for pc.ram) do not (to behave like an actual DMA).
+
+Messages in infiniband require two things:
+
+1. registration of the memory that will be transmitted
+2. (SEND only) work requests to be posted on both
+ sides of the network before the actual transmission
+ can occur.
+
+RDMA messages much easier to deal with. Once the memory
+on the receiver side is registed and pinned, we're
+basically done. All that is required is for the sender
+side to start dumping bytes onto the link.
+
+SEND messages require more coordination because the
+receiver must have reserved space (using a receive
+work request) on the receive queue (RQ) before QEMUFileRDMA
+can start using them to carry all the bytes as
+a transport for migration of device state.
+
+After the initial connection setup (migration-rdma.c),
+this coordination starts by having both sides post
+a single work request to the RQ before any users
+of QEMUFile are activated.
+
+Once an initial receive work request is posted,
+we have a put_buffer()/get_buffer() implementation
+that looks like this:
+
+Logically:
+
+qemu_rdma_get_buffer():
+
+1. A user on top of QEMUFile calls ops->get_buffer(),
+ which calls us.
+2. We transmit an empty SEND to let the sender know that
+ we are *ready* to receive some bytes from QEMUFileRDMA.
+ These bytes will come in the form of a another SEND.
+3. Before attempting to receive that SEND, we post another
+ RQ work request to replace the one we just used up.
+4. Block on a CQ event channel and wait for the SEND
+ to arrive.
+5. When the send arrives, librdmacm will unblock us
+ and we can consume the bytes (described later).
+
+qemu_rdma_put_buffer():
+
+1. A user on top of QEMUFile calls ops->put_buffer(),
+ which calls us.
+2. Block on the CQ event channel waiting for a SEND
+ from the receiver to tell us that the receiver
+ is *ready* for us to transmit some new bytes.
+3. When the "ready" SEND arrives, librdmacm will
+ unblock us and we immediately post a RQ work request
+ to replace the one we just used up.
+4. Now, we can actually deliver the bytes that
+ put_buffer() wants and return.
+
+NOTE: This entire sequents of events is designed this
+way to mimic the operations of a bytestream and is not
+typical of an infiniband application. (Something like MPI
+would not 'ping-pong' messages like this and would not
+block after every request, which would normally defeat
+the purpose of using zero-copy infiniband in the first place).
+
+Finally, how do we handoff the actual bytes to get_buffer()?
+
+Again, because we're trying to "fake" a bytestream abstraction
+using an analogy not unlike individual UDP frames, we have
+to hold on to the bytes received from SEND in memory.
+
+Each time we get to "Step 5" above for get_buffer(),
+the bytes from SEND are copied into a local holding buffer.
+
+Then, we return the number of bytes requested by get_buffer()
+and leave the remaining bytes in the buffer until get_buffer()
+comes around for another pass.
+
+If the buffer is empty, then we follow the same steps
+listed above for qemu_rdma_get_buffer() and block waiting
+for another SEND message to re-fill the buffer.
+
+Migration of pc.ram:
+===============================
+
+At the beginning of the migration, (migration-rdma.c),
+the sender and the receiver populate the list of RAMBlocks
+to be registered with each other into a structure.
+
+Then, using a single SEND message, they exchange this
+structure with each other, to be used later during the
+iteration of main memory. This structure includes a list
+of all the RAMBlocks, their offsets and lengths.
+
+Main memory is not migrated with SEND infiniband
+messages, but is instead migrated with RDMA infiniband
+messages.
+
+Messages are migrated in "chunks" (about 64 pages right now).
+Chunk size is not dynamic, but it could be in a future
+implementation.
+
+When a total of 64 pages (or a flush()) are aggregated,
+the memory backed by the chunk on the sender side is
+registered with librdmacm and pinned in memory.
+
+After pinning, an RDMA send is generated and tramsmitted
+for the entire chunk.
+
+Error-handling:
+===============================
+
+Infiniband has what is called a "Reliable, Connected"
+link (one of 4 choices). This is the mode in which
+we use for RDMA migration.
+
+If a *single* message fails,
+the decision is to abort the migration entirely and
+cleanup all the RDMA descriptors and unregister all
+the memory.
+
+After cleanup, the Virtual Machine is returned to normal
+operation the same way that would happen if the TCP
+socket is broken during a non-RDMA based migration.
+
+USAGE
+===============================
+
+Compiling:
+
+$ ./configure --enable-rdma --target-list=x86_64-softmmu
+
+$ make
+
+Command-line on the Source machine AND Destination:
+
+$ virsh qemu-monitor-command --hmp --cmd "migrate_set_speed 40g" # or whatever is the MAX of your RDMA device
+
+Finally, perform the actual migration:
+
+$ virsh migrate domain rdma:xx.xx.xx.xx:port
+
+PERFORMANCE
+===================
+
+Using a 40gbps infinband link performing a worst-case stress test:
+
+RDMA Throughput With $ stress --vm-bytes 1024M --vm 1 --vm-keep
+Approximately 30 gpbs (little better than the paper)
+1. Average worst-case throughput
+TCP Throughput With $ stress --vm-bytes 1024M --vm 1 --vm-keep
+2. Approximately 8 gpbs (using IPOIB IP over Infiniband)
+
+Average downtime (stop time) ranges between 28 and 33 milliseconds.
+
+An *exhaustive* paper (2010) shows additional performance details
+linked on the QEMU wiki:
+
+http://wiki.qemu.org/Features/RDMALiveMigration
--
1.7.10.4
next prev parent reply other threads:[~2013-03-18 3:19 UTC|newest]
Thread overview: 73+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-03-18 3:18 [Qemu-devel] [RFC PATCH RDMA support v4: 00/10] cleaner ramblocks and documentation mrhines
2013-03-18 3:18 ` [Qemu-devel] [RFC PATCH RDMA support v4: 01/10] ./configure --enable-rdma mrhines
2013-03-18 3:18 ` [Qemu-devel] [RFC PATCH RDMA support v4: 02/10] check for CONFIG_RDMA mrhines
2013-03-18 3:18 ` mrhines [this message]
2013-03-18 10:40 ` [Qemu-devel] [RFC PATCH RDMA support v4: 03/10] more verbose documentation of the RDMA transport Michael S. Tsirkin
2013-03-18 20:24 ` Michael R. Hines
2013-03-18 21:26 ` Michael S. Tsirkin
2013-03-18 23:23 ` Michael R. Hines
2013-03-19 8:19 ` Michael S. Tsirkin
2013-03-19 13:21 ` Michael R. Hines
2013-03-19 15:08 ` Michael R. Hines
2013-03-19 15:16 ` Michael S. Tsirkin
2013-03-19 15:32 ` Michael R. Hines
2013-03-19 15:36 ` Michael S. Tsirkin
2013-03-19 17:09 ` Michael R. Hines
2013-03-19 17:14 ` Paolo Bonzini
2013-03-19 17:23 ` Michael S. Tsirkin
2013-03-19 17:40 ` Michael R. Hines
2013-03-19 17:52 ` Paolo Bonzini
2013-03-19 18:04 ` Michael R. Hines
2013-03-20 13:07 ` Michael S. Tsirkin
2013-03-20 15:15 ` Michael R. Hines
2013-03-20 15:22 ` Michael R. Hines
2013-03-20 15:55 ` Michael S. Tsirkin
2013-03-20 16:08 ` Michael R. Hines
2013-03-20 19:06 ` Michael S. Tsirkin
2013-03-20 20:20 ` Michael R. Hines
2013-03-20 20:31 ` Michael S. Tsirkin
2013-03-20 20:39 ` Michael R. Hines
2013-03-20 20:46 ` Michael S. Tsirkin
2013-03-20 20:56 ` Michael R. Hines
2013-03-21 5:20 ` Michael S. Tsirkin
2013-03-20 20:24 ` Michael R. Hines
2013-03-20 20:37 ` Michael S. Tsirkin
2013-03-20 20:45 ` Michael R. Hines
2013-03-20 20:52 ` Michael S. Tsirkin
2013-03-19 17:49 ` Michael R. Hines
2013-03-21 6:11 ` Michael S. Tsirkin
2013-03-21 15:22 ` Michael R. Hines
2013-04-05 20:45 ` Michael R. Hines
2013-04-05 20:46 ` Michael R. Hines
2013-03-18 3:18 ` [Qemu-devel] [RFC PATCH RDMA support v4: 04/10] iterators for getting the RAMBlocks mrhines
2013-03-18 8:48 ` Paolo Bonzini
2013-03-18 20:25 ` Michael R. Hines
2013-03-18 3:18 ` [Qemu-devel] [RFC PATCH RDMA support v4: 05/10] reuse function for parsing the QMP 'migrate' string mrhines
2013-03-18 3:18 ` [Qemu-devel] [RFC PATCH RDMA support v4: 06/10] core RDMA migration code (rdma.c) mrhines
2013-03-18 3:19 ` [Qemu-devel] [RFC PATCH RDMA support v4: 07/10] connection-establishment for RDMA mrhines
2013-03-18 8:56 ` Paolo Bonzini
2013-03-18 20:26 ` Michael R. Hines
2013-03-18 3:19 ` [Qemu-devel] [RFC PATCH RDMA support v4: 08/10] introduce QEMUFileRDMA mrhines
2013-03-18 9:09 ` Paolo Bonzini
2013-03-18 20:33 ` Michael R. Hines
2013-03-19 9:18 ` Paolo Bonzini
2013-03-19 13:12 ` Michael R. Hines
2013-03-19 13:25 ` Paolo Bonzini
2013-03-19 13:40 ` Michael R. Hines
2013-03-19 13:45 ` Paolo Bonzini
2013-03-19 14:10 ` Michael R. Hines
2013-03-19 14:22 ` Paolo Bonzini
2013-03-19 15:02 ` [Qemu-devel] [Bug]? (RDMA-related) ballooned memory not consulted during migration? Michael R. Hines
2013-03-19 15:12 ` Michael R. Hines
2013-03-19 15:17 ` Michael S. Tsirkin
2013-03-19 18:27 ` [Qemu-devel] [RFC PATCH RDMA support v4: 08/10] introduce QEMUFileRDMA Michael R. Hines
2013-03-19 18:40 ` Paolo Bonzini
2013-03-20 15:20 ` Paolo Bonzini
2013-03-20 16:09 ` Michael R. Hines
2013-03-18 3:19 ` [Qemu-devel] [RFC PATCH RDMA support v4: 09/10] check for QMP string and bypass nonblock() calls mrhines
2013-03-18 8:47 ` Paolo Bonzini
2013-03-18 20:37 ` Michael R. Hines
2013-03-19 9:23 ` Paolo Bonzini
2013-03-19 13:08 ` Michael R. Hines
2013-03-19 13:20 ` Paolo Bonzini
2013-03-18 3:19 ` [Qemu-devel] [RFC PATCH RDMA support v4: 10/10] send pc.ram over RDMA mrhines
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=1363576743-6146-4-git-send-email-mrhines@linux.vnet.ibm.com \
--to=mrhines@linux.vnet.ibm.com \
--cc=abali@us.ibm.com \
--cc=aliguori@us.ibm.com \
--cc=gokul@us.ibm.com \
--cc=mrhines@us.ibm.com \
--cc=mst@redhat.com \
--cc=owasserm@redhat.com \
--cc=pbonzini@redhat.com \
--cc=qemu-devel@nongnu.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).