From: "Michael R. Hines" <mrhines@linux.vnet.ibm.com>
To: qemu-devel@nongnu.org
Cc: aliguori@us.ibm.com, mst@redhat.com, owasserm@redhat.com,
abali@us.ibm.com, mrhines@us.ibm.com, gokul@us.ibm.com,
pbonzini@redhat.com
Subject: Re: [Qemu-devel] [RFC PATCH RDMA support v5: 00/12] new formal protocol design
Date: Tue, 09 Apr 2013 00:24:48 -0400 [thread overview]
Message-ID: <51639810.8050607@linux.vnet.ibm.com> (raw)
In-Reply-To: <1365476681-31593-1-git-send-email-mrhines@linux.vnet.ibm.com>
FYI: Testable patchset can be found here: github.com:hinesmr/qemu.git,
'rdma' branch
- Michael
On 04/08/2013 11:04 PM, mrhines@linux.vnet.ibm.com wrote:
> From: "Michael R. Hines" <mrhines@us.ibm.com>
>
> Changes since v4:
>
> - Created a "formal" protocol for the RDMA control channel
> - Dynamic, chunked page registration now implemented on *both* the server and client
> - Created new 'capability' for page registration
> - Created new 'capability' for is_zero_page() - enabled by default
> (needed to test dynamic page registration)
> - Created version-check before protocol begins at connection-time
> - no more migrate_use_rdma() !
>
> NOTE: While dynamic registration works on both sides now,
> it does *not* work with cgroups swap limits. This functionality with infiniband
> remains broken. (It works fine with TCP). So, in order to take full
> advantage of this feature, a fix will have to be developed on the kernel side.
> Alternative proposed is use /dev/<pid>/pagemap. Patch will be submitted.
>
> Contents:
> =================================
> * Compiling
> * Running (please readme before running)
> * RDMA Protocol Description
> * Versioning
> * QEMUFileRDMA Interface
> * Migration of pc.ram
> * Error handling
> * TODO
> * Performance
>
> COMPILING:
> ===============================
>
> $ ./configure --enable-rdma --target-list=x86_64-softmmu
> $ make
>
> RUNNING:
> ===============================
>
> First, decide if you want dynamic page registration on the server-side.
> This always happens on the primary-VM side, but is optional on the server.
> Doing this allows you to support overcommit (such as cgroups or ballooning)
> with a smaller footprint on the server-side without having to register the
> entire VM memory footprint.
> NOTE: This significantly slows down performance (about 30% slower).
>
> $ virsh qemu-monitor-command --hmp \
> --cmd "migrate_set_capability chunk_register_destination on" # disabled by default
>
> Next, if you decided *not* to use chunked registration on the server,
> it is recommended to also disable zero page detection. While this is not
> strictly necessary, zero page detection also significantly slows down
> performance on higher-throughput links (by about 50%), like 40 gbps infiniband cards:
>
> $ virsh qemu-monitor-command --hmp \
> --cmd "migrate_set_capability check_for_zero off" # always enabled by default
>
> Finally, set the migration speed to match your hardware's capabilities:
>
> $ virsh qemu-monitor-command --hmp \
> --cmd "migrate_set_speed 40g" # or whatever is the MAX of your RDMA device
>
> Finally, perform the actual migration:
>
> $ virsh migrate domain rdma:xx.xx.xx.xx:port
>
> RDMA Protocol Description:
> =================================
>
> Migration with RDMA is separated into two parts:
>
> 1. The transmission of the pages using RDMA
> 2. Everything else (a control channel is introduced)
>
> "Everything else" is transmitted using a formal
> protocol now, consisting of infiniband SEND / RECV messages.
>
> An infiniband SEND message is the standard ibverbs
> message used by applications of infiniband hardware.
> The only difference between a SEND message and an RDMA
> message is that SEND message cause completion notifications
> to be posted to the completion queue (CQ) on the
> infiniband receiver side, whereas RDMA messages (used
> for pc.ram) do not (to behave like an actual DMA).
>
> Messages in infiniband require two things:
>
> 1. registration of the memory that will be transmitted
> 2. (SEND/RECV only) work requests to be posted on both
> sides of the network before the actual transmission
> can occur.
>
> RDMA messages much easier to deal with. Once the memory
> on the receiver side is registered and pinned, we're
> basically done. All that is required is for the sender
> side to start dumping bytes onto the link.
>
> SEND messages require more coordination because the
> receiver must have reserved space (using a receive
> work request) on the receive queue (RQ) before QEMUFileRDMA
> can start using them to carry all the bytes as
> a transport for migration of device state.
>
> To begin the migration, the initial connection setup is
> as follows (migration-rdma.c):
>
> 1. Receiver and Sender are started (command line or libvirt):
> 2. Both sides post two RQ work requests
> 3. Receiver does listen()
> 4. Sender does connect()
> 5. Receiver accept()
> 6. Check versioning and capabilities (described later)
>
> At this point, we define a control channel on top of SEND messages
> which is described by a formal protocol. Each SEND message has a
> header portion and a data portion (but together are transmitted
> as a single SEND message).
>
> Header:
> * Length (of the data portion)
> * Type (what command to perform, described below)
> * Version (protocol version validated before send/recv occurs)
>
> The 'type' field has 7 different command values:
> 1. None
> 2. Ready (control-channel is available)
> 3. QEMU File (for sending non-live device state)
> 4. RAM Blocks (used right after connection setup)
> 5. Register request (dynamic chunk registration)
> 6. Register result ('rkey' to be used by sender)
> 7. Register finished (registration for current iteration finished)
>
> After connection setup is completed, we have two protocol-level
> functions, responsible for communicating control-channel commands
> using the above list of values:
>
> Logically:
>
> qemu_rdma_exchange_recv(header, expected command type)
>
> 1. We transmit a READY command to let the sender know that
> we are *ready* to receive some data bytes on the control channel.
> 2. Before attempting to receive the expected command, we post another
> RQ work request to replace the one we just used up.
> 3. Block on a CQ event channel and wait for the SEND to arrive.
> 4. When the send arrives, librdmacm will unblock us.
> 5. Verify that the command-type and version received matches the one we expected.
>
> qemu_rdma_exchange_send(header, data, optional response header & data):
>
> 1. Block on the CQ event channel waiting for a READY command
> from the receiver to tell us that the receiver
> is *ready* for us to transmit some new bytes.
> 2. Optionally: if we are expecting a response from the command
> (that we have no yet transmitted), let's post an RQ
> work request to receive that data a few moments later.
> 3. When the READY arrives, librdmacm will
> unblock us and we immediately post a RQ work request
> to replace the one we just used up.
> 4. Now, we can actually post the work request to SEND
> the requested command type of the header we were asked for.
> 5. Optionally, if we are expecting a response (as before),
> we block again and wait for that response using the additional
> work request we previously posted. (This is used to carry
> 'Register result' commands #6 back to the sender which
> hold the rkey need to perform RDMA.
>
> All of the remaining command types (not including 'ready')
> described above all use the aformentioned two functions to do the hard work:
>
> 1. After connection setup, RAMBlock information is exchanged using
> this protocol before the actual migration begins.
> 2. During runtime, once a 'chunk' becomes full of pages ready to
> be sent with RDMA, the registration commands are used to ask the
> other side to register the memory for this chunk and respond
> with the result (rkey) of the registration.
> 3. Also, the QEMUFile interfaces also call these functions (described below)
> when transmitting non-live state, such as devices or to send
> its own protocol information during the migration process.
>
> Versioning
> ==================================
>
> librdmacm provides the user with a 'private data' area to be exchanged
> at connection-setup time before any infiniband traffic is generated.
>
> This is a convenient place to check for protocol versioning because the
> user does not need to register memory to transmit a few bytes of version
> information.
>
> This is also a convenient place to negotiate capabilities
> (like dynamic page registration).
>
> If the version is invalid, we throw an error.
>
> If the version is new, we only negotiate the capabilities that the
> requested version is able to perform and ignore the rest.
>
> QEMUFileRDMA Interface:
> ==================================
>
> QEMUFileRDMA introduces a couple of new functions:
>
> 1. qemu_rdma_get_buffer() (QEMUFileOps rdma_read_ops)
> 2. qemu_rdma_put_buffer() (QEMUFileOps rdma_write_ops)
>
> These two functions are very short and simply used the protocol
> describe above to deliver bytes without changing the upper-level
> users of QEMUFile that depend on a bytstream abstraction.
>
> Finally, how do we handoff the actual bytes to get_buffer()?
>
> Again, because we're trying to "fake" a bytestream abstraction
> using an analogy not unlike individual UDP frames, we have
> to hold on to the bytes received from control-channel's SEND
> messages in memory.
>
> Each time we receive a complete "QEMU File" control-channel
> message, the bytes from SEND are copied into a small local holding area.
>
> Then, we return the number of bytes requested by get_buffer()
> and leave the remaining bytes in the holding area until get_buffer()
> comes around for another pass.
>
> If the buffer is empty, then we follow the same steps
> listed above and issue another "QEMU File" protocol command,
> asking for a new SEND message to re-fill the buffer.
>
> Migration of pc.ram:
> ===============================
>
> At the beginning of the migration, (migration-rdma.c),
> the sender and the receiver populate the list of RAMBlocks
> to be registered with each other into a structure.
> Then, using the aforementioned protocol, they exchange a
> description of these blocks with each other, to be used later
> during the iteration of main memory. This description includes
> a list of all the RAMBlocks, their offsets and lengths and
> possibly includes pre-registered RDMA keys in case dynamic
> page registration was disabled on the server-side, otherwise not.
>
> Main memory is not migrated with the aforementioned protocol,
> but is instead migrated with normal RDMA Write operations.
>
> Pages are migrated in "chunks" (about 1 Megabyte right now).
> Chunk size is not dynamic, but it could be in a future implementation.
> There's nothing to indicate that this is useful right now.
>
> When a chunk is full (or a flush() occurs), the memory backed by
> the chunk is registered with librdmacm and pinned in memory on
> both sides using the aforementioned protocol.
>
> After pinning, an RDMA Write is generated and tramsmitted
> for the entire chunk.
>
> Chunks are also transmitted in batches: This means that we
> do not request that the hardware signal the completion queue
> for the completion of *every* chunk. The current batch size
> is about 64 chunks (corresponding to 64 MB of memory).
> Only the last chunk in a batch must be signaled.
> This helps keep everything as asynchronous as possible
> and helps keep the hardware busy performing RDMA operations.
>
> Error-handling:
> ===============================
>
> Infiniband has what is called a "Reliable, Connected"
> link (one of 4 choices). This is the mode in which
> we use for RDMA migration.
>
> If a *single* message fails,
> the decision is to abort the migration entirely and
> cleanup all the RDMA descriptors and unregister all
> the memory.
>
> After cleanup, the Virtual Machine is returned to normal
> operation the same way that would happen if the TCP
> socket is broken during a non-RDMA based migration.
>
> TODO:
> =================================
> 1. Currently, cgroups swap limits for *both* TCP and RDMA
> on the sender-side is broken. This is more poignant for
> RDMA because RDMA requires memory registration.
> Fixing this requires infiniband page registrations to be
> zero-page aware, and this does not yet work properly.
> 2. Currently overcommit for the the *receiver* side of
> TCP works, but not for RDMA. While dynamic page registration
> *does* work, it is only useful if the is_zero_page() capability
> is remained enabled (which it is by default).
> However, leaving this capability turned on *significantly* slows
> down the RDMA throughput, particularly on hardware capable
> of transmitting faster than 10 gbps (such as 40gbps links).
> 3. Use of the recent /dev/<pid>/pagemap would likely solve some
> of these problems.
> 4. Also, some form of balloon-device usage tracking would also
> help aleviate some of these issues.
>
> PERFORMANCE
> ===================
>
> Using a 40gbps infinband link performing a worst-case stress test:
>
> RDMA Throughput With $ stress --vm-bytes 1024M --vm 1 --vm-keep
> Approximately 30 gpbs (little better than the paper)
> 1. Average worst-case throughput
> TCP Throughput With $ stress --vm-bytes 1024M --vm 1 --vm-keep
> 2. Approximately 8 gpbs (using IPOIB IP over Infiniband)
>
> Average downtime (stop time) ranges between 28 and 33 milliseconds.
>
> An *exhaustive* paper (2010) shows additional performance details
> linked on the QEMU wiki:
>
> http://wiki.qemu.org/Features/RDMALiveMigration
>
>
next prev parent reply other threads:[~2013-04-09 4:25 UTC|newest]
Thread overview: 97+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-04-09 3:04 [Qemu-devel] [RFC PATCH RDMA support v5: 00/12] new formal protocol design mrhines
2013-04-09 3:04 ` [Qemu-devel] [RFC PATCH RDMA support v5: 01/12] ./configure with and without --enable-rdma mrhines
2013-04-09 17:05 ` Paolo Bonzini
2013-04-09 18:07 ` Michael R. Hines
2013-04-09 3:04 ` [Qemu-devel] [RFC PATCH RDMA support v5: 02/12] check for CONFIG_RDMA mrhines
2013-04-09 16:46 ` Paolo Bonzini
2013-04-09 3:04 ` [Qemu-devel] [RFC PATCH RDMA support v5: 03/12] comprehensive protocol documentation mrhines
2013-04-10 5:27 ` Michael S. Tsirkin
2013-04-10 13:04 ` Michael R. Hines
2013-04-10 13:34 ` Michael S. Tsirkin
2013-04-10 15:29 ` Michael R. Hines
2013-04-10 17:41 ` Michael S. Tsirkin
2013-04-10 20:05 ` Michael R. Hines
2013-04-11 7:19 ` Michael S. Tsirkin
2013-04-11 13:12 ` Michael R. Hines
2013-04-11 13:48 ` Michael S. Tsirkin
2013-04-11 13:58 ` Michael R. Hines
2013-04-11 14:37 ` Michael S. Tsirkin
2013-04-11 14:50 ` Paolo Bonzini
2013-04-11 14:56 ` Michael S. Tsirkin
2013-04-11 17:49 ` Michael R. Hines
2013-04-11 19:15 ` Michael S. Tsirkin
2013-04-11 20:33 ` Michael R. Hines
2013-04-12 10:48 ` Michael S. Tsirkin
2013-04-12 10:53 ` Paolo Bonzini
2013-04-12 11:25 ` Michael S. Tsirkin
2013-04-12 14:43 ` Paolo Bonzini
2013-04-14 11:59 ` Michael S. Tsirkin
2013-04-14 14:09 ` Paolo Bonzini
2013-04-14 14:40 ` Michael R. Hines
2013-04-14 14:27 ` Michael R. Hines
2013-04-14 16:03 ` Michael S. Tsirkin
2013-04-14 16:07 ` Michael R. Hines
2013-04-14 16:40 ` Michael R. Hines
2013-04-14 18:30 ` Michael S. Tsirkin
2013-04-14 19:06 ` Michael R. Hines
2013-04-14 21:10 ` Michael S. Tsirkin
2013-04-15 1:06 ` Michael R. Hines
2013-04-15 6:00 ` Michael S. Tsirkin
2013-04-15 13:07 ` Michael R. Hines
2013-04-15 22:20 ` Michael S. Tsirkin
2013-04-15 8:28 ` Paolo Bonzini
2013-04-15 13:08 ` Michael R. Hines
2013-04-15 8:26 ` Paolo Bonzini
2013-04-12 13:47 ` Michael R. Hines
2013-04-14 8:28 ` Michael S. Tsirkin
2013-04-14 14:31 ` Michael R. Hines
2013-04-14 18:51 ` Michael S. Tsirkin
2013-04-14 19:43 ` Michael R. Hines
2013-04-14 21:16 ` Michael S. Tsirkin
2013-04-15 1:10 ` Michael R. Hines
2013-04-15 6:10 ` Michael S. Tsirkin
2013-04-15 8:34 ` Paolo Bonzini
2013-04-15 13:24 ` Michael R. Hines
2013-04-15 13:30 ` Paolo Bonzini
2013-04-15 19:55 ` Michael R. Hines
2013-04-11 15:01 ` Michael R. Hines
2013-04-11 15:18 ` Michael R. Hines
2013-04-11 15:33 ` Paolo Bonzini
2013-04-11 15:46 ` Michael S. Tsirkin
2013-04-11 15:47 ` Paolo Bonzini
2013-04-11 15:58 ` Michael S. Tsirkin
2013-04-11 16:06 ` Michael R. Hines
2013-04-12 5:10 ` Michael R. Hines
2013-04-12 5:26 ` Paolo Bonzini
2013-04-12 5:54 ` Michael R. Hines
2013-04-11 15:44 ` Michael S. Tsirkin
2013-04-11 16:09 ` Michael R. Hines
2013-04-11 17:04 ` Michael S. Tsirkin
2013-04-11 17:27 ` Michael R. Hines
2013-04-11 16:13 ` Michael R. Hines
2013-04-09 3:04 ` [Qemu-devel] [RFC PATCH RDMA support v5: 04/12] introduce qemu_ram_foreach_block() mrhines
2013-04-09 3:04 ` [Qemu-devel] [RFC PATCH RDMA support v5: 05/12] core RDMA migration logic w/ new protocol mrhines
2013-04-09 16:57 ` Paolo Bonzini
2013-04-09 3:04 ` [Qemu-devel] [RFC PATCH RDMA support v5: 06/12] connection-establishment for RDMA mrhines
2013-04-09 3:04 ` [Qemu-devel] [RFC PATCH RDMA support v5: 07/12] additional savevm.c accessors " mrhines
2013-04-09 17:03 ` Paolo Bonzini
2013-04-09 17:31 ` Peter Maydell
2013-04-09 18:04 ` Michael R. Hines
2013-04-09 3:04 ` [Qemu-devel] [RFC PATCH RDMA support v5: 08/12] new capabilities added and check for QMP string 'rdma' mrhines
2013-04-09 17:01 ` Paolo Bonzini
2013-04-10 1:11 ` Michael R. Hines
2013-04-10 8:07 ` Paolo Bonzini
2013-04-10 10:35 ` Michael S. Tsirkin
2013-04-10 12:24 ` Michael R. Hines
2013-04-09 17:02 ` Paolo Bonzini
2013-04-09 3:04 ` [Qemu-devel] [RFC PATCH RDMA support v5: 09/12] transmit pc.ram using RDMA mrhines
2013-04-09 16:50 ` Paolo Bonzini
2013-04-09 3:04 ` [Qemu-devel] [RFC PATCH RDMA support v5: 10/12] new header file prototypes for savevm.c mrhines
2013-04-09 16:43 ` Paolo Bonzini
2013-04-09 3:04 ` [Qemu-devel] [RFC PATCH RDMA support v5: 11/12] update schema to define new capabilities mrhines
2013-04-09 16:43 ` Paolo Bonzini
2013-04-09 3:04 ` [Qemu-devel] [RFC PATCH RDMA support v5: 12/12] don't set nonblock on invalid file descriptor mrhines
2013-04-09 16:45 ` Paolo Bonzini
2013-04-09 4:24 ` Michael R. Hines [this message]
2013-04-09 12:44 ` [Qemu-devel] [RFC PATCH RDMA support v5: 00/12] new formal protocol design Michael S. Tsirkin
2013-04-09 14:23 ` Michael R. Hines
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=51639810.8050607@linux.vnet.ibm.com \
--to=mrhines@linux.vnet.ibm.com \
--cc=abali@us.ibm.com \
--cc=aliguori@us.ibm.com \
--cc=gokul@us.ibm.com \
--cc=mrhines@us.ibm.com \
--cc=mst@redhat.com \
--cc=owasserm@redhat.com \
--cc=pbonzini@redhat.com \
--cc=qemu-devel@nongnu.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).