qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed
From: "Michael R. Hines" <mrhines@linux.vnet.ibm.com>
To: qemu-devel@nongnu.org
Cc: aliguori@us.ibm.com, mst@redhat.com, owasserm@redhat.com,
	abali@us.ibm.com, mrhines@us.ibm.com, gokul@us.ibm.com,
	pbonzini@redhat.com
Subject: Re: [Qemu-devel] [RFC PATCH RDMA support v5: 00/12] new formal protocol design
Date: Tue, 09 Apr 2013 00:24:48 -0400	[thread overview]
Message-ID: <51639810.8050607@linux.vnet.ibm.com> (raw)
In-Reply-To: <1365476681-31593-1-git-send-email-mrhines@linux.vnet.ibm.com>

FYI: Testable patchset can be found here: github.com:hinesmr/qemu.git, 
'rdma' branch

- Michael

On 04/08/2013 11:04 PM, mrhines@linux.vnet.ibm.com wrote:
> From: "Michael R. Hines" <mrhines@us.ibm.com>
>
> Changes since v4:
>
> - Created a "formal" protocol for the RDMA control channel
> - Dynamic, chunked page registration now implemented on *both* the server and client
> - Created new 'capability' for page registration
> - Created new 'capability' for is_zero_page() - enabled by default
>    (needed to test dynamic page registration)
> - Created version-check before protocol begins at connection-time
> - no more migrate_use_rdma() !
>
> NOTE: While dynamic registration works on both sides now,
>        it does *not* work with cgroups swap limits. This functionality with infiniband
>        remains broken. (It works fine with TCP). So, in order to take full
>        advantage of this feature, a fix will have to be developed on the kernel side.
>        Alternative proposed is use /dev/<pid>/pagemap. Patch will be submitted.
>
> Contents:
> =================================
> * Compiling
> * Running (please readme before running)
> * RDMA Protocol Description
> * Versioning
> * QEMUFileRDMA Interface
> * Migration of pc.ram
> * Error handling
> * TODO
> * Performance
>
> COMPILING:
> ===============================
>
> $ ./configure --enable-rdma --target-list=x86_64-softmmu
> $ make
>
> RUNNING:
> ===============================
>
> First, decide if you want dynamic page registration on the server-side.
> This always happens on the primary-VM side, but is optional on the server.
> Doing this allows you to support overcommit (such as cgroups or ballooning)
> with a smaller footprint on the server-side without having to register the
> entire VM memory footprint.
> NOTE: This significantly slows down performance (about 30% slower).
>
> $ virsh qemu-monitor-command --hmp \
>      --cmd "migrate_set_capability chunk_register_destination on" # disabled by default
>
> Next, if you decided *not* to use chunked registration on the server,
> it is recommended to also disable zero page detection. While this is not
> strictly necessary, zero page detection also significantly slows down
> performance on higher-throughput links (by about 50%), like 40 gbps infiniband cards:
>
> $ virsh qemu-monitor-command --hmp \
>      --cmd "migrate_set_capability check_for_zero off" # always enabled by default
>
> Finally, set the migration speed to match your hardware's capabilities:
>
> $ virsh qemu-monitor-command --hmp \
>      --cmd "migrate_set_speed 40g" # or whatever is the MAX of your RDMA device
>
> Finally, perform the actual migration:
>
> $ virsh migrate domain rdma:xx.xx.xx.xx:port
>
> RDMA Protocol Description:
> =================================
>
> Migration with RDMA is separated into two parts:
>
> 1. The transmission of the pages using RDMA
> 2. Everything else (a control channel is introduced)
>
> "Everything else" is transmitted using a formal
> protocol now, consisting of infiniband SEND / RECV messages.
>
> An infiniband SEND message is the standard ibverbs
> message used by applications of infiniband hardware.
> The only difference between a SEND message and an RDMA
> message is that SEND message cause completion notifications
> to be posted to the completion queue (CQ) on the
> infiniband receiver side, whereas RDMA messages (used
> for pc.ram) do not (to behave like an actual DMA).
>      
> Messages in infiniband require two things:
>
> 1. registration of the memory that will be transmitted
> 2. (SEND/RECV only) work requests to be posted on both
>     sides of the network before the actual transmission
>     can occur.
>
> RDMA messages much easier to deal with. Once the memory
> on the receiver side is registered and pinned, we're
> basically done. All that is required is for the sender
> side to start dumping bytes onto the link.
>
> SEND messages require more coordination because the
> receiver must have reserved space (using a receive
> work request) on the receive queue (RQ) before QEMUFileRDMA
> can start using them to carry all the bytes as
> a transport for migration of device state.
>
> To begin the migration, the initial connection setup is
> as follows (migration-rdma.c):
>
> 1. Receiver and Sender are started (command line or libvirt):
> 2. Both sides post two RQ work requests
> 3. Receiver does listen()
> 4. Sender does connect()
> 5. Receiver accept()
> 6. Check versioning and capabilities (described later)
>
> At this point, we define a control channel on top of SEND messages
> which is described by a formal protocol. Each SEND message has a
> header portion and a data portion (but together are transmitted
> as a single SEND message).
>
> Header:
>      * Length  (of the data portion)
>      * Type    (what command to perform, described below)
>      * Version (protocol version validated before send/recv occurs)
>
> The 'type' field has 7 different command values:
>      1. None
>      2. Ready             (control-channel is available)
>      3. QEMU File         (for sending non-live device state)
>      4. RAM Blocks        (used right after connection setup)
>      5. Register request  (dynamic chunk registration)
>      6. Register result   ('rkey' to be used by sender)
>      7. Register finished (registration for current iteration finished)
>
> After connection setup is completed, we have two protocol-level
> functions, responsible for communicating control-channel commands
> using the above list of values:
>
> Logically:
>
> qemu_rdma_exchange_recv(header, expected command type)
>
> 1. We transmit a READY command to let the sender know that
>     we are *ready* to receive some data bytes on the control channel.
> 2. Before attempting to receive the expected command, we post another
>     RQ work request to replace the one we just used up.
> 3. Block on a CQ event channel and wait for the SEND to arrive.
> 4. When the send arrives, librdmacm will unblock us.
> 5. Verify that the command-type and version received matches the one we expected.
>
> qemu_rdma_exchange_send(header, data, optional response header & data):
>
> 1. Block on the CQ event channel waiting for a READY command
>     from the receiver to tell us that the receiver
>     is *ready* for us to transmit some new bytes.
> 2. Optionally: if we are expecting a response from the command
>     (that we have no yet transmitted), let's post an RQ
>     work request to receive that data a few moments later.
> 3. When the READY arrives, librdmacm will
>     unblock us and we immediately post a RQ work request
>     to replace the one we just used up.
> 4. Now, we can actually post the work request to SEND
>     the requested command type of the header we were asked for.
> 5. Optionally, if we are expecting a response (as before),
>     we block again and wait for that response using the additional
>     work request we previously posted. (This is used to carry
>     'Register result' commands #6 back to the sender which
>     hold the rkey need to perform RDMA.
>
> All of the remaining command types (not including 'ready')
> described above all use the aformentioned two functions to do the hard work:
>
> 1. After connection setup, RAMBlock information is exchanged using
>     this protocol before the actual migration begins.
> 2. During runtime, once a 'chunk' becomes full of pages ready to
>     be sent with RDMA, the registration commands are used to ask the
>     other side to register the memory for this chunk and respond
>     with the result (rkey) of the registration.
> 3. Also, the QEMUFile interfaces also call these functions (described below)
>     when transmitting non-live state, such as devices or to send
>     its own protocol information during the migration process.
>
> Versioning
> ==================================
>
> librdmacm provides the user with a 'private data' area to be exchanged
> at connection-setup time before any infiniband traffic is generated.
>
> This is a convenient place to check for protocol versioning because the
> user does not need to register memory to transmit a few bytes of version
> information.
>
> This is also a convenient place to negotiate capabilities
> (like dynamic page registration).
>
> If the version is invalid, we throw an error.
>
> If the version is new, we only negotiate the capabilities that the
> requested version is able to perform and ignore the rest.
>
> QEMUFileRDMA Interface:
> ==================================
>
> QEMUFileRDMA introduces a couple of new functions:
>
> 1. qemu_rdma_get_buffer()  (QEMUFileOps rdma_read_ops)
> 2. qemu_rdma_put_buffer()  (QEMUFileOps rdma_write_ops)
>
> These two functions are very short and simply used the protocol
> describe above to deliver bytes without changing the upper-level
> users of QEMUFile that depend on a bytstream abstraction.
>
> Finally, how do we handoff the actual bytes to get_buffer()?
>
> Again, because we're trying to "fake" a bytestream abstraction
> using an analogy not unlike individual UDP frames, we have
> to hold on to the bytes received from control-channel's SEND
> messages in memory.
>
> Each time we receive a complete "QEMU File" control-channel
> message, the bytes from SEND are copied into a small local holding area.
>
> Then, we return the number of bytes requested by get_buffer()
> and leave the remaining bytes in the holding area until get_buffer()
> comes around for another pass.
>
> If the buffer is empty, then we follow the same steps
> listed above and issue another "QEMU File" protocol command,
> asking for a new SEND message to re-fill the buffer.
>
> Migration of pc.ram:
> ===============================
>
> At the beginning of the migration, (migration-rdma.c),
> the sender and the receiver populate the list of RAMBlocks
> to be registered with each other into a structure.
> Then, using the aforementioned protocol, they exchange a
> description of these blocks with each other, to be used later
> during the iteration of main memory. This description includes
> a list of all the RAMBlocks, their offsets and lengths and
> possibly includes pre-registered RDMA keys in case dynamic
> page registration was disabled on the server-side, otherwise not.
>
> Main memory is not migrated with the aforementioned protocol,
> but is instead migrated with normal RDMA Write operations.
>
> Pages are migrated in "chunks" (about 1 Megabyte right now).
> Chunk size is not dynamic, but it could be in a future implementation.
> There's nothing to indicate that this is useful right now.
>
> When a chunk is full (or a flush() occurs), the memory backed by
> the chunk is registered with librdmacm and pinned in memory on
> both sides using the aforementioned protocol.
>
> After pinning, an RDMA Write is generated and tramsmitted
> for the entire chunk.
>
> Chunks are also transmitted in batches: This means that we
> do not request that the hardware signal the completion queue
> for the completion of *every* chunk. The current batch size
> is about 64 chunks (corresponding to 64 MB of memory).
> Only the last chunk in a batch must be signaled.
> This helps keep everything as asynchronous as possible
> and helps keep the hardware busy performing RDMA operations.
>
> Error-handling:
> ===============================
>
> Infiniband has what is called a "Reliable, Connected"
> link (one of 4 choices). This is the mode in which
> we use for RDMA migration.
>
> If a *single* message fails,
> the decision is to abort the migration entirely and
> cleanup all the RDMA descriptors and unregister all
> the memory.
>
> After cleanup, the Virtual Machine is returned to normal
> operation the same way that would happen if the TCP
> socket is broken during a non-RDMA based migration.
>
> TODO:
> =================================
> 1. Currently, cgroups swap limits for *both* TCP and RDMA
>     on the sender-side is broken. This is more poignant for
>     RDMA because RDMA requires memory registration.
>     Fixing this requires infiniband page registrations to be
>     zero-page aware, and this does not yet work properly.
> 2. Currently overcommit for the the *receiver* side of
>     TCP works, but not for RDMA. While dynamic page registration
>     *does* work, it is only useful if the is_zero_page() capability
>     is remained enabled (which it is by default).
>     However, leaving this capability turned on *significantly* slows
>     down the RDMA throughput, particularly on hardware capable
>     of transmitting faster than 10 gbps (such as 40gbps links).
> 3. Use of the recent /dev/<pid>/pagemap would likely solve some
>     of these problems.
> 4. Also, some form of balloon-device usage tracking would also
>     help aleviate some of these issues.
>
> PERFORMANCE
> ===================
>
> Using a 40gbps infinband link performing a worst-case stress test:
>
> RDMA Throughput With $ stress --vm-bytes 1024M --vm 1 --vm-keep
> Approximately 30 gpbs (little better than the paper)
> 1. Average worst-case throughput
> TCP Throughput With $ stress --vm-bytes 1024M --vm 1 --vm-keep
> 2. Approximately 8 gpbs (using IPOIB IP over Infiniband)
>
> Average downtime (stop time) ranges between 28 and 33 milliseconds.
>
> An *exhaustive* paper (2010) shows additional performance details
> linked on the QEMU wiki:
>
> http://wiki.qemu.org/Features/RDMALiveMigration
>
>

  parent reply	other threads:[~2013-04-09  4:25 UTC|newest]

Thread overview: 97+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-04-09  3:04 [Qemu-devel] [RFC PATCH RDMA support v5: 00/12] new formal protocol design mrhines
2013-04-09  3:04 ` [Qemu-devel] [RFC PATCH RDMA support v5: 01/12] ./configure with and without --enable-rdma mrhines
2013-04-09 17:05   ` Paolo Bonzini
2013-04-09 18:07     ` Michael R. Hines
2013-04-09  3:04 ` [Qemu-devel] [RFC PATCH RDMA support v5: 02/12] check for CONFIG_RDMA mrhines
2013-04-09 16:46   ` Paolo Bonzini
2013-04-09  3:04 ` [Qemu-devel] [RFC PATCH RDMA support v5: 03/12] comprehensive protocol documentation mrhines
2013-04-10  5:27   ` Michael S. Tsirkin
2013-04-10 13:04     ` Michael R. Hines
2013-04-10 13:34       ` Michael S. Tsirkin
2013-04-10 15:29         ` Michael R. Hines
2013-04-10 17:41           ` Michael S. Tsirkin
2013-04-10 20:05             ` Michael R. Hines
2013-04-11  7:19               ` Michael S. Tsirkin
2013-04-11 13:12                 ` Michael R. Hines
2013-04-11 13:48                   ` Michael S. Tsirkin
2013-04-11 13:58                     ` Michael R. Hines
2013-04-11 14:37                       ` Michael S. Tsirkin
2013-04-11 14:50                         ` Paolo Bonzini
2013-04-11 14:56                           ` Michael S. Tsirkin
2013-04-11 17:49                             ` Michael R. Hines
2013-04-11 19:15                               ` Michael S. Tsirkin
2013-04-11 20:33                                 ` Michael R. Hines
2013-04-12 10:48                                   ` Michael S. Tsirkin
2013-04-12 10:53                                     ` Paolo Bonzini
2013-04-12 11:25                                       ` Michael S. Tsirkin
2013-04-12 14:43                                         ` Paolo Bonzini
2013-04-14 11:59                                           ` Michael S. Tsirkin
2013-04-14 14:09                                             ` Paolo Bonzini
2013-04-14 14:40                                               ` Michael R. Hines
2013-04-14 14:27                                             ` Michael R. Hines
2013-04-14 16:03                                               ` Michael S. Tsirkin
2013-04-14 16:07                                                 ` Michael R. Hines
2013-04-14 16:40                                                 ` Michael R. Hines
2013-04-14 18:30                                                   ` Michael S. Tsirkin
2013-04-14 19:06                                                     ` Michael R. Hines
2013-04-14 21:10                                                       ` Michael S. Tsirkin
2013-04-15  1:06                                                         ` Michael R. Hines
2013-04-15  6:00                                                           ` Michael S. Tsirkin
2013-04-15 13:07                                                             ` Michael R. Hines
2013-04-15 22:20                                                               ` Michael S. Tsirkin
2013-04-15  8:28                                                           ` Paolo Bonzini
2013-04-15 13:08                                                             ` Michael R. Hines
2013-04-15  8:26                                                       ` Paolo Bonzini
2013-04-12 13:47                                     ` Michael R. Hines
2013-04-14  8:28                                       ` Michael S. Tsirkin
2013-04-14 14:31                                         ` Michael R. Hines
2013-04-14 18:51                                           ` Michael S. Tsirkin
2013-04-14 19:43                                             ` Michael R. Hines
2013-04-14 21:16                                               ` Michael S. Tsirkin
2013-04-15  1:10                                                 ` Michael R. Hines
2013-04-15  6:10                                                   ` Michael S. Tsirkin
2013-04-15  8:34                                                   ` Paolo Bonzini
2013-04-15 13:24                                                     ` Michael R. Hines
2013-04-15 13:30                                                       ` Paolo Bonzini
2013-04-15 19:55                                                         ` Michael R. Hines
2013-04-11 15:01                           ` Michael R. Hines
2013-04-11 15:18                         ` Michael R. Hines
2013-04-11 15:33                           ` Paolo Bonzini
2013-04-11 15:46                             ` Michael S. Tsirkin
2013-04-11 15:47                               ` Paolo Bonzini
2013-04-11 15:58                                 ` Michael S. Tsirkin
2013-04-11 16:06                                   ` Michael R. Hines
2013-04-12  5:10                             ` Michael R. Hines
2013-04-12  5:26                               ` Paolo Bonzini
2013-04-12  5:54                                 ` Michael R. Hines
2013-04-11 15:44                           ` Michael S. Tsirkin
2013-04-11 16:09                             ` Michael R. Hines
2013-04-11 17:04                               ` Michael S. Tsirkin
2013-04-11 17:27                                 ` Michael R. Hines
2013-04-11 16:13                             ` Michael R. Hines
2013-04-09  3:04 ` [Qemu-devel] [RFC PATCH RDMA support v5: 04/12] introduce qemu_ram_foreach_block() mrhines
2013-04-09  3:04 ` [Qemu-devel] [RFC PATCH RDMA support v5: 05/12] core RDMA migration logic w/ new protocol mrhines
2013-04-09 16:57   ` Paolo Bonzini
2013-04-09  3:04 ` [Qemu-devel] [RFC PATCH RDMA support v5: 06/12] connection-establishment for RDMA mrhines
2013-04-09  3:04 ` [Qemu-devel] [RFC PATCH RDMA support v5: 07/12] additional savevm.c accessors " mrhines
2013-04-09 17:03   ` Paolo Bonzini
2013-04-09 17:31   ` Peter Maydell
2013-04-09 18:04     ` Michael R. Hines
2013-04-09  3:04 ` [Qemu-devel] [RFC PATCH RDMA support v5: 08/12] new capabilities added and check for QMP string 'rdma' mrhines
2013-04-09 17:01   ` Paolo Bonzini
2013-04-10  1:11     ` Michael R. Hines
2013-04-10  8:07       ` Paolo Bonzini
2013-04-10 10:35         ` Michael S. Tsirkin
2013-04-10 12:24         ` Michael R. Hines
2013-04-09 17:02   ` Paolo Bonzini
2013-04-09  3:04 ` [Qemu-devel] [RFC PATCH RDMA support v5: 09/12] transmit pc.ram using RDMA mrhines
2013-04-09 16:50   ` Paolo Bonzini
2013-04-09  3:04 ` [Qemu-devel] [RFC PATCH RDMA support v5: 10/12] new header file prototypes for savevm.c mrhines
2013-04-09 16:43   ` Paolo Bonzini
2013-04-09  3:04 ` [Qemu-devel] [RFC PATCH RDMA support v5: 11/12] update schema to define new capabilities mrhines
2013-04-09 16:43   ` Paolo Bonzini
2013-04-09  3:04 ` [Qemu-devel] [RFC PATCH RDMA support v5: 12/12] don't set nonblock on invalid file descriptor mrhines
2013-04-09 16:45   ` Paolo Bonzini
2013-04-09  4:24 ` Michael R. Hines [this message]
2013-04-09 12:44 ` [Qemu-devel] [RFC PATCH RDMA support v5: 00/12] new formal protocol design Michael S. Tsirkin
2013-04-09 14:23   ` Michael R. Hines

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=51639810.8050607@linux.vnet.ibm.com \
    --to=mrhines@linux.vnet.ibm.com \
    --cc=abali@us.ibm.com \
    --cc=aliguori@us.ibm.com \
    --cc=gokul@us.ibm.com \
    --cc=mrhines@us.ibm.com \
    --cc=mst@redhat.com \
    --cc=owasserm@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=qemu-devel@nongnu.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).