From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([208.118.235.92]:50176)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <mrhines@linux.vnet.ibm.com>) id 1UPQ77-0006rd-Ev
	for qemu-devel@nongnu.org; Tue, 09 Apr 2013 00:25:00 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <mrhines@linux.vnet.ibm.com>) id 1UPQ73-00032B-Lb
	for qemu-devel@nongnu.org; Tue, 09 Apr 2013 00:24:57 -0400
Received: from e34.co.us.ibm.com ([32.97.110.152]:43706)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <mrhines@linux.vnet.ibm.com>) id 1UPQ73-00031L-CB
	for qemu-devel@nongnu.org; Tue, 09 Apr 2013 00:24:53 -0400
Received: from /spool/local
	by e34.co.us.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only!
	Violators will be prosecuted
	for <qemu-devel@nongnu.org> from <mrhines@linux.vnet.ibm.com>;
	Mon, 8 Apr 2013 22:24:52 -0600
Received: from d03relay01.boulder.ibm.com (d03relay01.boulder.ibm.com
	[9.17.195.226])
	by d03dlp01.boulder.ibm.com (Postfix) with ESMTP id 8D1621FF0039
	for <qemu-devel@nongnu.org>; Mon,  8 Apr 2013 22:19:50 -0600 (MDT)
Received: from d03av04.boulder.ibm.com (d03av04.boulder.ibm.com [9.17.195.170])
	by d03relay01.boulder.ibm.com (8.13.8/8.13.8/NCO v10.0) with ESMTP id
	r394Oojn120056
	for <qemu-devel@nongnu.org>; Mon, 8 Apr 2013 22:24:50 -0600
Received: from d03av04.boulder.ibm.com (loopback [127.0.0.1])
	by d03av04.boulder.ibm.com (8.14.4/8.13.1/NCO v10.0 AVout) with ESMTP
	id r394OnZZ001607
	for <qemu-devel@nongnu.org>; Mon, 8 Apr 2013 22:24:49 -0600
Message-ID: <51639810.8050607@linux.vnet.ibm.com>
Date: Tue, 09 Apr 2013 00:24:48 -0400
From: "Michael R. Hines" <mrhines@linux.vnet.ibm.com>
MIME-Version: 1.0
References: <1365476681-31593-1-git-send-email-mrhines@linux.vnet.ibm.com>
In-Reply-To: <1365476681-31593-1-git-send-email-mrhines@linux.vnet.ibm.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Subject: Re: [Qemu-devel] [RFC PATCH RDMA support v5: 00/12] new formal
 protocol design
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: qemu-devel@nongnu.org
Cc: aliguori@us.ibm.com, mst@redhat.com, owasserm@redhat.com, abali@us.ibm.com, mrhines@us.ibm.com, gokul@us.ibm.com, pbonzini@redhat.com

FYI: Testable patchset can be found here: github.com:hinesmr/qemu.git, 
'rdma' branch

- Michael

On 04/08/2013 11:04 PM, mrhines@linux.vnet.ibm.com wrote:
> From: "Michael R. Hines" <mrhines@us.ibm.com>
>
> Changes since v4:
>
> - Created a "formal" protocol for the RDMA control channel
> - Dynamic, chunked page registration now implemented on *both* the server and client
> - Created new 'capability' for page registration
> - Created new 'capability' for is_zero_page() - enabled by default
>    (needed to test dynamic page registration)
> - Created version-check before protocol begins at connection-time
> - no more migrate_use_rdma() !
>
> NOTE: While dynamic registration works on both sides now,
>        it does *not* work with cgroups swap limits. This functionality with infiniband
>        remains broken. (It works fine with TCP). So, in order to take full
>        advantage of this feature, a fix will have to be developed on the kernel side.
>        Alternative proposed is use /dev/<pid>/pagemap. Patch will be submitted.
>
> Contents:
> =================================
> * Compiling
> * Running (please readme before running)
> * RDMA Protocol Description
> * Versioning
> * QEMUFileRDMA Interface
> * Migration of pc.ram
> * Error handling
> * TODO
> * Performance
>
> COMPILING:
> ===============================
>
> $ ./configure --enable-rdma --target-list=x86_64-softmmu
> $ make
>
> RUNNING:
> ===============================
>
> First, decide if you want dynamic page registration on the server-side.
> This always happens on the primary-VM side, but is optional on the server.
> Doing this allows you to support overcommit (such as cgroups or ballooning)
> with a smaller footprint on the server-side without having to register the
> entire VM memory footprint.
> NOTE: This significantly slows down performance (about 30% slower).
>
> $ virsh qemu-monitor-command --hmp \
>      --cmd "migrate_set_capability chunk_register_destination on" # disabled by default
>
> Next, if you decided *not* to use chunked registration on the server,
> it is recommended to also disable zero page detection. While this is not
> strictly necessary, zero page detection also significantly slows down
> performance on higher-throughput links (by about 50%), like 40 gbps infiniband cards:
>
> $ virsh qemu-monitor-command --hmp \
>      --cmd "migrate_set_capability check_for_zero off" # always enabled by default
>
> Finally, set the migration speed to match your hardware's capabilities:
>
> $ virsh qemu-monitor-command --hmp \
>      --cmd "migrate_set_speed 40g" # or whatever is the MAX of your RDMA device
>
> Finally, perform the actual migration:
>
> $ virsh migrate domain rdma:xx.xx.xx.xx:port
>
> RDMA Protocol Description:
> =================================
>
> Migration with RDMA is separated into two parts:
>
> 1. The transmission of the pages using RDMA
> 2. Everything else (a control channel is introduced)
>
> "Everything else" is transmitted using a formal
> protocol now, consisting of infiniband SEND / RECV messages.
>
> An infiniband SEND message is the standard ibverbs
> message used by applications of infiniband hardware.
> The only difference between a SEND message and an RDMA
> message is that SEND message cause completion notifications
> to be posted to the completion queue (CQ) on the
> infiniband receiver side, whereas RDMA messages (used
> for pc.ram) do not (to behave like an actual DMA).
>      
> Messages in infiniband require two things:
>
> 1. registration of the memory that will be transmitted
> 2. (SEND/RECV only) work requests to be posted on both
>     sides of the network before the actual transmission
>     can occur.
>
> RDMA messages much easier to deal with. Once the memory
> on the receiver side is registered and pinned, we're
> basically done. All that is required is for the sender
> side to start dumping bytes onto the link.
>
> SEND messages require more coordination because the
> receiver must have reserved space (using a receive
> work request) on the receive queue (RQ) before QEMUFileRDMA
> can start using them to carry all the bytes as
> a transport for migration of device state.
>
> To begin the migration, the initial connection setup is
> as follows (migration-rdma.c):
>
> 1. Receiver and Sender are started (command line or libvirt):
> 2. Both sides post two RQ work requests
> 3. Receiver does listen()
> 4. Sender does connect()
> 5. Receiver accept()
> 6. Check versioning and capabilities (described later)
>
> At this point, we define a control channel on top of SEND messages
> which is described by a formal protocol. Each SEND message has a
> header portion and a data portion (but together are transmitted
> as a single SEND message).
>
> Header:
>      * Length  (of the data portion)
>      * Type    (what command to perform, described below)
>      * Version (protocol version validated before send/recv occurs)
>
> The 'type' field has 7 different command values:
>      1. None
>      2. Ready             (control-channel is available)
>      3. QEMU File         (for sending non-live device state)
>      4. RAM Blocks        (used right after connection setup)
>      5. Register request  (dynamic chunk registration)
>      6. Register result   ('rkey' to be used by sender)
>      7. Register finished (registration for current iteration finished)
>
> After connection setup is completed, we have two protocol-level
> functions, responsible for communicating control-channel commands
> using the above list of values:
>
> Logically:
>
> qemu_rdma_exchange_recv(header, expected command type)
>
> 1. We transmit a READY command to let the sender know that
>     we are *ready* to receive some data bytes on the control channel.
> 2. Before attempting to receive the expected command, we post another
>     RQ work request to replace the one we just used up.
> 3. Block on a CQ event channel and wait for the SEND to arrive.
> 4. When the send arrives, librdmacm will unblock us.
> 5. Verify that the command-type and version received matches the one we expected.
>
> qemu_rdma_exchange_send(header, data, optional response header & data):
>
> 1. Block on the CQ event channel waiting for a READY command
>     from the receiver to tell us that the receiver
>     is *ready* for us to transmit some new bytes.
> 2. Optionally: if we are expecting a response from the command
>     (that we have no yet transmitted), let's post an RQ
>     work request to receive that data a few moments later.
> 3. When the READY arrives, librdmacm will
>     unblock us and we immediately post a RQ work request
>     to replace the one we just used up.
> 4. Now, we can actually post the work request to SEND
>     the requested command type of the header we were asked for.
> 5. Optionally, if we are expecting a response (as before),
>     we block again and wait for that response using the additional
>     work request we previously posted. (This is used to carry
>     'Register result' commands #6 back to the sender which
>     hold the rkey need to perform RDMA.
>
> All of the remaining command types (not including 'ready')
> described above all use the aformentioned two functions to do the hard work:
>
> 1. After connection setup, RAMBlock information is exchanged using
>     this protocol before the actual migration begins.
> 2. During runtime, once a 'chunk' becomes full of pages ready to
>     be sent with RDMA, the registration commands are used to ask the
>     other side to register the memory for this chunk and respond
>     with the result (rkey) of the registration.
> 3. Also, the QEMUFile interfaces also call these functions (described below)
>     when transmitting non-live state, such as devices or to send
>     its own protocol information during the migration process.
>
> Versioning
> ==================================
>
> librdmacm provides the user with a 'private data' area to be exchanged
> at connection-setup time before any infiniband traffic is generated.
>
> This is a convenient place to check for protocol versioning because the
> user does not need to register memory to transmit a few bytes of version
> information.
>
> This is also a convenient place to negotiate capabilities
> (like dynamic page registration).
>
> If the version is invalid, we throw an error.
>
> If the version is new, we only negotiate the capabilities that the
> requested version is able to perform and ignore the rest.
>
> QEMUFileRDMA Interface:
> ==================================
>
> QEMUFileRDMA introduces a couple of new functions:
>
> 1. qemu_rdma_get_buffer()  (QEMUFileOps rdma_read_ops)
> 2. qemu_rdma_put_buffer()  (QEMUFileOps rdma_write_ops)
>
> These two functions are very short and simply used the protocol
> describe above to deliver bytes without changing the upper-level
> users of QEMUFile that depend on a bytstream abstraction.
>
> Finally, how do we handoff the actual bytes to get_buffer()?
>
> Again, because we're trying to "fake" a bytestream abstraction
> using an analogy not unlike individual UDP frames, we have
> to hold on to the bytes received from control-channel's SEND
> messages in memory.
>
> Each time we receive a complete "QEMU File" control-channel
> message, the bytes from SEND are copied into a small local holding area.
>
> Then, we return the number of bytes requested by get_buffer()
> and leave the remaining bytes in the holding area until get_buffer()
> comes around for another pass.
>
> If the buffer is empty, then we follow the same steps
> listed above and issue another "QEMU File" protocol command,
> asking for a new SEND message to re-fill the buffer.
>
> Migration of pc.ram:
> ===============================
>
> At the beginning of the migration, (migration-rdma.c),
> the sender and the receiver populate the list of RAMBlocks
> to be registered with each other into a structure.
> Then, using the aforementioned protocol, they exchange a
> description of these blocks with each other, to be used later
> during the iteration of main memory. This description includes
> a list of all the RAMBlocks, their offsets and lengths and
> possibly includes pre-registered RDMA keys in case dynamic
> page registration was disabled on the server-side, otherwise not.
>
> Main memory is not migrated with the aforementioned protocol,
> but is instead migrated with normal RDMA Write operations.
>
> Pages are migrated in "chunks" (about 1 Megabyte right now).
> Chunk size is not dynamic, but it could be in a future implementation.
> There's nothing to indicate that this is useful right now.
>
> When a chunk is full (or a flush() occurs), the memory backed by
> the chunk is registered with librdmacm and pinned in memory on
> both sides using the aforementioned protocol.
>
> After pinning, an RDMA Write is generated and tramsmitted
> for the entire chunk.
>
> Chunks are also transmitted in batches: This means that we
> do not request that the hardware signal the completion queue
> for the completion of *every* chunk. The current batch size
> is about 64 chunks (corresponding to 64 MB of memory).
> Only the last chunk in a batch must be signaled.
> This helps keep everything as asynchronous as possible
> and helps keep the hardware busy performing RDMA operations.
>
> Error-handling:
> ===============================
>
> Infiniband has what is called a "Reliable, Connected"
> link (one of 4 choices). This is the mode in which
> we use for RDMA migration.
>
> If a *single* message fails,
> the decision is to abort the migration entirely and
> cleanup all the RDMA descriptors and unregister all
> the memory.
>
> After cleanup, the Virtual Machine is returned to normal
> operation the same way that would happen if the TCP
> socket is broken during a non-RDMA based migration.
>
> TODO:
> =================================
> 1. Currently, cgroups swap limits for *both* TCP and RDMA
>     on the sender-side is broken. This is more poignant for
>     RDMA because RDMA requires memory registration.
>     Fixing this requires infiniband page registrations to be
>     zero-page aware, and this does not yet work properly.
> 2. Currently overcommit for the the *receiver* side of
>     TCP works, but not for RDMA. While dynamic page registration
>     *does* work, it is only useful if the is_zero_page() capability
>     is remained enabled (which it is by default).
>     However, leaving this capability turned on *significantly* slows
>     down the RDMA throughput, particularly on hardware capable
>     of transmitting faster than 10 gbps (such as 40gbps links).
> 3. Use of the recent /dev/<pid>/pagemap would likely solve some
>     of these problems.
> 4. Also, some form of balloon-device usage tracking would also
>     help aleviate some of these issues.
>
> PERFORMANCE
> ===================
>
> Using a 40gbps infinband link performing a worst-case stress test:
>
> RDMA Throughput With $ stress --vm-bytes 1024M --vm 1 --vm-keep
> Approximately 30 gpbs (little better than the paper)
> 1. Average worst-case throughput
> TCP Throughput With $ stress --vm-bytes 1024M --vm 1 --vm-keep
> 2. Approximately 8 gpbs (using IPOIB IP over Infiniband)
>
> Average downtime (stop time) ranges between 28 and 33 milliseconds.
>
> An *exhaustive* paper (2010) shows additional performance details
> linked on the QEMU wiki:
>
> http://wiki.qemu.org/Features/RDMALiveMigration
>
>