From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:60597) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Uxh6w-00030Y-0A for qemu-devel@nongnu.org; Fri, 12 Jul 2013 13:26:30 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1Uxh6s-0002cr-C5 for qemu-devel@nongnu.org; Fri, 12 Jul 2013 13:26:25 -0400 Received: from e32.co.us.ibm.com ([32.97.110.150]:49777) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Uxh6s-0002cP-11 for qemu-devel@nongnu.org; Fri, 12 Jul 2013 13:26:22 -0400 Received: from /spool/local by e32.co.us.ibm.com with IBM ESMTP SMTP Gateway: Authorized Use Only! Violators will be prosecuted for from ; Fri, 12 Jul 2013 11:26:12 -0600 Received: from d03relay02.boulder.ibm.com (d03relay02.boulder.ibm.com [9.17.195.227]) by d03dlp01.boulder.ibm.com (Postfix) with ESMTP id 9B1C11FF001F for ; Fri, 12 Jul 2013 11:20:50 -0600 (MDT) Received: from d03av04.boulder.ibm.com (d03av04.boulder.ibm.com [9.17.195.170]) by d03relay02.boulder.ibm.com (8.13.8/8.13.8/NCO v10.0) with ESMTP id r6CHQ79V054428 for ; Fri, 12 Jul 2013 11:26:08 -0600 Received: from d03av04.boulder.ibm.com (loopback [127.0.0.1]) by d03av04.boulder.ibm.com (8.14.4/8.13.1/NCO v10.0 AVout) with ESMTP id r6CHQ65M017598 for ; Fri, 12 Jul 2013 11:26:07 -0600 Message-ID: <51E03C2D.9020206@linux.vnet.ibm.com> Date: Fri, 12 Jul 2013 13:26:05 -0400 From: "Michael R. Hines" MIME-Version: 1.0 References: <1373640028-5138-1-git-send-email-mrhines@linux.vnet.ibm.com> <1373640028-5138-2-git-send-email-mrhines@linux.vnet.ibm.com> <51E0382E.5030209@redhat.com> In-Reply-To: <51E0382E.5030209@redhat.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Subject: Re: [Qemu-devel] [PATCH v3 resend/cleanup 1/8] rdma: update documentation to reflect new unpin support List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Eric Blake Cc: aliguori@us.ibm.com, quintela@redhat.com, qemu-devel@nongnu.org, owasserm@redhat.com, abali@us.ibm.com, mrhines@us.ibm.com, gokul@us.ibm.com, pbonzini@redhat.com, chegu_vinod@hp.com, knoel@redhat.com On 07/12/2013 01:09 PM, Eric Blake wrote: > On 07/12/2013 08:40 AM, mrhines@linux.vnet.ibm.com wrote: >> From: "Michael R. Hines" >> >> As requested, the protocol now includes memory unpinning support. >> This has been implemented in a non-optimized manner, in such a way >> that one could devise an LRU or other workload-specific information >> on top of the basic mechanism to influence the way unpinning happens >> during runtime. >> >> The feature is not yet user-facing, and is thus can only be enabled >> at compile-time. >> >> Reviewed-by: Eric Blake >> Signed-off-by: Michael R. Hines >> --- >> docs/rdma.txt | 51 ++++++++++++++++++++++++++++++--------------------- >> 1 file changed, 30 insertions(+), 21 deletions(-) > I suggest splitting this patch into two; and cc-ing the first of the two > patches through qemu-trivial (since formatting cleanups can be applied > now, even while still waiting for a comprehensive review of the > algorithm in the rest of the series) My understanding is that the reviews have completed already, including a very extensive test series that I performed which included both virt-test results and non-virt-test results from both myself and Chegu. Am I mistaken? > >> diff --git a/docs/rdma.txt b/docs/rdma.txt >> index 45a4b1d..45d1c8a 100644 >> --- a/docs/rdma.txt >> +++ b/docs/rdma.txt >> @@ -35,7 +35,7 @@ memory tracked during each live migration iteration round cannot keep pace >> with the rate of dirty memory produced by the workload. >> >> RDMA currently comes in two flavors: both Ethernet based (RoCE, or RDMA >> -over Convered Ethernet) as well as Infiniband-based. This implementation of >> +over Converged Ethernet) as well as Infiniband-based. This implementation of > Trivial > >> migration using RDMA is capable of using both technologies because of >> the use of the OpenFabrics OFED software stack that abstracts out the >> programming model irrespective of the underlying hardware. >> @@ -188,9 +188,9 @@ header portion and a data portion (but together are transmitted >> as a single SEND message). >> >> Header: >> - * Length (of the data portion, uint32, network byte order) >> - * Type (what command to perform, uint32, network byte order) >> - * Repeat (Number of commands in data portion, same type only) >> + * Length (of the data portion, uint32, network byte order) >> + * Type (what command to perform, uint32, network byte order) >> + * Repeat (Number of commands in data portion, same type only) > trivial > >> >> The 'Repeat' field is here to support future multiple page registrations >> in a single message without any need to change the protocol itself >> @@ -202,17 +202,19 @@ The maximum number of repeats is hard-coded to 4096. This is a conservative >> limit based on the maximum size of a SEND message along with emperical >> observations on the maximum future benefit of simultaneous page registrations. >> >> -The 'type' field has 10 different command values: >> - 1. Unused >> - 2. Error (sent to the source during bad things) >> - 3. Ready (control-channel is available) >> - 4. QEMU File (for sending non-live device state) >> - 5. RAM Blocks request (used right after connection setup) >> - 6. RAM Blocks result (used right after connection setup) >> - 7. Compress page (zap zero page and skip registration) >> - 8. Register request (dynamic chunk registration) >> - 9. Register result ('rkey' to be used by sender) >> - 10. Register finished (registration for current iteration finished) >> +The 'type' field has 12 different command values: >> + 1. Unused >> + 2. Error (sent to the source during bad things) >> + 3. Ready (control-channel is available) >> + 4. QEMU File (for sending non-live device state) >> + 5. RAM Blocks request (used right after connection setup) >> + 6. RAM Blocks result (used right after connection setup) >> + 7. Compress page (zap zero page and skip registration) >> + 8. Register request (dynamic chunk registration) >> + 9. Register result ('rkey' to be used by sender) >> + 10. Register finished (registration for current iteration finished) > reformatting is trivial, > >> + 11. Unregister request (unpin previously registered memory) >> + 12. Unregister finished (confirmation that unpin completed) > addition belongs in the second patch (so that we don't have to wade > through that much trivial stuff to find the real changes) > >> >> A single control message, as hinted above, can contain within the data >> portion an array of many commands of the same type. If there is more than >> @@ -243,7 +245,7 @@ qemu_rdma_exchange_send(header, data, optional response header & data): >> from the receiver to tell us that the receiver >> is *ready* for us to transmit some new bytes. >> 2. Optionally: if we are expecting a response from the command >> - (that we have no yet transmitted), let's post an RQ >> + (that we have not yet transmitted), let's post an RQ > trivial > >> work request to receive that data a few moments later. >> 3. When the READY arrives, librdmacm will >> unblock us and we immediately post a RQ work request >> @@ -293,8 +295,10 @@ librdmacm provides the user with a 'private data' area to be exchanged >> at connection-setup time before any infiniband traffic is generated. >> >> Header: >> - * Version (protocol version validated before send/recv occurs), uint32, network byte order >> - * Flags (bitwise OR of each capability), uint32, network byte order >> + * Version (protocol version validated before send/recv occurs), >> + uint32, network byte order >> + * Flags (bitwise OR of each capability), >> + uint32, network byte order > trivial > >> >> There is no data portion of this header right now, so there is >> no length field. The maximum size of the 'private data' section >> @@ -313,7 +317,7 @@ If the version is invalid, we throw an error. >> If the version is new, we only negotiate the capabilities that the >> requested version is able to perform and ignore the rest. >> >> -Currently there is only *one* capability in Version #1: dynamic page registration >> +Currently there is only one capability in Version #1: dynamic page registration > trivial > >> >> Finally: Negotiation happens with the Flags field: If the primary-VM >> sets a flag, but the destination does not support this capability, it >> @@ -326,8 +330,8 @@ QEMUFileRDMA Interface: >> >> QEMUFileRDMA introduces a couple of new functions: >> >> -1. qemu_rdma_get_buffer() (QEMUFileOps rdma_read_ops) >> -2. qemu_rdma_put_buffer() (QEMUFileOps rdma_write_ops) >> +1. qemu_rdma_get_buffer() (QEMUFileOps rdma_read_ops) >> +2. qemu_rdma_put_buffer() (QEMUFileOps rdma_write_ops) > trivial > >> >> These two functions are very short and simply use the protocol >> describe above to deliver bytes without changing the upper-level >> @@ -413,3 +417,8 @@ TODO: >> the use of KSM and ballooning while using RDMA. >> 4. Also, some form of balloon-device usage tracking would also >> help alleviate some issues. >> +5. Move UNREGISTER requests to a separate thread. >> +6. Use LRU to provide more fine-grained direction of UNREGISTER >> + requests for unpinning memory in an overcommitted environment. >> +7. Expose UNREGISTER support to the user by way of workload-specific >> + hints about application behavior. >> > new content >