From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:55825) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1UxgqS-0006uC-Uz for qemu-devel@nongnu.org; Fri, 12 Jul 2013 13:09:26 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1UxgqP-0003f4-MJ for qemu-devel@nongnu.org; Fri, 12 Jul 2013 13:09:24 -0400 Received: from mx1.redhat.com ([209.132.183.28]:28226) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1UxgqP-0003ez-EN for qemu-devel@nongnu.org; Fri, 12 Jul 2013 13:09:21 -0400 Message-ID: <51E0382E.5030209@redhat.com> Date: Fri, 12 Jul 2013 11:09:02 -0600 From: Eric Blake MIME-Version: 1.0 References: <1373640028-5138-1-git-send-email-mrhines@linux.vnet.ibm.com> <1373640028-5138-2-git-send-email-mrhines@linux.vnet.ibm.com> In-Reply-To: <1373640028-5138-2-git-send-email-mrhines@linux.vnet.ibm.com> Content-Type: multipart/signed; micalg=pgp-sha256; protocol="application/pgp-signature"; boundary="----enig2SPACQSTFNKUQJKTEUJJB" Subject: Re: [Qemu-devel] [PATCH v3 resend/cleanup 1/8] rdma: update documentation to reflect new unpin support List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: mrhines@linux.vnet.ibm.com Cc: aliguori@us.ibm.com, quintela@redhat.com, qemu-devel@nongnu.org, owasserm@redhat.com, abali@us.ibm.com, mrhines@us.ibm.com, gokul@us.ibm.com, pbonzini@redhat.com, chegu_vinod@hp.com, knoel@redhat.com This is an OpenPGP/MIME signed message (RFC 4880 and 3156) ------enig2SPACQSTFNKUQJKTEUJJB Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable On 07/12/2013 08:40 AM, mrhines@linux.vnet.ibm.com wrote: > From: "Michael R. Hines" >=20 > As requested, the protocol now includes memory unpinning support. > This has been implemented in a non-optimized manner, in such a way > that one could devise an LRU or other workload-specific information > on top of the basic mechanism to influence the way unpinning happens > during runtime. >=20 > The feature is not yet user-facing, and is thus can only be enabled > at compile-time. >=20 > Reviewed-by: Eric Blake > Signed-off-by: Michael R. Hines > --- > docs/rdma.txt | 51 ++++++++++++++++++++++++++++++-------------------= -- > 1 file changed, 30 insertions(+), 21 deletions(-) I suggest splitting this patch into two; and cc-ing the first of the two patches through qemu-trivial (since formatting cleanups can be applied now, even while still waiting for a comprehensive review of the algorithm in the rest of the series) >=20 > diff --git a/docs/rdma.txt b/docs/rdma.txt > index 45a4b1d..45d1c8a 100644 > --- a/docs/rdma.txt > +++ b/docs/rdma.txt > @@ -35,7 +35,7 @@ memory tracked during each live migration iteration r= ound cannot keep pace > with the rate of dirty memory produced by the workload. > =20 > RDMA currently comes in two flavors: both Ethernet based (RoCE, or RDM= A > -over Convered Ethernet) as well as Infiniband-based. This implementati= on of > +over Converged Ethernet) as well as Infiniband-based. This implementat= ion of Trivial > migration using RDMA is capable of using both technologies because of > the use of the OpenFabrics OFED software stack that abstracts out the > programming model irrespective of the underlying hardware. > @@ -188,9 +188,9 @@ header portion and a data portion (but together are= transmitted > as a single SEND message). > =20 > Header: > - * Length (of the data portion, uint32, network byte order) > - * Type (what command to perform, uint32, network byte order) > - * Repeat (Number of commands in data portion, same type only) > + * Length (of the data portion, uint32, network byte = order) > + * Type (what command to perform, uint32, network b= yte order) > + * Repeat (Number of commands in data portion, same t= ype only) trivial > =20 > The 'Repeat' field is here to support future multiple page registratio= ns > in a single message without any need to change the protocol itself > @@ -202,17 +202,19 @@ The maximum number of repeats is hard-coded to 40= 96. This is a conservative > limit based on the maximum size of a SEND message along with emperical= > observations on the maximum future benefit of simultaneous page regist= rations. > =20 > -The 'type' field has 10 different command values: > - 1. Unused > - 2. Error (sent to the source during bad things) > - 3. Ready (control-channel is available) > - 4. QEMU File (for sending non-live device state) > - 5. RAM Blocks request (used right after connection setup) > - 6. RAM Blocks result (used right after connection setup) > - 7. Compress page (zap zero page and skip registration) > - 8. Register request (dynamic chunk registration) > - 9. Register result ('rkey' to be used by sender) > - 10. Register finished (registration for current iteration finishe= d) > +The 'type' field has 12 different command values: > + 1. Unused > + 2. Error (sent to the source during bad thin= gs) > + 3. Ready (control-channel is available) > + 4. QEMU File (for sending non-live device state)= > + 5. RAM Blocks request (used right after connection setup)= > + 6. RAM Blocks result (used right after connection setup)= > + 7. Compress page (zap zero page and skip registratio= n) > + 8. Register request (dynamic chunk registration) > + 9. Register result ('rkey' to be used by sender) > + 10. Register finished (registration for current iteration= finished) reformatting is trivial, > + 11. Unregister request (unpin previously registered memory= ) > + 12. Unregister finished (confirmation that unpin completed)= addition belongs in the second patch (so that we don't have to wade through that much trivial stuff to find the real changes) > =20 > A single control message, as hinted above, can contain within the data= > portion an array of many commands of the same type. If there is more t= han > @@ -243,7 +245,7 @@ qemu_rdma_exchange_send(header, data, optional resp= onse header & data): > from the receiver to tell us that the receiver > is *ready* for us to transmit some new bytes. > 2. Optionally: if we are expecting a response from the command > - (that we have no yet transmitted), let's post an RQ > + (that we have not yet transmitted), let's post an RQ trivial > work request to receive that data a few moments later. > 3. When the READY arrives, librdmacm will > unblock us and we immediately post a RQ work request > @@ -293,8 +295,10 @@ librdmacm provides the user with a 'private data' = area to be exchanged > at connection-setup time before any infiniband traffic is generated. > =20 > Header: > - * Version (protocol version validated before send/recv occurs), ui= nt32, network byte order > - * Flags (bitwise OR of each capability), uint32, network byte or= der > + * Version (protocol version validated before send/recv occurs), > + uint32, network byte or= der > + * Flags (bitwise OR of each capability), > + uint32, network byte or= der trivial > =20 > There is no data portion of this header right now, so there is > no length field. The maximum size of the 'private data' section > @@ -313,7 +317,7 @@ If the version is invalid, we throw an error. > If the version is new, we only negotiate the capabilities that the > requested version is able to perform and ignore the rest. > =20 > -Currently there is only *one* capability in Version #1: dynamic page r= egistration > +Currently there is only one capability in Version #1: dynamic page reg= istration trivial > =20 > Finally: Negotiation happens with the Flags field: If the primary-VM > sets a flag, but the destination does not support this capability, it > @@ -326,8 +330,8 @@ QEMUFileRDMA Interface: > =20 > QEMUFileRDMA introduces a couple of new functions: > =20 > -1. qemu_rdma_get_buffer() (QEMUFileOps rdma_read_ops) > -2. qemu_rdma_put_buffer() (QEMUFileOps rdma_write_ops) > +1. qemu_rdma_get_buffer() (QEMUFileOps rdma_read_ops) > +2. qemu_rdma_put_buffer() (QEMUFileOps rdma_write_ops) trivial > =20 > These two functions are very short and simply use the protocol > describe above to deliver bytes without changing the upper-level > @@ -413,3 +417,8 @@ TODO: > the use of KSM and ballooning while using RDMA. > 4. Also, some form of balloon-device usage tracking would also > help alleviate some issues. > +5. Move UNREGISTER requests to a separate thread. > +6. Use LRU to provide more fine-grained direction of UNREGISTER > + requests for unpinning memory in an overcommitted environment. > +7. Expose UNREGISTER support to the user by way of workload-specific > + hints about application behavior. >=20 new content --=20 Eric Blake eblake redhat com +1-919-301-3266 Libvirt virtualization library http://libvirt.org ------enig2SPACQSTFNKUQJKTEUJJB Content-Type: application/pgp-signature; name="signature.asc" Content-Description: OpenPGP digital signature Content-Disposition: attachment; filename="signature.asc" -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.13 (GNU/Linux) Comment: Public key at http://people.redhat.com/eblake/eblake.gpg Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iQEcBAEBCAAGBQJR4DguAAoJEKeha0olJ0NqJ00H/AvgliTH4WXVZvpmOwtNu9Ir a17KiazQKlo3jQT72Vjt48vOuIL3vA6cmLIEvmyG2kBV8PGQS8eVb9+lO8QoSIlj 37ZbSWLeuDtOr43z++4w1RAGp2/F+dL4JN2wvLbO5ePM/tSPrjIwhgDYCli1AVME N7Ltt12n3+Z2YMl+RP6LgakP3Ml57M9K1I5dSK4YqRaix3rsxvUq5SkuKZymfEuT 6Hc6anehNYSZ+2bmcB4BzF/UkLN0VIVjzBFtsQpyb3ikXkHKbRoP+uWMcpmXyPDI /ynEtpoutRRkGOa188ulFnP4aRAATUAJWSO4kKYEAUCjdvhFSl2wvplTdUElD08= =PwNJ -----END PGP SIGNATURE----- ------enig2SPACQSTFNKUQJKTEUJJB--