From: Eric Blake <eblake@redhat.com>
To: mrhines@linux.vnet.ibm.com
Cc: aliguori@us.ibm.com, quintela@redhat.com, qemu-devel@nongnu.org,
owasserm@redhat.com, abali@us.ibm.com, mrhines@us.ibm.com,
gokul@us.ibm.com, pbonzini@redhat.com, chegu_vinod@hp.com,
knoel@redhat.com
Subject: Re: [Qemu-devel] [PATCH v3 resend/cleanup 1/8] rdma: update documentation to reflect new unpin support
Date: Fri, 12 Jul 2013 11:09:02 -0600 [thread overview]
Message-ID: <51E0382E.5030209@redhat.com> (raw)
In-Reply-To: <1373640028-5138-2-git-send-email-mrhines@linux.vnet.ibm.com>
[-- Attachment #1: Type: text/plain, Size: 7720 bytes --]
On 07/12/2013 08:40 AM, mrhines@linux.vnet.ibm.com wrote:
> From: "Michael R. Hines" <mrhines@us.ibm.com>
>
> As requested, the protocol now includes memory unpinning support.
> This has been implemented in a non-optimized manner, in such a way
> that one could devise an LRU or other workload-specific information
> on top of the basic mechanism to influence the way unpinning happens
> during runtime.
>
> The feature is not yet user-facing, and is thus can only be enabled
> at compile-time.
>
> Reviewed-by: Eric Blake <eblake@redhat.com>
> Signed-off-by: Michael R. Hines <mrhines@us.ibm.com>
> ---
> docs/rdma.txt | 51 ++++++++++++++++++++++++++++++---------------------
> 1 file changed, 30 insertions(+), 21 deletions(-)
I suggest splitting this patch into two; and cc-ing the first of the two
patches through qemu-trivial (since formatting cleanups can be applied
now, even while still waiting for a comprehensive review of the
algorithm in the rest of the series)
>
> diff --git a/docs/rdma.txt b/docs/rdma.txt
> index 45a4b1d..45d1c8a 100644
> --- a/docs/rdma.txt
> +++ b/docs/rdma.txt
> @@ -35,7 +35,7 @@ memory tracked during each live migration iteration round cannot keep pace
> with the rate of dirty memory produced by the workload.
>
> RDMA currently comes in two flavors: both Ethernet based (RoCE, or RDMA
> -over Convered Ethernet) as well as Infiniband-based. This implementation of
> +over Converged Ethernet) as well as Infiniband-based. This implementation of
Trivial
> migration using RDMA is capable of using both technologies because of
> the use of the OpenFabrics OFED software stack that abstracts out the
> programming model irrespective of the underlying hardware.
> @@ -188,9 +188,9 @@ header portion and a data portion (but together are transmitted
> as a single SEND message).
>
> Header:
> - * Length (of the data portion, uint32, network byte order)
> - * Type (what command to perform, uint32, network byte order)
> - * Repeat (Number of commands in data portion, same type only)
> + * Length (of the data portion, uint32, network byte order)
> + * Type (what command to perform, uint32, network byte order)
> + * Repeat (Number of commands in data portion, same type only)
trivial
>
> The 'Repeat' field is here to support future multiple page registrations
> in a single message without any need to change the protocol itself
> @@ -202,17 +202,19 @@ The maximum number of repeats is hard-coded to 4096. This is a conservative
> limit based on the maximum size of a SEND message along with emperical
> observations on the maximum future benefit of simultaneous page registrations.
>
> -The 'type' field has 10 different command values:
> - 1. Unused
> - 2. Error (sent to the source during bad things)
> - 3. Ready (control-channel is available)
> - 4. QEMU File (for sending non-live device state)
> - 5. RAM Blocks request (used right after connection setup)
> - 6. RAM Blocks result (used right after connection setup)
> - 7. Compress page (zap zero page and skip registration)
> - 8. Register request (dynamic chunk registration)
> - 9. Register result ('rkey' to be used by sender)
> - 10. Register finished (registration for current iteration finished)
> +The 'type' field has 12 different command values:
> + 1. Unused
> + 2. Error (sent to the source during bad things)
> + 3. Ready (control-channel is available)
> + 4. QEMU File (for sending non-live device state)
> + 5. RAM Blocks request (used right after connection setup)
> + 6. RAM Blocks result (used right after connection setup)
> + 7. Compress page (zap zero page and skip registration)
> + 8. Register request (dynamic chunk registration)
> + 9. Register result ('rkey' to be used by sender)
> + 10. Register finished (registration for current iteration finished)
reformatting is trivial,
> + 11. Unregister request (unpin previously registered memory)
> + 12. Unregister finished (confirmation that unpin completed)
addition belongs in the second patch (so that we don't have to wade
through that much trivial stuff to find the real changes)
>
> A single control message, as hinted above, can contain within the data
> portion an array of many commands of the same type. If there is more than
> @@ -243,7 +245,7 @@ qemu_rdma_exchange_send(header, data, optional response header & data):
> from the receiver to tell us that the receiver
> is *ready* for us to transmit some new bytes.
> 2. Optionally: if we are expecting a response from the command
> - (that we have no yet transmitted), let's post an RQ
> + (that we have not yet transmitted), let's post an RQ
trivial
> work request to receive that data a few moments later.
> 3. When the READY arrives, librdmacm will
> unblock us and we immediately post a RQ work request
> @@ -293,8 +295,10 @@ librdmacm provides the user with a 'private data' area to be exchanged
> at connection-setup time before any infiniband traffic is generated.
>
> Header:
> - * Version (protocol version validated before send/recv occurs), uint32, network byte order
> - * Flags (bitwise OR of each capability), uint32, network byte order
> + * Version (protocol version validated before send/recv occurs),
> + uint32, network byte order
> + * Flags (bitwise OR of each capability),
> + uint32, network byte order
trivial
>
> There is no data portion of this header right now, so there is
> no length field. The maximum size of the 'private data' section
> @@ -313,7 +317,7 @@ If the version is invalid, we throw an error.
> If the version is new, we only negotiate the capabilities that the
> requested version is able to perform and ignore the rest.
>
> -Currently there is only *one* capability in Version #1: dynamic page registration
> +Currently there is only one capability in Version #1: dynamic page registration
trivial
>
> Finally: Negotiation happens with the Flags field: If the primary-VM
> sets a flag, but the destination does not support this capability, it
> @@ -326,8 +330,8 @@ QEMUFileRDMA Interface:
>
> QEMUFileRDMA introduces a couple of new functions:
>
> -1. qemu_rdma_get_buffer() (QEMUFileOps rdma_read_ops)
> -2. qemu_rdma_put_buffer() (QEMUFileOps rdma_write_ops)
> +1. qemu_rdma_get_buffer() (QEMUFileOps rdma_read_ops)
> +2. qemu_rdma_put_buffer() (QEMUFileOps rdma_write_ops)
trivial
>
> These two functions are very short and simply use the protocol
> describe above to deliver bytes without changing the upper-level
> @@ -413,3 +417,8 @@ TODO:
> the use of KSM and ballooning while using RDMA.
> 4. Also, some form of balloon-device usage tracking would also
> help alleviate some issues.
> +5. Move UNREGISTER requests to a separate thread.
> +6. Use LRU to provide more fine-grained direction of UNREGISTER
> + requests for unpinning memory in an overcommitted environment.
> +7. Expose UNREGISTER support to the user by way of workload-specific
> + hints about application behavior.
>
new content
--
Eric Blake eblake redhat com +1-919-301-3266
Libvirt virtualization library http://libvirt.org
[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 621 bytes --]
next prev parent reply other threads:[~2013-07-12 17:09 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-07-12 14:40 [Qemu-devel] [PATCH v3 resend/cleanup 0/8] rdma: core logic mrhines
2013-07-12 14:40 ` [Qemu-devel] [PATCH v3 resend/cleanup 1/8] rdma: update documentation to reflect new unpin support mrhines
2013-07-12 17:09 ` Eric Blake [this message]
2013-07-12 17:26 ` Michael R. Hines
2013-07-12 17:39 ` Eric Blake
2013-07-12 17:46 ` Michael R. Hines
2013-07-12 14:40 ` [Qemu-devel] [PATCH v3 resend/cleanup 2/8] rdma: bugfix: ram_control_save_page() mrhines
2013-07-12 17:09 ` Eric Blake
2013-07-12 14:40 ` [Qemu-devel] [PATCH v3 resend/cleanup 3/8] rdma: introduce ram_handle_compressed() mrhines
2013-07-12 14:40 ` [Qemu-devel] [PATCH v3 resend/cleanup 4/8] rdma: core logic mrhines
2013-07-12 14:40 ` [Qemu-devel] [PATCH v3 resend/cleanup 5/8] rdma: send pc.ram mrhines
2013-07-12 14:40 ` [Qemu-devel] [PATCH v3 resend/cleanup 6/8] rdma: allow state transitions between other states besides ACTIVE mrhines
2013-07-12 14:40 ` [Qemu-devel] [PATCH v3 resend/cleanup 7/8] rdma: introduce MIG_STATE_NONE and change MIG_STATE_SETUP state transition mrhines
2013-07-12 14:40 ` [Qemu-devel] [PATCH v3 resend/cleanup 8/8] rdma: account for the time spent in MIG_STATE_SETUP through QMP mrhines
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=51E0382E.5030209@redhat.com \
--to=eblake@redhat.com \
--cc=abali@us.ibm.com \
--cc=aliguori@us.ibm.com \
--cc=chegu_vinod@hp.com \
--cc=gokul@us.ibm.com \
--cc=knoel@redhat.com \
--cc=mrhines@linux.vnet.ibm.com \
--cc=mrhines@us.ibm.com \
--cc=owasserm@redhat.com \
--cc=pbonzini@redhat.com \
--cc=qemu-devel@nongnu.org \
--cc=quintela@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).