From: "Michael R. Hines" <mrhines@linux.vnet.ibm.com>
To: Eric Blake <eblake@redhat.com>
Cc: aliguori@us.ibm.com, quintela@redhat.com, qemu-devel@nongnu.org,
owasserm@redhat.com, abali@us.ibm.com, mrhines@us.ibm.com,
gokul@us.ibm.com, pbonzini@redhat.com, chegu_vinod@hp.com,
knoel@redhat.com
Subject: Re: [Qemu-devel] [PATCH v3 resend/cleanup 1/8] rdma: update documentation to reflect new unpin support
Date: Fri, 12 Jul 2013 13:26:05 -0400 [thread overview]
Message-ID: <51E03C2D.9020206@linux.vnet.ibm.com> (raw)
In-Reply-To: <51E0382E.5030209@redhat.com>
On 07/12/2013 01:09 PM, Eric Blake wrote:
> On 07/12/2013 08:40 AM, mrhines@linux.vnet.ibm.com wrote:
>> From: "Michael R. Hines" <mrhines@us.ibm.com>
>>
>> As requested, the protocol now includes memory unpinning support.
>> This has been implemented in a non-optimized manner, in such a way
>> that one could devise an LRU or other workload-specific information
>> on top of the basic mechanism to influence the way unpinning happens
>> during runtime.
>>
>> The feature is not yet user-facing, and is thus can only be enabled
>> at compile-time.
>>
>> Reviewed-by: Eric Blake <eblake@redhat.com>
>> Signed-off-by: Michael R. Hines <mrhines@us.ibm.com>
>> ---
>> docs/rdma.txt | 51 ++++++++++++++++++++++++++++++---------------------
>> 1 file changed, 30 insertions(+), 21 deletions(-)
> I suggest splitting this patch into two; and cc-ing the first of the two
> patches through qemu-trivial (since formatting cleanups can be applied
> now, even while still waiting for a comprehensive review of the
> algorithm in the rest of the series)
My understanding is that the reviews have completed already,
including a very extensive test series that I performed which
included both virt-test results and non-virt-test results from both
myself and Chegu.
Am I mistaken?
>
>> diff --git a/docs/rdma.txt b/docs/rdma.txt
>> index 45a4b1d..45d1c8a 100644
>> --- a/docs/rdma.txt
>> +++ b/docs/rdma.txt
>> @@ -35,7 +35,7 @@ memory tracked during each live migration iteration round cannot keep pace
>> with the rate of dirty memory produced by the workload.
>>
>> RDMA currently comes in two flavors: both Ethernet based (RoCE, or RDMA
>> -over Convered Ethernet) as well as Infiniband-based. This implementation of
>> +over Converged Ethernet) as well as Infiniband-based. This implementation of
> Trivial
>
>> migration using RDMA is capable of using both technologies because of
>> the use of the OpenFabrics OFED software stack that abstracts out the
>> programming model irrespective of the underlying hardware.
>> @@ -188,9 +188,9 @@ header portion and a data portion (but together are transmitted
>> as a single SEND message).
>>
>> Header:
>> - * Length (of the data portion, uint32, network byte order)
>> - * Type (what command to perform, uint32, network byte order)
>> - * Repeat (Number of commands in data portion, same type only)
>> + * Length (of the data portion, uint32, network byte order)
>> + * Type (what command to perform, uint32, network byte order)
>> + * Repeat (Number of commands in data portion, same type only)
> trivial
>
>>
>> The 'Repeat' field is here to support future multiple page registrations
>> in a single message without any need to change the protocol itself
>> @@ -202,17 +202,19 @@ The maximum number of repeats is hard-coded to 4096. This is a conservative
>> limit based on the maximum size of a SEND message along with emperical
>> observations on the maximum future benefit of simultaneous page registrations.
>>
>> -The 'type' field has 10 different command values:
>> - 1. Unused
>> - 2. Error (sent to the source during bad things)
>> - 3. Ready (control-channel is available)
>> - 4. QEMU File (for sending non-live device state)
>> - 5. RAM Blocks request (used right after connection setup)
>> - 6. RAM Blocks result (used right after connection setup)
>> - 7. Compress page (zap zero page and skip registration)
>> - 8. Register request (dynamic chunk registration)
>> - 9. Register result ('rkey' to be used by sender)
>> - 10. Register finished (registration for current iteration finished)
>> +The 'type' field has 12 different command values:
>> + 1. Unused
>> + 2. Error (sent to the source during bad things)
>> + 3. Ready (control-channel is available)
>> + 4. QEMU File (for sending non-live device state)
>> + 5. RAM Blocks request (used right after connection setup)
>> + 6. RAM Blocks result (used right after connection setup)
>> + 7. Compress page (zap zero page and skip registration)
>> + 8. Register request (dynamic chunk registration)
>> + 9. Register result ('rkey' to be used by sender)
>> + 10. Register finished (registration for current iteration finished)
> reformatting is trivial,
>
>> + 11. Unregister request (unpin previously registered memory)
>> + 12. Unregister finished (confirmation that unpin completed)
> addition belongs in the second patch (so that we don't have to wade
> through that much trivial stuff to find the real changes)
>
>>
>> A single control message, as hinted above, can contain within the data
>> portion an array of many commands of the same type. If there is more than
>> @@ -243,7 +245,7 @@ qemu_rdma_exchange_send(header, data, optional response header & data):
>> from the receiver to tell us that the receiver
>> is *ready* for us to transmit some new bytes.
>> 2. Optionally: if we are expecting a response from the command
>> - (that we have no yet transmitted), let's post an RQ
>> + (that we have not yet transmitted), let's post an RQ
> trivial
>
>> work request to receive that data a few moments later.
>> 3. When the READY arrives, librdmacm will
>> unblock us and we immediately post a RQ work request
>> @@ -293,8 +295,10 @@ librdmacm provides the user with a 'private data' area to be exchanged
>> at connection-setup time before any infiniband traffic is generated.
>>
>> Header:
>> - * Version (protocol version validated before send/recv occurs), uint32, network byte order
>> - * Flags (bitwise OR of each capability), uint32, network byte order
>> + * Version (protocol version validated before send/recv occurs),
>> + uint32, network byte order
>> + * Flags (bitwise OR of each capability),
>> + uint32, network byte order
> trivial
>
>>
>> There is no data portion of this header right now, so there is
>> no length field. The maximum size of the 'private data' section
>> @@ -313,7 +317,7 @@ If the version is invalid, we throw an error.
>> If the version is new, we only negotiate the capabilities that the
>> requested version is able to perform and ignore the rest.
>>
>> -Currently there is only *one* capability in Version #1: dynamic page registration
>> +Currently there is only one capability in Version #1: dynamic page registration
> trivial
>
>>
>> Finally: Negotiation happens with the Flags field: If the primary-VM
>> sets a flag, but the destination does not support this capability, it
>> @@ -326,8 +330,8 @@ QEMUFileRDMA Interface:
>>
>> QEMUFileRDMA introduces a couple of new functions:
>>
>> -1. qemu_rdma_get_buffer() (QEMUFileOps rdma_read_ops)
>> -2. qemu_rdma_put_buffer() (QEMUFileOps rdma_write_ops)
>> +1. qemu_rdma_get_buffer() (QEMUFileOps rdma_read_ops)
>> +2. qemu_rdma_put_buffer() (QEMUFileOps rdma_write_ops)
> trivial
>
>>
>> These two functions are very short and simply use the protocol
>> describe above to deliver bytes without changing the upper-level
>> @@ -413,3 +417,8 @@ TODO:
>> the use of KSM and ballooning while using RDMA.
>> 4. Also, some form of balloon-device usage tracking would also
>> help alleviate some issues.
>> +5. Move UNREGISTER requests to a separate thread.
>> +6. Use LRU to provide more fine-grained direction of UNREGISTER
>> + requests for unpinning memory in an overcommitted environment.
>> +7. Expose UNREGISTER support to the user by way of workload-specific
>> + hints about application behavior.
>>
> new content
>
next prev parent reply other threads:[~2013-07-12 17:26 UTC|newest]
Thread overview: 14+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-07-12 14:40 [Qemu-devel] [PATCH v3 resend/cleanup 0/8] rdma: core logic mrhines
2013-07-12 14:40 ` [Qemu-devel] [PATCH v3 resend/cleanup 1/8] rdma: update documentation to reflect new unpin support mrhines
2013-07-12 17:09 ` Eric Blake
2013-07-12 17:26 ` Michael R. Hines [this message]
2013-07-12 17:39 ` Eric Blake
2013-07-12 17:46 ` Michael R. Hines
2013-07-12 14:40 ` [Qemu-devel] [PATCH v3 resend/cleanup 2/8] rdma: bugfix: ram_control_save_page() mrhines
2013-07-12 17:09 ` Eric Blake
2013-07-12 14:40 ` [Qemu-devel] [PATCH v3 resend/cleanup 3/8] rdma: introduce ram_handle_compressed() mrhines
2013-07-12 14:40 ` [Qemu-devel] [PATCH v3 resend/cleanup 4/8] rdma: core logic mrhines
2013-07-12 14:40 ` [Qemu-devel] [PATCH v3 resend/cleanup 5/8] rdma: send pc.ram mrhines
2013-07-12 14:40 ` [Qemu-devel] [PATCH v3 resend/cleanup 6/8] rdma: allow state transitions between other states besides ACTIVE mrhines
2013-07-12 14:40 ` [Qemu-devel] [PATCH v3 resend/cleanup 7/8] rdma: introduce MIG_STATE_NONE and change MIG_STATE_SETUP state transition mrhines
2013-07-12 14:40 ` [Qemu-devel] [PATCH v3 resend/cleanup 8/8] rdma: account for the time spent in MIG_STATE_SETUP through QMP mrhines
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=51E03C2D.9020206@linux.vnet.ibm.com \
--to=mrhines@linux.vnet.ibm.com \
--cc=abali@us.ibm.com \
--cc=aliguori@us.ibm.com \
--cc=chegu_vinod@hp.com \
--cc=eblake@redhat.com \
--cc=gokul@us.ibm.com \
--cc=knoel@redhat.com \
--cc=mrhines@us.ibm.com \
--cc=owasserm@redhat.com \
--cc=pbonzini@redhat.com \
--cc=qemu-devel@nongnu.org \
--cc=quintela@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).