From: "Michael R. Hines" <mrhines@linux.vnet.ibm.com>
To: "Michael S. Tsirkin" <mst@redhat.com>
Cc: aliguori@us.ibm.com, quintela@redhat.com, qemu-devel@nongnu.org,
owasserm@redhat.com, abali@us.ibm.com, mrhines@us.ibm.com,
gokul@us.ibm.com, pbonzini@redhat.com
Subject: Re: [Qemu-devel] [PULL v4 11/11] rdma: add documentation
Date: Thu, 18 Apr 2013 20:57:30 -0400 [thread overview]
Message-ID: <5170967A.1030709@linux.vnet.ibm.com> (raw)
In-Reply-To: <20130418065545.GA13787@redhat.com>
I'm very sorry. I totally missed this email. My apologies.
On 04/18/2013 02:55 AM, Michael S. Tsirkin wrote:
> On Wed, Apr 17, 2013 at 07:07:20PM -0400, mrhines@linux.vnet.ibm.com wrote:
>> From: "Michael R. Hines" <mrhines@us.ibm.com>
>>
>> docs/rdma.txt contains full documentation,
>> wiki links, github url and contact information.
>>
>> Signed-off-by: Michael R. Hines <mrhines@us.ibm.com>
> OK that's better. Need to improve the following areas:
> - fix half-sentences such as 'faster' (without saying than what)
> - document tradeoffs
Besides the tradeoffs of memory registration and latency and
throughput gains, are there other tradeoffs that you would specifically
like to see commented on that are not already listed?
The documentation has both a "Before running" section as well as
a brief "performance" section.
The documentation is quite long now. A full paper has already been
written which clearly documents the tradeoffs.
Any more than that, we would be RE-writing the linked paper that is already
linked to - which is quite through already from the work in 2010.
> - better document how to run
Better? I don't understand.
What more is needed besides the QMP migrate command?
One of the libvirt developers already told me not to include
any libvirt commands in the QEMU documentation.
> - add more examples
Examples of what?
The only option that has been left from the review process
is chunk registration, and the documentation has instructions
on how to toggle that option.
>> ---
>>
>> +BEFORE RUNNING:
>> +===============
>> +
>> +RDMA helps make your migration more deterministic under heavy load because
>> +of the significantly lower latency and higher throughput
> Higher and lower than what? Above is not helpful and subtly wrong. Say instead
>
> 'On infiniband networks, RDMA can achieve lower latency and higher
> throughput than IP over infiniband based networking by reducing the
> amount of interrupts and data copies and bypassing the host networking
> stack. Using RDMA for VM migration makes migration more deterministic
> under heavy VM load'.
>
> And add an example what 'more deterministic' means.
>
Acknowledged.
>> provided by infiniband.
> Does this works on top of other RDMA transports or just infiniband?
> Needs clarification.
Acknowledged. I will include RoCE in the description.
>> +
>> +Use of RDMA during migration requires pinning and registering memory
>> +with the hardware. This means that memory must be resident in memory
>> +before the hardware can transmit that memory to another machine.
> Above is too vague to be of real use. Please insert here the
> implications on host versus total VMs memory size.
> Also add some examples.
I included an simple 8GB VM example already. Can you be more specific?
>> +If this is not acceptable for your application or product,
>> +then the use of RDMA migration is strongly discouraged and you
>> +should revert back to standard TCP-based migration.
> Above is not helpful and will just lead to more questions.
> Remove.
Why is it not helpful?
It is a clear warning that RDMA can be
harmful to other software running on the hypervisor if the relocation
is not planned for in advance by management software.
>> +
>> +Experimental: Next, decide if you want dynamic page registration.
>> +For example, if you have an 8GB RAM virtual machine, but only 1GB
>> +is in active use,
> This is wrong, isn't it? You only skip zero pages, so any page
> that has data, even if it's not in active use, will be pinned.
Active use != dirty. That's why I chose the word "used".
Used includes both accessed and dirty pages.
A page can be used and later transition to dirty.
To be used means that a page *must* have been first
accessed at some point in time at least once before
it became mapped by the operating system.
I don't think it's our job to get into the "finer points"
of kernel memory management in a higher-level set
of documentation like QEMU.
>> then disabling this feature will cause all 8GB to
>> +be pinned and resident in memory.
> Add as opposed to the default behaviour which is ....
With all due respect, aren't we micro-managing here?
That was clearly described at the beginning of the documentation.
>> This feature mostly affects the
>> +bulk-phase round of the migration and can be disabled for extremely
>> +high-performance RDMA hardware
> Above is meaningless, it does not help user to know whether her hardware
> is "extremely high-performance". Put numbers here please.
> Does it help 40G cards but not 20g ones? By how much?
Acknowledged.
>
>> using the following command:
>> +
>> +QEMU Monitor Command:
>> +$ migrate_set_capability x-chunk-register-destination off # enabled by default
>> +
>> +Performing this action will cause all 8GB to be pinned, so if that's
>> +not what you want, then please ignore this step altogether.
>> +
>> +On the other hand, this will also significantly speed up the bulk round
>> +of the migration, which can greatly reduce the "total" time of your migration.
>
> Please add some example numbers so people know what the tradeoff is.
Acknowledged.
>> +
>> +RUNNING:
>> +========
>> +
>> +First, set the migration speed to match your hardware's capabilities:
>> +
>> +QEMU Monitor Command:
>> +$ migrate_set_speed 40g # or whatever is the MAX of your RDMA device
>> +
>> +Next, on the destination machine, add the following to the QEMU command line:
>> +
>> +qemu ..... -incoming x-rdma:host:port
>> +
>> +Finally, perform the actual migration:
>> +
>> +QEMU Monitor Command:
>> +$ migrate -d x-rdma:host:port
>> +
> Note users stop reading here, below is info for developers.
> So please add here the requirement to do ulimit and with what value.
> Also add an example with VM size.
Exactly what is the right ulimit value? Only the administrator
can determine that value based on how much free memory
is available on the hypervisor. If there is plenty of memory,
then no ulimit command is required at all, because QEMU can
safely pin the entire VM. If there is not enough free memory
then, ulimit must be limited to the amount of available free memory.
Should I just make a general statement like that?
>> +TODO:
>> +=====
>> +1. 'migrate x-rdma:host:port' and '-incoming x-rdma' options will be
>> + renamed to 'rdma' after the experimental phase of this work has
>> + completed upstream.
>> +2. Currently, 'ulimit -l' mlock() limits as well as cgroups swap limits
>> + are not compatible with infinband memory pinning and will result in
>> + an aborted migration (but with the source VM left unaffected).
>> +3. Use of the recent /proc/<pid>/pagemap would likely speed up
>> + the use of KSM and ballooning while using RDMA.
> For KSM you'll need the _GIFT patch for this I think, maybe note this.
I would prefer not to document features that do not yet exist.
In the near future, I will probably use the pagemap, in which
case I can update the documentation at that time.
- Michael
next prev parent reply other threads:[~2013-04-19 0:57 UTC|newest]
Thread overview: 54+ messages / expand[flat|nested] mbox.gz Atom feed top
2013-04-17 23:07 [Qemu-devel] [PULL v4 00/11] rdma: migration support mrhines
2013-04-17 23:07 ` [Qemu-devel] [PULL v4 01/11] rdma: export yield_until_fd_readable() mrhines
2013-04-17 23:07 ` [Qemu-devel] [PULL v4 02/11] rdma: introduce qemu_ram_foreach_block() mrhines
2013-04-17 23:07 ` [Qemu-devel] [PULL v4 03/11] rdma: introduce qemu_file_mode_is_not_valid() mrhines
2013-04-17 23:07 ` [Qemu-devel] [PULL v4 04/11] rdma: export ram_handle_compressed() mrhines
2013-04-17 23:07 ` [Qemu-devel] [PULL v4 05/11] rdma: export qemu_fflush() mrhines
2013-04-17 23:07 ` [Qemu-devel] [PULL v4 06/11] rdma: new QEMUFileOps hooks mrhines
2013-04-17 23:07 ` [Qemu-devel] [PULL v4 07/11] rdma: introduce capability for chunk registration mrhines
2013-04-18 22:07 ` Eric Blake
2013-04-19 0:34 ` Michael R. Hines
2013-04-20 17:02 ` Michael S. Tsirkin
2013-04-21 13:19 ` Paolo Bonzini
2013-04-21 14:17 ` Michael S. Tsirkin
2013-04-21 17:19 ` Michael R. Hines
2013-04-21 19:13 ` Michael S. Tsirkin
2013-04-21 16:05 ` Michael R. Hines
2013-04-21 18:59 ` Michael S. Tsirkin
2013-04-21 19:55 ` Michael R. Hines
2013-04-21 16:06 ` Michael R. Hines
2013-04-17 23:07 ` [Qemu-devel] [PULL v4 08/11] rdma: core logic mrhines
2013-04-18 7:55 ` Paolo Bonzini
2013-04-18 13:57 ` Michael R. Hines
2013-04-18 7:58 ` Michael S. Tsirkin
2013-04-18 13:59 ` Michael R. Hines
2013-04-18 13:06 ` Michael S. Tsirkin
2013-04-18 14:14 ` Michael R. Hines
2013-04-18 13:32 ` Michael S. Tsirkin
2013-04-18 14:45 ` Michael R. Hines
2013-04-18 13:52 ` Michael S. Tsirkin
2013-04-18 15:14 ` Anthony Liguori
2013-04-18 14:53 ` [Qemu-devel] licensing of IBM contributions to QEMU (was Re: [PULL v4 08/11] rdma: core logic) Paolo Bonzini
2013-04-18 19:15 ` Michael R. Hines
2013-04-19 0:35 ` Anthony Liguori
2013-04-18 8:44 ` [Qemu-devel] [PULL v4 08/11] rdma: core logic Orit Wasserman
2013-04-18 13:54 ` Michael R. Hines
2013-04-18 15:51 ` Orit Wasserman
2013-04-18 19:41 ` Michael R. Hines
2013-04-18 22:12 ` Eric Blake
2013-04-19 0:35 ` Michael R. Hines
2013-04-17 23:07 ` [Qemu-devel] [PULL v4 09/11] rdma: send pc.ram mrhines
2013-04-17 23:07 ` [Qemu-devel] [PULL v4 10/11] rdma: print out throughput while debugging mrhines
2013-04-17 23:07 ` [Qemu-devel] [PULL v4 11/11] rdma: add documentation mrhines
2013-04-18 6:55 ` Michael S. Tsirkin
2013-04-19 0:57 ` Michael R. Hines [this message]
2013-04-17 23:39 ` [Qemu-devel] [PULL v4 00/11] rdma: migration support Anthony Liguori
2013-04-18 13:46 ` Michael R. Hines
2013-04-18 7:00 ` Michael S. Tsirkin
2013-04-18 13:49 ` Michael R. Hines
2013-04-18 13:50 ` Michael S. Tsirkin
2013-04-18 19:17 ` Michael R. Hines
2013-04-18 20:12 ` Michael S. Tsirkin
2013-04-18 21:28 ` Michael R. Hines
2013-04-18 20:33 ` Michael S. Tsirkin
2013-04-18 14:36 ` Michael R. Hines
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=5170967A.1030709@linux.vnet.ibm.com \
--to=mrhines@linux.vnet.ibm.com \
--cc=abali@us.ibm.com \
--cc=aliguori@us.ibm.com \
--cc=gokul@us.ibm.com \
--cc=mrhines@us.ibm.com \
--cc=mst@redhat.com \
--cc=owasserm@redhat.com \
--cc=pbonzini@redhat.com \
--cc=qemu-devel@nongnu.org \
--cc=quintela@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).