From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([140.186.70.92]:55909) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Rctms-0003T5-ER for qemu-devel@nongnu.org; Tue, 20 Dec 2011 02:06:59 -0500 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1Rctmr-0007pG-4D for qemu-devel@nongnu.org; Tue, 20 Dec 2011 02:06:58 -0500 Received: from mx1.redhat.com ([209.132.183.28]:26377) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Rctmq-0007p8-TJ for qemu-devel@nongnu.org; Tue, 20 Dec 2011 02:06:57 -0500 Received: from int-mx02.intmail.prod.int.phx2.redhat.com (int-mx02.intmail.prod.int.phx2.redhat.com [10.5.11.12]) by mx1.redhat.com (8.14.4/8.14.4) with ESMTP id pBK76s8v010511 (version=TLSv1/SSLv3 cipher=DHE-RSA-AES256-SHA bits=256 verify=OK) for ; Tue, 20 Dec 2011 02:06:55 -0500 Received: from dhcp-1-73.tlv.redhat.com (vpn-202-127.tlv.redhat.com [10.35.202.127]) by int-mx02.intmail.prod.int.phx2.redhat.com (8.13.8/8.13.8) with ESMTP id pBK76rZ8011699 for ; Tue, 20 Dec 2011 02:06:54 -0500 Message-ID: <4EF0340C.9000005@redhat.com> Date: Tue, 20 Dec 2011 09:06:52 +0200 From: Ronen Hod MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="------------030405050701000000060909" Subject: [Qemu-devel] [RFC] Migration convergence - a suggestion List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: qemu-devel@nongnu.org This is a multi-part message in MIME format. --------------030405050701000000060909 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Well the issue is not new, anyhow, following a conversation with Orit ... Since we want the migration to finish, I believe that the "migration speed" parameter alone cannot do the job. I suggest using two distinct parameters: 1. Migration speed - will be used to limit the network resources utilization 2. aggressionLevel - A number between 0.0 and 1.0, where low values imply minimal interruption to the guest, and 1.0 mean that the guest will be completely stalled. In any case the migration will have to do its work and finish given any actual migration-speed, so even low aggressionLevel values will sometimes imply that the guest will be throttled substantially. The algorithm: The aggressionLevel should determine the targetGuest%CPU (how much CPU time we want to allocate to the guest) With aggressionLevel = 1.0, the guest gets no CPU-resources (stalled). With aggressionLevel = 0.0, the guest gets minGuest%CPU, such that migrationRate == dirtyPagesRate. This minGuest%CPU is continuously updated based on the running average of the recent samples (more below). Note that the targetGuest%CPU allocation is continuously updated due to changes guest behavior, network congestion, and alike. Some more details - minGuest%CPU (i.e., for dirtyPagesRate == migrationRate) is easy to calculate as a running average of (migrationRate / dirtyPagesRate * guest%CPU) - There are several methods to calculate the running average, my favorite is IIR, where, roughly speaking, newVal = 0.99 * oldVal + 0.01 * newSample - I would use two measures to ensure that there are more migrated pages than "dirty" pages. 1. The running average (based on recent samples) of the migrated pages is larger than that of the new dirty pages 2. The total number of migrated pages so far is larger than the total number of new dirty pages. And yes, many details are still missing. Ronen. --------------030405050701000000060909 Content-Type: text/html; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit
Well the issue is not new, anyhow, following a conversation with Orit ...

Since we want the migration to finish, I believe that the "migration speed" parameter alone cannot do the job.
I suggest using two distinct parameters:
1. Migration speed - will be used to limit the network resources utilization
2. aggressionLevel - A number between 0.0 and 1.0, where low values imply minimal interruption to the guest, and 1.0 mean that the guest will be completely stalled.

In any case the migration will have to do its work and finish given any actual migration-speed, so even low aggressionLevel values will sometimes imply that the guest will be throttled substantially.

The algorithm:
The aggressionLevel should determine the targetGuest%CPU (how much CPU time we want to allocate to the guest)
With aggressionLevel = 1.0, the guest gets no CPU-resources (stalled).
With aggressionLevel = 0.0, the guest gets minGuest%CPU, such that migrationRate == dirtyPagesRate. This minGuest%CPU is continuously updated based on the running average of the recent samples (more below).

Note that the targetGuest%CPU allocation is continuously updated due to changes guest behavior, network congestion, and alike.

Some more details
- minGuest%CPU (i.e., for dirtyPagesRate == migrationRate) is easy to calculate as a running average of
  (migrationRate / dirtyPagesRate * guest%CPU)
- There are several methods to calculate the running average, my favorite is IIR, where, roughly speaking,
  newVal = 0.99 * oldVal + 0.01 * newSample
- I would use two measures to ensure that there are more migrated pages than "dirty" pages.
  1. The running average (based on recent samples) of the migrated pages is larger than that of the new dirty pages
  2. The total number of migrated pages so far is larger than the total number of new dirty pages.

And yes, many details are still missing.

Ronen.

--------------030405050701000000060909--