Well the issue is not new,
anyhow, following a conversation with Orit ...
Since we want the migration to finish, I believe that the
"migration speed" parameter alone cannot do the job.
I suggest using two distinct parameters:
1. Migration speed - will be used to limit the network resources
utilization
2. aggressionLevel - A number between 0.0 and 1.0, where low
values imply minimal interruption to the guest, and 1.0 mean that
the guest will be completely stalled.
In any case the migration will have to do its work and finish
given any actual migration-speed, so even low aggressionLevel
values will sometimes imply that the guest will be throttled
substantially.
The algorithm:
The aggressionLevel should determine the targetGuest%CPU
(how much CPU time we want to allocate to the guest)
With aggressionLevel = 1.0, the guest gets no CPU-resources
(stalled).
With aggressionLevel = 0.0, the guest gets minGuest%CPU, such that
migrationRate == dirtyPagesRate. This minGuest%CPU is continuously
updated based on the running average of the recent samples (more
below).
Note that the targetGuest%CPU allocation is continuously updated
due to changes guest behavior, network congestion, and alike.
Some more details
- minGuest%CPU (i.e., for dirtyPagesRate == migrationRate) is easy
to calculate as a running average of
(migrationRate / dirtyPagesRate * guest%CPU)
- There are several methods to calculate the running average, my
favorite is IIR, where, roughly speaking,
newVal = 0.99 * oldVal + 0.01 * newSample
- I would use two measures to ensure that there are more migrated
pages than "dirty" pages.
1. The running average (based on recent samples) of the migrated
pages is larger than that of the new dirty pages
2. The total number of migrated pages so far is larger than the
total number of new dirty pages.
And yes, many details are still missing.
Ronen.