From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([2001:4830:134:3::10]:38587)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <dgilbert@redhat.com>) id 1YW0Pu-0000Nh-4J
	for qemu-devel@nongnu.org; Thu, 12 Mar 2015 06:32:39 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <dgilbert@redhat.com>) id 1YW0Pq-00083E-2h
	for qemu-devel@nongnu.org; Thu, 12 Mar 2015 06:32:38 -0400
Received: from mx1.redhat.com ([209.132.183.28]:39245)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <dgilbert@redhat.com>) id 1YW0Pp-000835-SI
	for qemu-devel@nongnu.org; Thu, 12 Mar 2015 06:32:34 -0400
Date: Thu, 12 Mar 2015 10:32:23 +0000
From: "Dr. David Alan Gilbert" <dgilbert@redhat.com>
Message-ID: <20150312103218.GD2330@work-vm>
References: <54F4D076.3040402@linux.vnet.ibm.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <54F4D076.3040402@linux.vnet.ibm.com>
Subject: Re: [Qemu-devel] Migration auto-converge problem
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: "Jason J. Herne" <jjherne@linux.vnet.ibm.com>
Cc: Christian Borntraeger <borntraeger@de.ibm.com>, "qemu-devel@nongnu.org qemu-devel" <qemu-devel@nongnu.org>

* Jason J. Herne (jjherne@linux.vnet.ibm.com) wrote:
> We have a test case that dirties memory very very quickly. When we run this
> test case in a guest and attempt a migration, that migration never converges
> even when done with auto-converge on.
> 
> The auto converge behavior of Qemu functions differently purpose than I had
> expected. In my mind, I expected auto converge to continuously apply
> adaptive throttling of the cpu utilization of a busy guest if Qemu detects
> that progress is not being made quickly enough in the guest memory transfer.

Yes; it's not as 'auto' as you might hope for an 'auto converge'.

> The idea is that a guest dirtying pages too quickly will be adaptively
> slowed down by the throttling until migration is able to transfer pages fast
> enough to complete the migration within the max downtime. Qemu's current
> auto converge does not appear to do this in practice.
> 
> A quick look at the source code shows the following:
> - Autoconverge keeps a counter. This counter is only incremented if, for a
> completed memory pass, the guest is dirtying pages at a rate of 50% (or
> more) of our transfer rate.
> - The counter only increments at most once per pass through memory.
> - The counter must reach 4 before any throttling is done. (a minimum of 4
> memory passes have to occur)
> - Once the counter reaches 4, it is immediately reset to 0, and then
> throttling action is taken.
> - Throttling occurs by doing an async sleep on each guest cpu for 30ms,
> exactly one time.
> 
> Now consider the scenario auto-converge is meant to solve (I think): A guest
> touching lots of memory very quickly. Each pass through memory is going to
> be sending a lot of pages, and thus, taking a decent amount of time to
> complete. If, for every four passes, we are *only* sleeping the guest for
> 30ms, our guest is still going to be able dirty pages faster than we can
> transfer them. We will never catch up because the sleep time relative to
> guest execution time is very very small.

And the problem is that magic numbers vary in usefulness wildly
depending on CPU speed, memory bandwidth, link bandwidth/type, load type and
the phase of the moon.

> Auto converge, as it is implemented today, does not address the problem I
> expect it solve. However, after rapid prototyping a new version of auto
> converge that performs adaptive modeling I've learned something. The
> workload I'm attempting to migrate is actually a pathological case. It is an
> excellent example of why throttling cpu is not always a good method of
> limiting memory access. In this test case we are able to touch over 600 MB
> of pages in 50 ms of continuous execution. In this case, even if I throttle
> the guest to 5% (50ms runtime, 950ms sleep) we still cannot even come close
> to catching up even with a fairly speedy network link (which not every user
> will have).

Right; the worst case is very bad; touching one line in each page can
be done with very little CPU; however you do hope that most real world
apps aren't quite that bad, and if it is that bad then turning XBZRLE on
should help.

> Given the above, I believe that some workloads touch memory too fast and
> we'll never be able to live migrate them with auto-converge. On the lower
> end there are workloads that have a very small/stagnant working set size
> which will be live migratable without the need for auto-converge. Lastly, we
> have "the nebulous middle". These are workloads that would benefit from
> auto-converge because they touch pages too fast for migration to be able to
> deal with them, AND (important conditional here), throttling will(may?)
> actually reduce their rate of page modifications. I would like to try and
> define this "middle" set of workloads.
> 
> A question with no obvious answer: How much throttling is acceptable? If I
> have to throttle a guest 90% and he ends up failing 75% of whatever
> transactions he is attempting to process then we have quite likely defeated
> the entire purpose of "live" migration. Perhaps it would be better in this
> case to just stop the guest and do a non-live migration. Maybe by reverting
> to non-live we actually save time and thus more transactions would have
> completed. This one may take some experimenting to be able to get a good
> idea for what makes the most sense. Maybe even have max throttling be be
> user configurable.

But then you could just increase the 'max-downtime' setting that
produces a similar behaviour.  Of course the reality is that users want no
appreciable throttling and instant migration - something has to give.
Postcopy is one answer but it still gives a performance degradation.

> With all this said, I still wonder exactly how big this "nebulous middle"
> really is. If, in practice, that "middle" only accounts for 1% of the
> workloads out there then is it really worth spending time fixing it? Keep in
> mind this is a two pronged test:
> 1. Guest cannot migrate because it changes memory too fast
> 2. Cpu throttling slows guest's memory writes down enough such that he can
> now migrate
> 
> I'm interested in any thoughts anyone has. Thanks!

It doesn't really matter what percentage of the workloads it is - as long
as real world workloads hit (rather than just evil worst cases); 1% still means
that a lot of people are going to hit it; and you can never usefully approximate
just what that percentage is.

I think it's certainly worth improving auto-converge; but as you say it's
very difficult to know just how many people it helps.

Dave

> -- 
> -- Jason J. Herne (jjherne@linux.vnet.ibm.com)
> 
> 
--
Dr. David Alan Gilbert / dgilbert@redhat.com / Manchester, UK