From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from mailman by lists.gnu.org with tmda-scanned (Exim 4.43)
	id 1M6Tqn-0005aF-6X
	for qemu-devel@nongnu.org; Tue, 19 May 2009 14:15:41 -0400
Received: from exim by lists.gnu.org with spam-scanned (Exim 4.43)
	id 1M6Tqi-0005NM-8t
	for qemu-devel@nongnu.org; Tue, 19 May 2009 14:15:40 -0400
Received: from [199.232.76.173] (port=44706 helo=monty-python.gnu.org)
	by lists.gnu.org with esmtp (Exim 4.43) id 1M6Tqi-0005Mz-4N
	for qemu-devel@nongnu.org; Tue, 19 May 2009 14:15:36 -0400
Received: from qw-out-1920.google.com ([74.125.92.147]:24653)
	by monty-python.gnu.org with esmtp (Exim 4.60)
	(envelope-from <anthony@codemonkey.ws>) id 1M6Tqh-0001AU-Op
	for qemu-devel@nongnu.org; Tue, 19 May 2009 14:15:35 -0400
Received: by qw-out-1920.google.com with SMTP id 4so2599218qwk.4
	for <qemu-devel@nongnu.org>; Tue, 19 May 2009 11:15:35 -0700 (PDT)
Message-ID: <4A12F744.3080809@codemonkey.ws>
Date: Tue, 19 May 2009 13:15:32 -0500
From: Anthony Liguori <anthony@codemonkey.ws>
MIME-Version: 1.0
Subject: Re: [Qemu-devel] [PATCH] ram_save_live: add a no-progress convergence
	rule
References: <1242731347-1558-1-git-send-email-uril@redhat.com>	<4A12AD80.701@codemonkey.ws>
	<20090519144101.GA16372@shell.devel.redhat.com>
	<4A12C942.4090708@redhat.com>
In-Reply-To: <4A12C942.4090708@redhat.com>
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
List-Id: qemu-devel.nongnu.org
List-Unsubscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.gnu.org/pipermail/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <http://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: dlaor@redhat.com
Cc: Glauber Costa <glommer@redhat.com>, Uri Lublin <uril@redhat.com>, qemu-devel@nongnu.org

Dor Laor wrote: 
> The problem is that if migration is not progressing since the guest is 
> dirtying pages
> faster than the migration protocol can send, than we just waist time 
> and cpu.
> The minimum is to notify the monitor interface in order to let mgmt 
> daemon to trap it.
> We can easily see this issue while running iperf in the guest or any 
> other high load/dirty
> pages scenario.

The problem is, what's the metric for determining the guest isn't 
progressing?  A raw iteration count is not a valid metric.  It may be 
expected that the migration take 50 iterations.

The management tool knows the guest isn't progressing when it decides 
that a guest isn't progressing :-)

> We can also make it configurable using the monitor migrate command. 
> For example:
> migrate -d -no_progress -threshold=x tcp:....

Theshold is really a bad metric to use.  You have no idea how much data 
has been passed in each iteration.  If you only needed one more 
iteration, then stopping the migration short was a really bad idea.

The only thing that this does is give a false sense of security.  
Management tools have to deal with forcing migration convergence based 
on policies.  If a management tool isn't doing this today, it's broken IMHO.

Basically, threshold introduces a regression.  If you run iperf and 
migrate a guest with a very large memory size, after migration, you'll 
get soft lockups because the guest hasn't been running for 10 seconds.  
This is bad.

Regards,

Anthony Liguori