From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([2001:4830:134:3::10]:43049)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <nick@bytemark.co.uk>) id 1Whyix-0002u4-Rc
	for qemu-devel@nongnu.org; Wed, 07 May 2014 06:05:21 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <nick@bytemark.co.uk>) id 1Whyir-00064C-QM
	for qemu-devel@nongnu.org; Wed, 07 May 2014 06:05:15 -0400
Received: from egg.sh.bytemark.co.uk ([212.110.161.171]:34080)
	by eggs.gnu.org with esmtp (Exim 4.71)
	(envelope-from <nick@bytemark.co.uk>) id 1Whyir-0005rO-F2
	for qemu-devel@nongnu.org; Wed, 07 May 2014 06:05:09 -0400
Message-ID: <536A0547.1080706@bytemark.co.uk>
Date: Wed, 07 May 2014 11:04:55 +0100
From: Nick Thomas <nick@bytemark.co.uk>
MIME-Version: 1.0
References: <1399297882-3444-1-git-send-email-agraf@suse.de>
	<20140505232343.GA20638@amt.cnet> <53688C56.6020109@suse.de>
In-Reply-To: <53688C56.6020109@suse.de>
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: 7bit
Subject: Re: [Qemu-devel] [PATCH] kvmclock: Ensure time in migration never
	goes backward
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: Alexander Graf <agraf@suse.de>, Marcelo Tosatti <mtosatti@redhat.com>
Cc: qemu-devel@nongnu.org, kvm@vger.kernel.org

Hi all,

On 06/05/14 08:16, Alexander Graf wrote:
> 
> On 06.05.14 01:23, Marcelo Tosatti wrote:
>
>> 1) By what algorithm you retrieve
>> and compare time in kvmclock guest structure and KVM_GET_CLOCK.
>> What are the results of the comparison.
>> And whether and backwards time was visible in the guest.
> 
> I've managed to get my hands on a broken migration stream from Nick.
> There I looked at the curr_clocksource structure and saw that the last
> seen time on the kvmclock clock source was greater than the value that
> the kvmclock device migrated.

We've been seeing live migration failures where the guest sees time go
backwards (= massive forward leap to the kernel, apparently)  for a
while now, affecting perhaps 5-10% of migrations we'd do (usually a
large proportion of the migrations on a few hosts, rather than an even
spread); initially in December, when we tried an upgrade to QEMU 1.7.1
and a 3.mumble (3.10?) kernel, from 1.5.0 and Debian's 3.2.

My testing at the time seemed to indicate that either upgrade - qemu or
kernel - caused the problems to show up. Guest symptoms are that the
kernel enters a tight loop in __run_timers and stays there. In the end,
I gave up and downgraded us again without any clear idea of what was
happening, or why.

In April, we finally got together a fairly reliable test case. This
patch resolves the guest hangs in that test, and I've also been able to
conduct > 1000 migrations of production guests without seeing the issue
recur. So,

Tested-by: Nick Thomas <nick@bytemark.co.uk>

/Nick