From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from eggs.gnu.org ([2001:4830:134:3::10]:43049) by lists.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Whyix-0002u4-Rc for qemu-devel@nongnu.org; Wed, 07 May 2014 06:05:21 -0400 Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71) (envelope-from ) id 1Whyir-00064C-QM for qemu-devel@nongnu.org; Wed, 07 May 2014 06:05:15 -0400 Received: from egg.sh.bytemark.co.uk ([212.110.161.171]:34080) by eggs.gnu.org with esmtp (Exim 4.71) (envelope-from ) id 1Whyir-0005rO-F2 for qemu-devel@nongnu.org; Wed, 07 May 2014 06:05:09 -0400 Message-ID: <536A0547.1080706@bytemark.co.uk> Date: Wed, 07 May 2014 11:04:55 +0100 From: Nick Thomas MIME-Version: 1.0 References: <1399297882-3444-1-git-send-email-agraf@suse.de> <20140505232343.GA20638@amt.cnet> <53688C56.6020109@suse.de> In-Reply-To: <53688C56.6020109@suse.de> Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: 7bit Subject: Re: [Qemu-devel] [PATCH] kvmclock: Ensure time in migration never goes backward List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , To: Alexander Graf , Marcelo Tosatti Cc: qemu-devel@nongnu.org, kvm@vger.kernel.org Hi all, On 06/05/14 08:16, Alexander Graf wrote: > > On 06.05.14 01:23, Marcelo Tosatti wrote: > >> 1) By what algorithm you retrieve >> and compare time in kvmclock guest structure and KVM_GET_CLOCK. >> What are the results of the comparison. >> And whether and backwards time was visible in the guest. > > I've managed to get my hands on a broken migration stream from Nick. > There I looked at the curr_clocksource structure and saw that the last > seen time on the kvmclock clock source was greater than the value that > the kvmclock device migrated. We've been seeing live migration failures where the guest sees time go backwards (= massive forward leap to the kernel, apparently) for a while now, affecting perhaps 5-10% of migrations we'd do (usually a large proportion of the migrations on a few hosts, rather than an even spread); initially in December, when we tried an upgrade to QEMU 1.7.1 and a 3.mumble (3.10?) kernel, from 1.5.0 and Debian's 3.2. My testing at the time seemed to indicate that either upgrade - qemu or kernel - caused the problems to show up. Guest symptoms are that the kernel enters a tight loop in __run_timers and stays there. In the end, I gave up and downgraded us again without any clear idea of what was happening, or why. In April, we finally got together a fairly reliable test case. This patch resolves the guest hangs in that test, and I've also been able to conduct > 1000 migrations of production guests without seeing the issue recur. So, Tested-by: Nick Thomas /Nick