From mboxrd@z Thu Jan 1 00:00:00 1970 From: Zachary Amsden Subject: Re: Bug in KVM clock backwards compensation Date: Thu, 28 Apr 2011 20:00:57 -0700 Message-ID: <4DBA29E9.9000905@redhat.com> References: <4DB9106D.6040203@redhat.com> <20110428071316.GG20365@amd.com> <4DB9B344.9010301@redhat.com> <20110428202002.GF13402@8bytes.org> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit Cc: "Roedel, Joerg" , kvm , Jan Kiszka To: Joerg Roedel Return-path: Received: from mx1.redhat.com ([209.132.183.28]:29047 "EHLO mx1.redhat.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1758486Ab1D2DBE (ORCPT ); Thu, 28 Apr 2011 23:01:04 -0400 In-Reply-To: <20110428202002.GF13402@8bytes.org> Sender: kvm-owner@vger.kernel.org List-ID: On 04/28/2011 01:20 PM, Joerg Roedel wrote: > On Thu, Apr 28, 2011 at 11:34:44AM -0700, Zachary Amsden wrote: > >> On 04/28/2011 12:13 AM, Roedel, Joerg wrote: >> > >>> I see it different. This code wants to check if the _guest_ tsc moves >>> forwared (or at least not backwards). So it is fully legitimate to just >>> do this by reading the guest-tsc and compare it to the last one the >>> guest had. >>> >> That wasn't the intention when I wrote that code. It's simply there to >> detect backwards motion of the host TSC. The guest TSC can legally go >> backwards whenever the guest decides to change it, so checking the guest >> TSC doesn't make sense here. >> > This code checks how many guest tsc cycles have passed since this vCPU > was de-scheduled last time (and before it is running again). So since > the vCPU hasn't run in the meantime it had no chance to change its TSC. > Further, the other parameters like the TSC offset and the scaling > multiplier havn't changed too, so the only variable in the guest-tsc > calculation is the host-tsc. > So this calculation using the guest-tsc can detect backwards going > host-tsc as good as the old one. The benefit here is that we can feed > consistent values into adjust_tsc_offset(). > While true, this is more complex than the original code. The original code here doesn't try to actually compensate for the guest TSC difference - instead what it does is NULL any discovered host TSC delta: if (tsc_delta < 0) mark_tsc_unstable("KVM discovered backwards TSC"); if (check_tsc_unstable()) { kvm_x86_ops->adjust_tsc_offset(vcpu, -tsc_delta); vcpu->arch.tsc_catchup = 1; } kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu); Erasing that delta also erases elapsed time since the VCPU has last been run, which isn't desirable, so it then sets tsc_catchup mode, which will restore the proper TSC. The request here triggers code which later updates the TSC offset again. To avoid complexity, I think it's simplest to do the first computation in terms of the host TSC. >> Yes, with tsc-scaling, the machines already have stable TSCs - the above >> test is for older hardware which could have problems, and can be >> reverted back to the original code without worrying about switching >> units. >> > This is the case pratically. But architecturally the tsc-scaling feature > does not depend on a constant tsc, so we can make no such assumtions. > Additionally, it may happen that Linux mis-detects an unstable tsc for > some reason (broken BIOS, bug in the code, ...). Therefore I think it > is dangerous to assume that this code will never run on tsc-scaling > capable hosts. And if it does and we don't manage the tsc-offset units > right, we may see very weird behavior. > I agree, it is best to handle this case - hardware can and will change - but the TSC adjustment in terms of guest rate should be done under the atomic protection right before entering hardware virtualized mode - here: I left compute_guest_tsc in place to recompute time in guest units here, even if the underlying hardware rate changes. /* * We may have to catch up the TSC to match elapsed wall clock * time for two reasons, even if kvmclock is used. * 1) CPU could have been running below the maximum TSC rate * 2) Broken TSC compensation resets the base at each VCPU * entry to avoid unknown leaps of TSC even when running * again on the same CPU. This may cause apparent elapsed * time to disappear, and the guest to stand still or run * very slowly. */ if (vcpu->tsc_catchup) { u64 tsc = compute_guest_tsc(v, kernel_ns); if (tsc > tsc_timestamp) { kvm_x86_ops->adjust_tsc_offset(v, tsc - tsc_timestamp); tsc_timestamp = tsc; } } So yeah, the code is getting pretty complex but we'd like to avoid that as much as possible - so I would prefer to have the hardware backwards compensation separate from the guest rate computation by doing this: step 1) remove any backwards hardware TSC delta (in hardware units) step 2) recompute guest TSC from a stable clock (gotten from kernel_ns) and apply adjustment (in guest units) So it appears you can just share most of the logic of guest TSC catchup mode. What do you think? Zach