From mboxrd@z Thu Jan  1 00:00:00 1970
From: Zachary Amsden <zamsden@redhat.com>
Subject: Re: Bug in KVM clock backwards compensation
Date: Thu, 28 Apr 2011 20:00:57 -0700
Message-ID: <4DBA29E9.9000905@redhat.com>
References: <4DB9106D.6040203@redhat.com> <20110428071316.GG20365@amd.com> <4DB9B344.9010301@redhat.com> <20110428202002.GF13402@8bytes.org>
Mime-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1; format=flowed
Content-Transfer-Encoding: 7bit
Cc: "Roedel, Joerg" <Joerg.Roedel@amd.com>, kvm <kvm@vger.kernel.org>,
	Jan Kiszka <jan.kiszka@web.de>
To: Joerg Roedel <joro@8bytes.org>
Return-path: <kvm-owner@vger.kernel.org>
Received: from mx1.redhat.com ([209.132.183.28]:29047 "EHLO mx1.redhat.com"
	rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP
	id S1758486Ab1D2DBE (ORCPT <rfc822;kvm@vger.kernel.org>);
	Thu, 28 Apr 2011 23:01:04 -0400
In-Reply-To: <20110428202002.GF13402@8bytes.org>
Sender: kvm-owner@vger.kernel.org
List-ID: <kvm.vger.kernel.org>

On 04/28/2011 01:20 PM, Joerg Roedel wrote:
> On Thu, Apr 28, 2011 at 11:34:44AM -0700, Zachary Amsden wrote:
>    
>> On 04/28/2011 12:13 AM, Roedel, Joerg wrote:
>>      
>    
>>> I see it different. This code wants to check if the _guest_ tsc moves
>>> forwared (or at least not backwards). So it is fully legitimate to just
>>> do this by reading the guest-tsc and compare it to the last one the
>>> guest had.
>>>        
>> That wasn't the intention when I wrote that code.  It's simply there to
>> detect backwards motion of the host TSC.  The guest TSC can legally go
>> backwards whenever the guest decides to change it, so checking the guest
>> TSC doesn't make sense here.
>>      
> This code checks how many guest tsc cycles have passed since this vCPU
> was de-scheduled last time (and before it is running again). So since
> the vCPU hasn't run in the meantime it had no chance to change its TSC.
> Further, the other parameters like the TSC offset and the scaling
> multiplier havn't changed too, so the only variable in the guest-tsc
> calculation is the host-tsc.
> So this calculation using the guest-tsc can detect backwards going
> host-tsc as good as the old one. The benefit here is that we can feed
> consistent values into adjust_tsc_offset().
>    

While true, this is more complex than the original code.  The original 
code here doesn't try to actually compensate for the guest TSC 
difference - instead what it does is NULL any discovered host TSC delta:

         if (tsc_delta < 0)
             mark_tsc_unstable("KVM discovered backwards TSC");
         if (check_tsc_unstable()) {
             kvm_x86_ops->adjust_tsc_offset(vcpu, -tsc_delta);
             vcpu->arch.tsc_catchup = 1;
         }
         kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);

Erasing that delta also erases elapsed time since the VCPU has last been 
run, which isn't desirable, so it then sets tsc_catchup mode, which will 
restore the proper TSC.  The request here triggers code which later 
updates the TSC offset again.

To avoid complexity, I think it's simplest to do the first computation 
in terms of the host TSC.

>> Yes, with tsc-scaling, the machines already have stable TSCs - the above
>> test is for older hardware which could have problems, and can be
>> reverted back to the original code without worrying about switching
>> units.
>>      
> This is the case pratically. But architecturally the tsc-scaling feature
> does not depend on a constant tsc, so we can make no such assumtions.
> Additionally, it may happen that Linux mis-detects an unstable tsc for
> some reason (broken BIOS, bug in the code, ...).  Therefore I think it
> is dangerous to assume that this code will never run on tsc-scaling
> capable hosts. And if it does and we don't manage the tsc-offset units
> right, we may see very weird behavior.
>    

I agree, it is best to handle this case - hardware can and will change - 
but the TSC adjustment in terms of guest rate should be done under the 
atomic protection right before entering hardware virtualized mode - here:

I left compute_guest_tsc in place to recompute time in guest units here, 
even if the underlying hardware rate changes.

         /*
          * We may have to catch up the TSC to match elapsed wall clock
          * time for two reasons, even if kvmclock is used.
          *   1) CPU could have been running below the maximum TSC rate
          *   2) Broken TSC compensation resets the base at each VCPU
          *      entry to avoid unknown leaps of TSC even when running
          *      again on the same CPU.  This may cause apparent elapsed
          *      time to disappear, and the guest to stand still or run
          *      very slowly.
          */
         if (vcpu->tsc_catchup) {
                 u64 tsc = compute_guest_tsc(v, kernel_ns);
                 if (tsc > tsc_timestamp) {
                         kvm_x86_ops->adjust_tsc_offset(v, tsc - 
tsc_timestamp);
                         tsc_timestamp = tsc;
                 }
         }

So yeah, the code is getting pretty complex but we'd like to avoid that 
as much as possible - so I would prefer to have the hardware backwards 
compensation separate from the guest rate computation by doing this:

step 1) remove any backwards hardware TSC delta (in hardware units)
step 2) recompute guest TSC from a stable clock (gotten from kernel_ns) 
and apply adjustment (in guest units)

So it appears you can just share most of the logic of guest TSC catchup 
mode.

What do you think?

Zach