From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-pj1-f73.google.com (mail-pj1-f73.google.com [209.85.216.73]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 6687520B0F for ; Wed, 14 Aug 2024 02:58:18 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.216.73 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1723604300; cv=none; b=MPAh1Rrz8OZQdL1f249JJV1a598jiNZk3jnyzH0i5f8al3zzSwUZL1+wRiRzQigVenLAZ1ZYbkHhjFh97/UOEt465jUb+XokBoFRJku3HeWiWNu8YFWPv4QsZvidhXF8UyE/EKQrhE2GPV2b0K7wNy0UnaWn/gT563tAN6Cys+E= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1723604300; c=relaxed/simple; bh=Es5KsrEC1BxHcts1C6mDXT8TpgRA/q5oXkaroAxBcPY=; h=Date:In-Reply-To:Mime-Version:References:Message-ID:Subject:From: To:Cc:Content-Type; b=mLZ36GtYgVetIFufSFRRwYFAz0CyUW4B+qO+GHmt6ABjEaE+ep6EfGaL92DEt0gQnEgvJxDZdPxBZ/aJ+SbkGzxhzfz+x9bQ1D4X5CzlszLAXnJpK1wDGB7Tb5/TTk2LMRLWdMR/wMjawWuK1WeaVv7jHTNRELmyKtvZMcQB8J8= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com; spf=pass smtp.mailfrom=flex--seanjc.bounces.google.com; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b=zYhhB9Vj; arc=none smtp.client-ip=209.85.216.73 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=reject dis=none) header.from=google.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=flex--seanjc.bounces.google.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=google.com header.i=@google.com header.b="zYhhB9Vj" Received: by mail-pj1-f73.google.com with SMTP id 98e67ed59e1d1-2cb68c23a00so414458a91.0 for ; Tue, 13 Aug 2024 19:58:18 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1723604298; x=1724209098; darn=vger.kernel.org; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:from:to:cc:subject:date:message-id:reply-to; bh=/TbiCqhnCvnsz9ogWMe2mKYUfm6ld6l7PiIeVapd+tc=; b=zYhhB9Vj27jeTFmc+wnOYR7AWhWHDZk5KkadNavhZjbMgqMaje8cqIiBoKt4x5qLv1 geT0v2W94kvjpFHg830NPU/yOsd6YqfVLcFSeNpQa2K3Z71rPI8ssKT9xzMTFrL2K7wb lHPmBy72exROmEnJtmPND5HkFZyjrws+ID7ZUq8T3ArdA3CGdeRmmJsJmSAC5SduOJF6 +zpCjE6a2KN44WBdUYFC4S0WpevuhpTVTYaQTWnZKY5DQWjPlYpKQCEfCURSeaXiYKbI ezb9YeTTFsS9dSgnuY9JKeX4rpBfRzdEkuv3MSryssqg3oIVdJzXeIx/5ZpCPZqCbrvv 3HcA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1723604298; x=1724209098; h=cc:to:from:subject:message-id:references:mime-version:in-reply-to :date:x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=/TbiCqhnCvnsz9ogWMe2mKYUfm6ld6l7PiIeVapd+tc=; b=dSNySwQkAi70bOj+yU54BKpjtHdjwE+qdCCbymvZdPQ8W/63sVmxUMuWVTkrFVnOoG cP5UsBvk3QBOKA4EnO69ttM7KuolPR/5u0RamFRrd6BPIbioryJdpde2o/2uigPUfn2p 9JtqnNs33lKXh4PhjooKXsd0OJHdixGL0LkBmjDMmmcgq9giYVCkIKqmPPg1OMPqIsrr wbVQklA6uZDnpxGkLm3WHgqeWbulyrfM811fUS8IuU3V5DS9XRGMwtaXw0gfrfGycVn6 hb97dhbNg/Ycw0X8eTjFo4E2KdsUPRf2gENn+FBAEWb+NvRk5hOOU7ISoquEtI5M1/vi syxg== X-Forwarded-Encrypted: i=1; AJvYcCVRDitnF72ELJqB1ceAfT7fa16XkWakVMB8UiPVQfsRXpFd0XoIuOort6L1ZjAmxoe4nAum22rZdFGAevW2uY6gBtqboiMeES9X X-Gm-Message-State: AOJu0Yzkn39wRbB0fOgM9DmLGvaSJKuGPcfompzHoQE7BOsOBOn2IL7T iW34dulVXjsscGoEfd83LX5ghFMbZSnCkqeaZx4CcetDKo6tbs+X/nmFRMfhAO6mo4SiyD2nIrI tCA== X-Google-Smtp-Source: AGHT+IGAHPSVk4Du2ortUE2qCh4n0/+jkWQcVqkje8HZoyPJTj2YYbKInTjuo66+vNywSakhJZPTGE8bey4= X-Received: from zagreus.c.googlers.com ([fda3:e722:ac3:cc00:7f:e700:c0a8:5c37]) (user=seanjc job=sendgmr) by 2002:a17:90a:a613:b0:2d3:96b5:4940 with SMTP id 98e67ed59e1d1-2d3a9c67b9dmr7183a91.0.1723604297438; Tue, 13 Aug 2024 19:58:17 -0700 (PDT) Date: Tue, 13 Aug 2024 19:58:16 -0700 In-Reply-To: <20240522001817.619072-10-dwmw2@infradead.org> Precedence: bulk X-Mailing-List: linux-doc@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: Mime-Version: 1.0 References: <20240522001817.619072-1-dwmw2@infradead.org> <20240522001817.619072-10-dwmw2@infradead.org> Message-ID: Subject: Re: [RFC PATCH v3 09/21] KVM: x86: Fix KVM clock precision in __get_kvmclock() From: Sean Christopherson To: David Woodhouse Cc: kvm@vger.kernel.org, Paolo Bonzini , Jonathan Corbet , Thomas Gleixner , Ingo Molnar , Borislav Petkov , Dave Hansen , x86@kernel.org, "H. Peter Anvin" , Paul Durrant , Peter Zijlstra , Juri Lelli , Vincent Guittot , Dietmar Eggemann , Steven Rostedt , Ben Segall , Mel Gorman , Daniel Bristot de Oliveira , Valentin Schneider , Shuah Khan , linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, jalliste@amazon.co.uk, sveith@amazon.de, zide.chen@intel.com, Dongli Zhang , Chenyi Qiang Content-Type: text/plain; charset="us-ascii" On Wed, May 22, 2024, David Woodhouse wrote: > From: David Woodhouse > > When in 'master clock mode' (i.e. when host and guest TSCs are behaving > sanely and in sync), the KVM clock is defined in terms of the guest TSC. > > When TSC scaling is used, calculating the KVM clock directly from *host* > TSC cycles leads to a systemic drift from the values calculated by the > guest from its TSC. > > Commit 451a707813ae ("KVM: x86/xen: improve accuracy of Xen timers") > had a simple workaround for the specific case of Xen timers, as it had an > actual vCPU to hand and could use its scaling information. That commit > noted that it was broken for the general case of get_kvmclock_ns(), and > said "I'll come back to that". > > Since __get_kvmclock() is invoked without a specific CPU, it needs to > be able to find or generate the scaling values required to perform the > correct calculation. > > Thankfully, TSC scaling can only happen with X86_FEATURE_CONSTANT_TSC, > so it isn't as complex as it might have been. > > In __kvm_synchronize_tsc(), note the current vCPU's scaling ratio in > kvm->arch.last_tsc_scaling_ratio. That is only protected by the > tsc_write_lock, so in pvclock_update_vm_gtod_copy(), copy it into a > separate kvm->arch.master_tsc_scaling_ratio so that it can be accessed > using the kvm->arch.pvclock_sc seqcount lock. Also generate the mul and > shift factors to convert to nanoseconds for the corresponding KVM clock, > just as kvm_guest_time_update() would. > > In __get_kvmclock(), which runs within a seqcount retry loop, use those > values to convert host to guest TSC and then to nanoseconds. Only fall > back to using get_kvmclock_base_ns() when not in master clock mode. > > There was previously a code path in __get_kvmclock() which looked like > it could set KVM_CLOCK_TSC_STABLE without KVM_CLOCK_REALTIME, perhaps > even on 32-bit hosts. In practice that could never happen as the > ka->use_master_clock flag couldn't be set on 32-bit, and even on 64-bit > hosts it would never be set when the system clock isn't TSC-based. So > that code path is now removed. This should be a separate patch. Actually, patches, plural. More below > The kvm_get_wall_clock_epoch() function had the same problem; make it > just call get_kvmclock() and subtract kvmclock from wallclock, with > the same fallback as before. > > Signed-off-by: David Woodhouse > --- ... > @@ -3100,36 +3131,49 @@ static unsigned long get_cpu_tsc_khz(void) > static void __get_kvmclock(struct kvm *kvm, struct kvm_clock_data *data) > { > struct kvm_arch *ka = &kvm->arch; > - struct pvclock_vcpu_time_info hv_clock; > + > +#ifdef CONFIG_X86_64 > + uint64_t cur_tsc_khz = 0; > + struct timespec64 ts; > > /* both __this_cpu_read() and rdtsc() should be on the same cpu */ > get_cpu(); > > - data->flags = 0; > if (ka->use_master_clock && > - (static_cpu_has(X86_FEATURE_CONSTANT_TSC) || __this_cpu_read(cpu_tsc_khz))) { > -#ifdef CONFIG_X86_64 > - struct timespec64 ts; > + (cur_tsc_khz = get_cpu_tsc_khz()) && That is mean. And if you push it inside the if-statement, the {get,put}_cpu() can be avoided when the master clock isn't being used, e.g. if (ka->use_master_clock) { /* * The RDTSC needs to happen on the same CPU whose frequency is * used to compute kvmclock's time. */ get_cpu(); cur_tsc_khz = get_cpu_tsc_khz(); if (cur_tsc_khz && !kvm_get_walltime_and_clockread(&ts, &data->host_tsc)) cur_tsc_khz = 0; put_cpu(); } However, the changelog essentially claims kvm_get_walltime_and_clockread() should never fail when use_master_clock is enabled, which suggests a WARN is warranted. There was previously a code path in __get_kvmclock() which looked like it could set KVM_CLOCK_TSC_STABLE without KVM_CLOCK_REALTIME, perhaps even on 32-bit hosts. In practice that could never happen as the ka->use_master_clock flag couldn't be set on 32-bit, and even on 64-bit hosts it would never be set when the system clock isn't TSC-based. So that code path is now removed. But, I think kvm_get_walltime_and_clockread() can fail when use_master_clock is true, i.e. I don't think a WARN is viable as it could get false positives. Ah, this is protected by pvclock_sc, so a stale use_master_clock should result in a retry. What if we WARN on that? Hrm, that requires plumbing in the original sequence count. Ah, but looking at the patch as a whole, if we keep kvm_get_wall_clock_epoch()'s style, then it's much easier. And FWIW, I like the existing kvm_get_wall_clock_epoch() style a lot more than the get_kvmclock() => __get_kvmclock() approach. So, can we do this as prep patch #1? diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 9c14d0f5a684..98806a59e110 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -3360,9 +3360,16 @@ uint64_t kvm_get_wall_clock_epoch(struct kvm *kvm) local_tsc_khz = get_cpu_tsc_khz(); + /* + * The master clock depends on the pvclock being based on TSC, + * so the only way kvm_get_walltime_and_clockread() can fail is + * if the clocksource changed and use_master_clock is stale, in + * which case a seqcount retry should be pending. + */ if (local_tsc_khz && - !kvm_get_walltime_and_clockread(&ts, &host_tsc)) - local_tsc_khz = 0; /* Fall back to old method */ + !kvm_get_walltime_and_clockread(&ts, &host_tsc) && + WARN_ON_ONCE(!read_seqcount_retry(&ka->pvclock_sc, seq))) + local_tsc_khz = 0; /* Fall back to old method */ put_cpu(); And then as patch(es) 2..7 (give or take) (2) fold __get_kvmclock() into get_kvmclock() (3) and the same WARN on the seqcount in get_kvmclock() (but skimp on the comments) (4) use get_kvmclock_base_ns() as the fallback in get_kvmclock(), i.e. delete the raw rdtsc() and setting of KVM_CLOCK_TSC_STABLE w/o KVM_CLOCK_REALTIME (5) use get_cpu_tsc_khz() instead of open coding something similar (6) scale TSC when computing kvmclock (the core of this patch) (7) use get_kvmclock() in kvm_get_wall_clock_epoch() as the will be 100% equivalent at this point. > + !kvm_get_walltime_and_clockread(&ts, &data->host_tsc)) > + cur_tsc_khz = 0; > > - if (kvm_get_walltime_and_clockread(&ts, &data->host_tsc)) { > - data->realtime = ts.tv_nsec + NSEC_PER_SEC * ts.tv_sec; > - data->flags |= KVM_CLOCK_REALTIME | KVM_CLOCK_HOST_TSC; > - } else > -#endif > - data->host_tsc = rdtsc(); > - > - data->flags |= KVM_CLOCK_TSC_STABLE; > - hv_clock.tsc_timestamp = ka->master_cycle_now; > - hv_clock.system_time = ka->master_kernel_ns + ka->kvmclock_offset; > - kvm_get_time_scale(NSEC_PER_SEC, get_cpu_tsc_khz() * 1000LL, > - &hv_clock.tsc_shift, > - &hv_clock.tsc_to_system_mul); > - data->clock = __pvclock_read_cycles(&hv_clock, data->host_tsc); > - } else { > - data->clock = get_kvmclock_base_ns() + ka->kvmclock_offset; > + put_cpu(); > + > + if (cur_tsc_khz) { > + uint64_t tsc_cycles; > + uint32_t mul; > + int8_t shift; > + > + tsc_cycles = data->host_tsc - ka->master_cycle_now; > + > + if (kvm_caps.has_tsc_control) > + tsc_cycles = kvm_scale_tsc(tsc_cycles, > + ka->master_tsc_scaling_ratio); > + > + if (static_cpu_has(X86_FEATURE_CONSTANT_TSC)) { > + mul = ka->master_tsc_mul; > + shift = ka->master_tsc_shift; > + } else { > + kvm_get_time_scale(NSEC_PER_SEC, cur_tsc_khz * 1000LL, > + &shift, &mul); > + } > + data->clock = ka->master_kernel_ns + ka->kvmclock_offset + > + pvclock_scale_delta(tsc_cycles, mul, shift); > + data->realtime = ts.tv_nsec + NSEC_PER_SEC * ts.tv_sec; > + data->flags = KVM_CLOCK_REALTIME | KVM_CLOCK_HOST_TSC | KVM_CLOCK_TSC_STABLE; > + return; > } > +#endif > > - put_cpu(); > + data->clock = get_kvmclock_base_ns() + ka->kvmclock_offset; > + data->flags = 0; > }