* Re: [PATCH v3 17/41] x86/tsc: WARN if TSC sched_clock save/restore used with PV sched_clock
From: David Woodhouse @ 2026-05-20 22:27 UTC (permalink / raw)
To: Sean Christopherson, Kiryl Shutsemau, Paolo Bonzini,
K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
Ajay Kaher, Alexey Makhalov, Jan Kiszka, Dave Hansen,
Andy Lutomirski, Peter Zijlstra, Juergen Gross, Daniel Lezcano,
Thomas Gleixner, John Stultz
Cc: Rick Edgecombe, Vitaly Kuznetsov,
Broadcom internal kernel review list, Boris Ostrovsky,
Stephen Boyd, x86, linux-coco, kvm, linux-hyperv, virtualization,
linux-kernel, xen-devel, Michael Kelley, Tom Lendacky,
Nikunj A Dadhania, Thomas Gleixner
In-Reply-To: <20260515191942.1892718-18-seanjc@google.com>
[-- Attachment #1: Type: text/plain, Size: 396 bytes --]
On Fri, 2026-05-15 at 12:19 -0700, Sean Christopherson wrote:
> Now that all PV clocksources override the sched_clock save/restore hooks
> when overriding sched_clock, WARN if the "default" TSC hooks are invoked
> when using a PV sched_clock, e.g. to guard against regressions.
>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]
^ permalink raw reply
* Re: [PATCH v2 15/15] KVM: x86: Move the bulk of register specific code from x86.c to regs.c
From: Huang, Kai @ 2026-05-20 22:22 UTC (permalink / raw)
To: seanjc@google.com
Cc: dwmw2@infradead.org, Edgecombe, Rick P, x86@kernel.org,
binbin.wu@linux.intel.com, kas@kernel.org,
dave.hansen@linux.intel.com, vkuznets@redhat.com, paul@xen.org,
yosry@kernel.org, pbonzini@redhat.com, kvm@vger.kernel.org,
linux-coco@lists.linux.dev, linux-kernel@vger.kernel.org
In-Reply-To: <ag35RgwowvR9Q4hA@google.com>
On Wed, 2026-05-20 at 11:11 -0700, Sean Christopherson wrote:
> On Wed, May 20, 2026, Kai Huang wrote:
> > On Tue, 2026-05-19 at 18:25 -0700, Sean Christopherson wrote:
> > > On Wed, May 20, 2026, Kai Huang wrote:
> > > > I am not sure whether there's a mandatory requirement that "struct kvm_arch" and
> > > > "struct kvm_vcpu_arch" must be fully embedded, and it would be kinda painful to
> > > > covert to a pointer (e.g., there's kvm_x86_ops::vm_size), but perhaps that is
> > > > also an option to consider?
> > >
> > > The idea I had in the past, and where I was going with things before s390's love
> > > for arm64 came along, was to add a kvm_arch.h in arch/<arch>/kvm, and have virt/kvm
> > > include _that_ instead of kvm_host.h.
> > >
> >
> > Not sure whether there's other code doing so? :-)
> >
> > > That way we don't need to make any fundamental
> > > changes to structures, but we can still significantly cut down on what's exposed
> > > via kvm_host.h.
> > >
> >
> > Yeah.
> >
> > I saw below from you in [1]:
> >
> > --
> > We've explore several alternatives to the #ifdef __KVM__ approach, and
> > they all sucked, hard. What I really wanted (and still want) to do, is to
> > bury the bulk of kvm_host.h (and other KVM headers) in virt/kvm, but every
> > attempt to do that ended in flames. Even with the __KVM__ guards in place,
> > each architecture's kvm_host.h is too intertwined with the common kvm_host.h,
> > and trying to extract small-ish pieces just doesn't work (each patch
> > inevitably snowballed into a gigantic beast).
> >
> > The other idea we considered (which I thought of, and feel dirty for even
> > proposing it internally), is to move all headers under virt/kvm, add
> > virt/kvm/include to the global header path, and then have KVM x86 omit
> > virt/kvm/include when configured to hide KVM internals. I hate this idea
> > because it sets a bad precedent, and requires a lot of file movement
> > without providing any benefit to other architectures. E.g. I hope that
> > guarding KVM internals with #ifdef __KVM__ will allow us to slowly clean
> > things up so that some day KVM only exposes a handful of APIs to the rest
> > of the kernel (probably a pipe dream).
> > --
> >
> > I haven't looked into details of your #ifdef __KVM__ approach yet, but seems you
> > don't quite like moving KVM internal staff to virt/kvm/include/ ?
>
> Not for arch code, which is the trickiest bit to handle.
>
> > But if we want to hide KVM internal structures, I don't see any other options
> > except virt/kvm/include/ is the place to go?
>
> arch/$(ARCH)/kvm/kvm_arch.h is the obvious approach. Code in virt/kvm can reach
> arch/$(ARCH)/kvm, we just need to add it to the include path. That's why I was
> working on unifying the include definitions.
Yeah, for asm/kvm_host.h.
But if I am still following you we still need a place for linux/kvm_host.h, for
which I thought virt/kvm/include/ would be the place at first glance.
>
> > Btw, have you considered reverting the inclusion of "strut kvm" and "struct
> > kvm_arch" (and the vcpu structure), i.e., to make "struct kvm_arch" include
> > "struct kvm"? I don't have any clue of whether it is feasible or how much
> > effort it needs, though -- it's just something came to mind when replying.
> >
> > [1]: https://lore.kernel.org/all/20230916003118.2540661-1-seanjc@google.com/
> >
> > > At some point I'll try to take another look; it's really the
> > > s390+arm64 combo that's problematic :-/
> >
> > If you want, I can take a look. I think I'll have bandwidth in near feature.
> >
> > Given you have tried multiple times so I am not sure what I can achieve, though.
>
> Consolidating includes and creating arch/$(ARCH)/kvm/kvm_arch.h should be more
> doable,
>
Right.
> what has failed spectacularly over and over is effectively trying to hide
> most of asm/kvm_host.h, which sounds *really* stupid when I phrase it that way :-)
I see :-)
>
> > Anyway, seems "allow loading a new (or old) KVM module without needing to
> > rebuild and reboot the entire kernel" is a good reason to do this.
>
> Note, we've scrapped upstreaming multi-KVM, and I don't see it ever getting accepted
> upstream, as live update is far superior, e.g. isn't limited to just KVM, doesn't
> have the same orchestration or testing problems, etc.
>
> And FWIW, today, you _can_ reload a new (or old) KVM module without rebuilding
> the kernel or rebooting the host, it's just that you can't modify certain structures.
Right.
^ permalink raw reply
* Re: [PATCH v3 16/41] x86/vmware: Nullify save/restore hooks when using VMware's sched_clock
From: David Woodhouse @ 2026-05-20 22:15 UTC (permalink / raw)
To: Sean Christopherson, Kiryl Shutsemau, Paolo Bonzini,
K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
Ajay Kaher, Alexey Makhalov, Jan Kiszka, Dave Hansen,
Andy Lutomirski, Peter Zijlstra, Juergen Gross, Daniel Lezcano,
Thomas Gleixner, John Stultz
Cc: Rick Edgecombe, Vitaly Kuznetsov,
Broadcom internal kernel review list, Boris Ostrovsky,
Stephen Boyd, x86, linux-coco, kvm, linux-hyperv, virtualization,
linux-kernel, xen-devel, Michael Kelley, Tom Lendacky,
Nikunj A Dadhania, Thomas Gleixner
In-Reply-To: <20260515191942.1892718-17-seanjc@google.com>
[-- Attachment #1: Type: text/plain, Size: 629 bytes --]
On Fri, 2026-05-15 at 12:19 -0700, Sean Christopherson wrote:
> Nullify the sched_clock save/restore hooks when using VMware's version of
> sched_clock. This will allow extending paravirt_set_sched_clock() to set
> the save/restore hooks, without having to simultaneously change the
> behavior of VMware guests.
>
> Note, it's not at all obvious that it's safe/correct for VMware guests to
> do nothing on suspend/resume, but that's a pre-existing problem. Leave it
> for a VMware expert to sort out.
>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]
^ permalink raw reply
* Re: [PATCH v3 15/41] x86/xen/time: Nullify x86_platform's sched_clock save/restore hooks
From: David Woodhouse @ 2026-05-20 22:11 UTC (permalink / raw)
To: Sean Christopherson, Kiryl Shutsemau, Paolo Bonzini,
K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
Ajay Kaher, Alexey Makhalov, Jan Kiszka, Dave Hansen,
Andy Lutomirski, Peter Zijlstra, Juergen Gross, Daniel Lezcano,
Thomas Gleixner, John Stultz
Cc: Rick Edgecombe, Vitaly Kuznetsov,
Broadcom internal kernel review list, Boris Ostrovsky,
Stephen Boyd, x86, linux-coco, kvm, linux-hyperv, virtualization,
linux-kernel, xen-devel, Michael Kelley, Tom Lendacky,
Nikunj A Dadhania, Thomas Gleixner
In-Reply-To: <20260515191942.1892718-16-seanjc@google.com>
[-- Attachment #1: Type: text/plain, Size: 1283 bytes --]
On Fri, 2026-05-15 at 12:19 -0700, Sean Christopherson wrote:
> Nullify the x86_platform sched_clock save/restore hooks when setting up
> Xen's PV clock to make it somewhat obvious the hooks aren't used when
> running as a Xen guest (Xen uses a paravirtualized suspend/resume flow).
>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
Hm. Xen HVM guests *do* use a standard ACPI S4 hibernation flow.
Does that need testing?
> ---
> arch/x86/xen/time.c | 6 ++++++
> 1 file changed, 6 insertions(+)
>
> diff --git a/arch/x86/xen/time.c b/arch/x86/xen/time.c
> index 3d3165eef821..21d366d01985 100644
> --- a/arch/x86/xen/time.c
> +++ b/arch/x86/xen/time.c
> @@ -568,6 +568,12 @@ static void __init xen_init_time_common(void)
> xen_sched_clock_offset = xen_clocksource_read();
> static_call_update(pv_steal_clock, xen_steal_clock);
> paravirt_set_sched_clock(xen_sched_clock);
> + /*
> + * Xen has paravirtualized suspend/resume and so doesn't use the common
> + * x86 sched_clock save/restore hooks.
> + */
> + x86_platform.save_sched_clock_state = NULL;
> + x86_platform.restore_sched_clock_state = NULL;
>
> tsc_register_calibration_routines(xen_tsc_khz, NULL);
> x86_platform.get_wallclock = xen_get_wallclock;
[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]
^ permalink raw reply
* Re: [PATCH v6 20/43] KVM: guest_memfd: Enable INIT_SHARED on guest_memfd for x86 Coco VMs
From: Ackerley Tng @ 2026-05-20 22:04 UTC (permalink / raw)
To: Ackerley Tng via B4 Relay, aik, andrew.jones, binbin.wu, brauner,
chao.p.peng, david, ira.weiny, jmattson, jthoughton, michael.roth,
oupton, pankaj.gupta, qperret, rick.p.edgecombe, rientjes,
shivankg, steven.price, tabba, willy, wyihan, yan.y.zhao,
forkloop, pratyush, suzuki.poulose, aneesh.kumar, liam,
Paolo Bonzini, Sean Christopherson, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He, Barry Song,
Axel Rasmussen, Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng,
Shakeel Butt, Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka
Cc: kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
linux-mm, linux-coco
In-Reply-To: <20260507-gmem-inplace-conversion-v6-20-91ab5a8b19a4@google.com>
Ackerley Tng via B4 Relay <devnull+ackerleytng.google.com@kernel.org>
writes:
> From: Sean Christopherson <seanjc@google.com>
>
> Now that guest_memfd supports tracking private vs. shared within gmem
> itself, allow userspace to specify INIT_SHARED on a guest_memfd instance
> for x86 Confidential Computing (CoCo) VMs, so long as per-VM attributes
> are disabled, i.e. when it's actually possible for a guest_memfd instance
> to contain shared memory.
>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> ---
> arch/x86/kvm/x86.c | 11 +++++------
> 1 file changed, 5 insertions(+), 6 deletions(-)
>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 1560de1e95be0..6609957ecfea3 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -14172,14 +14172,13 @@ bool kvm_arch_no_poll(struct kvm_vcpu *vcpu)
> }
>
> #ifdef CONFIG_KVM_GUEST_MEMFD
> -/*
> - * KVM doesn't yet support initializing guest_memfd memory as shared for VMs
> - * with private memory (the private vs. shared tracking needs to be moved into
> - * guest_memfd).
> - */
> bool kvm_arch_supports_gmem_init_shared(struct kvm *kvm)
> {
> - return !kvm_arch_has_private_mem(kvm);
> + /*
> + * INIT_SHARED isn't supported if the memory attributes are per-VM,
> + * in which case guest_memfd can _only_ be used for private memory.
> + */
> + return !vm_memory_attributes || !kvm_arch_has_private_mem(kvm);
Adding a note here from PUCK on 2026-05-20:
Michael pointed out that it's odd that when vm_memory_attributes is
available, guest_memfd still can only be used for private memory.
It is a little odd, but we don't want to investigate the complexities of
supporting it, and Sean says this is working as intended, in line with
deprecating vm_memory_attributes=true.
> }
>
> #ifdef CONFIG_HAVE_KVM_ARCH_GMEM_PREPARE
>
> --
> 2.54.0.563.g4f69b47b94-goog
^ permalink raw reply
* Re: [PATCH v3 14/41] x86/kvmclock: Move sched_clock save/restore helpers up in kvmclock.c
From: David Woodhouse @ 2026-05-20 21:59 UTC (permalink / raw)
To: Sean Christopherson, Kiryl Shutsemau, Paolo Bonzini,
K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
Ajay Kaher, Alexey Makhalov, Jan Kiszka, Dave Hansen,
Andy Lutomirski, Peter Zijlstra, Juergen Gross, Daniel Lezcano,
Thomas Gleixner, John Stultz
Cc: Rick Edgecombe, Vitaly Kuznetsov,
Broadcom internal kernel review list, Boris Ostrovsky,
Stephen Boyd, x86, linux-coco, kvm, linux-hyperv, virtualization,
linux-kernel, xen-devel, Michael Kelley, Tom Lendacky,
Nikunj A Dadhania, Thomas Gleixner
In-Reply-To: <20260515191942.1892718-15-seanjc@google.com>
[-- Attachment #1: Type: text/plain, Size: 345 bytes --]
On Fri, 2026-05-15 at 12:19 -0700, Sean Christopherson wrote:
> Move kvmclock's sched_clock save/restore helper "up" so that they can
> (eventually) be referenced by kvm_sched_clock_init().
>
> No functional change intended.
>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]
^ permalink raw reply
* Re: [PATCH v3 13/41] x86/paravirt: Move handling of unstable PV clocks into paravirt_set_sched_clock()
From: David Woodhouse @ 2026-05-20 21:57 UTC (permalink / raw)
To: Sean Christopherson, Kiryl Shutsemau, Paolo Bonzini,
K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
Ajay Kaher, Alexey Makhalov, Jan Kiszka, Dave Hansen,
Andy Lutomirski, Peter Zijlstra, Juergen Gross, Daniel Lezcano,
Thomas Gleixner, John Stultz
Cc: Rick Edgecombe, Vitaly Kuznetsov,
Broadcom internal kernel review list, Boris Ostrovsky,
Stephen Boyd, x86, linux-coco, kvm, linux-hyperv, virtualization,
linux-kernel, xen-devel, Michael Kelley, Tom Lendacky,
Nikunj A Dadhania, Thomas Gleixner
In-Reply-To: <20260515191942.1892718-14-seanjc@google.com>
[-- Attachment #1: Type: text/plain, Size: 853 bytes --]
On Fri, 2026-05-15 at 12:19 -0700, Sean Christopherson wrote:
> Move the handling of unstable PV clocks, of which kvmclock is the only
> example, into paravirt_set_sched_clock(). This will allow modifying
> paravirt_set_sched_clock() to keep using the TSC for sched_clock in
> certain scenarios without unintentionally marking the TSC-based clock as
> unstable.
>
> No functional change intended.
>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
> -void paravirt_set_sched_clock(u64 (*func)(void));
> +void __paravirt_set_sched_clock(u64 (*func)(void), bool stable);
> +
> +static inline void paravirt_set_sched_clock(u64 (*func)(void))
> +{
> + __paravirt_set_sched_clock(func, true);
> +}
One of the few things I actually like about C++ is default arguments...
[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]
^ permalink raw reply
* Re: [PATCH v3 12/41] x86/paravirt: Remove unnecessary PARAVIRT=n stub for paravirt_set_sched_clock()
From: David Woodhouse @ 2026-05-20 21:53 UTC (permalink / raw)
To: Sean Christopherson, Kiryl Shutsemau, Paolo Bonzini,
K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
Ajay Kaher, Alexey Makhalov, Jan Kiszka, Dave Hansen,
Andy Lutomirski, Peter Zijlstra, Juergen Gross, Daniel Lezcano,
Thomas Gleixner, John Stultz
Cc: Rick Edgecombe, Vitaly Kuznetsov,
Broadcom internal kernel review list, Boris Ostrovsky,
Stephen Boyd, x86, linux-coco, kvm, linux-hyperv, virtualization,
linux-kernel, xen-devel, Michael Kelley, Tom Lendacky,
Nikunj A Dadhania, Thomas Gleixner
In-Reply-To: <20260515191942.1892718-13-seanjc@google.com>
[-- Attachment #1: Type: text/plain, Size: 526 bytes --]
On Fri, 2026-05-15 at 12:19 -0700, Sean Christopherson wrote:
> Remove the unnecessary paravirt_set_sched_clock() stub for PARAVIRT=n, as
> all callers are gated by PARAVIRT=y. Eliminating the stub will avoid a
> pile of pointless churn as the "real" implementation evolves.
>
> No functional change intended.
>
> Fixes: 39965afb1151 ("x86/paravirt: Move paravirt_sched_clock() related code into tsc.c")
> Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]
^ permalink raw reply
* Re: [PATCH v3 11/41] x86/kvm: Don't disable kvmclock on BSP in syscore_suspend()
From: David Woodhouse @ 2026-05-20 21:51 UTC (permalink / raw)
To: Sean Christopherson, Kiryl Shutsemau, Paolo Bonzini,
K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
Ajay Kaher, Alexey Makhalov, Jan Kiszka, Dave Hansen,
Andy Lutomirski, Peter Zijlstra, Juergen Gross, Daniel Lezcano,
Thomas Gleixner, John Stultz
Cc: Rick Edgecombe, Vitaly Kuznetsov,
Broadcom internal kernel review list, Boris Ostrovsky,
Stephen Boyd, x86, linux-coco, kvm, linux-hyperv, virtualization,
linux-kernel, xen-devel, Michael Kelley, Tom Lendacky,
Nikunj A Dadhania, Thomas Gleixner
In-Reply-To: <20260515191942.1892718-12-seanjc@google.com>
[-- Attachment #1: Type: text/plain, Size: 1214 bytes --]
On Fri, 2026-05-15 at 12:19 -0700, Sean Christopherson wrote:
> Don't disable kvmclock on the BSP during syscore_suspend(), as the BSP's
> clock is NOT restored during syscore_resume(), but is instead restored
> earlier via the sched_clock restore callback. If suspend is aborted, e.g.
> due to a late wakeup, the BSP will run without its clock enabled, which
> "works" only because KVM-the-hypervisor is kind enough to not clobber the
> shared memory when the clock is disabled. But over time, the BSP's view
> of time will drift from APs.
The work that Dongli and I have been doing elsewhere makes me want to
phrase that last sentence far more judgmentally, as the kvmclock
absolutely should *not* be randomly drifting over time without good
reason, and what was written to it earlier really *ought* to still be
valid.
But OK :)
> Plumb in an "action" to KVM-as-a-guest and kvmclock code in preparation
> for additional cleanups to kvmclock's suspend/resume logic.
>
> Fixes: c02027b5742b ("x86/kvm: Disable kvmclock on all CPUs on shutdown")
> Cc: stable@vger.kernel.org
> Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]
^ permalink raw reply
* Re: [PATCH v3 10/41] x86/kvmclock: Setup kvmclock for secondary CPUs iff CONFIG_SMP=y
From: David Woodhouse @ 2026-05-20 21:45 UTC (permalink / raw)
To: Sean Christopherson, Kiryl Shutsemau, Paolo Bonzini,
K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
Ajay Kaher, Alexey Makhalov, Jan Kiszka, Dave Hansen,
Andy Lutomirski, Peter Zijlstra, Juergen Gross, Daniel Lezcano,
Thomas Gleixner, John Stultz
Cc: Rick Edgecombe, Vitaly Kuznetsov,
Broadcom internal kernel review list, Boris Ostrovsky,
Stephen Boyd, x86, linux-coco, kvm, linux-hyperv, virtualization,
linux-kernel, xen-devel, Michael Kelley, Tom Lendacky,
Nikunj A Dadhania, Thomas Gleixner
In-Reply-To: <20260515191942.1892718-11-seanjc@google.com>
[-- Attachment #1: Type: text/plain, Size: 514 bytes --]
On Fri, 2026-05-15 at 12:19 -0700, Sean Christopherson wrote:
> Gate kvmclock's secondary CPU code on CONFIG_SMP, not CONFIG_X86_LOCAL_APIC.
> Originally, kvmclock piggybacked PV APIC ops to setup secondary CPUs.
> When that wart was fixed by commit df156f90a0f9 ("x86: Introduce
> x86_cpuinit.early_percpu_clock_init hook"), the dependency on a local APIC
> got carried forward unnecessarily.
>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]
^ permalink raw reply
* Re: [PATCH v6 05/43] KVM: guest_memfd: Wire up kvm_get_memory_attributes() to per-gmem attributes
From: Ackerley Tng @ 2026-05-20 21:44 UTC (permalink / raw)
To: Fuad Tabba
Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
ira.weiny, jmattson, jthoughton, michael.roth, oupton,
pankaj.gupta, qperret, rick.p.edgecombe, rientjes, shivankg,
steven.price, willy, wyihan, yan.y.zhao, forkloop, pratyush,
suzuki.poulose, aneesh.kumar, liam, Paolo Bonzini,
Sean Christopherson, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He, Barry Song,
Axel Rasmussen, Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng,
Shakeel Butt, Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka,
kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
linux-mm, linux-coco
In-Reply-To: <CA+EHjTw-cUM=FrJevtSDtR7K6MwUfGfOx21LMFDn7DAy5bFzYw@mail.gmail.com>
Fuad Tabba <tabba@google.com> writes:
>
> [...snip...]
>
>> +unsigned long kvm_gmem_get_memory_attributes(struct kvm *kvm, gfn_t gfn)
>> +{
>> + struct kvm_memory_slot *slot = gfn_to_memslot(kvm, gfn);
>> + struct inode *inode;
>> +
>> + /*
>> + * If this gfn has no associated memslot, there's no chance of the gfn
>> + * being backed by private memory, since guest_memfd must be used for
>> + * private memory, and guest_memfd must be associated with some memslot.
>> + */
>> + if (!slot)
>> + return 0;
>> +
>> + CLASS(gmem_get_file, file)(slot);
>> + if (!file)
>> + return 0;
>> +
>> + inode = file_inode(file);
>> +
>> + /*
>> + * Rely on the maple tree's internal RCU lock to ensure a
>> + * stable result. This result can become stale as soon as the
>> + * lock is dropped, so the caller _must_ still protect
>> + * consumption of private vs. shared by checking
>> + * mmu_invalidate_retry_gfn() under mmu_lock to serialize
>> + * against ongoing attribute updates.
>> + */
>> + return kvm_gmem_get_attributes(inode, kvm_gmem_get_index(slot, gfn));
>> +}
>
> Doesn't this imply that all consumers of kvm_mem_is_private() should
> validate the result using mmu_lock and the invalidation sequence?
Let me know how I can improve the comment.
I think the "consumption" of private vs shared here actually means
something like "don't commit a page being faulted into page tables based
on the result of kvm_gmem_get_memory_attributes() without checking
kvm->mmu_invalidate_in_progress.", since a racing conversion may
complete before you commit.
kvm_mem_is_private() is used from these places:
1. Fault handling in KVM, like page_fault_can_be_fast(),
kvm_mmu_faultin_pfn(), kvm_mmu_page_fault(): this already handles the
entire mmu_lock and invalidation dance. No fault will be committed if
a racing conversion happened after kvm_mem_is_private() but before
the commit.
2. kvm_mmu_max_mapping_level() from recovering huge pages after
disabling dirty logging: Other than that it can't be used with
guest_memfd now since dirty logging can't be used with guest_memfd
and guest_memfd memslots are not updatable, this holds mmu_lock
throughout until the huge page recovery is done. invalidate_begin
also involves zapping the pages in the range, so if the order of
events is
| Thread A | Thread B |
|------------------------------|-------------------|
| invalidate_begin + zap | |
| update attributes maple_tree | recover huge page |
| invalidate_end | |
Then recovering will never see the zapped pages, nothing to
recover, no kvm_mem_is_private() lookup.
3. kvm_arch_vcpu_pre_fault_memory()
This eventually calls kvm_tdp_mmu_page_fault(), which checks
is_page_fault_stale(), so it does check before committing.
Were there any other calls I missed?
> sev_handle_rmp_fault() calls kvm_mem_is_private() without holding
> mmu_lock and without any retry mechanism. Is that a problem?
>
Sean already replied on your actual question separately :)
> Cheers,
> /fuad
>
>
>>
>> [...snip...]
>>
^ permalink raw reply
* Re: [PATCH v3 02/41] x86/tsc: Add helper to register CPU and TSC freq calibration routines
From: David Woodhouse @ 2026-05-20 21:44 UTC (permalink / raw)
To: Sean Christopherson
Cc: Kiryl Shutsemau, Paolo Bonzini, K. Y. Srinivasan, Haiyang Zhang,
Wei Liu, Dexuan Cui, Long Li, Ajay Kaher, Alexey Makhalov,
Jan Kiszka, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
Juergen Gross, Daniel Lezcano, Thomas Gleixner, John Stultz,
Rick Edgecombe, Vitaly Kuznetsov,
Broadcom internal kernel review list, Boris Ostrovsky,
Stephen Boyd, x86, linux-coco, kvm, linux-hyperv, virtualization,
linux-kernel, xen-devel, Michael Kelley, Tom Lendacky,
Nikunj A Dadhania, Thomas Gleixner
In-Reply-To: <ag4or9-9c6VZxqya@google.com>
[-- Attachment #1: Type: text/plain, Size: 3077 bytes --]
On Wed, 2026-05-20 at 14:33 -0700, Sean Christopherson wrote:
> On Wed, May 20, 2026, David Woodhouse wrote:
> > On Wed, 2026-05-20 at 13:44 -0700, Sean Christopherson wrote:
> > >
> > > + /*
> > > + * If the TSC counts at a constant frequency across P/T states, counts
> > > + * in deep C-states, and the TSC hasn't been marked unstable, treat the
> > > + * TSC reliable, as guaranteed by KVM. Note, the TSC unstable check
> > > + * exists purely to honor the TSC being marked unstable via command
> > > + * line, any runtime detection of an unstable will happen after this.
> > > + */
> > > + if (boot_cpu_has(X86_FEATURE_CONSTANT_TSC) &&
> > > + boot_cpu_has(X86_FEATURE_NONSTOP_TSC) &&
> > > + !check_tsc_unstable())
> > {
> > > + tsc_properties = TSC_FREQ_KNOWN_AND_RELIABLE;
> >
> > kvmclock = 0; /* Why use it if the TSC works? The kvmclock exists
> > *purely* to work around a TSC which *doesn't*
> > have those properties checked above. */
>
> kvmclock still provides SYSTEM_TIME and WALL_CLOCK :-/
Um, SYSTEM_TIME *is* the kvmclock of which we speak, isn't it?
WALL_CLOCK is something else, and I guess we should still consume that,
yes.
>
> > > +
> > > + kvm_tsc_khz_cpuid = kvm_para_tsc_khz();
> > > +
> > > + /*
> > > + * If provided, use the TSC (and APIC bus) frequency provided in KVM's
> > > + * PV CPUID leaf even if kvmclock itself is disabled via command line.
> > > + * The PV CPUID information isn't dependent on kvmclock in any way, and
> > > + * in fact using the precise information is *more* important when the
> > > + * user has explicitly disabled kvmclock to force the kernel to use the
> > > + * TSC as its clocksource.
> > > + */
> > > + if (!kvmclock) {
> > > + if (kvm_tsc_khz_cpuid)
> > > + tsc_register_calibration_routines(kvm_get_tsc_khz,
> > > + kvm_get_cpu_khz,
> > > + tsc_properties);
> > > + return;
> > > + }
> > > +
> >
> >
> > Regardless of the above, why not just register these here
> > unconditionally, and remove the later call that does the same?
>
> Because if kvmclock=n, it's only safe to call kvm_get_tsc_khz() if kvm_tsc_khz_cpuid
> is non-zero, other wise the "else" path will hit a NULL pointer deref when trying
> to get the frequency from the PV clock struct:
>
> return kvm_tsc_khz_cpuid ? : pvclock_tsc_khz(this_cpu_pvti());
Ah, right. Thanks. Ick though :)
[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]
^ permalink raw reply
* Re: [PATCH v3 02/41] x86/tsc: Add helper to register CPU and TSC freq calibration routines
From: Sean Christopherson @ 2026-05-20 21:33 UTC (permalink / raw)
To: David Woodhouse
Cc: Kiryl Shutsemau, Paolo Bonzini, K. Y. Srinivasan, Haiyang Zhang,
Wei Liu, Dexuan Cui, Long Li, Ajay Kaher, Alexey Makhalov,
Jan Kiszka, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
Juergen Gross, Daniel Lezcano, Thomas Gleixner, John Stultz,
Rick Edgecombe, Vitaly Kuznetsov,
Broadcom internal kernel review list, Boris Ostrovsky,
Stephen Boyd, x86, linux-coco, kvm, linux-hyperv, virtualization,
linux-kernel, xen-devel, Michael Kelley, Tom Lendacky,
Nikunj A Dadhania, Thomas Gleixner
In-Reply-To: <621e10bdc9e297c6c600b561d8fa25c3b62968bc.camel@infradead.org>
On Wed, May 20, 2026, David Woodhouse wrote:
> On Wed, 2026-05-20 at 13:44 -0700, Sean Christopherson wrote:
> >
> > + /*
> > + * If the TSC counts at a constant frequency across P/T states, counts
> > + * in deep C-states, and the TSC hasn't been marked unstable, treat the
> > + * TSC reliable, as guaranteed by KVM. Note, the TSC unstable check
> > + * exists purely to honor the TSC being marked unstable via command
> > + * line, any runtime detection of an unstable will happen after this.
> > + */
> > + if (boot_cpu_has(X86_FEATURE_CONSTANT_TSC) &&
> > + boot_cpu_has(X86_FEATURE_NONSTOP_TSC) &&
> > + !check_tsc_unstable())
> {
> > + tsc_properties = TSC_FREQ_KNOWN_AND_RELIABLE;
>
> kvmclock = 0; /* Why use it if the TSC works? The kvmclock exists
> *purely* to work around a TSC which *doesn't*
> have those properties checked above. */
kvmclock still provides SYSTEM_TIME and WALL_CLOCK :-/
> }
>
> I was going to say PVCLOCK_TSC_STABLE_BIT, and maybe we should check
> that *too* for paranoia?
No? PVCLOCK_TSC_STABLE is property of kvmclock more than it's a property of the
TSC itself. And for the no-kvmclock case, we most definitely don't want to setup
kvmclock just to query that flag.
> But hopefully the checks you have above are equivalent?
They aren't as paranoid, but if the host enumerates CONSTANT+NONSTOP TSC despite
KVM-the-host not being able to advertise PVCLOCK_TSC_STABLE_BIT, then the VMM done
messed up.
> > +
> > + kvm_tsc_khz_cpuid = kvm_para_tsc_khz();
> > +
> > + /*
> > + * If provided, use the TSC (and APIC bus) frequency provided in KVM's
> > + * PV CPUID leaf even if kvmclock itself is disabled via command line.
> > + * The PV CPUID information isn't dependent on kvmclock in any way, and
> > + * in fact using the precise information is *more* important when the
> > + * user has explicitly disabled kvmclock to force the kernel to use the
> > + * TSC as its clocksource.
> > + */
> > + if (!kvmclock) {
> > + if (kvm_tsc_khz_cpuid)
> > + tsc_register_calibration_routines(kvm_get_tsc_khz,
> > + kvm_get_cpu_khz,
> > + tsc_properties);
> > + return;
> > + }
> > +
>
>
> Regardless of the above, why not just register these here
> unconditionally, and remove the later call that does the same?
Because if kvmclock=n, it's only safe to call kvm_get_tsc_khz() if kvm_tsc_khz_cpuid
is non-zero, other wise the "else" path will hit a NULL pointer deref when trying
to get the frequency from the PV clock struct:
return kvm_tsc_khz_cpuid ? : pvclock_tsc_khz(this_cpu_pvti());
^ permalink raw reply
* Re: [PATCH v3 09/41] clocksource: hyper-v: Don't save/restore TSC offset when using HV sched_clock
From: David Woodhouse @ 2026-05-20 21:30 UTC (permalink / raw)
To: Sean Christopherson, Kiryl Shutsemau, Paolo Bonzini,
K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
Ajay Kaher, Alexey Makhalov, Jan Kiszka, Dave Hansen,
Andy Lutomirski, Peter Zijlstra, Juergen Gross, Daniel Lezcano,
Thomas Gleixner, John Stultz
Cc: Rick Edgecombe, Vitaly Kuznetsov,
Broadcom internal kernel review list, Boris Ostrovsky,
Stephen Boyd, x86, linux-coco, kvm, linux-hyperv, virtualization,
linux-kernel, xen-devel, Michael Kelley, Tom Lendacky,
Nikunj A Dadhania, Thomas Gleixner
In-Reply-To: <20260515191942.1892718-10-seanjc@google.com>
[-- Attachment #1: Type: text/plain, Size: 957 bytes --]
On Fri, 2026-05-15 at 12:19 -0700, Sean Christopherson wrote:
> Now that Hyper-V overrides the sched_clock save/restore hooks if and only
> sched_clock itself is set to the Hyper-V reference counter, drop the
> invocation of the "old" save/restore callbacks. When the registration of
> the PV sched_clock was done separately from overriding the save/restore
> hooks, it was possible for Hyper-V to clobber the TSC save/restore
> callbacks without actually switching to the Hyper-V refcounter.
>
> Enabling a PV sched_clock is a one-way street, i.e. the kernel will never
> revert to using TSC for sched_clock, and so there is no need to invoke the
> TSC save/restore hooks (and if there was, it belongs in common PV code).
>
> Reviewed-by: Michael Kelley <mhklinux@outlook.com>
> Tested-by: Michael Kelley <mhklinux@outlook.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]
^ permalink raw reply
* Re: [PATCH v3 08/41] clocksource: hyper-v: Drop wrappers to sched_clock save/restore helpers
From: David Woodhouse @ 2026-05-20 21:29 UTC (permalink / raw)
To: Sean Christopherson, Kiryl Shutsemau, Paolo Bonzini,
K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
Ajay Kaher, Alexey Makhalov, Jan Kiszka, Dave Hansen,
Andy Lutomirski, Peter Zijlstra, Juergen Gross, Daniel Lezcano,
Thomas Gleixner, John Stultz
Cc: Rick Edgecombe, Vitaly Kuznetsov,
Broadcom internal kernel review list, Boris Ostrovsky,
Stephen Boyd, x86, linux-coco, kvm, linux-hyperv, virtualization,
linux-kernel, xen-devel, Michael Kelley, Tom Lendacky,
Nikunj A Dadhania, Thomas Gleixner
In-Reply-To: <20260515191942.1892718-9-seanjc@google.com>
[-- Attachment #1: Type: text/plain, Size: 477 bytes --]
On Fri, 2026-05-15 at 12:19 -0700, Sean Christopherson wrote:
> Now that all of the Hyper-V reference counter sched_clock code is located
> in a single file, drop the superfluous wrappers for the save/restore flows.
>
> No functional change intended.
>
> Reviewed-by: Michael Kelley <mhklinux@outlook.com>
> Tested-by: Michael Kelley <mhklinux@outlook.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]
^ permalink raw reply
* Re: [PATCH v3 07/41] clocksource: hyper-v: Register sched_clock save/restore iff it's necessary
From: David Woodhouse @ 2026-05-20 21:27 UTC (permalink / raw)
To: Sean Christopherson, Kiryl Shutsemau, Paolo Bonzini,
K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
Ajay Kaher, Alexey Makhalov, Jan Kiszka, Dave Hansen,
Andy Lutomirski, Peter Zijlstra, Juergen Gross, Daniel Lezcano,
Thomas Gleixner, John Stultz
Cc: Rick Edgecombe, Vitaly Kuznetsov,
Broadcom internal kernel review list, Boris Ostrovsky,
Stephen Boyd, x86, linux-coco, kvm, linux-hyperv, virtualization,
linux-kernel, xen-devel, Michael Kelley, Tom Lendacky,
Nikunj A Dadhania, Thomas Gleixner
In-Reply-To: <20260515191942.1892718-8-seanjc@google.com>
[-- Attachment #1: Type: text/plain, Size: 1346 bytes --]
On Fri, 2026-05-15 at 12:19 -0700, Sean Christopherson wrote:
> Register the Hyper-V reference counter (refcounter) callbacks for saving
> and restoring its PV sched_clock, if and only if the refcounter is
> actually being used for sched_clock. Currently, Hyper-V overrides the
> save/restore hooks if the reference TSC available, whereas the Hyper-V
> refcounter code only overrides sched_clock if the reference TSC is
> available *and* it's not invariant. The flaw is effectively papered over
> by invoking the "old" save/restore callbacks as part of save/restore, but
> that's unnecessary and fragile.
>
> To avoid introducing more complexity, and to allow for additional cleanups
> of the PV sched_clock code, move the save/restore hooks and logic into
> hyperv_timer.c and simply wire up the hooks when overriding sched_clock
> itself.
>
> Note, while the Hyper-V refcounter code is intended to be architecture
> neutral, CONFIG_PARAVIRT is firmly x86-only, i.e. adding a small amount of
> x86 specific code (which will be reduced in future cleanups) doesn't
> meaningfully pollute generic code.
>
> Reviewed-by: Michael Kelley <mhklinux@outlook.com>
> Tested-by: Michael Kelley <mhklinux@outlook.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]
^ permalink raw reply
* Re: [PATCH v3 02/41] x86/tsc: Add helper to register CPU and TSC freq calibration routines
From: David Woodhouse @ 2026-05-20 21:19 UTC (permalink / raw)
To: Sean Christopherson
Cc: Kiryl Shutsemau, Paolo Bonzini, K. Y. Srinivasan, Haiyang Zhang,
Wei Liu, Dexuan Cui, Long Li, Ajay Kaher, Alexey Makhalov,
Jan Kiszka, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
Juergen Gross, Daniel Lezcano, Thomas Gleixner, John Stultz,
Rick Edgecombe, Vitaly Kuznetsov,
Broadcom internal kernel review list, Boris Ostrovsky,
Stephen Boyd, x86, linux-coco, kvm, linux-hyperv, virtualization,
linux-kernel, xen-devel, Michael Kelley, Tom Lendacky,
Nikunj A Dadhania, Thomas Gleixner
In-Reply-To: <ag4dMc2B3JQi4vxU@google.com>
[-- Attachment #1: Type: text/plain, Size: 2384 bytes --]
On Wed, 2026-05-20 at 13:44 -0700, Sean Christopherson wrote:
>
> + /*
> + * If the TSC counts at a constant frequency across P/T states, counts
> + * in deep C-states, and the TSC hasn't been marked unstable, treat the
> + * TSC reliable, as guaranteed by KVM. Note, the TSC unstable check
> + * exists purely to honor the TSC being marked unstable via command
> + * line, any runtime detection of an unstable will happen after this.
> + */
> + if (boot_cpu_has(X86_FEATURE_CONSTANT_TSC) &&
> + boot_cpu_has(X86_FEATURE_NONSTOP_TSC) &&
> + !check_tsc_unstable())
{
> + tsc_properties = TSC_FREQ_KNOWN_AND_RELIABLE;
kvmclock = 0; /* Why use it if the TSC works? The kvmclock exists
*purely* to work around a TSC which *doesn't*
have those properties checked above. */
}
I was going to say PVCLOCK_TSC_STABLE_BIT, and maybe we should check
that *too* for paranoia? But hopefully the checks you have above are
equivalent?
> +
> + kvm_tsc_khz_cpuid = kvm_para_tsc_khz();
> +
> + /*
> + * If provided, use the TSC (and APIC bus) frequency provided in KVM's
> + * PV CPUID leaf even if kvmclock itself is disabled via command line.
> + * The PV CPUID information isn't dependent on kvmclock in any way, and
> + * in fact using the precise information is *more* important when the
> + * user has explicitly disabled kvmclock to force the kernel to use the
> + * TSC as its clocksource.
> + */
> + if (!kvmclock) {
> + if (kvm_tsc_khz_cpuid)
> + tsc_register_calibration_routines(kvm_get_tsc_khz,
> + kvm_get_cpu_khz,
> + tsc_properties);
> + return;
> + }
> +
Regardless of the above, why not just register these here
unconditionally, and remove the later call that does the same?
[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]
^ permalink raw reply
* Re: [PATCH v3 06/41] x86/acrn: Mark TSC frequency as known when using ACRN for calibration
From: Sean Christopherson @ 2026-05-20 21:02 UTC (permalink / raw)
To: David Woodhouse
Cc: Kiryl Shutsemau, Paolo Bonzini, K. Y. Srinivasan, Haiyang Zhang,
Wei Liu, Dexuan Cui, Long Li, Ajay Kaher, Alexey Makhalov,
Jan Kiszka, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
Juergen Gross, Daniel Lezcano, Thomas Gleixner, John Stultz,
Rick Edgecombe, Vitaly Kuznetsov,
Broadcom internal kernel review list, Boris Ostrovsky,
Stephen Boyd, x86, linux-coco, kvm, linux-hyperv, virtualization,
linux-kernel, xen-devel, Michael Kelley, Tom Lendacky,
Nikunj A Dadhania, Thomas Gleixner
In-Reply-To: <c347d65f555ad1e10a0e87ee57c5879c7046d0e7.camel@infradead.org>
On Wed, May 20, 2026, David Woodhouse wrote:
> On Fri, 2026-05-15 at 12:19 -0700, Sean Christopherson wrote:
> > Mark the TSC frequency as known when using ACRN's PV CPUID information.
> > Per commit 81a71f51b89e ("x86/acrn: Set up timekeeping") and common sense,
> > the TSC freq is explicitly provided by the hypervisor.
> >
> > Signed-off-by: Sean Christopherson <seanjc@google.com>
> > ---
> > arch/x86/kernel/cpu/acrn.c | 1 +
> > 1 file changed, 1 insertion(+)
> >
> > diff --git a/arch/x86/kernel/cpu/acrn.c b/arch/x86/kernel/cpu/acrn.c
> > index c1506cb87d8c..2da3de4d470e 100644
> > --- a/arch/x86/kernel/cpu/acrn.c
> > +++ b/arch/x86/kernel/cpu/acrn.c
> > @@ -29,6 +29,7 @@ static void __init acrn_init_platform(void)
> > /* Install system interrupt handler for ACRN hypervisor callback */
> > sysvec_install(HYPERVISOR_CALLBACK_VECTOR, sysvec_acrn_hv_callback);
> >
> > + setup_force_cpu_cap(X86_FEATURE_TSC_KNOWN_FREQ);
> > tsc_register_calibration_routines(acrn_get_tsc_khz,
> > acrn_get_tsc_khz);
>
> I'd feel slightly happier doing that from within acrn_get_tsc_khz()
> itself....
Ya, though as you note below, this is really a comment on the overall series.
Waiting to set the cap until the calibration routine is actually run does prevent
the case where the something registers a calibration routine, but its routine is
never run.
However, because the cap is sticky, it doesn't handle the scenario where its
routine _does_ run, but the kernel ultimately throws away its calibration in favor
of something else.
Further complicating things is that ~half of the paravirt flows already force set
caps before their routines are invoked:
snp_secure_tsc_init(), jailhouse_init_platform(), ms_hyperv_init_platform(),
vmware_set_capabilities()
Rather than trying to convince everyone that waiting is better, despite that
approach still being flawed, I chose to handle this by ensuring that once the TSC
is marked known and/or reliable, the kernel won't replace the calibration routine
with a "lesser" source:
x86/tsc: Rejects attempts to override TSC calibration with lesser routine
That doesn't completely prevent the kernel from being stupid, but it should prevent
both the case where X's routine is registered but never run, as well as the case
where it's run but ultimately ignored.
> which I note is 'static inline'. I'm vaguely surprised that even builds
> (although it does).
Heh, KVM x86 (in the host) does stupid things like this too. E.g. kvm_pdptr_read()
is a static inline, but then wired up as a function pointer in three different
places.
> Probably nicer to move it explicitly out of line in acrn.c though.
>
> Most of that should be a comment on patch 2 of the series, I guess?
^ permalink raw reply
* Re: [PATCH v3 02/41] x86/tsc: Add helper to register CPU and TSC freq calibration routines
From: Sean Christopherson @ 2026-05-20 20:44 UTC (permalink / raw)
To: David Woodhouse
Cc: Kiryl Shutsemau, Paolo Bonzini, K. Y. Srinivasan, Haiyang Zhang,
Wei Liu, Dexuan Cui, Long Li, Ajay Kaher, Alexey Makhalov,
Jan Kiszka, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
Juergen Gross, Daniel Lezcano, Thomas Gleixner, John Stultz,
Rick Edgecombe, Vitaly Kuznetsov,
Broadcom internal kernel review list, Boris Ostrovsky,
Stephen Boyd, x86, linux-coco, kvm, linux-hyperv, virtualization,
linux-kernel, xen-devel, Michael Kelley, Tom Lendacky,
Nikunj A Dadhania, Thomas Gleixner
In-Reply-To: <44e0d60548d317fd59895f18bd17220dfb2f834b.camel@infradead.org>
On Wed, May 20, 2026, David Woodhouse wrote:
> On Fri, 2026-05-15 at 12:19 -0700, Sean Christopherson wrote:
> > diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
> > index b5991d53fc0e..e9e7394140dd 100644
> > --- a/arch/x86/kernel/kvmclock.c
> > +++ b/arch/x86/kernel/kvmclock.c
> > @@ -321,8 +321,8 @@ void __init kvmclock_init(void)
> > flags = pvclock_read_flags(&hv_clock_boot[0].pvti);
> > kvm_sched_clock_init(flags & PVCLOCK_TSC_STABLE_BIT);
> >
> > - x86_platform.calibrate_tsc = kvm_get_tsc_khz;
> > - x86_platform.calibrate_cpu = kvm_get_tsc_khz;
> > + tsc_register_calibration_routines(kvm_get_tsc_khz, kvm_get_tsc_khz);
> > +
> > x86_platform.get_wallclock = kvm_get_wallclock;
> > x86_platform.set_wallclock = kvm_set_wallclock;
> > #ifdef CONFIG_X86_LOCAL_APIC
>
> Can we move those (and maybe everything in the context there too) up
> *before* the check for no-kvmclock at the top of the function?
Oof, I was going to say "no", but disabling kvmclock is exactly the workaround
I've told people to use to get the kernel to use the TSC instead of kvmclock.
> Probably in a separate patch.
Ya. I think this?
diff --git a/arch/x86/kernel/kvmclock.c b/arch/x86/kernel/kvmclock.c
index 08ee4bc304c8..92a1ebf31e4d 100644
--- a/arch/x86/kernel/kvmclock.c
+++ b/arch/x86/kernel/kvmclock.c
@@ -27,6 +27,7 @@ static int kvmclock_vsyscall __initdata = 1;
static int msr_kvm_system_time __ro_after_init;
static int msr_kvm_wall_clock __ro_after_init;
static u64 kvm_sched_clock_offset __ro_after_init;
+static unsigned int kvm_tsc_khz_cpuid __ro_after_init;
static int __init parse_no_kvmclock(char *arg)
{
@@ -207,7 +208,7 @@ static unsigned long kvm_get_tsc_khz(void)
lapic_timer_period = apic_khz * 1000 / HZ;
#endif
- return kvm_para_tsc_khz() ? : pvclock_tsc_khz(this_cpu_pvti());
+ return kvm_tsc_khz_cpuid ? : pvclock_tsc_khz(this_cpu_pvti());
}
static unsigned long kvm_get_cpu_khz(void)
@@ -387,9 +388,39 @@ void __init kvmclock_init(void)
enum tsc_properties tsc_properties = TSC_FREQUENCY_KNOWN;
bool stable = false;
- if (!kvm_para_available() || !kvmclock)
+ if (!kvm_para_available())
return;
+ /*
+ * If the TSC counts at a constant frequency across P/T states, counts
+ * in deep C-states, and the TSC hasn't been marked unstable, treat the
+ * TSC reliable, as guaranteed by KVM. Note, the TSC unstable check
+ * exists purely to honor the TSC being marked unstable via command
+ * line, any runtime detection of an unstable will happen after this.
+ */
+ if (boot_cpu_has(X86_FEATURE_CONSTANT_TSC) &&
+ boot_cpu_has(X86_FEATURE_NONSTOP_TSC) &&
+ !check_tsc_unstable())
+ tsc_properties = TSC_FREQ_KNOWN_AND_RELIABLE;
+
+ kvm_tsc_khz_cpuid = kvm_para_tsc_khz();
+
+ /*
+ * If provided, use the TSC (and APIC bus) frequency provided in KVM's
+ * PV CPUID leaf even if kvmclock itself is disabled via command line.
+ * The PV CPUID information isn't dependent on kvmclock in any way, and
+ * in fact using the precise information is *more* important when the
+ * user has explicitly disabled kvmclock to force the kernel to use the
+ * TSC as its clocksource.
+ */
+ if (!kvmclock) {
+ if (kvm_tsc_khz_cpuid)
+ tsc_register_calibration_routines(kvm_get_tsc_khz,
+ kvm_get_cpu_khz,
+ tsc_properties);
+ return;
+ }
+
if (kvm_para_has_feature(KVM_FEATURE_CLOCKSOURCE2)) {
msr_kvm_system_time = MSR_KVM_SYSTEM_TIME_NEW;
msr_kvm_wall_clock = MSR_KVM_WALL_CLOCK_NEW;
@@ -424,21 +455,14 @@ void __init kvmclock_init(void)
}
/*
- * If the TSC counts at a constant frequency across P/T states, counts
- * in deep C-states, and the TSC hasn't been marked unstable, prefer
- * the TSC over kvmclock for sched_clock and drop kvmclock's rating so
- * that TSC is chosen as the clocksource. Note, the TSC unstable check
- * exists purely to honor the TSC being marked unstable via command
- * line, any runtime detection of an unstable will happen after this.
+ * If the TSC is reliable (see above), prefer the TSC over kvmclock for
+ * sched_clock and drop kvmclock's rating so that TSC is chosen as the
+ * clocksource.
*/
- if (boot_cpu_has(X86_FEATURE_CONSTANT_TSC) &&
- boot_cpu_has(X86_FEATURE_NONSTOP_TSC) &&
- !check_tsc_unstable()) {
+ if (tsc_properties & TSC_RELIABLE)
kvm_clock.rating = 299;
- tsc_properties = TSC_FREQ_KNOWN_AND_RELIABLE;
- } else {
+ else
kvm_sched_clock_init(stable);
- }
tsc_register_calibration_routines(kvm_get_tsc_khz, kvm_get_cpu_khz,
tsc_properties);
^ permalink raw reply related
* Re: [PATCH v6 06/43] KVM: x86/mmu: Bug the VM if gmem attributes are queried to determine max mapping level
From: Sean Christopherson @ 2026-05-20 20:39 UTC (permalink / raw)
To: Ackerley Tng
Cc: Fuad Tabba, aik, andrew.jones, binbin.wu, brauner, chao.p.peng,
david, ira.weiny, jmattson, jthoughton, michael.roth, oupton,
pankaj.gupta, qperret, rick.p.edgecombe, rientjes, shivankg,
steven.price, willy, wyihan, yan.y.zhao, forkloop, pratyush,
suzuki.poulose, aneesh.kumar, liam, Paolo Bonzini,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
Vishal Annapurve, Andrew Morton, Chris Li, Kairui Song,
Kemeng Shi, Nhat Pham, Baoquan He, Barry Song, Axel Rasmussen,
Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt,
Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka, kvm,
linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
linux-mm, linux-coco
In-Reply-To: <CAEvNRgGHHkvfJ-mn9rfDvS+_1ht08YatFWo-Swt+5wFSPnC9Nw@mail.gmail.com>
On Wed, May 20, 2026, Ackerley Tng wrote:
> Sean Christopherson <seanjc@google.com> writes:
> >> > } else {
> >> > + /*
> >> > + * Memory attributes cannot be obtained from guest_memfd while
> >> > + * the MMU lock is held.
> >> > + */
> >> > + if (KVM_BUG_ON(static_call_query(__kvm_get_memory_attributes) ==
> >> > + kvm_gmem_get_memory_attributes, kvm)) {
> >> > + return 0;
> >> > + }
> >> > +
> >>
> >> This directly takes the address of kvm_gmem_get_memory_attributes,
> >> which is only compiled if CONFIG_KVM_GUEST_MEMFD=y. This breaks
> >> ARCH=i386.
> >
> > And this bleeds guest_memfd implementation details into places they don't belong.
> > The right way to deal with this is to use lockdep_assert_not_held() in whatever
> > code mustn't run with mmu_lock held. E.g.
> >
> > diff --git virt/kvm/guest_memfd.c virt/kvm/guest_memfd.c
> > index c9f155c2dc5c..3bea9c1137ef 100644
> > --- virt/kvm/guest_memfd.c
> > +++ virt/kvm/guest_memfd.c
> > @@ -547,6 +547,9 @@ unsigned long kvm_gmem_get_memory_attributes(struct kvm *kvm, gfn_t gfn)
> > struct kvm_memory_slot *slot = gfn_to_memslot(kvm, gfn);
> > struct inode *inode;
> >
> > + /* Comment goes here. */
> > + lockdep_assert_not_held(&kvm->mmu_lock);
> > +
> > /*
> > * If this gfn has no associated memslot, there's no chance of the gfn
> > * being backed by private memory, since guest_memfd must be used for
> >
> > But I'm confused, because kvm_gmem_get_memory_attributes() doesn't actually take
> > filemap_invalidate_lock(), so what exactly is the problem?
> >
>
> Ahh I can drop this patch now. kvm_gmem_get_memory_attributes() used to
> take the filemap_invalidate_lock(), but after Liam pointed out that
> the attributes maple tree should be using MT_FLAGS_USE_RCU, I stopped
> taking filemap_invalidate_lock() and forgot to undo this.
>
> I'll wait a bit for more reviews and then put out another revision
> without this patch.
If this is the only issue with v6, don't send a new version, I'll just drop it
when applying.
^ permalink raw reply
* Re: [PATCH v6 12/43] KVM: guest_memfd: Call arch invalidate hooks on conversion
From: Ackerley Tng @ 2026-05-20 20:35 UTC (permalink / raw)
To: Fuad Tabba
Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
ira.weiny, jmattson, jthoughton, michael.roth, oupton,
pankaj.gupta, qperret, rick.p.edgecombe, rientjes, shivankg,
steven.price, willy, wyihan, yan.y.zhao, forkloop, pratyush,
suzuki.poulose, aneesh.kumar, liam, Paolo Bonzini,
Sean Christopherson, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He, Barry Song,
Axel Rasmussen, Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng,
Shakeel Butt, Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka,
kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
linux-mm, linux-coco
In-Reply-To: <CA+EHjTwOfJ=nCRoX3m5uDt=CcF_zrqGsZVBBmo4MscmPqrBxOA@mail.gmail.com>
Fuad Tabba <tabba@google.com> writes:
> On Thu, 7 May 2026 at 21:22, Ackerley Tng via B4 Relay
> <devnull+ackerleytng.google.com@kernel.org> wrote:
>>
>> From: Ackerley Tng <ackerleytng@google.com>
>>
>> When memory in guest_memfd is converted from private to shared, the
>> platform-specific state associated with the guest-private pages must be
>> invalidated or cleaned up.
>>
>> Iterate over the folios in the affected range and call the
>> kvm_arch_gmem_invalidate() hook for each PFN range. This allows
>> architectures to perform necessary teardown, such as updating hardware
>> metadata or encryption states, before the pages are transitioned to the
>> shared state.
>>
>> Invoke this helper after indicating to KVM's mmu code that an invalidation
>> is in progress to stop in-flight page faults from succeeding.
>>
>> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
>
> Minor nit below, but lgtm.
>
> Reviewed-by: Fuad Tabba <tabba@google.com>
>
> Cheers,
> /fuad
>
>> ---
>> virt/kvm/guest_memfd.c | 41 +++++++++++++++++++++++++++++++++++++++++
>> 1 file changed, 41 insertions(+)
>>
>> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
>> index 9d82642a025e9..baf4b88dead1f 100644
>> --- a/virt/kvm/guest_memfd.c
>> +++ b/virt/kvm/guest_memfd.c
>> @@ -603,6 +603,42 @@ static bool kvm_gmem_is_safe_for_conversion(struct inode *inode, pgoff_t start,
>> return safe;
>> }
>>
>> +#ifdef CONFIG_HAVE_KVM_ARCH_GMEM_INVALIDATE
>> +static void kvm_gmem_invalidate(struct inode *inode, pgoff_t start, pgoff_t end)
>> +{
>> + struct folio_batch fbatch;
>> + pgoff_t next = start;
>> + int i;
>> +
>> + folio_batch_init(&fbatch);
>> + while (filemap_get_folios(inode->i_mapping, &next, end - 1, &fbatch)) {
>> + for (i = 0; i < folio_batch_count(&fbatch); ++i) {
>> + struct folio *folio = fbatch.folios[i];
>> + pgoff_t start_index, end_index;
>> + kvm_pfn_t start_pfn, end_pfn;
>> +
>> + start_index = max(start, folio->index);
>> + end_index = min(end, folio_next_index(folio));
>> + /*
>> + * end_index is either in folio or points to
>> + * the first page of the next folio. Hence,
>> + * all pages in range [start_index, end_index)
>> + * are contiguous.
>> + */
>> + start_pfn = folio_file_pfn(folio, start_index);
>> + end_pfn = start_pfn + end_index - start_index;
>> +
>> + kvm_arch_gmem_invalidate(start_pfn, end_pfn);
>> + }
>> +
>> + folio_batch_release(&fbatch);
>> + cond_resched();
>> + }
>> +}
>> +#else
>> +static void kvm_gmem_invalidate(struct inode *inode, pgoff_t start, pgoff_t end) {}
>> +#endif
>> +
>> static int __kvm_gmem_set_attributes(struct inode *inode, pgoff_t start,
>> size_t nr_pages, uint64_t attrs,
>> pgoff_t *err_index)
>> @@ -643,7 +679,12 @@ static int __kvm_gmem_set_attributes(struct inode *inode, pgoff_t start,
>> */
>>
>> kvm_gmem_invalidate_begin(inode, start, end);
>> +
>> + if (!to_private)
>> + kvm_gmem_invalidate(inode, start, end);
>> +
>> mas_store_prealloc(&mas, xa_mk_value(attrs));
>> +
>
> Why the unrelated extra space?
>
Hmm this space provides vertical space between invalidate_{begin,end}
and the STUFF it wraps, like
invalidate_begin
STUFF
invalidate_end
More STUFF is going to go here in future patch series, such as splitting
private pages in TDX.
>> kvm_gmem_invalidate_end(inode, start, end);
>> out:
>> filemap_invalidate_unlock(mapping);
>>
>> --
>> 2.54.0.563.g4f69b47b94-goog
>>
>>
^ permalink raw reply
* Re: [PATCH v6 06/43] KVM: x86/mmu: Bug the VM if gmem attributes are queried to determine max mapping level
From: Ackerley Tng @ 2026-05-20 20:25 UTC (permalink / raw)
To: Sean Christopherson, Fuad Tabba
Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
ira.weiny, jmattson, jthoughton, michael.roth, oupton,
pankaj.gupta, qperret, rick.p.edgecombe, rientjes, shivankg,
steven.price, willy, wyihan, yan.y.zhao, forkloop, pratyush,
suzuki.poulose, aneesh.kumar, liam, Paolo Bonzini,
Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
H. Peter Anvin, Steven Rostedt, Masami Hiramatsu,
Mathieu Desnoyers, Jonathan Corbet, Shuah Khan, Shuah Khan,
Vishal Annapurve, Andrew Morton, Chris Li, Kairui Song,
Kemeng Shi, Nhat Pham, Baoquan He, Barry Song, Axel Rasmussen,
Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng, Shakeel Butt,
Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka, kvm,
linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
linux-mm, linux-coco
In-Reply-To: <ag3DawWCcrpCkD0e@google.com>
Sean Christopherson <seanjc@google.com> writes:
>
> [...snip...]
>
>> > @@ -3357,6 +3357,15 @@ int kvm_mmu_max_mapping_level(struct kvm *kvm, struct kvm_page_fault *fault,
>> > max_level = fault->max_level;
>> > is_private = fault->is_private;
>> > } else {
>> > + /*
>> > + * Memory attributes cannot be obtained from guest_memfd while
>> > + * the MMU lock is held.
>> > + */
>> > + if (KVM_BUG_ON(static_call_query(__kvm_get_memory_attributes) ==
>> > + kvm_gmem_get_memory_attributes, kvm)) {
>> > + return 0;
>> > + }
>> > +
>>
>> This directly takes the address of kvm_gmem_get_memory_attributes,
>> which is only compiled if CONFIG_KVM_GUEST_MEMFD=y. This breaks
>> ARCH=i386.
>
> And this bleeds guest_memfd implementation details into places they don't belong.
> The right way to deal with this is to use lockdep_assert_not_held() in whatever
> code mustn't run with mmu_lock held. E.g.
>
> diff --git virt/kvm/guest_memfd.c virt/kvm/guest_memfd.c
> index c9f155c2dc5c..3bea9c1137ef 100644
> --- virt/kvm/guest_memfd.c
> +++ virt/kvm/guest_memfd.c
> @@ -547,6 +547,9 @@ unsigned long kvm_gmem_get_memory_attributes(struct kvm *kvm, gfn_t gfn)
> struct kvm_memory_slot *slot = gfn_to_memslot(kvm, gfn);
> struct inode *inode;
>
> + /* Comment goes here. */
> + lockdep_assert_not_held(&kvm->mmu_lock);
> +
> /*
> * If this gfn has no associated memslot, there's no chance of the gfn
> * being backed by private memory, since guest_memfd must be used for
>
> But I'm confused, because kvm_gmem_get_memory_attributes() doesn't actually take
> filemap_invalidate_lock(), so what exactly is the problem?
>
Ahh I can drop this patch now. kvm_gmem_get_memory_attributes() used to
take the filemap_invalidate_lock(), but after Liam pointed out that
the attributes maple tree should be using MT_FLAGS_USE_RCU, I stopped
taking filemap_invalidate_lock() and forgot to undo this.
I'll wait a bit for more reviews and then put out another revision
without this patch.
>> > max_level = PG_LEVEL_NUM;
>> > is_private = kvm_mem_is_private(kvm, gfn);
>> > }
>> >
>> > --
>> > 2.54.0.563.g4f69b47b94-goog
>> >
>> >
^ permalink raw reply
* Re: [PATCH v3 06/41] x86/acrn: Mark TSC frequency as known when using ACRN for calibration
From: David Woodhouse @ 2026-05-20 20:01 UTC (permalink / raw)
To: Sean Christopherson, Kiryl Shutsemau, Paolo Bonzini,
K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
Ajay Kaher, Alexey Makhalov, Jan Kiszka, Dave Hansen,
Andy Lutomirski, Peter Zijlstra, Juergen Gross, Daniel Lezcano,
Thomas Gleixner, John Stultz
Cc: Rick Edgecombe, Vitaly Kuznetsov,
Broadcom internal kernel review list, Boris Ostrovsky,
Stephen Boyd, x86, linux-coco, kvm, linux-hyperv, virtualization,
linux-kernel, xen-devel, Michael Kelley, Tom Lendacky,
Nikunj A Dadhania, Thomas Gleixner
In-Reply-To: <20260515191942.1892718-7-seanjc@google.com>
[-- Attachment #1: Type: text/plain, Size: 1288 bytes --]
On Fri, 2026-05-15 at 12:19 -0700, Sean Christopherson wrote:
> Mark the TSC frequency as known when using ACRN's PV CPUID information.
> Per commit 81a71f51b89e ("x86/acrn: Set up timekeeping") and common sense,
> the TSC freq is explicitly provided by the hypervisor.
>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
> arch/x86/kernel/cpu/acrn.c | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/arch/x86/kernel/cpu/acrn.c b/arch/x86/kernel/cpu/acrn.c
> index c1506cb87d8c..2da3de4d470e 100644
> --- a/arch/x86/kernel/cpu/acrn.c
> +++ b/arch/x86/kernel/cpu/acrn.c
> @@ -29,6 +29,7 @@ static void __init acrn_init_platform(void)
> /* Install system interrupt handler for ACRN hypervisor callback */
> sysvec_install(HYPERVISOR_CALLBACK_VECTOR, sysvec_acrn_hv_callback);
>
> + setup_force_cpu_cap(X86_FEATURE_TSC_KNOWN_FREQ);
> tsc_register_calibration_routines(acrn_get_tsc_khz,
> acrn_get_tsc_khz);
I'd feel slightly happier doing that from within acrn_get_tsc_khz()
itself.... which I note is 'static inline'. I'm vaguely surprised that
even builds (although it does). Probably nicer to move it explicitly
out of line in acrn.c though.
Most of that should be a comment on patch 2 of the series, I guess?
[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]
^ permalink raw reply
* Re: [PATCH v3 05/41] x86/tdx: Override PV calibration routines with CPUID-based calibration
From: David Woodhouse @ 2026-05-20 19:52 UTC (permalink / raw)
To: Sean Christopherson, Kiryl Shutsemau, Paolo Bonzini,
K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
Ajay Kaher, Alexey Makhalov, Jan Kiszka, Dave Hansen,
Andy Lutomirski, Peter Zijlstra, Juergen Gross, Daniel Lezcano,
Thomas Gleixner, John Stultz
Cc: Rick Edgecombe, Vitaly Kuznetsov,
Broadcom internal kernel review list, Boris Ostrovsky,
Stephen Boyd, x86, linux-coco, kvm, linux-hyperv, virtualization,
linux-kernel, xen-devel, Michael Kelley, Tom Lendacky,
Nikunj A Dadhania, Thomas Gleixner
In-Reply-To: <20260515191942.1892718-6-seanjc@google.com>
[-- Attachment #1: Type: text/plain, Size: 2003 bytes --]
On Fri, 2026-05-15 at 12:19 -0700, Sean Christopherson wrote:
> When running as a TDX guest, explicitly override the TSC frequency
> calibration routine with CPUID-based calibration instead of potentially
> relying on a hypervisor-controlled PV routine. For TDX guests, CPUID.0x15
> is always emulated by the TDX-Module, i.e. the information from CPUID is
> more trustworthy than the information provided by the hypervisor.
>
> To maintain backwards compatibility with TDX guest kernels that use native
> calibration, and because it's the least awful option, retain
> native_calibrate_tsc()'s stuffing of the local APIC bus period using the
> core crystal frequency. While it's entirely possible for the hypervisor
> to emulate the APIC timer at a different frequency than the core crystal
> frequency, the commonly accepted interpretation of Intel's SDM is that APIC
> timer runs at the core crystal frequency when that latter is enumerated via
> CPUID:
>
> The APIC timer frequency will be the processor’s bus clock or core
> crystal clock frequency (when TSC/core crystal clock ratio is enumerated
> in CPUID leaf 0x15).
>
> If the hypervisor is malicious and deliberately runs the APIC timer at the
> wrong frequency, nothing would stop the hypervisor from modifying the
> frequency at any time, i.e. attempting to manually calibrate the frequency
> out of paranoia would be futile.
>
> Deliberately leave the CPU frequency calibration routine as is, since the
> TDX-Module doesn't provide any guarantees with respect to CPUID.0x16.
>
> Opportunistically add a comment explaining that CoCo TSC initialization
> needs to come after hypervisor specific initialization.
>
> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
I don't much like stuffing the lapic_timer_period... but I'll give you
'least awful option'. For now.
Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]
^ permalink raw reply
* Re: [PATCH v10 00/25] Runtime TDX module update support
From: Dave Hansen @ 2026-05-20 19:42 UTC (permalink / raw)
To: Chao Gao, kvm, linux-coco, x86, linux-kernel, linux-rt-devel,
linux-doc
Cc: binbin.wu, dave.hansen, djbw, ira.weiny, kai.huang, kas,
nik.borisov, paulmck, pbonzini, reinette.chatre, rick.p.edgecombe,
sagis, seanjc, tony.lindgren, vannapurve, vishal.l.verma,
yilun.xu, xiaoyao.li, yan.y.zhao, Thomas Gleixner, Ingo Molnar,
Borislav Petkov, H. Peter Anvin, Sebastian Andrzej Siewior,
Clark Williams, Steven Rostedt, Jonathan Corbet, Shuah Khan
In-Reply-To: <20260520133909.409394-1-chao.gao@intel.com>
So a little cat|sort|uniq says:
11 Reviewed-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
17 Reviewed-by: Kiryl Shutsemau (Meta) <kas@kernel.org>
We're on v10 here. There are 6 patches in here with no review tags:
M: Kiryl Shutsemau <kas@kernel.org>
R: Dave Hansen <dave.hansen@linux.intel.com>
R: Rick Edgecombe <rick.p.edgecombe@intel.com>
Are all of those 6 new patches? Or do the reviewers still have some work
to do?
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox