Linux Confidential Computing Development
 help / color / mirror / Atom feed
* Re: [PATCH v3 25/41] x86/kvmclock: Hook clocksource.suspend/resume when kvmclock isn't sched_clock
From: David Woodhouse @ 2026-05-20 23:01 UTC (permalink / raw)
  To: Sean Christopherson, Kiryl Shutsemau, Paolo Bonzini,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Ajay Kaher, Alexey Makhalov, Jan Kiszka, Dave Hansen,
	Andy Lutomirski, Peter Zijlstra, Juergen Gross, Daniel Lezcano,
	Thomas Gleixner, John Stultz
  Cc: Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, x86, linux-coco, kvm, linux-hyperv, virtualization,
	linux-kernel, xen-devel, Michael Kelley, Tom Lendacky,
	Nikunj A Dadhania, Thomas Gleixner
In-Reply-To: <20260515191942.1892718-26-seanjc@google.com>

[-- Attachment #1: Type: text/plain, Size: 409 bytes --]

On Fri, 2026-05-15 at 12:19 -0700, Sean Christopherson wrote:
> Save/restore kvmclock across suspend/resume via clocksource hooks when
> kvmclock isn't being used for sched_clock.  This will allow using kvmclock
> as a clocksource (or for wallclock!) without also using it for sched_clock.
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>

Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply

* Re: [PATCH v3 15/41] x86/xen/time: Nullify x86_platform's sched_clock save/restore hooks
From: David Woodhouse @ 2026-05-20 22:54 UTC (permalink / raw)
  To: Sean Christopherson, Kiryl Shutsemau, Paolo Bonzini,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Ajay Kaher, Alexey Makhalov, Jan Kiszka, Dave Hansen,
	Andy Lutomirski, Peter Zijlstra, Juergen Gross, Daniel Lezcano,
	Thomas Gleixner, John Stultz
  Cc: Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, x86, linux-coco, kvm, linux-hyperv, virtualization,
	linux-kernel, xen-devel, Michael Kelley, Tom Lendacky,
	Nikunj A Dadhania, Thomas Gleixner
In-Reply-To: <93f8fad19e2c617ca17f544d11483c517024bfc4.camel@infradead.org>

[-- Attachment #1: Type: text/plain, Size: 1508 bytes --]

On Wed, 2026-05-20 at 23:11 +0100, David Woodhouse wrote:
> On Fri, 2026-05-15 at 12:19 -0700, Sean Christopherson wrote:
> > Nullify the x86_platform sched_clock save/restore hooks when setting up
> > Xen's PV clock to make it somewhat obvious the hooks aren't used when
> > running as a Xen guest (Xen uses a paravirtualized suspend/resume flow).
> > 
> > Signed-off-by: Sean Christopherson <seanjc@google.com>
> 
> Hm. Xen HVM guests *do* use a standard ACPI S4 hibernation flow.
> 
> Does that need testing?

Ah, this looks as sane as we can expect.

Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>

> 
> > ---
> >  arch/x86/xen/time.c | 6 ++++++
> >  1 file changed, 6 insertions(+)
> > 
> > diff --git a/arch/x86/xen/time.c b/arch/x86/xen/time.c
> > index 3d3165eef821..21d366d01985 100644
> > --- a/arch/x86/xen/time.c
> > +++ b/arch/x86/xen/time.c
> > @@ -568,6 +568,12 @@ static void __init xen_init_time_common(void)
> >  	xen_sched_clock_offset = xen_clocksource_read();
> >  	static_call_update(pv_steal_clock, xen_steal_clock);
> >  	paravirt_set_sched_clock(xen_sched_clock);
> > +	/*
> > +	 * Xen has paravirtualized suspend/resume and so doesn't use the common
> > +	 * x86 sched_clock save/restore hooks.
> > +	 */
> > +	x86_platform.save_sched_clock_state = NULL;
> > +	x86_platform.restore_sched_clock_state = NULL;
> >  
> >  	tsc_register_calibration_routines(xen_tsc_khz, NULL);
> >  	x86_platform.get_wallclock = xen_get_wallclock;
> 


[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply

* Re: [PATCH v3 24/41] timekeeping: Resume clocksources before reading persistent clock
From: David Woodhouse @ 2026-05-20 22:52 UTC (permalink / raw)
  To: Sean Christopherson, Kiryl Shutsemau, Paolo Bonzini,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Ajay Kaher, Alexey Makhalov, Jan Kiszka, Dave Hansen,
	Andy Lutomirski, Peter Zijlstra, Juergen Gross, Daniel Lezcano,
	Thomas Gleixner, John Stultz
  Cc: Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, x86, linux-coco, kvm, linux-hyperv, virtualization,
	linux-kernel, xen-devel, Michael Kelley, Tom Lendacky,
	Nikunj A Dadhania, Thomas Gleixner
In-Reply-To: <20260515191942.1892718-25-seanjc@google.com>

[-- Attachment #1: Type: text/plain, Size: 919 bytes --]

On Fri, 2026-05-15 at 12:19 -0700, Sean Christopherson wrote:
> When resuming timekeeping after suspend, restore clocksources prior to
> reading the persistent clock.  Paravirt clocks, e.g. kvmclock, tie the
> validity of a PV persistent clock to a clocksource, i.e. reading the PV
> persistent clock will return garbage if the underlying PV clocksource
> hasn't been enabled.  The flaw has gone unnoticed because kvmclock is a
> mess and uses its own suspend/resume hooks instead of the clocksource
> suspend/resume hooks, which happens to work by sheer dumb luck (the
> kvmclock resume hook runs before timekeeping_resume()).
> 
> Note, there is no evidence that any clocksource supported by the kernel
> depends on a persistent clock.
> 
> Reviewed-by: Thomas Gleixner <tglx@linutronix.de>
> Signed-off-by: Sean Christopherson <seanjc@google.com>

Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply

* Re: [PATCH v3 23/41] x86/kvmclock: Refactor handling of PVCLOCK_TSC_STABLE_BIT during kvmclock_init()
From: David Woodhouse @ 2026-05-20 22:46 UTC (permalink / raw)
  To: Sean Christopherson, Kiryl Shutsemau, Paolo Bonzini,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Ajay Kaher, Alexey Makhalov, Jan Kiszka, Dave Hansen,
	Andy Lutomirski, Peter Zijlstra, Juergen Gross, Daniel Lezcano,
	Thomas Gleixner, John Stultz
  Cc: Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, x86, linux-coco, kvm, linux-hyperv, virtualization,
	linux-kernel, xen-devel, Michael Kelley, Tom Lendacky,
	Nikunj A Dadhania, Thomas Gleixner
In-Reply-To: <20260515191942.1892718-24-seanjc@google.com>

[-- Attachment #1: Type: text/plain, Size: 660 bytes --]

On Fri, 2026-05-15 at 12:19 -0700, Sean Christopherson wrote:
> Clean up the setting of PVCLOCK_TSC_STABLE_BIT during kvmclock init to
> make it somewhat obvious that pvclock_read_flags() must be called *after*
> pvclock_set_flags().
> 
> Note, in theory, a different PV clock could have set PVCLOCK_TSC_STABLE_BIT
> in the supported flags, i.e. reading flags only if
> KVM_FEATURE_CLOCKSOURCE_STABLE_BIT is set could very, very theoretically
> result in a change in behavior.  In practice, the kernel only supports a
> single PV clock.

> Signed-off-by: Sean Christopherson <seanjc@google.com>

Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply

* Re: [PATCH v3 22/41] x86/pvclock: WARN if pvclock's valid_flags are overwritten
From: David Woodhouse @ 2026-05-20 22:44 UTC (permalink / raw)
  To: Sean Christopherson, Kiryl Shutsemau, Paolo Bonzini,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Ajay Kaher, Alexey Makhalov, Jan Kiszka, Dave Hansen,
	Andy Lutomirski, Peter Zijlstra, Juergen Gross, Daniel Lezcano,
	Thomas Gleixner, John Stultz
  Cc: Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, x86, linux-coco, kvm, linux-hyperv, virtualization,
	linux-kernel, xen-devel, Michael Kelley, Tom Lendacky,
	Nikunj A Dadhania, Thomas Gleixner
In-Reply-To: <20260515191942.1892718-23-seanjc@google.com>

[-- Attachment #1: Type: text/plain, Size: 374 bytes --]

On Fri, 2026-05-15 at 12:19 -0700, Sean Christopherson wrote:
> WARN if the common PV clock valid_flags are overwritten; all PV clocks
> expect that they are the one and only PV clock, i.e. don't guard against
> another PV clock having modified the flags.
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>

Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply

* Re: [PATCH v3 21/41] x86/pvclock: Mark setup helpers and related various as __init/__ro_after_init
From: Woodhouse, David @ 2026-05-20 22:43 UTC (permalink / raw)
  To: tglx@kernel.org, longli@microsoft.com, luto@kernel.org,
	alexey.makhalov@broadcom.com, jstultz@google.com,
	dave.hansen@linux.intel.com, ajay.kaher@broadcom.com,
	jan.kiszka@siemens.com, haiyangz@microsoft.com, kas@kernel.org,
	seanjc@google.com, pbonzini@redhat.com, kys@microsoft.com,
	decui@microsoft.com, daniel.lezcano@kernel.org,
	wei.liu@kernel.org, peterz@infradead.org, jgross@suse.com
  Cc: boris.ostrovsky@oracle.com, linux-coco@lists.linux.dev,
	kvm@vger.kernel.org, mhklinux@outlook.com,
	thomas.lendacky@amd.com, linux-kernel@vger.kernel.org,
	bcm-kernel-feedback-list@broadcom.com, tglx@linutronix.de,
	nikunj@amd.com, xen-devel@lists.xenproject.org,
	linux-hyperv@vger.kernel.org, vkuznets@redhat.com,
	rick.p.edgecombe@intel.com, virtualization@lists.linux.dev,
	sboyd@kernel.org, x86@kernel.org
In-Reply-To: <20260515191942.1892718-22-seanjc@google.com>


[-- Attachment #1.1: Type: text/plain, Size: 340 bytes --]

On Fri, 2026-05-15 at 12:19 -0700, Sean Christopherson wrote:
> Now that Xen PV clock and kvmclock explicitly do setup only during init,
> tag the common PV clock flags/vsyscall variables and their mutators with
> __init.
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>

Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>

[-- Attachment #1.2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5964 bytes --]

[-- Attachment #2.1: Type: text/plain, Size: 215 bytes --]




Amazon Development Centre (London) Ltd. Registered in England and Wales with registration number 04543232 with its registered office at 1 Principal Place, Worship Street, London EC2A 2FA, United Kingdom.



[-- Attachment #2.2: Type: text/html, Size: 228 bytes --]

^ permalink raw reply

* Re: [PATCH v3 20/41] x86/xen/time: Mark xen_setup_vsyscall_time_info() as __init
From: David Woodhouse @ 2026-05-20 22:40 UTC (permalink / raw)
  To: Sean Christopherson, Kiryl Shutsemau, Paolo Bonzini,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Ajay Kaher, Alexey Makhalov, Jan Kiszka, Dave Hansen,
	Andy Lutomirski, Peter Zijlstra, Juergen Gross, Daniel Lezcano,
	Thomas Gleixner, John Stultz
  Cc: Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, x86, linux-coco, kvm, linux-hyperv, virtualization,
	linux-kernel, xen-devel, Michael Kelley, Tom Lendacky,
	Nikunj A Dadhania, Thomas Gleixner
In-Reply-To: <20260515191942.1892718-21-seanjc@google.com>

[-- Attachment #1: Type: text/plain, Size: 344 bytes --]

On Fri, 2026-05-15 at 12:19 -0700, Sean Christopherson wrote:
> Annotate xen_setup_vsyscall_time_info() as being used only during kernel
> initialization; it's called only by xen_time_init(), which is already
> tagged __init.
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>

Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply

* Re: [PATCH v3 19/41] x86/kvmclock: Move kvm_sched_clock_init() down in kvmclock.c
From: David Woodhouse @ 2026-05-20 22:39 UTC (permalink / raw)
  To: Sean Christopherson, Kiryl Shutsemau, Paolo Bonzini,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Ajay Kaher, Alexey Makhalov, Jan Kiszka, Dave Hansen,
	Andy Lutomirski, Peter Zijlstra, Juergen Gross, Daniel Lezcano,
	Thomas Gleixner, John Stultz
  Cc: Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, x86, linux-coco, kvm, linux-hyperv, virtualization,
	linux-kernel, xen-devel, Michael Kelley, Tom Lendacky,
	Nikunj A Dadhania, Thomas Gleixner
In-Reply-To: <20260515191942.1892718-20-seanjc@google.com>

[-- Attachment #1: Type: text/plain, Size: 763 bytes --]

On Fri, 2026-05-15 at 12:19 -0700, Sean Christopherson wrote:
> Move kvm_sched_clock_init() "down" so that it can reference the global
> kvm_clock structure without needing a forward declaration.
> 
> Opportunistically mark the helper as "__init" instead of "inline" to make
> its usage more obvious; modern compilers don't need a hint to inline a
> single-use function, and an extra CALL+RET pair during boot is a complete
> non-issue.  And, if the compiler ignores the hint and does NOT inline the
> function, the resulting code may not get discarded after boot due lack of
> an __init annotation.
> 
> No functional change intended.
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>

Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply

* Re: [PATCH v3 18/41] x86/paravirt: Pass sched_clock save/restore helpers during registration
From: Woodhouse, David @ 2026-05-20 22:35 UTC (permalink / raw)
  To: tglx@kernel.org, longli@microsoft.com, luto@kernel.org,
	alexey.makhalov@broadcom.com, jstultz@google.com,
	dave.hansen@linux.intel.com, ajay.kaher@broadcom.com,
	jan.kiszka@siemens.com, haiyangz@microsoft.com, kas@kernel.org,
	seanjc@google.com, pbonzini@redhat.com, kys@microsoft.com,
	decui@microsoft.com, daniel.lezcano@kernel.org,
	wei.liu@kernel.org, peterz@infradead.org, jgross@suse.com
  Cc: boris.ostrovsky@oracle.com, linux-coco@lists.linux.dev,
	kvm@vger.kernel.org, mhklinux@outlook.com,
	thomas.lendacky@amd.com, linux-kernel@vger.kernel.org,
	bcm-kernel-feedback-list@broadcom.com, tglx@linutronix.de,
	nikunj@amd.com, xen-devel@lists.xenproject.org,
	linux-hyperv@vger.kernel.org, vkuznets@redhat.com,
	rick.p.edgecombe@intel.com, virtualization@lists.linux.dev,
	sboyd@kernel.org, x86@kernel.org
In-Reply-To: <20260515191942.1892718-19-seanjc@google.com>


[-- Attachment #1.1: Type: text/plain, Size: 1541 bytes --]

On Fri, 2026-05-15 at 12:19 -0700, Sean Christopherson wrote:
> Pass in a PV clock's save/restore helpers when configuring sched_clock
> instead of relying on each PV clock to manually set the save/restore hooks.
> In addition to bringing sanity to the code, this will allow gracefully
> "rejecting" a PV sched_clock, e.g. when running as a CoCo guest that has
> access to a "secure" TSC.
> 
> No functional change intended.
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>

Nice!

Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>

> --- a/arch/x86/xen/time.c
> +++ b/arch/x86/xen/time.c
> @@ -567,13 +567,12 @@ static void __init xen_init_time_common(void)
>  {
>  	xen_sched_clock_offset = xen_clocksource_read();
>  	static_call_update(pv_steal_clock, xen_steal_clock);
> -	paravirt_set_sched_clock(xen_sched_clock);
> +
>  	/*
>  	 * Xen has paravirtualized suspend/resume and so doesn't use the common
>  	 * x86 sched_clock save/restore hooks.
>  	 */
> -	x86_platform.save_sched_clock_state = NULL;
> -	x86_platform.restore_sched_clock_state = NULL;
> +	paravirt_set_sched_clock(xen_sched_clock, NULL, NULL);
>  
>  	tsc_register_calibration_routines(xen_tsc_khz, NULL);
>  	x86_platform.get_wallclock = xen_get_wallclock;


Same deal here as with kvmclock, FWIW. If the TSC is viable, then it's
the better choice. 

Especially now there's a significant chance that a "Xen guest" is
actually running under KVM. Xen never screwed its pvclock up quite as
much as KVM did :)

[-- Attachment #1.2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5964 bytes --]

[-- Attachment #2.1: Type: text/plain, Size: 215 bytes --]




Amazon Development Centre (London) Ltd. Registered in England and Wales with registration number 04543232 with its registered office at 1 Principal Place, Worship Street, London EC2A 2FA, United Kingdom.



[-- Attachment #2.2: Type: text/html, Size: 228 bytes --]

^ permalink raw reply

* Re: [PATCH v3 17/41] x86/tsc: WARN if TSC sched_clock save/restore used with PV sched_clock
From: David Woodhouse @ 2026-05-20 22:27 UTC (permalink / raw)
  To: Sean Christopherson, Kiryl Shutsemau, Paolo Bonzini,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Ajay Kaher, Alexey Makhalov, Jan Kiszka, Dave Hansen,
	Andy Lutomirski, Peter Zijlstra, Juergen Gross, Daniel Lezcano,
	Thomas Gleixner, John Stultz
  Cc: Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, x86, linux-coco, kvm, linux-hyperv, virtualization,
	linux-kernel, xen-devel, Michael Kelley, Tom Lendacky,
	Nikunj A Dadhania, Thomas Gleixner
In-Reply-To: <20260515191942.1892718-18-seanjc@google.com>

[-- Attachment #1: Type: text/plain, Size: 396 bytes --]

On Fri, 2026-05-15 at 12:19 -0700, Sean Christopherson wrote:
> Now that all PV clocksources override the sched_clock save/restore hooks
> when overriding sched_clock, WARN if the "default" TSC hooks are invoked
> when using a PV sched_clock, e.g. to guard against regressions.
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>

Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply

* Re: [PATCH v2 15/15] KVM: x86: Move the bulk of register specific code from x86.c to regs.c
From: Huang, Kai @ 2026-05-20 22:22 UTC (permalink / raw)
  To: seanjc@google.com
  Cc: dwmw2@infradead.org, Edgecombe, Rick P, x86@kernel.org,
	binbin.wu@linux.intel.com, kas@kernel.org,
	dave.hansen@linux.intel.com, vkuznets@redhat.com, paul@xen.org,
	yosry@kernel.org, pbonzini@redhat.com, kvm@vger.kernel.org,
	linux-coco@lists.linux.dev, linux-kernel@vger.kernel.org
In-Reply-To: <ag35RgwowvR9Q4hA@google.com>

On Wed, 2026-05-20 at 11:11 -0700, Sean Christopherson wrote:
> On Wed, May 20, 2026, Kai Huang wrote:
> > On Tue, 2026-05-19 at 18:25 -0700, Sean Christopherson wrote:
> > > On Wed, May 20, 2026, Kai Huang wrote:
> > > > I am not sure whether there's a mandatory requirement that "struct kvm_arch" and
> > > > "struct kvm_vcpu_arch" must be fully embedded, and it would be kinda painful to
> > > > covert to a pointer (e.g., there's kvm_x86_ops::vm_size), but perhaps that is
> > > > also an option to consider?
> > > 
> > > The idea I had in the past, and where I was going with things before s390's love
> > > for arm64 came along, was to add a kvm_arch.h in arch/<arch>/kvm, and have virt/kvm
> > > include _that_ instead of kvm_host.h.  
> > > 
> > 
> > Not sure whether there's other code doing so? :-)
> > 
> > > That way we don't need to make any fundamental
> > > changes to structures, but we can still significantly cut down on what's exposed
> > > via kvm_host.h.  
> > > 
> > 
> > Yeah.
> > 
> > I saw below from you in [1]:
> > 
> >   --
> >   We've explore several alternatives to the #ifdef __KVM__ approach, and
> >   they all sucked, hard.  What I really wanted (and still want) to do, is to
> >   bury the bulk of kvm_host.h (and other KVM headers) in virt/kvm, but every
> >   attempt to do that ended in flames.  Even with the __KVM__ guards in place,
> >   each architecture's kvm_host.h is too intertwined with the common kvm_host.h,
> >   and trying to extract small-ish pieces just doesn't work (each patch
> >   inevitably snowballed into a gigantic beast).
> > 
> >   The other idea we considered (which I thought of, and feel dirty for even
> >   proposing it internally), is to move all headers under virt/kvm, add
> >   virt/kvm/include to the global header path, and then have KVM x86 omit
> >   virt/kvm/include when configured to hide KVM internals.  I hate this idea
> >   because it sets a bad precedent, and requires a lot of file movement
> >   without providing any benefit to other architectures.  E.g. I hope that
> >   guarding KVM internals with #ifdef __KVM__ will allow us to slowly clean
> >   things up so that some day KVM only exposes a handful of APIs to the rest
> >   of the kernel (probably a pipe dream).
> >   --
> > 
> > I haven't looked into details of your #ifdef __KVM__ approach yet, but seems you
> > don't quite like moving KVM internal staff to virt/kvm/include/ ?
> 
> Not for arch code, which is the trickiest bit to handle.
> 
> > But if we want to hide KVM internal structures, I don't see any other options
> > except virt/kvm/include/ is the place to go?
> 
> arch/$(ARCH)/kvm/kvm_arch.h is the obvious approach.  Code in virt/kvm can reach
> arch/$(ARCH)/kvm, we just need to add it to the include path.  That's why I was
> working on unifying the include definitions.

Yeah, for asm/kvm_host.h.

But if I am still following you we still need a place for linux/kvm_host.h, for
which I thought virt/kvm/include/ would be the place at first glance.

> 
> > Btw, have you considered reverting the inclusion of "strut kvm" and "struct
> > kvm_arch" (and the vcpu structure), i.e., to make "struct kvm_arch" include
> > "struct kvm"?  I don't have any clue of whether it is feasible or how much
> > effort it needs, though -- it's just something came to mind when replying.
> > 
> > [1]: https://lore.kernel.org/all/20230916003118.2540661-1-seanjc@google.com/
> > 
> > > At some point I'll try to take another look; it's really the
> > > s390+arm64 combo that's problematic :-/
> > 
> > If you want, I can take a look.  I think I'll have bandwidth in near feature.
> > 
> > Given you have tried multiple times so I am not sure what I can achieve, though.
> 
> Consolidating includes and creating arch/$(ARCH)/kvm/kvm_arch.h should be more
> doable, 
> 

Right.

> what has failed spectacularly over and over is effectively trying to hide
> most of asm/kvm_host.h, which sounds *really* stupid when I phrase it that way :-)

I see :-)

> 
> > Anyway, seems "allow loading a new (or old) KVM module without needing to
> > rebuild and reboot the entire kernel" is a good reason to do this.
> 
> Note, we've scrapped upstreaming multi-KVM, and I don't see it ever getting accepted
> upstream, as live update is far superior, e.g. isn't limited to just KVM, doesn't
> have the same orchestration or testing problems, etc.
> 
> And FWIW, today, you _can_ reload a new (or old) KVM module without rebuilding
> the kernel or rebooting the host, it's just that you can't modify certain structures.

Right.

^ permalink raw reply

* Re: [PATCH v3 16/41] x86/vmware: Nullify save/restore hooks when using VMware's sched_clock
From: David Woodhouse @ 2026-05-20 22:15 UTC (permalink / raw)
  To: Sean Christopherson, Kiryl Shutsemau, Paolo Bonzini,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Ajay Kaher, Alexey Makhalov, Jan Kiszka, Dave Hansen,
	Andy Lutomirski, Peter Zijlstra, Juergen Gross, Daniel Lezcano,
	Thomas Gleixner, John Stultz
  Cc: Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, x86, linux-coco, kvm, linux-hyperv, virtualization,
	linux-kernel, xen-devel, Michael Kelley, Tom Lendacky,
	Nikunj A Dadhania, Thomas Gleixner
In-Reply-To: <20260515191942.1892718-17-seanjc@google.com>

[-- Attachment #1: Type: text/plain, Size: 629 bytes --]

On Fri, 2026-05-15 at 12:19 -0700, Sean Christopherson wrote:
> Nullify the sched_clock save/restore hooks when using VMware's version of
> sched_clock.  This will allow extending paravirt_set_sched_clock() to set
> the save/restore hooks, without having to simultaneously change the
> behavior of VMware guests.
> 
> Note, it's not at all obvious that it's safe/correct for VMware guests to
> do nothing on suspend/resume, but that's a pre-existing problem.  Leave it
> for a VMware expert to sort out.
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>

Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply

* Re: [PATCH v3 15/41] x86/xen/time: Nullify x86_platform's sched_clock save/restore hooks
From: David Woodhouse @ 2026-05-20 22:11 UTC (permalink / raw)
  To: Sean Christopherson, Kiryl Shutsemau, Paolo Bonzini,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Ajay Kaher, Alexey Makhalov, Jan Kiszka, Dave Hansen,
	Andy Lutomirski, Peter Zijlstra, Juergen Gross, Daniel Lezcano,
	Thomas Gleixner, John Stultz
  Cc: Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, x86, linux-coco, kvm, linux-hyperv, virtualization,
	linux-kernel, xen-devel, Michael Kelley, Tom Lendacky,
	Nikunj A Dadhania, Thomas Gleixner
In-Reply-To: <20260515191942.1892718-16-seanjc@google.com>

[-- Attachment #1: Type: text/plain, Size: 1283 bytes --]

On Fri, 2026-05-15 at 12:19 -0700, Sean Christopherson wrote:
> Nullify the x86_platform sched_clock save/restore hooks when setting up
> Xen's PV clock to make it somewhat obvious the hooks aren't used when
> running as a Xen guest (Xen uses a paravirtualized suspend/resume flow).
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>

Hm. Xen HVM guests *do* use a standard ACPI S4 hibernation flow.

Does that need testing?

> ---
>  arch/x86/xen/time.c | 6 ++++++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/arch/x86/xen/time.c b/arch/x86/xen/time.c
> index 3d3165eef821..21d366d01985 100644
> --- a/arch/x86/xen/time.c
> +++ b/arch/x86/xen/time.c
> @@ -568,6 +568,12 @@ static void __init xen_init_time_common(void)
>  	xen_sched_clock_offset = xen_clocksource_read();
>  	static_call_update(pv_steal_clock, xen_steal_clock);
>  	paravirt_set_sched_clock(xen_sched_clock);
> +	/*
> +	 * Xen has paravirtualized suspend/resume and so doesn't use the common
> +	 * x86 sched_clock save/restore hooks.
> +	 */
> +	x86_platform.save_sched_clock_state = NULL;
> +	x86_platform.restore_sched_clock_state = NULL;
>  
>  	tsc_register_calibration_routines(xen_tsc_khz, NULL);
>  	x86_platform.get_wallclock = xen_get_wallclock;


[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply

* Re: [PATCH v6 20/43] KVM: guest_memfd: Enable INIT_SHARED on guest_memfd for x86 Coco VMs
From: Ackerley Tng @ 2026-05-20 22:04 UTC (permalink / raw)
  To: Ackerley Tng via B4 Relay, aik, andrew.jones, binbin.wu, brauner,
	chao.p.peng, david, ira.weiny, jmattson, jthoughton, michael.roth,
	oupton, pankaj.gupta, qperret, rick.p.edgecombe, rientjes,
	shivankg, steven.price, tabba, willy, wyihan, yan.y.zhao,
	forkloop, pratyush, suzuki.poulose, aneesh.kumar, liam,
	Paolo Bonzini, Sean Christopherson, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
	Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
	Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He, Barry Song,
	Axel Rasmussen, Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng,
	Shakeel Butt, Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka
  Cc: kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm, linux-coco
In-Reply-To: <20260507-gmem-inplace-conversion-v6-20-91ab5a8b19a4@google.com>

Ackerley Tng via B4 Relay <devnull+ackerleytng.google.com@kernel.org>
writes:

> From: Sean Christopherson <seanjc@google.com>
>
> Now that guest_memfd supports tracking private vs. shared within gmem
> itself, allow userspace to specify INIT_SHARED on a guest_memfd instance
> for x86 Confidential Computing (CoCo) VMs, so long as per-VM attributes
> are disabled, i.e. when it's actually possible for a guest_memfd instance
> to contain shared memory.
>
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> ---
>  arch/x86/kvm/x86.c | 11 +++++------
>  1 file changed, 5 insertions(+), 6 deletions(-)
>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 1560de1e95be0..6609957ecfea3 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -14172,14 +14172,13 @@ bool kvm_arch_no_poll(struct kvm_vcpu *vcpu)
>  }
>
>  #ifdef CONFIG_KVM_GUEST_MEMFD
> -/*
> - * KVM doesn't yet support initializing guest_memfd memory as shared for VMs
> - * with private memory (the private vs. shared tracking needs to be moved into
> - * guest_memfd).
> - */
>  bool kvm_arch_supports_gmem_init_shared(struct kvm *kvm)
>  {
> -	return !kvm_arch_has_private_mem(kvm);
> +	/*
> +	 * INIT_SHARED isn't supported if the memory attributes are per-VM,
> +	 * in which case guest_memfd can _only_ be used for private memory.
> +	 */
> +	return !vm_memory_attributes || !kvm_arch_has_private_mem(kvm);

Adding a note here from PUCK on 2026-05-20:

Michael pointed out that it's odd that when vm_memory_attributes is
available, guest_memfd still can only be used for private memory.

It is a little odd, but we don't want to investigate the complexities of
supporting it, and Sean says this is working as intended, in line with
deprecating vm_memory_attributes=true.

>  }
>
>  #ifdef CONFIG_HAVE_KVM_ARCH_GMEM_PREPARE
>
> --
> 2.54.0.563.g4f69b47b94-goog

^ permalink raw reply

* Re: [PATCH v3 14/41] x86/kvmclock: Move sched_clock save/restore helpers up in kvmclock.c
From: David Woodhouse @ 2026-05-20 21:59 UTC (permalink / raw)
  To: Sean Christopherson, Kiryl Shutsemau, Paolo Bonzini,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Ajay Kaher, Alexey Makhalov, Jan Kiszka, Dave Hansen,
	Andy Lutomirski, Peter Zijlstra, Juergen Gross, Daniel Lezcano,
	Thomas Gleixner, John Stultz
  Cc: Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, x86, linux-coco, kvm, linux-hyperv, virtualization,
	linux-kernel, xen-devel, Michael Kelley, Tom Lendacky,
	Nikunj A Dadhania, Thomas Gleixner
In-Reply-To: <20260515191942.1892718-15-seanjc@google.com>

[-- Attachment #1: Type: text/plain, Size: 345 bytes --]

On Fri, 2026-05-15 at 12:19 -0700, Sean Christopherson wrote:
> Move kvmclock's sched_clock save/restore helper "up" so that they can
> (eventually) be referenced by kvm_sched_clock_init().
> 
> No functional change intended.
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>

Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply

* Re: [PATCH v3 13/41] x86/paravirt: Move handling of unstable PV clocks into paravirt_set_sched_clock()
From: David Woodhouse @ 2026-05-20 21:57 UTC (permalink / raw)
  To: Sean Christopherson, Kiryl Shutsemau, Paolo Bonzini,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Ajay Kaher, Alexey Makhalov, Jan Kiszka, Dave Hansen,
	Andy Lutomirski, Peter Zijlstra, Juergen Gross, Daniel Lezcano,
	Thomas Gleixner, John Stultz
  Cc: Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, x86, linux-coco, kvm, linux-hyperv, virtualization,
	linux-kernel, xen-devel, Michael Kelley, Tom Lendacky,
	Nikunj A Dadhania, Thomas Gleixner
In-Reply-To: <20260515191942.1892718-14-seanjc@google.com>

[-- Attachment #1: Type: text/plain, Size: 853 bytes --]

On Fri, 2026-05-15 at 12:19 -0700, Sean Christopherson wrote:
> Move the handling of unstable PV clocks, of which kvmclock is the only
> example, into paravirt_set_sched_clock().  This will allow modifying
> paravirt_set_sched_clock() to keep using the TSC for sched_clock in
> certain scenarios without unintentionally marking the TSC-based clock as
> unstable.
> 
> No functional change intended.
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>

Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>

> -void paravirt_set_sched_clock(u64 (*func)(void));
> +void __paravirt_set_sched_clock(u64 (*func)(void), bool stable);
> +
> +static inline void paravirt_set_sched_clock(u64 (*func)(void))
> +{
> +	__paravirt_set_sched_clock(func, true);
> +}

One of the few things I actually like about C++ is default arguments...


[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply

* Re: [PATCH v3 12/41] x86/paravirt: Remove unnecessary PARAVIRT=n stub for paravirt_set_sched_clock()
From: David Woodhouse @ 2026-05-20 21:53 UTC (permalink / raw)
  To: Sean Christopherson, Kiryl Shutsemau, Paolo Bonzini,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Ajay Kaher, Alexey Makhalov, Jan Kiszka, Dave Hansen,
	Andy Lutomirski, Peter Zijlstra, Juergen Gross, Daniel Lezcano,
	Thomas Gleixner, John Stultz
  Cc: Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, x86, linux-coco, kvm, linux-hyperv, virtualization,
	linux-kernel, xen-devel, Michael Kelley, Tom Lendacky,
	Nikunj A Dadhania, Thomas Gleixner
In-Reply-To: <20260515191942.1892718-13-seanjc@google.com>

[-- Attachment #1: Type: text/plain, Size: 526 bytes --]

On Fri, 2026-05-15 at 12:19 -0700, Sean Christopherson wrote:
> Remove the unnecessary paravirt_set_sched_clock() stub for PARAVIRT=n, as
> all callers are gated by PARAVIRT=y.  Eliminating the stub will avoid a
> pile of pointless churn as the "real" implementation evolves.
> 
> No functional change intended.
> 
> Fixes: 39965afb1151 ("x86/paravirt: Move paravirt_sched_clock() related code into tsc.c")
> Signed-off-by: Sean Christopherson <seanjc@google.com>

Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply

* Re: [PATCH v3 11/41] x86/kvm: Don't disable kvmclock on BSP in syscore_suspend()
From: David Woodhouse @ 2026-05-20 21:51 UTC (permalink / raw)
  To: Sean Christopherson, Kiryl Shutsemau, Paolo Bonzini,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Ajay Kaher, Alexey Makhalov, Jan Kiszka, Dave Hansen,
	Andy Lutomirski, Peter Zijlstra, Juergen Gross, Daniel Lezcano,
	Thomas Gleixner, John Stultz
  Cc: Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, x86, linux-coco, kvm, linux-hyperv, virtualization,
	linux-kernel, xen-devel, Michael Kelley, Tom Lendacky,
	Nikunj A Dadhania, Thomas Gleixner
In-Reply-To: <20260515191942.1892718-12-seanjc@google.com>

[-- Attachment #1: Type: text/plain, Size: 1214 bytes --]

On Fri, 2026-05-15 at 12:19 -0700, Sean Christopherson wrote:
> Don't disable kvmclock on the BSP during syscore_suspend(), as the BSP's
> clock is NOT restored during syscore_resume(), but is instead restored
> earlier via the sched_clock restore callback.  If suspend is aborted, e.g.
> due to a late wakeup, the BSP will run without its clock enabled, which
> "works" only because KVM-the-hypervisor is kind enough to not clobber the
> shared memory when the clock is disabled.  But over time, the BSP's view
> of time will drift from APs.

The work that Dongli and I have been doing elsewhere makes me want to
phrase that last sentence far more judgmentally, as the kvmclock
absolutely should *not* be randomly drifting over time without good
reason, and what was written to it earlier really *ought* to still be
valid.

But OK :)

> Plumb in an "action" to KVM-as-a-guest and kvmclock code in preparation
> for additional cleanups to kvmclock's suspend/resume logic.
> 
> Fixes: c02027b5742b ("x86/kvm: Disable kvmclock on all CPUs on shutdown")
> Cc: stable@vger.kernel.org
> Signed-off-by: Sean Christopherson <seanjc@google.com>

Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply

* Re: [PATCH v3 10/41] x86/kvmclock: Setup kvmclock for secondary CPUs iff CONFIG_SMP=y
From: David Woodhouse @ 2026-05-20 21:45 UTC (permalink / raw)
  To: Sean Christopherson, Kiryl Shutsemau, Paolo Bonzini,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Ajay Kaher, Alexey Makhalov, Jan Kiszka, Dave Hansen,
	Andy Lutomirski, Peter Zijlstra, Juergen Gross, Daniel Lezcano,
	Thomas Gleixner, John Stultz
  Cc: Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, x86, linux-coco, kvm, linux-hyperv, virtualization,
	linux-kernel, xen-devel, Michael Kelley, Tom Lendacky,
	Nikunj A Dadhania, Thomas Gleixner
In-Reply-To: <20260515191942.1892718-11-seanjc@google.com>

[-- Attachment #1: Type: text/plain, Size: 514 bytes --]

On Fri, 2026-05-15 at 12:19 -0700, Sean Christopherson wrote:
> Gate kvmclock's secondary CPU code on CONFIG_SMP, not CONFIG_X86_LOCAL_APIC.
> Originally, kvmclock piggybacked PV APIC ops to setup secondary CPUs.
> When that wart was fixed by commit df156f90a0f9 ("x86: Introduce
> x86_cpuinit.early_percpu_clock_init hook"), the dependency on a local APIC
> got carried forward unnecessarily.
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>

Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply

* Re: [PATCH v6 05/43] KVM: guest_memfd: Wire up kvm_get_memory_attributes() to per-gmem attributes
From: Ackerley Tng @ 2026-05-20 21:44 UTC (permalink / raw)
  To: Fuad Tabba
  Cc: aik, andrew.jones, binbin.wu, brauner, chao.p.peng, david,
	ira.weiny, jmattson, jthoughton, michael.roth, oupton,
	pankaj.gupta, qperret, rick.p.edgecombe, rientjes, shivankg,
	steven.price, willy, wyihan, yan.y.zhao, forkloop, pratyush,
	suzuki.poulose, aneesh.kumar, liam, Paolo Bonzini,
	Sean Christopherson, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, x86, H. Peter Anvin, Steven Rostedt,
	Masami Hiramatsu, Mathieu Desnoyers, Jonathan Corbet, Shuah Khan,
	Shuah Khan, Vishal Annapurve, Andrew Morton, Chris Li,
	Kairui Song, Kemeng Shi, Nhat Pham, Baoquan He, Barry Song,
	Axel Rasmussen, Yuanchu Xie, Wei Xu, Youngjun Park, Qi Zheng,
	Shakeel Butt, Kiryl Shutsemau, Jason Gunthorpe, Vlastimil Babka,
	kvm, linux-kernel, linux-trace-kernel, linux-doc, linux-kselftest,
	linux-mm, linux-coco
In-Reply-To: <CA+EHjTw-cUM=FrJevtSDtR7K6MwUfGfOx21LMFDn7DAy5bFzYw@mail.gmail.com>

Fuad Tabba <tabba@google.com> writes:

>
> [...snip...]
>
>> +unsigned long kvm_gmem_get_memory_attributes(struct kvm *kvm, gfn_t gfn)
>> +{
>> +       struct kvm_memory_slot *slot = gfn_to_memslot(kvm, gfn);
>> +       struct inode *inode;
>> +
>> +       /*
>> +        * If this gfn has no associated memslot, there's no chance of the gfn
>> +        * being backed by private memory, since guest_memfd must be used for
>> +        * private memory, and guest_memfd must be associated with some memslot.
>> +        */
>> +       if (!slot)
>> +               return 0;
>> +
>> +       CLASS(gmem_get_file, file)(slot);
>> +       if (!file)
>> +               return 0;
>> +
>> +       inode = file_inode(file);
>> +
>> +       /*
>> +        * Rely on the maple tree's internal RCU lock to ensure a
>> +        * stable result. This result can become stale as soon as the
>> +        * lock is dropped, so the caller _must_ still protect
>> +        * consumption of private vs. shared by checking
>> +        * mmu_invalidate_retry_gfn() under mmu_lock to serialize
>> +        * against ongoing attribute updates.
>> +        */
>> +       return kvm_gmem_get_attributes(inode, kvm_gmem_get_index(slot, gfn));
>> +}
>
> Doesn't this imply that all consumers of kvm_mem_is_private() should
> validate the result using mmu_lock and the invalidation sequence?

Let me know how I can improve the comment.

I think the "consumption" of private vs shared here actually means
something like "don't commit a page being faulted into page tables based
on the result of kvm_gmem_get_memory_attributes() without checking
kvm->mmu_invalidate_in_progress.", since a racing conversion may
complete before you commit.

kvm_mem_is_private() is used from these places:

1. Fault handling in KVM, like page_fault_can_be_fast(),
   kvm_mmu_faultin_pfn(), kvm_mmu_page_fault(): this already handles the
   entire mmu_lock and invalidation dance. No fault will be committed if
   a racing conversion happened after kvm_mem_is_private() but before
   the commit.

2. kvm_mmu_max_mapping_level() from recovering huge pages after
   disabling dirty logging: Other than that it can't be used with
   guest_memfd now since dirty logging can't be used with guest_memfd
   and guest_memfd memslots are not updatable, this holds mmu_lock
   throughout until the huge page recovery is done. invalidate_begin
   also involves zapping the pages in the range, so if the order of
   events is

   | Thread A                     | Thread B          |
   |------------------------------|-------------------|
   | invalidate_begin + zap       |                   |
   | update attributes maple_tree | recover huge page |
   | invalidate_end               |                   |

   Then recovering will never see the zapped pages, nothing to
   recover, no kvm_mem_is_private() lookup.

3. kvm_arch_vcpu_pre_fault_memory()

   This eventually calls kvm_tdp_mmu_page_fault(), which checks
   is_page_fault_stale(), so it does check before committing.

Were there any other calls I missed?

> sev_handle_rmp_fault() calls kvm_mem_is_private() without holding
> mmu_lock and without any retry mechanism. Is that a problem?
>

Sean already replied on your actual question separately :)

> Cheers,
> /fuad
>
>
>>
>> [...snip...]
>>

^ permalink raw reply

* Re: [PATCH v3 02/41] x86/tsc: Add helper to register CPU and TSC freq calibration routines
From: David Woodhouse @ 2026-05-20 21:44 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Kiryl Shutsemau, Paolo Bonzini, K. Y. Srinivasan, Haiyang Zhang,
	Wei Liu, Dexuan Cui, Long Li, Ajay Kaher, Alexey Makhalov,
	Jan Kiszka, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Juergen Gross, Daniel Lezcano, Thomas Gleixner, John Stultz,
	Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, x86, linux-coco, kvm, linux-hyperv, virtualization,
	linux-kernel, xen-devel, Michael Kelley, Tom Lendacky,
	Nikunj A Dadhania, Thomas Gleixner
In-Reply-To: <ag4or9-9c6VZxqya@google.com>

[-- Attachment #1: Type: text/plain, Size: 3077 bytes --]

On Wed, 2026-05-20 at 14:33 -0700, Sean Christopherson wrote:
> On Wed, May 20, 2026, David Woodhouse wrote:
> > On Wed, 2026-05-20 at 13:44 -0700, Sean Christopherson wrote:
> > > 
> > > +       /*
> > > +        * If the TSC counts at a constant frequency across P/T states, counts
> > > +        * in deep C-states, and the TSC hasn't been marked unstable, treat the
> > > +        * TSC reliable, as guaranteed by KVM.  Note, the TSC unstable check
> > > +        * exists purely to honor the TSC being marked unstable via command
> > > +        * line, any runtime detection of an unstable will happen after this.
> > > +        */
> > > +       if (boot_cpu_has(X86_FEATURE_CONSTANT_TSC) &&
> > > +           boot_cpu_has(X86_FEATURE_NONSTOP_TSC) &&
> > > +           !check_tsc_unstable())
> >     { 
> > > +               tsc_properties = TSC_FREQ_KNOWN_AND_RELIABLE;
> > 
> >     kvmclock = 0; /* Why use it if the TSC works? The kvmclock exists
> >                      *purely* to work around a TSC which *doesn't*
> >                      have those properties checked above. */
> 
> kvmclock still provides SYSTEM_TIME and WALL_CLOCK :-/

Um, SYSTEM_TIME *is* the kvmclock of which we speak, isn't it?

WALL_CLOCK is something else, and I guess we should still consume that,
yes.

> 
> > > +
> > > +       kvm_tsc_khz_cpuid = kvm_para_tsc_khz();
> > > +
> > > +       /*
> > > +        * If provided, use the TSC (and APIC bus) frequency provided in KVM's
> > > +        * PV CPUID leaf even if kvmclock itself is disabled via command line.
> > > +        * The PV CPUID information isn't dependent on kvmclock in any way, and
> > > +        * in fact using the precise information is *more* important when the
> > > +        * user has explicitly disabled kvmclock to force the kernel to use the
> > > +        * TSC as its clocksource.
> > > +        */
> > > +       if (!kvmclock) {
> > > +               if (kvm_tsc_khz_cpuid)
> > > +                       tsc_register_calibration_routines(kvm_get_tsc_khz,
> > > +                                                         kvm_get_cpu_khz,
> > > +                                                         tsc_properties);
> > > +               return;
> > > +       }
> > > +
> > 
> > 
> > Regardless of the above, why not just register these here
> > unconditionally, and remove the later call that does the same?
> 
> Because if kvmclock=n, it's only safe to call kvm_get_tsc_khz() if kvm_tsc_khz_cpuid
> is non-zero, other wise the "else" path will hit a NULL pointer deref when trying
> to get the frequency from the PV clock struct:
> 
> 	return kvm_tsc_khz_cpuid ? : pvclock_tsc_khz(this_cpu_pvti());

Ah, right. Thanks. Ick though :)

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply

* Re: [PATCH v3 02/41] x86/tsc: Add helper to register CPU and TSC freq calibration routines
From: Sean Christopherson @ 2026-05-20 21:33 UTC (permalink / raw)
  To: David Woodhouse
  Cc: Kiryl Shutsemau, Paolo Bonzini, K. Y. Srinivasan, Haiyang Zhang,
	Wei Liu, Dexuan Cui, Long Li, Ajay Kaher, Alexey Makhalov,
	Jan Kiszka, Dave Hansen, Andy Lutomirski, Peter Zijlstra,
	Juergen Gross, Daniel Lezcano, Thomas Gleixner, John Stultz,
	Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, x86, linux-coco, kvm, linux-hyperv, virtualization,
	linux-kernel, xen-devel, Michael Kelley, Tom Lendacky,
	Nikunj A Dadhania, Thomas Gleixner
In-Reply-To: <621e10bdc9e297c6c600b561d8fa25c3b62968bc.camel@infradead.org>

On Wed, May 20, 2026, David Woodhouse wrote:
> On Wed, 2026-05-20 at 13:44 -0700, Sean Christopherson wrote:
> > 
> > +       /*
> > +        * If the TSC counts at a constant frequency across P/T states, counts
> > +        * in deep C-states, and the TSC hasn't been marked unstable, treat the
> > +        * TSC reliable, as guaranteed by KVM.  Note, the TSC unstable check
> > +        * exists purely to honor the TSC being marked unstable via command
> > +        * line, any runtime detection of an unstable will happen after this.
> > +        */
> > +       if (boot_cpu_has(X86_FEATURE_CONSTANT_TSC) &&
> > +           boot_cpu_has(X86_FEATURE_NONSTOP_TSC) &&
> > +           !check_tsc_unstable())
>     { 
> > +               tsc_properties = TSC_FREQ_KNOWN_AND_RELIABLE;
> 
>     kvmclock = 0; /* Why use it if the TSC works? The kvmclock exists
>                      *purely* to work around a TSC which *doesn't*
>                      have those properties checked above. */

kvmclock still provides SYSTEM_TIME and WALL_CLOCK :-/

>     }
> 
> I was going to say PVCLOCK_TSC_STABLE_BIT, and maybe we should check
> that *too* for paranoia?

No? PVCLOCK_TSC_STABLE is property of kvmclock more than it's a property of the
TSC itself.  And for the no-kvmclock case, we most definitely don't want to setup
kvmclock just to query that flag.

> But hopefully the checks you have above are equivalent?

They aren't as paranoid, but if the host enumerates CONSTANT+NONSTOP TSC despite
KVM-the-host not being able to advertise PVCLOCK_TSC_STABLE_BIT, then the VMM done
messed up.

> > +
> > +       kvm_tsc_khz_cpuid = kvm_para_tsc_khz();
> > +
> > +       /*
> > +        * If provided, use the TSC (and APIC bus) frequency provided in KVM's
> > +        * PV CPUID leaf even if kvmclock itself is disabled via command line.
> > +        * The PV CPUID information isn't dependent on kvmclock in any way, and
> > +        * in fact using the precise information is *more* important when the
> > +        * user has explicitly disabled kvmclock to force the kernel to use the
> > +        * TSC as its clocksource.
> > +        */
> > +       if (!kvmclock) {
> > +               if (kvm_tsc_khz_cpuid)
> > +                       tsc_register_calibration_routines(kvm_get_tsc_khz,
> > +                                                         kvm_get_cpu_khz,
> > +                                                         tsc_properties);
> > +               return;
> > +       }
> > +
> 
> 
> Regardless of the above, why not just register these here
> unconditionally, and remove the later call that does the same?

Because if kvmclock=n, it's only safe to call kvm_get_tsc_khz() if kvm_tsc_khz_cpuid
is non-zero, other wise the "else" path will hit a NULL pointer deref when trying
to get the frequency from the PV clock struct:

	return kvm_tsc_khz_cpuid ? : pvclock_tsc_khz(this_cpu_pvti());

^ permalink raw reply

* Re: [PATCH v3 09/41] clocksource: hyper-v: Don't save/restore TSC offset when using HV sched_clock
From: David Woodhouse @ 2026-05-20 21:30 UTC (permalink / raw)
  To: Sean Christopherson, Kiryl Shutsemau, Paolo Bonzini,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Ajay Kaher, Alexey Makhalov, Jan Kiszka, Dave Hansen,
	Andy Lutomirski, Peter Zijlstra, Juergen Gross, Daniel Lezcano,
	Thomas Gleixner, John Stultz
  Cc: Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, x86, linux-coco, kvm, linux-hyperv, virtualization,
	linux-kernel, xen-devel, Michael Kelley, Tom Lendacky,
	Nikunj A Dadhania, Thomas Gleixner
In-Reply-To: <20260515191942.1892718-10-seanjc@google.com>

[-- Attachment #1: Type: text/plain, Size: 957 bytes --]

On Fri, 2026-05-15 at 12:19 -0700, Sean Christopherson wrote:
> Now that Hyper-V overrides the sched_clock save/restore hooks if and only
> sched_clock itself is set to the Hyper-V reference counter, drop the
> invocation of the "old" save/restore callbacks.  When the registration of
> the PV sched_clock was done separately from overriding the save/restore
> hooks, it was possible for Hyper-V to clobber the TSC save/restore
> callbacks without actually switching to the Hyper-V refcounter.
> 
> Enabling a PV sched_clock is a one-way street, i.e. the kernel will never
> revert to using TSC for sched_clock, and so there is no need to invoke the
> TSC save/restore hooks (and if there was, it belongs in common PV code).
> 
> Reviewed-by: Michael Kelley <mhklinux@outlook.com>
> Tested-by: Michael Kelley <mhklinux@outlook.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>

Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply

* Re: [PATCH v3 08/41] clocksource: hyper-v: Drop wrappers to sched_clock save/restore helpers
From: David Woodhouse @ 2026-05-20 21:29 UTC (permalink / raw)
  To: Sean Christopherson, Kiryl Shutsemau, Paolo Bonzini,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Ajay Kaher, Alexey Makhalov, Jan Kiszka, Dave Hansen,
	Andy Lutomirski, Peter Zijlstra, Juergen Gross, Daniel Lezcano,
	Thomas Gleixner, John Stultz
  Cc: Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, x86, linux-coco, kvm, linux-hyperv, virtualization,
	linux-kernel, xen-devel, Michael Kelley, Tom Lendacky,
	Nikunj A Dadhania, Thomas Gleixner
In-Reply-To: <20260515191942.1892718-9-seanjc@google.com>

[-- Attachment #1: Type: text/plain, Size: 477 bytes --]

On Fri, 2026-05-15 at 12:19 -0700, Sean Christopherson wrote:
> Now that all of the Hyper-V reference counter sched_clock code is located
> in a single file, drop the superfluous wrappers for the save/restore flows.
> 
> No functional change intended.
> 
> Reviewed-by: Michael Kelley <mhklinux@outlook.com>
> Tested-by: Michael Kelley <mhklinux@outlook.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>

Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply

* Re: [PATCH v3 07/41] clocksource: hyper-v: Register sched_clock save/restore iff it's necessary
From: David Woodhouse @ 2026-05-20 21:27 UTC (permalink / raw)
  To: Sean Christopherson, Kiryl Shutsemau, Paolo Bonzini,
	K. Y. Srinivasan, Haiyang Zhang, Wei Liu, Dexuan Cui, Long Li,
	Ajay Kaher, Alexey Makhalov, Jan Kiszka, Dave Hansen,
	Andy Lutomirski, Peter Zijlstra, Juergen Gross, Daniel Lezcano,
	Thomas Gleixner, John Stultz
  Cc: Rick Edgecombe, Vitaly Kuznetsov,
	Broadcom internal kernel review list, Boris Ostrovsky,
	Stephen Boyd, x86, linux-coco, kvm, linux-hyperv, virtualization,
	linux-kernel, xen-devel, Michael Kelley, Tom Lendacky,
	Nikunj A Dadhania, Thomas Gleixner
In-Reply-To: <20260515191942.1892718-8-seanjc@google.com>

[-- Attachment #1: Type: text/plain, Size: 1346 bytes --]

On Fri, 2026-05-15 at 12:19 -0700, Sean Christopherson wrote:
> Register the Hyper-V reference counter (refcounter) callbacks for saving
> and restoring its PV sched_clock, if and only if the refcounter is
> actually being used for sched_clock.  Currently, Hyper-V overrides the
> save/restore hooks if the reference TSC available, whereas the Hyper-V
> refcounter code only overrides sched_clock if the reference TSC is
> available *and* it's not invariant.  The flaw is effectively papered over
> by invoking the "old" save/restore callbacks as part of save/restore, but
> that's unnecessary and fragile.
> 
> To avoid introducing more complexity, and to allow for additional cleanups
> of the PV sched_clock code, move the save/restore hooks and logic into
> hyperv_timer.c and simply wire up the hooks when overriding sched_clock
> itself.
> 
> Note, while the Hyper-V refcounter code is intended to be architecture
> neutral, CONFIG_PARAVIRT is firmly x86-only, i.e. adding a small amount of
> x86 specific code (which will be reduced in future cleanups) doesn't
> meaningfully pollute generic code.
> 
> Reviewed-by: Michael Kelley <mhklinux@outlook.com>
> Tested-by: Michael Kelley <mhklinux@outlook.com>
> Signed-off-by: Sean Christopherson <seanjc@google.com>

Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox