Kernel KVM virtualization development

Kernel KVM virtualization development
 help / color / mirror / Atom feed

* Re: [PATCH v13 16/22] KVM: selftests: Load per-vCPU guest stack in TDX boot parameters
From: Binbin Wu @ 2026-06-09  5:37 UTC (permalink / raw)
  To: Lisa Wang
  Cc: Andrew Jones, Ackerley Tng, Chao Gao, Chenyi Qiang, Dave Hansen,
	Erdem Aktas, Ira Weiny, Isaku Yamahata, Kiryl Shutsemau,
	linux-kselftest, Paolo Bonzini, Pratik R. Sampat, Reinette Chatre,
	Rick Edgecombe, Roger Wang, Ryan Afranji, Sagi Shahar,
	Sean Christopherson, Shuah Khan, Oliver Upton,
	Jeremiah McReynolds, kvm, linux-coco, linux-kernel, x86
In-Reply-To: <20260521-tdx-selftests-v13-v13-16-6983ae4c3a4d@google.com>



On 5/22/2026 7:16 AM, Lisa Wang wrote:
> From: Sagi Shahar <sagis@google.com>
> 
> Allocate a guest stack for each vCPU and record the GVA in the TDX boot
> parameters region to allow proper vCPU initialization.
> 
> Co-developed-by: Ackerley Tng <ackerleytng@google.com>
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> Signed-off-by: Sagi Shahar <sagis@google.com>
> Signed-off-by: Lisa Wang <wyihan@google.com>

Reviewed-by: Binbin Wu <binbin.wu@linux.intel.com>

One nit below.

[...]>  
> +void tdx_vcpu_load_boot_parameters(struct kvm_vm *vm, struct kvm_vcpu *vcpu)
> +{
> +	struct td_boot_parameters *params =
> +		addr_gpa2hva(vm, TD_BOOT_PARAMETERS_GPA);
> +	struct td_per_vcpu_parameters *vcpu_params =
> +		&params->per_vcpu[vcpu->id];
> +
> +	vcpu_params->esp_gva = kvm_allocate_vcpu_stack(vm);
> +}
> +
> +

An extra empty line.

>  static struct kvm_tdx_capabilities *tdx_read_capabilities(struct kvm_vm *vm)
>  {
>  	struct kvm_tdx_capabilities *tdx_cap = NULL;
> 


^ permalink raw reply

* [PATCH v3] KVM: PPC: Book3S HV: Validate arch_compat against host compatibility mode
From: Amit Machhiwal @ 2026-06-09  5:33 UTC (permalink / raw)
  To: linuxppc-dev, Madhavan Srinivasan
  Cc: Amit Machhiwal, Vaibhav Jain, Harsh Prateek Bora, Ritesh Harjani,
	Anushree Mathur, Gautam Menghani, Mukesh Kumar Chaurasiya,
	Nicholas Piggin, Michael Ellerman, Christophe Leroy (CS GROUP),
	Thomas Huth, kvm, stable, linux-kernel

On IBM POWER systems, newer processor generations can operate in
compatibility modes corresponding to earlier generations. This becomes
relevant for nested virtualization, where nested KVM guests may need to
run with a specific processor compatibility level.

Currently, when running a nested KVM guest (L2) inside a Power11 pSeries
logical partition (L1) booted in Power10 compatibility mode, the guest
fails to boot while setting 'arch_compat'. This happens because the CPU
class is derived from the hardware PVR (via mfspr()), which reflects the
physical processor generation (Power11), rather than the effective
compatibility mode (Power10).

As a result, userspace may request a Power11 arch_compat for the L2
guest. However, the L1 partition, running in Power10 compatibility, has
only negotiated support up to Power10 with the Power Hypervisor (L0).
When H_GUEST_SET_STATE is invoked with a Power11 Logical PVR, the
hypervisor rejects the request, leading to a late guest boot failure:

  KVM-NESTEDv2: couldn't set guest wide elements
  [..KVM reg dump..]

This situation should be detected earlier and rejected by KVM. Without
proper validation, if userspace ignores the error, the guest may continue
to boot in Power11 raw mode on a Power10 compatibility host, which should
not be allowed.

Introduce a validation mechanism that detects unsupported arch_compat
values early in the guest initialization path. When an unsupported
arch_compat is requested (e.g., Power11 on a Power10 compatibility mode
host), kvmppc_set_arch_compat() uses cpu_has_feature(CPU_FTR_P11_PVR) to
detect the mismatch and sets arch_compat to PVR_ARCH_INVALID. This
triggers kvmppc_sanity_check() to mark the vCPU as invalid by setting
vcpu->arch.sane to false. On the next vCPU run, kvmppc_vcpu_run_hv()
checks this flag and returns -EINVAL, preventing the guest from running
with an invalid processor compatibility configuration.

With this, when a Power11 arch_compat is requested on a Power10
compatibility mode host, the guest fails early during boot with:

  error: kvm run failed Invalid argument

This provides a much clearer failure mode compared to the previous
behavior where the guest could boot in Power11 raw mode (if userspace
ignored the error) or fail late during H_GUEST_SET_STATE.

Suggested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Reviewed-by: Vaibhav Jain <vaibhav@linux.ibm.com>
Cc: stable@vger.kernel.org # v6.13+
Signed-off-by: Amit Machhiwal <amachhiw@linux.ibm.com>
---
Changes in v3:
* Fixed null pointer dereference in kvmppc_sanity_check(): added check for
  vcpu->arch.vcore before accessing arch_compat, as vcore is NULL for Book3S
  PR and BookE guests (only Book3S HV uses vcore) [Reported by Sashiko AI]
* Added Reviewed-by tag from Vaibhav

Changes in v2:
* Fixed issue where v1 allowed guest to boot in Power11 raw mode when
  userspace ignored the error, by adding validation in kvmppc_sanity_check()
  to ensure early failure during vCPU run [Found the issue after posting v1,
  also reported by Gautam.]
* Introduced PVR_ARCH_INVALID constant for marking invalid arch_compat
* Dropped all Reviewed-by and Tested-by tags due to code changes; requesting
  fresh reviews
* v1: https://lore.kernel.org/all/20260603141539.47620-1-amachhiw@linux.ibm.com/

Changes in v1:
* Moved this patch out of the v3 series [1] as discussed here [2]
* Addressed below review comments from Ritesh:
  - Based the PVR validation on cpu features
  - Fixed hcall name typo
  - Stable backport

[1] https://lore.kernel.org/all/20260522152744.55251-1-amachhiw@linux.ibm.com/
[2] https://lore.kernel.org/all/20260522152744.55251-2-amachhiw@linux.ibm.com/
---
 arch/powerpc/include/asm/reg.h |  1 +
 arch/powerpc/kvm/book3s_hv.c   | 15 ++++++++++++++-
 arch/powerpc/kvm/powerpc.c     |  4 ++++
 3 files changed, 19 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/reg.h b/arch/powerpc/include/asm/reg.h
index 3449dd2b577d..7472b9522f71 100644
--- a/arch/powerpc/include/asm/reg.h
+++ b/arch/powerpc/include/asm/reg.h
@@ -1356,6 +1356,7 @@
 #define PVR_ARCH_300	0x0f000005
 #define PVR_ARCH_31	0x0f000006
 #define PVR_ARCH_31_P11	0x0f000007
+#define PVR_ARCH_INVALID	0xffffffff

 /* Macros for setting and retrieving special purpose registers */
 #ifndef __ASSEMBLER__
diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
index 61dbeea317f3..f9380ef65750 100644
--- a/arch/powerpc/kvm/book3s_hv.c
+++ b/arch/powerpc/kvm/book3s_hv.c
@@ -446,7 +446,19 @@ static int kvmppc_set_arch_compat(struct kvm_vcpu *vcpu, u32 arch_compat)
 			guest_pcr_bit = PCR_ARCH_300;
 			break;
 		case PVR_ARCH_31:
+			guest_pcr_bit = PCR_ARCH_31;
+			break;
 		case PVR_ARCH_31_P11:
+			/*
+			 * Need to check this for ISA 3.1, as Power10 and
+			 * Power11 share the same PCR. For any subsequent ISA
+			 * versions, this will be taken care of by the guest vs
+			 * host PCR comparison below.
+			 */
+			if (!cpu_has_feature(CPU_FTR_P11_PVR)) {
+				arch_compat = PVR_ARCH_INVALID;
+				goto out;
+			}
 			guest_pcr_bit = PCR_ARCH_31;
 			break;
 		default:
@@ -469,6 +481,7 @@ static int kvmppc_set_arch_compat(struct kvm_vcpu *vcpu, u32 arch_compat)
 			return -EINVAL;
 	}

+out:
 	spin_lock(&vc->lock);
 	vc->arch_compat = arch_compat;
 	kvmhv_nestedv2_mark_dirty(vcpu, KVMPPC_GSID_LOGICAL_PVR);
@@ -479,7 +492,7 @@ static int kvmppc_set_arch_compat(struct kvm_vcpu *vcpu, u32 arch_compat)
 	vc->pcr = (host_pcr_bit - guest_pcr_bit) | PCR_MASK;
 	spin_unlock(&vc->lock);

-	return 0;
+	return kvmppc_sanity_check(vcpu);
 }

 static void kvmppc_dump_regs(struct kvm_vcpu *vcpu)
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 00302399fc37..98de68379b18 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -258,6 +258,10 @@ int kvmppc_sanity_check(struct kvm_vcpu *vcpu)
 	if (!vcpu->arch.pvr)
 		goto out;

+	if (vcpu->arch.vcore &&
+		vcpu->arch.vcore->arch_compat == PVR_ARCH_INVALID)
+		goto out;
+
 	/* PAPR only works with book3s_64 */
 	if ((vcpu->arch.cpu_type != KVM_CPU_3S_64) && vcpu->arch.papr_enabled)
 		goto out;

base-commit: 2d3090a8aeb596a26935db0955d46c9a5db5c6ce
-- 
2.50.1 (Apple Git-155)

^ permalink raw reply related

* Re: [PATCH v2] KVM: PPC: Book3S HV: Validate arch_compat against host compatibility mode
From: Amit Machhiwal @ 2026-06-09  5:11 UTC (permalink / raw)
  To: sashiko-reviews; +Cc: Amit Machhiwal, kvm
In-Reply-To: <20260608202747.E521A1F00893@smtp.kernel.org>

On 2026/06/08 08:27 PM, sashiko-bot@kernel.org wrote:
> Thank you for your contribution! Sashiko AI review found 2 potential issue(s) to consider:
> - [High] Unconditional dereference of `vcpu->arch.vcore` in `kvmppc_sanity_check()` causes a deterministic kernel panic for Book3S PR and BookE KVM guests.
> - [Medium] Bypassing PCR capability checks corrupts shared VCORE state on older architectures and inconsistently blocks sibling VCPUs.
> --
> 
> commit 8aae14bd53c26cd9bec6b3a9d75dc94d32e6e296
> Author: Amit Machhiwal <amachhiw@linux.ibm.com>
> 
> KVM: PPC: Book3S HV: Validate arch_compat against host compatibility mode
> 
> This commit detects unsupported arch_compat values early in the guest
> initialization path. It sets arch_compat to PVR_ARCH_INVALID and marks
> the vCPU as invalid, preventing the guest from running with an invalid
> processor compatibility configuration.
> 
> Cc: stable@vger.kernel.org # v6.13+
> 
> > diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
> > index 61dbeea317f3b..f9380ef657500 100644
> > --- a/arch/powerpc/kvm/book3s_hv.c
> > +++ b/arch/powerpc/kvm/book3s_hv.c
> [ ... ]
> > @@ -446,7 +446,19 @@ static int kvmppc_set_arch_compat(struct kvm_vcpu *vcpu, u32 arch_compat)
> >  			guest_pcr_bit = PCR_ARCH_300;
> >  			break;
> >  		case PVR_ARCH_31:
> > +			guest_pcr_bit = PCR_ARCH_31;
> > +			break;
> >  		case PVR_ARCH_31_P11:
> > +			/*
> > +			 * Need to check this for ISA 3.1, as Power10 and
> > +			 * Power11 share the same PCR. For any subsequent ISA
> > +			 * versions, this will be taken care of by the guest vs
> > +			 * host PCR comparison below.
> > +			 */
> > +			if (!cpu_has_feature(CPU_FTR_P11_PVR)) {
> > +				arch_compat = PVR_ARCH_INVALID;
> > +				goto out;
> 
> [Severity: Medium]
> When jumping to the out label, the code applies the invalid arch_compat and
> a restricted pcr to the shared vcore structure before returning an error.
> Since vcore can be shared among multiple vCPUs on older Power architectures,
> does this state affect sibling vCPUs?

Setting arch_compat to PVR_ARCH_INVALID is intentional. The value is
stored in the shared vcore so that all vCPUs associated with the VM
observe the invalid compatibility state. Subsequent calls to
kvmppc_sanity_check() fail and prevent any vCPU from entering the guest.

> 
> While kvmppc_sanity_check() marks the current vCPU as insane, sibling vCPUs
> might retain sane == true. Could they then execute the guest using the
> updated pcr and invalid arch_compat state?
> 
> > +			}
> >  			guest_pcr_bit = PCR_ARCH_31;
> >  			break;
> >  		default:
> [ ... ]
> > diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
> > index 00302399fc37b..7e2d489af6423 100644
> > --- a/arch/powerpc/kvm/powerpc.c
> > +++ b/arch/powerpc/kvm/powerpc.c
> > @@ -258,6 +258,9 @@ int kvmppc_sanity_check(struct kvm_vcpu *vcpu)
> >  	if (!vcpu->arch.pvr)
> >  		goto out;
> >  
> > +	if (vcpu->arch.vcore->arch_compat == PVR_ARCH_INVALID)
> > +		goto out;
> > +
> 
> [Severity: High]
> Does this code introduce a null pointer dereference for Book3S PR and BookE
> guests?
> 
> kvmppc_sanity_check() is a generic helper used across Book3S HV, PR, and
> BookE KVM implementations. While vcpu->arch.vcore is populated for Book3S HV,
> it remains null for Book3S PR and BookE guests.

Good catch! I see vcore is only set for KVM-HV and may cause a null
deref on PR and BookE. I'll introduce a NULL check in the next version.

# git grep "arch.vcore.*="
[...]
arch/powerpc/kvm/book3s_hv.c:   vcpu->arch.vcore = vcore;

Thanks,
Amit

> 
> During vCPU creation for PR or BookE, the initialization sequence sets
> vcpu->arch.pvr and calls kvmppc_sanity_check(). Because pvr is non-zero,
> execution will reach this check and dereference the null vcore pointer,
> which can cause a host panic.
> 
> >  	/* PAPR only works with book3s_64 */
> >  	if ((vcpu->arch.cpu_type != KVM_CPU_3S_64) && vcpu->arch.papr_enabled)
> >  		goto out;
> 
> -- 
> Sashiko AI review · https://sashiko.dev/#/patchset/20260608201001.65760-1-amachhiw@linux.ibm.com?part=1

^ permalink raw reply

* Re: [PATCH] Documentation: KVM: Synchronize x86 VM types
From: Nikunj A Dadhania @ 2026-06-09  3:53 UTC (permalink / raw)
  To: Carlos López, kvm, linux-doc, linux-kernel
  Cc: Carlos López, Paolo Bonzini, Jonathan Corbet, Shuah Khan,
	Ashish Kalra, Michael Roth, Brijesh Singh, Isaku Yamahata,
	Binbin Wu
In-Reply-To: <20260603114504.814647-2-clopez@suse.de>

Carlos López <clopez@suse.de> writes:

> KVM has reflected KVM_X86_SNP_VM to userspace since 1dfe571c12cf
> ("KVM: SEV: Add initial SEV-SNP support"), and KVM_X86_TDX_VM since
> 161d34609f9b ("KVM: TDX: Make TDX VM type supported"). Update the
> documentation to reflect this fact.
>
> Fixes: 1dfe571c12cf ("KVM: SEV: Add initial SEV-SNP support")
> Fixes: 161d34609f9b ("KVM: TDX: Make TDX VM type supported")
> Signed-off-by: Carlos López <clopez@suse.de>

Reviewed-by: Nikunj A Dadhania <nikunj@amd.com>

> ---
>  Documentation/virt/kvm/api.rst | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index 52bbbb553ce1..3ec574a41f60 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -9363,6 +9363,8 @@ means the VM type with value @n is supported.  Possible values of @n are::
>    #define KVM_X86_SW_PROTECTED_VM	1
>    #define KVM_X86_SEV_VM	2
>    #define KVM_X86_SEV_ES_VM	3
> +  #define KVM_X86_SNP_VM		4
> +  #define KVM_X86_TDX_VM		5
>  
>  Note, KVM_X86_SW_PROTECTED_VM is currently only for development and testing.
>  Do not use KVM_X86_SW_PROTECTED_VM for "real" VMs, and especially not in
> -- 
> 2.51.0

^ permalink raw reply

* Re: [PATCH v2] KVM: PPC: Book3S HV: Validate arch_compat against host compatibility mode
From: Vaibhav Jain @ 2026-06-09  3:49 UTC (permalink / raw)
  To: Amit Machhiwal, linuxppc-dev, Madhavan Srinivasan
  Cc: Amit Machhiwal, Harsh Prateek Bora, Ritesh Harjani,
	Anushree Mathur, Gautam Menghani, Mukesh Kumar Chaurasiya,
	Nicholas Piggin, Michael Ellerman, Christophe Leroy (CS GROUP),
	Thomas Huth, kvm, stable, linux-kernel
In-Reply-To: <20260608201001.65760-1-amachhiw@linux.ibm.com>


Amit Machhiwal <amachhiw@linux.ibm.com> writes:

<snip>
> Introduce a validation mechanism that detects unsupported arch_compat
> values early in the guest initialization path. When an unsupported
> arch_compat is requested (e.g., Power11 on a Power10 compatibility mode
> host), kvmppc_set_arch_compat() uses cpu_has_feature(CPU_FTR_P11_PVR) to
> detect the mismatch and sets arch_compat to PVR_ARCH_INVALID. This
> triggers kvmppc_sanity_check() to mark the vCPU as invalid by setting
> vcpu->arch.sane to false. On the next vCPU run, kvmppc_vcpu_run_hv()
> checks this flag and returns -EINVAL, preventing the guest from running
> with an invalid processor compatibility configuration.
>
> With this, when a Power11 arch_compat is requested on a Power10
> compatibility mode host, the guest fails early during boot with:
>
>   error: kvm run failed Invalid argument
>
<snip>
>
> Suggested-by: Vaibhav Jain <vaibhav@linux.ibm.com>
> Cc: stable@vger.kernel.org # v6.13+
> Signed-off-by: Amit Machhiwal <amachhiw@linux.ibm.com>

Thanks Amit for addressing the issue reported on v1. The v2 patch
changes look aligned to what we discussed offline. Hence,


Reviewed-by: Vaibhav Jain <vaibhav@linux.ibm.com>


-- 
Cheers
~ Vaibhav

^ permalink raw reply

* Re: [PATCH 1/3] KVM: nVMX: unwind PDPTR load if processor triggers a nested VMFail
From: Sean Christopherson @ 2026-06-09  3:31 UTC (permalink / raw)
  To: Paolo Bonzini; +Cc: linux-kernel, kvm, stable
In-Reply-To: <20260604160733.12555-2-pbonzini@redhat.com>

On Thu, Jun 04, 2026, Paolo Bonzini wrote:
> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> index 4690a4d23709..d612a5d071fc 100644
> --- a/arch/x86/kvm/vmx/nested.c
> +++ b/arch/x86/kvm/vmx/nested.c
> @@ -4947,6 +4947,7 @@ static inline u64 nested_vmx_get_vmcs01_guest_efer(struct vcpu_vmx *vmx)
>  
>  static void nested_vmx_restore_host_state(struct kvm_vcpu *vcpu)
>  {
> +	enum vm_entry_failure_code ignored;
>  	struct vmcs12 *vmcs12 = get_vmcs12(vcpu);
>  	struct vcpu_vmx *vmx = to_vmx(vcpu);
>  	struct vmx_msr_entry g, h;
> @@ -4984,20 +4985,19 @@ static void nested_vmx_restore_host_state(struct kvm_vcpu *vcpu)
>  	vmx_set_cr4(vcpu, vmcs_readl(CR4_READ_SHADOW));
>  
>  	nested_ept_uninit_mmu_context(vcpu);
> -	vcpu->arch.cr3 = vmcs_readl(GUEST_CR3);
> -	kvm_register_mark_available(vcpu, VCPU_REG_CR3);
>  
>  	/*
> -	 * Use ept_save_pdptrs(vcpu) to load the MMU's cached PDPTRs
> -	 * from vmcs01 (if necessary).  The PDPTRs are not loaded on
> -	 * VMFail, like everything else we just need to ensure our
> -	 * software model is up-to-date.
> +	 * Now that nested EPT has been disabled, load the MMU's CR3 and
> +	 * possibly PDPTRs from vmcs01 (if necessary).  This should not
> +	 * happen for VMFail, but we get here if the check was caught by
> +	 * the processor and therefore the guest CR3 was loaded prematurely.
>  	 */
> +	kvm_mmu_unload(vcpu);
> +	if (nested_vmx_load_cr3(vcpu, vmcs_readl(GUEST_CR3), false, !enable_ept, &ignored))
> +		nested_vmx_abort(vcpu, VMX_ABORT_LOAD_HOST_PDPTE_FAIL);

This isn't quite correct either.  I mean, none of this is architecturally correct,
but this is less correct than the other incorrect code here :-)

To do this "right", KVM should snapshot the PDPTRs and shove them into the MMU,
without touching guest memory.

On a very related topic, I have a patch to stash CR3 in software instead of
abusing vmcs01.GUEST_CR3, as KVM fails to restore vmcs01.GUEST_CR3 to its proper
state if nested_vmx_enter_non_root_mode() bails after clobbering vmcs01.GUEST_CR3,
but before loading guest state.  We could probably do the same thing for PDPTRs?

https://lore.kernel.org/all/20260603223418.1720035-3-seanjc@google.com

>  	if (enable_ept && is_pae_paging(vcpu))
>  		ept_save_pdptrs(vcpu);
>  
> -	kvm_mmu_reset_context(vcpu);
> -
>  	/*
>  	 * This nasty bit of open coding is a compromise between blindly
>  	 * loading L1's MSRs using the exit load lists (incorrect emulation
> -- 
> 2.52.0
> 
> 

^ permalink raw reply

* Re: [PATCH] KVM: selftests: Make guest_code_xsave more friendly
From: Sean Christopherson @ 2026-06-09  3:08 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: Paolo Bonzini, Shuah Khan, kvm, linux-kselftest, linux-kernel,
	michael.roth, jackyli, liruxin, darwinguo, jacobhxu
In-Reply-To: <CAEvNRgHtpFa+HAhPVBvB4=ixUuEW-r_Xh49SHEPktUmHgzYcAw@mail.gmail.com>

On Fri, Jun 05, 2026, Ackerley Tng wrote:
> Sean Christopherson <seanjc@google.com> writes:
> > Uh, so as the comment says, the goal is to stash state before _any_ compiled
> > code runs.  Shoving the code into inline asm breaks that.  The compiler *probably*
> > won't shove anything before the first inline assembly, but there are absolutely
> > no guarantees.  E.g. you're subtly relying on a tail-call optimization to avoid
> > any stack operations.  If I force guest_sev_es_code() to be inlined, then the
> > prologue becomes:
> >
> > 0000000000402a10 <guest_code_xsave>:
> >   402a10:       48 83 ec 08                                     sub    $0x8,%rsp
> >   402a14:       b8 07 00 00 00                                  mov    $0x7,%eax
> >   402a19:       31 d2                                           xor    %edx,%edx
> >   402a1b:       0f ae 27                                        xsave  (%rdi)
> >   402a1e:       b9 31 01 01 c0                                  mov    $0xc0010131,%ecx
> >
> > and we're hosed.
> >
> 
> And omg, I thought I was running the tests!!

FWIW, you probably were, I had to force it to be inline via __always_inline.

> >> +		"xsave (%0)"
> >> +		:
> >> +		: "r"(addr)
> >> +		: "eax", "edx", "memory"
> >> +	 );
> >> +}
> >> +
> >>
> >> [...snip...]
> >>
> 
> My bad, this patch is such a goof!
> 
> I still feel that it'd be nice to allow commenting out functions parts
> of selftests while developing other parts, but let's shelve this for
> now.

I mean, it's always "allowed", but that's firmly "Here be dragons!" territory.
There are any number of things that can go sideways when adding and/or removing
code for development and debug purposes, many of which are far more nefarious
and subtle than obscure compiler errors.

^ permalink raw reply

* Re: [RFC PATCH 01/27] KVM: x86: Fix emulated CPUID features being applied to wrong sub-leaf
From: Sean Christopherson @ 2026-06-09  3:00 UTC (permalink / raw)
  To: Binbin Wu
  Cc: pbonzini, kvm, rick.p.edgecombe, xiaoyao.li, chao.gao, kai.huang
In-Reply-To: <cb8d64e6-dfc6-4984-a46a-157e57e55aaf@linux.intel.com>

On Tue, Jun 09, 2026, Binbin Wu wrote:
> On 6/8/2026 10:19 PM, Sean Christopherson wrote:
> > On Mon, Jun 08, 2026, Binbin Wu wrote:
> >>
> >>
> >> On 4/17/2026 3:35 PM, Binbin Wu wrote:
> >>> Guard the use of cpuid_func_emulated() with a check that the CPUID
> >>> sub-leaf index is 0, as cpuid_func_emulated() unconditionally returns
> >>> emulated features for index 0 and does not account for indexed leaves.
> >>>
> >>> Without the guard, when iterating over reverse_cpuid[] entries that
> >>> share the same CPUID function but have a non-zero index, e.g.
> >>> CPUID_7_1_ECX (function=7, index=1), the emulated features for index 0
> >>> are incorrectly OR'd into the wrong capability word.  For example,
> >>> RDPID (CPUID.7.0:ECX[22]) gets erroneously applied to CPUID_7_1_ECX,
> >>> which would allow userspace to set bit 22 of CPUID.7.1:ECX in the vCPU's
> >>> capabilities.
> >>>
> >>> This is currently benign as the affected bits in the non-zero index
> >>> words happen to not correspond to meaningful features, but it could
> >>> cause problems as new features are defined in those positions.
> >>>
> >>
> >> Hi Sean/Paolo,
> >>
> >> While the bug is currently benign, I still think it's worth fixing.
> >>
> >> There could be two options:
> >> - Considering cpuid_func_emulated() probably would not be extended
> >>   in the future, still keep this simple implementation, but add some
> >>   comments to cpuid_func_emulated() to call out only subleaf 0 is emulated
> >>   as suggested by Xiaoyao.
> >>   https://lore.kernel.org/kvm/fa47d520-1de9-4ddc-94ca-6bb53ac30f10@intel.com/
> >> - Add the index input parameter to cpuid_func_emulated().
> > 
> > Definitely handle this in cpuid_func_emulated().
> 
> Thanks, the diff looks good to me.
> I will send a separate patch for this.
> 
> BTW, do you want to provide your SOB?

Nah, you did the hard part, claim it as your own.

^ permalink raw reply

* Re: [PATCH v1] vsock/virtio: rework MSG_ZEROCOPY flag handling
From: Jakub Kicinski @ 2026-06-09  2:08 UTC (permalink / raw)
  To: Arseniy Krasnov
  Cc: Stefan Hajnoczi, Stefano Garzarella, David S. Miller,
	Eric Dumazet, Paolo Abeni, Michael S. Tsirkin, Jason Wang,
	Bobby Eshleman, Xuan Zhuo, Eugenio Pérez, Simon Horman, kvm,
	virtualization, netdev, linux-kernel, oxffffaa, rulkc
In-Reply-To: <20260605115314.552321-1-avkrasnov@rulkc.org>

On Fri,  5 Jun 2026 14:53:14 +0300 Arseniy Krasnov wrote:
> Logically it was based on TCP implementation, so to make further
> support easier, rewrite it in the TCP way.

Does not apply:

$ git pw series apply 1106582
Failed to apply patch:
Applying: vsock/virtio: rework MSG_ZEROCOPY flag handling
error: sha1 information is lacking or useless (net/vmw_vsock/virtio_transport_common.c).
error: could not build fake ancestor
hint: Use 'git am --show-current-patch=diff' to see the failed patch
hint: When you have resolved this problem, run "git am --continue".
hint: If you prefer to skip this patch, run "git am --skip" instead.
hint: To restore the original branch and stop patching, run "git am --abort".
hint: Disable this message with "git config set advice.mergeConflict false"
Patch failed at 0001 vsock/virtio: rework MSG_ZEROCOPY flag handling
-- 
pw-bot: cr

^ permalink raw reply

* Re: [RFC PATCH 01/27] KVM: x86: Fix emulated CPUID features being applied to wrong sub-leaf
From: Binbin Wu @ 2026-06-09  1:56 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: pbonzini, kvm, rick.p.edgecombe, xiaoyao.li, chao.gao, kai.huang
In-Reply-To: <aibPZMfxabh6meOk@google.com>



On 6/8/2026 10:19 PM, Sean Christopherson wrote:
> On Mon, Jun 08, 2026, Binbin Wu wrote:
>>
>>
>> On 4/17/2026 3:35 PM, Binbin Wu wrote:
>>> Guard the use of cpuid_func_emulated() with a check that the CPUID
>>> sub-leaf index is 0, as cpuid_func_emulated() unconditionally returns
>>> emulated features for index 0 and does not account for indexed leaves.
>>>
>>> Without the guard, when iterating over reverse_cpuid[] entries that
>>> share the same CPUID function but have a non-zero index, e.g.
>>> CPUID_7_1_ECX (function=7, index=1), the emulated features for index 0
>>> are incorrectly OR'd into the wrong capability word.  For example,
>>> RDPID (CPUID.7.0:ECX[22]) gets erroneously applied to CPUID_7_1_ECX,
>>> which would allow userspace to set bit 22 of CPUID.7.1:ECX in the vCPU's
>>> capabilities.
>>>
>>> This is currently benign as the affected bits in the non-zero index
>>> words happen to not correspond to meaningful features, but it could
>>> cause problems as new features are defined in those positions.
>>>
>>
>> Hi Sean/Paolo,
>>
>> While the bug is currently benign, I still think it's worth fixing.
>>
>> There could be two options:
>> - Considering cpuid_func_emulated() probably would not be extended
>>   in the future, still keep this simple implementation, but add some
>>   comments to cpuid_func_emulated() to call out only subleaf 0 is emulated
>>   as suggested by Xiaoyao.
>>   https://lore.kernel.org/kvm/fa47d520-1de9-4ddc-94ca-6bb53ac30f10@intel.com/
>> - Add the index input parameter to cpuid_func_emulated().
> 
> Definitely handle this in cpuid_func_emulated().

Thanks, the diff looks good to me.
I will send a separate patch for this.

BTW, do you want to provide your SOB?

> 
> diff --git arch/x86/kvm/cpuid.c arch/x86/kvm/cpuid.c
> index db8be9173bd0..737ca3e7bda2 100644
> --- arch/x86/kvm/cpuid.c
> +++ arch/x86/kvm/cpuid.c
> @@ -400,7 +400,7 @@ void kvm_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
>                 if (!entry)
>                         continue;
>  
> -               cpuid_func_emulated(&emulated, cpuid.function, true);
> +               cpuid_func_emulated(&emulated, cpuid.function, cpuid.index, true);
>  
>                 /*
>                  * A vCPU has a feature if it's supported by KVM and is enabled
> @@ -1369,11 +1369,15 @@ static struct kvm_cpuid_entry2 *do_host_cpuid(struct kvm_cpuid_array *array,
>         return entry;
>  }
>  
> -static int cpuid_func_emulated(struct kvm_cpuid_entry2 *entry, u32 func,
> +static int cpuid_func_emulated(struct kvm_cpuid_entry2 *entry, u32 func, u32 index,
>                                bool include_partially_emulated)
>  {
>         memset(entry, 0, sizeof(*entry));
>  
> +       /* KVM doesn't currently emulate any non-zero indices. */
> +       if (cpuid_function_is_indexed(func) && index)
> +               return 0;
> +
>         entry->function = func;
>         entry->index = 0;
>         entry->flags = 0;
> @@ -1411,7 +1415,7 @@ static int __do_cpuid_func_emulated(struct kvm_cpuid_array *array, u32 func)
>         if (array->nent >= array->maxnent)
>                 return -E2BIG;
>  
> -       array->nent += cpuid_func_emulated(&array->entries[array->nent], func, false);
> +       array->nent += cpuid_func_emulated(&array->entries[array->nent], func, 0, false);
>         return 0;
>  }
>  
> 
> 


^ permalink raw reply

* Re: [PATCH] KVM: guest_memfd: Elaborate on how release() vs. get_pfn() is safe against UAF
From: Sean Christopherson @ 2026-06-09  1:41 UTC (permalink / raw)
  To: Yan Zhao; +Cc: Paolo Bonzini, kvm, linux-kernel, Vishal Annapurve
In-Reply-To: <aiKJt5j6vIqfh7S9@yzhao56-desk.sh.intel.com>

On Fri, Jun 05, 2026, Yan Zhao wrote:
> On Wed, Jun 03, 2026 at 05:10:59PM -0700, Sean Christopherson wrote:
> > > > +	/*
> > > > +	 * Note!  synchronize_srcu() is _not_ needed after nullifying memslot
> > > > +	 * bindings as slot->gmem.file cannot be set back to a non-null value
> > > > +	 * without the memslot first being deleted.  I.e. this relies on the
> > > > +	 * synchronize_srcu_expedited() in kvm_swap_active_memslots() to ensure
> > > > +	 * kvm_gmem_get_pfn() (which runs with kvm->srcu held for read) can't
> > > > +	 * grab a reference to slot->gmem.file even if the struct file object
> > > > +	 * is reallocated.
> > > Do you mean that as kvm_gmem_get_pfn() can't find a stale slot, it can't grab
> > > reference to stale slot->gmem.file, even if slot->gmem.file has been set to a
> > > different value, i.e., after invoking unbind(), bind() ?
> > 
> > I was specifically trying to say this:
> > 
> >  1. slot->gmem.file == old  ==
> >     synchronize_srcu()       |===> window #1 for readers
> >  2. slot->gmem.file == NULL ==
> >     synchronize_srcu()       |===> window #2 for readers
> >  3. slot->gmem.file == new  ==
> > 
> > kvm_gmem_get_pfn() can run in two "windows".  In the window #1, it can see @old
> > or NULL.  In window #2, it can see NULL or @new.  The synchronization ensures
> > it can't grab a reference to @old once window #2 has been opened.
> Thanks for the clarification.
> 
> > > But I'm not sure why to put the kvm_gmem_get_pfn() relying on
> > > synchronize_srcu_expedited() in kvm_swap_active_memslots() in the comment of
> > > release().  Without the guard of kvm_gmem_get_file(), kvm_gmem_get_pfn() may
> > > need to provide some RCU read-critial section too for release() to wait by
> > > synchronize_srcu()?
> > 
> > No, what I'm saying is that without synchronize_srcu() to create distinct
> > window #1 and window #2 above, KVM would need this:
> > 
> > diff --git virt/kvm/guest_memfd.c virt/kvm/guest_memfd.c
> > index a1cb72e66288..e66e431bdb98 100644
> > --- virt/kvm/guest_memfd.c
> > +++ virt/kvm/guest_memfd.c
> > @@ -347,6 +347,8 @@ static int kvm_gmem_release(struct inode *inode, struct file *file)
> >         xa_for_each(&f->bindings, index, slot)
> >                 WRITE_ONCE(slot->gmem.file, NULL);
> >  
> > +       synchronize_srcu();
> > +
> >         /*
> >          * All in-flight operations are gone and new bindings can be created.
> >          * Zap all SPTEs pointed at by this file.  Do not free the backing
> > 
> > Otherwise there is no guarantee in-flight readers can't reach the old
> > slot->gmem.file.
> Hmm, however, if the reader (kvm_gmem_get_pfn()) has grabbed a reference to @old,
> kvm_gmem_release() can't come; if kvm_gmem_release() comes, kvm_gmem_get_pfn()
> should not have been able to successfully invoke get_file_active() on @old.
> After kvm_gmem_release() returns, any subsequent calls to get_file_active() on
> @old should fail and return NULL.
> 
> So, even if the reader can see @old, we're still fine without synchronize_srcu(),
> as get_file_active() and SLAB_TYPESAFE_BY_RCU should have provide the safety.
> 
> Am I missing something? 

In kvm_gmem_get_pfn(), because slots_lock isn't held, this could happen if
kvm_set_memory_region() didn't synchronize_srcu().
 
  CPU0                                    CPU1
  kvm_gmem_get_pfn()                     
    f = X (from slot->gmem.file)

                                          kvm_gmem_release())
                                            slot->gmem.file = NULL

                                          kvm_set_memory_region()
                                            slot deleted

                                          kvm_set_memory_region()
                                            slot created
                                            slot->gmem.file = f (alloc the same object)

  get_file_active()
    file = f
    file_reloaded = f

  <KVM does weird things with an old memslot+file> 

Obviously KVM would be wildly broken for other reasons, but just saying
get_file_active() takes care of everything is misleading because it only handles
reallocation of the object, it doesn't guarantee a reference to the correct file
was obtained.

E.g. AFAICT, users of get_mm_exe_file() don't care if they race with
replace_mm_exe_file(), they only care that they have a reference to a live file.

KVM on the other hand cares that kvm_gmem_get_pfn() gets the exact file that is
associated with its memslot point.

^ permalink raw reply

* Re: [PATCH v5 26/34] KVM: x86: Avoid redundant masterclock updates from multiple vCPUs
From: David Woodhouse @ 2026-06-09  1:34 UTC (permalink / raw)
  To: sashiko-bot; +Cc: dwmw2, kvm, sashiko-reviews
In-Reply-To: <20260608201120.B2A511F00893@smtp.kernel.org>

[-- Attachment #1: Type: text/plain, Size: 2436 bytes --]

On Mon, 2026-06-08 at 20:11 +0000, sashiko-bot@kernel.org wrote:
> [Severity: High]
> O(N) vCPU iteration loops moved inside a raw spinlock

This is an existing pattern in KVM — kvm_end_pvclock_update()
always iterated over all vCPUs under the lock. The per-vCPU
operations are just atomic bit set/clear, so the hold time
is bounded by N * (one atomic op) which is a few microseconds
even for very large VMs.

> [Severity: High]
> Unilateral clearing of KVM_REQ_MCLOCK_INPROGRESS in the bailout
> path creates a race condition

Hm, I don't think so.

All vCPUs might react to KVM_REQ_MASTERCLOCK_UPDATE in parallel and
each set KVM_REQ_MCLOCK_INPROGRESS on the others (which can't be done
under spinlock as it has to kick), then they all try to take the lock
in __kvm_start_pvclock_update().

The *winner* will actually do the work, then clear INPROGRESS on the
rest. The *losers* then each take the lock, realise there's nothing to
do, and also clear INPROGRESS (just in case they were late to the party
and they had *set* it themselves, long after everyone else was done).

Could that bailout clearing of INPROGRESS race with a *new* round of
KVM_REQ_MASTERCLOCK_UPDATE? Not without the requesting vCPU actually
having that REQ set again... in which case it wouldn't take the bailout
at all, would it?

Hm, I think there's a different, pre-existing race here, with a vCPU
that was *going* to bail out, but actually it sees a
KVM_REQ_MASTERCLOCK_UPDATE for a *next* generation, and processes it
even though MCLOCK_INPROGRESS already got *cleared* again for the last
generation, and hasn't been reasserted yet?

That existed in the old code where everyone was a winner: the *last*
winner could be clearing MCLOCK_INPROGRESS, just after the first winner
has set it for the next round.

I *think* it can be resolved by taking the bailout path if
MCLOCK_INPROGRESS is already clear, as well as if MASTERCLOCK_UPDATE is
already clear. But it's pre-existing, makes my head hurt, and is for a
very rare condition of back-to-back updates. So it's going on the 'for
later' list.

> [Severity: High]
> Does moving write_seqcount_end() after the loops risk stalling
> seqcount readers?

The loops are O(N) atomic bit operations. On the guest side,
readers that observe the seqcount mid-write simply retry —
they don't spin indefinitely. The window is microseconds
even for large VMs.

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply

* Re: [PATCH v5 33/34] KVM: x86: Replace pvclock_gtod_data vclock_mode with boolean
From: sashiko-bot @ 2026-06-09  1:23 UTC (permalink / raw)
  To: David Woodhouse; +Cc: kvm
In-Reply-To: <20260608145455.89187-34-dwmw2@infradead.org>

Thank you for your contribution! Sashiko AI review found 2 potential issue(s) to consider:
- [High] `kvm_check_tsc_unstable()` incorrectly masks TSC instability when the host uses a native TSC clocksource, breaking backward TSC compensation.
- [Medium] Tracepoints expect a clock mode enum but now receive a collapsed boolean value, changing trace output and breaking observability tooling.
--

commit e7585c139257b84c3968a9648d71f5c5478115f9
Author: David Woodhouse <dwmw@amazon.co.uk>
Subject: KVM: x86: Replace vclock_mode checks with kvm_host_has_tsc_clocksource

This commit replaces remaining users of pvclock_gtod_data with a simple
kvm_host_has_tsc_clocksource boolean. This removes the need for tracking
the exact vclock_mode when determining if the host clocksource is
TSC-based.

> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index c18947c5b63f..93a428c37847 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
[ ... ]
> @@ -2726,7 +2727,7 @@ static void kvm_track_tsc_matching(struct kvm_vcpu *vcpu, bool new_generation)
>  
>  	trace_kvm_track_tsc(vcpu->vcpu_id, ka->nr_vcpus_matched_tsc,
>  			    atomic_read(&vcpu->kvm->online_vcpus),
> -		            ka->use_master_clock, gtod->clock.vclock_mode);
> +		            ka->use_master_clock, kvm_host_has_tsc_clocksource);

[Severity: Medium]
Will this change break ABI for observability tooling that expects the
tracepoint to log the original vclock_mode integer enum?

Since kvm_host_has_tsc_clocksource is a boolean, a true value will
implicitly cast to 1 (VDSO_CLOCKMODE_TSC). If the actual mode is
VDSO_CLOCKMODE_HVCLOCK (3), the boolean still evaluates to 1, causing the
trace output to erroneously log 'tsc' instead of the expected integer 3.

[ ... ]
> @@ -2850,7 +2851,7 @@ static inline bool kvm_check_tsc_unstable(void)
>  	 * TSC is marked unstable when we're running on Hyper-V,
>  	 * 'TSC page' clocksource is good.
>  	 */
> -	if (pvclock_gtod_data.clock.vclock_mode == VDSO_CLOCKMODE_HVCLOCK)
> +	if (kvm_host_has_tsc_clocksource)
>  		return false;

[Severity: High]
Does this change unintentionally mask TSC instability when the host uses a
native TSC clocksource?

Previously, this check bypassed check_tsc_unstable() only when the clocksource
was VDSO_CLOCKMODE_HVCLOCK. Now, it evaluates to true for both HVCLOCK and
native TSC clocksources.

When using a native TSC clocksource, KVM detects backwards TSCs in
kvm_arch_vcpu_load() and synchronously calls mark_tsc_unstable(). However,
because kvm_host_has_tsc_clocksource remains true until the timekeeper
notifier fires, the subsequent call to kvm_check_tsc_unstable() will
immediately return false.

Will this completely bypass the TSC offset catchup logic, directly exposing
the backwards TSC jump to the guest?

[ ... ]
> @@ -3315,7 +3316,7 @@ static void pvclock_update_vm_gtod_copy(struct kvm *kvm)
>  			ka->use_master_clock = false;
>  	}
>  
> -	vclock_mode = pvclock_gtod_data.clock.vclock_mode;
> +	vclock_mode = kvm_host_has_tsc_clocksource;
>  	trace_kvm_update_master_clock(ka->use_master_clock, vclock_mode,
>  					ka->all_vcpus_matched_freq);
>  #endif

[Severity: Medium]
Similar to trace_kvm_track_tsc above, does passing a boolean here break
tools expecting the original vclock_mode enum for
trace_kvm_update_master_clock?

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260608145455.89187-1-dwmw2@infradead.org?part=33

^ permalink raw reply

* Re: [PATCH v5 31/34] KVM: x86: Use ktime_get_snapshot_id() for master clock
From: sashiko-bot @ 2026-06-09  1:03 UTC (permalink / raw)
  To: David Woodhouse; +Cc: kvm
In-Reply-To: <20260608145455.89187-32-dwmw2@infradead.org>

Thank you for your contribution! Sashiko AI review found 2 potential issue(s) to consider:
- [Critical] Uninitialized stack memory used on failure path in KVM timekeeping.
- [Low] Unused static functions left in the code, causing `-Wunused-function` warnings.
--

commit 25989314a763f4ae38067d7da21492c86ffa8fa4
Author: David Woodhouse <dwmw@amazon.co.uk>

Replace the KVM-private vgettsc()/do_kvmclock_base()/do_monotonic()/
do_realtime() timekeeping reimplementation with calls to the generic
ktime_get_snapshot_id() interface.

This patch replaces the KVM-private timekeeping functions with generic
ktime_get_snapshot_id() calls. It provides system time and raw cycles 
atomically paired to simplify the timekeeping state.

> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 96250264d403..2713aebb96ae 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
[ ... ]
> @@ -3162,14 +3163,32 @@ static int do_realtime(struct timespec64 *ts, u64 *tsc_timestamp)

[Severity: Low]
The patch description mentions replacing the private timekeeping functions
(vgettsc, do_kvmclock_base, do_monotonic, and do_realtime), but it appears
they were not removed from the source code.

Does leaving these unused static functions trigger -Wunused-function
compiler warnings during the build?

>   * reports the TSC value from which it do so. Returns true if host is
>   * using TSC based clocksource.
>   */
> +static bool kvm_snapshot_has_tsc(struct system_time_snapshot *snap,
> +				u64 *tsc_timestamp)
> +{
> +	if (snap->cs_id == CSID_X86_TSC) {
> +		*tsc_timestamp = snap->cycles;
> +		return true;
> +	}
> +
> +	if (snap->hw_csid == CSID_X86_TSC && snap->hw_cycles) {
> +		*tsc_timestamp = snap->hw_cycles;
> +		return true;
> +	}
> +
> +	return false;
> +}
> +
>  static bool kvm_get_time_and_clockread(s64 *kernel_ns, u64 *tsc_timestamp)
>  {
> -	/* checked again under seqlock below */
> -	if (!gtod_is_based_on_tsc(pvclock_gtod_data.clock.vclock_mode))
> +	struct system_time_snapshot snap;
> +
> +	ktime_get_snapshot_id(CLOCK_MONOTONIC_RAW, &snap);
> +	if (!kvm_snapshot_has_tsc(&snap, tsc_timestamp))

[Severity: Critical]
If ktime_get_snapshot_id() encounters a failure (such as when timekeeping is
suspended), it only sets snap.valid = false and returns immediately, leaving
the rest of the struct uninitialized.

Since snap.valid is not checked before calling kvm_snapshot_has_tsc(), could
this pass uninitialized stack memory into the helper?

If the uninitialized snap->cs_id happens to match CSID_X86_TSC, could KVM
expose uninitialized host kernel stack data to the guest VM?

>  		return false;
>  
> -	return gtod_is_based_on_tsc(do_kvmclock_base(kernel_ns,
> -						     tsc_timestamp));
> +	*kernel_ns = ktime_to_ns(ktime_mono_to_any(snap.systime, TK_OFFS_BOOT));
> +	return true;
>  }

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260608145455.89187-1-dwmw2@infradead.org?part=31

^ permalink raw reply

* Re: [PATCH v5 25/34] KVM: x86/xen: Prevent runstate times from becoming negative
From: David Woodhouse @ 2026-06-09  1:02 UTC (permalink / raw)
  To: sashiko-bot; +Cc: dwmw2, kvm, sashiko-reviews
In-Reply-To: <20260608195835.CCFD71F00893@smtp.kernel.org>

[-- Attachment #1: Type: text/plain, Size: 424 bytes --]

On Mon, 2026-06-08 at 19:58 +0000, sashiko-bot@kernel.org wrote:
> [Severity: High]
> can this read-modify-write sequence be safely executed with
> preemption enabled?

This is hurting my brain and I might need to come back to it.
The reentrancy concern is pre-existing and out of scope for
this series — the patch only adds handling for retrograde clock
values, it doesn't change the calling context or locking.

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply

* Re: [PATCH v2 15/20] KVM: x86/xen: Convert kvm_xen_set_evtchn_fast() to gpc's CLASS() APIs
From: Sean Christopherson @ 2026-06-09  0:53 UTC (permalink / raw)
  To: David Woodhouse
  Cc: pbonzini, tglx, mingo, bp, dave.hansen, x86, hpa, kvm,
	linux-kernel, sashiko-reviews, dwmw
In-Reply-To: <c1f0dee379e282ffb35296da4b666d93025f1c95.camel@infradead.org>

On Tue, Jun 02, 2026, David Woodhouse wrote:
> On Fri, 29 May 2026 12:11:37 -0700, Sean Christopherson wrote:
> > On Fri, May 29, 2026, sashiko-bot@kernel.org wrote:
> > > [Severity: High]
> > > Does this unintentionally hold the gpc lock across the IPI kick?
> > 
> > No, it intentionally holds the gpc lock across _sending_ the IPI kick.
> > 
> > > [Severity: High]
> > > Does this now hold the VM-wide shinfo_cache lock while calling
> > > __kvm_xen_set_evtchn_fast() and kicking the vCPU?
> > > 
> > > Since shinfo_map is a function-scoped CLASS() variable, its destructor
> > > won't release the lock until after __kvm_xen_set_evtchn_fast() returns.
> > > This creates a nested locking dependency and holds locks over expensive
> > > cross-vCPU operations, potentially serializing event channel deliveries
> > > across the entire VM on the fast path.
> > 
> > __kvm_vcpu_kick() is neither expensive nor cross-vCPU.  In the wait=false case,
> > which is the behavior of kvm_vcpu_kick(), it sends IPIs via smp_send_reschedule(),
> > i.e. it's more or less just __apic_send_IPI(cpu, RESCHEDULE_VECTOR), which is a
> > single WRMSR on modern harware.
> 
> Hm, if we really are going to use a raw rwlock then I suppose we should
> take care not to spend any extra cycles with the lock held unless we
> actually *need* to?
> 
> In which case I wonder if we really need the scoped CLASS() thing (that
> encourages people to run to the end of the function under the lock), or
> if the helpers you add for that are sufficient on their own?

Eh, I would argue that "the code is easier to read" is sufficient justification
so long as holding the lock for a few tens of extra cycles doesn't cause problems
in practice.

^ permalink raw reply

* Re: [PATCH v5 30/34] KVM: selftests: Add Xen runstate migration test
From: sashiko-bot @ 2026-06-09  0:50 UTC (permalink / raw)
  To: David Woodhouse; +Cc: kvm
In-Reply-To: <20260608145455.89187-31-dwmw2@infradead.org>

Thank you for your contribution! Sashiko AI review found 3 potential issue(s) to consider:
- [Medium] Memory leak of the VM metadata structures and userspace memory regions.
- [Low] The new test file is missing from the KVM selftests Makefile, preventing it from being compiled.
- [Low] The test uses undefined KVM UAPI ioctls, causing compilation failures.
--

commit 0780d159e1213b2e5be1d76bb2647a1007af3fdf
Author: David Woodhouse <dwmw@amazon.co.uk>

Test that Xen runstate (steal time) is correctly accounted across a
simulated live migration using KVM_XEN_VCPU_ATTR and KVM_[GS]ET_CLOCK_GUEST.

This commit adds a new KVM selftest to verify Xen runstate tracking across
a simulated VM migration. It creates a VM with Xen HVM configuration,
accumulates runstate time, saves and restores the VM state, and verifies
that the elapsed time during migration is properly accounted as steal time.

>  .../selftests/kvm/x86/xen_migration_test.c    | 194 ++++++++++++++++++
>  1 file changed, 194 insertions(+)
>  create mode 100644 tools/testing/selftests/kvm/x86/xen_migration_test.c

[Severity: Low]
Is the new xen_migration_test.c file missing from the KVM selftests Makefile?
Without adding it to TEST_GEN_PROGS_x86 (or an equivalent target) in
tools/testing/selftests/kvm/Makefile, the test is ignored by the build system
and will never be compiled or executed.

> diff --git a/tools/testing/selftests/kvm/x86/xen_migration_test.c b/tools/testing/selftests/kvm/x86/xen_migration_test.c
> new file mode 100644
> index 000000000000..37e8ace00611
> --- /dev/null
> +++ b/tools/testing/selftests/kvm/x86/xen_migration_test.c
[ ... ]
> +int main(void)
> +{
> +	struct pvclock_vcpu_time_info pvti;
> +	struct kvm_xen_vcpu_attr runstate_save;
[ ... ]
> +	pr_info("  Guest sees: runnable=%" PRIu64 " running=%" PRIu64 "\n",
> +		uc.args[2], uc.args[3]);
> +
> +	/* Save clock state */
> +	ret = __vcpu_ioctl(vcpu, KVM_GET_CLOCK_GUEST, &pvti);

[Severity: Low]
Does this code fail to compile because KVM_GET_CLOCK_GUEST is an undeclared
identifier? It does not appear to exist in the standard KVM UAPI headers.

> +	TEST_ASSERT(!ret, "KVM_GET_CLOCK_GUEST failed");
> +
> +	/* Save TSC offset */
[ ... ]
> +	uint64_t saved_runnable = runstate_save.u.runstate.time_runnable;
> +
> +	kvm_vm_release(vm);

[Severity: Medium]
Does calling kvm_vm_release() here instead of kvm_vm_free() leak the 
struct kvm_vm object and its mapped userspace memory regions?

kvm_vm_release() is an internal helper that releases vCPUs and file
descriptors, but it leaves the VM structure and memory maps allocated
right before the vm pointer is overwritten with a new VM.

> +
> +	/* === MIGRATION GAP === */
> +	pr_info("=== Simulating migration (sleeping 10ms) ===\n");
> +	usleep(10000);
> +
> +	/* === DESTINATION SIDE === */
> +	pr_info("=== Destination: create new VM and restore ===\n");
> +	vm = create_xen_vm(&vcpu);
> +
> +	/* Restore TSC offset */
> +	vcpu_set_msr(vcpu, MSR_IA32_TSC_ADJUST, tsc_offset);
> +
> +	/* Restore clock — kvmclock will now be ~10ms ahead of the snapshot */
> +	vcpu_ioctl(vcpu, KVM_SET_CLOCK_GUEST, &pvti);

[Severity: Low]
Similarly, does this fail to compile because KVM_SET_CLOCK_GUEST does not
exist in the standard KVM UAPI headers?

> +
> +	/* Restore runstate exactly as saved (state=runnable) */
[ ... ]
> +	TEST_ASSERT(steal_increase >= 10000000ULL &&
> +		    steal_increase < 5000000000ULL,
> +		    "Steal time increase %" PRIu64 " ns not in expected range "
> +		    "[10ms, 5s]", steal_increase);
> +
> +	kvm_vm_release(vm);

[Severity: Medium]
Does this also permanently leak the second VM's metadata structures and
memory maps before the test exits?

> +	pr_info("PASS: Migration gap correctly accounted as steal time\n");
> +	return 0;
> +}

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260608145455.89187-1-dwmw2@infradead.org?part=30

^ permalink raw reply

* Re: [PATCH v5 22/34] KVM: selftests: Add master clock offset test
From: David Woodhouse @ 2026-06-09  0:50 UTC (permalink / raw)
  To: sashiko-bot; +Cc: dwmw2, kvm, sashiko-reviews
In-Reply-To: <20260608192659.042CE1F00893@smtp.kernel.org>

[-- Attachment #1: Type: text/plain, Size: 1235 bytes --]

On Mon, 2026-06-08 at 19:26 +0000, sashiko-bot@kernel.org wrote:
> [Severity: High]
> Does the KVM_GET_CLOCK_GUEST macro exist?

Yes — it's defined in include/uapi/linux/kvm.h by patch 04
in this series.

> [Severity: Medium]
> Will this assertion fail? KVM disables the master clock if all
> online vCPUs do not have matching TSC offsets

No — patch 21 in this series decouples master clock from offset
matching. Master clock only requires matching *frequency*.
Different offsets are fine. That's kind of the point in the test!

> [Severity: Medium]
> The commit message claims KVM_CLOCK_TSC_STABLE is NOT set, but
> the code asserts it is present.

You're right — the commit message is wrong. KVM_CLOCK_TSC_STABLE
(VM-level) IS set in master clock mode. What is NOT set is
PVCLOCK_TSC_STABLE_BIT (per-vCPU), which requires matching
offsets. Will fix the commit message in v6.

> [Severity: Medium]
> PVCLOCK_TSC_STABLE_BIT should be set when master clock is active

No — patch 21 gates PVCLOCK_TSC_STABLE_BIT on
all_vcpus_matched_tsc (full offset match), not just
use_master_clock. This is the intended decoupling, as guests use
PVCLOCK_TSC_STABLE_BIT to infer that TSC is stable across vCPUs.

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply

* [PATCH net] vsock/virtio: restore msg_iter on transmission failure
From: Octavian Purdila @ 2026-06-09  0:48 UTC (permalink / raw)
  To: netdev
  Cc: Octavian Purdila, syzbot+28e5f3d207b14bae122a, Stefan Hajnoczi,
	Stefano Garzarella, Michael S. Tsirkin, Jason Wang, Xuan Zhuo,
	Eugenio Pérez, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Arseniy Krasnov, kvm, virtualization,
	linux-kernel

When transmission fails in virtio_transport_send_pkt_info, the msg_iter
might have been partially advanced. If we don't restore it, the next
attempt to send data will use an incorrect iterator state, leading to
desync and warnings like "send_pkt() returns 0, but X expected".

Specifically, this can happen in the following scenario, triggered by
the syzkaller repro:

1. A write-only VMA (PROT_WRITE only) is partially populated by a
   prior TUN write that failed with -EIO but still faulted in some
   pages).
2. A vsock sendmmsg call with MSG_ZEROCOPY requests transmission of a
   buffer from this VMA.
3. The first packet (64KB) is sent successfully because the pages are
   populated.
4. The second packet allocation fails because GUP fast pins the first page
   but GUP slow fails on the next unpopulated page due to PROT_WRITE-only
   permissions.
5. The iterator is advanced by the partially successful GUP (68KB total
   advanced: 64KB from first packet + 4KB from second), but the send loop
   breaks and only reports 64KB sent. This creates a 4KB desync.
6. The next retry starts with a non-zero iov_offset, disabling zerocopy
   and falling back to copy mode.
7. In copy mode, the transmission succeeds for the next packets but
   exhausts the iterator early because of the desync.
8. The final retry sees an empty iterator but zerocopy is re-enabled
   (offset resets). It attempts to send the remaining bytes with zerocopy
   but pins 0 pages, creating an empty packet.
9. The transport sends the empty packet, triggering the warning because
   the returned bytes (header only) do not match the expected payload size.
10. The loop continues to spin, allocating ubuf_info each time, eventually
    exhausting sysctl_optmem_max and returning -ENOMEM to userspace.

Restore msg_iter to its original state before the packet allocation
and transmission attempt if they fail.

Fixes: e0718bd82e27 ("vsock: enable setting SO_ZEROCOPY")
Reported-by: syzbot+28e5f3d207b14bae122a@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=28e5f3d207b14bae122a
Assisted-by: gemini:gemini-3.1-pro
Signed-off-by: Octavian Purdila <tavip@google.com>
---
 net/vmw_vsock/virtio_transport_common.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
index b10666937c490..588623a3e2bbc 100644
--- a/net/vmw_vsock/virtio_transport_common.c
+++ b/net/vmw_vsock/virtio_transport_common.c
@@ -367,6 +367,10 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
 	do {
 		struct sk_buff *skb;
 		size_t skb_len;
+		struct iov_iter saved_iter;
+
+		if (info->msg)
+			saved_iter = info->msg->msg_iter;
 
 		skb_len = min(max_skb_len, rest_len);
 
@@ -375,6 +379,8 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
 						 src_cid, src_port,
 						 dst_cid, dst_port);
 		if (!skb) {
+			if (info->msg)
+				info->msg->msg_iter = saved_iter;
 			ret = -ENOMEM;
 			break;
 		}
@@ -382,8 +388,11 @@ static int virtio_transport_send_pkt_info(struct vsock_sock *vsk,
 		virtio_transport_inc_tx_pkt(vvs, skb);
 
 		ret = t_ops->send_pkt(skb, info->net);
-		if (ret < 0)
+		if (ret < 0) {
+			if (info->msg)
+				info->msg->msg_iter = saved_iter;
 			break;
+		}
 
 		/* Both virtio and vhost 'send_pkt()' returns 'skb_len',
 		 * but for reliability use 'ret' instead of 'skb_len'.
-- 
2.54.0.1064.gd145956f57-goog


^ permalink raw reply related

* Re: [PATCH v2 20/20] KVM: x86: Use gfn_to_pfn_cache for record_steal_time
From: Sean Christopherson @ 2026-06-09  0:45 UTC (permalink / raw)
  To: David Woodhouse
  Cc: sashiko-bot, pbonzini, tglx, mingo, bp, dave.hansen, x86, hpa,
	kvm, linux-kernel, sashiko-reviews, stollmc, dwmw
In-Reply-To: <3291d47ea5fdd6ee2f284cadb518473130954cfc.camel@infradead.org>

On Tue, Jun 02, 2026, David Woodhouse wrote:
> On Sat, 30 May 2026 06:19:32 +0000, sashiko-bot@kernel.org wrote:
> > [Severity: High]
> > Does this introduce a scheduling while atomic bug on non-PREEMPT_RT kernels?
> >
> > The CLASS(gpc_map_local, st_map) macro acquires a read_lock on gpc->lock,
> > which disables preemption. While this lock is held, if the guest supports
> > PV TLB flush, the code calls kvm_vcpu_flush_tlb_guest().
> >
> > If TDP is disabled (shadow paging), kvm_vcpu_flush_tlb_guest() calls
> > kvm_mmu_sync_roots() and eventually mmu_sync_children(). This path can yield
> > via cond_resched_rwlock_write(). Yielding while preemption is disabled by
> > the gpc read lock will trigger a BUG.
> 
> Ah, that issue exists in the previous versions too, but it's simple
> enough to fix. There's no particular timing constraint for flushing the
> TLB; it just have to be done before this vCPU ever runs again. It can
> just be moved to the end of the function after the lock is dropped.
> 
> That does mean record_steal_time() should use the explicit
> gpc_map_local_lock()/gpc_map_local_unlock() instead of the CLASS()
> macro, but that's easy enough.

Actually, we use KVM_REQ_TLB_FLUSH_GUEST and "optimize" the code for the rare
case where KVM already have a TLB flushed queued for the vCPU.  E.g. over two
patches (so that changing the order of the request processing is isolated):

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 1b27dd9ba0aa..48234eeb246b 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3764,7 +3764,7 @@ static void record_steal_time(struct kvm_vcpu *vcpu)
                trace_kvm_pv_tlb_flush(vcpu->vcpu_id,
                                       st_preempted & KVM_VCPU_FLUSH_TLB);
                if (st_preempted & KVM_VCPU_FLUSH_TLB)
-                       kvm_vcpu_flush_tlb_guest(vcpu);
+                       kvm_make_request(KVM_REQ_TLB_FLUSH_GUEST, vcpu);
        } else {
                WRITE_ONCE(st->preempted, 0);
                vcpu->arch.st.preempted = 0;
@@ -11165,6 +11165,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
                        if (unlikely(r))
                                goto out;
                }
+               if (kvm_check_request(KVM_REQ_STEAL_UPDATE, vcpu))
+                       record_steal_time(vcpu);
                if (kvm_check_request(KVM_REQ_MMU_SYNC, vcpu))
                        kvm_mmu_sync_roots(vcpu);
                if (kvm_check_request(KVM_REQ_LOAD_MMU_PGD, vcpu))
@@ -11214,8 +11216,6 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
                        r = 1;
                        goto out;
                }
-               if (kvm_check_request(KVM_REQ_STEAL_UPDATE, vcpu))
-                       record_steal_time(vcpu);
                if (kvm_check_request(KVM_REQ_PMU, vcpu))
                        kvm_pmu_handle_event(vcpu);
                if (kvm_check_request(KVM_REQ_PMI, vcpu))


KVM needs to ensure the RMW on st->preempted is atomic, to avoid re-introducing
the bug fixed by commit b043138246a4 ("x86/KVM: Make sure KVM_VCPU_FLUSH_TLB flag
is not missed"), but AFAICT there's nothing that requires to complete the TLB
flush before bumping the version, KVM just needs to service the flush before
entering the guest on that vCPU.

^ permalink raw reply related

* Re: [PATCH v5 29/34] KVM: x86: Re-synchronize TSC after KVM_SET_TSC_KHZ
From: sashiko-bot @ 2026-06-09  0:37 UTC (permalink / raw)
  To: David Woodhouse; +Cc: kvm
In-Reply-To: <20260608145455.89187-30-dwmw2@infradead.org>

Thank you for your contribution! Sashiko AI review found 1 potential issue(s) to consider:
- [High] Calling `kvm_read_l1_tsc` after `set_tsc_khz` calculates the current TSC using the newly set scaling ratio but the old offset, causing a massive discontinuity in the guest's observed TSC instead of preserving continuity.
--

commit c224c306d9d055ced6e04bed3638e47c8d31fb69
Author: David Woodhouse <dwmw@amazon.co.uk>

KVM_SET_TSC_KHZ changes the vCPU's TSC scaling ratio but does not update

This commit fixes an issue where get_kvmclock() uses a stale (1:1) ratio
when computing the KVM clock. It calls kvm_synchronize_tsc() after
changing the TSC frequency to ensure cur_tsc_scaling_ratio is updated
and the TSC value remains continuous across the frequency change.

> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 54d4b1b3cfe4..96250264d403 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c

[ ... ]

> @@ -2611,7 +2612,20 @@ static int kvm_set_tsc_khz(struct kvm_vcpu *vcpu, u32 user_tsc_khz)
>  			 user_tsc_khz, thresh_lo, thresh_hi);
>  		use_scaling = 1;
>  	}
> -	return set_tsc_khz(vcpu, user_tsc_khz, use_scaling);
> +	if (set_tsc_khz(vcpu, user_tsc_khz, use_scaling))
> +		return -1;
> +
> +	/*
> +	 * Re-synchronize the TSC after changing frequency. This ensures
> +	 * cur_tsc_scaling_ratio is updated (used by get_kvmclock) and
> +	 * the TSC value is continuous across the frequency change.
> +	 */
> +	{
> +		u64 tsc = kvm_read_l1_tsc(vcpu, rdtsc());
> +
> +		kvm_synchronize_tsc(vcpu, &tsc);

[Severity: High]
Will this cause a discontinuity in the guest's observed TSC instead of
preserving continuity?

If kvm_read_l1_tsc() is called after set_tsc_khz() has already updated the
scaling ratio, the current guest TSC is calculated using the newly set
scaling ratio combined with the old offset:

tsc = old_offset + rdtsc() * new_ratio

When kvm_synchronize_tsc() is then called to compute the new offset:

new_offset = tsc - rdtsc() * new_ratio

This algebraically simplifies back to the old offset. Because the offset
remains essentially unchanged while the scaling ratio has changed, the
guest's observed TSC would immediately jump by the difference between the
old and new ratios multiplied by rdtsc().

Since rdtsc() represents the host's uptime in cycles, could this cause a
massive time jump that breaks guest timekeeping or leads to lockups?

Would it be safer to evaluate kvm_read_l1_tsc(vcpu, rdtsc()) before
calling set_tsc_khz(), so the old scaling ratio is used to establish the
baseline for continuity?

> +	}
> +	return 0;
>  }

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260608145455.89187-1-dwmw2@infradead.org?part=29

^ permalink raw reply

* Re: [PATCH v5 19/34] KVM: x86: Kill last_tsc_{nsec,write,offset} fields
From: David Woodhouse @ 2026-06-09  0:34 UTC (permalink / raw)
  To: sashiko-bot; +Cc: dwmw2, kvm, sashiko-reviews
In-Reply-To: <20260608185301.326AC1F00893@smtp.kernel.org>

[-- Attachment #1: Type: text/plain, Size: 1717 bytes --]

On Mon, 2026-06-08 at 18:53 +0000, sashiko-bot@kernel.org wrote:
> [Severity: High]
> Does this S4 resume adjustment to cur_tsc_write permanently break
> cross-vCPU synchronization by omitting the symmetrical adjustment
> to cur_tsc_offset?

Yes, good catch. Fixed in v6 by also adjusting cur_tsc_offset +=
delta_cyc alongside the cur_tsc_write adjustment, so new vCPUs
created after a backwards TSC event get the correct offset.

> [Severity: High]
> Can this lockless read-modify-write operation race with concurrent
> KVM_SET_TSC ioctls?

Also fixed in v6 — the cur_tsc_write adjustment is now done under
tsc_write_lock. This path runs on CPU hotplug (not just S4) so
it can genuinely race with concurrent vCPU operations.

Theoretically I guess we could store the delta_cyc in the kvm
somewhere, and let the vCPU which wins the race to call
kvm_update_masterclock() ingest them? But for an esoteric case on
ancient crappy hardware than I'm hoping we delete soon... can I be
bothered?

> [Severity: Medium]
> The commit silently changes the logic for TSC matching after an S4
> resume without documenting the functional change.

Will update the commit message in v6. Basically, if we adjust the
fields correctly we shouldn't *need* to disable matching, and that TODO
about VM creation across S4 suspend can go away... although I wasn't
actually planning to test that case.

> [Severity: Critical]
> Unprotected iteration over vm_list in CPU hotplug path

I guess that's because we can't take a mutex in a hotplug routine?
I'm going to note it as preexisting and try to ignore it for now.
I shall probably fail (to ignore it) in the long run... but let's see
how I get on.

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5069 bytes --]

^ permalink raw reply

* Re: [PATCH v4 26/47] x86/kvm: Don't disable kvmclock on BSP in syscore_suspend()
From: Sean Christopherson @ 2026-06-09  0:31 UTC (permalink / raw)
  To: sashiko-reviews; +Cc: linux-hyperv, kvm
In-Reply-To: <20260530070848.158521F00893@smtp.kernel.org>

On Sat, May 30, 2026, sashiko-bot@kernel.org wrote:
> > diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
> > index c81a24d0efdf6..fd1c417b4f9b9 100644
> > --- a/arch/x86/kernel/kvm.c
> > +++ b/arch/x86/kernel/kvm.c
> > @@ -460,7 +460,7 @@ static void __init sev_map_percpu_data(void)
> 
> [ ... ]
> 
> > -static void kvm_guest_cpu_offline(bool shutdown)
> > +static void kvm_guest_cpu_offline(enum kvm_guest_cpu_action action)
> >  {
> >  	kvm_disable_steal_time();
> >  	if (kvm_para_has_feature(KVM_FEATURE_PV_EOI))
> > @@ -468,9 +468,10 @@ static void kvm_guest_cpu_offline(bool shutdown)
> >  	if (kvm_para_has_feature(KVM_FEATURE_MIGRATION_CONTROL))
> >  		wrmsrq(MSR_KVM_MIGRATION_CONTROL, 0);
> 
> [Severity: High]
> This isn't a bug introduced by this patch, but I noticed that
> MSR_KVM_MIGRATION_CONTROL is cleared to 0 here during CPU offline and system
> suspend, but it does not appear to be re-enabled upon online/resume.
> 
> Since this patch introduces the kvm_guest_cpu_action enum to solve an
> identical state-machine issue for kvmclock, should we also use this action
> enum to ensure the migration control MSR isn't incorrectly cleared on
> suspend? Otherwise, wouldn't a single CPU hotplug event or a suspend/resume
> cycle permanently disable live migration for the entire VM?

Looks like.  That's someone else's future problem though.

^ permalink raw reply

* Re: [PATCH v4 21/47] x86/xen: Obtain TSC frequency from CPUID if present
From: Sean Christopherson @ 2026-06-09  0:29 UTC (permalink / raw)
  To: sashiko-reviews; +Cc: kvm, linux-hyperv
In-Reply-To: <20260530063531.D034A1F00898@smtp.kernel.org>

On Sat, May 30, 2026, sashiko-bot@kernel.org wrote:
> > diff --git a/arch/x86/xen/time.c b/arch/x86/xen/time.c
> > index 487ad838c4418..36d66abf53796 100644
> > --- a/arch/x86/xen/time.c
> > +++ b/arch/x86/xen/time.c
> > @@ -42,6 +42,17 @@ static unsigned int __init xen_tsc_khz(void)
> >  {
> >  	struct pvclock_vcpu_time_info *info =
> >  		&HYPERVISOR_shared_info->vcpu_info[0].time;
> > +	u32 base = xen_cpuid_base();
> > +	u32 eax, ebx, ecx, edx;
> > +
> > +	/*
> > +	 * If Xen provides the guest TSC frequency directly in CPUID
> > +	 * (leaf 3, sub-leaf 0, ECX), use that instead of reverse-
> > +	 * calculating from the pvclock mul/shift.
> > +	 */
> > +	cpuid_count(base + 3, 0, &eax, &ebx, &ecx, &edx);
> > +	if (ecx)
> > +		return ecx;
> 
> [Severity: High]
> Is it possible for the hypervisor to support a maximum leaf of base + 2,
> resulting in an out-of-bounds CPUID query here?
> 
> Looking at xen_cpuid_base(), it calls cpuid_base_hypervisor(XEN_SIGNATURE, 2)
> which only guarantees that the hypervisor supports up to leaf base + 2.
> 
> If the hypervisor does not support base + 3, this CPUID query could be out
> of bounds. Processors handling out-of-bounds CPUID requests typically return
> data from the maximum basic leaf.

Heh, depends on the hypervisor.  This quirk is specific to Intel CPUs, and so
KVM emulates this behavior only when the advertised vCPU vendor is Intel.

Anyways, AFAICT, Sashiko is right to be skeptical, I don't see anything obvious
that guarantees +3 will be supported.

David, can you send this as a standalone patch, and either address Sashiko's
concern or add a blurb/comment explaining why it's safe?  Unlike the KVM changes,
this won't conflict with any of the other changes in this series.  So while it's
themetatically very related to this series, in practice it can go in separately,
and I'd strongly prefer to let the Xen folks handle this one.

^ permalink raw reply

* Re: [PATCH v5 28/34] KVM: selftests: Add Xen/generic CPUID timing leaf test
From: sashiko-bot @ 2026-06-09  0:27 UTC (permalink / raw)
  To: David Woodhouse; +Cc: kvm
In-Reply-To: <20260608145455.89187-29-dwmw2@infradead.org>

Thank you for your contribution! Sashiko AI review found 2 potential issue(s) to consider:
- [Medium] Using kvm_vm_release() instead of kvm_vm_free() leaks the struct kvm_vm and its memory tracking structures on each test iteration.
- [Low] KVM_VCPU_TSC_SCALE is undeclared in UAPI headers, causing a build failure.
--

commit 520a5187b17e7c7df17d704853e3e3ea37d182f3
Author: David Woodhouse <dwmw@amazon.co.uk>

Verify that userspace can correctly populate Xen and generic CPUID
timing leaves using the KVM_VCPU_TSC_SCALE and
KVM_VCPU_TSC_SCALE attributes.

This patch adds a KVM selftest to validate that removing KVM's runtime
Xen CPUID modifications does not break guests. It queries effective
frequencies, computes pvclock variables, and verifies that the guest CPUID
instructions return the expected values.

> diff --git a/tools/testing/selftests/kvm/x86/xen_cpuid_timing_test.c b/tools/testing/selftests/kvm/x86/xen_cpuid_timing_test.c
> new file mode 100644
> index 000000000000..a0c262b8db89
> --- /dev/null
> +++ b/tools/testing/selftests/kvm/x86/xen_cpuid_timing_test.c

[ ... ]

> +static void run_test(uint64_t tsc_khz)
> +{

[ ... ]

> +	/* If scaling wasn't applied, skip this frequency */
> +	if (tsc_khz && effective_tsc_khz == host_khz) {
> +		pr_info("  TSC scaling not available, skipping\n");
> +		kvm_vm_release(vm);

[Severity: Medium]
Should this be a call to kvm_vm_free() instead of kvm_vm_release()?

In KVM selftests, kvm_vm_release() only closes file descriptors.

Calling it here leaves the struct kvm_vm and its associated memory structures
dynamically allocated, which causes a memory leak on each skipped or
completed test iteration.

> +		return;
> +	}

[ ... ]

> +	switch (get_ucall(vcpu, &uc)) {
> +	case UCALL_ABORT:
> +		REPORT_GUEST_ASSERT(uc);
> +		break;
> +	case UCALL_SYNC:
> +		break;
> +	default:
> +		TEST_FAIL("Unexpected ucall");
> +	}
> +
> +	kvm_vm_release(vm);

[Severity: Medium]
As mentioned above, does this code leak the struct kvm_vm and its tracking
structures?

Using kvm_vm_free() here would free the tracking structures in addition to
closing the fds.

> +}
> +
> +int main(void)
> +{
> +	uint64_t freq;
> +	struct kvm_vcpu *vcpu;
> +	struct kvm_vm *vm;
> +	struct kvm_device_attr attr = {
> +		.group = KVM_VCPU_TSC_CTRL,
> +		.attr = KVM_VCPU_TSC_SCALE,

[Severity: Low]
Does this break the build due to KVM_VCPU_TSC_SCALE being undeclared?

It appears KVM_VCPU_TSC_SCALE isn't defined in the UAPI headers.

There is a hardcoded fallback 1 used earlier in run_test() that avoids
this issue, but it seems to have been missed here.

> +	};
> +
> +	TEST_REQUIRE(sys_clocksource_is_based_on_tsc());
> +
> +	/* Check KVM_VCPU_TSC_SCALE is supported (implies TSC scaling) */
> +	vm = vm_create_with_one_vcpu(&vcpu, guest_code);
> +	TEST_REQUIRE(!__vcpu_ioctl(vcpu, KVM_HAS_DEVICE_ATTR, &attr));
> +	host_khz = __vcpu_ioctl(vcpu, KVM_GET_TSC_KHZ, NULL);
> +	kvm_vm_release(vm);

[Severity: Medium]
Similarly, could this leak the capability check VM?

It seems kvm_vm_free() is the intended cleanup function here.

> +
> +	/* Native frequency */
> +	run_test(0);

[ ... ]

-- 
Sashiko AI review · https://sashiko.dev/#/patchset/20260608145455.89187-1-dwmw2@infradead.org?part=28

^ permalink raw reply

page: next (older)
- recent:[subjects (threaded)|topics (new)|topics (active)]

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox