[PATCH] x86/kvm: Account for fpstate->user

public inbox for stable@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH] x86/kvm: Account for fpstate->user_xfeatures changes
@ 2023-09-14  1:00 Tyler Stachecki
  2023-09-14  1:33 ` Tyler Stachecki
  2023-09-14  7:15 ` Leonardo Bras
  0 siblings, 2 replies; 16+ messages in thread
From: Tyler Stachecki @ 2023-09-14  1:00 UTC (permalink / raw)
  To: kvm
  Cc: stachecki.tyler, leobras, seanjc, pbonzini, dgilbert, tglx, mingo,
	dave.hansen, bp, Tyler Stachecki, stable

Live-migrations under qemu result in guest corruption when
the following three conditions are all met:

  * The source host CPU has capabilities that itself
    extend that of the guest CPU fpstate->user_xfeatures

  * The source kernel emits guest_fpu->user_xfeatures
    with respect to the host CPU (i.e. it *does not* have
    the "Fixes:" commit)

  * The destination kernel enforces that the xfeatures
    in the buffer given to KVM_SET_IOCTL are compatible
    with the guest CPU (i.e., it *does* have the "Fixes:"
    commit)

When these conditions are met, the semantical changes to
fpstate->user_features trigger a subtle bug in qemu that
results in qemu failing to put the XSAVE architectural
state into KVM.

qemu then both ceases to put the remaining (non-XSAVE) x86
architectural state into KVM and makes the fateful mistake
of resuming the guest anyways. This usually results in
immediate guest corruption, silent or not.

Due to the grave nature of this qemu bug, attempt to
retain behavior of old kernels by clamping the xfeatures
specified in the buffer given to KVM_SET_IOCTL such that
it aligns with the guests fpstate->user_xfeatures instead
of returning an error.

Fixes: ad856280ddea ("x86/kvm/fpu: Limit guest user_xfeatures to supported bits of XCR0")
Cc: stable@vger.kernel.org
Cc: Leonardo Bras <leobras@redhat.com>
Signed-off-by: Tyler Stachecki <tstachecki@bloomberg.net>
---
 arch/x86/kvm/x86.c | 14 ++++++++++++--
 1 file changed, 12 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 6c9c81e82e65..baad160b592f 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -5407,11 +5407,21 @@ static void kvm_vcpu_ioctl_x86_get_xsave2(struct kvm_vcpu *vcpu,
 static int kvm_vcpu_ioctl_x86_set_xsave(struct kvm_vcpu *vcpu,
 					struct kvm_xsave *guest_xsave)
 {
+	union fpregs_state *ustate = (union fpregs_state *) guest_xsave->region;
+	u64 user_xfeatures = vcpu->arch.guest_fpu.fpstate->user_xfeatures;
+
 	if (fpstate_is_confidential(&vcpu->arch.guest_fpu))
 		return 0;
 
-	return fpu_copy_uabi_to_guest_fpstate(&vcpu->arch.guest_fpu,
-					      guest_xsave->region,
+	/*
+	 * In previous kernels, kvm_arch_vcpu_create() set the guest's fpstate
+	 * based on what the host CPU supported. Recent kernels changed this
+	 * and only accept ustate containing xfeatures that the guest CPU is
+	 * capable of supporting.
+	 */
+	ustate->xsave.header.xfeatures &= user_xfeatures;
+
+	return fpu_copy_uabi_to_guest_fpstate(&vcpu->arch.guest_fpu, ustate,
 					      kvm_caps.supported_xcr0,
 					      &vcpu->arch.pkru);
 }
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH] x86/kvm: Account for fpstate->user_xfeatures changes
  2023-09-14  1:00 [PATCH] x86/kvm: Account for fpstate->user_xfeatures changes Tyler Stachecki
@ 2023-09-14  1:33 ` Tyler Stachecki
  2023-09-14  7:15 ` Leonardo Bras
  1 sibling, 0 replies; 16+ messages in thread
From: Tyler Stachecki @ 2023-09-14  1:33 UTC (permalink / raw)
  To: kvm
  Cc: leobras, seanjc, pbonzini, dgilbert, tglx, mingo, dave.hansen, bp,
	Tyler Stachecki, stable

On Wed, Sep 13, 2023 at 09:00:03PM -0400, Tyler Stachecki wrote:
> qemu then both ceases to put the remaining (non-XSAVE) x86
> architectural state into KVM and makes the fateful mistake
> of resuming the guest anyways. This usually results in
> immediate guest corruption, silent or not.

I just want to highlight that although this is probably more of a bug with
respect to how qemu is handling things, the original patches from Leo are
starting to appear in many distro stable kernels and are really putting a
spanner in the works for maintaining VMs that are long-lived in nature.

At present, if you take the fix for PKRU migration issues (or if you are just
in need a more recent kernel), you are dealt with a situation where live-
migrating VMs to a kernel patched for the PKRU issue from one that is not
potentially crashes or corrupts skads of VMs.

There is no fix for qemu that I am aware of yet. Although, I am willing to
look into one if that is more palatable, I filed this patch on the premise
of "don't break userspace"...

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] x86/kvm: Account for fpstate->user_xfeatures changes
  2023-09-14  1:00 [PATCH] x86/kvm: Account for fpstate->user_xfeatures changes Tyler Stachecki
  2023-09-14  1:33 ` Tyler Stachecki
@ 2023-09-14  7:15 ` Leonardo Bras
  2023-09-14  9:11   ` Tyler Stachecki
  1 sibling, 1 reply; 16+ messages in thread
From: Leonardo Bras @ 2023-09-14  7:15 UTC (permalink / raw)
  To: Tyler Stachecki
  Cc: kvm, seanjc, pbonzini, dgilbert, tglx, mingo, dave.hansen, bp,
	Tyler Stachecki, stable

On Wed, Sep 13, 2023 at 09:00:03PM -0400, Tyler Stachecki wrote:
> Live-migrations under qemu result in guest corruption when
> the following three conditions are all met:
> 
>   * The source host CPU has capabilities that itself
>     extend that of the guest CPU fpstate->user_xfeatures
> 
>   * The source kernel emits guest_fpu->user_xfeatures
>     with respect to the host CPU (i.e. it *does not* have
>     the "Fixes:" commit)
> 
>   * The destination kernel enforces that the xfeatures
>     in the buffer given to KVM_SET_IOCTL are compatible
>     with the guest CPU (i.e., it *does* have the "Fixes:"
>     commit)
> 
> When these conditions are met, the semantical changes to
> fpstate->user_features trigger a subtle bug in qemu that
> results in qemu failing to put the XSAVE architectural
> state into KVM.
> 
> qemu then both ceases to put the remaining (non-XSAVE) x86
> architectural state into KVM and makes the fateful mistake
> of resuming the guest anyways. This usually results in
> immediate guest corruption, silent or not.
> 
> Due to the grave nature of this qemu bug, attempt to
> retain behavior of old kernels by clamping the xfeatures
> specified in the buffer given to KVM_SET_IOCTL such that
> it aligns with the guests fpstate->user_xfeatures instead
> of returning an error.

So, IIUC, the xfeatures from the source guest will be different than the 
xfeatures of the target (destination) guest. Is that correct?

It does not seem right to me. I mean, from the guest viewpoint, some 
features will simply vanish during execution, and this could lead to major 
issues in the guest.

The idea here is that if the target (destination) host can't provide those 
features for the guest, then migration should fail.

I mean, qemu should fail the migration, and that's correct behavior.
Is it what is happening?

Regards,
Leo

> 
> Fixes: ad856280ddea ("x86/kvm/fpu: Limit guest user_xfeatures to supported bits of XCR0")
> Cc: stable@vger.kernel.org
> Cc: Leonardo Bras <leobras@redhat.com>
> Signed-off-by: Tyler Stachecki <tstachecki@bloomberg.net>
> ---
>  arch/x86/kvm/x86.c | 14 ++++++++++++--
>  1 file changed, 12 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 6c9c81e82e65..baad160b592f 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -5407,11 +5407,21 @@ static void kvm_vcpu_ioctl_x86_get_xsave2(struct kvm_vcpu *vcpu,
>  static int kvm_vcpu_ioctl_x86_set_xsave(struct kvm_vcpu *vcpu,
>  					struct kvm_xsave *guest_xsave)
>  {
> +	union fpregs_state *ustate = (union fpregs_state *) guest_xsave->region;
> +	u64 user_xfeatures = vcpu->arch.guest_fpu.fpstate->user_xfeatures;
> +
>  	if (fpstate_is_confidential(&vcpu->arch.guest_fpu))
>  		return 0;
>  
> -	return fpu_copy_uabi_to_guest_fpstate(&vcpu->arch.guest_fpu,
> -					      guest_xsave->region,
> +	/*
> +	 * In previous kernels, kvm_arch_vcpu_create() set the guest's fpstate
> +	 * based on what the host CPU supported. Recent kernels changed this
> +	 * and only accept ustate containing xfeatures that the guest CPU is
> +	 * capable of supporting.
> +	 */
> +	ustate->xsave.header.xfeatures &= user_xfeatures;
> +
> +	return fpu_copy_uabi_to_guest_fpstate(&vcpu->arch.guest_fpu, ustate,
>  					      kvm_caps.supported_xcr0,
>  					      &vcpu->arch.pkru);
>  }
> -- 
> 2.30.2
> 


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] x86/kvm: Account for fpstate->user_xfeatures changes
  2023-09-14  7:15 ` Leonardo Bras
@ 2023-09-14  9:11   ` Tyler Stachecki
  2023-09-14 17:05     ` Dongli Zhang
  2023-09-15  7:11     ` Leonardo Bras
  0 siblings, 2 replies; 16+ messages in thread
From: Tyler Stachecki @ 2023-09-14  9:11 UTC (permalink / raw)
  To: Leonardo Bras
  Cc: kvm, seanjc, pbonzini, dgilbert, tglx, mingo, dave.hansen, bp,
	Tyler Stachecki, stable

On Thu, Sep 14, 2023 at 04:15:54AM -0300, Leonardo Bras wrote:
> So, IIUC, the xfeatures from the source guest will be different than the 
> xfeatures of the target (destination) guest. Is that correct?

Correct.

> It does not seem right to me. I mean, from the guest viewpoint, some 
> features will simply vanish during execution, and this could lead to major 
> issues in the guest.

My assumption is that the guest CPU model should confine access to registers
that make sense for that (guest) CPU.

e.g., take a host CPU capable of AVX-512 running a guest CPU model that only
has AVX-256. If the guest suddenly loses the top 256 bits of %zmm*, it should
not really be perceivable as %ymm architecturally remains unchanged.

Though maybe I'm being too rash here? Is there a case where this assumption
breaks down?

> The idea here is that if the target (destination) host can't provide those 
> features for the guest, then migration should fail.
> 
> I mean, qemu should fail the migration, and that's correct behavior.
> Is it what is happening?

Unfortunately, no, it is not... and that is biggest concern right now.

I do see some discussion between Peter and you on this topic and see that
there was an RFC to implement such behavior stemming from it, here:
https://lore.kernel.org/qemu-devel/20220607230645.53950-1-peterx@redhat.com/

... though I do not believe that work ever landed in the tree. Looking at
qemu's master branch now, the error from kvm_arch_put_registers is just
discarded in do_kvm_cpu_synchronize_post_init...

```
static void do_kvm_cpu_synchronize_post_init(CPUState *cpu, run_on_cpu_data arg)
{
    kvm_arch_put_registers(cpu, KVM_PUT_FULL_STATE);
    cpu->vcpu_dirty = false;
}
```

Best,
Tyler

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] x86/kvm: Account for fpstate->user_xfeatures changes
  2023-09-14  9:11   ` Tyler Stachecki
@ 2023-09-14 17:05     ` Dongli Zhang
  2023-09-15  0:58       ` Tyler Stachecki
  2023-09-15  7:13       ` Leonardo Bras
  2023-09-15  7:11     ` Leonardo Bras
  1 sibling, 2 replies; 16+ messages in thread
From: Dongli Zhang @ 2023-09-14 17:05 UTC (permalink / raw)
  To: Tyler Stachecki, Leonardo Bras
  Cc: kvm, seanjc, pbonzini, dgilbert, tglx, mingo, dave.hansen, bp,
	Tyler Stachecki, stable



On 9/14/23 2:11 AM, Tyler Stachecki wrote:
> On Thu, Sep 14, 2023 at 04:15:54AM -0300, Leonardo Bras wrote:
>> So, IIUC, the xfeatures from the source guest will be different than the 
>> xfeatures of the target (destination) guest. Is that correct?
> 
> Correct.
>  
>> It does not seem right to me. I mean, from the guest viewpoint, some 
>> features will simply vanish during execution, and this could lead to major 
>> issues in the guest.

I fully agree with this.

I think the original commit ad856280ddea ("x86/kvm/fpu: Limit guest
user_xfeatures to supported bits of XCR0") is for the source server, not
destination server.

That is:

1. Without the commit (src and dst), something bad may happen.

2. With the commit on src, issue is fixed.

3. With the commit only dst, it is expected that issue is not fixed.

Therefore, from administrator's perspective, the bugfix should always be applied
no the source server, in order to succeed the migration.


BTW, we may not be able to use commit ad856280ddea in the Fixes tag.

> 
> My assumption is that the guest CPU model should confine access to registers
> that make sense for that (guest) CPU.
> 
> e.g., take a host CPU capable of AVX-512 running a guest CPU model that only
> has AVX-256. If the guest suddenly loses the top 256 bits of %zmm*, it should
> not really be perceivable as %ymm architecturally remains unchanged.
> 
> Though maybe I'm being too rash here? Is there a case where this assumption
> breaks down?
> 
>> The idea here is that if the target (destination) host can't provide those 
>> features for the guest, then migration should fail.
>>
>> I mean, qemu should fail the migration, and that's correct behavior.
>> Is it what is happening?
> 
> Unfortunately, no, it is not... and that is biggest concern right now.
> 
> I do see some discussion between Peter and you on this topic and see that
> there was an RFC to implement such behavior stemming from it, here:
> https://lore.kernel.org/qemu-devel/20220607230645.53950-1-peterx@redhat.com/

I agree that bug is at QEMU side, not KVM side.

It is better to improve at QEMU side.

4508 int kvm_arch_put_registers(CPUState *cpu, int level)
4509 {
4510     X86CPU *x86_cpu = X86_CPU(cpu);
4511     int ret;
... ...
4546     ret = kvm_put_xsave(x86_cpu);
4547     if (ret < 0) {
4548         return ret;
4549     }
... ...--> the rest of kvm_arch_put_registers() won't execute !!!

Thank you very much!

Dongli Zhang

> 
> ... though I do not believe that work ever landed in the tree. Looking at
> qemu's master branch now, the error from kvm_arch_put_registers is just
> discarded in do_kvm_cpu_synchronize_post_init...
> 
> ```
> static void do_kvm_cpu_synchronize_post_init(CPUState *cpu, run_on_cpu_data arg)
> {
>     kvm_arch_put_registers(cpu, KVM_PUT_FULL_STATE);
>     cpu->vcpu_dirty = false;
> }
> ```
> 
> Best,
> Tyler

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] x86/kvm: Account for fpstate->user_xfeatures changes
  2023-09-14 17:05     ` Dongli Zhang
@ 2023-09-15  0:58       ` Tyler Stachecki
  2023-09-15  7:41         ` Leonardo Bras
  2023-09-15  7:13       ` Leonardo Bras
  1 sibling, 1 reply; 16+ messages in thread
From: Tyler Stachecki @ 2023-09-15  0:58 UTC (permalink / raw)
  To: Dongli Zhang
  Cc: Leonardo Bras, kvm, seanjc, pbonzini, dgilbert, tglx, mingo,
	dave.hansen, bp, Tyler Stachecki, stable

On Thu, Sep 14, 2023 at 10:05:57AM -0700, Dongli Zhang wrote:
> That is:
> 
> 1. Without the commit (src and dst), something bad may happen.
> 
> 2. With the commit on src, issue is fixed.
> 
> 3. With the commit only dst, it is expected that issue is not fixed.
> 
> Therefore, from administrator's perspective, the bugfix should always be applied
> no the source server, in order to succeed the migration.

I fully agree. Though, I think this boils down to:
The commit must be on the source or something bad may happen.

It then follows that you cannot live-migrate guests off the source to patch it
without potentially corrupting the guests currently running on that source...

Regards,
Tyler

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] x86/kvm: Account for fpstate->user_xfeatures changes
  2023-09-14  9:11   ` Tyler Stachecki
  2023-09-14 17:05     ` Dongli Zhang
@ 2023-09-15  7:11     ` Leonardo Bras
  1 sibling, 0 replies; 16+ messages in thread
From: Leonardo Bras @ 2023-09-15  7:11 UTC (permalink / raw)
  To: Tyler Stachecki
  Cc: kvm, seanjc, pbonzini, dgilbert, tglx, mingo, dave.hansen, bp,
	Tyler Stachecki, stable

On Thu, Sep 14, 2023 at 05:11:50AM -0400, Tyler Stachecki wrote:
> On Thu, Sep 14, 2023 at 04:15:54AM -0300, Leonardo Bras wrote:
> > So, IIUC, the xfeatures from the source guest will be different than the 
> > xfeatures of the target (destination) guest. Is that correct?
> 
> Correct.
>  
> > It does not seem right to me. I mean, from the guest viewpoint, some 
> > features will simply vanish during execution, and this could lead to major 
> > issues in the guest.
> 
> My assumption is that the guest CPU model should confine access to registers
> that make sense for that (guest) CPU.
> 
> e.g., take a host CPU capable of AVX-512 running a guest CPU model that only
> has AVX-256. If the guest suddenly loses the top 256 bits of %zmm*, it should
> not really be perceivable as %ymm architecturally remains unchanged.
> 
> Though maybe I'm being too rash here? Is there a case where this assumption
> breaks down?

There is no guarantee that it would be ok to simple remove a feature from 
the guest. Maybe it's fine for 99% of the cases for given feature, but it 
could always go wrong.

> 
> > The idea here is that if the target (destination) host can't provide those 
> > features for the guest, then migration should fail.
> > 
> > I mean, qemu should fail the migration, and that's correct behavior.
> > Is it what is happening?
> 
> Unfortunately, no, it is not... and that is biggest concern right now.
> 
> I do see some discussion between Peter and you on this topic and see that
> there was an RFC to implement such behavior stemming from it, here:
> https://lore.kernel.org/qemu-devel/20220607230645.53950-1-peterx@redhat.com/
> 
> ... though I do not believe that work ever landed in the tree. Looking at
> qemu's master branch now, the error from kvm_arch_put_registers is just
> discarded in do_kvm_cpu_synchronize_post_init...

This is wrong, then. QEMU should abort the migration in this case, so the 
VM is not lost.

Of course, with this issue fixed, there is another issue to deal with:
- VMs running on hosts with older kernel get stuck in hosts without the 
fixes.

Thanks,
Leo

> 
> ```
> static void do_kvm_cpu_synchronize_post_init(CPUState *cpu, run_on_cpu_data arg)
> {
>     kvm_arch_put_registers(cpu, KVM_PUT_FULL_STATE);
>     cpu->vcpu_dirty = false;
> }
> ```
> 
> Best,
> Tyler
> 


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] x86/kvm: Account for fpstate->user_xfeatures changes
  2023-09-14 17:05     ` Dongli Zhang
  2023-09-15  0:58       ` Tyler Stachecki
@ 2023-09-15  7:13       ` Leonardo Bras
  1 sibling, 0 replies; 16+ messages in thread
From: Leonardo Bras @ 2023-09-15  7:13 UTC (permalink / raw)
  To: Dongli Zhang
  Cc: Tyler Stachecki, kvm, seanjc, pbonzini, dgilbert, tglx, mingo,
	dave.hansen, bp, Tyler Stachecki, stable

On Thu, Sep 14, 2023 at 10:05:57AM -0700, Dongli Zhang wrote:
> 
> 
> On 9/14/23 2:11 AM, Tyler Stachecki wrote:
> > On Thu, Sep 14, 2023 at 04:15:54AM -0300, Leonardo Bras wrote:
> >> So, IIUC, the xfeatures from the source guest will be different than the 
> >> xfeatures of the target (destination) guest. Is that correct?
> > 
> > Correct.
> >  
> >> It does not seem right to me. I mean, from the guest viewpoint, some 
> >> features will simply vanish during execution, and this could lead to major 
> >> issues in the guest.
> 
> I fully agree with this.
> 
> I think the original commit ad856280ddea ("x86/kvm/fpu: Limit guest
> user_xfeatures to supported bits of XCR0") is for the source server, not
> destination server.
> 
> That is:
> 
> 1. Without the commit (src and dst), something bad may happen.
> 
> 2. With the commit on src, issue is fixed.
> 
> 3. With the commit only dst, it is expected that issue is not fixed.
> 
> Therefore, from administrator's perspective, the bugfix should always be applied
> no the source server, in order to succeed the migration.
> 
> 
> BTW, we may not be able to use commit ad856280ddea in the Fixes tag.
> 
> > 
> > My assumption is that the guest CPU model should confine access to registers
> > that make sense for that (guest) CPU.
> > 
> > e.g., take a host CPU capable of AVX-512 running a guest CPU model that only
> > has AVX-256. If the guest suddenly loses the top 256 bits of %zmm*, it should
> > not really be perceivable as %ymm architecturally remains unchanged.
> > 
> > Though maybe I'm being too rash here? Is there a case where this assumption
> > breaks down?
> > 
> >> The idea here is that if the target (destination) host can't provide those 
> >> features for the guest, then migration should fail.
> >>
> >> I mean, qemu should fail the migration, and that's correct behavior.
> >> Is it what is happening?
> > 
> > Unfortunately, no, it is not... and that is biggest concern right now.
> > 
> > I do see some discussion between Peter and you on this topic and see that
> > there was an RFC to implement such behavior stemming from it, here:
> > https://lore.kernel.org/qemu-devel/20220607230645.53950-1-peterx@redhat.com/
> 
> I agree that bug is at QEMU side, not KVM side.
> 
> It is better to improve at QEMU side.

I agree fixing QEMU is the best solution we have.

> 
> 4508 int kvm_arch_put_registers(CPUState *cpu, int level)
> 4509 {
> 4510     X86CPU *x86_cpu = X86_CPU(cpu);
> 4511     int ret;
> ... ...
> 4546     ret = kvm_put_xsave(x86_cpu);
> 4547     if (ret < 0) {
> 4548         return ret;
> 4549     }
> ... ...--> the rest of kvm_arch_put_registers() won't execute !!!
> 
> Thank you very much!
> 
> Dongli Zhang
> 
> > 
> > ... though I do not believe that work ever landed in the tree. Looking at
> > qemu's master branch now, the error from kvm_arch_put_registers is just
> > discarded in do_kvm_cpu_synchronize_post_init...
> > 
> > ```
> > static void do_kvm_cpu_synchronize_post_init(CPUState *cpu, run_on_cpu_data arg)
> > {
> >     kvm_arch_put_registers(cpu, KVM_PUT_FULL_STATE);
> >     cpu->vcpu_dirty = false;
> > }
> > ```
> > 
> > Best,
> > Tyler
> 


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] x86/kvm: Account for fpstate->user_xfeatures changes
  2023-09-15  0:58       ` Tyler Stachecki
@ 2023-09-15  7:41         ` Leonardo Bras
  2023-09-15 12:27           ` Tyler Stachecki
  0 siblings, 1 reply; 16+ messages in thread
From: Leonardo Bras @ 2023-09-15  7:41 UTC (permalink / raw)
  To: Tyler Stachecki
  Cc: Dongli Zhang, kvm, seanjc, pbonzini, dgilbert, tglx, mingo,
	dave.hansen, bp, Tyler Stachecki, stable

On Thu, Sep 14, 2023 at 08:58:42PM -0400, Tyler Stachecki wrote:
> On Thu, Sep 14, 2023 at 10:05:57AM -0700, Dongli Zhang wrote:
> > That is:
> > 
> > 1. Without the commit (src and dst), something bad may happen.
> > 
> > 2. With the commit on src, issue is fixed.
> > 
> > 3. With the commit only dst, it is expected that issue is not fixed.
> > 
> > Therefore, from administrator's perspective, the bugfix should always be applied
> > no the source server, in order to succeed the migration.
> 
> I fully agree. Though, I think this boils down to:
> The commit must be on the source or something bad may happen.
> 
> It then follows that you cannot live-migrate guests off the source to patch it
> without potentially corrupting the guests currently running on that source...

Well, the bug was a real bad issue, and even the solution does not solve 
all problems.

As we discussed, there is no way of safely removing any feature from the 
guest without potential issues. One potential solution would be having 
hosts that implement the missing guest features needed for the VMs, but 
this may be far from easy depending on the missing feature.

Other than that, all I can think of is removing the features from guest:

As you commented, there may be some features that would not be a problem 
to be removed, and also there may be features which are not used by the 
workload, and could be removed. But this would depend on the feature, and 
the workload, beind a custom solution for every case.

For this (removing guest features), from kernel side, I would suggest using 
SystemTap (and eBPF, IIRC). The procedures should be something like:
- Try to migrate VM from host with older kernel: fail
- Look at qemu error, which features are missing?
- Are those features safely removable from guest ? 
  - If so, get an SystemTap / eBPF script masking out the undesired bits.
  - Try the migration again, it should succeed.

IIRC, this could also be done in qemu side, with a custom qemu:
- Try to migrate VM from host with older kernel: fail
- Look at qemu error, which features are missing?
- Are those features safely removable from guest ?
  - If so, get a custom qemu which mask-out the desired flags before the VM 
    starts
  - Live migrate (can be inside the source host) to the custom qemu
  - Live migrate from custom qemu to target host.
- The custom qemu could be on a auxiliary host, and used only for this

Yes, it's hard, takes time, and may not solve every case, but it gets a 
higher chance of the VM surviving in the long run.

But keep in mind this is a hack.
Taking features from a live guest is not supported in any way, and has a 
high chance of crashing the VM.

Best regards,
Leo

> 
> Regards,
> Tyler
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] x86/kvm: Account for fpstate->user_xfeatures changes
  2023-09-15  7:41         ` Leonardo Bras
@ 2023-09-15 12:27           ` Tyler Stachecki
  2023-09-25 21:26             ` Sean Christopherson
  0 siblings, 1 reply; 16+ messages in thread
From: Tyler Stachecki @ 2023-09-15 12:27 UTC (permalink / raw)
  To: Leonardo Bras
  Cc: Dongli Zhang, kvm, seanjc, pbonzini, dgilbert, tglx, mingo,
	dave.hansen, bp, Tyler Stachecki, stable

On Fri, Sep 15, 2023 at 04:41:20AM -0300, Leonardo Bras wrote:
> Other than that, all I can think of is removing the features from guest:
> 
> As you commented, there may be some features that would not be a problem 
> to be removed, and also there may be features which are not used by the 
> workload, and could be removed. But this would depend on the feature, and 
> the workload, beind a custom solution for every case.

Yes, the "fixup back" should be refined to pointed and verified cases.

> For this (removing guest features), from kernel side, I would suggest using 
> SystemTap (and eBPF, IIRC). The procedures should be something like:
> - Try to migrate VM from host with older kernel: fail
> - Look at qemu error, which features are missing?
> - Are those features safely removable from guest ? 
>   - If so, get an SystemTap / eBPF script masking out the undesired bits.
>   - Try the migration again, it should succeed.
> 
> IIRC, this could also be done in qemu side, with a custom qemu:
> - Try to migrate VM from host with older kernel: fail
> - Look at qemu error, which features are missing?
> - Are those features safely removable from guest ?
>   - If so, get a custom qemu which mask-out the desired flags before the VM 
>     starts
>   - Live migrate (can be inside the source host) to the custom qemu
>   - Live migrate from custom qemu to target host.
> - The custom qemu could be on a auxiliary host, and used only for this
> 
> Yes, it's hard, takes time, and may not solve every case, but it gets a 
> higher chance of the VM surviving in the long run.

Thank you for taking the time to throughly consider the issue and suggest some
ways out - I really appreciate it.

> But keep in mind this is a hack.
> Taking features from a live guest is not supported in any way, and has a 
> high chance of crashing the VM.

OK - if there's no interest in the below, I will not push for including this
patch in the kernel tree any longer. I do think the specific case below is what
a vast majority of KVM users will struggle with in the near future, though:

I have a test environment with Broadwell-based (have only AVX-256) guests
running under Skylake (PKRU, AVX512, ...) hypervisors.

I added some pr_debug statements to a guest kernel running under a hypervisor,
with said hypervisor containing neither your nor my patches, and printed the
guests view of `fpu_kernel_cfg.max_features` at boot. It was 0x7, or:
  XFEATURE_MASK_FP, XFEATURE_MASK_SSE, XFEATURE_MASK_YMM

Thus, I'm pretty sure that all that's happening here is that the guest's FP
context is having PKRU/ZMM. saved and restored needlessly by the hypervisor.
Stripping it on a live-migration does not seem to have any ill-effects in
all the testing I have done.

Cheers,
Tyler

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] x86/kvm: Account for fpstate->user_xfeatures changes
  2023-09-15 12:27           ` Tyler Stachecki
@ 2023-09-25 21:26             ` Sean Christopherson
  2023-09-26  3:02               ` Tyler Stachecki
  0 siblings, 1 reply; 16+ messages in thread
From: Sean Christopherson @ 2023-09-25 21:26 UTC (permalink / raw)
  To: Tyler Stachecki
  Cc: Leonardo Bras, Dongli Zhang, kvm, pbonzini, dgilbert, tglx, mingo,
	dave.hansen, bp, Tyler Stachecki, stable

On Fri, Sep 15, 2023, Tyler Stachecki wrote:
> On Fri, Sep 15, 2023 at 04:41:20AM -0300, Leonardo Bras wrote:
> > Other than that, all I can think of is removing the features from guest:
> > 
> > As you commented, there may be some features that would not be a problem 
> > to be removed, and also there may be features which are not used by the 
> > workload, and could be removed. But this would depend on the feature, and 
> > the workload, beind a custom solution for every case.
> 
> Yes, the "fixup back" should be refined to pointed and verified cases.
>  
> > For this (removing guest features), from kernel side, I would suggest using 
> > SystemTap (and eBPF, IIRC). The procedures should be something like:
> > - Try to migrate VM from host with older kernel: fail
> > - Look at qemu error, which features are missing?
> > - Are those features safely removable from guest ? 
> >   - If so, get an SystemTap / eBPF script masking out the undesired bits.
> >   - Try the migration again, it should succeed.
> > 
> > IIRC, this could also be done in qemu side, with a custom qemu:
> > - Try to migrate VM from host with older kernel: fail
> > - Look at qemu error, which features are missing?
> > - Are those features safely removable from guest ?
> >   - If so, get a custom qemu which mask-out the desired flags before the VM 
> >     starts
> >   - Live migrate (can be inside the source host) to the custom qemu
> >   - Live migrate from custom qemu to target host.
> > - The custom qemu could be on a auxiliary host, and used only for this
> > 
> > Yes, it's hard, takes time, and may not solve every case, but it gets a 
> > higher chance of the VM surviving in the long run.
> 
> Thank you for taking the time to throughly consider the issue and suggest some
> ways out - I really appreciate it.
> 
> > But keep in mind this is a hack.
> > Taking features from a live guest is not supported in any way, and has a 
> > high chance of crashing the VM.
>
> OK - if there's no interest in the below, I will not push for including this
> patch in the kernel tree any longer. I do think the specific case below is what
> a vast majority of KVM users will struggle with in the near future, though:
>
> I have a test environment with Broadwell-based (have only AVX-256) guests
> running under Skylake (PKRU, AVX512, ...) hypervisors.

I definitely don't want to take the proposed patch.  As Leo pointed out, silently
dropping features that userspace explicitly requests is a recipe for disaster.

However, I do agree with Tyler that is an egregious kernel/KVM bug, as essentially
requiring KVM_SET_XSAVE to be a subset of guest supported XCR0, i.e. guest CPUID,
is a clearcut breakage of userspace.  KVM_SET_XSAVE worked on kernel X and failed
on kernel X+1, there's really no wiggle room there.

Luckily, I'm pretty sure there's no need to take features away from the guest in
order to fix the bug Tyler is experiencing.  Prior to commit ad856280ddea, KVM's
ABI was that KVM_SET_SAVE just needs a subset of the *host* features, i.e. this
chunk from the changelog simply needs to be undone:

    As a bonus, it will also fail if userspace tries to set fpu features
    (with the KVM_SET_XSAVE ioctl) that are not compatible to the guest
    configuration.  Such features will never be returned by KVM_GET_XSAVE
    or KVM_GET_XSAVE2.

That can be done by applying guest_supported_xcr0 to *only* the KVM_GET_XSAVE{2}
path.  It's not ideal since it means that KVM_GET_XSAVE{2} won't be consistent
with the guest model if userspace does KVM_GET_XSAVE{2} before KVM_SET_CPUID, but
practically speaking I don't think there's a real world userspace VMM that does
that.

Compile tested only, and it needs a changelog, but I think this will do the trick:

---
 arch/x86/include/asm/fpu/api.h |  3 ++-
 arch/x86/kernel/fpu/core.c     |  5 +++--
 arch/x86/kernel/fpu/xstate.c   |  7 +++++--
 arch/x86/kernel/fpu/xstate.h   |  3 ++-
 arch/x86/kvm/cpuid.c           |  8 --------
 arch/x86/kvm/x86.c             | 37 ++++++++++++++++++++++------------
 6 files changed, 36 insertions(+), 27 deletions(-)

diff --git a/arch/x86/include/asm/fpu/api.h b/arch/x86/include/asm/fpu/api.h
index 31089b851c4f..a2be3aefff9f 100644
--- a/arch/x86/include/asm/fpu/api.h
+++ b/arch/x86/include/asm/fpu/api.h
@@ -157,7 +157,8 @@ static inline void fpu_update_guest_xfd(struct fpu_guest *guest_fpu, u64 xfd) {
 static inline void fpu_sync_guest_vmexit_xfd_state(void) { }
 #endif
 
-extern void fpu_copy_guest_fpstate_to_uabi(struct fpu_guest *gfpu, void *buf, unsigned int size, u32 pkru);
+extern void fpu_copy_guest_fpstate_to_uabi(struct fpu_guest *gfpu, void *buf,
+					   unsigned int size, u64 xfeatures, u32 pkru);
 extern int fpu_copy_uabi_to_guest_fpstate(struct fpu_guest *gfpu, const void *buf, u64 xcr0, u32 *vpkru);
 
 static inline void fpstate_set_confidential(struct fpu_guest *gfpu)
diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
index a86d37052a64..a21a4d0ecc34 100644
--- a/arch/x86/kernel/fpu/core.c
+++ b/arch/x86/kernel/fpu/core.c
@@ -369,14 +369,15 @@ int fpu_swap_kvm_fpstate(struct fpu_guest *guest_fpu, bool enter_guest)
 EXPORT_SYMBOL_GPL(fpu_swap_kvm_fpstate);
 
 void fpu_copy_guest_fpstate_to_uabi(struct fpu_guest *gfpu, void *buf,
-				    unsigned int size, u32 pkru)
+				    unsigned int size, u64 xfeatures, u32 pkru)
 {
 	struct fpstate *kstate = gfpu->fpstate;
 	union fpregs_state *ustate = buf;
 	struct membuf mb = { .p = buf, .left = size };
 
 	if (cpu_feature_enabled(X86_FEATURE_XSAVE)) {
-		__copy_xstate_to_uabi_buf(mb, kstate, pkru, XSTATE_COPY_XSAVE);
+		__copy_xstate_to_uabi_buf(mb, kstate, xfeatures, pkru,
+					  XSTATE_COPY_XSAVE);
 	} else {
 		memcpy(&ustate->fxsave, &kstate->regs.fxsave,
 		       sizeof(ustate->fxsave));
diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
index cadf68737e6b..7d31033d176e 100644
--- a/arch/x86/kernel/fpu/xstate.c
+++ b/arch/x86/kernel/fpu/xstate.c
@@ -1049,6 +1049,7 @@ static void copy_feature(bool from_xstate, struct membuf *to, void *xstate,
  * __copy_xstate_to_uabi_buf - Copy kernel saved xstate to a UABI buffer
  * @to:		membuf descriptor
  * @fpstate:	The fpstate buffer from which to copy
+ * @xfeatures:	Constraint which of user xfeatures to save (XSAVE only)
  * @pkru_val:	The PKRU value to store in the PKRU component
  * @copy_mode:	The requested copy mode
  *
@@ -1059,7 +1060,8 @@ static void copy_feature(bool from_xstate, struct membuf *to, void *xstate,
  * It supports partial copy but @to.pos always starts from zero.
  */
 void __copy_xstate_to_uabi_buf(struct membuf to, struct fpstate *fpstate,
-			       u32 pkru_val, enum xstate_copy_mode copy_mode)
+			       u64 xfeatures, u32 pkru_val,
+			       enum xstate_copy_mode copy_mode)
 {
 	const unsigned int off_mxcsr = offsetof(struct fxregs_state, mxcsr);
 	struct xregs_state *xinit = &init_fpstate.regs.xsave;
@@ -1083,7 +1085,7 @@ void __copy_xstate_to_uabi_buf(struct membuf to, struct fpstate *fpstate,
 		break;
 
 	case XSTATE_COPY_XSAVE:
-		header.xfeatures &= fpstate->user_xfeatures;
+		header.xfeatures &= fpstate->user_xfeatures & xfeatures;
 		break;
 	}
 
@@ -1185,6 +1187,7 @@ void copy_xstate_to_uabi_buf(struct membuf to, struct task_struct *tsk,
 			     enum xstate_copy_mode copy_mode)
 {
 	__copy_xstate_to_uabi_buf(to, tsk->thread.fpu.fpstate,
+				  tsk->thread.fpu.fpstate->user_xfeatures,
 				  tsk->thread.pkru, copy_mode);
 }
 
diff --git a/arch/x86/kernel/fpu/xstate.h b/arch/x86/kernel/fpu/xstate.h
index a4ecb04d8d64..3518fb26d06b 100644
--- a/arch/x86/kernel/fpu/xstate.h
+++ b/arch/x86/kernel/fpu/xstate.h
@@ -43,7 +43,8 @@ enum xstate_copy_mode {
 
 struct membuf;
 extern void __copy_xstate_to_uabi_buf(struct membuf to, struct fpstate *fpstate,
-				      u32 pkru_val, enum xstate_copy_mode copy_mode);
+				      u64 xfeatures, u32 pkru_val,
+				      enum xstate_copy_mode copy_mode);
 extern void copy_xstate_to_uabi_buf(struct membuf to, struct task_struct *tsk,
 				    enum xstate_copy_mode mode);
 extern int copy_uabi_from_kernel_to_xstate(struct fpstate *fpstate, const void *kbuf, u32 *pkru);
diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
index 0544e30b4946..773132c3bf5a 100644
--- a/arch/x86/kvm/cpuid.c
+++ b/arch/x86/kvm/cpuid.c
@@ -360,14 +360,6 @@ static void kvm_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
 	vcpu->arch.guest_supported_xcr0 =
 		cpuid_get_supported_xcr0(vcpu->arch.cpuid_entries, vcpu->arch.cpuid_nent);
 
-	/*
-	 * FP+SSE can always be saved/restored via KVM_{G,S}ET_XSAVE, even if
-	 * XSAVE/XCRO are not exposed to the guest, and even if XSAVE isn't
-	 * supported by the host.
-	 */
-	vcpu->arch.guest_fpu.fpstate->user_xfeatures = vcpu->arch.guest_supported_xcr0 |
-						       XFEATURE_MASK_FPSSE;
-
 	kvm_update_pv_runtime(vcpu);
 
 	vcpu->arch.maxphyaddr = cpuid_query_maxphyaddr(vcpu);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 9f18b06bbda6..734e2d69329b 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -5382,26 +5382,37 @@ static int kvm_vcpu_ioctl_x86_set_debugregs(struct kvm_vcpu *vcpu,
 	return 0;
 }
 
-static void kvm_vcpu_ioctl_x86_get_xsave(struct kvm_vcpu *vcpu,
-					 struct kvm_xsave *guest_xsave)
-{
-	if (fpstate_is_confidential(&vcpu->arch.guest_fpu))
-		return;
-
-	fpu_copy_guest_fpstate_to_uabi(&vcpu->arch.guest_fpu,
-				       guest_xsave->region,
-				       sizeof(guest_xsave->region),
-				       vcpu->arch.pkru);
-}
 
 static void kvm_vcpu_ioctl_x86_get_xsave2(struct kvm_vcpu *vcpu,
 					  u8 *state, unsigned int size)
 {
+	/*
+	 * Only copy state for features that are enabled for the guest.  The
+	 * state itself isn't problematic, but setting bits in the header for
+	 * features that are supported in *this* host but not exposed to the
+	 * guest can result in KVM_SET_XSAVE failing when live migrating to a
+	 * compatible host, i.e. a host without the features that are NOT
+	 * exposed to the guest.
+	 *
+	 * FP+SSE can always be saved/restored via KVM_{G,S}ET_XSAVE, even if
+	 * XSAVE/XCRO are not exposed to the guest, and even if XSAVE isn't
+	 * supported by the host.
+	 */
+	u64 supported_xcr0 = vcpu->arch.guest_supported_xcr0 |
+			     XFEATURE_MASK_FPSSE;
+
 	if (fpstate_is_confidential(&vcpu->arch.guest_fpu))
 		return;
 
-	fpu_copy_guest_fpstate_to_uabi(&vcpu->arch.guest_fpu,
-				       state, size, vcpu->arch.pkru);
+	fpu_copy_guest_fpstate_to_uabi(&vcpu->arch.guest_fpu, state, size,
+				       supported_xcr0, vcpu->arch.pkru);
+}
+
+static void kvm_vcpu_ioctl_x86_get_xsave(struct kvm_vcpu *vcpu,
+					 struct kvm_xsave *guest_xsave)
+{
+	return kvm_vcpu_ioctl_x86_get_xsave2(vcpu, (void *)guest_xsave->region,
+					     sizeof(guest_xsave->region));
 }
 
 static int kvm_vcpu_ioctl_x86_set_xsave(struct kvm_vcpu *vcpu,

base-commit: 5804c19b80bf625c6a9925317f845e497434d6d3
-- 


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH] x86/kvm: Account for fpstate->user_xfeatures changes
  2023-09-25 21:26             ` Sean Christopherson
@ 2023-09-26  3:02               ` Tyler Stachecki
  2023-09-26 16:31                 ` Sean Christopherson
  0 siblings, 1 reply; 16+ messages in thread
From: Tyler Stachecki @ 2023-09-26  3:02 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: Leonardo Bras, Dongli Zhang, kvm, pbonzini, dgilbert, tglx, mingo,
	dave.hansen, bp, Tyler Stachecki, stable

On Mon, Sep 25, 2023 at 02:26:47PM -0700, Sean Christopherson wrote:
> On Fri, Sep 15, 2023, Tyler Stachecki wrote:
> > On Fri, Sep 15, 2023 at 04:41:20AM -0300, Leonardo Bras wrote:
> > > Other than that, all I can think of is removing the features from guest:
> > > 
> > > As you commented, there may be some features that would not be a problem 
> > > to be removed, and also there may be features which are not used by the 
> > > workload, and could be removed. But this would depend on the feature, and 
> > > the workload, beind a custom solution for every case.
> > 
> > Yes, the "fixup back" should be refined to pointed and verified cases.
> >  
> > > For this (removing guest features), from kernel side, I would suggest using 
> > > SystemTap (and eBPF, IIRC). The procedures should be something like:
> > > - Try to migrate VM from host with older kernel: fail
> > > - Look at qemu error, which features are missing?
> > > - Are those features safely removable from guest ? 
> > >   - If so, get an SystemTap / eBPF script masking out the undesired bits.
> > >   - Try the migration again, it should succeed.
> > > 
> > > IIRC, this could also be done in qemu side, with a custom qemu:
> > > - Try to migrate VM from host with older kernel: fail
> > > - Look at qemu error, which features are missing?
> > > - Are those features safely removable from guest ?
> > >   - If so, get a custom qemu which mask-out the desired flags before the VM 
> > >     starts
> > >   - Live migrate (can be inside the source host) to the custom qemu
> > >   - Live migrate from custom qemu to target host.
> > > - The custom qemu could be on a auxiliary host, and used only for this
> > > 
> > > Yes, it's hard, takes time, and may not solve every case, but it gets a 
> > > higher chance of the VM surviving in the long run.
> > 
> > Thank you for taking the time to throughly consider the issue and suggest some
> > ways out - I really appreciate it.
> > 
> > > But keep in mind this is a hack.
> > > Taking features from a live guest is not supported in any way, and has a 
> > > high chance of crashing the VM.
> >
> > OK - if there's no interest in the below, I will not push for including this
> > patch in the kernel tree any longer. I do think the specific case below is what
> > a vast majority of KVM users will struggle with in the near future, though:
> >
> > I have a test environment with Broadwell-based (have only AVX-256) guests
> > running under Skylake (PKRU, AVX512, ...) hypervisors.
> 
> I definitely don't want to take the proposed patch.  As Leo pointed out, silently
> dropping features that userspace explicitly requests is a recipe for disaster.
> 
> However, I do agree with Tyler that is an egregious kernel/KVM bug, as essentially
> requiring KVM_SET_XSAVE to be a subset of guest supported XCR0, i.e. guest CPUID,
> is a clearcut breakage of userspace.  KVM_SET_XSAVE worked on kernel X and failed
> on kernel X+1, there's really no wiggle room there.
> 
> Luckily, I'm pretty sure there's no need to take features away from the guest in
> order to fix the bug Tyler is experiencing.  Prior to commit ad856280ddea, KVM's
> ABI was that KVM_SET_SAVE just needs a subset of the *host* features, i.e. this
> chunk from the changelog simply needs to be undone:
> 
>     As a bonus, it will also fail if userspace tries to set fpu features
>     (with the KVM_SET_XSAVE ioctl) that are not compatible to the guest
>     configuration.  Such features will never be returned by KVM_GET_XSAVE
>     or KVM_GET_XSAVE2.
> 
> That can be done by applying guest_supported_xcr0 to *only* the KVM_GET_XSAVE{2}
> path.  It's not ideal since it means that KVM_GET_XSAVE{2} won't be consistent
> with the guest model if userspace does KVM_GET_XSAVE{2} before KVM_SET_CPUID, but
> practically speaking I don't think there's a real world userspace VMM that does
> that.
> 
> Compile tested only, and it needs a changelog, but I think this will do the trick:
> 
> ---
>  arch/x86/include/asm/fpu/api.h |  3 ++-
>  arch/x86/kernel/fpu/core.c     |  5 +++--
>  arch/x86/kernel/fpu/xstate.c   |  7 +++++--
>  arch/x86/kernel/fpu/xstate.h   |  3 ++-
>  arch/x86/kvm/cpuid.c           |  8 --------
>  arch/x86/kvm/x86.c             | 37 ++++++++++++++++++++++------------
>  6 files changed, 36 insertions(+), 27 deletions(-)
> 
> diff --git a/arch/x86/include/asm/fpu/api.h b/arch/x86/include/asm/fpu/api.h
> index 31089b851c4f..a2be3aefff9f 100644
> --- a/arch/x86/include/asm/fpu/api.h
> +++ b/arch/x86/include/asm/fpu/api.h
> @@ -157,7 +157,8 @@ static inline void fpu_update_guest_xfd(struct fpu_guest *guest_fpu, u64 xfd) {
>  static inline void fpu_sync_guest_vmexit_xfd_state(void) { }
>  #endif
>  
> -extern void fpu_copy_guest_fpstate_to_uabi(struct fpu_guest *gfpu, void *buf, unsigned int size, u32 pkru);
> +extern void fpu_copy_guest_fpstate_to_uabi(struct fpu_guest *gfpu, void *buf,
> +					   unsigned int size, u64 xfeatures, u32 pkru);
>  extern int fpu_copy_uabi_to_guest_fpstate(struct fpu_guest *gfpu, const void *buf, u64 xcr0, u32 *vpkru);
>  
>  static inline void fpstate_set_confidential(struct fpu_guest *gfpu)
> diff --git a/arch/x86/kernel/fpu/core.c b/arch/x86/kernel/fpu/core.c
> index a86d37052a64..a21a4d0ecc34 100644
> --- a/arch/x86/kernel/fpu/core.c
> +++ b/arch/x86/kernel/fpu/core.c
> @@ -369,14 +369,15 @@ int fpu_swap_kvm_fpstate(struct fpu_guest *guest_fpu, bool enter_guest)
>  EXPORT_SYMBOL_GPL(fpu_swap_kvm_fpstate);
>  
>  void fpu_copy_guest_fpstate_to_uabi(struct fpu_guest *gfpu, void *buf,
> -				    unsigned int size, u32 pkru)
> +				    unsigned int size, u64 xfeatures, u32 pkru)
>  {
>  	struct fpstate *kstate = gfpu->fpstate;
>  	union fpregs_state *ustate = buf;
>  	struct membuf mb = { .p = buf, .left = size };
>  
>  	if (cpu_feature_enabled(X86_FEATURE_XSAVE)) {
> -		__copy_xstate_to_uabi_buf(mb, kstate, pkru, XSTATE_COPY_XSAVE);
> +		__copy_xstate_to_uabi_buf(mb, kstate, xfeatures, pkru,
> +					  XSTATE_COPY_XSAVE);
>  	} else {
>  		memcpy(&ustate->fxsave, &kstate->regs.fxsave,
>  		       sizeof(ustate->fxsave));
> diff --git a/arch/x86/kernel/fpu/xstate.c b/arch/x86/kernel/fpu/xstate.c
> index cadf68737e6b..7d31033d176e 100644
> --- a/arch/x86/kernel/fpu/xstate.c
> +++ b/arch/x86/kernel/fpu/xstate.c
> @@ -1049,6 +1049,7 @@ static void copy_feature(bool from_xstate, struct membuf *to, void *xstate,
>   * __copy_xstate_to_uabi_buf - Copy kernel saved xstate to a UABI buffer
>   * @to:		membuf descriptor
>   * @fpstate:	The fpstate buffer from which to copy
> + * @xfeatures:	Constraint which of user xfeatures to save (XSAVE only)
>   * @pkru_val:	The PKRU value to store in the PKRU component
>   * @copy_mode:	The requested copy mode
>   *
> @@ -1059,7 +1060,8 @@ static void copy_feature(bool from_xstate, struct membuf *to, void *xstate,
>   * It supports partial copy but @to.pos always starts from zero.
>   */
>  void __copy_xstate_to_uabi_buf(struct membuf to, struct fpstate *fpstate,
> -			       u32 pkru_val, enum xstate_copy_mode copy_mode)
> +			       u64 xfeatures, u32 pkru_val,
> +			       enum xstate_copy_mode copy_mode)
>  {
>  	const unsigned int off_mxcsr = offsetof(struct fxregs_state, mxcsr);
>  	struct xregs_state *xinit = &init_fpstate.regs.xsave;
> @@ -1083,7 +1085,7 @@ void __copy_xstate_to_uabi_buf(struct membuf to, struct fpstate *fpstate,
>  		break;
>  
>  	case XSTATE_COPY_XSAVE:
> -		header.xfeatures &= fpstate->user_xfeatures;
> +		header.xfeatures &= fpstate->user_xfeatures & xfeatures;
>  		break;

This changes the consideration of the xfeatures copied *into* the uabi buffer
with respect to the guest xfeatures IIUC (approx guest XCR0, less FP/SSE only).

IOW: we are still trimming guest xfeatures, just at the source...?

That being said: the patch that I gave siliently allows unacceptable things to
be accepted at the destination, whereas this maintains status-quo and signals
an error when the destination cannot wholly process the uabi buffer (i.e.,
asked to restore more state than the destination processor has present).

The downside of my approach is above -- the flip side, though is that this
approach requires a patch to be applied on the source. However, we cannot
apply a patch at the source until it is evacuated of VMs -- chicken and egg
problem...

Unless I am grossly misunderstanding things here -- forgive me... :-)

It almost feels like userspace needs a flag to say: yes, old pre-Leo's-patched
kernel was broken and sent more state than you might need or want -- eat what
you can by default. If an additional flag is set, be conservative and ensure
that you are capable of restoring the xfeatures specified in the uabi buffer.

Cheers,
Tyler

>  	}
>  
> @@ -1185,6 +1187,7 @@ void copy_xstate_to_uabi_buf(struct membuf to, struct task_struct *tsk,
>  			     enum xstate_copy_mode copy_mode)
>  {
>  	__copy_xstate_to_uabi_buf(to, tsk->thread.fpu.fpstate,
> +				  tsk->thread.fpu.fpstate->user_xfeatures,
>  				  tsk->thread.pkru, copy_mode);
>  }
>  
> diff --git a/arch/x86/kernel/fpu/xstate.h b/arch/x86/kernel/fpu/xstate.h
> index a4ecb04d8d64..3518fb26d06b 100644
> --- a/arch/x86/kernel/fpu/xstate.h
> +++ b/arch/x86/kernel/fpu/xstate.h
> @@ -43,7 +43,8 @@ enum xstate_copy_mode {
>  
>  struct membuf;
>  extern void __copy_xstate_to_uabi_buf(struct membuf to, struct fpstate *fpstate,
> -				      u32 pkru_val, enum xstate_copy_mode copy_mode);
> +				      u64 xfeatures, u32 pkru_val,
> +				      enum xstate_copy_mode copy_mode);
>  extern void copy_xstate_to_uabi_buf(struct membuf to, struct task_struct *tsk,
>  				    enum xstate_copy_mode mode);
>  extern int copy_uabi_from_kernel_to_xstate(struct fpstate *fpstate, const void *kbuf, u32 *pkru);
> diff --git a/arch/x86/kvm/cpuid.c b/arch/x86/kvm/cpuid.c
> index 0544e30b4946..773132c3bf5a 100644
> --- a/arch/x86/kvm/cpuid.c
> +++ b/arch/x86/kvm/cpuid.c
> @@ -360,14 +360,6 @@ static void kvm_vcpu_after_set_cpuid(struct kvm_vcpu *vcpu)
>  	vcpu->arch.guest_supported_xcr0 =
>  		cpuid_get_supported_xcr0(vcpu->arch.cpuid_entries, vcpu->arch.cpuid_nent);
>  
> -	/*
> -	 * FP+SSE can always be saved/restored via KVM_{G,S}ET_XSAVE, even if
> -	 * XSAVE/XCRO are not exposed to the guest, and even if XSAVE isn't
> -	 * supported by the host.
> -	 */
> -	vcpu->arch.guest_fpu.fpstate->user_xfeatures = vcpu->arch.guest_supported_xcr0 |
> -						       XFEATURE_MASK_FPSSE;
> -
>  	kvm_update_pv_runtime(vcpu);
>  
>  	vcpu->arch.maxphyaddr = cpuid_query_maxphyaddr(vcpu);
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 9f18b06bbda6..734e2d69329b 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -5382,26 +5382,37 @@ static int kvm_vcpu_ioctl_x86_set_debugregs(struct kvm_vcpu *vcpu,
>  	return 0;
>  }
>  
> -static void kvm_vcpu_ioctl_x86_get_xsave(struct kvm_vcpu *vcpu,
> -					 struct kvm_xsave *guest_xsave)
> -{
> -	if (fpstate_is_confidential(&vcpu->arch.guest_fpu))
> -		return;
> -
> -	fpu_copy_guest_fpstate_to_uabi(&vcpu->arch.guest_fpu,
> -				       guest_xsave->region,
> -				       sizeof(guest_xsave->region),
> -				       vcpu->arch.pkru);
> -}
>  
>  static void kvm_vcpu_ioctl_x86_get_xsave2(struct kvm_vcpu *vcpu,
>  					  u8 *state, unsigned int size)
>  {
> +	/*
> +	 * Only copy state for features that are enabled for the guest.  The
> +	 * state itself isn't problematic, but setting bits in the header for
> +	 * features that are supported in *this* host but not exposed to the
> +	 * guest can result in KVM_SET_XSAVE failing when live migrating to a
> +	 * compatible host, i.e. a host without the features that are NOT
> +	 * exposed to the guest.
> +	 *
> +	 * FP+SSE can always be saved/restored via KVM_{G,S}ET_XSAVE, even if
> +	 * XSAVE/XCRO are not exposed to the guest, and even if XSAVE isn't
> +	 * supported by the host.
> +	 */
> +	u64 supported_xcr0 = vcpu->arch.guest_supported_xcr0 |
> +			     XFEATURE_MASK_FPSSE;
> +
>  	if (fpstate_is_confidential(&vcpu->arch.guest_fpu))
>  		return;
>  
> -	fpu_copy_guest_fpstate_to_uabi(&vcpu->arch.guest_fpu,
> -				       state, size, vcpu->arch.pkru);
> +	fpu_copy_guest_fpstate_to_uabi(&vcpu->arch.guest_fpu, state, size,
> +				       supported_xcr0, vcpu->arch.pkru);
> +}
> +
> +static void kvm_vcpu_ioctl_x86_get_xsave(struct kvm_vcpu *vcpu,
> +					 struct kvm_xsave *guest_xsave)
> +{
> +	return kvm_vcpu_ioctl_x86_get_xsave2(vcpu, (void *)guest_xsave->region,
> +					     sizeof(guest_xsave->region));
>  }
>  
>  static int kvm_vcpu_ioctl_x86_set_xsave(struct kvm_vcpu *vcpu,
> 
> base-commit: 5804c19b80bf625c6a9925317f845e497434d6d3
> -- 
> 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] x86/kvm: Account for fpstate->user_xfeatures changes
  2023-09-26  3:02               ` Tyler Stachecki
@ 2023-09-26 16:31                 ` Sean Christopherson
  2023-09-26 17:31                   ` Sean Christopherson
  2023-09-26 19:22                   ` Sean Christopherson
  0 siblings, 2 replies; 16+ messages in thread
From: Sean Christopherson @ 2023-09-26 16:31 UTC (permalink / raw)
  To: Tyler Stachecki
  Cc: Leonardo Bras, Dongli Zhang, kvm, pbonzini, dgilbert, tglx, mingo,
	dave.hansen, bp, Tyler Stachecki, stable

On Mon, Sep 25, 2023, Tyler Stachecki wrote:
> On Mon, Sep 25, 2023 at 02:26:47PM -0700, Sean Christopherson wrote:
> > On Fri, Sep 15, 2023, Tyler Stachecki wrote:
> > > OK - if there's no interest in the below, I will not push for including this
> > > patch in the kernel tree any longer. I do think the specific case below is what
> > > a vast majority of KVM users will struggle with in the near future, though:
> > >
> > > I have a test environment with Broadwell-based (have only AVX-256) guests
> > > running under Skylake (PKRU, AVX512, ...) hypervisors.
> > 
> > I definitely don't want to take the proposed patch.  As Leo pointed out, silently
> > dropping features that userspace explicitly requests is a recipe for disaster.
> > 
> > However, I do agree with Tyler that is an egregious kernel/KVM bug, as essentially
> > requiring KVM_SET_XSAVE to be a subset of guest supported XCR0, i.e. guest CPUID,
> > is a clearcut breakage of userspace.  KVM_SET_XSAVE worked on kernel X and failed
> > on kernel X+1, there's really no wiggle room there.
> > 
> > Luckily, I'm pretty sure there's no need to take features away from the guest in
> > order to fix the bug Tyler is experiencing.  Prior to commit ad856280ddea, KVM's
> > ABI was that KVM_SET_SAVE just needs a subset of the *host* features, i.e. this
> > chunk from the changelog simply needs to be undone:
> > 
> >     As a bonus, it will also fail if userspace tries to set fpu features
> >     (with the KVM_SET_XSAVE ioctl) that are not compatible to the guest
> >     configuration.  Such features will never be returned by KVM_GET_XSAVE
> >     or KVM_GET_XSAVE2.
> > 
> > That can be done by applying guest_supported_xcr0 to *only* the KVM_GET_XSAVE{2}
> > path.  It's not ideal since it means that KVM_GET_XSAVE{2} won't be consistent
> > with the guest model if userspace does KVM_GET_XSAVE{2} before KVM_SET_CPUID, but
> > practically speaking I don't think there's a real world userspace VMM that does
> > that.
> > 
> > Compile tested only, and it needs a changelog, but I think this will do the trick:

...

> > @@ -1059,7 +1060,8 @@ static void copy_feature(bool from_xstate, struct membuf *to, void *xstate,
> >   * It supports partial copy but @to.pos always starts from zero.
> >   */
> >  void __copy_xstate_to_uabi_buf(struct membuf to, struct fpstate *fpstate,
> > -			       u32 pkru_val, enum xstate_copy_mode copy_mode)
> > +			       u64 xfeatures, u32 pkru_val,
> > +			       enum xstate_copy_mode copy_mode)
> >  {
> >  	const unsigned int off_mxcsr = offsetof(struct fxregs_state, mxcsr);
> >  	struct xregs_state *xinit = &init_fpstate.regs.xsave;
> > @@ -1083,7 +1085,7 @@ void __copy_xstate_to_uabi_buf(struct membuf to, struct fpstate *fpstate,
> >  		break;
> >  
> >  	case XSTATE_COPY_XSAVE:
> > -		header.xfeatures &= fpstate->user_xfeatures;
> > +		header.xfeatures &= fpstate->user_xfeatures & xfeatures;
> >  		break;
> 
> This changes the consideration of the xfeatures copied *into* the uabi buffer
> with respect to the guest xfeatures IIUC (approx guest XCR0, less FP/SSE only).
> 
> IOW: we are still trimming guest xfeatures, just at the source...?

KVM *must* "trim" features when servicing KVM_GET_SAVE{2}, because that's been KVM's
ABI for a very long time, and userspace absolutely relies on that functionality
to ensure that a VM can be migrated within a pool of heterogenous systems so long
as the features that are *exposed* to the guest are supported on all platforms.

Before the big FPU rework, KVM manually filled xregs_state and directly "trimmed"
based on the guest supported XCR0 (see fill_xsave() below).

The major difference between my proposed patch and the current code is that KVM
trims *only* at KVM_GET_XSAVE{2}, which in addition to being KVM's historical ABI,
(see load_xsave() below), ensures that the *destination* can load state irrespective
of the possibly-not-yet-defined guest CPUID model.

Masking fpstate->user_xfeatures is buggy for another reason: it's destructive if
userspace calls KVM_SET_CPUID multiple times.  No real world userspace actually
calls KVM_SET_CPUID to "expand" features, but it's technically possible and KVM
is supposed to allow it.

static void fill_xsave(u8 *dest, struct kvm_vcpu *vcpu)
{
        struct xregs_state *xsave = &vcpu->arch.guest_fpu->state.xsave;
        u64 xstate_bv = xsave->header.xfeatures;
        u64 valid;

        /*
         * Copy legacy XSAVE area, to avoid complications with CPUID
         * leaves 0 and 1 in the loop below.
         */
        memcpy(dest, xsave, XSAVE_HDR_OFFSET);

        /* Set XSTATE_BV */
        xstate_bv &= vcpu->arch.guest_supported_xcr0 | XFEATURE_MASK_FPSSE;  <= here lies the trimming
        *(u64 *)(dest + XSAVE_HDR_OFFSET) = xstate_bv;

        /*
         * Copy each region from the possibly compacted offset to the
         * non-compacted offset.
         */
        valid = xstate_bv & ~XFEATURE_MASK_FPSSE;
        while (valid) {
                u64 feature = valid & -valid;
                int index = fls64(feature) - 1;
                void *src = get_xsave_addr(xsave, feature);

                if (src) {
                        u32 size, offset, ecx, edx;
                        cpuid_count(XSTATE_CPUID, index,
                                    &size, &offset, &ecx, &edx);
                        if (feature == XFEATURE_MASK_PKRU)
                                memcpy(dest + offset, &vcpu->arch.pkru,
                                       sizeof(vcpu->arch.pkru));
                        else
                                memcpy(dest + offset, src, size);

                }

                valid -= feature;
        }
}

static void load_xsave(struct kvm_vcpu *vcpu, u8 *src)
{
        struct xregs_state *xsave = &vcpu->arch.guest_fpu->state.xsave;
        u64 xstate_bv = *(u64 *)(src + XSAVE_HDR_OFFSET);
        u64 valid;

        /*
         * Copy legacy XSAVE area, to avoid complications with CPUID
         * leaves 0 and 1 in the loop below.
         */
        memcpy(xsave, src, XSAVE_HDR_OFFSET);

        /* Set XSTATE_BV and possibly XCOMP_BV.  */
        xsave->header.xfeatures = xstate_bv;                <= features NOT limited to guest support
        if (boot_cpu_has(X86_FEATURE_XSAVES))
                xsave->header.xcomp_bv = host_xcr0 | XSTATE_COMPACTION_ENABLED;

        /*
         * Copy each region from the non-compacted offset to the
         * possibly compacted offset.
         */
        valid = xstate_bv & ~XFEATURE_MASK_FPSSE;
        while (valid) {
                u64 feature = valid & -valid;
                int index = fls64(feature) - 1;
                void *dest = get_xsave_addr(xsave, feature);

                if (dest) {
                        u32 size, offset, ecx, edx;
                        cpuid_count(XSTATE_CPUID, index,
                                    &size, &offset, &ecx, &edx);
                        if (feature == XFEATURE_MASK_PKRU)
                                memcpy(&vcpu->arch.pkru, src + offset,
                                       sizeof(vcpu->arch.pkru));
                        else
                                memcpy(dest, src + offset, size);
                }

                valid -= feature;
        }
}

static int kvm_vcpu_ioctl_x86_set_xsave(struct kvm_vcpu *vcpu,
                                        struct kvm_xsave *guest_xsave)
{
        u64 xstate_bv =
                *(u64 *)&guest_xsave->region[XSAVE_HDR_OFFSET / sizeof(u32)];
        u32 mxcsr = *(u32 *)&guest_xsave->region[XSAVE_MXCSR_OFFSET / sizeof(u32)];

        if (boot_cpu_has(X86_FEATURE_XSAVE)) {
                /*
                 * Here we allow setting states that are not present in
                 * CPUID leaf 0xD, index 0, EDX:EAX.  This is for compatibility
                 * with old userspace.
                 */
                if (xstate_bv & ~kvm_supported_xcr0() ||  <= loading more than KVM supports is disallowed
                        mxcsr & ~mxcsr_feature_mask)
                        return -EINVAL;
                load_xsave(vcpu, (u8 *)guest_xsave->region);
        } else {
                if (xstate_bv & ~XFEATURE_MASK_FPSSE ||
                        mxcsr & ~mxcsr_feature_mask)
                        return -EINVAL;
                memcpy(&vcpu->arch.guest_fpu->state.fxsave,
                        guest_xsave->region, sizeof(struct fxregs_state));
        }
        return 0;
}


> That being said: the patch that I gave siliently allows unacceptable things to
> be accepted at the destination, whereas this maintains status-quo and signals
> an error when the destination cannot wholly process the uabi buffer (i.e.,
> asked to restore more state than the destination processor has present).
> 
> The downside of my approach is above -- the flip side, though is that this
> approach requires a patch to be applied on the source. However, we cannot
> apply a patch at the source until it is evacuated of VMs -- chicken and egg
> problem...
> 
> Unless I am grossly misunderstanding things here -- forgive me... :-)

I suspect you're overlooking the impact on the destination.  Trimming features
only when saving XSAVE state into the userspace buffer means that the destination
will play nice with the "bad" snapshot.  It won't help setups where a VM is being
migrated from a host with more XSAVE features to a host with fewer XSAVE features,
but I don't see a sane way to retroactively fix that situation.  And IIUC, that's
not the situation you're in (unless the Skylake host vs. Broadwell guests is only
the test environment).

Silently dropping features would break KVM's ABI (see kvm_vcpu_ioctl_x86_set_xsave()
above).  Giving userspace a flag to opt-in is pointless because that requires a
userspace update, and if userspace needs to be modified then it's simpler to just
have userspace sanitize the xfeatures field.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] x86/kvm: Account for fpstate->user_xfeatures changes
  2023-09-26 16:31                 ` Sean Christopherson
@ 2023-09-26 17:31                   ` Sean Christopherson
  2023-09-26 19:22                   ` Sean Christopherson
  1 sibling, 0 replies; 16+ messages in thread
From: Sean Christopherson @ 2023-09-26 17:31 UTC (permalink / raw)
  To: Tyler Stachecki
  Cc: Leonardo Bras, Dongli Zhang, kvm, pbonzini, dgilbert, tglx, mingo,
	dave.hansen, bp, Tyler Stachecki, stable

On Tue, Sep 26, 2023, Sean Christopherson wrote:
> Masking fpstate->user_xfeatures is buggy for another reason: it's destructive if
> userspace calls KVM_SET_CPUID multiple times.  No real world userspace actually
> calls KVM_SET_CPUID to "expand" features, but it's technically possible and KVM
> is supposed to allow it.

This particular bit is wrong, KVM overwrites user_xfeatures, it doesn't AND it.
I misremembered the code.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] x86/kvm: Account for fpstate->user_xfeatures changes
  2023-09-26 16:31                 ` Sean Christopherson
  2023-09-26 17:31                   ` Sean Christopherson
@ 2023-09-26 19:22                   ` Sean Christopherson
  2023-09-26 20:27                     ` Sean Christopherson
  1 sibling, 1 reply; 16+ messages in thread
From: Sean Christopherson @ 2023-09-26 19:22 UTC (permalink / raw)
  To: Tyler Stachecki
  Cc: Leonardo Bras, Dongli Zhang, kvm, pbonzini, dgilbert, tglx, mingo,
	dave.hansen, bp, Tyler Stachecki, stable

On Tue, Sep 26, 2023, Sean Christopherson wrote:
> On Mon, Sep 25, 2023, Tyler Stachecki wrote:
> > > @@ -1059,7 +1060,8 @@ static void copy_feature(bool from_xstate, struct membuf *to, void *xstate,
> > >   * It supports partial copy but @to.pos always starts from zero.
> > >   */
> > >  void __copy_xstate_to_uabi_buf(struct membuf to, struct fpstate *fpstate,
> > > -			       u32 pkru_val, enum xstate_copy_mode copy_mode)
> > > +			       u64 xfeatures, u32 pkru_val,
> > > +			       enum xstate_copy_mode copy_mode)
> > >  {
> > >  	const unsigned int off_mxcsr = offsetof(struct fxregs_state, mxcsr);
> > >  	struct xregs_state *xinit = &init_fpstate.regs.xsave;
> > > @@ -1083,7 +1085,7 @@ void __copy_xstate_to_uabi_buf(struct membuf to, struct fpstate *fpstate,
> > >  		break;
> > >  
> > >  	case XSTATE_COPY_XSAVE:
> > > -		header.xfeatures &= fpstate->user_xfeatures;
> > > +		header.xfeatures &= fpstate->user_xfeatures & xfeatures;
> > >  		break;
> > 
> > This changes the consideration of the xfeatures copied *into* the uabi buffer
> > with respect to the guest xfeatures IIUC (approx guest XCR0, less FP/SSE only).
> > 
> > IOW: we are still trimming guest xfeatures, just at the source...?
> 
> KVM *must* "trim" features when servicing KVM_GET_SAVE{2}, because that's been KVM's
> ABI for a very long time, and userspace absolutely relies on that functionality
> to ensure that a VM can be migrated within a pool of heterogenous systems so long
> as the features that are *exposed* to the guest are supported on all platforms.
> 
> Before the big FPU rework, KVM manually filled xregs_state and directly "trimmed"
> based on the guest supported XCR0 (see fill_xsave() below).
> 
> The major difference between my proposed patch and the current code is that KVM
> trims *only* at KVM_GET_XSAVE{2}, which in addition to being KVM's historical ABI,
> (see load_xsave() below), ensures that the *destination* can load state irrespective
> of the possibly-not-yet-defined guest CPUID model.

...

> > That being said: the patch that I gave siliently allows unacceptable things to
> > be accepted at the destination, whereas this maintains status-quo and signals
> > an error when the destination cannot wholly process the uabi buffer (i.e.,
> > asked to restore more state than the destination processor has present).
> > 
> > The downside of my approach is above -- the flip side, though is that this
> > approach requires a patch to be applied on the source. However, we cannot
> > apply a patch at the source until it is evacuated of VMs -- chicken and egg
> > problem...
> > 
> > Unless I am grossly misunderstanding things here -- forgive me... :-)
> 
> I suspect you're overlooking the impact on the destination.  Trimming features
> only when saving XSAVE state into the userspace buffer means that the destination
> will play nice with the "bad" snapshot.  It won't help setups where a VM is being
> migrated from a host with more XSAVE features to a host with fewer XSAVE features,
> but I don't see a sane way to retroactively fix that situation.  And IIUC, that's
> not the situation you're in (unless the Skylake host vs. Broadwell guests is only
> the test environment).

One clarification: this comment from Tyler's proposed patch is somewhat misleading.
 
I do think KVM should filter state only at KVM_GET_XSAVE, because otherwise any
off-by-default feature will have different behavior than on-by-default features,
e.g. restoring AMX state requires KVM_SET_CPUID, whereas every other features
depends only on host capabilities.


+       /*
+        * In previous kernels, kvm_arch_vcpu_create() set the guest's fpstate
+        * based on what the host CPU supported. Recent kernels changed this
+        * and only accept ustate containing xfeatures that the guest CPU is
+        * capable of supporting.
+        */
+       ustate->xsave.header.xfeatures &= user_xfeatures;


It's only the "realloc" path used when granting access to an off-by-default feature,
i.e. AMX, that skips initialize user_xfeatures.   __fpstate_reset() does initialize
user_xfeatures to the default set for guest state, i.e. the problem is not in
kvm_arch_vcpu_create().

The problem is specifically that KVM rejects KVM_SET_XSAVE if userspace limits
guest supported xfeatures via KVM_SET_CPUID.  In a perfect world, that would be
desirable, e.g. KVM's ABI when loading MSRs from userspace is to (mostly) honor
guest CPUID, but again this is a clear breakage of userspace.  I.e. QEMU does
KVM_SET_CPUID2, and *then* KVM_SET_XSAVE, which fails.  If QEMU reversed those,
KVM_SET_XSAVE would actually succeed.
 
There's another related oddity that will be fixed by my approach, assuming the realloc
change is also reverted (I missed that in my pasted patch).  Userspace *must* do
KVM_SET_CPUID{2} in order to load off-by-default state, whereas there is no such
requirement for on-by-default state.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH] x86/kvm: Account for fpstate->user_xfeatures changes
  2023-09-26 19:22                   ` Sean Christopherson
@ 2023-09-26 20:27                     ` Sean Christopherson
  0 siblings, 0 replies; 16+ messages in thread
From: Sean Christopherson @ 2023-09-26 20:27 UTC (permalink / raw)
  To: Tyler Stachecki
  Cc: Leonardo Bras, Dongli Zhang, kvm, pbonzini, dgilbert, tglx, mingo,
	dave.hansen, bp, Tyler Stachecki, stable

On Tue, Sep 26, 2023, Sean Christopherson wrote:
> There's another related oddity that will be fixed by my approach, assuming the realloc
> change is also reverted (I missed that in my pasted patch).  Userspace *must* do
> KVM_SET_CPUID{2} in order to load off-by-default state, whereas there is no such
> requirement for on-by-default state.

Scratch that, KVM explicitly requires KVM_SET_CPUID2 to grant the guest access to
off-by-default features, e.g. so that the kernel/KVM doesn't need to context AMX
state if it's not exposed to the guest.  Thankfully, that has always been true for
XFD-based features, i.e. AMX, so it's safe to keep that behavior even though it
diverges from on-by-default features.

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2023-09-26 20:27 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-09-14  1:00 [PATCH] x86/kvm: Account for fpstate->user_xfeatures changes Tyler Stachecki
2023-09-14  1:33 ` Tyler Stachecki
2023-09-14  7:15 ` Leonardo Bras
2023-09-14  9:11   ` Tyler Stachecki
2023-09-14 17:05     ` Dongli Zhang
2023-09-15  0:58       ` Tyler Stachecki
2023-09-15  7:41         ` Leonardo Bras
2023-09-15 12:27           ` Tyler Stachecki
2023-09-25 21:26             ` Sean Christopherson
2023-09-26  3:02               ` Tyler Stachecki
2023-09-26 16:31                 ` Sean Christopherson
2023-09-26 17:31                   ` Sean Christopherson
2023-09-26 19:22                   ` Sean Christopherson
2023-09-26 20:27                     ` Sean Christopherson
2023-09-15  7:13       ` Leonardo Bras
2023-09-15  7:11     ` Leonardo Bras

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox