[PATCH 0/2] Fix the #PF injection logic for smaller MAXPHYADDR in nested.

public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 0/2] Fix the #PF injection logic for smaller MAXPHYADDR in nested.
@ 2022-07-15 11:42 Yu Zhang
  2022-07-15 11:42 ` [PATCH 1/2] KVM: X86: Initialize 'fault' in kvm_fixup_and_inject_pf_error() Yu Zhang
  2022-07-15 11:42 ` [PATCH 2/2] KVM: X86: Fix the comments in prepare_vmcs02_rare() Yu Zhang
  0 siblings, 2 replies; 8+ messages in thread
From: Yu Zhang @ 2022-07-15 11:42 UTC (permalink / raw)
  To: pbonzini, seanjc, vkuznets, jmattson, joro, wanpengli, kvm; +Cc: linux-kernel

KVM shall inject the #PF directly to L2 when L2 has a smaller MAXPHYADDR
and L1 has no desire to intercept it, yet currently the #PF will be
delivered to L1 anyway.

Patch 1 fixes this, by initializing the 'fault' variable.
Patch 2 is just a cleanup for the comments.

Yu Zhang (2):
  KVM: X86: Initialize 'fault' in kvm_fixup_and_inject_pf_error().
  KVM: X86: Fix the comments in prepare_vmcs02_rare()

 arch/x86/kvm/vmx/nested.c | 8 ++++----
 arch/x86/kvm/x86.c        | 2 +-
 2 files changed, 5 insertions(+), 5 deletions(-)

-- 
2.25.1

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH 1/2] KVM: X86: Initialize 'fault' in kvm_fixup_and_inject_pf_error().
  2022-07-15 11:42 [PATCH 0/2] Fix the #PF injection logic for smaller MAXPHYADDR in nested Yu Zhang
@ 2022-07-15 11:42 ` Yu Zhang
  2022-07-15 14:47   ` Sean Christopherson
  2022-07-15 11:42 ` [PATCH 2/2] KVM: X86: Fix the comments in prepare_vmcs02_rare() Yu Zhang
  1 sibling, 1 reply; 8+ messages in thread
From: Yu Zhang @ 2022-07-15 11:42 UTC (permalink / raw)
  To: pbonzini, seanjc, vkuznets, jmattson, joro, wanpengli, kvm; +Cc: linux-kernel

kvm_fixup_and_inject_pf_error() was introduced to fixup the error code(
e.g., to add RSVD flag) and inject the #PF to the guest, when guest
MAXPHYADDR is smaller than the host one.

When it comes to nested, L0 is expected to intercept and fix up the #PF
and then inject to L2 directly if
- L2.MAXPHYADDR < L0.MAXPHYADDR and
- L1 has no intention to intercept L2's #PF (e.g., L2 and L1 have the
  same MAXPHYADDR value && L1 is using EPT for L2),
instead of constructing a #PF VM Exit to L1. Currently, with PFEC_MASK
and PFEC_MATCH both set to 0 in vmcs02, the interception and injection
may happen on all L2 #PFs.

However, failing to initialize 'fault' in kvm_fixup_and_inject_pf_error()
may cause the fault.async_page_fault being NOT zeroed, and later the #PF
being treated as a nested async page fault, and then being injected to L1.
So just fix it by initialize the 'fault' value in the beginning.

Fixes: 897861479c064 ("KVM: x86: Add helper functions for illegal GPA checking and page fault injection")
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=216178
Reported-by: Yang Lixiao <lixiao.yang@intel.com>
Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
---
 arch/x86/kvm/x86.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 031678eff28e..3246b3c9dfb3 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -12983,7 +12983,7 @@ EXPORT_SYMBOL_GPL(kvm_spec_ctrl_test_value);
 void kvm_fixup_and_inject_pf_error(struct kvm_vcpu *vcpu, gva_t gva, u16 error_code)
 {
 	struct kvm_mmu *mmu = vcpu->arch.walk_mmu;
-	struct x86_exception fault;
+	struct x86_exception fault = {0};
 	u64 access = error_code &
 		(PFERR_WRITE_MASK | PFERR_FETCH_MASK | PFERR_USER_MASK);
 
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* [PATCH 2/2] KVM: X86: Fix the comments in prepare_vmcs02_rare()
  2022-07-15 11:42 [PATCH 0/2] Fix the #PF injection logic for smaller MAXPHYADDR in nested Yu Zhang
  2022-07-15 11:42 ` [PATCH 1/2] KVM: X86: Initialize 'fault' in kvm_fixup_and_inject_pf_error() Yu Zhang
@ 2022-07-15 11:42 ` Yu Zhang
  2022-07-15 15:56   ` Sean Christopherson
  1 sibling, 1 reply; 8+ messages in thread
From: Yu Zhang @ 2022-07-15 11:42 UTC (permalink / raw)
  To: pbonzini, seanjc, vkuznets, jmattson, joro, wanpengli, kvm; +Cc: linux-kernel

Although EB.PF in vmcs02 is still set by simply "or"ing the EB of
vmcs01 and vmcs12, the explanation is obsolete. "enable_ept" being
set is not the only reason for L0 to clear its EB.PF.

Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
---
 arch/x86/kvm/vmx/nested.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
index 778f82015f03..634a7d218048 100644
--- a/arch/x86/kvm/vmx/nested.c
+++ b/arch/x86/kvm/vmx/nested.c
@@ -2451,10 +2451,10 @@ static void prepare_vmcs02_rare(struct vcpu_vmx *vmx, struct vmcs12 *vmcs12)
 	 * is not easy (if at all possible?) to merge L0 and L1's desires, we
 	 * simply ask to exit on each and every L2 page fault. This is done by
 	 * setting MASK=MATCH=0 and (see below) EB.PF=1.
-	 * Note that below we don't need special code to set EB.PF beyond the
-	 * "or"ing of the EB of vmcs01 and vmcs12, because when enable_ept,
-	 * vmcs01's EB.PF is 0 so the "or" will take vmcs12's value, and when
-	 * !enable_ept, EB.PF is 1, so the "or" will always be 1.
+	 * Note that EB.PF is set by "or"ing of the EB of vmcs01 and vmcs12,
+	 * because when L0 has no desire to intercept #PF, vmcs01's EB.PF is 0
+	 * so the "or" will take vmcs12's value, otherwise EB.PF is 1, so the
+	 * "or" will always be 1.
 	 */
 	if (vmx_need_pf_intercept(&vmx->vcpu)) {
 		/*
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH 1/2] KVM: X86: Initialize 'fault' in kvm_fixup_and_inject_pf_error().
  2022-07-15 11:42 ` [PATCH 1/2] KVM: X86: Initialize 'fault' in kvm_fixup_and_inject_pf_error() Yu Zhang
@ 2022-07-15 14:47   ` Sean Christopherson
  2022-07-18  6:52     ` Yu Zhang
  0 siblings, 1 reply; 8+ messages in thread
From: Sean Christopherson @ 2022-07-15 14:47 UTC (permalink / raw)
  To: Yu Zhang; +Cc: pbonzini, vkuznets, jmattson, joro, wanpengli, kvm, linux-kernel

On Fri, Jul 15, 2022, Yu Zhang wrote:
> kvm_fixup_and_inject_pf_error() was introduced to fixup the error code(
> e.g., to add RSVD flag) and inject the #PF to the guest, when guest
> MAXPHYADDR is smaller than the host one.
> 
> When it comes to nested, L0 is expected to intercept and fix up the #PF
> and then inject to L2 directly if
> - L2.MAXPHYADDR < L0.MAXPHYADDR and
> - L1 has no intention to intercept L2's #PF (e.g., L2 and L1 have the
>   same MAXPHYADDR value && L1 is using EPT for L2),
> instead of constructing a #PF VM Exit to L1. Currently, with PFEC_MASK
> and PFEC_MATCH both set to 0 in vmcs02, the interception and injection
> may happen on all L2 #PFs.
> 
> However, failing to initialize 'fault' in kvm_fixup_and_inject_pf_error()
> may cause the fault.async_page_fault being NOT zeroed, and later the #PF
> being treated as a nested async page fault, and then being injected to L1.
> So just fix it by initialize the 'fault' value in the beginning.

Ouch.

> Fixes: 897861479c064 ("KVM: x86: Add helper functions for illegal GPA checking and page fault injection")
> Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=216178
> Reported-by: Yang Lixiao <lixiao.yang@intel.com>
> Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> ---
>  arch/x86/kvm/x86.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 031678eff28e..3246b3c9dfb3 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -12983,7 +12983,7 @@ EXPORT_SYMBOL_GPL(kvm_spec_ctrl_test_value);
>  void kvm_fixup_and_inject_pf_error(struct kvm_vcpu *vcpu, gva_t gva, u16 error_code)
>  {
>  	struct kvm_mmu *mmu = vcpu->arch.walk_mmu;
> -	struct x86_exception fault;
> +	struct x86_exception fault = {0};
>  	u64 access = error_code &
>  		(PFERR_WRITE_MASK | PFERR_FETCH_MASK | PFERR_USER_MASK);

As stupid as it may be to intentionally not fix the uninitialized data in a robust
way, I'd actually prefer to manually clear fault.async_page_fault instead of
zero-initializing the struct.  Unlike a similar bug fix in commit 159e037d2e36
("KVM: x86: Fully initialize 'struct kvm_lapic_irq' in kvm_pv_kick_cpu_op()"),
this code actually cares about async_page_fault being false as opposed to just
being _initialized_.

And if another field is added to struct x86_exception in the future, leaving the
struct uninitialized means that if such a patch were to miss this case, running
with various sanitizers should in theory be able to detect such a bug.  I suspect
no one has found this with syzkaller due to the need to opt into running with
allow_smaller_maxphyaddr=1.

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index f389691d8c04..aeed737b55c2 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -12996,6 +12996,7 @@ void kvm_fixup_and_inject_pf_error(struct kvm_vcpu *vcpu, gva_t gva, u16 error_c
                fault.error_code = error_code;
                fault.nested_page_fault = false;
                fault.address = gva;
+               fault.async_page_fault = false;
        }
        vcpu->arch.walk_mmu->inject_page_fault(vcpu, &fault);
 }


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH 2/2] KVM: X86: Fix the comments in prepare_vmcs02_rare()
  2022-07-15 11:42 ` [PATCH 2/2] KVM: X86: Fix the comments in prepare_vmcs02_rare() Yu Zhang
@ 2022-07-15 15:56   ` Sean Christopherson
  2022-07-18  7:58     ` Yu Zhang
  0 siblings, 1 reply; 8+ messages in thread
From: Sean Christopherson @ 2022-07-15 15:56 UTC (permalink / raw)
  To: Yu Zhang; +Cc: pbonzini, vkuznets, jmattson, joro, wanpengli, kvm, linux-kernel

On Fri, Jul 15, 2022, Yu Zhang wrote:
> Although EB.PF in vmcs02 is still set by simply "or"ing the EB of
> vmcs01 and vmcs12, the explanation is obsolete. "enable_ept" being
> set is not the only reason for L0 to clear its EB.PF.
> 
> Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> ---
>  arch/x86/kvm/vmx/nested.c | 8 ++++----
>  1 file changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> index 778f82015f03..634a7d218048 100644
> --- a/arch/x86/kvm/vmx/nested.c
> +++ b/arch/x86/kvm/vmx/nested.c
> @@ -2451,10 +2451,10 @@ static void prepare_vmcs02_rare(struct vcpu_vmx *vmx, struct vmcs12 *vmcs12)
>  	 * is not easy (if at all possible?) to merge L0 and L1's desires, we
>  	 * simply ask to exit on each and every L2 page fault. This is done by
>  	 * setting MASK=MATCH=0 and (see below) EB.PF=1.
> -	 * Note that below we don't need special code to set EB.PF beyond the
> -	 * "or"ing of the EB of vmcs01 and vmcs12, because when enable_ept,
> -	 * vmcs01's EB.PF is 0 so the "or" will take vmcs12's value, and when
> -	 * !enable_ept, EB.PF is 1, so the "or" will always be 1.
> +	 * Note that EB.PF is set by "or"ing of the EB of vmcs01 and vmcs12,
> +	 * because when L0 has no desire to intercept #PF, vmcs01's EB.PF is 0
> +	 * so the "or" will take vmcs12's value, otherwise EB.PF is 1, so the
> +	 * "or" will always be 1.

Oof!  I was going to respond with a variety of nits (about the existing comment),
and even suggest that we address the TODO just out of sight, but looking at all
of this made me realize there's a bug here!  vmx_update_exception_bitmap() doesn't
update MASK and MATCH!

Hitting the bug is extremely unlikely, as it would require changing the guest's
MAXPHYADDR via KVM_SET_CPUID2 _after_ KVM_SET_NESTED_STATE, but before KVM_RUN
(because KVM now disallows changin CPUID after KVM_RUN).

During KVM_SET_CPUID2, KVM will invoke vmx_update_exception_bitmap() to refresh
the exception bitmap to handle the ept=1 && allow_smaller_maxphyaddr=1 scenario.
But when L2 is active, vmx_update_exception_bitmap() assumes vmcs02 already has
the correct MASK+MATCH because of the "clear both if KVM and L1 both want #PF"
behavior.  But if KVM's desire to intercept #PF changes from 0=>1, then KVM will
run L2 with the MASK+MATCH from vmcs12 because vmx_need_pf_intercept() would have
returned false at the time of prepare_vmcs02_rare().

Fixing the bug is fairly straightforward, and presents a good opportunity to
clean up the code (and this comment) and address the TODO.

Unless someone objects to my suggestion for patch 01, can you send a new version
of patch 01?  I'll send a separate series to fix this theoretical bug, avoid
writing MASK+MATCH when vmcs0x.EXCEPTION_BITMAP.PF+0, and to address the TODO.

E.g. I believe this is what we want to end up with:

	if (vmcs12)
		eb |= vmcs12->exception_bitmap;

	/*
	 * #PF is conditionally intercepted based on the #PF error code (PFEC)
	 * combined with the exception bitmap.  #PF is intercept if:
	 *
	 *    EXCEPTION_BITMAP.PF=1 && ((PFEC & MASK) == MATCH).
	 *
	 * If any #PF is being intercepted, update MASK+MATCH, otherwise leave
	 * them alone they do not affect interception (EXCEPTION_BITMAP.PF=0).
	 */
	if (eb & (1u << PF_VECTOR)) {
		/*
		 * If EPT is enabled, #PF is only intercepted if MAXPHYADDR is
		 * smaller on the guest than on the host.  In that case, KVM
		 * only needs to intercept present, non-reserved #PF.  If EPT
		 * is disabled, i.e. KVM is using shadow paging, KVM needs to
		 * intercept all #PF.  Note, whether or not KVM wants to
		 * intercept _any_ #PF is handled below.
		 */
		if (enable_ept) {
			pfec_mask = PFERR_PRESENT_MASK | PFERR_RSVD_MASK;
			pfec_match = PFERR_PRESENT_MASK;
		} else {
			pfec_mask = 0;
			pfec_match = 0;
		}

		if (!(vmcs12->exception_bitmap & (1u << PF_VECTOR))) {
			/* L1 doesn't want to intercept #PF, use KVM's MASK+MATCH. */
		} else if (!kvm_needs_pf_intercept) {
			/* KVM doesn't want to intercept #PF, use L1's MASK+MATCH. */
			pfec_mask = vmcs12->page_fault_error_code_mask;
			pfec_match = vmcs12->page_fault_error_code_match;
		} else if (pfec_mask != vmcs12->page_fault_error_code_mask ||
			   pfec_match != vmcs12->page_fault_error_code_mask) {
			/*
			 * KVM and L1 want to intercept #PF with different MASK
			 * and/or MATCH.  For simplicity, intercept all #PF by
			 * clearing MASK+MATCH.  Merging KVM's and L1's desires
			 * is quite complex, while the odds of meaningfully
			 * reducing what #PFs are intercept are low.
			 */
			pfec_mask = 0;
			pfec_match = 0;
		} else {
			/* KVM and L1 have identical MASK+MATCH. */
		}
		vmcs_write32(PAGE_FAULT_ERROR_CODE_MASK, pfec_mask);
		vmcs_write32(PAGE_FAULT_ERROR_CODE_MATCH, pfec_match);
	}


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 1/2] KVM: X86: Initialize 'fault' in kvm_fixup_and_inject_pf_error().
  2022-07-15 14:47   ` Sean Christopherson
@ 2022-07-18  6:52     ` Yu Zhang
  0 siblings, 0 replies; 8+ messages in thread
From: Yu Zhang @ 2022-07-18  6:52 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: pbonzini, vkuznets, jmattson, joro, wanpengli, kvm, linux-kernel

On Fri, Jul 15, 2022 at 02:47:36PM +0000, Sean Christopherson wrote:
> On Fri, Jul 15, 2022, Yu Zhang wrote:
> > kvm_fixup_and_inject_pf_error() was introduced to fixup the error code(
> > e.g., to add RSVD flag) and inject the #PF to the guest, when guest
> > MAXPHYADDR is smaller than the host one.
> > 
> > When it comes to nested, L0 is expected to intercept and fix up the #PF
> > and then inject to L2 directly if
> > - L2.MAXPHYADDR < L0.MAXPHYADDR and
> > - L1 has no intention to intercept L2's #PF (e.g., L2 and L1 have the
> >   same MAXPHYADDR value && L1 is using EPT for L2),
> > instead of constructing a #PF VM Exit to L1. Currently, with PFEC_MASK
> > and PFEC_MATCH both set to 0 in vmcs02, the interception and injection
> > may happen on all L2 #PFs.
> > 
> > However, failing to initialize 'fault' in kvm_fixup_and_inject_pf_error()
> > may cause the fault.async_page_fault being NOT zeroed, and later the #PF
> > being treated as a nested async page fault, and then being injected to L1.
> > So just fix it by initialize the 'fault' value in the beginning.
> 
> Ouch.
> 
> > Fixes: 897861479c064 ("KVM: x86: Add helper functions for illegal GPA checking and page fault injection")
> > Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=216178
> > Reported-by: Yang Lixiao <lixiao.yang@intel.com>
> > Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> > ---
> >  arch/x86/kvm/x86.c | 2 +-
> >  1 file changed, 1 insertion(+), 1 deletion(-)
> > 
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index 031678eff28e..3246b3c9dfb3 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -12983,7 +12983,7 @@ EXPORT_SYMBOL_GPL(kvm_spec_ctrl_test_value);
> >  void kvm_fixup_and_inject_pf_error(struct kvm_vcpu *vcpu, gva_t gva, u16 error_code)
> >  {
> >  	struct kvm_mmu *mmu = vcpu->arch.walk_mmu;
> > -	struct x86_exception fault;
> > +	struct x86_exception fault = {0};
> >  	u64 access = error_code &
> >  		(PFERR_WRITE_MASK | PFERR_FETCH_MASK | PFERR_USER_MASK);
> 
> As stupid as it may be to intentionally not fix the uninitialized data in a robust
> way, I'd actually prefer to manually clear fault.async_page_fault instead of
> zero-initializing the struct.  Unlike a similar bug fix in commit 159e037d2e36
> ("KVM: x86: Fully initialize 'struct kvm_lapic_irq' in kvm_pv_kick_cpu_op()"),
> this code actually cares about async_page_fault being false as opposed to just
> being _initialized_.
> 
> And if another field is added to struct x86_exception in the future, leaving the
> struct uninitialized means that if such a patch were to miss this case, running
> with various sanitizers should in theory be able to detect such a bug.  I suspect
> no one has found this with syzkaller due to the need to opt into running with
> allow_smaller_maxphyaddr=1.
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index f389691d8c04..aeed737b55c2 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -12996,6 +12996,7 @@ void kvm_fixup_and_inject_pf_error(struct kvm_vcpu *vcpu, gva_t gva, u16 error_c
>                 fault.error_code = error_code;
>                 fault.nested_page_fault = false;
>                 fault.address = gva;
> +               fault.async_page_fault = false;
>         }
>         vcpu->arch.walk_mmu->inject_page_fault(vcpu, &fault);
>  }
> 

Fair enough. Thanks!

B.R.
Yu

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 2/2] KVM: X86: Fix the comments in prepare_vmcs02_rare()
  2022-07-15 15:56   ` Sean Christopherson
@ 2022-07-18  7:58     ` Yu Zhang
  2022-07-19  0:09       ` Sean Christopherson
  0 siblings, 1 reply; 8+ messages in thread
From: Yu Zhang @ 2022-07-18  7:58 UTC (permalink / raw)
  To: Sean Christopherson
  Cc: pbonzini, vkuznets, jmattson, joro, wanpengli, kvm, linux-kernel

On Fri, Jul 15, 2022 at 03:56:31PM +0000, Sean Christopherson wrote:
> On Fri, Jul 15, 2022, Yu Zhang wrote:
> > Although EB.PF in vmcs02 is still set by simply "or"ing the EB of
> > vmcs01 and vmcs12, the explanation is obsolete. "enable_ept" being
> > set is not the only reason for L0 to clear its EB.PF.
> > 
> > Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com>
> > ---
> >  arch/x86/kvm/vmx/nested.c | 8 ++++----
> >  1 file changed, 4 insertions(+), 4 deletions(-)
> > 
> > diff --git a/arch/x86/kvm/vmx/nested.c b/arch/x86/kvm/vmx/nested.c
> > index 778f82015f03..634a7d218048 100644
> > --- a/arch/x86/kvm/vmx/nested.c
> > +++ b/arch/x86/kvm/vmx/nested.c
> > @@ -2451,10 +2451,10 @@ static void prepare_vmcs02_rare(struct vcpu_vmx *vmx, struct vmcs12 *vmcs12)
> >  	 * is not easy (if at all possible?) to merge L0 and L1's desires, we
> >  	 * simply ask to exit on each and every L2 page fault. This is done by
> >  	 * setting MASK=MATCH=0 and (see below) EB.PF=1.
> > -	 * Note that below we don't need special code to set EB.PF beyond the
> > -	 * "or"ing of the EB of vmcs01 and vmcs12, because when enable_ept,
> > -	 * vmcs01's EB.PF is 0 so the "or" will take vmcs12's value, and when
> > -	 * !enable_ept, EB.PF is 1, so the "or" will always be 1.
> > +	 * Note that EB.PF is set by "or"ing of the EB of vmcs01 and vmcs12,
> > +	 * because when L0 has no desire to intercept #PF, vmcs01's EB.PF is 0
> > +	 * so the "or" will take vmcs12's value, otherwise EB.PF is 1, so the
> > +	 * "or" will always be 1.
> 
> Oof!  I was going to respond with a variety of nits (about the existing comment),
> and even suggest that we address the TODO just out of sight, but looking at all
> of this made me realize there's a bug here!  vmx_update_exception_bitmap() doesn't
> update MASK and MATCH!
> 
> Hitting the bug is extremely unlikely, as it would require changing the guest's
> MAXPHYADDR via KVM_SET_CPUID2 _after_ KVM_SET_NESTED_STATE, but before KVM_RUN
> (because KVM now disallows changin CPUID after KVM_RUN).
> 
> During KVM_SET_CPUID2, KVM will invoke vmx_update_exception_bitmap() to refresh
> the exception bitmap to handle the ept=1 && allow_smaller_maxphyaddr=1 scenario.
> But when L2 is active, vmx_update_exception_bitmap() assumes vmcs02 already has
> the correct MASK+MATCH because of the "clear both if KVM and L1 both want #PF"
> behavior.  But if KVM's desire to intercept #PF changes from 0=>1, then KVM will
> run L2 with the MASK+MATCH from vmcs12 because vmx_need_pf_intercept() would have
> returned false at the time of prepare_vmcs02_rare().

And then the #PF could be missed in L0 because previously both L1 and L0 has no
desire to intercept it, meanwhile KVM fails to update this after migration(I guess
the only scenario for this to happen is migration?). Is this understanding correct? 

> 
> Fixing the bug is fairly straightforward, and presents a good opportunity to
> clean up the code (and this comment) and address the TODO.
> 
> Unless someone objects to my suggestion for patch 01, can you send a new version
> of patch 01?  I'll send a separate series to fix this theoretical bug, avoid
> writing MASK+MATCH when vmcs0x.EXCEPTION_BITMAP.PF+0, and to address the TODO.

Sure, I will send another version of patch 01.

> 
> E.g. I believe this is what we want to end up with:
> 
> 	if (vmcs12)
> 		eb |= vmcs12->exception_bitmap;
> 
> 	/*
> 	 * #PF is conditionally intercepted based on the #PF error code (PFEC)
> 	 * combined with the exception bitmap.  #PF is intercept if:
> 	 *
> 	 *    EXCEPTION_BITMAP.PF=1 && ((PFEC & MASK) == MATCH).
> 	 *
> 	 * If any #PF is being intercepted, update MASK+MATCH, otherwise leave
> 	 * them alone they do not affect interception (EXCEPTION_BITMAP.PF=0).
> 	 */
> 	if (eb & (1u << PF_VECTOR)) {
> 		/*
> 		 * If EPT is enabled, #PF is only intercepted if MAXPHYADDR is
> 		 * smaller on the guest than on the host.  In that case, KVM
> 		 * only needs to intercept present, non-reserved #PF.  If EPT
> 		 * is disabled, i.e. KVM is using shadow paging, KVM needs to
> 		 * intercept all #PF.  Note, whether or not KVM wants to
> 		 * intercept _any_ #PF is handled below.
> 		 */
> 		if (enable_ept) {
> 			pfec_mask = PFERR_PRESENT_MASK | PFERR_RSVD_MASK;
> 			pfec_match = PFERR_PRESENT_MASK;
> 		} else {
> 			pfec_mask = 0;
> 			pfec_match = 0;
> 		}
> 
> 		if (!(vmcs12->exception_bitmap & (1u << PF_VECTOR))) {
> 			/* L1 doesn't want to intercept #PF, use KVM's MASK+MATCH. */
> 		} else if (!kvm_needs_pf_intercept) {
> 			/* KVM doesn't want to intercept #PF, use L1's MASK+MATCH. */
> 			pfec_mask = vmcs12->page_fault_error_code_mask;
> 			pfec_match = vmcs12->page_fault_error_code_match;
> 		} else if (pfec_mask != vmcs12->page_fault_error_code_mask ||
> 			   pfec_match != vmcs12->page_fault_error_code_mask) {
> 			/*
> 			 * KVM and L1 want to intercept #PF with different MASK
> 			 * and/or MATCH.  For simplicity, intercept all #PF by
> 			 * clearing MASK+MATCH.  Merging KVM's and L1's desires
> 			 * is quite complex, while the odds of meaningfully
> 			 * reducing what #PFs are intercept are low.
> 			 */
> 			pfec_mask = 0;
> 			pfec_match = 0;
> 		} else {
> 			/* KVM and L1 have identical MASK+MATCH. */
> 		}
> 		vmcs_write32(PAGE_FAULT_ERROR_CODE_MASK, pfec_mask);
> 		vmcs_write32(PAGE_FAULT_ERROR_CODE_MATCH, pfec_match);
> 	}

And we do not need to update the PFEC_MASK & PFEC_MATCH in prepare_vmcs02_rare()
anymore, right? Thanks!

B.R.
Yu

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH 2/2] KVM: X86: Fix the comments in prepare_vmcs02_rare()
  2022-07-18  7:58     ` Yu Zhang
@ 2022-07-19  0:09       ` Sean Christopherson
  0 siblings, 0 replies; 8+ messages in thread
From: Sean Christopherson @ 2022-07-19  0:09 UTC (permalink / raw)
  To: Yu Zhang; +Cc: pbonzini, vkuznets, jmattson, joro, wanpengli, kvm, linux-kernel

On Mon, Jul 18, 2022, Yu Zhang wrote:
> On Fri, Jul 15, 2022 at 03:56:31PM +0000, Sean Christopherson wrote:
> > On Fri, Jul 15, 2022, Yu Zhang wrote:
> > Oof!  I was going to respond with a variety of nits (about the existing comment),
> > and even suggest that we address the TODO just out of sight, but looking at all
> > of this made me realize there's a bug here!  vmx_update_exception_bitmap() doesn't
> > update MASK and MATCH!
> > 
> > Hitting the bug is extremely unlikely, as it would require changing the guest's
> > MAXPHYADDR via KVM_SET_CPUID2 _after_ KVM_SET_NESTED_STATE, but before KVM_RUN
> > (because KVM now disallows changin CPUID after KVM_RUN).
> > 
> > During KVM_SET_CPUID2, KVM will invoke vmx_update_exception_bitmap() to refresh
> > the exception bitmap to handle the ept=1 && allow_smaller_maxphyaddr=1 scenario.
> > But when L2 is active, vmx_update_exception_bitmap() assumes vmcs02 already has
> > the correct MASK+MATCH because of the "clear both if KVM and L1 both want #PF"
> > behavior.  But if KVM's desire to intercept #PF changes from 0=>1, then KVM will
> > run L2 with the MASK+MATCH from vmcs12 because vmx_need_pf_intercept() would have
> > returned false at the time of prepare_vmcs02_rare().
> 
> And then the #PF could be missed in L0 because previously both L1 and L0 has no
> desire to intercept it, meanwhile KVM fails to update this after migration(I guess
> the only scenario for this to happen is migration?). Is this understanding correct? 

Yes, I think that will be the scenario (I hadn't thought too much about the actual
result).

> > Fixing the bug is fairly straightforward, and presents a good opportunity to
> > clean up the code (and this comment) and address the TODO.
> > 
> > Unless someone objects to my suggestion for patch 01, can you send a new version
> > of patch 01?  I'll send a separate series to fix this theoretical bug, avoid
> > writing MASK+MATCH when vmcs0x.EXCEPTION_BITMAP.PF+0, and to address the TODO.
> 
> Sure, I will send another version of patch 01.
> 
> > 
> > E.g. I believe this is what we want to end up with:
> > 
> > 	if (vmcs12)
> > 		eb |= vmcs12->exception_bitmap;
> > 
> > 	/*
> > 	 * #PF is conditionally intercepted based on the #PF error code (PFEC)
> > 	 * combined with the exception bitmap.  #PF is intercept if:
> > 	 *
> > 	 *    EXCEPTION_BITMAP.PF=1 && ((PFEC & MASK) == MATCH).
> > 	 *
> > 	 * If any #PF is being intercepted, update MASK+MATCH, otherwise leave
> > 	 * them alone they do not affect interception (EXCEPTION_BITMAP.PF=0).
> > 	 */
> > 	if (eb & (1u << PF_VECTOR)) {
> > 		/*
> > 		 * If EPT is enabled, #PF is only intercepted if MAXPHYADDR is
> > 		 * smaller on the guest than on the host.  In that case, KVM
> > 		 * only needs to intercept present, non-reserved #PF.  If EPT
> > 		 * is disabled, i.e. KVM is using shadow paging, KVM needs to
> > 		 * intercept all #PF.  Note, whether or not KVM wants to
> > 		 * intercept _any_ #PF is handled below.
> > 		 */
> > 		if (enable_ept) {
> > 			pfec_mask = PFERR_PRESENT_MASK | PFERR_RSVD_MASK;
> > 			pfec_match = PFERR_PRESENT_MASK;
> > 		} else {
> > 			pfec_mask = 0;
> > 			pfec_match = 0;
> > 		}
> > 
> > 		if (!(vmcs12->exception_bitmap & (1u << PF_VECTOR))) {
> > 			/* L1 doesn't want to intercept #PF, use KVM's MASK+MATCH. */
> > 		} else if (!kvm_needs_pf_intercept) {
> > 			/* KVM doesn't want to intercept #PF, use L1's MASK+MATCH. */
> > 			pfec_mask = vmcs12->page_fault_error_code_mask;
> > 			pfec_match = vmcs12->page_fault_error_code_match;
> > 		} else if (pfec_mask != vmcs12->page_fault_error_code_mask ||
> > 			   pfec_match != vmcs12->page_fault_error_code_mask) {
> > 			/*
> > 			 * KVM and L1 want to intercept #PF with different MASK
> > 			 * and/or MATCH.  For simplicity, intercept all #PF by
> > 			 * clearing MASK+MATCH.  Merging KVM's and L1's desires
> > 			 * is quite complex, while the odds of meaningfully
> > 			 * reducing what #PFs are intercept are low.
> > 			 */
> > 			pfec_mask = 0;
> > 			pfec_match = 0;
> > 		} else {
> > 			/* KVM and L1 have identical MASK+MATCH. */
> > 		}
> > 		vmcs_write32(PAGE_FAULT_ERROR_CODE_MASK, pfec_mask);
> > 		vmcs_write32(PAGE_FAULT_ERROR_CODE_MATCH, pfec_match);
> > 	}
> 
> And we do not need to update the PFEC_MASK & PFEC_MATCH in prepare_vmcs02_rare()
> anymore, right? Thanks!

Yep!

Also, IIRC I have a goof or two in the above, i.e. don't waste any time trying to test it.

^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2022-07-19  0:09 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2022-07-15 11:42 [PATCH 0/2] Fix the #PF injection logic for smaller MAXPHYADDR in nested Yu Zhang
2022-07-15 11:42 ` [PATCH 1/2] KVM: X86: Initialize 'fault' in kvm_fixup_and_inject_pf_error() Yu Zhang
2022-07-15 14:47   ` Sean Christopherson
2022-07-18  6:52     ` Yu Zhang
2022-07-15 11:42 ` [PATCH 2/2] KVM: X86: Fix the comments in prepare_vmcs02_rare() Yu Zhang
2022-07-15 15:56   ` Sean Christopherson
2022-07-18  7:58     ` Yu Zhang
2022-07-19  0:09       ` Sean Christopherson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox