linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Sean Christopherson <seanjc@google.com>
To: Lai Jiangshan <laijs@linux.alibaba.com>
Cc: Lai Jiangshan <jiangshanlai@gmail.com>,
	LKML <linux-kernel@vger.kernel.org>,
	kvm@vger.kernel.org, Paolo Bonzini <pbonzini@redhat.com>,
	Vitaly Kuznetsov <vkuznets@redhat.com>,
	Wanpeng Li <wanpengli@tencent.com>,
	Jim Mattson <jmattson@google.com>, Joerg Roedel <joro@8bytes.org>,
	Thomas Gleixner <tglx@linutronix.de>,
	Ingo Molnar <mingo@redhat.com>, Borislav Petkov <bp@alien8.de>,
	Dave Hansen <dave.hansen@linux.intel.com>,
	X86 ML <x86@kernel.org>, "H. Peter Anvin" <hpa@zytor.com>
Subject: Re: [RFC PATCH 5/6] KVM: X86: Alloc pae_root shadow page
Date: Thu, 6 Jan 2022 19:41:47 +0000	[thread overview]
Message-ID: <YddF+6eX7ycAsZLr@google.com> (raw)
In-Reply-To: <dc8f2508-35ac-0dee-2465-4b5a8e3879ca@linux.alibaba.com>

On Thu, Jan 06, 2022, Lai Jiangshan wrote:
> 
> 
> On 2022/1/6 00:45, Sean Christopherson wrote:
> > On Wed, Jan 05, 2022, Lai Jiangshan wrote:
> > > On Wed, Jan 5, 2022 at 5:54 AM Sean Christopherson <seanjc@google.com> wrote:
> > > 
> > > > > 
> > > > > default_pae_pdpte is needed because the cpu expect PAE pdptes are
> > > > > present when VMenter.
> > > > 
> > > > That's incorrect.  Neither Intel nor AMD require PDPTEs to be present.  Not present
> > > > is perfectly ok, present with reserved bits is what's not allowed.
> > > > 
> > > > Intel SDM:
> > > >    A VM entry that checks the validity of the PDPTEs uses the same checks that are
> > > >    used when CR3 is loaded with MOV to CR3 when PAE paging is in use[7].  If MOV to CR3
> > > >    would cause a general-protection exception due to the PDPTEs that would be loaded
> > > >    (e.g., because a reserved bit is set), the VM entry fails.
> > > > 
> > > >    7. This implies that (1) bits 11:9 in each PDPTE are ignored; and (2) if bit 0
> > > >       (present) is clear in one of the PDPTEs, bits 63:1 of that PDPTE are ignored.
> > > 
> > > But in practice, the VM entry fails if the present bit is not set in the
> > > PDPTE for the linear address being accessed (when EPT enabled at least).  The
> > > host kvm complains and dumps the vmcs state.
> > 
> > That doesn't make any sense.  If EPT is enabled, KVM should never use a pae_root.
> > The vmcs.GUEST_PDPTRn fields are in play, but those shouldn't derive from KVM's
> > shadow page tables.
> 
> Oh, I wrote the negative what I want to say again when I try to emphasis
> something after I wrote a sentence and modified it several times.
> 
> I wanted to mean "EPT not enabled" when vmx.

Heh, that makes a lot more sense.

> The VM entry fails when the guest is in very early stage when booting which
> might be still in real mode.
> 
> VMEXIT: intr_info=00000000 errorcode=0000000 ilen=00000000
> reason=80000021 qualification=0000000000000002

Yep, that's the signature for an illegal PDPTE at VM-Enter.  But as noted above,
a not-present PDPTE is perfectly legal, VM-Enter should failed if and only if a
PDPTE is present and has reserved bits set.

> IDTVectoring: info=00000000 errorcode=00000000
> 
> > 
> > And I doubt there is a VMX ucode bug at play, as KVM currently uses '0' in its
> > shadow page tables for not-present PDPTEs.
> > 
> > If you can post/provide the patches that lead to VM-Fail, I'd be happy to help
> > debug.
> 
> If you can try this patchset, you can just set the default_pae_pdpte to 0 to test
> it.

I can't reproduce the failure with this on top of your series + kvm/queue (commit
cc0e35f9c2d4 ("KVM: SVM: Nullify vcpu_(un)blocking() hooks if AVIC is disabled")).

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index f6f7caf76b70..b7170a840330 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -728,22 +728,11 @@ static u64 default_pae_pdpte;

 static void free_default_pae_pdpte(void)
 {
-       free_page((unsigned long)__va(default_pae_pdpte & PAGE_MASK));
        default_pae_pdpte = 0;
 }

 static int alloc_default_pae_pdpte(void)
 {
-       unsigned long p = __get_free_page(GFP_KERNEL | __GFP_ZERO);
-
-       if (!p)
-               return -ENOMEM;
-       default_pae_pdpte = __pa(p) | PT_PRESENT_MASK | shadow_me_mask;
-       if (WARN_ON(is_shadow_present_pte(default_pae_pdpte) ||
-                   is_mmio_spte(default_pae_pdpte))) {
-               free_default_pae_pdpte();
-               return -EINVAL;
-       }
        return 0;
 }

Are you using a different base and/or running with other changes?

To aid debug, the below patch will dump the PDPTEs from the current MMU root on
failure (I'll also submit this as a formal patch).  On failure, I would expect
that at least one of the PDPTEs will be present with reserved bits set.

diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index fe06b02994e6..c13f37ef1bbc 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -5773,11 +5773,19 @@ void dump_vmcs(struct kvm_vcpu *vcpu)
        pr_err("CR4: actual=0x%016lx, shadow=0x%016lx, gh_mask=%016lx\n",
               cr4, vmcs_readl(CR4_READ_SHADOW), vmcs_readl(CR4_GUEST_HOST_MASK));
        pr_err("CR3 = 0x%016lx\n", vmcs_readl(GUEST_CR3));
-       if (cpu_has_vmx_ept()) {
+       if (enable_ept) {
                pr_err("PDPTR0 = 0x%016llx  PDPTR1 = 0x%016llx\n",
                       vmcs_read64(GUEST_PDPTR0), vmcs_read64(GUEST_PDPTR1));
                pr_err("PDPTR2 = 0x%016llx  PDPTR3 = 0x%016llx\n",
                       vmcs_read64(GUEST_PDPTR2), vmcs_read64(GUEST_PDPTR3));
+       } else if (vcpu->arch.mmu->shadow_root_level == PT32E_ROOT_LEVEL &&
+                  VALID_PAGE(vcpu->arch.mmu->root_hpa)) {
+               u64 *pdpte = __va(vcpu->arch.mmu->root_hpa);
+
+               pr_err("PDPTE0 = 0x%016llx  PDPTE1 = 0x%016llx\n",
+                      pdpte[0], pdpte[1]);
+               pr_err("PDPTE2 = 0x%016llx  PDPTE3 = 0x%016llx\n",
+                      pdpte[2], pdpte[3]);
        }
        pr_err("RSP = 0x%016lx  RIP = 0x%016lx\n",
               vmcs_readl(GUEST_RSP), vmcs_readl(GUEST_RIP));

> If you can't try this patchset, the mmu->pae_root can be possible to be modified
> to test it.
> 
> I guess the vmx fails to translate %rip when VMentry in this case.

No, the CPU doesn't translate RIP at VM-Enter, vmcs.GUEST_RIP is only checked for
legality, e.g. that it's canonical.  Translating RIP through page tables is firmly
a post-VM-Enter code fetch action.

  reply	other threads:[~2022-01-06 19:41 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-12-10  9:25 [RFC PATCH 0/6] KVM: X86: Add and use shadow page with level promoted or acting as pae_root Lai Jiangshan
2021-12-10  9:25 ` [RFC PATCH 1/6] KVM: X86: Check root_level only in fast_pgd_switch() Lai Jiangshan
2022-01-04 20:24   ` Sean Christopherson
2021-12-10  9:25 ` [RFC PATCH 2/6] KVM: X86: Walk shadow page starting with shadow_root_level Lai Jiangshan
2022-01-04 20:34   ` Sean Christopherson
2022-01-04 20:37     ` Paolo Bonzini
2022-01-04 20:43       ` Maxim Levitsky
2021-12-10  9:25 ` [RFC PATCH 3/6] KVM: X86: Add arguement gfn and role to kvm_mmu_alloc_page() Lai Jiangshan
2022-01-04 20:53   ` Sean Christopherson
2021-12-10  9:25 ` [RFC PATCH 4/6] KVM: X86: Introduce role.level_promoted Lai Jiangshan
2022-01-04 22:14   ` Sean Christopherson
2022-02-11 16:06     ` Paolo Bonzini
2021-12-10  9:25 ` [RFC PATCH 5/6] KVM: X86: Alloc pae_root shadow page Lai Jiangshan
2022-01-04 21:54   ` Sean Christopherson
2022-01-05  3:11     ` Lai Jiangshan
2022-01-05 16:45       ` Sean Christopherson
2022-01-06  2:01         ` Lai Jiangshan
2022-01-06 19:41           ` Sean Christopherson [this message]
2022-01-07  4:36             ` Lai Jiangshan
2021-12-10  9:25 ` [RFC PATCH 6/6] KVM: X86: Use level_promoted and pae_root shadow page for 32bit guests Lai Jiangshan
2022-01-04 20:55   ` Sean Christopherson
2021-12-10 10:27 ` [RFC PATCH 0/6] KVM: X86: Add and use shadow page with level promoted or acting as pae_root Maxim Levitsky

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YddF+6eX7ycAsZLr@google.com \
    --to=seanjc@google.com \
    --cc=bp@alien8.de \
    --cc=dave.hansen@linux.intel.com \
    --cc=hpa@zytor.com \
    --cc=jiangshanlai@gmail.com \
    --cc=jmattson@google.com \
    --cc=joro@8bytes.org \
    --cc=kvm@vger.kernel.org \
    --cc=laijs@linux.alibaba.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=tglx@linutronix.de \
    --cc=vkuznets@redhat.com \
    --cc=wanpengli@tencent.com \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).