From: Sean Christopherson <seanjc@google.com>
To: Vitaly Kuznetsov <vkuznets@redhat.com>
Cc: linux-kernel@vger.kernel.org, kvm@vger.kernel.org,
Paolo Bonzini <pbonzini@redhat.com>
Subject: Re: [FYI PATCH] Revert "KVM: x86/mmu: Zap only TDP MMU leafs in kvm_zap_gfn_range()"
Date: Fri, 25 Mar 2022 20:18:32 +0000 [thread overview]
Message-ID: <Yj4jmAjj5ZTJodQM@google.com> (raw)
In-Reply-To: <87r16qnkgl.fsf@redhat.com>
On Fri, Mar 25, 2022, Vitaly Kuznetsov wrote:
> Sean Christopherson <seanjc@google.com> writes:
> > Actually, since this is apparently specific to kvm_zap_gfn_range(), can you add
> > printk "tracing" in update_mtrr(), kvm_post_set_cr0(), and __kvm_request_apicv_update()
> > to see what is actually triggering zaps? Capturing the start and end GFNs would be very
> > helpful for the MTRR case.
> >
> > The APICv update seems unlikely to affect only Hyper-V guests, though there is the auto
> > EOI crud. And the other two only come into play with non-coherent DMA. In other words,
> > figuring out exactly what sequence leads to failure should be straightforward.
>
> The tricky part here is that Hyper-V doesn't crash immediately, the
> crash is always different (if you look at the BSOD) and happens at
> different times. Crashes mention various stuff like trying to execute
> non-executable memory, ...
>
> I've added tracing you've suggested:
> - __kvm_request_apicv_update() happens only once in the very beginning.
And thinking through this again, APICv changes should never result in a shadow
page being zapped as they'll only zap a 4k range.
> - update_mtrr() never actually reaches kvm_zap_gfn_range()
>
> - kvm_post_set_cr0() happen in early boot but the crash happen much much
> later. E.g.:
Ah rats, I got the sequencing of the revert messed up. mmu_notifier is also in
play, via kvm_tdp_mmu_unmap_gfn_range().
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 4fa4d8269e5b..db7c5a05e574 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -870,6 +870,8 @@ EXPORT_SYMBOL_GPL(load_pdptrs);
>
> void kvm_post_set_cr0(struct kvm_vcpu *vcpu, unsigned long old_cr0, unsigned long cr0)
> {
> + trace_printk("vCPU %d %lx %lx\n", vcpu->vcpu_id, old_cr0, cr0);
This doesn't guarantee kvm_zap_gfn_range() will be reached. The guest has to
have non-coherente DMA and be running with the CD/NW memtyp quirk. Moving the
print inside the if statement would show if KVM is actually zapping in those
cases.
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index d3a9ce07a565..25c7d8fc3287 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -887,8 +887,10 @@ void kvm_post_set_cr0(struct kvm_vcpu *vcpu, unsigned long old_cr0, unsigned lon
if (((cr0 ^ old_cr0) & X86_CR0_CD) &&
kvm_arch_has_noncoherent_dma(vcpu->kvm) &&
- !kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_CD_NW_CLEARED))
+ !kvm_check_has_quirk(vcpu->kvm, KVM_X86_QUIRK_CD_NW_CLEARED)) {
+ trace_printk("vCPU %d %lx %lx\n", vcpu->vcpu_id, old_cr0, cr0);
kvm_zap_gfn_range(vcpu->kvm, 0, ~0ULL);
+ }
}
EXPORT_SYMBOL_GPL(kvm_post_set_cr0);
> kvm_hv_set_msr_pw() call is when Hyper-V writes to HV_X64_MSR_CRASH_CTL
> ('hv-crash' QEMU flag is needed to enable the feature). The debug patch
> is:
>
> diff --git a/arch/x86/kvm/hyperv.c b/arch/x86/kvm/hyperv.c
> index a32f54ab84a2..59a72f6ced99 100644
> --- a/arch/x86/kvm/hyperv.c
> +++ b/arch/x86/kvm/hyperv.c
> @@ -1391,6 +1391,7 @@ static int kvm_hv_set_msr_pw(struct kvm_vcpu *vcpu, u32 msr, u64 data,
>
> /* Send notification about crash to user space */
> kvm_make_request(KVM_REQ_HV_CRASH, vcpu);
> + trace_printk("%d\n", vcpu->vcpu_id);
> }
> break;
> case HV_X64_MSR_RESET:
>
> So it's 20 seconds (!) between the last kvm_post_set_cr0() call and the
> crash. My (disappointing) conclusion is: the problem can be anywhere and
> Hyper-V detects it much much later.
And reproduced... 'twas indeed the mmu_notifier. Hyper-V 2019 booted just fine,
until I turned on KSM and cranked up the scanning.
The bug has nothing to do with zapping only leafs, it's a simple goof where the
TLB flush gets lost. Not sure why only Hyper-V detects the issue; maybe because
it maintains a pool of zeroed pages that are KSM-friendly?
I'll send a patch to reintroduce the reverted code.
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index c71debdbc732..a641737725d1 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -885,7 +885,7 @@ bool kvm_tdp_mmu_zap_leafs(struct kvm *kvm, int as_id, gfn_t start, gfn_t end,
struct kvm_mmu_page *root;
for_each_tdp_mmu_root_yield_safe(kvm, root, as_id)
- flush = tdp_mmu_zap_leafs(kvm, root, start, end, can_yield, false);
+ flush = tdp_mmu_zap_leafs(kvm, root, start, end, can_yield, flush);
return flush;
}
prev parent reply other threads:[~2022-03-25 20:18 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-03-18 16:48 [FYI PATCH] Revert "KVM: x86/mmu: Zap only TDP MMU leafs in kvm_zap_gfn_range()" Paolo Bonzini
2022-03-21 9:13 ` Paolo Bonzini
2022-03-24 23:57 ` Sean Christopherson
2022-03-25 10:38 ` Vitaly Kuznetsov
2022-03-25 11:21 ` Paolo Bonzini
2022-03-25 20:22 ` Sean Christopherson
2022-03-25 15:03 ` Vitaly Kuznetsov
2022-03-25 20:18 ` Sean Christopherson [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Yj4jmAjj5ZTJodQM@google.com \
--to=seanjc@google.com \
--cc=kvm@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=pbonzini@redhat.com \
--cc=vkuznets@redhat.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.