public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed
From: bugzilla-daemon@kernel.org
To: kvm@vger.kernel.org
Subject: [Bug 219588] [6.13.0-rc2+]WARNING: CPU: 52 PID: 12253 at arch/x86/kvm/mmu/tdp_mmu.c:1001 tdp_mmu_map_handle_target_level+0x1f0/0x310 [kvm]
Date: Mon, 16 Dec 2024 05:42:30 +0000	[thread overview]
Message-ID: <bug-219588-28872-3UuJWx01n1@https.bugzilla.kernel.org/> (raw)
In-Reply-To: <bug-219588-28872@https.bugzilla.kernel.org/>

https://bugzilla.kernel.org/show_bug.cgi?id=219588

--- Comment #2 from leiyang@redhat.com ---
(In reply to Sean Christopherson from comment #1)
> On Wed, Dec 11, 2024, bugzilla-daemon@kernel.org wrote:
> > I hit a bug on the intel host, this problem occurs randomly:
> > [  406.127925] ------------[ cut here ]------------
> > [  406.132572] WARNING: CPU: 52 PID: 12253 at
> arch/x86/kvm/mmu/tdp_mmu.c:1001
> > tdp_mmu_map_handle_target_level+0x1f0/0x310 [kvm]
> 
> Can you describe the host activity at the time of the WARN?  E.g. is it under
> memory pressure and potentially swapping, is KSM or NUMA balancing active? I
> have a sound theory for how the scenario occurs on KVM's end, but I still
> think
> it's wrong for KVM to overwrite a writable SPTE with a read-only SPTE in this
> situation.

I spent some time for this problem so late reply. When host dmesg print this
error messages which running install a new guest via automation. And I found
this bug's reproducer is run this install case after the mchine first time
running(Let me introduce more to avoid ambiguity: 1. Must to test it when the
machine first time running this kernel,that mean's if I hit this problem then
reboot host, it can not reproduced again even if I run the same tests.  2.
Based on 1, I also must test this installation guest case, it can not
reporduced on other cases.). But through compare, this installation cases only
used pxe install based on a internal KS cfg is different other cases.

Sure, I think it's running under memory pressure and swapping. Based on
automation log, KSM is disable and I don't add NUMA in qemu command line. 

If you have a machine can clone avocado and run tp-qemu tests, you can prepare
env then run this case:
unattended_install.cdrom.extra_cdrom_ks.default_install.aio_threads

> 
> And does the VM have device memory or any other type of VM_PFNMAP or VM_IO
> memory exposed to it?  E.g. an assigned device?  If so, can you provide the
> register
> state from the other WARNs?  If the PFNs are all in the same range, then
> maybe
> this is something funky with the VM_PFNMAP | VM_IO path.

I can confirm it has VM_IO because it runing installation case, VM is
constantly performing I/O operations

This my tests used memory and CPU configured, hope it help you debug this
problem:
-m 29G \
-smp 48,maxcpus=48,cores=24,threads=1,dies=1,sockets=2  \

And looks like there are no other device  and no using VM_PFNMAP. Please
correct me if I'm wrong.

> 
> The WARN is a sanity check I added because it should be impossible for KVM to
> install a non-writable SPTE overtop an existing writable SPTE.  Or so I
> thought.
> The WARN is benign in the sense that nothing bad will happen _in KVM_; KVM
> correctly handles the unexpected change, the WARN is there purely to flag
> that
> something unexpected happen.
> 
>       if (new_spte == iter->old_spte)
>               ret = RET_PF_SPURIOUS;
>       else if (tdp_mmu_set_spte_atomic(vcpu->kvm, iter, new_spte))
>               return RET_PF_RETRY;
>       else if (is_shadow_present_pte(iter->old_spte) &&
>                (!is_last_spte(iter->old_spte, iter->level) ||
>                 WARN_ON_ONCE(leaf_spte_change_needs_tlb_flush(iter->old_spte,
> new_spte)))) <====
>               kvm_flush_remote_tlbs_gfn(vcpu->kvm, iter->gfn, iter->level);
> 
> Cross referencing the register state
> 
>   RAX: 860000025e000bf7 RBX: ff4af92c619cf920 RCX: 0400000000000000
>   RDX: 0000000000000002 RSI: 0000000000000000 RDI: 0000000000000015
>   RBP: ff4af92c619cf9e8 R08: 800000025e0009f5 R09: 0000000000000002
>   R10: 000000005e000901 R11: 0000000000000001 R12: ff1e70694fc68000
>   R13: 0000000000000005 R14: 0000000000000000 R15: ff4af92c619a1000
> 
> with the disassembly
> 
>   4885C8                          TEST RAX,RCX
>   0F84EEFEFFFF                    JE 0000000000000-F1
>   4985C8                          TEST R8,RCX
>   0F85E5FEFFFF                    JNE 0000000000000-F1
>   0F0B                            UD2
>   
> RAX is the old SPTE and RCX is the new SPTE, i.e. the SPTE change is:
> 
>   860000025e000bf7
>   800000025e0009f5
> 
> On Intel, bits 57 and 58 are the host-writable and MMU-writable flags
> 
>   #define EPT_SPTE_HOST_WRITABLE      BIT_ULL(57)
>   #define EPT_SPTE_MMU_WRITABLE               BIT_ULL(58)
> 
> which means KVM is overwriting a writable SPTE with a non-writable SPTE
> because
> the current vCPU (a) hit a READ or EXEC fault on a non-present SPTE and (b)
> retrieved
> a non-writable PFN from the primary MMU, and that fault raced with a WRITE
> fault
> on a different vCPU that retrieved and installed a writable PFN.
> 
> On a READ or EXEC fault, this code in hva_to_pfn_slow() should get a
> writable PFN.
> Given that KVM has an valid writable SPTE, the corresponding PTE in the
> primary MMU
> *must* be writable, otherwise there's a missing mmu_notifier invalidation.
> 
>       /* map read fault as writable if possible */
>       if (!(flags & FOLL_WRITE) && kfp->map_writable &&
>           get_user_page_fast_only(kfp->hva, FOLL_WRITE, &wpage)) {
>               put_page(page);
>               page = wpage;
>               flags |= FOLL_WRITE;
>       }
> 
> out:
>       *pfn = kvm_resolve_pfn(kfp, page, NULL, flags & FOLL_WRITE);
>       return npages;
> 
> Hmm, gup_fast_folio_allowed() has a few conditions where it will reject fast
> GUP,
> but they should be mutually exclusive with KVM having a writable SPTE.  If
> the
> mapping is truncated or the folio is swapped out, secondary MMUs need to be
> invalidated before folio->mapping is nullified.
> 
>       /*
>        * The mapping may have been truncated, in any case we cannot determine
>        * if this mapping is safe - fall back to slow path to determine how to
>        * proceed.
>        */
>       if (!mapping)
>               return false;
> 
> And secretmem can't be GUP'd, and it's not a long-term pin, so these checks
> don't
> apply either:
> 
>       if (check_secretmem && secretmem_mapping(mapping))
>               return false;
>       /* The only remaining allowed file system is shmem. */
>       return !reject_file_backed || shmem_mapping(mapping);
> 
> Similarly, hva_to_pfn_remapped() should get a writable PFN if said PFN is
> writable
> in the primary MMU, regardless of the fault type.
> 
> If this turns out to get a legitimate scenario, then I think it makes sense
> to
> add an is_access_allowed() check and treat the fault as spurious.  But I
> would
> like to try to bottom out on what exactly is happening, because I'm mildly
> concerned something is buggy in the primary MMU.

If you need me to provide more info, please feel free to let me know. And if
you  sent a patch to fix this problem I can help to verified it, since I think
I found the stable reproducer.

-- 
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

  parent reply	other threads:[~2024-12-16  5:42 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-12-11  5:41 [Bug 219588] New: [6.13.0-rc2+]WARNING: CPU: 52 PID: 12253 at arch/x86/kvm/mmu/tdp_mmu.c:1001 tdp_mmu_map_handle_target_level+0x1f0/0x310 [kvm] bugzilla-daemon
2024-12-11 16:12 ` Sean Christopherson
2024-12-11 16:12 ` [Bug 219588] " bugzilla-daemon
2024-12-16  5:42 ` bugzilla-daemon [this message]
2024-12-16 23:52   ` Sean Christopherson
2024-12-16 23:53 ` bugzilla-daemon
2024-12-17  9:03 ` bugzilla-daemon
2025-03-20 14:48 ` bugzilla-daemon
2025-03-20 14:48 ` bugzilla-daemon

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=bug-219588-28872-3UuJWx01n1@https.bugzilla.kernel.org/ \
    --to=bugzilla-daemon@kernel.org \
    --cc=kvm@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox