[Bug 219588] New: [6.13.0-rc2+]WARNING: CPU: 52 PID: 12253 at arch/x86/kvm/mmu/tdp_mmu.c:1001 tdp_mmu_map_handle_target

public inbox for kvm@vger.kernel.org
 help / color / mirror / Atom feed

* [Bug 219588] New: [6.13.0-rc2+]WARNING: CPU: 52 PID: 12253 at arch/x86/kvm/mmu/tdp_mmu.c:1001 tdp_mmu_map_handle_target_level+0x1f0/0x310 [kvm]
@ 2024-12-11  5:41 bugzilla-daemon
  2024-12-11 16:12 ` Sean Christopherson
                   ` (6 more replies)
  0 siblings, 7 replies; 9+ messages in thread
From: bugzilla-daemon @ 2024-12-11  5:41 UTC (permalink / raw)
  To: kvm

https://bugzilla.kernel.org/show_bug.cgi?id=219588

            Bug ID: 219588
           Summary: [6.13.0-rc2+]WARNING: CPU: 52 PID: 12253 at
                    arch/x86/kvm/mmu/tdp_mmu.c:1001
                    tdp_mmu_map_handle_target_level+0x1f0/0x310 [kvm]
           Product: Virtualization
           Version: unspecified
          Hardware: Intel
                OS: Linux
            Status: NEW
          Severity: normal
          Priority: P3
         Component: kvm
          Assignee: virtualization_kvm@kernel-bugs.osdl.org
          Reporter: leiyang@redhat.com
        Regression: No

Hello

I hit a bug on the intel host, this problem occurs randomly:
[  406.127925] ------------[ cut here ]------------
[  406.132572] WARNING: CPU: 52 PID: 12253 at arch/x86/kvm/mmu/tdp_mmu.c:1001
tdp_mmu_map_handle_target_level+0x1f0/0x310 [kvm]
[  406.143883] Modules linked in: vhost_net vhost vhost_iotlb tap tun isofs
cdrom bluetooth nfsv3 rpcsec_gss_krb5 nfsv4 dns_resolver nfs netfs tls nft_masq
nft_ct nft_reject_ipv4 nf_reject_ipv4 nft_reject nft_chain_nat nf_nat
nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 nf_tables rpcrdma rdma_cm iw_cm
ib_cm ib_core bridge stp llc rfkill intel_rapl_msr intel_rapl_common
intel_uncore_frequency intel_uncore_frequency_common intel_ifs i10nm_edac
skx_edac_common nfit libnvdimm x86_pkg_temp_thermal intel_powerclamp coretemp
kvm_intel dax_hmem pmt_telemetry cxl_acpi iTCO_wdt kvm ipmi_ssif dell_pc
iTCO_vendor_support cxl_port mgag200 cxl_core rapl isst_if_mbox_pci dell_smbios
intel_sdsi isst_if_mmio pmt_class i2c_i801 i2c_algo_bit dell_wmi_descriptor tg3
intel_cstate intel_vsec wmi_bmof einj platform_profile isst_if_common dcdbas
intel_uncore ipmi_si acpi_power_meter i2c_smbus sg i2c_ismt mei_me mei
acpi_ipmi pcspkr ipmi_devintf ipmi_msghandler nfsd auth_rpcgss nfs_acl fuse
loop lockd grace nfs_localio sunrpc nfnetlink xfs sd_mod
[  406.143941]  iaa_crypto ahci libahci crct10dif_pclmul crc32_pclmul idxd
crc32c_intel libata megaraid_sas ghash_clmulni_intel idxd_bus wmi
pinctrl_emmitsburg dm_mirror dm_region_hash dm_log dm_mod
[  406.251266] CPU: 52 UID: 0 PID: 12253 Comm: qemu-kvm Kdump: loaded Not
tainted 6.13.0-rc2+ #1
[  406.259802] Hardware name: Dell Inc. PowerEdge R760/0024FG, BIOS 2.1.5
03/14/2024
[  406.267298] RIP: 0010:tdp_mmu_map_handle_target_level+0x1f0/0x310 [kvm]
[  406.273973] Code: 01 74 04 a8 80 74 25 48 8b 0d 54 5f 1b 00 4c 8b 04 24 41
bd 05 00 00 00 48 85 c8 0f 84 ee fe ff ff 49 85 c8 0f 85 e5 fe ff ff <0f> 0b 8d
44 d2 03 ba 01 00 00 00 49 8b 3c 24 c4 e2 f9 f7 d2 48 c1
[  406.292734] RSP: 0018:ff4af92c619cf8c0 EFLAGS: 00010246
[  406.297980] RAX: 860000025e000bf7 RBX: ff4af92c619cf920 RCX:
0400000000000000
[  406.305130] RDX: 0000000000000002 RSI: 0000000000000000 RDI:
0000000000000015
[  406.312284] RBP: ff4af92c619cf9e8 R08: 800000025e0009f5 R09:
0000000000000002
[  406.319438] R10: 000000005e000901 R11: 0000000000000001 R12:
ff1e70694fc68000
[  406.326587] R13: 0000000000000005 R14: 0000000000000000 R15:
ff4af92c619a1000
[  406.333741] FS:  00007efceb5fe6c0(0000) GS:ff1e7087bf700000(0000)
knlGS:0000000000000000
[  406.341881] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  406.347632] CR2: 00007fbb241f0710 CR3: 00000001139f0003 CR4:
0000000000f73ef0
[  406.354782] PKRU: 55555554
[  406.357512] Call Trace:
[  406.359985]  <TASK>
[  406.362108]  ? show_trace_log_lvl+0x1b0/0x2f0
[  406.366491]  ? show_trace_log_lvl+0x1b0/0x2f0
[  406.370869]  ? kvm_tdp_mmu_map+0x304/0x320 [kvm]
[  406.375550]  ? tdp_mmu_map_handle_target_level+0x1f0/0x310 [kvm]
[  406.381614]  ? __warn.cold+0x93/0xf4
[  406.385211]  ? tdp_mmu_map_handle_target_level+0x1f0/0x310 [kvm]
[  406.391277]  ? report_bug+0xff/0x140
[  406.394876]  ? handle_bug+0x53/0x90
[  406.398382]  ? exc_invalid_op+0x17/0x70
[  406.402242]  ? asm_exc_invalid_op+0x1a/0x20
[  406.406445]  ? tdp_mmu_map_handle_target_level+0x1f0/0x310 [kvm]
[  406.412511]  kvm_tdp_mmu_map+0x304/0x320 [kvm]
[  406.417020]  kvm_tdp_page_fault+0xbf/0xf0 [kvm]
[  406.421623]  kvm_mmu_do_page_fault+0x1d9/0x210 [kvm]
[  406.426657]  kvm_mmu_page_fault+0x89/0x360 [kvm]
[  406.431343]  vmx_handle_exit+0x12/0x50 [kvm_intel]
[  406.436165]  vcpu_enter_guest.constprop.0+0x5cf/0xe80 [kvm]
[  406.441806]  ? kvm_apic_update_irr+0x23/0x50 [kvm]
[  406.446667]  ? vmx_sync_pir_to_irr+0xd5/0x110 [kvm_intel]
[  406.452095]  vcpu_run+0x2f/0x240 [kvm]
[  406.455913]  kvm_arch_vcpu_ioctl_run+0x103/0x490 [kvm]
[  406.461114]  kvm_vcpu_ioctl+0x20b/0x770 [kvm]
[  406.465519]  __x64_sys_ioctl+0x94/0xc0
[  406.469288]  do_syscall_64+0x7d/0x160
[  406.472971]  ? select_idle_sibling+0x41/0x750
[  406.477350]  ? select_task_rq_fair+0x1f5/0x3b0
[  406.481821]  ? sched_clock+0x10/0x30
[  406.485417]  ? sched_clock_cpu+0xf/0x1d0
[  406.489361]  ? __smp_call_single_queue+0xab/0x110
[  406.494087]  ? ttwu_queue_wakelist+0xe5/0x100
[  406.498469]  ? try_to_wake_up+0x2a2/0x600
[  406.502499]  ? wake_up_q+0x4e/0x90
[  406.505923]  ? futex_wake+0x177/0x1a0
[  406.509608]  ? do_futex+0x125/0x190
[  406.513116]  ? __x64_sys_futex+0x127/0x1e0
[  406.517225]  ? syscall_exit_to_user_mode_prepare+0x15e/0x1a0
[  406.522899]  ? syscall_exit_to_user_mode+0x32/0x190
[  406.527799]  ? do_syscall_64+0x89/0x160
[  406.531656]  ? syscall_exit_to_user_mode_prepare+0x15e/0x1a0
[  406.537333]  ? syscall_exit_to_user_mode+0x32/0x190
[  406.542227]  ? do_syscall_64+0x89/0x160
[  406.546087]  ? exc_page_fault+0x73/0x160
[  406.550028]  entry_SYSCALL_64_after_hwframe+0x76/0x7e
[  406.555100] RIP: 0033:0x7f1bbab08ebf
[  406.558721] Code: 00 48 89 44 24 18 31 c0 48 8d 44 24 60 c7 04 24 10 00 00
00 48 89 44 24 08 48 8d 44 24 20 48 89 44 24 10 b8 10 00 00 00 0f 05 <89> c2 3d
00 f0 ff ff 77 18 48 8b 44 24 18 64 48 2b 04 25 28 00 00
[  406.577484] RSP: 002b:00007efceb5fd4c0 EFLAGS: 00000246 ORIG_RAX:
0000000000000010
[  406.585068] RAX: ffffffffffffffda RBX: 0000559776de3e80 RCX:
00007f1bbab08ebf
[  406.592217] RDX: 0000000000000000 RSI: 000000000000ae80 RDI:
000000000000001e
[  406.599367] RBP: 0000000000000000 R08: 0000000000000000 R09:
0000000000000000
[  406.606520] R10: 0000000000000000 R11: 0000000000000246 R12:
948ed561ee19de00
[  406.613670] R13: 000000000000ae80 R14: 00007efcebdffe50 R15:
fffffffffffffd10
[  406.620822]  </TASK>
[  406.623029] ---[ end trace 0000000000000000 ]---

I also can not sure it is a CPU issues or a memory issues, due to the "12253"
is a active vcpu PID, so report this bug on kvm.Please correct me if I'm wrong.

-- 
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Bug 219588] New: [6.13.0-rc2+]WARNING: CPU: 52 PID: 12253 at arch/x86/kvm/mmu/tdp_mmu.c:1001 tdp_mmu_map_handle_target_level+0x1f0/0x310 [kvm]
  2024-12-11  5:41 [Bug 219588] New: [6.13.0-rc2+]WARNING: CPU: 52 PID: 12253 at arch/x86/kvm/mmu/tdp_mmu.c:1001 tdp_mmu_map_handle_target_level+0x1f0/0x310 [kvm] bugzilla-daemon
@ 2024-12-11 16:12 ` Sean Christopherson
  2024-12-11 16:12 ` [Bug 219588] " bugzilla-daemon
                   ` (5 subsequent siblings)
  6 siblings, 0 replies; 9+ messages in thread
From: Sean Christopherson @ 2024-12-11 16:12 UTC (permalink / raw)
  To: bugzilla-daemon; +Cc: kvm

On Wed, Dec 11, 2024, bugzilla-daemon@kernel.org wrote:
> I hit a bug on the intel host, this problem occurs randomly:
> [  406.127925] ------------[ cut here ]------------
> [  406.132572] WARNING: CPU: 52 PID: 12253 at arch/x86/kvm/mmu/tdp_mmu.c:1001
> tdp_mmu_map_handle_target_level+0x1f0/0x310 [kvm]

Can you describe the host activity at the time of the WARN?  E.g. is it under
memory pressure and potentially swapping, is KSM or NUMA balancing active? I
have a sound theory for how the scenario occurs on KVM's end, but I still think
it's wrong for KVM to overwrite a writable SPTE with a read-only SPTE in this
situation.

And does the VM have device memory or any other type of VM_PFNMAP or VM_IO
memory exposed to it?  E.g. an assigned device?  If so, can you provide the register
state from the other WARNs?  If the PFNs are all in the same range, then maybe
this is something funky with the VM_PFNMAP | VM_IO path.

The WARN is a sanity check I added because it should be impossible for KVM to
install a non-writable SPTE overtop an existing writable SPTE.  Or so I thought.
The WARN is benign in the sense that nothing bad will happen _in KVM_; KVM
correctly handles the unexpected change, the WARN is there purely to flag that
something unexpected happen.

	if (new_spte == iter->old_spte)
		ret = RET_PF_SPURIOUS;
	else if (tdp_mmu_set_spte_atomic(vcpu->kvm, iter, new_spte))
		return RET_PF_RETRY;
	else if (is_shadow_present_pte(iter->old_spte) &&
		 (!is_last_spte(iter->old_spte, iter->level) ||
		  WARN_ON_ONCE(leaf_spte_change_needs_tlb_flush(iter->old_spte, new_spte)))) <====
		kvm_flush_remote_tlbs_gfn(vcpu->kvm, iter->gfn, iter->level);

Cross referencing the register state

  RAX: 860000025e000bf7 RBX: ff4af92c619cf920 RCX: 0400000000000000
  RDX: 0000000000000002 RSI: 0000000000000000 RDI: 0000000000000015
  RBP: ff4af92c619cf9e8 R08: 800000025e0009f5 R09: 0000000000000002
  R10: 000000005e000901 R11: 0000000000000001 R12: ff1e70694fc68000
  R13: 0000000000000005 R14: 0000000000000000 R15: ff4af92c619a1000

with the disassembly

  4885C8                          TEST RAX,RCX
  0F84EEFEFFFF                    JE 0000000000000-F1
  4985C8                          TEST R8,RCX
  0F85E5FEFFFF                    JNE 0000000000000-F1
  0F0B                            UD2

RAX is the old SPTE and RCX is the new SPTE, i.e. the SPTE change is:

  860000025e000bf7
  800000025e0009f5

On Intel, bits 57 and 58 are the host-writable and MMU-writable flags

  #define EPT_SPTE_HOST_WRITABLE	BIT_ULL(57)
  #define EPT_SPTE_MMU_WRITABLE		BIT_ULL(58)

which means KVM is overwriting a writable SPTE with a non-writable SPTE because
the current vCPU (a) hit a READ or EXEC fault on a non-present SPTE and (b) retrieved
a non-writable PFN from the primary MMU, and that fault raced with a WRITE fault
on a different vCPU that retrieved and installed a writable PFN.

On a READ or EXEC fault, this code in hva_to_pfn_slow() should get a writable PFN.
Given that KVM has an valid writable SPTE, the corresponding PTE in the primary MMU
*must* be writable, otherwise there's a missing mmu_notifier invalidation.

	/* map read fault as writable if possible */
	if (!(flags & FOLL_WRITE) && kfp->map_writable &&
	    get_user_page_fast_only(kfp->hva, FOLL_WRITE, &wpage)) {
		put_page(page);
		page = wpage;
		flags |= FOLL_WRITE;
	}

out:
	*pfn = kvm_resolve_pfn(kfp, page, NULL, flags & FOLL_WRITE);
	return npages;

Hmm, gup_fast_folio_allowed() has a few conditions where it will reject fast GUP,
but they should be mutually exclusive with KVM having a writable SPTE.  If the
mapping is truncated or the folio is swapped out, secondary MMUs need to be
invalidated before folio->mapping is nullified.

	/*
	 * The mapping may have been truncated, in any case we cannot determine
	 * if this mapping is safe - fall back to slow path to determine how to
	 * proceed.
	 */
	if (!mapping)
		return false;

And secretmem can't be GUP'd, and it's not a long-term pin, so these checks don't
apply either:

	if (check_secretmem && secretmem_mapping(mapping))
		return false;
	/* The only remaining allowed file system is shmem. */
	return !reject_file_backed || shmem_mapping(mapping);

Similarly, hva_to_pfn_remapped() should get a writable PFN if said PFN is writable
in the primary MMU, regardless of the fault type.

If this turns out to get a legitimate scenario, then I think it makes sense to
add an is_access_allowed() check and treat the fault as spurious.  But I would
like to try to bottom out on what exactly is happening, because I'm mildly
concerned something is buggy in the primary MMU.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug 219588] [6.13.0-rc2+]WARNING: CPU: 52 PID: 12253 at arch/x86/kvm/mmu/tdp_mmu.c:1001 tdp_mmu_map_handle_target_level+0x1f0/0x310 [kvm]
  2024-12-11  5:41 [Bug 219588] New: [6.13.0-rc2+]WARNING: CPU: 52 PID: 12253 at arch/x86/kvm/mmu/tdp_mmu.c:1001 tdp_mmu_map_handle_target_level+0x1f0/0x310 [kvm] bugzilla-daemon
  2024-12-11 16:12 ` Sean Christopherson
@ 2024-12-11 16:12 ` bugzilla-daemon
  2024-12-16  5:42 ` bugzilla-daemon
                   ` (4 subsequent siblings)
  6 siblings, 0 replies; 9+ messages in thread
From: bugzilla-daemon @ 2024-12-11 16:12 UTC (permalink / raw)
  To: kvm

https://bugzilla.kernel.org/show_bug.cgi?id=219588

--- Comment #1 from Sean Christopherson (seanjc@google.com) ---
On Wed, Dec 11, 2024, bugzilla-daemon@kernel.org wrote:
> I hit a bug on the intel host, this problem occurs randomly:
> [  406.127925] ------------[ cut here ]------------
> [  406.132572] WARNING: CPU: 52 PID: 12253 at arch/x86/kvm/mmu/tdp_mmu.c:1001
> tdp_mmu_map_handle_target_level+0x1f0/0x310 [kvm]

Can you describe the host activity at the time of the WARN?  E.g. is it under
memory pressure and potentially swapping, is KSM or NUMA balancing active? I
have a sound theory for how the scenario occurs on KVM's end, but I still think
it's wrong for KVM to overwrite a writable SPTE with a read-only SPTE in this
situation.

And does the VM have device memory or any other type of VM_PFNMAP or VM_IO
memory exposed to it?  E.g. an assigned device?  If so, can you provide the
register
state from the other WARNs?  If the PFNs are all in the same range, then maybe
this is something funky with the VM_PFNMAP | VM_IO path.

The WARN is a sanity check I added because it should be impossible for KVM to
install a non-writable SPTE overtop an existing writable SPTE.  Or so I
thought.
The WARN is benign in the sense that nothing bad will happen _in KVM_; KVM
correctly handles the unexpected change, the WARN is there purely to flag that
something unexpected happen.

        if (new_spte == iter->old_spte)
                ret = RET_PF_SPURIOUS;
        else if (tdp_mmu_set_spte_atomic(vcpu->kvm, iter, new_spte))
                return RET_PF_RETRY;
        else if (is_shadow_present_pte(iter->old_spte) &&
                 (!is_last_spte(iter->old_spte, iter->level) ||
                  WARN_ON_ONCE(leaf_spte_change_needs_tlb_flush(iter->old_spte,
new_spte)))) <====
                kvm_flush_remote_tlbs_gfn(vcpu->kvm, iter->gfn, iter->level);

Cross referencing the register state

  RAX: 860000025e000bf7 RBX: ff4af92c619cf920 RCX: 0400000000000000
  RDX: 0000000000000002 RSI: 0000000000000000 RDI: 0000000000000015
  RBP: ff4af92c619cf9e8 R08: 800000025e0009f5 R09: 0000000000000002
  R10: 000000005e000901 R11: 0000000000000001 R12: ff1e70694fc68000
  R13: 0000000000000005 R14: 0000000000000000 R15: ff4af92c619a1000

with the disassembly

  4885C8                          TEST RAX,RCX
  0F84EEFEFFFF                    JE 0000000000000-F1
  4985C8                          TEST R8,RCX
  0F85E5FEFFFF                    JNE 0000000000000-F1
  0F0B                            UD2

RAX is the old SPTE and RCX is the new SPTE, i.e. the SPTE change is:

  860000025e000bf7
  800000025e0009f5

On Intel, bits 57 and 58 are the host-writable and MMU-writable flags

  #define EPT_SPTE_HOST_WRITABLE        BIT_ULL(57)
  #define EPT_SPTE_MMU_WRITABLE         BIT_ULL(58)

which means KVM is overwriting a writable SPTE with a non-writable SPTE because
the current vCPU (a) hit a READ or EXEC fault on a non-present SPTE and (b)
retrieved
a non-writable PFN from the primary MMU, and that fault raced with a WRITE
fault
on a different vCPU that retrieved and installed a writable PFN.

On a READ or EXEC fault, this code in hva_to_pfn_slow() should get a writable
PFN.
Given that KVM has an valid writable SPTE, the corresponding PTE in the primary
MMU
*must* be writable, otherwise there's a missing mmu_notifier invalidation.

        /* map read fault as writable if possible */
        if (!(flags & FOLL_WRITE) && kfp->map_writable &&
            get_user_page_fast_only(kfp->hva, FOLL_WRITE, &wpage)) {
                put_page(page);
                page = wpage;
                flags |= FOLL_WRITE;
        }

out:
        *pfn = kvm_resolve_pfn(kfp, page, NULL, flags & FOLL_WRITE);
        return npages;

Hmm, gup_fast_folio_allowed() has a few conditions where it will reject fast
GUP,
but they should be mutually exclusive with KVM having a writable SPTE.  If the
mapping is truncated or the folio is swapped out, secondary MMUs need to be
invalidated before folio->mapping is nullified.

        /*
         * The mapping may have been truncated, in any case we cannot determine
         * if this mapping is safe - fall back to slow path to determine how to
         * proceed.
         */
        if (!mapping)
                return false;

And secretmem can't be GUP'd, and it's not a long-term pin, so these checks
don't
apply either:

        if (check_secretmem && secretmem_mapping(mapping))
                return false;
        /* The only remaining allowed file system is shmem. */
        return !reject_file_backed || shmem_mapping(mapping);

Similarly, hva_to_pfn_remapped() should get a writable PFN if said PFN is
writable
in the primary MMU, regardless of the fault type.

If this turns out to get a legitimate scenario, then I think it makes sense to
add an is_access_allowed() check and treat the fault as spurious.  But I would
like to try to bottom out on what exactly is happening, because I'm mildly
concerned something is buggy in the primary MMU.

-- 
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug 219588] [6.13.0-rc2+]WARNING: CPU: 52 PID: 12253 at arch/x86/kvm/mmu/tdp_mmu.c:1001 tdp_mmu_map_handle_target_level+0x1f0/0x310 [kvm]
  2024-12-11  5:41 [Bug 219588] New: [6.13.0-rc2+]WARNING: CPU: 52 PID: 12253 at arch/x86/kvm/mmu/tdp_mmu.c:1001 tdp_mmu_map_handle_target_level+0x1f0/0x310 [kvm] bugzilla-daemon
  2024-12-11 16:12 ` Sean Christopherson
  2024-12-11 16:12 ` [Bug 219588] " bugzilla-daemon
@ 2024-12-16  5:42 ` bugzilla-daemon
  2024-12-16 23:52   ` Sean Christopherson
  2024-12-16 23:53 ` bugzilla-daemon
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 9+ messages in thread
From: bugzilla-daemon @ 2024-12-16  5:42 UTC (permalink / raw)
  To: kvm

https://bugzilla.kernel.org/show_bug.cgi?id=219588

--- Comment #2 from leiyang@redhat.com ---
(In reply to Sean Christopherson from comment #1)
> On Wed, Dec 11, 2024, bugzilla-daemon@kernel.org wrote:
> > I hit a bug on the intel host, this problem occurs randomly:
> > [  406.127925] ------------[ cut here ]------------
> > [  406.132572] WARNING: CPU: 52 PID: 12253 at
> arch/x86/kvm/mmu/tdp_mmu.c:1001
> > tdp_mmu_map_handle_target_level+0x1f0/0x310 [kvm]
> 
> Can you describe the host activity at the time of the WARN?  E.g. is it under
> memory pressure and potentially swapping, is KSM or NUMA balancing active? I
> have a sound theory for how the scenario occurs on KVM's end, but I still
> think
> it's wrong for KVM to overwrite a writable SPTE with a read-only SPTE in this
> situation.

I spent some time for this problem so late reply. When host dmesg print this
error messages which running install a new guest via automation. And I found
this bug's reproducer is run this install case after the mchine first time
running(Let me introduce more to avoid ambiguity： 1. Must to test it when the
machine first time running this kernel,that mean's if I hit this problem then
reboot host, it can not reproduced again even if I run the same tests.  2.
Based on 1, I also must test this installation guest case, it can not
reporduced on other cases.). But through compare, this installation cases only
used pxe install based on a internal KS cfg is different other cases.

Sure, I think it's running under memory pressure and swapping. Based on
automation log, KSM is disable and I don't add NUMA in qemu command line. 

If you have a machine can clone avocado and run tp-qemu tests, you can prepare
env then run this case:
unattended_install.cdrom.extra_cdrom_ks.default_install.aio_threads

> 
> And does the VM have device memory or any other type of VM_PFNMAP or VM_IO
> memory exposed to it?  E.g. an assigned device?  If so, can you provide the
> register
> state from the other WARNs?  If the PFNs are all in the same range, then
> maybe
> this is something funky with the VM_PFNMAP | VM_IO path.

I can confirm it has VM_IO because it runing installation case, VM is
constantly performing I/O operations

This my tests used memory and CPU configured, hope it help you debug this
problem:
-m 29G \
-smp 48,maxcpus=48,cores=24,threads=1,dies=1,sockets=2  \

And looks like there are no other device  and no using VM_PFNMAP. Please
correct me if I'm wrong.

> 
> The WARN is a sanity check I added because it should be impossible for KVM to
> install a non-writable SPTE overtop an existing writable SPTE.  Or so I
> thought.
> The WARN is benign in the sense that nothing bad will happen _in KVM_; KVM
> correctly handles the unexpected change, the WARN is there purely to flag
> that
> something unexpected happen.
> 
>       if (new_spte == iter->old_spte)
>               ret = RET_PF_SPURIOUS;
>       else if (tdp_mmu_set_spte_atomic(vcpu->kvm, iter, new_spte))
>               return RET_PF_RETRY;
>       else if (is_shadow_present_pte(iter->old_spte) &&
>                (!is_last_spte(iter->old_spte, iter->level) ||
>                 WARN_ON_ONCE(leaf_spte_change_needs_tlb_flush(iter->old_spte,
> new_spte)))) <====
>               kvm_flush_remote_tlbs_gfn(vcpu->kvm, iter->gfn, iter->level);
> 
> Cross referencing the register state
> 
>   RAX: 860000025e000bf7 RBX: ff4af92c619cf920 RCX: 0400000000000000
>   RDX: 0000000000000002 RSI: 0000000000000000 RDI: 0000000000000015
>   RBP: ff4af92c619cf9e8 R08: 800000025e0009f5 R09: 0000000000000002
>   R10: 000000005e000901 R11: 0000000000000001 R12: ff1e70694fc68000
>   R13: 0000000000000005 R14: 0000000000000000 R15: ff4af92c619a1000
> 
> with the disassembly
> 
>   4885C8                          TEST RAX,RCX
>   0F84EEFEFFFF                    JE 0000000000000-F1
>   4985C8                          TEST R8,RCX
>   0F85E5FEFFFF                    JNE 0000000000000-F1
>   0F0B                            UD2
>   
> RAX is the old SPTE and RCX is the new SPTE, i.e. the SPTE change is:
> 
>   860000025e000bf7
>   800000025e0009f5
> 
> On Intel, bits 57 and 58 are the host-writable and MMU-writable flags
> 
>   #define EPT_SPTE_HOST_WRITABLE      BIT_ULL(57)
>   #define EPT_SPTE_MMU_WRITABLE               BIT_ULL(58)
> 
> which means KVM is overwriting a writable SPTE with a non-writable SPTE
> because
> the current vCPU (a) hit a READ or EXEC fault on a non-present SPTE and (b)
> retrieved
> a non-writable PFN from the primary MMU, and that fault raced with a WRITE
> fault
> on a different vCPU that retrieved and installed a writable PFN.
> 
> On a READ or EXEC fault, this code in hva_to_pfn_slow() should get a
> writable PFN.
> Given that KVM has an valid writable SPTE, the corresponding PTE in the
> primary MMU
> *must* be writable, otherwise there's a missing mmu_notifier invalidation.
> 
>       /* map read fault as writable if possible */
>       if (!(flags & FOLL_WRITE) && kfp->map_writable &&
>           get_user_page_fast_only(kfp->hva, FOLL_WRITE, &wpage)) {
>               put_page(page);
>               page = wpage;
>               flags |= FOLL_WRITE;
>       }
> 
> out:
>       *pfn = kvm_resolve_pfn(kfp, page, NULL, flags & FOLL_WRITE);
>       return npages;
> 
> Hmm, gup_fast_folio_allowed() has a few conditions where it will reject fast
> GUP,
> but they should be mutually exclusive with KVM having a writable SPTE.  If
> the
> mapping is truncated or the folio is swapped out, secondary MMUs need to be
> invalidated before folio->mapping is nullified.
> 
>       /*
>        * The mapping may have been truncated, in any case we cannot determine
>        * if this mapping is safe - fall back to slow path to determine how to
>        * proceed.
>        */
>       if (!mapping)
>               return false;
> 
> And secretmem can't be GUP'd, and it's not a long-term pin, so these checks
> don't
> apply either:
> 
>       if (check_secretmem && secretmem_mapping(mapping))
>               return false;
>       /* The only remaining allowed file system is shmem. */
>       return !reject_file_backed || shmem_mapping(mapping);
> 
> Similarly, hva_to_pfn_remapped() should get a writable PFN if said PFN is
> writable
> in the primary MMU, regardless of the fault type.
> 
> If this turns out to get a legitimate scenario, then I think it makes sense
> to
> add an is_access_allowed() check and treat the fault as spurious.  But I
> would
> like to try to bottom out on what exactly is happening, because I'm mildly
> concerned something is buggy in the primary MMU.

If you need me to provide more info, please feel free to let me know. And if
you  sent a patch to fix this problem I can help to verified it, since I think
I found the stable reproducer.

-- 
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [Bug 219588] [6.13.0-rc2+]WARNING: CPU: 52 PID: 12253 at arch/x86/kvm/mmu/tdp_mmu.c:1001 tdp_mmu_map_handle_target_level+0x1f0/0x310 [kvm]
  2024-12-16  5:42 ` bugzilla-daemon
@ 2024-12-16 23:52   ` Sean Christopherson
  0 siblings, 0 replies; 9+ messages in thread
From: Sean Christopherson @ 2024-12-16 23:52 UTC (permalink / raw)
  To: bugzilla-daemon; +Cc: kvm

On Mon, Dec 16, 2024, bugzilla-daemon@kernel.org wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=219588
> 
> --- Comment #2 from leiyang@redhat.com ---
> (In reply to Sean Christopherson from comment #1)
> > On Wed, Dec 11, 2024, bugzilla-daemon@kernel.org wrote:
> > > I hit a bug on the intel host, this problem occurs randomly:
> > > [  406.127925] ------------[ cut here ]------------
> > > [  406.132572] WARNING: CPU: 52 PID: 12253 at
> > arch/x86/kvm/mmu/tdp_mmu.c:1001
> > > tdp_mmu_map_handle_target_level+0x1f0/0x310 [kvm]
> > 
> > Can you describe the host activity at the time of the WARN?  E.g. is it under
> > memory pressure and potentially swapping, is KSM or NUMA balancing active? I
> > have a sound theory for how the scenario occurs on KVM's end, but I still
> > think
> > it's wrong for KVM to overwrite a writable SPTE with a read-only SPTE in this
> > situation.
> 
> I spent some time for this problem so late reply. When host dmesg print this
> error messages which running install a new guest via automation. And I found
> this bug's reproducer is run this install case after the mchine first time
> running(Let me introduce more to avoid ambiguity： 1. Must to test it when the
> machine first time running this kernel,that mean's if I hit this problem then
> reboot host, it can not reproduced again even if I run the same tests.

Yeah, WARN_ON_ONCE() suppresses the splat after a single occurrence.  If KVM is
built as a module, you can can unload and reload kvm.ko (and whatever vendor
you're using) instead of rebooting.  The flag that controls the "once" behavior
is tied to the lifetime of kvm.ko.

> 2.  Based on 1, I also must test this installation guest case, it can not
> reporduced on other cases.). But through compare, this installation cases
> only used pxe install based on a internal KS cfg is different other cases.

Mostly out of curiosity, what's "KS cfg"?

> Sure, I think it's running under memory pressure and swapping. Based on
> automation log, KSM is disable and I don't add NUMA in qemu command line. 
> 
> If you have a machine can clone avocado and run tp-qemu tests, you can prepare
> env then run this case:
> unattended_install.cdrom.extra_cdrom_ks.default_install.aio_threads
> 
> > 
> > And does the VM have device memory or any other type of VM_PFNMAP or VM_IO
> > memory exposed to it?  E.g. an assigned device?  If so, can you provide the
> > register
> > state from the other WARNs?  If the PFNs are all in the same range, then
> > maybe
> > this is something funky with the VM_PFNMAP | VM_IO path.
> 
> I can confirm it has VM_IO because it runing installation case, VM is
> constantly performing I/O operations

Heh, VM_IO and emulated I/O for the guest are two very different things.  VM_IO
would typically only be present if you're doing some form of device passthrough.

Regardless, I don't know that it's worth trying to figure out exactly what's
happening in the primary MMU.  What's happening is perfectly legal, and the odds
of there being an underlying bug *and* it not having any other symptoms is very
low.  And it definitely doesn't make sense to withold a fix just because I'm
paranoid :-)

So, assuming you can test custom kernels, can you test this and confirm it makes
the WARN go away?  If it does, or if you can't test it, I'll post a proper patch.
If it doesn't fix the issue, then something completely unexpected is happening
and we'll have to dig deeper.

---
From: Sean Christopherson <seanjc@google.com>
Date: Mon, 16 Dec 2024 13:18:01 -0800
Subject: [PATCH] KVM: x86/mmu: Treat TDP MMU faults as spurious if access is
 already allowed

Treat slow-path TDP MMU faults as spurious if the access is allowed given
the existing SPTE to fix a benign warning (other than the WARN itself)
due to replacing a writable SPTE with a read-only SPTE, and to avoid the
unnecessary LOCK CMPXCHG and subsequent TLB flush.

If a read fault races with a write fault, fast GUP fails for any reason
when trying to "promote" the read fault to a writable mapping, and KVM
resolves the write fault first, then KVM will end up trying to install a
read-only SPTE (for a !map_writable fault) overtop a writable SPTE.

Note, it's not entirely clear why fast GUP fails, or if that's even how
KVM ends up with a !map_writable fault with a writable SPTE.  If something
else is going awry, e.g. due to a bug in mmu_notifiers, then treating read
faults as spurious in this scenario could effectively mask the underlying
problem.

However, retrying the faulting access instead of overwriting an existing
SPTE is functionally correct and desirable irrespective of the WARN, and
fast GUP _can_ legitimately fail with a writable VMA, e.g. if the Accessed
bit in primary MMU's PTE is toggled and causes a PTE value mismatch.  The
WARN was also recently added, specifically to track down scenarios where
KVM is unnecessarily overwrites SPTEs, i.e. treating the fault as spurious
doesn't regress KVM's bug-finding capabilities in any way.  In short,
letting the WARN linger because there's a tiny chance it's due to a bug
elsewhere would be excessively paranoid.

Fixes: 1a175082b190 ("KVM: x86/mmu: WARN and flush if resolving a TDP MMU fault clears MMU-writable")
Reported-by: leiyang@redhat.com
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=219588
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/mmu/mmu.c     | 12 ------------
 arch/x86/kvm/mmu/spte.h    | 17 +++++++++++++++++
 arch/x86/kvm/mmu/tdp_mmu.c |  5 +++++
 3 files changed, 22 insertions(+), 12 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 22e7ad235123..2401606db260 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3364,18 +3364,6 @@ static bool fast_pf_fix_direct_spte(struct kvm_vcpu *vcpu,
 	return true;
 }

-static bool is_access_allowed(struct kvm_page_fault *fault, u64 spte)
-{
-	if (fault->exec)
-		return is_executable_pte(spte);
-
-	if (fault->write)
-		return is_writable_pte(spte);
-
-	/* Fault was on Read access */
-	return spte & PT_PRESENT_MASK;
-}
-
 /*
  * Returns the last level spte pointer of the shadow page walk for the given
  * gpa, and sets *spte to the spte value. This spte may be non-preset. If no
diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index f332b33bc817..6285c45fa56d 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -461,6 +461,23 @@ static inline bool is_mmu_writable_spte(u64 spte)
 	return spte & shadow_mmu_writable_mask;
 }

+/*
+ * Returns true if the access indiciated by @fault is allowed by the existing
+ * SPTE protections.  Note, the caller is responsible for checking that the
+ * SPTE is a shadow-present, leaf SPTE (either before or after).
+ */
+static inline bool is_access_allowed(struct kvm_page_fault *fault, u64 spte)
+{
+	if (fault->exec)
+		return is_executable_pte(spte);
+
+	if (fault->write)
+		return is_writable_pte(spte);
+
+	/* Fault was on Read access */
+	return spte & PT_PRESENT_MASK;
+}
+
 /*
  * If the MMU-writable flag is cleared, i.e. the SPTE is write-protected for
  * write-tracking, remote TLBs must be flushed, even if the SPTE was read-only,
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 4508d868f1cd..2f15e0e33903 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -985,6 +985,11 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu *vcpu,
 	if (fault->prefetch && is_shadow_present_pte(iter->old_spte))
 		return RET_PF_SPURIOUS;

+	if (is_shadow_present_pte(iter->old_spte) &&
+	    is_access_allowed(fault, iter->old_spte) &&
+	    is_last_spte(iter->old_spte, iter->level))
+		return RET_PF_SPURIOUS;
+
 	if (unlikely(!fault->slot))
 		new_spte = make_mmio_spte(vcpu, iter->gfn, ACC_ALL);
 	else

base-commit: 3522c419758ee8dca5a0e8753ee0070a22157bc1
-- 

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [Bug 219588] [6.13.0-rc2+]WARNING: CPU: 52 PID: 12253 at arch/x86/kvm/mmu/tdp_mmu.c:1001 tdp_mmu_map_handle_target_level+0x1f0/0x310 [kvm]
  2024-12-11  5:41 [Bug 219588] New: [6.13.0-rc2+]WARNING: CPU: 52 PID: 12253 at arch/x86/kvm/mmu/tdp_mmu.c:1001 tdp_mmu_map_handle_target_level+0x1f0/0x310 [kvm] bugzilla-daemon
                   ` (2 preceding siblings ...)
  2024-12-16  5:42 ` bugzilla-daemon
@ 2024-12-16 23:53 ` bugzilla-daemon
  2024-12-17  9:03 ` bugzilla-daemon
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 9+ messages in thread
From: bugzilla-daemon @ 2024-12-16 23:53 UTC (permalink / raw)
  To: kvm

https://bugzilla.kernel.org/show_bug.cgi?id=219588

--- Comment #3 from Sean Christopherson (seanjc@google.com) ---
On Mon, Dec 16, 2024, bugzilla-daemon@kernel.org wrote:
> https://bugzilla.kernel.org/show_bug.cgi?id=219588
> 
> --- Comment #2 from leiyang@redhat.com ---
> (In reply to Sean Christopherson from comment #1)
> > On Wed, Dec 11, 2024, bugzilla-daemon@kernel.org wrote:
> > > I hit a bug on the intel host, this problem occurs randomly:
> > > [  406.127925] ------------[ cut here ]------------
> > > [  406.132572] WARNING: CPU: 52 PID: 12253 at
> > arch/x86/kvm/mmu/tdp_mmu.c:1001
> > > tdp_mmu_map_handle_target_level+0x1f0/0x310 [kvm]
> > 
> > Can you describe the host activity at the time of the WARN?  E.g. is it
> under
> > memory pressure and potentially swapping, is KSM or NUMA balancing active?
> I
> > have a sound theory for how the scenario occurs on KVM's end, but I still
> > think
> > it's wrong for KVM to overwrite a writable SPTE with a read-only SPTE in
> this
> > situation.
> 
> I spent some time for this problem so late reply. When host dmesg print this
> error messages which running install a new guest via automation. And I found
> this bug's reproducer is run this install case after the mchine first time
> running(Let me introduce more to avoid ambiguity： 1. Must to test it when the
> machine first time running this kernel,that mean's if I hit this problem then
> reboot host, it can not reproduced again even if I run the same tests.

Yeah, WARN_ON_ONCE() suppresses the splat after a single occurrence.  If KVM is
built as a module, you can can unload and reload kvm.ko (and whatever vendor
you're using) instead of rebooting.  The flag that controls the "once" behavior
is tied to the lifetime of kvm.ko.

> 2.  Based on 1, I also must test this installation guest case, it can not
> reporduced on other cases.). But through compare, this installation cases
> only used pxe install based on a internal KS cfg is different other cases.

Mostly out of curiosity, what's "KS cfg"?

> Sure, I think it's running under memory pressure and swapping. Based on
> automation log, KSM is disable and I don't add NUMA in qemu command line. 
> 
> If you have a machine can clone avocado and run tp-qemu tests, you can
> prepare
> env then run this case:
> unattended_install.cdrom.extra_cdrom_ks.default_install.aio_threads
> 
> > 
> > And does the VM have device memory or any other type of VM_PFNMAP or VM_IO
> > memory exposed to it?  E.g. an assigned device?  If so, can you provide the
> > register
> > state from the other WARNs?  If the PFNs are all in the same range, then
> > maybe
> > this is something funky with the VM_PFNMAP | VM_IO path.
> 
> I can confirm it has VM_IO because it runing installation case, VM is
> constantly performing I/O operations

Heh, VM_IO and emulated I/O for the guest are two very different things.  VM_IO
would typically only be present if you're doing some form of device
passthrough.

Regardless, I don't know that it's worth trying to figure out exactly what's
happening in the primary MMU.  What's happening is perfectly legal, and the
odds
of there being an underlying bug *and* it not having any other symptoms is very
low.  And it definitely doesn't make sense to withold a fix just because I'm
paranoid :-)

So, assuming you can test custom kernels, can you test this and confirm it
makes
the WARN go away?  If it does, or if you can't test it, I'll post a proper
patch.
If it doesn't fix the issue, then something completely unexpected is happening
and we'll have to dig deeper.

---
From: Sean Christopherson <seanjc@google.com>
Date: Mon, 16 Dec 2024 13:18:01 -0800
Subject: [PATCH] KVM: x86/mmu: Treat TDP MMU faults as spurious if access is
 already allowed

Treat slow-path TDP MMU faults as spurious if the access is allowed given
the existing SPTE to fix a benign warning (other than the WARN itself)
due to replacing a writable SPTE with a read-only SPTE, and to avoid the
unnecessary LOCK CMPXCHG and subsequent TLB flush.

If a read fault races with a write fault, fast GUP fails for any reason
when trying to "promote" the read fault to a writable mapping, and KVM
resolves the write fault first, then KVM will end up trying to install a
read-only SPTE (for a !map_writable fault) overtop a writable SPTE.

Note, it's not entirely clear why fast GUP fails, or if that's even how
KVM ends up with a !map_writable fault with a writable SPTE.  If something
else is going awry, e.g. due to a bug in mmu_notifiers, then treating read
faults as spurious in this scenario could effectively mask the underlying
problem.

However, retrying the faulting access instead of overwriting an existing
SPTE is functionally correct and desirable irrespective of the WARN, and
fast GUP _can_ legitimately fail with a writable VMA, e.g. if the Accessed
bit in primary MMU's PTE is toggled and causes a PTE value mismatch.  The
WARN was also recently added, specifically to track down scenarios where
KVM is unnecessarily overwrites SPTEs, i.e. treating the fault as spurious
doesn't regress KVM's bug-finding capabilities in any way.  In short,
letting the WARN linger because there's a tiny chance it's due to a bug
elsewhere would be excessively paranoid.

Fixes: 1a175082b190 ("KVM: x86/mmu: WARN and flush if resolving a TDP MMU fault
clears MMU-writable")
Reported-by: leiyang@redhat.com
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=219588
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 arch/x86/kvm/mmu/mmu.c     | 12 ------------
 arch/x86/kvm/mmu/spte.h    | 17 +++++++++++++++++
 arch/x86/kvm/mmu/tdp_mmu.c |  5 +++++
 3 files changed, 22 insertions(+), 12 deletions(-)

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 22e7ad235123..2401606db260 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3364,18 +3364,6 @@ static bool fast_pf_fix_direct_spte(struct kvm_vcpu
*vcpu,
        return true;
 }

-static bool is_access_allowed(struct kvm_page_fault *fault, u64 spte)
-{
-       if (fault->exec)
-               return is_executable_pte(spte);
-
-       if (fault->write)
-               return is_writable_pte(spte);
-
-       /* Fault was on Read access */
-       return spte & PT_PRESENT_MASK;
-}
-
 /*
  * Returns the last level spte pointer of the shadow page walk for the given
  * gpa, and sets *spte to the spte value. This spte may be non-preset. If no
diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
index f332b33bc817..6285c45fa56d 100644
--- a/arch/x86/kvm/mmu/spte.h
+++ b/arch/x86/kvm/mmu/spte.h
@@ -461,6 +461,23 @@ static inline bool is_mmu_writable_spte(u64 spte)
        return spte & shadow_mmu_writable_mask;
 }

+/*
+ * Returns true if the access indiciated by @fault is allowed by the existing
+ * SPTE protections.  Note, the caller is responsible for checking that the
+ * SPTE is a shadow-present, leaf SPTE (either before or after).
+ */
+static inline bool is_access_allowed(struct kvm_page_fault *fault, u64 spte)
+{
+       if (fault->exec)
+               return is_executable_pte(spte);
+
+       if (fault->write)
+               return is_writable_pte(spte);
+
+       /* Fault was on Read access */
+       return spte & PT_PRESENT_MASK;
+}
+
 /*
  * If the MMU-writable flag is cleared, i.e. the SPTE is write-protected for
  * write-tracking, remote TLBs must be flushed, even if the SPTE was
read-only,
diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index 4508d868f1cd..2f15e0e33903 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -985,6 +985,11 @@ static int tdp_mmu_map_handle_target_level(struct kvm_vcpu
*vcpu,
        if (fault->prefetch && is_shadow_present_pte(iter->old_spte))
                return RET_PF_SPURIOUS;

+       if (is_shadow_present_pte(iter->old_spte) &&
+           is_access_allowed(fault, iter->old_spte) &&
+           is_last_spte(iter->old_spte, iter->level))
+               return RET_PF_SPURIOUS;
+
        if (unlikely(!fault->slot))
                new_spte = make_mmio_spte(vcpu, iter->gfn, ACC_ALL);
        else

base-commit: 3522c419758ee8dca5a0e8753ee0070a22157bc1

-- 
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [Bug 219588] [6.13.0-rc2+]WARNING: CPU: 52 PID: 12253 at arch/x86/kvm/mmu/tdp_mmu.c:1001 tdp_mmu_map_handle_target_level+0x1f0/0x310 [kvm]
  2024-12-11  5:41 [Bug 219588] New: [6.13.0-rc2+]WARNING: CPU: 52 PID: 12253 at arch/x86/kvm/mmu/tdp_mmu.c:1001 tdp_mmu_map_handle_target_level+0x1f0/0x310 [kvm] bugzilla-daemon
                   ` (3 preceding siblings ...)
  2024-12-16 23:53 ` bugzilla-daemon
@ 2024-12-17  9:03 ` bugzilla-daemon
  2025-03-20 14:48 ` bugzilla-daemon
  2025-03-20 14:48 ` bugzilla-daemon
  6 siblings, 0 replies; 9+ messages in thread
From: bugzilla-daemon @ 2024-12-17  9:03 UTC (permalink / raw)
  To: kvm

https://bugzilla.kernel.org/show_bug.cgi?id=219588

--- Comment #4 from leiyang@redhat.com ---
Hi (In reply to Sean Christopherson from comment #3)
> On Mon, Dec 16, 2024, bugzilla-daemon@kernel.org wrote
> However, retrying the faulting access instead of overwriting an existing
> SPTE is functionally correct and desirable irrespective of the WARN, and
> fast GUP _can_ legitimately fail with a writable VMA, e.g. if the Accessed
> bit in primary MMU's PTE is toggled and causes a PTE value mismatch.  The
> WARN was also recently added, specifically to track down scenarios where
> KVM is unnecessarily overwrites SPTEs, i.e. treating the fault as spurious
> doesn't regress KVM's bug-finding capabilities in any way.  In short,
> letting the WARN linger because there's a tiny chance it's due to a bug
> elsewhere would be excessively paranoid.
> 
> Fixes: 1a175082b190 ("KVM: x86/mmu: WARN and flush if resolving a TDP MMU
> fault clears MMU-writable")
> Reported-by: leiyang@redhat.com
> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=219588
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>  arch/x86/kvm/mmu/mmu.c     | 12 ------------
>  arch/x86/kvm/mmu/spte.h    | 17 +++++++++++++++++
>  arch/x86/kvm/mmu/tdp_mmu.c |  5 +++++
>  3 files changed, 22 insertions(+), 12 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 22e7ad235123..2401606db260 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -3364,18 +3364,6 @@ static bool fast_pf_fix_direct_spte(struct kvm_vcpu
> *vcpu,
>       return true;
>  }
>  
> -static bool is_access_allowed(struct kvm_page_fault *fault, u64 spte)
> -{
> -     if (fault->exec)
> -             return is_executable_pte(spte);
> -
> -     if (fault->write)
> -             return is_writable_pte(spte);
> -
> -     /* Fault was on Read access */
> -     return spte & PT_PRESENT_MASK;
> -}
> -
>  /*
>   * Returns the last level spte pointer of the shadow page walk for the given
>   * gpa, and sets *spte to the spte value. This spte may be non-preset. If no
> diff --git a/arch/x86/kvm/mmu/spte.h b/arch/x86/kvm/mmu/spte.h
> index f332b33bc817..6285c45fa56d 100644
> --- a/arch/x86/kvm/mmu/spte.h
> +++ b/arch/x86/kvm/mmu/spte.h
> @@ -461,6 +461,23 @@ static inline bool is_mmu_writable_spte(u64 spte)
>       return spte & shadow_mmu_writable_mask;
>  }
>  
> +/*
> + * Returns true if the access indiciated by @fault is allowed by the
> existing
> + * SPTE protections.  Note, the caller is responsible for checking that the
> + * SPTE is a shadow-present, leaf SPTE (either before or after).
> + */
> +static inline bool is_access_allowed(struct kvm_page_fault *fault, u64 spte)
> +{
> +     if (fault->exec)
> +             return is_executable_pte(spte);
> +
> +     if (fault->write)
> +             return is_writable_pte(spte);
> +
> +     /* Fault was on Read access */
> +     return spte & PT_PRESENT_MASK;
> +}
> +
>  /*
>   * If the MMU-writable flag is cleared, i.e. the SPTE is write-protected for
>   * write-tracking, remote TLBs must be flushed, even if the SPTE was
> read-only,
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 4508d868f1cd..2f15e0e33903 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -985,6 +985,11 @@ static int tdp_mmu_map_handle_target_level(struct
> kvm_vcpu *vcpu,
>       if (fault->prefetch && is_shadow_present_pte(iter->old_spte))
>               return RET_PF_SPURIOUS;
>  
> +     if (is_shadow_present_pte(iter->old_spte) &&
> +         is_access_allowed(fault, iter->old_spte) &&
> +         is_last_spte(iter->old_spte, iter->level))
> +             return RET_PF_SPURIOUS;
> +
>       if (unlikely(!fault->slot))
>               new_spte = make_mmio_spte(vcpu, iter->gfn, ACC_ALL);
>       else
> 
> base-commit: 3522c419758ee8dca5a0e8753ee0070a22157bc1

Hi Sean

This problem has gone after applied your provide patch.

Thanks
Lei

-- 
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug 219588] [6.13.0-rc2+]WARNING: CPU: 52 PID: 12253 at arch/x86/kvm/mmu/tdp_mmu.c:1001 tdp_mmu_map_handle_target_level+0x1f0/0x310 [kvm]
  2024-12-11  5:41 [Bug 219588] New: [6.13.0-rc2+]WARNING: CPU: 52 PID: 12253 at arch/x86/kvm/mmu/tdp_mmu.c:1001 tdp_mmu_map_handle_target_level+0x1f0/0x310 [kvm] bugzilla-daemon
                   ` (4 preceding siblings ...)
  2024-12-17  9:03 ` bugzilla-daemon
@ 2025-03-20 14:48 ` bugzilla-daemon
  2025-03-20 14:48 ` bugzilla-daemon
  6 siblings, 0 replies; 9+ messages in thread
From: bugzilla-daemon @ 2025-03-20 14:48 UTC (permalink / raw)
  To: kvm

https://bugzilla.kernel.org/show_bug.cgi?id=219588

--- Comment #5 from leiyang@redhat.com ---
Due to the fixed patch has been merge into the upstream master branch, so close
done this bug.

commit 386d69f9f29b0814881fa4f92ac7b8dfa9b4f44a
Author: Sean Christopherson <seanjc@google.com>
Date:   Wed Dec 18 13:36:11 2024 -0800

    KVM: x86/mmu: Treat TDP MMU faults as spurious if access is already allowed

    Treat slow-path TDP MMU faults as spurious if the access is allowed given
    the existing SPTE to fix a benign warning (other than the WARN itself)
    due to replacing a writable SPTE with a read-only SPTE, and to avoid the
    unnecessary LOCK CMPXCHG and subsequent TLB flush.

    If a read fault races with a write fault, fast GUP fails for any reason
    when trying to "promote" the read fault to a writable mapping, and KVM
    resolves the write fault first, then KVM will end up trying to install a
    read-only SPTE (for a !map_writable fault) overtop a writable SPTE.

    Note, it's not entirely clear why fast GUP fails, or if that's even how
    KVM ends up with a !map_writable fault with a writable SPTE.  If something
    else is going awry, e.g. due to a bug in mmu_notifiers, then treating read
    faults as spurious in this scenario could effectively mask the underlying
    problem.

    However, retrying the faulting access instead of overwriting an existing
    SPTE is functionally correct and desirable irrespective of the WARN, and
    fast GUP _can_ legitimately fail with a writable VMA, e.g. if the Accessed
    bit in primary MMU's PTE is toggled and causes a PTE value mismatch.  The
    WARN was also recently added, specifically to track down scenarios where
    KVM is unnecessarily overwrites SPTEs, i.e. treating the fault as spurious
    doesn't regress KVM's bug-finding capabilities in any way.  In short,
    letting the WARN linger because there's a tiny chance it's due to a bug
    elsewhere would be excessively paranoid.

    Fixes: 1a175082b190 ("KVM: x86/mmu: WARN and flush if resolving a TDP MMU
fault clears MMU-writable")
    Reported-by: Lei Yang <leiyang@redhat.com>
    Closes: https://bugzilla.kernel.org/show_bug.cgi?id=219588

-- 
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [Bug 219588] [6.13.0-rc2+]WARNING: CPU: 52 PID: 12253 at arch/x86/kvm/mmu/tdp_mmu.c:1001 tdp_mmu_map_handle_target_level+0x1f0/0x310 [kvm]
  2024-12-11  5:41 [Bug 219588] New: [6.13.0-rc2+]WARNING: CPU: 52 PID: 12253 at arch/x86/kvm/mmu/tdp_mmu.c:1001 tdp_mmu_map_handle_target_level+0x1f0/0x310 [kvm] bugzilla-daemon
                   ` (5 preceding siblings ...)
  2025-03-20 14:48 ` bugzilla-daemon
@ 2025-03-20 14:48 ` bugzilla-daemon
  6 siblings, 0 replies; 9+ messages in thread
From: bugzilla-daemon @ 2025-03-20 14:48 UTC (permalink / raw)
  To: kvm

https://bugzilla.kernel.org/show_bug.cgi?id=219588

leiyang@redhat.com changed:

           What    |Removed                     |Added
----------------------------------------------------------------------------
             Status|NEW                         |RESOLVED
         Resolution|---                         |CODE_FIX

-- 
You may reply to this email to add a comment.

You are receiving this mail because:
You are watching the assignee of the bug.

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2025-03-20 14:48 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-12-11  5:41 [Bug 219588] New: [6.13.0-rc2+]WARNING: CPU: 52 PID: 12253 at arch/x86/kvm/mmu/tdp_mmu.c:1001 tdp_mmu_map_handle_target_level+0x1f0/0x310 [kvm] bugzilla-daemon
2024-12-11 16:12 ` Sean Christopherson
2024-12-11 16:12 ` [Bug 219588] " bugzilla-daemon
2024-12-16  5:42 ` bugzilla-daemon
2024-12-16 23:52   ` Sean Christopherson
2024-12-16 23:53 ` bugzilla-daemon
2024-12-17  9:03 ` bugzilla-daemon
2025-03-20 14:48 ` bugzilla-daemon
2025-03-20 14:48 ` bugzilla-daemon

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox