Re: [PATCH] i386/kvm: Prefault memory on page state change

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

From: Xiaoyao Li <xiaoyao.li@intel.com>
To: Paolo Bonzini <pbonzini@redhat.com>,
	Tom Lendacky <thomas.lendacky@amd.com>,
	qemu-devel@nongnu.org, kvm@vger.kernel.org
Cc: Marcelo Tosatti <mtosatti@redhat.com>,
	Michael Roth <michael.roth@amd.com>
Subject: Re: [PATCH] i386/kvm: Prefault memory on page state change
Date: Tue, 3 Jun 2025 23:31:52 +0800	[thread overview]
Message-ID: <9962b691-056e-4ee0-8345-5b5657c5376b@intel.com> (raw)
In-Reply-To: <d0983ba3-383b-4c81-9cfd-b5b0d26a5d17@redhat.com>

On 6/3/2025 11:00 PM, Paolo Bonzini wrote:
> On 6/3/25 13:47, Xiaoyao Li wrote:
>> On 6/3/2025 3:41 PM, Xiaoyao Li wrote:
>>> On 3/29/2025 4:30 AM, Tom Lendacky wrote:
>>>> A page state change is typically followed by an access of the 
>>>> page(s) and
>>>> results in another VMEXIT in order to map the page into the nested page
>>>> table. Depending on the size of page state change request, this can
>>>> generate a number of additional VMEXITs. For example, under SNP, when
>>>> Linux is utilizing lazy memory acceptance, memory is typically 
>>>> accepted in
>>>> 4M chunks. A page state change request is submitted to mark the 
>>>> pages as
>>>> private, followed by validation of the memory. Since the guest_memfd
>>>> currently only supports 4K pages, each page validation will result in
>>>> VMEXIT to map the page, resulting in 1024 additional exits.
>>>>
>>>> When performing a page state change, invoke KVM_PRE_FAULT_MEMORY for 
>>>> the
>>>> size of the page state change in order to pre-map the pages and 
>>>> avoid the
>>>> additional VMEXITs. This helps speed up boot times.
>>>
>>> Unfortunately, it breaks TDX guest.
>>>
>>>    kvm_hc_map_gpa_range gpa 0x80000000 size 0x200000 attributes 0x0 
>>> flags 0x1
>>>
>>> For TDX guest, it uses MAPGPA to maps the range [0x8000 0000, 
>>> +0x0x200000] to shared. The call of KVM_PRE_FAULT_MEMORY on such 
>>> range leads to the TD being marked as bugged
>>>
>>> [353467.266761] WARNING: CPU: 109 PID: 295970 at arch/x86/kvm/mmu/ 
>>> tdp_mmu.c:674 tdp_mmu_map_handle_target_level+0x301/0x460 [kvm]
>>
>> It turns out to be a KVM bug.
>>
>> The gpa passed in in KVM_PRE_FAULT_MEMORY, i.e., range->gpa has no 
>> indication for share vs. private. KVM directly passes range->gpa to 
>> kvm_tdp_map_page() in kvm_arch_vcpu_pre_fault_memory(), which is then 
>> assigned to fault.addr
>>
>> However, fault.addr is supposed to be a gpa of real access in TDX 
>> guest, which means it needs to have shared bit set if the map is for 
>> shared access, for TDX case. tdp_mmu_get_root_for_fault() will use it 
>> to determine which root to be used.
>>
>> For this case, the pre fault is on the shared memory, while the 
>> fault.addr leads to mirror_root which is for private memory. Thus it 
>> triggers KVM_BUG_ON().
> So this would fix it?
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 7b3f1783ab3c..66f96476fade 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -4895,6 +4895,7 @@ long kvm_arch_vcpu_pre_fault_memory(struct 
> kvm_vcpu *vcpu,
>   {
>       u64 error_code = PFERR_GUEST_FINAL_MASK;
>       u8 level = PG_LEVEL_4K;
> +    u64 direct_bits;
>       u64 end;
>       int r;
> 
> @@ -4909,15 +4910,18 @@ long kvm_arch_vcpu_pre_fault_memory(struct 
> kvm_vcpu *vcpu,
>       if (r)
>           return r;
> 
> +    direct_bits = 0;
>       if (kvm_arch_has_private_mem(vcpu->kvm) &&
>           kvm_mem_is_private(vcpu->kvm, gpa_to_gfn(range->gpa)))
>           error_code |= PFERR_PRIVATE_ACCESS;
> +    else
> +        direct_bits = kvm_gfn_direct_bits(vcpu->kvm);

should be

	direct_bits = gfn_to_gpa(kvm_gfn_direct_bits(vcpu->kvm));

> 
>       /*
>        * Shadow paging uses GVA for kvm page fault, so restrict to
>        * two-dimensional paging.
>        */
> -    r = kvm_tdp_map_page(vcpu, range->gpa, error_code, &level);
> +    r = kvm_tdp_map_page(vcpu, range->gpa | direct_bits, error_code, 
> &level);
>       if (r < 0)
>           return r;
> 
> 
> 
> I'm applying Tom's patch to get it out of his queue, but will delay sending
> a pull request until the Linux-side fix is accepted.

With above mentioned change, it can fix the issue.

Me synced with Yan offline, and our fix is:

diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index 52acf99d40a0..209103bf0f30 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -48,7 +48,7 @@ static inline enum kvm_tdp_mmu_root_types 
kvm_gfn_range_filter_to_root_types(str
  static inline struct kvm_mmu_page *tdp_mmu_get_root_for_fault(struct 
kvm_vcpu *vcpu,
                                                               struct 
kvm_page_fault *fault)
  {
-       if (unlikely(!kvm_is_addr_direct(vcpu->kvm, fault->addr)))
+       if (unlikely(fault->is_private))
                 return root_to_sp(vcpu->arch.mmu->mirror_root_hpa);

         return root_to_sp(vcpu->arch.mmu->root.hpa);

> Paolo
> 
>

next prev parent reply	other threads:[~2025-06-03 15:33 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-03-28 20:30 [PATCH] i386/kvm: Prefault memory on page state change Tom Lendacky
2025-06-02 13:17 ` Tom Lendacky
2025-06-03  7:41 ` Xiaoyao Li
2025-06-03 11:47   ` Xiaoyao Li
2025-06-03 13:40     ` Tom Lendacky
2025-06-03 15:00     ` Paolo Bonzini
2025-06-03 15:31       ` Xiaoyao Li [this message]
2025-06-03 16:12       ` Xiaoyao Li

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:52acf99d40a dfblob:209103bf0f3 )
 OR (
bs:"Re: [PATCH] i386/kvm: Prefault memory on page state change" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=9962b691-056e-4ee0-8345-5b5657c5376b@intel.com \
    --to=xiaoyao.li@intel.com \
    --cc=kvm@vger.kernel.org \
    --cc=michael.roth@amd.com \
    --cc=mtosatti@redhat.com \
    --cc=pbonzini@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=thomas.lendacky@amd.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).