From: Xiaoyao Li <xiaoyao.li@intel.com>
To: Paolo Bonzini <pbonzini@redhat.com>,
Tom Lendacky <thomas.lendacky@amd.com>,
qemu-devel@nongnu.org, kvm@vger.kernel.org
Cc: Marcelo Tosatti <mtosatti@redhat.com>,
Michael Roth <michael.roth@amd.com>
Subject: Re: [PATCH] i386/kvm: Prefault memory on page state change
Date: Tue, 3 Jun 2025 23:31:52 +0800 [thread overview]
Message-ID: <9962b691-056e-4ee0-8345-5b5657c5376b@intel.com> (raw)
In-Reply-To: <d0983ba3-383b-4c81-9cfd-b5b0d26a5d17@redhat.com>
On 6/3/2025 11:00 PM, Paolo Bonzini wrote:
> On 6/3/25 13:47, Xiaoyao Li wrote:
>> On 6/3/2025 3:41 PM, Xiaoyao Li wrote:
>>> On 3/29/2025 4:30 AM, Tom Lendacky wrote:
>>>> A page state change is typically followed by an access of the
>>>> page(s) and
>>>> results in another VMEXIT in order to map the page into the nested page
>>>> table. Depending on the size of page state change request, this can
>>>> generate a number of additional VMEXITs. For example, under SNP, when
>>>> Linux is utilizing lazy memory acceptance, memory is typically
>>>> accepted in
>>>> 4M chunks. A page state change request is submitted to mark the
>>>> pages as
>>>> private, followed by validation of the memory. Since the guest_memfd
>>>> currently only supports 4K pages, each page validation will result in
>>>> VMEXIT to map the page, resulting in 1024 additional exits.
>>>>
>>>> When performing a page state change, invoke KVM_PRE_FAULT_MEMORY for
>>>> the
>>>> size of the page state change in order to pre-map the pages and
>>>> avoid the
>>>> additional VMEXITs. This helps speed up boot times.
>>>
>>> Unfortunately, it breaks TDX guest.
>>>
>>> kvm_hc_map_gpa_range gpa 0x80000000 size 0x200000 attributes 0x0
>>> flags 0x1
>>>
>>> For TDX guest, it uses MAPGPA to maps the range [0x8000 0000,
>>> +0x0x200000] to shared. The call of KVM_PRE_FAULT_MEMORY on such
>>> range leads to the TD being marked as bugged
>>>
>>> [353467.266761] WARNING: CPU: 109 PID: 295970 at arch/x86/kvm/mmu/
>>> tdp_mmu.c:674 tdp_mmu_map_handle_target_level+0x301/0x460 [kvm]
>>
>> It turns out to be a KVM bug.
>>
>> The gpa passed in in KVM_PRE_FAULT_MEMORY, i.e., range->gpa has no
>> indication for share vs. private. KVM directly passes range->gpa to
>> kvm_tdp_map_page() in kvm_arch_vcpu_pre_fault_memory(), which is then
>> assigned to fault.addr
>>
>> However, fault.addr is supposed to be a gpa of real access in TDX
>> guest, which means it needs to have shared bit set if the map is for
>> shared access, for TDX case. tdp_mmu_get_root_for_fault() will use it
>> to determine which root to be used.
>>
>> For this case, the pre fault is on the shared memory, while the
>> fault.addr leads to mirror_root which is for private memory. Thus it
>> triggers KVM_BUG_ON().
> So this would fix it?
>
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 7b3f1783ab3c..66f96476fade 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -4895,6 +4895,7 @@ long kvm_arch_vcpu_pre_fault_memory(struct
> kvm_vcpu *vcpu,
> {
> u64 error_code = PFERR_GUEST_FINAL_MASK;
> u8 level = PG_LEVEL_4K;
> + u64 direct_bits;
> u64 end;
> int r;
>
> @@ -4909,15 +4910,18 @@ long kvm_arch_vcpu_pre_fault_memory(struct
> kvm_vcpu *vcpu,
> if (r)
> return r;
>
> + direct_bits = 0;
> if (kvm_arch_has_private_mem(vcpu->kvm) &&
> kvm_mem_is_private(vcpu->kvm, gpa_to_gfn(range->gpa)))
> error_code |= PFERR_PRIVATE_ACCESS;
> + else
> + direct_bits = kvm_gfn_direct_bits(vcpu->kvm);
should be
direct_bits = gfn_to_gpa(kvm_gfn_direct_bits(vcpu->kvm));
>
> /*
> * Shadow paging uses GVA for kvm page fault, so restrict to
> * two-dimensional paging.
> */
> - r = kvm_tdp_map_page(vcpu, range->gpa, error_code, &level);
> + r = kvm_tdp_map_page(vcpu, range->gpa | direct_bits, error_code,
> &level);
> if (r < 0)
> return r;
>
>
>
> I'm applying Tom's patch to get it out of his queue, but will delay sending
> a pull request until the Linux-side fix is accepted.
With above mentioned change, it can fix the issue.
Me synced with Yan offline, and our fix is:
diff --git a/arch/x86/kvm/mmu/tdp_mmu.h b/arch/x86/kvm/mmu/tdp_mmu.h
index 52acf99d40a0..209103bf0f30 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.h
+++ b/arch/x86/kvm/mmu/tdp_mmu.h
@@ -48,7 +48,7 @@ static inline enum kvm_tdp_mmu_root_types
kvm_gfn_range_filter_to_root_types(str
static inline struct kvm_mmu_page *tdp_mmu_get_root_for_fault(struct
kvm_vcpu *vcpu,
struct
kvm_page_fault *fault)
{
- if (unlikely(!kvm_is_addr_direct(vcpu->kvm, fault->addr)))
+ if (unlikely(fault->is_private))
return root_to_sp(vcpu->arch.mmu->mirror_root_hpa);
return root_to_sp(vcpu->arch.mmu->root.hpa);
> Paolo
>
>
next prev parent reply other threads:[~2025-06-03 15:33 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-03-28 20:30 [PATCH] i386/kvm: Prefault memory on page state change Tom Lendacky
2025-06-02 13:17 ` Tom Lendacky
2025-06-03 7:41 ` Xiaoyao Li
2025-06-03 11:47 ` Xiaoyao Li
2025-06-03 13:40 ` Tom Lendacky
2025-06-03 15:00 ` Paolo Bonzini
2025-06-03 15:31 ` Xiaoyao Li [this message]
2025-06-03 16:12 ` Xiaoyao Li
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=9962b691-056e-4ee0-8345-5b5657c5376b@intel.com \
--to=xiaoyao.li@intel.com \
--cc=kvm@vger.kernel.org \
--cc=michael.roth@amd.com \
--cc=mtosatti@redhat.com \
--cc=pbonzini@redhat.com \
--cc=qemu-devel@nongnu.org \
--cc=thomas.lendacky@amd.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).