Re: [PATCH v17 14/24] KVM: x86/mmu: Enforce guest_memfd's max order when recovering hugepages

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Xiaoyao Li <xiaoyao.li@intel.com>
To: Sean Christopherson <seanjc@google.com>,
	Paolo Bonzini <pbonzini@redhat.com>,
	Marc Zyngier <maz@kernel.org>,
	Oliver Upton <oliver.upton@linux.dev>
Cc: kvm@vger.kernel.org, linux-arm-kernel@lists.infradead.org,
	kvmarm@lists.linux.dev, linux-kernel@vger.kernel.org,
	Ira Weiny <ira.weiny@intel.com>, Gavin Shan <gshan@redhat.com>,
	Shivank Garg <shivankg@amd.com>, Vlastimil Babka <vbabka@suse.cz>,
	David Hildenbrand <david@redhat.com>,
	Fuad Tabba <tabba@google.com>,
	Ackerley Tng <ackerleytng@google.com>,
	Tao Chan <chentao@kylinos.cn>,
	James Houghton <jthoughton@google.com>
Subject: Re: [PATCH v17 14/24] KVM: x86/mmu: Enforce guest_memfd's max order when recovering hugepages
Date: Wed, 30 Jul 2025 15:33:51 +0800	[thread overview]
Message-ID: <e67b0825-abcb-4bac-bc3c-2e8b513f1d57@intel.com> (raw)
In-Reply-To: <20250729225455.670324-15-seanjc@google.com>

On 7/30/2025 6:54 AM, Sean Christopherson wrote:
> Rework kvm_mmu_max_mapping_level() to provide the plumbing to consult
> guest_memfd (and relevant vendor code) when recovering hugepages, e.g.
> after disabling live migration.  The flaw has existed since guest_memfd was
> originally added, but has gone unnoticed due to lack of guest_memfd support
> for hugepages or dirty logging.
> 
> Don't actually call into guest_memfd at this time, as it's unclear as to
> what the API should be.  Ideally, KVM would simply use kvm_gmem_get_pfn(),
> but invoking kvm_gmem_get_pfn() would lead to sleeping in atomic context
> if guest_memfd needed to allocate memory (mmu_lock is held).  Luckily,
> the path isn't actually reachable, so just add a TODO and WARN to ensure
> the functionality is added alongisde guest_memfd hugepage support, and
> punt the guest_memfd API design question to the future.
> 
> Note, calling kvm_mem_is_private() in the non-fault path is safe, so long
> as mmu_lock is held, as hugepage recovery operates on shadow-present SPTEs,
> i.e. calling kvm_mmu_max_mapping_level() with @fault=NULL is mutually
> exclusive with kvm_vm_set_mem_attributes() changing the PRIVATE attribute
> of the gfn.
> 
> Signed-off-by: Sean Christopherson <seanjc@google.com>
> ---
>   arch/x86/kvm/mmu/mmu.c          | 82 +++++++++++++++++++--------------
>   arch/x86/kvm/mmu/mmu_internal.h |  2 +-
>   arch/x86/kvm/mmu/tdp_mmu.c      |  2 +-
>   3 files changed, 49 insertions(+), 37 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> index 20dd9f64156e..61eb9f723675 100644
> --- a/arch/x86/kvm/mmu/mmu.c
> +++ b/arch/x86/kvm/mmu/mmu.c
> @@ -3302,31 +3302,54 @@ static u8 kvm_max_level_for_order(int order)
>   	return PG_LEVEL_4K;
>   }
>   
> -static u8 kvm_max_private_mapping_level(struct kvm *kvm, kvm_pfn_t pfn,
> -					u8 max_level, int gmem_order)
> +static u8 kvm_max_private_mapping_level(struct kvm *kvm, struct kvm_page_fault *fault,
> +					const struct kvm_memory_slot *slot, gfn_t gfn)

I don't see why slot and gfn are needed here. Just to keep consistent 
with host_pfn_mapping_level()?

>   {
> -	u8 req_max_level;
> +	u8 max_level, coco_level;
> +	kvm_pfn_t pfn;
>   
> -	if (max_level == PG_LEVEL_4K)
> -		return PG_LEVEL_4K;
> +	/* For faults, use the gmem information that was resolved earlier. */
> +	if (fault) {
> +		pfn = fault->pfn;
> +		max_level = fault->max_level;
> +	} else {
> +		/* TODO: Call into guest_memfd once hugepages are supported. */
> +		WARN_ONCE(1, "Get pfn+order from guest_memfd");
> +		pfn = KVM_PFN_ERR_FAULT;
> +		max_level = PG_LEVEL_4K;
> +	}
>   
> -	max_level = min(kvm_max_level_for_order(gmem_order), max_level);
>   	if (max_level == PG_LEVEL_4K)
> -		return PG_LEVEL_4K;
> +		return max_level;
>   
> -	req_max_level = kvm_x86_call(gmem_max_mapping_level)(kvm, pfn);
> -	if (req_max_level)
> -		max_level = min(max_level, req_max_level);
> +	/*
> +	 * CoCo may influence the max mapping level, e.g. due to RMP or S-EPT
> +	 * restrictions.  A return of '0' means "no additional restrictions", to
> +	 * allow for using an optional "ret0" static call.
> +	 */
> +	coco_level = kvm_x86_call(gmem_max_mapping_level)(kvm, pfn);
> +	if (coco_level)
> +		max_level = min(max_level, coco_level);
>   
>   	return max_level;
>   }
>   
> -static int __kvm_mmu_max_mapping_level(struct kvm *kvm,
> -				       const struct kvm_memory_slot *slot,
> -				       gfn_t gfn, int max_level, bool is_private)
> +int kvm_mmu_max_mapping_level(struct kvm *kvm, struct kvm_page_fault *fault,
> +			      const struct kvm_memory_slot *slot, gfn_t gfn)
>   {
>   	struct kvm_lpage_info *linfo;
> -	int host_level;
> +	int host_level, max_level;
> +	bool is_private;
> +
> +	lockdep_assert_held(&kvm->mmu_lock);
> +
> +	if (fault) {
> +		max_level = fault->max_level;
> +		is_private = fault->is_private;
> +	} else {
> +		max_level = PG_LEVEL_NUM;
> +		is_private = kvm_mem_is_private(kvm, gfn);
> +	}
>   
>   	max_level = min(max_level, max_huge_page_level);
>   	for ( ; max_level > PG_LEVEL_4K; max_level--) {
> @@ -3335,25 +3358,16 @@ static int __kvm_mmu_max_mapping_level(struct kvm *kvm,
>   			break;
>   	}
>   
> +	if (max_level == PG_LEVEL_4K)
> +		return PG_LEVEL_4K;
> +
>   	if (is_private)
> -		return max_level;
> -
> -	if (max_level == PG_LEVEL_4K)
> -		return PG_LEVEL_4K;
> -
> -	host_level = host_pfn_mapping_level(kvm, gfn, slot);
> +		host_level = kvm_max_private_mapping_level(kvm, fault, slot, gfn);
> +	else
> +		host_level = host_pfn_mapping_level(kvm, gfn, slot);
>   	return min(host_level, max_level);
>   }
>   
> -int kvm_mmu_max_mapping_level(struct kvm *kvm,
> -			      const struct kvm_memory_slot *slot, gfn_t gfn)
> -{
> -	bool is_private = kvm_slot_has_gmem(slot) &&
> -			  kvm_mem_is_private(kvm, gfn);
> -
> -	return __kvm_mmu_max_mapping_level(kvm, slot, gfn, PG_LEVEL_NUM, is_private);
> -}
> -
>   void kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault)
>   {
>   	struct kvm_memory_slot *slot = fault->slot;
> @@ -3374,9 +3388,8 @@ void kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
>   	 * Enforce the iTLB multihit workaround after capturing the requested
>   	 * level, which will be used to do precise, accurate accounting.
>   	 */
> -	fault->req_level = __kvm_mmu_max_mapping_level(vcpu->kvm, slot,
> -						       fault->gfn, fault->max_level,
> -						       fault->is_private);
> +	fault->req_level = kvm_mmu_max_mapping_level(vcpu->kvm, fault,
> +						     fault->slot, fault->gfn);
>   	if (fault->req_level == PG_LEVEL_4K || fault->huge_page_disallowed)
>   		return;
>   
> @@ -4564,8 +4577,7 @@ static int kvm_mmu_faultin_pfn_private(struct kvm_vcpu *vcpu,
>   	}
>   
>   	fault->map_writable = !(fault->slot->flags & KVM_MEM_READONLY);
> -	fault->max_level = kvm_max_private_mapping_level(vcpu->kvm, fault->pfn,
> -							 fault->max_level, max_order);
> +	fault->max_level = kvm_max_level_for_order(max_order);
>   
>   	return RET_PF_CONTINUE;
>   }
> @@ -7165,7 +7177,7 @@ static bool kvm_mmu_zap_collapsible_spte(struct kvm *kvm,
>   		 * mapping if the indirect sp has level = 1.
>   		 */
>   		if (sp->role.direct &&
> -		    sp->role.level < kvm_mmu_max_mapping_level(kvm, slot, sp->gfn)) {
> +		    sp->role.level < kvm_mmu_max_mapping_level(kvm, NULL, slot, sp->gfn)) {
>   			kvm_zap_one_rmap_spte(kvm, rmap_head, sptep);
>   
>   			if (kvm_available_flush_remote_tlbs_range())
> diff --git a/arch/x86/kvm/mmu/mmu_internal.h b/arch/x86/kvm/mmu/mmu_internal.h
> index 65f3c89d7c5d..b776be783a2f 100644
> --- a/arch/x86/kvm/mmu/mmu_internal.h
> +++ b/arch/x86/kvm/mmu/mmu_internal.h
> @@ -411,7 +411,7 @@ static inline int kvm_mmu_do_page_fault(struct kvm_vcpu *vcpu, gpa_t cr2_or_gpa,
>   	return r;
>   }
>   
> -int kvm_mmu_max_mapping_level(struct kvm *kvm,
> +int kvm_mmu_max_mapping_level(struct kvm *kvm, struct kvm_page_fault *fault,
>   			      const struct kvm_memory_slot *slot, gfn_t gfn);
>   void kvm_mmu_hugepage_adjust(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault);
>   void disallowed_hugepage_adjust(struct kvm_page_fault *fault, u64 spte, int cur_level);
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index 7f3d7229b2c1..740cb06accdb 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -1813,7 +1813,7 @@ static void recover_huge_pages_range(struct kvm *kvm,
>   		if (iter.gfn < start || iter.gfn >= end)
>   			continue;
>   
> -		max_mapping_level = kvm_mmu_max_mapping_level(kvm, slot, iter.gfn);
> +		max_mapping_level = kvm_mmu_max_mapping_level(kvm, NULL, slot, iter.gfn);
>   		if (max_mapping_level < iter.level)
>   			continue;
>

next prev parent reply	other threads:[~2025-07-30  7:34 UTC|newest]

Thread overview: 54+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-07-29 22:54 [PATCH v17 00/24] KVM: Enable mmap() for guest_memfd Sean Christopherson
2025-07-29 22:54 ` [PATCH v17 01/24] KVM: Rename CONFIG_KVM_PRIVATE_MEM to CONFIG_KVM_GUEST_MEMFD Sean Christopherson
2025-07-29 22:54 ` [PATCH v17 02/24] KVM: x86: Have all vendor neutral sub-configs depend on KVM_X86, not just KVM Sean Christopherson
2025-07-31  8:08   ` Fuad Tabba
2025-07-29 22:54 ` [PATCH v17 03/24] KVM: x86: Select KVM_GENERIC_PRIVATE_MEM directly from KVM_SW_PROTECTED_VM Sean Christopherson
2025-07-31  8:08   ` Fuad Tabba
2025-07-29 22:54 ` [PATCH v17 04/24] KVM: x86: Select TDX's KVM_GENERIC_xxx dependencies iff CONFIG_KVM_INTEL_TDX=y Sean Christopherson
2025-07-31  8:07   ` Fuad Tabba
2025-07-29 22:54 ` [PATCH v17 05/24] KVM: Rename CONFIG_KVM_GENERIC_PRIVATE_MEM to CONFIG_HAVE_KVM_ARCH_GMEM_POPULATE Sean Christopherson
2025-07-29 22:54 ` [PATCH v17 06/24] KVM: Rename kvm_slot_can_be_private() to kvm_slot_has_gmem() Sean Christopherson
2025-07-29 22:54 ` [PATCH v17 07/24] KVM: Fix comments that refer to slots_lock Sean Christopherson
2025-07-29 22:54 ` [PATCH v17 08/24] KVM: Fix comment that refers to kvm uapi header path Sean Christopherson
2025-07-29 22:54 ` [PATCH v17 09/24] KVM: x86: Enable KVM_GUEST_MEMFD for all 64-bit builds Sean Christopherson
2025-07-29 22:54 ` [PATCH v17 10/24] KVM: guest_memfd: Add plumbing to host to map guest_memfd pages Sean Christopherson
2025-07-29 22:54 ` [PATCH v17 11/24] KVM: guest_memfd: Track guest_memfd mmap support in memslot Sean Christopherson
2025-07-29 22:54 ` [PATCH v17 12/24] KVM: x86/mmu: Rename .private_max_mapping_level() to .gmem_max_mapping_level() Sean Christopherson
2025-07-31  8:15   ` Fuad Tabba
2025-07-31  8:29     ` David Hildenbrand
2025-07-31  8:33       ` Fuad Tabba
2025-07-29 22:54 ` [PATCH v17 13/24] KVM: x86/mmu: Hoist guest_memfd max level/order helpers "up" in mmu.c Sean Christopherson
2025-07-31  7:59   ` David Hildenbrand
2025-07-31  8:06   ` Fuad Tabba
2025-07-29 22:54 ` [PATCH v17 14/24] KVM: x86/mmu: Enforce guest_memfd's max order when recovering hugepages Sean Christopherson
2025-07-30  7:33   ` Xiaoyao Li [this message]
2025-07-31  8:06     ` David Hildenbrand
2025-07-31  8:10   ` Fuad Tabba
2025-07-29 22:54 ` [PATCH v17 15/24] KVM: x86/mmu: Extend guest_memfd's max mapping level to shared mappings Sean Christopherson
2025-07-30  7:36   ` Xiaoyao Li
2025-07-31  8:01   ` David Hildenbrand
2025-07-31  8:05   ` Fuad Tabba
2025-07-29 22:54 ` [PATCH v17 16/24] KVM: x86/mmu: Handle guest page faults for guest_memfd with shared memory Sean Christopherson
2025-07-30  7:37   ` Xiaoyao Li
2025-07-29 22:54 ` [PATCH v17 17/24] KVM: arm64: Refactor user_mem_abort() Sean Christopherson
2025-07-29 22:54 ` [PATCH v17 18/24] KVM: arm64: Handle guest_memfd-backed guest page faults Sean Christopherson
2025-07-29 22:54 ` [PATCH v17 19/24] KVM: arm64: nv: Handle VNCR_EL2-triggered faults backed by guest_memfd Sean Christopherson
2025-07-29 22:54 ` [PATCH v17 20/24] KVM: arm64: Enable support for guest_memfd backed memory Sean Christopherson
2025-07-29 22:54 ` [PATCH v17 21/24] KVM: Allow and advertise support for host mmap() on guest_memfd files Sean Christopherson
2025-07-29 22:54 ` [PATCH v17 22/24] KVM: selftests: Do not use hardcoded page sizes in guest_memfd test Sean Christopherson
2025-07-30 11:04   ` Xiaoyao Li
2025-07-29 22:54 ` [PATCH v17 23/24] KVM: selftests: guest_memfd mmap() test when mmap is supported Sean Christopherson
2025-07-30 11:39   ` Xiaoyao Li
2025-07-30 12:57     ` Sean Christopherson
2025-07-31  7:49       ` Xiaoyao Li
2025-08-07  8:12   ` Shivank Garg
2025-07-29 22:54 ` [PATCH v17 24/24] KVM: selftests: Add guest_memfd testcase to fault-in on !mmap()'d memory Sean Christopherson
2025-07-30  8:20   ` Xiaoyao Li
2025-07-30 15:51   ` Fuad Tabba
2025-07-30 21:34 ` [PATCH v17 00/24] KVM: Enable mmap() for guest_memfd Ackerley Tng
2025-07-30 22:44   ` Ackerley Tng
2025-08-27  8:43 ` Paolo Bonzini
2025-08-27 12:57   ` Sean Christopherson
2025-08-27 13:08   ` Marc Zyngier
2025-08-27 13:11     ` Paolo Bonzini
2025-08-27 13:14       ` Marc Zyngier

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=e67b0825-abcb-4bac-bc3c-2e8b513f1d57@intel.com \
    --to=xiaoyao.li@intel.com \
    --cc=ackerleytng@google.com \
    --cc=chentao@kylinos.cn \
    --cc=david@redhat.com \
    --cc=gshan@redhat.com \
    --cc=ira.weiny@intel.com \
    --cc=jthoughton@google.com \
    --cc=kvm@vger.kernel.org \
    --cc=kvmarm@lists.linux.dev \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=maz@kernel.org \
    --cc=oliver.upton@linux.dev \
    --cc=pbonzini@redhat.com \
    --cc=seanjc@google.com \
    --cc=shivankg@amd.com \
    --cc=tabba@google.com \
    --cc=vbabka@suse.cz \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).