Re: [PATCH v16 14/22] KVM: x86/mmu: Enforce guest_memfd's max order when recovering hugepages

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Sean Christopherson <seanjc@google.com>
To: Ackerley Tng <ackerleytng@google.com>
Cc: Xiaoyao Li <xiaoyao.li@intel.com>, Fuad Tabba <tabba@google.com>,
	kvm@vger.kernel.org,  linux-arm-msm@vger.kernel.org,
	linux-mm@kvack.org, kvmarm@lists.linux.dev,  pbonzini@redhat.com,
	chenhuacai@kernel.org, mpe@ellerman.id.au,  anup@brainfault.org,
	paul.walmsley@sifive.com, palmer@dabbelt.com,
	 aou@eecs.berkeley.edu, viro@zeniv.linux.org.uk,
	brauner@kernel.org,  willy@infradead.org,
	akpm@linux-foundation.org, yilun.xu@intel.com,
	 chao.p.peng@linux.intel.com, jarkko@kernel.org,
	amoorthy@google.com,  dmatlack@google.com,
	isaku.yamahata@intel.com, mic@digikod.net,  vbabka@suse.cz,
	vannapurve@google.com, mail@maciej.szmigiero.name,
	 david@redhat.com, michael.roth@amd.com, wei.w.wang@intel.com,
	 liam.merwick@oracle.com, isaku.yamahata@gmail.com,
	 kirill.shutemov@linux.intel.com, suzuki.poulose@arm.com,
	steven.price@arm.com,  quic_eberman@quicinc.com,
	quic_mnalajal@quicinc.com, quic_tsoni@quicinc.com,
	 quic_svaddagi@quicinc.com, quic_cvanscha@quicinc.com,
	 quic_pderrin@quicinc.com, quic_pheragu@quicinc.com,
	catalin.marinas@arm.com,  james.morse@arm.com,
	yuzenghui@huawei.com, oliver.upton@linux.dev,  maz@kernel.org,
	will@kernel.org, qperret@google.com, keirf@google.com,
	 roypat@amazon.co.uk, shuah@kernel.org, hch@infradead.org,
	jgg@nvidia.com,  rientjes@google.com, jhubbard@nvidia.com,
	fvdl@google.com, hughd@google.com,  jthoughton@google.com,
	peterx@redhat.com, pankaj.gupta@amd.com,  ira.weiny@intel.com
Subject: Re: [PATCH v16 14/22] KVM: x86/mmu: Enforce guest_memfd's max order when recovering hugepages
Date: Fri, 25 Jul 2025 07:31:17 -0700	[thread overview]
Message-ID: <aIOVNcp7p2hU-YHM@google.com> (raw)
In-Reply-To: <diqz7bzxduyv.fsf@ackerleytng-ctop.c.googlers.com>

On Thu, Jul 24, 2025, Ackerley Tng wrote:
> Ackerley Tng <ackerleytng@google.com> writes:
> 
> > Sean Christopherson <seanjc@google.com> writes:
> >> diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
> >> index 20dd9f64156e..c4ff8b4028df 100644
> >> --- a/arch/x86/kvm/mmu/mmu.c
> >> +++ b/arch/x86/kvm/mmu/mmu.c
> >> @@ -3302,31 +3302,63 @@ static u8 kvm_max_level_for_order(int order)
> >>  	return PG_LEVEL_4K;
> >>  }
> >>  
> >> -static u8 kvm_max_private_mapping_level(struct kvm *kvm, kvm_pfn_t pfn,
> >> -					u8 max_level, int gmem_order)
> >> +static u8 kvm_max_private_mapping_level(struct kvm *kvm, struct kvm_page_fault *fault,
> >> +					const struct kvm_memory_slot *slot, gfn_t gfn)
> >
> > Would you consider renaming this kvm_max_gmem_mapping_level()? Or
> > something that doesn't limit the use of this function to private memory?

Heh, see the next patch, which does exactly that and is appropriately titled
"KVM: x86/mmu: Extend guest_memfd's max mapping level to shared mappings".

> >> -	u8 req_max_level;
> >> +	u8 max_level, coco_level;
> >> +	struct page *page;
> >> +	kvm_pfn_t pfn;
> >>  
> >> -	if (max_level == PG_LEVEL_4K)
> >> -		return PG_LEVEL_4K;
> >> +	/* For faults, use the gmem information that was resolved earlier. */
> >> +	if (fault) {
> >> +		pfn = fault->pfn;
> >> +		max_level = fault->max_level;
> >> +	} else {
> >> +		/* TODO: Constify the guest_memfd chain. */
> >> +		struct kvm_memory_slot *__slot = (struct kvm_memory_slot *)slot;
> >> +		int max_order, r;
> >> +
> >> +		r = kvm_gmem_get_pfn(kvm, __slot, gfn, &pfn, &page, &max_order);
> >> +		if (r)
> >> +			return PG_LEVEL_4K;
> >> +
> >> +		if (page)
> >> +			put_page(page);
> >
> > When I was working on this, I added a kvm_gmem_mapping_order() [1] where
> > guest_memfd could return the order that this gfn would be allocated at
> > without actually doing the allocation. Is it okay that an
> > allocation may be performed here?

No, it's not.  From a guest_memfd semantics perspective, it'd be ok.  But allocating
can block, and mmu_lock is held here.

I routed this through kvm_gmem_get_pfn(), because for this code to do the right
thing, KVM needs to the PFN.  That could be plumbed in from the existing SPTE, but
I don't love the idea of potentially mixing the gmem order for pfn X with pfn Y
from the SPTE, e.g. if the gmem backing has changed and an invalidation  is pending.

KVM kinda sorta has such races with non-gmem memory, but for non-gmem KVM will never
fully consume a "bad" PFN, whereas for this path, KVM could (at least in theory)
immediately consume the pfn via an RMP lookup.  Which is probably fine?  but I
don't love it.

I assume getting the order will basically get the page/pfn as well, so plumbing
in the pfn from the SPTE, *knowing* that it could be stale, feels all kinds of
wrong.

I also don't want to effectively speculatively add kvm_gmem_mapping_order() or
expand kvm_gmem_get_pfn(), e.g. to say "no create", so what if we just do this?

	/* For faults, use the gmem information that was resolved earlier. */
	if (fault) {
		pfn = fault->pfn;
		max_level = fault->max_level;
	} else {
		/* TODO: Call into guest_memfd once hugepages are supported. */
		pfn = KVM_PFN_ERR_FAULT;
		max_level = PG_LEVEL_4K;
	}

	if (max_level == PG_LEVEL_4K)
		return max_level;

or alternatively:

	/* For faults, use the gmem information that was resolved earlier. */
	if (fault) {
		pfn = fault->pfn;
		max_level = fault->max_level;
	} else {
		/* TODO: Call into guest_memfd once hugepages are supported. */
		return PG_LEVEL_4K;
	}

	if (max_level == PG_LEVEL_4K)
		return max_level;

Functionally, it's 100% safe, even if/when guest_memfd supports hugepages.  E.g.
if we fail/forget to update this code, the worst case scneario is that KVM will
neglect to recover hugepages.

While it's kinda weird/silly, I'm leaning toward the first option of setting
max_level and relying on the common "max_level == PG_LEVEL_4K" check to avoid
doing an RMP looking with KVM_PFN_ERR_FAULT.  I like that it helps visually
captures that KVM needs to get both the max_level *and* the pfn from guest_memfd.

> > [1] https://lore.kernel.org/all/20250717162731.446579-13-tabba@google.com/
> >
> >> +
> >> +		max_level = kvm_max_level_for_order(max_order);
> >> +	}
> >>  
> >> -	max_level = min(kvm_max_level_for_order(gmem_order), max_level);
> >>  	if (max_level == PG_LEVEL_4K)
> >> -		return PG_LEVEL_4K;
> >> +		return max_level;
> >
> > I think the above line is a git-introduced issue, there probably
> > shouldn't be a return here.
> >
> 
> My bad, this is a correct short-circuiting of the rest of the function
> since there's no smaller PG_LEVEL than PG_LEVEL_4K.

Off topic: please trim your replies.

next prev parent reply	other threads:[~2025-07-25 14:31 UTC|newest]

Thread overview: 62+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-07-23 10:46 [PATCH v16 00/22] KVM: Enable host userspace mapping for guest_memfd-backed memory for non-CoCo VMs Fuad Tabba
2025-07-23 10:46 ` [PATCH v16 01/22] KVM: Rename CONFIG_KVM_PRIVATE_MEM to CONFIG_KVM_GUEST_MEMFD Fuad Tabba
2025-07-23 10:46 ` [PATCH v16 02/22] KVM: x86: Have all vendor neutral sub-configs depend on KVM_X86, not just KVM Fuad Tabba
2025-07-23 13:06   ` Xiaoyao Li
2025-07-23 13:13   ` David Hildenbrand
2025-07-23 10:46 ` [PATCH v16 03/22] KVM: x86: Select KVM_GENERIC_PRIVATE_MEM directly from KVM_SW_PROTECTED_VM Fuad Tabba
2025-07-23 13:13   ` David Hildenbrand
2025-07-23 13:17   ` Xiaoyao Li
2025-07-23 10:46 ` [PATCH v16 04/22] KVM: x86: Select TDX's KVM_GENERIC_xxx dependencies iff CONFIG_KVM_INTEL_TDX=y Fuad Tabba
2025-07-23 13:14   ` David Hildenbrand
2025-07-23 13:22   ` Xiaoyao Li
2025-07-24 22:35     ` Sean Christopherson
2025-07-23 10:46 ` [PATCH v16 05/22] KVM: Rename CONFIG_KVM_GENERIC_PRIVATE_MEM to CONFIG_HAVE_KVM_ARCH_GMEM_POPULATE Fuad Tabba
2025-07-23 13:27   ` Xiaoyao Li
2025-07-24 22:41     ` Sean Christopherson
2025-07-25 15:13       ` Xiaoyao Li
2025-07-23 10:46 ` [PATCH v16 06/22] KVM: Rename kvm_slot_can_be_private() to kvm_slot_has_gmem() Fuad Tabba
2025-07-23 10:46 ` [PATCH v16 07/22] KVM: Fix comments that refer to slots_lock Fuad Tabba
2025-07-23 10:47 ` [PATCH v16 08/22] KVM: Fix comment that refers to kvm uapi header path Fuad Tabba
2025-07-23 10:47 ` [PATCH v16 09/22] KVM: x86: Enable KVM_GUEST_MEMFD for all 64-bit builds Fuad Tabba
2025-07-23 13:17   ` David Hildenbrand
2025-07-23 13:42   ` Xiaoyao Li
2025-07-23 10:47 ` [PATCH v16 10/22] KVM: guest_memfd: Add plumbing to host to map guest_memfd pages Fuad Tabba
2025-07-23 14:03   ` Xiaoyao Li
2025-07-24 22:33     ` Sean Christopherson
2025-07-23 10:47 ` [PATCH v16 11/22] KVM: guest_memfd: Track guest_memfd mmap support in memslot Fuad Tabba
2025-07-23 10:47 ` [PATCH v16 12/22] KVM: x86/mmu: Rename .private_max_mapping_level() to .gmem_max_mapping_level() Fuad Tabba
2025-07-23 10:47 ` [PATCH v16 13/22] KVM: x86/mmu: Hoist guest_memfd max level/order helpers "up" in mmu.c Fuad Tabba
2025-07-23 13:51   ` Xiaoyao Li
2025-07-24 23:03   ` Ackerley Tng
2025-07-24 23:04   ` Ackerley Tng
2025-07-23 10:47 ` [PATCH v16 14/22] KVM: x86/mmu: Enforce guest_memfd's max order when recovering hugepages Fuad Tabba
2025-07-23 13:55   ` Xiaoyao Li
2025-07-24 22:32     ` Sean Christopherson
2025-07-24 23:21       ` Ackerley Tng
2025-07-24 23:34         ` Ackerley Tng
2025-07-25 14:31           ` Sean Christopherson [this message]
2025-07-25 17:24             ` Sean Christopherson
2025-07-25 19:16               ` Ackerley Tng
2025-07-23 10:47 ` [PATCH v16 15/22] KVM: x86/mmu: Extend guest_memfd's max mapping level to shared mappings Fuad Tabba
2025-07-24 23:31   ` Ackerley Tng
2025-07-25 13:53     ` Sean Christopherson
2025-07-25 16:40       ` Ackerley Tng
2025-07-25 17:13         ` Sean Christopherson
2025-07-25 19:34           ` Ackerley Tng
2025-07-25 19:52             ` Sean Christopherson
2025-07-25 21:31               ` Ackerley Tng
2025-07-25 22:01                 ` Sean Christopherson
2025-07-25 22:25                   ` Ackerley Tng
2025-07-23 10:47 ` [PATCH v16 16/22] KVM: arm64: Refactor user_mem_abort() Fuad Tabba
2025-07-23 10:47 ` [PATCH v16 17/22] KVM: arm64: Handle guest_memfd-backed guest page faults Fuad Tabba
2025-07-23 10:47 ` [PATCH v16 18/22] KVM: arm64: nv: Handle VNCR_EL2-triggered faults backed by guest_memfd Fuad Tabba
2025-07-23 10:47 ` [PATCH v16 19/22] KVM: arm64: Enable support for guest_memfd backed memory Fuad Tabba
2025-07-23 10:47 ` [PATCH v16 20/22] KVM: Allow and advertise support for host mmap() on guest_memfd files Fuad Tabba
2025-07-23 10:47 ` [PATCH v16 21/22] KVM: selftests: Do not use hardcoded page sizes in guest_memfd test Fuad Tabba
2025-07-23 10:47 ` [PATCH v16 22/22] KVM: selftests: guest_memfd mmap() test when mmap is supported Fuad Tabba
2025-07-24 22:15   ` Sean Christopherson
2025-07-28  7:00     ` Fuad Tabba
2025-07-24 22:44 ` [PATCH v16 00/22] KVM: Enable host userspace mapping for guest_memfd-backed memory for non-CoCo VMs Sean Christopherson
2025-07-24 23:46 ` Ackerley Tng
2025-07-25 14:56   ` Sean Christopherson
2025-07-28  7:05     ` Fuad Tabba

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aIOVNcp7p2hU-YHM@google.com \
    --to=seanjc@google.com \
    --cc=ackerleytng@google.com \
    --cc=akpm@linux-foundation.org \
    --cc=amoorthy@google.com \
    --cc=anup@brainfault.org \
    --cc=aou@eecs.berkeley.edu \
    --cc=brauner@kernel.org \
    --cc=catalin.marinas@arm.com \
    --cc=chao.p.peng@linux.intel.com \
    --cc=chenhuacai@kernel.org \
    --cc=david@redhat.com \
    --cc=dmatlack@google.com \
    --cc=fvdl@google.com \
    --cc=hch@infradead.org \
    --cc=hughd@google.com \
    --cc=ira.weiny@intel.com \
    --cc=isaku.yamahata@gmail.com \
    --cc=isaku.yamahata@intel.com \
    --cc=james.morse@arm.com \
    --cc=jarkko@kernel.org \
    --cc=jgg@nvidia.com \
    --cc=jhubbard@nvidia.com \
    --cc=jthoughton@google.com \
    --cc=keirf@google.com \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=kvm@vger.kernel.org \
    --cc=kvmarm@lists.linux.dev \
    --cc=liam.merwick@oracle.com \
    --cc=linux-arm-msm@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mail@maciej.szmigiero.name \
    --cc=maz@kernel.org \
    --cc=mic@digikod.net \
    --cc=michael.roth@amd.com \
    --cc=mpe@ellerman.id.au \
    --cc=oliver.upton@linux.dev \
    --cc=palmer@dabbelt.com \
    --cc=pankaj.gupta@amd.com \
    --cc=paul.walmsley@sifive.com \
    --cc=pbonzini@redhat.com \
    --cc=peterx@redhat.com \
    --cc=qperret@google.com \
    --cc=quic_cvanscha@quicinc.com \
    --cc=quic_eberman@quicinc.com \
    --cc=quic_mnalajal@quicinc.com \
    --cc=quic_pderrin@quicinc.com \
    --cc=quic_pheragu@quicinc.com \
    --cc=quic_svaddagi@quicinc.com \
    --cc=quic_tsoni@quicinc.com \
    --cc=rientjes@google.com \
    --cc=roypat@amazon.co.uk \
    --cc=shuah@kernel.org \
    --cc=steven.price@arm.com \
    --cc=suzuki.poulose@arm.com \
    --cc=tabba@google.com \
    --cc=vannapurve@google.com \
    --cc=vbabka@suse.cz \
    --cc=viro@zeniv.linux.org.uk \
    --cc=wei.w.wang@intel.com \
    --cc=will@kernel.org \
    --cc=willy@infradead.org \
    --cc=xiaoyao.li@intel.com \
    --cc=yilun.xu@intel.com \
    --cc=yuzenghui@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.