Re: [RFC PATCH v2 38/51] KVM: guest_memfd: Split allocator pages for guest_memfd use

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: Yan Zhao <yan.y.zhao@intel.com>
To: Ackerley Tng <ackerleytng@google.com>
Cc: <kvm@vger.kernel.org>, <linux-mm@kvack.org>,
	<linux-kernel@vger.kernel.org>, <x86@kernel.org>,
	<linux-fsdevel@vger.kernel.org>, <aik@amd.com>,
	<ajones@ventanamicro.com>, <akpm@linux-foundation.org>,
	<amoorthy@google.com>, <anthony.yznaga@oracle.com>,
	<anup@brainfault.org>, <aou@eecs.berkeley.edu>,
	<bfoster@redhat.com>, <binbin.wu@linux.intel.com>,
	<brauner@kernel.org>, <catalin.marinas@arm.com>,
	<chao.p.peng@intel.com>, <chenhuacai@kernel.org>,
	<dave.hansen@intel.com>, <david@redhat.com>,
	<dmatlack@google.com>, <dwmw@amazon.co.uk>,
	<erdemaktas@google.com>, <fan.du@intel.com>, <fvdl@google.com>,
	<graf@amazon.com>, <haibo1.xu@intel.com>, <hch@infradead.org>,
	<hughd@google.com>, <ira.weiny@intel.com>,
	<isaku.yamahata@intel.com>, <jack@suse.cz>, <james.morse@arm.com>,
	<jarkko@kernel.org>, <jgg@ziepe.ca>, <jgowans@amazon.com>,
	<jhubbard@nvidia.com>, <jroedel@suse.de>, <jthoughton@google.com>,
	<jun.miao@intel.com>, <kai.huang@intel.com>, <keirf@google.com>,
	<kent.overstreet@linux.dev>, <kirill.shutemov@intel.com>,
	<liam.merwick@oracle.com>, <maciej.wieczor-retman@intel.com>,
	<mail@maciej.szmigiero.name>, <maz@kernel.org>, <mic@digikod.net>,
	<michael.roth@amd.com>, <mpe@ellerman.id.au>,
	<muchun.song@linux.dev>, <nikunj@amd.com>, <nsaenz@amazon.es>,
	<oliver.upton@linux.dev>, <palmer@dabbelt.com>,
	<pankaj.gupta@amd.com>, <paul.walmsley@sifive.com>,
	<pbonzini@redhat.com>, <pdurrant@amazon.co.uk>,
	<peterx@redhat.com>, <pgonda@google.com>, <pvorel@suse.cz>,
	<qperret@google.com>, <quic_cvanscha@quicinc.com>,
	<quic_eberman@quicinc.com>, <quic_mnalajal@quicinc.com>,
	<quic_pderrin@quicinc.com>, <quic_pheragu@quicinc.com>,
	<quic_svaddagi@quicinc.com>, <quic_tsoni@quicinc.com>,
	<richard.weiyang@gmail.com>, <rick.p.edgecombe@intel.com>,
	<rientjes@google.com>, <roypat@amazon.co.uk>, <rppt@kernel.org>,
	<seanjc@google.com>, <shuah@kernel.org>, <steven.price@arm.com>,
	<steven.sistare@oracle.com>, <suzuki.poulose@arm.com>,
	<tabba@google.com>, <thomas.lendacky@amd.com>,
	<usama.arif@bytedance.com>, <vannapurve@google.com>,
	<vbabka@suse.cz>, <viro@zeniv.linux.org.uk>,
	<vkuznets@redhat.com>, <wei.w.wang@intel.com>, <will@kernel.org>,
	<willy@infradead.org>, <xiaoyao.li@intel.com>,
	<yilun.xu@intel.com>, <yuzenghui@huawei.com>,
	<zhiquan1.li@intel.com>
Subject: Re: [RFC PATCH v2 38/51] KVM: guest_memfd: Split allocator pages for guest_memfd use
Date: Mon, 16 Jun 2025 19:15:34 +0800	[thread overview]
Message-ID: <aE/81hmduC08B8Lt@yzhao56-desk.sh.intel.com> (raw)
In-Reply-To: <diqztt4uhunj.fsf@ackerleytng-ctop.c.googlers.com>

On Thu, Jun 05, 2025 at 12:10:08PM -0700, Ackerley Tng wrote:
> Yan Zhao <yan.y.zhao@intel.com> writes:
> 
> > On Wed, May 14, 2025 at 04:42:17PM -0700, Ackerley Tng wrote:
> >> +static int kvm_gmem_convert_execute_work(struct inode *inode,
> >> +					 struct conversion_work *work,
> >> +					 bool to_shared)
> >> +{
> >> +	enum shareability m;
> >> +	int ret;
> >> +
> >> +	m = to_shared ? SHAREABILITY_ALL : SHAREABILITY_GUEST;
> >> +	ret = kvm_gmem_shareability_apply(inode, work, m);
> >> +	if (ret)
> >> +		return ret;
> >> +	/*
> >> +	 * Apply shareability first so split/merge can operate on new
> >> +	 * shareability state.
> >> +	 */
> >> +	ret = kvm_gmem_restructure_folios_in_range(
> >> +		inode, work->start, work->nr_pages, to_shared);
> >> +
> >> +	return ret;
> >> +}
> >> +
> 
> Hi Yan,
> 
> Thanks for your thorough reviews and your alternative suggestion in the
> other discussion at [1]! I'll try to bring the conversion-related parts
> of that discussion over here.
> 
> >>  static int kvm_gmem_convert_range(struct file *file, pgoff_t start,
> >>  				  size_t nr_pages, bool shared,
> >>  				  pgoff_t *error_index)
> 
> The guiding principle I was using for the conversion ioctls is
> 
> * Have the shareability updates and any necessary page restructuring
>   (aka splitting/merging) either fully complete for not at all by the
>   time the conversion ioctl returns.
> * Any unmapping (from host or guest page tables) will not be re-mapped
>   on errors.
> * Rollback undoes changes if conversion failed, and in those cases any
>   errors are turned into WARNings.
> 
> The rationale is that we want page sizes to be in sync with shareability
> so that any faults after the (successful or failed) conversion will not
> wrongly map in a larger page than allowed and cause any host crashes.
> 
> We considered 3 places where the memory can be mapped for conversions:
> 
> 1. Host page tables
> 2. Guest page tables
> 3. IOMMU page tables
> 
> Unmapping from host page tables is the simplest case. We unmap any
> shared ranges from the host page tables. Any accesses after the failed
> conversion would just fault the memory back in and proceed as usual.
> 
> guest_memfd memory is not unmapped from IOMMUs in conversions. This case
> is handled because IOMMU mappings hold refcounts. After unmapping from
> the host, we check for unexpected refcounts and fail if there are
> unexpected refcounts.
> 
> We also unmap from guest page tables. Considering failed conversions, if
> the pages are shared, we're good since the next time the guest accesses
> the page, the page will be faulted in as before.
> 
> If the pages are private, on the next guest access, the pages will be
> faulted in again as well. This is fine for software-protected VMs IIUC.
> 
> For TDX (and SNP) IIUC the memory would have been cleared, and the
> memory would also need to be re-accepted. I was thinking that this is by
> design, since when a TDX guest requests a conversion it knows that the
> contents is not to be used again.
This is not guaranteed.

On private-to-shared conversion failure, the guest may leak the page or release
the page. If the guest chooses the latter (e.g. in kvmclock_init_mem(),
kvm_arch_ptp_init()), the page is regarded as private by the guest OS.
Re-acceptance then will not happen before the guest access.

So, it's better for host to keep the original SEPT if private-to-shared
conversion fails.


 
> The userspace VMM is obligated to keep trying convert and if it gives
> up, userspace VMM should inform the guest that the conversion
> failed. The guest should handle conversion failures too and not assume
> that conversion always succeeds.
I don't think relying userspace to keep trying convert endlessly is a good
design.

> 
> Putting TDX aside for a moment, so far, there are a few ways this
> conversion could fail:
> 
> a. Unexpected refcounts. Userspace should clear up the unexpected
>    refcounts and report failure to the guest if it can't for whatever
>    reason.
This is acceptable. Unmapping shared mappings in the primary MMU or shared EPT
is harmless.

> b. ENOMEM because (i) we ran out of memory updating the shareability
>    maple_tree or (ii) since splitting involves allocating more memory
>    for struct pages and we ran out of memory there. In this case the
>    userspace VMM gets -ENOMEM and can make more memory available and
>    then retry, or if it can't, also report failure to the guest.
This is unacceptable. Why not reserve the memory before determining to start
the real conversion? If -ENOMEM is returned before executing the conversion, we
don't need to handle the restore error, which is impossible to be handled
gracefully.


> TDX introduces TDX-specific conversion failures (see discussion at
> [1]), which this series doesn't handle, but I think we still have a line
> of sight to handle new errors.
The errors can be divided into two categories:
(1) errors due to kernel bugs
(2) errors that could occur in normal or bad conditions (e.g. the -ENOMEM)

We can't handle (1), so BUG_ON or leaking memory is allowed.
However, we should try to avoid (2) especially in the rollback path.


> In the other thread [1], I was proposing to have guest_memfd decide what
> to do on errors, but I think that might be baking more TDX-specific
> details into guest_memfd/KVM, and perhaps this is better:
The TDX-specific errors in the unmapping path is of category (1).
So, we hope to resolve it by BUG_ON and leaking the memory.

The other conversion error for TDX is for splitting memory. We hope to
do the splitting before executing the real conversion.

Please check the proposal details at
https://lore.kernel.org/all/aE%2Fq9VKkmaCcuwpU@yzhao56-desk.sh.intel.com.

> We could return the errors to userspace and let userspace determine what
> to do. For retryable errors (as determined by userspace), it should do
> what it needs to do, and retry. For errors like TDX being unable to
> reclaim the memory, it could tell guest_memfd to leak that memory.
> 
> If userspace gives up, it should report conversion failure to the guest
> if userspace thinks the guest can continue (to a clean shutdown or
> otherwise). If something terrible happened during conversion, then
> userspace might have to exit itself or shutdown the host.
> 
> In [2], for TDX-specific conversion failures, you proposed prepping to
> eliminate errors and exiting early on failure, then actually
> unmapping. I think that could work too.
> 
> I'm a little concerned that prepping could be complicated, since the
> nature of conversion depends on the current state of shareability, and
> there's a lot to prepare, everything from counting memory required for
> maple_tree allocation (and merging ranges in the maple_tree), and
> counting the number of pages required for undoing vmemmap optimization
> in the case of splitting...
> 
> And even after doing all the prep to eliminate errors, the unmapping
> could fail in TDX-specific cases anyway, which still needs to be
> handled.
> 
> Hence I'm hoping you'll consider to let TDX-specific failures be
> built-in and handled alongside other failures by getting help from the
> userspace VMM, and in the worst case letting the guest know the
> conversion failed.
> 
> I also appreciate comments or suggestions from anyone else!
> 
> [1] https://lore.kernel.org/all/diqzfrgfp95d.fsf@ackerleytng-ctop.c.googlers.com/
> [2] https://lore.kernel.org/all/aEEEJbTzlncbRaRA@yzhao56-desk.sh.intel.com/
> 
> >> @@ -371,18 +539,21 @@ static int kvm_gmem_convert_range(struct file *file, pgoff_t start,
> >>  
> >>  	list_for_each_entry(work, &work_list, list) {
> >>  		rollback_stop_item = work;
> >> -		ret = kvm_gmem_shareability_apply(inode, work, m);
> >> +
> >> +		ret = kvm_gmem_convert_execute_work(inode, work, shared);
> >>  		if (ret)
> >>  			break;
> >>  	}
> >>  
> >>  	if (ret) {
> >> -		m = shared ? SHAREABILITY_GUEST : SHAREABILITY_ALL;
> >>  		list_for_each_entry(work, &work_list, list) {
> >> +			int r;
> >> +
> >> +			r = kvm_gmem_convert_execute_work(inode, work, !shared);
> >> +			WARN_ON(r);
> >> +
> >>  			if (work == rollback_stop_item)
> >>  				break;
> >> -
> >> -			WARN_ON(kvm_gmem_shareability_apply(inode, work, m));
> > Could kvm_gmem_shareability_apply() fail here?
> >
> 
> Yes, it could. If shareability cannot be updated, then we probably ran
> out of memory. Userspace VMM will probably get -ENOMEM set on some
> earlier ret and should handle that accordingly.
> 
> On -ENOMEM in a rollback, the host is in a very tough spot anyway, and a
> clean guest shutdown may be the only way out, hence this is a WARN and
> not returned to userspace.
> 
> >>  		}
> >>  	}
> >>  
> >> @@ -434,6 +605,277 @@ static int kvm_gmem_ioctl_convert_range(struct file *file,
> >>  	return ret;
> >>  }
> >>  
> >> +#ifdef CONFIG_KVM_GMEM_HUGETLB
> >> +
> >> +static inline void __filemap_remove_folio_for_restructuring(struct folio *folio)
> >> +{
> >> +	struct address_space *mapping = folio->mapping;
> >> +
> >> +	spin_lock(&mapping->host->i_lock);
> >> +	xa_lock_irq(&mapping->i_pages);
> >> +
> >> +	__filemap_remove_folio(folio, NULL);
> >> +
> >> +	xa_unlock_irq(&mapping->i_pages);
> >> +	spin_unlock(&mapping->host->i_lock);
> >> +}
> >> +
> >> +/**
> >> + * filemap_remove_folio_for_restructuring() - Remove @folio from filemap for
> >> + * split/merge.
> >> + *
> >> + * @folio: the folio to be removed.
> >> + *
> >> + * Similar to filemap_remove_folio(), but skips LRU-related calls (meaningless
> >> + * for guest_memfd), and skips call to ->free_folio() to maintain folio flags.
> >> + *
> >> + * Context: Expects only the filemap's refcounts to be left on the folio. Will
> >> + *          freeze these refcounts away so that no other users will interfere
> >> + *          with restructuring.
> >> + */
> >> +static inline void filemap_remove_folio_for_restructuring(struct folio *folio)
> >> +{
> >> +	int filemap_refcount;
> >> +
> >> +	filemap_refcount = folio_nr_pages(folio);
> >> +	while (!folio_ref_freeze(folio, filemap_refcount)) {
> >> +		/*
> >> +		 * At this point only filemap refcounts are expected, hence okay
> >> +		 * to spin until speculative refcounts go away.
> >> +		 */
> >> +		WARN_ONCE(1, "Spinning on folio=%p refcount=%d", folio, folio_ref_count(folio));
> >> +	}
> >> +
> >> +	folio_lock(folio);
> >> +	__filemap_remove_folio_for_restructuring(folio);
> >> +	folio_unlock(folio);
> >> +}
> >> +
> >> +/**
> >> + * kvm_gmem_split_folio_in_filemap() - Split @folio within filemap in @inode.
> >> + *
> >> + * @inode: inode containing the folio.
> >> + * @folio: folio to be split.
> >> + *
> >> + * Split a folio into folios of size PAGE_SIZE. Will clean up folio from filemap
> >> + * and add back the split folios.
> >> + *
> >> + * Context: Expects that before this call, folio's refcount is just the
> >> + *          filemap's refcounts. After this function returns, the split folios'
> >> + *          refcounts will also be filemap's refcounts.
> >> + * Return: 0 on success or negative error otherwise.
> >> + */
> >> +static int kvm_gmem_split_folio_in_filemap(struct inode *inode, struct folio *folio)
> >> +{
> >> +	size_t orig_nr_pages;
> >> +	pgoff_t orig_index;
> >> +	size_t i, j;
> >> +	int ret;
> >> +
> >> +	orig_nr_pages = folio_nr_pages(folio);
> >> +	if (orig_nr_pages == 1)
> >> +		return 0;
> >> +
> >> +	orig_index = folio->index;
> >> +
> >> +	filemap_remove_folio_for_restructuring(folio);
> >> +
> >> +	ret = kvm_gmem_allocator_ops(inode)->split_folio(folio);
> >> +	if (ret)
> >> +		goto err;
> >> +
> >> +	for (i = 0; i < orig_nr_pages; ++i) {
> >> +		struct folio *f = page_folio(folio_page(folio, i));
> >> +
> >> +		ret = __kvm_gmem_filemap_add_folio(inode->i_mapping, f,
> >> +						   orig_index + i);
> > Why does the failure of __kvm_gmem_filemap_add_folio() here lead to rollback,    
> > while the failure of the one under rollback only triggers WARN_ON()?
> >
> 
> Mostly because I don't really have a choice on rollback. On rollback we
> try to restore the merged folio back into the filemap, and if we
> couldn't, the host is probably in rather bad shape in terms of memory
> availability and there may not be many options for the userspace VMM.
Out of memory is not a good excuse for rollback error.

> Perhaps the different possible errors from
> __kvm_gmem_filemap_add_folio() in both should be handled differently. Do
> you have any suggestions on that?
Maybe reserving the memory or ruling out other factors that could lead to
conversion failure before executing the conversion?
Then BUG_ON the if the failure is cause by a bug.

> >> +		if (ret)
> >> +			goto rollback;
> >> +	}
> >> +
> >> +	return ret;
> >> +
> >> +rollback:
> >> +	for (j = 0; j < i; ++j) {
> >> +		struct folio *f = page_folio(folio_page(folio, j));
> >> +
> >> +		filemap_remove_folio_for_restructuring(f);
> >> +	}
> >> +
> >> +	kvm_gmem_allocator_ops(inode)->merge_folio(folio);
> >> +err:
> >> +	WARN_ON(__kvm_gmem_filemap_add_folio(inode->i_mapping, folio, orig_index));
> >> +
> >> +	return ret;
> >> +}
> >> +
> >> +static inline int kvm_gmem_try_split_folio_in_filemap(struct inode *inode,
> >> +						      struct folio *folio)
> >> +{
> >> +	size_t to_nr_pages;
> >> +	void *priv;
> >> +
> >> +	if (!kvm_gmem_has_custom_allocator(inode))
> >> +		return 0;
> >> +
> >> +	priv = kvm_gmem_allocator_private(inode);
> >> +	to_nr_pages = kvm_gmem_allocator_ops(inode)->nr_pages_in_page(priv);
> >> +
> >> +	if (kvm_gmem_has_some_shared(inode, folio->index, to_nr_pages))
> >> +		return kvm_gmem_split_folio_in_filemap(inode, folio);
> >> +
> >> +	return 0;
> >> +}
> >> +
> >> +/**
> >> + * kvm_gmem_merge_folio_in_filemap() - Merge @first_folio within filemap in
> >> + * @inode.
> >> + *
> >> + * @inode: inode containing the folio.
> >> + * @first_folio: first folio among folios to be merged.
> >> + *
> >> + * Will clean up subfolios from filemap and add back the merged folio.
> >> + *
> >> + * Context: Expects that before this call, all subfolios only have filemap
> >> + *          refcounts. After this function returns, the merged folio will only
> >> + *          have filemap refcounts.
> >> + * Return: 0 on success or negative error otherwise.
> >> + */
> >> +static int kvm_gmem_merge_folio_in_filemap(struct inode *inode,
> >> +					   struct folio *first_folio)
> >> +{
> >> +	size_t to_nr_pages;
> >> +	pgoff_t index;
> >> +	void *priv;
> >> +	size_t i;
> >> +	int ret;
> >> +
> >> +	index = first_folio->index;
> >> +
> >> +	priv = kvm_gmem_allocator_private(inode);
> >> +	to_nr_pages = kvm_gmem_allocator_ops(inode)->nr_pages_in_folio(priv);
> >> +	if (folio_nr_pages(first_folio) == to_nr_pages)
> >> +		return 0;
> >> +
> >> +	for (i = 0; i < to_nr_pages; ++i) {
> >> +		struct folio *f = page_folio(folio_page(first_folio, i));
> >> +
> >> +		filemap_remove_folio_for_restructuring(f);
> >> +	}
> >> +
> >> +	kvm_gmem_allocator_ops(inode)->merge_folio(first_folio);
> >> +
> >> +	ret = __kvm_gmem_filemap_add_folio(inode->i_mapping, first_folio, index);
> >> +	if (ret)
> >> +		goto err_split;
> >> +
> >> +	return ret;
> >> +
> >> +err_split:
> >> +	WARN_ON(kvm_gmem_allocator_ops(inode)->split_folio(first_folio));
> > guestmem_hugetlb_split_folio() is possible to fail. e.g.
> > After the stash is freed by guestmem_hugetlb_unstash_free_metadata() in
> > guestmem_hugetlb_merge_folio(), it's possible to get -ENOMEM for the stash
> > allocation in guestmem_hugetlb_stash_metadata() in
> > guestmem_hugetlb_split_folio().
> >
> >
> 
> Yes. This is also on the error path. In line with all the other error
> and rollback paths, I don't really have other options at this point,
> since on error, I probably ran out of memory, so I try my best to
> restore the original state but give up with a WARN otherwise.
> 
> >> +	for (i = 0; i < to_nr_pages; ++i) {
> >> +		struct folio *f = page_folio(folio_page(first_folio, i));
> >> +
> >> +		WARN_ON(__kvm_gmem_filemap_add_folio(inode->i_mapping, f, index + i));
> >> +	}
> >> +
> >> +	return ret;
> >> +}
> >> +
> >> +static inline int kvm_gmem_try_merge_folio_in_filemap(struct inode *inode,
> >> +						      struct folio *first_folio)
> >> +{
> >> +	size_t to_nr_pages;
> >> +	void *priv;
> >> +
> >> +	priv = kvm_gmem_allocator_private(inode);
> >> +	to_nr_pages = kvm_gmem_allocator_ops(inode)->nr_pages_in_folio(priv);
> >> +
> >> +	if (kvm_gmem_has_some_shared(inode, first_folio->index, to_nr_pages))
> >> +		return 0;
> >> +
> >> +	return kvm_gmem_merge_folio_in_filemap(inode, first_folio);
> >> +}
> >> +
> >> +static int kvm_gmem_restructure_folios_in_range(struct inode *inode,
> >> +						pgoff_t start, size_t nr_pages,
> >> +						bool is_split_operation)
> >> +{
> >> +	size_t to_nr_pages;
> >> +	pgoff_t index;
> >> +	pgoff_t end;
> >> +	void *priv;
> >> +	int ret;
> >> +
> >> +	if (!kvm_gmem_has_custom_allocator(inode))
> >> +		return 0;
> >> +
> >> +	end = start + nr_pages;
> >> +
> >> +	/* Round to allocator page size, to check all (huge) pages in range. */
> >> +	priv = kvm_gmem_allocator_private(inode);
> >> +	to_nr_pages = kvm_gmem_allocator_ops(inode)->nr_pages_in_folio(priv);
> >> +
> >> +	start = round_down(start, to_nr_pages);
> >> +	end = round_up(end, to_nr_pages);
> >> +
> >> +	for (index = start; index < end; index += to_nr_pages) {
> >> +		struct folio *f;
> >> +
> >> +		f = filemap_get_folio(inode->i_mapping, index);
> >> +		if (IS_ERR(f))
> >> +			continue;
> >> +
> >> +		/* Leave just filemap's refcounts on the folio. */
> >> +		folio_put(f);
> >> +
> >> +		if (is_split_operation)
> >> +			ret = kvm_gmem_split_folio_in_filemap(inode, f);
> > kvm_gmem_try_split_folio_in_filemap()?
> >
> 
> Here we know for sure that this was a private-to-shared
> conversion. Hence, we know that there are at least some shared parts in
> this huge page and we can skip checking that. 
Ok.

> >> +		else
> >> +			ret = kvm_gmem_try_merge_folio_in_filemap(inode, f);
> >> +
> 
> For merge, we don't know if the entire huge page might perhaps contain
> some other shared subpages, hence we "try" to merge by first checking
> against shareability to find shared subpages.
Makes sense.


> >> +		if (ret)
> >> +			goto rollback;
> >> +	}
> >> +	return ret;
> >> +
> >> +rollback:
> >> +	for (index -= to_nr_pages; index >= start; index -= to_nr_pages) {
> 
> Note to self: the first index -= to_nr_pages was meant to skip the index
> that caused the failure, but this could cause an underflow if index = 0
> when entering rollback. Need to fix this in the next revision.
Yes :)

> >> +		struct folio *f;
> >> +
> >> +		f = filemap_get_folio(inode->i_mapping, index);
> >> +		if (IS_ERR(f))
> >> +			continue;
> >> +
> >> +		/* Leave just filemap's refcounts on the folio. */
> >> +		folio_put(f);
> >> +
> >> +		if (is_split_operation)
> >> +			WARN_ON(kvm_gmem_merge_folio_in_filemap(inode, f));
> >> +		else
> >> +			WARN_ON(kvm_gmem_split_folio_in_filemap(inode, f));
> > Is it safe to just leave WARN_ON()s in the rollback case?
> >
> 
> Same as above. I don't think we have much of a choice.
> 
> > Besides, are the kvm_gmem_merge_folio_in_filemap() and
> > kvm_gmem_split_folio_in_filemap() here duplicated with the
> > kvm_gmem_split_folio_in_filemap() and kvm_gmem_try_merge_folio_in_filemap() in
> > the following "r = kvm_gmem_convert_execute_work(inode, work, !shared)"?
> >
> 
> This handles the case where some pages in the range [start, start +
> nr_pages) were split and the failure was halfway through. I could call
> kvm_gmem_convert_execute_work() with !shared but that would go over all
> the folios again from the start.
> 
> >> +	}
> >> +
> >> +	return ret;
> >> +}
> >> +
> >> +#else
> >> +
> >> +static inline int kvm_gmem_try_split_folio_in_filemap(struct inode *inode,
> >> +						      struct folio *folio)
> >> +{
> >> +	return 0;
> >> +}
> >> +
> >> +static int kvm_gmem_restructure_folios_in_range(struct inode *inode,
> >> +						pgoff_t start, size_t nr_pages,
> >> +						bool is_split_operation)
> >> +{
> >> +	return 0;
> >> +}
> >> +
> >> +#endif
> >> +
> >

next prev parent reply	other threads:[~2025-06-16 11:18 UTC|newest]

Thread overview: 231+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-05-14 23:41 [RFC PATCH v2 00/51] 1G page support for guest_memfd Ackerley Tng
2025-05-14 23:41 ` [RFC PATCH v2 01/51] KVM: guest_memfd: Make guest mem use guest mem inodes instead of anonymous inodes Ackerley Tng
2025-05-14 23:41 ` [RFC PATCH v2 02/51] KVM: guest_memfd: Introduce and use shareability to guard faulting Ackerley Tng
2025-05-27  3:54   ` Yan Zhao
2025-05-29 18:20     ` Ackerley Tng
2025-05-30  8:53     ` Fuad Tabba
2025-05-30 18:32       ` Ackerley Tng
2025-06-02  9:43         ` Fuad Tabba
2025-05-27  8:25   ` Binbin Wu
2025-05-27  8:43     ` Binbin Wu
2025-05-29 18:26     ` Ackerley Tng
2025-05-29 20:37       ` Ackerley Tng
2025-05-29  5:42   ` Michael Roth
2025-06-11 21:51     ` Ackerley Tng
2025-07-02 23:25       ` Michael Roth
2025-07-03  0:46         ` Vishal Annapurve
2025-07-03  0:52           ` Vishal Annapurve
2025-07-03  4:12           ` Michael Roth
2025-07-03  5:10             ` Vishal Annapurve
2025-07-03 20:39               ` Michael Roth
2025-07-07 14:55                 ` Vishal Annapurve
2025-07-12  0:10                   ` Michael Roth
2025-07-12 17:53                     ` Vishal Annapurve
2025-08-12  8:23             ` Fuad Tabba
2025-08-13 17:11               ` Ira Weiny
2025-06-11 22:10     ` Ackerley Tng
2025-08-01  0:01   ` Yan Zhao
2025-08-14 21:35     ` Ackerley Tng
2025-05-14 23:41 ` [RFC PATCH v2 03/51] KVM: selftests: Update guest_memfd_test for INIT_PRIVATE flag Ackerley Tng
2025-05-15 13:49   ` Ira Weiny
2025-05-16 17:42     ` Ackerley Tng
2025-05-16 19:31       ` Ira Weiny
2025-05-27  8:53       ` Binbin Wu
2025-05-30 19:59         ` Ackerley Tng
2025-05-14 23:41 ` [RFC PATCH v2 04/51] KVM: guest_memfd: Introduce KVM_GMEM_CONVERT_SHARED/PRIVATE ioctls Ackerley Tng
2025-05-15 14:50   ` Ira Weiny
2025-05-16 17:53     ` Ackerley Tng
2025-05-20  9:22   ` Fuad Tabba
2025-05-20 13:02     ` Vishal Annapurve
2025-05-20 13:44       ` Fuad Tabba
2025-05-20 14:11         ` Vishal Annapurve
2025-05-20 14:33           ` Fuad Tabba
2025-05-20 16:02             ` Vishal Annapurve
2025-05-20 18:05               ` Fuad Tabba
2025-05-20 19:40                 ` Ackerley Tng
2025-05-21 12:36                   ` Fuad Tabba
2025-05-21 14:42                     ` Vishal Annapurve
2025-05-21 15:21                       ` Fuad Tabba
2025-05-21 15:51                         ` Vishal Annapurve
2025-05-21 18:27                           ` Fuad Tabba
2025-05-22 14:52                             ` Sean Christopherson
2025-05-22 15:07                               ` Fuad Tabba
2025-05-22 16:26                                 ` Sean Christopherson
2025-05-23 10:12                                   ` Fuad Tabba
2025-06-24  8:23           ` Alexey Kardashevskiy
2025-06-24 13:08             ` Jason Gunthorpe
2025-06-24 14:10               ` Vishal Annapurve
2025-06-27  4:49                 ` Alexey Kardashevskiy
2025-06-27 15:17                   ` Vishal Annapurve
2025-06-30  0:19                     ` Alexey Kardashevskiy
2025-06-30 14:19                       ` Vishal Annapurve
2025-07-10  6:57                         ` Alexey Kardashevskiy
2025-07-10 17:58                           ` Jason Gunthorpe
2025-07-02  8:35                 ` Yan Zhao
2025-07-02 13:54                   ` Vishal Annapurve
2025-07-02 14:13                     ` Jason Gunthorpe
2025-07-02 14:32                       ` Vishal Annapurve
2025-07-10 10:50                         ` Xu Yilun
2025-07-10 17:54                           ` Jason Gunthorpe
2025-07-11  4:31                             ` Xu Yilun
2025-07-11  9:33                               ` Xu Yilun
2025-07-16 22:22                   ` Ackerley Tng
2025-07-17  9:32                     ` Xu Yilun
2025-07-17 16:56                       ` Ackerley Tng
2025-07-18  2:48                         ` Xu Yilun
2025-07-18 14:15                           ` Jason Gunthorpe
2025-07-21 14:18                             ` Xu Yilun
2025-07-18 15:13                           ` Ira Weiny
2025-07-21  9:58                             ` Xu Yilun
2025-07-22 18:17                               ` Ackerley Tng
2025-07-22 19:25                                 ` Edgecombe, Rick P
2025-05-28  3:16   ` Binbin Wu
2025-05-30 20:10     ` Ackerley Tng
2025-06-03  0:54       ` Binbin Wu
2025-05-14 23:41 ` [RFC PATCH v2 05/51] KVM: guest_memfd: Skip LRU for guest_memfd folios Ackerley Tng
2025-05-28  7:01   ` Binbin Wu
2025-05-30 20:32     ` Ackerley Tng
2025-05-14 23:41 ` [RFC PATCH v2 06/51] KVM: Query guest_memfd for private/shared status Ackerley Tng
2025-05-27  3:55   ` Yan Zhao
2025-05-28  8:08     ` Binbin Wu
2025-05-28  9:55       ` Yan Zhao
2025-05-14 23:41 ` [RFC PATCH v2 07/51] KVM: guest_memfd: Add CAP KVM_CAP_GMEM_CONVERSION Ackerley Tng
2025-05-14 23:41 ` [RFC PATCH v2 08/51] KVM: selftests: Test flag validity after guest_memfd supports conversions Ackerley Tng
2025-05-14 23:41 ` [RFC PATCH v2 09/51] KVM: selftests: Test faulting with respect to GUEST_MEMFD_FLAG_INIT_PRIVATE Ackerley Tng
2025-05-14 23:41 ` [RFC PATCH v2 10/51] KVM: selftests: Refactor vm_mem_add to be more flexible Ackerley Tng
2025-05-14 23:41 ` [RFC PATCH v2 11/51] KVM: selftests: Allow cleanup of ucall_pool from host Ackerley Tng
2025-05-14 23:41 ` [RFC PATCH v2 12/51] KVM: selftests: Test conversion flows for guest_memfd Ackerley Tng
2025-05-14 23:41 ` [RFC PATCH v2 13/51] KVM: selftests: Add script to exercise private_mem_conversions_test Ackerley Tng
2025-05-14 23:41 ` [RFC PATCH v2 14/51] KVM: selftests: Update private_mem_conversions_test to mmap guest_memfd Ackerley Tng
2025-05-14 23:41 ` [RFC PATCH v2 15/51] KVM: selftests: Update script to map shared memory from guest_memfd Ackerley Tng
2025-05-14 23:41 ` [RFC PATCH v2 16/51] mm: hugetlb: Consolidate interpretation of gbl_chg within alloc_hugetlb_folio() Ackerley Tng
2025-05-15  2:09   ` Matthew Wilcox
2025-05-28  8:55   ` Binbin Wu
2025-07-07 18:27   ` James Houghton
2025-05-14 23:41 ` [RFC PATCH v2 17/51] mm: hugetlb: Cleanup interpretation of gbl_chg in alloc_hugetlb_folio() Ackerley Tng
2025-05-14 23:41 ` [RFC PATCH v2 18/51] mm: hugetlb: Cleanup interpretation of map_chg_state within alloc_hugetlb_folio() Ackerley Tng
2025-07-07 18:08   ` James Houghton
2025-05-14 23:41 ` [RFC PATCH v2 19/51] mm: hugetlb: Rename alloc_surplus_hugetlb_folio Ackerley Tng
2025-05-14 23:41 ` [RFC PATCH v2 20/51] mm: mempolicy: Refactor out policy_node_nodemask() Ackerley Tng
2025-05-14 23:42 ` [RFC PATCH v2 21/51] mm: hugetlb: Inline huge_node() into callers Ackerley Tng
2025-05-14 23:42 ` [RFC PATCH v2 22/51] mm: hugetlb: Refactor hugetlb allocation functions Ackerley Tng
2025-05-31 23:45   ` Ira Weiny
2025-06-13 22:03     ` Ackerley Tng
2025-05-14 23:42 ` [RFC PATCH v2 23/51] mm: hugetlb: Refactor out hugetlb_alloc_folio() Ackerley Tng
2025-06-01  0:38   ` Ira Weiny
2025-06-13 22:07     ` Ackerley Tng
2025-05-14 23:42 ` [RFC PATCH v2 24/51] mm: hugetlb: Add option to create new subpool without using surplus Ackerley Tng
2025-05-14 23:42 ` [RFC PATCH v2 25/51] mm: truncate: Expose preparation steps for truncate_inode_pages_final Ackerley Tng
2025-05-14 23:42 ` [RFC PATCH v2 26/51] mm: Consolidate freeing of typed folios on final folio_put() Ackerley Tng
2025-05-14 23:42 ` [RFC PATCH v2 27/51] mm: hugetlb: Expose hugetlb_subpool_{get,put}_pages() Ackerley Tng
2025-05-14 23:42 ` [RFC PATCH v2 28/51] mm: Introduce guestmem_hugetlb to support folio_put() handling of guestmem pages Ackerley Tng
2025-05-14 23:42 ` [RFC PATCH v2 29/51] mm: guestmem_hugetlb: Wrap HugeTLB as an allocator for guest_memfd Ackerley Tng
2025-05-16 14:07   ` Ackerley Tng
2025-05-16 20:33     ` Ackerley Tng
2025-05-14 23:42 ` [RFC PATCH v2 30/51] mm: truncate: Expose truncate_inode_folio() Ackerley Tng
2025-05-14 23:42 ` [RFC PATCH v2 31/51] KVM: x86: Set disallow_lpage on base_gfn and guest_memfd pgoff misalignment Ackerley Tng
2025-05-14 23:42 ` [RFC PATCH v2 32/51] KVM: guest_memfd: Support guestmem_hugetlb as custom allocator Ackerley Tng
2025-05-23 10:47   ` Yan Zhao
2025-08-12  9:13   ` Tony Lindgren
2025-05-14 23:42 ` [RFC PATCH v2 33/51] KVM: guest_memfd: Allocate and truncate from " Ackerley Tng
2025-05-21 18:05   ` Vishal Annapurve
2025-05-22 23:12   ` Edgecombe, Rick P
2025-05-28 10:58   ` Yan Zhao
2025-06-03  7:43   ` Binbin Wu
2025-07-16 22:13     ` Ackerley Tng
2025-05-14 23:42 ` [RFC PATCH v2 34/51] mm: hugetlb: Add functions to add/delete folio from hugetlb lists Ackerley Tng
2025-05-14 23:42 ` [RFC PATCH v2 35/51] mm: guestmem_hugetlb: Add support for splitting and merging pages Ackerley Tng
2025-05-14 23:42 ` [RFC PATCH v2 36/51] mm: Convert split_folio() macro to function Ackerley Tng
2025-05-21 16:40   ` Edgecombe, Rick P
2025-05-14 23:42 ` [RFC PATCH v2 37/51] filemap: Pass address_space mapping to ->free_folio() Ackerley Tng
2025-05-14 23:42 ` [RFC PATCH v2 38/51] KVM: guest_memfd: Split allocator pages for guest_memfd use Ackerley Tng
2025-05-22 22:19   ` Edgecombe, Rick P
2025-06-05 17:15     ` Ackerley Tng
2025-06-05 17:53       ` Edgecombe, Rick P
2025-06-05 17:15     ` Ackerley Tng
2025-06-05 17:16     ` Ackerley Tng
2025-06-05 17:16     ` Ackerley Tng
2025-06-05 17:16     ` Ackerley Tng
2025-05-27  4:30   ` Yan Zhao
2025-05-27  4:38     ` Yan Zhao
2025-06-05 17:50     ` Ackerley Tng
2025-05-27  8:45   ` Yan Zhao
2025-06-05 19:10     ` Ackerley Tng
2025-06-16 11:15       ` Yan Zhao [this message]
2025-06-05  5:24   ` Binbin Wu
2025-06-05 19:16     ` Ackerley Tng
2025-05-14 23:42 ` [RFC PATCH v2 39/51] KVM: guest_memfd: Merge and truncate on fallocate(PUNCH_HOLE) Ackerley Tng
2025-05-28 11:00   ` Yan Zhao
2025-05-28 16:39     ` Ackerley Tng
2025-05-29  3:26       ` Yan Zhao
2025-05-14 23:42 ` [RFC PATCH v2 40/51] KVM: guest_memfd: Update kvm_gmem_mapping_order to account for page status Ackerley Tng
2025-05-14 23:42 ` [RFC PATCH v2 41/51] KVM: Add CAP to indicate support for HugeTLB as custom allocator Ackerley Tng
2025-05-14 23:42 ` [RFC PATCH v2 42/51] KVM: selftests: Add basic selftests for hugetlb-backed guest_memfd Ackerley Tng
2025-05-14 23:42 ` [RFC PATCH v2 43/51] KVM: selftests: Update conversion flows test for HugeTLB Ackerley Tng
2025-05-14 23:42 ` [RFC PATCH v2 44/51] KVM: selftests: Test truncation paths of guest_memfd Ackerley Tng
2025-05-14 23:42 ` [RFC PATCH v2 45/51] KVM: selftests: Test allocation and conversion of subfolios Ackerley Tng
2025-05-14 23:42 ` [RFC PATCH v2 46/51] KVM: selftests: Test that guest_memfd usage is reported via hugetlb Ackerley Tng
2025-05-14 23:42 ` [RFC PATCH v2 47/51] KVM: selftests: Support various types of backing sources for private memory Ackerley Tng
2025-05-14 23:42 ` [RFC PATCH v2 48/51] KVM: selftests: Update test for various private memory backing source types Ackerley Tng
2025-05-14 23:42 ` [RFC PATCH v2 49/51] KVM: selftests: Update private_mem_conversions_test.sh to test with HugeTLB pages Ackerley Tng
2025-05-14 23:42 ` [RFC PATCH v2 50/51] KVM: selftests: Add script to test HugeTLB statistics Ackerley Tng
2025-05-15 18:03 ` [RFC PATCH v2 00/51] 1G page support for guest_memfd Edgecombe, Rick P
2025-05-15 18:42   ` Vishal Annapurve
2025-05-15 23:35     ` Edgecombe, Rick P
2025-05-16  0:57       ` Sean Christopherson
2025-05-16  2:12         ` Edgecombe, Rick P
2025-05-16 13:11           ` Vishal Annapurve
2025-05-16 16:45             ` Edgecombe, Rick P
2025-05-16 17:51               ` Sean Christopherson
2025-05-16 19:14                 ` Edgecombe, Rick P
2025-05-16 20:25                   ` Dave Hansen
2025-05-16 21:42                     ` Edgecombe, Rick P
2025-05-16 17:45             ` Sean Christopherson
2025-05-16 13:09         ` Jason Gunthorpe
2025-05-16 17:04           ` Edgecombe, Rick P
2025-05-16  0:22 ` [RFC PATCH v2 51/51] KVM: selftests: Test guest_memfd for accuracy of st_blocks Ackerley Tng
2025-05-16 19:48 ` [RFC PATCH v2 00/51] 1G page support for guest_memfd Ira Weiny
2025-05-16 19:59   ` Ira Weiny
2025-05-16 20:26     ` Ackerley Tng
2025-05-16 22:43 ` Ackerley Tng
2025-06-19  8:13 ` Yan Zhao
2025-06-19  8:59   ` Xiaoyao Li
2025-06-19  9:18     ` Xiaoyao Li
2025-06-19  9:28       ` Yan Zhao
2025-06-19  9:45         ` Xiaoyao Li
2025-06-19  9:49           ` Xiaoyao Li
2025-06-29 18:28     ` Vishal Annapurve
2025-06-30  3:14       ` Yan Zhao
2025-06-30 14:14         ` Vishal Annapurve
2025-07-01  5:23           ` Yan Zhao
2025-07-01 19:48             ` Vishal Annapurve
2025-07-07 23:25               ` Sean Christopherson
2025-07-08  0:14                 ` Vishal Annapurve
2025-07-08  1:08                   ` Edgecombe, Rick P
2025-07-08 14:20                     ` Sean Christopherson
2025-07-08 14:52                       ` Edgecombe, Rick P
2025-07-08 15:07                         ` Vishal Annapurve
2025-07-08 15:31                           ` Edgecombe, Rick P
2025-07-08 17:16                             ` Vishal Annapurve
2025-07-08 17:39                               ` Edgecombe, Rick P
2025-07-08 18:03                                 ` Sean Christopherson
2025-07-08 18:13                                   ` Edgecombe, Rick P
2025-07-08 18:55                                     ` Sean Christopherson
2025-07-08 21:23                                       ` Edgecombe, Rick P
2025-07-09 14:28                                       ` Vishal Annapurve
2025-07-09 15:00                                         ` Sean Christopherson
2025-07-10  1:30                                           ` Vishal Annapurve
2025-07-10 23:33                                             ` Sean Christopherson
2025-07-11 21:18                                             ` Vishal Annapurve
2025-07-12 17:33                                               ` Vishal Annapurve
2025-07-09 15:17                                         ` Edgecombe, Rick P
2025-07-10  3:39                                           ` Vishal Annapurve
2025-07-08 19:28                                   ` Vishal Annapurve
2025-07-08 19:58                                     ` Sean Christopherson
2025-07-08 22:54                                       ` Vishal Annapurve
2025-07-08 15:38                           ` Sean Christopherson
2025-07-08 16:22                             ` Fuad Tabba
2025-07-08 17:25                               ` Sean Christopherson
2025-07-08 18:37                                 ` Fuad Tabba
2025-07-16 23:06                                   ` Ackerley Tng
2025-06-26 23:19 ` Ackerley Tng

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aE/81hmduC08B8Lt@yzhao56-desk.sh.intel.com \
    --to=yan.y.zhao@intel.com \
    --cc=ackerleytng@google.com \
    --cc=aik@amd.com \
    --cc=ajones@ventanamicro.com \
    --cc=akpm@linux-foundation.org \
    --cc=amoorthy@google.com \
    --cc=anthony.yznaga@oracle.com \
    --cc=anup@brainfault.org \
    --cc=aou@eecs.berkeley.edu \
    --cc=bfoster@redhat.com \
    --cc=binbin.wu@linux.intel.com \
    --cc=brauner@kernel.org \
    --cc=catalin.marinas@arm.com \
    --cc=chao.p.peng@intel.com \
    --cc=chenhuacai@kernel.org \
    --cc=dave.hansen@intel.com \
    --cc=david@redhat.com \
    --cc=dmatlack@google.com \
    --cc=dwmw@amazon.co.uk \
    --cc=erdemaktas@google.com \
    --cc=fan.du@intel.com \
    --cc=fvdl@google.com \
    --cc=graf@amazon.com \
    --cc=haibo1.xu@intel.com \
    --cc=hch@infradead.org \
    --cc=hughd@google.com \
    --cc=ira.weiny@intel.com \
    --cc=isaku.yamahata@intel.com \
    --cc=jack@suse.cz \
    --cc=james.morse@arm.com \
    --cc=jarkko@kernel.org \
    --cc=jgg@ziepe.ca \
    --cc=jgowans@amazon.com \
    --cc=jhubbard@nvidia.com \
    --cc=jroedel@suse.de \
    --cc=jthoughton@google.com \
    --cc=jun.miao@intel.com \
    --cc=kai.huang@intel.com \
    --cc=keirf@google.com \
    --cc=kent.overstreet@linux.dev \
    --cc=kirill.shutemov@intel.com \
    --cc=kvm@vger.kernel.org \
    --cc=liam.merwick@oracle.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=maciej.wieczor-retman@intel.com \
    --cc=mail@maciej.szmigiero.name \
    --cc=maz@kernel.org \
    --cc=mic@digikod.net \
    --cc=michael.roth@amd.com \
    --cc=mpe@ellerman.id.au \
    --cc=muchun.song@linux.dev \
    --cc=nikunj@amd.com \
    --cc=nsaenz@amazon.es \
    --cc=oliver.upton@linux.dev \
    --cc=palmer@dabbelt.com \
    --cc=pankaj.gupta@amd.com \
    --cc=paul.walmsley@sifive.com \
    --cc=pbonzini@redhat.com \
    --cc=pdurrant@amazon.co.uk \
    --cc=peterx@redhat.com \
    --cc=pgonda@google.com \
    --cc=pvorel@suse.cz \
    --cc=qperret@google.com \
    --cc=quic_cvanscha@quicinc.com \
    --cc=quic_eberman@quicinc.com \
    --cc=quic_mnalajal@quicinc.com \
    --cc=quic_pderrin@quicinc.com \
    --cc=quic_pheragu@quicinc.com \
    --cc=quic_svaddagi@quicinc.com \
    --cc=quic_tsoni@quicinc.com \
    --cc=richard.weiyang@gmail.com \
    --cc=rick.p.edgecombe@intel.com \
    --cc=rientjes@google.com \
    --cc=roypat@amazon.co.uk \
    --cc=rppt@kernel.org \
    --cc=seanjc@google.com \
    --cc=shuah@kernel.org \
    --cc=steven.price@arm.com \
    --cc=steven.sistare@oracle.com \
    --cc=suzuki.poulose@arm.com \
    --cc=tabba@google.com \
    --cc=thomas.lendacky@amd.com \
    --cc=usama.arif@bytedance.com \
    --cc=vannapurve@google.com \
    --cc=vbabka@suse.cz \
    --cc=viro@zeniv.linux.org.uk \
    --cc=vkuznets@redhat.com \
    --cc=wei.w.wang@intel.com \
    --cc=will@kernel.org \
    --cc=willy@infradead.org \
    --cc=x86@kernel.org \
    --cc=xiaoyao.li@intel.com \
    --cc=yilun.xu@intel.com \
    --cc=yuzenghui@huawei.com \
    --cc=zhiquan1.li@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).