[RFC PATCH 00/39] 1G page support for guest

linux-kselftest.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH 00/39] 1G page support for guest_memfd
@ 2024-09-10 23:43 Ackerley Tng
  2024-09-10 23:43 ` [RFC PATCH 01/39] mm: hugetlb: Simplify logic in dequeue_hugetlb_folio_vma() Ackerley Tng
                   ` (41 more replies)
  0 siblings, 42 replies; 130+ messages in thread
From: Ackerley Tng @ 2024-09-10 23:43 UTC (permalink / raw)
  To: tabba, quic_eberman, roypat, jgg, peterx, david, rientjes, fvdl,
	jthoughton, seanjc, pbonzini, zhiquan1.li, fan.du, jun.miao,
	isaku.yamahata, muchun.song, mike.kravetz
  Cc: erdemaktas, vannapurve, ackerleytng, qperret, jhubbard, willy,
	shuah, brauner, bfoster, kent.overstreet, pvorel, rppt,
	richard.weiyang, anup, haibo1.xu, ajones, vkuznets,
	maciej.wieczor-retman, pgonda, oliver.upton, linux-kernel,
	linux-mm, kvm, linux-kselftest, linux-fsdevel

Hello,

This patchset is our exploration of how to support 1G pages in guest_memfd, and
how the pages will be used in Confidential VMs.

The patchset covers:

+ How to get 1G pages
+ Allowing mmap() of guest_memfd to userspace so that both private and shared
  memory can use the same physical pages
+ Splitting and reconstructing pages to support conversions and mmap()
+ How the VM, userspace and guest_memfd interact to support conversions
+ Selftests to test all the above
    + Selftests also demonstrate the conversion flow between VM, userspace and
      guest_memfd.

Why 1G pages in guest memfd?

Bring guest_memfd to performance and memory savings parity with VMs that are
backed by HugeTLBfs.

+ Performance is improved with 1G pages by more TLB hits and faster page walks
  on TLB misses.
+ Memory savings from 1G pages comes from HugeTLB Vmemmap Optimization (HVO).

Options for 1G page support:

1. HugeTLB
2. Contiguous Memory Allocator (CMA)
3. Other suggestions are welcome!

Comparison between options:

1. HugeTLB
    + Refactor HugeTLB to separate allocator from the rest of HugeTLB
    + Pro: Graceful transition for VMs backed with HugeTLB to guest_memfd
        + Near term: Allows co-tenancy of HugeTLB and guest_memfd backed VMs
    + Pro: Can provide iterative steps toward new future allocator
        + Unexplored: Managing userspace-visible changes
            + e.g. HugeTLB's free_hugepages will decrease if HugeTLB is used,
              but not when future allocator is used
2. CMA
    + Port some HugeTLB features to be applied on CMA
    + Pro: Clean slate

What would refactoring HugeTLB involve?

(Some refactoring was done in this RFC, more can be done.)

1. Broadly involves separating the HugeTLB allocator from the rest of HugeTLB
    + Brings more modularity to HugeTLB
    + No functionality change intended
    + Likely step towards HugeTLB's integration into core-mm
2. guest_memfd will use just the allocator component of HugeTLB, not including
   the complex parts of HugeTLB like
    + Userspace reservations (resv_map)
    + Shared PMD mappings
    + Special page walkers

What features would need to be ported to CMA?

+ Improved allocation guarantees
    + Per NUMA node pool of huge pages
    + Subpools per guest_memfd
+ Memory savings
    + Something like HugeTLB Vmemmap Optimization
+ Configuration/reporting features
    + Configuration of number of pages available (and per NUMA node) at and
      after host boot
    + Reporting of memory usage/availability statistics at runtime

HugeTLB was picked as the source of 1G pages for this RFC because it allows a
graceful transition, and retains memory savings from HVO.

To illustrate this, if a host machine uses HugeTLBfs to back VMs, and a
confidential VM were to be scheduled on that host, some HugeTLBfs pages would
have to be given up and returned to CMA for guest_memfd pages to be rebuilt from
that memory. This requires memory to be reserved for HVO to be removed and
reapplied on the new guest_memfd memory. This not only slows down memory
allocation but also trims the benefits of HVO. Memory would have to be reserved
on the host to facilitate these transitions.

Improving how guest_memfd uses the allocator in a future revision of this RFC:

To provide an easier transition away from HugeTLB, guest_memfd's use of HugeTLB
should be limited to these allocator functions:

+ reserve(node, page_size, num_pages) => opaque handle
    + Used when a guest_memfd inode is created to reserve memory from backend
      allocator
+ allocate(handle, mempolicy, page_size) => folio
    + To allocate a folio from guest_memfd's reservation
+ split(handle, folio, target_page_size) => void
    + To take a huge folio, and split it to smaller folios, restore to filemap
+ reconstruct(handle, first_folio, nr_pages) => void
    + To take a folio, and reconstruct a huge folio out of nr_pages from the
      first_folio
+ free(handle, folio) => void
    + To return folio to guest_memfd's reservation
+ error(handle, folio) => void
    + To handle memory errors
+ unreserve(handle) => void
    + To return guest_memfd's reservation to allocator backend

Userspace should only provide a page size when creating a guest_memfd and should
not have to specify HugeTLB.

Overview of patches:

+ Patches 01-12
    + Many small changes to HugeTLB, mostly to separate HugeTLBfs concepts from
      HugeTLB, and to expose HugeTLB functions.
+ Patches 13-16
    + Letting guest_memfd use HugeTLB
    + Creation of each guest_memfd reserves pages from HugeTLB's global hstate
      and puts it into the guest_memfd inode's subpool
    + Each folio allocation takes a page from the guest_memfd inode's subpool
+ Patches 17-21
    + Selftests for new HugeTLB features in guest_memfd
+ Patches 22-24
    + More small changes on the HugeTLB side to expose functions needed by
      guest_memfd
+ Patch 25:
    + Uses the newly available functions from patches 22-24 to split HugeTLB
      pages. In this patch, HugeTLB folios are always split to 4K before any
      usage, private or shared.
+ Patches 26-28
    + Allow mmap() in guest_memfd and faulting in shared pages
+ Patch 29
    + Enables conversion between private/shared pages
+ Patch 30
    + Required to zero folios after conversions to avoid leaking initialized
      kernel memory
+ Patch 31-38
    + Add selftests to test mapping pages to userspace, guest/host memory
      sharing and update conversions tests
    + Patch 33 illustrates the conversion flow between VM/userspace/guest_memfd
+ Patch 39
    + Dynamically split and reconstruct HugeTLB pages instead of always
      splitting before use. All earlier selftests are expected to still pass.

TODOs:

+ Add logic to wait for safe_refcount [1]
+ Look into lazy splitting/reconstruction of pages
    + Currently, when the KVM_SET_MEMORY_ATTRIBUTES is invoked, not only is the
      mem_attr_array and faultability updated, the pages in the requested range
      are also split/reconstructed as necessary. We want to look into delaying
      splitting/reconstruction to fault time.
+ Solve race between folios being faulted in and being truncated
    + When running private_mem_conversions_test with more than 1 vCPU, a folio
      getting truncated may get faulted in by another process, causing elevated
      mapcounts when the folio is freed (VM_BUG_ON_FOLIO).
+ Add intermediate splits (1G should first split to 2M and not split directly to
  4K)
+ Use guest's lock instead of hugetlb_lock
+ Use multi-index xarray/replace xarray with some other data struct for
  faultability flag
+ Refactor HugeTLB better, present generic allocator interface

Please let us know your thoughts on:

+ HugeTLB as the choice of transitional allocator backend
+ Refactoring HugeTLB to provide generic allocator interface
+ Shared/private conversion flow
    + Requiring user to request kernel to unmap pages from userspace using
      madvise(MADV_DONTNEED)
    + Failing conversion on elevated mapcounts/pincounts/refcounts
+ Process of splitting/reconstructing page
+ Anything else!

[1] https://lore.kernel.org/all/20240829-guest-memfd-lib-v2-0-b9afc1ff3656@quicinc.com/T/

Ackerley Tng (37):
  mm: hugetlb: Simplify logic in dequeue_hugetlb_folio_vma()
  mm: hugetlb: Refactor vma_has_reserves() to should_use_hstate_resv()
  mm: hugetlb: Remove unnecessary check for avoid_reserve
  mm: mempolicy: Refactor out policy_node_nodemask()
  mm: hugetlb: Refactor alloc_buddy_hugetlb_folio_with_mpol() to
    interpret mempolicy instead of vma
  mm: hugetlb: Refactor dequeue_hugetlb_folio_vma() to use mpol
  mm: hugetlb: Refactor out hugetlb_alloc_folio
  mm: truncate: Expose preparation steps for truncate_inode_pages_final
  mm: hugetlb: Expose hugetlb_subpool_{get,put}_pages()
  mm: hugetlb: Add option to create new subpool without using surplus
  mm: hugetlb: Expose hugetlb_acct_memory()
  mm: hugetlb: Move and expose hugetlb_zero_partial_page()
  KVM: guest_memfd: Make guest mem use guest mem inodes instead of
    anonymous inodes
  KVM: guest_memfd: hugetlb: initialization and cleanup
  KVM: guest_memfd: hugetlb: allocate and truncate from hugetlb
  KVM: guest_memfd: Add page alignment check for hugetlb guest_memfd
  KVM: selftests: Add basic selftests for hugetlb-backed guest_memfd
  KVM: selftests: Support various types of backing sources for private
    memory
  KVM: selftests: Update test for various private memory backing source
    types
  KVM: selftests: Add private_mem_conversions_test.sh
  KVM: selftests: Test that guest_memfd usage is reported via hugetlb
  mm: hugetlb: Expose vmemmap optimization functions
  mm: hugetlb: Expose HugeTLB functions for promoting/demoting pages
  mm: hugetlb: Add functions to add/move/remove from hugetlb lists
  KVM: guest_memfd: Track faultability within a struct kvm_gmem_private
  KVM: guest_memfd: Allow mmapping guest_memfd files
  KVM: guest_memfd: Use vm_type to determine default faultability
  KVM: Handle conversions in the SET_MEMORY_ATTRIBUTES ioctl
  KVM: guest_memfd: Handle folio preparation for guest_memfd mmap
  KVM: selftests: Allow vm_set_memory_attributes to be used without
    asserting return value of 0
  KVM: selftests: Test using guest_memfd memory from userspace
  KVM: selftests: Test guest_memfd memory sharing between guest and host
  KVM: selftests: Add notes in private_mem_kvm_exits_test for mmap-able
    guest_memfd
  KVM: selftests: Test that pinned pages block KVM from setting memory
    attributes to PRIVATE
  KVM: selftests: Refactor vm_mem_add to be more flexible
  KVM: selftests: Add helper to perform madvise by memslots
  KVM: selftests: Update private_mem_conversions_test for mmap()able
    guest_memfd

Vishal Annapurve (2):
  KVM: guest_memfd: Split HugeTLB pages for guest_memfd use
  KVM: guest_memfd: Dynamically split/reconstruct HugeTLB page

 fs/hugetlbfs/inode.c                          |   35 +-
 include/linux/hugetlb.h                       |   54 +-
 include/linux/kvm_host.h                      |    1 +
 include/linux/mempolicy.h                     |    2 +
 include/linux/mm.h                            |    1 +
 include/uapi/linux/kvm.h                      |   26 +
 include/uapi/linux/magic.h                    |    1 +
 mm/hugetlb.c                                  |  346 ++--
 mm/hugetlb_vmemmap.h                          |   11 -
 mm/mempolicy.c                                |   36 +-
 mm/truncate.c                                 |   26 +-
 tools/include/linux/kernel.h                  |    4 +-
 tools/testing/selftests/kvm/Makefile          |    3 +
 .../kvm/guest_memfd_hugetlb_reporting_test.c  |  222 +++
 .../selftests/kvm/guest_memfd_pin_test.c      |  104 ++
 .../selftests/kvm/guest_memfd_sharing_test.c  |  160 ++
 .../testing/selftests/kvm/guest_memfd_test.c  |  238 ++-
 .../testing/selftests/kvm/include/kvm_util.h  |   45 +-
 .../testing/selftests/kvm/include/test_util.h |   18 +
 tools/testing/selftests/kvm/lib/kvm_util.c    |  443 +++--
 tools/testing/selftests/kvm/lib/test_util.c   |   99 ++
 .../kvm/x86_64/private_mem_conversions_test.c |  158 +-
 .../x86_64/private_mem_conversions_test.sh    |   91 +
 .../kvm/x86_64/private_mem_kvm_exits_test.c   |   11 +-
 virt/kvm/guest_memfd.c                        | 1563 ++++++++++++++++-
 virt/kvm/kvm_main.c                           |   17 +
 virt/kvm/kvm_mm.h                             |   16 +
 27 files changed, 3288 insertions(+), 443 deletions(-)
 create mode 100644 tools/testing/selftests/kvm/guest_memfd_hugetlb_reporting_test.c
 create mode 100644 tools/testing/selftests/kvm/guest_memfd_pin_test.c
 create mode 100644 tools/testing/selftests/kvm/guest_memfd_sharing_test.c
 create mode 100755 tools/testing/selftests/kvm/x86_64/private_mem_conversions_test.sh

--
2.46.0.598.g6f2099f65c-goog

^ permalink raw reply	[flat|nested] 130+ messages in thread

* [RFC PATCH 01/39] mm: hugetlb: Simplify logic in dequeue_hugetlb_folio_vma()
  2024-09-10 23:43 [RFC PATCH 00/39] 1G page support for guest_memfd Ackerley Tng
@ 2024-09-10 23:43 ` Ackerley Tng
  2024-09-10 23:43 ` [RFC PATCH 02/39] mm: hugetlb: Refactor vma_has_reserves() to should_use_hstate_resv() Ackerley Tng
                   ` (40 subsequent siblings)
  41 siblings, 0 replies; 130+ messages in thread
From: Ackerley Tng @ 2024-09-10 23:43 UTC (permalink / raw)
  To: tabba, quic_eberman, roypat, jgg, peterx, david, rientjes, fvdl,
	jthoughton, seanjc, pbonzini, zhiquan1.li, fan.du, jun.miao,
	isaku.yamahata, muchun.song, mike.kravetz
  Cc: erdemaktas, vannapurve, ackerleytng, qperret, jhubbard, willy,
	shuah, brauner, bfoster, kent.overstreet, pvorel, rppt,
	richard.weiyang, anup, haibo1.xu, ajones, vkuznets,
	maciej.wieczor-retman, pgonda, oliver.upton, linux-kernel,
	linux-mm, kvm, linux-kselftest, linux-fsdevel

Replace arguments avoid_reserve and chg in dequeue_hugetlb_folio_vma()
so dequeue_hugetlb_folio_vma() is more understandable.

The new argument, use_hstate_resv, indicates whether the folio to be
dequeued should be taken from reservations in hstate.

If use_hstate_resv is true, the folio to be dequeued should be taken
from reservations in hstate and hence h->resv_huge_pages is
decremented, and the folio is marked so that the reservation is
restored.

If use_hstate_resv is false, then a folio needs to be taken from the
pool and hence there must exist available_huge_pages(h), failing
which, goto err.

The bool use_hstate_resv can be reused within
dequeue_hugetlb_folio_vma()'s caller, alloc_hugetlb_folio().

No functional changes are intended.

As proof, the original two if conditions

!vma_has_reserves(vma, chg) && !available_huge_pages(h)

and

avoid_reserve && !available_huge_pages(h)

can be combined into

(avoid_reserve || !vma_has_reserves(vma, chg))
&& !available_huge_pages(h).

Applying de Morgan's theorem on

avoid_reserve || !vma_has_reserves(vma, chg)

yields

!avoid_reserve && vma_has_reserves(vma, chg),

hence the simplification is correct.

Signed-off-by: Ackerley Tng <ackerleytng@google.com>
---
 mm/hugetlb.c | 33 +++++++++++----------------------
 1 file changed, 11 insertions(+), 22 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index aaf508be0a2b..af5c6bbc9ff0 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1280,8 +1280,9 @@ static bool vma_has_reserves(struct vm_area_struct *vma, long chg)
 	}
 
 	/*
-	 * Only the process that called mmap() has reserves for
-	 * private mappings.
+	 * Only the process that called mmap() has reserves for private
+	 * mappings. A child process with MAP_PRIVATE mappings created by their
+	 * parent have no page reserves.
 	 */
 	if (is_vma_resv_set(vma, HPAGE_RESV_OWNER)) {
 		/*
@@ -1393,8 +1394,7 @@ static unsigned long available_huge_pages(struct hstate *h)
 
 static struct folio *dequeue_hugetlb_folio_vma(struct hstate *h,
 				struct vm_area_struct *vma,
-				unsigned long address, int avoid_reserve,
-				long chg)
+				unsigned long address, bool use_hstate_resv)
 {
 	struct folio *folio = NULL;
 	struct mempolicy *mpol;
@@ -1402,16 +1402,7 @@ static struct folio *dequeue_hugetlb_folio_vma(struct hstate *h,
 	nodemask_t *nodemask;
 	int nid;
 
-	/*
-	 * A child process with MAP_PRIVATE mappings created by their parent
-	 * have no page reserves. This check ensures that reservations are
-	 * not "stolen". The child may still get SIGKILLed
-	 */
-	if (!vma_has_reserves(vma, chg) && !available_huge_pages(h))
-		goto err;
-
-	/* If reserves cannot be used, ensure enough pages are in the pool */
-	if (avoid_reserve && !available_huge_pages(h))
+	if (!use_hstate_resv && !available_huge_pages(h))
 		goto err;
 
 	gfp_mask = htlb_alloc_mask(h);
@@ -1429,7 +1420,7 @@ static struct folio *dequeue_hugetlb_folio_vma(struct hstate *h,
 		folio = dequeue_hugetlb_folio_nodemask(h, gfp_mask,
 							nid, nodemask);
 
-	if (folio && !avoid_reserve && vma_has_reserves(vma, chg)) {
+	if (folio && use_hstate_resv) {
 		folio_set_hugetlb_restore_reserve(folio);
 		h->resv_huge_pages--;
 	}
@@ -3130,6 +3121,7 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
 	struct mem_cgroup *memcg;
 	bool deferred_reserve;
 	gfp_t gfp = htlb_alloc_mask(h) | __GFP_RETRY_MAYFAIL;
+	bool use_hstate_resv;
 
 	memcg = get_mem_cgroup_from_current();
 	memcg_charge_ret = mem_cgroup_hugetlb_try_charge(memcg, gfp, nr_pages);
@@ -3190,20 +3182,17 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
 	if (ret)
 		goto out_uncharge_cgroup_reservation;
 
+	use_hstate_resv = !avoid_reserve && vma_has_reserves(vma, gbl_chg);
+
 	spin_lock_irq(&hugetlb_lock);
-	/*
-	 * glb_chg is passed to indicate whether or not a page must be taken
-	 * from the global free pool (global change).  gbl_chg == 0 indicates
-	 * a reservation exists for the allocation.
-	 */
-	folio = dequeue_hugetlb_folio_vma(h, vma, addr, avoid_reserve, gbl_chg);
+	folio = dequeue_hugetlb_folio_vma(h, vma, addr, use_hstate_resv);
 	if (!folio) {
 		spin_unlock_irq(&hugetlb_lock);
 		folio = alloc_buddy_hugetlb_folio_with_mpol(h, vma, addr);
 		if (!folio)
 			goto out_uncharge_cgroup;
 		spin_lock_irq(&hugetlb_lock);
-		if (!avoid_reserve && vma_has_reserves(vma, gbl_chg)) {
+		if (use_hstate_resv) {
 			folio_set_hugetlb_restore_reserve(folio);
 			h->resv_huge_pages--;
 		}
-- 
2.46.0.598.g6f2099f65c-goog


^ permalink raw reply related	[flat|nested] 130+ messages in thread

* [RFC PATCH 02/39] mm: hugetlb: Refactor vma_has_reserves() to should_use_hstate_resv()
  2024-09-10 23:43 [RFC PATCH 00/39] 1G page support for guest_memfd Ackerley Tng
  2024-09-10 23:43 ` [RFC PATCH 01/39] mm: hugetlb: Simplify logic in dequeue_hugetlb_folio_vma() Ackerley Tng
@ 2024-09-10 23:43 ` Ackerley Tng
  2024-09-10 23:43 ` [RFC PATCH 03/39] mm: hugetlb: Remove unnecessary check for avoid_reserve Ackerley Tng
                   ` (39 subsequent siblings)
  41 siblings, 0 replies; 130+ messages in thread
From: Ackerley Tng @ 2024-09-10 23:43 UTC (permalink / raw)
  To: tabba, quic_eberman, roypat, jgg, peterx, david, rientjes, fvdl,
	jthoughton, seanjc, pbonzini, zhiquan1.li, fan.du, jun.miao,
	isaku.yamahata, muchun.song, mike.kravetz
  Cc: erdemaktas, vannapurve, ackerleytng, qperret, jhubbard, willy,
	shuah, brauner, bfoster, kent.overstreet, pvorel, rppt,
	richard.weiyang, anup, haibo1.xu, ajones, vkuznets,
	maciej.wieczor-retman, pgonda, oliver.upton, linux-kernel,
	linux-mm, kvm, linux-kselftest, linux-fsdevel

With the addition of the chg parameter, vma_has_reserves() no longer
just determines whether the vma has reserves.

The comment in the vma->vm_flags & VM_NORESERVE block indicates that
this function actually computes whether or not the reserved count
should be decremented.

This refactoring also takes into account the allocation's request
parameter avoid_reserve, which helps to further simplify the calling
function alloc_hugetlb_folio().

Signed-off-by: Ackerley Tng <ackerleytng@google.com>
---
 mm/hugetlb.c | 16 +++++++++++++---
 1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index af5c6bbc9ff0..597102ed224b 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1245,9 +1245,19 @@ void clear_vma_resv_huge_pages(struct vm_area_struct *vma)
 	hugetlb_dup_vma_private(vma);
 }
 
-/* Returns true if the VMA has associated reserve pages */
-static bool vma_has_reserves(struct vm_area_struct *vma, long chg)
+/*
+ * Returns true if this allocation should use (debit) hstate reservations, based on
+ *
+ * @vma: VMA config
+ * @chg: Whether the page requirement can be satisfied using subpool reservations
+ * @avoid_reserve: Whether allocation was requested to avoid using reservations
+ */
+static bool should_use_hstate_resv(struct vm_area_struct *vma, long chg,
+				   bool avoid_reserve)
 {
+	if (avoid_reserve)
+		return false;
+
 	if (vma->vm_flags & VM_NORESERVE) {
 		/*
 		 * This address is already reserved by other process(chg == 0),
@@ -3182,7 +3192,7 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
 	if (ret)
 		goto out_uncharge_cgroup_reservation;
 
-	use_hstate_resv = !avoid_reserve && vma_has_reserves(vma, gbl_chg);
+	use_hstate_resv = should_use_hstate_resv(vma, gbl_chg, avoid_reserve);
 
 	spin_lock_irq(&hugetlb_lock);
 	folio = dequeue_hugetlb_folio_vma(h, vma, addr, use_hstate_resv);
-- 
2.46.0.598.g6f2099f65c-goog


^ permalink raw reply related	[flat|nested] 130+ messages in thread

* [RFC PATCH 03/39] mm: hugetlb: Remove unnecessary check for avoid_reserve
  2024-09-10 23:43 [RFC PATCH 00/39] 1G page support for guest_memfd Ackerley Tng
  2024-09-10 23:43 ` [RFC PATCH 01/39] mm: hugetlb: Simplify logic in dequeue_hugetlb_folio_vma() Ackerley Tng
  2024-09-10 23:43 ` [RFC PATCH 02/39] mm: hugetlb: Refactor vma_has_reserves() to should_use_hstate_resv() Ackerley Tng
@ 2024-09-10 23:43 ` Ackerley Tng
  2024-09-10 23:43 ` [RFC PATCH 04/39] mm: mempolicy: Refactor out policy_node_nodemask() Ackerley Tng
                   ` (38 subsequent siblings)
  41 siblings, 0 replies; 130+ messages in thread
From: Ackerley Tng @ 2024-09-10 23:43 UTC (permalink / raw)
  To: tabba, quic_eberman, roypat, jgg, peterx, david, rientjes, fvdl,
	jthoughton, seanjc, pbonzini, zhiquan1.li, fan.du, jun.miao,
	isaku.yamahata, muchun.song, mike.kravetz
  Cc: erdemaktas, vannapurve, ackerleytng, qperret, jhubbard, willy,
	shuah, brauner, bfoster, kent.overstreet, pvorel, rppt,
	richard.weiyang, anup, haibo1.xu, ajones, vkuznets,
	maciej.wieczor-retman, pgonda, oliver.upton, linux-kernel,
	linux-mm, kvm, linux-kselftest, linux-fsdevel

If avoid_reserve is true, gbl_chg is not used anyway, so there is no
point in setting gbl_chg.

Signed-off-by: Ackerley Tng <ackerleytng@google.com>
---
 mm/hugetlb.c | 10 ----------
 1 file changed, 10 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 597102ed224b..5cf7fb117e9d 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3166,16 +3166,6 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
 		if (gbl_chg < 0)
 			goto out_end_reservation;
 
-		/*
-		 * Even though there was no reservation in the region/reserve
-		 * map, there could be reservations associated with the
-		 * subpool that can be used.  This would be indicated if the
-		 * return value of hugepage_subpool_get_pages() is zero.
-		 * However, if avoid_reserve is specified we still avoid even
-		 * the subpool reservations.
-		 */
-		if (avoid_reserve)
-			gbl_chg = 1;
 	}
 
 	/* If this allocation is not consuming a reservation, charge it now.
-- 
2.46.0.598.g6f2099f65c-goog


^ permalink raw reply related	[flat|nested] 130+ messages in thread

* [RFC PATCH 04/39] mm: mempolicy: Refactor out policy_node_nodemask()
  2024-09-10 23:43 [RFC PATCH 00/39] 1G page support for guest_memfd Ackerley Tng
                   ` (2 preceding siblings ...)
  2024-09-10 23:43 ` [RFC PATCH 03/39] mm: hugetlb: Remove unnecessary check for avoid_reserve Ackerley Tng
@ 2024-09-10 23:43 ` Ackerley Tng
  2024-09-11 16:46   ` Gregory Price
  2024-09-10 23:43 ` [RFC PATCH 05/39] mm: hugetlb: Refactor alloc_buddy_hugetlb_folio_with_mpol() to interpret mempolicy instead of vma Ackerley Tng
                   ` (37 subsequent siblings)
  41 siblings, 1 reply; 130+ messages in thread
From: Ackerley Tng @ 2024-09-10 23:43 UTC (permalink / raw)
  To: tabba, quic_eberman, roypat, jgg, peterx, david, rientjes, fvdl,
	jthoughton, seanjc, pbonzini, zhiquan1.li, fan.du, jun.miao,
	isaku.yamahata, muchun.song, mike.kravetz
  Cc: erdemaktas, vannapurve, ackerleytng, qperret, jhubbard, willy,
	shuah, brauner, bfoster, kent.overstreet, pvorel, rppt,
	richard.weiyang, anup, haibo1.xu, ajones, vkuznets,
	maciej.wieczor-retman, pgonda, oliver.upton, linux-kernel,
	linux-mm, kvm, linux-kselftest, linux-fsdevel

This was refactored out of huge_node().

huge_node()'s interpretation of vma for order assumes the
hugetlb-specific storage of the hstate information in the
inode. policy_node_nodemask() does not assume that, and can be used
more generically.

This refactoring also enforces that nid default to the current node
id, which was not previously enforced.

alloc_pages_mpol_noprof() is the last remaining direct user of
policy_nodemask(). All its callers begin with nid being the current
node id as well. More refactoring is required for to simplify that.

Signed-off-by: Ackerley Tng <ackerleytng@google.com>
---
 include/linux/mempolicy.h |  2 ++
 mm/mempolicy.c            | 36 ++++++++++++++++++++++++++----------
 2 files changed, 28 insertions(+), 10 deletions(-)

diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
index 1add16f21612..a49631e47421 100644
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -138,6 +138,8 @@ extern void numa_policy_init(void);
 extern void mpol_rebind_task(struct task_struct *tsk, const nodemask_t *new);
 extern void mpol_rebind_mm(struct mm_struct *mm, nodemask_t *new);
 
+extern int policy_node_nodemask(struct mempolicy *mpol, gfp_t gfp_flags,
+				pgoff_t ilx, nodemask_t **nodemask);
 extern int huge_node(struct vm_area_struct *vma,
 				unsigned long addr, gfp_t gfp_flags,
 				struct mempolicy **mpol, nodemask_t **nodemask);
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index b858e22b259d..f3e572e17775 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1212,7 +1212,6 @@ static struct folio *alloc_migration_target_by_mpol(struct folio *src,
 	struct mempolicy *pol = mmpol->pol;
 	pgoff_t ilx = mmpol->ilx;
 	unsigned int order;
-	int nid = numa_node_id();
 	gfp_t gfp;
 
 	order = folio_order(src);
@@ -1221,10 +1220,11 @@ static struct folio *alloc_migration_target_by_mpol(struct folio *src,
 	if (folio_test_hugetlb(src)) {
 		nodemask_t *nodemask;
 		struct hstate *h;
+		int nid;
 
 		h = folio_hstate(src);
 		gfp = htlb_alloc_mask(h);
-		nodemask = policy_nodemask(gfp, pol, ilx, &nid);
+		nid = policy_node_nodemask(pol, gfp, ilx, &nodemask);
 		return alloc_hugetlb_folio_nodemask(h, nid, nodemask, gfp,
 				htlb_allow_alloc_fallback(MR_MEMPOLICY_MBIND));
 	}
@@ -1234,7 +1234,7 @@ static struct folio *alloc_migration_target_by_mpol(struct folio *src,
 	else
 		gfp = GFP_HIGHUSER_MOVABLE | __GFP_RETRY_MAYFAIL | __GFP_COMP;
 
-	return folio_alloc_mpol(gfp, order, pol, ilx, nid);
+	return folio_alloc_mpol(gfp, order, pol, ilx, numa_node_id());
 }
 #else
 
@@ -2084,6 +2084,27 @@ static nodemask_t *policy_nodemask(gfp_t gfp, struct mempolicy *pol,
 	return nodemask;
 }
 
+/**
+ * policy_node_nodemask(@mpol, @gfp_flags, @ilx, @nodemask)
+ * @mpol: the memory policy to interpret. Reference must be taken.
+ * @gfp_flags: for this request
+ * @ilx: interleave index, for use only when MPOL_INTERLEAVE or
+ *       MPOL_WEIGHTED_INTERLEAVE
+ * @nodemask: (output) pointer to nodemask pointer for 'bind' and 'prefer-many'
+ *            policy
+ *
+ * Returns a nid suitable for a page allocation and a pointer. If the effective
+ * policy is 'bind' or 'prefer-many', returns a pointer to the mempolicy's
+ * @nodemask for filtering the zonelist.
+ */
+int policy_node_nodemask(struct mempolicy *mpol, gfp_t gfp_flags,
+			 pgoff_t ilx, nodemask_t **nodemask)
+{
+	int nid = numa_node_id();
+	*nodemask = policy_nodemask(gfp_flags, mpol, ilx, &nid);
+	return nid;
+}
+
 #ifdef CONFIG_HUGETLBFS
 /*
  * huge_node(@vma, @addr, @gfp_flags, @mpol)
@@ -2102,12 +2123,8 @@ int huge_node(struct vm_area_struct *vma, unsigned long addr, gfp_t gfp_flags,
 		struct mempolicy **mpol, nodemask_t **nodemask)
 {
 	pgoff_t ilx;
-	int nid;
-
-	nid = numa_node_id();
 	*mpol = get_vma_policy(vma, addr, hstate_vma(vma)->order, &ilx);
-	*nodemask = policy_nodemask(gfp_flags, *mpol, ilx, &nid);
-	return nid;
+	return policy_node_nodemask(*mpol, gfp_flags, ilx, nodemask);
 }
 
 /*
@@ -2549,8 +2566,7 @@ unsigned long alloc_pages_bulk_array_mempolicy_noprof(gfp_t gfp,
 		return alloc_pages_bulk_array_preferred_many(gfp,
 				numa_node_id(), pol, nr_pages, page_array);
 
-	nid = numa_node_id();
-	nodemask = policy_nodemask(gfp, pol, NO_INTERLEAVE_INDEX, &nid);
+	nid = policy_node_nodemask(pol, gfp, NO_INTERLEAVE_INDEX, &nodemask);
 	return alloc_pages_bulk_noprof(gfp, nid, nodemask,
 				       nr_pages, NULL, page_array);
 }
-- 
2.46.0.598.g6f2099f65c-goog


^ permalink raw reply related	[flat|nested] 130+ messages in thread

* Re: [RFC PATCH 04/39] mm: mempolicy: Refactor out policy_node_nodemask()
  2024-09-10 23:43 ` [RFC PATCH 04/39] mm: mempolicy: Refactor out policy_node_nodemask() Ackerley Tng
@ 2024-09-11 16:46   ` Gregory Price
  0 siblings, 0 replies; 130+ messages in thread
From: Gregory Price @ 2024-09-11 16:46 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: tabba, quic_eberman, roypat, jgg, peterx, david, rientjes, fvdl,
	jthoughton, seanjc, pbonzini, zhiquan1.li, fan.du, jun.miao,
	isaku.yamahata, muchun.song, mike.kravetz, erdemaktas, vannapurve,
	qperret, jhubbard, willy, shuah, brauner, bfoster,
	kent.overstreet, pvorel, rppt, richard.weiyang, anup, haibo1.xu,
	ajones, vkuznets, maciej.wieczor-retman, pgonda, oliver.upton,
	linux-kernel, linux-mm, kvm, linux-kselftest, linux-fsdevel

On Tue, Sep 10, 2024 at 11:43:35PM +0000, Ackerley Tng wrote:
> This was refactored out of huge_node().
> 
> huge_node()'s interpretation of vma for order assumes the
> hugetlb-specific storage of the hstate information in the
> inode. policy_node_nodemask() does not assume that, and can be used
> more generically.
> 
> This refactoring also enforces that nid default to the current node
> id, which was not previously enforced.
> 
> alloc_pages_mpol_noprof() is the last remaining direct user of
> policy_nodemask(). All its callers begin with nid being the current
> node id as well. More refactoring is required for to simplify that.
> 
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>

Reviewed-by: Gregory Price <gourry@gourry.net>

> +/**
> + * policy_node_nodemask(@mpol, @gfp_flags, @ilx, @nodemask)
> + * @mpol: the memory policy to interpret. Reference must be taken.
> + * @gfp_flags: for this request
> + * @ilx: interleave index, for use only when MPOL_INTERLEAVE or
> + *       MPOL_WEIGHTED_INTERLEAVE
> + * @nodemask: (output) pointer to nodemask pointer for 'bind' and 'prefer-many'
> + *            policy
> + *
> + * Returns a nid suitable for a page allocation and a pointer. If the effective
> + * policy is 'bind' or 'prefer-many', returns a pointer to the mempolicy's
> + * @nodemask for filtering the zonelist.

Technically it's possible for nid to contain MAX_NUMNODES upon return
if weighted interleave is used and the nodemask is somehow invalid
(contains no nodes, including the local node). I would expect this to
be indicative of a larger problem (i.e. should functionally never happen).

Now that I'm looking at it, it's possible the weighted interleave path
should default to returning numa_node_id() if node == MAX_NUMNODES, which
would not require any changes to this patch.

> + */
> +int policy_node_nodemask(struct mempolicy *mpol, gfp_t gfp_flags,
> +			 pgoff_t ilx, nodemask_t **nodemask)
> +{
> +	int nid = numa_node_id();
> +	*nodemask = policy_nodemask(gfp_flags, mpol, ilx, &nid);
> +	return nid;
> +}
> +
>  #ifdef CONFIG_HUGETLBFS
>  /*
>   * huge_node(@vma, @addr, @gfp_flags, @mpol)

^ permalink raw reply	[flat|nested] 130+ messages in thread

* [RFC PATCH 05/39] mm: hugetlb: Refactor alloc_buddy_hugetlb_folio_with_mpol() to interpret mempolicy instead of vma
  2024-09-10 23:43 [RFC PATCH 00/39] 1G page support for guest_memfd Ackerley Tng
                   ` (3 preceding siblings ...)
  2024-09-10 23:43 ` [RFC PATCH 04/39] mm: mempolicy: Refactor out policy_node_nodemask() Ackerley Tng
@ 2024-09-10 23:43 ` Ackerley Tng
  2024-09-10 23:43 ` [RFC PATCH 06/39] mm: hugetlb: Refactor dequeue_hugetlb_folio_vma() to use mpol Ackerley Tng
                   ` (36 subsequent siblings)
  41 siblings, 0 replies; 130+ messages in thread
From: Ackerley Tng @ 2024-09-10 23:43 UTC (permalink / raw)
  To: tabba, quic_eberman, roypat, jgg, peterx, david, rientjes, fvdl,
	jthoughton, seanjc, pbonzini, zhiquan1.li, fan.du, jun.miao,
	isaku.yamahata, muchun.song, mike.kravetz
  Cc: erdemaktas, vannapurve, ackerleytng, qperret, jhubbard, willy,
	shuah, brauner, bfoster, kent.overstreet, pvorel, rppt,
	richard.weiyang, anup, haibo1.xu, ajones, vkuznets,
	maciej.wieczor-retman, pgonda, oliver.upton, linux-kernel,
	linux-mm, kvm, linux-kselftest, linux-fsdevel

Reducing dependence on vma avoids the hugetlb-specific assumption of
where the mempolicy is stored. This will open up other ways of using
hugetlb.

Signed-off-by: Ackerley Tng <ackerleytng@google.com>
---
 mm/hugetlb.c | 37 +++++++++++++++++++++++--------------
 1 file changed, 23 insertions(+), 14 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 5cf7fb117e9d..2f2bd2444ae2 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2536,32 +2536,31 @@ static struct folio *alloc_migrate_hugetlb_folio(struct hstate *h, gfp_t gfp_mas
 }
 
 /*
- * Use the VMA's mpolicy to allocate a huge page from the buddy.
+ * Allocate a huge page from the buddy allocator, given memory policy, node id
+ * and nodemask.
  */
-static
-struct folio *alloc_buddy_hugetlb_folio_with_mpol(struct hstate *h,
-		struct vm_area_struct *vma, unsigned long addr)
+static struct folio *alloc_buddy_hugetlb_folio_from_node(struct hstate *h,
+							 struct mempolicy *mpol,
+							 int nid,
+							 nodemask_t *nodemask)
 {
-	struct folio *folio = NULL;
-	struct mempolicy *mpol;
 	gfp_t gfp_mask = htlb_alloc_mask(h);
-	int nid;
-	nodemask_t *nodemask;
+	struct folio *folio = NULL;
 
-	nid = huge_node(vma, addr, gfp_mask, &mpol, &nodemask);
 	if (mpol_is_preferred_many(mpol)) {
 		gfp_t gfp = gfp_mask | __GFP_NOWARN;
 
 		gfp &=  ~(__GFP_DIRECT_RECLAIM | __GFP_NOFAIL);
 		folio = alloc_surplus_hugetlb_folio(h, gfp, nid, nodemask);
+	}
 
-		/* Fallback to all nodes if page==NULL */
+	if (!folio) {
+		/* Fallback to all nodes if earlier allocation failed */
 		nodemask = NULL;
-	}
 
-	if (!folio)
 		folio = alloc_surplus_hugetlb_folio(h, gfp_mask, nid, nodemask);
-	mpol_cond_put(mpol);
+	}
+
 	return folio;
 }
 
@@ -3187,8 +3186,18 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
 	spin_lock_irq(&hugetlb_lock);
 	folio = dequeue_hugetlb_folio_vma(h, vma, addr, use_hstate_resv);
 	if (!folio) {
+		struct mempolicy *mpol;
+		nodemask_t *nodemask;
+		pgoff_t ilx;
+		int nid;
+
 		spin_unlock_irq(&hugetlb_lock);
-		folio = alloc_buddy_hugetlb_folio_with_mpol(h, vma, addr);
+
+		mpol = get_vma_policy(vma, addr, hstate_vma(vma)->order, &ilx);
+		nid = policy_node_nodemask(mpol, htlb_alloc_mask(h), ilx, &nodemask);
+		folio = alloc_buddy_hugetlb_folio_from_node(h, mpol, nid, nodemask);
+		mpol_cond_put(mpol);
+
 		if (!folio)
 			goto out_uncharge_cgroup;
 		spin_lock_irq(&hugetlb_lock);
-- 
2.46.0.598.g6f2099f65c-goog


^ permalink raw reply related	[flat|nested] 130+ messages in thread

* [RFC PATCH 06/39] mm: hugetlb: Refactor dequeue_hugetlb_folio_vma() to use mpol
  2024-09-10 23:43 [RFC PATCH 00/39] 1G page support for guest_memfd Ackerley Tng
                   ` (4 preceding siblings ...)
  2024-09-10 23:43 ` [RFC PATCH 05/39] mm: hugetlb: Refactor alloc_buddy_hugetlb_folio_with_mpol() to interpret mempolicy instead of vma Ackerley Tng
@ 2024-09-10 23:43 ` Ackerley Tng
  2024-09-10 23:43 ` [RFC PATCH 07/39] mm: hugetlb: Refactor out hugetlb_alloc_folio Ackerley Tng
                   ` (35 subsequent siblings)
  41 siblings, 0 replies; 130+ messages in thread
From: Ackerley Tng @ 2024-09-10 23:43 UTC (permalink / raw)
  To: tabba, quic_eberman, roypat, jgg, peterx, david, rientjes, fvdl,
	jthoughton, seanjc, pbonzini, zhiquan1.li, fan.du, jun.miao,
	isaku.yamahata, muchun.song, mike.kravetz
  Cc: erdemaktas, vannapurve, ackerleytng, qperret, jhubbard, willy,
	shuah, brauner, bfoster, kent.overstreet, pvorel, rppt,
	richard.weiyang, anup, haibo1.xu, ajones, vkuznets,
	maciej.wieczor-retman, pgonda, oliver.upton, linux-kernel,
	linux-mm, kvm, linux-kselftest, linux-fsdevel

Reduce dependence on vma since the use of huge_node() assumes
that the mempolicy is stored in a specific place in the inode,
accessed via the vma.

Signed-off-by: Ackerley Tng <ackerleytng@google.com>
---
 mm/hugetlb.c | 55 ++++++++++++++++++++++------------------------------
 1 file changed, 23 insertions(+), 32 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 2f2bd2444ae2..e341bc0eb49a 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1402,44 +1402,33 @@ static unsigned long available_huge_pages(struct hstate *h)
 	return h->free_huge_pages - h->resv_huge_pages;
 }
 
-static struct folio *dequeue_hugetlb_folio_vma(struct hstate *h,
-				struct vm_area_struct *vma,
-				unsigned long address, bool use_hstate_resv)
+static struct folio *dequeue_hugetlb_folio(struct hstate *h,
+					   struct mempolicy *mpol, int nid,
+					   nodemask_t *nodemask,
+					   bool use_hstate_resv)
 {
 	struct folio *folio = NULL;
-	struct mempolicy *mpol;
 	gfp_t gfp_mask;
-	nodemask_t *nodemask;
-	int nid;
 
 	if (!use_hstate_resv && !available_huge_pages(h))
-		goto err;
+		return NULL;
 
 	gfp_mask = htlb_alloc_mask(h);
-	nid = huge_node(vma, address, gfp_mask, &mpol, &nodemask);
 
-	if (mpol_is_preferred_many(mpol)) {
-		folio = dequeue_hugetlb_folio_nodemask(h, gfp_mask,
-							nid, nodemask);
+	if (mpol_is_preferred_many(mpol))
+		folio = dequeue_hugetlb_folio_nodemask(h, gfp_mask, nid, nodemask);
 
-		/* Fallback to all nodes if page==NULL */
-		nodemask = NULL;
+	if (!folio) {
+		/* Fallback to all nodes if earlier allocation failed */
+		folio = dequeue_hugetlb_folio_nodemask(h, gfp_mask, nid, NULL);
 	}
 
-	if (!folio)
-		folio = dequeue_hugetlb_folio_nodemask(h, gfp_mask,
-							nid, nodemask);
-
 	if (folio && use_hstate_resv) {
 		folio_set_hugetlb_restore_reserve(folio);
 		h->resv_huge_pages--;
 	}
 
-	mpol_cond_put(mpol);
 	return folio;
-
-err:
-	return NULL;
 }
 
 /*
@@ -3131,6 +3120,10 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
 	bool deferred_reserve;
 	gfp_t gfp = htlb_alloc_mask(h) | __GFP_RETRY_MAYFAIL;
 	bool use_hstate_resv;
+	struct mempolicy *mpol;
+	nodemask_t *nodemask;
+	pgoff_t ilx;
+	int nid;
 
 	memcg = get_mem_cgroup_from_current();
 	memcg_charge_ret = mem_cgroup_hugetlb_try_charge(memcg, gfp, nr_pages);
@@ -3184,22 +3177,19 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
 	use_hstate_resv = should_use_hstate_resv(vma, gbl_chg, avoid_reserve);
 
 	spin_lock_irq(&hugetlb_lock);
-	folio = dequeue_hugetlb_folio_vma(h, vma, addr, use_hstate_resv);
-	if (!folio) {
-		struct mempolicy *mpol;
-		nodemask_t *nodemask;
-		pgoff_t ilx;
-		int nid;
 
+	mpol = get_vma_policy(vma, addr, hstate_vma(vma)->order, &ilx);
+	nid = policy_node_nodemask(mpol, htlb_alloc_mask(h), ilx, &nodemask);
+	folio = dequeue_hugetlb_folio(h, mpol, nid, nodemask, use_hstate_resv);
+	if (!folio) {
 		spin_unlock_irq(&hugetlb_lock);
 
-		mpol = get_vma_policy(vma, addr, hstate_vma(vma)->order, &ilx);
-		nid = policy_node_nodemask(mpol, htlb_alloc_mask(h), ilx, &nodemask);
 		folio = alloc_buddy_hugetlb_folio_from_node(h, mpol, nid, nodemask);
-		mpol_cond_put(mpol);
-
-		if (!folio)
+		if (!folio) {
+			mpol_cond_put(mpol);
 			goto out_uncharge_cgroup;
+		}
+
 		spin_lock_irq(&hugetlb_lock);
 		if (use_hstate_resv) {
 			folio_set_hugetlb_restore_reserve(folio);
@@ -3209,6 +3199,7 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
 		folio_ref_unfreeze(folio, 1);
 		/* Fall through */
 	}
+	mpol_cond_put(mpol);
 
 	hugetlb_cgroup_commit_charge(idx, pages_per_huge_page(h), h_cg, folio);
 	/* If allocation is not consuming a reservation, also store the
-- 
2.46.0.598.g6f2099f65c-goog


^ permalink raw reply related	[flat|nested] 130+ messages in thread

* [RFC PATCH 07/39] mm: hugetlb: Refactor out hugetlb_alloc_folio
  2024-09-10 23:43 [RFC PATCH 00/39] 1G page support for guest_memfd Ackerley Tng
                   ` (5 preceding siblings ...)
  2024-09-10 23:43 ` [RFC PATCH 06/39] mm: hugetlb: Refactor dequeue_hugetlb_folio_vma() to use mpol Ackerley Tng
@ 2024-09-10 23:43 ` Ackerley Tng
  2024-09-10 23:43 ` [RFC PATCH 08/39] mm: truncate: Expose preparation steps for truncate_inode_pages_final Ackerley Tng
                   ` (34 subsequent siblings)
  41 siblings, 0 replies; 130+ messages in thread
From: Ackerley Tng @ 2024-09-10 23:43 UTC (permalink / raw)
  To: tabba, quic_eberman, roypat, jgg, peterx, david, rientjes, fvdl,
	jthoughton, seanjc, pbonzini, zhiquan1.li, fan.du, jun.miao,
	isaku.yamahata, muchun.song, mike.kravetz
  Cc: erdemaktas, vannapurve, ackerleytng, qperret, jhubbard, willy,
	shuah, brauner, bfoster, kent.overstreet, pvorel, rppt,
	richard.weiyang, anup, haibo1.xu, ajones, vkuznets,
	maciej.wieczor-retman, pgonda, oliver.upton, linux-kernel,
	linux-mm, kvm, linux-kselftest, linux-fsdevel

hugetlb_alloc_folio() allocates a hugetlb folio without handling
reservations in the vma and subpool, since some of that reservation
concepts are hugetlbfs specific.

Signed-off-by: Ackerley Tng <ackerleytng@google.com>
---
 include/linux/hugetlb.h |  12 ++++
 mm/hugetlb.c            | 144 ++++++++++++++++++++++++----------------
 2 files changed, 98 insertions(+), 58 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index c9bf68c239a0..e4a05a421623 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -690,6 +690,10 @@ struct huge_bootmem_page {
 };
 
 int isolate_or_dissolve_huge_page(struct page *page, struct list_head *list);
+struct folio *hugetlb_alloc_folio(struct hstate *h, struct mempolicy *mpol,
+				  int nid, nodemask_t *nodemask,
+				  bool charge_cgroup_reservation,
+				  bool use_hstate_resv);
 struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
 				unsigned long addr, int avoid_reserve);
 struct folio *alloc_hugetlb_folio_nodemask(struct hstate *h, int preferred_nid,
@@ -1027,6 +1031,14 @@ static inline int isolate_or_dissolve_huge_page(struct page *page,
 	return -ENOMEM;
 }
 
+static inline struct folio *
+hugetlb_alloc_folio(struct hstate *h, struct mempolicy *mpol, int nid,
+		    nodemask_t *nodemask, bool charge_cgroup_reservation,
+		    bool use_hstate_resv)
+{
+	return NULL;
+}
+
 static inline struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
 					   unsigned long addr,
 					   int avoid_reserve)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index e341bc0eb49a..7e73ebcc0f26 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3106,6 +3106,75 @@ int isolate_or_dissolve_huge_page(struct page *page, struct list_head *list)
 	return ret;
 }
 
+/**
+ * Allocates a hugetlb folio either by dequeueing or from buddy allocator.
+ */
+struct folio *hugetlb_alloc_folio(struct hstate *h, struct mempolicy *mpol,
+				  int nid, nodemask_t *nodemask,
+				  bool charge_cgroup_reservation,
+				  bool use_hstate_resv)
+{
+	struct hugetlb_cgroup *h_cg = NULL;
+	struct folio *folio;
+	int ret;
+	int idx;
+
+	idx = hstate_index(h);
+
+	if (charge_cgroup_reservation) {
+		ret = hugetlb_cgroup_charge_cgroup_rsvd(
+			idx, pages_per_huge_page(h), &h_cg);
+		if (ret)
+			return NULL;
+	}
+
+	ret = hugetlb_cgroup_charge_cgroup(idx, pages_per_huge_page(h), &h_cg);
+	if (ret)
+		goto err_uncharge_cgroup_reservation;
+
+	spin_lock_irq(&hugetlb_lock);
+
+	folio = dequeue_hugetlb_folio(h, mpol, nid, nodemask, use_hstate_resv);
+	if (!folio) {
+		spin_unlock_irq(&hugetlb_lock);
+
+		folio = alloc_buddy_hugetlb_folio_from_node(h, mpol, nid, nodemask);
+		if (!folio)
+			goto err_uncharge_cgroup;
+
+		spin_lock_irq(&hugetlb_lock);
+		if (use_hstate_resv) {
+			folio_set_hugetlb_restore_reserve(folio);
+			h->resv_huge_pages--;
+		}
+		list_add(&folio->lru, &h->hugepage_activelist);
+		folio_ref_unfreeze(folio, 1);
+		/* Fall through */
+	}
+
+	hugetlb_cgroup_commit_charge(idx, pages_per_huge_page(h), h_cg, folio);
+
+	if (charge_cgroup_reservation) {
+		hugetlb_cgroup_commit_charge_rsvd(idx, pages_per_huge_page(h),
+						  h_cg, folio);
+	}
+
+	spin_unlock_irq(&hugetlb_lock);
+
+	return folio;
+
+err_uncharge_cgroup:
+	hugetlb_cgroup_uncharge_cgroup(idx, pages_per_huge_page(h), h_cg);
+
+err_uncharge_cgroup_reservation:
+	if (charge_cgroup_reservation) {
+		hugetlb_cgroup_uncharge_cgroup_rsvd(idx, pages_per_huge_page(h),
+						    h_cg);
+	}
+
+	return NULL;
+}
+
 struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
 				    unsigned long addr, int avoid_reserve)
 {
@@ -3114,11 +3183,10 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
 	struct folio *folio;
 	long map_chg, map_commit, nr_pages = pages_per_huge_page(h);
 	long gbl_chg;
-	int memcg_charge_ret, ret, idx;
-	struct hugetlb_cgroup *h_cg = NULL;
+	int memcg_charge_ret;
 	struct mem_cgroup *memcg;
-	bool deferred_reserve;
-	gfp_t gfp = htlb_alloc_mask(h) | __GFP_RETRY_MAYFAIL;
+	bool charge_cgroup_reservation;
+	gfp_t gfp = htlb_alloc_mask(h);
 	bool use_hstate_resv;
 	struct mempolicy *mpol;
 	nodemask_t *nodemask;
@@ -3126,13 +3194,14 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
 	int nid;
 
 	memcg = get_mem_cgroup_from_current();
-	memcg_charge_ret = mem_cgroup_hugetlb_try_charge(memcg, gfp, nr_pages);
+	memcg_charge_ret =
+		mem_cgroup_hugetlb_try_charge(memcg, gfp | __GFP_RETRY_MAYFAIL,
+					      nr_pages);
 	if (memcg_charge_ret == -ENOMEM) {
 		mem_cgroup_put(memcg);
 		return ERR_PTR(-ENOMEM);
 	}
 
-	idx = hstate_index(h);
 	/*
 	 * Examine the region/reserve map to determine if the process
 	 * has a reservation for the page to be allocated.  A return
@@ -3160,57 +3229,22 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
 
 	}
 
-	/* If this allocation is not consuming a reservation, charge it now.
-	 */
-	deferred_reserve = map_chg || avoid_reserve;
-	if (deferred_reserve) {
-		ret = hugetlb_cgroup_charge_cgroup_rsvd(
-			idx, pages_per_huge_page(h), &h_cg);
-		if (ret)
-			goto out_subpool_put;
-	}
-
-	ret = hugetlb_cgroup_charge_cgroup(idx, pages_per_huge_page(h), &h_cg);
-	if (ret)
-		goto out_uncharge_cgroup_reservation;
-
 	use_hstate_resv = should_use_hstate_resv(vma, gbl_chg, avoid_reserve);
 
-	spin_lock_irq(&hugetlb_lock);
+	/*
+	 * charge_cgroup_reservation if this allocation is not consuming a
+	 * reservation
+	 */
+	charge_cgroup_reservation = map_chg || avoid_reserve;
 
 	mpol = get_vma_policy(vma, addr, hstate_vma(vma)->order, &ilx);
-	nid = policy_node_nodemask(mpol, htlb_alloc_mask(h), ilx, &nodemask);
-	folio = dequeue_hugetlb_folio(h, mpol, nid, nodemask, use_hstate_resv);
-	if (!folio) {
-		spin_unlock_irq(&hugetlb_lock);
-
-		folio = alloc_buddy_hugetlb_folio_from_node(h, mpol, nid, nodemask);
-		if (!folio) {
-			mpol_cond_put(mpol);
-			goto out_uncharge_cgroup;
-		}
-
-		spin_lock_irq(&hugetlb_lock);
-		if (use_hstate_resv) {
-			folio_set_hugetlb_restore_reserve(folio);
-			h->resv_huge_pages--;
-		}
-		list_add(&folio->lru, &h->hugepage_activelist);
-		folio_ref_unfreeze(folio, 1);
-		/* Fall through */
-	}
+	nid = policy_node_nodemask(mpol, gfp, ilx, &nodemask);
+	folio = hugetlb_alloc_folio(h, mpol, nid, nodemask,
+				    charge_cgroup_reservation, use_hstate_resv);
 	mpol_cond_put(mpol);
 
-	hugetlb_cgroup_commit_charge(idx, pages_per_huge_page(h), h_cg, folio);
-	/* If allocation is not consuming a reservation, also store the
-	 * hugetlb_cgroup pointer on the page.
-	 */
-	if (deferred_reserve) {
-		hugetlb_cgroup_commit_charge_rsvd(idx, pages_per_huge_page(h),
-						  h_cg, folio);
-	}
-
-	spin_unlock_irq(&hugetlb_lock);
+	if (!folio)
+		goto out_subpool_put;
 
 	hugetlb_set_folio_subpool(folio, spool);
 
@@ -3229,7 +3263,7 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
 
 		rsv_adjust = hugepage_subpool_put_pages(spool, 1);
 		hugetlb_acct_memory(h, -rsv_adjust);
-		if (deferred_reserve) {
+		if (charge_cgroup_reservation) {
 			spin_lock_irq(&hugetlb_lock);
 			hugetlb_cgroup_uncharge_folio_rsvd(hstate_index(h),
 					pages_per_huge_page(h), folio);
@@ -3243,12 +3277,6 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
 
 	return folio;
 
-out_uncharge_cgroup:
-	hugetlb_cgroup_uncharge_cgroup(idx, pages_per_huge_page(h), h_cg);
-out_uncharge_cgroup_reservation:
-	if (deferred_reserve)
-		hugetlb_cgroup_uncharge_cgroup_rsvd(idx, pages_per_huge_page(h),
-						    h_cg);
 out_subpool_put:
 	if (map_chg || avoid_reserve)
 		hugepage_subpool_put_pages(spool, 1);
-- 
2.46.0.598.g6f2099f65c-goog


^ permalink raw reply related	[flat|nested] 130+ messages in thread

* [RFC PATCH 08/39] mm: truncate: Expose preparation steps for truncate_inode_pages_final
  2024-09-10 23:43 [RFC PATCH 00/39] 1G page support for guest_memfd Ackerley Tng
                   ` (6 preceding siblings ...)
  2024-09-10 23:43 ` [RFC PATCH 07/39] mm: hugetlb: Refactor out hugetlb_alloc_folio Ackerley Tng
@ 2024-09-10 23:43 ` Ackerley Tng
  2024-09-10 23:43 ` [RFC PATCH 09/39] mm: hugetlb: Expose hugetlb_subpool_{get,put}_pages() Ackerley Tng
                   ` (33 subsequent siblings)
  41 siblings, 0 replies; 130+ messages in thread
From: Ackerley Tng @ 2024-09-10 23:43 UTC (permalink / raw)
  To: tabba, quic_eberman, roypat, jgg, peterx, david, rientjes, fvdl,
	jthoughton, seanjc, pbonzini, zhiquan1.li, fan.du, jun.miao,
	isaku.yamahata, muchun.song, mike.kravetz
  Cc: erdemaktas, vannapurve, ackerleytng, qperret, jhubbard, willy,
	shuah, brauner, bfoster, kent.overstreet, pvorel, rppt,
	richard.weiyang, anup, haibo1.xu, ajones, vkuznets,
	maciej.wieczor-retman, pgonda, oliver.upton, linux-kernel,
	linux-mm, kvm, linux-kselftest, linux-fsdevel

This will allow preparation steps to be shared

Signed-off-by: Ackerley Tng <ackerleytng@google.com>
---
 include/linux/mm.h |  1 +
 mm/truncate.c      | 26 ++++++++++++++++----------
 2 files changed, 17 insertions(+), 10 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index c4b238a20b76..ffb4788295b4 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3442,6 +3442,7 @@ extern unsigned long vm_unmapped_area(struct vm_unmapped_area_info *info);
 extern void truncate_inode_pages(struct address_space *, loff_t);
 extern void truncate_inode_pages_range(struct address_space *,
 				       loff_t lstart, loff_t lend);
+extern void truncate_inode_pages_final_prepare(struct address_space *);
 extern void truncate_inode_pages_final(struct address_space *);
 
 /* generic vm_area_ops exported for stackable file systems */
diff --git a/mm/truncate.c b/mm/truncate.c
index 4d61fbdd4b2f..28cca86424f8 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -424,16 +424,7 @@ void truncate_inode_pages(struct address_space *mapping, loff_t lstart)
 }
 EXPORT_SYMBOL(truncate_inode_pages);
 
-/**
- * truncate_inode_pages_final - truncate *all* pages before inode dies
- * @mapping: mapping to truncate
- *
- * Called under (and serialized by) inode->i_rwsem.
- *
- * Filesystems have to use this in the .evict_inode path to inform the
- * VM that this is the final truncate and the inode is going away.
- */
-void truncate_inode_pages_final(struct address_space *mapping)
+void truncate_inode_pages_final_prepare(struct address_space *mapping)
 {
 	/*
 	 * Page reclaim can not participate in regular inode lifetime
@@ -454,6 +445,21 @@ void truncate_inode_pages_final(struct address_space *mapping)
 		xa_lock_irq(&mapping->i_pages);
 		xa_unlock_irq(&mapping->i_pages);
 	}
+}
+EXPORT_SYMBOL(truncate_inode_pages_final_prepare);
+
+/**
+ * truncate_inode_pages_final - truncate *all* pages before inode dies
+ * @mapping: mapping to truncate
+ *
+ * Called under (and serialized by) inode->i_rwsem.
+ *
+ * Filesystems have to use this in the .evict_inode path to inform the
+ * VM that this is the final truncate and the inode is going away.
+ */
+void truncate_inode_pages_final(struct address_space *mapping)
+{
+	truncate_inode_pages_final_prepare(mapping);
 
 	truncate_inode_pages(mapping, 0);
 }
-- 
2.46.0.598.g6f2099f65c-goog


^ permalink raw reply related	[flat|nested] 130+ messages in thread

* [RFC PATCH 09/39] mm: hugetlb: Expose hugetlb_subpool_{get,put}_pages()
  2024-09-10 23:43 [RFC PATCH 00/39] 1G page support for guest_memfd Ackerley Tng
                   ` (7 preceding siblings ...)
  2024-09-10 23:43 ` [RFC PATCH 08/39] mm: truncate: Expose preparation steps for truncate_inode_pages_final Ackerley Tng
@ 2024-09-10 23:43 ` Ackerley Tng
  2024-09-10 23:43 ` [RFC PATCH 10/39] mm: hugetlb: Add option to create new subpool without using surplus Ackerley Tng
                   ` (32 subsequent siblings)
  41 siblings, 0 replies; 130+ messages in thread
From: Ackerley Tng @ 2024-09-10 23:43 UTC (permalink / raw)
  To: tabba, quic_eberman, roypat, jgg, peterx, david, rientjes, fvdl,
	jthoughton, seanjc, pbonzini, zhiquan1.li, fan.du, jun.miao,
	isaku.yamahata, muchun.song, mike.kravetz
  Cc: erdemaktas, vannapurve, ackerleytng, qperret, jhubbard, willy,
	shuah, brauner, bfoster, kent.overstreet, pvorel, rppt,
	richard.weiyang, anup, haibo1.xu, ajones, vkuznets,
	maciej.wieczor-retman, pgonda, oliver.upton, linux-kernel,
	linux-mm, kvm, linux-kselftest, linux-fsdevel

This will allow hugetlb subpools to be used by guest_memfd.

Signed-off-by: Ackerley Tng <ackerleytng@google.com>
---
 include/linux/hugetlb.h | 3 +++
 mm/hugetlb.c            | 6 ++----
 2 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index e4a05a421623..907cfbbd9e24 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -119,6 +119,9 @@ struct hugepage_subpool *hugepage_new_subpool(struct hstate *h, long max_hpages,
 						long min_hpages);
 void hugepage_put_subpool(struct hugepage_subpool *spool);
 
+long hugepage_subpool_get_pages(struct hugepage_subpool *spool, long delta);
+long hugepage_subpool_put_pages(struct hugepage_subpool *spool, long delta);
+
 void hugetlb_dup_vma_private(struct vm_area_struct *vma);
 void clear_vma_resv_huge_pages(struct vm_area_struct *vma);
 int move_hugetlb_page_tables(struct vm_area_struct *vma,
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 7e73ebcc0f26..808915108126 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -170,8 +170,7 @@ void hugepage_put_subpool(struct hugepage_subpool *spool)
  * only be different than the passed value (delta) in the case where
  * a subpool minimum size must be maintained.
  */
-static long hugepage_subpool_get_pages(struct hugepage_subpool *spool,
-				      long delta)
+long hugepage_subpool_get_pages(struct hugepage_subpool *spool, long delta)
 {
 	long ret = delta;
 
@@ -215,8 +214,7 @@ static long hugepage_subpool_get_pages(struct hugepage_subpool *spool,
  * The return value may only be different than the passed value (delta)
  * in the case where a subpool minimum size must be maintained.
  */
-static long hugepage_subpool_put_pages(struct hugepage_subpool *spool,
-				       long delta)
+long hugepage_subpool_put_pages(struct hugepage_subpool *spool, long delta)
 {
 	long ret = delta;
 	unsigned long flags;
-- 
2.46.0.598.g6f2099f65c-goog


^ permalink raw reply related	[flat|nested] 130+ messages in thread

* [RFC PATCH 10/39] mm: hugetlb: Add option to create new subpool without using surplus
  2024-09-10 23:43 [RFC PATCH 00/39] 1G page support for guest_memfd Ackerley Tng
                   ` (8 preceding siblings ...)
  2024-09-10 23:43 ` [RFC PATCH 09/39] mm: hugetlb: Expose hugetlb_subpool_{get,put}_pages() Ackerley Tng
@ 2024-09-10 23:43 ` Ackerley Tng
  2024-09-10 23:43 ` [RFC PATCH 11/39] mm: hugetlb: Expose hugetlb_acct_memory() Ackerley Tng
                   ` (31 subsequent siblings)
  41 siblings, 0 replies; 130+ messages in thread
From: Ackerley Tng @ 2024-09-10 23:43 UTC (permalink / raw)
  To: tabba, quic_eberman, roypat, jgg, peterx, david, rientjes, fvdl,
	jthoughton, seanjc, pbonzini, zhiquan1.li, fan.du, jun.miao,
	isaku.yamahata, muchun.song, mike.kravetz
  Cc: erdemaktas, vannapurve, ackerleytng, qperret, jhubbard, willy,
	shuah, brauner, bfoster, kent.overstreet, pvorel, rppt,
	richard.weiyang, anup, haibo1.xu, ajones, vkuznets,
	maciej.wieczor-retman, pgonda, oliver.upton, linux-kernel,
	linux-mm, kvm, linux-kselftest, linux-fsdevel

__hugetlb_acct_memory() today does more than just memory
accounting. when there's insufficient HugeTLB pages,
__hugetlb_acct_memory() will attempt to get surplus pages.

This change adds a flag to disable getting surplus pages if there are
insufficient HugeTLB pages.

Signed-off-by: Ackerley Tng <ackerleytng@google.com>
---
 fs/hugetlbfs/inode.c    |  2 +-
 include/linux/hugetlb.h |  2 +-
 mm/hugetlb.c            | 43 ++++++++++++++++++++++++++++++-----------
 3 files changed, 34 insertions(+), 13 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 9f6cff356796..300a6ef300c1 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -1488,7 +1488,7 @@ hugetlbfs_fill_super(struct super_block *sb, struct fs_context *fc)
 	if (ctx->max_hpages != -1 || ctx->min_hpages != -1) {
 		sbinfo->spool = hugepage_new_subpool(ctx->hstate,
 						     ctx->max_hpages,
-						     ctx->min_hpages);
+						     ctx->min_hpages, true);
 		if (!sbinfo->spool)
 			goto out_free;
 	}
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 907cfbbd9e24..9ef1adbd3207 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -116,7 +116,7 @@ extern int hugetlb_max_hstate __read_mostly;
 	for ((h) = hstates; (h) < &hstates[hugetlb_max_hstate]; (h)++)
 
 struct hugepage_subpool *hugepage_new_subpool(struct hstate *h, long max_hpages,
-						long min_hpages);
+					      long min_hpages, bool use_surplus);
 void hugepage_put_subpool(struct hugepage_subpool *spool);
 
 long hugepage_subpool_get_pages(struct hugepage_subpool *spool, long delta);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 808915108126..efdb5772b367 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -92,6 +92,7 @@ static int num_fault_mutexes;
 struct mutex *hugetlb_fault_mutex_table ____cacheline_aligned_in_smp;
 
 /* Forward declaration */
+static int __hugetlb_acct_memory(struct hstate *h, long delta, bool use_surplus);
 static int hugetlb_acct_memory(struct hstate *h, long delta);
 static void hugetlb_vma_lock_free(struct vm_area_struct *vma);
 static void hugetlb_vma_lock_alloc(struct vm_area_struct *vma);
@@ -129,7 +130,7 @@ static inline void unlock_or_release_subpool(struct hugepage_subpool *spool,
 }
 
 struct hugepage_subpool *hugepage_new_subpool(struct hstate *h, long max_hpages,
-						long min_hpages)
+					      long min_hpages, bool use_surplus)
 {
 	struct hugepage_subpool *spool;
 
@@ -143,7 +144,8 @@ struct hugepage_subpool *hugepage_new_subpool(struct hstate *h, long max_hpages,
 	spool->hstate = h;
 	spool->min_hpages = min_hpages;
 
-	if (min_hpages != -1 && hugetlb_acct_memory(h, min_hpages)) {
+	if (min_hpages != -1 &&
+	    __hugetlb_acct_memory(h, min_hpages, use_surplus)) {
 		kfree(spool);
 		return NULL;
 	}
@@ -2592,6 +2594,21 @@ static nodemask_t *policy_mbind_nodemask(gfp_t gfp)
 	return NULL;
 }
 
+static int hugetlb_hstate_reserve_pages(struct hstate *h,
+					long num_pages_to_reserve)
+	__must_hold(&hugetlb_lock)
+{
+	long needed;
+
+	needed = (h->resv_huge_pages + num_pages_to_reserve) - h->free_huge_pages;
+	if (needed <= 0) {
+		h->resv_huge_pages += num_pages_to_reserve;
+		return 0;
+	}
+
+	return needed;
+}
+
 /*
  * Increase the hugetlb pool such that it can accommodate a reservation
  * of size 'delta'.
@@ -2608,13 +2625,7 @@ static int gather_surplus_pages(struct hstate *h, long delta)
 	int node;
 	nodemask_t *mbind_nodemask = policy_mbind_nodemask(htlb_alloc_mask(h));
 
-	lockdep_assert_held(&hugetlb_lock);
-	needed = (h->resv_huge_pages + delta) - h->free_huge_pages;
-	if (needed <= 0) {
-		h->resv_huge_pages += delta;
-		return 0;
-	}
-
+	needed = delta;
 	allocated = 0;
 
 	ret = -ENOMEM;
@@ -5104,7 +5115,7 @@ unsigned long hugetlb_total_pages(void)
 	return nr_total_pages;
 }
 
-static int hugetlb_acct_memory(struct hstate *h, long delta)
+static int __hugetlb_acct_memory(struct hstate *h, long delta, bool use_surplus)
 {
 	int ret = -ENOMEM;
 
@@ -5136,7 +5147,12 @@ static int hugetlb_acct_memory(struct hstate *h, long delta)
 	 * above.
 	 */
 	if (delta > 0) {
-		if (gather_surplus_pages(h, delta) < 0)
+		long required_surplus = hugetlb_hstate_reserve_pages(h, delta);
+
+		if (!use_surplus && required_surplus > 0)
+			goto out;
+
+		if (gather_surplus_pages(h, required_surplus) < 0)
 			goto out;
 
 		if (delta > allowed_mems_nr(h)) {
@@ -5154,6 +5170,11 @@ static int hugetlb_acct_memory(struct hstate *h, long delta)
 	return ret;
 }
 
+static int hugetlb_acct_memory(struct hstate *h, long delta)
+{
+	return __hugetlb_acct_memory(h, delta, true);
+}
+
 static void hugetlb_vm_op_open(struct vm_area_struct *vma)
 {
 	struct resv_map *resv = vma_resv_map(vma);
-- 
2.46.0.598.g6f2099f65c-goog


^ permalink raw reply related	[flat|nested] 130+ messages in thread

* [RFC PATCH 11/39] mm: hugetlb: Expose hugetlb_acct_memory()
  2024-09-10 23:43 [RFC PATCH 00/39] 1G page support for guest_memfd Ackerley Tng
                   ` (9 preceding siblings ...)
  2024-09-10 23:43 ` [RFC PATCH 10/39] mm: hugetlb: Add option to create new subpool without using surplus Ackerley Tng
@ 2024-09-10 23:43 ` Ackerley Tng
  2024-09-10 23:43 ` [RFC PATCH 12/39] mm: hugetlb: Move and expose hugetlb_zero_partial_page() Ackerley Tng
                   ` (30 subsequent siblings)
  41 siblings, 0 replies; 130+ messages in thread
From: Ackerley Tng @ 2024-09-10 23:43 UTC (permalink / raw)
  To: tabba, quic_eberman, roypat, jgg, peterx, david, rientjes, fvdl,
	jthoughton, seanjc, pbonzini, zhiquan1.li, fan.du, jun.miao,
	isaku.yamahata, muchun.song, mike.kravetz
  Cc: erdemaktas, vannapurve, ackerleytng, qperret, jhubbard, willy,
	shuah, brauner, bfoster, kent.overstreet, pvorel, rppt,
	richard.weiyang, anup, haibo1.xu, ajones, vkuznets,
	maciej.wieczor-retman, pgonda, oliver.upton, linux-kernel,
	linux-mm, kvm, linux-kselftest, linux-fsdevel

This will used by guest_memfd in a later patch.

Signed-off-by: Ackerley Tng <ackerleytng@google.com>
---
 include/linux/hugetlb.h | 2 ++
 mm/hugetlb.c            | 4 ++--
 2 files changed, 4 insertions(+), 2 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 9ef1adbd3207..4d47bf94c211 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -122,6 +122,8 @@ void hugepage_put_subpool(struct hugepage_subpool *spool);
 long hugepage_subpool_get_pages(struct hugepage_subpool *spool, long delta);
 long hugepage_subpool_put_pages(struct hugepage_subpool *spool, long delta);
 
+int hugetlb_acct_memory(struct hstate *h, long delta);
+
 void hugetlb_dup_vma_private(struct vm_area_struct *vma);
 void clear_vma_resv_huge_pages(struct vm_area_struct *vma);
 int move_hugetlb_page_tables(struct vm_area_struct *vma,
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index efdb5772b367..5a37b03e1361 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -93,7 +93,7 @@ struct mutex *hugetlb_fault_mutex_table ____cacheline_aligned_in_smp;
 
 /* Forward declaration */
 static int __hugetlb_acct_memory(struct hstate *h, long delta, bool use_surplus);
-static int hugetlb_acct_memory(struct hstate *h, long delta);
+int hugetlb_acct_memory(struct hstate *h, long delta);
 static void hugetlb_vma_lock_free(struct vm_area_struct *vma);
 static void hugetlb_vma_lock_alloc(struct vm_area_struct *vma);
 static void __hugetlb_vma_unlock_write_free(struct vm_area_struct *vma);
@@ -5170,7 +5170,7 @@ static int __hugetlb_acct_memory(struct hstate *h, long delta, bool use_surplus)
 	return ret;
 }
 
-static int hugetlb_acct_memory(struct hstate *h, long delta)
+int hugetlb_acct_memory(struct hstate *h, long delta)
 {
 	return __hugetlb_acct_memory(h, delta, true);
 }
-- 
2.46.0.598.g6f2099f65c-goog


^ permalink raw reply related	[flat|nested] 130+ messages in thread

* [RFC PATCH 12/39] mm: hugetlb: Move and expose hugetlb_zero_partial_page()
  2024-09-10 23:43 [RFC PATCH 00/39] 1G page support for guest_memfd Ackerley Tng
                   ` (10 preceding siblings ...)
  2024-09-10 23:43 ` [RFC PATCH 11/39] mm: hugetlb: Expose hugetlb_acct_memory() Ackerley Tng
@ 2024-09-10 23:43 ` Ackerley Tng
  2024-09-10 23:43 ` [RFC PATCH 13/39] KVM: guest_memfd: Make guest mem use guest mem inodes instead of anonymous inodes Ackerley Tng
                   ` (29 subsequent siblings)
  41 siblings, 0 replies; 130+ messages in thread
From: Ackerley Tng @ 2024-09-10 23:43 UTC (permalink / raw)
  To: tabba, quic_eberman, roypat, jgg, peterx, david, rientjes, fvdl,
	jthoughton, seanjc, pbonzini, zhiquan1.li, fan.du, jun.miao,
	isaku.yamahata, muchun.song, mike.kravetz
  Cc: erdemaktas, vannapurve, ackerleytng, qperret, jhubbard, willy,
	shuah, brauner, bfoster, kent.overstreet, pvorel, rppt,
	richard.weiyang, anup, haibo1.xu, ajones, vkuznets,
	maciej.wieczor-retman, pgonda, oliver.upton, linux-kernel,
	linux-mm, kvm, linux-kselftest, linux-fsdevel

This will used by guest_memfd in a later patch.

Signed-off-by: Ackerley Tng <ackerleytng@google.com>
---
 fs/hugetlbfs/inode.c    | 33 +++++----------------------------
 include/linux/hugetlb.h |  3 +++
 mm/hugetlb.c            | 21 +++++++++++++++++++++
 3 files changed, 29 insertions(+), 28 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 300a6ef300c1..f76001418672 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -720,29 +720,6 @@ static void hugetlb_vmtruncate(struct inode *inode, loff_t offset)
 	remove_inode_hugepages(inode, offset, LLONG_MAX);
 }
 
-static void hugetlbfs_zero_partial_page(struct hstate *h,
-					struct address_space *mapping,
-					loff_t start,
-					loff_t end)
-{
-	pgoff_t idx = start >> huge_page_shift(h);
-	struct folio *folio;
-
-	folio = filemap_lock_hugetlb_folio(h, mapping, idx);
-	if (IS_ERR(folio))
-		return;
-
-	start = start & ~huge_page_mask(h);
-	end = end & ~huge_page_mask(h);
-	if (!end)
-		end = huge_page_size(h);
-
-	folio_zero_segment(folio, (size_t)start, (size_t)end);
-
-	folio_unlock(folio);
-	folio_put(folio);
-}
-
 static long hugetlbfs_punch_hole(struct inode *inode, loff_t offset, loff_t len)
 {
 	struct hugetlbfs_inode_info *info = HUGETLBFS_I(inode);
@@ -768,9 +745,10 @@ static long hugetlbfs_punch_hole(struct inode *inode, loff_t offset, loff_t len)
 	i_mmap_lock_write(mapping);
 
 	/* If range starts before first full page, zero partial page. */
-	if (offset < hole_start)
-		hugetlbfs_zero_partial_page(h, mapping,
-				offset, min(offset + len, hole_start));
+	if (offset < hole_start) {
+		hugetlb_zero_partial_page(h, mapping, offset,
+					  min(offset + len, hole_start));
+	}
 
 	/* Unmap users of full pages in the hole. */
 	if (hole_end > hole_start) {
@@ -782,8 +760,7 @@ static long hugetlbfs_punch_hole(struct inode *inode, loff_t offset, loff_t len)
 
 	/* If range extends beyond last full page, zero partial page. */
 	if ((offset + len) > hole_end && (offset + len) > hole_start)
-		hugetlbfs_zero_partial_page(h, mapping,
-				hole_end, offset + len);
+		hugetlb_zero_partial_page(h, mapping, hole_end, offset + len);
 
 	i_mmap_unlock_write(mapping);
 
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 4d47bf94c211..752062044b0b 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -124,6 +124,9 @@ long hugepage_subpool_put_pages(struct hugepage_subpool *spool, long delta);
 
 int hugetlb_acct_memory(struct hstate *h, long delta);
 
+void hugetlb_zero_partial_page(struct hstate *h, struct address_space *mapping,
+			       loff_t start, loff_t end);
+
 void hugetlb_dup_vma_private(struct vm_area_struct *vma);
 void clear_vma_resv_huge_pages(struct vm_area_struct *vma);
 int move_hugetlb_page_tables(struct vm_area_struct *vma,
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 5a37b03e1361..372d8294fb2f 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1989,6 +1989,27 @@ void free_huge_folio(struct folio *folio)
 	}
 }
 
+void hugetlb_zero_partial_page(struct hstate *h, struct address_space *mapping,
+			       loff_t start, loff_t end)
+{
+	pgoff_t idx = start >> huge_page_shift(h);
+	struct folio *folio;
+
+	folio = filemap_lock_hugetlb_folio(h, mapping, idx);
+	if (IS_ERR(folio))
+		return;
+
+	start = start & ~huge_page_mask(h);
+	end = end & ~huge_page_mask(h);
+	if (!end)
+		end = huge_page_size(h);
+
+	folio_zero_segment(folio, (size_t)start, (size_t)end);
+
+	folio_unlock(folio);
+	folio_put(folio);
+}
+
 /*
  * Must be called with the hugetlb lock held
  */
-- 
2.46.0.598.g6f2099f65c-goog


^ permalink raw reply related	[flat|nested] 130+ messages in thread

* [RFC PATCH 13/39] KVM: guest_memfd: Make guest mem use guest mem inodes instead of anonymous inodes
  2024-09-10 23:43 [RFC PATCH 00/39] 1G page support for guest_memfd Ackerley Tng
                   ` (11 preceding siblings ...)
  2024-09-10 23:43 ` [RFC PATCH 12/39] mm: hugetlb: Move and expose hugetlb_zero_partial_page() Ackerley Tng
@ 2024-09-10 23:43 ` Ackerley Tng
  2025-04-02  4:01   ` Yan Zhao
  2024-09-10 23:43 ` [RFC PATCH 14/39] KVM: guest_memfd: hugetlb: initialization and cleanup Ackerley Tng
                   ` (28 subsequent siblings)
  41 siblings, 1 reply; 130+ messages in thread
From: Ackerley Tng @ 2024-09-10 23:43 UTC (permalink / raw)
  To: tabba, quic_eberman, roypat, jgg, peterx, david, rientjes, fvdl,
	jthoughton, seanjc, pbonzini, zhiquan1.li, fan.du, jun.miao,
	isaku.yamahata, muchun.song, mike.kravetz
  Cc: erdemaktas, vannapurve, ackerleytng, qperret, jhubbard, willy,
	shuah, brauner, bfoster, kent.overstreet, pvorel, rppt,
	richard.weiyang, anup, haibo1.xu, ajones, vkuznets,
	maciej.wieczor-retman, pgonda, oliver.upton, linux-kernel,
	linux-mm, kvm, linux-kselftest, linux-fsdevel

Using guest mem inodes allows us to store metadata for the backing
memory on the inode. Metadata will be added in a later patch to
support HugeTLB pages.

Metadata about backing memory should not be stored on the file, since
the file represents a guest_memfd's binding with a struct kvm, and
metadata about backing memory is not unique to a specific binding and
struct kvm.

Signed-off-by: Ackerley Tng <ackerleytng@google.com>
---
 include/uapi/linux/magic.h |   1 +
 virt/kvm/guest_memfd.c     | 119 ++++++++++++++++++++++++++++++-------
 2 files changed, 100 insertions(+), 20 deletions(-)

diff --git a/include/uapi/linux/magic.h b/include/uapi/linux/magic.h
index bb575f3ab45e..169dba2a6920 100644
--- a/include/uapi/linux/magic.h
+++ b/include/uapi/linux/magic.h
@@ -103,5 +103,6 @@
 #define DEVMEM_MAGIC		0x454d444d	/* "DMEM" */
 #define SECRETMEM_MAGIC		0x5345434d	/* "SECM" */
 #define PID_FS_MAGIC		0x50494446	/* "PIDF" */
+#define GUEST_MEMORY_MAGIC	0x474d454d	/* "GMEM" */
 
 #endif /* __LINUX_MAGIC_H__ */
diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index 8f079a61a56d..5d7fd1f708a6 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -1,12 +1,17 @@
 // SPDX-License-Identifier: GPL-2.0
+#include <linux/fs.h>
+#include <linux/mount.h>
 #include <linux/backing-dev.h>
 #include <linux/falloc.h>
 #include <linux/kvm_host.h>
+#include <linux/pseudo_fs.h>
 #include <linux/pagemap.h>
 #include <linux/anon_inodes.h>
 
 #include "kvm_mm.h"
 
+static struct vfsmount *kvm_gmem_mnt;
+
 struct kvm_gmem {
 	struct kvm *kvm;
 	struct xarray bindings;
@@ -302,6 +307,38 @@ static inline struct file *kvm_gmem_get_file(struct kvm_memory_slot *slot)
 	return get_file_active(&slot->gmem.file);
 }
 
+static const struct super_operations kvm_gmem_super_operations = {
+	.statfs		= simple_statfs,
+};
+
+static int kvm_gmem_init_fs_context(struct fs_context *fc)
+{
+	struct pseudo_fs_context *ctx;
+
+	if (!init_pseudo(fc, GUEST_MEMORY_MAGIC))
+		return -ENOMEM;
+
+	ctx = fc->fs_private;
+	ctx->ops = &kvm_gmem_super_operations;
+
+	return 0;
+}
+
+static struct file_system_type kvm_gmem_fs = {
+	.name		 = "kvm_guest_memory",
+	.init_fs_context = kvm_gmem_init_fs_context,
+	.kill_sb	 = kill_anon_super,
+};
+
+static void kvm_gmem_init_mount(void)
+{
+	kvm_gmem_mnt = kern_mount(&kvm_gmem_fs);
+	BUG_ON(IS_ERR(kvm_gmem_mnt));
+
+	/* For giggles. Userspace can never map this anyways. */
+	kvm_gmem_mnt->mnt_flags |= MNT_NOEXEC;
+}
+
 static struct file_operations kvm_gmem_fops = {
 	.open		= generic_file_open,
 	.release	= kvm_gmem_release,
@@ -311,6 +348,8 @@ static struct file_operations kvm_gmem_fops = {
 void kvm_gmem_init(struct module *module)
 {
 	kvm_gmem_fops.owner = module;
+
+	kvm_gmem_init_mount();
 }
 
 static int kvm_gmem_migrate_folio(struct address_space *mapping,
@@ -392,11 +431,67 @@ static const struct inode_operations kvm_gmem_iops = {
 	.setattr	= kvm_gmem_setattr,
 };
 
+static struct inode *kvm_gmem_inode_make_secure_inode(const char *name,
+						      loff_t size, u64 flags)
+{
+	const struct qstr qname = QSTR_INIT(name, strlen(name));
+	struct inode *inode;
+	int err;
+
+	inode = alloc_anon_inode(kvm_gmem_mnt->mnt_sb);
+	if (IS_ERR(inode))
+		return inode;
+
+	err = security_inode_init_security_anon(inode, &qname, NULL);
+	if (err) {
+		iput(inode);
+		return ERR_PTR(err);
+	}
+
+	inode->i_private = (void *)(unsigned long)flags;
+	inode->i_op = &kvm_gmem_iops;
+	inode->i_mapping->a_ops = &kvm_gmem_aops;
+	inode->i_mode |= S_IFREG;
+	inode->i_size = size;
+	mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER);
+	mapping_set_inaccessible(inode->i_mapping);
+	/* Unmovable mappings are supposed to be marked unevictable as well. */
+	WARN_ON_ONCE(!mapping_unevictable(inode->i_mapping));
+
+	return inode;
+}
+
+static struct file *kvm_gmem_inode_create_getfile(void *priv, loff_t size,
+						  u64 flags)
+{
+	static const char *name = "[kvm-gmem]";
+	struct inode *inode;
+	struct file *file;
+
+	if (kvm_gmem_fops.owner && !try_module_get(kvm_gmem_fops.owner))
+		return ERR_PTR(-ENOENT);
+
+	inode = kvm_gmem_inode_make_secure_inode(name, size, flags);
+	if (IS_ERR(inode))
+		return ERR_CAST(inode);
+
+	file = alloc_file_pseudo(inode, kvm_gmem_mnt, name, O_RDWR,
+				 &kvm_gmem_fops);
+	if (IS_ERR(file)) {
+		iput(inode);
+		return file;
+	}
+
+	file->f_mapping = inode->i_mapping;
+	file->f_flags |= O_LARGEFILE;
+	file->private_data = priv;
+
+	return file;
+}
+
 static int __kvm_gmem_create(struct kvm *kvm, loff_t size, u64 flags)
 {
-	const char *anon_name = "[kvm-gmem]";
 	struct kvm_gmem *gmem;
-	struct inode *inode;
 	struct file *file;
 	int fd, err;
 
@@ -410,32 +505,16 @@ static int __kvm_gmem_create(struct kvm *kvm, loff_t size, u64 flags)
 		goto err_fd;
 	}
 
-	file = anon_inode_create_getfile(anon_name, &kvm_gmem_fops, gmem,
-					 O_RDWR, NULL);
+	file = kvm_gmem_inode_create_getfile(gmem, size, flags);
 	if (IS_ERR(file)) {
 		err = PTR_ERR(file);
 		goto err_gmem;
 	}
 
-	file->f_flags |= O_LARGEFILE;
-
-	inode = file->f_inode;
-	WARN_ON(file->f_mapping != inode->i_mapping);
-
-	inode->i_private = (void *)(unsigned long)flags;
-	inode->i_op = &kvm_gmem_iops;
-	inode->i_mapping->a_ops = &kvm_gmem_aops;
-	inode->i_mode |= S_IFREG;
-	inode->i_size = size;
-	mapping_set_gfp_mask(inode->i_mapping, GFP_HIGHUSER);
-	mapping_set_inaccessible(inode->i_mapping);
-	/* Unmovable mappings are supposed to be marked unevictable as well. */
-	WARN_ON_ONCE(!mapping_unevictable(inode->i_mapping));
-
 	kvm_get_kvm(kvm);
 	gmem->kvm = kvm;
 	xa_init(&gmem->bindings);
-	list_add(&gmem->entry, &inode->i_mapping->i_private_list);
+	list_add(&gmem->entry, &file_inode(file)->i_mapping->i_private_list);
 
 	fd_install(fd, file);
 	return fd;
-- 
2.46.0.598.g6f2099f65c-goog


^ permalink raw reply related	[flat|nested] 130+ messages in thread

* Re: [RFC PATCH 13/39] KVM: guest_memfd: Make guest mem use guest mem inodes instead of anonymous inodes
  2024-09-10 23:43 ` [RFC PATCH 13/39] KVM: guest_memfd: Make guest mem use guest mem inodes instead of anonymous inodes Ackerley Tng
@ 2025-04-02  4:01   ` Yan Zhao
  2025-04-23 20:22     ` Ackerley Tng
  0 siblings, 1 reply; 130+ messages in thread
From: Yan Zhao @ 2025-04-02  4:01 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: tabba, quic_eberman, roypat, jgg, peterx, david, rientjes, fvdl,
	jthoughton, seanjc, pbonzini, zhiquan1.li, fan.du, jun.miao,
	isaku.yamahata, muchun.song, mike.kravetz, erdemaktas, vannapurve,
	qperret, jhubbard, willy, shuah, brauner, bfoster,
	kent.overstreet, pvorel, rppt, richard.weiyang, anup, haibo1.xu,
	ajones, vkuznets, maciej.wieczor-retman, pgonda, oliver.upton,
	linux-kernel, linux-mm, kvm, linux-kselftest, linux-fsdevel

Hi Ackerley,

Not sure if below nits have been resolved in your latest code.
I came across them and felt it's better to report them anyway.

Apologies for any redundancy if you've already addressed them.

On Tue, Sep 10, 2024 at 11:43:44PM +0000, Ackerley Tng wrote:
> +static void kvm_gmem_init_mount(void)                                         
> +{                                                                             
> +     kvm_gmem_mnt = kern_mount(&kvm_gmem_fs);                                 
> +     BUG_ON(IS_ERR(kvm_gmem_mnt));                                            
> +                                                                              
> +     /* For giggles. Userspace can never map this anyways. */                 
> +     kvm_gmem_mnt->mnt_flags |= MNT_NOEXEC;                                   
> +}                                                                             
> +                                                                              
>  static struct file_operations kvm_gmem_fops = {                               
>       .open           = generic_file_open,                                     
>       .release        = kvm_gmem_release,                                      
> @@ -311,6 +348,8 @@ static struct file_operations kvm_gmem_fops = {            
>  void kvm_gmem_init(struct module *module)                                     
>  {                                                                             
>       kvm_gmem_fops.owner = module;                                            
> +                                                                              
> +     kvm_gmem_init_mount();                                                   
>  } 
When KVM is compiled as a module, looks "kern_unmount(kvm_gmem_mnt)" is
missing in the kvm_exit() path.

This may lead to kernel oops when executing "sync" after KVM is unloaded or
reloaded.

BTW, there're lots of symbols not exported under mm.

> +static struct file *kvm_gmem_inode_create_getfile(void *priv, loff_t size,
> +						  u64 flags)
> +{
> +	static const char *name = "[kvm-gmem]";
> +	struct inode *inode;
> +	struct file *file;
> +
> +	if (kvm_gmem_fops.owner && !try_module_get(kvm_gmem_fops.owner))
> +		return ERR_PTR(-ENOENT);
> +
> +	inode = kvm_gmem_inode_make_secure_inode(name, size, flags);
> +	if (IS_ERR(inode))
Missing module_put() here. i.e.,

-       if (IS_ERR(inode))
+       if (IS_ERR(inode)) {
+               if (kvm_gmem_fops.owner)
+                       module_put(kvm_gmem_fops.owner);
+
                return ERR_CAST(inode);
+       }

> +		return ERR_CAST(inode);
> +
> +	file = alloc_file_pseudo(inode, kvm_gmem_mnt, name, O_RDWR,
> +				 &kvm_gmem_fops);
> +	if (IS_ERR(file)) {
> +		iput(inode);
> +		return file;
> +	}
> +
> +	file->f_mapping = inode->i_mapping;
> +	file->f_flags |= O_LARGEFILE;
> +	file->private_data = priv;
> +
> +	return file;
> +}
> +

Thanks
Yan

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [RFC PATCH 13/39] KVM: guest_memfd: Make guest mem use guest mem inodes instead of anonymous inodes
  2025-04-02  4:01   ` Yan Zhao
@ 2025-04-23 20:22     ` Ackerley Tng
  2025-04-24  3:53       ` Yan Zhao
  0 siblings, 1 reply; 130+ messages in thread
From: Ackerley Tng @ 2025-04-23 20:22 UTC (permalink / raw)
  To: Yan Zhao
  Cc: tabba, quic_eberman, roypat, jgg, peterx, david, rientjes, fvdl,
	jthoughton, seanjc, pbonzini, zhiquan1.li, fan.du, jun.miao,
	isaku.yamahata, muchun.song, mike.kravetz, erdemaktas, vannapurve,
	qperret, jhubbard, willy, shuah, brauner, bfoster,
	kent.overstreet, pvorel, rppt, richard.weiyang, anup, haibo1.xu,
	ajones, vkuznets, maciej.wieczor-retman, pgonda, oliver.upton,
	linux-kernel, linux-mm, kvm, linux-kselftest, linux-fsdevel

Yan Zhao <yan.y.zhao@intel.com> writes:

> Hi Ackerley,
>
> Not sure if below nits have been resolved in your latest code.
> I came across them and felt it's better to report them anyway.
>
> Apologies for any redundancy if you've already addressed them.

No worries, thank you so much for your reviews!

>
> On Tue, Sep 10, 2024 at 11:43:44PM +0000, Ackerley Tng wrote:
>> +static void kvm_gmem_init_mount(void)                                         
>> +{                                                                             
>> +     kvm_gmem_mnt = kern_mount(&kvm_gmem_fs);                                 
>> +     BUG_ON(IS_ERR(kvm_gmem_mnt));                                            
>> +                                                                              
>> +     /* For giggles. Userspace can never map this anyways. */                 
>> +     kvm_gmem_mnt->mnt_flags |= MNT_NOEXEC;                                   
>> +}                                                                             
>> +                                                                              
>>  static struct file_operations kvm_gmem_fops = {                               
>>       .open           = generic_file_open,                                     
>>       .release        = kvm_gmem_release,                                      
>> @@ -311,6 +348,8 @@ static struct file_operations kvm_gmem_fops = {            
>>  void kvm_gmem_init(struct module *module)                                     
>>  {                                                                             
>>       kvm_gmem_fops.owner = module;                                            
>> +                                                                              
>> +     kvm_gmem_init_mount();                                                   
>>  } 
> When KVM is compiled as a module, looks "kern_unmount(kvm_gmem_mnt)" is
> missing in the kvm_exit() path.
>
> This may lead to kernel oops when executing "sync" after KVM is unloaded or
> reloaded.
>

Thanks, Fuad will be addressing this in a revision of [1].

> BTW, there're lots of symbols not exported under mm.
>

Thanks again, is there a good way to do a build test for symbols not
being exported?  What CONFIG flags do you use?

>> +static struct file *kvm_gmem_inode_create_getfile(void *priv, loff_t size,
>> +						  u64 flags)
>> +{
>> +	static const char *name = "[kvm-gmem]";
>> +	struct inode *inode;
>> +	struct file *file;
>> +
>> +	if (kvm_gmem_fops.owner && !try_module_get(kvm_gmem_fops.owner))
>> +		return ERR_PTR(-ENOENT);
>> +
>> +	inode = kvm_gmem_inode_make_secure_inode(name, size, flags);
>> +	if (IS_ERR(inode))
> Missing module_put() here. i.e.,
>
> -       if (IS_ERR(inode))
> +       if (IS_ERR(inode)) {
> +               if (kvm_gmem_fops.owner)
> +                       module_put(kvm_gmem_fops.owner);
> +
>                 return ERR_CAST(inode);
> +       }
>

Thanks, Fuad will be addressing this in a revision of [1].

>> +		return ERR_CAST(inode);
>> +
>> +	file = alloc_file_pseudo(inode, kvm_gmem_mnt, name, O_RDWR,
>> +				 &kvm_gmem_fops);
>> +	if (IS_ERR(file)) {
>> +		iput(inode);
>> +		return file;
>> +	}
>> +
>> +	file->f_mapping = inode->i_mapping;
>> +	file->f_flags |= O_LARGEFILE;
>> +	file->private_data = priv;
>> +
>> +	return file;
>> +}
>> +
>
> Thanks
> Yan

[1] https://lore.kernel.org/all/20250328153133.3504118-2-tabba@google.com/

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [RFC PATCH 13/39] KVM: guest_memfd: Make guest mem use guest mem inodes instead of anonymous inodes
  2025-04-23 20:22     ` Ackerley Tng
@ 2025-04-24  3:53       ` Yan Zhao
  0 siblings, 0 replies; 130+ messages in thread
From: Yan Zhao @ 2025-04-24  3:53 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: tabba, quic_eberman, roypat, jgg, peterx, david, rientjes, fvdl,
	jthoughton, seanjc, pbonzini, zhiquan1.li, fan.du, jun.miao,
	isaku.yamahata, muchun.song, mike.kravetz, erdemaktas, vannapurve,
	qperret, jhubbard, willy, shuah, brauner, bfoster,
	kent.overstreet, pvorel, rppt, richard.weiyang, anup, haibo1.xu,
	ajones, vkuznets, maciej.wieczor-retman, pgonda, oliver.upton,
	linux-kernel, linux-mm, kvm, linux-kselftest, linux-fsdevel

On Wed, Apr 23, 2025 at 01:22:00PM -0700, Ackerley Tng wrote:
> Yan Zhao <yan.y.zhao@intel.com> writes:
> 
> > Hi Ackerley,
> >
> > Not sure if below nits have been resolved in your latest code.
> > I came across them and felt it's better to report them anyway.
> >
> > Apologies for any redundancy if you've already addressed them.
> 
> No worries, thank you so much for your reviews!
> 
> >
> > On Tue, Sep 10, 2024 at 11:43:44PM +0000, Ackerley Tng wrote:
> >> +static void kvm_gmem_init_mount(void)                                         
> >> +{                                                                             
> >> +     kvm_gmem_mnt = kern_mount(&kvm_gmem_fs);                                 
> >> +     BUG_ON(IS_ERR(kvm_gmem_mnt));                                            
> >> +                                                                              
> >> +     /* For giggles. Userspace can never map this anyways. */                 
> >> +     kvm_gmem_mnt->mnt_flags |= MNT_NOEXEC;                                   
> >> +}                                                                             
> >> +                                                                              
> >>  static struct file_operations kvm_gmem_fops = {                               
> >>       .open           = generic_file_open,                                     
> >>       .release        = kvm_gmem_release,                                      
> >> @@ -311,6 +348,8 @@ static struct file_operations kvm_gmem_fops = {            
> >>  void kvm_gmem_init(struct module *module)                                     
> >>  {                                                                             
> >>       kvm_gmem_fops.owner = module;                                            
> >> +                                                                              
> >> +     kvm_gmem_init_mount();                                                   
> >>  } 
> > When KVM is compiled as a module, looks "kern_unmount(kvm_gmem_mnt)" is
> > missing in the kvm_exit() path.
> >
> > This may lead to kernel oops when executing "sync" after KVM is unloaded or
> > reloaded.
> >
> 
> Thanks, Fuad will be addressing this in a revision of [1].
> 
> > BTW, there're lots of symbols not exported under mm.
> >
> 
> Thanks again, is there a good way to do a build test for symbols not
> being exported?  What CONFIG flags do you use?
I compiled kvm.ko and kvm-intel.ko as modules.

CONFIG_KVM=m
CONFIG_KVM_INTEL=m
CONFIG_KVM_INTEL_TDX=y

 

^ permalink raw reply	[flat|nested] 130+ messages in thread

* [RFC PATCH 14/39] KVM: guest_memfd: hugetlb: initialization and cleanup
  2024-09-10 23:43 [RFC PATCH 00/39] 1G page support for guest_memfd Ackerley Tng
                   ` (12 preceding siblings ...)
  2024-09-10 23:43 ` [RFC PATCH 13/39] KVM: guest_memfd: Make guest mem use guest mem inodes instead of anonymous inodes Ackerley Tng
@ 2024-09-10 23:43 ` Ackerley Tng
  2024-09-20  9:17   ` Vishal Annapurve
                     ` (2 more replies)
  2024-09-10 23:43 ` [RFC PATCH 15/39] KVM: guest_memfd: hugetlb: allocate and truncate from hugetlb Ackerley Tng
                   ` (27 subsequent siblings)
  41 siblings, 3 replies; 130+ messages in thread
From: Ackerley Tng @ 2024-09-10 23:43 UTC (permalink / raw)
  To: tabba, quic_eberman, roypat, jgg, peterx, david, rientjes, fvdl,
	jthoughton, seanjc, pbonzini, zhiquan1.li, fan.du, jun.miao,
	isaku.yamahata, muchun.song, mike.kravetz
  Cc: erdemaktas, vannapurve, ackerleytng, qperret, jhubbard, willy,
	shuah, brauner, bfoster, kent.overstreet, pvorel, rppt,
	richard.weiyang, anup, haibo1.xu, ajones, vkuznets,
	maciej.wieczor-retman, pgonda, oliver.upton, linux-kernel,
	linux-mm, kvm, linux-kselftest, linux-fsdevel

First stage of hugetlb support: add initialization and cleanup
routines.

After guest_mem was massaged to use guest_mem inodes instead of
anonymous inodes in an earlier patch, the .evict_inode handler can now
be overridden to do hugetlb metadata cleanup.

Signed-off-by: Ackerley Tng <ackerleytng@google.com>
---
 include/uapi/linux/kvm.h |  26 ++++++
 virt/kvm/guest_memfd.c   | 177 +++++++++++++++++++++++++++++++++++++--
 2 files changed, 197 insertions(+), 6 deletions(-)

diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 637efc055145..77de7c4432f6 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -13,6 +13,7 @@
 #include <linux/compiler.h>
 #include <linux/ioctl.h>
 #include <asm/kvm.h>
+#include <asm-generic/hugetlb_encode.h>
 
 #define KVM_API_VERSION 12
 
@@ -1558,6 +1559,31 @@ struct kvm_memory_attributes {
 
 #define KVM_CREATE_GUEST_MEMFD	_IOWR(KVMIO,  0xd4, struct kvm_create_guest_memfd)
 
+#define KVM_GUEST_MEMFD_HUGETLB (1ULL << 1)
+
+/*
+ * Huge page size encoding when KVM_GUEST_MEMFD_HUGETLB is specified, and a huge
+ * page size other than the default is desired.  See hugetlb_encode.h.  All
+ * known huge page size encodings are provided here.  It is the responsibility
+ * of the application to know which sizes are supported on the running system.
+ * See mmap(2) man page for details.
+ */
+#define KVM_GUEST_MEMFD_HUGE_SHIFT     HUGETLB_FLAG_ENCODE_SHIFT
+#define KVM_GUEST_MEMFD_HUGE_MASK      HUGETLB_FLAG_ENCODE_MASK
+
+#define KVM_GUEST_MEMFD_HUGE_64KB      HUGETLB_FLAG_ENCODE_64KB
+#define KVM_GUEST_MEMFD_HUGE_512KB     HUGETLB_FLAG_ENCODE_512KB
+#define KVM_GUEST_MEMFD_HUGE_1MB       HUGETLB_FLAG_ENCODE_1MB
+#define KVM_GUEST_MEMFD_HUGE_2MB       HUGETLB_FLAG_ENCODE_2MB
+#define KVM_GUEST_MEMFD_HUGE_8MB       HUGETLB_FLAG_ENCODE_8MB
+#define KVM_GUEST_MEMFD_HUGE_16MB      HUGETLB_FLAG_ENCODE_16MB
+#define KVM_GUEST_MEMFD_HUGE_32MB      HUGETLB_FLAG_ENCODE_32MB
+#define KVM_GUEST_MEMFD_HUGE_256MB     HUGETLB_FLAG_ENCODE_256MB
+#define KVM_GUEST_MEMFD_HUGE_512MB     HUGETLB_FLAG_ENCODE_512MB
+#define KVM_GUEST_MEMFD_HUGE_1GB       HUGETLB_FLAG_ENCODE_1GB
+#define KVM_GUEST_MEMFD_HUGE_2GB       HUGETLB_FLAG_ENCODE_2GB
+#define KVM_GUEST_MEMFD_HUGE_16GB      HUGETLB_FLAG_ENCODE_16GB
+
 struct kvm_create_guest_memfd {
 	__u64 size;
 	__u64 flags;
diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index 5d7fd1f708a6..31e1115273e1 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -3,6 +3,7 @@
 #include <linux/mount.h>
 #include <linux/backing-dev.h>
 #include <linux/falloc.h>
+#include <linux/hugetlb.h>
 #include <linux/kvm_host.h>
 #include <linux/pseudo_fs.h>
 #include <linux/pagemap.h>
@@ -18,6 +19,16 @@ struct kvm_gmem {
 	struct list_head entry;
 };
 
+struct kvm_gmem_hugetlb {
+	struct hstate *h;
+	struct hugepage_subpool *spool;
+};
+
+static struct kvm_gmem_hugetlb *kvm_gmem_hgmem(struct inode *inode)
+{
+	return inode->i_mapping->i_private_data;
+}
+
 /**
  * folio_file_pfn - like folio_file_page, but return a pfn.
  * @folio: The folio which contains this index.
@@ -154,6 +165,82 @@ static void kvm_gmem_invalidate_end(struct kvm_gmem *gmem, pgoff_t start,
 	}
 }
 
+static inline void kvm_gmem_hugetlb_filemap_remove_folio(struct folio *folio)
+{
+	folio_lock(folio);
+
+	folio_clear_dirty(folio);
+	folio_clear_uptodate(folio);
+	filemap_remove_folio(folio);
+
+	folio_unlock(folio);
+}
+
+/**
+ * Removes folios in range [@lstart, @lend) from page cache/filemap (@mapping),
+ * returning the number of pages freed.
+ */
+static int kvm_gmem_hugetlb_filemap_remove_folios(struct address_space *mapping,
+						  struct hstate *h,
+						  loff_t lstart, loff_t lend)
+{
+	const pgoff_t end = lend >> PAGE_SHIFT;
+	pgoff_t next = lstart >> PAGE_SHIFT;
+	struct folio_batch fbatch;
+	int num_freed = 0;
+
+	folio_batch_init(&fbatch);
+	while (filemap_get_folios(mapping, &next, end - 1, &fbatch)) {
+		int i;
+		for (i = 0; i < folio_batch_count(&fbatch); ++i) {
+			struct folio *folio;
+			pgoff_t hindex;
+			u32 hash;
+
+			folio = fbatch.folios[i];
+			hindex = folio->index >> huge_page_order(h);
+			hash = hugetlb_fault_mutex_hash(mapping, hindex);
+
+			mutex_lock(&hugetlb_fault_mutex_table[hash]);
+			kvm_gmem_hugetlb_filemap_remove_folio(folio);
+			mutex_unlock(&hugetlb_fault_mutex_table[hash]);
+
+			num_freed++;
+		}
+		folio_batch_release(&fbatch);
+		cond_resched();
+	}
+
+	return num_freed;
+}
+
+/**
+ * Removes folios in range [@lstart, @lend) from page cache of inode, updates
+ * inode metadata and hugetlb reservations.
+ */
+static void kvm_gmem_hugetlb_truncate_folios_range(struct inode *inode,
+						   loff_t lstart, loff_t lend)
+{
+	struct kvm_gmem_hugetlb *hgmem;
+	struct hstate *h;
+	int gbl_reserve;
+	int num_freed;
+
+	hgmem = kvm_gmem_hgmem(inode);
+	h = hgmem->h;
+
+	num_freed = kvm_gmem_hugetlb_filemap_remove_folios(inode->i_mapping,
+							   h, lstart, lend);
+
+	gbl_reserve = hugepage_subpool_put_pages(hgmem->spool, num_freed);
+	hugetlb_acct_memory(h, -gbl_reserve);
+
+	spin_lock(&inode->i_lock);
+	inode->i_blocks -= blocks_per_huge_page(h) * num_freed;
+	spin_unlock(&inode->i_lock);
+}
+
+
 static long kvm_gmem_punch_hole(struct inode *inode, loff_t offset, loff_t len)
 {
 	struct list_head *gmem_list = &inode->i_mapping->i_private_list;
@@ -307,8 +394,33 @@ static inline struct file *kvm_gmem_get_file(struct kvm_memory_slot *slot)
 	return get_file_active(&slot->gmem.file);
 }
 
+static void kvm_gmem_hugetlb_teardown(struct inode *inode)
+{
+	struct kvm_gmem_hugetlb *hgmem;
+
+	truncate_inode_pages_final_prepare(inode->i_mapping);
+	kvm_gmem_hugetlb_truncate_folios_range(inode, 0, LLONG_MAX);
+
+	hgmem = kvm_gmem_hgmem(inode);
+	hugepage_put_subpool(hgmem->spool);
+	kfree(hgmem);
+}
+
+static void kvm_gmem_evict_inode(struct inode *inode)
+{
+	u64 flags = (u64)inode->i_private;
+
+	if (flags & KVM_GUEST_MEMFD_HUGETLB)
+		kvm_gmem_hugetlb_teardown(inode);
+	else
+		truncate_inode_pages_final(inode->i_mapping);
+
+	clear_inode(inode);
+}
+
 static const struct super_operations kvm_gmem_super_operations = {
 	.statfs		= simple_statfs,
+	.evict_inode	= kvm_gmem_evict_inode,
 };
 
 static int kvm_gmem_init_fs_context(struct fs_context *fc)
@@ -431,6 +543,42 @@ static const struct inode_operations kvm_gmem_iops = {
 	.setattr	= kvm_gmem_setattr,
 };
 
+static int kvm_gmem_hugetlb_setup(struct inode *inode, loff_t size, u64 flags)
+{
+	struct kvm_gmem_hugetlb *hgmem;
+	struct hugepage_subpool *spool;
+	int page_size_log;
+	struct hstate *h;
+	long hpages;
+
+	page_size_log = (flags >> KVM_GUEST_MEMFD_HUGE_SHIFT) & KVM_GUEST_MEMFD_HUGE_MASK;
+	h = hstate_sizelog(page_size_log);
+
+	/* Round up to accommodate size requests that don't align with huge pages */
+	hpages = round_up(size, huge_page_size(h)) >> huge_page_shift(h);
+
+	spool = hugepage_new_subpool(h, hpages, hpages, false);
+	if (!spool)
+		goto err;
+
+	hgmem = kzalloc(sizeof(*hgmem), GFP_KERNEL);
+	if (!hgmem)
+		goto err_subpool;
+
+	inode->i_blkbits = huge_page_shift(h);
+
+	hgmem->h = h;
+	hgmem->spool = spool;
+	inode->i_mapping->i_private_data = hgmem;
+
+	return 0;
+
+err_subpool:
+	kfree(spool);
+err:
+	return -ENOMEM;
+}
+
 static struct inode *kvm_gmem_inode_make_secure_inode(const char *name,
 						      loff_t size, u64 flags)
 {
@@ -443,9 +591,13 @@ static struct inode *kvm_gmem_inode_make_secure_inode(const char *name,
 		return inode;
 
 	err = security_inode_init_security_anon(inode, &qname, NULL);
-	if (err) {
-		iput(inode);
-		return ERR_PTR(err);
+	if (err)
+		goto out;
+
+	if (flags & KVM_GUEST_MEMFD_HUGETLB) {
+		err = kvm_gmem_hugetlb_setup(inode, size, flags);
+		if (err)
+			goto out;
 	}
 
 	inode->i_private = (void *)(unsigned long)flags;
@@ -459,6 +611,11 @@ static struct inode *kvm_gmem_inode_make_secure_inode(const char *name,
 	WARN_ON_ONCE(!mapping_unevictable(inode->i_mapping));
 
 	return inode;
+
+out:
+	iput(inode);
+
+	return ERR_PTR(err);
 }
 
 static struct file *kvm_gmem_inode_create_getfile(void *priv, loff_t size,
@@ -526,14 +683,22 @@ static int __kvm_gmem_create(struct kvm *kvm, loff_t size, u64 flags)
 	return err;
 }
 
+#define KVM_GUEST_MEMFD_ALL_FLAGS KVM_GUEST_MEMFD_HUGETLB
+
 int kvm_gmem_create(struct kvm *kvm, struct kvm_create_guest_memfd *args)
 {
 	loff_t size = args->size;
 	u64 flags = args->flags;
-	u64 valid_flags = 0;
 
-	if (flags & ~valid_flags)
-		return -EINVAL;
+	if (flags & KVM_GUEST_MEMFD_HUGETLB) {
+		/* Allow huge page size encoding in flags */
+		if (flags & ~(KVM_GUEST_MEMFD_ALL_FLAGS |
+			      (KVM_GUEST_MEMFD_HUGE_MASK << KVM_GUEST_MEMFD_HUGE_SHIFT)))
+			return -EINVAL;
+	} else {
+		if (flags & ~KVM_GUEST_MEMFD_ALL_FLAGS)
+			return -EINVAL;
+	}
 
 	if (size <= 0 || !PAGE_ALIGNED(size))
 		return -EINVAL;
-- 
2.46.0.598.g6f2099f65c-goog


^ permalink raw reply related	[flat|nested] 130+ messages in thread

* Re: [RFC PATCH 14/39] KVM: guest_memfd: hugetlb: initialization and cleanup
  2024-09-10 23:43 ` [RFC PATCH 14/39] KVM: guest_memfd: hugetlb: initialization and cleanup Ackerley Tng
@ 2024-09-20  9:17   ` Vishal Annapurve
  2024-10-01 23:00     ` Ackerley Tng
  2024-12-01 17:59   ` Peter Xu
  2025-03-06 17:33   ` Peter Xu
  2 siblings, 1 reply; 130+ messages in thread
From: Vishal Annapurve @ 2024-09-20  9:17 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: tabba, quic_eberman, roypat, jgg, peterx, david, rientjes, fvdl,
	jthoughton, seanjc, pbonzini, zhiquan1.li, fan.du, jun.miao,
	isaku.yamahata, muchun.song, mike.kravetz, erdemaktas, qperret,
	jhubbard, willy, shuah, brauner, bfoster, kent.overstreet, pvorel,
	rppt, richard.weiyang, anup, haibo1.xu, ajones, vkuznets,
	maciej.wieczor-retman, pgonda, oliver.upton, linux-kernel,
	linux-mm, kvm, linux-kselftest, linux-fsdevel

On Wed, Sep 11, 2024 at 1:44 AM Ackerley Tng <ackerleytng@google.com> wrote:
>
> ...
> +}
> +
> +static void kvm_gmem_evict_inode(struct inode *inode)
> +{
> +       u64 flags = (u64)inode->i_private;
> +
> +       if (flags & KVM_GUEST_MEMFD_HUGETLB)
> +               kvm_gmem_hugetlb_teardown(inode);
> +       else
> +               truncate_inode_pages_final(inode->i_mapping);
> +
> +       clear_inode(inode);
> +}
> +
>  static const struct super_operations kvm_gmem_super_operations = {
>         .statfs         = simple_statfs,
> +       .evict_inode    = kvm_gmem_evict_inode,

Ackerley, can we use free_inode[1] callback to free any special
metadata associated with the inode instead of relying on
super_operations?

[1] https://elixir.bootlin.com/linux/v6.11/source/include/linux/fs.h#L719

> ...


>
>         if (size <= 0 || !PAGE_ALIGNED(size))
>                 return -EINVAL;
> --
> 2.46.0.598.g6f2099f65c-goog
>

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [RFC PATCH 14/39] KVM: guest_memfd: hugetlb: initialization and cleanup
  2024-09-20  9:17   ` Vishal Annapurve
@ 2024-10-01 23:00     ` Ackerley Tng
  0 siblings, 0 replies; 130+ messages in thread
From: Ackerley Tng @ 2024-10-01 23:00 UTC (permalink / raw)
  To: Vishal Annapurve
  Cc: tabba, quic_eberman, roypat, jgg, peterx, david, rientjes, fvdl,
	jthoughton, seanjc, pbonzini, zhiquan1.li, fan.du, jun.miao,
	isaku.yamahata, muchun.song, mike.kravetz, erdemaktas, qperret,
	jhubbard, willy, shuah, brauner, bfoster, kent.overstreet, pvorel,
	rppt, richard.weiyang, anup, haibo1.xu, ajones, vkuznets,
	maciej.wieczor-retman, pgonda, oliver.upton, linux-kernel,
	linux-mm, kvm, linux-kselftest, linux-fsdevel

Vishal Annapurve <vannapurve@google.com> writes:

> On Wed, Sep 11, 2024 at 1:44 AM Ackerley Tng <ackerleytng@google.com> wrote:
>>
>> ...
>> +}
>> +
>> +static void kvm_gmem_evict_inode(struct inode *inode)
>> +{
>> +       u64 flags = (u64)inode->i_private;
>> +
>> +       if (flags & KVM_GUEST_MEMFD_HUGETLB)
>> +               kvm_gmem_hugetlb_teardown(inode);
>> +       else
>> +               truncate_inode_pages_final(inode->i_mapping);
>> +
>> +       clear_inode(inode);
>> +}
>> +
>>  static const struct super_operations kvm_gmem_super_operations = {
>>         .statfs         = simple_statfs,
>> +       .evict_inode    = kvm_gmem_evict_inode,
>
> Ackerley, can we use free_inode[1] callback to free any special
> metadata associated with the inode instead of relying on
> super_operations?
>
> [1] https://elixir.bootlin.com/linux/v6.11/source/include/linux/fs.h#L719
>

.free_inode() is not a direct replacement for .evict_inode().

If the .free_inode() op is NULL, free_inode_nonrcu(inode) handles freeing the
struct inode itself. Hence, the .free_inode() op is meant for freeing the inode
struct.

.free_inode() should undo what .alloc_inode() does.

There's more information about the ops free_inode() here
https://docs.kernel.org/filesystems/porting.html, specifically

| Rules for inode destruction:
|
| + if ->destroy_inode() is non-NULL, it gets called
| + if ->free_inode() is non-NULL, it gets scheduled by call_rcu()
| + combination of NULL ->destroy_inode and NULL ->free_inode is treated as
|   NULL/free_inode_nonrcu, to preserve the compatibility.

The common setup is to have a larger containing struct containing a struct
inode, and the .free_inode() op will then free the larger struct. In our case,
we're not using a containing struct for the metadata, so .free_inode() isn't the
appropriate op.

I think this question might be related to Sean's question at LPC about whether
it is necessary for guest_memfd to have its own mount, as opposed to using the
anon_inode_mnt.

I believe having its own mount is the correct approach, my reasoning is as
follows

1. We want to clean up these inode metadata when the last reference to the inode
   is dropped
2. That means using some op on the iput_final() path.
3. All the ops on the iput_final() path are in struct super_operations, which is
   part of struct super_block
4. struct super_block should be used together with a mount

Hence, I think it is correct to have a guest_memfd mount. I guess it might be
possible to have a static super_block without a mount, but that seems hacky and
brittle, and I'm not aware of any precedent for a static super_block.

Sean, what are your concerns with having a guest_memfd mount?

Comparing the callbacks along the iput_final() path, we have these:

+ .drop_inode() determines whether to evict the inode, so that's not the
  approprate op.
+ .evict_inode() is the current proposal, which is a place where the inode's
  fields are cleaned up. HugeTLB uses this to clean up resv_map, which it also
  stores in inode->i_mapping->i_private_data.
+ .destroy_inode() should clean up inode allocation if inode allocation involves
  a containing struct (like shmem_inode_info). Shmem uses this to clean up a
  struct shared_policy, which we will eventually need to store as well.
+ .free_inode() is the rcu-delayed part that completes inode cleanup.

Using .free_inode() implies using a containing struct to follow the
convention. Between putting metadata in a containing struct and using
inode->i_mapping->i_private_data, I think using inode->i_mapping->i_private_data
is less complex since it avoids needing a custom .alloc_inode() op.

Other than using inode->i_mapping->i_private_data, there's the option of
combining the metadata with guest_memfd flags, and storing everything in
inode->i_private.

Because inode->i_mapping actually points to inode->i_data and i_data is
a part of the inode (not a pointer), .evict_inode() is still the op to
use to clean both inode->i_mapping->i_private_data and inode->i_private.

I think we should stick with storing metadata (faultability xarray and hugetlb
pool reference) in inode->i_mapping->i_private_data because both of these are
properties of the page cache/filemap.

When we need to store a memory policy, we might want to use .destroy_inode() to
align with shmem.

What do you all think?

And there's no way to set inode->free_inode directly and skip copying
from inode->i_sb->s_op. All the code paths going to i_callback() copy
inode->i_sb->s_op->free_inode to inode->free_inode before calling
.free_inode() in i_callback() to complete the inode cleanup.

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [RFC PATCH 14/39] KVM: guest_memfd: hugetlb: initialization and cleanup
  2024-09-10 23:43 ` [RFC PATCH 14/39] KVM: guest_memfd: hugetlb: initialization and cleanup Ackerley Tng
  2024-09-20  9:17   ` Vishal Annapurve
@ 2024-12-01 17:59   ` Peter Xu
  2025-02-13  9:47     ` Ackerley Tng
  2025-03-06 17:33   ` Peter Xu
  2 siblings, 1 reply; 130+ messages in thread
From: Peter Xu @ 2024-12-01 17:59 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: tabba, quic_eberman, roypat, jgg, david, rientjes, fvdl,
	jthoughton, seanjc, pbonzini, zhiquan1.li, fan.du, jun.miao,
	isaku.yamahata, muchun.song, mike.kravetz, erdemaktas, vannapurve,
	qperret, jhubbard, willy, shuah, brauner, bfoster,
	kent.overstreet, pvorel, rppt, richard.weiyang, anup, haibo1.xu,
	ajones, vkuznets, maciej.wieczor-retman, pgonda, oliver.upton,
	linux-kernel, linux-mm, kvm, linux-kselftest, linux-fsdevel

On Tue, Sep 10, 2024 at 11:43:45PM +0000, Ackerley Tng wrote:
> +/**
> + * Removes folios in range [@lstart, @lend) from page cache of inode, updates
> + * inode metadata and hugetlb reservations.
> + */
> +static void kvm_gmem_hugetlb_truncate_folios_range(struct inode *inode,
> +						   loff_t lstart, loff_t lend)
> +{
> +	struct kvm_gmem_hugetlb *hgmem;
> +	struct hstate *h;
> +	int gbl_reserve;
> +	int num_freed;
> +
> +	hgmem = kvm_gmem_hgmem(inode);
> +	h = hgmem->h;
> +
> +	num_freed = kvm_gmem_hugetlb_filemap_remove_folios(inode->i_mapping,
> +							   h, lstart, lend);
> +
> +	gbl_reserve = hugepage_subpool_put_pages(hgmem->spool, num_freed);
> +	hugetlb_acct_memory(h, -gbl_reserve);

I wonder whether this is needed, and whether hugetlb_acct_memory() needs to
be exported in the other patch.

IIUC subpools manages the global reservation on its own when min_pages is
set (which should be gmem's case, where both max/min set to gmem size).
That's in hugepage_put_subpool() -> unlock_or_release_subpool().

> +
> +	spin_lock(&inode->i_lock);
> +	inode->i_blocks -= blocks_per_huge_page(h) * num_freed;
> +	spin_unlock(&inode->i_lock);
> +}

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [RFC PATCH 14/39] KVM: guest_memfd: hugetlb: initialization and cleanup
  2024-12-01 17:59   ` Peter Xu
@ 2025-02-13  9:47     ` Ackerley Tng
  2025-02-26 18:55       ` Ackerley Tng
  0 siblings, 1 reply; 130+ messages in thread
From: Ackerley Tng @ 2025-02-13  9:47 UTC (permalink / raw)
  To: Peter Xu
  Cc: tabba, quic_eberman, roypat, jgg, david, rientjes, fvdl,
	jthoughton, seanjc, pbonzini, zhiquan1.li, fan.du, jun.miao,
	isaku.yamahata, muchun.song, mike.kravetz, erdemaktas, vannapurve,
	qperret, jhubbard, willy, shuah, brauner, bfoster,
	kent.overstreet, pvorel, rppt, richard.weiyang, anup, haibo1.xu,
	ajones, vkuznets, maciej.wieczor-retman, pgonda, oliver.upton,
	linux-kernel, linux-mm, kvm, linux-kselftest, linux-fsdevel

Peter Xu <peterx@redhat.com> writes:

> On Tue, Sep 10, 2024 at 11:43:45PM +0000, Ackerley Tng wrote:
>> +/**
>> + * Removes folios in range [@lstart, @lend) from page cache of inode, updates
>> + * inode metadata and hugetlb reservations.
>> + */
>> +static void kvm_gmem_hugetlb_truncate_folios_range(struct inode *inode,
>> +						   loff_t lstart, loff_t lend)
>> +{
>> +	struct kvm_gmem_hugetlb *hgmem;
>> +	struct hstate *h;
>> +	int gbl_reserve;
>> +	int num_freed;
>> +
>> +	hgmem = kvm_gmem_hgmem(inode);
>> +	h = hgmem->h;
>> +
>> +	num_freed = kvm_gmem_hugetlb_filemap_remove_folios(inode->i_mapping,
>> +							   h, lstart, lend);
>> +
>> +	gbl_reserve = hugepage_subpool_put_pages(hgmem->spool, num_freed);
>> +	hugetlb_acct_memory(h, -gbl_reserve);
>
> I wonder whether this is needed, and whether hugetlb_acct_memory() needs to
> be exported in the other patch.
>
> IIUC subpools manages the global reservation on its own when min_pages is
> set (which should be gmem's case, where both max/min set to gmem size).
> That's in hugepage_put_subpool() -> unlock_or_release_subpool().
>

Thank you for pointing this out! You are right and I will remove
hugetlb_acct_memory() from here.

>> +
>> +	spin_lock(&inode->i_lock);
>> +	inode->i_blocks -= blocks_per_huge_page(h) * num_freed;
>> +	spin_unlock(&inode->i_lock);
>> +}

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [RFC PATCH 14/39] KVM: guest_memfd: hugetlb: initialization and cleanup
  2025-02-13  9:47     ` Ackerley Tng
@ 2025-02-26 18:55       ` Ackerley Tng
  0 siblings, 0 replies; 130+ messages in thread
From: Ackerley Tng @ 2025-02-26 18:55 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: peterx, tabba, quic_eberman, roypat, jgg, david, rientjes, fvdl,
	jthoughton, seanjc, pbonzini, zhiquan1.li, fan.du, jun.miao,
	isaku.yamahata, muchun.song, mike.kravetz, erdemaktas, vannapurve,
	qperret, jhubbard, willy, shuah, brauner, bfoster,
	kent.overstreet, pvorel, rppt, richard.weiyang, anup, haibo1.xu,
	ajones, vkuznets, maciej.wieczor-retman, pgonda, oliver.upton,
	linux-kernel, linux-mm, kvm, linux-kselftest, linux-fsdevel

Ackerley Tng <ackerleytng@google.com> writes:

> Peter Xu <peterx@redhat.com> writes:
>
>> On Tue, Sep 10, 2024 at 11:43:45PM +0000, Ackerley Tng wrote:
>>> +/**
>>> + * Removes folios in range [@lstart, @lend) from page cache of inode, updates
>>> + * inode metadata and hugetlb reservations.
>>> + */
>>> +static void kvm_gmem_hugetlb_truncate_folios_range(struct inode *inode,
>>> +						   loff_t lstart, loff_t lend)
>>> +{
>>> +	struct kvm_gmem_hugetlb *hgmem;
>>> +	struct hstate *h;
>>> +	int gbl_reserve;
>>> +	int num_freed;
>>> +
>>> +	hgmem = kvm_gmem_hgmem(inode);
>>> +	h = hgmem->h;
>>> +
>>> +	num_freed = kvm_gmem_hugetlb_filemap_remove_folios(inode->i_mapping,
>>> +							   h, lstart, lend);
>>> +
>>> +	gbl_reserve = hugepage_subpool_put_pages(hgmem->spool, num_freed);
>>> +	hugetlb_acct_memory(h, -gbl_reserve);
>>
>> I wonder whether this is needed, and whether hugetlb_acct_memory() needs to
>> be exported in the other patch.
>>
>> IIUC subpools manages the global reservation on its own when min_pages is
>> set (which should be gmem's case, where both max/min set to gmem size).
>> That's in hugepage_put_subpool() -> unlock_or_release_subpool().
>>
>
> Thank you for pointing this out! You are right and I will remove
> hugetlb_acct_memory() from here.
>

I looked further at the folio cleanup process in free_huge_folio() and I
realized I should be returning the pages to the subpool via
free_huge_folio(). There should be no call to
hugepage_subpool_put_pages() directly from this truncate function.

To use free_huge_folio() to return the pages to the subpool, I will
clear the restore_reserve flag once guest_memfd allocates a folio. All
the guest_memfd hugetlb folios will always have the restore_reserve flag
cleared.

With the restore_reserve flag cleared, free_huge_folio() will do
hugepage_subpool_put_pages(), and then restore the reservation in hstate
as well.

Returning the folio to the subpool on freeing is important and correct,
since if/when the folio_put() callback is used, the filemap may not hold
the last refcount on the folio, so truncation may not be when the folio
should not be returned to the subpool.

>>> +
>>> +	spin_lock(&inode->i_lock);
>>> +	inode->i_blocks -= blocks_per_huge_page(h) * num_freed;
>>> +	spin_unlock(&inode->i_lock);
>>> +}

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [RFC PATCH 14/39] KVM: guest_memfd: hugetlb: initialization and cleanup
  2024-09-10 23:43 ` [RFC PATCH 14/39] KVM: guest_memfd: hugetlb: initialization and cleanup Ackerley Tng
  2024-09-20  9:17   ` Vishal Annapurve
  2024-12-01 17:59   ` Peter Xu
@ 2025-03-06 17:33   ` Peter Xu
  2 siblings, 0 replies; 130+ messages in thread
From: Peter Xu @ 2025-03-06 17:33 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: tabba, quic_eberman, roypat, jgg, david, rientjes, fvdl,
	jthoughton, seanjc, pbonzini, zhiquan1.li, fan.du, jun.miao,
	isaku.yamahata, muchun.song, mike.kravetz, erdemaktas, vannapurve,
	qperret, jhubbard, willy, shuah, brauner, bfoster,
	kent.overstreet, pvorel, rppt, richard.weiyang, anup, haibo1.xu,
	ajones, vkuznets, maciej.wieczor-retman, pgonda, oliver.upton,
	linux-kernel, linux-mm, kvm, linux-kselftest, linux-fsdevel

On Tue, Sep 10, 2024 at 11:43:45PM +0000, Ackerley Tng wrote:
> +static int kvm_gmem_hugetlb_filemap_remove_folios(struct address_space *mapping,
> +						  struct hstate *h,
> +						  loff_t lstart, loff_t lend)
> +{
> +	const pgoff_t end = lend >> PAGE_SHIFT;
> +	pgoff_t next = lstart >> PAGE_SHIFT;
> +	struct folio_batch fbatch;
> +	int num_freed = 0;
> +
> +	folio_batch_init(&fbatch);
> +	while (filemap_get_folios(mapping, &next, end - 1, &fbatch)) {
> +		int i;
> +		for (i = 0; i < folio_batch_count(&fbatch); ++i) {
> +			struct folio *folio;
> +			pgoff_t hindex;
> +			u32 hash;
> +
> +			folio = fbatch.folios[i];
> +			hindex = folio->index >> huge_page_order(h);
> +			hash = hugetlb_fault_mutex_hash(mapping, hindex);
> +
> +			mutex_lock(&hugetlb_fault_mutex_table[hash]);

I'm debugging some issue and this caught my attention.  IIUC we need to
unmap the last time here with the fault mutex, right?  Something like:

        unmap_mapping_range(mapping, lstart, lend, 0);

Otherwise I don't know what protects a concurrent fault from happening when
removing the folio from the page cache simultaneously.  Could refer to
remove_inode_single_folio() for hugetlbfs.  For generic folios, it normally
needs the folio lock when unmap, iiuc, but here the mutex should be fine.

So far, even with the line added, my issue still didn't yet go away.
However I figured I should raise this up here anyway at least as a pure
question.

> +			kvm_gmem_hugetlb_filemap_remove_folio(folio);
> +			mutex_unlock(&hugetlb_fault_mutex_table[hash]);
> +
> +			num_freed++;
> +		}
> +		folio_batch_release(&fbatch);
> +		cond_resched();
> +	}
> +
> +	return num_freed;
> +}

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 130+ messages in thread

* [RFC PATCH 15/39] KVM: guest_memfd: hugetlb: allocate and truncate from hugetlb
  2024-09-10 23:43 [RFC PATCH 00/39] 1G page support for guest_memfd Ackerley Tng
                   ` (13 preceding siblings ...)
  2024-09-10 23:43 ` [RFC PATCH 14/39] KVM: guest_memfd: hugetlb: initialization and cleanup Ackerley Tng
@ 2024-09-10 23:43 ` Ackerley Tng
  2024-09-13 22:26   ` Elliot Berman
                     ` (2 more replies)
  2024-09-10 23:43 ` [RFC PATCH 16/39] KVM: guest_memfd: Add page alignment check for hugetlb guest_memfd Ackerley Tng
                   ` (26 subsequent siblings)
  41 siblings, 3 replies; 130+ messages in thread
From: Ackerley Tng @ 2024-09-10 23:43 UTC (permalink / raw)
  To: tabba, quic_eberman, roypat, jgg, peterx, david, rientjes, fvdl,
	jthoughton, seanjc, pbonzini, zhiquan1.li, fan.du, jun.miao,
	isaku.yamahata, muchun.song, mike.kravetz
  Cc: erdemaktas, vannapurve, ackerleytng, qperret, jhubbard, willy,
	shuah, brauner, bfoster, kent.overstreet, pvorel, rppt,
	richard.weiyang, anup, haibo1.xu, ajones, vkuznets,
	maciej.wieczor-retman, pgonda, oliver.upton, linux-kernel,
	linux-mm, kvm, linux-kselftest, linux-fsdevel

If HugeTLB is requested at guest_memfd creation time, HugeTLB pages
will be used to back guest_memfd.

Signed-off-by: Ackerley Tng <ackerleytng@google.com>
---
 virt/kvm/guest_memfd.c | 252 ++++++++++++++++++++++++++++++++++++++---
 1 file changed, 239 insertions(+), 13 deletions(-)

diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index 31e1115273e1..2e6f12e2bac8 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -8,6 +8,8 @@
 #include <linux/pseudo_fs.h>
 #include <linux/pagemap.h>
 #include <linux/anon_inodes.h>
+#include <linux/memcontrol.h>
+#include <linux/mempolicy.h>
 
 #include "kvm_mm.h"
 
@@ -29,6 +31,13 @@ static struct kvm_gmem_hugetlb *kvm_gmem_hgmem(struct inode *inode)
 	return inode->i_mapping->i_private_data;
 }
 
+static bool is_kvm_gmem_hugetlb(struct inode *inode)
+{
+	u64 flags = (u64)inode->i_private;
+
+	return flags & KVM_GUEST_MEMFD_HUGETLB;
+}
+
 /**
  * folio_file_pfn - like folio_file_page, but return a pfn.
  * @folio: The folio which contains this index.
@@ -58,6 +67,9 @@ static int __kvm_gmem_prepare_folio(struct kvm *kvm, struct kvm_memory_slot *slo
 	return 0;
 }
 
+/**
+ * Use the uptodate flag to indicate that the folio is prepared for KVM's usage.
+ */
 static inline void kvm_gmem_mark_prepared(struct folio *folio)
 {
 	folio_mark_uptodate(folio);
@@ -72,13 +84,18 @@ static inline void kvm_gmem_mark_prepared(struct folio *folio)
 static int kvm_gmem_prepare_folio(struct kvm *kvm, struct kvm_memory_slot *slot,
 				  gfn_t gfn, struct folio *folio)
 {
-	unsigned long nr_pages, i;
 	pgoff_t index;
 	int r;
 
-	nr_pages = folio_nr_pages(folio);
-	for (i = 0; i < nr_pages; i++)
-		clear_highpage(folio_page(folio, i));
+	if (folio_test_hugetlb(folio)) {
+		folio_zero_user(folio, folio->index << PAGE_SHIFT);
+	} else {
+		unsigned long nr_pages, i;
+
+		nr_pages = folio_nr_pages(folio);
+		for (i = 0; i < nr_pages; i++)
+			clear_highpage(folio_page(folio, i));
+	}
 
 	/*
 	 * Preparing huge folios should always be safe, since it should
@@ -103,6 +120,174 @@ static int kvm_gmem_prepare_folio(struct kvm *kvm, struct kvm_memory_slot *slot,
 	return r;
 }
 
+static int kvm_gmem_get_mpol_node_nodemask(gfp_t gfp_mask,
+					   struct mempolicy **mpol,
+					   nodemask_t **nodemask)
+{
+	/*
+	 * TODO: mempolicy would probably have to be stored on the inode, use
+	 * task policy for now.
+	 */
+	*mpol = get_task_policy(current);
+
+	/* TODO: ignore interleaving (set ilx to 0) for now. */
+	return policy_node_nodemask(*mpol, gfp_mask, 0, nodemask);
+}
+
+static struct folio *kvm_gmem_hugetlb_alloc_folio(struct hstate *h,
+						  struct hugepage_subpool *spool)
+{
+	bool memcg_charge_was_prepared;
+	struct mem_cgroup *memcg;
+	struct mempolicy *mpol;
+	nodemask_t *nodemask;
+	struct folio *folio;
+	gfp_t gfp_mask;
+	int ret;
+	int nid;
+
+	gfp_mask = htlb_alloc_mask(h);
+
+	memcg = get_mem_cgroup_from_current();
+	ret = mem_cgroup_hugetlb_try_charge(memcg,
+					    gfp_mask | __GFP_RETRY_MAYFAIL,
+					    pages_per_huge_page(h));
+	if (ret == -ENOMEM)
+		goto err;
+
+	memcg_charge_was_prepared = ret != -EOPNOTSUPP;
+
+	/* Pages are only to be taken from guest_memfd subpool and nowhere else. */
+	if (hugepage_subpool_get_pages(spool, 1))
+		goto err_cancel_charge;
+
+	nid = kvm_gmem_get_mpol_node_nodemask(htlb_alloc_mask(h), &mpol,
+					      &nodemask);
+	/*
+	 * charge_cgroup_reservation is false because we didn't make any cgroup
+	 * reservations when creating the guest_memfd subpool.
+	 *
+	 * use_hstate_resv is true because we reserved from global hstate when
+	 * creating the guest_memfd subpool.
+	 */
+	folio = hugetlb_alloc_folio(h, mpol, nid, nodemask, false, true);
+	mpol_cond_put(mpol);
+
+	if (!folio)
+		goto err_put_pages;
+
+	hugetlb_set_folio_subpool(folio, spool);
+
+	if (memcg_charge_was_prepared)
+		mem_cgroup_commit_charge(folio, memcg);
+
+out:
+	mem_cgroup_put(memcg);
+
+	return folio;
+
+err_put_pages:
+	hugepage_subpool_put_pages(spool, 1);
+
+err_cancel_charge:
+	if (memcg_charge_was_prepared)
+		mem_cgroup_cancel_charge(memcg, pages_per_huge_page(h));
+
+err:
+	folio = ERR_PTR(-ENOMEM);
+	goto out;
+}
+
+static int kvm_gmem_hugetlb_filemap_add_folio(struct address_space *mapping,
+					      struct folio *folio, pgoff_t index,
+					      gfp_t gfp)
+{
+	int ret;
+
+	__folio_set_locked(folio);
+	ret = __filemap_add_folio(mapping, folio, index, gfp, NULL);
+	if (unlikely(ret)) {
+		__folio_clear_locked(folio);
+		return ret;
+	}
+
+	/*
+	 * In hugetlb_add_to_page_cache(), there is a call to
+	 * folio_clear_hugetlb_restore_reserve(). This is handled when the pages
+	 * are removed from the page cache in unmap_hugepage_range() ->
+	 * __unmap_hugepage_range() by conditionally calling
+	 * folio_set_hugetlb_restore_reserve(). In kvm_gmem_hugetlb's usage of
+	 * hugetlb, there are no VMAs involved, and pages are never taken from
+	 * the surplus, so when pages are freed, the hstate reserve must be
+	 * restored. Hence, this function makes no call to
+	 * folio_clear_hugetlb_restore_reserve().
+	 */
+
+	/* mark folio dirty so that it will not be removed from cache/inode */
+	folio_mark_dirty(folio);
+
+	return 0;
+}
+
+static struct folio *kvm_gmem_hugetlb_alloc_and_cache_folio(struct inode *inode,
+							    pgoff_t index)
+{
+	struct kvm_gmem_hugetlb *hgmem;
+	struct folio *folio;
+	int ret;
+
+	hgmem = kvm_gmem_hgmem(inode);
+	folio = kvm_gmem_hugetlb_alloc_folio(hgmem->h, hgmem->spool);
+	if (IS_ERR(folio))
+		return folio;
+
+	/* TODO: Fix index here to be aligned to huge page size. */
+	ret = kvm_gmem_hugetlb_filemap_add_folio(
+		inode->i_mapping, folio, index, htlb_alloc_mask(hgmem->h));
+	if (ret) {
+		folio_put(folio);
+		return ERR_PTR(ret);
+	}
+
+	spin_lock(&inode->i_lock);
+	inode->i_blocks += blocks_per_huge_page(hgmem->h);
+	spin_unlock(&inode->i_lock);
+
+	return folio;
+}
+
+static struct folio *kvm_gmem_get_hugetlb_folio(struct inode *inode,
+						pgoff_t index)
+{
+	struct address_space *mapping;
+	struct folio *folio;
+	struct hstate *h;
+	pgoff_t hindex;
+	u32 hash;
+
+	h = kvm_gmem_hgmem(inode)->h;
+	hindex = index >> huge_page_order(h);
+	mapping = inode->i_mapping;
+
+	/* To lock, we calculate the hash using the hindex and not index. */
+	hash = hugetlb_fault_mutex_hash(mapping, hindex);
+	mutex_lock(&hugetlb_fault_mutex_table[hash]);
+
+	/*
+	 * The filemap is indexed with index and not hindex. Taking lock on
+	 * folio to align with kvm_gmem_get_regular_folio()
+	 */
+	folio = filemap_lock_folio(mapping, index);
+	if (!IS_ERR(folio))
+		goto out;
+
+	folio = kvm_gmem_hugetlb_alloc_and_cache_folio(inode, index);
+out:
+	mutex_unlock(&hugetlb_fault_mutex_table[hash]);
+
+	return folio;
+}
+
 /*
  * Returns a locked folio on success.  The caller is responsible for
  * setting the up-to-date flag before the memory is mapped into the guest.
@@ -114,8 +299,10 @@ static int kvm_gmem_prepare_folio(struct kvm *kvm, struct kvm_memory_slot *slot,
  */
 static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index)
 {
-	/* TODO: Support huge pages. */
-	return filemap_grab_folio(inode->i_mapping, index);
+	if (is_kvm_gmem_hugetlb(inode))
+		return kvm_gmem_get_hugetlb_folio(inode, index);
+	else
+		return filemap_grab_folio(inode->i_mapping, index);
 }
 
 static void kvm_gmem_invalidate_begin(struct kvm_gmem *gmem, pgoff_t start,
@@ -240,6 +427,35 @@ static void kvm_gmem_hugetlb_truncate_folios_range(struct inode *inode,
 	spin_unlock(&inode->i_lock);
 }
 
+static void kvm_gmem_hugetlb_truncate_range(struct inode *inode, loff_t lstart,
+					    loff_t lend)
+{
+	loff_t full_hpage_start;
+	loff_t full_hpage_end;
+	unsigned long hsize;
+	struct hstate *h;
+
+	h = kvm_gmem_hgmem(inode)->h;
+	hsize = huge_page_size(h);
+
+	full_hpage_start = round_up(lstart, hsize);
+	full_hpage_end = round_down(lend, hsize);
+
+	if (lstart < full_hpage_start) {
+		hugetlb_zero_partial_page(h, inode->i_mapping, lstart,
+					  full_hpage_start);
+	}
+
+	if (full_hpage_end > full_hpage_start) {
+		kvm_gmem_hugetlb_truncate_folios_range(inode, full_hpage_start,
+						       full_hpage_end);
+	}
+
+	if (lend > full_hpage_end) {
+		hugetlb_zero_partial_page(h, inode->i_mapping, full_hpage_end,
+					  lend);
+	}
+}
 
 static long kvm_gmem_punch_hole(struct inode *inode, loff_t offset, loff_t len)
 {
@@ -257,7 +473,12 @@ static long kvm_gmem_punch_hole(struct inode *inode, loff_t offset, loff_t len)
 	list_for_each_entry(gmem, gmem_list, entry)
 		kvm_gmem_invalidate_begin(gmem, start, end);
 
-	truncate_inode_pages_range(inode->i_mapping, offset, offset + len - 1);
+	if (is_kvm_gmem_hugetlb(inode)) {
+		kvm_gmem_hugetlb_truncate_range(inode, offset, offset + len);
+	} else {
+		truncate_inode_pages_range(inode->i_mapping, offset,
+					   offset + len - 1);
+	}
 
 	list_for_each_entry(gmem, gmem_list, entry)
 		kvm_gmem_invalidate_end(gmem, start, end);
@@ -279,8 +500,15 @@ static long kvm_gmem_allocate(struct inode *inode, loff_t offset, loff_t len)
 
 	filemap_invalidate_lock_shared(mapping);
 
-	start = offset >> PAGE_SHIFT;
-	end = (offset + len) >> PAGE_SHIFT;
+	if (is_kvm_gmem_hugetlb(inode)) {
+		unsigned long hsize = huge_page_size(kvm_gmem_hgmem(inode)->h);
+
+		start = round_down(offset, hsize) >> PAGE_SHIFT;
+		end = round_down(offset + len, hsize) >> PAGE_SHIFT;
+	} else {
+		start = offset >> PAGE_SHIFT;
+		end = (offset + len) >> PAGE_SHIFT;
+	}
 
 	r = 0;
 	for (index = start; index < end; ) {
@@ -408,9 +636,7 @@ static void kvm_gmem_hugetlb_teardown(struct inode *inode)
 
 static void kvm_gmem_evict_inode(struct inode *inode)
 {
-	u64 flags = (u64)inode->i_private;
-
-	if (flags & KVM_GUEST_MEMFD_HUGETLB)
+	if (is_kvm_gmem_hugetlb(inode))
 		kvm_gmem_hugetlb_teardown(inode);
 	else
 		truncate_inode_pages_final(inode->i_mapping);
@@ -827,7 +1053,7 @@ __kvm_gmem_get_pfn(struct file *file, struct kvm_memory_slot *slot,
 
 	*pfn = folio_file_pfn(folio, index);
 	if (max_order)
-		*max_order = 0;
+		*max_order = folio_order(folio);
 
 	*is_prepared = folio_test_uptodate(folio);
 	return folio;
-- 
2.46.0.598.g6f2099f65c-goog


^ permalink raw reply related	[flat|nested] 130+ messages in thread

* Re: [RFC PATCH 15/39] KVM: guest_memfd: hugetlb: allocate and truncate from hugetlb
  2024-09-10 23:43 ` [RFC PATCH 15/39] KVM: guest_memfd: hugetlb: allocate and truncate from hugetlb Ackerley Tng
@ 2024-09-13 22:26   ` Elliot Berman
  2024-10-03 20:23     ` Ackerley Tng
  2024-10-30  9:01   ` Jun Miao
  2024-12-01 17:55   ` Peter Xu
  2 siblings, 1 reply; 130+ messages in thread
From: Elliot Berman @ 2024-09-13 22:26 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: tabba, roypat, jgg, peterx, david, rientjes, fvdl, jthoughton,
	seanjc, pbonzini, zhiquan1.li, fan.du, jun.miao, isaku.yamahata,
	muchun.song, mike.kravetz, erdemaktas, vannapurve, qperret,
	jhubbard, willy, shuah, brauner, bfoster, kent.overstreet, pvorel,
	rppt, richard.weiyang, anup, haibo1.xu, ajones, vkuznets,
	maciej.wieczor-retman, pgonda, oliver.upton, linux-kernel,
	linux-mm, kvm, linux-kselftest, linux-fsdevel

On Tue, Sep 10, 2024 at 11:43:46PM +0000, Ackerley Tng wrote:
> If HugeTLB is requested at guest_memfd creation time, HugeTLB pages
> will be used to back guest_memfd.
> 
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> ---
>  virt/kvm/guest_memfd.c | 252 ++++++++++++++++++++++++++++++++++++++---
>  1 file changed, 239 insertions(+), 13 deletions(-)
> 
> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> index 31e1115273e1..2e6f12e2bac8 100644
> --- a/virt/kvm/guest_memfd.c
> +++ b/virt/kvm/guest_memfd.c
> @@ -8,6 +8,8 @@
>  #include <linux/pseudo_fs.h>
>  #include <linux/pagemap.h>
>  #include <linux/anon_inodes.h>
> +#include <linux/memcontrol.h>
> +#include <linux/mempolicy.h>
>  
>  #include "kvm_mm.h"
>  
> @@ -29,6 +31,13 @@ static struct kvm_gmem_hugetlb *kvm_gmem_hgmem(struct inode *inode)
>  	return inode->i_mapping->i_private_data;
>  }
>  
> +static bool is_kvm_gmem_hugetlb(struct inode *inode)
> +{
> +	u64 flags = (u64)inode->i_private;
> +
> +	return flags & KVM_GUEST_MEMFD_HUGETLB;
> +}
> +
>  /**
>   * folio_file_pfn - like folio_file_page, but return a pfn.
>   * @folio: The folio which contains this index.
> @@ -58,6 +67,9 @@ static int __kvm_gmem_prepare_folio(struct kvm *kvm, struct kvm_memory_slot *slo
>  	return 0;
>  }
>  
> +/**
> + * Use the uptodate flag to indicate that the folio is prepared for KVM's usage.
> + */
>  static inline void kvm_gmem_mark_prepared(struct folio *folio)
>  {
>  	folio_mark_uptodate(folio);
> @@ -72,13 +84,18 @@ static inline void kvm_gmem_mark_prepared(struct folio *folio)
>  static int kvm_gmem_prepare_folio(struct kvm *kvm, struct kvm_memory_slot *slot,
>  				  gfn_t gfn, struct folio *folio)
>  {
> -	unsigned long nr_pages, i;
>  	pgoff_t index;
>  	int r;
>  
> -	nr_pages = folio_nr_pages(folio);
> -	for (i = 0; i < nr_pages; i++)
> -		clear_highpage(folio_page(folio, i));
> +	if (folio_test_hugetlb(folio)) {
> +		folio_zero_user(folio, folio->index << PAGE_SHIFT);

Is (folio->index << PAGE_SHIFT) the right address hint to provide?
I don't think we can say the folio will be mapped at this address since
this value is an offset into the file.  In most cases, I believe it
won't be mapped anywhere since we just allocated it.

Thanks,
Elliot


^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [RFC PATCH 15/39] KVM: guest_memfd: hugetlb: allocate and truncate from hugetlb
  2024-09-13 22:26   ` Elliot Berman
@ 2024-10-03 20:23     ` Ackerley Tng
  0 siblings, 0 replies; 130+ messages in thread
From: Ackerley Tng @ 2024-10-03 20:23 UTC (permalink / raw)
  To: Elliot Berman
  Cc: tabba, roypat, jgg, peterx, david, rientjes, fvdl, jthoughton,
	seanjc, pbonzini, zhiquan1.li, fan.du, jun.miao, isaku.yamahata,
	muchun.song, mike.kravetz, erdemaktas, vannapurve, qperret,
	jhubbard, willy, shuah, brauner, bfoster, kent.overstreet, pvorel,
	rppt, richard.weiyang, anup, haibo1.xu, ajones, vkuznets,
	maciej.wieczor-retman, pgonda, oliver.upton, linux-kernel,
	linux-mm, kvm, linux-kselftest, linux-fsdevel

Elliot Berman <quic_eberman@quicinc.com> writes:

> On Tue, Sep 10, 2024 at 11:43:46PM +0000, Ackerley Tng wrote:
>> If HugeTLB is requested at guest_memfd creation time, HugeTLB pages
>> will be used to back guest_memfd.
>> 
>> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
>>
>> <snip>
>>
>> +/**
>> + * Use the uptodate flag to indicate that the folio is prepared for KVM's usage.
>> + */
>>  static inline void kvm_gmem_mark_prepared(struct folio *folio)
>>  {
>>  	folio_mark_uptodate(folio);
>> @@ -72,13 +84,18 @@ static inline void kvm_gmem_mark_prepared(struct folio *folio)
>>  static int kvm_gmem_prepare_folio(struct kvm *kvm, struct kvm_memory_slot *slot,
>>  				  gfn_t gfn, struct folio *folio)
>>  {
>> -	unsigned long nr_pages, i;
>>  	pgoff_t index;
>>  	int r;
>>  
>> -	nr_pages = folio_nr_pages(folio);
>> -	for (i = 0; i < nr_pages; i++)
>> -		clear_highpage(folio_page(folio, i));
>> +	if (folio_test_hugetlb(folio)) {
>> +		folio_zero_user(folio, folio->index << PAGE_SHIFT);
>
> Is (folio->index << PAGE_SHIFT) the right address hint to provide?
> I don't think we can say the folio will be mapped at this address since
> this value is an offset into the file.  In most cases, I believe it
> won't be mapped anywhere since we just allocated it.

vaddr in folio_zero_user(folio, vaddr) is eventually passed to
clear_user_page(). clear_user_page() uses vaddr to clean up dcaches on
some architectures, according to Documentation/core-api/cachetlb.rst.

In this patch series, folio_zero_user() is used in 2 places:

+ kvm_gmem_prepare_folio()
+ kvm_gmem_fault()

folio->index is valid by the time folio_zero_user() is called in
kvm_gmem_prepare_folio(), because when kvm_gmem_prepare_folio() is called, the
folio is already in the filemap, and folio->index is set when the folios is
added to the filemap.

In kvm_gmem_fault(), kvm_gmem_get_folio() also returns a folio in the filemap
and so folio->index is valid by the tiem folio_zero_user() is called.

Hence in both cases where folio_zero_user() is called, folio->index <<
PAGE_SHIFT returns the offset in the file.

In hugetlb's fallocate, the offset within the file is passed in the call to
folio_zero_user(), which is why the offset within the file was used here.

In the next revision I will refactor this to something like
kvm_gmem_prepare_folio_shared() and kvm_gmem_prepare_folio_private().

In kvm_gmem_prepare_folio_private(), folio->index << PAGE_SHIFT can still be
passed as addr_hint to align with HugeTLB. When being prepared as a private
folio, the folio will be mapped by KVM: addr_hint won't matter since this folio
isn't going to be mapped into userspace. If the folio was previously used as a
shared page, unmapping would have flushed the dcache.

In kvm_gmem_prepare_folio_shared(), the folio will subsequently be mapped and
vmf->real_address should be passed as addr_hint.

Thanks for this question!

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [RFC PATCH 15/39] KVM: guest_memfd: hugetlb: allocate and truncate from hugetlb
  2024-09-10 23:43 ` [RFC PATCH 15/39] KVM: guest_memfd: hugetlb: allocate and truncate from hugetlb Ackerley Tng
  2024-09-13 22:26   ` Elliot Berman
@ 2024-10-30  9:01   ` Jun Miao
  2025-02-11  1:21     ` Ackerley Tng
  2024-12-01 17:55   ` Peter Xu
  2 siblings, 1 reply; 130+ messages in thread
From: Jun Miao @ 2024-10-30  9:01 UTC (permalink / raw)
  To: Ackerley Tng, tabba, quic_eberman, roypat, jgg, peterx, david,
	rientjes, fvdl, jthoughton, seanjc, pbonzini, zhiquan1.li, fan.du,
	isaku.yamahata, muchun.song, mike.kravetz
  Cc: erdemaktas, vannapurve, qperret, jhubbard, willy, shuah, brauner,
	bfoster, kent.overstreet, pvorel, rppt, richard.weiyang, anup,
	haibo1.xu, ajones, vkuznets, maciej.wieczor-retman, pgonda,
	oliver.upton, linux-kernel, linux-mm, kvm, linux-kselftest,
	linux-fsdevel, Li, Zhiquan1, Du, Fan, Miao, Jun

Hi Ackerley,
Due to actual customer requirements(such as ByteDance), I have added 
support for NUMA policy based on your foundation.
Standing on the shoulders of giants, please correct me if there is 
anyting wrong.

--- Thanks Jun.miao

On 2024/9/11 07:43, Ackerley Tng wrote:
> If HugeTLB is requested at guest_memfd creation time, HugeTLB pages
> will be used to back guest_memfd.
>
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> ---
>   virt/kvm/guest_memfd.c | 252 ++++++++++++++++++++++++++++++++++++++---
>   1 file changed, 239 insertions(+), 13 deletions(-)
>
> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> index 31e1115273e1..2e6f12e2bac8 100644
> --- a/virt/kvm/guest_memfd.c
> +++ b/virt/kvm/guest_memfd.c
> @@ -8,6 +8,8 @@
>   #include <linux/pseudo_fs.h>
>   #include <linux/pagemap.h>
>   #include <linux/anon_inodes.h>
> +#include <linux/memcontrol.h>
> +#include <linux/mempolicy.h>
>   
>   #include "kvm_mm.h"
>   
> @@ -29,6 +31,13 @@ static struct kvm_gmem_hugetlb *kvm_gmem_hgmem(struct inode *inode)
>   	return inode->i_mapping->i_private_data;
>   }
>   
> +static bool is_kvm_gmem_hugetlb(struct inode *inode)
> +{
> +	u64 flags = (u64)inode->i_private;
> +
> +	return flags & KVM_GUEST_MEMFD_HUGETLB;
> +}
> +
>   /**
>    * folio_file_pfn - like folio_file_page, but return a pfn.
>    * @folio: The folio which contains this index.
> @@ -58,6 +67,9 @@ static int __kvm_gmem_prepare_folio(struct kvm *kvm, struct kvm_memory_slot *slo
>   	return 0;
>   }
>   
> +/**
> + * Use the uptodate flag to indicate that the folio is prepared for KVM's usage.
> + */
>   static inline void kvm_gmem_mark_prepared(struct folio *folio)
>   {
>   	folio_mark_uptodate(folio);
> @@ -72,13 +84,18 @@ static inline void kvm_gmem_mark_prepared(struct folio *folio)
>   static int kvm_gmem_prepare_folio(struct kvm *kvm, struct kvm_memory_slot *slot,
>   				  gfn_t gfn, struct folio *folio)
>   {
> -	unsigned long nr_pages, i;
>   	pgoff_t index;
>   	int r;
>   
> -	nr_pages = folio_nr_pages(folio);
> -	for (i = 0; i < nr_pages; i++)
> -		clear_highpage(folio_page(folio, i));
> +	if (folio_test_hugetlb(folio)) {
> +		folio_zero_user(folio, folio->index << PAGE_SHIFT);
> +	} else {
> +		unsigned long nr_pages, i;
> +
> +		nr_pages = folio_nr_pages(folio);
> +		for (i = 0; i < nr_pages; i++)
> +			clear_highpage(folio_page(folio, i));
> +	}
>   
>   	/*
>   	 * Preparing huge folios should always be safe, since it should
> @@ -103,6 +120,174 @@ static int kvm_gmem_prepare_folio(struct kvm *kvm, struct kvm_memory_slot *slot,
>   	return r;
>   }
>   
> +static int kvm_gmem_get_mpol_node_nodemask(gfp_t gfp_mask,
> +					   struct mempolicy **mpol,
> +					   nodemask_t **nodemask)
> +{
> +	/*
> +	 * TODO: mempolicy would probably have to be stored on the inode, use
> +	 * task policy for now.
> +	 */
> +	*mpol = get_task_policy(current);
commit bbb0b86af11574516fe78bc1340f49c9e6b7e588 (HEAD -> 
my-gmem-hugetlb-rfc-v2)
Author: Jun Miao <jun.miao@intel.com>
Date:   Wed Oct 30 11:07:16 2024 -0400

     KVM: guest_memfd: add TDX numa policy in hugetlb support

     Support the numa policy in the gmem hugetlb. This function need the
     corresponding QEMU patch cooperate to work, and set the numa policy
     like this in qemu:
     "--object host-nodes=0,policy=bind".

     If no set in the Qemu, the policy uses current task policy for now.

     Signed-off-by: Jun Miao <jun.miao@intel.com>

diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
index a49631e47421..cf569fe0740d 100644
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -91,6 +91,17 @@ static inline struct mempolicy *mpol_dup(struct 
mempolicy *pol)
         return pol;
  }

+struct mempolicy *mpol_new(unsigned short mode, unsigned short flags,
+                           nodemask_t *nodes);
+
+int mpol_set_nodemask(struct mempolicy *pol,
+                const nodemask_t *nodes, struct nodemask_scratch *nsc);
+
+int sanitize_mpol_flags(int *mode, unsigned short *flags);
+
+int get_nodes(nodemask_t *nodes, const unsigned long __user *nmask,
+                unsigned long maxnode);
+
  static inline void mpol_get(struct mempolicy *pol)
  {
         if (pol)
@@ -202,6 +213,25 @@ static inline void mpol_cond_put(struct mempolicy *pol)
  {
  }

+struct mempolicy *mpol_new(unsigned short mode, unsigned short flags,
+                           nodemask_t *nodes);
+{
+}
+
+int mpol_set_nodemask(struct mempolicy *pol,
+                const nodemask_t *nodes, struct nodemask_scratch *nsc);
+{
+}
+
+int sanitize_mpol_flags(int *mode, unsigned short *flags);
+{
+}
+
+int get_nodes(nodemask_t *nodes, const unsigned long __user *nmask,
+                unsigned long maxnode);
+{
+}
+
  static inline void mpol_get(struct mempolicy *pol)
  {
  }
diff --git a/include/uapi/linux/mempolicy.h b/include/uapi/linux/mempolicy.h
index 1f9bb10d1a47..6ba4eb0935de 100644
--- a/include/uapi/linux/mempolicy.h
+++ b/include/uapi/linux/mempolicy.h
@@ -24,6 +24,7 @@ enum {
         MPOL_LOCAL,
         MPOL_PREFERRED_MANY,
         MPOL_WEIGHTED_INTERLEAVE,
+       MPOL_INVALID,   /* Invalid parameter passing, come from and keep 
consistent with QEMU */
         MPOL_MAX,       /* always last member of enum */
  };

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index f3e572e17775..b465ed5091c2 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -259,7 +259,7 @@ static int mpol_new_preferred(struct mempolicy *pol, 
const nodemask_t *nodes)
   * Must be called holding task's alloc_lock to protect task's mems_allowed
   * and mempolicy.  May also be called holding the mmap_lock for write.
   */
-static int mpol_set_nodemask(struct mempolicy *pol,
+int mpol_set_nodemask(struct mempolicy *pol,
                      const nodemask_t *nodes, struct nodemask_scratch *nsc)
  {
         int ret;
@@ -291,12 +291,13 @@ static int mpol_set_nodemask(struct mempolicy *pol,
         ret = mpol_ops[pol->mode].create(pol, &nsc->mask2);
         return ret;
  }
+EXPORT_SYMBOL_GPL(mpol_set_nodemask);

  /*
   * This function just creates a new policy, does some check and simple
   * initialization. You must invoke mpol_set_nodemask() to set nodes.
   */
-static struct mempolicy *mpol_new(unsigned short mode, unsigned short 
flags,
+struct mempolicy *mpol_new(unsigned short mode, unsigned short flags,
                                   nodemask_t *nodes)
  {
         struct mempolicy *policy;
@@ -339,6 +340,7 @@ static struct mempolicy *mpol_new(unsigned short 
mode, unsigned short flags,

         return policy;
  }
+EXPORT_SYMBOL_GPL(mpol_new);

  /* Slow path of a mpol destructor. */
  void __mpol_put(struct mempolicy *pol)
@@ -1429,7 +1431,7 @@ static int get_bitmap(unsigned long *mask, const 
unsigned long __user *nmask,
  }

  /* Copy a node mask from user space. */
-static int get_nodes(nodemask_t *nodes, const unsigned long __user *nmask,
+int get_nodes(nodemask_t *nodes, const unsigned long __user *nmask,
                      unsigned long maxnode)
  {
         --maxnode;
@@ -1463,6 +1465,7 @@ static int get_nodes(nodemask_t *nodes, const 
unsigned long __user *nmask,

         return get_bitmap(nodes_addr(*nodes), nmask, maxnode);
  }
+EXPORT_SYMBOL(get_nodes);

  /* Copy a kernel node mask to user space */
  static int copy_nodes_to_user(unsigned long __user *mask, unsigned 
long maxnode,
@@ -1492,7 +1495,7 @@ static int copy_nodes_to_user(unsigned long __user 
*mask, unsigned long maxnode,
  }

  /* Basic parameter sanity check used by both mbind() and 
set_mempolicy() */
-static inline int sanitize_mpol_flags(int *mode, unsigned short *flags)
+inline int sanitize_mpol_flags(int *mode, unsigned short *flags)
  {
         *flags = *mode & MPOL_MODE_FLAGS;
         *mode &= ~MPOL_MODE_FLAGS;
@@ -1509,6 +1512,7 @@ static inline int sanitize_mpol_flags(int *mode, 
unsigned short *flags)
         }
         return 0;
  }
+EXPORT_SYMBOL_GPL(sanitize_mpol_flags);

  static long kernel_mbind(unsigned long start, unsigned long len,
                          unsigned long mode, const unsigned long __user 
*nmask,
diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index f34aff971628..7570aa38e519 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -20,6 +20,7 @@ struct kvm_gmem {
         struct kvm *kvm;
         struct xarray bindings;
         struct list_head entry;
+       struct mempolicy *gmemfd_policy;
  };

  struct kvm_gmem_hugetlb {
@@ -154,21 +155,21 @@ static int kvm_gmem_prepare_folio(struct kvm *kvm, 
struct kvm_memory_slot *slot,
         return r;
  }

-static int kvm_gmem_get_mpol_node_nodemask(gfp_t gfp_mask,
+static int kvm_gmem_get_mpol_node_nodemask(struct kvm_gmem *gmem, gfp_t 
gfp_mask,
                                            struct mempolicy **mpol,
                                            nodemask_t **nodemask)
  {
         /*
-        * TODO: mempolicy would probably have to be stored on the 
inode, use
-        * task policy for now.
+        *  Mempolicy would probably have to be stored on the inode, if 
no setting in qeum
+        *  use task policy for now.
          */
-       *mpol = get_task_policy(current);
+       *mpol = gmem->gmemfd_policy;

         /* TODO: ignore interleaving (set ilx to 0) for now. */
         return policy_node_nodemask(*mpol, gfp_mask, 0, nodemask);
  }

-static struct folio *kvm_gmem_hugetlb_alloc_folio(struct hstate *h,
+static struct folio *kvm_gmem_hugetlb_alloc_folio(struct kvm_gmem 
*gmem, struct hstate *h,
                                                   struct 
hugepage_subpool *spool)
  {
         bool memcg_charge_was_prepared;
@@ -195,7 +196,7 @@ static struct folio 
*kvm_gmem_hugetlb_alloc_folio(struct hstate *h,
         if (hugepage_subpool_get_pages(spool, 1))
                 goto err_cancel_charge;

-       nid = kvm_gmem_get_mpol_node_nodemask(htlb_alloc_mask(h), &mpol,
+       nid = kvm_gmem_get_mpol_node_nodemask(gmem, htlb_alloc_mask(h), 
&mpol,
                                               &nodemask);
         /*
          * charge_cgroup_reservation is false because we didn't make 
any cgroup
@@ -268,10 +269,12 @@ static struct folio 
*kvm_gmem_hugetlb_alloc_and_cache_folio(struct inode *inode,
  {
         struct kvm_gmem_hugetlb *hgmem;
         struct folio *folio;
+       struct kvm_gmem *gmem;
         int ret;

         hgmem = kvm_gmem_hgmem(inode);
-       folio = kvm_gmem_hugetlb_alloc_folio(hgmem->h, hgmem->spool);
+       gmem = inode->i_mapping->i_private_data;
+       folio = kvm_gmem_hugetlb_alloc_folio(gmem, hgmem->h, hgmem->spool);
         if (IS_ERR(folio))
                 return folio;

@@ -905,7 +908,7 @@ static struct file 
*kvm_gmem_inode_create_getfile(void *priv, loff_t size,
         return file;
  }

-static int __kvm_gmem_create(struct kvm *kvm, loff_t size, u64 flags)
+static int __kvm_gmem_create(struct kvm *kvm, loff_t size, u64 flags, 
struct mempolicy *new)
  {
         struct kvm_gmem *gmem;
         struct file *file;
@@ -927,6 +930,8 @@ static int __kvm_gmem_create(struct kvm *kvm, loff_t 
size, u64 flags)
                 goto err_gmem;
         }

+       file_inode(file)->i_mapping->i_private_data = gmem;
+       gmem->gmemfd_policy = new;
         kvm_get_kvm(kvm);
         gmem->kvm = kvm;
         xa_init(&gmem->bindings);
@@ -955,6 +960,40 @@ int kvm_gmem_create(struct kvm *kvm, struct 
kvm_create_guest_memfd *args)
  {
         loff_t size = args->size;
         u64 flags = args->flags;
+       nodemask_t nodes;
+       struct mempolicy *new;
+       int err, ret;
+       u64 mode = args->reserved[0];
+       u64 maxnode = args->reserved[1];
+       const unsigned long host_nodes = (unsigned long)args->reserved[2];
+       unsigned short mode_flags;
+       int lmode = mode;
+       NODEMASK_SCRATCH(scratch);
+       if(!scratch)
+               return -ENOMEM;
+
+       if (mode == MPOL_INVALID)
+               goto task_policy;
+       else {
+               err = sanitize_mpol_flags(&lmode, &mode_flags);
+               if (err)
+                       goto task_policy;
+
+               err = get_nodes(&nodes, &host_nodes, maxnode);
+               if (err)
+                       goto task_policy;
+
+               new = mpol_new(mode, mode_flags, &nodes);
+               if (IS_ERR(new))
+                       goto task_policy;
+               else
+                       goto numa_policy;
+}
+
+task_policy:
+       new = get_task_policy(current);
+numa_policy:
+       ret = mpol_set_nodemask(new, &nodes, scratch);

         if (flags & KVM_GUEST_MEMFD_HUGETLB) {
                 /* Allow huge page size encoding in flags */
@@ -975,7 +1014,7 @@ int kvm_gmem_create(struct kvm *kvm, struct 
kvm_create_guest_memfd *args)
         if (size <= 0)
                 return -EINVAL;

-       return __kvm_gmem_create(kvm, size, flags);
+       return __kvm_gmem_create(kvm, size, flags, new);
  }

  int kvm_gmem_bind(struct kvm *kvm, struct kvm_memory_slot *slot,
(END)

> +
> +	/* TODO: ignore interleaving (set ilx to 0) for now. */
> +	return policy_node_nodemask(*mpol, gfp_mask, 0, nodemask);
> +}
> +
> +static struct folio *kvm_gmem_hugetlb_alloc_folio(struct hstate *h,
> +						  struct hugepage_subpool *spool)
> +{
> +	bool memcg_charge_was_prepared;
> +	struct mem_cgroup *memcg;
> +	struct mempolicy *mpol;
> +	nodemask_t *nodemask;
> +	struct folio *folio;
> +	gfp_t gfp_mask;
> +	int ret;
> +	int nid;
> +
> +	gfp_mask = htlb_alloc_mask(h);
> +
> +	memcg = get_mem_cgroup_from_current();
> +	ret = mem_cgroup_hugetlb_try_charge(memcg,
> +					    gfp_mask | __GFP_RETRY_MAYFAIL,
> +					    pages_per_huge_page(h));
> +	if (ret == -ENOMEM)
> +		goto err;
> +
> +	memcg_charge_was_prepared = ret != -EOPNOTSUPP;
> +
> +	/* Pages are only to be taken from guest_memfd subpool and nowhere else. */
> +	if (hugepage_subpool_get_pages(spool, 1))
> +		goto err_cancel_charge;
> +
> +	nid = kvm_gmem_get_mpol_node_nodemask(htlb_alloc_mask(h), &mpol,
> +					      &nodemask);
> +	/*
> +	 * charge_cgroup_reservation is false because we didn't make any cgroup
> +	 * reservations when creating the guest_memfd subpool.
> +	 *
> +	 * use_hstate_resv is true because we reserved from global hstate when
> +	 * creating the guest_memfd subpool.
> +	 */
> +	folio = hugetlb_alloc_folio(h, mpol, nid, nodemask, false, true);
> +	mpol_cond_put(mpol);
> +
> +	if (!folio)
> +		goto err_put_pages;
> +
> +	hugetlb_set_folio_subpool(folio, spool);
> +
> +	if (memcg_charge_was_prepared)
> +		mem_cgroup_commit_charge(folio, memcg);
> +
> +out:
> +	mem_cgroup_put(memcg);
> +
> +	return folio;
> +
> +err_put_pages:
> +	hugepage_subpool_put_pages(spool, 1);
> +
> +err_cancel_charge:
> +	if (memcg_charge_was_prepared)
> +		mem_cgroup_cancel_charge(memcg, pages_per_huge_page(h));
> +
> +err:
> +	folio = ERR_PTR(-ENOMEM);
> +	goto out;
> +}
> +
> +static int kvm_gmem_hugetlb_filemap_add_folio(struct address_space *mapping,
> +					      struct folio *folio, pgoff_t index,
> +					      gfp_t gfp)
> +{
> +	int ret;
> +
> +	__folio_set_locked(folio);
> +	ret = __filemap_add_folio(mapping, folio, index, gfp, NULL);
> +	if (unlikely(ret)) {
> +		__folio_clear_locked(folio);
> +		return ret;
> +	}
> +
> +	/*
> +	 * In hugetlb_add_to_page_cache(), there is a call to
> +	 * folio_clear_hugetlb_restore_reserve(). This is handled when the pages
> +	 * are removed from the page cache in unmap_hugepage_range() ->
> +	 * __unmap_hugepage_range() by conditionally calling
> +	 * folio_set_hugetlb_restore_reserve(). In kvm_gmem_hugetlb's usage of
> +	 * hugetlb, there are no VMAs involved, and pages are never taken from
> +	 * the surplus, so when pages are freed, the hstate reserve must be
> +	 * restored. Hence, this function makes no call to
> +	 * folio_clear_hugetlb_restore_reserve().
> +	 */
> +
> +	/* mark folio dirty so that it will not be removed from cache/inode */
> +	folio_mark_dirty(folio);
> +
> +	return 0;
> +}
> +
> +static struct folio *kvm_gmem_hugetlb_alloc_and_cache_folio(struct inode *inode,
> +							    pgoff_t index)
> +{
> +	struct kvm_gmem_hugetlb *hgmem;
> +	struct folio *folio;
> +	int ret;
> +
> +	hgmem = kvm_gmem_hgmem(inode);
> +	folio = kvm_gmem_hugetlb_alloc_folio(hgmem->h, hgmem->spool);
> +	if (IS_ERR(folio))
> +		return folio;
> +
> +	/* TODO: Fix index here to be aligned to huge page size. */
> +	ret = kvm_gmem_hugetlb_filemap_add_folio(
> +		inode->i_mapping, folio, index, htlb_alloc_mask(hgmem->h));
> +	if (ret) {
> +		folio_put(folio);
> +		return ERR_PTR(ret);
> +	}
> +
> +	spin_lock(&inode->i_lock);
> +	inode->i_blocks += blocks_per_huge_page(hgmem->h);
> +	spin_unlock(&inode->i_lock);
> +
> +	return folio;
> +}
> +
> +static struct folio *kvm_gmem_get_hugetlb_folio(struct inode *inode,
> +						pgoff_t index)
> +{
> +	struct address_space *mapping;
> +	struct folio *folio;
> +	struct hstate *h;
> +	pgoff_t hindex;
> +	u32 hash;
> +
> +	h = kvm_gmem_hgmem(inode)->h;
> +	hindex = index >> huge_page_order(h);
> +	mapping = inode->i_mapping;
> +
> +	/* To lock, we calculate the hash using the hindex and not index. */
> +	hash = hugetlb_fault_mutex_hash(mapping, hindex);
> +	mutex_lock(&hugetlb_fault_mutex_table[hash]);
> +
> +	/*
> +	 * The filemap is indexed with index and not hindex. Taking lock on
> +	 * folio to align with kvm_gmem_get_regular_folio()
> +	 */
> +	folio = filemap_lock_folio(mapping, index);
> +	if (!IS_ERR(folio))
> +		goto out;
> +
> +	folio = kvm_gmem_hugetlb_alloc_and_cache_folio(inode, index);
> +out:
> +	mutex_unlock(&hugetlb_fault_mutex_table[hash]);
> +
> +	return folio;
> +}
> +
>   /*
>    * Returns a locked folio on success.  The caller is responsible for
>    * setting the up-to-date flag before the memory is mapped into the guest.
> @@ -114,8 +299,10 @@ static int kvm_gmem_prepare_folio(struct kvm *kvm, struct kvm_memory_slot *slot,
>    */
>   static struct folio *kvm_gmem_get_folio(struct inode *inode, pgoff_t index)
>   {
> -	/* TODO: Support huge pages. */
> -	return filemap_grab_folio(inode->i_mapping, index);
> +	if (is_kvm_gmem_hugetlb(inode))
> +		return kvm_gmem_get_hugetlb_folio(inode, index);
> +	else
> +		return filemap_grab_folio(inode->i_mapping, index);
>   }
>   
>   static void kvm_gmem_invalidate_begin(struct kvm_gmem *gmem, pgoff_t start,
> @@ -240,6 +427,35 @@ static void kvm_gmem_hugetlb_truncate_folios_range(struct inode *inode,
>   	spin_unlock(&inode->i_lock);
>   }
>   
> +static void kvm_gmem_hugetlb_truncate_range(struct inode *inode, loff_t lstart,
> +					    loff_t lend)
> +{
> +	loff_t full_hpage_start;
> +	loff_t full_hpage_end;
> +	unsigned long hsize;
> +	struct hstate *h;
> +
> +	h = kvm_gmem_hgmem(inode)->h;
> +	hsize = huge_page_size(h);
> +
> +	full_hpage_start = round_up(lstart, hsize);
> +	full_hpage_end = round_down(lend, hsize);
> +
> +	if (lstart < full_hpage_start) {
> +		hugetlb_zero_partial_page(h, inode->i_mapping, lstart,
> +					  full_hpage_start);
> +	}
> +
> +	if (full_hpage_end > full_hpage_start) {
> +		kvm_gmem_hugetlb_truncate_folios_range(inode, full_hpage_start,
> +						       full_hpage_end);
> +	}
> +
> +	if (lend > full_hpage_end) {
> +		hugetlb_zero_partial_page(h, inode->i_mapping, full_hpage_end,
> +					  lend);
> +	}
> +}
>   
>   static long kvm_gmem_punch_hole(struct inode *inode, loff_t offset, loff_t len)
>   {
> @@ -257,7 +473,12 @@ static long kvm_gmem_punch_hole(struct inode *inode, loff_t offset, loff_t len)
>   	list_for_each_entry(gmem, gmem_list, entry)
>   		kvm_gmem_invalidate_begin(gmem, start, end);
>   
> -	truncate_inode_pages_range(inode->i_mapping, offset, offset + len - 1);
> +	if (is_kvm_gmem_hugetlb(inode)) {
> +		kvm_gmem_hugetlb_truncate_range(inode, offset, offset + len);
> +	} else {
> +		truncate_inode_pages_range(inode->i_mapping, offset,
> +					   offset + len - 1);
> +	}
>   
>   	list_for_each_entry(gmem, gmem_list, entry)
>   		kvm_gmem_invalidate_end(gmem, start, end);
> @@ -279,8 +500,15 @@ static long kvm_gmem_allocate(struct inode *inode, loff_t offset, loff_t len)
>   
>   	filemap_invalidate_lock_shared(mapping);
>   
> -	start = offset >> PAGE_SHIFT;
> -	end = (offset + len) >> PAGE_SHIFT;
> +	if (is_kvm_gmem_hugetlb(inode)) {
> +		unsigned long hsize = huge_page_size(kvm_gmem_hgmem(inode)->h);
> +
> +		start = round_down(offset, hsize) >> PAGE_SHIFT;
> +		end = round_down(offset + len, hsize) >> PAGE_SHIFT;
> +	} else {
> +		start = offset >> PAGE_SHIFT;
> +		end = (offset + len) >> PAGE_SHIFT;
> +	}
>   
>   	r = 0;
>   	for (index = start; index < end; ) {
> @@ -408,9 +636,7 @@ static void kvm_gmem_hugetlb_teardown(struct inode *inode)
>   
>   static void kvm_gmem_evict_inode(struct inode *inode)
>   {
> -	u64 flags = (u64)inode->i_private;
> -
> -	if (flags & KVM_GUEST_MEMFD_HUGETLB)
> +	if (is_kvm_gmem_hugetlb(inode))
>   		kvm_gmem_hugetlb_teardown(inode);
>   	else
>   		truncate_inode_pages_final(inode->i_mapping);
> @@ -827,7 +1053,7 @@ __kvm_gmem_get_pfn(struct file *file, struct kvm_memory_slot *slot,
>   
>   	*pfn = folio_file_pfn(folio, index);
>   	if (max_order)
> -		*max_order = 0;
> +		*max_order = folio_order(folio);
>   
>   	*is_prepared = folio_test_uptodate(folio);
>   	return folio;

^ permalink raw reply related	[flat|nested] 130+ messages in thread

* Re: [RFC PATCH 15/39] KVM: guest_memfd: hugetlb: allocate and truncate from hugetlb
  2024-10-30  9:01   ` Jun Miao
@ 2025-02-11  1:21     ` Ackerley Tng
  0 siblings, 0 replies; 130+ messages in thread
From: Ackerley Tng @ 2025-02-11  1:21 UTC (permalink / raw)
  To: Jun Miao
  Cc: tabba, quic_eberman, roypat, jgg, peterx, david, rientjes, fvdl,
	jthoughton, seanjc, pbonzini, zhiquan1.li, fan.du, isaku.yamahata,
	muchun.song, mike.kravetz, erdemaktas, vannapurve, qperret,
	jhubbard, willy, shuah, brauner, bfoster, kent.overstreet, pvorel,
	rppt, richard.weiyang, anup, haibo1.xu, ajones, vkuznets,
	maciej.wieczor-retman, pgonda, oliver.upton, linux-kernel,
	linux-mm, kvm, linux-kselftest, linux-fsdevel, jun.miao

Jun Miao <jun.miao@intel.com> writes:

> Hi Ackerley,
> Due to actual customer requirements(such as ByteDance), I have added 
> support for NUMA policy based on your foundation.
> Standing on the shoulders of giants, please correct me if there is 
> anyting wrong.
>
> --- Thanks Jun.miao
>
> <snip>

Hi Jun,

Thank you for your email and sorry about the delayed reply, haven't had
a chance to look at NUMA support.

Shivank Garg just posted a series for NUMA mempolicy support [1], which
is dependent on mmap() and then mbind(). Does that work for your use
case, or must you have mempolicy set up at guest_memfd creation time?


Ackerley

[1] https://lore.kernel.org/all/20250210063227.41125-1-shivankg@amd.com/T/

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [RFC PATCH 15/39] KVM: guest_memfd: hugetlb: allocate and truncate from hugetlb
  2024-09-10 23:43 ` [RFC PATCH 15/39] KVM: guest_memfd: hugetlb: allocate and truncate from hugetlb Ackerley Tng
  2024-09-13 22:26   ` Elliot Berman
  2024-10-30  9:01   ` Jun Miao
@ 2024-12-01 17:55   ` Peter Xu
  2025-02-13  7:52     ` Ackerley Tng
  2 siblings, 1 reply; 130+ messages in thread
From: Peter Xu @ 2024-12-01 17:55 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: tabba, quic_eberman, roypat, jgg, david, rientjes, fvdl,
	jthoughton, seanjc, pbonzini, zhiquan1.li, fan.du, jun.miao,
	isaku.yamahata, muchun.song, mike.kravetz, erdemaktas, vannapurve,
	qperret, jhubbard, willy, shuah, brauner, bfoster,
	kent.overstreet, pvorel, rppt, richard.weiyang, anup, haibo1.xu,
	ajones, vkuznets, maciej.wieczor-retman, pgonda, oliver.upton,
	linux-kernel, linux-mm, kvm, linux-kselftest, linux-fsdevel

On Tue, Sep 10, 2024 at 11:43:46PM +0000, Ackerley Tng wrote:
> +static struct folio *kvm_gmem_hugetlb_alloc_folio(struct hstate *h,
> +						  struct hugepage_subpool *spool)
> +{
> +	bool memcg_charge_was_prepared;
> +	struct mem_cgroup *memcg;
> +	struct mempolicy *mpol;
> +	nodemask_t *nodemask;
> +	struct folio *folio;
> +	gfp_t gfp_mask;
> +	int ret;
> +	int nid;
> +
> +	gfp_mask = htlb_alloc_mask(h);
> +
> +	memcg = get_mem_cgroup_from_current();
> +	ret = mem_cgroup_hugetlb_try_charge(memcg,
> +					    gfp_mask | __GFP_RETRY_MAYFAIL,
> +					    pages_per_huge_page(h));
> +	if (ret == -ENOMEM)
> +		goto err;
> +
> +	memcg_charge_was_prepared = ret != -EOPNOTSUPP;
> +
> +	/* Pages are only to be taken from guest_memfd subpool and nowhere else. */
> +	if (hugepage_subpool_get_pages(spool, 1))
> +		goto err_cancel_charge;
> +
> +	nid = kvm_gmem_get_mpol_node_nodemask(htlb_alloc_mask(h), &mpol,
> +					      &nodemask);
> +	/*
> +	 * charge_cgroup_reservation is false because we didn't make any cgroup
> +	 * reservations when creating the guest_memfd subpool.

Hmm.. isn't this the exact reason to set charge_cgroup_reservation==true
instead?

IIUC gmem hugetlb pages should participate in the hugetlb cgroup resv
charge as well.  It is already involved in the rest cgroup charges, and I
wonder whether it's intended that the patch treated the resv accounting
specially.

Thanks,

> +	 *
> +	 * use_hstate_resv is true because we reserved from global hstate when
> +	 * creating the guest_memfd subpool.
> +	 */
> +	folio = hugetlb_alloc_folio(h, mpol, nid, nodemask, false, true);
> +	mpol_cond_put(mpol);
> +
> +	if (!folio)
> +		goto err_put_pages;
> +
> +	hugetlb_set_folio_subpool(folio, spool);
> +
> +	if (memcg_charge_was_prepared)
> +		mem_cgroup_commit_charge(folio, memcg);
> +
> +out:
> +	mem_cgroup_put(memcg);
> +
> +	return folio;
> +
> +err_put_pages:
> +	hugepage_subpool_put_pages(spool, 1);
> +
> +err_cancel_charge:
> +	if (memcg_charge_was_prepared)
> +		mem_cgroup_cancel_charge(memcg, pages_per_huge_page(h));
> +
> +err:
> +	folio = ERR_PTR(-ENOMEM);
> +	goto out;
> +}

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [RFC PATCH 15/39] KVM: guest_memfd: hugetlb: allocate and truncate from hugetlb
  2024-12-01 17:55   ` Peter Xu
@ 2025-02-13  7:52     ` Ackerley Tng
  2025-02-13 16:48       ` Peter Xu
  0 siblings, 1 reply; 130+ messages in thread
From: Ackerley Tng @ 2025-02-13  7:52 UTC (permalink / raw)
  To: Peter Xu
  Cc: tabba, quic_eberman, roypat, jgg, david, rientjes, fvdl,
	jthoughton, seanjc, pbonzini, zhiquan1.li, fan.du, jun.miao,
	isaku.yamahata, muchun.song, mike.kravetz, erdemaktas, vannapurve,
	qperret, jhubbard, willy, shuah, brauner, bfoster,
	kent.overstreet, pvorel, rppt, richard.weiyang, anup, haibo1.xu,
	ajones, vkuznets, maciej.wieczor-retman, pgonda, oliver.upton,
	linux-kernel, linux-mm, kvm, linux-kselftest, linux-fsdevel

Peter Xu <peterx@redhat.com> writes:

> On Tue, Sep 10, 2024 at 11:43:46PM +0000, Ackerley Tng wrote:
>> +static struct folio *kvm_gmem_hugetlb_alloc_folio(struct hstate *h,
>> +						  struct hugepage_subpool *spool)
>> +{
>> +	bool memcg_charge_was_prepared;
>> +	struct mem_cgroup *memcg;
>> +	struct mempolicy *mpol;
>> +	nodemask_t *nodemask;
>> +	struct folio *folio;
>> +	gfp_t gfp_mask;
>> +	int ret;
>> +	int nid;
>> +
>> +	gfp_mask = htlb_alloc_mask(h);
>> +
>> +	memcg = get_mem_cgroup_from_current();
>> +	ret = mem_cgroup_hugetlb_try_charge(memcg,
>> +					    gfp_mask | __GFP_RETRY_MAYFAIL,
>> +					    pages_per_huge_page(h));
>> +	if (ret == -ENOMEM)
>> +		goto err;
>> +
>> +	memcg_charge_was_prepared = ret != -EOPNOTSUPP;
>> +
>> +	/* Pages are only to be taken from guest_memfd subpool and nowhere else. */
>> +	if (hugepage_subpool_get_pages(spool, 1))
>> +		goto err_cancel_charge;
>> +
>> +	nid = kvm_gmem_get_mpol_node_nodemask(htlb_alloc_mask(h), &mpol,
>> +					      &nodemask);
>> +	/*
>> +	 * charge_cgroup_reservation is false because we didn't make any cgroup
>> +	 * reservations when creating the guest_memfd subpool.
>
> Hmm.. isn't this the exact reason to set charge_cgroup_reservation==true
> instead?
>
> IIUC gmem hugetlb pages should participate in the hugetlb cgroup resv
> charge as well.  It is already involved in the rest cgroup charges, and I
> wonder whether it's intended that the patch treated the resv accounting
> specially.
>
> Thanks,
>

Thank you for your careful reviews!

I misunderstood charging a cgroup for hugetlb reservations when I was
working on this patch.

Before this, I thought hugetlb_cgroup_charge_cgroup_rsvd() was only for
resv_map reservations, so I set charge_cgroup_reservation to false since
guest_memfd didn't use resv_map, but I understand better now. Please
help me check my understanding:

+ All reservations are made at the hstate
+ In addition, every reservation is associated with a subpool (through
  spool->rsv_hpages) or recorded in a resv_map
    + Reservations are either in a subpool or in a resv_map but not both
+ hugetlb_cgroup_charge_cgroup_rsvd() is for any reservation

Regarding the time that a cgroup is charged for reservations:

+ If a reservation is made during subpool creation, the cgroup is not
  charged during the reservation by the subpool, probably by design
  since the process doing the mount may not be the process using the
  pages
+ Charging a cgroup for the reservation happens in
  hugetlb_reserve_pages(), which is called at mmap() time.

For guest_memfd, I see two options:

Option 1: Charge cgroup for reservations at fault time

Pros:

+ Similar in behavior to a fd on a hugetlbfs mount, where the cgroup of
  the process calling fallocate() is charged for the reservation.
+ Symmetric approach, since uncharging happens when the hugetlb folio is
  freed.

Cons:

+ Room for allocation failure after guest_memfd creation. Even though
  this guest_memfd had been created with a subpool and pages have been
  reserved, there is a chance of hitting the cgroup's hugetlb
  reservation cap and failing to allocate a page.

Option 2 (preferred): Charge cgroup for reservations at guest_memfd
creation time

Pros:

+ Once guest_memfd file is created, a page is guaranteed at fault time.
+ Simplifies/doesn't carry over the complexities of the hugetlb(fs)
  reservation system

Cons:

+ The cgroup being charged is the cgroup of the process creating
  guest_memfd, which might be an issue if users expect the process
  faulting the page to be charged.

Implementation:

+ At guest_memfd creation time, when creating the subpool, charge the
  cgroups for everything:
   + for hugetlb usage
   + hugetlb reservation usage and
   + hugetlb usage by page count (as in mem_cgroup_charge_hugetlb(),
     which is new since [1])
+ Refactoring in [1] would be focused on just dequeueing a folio or
  failing which, allocating a surplus folio.
   + After allocation, don't set cgroup on the folio so that the freeing
     process doesn't uncharge anything
+ Uncharge when the file is closed

Please let me know if anyone has any thoughts/suggestions!

>> +	 *
>> +	 * use_hstate_resv is true because we reserved from global hstate when
>> +	 * creating the guest_memfd subpool.
>> +	 */
>> +	folio = hugetlb_alloc_folio(h, mpol, nid, nodemask, false, true);
>> +	mpol_cond_put(mpol);
>> +
>> +	if (!folio)
>> +		goto err_put_pages;
>> +
>> +	hugetlb_set_folio_subpool(folio, spool);
>> +
>> +	if (memcg_charge_was_prepared)
>> +		mem_cgroup_commit_charge(folio, memcg);
>> +
>> +out:
>> +	mem_cgroup_put(memcg);
>> +
>> +	return folio;
>> +
>> +err_put_pages:
>> +	hugepage_subpool_put_pages(spool, 1);
>> +
>> +err_cancel_charge:
>> +	if (memcg_charge_was_prepared)
>> +		mem_cgroup_cancel_charge(memcg, pages_per_huge_page(h));
>> +
>> +err:
>> +	folio = ERR_PTR(-ENOMEM);
>> +	goto out;
>> +}

[1] https://lore.kernel.org/all/7348091f4c539ed207d9bb0f3744d0f0efb7f2b3.1726009989.git.ackerleytng@google.com/

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [RFC PATCH 15/39] KVM: guest_memfd: hugetlb: allocate and truncate from hugetlb
  2025-02-13  7:52     ` Ackerley Tng
@ 2025-02-13 16:48       ` Peter Xu
  0 siblings, 0 replies; 130+ messages in thread
From: Peter Xu @ 2025-02-13 16:48 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: tabba, quic_eberman, roypat, jgg, david, rientjes, fvdl,
	jthoughton, seanjc, pbonzini, zhiquan1.li, fan.du, jun.miao,
	isaku.yamahata, muchun.song, mike.kravetz, erdemaktas, vannapurve,
	qperret, jhubbard, willy, shuah, brauner, bfoster,
	kent.overstreet, pvorel, rppt, richard.weiyang, anup, haibo1.xu,
	ajones, vkuznets, maciej.wieczor-retman, pgonda, oliver.upton,
	linux-kernel, linux-mm, kvm, linux-kselftest, linux-fsdevel

On Thu, Feb 13, 2025 at 07:52:43AM +0000, Ackerley Tng wrote:
> Peter Xu <peterx@redhat.com> writes:
> 
> > On Tue, Sep 10, 2024 at 11:43:46PM +0000, Ackerley Tng wrote:
> >> +static struct folio *kvm_gmem_hugetlb_alloc_folio(struct hstate *h,
> >> +						  struct hugepage_subpool *spool)
> >> +{
> >> +	bool memcg_charge_was_prepared;
> >> +	struct mem_cgroup *memcg;
> >> +	struct mempolicy *mpol;
> >> +	nodemask_t *nodemask;
> >> +	struct folio *folio;
> >> +	gfp_t gfp_mask;
> >> +	int ret;
> >> +	int nid;
> >> +
> >> +	gfp_mask = htlb_alloc_mask(h);
> >> +
> >> +	memcg = get_mem_cgroup_from_current();
> >> +	ret = mem_cgroup_hugetlb_try_charge(memcg,
> >> +					    gfp_mask | __GFP_RETRY_MAYFAIL,
> >> +					    pages_per_huge_page(h));
> >> +	if (ret == -ENOMEM)
> >> +		goto err;
> >> +
> >> +	memcg_charge_was_prepared = ret != -EOPNOTSUPP;
> >> +
> >> +	/* Pages are only to be taken from guest_memfd subpool and nowhere else. */
> >> +	if (hugepage_subpool_get_pages(spool, 1))
> >> +		goto err_cancel_charge;
> >> +
> >> +	nid = kvm_gmem_get_mpol_node_nodemask(htlb_alloc_mask(h), &mpol,
> >> +					      &nodemask);
> >> +	/*
> >> +	 * charge_cgroup_reservation is false because we didn't make any cgroup
> >> +	 * reservations when creating the guest_memfd subpool.
> >
> > Hmm.. isn't this the exact reason to set charge_cgroup_reservation==true
> > instead?
> >
> > IIUC gmem hugetlb pages should participate in the hugetlb cgroup resv
> > charge as well.  It is already involved in the rest cgroup charges, and I
> > wonder whether it's intended that the patch treated the resv accounting
> > specially.
> >
> > Thanks,
> >
> 
> Thank you for your careful reviews!
> 
> I misunderstood charging a cgroup for hugetlb reservations when I was
> working on this patch.
> 
> Before this, I thought hugetlb_cgroup_charge_cgroup_rsvd() was only for
> resv_map reservations, so I set charge_cgroup_reservation to false since
> guest_memfd didn't use resv_map, but I understand better now. Please
> help me check my understanding:
> 
> + All reservations are made at the hstate
> + In addition, every reservation is associated with a subpool (through
>   spool->rsv_hpages) or recorded in a resv_map
>     + Reservations are either in a subpool or in a resv_map but not both
> + hugetlb_cgroup_charge_cgroup_rsvd() is for any reservation
> 
> Regarding the time that a cgroup is charged for reservations:
> 
> + If a reservation is made during subpool creation, the cgroup is not
>   charged during the reservation by the subpool, probably by design
>   since the process doing the mount may not be the process using the
>   pages

Exactly.

> + Charging a cgroup for the reservation happens in
>   hugetlb_reserve_pages(), which is called at mmap() time.

Yes, or if it's not charged in hugetlb_reserve_pages() it needs to be
charged at folio allocation as of now.

> 
> For guest_memfd, I see two options:
> 
> Option 1: Charge cgroup for reservations at fault time
> 
> Pros:
> 
> + Similar in behavior to a fd on a hugetlbfs mount, where the cgroup of
>   the process calling fallocate() is charged for the reservation.
> + Symmetric approach, since uncharging happens when the hugetlb folio is
>   freed.
> 
> Cons:
> 
> + Room for allocation failure after guest_memfd creation. Even though
>   this guest_memfd had been created with a subpool and pages have been
>   reserved, there is a chance of hitting the cgroup's hugetlb
>   reservation cap and failing to allocate a page.
> 
> Option 2 (preferred): Charge cgroup for reservations at guest_memfd
> creation time
> 
> Pros:
> 
> + Once guest_memfd file is created, a page is guaranteed at fault time.

This would definitely be nice, that whatever that can block the guest from
using the memory should be a fault upfront when a VM boots if ever possible
(e.g. this is not a mmap() interface, so user yet doesn't allow NORESERVE).

It'll be slightly different from the spool use case of mount points, but I
think it's a new use case anyway, so IIUC we can define its behavior to
best suite the use case.

> + Simplifies/doesn't carry over the complexities of the hugetlb(fs)
>   reservation system
> 
> Cons:
> 
> + The cgroup being charged is the cgroup of the process creating
>   guest_memfd, which might be an issue if users expect the process
>   faulting the page to be charged.

Right, though I can't picture such use case yet.

I'm guessing multiple processes use of guest-memfd is still very far away.
When it happens, I would expect these tasks be put into the same cgroup..
Maybe kubevirt already have some of such use, I can go and have a check.

If they're not in the same cgroup, it's still more reasonable to always
charge that at the VM instance, rather than whatever other process that may
operate on the guest memory.

So it could be that we don't see major cons in solution 2.  In general, I
agree with your preference.

> 
> Implementation:
> 
> + At guest_memfd creation time, when creating the subpool, charge the
>   cgroups for everything:
>    + for hugetlb usage

I suppose here you meant the global reservation?  If so, I agree.

IIUC the new code shouldn't need to worry on this if the subpool is created
by the API, as that API does the global charging, like we discussed
elsewhere.

If you meant hugetlb_cgroup_commit_charge(),IMHO it should still be left
done until allocation.  In guest-memfd case, when fallocate().  AFAICT,
that's the only reason why we need two of such anyway..

>    + hugetlb reservation usage and

Agree on this one.

>    + hugetlb usage by page count (as in mem_cgroup_charge_hugetlb(),
>      which is new since [1])

This one should, IMHO, also be done only during allocation.

Thanks,

> + Refactoring in [1] would be focused on just dequeueing a folio or
>   failing which, allocating a surplus folio.
>    + After allocation, don't set cgroup on the folio so that the freeing
>      process doesn't uncharge anything
> + Uncharge when the file is closed
> 
> Please let me know if anyone has any thoughts/suggestions!
> 
> >> +	 *
> >> +	 * use_hstate_resv is true because we reserved from global hstate when
> >> +	 * creating the guest_memfd subpool.
> >> +	 */
> >> +	folio = hugetlb_alloc_folio(h, mpol, nid, nodemask, false, true);
> >> +	mpol_cond_put(mpol);
> >> +
> >> +	if (!folio)
> >> +		goto err_put_pages;
> >> +
> >> +	hugetlb_set_folio_subpool(folio, spool);
> >> +
> >> +	if (memcg_charge_was_prepared)
> >> +		mem_cgroup_commit_charge(folio, memcg);
> >> +
> >> +out:
> >> +	mem_cgroup_put(memcg);
> >> +
> >> +	return folio;
> >> +
> >> +err_put_pages:
> >> +	hugepage_subpool_put_pages(spool, 1);
> >> +
> >> +err_cancel_charge:
> >> +	if (memcg_charge_was_prepared)
> >> +		mem_cgroup_cancel_charge(memcg, pages_per_huge_page(h));
> >> +
> >> +err:
> >> +	folio = ERR_PTR(-ENOMEM);
> >> +	goto out;
> >> +}
> 
> [1] https://lore.kernel.org/all/7348091f4c539ed207d9bb0f3744d0f0efb7f2b3.1726009989.git.ackerleytng@google.com/
> 

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 130+ messages in thread

* [RFC PATCH 16/39] KVM: guest_memfd: Add page alignment check for hugetlb guest_memfd
  2024-09-10 23:43 [RFC PATCH 00/39] 1G page support for guest_memfd Ackerley Tng
                   ` (14 preceding siblings ...)
  2024-09-10 23:43 ` [RFC PATCH 15/39] KVM: guest_memfd: hugetlb: allocate and truncate from hugetlb Ackerley Tng
@ 2024-09-10 23:43 ` Ackerley Tng
  2024-09-10 23:43 ` [RFC PATCH 17/39] KVM: selftests: Add basic selftests for hugetlb-backed guest_memfd Ackerley Tng
                   ` (25 subsequent siblings)
  41 siblings, 0 replies; 130+ messages in thread
From: Ackerley Tng @ 2024-09-10 23:43 UTC (permalink / raw)
  To: tabba, quic_eberman, roypat, jgg, peterx, david, rientjes, fvdl,
	jthoughton, seanjc, pbonzini, zhiquan1.li, fan.du, jun.miao,
	isaku.yamahata, muchun.song, mike.kravetz
  Cc: erdemaktas, vannapurve, ackerleytng, qperret, jhubbard, willy,
	shuah, brauner, bfoster, kent.overstreet, pvorel, rppt,
	richard.weiyang, anup, haibo1.xu, ajones, vkuznets,
	maciej.wieczor-retman, pgonda, oliver.upton, linux-kernel,
	linux-mm, kvm, linux-kselftest, linux-fsdevel

When a hugetlb guest_memfd is requested, the requested size should be
aligned to the size of the hugetlb page requested.

Signed-off-by: Ackerley Tng <ackerleytng@google.com>
---
 virt/kvm/guest_memfd.c | 15 ++++++++++++++-
 1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index 2e6f12e2bac8..eacbfdb950d1 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -909,6 +909,13 @@ static int __kvm_gmem_create(struct kvm *kvm, loff_t size, u64 flags)
 	return err;
 }
 
+static inline bool kvm_gmem_hugetlb_page_aligned(u32 flags, u64 value)
+{
+	int page_size_log = (flags >> KVM_GUEST_MEMFD_HUGE_SHIFT) & KVM_GUEST_MEMFD_HUGE_MASK;
+	u64 page_size = 1ULL << page_size_log;
+	return IS_ALIGNED(value, page_size);
+}
+
 #define KVM_GUEST_MEMFD_ALL_FLAGS KVM_GUEST_MEMFD_HUGETLB
 
 int kvm_gmem_create(struct kvm *kvm, struct kvm_create_guest_memfd *args)
@@ -921,12 +928,18 @@ int kvm_gmem_create(struct kvm *kvm, struct kvm_create_guest_memfd *args)
 		if (flags & ~(KVM_GUEST_MEMFD_ALL_FLAGS |
 			      (KVM_GUEST_MEMFD_HUGE_MASK << KVM_GUEST_MEMFD_HUGE_SHIFT)))
 			return -EINVAL;
+
+		if (!kvm_gmem_hugetlb_page_aligned(flags, size))
+			return -EINVAL;
 	} else {
 		if (flags & ~KVM_GUEST_MEMFD_ALL_FLAGS)
 			return -EINVAL;
+
+		if (!PAGE_ALIGNED(size))
+			return -EINVAL;
 	}
 
-	if (size <= 0 || !PAGE_ALIGNED(size))
+	if (size <= 0)
 		return -EINVAL;
 
 	return __kvm_gmem_create(kvm, size, flags);
-- 
2.46.0.598.g6f2099f65c-goog


^ permalink raw reply related	[flat|nested] 130+ messages in thread

* [RFC PATCH 17/39] KVM: selftests: Add basic selftests for hugetlb-backed guest_memfd
  2024-09-10 23:43 [RFC PATCH 00/39] 1G page support for guest_memfd Ackerley Tng
                   ` (15 preceding siblings ...)
  2024-09-10 23:43 ` [RFC PATCH 16/39] KVM: guest_memfd: Add page alignment check for hugetlb guest_memfd Ackerley Tng
@ 2024-09-10 23:43 ` Ackerley Tng
  2024-09-10 23:43 ` [RFC PATCH 18/39] KVM: selftests: Support various types of backing sources for private memory Ackerley Tng
                   ` (24 subsequent siblings)
  41 siblings, 0 replies; 130+ messages in thread
From: Ackerley Tng @ 2024-09-10 23:43 UTC (permalink / raw)
  To: tabba, quic_eberman, roypat, jgg, peterx, david, rientjes, fvdl,
	jthoughton, seanjc, pbonzini, zhiquan1.li, fan.du, jun.miao,
	isaku.yamahata, muchun.song, mike.kravetz
  Cc: erdemaktas, vannapurve, ackerleytng, qperret, jhubbard, willy,
	shuah, brauner, bfoster, kent.overstreet, pvorel, rppt,
	richard.weiyang, anup, haibo1.xu, ajones, vkuznets,
	maciej.wieczor-retman, pgonda, oliver.upton, linux-kernel,
	linux-mm, kvm, linux-kselftest, linux-fsdevel

Add tests for 2MB and 1GB page sizes, and update the invalid flags
test for the new KVM_GUEST_MEMFD_HUGETLB flag.

Signed-off-by: Ackerley Tng <ackerleytng@google.com>

---
 .../testing/selftests/kvm/guest_memfd_test.c  | 45 ++++++++++++++-----
 1 file changed, 35 insertions(+), 10 deletions(-)

diff --git a/tools/testing/selftests/kvm/guest_memfd_test.c b/tools/testing/selftests/kvm/guest_memfd_test.c
index ba0c8e996035..3618ce06663e 100644
--- a/tools/testing/selftests/kvm/guest_memfd_test.c
+++ b/tools/testing/selftests/kvm/guest_memfd_test.c
@@ -13,6 +13,7 @@
 
 #include <linux/bitmap.h>
 #include <linux/falloc.h>
+#include <linux/kvm.h>
 #include <sys/mman.h>
 #include <sys/types.h>
 #include <sys/stat.h>
@@ -122,6 +123,7 @@ static void test_invalid_punch_hole(int fd, size_t page_size, size_t total_size)
 
 static void test_create_guest_memfd_invalid(struct kvm_vm *vm)
 {
+	uint64_t valid_flags = KVM_GUEST_MEMFD_HUGETLB;
 	size_t page_size = getpagesize();
 	uint64_t flag;
 	size_t size;
@@ -135,6 +137,9 @@ static void test_create_guest_memfd_invalid(struct kvm_vm *vm)
 	}
 
 	for (flag = 0; flag; flag <<= 1) {
+		if (flag & valid_flags)
+			continue;
+
 		fd = __vm_create_guest_memfd(vm, page_size, flag);
 		TEST_ASSERT(fd == -1 && errno == EINVAL,
 			    "guest_memfd() with flag '0x%lx' should fail with EINVAL",
@@ -170,24 +175,16 @@ static void test_create_guest_memfd_multiple(struct kvm_vm *vm)
 	close(fd1);
 }
 
-int main(int argc, char *argv[])
+static void test_guest_memfd(struct kvm_vm *vm, uint32_t flags, size_t page_size)
 {
-	size_t page_size;
 	size_t total_size;
 	int fd;
-	struct kvm_vm *vm;
 
 	TEST_REQUIRE(kvm_has_cap(KVM_CAP_GUEST_MEMFD));
 
-	page_size = getpagesize();
 	total_size = page_size * 4;
 
-	vm = vm_create_barebones();
-
-	test_create_guest_memfd_invalid(vm);
-	test_create_guest_memfd_multiple(vm);
-
-	fd = vm_create_guest_memfd(vm, total_size, 0);
+	fd = vm_create_guest_memfd(vm, total_size, flags);
 
 	test_file_read_write(fd);
 	test_mmap(fd, page_size);
@@ -197,3 +194,31 @@ int main(int argc, char *argv[])
 
 	close(fd);
 }
+
+int main(int argc, char *argv[])
+{
+	struct kvm_vm *vm;
+
+	TEST_REQUIRE(kvm_has_cap(KVM_CAP_GUEST_MEMFD));
+
+	vm = vm_create_barebones();
+
+	test_create_guest_memfd_invalid(vm);
+	test_create_guest_memfd_multiple(vm);
+
+	printf("Test guest_memfd with 4K pages\n");
+	test_guest_memfd(vm, 0, getpagesize());
+	printf("\tPASSED\n");
+
+	printf("Test guest_memfd with 2M pages\n");
+	test_guest_memfd(vm, KVM_GUEST_MEMFD_HUGETLB | KVM_GUEST_MEMFD_HUGE_2MB,
+			 2UL << 20);
+	printf("\tPASSED\n");
+
+	printf("Test guest_memfd with 1G pages\n");
+	test_guest_memfd(vm, KVM_GUEST_MEMFD_HUGETLB | KVM_GUEST_MEMFD_HUGE_1GB,
+			 1UL << 30);
+	printf("\tPASSED\n");
+
+	return 0;
+}
-- 
2.46.0.598.g6f2099f65c-goog


^ permalink raw reply related	[flat|nested] 130+ messages in thread

* [RFC PATCH 18/39] KVM: selftests: Support various types of backing sources for private memory
  2024-09-10 23:43 [RFC PATCH 00/39] 1G page support for guest_memfd Ackerley Tng
                   ` (16 preceding siblings ...)
  2024-09-10 23:43 ` [RFC PATCH 17/39] KVM: selftests: Add basic selftests for hugetlb-backed guest_memfd Ackerley Tng
@ 2024-09-10 23:43 ` Ackerley Tng
  2024-09-10 23:43 ` [RFC PATCH 19/39] KVM: selftests: Update test for various private memory backing source types Ackerley Tng
                   ` (23 subsequent siblings)
  41 siblings, 0 replies; 130+ messages in thread
From: Ackerley Tng @ 2024-09-10 23:43 UTC (permalink / raw)
  To: tabba, quic_eberman, roypat, jgg, peterx, david, rientjes, fvdl,
	jthoughton, seanjc, pbonzini, zhiquan1.li, fan.du, jun.miao,
	isaku.yamahata, muchun.song, mike.kravetz
  Cc: erdemaktas, vannapurve, ackerleytng, qperret, jhubbard, willy,
	shuah, brauner, bfoster, kent.overstreet, pvorel, rppt,
	richard.weiyang, anup, haibo1.xu, ajones, vkuznets,
	maciej.wieczor-retman, pgonda, oliver.upton, linux-kernel,
	linux-mm, kvm, linux-kselftest, linux-fsdevel

Adds support for various type of backing sources for private
memory (in the sense of confidential computing), similar to the
backing sources available for shared memory.

Signed-off-by: Ackerley Tng <ackerleytng@google.com>

---
 .../testing/selftests/kvm/include/test_util.h | 16 ++++
 tools/testing/selftests/kvm/lib/test_util.c   | 74 +++++++++++++++++++
 2 files changed, 90 insertions(+)

diff --git a/tools/testing/selftests/kvm/include/test_util.h b/tools/testing/selftests/kvm/include/test_util.h
index 3e473058849f..011e757d4e2c 100644
--- a/tools/testing/selftests/kvm/include/test_util.h
+++ b/tools/testing/selftests/kvm/include/test_util.h
@@ -142,6 +142,16 @@ struct vm_mem_backing_src_alias {
 	uint32_t flag;
 };
 
+enum vm_private_mem_backing_src_type {
+	VM_PRIVATE_MEM_SRC_GUEST_MEM,  /* Use default page size */
+	VM_PRIVATE_MEM_SRC_HUGETLB,    /* Use kernel default page size for hugetlb pages */
+	VM_PRIVATE_MEM_SRC_HUGETLB_2MB,
+	VM_PRIVATE_MEM_SRC_HUGETLB_1GB,
+	NUM_PRIVATE_MEM_SRC_TYPES,
+};
+
+#define DEFAULT_VM_PRIVATE_MEM_SRC VM_PRIVATE_MEM_SRC_GUEST_MEM
+
 #define MIN_RUN_DELAY_NS	200000UL
 
 bool thp_configured(void);
@@ -152,6 +162,12 @@ size_t get_backing_src_pagesz(uint32_t i);
 bool is_backing_src_hugetlb(uint32_t i);
 void backing_src_help(const char *flag);
 enum vm_mem_backing_src_type parse_backing_src_type(const char *type_name);
+
+void private_mem_backing_src_help(const char *flag);
+enum vm_private_mem_backing_src_type parse_private_mem_backing_src_type(const char *type_name);
+const struct vm_mem_backing_src_alias *vm_private_mem_backing_src_alias(uint32_t i);
+size_t get_private_mem_backing_src_pagesz(uint32_t i);
+
 long get_run_delay(void);
 
 /*
diff --git a/tools/testing/selftests/kvm/lib/test_util.c b/tools/testing/selftests/kvm/lib/test_util.c
index 8ed0b74ae837..d0a9b5ee0c01 100644
--- a/tools/testing/selftests/kvm/lib/test_util.c
+++ b/tools/testing/selftests/kvm/lib/test_util.c
@@ -15,6 +15,7 @@
 #include <sys/syscall.h>
 #include <linux/mman.h>
 #include "linux/kernel.h"
+#include <linux/kvm.h>
 
 #include "test_util.h"
 
@@ -288,6 +289,34 @@ const struct vm_mem_backing_src_alias *vm_mem_backing_src_alias(uint32_t i)
 	return &aliases[i];
 }
 
+const struct vm_mem_backing_src_alias *vm_private_mem_backing_src_alias(uint32_t i)
+{
+	static const struct vm_mem_backing_src_alias aliases[] = {
+		[VM_PRIVATE_MEM_SRC_GUEST_MEM] = {
+			.name = "private_mem_guest_mem",
+			.flag = 0,
+		},
+		[VM_PRIVATE_MEM_SRC_HUGETLB] = {
+			.name = "private_mem_hugetlb",
+			.flag = KVM_GUEST_MEMFD_HUGETLB,
+		},
+		[VM_PRIVATE_MEM_SRC_HUGETLB_2MB] = {
+			.name = "private_mem_hugetlb_2mb",
+			.flag = KVM_GUEST_MEMFD_HUGETLB | KVM_GUEST_MEMFD_HUGE_2MB,
+		},
+		[VM_PRIVATE_MEM_SRC_HUGETLB_1GB] = {
+			.name = "private_mem_hugetlb_1gb",
+			.flag = KVM_GUEST_MEMFD_HUGETLB | KVM_GUEST_MEMFD_HUGE_1GB,
+		},
+	};
+	_Static_assert(ARRAY_SIZE(aliases) == NUM_PRIVATE_MEM_SRC_TYPES,
+		       "Missing new backing private mem src types?");
+
+	TEST_ASSERT(i < NUM_PRIVATE_MEM_SRC_TYPES, "Private mem backing src type ID %d too big", i);
+
+	return &aliases[i];
+}
+
 #define MAP_HUGE_PAGE_SIZE(x) (1ULL << ((x >> MAP_HUGE_SHIFT) & MAP_HUGE_MASK))
 
 size_t get_backing_src_pagesz(uint32_t i)
@@ -308,6 +337,20 @@ size_t get_backing_src_pagesz(uint32_t i)
 	}
 }
 
+size_t get_private_mem_backing_src_pagesz(uint32_t i)
+{
+	uint32_t flag = vm_private_mem_backing_src_alias(i)->flag;
+
+	switch (i) {
+	case VM_PRIVATE_MEM_SRC_GUEST_MEM:
+		return getpagesize();
+	case VM_PRIVATE_MEM_SRC_HUGETLB:
+		return get_def_hugetlb_pagesz();
+	default:
+		return MAP_HUGE_PAGE_SIZE(flag);
+	}
+}
+
 bool is_backing_src_hugetlb(uint32_t i)
 {
 	return !!(vm_mem_backing_src_alias(i)->flag & MAP_HUGETLB);
@@ -344,6 +387,37 @@ enum vm_mem_backing_src_type parse_backing_src_type(const char *type_name)
 	return -1;
 }
 
+static void print_available_private_mem_backing_src_types(const char *prefix)
+{
+	int i;
+
+	printf("%sAvailable private mem backing src types:\n", prefix);
+
+	for (i = 0; i < NUM_PRIVATE_MEM_SRC_TYPES; i++)
+		printf("%s    %s\n", prefix, vm_private_mem_backing_src_alias(i)->name);
+}
+
+void private_mem_backing_src_help(const char *flag)
+{
+	printf(" %s: specify the type of memory that should be used to\n"
+	       "     back guest private memory. (default: %s)\n",
+	       flag, vm_private_mem_backing_src_alias(DEFAULT_VM_PRIVATE_MEM_SRC)->name);
+	print_available_private_mem_backing_src_types("     ");
+}
+
+enum vm_private_mem_backing_src_type parse_private_mem_backing_src_type(const char *type_name)
+{
+	int i;
+
+	for (i = 0; i < NUM_PRIVATE_MEM_SRC_TYPES; i++)
+		if (!strcmp(type_name, vm_private_mem_backing_src_alias(i)->name))
+			return i;
+
+	print_available_private_mem_backing_src_types("");
+	TEST_FAIL("Unknown private mem backing src type: %s", type_name);
+	return -1;
+}
+
 long get_run_delay(void)
 {
 	char path[64];
-- 
2.46.0.598.g6f2099f65c-goog


^ permalink raw reply related	[flat|nested] 130+ messages in thread

* [RFC PATCH 19/39] KVM: selftests: Update test for various private memory backing source types
  2024-09-10 23:43 [RFC PATCH 00/39] 1G page support for guest_memfd Ackerley Tng
                   ` (17 preceding siblings ...)
  2024-09-10 23:43 ` [RFC PATCH 18/39] KVM: selftests: Support various types of backing sources for private memory Ackerley Tng
@ 2024-09-10 23:43 ` Ackerley Tng
  2024-09-10 23:43 ` [RFC PATCH 20/39] KVM: selftests: Add private_mem_conversions_test.sh Ackerley Tng
                   ` (22 subsequent siblings)
  41 siblings, 0 replies; 130+ messages in thread
From: Ackerley Tng @ 2024-09-10 23:43 UTC (permalink / raw)
  To: tabba, quic_eberman, roypat, jgg, peterx, david, rientjes, fvdl,
	jthoughton, seanjc, pbonzini, zhiquan1.li, fan.du, jun.miao,
	isaku.yamahata, muchun.song, mike.kravetz
  Cc: erdemaktas, vannapurve, ackerleytng, qperret, jhubbard, willy,
	shuah, brauner, bfoster, kent.overstreet, pvorel, rppt,
	richard.weiyang, anup, haibo1.xu, ajones, vkuznets,
	maciej.wieczor-retman, pgonda, oliver.upton, linux-kernel,
	linux-mm, kvm, linux-kselftest, linux-fsdevel

Update private_mem_conversions_test for various private memory backing
source types.

Signed-off-by: Ackerley Tng <ackerleytng@google.com>
---
 .../kvm/x86_64/private_mem_conversions_test.c | 28 ++++++++++++++-----
 1 file changed, 21 insertions(+), 7 deletions(-)

diff --git a/tools/testing/selftests/kvm/x86_64/private_mem_conversions_test.c b/tools/testing/selftests/kvm/x86_64/private_mem_conversions_test.c
index 82a8d88b5338..71f480c19f92 100644
--- a/tools/testing/selftests/kvm/x86_64/private_mem_conversions_test.c
+++ b/tools/testing/selftests/kvm/x86_64/private_mem_conversions_test.c
@@ -366,14 +366,20 @@ static void *__test_mem_conversions(void *__vcpu)
 	}
 }
 
-static void test_mem_conversions(enum vm_mem_backing_src_type src_type, uint32_t nr_vcpus,
-				 uint32_t nr_memslots)
+static void
+test_mem_conversions(enum vm_mem_backing_src_type src_type,
+		     enum vm_private_mem_backing_src_type private_mem_src_type,
+		     uint32_t nr_vcpus,
+		     uint32_t nr_memslots)
 {
 	/*
 	 * Allocate enough memory so that each vCPU's chunk of memory can be
 	 * naturally aligned with respect to the size of the backing store.
 	 */
-	const size_t alignment = max_t(size_t, SZ_2M, get_backing_src_pagesz(src_type));
+	const size_t alignment = max_t(size_t, SZ_2M,
+				       max_t(size_t,
+					     get_private_mem_backing_src_pagesz(private_mem_src_type),
+					     get_backing_src_pagesz(src_type)));
 	const size_t per_cpu_size = align_up(PER_CPU_DATA_SIZE, alignment);
 	const size_t memfd_size = per_cpu_size * nr_vcpus;
 	const size_t slot_size = memfd_size / nr_memslots;
@@ -394,7 +400,9 @@ static void test_mem_conversions(enum vm_mem_backing_src_type src_type, uint32_t
 
 	vm_enable_cap(vm, KVM_CAP_EXIT_HYPERCALL, (1 << KVM_HC_MAP_GPA_RANGE));
 
-	memfd = vm_create_guest_memfd(vm, memfd_size, 0);
+	memfd = vm_create_guest_memfd(
+		vm, memfd_size,
+		vm_private_mem_backing_src_alias(private_mem_src_type)->flag);
 
 	for (i = 0; i < nr_memslots; i++)
 		vm_mem_add(vm, src_type, BASE_DATA_GPA + slot_size * i,
@@ -440,10 +448,12 @@ static void test_mem_conversions(enum vm_mem_backing_src_type src_type, uint32_t
 static void usage(const char *cmd)
 {
 	puts("");
-	printf("usage: %s [-h] [-m nr_memslots] [-s mem_type] [-n nr_vcpus]\n", cmd);
+	printf("usage: %s [-h] [-m nr_memslots] [-s mem_type] [-p private_mem_type] [-n nr_vcpus]\n", cmd);
 	puts("");
 	backing_src_help("-s");
 	puts("");
+	private_mem_backing_src_help("-p");
+	puts("");
 	puts(" -n: specify the number of vcpus (default: 1)");
 	puts("");
 	puts(" -m: specify the number of memslots (default: 1)");
@@ -453,17 +463,21 @@ static void usage(const char *cmd)
 int main(int argc, char *argv[])
 {
 	enum vm_mem_backing_src_type src_type = DEFAULT_VM_MEM_SRC;
+	enum vm_private_mem_backing_src_type private_mem_src_type = DEFAULT_VM_PRIVATE_MEM_SRC;
 	uint32_t nr_memslots = 1;
 	uint32_t nr_vcpus = 1;
 	int opt;
 
 	TEST_REQUIRE(kvm_check_cap(KVM_CAP_VM_TYPES) & BIT(KVM_X86_SW_PROTECTED_VM));
 
-	while ((opt = getopt(argc, argv, "hm:s:n:")) != -1) {
+	while ((opt = getopt(argc, argv, "hm:s:p:n:")) != -1) {
 		switch (opt) {
 		case 's':
 			src_type = parse_backing_src_type(optarg);
 			break;
+		case 'p':
+			private_mem_src_type = parse_private_mem_backing_src_type(optarg);
+			break;
 		case 'n':
 			nr_vcpus = atoi_positive("nr_vcpus", optarg);
 			break;
@@ -477,7 +491,7 @@ int main(int argc, char *argv[])
 		}
 	}
 
-	test_mem_conversions(src_type, nr_vcpus, nr_memslots);
+	test_mem_conversions(src_type, private_mem_src_type, nr_vcpus, nr_memslots);
 
 	return 0;
 }
-- 
2.46.0.598.g6f2099f65c-goog


^ permalink raw reply related	[flat|nested] 130+ messages in thread

* [RFC PATCH 20/39] KVM: selftests: Add private_mem_conversions_test.sh
  2024-09-10 23:43 [RFC PATCH 00/39] 1G page support for guest_memfd Ackerley Tng
                   ` (18 preceding siblings ...)
  2024-09-10 23:43 ` [RFC PATCH 19/39] KVM: selftests: Update test for various private memory backing source types Ackerley Tng
@ 2024-09-10 23:43 ` Ackerley Tng
  2024-09-10 23:43 ` [RFC PATCH 21/39] KVM: selftests: Test that guest_memfd usage is reported via hugetlb Ackerley Tng
                   ` (21 subsequent siblings)
  41 siblings, 0 replies; 130+ messages in thread
From: Ackerley Tng @ 2024-09-10 23:43 UTC (permalink / raw)
  To: tabba, quic_eberman, roypat, jgg, peterx, david, rientjes, fvdl,
	jthoughton, seanjc, pbonzini, zhiquan1.li, fan.du, jun.miao,
	isaku.yamahata, muchun.song, mike.kravetz
  Cc: erdemaktas, vannapurve, ackerleytng, qperret, jhubbard, willy,
	shuah, brauner, bfoster, kent.overstreet, pvorel, rppt,
	richard.weiyang, anup, haibo1.xu, ajones, vkuznets,
	maciej.wieczor-retman, pgonda, oliver.upton, linux-kernel,
	linux-mm, kvm, linux-kselftest, linux-fsdevel

Add private_mem_conversions_test.sh to automate testing of different
combinations of private_mem_conversions_test.

Signed-off-by: Ackerley Tng <ackerleytng@google.com>

---
 .../x86_64/private_mem_conversions_test.sh    | 88 +++++++++++++++++++
 1 file changed, 88 insertions(+)
 create mode 100755 tools/testing/selftests/kvm/x86_64/private_mem_conversions_test.sh

diff --git a/tools/testing/selftests/kvm/x86_64/private_mem_conversions_test.sh b/tools/testing/selftests/kvm/x86_64/private_mem_conversions_test.sh
new file mode 100755
index 000000000000..fb6705fef466
--- /dev/null
+++ b/tools/testing/selftests/kvm/x86_64/private_mem_conversions_test.sh
@@ -0,0 +1,88 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0-only */
+#
+# Wrapper script which runs different test setups of
+# private_mem_conversions_test.
+#
+# tools/testing/selftests/kvm/private_mem_conversions_test.sh
+# Copyright (C) 2023, Google LLC.
+
+set -e
+
+num_vcpus_to_test=4
+num_memslots_to_test=$num_vcpus_to_test
+
+get_default_hugepage_size_in_kB() {
+  grep "Hugepagesize:" /proc/meminfo | grep -o '[[:digit:]]\+'
+}
+
+# Required pages are based on the test setup (see computation for memfd_size) in
+# test_mem_conversions() in private_mem_migrate_tests.c)
+
+# These static requirements are set to the maximum required for
+# num_vcpus_to_test, over all the hugetlb-related tests
+required_num_2m_hugepages=$(( 1024 * num_vcpus_to_test ))
+required_num_1g_hugepages=$(( 2 * num_vcpus_to_test ))
+
+# The other hugetlb sizes are not supported on x86_64
+[ "$(cat /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages 2>/dev/null || echo 0)" -ge "$required_num_2m_hugepages" ] && hugepage_2mb_enabled=1
+[ "$(cat /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages 2>/dev/null || echo 0)" -ge "$required_num_1g_hugepages" ] && hugepage_1gb_enabled=1
+
+case $(get_default_hugepage_size_in_kB) in
+  2048)
+    hugepage_default_enabled=$hugepage_2mb_enabled
+    ;;
+  1048576)
+    hugepage_default_enabled=$hugepage_1gb_enabled
+    ;;
+  *)
+    hugepage_default_enabled=0
+    ;;
+esac
+
+backing_src_types=( anonymous )
+backing_src_types+=( anonymous_thp )
+[ -n "$hugepage_default_enabled" ] && \
+    backing_src_types+=( anonymous_hugetlb ) || echo "skipping anonymous_hugetlb backing source type"
+[ -n "$hugepage_2mb_enabled" ] && \
+    backing_src_types+=( anonymous_hugetlb_2mb ) || echo "skipping anonymous_hugetlb_2mb backing source type"
+[ -n "$hugepage_1gb_enabled" ] && \
+    backing_src_types+=( anonymous_hugetlb_1gb ) || echo "skipping anonymous_hugetlb_1gb backing source type"
+backing_src_types+=( shmem )
+[ -n "$hugepage_default_enabled" ] && \
+  backing_src_types+=( shared_hugetlb ) || echo "skipping shared_hugetlb backing source type"
+
+private_mem_backing_src_types=( private_mem_guest_mem )
+[ -n "$hugepage_default_enabled" ] && \
+    private_mem_backing_src_types+=( private_mem_hugetlb ) || echo "skipping private_mem_hugetlb backing source type"
+[ -n "$hugepage_2mb_enabled" ] && \
+    private_mem_backing_src_types+=( private_mem_hugetlb_2mb ) || echo "skipping private_mem_hugetlb_2mb backing source type"
+[ -n "$hugepage_1gb_enabled" ] && \
+    private_mem_backing_src_types+=( private_mem_hugetlb_1gb ) || echo "skipping private_mem_hugetlb_1gb backing source type"
+
+set +e
+
+TEST_EXECUTABLE="$(dirname "$0")/private_mem_conversions_test"
+
+(
+	set -e
+
+	for src_type in "${backing_src_types[@]}"; do
+
+		for private_mem_src_type in "${private_mem_backing_src_types[@]}"; do
+			set -x
+
+			$TEST_EXECUTABLE -s "$src_type" -p "$private_mem_src_type" -n $num_vcpus_to_test
+			$TEST_EXECUTABLE -s "$src_type" -p "$private_mem_src_type" -n $num_vcpus_to_test -m $num_memslots_to_test
+
+			{ set +x; } 2>/dev/null
+
+			echo
+
+		done
+
+	done
+)
+RET=$?
+
+exit $RET
-- 
2.46.0.598.g6f2099f65c-goog


^ permalink raw reply related	[flat|nested] 130+ messages in thread

* [RFC PATCH 21/39] KVM: selftests: Test that guest_memfd usage is reported via hugetlb
  2024-09-10 23:43 [RFC PATCH 00/39] 1G page support for guest_memfd Ackerley Tng
                   ` (19 preceding siblings ...)
  2024-09-10 23:43 ` [RFC PATCH 20/39] KVM: selftests: Add private_mem_conversions_test.sh Ackerley Tng
@ 2024-09-10 23:43 ` Ackerley Tng
  2024-09-10 23:43 ` [RFC PATCH 22/39] mm: hugetlb: Expose vmemmap optimization functions Ackerley Tng
                   ` (20 subsequent siblings)
  41 siblings, 0 replies; 130+ messages in thread
From: Ackerley Tng @ 2024-09-10 23:43 UTC (permalink / raw)
  To: tabba, quic_eberman, roypat, jgg, peterx, david, rientjes, fvdl,
	jthoughton, seanjc, pbonzini, zhiquan1.li, fan.du, jun.miao,
	isaku.yamahata, muchun.song, mike.kravetz
  Cc: erdemaktas, vannapurve, ackerleytng, qperret, jhubbard, willy,
	shuah, brauner, bfoster, kent.overstreet, pvorel, rppt,
	richard.weiyang, anup, haibo1.xu, ajones, vkuznets,
	maciej.wieczor-retman, pgonda, oliver.upton, linux-kernel,
	linux-mm, kvm, linux-kselftest, linux-fsdevel

Using HugeTLB as the huge page allocator for guest_memfd allows reuse
of HugeTLB's reporting mechanism.

Signed-off-by: Ackerley Tng <ackerleytng@google.com>
---
 tools/testing/selftests/kvm/Makefile          |   1 +
 .../kvm/guest_memfd_hugetlb_reporting_test.c  | 222 ++++++++++++++++++
 2 files changed, 223 insertions(+)
 create mode 100644 tools/testing/selftests/kvm/guest_memfd_hugetlb_reporting_test.c

diff --git a/tools/testing/selftests/kvm/Makefile b/tools/testing/selftests/kvm/Makefile
index 48d32c5aa3eb..b3b7e83f39fc 100644
--- a/tools/testing/selftests/kvm/Makefile
+++ b/tools/testing/selftests/kvm/Makefile
@@ -134,6 +134,7 @@ TEST_GEN_PROGS_x86_64 += demand_paging_test
 TEST_GEN_PROGS_x86_64 += dirty_log_test
 TEST_GEN_PROGS_x86_64 += dirty_log_perf_test
 TEST_GEN_PROGS_x86_64 += guest_memfd_test
+TEST_GEN_PROGS_x86_64 += guest_memfd_hugetlb_reporting_test
 TEST_GEN_PROGS_x86_64 += guest_print_test
 TEST_GEN_PROGS_x86_64 += hardware_disable_test
 TEST_GEN_PROGS_x86_64 += kvm_create_max_vcpus
diff --git a/tools/testing/selftests/kvm/guest_memfd_hugetlb_reporting_test.c b/tools/testing/selftests/kvm/guest_memfd_hugetlb_reporting_test.c
new file mode 100644
index 000000000000..cb9fdf0d4ec8
--- /dev/null
+++ b/tools/testing/selftests/kvm/guest_memfd_hugetlb_reporting_test.c
@@ -0,0 +1,222 @@
+#include <fcntl.h>
+#include <linux/falloc.h>
+#include <linux/kvm.h>
+#include <linux/limits.h>
+#include <linux/memfd.h>
+#include <string.h>
+#include <sys/mman.h>
+
+#include "kvm_util.h"
+#include "test_util.h"
+#include "processor.h"
+
+static int read_int(const char *file_name)
+{
+	FILE *fp;
+	int num;
+
+	fp = fopen(file_name, "r");
+	TEST_ASSERT(fp != NULL, "Error opening file %s!\n", file_name);
+
+	TEST_ASSERT_EQ(fscanf(fp, "%d", &num), 1);
+
+	fclose(fp);
+
+	return num;
+}
+
+enum hugetlb_statistic {
+	FREE_HUGEPAGES,
+	NR_HUGEPAGES,
+	NR_OVERCOMMIT_HUGEPAGES,
+	RESV_HUGEPAGES,
+	SURPLUS_HUGEPAGES,
+	NR_TESTED_HUGETLB_STATISTICS,
+};
+
+static const char *hugetlb_statistics[NR_TESTED_HUGETLB_STATISTICS] = {
+	[FREE_HUGEPAGES] = "free_hugepages",
+	[NR_HUGEPAGES] = "nr_hugepages",
+	[NR_OVERCOMMIT_HUGEPAGES] = "nr_overcommit_hugepages",
+	[RESV_HUGEPAGES] = "resv_hugepages",
+	[SURPLUS_HUGEPAGES] = "surplus_hugepages",
+};
+
+enum test_page_size {
+	TEST_SZ_2M,
+	TEST_SZ_1G,
+	NR_TEST_SIZES,
+};
+
+struct test_param {
+	size_t page_size;
+	int memfd_create_flags;
+	int guest_memfd_flags;
+	char *path_suffix;
+};
+
+const struct test_param *test_params(enum test_page_size size)
+{
+	static const struct test_param params[] = {
+		[TEST_SZ_2M] = {
+			.page_size = PG_SIZE_2M,
+			.memfd_create_flags = MFD_HUGETLB | MFD_HUGE_2MB,
+			.guest_memfd_flags = KVM_GUEST_MEMFD_HUGETLB | KVM_GUEST_MEMFD_HUGE_2MB,
+			.path_suffix = "2048kB",
+		},
+		[TEST_SZ_1G] = {
+			.page_size = PG_SIZE_1G,
+			.memfd_create_flags = MFD_HUGETLB | MFD_HUGE_1GB,
+			.guest_memfd_flags = KVM_GUEST_MEMFD_HUGETLB | KVM_GUEST_MEMFD_HUGE_1GB,
+			.path_suffix = "1048576kB",
+		},
+	};
+
+	return &params[size];
+}
+
+static int read_statistic(enum test_page_size size, enum hugetlb_statistic statistic)
+{
+	char path[PATH_MAX] = "/sys/kernel/mm/hugepages/hugepages-";
+
+	strcat(path, test_params(size)->path_suffix);
+	strcat(path, "/");
+	strcat(path, hugetlb_statistics[statistic]);
+
+	return read_int(path);
+}
+
+static int baseline[NR_TEST_SIZES][NR_TESTED_HUGETLB_STATISTICS];
+
+static void establish_baseline(void)
+{
+	int i, j;
+
+	for (i = 0; i < NR_TEST_SIZES; ++i)
+		for (j = 0; j < NR_TESTED_HUGETLB_STATISTICS; ++j)
+			baseline[i][j] = read_statistic(i, j);
+}
+
+static void assert_stats_at_baseline(void)
+{
+	TEST_ASSERT_EQ(read_statistic(TEST_SZ_2M, FREE_HUGEPAGES),
+		       baseline[TEST_SZ_2M][FREE_HUGEPAGES]);
+	TEST_ASSERT_EQ(read_statistic(TEST_SZ_2M, NR_HUGEPAGES),
+		       baseline[TEST_SZ_2M][NR_HUGEPAGES]);
+	TEST_ASSERT_EQ(read_statistic(TEST_SZ_2M, NR_OVERCOMMIT_HUGEPAGES),
+		       baseline[TEST_SZ_2M][NR_OVERCOMMIT_HUGEPAGES]);
+	TEST_ASSERT_EQ(read_statistic(TEST_SZ_2M, RESV_HUGEPAGES),
+		       baseline[TEST_SZ_2M][RESV_HUGEPAGES]);
+	TEST_ASSERT_EQ(read_statistic(TEST_SZ_2M, SURPLUS_HUGEPAGES),
+		       baseline[TEST_SZ_2M][SURPLUS_HUGEPAGES]);
+
+	TEST_ASSERT_EQ(read_statistic(TEST_SZ_1G, FREE_HUGEPAGES),
+		       baseline[TEST_SZ_1G][FREE_HUGEPAGES]);
+	TEST_ASSERT_EQ(read_statistic(TEST_SZ_1G, NR_HUGEPAGES),
+		       baseline[TEST_SZ_1G][NR_HUGEPAGES]);
+	TEST_ASSERT_EQ(read_statistic(TEST_SZ_1G, NR_OVERCOMMIT_HUGEPAGES),
+		       baseline[TEST_SZ_1G][NR_OVERCOMMIT_HUGEPAGES]);
+	TEST_ASSERT_EQ(read_statistic(TEST_SZ_1G, RESV_HUGEPAGES),
+		       baseline[TEST_SZ_1G][RESV_HUGEPAGES]);
+	TEST_ASSERT_EQ(read_statistic(TEST_SZ_1G, SURPLUS_HUGEPAGES),
+		       baseline[TEST_SZ_1G][SURPLUS_HUGEPAGES]);
+}
+
+static void assert_stats(enum test_page_size size, int num_reserved, int num_faulted)
+{
+	TEST_ASSERT_EQ(read_statistic(size, FREE_HUGEPAGES),
+		       baseline[size][FREE_HUGEPAGES] - num_faulted);
+	TEST_ASSERT_EQ(read_statistic(size, NR_HUGEPAGES),
+		       baseline[size][NR_HUGEPAGES]);
+	TEST_ASSERT_EQ(read_statistic(size, NR_OVERCOMMIT_HUGEPAGES),
+		       baseline[size][NR_OVERCOMMIT_HUGEPAGES]);
+	TEST_ASSERT_EQ(read_statistic(size, RESV_HUGEPAGES),
+		       baseline[size][RESV_HUGEPAGES] + num_reserved - num_faulted);
+	TEST_ASSERT_EQ(read_statistic(size, SURPLUS_HUGEPAGES),
+		       baseline[size][SURPLUS_HUGEPAGES]);
+}
+
+/* Use hugetlb behavior as a baseline. guest_memfd should have comparable behavior. */
+static void test_hugetlb_behavior(enum test_page_size test_size)
+{
+	const struct test_param *param;
+	char *mem;
+	int memfd;
+
+	param = test_params(test_size);
+
+	assert_stats_at_baseline();
+
+	memfd = memfd_create("guest_memfd_hugetlb_reporting_test",
+			     param->memfd_create_flags);
+
+	mem = mmap(NULL, param->page_size, PROT_READ | PROT_WRITE,
+		   MAP_SHARED | MAP_HUGETLB, memfd, 0);
+	TEST_ASSERT(mem != MAP_FAILED, "Couldn't mmap()");
+
+	assert_stats(test_size, 1, 0);
+
+	*mem = 'A';
+
+	assert_stats(test_size, 1, 1);
+
+	munmap(mem, param->page_size);
+
+	assert_stats(test_size, 1, 1);
+
+	madvise(mem, param->page_size, MADV_DONTNEED);
+
+	assert_stats(test_size, 1, 1);
+
+	madvise(mem, param->page_size, MADV_REMOVE);
+
+	assert_stats(test_size, 1, 1);
+
+	close(memfd);
+
+	assert_stats_at_baseline();
+}
+
+static void test_guest_memfd_behavior(enum test_page_size test_size)
+{
+	const struct test_param *param;
+	struct kvm_vm *vm;
+	int guest_memfd;
+
+	param = test_params(test_size);
+
+	assert_stats_at_baseline();
+
+	vm = vm_create_barebones_type(KVM_X86_SW_PROTECTED_VM);
+
+	guest_memfd = vm_create_guest_memfd(vm, param->page_size,
+					    param->guest_memfd_flags);
+
+	assert_stats(test_size, 1, 0);
+
+	fallocate(guest_memfd, FALLOC_FL_KEEP_SIZE, 0, param->page_size);
+
+	assert_stats(test_size, 1, 1);
+
+	fallocate(guest_memfd, FALLOC_FL_KEEP_SIZE | FALLOC_FL_PUNCH_HOLE, 0,
+		  param->page_size);
+
+	assert_stats(test_size, 1, 0);
+
+	close(guest_memfd);
+
+	assert_stats_at_baseline();
+
+	kvm_vm_free(vm);
+}
+
+int main(int argc, char *argv[])
+{
+	establish_baseline();
+
+	test_hugetlb_behavior(TEST_SZ_2M);
+	test_hugetlb_behavior(TEST_SZ_1G);
+
+	test_guest_memfd_behavior(TEST_SZ_2M);
+	test_guest_memfd_behavior(TEST_SZ_1G);
+}
-- 
2.46.0.598.g6f2099f65c-goog


^ permalink raw reply related	[flat|nested] 130+ messages in thread

* [RFC PATCH 22/39] mm: hugetlb: Expose vmemmap optimization functions
  2024-09-10 23:43 [RFC PATCH 00/39] 1G page support for guest_memfd Ackerley Tng
                   ` (20 preceding siblings ...)
  2024-09-10 23:43 ` [RFC PATCH 21/39] KVM: selftests: Test that guest_memfd usage is reported via hugetlb Ackerley Tng
@ 2024-09-10 23:43 ` Ackerley Tng
  2024-09-10 23:43 ` [RFC PATCH 23/39] mm: hugetlb: Expose HugeTLB functions for promoting/demoting pages Ackerley Tng
                   ` (19 subsequent siblings)
  41 siblings, 0 replies; 130+ messages in thread
From: Ackerley Tng @ 2024-09-10 23:43 UTC (permalink / raw)
  To: tabba, quic_eberman, roypat, jgg, peterx, david, rientjes, fvdl,
	jthoughton, seanjc, pbonzini, zhiquan1.li, fan.du, jun.miao,
	isaku.yamahata, muchun.song, mike.kravetz
  Cc: erdemaktas, vannapurve, ackerleytng, qperret, jhubbard, willy,
	shuah, brauner, bfoster, kent.overstreet, pvorel, rppt,
	richard.weiyang, anup, haibo1.xu, ajones, vkuznets,
	maciej.wieczor-retman, pgonda, oliver.upton, linux-kernel,
	linux-mm, kvm, linux-kselftest, linux-fsdevel

These functions will need to be used by guest_memfd when
splitting/reconstructing HugeTLB pages.

Co-developed-by: Ackerley Tng <ackerleytng@google.com>
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Co-developed-by: Vishal Annapurve <vannapurve@google.com>
Signed-off-by: Vishal Annapurve <vannapurve@google.com>
---
 include/linux/hugetlb.h | 14 ++++++++++++++
 mm/hugetlb_vmemmap.h    | 11 -----------
 2 files changed, 14 insertions(+), 11 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 752062044b0b..7ba4ed9e0001 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -284,6 +284,20 @@ bool is_hugetlb_entry_migration(pte_t pte);
 bool is_hugetlb_entry_hwpoisoned(pte_t pte);
 void hugetlb_unshare_all_pmds(struct vm_area_struct *vma);
 
+#ifdef CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
+int hugetlb_vmemmap_restore_folio(const struct hstate *h, struct folio *folio);
+void hugetlb_vmemmap_optimize_folio(const struct hstate *h, struct folio *folio);
+#else
+static inline int hugetlb_vmemmap_restore_folio(const struct hstate *h, struct folio *folio)
+{
+	return 0;
+}
+
+static inline void hugetlb_vmemmap_optimize_folio(const struct hstate *h, struct folio *folio)
+{
+}
+#endif
+
 #else /* !CONFIG_HUGETLB_PAGE */
 
 static inline void hugetlb_dup_vma_private(struct vm_area_struct *vma)
diff --git a/mm/hugetlb_vmemmap.h b/mm/hugetlb_vmemmap.h
index 2fcae92d3359..e702ace3b42f 100644
--- a/mm/hugetlb_vmemmap.h
+++ b/mm/hugetlb_vmemmap.h
@@ -18,11 +18,9 @@
 #define HUGETLB_VMEMMAP_RESERVE_PAGES	(HUGETLB_VMEMMAP_RESERVE_SIZE / sizeof(struct page))
 
 #ifdef CONFIG_HUGETLB_PAGE_OPTIMIZE_VMEMMAP
-int hugetlb_vmemmap_restore_folio(const struct hstate *h, struct folio *folio);
 long hugetlb_vmemmap_restore_folios(const struct hstate *h,
 					struct list_head *folio_list,
 					struct list_head *non_hvo_folios);
-void hugetlb_vmemmap_optimize_folio(const struct hstate *h, struct folio *folio);
 void hugetlb_vmemmap_optimize_folios(struct hstate *h, struct list_head *folio_list);
 
 static inline unsigned int hugetlb_vmemmap_size(const struct hstate *h)
@@ -43,11 +41,6 @@ static inline unsigned int hugetlb_vmemmap_optimizable_size(const struct hstate
 	return size > 0 ? size : 0;
 }
 #else
-static inline int hugetlb_vmemmap_restore_folio(const struct hstate *h, struct folio *folio)
-{
-	return 0;
-}
-
 static long hugetlb_vmemmap_restore_folios(const struct hstate *h,
 					struct list_head *folio_list,
 					struct list_head *non_hvo_folios)
@@ -56,10 +49,6 @@ static long hugetlb_vmemmap_restore_folios(const struct hstate *h,
 	return 0;
 }
 
-static inline void hugetlb_vmemmap_optimize_folio(const struct hstate *h, struct folio *folio)
-{
-}
-
 static inline void hugetlb_vmemmap_optimize_folios(struct hstate *h, struct list_head *folio_list)
 {
 }
-- 
2.46.0.598.g6f2099f65c-goog


^ permalink raw reply related	[flat|nested] 130+ messages in thread

* [RFC PATCH 23/39] mm: hugetlb: Expose HugeTLB functions for promoting/demoting pages
  2024-09-10 23:43 [RFC PATCH 00/39] 1G page support for guest_memfd Ackerley Tng
                   ` (21 preceding siblings ...)
  2024-09-10 23:43 ` [RFC PATCH 22/39] mm: hugetlb: Expose vmemmap optimization functions Ackerley Tng
@ 2024-09-10 23:43 ` Ackerley Tng
  2024-09-10 23:43 ` [RFC PATCH 24/39] mm: hugetlb: Add functions to add/move/remove from hugetlb lists Ackerley Tng
                   ` (18 subsequent siblings)
  41 siblings, 0 replies; 130+ messages in thread
From: Ackerley Tng @ 2024-09-10 23:43 UTC (permalink / raw)
  To: tabba, quic_eberman, roypat, jgg, peterx, david, rientjes, fvdl,
	jthoughton, seanjc, pbonzini, zhiquan1.li, fan.du, jun.miao,
	isaku.yamahata, muchun.song, mike.kravetz
  Cc: erdemaktas, vannapurve, ackerleytng, qperret, jhubbard, willy,
	shuah, brauner, bfoster, kent.overstreet, pvorel, rppt,
	richard.weiyang, anup, haibo1.xu, ajones, vkuznets,
	maciej.wieczor-retman, pgonda, oliver.upton, linux-kernel,
	linux-mm, kvm, linux-kselftest, linux-fsdevel

These functions will be used by guest_memfd to split/reconstruct
HugeTLB pages.

Co-developed-by: Ackerley Tng <ackerleytng@google.com>
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Co-developed-by: Vishal Annapurve <vannapurve@google.com>
Signed-off-by: Vishal Annapurve <vannapurve@google.com>
---
 include/linux/hugetlb.h | 15 +++++++++++++++
 mm/hugetlb.c            |  8 ++------
 2 files changed, 17 insertions(+), 6 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 7ba4ed9e0001..ac9d4ada52bd 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -298,6 +298,21 @@ static inline void hugetlb_vmemmap_optimize_folio(const struct hstate *h, struct
 }
 #endif
 
+#ifdef CONFIG_ARCH_HAS_GIGANTIC_PAGE
+bool prep_compound_gigantic_folio(struct folio *folio, unsigned int order);
+void destroy_compound_gigantic_folio(struct folio *folio, unsigned int order);
+#else
+bool prep_compound_gigantic_folio(struct folio *folio, unsigned int order)
+{
+	return false;
+}
+
+static inline void destroy_compound_gigantic_folio(struct folio *folio,
+						   unsigned int order)
+{
+}
+#endif
+
 #else /* !CONFIG_HUGETLB_PAGE */
 
 static inline void hugetlb_dup_vma_private(struct vm_area_struct *vma)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 372d8294fb2f..8f2b7b411b60 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1533,8 +1533,7 @@ static void destroy_compound_hugetlb_folio_for_demote(struct folio *folio,
 }
 
 #ifdef CONFIG_ARCH_HAS_GIGANTIC_PAGE
-static void destroy_compound_gigantic_folio(struct folio *folio,
-					unsigned int order)
+void destroy_compound_gigantic_folio(struct folio *folio, unsigned int order)
 {
 	__destroy_compound_gigantic_folio(folio, order, false);
 }
@@ -1609,8 +1608,6 @@ static struct folio *alloc_gigantic_folio(struct hstate *h, gfp_t gfp_mask,
 }
 static inline void free_gigantic_folio(struct folio *folio,
 						unsigned int order) { }
-static inline void destroy_compound_gigantic_folio(struct folio *folio,
-						unsigned int order) { }
 #endif
 
 /*
@@ -2120,8 +2117,7 @@ static bool __prep_compound_gigantic_folio(struct folio *folio,
 	return false;
 }
 
-static bool prep_compound_gigantic_folio(struct folio *folio,
-							unsigned int order)
+bool prep_compound_gigantic_folio(struct folio *folio, unsigned int order)
 {
 	return __prep_compound_gigantic_folio(folio, order, false);
 }
-- 
2.46.0.598.g6f2099f65c-goog


^ permalink raw reply related	[flat|nested] 130+ messages in thread

* [RFC PATCH 24/39] mm: hugetlb: Add functions to add/move/remove from hugetlb lists
  2024-09-10 23:43 [RFC PATCH 00/39] 1G page support for guest_memfd Ackerley Tng
                   ` (22 preceding siblings ...)
  2024-09-10 23:43 ` [RFC PATCH 23/39] mm: hugetlb: Expose HugeTLB functions for promoting/demoting pages Ackerley Tng
@ 2024-09-10 23:43 ` Ackerley Tng
  2024-09-10 23:43 ` [RFC PATCH 25/39] KVM: guest_memfd: Split HugeTLB pages for guest_memfd use Ackerley Tng
                   ` (17 subsequent siblings)
  41 siblings, 0 replies; 130+ messages in thread
From: Ackerley Tng @ 2024-09-10 23:43 UTC (permalink / raw)
  To: tabba, quic_eberman, roypat, jgg, peterx, david, rientjes, fvdl,
	jthoughton, seanjc, pbonzini, zhiquan1.li, fan.du, jun.miao,
	isaku.yamahata, muchun.song, mike.kravetz
  Cc: erdemaktas, vannapurve, ackerleytng, qperret, jhubbard, willy,
	shuah, brauner, bfoster, kent.overstreet, pvorel, rppt,
	richard.weiyang, anup, haibo1.xu, ajones, vkuznets,
	maciej.wieczor-retman, pgonda, oliver.upton, linux-kernel,
	linux-mm, kvm, linux-kselftest, linux-fsdevel

These functions are introduced in hugetlb.c so the private
hugetlb_lock can be accessed.

hugetlb_lock is reused for this PoC, but a separate lock should be
used in a future revision to avoid interference due to hash collisions
with HugeTLB's usage of this lock.

Co-developed-by: Ackerley Tng <ackerleytng@google.com>
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Co-developed-by: Vishal Annapurve <vannapurve@google.com>
Signed-off-by: Vishal Annapurve <vannapurve@google.com>

---
 include/linux/hugetlb.h |  3 +++
 mm/hugetlb.c            | 21 +++++++++++++++++++++
 2 files changed, 24 insertions(+)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index ac9d4ada52bd..0f3f920ad608 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -164,6 +164,9 @@ bool hugetlb_reserve_pages(struct inode *inode, long from, long to,
 						vm_flags_t vm_flags);
 long hugetlb_unreserve_pages(struct inode *inode, long start, long end,
 						long freed);
+void hugetlb_folio_list_add(struct folio *folio, struct list_head *list);
+void hugetlb_folio_list_move(struct folio *folio, struct list_head *list);
+void hugetlb_folio_list_del(struct folio *folio);
 bool isolate_hugetlb(struct folio *folio, struct list_head *list);
 int get_hwpoison_hugetlb_folio(struct folio *folio, bool *hugetlb, bool unpoison);
 int get_huge_page_for_hwpoison(unsigned long pfn, int flags,
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 8f2b7b411b60..60e72214d5bf 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -7264,6 +7264,27 @@ long hugetlb_unreserve_pages(struct inode *inode, long start, long end,
 	return 0;
 }
 
+void hugetlb_folio_list_add(struct folio *folio, struct list_head *list)
+{
+	spin_lock_irq(&hugetlb_lock);
+	list_add(&folio->lru, list);
+	spin_unlock_irq(&hugetlb_lock);
+}
+
+void hugetlb_folio_list_move(struct folio *folio, struct list_head *list)
+{
+	spin_lock_irq(&hugetlb_lock);
+	list_move_tail(&folio->lru, list);
+	spin_unlock_irq(&hugetlb_lock);
+}
+
+void hugetlb_folio_list_del(struct folio *folio)
+{
+	spin_lock_irq(&hugetlb_lock);
+	list_del(&folio->lru);
+	spin_unlock_irq(&hugetlb_lock);
+}
+
 #ifdef CONFIG_ARCH_WANT_HUGE_PMD_SHARE
 static unsigned long page_table_shareable(struct vm_area_struct *svma,
 				struct vm_area_struct *vma,
-- 
2.46.0.598.g6f2099f65c-goog


^ permalink raw reply related	[flat|nested] 130+ messages in thread

* [RFC PATCH 25/39] KVM: guest_memfd: Split HugeTLB pages for guest_memfd use
  2024-09-10 23:43 [RFC PATCH 00/39] 1G page support for guest_memfd Ackerley Tng
                   ` (23 preceding siblings ...)
  2024-09-10 23:43 ` [RFC PATCH 24/39] mm: hugetlb: Add functions to add/move/remove from hugetlb lists Ackerley Tng
@ 2024-09-10 23:43 ` Ackerley Tng
  2024-09-10 23:43 ` [RFC PATCH 26/39] KVM: guest_memfd: Track faultability within a struct kvm_gmem_private Ackerley Tng
                   ` (16 subsequent siblings)
  41 siblings, 0 replies; 130+ messages in thread
From: Ackerley Tng @ 2024-09-10 23:43 UTC (permalink / raw)
  To: tabba, quic_eberman, roypat, jgg, peterx, david, rientjes, fvdl,
	jthoughton, seanjc, pbonzini, zhiquan1.li, fan.du, jun.miao,
	isaku.yamahata, muchun.song, mike.kravetz
  Cc: erdemaktas, vannapurve, ackerleytng, qperret, jhubbard, willy,
	shuah, brauner, bfoster, kent.overstreet, pvorel, rppt,
	richard.weiyang, anup, haibo1.xu, ajones, vkuznets,
	maciej.wieczor-retman, pgonda, oliver.upton, linux-kernel,
	linux-mm, kvm, linux-kselftest, linux-fsdevel

From: Vishal Annapurve <vannapurve@google.com>

In this patch, newly allocated HugeTLB pages are split to 4K regular
pages before providing them to the requester (fallocate() or KVM).

The pages are then reconstructed/merged to HugeTLB pages before
the HugeTLB pages are returned to HugeTLB.

This is an intermediate step to build page splitting/merging
functionality before allowing guest_memfd files to be mmap()ed.

Co-developed-by: Ackerley Tng <ackerleytng@google.com>
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Co-developed-by: Vishal Annapurve <vannapurve@google.com>
Signed-off-by: Vishal Annapurve <vannapurve@google.com>

---
 virt/kvm/guest_memfd.c | 299 ++++++++++++++++++++++++++++++++++++++---
 1 file changed, 281 insertions(+), 18 deletions(-)

diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index eacbfdb950d1..8151df2c03e5 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -229,31 +229,206 @@ static int kvm_gmem_hugetlb_filemap_add_folio(struct address_space *mapping,
 	return 0;
 }
 
+struct kvm_gmem_split_stash {
+	struct {
+		unsigned long _flags_2;
+		unsigned long _head_2;
+
+		void *_hugetlb_subpool;
+		void *_hugetlb_cgroup;
+		void *_hugetlb_cgroup_rsvd;
+		void *_hugetlb_hwpoison;
+	};
+	void *hugetlb_private;
+};
+
+static int kvm_gmem_hugetlb_stash_metadata(struct folio *folio)
+{
+	struct kvm_gmem_split_stash *stash;
+
+	stash = kmalloc(sizeof(*stash), GFP_KERNEL);
+	if (!stash)
+		return -ENOMEM;
+
+	stash->_flags_2 = folio->_flags_2;
+	stash->_head_2 = folio->_head_2;
+	stash->_hugetlb_subpool = folio->_hugetlb_subpool;
+	stash->_hugetlb_cgroup = folio->_hugetlb_cgroup;
+	stash->_hugetlb_cgroup_rsvd = folio->_hugetlb_cgroup_rsvd;
+	stash->_hugetlb_hwpoison = folio->_hugetlb_hwpoison;
+	stash->hugetlb_private = folio_get_private(folio);
+
+	folio_change_private(folio, (void *)stash);
+
+	return 0;
+}
+
+static int kvm_gmem_hugetlb_unstash_metadata(struct folio *folio)
+{
+	struct kvm_gmem_split_stash *stash;
+
+	stash = folio_get_private(folio);
+
+	if (!stash)
+		return -EINVAL;
+
+	folio->_flags_2 = stash->_flags_2;
+	folio->_head_2 = stash->_head_2;
+	folio->_hugetlb_subpool = stash->_hugetlb_subpool;
+	folio->_hugetlb_cgroup = stash->_hugetlb_cgroup;
+	folio->_hugetlb_cgroup_rsvd = stash->_hugetlb_cgroup_rsvd;
+	folio->_hugetlb_hwpoison = stash->_hugetlb_hwpoison;
+	folio_change_private(folio, stash->hugetlb_private);
+
+	kfree(stash);
+
+	return 0;
+}
+
+/**
+ * Reconstruct a HugeTLB folio from a contiguous block of folios where the first
+ * of the contiguous folios is @folio.
+ *
+ * The size of the contiguous block is of huge_page_size(@h). All the folios in
+ * the block are checked to have a refcount of 1 before reconstruction. After
+ * reconstruction, the reconstructed folio has a refcount of 1.
+ *
+ * Return 0 on success and negative error otherwise.
+ */
+static int kvm_gmem_hugetlb_reconstruct_folio(struct hstate *h, struct folio *folio)
+{
+	int ret;
+
+	WARN_ON((folio->index & (huge_page_order(h) - 1)) != 0);
+
+	ret = kvm_gmem_hugetlb_unstash_metadata(folio);
+	if (ret)
+		return ret;
+
+	if (!prep_compound_gigantic_folio(folio, huge_page_order(h))) {
+		kvm_gmem_hugetlb_stash_metadata(folio);
+		return -ENOMEM;
+	}
+
+	__folio_set_hugetlb(folio);
+
+	folio_set_count(folio, 1);
+
+	hugetlb_vmemmap_optimize_folio(h, folio);
+
+	return 0;
+}
+
+/* Basically folio_set_order(folio, 1) without the checks. */
+static inline void kvm_gmem_folio_set_order(struct folio *folio, unsigned int order)
+{
+	folio->_flags_1 = (folio->_flags_1 & ~0xffUL) | order;
+#ifdef CONFIG_64BIT
+	folio->_folio_nr_pages = 1U << order;
+#endif
+}
+
+/**
+ * Split a HugeTLB @folio of size huge_page_size(@h).
+ *
+ * After splitting, each split folio has a refcount of 1. There are no checks on
+ * refcounts before splitting.
+ *
+ * Return 0 on success and negative error otherwise.
+ */
+static int kvm_gmem_hugetlb_split_folio(struct hstate *h, struct folio *folio)
+{
+	int ret;
+
+	ret = hugetlb_vmemmap_restore_folio(h, folio);
+	if (ret)
+		return ret;
+
+	ret = kvm_gmem_hugetlb_stash_metadata(folio);
+	if (ret) {
+		hugetlb_vmemmap_optimize_folio(h, folio);
+		return ret;
+	}
+
+	kvm_gmem_folio_set_order(folio, 0);
+
+	destroy_compound_gigantic_folio(folio, huge_page_order(h));
+	__folio_clear_hugetlb(folio);
+
+	/*
+	 * Remove the first folio from h->hugepage_activelist since it is no
+	 * longer a HugeTLB page. The other split pages should not be on any
+	 * lists.
+	 */
+	hugetlb_folio_list_del(folio);
+
+	return 0;
+}
+
 static struct folio *kvm_gmem_hugetlb_alloc_and_cache_folio(struct inode *inode,
 							    pgoff_t index)
 {
+	struct folio *allocated_hugetlb_folio;
+	pgoff_t hugetlb_first_subpage_index;
+	struct page *hugetlb_first_subpage;
 	struct kvm_gmem_hugetlb *hgmem;
-	struct folio *folio;
+	struct page *requested_page;
 	int ret;
+	int i;
 
 	hgmem = kvm_gmem_hgmem(inode);
-	folio = kvm_gmem_hugetlb_alloc_folio(hgmem->h, hgmem->spool);
-	if (IS_ERR(folio))
-		return folio;
+	allocated_hugetlb_folio = kvm_gmem_hugetlb_alloc_folio(hgmem->h, hgmem->spool);
+	if (IS_ERR(allocated_hugetlb_folio))
+		return allocated_hugetlb_folio;
+
+	requested_page = folio_file_page(allocated_hugetlb_folio, index);
+	hugetlb_first_subpage = folio_file_page(allocated_hugetlb_folio, 0);
+	hugetlb_first_subpage_index = index & (huge_page_mask(hgmem->h) >> PAGE_SHIFT);
 
-	/* TODO: Fix index here to be aligned to huge page size. */
-	ret = kvm_gmem_hugetlb_filemap_add_folio(
-		inode->i_mapping, folio, index, htlb_alloc_mask(hgmem->h));
+	ret = kvm_gmem_hugetlb_split_folio(hgmem->h, allocated_hugetlb_folio);
 	if (ret) {
-		folio_put(folio);
+		folio_put(allocated_hugetlb_folio);
 		return ERR_PTR(ret);
 	}
 
+	for (i = 0; i < pages_per_huge_page(hgmem->h); ++i) {
+		struct folio *folio = page_folio(nth_page(hugetlb_first_subpage, i));
+
+		ret = kvm_gmem_hugetlb_filemap_add_folio(inode->i_mapping,
+							 folio,
+							 hugetlb_first_subpage_index + i,
+							 htlb_alloc_mask(hgmem->h));
+		if (ret) {
+			/* TODO: handle cleanup properly. */
+			pr_err("Handle cleanup properly index=%lx, ret=%d\n",
+			       hugetlb_first_subpage_index + i, ret);
+			dump_page(nth_page(hugetlb_first_subpage, i), "check");
+			return ERR_PTR(ret);
+		}
+
+		/*
+		 * Skip unlocking for the requested index since
+		 * kvm_gmem_get_folio() returns a locked folio.
+		 *
+		 * Do folio_put() to drop the refcount that came with the folio,
+		 * from splitting the folio. Splitting the folio has a refcount
+		 * to be in line with hugetlb_alloc_folio(), which returns a
+		 * folio with refcount 1.
+		 *
+		 * Skip folio_put() for requested index since
+		 * kvm_gmem_get_folio() returns a folio with refcount 1.
+		 */
+		if (hugetlb_first_subpage_index + i != index) {
+			folio_unlock(folio);
+			folio_put(folio);
+		}
+	}
+
 	spin_lock(&inode->i_lock);
 	inode->i_blocks += blocks_per_huge_page(hgmem->h);
 	spin_unlock(&inode->i_lock);
 
-	return folio;
+	return page_folio(requested_page);
 }
 
 static struct folio *kvm_gmem_get_hugetlb_folio(struct inode *inode,
@@ -365,7 +540,9 @@ static inline void kvm_gmem_hugetlb_filemap_remove_folio(struct folio *folio)
 
 /**
  * Removes folios in range [@lstart, @lend) from page cache/filemap (@mapping),
- * returning the number of pages freed.
+ * returning the number of HugeTLB pages freed.
+ *
+ * @lend - @lstart must be a multiple of the HugeTLB page size.
  */
 static int kvm_gmem_hugetlb_filemap_remove_folios(struct address_space *mapping,
 						  struct hstate *h,
@@ -373,37 +550,69 @@ static int kvm_gmem_hugetlb_filemap_remove_folios(struct address_space *mapping,
 {
 	const pgoff_t end = lend >> PAGE_SHIFT;
 	pgoff_t next = lstart >> PAGE_SHIFT;
+	LIST_HEAD(folios_to_reconstruct);
 	struct folio_batch fbatch;
+	struct folio *folio, *tmp;
 	int num_freed = 0;
+	int i;
 
+	/*
+	 * TODO: Iterate over huge_page_size(h) blocks to avoid taking and
+	 * releasing hugetlb_fault_mutex_table[hash] lock so often. When
+	 * truncating, lstart and lend should be clipped to the size of this
+	 * guest_memfd file, otherwise there would be too many iterations.
+	 */
 	folio_batch_init(&fbatch);
 	while (filemap_get_folios(mapping, &next, end - 1, &fbatch)) {
-		int i;
 		for (i = 0; i < folio_batch_count(&fbatch); ++i) {
 			struct folio *folio;
 			pgoff_t hindex;
 			u32 hash;
 
 			folio = fbatch.folios[i];
+
 			hindex = folio->index >> huge_page_order(h);
 			hash = hugetlb_fault_mutex_hash(mapping, hindex);
-
 			mutex_lock(&hugetlb_fault_mutex_table[hash]);
+
+			/*
+			 * Collect first pages of HugeTLB folios for
+			 * reconstruction later.
+			 */
+			if ((folio->index & ~(huge_page_mask(h) >> PAGE_SHIFT)) == 0)
+				list_add(&folio->lru, &folios_to_reconstruct);
+
+			/*
+			 * Before removing from filemap, take a reference so
+			 * sub-folios don't get freed. Don't free the sub-folios
+			 * until after reconstruction.
+			 */
+			folio_get(folio);
+
 			kvm_gmem_hugetlb_filemap_remove_folio(folio);
-			mutex_unlock(&hugetlb_fault_mutex_table[hash]);
 
-			num_freed++;
+			mutex_unlock(&hugetlb_fault_mutex_table[hash]);
 		}
 		folio_batch_release(&fbatch);
 		cond_resched();
 	}
 
+	list_for_each_entry_safe(folio, tmp, &folios_to_reconstruct, lru) {
+		kvm_gmem_hugetlb_reconstruct_folio(h, folio);
+		hugetlb_folio_list_move(folio, &h->hugepage_activelist);
+
+		folio_put(folio);
+		num_freed++;
+	}
+
 	return num_freed;
 }
 
 /**
  * Removes folios in range [@lstart, @lend) from page cache of inode, updates
  * inode metadata and hugetlb reservations.
+ *
+ * @lend - @lstart must be a multiple of the HugeTLB page size.
  */
 static void kvm_gmem_hugetlb_truncate_folios_range(struct inode *inode,
 						   loff_t lstart, loff_t lend)
@@ -427,6 +636,56 @@ static void kvm_gmem_hugetlb_truncate_folios_range(struct inode *inode,
 	spin_unlock(&inode->i_lock);
 }
 
+/**
+ * Zeroes offsets [@start, @end) in a folio from @mapping.
+ *
+ * [@start, @end) must be within the same folio.
+ */
+static void kvm_gmem_zero_partial_page(
+	struct address_space *mapping, loff_t start, loff_t end)
+{
+	struct folio *folio;
+	pgoff_t idx = start >> PAGE_SHIFT;
+
+	folio = filemap_lock_folio(mapping, idx);
+	if (IS_ERR(folio))
+		return;
+
+	start = offset_in_folio(folio, start);
+	end = offset_in_folio(folio, end);
+	if (!end)
+		end = folio_size(folio);
+
+	folio_zero_segment(folio, (size_t)start, (size_t)end);
+	folio_unlock(folio);
+	folio_put(folio);
+}
+
+/**
+ * Zeroes all pages in range [@start, @end) in @mapping.
+ *
+ * hugetlb_zero_partial_page() would work if this had been a full page, but is
+ * not suitable since the pages have been split.
+ *
+ * truncate_inode_pages_range() isn't the right function because it removes
+ * pages from the page cache; this function only zeroes the pages.
+ */
+static void kvm_gmem_hugetlb_zero_split_pages(struct address_space *mapping,
+					      loff_t start, loff_t end)
+{
+	loff_t aligned_start;
+	loff_t index;
+
+	aligned_start = round_up(start, PAGE_SIZE);
+
+	kvm_gmem_zero_partial_page(mapping, start, min(aligned_start, end));
+
+	for (index = aligned_start; index < end; index += PAGE_SIZE) {
+		kvm_gmem_zero_partial_page(mapping, index,
+					   min((loff_t)(index + PAGE_SIZE), end));
+	}
+}
+
 static void kvm_gmem_hugetlb_truncate_range(struct inode *inode, loff_t lstart,
 					    loff_t lend)
 {
@@ -442,8 +701,8 @@ static void kvm_gmem_hugetlb_truncate_range(struct inode *inode, loff_t lstart,
 	full_hpage_end = round_down(lend, hsize);
 
 	if (lstart < full_hpage_start) {
-		hugetlb_zero_partial_page(h, inode->i_mapping, lstart,
-					  full_hpage_start);
+		kvm_gmem_hugetlb_zero_split_pages(inode->i_mapping, lstart,
+						  full_hpage_start);
 	}
 
 	if (full_hpage_end > full_hpage_start) {
@@ -452,8 +711,8 @@ static void kvm_gmem_hugetlb_truncate_range(struct inode *inode, loff_t lstart,
 	}
 
 	if (lend > full_hpage_end) {
-		hugetlb_zero_partial_page(h, inode->i_mapping, full_hpage_end,
-					  lend);
+		kvm_gmem_hugetlb_zero_split_pages(inode->i_mapping, full_hpage_end,
+						  lend);
 	}
 }
 
@@ -1060,6 +1319,10 @@ __kvm_gmem_get_pfn(struct file *file, struct kvm_memory_slot *slot,
 
 	if (folio_test_hwpoison(folio)) {
 		folio_unlock(folio);
+		/*
+		 * TODO: this folio may be part of a HugeTLB folio. Perhaps
+		 * reconstruct and then free page?
+		 */
 		folio_put(folio);
 		return ERR_PTR(-EHWPOISON);
 	}
-- 
2.46.0.598.g6f2099f65c-goog


^ permalink raw reply related	[flat|nested] 130+ messages in thread

* [RFC PATCH 26/39] KVM: guest_memfd: Track faultability within a struct kvm_gmem_private
  2024-09-10 23:43 [RFC PATCH 00/39] 1G page support for guest_memfd Ackerley Tng
                   ` (24 preceding siblings ...)
  2024-09-10 23:43 ` [RFC PATCH 25/39] KVM: guest_memfd: Split HugeTLB pages for guest_memfd use Ackerley Tng
@ 2024-09-10 23:43 ` Ackerley Tng
  2024-10-10 16:06   ` Peter Xu
  2025-02-25 20:37   ` Peter Xu
  2024-09-10 23:43 ` [RFC PATCH 27/39] KVM: guest_memfd: Allow mmapping guest_memfd files Ackerley Tng
                   ` (15 subsequent siblings)
  41 siblings, 2 replies; 130+ messages in thread
From: Ackerley Tng @ 2024-09-10 23:43 UTC (permalink / raw)
  To: tabba, quic_eberman, roypat, jgg, peterx, david, rientjes, fvdl,
	jthoughton, seanjc, pbonzini, zhiquan1.li, fan.du, jun.miao,
	isaku.yamahata, muchun.song, mike.kravetz
  Cc: erdemaktas, vannapurve, ackerleytng, qperret, jhubbard, willy,
	shuah, brauner, bfoster, kent.overstreet, pvorel, rppt,
	richard.weiyang, anup, haibo1.xu, ajones, vkuznets,
	maciej.wieczor-retman, pgonda, oliver.upton, linux-kernel,
	linux-mm, kvm, linux-kselftest, linux-fsdevel

The faultability xarray is stored on the inode since faultability is a
property of the guest_memfd's memory contents.

In this RFC, presence of an entry in the xarray indicates faultable,
but this could be flipped so that presence indicates unfaultable. For
flexibility, a special value "FAULT" is used instead of a simple
boolean.

However, at some stages of a VM's lifecycle there could be more
private pages, and at other stages there could be more shared pages.

This is likely to be replaced by a better data structure in a future
revision to better support ranges.

Also store struct kvm_gmem_hugetlb in struct kvm_gmem_hugetlb as a
pointer. inode->i_mapping->i_private_data.

Co-developed-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Fuad Tabba <tabba@google.com>
Co-developed-by: Ackerley Tng <ackerleytng@google.com>
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Co-developed-by: Vishal Annapurve <vannapurve@google.com>
Signed-off-by: Vishal Annapurve <vannapurve@google.com>

---
 virt/kvm/guest_memfd.c | 105 ++++++++++++++++++++++++++++++++++++-----
 1 file changed, 94 insertions(+), 11 deletions(-)

diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index 8151df2c03e5..b603518f7b62 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -26,11 +26,21 @@ struct kvm_gmem_hugetlb {
 	struct hugepage_subpool *spool;
 };
 
-static struct kvm_gmem_hugetlb *kvm_gmem_hgmem(struct inode *inode)
+struct kvm_gmem_inode_private {
+	struct xarray faultability;
+	struct kvm_gmem_hugetlb *hgmem;
+};
+
+static struct kvm_gmem_inode_private *kvm_gmem_private(struct inode *inode)
 {
 	return inode->i_mapping->i_private_data;
 }
 
+static struct kvm_gmem_hugetlb *kvm_gmem_hgmem(struct inode *inode)
+{
+	return kvm_gmem_private(inode)->hgmem;
+}
+
 static bool is_kvm_gmem_hugetlb(struct inode *inode)
 {
 	u64 flags = (u64)inode->i_private;
@@ -38,6 +48,57 @@ static bool is_kvm_gmem_hugetlb(struct inode *inode)
 	return flags & KVM_GUEST_MEMFD_HUGETLB;
 }
 
+#define KVM_GMEM_FAULTABILITY_VALUE 0x4641554c54  /* FAULT */
+
+/**
+ * Set faultability of given range of inode indices [@start, @end) to
+ * @faultable. Return 0 if attributes were successfully updated or negative
+ * errno on error.
+ */
+static int kvm_gmem_set_faultable(struct inode *inode, pgoff_t start, pgoff_t end,
+				  bool faultable)
+{
+	struct xarray *faultability;
+	void *val;
+	pgoff_t i;
+
+	/*
+	 * The expectation is that fewer pages are faultable, hence save memory
+	 * entries are created for faultable pages as opposed to creating
+	 * entries for non-faultable pages.
+	 */
+	val = faultable ? xa_mk_value(KVM_GMEM_FAULTABILITY_VALUE) : NULL;
+	faultability = &kvm_gmem_private(inode)->faultability;
+
+	/*
+	 * TODO replace this with something else (maybe interval
+	 * tree?). store_range doesn't quite do what we expect if overlapping
+	 * ranges are specified: if we store_range(5, 10, val) and then
+	 * store_range(7, 12, NULL), the entire range [5, 12] will be NULL.  For
+	 * now, use the slower xa_store() to store individual entries on indices
+	 * to avoid this.
+	 */
+	for (i = start; i < end; i++) {
+		int r;
+
+		r = xa_err(xa_store(faultability, i, val, GFP_KERNEL_ACCOUNT));
+		if (r)
+			return r;
+	}
+
+	return 0;
+}
+
+/**
+ * Return true if the page at @index is allowed to be faulted in.
+ */
+static bool kvm_gmem_is_faultable(struct inode *inode, pgoff_t index)
+{
+	struct xarray *faultability = &kvm_gmem_private(inode)->faultability;
+
+	return xa_to_value(xa_load(faultability, index)) == KVM_GMEM_FAULTABILITY_VALUE;
+}
+
 /**
  * folio_file_pfn - like folio_file_page, but return a pfn.
  * @folio: The folio which contains this index.
@@ -895,11 +956,21 @@ static void kvm_gmem_hugetlb_teardown(struct inode *inode)
 
 static void kvm_gmem_evict_inode(struct inode *inode)
 {
+	struct kvm_gmem_inode_private *private = kvm_gmem_private(inode);
+
+	/*
+	 * .evict_inode can be called before faultability is set up if there are
+	 * issues during inode creation.
+	 */
+	if (private)
+		xa_destroy(&private->faultability);
+
 	if (is_kvm_gmem_hugetlb(inode))
 		kvm_gmem_hugetlb_teardown(inode);
 	else
 		truncate_inode_pages_final(inode->i_mapping);
 
+	kfree(private);
 	clear_inode(inode);
 }
 
@@ -1028,7 +1099,9 @@ static const struct inode_operations kvm_gmem_iops = {
 	.setattr	= kvm_gmem_setattr,
 };
 
-static int kvm_gmem_hugetlb_setup(struct inode *inode, loff_t size, u64 flags)
+static int kvm_gmem_hugetlb_setup(struct inode *inode,
+				  struct kvm_gmem_inode_private *private,
+				  loff_t size, u64 flags)
 {
 	struct kvm_gmem_hugetlb *hgmem;
 	struct hugepage_subpool *spool;
@@ -1036,6 +1109,10 @@ static int kvm_gmem_hugetlb_setup(struct inode *inode, loff_t size, u64 flags)
 	struct hstate *h;
 	long hpages;
 
+	hgmem = kzalloc(sizeof(*hgmem), GFP_KERNEL);
+	if (!hgmem)
+		return -ENOMEM;
+
 	page_size_log = (flags >> KVM_GUEST_MEMFD_HUGE_SHIFT) & KVM_GUEST_MEMFD_HUGE_MASK;
 	h = hstate_sizelog(page_size_log);
 
@@ -1046,21 +1123,16 @@ static int kvm_gmem_hugetlb_setup(struct inode *inode, loff_t size, u64 flags)
 	if (!spool)
 		goto err;
 
-	hgmem = kzalloc(sizeof(*hgmem), GFP_KERNEL);
-	if (!hgmem)
-		goto err_subpool;
-
 	inode->i_blkbits = huge_page_shift(h);
 
 	hgmem->h = h;
 	hgmem->spool = spool;
-	inode->i_mapping->i_private_data = hgmem;
 
+	private->hgmem = hgmem;
 	return 0;
 
-err_subpool:
-	kfree(spool);
 err:
+	kfree(hgmem);
 	return -ENOMEM;
 }
 
@@ -1068,6 +1140,7 @@ static struct inode *kvm_gmem_inode_make_secure_inode(const char *name,
 						      loff_t size, u64 flags)
 {
 	const struct qstr qname = QSTR_INIT(name, strlen(name));
+	struct kvm_gmem_inode_private *private;
 	struct inode *inode;
 	int err;
 
@@ -1079,12 +1152,20 @@ static struct inode *kvm_gmem_inode_make_secure_inode(const char *name,
 	if (err)
 		goto out;
 
+	err = -ENOMEM;
+	private = kzalloc(sizeof(*private), GFP_KERNEL);
+	if (!private)
+		goto out;
+
 	if (flags & KVM_GUEST_MEMFD_HUGETLB) {
-		err = kvm_gmem_hugetlb_setup(inode, size, flags);
+		err = kvm_gmem_hugetlb_setup(inode, private, size, flags);
 		if (err)
-			goto out;
+			goto free_private;
 	}
 
+	xa_init(&private->faultability);
+	inode->i_mapping->i_private_data = private;
+
 	inode->i_private = (void *)(unsigned long)flags;
 	inode->i_op = &kvm_gmem_iops;
 	inode->i_mapping->a_ops = &kvm_gmem_aops;
@@ -1097,6 +1178,8 @@ static struct inode *kvm_gmem_inode_make_secure_inode(const char *name,
 
 	return inode;
 
+free_private:
+	kfree(private);
 out:
 	iput(inode);
 
-- 
2.46.0.598.g6f2099f65c-goog


^ permalink raw reply related	[flat|nested] 130+ messages in thread

* Re: [RFC PATCH 26/39] KVM: guest_memfd: Track faultability within a struct kvm_gmem_private
  2024-09-10 23:43 ` [RFC PATCH 26/39] KVM: guest_memfd: Track faultability within a struct kvm_gmem_private Ackerley Tng
@ 2024-10-10 16:06   ` Peter Xu
  2024-10-11 23:32     ` Ackerley Tng
  2025-02-25 20:37   ` Peter Xu
  1 sibling, 1 reply; 130+ messages in thread
From: Peter Xu @ 2024-10-10 16:06 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: tabba, quic_eberman, roypat, jgg, david, rientjes, fvdl,
	jthoughton, seanjc, pbonzini, zhiquan1.li, fan.du, jun.miao,
	isaku.yamahata, muchun.song, mike.kravetz, erdemaktas, vannapurve,
	qperret, jhubbard, willy, shuah, brauner, bfoster,
	kent.overstreet, pvorel, rppt, richard.weiyang, anup, haibo1.xu,
	ajones, vkuznets, maciej.wieczor-retman, pgonda, oliver.upton,
	linux-kernel, linux-mm, kvm, linux-kselftest, linux-fsdevel

On Tue, Sep 10, 2024 at 11:43:57PM +0000, Ackerley Tng wrote:
> The faultability xarray is stored on the inode since faultability is a
> property of the guest_memfd's memory contents.
> 
> In this RFC, presence of an entry in the xarray indicates faultable,
> but this could be flipped so that presence indicates unfaultable. For
> flexibility, a special value "FAULT" is used instead of a simple
> boolean.
> 
> However, at some stages of a VM's lifecycle there could be more
> private pages, and at other stages there could be more shared pages.
> 
> This is likely to be replaced by a better data structure in a future
> revision to better support ranges.
> 
> Also store struct kvm_gmem_hugetlb in struct kvm_gmem_hugetlb as a
> pointer. inode->i_mapping->i_private_data.

Could you help explain the difference between faultability v.s. the
existing KVM_MEMORY_ATTRIBUTE_PRIVATE?  Not sure if I'm the only one who's
confused, otherwise might be good to enrich the commit message.

The latter is per-slot, so one level higher, however I don't think it's a
common use case for mapping the same gmemfd in multiple slots anyway for
KVM (besides corner cases like live upgrade).  So perhaps this is not about
layering but something else?  For example, any use case where PRIVATE and
FAULTABLE can be reported with different values.

Another higher level question is, is there any plan to support non-CoCo
context for 1G?

I saw that you also mentioned you have working QEMU prototypes ready in
another email.  It'll be great if you can push your kernel/QEMU's latest
tree (including all dependency patches) somewhere so anyone can have a
closer look, or play with it.

Thanks,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [RFC PATCH 26/39] KVM: guest_memfd: Track faultability within a struct kvm_gmem_private
  2024-10-10 16:06   ` Peter Xu
@ 2024-10-11 23:32     ` Ackerley Tng
  2024-10-15 21:34       ` Peter Xu
  0 siblings, 1 reply; 130+ messages in thread
From: Ackerley Tng @ 2024-10-11 23:32 UTC (permalink / raw)
  To: Peter Xu
  Cc: tabba, quic_eberman, roypat, jgg, david, rientjes, fvdl,
	jthoughton, seanjc, pbonzini, zhiquan1.li, fan.du, jun.miao,
	isaku.yamahata, muchun.song, erdemaktas, vannapurve, qperret,
	jhubbard, willy, shuah, brauner, bfoster, kent.overstreet, pvorel,
	rppt, richard.weiyang, anup, haibo1.xu, ajones, vkuznets,
	maciej.wieczor-retman, pgonda, oliver.upton, linux-kernel,
	linux-mm, kvm, linux-kselftest

Peter Xu <peterx@redhat.com> writes:

> On Tue, Sep 10, 2024 at 11:43:57PM +0000, Ackerley Tng wrote:
>> The faultability xarray is stored on the inode since faultability is a
>> property of the guest_memfd's memory contents.
>> 
>> In this RFC, presence of an entry in the xarray indicates faultable,
>> but this could be flipped so that presence indicates unfaultable. For
>> flexibility, a special value "FAULT" is used instead of a simple
>> boolean.
>> 
>> However, at some stages of a VM's lifecycle there could be more
>> private pages, and at other stages there could be more shared pages.
>> 
>> This is likely to be replaced by a better data structure in a future
>> revision to better support ranges.
>> 
>> Also store struct kvm_gmem_hugetlb in struct kvm_gmem_hugetlb as a
>> pointer. inode->i_mapping->i_private_data.
>
> Could you help explain the difference between faultability v.s. the
> existing KVM_MEMORY_ATTRIBUTE_PRIVATE?  Not sure if I'm the only one who's
> confused, otherwise might be good to enrich the commit message.

Thank you for this question, I'll add this to the commit message to the
next revision if Fuad's patch set [1] doesn't make it first.

Reason (a): To elaborate on the explanation in [1],
KVM_MEMORY_ATTRIBUTE_PRIVATE is whether userspace wants this page to be
private or shared, and faultability is whether the page is allowed to be
faulted in by userspace.

These two are similar but may not be the same thing. In pKVM, pKVM
cannot trust userspace's configuration of private/shared, and other
information will go into determining the private/shared setting in
faultability.

Perhaps Fuad can elaborate more here.

Reason (b): In this patch series (mostly focus on x86 first), we're
using faultability to prevent any future faults before checking that
there are no mappings.

Having a different xarray from mem_attr_array allows us to disable
faulting before committing to changing mem_attr_array. Please see
`kvm_gmem_should_set_attributes_private()` in this patch [2].

We're not completely sure about the effectiveness of using faultability
to block off future faults here, in future revisions we may be using a
different approach. The folio_lock() is probably important if we need to
check mapcount. Please let me know if you have any ideas!

The starting point of having a different xarray was pKVM's requirement
of having separate xarrays, and we later realized that the xarray could
be used for reason (b). For x86 we could perhaps eventually remove the
second xarray? Not sure as of now.

>
> The latter is per-slot, so one level higher, however I don't think it's a
> common use case for mapping the same gmemfd in multiple slots anyway for
> KVM (besides corner cases like live upgrade).  So perhaps this is not about
> layering but something else?  For example, any use case where PRIVATE and
> FAULTABLE can be reported with different values.
>
> Another higher level question is, is there any plan to support non-CoCo
> context for 1G?

I believe guest_memfd users are generally in favor of eventually using
guest_memfd for non-CoCo use cases, which means we do want 1G (shared,
in the case of CoCo) page support.

However, core-mm's fault path does not support mapping at anything
higher than the PMD level (other than hugetlb_fault(), which the
community wants to move away from), so core-mm wouldn't be able to map
1G pages taken from HugeTLB.

In this patch series, we always split pages before mapping them to
userspace and that's how this series still works with core-mm.

Having 1G page support for shared memory or for non-CoCo use cases would
probably depend on better HugeTLB integration with core-mm, which you'd
be most familiar with.

Thank you for looking through our patches, we need your experience and
help! I've also just sent out the first 3 patches separately, which I
think is useful in improving understandability of the
resv_map/subpool/hstate reservation system in HugeTLB and can be
considered separately. Hope you can also review/comment on [4].

> I saw that you also mentioned you have working QEMU prototypes ready in
> another email.  It'll be great if you can push your kernel/QEMU's latest
> tree (including all dependency patches) somewhere so anyone can have a
> closer look, or play with it.

Vishal's reply [3] might have been a bit confusing. To clarify, my team
doesn't work with Qemu at all (we use a custom userspace VMM internally)
so the patches in this series are tested purely with selftests.

The selftests have fewer dependencies than full Qemu and I'd be happy to
help with running them or explain anything that I might have missed out.

We don't have any Qemu prototypes and are not likely to be building any
prototypes in the foreseeable future.

>
> Thanks,
>
> -- 
> Peter Xu

[1] https://lore.kernel.org/all/20241010085930.1546800-3-tabba@google.com/
[2] https://lore.kernel.org/all/f4ca1711a477a3b56406c05d125dce3d7403b936.1726009989.git.ackerleytng@google.com/
[3] https://lore.kernel.org/all/CAGtprH-GczOb64XrLpdW4ObRG7Gsv8tHWNhiW7=2dE=OAF7-Rw@mail.gmail.com/
[4] https://lore.kernel.org/all/cover.1728684491.git.ackerleytng@google.com/T/

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [RFC PATCH 26/39] KVM: guest_memfd: Track faultability within a struct kvm_gmem_private
  2024-10-11 23:32     ` Ackerley Tng
@ 2024-10-15 21:34       ` Peter Xu
  2024-10-15 23:42         ` Ackerley Tng
  0 siblings, 1 reply; 130+ messages in thread
From: Peter Xu @ 2024-10-15 21:34 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: tabba, quic_eberman, roypat, jgg, david, rientjes, fvdl,
	jthoughton, seanjc, pbonzini, zhiquan1.li, fan.du, jun.miao,
	isaku.yamahata, muchun.song, erdemaktas, vannapurve, qperret,
	jhubbard, willy, shuah, brauner, bfoster, kent.overstreet, pvorel,
	rppt, richard.weiyang, anup, haibo1.xu, ajones, vkuznets,
	maciej.wieczor-retman, pgonda, oliver.upton, linux-kernel,
	linux-mm, kvm, linux-kselftest

On Fri, Oct 11, 2024 at 11:32:11PM +0000, Ackerley Tng wrote:
> Peter Xu <peterx@redhat.com> writes:
> 
> > On Tue, Sep 10, 2024 at 11:43:57PM +0000, Ackerley Tng wrote:
> >> The faultability xarray is stored on the inode since faultability is a
> >> property of the guest_memfd's memory contents.
> >> 
> >> In this RFC, presence of an entry in the xarray indicates faultable,
> >> but this could be flipped so that presence indicates unfaultable. For
> >> flexibility, a special value "FAULT" is used instead of a simple
> >> boolean.
> >> 
> >> However, at some stages of a VM's lifecycle there could be more
> >> private pages, and at other stages there could be more shared pages.
> >> 
> >> This is likely to be replaced by a better data structure in a future
> >> revision to better support ranges.
> >> 
> >> Also store struct kvm_gmem_hugetlb in struct kvm_gmem_hugetlb as a
> >> pointer. inode->i_mapping->i_private_data.
> >
> > Could you help explain the difference between faultability v.s. the
> > existing KVM_MEMORY_ATTRIBUTE_PRIVATE?  Not sure if I'm the only one who's
> > confused, otherwise might be good to enrich the commit message.
> 
> Thank you for this question, I'll add this to the commit message to the
> next revision if Fuad's patch set [1] doesn't make it first.
> 
> Reason (a): To elaborate on the explanation in [1],
> KVM_MEMORY_ATTRIBUTE_PRIVATE is whether userspace wants this page to be
> private or shared, and faultability is whether the page is allowed to be
> faulted in by userspace.
> 
> These two are similar but may not be the same thing. In pKVM, pKVM
> cannot trust userspace's configuration of private/shared, and other
> information will go into determining the private/shared setting in
> faultability.

It makes sense to me that the kernel has the right to decide which page is
shared / private.  No matter if it's for pKVM or CoCo, I believe the normal
case is most / all pages are private, until some requests to share them for
special purposes (like DMA).  But that'll need to be initiated as a request
from the guest not the userspace hypervisor.

I must confess I totally have no idea how KVM_MEMORY_ATTRIBUTE_PRIVATE is
planned to be used in the future. Currently it's always set at least in
QEMU if gmemfd is enabled, so it doesn't yet tell me anything..

If it's driven by the userspace side of the hypervisor, I wonder when
should the user app request some different value it already was, if the
kernel already has an answer in this case.  It made me even more confused,
as we have this in the API doc:

        Note, there is no "get" API.  Userspace is responsible for
        explicitly tracking the state of a gfn/page as needed.

And I do wonder whether we will still need some API just to query whether
the kernel allows the page to be mapped or not (aka, the "real" shared /
private status of a guest page).  I guess that's not directly relevant to
the faultability to be introduced here, but if you or anyone know please
kindly share, I'd love to learn about it.

> 
> Perhaps Fuad can elaborate more here.
> 
> Reason (b): In this patch series (mostly focus on x86 first), we're
> using faultability to prevent any future faults before checking that
> there are no mappings.
> 
> Having a different xarray from mem_attr_array allows us to disable
> faulting before committing to changing mem_attr_array. Please see
> `kvm_gmem_should_set_attributes_private()` in this patch [2].
> 
> We're not completely sure about the effectiveness of using faultability
> to block off future faults here, in future revisions we may be using a
> different approach. The folio_lock() is probably important if we need to
> check mapcount. Please let me know if you have any ideas!
> 
> The starting point of having a different xarray was pKVM's requirement
> of having separate xarrays, and we later realized that the xarray could
> be used for reason (b). For x86 we could perhaps eventually remove the
> second xarray? Not sure as of now.

Just had a quick look at patch 27:

https://lore.kernel.org/all/5a05eb947cf7aa21f00b94171ca818cc3d5bdfee.1726009989.git.ackerleytng@google.com/

I'm not yet sure what's protecting from faultability being modified against
a concurrent fault().

I wonder whether one can use the folio lock to serialize that, so that one
needs to take the folio lock to modify/lookup the folio's faultability,
then it may naturally match with the fault() handler design, where
kvm_gmem_get_folio() needs to lock the page first.

But then kvm_gmem_is_faultable() will need to also be called only after the
folio is locked to avoid races.

> 
> >
> > The latter is per-slot, so one level higher, however I don't think it's a
> > common use case for mapping the same gmemfd in multiple slots anyway for
> > KVM (besides corner cases like live upgrade).  So perhaps this is not about
> > layering but something else?  For example, any use case where PRIVATE and
> > FAULTABLE can be reported with different values.
> >
> > Another higher level question is, is there any plan to support non-CoCo
> > context for 1G?
> 
> I believe guest_memfd users are generally in favor of eventually using
> guest_memfd for non-CoCo use cases, which means we do want 1G (shared,
> in the case of CoCo) page support.
> 
> However, core-mm's fault path does not support mapping at anything
> higher than the PMD level (other than hugetlb_fault(), which the
> community wants to move away from), so core-mm wouldn't be able to map
> 1G pages taken from HugeTLB.

Have you looked at vm_operations_struct.huge_fault()?  Or maybe you're
referring to some other challenges?

> 
> In this patch series, we always split pages before mapping them to
> userspace and that's how this series still works with core-mm.
> 
> Having 1G page support for shared memory or for non-CoCo use cases would
> probably depend on better HugeTLB integration with core-mm, which you'd
> be most familiar with.

My understanding is the mm community wants to avoid adding major new things
on top of current hugetlbfs alone, I'm not sure whether this will also be
accounted as part of that.  IMHO it could depend on how much this series
will reuse hugetlbfs.  If it's only about allocations it might be ok,
however I still feel risky having the name "hugetlbfs" here, the allocator
(if refactored out of hugetlb, but to contain more functions than CMA)
could be named in a more generic way.  No rush on changing anything, you
may always want to talk with more mm people on this I guess.

I also don't know how you treat things like folio_test_hugetlb() on
possible assumptions that the VMA must be a hugetlb vma.  I'd confess I
didn't yet check the rest of the patchset yet - reading a large series
without a git tree is sometimes challenging to me.

> 
> Thank you for looking through our patches, we need your experience and
> help! I've also just sent out the first 3 patches separately, which I
> think is useful in improving understandability of the
> resv_map/subpool/hstate reservation system in HugeTLB and can be
> considered separately. Hope you can also review/comment on [4].

I'll read and think about it.  Before that, I'll probably need to read more
backgrounds you need from hugetlb allocators (e.g. I remember you mentioned
pool management somewhere).  I tried to watch your LPC talk but the
recording has some issue on audio so I can mostly hear nothing in most of
the discussions..  I'll try to join the bi-weekly meeting two days later,
though.

> 
> > I saw that you also mentioned you have working QEMU prototypes ready in
> > another email.  It'll be great if you can push your kernel/QEMU's latest
> > tree (including all dependency patches) somewhere so anyone can have a
> > closer look, or play with it.
> 
> Vishal's reply [3] might have been a bit confusing. To clarify, my team
> doesn't work with Qemu at all (we use a custom userspace VMM internally)
> so the patches in this series are tested purely with selftests.
> 
> The selftests have fewer dependencies than full Qemu and I'd be happy to
> help with running them or explain anything that I might have missed out.
> 
> We don't have any Qemu prototypes and are not likely to be building any
> prototypes in the foreseeable future.

I see, that's totally not a problem.  If there can be, especially !CoCo
support at some point, we're happy to test it on QEMU side.  I'll see what
I can do to help !CoCo kernel side getting there.

Besides, it'll still be great if you can push a latest kernel tree
somewhere (or provide the base commit ID, but that needs to be on a public
tree I can fetch).

Thanks,

> 
> >
> > Thanks,
> >
> > -- 
> > Peter Xu
> 
> [1] https://lore.kernel.org/all/20241010085930.1546800-3-tabba@google.com/
> [2] https://lore.kernel.org/all/f4ca1711a477a3b56406c05d125dce3d7403b936.1726009989.git.ackerleytng@google.com/
> [3] https://lore.kernel.org/all/CAGtprH-GczOb64XrLpdW4ObRG7Gsv8tHWNhiW7=2dE=OAF7-Rw@mail.gmail.com/
> [4] https://lore.kernel.org/all/cover.1728684491.git.ackerleytng@google.com/T/
> 

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [RFC PATCH 26/39] KVM: guest_memfd: Track faultability within a struct kvm_gmem_private
  2024-10-15 21:34       ` Peter Xu
@ 2024-10-15 23:42         ` Ackerley Tng
  2024-10-16  8:45           ` David Hildenbrand
  2024-10-16  8:50           ` David Hildenbrand
  0 siblings, 2 replies; 130+ messages in thread
From: Ackerley Tng @ 2024-10-15 23:42 UTC (permalink / raw)
  To: Peter Xu
  Cc: tabba, quic_eberman, roypat, jgg, david, rientjes, fvdl,
	jthoughton, seanjc, pbonzini, zhiquan1.li, fan.du, jun.miao,
	isaku.yamahata, muchun.song, erdemaktas, vannapurve, qperret,
	jhubbard, willy, shuah, brauner, bfoster, kent.overstreet, pvorel,
	rppt, richard.weiyang, anup, haibo1.xu, ajones, vkuznets,
	maciej.wieczor-retman, pgonda, oliver.upton, linux-kernel,
	linux-mm, kvm, linux-kselftest

Peter Xu <peterx@redhat.com> writes:

> On Fri, Oct 11, 2024 at 11:32:11PM +0000, Ackerley Tng wrote:
>> Peter Xu <peterx@redhat.com> writes:
>> 
>> > On Tue, Sep 10, 2024 at 11:43:57PM +0000, Ackerley Tng wrote:
>> >> The faultability xarray is stored on the inode since faultability is a
>> >> property of the guest_memfd's memory contents.
>> >> 
>> >> In this RFC, presence of an entry in the xarray indicates faultable,
>> >> but this could be flipped so that presence indicates unfaultable. For
>> >> flexibility, a special value "FAULT" is used instead of a simple
>> >> boolean.
>> >> 
>> >> However, at some stages of a VM's lifecycle there could be more
>> >> private pages, and at other stages there could be more shared pages.
>> >> 
>> >> This is likely to be replaced by a better data structure in a future
>> >> revision to better support ranges.
>> >> 
>> >> Also store struct kvm_gmem_hugetlb in struct kvm_gmem_hugetlb as a
>> >> pointer. inode->i_mapping->i_private_data.
>> >
>> > Could you help explain the difference between faultability v.s. the
>> > existing KVM_MEMORY_ATTRIBUTE_PRIVATE?  Not sure if I'm the only one who's
>> > confused, otherwise might be good to enrich the commit message.
>> 
>> Thank you for this question, I'll add this to the commit message to the
>> next revision if Fuad's patch set [1] doesn't make it first.
>> 
>> Reason (a): To elaborate on the explanation in [1],
>> KVM_MEMORY_ATTRIBUTE_PRIVATE is whether userspace wants this page to be
>> private or shared, and faultability is whether the page is allowed to be
>> faulted in by userspace.
>> 
>> These two are similar but may not be the same thing. In pKVM, pKVM
>> cannot trust userspace's configuration of private/shared, and other
>> information will go into determining the private/shared setting in
>> faultability.
>
> It makes sense to me that the kernel has the right to decide which page is
> shared / private.  No matter if it's for pKVM or CoCo, I believe the normal
> case is most / all pages are private, until some requests to share them for
> special purposes (like DMA).  But that'll need to be initiated as a request
> from the guest not the userspace hypervisor.

For TDX, the plan is that the guest will request the page to be remapped
as shared or private, and the handler for that request will exit to
the userspace VMM.

The userspace VMM will then do any necessary coordination (e.g. for a
shared to private conversion it may need to unpin pages from DMA), and
then use the KVM_SET_MEMORY_ATTRIBUTES ioctl to indicate agreement with
the guest's requested conversion. This is where
KVM_MEMORY_ATTRIBUTE_PRIVATE will be provided.

Patch 38 [1] updates
tools/testing/selftests/kvm/x86_64/private_mem_conversions_test.c to
demonstrate the usage flow for x86.

Fuad will be in a better position to explain the flow for pKVM. 

> I must confess I totally have no idea how KVM_MEMORY_ATTRIBUTE_PRIVATE is
> planned to be used in the future. Currently it's always set at least in
> QEMU if gmemfd is enabled, so it doesn't yet tell me anything..
>
> If it's driven by the userspace side of the hypervisor, I wonder when
> should the user app request some different value it already was, if the
> kernel already has an answer in this case.  It made me even more confused,
> as we have this in the API doc:
>
>         Note, there is no "get" API.  Userspace is responsible for
>         explicitly tracking the state of a gfn/page as needed.
>
> And I do wonder whether we will still need some API just to query whether
> the kernel allows the page to be mapped or not (aka, the "real" shared /
> private status of a guest page).  I guess that's not directly relevant to
> the faultability to be introduced here, but if you or anyone know please
> kindly share, I'd love to learn about it.

The userspace VMM will track the initial shared/private state, in the
sense that when the VM is created, the mem_attr_array is initialized
such that the guest pages are all shared.

Then when the userspace VMM calls the KVM_SET_MEMORY_ATTRIBUTES ioctl,
it should record all changes so it knows what the state is in the
kernel.

Even if userspace VMM doesn't record the state properly, if the
KVM_SET_MEMORY_ATTRIBUTES ioctl is used to request no change
(e.g. setting an already private page to private), it will just be a
no-op in the kernel.

>> 
>> Perhaps Fuad can elaborate more here.
>> 
>> Reason (b): In this patch series (mostly focus on x86 first), we're
>> using faultability to prevent any future faults before checking that
>> there are no mappings.
>> 
>> Having a different xarray from mem_attr_array allows us to disable
>> faulting before committing to changing mem_attr_array. Please see
>> `kvm_gmem_should_set_attributes_private()` in this patch [2].
>> 
>> We're not completely sure about the effectiveness of using faultability
>> to block off future faults here, in future revisions we may be using a
>> different approach. The folio_lock() is probably important if we need to
>> check mapcount. Please let me know if you have any ideas!
>> 
>> The starting point of having a different xarray was pKVM's requirement
>> of having separate xarrays, and we later realized that the xarray could
>> be used for reason (b). For x86 we could perhaps eventually remove the
>> second xarray? Not sure as of now.
>
> Just had a quick look at patch 27:
>
> https://lore.kernel.org/all/5a05eb947cf7aa21f00b94171ca818cc3d5bdfee.1726009989.git.ackerleytng@google.com/
>
> I'm not yet sure what's protecting from faultability being modified against
> a concurrent fault().
>
> I wonder whether one can use the folio lock to serialize that, so that one
> needs to take the folio lock to modify/lookup the folio's faultability,
> then it may naturally match with the fault() handler design, where
> kvm_gmem_get_folio() needs to lock the page first.
>
> But then kvm_gmem_is_faultable() will need to also be called only after the
> folio is locked to avoid races.

My bad. In our rush to get this series out before LPC, the patch series
was not organized very well. Patch 39 [2] adds the
lock. filemap_invalidate_lock_shared() should make sure that faulting
doesn't race with faultability updates.

>> > The latter is per-slot, so one level higher, however I don't think it's a
>> > common use case for mapping the same gmemfd in multiple slots anyway for
>> > KVM (besides corner cases like live upgrade).  So perhaps this is not about
>> > layering but something else?  For example, any use case where PRIVATE and
>> > FAULTABLE can be reported with different values.
>> >
>> > Another higher level question is, is there any plan to support non-CoCo
>> > context for 1G?
>> 
>> I believe guest_memfd users are generally in favor of eventually using
>> guest_memfd for non-CoCo use cases, which means we do want 1G (shared,
>> in the case of CoCo) page support.
>> 
>> However, core-mm's fault path does not support mapping at anything
>> higher than the PMD level (other than hugetlb_fault(), which the
>> community wants to move away from), so core-mm wouldn't be able to map
>> 1G pages taken from HugeTLB.
>
> Have you looked at vm_operations_struct.huge_fault()?  Or maybe you're
> referring to some other challenges?
>

IIUC vm_operations_struct.huge_fault() is used when creating a PMD, but
PUD mappings will be needed for 1G pages, so 1G pages can't be mapped by
core-mm using vm_operations_struct.huge_fault().

>> 
>> In this patch series, we always split pages before mapping them to
>> userspace and that's how this series still works with core-mm.
>> 
>> Having 1G page support for shared memory or for non-CoCo use cases would
>> probably depend on better HugeTLB integration with core-mm, which you'd
>> be most familiar with.
>
> My understanding is the mm community wants to avoid adding major new things
> on top of current hugetlbfs alone, I'm not sure whether this will also be
> accounted as part of that.  IMHO it could depend on how much this series
> will reuse hugetlbfs.  If it's only about allocations it might be ok,
> however I still feel risky having the name "hugetlbfs" here, the allocator
> (if refactored out of hugetlb, but to contain more functions than CMA)
> could be named in a more generic way.  No rush on changing anything, you
> may always want to talk with more mm people on this I guess.
>

Thanks for your feedback! We do intend to only use the allocator part of
HugeTLB for guest_memfd, which will need some refactoring on the HugeTLB
side. The refactoring is not expected to require any functional changes.

What do you think of refactoring out the allocator part of HugeTLB in
terms of whether it helps with HugeTLB unification?

If the refactoring out of the allocator part of HugeTLB needs a name
change, that could work too.

> I also don't know how you treat things like folio_test_hugetlb() on
> possible assumptions that the VMA must be a hugetlb vma.  I'd confess I
> didn't yet check the rest of the patchset yet - reading a large series
> without a git tree is sometimes challenging to me.
>

I'm thinking to basically never involve folio_test_hugetlb(), and the
VMAs used by guest_memfd will also never be a HugeTLB VMA. That's
because only the HugeTLB allocator is used, but by the time the folio is
mapped to userspace, it would have already have been split. After the
page is split, the folio loses its HugeTLB status. guest_memfd folios
will never be mapped to userspace while they still have a HugeTLB
status.

(When 1G pages can be mapped to userspace, we will have to rethink the
above. But possibly by then HugeTLB would have been more unified with
core-mm and hence perhaps things will fall in place?)

The current uses of folio_test_hugetlb() in this patch series are

1. In alloc_migration_target_by_mpol(), which is okay because that's
   during allocation of the HugeTLB folio, before it gets split up and
   loses its status. When the folio is freed, before it is returned to
   HugeTLB, the HugeTLB status will be reinstated.

2. In kvm_gmem_prepare_folio(). If the folio hasn't been split yet, then
   we use folio_zero_user() to zero the folio, and if it has been split,
   then we use a more primitive loop to zero the folio. These two
   methods of zeroing are actually kind of the same thing and can be
   combined. This correctly uses folio_test_hugetlb().

3. In kvm_gmem_fault(), I check if folio_test_hugetlb() when doing the
   same zeroing described in (2), but this is not actually necessary and
   will be removed in a future revision, since HugeTLB folios should
   never get faulted to userspace.

>> 
>> Thank you for looking through our patches, we need your experience and
>> help! I've also just sent out the first 3 patches separately, which I
>> think is useful in improving understandability of the
>> resv_map/subpool/hstate reservation system in HugeTLB and can be
>> considered separately. Hope you can also review/comment on [4].
>
> I'll read and think about it.  Before that, I'll probably need to read more
> backgrounds you need from hugetlb allocators (e.g. I remember you mentioned
> pool management somewhere).  I tried to watch your LPC talk but the
> recording has some issue on audio so I can mostly hear nothing in most of
> the discussions..  I'll try to join the bi-weekly meeting two days later,
> though.
>

Thank you!

>> 
>> > I saw that you also mentioned you have working QEMU prototypes ready in
>> > another email.  It'll be great if you can push your kernel/QEMU's latest
>> > tree (including all dependency patches) somewhere so anyone can have a
>> > closer look, or play with it.
>> 
>> Vishal's reply [3] might have been a bit confusing. To clarify, my team
>> doesn't work with Qemu at all (we use a custom userspace VMM internally)
>> so the patches in this series are tested purely with selftests.
>> 
>> The selftests have fewer dependencies than full Qemu and I'd be happy to
>> help with running them or explain anything that I might have missed out.
>> 
>> We don't have any Qemu prototypes and are not likely to be building any
>> prototypes in the foreseeable future.
>
> I see, that's totally not a problem.  If there can be, especially !CoCo
> support at some point, we're happy to test it on QEMU side.  I'll see what
> I can do to help !CoCo kernel side getting there.
>
> Besides, it'll still be great if you can push a latest kernel tree
> somewhere (or provide the base commit ID, but that needs to be on a public
> tree I can fetch).

I should have added the base commit ID.

The base commit hash for this series is
1c4246294c9841c50805cec0627030c083e019c6.

>
> Thanks,
>
>> 
>> >
>> > Thanks,
>> >
>> > -- 
>> > Peter Xu
>> 
>> [1] https://lore.kernel.org/all/20241010085930.1546800-3-tabba@google.com/
>> [2] https://lore.kernel.org/all/f4ca1711a477a3b56406c05d125dce3d7403b936.1726009989.git.ackerleytng@google.com/
>> [3] https://lore.kernel.org/all/CAGtprH-GczOb64XrLpdW4ObRG7Gsv8tHWNhiW7=2dE=OAF7-Rw@mail.gmail.com/
>> [4] https://lore.kernel.org/all/cover.1728684491.git.ackerleytng@google.com/T/
>> 
>
> -- 
> Peter Xu

[1] https://lore.kernel.org/all/3ef4b32d32dca6e1b506e967c950dc2d4c3bc7ae.1726009989.git.ackerleytng@google.com/
[2] https://lore.kernel.org/all/38723c5d5e9b530e52f28b9f9f4a6d862ed69bcd.1726009989.git.ackerleytng@google.com/

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [RFC PATCH 26/39] KVM: guest_memfd: Track faultability within a struct kvm_gmem_private
  2024-10-15 23:42         ` Ackerley Tng
@ 2024-10-16  8:45           ` David Hildenbrand
  2024-10-16 20:16             ` Peter Xu
  2024-10-16  8:50           ` David Hildenbrand
  1 sibling, 1 reply; 130+ messages in thread
From: David Hildenbrand @ 2024-10-16  8:45 UTC (permalink / raw)
  To: Ackerley Tng, Peter Xu
  Cc: tabba, quic_eberman, roypat, jgg, rientjes, fvdl, jthoughton,
	seanjc, pbonzini, zhiquan1.li, fan.du, jun.miao, isaku.yamahata,
	muchun.song, erdemaktas, vannapurve, qperret, jhubbard, willy,
	shuah, brauner, bfoster, kent.overstreet, pvorel, rppt,
	richard.weiyang, anup, haibo1.xu, ajones, vkuznets,
	maciej.wieczor-retman, pgonda, oliver.upton, linux-kernel,
	linux-mm, kvm, linux-kselftest

On 16.10.24 01:42, Ackerley Tng wrote:
> Peter Xu <peterx@redhat.com> writes:
> 
>> On Fri, Oct 11, 2024 at 11:32:11PM +0000, Ackerley Tng wrote:
>>> Peter Xu <peterx@redhat.com> writes:
>>>
>>>> On Tue, Sep 10, 2024 at 11:43:57PM +0000, Ackerley Tng wrote:
>>>>> The faultability xarray is stored on the inode since faultability is a
>>>>> property of the guest_memfd's memory contents.
>>>>>
>>>>> In this RFC, presence of an entry in the xarray indicates faultable,
>>>>> but this could be flipped so that presence indicates unfaultable. For
>>>>> flexibility, a special value "FAULT" is used instead of a simple
>>>>> boolean.
>>>>>
>>>>> However, at some stages of a VM's lifecycle there could be more
>>>>> private pages, and at other stages there could be more shared pages.
>>>>>
>>>>> This is likely to be replaced by a better data structure in a future
>>>>> revision to better support ranges.
>>>>>
>>>>> Also store struct kvm_gmem_hugetlb in struct kvm_gmem_hugetlb as a
>>>>> pointer. inode->i_mapping->i_private_data.
>>>>
>>>> Could you help explain the difference between faultability v.s. the
>>>> existing KVM_MEMORY_ATTRIBUTE_PRIVATE?  Not sure if I'm the only one who's
>>>> confused, otherwise might be good to enrich the commit message.
>>>
>>> Thank you for this question, I'll add this to the commit message to the
>>> next revision if Fuad's patch set [1] doesn't make it first.
>>>
>>> Reason (a): To elaborate on the explanation in [1],
>>> KVM_MEMORY_ATTRIBUTE_PRIVATE is whether userspace wants this page to be
>>> private or shared, and faultability is whether the page is allowed to be
>>> faulted in by userspace.
>>>
>>> These two are similar but may not be the same thing. In pKVM, pKVM
>>> cannot trust userspace's configuration of private/shared, and other
>>> information will go into determining the private/shared setting in
>>> faultability.
>>
>> It makes sense to me that the kernel has the right to decide which page is
>> shared / private.  No matter if it's for pKVM or CoCo, I believe the normal
>> case is most / all pages are private, until some requests to share them for
>> special purposes (like DMA).  But that'll need to be initiated as a request
>> from the guest not the userspace hypervisor.
> 
> For TDX, the plan is that the guest will request the page to be remapped
> as shared or private, and the handler for that request will exit to
> the userspace VMM.
> 
> The userspace VMM will then do any necessary coordination (e.g. for a
> shared to private conversion it may need to unpin pages from DMA), and
> then use the KVM_SET_MEMORY_ATTRIBUTES ioctl to indicate agreement with
> the guest's requested conversion. This is where
> KVM_MEMORY_ATTRIBUTE_PRIVATE will be provided.
> 
> Patch 38 [1] updates
> tools/testing/selftests/kvm/x86_64/private_mem_conversions_test.c to
> demonstrate the usage flow for x86.
> 
> Fuad will be in a better position to explain the flow for pKVM.
> 
>> I must confess I totally have no idea how KVM_MEMORY_ATTRIBUTE_PRIVATE is
>> planned to be used in the future. Currently it's always set at least in
>> QEMU if gmemfd is enabled, so it doesn't yet tell me anything..
>>
>> If it's driven by the userspace side of the hypervisor, I wonder when
>> should the user app request some different value it already was, if the
>> kernel already has an answer in this case.  It made me even more confused,
>> as we have this in the API doc:
>>
>>          Note, there is no "get" API.  Userspace is responsible for
>>          explicitly tracking the state of a gfn/page as needed.
>>
>> And I do wonder whether we will still need some API just to query whether
>> the kernel allows the page to be mapped or not (aka, the "real" shared /
>> private status of a guest page).  I guess that's not directly relevant to
>> the faultability to be introduced here, but if you or anyone know please
>> kindly share, I'd love to learn about it.
> 
> The userspace VMM will track the initial shared/private state, in the
> sense that when the VM is created, the mem_attr_array is initialized
> such that the guest pages are all shared.
> 
> Then when the userspace VMM calls the KVM_SET_MEMORY_ATTRIBUTES ioctl,
> it should record all changes so it knows what the state is in the
> kernel.
> 
> Even if userspace VMM doesn't record the state properly, if the
> KVM_SET_MEMORY_ATTRIBUTES ioctl is used to request no change
> (e.g. setting an already private page to private), it will just be a
> no-op in the kernel.
> 
>>>
>>> Perhaps Fuad can elaborate more here.
>>>
>>> Reason (b): In this patch series (mostly focus on x86 first), we're
>>> using faultability to prevent any future faults before checking that
>>> there are no mappings.
>>>
>>> Having a different xarray from mem_attr_array allows us to disable
>>> faulting before committing to changing mem_attr_array. Please see
>>> `kvm_gmem_should_set_attributes_private()` in this patch [2].
>>>
>>> We're not completely sure about the effectiveness of using faultability
>>> to block off future faults here, in future revisions we may be using a
>>> different approach. The folio_lock() is probably important if we need to
>>> check mapcount. Please let me know if you have any ideas!
>>>
>>> The starting point of having a different xarray was pKVM's requirement
>>> of having separate xarrays, and we later realized that the xarray could
>>> be used for reason (b). For x86 we could perhaps eventually remove the
>>> second xarray? Not sure as of now.
>>
>> Just had a quick look at patch 27:
>>
>> https://lore.kernel.org/all/5a05eb947cf7aa21f00b94171ca818cc3d5bdfee.1726009989.git.ackerleytng@google.com/
>>
>> I'm not yet sure what's protecting from faultability being modified against
>> a concurrent fault().
>>
>> I wonder whether one can use the folio lock to serialize that, so that one
>> needs to take the folio lock to modify/lookup the folio's faultability,
>> then it may naturally match with the fault() handler design, where
>> kvm_gmem_get_folio() needs to lock the page first.
>>
>> But then kvm_gmem_is_faultable() will need to also be called only after the
>> folio is locked to avoid races.
> 
> My bad. In our rush to get this series out before LPC, the patch series
> was not organized very well. Patch 39 [2] adds the
> lock. filemap_invalidate_lock_shared() should make sure that faulting
> doesn't race with faultability updates.
> 
>>>> The latter is per-slot, so one level higher, however I don't think it's a
>>>> common use case for mapping the same gmemfd in multiple slots anyway for
>>>> KVM (besides corner cases like live upgrade).  So perhaps this is not about
>>>> layering but something else?  For example, any use case where PRIVATE and
>>>> FAULTABLE can be reported with different values.
>>>>
>>>> Another higher level question is, is there any plan to support non-CoCo
>>>> context for 1G?
>>>
>>> I believe guest_memfd users are generally in favor of eventually using
>>> guest_memfd for non-CoCo use cases, which means we do want 1G (shared,
>>> in the case of CoCo) page support.
>>>
>>> However, core-mm's fault path does not support mapping at anything
>>> higher than the PMD level (other than hugetlb_fault(), which the
>>> community wants to move away from), so core-mm wouldn't be able to map
>>> 1G pages taken from HugeTLB.
>>
>> Have you looked at vm_operations_struct.huge_fault()?  Or maybe you're
>> referring to some other challenges?
>>
> 
> IIUC vm_operations_struct.huge_fault() is used when creating a PMD, but
> PUD mappings will be needed for 1G pages, so 1G pages can't be mapped by
> core-mm using vm_operations_struct.huge_fault().


Just to clarify a bit for Peter: as has been discussed previously, there 
are rather big difference between CoCo and non-CoCo VMs.

In CoCo VMs, the primary portion of all pages are private, and they are 
not mapped into user space. Only a handful of pages are commonly shared 
and mapped into user space.

In non-CoCo VMs, all pages are shared and (for the time being) all pages 
are mapped into user space from where KVM will consume them.


Installing pmd/pud mappings into user space (recall: shared memory only) 
is currently not really a requirement for CoCo VMs, and therefore not 
the focus of this work.

Further, it's currently considered to be incompatible with getting 
in-place private<->share conversion on *page* granularity right, as we 
will be exposing huge/gigantic folios via individual small folios to 
core-MM. Mapping a PMD/PUD into core-mm, that is composed of multiple 
folios is not going to fly, unless using a PFNMAP, which has been 
briefly discussed as well, bu disregarded so far (no page pinning support).

So in the context of this work here, huge faults and PUD/PMD *user space 
page tables* do not apply.

For non-CoCo VMs there is no in-place conversion problem. One could use 
the same CoCo implementation, but without user space pud/pmd mappings. 
KVM and VFIO would have to consume this memory via the guest_memfd in 
memslots instead of via the user space mappings to more easily get 
PMD/PUD mappings into the secondary MMU. And the downsides would be 
sacrificing the vmemmap optimization and PMD/PUD user space mappings, 
while at the same time benefiting from being able to easily map only 
parts of a huge/gigantic page into user space.


So I consider pmd/pud user space mappings for non-CoCo an independent 
work item, not something that is part of the current effort of 
huge/gigantic pages with in-place conversion at page granularity for 
CoCo VMs.


More information is available in the bi-weekly upstream MM meeting (that 
was recorded) and the LPC talks, where most of that has been discussed.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [RFC PATCH 26/39] KVM: guest_memfd: Track faultability within a struct kvm_gmem_private
  2024-10-16  8:45           ` David Hildenbrand
@ 2024-10-16 20:16             ` Peter Xu
  2024-10-16 22:51               ` Jason Gunthorpe
  2024-10-17 15:02               ` David Hildenbrand
  0 siblings, 2 replies; 130+ messages in thread
From: Peter Xu @ 2024-10-16 20:16 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Ackerley Tng, tabba, quic_eberman, roypat, jgg, rientjes, fvdl,
	jthoughton, seanjc, pbonzini, zhiquan1.li, fan.du, jun.miao,
	isaku.yamahata, muchun.song, erdemaktas, vannapurve, qperret,
	jhubbard, willy, shuah, brauner, bfoster, kent.overstreet, pvorel,
	rppt, richard.weiyang, anup, haibo1.xu, ajones, vkuznets,
	maciej.wieczor-retman, pgonda, oliver.upton, linux-kernel,
	linux-mm, kvm, linux-kselftest

On Wed, Oct 16, 2024 at 10:45:43AM +0200, David Hildenbrand wrote:
> On 16.10.24 01:42, Ackerley Tng wrote:
> > Peter Xu <peterx@redhat.com> writes:
> > 
> > > On Fri, Oct 11, 2024 at 11:32:11PM +0000, Ackerley Tng wrote:
> > > > Peter Xu <peterx@redhat.com> writes:
> > > > 
> > > > > On Tue, Sep 10, 2024 at 11:43:57PM +0000, Ackerley Tng wrote:
> > > > > > The faultability xarray is stored on the inode since faultability is a
> > > > > > property of the guest_memfd's memory contents.
> > > > > > 
> > > > > > In this RFC, presence of an entry in the xarray indicates faultable,
> > > > > > but this could be flipped so that presence indicates unfaultable. For
> > > > > > flexibility, a special value "FAULT" is used instead of a simple
> > > > > > boolean.
> > > > > > 
> > > > > > However, at some stages of a VM's lifecycle there could be more
> > > > > > private pages, and at other stages there could be more shared pages.
> > > > > > 
> > > > > > This is likely to be replaced by a better data structure in a future
> > > > > > revision to better support ranges.
> > > > > > 
> > > > > > Also store struct kvm_gmem_hugetlb in struct kvm_gmem_hugetlb as a
> > > > > > pointer. inode->i_mapping->i_private_data.
> > > > > 
> > > > > Could you help explain the difference between faultability v.s. the
> > > > > existing KVM_MEMORY_ATTRIBUTE_PRIVATE?  Not sure if I'm the only one who's
> > > > > confused, otherwise might be good to enrich the commit message.
> > > > 
> > > > Thank you for this question, I'll add this to the commit message to the
> > > > next revision if Fuad's patch set [1] doesn't make it first.
> > > > 
> > > > Reason (a): To elaborate on the explanation in [1],
> > > > KVM_MEMORY_ATTRIBUTE_PRIVATE is whether userspace wants this page to be
> > > > private or shared, and faultability is whether the page is allowed to be
> > > > faulted in by userspace.
> > > > 
> > > > These two are similar but may not be the same thing. In pKVM, pKVM
> > > > cannot trust userspace's configuration of private/shared, and other
> > > > information will go into determining the private/shared setting in
> > > > faultability.
> > > 
> > > It makes sense to me that the kernel has the right to decide which page is
> > > shared / private.  No matter if it's for pKVM or CoCo, I believe the normal
> > > case is most / all pages are private, until some requests to share them for
> > > special purposes (like DMA).  But that'll need to be initiated as a request
> > > from the guest not the userspace hypervisor.
> > 
> > For TDX, the plan is that the guest will request the page to be remapped
> > as shared or private, and the handler for that request will exit to
> > the userspace VMM.
> > 
> > The userspace VMM will then do any necessary coordination (e.g. for a
> > shared to private conversion it may need to unpin pages from DMA), and
> > then use the KVM_SET_MEMORY_ATTRIBUTES ioctl to indicate agreement with
> > the guest's requested conversion. This is where
> > KVM_MEMORY_ATTRIBUTE_PRIVATE will be provided.
> > 
> > Patch 38 [1] updates
> > tools/testing/selftests/kvm/x86_64/private_mem_conversions_test.c to
> > demonstrate the usage flow for x86.
> > 
> > Fuad will be in a better position to explain the flow for pKVM.
> > 
> > > I must confess I totally have no idea how KVM_MEMORY_ATTRIBUTE_PRIVATE is
> > > planned to be used in the future. Currently it's always set at least in
> > > QEMU if gmemfd is enabled, so it doesn't yet tell me anything..
> > > 
> > > If it's driven by the userspace side of the hypervisor, I wonder when
> > > should the user app request some different value it already was, if the
> > > kernel already has an answer in this case.  It made me even more confused,
> > > as we have this in the API doc:
> > > 
> > >          Note, there is no "get" API.  Userspace is responsible for
> > >          explicitly tracking the state of a gfn/page as needed.
> > > 
> > > And I do wonder whether we will still need some API just to query whether
> > > the kernel allows the page to be mapped or not (aka, the "real" shared /
> > > private status of a guest page).  I guess that's not directly relevant to
> > > the faultability to be introduced here, but if you or anyone know please
> > > kindly share, I'd love to learn about it.
> > 
> > The userspace VMM will track the initial shared/private state, in the
> > sense that when the VM is created, the mem_attr_array is initialized
> > such that the guest pages are all shared.
> > 
> > Then when the userspace VMM calls the KVM_SET_MEMORY_ATTRIBUTES ioctl,
> > it should record all changes so it knows what the state is in the
> > kernel.
> > 
> > Even if userspace VMM doesn't record the state properly, if the
> > KVM_SET_MEMORY_ATTRIBUTES ioctl is used to request no change
> > (e.g. setting an already private page to private), it will just be a
> > no-op in the kernel.
> > 
> > > > 
> > > > Perhaps Fuad can elaborate more here.
> > > > 
> > > > Reason (b): In this patch series (mostly focus on x86 first), we're
> > > > using faultability to prevent any future faults before checking that
> > > > there are no mappings.
> > > > 
> > > > Having a different xarray from mem_attr_array allows us to disable
> > > > faulting before committing to changing mem_attr_array. Please see
> > > > `kvm_gmem_should_set_attributes_private()` in this patch [2].
> > > > 
> > > > We're not completely sure about the effectiveness of using faultability
> > > > to block off future faults here, in future revisions we may be using a
> > > > different approach. The folio_lock() is probably important if we need to
> > > > check mapcount. Please let me know if you have any ideas!
> > > > 
> > > > The starting point of having a different xarray was pKVM's requirement
> > > > of having separate xarrays, and we later realized that the xarray could
> > > > be used for reason (b). For x86 we could perhaps eventually remove the
> > > > second xarray? Not sure as of now.
> > > 
> > > Just had a quick look at patch 27:
> > > 
> > > https://lore.kernel.org/all/5a05eb947cf7aa21f00b94171ca818cc3d5bdfee.1726009989.git.ackerleytng@google.com/
> > > 
> > > I'm not yet sure what's protecting from faultability being modified against
> > > a concurrent fault().
> > > 
> > > I wonder whether one can use the folio lock to serialize that, so that one
> > > needs to take the folio lock to modify/lookup the folio's faultability,
> > > then it may naturally match with the fault() handler design, where
> > > kvm_gmem_get_folio() needs to lock the page first.
> > > 
> > > But then kvm_gmem_is_faultable() will need to also be called only after the
> > > folio is locked to avoid races.
> > 
> > My bad. In our rush to get this series out before LPC, the patch series
> > was not organized very well. Patch 39 [2] adds the
> > lock. filemap_invalidate_lock_shared() should make sure that faulting
> > doesn't race with faultability updates.
> > 
> > > > > The latter is per-slot, so one level higher, however I don't think it's a
> > > > > common use case for mapping the same gmemfd in multiple slots anyway for
> > > > > KVM (besides corner cases like live upgrade).  So perhaps this is not about
> > > > > layering but something else?  For example, any use case where PRIVATE and
> > > > > FAULTABLE can be reported with different values.
> > > > > 
> > > > > Another higher level question is, is there any plan to support non-CoCo
> > > > > context for 1G?
> > > > 
> > > > I believe guest_memfd users are generally in favor of eventually using
> > > > guest_memfd for non-CoCo use cases, which means we do want 1G (shared,
> > > > in the case of CoCo) page support.
> > > > 
> > > > However, core-mm's fault path does not support mapping at anything
> > > > higher than the PMD level (other than hugetlb_fault(), which the
> > > > community wants to move away from), so core-mm wouldn't be able to map
> > > > 1G pages taken from HugeTLB.
> > > 
> > > Have you looked at vm_operations_struct.huge_fault()?  Or maybe you're
> > > referring to some other challenges?
> > > 
> > 
> > IIUC vm_operations_struct.huge_fault() is used when creating a PMD, but
> > PUD mappings will be needed for 1G pages, so 1G pages can't be mapped by
> > core-mm using vm_operations_struct.huge_fault().
> 
> 
> Just to clarify a bit for Peter: as has been discussed previously, there are
> rather big difference between CoCo and non-CoCo VMs.
> 
> In CoCo VMs, the primary portion of all pages are private, and they are not
> mapped into user space. Only a handful of pages are commonly shared and
> mapped into user space.
> 
> In non-CoCo VMs, all pages are shared and (for the time being) all pages are
> mapped into user space from where KVM will consume them.
> 
> 
> Installing pmd/pud mappings into user space (recall: shared memory only) is
> currently not really a requirement for CoCo VMs, and therefore not the focus
> of this work.
> 
> Further, it's currently considered to be incompatible with getting in-place
> private<->share conversion on *page* granularity right, as we will be
> exposing huge/gigantic folios via individual small folios to core-MM.
> Mapping a PMD/PUD into core-mm, that is composed of multiple folios is not
> going to fly, unless using a PFNMAP, which has been briefly discussed as
> well, bu disregarded so far (no page pinning support).
> 
> So in the context of this work here, huge faults and PUD/PMD *user space
> page tables* do not apply.
> 
> For non-CoCo VMs there is no in-place conversion problem. One could use the
> same CoCo implementation, but without user space pud/pmd mappings. KVM and
> VFIO would have to consume this memory via the guest_memfd in memslots
> instead of via the user space mappings to more easily get PMD/PUD mappings
> into the secondary MMU. And the downsides would be sacrificing the vmemmap

Is there chance that when !CoCo will be supported, then external modules
(e.g. VFIO) can reuse the old user mappings, just like before gmemfd?

To support CoCo, I understand gmem+offset is required all over the places.
However in a non-CoCo context, I wonder whether the other modules are
required to stick with gmem+offset, or they can reuse the old VA ways,
because how it works can fundamentally be the same as before, except that
the folios now will be managed by gmemfd.

I think the good thing with such approach is when developing CoCo support
for all these modules, there's less constraints / concerns to be compatible
with non-CoCo use case, also it'll make it even easier to be used in
production before all CoCo facilities ready, as most infrastructures are
already around and being used for years if VA can be mapped and GUPed like
before.

Thanks,

> optimization and PMD/PUD user space mappings, while at the same time
> benefiting from being able to easily map only parts of a huge/gigantic page
> into user space.
> 
> 
> So I consider pmd/pud user space mappings for non-CoCo an independent work
> item, not something that is part of the current effort of huge/gigantic
> pages with in-place conversion at page granularity for CoCo VMs.
> 
> 
> More information is available in the bi-weekly upstream MM meeting (that was
> recorded) and the LPC talks, where most of that has been discussed.
> 
> -- 
> Cheers,
> 
> David / dhildenb
> 

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [RFC PATCH 26/39] KVM: guest_memfd: Track faultability within a struct kvm_gmem_private
  2024-10-16 20:16             ` Peter Xu
@ 2024-10-16 22:51               ` Jason Gunthorpe
  2024-10-16 23:49                 ` Peter Xu
  2024-10-17 15:02               ` David Hildenbrand
  1 sibling, 1 reply; 130+ messages in thread
From: Jason Gunthorpe @ 2024-10-16 22:51 UTC (permalink / raw)
  To: Peter Xu
  Cc: David Hildenbrand, Ackerley Tng, tabba, quic_eberman, roypat,
	rientjes, fvdl, jthoughton, seanjc, pbonzini, zhiquan1.li, fan.du,
	jun.miao, isaku.yamahata, muchun.song, erdemaktas, vannapurve,
	qperret, jhubbard, willy, shuah, brauner, bfoster,
	kent.overstreet, pvorel, rppt, richard.weiyang, anup, haibo1.xu,
	ajones, vkuznets, maciej.wieczor-retman, pgonda, oliver.upton,
	linux-kernel, linux-mm, kvm, linux-kselftest

On Wed, Oct 16, 2024 at 04:16:17PM -0400, Peter Xu wrote:
> 
> Is there chance that when !CoCo will be supported, then external modules
> (e.g. VFIO) can reuse the old user mappings, just like before gmemfd?
> 
> To support CoCo, I understand gmem+offset is required all over the places.
> However in a non-CoCo context, I wonder whether the other modules are
> required to stick with gmem+offset, or they can reuse the old VA ways,
> because how it works can fundamentally be the same as before, except that
> the folios now will be managed by gmemfd.

My intention with iommufd was to see fd + offest as the "new" way
to refer to all guest memory and discourage people from using VMA
handles.

Jason

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [RFC PATCH 26/39] KVM: guest_memfd: Track faultability within a struct kvm_gmem_private
  2024-10-16 22:51               ` Jason Gunthorpe
@ 2024-10-16 23:49                 ` Peter Xu
  2024-10-16 23:54                   ` Jason Gunthorpe
  2024-10-17 14:56                   ` David Hildenbrand
  0 siblings, 2 replies; 130+ messages in thread
From: Peter Xu @ 2024-10-16 23:49 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: David Hildenbrand, Ackerley Tng, tabba, quic_eberman, roypat,
	rientjes, fvdl, jthoughton, seanjc, pbonzini, zhiquan1.li, fan.du,
	jun.miao, isaku.yamahata, muchun.song, erdemaktas, vannapurve,
	qperret, jhubbard, willy, shuah, brauner, bfoster,
	kent.overstreet, pvorel, rppt, richard.weiyang, anup, haibo1.xu,
	ajones, vkuznets, maciej.wieczor-retman, pgonda, oliver.upton,
	linux-kernel, linux-mm, kvm, linux-kselftest

On Wed, Oct 16, 2024 at 07:51:57PM -0300, Jason Gunthorpe wrote:
> On Wed, Oct 16, 2024 at 04:16:17PM -0400, Peter Xu wrote:
> > 
> > Is there chance that when !CoCo will be supported, then external modules
> > (e.g. VFIO) can reuse the old user mappings, just like before gmemfd?
> > 
> > To support CoCo, I understand gmem+offset is required all over the places.
> > However in a non-CoCo context, I wonder whether the other modules are
> > required to stick with gmem+offset, or they can reuse the old VA ways,
> > because how it works can fundamentally be the same as before, except that
> > the folios now will be managed by gmemfd.
> 
> My intention with iommufd was to see fd + offest as the "new" way
> to refer to all guest memory and discourage people from using VMA
> handles.

Does it mean anonymous memory guests will not be supported at all for
iommufd?

Indeed it's very rare now, lose quite some flexibility (v.s. fd based), and
I can't think of a lot besides some default configs or KSM users (which I
would expect rare), but still I wonder there're other use cases that people
would still need to stick with anon, hence fd isn't around.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [RFC PATCH 26/39] KVM: guest_memfd: Track faultability within a struct kvm_gmem_private
  2024-10-16 23:49                 ` Peter Xu
@ 2024-10-16 23:54                   ` Jason Gunthorpe
  2024-10-17 14:58                     ` Peter Xu
  2024-10-17 14:56                   ` David Hildenbrand
  1 sibling, 1 reply; 130+ messages in thread
From: Jason Gunthorpe @ 2024-10-16 23:54 UTC (permalink / raw)
  To: Peter Xu
  Cc: David Hildenbrand, Ackerley Tng, tabba, quic_eberman, roypat,
	rientjes, fvdl, jthoughton, seanjc, pbonzini, zhiquan1.li, fan.du,
	jun.miao, isaku.yamahata, muchun.song, erdemaktas, vannapurve,
	qperret, jhubbard, willy, shuah, brauner, bfoster,
	kent.overstreet, pvorel, rppt, richard.weiyang, anup, haibo1.xu,
	ajones, vkuznets, maciej.wieczor-retman, pgonda, oliver.upton,
	linux-kernel, linux-mm, kvm, linux-kselftest

On Wed, Oct 16, 2024 at 07:49:31PM -0400, Peter Xu wrote:
> On Wed, Oct 16, 2024 at 07:51:57PM -0300, Jason Gunthorpe wrote:
> > On Wed, Oct 16, 2024 at 04:16:17PM -0400, Peter Xu wrote:
> > > 
> > > Is there chance that when !CoCo will be supported, then external modules
> > > (e.g. VFIO) can reuse the old user mappings, just like before gmemfd?
> > > 
> > > To support CoCo, I understand gmem+offset is required all over the places.
> > > However in a non-CoCo context, I wonder whether the other modules are
> > > required to stick with gmem+offset, or they can reuse the old VA ways,
> > > because how it works can fundamentally be the same as before, except that
> > > the folios now will be managed by gmemfd.
> > 
> > My intention with iommufd was to see fd + offest as the "new" way
> > to refer to all guest memory and discourage people from using VMA
> > handles.
> 
> Does it mean anonymous memory guests will not be supported at all for
> iommufd?

No, they can use the "old" way with normal VMA's still, or they can
use an anonymous memfd with the new way..

I just don't expect to have new complex stuff built on the VMA
interface - I don't expect guestmemfd VMAs to work.

Jason

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [RFC PATCH 26/39] KVM: guest_memfd: Track faultability within a struct kvm_gmem_private
  2024-10-16 23:54                   ` Jason Gunthorpe
@ 2024-10-17 14:58                     ` Peter Xu
  2024-10-17 16:47                       ` Jason Gunthorpe
  0 siblings, 1 reply; 130+ messages in thread
From: Peter Xu @ 2024-10-17 14:58 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: David Hildenbrand, Ackerley Tng, tabba, quic_eberman, roypat,
	rientjes, fvdl, jthoughton, seanjc, pbonzini, zhiquan1.li, fan.du,
	jun.miao, isaku.yamahata, muchun.song, erdemaktas, vannapurve,
	qperret, jhubbard, willy, shuah, brauner, bfoster,
	kent.overstreet, pvorel, rppt, richard.weiyang, anup, haibo1.xu,
	ajones, vkuznets, maciej.wieczor-retman, pgonda, oliver.upton,
	linux-kernel, linux-mm, kvm, linux-kselftest

On Wed, Oct 16, 2024 at 08:54:24PM -0300, Jason Gunthorpe wrote:
> On Wed, Oct 16, 2024 at 07:49:31PM -0400, Peter Xu wrote:
> > On Wed, Oct 16, 2024 at 07:51:57PM -0300, Jason Gunthorpe wrote:
> > > On Wed, Oct 16, 2024 at 04:16:17PM -0400, Peter Xu wrote:
> > > > 
> > > > Is there chance that when !CoCo will be supported, then external modules
> > > > (e.g. VFIO) can reuse the old user mappings, just like before gmemfd?
> > > > 
> > > > To support CoCo, I understand gmem+offset is required all over the places.
> > > > However in a non-CoCo context, I wonder whether the other modules are
> > > > required to stick with gmem+offset, or they can reuse the old VA ways,
> > > > because how it works can fundamentally be the same as before, except that
> > > > the folios now will be managed by gmemfd.
> > > 
> > > My intention with iommufd was to see fd + offest as the "new" way
> > > to refer to all guest memory and discourage people from using VMA
> > > handles.
> > 
> > Does it mean anonymous memory guests will not be supported at all for
> > iommufd?
> 
> No, they can use the "old" way with normal VMA's still, or they can
> use an anonymous memfd with the new way..
> 
> I just don't expect to have new complex stuff built on the VMA
> interface - I don't expect guestmemfd VMAs to work.

Yes, if with guestmemfd already we probably don't need to bother on the VA
interface.

It's the same when guestmemfd supports KVM_SET_USER_MEMORY_REGION2 already,
then it's not a problem at all to use fd+offset for this KVM API.

My question was more torwards whether gmemfd could still expose the
possibility to be used in VA forms to other modules that may not support
fd+offsets yet.  And I assume your reference on the word "VMA" means "VA
ranges", while "gmemfd VMA" on its own is probably OK?  Which is proposed
in this series with the fault handler.

It may not be a problem to many cloud providers, but if QEMU is involved,
it's still pretty flexible and QEMU will need to add fd+offset support for
many of the existing interfaces that is mostly based on VA or VA ranges.  I
believe that includes QEMU itself, aka, the user hypervisor (which is about
how user app should access shared pages that KVM is fault-allowed),
vhost-kernel (more GUP oriented), vhost-user (similar to userapp side),
etc.

I think as long as we can provide gmemfd VMAs like what this series
provides, it sounds possible to reuse the old VA interfaces before the CoCo
interfaces are ready, so that people can already start leveraging gmemfd
backing pages.

The idea is in general nice to me - QEMU used to have a requirement where
we want to have strict vIOMMU semantics between QEMU and another process
that runs the device emulation (aka, vhost-user).  We didn't want to map
all guest RAM all the time because OVS bug can corrupt QEMU memory until
now even if vIOMMU is present (which should be able to prevent this, only
logically..).  We used to have the idea that we can have one fd sent to
vhost-user process that we can have control of what is mapped and what can
be zapped.

In this case of gmemfd that is mostly what we used to persue already
before, that:

  - It allows mmap() of a guest memory region (without yet the capability
    to access all of them... otherwise it can bypass protection, no matter
    it's for CoCo or a vIOMMU in this case)

  - It allows the main process (in this case, it can be QEMU/KVM or
    anything/KVM) to control how to fault in the pages, in this case gmemfd
    lazily faults in the pages only if they're falutable / shared

  - It allows remote tearing down of pages that were not faultable / shared
    anymore, which guarantees the safety measure that the other process
    cannot access any page that was not authorized

I wonder if it's good enough even for CoCo's use case, where if anyone
wants to illegally access some page, it'll simply crash.

Besides that, we definitely can also have good use of non-CoCo 1G pages on
either postcopy solution (that James used to work on for HGM), or
hwpoisoning (where currently at least the latter one is, I believe, still a
common issue for all of us, to make hwpoison work for hugetlbfs with
PAGE_SIZE granule [1]).  The former issue will be still required at least
for QEMU to leverage the split-abliity of gmemfd huge folios.

Then even if both KVM ioctls + iommufd ioctls will only support fd+offsets,
as long as it's allowed to be faultable and gupped on the shared portion of
the gmemfd folios, they can start to be considered using to replace hugetlb
to overcome those difficulties even before CoCo is supported all over the
places.  There's also a question on whether all the known modules would
finally support fd+offsets, which I'm not sure.  If some module won't
support it, maybe it can still work with gmemfd in VA ranges so that it can
still benefit from what gmemfd can provide.

So in short, not sure if the use case can use a combination of (fd, offset)
interfacing on some modules like KVM/iommufd, but VA ranges like before on
some others.

Thanks,

[1] https://lore.kernel.org/all/20240924043924.3562257-1-jiaqiyan@google.com/

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [RFC PATCH 26/39] KVM: guest_memfd: Track faultability within a struct kvm_gmem_private
  2024-10-17 14:58                     ` Peter Xu
@ 2024-10-17 16:47                       ` Jason Gunthorpe
  2024-10-17 17:05                         ` Peter Xu
  2024-10-17 17:11                         ` David Hildenbrand
  0 siblings, 2 replies; 130+ messages in thread
From: Jason Gunthorpe @ 2024-10-17 16:47 UTC (permalink / raw)
  To: Peter Xu
  Cc: David Hildenbrand, Ackerley Tng, tabba, quic_eberman, roypat,
	rientjes, fvdl, jthoughton, seanjc, pbonzini, zhiquan1.li, fan.du,
	jun.miao, isaku.yamahata, muchun.song, erdemaktas, vannapurve,
	qperret, jhubbard, willy, shuah, brauner, bfoster,
	kent.overstreet, pvorel, rppt, richard.weiyang, anup, haibo1.xu,
	ajones, vkuznets, maciej.wieczor-retman, pgonda, oliver.upton,
	linux-kernel, linux-mm, kvm, linux-kselftest

On Thu, Oct 17, 2024 at 10:58:29AM -0400, Peter Xu wrote:

> My question was more torwards whether gmemfd could still expose the
> possibility to be used in VA forms to other modules that may not support
> fd+offsets yet.

I keep hearing they don't want to support page pinning on a guestmemfd
mapping, so VA based paths could not work.

> I think as long as we can provide gmemfd VMAs like what this series
> provides, it sounds possible to reuse the old VA interfaces before the CoCo
> interfaces are ready, so that people can already start leveraging gmemfd
> backing pages.

And you definitely can't get the private pages out of the VA interface
because all the VMA PTEs of private pages are non-present by definition.

Hence, you must use the FD for a lot of use cases here.

Jason

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [RFC PATCH 26/39] KVM: guest_memfd: Track faultability within a struct kvm_gmem_private
  2024-10-17 16:47                       ` Jason Gunthorpe
@ 2024-10-17 17:05                         ` Peter Xu
  2024-10-17 17:10                           ` Jason Gunthorpe
  2024-10-17 17:11                         ` David Hildenbrand
  1 sibling, 1 reply; 130+ messages in thread
From: Peter Xu @ 2024-10-17 17:05 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: David Hildenbrand, Ackerley Tng, tabba, quic_eberman, roypat,
	rientjes, fvdl, jthoughton, seanjc, pbonzini, zhiquan1.li, fan.du,
	jun.miao, isaku.yamahata, muchun.song, erdemaktas, vannapurve,
	qperret, jhubbard, willy, shuah, brauner, bfoster,
	kent.overstreet, pvorel, rppt, richard.weiyang, anup, haibo1.xu,
	ajones, vkuznets, maciej.wieczor-retman, pgonda, oliver.upton,
	linux-kernel, linux-mm, kvm, linux-kselftest

On Thu, Oct 17, 2024 at 01:47:13PM -0300, Jason Gunthorpe wrote:
> On Thu, Oct 17, 2024 at 10:58:29AM -0400, Peter Xu wrote:
> 
> > My question was more torwards whether gmemfd could still expose the
> > possibility to be used in VA forms to other modules that may not support
> > fd+offsets yet.
> 
> I keep hearing they don't want to support page pinning on a guestmemfd
> mapping, so VA based paths could not work.

Do you remember the reasoning of it?  Is it because CoCo still needs to
have a bounded time window to convert from shared back to private?  If so,
maybe that's a non-issue for non-CoCo, where the VM object / gmemfd object
(when created) can have a flag marking that it's always shared and can
never be converted to private for any page within.

So how would VFIO's DMA work even with iommufd if pages cannot be pinned?
Is some form of bounce buffering required, then?

It sounds like if so there'll be a lot of use cases that won't work with
current infrastructure..

> 
> > I think as long as we can provide gmemfd VMAs like what this series
> > provides, it sounds possible to reuse the old VA interfaces before the CoCo
> > interfaces are ready, so that people can already start leveraging gmemfd
> > backing pages.
> 
> And you definitely can't get the private pages out of the VA interface
> because all the VMA PTEs of private pages are non-present by definition.

It's the same as "not present" if the fault() gets a SIGBUS always for
private pages, IIUC.

My prior references to "VA ranges" are mostly only for shared / faultable
pages. And they'll get zapped too when requested to be converted from
shared -> private, aka, always not present for private.

> 
> Hence, you must use the FD for a lot of use cases here.

Thanks,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [RFC PATCH 26/39] KVM: guest_memfd: Track faultability within a struct kvm_gmem_private
  2024-10-17 17:05                         ` Peter Xu
@ 2024-10-17 17:10                           ` Jason Gunthorpe
  2024-10-17 19:11                             ` Peter Xu
  0 siblings, 1 reply; 130+ messages in thread
From: Jason Gunthorpe @ 2024-10-17 17:10 UTC (permalink / raw)
  To: Peter Xu
  Cc: David Hildenbrand, Ackerley Tng, tabba, quic_eberman, roypat,
	rientjes, fvdl, jthoughton, seanjc, pbonzini, zhiquan1.li, fan.du,
	jun.miao, isaku.yamahata, muchun.song, erdemaktas, vannapurve,
	qperret, jhubbard, willy, shuah, brauner, bfoster,
	kent.overstreet, pvorel, rppt, richard.weiyang, anup, haibo1.xu,
	ajones, vkuznets, maciej.wieczor-retman, pgonda, oliver.upton,
	linux-kernel, linux-mm, kvm, linux-kselftest

On Thu, Oct 17, 2024 at 01:05:34PM -0400, Peter Xu wrote:
> On Thu, Oct 17, 2024 at 01:47:13PM -0300, Jason Gunthorpe wrote:
> > On Thu, Oct 17, 2024 at 10:58:29AM -0400, Peter Xu wrote:
> > 
> > > My question was more torwards whether gmemfd could still expose the
> > > possibility to be used in VA forms to other modules that may not support
> > > fd+offsets yet.
> > 
> > I keep hearing they don't want to support page pinning on a guestmemfd
> > mapping, so VA based paths could not work.
> 
> Do you remember the reasoning of it?  Is it because CoCo still needs to
> have a bounded time window to convert from shared back to private?  

I think so

> If so, maybe that's a non-issue for non-CoCo, where the VM object /
> gmemfd object (when created) can have a flag marking that it's
> always shared and can never be converted to private for any page
> within.

What is non-CoCo? Does it include the private/shared concept?

> So how would VFIO's DMA work even with iommufd if pages cannot be pinned?
> Is some form of bounce buffering required, then?

We can do some kind of atomic replace during a private/shared
exchange. In some HW cases the iommu table doesn't even need an
update.

It will be tricky stuff.

Jason

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [RFC PATCH 26/39] KVM: guest_memfd: Track faultability within a struct kvm_gmem_private
  2024-10-17 17:10                           ` Jason Gunthorpe
@ 2024-10-17 19:11                             ` Peter Xu
  2024-10-17 19:18                               ` Jason Gunthorpe
  0 siblings, 1 reply; 130+ messages in thread
From: Peter Xu @ 2024-10-17 19:11 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: David Hildenbrand, Ackerley Tng, tabba, quic_eberman, roypat,
	rientjes, fvdl, jthoughton, seanjc, pbonzini, zhiquan1.li, fan.du,
	jun.miao, isaku.yamahata, muchun.song, erdemaktas, vannapurve,
	qperret, jhubbard, willy, shuah, brauner, bfoster,
	kent.overstreet, pvorel, rppt, richard.weiyang, anup, haibo1.xu,
	ajones, vkuznets, maciej.wieczor-retman, pgonda, oliver.upton,
	linux-kernel, linux-mm, kvm, linux-kselftest

On Thu, Oct 17, 2024 at 02:10:10PM -0300, Jason Gunthorpe wrote:
> > If so, maybe that's a non-issue for non-CoCo, where the VM object /
> > gmemfd object (when created) can have a flag marking that it's
> > always shared and can never be converted to private for any page
> > within.
> 
> What is non-CoCo? Does it include the private/shared concept?

I used that to represent the possible gmemfd use cases outside confidential
computing.

So the private/shared things should still be around as fundamental property
of gmemfd, but it should be always shared and no convertion needed for the
whole lifecycle of the gmemfd when marked !CoCo.

Basically, that's the KVM-only hugetlbfs v2.. especially if this series
will move on with hugetlb allocators, that's even closer.. which makes some
sense to me at least for now to avoid reinvent the wheels all over the
places over cgroup/pool/meminfo/etc.

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [RFC PATCH 26/39] KVM: guest_memfd: Track faultability within a struct kvm_gmem_private
  2024-10-17 19:11                             ` Peter Xu
@ 2024-10-17 19:18                               ` Jason Gunthorpe
  2024-10-17 19:29                                 ` David Hildenbrand
  2024-10-18  7:15                                 ` Patrick Roy
  0 siblings, 2 replies; 130+ messages in thread
From: Jason Gunthorpe @ 2024-10-17 19:18 UTC (permalink / raw)
  To: Peter Xu
  Cc: David Hildenbrand, Ackerley Tng, tabba, quic_eberman, roypat,
	rientjes, fvdl, jthoughton, seanjc, pbonzini, zhiquan1.li, fan.du,
	jun.miao, isaku.yamahata, muchun.song, erdemaktas, vannapurve,
	qperret, jhubbard, willy, shuah, brauner, bfoster,
	kent.overstreet, pvorel, rppt, richard.weiyang, anup, haibo1.xu,
	ajones, vkuznets, maciej.wieczor-retman, pgonda, oliver.upton,
	linux-kernel, linux-mm, kvm, linux-kselftest

On Thu, Oct 17, 2024 at 03:11:10PM -0400, Peter Xu wrote:
> On Thu, Oct 17, 2024 at 02:10:10PM -0300, Jason Gunthorpe wrote:
> > > If so, maybe that's a non-issue for non-CoCo, where the VM object /
> > > gmemfd object (when created) can have a flag marking that it's
> > > always shared and can never be converted to private for any page
> > > within.
> > 
> > What is non-CoCo? Does it include the private/shared concept?
> 
> I used that to represent the possible gmemfd use cases outside confidential
> computing.
> 
> So the private/shared things should still be around as fundamental property
> of gmemfd, but it should be always shared and no convertion needed for the
> whole lifecycle of the gmemfd when marked !CoCo.

But what does private mean in this context?

Is it just like a bit of additional hypervisor security that the page
is not mapped anyplace except the KVM stage 2 and the hypervisor can
cause it to become mapped/shared at any time? But the guest has no
idea about this?

Jason

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [RFC PATCH 26/39] KVM: guest_memfd: Track faultability within a struct kvm_gmem_private
  2024-10-17 19:18                               ` Jason Gunthorpe
@ 2024-10-17 19:29                                 ` David Hildenbrand
  2024-10-18  7:15                                 ` Patrick Roy
  1 sibling, 0 replies; 130+ messages in thread
From: David Hildenbrand @ 2024-10-17 19:29 UTC (permalink / raw)
  To: Jason Gunthorpe, Peter Xu
  Cc: Ackerley Tng, tabba, quic_eberman, roypat, rientjes, fvdl,
	jthoughton, seanjc, pbonzini, zhiquan1.li, fan.du, jun.miao,
	isaku.yamahata, muchun.song, erdemaktas, vannapurve, qperret,
	jhubbard, willy, shuah, brauner, bfoster, kent.overstreet, pvorel,
	rppt, richard.weiyang, anup, haibo1.xu, ajones, vkuznets,
	maciej.wieczor-retman, pgonda, oliver.upton, linux-kernel,
	linux-mm, kvm, linux-kselftest

On 17.10.24 21:18, Jason Gunthorpe wrote:
> On Thu, Oct 17, 2024 at 03:11:10PM -0400, Peter Xu wrote:
>> On Thu, Oct 17, 2024 at 02:10:10PM -0300, Jason Gunthorpe wrote:
>>>> If so, maybe that's a non-issue for non-CoCo, where the VM object /
>>>> gmemfd object (when created) can have a flag marking that it's
>>>> always shared and can never be converted to private for any page
>>>> within.
>>>
>>> What is non-CoCo? Does it include the private/shared concept?
>>
>> I used that to represent the possible gmemfd use cases outside confidential
>> computing.
>>
>> So the private/shared things should still be around as fundamental property
>> of gmemfd, but it should be always shared and no convertion needed for the
>> whole lifecycle of the gmemfd when marked !CoCo.
> 
> But what does private mean in this context?
> 
> Is it just like a bit of additional hypervisor security that the page
> is not mapped anyplace except the KVM stage 2 and the hypervisor can
> cause it to become mapped/shared at any time? But the guest has no
> idea about this?

I think what Peter is trying to say is that it would all be shared. 
Private conversion is never triggered by the host or the guest.

No special security, nothing. Just like using hugetlb, but without the 
hugetlb.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [RFC PATCH 26/39] KVM: guest_memfd: Track faultability within a struct kvm_gmem_private
  2024-10-17 19:18                               ` Jason Gunthorpe
  2024-10-17 19:29                                 ` David Hildenbrand
@ 2024-10-18  7:15                                 ` Patrick Roy
  2024-10-18  7:50                                   ` David Hildenbrand
  1 sibling, 1 reply; 130+ messages in thread
From: Patrick Roy @ 2024-10-18  7:15 UTC (permalink / raw)
  To: Jason Gunthorpe, Peter Xu
  Cc: David Hildenbrand, Ackerley Tng, tabba, quic_eberman, rientjes,
	fvdl, jthoughton, seanjc, pbonzini, zhiquan1.li, fan.du, jun.miao,
	isaku.yamahata, muchun.song, erdemaktas, vannapurve, qperret,
	jhubbard, willy, shuah, brauner, bfoster, kent.overstreet, pvorel,
	rppt, richard.weiyang, anup, haibo1.xu, ajones, vkuznets,
	maciej.wieczor-retman, pgonda, oliver.upton, linux-kernel,
	linux-mm, kvm, linux-kselftest



On Thu, 2024-10-17 at 20:18 +0100, Jason Gunthorpe wrote:
> On Thu, Oct 17, 2024 at 03:11:10PM -0400, Peter Xu wrote:
>> On Thu, Oct 17, 2024 at 02:10:10PM -0300, Jason Gunthorpe wrote:
>>>> If so, maybe that's a non-issue for non-CoCo, where the VM object /
>>>> gmemfd object (when created) can have a flag marking that it's
>>>> always shared and can never be converted to private for any page
>>>> within.
>>>
>>> What is non-CoCo? Does it include the private/shared concept?
>>
>> I used that to represent the possible gmemfd use cases outside confidential
>> computing.
>>
>> So the private/shared things should still be around as fundamental property
>> of gmemfd, but it should be always shared and no convertion needed for the
>> whole lifecycle of the gmemfd when marked !CoCo.
> 
> But what does private mean in this context?
> 
> Is it just like a bit of additional hypervisor security that the page
> is not mapped anyplace except the KVM stage 2 and the hypervisor can
> cause it to become mapped/shared at any time? But the guest has no
> idea about this?
> 
> Jason

Yes, this is pretty much exactly what I'm after when I say "non-CoCo".
No direct map entries to provide defense-in-depth for guests against
various speculative execution issues, but not a full confidential
computing setup (e.g. the guest should be completely oblivious to this,
and not require any modifications).

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [RFC PATCH 26/39] KVM: guest_memfd: Track faultability within a struct kvm_gmem_private
  2024-10-18  7:15                                 ` Patrick Roy
@ 2024-10-18  7:50                                   ` David Hildenbrand
  2024-10-18  9:34                                     ` Patrick Roy
  0 siblings, 1 reply; 130+ messages in thread
From: David Hildenbrand @ 2024-10-18  7:50 UTC (permalink / raw)
  To: Patrick Roy, Jason Gunthorpe, Peter Xu
  Cc: Ackerley Tng, tabba, quic_eberman, rientjes, fvdl, jthoughton,
	seanjc, pbonzini, zhiquan1.li, fan.du, jun.miao, isaku.yamahata,
	muchun.song, erdemaktas, vannapurve, qperret, jhubbard, willy,
	shuah, brauner, bfoster, kent.overstreet, pvorel, rppt,
	richard.weiyang, anup, haibo1.xu, ajones, vkuznets,
	maciej.wieczor-retman, pgonda, oliver.upton, linux-kernel,
	linux-mm, kvm, linux-kselftest

On 18.10.24 09:15, Patrick Roy wrote:
> 
> 
> On Thu, 2024-10-17 at 20:18 +0100, Jason Gunthorpe wrote:
>> On Thu, Oct 17, 2024 at 03:11:10PM -0400, Peter Xu wrote:
>>> On Thu, Oct 17, 2024 at 02:10:10PM -0300, Jason Gunthorpe wrote:
>>>>> If so, maybe that's a non-issue for non-CoCo, where the VM object /
>>>>> gmemfd object (when created) can have a flag marking that it's
>>>>> always shared and can never be converted to private for any page
>>>>> within.
>>>>
>>>> What is non-CoCo? Does it include the private/shared concept?
>>>
>>> I used that to represent the possible gmemfd use cases outside confidential
>>> computing.
>>>
>>> So the private/shared things should still be around as fundamental property
>>> of gmemfd, but it should be always shared and no convertion needed for the
>>> whole lifecycle of the gmemfd when marked !CoCo.
>>
>> But what does private mean in this context?
>>
>> Is it just like a bit of additional hypervisor security that the page
>> is not mapped anyplace except the KVM stage 2 and the hypervisor can
>> cause it to become mapped/shared at any time? But the guest has no
>> idea about this?
>>
>> Jason
> 
> Yes, this is pretty much exactly what I'm after when I say "non-CoCo".

It's likely not what Peter meant, though.

I think there are three scenarios:

(a) Secure CoCo VMs: private is protected by HW
(b) Semi-secured non-CoCo VMs: private is removed from the directmap
(c) Non-CoCo VMs: only shared memory

Does that match what you have in mind? Are there other cases?

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [RFC PATCH 26/39] KVM: guest_memfd: Track faultability within a struct kvm_gmem_private
  2024-10-18  7:50                                   ` David Hildenbrand
@ 2024-10-18  9:34                                     ` Patrick Roy
  0 siblings, 0 replies; 130+ messages in thread
From: Patrick Roy @ 2024-10-18  9:34 UTC (permalink / raw)
  To: David Hildenbrand, Jason Gunthorpe, Peter Xu
  Cc: Ackerley Tng, tabba, quic_eberman, rientjes, fvdl, jthoughton,
	seanjc, pbonzini, zhiquan1.li, fan.du, jun.miao, isaku.yamahata,
	muchun.song, erdemaktas, vannapurve, qperret, jhubbard, willy,
	shuah, brauner, bfoster, kent.overstreet, pvorel, rppt,
	richard.weiyang, anup, haibo1.xu, ajones, vkuznets,
	maciej.wieczor-retman, pgonda, oliver.upton, linux-kernel,
	linux-mm, kvm, linux-kselftest



On Fri, 2024-10-18 at 08:50 +0100, David Hildenbrand wrote:
> On 18.10.24 09:15, Patrick Roy wrote:
>>
>>
>> On Thu, 2024-10-17 at 20:18 +0100, Jason Gunthorpe wrote:
>>> On Thu, Oct 17, 2024 at 03:11:10PM -0400, Peter Xu wrote:
>>>> On Thu, Oct 17, 2024 at 02:10:10PM -0300, Jason Gunthorpe wrote:
>>>>>> If so, maybe that's a non-issue for non-CoCo, where the VM object /
>>>>>> gmemfd object (when created) can have a flag marking that it's
>>>>>> always shared and can never be converted to private for any page
>>>>>> within.
>>>>>
>>>>> What is non-CoCo? Does it include the private/shared concept?
>>>>
>>>> I used that to represent the possible gmemfd use cases outside confidential
>>>> computing.
>>>>
>>>> So the private/shared things should still be around as fundamental property
>>>> of gmemfd, but it should be always shared and no convertion needed for the
>>>> whole lifecycle of the gmemfd when marked !CoCo.
>>>
>>> But what does private mean in this context?
>>>
>>> Is it just like a bit of additional hypervisor security that the page
>>> is not mapped anyplace except the KVM stage 2 and the hypervisor can
>>> cause it to become mapped/shared at any time? But the guest has no
>>> idea about this?
>>>
>>> Jason
>>
>> Yes, this is pretty much exactly what I'm after when I say "non-CoCo".
> 
> It's likely not what Peter meant, though.
> 
> I think there are three scenarios:
> 
> (a) Secure CoCo VMs: private is protected by HW
> (b) Semi-secured non-CoCo VMs: private is removed from the directmap
> (c) Non-CoCo VMs: only shared memory
> 
> Does that match what you have in mind? Are there other cases?

Yeah, I'm after your case (b). I suppose I will not call it just
"non-CoCo" anymore then :)

> -- 
> Cheers,
> 
> David / dhildenb
> 

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [RFC PATCH 26/39] KVM: guest_memfd: Track faultability within a struct kvm_gmem_private
  2024-10-17 16:47                       ` Jason Gunthorpe
  2024-10-17 17:05                         ` Peter Xu
@ 2024-10-17 17:11                         ` David Hildenbrand
  2024-10-17 17:16                           ` Jason Gunthorpe
  1 sibling, 1 reply; 130+ messages in thread
From: David Hildenbrand @ 2024-10-17 17:11 UTC (permalink / raw)
  To: Jason Gunthorpe, Peter Xu
  Cc: Ackerley Tng, tabba, quic_eberman, roypat, rientjes, fvdl,
	jthoughton, seanjc, pbonzini, zhiquan1.li, fan.du, jun.miao,
	isaku.yamahata, muchun.song, erdemaktas, vannapurve, qperret,
	jhubbard, willy, shuah, brauner, bfoster, kent.overstreet, pvorel,
	rppt, richard.weiyang, anup, haibo1.xu, ajones, vkuznets,
	maciej.wieczor-retman, pgonda, oliver.upton, linux-kernel,
	linux-mm, kvm, linux-kselftest

On 17.10.24 18:47, Jason Gunthorpe wrote:
> On Thu, Oct 17, 2024 at 10:58:29AM -0400, Peter Xu wrote:
> 
>> My question was more torwards whether gmemfd could still expose the
>> possibility to be used in VA forms to other modules that may not support
>> fd+offsets yet.
> 
> I keep hearing they don't want to support page pinning on a guestmemfd
> mapping, so VA based paths could not work.

For shared pages it absolutely must work. That's what I keep hearing :)

> 
>> I think as long as we can provide gmemfd VMAs like what this series
>> provides, it sounds possible to reuse the old VA interfaces before the CoCo
>> interfaces are ready, so that people can already start leveraging gmemfd
>> backing pages.
> 
> And you definitely can't get the private pages out of the VA interface
> because all the VMA PTEs of private pages are non-present by definition.

Agreed.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [RFC PATCH 26/39] KVM: guest_memfd: Track faultability within a struct kvm_gmem_private
  2024-10-17 17:11                         ` David Hildenbrand
@ 2024-10-17 17:16                           ` Jason Gunthorpe
  2024-10-17 17:55                             ` David Hildenbrand
  2024-10-17 18:26                             ` Vishal Annapurve
  0 siblings, 2 replies; 130+ messages in thread
From: Jason Gunthorpe @ 2024-10-17 17:16 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Peter Xu, Ackerley Tng, tabba, quic_eberman, roypat, rientjes,
	fvdl, jthoughton, seanjc, pbonzini, zhiquan1.li, fan.du, jun.miao,
	isaku.yamahata, muchun.song, erdemaktas, vannapurve, qperret,
	jhubbard, willy, shuah, brauner, bfoster, kent.overstreet, pvorel,
	rppt, richard.weiyang, anup, haibo1.xu, ajones, vkuznets,
	maciej.wieczor-retman, pgonda, oliver.upton, linux-kernel,
	linux-mm, kvm, linux-kselftest

On Thu, Oct 17, 2024 at 07:11:46PM +0200, David Hildenbrand wrote:
> On 17.10.24 18:47, Jason Gunthorpe wrote:
> > On Thu, Oct 17, 2024 at 10:58:29AM -0400, Peter Xu wrote:
> > 
> > > My question was more torwards whether gmemfd could still expose the
> > > possibility to be used in VA forms to other modules that may not support
> > > fd+offsets yet.
> > 
> > I keep hearing they don't want to support page pinning on a guestmemfd
> > mapping, so VA based paths could not work.
> 
> For shared pages it absolutely must work. That's what I keep hearing :)

Oh that's confusing. I assume non longterm pins desired on shared
pages though??

Jason

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [RFC PATCH 26/39] KVM: guest_memfd: Track faultability within a struct kvm_gmem_private
  2024-10-17 17:16                           ` Jason Gunthorpe
@ 2024-10-17 17:55                             ` David Hildenbrand
  2024-10-17 18:26                             ` Vishal Annapurve
  1 sibling, 0 replies; 130+ messages in thread
From: David Hildenbrand @ 2024-10-17 17:55 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: Peter Xu, Ackerley Tng, tabba, quic_eberman, roypat, rientjes,
	fvdl, jthoughton, seanjc, pbonzini, zhiquan1.li, fan.du, jun.miao,
	isaku.yamahata, muchun.song, erdemaktas, vannapurve, qperret,
	jhubbard, willy, shuah, brauner, bfoster, kent.overstreet, pvorel,
	rppt, richard.weiyang, anup, haibo1.xu, ajones, vkuznets,
	maciej.wieczor-retman, pgonda, oliver.upton, linux-kernel,
	linux-mm, kvm, linux-kselftest

On 17.10.24 19:16, Jason Gunthorpe wrote:
> On Thu, Oct 17, 2024 at 07:11:46PM +0200, David Hildenbrand wrote:
>> On 17.10.24 18:47, Jason Gunthorpe wrote:
>>> On Thu, Oct 17, 2024 at 10:58:29AM -0400, Peter Xu wrote:
>>>
>>>> My question was more torwards whether gmemfd could still expose the
>>>> possibility to be used in VA forms to other modules that may not support
>>>> fd+offsets yet.
>>>
>>> I keep hearing they don't want to support page pinning on a guestmemfd
>>> mapping, so VA based paths could not work.
>>
>> For shared pages it absolutely must work. That's what I keep hearing :)
> 
> Oh that's confusing. I assume non longterm pins desired on shared
> pages though??

For user space to driver I/O to shared pages GUP is often required 
(e.g., O_DIRECT), as was raised at LPC in a session IIRC (someone 
brought up a use case that involved vhost-user and friends).

Of course, for the guest_memfd use cases where we want to remove also 
shared pages from the directmap, it's not possible, but let's put that 
aside (I recall there was a brief discussion at LPC about that: it's 
tricky for shared memory for exactly this reason -- I/O).

longterm pins would have to be used with care, and it's under user-space 
control, and user-space must be aware of the implications: for example, 
registering shared pages as fixed buffers for liburing is possible, but 
when a conversion to private is requested it must unregister these buffers.

(in VFIO terms, a prior unmap operation would be required)

Of course, a conversion to private will not work as long as the pages 
are pinned, and this is under user space control.

If the guest attempts to perform such a conversion while pages will be 
pinned, there will likely be a notification to user space (we touched on 
that today in the upstream call) that something is blocking the 
conversion of that page, and user space has to fix that up and retry.

It's not expected to matter much in practice, but it can be triggered 
and there must be a way to handle it: if a guest triggers a 
shared->private conversion while there is still I/O going on the page, 
something is messed up, and the conversion will be delayed until the I/O 
is done and the page can be converted.

There are still quite some things to be clarified, but this is my 
understanding so far.

-- 
Cheers,

David / dhildenb

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [RFC PATCH 26/39] KVM: guest_memfd: Track faultability within a struct kvm_gmem_private
  2024-10-17 17:16                           ` Jason Gunthorpe
  2024-10-17 17:55                             ` David Hildenbrand
@ 2024-10-17 18:26                             ` Vishal Annapurve
  1 sibling, 0 replies; 130+ messages in thread
From: Vishal Annapurve @ 2024-10-17 18:26 UTC (permalink / raw)
  To: Jason Gunthorpe
  Cc: David Hildenbrand, Peter Xu, Ackerley Tng, tabba, quic_eberman,
	roypat, rientjes, fvdl, jthoughton, seanjc, pbonzini, zhiquan1.li,
	fan.du, jun.miao, isaku.yamahata, muchun.song, erdemaktas,
	qperret, jhubbard, willy, shuah, brauner, bfoster,
	kent.overstreet, pvorel, rppt, richard.weiyang, anup, haibo1.xu,
	ajones, vkuznets, maciej.wieczor-retman, pgonda, oliver.upton,
	linux-kernel, linux-mm, kvm, linux-kselftest

On Thu, Oct 17, 2024 at 10:46 PM Jason Gunthorpe <jgg@nvidia.com> wrote:
>
> On Thu, Oct 17, 2024 at 07:11:46PM +0200, David Hildenbrand wrote:
> > On 17.10.24 18:47, Jason Gunthorpe wrote:
> > > On Thu, Oct 17, 2024 at 10:58:29AM -0400, Peter Xu wrote:
> > >
> > > > My question was more torwards whether gmemfd could still expose the
> > > > possibility to be used in VA forms to other modules that may not support
> > > > fd+offsets yet.
> > >
> > > I keep hearing they don't want to support page pinning on a guestmemfd
> > > mapping, so VA based paths could not work.
> >
> > For shared pages it absolutely must work. That's what I keep hearing :)
>
> Oh that's confusing. I assume non longterm pins desired on shared
> pages though??
>
> Jason

For hugepage support to work, longterm pins on guest private pages
need to be avoided [1], If this somehow was the cause of any confusion
here.

[1] https://lpc.events/event/18/contributions/1764/attachments/1409/3182/LPC%202024_%201G%20page%20support%20for%20guest_memfd.pdf
(slide 12)

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [RFC PATCH 26/39] KVM: guest_memfd: Track faultability within a struct kvm_gmem_private
  2024-10-16 23:49                 ` Peter Xu
  2024-10-16 23:54                   ` Jason Gunthorpe
@ 2024-10-17 14:56                   ` David Hildenbrand
  1 sibling, 0 replies; 130+ messages in thread
From: David Hildenbrand @ 2024-10-17 14:56 UTC (permalink / raw)
  To: Peter Xu, Jason Gunthorpe
  Cc: Ackerley Tng, tabba, quic_eberman, roypat, rientjes, fvdl,
	jthoughton, seanjc, pbonzini, zhiquan1.li, fan.du, jun.miao,
	isaku.yamahata, muchun.song, erdemaktas, vannapurve, qperret,
	jhubbard, willy, shuah, brauner, bfoster, kent.overstreet, pvorel,
	rppt, richard.weiyang, anup, haibo1.xu, ajones, vkuznets,
	maciej.wieczor-retman, pgonda, oliver.upton, linux-kernel,
	linux-mm, kvm, linux-kselftest

On 17.10.24 01:49, Peter Xu wrote:
> On Wed, Oct 16, 2024 at 07:51:57PM -0300, Jason Gunthorpe wrote:
>> On Wed, Oct 16, 2024 at 04:16:17PM -0400, Peter Xu wrote:
>>>
>>> Is there chance that when !CoCo will be supported, then external modules
>>> (e.g. VFIO) can reuse the old user mappings, just like before gmemfd?
>>>
>>> To support CoCo, I understand gmem+offset is required all over the places.
>>> However in a non-CoCo context, I wonder whether the other modules are
>>> required to stick with gmem+offset, or they can reuse the old VA ways,
>>> because how it works can fundamentally be the same as before, except that
>>> the folios now will be managed by gmemfd.
>>
>> My intention with iommufd was to see fd + offest as the "new" way
>> to refer to all guest memory and discourage people from using VMA
>> handles.
> 
> Does it mean anonymous memory guests will not be supported at all for
> iommufd?
> 
> Indeed it's very rare now, lose quite some flexibility (v.s. fd based), and
> I can't think of a lot besides some default configs or KSM users (which I
> would expect rare), but still I wonder there're other use cases that people
> would still need to stick with anon, hence fd isn't around.

Not sure I completely understand the question, but for most VMs out 
there I expect an anonymous memory to remain the default memory backing.

Regarding users of iommufd, I have absolutely no clue :)

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [RFC PATCH 26/39] KVM: guest_memfd: Track faultability within a struct kvm_gmem_private
  2024-10-16 20:16             ` Peter Xu
  2024-10-16 22:51               ` Jason Gunthorpe
@ 2024-10-17 15:02               ` David Hildenbrand
  1 sibling, 0 replies; 130+ messages in thread
From: David Hildenbrand @ 2024-10-17 15:02 UTC (permalink / raw)
  To: Peter Xu
  Cc: Ackerley Tng, tabba, quic_eberman, roypat, jgg, rientjes, fvdl,
	jthoughton, seanjc, pbonzini, zhiquan1.li, fan.du, jun.miao,
	isaku.yamahata, muchun.song, erdemaktas, vannapurve, qperret,
	jhubbard, willy, shuah, brauner, bfoster, kent.overstreet, pvorel,
	rppt, richard.weiyang, anup, haibo1.xu, ajones, vkuznets,
	maciej.wieczor-retman, pgonda, oliver.upton, linux-kernel,
	linux-mm, kvm, linux-kselftest

On 16.10.24 22:16, Peter Xu wrote:
> On Wed, Oct 16, 2024 at 10:45:43AM +0200, David Hildenbrand wrote:
>> On 16.10.24 01:42, Ackerley Tng wrote:
>>> Peter Xu <peterx@redhat.com> writes:
>>>
>>>> On Fri, Oct 11, 2024 at 11:32:11PM +0000, Ackerley Tng wrote:
>>>>> Peter Xu <peterx@redhat.com> writes:
>>>>>
>>>>>> On Tue, Sep 10, 2024 at 11:43:57PM +0000, Ackerley Tng wrote:
>>>>>>> The faultability xarray is stored on the inode since faultability is a
>>>>>>> property of the guest_memfd's memory contents.
>>>>>>>
>>>>>>> In this RFC, presence of an entry in the xarray indicates faultable,
>>>>>>> but this could be flipped so that presence indicates unfaultable. For
>>>>>>> flexibility, a special value "FAULT" is used instead of a simple
>>>>>>> boolean.
>>>>>>>
>>>>>>> However, at some stages of a VM's lifecycle there could be more
>>>>>>> private pages, and at other stages there could be more shared pages.
>>>>>>>
>>>>>>> This is likely to be replaced by a better data structure in a future
>>>>>>> revision to better support ranges.
>>>>>>>
>>>>>>> Also store struct kvm_gmem_hugetlb in struct kvm_gmem_hugetlb as a
>>>>>>> pointer. inode->i_mapping->i_private_data.
>>>>>>
>>>>>> Could you help explain the difference between faultability v.s. the
>>>>>> existing KVM_MEMORY_ATTRIBUTE_PRIVATE?  Not sure if I'm the only one who's
>>>>>> confused, otherwise might be good to enrich the commit message.
>>>>>
>>>>> Thank you for this question, I'll add this to the commit message to the
>>>>> next revision if Fuad's patch set [1] doesn't make it first.
>>>>>
>>>>> Reason (a): To elaborate on the explanation in [1],
>>>>> KVM_MEMORY_ATTRIBUTE_PRIVATE is whether userspace wants this page to be
>>>>> private or shared, and faultability is whether the page is allowed to be
>>>>> faulted in by userspace.
>>>>>
>>>>> These two are similar but may not be the same thing. In pKVM, pKVM
>>>>> cannot trust userspace's configuration of private/shared, and other
>>>>> information will go into determining the private/shared setting in
>>>>> faultability.
>>>>
>>>> It makes sense to me that the kernel has the right to decide which page is
>>>> shared / private.  No matter if it's for pKVM or CoCo, I believe the normal
>>>> case is most / all pages are private, until some requests to share them for
>>>> special purposes (like DMA).  But that'll need to be initiated as a request
>>>> from the guest not the userspace hypervisor.
>>>
>>> For TDX, the plan is that the guest will request the page to be remapped
>>> as shared or private, and the handler for that request will exit to
>>> the userspace VMM.
>>>
>>> The userspace VMM will then do any necessary coordination (e.g. for a
>>> shared to private conversion it may need to unpin pages from DMA), and
>>> then use the KVM_SET_MEMORY_ATTRIBUTES ioctl to indicate agreement with
>>> the guest's requested conversion. This is where
>>> KVM_MEMORY_ATTRIBUTE_PRIVATE will be provided.
>>>
>>> Patch 38 [1] updates
>>> tools/testing/selftests/kvm/x86_64/private_mem_conversions_test.c to
>>> demonstrate the usage flow for x86.
>>>
>>> Fuad will be in a better position to explain the flow for pKVM.
>>>
>>>> I must confess I totally have no idea how KVM_MEMORY_ATTRIBUTE_PRIVATE is
>>>> planned to be used in the future. Currently it's always set at least in
>>>> QEMU if gmemfd is enabled, so it doesn't yet tell me anything..
>>>>
>>>> If it's driven by the userspace side of the hypervisor, I wonder when
>>>> should the user app request some different value it already was, if the
>>>> kernel already has an answer in this case.  It made me even more confused,
>>>> as we have this in the API doc:
>>>>
>>>>           Note, there is no "get" API.  Userspace is responsible for
>>>>           explicitly tracking the state of a gfn/page as needed.
>>>>
>>>> And I do wonder whether we will still need some API just to query whether
>>>> the kernel allows the page to be mapped or not (aka, the "real" shared /
>>>> private status of a guest page).  I guess that's not directly relevant to
>>>> the faultability to be introduced here, but if you or anyone know please
>>>> kindly share, I'd love to learn about it.
>>>
>>> The userspace VMM will track the initial shared/private state, in the
>>> sense that when the VM is created, the mem_attr_array is initialized
>>> such that the guest pages are all shared.
>>>
>>> Then when the userspace VMM calls the KVM_SET_MEMORY_ATTRIBUTES ioctl,
>>> it should record all changes so it knows what the state is in the
>>> kernel.
>>>
>>> Even if userspace VMM doesn't record the state properly, if the
>>> KVM_SET_MEMORY_ATTRIBUTES ioctl is used to request no change
>>> (e.g. setting an already private page to private), it will just be a
>>> no-op in the kernel.
>>>
>>>>>
>>>>> Perhaps Fuad can elaborate more here.
>>>>>
>>>>> Reason (b): In this patch series (mostly focus on x86 first), we're
>>>>> using faultability to prevent any future faults before checking that
>>>>> there are no mappings.
>>>>>
>>>>> Having a different xarray from mem_attr_array allows us to disable
>>>>> faulting before committing to changing mem_attr_array. Please see
>>>>> `kvm_gmem_should_set_attributes_private()` in this patch [2].
>>>>>
>>>>> We're not completely sure about the effectiveness of using faultability
>>>>> to block off future faults here, in future revisions we may be using a
>>>>> different approach. The folio_lock() is probably important if we need to
>>>>> check mapcount. Please let me know if you have any ideas!
>>>>>
>>>>> The starting point of having a different xarray was pKVM's requirement
>>>>> of having separate xarrays, and we later realized that the xarray could
>>>>> be used for reason (b). For x86 we could perhaps eventually remove the
>>>>> second xarray? Not sure as of now.
>>>>
>>>> Just had a quick look at patch 27:
>>>>
>>>> https://lore.kernel.org/all/5a05eb947cf7aa21f00b94171ca818cc3d5bdfee.1726009989.git.ackerleytng@google.com/
>>>>
>>>> I'm not yet sure what's protecting from faultability being modified against
>>>> a concurrent fault().
>>>>
>>>> I wonder whether one can use the folio lock to serialize that, so that one
>>>> needs to take the folio lock to modify/lookup the folio's faultability,
>>>> then it may naturally match with the fault() handler design, where
>>>> kvm_gmem_get_folio() needs to lock the page first.
>>>>
>>>> But then kvm_gmem_is_faultable() will need to also be called only after the
>>>> folio is locked to avoid races.
>>>
>>> My bad. In our rush to get this series out before LPC, the patch series
>>> was not organized very well. Patch 39 [2] adds the
>>> lock. filemap_invalidate_lock_shared() should make sure that faulting
>>> doesn't race with faultability updates.
>>>
>>>>>> The latter is per-slot, so one level higher, however I don't think it's a
>>>>>> common use case for mapping the same gmemfd in multiple slots anyway for
>>>>>> KVM (besides corner cases like live upgrade).  So perhaps this is not about
>>>>>> layering but something else?  For example, any use case where PRIVATE and
>>>>>> FAULTABLE can be reported with different values.
>>>>>>
>>>>>> Another higher level question is, is there any plan to support non-CoCo
>>>>>> context for 1G?
>>>>>
>>>>> I believe guest_memfd users are generally in favor of eventually using
>>>>> guest_memfd for non-CoCo use cases, which means we do want 1G (shared,
>>>>> in the case of CoCo) page support.
>>>>>
>>>>> However, core-mm's fault path does not support mapping at anything
>>>>> higher than the PMD level (other than hugetlb_fault(), which the
>>>>> community wants to move away from), so core-mm wouldn't be able to map
>>>>> 1G pages taken from HugeTLB.
>>>>
>>>> Have you looked at vm_operations_struct.huge_fault()?  Or maybe you're
>>>> referring to some other challenges?
>>>>
>>>
>>> IIUC vm_operations_struct.huge_fault() is used when creating a PMD, but
>>> PUD mappings will be needed for 1G pages, so 1G pages can't be mapped by
>>> core-mm using vm_operations_struct.huge_fault().
>>
>>
>> Just to clarify a bit for Peter: as has been discussed previously, there are
>> rather big difference between CoCo and non-CoCo VMs.
>>
>> In CoCo VMs, the primary portion of all pages are private, and they are not
>> mapped into user space. Only a handful of pages are commonly shared and
>> mapped into user space.
>>
>> In non-CoCo VMs, all pages are shared and (for the time being) all pages are
>> mapped into user space from where KVM will consume them.
>>
>>
>> Installing pmd/pud mappings into user space (recall: shared memory only) is
>> currently not really a requirement for CoCo VMs, and therefore not the focus
>> of this work.
>>
>> Further, it's currently considered to be incompatible with getting in-place
>> private<->share conversion on *page* granularity right, as we will be
>> exposing huge/gigantic folios via individual small folios to core-MM.
>> Mapping a PMD/PUD into core-mm, that is composed of multiple folios is not
>> going to fly, unless using a PFNMAP, which has been briefly discussed as
>> well, bu disregarded so far (no page pinning support).
>>
>> So in the context of this work here, huge faults and PUD/PMD *user space
>> page tables* do not apply.
>>
>> For non-CoCo VMs there is no in-place conversion problem. One could use the
>> same CoCo implementation, but without user space pud/pmd mappings. KVM and
>> VFIO would have to consume this memory via the guest_memfd in memslots
>> instead of via the user space mappings to more easily get PMD/PUD mappings
>> into the secondary MMU. And the downsides would be sacrificing the vmemmap
> 
> Is there chance that when !CoCo will be supported, then external modules
> (e.g. VFIO) can reuse the old user mappings, just like before gmemfd?

I expect this at least initially to be the case. At some point, we might 
see a transition to fd+offset for some interfaces.

I recall that there was a similar discussion when specifying "shared" 
memory in a KVM memory slot that will be backed by a guest_memfd: 
initially, this would be via VA and not via guest_memfd+offset. I recall 
Sean and James wants it to stay that way (sorry if I am wrong!), and 
James might require that to get the fancy uffd mechanism flying.

> 
> To support CoCo, I understand gmem+offset is required all over the places.
> However in a non-CoCo context, I wonder whether the other modules are
> required to stick with gmem+offset, or they can reuse the old VA ways,
> because how it works can fundamentally be the same as before, except that
> the folios now will be managed by gmemfd.
 > > I think the good thing with such approach is when developing CoCo 
support
> for all these modules, there's less constraints / concerns to be compatible
> with non-CoCo use case, also it'll make it even easier to be used in
> production before all CoCo facilities ready, as most infrastructures are
> already around and being used for years if VA can be mapped and GUPed like
> before.

Right, but even if most interfaces support guest_memfd+offset, things 
like DIRECT_IO to shared guest memory will require VA+GUP (someone 
brought that up at LPC).

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [RFC PATCH 26/39] KVM: guest_memfd: Track faultability within a struct kvm_gmem_private
  2024-10-15 23:42         ` Ackerley Tng
  2024-10-16  8:45           ` David Hildenbrand
@ 2024-10-16  8:50           ` David Hildenbrand
  2024-10-16 10:48             ` Vishal Annapurve
  1 sibling, 1 reply; 130+ messages in thread
From: David Hildenbrand @ 2024-10-16  8:50 UTC (permalink / raw)
  To: Ackerley Tng, Peter Xu
  Cc: tabba, quic_eberman, roypat, jgg, rientjes, fvdl, jthoughton,
	seanjc, pbonzini, zhiquan1.li, fan.du, jun.miao, isaku.yamahata,
	muchun.song, erdemaktas, vannapurve, qperret, jhubbard, willy,
	shuah, brauner, bfoster, kent.overstreet, pvorel, rppt,
	richard.weiyang, anup, haibo1.xu, ajones, vkuznets,
	maciej.wieczor-retman, pgonda, oliver.upton, linux-kernel,
	linux-mm, kvm, linux-kselftest

>> I also don't know how you treat things like folio_test_hugetlb() on
>> possible assumptions that the VMA must be a hugetlb vma.  I'd confess I
>> didn't yet check the rest of the patchset yet - reading a large series
>> without a git tree is sometimes challenging to me.
>>
> 
> I'm thinking to basically never involve folio_test_hugetlb(), and the
> VMAs used by guest_memfd will also never be a HugeTLB VMA. That's
> because only the HugeTLB allocator is used, but by the time the folio is
> mapped to userspace, it would have already have been split. After the
> page is split, the folio loses its HugeTLB status. guest_memfd folios
> will never be mapped to userspace while they still have a HugeTLB
> status.

We absolutely must convert these hugetlb folios to non-hugetlb folios.

That is one of the reasons why I raised at LPC that we should focus on 
leaving hugetlb out of the picture and rather have a global pool, and 
the option to move folios from the global pool back and forth to hugetlb 
or to guest_memfd.

How exactly that would look like is TBD.

For the time being, I think we could add a "hack" to take hugetlb folios 
from hugetlb for our purposes, but we would absolutely have to convert 
them to non-hugetlb folios, especially when we split them to small 
folios and start using the mapcount. But it doesn't feel quite clean.

Simply starting with a separate global pool (e.g., boot-time allocation 
similar to as done by hugetlb, or CMA) might be cleaner, and a lot of 
stuff could be factored out from hugetlb code to achieve that.

-- 
Cheers,

David / dhildenb

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [RFC PATCH 26/39] KVM: guest_memfd: Track faultability within a struct kvm_gmem_private
  2024-10-16  8:50           ` David Hildenbrand
@ 2024-10-16 10:48             ` Vishal Annapurve
  2024-10-16 11:54               ` David Hildenbrand
  0 siblings, 1 reply; 130+ messages in thread
From: Vishal Annapurve @ 2024-10-16 10:48 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Ackerley Tng, Peter Xu, tabba, quic_eberman, roypat, jgg,
	rientjes, fvdl, jthoughton, seanjc, pbonzini, zhiquan1.li, fan.du,
	jun.miao, isaku.yamahata, muchun.song, erdemaktas, qperret,
	jhubbard, willy, shuah, brauner, bfoster, kent.overstreet, pvorel,
	rppt, richard.weiyang, anup, haibo1.xu, ajones, vkuznets,
	maciej.wieczor-retman, pgonda, oliver.upton, linux-kernel,
	linux-mm, kvm, linux-kselftest

On Wed, Oct 16, 2024 at 2:20 PM David Hildenbrand <david@redhat.com> wrote:
>
> >> I also don't know how you treat things like folio_test_hugetlb() on
> >> possible assumptions that the VMA must be a hugetlb vma.  I'd confess I
> >> didn't yet check the rest of the patchset yet - reading a large series
> >> without a git tree is sometimes challenging to me.
> >>
> >
> > I'm thinking to basically never involve folio_test_hugetlb(), and the
> > VMAs used by guest_memfd will also never be a HugeTLB VMA. That's
> > because only the HugeTLB allocator is used, but by the time the folio is
> > mapped to userspace, it would have already have been split. After the
> > page is split, the folio loses its HugeTLB status. guest_memfd folios
> > will never be mapped to userspace while they still have a HugeTLB
> > status.
>
> We absolutely must convert these hugetlb folios to non-hugetlb folios.
>
> That is one of the reasons why I raised at LPC that we should focus on
> leaving hugetlb out of the picture and rather have a global pool, and
> the option to move folios from the global pool back and forth to hugetlb
> or to guest_memfd.
>
> How exactly that would look like is TBD.
>
> For the time being, I think we could add a "hack" to take hugetlb folios
> from hugetlb for our purposes, but we would absolutely have to convert
> them to non-hugetlb folios, especially when we split them to small
> folios and start using the mapcount. But it doesn't feel quite clean.

As hugepage folios need to be split up in order to support backing
CoCo VMs with hugepages, I would assume any folio based hugepage
memory allocation will need to go through split/merge cycles through
the guest memfd lifetime.

Plan through next RFC series is to abstract out the hugetlb folio
management within guest_memfd so that any hugetlb specific logic is
cleanly separated out and allows guest memfd to allocate memory from
other hugepage allocators in the future.

>
> Simply starting with a separate global pool (e.g., boot-time allocation
> similar to as done by hugetlb, or CMA) might be cleaner, and a lot of
> stuff could be factored out from hugetlb code to achieve that.

I am not sure if a separate global pool necessarily solves all the
issues here unless we come up with more concrete implementation
details. One of the concerns was the ability of implementing/retaining
HVO while transferring memory between the separate global pool and
hugetlb pool i.e. whether it can seamlessly serve all hugepage users
on the host. Another question could be whether the separate
pool/allocator simplifies the split/merge operations at runtime.

>
> --
> Cheers,
>
> David / dhildenb
>

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [RFC PATCH 26/39] KVM: guest_memfd: Track faultability within a struct kvm_gmem_private
  2024-10-16 10:48             ` Vishal Annapurve
@ 2024-10-16 11:54               ` David Hildenbrand
  2024-10-16 11:57                 ` Jason Gunthorpe
  0 siblings, 1 reply; 130+ messages in thread
From: David Hildenbrand @ 2024-10-16 11:54 UTC (permalink / raw)
  To: Vishal Annapurve
  Cc: Ackerley Tng, Peter Xu, tabba, quic_eberman, roypat, jgg,
	rientjes, fvdl, jthoughton, seanjc, pbonzini, zhiquan1.li, fan.du,
	jun.miao, isaku.yamahata, muchun.song, erdemaktas, qperret,
	jhubbard, willy, shuah, brauner, bfoster, kent.overstreet, pvorel,
	rppt, richard.weiyang, anup, haibo1.xu, ajones, vkuznets,
	maciej.wieczor-retman, pgonda, oliver.upton, linux-kernel,
	linux-mm, kvm, linux-kselftest

On 16.10.24 12:48, Vishal Annapurve wrote:
> On Wed, Oct 16, 2024 at 2:20 PM David Hildenbrand <david@redhat.com> wrote:
>>
>>>> I also don't know how you treat things like folio_test_hugetlb() on
>>>> possible assumptions that the VMA must be a hugetlb vma.  I'd confess I
>>>> didn't yet check the rest of the patchset yet - reading a large series
>>>> without a git tree is sometimes challenging to me.
>>>>
>>>
>>> I'm thinking to basically never involve folio_test_hugetlb(), and the
>>> VMAs used by guest_memfd will also never be a HugeTLB VMA. That's
>>> because only the HugeTLB allocator is used, but by the time the folio is
>>> mapped to userspace, it would have already have been split. After the
>>> page is split, the folio loses its HugeTLB status. guest_memfd folios
>>> will never be mapped to userspace while they still have a HugeTLB
>>> status.
>>
>> We absolutely must convert these hugetlb folios to non-hugetlb folios.
>>
>> That is one of the reasons why I raised at LPC that we should focus on
>> leaving hugetlb out of the picture and rather have a global pool, and
>> the option to move folios from the global pool back and forth to hugetlb
>> or to guest_memfd.
>>
>> How exactly that would look like is TBD.
>>
>> For the time being, I think we could add a "hack" to take hugetlb folios
>> from hugetlb for our purposes, but we would absolutely have to convert
>> them to non-hugetlb folios, especially when we split them to small
>> folios and start using the mapcount. But it doesn't feel quite clean.
> 
> As hugepage folios need to be split up in order to support backing
> CoCo VMs with hugepages, I would assume any folio based hugepage
> memory allocation will need to go through split/merge cycles through
> the guest memfd lifetime.

Yes, that's my understanding as well.

> 
> Plan through next RFC series is to abstract out the hugetlb folio
> management within guest_memfd so that any hugetlb specific logic is
> cleanly separated out and allows guest memfd to allocate memory from
> other hugepage allocators in the future.

Yes, that must happen. As soon as a hugetlb folio would transition to 
guest_memfd, it must no longer be a hugetlb folio.

> 
>>
>> Simply starting with a separate global pool (e.g., boot-time allocation
>> similar to as done by hugetlb, or CMA) might be cleaner, and a lot of
>> stuff could be factored out from hugetlb code to achieve that.
> 
> I am not sure if a separate global pool necessarily solves all the
> issues here unless we come up with more concrete implementation
> details. One of the concerns was the ability of implementing/retaining
> HVO while transferring memory between the separate global pool and
> hugetlb pool i.e. whether it can seamlessly serve all hugepage users
> on the host.

Likely should be doable. All we need is the generalized concept of a 
folio with HVO, and a way to move these folios between owners (e.g., 
global<->hugetlb, global<->guest_memfd).

Factoring the HVO optimization out shouldn't be too crazy I believe. 
Famous last words :)

> Another question could be whether the separate
> pool/allocator simplifies the split/merge operations at runtime.

The less hugetlb hacks we have to add, the better :)

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [RFC PATCH 26/39] KVM: guest_memfd: Track faultability within a struct kvm_gmem_private
  2024-10-16 11:54               ` David Hildenbrand
@ 2024-10-16 11:57                 ` Jason Gunthorpe
  0 siblings, 0 replies; 130+ messages in thread
From: Jason Gunthorpe @ 2024-10-16 11:57 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Vishal Annapurve, Ackerley Tng, Peter Xu, tabba, quic_eberman,
	roypat, rientjes, fvdl, jthoughton, seanjc, pbonzini, zhiquan1.li,
	fan.du, jun.miao, isaku.yamahata, muchun.song, erdemaktas,
	qperret, jhubbard, willy, shuah, brauner, bfoster,
	kent.overstreet, pvorel, rppt, richard.weiyang, anup, haibo1.xu,
	ajones, vkuznets, maciej.wieczor-retman, pgonda, oliver.upton,
	linux-kernel, linux-mm, kvm, linux-kselftest

On Wed, Oct 16, 2024 at 01:54:32PM +0200, David Hildenbrand wrote:

> Likely should be doable. All we need is the generalized concept of a folio
> with HVO, and a way to move these folios between owners (e.g.,
> global<->hugetlb, global<->guest_memfd).

+1

HVO seems to become a sticking point in these discussions, having a
way to make any big folio HVO optimized (and undo it) then put hugetlb
on top of that would be a nice refactoring.

Jason

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [RFC PATCH 26/39] KVM: guest_memfd: Track faultability within a struct kvm_gmem_private
  2024-09-10 23:43 ` [RFC PATCH 26/39] KVM: guest_memfd: Track faultability within a struct kvm_gmem_private Ackerley Tng
  2024-10-10 16:06   ` Peter Xu
@ 2025-02-25 20:37   ` Peter Xu
  2025-04-23 22:07     ` Ackerley Tng
  1 sibling, 1 reply; 130+ messages in thread
From: Peter Xu @ 2025-02-25 20:37 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: tabba, quic_eberman, roypat, jgg, david, rientjes, fvdl,
	jthoughton, seanjc, pbonzini, zhiquan1.li, fan.du, jun.miao,
	isaku.yamahata, muchun.song, mike.kravetz, erdemaktas, vannapurve,
	qperret, jhubbard, willy, shuah, brauner, bfoster,
	kent.overstreet, pvorel, rppt, richard.weiyang, anup, haibo1.xu,
	ajones, vkuznets, maciej.wieczor-retman, pgonda, oliver.upton,
	linux-kernel, linux-mm, kvm, linux-kselftest, linux-fsdevel

On Tue, Sep 10, 2024 at 11:43:57PM +0000, Ackerley Tng wrote:
> @@ -1079,12 +1152,20 @@ static struct inode *kvm_gmem_inode_make_secure_inode(const char *name,
>  	if (err)
>  		goto out;
>  
> +	err = -ENOMEM;
> +	private = kzalloc(sizeof(*private), GFP_KERNEL);
> +	if (!private)
> +		goto out;
> +
>  	if (flags & KVM_GUEST_MEMFD_HUGETLB) {
> -		err = kvm_gmem_hugetlb_setup(inode, size, flags);
> +		err = kvm_gmem_hugetlb_setup(inode, private, size, flags);
>  		if (err)
> -			goto out;
> +			goto free_private;
>  	}
>  
> +	xa_init(&private->faultability);
> +	inode->i_mapping->i_private_data = private;
> +
>  	inode->i_private = (void *)(unsigned long)flags;

Looks like inode->i_private isn't used before this series; the flags was
always zero before anyway.  Maybe it could keep kvm_gmem_inode_private
instead? Then make the flags be part of the struct.

It avoids two separate places (inode->i_mapping->i_private_data,
inode->i_private) to store gmem private info.

>  	inode->i_op = &kvm_gmem_iops;
>  	inode->i_mapping->a_ops = &kvm_gmem_aops;
> @@ -1097,6 +1178,8 @@ static struct inode *kvm_gmem_inode_make_secure_inode(const char *name,
>  
>  	return inode;
>  
> +free_private:
> +	kfree(private);
>  out:
>  	iput(inode);
>  
> -- 
> 2.46.0.598.g6f2099f65c-goog
> 

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [RFC PATCH 26/39] KVM: guest_memfd: Track faultability within a struct kvm_gmem_private
  2025-02-25 20:37   ` Peter Xu
@ 2025-04-23 22:07     ` Ackerley Tng
  0 siblings, 0 replies; 130+ messages in thread
From: Ackerley Tng @ 2025-04-23 22:07 UTC (permalink / raw)
  To: Peter Xu
  Cc: tabba, quic_eberman, roypat, jgg, david, rientjes, fvdl,
	jthoughton, seanjc, pbonzini, zhiquan1.li, fan.du, jun.miao,
	isaku.yamahata, muchun.song, erdemaktas, vannapurve, qperret,
	jhubbard, willy, shuah, brauner, bfoster, kent.overstreet, pvorel,
	rppt, richard.weiyang, anup, haibo1.xu, ajones, vkuznets,
	maciej.wieczor-retman, pgonda, oliver.upton, linux-kernel,
	linux-mm, kvm, linux-kselftest

Peter Xu <peterx@redhat.com> writes:

> On Tue, Sep 10, 2024 at 11:43:57PM +0000, Ackerley Tng wrote:
>> @@ -1079,12 +1152,20 @@ static struct inode *kvm_gmem_inode_make_secure_inode(const char *name,
>>  	if (err)
>>  		goto out;
>>  
>> +	err = -ENOMEM;
>> +	private = kzalloc(sizeof(*private), GFP_KERNEL);
>> +	if (!private)
>> +		goto out;
>> +
>>  	if (flags & KVM_GUEST_MEMFD_HUGETLB) {
>> -		err = kvm_gmem_hugetlb_setup(inode, size, flags);
>> +		err = kvm_gmem_hugetlb_setup(inode, private, size, flags);
>>  		if (err)
>> -			goto out;
>> +			goto free_private;
>>  	}
>>  
>> +	xa_init(&private->faultability);
>> +	inode->i_mapping->i_private_data = private;
>> +
>>  	inode->i_private = (void *)(unsigned long)flags;
>
> Looks like inode->i_private isn't used before this series; the flags was
> always zero before anyway.  Maybe it could keep kvm_gmem_inode_private
> instead? Then make the flags be part of the struct.
>
> It avoids two separate places (inode->i_mapping->i_private_data,
> inode->i_private) to store gmem private info.
>

Weakly-held opinion: I think the advantage of re-using inode->i_private
to store flags is that in some cases, e.g. non-hugetlb, we might be able
to avoid an allocation (of kvm_gmem_inode_private).

Does anyone else have any thoughts on this?

>>  	inode->i_op = &kvm_gmem_iops;
>>  	inode->i_mapping->a_ops = &kvm_gmem_aops;
>> @@ -1097,6 +1178,8 @@ static struct inode *kvm_gmem_inode_make_secure_inode(const char *name,
>>  
>>  	return inode;
>>  
>> +free_private:
>> +	kfree(private);
>>  out:
>>  	iput(inode);
>>  
>> -- 
>> 2.46.0.598.g6f2099f65c-goog
>> 
>
> -- 
> Peter Xu

^ permalink raw reply	[flat|nested] 130+ messages in thread

* [RFC PATCH 27/39] KVM: guest_memfd: Allow mmapping guest_memfd files
  2024-09-10 23:43 [RFC PATCH 00/39] 1G page support for guest_memfd Ackerley Tng
                   ` (25 preceding siblings ...)
  2024-09-10 23:43 ` [RFC PATCH 26/39] KVM: guest_memfd: Track faultability within a struct kvm_gmem_private Ackerley Tng
@ 2024-09-10 23:43 ` Ackerley Tng
  2025-01-20 22:42   ` Peter Xu
                     ` (2 more replies)
  2024-09-10 23:43 ` [RFC PATCH 28/39] KVM: guest_memfd: Use vm_type to determine default faultability Ackerley Tng
                   ` (14 subsequent siblings)
  41 siblings, 3 replies; 130+ messages in thread
From: Ackerley Tng @ 2024-09-10 23:43 UTC (permalink / raw)
  To: tabba, quic_eberman, roypat, jgg, peterx, david, rientjes, fvdl,
	jthoughton, seanjc, pbonzini, zhiquan1.li, fan.du, jun.miao,
	isaku.yamahata, muchun.song, mike.kravetz
  Cc: erdemaktas, vannapurve, ackerleytng, qperret, jhubbard, willy,
	shuah, brauner, bfoster, kent.overstreet, pvorel, rppt,
	richard.weiyang, anup, haibo1.xu, ajones, vkuznets,
	maciej.wieczor-retman, pgonda, oliver.upton, linux-kernel,
	linux-mm, kvm, linux-kselftest, linux-fsdevel

guest_memfd files can always be mmap()ed to userspace, but
faultability is controlled by an attribute on the inode.

Co-developed-by: Fuad Tabba <tabba@google.com>
Signed-off-by: Fuad Tabba <tabba@google.com>
Co-developed-by: Ackerley Tng <ackerleytng@google.com>
Signed-off-by: Ackerley Tng <ackerleytng@google.com>

---
 virt/kvm/guest_memfd.c | 46 ++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 44 insertions(+), 2 deletions(-)

diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index b603518f7b62..fc2483e35876 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -781,7 +781,8 @@ static long kvm_gmem_punch_hole(struct inode *inode, loff_t offset, loff_t len)
 {
 	struct list_head *gmem_list = &inode->i_mapping->i_private_list;
 	pgoff_t start = offset >> PAGE_SHIFT;
-	pgoff_t end = (offset + len) >> PAGE_SHIFT;
+	pgoff_t nr = len >> PAGE_SHIFT;
+	pgoff_t end = start + nr;
 	struct kvm_gmem *gmem;
 
 	/*
@@ -790,6 +791,9 @@ static long kvm_gmem_punch_hole(struct inode *inode, loff_t offset, loff_t len)
 	 */
 	filemap_invalidate_lock(inode->i_mapping);
 
+	/* TODO: Check if even_cows should be 0 or 1 */
+	unmap_mapping_range(inode->i_mapping, start, len, 0);
+
 	list_for_each_entry(gmem, gmem_list, entry)
 		kvm_gmem_invalidate_begin(gmem, start, end);
 
@@ -946,6 +950,9 @@ static void kvm_gmem_hugetlb_teardown(struct inode *inode)
 {
 	struct kvm_gmem_hugetlb *hgmem;
 
+	/* TODO: Check if even_cows should be 0 or 1 */
+	unmap_mapping_range(inode->i_mapping, 0, LLONG_MAX, 0);
+
 	truncate_inode_pages_final_prepare(inode->i_mapping);
 	kvm_gmem_hugetlb_truncate_folios_range(inode, 0, LLONG_MAX);
 
@@ -1003,11 +1010,46 @@ static void kvm_gmem_init_mount(void)
 	kvm_gmem_mnt = kern_mount(&kvm_gmem_fs);
 	BUG_ON(IS_ERR(kvm_gmem_mnt));
 
-	/* For giggles. Userspace can never map this anyways. */
 	kvm_gmem_mnt->mnt_flags |= MNT_NOEXEC;
 }
 
+static vm_fault_t kvm_gmem_fault(struct vm_fault *vmf)
+{
+	struct inode *inode;
+	struct folio *folio;
+
+	inode = file_inode(vmf->vma->vm_file);
+	if (!kvm_gmem_is_faultable(inode, vmf->pgoff))
+		return VM_FAULT_SIGBUS;
+
+	folio = kvm_gmem_get_folio(inode, vmf->pgoff);
+	if (!folio)
+		return VM_FAULT_SIGBUS;
+
+	vmf->page = folio_file_page(folio, vmf->pgoff);
+	return VM_FAULT_LOCKED;
+}
+
+static const struct vm_operations_struct kvm_gmem_vm_ops = {
+	.fault = kvm_gmem_fault,
+};
+
+static int kvm_gmem_mmap(struct file *file, struct vm_area_struct *vma)
+{
+	if ((vma->vm_flags & (VM_SHARED | VM_MAYSHARE)) !=
+	    (VM_SHARED | VM_MAYSHARE)) {
+		return -EINVAL;
+	}
+
+	file_accessed(file);
+	vm_flags_set(vma, VM_DONTDUMP);
+	vma->vm_ops = &kvm_gmem_vm_ops;
+
+	return 0;
+}
+
 static struct file_operations kvm_gmem_fops = {
+	.mmap		= kvm_gmem_mmap,
 	.open		= generic_file_open,
 	.release	= kvm_gmem_release,
 	.fallocate	= kvm_gmem_fallocate,
-- 
2.46.0.598.g6f2099f65c-goog


^ permalink raw reply related	[flat|nested] 130+ messages in thread

* Re: [RFC PATCH 27/39] KVM: guest_memfd: Allow mmapping guest_memfd files
  2024-09-10 23:43 ` [RFC PATCH 27/39] KVM: guest_memfd: Allow mmapping guest_memfd files Ackerley Tng
@ 2025-01-20 22:42   ` Peter Xu
  2025-04-23 20:25     ` Ackerley Tng
  2025-03-04 23:24   ` Peter Xu
  2025-04-02  4:07   ` Yan Zhao
  2 siblings, 1 reply; 130+ messages in thread
From: Peter Xu @ 2025-01-20 22:42 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: tabba, quic_eberman, roypat, jgg, david, rientjes, fvdl,
	jthoughton, seanjc, pbonzini, zhiquan1.li, fan.du, jun.miao,
	isaku.yamahata, muchun.song, mike.kravetz, erdemaktas, vannapurve,
	qperret, jhubbard, willy, shuah, brauner, bfoster,
	kent.overstreet, pvorel, rppt, richard.weiyang, anup, haibo1.xu,
	ajones, vkuznets, maciej.wieczor-retman, pgonda, oliver.upton,
	linux-kernel, linux-mm, kvm, linux-kselftest, linux-fsdevel

On Tue, Sep 10, 2024 at 11:43:58PM +0000, Ackerley Tng wrote:
> @@ -790,6 +791,9 @@ static long kvm_gmem_punch_hole(struct inode *inode, loff_t offset, loff_t len)
>  	 */
>  	filemap_invalidate_lock(inode->i_mapping);
>  
> +	/* TODO: Check if even_cows should be 0 or 1 */
> +	unmap_mapping_range(inode->i_mapping, start, len, 0);
> +
>  	list_for_each_entry(gmem, gmem_list, entry)
>  		kvm_gmem_invalidate_begin(gmem, start, end);
>  
> @@ -946,6 +950,9 @@ static void kvm_gmem_hugetlb_teardown(struct inode *inode)
>  {
>  	struct kvm_gmem_hugetlb *hgmem;
>  
> +	/* TODO: Check if even_cows should be 0 or 1 */
> +	unmap_mapping_range(inode->i_mapping, 0, LLONG_MAX, 0);

Setting to 0 is ok in both places: even_cows only applies to MAP_PRIVATE,
which gmemfd doesn't support.  So feel free to drop the two comment lines.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [RFC PATCH 27/39] KVM: guest_memfd: Allow mmapping guest_memfd files
  2025-01-20 22:42   ` Peter Xu
@ 2025-04-23 20:25     ` Ackerley Tng
  0 siblings, 0 replies; 130+ messages in thread
From: Ackerley Tng @ 2025-04-23 20:25 UTC (permalink / raw)
  To: Peter Xu
  Cc: tabba, quic_eberman, roypat, jgg, david, rientjes, fvdl,
	jthoughton, seanjc, pbonzini, zhiquan1.li, fan.du, jun.miao,
	isaku.yamahata, muchun.song, mike.kravetz, erdemaktas, vannapurve,
	qperret, jhubbard, willy, shuah, brauner, bfoster,
	kent.overstreet, pvorel, rppt, richard.weiyang, anup, haibo1.xu,
	ajones, vkuznets, maciej.wieczor-retman, pgonda, oliver.upton,
	linux-kernel, linux-mm, kvm, linux-kselftest, linux-fsdevel

Peter Xu <peterx@redhat.com> writes:

> On Tue, Sep 10, 2024 at 11:43:58PM +0000, Ackerley Tng wrote:
>> @@ -790,6 +791,9 @@ static long kvm_gmem_punch_hole(struct inode *inode, loff_t offset, loff_t len)
>>  	 */
>>  	filemap_invalidate_lock(inode->i_mapping);
>>  
>> +	/* TODO: Check if even_cows should be 0 or 1 */
>> +	unmap_mapping_range(inode->i_mapping, start, len, 0);
>> +
>>  	list_for_each_entry(gmem, gmem_list, entry)
>>  		kvm_gmem_invalidate_begin(gmem, start, end);
>>  
>> @@ -946,6 +950,9 @@ static void kvm_gmem_hugetlb_teardown(struct inode *inode)
>>  {
>>  	struct kvm_gmem_hugetlb *hgmem;
>>  
>> +	/* TODO: Check if even_cows should be 0 or 1 */
>> +	unmap_mapping_range(inode->i_mapping, 0, LLONG_MAX, 0);
>
> Setting to 0 is ok in both places: even_cows only applies to MAP_PRIVATE,
> which gmemfd doesn't support.  So feel free to drop the two comment lines.
>
> Thanks,
>
> -- 
> Peter Xu

Thank you for reviewing and helping me check on this!

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [RFC PATCH 27/39] KVM: guest_memfd: Allow mmapping guest_memfd files
  2024-09-10 23:43 ` [RFC PATCH 27/39] KVM: guest_memfd: Allow mmapping guest_memfd files Ackerley Tng
  2025-01-20 22:42   ` Peter Xu
@ 2025-03-04 23:24   ` Peter Xu
  2025-04-02  4:07   ` Yan Zhao
  2 siblings, 0 replies; 130+ messages in thread
From: Peter Xu @ 2025-03-04 23:24 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: tabba, quic_eberman, roypat, jgg, david, rientjes, fvdl,
	jthoughton, seanjc, pbonzini, zhiquan1.li, fan.du, jun.miao,
	isaku.yamahata, muchun.song, mike.kravetz, erdemaktas, vannapurve,
	qperret, jhubbard, willy, shuah, brauner, bfoster,
	kent.overstreet, pvorel, rppt, richard.weiyang, anup, haibo1.xu,
	ajones, vkuznets, maciej.wieczor-retman, pgonda, oliver.upton,
	linux-kernel, linux-mm, kvm, linux-kselftest, linux-fsdevel

On Tue, Sep 10, 2024 at 11:43:58PM +0000, Ackerley Tng wrote:
> @@ -790,6 +791,9 @@ static long kvm_gmem_punch_hole(struct inode *inode, loff_t offset, loff_t len)
>  	 */
>  	filemap_invalidate_lock(inode->i_mapping);
>  
> +	/* TODO: Check if even_cows should be 0 or 1 */
> +	unmap_mapping_range(inode->i_mapping, start, len, 0);

Should be s/start/offset/ here, or should expect some filemap crash assert
on non-zero mapcounts (when it starts to matter).

Btw, it would be nice if the new version would allow kvm to be compiled as
a module.  Currently it uses a lot of mm functions that are not yet
exported, so AFAIU it will only build if kvm is builtin.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [RFC PATCH 27/39] KVM: guest_memfd: Allow mmapping guest_memfd files
  2024-09-10 23:43 ` [RFC PATCH 27/39] KVM: guest_memfd: Allow mmapping guest_memfd files Ackerley Tng
  2025-01-20 22:42   ` Peter Xu
  2025-03-04 23:24   ` Peter Xu
@ 2025-04-02  4:07   ` Yan Zhao
  2025-04-23 20:28     ` Ackerley Tng
  2 siblings, 1 reply; 130+ messages in thread
From: Yan Zhao @ 2025-04-02  4:07 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: tabba, quic_eberman, roypat, jgg, peterx, david, rientjes, fvdl,
	jthoughton, seanjc, pbonzini, zhiquan1.li, fan.du, jun.miao,
	isaku.yamahata, muchun.song, mike.kravetz, erdemaktas, vannapurve,
	qperret, jhubbard, willy, shuah, brauner, bfoster,
	kent.overstreet, pvorel, rppt, richard.weiyang, anup, haibo1.xu,
	ajones, vkuznets, maciej.wieczor-retman, pgonda, oliver.upton,
	linux-kernel, linux-mm, kvm, linux-kselftest, linux-fsdevel,
	rick.p.edgecombe

On Tue, Sep 10, 2024 at 11:43:58PM +0000, Ackerley Tng wrote:
> guest_memfd files can always be mmap()ed to userspace, but
> faultability is controlled by an attribute on the inode.
> 
> Co-developed-by: Fuad Tabba <tabba@google.com>
> Signed-off-by: Fuad Tabba <tabba@google.com>
> Co-developed-by: Ackerley Tng <ackerleytng@google.com>
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> 
> ---
>  virt/kvm/guest_memfd.c | 46 ++++++++++++++++++++++++++++++++++++++++--
>  1 file changed, 44 insertions(+), 2 deletions(-)
> 
> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> index b603518f7b62..fc2483e35876 100644
> --- a/virt/kvm/guest_memfd.c
> +++ b/virt/kvm/guest_memfd.c
> @@ -781,7 +781,8 @@ static long kvm_gmem_punch_hole(struct inode *inode, loff_t offset, loff_t len)
>  {
Hi Ackerley,

If userspace mmaps a guest_memfd to a VA when a GFN range is shared, it looks
that even after the GFN range has been successfully converted to private,
userspace can still call madvise(mem, size, MADV_REMOVE) on the userspace VA.
This action triggers kvm_gmem_punch_hole() and kvm_gmem_invalidate_begin(),
which can zap the private GFNs in the EPT.

Is this behavior intended for in-place conversion, and could it potentially lead
to private GFN ranges being accidentally zapped from the EPT?

Apologies if I missed any related discussions on this topic.

Thanks
Yan

>  	struct list_head *gmem_list = &inode->i_mapping->i_private_list;
>  	pgoff_t start = offset >> PAGE_SHIFT;
> -	pgoff_t end = (offset + len) >> PAGE_SHIFT;
> +	pgoff_t nr = len >> PAGE_SHIFT;
> +	pgoff_t end = start + nr;
>  	struct kvm_gmem *gmem;
>  
>  	/*
> @@ -790,6 +791,9 @@ static long kvm_gmem_punch_hole(struct inode *inode, loff_t offset, loff_t len)
>  	 */
>  	filemap_invalidate_lock(inode->i_mapping);
>  
> +	/* TODO: Check if even_cows should be 0 or 1 */
> +	unmap_mapping_range(inode->i_mapping, start, len, 0);
> +
>  	list_for_each_entry(gmem, gmem_list, entry)
>  		kvm_gmem_invalidate_begin(gmem, start, end);
>  

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [RFC PATCH 27/39] KVM: guest_memfd: Allow mmapping guest_memfd files
  2025-04-02  4:07   ` Yan Zhao
@ 2025-04-23 20:28     ` Ackerley Tng
  0 siblings, 0 replies; 130+ messages in thread
From: Ackerley Tng @ 2025-04-23 20:28 UTC (permalink / raw)
  To: Yan Zhao
  Cc: tabba, quic_eberman, roypat, jgg, peterx, david, rientjes, fvdl,
	jthoughton, seanjc, pbonzini, zhiquan1.li, fan.du, jun.miao,
	isaku.yamahata, muchun.song, erdemaktas, vannapurve, qperret,
	jhubbard, willy, shuah, brauner, bfoster, kent.overstreet, pvorel,
	rppt, richard.weiyang, anup, haibo1.xu, ajones, vkuznets,
	maciej.wieczor-retman, pgonda, oliver.upton, linux-kernel,
	linux-mm, kvm, linux-kselftest, rick.p.edgecombe

Yan Zhao <yan.y.zhao@intel.com> writes:

> On Tue, Sep 10, 2024 at 11:43:58PM +0000, Ackerley Tng wrote:
>> guest_memfd files can always be mmap()ed to userspace, but
>> faultability is controlled by an attribute on the inode.
>> 
>> Co-developed-by: Fuad Tabba <tabba@google.com>
>> Signed-off-by: Fuad Tabba <tabba@google.com>
>> Co-developed-by: Ackerley Tng <ackerleytng@google.com>
>> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
>> 
>> ---
>>  virt/kvm/guest_memfd.c | 46 ++++++++++++++++++++++++++++++++++++++++--
>>  1 file changed, 44 insertions(+), 2 deletions(-)
>> 
>> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
>> index b603518f7b62..fc2483e35876 100644
>> --- a/virt/kvm/guest_memfd.c
>> +++ b/virt/kvm/guest_memfd.c
>> @@ -781,7 +781,8 @@ static long kvm_gmem_punch_hole(struct inode *inode, loff_t offset, loff_t len)
>>  {
> Hi Ackerley,
>
> If userspace mmaps a guest_memfd to a VA when a GFN range is shared, it looks
> that even after the GFN range has been successfully converted to private,
> userspace can still call madvise(mem, size, MADV_REMOVE) on the userspace VA.
> This action triggers kvm_gmem_punch_hole() and kvm_gmem_invalidate_begin(),
> which can zap the private GFNs in the EPT.
>
> Is this behavior intended for in-place conversion, and could it potentially lead
> to private GFN ranges being accidentally zapped from the EPT?
>
> Apologies if I missed any related discussions on this topic.

No worries and thank you for your review! The next revision will not be
requiring userspace to do madvise(MADV_REMOVE), because memory could be
mapped in multiple processes, so unmapping from the kernel saves the
trouble of coordination in userspace between multiple processes.

>
> Thanks
> Yan
>
>>  	struct list_head *gmem_list = &inode->i_mapping->i_private_list;
>>  	pgoff_t start = offset >> PAGE_SHIFT;
>> -	pgoff_t end = (offset + len) >> PAGE_SHIFT;
>> +	pgoff_t nr = len >> PAGE_SHIFT;
>> +	pgoff_t end = start + nr;
>>  	struct kvm_gmem *gmem;
>>  
>>  	/*
>> @@ -790,6 +791,9 @@ static long kvm_gmem_punch_hole(struct inode *inode, loff_t offset, loff_t len)
>>  	 */
>>  	filemap_invalidate_lock(inode->i_mapping);
>>  
>> +	/* TODO: Check if even_cows should be 0 or 1 */
>> +	unmap_mapping_range(inode->i_mapping, start, len, 0);
>> +
>>  	list_for_each_entry(gmem, gmem_list, entry)
>>  		kvm_gmem_invalidate_begin(gmem, start, end);
>>  

^ permalink raw reply	[flat|nested] 130+ messages in thread

* [RFC PATCH 28/39] KVM: guest_memfd: Use vm_type to determine default faultability
  2024-09-10 23:43 [RFC PATCH 00/39] 1G page support for guest_memfd Ackerley Tng
                   ` (26 preceding siblings ...)
  2024-09-10 23:43 ` [RFC PATCH 27/39] KVM: guest_memfd: Allow mmapping guest_memfd files Ackerley Tng
@ 2024-09-10 23:43 ` Ackerley Tng
  2024-09-10 23:44 ` [RFC PATCH 29/39] KVM: Handle conversions in the SET_MEMORY_ATTRIBUTES ioctl Ackerley Tng
                   ` (13 subsequent siblings)
  41 siblings, 0 replies; 130+ messages in thread
From: Ackerley Tng @ 2024-09-10 23:43 UTC (permalink / raw)
  To: tabba, quic_eberman, roypat, jgg, peterx, david, rientjes, fvdl,
	jthoughton, seanjc, pbonzini, zhiquan1.li, fan.du, jun.miao,
	isaku.yamahata, muchun.song, mike.kravetz
  Cc: erdemaktas, vannapurve, ackerleytng, qperret, jhubbard, willy,
	shuah, brauner, bfoster, kent.overstreet, pvorel, rppt,
	richard.weiyang, anup, haibo1.xu, ajones, vkuznets,
	maciej.wieczor-retman, pgonda, oliver.upton, linux-kernel,
	linux-mm, kvm, linux-kselftest, linux-fsdevel

Memory of a KVM_X86_SW_PROTECTED_VM defaults to faultable to align
with the default in kvm->mem_attr_array.

For this RFC, determine default faultability when associating a range
with a memslot.

Another option is to determine default faultability at guest_memfd
creation time. guest_memfd is created for a specific VM, hence we can
set default faultability based on the VM type.

In future, if different struct kvms are bound to the same guest_memfd
inode, all the struct kvms must be of the same vm_type.

TODO: Perhaps faultability should be based on kvm->mem_attr_array?

Signed-off-by: Ackerley Tng <ackerleytng@google.com>

---
 virt/kvm/guest_memfd.c | 22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)

diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index fc2483e35876..1d4dfe0660ad 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -1256,6 +1256,23 @@ static struct file *kvm_gmem_inode_create_getfile(void *priv, loff_t size,
 	return file;
 }
 
+static void kvm_gmem_set_default_faultability_by_vm_type(struct inode *inode,
+							 u8 vm_type,
+							 loff_t start, loff_t end)
+{
+	bool faultable;
+
+	switch (vm_type) {
+	case KVM_X86_SW_PROTECTED_VM:
+		faultable = true;
+		break;
+	default:
+		faultable = false;
+	}
+
+	WARN_ON(kvm_gmem_set_faultable(inode, start, end, faultable));
+}
+
 static int __kvm_gmem_create(struct kvm *kvm, loff_t size, u64 flags)
 {
 	struct kvm_gmem *gmem;
@@ -1378,6 +1395,11 @@ int kvm_gmem_bind(struct kvm *kvm, struct kvm_memory_slot *slot,
 	slot->gmem.pgoff = start;
 
 	xa_store_range(&gmem->bindings, start, end - 1, slot, GFP_KERNEL);
+
+	kvm_gmem_set_default_faultability_by_vm_type(file_inode(file),
+						     kvm->arch.vm_type,
+						     start, end);
+
 	filemap_invalidate_unlock(inode->i_mapping);
 
 	/*
-- 
2.46.0.598.g6f2099f65c-goog


^ permalink raw reply related	[flat|nested] 130+ messages in thread

* [RFC PATCH 29/39] KVM: Handle conversions in the SET_MEMORY_ATTRIBUTES ioctl
  2024-09-10 23:43 [RFC PATCH 00/39] 1G page support for guest_memfd Ackerley Tng
                   ` (27 preceding siblings ...)
  2024-09-10 23:43 ` [RFC PATCH 28/39] KVM: guest_memfd: Use vm_type to determine default faultability Ackerley Tng
@ 2024-09-10 23:44 ` Ackerley Tng
  2024-09-10 23:44 ` [RFC PATCH 30/39] KVM: guest_memfd: Handle folio preparation for guest_memfd mmap Ackerley Tng
                   ` (12 subsequent siblings)
  41 siblings, 0 replies; 130+ messages in thread
From: Ackerley Tng @ 2024-09-10 23:44 UTC (permalink / raw)
  To: tabba, quic_eberman, roypat, jgg, peterx, david, rientjes, fvdl,
	jthoughton, seanjc, pbonzini, zhiquan1.li, fan.du, jun.miao,
	isaku.yamahata, muchun.song, mike.kravetz
  Cc: erdemaktas, vannapurve, ackerleytng, qperret, jhubbard, willy,
	shuah, brauner, bfoster, kent.overstreet, pvorel, rppt,
	richard.weiyang, anup, haibo1.xu, ajones, vkuznets,
	maciej.wieczor-retman, pgonda, oliver.upton, linux-kernel,
	linux-mm, kvm, linux-kselftest, linux-fsdevel

The key steps for a private to shared conversion are:

1. Unmap from guest page tables
2. Set pages associated with requested range in memslot to be
   faultable
3. Update kvm->mem_attr_array

The key steps for a shared to private conversion are:

1. Check and disallow set_memory_attributes if any page in the range
   is still mapped or pinned, by
   a. Updating guest_memfd's faultability to prevent future faulting
   b. Returning -EINVAL if any pages are still pinned.
2. Update kvm->mem_attr_array

Userspace VMM must ensure shared pages are not in use, since any
faults racing with this call will get a SIGBUS.

Co-developed-by: Ackerley Tng <ackerleytng@google.com>
Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Co-developed-by: Vishal Annapurve <vannapurve@google.com>
Signed-off-by: Vishal Annapurve <vannapurve@google.com>

---
 include/linux/kvm_host.h |   1 +
 virt/kvm/guest_memfd.c   | 207 +++++++++++++++++++++++++++++++++++++++
 virt/kvm/kvm_main.c      |  15 +++
 virt/kvm/kvm_mm.h        |   9 ++
 4 files changed, 232 insertions(+)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 79a6b1a63027..10993cd33e34 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -2476,6 +2476,7 @@ typedef int (*kvm_gmem_populate_cb)(struct kvm *kvm, gfn_t gfn, kvm_pfn_t pfn,
 
 long kvm_gmem_populate(struct kvm *kvm, gfn_t gfn, void __user *src, long npages,
 		       kvm_gmem_populate_cb post_populate, void *opaque);
+
 #endif
 
 #ifdef CONFIG_HAVE_KVM_ARCH_GMEM_INVALIDATE
diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index 1d4dfe0660ad..110c4bbb004b 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -1592,4 +1592,211 @@ long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src, long
 	return ret && !i ? ret : i;
 }
 EXPORT_SYMBOL_GPL(kvm_gmem_populate);
+
+/**
+ * Returns true if pages in range [@start, @end) in inode @inode have no
+ * userspace mappings.
+ */
+static bool kvm_gmem_no_mappings_range(struct inode *inode, pgoff_t start, pgoff_t end)
+{
+	pgoff_t index;
+	bool checked_indices_unmapped;
+
+	filemap_invalidate_lock_shared(inode->i_mapping);
+
+	/* TODO: replace iteration with filemap_get_folios() for efficiency. */
+	checked_indices_unmapped = true;
+	for (index = start; checked_indices_unmapped && index < end;) {
+		struct folio *folio;
+
+		/* Don't use kvm_gmem_get_folio to avoid allocating */
+		folio = filemap_lock_folio(inode->i_mapping, index);
+		if (IS_ERR(folio)) {
+			++index;
+			continue;
+		}
+
+		if (folio_mapped(folio) || folio_maybe_dma_pinned(folio))
+			checked_indices_unmapped = false;
+		else
+			index = folio_next_index(folio);
+
+		folio_unlock(folio);
+		folio_put(folio);
+	}
+
+	filemap_invalidate_unlock_shared(inode->i_mapping);
+	return checked_indices_unmapped;
+}
+
+/**
+ * Returns true if pages in range [@start, @end) in memslot @slot have no
+ * userspace mappings.
+ */
+static bool kvm_gmem_no_mappings_slot(struct kvm_memory_slot *slot,
+				      gfn_t start, gfn_t end)
+{
+	pgoff_t offset_start;
+	pgoff_t offset_end;
+	struct file *file;
+	bool ret;
+
+	offset_start = start - slot->base_gfn + slot->gmem.pgoff;
+	offset_end = end - slot->base_gfn + slot->gmem.pgoff;
+
+	file = kvm_gmem_get_file(slot);
+	if (!file)
+		return false;
+
+	ret = kvm_gmem_no_mappings_range(file_inode(file), offset_start, offset_end);
+
+	fput(file);
+
+	return ret;
+}
+
+/**
+ * Returns true if pages in range [@start, @end) have no host userspace mappings.
+ */
+static bool kvm_gmem_no_mappings(struct kvm *kvm, gfn_t start, gfn_t end)
+{
+	int i;
+
+	lockdep_assert_held(&kvm->slots_lock);
+
+	for (i = 0; i < kvm_arch_nr_memslot_as_ids(kvm); i++) {
+		struct kvm_memslot_iter iter;
+		struct kvm_memslots *slots;
+
+		slots = __kvm_memslots(kvm, i);
+		kvm_for_each_memslot_in_gfn_range(&iter, slots, start, end) {
+			struct kvm_memory_slot *slot;
+			gfn_t gfn_start;
+			gfn_t gfn_end;
+
+			slot = iter.slot;
+			gfn_start = max(start, slot->base_gfn);
+			gfn_end = min(end, slot->base_gfn + slot->npages);
+
+			if (iter.slot->flags & KVM_MEM_GUEST_MEMFD &&
+			    !kvm_gmem_no_mappings_slot(iter.slot, gfn_start, gfn_end))
+				return false;
+		}
+	}
+
+	return true;
+}
+
+/**
+ * Set faultability of given range of gfns [@start, @end) in memslot @slot to
+ * @faultable.
+ */
+static void kvm_gmem_set_faultable_slot(struct kvm_memory_slot *slot, gfn_t start,
+					gfn_t end, bool faultable)
+{
+	pgoff_t start_offset;
+	pgoff_t end_offset;
+	struct file *file;
+
+	file = kvm_gmem_get_file(slot);
+	if (!file)
+		return;
+
+	start_offset = start - slot->base_gfn + slot->gmem.pgoff;
+	end_offset = end - slot->base_gfn + slot->gmem.pgoff;
+
+	WARN_ON(kvm_gmem_set_faultable(file_inode(file), start_offset, end_offset,
+				       faultable));
+
+	fput(file);
+}
+
+/**
+ * Set faultability of given range of gfns [@start, @end) in memslot @slot to
+ * @faultable.
+ */
+static void kvm_gmem_set_faultable_vm(struct kvm *kvm, gfn_t start, gfn_t end,
+				      bool faultable)
+{
+	int i;
+
+	lockdep_assert_held(&kvm->slots_lock);
+
+	for (i = 0; i < kvm_arch_nr_memslot_as_ids(kvm); i++) {
+		struct kvm_memslot_iter iter;
+		struct kvm_memslots *slots;
+
+		slots = __kvm_memslots(kvm, i);
+		kvm_for_each_memslot_in_gfn_range(&iter, slots, start, end) {
+			struct kvm_memory_slot *slot;
+			gfn_t gfn_start;
+			gfn_t gfn_end;
+
+			slot = iter.slot;
+			gfn_start = max(start, slot->base_gfn);
+			gfn_end = min(end, slot->base_gfn + slot->npages);
+
+			if (iter.slot->flags & KVM_MEM_GUEST_MEMFD) {
+				kvm_gmem_set_faultable_slot(slot, gfn_start,
+							    gfn_end, faultable);
+			}
+		}
+	}
+}
+
+/**
+ * Returns true if guest_memfd permits setting range [@start, @end) to PRIVATE.
+ *
+ * If memory is faulted in to host userspace and a request was made to set the
+ * memory to PRIVATE, the faulted in pages must not be pinned for the request to
+ * be permitted.
+ */
+static int kvm_gmem_should_set_attributes_private(struct kvm *kvm, gfn_t start,
+						  gfn_t end)
+{
+	kvm_gmem_set_faultable_vm(kvm, start, end, false);
+
+	if (kvm_gmem_no_mappings(kvm, start, end))
+		return 0;
+
+	kvm_gmem_set_faultable_vm(kvm, start, end, true);
+	return -EINVAL;
+}
+
+/**
+ * Returns true if guest_memfd permits setting range [@start, @end) to SHARED.
+ *
+ * Because this allows pages to be faulted in to userspace, this must only be
+ * called after the pages have been invalidated from guest page tables.
+ */
+static int kvm_gmem_should_set_attributes_shared(struct kvm *kvm, gfn_t start,
+						 gfn_t end)
+{
+	/* Always okay to set shared, hence set range faultable here. */
+	kvm_gmem_set_faultable_vm(kvm, start, end, true);
+
+	return 0;
+}
+
+/**
+ * Returns 0 if guest_memfd permits setting attributes @attrs for range [@start,
+ * @end) or negative error otherwise.
+ *
+ * If memory is faulted in to host userspace and a request was made to set the
+ * memory to PRIVATE, the faulted in pages must not be pinned for the request to
+ * be permitted.
+ *
+ * Because this may allow pages to be faulted in to userspace when requested to
+ * set attributes to shared, this must only be called after the pages have been
+ * invalidated from guest page tables.
+ */
+int kvm_gmem_should_set_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
+				   unsigned long attrs)
+{
+	if (attrs & KVM_MEMORY_ATTRIBUTE_PRIVATE)
+		return kvm_gmem_should_set_attributes_private(kvm, start, end);
+	else
+		return kvm_gmem_should_set_attributes_shared(kvm, start, end);
+}
+
 #endif
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 92901656a0d4..1a7bbcc31b7e 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2524,6 +2524,13 @@ static int kvm_vm_set_mem_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
 		.on_lock = kvm_mmu_invalidate_end,
 		.may_block = true,
 	};
+	struct kvm_mmu_notifier_range error_set_range = {
+		.start = start,
+		.end = end,
+		.handler = (void *)kvm_null_fn,
+		.on_lock = kvm_mmu_invalidate_end,
+		.may_block = true,
+	};
 	unsigned long i;
 	void *entry;
 	int r = 0;
@@ -2548,6 +2555,10 @@ static int kvm_vm_set_mem_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
 
 	kvm_handle_gfn_range(kvm, &pre_set_range);
 
+	r = kvm_gmem_should_set_attributes(kvm, start, end, attributes);
+	if (r)
+		goto err;
+
 	for (i = start; i < end; i++) {
 		r = xa_err(xa_store(&kvm->mem_attr_array, i, entry,
 				    GFP_KERNEL_ACCOUNT));
@@ -2560,6 +2571,10 @@ static int kvm_vm_set_mem_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
 	mutex_unlock(&kvm->slots_lock);
 
 	return r;
+
+err:
+	kvm_handle_gfn_range(kvm, &error_set_range);
+	goto out_unlock;
 }
 static int kvm_vm_ioctl_set_mem_attributes(struct kvm *kvm,
 					   struct kvm_memory_attributes *attrs)
diff --git a/virt/kvm/kvm_mm.h b/virt/kvm/kvm_mm.h
index 715f19669d01..d8ff2b380d0e 100644
--- a/virt/kvm/kvm_mm.h
+++ b/virt/kvm/kvm_mm.h
@@ -41,6 +41,8 @@ int kvm_gmem_create(struct kvm *kvm, struct kvm_create_guest_memfd *args);
 int kvm_gmem_bind(struct kvm *kvm, struct kvm_memory_slot *slot,
 		  unsigned int fd, loff_t offset);
 void kvm_gmem_unbind(struct kvm_memory_slot *slot);
+int kvm_gmem_should_set_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
+				   unsigned long attrs);
 #else
 static inline void kvm_gmem_init(struct module *module)
 {
@@ -59,6 +61,13 @@ static inline void kvm_gmem_unbind(struct kvm_memory_slot *slot)
 {
 	WARN_ON_ONCE(1);
 }
+
+static inline int kvm_gmem_should_set_attributes(struct kvm *kvm, gfn_t start,
+						 gfn_t end, unsigned long attrs)
+{
+	return 0;
+}
+
 #endif /* CONFIG_KVM_PRIVATE_MEM */
 
 #endif /* __KVM_MM_H__ */
-- 
2.46.0.598.g6f2099f65c-goog


^ permalink raw reply related	[flat|nested] 130+ messages in thread

* [RFC PATCH 30/39] KVM: guest_memfd: Handle folio preparation for guest_memfd mmap
  2024-09-10 23:43 [RFC PATCH 00/39] 1G page support for guest_memfd Ackerley Tng
                   ` (28 preceding siblings ...)
  2024-09-10 23:44 ` [RFC PATCH 29/39] KVM: Handle conversions in the SET_MEMORY_ATTRIBUTES ioctl Ackerley Tng
@ 2024-09-10 23:44 ` Ackerley Tng
  2024-09-16 20:00   ` Elliot Berman
  2024-09-10 23:44 ` [RFC PATCH 31/39] KVM: selftests: Allow vm_set_memory_attributes to be used without asserting return value of 0 Ackerley Tng
                   ` (11 subsequent siblings)
  41 siblings, 1 reply; 130+ messages in thread
From: Ackerley Tng @ 2024-09-10 23:44 UTC (permalink / raw)
  To: tabba, quic_eberman, roypat, jgg, peterx, david, rientjes, fvdl,
	jthoughton, seanjc, pbonzini, zhiquan1.li, fan.du, jun.miao,
	isaku.yamahata, muchun.song, mike.kravetz
  Cc: erdemaktas, vannapurve, ackerleytng, qperret, jhubbard, willy,
	shuah, brauner, bfoster, kent.overstreet, pvorel, rppt,
	richard.weiyang, anup, haibo1.xu, ajones, vkuznets,
	maciej.wieczor-retman, pgonda, oliver.upton, linux-kernel,
	linux-mm, kvm, linux-kselftest, linux-fsdevel

Since guest_memfd now supports mmap(), folios have to be prepared
before they are faulted into userspace.

When memory attributes are switched between shared and private, the
up-to-date flags will be cleared.

Use the folio's up-to-date flag to indicate being ready for the guest
usage and can be used to mark whether the folio is ready for shared OR
private use.

Signed-off-by: Ackerley Tng <ackerleytng@google.com>

---
 virt/kvm/guest_memfd.c | 131 ++++++++++++++++++++++++++++++++++++++++-
 virt/kvm/kvm_main.c    |   2 +
 virt/kvm/kvm_mm.h      |   7 +++
 3 files changed, 139 insertions(+), 1 deletion(-)

diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index 110c4bbb004b..fb292e542381 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -129,13 +129,29 @@ static int __kvm_gmem_prepare_folio(struct kvm *kvm, struct kvm_memory_slot *slo
 }
 
 /**
- * Use the uptodate flag to indicate that the folio is prepared for KVM's usage.
+ * Use folio's up-to-date flag to indicate that this folio is prepared for usage
+ * by the guest.
+ *
+ * This flag can be used whether the folio is prepared for PRIVATE or SHARED
+ * usage.
  */
 static inline void kvm_gmem_mark_prepared(struct folio *folio)
 {
 	folio_mark_uptodate(folio);
 }
 
+/**
+ * Use folio's up-to-date flag to indicate that this folio is not yet prepared for
+ * usage by the guest.
+ *
+ * This flag can be used whether the folio is prepared for PRIVATE or SHARED
+ * usage.
+ */
+static inline void kvm_gmem_clear_prepared(struct folio *folio)
+{
+	folio_clear_uptodate(folio);
+}
+
 /*
  * Process @folio, which contains @gfn, so that the guest can use it.
  * The folio must be locked and the gfn must be contained in @slot.
@@ -148,6 +164,12 @@ static int kvm_gmem_prepare_folio(struct kvm *kvm, struct kvm_memory_slot *slot,
 	pgoff_t index;
 	int r;
 
+	/*
+	 * Defensively zero folio to avoid leaking kernel memory in
+	 * uninitialized pages. This is important since pages can now be mapped
+	 * into userspace, where hardware (e.g. TDX) won't be clearing those
+	 * pages.
+	 */
 	if (folio_test_hugetlb(folio)) {
 		folio_zero_user(folio, folio->index << PAGE_SHIFT);
 	} else {
@@ -1017,6 +1039,7 @@ static vm_fault_t kvm_gmem_fault(struct vm_fault *vmf)
 {
 	struct inode *inode;
 	struct folio *folio;
+	bool is_prepared;
 
 	inode = file_inode(vmf->vma->vm_file);
 	if (!kvm_gmem_is_faultable(inode, vmf->pgoff))
@@ -1026,6 +1049,31 @@ static vm_fault_t kvm_gmem_fault(struct vm_fault *vmf)
 	if (!folio)
 		return VM_FAULT_SIGBUS;
 
+	is_prepared = folio_test_uptodate(folio);
+	if (!is_prepared) {
+		unsigned long nr_pages;
+		unsigned long i;
+
+		if (folio_test_hugetlb(folio)) {
+			folio_zero_user(folio, folio->index << PAGE_SHIFT);
+		} else {
+			/*
+			 * Defensively zero folio to avoid leaking kernel memory in
+			 * uninitialized pages. This is important since pages can now be
+			 * mapped into userspace, where hardware (e.g. TDX) won't be
+			 * clearing those pages.
+			 *
+			 * Will probably need a version of kvm_gmem_prepare_folio() to
+			 * prepare the page for SHARED use.
+			 */
+			nr_pages = folio_nr_pages(folio);
+			for (i = 0; i < nr_pages; i++)
+				clear_highpage(folio_page(folio, i));
+		}
+
+		kvm_gmem_mark_prepared(folio);
+	}
+
 	vmf->page = folio_file_page(folio, vmf->pgoff);
 	return VM_FAULT_LOCKED;
 }
@@ -1593,6 +1641,87 @@ long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src, long
 }
 EXPORT_SYMBOL_GPL(kvm_gmem_populate);
 
+static void kvm_gmem_clear_prepared_range(struct inode *inode, pgoff_t start,
+					  pgoff_t end)
+{
+	pgoff_t index;
+
+	filemap_invalidate_lock_shared(inode->i_mapping);
+
+	/* TODO: replace iteration with filemap_get_folios() for efficiency. */
+	for (index = start; index < end;) {
+		struct folio *folio;
+
+		/* Don't use kvm_gmem_get_folio to avoid allocating */
+		folio = filemap_lock_folio(inode->i_mapping, index);
+		if (IS_ERR(folio)) {
+			++index;
+			continue;
+		}
+
+		kvm_gmem_clear_prepared(folio);
+
+		index = folio_next_index(folio);
+		folio_unlock(folio);
+		folio_put(folio);
+	}
+
+	filemap_invalidate_unlock_shared(inode->i_mapping);
+}
+
+/**
+ * Clear the prepared flag for all folios in gfn range [@start, @end) in memslot
+ * @slot.
+ */
+static void kvm_gmem_clear_prepared_slot(struct kvm_memory_slot *slot, gfn_t start,
+					 gfn_t end)
+{
+	pgoff_t start_offset;
+	pgoff_t end_offset;
+	struct file *file;
+
+	file = kvm_gmem_get_file(slot);
+	if (!file)
+		return;
+
+	start_offset = start - slot->base_gfn + slot->gmem.pgoff;
+	end_offset = end - slot->base_gfn + slot->gmem.pgoff;
+
+	kvm_gmem_clear_prepared_range(file_inode(file), start_offset, end_offset);
+
+	fput(file);
+}
+
+/**
+ * Clear the prepared flag for all folios for any slot in gfn range
+ * [@start, @end) in @kvm.
+ */
+void kvm_gmem_clear_prepared_vm(struct kvm *kvm, gfn_t start, gfn_t end)
+{
+	int i;
+
+	lockdep_assert_held(&kvm->slots_lock);
+
+	for (i = 0; i < kvm_arch_nr_memslot_as_ids(kvm); i++) {
+		struct kvm_memslot_iter iter;
+		struct kvm_memslots *slots;
+
+		slots = __kvm_memslots(kvm, i);
+		kvm_for_each_memslot_in_gfn_range(&iter, slots, start, end) {
+			struct kvm_memory_slot *slot;
+			gfn_t gfn_start;
+			gfn_t gfn_end;
+
+			slot = iter.slot;
+			gfn_start = max(start, slot->base_gfn);
+			gfn_end = min(end, slot->base_gfn + slot->npages);
+
+			if (iter.slot->flags & KVM_MEM_GUEST_MEMFD)
+				kvm_gmem_clear_prepared_slot(iter.slot, gfn_start, gfn_end);
+		}
+	}
+}
+
 /**
  * Returns true if pages in range [@start, @end) in inode @inode have no
  * userspace mappings.
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 1a7bbcc31b7e..255d27df7f5c 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2565,6 +2565,8 @@ static int kvm_vm_set_mem_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
 		KVM_BUG_ON(r, kvm);
 	}
 
+	kvm_gmem_clear_prepared_vm(kvm, start, end);
+
 	kvm_handle_gfn_range(kvm, &post_set_range);
 
 out_unlock:
diff --git a/virt/kvm/kvm_mm.h b/virt/kvm/kvm_mm.h
index d8ff2b380d0e..25fd0d9f66cc 100644
--- a/virt/kvm/kvm_mm.h
+++ b/virt/kvm/kvm_mm.h
@@ -43,6 +43,7 @@ int kvm_gmem_bind(struct kvm *kvm, struct kvm_memory_slot *slot,
 void kvm_gmem_unbind(struct kvm_memory_slot *slot);
 int kvm_gmem_should_set_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
 				   unsigned long attrs);
+void kvm_gmem_clear_prepared_vm(struct kvm *kvm, gfn_t start, gfn_t end);
 #else
 static inline void kvm_gmem_init(struct module *module)
 {
@@ -68,6 +69,12 @@ static inline int kvm_gmem_should_set_attributes(struct kvm *kvm, gfn_t start,
 	return 0;
 }
 
+static inline void kvm_gmem_clear_prepared_slots(struct kvm *kvm,
+						 gfn_t start, gfn_t end)
+{
+	WARN_ON_ONCE(1);
+}
+
 #endif /* CONFIG_KVM_PRIVATE_MEM */
 
 #endif /* __KVM_MM_H__ */
-- 
2.46.0.598.g6f2099f65c-goog


^ permalink raw reply related	[flat|nested] 130+ messages in thread

* Re: [RFC PATCH 30/39] KVM: guest_memfd: Handle folio preparation for guest_memfd mmap
  2024-09-10 23:44 ` [RFC PATCH 30/39] KVM: guest_memfd: Handle folio preparation for guest_memfd mmap Ackerley Tng
@ 2024-09-16 20:00   ` Elliot Berman
  2024-10-03 21:32     ` Ackerley Tng
  0 siblings, 1 reply; 130+ messages in thread
From: Elliot Berman @ 2024-09-16 20:00 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: tabba, roypat, jgg, peterx, david, rientjes, fvdl, jthoughton,
	seanjc, pbonzini, zhiquan1.li, fan.du, jun.miao, isaku.yamahata,
	muchun.song, mike.kravetz, erdemaktas, vannapurve, qperret,
	jhubbard, willy, shuah, brauner, bfoster, kent.overstreet, pvorel,
	rppt, richard.weiyang, anup, haibo1.xu, ajones, vkuznets,
	maciej.wieczor-retman, pgonda, oliver.upton, linux-kernel,
	linux-mm, kvm, linux-kselftest, linux-fsdevel

On Tue, Sep 10, 2024 at 11:44:01PM +0000, Ackerley Tng wrote:
> Since guest_memfd now supports mmap(), folios have to be prepared
> before they are faulted into userspace.
> 
> When memory attributes are switched between shared and private, the
> up-to-date flags will be cleared.
> 
> Use the folio's up-to-date flag to indicate being ready for the guest
> usage and can be used to mark whether the folio is ready for shared OR
> private use.

Clearing the up-to-date flag also means that the page gets zero'd out
whenever it transitions between shared and private (either direction).
pKVM (Android) hypervisor policy can allow in-place conversion between
shared/private.

I believe the important thing is that sev_gmem_prepare() needs to be
called prior to giving page to guest. In my series, I had made a
->prepare_inaccessible() callback where KVM would only do this part.
When transitioning to inaccessible, only that callback would be made,
besides the bookkeeping. The folio zeroing happens once when allocating
the folio if the folio is initially accessible (faultable).

From x86 CoCo perspective, I think it also makes sense to not zero
the folio when changing faultiblity from private to shared:
 - If guest is sharing some data with host, you've wiped the data and
   guest has to copy again.
 - Or, if SEV/TDX enforces that page is zero'd between transitions,
   Linux has duplicated the work that trusted entity has already done.

Fuad and I can help add some details for the conversion. Hopefully we
can figure out some of the plan at plumbers this week.

Thanks,
Elliot

> 
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> 
> ---
>  virt/kvm/guest_memfd.c | 131 ++++++++++++++++++++++++++++++++++++++++-
>  virt/kvm/kvm_main.c    |   2 +
>  virt/kvm/kvm_mm.h      |   7 +++
>  3 files changed, 139 insertions(+), 1 deletion(-)
> 
> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> index 110c4bbb004b..fb292e542381 100644
> --- a/virt/kvm/guest_memfd.c
> +++ b/virt/kvm/guest_memfd.c
> @@ -129,13 +129,29 @@ static int __kvm_gmem_prepare_folio(struct kvm *kvm, struct kvm_memory_slot *slo
>  }
>  
>  /**
> - * Use the uptodate flag to indicate that the folio is prepared for KVM's usage.
> + * Use folio's up-to-date flag to indicate that this folio is prepared for usage
> + * by the guest.
> + *
> + * This flag can be used whether the folio is prepared for PRIVATE or SHARED
> + * usage.
>   */
>  static inline void kvm_gmem_mark_prepared(struct folio *folio)
>  {
>  	folio_mark_uptodate(folio);
>  }
>  
> +/**
> + * Use folio's up-to-date flag to indicate that this folio is not yet prepared for
> + * usage by the guest.
> + *
> + * This flag can be used whether the folio is prepared for PRIVATE or SHARED
> + * usage.
> + */
> +static inline void kvm_gmem_clear_prepared(struct folio *folio)
> +{
> +	folio_clear_uptodate(folio);
> +}
> +
>  /*
>   * Process @folio, which contains @gfn, so that the guest can use it.
>   * The folio must be locked and the gfn must be contained in @slot.
> @@ -148,6 +164,12 @@ static int kvm_gmem_prepare_folio(struct kvm *kvm, struct kvm_memory_slot *slot,
>  	pgoff_t index;
>  	int r;
>  
> +	/*
> +	 * Defensively zero folio to avoid leaking kernel memory in
> +	 * uninitialized pages. This is important since pages can now be mapped
> +	 * into userspace, where hardware (e.g. TDX) won't be clearing those
> +	 * pages.
> +	 */
>  	if (folio_test_hugetlb(folio)) {
>  		folio_zero_user(folio, folio->index << PAGE_SHIFT);
>  	} else {
> @@ -1017,6 +1039,7 @@ static vm_fault_t kvm_gmem_fault(struct vm_fault *vmf)
>  {
>  	struct inode *inode;
>  	struct folio *folio;
> +	bool is_prepared;
>  
>  	inode = file_inode(vmf->vma->vm_file);
>  	if (!kvm_gmem_is_faultable(inode, vmf->pgoff))
> @@ -1026,6 +1049,31 @@ static vm_fault_t kvm_gmem_fault(struct vm_fault *vmf)
>  	if (!folio)
>  		return VM_FAULT_SIGBUS;
>  
> +	is_prepared = folio_test_uptodate(folio);
> +	if (!is_prepared) {
> +		unsigned long nr_pages;
> +		unsigned long i;
> +
> +		if (folio_test_hugetlb(folio)) {
> +			folio_zero_user(folio, folio->index << PAGE_SHIFT);
> +		} else {
> +			/*
> +			 * Defensively zero folio to avoid leaking kernel memory in
> +			 * uninitialized pages. This is important since pages can now be
> +			 * mapped into userspace, where hardware (e.g. TDX) won't be
> +			 * clearing those pages.
> +			 *
> +			 * Will probably need a version of kvm_gmem_prepare_folio() to
> +			 * prepare the page for SHARED use.
> +			 */
> +			nr_pages = folio_nr_pages(folio);
> +			for (i = 0; i < nr_pages; i++)
> +				clear_highpage(folio_page(folio, i));
> +		}
> +
> +		kvm_gmem_mark_prepared(folio);
> +	}
> +
>  	vmf->page = folio_file_page(folio, vmf->pgoff);
>  	return VM_FAULT_LOCKED;
>  }
> @@ -1593,6 +1641,87 @@ long kvm_gmem_populate(struct kvm *kvm, gfn_t start_gfn, void __user *src, long
>  }
>  EXPORT_SYMBOL_GPL(kvm_gmem_populate);
>  
> +static void kvm_gmem_clear_prepared_range(struct inode *inode, pgoff_t start,
> +					  pgoff_t end)
> +{
> +	pgoff_t index;
> +
> +	filemap_invalidate_lock_shared(inode->i_mapping);
> +
> +	/* TODO: replace iteration with filemap_get_folios() for efficiency. */
> +	for (index = start; index < end;) {
> +		struct folio *folio;
> +
> +		/* Don't use kvm_gmem_get_folio to avoid allocating */
> +		folio = filemap_lock_folio(inode->i_mapping, index);
> +		if (IS_ERR(folio)) {
> +			++index;
> +			continue;
> +		}
> +
> +		kvm_gmem_clear_prepared(folio);
> +
> +		index = folio_next_index(folio);
> +		folio_unlock(folio);
> +		folio_put(folio);
> +	}
> +
> +	filemap_invalidate_unlock_shared(inode->i_mapping);
> +}
> +
> +/**
> + * Clear the prepared flag for all folios in gfn range [@start, @end) in memslot
> + * @slot.
> + */
> +static void kvm_gmem_clear_prepared_slot(struct kvm_memory_slot *slot, gfn_t start,
> +					 gfn_t end)
> +{
> +	pgoff_t start_offset;
> +	pgoff_t end_offset;
> +	struct file *file;
> +
> +	file = kvm_gmem_get_file(slot);
> +	if (!file)
> +		return;
> +
> +	start_offset = start - slot->base_gfn + slot->gmem.pgoff;
> +	end_offset = end - slot->base_gfn + slot->gmem.pgoff;
> +
> +	kvm_gmem_clear_prepared_range(file_inode(file), start_offset, end_offset);
> +
> +	fput(file);
> +}
> +
> +/**
> + * Clear the prepared flag for all folios for any slot in gfn range
> + * [@start, @end) in @kvm.
> + */
> +void kvm_gmem_clear_prepared_vm(struct kvm *kvm, gfn_t start, gfn_t end)
> +{
> +	int i;
> +
> +	lockdep_assert_held(&kvm->slots_lock);
> +
> +	for (i = 0; i < kvm_arch_nr_memslot_as_ids(kvm); i++) {
> +		struct kvm_memslot_iter iter;
> +		struct kvm_memslots *slots;
> +
> +		slots = __kvm_memslots(kvm, i);
> +		kvm_for_each_memslot_in_gfn_range(&iter, slots, start, end) {
> +			struct kvm_memory_slot *slot;
> +			gfn_t gfn_start;
> +			gfn_t gfn_end;
> +
> +			slot = iter.slot;
> +			gfn_start = max(start, slot->base_gfn);
> +			gfn_end = min(end, slot->base_gfn + slot->npages);
> +
> +			if (iter.slot->flags & KVM_MEM_GUEST_MEMFD)
> +				kvm_gmem_clear_prepared_slot(iter.slot, gfn_start, gfn_end);
> +		}
> +	}
> +}
> +
>  /**
>   * Returns true if pages in range [@start, @end) in inode @inode have no
>   * userspace mappings.
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 1a7bbcc31b7e..255d27df7f5c 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -2565,6 +2565,8 @@ static int kvm_vm_set_mem_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
>  		KVM_BUG_ON(r, kvm);
>  	}
>  
> +	kvm_gmem_clear_prepared_vm(kvm, start, end);
> +
>  	kvm_handle_gfn_range(kvm, &post_set_range);
>  
>  out_unlock:
> diff --git a/virt/kvm/kvm_mm.h b/virt/kvm/kvm_mm.h
> index d8ff2b380d0e..25fd0d9f66cc 100644
> --- a/virt/kvm/kvm_mm.h
> +++ b/virt/kvm/kvm_mm.h
> @@ -43,6 +43,7 @@ int kvm_gmem_bind(struct kvm *kvm, struct kvm_memory_slot *slot,
>  void kvm_gmem_unbind(struct kvm_memory_slot *slot);
>  int kvm_gmem_should_set_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
>  				   unsigned long attrs);
> +void kvm_gmem_clear_prepared_vm(struct kvm *kvm, gfn_t start, gfn_t end);
>  #else
>  static inline void kvm_gmem_init(struct module *module)
>  {
> @@ -68,6 +69,12 @@ static inline int kvm_gmem_should_set_attributes(struct kvm *kvm, gfn_t start,
>  	return 0;
>  }
>  
> +static inline void kvm_gmem_clear_prepared_slots(struct kvm *kvm,
> +						 gfn_t start, gfn_t end)
> +{
> +	WARN_ON_ONCE(1);
> +}
> +
>  #endif /* CONFIG_KVM_PRIVATE_MEM */
>  
>  #endif /* __KVM_MM_H__ */
> -- 
> 2.46.0.598.g6f2099f65c-goog
> 

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [RFC PATCH 30/39] KVM: guest_memfd: Handle folio preparation for guest_memfd mmap
  2024-09-16 20:00   ` Elliot Berman
@ 2024-10-03 21:32     ` Ackerley Tng
  2024-10-03 23:43       ` Ackerley Tng
  2024-10-07 15:56       ` Patrick Roy
  0 siblings, 2 replies; 130+ messages in thread
From: Ackerley Tng @ 2024-10-03 21:32 UTC (permalink / raw)
  To: Elliot Berman
  Cc: tabba, roypat, jgg, peterx, david, rientjes, fvdl, jthoughton,
	seanjc, pbonzini, zhiquan1.li, fan.du, jun.miao, isaku.yamahata,
	muchun.song, mike.kravetz, erdemaktas, vannapurve, qperret,
	jhubbard, willy, shuah, brauner, bfoster, kent.overstreet, pvorel,
	rppt, richard.weiyang, anup, haibo1.xu, ajones, vkuznets,
	maciej.wieczor-retman, pgonda, oliver.upton, linux-kernel,
	linux-mm, kvm, linux-kselftest, linux-fsdevel

Elliot Berman <quic_eberman@quicinc.com> writes:

> On Tue, Sep 10, 2024 at 11:44:01PM +0000, Ackerley Tng wrote:
>> Since guest_memfd now supports mmap(), folios have to be prepared
>> before they are faulted into userspace.
>>
>> When memory attributes are switched between shared and private, the
>> up-to-date flags will be cleared.
>>
>> Use the folio's up-to-date flag to indicate being ready for the guest
>> usage and can be used to mark whether the folio is ready for shared OR
>> private use.
>
> Clearing the up-to-date flag also means that the page gets zero'd out
> whenever it transitions between shared and private (either direction).
> pKVM (Android) hypervisor policy can allow in-place conversion between
> shared/private.
>
> I believe the important thing is that sev_gmem_prepare() needs to be
> called prior to giving page to guest. In my series, I had made a
> ->prepare_inaccessible() callback where KVM would only do this part.
> When transitioning to inaccessible, only that callback would be made,
> besides the bookkeeping. The folio zeroing happens once when allocating
> the folio if the folio is initially accessible (faultable).
>
> From x86 CoCo perspective, I think it also makes sense to not zero
> the folio when changing faultiblity from private to shared:
>  - If guest is sharing some data with host, you've wiped the data and
>    guest has to copy again.
>  - Or, if SEV/TDX enforces that page is zero'd between transitions,
>    Linux has duplicated the work that trusted entity has already done.
>
> Fuad and I can help add some details for the conversion. Hopefully we
> can figure out some of the plan at plumbers this week.

Zeroing the page prevents leaking host data (see function docstring for
kvm_gmem_prepare_folio() introduced in [1]), so we definitely don't want
to introduce a kernel data leak bug here.

In-place conversion does require preservation of data, so for
conversions, shall we zero depending on VM type?

+ Gunyah: don't zero since ->prepare_inaccessible() is a no-op
+ pKVM: don't zero
+ TDX: don't zero
+ SEV: AMD Architecture Programmers Manual 7.10.6 says there is no
  automatic encryption and implies no zeroing, hence perform zeroing
+ KVM_X86_SW_PROTECTED_VM: Doesn't have a formal definition so I guess
  we could require zeroing on transition?

This way, the uptodate flag means that it has been prepared (as in
sev_gmem_prepare()), and zeroed if required by VM type.

Regarding flushing the dcache/tlb in your other question [2], if we
don't use folio_zero_user(), can we relying on unmapping within core-mm
to flush after shared use, and unmapping within KVM To flush after
private use?

Or should flush_dcache_folio() be explicitly called on kvm_gmem_fault()?

clear_highpage(), used in the non-hugetlb (original) path, doesn't flush
the dcache. Was that intended?

> Thanks,
> Elliot
>
>>
>> <snip>

[1] https://lore.kernel.org/all/20240726185157.72821-8-pbonzini@redhat.com/
[2] https://lore.kernel.org/all/diqz34ldszp3.fsf@ackerleytng-ctop.c.googlers.com/

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [RFC PATCH 30/39] KVM: guest_memfd: Handle folio preparation for guest_memfd mmap
  2024-10-03 21:32     ` Ackerley Tng
@ 2024-10-03 23:43       ` Ackerley Tng
  2024-10-08 19:30         ` Sean Christopherson
  2024-10-07 15:56       ` Patrick Roy
  1 sibling, 1 reply; 130+ messages in thread
From: Ackerley Tng @ 2024-10-03 23:43 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: quic_eberman, tabba, roypat, jgg, peterx, david, rientjes, fvdl,
	jthoughton, seanjc, pbonzini, zhiquan1.li, fan.du, jun.miao,
	isaku.yamahata, muchun.song, mike.kravetz, erdemaktas, vannapurve,
	qperret, jhubbard, willy, shuah, brauner, bfoster,
	kent.overstreet, pvorel, rppt, richard.weiyang, anup, haibo1.xu,
	ajones, vkuznets, maciej.wieczor-retman, pgonda, oliver.upton,
	linux-kernel, linux-mm, kvm, linux-kselftest, linux-fsdevel

Ackerley Tng <ackerleytng@google.com> writes:

> Elliot Berman <quic_eberman@quicinc.com> writes:
>
>> On Tue, Sep 10, 2024 at 11:44:01PM +0000, Ackerley Tng wrote:
>>> Since guest_memfd now supports mmap(), folios have to be prepared
>>> before they are faulted into userspace.
>>>
>>> When memory attributes are switched between shared and private, the
>>> up-to-date flags will be cleared.
>>>
>>> Use the folio's up-to-date flag to indicate being ready for the guest
>>> usage and can be used to mark whether the folio is ready for shared OR
>>> private use.
>>
>> Clearing the up-to-date flag also means that the page gets zero'd out
>> whenever it transitions between shared and private (either direction).
>> pKVM (Android) hypervisor policy can allow in-place conversion between
>> shared/private.
>>
>> I believe the important thing is that sev_gmem_prepare() needs to be
>> called prior to giving page to guest. In my series, I had made a
>> ->prepare_inaccessible() callback where KVM would only do this part.
>> When transitioning to inaccessible, only that callback would be made,
>> besides the bookkeeping. The folio zeroing happens once when allocating
>> the folio if the folio is initially accessible (faultable).
>>
>> From x86 CoCo perspective, I think it also makes sense to not zero
>> the folio when changing faultiblity from private to shared:
>>  - If guest is sharing some data with host, you've wiped the data and
>>    guest has to copy again.
>>  - Or, if SEV/TDX enforces that page is zero'd between transitions,
>>    Linux has duplicated the work that trusted entity has already done.
>>
>> Fuad and I can help add some details for the conversion. Hopefully we
>> can figure out some of the plan at plumbers this week.
>
> Zeroing the page prevents leaking host data (see function docstring for
> kvm_gmem_prepare_folio() introduced in [1]), so we definitely don't want
> to introduce a kernel data leak bug here.

Actually it seems like filemap_grab_folio() already gets a zeroed page.

filemap_grab_folio() eventually calls __alloc_pages_noprof()
-> get_page_from_freelist()
   -> prep_new_page()
      -> post_alloc_hook()

and post_alloc_hook() calls kernel_init_pages(), which zeroes the page,
depending on kernel config.

Paolo, was calling clear_highpage() in kvm_gmem_prepare_folio() zeroing an
already empty page returned from filemap_grab_folio()?

> In-place conversion does require preservation of data, so for
> conversions, shall we zero depending on VM type?
>
> + Gunyah: don't zero since ->prepare_inaccessible() is a no-op
> + pKVM: don't zero
> + TDX: don't zero
> + SEV: AMD Architecture Programmers Manual 7.10.6 says there is no
>   automatic encryption and implies no zeroing, hence perform zeroing
> + KVM_X86_SW_PROTECTED_VM: Doesn't have a formal definition so I guess
>   we could require zeroing on transition?
>
> This way, the uptodate flag means that it has been prepared (as in
> sev_gmem_prepare()), and zeroed if required by VM type.
>
> Regarding flushing the dcache/tlb in your other question [2], if we
> don't use folio_zero_user(), can we relying on unmapping within core-mm
> to flush after shared use, and unmapping within KVM To flush after
> private use?
>
> Or should flush_dcache_folio() be explicitly called on kvm_gmem_fault()?
>
> clear_highpage(), used in the non-hugetlb (original) path, doesn't flush
> the dcache. Was that intended?
>
>> Thanks,
>> Elliot
>>
>>>
>>> <snip>
>
> [1] https://lore.kernel.org/all/20240726185157.72821-8-pbonzini@redhat.com/
> [2] https://lore.kernel.org/all/diqz34ldszp3.fsf@ackerleytng-ctop.c.googlers.com/

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [RFC PATCH 30/39] KVM: guest_memfd: Handle folio preparation for guest_memfd mmap
  2024-10-03 23:43       ` Ackerley Tng
@ 2024-10-08 19:30         ` Sean Christopherson
  0 siblings, 0 replies; 130+ messages in thread
From: Sean Christopherson @ 2024-10-08 19:30 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: quic_eberman, tabba, roypat, jgg, peterx, david, rientjes, fvdl,
	jthoughton, pbonzini, zhiquan1.li, fan.du, jun.miao,
	isaku.yamahata, muchun.song, mike.kravetz, erdemaktas, vannapurve,
	qperret, jhubbard, willy, shuah, brauner, bfoster,
	kent.overstreet, pvorel, rppt, richard.weiyang, anup, haibo1.xu,
	ajones, vkuznets, maciej.wieczor-retman, pgonda, oliver.upton,
	linux-kernel, linux-mm, kvm, linux-kselftest, linux-fsdevel

On Thu, Oct 03, 2024, Ackerley Tng wrote:
> Ackerley Tng <ackerleytng@google.com> writes:
> 
> > Elliot Berman <quic_eberman@quicinc.com> writes:
> >> From x86 CoCo perspective, I think it also makes sense to not zero
> >> the folio when changing faultiblity from private to shared:
> >>  - If guest is sharing some data with host, you've wiped the data and
> >>    guest has to copy again.
> >>  - Or, if SEV/TDX enforces that page is zero'd between transitions,
> >>    Linux has duplicated the work that trusted entity has already done.
> >>
> >> Fuad and I can help add some details for the conversion. Hopefully we
> >> can figure out some of the plan at plumbers this week.
> >
> > Zeroing the page prevents leaking host data (see function docstring for
> > kvm_gmem_prepare_folio() introduced in [1]), so we definitely don't want
> > to introduce a kernel data leak bug here.
> 
> Actually it seems like filemap_grab_folio() already gets a zeroed page.
> 
> filemap_grab_folio() eventually calls __alloc_pages_noprof()
> -> get_page_from_freelist()
>    -> prep_new_page()
>       -> post_alloc_hook()
> 
> and post_alloc_hook() calls kernel_init_pages(), which zeroes the page,
> depending on kernel config.
> 
> Paolo, was calling clear_highpage() in kvm_gmem_prepare_folio() zeroing an
> already empty page returned from filemap_grab_folio()?

Yes and no.  CONFIG_INIT_ON_ALLOC_DEFAULT_ON and init_on_alloc are very much
hardening features, not functional behavior that other code _needs_ to be aware
of.  E.g. enabling init-on-alloc comes with a measurable performance cost.

Ignoring hardening, the guest_memfd mapping specifically sets the gfp_mask to
GFP_HIGHUSER, i.e. doesn't set __GFP_ZERO.

That said, I wouldn't be opposed to skipping the clear_highpage() call when
want_init_on_alloc() is true.

Also, the intended behavior (or at least, what  intended) of kvm_gmem_prepare_folio()
was it would do clear_highpage() if and only if a trusted entity does NOT zero
the page.  Factoring that in is a bit harder, as it probably requires another
arch hook (or providing an out-param from kvm_arch_gmem_prepare()).  I.e. the
want_init_on_alloc() case isn't the only time KVM could shave cycles by not
redundantly zeroing memory.

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [RFC PATCH 30/39] KVM: guest_memfd: Handle folio preparation for guest_memfd mmap
  2024-10-03 21:32     ` Ackerley Tng
  2024-10-03 23:43       ` Ackerley Tng
@ 2024-10-07 15:56       ` Patrick Roy
  2024-10-08 18:07         ` Ackerley Tng
  1 sibling, 1 reply; 130+ messages in thread
From: Patrick Roy @ 2024-10-07 15:56 UTC (permalink / raw)
  To: Ackerley Tng, Elliot Berman
  Cc: tabba, jgg, peterx, david, rientjes, fvdl, jthoughton, seanjc,
	pbonzini, zhiquan1.li, fan.du, jun.miao, isaku.yamahata,
	muchun.song, mike.kravetz, erdemaktas, vannapurve, qperret,
	jhubbard, willy, shuah, brauner, bfoster, kent.overstreet, pvorel,
	rppt, richard.weiyang, anup, haibo1.xu, ajones, vkuznets,
	maciej.wieczor-retman, pgonda, oliver.upton, linux-kernel,
	linux-mm, kvm, linux-kselftest, linux-fsdevel, James Gowans,
	Kalyazin, Nikita, Manwaring, Derek

Hi Ackerley,

On Thu, 2024-10-03 at 22:32 +0100, Ackerley Tng wrote:
> Elliot Berman <quic_eberman@quicinc.com> writes:
> 
>> On Tue, Sep 10, 2024 at 11:44:01PM +0000, Ackerley Tng wrote:
>>> Since guest_memfd now supports mmap(), folios have to be prepared
>>> before they are faulted into userspace.
>>>
>>> When memory attributes are switched between shared and private, the
>>> up-to-date flags will be cleared.
>>>
>>> Use the folio's up-to-date flag to indicate being ready for the guest
>>> usage and can be used to mark whether the folio is ready for shared OR
>>> private use.
>>
>> Clearing the up-to-date flag also means that the page gets zero'd out
>> whenever it transitions between shared and private (either direction).
>> pKVM (Android) hypervisor policy can allow in-place conversion between
>> shared/private.
>>
>> I believe the important thing is that sev_gmem_prepare() needs to be
>> called prior to giving page to guest. In my series, I had made a
>> ->prepare_inaccessible() callback where KVM would only do this part.
>> When transitioning to inaccessible, only that callback would be made,
>> besides the bookkeeping. The folio zeroing happens once when allocating
>> the folio if the folio is initially accessible (faultable).
>>
>> From x86 CoCo perspective, I think it also makes sense to not zero
>> the folio when changing faultiblity from private to shared:
>>  - If guest is sharing some data with host, you've wiped the data and
>>    guest has to copy again.
>>  - Or, if SEV/TDX enforces that page is zero'd between transitions,
>>    Linux has duplicated the work that trusted entity has already done.
>>
>> Fuad and I can help add some details for the conversion. Hopefully we
>> can figure out some of the plan at plumbers this week.
> 
> Zeroing the page prevents leaking host data (see function docstring for
> kvm_gmem_prepare_folio() introduced in [1]), so we definitely don't want
> to introduce a kernel data leak bug here.
> 
> In-place conversion does require preservation of data, so for
> conversions, shall we zero depending on VM type?
> 
> + Gunyah: don't zero since ->prepare_inaccessible() is a no-op
> + pKVM: don't zero
> + TDX: don't zero
> + SEV: AMD Architecture Programmers Manual 7.10.6 says there is no
>   automatic encryption and implies no zeroing, hence perform zeroing
> + KVM_X86_SW_PROTECTED_VM: Doesn't have a formal definition so I guess
>   we could require zeroing on transition?

Maybe for KVM_X86_SW_PROTECTED_VM we could make zero-ing configurable
via some CREATE_GUEST_MEMFD flag, instead of forcing one specific
behavior. 

For the "non-CoCo with direct map entries removed" VMs that we at AWS
are going for, we'd like a VM type with host-controlled in-place
conversions which doesn't zero on transitions, so if
KVM_X86_SW_PROTECTED_VM ends up zeroing, we'd need to add another new VM
type for that.

Somewhat related sidenote: For VMs that allow inplace conversions and do
not zero, we do not need to zap the stage-2 mappings on memory attribute
changes, right?

> This way, the uptodate flag means that it has been prepared (as in
> sev_gmem_prepare()), and zeroed if required by VM type.
> 
> Regarding flushing the dcache/tlb in your other question [2], if we
> don't use folio_zero_user(), can we relying on unmapping within core-mm
> to flush after shared use, and unmapping within KVM To flush after
> private use?
> 
> Or should flush_dcache_folio() be explicitly called on kvm_gmem_fault()?
> 
> clear_highpage(), used in the non-hugetlb (original) path, doesn't flush
> the dcache. Was that intended?
> 
>> Thanks,
>> Elliot
>>
>>>
>>> <snip>
> 
> [1] https://lore.kernel.org/all/20240726185157.72821-8-pbonzini@redhat.com/
> [2] https://lore.kernel.org/all/diqz34ldszp3.fsf@ackerleytng-ctop.c.googlers.com/

Best, 
Patrick

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [RFC PATCH 30/39] KVM: guest_memfd: Handle folio preparation for guest_memfd mmap
  2024-10-07 15:56       ` Patrick Roy
@ 2024-10-08 18:07         ` Ackerley Tng
  2024-10-08 19:56           ` Sean Christopherson
  0 siblings, 1 reply; 130+ messages in thread
From: Ackerley Tng @ 2024-10-08 18:07 UTC (permalink / raw)
  To: Patrick Roy
  Cc: quic_eberman, tabba, jgg, peterx, david, rientjes, fvdl,
	jthoughton, seanjc, pbonzini, zhiquan1.li, fan.du, jun.miao,
	isaku.yamahata, muchun.song, mike.kravetz, erdemaktas, vannapurve,
	qperret, jhubbard, willy, shuah, brauner, bfoster,
	kent.overstreet, pvorel, rppt, richard.weiyang, anup, haibo1.xu,
	ajones, vkuznets, maciej.wieczor-retman, pgonda, oliver.upton,
	linux-kernel, linux-mm, kvm, linux-kselftest, linux-fsdevel,
	jgowans, kalyazin, derekmn

Patrick Roy <roypat@amazon.co.uk> writes:

> Hi Ackerley,
>
> On Thu, 2024-10-03 at 22:32 +0100, Ackerley Tng wrote:
>> Elliot Berman <quic_eberman@quicinc.com> writes:
>>
>>> On Tue, Sep 10, 2024 at 11:44:01PM +0000, Ackerley Tng wrote:
>>>> Since guest_memfd now supports mmap(), folios have to be prepared
>>>> before they are faulted into userspace.
>>>>
>>>> When memory attributes are switched between shared and private, the
>>>> up-to-date flags will be cleared.
>>>>
>>>> Use the folio's up-to-date flag to indicate being ready for the guest
>>>> usage and can be used to mark whether the folio is ready for shared OR
>>>> private use.
>>>
>>> Clearing the up-to-date flag also means that the page gets zero'd out
>>> whenever it transitions between shared and private (either direction).
>>> pKVM (Android) hypervisor policy can allow in-place conversion between
>>> shared/private.
>>>
>>> I believe the important thing is that sev_gmem_prepare() needs to be
>>> called prior to giving page to guest. In my series, I had made a
>>> ->prepare_inaccessible() callback where KVM would only do this part.
>>> When transitioning to inaccessible, only that callback would be made,
>>> besides the bookkeeping. The folio zeroing happens once when allocating
>>> the folio if the folio is initially accessible (faultable).
>>>
>>> From x86 CoCo perspective, I think it also makes sense to not zero
>>> the folio when changing faultiblity from private to shared:
>>>  - If guest is sharing some data with host, you've wiped the data and
>>>    guest has to copy again.
>>>  - Or, if SEV/TDX enforces that page is zero'd between transitions,
>>>    Linux has duplicated the work that trusted entity has already done.
>>>
>>> Fuad and I can help add some details for the conversion. Hopefully we
>>> can figure out some of the plan at plumbers this week.
>>
>> Zeroing the page prevents leaking host data (see function docstring for
>> kvm_gmem_prepare_folio() introduced in [1]), so we definitely don't want
>> to introduce a kernel data leak bug here.
>>
>> In-place conversion does require preservation of data, so for
>> conversions, shall we zero depending on VM type?
>>
>> + Gunyah: don't zero since ->prepare_inaccessible() is a no-op
>> + pKVM: don't zero
>> + TDX: don't zero
>> + SEV: AMD Architecture Programmers Manual 7.10.6 says there is no
>>   automatic encryption and implies no zeroing, hence perform zeroing
>> + KVM_X86_SW_PROTECTED_VM: Doesn't have a formal definition so I guess
>>   we could require zeroing on transition?
>
> Maybe for KVM_X86_SW_PROTECTED_VM we could make zero-ing configurable
> via some CREATE_GUEST_MEMFD flag, instead of forcing one specific
> behavior.

Sounds good to me, I can set up a flag in the next revision.

> For the "non-CoCo with direct map entries removed" VMs that we at AWS
> are going for, we'd like a VM type with host-controlled in-place
> conversions which doesn't zero on transitions, so if
> KVM_X86_SW_PROTECTED_VM ends up zeroing, we'd need to add another new VM
> type for that.
>
> Somewhat related sidenote: For VMs that allow inplace conversions and do
> not zero, we do not need to zap the stage-2 mappings on memory attribute
> changes, right?
>

Here are some reasons for zapping I can think of:

1. When private pages are split/merged, zapping the stage-2 mappings on
   memory attribute changes allows the private pages to be re-faulted by
   KVM at smaller/larger granularity.

2. The rationale described here
   https://elixir.bootlin.com/linux/v6.11.2/source/arch/x86/kvm/mmu/mmu.c#L7482
   ("Zapping SPTEs in this case ensures KVM will reassess whether or not
   a hugepage can be used for affected ranges.") probably refers to the
   existing implementation, when a different set of physical pages is
   used to back shared and private memory. When the same set of physical
   pages is used for both shared and private memory, then IIUC this
   rationale does not apply.

3. There's another rationale for zapping
   https://elixir.bootlin.com/linux/v6.11.2/source/virt/kvm/kvm_main.c#L2494
   to do with read vs write mappings here. I don't fully understand
   this, does this rationale still apply?

4. Is zapping required if the pages get removed/added to kernel direct
   map?

>> This way, the uptodate flag means that it has been prepared (as in
>> sev_gmem_prepare()), and zeroed if required by VM type.
>>
>> Regarding flushing the dcache/tlb in your other question [2], if we
>> don't use folio_zero_user(), can we relying on unmapping within core-mm
>> to flush after shared use, and unmapping within KVM To flush after
>> private use?
>>
>> Or should flush_dcache_folio() be explicitly called on kvm_gmem_fault()?
>>
>> clear_highpage(), used in the non-hugetlb (original) path, doesn't flush
>> the dcache. Was that intended?
>>
>>> Thanks,
>>> Elliot
>>>
>>>>
>>>> <snip>
>>
>> [1] https://lore.kernel.org/all/20240726185157.72821-8-pbonzini@redhat.com/
>> [2] https://lore.kernel.org/all/diqz34ldszp3.fsf@ackerleytng-ctop.c.googlers.com/
>
> Best,
> Patrick

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [RFC PATCH 30/39] KVM: guest_memfd: Handle folio preparation for guest_memfd mmap
  2024-10-08 18:07         ` Ackerley Tng
@ 2024-10-08 19:56           ` Sean Christopherson
  2024-10-09  3:51             ` Manwaring, Derek
  2024-10-10 16:21             ` Patrick Roy
  0 siblings, 2 replies; 130+ messages in thread
From: Sean Christopherson @ 2024-10-08 19:56 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: Patrick Roy, quic_eberman, tabba, jgg, peterx, david, rientjes,
	fvdl, jthoughton, pbonzini, zhiquan1.li, fan.du, jun.miao,
	isaku.yamahata, muchun.song, mike.kravetz, erdemaktas, vannapurve,
	qperret, jhubbard, willy, shuah, brauner, bfoster,
	kent.overstreet, pvorel, rppt, richard.weiyang, anup, haibo1.xu,
	ajones, vkuznets, maciej.wieczor-retman, pgonda, oliver.upton,
	linux-kernel, linux-mm, kvm, linux-kselftest, linux-fsdevel,
	jgowans, kalyazin, derekmn

On Tue, Oct 08, 2024, Ackerley Tng wrote:
> Patrick Roy <roypat@amazon.co.uk> writes:
> > For the "non-CoCo with direct map entries removed" VMs that we at AWS
> > are going for, we'd like a VM type with host-controlled in-place
> > conversions which doesn't zero on transitions,

Hmm, your use case shouldn't need conversions _for KVM_, as there's no need for
KVM to care if userspace or the guest _wants_ a page to be shared vs. private.
Userspace is fully trusted to manage things; KVM simply reacts to the current
state of things.

And more importantly, whether or not the direct map is zapped needs to be a
property of the guest_memfd inode, i.e. can't be associated with a struct kvm.
I forget who got volunteered to do the work, but we're going to need similar
functionality for tracking the state of individual pages in a huge folio, as
folio_mark_uptodate() is too coarse-grained.  I.e. at some point, I expect that
guest_memfd will make it easy-ish to determine whether or not the direct map has
been obliterated.

The shared vs. private attributes tracking in KVM is still needed (I think), as
it communicates what userspace _wants_, whereas he guest_memfd machinery will
track what the state _is_.

> > so if KVM_X86_SW_PROTECTED_VM ends up zeroing, we'd need to add another new
> > VM type for that.

Maybe we should sneak in a s/KVM_X86_SW_PROTECTED_VM/KVM_X86_SW_HARDENED_VM rename?
The original thought behind "software protected VM" was to do a slow build of
something akin to pKVM, but realistically I don't think that idea is going anywhere.

Alternatively, depending on how KVM accesses guest memory that's been removed from
the direct map, another solution would be to allow "regular" VMs to bind memslots
to guest_memfd, i.e. if the non-CoCo use case needs/wnats to bind all memory to
guest_memfd, not just "private" mappings.

That's probably the biggest topic of discussion: how do we want to allow mapping
guest_memfd into the guest, without direct map entries, but while still allowing
KVM to access guest memory as needed, e.g. for shadow paging.  One approach is
your RFC, where KVM maps guest_memfd pfns on-demand.

Another (slightly crazy) approach would be use protection keys to provide the
security properties that you want, while giving KVM (and userspace) a quick-and-easy
override to access guest memory.

 1. mmap() guest_memfd into userpace with RW protections
 2. Configure PKRU to make guest_memfd memory inaccessible by default
 3. Swizzle PKRU on-demand when intentionally accessing guest memory

It's essentially the same idea as SMAP+STAC/CLAC, just applied to guest memory
instead of to usersepace memory.

The benefit of the PKRU approach is that there are no PTE modifications, and thus
no TLB flushes, and only the CPU that is access guest memory gains temporary
access.  The big downside is that it would be limited to modern hardware, but
that might be acceptable, especially if it simplifies KVM's implementation.

> > Somewhat related sidenote: For VMs that allow inplace conversions and do
> > not zero, we do not need to zap the stage-2 mappings on memory attribute
> > changes, right?

See above.  I don't think conversions by toggling the shared/private flag in
KVM's memory attributes is the right fit for your use case.

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [RFC PATCH 30/39] KVM: guest_memfd: Handle folio preparation for guest_memfd mmap
  2024-10-08 19:56           ` Sean Christopherson
@ 2024-10-09  3:51             ` Manwaring, Derek
  2024-10-09 13:52               ` Andrew Cooper
  2024-10-10 16:21             ` Patrick Roy
  1 sibling, 1 reply; 130+ messages in thread
From: Manwaring, Derek @ 2024-10-09  3:51 UTC (permalink / raw)
  To: seanjc, andrew.cooper3, dave.hansen
  Cc: ackerleytng, ajones, anup, bfoster, brauner, david, derekmn,
	erdemaktas, fan.du, fvdl, haibo1.xu, isaku.yamahata, jgg, jgowans,
	jhubbard, jthoughton, jun.miao, kalyazin, kent.overstreet, kvm,
	linux-fsdevel, linux-kernel, linux-kselftest, linux-mm,
	maciej.wieczor-retman, mike.kravetz, muchun.song, oliver.upton,
	pbonzini, peterx, pgonda, pvorel, qperret, quic_eberman,
	richard.weiyang, rientjes, roypat, rppt, shuah, tabba, vannapurve,
	vkuznets, willy, zhiquan1.li, graf, mlipp, canellac

On 2024-10-08 at 19:56+0000 Sean Christopherson wrote:
> Another (slightly crazy) approach would be use protection keys to provide the
> security properties that you want, while giving KVM (and userspace) a quick-and-easy
> override to access guest memory.
>
>  1. mmap() guest_memfd into userpace with RW protections
>  2. Configure PKRU to make guest_memfd memory inaccessible by default
>  3. Swizzle PKRU on-demand when intentionally accessing guest memory
>
> It's essentially the same idea as SMAP+STAC/CLAC, just applied to guest memory
> instead of to usersepace memory.
>
> The benefit of the PKRU approach is that there are no PTE modifications, and thus
> no TLB flushes, and only the CPU that is access guest memory gains temporary
> access.  The big downside is that it would be limited to modern hardware, but
> that might be acceptable, especially if it simplifies KVM's implementation.

Yeah this might be worth it if it simplifies significantly. Jenkins et
al. showed MPK worked for stopping in-process Spectre V1 [1]. While
future hardware bugs are always possible, the host kernel would still
offer better protection overall since discovery of additional Spectre
approaches and gadgets in the kernel is more likely (I think it's a
bigger surface area than hardware-specific MPK transient execution
issues).

Patrick, we talked about this a couple weeks ago and ended up focusing
on within-userspace protection, but I see keys can also be used to stop
kernel access like Andrew's project he mentioned during Dave's MPK
session at LPC [2]. Andrew, could you share that here?

It's not clear to me how reliably the kernel prevents its own access to
such pages. I see a few papers that warrant more investigation:

"we found multiple interfaces that Linux, by design, provides for
accessing process memory that ignore PKU domains on a page." [3]

"Though Connor et al. demonstrate that existing MPK protections can be
bypassed by using the kernel as a confused deputy, compelling recent
work indicates that MPK operations can be made secure." [4]

Dave and others, if you're aware of resources clarifying how strong the
boundaries are, that would be helpful.

Derek

[1] https://www.cs.dartmouth.edu/~sws/pubs/jas2020.pdf
[2] https://www.youtube.com/watch?v=gEUeMfrNH94&t=1028s
[3] https://www.usenix.org/system/files/sec20-connor.pdf
[4] https://ics.uci.edu/~dabrowsa/kirth-eurosys22-pkru.pdf

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [RFC PATCH 30/39] KVM: guest_memfd: Handle folio preparation for guest_memfd mmap
  2024-10-09  3:51             ` Manwaring, Derek
@ 2024-10-09 13:52               ` Andrew Cooper
  0 siblings, 0 replies; 130+ messages in thread
From: Andrew Cooper @ 2024-10-09 13:52 UTC (permalink / raw)
  To: Manwaring, Derek, seanjc, dave.hansen
  Cc: ackerleytng, ajones, anup, bfoster, brauner, david, erdemaktas,
	fan.du, fvdl, haibo1.xu, isaku.yamahata, jgg, jgowans, jhubbard,
	jthoughton, jun.miao, kalyazin, kent.overstreet, kvm,
	linux-fsdevel, linux-kernel, linux-kselftest, linux-mm,
	maciej.wieczor-retman, mike.kravetz, muchun.song, oliver.upton,
	pbonzini, peterx, pgonda, pvorel, qperret, quic_eberman,
	richard.weiyang, rientjes, roypat, rppt, shuah, tabba, vannapurve,
	vkuznets, willy, zhiquan1.li, graf, mlipp, canellac

On 09/10/2024 4:51 am, Manwaring, Derek wrote:
> On 2024-10-08 at 19:56+0000 Sean Christopherson wrote:
>> Another (slightly crazy) approach would be use protection keys to provide the
>> security properties that you want, while giving KVM (and userspace) a quick-and-easy
>> override to access guest memory.
>>
>>   1. mmap() guest_memfd into userpace with RW protections
>>   2. Configure PKRU to make guest_memfd memory inaccessible by default
>>   3. Swizzle PKRU on-demand when intentionally accessing guest memory
>>
>> It's essentially the same idea as SMAP+STAC/CLAC, just applied to guest memory
>> instead of to usersepace memory.
>>
>> The benefit of the PKRU approach is that there are no PTE modifications, and thus
>> no TLB flushes, and only the CPU that is access guest memory gains temporary
>> access.  The big downside is that it would be limited to modern hardware, but
>> that might be acceptable, especially if it simplifies KVM's implementation.
> Yeah this might be worth it if it simplifies significantly. Jenkins et
> al. showed MPK worked for stopping in-process Spectre V1 [1]. While
> future hardware bugs are always possible, the host kernel would still
> offer better protection overall since discovery of additional Spectre
> approaches and gadgets in the kernel is more likely (I think it's a
> bigger surface area than hardware-specific MPK transient execution
> issues).
>
> Patrick, we talked about this a couple weeks ago and ended up focusing
> on within-userspace protection, but I see keys can also be used to stop
> kernel access like Andrew's project he mentioned during Dave's MPK
> session at LPC [2]. Andrew, could you share that here?

This was in reference to PKS specifically (so Sapphire Rapids and
later), and also for Xen but the technique is general.

Allocate one supervisor key for the directmap (and other ranges wanting
protecting), and configure MSR_PKS[key]=AD by default.

Protection Keys were identified as being safe as a defence against
Meltdown.  At the time, only PKRU existed, and PKS was expected to have
been less overhead than KPTI on Skylake, which was even more frustrating
for those of us who'd begged for a supervisor form at the time.  What's
done is done.


The changes needed in main code would be accessors for directmap
pointers, because there needs to temporary AD-disable.  This would take
the form of 2x WRMSR, as opposed to a STAC/CLAC pair.

An area of concern is the overhead of the WRMSRs.  MSR_PKS is defined as
not-architecturally-serialising, but like STAC/CLAC probably comes with
model-dependent dispatch-serialising properties to prevent memory
accesses executing speculatively under the wrong protection key.

Also, for this strategy to be effective, you need to PKEY-tag all
aliases of the memory.

~Andrew

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [RFC PATCH 30/39] KVM: guest_memfd: Handle folio preparation for guest_memfd mmap
  2024-10-08 19:56           ` Sean Christopherson
  2024-10-09  3:51             ` Manwaring, Derek
@ 2024-10-10 16:21             ` Patrick Roy
  2024-10-10 19:27               ` Manwaring, Derek
  2024-10-17 23:16               ` Ackerley Tng
  1 sibling, 2 replies; 130+ messages in thread
From: Patrick Roy @ 2024-10-10 16:21 UTC (permalink / raw)
  To: Sean Christopherson, Ackerley Tng
  Cc: quic_eberman, tabba, jgg, peterx, david, rientjes, fvdl,
	jthoughton, pbonzini, zhiquan1.li, fan.du, jun.miao,
	isaku.yamahata, muchun.song, mike.kravetz, erdemaktas, vannapurve,
	qperret, jhubbard, willy, shuah, brauner, bfoster,
	kent.overstreet, pvorel, rppt, richard.weiyang, anup, haibo1.xu,
	ajones, vkuznets, maciej.wieczor-retman, pgonda, oliver.upton,
	linux-kernel, linux-mm, kvm, linux-kselftest, linux-fsdevel,
	jgowans, kalyazin, derekmn

On Tue, 2024-10-08 at 20:56 +0100, Sean Christopherson wrote:
> On Tue, Oct 08, 2024, Ackerley Tng wrote:
>> Patrick Roy <roypat@amazon.co.uk> writes:
>>> For the "non-CoCo with direct map entries removed" VMs that we at AWS
>>> are going for, we'd like a VM type with host-controlled in-place
>>> conversions which doesn't zero on transitions,
> 
> Hmm, your use case shouldn't need conversions _for KVM_, as there's no need for
> KVM to care if userspace or the guest _wants_ a page to be shared vs. private.
> Userspace is fully trusted to manage things; KVM simply reacts to the current
> state of things.
> 
> And more importantly, whether or not the direct map is zapped needs to be a
> property of the guest_memfd inode, i.e. can't be associated with a struct kvm.
> I forget who got volunteered to do the work,

I think me? At least we talked about it briefly

> but we're going to need similar
> functionality for tracking the state of individual pages in a huge folio, as
> folio_mark_uptodate() is too coarse-grained.  I.e. at some point, I expect that
> guest_memfd will make it easy-ish to determine whether or not the direct map has
> been obliterated.
> 
> The shared vs. private attributes tracking in KVM is still needed (I think), as
> it communicates what userspace _wants_, whereas he guest_memfd machinery will
> track what the state _is_.

If I'm understanding this patch series correctly, the approach taken
here is to force the KVM memory attributes and the internal guest_memfd
state to be in-sync, because the VMA from mmap()ing guest_memfd is
reflected back into the userspace_addr of the memslot. So, to me, in
this world, "direct map zapped iff
kvm_has_mem_attributes(KVM_MEMORY_ATTRIBUTES_PRIVATE)", with memory
attribute changes forcing the corresponding gmem state change. That's
why I was talking about conversions above.

I've played around with this locally, and since KVM seems to generally
use copy_from_user and friends to access the userspace_addr VMA, (aka
private mem that's reflected back into memslots here), with this things
like MMIO emulation can be oblivious to gmem's existence, since
copy_from_user and co don't require GUP or presence of direct map
entries (well, "oblivious" in the sense that things like kvm_read_guest
currently ignore memory attributes and unconditionally access
userspace_addr, which I suppose is not really wanted for VMs where
userspace_addr and guest_memfd aren't short-circuited like this). The
exception is kvm_clock, where the pv_time page would need to be
explicitly converted to shared to restore the direct map entry, although
I think we could just let userspace deal with making sure this page is
shared (and then, if gmem supports GUP on shared memory, even the
gfn_to_pfn_caches could work without gmem knowledge. Without GUP, we'd
still need a tiny hack in the uhva->pfn translation somewhere to handle
gmem vmas, but iirc you did mention that having kvm-clock be special
might be fine).

I guess it does come down to what you note below, answering the question
of "how does KVM internally access guest_memfd for non-CoCo VMs".  Is
there any way we can make uaccesses like above work? I've finally gotten
around to re-running some performance benchmarks of my on-demand
reinsertion patches with all the needed TLB flushes added, and my fio
benchmark on a virtio-blk device suffers a ~50% throughput regression,
which does not necessarily spark joy. And I think James H.  mentioned at
LPC that making the userfault stuff work with my patches would be quite
hard. All this in addition to you also not necessarily sounding too keen
on it either :D

>>> so if KVM_X86_SW_PROTECTED_VM ends up zeroing, we'd need to add another new
>>> VM type for that.
> 
> Maybe we should sneak in a s/KVM_X86_SW_PROTECTED_VM/KVM_X86_SW_HARDENED_VM rename?
> The original thought behind "software protected VM" was to do a slow build of
> something akin to pKVM, but realistically I don't think that idea is going anywhere.

Ah, admittedly I've thought of KVM_X86_SW_PROTECTED_VM as a bit of a
playground where various configurations other VM types enforce can be
mixed and matched (e.g. zero on conversions yes/no, direct map removal
yes/no) so more of a KVM_X86_GMEM_VM, but am happy to update my
understanding :) 

> Alternatively, depending on how KVM accesses guest memory that's been removed from
> the direct map, another solution would be to allow "regular" VMs to bind memslots
> to guest_memfd, i.e. if the non-CoCo use case needs/wnats to bind all memory to
> guest_memfd, not just "private" mappings.
> 
> That's probably the biggest topic of discussion: how do we want to allow mapping
> guest_memfd into the guest, without direct map entries, but while still allowing
> KVM to access guest memory as needed, e.g. for shadow paging.  One approach is
> your RFC, where KVM maps guest_memfd pfns on-demand.
> 
> Another (slightly crazy) approach would be use protection keys to provide the
> security properties that you want, while giving KVM (and userspace) a quick-and-easy
> override to access guest memory.
> 
>  1. mmap() guest_memfd into userpace with RW protections
>  2. Configure PKRU to make guest_memfd memory inaccessible by default
>  3. Swizzle PKRU on-demand when intentionally accessing guest memory
> 
> It's essentially the same idea as SMAP+STAC/CLAC, just applied to guest memory
> instead of to usersepace memory.
> 
> The benefit of the PKRU approach is that there are no PTE modifications, and thus
> no TLB flushes, and only the CPU that is access guest memory gains temporary
> access.  The big downside is that it would be limited to modern hardware, but
> that might be acceptable, especially if it simplifies KVM's implementation.

Mh, but we only have 16 protection keys, so we cannot give each VM a
unique one. And if all guest memory shares the same protection key, then
during the on-demand swizzling the CPU would get access to _all_ guest
memory on the host, which "feels" scary. What do you think, @Derek?

Does ARM have something equivalent, btw?

>>> Somewhat related sidenote: For VMs that allow inplace conversions and do
>>> not zero, we do not need to zap the stage-2 mappings on memory attribute
>>> changes, right?
> 
> See above.  I don't think conversions by toggling the shared/private flag in
> KVM's memory attributes is the right fit for your use case.

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [RFC PATCH 30/39] KVM: guest_memfd: Handle folio preparation for guest_memfd mmap
  2024-10-10 16:21             ` Patrick Roy
@ 2024-10-10 19:27               ` Manwaring, Derek
  2024-10-17 23:16               ` Ackerley Tng
  1 sibling, 0 replies; 130+ messages in thread
From: Manwaring, Derek @ 2024-10-10 19:27 UTC (permalink / raw)
  To: roypat
  Cc: ackerleytng, ajones, anup, bfoster, brauner, david, derekmn,
	erdemaktas, fan.du, fvdl, haibo1.xu, isaku.yamahata, jgg, jgowans,
	jhubbard, jthoughton, jun.miao, kalyazin, kent.overstreet, kvm,
	linux-fsdevel, linux-kernel, linux-kselftest, linux-mm,
	maciej.wieczor-retman, mike.kravetz, muchun.song, oliver.upton,
	pbonzini, peterx, pgonda, pvorel, qperret, quic_eberman,
	richard.weiyang, rientjes, rppt, seanjc, shuah, tabba, vannapurve,
	vkuznets, willy, zhiquan1.li, mlipp, canellac, dave.hansen,
	andrew.cooper3

On 2024-10-10 at 16:21+0000 Patrick Roy wrote:
> On Tue, 2024-10-08 at 20:56 +0100, Sean Christopherson wrote:
> > Another (slightly crazy) approach would be use protection keys to provide the
> > security properties that you want, while giving KVM (and userspace) a quick-and-easy
> > override to access guest memory.
> >
> >  1. mmap() guest_memfd into userpace with RW protections
> >  2. Configure PKRU to make guest_memfd memory inaccessible by default
> >  3. Swizzle PKRU on-demand when intentionally accessing guest memory
> >
> > It's essentially the same idea as SMAP+STAC/CLAC, just applied to guest memory
> > instead of to usersepace memory.
> >
> > The benefit of the PKRU approach is that there are no PTE modifications, and thus
> > no TLB flushes, and only the CPU that is access guest memory gains temporary
> > access.  The big downside is that it would be limited to modern hardware, but
> > that might be acceptable, especially if it simplifies KVM's implementation.
>
> Mh, but we only have 16 protection keys, so we cannot give each VM a
> unique one. And if all guest memory shares the same protection key, then
> during the on-demand swizzling the CPU would get access to _all_ guest
> memory on the host, which "feels" scary. What do you think, @Derek?

Yes I am concerned about this. I don't see a way to use protection keys
that would ensure the host kernel cannot be tricked by one guest into
speculatively accessing another guest's memory (unless we do a key per
vm, which like you say severely limits how many guests you can host).

> Does ARM have something equivalent, btw?

Yes - Permission Overlay Extension [1]. Although even the most recent
parts don't offer it. I don't see it in Neoverse V3 or Cortex-X4.

Derek


[1] https://lore.kernel.org/all/20240822151113.1479789-1-joey.gouly@arm.com/

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [RFC PATCH 30/39] KVM: guest_memfd: Handle folio preparation for guest_memfd mmap
  2024-10-10 16:21             ` Patrick Roy
  2024-10-10 19:27               ` Manwaring, Derek
@ 2024-10-17 23:16               ` Ackerley Tng
  2024-10-18  7:10                 ` Patrick Roy
  1 sibling, 1 reply; 130+ messages in thread
From: Ackerley Tng @ 2024-10-17 23:16 UTC (permalink / raw)
  To: Patrick Roy
  Cc: seanjc, quic_eberman, tabba, jgg, peterx, david, rientjes, fvdl,
	jthoughton, pbonzini, zhiquan1.li, fan.du, jun.miao,
	isaku.yamahata, muchun.song, mike.kravetz, erdemaktas, vannapurve,
	qperret, jhubbard, willy, shuah, brauner, bfoster,
	kent.overstreet, pvorel, rppt, richard.weiyang, anup, haibo1.xu,
	ajones, vkuznets, maciej.wieczor-retman, pgonda, oliver.upton,
	linux-kernel, linux-mm, kvm, linux-kselftest, linux-fsdevel,
	jgowans, kalyazin, derekmn

Patrick Roy <roypat@amazon.co.uk> writes:

> On Tue, 2024-10-08 at 20:56 +0100, Sean Christopherson wrote:
>> On Tue, Oct 08, 2024, Ackerley Tng wrote:
>>> Patrick Roy <roypat@amazon.co.uk> writes:
>>>> For the "non-CoCo with direct map entries removed" VMs that we at AWS
>>>> are going for, we'd like a VM type with host-controlled in-place
>>>> conversions which doesn't zero on transitions,
>> 
>> Hmm, your use case shouldn't need conversions _for KVM_, as there's no need for
>> KVM to care if userspace or the guest _wants_ a page to be shared vs. private.
>> Userspace is fully trusted to manage things; KVM simply reacts to the current
>> state of things.
>> 
>> And more importantly, whether or not the direct map is zapped needs to be a
>> property of the guest_memfd inode, i.e. can't be associated with a struct kvm.
>> I forget who got volunteered to do the work,
>
> I think me? At least we talked about it briefly
>
>> but we're going to need similar
>> functionality for tracking the state of individual pages in a huge folio, as
>> folio_mark_uptodate() is too coarse-grained.  I.e. at some point, I expect that
>> guest_memfd will make it easy-ish to determine whether or not the direct map has
>> been obliterated.
>> 
>> The shared vs. private attributes tracking in KVM is still needed (I think), as
>> it communicates what userspace _wants_, whereas he guest_memfd machinery will
>> track what the state _is_.
>
> If I'm understanding this patch series correctly, the approach taken
> here is to force the KVM memory attributes and the internal guest_memfd
> state to be in-sync, because the VMA from mmap()ing guest_memfd is
> reflected back into the userspace_addr of the memslot.

In this patch series, we're also using guest_memfd state (faultability
xarray) to prevent any future faults before checking that there are no
mappings. Further explanation at [1].

Reason (a) at [1] is what Sean describes above to be what userspace
_wants_ vs what the state _is_.

> So, to me, in
> this world, "direct map zapped iff
> kvm_has_mem_attributes(KVM_MEMORY_ATTRIBUTES_PRIVATE)", with memory
> attribute changes forcing the corresponding gmem state change. That's
> why I was talking about conversions above.

I think if we do continue to have state in guest_memfd, then direct map
removal should be based on guest_memfd's state, rather than
KVM_MEMORY_ATTRIBUTE_PRIVATE in mem_attr_array.

> I've played around with this locally, and since KVM seems to generally
> use copy_from_user and friends to access the userspace_addr VMA, (aka
> private mem that's reflected back into memslots here), with this things
> like MMIO emulation can be oblivious to gmem's existence, since
> copy_from_user and co don't require GUP or presence of direct map
> entries (well, "oblivious" in the sense that things like kvm_read_guest
> currently ignore memory attributes and unconditionally access
> userspace_addr, which I suppose is not really wanted for VMs where
> userspace_addr and guest_memfd aren't short-circuited like this). The
> exception is kvm_clock, where the pv_time page would need to be
> explicitly converted to shared to restore the direct map entry, although
> I think we could just let userspace deal with making sure this page is
> shared (and then, if gmem supports GUP on shared memory, even the
> gfn_to_pfn_caches could work without gmem knowledge. Without GUP, we'd
> still need a tiny hack in the uhva->pfn translation somewhere to handle
> gmem vmas, but iirc you did mention that having kvm-clock be special
> might be fine).
>
> I guess it does come down to what you note below, answering the question
> of "how does KVM internally access guest_memfd for non-CoCo VMs".  Is
> there any way we can make uaccesses like above work? I've finally gotten
> around to re-running some performance benchmarks of my on-demand
> reinsertion patches with all the needed TLB flushes added, and my fio
> benchmark on a virtio-blk device suffers a ~50% throughput regression,
> which does not necessarily spark joy. And I think James H.  mentioned at
> LPC that making the userfault stuff work with my patches would be quite
> hard. All this in addition to you also not necessarily sounding too keen
> on it either :D
>
>>>> so if KVM_X86_SW_PROTECTED_VM ends up zeroing, we'd need to add another new
>>>> VM type for that.
>> 
>> Maybe we should sneak in a s/KVM_X86_SW_PROTECTED_VM/KVM_X86_SW_HARDENED_VM rename?
>> The original thought behind "software protected VM" was to do a slow build of
>> something akin to pKVM, but realistically I don't think that idea is going anywhere.
>
> Ah, admittedly I've thought of KVM_X86_SW_PROTECTED_VM as a bit of a
> playground where various configurations other VM types enforce can be
> mixed and matched (e.g. zero on conversions yes/no, direct map removal
> yes/no) so more of a KVM_X86_GMEM_VM, but am happy to update my
> understanding :) 
>

Given the different axes of possible configurations for guest_memfd
(zero on conversion, direct map removal), I think it's better to let
userspace choose, than to enumerate the combinations in VM types.

Independently of whether to use a flag or VM type to configure
guest_memfd, the "zeroed" state has to be stored somewhere.

For folios to at least be zeroed once, presence in the filemap could
indicated "zeroed".

Presence in the filemap may be awkward to use as an indication of
"zeroed" for the conversion case.

What else can we use to store "zeroed"? Suggestions: 

1. Since "prepared" already took the dirty bit on the folio, "zeroed"
   can use the checked bit on the folio. [2] indicates that it is for
   filesystems, which sounds like guest_memfd :)
2. folio->private (which we may already need to use)
3. Another xarray

>> <snip>

[1] https://lore.kernel.org/all/diqz1q0qtqnd.fsf@ackerleytng-ctop.c.googlers.com/T/#ma6f828d7a50c4de8a2f829a16c9bb458b53d8f3f
[2] https://elixir.bootlin.com/linux/v6.11.4/source/include/linux/page-flags.h#L147

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [RFC PATCH 30/39] KVM: guest_memfd: Handle folio preparation for guest_memfd mmap
  2024-10-17 23:16               ` Ackerley Tng
@ 2024-10-18  7:10                 ` Patrick Roy
  0 siblings, 0 replies; 130+ messages in thread
From: Patrick Roy @ 2024-10-18  7:10 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: seanjc, quic_eberman, tabba, jgg, peterx, david, rientjes, fvdl,
	jthoughton, pbonzini, zhiquan1.li, fan.du, jun.miao,
	isaku.yamahata, muchun.song, mike.kravetz, erdemaktas, vannapurve,
	qperret, jhubbard, willy, shuah, brauner, bfoster,
	kent.overstreet, pvorel, rppt, richard.weiyang, anup, haibo1.xu,
	ajones, vkuznets, maciej.wieczor-retman, pgonda, oliver.upton,
	linux-kernel, linux-mm, kvm, linux-kselftest, linux-fsdevel,
	jgowans, kalyazin, derekmn



On Fri, 2024-10-18 at 00:16 +0100, Ackerley Tng wrote:
> Patrick Roy <roypat@amazon.co.uk> writes:
> 
>> On Tue, 2024-10-08 at 20:56 +0100, Sean Christopherson wrote:
>>> On Tue, Oct 08, 2024, Ackerley Tng wrote:
>>>> Patrick Roy <roypat@amazon.co.uk> writes:
>>>>> For the "non-CoCo with direct map entries removed" VMs that we at AWS
>>>>> are going for, we'd like a VM type with host-controlled in-place
>>>>> conversions which doesn't zero on transitions,
>>>
>>> Hmm, your use case shouldn't need conversions _for KVM_, as there's no need for
>>> KVM to care if userspace or the guest _wants_ a page to be shared vs. private.
>>> Userspace is fully trusted to manage things; KVM simply reacts to the current
>>> state of things.
>>>
>>> And more importantly, whether or not the direct map is zapped needs to be a
>>> property of the guest_memfd inode, i.e. can't be associated with a struct kvm.
>>> I forget who got volunteered to do the work,
>>
>> I think me? At least we talked about it briefly
>>
>>> but we're going to need similar
>>> functionality for tracking the state of individual pages in a huge folio, as
>>> folio_mark_uptodate() is too coarse-grained.  I.e. at some point, I expect that
>>> guest_memfd will make it easy-ish to determine whether or not the direct map has
>>> been obliterated.
>>>
>>> The shared vs. private attributes tracking in KVM is still needed (I think), as
>>> it communicates what userspace _wants_, whereas he guest_memfd machinery will
>>> track what the state _is_.
>>
>> If I'm understanding this patch series correctly, the approach taken
>> here is to force the KVM memory attributes and the internal guest_memfd
>> state to be in-sync, because the VMA from mmap()ing guest_memfd is
>> reflected back into the userspace_addr of the memslot.
> 
> In this patch series, we're also using guest_memfd state (faultability
> xarray) to prevent any future faults before checking that there are no
> mappings. Further explanation at [1].
> 
> Reason (a) at [1] is what Sean describes above to be what userspace
> _wants_ vs what the state _is_.

Ah, I was missing that detail about faultability being disabled, yet
mem_attrs not being updated until all pins are actually gone. Thanks!

Mh, I'm probably not seeing it because of my lack with CoCo setups, but
how would pKVM not trusting userspace about conversions cause mem_attrs
and faultability go out of sync? Or generally, if the guest and
userspace have different ideas about what is shared, and userspace's
idea is stored in mem_attrs (or rather, the part where they can agree is
stored in mem_attrs?), where do we store the guest's view of it? Guest
page tables?

>> So, to me, in
>> this world, "direct map zapped iff
>> kvm_has_mem_attributes(KVM_MEMORY_ATTRIBUTES_PRIVATE)", with memory
>> attribute changes forcing the corresponding gmem state change. That's
>> why I was talking about conversions above.
> 
> I think if we do continue to have state in guest_memfd, then direct map
> removal should be based on guest_memfd's state, rather than
> KVM_MEMORY_ATTRIBUTE_PRIVATE in mem_attr_array.

I am not trying to argue against tracking it in guest_memfd, I'm just
wondering if mem attributes and direct map state would ever disagree.
But probably that's also just because of my confusion above :)

>> I've played around with this locally, and since KVM seems to generally
>> use copy_from_user and friends to access the userspace_addr VMA, (aka
>> private mem that's reflected back into memslots here), with this things
>> like MMIO emulation can be oblivious to gmem's existence, since
>> copy_from_user and co don't require GUP or presence of direct map
>> entries (well, "oblivious" in the sense that things like kvm_read_guest
>> currently ignore memory attributes and unconditionally access
>> userspace_addr, which I suppose is not really wanted for VMs where
>> userspace_addr and guest_memfd aren't short-circuited like this). The
>> exception is kvm_clock, where the pv_time page would need to be
>> explicitly converted to shared to restore the direct map entry, although
>> I think we could just let userspace deal with making sure this page is
>> shared (and then, if gmem supports GUP on shared memory, even the
>> gfn_to_pfn_caches could work without gmem knowledge. Without GUP, we'd
>> still need a tiny hack in the uhva->pfn translation somewhere to handle
>> gmem vmas, but iirc you did mention that having kvm-clock be special
>> might be fine).
>>
>> I guess it does come down to what you note below, answering the question
>> of "how does KVM internally access guest_memfd for non-CoCo VMs".  Is
>> there any way we can make uaccesses like above work? I've finally gotten
>> around to re-running some performance benchmarks of my on-demand
>> reinsertion patches with all the needed TLB flushes added, and my fio
>> benchmark on a virtio-blk device suffers a ~50% throughput regression,
>> which does not necessarily spark joy. And I think James H.  mentioned at
>> LPC that making the userfault stuff work with my patches would be quite
>> hard. All this in addition to you also not necessarily sounding too keen
>> on it either :D
>>
>>>>> so if KVM_X86_SW_PROTECTED_VM ends up zeroing, we'd need to add another new
>>>>> VM type for that.
>>>
>>> Maybe we should sneak in a s/KVM_X86_SW_PROTECTED_VM/KVM_X86_SW_HARDENED_VM rename?
>>> The original thought behind "software protected VM" was to do a slow build of
>>> something akin to pKVM, but realistically I don't think that idea is going anywhere.
>>
>> Ah, admittedly I've thought of KVM_X86_SW_PROTECTED_VM as a bit of a
>> playground where various configurations other VM types enforce can be
>> mixed and matched (e.g. zero on conversions yes/no, direct map removal
>> yes/no) so more of a KVM_X86_GMEM_VM, but am happy to update my
>> understanding :)
>>
> 
> Given the different axes of possible configurations for guest_memfd
> (zero on conversion, direct map removal), I think it's better to let
> userspace choose, than to enumerate the combinations in VM types.
> 
> Independently of whether to use a flag or VM type to configure
> guest_memfd, the "zeroed" state has to be stored somewhere.
> 
> For folios to at least be zeroed once, presence in the filemap could
> indicated "zeroed".
> 
> Presence in the filemap may be awkward to use as an indication of
> "zeroed" for the conversion case.
> 
> What else can we use to store "zeroed"? Suggestions:
> 
> 1. Since "prepared" already took the dirty bit on the folio, "zeroed"
>    can use the checked bit on the folio. [2] indicates that it is for
>    filesystems, which sounds like guest_memfd :)
> 2. folio->private (which we may already need to use)
> 3. Another xarray
> 
>>> <snip>
> 
> [1] https://lore.kernel.org/all/diqz1q0qtqnd.fsf@ackerleytng-ctop.c.googlers.com/T/#ma6f828d7a50c4de8a2f829a16c9bb458b53d8f3f
> [2] https://elixir.bootlin.com/linux/v6.11.4/source/include/linux/page-flags.h#L147

^ permalink raw reply	[flat|nested] 130+ messages in thread

* [RFC PATCH 31/39] KVM: selftests: Allow vm_set_memory_attributes to be used without asserting return value of 0
  2024-09-10 23:43 [RFC PATCH 00/39] 1G page support for guest_memfd Ackerley Tng
                   ` (29 preceding siblings ...)
  2024-09-10 23:44 ` [RFC PATCH 30/39] KVM: guest_memfd: Handle folio preparation for guest_memfd mmap Ackerley Tng
@ 2024-09-10 23:44 ` Ackerley Tng
  2024-09-10 23:44 ` [RFC PATCH 32/39] KVM: selftests: Test using guest_memfd memory from userspace Ackerley Tng
                   ` (10 subsequent siblings)
  41 siblings, 0 replies; 130+ messages in thread
From: Ackerley Tng @ 2024-09-10 23:44 UTC (permalink / raw)
  To: tabba, quic_eberman, roypat, jgg, peterx, david, rientjes, fvdl,
	jthoughton, seanjc, pbonzini, zhiquan1.li, fan.du, jun.miao,
	isaku.yamahata, muchun.song, mike.kravetz
  Cc: erdemaktas, vannapurve, ackerleytng, qperret, jhubbard, willy,
	shuah, brauner, bfoster, kent.overstreet, pvorel, rppt,
	richard.weiyang, anup, haibo1.xu, ajones, vkuznets,
	maciej.wieczor-retman, pgonda, oliver.upton, linux-kernel,
	linux-mm, kvm, linux-kselftest, linux-fsdevel

No functional change intended.

Signed-off-by: Ackerley Tng <ackerleytng@google.com>

---
 tools/testing/selftests/kvm/include/kvm_util.h | 14 +++++++++++---
 1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/tools/testing/selftests/kvm/include/kvm_util.h b/tools/testing/selftests/kvm/include/kvm_util.h
index 63c2aaae51f3..d336cd0c8f19 100644
--- a/tools/testing/selftests/kvm/include/kvm_util.h
+++ b/tools/testing/selftests/kvm/include/kvm_util.h
@@ -374,8 +374,8 @@ static inline void vm_enable_cap(struct kvm_vm *vm, uint32_t cap, uint64_t arg0)
 	vm_ioctl(vm, KVM_ENABLE_CAP, &enable_cap);
 }
 
-static inline void vm_set_memory_attributes(struct kvm_vm *vm, uint64_t gpa,
-					    uint64_t size, uint64_t attributes)
+static inline int __vm_set_memory_attributes(struct kvm_vm *vm, uint64_t gpa,
+					     uint64_t size, uint64_t attributes)
 {
 	struct kvm_memory_attributes attr = {
 		.attributes = attributes,
@@ -391,7 +391,15 @@ static inline void vm_set_memory_attributes(struct kvm_vm *vm, uint64_t gpa,
 	TEST_ASSERT(!attributes || attributes == KVM_MEMORY_ATTRIBUTE_PRIVATE,
 		    "Update me to support multiple attributes!");
 
-	vm_ioctl(vm, KVM_SET_MEMORY_ATTRIBUTES, &attr);
+	return __vm_ioctl(vm, KVM_SET_MEMORY_ATTRIBUTES, &attr);
+}
+
+static inline void vm_set_memory_attributes(struct kvm_vm *vm, uint64_t gpa,
+					    uint64_t size, uint64_t attributes)
+{
+	int ret = __vm_set_memory_attributes(vm, gpa, size, attributes);
+
+	__TEST_ASSERT_VM_VCPU_IOCTL(!ret, "KVM_SET_MEMORY_ATTRIBUTES", ret, vm);
 }
 
 
-- 
2.46.0.598.g6f2099f65c-goog


^ permalink raw reply related	[flat|nested] 130+ messages in thread

* [RFC PATCH 32/39] KVM: selftests: Test using guest_memfd memory from userspace
  2024-09-10 23:43 [RFC PATCH 00/39] 1G page support for guest_memfd Ackerley Tng
                   ` (30 preceding siblings ...)
  2024-09-10 23:44 ` [RFC PATCH 31/39] KVM: selftests: Allow vm_set_memory_attributes to be used without asserting return value of 0 Ackerley Tng
@ 2024-09-10 23:44 ` Ackerley Tng
  2024-09-10 23:44 ` [RFC PATCH 33/39] KVM: selftests: Test guest_memfd memory sharing between guest and host Ackerley Tng
                   ` (9 subsequent siblings)
  41 siblings, 0 replies; 130+ messages in thread
From: Ackerley Tng @ 2024-09-10 23:44 UTC (permalink / raw)
  To: tabba, quic_eberman, roypat, jgg, peterx, david, rientjes, fvdl,
	jthoughton, seanjc, pbonzini, zhiquan1.li, fan.du, jun.miao,
	isaku.yamahata, muchun.song, mike.kravetz
  Cc: erdemaktas, vannapurve, ackerleytng, qperret, jhubbard, willy,
	shuah, brauner, bfoster, kent.overstreet, pvorel, rppt,
	richard.weiyang, anup, haibo1.xu, ajones, vkuznets,
	maciej.wieczor-retman, pgonda, oliver.upton, linux-kernel,
	linux-mm, kvm, linux-kselftest, linux-fsdevel

Test using guest_memfd from userspace, since guest_memfd now has
mmap() support.

Tests:

1. mmap() should now always return a valid address
2. Test that madvise() doesn't give any issues when pages are not
   faulted in.
3. Test that pages should not be faultable before association with a
   memslot, and that faults result in SIGBUS.
4. Test that pages can be faulted if marked faultable, and the flow of
   setting a memory range as private, which is:
   a. madvise(MADV_DONTNEED) to request kernel to unmap pages
   b. Set memory attributes of VM to private
   Also test that if pages are still mapped, setting memory attributes
   will fail.
5. Test that madvise(MADV_REMOVE) can be used to remove pages from
   guest_memfd, forcing zeroing of those pages before the next time
   the pages are faulted in.

Signed-off-by: Ackerley Tng <ackerleytng@google.com>

---
 .../testing/selftests/kvm/guest_memfd_test.c  | 195 +++++++++++++++++-
 1 file changed, 189 insertions(+), 6 deletions(-)

diff --git a/tools/testing/selftests/kvm/guest_memfd_test.c b/tools/testing/selftests/kvm/guest_memfd_test.c
index 3618ce06663e..b6f3c3e6d0dd 100644
--- a/tools/testing/selftests/kvm/guest_memfd_test.c
+++ b/tools/testing/selftests/kvm/guest_memfd_test.c
@@ -6,6 +6,7 @@
  */
 #include <stdlib.h>
 #include <string.h>
+#include <sys/wait.h>
 #include <unistd.h>
 #include <errno.h>
 #include <stdio.h>
@@ -35,12 +36,192 @@ static void test_file_read_write(int fd)
 		    "pwrite on a guest_mem fd should fail");
 }
 
-static void test_mmap(int fd, size_t page_size)
+static void test_mmap_should_map_pages_into_userspace(int fd, size_t page_size)
 {
 	char *mem;
 
 	mem = mmap(NULL, page_size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
-	TEST_ASSERT_EQ(mem, MAP_FAILED);
+	TEST_ASSERT(mem != MAP_FAILED, "mmap should return valid address");
+
+	TEST_ASSERT_EQ(munmap(mem, page_size), 0);
+}
+
+static void test_madvise_no_error_when_pages_not_faulted(int fd, size_t page_size)
+{
+	char *mem;
+
+	mem = mmap(NULL, page_size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
+	TEST_ASSERT(mem != MAP_FAILED, "mmap should return valid address");
+
+	TEST_ASSERT_EQ(madvise(mem, page_size, MADV_DONTNEED), 0);
+
+	TEST_ASSERT_EQ(munmap(mem, page_size), 0);
+}
+
+static void assert_not_faultable(char *address)
+{
+	pid_t child_pid;
+
+	child_pid = fork();
+	TEST_ASSERT(child_pid != -1, "fork failed");
+
+	if (child_pid == 0) {
+		*address = 'A';
+	} else {
+		int status;
+		waitpid(child_pid, &status, 0);
+
+		TEST_ASSERT(WIFSIGNALED(status),
+			    "Child should have exited with a signal");
+		TEST_ASSERT_EQ(WTERMSIG(status), SIGBUS);
+	}
+}
+
+/*
+ * Pages should not be faultable before association with memslot because pages
+ * (in a KVM_X86_SW_PROTECTED_VM) only default to faultable at memslot
+ * association time.
+ */
+static void test_pages_not_faultable_if_not_associated_with_memslot(int fd,
+								    size_t page_size)
+{
+	char *mem = mmap(NULL, page_size, PROT_READ | PROT_WRITE,
+			 MAP_SHARED, fd, 0);
+	TEST_ASSERT(mem != MAP_FAILED, "mmap should return valid address");
+
+	assert_not_faultable(mem);
+
+	TEST_ASSERT_EQ(munmap(mem, page_size), 0);
+}
+
+static void test_pages_faultable_if_marked_faultable(struct kvm_vm *vm, int fd,
+						     size_t page_size)
+{
+	char *mem;
+	uint64_t gpa = 0;
+	uint64_t guest_memfd_offset = 0;
+
+	/*
+	 * This test uses KVM_X86_SW_PROTECTED_VM is required to set
+	 * arch.has_private_mem, to add a memslot with guest_memfd to a VM.
+	 */
+	if (!(kvm_check_cap(KVM_CAP_VM_TYPES) & BIT(KVM_X86_SW_PROTECTED_VM))) {
+		printf("Faultability test skipped since KVM_X86_SW_PROTECTED_VM is not supported.");
+		return;
+	}
+
+	mem = mmap(NULL, page_size, PROT_READ | PROT_WRITE, MAP_SHARED, fd,
+		   guest_memfd_offset);
+	TEST_ASSERT(mem != MAP_FAILED, "mmap should return valid address");
+
+	/*
+	 * Setting up this memslot with a KVM_X86_SW_PROTECTED_VM marks all
+	 * offsets in the file as shared, allowing pages to be faulted in.
+	 */
+	vm_set_user_memory_region2(vm, 0, KVM_MEM_GUEST_MEMFD, gpa, page_size,
+				   mem, fd, guest_memfd_offset);
+
+	*mem = 'A';
+	TEST_ASSERT_EQ(*mem, 'A');
+
+	/* Should fail since the page is still faulted in. */
+	TEST_ASSERT_EQ(__vm_set_memory_attributes(vm, gpa, page_size,
+						  KVM_MEMORY_ATTRIBUTE_PRIVATE),
+		       -1);
+	TEST_ASSERT_EQ(errno, EINVAL);
+
+	/*
+	 * Use madvise() to remove the pages from userspace page tables, then
+	 * test that the page is still faultable, and that page contents remain
+	 * the same.
+	 */
+	madvise(mem, page_size, MADV_DONTNEED);
+	TEST_ASSERT_EQ(*mem, 'A');
+
+	/* Tell kernel to unmap the page from userspace. */
+	madvise(mem, page_size, MADV_DONTNEED);
+
+	/* Now kernel can set this page to private. */
+	vm_mem_set_private(vm, gpa, page_size);
+	assert_not_faultable(mem);
+
+	/*
+	 * Should be able to fault again after setting this back to shared, and
+	 * memory contents should be cleared since pages must be re-prepared for
+	 * SHARED use.
+	 */
+	vm_mem_set_shared(vm, gpa, page_size);
+	TEST_ASSERT_EQ(*mem, 0);
+
+	/* Cleanup */
+	vm_set_user_memory_region2(vm, 0, KVM_MEM_GUEST_MEMFD, gpa, 0, mem, fd,
+				   guest_memfd_offset);
+
+	TEST_ASSERT_EQ(munmap(mem, page_size), 0);
+}
+
+static void test_madvise_remove_releases_pages(struct kvm_vm *vm, int fd,
+					       size_t page_size)
+{
+	char *mem;
+	uint64_t gpa = 0;
+	uint64_t guest_memfd_offset = 0;
+
+	/*
+	 * This test uses KVM_X86_SW_PROTECTED_VM is required to set
+	 * arch.has_private_mem, to add a memslot with guest_memfd to a VM.
+	 */
+	if (!(kvm_check_cap(KVM_CAP_VM_TYPES) & BIT(KVM_X86_SW_PROTECTED_VM))) {
+		printf("madvise test skipped since KVM_X86_SW_PROTECTED_VM is not supported.");
+		return;
+	}
+
+	mem = mmap(NULL, page_size, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
+	TEST_ASSERT(mem != MAP_FAILED, "mmap should return valid address");
+
+	/*
+	 * Setting up this memslot with a KVM_X86_SW_PROTECTED_VM marks all
+	 * offsets in the file as shared, allowing pages to be faulted in.
+	 */
+	vm_set_user_memory_region2(vm, 0, KVM_MEM_GUEST_MEMFD, gpa, page_size,
+				   mem, fd, guest_memfd_offset);
+
+	*mem = 'A';
+	TEST_ASSERT_EQ(*mem, 'A');
+
+	/*
+	 * MADV_DONTNEED causes pages to be removed from userspace page tables
+	 * but should not release pages, hence page contents are kept.
+	 */
+	TEST_ASSERT_EQ(madvise(mem, page_size, MADV_DONTNEED), 0);
+	TEST_ASSERT_EQ(*mem, 'A');
+
+	/*
+	 * MADV_REMOVE causes pages to be released. Pages are then zeroed when
+	 * prepared for shared use, hence 0 is expected on next fault.
+	 */
+	TEST_ASSERT_EQ(madvise(mem, page_size, MADV_REMOVE), 0);
+	TEST_ASSERT_EQ(*mem, 0);
+
+	TEST_ASSERT_EQ(munmap(mem, page_size), 0);
+
+	/* Cleanup */
+	vm_set_user_memory_region2(vm, 0, KVM_MEM_GUEST_MEMFD, gpa, 0, mem, fd,
+				   guest_memfd_offset);
+}
+
+static void test_using_memory_directly_from_userspace(struct kvm_vm *vm,
+						      int fd, size_t page_size)
+{
+	test_mmap_should_map_pages_into_userspace(fd, page_size);
+
+	test_madvise_no_error_when_pages_not_faulted(fd, page_size);
+
+	test_pages_not_faultable_if_not_associated_with_memslot(fd, page_size);
+
+	test_pages_faultable_if_marked_faultable(vm, fd, page_size);
+
+	test_madvise_remove_releases_pages(vm, fd, page_size);
 }
 
 static void test_file_size(int fd, size_t page_size, size_t total_size)
@@ -180,18 +361,17 @@ static void test_guest_memfd(struct kvm_vm *vm, uint32_t flags, size_t page_size
 	size_t total_size;
 	int fd;
 
-	TEST_REQUIRE(kvm_has_cap(KVM_CAP_GUEST_MEMFD));
-
 	total_size = page_size * 4;
 
 	fd = vm_create_guest_memfd(vm, total_size, flags);
 
 	test_file_read_write(fd);
-	test_mmap(fd, page_size);
 	test_file_size(fd, page_size, total_size);
 	test_fallocate(fd, page_size, total_size);
 	test_invalid_punch_hole(fd, page_size, total_size);
 
+	test_using_memory_directly_from_userspace(vm, fd, page_size);
+
 	close(fd);
 }
 
@@ -201,7 +381,10 @@ int main(int argc, char *argv[])
 
 	TEST_REQUIRE(kvm_has_cap(KVM_CAP_GUEST_MEMFD));
 
-	vm = vm_create_barebones();
+	if ((kvm_check_cap(KVM_CAP_VM_TYPES) & BIT(KVM_X86_SW_PROTECTED_VM)))
+		vm = vm_create_barebones_type(KVM_X86_SW_PROTECTED_VM);
+	else
+		vm = vm_create_barebones();
 
 	test_create_guest_memfd_invalid(vm);
 	test_create_guest_memfd_multiple(vm);
-- 
2.46.0.598.g6f2099f65c-goog


^ permalink raw reply related	[flat|nested] 130+ messages in thread

* [RFC PATCH 33/39] KVM: selftests: Test guest_memfd memory sharing between guest and host
  2024-09-10 23:43 [RFC PATCH 00/39] 1G page support for guest_memfd Ackerley Tng
                   ` (31 preceding siblings ...)
  2024-09-10 23:44 ` [RFC PATCH 32/39] KVM: selftests: Test using guest_memfd memory from userspace Ackerley Tng
@ 2024-09-10 23:44 ` Ackerley Tng
  2024-09-10 23:44 ` [RFC PATCH 34/39] KVM: selftests: Add notes in private_mem_kvm_exits_test for mmap-able guest_memfd Ackerley Tng
                   ` (8 subsequent siblings)
  41 siblings, 0 replies; 130+ messages in thread
From: Ackerley Tng @ 2024-09-10 23:44 UTC (permalink / raw)
  To: tabba, quic_eberman, roypat, jgg, peterx, david, rientjes, fvdl,
	jthoughton, seanjc, pbonzini, zhiquan1.li, fan.du, jun.miao,
	isaku.yamahata, muchun.song, mike.kravetz
  Cc: erdemaktas, vannapurve, ackerleytng, qperret, jhubbard, willy,
	shuah, brauner, bfoster, kent.overstreet, pvorel, rppt,
	richard.weiyang, anup, haibo1.xu, ajones, vkuznets,
	maciej.wieczor-retman, pgonda, oliver.upton, linux-kernel,
	linux-mm, kvm, linux-kselftest, linux-fsdevel

Minimal test for guest_memfd to test that when memory is marked shared
in a VM, the host can read and write to it via an mmap()ed address,
and the guest can also read and write to it.

Signed-off-by: Ackerley Tng <ackerleytng@google.com>

---
 tools/testing/selftests/kvm/Makefile          |   1 +
 .../selftests/kvm/guest_memfd_sharing_test.c  | 160 ++++++++++++++++++
 2 files changed, 161 insertions(+)
 create mode 100644 tools/testing/selftests/kvm/guest_memfd_sharing_test.c

diff --git a/tools/testing/selftests/kvm/Makefile b/tools/testing/selftests/kvm/Makefile
index b3b7e83f39fc..3c1f35456bfc 100644
--- a/tools/testing/selftests/kvm/Makefile
+++ b/tools/testing/selftests/kvm/Makefile
@@ -135,6 +135,7 @@ TEST_GEN_PROGS_x86_64 += dirty_log_test
 TEST_GEN_PROGS_x86_64 += dirty_log_perf_test
 TEST_GEN_PROGS_x86_64 += guest_memfd_test
 TEST_GEN_PROGS_x86_64 += guest_memfd_hugetlb_reporting_test
+TEST_GEN_PROGS_x86_64 += guest_memfd_sharing_test
 TEST_GEN_PROGS_x86_64 += guest_print_test
 TEST_GEN_PROGS_x86_64 += hardware_disable_test
 TEST_GEN_PROGS_x86_64 += kvm_create_max_vcpus
diff --git a/tools/testing/selftests/kvm/guest_memfd_sharing_test.c b/tools/testing/selftests/kvm/guest_memfd_sharing_test.c
new file mode 100644
index 000000000000..fef5a73e5053
--- /dev/null
+++ b/tools/testing/selftests/kvm/guest_memfd_sharing_test.c
@@ -0,0 +1,160 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Minimal test for guest_memfd to test that when memory is marked shared in a
+ * VM, the host can read and write to it via an mmap()ed address, and the guest
+ * can also read and write to it.
+ *
+ * Copyright (c) 2024, Google LLC.
+ */
+#include <string.h>
+#include <sys/mman.h>
+#include <unistd.h>
+
+#include "test_util.h"
+#include "kvm_util.h"
+#include "ucall_common.h"
+
+#define GUEST_MEMFD_SHARING_TEST_SLOT 10
+#define GUEST_MEMFD_SHARING_TEST_GPA 0x50000000ULL
+#define GUEST_MEMFD_SHARING_TEST_GVA 0x90000000ULL
+#define GUEST_MEMFD_SHARING_TEST_OFFSET 0
+#define GUEST_MEMFD_SHARING_TEST_GUEST_TO_HOST_VALUE 0x11
+#define GUEST_MEMFD_SHARING_TEST_HOST_TO_GUEST_VALUE 0x22
+
+static void guest_code(int page_size)
+{
+	char *mem;
+	int i;
+
+	mem = (char *)GUEST_MEMFD_SHARING_TEST_GVA;
+
+	for (i = 0; i < page_size; ++i) {
+		GUEST_ASSERT_EQ(mem[i], GUEST_MEMFD_SHARING_TEST_HOST_TO_GUEST_VALUE);
+	}
+
+	memset(mem, GUEST_MEMFD_SHARING_TEST_GUEST_TO_HOST_VALUE, page_size);
+
+	GUEST_DONE();
+}
+
+int run_test(struct kvm_vcpu *vcpu, void *hva, int page_size)
+{
+	struct ucall uc;
+	uint64_t uc_cmd;
+
+	memset(hva, GUEST_MEMFD_SHARING_TEST_HOST_TO_GUEST_VALUE, page_size);
+	vcpu_args_set(vcpu, 1, page_size);
+
+	/* Reset vCPU to guest_code every time run_test is called. */
+	vcpu_arch_set_entry_point(vcpu, guest_code);
+
+	vcpu_run(vcpu);
+	uc_cmd = get_ucall(vcpu, &uc);
+
+	if (uc_cmd == UCALL_ABORT) {
+		REPORT_GUEST_ASSERT(uc);
+		return 1;
+	} else if (uc_cmd == UCALL_DONE) {
+		char *mem;
+		int i;
+
+		mem = hva;
+		for (i = 0; i < page_size; ++i)
+			TEST_ASSERT_EQ(mem[i], GUEST_MEMFD_SHARING_TEST_GUEST_TO_HOST_VALUE);
+
+		return 0;
+	} else {
+		TEST_FAIL("Unknown ucall 0x%lx.", uc.cmd);
+		return 1;
+	}
+}
+
+void *add_memslot(struct kvm_vm *vm, int guest_memfd, size_t page_size,
+		  bool back_shared_memory_with_guest_memfd)
+{
+	void *mem;
+
+	if (back_shared_memory_with_guest_memfd) {
+		mem = mmap(NULL, page_size, PROT_READ | PROT_WRITE, MAP_SHARED,
+			   guest_memfd, GUEST_MEMFD_SHARING_TEST_OFFSET);
+	} else {
+		mem = mmap(NULL, page_size, PROT_READ | PROT_WRITE,
+			   MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
+	}
+	TEST_ASSERT(mem != MAP_FAILED, "mmap should return valid address");
+
+	/*
+	 * Setting up this memslot with a KVM_X86_SW_PROTECTED_VM marks all
+	 * offsets in the file as shared.
+	 */
+	vm_set_user_memory_region2(vm, GUEST_MEMFD_SHARING_TEST_SLOT,
+				   KVM_MEM_GUEST_MEMFD,
+				   GUEST_MEMFD_SHARING_TEST_GPA, page_size, mem,
+				   guest_memfd, GUEST_MEMFD_SHARING_TEST_OFFSET);
+
+	return mem;
+}
+
+void test_sharing(bool back_shared_memory_with_guest_memfd)
+{
+	const struct vm_shape shape = {
+		.mode = VM_MODE_DEFAULT,
+		.type = KVM_X86_SW_PROTECTED_VM,
+	};
+	struct kvm_vcpu *vcpu;
+	struct kvm_vm *vm;
+	size_t page_size;
+	int guest_memfd;
+	void *mem;
+
+	TEST_REQUIRE(kvm_check_cap(KVM_CAP_VM_TYPES) & BIT(KVM_X86_SW_PROTECTED_VM));
+
+	vm = vm_create_shape_with_one_vcpu(shape, &vcpu, &guest_code);
+
+	page_size = getpagesize();
+
+	guest_memfd = vm_create_guest_memfd(vm, page_size, 0);
+
+	mem = add_memslot(vm, guest_memfd, page_size, back_shared_memory_with_guest_memfd);
+
+	virt_map(vm, GUEST_MEMFD_SHARING_TEST_GVA, GUEST_MEMFD_SHARING_TEST_GPA, 1);
+
+	run_test(vcpu, mem, page_size);
+
+	/* Toggle private flag of memory attributes and run the test again. */
+	if (back_shared_memory_with_guest_memfd) {
+		/*
+		 * Use MADV_REMOVE to release the backing guest_memfd memory
+		 * back to the system before it is used again. Test that this is
+		 * only necessary when guest_memfd is used to back shared
+		 * memory.
+		 */
+		madvise(mem, page_size, MADV_REMOVE);
+	}
+	vm_mem_set_private(vm, GUEST_MEMFD_SHARING_TEST_GPA, page_size);
+	vm_mem_set_shared(vm, GUEST_MEMFD_SHARING_TEST_GPA, page_size);
+
+	run_test(vcpu, mem, page_size);
+
+	kvm_vm_free(vm);
+	munmap(mem, page_size);
+	close(guest_memfd);
+}
+
+int main(int argc, char *argv[])
+{
+	/*
+	 * Confidence check that when guest_memfd is associated with a memslot
+	 * but only anonymous memory is used to back shared memory, sharing
+	 * memory between guest and host works as expected.
+	 */
+	test_sharing(false);
+
+	/*
+	 * Memory sharing should work as expected when shared memory is backed
+	 * with guest_memfd.
+	 */
+	test_sharing(true);
+
+	return 0;
+}
-- 
2.46.0.598.g6f2099f65c-goog


^ permalink raw reply related	[flat|nested] 130+ messages in thread

* [RFC PATCH 34/39] KVM: selftests: Add notes in private_mem_kvm_exits_test for mmap-able guest_memfd
  2024-09-10 23:43 [RFC PATCH 00/39] 1G page support for guest_memfd Ackerley Tng
                   ` (32 preceding siblings ...)
  2024-09-10 23:44 ` [RFC PATCH 33/39] KVM: selftests: Test guest_memfd memory sharing between guest and host Ackerley Tng
@ 2024-09-10 23:44 ` Ackerley Tng
  2024-09-10 23:44 ` [RFC PATCH 35/39] KVM: selftests: Test that pinned pages block KVM from setting memory attributes to PRIVATE Ackerley Tng
                   ` (7 subsequent siblings)
  41 siblings, 0 replies; 130+ messages in thread
From: Ackerley Tng @ 2024-09-10 23:44 UTC (permalink / raw)
  To: tabba, quic_eberman, roypat, jgg, peterx, david, rientjes, fvdl,
	jthoughton, seanjc, pbonzini, zhiquan1.li, fan.du, jun.miao,
	isaku.yamahata, muchun.song, mike.kravetz
  Cc: erdemaktas, vannapurve, ackerleytng, qperret, jhubbard, willy,
	shuah, brauner, bfoster, kent.overstreet, pvorel, rppt,
	richard.weiyang, anup, haibo1.xu, ajones, vkuznets,
	maciej.wieczor-retman, pgonda, oliver.upton, linux-kernel,
	linux-mm, kvm, linux-kselftest, linux-fsdevel

Note in comments why madvise() is not needed before setting memory to
private.

Signed-off-by: Ackerley Tng <ackerleytng@google.com>
---
 .../selftests/kvm/x86_64/private_mem_kvm_exits_test.c | 11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/tools/testing/selftests/kvm/x86_64/private_mem_kvm_exits_test.c b/tools/testing/selftests/kvm/x86_64/private_mem_kvm_exits_test.c
index 13e72fcec8dd..f8bcfc897f6a 100644
--- a/tools/testing/selftests/kvm/x86_64/private_mem_kvm_exits_test.c
+++ b/tools/testing/selftests/kvm/x86_64/private_mem_kvm_exits_test.c
@@ -62,7 +62,11 @@ static void test_private_access_memslot_deleted(void)
 
 	virt_map(vm, EXITS_TEST_GVA, EXITS_TEST_GPA, EXITS_TEST_NPAGES);
 
-	/* Request to access page privately */
+	/*
+	 * Request to access page privately. madvise(MADV_DONTNEED) not required
+	 * since memory was never mmap()-ed from guest_memfd. Anonymous memory
+	 * was used instead for this memslot's userspace_addr.
+	 */
 	vm_mem_set_private(vm, EXITS_TEST_GPA, EXITS_TEST_SIZE);
 
 	pthread_create(&vm_thread, NULL,
@@ -98,7 +102,10 @@ static void test_private_access_memslot_not_private(void)
 
 	virt_map(vm, EXITS_TEST_GVA, EXITS_TEST_GPA, EXITS_TEST_NPAGES);
 
-	/* Request to access page privately */
+	/*
+	 * Request to access page privately. madvise(MADV_DONTNEED) not required
+	 * since the affected memslot doesn't use guest_memfd.
+	 */
 	vm_mem_set_private(vm, EXITS_TEST_GPA, EXITS_TEST_SIZE);
 
 	exit_reason = run_vcpu_get_exit_reason(vcpu);
-- 
2.46.0.598.g6f2099f65c-goog


^ permalink raw reply related	[flat|nested] 130+ messages in thread

* [RFC PATCH 35/39] KVM: selftests: Test that pinned pages block KVM from setting memory attributes to PRIVATE
  2024-09-10 23:43 [RFC PATCH 00/39] 1G page support for guest_memfd Ackerley Tng
                   ` (33 preceding siblings ...)
  2024-09-10 23:44 ` [RFC PATCH 34/39] KVM: selftests: Add notes in private_mem_kvm_exits_test for mmap-able guest_memfd Ackerley Tng
@ 2024-09-10 23:44 ` Ackerley Tng
  2024-09-10 23:44 ` [RFC PATCH 36/39] KVM: selftests: Refactor vm_mem_add to be more flexible Ackerley Tng
                   ` (6 subsequent siblings)
  41 siblings, 0 replies; 130+ messages in thread
From: Ackerley Tng @ 2024-09-10 23:44 UTC (permalink / raw)
  To: tabba, quic_eberman, roypat, jgg, peterx, david, rientjes, fvdl,
	jthoughton, seanjc, pbonzini, zhiquan1.li, fan.du, jun.miao,
	isaku.yamahata, muchun.song, mike.kravetz
  Cc: erdemaktas, vannapurve, ackerleytng, qperret, jhubbard, willy,
	shuah, brauner, bfoster, kent.overstreet, pvorel, rppt,
	richard.weiyang, anup, haibo1.xu, ajones, vkuznets,
	maciej.wieczor-retman, pgonda, oliver.upton, linux-kernel,
	linux-mm, kvm, linux-kselftest, linux-fsdevel

CONFIG_GUP_TEST provides userspace with an ioctl to invoke
pin_user_pages(), and this test uses the ioctl to pin pages, to check
that memory attributes cannot be set to private if shared pages are
pinned.

Signed-off-by: Ackerley Tng <ackerleytng@google.com>
---
 tools/testing/selftests/kvm/Makefile          |   1 +
 .../selftests/kvm/guest_memfd_pin_test.c      | 104 ++++++++++++++++++
 2 files changed, 105 insertions(+)
 create mode 100644 tools/testing/selftests/kvm/guest_memfd_pin_test.c

diff --git a/tools/testing/selftests/kvm/Makefile b/tools/testing/selftests/kvm/Makefile
index 3c1f35456bfc..c5a1c8c7125a 100644
--- a/tools/testing/selftests/kvm/Makefile
+++ b/tools/testing/selftests/kvm/Makefile
@@ -136,6 +136,7 @@ TEST_GEN_PROGS_x86_64 += dirty_log_perf_test
 TEST_GEN_PROGS_x86_64 += guest_memfd_test
 TEST_GEN_PROGS_x86_64 += guest_memfd_hugetlb_reporting_test
 TEST_GEN_PROGS_x86_64 += guest_memfd_sharing_test
+TEST_GEN_PROGS_x86_64 += guest_memfd_pin_test
 TEST_GEN_PROGS_x86_64 += guest_print_test
 TEST_GEN_PROGS_x86_64 += hardware_disable_test
 TEST_GEN_PROGS_x86_64 += kvm_create_max_vcpus
diff --git a/tools/testing/selftests/kvm/guest_memfd_pin_test.c b/tools/testing/selftests/kvm/guest_memfd_pin_test.c
new file mode 100644
index 000000000000..b45fb8024970
--- /dev/null
+++ b/tools/testing/selftests/kvm/guest_memfd_pin_test.c
@@ -0,0 +1,104 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Test that pinned pages block KVM from setting memory attributes to PRIVATE.
+ *
+ * Copyright (c) 2024, Google LLC.
+ */
+#include <string.h>
+#include <sys/mman.h>
+#include <unistd.h>
+
+#include "test_util.h"
+#include "kvm_util.h"
+#include "../../../../mm/gup_test.h"
+
+#define GUEST_MEMFD_PIN_TEST_SLOT 10
+#define GUEST_MEMFD_PIN_TEST_GPA 0x50000000ULL
+#define GUEST_MEMFD_PIN_TEST_OFFSET 0
+
+static int gup_test_fd;
+
+void pin_pages(void *vaddr, uint64_t size)
+{
+	const struct pin_longterm_test args = {
+		.addr = (uint64_t)vaddr,
+		.size = size,
+		.flags = PIN_LONGTERM_TEST_FLAG_USE_WRITE,
+	};
+
+	TEST_ASSERT_EQ(ioctl(gup_test_fd, PIN_LONGTERM_TEST_START, &args), 0);
+}
+
+void unpin_pages(void)
+{
+	TEST_ASSERT_EQ(ioctl(gup_test_fd, PIN_LONGTERM_TEST_STOP), 0);
+}
+
+void run_test(void)
+{
+	struct kvm_vm *vm;
+	size_t page_size;
+	void *mem;
+	int fd;
+
+	vm = vm_create_barebones_type(KVM_X86_SW_PROTECTED_VM);
+
+	page_size = getpagesize();
+	fd = vm_create_guest_memfd(vm, page_size, 0);
+
+	mem = mmap(NULL, page_size, PROT_READ | PROT_WRITE, MAP_SHARED, fd,
+		   GUEST_MEMFD_PIN_TEST_OFFSET);
+	TEST_ASSERT(mem != MAP_FAILED, "mmap should return valid address");
+
+	/*
+	 * Setting up this memslot with a KVM_X86_SW_PROTECTED_VM marks all
+	 * offsets in the file as shared.
+	 */
+	vm_set_user_memory_region2(vm, GUEST_MEMFD_PIN_TEST_SLOT,
+				   KVM_MEM_GUEST_MEMFD,
+				   GUEST_MEMFD_PIN_TEST_GPA, page_size, mem, fd,
+				   GUEST_MEMFD_PIN_TEST_OFFSET);
+
+	/* Before pinning pages, toggling memory attributes should be fine. */
+	vm_mem_set_private(vm, GUEST_MEMFD_PIN_TEST_GPA, page_size);
+	vm_mem_set_shared(vm, GUEST_MEMFD_PIN_TEST_GPA, page_size);
+
+	pin_pages(mem, page_size);
+
+	/*
+	 * Pinning also faults pages in, so remove these pages from userspace
+	 * page tables to properly test that pinning blocks setting memory
+	 * attributes to private.
+	 */
+	TEST_ASSERT_EQ(madvise(mem, page_size, MADV_DONTNEED), 0);
+
+	/* Should fail since the page is still faulted in. */
+	TEST_ASSERT_EQ(__vm_set_memory_attributes(vm, GUEST_MEMFD_PIN_TEST_GPA,
+						  page_size,
+						  KVM_MEMORY_ATTRIBUTE_PRIVATE),
+		       -1);
+	TEST_ASSERT_EQ(errno, EINVAL);
+
+	unpin_pages();
+
+	/* With the pages unpinned, kvm can set this page to private. */
+	vm_mem_set_private(vm, GUEST_MEMFD_PIN_TEST_GPA, page_size);
+
+	kvm_vm_free(vm);
+	close(fd);
+}
+
+int main(int argc, char *argv[])
+{
+	gup_test_fd = open("/sys/kernel/debug/gup_test", O_RDWR);
+	/*
+	 * This test depends on CONFIG_GUP_TEST to provide a kernel module that
+	 * exposes pin_user_pages() to userspace.
+	 */
+	TEST_REQUIRE(gup_test_fd != -1);
+	TEST_REQUIRE(kvm_check_cap(KVM_CAP_VM_TYPES) & BIT(KVM_X86_SW_PROTECTED_VM));
+
+	run_test();
+
+	return 0;
+}
-- 
2.46.0.598.g6f2099f65c-goog


^ permalink raw reply related	[flat|nested] 130+ messages in thread

* [RFC PATCH 36/39] KVM: selftests: Refactor vm_mem_add to be more flexible
  2024-09-10 23:43 [RFC PATCH 00/39] 1G page support for guest_memfd Ackerley Tng
                   ` (34 preceding siblings ...)
  2024-09-10 23:44 ` [RFC PATCH 35/39] KVM: selftests: Test that pinned pages block KVM from setting memory attributes to PRIVATE Ackerley Tng
@ 2024-09-10 23:44 ` Ackerley Tng
  2024-09-10 23:44 ` [RFC PATCH 37/39] KVM: selftests: Add helper to perform madvise by memslots Ackerley Tng
                   ` (5 subsequent siblings)
  41 siblings, 0 replies; 130+ messages in thread
From: Ackerley Tng @ 2024-09-10 23:44 UTC (permalink / raw)
  To: tabba, quic_eberman, roypat, jgg, peterx, david, rientjes, fvdl,
	jthoughton, seanjc, pbonzini, zhiquan1.li, fan.du, jun.miao,
	isaku.yamahata, muchun.song, mike.kravetz
  Cc: erdemaktas, vannapurve, ackerleytng, qperret, jhubbard, willy,
	shuah, brauner, bfoster, kent.overstreet, pvorel, rppt,
	richard.weiyang, anup, haibo1.xu, ajones, vkuznets,
	maciej.wieczor-retman, pgonda, oliver.upton, linux-kernel,
	linux-mm, kvm, linux-kselftest, linux-fsdevel

enum vm_mem_backing_src_type is encoding too many different
possibilities on different axes of (1) whether to mmap from an fd, (2)
granularity of mapping for THP, (3) size of hugetlb mapping, and has
yet to be extended to support guest_memfd.

When guest_memfd supports mmap() and we also want to support testing
with mmap()ing from guest_memfd, the number of combinations make
enumeration in vm_mem_backing_src_type difficult.

This refactor separates out vm_mem_backing_src_type from
userspace_mem_region. For now, vm_mem_backing_src_type remains a
possible way for tests to specify, on the command line, the
combination of backing memory to test.

vm_mem_add() is now the last place where vm_mem_backing_src_type is
interpreted, to

1. Check validity of requested guest_paddr
2. Align mmap_size appropriately based on the mapping's page_size and
   architecture
3. Install memory appropriately according to mapping's page size

mmap()ing an alias seems to be specific to userfaultfd tests and could
be refactored out of struct userspace_mem_region and localized in
userfaultfd tests in future.

This paves the way for replacing vm_mem_backing_src_type with multiple
command line flags that would specify backing memory more
flexibly. Future tests are expected to use vm_mem_region_alloc() to
allocate a struct userspace_mem_region, then use more fundamental
functions like vm_mem_region_mmap(), vm_mem_region_madvise_thp(),
kvm_memfd_create(), vm_create_guest_memfd(), and other functions in
vm_mem_add() to flexibly build up struct userspace_mem_region before
finally adding the region to the vm with vm_mem_region_add().

Signed-off-by: Ackerley Tng <ackerleytng@google.com>

---
 .../testing/selftests/kvm/include/kvm_util.h  |  29 +-
 .../testing/selftests/kvm/include/test_util.h |   2 +
 tools/testing/selftests/kvm/lib/kvm_util.c    | 413 +++++++++++-------
 tools/testing/selftests/kvm/lib/test_util.c   |  25 ++
 4 files changed, 319 insertions(+), 150 deletions(-)

diff --git a/tools/testing/selftests/kvm/include/kvm_util.h b/tools/testing/selftests/kvm/include/kvm_util.h
index d336cd0c8f19..1576e7e4aefe 100644
--- a/tools/testing/selftests/kvm/include/kvm_util.h
+++ b/tools/testing/selftests/kvm/include/kvm_util.h
@@ -35,11 +35,26 @@ struct userspace_mem_region {
 	struct sparsebit *protected_phy_pages;
 	int fd;
 	off_t offset;
-	enum vm_mem_backing_src_type backing_src_type;
+	/*
+	 * host_mem is mmap_start aligned upwards to an address suitable for the
+	 * architecture. In most cases, host_mem and mmap_start are the same,
+	 * except for s390x, where the host address must be aligned to 1M (due
+	 * to PGSTEs).
+	 */
+#ifdef __s390x__
+#define S390X_HOST_ADDRESS_ALIGNMENT 0x100000
+#endif
 	void *host_mem;
+	/* host_alias is to mmap_alias as host_mem is to mmap_start */
 	void *host_alias;
 	void *mmap_start;
 	void *mmap_alias;
+	/*
+	 * mmap_size is possibly larger than region.memory_size because in some
+	 * cases, host_mem has to be adjusted upwards (see comment for host_mem
+	 * above). In those cases, mmap_size has to be adjusted upwards so that
+	 * enough memory is available in this memslot.
+	 */
 	size_t mmap_size;
 	struct rb_node gpa_node;
 	struct rb_node hva_node;
@@ -559,6 +574,18 @@ int __vm_set_user_memory_region2(struct kvm_vm *vm, uint32_t slot, uint32_t flag
 				 uint64_t gpa, uint64_t size, void *hva,
 				 uint32_t guest_memfd, uint64_t guest_memfd_offset);
 
+struct userspace_mem_region *vm_mem_region_alloc(struct kvm_vm *vm);
+void *vm_mem_region_mmap(struct userspace_mem_region *region, size_t length,
+			 int flags, int fd, off_t offset);
+void vm_mem_region_install_memory(struct userspace_mem_region *region,
+				  size_t memslot_size, size_t alignment);
+void vm_mem_region_madvise_thp(struct userspace_mem_region *region, int advice);
+int vm_mem_region_install_guest_memfd(struct userspace_mem_region *region,
+				      int guest_memfd);
+void *vm_mem_region_mmap_alias(struct userspace_mem_region *region, int flags,
+			       size_t alignment);
+void vm_mem_region_add(struct kvm_vm *vm, struct userspace_mem_region *region);
+
 void vm_userspace_mem_region_add(struct kvm_vm *vm,
 	enum vm_mem_backing_src_type src_type,
 	uint64_t guest_paddr, uint32_t slot, uint64_t npages,
diff --git a/tools/testing/selftests/kvm/include/test_util.h b/tools/testing/selftests/kvm/include/test_util.h
index 011e757d4e2c..983adeb54c0e 100644
--- a/tools/testing/selftests/kvm/include/test_util.h
+++ b/tools/testing/selftests/kvm/include/test_util.h
@@ -159,6 +159,8 @@ size_t get_trans_hugepagesz(void);
 size_t get_def_hugetlb_pagesz(void);
 const struct vm_mem_backing_src_alias *vm_mem_backing_src_alias(uint32_t i);
 size_t get_backing_src_pagesz(uint32_t i);
+int backing_src_should_madvise(uint32_t i);
+int get_backing_src_madvise_advice(uint32_t i);
 bool is_backing_src_hugetlb(uint32_t i);
 void backing_src_help(const char *flag);
 enum vm_mem_backing_src_type parse_backing_src_type(const char *type_name);
diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c b/tools/testing/selftests/kvm/lib/kvm_util.c
index 56b170b725b3..9bdd03a5da90 100644
--- a/tools/testing/selftests/kvm/lib/kvm_util.c
+++ b/tools/testing/selftests/kvm/lib/kvm_util.c
@@ -774,15 +774,12 @@ void kvm_vm_free(struct kvm_vm *vmp)
 	free(vmp);
 }
 
-int kvm_memfd_alloc(size_t size, bool hugepages)
+int kvm_create_memfd(size_t size, unsigned int flags)
 {
-	int memfd_flags = MFD_CLOEXEC;
-	int fd, r;
-
-	if (hugepages)
-		memfd_flags |= MFD_HUGETLB;
+	int fd;
+	int r;
 
-	fd = memfd_create("kvm_selftest", memfd_flags);
+	fd = memfd_create("kvm_selftest", flags);
 	TEST_ASSERT(fd != -1, __KVM_SYSCALL_ERROR("memfd_create()", fd));
 
 	r = ftruncate(fd, size);
@@ -794,6 +791,16 @@ int kvm_memfd_alloc(size_t size, bool hugepages)
 	return fd;
 }
 
+int kvm_memfd_alloc(size_t size, bool hugepages)
+{
+	int memfd_flags = MFD_CLOEXEC;
+
+	if (hugepages)
+		memfd_flags |= MFD_HUGETLB;
+
+	return kvm_create_memfd(size, memfd_flags);
+}
+
 /*
  * Memory Compare, host virtual to guest virtual
  *
@@ -973,185 +980,293 @@ void vm_set_user_memory_region2(struct kvm_vm *vm, uint32_t slot, uint32_t flags
 		    errno, strerror(errno));
 }
 
+/**
+ * Allocates and returns a struct userspace_mem_region.
+ */
+struct userspace_mem_region *vm_mem_region_alloc(struct kvm_vm *vm)
+{
+	struct userspace_mem_region *region;
 
-/* FIXME: This thing needs to be ripped apart and rewritten. */
-void vm_mem_add(struct kvm_vm *vm, enum vm_mem_backing_src_type src_type,
-		uint64_t guest_paddr, uint32_t slot, uint64_t npages,
-		uint32_t flags, int guest_memfd, uint64_t guest_memfd_offset)
+	/* Allocate and initialize new mem region structure. */
+	region = calloc(1, sizeof(*region));
+	TEST_ASSERT(region != NULL, "Insufficient Memory");
+
+	region->unused_phy_pages = sparsebit_alloc();
+	if (vm_arch_has_protected_memory(vm))
+		region->protected_phy_pages = sparsebit_alloc();
+
+	region->fd = -1;
+	region->region.guest_memfd = -1;
+
+	return region;
+}
+
+static size_t compute_page_size(int mmap_flags, int madvise_advice)
+{
+	if (mmap_flags & MAP_HUGETLB) {
+		int size_flags = (mmap_flags >> MAP_HUGE_SHIFT) & MAP_HUGE_MASK;
+		if (!size_flags)
+			return get_def_hugetlb_pagesz();
+
+		return 1ULL << size_flags;
+	}
+
+	return madvise_advice == MADV_HUGEPAGE ? get_trans_hugepagesz() : getpagesize();
+}
+
+/**
+ * Calls mmap() with @length, @flags, @fd, @offset for @region.
+ *
+ * Think of this as the struct userspace_mem_region wrapper for the mmap()
+ * syscall.
+ */
+void *vm_mem_region_mmap(struct userspace_mem_region *region, size_t length,
+			 int flags, int fd, off_t offset)
+{
+	void *mem;
+
+	if (flags & MAP_SHARED) {
+		TEST_ASSERT(fd != -1,
+			    "Ensure that fd is provided for shared mappings.");
+		TEST_ASSERT(
+			region->fd == fd || region->region.guest_memfd == fd,
+			"Ensure that fd is opened before mmap, and is either "
+			"set up in region->fd or region->region.guest_memfd.");
+	}
+
+	mem = mmap(NULL, length, PROT_READ | PROT_WRITE, flags, fd, offset);
+	TEST_ASSERT(mem != MAP_FAILED, "Couldn't mmap anonymous memory");
+
+	region->mmap_start = mem;
+	region->mmap_size = length;
+	region->offset = offset;
+
+	return mem;
+}
+
+/**
+ * Installs mmap()ed memory in @region->mmap_start as @region->host_mem,
+ * checking constraints.
+ */
+void vm_mem_region_install_memory(struct userspace_mem_region *region,
+				  size_t memslot_size, size_t alignment)
+{
+	TEST_ASSERT(region->mmap_size >= memslot_size,
+		    "mmap()ed memory insufficient for memslot");
+
+	region->host_mem = align_ptr_up(region->mmap_start, alignment);
+	region->region.userspace_addr = (uint64_t)region->host_mem;
+	region->region.memory_size = memslot_size;
+}
+
+
+/**
+ * Calls madvise with @advice for @region.
+ *
+ * Think of this as the struct userspace_mem_region wrapper for the madvise()
+ * syscall.
+ */
+void vm_mem_region_madvise_thp(struct userspace_mem_region *region, int advice)
 {
 	int ret;
-	struct userspace_mem_region *region;
-	size_t backing_src_pagesz = get_backing_src_pagesz(src_type);
-	size_t mem_size = npages * vm->page_size;
-	size_t alignment;
 
-	TEST_REQUIRE_SET_USER_MEMORY_REGION2();
+	TEST_ASSERT(
+		region->host_mem && region->mmap_size,
+		"vm_mem_region_madvise_thp() must be called after vm_mem_region_mmap()");
 
-	TEST_ASSERT(vm_adjust_num_guest_pages(vm->mode, npages) == npages,
-		"Number of guest pages is not compatible with the host. "
-		"Try npages=%d", vm_adjust_num_guest_pages(vm->mode, npages));
-
-	TEST_ASSERT((guest_paddr % vm->page_size) == 0, "Guest physical "
-		"address not on a page boundary.\n"
-		"  guest_paddr: 0x%lx vm->page_size: 0x%x",
-		guest_paddr, vm->page_size);
-	TEST_ASSERT((((guest_paddr >> vm->page_shift) + npages) - 1)
-		<= vm->max_gfn, "Physical range beyond maximum "
-		"supported physical address,\n"
-		"  guest_paddr: 0x%lx npages: 0x%lx\n"
-		"  vm->max_gfn: 0x%lx vm->page_size: 0x%x",
-		guest_paddr, npages, vm->max_gfn, vm->page_size);
+	ret = madvise(region->host_mem, region->mmap_size, advice);
+	TEST_ASSERT(ret == 0, "madvise failed, addr: %p length: 0x%lx",
+		    region->host_mem, region->mmap_size);
+}
+
+/**
+ * Installs guest_memfd by setting it up in @region.
+ *
+ * Returns the guest_memfd that was installed in the @region.
+ */
+int vm_mem_region_install_guest_memfd(struct userspace_mem_region *region,
+				      int guest_memfd)
+{
+	/*
+	 * Install a unique fd for each memslot so that the fd can be closed
+	 * when the region is deleted without needing to track if the fd is
+	 * owned by the framework or by the caller.
+	 */
+	guest_memfd = dup(guest_memfd);
+	TEST_ASSERT(guest_memfd >= 0, __KVM_SYSCALL_ERROR("dup()", guest_memfd));
+	region->region.guest_memfd = guest_memfd;
+
+	return guest_memfd;
+}
+
+/**
+ * Calls mmap() to create an alias for mmap()ed memory at region->host_mem,
+ * exactly the same size the was mmap()ed.
+ *
+ * This is used mainly for userfaultfd tests.
+ */
+void *vm_mem_region_mmap_alias(struct userspace_mem_region *region, int flags,
+			       size_t alignment)
+{
+	region->mmap_alias = mmap(NULL, region->mmap_size,
+				  PROT_READ | PROT_WRITE, flags, region->fd, 0);
+	TEST_ASSERT(region->mmap_alias != MAP_FAILED,
+		    __KVM_SYSCALL_ERROR("mmap()",  (int)(unsigned long)MAP_FAILED));
+
+	region->host_alias = align_ptr_up(region->mmap_alias, alignment);
+
+	return region->host_alias;
+}
+
+static void vm_mem_region_assert_no_duplicate(struct kvm_vm *vm, uint32_t slot,
+					      uint64_t gpa, size_t size)
+{
+	struct userspace_mem_region *region;
 
 	/*
 	 * Confirm a mem region with an overlapping address doesn't
 	 * already exist.
 	 */
-	region = (struct userspace_mem_region *) userspace_mem_region_find(
-		vm, guest_paddr, (guest_paddr + npages * vm->page_size) - 1);
-	if (region != NULL)
-		TEST_FAIL("overlapping userspace_mem_region already "
-			"exists\n"
-			"  requested guest_paddr: 0x%lx npages: 0x%lx "
-			"page_size: 0x%x\n"
-			"  existing guest_paddr: 0x%lx size: 0x%lx",
-			guest_paddr, npages, vm->page_size,
-			(uint64_t) region->region.guest_phys_addr,
-			(uint64_t) region->region.memory_size);
+	region = userspace_mem_region_find(vm, gpa, gpa + size - 1);
+	if (region != NULL) {
+		TEST_FAIL("overlapping userspace_mem_region already exists\n"
+			  "  requested gpa: 0x%lx size: 0x%lx"
+			  "  existing gpa: 0x%lx size: 0x%lx",
+			  gpa, size,
+			  (uint64_t) region->region.guest_phys_addr,
+			  (uint64_t) region->region.memory_size);
+	}
 
 	/* Confirm no region with the requested slot already exists. */
-	hash_for_each_possible(vm->regions.slot_hash, region, slot_node,
-			       slot) {
+	hash_for_each_possible(vm->regions.slot_hash, region, slot_node, slot) {
 		if (region->region.slot != slot)
 			continue;
 
-		TEST_FAIL("A mem region with the requested slot "
-			"already exists.\n"
-			"  requested slot: %u paddr: 0x%lx npages: 0x%lx\n"
-			"  existing slot: %u paddr: 0x%lx size: 0x%lx",
-			slot, guest_paddr, npages,
-			region->region.slot,
-			(uint64_t) region->region.guest_phys_addr,
-			(uint64_t) region->region.memory_size);
+		TEST_FAIL("A mem region with the requested slot already exists.\n"
+			  "  requested slot: %u paddr: 0x%lx size: 0x%lx\n"
+			  "  existing slot: %u paddr: 0x%lx size: 0x%lx",
+			  slot, gpa, size,
+			  region->region.slot,
+			  (uint64_t) region->region.guest_phys_addr,
+			  (uint64_t) region->region.memory_size);
 	}
+}
 
-	/* Allocate and initialize new mem region structure. */
-	region = calloc(1, sizeof(*region));
-	TEST_ASSERT(region != NULL, "Insufficient Memory");
-	region->mmap_size = mem_size;
+/**
+ * Add a @region to @vm. All necessary fields in region->region should already
+ * be populated.
+ *
+ * Think of this as the struct userspace_mem_region wrapper for the
+ * KVM_SET_USER_MEMORY_REGION2 ioctl.
+ */
+void vm_mem_region_add(struct kvm_vm *vm, struct userspace_mem_region *region)
+{
+	uint64_t npages;
+	uint64_t gpa;
+	int ret;
 
-#ifdef __s390x__
-	/* On s390x, the host address must be aligned to 1M (due to PGSTEs) */
-	alignment = 0x100000;
-#else
-	alignment = 1;
-#endif
+	TEST_REQUIRE_SET_USER_MEMORY_REGION2();
 
-	/*
-	 * When using THP mmap is not guaranteed to returned a hugepage aligned
-	 * address so we have to pad the mmap. Padding is not needed for HugeTLB
-	 * because mmap will always return an address aligned to the HugeTLB
-	 * page size.
-	 */
-	if (src_type == VM_MEM_SRC_ANONYMOUS_THP)
-		alignment = max(backing_src_pagesz, alignment);
+	npages = region->region.memory_size / vm->page_size;
+	TEST_ASSERT(vm_adjust_num_guest_pages(vm->mode, npages) == npages,
+		    "Number of guest pages is not compatible with the host. "
+		    "Try npages=%d", vm_adjust_num_guest_pages(vm->mode, npages));
+
+	gpa = region->region.guest_phys_addr;
+	TEST_ASSERT((gpa % vm->page_size) == 0,
+		    "Guest physical address not on a page boundary.\n"
+		    "  gpa: 0x%lx vm->page_size: 0x%x",
+		    gpa, vm->page_size);
+	TEST_ASSERT((((gpa >> vm->page_shift) + npages) - 1) <= vm->max_gfn,
+		    "Physical range beyond maximum supported physical address,\n"
+		    "  gpa: 0x%lx npages: 0x%lx\n"
+		    "  vm->max_gfn: 0x%lx vm->page_size: 0x%x",
+		    gpa, npages, vm->max_gfn, vm->page_size);
+
+	vm_mem_region_assert_no_duplicate(vm, region->region.slot, gpa,
+					  region->mmap_size);
 
-	TEST_ASSERT_EQ(guest_paddr, align_up(guest_paddr, backing_src_pagesz));
+	ret = __vm_ioctl(vm, KVM_SET_USER_MEMORY_REGION2, &region->region);
+	TEST_ASSERT(ret == 0, "KVM_SET_USER_MEMORY_REGION2 IOCTL failed,\n"
+		    "  rc: %i errno: %i\n"
+		    "  slot: %u flags: 0x%x\n"
+		    "  guest_phys_addr: 0x%lx size: 0x%llx guest_memfd: %d",
+		    ret, errno, region->region.slot, region->region.flags,
+		    gpa, region->region.memory_size,
+		    region->region.guest_memfd);
 
-	/* Add enough memory to align up if necessary */
-	if (alignment > 1)
-		region->mmap_size += alignment;
+	sparsebit_set_num(region->unused_phy_pages, gpa >> vm->page_shift, npages);
 
-	region->fd = -1;
-	if (backing_src_is_shared(src_type))
-		region->fd = kvm_memfd_alloc(region->mmap_size,
-					     src_type == VM_MEM_SRC_SHARED_HUGETLB);
-
-	region->mmap_start = mmap(NULL, region->mmap_size,
-				  PROT_READ | PROT_WRITE,
-				  vm_mem_backing_src_alias(src_type)->flag,
-				  region->fd, 0);
-	TEST_ASSERT(region->mmap_start != MAP_FAILED,
-		    __KVM_SYSCALL_ERROR("mmap()", (int)(unsigned long)MAP_FAILED));
+	/* Add to quick lookup data structures */
+	vm_userspace_mem_region_gpa_insert(&vm->regions.gpa_tree, region);
+	vm_userspace_mem_region_hva_insert(&vm->regions.hva_tree, region);
+	hash_add(vm->regions.slot_hash, &region->slot_node, region->region.slot);
+}
 
-	TEST_ASSERT(!is_backing_src_hugetlb(src_type) ||
-		    region->mmap_start == align_ptr_up(region->mmap_start, backing_src_pagesz),
-		    "mmap_start %p is not aligned to HugeTLB page size 0x%lx",
-		    region->mmap_start, backing_src_pagesz);
+void vm_mem_add(struct kvm_vm *vm, enum vm_mem_backing_src_type src_type,
+		uint64_t guest_paddr, uint32_t slot, uint64_t npages,
+		uint32_t flags, int guest_memfd, uint64_t guest_memfd_offset)
+{
+	struct userspace_mem_region *region;
+	size_t mapping_page_size;
+	size_t memslot_size;
+	int madvise_advice;
+	size_t mmap_size;
+	size_t alignment;
+	int mmap_flags;
+	int memfd;
 
-	/* Align host address */
-	region->host_mem = align_ptr_up(region->mmap_start, alignment);
+	memslot_size = npages * vm->page_size;
+
+	mmap_flags = vm_mem_backing_src_alias(src_type)->flag;
+	madvise_advice = get_backing_src_madvise_advice(src_type);
+	mapping_page_size = compute_page_size(mmap_flags, madvise_advice);
+
+	TEST_ASSERT_EQ(guest_paddr, align_up(guest_paddr, mapping_page_size));
+
+	alignment = mapping_page_size;
+#ifdef __s390x__
+	alignment = max(alignment, S390X_HOST_ADDRESS_ALIGNMENT);
+#endif
 
-	/* As needed perform madvise */
-	if ((src_type == VM_MEM_SRC_ANONYMOUS ||
-	     src_type == VM_MEM_SRC_ANONYMOUS_THP) && thp_configured()) {
-		ret = madvise(region->host_mem, mem_size,
-			      src_type == VM_MEM_SRC_ANONYMOUS ? MADV_NOHUGEPAGE : MADV_HUGEPAGE);
-		TEST_ASSERT(ret == 0, "madvise failed, addr: %p length: 0x%lx src_type: %s",
-			    region->host_mem, mem_size,
-			    vm_mem_backing_src_alias(src_type)->name);
+	region = vm_mem_region_alloc(vm);
+
+	memfd = -1;
+	if (backing_src_is_shared(src_type)) {
+		unsigned int memfd_flags = MFD_CLOEXEC;
+		if (src_type == VM_MEM_SRC_SHARED_HUGETLB)
+			memfd_flags |= MFD_HUGETLB;
+
+		memfd = kvm_create_memfd(memslot_size, memfd_flags);
 	}
+	region->fd = memfd;
+
+	mmap_size = align_up(memslot_size, alignment);
+	vm_mem_region_mmap(region, mmap_size, mmap_flags, memfd, 0);
+	vm_mem_region_install_memory(region, memslot_size, alignment);
 
-	region->backing_src_type = src_type;
+	if (backing_src_should_madvise(src_type))
+		vm_mem_region_madvise_thp(region, madvise_advice);
+
+	if (backing_src_is_shared(src_type))
+		vm_mem_region_mmap_alias(region, mmap_flags, alignment);
 
 	if (flags & KVM_MEM_GUEST_MEMFD) {
 		if (guest_memfd < 0) {
-			uint32_t guest_memfd_flags = 0;
-			TEST_ASSERT(!guest_memfd_offset,
-				    "Offset must be zero when creating new guest_memfd");
-			guest_memfd = vm_create_guest_memfd(vm, mem_size, guest_memfd_flags);
-		} else {
-			/*
-			 * Install a unique fd for each memslot so that the fd
-			 * can be closed when the region is deleted without
-			 * needing to track if the fd is owned by the framework
-			 * or by the caller.
-			 */
-			guest_memfd = dup(guest_memfd);
-			TEST_ASSERT(guest_memfd >= 0, __KVM_SYSCALL_ERROR("dup()", guest_memfd));
+			TEST_ASSERT(
+				guest_memfd_offset == 0,
+				"Offset must be zero when creating new guest_memfd");
+			guest_memfd = vm_create_guest_memfd(vm, memslot_size, 0);
 		}
 
-		region->region.guest_memfd = guest_memfd;
-		region->region.guest_memfd_offset = guest_memfd_offset;
-	} else {
-		region->region.guest_memfd = -1;
+		vm_mem_region_install_guest_memfd(region, guest_memfd);
 	}
 
-	region->unused_phy_pages = sparsebit_alloc();
-	if (vm_arch_has_protected_memory(vm))
-		region->protected_phy_pages = sparsebit_alloc();
-	sparsebit_set_num(region->unused_phy_pages,
-		guest_paddr >> vm->page_shift, npages);
 	region->region.slot = slot;
 	region->region.flags = flags;
 	region->region.guest_phys_addr = guest_paddr;
-	region->region.memory_size = npages * vm->page_size;
-	region->region.userspace_addr = (uintptr_t) region->host_mem;
-	ret = __vm_ioctl(vm, KVM_SET_USER_MEMORY_REGION2, &region->region);
-	TEST_ASSERT(ret == 0, "KVM_SET_USER_MEMORY_REGION2 IOCTL failed,\n"
-		"  rc: %i errno: %i\n"
-		"  slot: %u flags: 0x%x\n"
-		"  guest_phys_addr: 0x%lx size: 0x%lx guest_memfd: %d",
-		ret, errno, slot, flags,
-		guest_paddr, (uint64_t) region->region.memory_size,
-		region->region.guest_memfd);
-
-	/* Add to quick lookup data structures */
-	vm_userspace_mem_region_gpa_insert(&vm->regions.gpa_tree, region);
-	vm_userspace_mem_region_hva_insert(&vm->regions.hva_tree, region);
-	hash_add(vm->regions.slot_hash, &region->slot_node, slot);
-
-	/* If shared memory, create an alias. */
-	if (region->fd >= 0) {
-		region->mmap_alias = mmap(NULL, region->mmap_size,
-					  PROT_READ | PROT_WRITE,
-					  vm_mem_backing_src_alias(src_type)->flag,
-					  region->fd, 0);
-		TEST_ASSERT(region->mmap_alias != MAP_FAILED,
-			    __KVM_SYSCALL_ERROR("mmap()",  (int)(unsigned long)MAP_FAILED));
-
-		/* Align host alias address */
-		region->host_alias = align_ptr_up(region->mmap_alias, alignment);
-	}
+	region->region.guest_memfd_offset = guest_memfd_offset;
+	vm_mem_region_add(vm, region);
 }
 
 void vm_userspace_mem_region_add(struct kvm_vm *vm,
diff --git a/tools/testing/selftests/kvm/lib/test_util.c b/tools/testing/selftests/kvm/lib/test_util.c
index d0a9b5ee0c01..cbcc1e7ad578 100644
--- a/tools/testing/selftests/kvm/lib/test_util.c
+++ b/tools/testing/selftests/kvm/lib/test_util.c
@@ -351,6 +351,31 @@ size_t get_private_mem_backing_src_pagesz(uint32_t i)
 	}
 }
 
+int backing_src_should_madvise(uint32_t i)
+{
+	switch (i) {
+	case VM_MEM_SRC_ANONYMOUS:
+	case VM_MEM_SRC_SHMEM:
+	case VM_MEM_SRC_ANONYMOUS_THP:
+		return true;
+	default:
+		return false;
+	}
+}
+
+int get_backing_src_madvise_advice(uint32_t i)
+{
+	switch (i) {
+	case VM_MEM_SRC_ANONYMOUS:
+	case VM_MEM_SRC_SHMEM:
+		return MADV_NOHUGEPAGE;
+	case VM_MEM_SRC_ANONYMOUS_THP:
+		return MADV_NOHUGEPAGE;
+	default:
+		return 0;
+	}
+}
+
 bool is_backing_src_hugetlb(uint32_t i)
 {
 	return !!(vm_mem_backing_src_alias(i)->flag & MAP_HUGETLB);
-- 
2.46.0.598.g6f2099f65c-goog


^ permalink raw reply related	[flat|nested] 130+ messages in thread

* [RFC PATCH 37/39] KVM: selftests: Add helper to perform madvise by memslots
  2024-09-10 23:43 [RFC PATCH 00/39] 1G page support for guest_memfd Ackerley Tng
                   ` (35 preceding siblings ...)
  2024-09-10 23:44 ` [RFC PATCH 36/39] KVM: selftests: Refactor vm_mem_add to be more flexible Ackerley Tng
@ 2024-09-10 23:44 ` Ackerley Tng
  2024-09-10 23:44 ` [RFC PATCH 38/39] KVM: selftests: Update private_mem_conversions_test for mmap()able guest_memfd Ackerley Tng
                   ` (4 subsequent siblings)
  41 siblings, 0 replies; 130+ messages in thread
From: Ackerley Tng @ 2024-09-10 23:44 UTC (permalink / raw)
  To: tabba, quic_eberman, roypat, jgg, peterx, david, rientjes, fvdl,
	jthoughton, seanjc, pbonzini, zhiquan1.li, fan.du, jun.miao,
	isaku.yamahata, muchun.song, mike.kravetz
  Cc: erdemaktas, vannapurve, ackerleytng, qperret, jhubbard, willy,
	shuah, brauner, bfoster, kent.overstreet, pvorel, rppt,
	richard.weiyang, anup, haibo1.xu, ajones, vkuznets,
	maciej.wieczor-retman, pgonda, oliver.upton, linux-kernel,
	linux-mm, kvm, linux-kselftest, linux-fsdevel

A contiguous GPA range may not be contiguous in HVA.

This helper performs madvise, given a GPA range, by madvising in
blocks according to memslot configuration.

Signed-off-by: Ackerley Tng <ackerleytng@google.com>

---
 tools/include/linux/kernel.h                  |  4 +--
 .../testing/selftests/kvm/include/kvm_util.h  |  2 ++
 tools/testing/selftests/kvm/lib/kvm_util.c    | 30 +++++++++++++++++++
 3 files changed, 34 insertions(+), 2 deletions(-)

diff --git a/tools/include/linux/kernel.h b/tools/include/linux/kernel.h
index 07cfad817d53..5454cd3272ed 100644
--- a/tools/include/linux/kernel.h
+++ b/tools/include/linux/kernel.h
@@ -54,8 +54,8 @@
 	_min1 < _min2 ? _min1 : _min2; })
 #endif
 
-#define max_t(type, x, y)	max((type)x, (type)y)
-#define min_t(type, x, y)	min((type)x, (type)y)
+#define max_t(type, x, y)	max((type)(x), (type)(y))
+#define min_t(type, x, y)	min((type)(x), (type)(y))
 #define clamp(val, lo, hi)	min((typeof(val))max(val, lo), hi)
 
 #ifndef BUG_ON
diff --git a/tools/testing/selftests/kvm/include/kvm_util.h b/tools/testing/selftests/kvm/include/kvm_util.h
index 1576e7e4aefe..58b516c23574 100644
--- a/tools/testing/selftests/kvm/include/kvm_util.h
+++ b/tools/testing/selftests/kvm/include/kvm_util.h
@@ -433,6 +433,8 @@ static inline void vm_mem_set_shared(struct kvm_vm *vm, uint64_t gpa,
 void vm_guest_mem_fallocate(struct kvm_vm *vm, uint64_t gpa, uint64_t size,
 			    bool punch_hole);
 
+void vm_guest_mem_madvise(struct kvm_vm *vm, vm_paddr_t gpa_start, uint64_t size,
+			  int advice);
 static inline void vm_guest_mem_punch_hole(struct kvm_vm *vm, uint64_t gpa,
 					   uint64_t size)
 {
diff --git a/tools/testing/selftests/kvm/lib/kvm_util.c b/tools/testing/selftests/kvm/lib/kvm_util.c
index 9bdd03a5da90..21ea6616124c 100644
--- a/tools/testing/selftests/kvm/lib/kvm_util.c
+++ b/tools/testing/selftests/kvm/lib/kvm_util.c
@@ -1416,6 +1416,36 @@ void vm_guest_mem_fallocate(struct kvm_vm *vm, uint64_t base, uint64_t size,
 	}
 }
 
+void vm_guest_mem_madvise(struct kvm_vm *vm, vm_paddr_t gpa_start, uint64_t size,
+			  int advice)
+{
+	size_t madvise_len;
+	vm_paddr_t gpa_end;
+	vm_paddr_t gpa;
+
+	gpa_end = gpa_start + size;
+	for (gpa = gpa_start; gpa < gpa_end; gpa += madvise_len) {
+		struct userspace_mem_region *region;
+		void *hva_start;
+		uint64_t memslot_end;
+		int ret;
+
+		region = userspace_mem_region_find(vm, gpa, gpa);
+		TEST_ASSERT(region, "Memory region not found for GPA 0x%lx", gpa);
+
+		hva_start = addr_gpa2hva(vm, gpa);
+		memslot_end = region->region.userspace_addr +
+			      region->region.memory_size;
+		madvise_len = min_t(size_t, memslot_end - (uint64_t)hva_start,
+				    gpa_end - gpa);
+
+		ret = madvise(hva_start, madvise_len, advice);
+		TEST_ASSERT(!ret, "madvise(addr=%p, len=%lx, advice=%x) failed\n",
+			    hva_start, madvise_len, advice);
+	}
+}
+
+
 /* Returns the size of a vCPU's kvm_run structure. */
 static int vcpu_mmap_sz(void)
 {
-- 
2.46.0.598.g6f2099f65c-goog


^ permalink raw reply related	[flat|nested] 130+ messages in thread

* [RFC PATCH 38/39] KVM: selftests: Update private_mem_conversions_test for mmap()able guest_memfd
  2024-09-10 23:43 [RFC PATCH 00/39] 1G page support for guest_memfd Ackerley Tng
                   ` (36 preceding siblings ...)
  2024-09-10 23:44 ` [RFC PATCH 37/39] KVM: selftests: Add helper to perform madvise by memslots Ackerley Tng
@ 2024-09-10 23:44 ` Ackerley Tng
  2024-09-10 23:44 ` [RFC PATCH 39/39] KVM: guest_memfd: Dynamically split/reconstruct HugeTLB page Ackerley Tng
                   ` (3 subsequent siblings)
  41 siblings, 0 replies; 130+ messages in thread
From: Ackerley Tng @ 2024-09-10 23:44 UTC (permalink / raw)
  To: tabba, quic_eberman, roypat, jgg, peterx, david, rientjes, fvdl,
	jthoughton, seanjc, pbonzini, zhiquan1.li, fan.du, jun.miao,
	isaku.yamahata, muchun.song, mike.kravetz
  Cc: erdemaktas, vannapurve, ackerleytng, qperret, jhubbard, willy,
	shuah, brauner, bfoster, kent.overstreet, pvorel, rppt,
	richard.weiyang, anup, haibo1.xu, ajones, vkuznets,
	maciej.wieczor-retman, pgonda, oliver.upton, linux-kernel,
	linux-mm, kvm, linux-kselftest, linux-fsdevel

Signed-off-by: Ackerley Tng <ackerleytng@google.com>
---
 .../kvm/x86_64/private_mem_conversions_test.c | 146 +++++++++++++++---
 .../x86_64/private_mem_conversions_test.sh    |   3 +
 2 files changed, 124 insertions(+), 25 deletions(-)

diff --git a/tools/testing/selftests/kvm/x86_64/private_mem_conversions_test.c b/tools/testing/selftests/kvm/x86_64/private_mem_conversions_test.c
index 71f480c19f92..6524ef398584 100644
--- a/tools/testing/selftests/kvm/x86_64/private_mem_conversions_test.c
+++ b/tools/testing/selftests/kvm/x86_64/private_mem_conversions_test.c
@@ -11,6 +11,8 @@
 #include <stdlib.h>
 #include <string.h>
 #include <sys/ioctl.h>
+#include <sys/mman.h>
+#include <sys/wait.h>
 
 #include <linux/compiler.h>
 #include <linux/kernel.h>
@@ -202,15 +204,19 @@ static void guest_test_explicit_conversion(uint64_t base_gpa, bool do_fallocate)
 		guest_sync_shared(gpa, size, p3, p4);
 		memcmp_g(gpa, p4, size);
 
-		/* Reset the shared memory back to the initial pattern. */
-		memset((void *)gpa, init_p, size);
-
 		/*
 		 * Free (via PUNCH_HOLE) *all* private memory so that the next
 		 * iteration starts from a clean slate, e.g. with respect to
 		 * whether or not there are pages/folios in guest_mem.
 		 */
 		guest_map_shared(base_gpa, PER_CPU_DATA_SIZE, true);
+
+		/*
+		 * Reset the entire block back to the initial pattern. Do this
+		 * after fallocate(PUNCH_HOLE) because hole-punching zeroes
+		 * memory.
+		 */
+		memset((void *)base_gpa, init_p, PER_CPU_DATA_SIZE);
 	}
 }
 
@@ -286,7 +292,8 @@ static void guest_code(uint64_t base_gpa)
 	GUEST_DONE();
 }
 
-static void handle_exit_hypercall(struct kvm_vcpu *vcpu)
+static void handle_exit_hypercall(struct kvm_vcpu *vcpu,
+				  bool back_shared_memory_with_guest_memfd)
 {
 	struct kvm_run *run = vcpu->run;
 	uint64_t gpa = run->hypercall.args[0];
@@ -303,17 +310,46 @@ static void handle_exit_hypercall(struct kvm_vcpu *vcpu)
 	if (do_fallocate)
 		vm_guest_mem_fallocate(vm, gpa, size, map_shared);
 
-	if (set_attributes)
+	if (set_attributes) {
+		if (back_shared_memory_with_guest_memfd && !map_shared)
+			vm_guest_mem_madvise(vm, gpa, size, MADV_DONTNEED);
 		vm_set_memory_attributes(vm, gpa, size,
 					 map_shared ? 0 : KVM_MEMORY_ATTRIBUTE_PRIVATE);
+	}
 	run->hypercall.ret = 0;
 }
 
+static void assert_not_faultable(uint8_t *address)
+{
+	pid_t child_pid;
+
+	child_pid = fork();
+	TEST_ASSERT(child_pid != -1, "fork failed");
+
+	if (child_pid == 0) {
+		*address = 'A';
+	} else {
+		int status;
+		waitpid(child_pid, &status, 0);
+
+		TEST_ASSERT(WIFSIGNALED(status),
+			    "Child should have exited with a signal");
+		TEST_ASSERT_EQ(WTERMSIG(status), SIGBUS);
+	}
+}
+
 static bool run_vcpus;
 
-static void *__test_mem_conversions(void *__vcpu)
+struct test_thread_args
 {
-	struct kvm_vcpu *vcpu = __vcpu;
+	struct kvm_vcpu *vcpu;
+	bool back_shared_memory_with_guest_memfd;
+};
+
+static void *__test_mem_conversions(void *params)
+{
+	struct test_thread_args *args = params;
+	struct kvm_vcpu *vcpu = args->vcpu;
 	struct kvm_run *run = vcpu->run;
 	struct kvm_vm *vm = vcpu->vm;
 	struct ucall uc;
@@ -325,7 +361,8 @@ static void *__test_mem_conversions(void *__vcpu)
 		vcpu_run(vcpu);
 
 		if (run->exit_reason == KVM_EXIT_HYPERCALL) {
-			handle_exit_hypercall(vcpu);
+			handle_exit_hypercall(vcpu,
+					      args->back_shared_memory_with_guest_memfd);
 			continue;
 		}
 
@@ -349,8 +386,18 @@ static void *__test_mem_conversions(void *__vcpu)
 				size_t nr_bytes = min_t(size_t, vm->page_size, size - i);
 				uint8_t *hva = addr_gpa2hva(vm, gpa + i);
 
-				/* In all cases, the host should observe the shared data. */
-				memcmp_h(hva, gpa + i, uc.args[3], nr_bytes);
+				/* Check contents of memory */
+				if (args->back_shared_memory_with_guest_memfd &&
+				    uc.args[0] == SYNC_PRIVATE) {
+					assert_not_faultable(hva);
+				} else {
+					/*
+					 * If shared and private memory use
+					 * separate backing memory, the host
+					 * should always observe shared data.
+					 */
+					memcmp_h(hva, gpa + i, uc.args[3], nr_bytes);
+				}
 
 				/* For shared, write the new pattern to guest memory. */
 				if (uc.args[0] == SYNC_SHARED)
@@ -366,11 +413,41 @@ static void *__test_mem_conversions(void *__vcpu)
 	}
 }
 
-static void
-test_mem_conversions(enum vm_mem_backing_src_type src_type,
-		     enum vm_private_mem_backing_src_type private_mem_src_type,
-		     uint32_t nr_vcpus,
-		     uint32_t nr_memslots)
+static void add_memslot(struct kvm_vm *vm, uint64_t gpa, uint32_t slot,
+			uint64_t size, int guest_memfd,
+			uint64_t guest_memfd_offset,
+			enum vm_mem_backing_src_type src_type,
+			bool back_shared_memory_with_guest_memfd)
+{
+	struct userspace_mem_region *region;
+
+	if (!back_shared_memory_with_guest_memfd) {
+		vm_mem_add(vm, src_type, gpa, slot, size / vm->page_size,
+			   KVM_MEM_GUEST_MEMFD, guest_memfd,
+			   guest_memfd_offset);
+		return;
+	}
+
+	region = vm_mem_region_alloc(vm);
+
+	guest_memfd = vm_mem_region_install_guest_memfd(region, guest_memfd);
+
+	vm_mem_region_mmap(region, size, MAP_SHARED, guest_memfd, guest_memfd_offset);
+	vm_mem_region_install_memory(region, size, getpagesize());
+
+	region->region.slot = slot;
+	region->region.flags = KVM_MEM_GUEST_MEMFD;
+	region->region.guest_phys_addr = gpa;
+	region->region.guest_memfd_offset = guest_memfd_offset;
+
+	vm_mem_region_add(vm, region);
+}
+
+static void test_mem_conversions(enum vm_mem_backing_src_type src_type,
+				 enum vm_private_mem_backing_src_type private_mem_src_type,
+				 uint32_t nr_vcpus,
+				 uint32_t nr_memslots,
+				 bool back_shared_memory_with_guest_memfd)
 {
 	/*
 	 * Allocate enough memory so that each vCPU's chunk of memory can be
@@ -381,6 +458,7 @@ test_mem_conversions(enum vm_mem_backing_src_type src_type,
 					     get_private_mem_backing_src_pagesz(private_mem_src_type),
 					     get_backing_src_pagesz(src_type)));
 	const size_t per_cpu_size = align_up(PER_CPU_DATA_SIZE, alignment);
+	struct test_thread_args *thread_args[KVM_MAX_VCPUS];
 	const size_t memfd_size = per_cpu_size * nr_vcpus;
 	const size_t slot_size = memfd_size / nr_memslots;
 	struct kvm_vcpu *vcpus[KVM_MAX_VCPUS];
@@ -404,13 +482,14 @@ test_mem_conversions(enum vm_mem_backing_src_type src_type,
 		vm, memfd_size,
 		vm_private_mem_backing_src_alias(private_mem_src_type)->flag);
 
-	for (i = 0; i < nr_memslots; i++)
-		vm_mem_add(vm, src_type, BASE_DATA_GPA + slot_size * i,
-			   BASE_DATA_SLOT + i, slot_size / vm->page_size,
-			   KVM_MEM_GUEST_MEMFD, memfd, slot_size * i);
+	for (i = 0; i < nr_memslots; i++) {
+		add_memslot(vm, BASE_DATA_GPA + slot_size * i,
+			    BASE_DATA_SLOT + i, slot_size, memfd, slot_size * i,
+			    src_type, back_shared_memory_with_guest_memfd);
+	}
 
 	for (i = 0; i < nr_vcpus; i++) {
-		uint64_t gpa =  BASE_DATA_GPA + i * per_cpu_size;
+		uint64_t gpa = BASE_DATA_GPA + i * per_cpu_size;
 
 		vcpu_args_set(vcpus[i], 1, gpa);
 
@@ -420,13 +499,23 @@ test_mem_conversions(enum vm_mem_backing_src_type src_type,
 		 */
 		virt_map(vm, gpa, gpa, PER_CPU_DATA_SIZE / vm->page_size);
 
-		pthread_create(&threads[i], NULL, __test_mem_conversions, vcpus[i]);
+		thread_args[i] = malloc(sizeof(struct test_thread_args));
+		TEST_ASSERT(thread_args[i] != NULL,
+			    "Could not allocate memory for thread parameters");
+		thread_args[i]->vcpu = vcpus[i];
+		thread_args[i]->back_shared_memory_with_guest_memfd =
+			back_shared_memory_with_guest_memfd;
+
+		pthread_create(&threads[i], NULL, __test_mem_conversions,
+			       (void *)thread_args[i]);
 	}
 
 	WRITE_ONCE(run_vcpus, true);
 
-	for (i = 0; i < nr_vcpus; i++)
+	for (i = 0; i < nr_vcpus; i++) {
 		pthread_join(threads[i], NULL);
+		free(thread_args[i]);
+	}
 
 	kvm_vm_free(vm);
 
@@ -448,7 +537,7 @@ test_mem_conversions(enum vm_mem_backing_src_type src_type,
 static void usage(const char *cmd)
 {
 	puts("");
-	printf("usage: %s [-h] [-m nr_memslots] [-s mem_type] [-p private_mem_type] [-n nr_vcpus]\n", cmd);
+	printf("usage: %s [-h] [-m nr_memslots] [-s mem_type] [-p private_mem_type] [-n nr_vcpus] [-g]\n", cmd);
 	puts("");
 	backing_src_help("-s");
 	puts("");
@@ -458,19 +547,22 @@ static void usage(const char *cmd)
 	puts("");
 	puts(" -m: specify the number of memslots (default: 1)");
 	puts("");
+	puts(" -g: back shared memory with guest_memfd (default: false)");
+	puts("");
 }
 
 int main(int argc, char *argv[])
 {
 	enum vm_mem_backing_src_type src_type = DEFAULT_VM_MEM_SRC;
 	enum vm_private_mem_backing_src_type private_mem_src_type = DEFAULT_VM_PRIVATE_MEM_SRC;
+	bool back_shared_memory_with_guest_memfd = false;
 	uint32_t nr_memslots = 1;
 	uint32_t nr_vcpus = 1;
 	int opt;
 
 	TEST_REQUIRE(kvm_check_cap(KVM_CAP_VM_TYPES) & BIT(KVM_X86_SW_PROTECTED_VM));
 
-	while ((opt = getopt(argc, argv, "hm:s:p:n:")) != -1) {
+	while ((opt = getopt(argc, argv, "hgm:s:p:n:")) != -1) {
 		switch (opt) {
 		case 's':
 			src_type = parse_backing_src_type(optarg);
@@ -484,6 +576,9 @@ int main(int argc, char *argv[])
 		case 'm':
 			nr_memslots = atoi_positive("nr_memslots", optarg);
 			break;
+		case 'g':
+			back_shared_memory_with_guest_memfd = true;
+			break;
 		case 'h':
 		default:
 			usage(argv[0]);
@@ -491,7 +586,8 @@ int main(int argc, char *argv[])
 		}
 	}
 
-	test_mem_conversions(src_type, private_mem_src_type, nr_vcpus, nr_memslots);
+	test_mem_conversions(src_type, private_mem_src_type, nr_vcpus, nr_memslots,
+			     back_shared_memory_with_guest_memfd);
 
 	return 0;
 }
diff --git a/tools/testing/selftests/kvm/x86_64/private_mem_conversions_test.sh b/tools/testing/selftests/kvm/x86_64/private_mem_conversions_test.sh
index fb6705fef466..c7f3dfee0336 100755
--- a/tools/testing/selftests/kvm/x86_64/private_mem_conversions_test.sh
+++ b/tools/testing/selftests/kvm/x86_64/private_mem_conversions_test.sh
@@ -75,6 +75,9 @@ TEST_EXECUTABLE="$(dirname "$0")/private_mem_conversions_test"
 			$TEST_EXECUTABLE -s "$src_type" -p "$private_mem_src_type" -n $num_vcpus_to_test
 			$TEST_EXECUTABLE -s "$src_type" -p "$private_mem_src_type" -n $num_vcpus_to_test -m $num_memslots_to_test
 
+			$TEST_EXECUTABLE -s "$src_type" -p "$private_mem_src_type" -n $num_vcpus_to_test -g
+			$TEST_EXECUTABLE -s "$src_type" -p "$private_mem_src_type" -n $num_vcpus_to_test -m $num_memslots_to_test -g
+
 			{ set +x; } 2>/dev/null
 
 			echo
-- 
2.46.0.598.g6f2099f65c-goog


^ permalink raw reply related	[flat|nested] 130+ messages in thread

* [RFC PATCH 39/39] KVM: guest_memfd: Dynamically split/reconstruct HugeTLB page
  2024-09-10 23:43 [RFC PATCH 00/39] 1G page support for guest_memfd Ackerley Tng
                   ` (37 preceding siblings ...)
  2024-09-10 23:44 ` [RFC PATCH 38/39] KVM: selftests: Update private_mem_conversions_test for mmap()able guest_memfd Ackerley Tng
@ 2024-09-10 23:44 ` Ackerley Tng
  2025-04-03 12:33   ` Yan Zhao
  2024-09-11  6:56 ` [RFC PATCH 00/39] 1G page support for guest_memfd Michal Hocko
                   ` (2 subsequent siblings)
  41 siblings, 1 reply; 130+ messages in thread
From: Ackerley Tng @ 2024-09-10 23:44 UTC (permalink / raw)
  To: tabba, quic_eberman, roypat, jgg, peterx, david, rientjes, fvdl,
	jthoughton, seanjc, pbonzini, zhiquan1.li, fan.du, jun.miao,
	isaku.yamahata, muchun.song, mike.kravetz
  Cc: erdemaktas, vannapurve, ackerleytng, qperret, jhubbard, willy,
	shuah, brauner, bfoster, kent.overstreet, pvorel, rppt,
	richard.weiyang, anup, haibo1.xu, ajones, vkuznets,
	maciej.wieczor-retman, pgonda, oliver.upton, linux-kernel,
	linux-mm, kvm, linux-kselftest, linux-fsdevel

From: Vishal Annapurve <vannapurve@google.com>

The faultability of a page is used to determine whether to split or
reconstruct a page.

If there is any page in a folio that is faultable, split the folio. If
all pages in a folio are not faultable, reconstruct the folio.

On truncation, always reconstruct and free regardless of
faultability (as long as a HugeTLB page's worth of pages is
truncated).

Co-developed-by: Vishal Annapurve <vannapurve@google.com>
Signed-off-by: Vishal Annapurve <vannapurve@google.com>
Co-developed-by: Ackerley Tng <ackerleytng@google.com>
Signed-off-by: Ackerley Tng <ackerleytng@google.com>

---
 virt/kvm/guest_memfd.c | 678 +++++++++++++++++++++++++++--------------
 1 file changed, 456 insertions(+), 222 deletions(-)

diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index fb292e542381..0afc111099c0 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -99,6 +99,23 @@ static bool kvm_gmem_is_faultable(struct inode *inode, pgoff_t index)
 	return xa_to_value(xa_load(faultability, index)) == KVM_GMEM_FAULTABILITY_VALUE;
 }
 
+/**
+ * Return true if any of the @nr_pages beginning at @index is allowed to be
+ * faulted in.
+ */
+static bool kvm_gmem_is_any_faultable(struct inode *inode, pgoff_t index,
+				      int nr_pages)
+{
+	pgoff_t i;
+
+	for (i = index; i < index + nr_pages; ++i) {
+		if (kvm_gmem_is_faultable(inode, i))
+		    return true;
+	}
+
+	return false;
+}
+
 /**
  * folio_file_pfn - like folio_file_page, but return a pfn.
  * @folio: The folio which contains this index.
@@ -312,6 +329,40 @@ static int kvm_gmem_hugetlb_filemap_add_folio(struct address_space *mapping,
 	return 0;
 }
 
+static inline void kvm_gmem_hugetlb_filemap_remove_folio(struct folio *folio)
+{
+	folio_lock(folio);
+
+	folio_clear_dirty(folio);
+	folio_clear_uptodate(folio);
+	filemap_remove_folio(folio);
+
+	folio_unlock(folio);
+}
+
+/*
+ * Locks a block of nr_pages (1 << huge_page_order(h)) pages within @mapping
+ * beginning at @index. Take either this or filemap_invalidate_lock() whenever
+ * the filemap is accessed.
+ */
+static u32 hugetlb_fault_mutex_lock(struct address_space *mapping, pgoff_t index)
+{
+	pgoff_t hindex;
+	u32 hash;
+
+	hindex = index >> huge_page_order(kvm_gmem_hgmem(mapping->host)->h);
+	hash = hugetlb_fault_mutex_hash(mapping, hindex);
+
+	mutex_lock(&hugetlb_fault_mutex_table[hash]);
+
+	return hash;
+}
+
+static void hugetlb_fault_mutex_unlock(u32 hash)
+{
+	mutex_unlock(&hugetlb_fault_mutex_table[hash]);
+}
+
 struct kvm_gmem_split_stash {
 	struct {
 		unsigned long _flags_2;
@@ -394,15 +445,136 @@ static int kvm_gmem_hugetlb_reconstruct_folio(struct hstate *h, struct folio *fo
 	}
 
 	__folio_set_hugetlb(folio);
-
-	folio_set_count(folio, 1);
+	hugetlb_folio_list_add(folio, &h->hugepage_activelist);
 
 	hugetlb_vmemmap_optimize_folio(h, folio);
 
+	folio_set_count(folio, 1);
+
 	return 0;
 }
 
-/* Basically folio_set_order(folio, 1) without the checks. */
+/**
+ * Reconstruct a HugeTLB folio out of folio_nr_pages(@first_folio) pages. Will
+ * clean up subfolios from filemap and add back the reconstructed folio. Folios
+ * to be reconstructed must not be locked, and reconstructed folio will not be
+ * locked. Return 0 on success or negative error otherwise.
+ *
+ * hugetlb_fault_mutex_lock() has to be held when calling this function.
+ *
+ * Expects that before this call, the filemap's refcounts are the only refcounts
+ * for the folios in the filemap. After this function returns, the filemap's
+ * refcount will be the only refcount on the reconstructed folio.
+ */
+static int kvm_gmem_reconstruct_folio_in_filemap(struct hstate *h,
+						 struct folio *first_folio)
+{
+	struct address_space *mapping;
+	struct folio_batch fbatch;
+	unsigned long end;
+	pgoff_t index;
+	pgoff_t next;
+	int ret;
+	int i;
+
+	if (folio_order(first_folio) == huge_page_order(h))
+		return 0;
+
+	index = first_folio->index;
+	mapping = first_folio->mapping;
+
+	next = index;
+	end = index + (1UL << huge_page_order(h));
+	folio_batch_init(&fbatch);
+	while (filemap_get_folios(mapping, &next, end - 1, &fbatch)) {
+		for (i = 0; i < folio_batch_count(&fbatch); ++i) {
+			struct folio *folio;
+
+			folio = fbatch.folios[i];
+
+			/*
+			 * Before removing from filemap, take a reference so
+			 * sub-folios don't get freed when removing from
+			 * filemap.
+			 */
+			folio_get(folio);
+
+			kvm_gmem_hugetlb_filemap_remove_folio(folio);
+		}
+		folio_batch_release(&fbatch);
+	}
+
+	ret = kvm_gmem_hugetlb_reconstruct_folio(h, first_folio);
+	if (ret) {
+		/* TODO: handle cleanup properly. */
+		WARN_ON(ret);
+		return ret;
+	}
+
+	kvm_gmem_hugetlb_filemap_add_folio(mapping, first_folio, index,
+					   htlb_alloc_mask(h));
+
+	folio_unlock(first_folio);
+	folio_put(first_folio);
+
+	return ret;
+}
+
+/**
+ * Reconstruct any HugeTLB folios in range [@start, @end), if all the subfolios
+ * are not faultable. Return 0 on success or negative error otherwise.
+ *
+ * Will skip any folios that are already reconstructed.
+ */
+static int kvm_gmem_try_reconstruct_folios_range(struct inode *inode,
+						 pgoff_t start, pgoff_t end)
+{
+	unsigned int nr_pages;
+	pgoff_t aligned_start;
+	pgoff_t aligned_end;
+	struct hstate *h;
+	pgoff_t index;
+	int ret;
+
+	if (!is_kvm_gmem_hugetlb(inode))
+		return 0;
+
+	h = kvm_gmem_hgmem(inode)->h;
+	nr_pages = 1UL << huge_page_order(h);
+
+	aligned_start = round_up(start, nr_pages);
+	aligned_end = round_down(end, nr_pages);
+
+	ret = 0;
+	for (index = aligned_start; !ret && index < aligned_end; index += nr_pages) {
+		struct folio *folio;
+		u32 hash;
+
+		hash = hugetlb_fault_mutex_lock(inode->i_mapping, index);
+
+		folio = filemap_get_folio(inode->i_mapping, index);
+		if (!IS_ERR(folio)) {
+			/*
+			 * Drop refcount because reconstruction expects an equal number
+			 * of refcounts for all subfolios - just keep the refcount taken
+			 * by the filemap.
+			 */
+			folio_put(folio);
+
+			/* Merge only when the entire block of nr_pages is not faultable. */
+			if (!kvm_gmem_is_any_faultable(inode, index, nr_pages)) {
+				ret = kvm_gmem_reconstruct_folio_in_filemap(h, folio);
+				WARN_ON(ret);
+			}
+		}
+
+		hugetlb_fault_mutex_unlock(hash);
+	}
+
+	return ret;
+}
+
+/* Basically folio_set_order() without the checks. */
 static inline void kvm_gmem_folio_set_order(struct folio *folio, unsigned int order)
 {
 	folio->_flags_1 = (folio->_flags_1 & ~0xffUL) | order;
@@ -414,8 +586,8 @@ static inline void kvm_gmem_folio_set_order(struct folio *folio, unsigned int or
 /**
  * Split a HugeTLB @folio of size huge_page_size(@h).
  *
- * After splitting, each split folio has a refcount of 1. There are no checks on
- * refcounts before splitting.
+ * Folio must have refcount of 1 when this function is called. After splitting,
+ * each split folio has a refcount of 1.
  *
  * Return 0 on success and negative error otherwise.
  */
@@ -423,14 +595,18 @@ static int kvm_gmem_hugetlb_split_folio(struct hstate *h, struct folio *folio)
 {
 	int ret;
 
+	VM_WARN_ON_ONCE_FOLIO(folio_ref_count(folio) != 1, folio);
+
+	folio_set_count(folio, 0);
+
 	ret = hugetlb_vmemmap_restore_folio(h, folio);
 	if (ret)
-		return ret;
+		goto out;
 
 	ret = kvm_gmem_hugetlb_stash_metadata(folio);
 	if (ret) {
 		hugetlb_vmemmap_optimize_folio(h, folio);
-		return ret;
+		goto out;
 	}
 
 	kvm_gmem_folio_set_order(folio, 0);
@@ -439,109 +615,183 @@ static int kvm_gmem_hugetlb_split_folio(struct hstate *h, struct folio *folio)
 	__folio_clear_hugetlb(folio);
 
 	/*
-	 * Remove the first folio from h->hugepage_activelist since it is no
+	 * Remove the original folio from h->hugepage_activelist since it is no
 	 * longer a HugeTLB page. The other split pages should not be on any
 	 * lists.
 	 */
 	hugetlb_folio_list_del(folio);
 
-	return 0;
+	ret = 0;
+out:
+	folio_set_count(folio, 1);
+	return ret;
 }
 
-static struct folio *kvm_gmem_hugetlb_alloc_and_cache_folio(struct inode *inode,
-							    pgoff_t index)
+/**
+ * Split a HugeTLB folio into folio_nr_pages(@folio) pages. Will clean up folio
+ * from filemap and add back the split folios. @folio must not be locked, and
+ * all split folios will not be locked. Return 0 on success or negative error
+ * otherwise.
+ *
+ * hugetlb_fault_mutex_lock() has to be held when calling this function.
+ *
+ * Expects that before this call, the filemap's refcounts are the only refcounts
+ * for the folio. After this function returns, the filemap's refcounts will be
+ * the only refcounts on the split folios.
+ */
+static int kvm_gmem_split_folio_in_filemap(struct hstate *h, struct folio *folio)
 {
-	struct folio *allocated_hugetlb_folio;
-	pgoff_t hugetlb_first_subpage_index;
-	struct page *hugetlb_first_subpage;
-	struct kvm_gmem_hugetlb *hgmem;
-	struct page *requested_page;
+	struct address_space *mapping;
+	struct page *first_subpage;
+	pgoff_t index;
 	int ret;
 	int i;
 
-	hgmem = kvm_gmem_hgmem(inode);
-	allocated_hugetlb_folio = kvm_gmem_hugetlb_alloc_folio(hgmem->h, hgmem->spool);
-	if (IS_ERR(allocated_hugetlb_folio))
-		return allocated_hugetlb_folio;
+	if (folio_order(folio) == 0)
+		return 0;
 
-	requested_page = folio_file_page(allocated_hugetlb_folio, index);
-	hugetlb_first_subpage = folio_file_page(allocated_hugetlb_folio, 0);
-	hugetlb_first_subpage_index = index & (huge_page_mask(hgmem->h) >> PAGE_SHIFT);
+	index = folio->index;
+	mapping = folio->mapping;
 
-	ret = kvm_gmem_hugetlb_split_folio(hgmem->h, allocated_hugetlb_folio);
+	first_subpage = folio_page(folio, 0);
+
+	/*
+	 * Take reference so that folio will not be released when removed from
+	 * filemap.
+	 */
+	folio_get(folio);
+
+	kvm_gmem_hugetlb_filemap_remove_folio(folio);
+
+	ret = kvm_gmem_hugetlb_split_folio(h, folio);
 	if (ret) {
-		folio_put(allocated_hugetlb_folio);
-		return ERR_PTR(ret);
+		WARN_ON(ret);
+		kvm_gmem_hugetlb_filemap_add_folio(mapping, folio, index,
+						   htlb_alloc_mask(h));
+		folio_put(folio);
+		return ret;
 	}
 
-	for (i = 0; i < pages_per_huge_page(hgmem->h); ++i) {
-		struct folio *folio = page_folio(nth_page(hugetlb_first_subpage, i));
+	for (i = 0; i < pages_per_huge_page(h); ++i) {
+		struct folio *folio = page_folio(nth_page(first_subpage, i));
 
-		ret = kvm_gmem_hugetlb_filemap_add_folio(inode->i_mapping,
-							 folio,
-							 hugetlb_first_subpage_index + i,
-							 htlb_alloc_mask(hgmem->h));
+		ret = kvm_gmem_hugetlb_filemap_add_folio(mapping, folio,
+							 index + i,
+							 htlb_alloc_mask(h));
 		if (ret) {
 			/* TODO: handle cleanup properly. */
-			pr_err("Handle cleanup properly index=%lx, ret=%d\n",
-			       hugetlb_first_subpage_index + i, ret);
-			dump_page(nth_page(hugetlb_first_subpage, i), "check");
-			return ERR_PTR(ret);
+			WARN_ON(ret);
+			return ret;
 		}
 
+		folio_unlock(folio);
+
 		/*
-		 * Skip unlocking for the requested index since
-		 * kvm_gmem_get_folio() returns a locked folio.
-		 *
-		 * Do folio_put() to drop the refcount that came with the folio,
-		 * from splitting the folio. Splitting the folio has a refcount
-		 * to be in line with hugetlb_alloc_folio(), which returns a
-		 * folio with refcount 1.
-		 *
-		 * Skip folio_put() for requested index since
-		 * kvm_gmem_get_folio() returns a folio with refcount 1.
+		 * Drop reference so that the only remaining reference is the
+		 * one held by the filemap.
 		 */
-		if (hugetlb_first_subpage_index + i != index) {
-			folio_unlock(folio);
-			folio_put(folio);
-		}
+		folio_put(folio);
 	}
 
+	return ret;
+}
+
+/*
+ * Allocates and then caches a folio in the filemap. Returns a folio with
+ * refcount of 2: 1 after allocation, and 1 taken by the filemap.
+ */
+static struct folio *kvm_gmem_hugetlb_alloc_and_cache_folio(struct inode *inode,
+							    pgoff_t index)
+{
+	struct kvm_gmem_hugetlb *hgmem;
+	pgoff_t aligned_index;
+	struct folio *folio;
+	int nr_pages;
+	int ret;
+
+	hgmem = kvm_gmem_hgmem(inode);
+	folio = kvm_gmem_hugetlb_alloc_folio(hgmem->h, hgmem->spool);
+	if (IS_ERR(folio))
+		return folio;
+
+	nr_pages = 1UL << huge_page_order(hgmem->h);
+	aligned_index = round_down(index, nr_pages);
+
+	ret = kvm_gmem_hugetlb_filemap_add_folio(inode->i_mapping, folio,
+						 aligned_index,
+						 htlb_alloc_mask(hgmem->h));
+	WARN_ON(ret);
+
 	spin_lock(&inode->i_lock);
 	inode->i_blocks += blocks_per_huge_page(hgmem->h);
 	spin_unlock(&inode->i_lock);
 
-	return page_folio(requested_page);
+	return folio;
+}
+
+/**
+ * Split @folio if any of the subfolios are faultable. Returns the split
+ * (locked, refcount=2) folio at @index.
+ *
+ * Expects a locked folio with 1 refcount in addition to filemap's refcounts.
+ *
+ * After splitting, the subfolios in the filemap will be unlocked and have
+ * refcount 1 (other than the returned folio, which will be locked and have
+ * refcount 2).
+ */
+static struct folio *kvm_gmem_maybe_split_folio(struct folio *folio, pgoff_t index)
+{
+	pgoff_t aligned_index;
+	struct inode *inode;
+	struct hstate *h;
+	int nr_pages;
+	int ret;
+
+	inode = folio->mapping->host;
+	h = kvm_gmem_hgmem(inode)->h;
+	nr_pages = 1UL << huge_page_order(h);
+	aligned_index = round_down(index, nr_pages);
+
+	if (!kvm_gmem_is_any_faultable(inode, aligned_index, nr_pages))
+		return folio;
+
+	/* Drop lock and refcount in preparation for splitting. */
+	folio_unlock(folio);
+	folio_put(folio);
+
+	ret = kvm_gmem_split_folio_in_filemap(h, folio);
+	if (ret) {
+		kvm_gmem_hugetlb_filemap_remove_folio(folio);
+		return ERR_PTR(ret);
+	}
+
+	/*
+	 * At this point, the filemap has the only reference on the folio. Take
+	 * lock and refcount on folio to align with kvm_gmem_get_folio().
+	 */
+	return filemap_lock_folio(inode->i_mapping, index);
 }
 
 static struct folio *kvm_gmem_get_hugetlb_folio(struct inode *inode,
 						pgoff_t index)
 {
-	struct address_space *mapping;
 	struct folio *folio;
-	struct hstate *h;
-	pgoff_t hindex;
 	u32 hash;
 
-	h = kvm_gmem_hgmem(inode)->h;
-	hindex = index >> huge_page_order(h);
-	mapping = inode->i_mapping;
-
-	/* To lock, we calculate the hash using the hindex and not index. */
-	hash = hugetlb_fault_mutex_hash(mapping, hindex);
-	mutex_lock(&hugetlb_fault_mutex_table[hash]);
+	hash = hugetlb_fault_mutex_lock(inode->i_mapping, index);
 
 	/*
-	 * The filemap is indexed with index and not hindex. Taking lock on
-	 * folio to align with kvm_gmem_get_regular_folio()
+	 * The filemap is indexed with index and not hindex. Take lock on folio
+	 * to align with kvm_gmem_get_regular_folio()
 	 */
-	folio = filemap_lock_folio(mapping, index);
+	folio = filemap_lock_folio(inode->i_mapping, index);
+	if (IS_ERR(folio))
+		folio = kvm_gmem_hugetlb_alloc_and_cache_folio(inode, index);
+
 	if (!IS_ERR(folio))
-		goto out;
+		folio = kvm_gmem_maybe_split_folio(folio, index);
 
-	folio = kvm_gmem_hugetlb_alloc_and_cache_folio(inode, index);
-out:
-	mutex_unlock(&hugetlb_fault_mutex_table[hash]);
+	hugetlb_fault_mutex_unlock(hash);
 
 	return folio;
 }
@@ -610,17 +860,6 @@ static void kvm_gmem_invalidate_end(struct kvm_gmem *gmem, pgoff_t start,
 	}
 }
 
-static inline void kvm_gmem_hugetlb_filemap_remove_folio(struct folio *folio)
-{
-	folio_lock(folio);
-
-	folio_clear_dirty(folio);
-	folio_clear_uptodate(folio);
-	filemap_remove_folio(folio);
-
-	folio_unlock(folio);
-}
-
 /**
  * Removes folios in range [@lstart, @lend) from page cache/filemap (@mapping),
  * returning the number of HugeTLB pages freed.
@@ -631,61 +870,30 @@ static int kvm_gmem_hugetlb_filemap_remove_folios(struct address_space *mapping,
 						  struct hstate *h,
 						  loff_t lstart, loff_t lend)
 {
-	const pgoff_t end = lend >> PAGE_SHIFT;
-	pgoff_t next = lstart >> PAGE_SHIFT;
-	LIST_HEAD(folios_to_reconstruct);
-	struct folio_batch fbatch;
-	struct folio *folio, *tmp;
-	int num_freed = 0;
-	int i;
-
-	/*
-	 * TODO: Iterate over huge_page_size(h) blocks to avoid taking and
-	 * releasing hugetlb_fault_mutex_table[hash] lock so often. When
-	 * truncating, lstart and lend should be clipped to the size of this
-	 * guest_memfd file, otherwise there would be too many iterations.
-	 */
-	folio_batch_init(&fbatch);
-	while (filemap_get_folios(mapping, &next, end - 1, &fbatch)) {
-		for (i = 0; i < folio_batch_count(&fbatch); ++i) {
-			struct folio *folio;
-			pgoff_t hindex;
-			u32 hash;
-
-			folio = fbatch.folios[i];
+	loff_t offset;
+	int num_freed;
 
-			hindex = folio->index >> huge_page_order(h);
-			hash = hugetlb_fault_mutex_hash(mapping, hindex);
-			mutex_lock(&hugetlb_fault_mutex_table[hash]);
+	num_freed = 0;
+	for (offset = lstart; offset < lend; offset += huge_page_size(h)) {
+		struct folio *folio;
+		pgoff_t index;
+		u32 hash;
 
-			/*
-			 * Collect first pages of HugeTLB folios for
-			 * reconstruction later.
-			 */
-			if ((folio->index & ~(huge_page_mask(h) >> PAGE_SHIFT)) == 0)
-				list_add(&folio->lru, &folios_to_reconstruct);
+		index = offset >> PAGE_SHIFT;
+		hash = hugetlb_fault_mutex_lock(mapping, index);
 
-			/*
-			 * Before removing from filemap, take a reference so
-			 * sub-folios don't get freed. Don't free the sub-folios
-			 * until after reconstruction.
-			 */
-			folio_get(folio);
+		folio = filemap_get_folio(mapping, index);
+		if (!IS_ERR(folio)) {
+			/* Drop refcount so that filemap holds only reference. */
+			folio_put(folio);
 
+			kvm_gmem_reconstruct_folio_in_filemap(h, folio);
 			kvm_gmem_hugetlb_filemap_remove_folio(folio);
 
-			mutex_unlock(&hugetlb_fault_mutex_table[hash]);
+			num_freed++;
 		}
-		folio_batch_release(&fbatch);
-		cond_resched();
-	}
-
-	list_for_each_entry_safe(folio, tmp, &folios_to_reconstruct, lru) {
-		kvm_gmem_hugetlb_reconstruct_folio(h, folio);
-		hugetlb_folio_list_move(folio, &h->hugepage_activelist);
 
-		folio_put(folio);
-		num_freed++;
+		hugetlb_fault_mutex_unlock(hash);
 	}
 
 	return num_freed;
@@ -705,6 +913,10 @@ static void kvm_gmem_hugetlb_truncate_folios_range(struct inode *inode,
 	int gbl_reserve;
 	int num_freed;
 
+	/* No point truncating more than inode size. */
+	lstart = min(lstart, inode->i_size);
+	lend = min(lend, inode->i_size);
+
 	hgmem = kvm_gmem_hgmem(inode);
 	h = hgmem->h;
 
@@ -1042,13 +1254,27 @@ static vm_fault_t kvm_gmem_fault(struct vm_fault *vmf)
 	bool is_prepared;
 
 	inode = file_inode(vmf->vma->vm_file);
-	if (!kvm_gmem_is_faultable(inode, vmf->pgoff))
+
+	/*
+	 * Use filemap_invalidate_lock_shared() to make sure
+	 * kvm_gmem_get_folio() doesn't race with faultability updates.
+	 */
+	filemap_invalidate_lock_shared(inode->i_mapping);
+
+	if (!kvm_gmem_is_faultable(inode, vmf->pgoff)) {
+		filemap_invalidate_unlock_shared(inode->i_mapping);
 		return VM_FAULT_SIGBUS;
+	}
 
 	folio = kvm_gmem_get_folio(inode, vmf->pgoff);
+
+	filemap_invalidate_unlock_shared(inode->i_mapping);
+
 	if (!folio)
 		return VM_FAULT_SIGBUS;
 
+	WARN(folio_test_hugetlb(folio), "should not be faulting in hugetlb folio=%p\n", folio);
+
 	is_prepared = folio_test_uptodate(folio);
 	if (!is_prepared) {
 		unsigned long nr_pages;
@@ -1731,8 +1957,6 @@ static bool kvm_gmem_no_mappings_range(struct inode *inode, pgoff_t start, pgoff
 	pgoff_t index;
 	bool checked_indices_unmapped;
 
-	filemap_invalidate_lock_shared(inode->i_mapping);
-
 	/* TODO: replace iteration with filemap_get_folios() for efficiency. */
 	checked_indices_unmapped = true;
 	for (index = start; checked_indices_unmapped && index < end;) {
@@ -1754,98 +1978,130 @@ static bool kvm_gmem_no_mappings_range(struct inode *inode, pgoff_t start, pgoff
 		folio_put(folio);
 	}
 
-	filemap_invalidate_unlock_shared(inode->i_mapping);
 	return checked_indices_unmapped;
 }
 
 /**
- * Returns true if pages in range [@start, @end) in memslot @slot have no
- * userspace mappings.
+ * Split any HugeTLB folios in range [@start, @end), if any of the offsets in
+ * the folio are faultable. Return 0 on success or negative error otherwise.
+ *
+ * Will skip any folios that are already split.
  */
-static bool kvm_gmem_no_mappings_slot(struct kvm_memory_slot *slot,
-				      gfn_t start, gfn_t end)
+static int kvm_gmem_try_split_folios_range(struct inode *inode,
+					   pgoff_t start, pgoff_t end)
 {
-	pgoff_t offset_start;
-	pgoff_t offset_end;
-	struct file *file;
-	bool ret;
-
-	offset_start = start - slot->base_gfn + slot->gmem.pgoff;
-	offset_end = end - slot->base_gfn + slot->gmem.pgoff;
-
-	file = kvm_gmem_get_file(slot);
-	if (!file)
-		return false;
-
-	ret = kvm_gmem_no_mappings_range(file_inode(file), offset_start, offset_end);
+	unsigned int nr_pages;
+	pgoff_t aligned_start;
+	pgoff_t aligned_end;
+	struct hstate *h;
+	pgoff_t index;
+	int ret;
 
-	fput(file);
+	if (!is_kvm_gmem_hugetlb(inode))
+		return 0;
 
-	return ret;
-}
+	h = kvm_gmem_hgmem(inode)->h;
+	nr_pages = 1UL << huge_page_order(h);
 
-/**
- * Returns true if pages in range [@start, @end) have no host userspace mappings.
- */
-static bool kvm_gmem_no_mappings(struct kvm *kvm, gfn_t start, gfn_t end)
-{
-	int i;
+	aligned_start = round_down(start, nr_pages);
+	aligned_end = round_up(end, nr_pages);
 
-	lockdep_assert_held(&kvm->slots_lock);
+	ret = 0;
+	for (index = aligned_start; !ret && index < aligned_end; index += nr_pages) {
+		struct folio *folio;
+		u32 hash;
 
-	for (i = 0; i < kvm_arch_nr_memslot_as_ids(kvm); i++) {
-		struct kvm_memslot_iter iter;
-		struct kvm_memslots *slots;
+		hash = hugetlb_fault_mutex_lock(inode->i_mapping, index);
 
-		slots = __kvm_memslots(kvm, i);
-		kvm_for_each_memslot_in_gfn_range(&iter, slots, start, end) {
-			struct kvm_memory_slot *slot;
-			gfn_t gfn_start;
-			gfn_t gfn_end;
-
-			slot = iter.slot;
-			gfn_start = max(start, slot->base_gfn);
-			gfn_end = min(end, slot->base_gfn + slot->npages);
+		folio = filemap_get_folio(inode->i_mapping, index);
+		if (!IS_ERR(folio)) {
+			/*
+			 * Drop refcount so that the only references held are refcounts
+			 * from the filemap.
+			 */
+			folio_put(folio);
 
-			if (iter.slot->flags & KVM_MEM_GUEST_MEMFD &&
-			    !kvm_gmem_no_mappings_slot(iter.slot, gfn_start, gfn_end))
-				return false;
+			if (kvm_gmem_is_any_faultable(inode, index, nr_pages)) {
+				ret = kvm_gmem_split_folio_in_filemap(h, folio);
+				if (ret) {
+					/* TODO cleanup properly. */
+					WARN_ON(ret);
+				}
+			}
 		}
+
+		hugetlb_fault_mutex_unlock(hash);
 	}
 
-	return true;
+	return ret;
 }
 
 /**
- * Set faultability of given range of gfns [@start, @end) in memslot @slot to
- * @faultable.
+ * Returns 0 if guest_memfd permits setting range [@start, @end) with
+ * faultability @faultable within memslot @slot, or negative error otherwise.
+ *
+ * If a request was made to set the memory to PRIVATE (not faultable), the pages
+ * in the range must not be pinned or mapped for the request to be permitted.
+ *
+ * Because this may allow pages to be faulted in to userspace when requested to
+ * set attributes to shared, this must only be called after the pages have been
+ * invalidated from guest page tables.
  */
-static void kvm_gmem_set_faultable_slot(struct kvm_memory_slot *slot, gfn_t start,
-					gfn_t end, bool faultable)
+static int kvm_gmem_try_set_faultable_slot(struct kvm_memory_slot *slot,
+					   gfn_t start, gfn_t end,
+					   bool faultable)
 {
 	pgoff_t start_offset;
+	struct inode *inode;
 	pgoff_t end_offset;
 	struct file *file;
+	int ret;
 
 	file = kvm_gmem_get_file(slot);
 	if (!file)
-		return;
+		return 0;
 
 	start_offset = start - slot->base_gfn + slot->gmem.pgoff;
 	end_offset = end - slot->base_gfn + slot->gmem.pgoff;
 
-	WARN_ON(kvm_gmem_set_faultable(file_inode(file), start_offset, end_offset,
-				       faultable));
+	inode = file_inode(file);
+
+	/*
+	 * Use filemap_invalidate_lock_shared() to make sure
+	 * splitting/reconstruction doesn't race with faultability updates.
+	 */
+	filemap_invalidate_lock(inode->i_mapping);
+
+	kvm_gmem_set_faultable(inode, start_offset, end_offset, faultable);
+
+	if (faultable) {
+		ret = kvm_gmem_try_split_folios_range(inode, start_offset,
+						      end_offset);
+	} else {
+		if (kvm_gmem_no_mappings_range(inode, start_offset, end_offset)) {
+			ret = kvm_gmem_try_reconstruct_folios_range(inode,
+								    start_offset,
+								    end_offset);
+		} else {
+			ret = -EINVAL;
+		}
+	}
+
+	filemap_invalidate_unlock(inode->i_mapping);
 
 	fput(file);
+
+	return ret;
 }
 
 /**
- * Set faultability of given range of gfns [@start, @end) in memslot @slot to
- * @faultable.
+ * Returns 0 if guest_memfd permits setting range [@start, @end) with
+ * faultability @faultable within VM @kvm, or negative error otherwise.
+ *
+ * See kvm_gmem_try_set_faultable_slot() for details.
  */
-static void kvm_gmem_set_faultable_vm(struct kvm *kvm, gfn_t start, gfn_t end,
-				      bool faultable)
+static int kvm_gmem_try_set_faultable_vm(struct kvm *kvm, gfn_t start, gfn_t end,
+					 bool faultable)
 {
 	int i;
 
@@ -1866,43 +2122,15 @@ static void kvm_gmem_set_faultable_vm(struct kvm *kvm, gfn_t start, gfn_t end,
 			gfn_end = min(end, slot->base_gfn + slot->npages);
 
 			if (iter.slot->flags & KVM_MEM_GUEST_MEMFD) {
-				kvm_gmem_set_faultable_slot(slot, gfn_start,
-							    gfn_end, faultable);
+				int ret;
+
+				ret = kvm_gmem_try_set_faultable_slot(slot, gfn_start,
+								      gfn_end, faultable);
+				if (ret)
+					return ret;
 			}
 		}
 	}
-}
-
-/**
- * Returns true if guest_memfd permits setting range [@start, @end) to PRIVATE.
- *
- * If memory is faulted in to host userspace and a request was made to set the
- * memory to PRIVATE, the faulted in pages must not be pinned for the request to
- * be permitted.
- */
-static int kvm_gmem_should_set_attributes_private(struct kvm *kvm, gfn_t start,
-						  gfn_t end)
-{
-	kvm_gmem_set_faultable_vm(kvm, start, end, false);
-
-	if (kvm_gmem_no_mappings(kvm, start, end))
-		return 0;
-
-	kvm_gmem_set_faultable_vm(kvm, start, end, true);
-	return -EINVAL;
-}
-
-/**
- * Returns true if guest_memfd permits setting range [@start, @end) to SHARED.
- *
- * Because this allows pages to be faulted in to userspace, this must only be
- * called after the pages have been invalidated from guest page tables.
- */
-static int kvm_gmem_should_set_attributes_shared(struct kvm *kvm, gfn_t start,
-						 gfn_t end)
-{
-	/* Always okay to set shared, hence set range faultable here. */
-	kvm_gmem_set_faultable_vm(kvm, start, end, true);
 
 	return 0;
 }
@@ -1922,10 +2150,16 @@ static int kvm_gmem_should_set_attributes_shared(struct kvm *kvm, gfn_t start,
 int kvm_gmem_should_set_attributes(struct kvm *kvm, gfn_t start, gfn_t end,
 				   unsigned long attrs)
 {
-	if (attrs & KVM_MEMORY_ATTRIBUTE_PRIVATE)
-		return kvm_gmem_should_set_attributes_private(kvm, start, end);
-	else
-		return kvm_gmem_should_set_attributes_shared(kvm, start, end);
+	bool faultable;
+	int ret;
+
+	faultable = !(attrs & KVM_MEMORY_ATTRIBUTE_PRIVATE);
+
+	ret = kvm_gmem_try_set_faultable_vm(kvm, start, end, faultable);
+	if (ret)
+		WARN_ON(kvm_gmem_try_set_faultable_vm(kvm, start, end, !faultable));
+
+	return ret;
 }
 
 #endif
-- 
2.46.0.598.g6f2099f65c-goog


^ permalink raw reply related	[flat|nested] 130+ messages in thread

* Re: [RFC PATCH 39/39] KVM: guest_memfd: Dynamically split/reconstruct HugeTLB page
  2024-09-10 23:44 ` [RFC PATCH 39/39] KVM: guest_memfd: Dynamically split/reconstruct HugeTLB page Ackerley Tng
@ 2025-04-03 12:33   ` Yan Zhao
  2025-04-23 22:02     ` Ackerley Tng
  0 siblings, 1 reply; 130+ messages in thread
From: Yan Zhao @ 2025-04-03 12:33 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: tabba, quic_eberman, roypat, jgg, peterx, david, rientjes, fvdl,
	jthoughton, seanjc, pbonzini, zhiquan1.li, fan.du, jun.miao,
	isaku.yamahata, muchun.song, mike.kravetz, erdemaktas, vannapurve,
	qperret, jhubbard, willy, shuah, brauner, bfoster,
	kent.overstreet, pvorel, rppt, richard.weiyang, anup, haibo1.xu,
	ajones, vkuznets, maciej.wieczor-retman, pgonda, oliver.upton,
	linux-kernel, linux-mm, kvm, linux-kselftest, linux-fsdevel

On Tue, Sep 10, 2024 at 11:44:10PM +0000, Ackerley Tng wrote:
> +/*
> + * Allocates and then caches a folio in the filemap. Returns a folio with
> + * refcount of 2: 1 after allocation, and 1 taken by the filemap.
> + */
> +static struct folio *kvm_gmem_hugetlb_alloc_and_cache_folio(struct inode *inode,
> +							    pgoff_t index)
> +{
> +	struct kvm_gmem_hugetlb *hgmem;
> +	pgoff_t aligned_index;
> +	struct folio *folio;
> +	int nr_pages;
> +	int ret;
> +
> +	hgmem = kvm_gmem_hgmem(inode);
> +	folio = kvm_gmem_hugetlb_alloc_folio(hgmem->h, hgmem->spool);
> +	if (IS_ERR(folio))
> +		return folio;
> +
> +	nr_pages = 1UL << huge_page_order(hgmem->h);
> +	aligned_index = round_down(index, nr_pages);
Maybe a gap here.

When a guest_memfd is bound to a slot where slot->base_gfn is not aligned to
2M/1G and slot->gmem.pgoff is 0, even if an index is 2M/1G aligned, the
corresponding GFN is not 2M/1G aligned.

However, TDX requires that private huge pages be 2M aligned in GFN.

> +	ret = kvm_gmem_hugetlb_filemap_add_folio(inode->i_mapping, folio,
> +						 aligned_index,
> +						 htlb_alloc_mask(hgmem->h));
> +	WARN_ON(ret);
> +
>  	spin_lock(&inode->i_lock);
>  	inode->i_blocks += blocks_per_huge_page(hgmem->h);
>  	spin_unlock(&inode->i_lock);
>  
> -	return page_folio(requested_page);
> +	return folio;
> +}

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [RFC PATCH 39/39] KVM: guest_memfd: Dynamically split/reconstruct HugeTLB page
  2025-04-03 12:33   ` Yan Zhao
@ 2025-04-23 22:02     ` Ackerley Tng
  2025-04-24  1:09       ` Yan Zhao
  0 siblings, 1 reply; 130+ messages in thread
From: Ackerley Tng @ 2025-04-23 22:02 UTC (permalink / raw)
  To: Yan Zhao
  Cc: tabba, quic_eberman, roypat, jgg, peterx, david, rientjes, fvdl,
	jthoughton, seanjc, pbonzini, zhiquan1.li, fan.du, jun.miao,
	isaku.yamahata, muchun.song, erdemaktas, vannapurve, qperret,
	jhubbard, willy, shuah, brauner, bfoster, kent.overstreet, pvorel,
	rppt, richard.weiyang, anup, haibo1.xu, ajones, vkuznets,
	maciej.wieczor-retman, pgonda, oliver.upton, linux-kernel,
	linux-mm, kvm, linux-kselftest

Yan Zhao <yan.y.zhao@intel.com> writes:

> On Tue, Sep 10, 2024 at 11:44:10PM +0000, Ackerley Tng wrote:
>> +/*
>> + * Allocates and then caches a folio in the filemap. Returns a folio with
>> + * refcount of 2: 1 after allocation, and 1 taken by the filemap.
>> + */
>> +static struct folio *kvm_gmem_hugetlb_alloc_and_cache_folio(struct inode *inode,
>> +							    pgoff_t index)
>> +{
>> +	struct kvm_gmem_hugetlb *hgmem;
>> +	pgoff_t aligned_index;
>> +	struct folio *folio;
>> +	int nr_pages;
>> +	int ret;
>> +
>> +	hgmem = kvm_gmem_hgmem(inode);
>> +	folio = kvm_gmem_hugetlb_alloc_folio(hgmem->h, hgmem->spool);
>> +	if (IS_ERR(folio))
>> +		return folio;
>> +
>> +	nr_pages = 1UL << huge_page_order(hgmem->h);
>> +	aligned_index = round_down(index, nr_pages);
> Maybe a gap here.
>
> When a guest_memfd is bound to a slot where slot->base_gfn is not aligned to
> 2M/1G and slot->gmem.pgoff is 0, even if an index is 2M/1G aligned, the
> corresponding GFN is not 2M/1G aligned.

Thanks for looking into this.

In 1G page support for guest_memfd, the offset and size are always
hugepage aligned to the hugepage size requested at guest_memfd creation
time, and it is true that when binding to a memslot, slot->base_gfn and
slot->npages may not be hugepage aligned.

>
> However, TDX requires that private huge pages be 2M aligned in GFN.
>

IIUC other factors also contribute to determining the mapping level in
the guest page tables, like lpage_info and .private_max_mapping_level()
in kvm_x86_ops.

If slot->base_gfn and slot->npages are not hugepage aligned, lpage_info
will track that and not allow faulting into guest page tables at higher
granularity.

Hence I think it is okay to leave it to KVM to fault pages into the
guest correctly. For guest_memfd will just maintain the invariant that
offset and size are hugepage aligned, but not require that
slot->base_gfn and slot->npages are hugepage aligned. This behavior will
be consistent with other backing memory for guests like regular shmem or
HugeTLB.

>> +	ret = kvm_gmem_hugetlb_filemap_add_folio(inode->i_mapping, folio,
>> +						 aligned_index,
>> +						 htlb_alloc_mask(hgmem->h));
>> +	WARN_ON(ret);
>> +
>>  	spin_lock(&inode->i_lock);
>>  	inode->i_blocks += blocks_per_huge_page(hgmem->h);
>>  	spin_unlock(&inode->i_lock);
>>  
>> -	return page_folio(requested_page);
>> +	return folio;
>> +}

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [RFC PATCH 39/39] KVM: guest_memfd: Dynamically split/reconstruct HugeTLB page
  2025-04-23 22:02     ` Ackerley Tng
@ 2025-04-24  1:09       ` Yan Zhao
  2025-04-24  4:25         ` Yan Zhao
  0 siblings, 1 reply; 130+ messages in thread
From: Yan Zhao @ 2025-04-24  1:09 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: tabba, quic_eberman, roypat, jgg, peterx, david, rientjes, fvdl,
	jthoughton, seanjc, pbonzini, zhiquan1.li, fan.du, jun.miao,
	isaku.yamahata, muchun.song, erdemaktas, vannapurve, qperret,
	jhubbard, willy, shuah, brauner, bfoster, kent.overstreet, pvorel,
	rppt, richard.weiyang, anup, haibo1.xu, ajones, vkuznets,
	maciej.wieczor-retman, pgonda, oliver.upton, linux-kernel,
	linux-mm, kvm, linux-kselftest

On Wed, Apr 23, 2025 at 03:02:02PM -0700, Ackerley Tng wrote:
> Yan Zhao <yan.y.zhao@intel.com> writes:
> 
> > On Tue, Sep 10, 2024 at 11:44:10PM +0000, Ackerley Tng wrote:
> >> +/*
> >> + * Allocates and then caches a folio in the filemap. Returns a folio with
> >> + * refcount of 2: 1 after allocation, and 1 taken by the filemap.
> >> + */
> >> +static struct folio *kvm_gmem_hugetlb_alloc_and_cache_folio(struct inode *inode,
> >> +							    pgoff_t index)
> >> +{
> >> +	struct kvm_gmem_hugetlb *hgmem;
> >> +	pgoff_t aligned_index;
> >> +	struct folio *folio;
> >> +	int nr_pages;
> >> +	int ret;
> >> +
> >> +	hgmem = kvm_gmem_hgmem(inode);
> >> +	folio = kvm_gmem_hugetlb_alloc_folio(hgmem->h, hgmem->spool);
> >> +	if (IS_ERR(folio))
> >> +		return folio;
> >> +
> >> +	nr_pages = 1UL << huge_page_order(hgmem->h);
> >> +	aligned_index = round_down(index, nr_pages);
> > Maybe a gap here.
> >
> > When a guest_memfd is bound to a slot where slot->base_gfn is not aligned to
> > 2M/1G and slot->gmem.pgoff is 0, even if an index is 2M/1G aligned, the
> > corresponding GFN is not 2M/1G aligned.
> 
> Thanks for looking into this.
> 
> In 1G page support for guest_memfd, the offset and size are always
> hugepage aligned to the hugepage size requested at guest_memfd creation
> time, and it is true that when binding to a memslot, slot->base_gfn and
> slot->npages may not be hugepage aligned.
> 
> >
> > However, TDX requires that private huge pages be 2M aligned in GFN.
> >
> 
> IIUC other factors also contribute to determining the mapping level in
> the guest page tables, like lpage_info and .private_max_mapping_level()
> in kvm_x86_ops.
>
> If slot->base_gfn and slot->npages are not hugepage aligned, lpage_info
> will track that and not allow faulting into guest page tables at higher
> granularity.
 
lpage_info only checks the alignments of slot->base_gfn and
slot->base_gfn + npages. e.g.,

if slot->base_gfn is 8K, npages is 8M, then for this slot,
lpage_info[2M][0].disallow_lpage = 1, which is for GFN [4K, 2M+8K);
lpage_info[2M][1].disallow_lpage = 0, which is for GFN [2M+8K, 4M+8K);
lpage_info[2M][2].disallow_lpage = 0, which is for GFN [4M+8K, 6M+8K);
lpage_info[2M][3].disallow_lpage = 1, which is for GFN [6M+8K, 8M+8K);

  ---------------------------------------------------------
  |          |  |          |  |          |  |          |  |
  8K        2M 2M+8K      4M  4M+8K     6M  6M+8K     8M  8M+8K

For GFN 6M and GFN 6M+4K, as they both belong to lpage_info[2M][2], huge
page is allowed. Also, they have the same aligned_index 2 in guest_memfd.
So, guest_memfd allocates the same huge folio of 2M order for them.

However, for TDX, GFN 6M and GFN 6M+4K should not belong to the same folio.
It's also weird for a 2M mapping in KVM to stride across 2 huge folios.

> Hence I think it is okay to leave it to KVM to fault pages into the
> guest correctly. For guest_memfd will just maintain the invariant that
> offset and size are hugepage aligned, but not require that
> slot->base_gfn and slot->npages are hugepage aligned. This behavior will
> be consistent with other backing memory for guests like regular shmem or
> HugeTLB.
> 
> >> +	ret = kvm_gmem_hugetlb_filemap_add_folio(inode->i_mapping, folio,
> >> +						 aligned_index,
> >> +						 htlb_alloc_mask(hgmem->h));
> >> +	WARN_ON(ret);
> >> +
> >>  	spin_lock(&inode->i_lock);
> >>  	inode->i_blocks += blocks_per_huge_page(hgmem->h);
> >>  	spin_unlock(&inode->i_lock);
> >>  
> >> -	return page_folio(requested_page);
> >> +	return folio;
> >> +}

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [RFC PATCH 39/39] KVM: guest_memfd: Dynamically split/reconstruct HugeTLB page
  2025-04-24  1:09       ` Yan Zhao
@ 2025-04-24  4:25         ` Yan Zhao
  2025-04-24  5:55           ` Chenyi Qiang
  0 siblings, 1 reply; 130+ messages in thread
From: Yan Zhao @ 2025-04-24  4:25 UTC (permalink / raw)
  To: Ackerley Tng, tabba, quic_eberman, roypat, jgg, peterx, david,
	rientjes, fvdl, jthoughton, seanjc, pbonzini, zhiquan1.li, fan.du,
	jun.miao, isaku.yamahata, muchun.song, erdemaktas, vannapurve,
	qperret, jhubbard, willy, shuah, brauner, bfoster,
	kent.overstreet, pvorel, rppt, richard.weiyang, anup, haibo1.xu,
	ajones, vkuznets, maciej.wieczor-retman, pgonda, oliver.upton,
	linux-kernel, linux-mm, kvm, linux-kselftest

On Thu, Apr 24, 2025 at 09:09:22AM +0800, Yan Zhao wrote:
> On Wed, Apr 23, 2025 at 03:02:02PM -0700, Ackerley Tng wrote:
> > Yan Zhao <yan.y.zhao@intel.com> writes:
> > 
> > > On Tue, Sep 10, 2024 at 11:44:10PM +0000, Ackerley Tng wrote:
> > >> +/*
> > >> + * Allocates and then caches a folio in the filemap. Returns a folio with
> > >> + * refcount of 2: 1 after allocation, and 1 taken by the filemap.
> > >> + */
> > >> +static struct folio *kvm_gmem_hugetlb_alloc_and_cache_folio(struct inode *inode,
> > >> +							    pgoff_t index)
> > >> +{
> > >> +	struct kvm_gmem_hugetlb *hgmem;
> > >> +	pgoff_t aligned_index;
> > >> +	struct folio *folio;
> > >> +	int nr_pages;
> > >> +	int ret;
> > >> +
> > >> +	hgmem = kvm_gmem_hgmem(inode);
> > >> +	folio = kvm_gmem_hugetlb_alloc_folio(hgmem->h, hgmem->spool);
> > >> +	if (IS_ERR(folio))
> > >> +		return folio;
> > >> +
> > >> +	nr_pages = 1UL << huge_page_order(hgmem->h);
> > >> +	aligned_index = round_down(index, nr_pages);
> > > Maybe a gap here.
> > >
> > > When a guest_memfd is bound to a slot where slot->base_gfn is not aligned to
> > > 2M/1G and slot->gmem.pgoff is 0, even if an index is 2M/1G aligned, the
> > > corresponding GFN is not 2M/1G aligned.
> > 
> > Thanks for looking into this.
> > 
> > In 1G page support for guest_memfd, the offset and size are always
> > hugepage aligned to the hugepage size requested at guest_memfd creation
> > time, and it is true that when binding to a memslot, slot->base_gfn and
> > slot->npages may not be hugepage aligned.
> > 
> > >
> > > However, TDX requires that private huge pages be 2M aligned in GFN.
> > >
> > 
> > IIUC other factors also contribute to determining the mapping level in
> > the guest page tables, like lpage_info and .private_max_mapping_level()
> > in kvm_x86_ops.
> >
> > If slot->base_gfn and slot->npages are not hugepage aligned, lpage_info
> > will track that and not allow faulting into guest page tables at higher
> > granularity.
>  
> lpage_info only checks the alignments of slot->base_gfn and
> slot->base_gfn + npages. e.g.,
> 
> if slot->base_gfn is 8K, npages is 8M, then for this slot,
> lpage_info[2M][0].disallow_lpage = 1, which is for GFN [4K, 2M+8K);
> lpage_info[2M][1].disallow_lpage = 0, which is for GFN [2M+8K, 4M+8K);
> lpage_info[2M][2].disallow_lpage = 0, which is for GFN [4M+8K, 6M+8K);
> lpage_info[2M][3].disallow_lpage = 1, which is for GFN [6M+8K, 8M+8K);
> 
>   ---------------------------------------------------------
>   |          |  |          |  |          |  |          |  |
>   8K        2M 2M+8K      4M  4M+8K     6M  6M+8K     8M  8M+8K
> 
> For GFN 6M and GFN 6M+4K, as they both belong to lpage_info[2M][2], huge
> page is allowed. Also, they have the same aligned_index 2 in guest_memfd.
> So, guest_memfd allocates the same huge folio of 2M order for them.
Sorry, sent too fast this morning. The example is not right. The correct
one is:

For GFN 4M and GFN 4M+16K, lpage_info indicates that 2M is allowed. So,
KVM will create a 2M mapping for them.

However, in guest_memfd, GFN 4M and GFN 4M+16K do not correspond to the
same 2M folio and physical addresses may not be contiguous.


> However, for TDX, GFN 6M and GFN 6M+4K should not belong to the same folio.
> It's also weird for a 2M mapping in KVM to stride across 2 huge folios.
> 
> > Hence I think it is okay to leave it to KVM to fault pages into the
> > guest correctly. For guest_memfd will just maintain the invariant that
> > offset and size are hugepage aligned, but not require that
> > slot->base_gfn and slot->npages are hugepage aligned. This behavior will
> > be consistent with other backing memory for guests like regular shmem or
> > HugeTLB.
> > 
> > >> +	ret = kvm_gmem_hugetlb_filemap_add_folio(inode->i_mapping, folio,
> > >> +						 aligned_index,
> > >> +						 htlb_alloc_mask(hgmem->h));
> > >> +	WARN_ON(ret);
> > >> +
> > >>  	spin_lock(&inode->i_lock);
> > >>  	inode->i_blocks += blocks_per_huge_page(hgmem->h);
> > >>  	spin_unlock(&inode->i_lock);
> > >>  
> > >> -	return page_folio(requested_page);
> > >> +	return folio;
> > >> +}

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [RFC PATCH 39/39] KVM: guest_memfd: Dynamically split/reconstruct HugeTLB page
  2025-04-24  4:25         ` Yan Zhao
@ 2025-04-24  5:55           ` Chenyi Qiang
  2025-04-24  8:13             ` Yan Zhao
  0 siblings, 1 reply; 130+ messages in thread
From: Chenyi Qiang @ 2025-04-24  5:55 UTC (permalink / raw)
  To: Yan Zhao, Ackerley Tng, tabba, quic_eberman, roypat, jgg, peterx,
	david, rientjes, fvdl, jthoughton, seanjc, pbonzini, zhiquan1.li,
	fan.du, jun.miao, isaku.yamahata, muchun.song, erdemaktas,
	vannapurve, qperret, jhubbard, willy, shuah, brauner, bfoster,
	kent.overstreet, pvorel, rppt, richard.weiyang, anup, haibo1.xu,
	ajones, vkuznets, maciej.wieczor-retman, pgonda, oliver.upton,
	linux-kernel, linux-mm, kvm, linux-kselftest



On 4/24/2025 12:25 PM, Yan Zhao wrote:
> On Thu, Apr 24, 2025 at 09:09:22AM +0800, Yan Zhao wrote:
>> On Wed, Apr 23, 2025 at 03:02:02PM -0700, Ackerley Tng wrote:
>>> Yan Zhao <yan.y.zhao@intel.com> writes:
>>>
>>>> On Tue, Sep 10, 2024 at 11:44:10PM +0000, Ackerley Tng wrote:
>>>>> +/*
>>>>> + * Allocates and then caches a folio in the filemap. Returns a folio with
>>>>> + * refcount of 2: 1 after allocation, and 1 taken by the filemap.
>>>>> + */
>>>>> +static struct folio *kvm_gmem_hugetlb_alloc_and_cache_folio(struct inode *inode,
>>>>> +							    pgoff_t index)
>>>>> +{
>>>>> +	struct kvm_gmem_hugetlb *hgmem;
>>>>> +	pgoff_t aligned_index;
>>>>> +	struct folio *folio;
>>>>> +	int nr_pages;
>>>>> +	int ret;
>>>>> +
>>>>> +	hgmem = kvm_gmem_hgmem(inode);
>>>>> +	folio = kvm_gmem_hugetlb_alloc_folio(hgmem->h, hgmem->spool);
>>>>> +	if (IS_ERR(folio))
>>>>> +		return folio;
>>>>> +
>>>>> +	nr_pages = 1UL << huge_page_order(hgmem->h);
>>>>> +	aligned_index = round_down(index, nr_pages);
>>>> Maybe a gap here.
>>>>
>>>> When a guest_memfd is bound to a slot where slot->base_gfn is not aligned to
>>>> 2M/1G and slot->gmem.pgoff is 0, even if an index is 2M/1G aligned, the
>>>> corresponding GFN is not 2M/1G aligned.
>>>
>>> Thanks for looking into this.
>>>
>>> In 1G page support for guest_memfd, the offset and size are always
>>> hugepage aligned to the hugepage size requested at guest_memfd creation
>>> time, and it is true that when binding to a memslot, slot->base_gfn and
>>> slot->npages may not be hugepage aligned.
>>>
>>>>
>>>> However, TDX requires that private huge pages be 2M aligned in GFN.
>>>>
>>>
>>> IIUC other factors also contribute to determining the mapping level in
>>> the guest page tables, like lpage_info and .private_max_mapping_level()
>>> in kvm_x86_ops.
>>>
>>> If slot->base_gfn and slot->npages are not hugepage aligned, lpage_info
>>> will track that and not allow faulting into guest page tables at higher
>>> granularity.
>>  
>> lpage_info only checks the alignments of slot->base_gfn and
>> slot->base_gfn + npages. e.g.,
>>
>> if slot->base_gfn is 8K, npages is 8M, then for this slot,
>> lpage_info[2M][0].disallow_lpage = 1, which is for GFN [4K, 2M+8K);
>> lpage_info[2M][1].disallow_lpage = 0, which is for GFN [2M+8K, 4M+8K);
>> lpage_info[2M][2].disallow_lpage = 0, which is for GFN [4M+8K, 6M+8K);
>> lpage_info[2M][3].disallow_lpage = 1, which is for GFN [6M+8K, 8M+8K);

Should it be?
lpage_info[2M][0].disallow_lpage = 1, which is for GFN [8K, 2M);
lpage_info[2M][1].disallow_lpage = 0, which is for GFN [2M, 4M);
lpage_info[2M][2].disallow_lpage = 0, which is for GFN [4M, 6M);
lpage_info[2M][3].disallow_lpage = 0, which is for GFN [6M, 8M);
lpage_info[2M][4].disallow_lpage = 1, which is for GFN [8M, 8M+8K);

>>
>>   ---------------------------------------------------------
>>   |          |  |          |  |          |  |          |  |
>>   8K        2M 2M+8K      4M  4M+8K     6M  6M+8K     8M  8M+8K
>>
>> For GFN 6M and GFN 6M+4K, as they both belong to lpage_info[2M][2], huge
>> page is allowed. Also, they have the same aligned_index 2 in guest_memfd.
>> So, guest_memfd allocates the same huge folio of 2M order for them.
> Sorry, sent too fast this morning. The example is not right. The correct
> one is:
> 
> For GFN 4M and GFN 4M+16K, lpage_info indicates that 2M is allowed. So,
> KVM will create a 2M mapping for them.
> 
> However, in guest_memfd, GFN 4M and GFN 4M+16K do not correspond to the
> same 2M folio and physical addresses may not be contiguous.
> 
> 
>> However, for TDX, GFN 6M and GFN 6M+4K should not belong to the same folio.
>> It's also weird for a 2M mapping in KVM to stride across 2 huge folios.
>>
>>> Hence I think it is okay to leave it to KVM to fault pages into the
>>> guest correctly. For guest_memfd will just maintain the invariant that
>>> offset and size are hugepage aligned, but not require that
>>> slot->base_gfn and slot->npages are hugepage aligned. This behavior will
>>> be consistent with other backing memory for guests like regular shmem or
>>> HugeTLB.
>>>
>>>>> +	ret = kvm_gmem_hugetlb_filemap_add_folio(inode->i_mapping, folio,
>>>>> +						 aligned_index,
>>>>> +						 htlb_alloc_mask(hgmem->h));
>>>>> +	WARN_ON(ret);
>>>>> +
>>>>>  	spin_lock(&inode->i_lock);
>>>>>  	inode->i_blocks += blocks_per_huge_page(hgmem->h);
>>>>>  	spin_unlock(&inode->i_lock);
>>>>>  
>>>>> -	return page_folio(requested_page);
>>>>> +	return folio;
>>>>> +}
> 


^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [RFC PATCH 39/39] KVM: guest_memfd: Dynamically split/reconstruct HugeTLB page
  2025-04-24  5:55           ` Chenyi Qiang
@ 2025-04-24  8:13             ` Yan Zhao
  2025-04-24 14:10               ` Vishal Annapurve
  0 siblings, 1 reply; 130+ messages in thread
From: Yan Zhao @ 2025-04-24  8:13 UTC (permalink / raw)
  To: Chenyi Qiang
  Cc: Ackerley Tng, tabba, quic_eberman, roypat, jgg, peterx, david,
	rientjes, fvdl, jthoughton, seanjc, pbonzini, zhiquan1.li, fan.du,
	jun.miao, isaku.yamahata, muchun.song, erdemaktas, vannapurve,
	qperret, jhubbard, willy, shuah, brauner, bfoster,
	kent.overstreet, pvorel, rppt, richard.weiyang, anup, haibo1.xu,
	ajones, vkuznets, maciej.wieczor-retman, pgonda, oliver.upton,
	linux-kernel, linux-mm, kvm, linux-kselftest

On Thu, Apr 24, 2025 at 01:55:51PM +0800, Chenyi Qiang wrote:
> 
> 
> On 4/24/2025 12:25 PM, Yan Zhao wrote:
> > On Thu, Apr 24, 2025 at 09:09:22AM +0800, Yan Zhao wrote:
> >> On Wed, Apr 23, 2025 at 03:02:02PM -0700, Ackerley Tng wrote:
> >>> Yan Zhao <yan.y.zhao@intel.com> writes:
> >>>
> >>>> On Tue, Sep 10, 2024 at 11:44:10PM +0000, Ackerley Tng wrote:
> >>>>> +/*
> >>>>> + * Allocates and then caches a folio in the filemap. Returns a folio with
> >>>>> + * refcount of 2: 1 after allocation, and 1 taken by the filemap.
> >>>>> + */
> >>>>> +static struct folio *kvm_gmem_hugetlb_alloc_and_cache_folio(struct inode *inode,
> >>>>> +							    pgoff_t index)
> >>>>> +{
> >>>>> +	struct kvm_gmem_hugetlb *hgmem;
> >>>>> +	pgoff_t aligned_index;
> >>>>> +	struct folio *folio;
> >>>>> +	int nr_pages;
> >>>>> +	int ret;
> >>>>> +
> >>>>> +	hgmem = kvm_gmem_hgmem(inode);
> >>>>> +	folio = kvm_gmem_hugetlb_alloc_folio(hgmem->h, hgmem->spool);
> >>>>> +	if (IS_ERR(folio))
> >>>>> +		return folio;
> >>>>> +
> >>>>> +	nr_pages = 1UL << huge_page_order(hgmem->h);
> >>>>> +	aligned_index = round_down(index, nr_pages);
> >>>> Maybe a gap here.
> >>>>
> >>>> When a guest_memfd is bound to a slot where slot->base_gfn is not aligned to
> >>>> 2M/1G and slot->gmem.pgoff is 0, even if an index is 2M/1G aligned, the
> >>>> corresponding GFN is not 2M/1G aligned.
> >>>
> >>> Thanks for looking into this.
> >>>
> >>> In 1G page support for guest_memfd, the offset and size are always
> >>> hugepage aligned to the hugepage size requested at guest_memfd creation
> >>> time, and it is true that when binding to a memslot, slot->base_gfn and
> >>> slot->npages may not be hugepage aligned.
> >>>
> >>>>
> >>>> However, TDX requires that private huge pages be 2M aligned in GFN.
> >>>>
> >>>
> >>> IIUC other factors also contribute to determining the mapping level in
> >>> the guest page tables, like lpage_info and .private_max_mapping_level()
> >>> in kvm_x86_ops.
> >>>
> >>> If slot->base_gfn and slot->npages are not hugepage aligned, lpage_info
> >>> will track that and not allow faulting into guest page tables at higher
> >>> granularity.
> >>  
> >> lpage_info only checks the alignments of slot->base_gfn and
> >> slot->base_gfn + npages. e.g.,
> >>
> >> if slot->base_gfn is 8K, npages is 8M, then for this slot,
> >> lpage_info[2M][0].disallow_lpage = 1, which is for GFN [4K, 2M+8K);
> >> lpage_info[2M][1].disallow_lpage = 0, which is for GFN [2M+8K, 4M+8K);
> >> lpage_info[2M][2].disallow_lpage = 0, which is for GFN [4M+8K, 6M+8K);
> >> lpage_info[2M][3].disallow_lpage = 1, which is for GFN [6M+8K, 8M+8K);
> 
> Should it be?
> lpage_info[2M][0].disallow_lpage = 1, which is for GFN [8K, 2M);
> lpage_info[2M][1].disallow_lpage = 0, which is for GFN [2M, 4M);
> lpage_info[2M][2].disallow_lpage = 0, which is for GFN [4M, 6M);
> lpage_info[2M][3].disallow_lpage = 0, which is for GFN [6M, 8M);
> lpage_info[2M][4].disallow_lpage = 1, which is for GFN [8M, 8M+8K);
Right. Good catch. Thanks!

Let me update the example as below:
slot->base_gfn is 2 (for GPA 8KB), npages 2000 (for a 8MB range)

lpage_info[2M][0].disallow_lpage = 1, which is for GPA [8KB, 2MB);
lpage_info[2M][1].disallow_lpage = 0, which is for GPA [2MB, 4MB);
lpage_info[2M][2].disallow_lpage = 0, which is for GPA [4MB, 6MB);
lpage_info[2M][3].disallow_lpage = 0, which is for GPA [6MB, 8MB);
lpage_info[2M][4].disallow_lpage = 1, which is for GPA [8MB, 8MB+8KB);

lpage_info indicates that a 2MB mapping is alllowed to cover GPA 4MB and GPA
4MB+16KB. However, their aligned_index values lead guest_memfd to allocate two
2MB folios, whose physical addresses may not be contiguous.

Additionally, if the guest accesses two GPAs, e.g., GPA 2MB+8KB and GPA 4MB,
KVM could create two 2MB mappings to cover GPA ranges [2MB, 4MB), [4MB, 6MB).
However, guest_memfd just allocates the same 2MB folio for both faults.


> 
> >>
> >>   ---------------------------------------------------------
> >>   |          |  |          |  |          |  |          |  |
> >>   8K        2M 2M+8K      4M  4M+8K     6M  6M+8K     8M  8M+8K
> >>
> >> For GFN 6M and GFN 6M+4K, as they both belong to lpage_info[2M][2], huge
> >> page is allowed. Also, they have the same aligned_index 2 in guest_memfd.
> >> So, guest_memfd allocates the same huge folio of 2M order for them.
> > Sorry, sent too fast this morning. The example is not right. The correct
> > one is:
> > 
> > For GFN 4M and GFN 4M+16K, lpage_info indicates that 2M is allowed. So,
> > KVM will create a 2M mapping for them.
> > 
> > However, in guest_memfd, GFN 4M and GFN 4M+16K do not correspond to the
> > same 2M folio and physical addresses may not be contiguous.
> > 
> > 
> >> However, for TDX, GFN 6M and GFN 6M+4K should not belong to the same folio.
> >> It's also weird for a 2M mapping in KVM to stride across 2 huge folios.
> >>
> >>> Hence I think it is okay to leave it to KVM to fault pages into the
> >>> guest correctly. For guest_memfd will just maintain the invariant that
> >>> offset and size are hugepage aligned, but not require that
> >>> slot->base_gfn and slot->npages are hugepage aligned. This behavior will
> >>> be consistent with other backing memory for guests like regular shmem or
> >>> HugeTLB.
> >>>
> >>>>> +	ret = kvm_gmem_hugetlb_filemap_add_folio(inode->i_mapping, folio,
> >>>>> +						 aligned_index,
> >>>>> +						 htlb_alloc_mask(hgmem->h));
> >>>>> +	WARN_ON(ret);
> >>>>> +
> >>>>>  	spin_lock(&inode->i_lock);
> >>>>>  	inode->i_blocks += blocks_per_huge_page(hgmem->h);
> >>>>>  	spin_unlock(&inode->i_lock);
> >>>>>  
> >>>>> -	return page_folio(requested_page);
> >>>>> +	return folio;
> >>>>> +}
> > 
> 

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [RFC PATCH 39/39] KVM: guest_memfd: Dynamically split/reconstruct HugeTLB page
  2025-04-24  8:13             ` Yan Zhao
@ 2025-04-24 14:10               ` Vishal Annapurve
  2025-04-24 18:15                 ` Ackerley Tng
  0 siblings, 1 reply; 130+ messages in thread
From: Vishal Annapurve @ 2025-04-24 14:10 UTC (permalink / raw)
  To: Yan Zhao
  Cc: Chenyi Qiang, Ackerley Tng, tabba, quic_eberman, roypat, jgg,
	peterx, david, rientjes, fvdl, jthoughton, seanjc, pbonzini,
	zhiquan1.li, fan.du, jun.miao, isaku.yamahata, muchun.song,
	erdemaktas, qperret, jhubbard, willy, shuah, brauner, bfoster,
	kent.overstreet, pvorel, rppt, richard.weiyang, anup, haibo1.xu,
	ajones, vkuznets, maciej.wieczor-retman, pgonda, oliver.upton,
	linux-kernel, linux-mm, kvm, linux-kselftest

On Thu, Apr 24, 2025 at 1:15 AM Yan Zhao <yan.y.zhao@intel.com> wrote:
>
> On Thu, Apr 24, 2025 at 01:55:51PM +0800, Chenyi Qiang wrote:
> >
> >
> > On 4/24/2025 12:25 PM, Yan Zhao wrote:
> > > On Thu, Apr 24, 2025 at 09:09:22AM +0800, Yan Zhao wrote:
> > >> On Wed, Apr 23, 2025 at 03:02:02PM -0700, Ackerley Tng wrote:
> > >>> Yan Zhao <yan.y.zhao@intel.com> writes:
> > >>>
> > >>>> On Tue, Sep 10, 2024 at 11:44:10PM +0000, Ackerley Tng wrote:
> > >>>>> +/*
> > >>>>> + * Allocates and then caches a folio in the filemap. Returns a folio with
> > >>>>> + * refcount of 2: 1 after allocation, and 1 taken by the filemap.
> > >>>>> + */
> > >>>>> +static struct folio *kvm_gmem_hugetlb_alloc_and_cache_folio(struct inode *inode,
> > >>>>> +                                                           pgoff_t index)
> > >>>>> +{
> > >>>>> +       struct kvm_gmem_hugetlb *hgmem;
> > >>>>> +       pgoff_t aligned_index;
> > >>>>> +       struct folio *folio;
> > >>>>> +       int nr_pages;
> > >>>>> +       int ret;
> > >>>>> +
> > >>>>> +       hgmem = kvm_gmem_hgmem(inode);
> > >>>>> +       folio = kvm_gmem_hugetlb_alloc_folio(hgmem->h, hgmem->spool);
> > >>>>> +       if (IS_ERR(folio))
> > >>>>> +               return folio;
> > >>>>> +
> > >>>>> +       nr_pages = 1UL << huge_page_order(hgmem->h);
> > >>>>> +       aligned_index = round_down(index, nr_pages);
> > >>>> Maybe a gap here.
> > >>>>
> > >>>> When a guest_memfd is bound to a slot where slot->base_gfn is not aligned to
> > >>>> 2M/1G and slot->gmem.pgoff is 0, even if an index is 2M/1G aligned, the
> > >>>> corresponding GFN is not 2M/1G aligned.
> > >>>
> > >>> Thanks for looking into this.
> > >>>
> > >>> In 1G page support for guest_memfd, the offset and size are always
> > >>> hugepage aligned to the hugepage size requested at guest_memfd creation
> > >>> time, and it is true that when binding to a memslot, slot->base_gfn and
> > >>> slot->npages may not be hugepage aligned.
> > >>>
> > >>>>
> > >>>> However, TDX requires that private huge pages be 2M aligned in GFN.
> > >>>>
> > >>>
> > >>> IIUC other factors also contribute to determining the mapping level in
> > >>> the guest page tables, like lpage_info and .private_max_mapping_level()
> > >>> in kvm_x86_ops.
> > >>>
> > >>> If slot->base_gfn and slot->npages are not hugepage aligned, lpage_info
> > >>> will track that and not allow faulting into guest page tables at higher
> > >>> granularity.
> > >>
> > >> lpage_info only checks the alignments of slot->base_gfn and
> > >> slot->base_gfn + npages. e.g.,
> > >>
> > >> if slot->base_gfn is 8K, npages is 8M, then for this slot,
> > >> lpage_info[2M][0].disallow_lpage = 1, which is for GFN [4K, 2M+8K);
> > >> lpage_info[2M][1].disallow_lpage = 0, which is for GFN [2M+8K, 4M+8K);
> > >> lpage_info[2M][2].disallow_lpage = 0, which is for GFN [4M+8K, 6M+8K);
> > >> lpage_info[2M][3].disallow_lpage = 1, which is for GFN [6M+8K, 8M+8K);
> >
> > Should it be?
> > lpage_info[2M][0].disallow_lpage = 1, which is for GFN [8K, 2M);
> > lpage_info[2M][1].disallow_lpage = 0, which is for GFN [2M, 4M);
> > lpage_info[2M][2].disallow_lpage = 0, which is for GFN [4M, 6M);
> > lpage_info[2M][3].disallow_lpage = 0, which is for GFN [6M, 8M);
> > lpage_info[2M][4].disallow_lpage = 1, which is for GFN [8M, 8M+8K);
> Right. Good catch. Thanks!
>
> Let me update the example as below:
> slot->base_gfn is 2 (for GPA 8KB), npages 2000 (for a 8MB range)
>
> lpage_info[2M][0].disallow_lpage = 1, which is for GPA [8KB, 2MB);
> lpage_info[2M][1].disallow_lpage = 0, which is for GPA [2MB, 4MB);
> lpage_info[2M][2].disallow_lpage = 0, which is for GPA [4MB, 6MB);
> lpage_info[2M][3].disallow_lpage = 0, which is for GPA [6MB, 8MB);
> lpage_info[2M][4].disallow_lpage = 1, which is for GPA [8MB, 8MB+8KB);
>
> lpage_info indicates that a 2MB mapping is alllowed to cover GPA 4MB and GPA
> 4MB+16KB. However, their aligned_index values lead guest_memfd to allocate two
> 2MB folios, whose physical addresses may not be contiguous.
>
> Additionally, if the guest accesses two GPAs, e.g., GPA 2MB+8KB and GPA 4MB,
> KVM could create two 2MB mappings to cover GPA ranges [2MB, 4MB), [4MB, 6MB).
> However, guest_memfd just allocates the same 2MB folio for both faults.
>
>
> >
> > >>
> > >>   ---------------------------------------------------------
> > >>   |          |  |          |  |          |  |          |  |
> > >>   8K        2M 2M+8K      4M  4M+8K     6M  6M+8K     8M  8M+8K
> > >>
> > >> For GFN 6M and GFN 6M+4K, as they both belong to lpage_info[2M][2], huge
> > >> page is allowed. Also, they have the same aligned_index 2 in guest_memfd.
> > >> So, guest_memfd allocates the same huge folio of 2M order for them.
> > > Sorry, sent too fast this morning. The example is not right. The correct
> > > one is:
> > >
> > > For GFN 4M and GFN 4M+16K, lpage_info indicates that 2M is allowed. So,
> > > KVM will create a 2M mapping for them.
> > >
> > > However, in guest_memfd, GFN 4M and GFN 4M+16K do not correspond to the
> > > same 2M folio and physical addresses may not be contiguous.

Then during binding, guest memfd offset misalignment with hugepage
should be same as gfn misalignment. i.e.

(offset & ~huge_page_mask(h)) == ((slot->base_gfn << PAGE_SHIFT) &
~huge_page_mask(h));

For non guest_memfd backed scenarios, KVM allows slot gfn ranges that
are not hugepage aligned, so guest_memfd should also be able to
support non-hugepage aligned memslots.

> > >
> > >
> > >> However, for TDX, GFN 6M and GFN 6M+4K should not belong to the same folio.
> > >> It's also weird for a 2M mapping in KVM to stride across 2 huge folios.
> > >>
> > >>> Hence I think it is okay to leave it to KVM to fault pages into the
> > >>> guest correctly. For guest_memfd will just maintain the invariant that
> > >>> offset and size are hugepage aligned, but not require that
> > >>> slot->base_gfn and slot->npages are hugepage aligned. This behavior will
> > >>> be consistent with other backing memory for guests like regular shmem or
> > >>> HugeTLB.
> > >>>
> > >>>>> +       ret = kvm_gmem_hugetlb_filemap_add_folio(inode->i_mapping, folio,
> > >>>>> +                                                aligned_index,
> > >>>>> +                                                htlb_alloc_mask(hgmem->h));
> > >>>>> +       WARN_ON(ret);
> > >>>>> +
> > >>>>>         spin_lock(&inode->i_lock);
> > >>>>>         inode->i_blocks += blocks_per_huge_page(hgmem->h);
> > >>>>>         spin_unlock(&inode->i_lock);
> > >>>>>
> > >>>>> -       return page_folio(requested_page);
> > >>>>> +       return folio;
> > >>>>> +}
> > >
> >

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [RFC PATCH 39/39] KVM: guest_memfd: Dynamically split/reconstruct HugeTLB page
  2025-04-24 14:10               ` Vishal Annapurve
@ 2025-04-24 18:15                 ` Ackerley Tng
  2025-04-25  4:02                   ` Yan Zhao
  0 siblings, 1 reply; 130+ messages in thread
From: Ackerley Tng @ 2025-04-24 18:15 UTC (permalink / raw)
  To: Vishal Annapurve, Yan Zhao
  Cc: Chenyi Qiang, tabba, quic_eberman, roypat, jgg, peterx, david,
	rientjes, fvdl, jthoughton, seanjc, pbonzini, zhiquan1.li, fan.du,
	jun.miao, isaku.yamahata, muchun.song, erdemaktas, qperret,
	jhubbard, willy, shuah, brauner, bfoster, kent.overstreet, pvorel,
	rppt, richard.weiyang, anup, haibo1.xu, ajones, vkuznets,
	maciej.wieczor-retman, pgonda, oliver.upton, linux-kernel,
	linux-mm, kvm, linux-kselftest

Vishal Annapurve <vannapurve@google.com> writes:

> On Thu, Apr 24, 2025 at 1:15 AM Yan Zhao <yan.y.zhao@intel.com> wrote:
>>
>> On Thu, Apr 24, 2025 at 01:55:51PM +0800, Chenyi Qiang wrote:
>> >
>> >
>> > On 4/24/2025 12:25 PM, Yan Zhao wrote:
>> > > On Thu, Apr 24, 2025 at 09:09:22AM +0800, Yan Zhao wrote:
>> > >> On Wed, Apr 23, 2025 at 03:02:02PM -0700, Ackerley Tng wrote:
>> > >>> Yan Zhao <yan.y.zhao@intel.com> writes:
>> > >>>
>> > >>>> On Tue, Sep 10, 2024 at 11:44:10PM +0000, Ackerley Tng wrote:
>> > >>>>> +/*
>> > >>>>> + * Allocates and then caches a folio in the filemap. Returns a folio with
>> > >>>>> + * refcount of 2: 1 after allocation, and 1 taken by the filemap.
>> > >>>>> + */
>> > >>>>> +static struct folio *kvm_gmem_hugetlb_alloc_and_cache_folio(struct inode *inode,
>> > >>>>> +                                                           pgoff_t index)
>> > >>>>> +{
>> > >>>>> +       struct kvm_gmem_hugetlb *hgmem;
>> > >>>>> +       pgoff_t aligned_index;
>> > >>>>> +       struct folio *folio;
>> > >>>>> +       int nr_pages;
>> > >>>>> +       int ret;
>> > >>>>> +
>> > >>>>> +       hgmem = kvm_gmem_hgmem(inode);
>> > >>>>> +       folio = kvm_gmem_hugetlb_alloc_folio(hgmem->h, hgmem->spool);
>> > >>>>> +       if (IS_ERR(folio))
>> > >>>>> +               return folio;
>> > >>>>> +
>> > >>>>> +       nr_pages = 1UL << huge_page_order(hgmem->h);
>> > >>>>> +       aligned_index = round_down(index, nr_pages);
>> > >>>> Maybe a gap here.
>> > >>>>
>> > >>>> When a guest_memfd is bound to a slot where slot->base_gfn is not aligned to
>> > >>>> 2M/1G and slot->gmem.pgoff is 0, even if an index is 2M/1G aligned, the
>> > >>>> corresponding GFN is not 2M/1G aligned.
>> > >>>
>> > >>> Thanks for looking into this.
>> > >>>
>> > >>> In 1G page support for guest_memfd, the offset and size are always
>> > >>> hugepage aligned to the hugepage size requested at guest_memfd creation
>> > >>> time, and it is true that when binding to a memslot, slot->base_gfn and
>> > >>> slot->npages may not be hugepage aligned.
>> > >>>
>> > >>>>
>> > >>>> However, TDX requires that private huge pages be 2M aligned in GFN.
>> > >>>>
>> > >>>
>> > >>> IIUC other factors also contribute to determining the mapping level in
>> > >>> the guest page tables, like lpage_info and .private_max_mapping_level()
>> > >>> in kvm_x86_ops.
>> > >>>
>> > >>> If slot->base_gfn and slot->npages are not hugepage aligned, lpage_info
>> > >>> will track that and not allow faulting into guest page tables at higher
>> > >>> granularity.
>> > >>
>> > >> lpage_info only checks the alignments of slot->base_gfn and
>> > >> slot->base_gfn + npages. e.g.,
>> > >>
>> > >> if slot->base_gfn is 8K, npages is 8M, then for this slot,
>> > >> lpage_info[2M][0].disallow_lpage = 1, which is for GFN [4K, 2M+8K);
>> > >> lpage_info[2M][1].disallow_lpage = 0, which is for GFN [2M+8K, 4M+8K);
>> > >> lpage_info[2M][2].disallow_lpage = 0, which is for GFN [4M+8K, 6M+8K);
>> > >> lpage_info[2M][3].disallow_lpage = 1, which is for GFN [6M+8K, 8M+8K);
>> >
>> > Should it be?
>> > lpage_info[2M][0].disallow_lpage = 1, which is for GFN [8K, 2M);
>> > lpage_info[2M][1].disallow_lpage = 0, which is for GFN [2M, 4M);
>> > lpage_info[2M][2].disallow_lpage = 0, which is for GFN [4M, 6M);
>> > lpage_info[2M][3].disallow_lpage = 0, which is for GFN [6M, 8M);
>> > lpage_info[2M][4].disallow_lpage = 1, which is for GFN [8M, 8M+8K);
>> Right. Good catch. Thanks!
>>
>> Let me update the example as below:
>> slot->base_gfn is 2 (for GPA 8KB), npages 2000 (for a 8MB range)
>>
>> lpage_info[2M][0].disallow_lpage = 1, which is for GPA [8KB, 2MB);
>> lpage_info[2M][1].disallow_lpage = 0, which is for GPA [2MB, 4MB);
>> lpage_info[2M][2].disallow_lpage = 0, which is for GPA [4MB, 6MB);
>> lpage_info[2M][3].disallow_lpage = 0, which is for GPA [6MB, 8MB);
>> lpage_info[2M][4].disallow_lpage = 1, which is for GPA [8MB, 8MB+8KB);
>>
>> lpage_info indicates that a 2MB mapping is alllowed to cover GPA 4MB and GPA
>> 4MB+16KB. However, their aligned_index values lead guest_memfd to allocate two
>> 2MB folios, whose physical addresses may not be contiguous.
>>
>> Additionally, if the guest accesses two GPAs, e.g., GPA 2MB+8KB and GPA 4MB,
>> KVM could create two 2MB mappings to cover GPA ranges [2MB, 4MB), [4MB, 6MB).
>> However, guest_memfd just allocates the same 2MB folio for both faults.
>>
>>
>> >
>> > >>
>> > >>   ---------------------------------------------------------
>> > >>   |          |  |          |  |          |  |          |  |
>> > >>   8K        2M 2M+8K      4M  4M+8K     6M  6M+8K     8M  8M+8K
>> > >>
>> > >> For GFN 6M and GFN 6M+4K, as they both belong to lpage_info[2M][2], huge
>> > >> page is allowed. Also, they have the same aligned_index 2 in guest_memfd.
>> > >> So, guest_memfd allocates the same huge folio of 2M order for them.
>> > > Sorry, sent too fast this morning. The example is not right. The correct
>> > > one is:
>> > >
>> > > For GFN 4M and GFN 4M+16K, lpage_info indicates that 2M is allowed. So,
>> > > KVM will create a 2M mapping for them.
>> > >
>> > > However, in guest_memfd, GFN 4M and GFN 4M+16K do not correspond to the
>> > > same 2M folio and physical addresses may not be contiguous.
>
> Then during binding, guest memfd offset misalignment with hugepage
> should be same as gfn misalignment. i.e.
>
> (offset & ~huge_page_mask(h)) == ((slot->base_gfn << PAGE_SHIFT) &
> ~huge_page_mask(h));
>
> For non guest_memfd backed scenarios, KVM allows slot gfn ranges that
> are not hugepage aligned, so guest_memfd should also be able to
> support non-hugepage aligned memslots.
>

I drew up a picture [1] which hopefully clarifies this.

Thanks for pointing this out, I understand better now and we will add an
extra constraint during memslot binding of guest_memfd to check that gfn
offsets within a hugepage must be guest_memfd offsets.

Adding checks at binding time will allow hugepage-unaligned offsets (to
be at parity with non-guest_memfd backing memory) but still fix this
issue.

lpage_info will make sure that ranges near the bounds will be
fragmented, but the hugepages in the middle will still be mappable as
hugepages.

[1] https://lpc.events/event/18/contributions/1764/attachments/1409/3706/binding-must-have-same-alignment.svg

>> > >
>> > >
>> > >> However, for TDX, GFN 6M and GFN 6M+4K should not belong to the same folio.
>> > >> It's also weird for a 2M mapping in KVM to stride across 2 huge folios.
>> > >>
>> > >>> Hence I think it is okay to leave it to KVM to fault pages into the
>> > >>> guest correctly. For guest_memfd will just maintain the invariant that
>> > >>> offset and size are hugepage aligned, but not require that
>> > >>> slot->base_gfn and slot->npages are hugepage aligned. This behavior will
>> > >>> be consistent with other backing memory for guests like regular shmem or
>> > >>> HugeTLB.
>> > >>>
>> > >>>>> +       ret = kvm_gmem_hugetlb_filemap_add_folio(inode->i_mapping, folio,
>> > >>>>> +                                                aligned_index,
>> > >>>>> +                                                htlb_alloc_mask(hgmem->h));
>> > >>>>> +       WARN_ON(ret);
>> > >>>>> +
>> > >>>>>         spin_lock(&inode->i_lock);
>> > >>>>>         inode->i_blocks += blocks_per_huge_page(hgmem->h);
>> > >>>>>         spin_unlock(&inode->i_lock);
>> > >>>>>
>> > >>>>> -       return page_folio(requested_page);
>> > >>>>> +       return folio;
>> > >>>>> +}
>> > >
>> >

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [RFC PATCH 39/39] KVM: guest_memfd: Dynamically split/reconstruct HugeTLB page
  2025-04-24 18:15                 ` Ackerley Tng
@ 2025-04-25  4:02                   ` Yan Zhao
  2025-04-25 22:45                     ` Ackerley Tng
  0 siblings, 1 reply; 130+ messages in thread
From: Yan Zhao @ 2025-04-25  4:02 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: Vishal Annapurve, Chenyi Qiang, tabba, quic_eberman, roypat, jgg,
	peterx, david, rientjes, fvdl, jthoughton, seanjc, pbonzini,
	zhiquan1.li, fan.du, jun.miao, isaku.yamahata, muchun.song,
	erdemaktas, qperret, jhubbard, willy, shuah, brauner, bfoster,
	kent.overstreet, pvorel, rppt, richard.weiyang, anup, haibo1.xu,
	ajones, vkuznets, maciej.wieczor-retman, pgonda, oliver.upton,
	linux-kernel, linux-mm, kvm, linux-kselftest

On Thu, Apr 24, 2025 at 11:15:11AM -0700, Ackerley Tng wrote:
> Vishal Annapurve <vannapurve@google.com> writes:
> 
> > On Thu, Apr 24, 2025 at 1:15 AM Yan Zhao <yan.y.zhao@intel.com> wrote:
> >>
> >> On Thu, Apr 24, 2025 at 01:55:51PM +0800, Chenyi Qiang wrote:
> >> >
> >> >
> >> > On 4/24/2025 12:25 PM, Yan Zhao wrote:
> >> > > On Thu, Apr 24, 2025 at 09:09:22AM +0800, Yan Zhao wrote:
> >> > >> On Wed, Apr 23, 2025 at 03:02:02PM -0700, Ackerley Tng wrote:
> >> > >>> Yan Zhao <yan.y.zhao@intel.com> writes:
> >> > >>>
> >> > >>>> On Tue, Sep 10, 2024 at 11:44:10PM +0000, Ackerley Tng wrote:
> >> > >>>>> +/*
> >> > >>>>> + * Allocates and then caches a folio in the filemap. Returns a folio with
> >> > >>>>> + * refcount of 2: 1 after allocation, and 1 taken by the filemap.
> >> > >>>>> + */
> >> > >>>>> +static struct folio *kvm_gmem_hugetlb_alloc_and_cache_folio(struct inode *inode,
> >> > >>>>> +                                                           pgoff_t index)
> >> > >>>>> +{
> >> > >>>>> +       struct kvm_gmem_hugetlb *hgmem;
> >> > >>>>> +       pgoff_t aligned_index;
> >> > >>>>> +       struct folio *folio;
> >> > >>>>> +       int nr_pages;
> >> > >>>>> +       int ret;
> >> > >>>>> +
> >> > >>>>> +       hgmem = kvm_gmem_hgmem(inode);
> >> > >>>>> +       folio = kvm_gmem_hugetlb_alloc_folio(hgmem->h, hgmem->spool);
> >> > >>>>> +       if (IS_ERR(folio))
> >> > >>>>> +               return folio;
> >> > >>>>> +
> >> > >>>>> +       nr_pages = 1UL << huge_page_order(hgmem->h);
> >> > >>>>> +       aligned_index = round_down(index, nr_pages);
> >> > >>>> Maybe a gap here.
> >> > >>>>
> >> > >>>> When a guest_memfd is bound to a slot where slot->base_gfn is not aligned to
> >> > >>>> 2M/1G and slot->gmem.pgoff is 0, even if an index is 2M/1G aligned, the
> >> > >>>> corresponding GFN is not 2M/1G aligned.
> >> > >>>
> >> > >>> Thanks for looking into this.
> >> > >>>
> >> > >>> In 1G page support for guest_memfd, the offset and size are always
> >> > >>> hugepage aligned to the hugepage size requested at guest_memfd creation
> >> > >>> time, and it is true that when binding to a memslot, slot->base_gfn and
> >> > >>> slot->npages may not be hugepage aligned.
> >> > >>>
> >> > >>>>
> >> > >>>> However, TDX requires that private huge pages be 2M aligned in GFN.
> >> > >>>>
> >> > >>>
> >> > >>> IIUC other factors also contribute to determining the mapping level in
> >> > >>> the guest page tables, like lpage_info and .private_max_mapping_level()
> >> > >>> in kvm_x86_ops.
> >> > >>>
> >> > >>> If slot->base_gfn and slot->npages are not hugepage aligned, lpage_info
> >> > >>> will track that and not allow faulting into guest page tables at higher
> >> > >>> granularity.
> >> > >>
> >> > >> lpage_info only checks the alignments of slot->base_gfn and
> >> > >> slot->base_gfn + npages. e.g.,
> >> > >>
> >> > >> if slot->base_gfn is 8K, npages is 8M, then for this slot,
> >> > >> lpage_info[2M][0].disallow_lpage = 1, which is for GFN [4K, 2M+8K);
> >> > >> lpage_info[2M][1].disallow_lpage = 0, which is for GFN [2M+8K, 4M+8K);
> >> > >> lpage_info[2M][2].disallow_lpage = 0, which is for GFN [4M+8K, 6M+8K);
> >> > >> lpage_info[2M][3].disallow_lpage = 1, which is for GFN [6M+8K, 8M+8K);
> >> >
> >> > Should it be?
> >> > lpage_info[2M][0].disallow_lpage = 1, which is for GFN [8K, 2M);
> >> > lpage_info[2M][1].disallow_lpage = 0, which is for GFN [2M, 4M);
> >> > lpage_info[2M][2].disallow_lpage = 0, which is for GFN [4M, 6M);
> >> > lpage_info[2M][3].disallow_lpage = 0, which is for GFN [6M, 8M);
> >> > lpage_info[2M][4].disallow_lpage = 1, which is for GFN [8M, 8M+8K);
> >> Right. Good catch. Thanks!
> >>
> >> Let me update the example as below:
> >> slot->base_gfn is 2 (for GPA 8KB), npages 2000 (for a 8MB range)
> >>
> >> lpage_info[2M][0].disallow_lpage = 1, which is for GPA [8KB, 2MB);
> >> lpage_info[2M][1].disallow_lpage = 0, which is for GPA [2MB, 4MB);
> >> lpage_info[2M][2].disallow_lpage = 0, which is for GPA [4MB, 6MB);
> >> lpage_info[2M][3].disallow_lpage = 0, which is for GPA [6MB, 8MB);
> >> lpage_info[2M][4].disallow_lpage = 1, which is for GPA [8MB, 8MB+8KB);
> >>
> >> lpage_info indicates that a 2MB mapping is alllowed to cover GPA 4MB and GPA
> >> 4MB+16KB. However, their aligned_index values lead guest_memfd to allocate two
> >> 2MB folios, whose physical addresses may not be contiguous.
> >>
> >> Additionally, if the guest accesses two GPAs, e.g., GPA 2MB+8KB and GPA 4MB,
> >> KVM could create two 2MB mappings to cover GPA ranges [2MB, 4MB), [4MB, 6MB).
> >> However, guest_memfd just allocates the same 2MB folio for both faults.
> >>
> >>
> >> >
> >> > >>
> >> > >>   ---------------------------------------------------------
> >> > >>   |          |  |          |  |          |  |          |  |
> >> > >>   8K        2M 2M+8K      4M  4M+8K     6M  6M+8K     8M  8M+8K
> >> > >>
> >> > >> For GFN 6M and GFN 6M+4K, as they both belong to lpage_info[2M][2], huge
> >> > >> page is allowed. Also, they have the same aligned_index 2 in guest_memfd.
> >> > >> So, guest_memfd allocates the same huge folio of 2M order for them.
> >> > > Sorry, sent too fast this morning. The example is not right. The correct
> >> > > one is:
> >> > >
> >> > > For GFN 4M and GFN 4M+16K, lpage_info indicates that 2M is allowed. So,
> >> > > KVM will create a 2M mapping for them.
> >> > >
> >> > > However, in guest_memfd, GFN 4M and GFN 4M+16K do not correspond to the
> >> > > same 2M folio and physical addresses may not be contiguous.
> >
> > Then during binding, guest memfd offset misalignment with hugepage
> > should be same as gfn misalignment. i.e.
> >
> > (offset & ~huge_page_mask(h)) == ((slot->base_gfn << PAGE_SHIFT) &
> > ~huge_page_mask(h));
> >
> > For non guest_memfd backed scenarios, KVM allows slot gfn ranges that
> > are not hugepage aligned, so guest_memfd should also be able to
> > support non-hugepage aligned memslots.
> >
> 
> I drew up a picture [1] which hopefully clarifies this.
> 
> Thanks for pointing this out, I understand better now and we will add an
> extra constraint during memslot binding of guest_memfd to check that gfn
> offsets within a hugepage must be guest_memfd offsets.
I'm a bit confused.

As "index = gfn - slot->base_gfn + slot->gmem.pgoff", do you mean you are going
to force "slot->base_gfn == slot->gmem.pgoff" ?

For some memory region, e.g., "pc.ram", it's divided into 2 parts:
- one with offset 0, size 0x80000000(2G),
  positioned at GPA 0, which is below GPA 4G;
- one with offset 0x80000000(2G), size 0x80000000(2G),
  positioned at GPA 0x100000000(4G), which is above GPA 4G.

For the second part, its slot->base_gfn is 0x100000000, while slot->gmem.pgoff
is 0x80000000.

> Adding checks at binding time will allow hugepage-unaligned offsets (to
> be at parity with non-guest_memfd backing memory) but still fix this
> issue.
> 
> lpage_info will make sure that ranges near the bounds will be
> fragmented, but the hugepages in the middle will still be mappable as
> hugepages.
> 
> [1] https://lpc.events/event/18/contributions/1764/attachments/1409/3706/binding-must-have-same-alignment.svg



^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [RFC PATCH 39/39] KVM: guest_memfd: Dynamically split/reconstruct HugeTLB page
  2025-04-25  4:02                   ` Yan Zhao
@ 2025-04-25 22:45                     ` Ackerley Tng
  2025-04-28  1:05                       ` Yan Zhao
  0 siblings, 1 reply; 130+ messages in thread
From: Ackerley Tng @ 2025-04-25 22:45 UTC (permalink / raw)
  To: Yan Zhao
  Cc: Vishal Annapurve, Chenyi Qiang, tabba, quic_eberman, roypat, jgg,
	peterx, david, rientjes, fvdl, jthoughton, seanjc, pbonzini,
	zhiquan1.li, fan.du, jun.miao, isaku.yamahata, muchun.song,
	erdemaktas, qperret, jhubbard, willy, shuah, brauner, bfoster,
	kent.overstreet, pvorel, rppt, richard.weiyang, anup, haibo1.xu,
	ajones, vkuznets, maciej.wieczor-retman, pgonda, oliver.upton,
	linux-kernel, linux-mm, kvm, linux-kselftest

Yan Zhao <yan.y.zhao@intel.com> writes:

> On Thu, Apr 24, 2025 at 11:15:11AM -0700, Ackerley Tng wrote:
>> Vishal Annapurve <vannapurve@google.com> writes:
>> 
>> > On Thu, Apr 24, 2025 at 1:15 AM Yan Zhao <yan.y.zhao@intel.com> wrote:
>> >>
>> >> On Thu, Apr 24, 2025 at 01:55:51PM +0800, Chenyi Qiang wrote:
>> >> >
>> >> >
>> >> > On 4/24/2025 12:25 PM, Yan Zhao wrote:
>> >> > > On Thu, Apr 24, 2025 at 09:09:22AM +0800, Yan Zhao wrote:
>> >> > >> On Wed, Apr 23, 2025 at 03:02:02PM -0700, Ackerley Tng wrote:
>> >> > >>> Yan Zhao <yan.y.zhao@intel.com> writes:
>> >> > >>>
>> >> > >>>> On Tue, Sep 10, 2024 at 11:44:10PM +0000, Ackerley Tng wrote:
>> >> > >>>>> +/*
>> >> > >>>>> + * Allocates and then caches a folio in the filemap. Returns a folio with
>> >> > >>>>> + * refcount of 2: 1 after allocation, and 1 taken by the filemap.
>> >> > >>>>> + */
>> >> > >>>>> +static struct folio *kvm_gmem_hugetlb_alloc_and_cache_folio(struct inode *inode,
>> >> > >>>>> +                                                           pgoff_t index)
>> >> > >>>>> +{
>> >> > >>>>> +       struct kvm_gmem_hugetlb *hgmem;
>> >> > >>>>> +       pgoff_t aligned_index;
>> >> > >>>>> +       struct folio *folio;
>> >> > >>>>> +       int nr_pages;
>> >> > >>>>> +       int ret;
>> >> > >>>>> +
>> >> > >>>>> +       hgmem = kvm_gmem_hgmem(inode);
>> >> > >>>>> +       folio = kvm_gmem_hugetlb_alloc_folio(hgmem->h, hgmem->spool);
>> >> > >>>>> +       if (IS_ERR(folio))
>> >> > >>>>> +               return folio;
>> >> > >>>>> +
>> >> > >>>>> +       nr_pages = 1UL << huge_page_order(hgmem->h);
>> >> > >>>>> +       aligned_index = round_down(index, nr_pages);
>> >> > >>>> Maybe a gap here.
>> >> > >>>>
>> >> > >>>> When a guest_memfd is bound to a slot where slot->base_gfn is not aligned to
>> >> > >>>> 2M/1G and slot->gmem.pgoff is 0, even if an index is 2M/1G aligned, the
>> >> > >>>> corresponding GFN is not 2M/1G aligned.
>> >> > >>>
>> >> > >>> Thanks for looking into this.
>> >> > >>>
>> >> > >>> In 1G page support for guest_memfd, the offset and size are always
>> >> > >>> hugepage aligned to the hugepage size requested at guest_memfd creation
>> >> > >>> time, and it is true that when binding to a memslot, slot->base_gfn and
>> >> > >>> slot->npages may not be hugepage aligned.
>> >> > >>>
>> >> > >>>>
>> >> > >>>> However, TDX requires that private huge pages be 2M aligned in GFN.
>> >> > >>>>
>> >> > >>>
>> >> > >>> IIUC other factors also contribute to determining the mapping level in
>> >> > >>> the guest page tables, like lpage_info and .private_max_mapping_level()
>> >> > >>> in kvm_x86_ops.
>> >> > >>>
>> >> > >>> If slot->base_gfn and slot->npages are not hugepage aligned, lpage_info
>> >> > >>> will track that and not allow faulting into guest page tables at higher
>> >> > >>> granularity.
>> >> > >>
>> >> > >> lpage_info only checks the alignments of slot->base_gfn and
>> >> > >> slot->base_gfn + npages. e.g.,
>> >> > >>
>> >> > >> if slot->base_gfn is 8K, npages is 8M, then for this slot,
>> >> > >> lpage_info[2M][0].disallow_lpage = 1, which is for GFN [4K, 2M+8K);
>> >> > >> lpage_info[2M][1].disallow_lpage = 0, which is for GFN [2M+8K, 4M+8K);
>> >> > >> lpage_info[2M][2].disallow_lpage = 0, which is for GFN [4M+8K, 6M+8K);
>> >> > >> lpage_info[2M][3].disallow_lpage = 1, which is for GFN [6M+8K, 8M+8K);
>> >> >
>> >> > Should it be?
>> >> > lpage_info[2M][0].disallow_lpage = 1, which is for GFN [8K, 2M);
>> >> > lpage_info[2M][1].disallow_lpage = 0, which is for GFN [2M, 4M);
>> >> > lpage_info[2M][2].disallow_lpage = 0, which is for GFN [4M, 6M);
>> >> > lpage_info[2M][3].disallow_lpage = 0, which is for GFN [6M, 8M);
>> >> > lpage_info[2M][4].disallow_lpage = 1, which is for GFN [8M, 8M+8K);
>> >> Right. Good catch. Thanks!
>> >>
>> >> Let me update the example as below:
>> >> slot->base_gfn is 2 (for GPA 8KB), npages 2000 (for a 8MB range)
>> >>
>> >> lpage_info[2M][0].disallow_lpage = 1, which is for GPA [8KB, 2MB);
>> >> lpage_info[2M][1].disallow_lpage = 0, which is for GPA [2MB, 4MB);
>> >> lpage_info[2M][2].disallow_lpage = 0, which is for GPA [4MB, 6MB);
>> >> lpage_info[2M][3].disallow_lpage = 0, which is for GPA [6MB, 8MB);
>> >> lpage_info[2M][4].disallow_lpage = 1, which is for GPA [8MB, 8MB+8KB);
>> >>
>> >> lpage_info indicates that a 2MB mapping is alllowed to cover GPA 4MB and GPA
>> >> 4MB+16KB. However, their aligned_index values lead guest_memfd to allocate two
>> >> 2MB folios, whose physical addresses may not be contiguous.
>> >>
>> >> Additionally, if the guest accesses two GPAs, e.g., GPA 2MB+8KB and GPA 4MB,
>> >> KVM could create two 2MB mappings to cover GPA ranges [2MB, 4MB), [4MB, 6MB).
>> >> However, guest_memfd just allocates the same 2MB folio for both faults.
>> >>
>> >>
>> >> >
>> >> > >>
>> >> > >>   ---------------------------------------------------------
>> >> > >>   |          |  |          |  |          |  |          |  |
>> >> > >>   8K        2M 2M+8K      4M  4M+8K     6M  6M+8K     8M  8M+8K
>> >> > >>
>> >> > >> For GFN 6M and GFN 6M+4K, as they both belong to lpage_info[2M][2], huge
>> >> > >> page is allowed. Also, they have the same aligned_index 2 in guest_memfd.
>> >> > >> So, guest_memfd allocates the same huge folio of 2M order for them.
>> >> > > Sorry, sent too fast this morning. The example is not right. The correct
>> >> > > one is:
>> >> > >
>> >> > > For GFN 4M and GFN 4M+16K, lpage_info indicates that 2M is allowed. So,
>> >> > > KVM will create a 2M mapping for them.
>> >> > >
>> >> > > However, in guest_memfd, GFN 4M and GFN 4M+16K do not correspond to the
>> >> > > same 2M folio and physical addresses may not be contiguous.
>> >
>> > Then during binding, guest memfd offset misalignment with hugepage
>> > should be same as gfn misalignment. i.e.
>> >
>> > (offset & ~huge_page_mask(h)) == ((slot->base_gfn << PAGE_SHIFT) &
>> > ~huge_page_mask(h));
>> >
>> > For non guest_memfd backed scenarios, KVM allows slot gfn ranges that
>> > are not hugepage aligned, so guest_memfd should also be able to
>> > support non-hugepage aligned memslots.
>> >
>> 
>> I drew up a picture [1] which hopefully clarifies this.
>> 
>> Thanks for pointing this out, I understand better now and we will add an
>> extra constraint during memslot binding of guest_memfd to check that gfn
>> offsets within a hugepage must be guest_memfd offsets.
> I'm a bit confused.
>
> As "index = gfn - slot->base_gfn + slot->gmem.pgoff", do you mean you are going
> to force "slot->base_gfn == slot->gmem.pgoff" ?
>
> For some memory region, e.g., "pc.ram", it's divided into 2 parts:
> - one with offset 0, size 0x80000000(2G),
>   positioned at GPA 0, which is below GPA 4G;
> - one with offset 0x80000000(2G), size 0x80000000(2G),
>   positioned at GPA 0x100000000(4G), which is above GPA 4G.
>
> For the second part, its slot->base_gfn is 0x100000000, while slot->gmem.pgoff
> is 0x80000000.
>

Nope I don't mean to enforce that they are equal, we just need the
offsets within the page to be equal.

I edited Vishal's code snippet, perhaps it would help explain better:

page_size is the size of the hugepage, so in our example,

  page_size = SZ_2M;
  page_mask = ~(page_size - 1);
  offset_within_page = slot->gmem.pgoff & page_mask;
  gfn_within_page = (slot->base_gfn << PAGE_SHIFT) & page_mask;

We will enforce that

  offset_within_page == gfn_within_page;

>> Adding checks at binding time will allow hugepage-unaligned offsets (to
>> be at parity with non-guest_memfd backing memory) but still fix this
>> issue.
>> 
>> lpage_info will make sure that ranges near the bounds will be
>> fragmented, but the hugepages in the middle will still be mappable as
>> hugepages.
>> 
>> [1] https://lpc.events/event/18/contributions/1764/attachments/1409/3706/binding-must-have-same-alignment.svg

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [RFC PATCH 39/39] KVM: guest_memfd: Dynamically split/reconstruct HugeTLB page
  2025-04-25 22:45                     ` Ackerley Tng
@ 2025-04-28  1:05                       ` Yan Zhao
  2025-04-28 19:02                         ` Vishal Annapurve
  2025-04-30 20:09                         ` Ackerley Tng
  0 siblings, 2 replies; 130+ messages in thread
From: Yan Zhao @ 2025-04-28  1:05 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: Vishal Annapurve, Chenyi Qiang, tabba, quic_eberman, roypat, jgg,
	peterx, david, rientjes, fvdl, jthoughton, seanjc, pbonzini,
	zhiquan1.li, fan.du, jun.miao, isaku.yamahata, muchun.song,
	erdemaktas, qperret, jhubbard, willy, shuah, brauner, bfoster,
	kent.overstreet, pvorel, rppt, richard.weiyang, anup, haibo1.xu,
	ajones, vkuznets, maciej.wieczor-retman, pgonda, oliver.upton,
	linux-kernel, linux-mm, kvm, linux-kselftest

On Fri, Apr 25, 2025 at 03:45:20PM -0700, Ackerley Tng wrote:
> Yan Zhao <yan.y.zhao@intel.com> writes:
> 
> > On Thu, Apr 24, 2025 at 11:15:11AM -0700, Ackerley Tng wrote:
> >> Vishal Annapurve <vannapurve@google.com> writes:
> >> 
> >> > On Thu, Apr 24, 2025 at 1:15 AM Yan Zhao <yan.y.zhao@intel.com> wrote:
> >> >>
> >> >> On Thu, Apr 24, 2025 at 01:55:51PM +0800, Chenyi Qiang wrote:
> >> >> >
> >> >> >
> >> >> > On 4/24/2025 12:25 PM, Yan Zhao wrote:
> >> >> > > On Thu, Apr 24, 2025 at 09:09:22AM +0800, Yan Zhao wrote:
> >> >> > >> On Wed, Apr 23, 2025 at 03:02:02PM -0700, Ackerley Tng wrote:
> >> >> > >>> Yan Zhao <yan.y.zhao@intel.com> writes:
> >> >> > >>>
> >> >> > >>>> On Tue, Sep 10, 2024 at 11:44:10PM +0000, Ackerley Tng wrote:
> >> >> > >>>>> +/*
> >> >> > >>>>> + * Allocates and then caches a folio in the filemap. Returns a folio with
> >> >> > >>>>> + * refcount of 2: 1 after allocation, and 1 taken by the filemap.
> >> >> > >>>>> + */
> >> >> > >>>>> +static struct folio *kvm_gmem_hugetlb_alloc_and_cache_folio(struct inode *inode,
> >> >> > >>>>> +                                                           pgoff_t index)
> >> >> > >>>>> +{
> >> >> > >>>>> +       struct kvm_gmem_hugetlb *hgmem;
> >> >> > >>>>> +       pgoff_t aligned_index;
> >> >> > >>>>> +       struct folio *folio;
> >> >> > >>>>> +       int nr_pages;
> >> >> > >>>>> +       int ret;
> >> >> > >>>>> +
> >> >> > >>>>> +       hgmem = kvm_gmem_hgmem(inode);
> >> >> > >>>>> +       folio = kvm_gmem_hugetlb_alloc_folio(hgmem->h, hgmem->spool);
> >> >> > >>>>> +       if (IS_ERR(folio))
> >> >> > >>>>> +               return folio;
> >> >> > >>>>> +
> >> >> > >>>>> +       nr_pages = 1UL << huge_page_order(hgmem->h);
> >> >> > >>>>> +       aligned_index = round_down(index, nr_pages);
> >> >> > >>>> Maybe a gap here.
> >> >> > >>>>
> >> >> > >>>> When a guest_memfd is bound to a slot where slot->base_gfn is not aligned to
> >> >> > >>>> 2M/1G and slot->gmem.pgoff is 0, even if an index is 2M/1G aligned, the
> >> >> > >>>> corresponding GFN is not 2M/1G aligned.
> >> >> > >>>
> >> >> > >>> Thanks for looking into this.
> >> >> > >>>
> >> >> > >>> In 1G page support for guest_memfd, the offset and size are always
> >> >> > >>> hugepage aligned to the hugepage size requested at guest_memfd creation
> >> >> > >>> time, and it is true that when binding to a memslot, slot->base_gfn and
> >> >> > >>> slot->npages may not be hugepage aligned.
> >> >> > >>>
> >> >> > >>>>
> >> >> > >>>> However, TDX requires that private huge pages be 2M aligned in GFN.
> >> >> > >>>>
> >> >> > >>>
> >> >> > >>> IIUC other factors also contribute to determining the mapping level in
> >> >> > >>> the guest page tables, like lpage_info and .private_max_mapping_level()
> >> >> > >>> in kvm_x86_ops.
> >> >> > >>>
> >> >> > >>> If slot->base_gfn and slot->npages are not hugepage aligned, lpage_info
> >> >> > >>> will track that and not allow faulting into guest page tables at higher
> >> >> > >>> granularity.
> >> >> > >>
> >> >> > >> lpage_info only checks the alignments of slot->base_gfn and
> >> >> > >> slot->base_gfn + npages. e.g.,
> >> >> > >>
> >> >> > >> if slot->base_gfn is 8K, npages is 8M, then for this slot,
> >> >> > >> lpage_info[2M][0].disallow_lpage = 1, which is for GFN [4K, 2M+8K);
> >> >> > >> lpage_info[2M][1].disallow_lpage = 0, which is for GFN [2M+8K, 4M+8K);
> >> >> > >> lpage_info[2M][2].disallow_lpage = 0, which is for GFN [4M+8K, 6M+8K);
> >> >> > >> lpage_info[2M][3].disallow_lpage = 1, which is for GFN [6M+8K, 8M+8K);
> >> >> >
> >> >> > Should it be?
> >> >> > lpage_info[2M][0].disallow_lpage = 1, which is for GFN [8K, 2M);
> >> >> > lpage_info[2M][1].disallow_lpage = 0, which is for GFN [2M, 4M);
> >> >> > lpage_info[2M][2].disallow_lpage = 0, which is for GFN [4M, 6M);
> >> >> > lpage_info[2M][3].disallow_lpage = 0, which is for GFN [6M, 8M);
> >> >> > lpage_info[2M][4].disallow_lpage = 1, which is for GFN [8M, 8M+8K);
> >> >> Right. Good catch. Thanks!
> >> >>
> >> >> Let me update the example as below:
> >> >> slot->base_gfn is 2 (for GPA 8KB), npages 2000 (for a 8MB range)
> >> >>
> >> >> lpage_info[2M][0].disallow_lpage = 1, which is for GPA [8KB, 2MB);
> >> >> lpage_info[2M][1].disallow_lpage = 0, which is for GPA [2MB, 4MB);
> >> >> lpage_info[2M][2].disallow_lpage = 0, which is for GPA [4MB, 6MB);
> >> >> lpage_info[2M][3].disallow_lpage = 0, which is for GPA [6MB, 8MB);
> >> >> lpage_info[2M][4].disallow_lpage = 1, which is for GPA [8MB, 8MB+8KB);
> >> >>
> >> >> lpage_info indicates that a 2MB mapping is alllowed to cover GPA 4MB and GPA
> >> >> 4MB+16KB. However, their aligned_index values lead guest_memfd to allocate two
> >> >> 2MB folios, whose physical addresses may not be contiguous.
> >> >>
> >> >> Additionally, if the guest accesses two GPAs, e.g., GPA 2MB+8KB and GPA 4MB,
> >> >> KVM could create two 2MB mappings to cover GPA ranges [2MB, 4MB), [4MB, 6MB).
> >> >> However, guest_memfd just allocates the same 2MB folio for both faults.
> >> >>
> >> >>
> >> >> >
> >> >> > >>
> >> >> > >>   ---------------------------------------------------------
> >> >> > >>   |          |  |          |  |          |  |          |  |
> >> >> > >>   8K        2M 2M+8K      4M  4M+8K     6M  6M+8K     8M  8M+8K
> >> >> > >>
> >> >> > >> For GFN 6M and GFN 6M+4K, as they both belong to lpage_info[2M][2], huge
> >> >> > >> page is allowed. Also, they have the same aligned_index 2 in guest_memfd.
> >> >> > >> So, guest_memfd allocates the same huge folio of 2M order for them.
> >> >> > > Sorry, sent too fast this morning. The example is not right. The correct
> >> >> > > one is:
> >> >> > >
> >> >> > > For GFN 4M and GFN 4M+16K, lpage_info indicates that 2M is allowed. So,
> >> >> > > KVM will create a 2M mapping for them.
> >> >> > >
> >> >> > > However, in guest_memfd, GFN 4M and GFN 4M+16K do not correspond to the
> >> >> > > same 2M folio and physical addresses may not be contiguous.
> >> >
> >> > Then during binding, guest memfd offset misalignment with hugepage
> >> > should be same as gfn misalignment. i.e.
> >> >
> >> > (offset & ~huge_page_mask(h)) == ((slot->base_gfn << PAGE_SHIFT) &
> >> > ~huge_page_mask(h));
> >> >
> >> > For non guest_memfd backed scenarios, KVM allows slot gfn ranges that
> >> > are not hugepage aligned, so guest_memfd should also be able to
> >> > support non-hugepage aligned memslots.
> >> >
> >> 
> >> I drew up a picture [1] which hopefully clarifies this.
> >> 
> >> Thanks for pointing this out, I understand better now and we will add an
> >> extra constraint during memslot binding of guest_memfd to check that gfn
> >> offsets within a hugepage must be guest_memfd offsets.
> > I'm a bit confused.
> >
> > As "index = gfn - slot->base_gfn + slot->gmem.pgoff", do you mean you are going
> > to force "slot->base_gfn == slot->gmem.pgoff" ?
> >
> > For some memory region, e.g., "pc.ram", it's divided into 2 parts:
> > - one with offset 0, size 0x80000000(2G),
> >   positioned at GPA 0, which is below GPA 4G;
> > - one with offset 0x80000000(2G), size 0x80000000(2G),
> >   positioned at GPA 0x100000000(4G), which is above GPA 4G.
> >
> > For the second part, its slot->base_gfn is 0x100000000, while slot->gmem.pgoff
> > is 0x80000000.
> >
> 
> Nope I don't mean to enforce that they are equal, we just need the
> offsets within the page to be equal.
> 
> I edited Vishal's code snippet, perhaps it would help explain better:
> 
> page_size is the size of the hugepage, so in our example,
> 
>   page_size = SZ_2M;
>   page_mask = ~(page_size - 1);
page_mask = page_size - 1  ?

>   offset_within_page = slot->gmem.pgoff & page_mask;
>   gfn_within_page = (slot->base_gfn << PAGE_SHIFT) & page_mask;
> 
> We will enforce that
> 
>   offset_within_page == gfn_within_page;
For "pc.ram", if it has 2.5G below 4G, it would be configured as follows
- slot 1: slot->gmem.pgoff=0, base GPA 0, size=2.5G
- slot 2: slot->gmem.pgoff=2.5G, base GPA 4G, size=1.5G

When binding these two slots to the same guest_memfd created with flag
KVM_GUEST_MEMFD_HUGE_1GB: 
- binding the 1st slot will succeed;
- binding the 2nd slot will fail.

What options does userspace have in this scenario?
It can't reduce the flag to KVM_GUEST_MEMFD_HUGE_2MB. Adjusting the gmem.pgoff
isn't ideal either.

What about something similar as below?

diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
index d2feacd14786..87c33704a748 100644
--- a/virt/kvm/guest_memfd.c
+++ b/virt/kvm/guest_memfd.c
@@ -1842,8 +1842,16 @@ __kvm_gmem_get_pfn(struct file *file, struct kvm_memory_slot *slot,
        }

        *pfn = folio_file_pfn(folio, index);
-       if (max_order)
-               *max_order = folio_order(folio);
+       if (max_order) {
+               int order;
+
+               order = folio_order(folio);
+
+               while (order > 0 && ((slot->base_gfn ^ slot->gmem.pgoff) & ((1 << order) - 1)))
+                       order--;
+
+               *max_order = order;
+       }

        *is_prepared = folio_test_uptodate(folio);
        return folio;


> >> Adding checks at binding time will allow hugepage-unaligned offsets (to
> >> be at parity with non-guest_memfd backing memory) but still fix this
> >> issue.
> >> 
> >> lpage_info will make sure that ranges near the bounds will be
> >> fragmented, but the hugepages in the middle will still be mappable as
> >> hugepages.
> >> 
> >> [1] https://lpc.events/event/18/contributions/1764/attachments/1409/3706/binding-must-have-same-alignment.svg

^ permalink raw reply related	[flat|nested] 130+ messages in thread

* Re: [RFC PATCH 39/39] KVM: guest_memfd: Dynamically split/reconstruct HugeTLB page
  2025-04-28  1:05                       ` Yan Zhao
@ 2025-04-28 19:02                         ` Vishal Annapurve
  2025-04-30 20:09                         ` Ackerley Tng
  1 sibling, 0 replies; 130+ messages in thread
From: Vishal Annapurve @ 2025-04-28 19:02 UTC (permalink / raw)
  To: Yan Zhao
  Cc: Ackerley Tng, Chenyi Qiang, tabba, quic_eberman, roypat, jgg,
	peterx, david, rientjes, fvdl, jthoughton, seanjc, pbonzini,
	zhiquan1.li, fan.du, jun.miao, isaku.yamahata, muchun.song,
	erdemaktas, qperret, jhubbard, willy, shuah, brauner, bfoster,
	kent.overstreet, pvorel, rppt, richard.weiyang, anup, haibo1.xu,
	ajones, vkuznets, maciej.wieczor-retman, pgonda, oliver.upton,
	linux-kernel, linux-mm, kvm, linux-kselftest

On Sun, Apr 27, 2025 at 6:08 PM Yan Zhao <yan.y.zhao@intel.com> wrote:
>
> On Fri, Apr 25, 2025 at 03:45:20PM -0700, Ackerley Tng wrote:
> > Yan Zhao <yan.y.zhao@intel.com> writes:
> > ...
> > >
> > > For some memory region, e.g., "pc.ram", it's divided into 2 parts:
> > > - one with offset 0, size 0x80000000(2G),
> > >   positioned at GPA 0, which is below GPA 4G;
> > > - one with offset 0x80000000(2G), size 0x80000000(2G),
> > >   positioned at GPA 0x100000000(4G), which is above GPA 4G.
> > >
> > > For the second part, its slot->base_gfn is 0x100000000, while slot->gmem.pgoff
> > > is 0x80000000.
> > >
> >
> > Nope I don't mean to enforce that they are equal, we just need the
> > offsets within the page to be equal.
> >
> > I edited Vishal's code snippet, perhaps it would help explain better:
> >
> > page_size is the size of the hugepage, so in our example,
> >
> >   page_size = SZ_2M;
> >   page_mask = ~(page_size - 1);
> page_mask = page_size - 1  ?
>
> >   offset_within_page = slot->gmem.pgoff & page_mask;
> >   gfn_within_page = (slot->base_gfn << PAGE_SHIFT) & page_mask;
> >
> > We will enforce that
> >
> >   offset_within_page == gfn_within_page;
> For "pc.ram", if it has 2.5G below 4G, it would be configured as follows
> - slot 1: slot->gmem.pgoff=0, base GPA 0, size=2.5G
> - slot 2: slot->gmem.pgoff=2.5G, base GPA 4G, size=1.5G
>
> When binding these two slots to the same guest_memfd created with flag
> KVM_GUEST_MEMFD_HUGE_1GB:
> - binding the 1st slot will succeed;
> - binding the 2nd slot will fail.
>
> What options does userspace have in this scenario?

Userspace can create new gmem files that have aligned offsets. But I
see your point, enforcing alignment at binding time will lead to
wastage of memory. i.e. Your example above could be reworked to have:
- slot 1: slot->gmem.pgoff=0, base GPA 0, size=2.5G, gmem_fd = x, gmem_size = 3G
- slot 2: slot->gmem.pgoff=0, base GPA 4G, size=1.5G, gmem_fd = y,
gmem_size = 2G

This will waste 1G of memory as gmem files will have to be hugepage aligned.

> It can't reduce the flag to KVM_GUEST_MEMFD_HUGE_2MB. Adjusting the gmem.pgoff
> isn't ideal either.
>
> What about something similar as below?
>
> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> index d2feacd14786..87c33704a748 100644
> --- a/virt/kvm/guest_memfd.c
> +++ b/virt/kvm/guest_memfd.c
> @@ -1842,8 +1842,16 @@ __kvm_gmem_get_pfn(struct file *file, struct kvm_memory_slot *slot,
>         }
>
>         *pfn = folio_file_pfn(folio, index);
> -       if (max_order)
> -               *max_order = folio_order(folio);
> +       if (max_order) {
> +               int order;
> +
> +               order = folio_order(folio);
> +
> +               while (order > 0 && ((slot->base_gfn ^ slot->gmem.pgoff) & ((1 << order) - 1)))

This sounds better. Userspace will need to avoid this in general or
keep such ranges short so that most of the guest memory ranges can be
mapped at hugepage granularity. So maybe a pr_warn could be spewed
during binding that the alignment is not optimal.

> +                       order--;
> +
> +               *max_order = order;
> +       }
>
>         *is_prepared = folio_test_uptodate(folio);
>         return folio;
>
>
> > >> Adding checks at binding time will allow hugepage-unaligned offsets (to
> > >> be at parity with non-guest_memfd backing memory) but still fix this
> > >> issue.
> > >>
> > >> lpage_info will make sure that ranges near the bounds will be
> > >> fragmented, but the hugepages in the middle will still be mappable as
> > >> hugepages.
> > >>
> > >> [1] https://lpc.events/event/18/contributions/1764/attachments/1409/3706/binding-must-have-same-alignment.svg

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [RFC PATCH 39/39] KVM: guest_memfd: Dynamically split/reconstruct HugeTLB page
  2025-04-28  1:05                       ` Yan Zhao
  2025-04-28 19:02                         ` Vishal Annapurve
@ 2025-04-30 20:09                         ` Ackerley Tng
  2025-05-06  1:23                           ` Yan Zhao
  1 sibling, 1 reply; 130+ messages in thread
From: Ackerley Tng @ 2025-04-30 20:09 UTC (permalink / raw)
  To: Yan Zhao
  Cc: vannapurve, chenyi.qiang, tabba, quic_eberman, roypat, jgg,
	peterx, david, rientjes, fvdl, jthoughton, seanjc, pbonzini,
	zhiquan1.li, fan.du, jun.miao, isaku.yamahata, muchun.song,
	erdemaktas, qperret, jhubbard, willy, shuah, brauner, bfoster,
	kent.overstreet, pvorel, rppt, richard.weiyang, anup, haibo1.xu,
	ajones, vkuznets, maciej.wieczor-retman, pgonda, oliver.upton,
	linux-kernel, linux-mm, kvm, linux-kselftest

Yan Zhao <yan.y.zhao@intel.com> writes:

> On Fri, Apr 25, 2025 at 03:45:20PM -0700, Ackerley Tng wrote:
>> Yan Zhao <yan.y.zhao@intel.com> writes:
>> 
>> > On Thu, Apr 24, 2025 at 11:15:11AM -0700, Ackerley Tng wrote:
>> >> Vishal Annapurve <vannapurve@google.com> writes:
>> >> 
>> >> > On Thu, Apr 24, 2025 at 1:15 AM Yan Zhao <yan.y.zhao@intel.com> wrote:
>> >> >>
>> >> >> On Thu, Apr 24, 2025 at 01:55:51PM +0800, Chenyi Qiang wrote:
>> >> >> >
>> >> >> >
>> >> >> > On 4/24/2025 12:25 PM, Yan Zhao wrote:
>> >> >> > > On Thu, Apr 24, 2025 at 09:09:22AM +0800, Yan Zhao wrote:
>> >> >> > >> On Wed, Apr 23, 2025 at 03:02:02PM -0700, Ackerley Tng wrote:
>> >> >> > >>> Yan Zhao <yan.y.zhao@intel.com> writes:
>> >> >> > >>>
>> >> >> > >>>> On Tue, Sep 10, 2024 at 11:44:10PM +0000, Ackerley Tng wrote:
>> >> >> > >>>>> +/*
>> >> >> > >>>>> + * Allocates and then caches a folio in the filemap. Returns a folio with
>> >> >> > >>>>> + * refcount of 2: 1 after allocation, and 1 taken by the filemap.
>> >> >> > >>>>> + */
>> >> >> > >>>>> +static struct folio *kvm_gmem_hugetlb_alloc_and_cache_folio(struct inode *inode,
>> >> >> > >>>>> +                                                           pgoff_t index)
>> >> >> > >>>>> +{
>> >> >> > >>>>> +       struct kvm_gmem_hugetlb *hgmem;
>> >> >> > >>>>> +       pgoff_t aligned_index;
>> >> >> > >>>>> +       struct folio *folio;
>> >> >> > >>>>> +       int nr_pages;
>> >> >> > >>>>> +       int ret;
>> >> >> > >>>>> +
>> >> >> > >>>>> +       hgmem = kvm_gmem_hgmem(inode);
>> >> >> > >>>>> +       folio = kvm_gmem_hugetlb_alloc_folio(hgmem->h, hgmem->spool);
>> >> >> > >>>>> +       if (IS_ERR(folio))
>> >> >> > >>>>> +               return folio;
>> >> >> > >>>>> +
>> >> >> > >>>>> +       nr_pages = 1UL << huge_page_order(hgmem->h);
>> >> >> > >>>>> +       aligned_index = round_down(index, nr_pages);
>> >> >> > >>>> Maybe a gap here.
>> >> >> > >>>>
>> >> >> > >>>> When a guest_memfd is bound to a slot where slot->base_gfn is not aligned to
>> >> >> > >>>> 2M/1G and slot->gmem.pgoff is 0, even if an index is 2M/1G aligned, the
>> >> >> > >>>> corresponding GFN is not 2M/1G aligned.
>> >> >> > >>>
>> >> >> > >>> Thanks for looking into this.
>> >> >> > >>>
>> >> >> > >>> In 1G page support for guest_memfd, the offset and size are always
>> >> >> > >>> hugepage aligned to the hugepage size requested at guest_memfd creation
>> >> >> > >>> time, and it is true that when binding to a memslot, slot->base_gfn and
>> >> >> > >>> slot->npages may not be hugepage aligned.
>> >> >> > >>>
>> >> >> > >>>>
>> >> >> > >>>> However, TDX requires that private huge pages be 2M aligned in GFN.
>> >> >> > >>>>
>> >> >> > >>>
>> >> >> > >>> IIUC other factors also contribute to determining the mapping level in
>> >> >> > >>> the guest page tables, like lpage_info and .private_max_mapping_level()
>> >> >> > >>> in kvm_x86_ops.
>> >> >> > >>>
>> >> >> > >>> If slot->base_gfn and slot->npages are not hugepage aligned, lpage_info
>> >> >> > >>> will track that and not allow faulting into guest page tables at higher
>> >> >> > >>> granularity.
>> >> >> > >>
>> >> >> > >> lpage_info only checks the alignments of slot->base_gfn and
>> >> >> > >> slot->base_gfn + npages. e.g.,
>> >> >> > >>
>> >> >> > >> if slot->base_gfn is 8K, npages is 8M, then for this slot,
>> >> >> > >> lpage_info[2M][0].disallow_lpage = 1, which is for GFN [4K, 2M+8K);
>> >> >> > >> lpage_info[2M][1].disallow_lpage = 0, which is for GFN [2M+8K, 4M+8K);
>> >> >> > >> lpage_info[2M][2].disallow_lpage = 0, which is for GFN [4M+8K, 6M+8K);
>> >> >> > >> lpage_info[2M][3].disallow_lpage = 1, which is for GFN [6M+8K, 8M+8K);
>> >> >> >
>> >> >> > Should it be?
>> >> >> > lpage_info[2M][0].disallow_lpage = 1, which is for GFN [8K, 2M);
>> >> >> > lpage_info[2M][1].disallow_lpage = 0, which is for GFN [2M, 4M);
>> >> >> > lpage_info[2M][2].disallow_lpage = 0, which is for GFN [4M, 6M);
>> >> >> > lpage_info[2M][3].disallow_lpage = 0, which is for GFN [6M, 8M);
>> >> >> > lpage_info[2M][4].disallow_lpage = 1, which is for GFN [8M, 8M+8K);
>> >> >> Right. Good catch. Thanks!
>> >> >>
>> >> >> Let me update the example as below:
>> >> >> slot->base_gfn is 2 (for GPA 8KB), npages 2000 (for a 8MB range)
>> >> >>
>> >> >> lpage_info[2M][0].disallow_lpage = 1, which is for GPA [8KB, 2MB);
>> >> >> lpage_info[2M][1].disallow_lpage = 0, which is for GPA [2MB, 4MB);
>> >> >> lpage_info[2M][2].disallow_lpage = 0, which is for GPA [4MB, 6MB);
>> >> >> lpage_info[2M][3].disallow_lpage = 0, which is for GPA [6MB, 8MB);
>> >> >> lpage_info[2M][4].disallow_lpage = 1, which is for GPA [8MB, 8MB+8KB);
>> >> >>
>> >> >> lpage_info indicates that a 2MB mapping is alllowed to cover GPA 4MB and GPA
>> >> >> 4MB+16KB. However, their aligned_index values lead guest_memfd to allocate two
>> >> >> 2MB folios, whose physical addresses may not be contiguous.
>> >> >>
>> >> >> Additionally, if the guest accesses two GPAs, e.g., GPA 2MB+8KB and GPA 4MB,
>> >> >> KVM could create two 2MB mappings to cover GPA ranges [2MB, 4MB), [4MB, 6MB).
>> >> >> However, guest_memfd just allocates the same 2MB folio for both faults.
>> >> >>
>> >> >>
>> >> >> >
>> >> >> > >>
>> >> >> > >>   ---------------------------------------------------------
>> >> >> > >>   |          |  |          |  |          |  |          |  |
>> >> >> > >>   8K        2M 2M+8K      4M  4M+8K     6M  6M+8K     8M  8M+8K
>> >> >> > >>
>> >> >> > >> For GFN 6M and GFN 6M+4K, as they both belong to lpage_info[2M][2], huge
>> >> >> > >> page is allowed. Also, they have the same aligned_index 2 in guest_memfd.
>> >> >> > >> So, guest_memfd allocates the same huge folio of 2M order for them.
>> >> >> > > Sorry, sent too fast this morning. The example is not right. The correct
>> >> >> > > one is:
>> >> >> > >
>> >> >> > > For GFN 4M and GFN 4M+16K, lpage_info indicates that 2M is allowed. So,
>> >> >> > > KVM will create a 2M mapping for them.
>> >> >> > >
>> >> >> > > However, in guest_memfd, GFN 4M and GFN 4M+16K do not correspond to the
>> >> >> > > same 2M folio and physical addresses may not be contiguous.
>> >> >
>> >> > Then during binding, guest memfd offset misalignment with hugepage
>> >> > should be same as gfn misalignment. i.e.
>> >> >
>> >> > (offset & ~huge_page_mask(h)) == ((slot->base_gfn << PAGE_SHIFT) &
>> >> > ~huge_page_mask(h));
>> >> >
>> >> > For non guest_memfd backed scenarios, KVM allows slot gfn ranges that
>> >> > are not hugepage aligned, so guest_memfd should also be able to
>> >> > support non-hugepage aligned memslots.
>> >> >
>> >> 
>> >> I drew up a picture [1] which hopefully clarifies this.
>> >> 
>> >> Thanks for pointing this out, I understand better now and we will add an
>> >> extra constraint during memslot binding of guest_memfd to check that gfn
>> >> offsets within a hugepage must be guest_memfd offsets.
>> > I'm a bit confused.
>> >
>> > As "index = gfn - slot->base_gfn + slot->gmem.pgoff", do you mean you are going
>> > to force "slot->base_gfn == slot->gmem.pgoff" ?
>> >
>> > For some memory region, e.g., "pc.ram", it's divided into 2 parts:
>> > - one with offset 0, size 0x80000000(2G),
>> >   positioned at GPA 0, which is below GPA 4G;
>> > - one with offset 0x80000000(2G), size 0x80000000(2G),
>> >   positioned at GPA 0x100000000(4G), which is above GPA 4G.
>> >
>> > For the second part, its slot->base_gfn is 0x100000000, while slot->gmem.pgoff
>> > is 0x80000000.
>> >
>> 
>> Nope I don't mean to enforce that they are equal, we just need the
>> offsets within the page to be equal.
>> 
>> I edited Vishal's code snippet, perhaps it would help explain better:
>> 
>> page_size is the size of the hugepage, so in our example,
>> 
>>   page_size = SZ_2M;
>>   page_mask = ~(page_size - 1);
> page_mask = page_size - 1  ?
>

Yes, thank you!

>>   offset_within_page = slot->gmem.pgoff & page_mask;
>>   gfn_within_page = (slot->base_gfn << PAGE_SHIFT) & page_mask;
>> 
>> We will enforce that
>> 
>>   offset_within_page == gfn_within_page;
> For "pc.ram", if it has 2.5G below 4G, it would be configured as follows
> - slot 1: slot->gmem.pgoff=0, base GPA 0, size=2.5G
> - slot 2: slot->gmem.pgoff=2.5G, base GPA 4G, size=1.5G
>
> When binding these two slots to the same guest_memfd created with flag
> KVM_GUEST_MEMFD_HUGE_1GB: 
> - binding the 1st slot will succeed;
> - binding the 2nd slot will fail.
>
> What options does userspace have in this scenario?
> It can't reduce the flag to KVM_GUEST_MEMFD_HUGE_2MB. Adjusting the gmem.pgoff
> isn't ideal either.
>
> What about something similar as below?
>
> diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> index d2feacd14786..87c33704a748 100644
> --- a/virt/kvm/guest_memfd.c
> +++ b/virt/kvm/guest_memfd.c
> @@ -1842,8 +1842,16 @@ __kvm_gmem_get_pfn(struct file *file, struct kvm_memory_slot *slot,
>         }
>
>         *pfn = folio_file_pfn(folio, index);
> -       if (max_order)
> -               *max_order = folio_order(folio);
> +       if (max_order) {
> +               int order;
> +
> +               order = folio_order(folio);
> +
> +               while (order > 0 && ((slot->base_gfn ^ slot->gmem.pgoff) & ((1 << order) - 1)))
> +                       order--;
> +
> +               *max_order = order;
> +       }
>
>         *is_prepared = folio_test_uptodate(folio);
>         return folio;
>

Vishal was wondering how this is working before guest_memfd was
introduced, for other backing memory like HugeTLB.

I then poked around and found this [1]. I will be adding a similar check
for any slot where kvm_slot_can_be_private(slot).

Yan, that should work, right?

[1] https://github.com/torvalds/linux/blob/b6ea1680d0ac0e45157a819c41b46565f4616186/arch/x86/kvm/x86.c#L12996

>> >> Adding checks at binding time will allow hugepage-unaligned offsets (to
>> >> be at parity with non-guest_memfd backing memory) but still fix this
>> >> issue.
>> >> 
>> >> lpage_info will make sure that ranges near the bounds will be
>> >> fragmented, but the hugepages in the middle will still be mappable as
>> >> hugepages.
>> >> 
>> >> [1] https://lpc.events/event/18/contributions/1764/attachments/1409/3706/binding-must-have-same-alignment.svg

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [RFC PATCH 39/39] KVM: guest_memfd: Dynamically split/reconstruct HugeTLB page
  2025-04-30 20:09                         ` Ackerley Tng
@ 2025-05-06  1:23                           ` Yan Zhao
  2025-05-06 19:22                             ` Ackerley Tng
  0 siblings, 1 reply; 130+ messages in thread
From: Yan Zhao @ 2025-05-06  1:23 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: vannapurve, chenyi.qiang, tabba, quic_eberman, roypat, jgg,
	peterx, david, rientjes, fvdl, jthoughton, seanjc, pbonzini,
	zhiquan1.li, fan.du, jun.miao, isaku.yamahata, muchun.song,
	erdemaktas, qperret, jhubbard, willy, shuah, brauner, bfoster,
	kent.overstreet, pvorel, rppt, richard.weiyang, anup, haibo1.xu,
	ajones, vkuznets, maciej.wieczor-retman, pgonda, oliver.upton,
	linux-kernel, linux-mm, kvm, linux-kselftest

On Wed, Apr 30, 2025 at 01:09:33PM -0700, Ackerley Tng wrote:
> Yan Zhao <yan.y.zhao@intel.com> writes:
> 
> > On Fri, Apr 25, 2025 at 03:45:20PM -0700, Ackerley Tng wrote:
> >> Yan Zhao <yan.y.zhao@intel.com> writes:
> >> 
> >> > On Thu, Apr 24, 2025 at 11:15:11AM -0700, Ackerley Tng wrote:
> >> >> Vishal Annapurve <vannapurve@google.com> writes:
> >> >> 
> >> >> > On Thu, Apr 24, 2025 at 1:15 AM Yan Zhao <yan.y.zhao@intel.com> wrote:
> >> >> >>
> >> >> >> On Thu, Apr 24, 2025 at 01:55:51PM +0800, Chenyi Qiang wrote:
> >> >> >> >
> >> >> >> >
> >> >> >> > On 4/24/2025 12:25 PM, Yan Zhao wrote:
> >> >> >> > > On Thu, Apr 24, 2025 at 09:09:22AM +0800, Yan Zhao wrote:
> >> >> >> > >> On Wed, Apr 23, 2025 at 03:02:02PM -0700, Ackerley Tng wrote:
> >> >> >> > >>> Yan Zhao <yan.y.zhao@intel.com> writes:
> >> >> >> > >>>
> >> >> >> > >>>> On Tue, Sep 10, 2024 at 11:44:10PM +0000, Ackerley Tng wrote:
> >> >> >> > >>>>> +/*
> >> >> >> > >>>>> + * Allocates and then caches a folio in the filemap. Returns a folio with
> >> >> >> > >>>>> + * refcount of 2: 1 after allocation, and 1 taken by the filemap.
> >> >> >> > >>>>> + */
> >> >> >> > >>>>> +static struct folio *kvm_gmem_hugetlb_alloc_and_cache_folio(struct inode *inode,
> >> >> >> > >>>>> +                                                           pgoff_t index)
> >> >> >> > >>>>> +{
> >> >> >> > >>>>> +       struct kvm_gmem_hugetlb *hgmem;
> >> >> >> > >>>>> +       pgoff_t aligned_index;
> >> >> >> > >>>>> +       struct folio *folio;
> >> >> >> > >>>>> +       int nr_pages;
> >> >> >> > >>>>> +       int ret;
> >> >> >> > >>>>> +
> >> >> >> > >>>>> +       hgmem = kvm_gmem_hgmem(inode);
> >> >> >> > >>>>> +       folio = kvm_gmem_hugetlb_alloc_folio(hgmem->h, hgmem->spool);
> >> >> >> > >>>>> +       if (IS_ERR(folio))
> >> >> >> > >>>>> +               return folio;
> >> >> >> > >>>>> +
> >> >> >> > >>>>> +       nr_pages = 1UL << huge_page_order(hgmem->h);
> >> >> >> > >>>>> +       aligned_index = round_down(index, nr_pages);
> >> >> >> > >>>> Maybe a gap here.
> >> >> >> > >>>>
> >> >> >> > >>>> When a guest_memfd is bound to a slot where slot->base_gfn is not aligned to
> >> >> >> > >>>> 2M/1G and slot->gmem.pgoff is 0, even if an index is 2M/1G aligned, the
> >> >> >> > >>>> corresponding GFN is not 2M/1G aligned.
> >> >> >> > >>>
> >> >> >> > >>> Thanks for looking into this.
> >> >> >> > >>>
> >> >> >> > >>> In 1G page support for guest_memfd, the offset and size are always
> >> >> >> > >>> hugepage aligned to the hugepage size requested at guest_memfd creation
> >> >> >> > >>> time, and it is true that when binding to a memslot, slot->base_gfn and
> >> >> >> > >>> slot->npages may not be hugepage aligned.
> >> >> >> > >>>
> >> >> >> > >>>>
> >> >> >> > >>>> However, TDX requires that private huge pages be 2M aligned in GFN.
> >> >> >> > >>>>
> >> >> >> > >>>
> >> >> >> > >>> IIUC other factors also contribute to determining the mapping level in
> >> >> >> > >>> the guest page tables, like lpage_info and .private_max_mapping_level()
> >> >> >> > >>> in kvm_x86_ops.
> >> >> >> > >>>
> >> >> >> > >>> If slot->base_gfn and slot->npages are not hugepage aligned, lpage_info
> >> >> >> > >>> will track that and not allow faulting into guest page tables at higher
> >> >> >> > >>> granularity.
> >> >> >> > >>
> >> >> >> > >> lpage_info only checks the alignments of slot->base_gfn and
> >> >> >> > >> slot->base_gfn + npages. e.g.,
> >> >> >> > >>
> >> >> >> > >> if slot->base_gfn is 8K, npages is 8M, then for this slot,
> >> >> >> > >> lpage_info[2M][0].disallow_lpage = 1, which is for GFN [4K, 2M+8K);
> >> >> >> > >> lpage_info[2M][1].disallow_lpage = 0, which is for GFN [2M+8K, 4M+8K);
> >> >> >> > >> lpage_info[2M][2].disallow_lpage = 0, which is for GFN [4M+8K, 6M+8K);
> >> >> >> > >> lpage_info[2M][3].disallow_lpage = 1, which is for GFN [6M+8K, 8M+8K);
> >> >> >> >
> >> >> >> > Should it be?
> >> >> >> > lpage_info[2M][0].disallow_lpage = 1, which is for GFN [8K, 2M);
> >> >> >> > lpage_info[2M][1].disallow_lpage = 0, which is for GFN [2M, 4M);
> >> >> >> > lpage_info[2M][2].disallow_lpage = 0, which is for GFN [4M, 6M);
> >> >> >> > lpage_info[2M][3].disallow_lpage = 0, which is for GFN [6M, 8M);
> >> >> >> > lpage_info[2M][4].disallow_lpage = 1, which is for GFN [8M, 8M+8K);
> >> >> >> Right. Good catch. Thanks!
> >> >> >>
> >> >> >> Let me update the example as below:
> >> >> >> slot->base_gfn is 2 (for GPA 8KB), npages 2000 (for a 8MB range)
> >> >> >>
> >> >> >> lpage_info[2M][0].disallow_lpage = 1, which is for GPA [8KB, 2MB);
> >> >> >> lpage_info[2M][1].disallow_lpage = 0, which is for GPA [2MB, 4MB);
> >> >> >> lpage_info[2M][2].disallow_lpage = 0, which is for GPA [4MB, 6MB);
> >> >> >> lpage_info[2M][3].disallow_lpage = 0, which is for GPA [6MB, 8MB);
> >> >> >> lpage_info[2M][4].disallow_lpage = 1, which is for GPA [8MB, 8MB+8KB);
> >> >> >>
> >> >> >> lpage_info indicates that a 2MB mapping is alllowed to cover GPA 4MB and GPA
> >> >> >> 4MB+16KB. However, their aligned_index values lead guest_memfd to allocate two
> >> >> >> 2MB folios, whose physical addresses may not be contiguous.
> >> >> >>
> >> >> >> Additionally, if the guest accesses two GPAs, e.g., GPA 2MB+8KB and GPA 4MB,
> >> >> >> KVM could create two 2MB mappings to cover GPA ranges [2MB, 4MB), [4MB, 6MB).
> >> >> >> However, guest_memfd just allocates the same 2MB folio for both faults.
> >> >> >>
> >> >> >>
> >> >> >> >
> >> >> >> > >>
> >> >> >> > >>   ---------------------------------------------------------
> >> >> >> > >>   |          |  |          |  |          |  |          |  |
> >> >> >> > >>   8K        2M 2M+8K      4M  4M+8K     6M  6M+8K     8M  8M+8K
> >> >> >> > >>
> >> >> >> > >> For GFN 6M and GFN 6M+4K, as they both belong to lpage_info[2M][2], huge
> >> >> >> > >> page is allowed. Also, they have the same aligned_index 2 in guest_memfd.
> >> >> >> > >> So, guest_memfd allocates the same huge folio of 2M order for them.
> >> >> >> > > Sorry, sent too fast this morning. The example is not right. The correct
> >> >> >> > > one is:
> >> >> >> > >
> >> >> >> > > For GFN 4M and GFN 4M+16K, lpage_info indicates that 2M is allowed. So,
> >> >> >> > > KVM will create a 2M mapping for them.
> >> >> >> > >
> >> >> >> > > However, in guest_memfd, GFN 4M and GFN 4M+16K do not correspond to the
> >> >> >> > > same 2M folio and physical addresses may not be contiguous.
> >> >> >
> >> >> > Then during binding, guest memfd offset misalignment with hugepage
> >> >> > should be same as gfn misalignment. i.e.
> >> >> >
> >> >> > (offset & ~huge_page_mask(h)) == ((slot->base_gfn << PAGE_SHIFT) &
> >> >> > ~huge_page_mask(h));
> >> >> >
> >> >> > For non guest_memfd backed scenarios, KVM allows slot gfn ranges that
> >> >> > are not hugepage aligned, so guest_memfd should also be able to
> >> >> > support non-hugepage aligned memslots.
> >> >> >
> >> >> 
> >> >> I drew up a picture [1] which hopefully clarifies this.
> >> >> 
> >> >> Thanks for pointing this out, I understand better now and we will add an
> >> >> extra constraint during memslot binding of guest_memfd to check that gfn
> >> >> offsets within a hugepage must be guest_memfd offsets.
> >> > I'm a bit confused.
> >> >
> >> > As "index = gfn - slot->base_gfn + slot->gmem.pgoff", do you mean you are going
> >> > to force "slot->base_gfn == slot->gmem.pgoff" ?
> >> >
> >> > For some memory region, e.g., "pc.ram", it's divided into 2 parts:
> >> > - one with offset 0, size 0x80000000(2G),
> >> >   positioned at GPA 0, which is below GPA 4G;
> >> > - one with offset 0x80000000(2G), size 0x80000000(2G),
> >> >   positioned at GPA 0x100000000(4G), which is above GPA 4G.
> >> >
> >> > For the second part, its slot->base_gfn is 0x100000000, while slot->gmem.pgoff
> >> > is 0x80000000.
> >> >
> >> 
> >> Nope I don't mean to enforce that they are equal, we just need the
> >> offsets within the page to be equal.
> >> 
> >> I edited Vishal's code snippet, perhaps it would help explain better:
> >> 
> >> page_size is the size of the hugepage, so in our example,
> >> 
> >>   page_size = SZ_2M;
> >>   page_mask = ~(page_size - 1);
> > page_mask = page_size - 1  ?
> >
> 
> Yes, thank you!
> 
> >>   offset_within_page = slot->gmem.pgoff & page_mask;
> >>   gfn_within_page = (slot->base_gfn << PAGE_SHIFT) & page_mask;
> >> 
> >> We will enforce that
> >> 
> >>   offset_within_page == gfn_within_page;
> > For "pc.ram", if it has 2.5G below 4G, it would be configured as follows
> > - slot 1: slot->gmem.pgoff=0, base GPA 0, size=2.5G
> > - slot 2: slot->gmem.pgoff=2.5G, base GPA 4G, size=1.5G
> >
> > When binding these two slots to the same guest_memfd created with flag
> > KVM_GUEST_MEMFD_HUGE_1GB: 
> > - binding the 1st slot will succeed;
> > - binding the 2nd slot will fail.
> >
> > What options does userspace have in this scenario?
> > It can't reduce the flag to KVM_GUEST_MEMFD_HUGE_2MB. Adjusting the gmem.pgoff
> > isn't ideal either.
> >
> > What about something similar as below?
> >
> > diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> > index d2feacd14786..87c33704a748 100644
> > --- a/virt/kvm/guest_memfd.c
> > +++ b/virt/kvm/guest_memfd.c
> > @@ -1842,8 +1842,16 @@ __kvm_gmem_get_pfn(struct file *file, struct kvm_memory_slot *slot,
> >         }
> >
> >         *pfn = folio_file_pfn(folio, index);
> > -       if (max_order)
> > -               *max_order = folio_order(folio);
> > +       if (max_order) {
> > +               int order;
> > +
> > +               order = folio_order(folio);
> > +
> > +               while (order > 0 && ((slot->base_gfn ^ slot->gmem.pgoff) & ((1 << order) - 1)))
> > +                       order--;
> > +
> > +               *max_order = order;
> > +       }
> >
> >         *is_prepared = folio_test_uptodate(folio);
> >         return folio;
> >
> 
> Vishal was wondering how this is working before guest_memfd was
> introduced, for other backing memory like HugeTLB.
> 
> I then poked around and found this [1]. I will be adding a similar check
> for any slot where kvm_slot_can_be_private(slot).
>
> Yan, that should work, right?
No, I don't think the checking of ugfn [1] should work.

1. Even for slots bound to in-place-conversion guest_memfd (i.e. shared memory
are allocated from guest_memfd), the slot->userspace_addr does not necessarily
have the same offset as slot->gmem.pgoff. Even if we audit the offset in
kvm_gmem_bind(), userspace could invoke munmap() and mmap() afterwards, causing
slot->userspace_addr to point to a different offset.

2. for slots bound to guest_memfd that do not support in-place-conversion,
shared memory is allocated from a different backend. Therefore, checking
"slot->base_gfn ^ slot->gmem.pgoff" is required for private memory. The check is
currently absent because guest_memfd supports 4K only.

 
> [1] https://github.com/torvalds/linux/blob/b6ea1680d0ac0e45157a819c41b46565f4616186/arch/x86/kvm/x86.c#L12996
> 
> >> >> Adding checks at binding time will allow hugepage-unaligned offsets (to
> >> >> be at parity with non-guest_memfd backing memory) but still fix this
> >> >> issue.
> >> >> 
> >> >> lpage_info will make sure that ranges near the bounds will be
> >> >> fragmented, but the hugepages in the middle will still be mappable as
> >> >> hugepages.
> >> >> 
> >> >> [1] https://lpc.events/event/18/contributions/1764/attachments/1409/3706/binding-must-have-same-alignment.svg

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [RFC PATCH 39/39] KVM: guest_memfd: Dynamically split/reconstruct HugeTLB page
  2025-05-06  1:23                           ` Yan Zhao
@ 2025-05-06 19:22                             ` Ackerley Tng
  2025-05-07  3:15                               ` Yan Zhao
  0 siblings, 1 reply; 130+ messages in thread
From: Ackerley Tng @ 2025-05-06 19:22 UTC (permalink / raw)
  To: Yan Zhao
  Cc: vannapurve, chenyi.qiang, tabba, quic_eberman, roypat, jgg,
	peterx, david, rientjes, fvdl, jthoughton, seanjc, pbonzini,
	zhiquan1.li, fan.du, jun.miao, isaku.yamahata, muchun.song,
	erdemaktas, qperret, jhubbard, willy, shuah, brauner, bfoster,
	kent.overstreet, pvorel, rppt, richard.weiyang, anup, haibo1.xu,
	ajones, vkuznets, maciej.wieczor-retman, pgonda, oliver.upton,
	linux-kernel, linux-mm, kvm, linux-kselftest

Yan Zhao <yan.y.zhao@intel.com> writes:

>> > <snip>
>> >
>> > What options does userspace have in this scenario?
>> > It can't reduce the flag to KVM_GUEST_MEMFD_HUGE_2MB. Adjusting the gmem.pgoff
>> > isn't ideal either.
>> >
>> > What about something similar as below?
>> >
>> > diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
>> > index d2feacd14786..87c33704a748 100644
>> > --- a/virt/kvm/guest_memfd.c
>> > +++ b/virt/kvm/guest_memfd.c
>> > @@ -1842,8 +1842,16 @@ __kvm_gmem_get_pfn(struct file *file, struct kvm_memory_slot *slot,
>> >         }
>> >
>> >         *pfn = folio_file_pfn(folio, index);
>> > -       if (max_order)
>> > -               *max_order = folio_order(folio);
>> > +       if (max_order) {
>> > +               int order;
>> > +
>> > +               order = folio_order(folio);
>> > +
>> > +               while (order > 0 && ((slot->base_gfn ^ slot->gmem.pgoff) & ((1 << order) - 1)))
>> > +                       order--;
>> > +
>> > +               *max_order = order;
>> > +       }
>> >
>> >         *is_prepared = folio_test_uptodate(folio);
>> >         return folio;
>> >
>> 
>> Vishal was wondering how this is working before guest_memfd was
>> introduced, for other backing memory like HugeTLB.
>> 
>> I then poked around and found this [1]. I will be adding a similar check
>> for any slot where kvm_slot_can_be_private(slot).
>>
>> Yan, that should work, right?
> No, I don't think the checking of ugfn [1] should work.
>
> 1. Even for slots bound to in-place-conversion guest_memfd (i.e. shared memory
> are allocated from guest_memfd), the slot->userspace_addr does not necessarily
> have the same offset as slot->gmem.pgoff. Even if we audit the offset in
> kvm_gmem_bind(), userspace could invoke munmap() and mmap() afterwards, causing
> slot->userspace_addr to point to a different offset.
>
> 2. for slots bound to guest_memfd that do not support in-place-conversion,
> shared memory is allocated from a different backend. Therefore, checking
> "slot->base_gfn ^ slot->gmem.pgoff" is required for private memory. The check is
> currently absent because guest_memfd supports 4K only.
>
>

Let me clarify, I meant these changes:

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 4b64ab3..d0dccf1 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -12938,6 +12938,11 @@ int memslot_rmap_alloc(struct kvm_memory_slot *slot, unsigned long npages)
        return 0;
 }
 
+static inline bool kvm_is_level_aligned(u64 value, int level)
+{
+       return IS_ALIGNED(value, KVM_PAGES_PER_HPAGE(level));
+}
+
 static int kvm_alloc_memslot_metadata(struct kvm *kvm,
                                      struct kvm_memory_slot *slot)
 {
@@ -12971,16 +12976,20 @@ static int kvm_alloc_memslot_metadata(struct kvm *kvm,
 
                slot->arch.lpage_info[i - 1] = linfo;
 
-               if (slot->base_gfn & (KVM_PAGES_PER_HPAGE(level) - 1))
+               if (!kvm_is_level_aligned(slot->base_gfn, level))
                        linfo[0].disallow_lpage = 1;
-               if ((slot->base_gfn + npages) & (KVM_PAGES_PER_HPAGE(level) - 1))
+               if (!kvm_is_level_aligned(slot->base_gfn + npages, level))
                        linfo[lpages - 1].disallow_lpage = 1;
                ugfn = slot->userspace_addr >> PAGE_SHIFT;
                /*
-                * If the gfn and userspace address are not aligned wrt each
-                * other, disable large page support for this slot.
+                * If the gfn and userspace address are not aligned or if gfn
+                * and guest_memfd offset are not aligned wrt each other,
+                * disable large page support for this slot.
                 */
-               if ((slot->base_gfn ^ ugfn) & (KVM_PAGES_PER_HPAGE(level) - 1)) {
+               if (!kvm_is_level_aligned(slot->base_gfn ^ ugfn, level) ||
+                   (kvm_slot_can_be_private(slot) &&
+                    !kvm_is_level_aligned(slot->base_gfn ^ slot->gmem.pgoff,
+                                          level))) {
                        unsigned long j;
 
                        for (j = 0; j < lpages; ++j)

This does not rely on the ugfn check, but adds a similar check for gmem.pgoff.

I think this should take care of case (1.), for guest_memfds going to be
used for both shared and private memory. Userspace can't update
slot->userspace_addr, since guest_memfd memslots cannot be updated and
can only be deleted.

If userspace re-uses slot->userspace_addr for some other memory address
without deleting and re-adding a memslot,

+ KVM's access to memory should still be fine, since after the recent
  discussion at guest_memfd upstream call, KVM's guest faults will
  always go via fd+offset and KVM's access won't be disrupted
  there. Whatever checking done at memslot binding time will still be
  valid.
+ Host's access and other accesses (e.g. instruction emulation, which
  uses slot->userspace_addr) to guest memory will be broken, but I think
  there's nothing protecting against that. The same breakage would
  happen for non-guest_memfd memslot.

p.s. I will be adding the validation as you suggested [1], though that
shouldn't make a difference here, since the above check directly
validates against gmem.pgoff.

Regarding 2., checking this checks against gmem.pgoff and should handle
that as well.

[1] https://lore.kernel.org/all/aBnMp26iWWhUrsVf@yzhao56-desk.sh.intel.com/

I prefer checking at binding time because it aligns with the ugfn check
that is already there, and avoids having to check at every fault.

>> [1] https://github.com/torvalds/linux/blob/b6ea1680d0ac0e45157a819c41b46565f4616186/arch/x86/kvm/x86.c#L12996
>> 
>> >> >> Adding checks at binding time will allow hugepage-unaligned offsets (to
>> >> >> be at parity with non-guest_memfd backing memory) but still fix this
>> >> >> issue.
>> >> >> 
>> >> >> lpage_info will make sure that ranges near the bounds will be
>> >> >> fragmented, but the hugepages in the middle will still be mappable as
>> >> >> hugepages.
>> >> >> 
>> >> >> [1] https://lpc.events/event/18/contributions/1764/attachments/1409/3706/binding-must-have-same-alignment.svg

^ permalink raw reply related	[flat|nested] 130+ messages in thread

* Re: [RFC PATCH 39/39] KVM: guest_memfd: Dynamically split/reconstruct HugeTLB page
  2025-05-06 19:22                             ` Ackerley Tng
@ 2025-05-07  3:15                               ` Yan Zhao
  2025-05-13 17:33                                 ` Ackerley Tng
  0 siblings, 1 reply; 130+ messages in thread
From: Yan Zhao @ 2025-05-07  3:15 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: vannapurve, chenyi.qiang, tabba, quic_eberman, roypat, jgg,
	peterx, david, rientjes, fvdl, jthoughton, seanjc, pbonzini,
	zhiquan1.li, fan.du, jun.miao, isaku.yamahata, muchun.song,
	erdemaktas, qperret, jhubbard, willy, shuah, brauner, bfoster,
	kent.overstreet, pvorel, rppt, richard.weiyang, anup, haibo1.xu,
	ajones, vkuznets, maciej.wieczor-retman, pgonda, oliver.upton,
	linux-kernel, linux-mm, kvm, linux-kselftest

On Tue, May 06, 2025 at 12:22:47PM -0700, Ackerley Tng wrote:
> Yan Zhao <yan.y.zhao@intel.com> writes:
> 
> >> > <snip>
> >> >
> >> > What options does userspace have in this scenario?
> >> > It can't reduce the flag to KVM_GUEST_MEMFD_HUGE_2MB. Adjusting the gmem.pgoff
> >> > isn't ideal either.
> >> >
> >> > What about something similar as below?
> >> >
> >> > diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
> >> > index d2feacd14786..87c33704a748 100644
> >> > --- a/virt/kvm/guest_memfd.c
> >> > +++ b/virt/kvm/guest_memfd.c
> >> > @@ -1842,8 +1842,16 @@ __kvm_gmem_get_pfn(struct file *file, struct kvm_memory_slot *slot,
> >> >         }
> >> >
> >> >         *pfn = folio_file_pfn(folio, index);
> >> > -       if (max_order)
> >> > -               *max_order = folio_order(folio);
> >> > +       if (max_order) {
> >> > +               int order;
> >> > +
> >> > +               order = folio_order(folio);
> >> > +
> >> > +               while (order > 0 && ((slot->base_gfn ^ slot->gmem.pgoff) & ((1 << order) - 1)))
> >> > +                       order--;
> >> > +
> >> > +               *max_order = order;
> >> > +       }
> >> >
> >> >         *is_prepared = folio_test_uptodate(folio);
> >> >         return folio;
> >> >
> >> 
> >> Vishal was wondering how this is working before guest_memfd was
> >> introduced, for other backing memory like HugeTLB.
> >> 
> >> I then poked around and found this [1]. I will be adding a similar check
> >> for any slot where kvm_slot_can_be_private(slot).
> >>
> >> Yan, that should work, right?
> > No, I don't think the checking of ugfn [1] should work.
> >
> > 1. Even for slots bound to in-place-conversion guest_memfd (i.e. shared memory
> > are allocated from guest_memfd), the slot->userspace_addr does not necessarily
> > have the same offset as slot->gmem.pgoff. Even if we audit the offset in
> > kvm_gmem_bind(), userspace could invoke munmap() and mmap() afterwards, causing
> > slot->userspace_addr to point to a different offset.
> >
> > 2. for slots bound to guest_memfd that do not support in-place-conversion,
> > shared memory is allocated from a different backend. Therefore, checking
> > "slot->base_gfn ^ slot->gmem.pgoff" is required for private memory. The check is
> > currently absent because guest_memfd supports 4K only.
> >
> >
> 
> Let me clarify, I meant these changes:
> 
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 4b64ab3..d0dccf1 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -12938,6 +12938,11 @@ int memslot_rmap_alloc(struct kvm_memory_slot *slot, unsigned long npages)
>         return 0;
>  }
>  
> +static inline bool kvm_is_level_aligned(u64 value, int level)
> +{
> +       return IS_ALIGNED(value, KVM_PAGES_PER_HPAGE(level));
> +}
> +
>  static int kvm_alloc_memslot_metadata(struct kvm *kvm,
>                                       struct kvm_memory_slot *slot)
>  {
> @@ -12971,16 +12976,20 @@ static int kvm_alloc_memslot_metadata(struct kvm *kvm,
>  
>                 slot->arch.lpage_info[i - 1] = linfo;
>  
> -               if (slot->base_gfn & (KVM_PAGES_PER_HPAGE(level) - 1))
> +               if (!kvm_is_level_aligned(slot->base_gfn, level))
>                         linfo[0].disallow_lpage = 1;
> -               if ((slot->base_gfn + npages) & (KVM_PAGES_PER_HPAGE(level) - 1))
> +               if (!kvm_is_level_aligned(slot->base_gfn + npages, level))
>                         linfo[lpages - 1].disallow_lpage = 1;
>                 ugfn = slot->userspace_addr >> PAGE_SHIFT;
>                 /*
> -                * If the gfn and userspace address are not aligned wrt each
> -                * other, disable large page support for this slot.
> +                * If the gfn and userspace address are not aligned or if gfn
> +                * and guest_memfd offset are not aligned wrt each other,
> +                * disable large page support for this slot.
>                  */
> -               if ((slot->base_gfn ^ ugfn) & (KVM_PAGES_PER_HPAGE(level) - 1)) {
> +               if (!kvm_is_level_aligned(slot->base_gfn ^ ugfn, level) ||
> +                   (kvm_slot_can_be_private(slot) &&
> +                    !kvm_is_level_aligned(slot->base_gfn ^ slot->gmem.pgoff,
> +                                          level))) {
>                         unsigned long j;
>  
>                         for (j = 0; j < lpages; ++j)
> 
> This does not rely on the ugfn check, but adds a similar check for gmem.pgoff.
In the case of shared memory is not allocated from guest_memfd, (e.g. with the
current upstream code), the checking of gmem.pgoff here will disallow huge page
of shared memory even if "slot->base_gfn ^ ugfn" is aligned.

> I think this should take care of case (1.), for guest_memfds going to be
> used for both shared and private memory. Userspace can't update
> slot->userspace_addr, since guest_memfd memslots cannot be updated and
> can only be deleted.
> 
> If userspace re-uses slot->userspace_addr for some other memory address
> without deleting and re-adding a memslot,
> 
> + KVM's access to memory should still be fine, since after the recent
>   discussion at guest_memfd upstream call, KVM's guest faults will
>   always go via fd+offset and KVM's access won't be disrupted
>   there. Whatever checking done at memslot binding time will still be
>   valid.
Could the offset of shared memory and offset of private memory be different if
userspace re-uses slot->userspace_addr without deleting and re-adding a memslot?

Then though the two offsets are validated as equal in kvm_gmem_bind(), they may
differ later on.

> + Host's access and other accesses (e.g. instruction emulation, which
>   uses slot->userspace_addr) to guest memory will be broken, but I think
>   there's nothing protecting against that. The same breakage would
>   happen for non-guest_memfd memslot.
Why is host access broken in non-guest_memfd case?
The HVA is still a valid one in QEMU's mmap-ed address space.

> p.s. I will be adding the validation as you suggested [1], though that
> shouldn't make a difference here, since the above check directly
> validates against gmem.pgoff.
> 
> Regarding 2., checking this checks against gmem.pgoff and should handle
> that as well.
> 
> [1] https://lore.kernel.org/all/aBnMp26iWWhUrsVf@yzhao56-desk.sh.intel.com/
> 
> I prefer checking at binding time because it aligns with the ugfn check
> that is already there, and avoids having to check at every fault.
> 
> >> [1] https://github.com/torvalds/linux/blob/b6ea1680d0ac0e45157a819c41b46565f4616186/arch/x86/kvm/x86.c#L12996
> >> 
> >> >> >> Adding checks at binding time will allow hugepage-unaligned offsets (to
> >> >> >> be at parity with non-guest_memfd backing memory) but still fix this
> >> >> >> issue.
> >> >> >> 
> >> >> >> lpage_info will make sure that ranges near the bounds will be
> >> >> >> fragmented, but the hugepages in the middle will still be mappable as
> >> >> >> hugepages.
> >> >> >> 
> >> >> >> [1] https://lpc.events/event/18/contributions/1764/attachments/1409/3706/binding-must-have-same-alignment.svg

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [RFC PATCH 39/39] KVM: guest_memfd: Dynamically split/reconstruct HugeTLB page
  2025-05-07  3:15                               ` Yan Zhao
@ 2025-05-13 17:33                                 ` Ackerley Tng
  0 siblings, 0 replies; 130+ messages in thread
From: Ackerley Tng @ 2025-05-13 17:33 UTC (permalink / raw)
  To: Yan Zhao
  Cc: vannapurve, chenyi.qiang, tabba, quic_eberman, roypat, jgg,
	peterx, david, rientjes, fvdl, jthoughton, seanjc, pbonzini,
	zhiquan1.li, fan.du, jun.miao, isaku.yamahata, muchun.song,
	erdemaktas, qperret, jhubbard, willy, shuah, brauner, bfoster,
	kent.overstreet, pvorel, rppt, richard.weiyang, anup, haibo1.xu,
	ajones, vkuznets, maciej.wieczor-retman, pgonda, oliver.upton,
	linux-kernel, linux-mm, kvm, linux-kselftest

Yan Zhao <yan.y.zhao@intel.com> writes:

> On Tue, May 06, 2025 at 12:22:47PM -0700, Ackerley Tng wrote:
>> Yan Zhao <yan.y.zhao@intel.com> writes:
>> 
>> >> > <snip>
>> >> >
>> >> > What options does userspace have in this scenario?
>> >> > It can't reduce the flag to KVM_GUEST_MEMFD_HUGE_2MB. Adjusting the gmem.pgoff
>> >> > isn't ideal either.
>> >> >
>> >> > What about something similar as below?
>> >> >
>> >> > diff --git a/virt/kvm/guest_memfd.c b/virt/kvm/guest_memfd.c
>> >> > index d2feacd14786..87c33704a748 100644
>> >> > --- a/virt/kvm/guest_memfd.c
>> >> > +++ b/virt/kvm/guest_memfd.c
>> >> > @@ -1842,8 +1842,16 @@ __kvm_gmem_get_pfn(struct file *file, struct kvm_memory_slot *slot,
>> >> >         }
>> >> >
>> >> >         *pfn = folio_file_pfn(folio, index);
>> >> > -       if (max_order)
>> >> > -               *max_order = folio_order(folio);
>> >> > +       if (max_order) {
>> >> > +               int order;
>> >> > +
>> >> > +               order = folio_order(folio);
>> >> > +
>> >> > +               while (order > 0 && ((slot->base_gfn ^ slot->gmem.pgoff) & ((1 << order) - 1)))
>> >> > +                       order--;
>> >> > +
>> >> > +               *max_order = order;
>> >> > +       }
>> >> >
>> >> >         *is_prepared = folio_test_uptodate(folio);
>> >> >         return folio;
>> >> >
>> >> 
>> >> Vishal was wondering how this is working before guest_memfd was
>> >> introduced, for other backing memory like HugeTLB.
>> >> 
>> >> I then poked around and found this [1]. I will be adding a similar check
>> >> for any slot where kvm_slot_can_be_private(slot).
>> >>
>> >> Yan, that should work, right?
>> > No, I don't think the checking of ugfn [1] should work.
>> >
>> > 1. Even for slots bound to in-place-conversion guest_memfd (i.e. shared memory
>> > are allocated from guest_memfd), the slot->userspace_addr does not necessarily
>> > have the same offset as slot->gmem.pgoff. Even if we audit the offset in
>> > kvm_gmem_bind(), userspace could invoke munmap() and mmap() afterwards, causing
>> > slot->userspace_addr to point to a different offset.
>> >
>> > 2. for slots bound to guest_memfd that do not support in-place-conversion,
>> > shared memory is allocated from a different backend. Therefore, checking
>> > "slot->base_gfn ^ slot->gmem.pgoff" is required for private memory. The check is
>> > currently absent because guest_memfd supports 4K only.
>> >
>> >
>> 
>> Let me clarify, I meant these changes:
>> 
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> index 4b64ab3..d0dccf1 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -12938,6 +12938,11 @@ int memslot_rmap_alloc(struct kvm_memory_slot *slot, unsigned long npages)
>>         return 0;
>>  }
>>  
>> +static inline bool kvm_is_level_aligned(u64 value, int level)
>> +{
>> +       return IS_ALIGNED(value, KVM_PAGES_PER_HPAGE(level));
>> +}
>> +
>>  static int kvm_alloc_memslot_metadata(struct kvm *kvm,
>>                                       struct kvm_memory_slot *slot)
>>  {
>> @@ -12971,16 +12976,20 @@ static int kvm_alloc_memslot_metadata(struct kvm *kvm,
>>  
>>                 slot->arch.lpage_info[i - 1] = linfo;
>>  
>> -               if (slot->base_gfn & (KVM_PAGES_PER_HPAGE(level) - 1))
>> +               if (!kvm_is_level_aligned(slot->base_gfn, level))
>>                         linfo[0].disallow_lpage = 1;
>> -               if ((slot->base_gfn + npages) & (KVM_PAGES_PER_HPAGE(level) - 1))
>> +               if (!kvm_is_level_aligned(slot->base_gfn + npages, level))
>>                         linfo[lpages - 1].disallow_lpage = 1;
>>                 ugfn = slot->userspace_addr >> PAGE_SHIFT;
>>                 /*
>> -                * If the gfn and userspace address are not aligned wrt each
>> -                * other, disable large page support for this slot.
>> +                * If the gfn and userspace address are not aligned or if gfn
>> +                * and guest_memfd offset are not aligned wrt each other,
>> +                * disable large page support for this slot.
>>                  */
>> -               if ((slot->base_gfn ^ ugfn) & (KVM_PAGES_PER_HPAGE(level) - 1)) {
>> +               if (!kvm_is_level_aligned(slot->base_gfn ^ ugfn, level) ||
>> +                   (kvm_slot_can_be_private(slot) &&
>> +                    !kvm_is_level_aligned(slot->base_gfn ^ slot->gmem.pgoff,
>> +                                          level))) {
>>                         unsigned long j;
>>  
>>                         for (j = 0; j < lpages; ++j)
>> 
>> This does not rely on the ugfn check, but adds a similar check for gmem.pgoff.
> In the case of shared memory is not allocated from guest_memfd, (e.g. with the
> current upstream code), the checking of gmem.pgoff here will disallow huge page
> of shared memory even if "slot->base_gfn ^ ugfn" is aligned.
>

Thanks, I get it now. What you mean is that the memslot could have been
set up such that

+ slot->userspace_addr is aligned with slot->base_gfn, to be used for
  shared memory, and 
+ slot->gmem.pgoff is not aligned with slot->base_gfn, to be used for
  private memory

and this check would disallow huge page mappings even though this
memslot was going to only be used for shared memory.

The only way to fix this would indeed be a runtime check, since the
shared/private status can change at runtime.

I think it is okay that this check is stricter than necessary, since it
just results in mapping without huge pages.

What do you think?

>> I think this should take care of case (1.), for guest_memfds going to be
>> used for both shared and private memory. Userspace can't update
>> slot->userspace_addr, since guest_memfd memslots cannot be updated and
>> can only be deleted.
>> 
>> If userspace re-uses slot->userspace_addr for some other memory address
>> without deleting and re-adding a memslot,
>> 
>> + KVM's access to memory should still be fine, since after the recent
>>   discussion at guest_memfd upstream call, KVM's guest faults will
>>   always go via fd+offset and KVM's access won't be disrupted
>>   there. Whatever checking done at memslot binding time will still be
>>   valid.
> Could the offset of shared memory and offset of private memory be different if
> userspace re-uses slot->userspace_addr without deleting and re-adding a memslot?
>

They could be different, yes. I think what you mean is if userspace does
something like

addr = mmap(guest_memfd);
ioctl(KVM_SET_USER_MEMORY_REGION, addr, guest_memfd);
munmap(addr);
addr = mmap(addr, other_fd);
(with no second call to KVM_SET_USER_MEMORY_REGION)

Without guest_memfd, when munmap() happens, KVM should get a
notification via mmu_notifiers. That will unmap the pages from guest
page tables. At the next fault, host page tables will be consulted to
determine max_mapping_level, and at that time the mapping level would be
the new mapping level in host page tables.

> Then though the two offsets are validated as equal in kvm_gmem_bind(), they may
> differ later on.
>

This is true.

Continuing from above, with guest_memfd, no issues if guest_memfd is
only used for private memory, since shared memory uses the same
mechanism as before guest_memfd.

If guest_memfd is used for both private and shared memory, on unmapping,
KVM will also get notified via mmu_notifiers. On the next fault, the
mapping level is determined as follows (I have a patch coming up that
will illustrate this better)

1. guest_memfd will return 4K since this is a shared folio and shared
   folios are always split to 4K. But suppose in future guest_memfd
   supports shared folios at higher levels, say 1G, we continue...
2. lpage info (not updated since userspace swapped out addr) will say
   map at 1G
3. Since this is a shared fault, we check host page tables, which would
   say 4K since there was a munmap() and mmap().

I think it should still work as expected.

>> + Host's access and other accesses (e.g. instruction emulation, which
>>   uses slot->userspace_addr) to guest memory will be broken, but I think
>>   there's nothing protecting against that. The same breakage would
>>   happen for non-guest_memfd memslot.
> Why is host access broken in non-guest_memfd case?
> The HVA is still a valid one in QEMU's mmap-ed address space.
>

I was thinking that if a guest was executing code and the code gets
swapped out from under its feet by replacing the memory pointed to by
addr, the guest would be broken.

Now that I think about it again, it could be a valid use case. You're
right, thanks for pointing this out!

>> p.s. I will be adding the validation as you suggested [1], though that
>> shouldn't make a difference here, since the above check directly
>> validates against gmem.pgoff.
>> 
>> Regarding 2., checking this checks against gmem.pgoff and should handle
>> that as well.
>> 
>> [1] https://lore.kernel.org/all/aBnMp26iWWhUrsVf@yzhao56-desk.sh.intel.com/
>> 
>> I prefer checking at binding time because it aligns with the ugfn check
>> that is already there, and avoids having to check at every fault.
>> 
>> >> [1] https://github.com/torvalds/linux/blob/b6ea1680d0ac0e45157a819c41b46565f4616186/arch/x86/kvm/x86.c#L12996
>> >> 
>> >> >> >> Adding checks at binding time will allow hugepage-unaligned offsets (to
>> >> >> >> be at parity with non-guest_memfd backing memory) but still fix this
>> >> >> >> issue.
>> >> >> >> 
>> >> >> >> lpage_info will make sure that ranges near the bounds will be
>> >> >> >> fragmented, but the hugepages in the middle will still be mappable as
>> >> >> >> hugepages.
>> >> >> >> 
>> >> >> >> [1] https://lpc.events/event/18/contributions/1764/attachments/1409/3706/binding-must-have-same-alignment.svg

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [RFC PATCH 00/39] 1G page support for guest_memfd
  2024-09-10 23:43 [RFC PATCH 00/39] 1G page support for guest_memfd Ackerley Tng
                   ` (38 preceding siblings ...)
  2024-09-10 23:44 ` [RFC PATCH 39/39] KVM: guest_memfd: Dynamically split/reconstruct HugeTLB page Ackerley Tng
@ 2024-09-11  6:56 ` Michal Hocko
  2024-09-14  1:08 ` Du, Fan
  2025-01-28  9:42 ` Amit Shah
  41 siblings, 0 replies; 130+ messages in thread
From: Michal Hocko @ 2024-09-11  6:56 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: tabba, quic_eberman, roypat, jgg, peterx, david, rientjes, fvdl,
	jthoughton, seanjc, pbonzini, zhiquan1.li, fan.du, jun.miao,
	isaku.yamahata, muchun.song, mike.kravetz, erdemaktas, vannapurve,
	qperret, jhubbard, willy, shuah, brauner, bfoster,
	kent.overstreet, pvorel, rppt, richard.weiyang, anup, haibo1.xu,
	ajones, vkuznets, maciej.wieczor-retman, pgonda, oliver.upton,
	linux-kernel, linux-mm, kvm, linux-kselftest, linux-fsdevel,
	Oscar Salvador

Cc Oscar for awareness

On Tue 10-09-24 23:43:31, Ackerley Tng wrote:
> Hello,
> 
> This patchset is our exploration of how to support 1G pages in guest_memfd, and
> how the pages will be used in Confidential VMs.
> 
> The patchset covers:
> 
> + How to get 1G pages
> + Allowing mmap() of guest_memfd to userspace so that both private and shared
>   memory can use the same physical pages
> + Splitting and reconstructing pages to support conversions and mmap()
> + How the VM, userspace and guest_memfd interact to support conversions
> + Selftests to test all the above
>     + Selftests also demonstrate the conversion flow between VM, userspace and
>       guest_memfd.
> 
> Why 1G pages in guest memfd?
> 
> Bring guest_memfd to performance and memory savings parity with VMs that are
> backed by HugeTLBfs.
> 
> + Performance is improved with 1G pages by more TLB hits and faster page walks
>   on TLB misses.
> + Memory savings from 1G pages comes from HugeTLB Vmemmap Optimization (HVO).
> 
> Options for 1G page support:
> 
> 1. HugeTLB
> 2. Contiguous Memory Allocator (CMA)
> 3. Other suggestions are welcome!
> 
> Comparison between options:
> 
> 1. HugeTLB
>     + Refactor HugeTLB to separate allocator from the rest of HugeTLB
>     + Pro: Graceful transition for VMs backed with HugeTLB to guest_memfd
>         + Near term: Allows co-tenancy of HugeTLB and guest_memfd backed VMs
>     + Pro: Can provide iterative steps toward new future allocator
>         + Unexplored: Managing userspace-visible changes
>             + e.g. HugeTLB's free_hugepages will decrease if HugeTLB is used,
>               but not when future allocator is used
> 2. CMA
>     + Port some HugeTLB features to be applied on CMA
>     + Pro: Clean slate
> 
> What would refactoring HugeTLB involve?
> 
> (Some refactoring was done in this RFC, more can be done.)
> 
> 1. Broadly involves separating the HugeTLB allocator from the rest of HugeTLB
>     + Brings more modularity to HugeTLB
>     + No functionality change intended
>     + Likely step towards HugeTLB's integration into core-mm
> 2. guest_memfd will use just the allocator component of HugeTLB, not including
>    the complex parts of HugeTLB like
>     + Userspace reservations (resv_map)
>     + Shared PMD mappings
>     + Special page walkers
> 
> What features would need to be ported to CMA?
> 
> + Improved allocation guarantees
>     + Per NUMA node pool of huge pages
>     + Subpools per guest_memfd
> + Memory savings
>     + Something like HugeTLB Vmemmap Optimization
> + Configuration/reporting features
>     + Configuration of number of pages available (and per NUMA node) at and
>       after host boot
>     + Reporting of memory usage/availability statistics at runtime
> 
> HugeTLB was picked as the source of 1G pages for this RFC because it allows a
> graceful transition, and retains memory savings from HVO.
> 
> To illustrate this, if a host machine uses HugeTLBfs to back VMs, and a
> confidential VM were to be scheduled on that host, some HugeTLBfs pages would
> have to be given up and returned to CMA for guest_memfd pages to be rebuilt from
> that memory. This requires memory to be reserved for HVO to be removed and
> reapplied on the new guest_memfd memory. This not only slows down memory
> allocation but also trims the benefits of HVO. Memory would have to be reserved
> on the host to facilitate these transitions.
> 
> Improving how guest_memfd uses the allocator in a future revision of this RFC:
> 
> To provide an easier transition away from HugeTLB, guest_memfd's use of HugeTLB
> should be limited to these allocator functions:
> 
> + reserve(node, page_size, num_pages) => opaque handle
>     + Used when a guest_memfd inode is created to reserve memory from backend
>       allocator
> + allocate(handle, mempolicy, page_size) => folio
>     + To allocate a folio from guest_memfd's reservation
> + split(handle, folio, target_page_size) => void
>     + To take a huge folio, and split it to smaller folios, restore to filemap
> + reconstruct(handle, first_folio, nr_pages) => void
>     + To take a folio, and reconstruct a huge folio out of nr_pages from the
>       first_folio
> + free(handle, folio) => void
>     + To return folio to guest_memfd's reservation
> + error(handle, folio) => void
>     + To handle memory errors
> + unreserve(handle) => void
>     + To return guest_memfd's reservation to allocator backend
> 
> Userspace should only provide a page size when creating a guest_memfd and should
> not have to specify HugeTLB.
> 
> Overview of patches:
> 
> + Patches 01-12
>     + Many small changes to HugeTLB, mostly to separate HugeTLBfs concepts from
>       HugeTLB, and to expose HugeTLB functions.
> + Patches 13-16
>     + Letting guest_memfd use HugeTLB
>     + Creation of each guest_memfd reserves pages from HugeTLB's global hstate
>       and puts it into the guest_memfd inode's subpool
>     + Each folio allocation takes a page from the guest_memfd inode's subpool
> + Patches 17-21
>     + Selftests for new HugeTLB features in guest_memfd
> + Patches 22-24
>     + More small changes on the HugeTLB side to expose functions needed by
>       guest_memfd
> + Patch 25:
>     + Uses the newly available functions from patches 22-24 to split HugeTLB
>       pages. In this patch, HugeTLB folios are always split to 4K before any
>       usage, private or shared.
> + Patches 26-28
>     + Allow mmap() in guest_memfd and faulting in shared pages
> + Patch 29
>     + Enables conversion between private/shared pages
> + Patch 30
>     + Required to zero folios after conversions to avoid leaking initialized
>       kernel memory
> + Patch 31-38
>     + Add selftests to test mapping pages to userspace, guest/host memory
>       sharing and update conversions tests
>     + Patch 33 illustrates the conversion flow between VM/userspace/guest_memfd
> + Patch 39
>     + Dynamically split and reconstruct HugeTLB pages instead of always
>       splitting before use. All earlier selftests are expected to still pass.
> 
> TODOs:
> 
> + Add logic to wait for safe_refcount [1]
> + Look into lazy splitting/reconstruction of pages
>     + Currently, when the KVM_SET_MEMORY_ATTRIBUTES is invoked, not only is the
>       mem_attr_array and faultability updated, the pages in the requested range
>       are also split/reconstructed as necessary. We want to look into delaying
>       splitting/reconstruction to fault time.
> + Solve race between folios being faulted in and being truncated
>     + When running private_mem_conversions_test with more than 1 vCPU, a folio
>       getting truncated may get faulted in by another process, causing elevated
>       mapcounts when the folio is freed (VM_BUG_ON_FOLIO).
> + Add intermediate splits (1G should first split to 2M and not split directly to
>   4K)
> + Use guest's lock instead of hugetlb_lock
> + Use multi-index xarray/replace xarray with some other data struct for
>   faultability flag
> + Refactor HugeTLB better, present generic allocator interface
> 
> Please let us know your thoughts on:
> 
> + HugeTLB as the choice of transitional allocator backend
> + Refactoring HugeTLB to provide generic allocator interface
> + Shared/private conversion flow
>     + Requiring user to request kernel to unmap pages from userspace using
>       madvise(MADV_DONTNEED)
>     + Failing conversion on elevated mapcounts/pincounts/refcounts
> + Process of splitting/reconstructing page
> + Anything else!
> 
> [1] https://lore.kernel.org/all/20240829-guest-memfd-lib-v2-0-b9afc1ff3656@quicinc.com/T/
> 
> Ackerley Tng (37):
>   mm: hugetlb: Simplify logic in dequeue_hugetlb_folio_vma()
>   mm: hugetlb: Refactor vma_has_reserves() to should_use_hstate_resv()
>   mm: hugetlb: Remove unnecessary check for avoid_reserve
>   mm: mempolicy: Refactor out policy_node_nodemask()
>   mm: hugetlb: Refactor alloc_buddy_hugetlb_folio_with_mpol() to
>     interpret mempolicy instead of vma
>   mm: hugetlb: Refactor dequeue_hugetlb_folio_vma() to use mpol
>   mm: hugetlb: Refactor out hugetlb_alloc_folio
>   mm: truncate: Expose preparation steps for truncate_inode_pages_final
>   mm: hugetlb: Expose hugetlb_subpool_{get,put}_pages()
>   mm: hugetlb: Add option to create new subpool without using surplus
>   mm: hugetlb: Expose hugetlb_acct_memory()
>   mm: hugetlb: Move and expose hugetlb_zero_partial_page()
>   KVM: guest_memfd: Make guest mem use guest mem inodes instead of
>     anonymous inodes
>   KVM: guest_memfd: hugetlb: initialization and cleanup
>   KVM: guest_memfd: hugetlb: allocate and truncate from hugetlb
>   KVM: guest_memfd: Add page alignment check for hugetlb guest_memfd
>   KVM: selftests: Add basic selftests for hugetlb-backed guest_memfd
>   KVM: selftests: Support various types of backing sources for private
>     memory
>   KVM: selftests: Update test for various private memory backing source
>     types
>   KVM: selftests: Add private_mem_conversions_test.sh
>   KVM: selftests: Test that guest_memfd usage is reported via hugetlb
>   mm: hugetlb: Expose vmemmap optimization functions
>   mm: hugetlb: Expose HugeTLB functions for promoting/demoting pages
>   mm: hugetlb: Add functions to add/move/remove from hugetlb lists
>   KVM: guest_memfd: Track faultability within a struct kvm_gmem_private
>   KVM: guest_memfd: Allow mmapping guest_memfd files
>   KVM: guest_memfd: Use vm_type to determine default faultability
>   KVM: Handle conversions in the SET_MEMORY_ATTRIBUTES ioctl
>   KVM: guest_memfd: Handle folio preparation for guest_memfd mmap
>   KVM: selftests: Allow vm_set_memory_attributes to be used without
>     asserting return value of 0
>   KVM: selftests: Test using guest_memfd memory from userspace
>   KVM: selftests: Test guest_memfd memory sharing between guest and host
>   KVM: selftests: Add notes in private_mem_kvm_exits_test for mmap-able
>     guest_memfd
>   KVM: selftests: Test that pinned pages block KVM from setting memory
>     attributes to PRIVATE
>   KVM: selftests: Refactor vm_mem_add to be more flexible
>   KVM: selftests: Add helper to perform madvise by memslots
>   KVM: selftests: Update private_mem_conversions_test for mmap()able
>     guest_memfd
> 
> Vishal Annapurve (2):
>   KVM: guest_memfd: Split HugeTLB pages for guest_memfd use
>   KVM: guest_memfd: Dynamically split/reconstruct HugeTLB page
> 
>  fs/hugetlbfs/inode.c                          |   35 +-
>  include/linux/hugetlb.h                       |   54 +-
>  include/linux/kvm_host.h                      |    1 +
>  include/linux/mempolicy.h                     |    2 +
>  include/linux/mm.h                            |    1 +
>  include/uapi/linux/kvm.h                      |   26 +
>  include/uapi/linux/magic.h                    |    1 +
>  mm/hugetlb.c                                  |  346 ++--
>  mm/hugetlb_vmemmap.h                          |   11 -
>  mm/mempolicy.c                                |   36 +-
>  mm/truncate.c                                 |   26 +-
>  tools/include/linux/kernel.h                  |    4 +-
>  tools/testing/selftests/kvm/Makefile          |    3 +
>  .../kvm/guest_memfd_hugetlb_reporting_test.c  |  222 +++
>  .../selftests/kvm/guest_memfd_pin_test.c      |  104 ++
>  .../selftests/kvm/guest_memfd_sharing_test.c  |  160 ++
>  .../testing/selftests/kvm/guest_memfd_test.c  |  238 ++-
>  .../testing/selftests/kvm/include/kvm_util.h  |   45 +-
>  .../testing/selftests/kvm/include/test_util.h |   18 +
>  tools/testing/selftests/kvm/lib/kvm_util.c    |  443 +++--
>  tools/testing/selftests/kvm/lib/test_util.c   |   99 ++
>  .../kvm/x86_64/private_mem_conversions_test.c |  158 +-
>  .../x86_64/private_mem_conversions_test.sh    |   91 +
>  .../kvm/x86_64/private_mem_kvm_exits_test.c   |   11 +-
>  virt/kvm/guest_memfd.c                        | 1563 ++++++++++++++++-
>  virt/kvm/kvm_main.c                           |   17 +
>  virt/kvm/kvm_mm.h                             |   16 +
>  27 files changed, 3288 insertions(+), 443 deletions(-)
>  create mode 100644 tools/testing/selftests/kvm/guest_memfd_hugetlb_reporting_test.c
>  create mode 100644 tools/testing/selftests/kvm/guest_memfd_pin_test.c
>  create mode 100644 tools/testing/selftests/kvm/guest_memfd_sharing_test.c
>  create mode 100755 tools/testing/selftests/kvm/x86_64/private_mem_conversions_test.sh
> 
> --
> 2.46.0.598.g6f2099f65c-goog

-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 130+ messages in thread

* RE: [RFC PATCH 00/39] 1G page support for guest_memfd
  2024-09-10 23:43 [RFC PATCH 00/39] 1G page support for guest_memfd Ackerley Tng
                   ` (39 preceding siblings ...)
  2024-09-11  6:56 ` [RFC PATCH 00/39] 1G page support for guest_memfd Michal Hocko
@ 2024-09-14  1:08 ` Du, Fan
  2024-09-14 13:34   ` Vishal Annapurve
  2025-01-28  9:42 ` Amit Shah
  41 siblings, 1 reply; 130+ messages in thread
From: Du, Fan @ 2024-09-14  1:08 UTC (permalink / raw)
  To: Ackerley Tng, tabba@google.com, quic_eberman@quicinc.com,
	roypat@amazon.co.uk, jgg@nvidia.com, peterx@redhat.com,
	david@redhat.com, rientjes@google.com, fvdl@google.com,
	jthoughton@google.com, seanjc@google.com, pbonzini@redhat.com,
	Li, Zhiquan1, Miao, Jun, Yamahata, Isaku, muchun.song@linux.dev,
	mike.kravetz@oracle.com
  Cc: Aktas, Erdem, Annapurve, Vishal, qperret@google.com,
	jhubbard@nvidia.com, willy@infradead.org, shuah@kernel.org,
	brauner@kernel.org, bfoster@redhat.com, kent.overstreet@linux.dev,
	pvorel@suse.cz, rppt@kernel.org, richard.weiyang@gmail.com,
	anup@brainfault.org, Xu, Haibo1, ajones@ventanamicro.com,
	vkuznets@redhat.com, Wieczor-Retman, Maciej, pgonda@google.com,
	oliver.upton@linux.dev, linux-kernel@vger.kernel.org,
	linux-mm@kvack.org, kvm@vger.kernel.org,
	linux-kselftest@vger.kernel.org, linux-fsdevel@kvack.org, Du, Fan



> -----Original Message-----
> From: Ackerley Tng <ackerleytng@google.com>
> Sent: Wednesday, September 11, 2024 7:44 AM
> To: tabba@google.com; quic_eberman@quicinc.com; roypat@amazon.co.uk;
> jgg@nvidia.com; peterx@redhat.com; david@redhat.com;
> rientjes@google.com; fvdl@google.com; jthoughton@google.com;
> seanjc@google.com; pbonzini@redhat.com; Li, Zhiquan1
> <zhiquan1.li@intel.com>; Du, Fan <fan.du@intel.com>; Miao, Jun
> <jun.miao@intel.com>; Yamahata, Isaku <isaku.yamahata@intel.com>;
> muchun.song@linux.dev; mike.kravetz@oracle.com
> Cc: Aktas, Erdem <erdemaktas@google.com>; Annapurve, Vishal
> <vannapurve@google.com>; ackerleytng@google.com; qperret@google.com;
> jhubbard@nvidia.com; willy@infradead.org; shuah@kernel.org;
> brauner@kernel.org; bfoster@redhat.com; kent.overstreet@linux.dev;
> pvorel@suse.cz; rppt@kernel.org; richard.weiyang@gmail.com;
> anup@brainfault.org; Xu, Haibo1 <haibo1.xu@intel.com>;
> ajones@ventanamicro.com; vkuznets@redhat.com; Wieczor-Retman, Maciej
> <maciej.wieczor-retman@intel.com>; pgonda@google.com;
> oliver.upton@linux.dev; linux-kernel@vger.kernel.org; linux-mm@kvack.org;
> kvm@vger.kernel.org; linux-kselftest@vger.kernel.org; linux-
> fsdevel@kvack.org
> Subject: [RFC PATCH 00/39] 1G page support for guest_memfd
> 
> Hello,
> 
> This patchset is our exploration of how to support 1G pages in guest_memfd,
> and
> how the pages will be used in Confidential VMs.
> 
> The patchset covers:
> 
> + How to get 1G pages
> + Allowing mmap() of guest_memfd to userspace so that both private and
> shared

Hi Ackerley

Thanks for posting new version :)

W.r.t above description and below patch snippet from Patch 26-29,
Does this new design aim to backup shared and private GPA with a single
Hugetlb spool which equal VM instance total memory?

By my understanding, before this new changes, shared memfd and gmem fd
has dedicate hugetlb pool, that's two copy/reservation of hugetlb spool.

Does Qemu require new changes as well? I'd like to have a test of this series
if you can share Qemu branch?

> + Patches 26-28
>     + Allow mmap() in guest_memfd and faulting in shared pages
> + Patch 29
>     + Enables conversion between private/shared pages

Thanks!

>   memory can use the same physical pages
> + Splitting and reconstructing pages to support conversions and mmap()
> + How the VM, userspace and guest_memfd interact to support conversions
> + Selftests to test all the above
>     + Selftests also demonstrate the conversion flow between VM, userspace
> and
>       guest_memfd.
> 
> Why 1G pages in guest memfd?
> 
> Bring guest_memfd to performance and memory savings parity with VMs that
> are
> backed by HugeTLBfs.
> 
> + Performance is improved with 1G pages by more TLB hits and faster page
> walks
>   on TLB misses.
> + Memory savings from 1G pages comes from HugeTLB Vmemmap
> Optimization (HVO).
> 
> Options for 1G page support:
> 
> 1. HugeTLB
> 2. Contiguous Memory Allocator (CMA)
> 3. Other suggestions are welcome!
> 
> Comparison between options:
> 
> 1. HugeTLB
>     + Refactor HugeTLB to separate allocator from the rest of HugeTLB
>     + Pro: Graceful transition for VMs backed with HugeTLB to guest_memfd
>         + Near term: Allows co-tenancy of HugeTLB and guest_memfd backed
> VMs
>     + Pro: Can provide iterative steps toward new future allocator
>         + Unexplored: Managing userspace-visible changes
>             + e.g. HugeTLB's free_hugepages will decrease if HugeTLB is used,
>               but not when future allocator is used
> 2. CMA
>     + Port some HugeTLB features to be applied on CMA
>     + Pro: Clean slate
> 
> What would refactoring HugeTLB involve?
> 
> (Some refactoring was done in this RFC, more can be done.)
> 
> 1. Broadly involves separating the HugeTLB allocator from the rest of HugeTLB
>     + Brings more modularity to HugeTLB
>     + No functionality change intended
>     + Likely step towards HugeTLB's integration into core-mm
> 2. guest_memfd will use just the allocator component of HugeTLB, not
> including
>    the complex parts of HugeTLB like
>     + Userspace reservations (resv_map)
>     + Shared PMD mappings
>     + Special page walkers
> 
> What features would need to be ported to CMA?
> 
> + Improved allocation guarantees
>     + Per NUMA node pool of huge pages
>     + Subpools per guest_memfd
> + Memory savings
>     + Something like HugeTLB Vmemmap Optimization
> + Configuration/reporting features
>     + Configuration of number of pages available (and per NUMA node) at and
>       after host boot
>     + Reporting of memory usage/availability statistics at runtime
> 
> HugeTLB was picked as the source of 1G pages for this RFC because it allows a
> graceful transition, and retains memory savings from HVO.
> 
> To illustrate this, if a host machine uses HugeTLBfs to back VMs, and a
> confidential VM were to be scheduled on that host, some HugeTLBfs pages
> would
> have to be given up and returned to CMA for guest_memfd pages to be
> rebuilt from
> that memory. This requires memory to be reserved for HVO to be removed
> and
> reapplied on the new guest_memfd memory. This not only slows down
> memory
> allocation but also trims the benefits of HVO. Memory would have to be
> reserved
> on the host to facilitate these transitions.
> 
> Improving how guest_memfd uses the allocator in a future revision of this
> RFC:
> 
> To provide an easier transition away from HugeTLB, guest_memfd's use of
> HugeTLB
> should be limited to these allocator functions:
> 
> + reserve(node, page_size, num_pages) => opaque handle
>     + Used when a guest_memfd inode is created to reserve memory from
> backend
>       allocator
> + allocate(handle, mempolicy, page_size) => folio
>     + To allocate a folio from guest_memfd's reservation
> + split(handle, folio, target_page_size) => void
>     + To take a huge folio, and split it to smaller folios, restore to filemap
> + reconstruct(handle, first_folio, nr_pages) => void
>     + To take a folio, and reconstruct a huge folio out of nr_pages from the
>       first_folio
> + free(handle, folio) => void
>     + To return folio to guest_memfd's reservation
> + error(handle, folio) => void
>     + To handle memory errors
> + unreserve(handle) => void
>     + To return guest_memfd's reservation to allocator backend
> 
> Userspace should only provide a page size when creating a guest_memfd and
> should
> not have to specify HugeTLB.
> 
> Overview of patches:
> 
> + Patches 01-12
>     + Many small changes to HugeTLB, mostly to separate HugeTLBfs concepts
> from
>       HugeTLB, and to expose HugeTLB functions.
> + Patches 13-16
>     + Letting guest_memfd use HugeTLB
>     + Creation of each guest_memfd reserves pages from HugeTLB's global
> hstate
>       and puts it into the guest_memfd inode's subpool
>     + Each folio allocation takes a page from the guest_memfd inode's subpool
> + Patches 17-21
>     + Selftests for new HugeTLB features in guest_memfd
> + Patches 22-24
>     + More small changes on the HugeTLB side to expose functions needed by
>       guest_memfd
> + Patch 25:
>     + Uses the newly available functions from patches 22-24 to split HugeTLB
>       pages. In this patch, HugeTLB folios are always split to 4K before any
>       usage, private or shared.
> + Patches 26-28
>     + Allow mmap() in guest_memfd and faulting in shared pages
> + Patch 29
>     + Enables conversion between private/shared pages
> + Patch 30
>     + Required to zero folios after conversions to avoid leaking initialized
>       kernel memory
> + Patch 31-38
>     + Add selftests to test mapping pages to userspace, guest/host memory
>       sharing and update conversions tests
>     + Patch 33 illustrates the conversion flow between
> VM/userspace/guest_memfd
> + Patch 39
>     + Dynamically split and reconstruct HugeTLB pages instead of always
>       splitting before use. All earlier selftests are expected to still pass.
> 
> TODOs:
> 
> + Add logic to wait for safe_refcount [1]
> + Look into lazy splitting/reconstruction of pages
>     + Currently, when the KVM_SET_MEMORY_ATTRIBUTES is invoked, not only
> is the
>       mem_attr_array and faultability updated, the pages in the requested
> range
>       are also split/reconstructed as necessary. We want to look into delaying
>       splitting/reconstruction to fault time.
> + Solve race between folios being faulted in and being truncated
>     + When running private_mem_conversions_test with more than 1 vCPU, a
> folio
>       getting truncated may get faulted in by another process, causing elevated
>       mapcounts when the folio is freed (VM_BUG_ON_FOLIO).
> + Add intermediate splits (1G should first split to 2M and not split directly to
>   4K)
> + Use guest's lock instead of hugetlb_lock
> + Use multi-index xarray/replace xarray with some other data struct for
>   faultability flag
> + Refactor HugeTLB better, present generic allocator interface
> 
> Please let us know your thoughts on:
> 
> + HugeTLB as the choice of transitional allocator backend
> + Refactoring HugeTLB to provide generic allocator interface
> + Shared/private conversion flow
>     + Requiring user to request kernel to unmap pages from userspace using
>       madvise(MADV_DONTNEED)
>     + Failing conversion on elevated mapcounts/pincounts/refcounts
> + Process of splitting/reconstructing page
> + Anything else!
> 
> [1] https://lore.kernel.org/all/20240829-guest-memfd-lib-v2-0-
> b9afc1ff3656@quicinc.com/T/
> 
> Ackerley Tng (37):
>   mm: hugetlb: Simplify logic in dequeue_hugetlb_folio_vma()
>   mm: hugetlb: Refactor vma_has_reserves() to should_use_hstate_resv()
>   mm: hugetlb: Remove unnecessary check for avoid_reserve
>   mm: mempolicy: Refactor out policy_node_nodemask()
>   mm: hugetlb: Refactor alloc_buddy_hugetlb_folio_with_mpol() to
>     interpret mempolicy instead of vma
>   mm: hugetlb: Refactor dequeue_hugetlb_folio_vma() to use mpol
>   mm: hugetlb: Refactor out hugetlb_alloc_folio
>   mm: truncate: Expose preparation steps for truncate_inode_pages_final
>   mm: hugetlb: Expose hugetlb_subpool_{get,put}_pages()
>   mm: hugetlb: Add option to create new subpool without using surplus
>   mm: hugetlb: Expose hugetlb_acct_memory()
>   mm: hugetlb: Move and expose hugetlb_zero_partial_page()
>   KVM: guest_memfd: Make guest mem use guest mem inodes instead of
>     anonymous inodes
>   KVM: guest_memfd: hugetlb: initialization and cleanup
>   KVM: guest_memfd: hugetlb: allocate and truncate from hugetlb
>   KVM: guest_memfd: Add page alignment check for hugetlb guest_memfd
>   KVM: selftests: Add basic selftests for hugetlb-backed guest_memfd
>   KVM: selftests: Support various types of backing sources for private
>     memory
>   KVM: selftests: Update test for various private memory backing source
>     types
>   KVM: selftests: Add private_mem_conversions_test.sh
>   KVM: selftests: Test that guest_memfd usage is reported via hugetlb
>   mm: hugetlb: Expose vmemmap optimization functions
>   mm: hugetlb: Expose HugeTLB functions for promoting/demoting pages
>   mm: hugetlb: Add functions to add/move/remove from hugetlb lists
>   KVM: guest_memfd: Track faultability within a struct kvm_gmem_private
>   KVM: guest_memfd: Allow mmapping guest_memfd files
>   KVM: guest_memfd: Use vm_type to determine default faultability
>   KVM: Handle conversions in the SET_MEMORY_ATTRIBUTES ioctl
>   KVM: guest_memfd: Handle folio preparation for guest_memfd mmap
>   KVM: selftests: Allow vm_set_memory_attributes to be used without
>     asserting return value of 0
>   KVM: selftests: Test using guest_memfd memory from userspace
>   KVM: selftests: Test guest_memfd memory sharing between guest and host
>   KVM: selftests: Add notes in private_mem_kvm_exits_test for mmap-able
>     guest_memfd
>   KVM: selftests: Test that pinned pages block KVM from setting memory
>     attributes to PRIVATE
>   KVM: selftests: Refactor vm_mem_add to be more flexible
>   KVM: selftests: Add helper to perform madvise by memslots
>   KVM: selftests: Update private_mem_conversions_test for mmap()able
>     guest_memfd
> 
> Vishal Annapurve (2):
>   KVM: guest_memfd: Split HugeTLB pages for guest_memfd use
>   KVM: guest_memfd: Dynamically split/reconstruct HugeTLB page
> 
>  fs/hugetlbfs/inode.c                          |   35 +-
>  include/linux/hugetlb.h                       |   54 +-
>  include/linux/kvm_host.h                      |    1 +
>  include/linux/mempolicy.h                     |    2 +
>  include/linux/mm.h                            |    1 +
>  include/uapi/linux/kvm.h                      |   26 +
>  include/uapi/linux/magic.h                    |    1 +
>  mm/hugetlb.c                                  |  346 ++--
>  mm/hugetlb_vmemmap.h                          |   11 -
>  mm/mempolicy.c                                |   36 +-
>  mm/truncate.c                                 |   26 +-
>  tools/include/linux/kernel.h                  |    4 +-
>  tools/testing/selftests/kvm/Makefile          |    3 +
>  .../kvm/guest_memfd_hugetlb_reporting_test.c  |  222 +++
>  .../selftests/kvm/guest_memfd_pin_test.c      |  104 ++
>  .../selftests/kvm/guest_memfd_sharing_test.c  |  160 ++
>  .../testing/selftests/kvm/guest_memfd_test.c  |  238 ++-
>  .../testing/selftests/kvm/include/kvm_util.h  |   45 +-
>  .../testing/selftests/kvm/include/test_util.h |   18 +
>  tools/testing/selftests/kvm/lib/kvm_util.c    |  443 +++--
>  tools/testing/selftests/kvm/lib/test_util.c   |   99 ++
>  .../kvm/x86_64/private_mem_conversions_test.c |  158 +-
>  .../x86_64/private_mem_conversions_test.sh    |   91 +
>  .../kvm/x86_64/private_mem_kvm_exits_test.c   |   11 +-
>  virt/kvm/guest_memfd.c                        | 1563 ++++++++++++++++-
>  virt/kvm/kvm_main.c                           |   17 +
>  virt/kvm/kvm_mm.h                             |   16 +
>  27 files changed, 3288 insertions(+), 443 deletions(-)
>  create mode 100644
> tools/testing/selftests/kvm/guest_memfd_hugetlb_reporting_test.c
>  create mode 100644 tools/testing/selftests/kvm/guest_memfd_pin_test.c
>  create mode 100644 tools/testing/selftests/kvm/guest_memfd_sharing_test.c
>  create mode 100755
> tools/testing/selftests/kvm/x86_64/private_mem_conversions_test.sh
> 
> --
> 2.46.0.598.g6f2099f65c-goog

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [RFC PATCH 00/39] 1G page support for guest_memfd
  2024-09-14  1:08 ` Du, Fan
@ 2024-09-14 13:34   ` Vishal Annapurve
  0 siblings, 0 replies; 130+ messages in thread
From: Vishal Annapurve @ 2024-09-14 13:34 UTC (permalink / raw)
  To: Du, Fan
  Cc: Ackerley Tng, tabba@google.com, quic_eberman@quicinc.com,
	roypat@amazon.co.uk, jgg@nvidia.com, peterx@redhat.com,
	david@redhat.com, rientjes@google.com, fvdl@google.com,
	jthoughton@google.com, seanjc@google.com, pbonzini@redhat.com,
	Li, Zhiquan1, Miao, Jun, Yamahata, Isaku, muchun.song@linux.dev,
	mike.kravetz@oracle.com, Aktas, Erdem, qperret@google.com,
	jhubbard@nvidia.com, willy@infradead.org, shuah@kernel.org,
	brauner@kernel.org, bfoster@redhat.com, kent.overstreet@linux.dev,
	pvorel@suse.cz, rppt@kernel.org, richard.weiyang@gmail.com,
	anup@brainfault.org, Xu, Haibo1, ajones@ventanamicro.com,
	vkuznets@redhat.com, Wieczor-Retman, Maciej, pgonda@google.com,
	oliver.upton@linux.dev, linux-kernel@vger.kernel.org,
	linux-mm@kvack.org, kvm@vger.kernel.org,
	linux-kselftest@vger.kernel.org, linux-fsdevel@kvack.org

On Fri, Sep 13, 2024 at 6:08 PM Du, Fan <fan.du@intel.com> wrote:
>
> ...
> >
> > Hello,
> >
> > This patchset is our exploration of how to support 1G pages in guest_memfd,
> > and
> > how the pages will be used in Confidential VMs.
> >
> > The patchset covers:
> >
> > + How to get 1G pages
> > + Allowing mmap() of guest_memfd to userspace so that both private and
> > shared
>
> Hi Ackerley
>
> Thanks for posting new version :)
>
> W.r.t above description and below patch snippet from Patch 26-29,
> Does this new design aim to backup shared and private GPA with a single
> Hugetlb spool which equal VM instance total memory?

Yes.
>
> By my understanding, before this new changes, shared memfd and gmem fd
> has dedicate hugetlb pool, that's two copy/reservation of hugetlb spool.

Selftests attached to this series use single gmem fd to back guest memory.

>
> Does Qemu require new changes as well? I'd like to have a test of this series
> if you can share Qemu branch?
>

We are going to discuss this RFC series and related issues at LPC.
Once the next steps are finalized, the plan will be to send out an
improved version. You can use/modify the selftests that are part of
this series to test this feature with software protected VMs for now.

Qemu will require changes for this feature on top of already floated
gmem integration series [1] that adds software protected VM support to
Qemu. If you are interested in testing this feature with TDX VMs then
it needs multiple series to set up the right test environment
(including [2]). We haven't considered posting Qemu patches and it
will be a while before we can get to it.

[1] https://patchew.org/QEMU/20230914035117.3285885-1-xiaoyao.li@intel.com/
[2] https://patchwork.kernel.org/project/kvm/cover/20231115071519.2864957-1-xiaoyao.li@intel.com/

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [RFC PATCH 00/39] 1G page support for guest_memfd
  2024-09-10 23:43 [RFC PATCH 00/39] 1G page support for guest_memfd Ackerley Tng
                   ` (40 preceding siblings ...)
  2024-09-14  1:08 ` Du, Fan
@ 2025-01-28  9:42 ` Amit Shah
  2025-02-03  8:35   ` Ackerley Tng
  41 siblings, 1 reply; 130+ messages in thread
From: Amit Shah @ 2025-01-28  9:42 UTC (permalink / raw)
  To: Ackerley Tng, tabba, quic_eberman, roypat, jgg, peterx, david,
	rientjes, fvdl, jthoughton, seanjc, pbonzini, zhiquan1.li, fan.du,
	jun.miao, isaku.yamahata, muchun.song, mike.kravetz
  Cc: erdemaktas, vannapurve, qperret, jhubbard, willy, shuah, brauner,
	bfoster, kent.overstreet, pvorel, rppt, richard.weiyang, anup,
	haibo1.xu, ajones, vkuznets, maciej.wieczor-retman, pgonda,
	oliver.upton, linux-kernel, linux-mm, kvm, linux-kselftest,
	linux-fsdevel

Hey Ackerley,

On Tue, 2024-09-10 at 23:43 +0000, Ackerley Tng wrote:
> Hello,
> 
> This patchset is our exploration of how to support 1G pages in
> guest_memfd, and
> how the pages will be used in Confidential VMs.

We've discussed this patchset at LPC and in the guest-memfd calls.  Can
you please summarise the discussions here as a follow-up, so we can
also continue discussing on-list, and not repeat things that are
already discussed?

Also - as mentioned in those meetings, we at AMD are interested in this
series along with SEV-SNP support - and I'm also interested in figuring
out how we collaborate on the evolution of this series.

Thanks,

		Amit

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [RFC PATCH 00/39] 1G page support for guest_memfd
  2025-01-28  9:42 ` Amit Shah
@ 2025-02-03  8:35   ` Ackerley Tng
  2025-02-06 11:07     ` Amit Shah
  0 siblings, 1 reply; 130+ messages in thread
From: Ackerley Tng @ 2025-02-03  8:35 UTC (permalink / raw)
  To: Amit Shah
  Cc: tabba, quic_eberman, roypat, jgg, peterx, david, rientjes, fvdl,
	jthoughton, seanjc, pbonzini, zhiquan1.li, fan.du, jun.miao,
	isaku.yamahata, muchun.song, mike.kravetz, erdemaktas, vannapurve,
	qperret, jhubbard, willy, shuah, brauner, bfoster,
	kent.overstreet, pvorel, rppt, richard.weiyang, anup, haibo1.xu,
	ajones, vkuznets, maciej.wieczor-retman, pgonda, oliver.upton,
	linux-kernel, linux-mm, kvm, linux-kselftest, linux-fsdevel

Amit Shah <amit@infradead.org> writes:

> Hey Ackerley,

Hi Amit,

> On Tue, 2024-09-10 at 23:43 +0000, Ackerley Tng wrote:
>> Hello,
>> 
>> This patchset is our exploration of how to support 1G pages in
>> guest_memfd, and
>> how the pages will be used in Confidential VMs.
>
> We've discussed this patchset at LPC and in the guest-memfd calls.  Can
> you please summarise the discussions here as a follow-up, so we can
> also continue discussing on-list, and not repeat things that are
> already discussed?

Thanks for this question! Since LPC, Vishal and I have been tied up with
some Google internal work, which slowed down progress on 1G page support
for guest_memfd. We will have progress this quarter and the next few
quarters on 1G page support for guest_memfd.

The related updates are

1. No objections on using hugetlb as the source of 1G pages.

2. Prerequisite hugetlb changes.

+ I've separated some of the prerequisite hugetlb changes into another
  patch series hoping to have them merged ahead of and separately from
  this patchset [1].
+ Peter Xu contributed a better patchset, including a bugfix [2].
+ I have an alternative [3].
+ The next revision of this series (1G page support for guest_memfd)
  will be based on alternative [3]. I think there should be no issues
  there.
+ I believe Peter is also waiting on the next revision before we make
  further progress/decide on [2] or [3].

3. No objections for allowing mmap()-ing of guest_memfd physical memory
   when memory is marked shared to avoid double-allocation.

4. No objections for splitting pages when marked shared.

5. folio_put() callback for guest_memfd folio cleanup/merging.

+ In Fuad's series [4], Fuad used the callback to reset the folio's
  mappability status.
+ The catch is that the callback is only invoked when folio->page_type
  == PGTY_guest_memfd, and folio->page_type is a union with folio's
  mapcount, so any folio with a non-zero mapcount cannot have a valid
  page_type.
+ I was concerned that we might not get a callback, and hence
  unintentionally skip merging pages and not correctly restore hugetlb
  pages
+ This was discussed at the last guest_memfd upstream call (2025-01-23
  07:58 PST), and the conclusion is that using folio->page_type works,
  because
    + We only merge folios in two cases: (1) when converting to private
      (2) when truncating folios (removing from filemap).
    + When converting to private, in (1), we can forcibly unmap all the
      converted pages or check if the mapcount is 0, and once mapcount
      is 0 we can install the callback by setting folio->page_type =
      PGTY_guest_memfd
    + When truncating, we will be unmapping the folios anyway, so
      mapcount is also 0 and we can install the callback.

Hope that covers the points that you're referring to. If there are other
parts that you'd like to know the status on, please let me know which
aspects those are!

> Also - as mentioned in those meetings, we at AMD are interested in this
> series along with SEV-SNP support - and I'm also interested in figuring
> out how we collaborate on the evolution of this series.

Thanks all your help and comments during the guest_memfd upstream calls,
and thanks for the help from AMD.

Extending mmap() support from Fuad with 1G page support introduces more
states that made it more complicated (at least for me).

I'm modeling the states in python so I can iterate more quickly. I also
have usage flows (e.g. allocate, guest_use, host_use,
transient_folio_get, close, transient_folio_put) as test cases.

I'm almost done with the model and my next steps are to write up a state
machine (like Fuad's [5]) and share that.

I'd be happy to share the python model too but I have to work through
some internal open-sourcing processes first, so if you think this will
be useful, let me know!

Then, I'll code it all up in a new revision of this series (target:
March 2025), which will be accompanied by source code on GitHub.

I'm happy to collaborate more closely, let me know if you have ideas for
collaboration!

> Thanks,
>
> 		Amit

[1] https://lore.kernel.org/all/cover.1728684491.git.ackerleytng@google.com/T/
[2] https://lore.kernel.org/all/20250107204002.2683356-1-peterx@redhat.com/T/
[3] https://lore.kernel.org/all/diqzjzayz5ho.fsf@ackerleytng-ctop.c.googlers.com/
[4] https://lore.kernel.org/all/20250117163001.2326672-1-tabba@google.com/T/
[5] https://lpc.events/event/18/contributions/1758/attachments/1457/3699/Guestmemfd%20folio%20state%20page_type.pdf

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [RFC PATCH 00/39] 1G page support for guest_memfd
  2025-02-03  8:35   ` Ackerley Tng
@ 2025-02-06 11:07     ` Amit Shah
  2025-02-07  6:25       ` Ackerley Tng
  0 siblings, 1 reply; 130+ messages in thread
From: Amit Shah @ 2025-02-06 11:07 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: tabba, quic_eberman, roypat, jgg, peterx, david, rientjes, fvdl,
	jthoughton, seanjc, pbonzini, zhiquan1.li, fan.du, jun.miao,
	isaku.yamahata, muchun.song, mike.kravetz, erdemaktas, vannapurve,
	qperret, jhubbard, willy, shuah, brauner, bfoster,
	kent.overstreet, pvorel, rppt, richard.weiyang, anup, haibo1.xu,
	ajones, vkuznets, maciej.wieczor-retman, pgonda, oliver.upton,
	linux-kernel, linux-mm, kvm, linux-kselftest, linux-fsdevel

On Mon, 2025-02-03 at 08:35 +0000, Ackerley Tng wrote:
> Amit Shah <amit@infradead.org> writes:
> 
> > Hey Ackerley,
> 
> Hi Amit,
> 
> > On Tue, 2024-09-10 at 23:43 +0000, Ackerley Tng wrote:
> > > Hello,
> > > 
> > > This patchset is our exploration of how to support 1G pages in
> > > guest_memfd, and
> > > how the pages will be used in Confidential VMs.
> > 
> > We've discussed this patchset at LPC and in the guest-memfd calls. 
> > Can
> > you please summarise the discussions here as a follow-up, so we can
> > also continue discussing on-list, and not repeat things that are
> > already discussed?
> 
> Thanks for this question! Since LPC, Vishal and I have been tied up
> with
> some Google internal work, which slowed down progress on 1G page
> support
> for guest_memfd. We will have progress this quarter and the next few
> quarters on 1G page support for guest_memfd.
> 
> The related updates are
> 
> 1. No objections on using hugetlb as the source of 1G pages.
> 
> 2. Prerequisite hugetlb changes.
> 
> + I've separated some of the prerequisite hugetlb changes into
> another
>   patch series hoping to have them merged ahead of and separately
> from
>   this patchset [1].
> + Peter Xu contributed a better patchset, including a bugfix [2].
> + I have an alternative [3].
> + The next revision of this series (1G page support for guest_memfd)
>   will be based on alternative [3]. I think there should be no issues
>   there.
> + I believe Peter is also waiting on the next revision before we make
>   further progress/decide on [2] or [3].
> 
> 3. No objections for allowing mmap()-ing of guest_memfd physical
> memory
>    when memory is marked shared to avoid double-allocation.
> 
> 4. No objections for splitting pages when marked shared.
> 
> 5. folio_put() callback for guest_memfd folio cleanup/merging.
> 
> + In Fuad's series [4], Fuad used the callback to reset the folio's
>   mappability status.
> + The catch is that the callback is only invoked when folio-
> >page_type
>   == PGTY_guest_memfd, and folio->page_type is a union with folio's
>   mapcount, so any folio with a non-zero mapcount cannot have a valid
>   page_type.
> + I was concerned that we might not get a callback, and hence
>   unintentionally skip merging pages and not correctly restore
> hugetlb
>   pages
> + This was discussed at the last guest_memfd upstream call (2025-01-
> 23
>   07:58 PST), and the conclusion is that using folio->page_type
> works,
>   because
>     + We only merge folios in two cases: (1) when converting to
> private
>       (2) when truncating folios (removing from filemap).
>     + When converting to private, in (1), we can forcibly unmap all
> the
>       converted pages or check if the mapcount is 0, and once
> mapcount
>       is 0 we can install the callback by setting folio->page_type =
>       PGTY_guest_memfd
>     + When truncating, we will be unmapping the folios anyway, so
>       mapcount is also 0 and we can install the callback.
> 
> Hope that covers the points that you're referring to. If there are
> other
> parts that you'd like to know the status on, please let me know which
> aspects those are!

Thank you for the nice summary!

> > Also - as mentioned in those meetings, we at AMD are interested in
> > this
> > series along with SEV-SNP support - and I'm also interested in
> > figuring
> > out how we collaborate on the evolution of this series.
> 
> Thanks all your help and comments during the guest_memfd upstream
> calls,
> and thanks for the help from AMD.
> 
> Extending mmap() support from Fuad with 1G page support introduces
> more
> states that made it more complicated (at least for me).
> 
> I'm modeling the states in python so I can iterate more quickly. I
> also
> have usage flows (e.g. allocate, guest_use, host_use,
> transient_folio_get, close, transient_folio_put) as test cases.
> 
> I'm almost done with the model and my next steps are to write up a
> state
> machine (like Fuad's [5]) and share that.
> 
> I'd be happy to share the python model too but I have to work through
> some internal open-sourcing processes first, so if you think this
> will
> be useful, let me know!

No problem.  Yes, I'm interested in this - it'll be helpful!

The other thing of note is that while we have the kernel patches, a
userspace to drive them and exercise them is currently missing.

> Then, I'll code it all up in a new revision of this series (target:
> March 2025), which will be accompanied by source code on GitHub.
> 
> I'm happy to collaborate more closely, let me know if you have ideas
> for
> collaboration!

Thank you.  I think currently the bigger problem we have is allocation
of hugepages -- which is also blocking a lot of the follow-on work. 
Vishal briefly mentioned isolating pages from Linux entirely last time
- that's also what I'm interested in to figure out if we can completely
bypass the allocation problem by not allocating struct pages for non-
host use pages entirely.  The guest_memfs/KHO/kexec/live-update patches
also take this approach on AWS (for how their VMs are launched).  If we
work with those patches together, allocation of 1G hugepages is
simplified.  I'd like to discuss more on these themes to see if this is
an approach that helps as well.


		Amit

^ permalink raw reply	[flat|nested] 130+ messages in thread

* Re: [RFC PATCH 00/39] 1G page support for guest_memfd
  2025-02-06 11:07     ` Amit Shah
@ 2025-02-07  6:25       ` Ackerley Tng
  0 siblings, 0 replies; 130+ messages in thread
From: Ackerley Tng @ 2025-02-07  6:25 UTC (permalink / raw)
  To: Amit Shah
  Cc: tabba, quic_eberman, roypat, jgg, peterx, david, rientjes, fvdl,
	jthoughton, seanjc, pbonzini, zhiquan1.li, fan.du, jun.miao,
	isaku.yamahata, muchun.song, mike.kravetz, erdemaktas, vannapurve,
	qperret, jhubbard, willy, shuah, brauner, bfoster,
	kent.overstreet, pvorel, rppt, richard.weiyang, anup, haibo1.xu,
	ajones, vkuznets, maciej.wieczor-retman, pgonda, oliver.upton,
	linux-kernel, linux-mm, kvm, linux-kselftest, linux-fsdevel

Amit Shah <amit@infradead.org> writes:

>> <snip>
>> 
>> Thanks all your help and comments during the guest_memfd upstream
>> calls,
>> and thanks for the help from AMD.
>> 
>> Extending mmap() support from Fuad with 1G page support introduces
>> more
>> states that made it more complicated (at least for me).
>> 
>> I'm modeling the states in python so I can iterate more quickly. I
>> also
>> have usage flows (e.g. allocate, guest_use, host_use,
>> transient_folio_get, close, transient_folio_put) as test cases.
>> 
>> I'm almost done with the model and my next steps are to write up a
>> state
>> machine (like Fuad's [5]) and share that.

Thanks everyone for all the comments at the 2025-02-06 guest_memfd
upstream call! Here are the 

+ Slides: https://lpc.events/event/18/contributions/1764/attachments/1409/3704/guest-memfd-1g-page-support-2025-02-06.pdf
+ State diagram: https://lpc.events/event/18/contributions/1764/attachments/1409/3702/guest-memfd-state-diagram-split-merge-2025-02-06.drawio.svg
+ For those interested in editing the state diagram using draw.io:
  https://lpc.events/event/18/contributions/1764/attachments/1409/3703/guest-memfd-state-diagram-split-merge-2025-02-06.drawio.xml

>> 
>> I'd be happy to share the python model too but I have to work through
>> some internal open-sourcing processes first, so if you think this
>> will
>> be useful, let me know!
>
> No problem.  Yes, I'm interested in this - it'll be helpful!

I've started working through the internal processes and will update here
when I'm done!

>
> The other thing of note is that while we have the kernel patches, a
> userspace to drive them and exercise them is currently missing.

In this and future patch series, I'll have selftests that will exercise
any new functionality.

>
>> Then, I'll code it all up in a new revision of this series (target:
>> March 2025), which will be accompanied by source code on GitHub.
>> 
>> I'm happy to collaborate more closely, let me know if you have ideas
>> for
>> collaboration!
>
> Thank you.  I think currently the bigger problem we have is allocation
> of hugepages -- which is also blocking a lot of the follow-on work. 
> Vishal briefly mentioned isolating pages from Linux entirely last time
> - that's also what I'm interested in to figure out if we can completely
> bypass the allocation problem by not allocating struct pages for non-
> host use pages entirely.  The guest_memfs/KHO/kexec/live-update patches
> also take this approach on AWS (for how their VMs are launched).  If we
> work with those patches together, allocation of 1G hugepages is
> simplified.  I'd like to discuss more on these themes to see if this is
> an approach that helps as well.
>
>
> 		Amit

Vishal is still very interested in this and will probably be looking
into this while I push ahead assuming that KVM continues to use struct
pages. This was also brought up at the guest_memfd upstream call on
2025-02-06, people were interested in this and think that it will
simplify refcounting for merging and splitting.

I'll push ahead assuming that we use hugetlb as the source of 1G pages,
and assuming that KVM continues to use struct pages to describe guest
private memory.

The series will still be useful as an interim solution/prototype even if
other allocators are preferred and get merged. :)

^ permalink raw reply	[flat|nested] 130+ messages in thread

end of thread, other threads:[~2025-05-13 17:33 UTC | newest]

Thread overview: 130+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-09-10 23:43 [RFC PATCH 00/39] 1G page support for guest_memfd Ackerley Tng
2024-09-10 23:43 ` [RFC PATCH 01/39] mm: hugetlb: Simplify logic in dequeue_hugetlb_folio_vma() Ackerley Tng
2024-09-10 23:43 ` [RFC PATCH 02/39] mm: hugetlb: Refactor vma_has_reserves() to should_use_hstate_resv() Ackerley Tng
2024-09-10 23:43 ` [RFC PATCH 03/39] mm: hugetlb: Remove unnecessary check for avoid_reserve Ackerley Tng
2024-09-10 23:43 ` [RFC PATCH 04/39] mm: mempolicy: Refactor out policy_node_nodemask() Ackerley Tng
2024-09-11 16:46   ` Gregory Price
2024-09-10 23:43 ` [RFC PATCH 05/39] mm: hugetlb: Refactor alloc_buddy_hugetlb_folio_with_mpol() to interpret mempolicy instead of vma Ackerley Tng
2024-09-10 23:43 ` [RFC PATCH 06/39] mm: hugetlb: Refactor dequeue_hugetlb_folio_vma() to use mpol Ackerley Tng
2024-09-10 23:43 ` [RFC PATCH 07/39] mm: hugetlb: Refactor out hugetlb_alloc_folio Ackerley Tng
2024-09-10 23:43 ` [RFC PATCH 08/39] mm: truncate: Expose preparation steps for truncate_inode_pages_final Ackerley Tng
2024-09-10 23:43 ` [RFC PATCH 09/39] mm: hugetlb: Expose hugetlb_subpool_{get,put}_pages() Ackerley Tng
2024-09-10 23:43 ` [RFC PATCH 10/39] mm: hugetlb: Add option to create new subpool without using surplus Ackerley Tng
2024-09-10 23:43 ` [RFC PATCH 11/39] mm: hugetlb: Expose hugetlb_acct_memory() Ackerley Tng
2024-09-10 23:43 ` [RFC PATCH 12/39] mm: hugetlb: Move and expose hugetlb_zero_partial_page() Ackerley Tng
2024-09-10 23:43 ` [RFC PATCH 13/39] KVM: guest_memfd: Make guest mem use guest mem inodes instead of anonymous inodes Ackerley Tng
2025-04-02  4:01   ` Yan Zhao
2025-04-23 20:22     ` Ackerley Tng
2025-04-24  3:53       ` Yan Zhao
2024-09-10 23:43 ` [RFC PATCH 14/39] KVM: guest_memfd: hugetlb: initialization and cleanup Ackerley Tng
2024-09-20  9:17   ` Vishal Annapurve
2024-10-01 23:00     ` Ackerley Tng
2024-12-01 17:59   ` Peter Xu
2025-02-13  9:47     ` Ackerley Tng
2025-02-26 18:55       ` Ackerley Tng
2025-03-06 17:33   ` Peter Xu
2024-09-10 23:43 ` [RFC PATCH 15/39] KVM: guest_memfd: hugetlb: allocate and truncate from hugetlb Ackerley Tng
2024-09-13 22:26   ` Elliot Berman
2024-10-03 20:23     ` Ackerley Tng
2024-10-30  9:01   ` Jun Miao
2025-02-11  1:21     ` Ackerley Tng
2024-12-01 17:55   ` Peter Xu
2025-02-13  7:52     ` Ackerley Tng
2025-02-13 16:48       ` Peter Xu
2024-09-10 23:43 ` [RFC PATCH 16/39] KVM: guest_memfd: Add page alignment check for hugetlb guest_memfd Ackerley Tng
2024-09-10 23:43 ` [RFC PATCH 17/39] KVM: selftests: Add basic selftests for hugetlb-backed guest_memfd Ackerley Tng
2024-09-10 23:43 ` [RFC PATCH 18/39] KVM: selftests: Support various types of backing sources for private memory Ackerley Tng
2024-09-10 23:43 ` [RFC PATCH 19/39] KVM: selftests: Update test for various private memory backing source types Ackerley Tng
2024-09-10 23:43 ` [RFC PATCH 20/39] KVM: selftests: Add private_mem_conversions_test.sh Ackerley Tng
2024-09-10 23:43 ` [RFC PATCH 21/39] KVM: selftests: Test that guest_memfd usage is reported via hugetlb Ackerley Tng
2024-09-10 23:43 ` [RFC PATCH 22/39] mm: hugetlb: Expose vmemmap optimization functions Ackerley Tng
2024-09-10 23:43 ` [RFC PATCH 23/39] mm: hugetlb: Expose HugeTLB functions for promoting/demoting pages Ackerley Tng
2024-09-10 23:43 ` [RFC PATCH 24/39] mm: hugetlb: Add functions to add/move/remove from hugetlb lists Ackerley Tng
2024-09-10 23:43 ` [RFC PATCH 25/39] KVM: guest_memfd: Split HugeTLB pages for guest_memfd use Ackerley Tng
2024-09-10 23:43 ` [RFC PATCH 26/39] KVM: guest_memfd: Track faultability within a struct kvm_gmem_private Ackerley Tng
2024-10-10 16:06   ` Peter Xu
2024-10-11 23:32     ` Ackerley Tng
2024-10-15 21:34       ` Peter Xu
2024-10-15 23:42         ` Ackerley Tng
2024-10-16  8:45           ` David Hildenbrand
2024-10-16 20:16             ` Peter Xu
2024-10-16 22:51               ` Jason Gunthorpe
2024-10-16 23:49                 ` Peter Xu
2024-10-16 23:54                   ` Jason Gunthorpe
2024-10-17 14:58                     ` Peter Xu
2024-10-17 16:47                       ` Jason Gunthorpe
2024-10-17 17:05                         ` Peter Xu
2024-10-17 17:10                           ` Jason Gunthorpe
2024-10-17 19:11                             ` Peter Xu
2024-10-17 19:18                               ` Jason Gunthorpe
2024-10-17 19:29                                 ` David Hildenbrand
2024-10-18  7:15                                 ` Patrick Roy
2024-10-18  7:50                                   ` David Hildenbrand
2024-10-18  9:34                                     ` Patrick Roy
2024-10-17 17:11                         ` David Hildenbrand
2024-10-17 17:16                           ` Jason Gunthorpe
2024-10-17 17:55                             ` David Hildenbrand
2024-10-17 18:26                             ` Vishal Annapurve
2024-10-17 14:56                   ` David Hildenbrand
2024-10-17 15:02               ` David Hildenbrand
2024-10-16  8:50           ` David Hildenbrand
2024-10-16 10:48             ` Vishal Annapurve
2024-10-16 11:54               ` David Hildenbrand
2024-10-16 11:57                 ` Jason Gunthorpe
2025-02-25 20:37   ` Peter Xu
2025-04-23 22:07     ` Ackerley Tng
2024-09-10 23:43 ` [RFC PATCH 27/39] KVM: guest_memfd: Allow mmapping guest_memfd files Ackerley Tng
2025-01-20 22:42   ` Peter Xu
2025-04-23 20:25     ` Ackerley Tng
2025-03-04 23:24   ` Peter Xu
2025-04-02  4:07   ` Yan Zhao
2025-04-23 20:28     ` Ackerley Tng
2024-09-10 23:43 ` [RFC PATCH 28/39] KVM: guest_memfd: Use vm_type to determine default faultability Ackerley Tng
2024-09-10 23:44 ` [RFC PATCH 29/39] KVM: Handle conversions in the SET_MEMORY_ATTRIBUTES ioctl Ackerley Tng
2024-09-10 23:44 ` [RFC PATCH 30/39] KVM: guest_memfd: Handle folio preparation for guest_memfd mmap Ackerley Tng
2024-09-16 20:00   ` Elliot Berman
2024-10-03 21:32     ` Ackerley Tng
2024-10-03 23:43       ` Ackerley Tng
2024-10-08 19:30         ` Sean Christopherson
2024-10-07 15:56       ` Patrick Roy
2024-10-08 18:07         ` Ackerley Tng
2024-10-08 19:56           ` Sean Christopherson
2024-10-09  3:51             ` Manwaring, Derek
2024-10-09 13:52               ` Andrew Cooper
2024-10-10 16:21             ` Patrick Roy
2024-10-10 19:27               ` Manwaring, Derek
2024-10-17 23:16               ` Ackerley Tng
2024-10-18  7:10                 ` Patrick Roy
2024-09-10 23:44 ` [RFC PATCH 31/39] KVM: selftests: Allow vm_set_memory_attributes to be used without asserting return value of 0 Ackerley Tng
2024-09-10 23:44 ` [RFC PATCH 32/39] KVM: selftests: Test using guest_memfd memory from userspace Ackerley Tng
2024-09-10 23:44 ` [RFC PATCH 33/39] KVM: selftests: Test guest_memfd memory sharing between guest and host Ackerley Tng
2024-09-10 23:44 ` [RFC PATCH 34/39] KVM: selftests: Add notes in private_mem_kvm_exits_test for mmap-able guest_memfd Ackerley Tng
2024-09-10 23:44 ` [RFC PATCH 35/39] KVM: selftests: Test that pinned pages block KVM from setting memory attributes to PRIVATE Ackerley Tng
2024-09-10 23:44 ` [RFC PATCH 36/39] KVM: selftests: Refactor vm_mem_add to be more flexible Ackerley Tng
2024-09-10 23:44 ` [RFC PATCH 37/39] KVM: selftests: Add helper to perform madvise by memslots Ackerley Tng
2024-09-10 23:44 ` [RFC PATCH 38/39] KVM: selftests: Update private_mem_conversions_test for mmap()able guest_memfd Ackerley Tng
2024-09-10 23:44 ` [RFC PATCH 39/39] KVM: guest_memfd: Dynamically split/reconstruct HugeTLB page Ackerley Tng
2025-04-03 12:33   ` Yan Zhao
2025-04-23 22:02     ` Ackerley Tng
2025-04-24  1:09       ` Yan Zhao
2025-04-24  4:25         ` Yan Zhao
2025-04-24  5:55           ` Chenyi Qiang
2025-04-24  8:13             ` Yan Zhao
2025-04-24 14:10               ` Vishal Annapurve
2025-04-24 18:15                 ` Ackerley Tng
2025-04-25  4:02                   ` Yan Zhao
2025-04-25 22:45                     ` Ackerley Tng
2025-04-28  1:05                       ` Yan Zhao
2025-04-28 19:02                         ` Vishal Annapurve
2025-04-30 20:09                         ` Ackerley Tng
2025-05-06  1:23                           ` Yan Zhao
2025-05-06 19:22                             ` Ackerley Tng
2025-05-07  3:15                               ` Yan Zhao
2025-05-13 17:33                                 ` Ackerley Tng
2024-09-11  6:56 ` [RFC PATCH 00/39] 1G page support for guest_memfd Michal Hocko
2024-09-14  1:08 ` Du, Fan
2024-09-14 13:34   ` Vishal Annapurve
2025-01-28  9:42 ` Amit Shah
2025-02-03  8:35   ` Ackerley Tng
2025-02-06 11:07     ` Amit Shah
2025-02-07  6:25       ` Ackerley Tng

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).