[PATCH v2 0/7] mm/hugetlb: Refactor hugetlb allocation resv accounting

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 0/7] mm/hugetlb: Refactor hugetlb allocation resv accounting
@ 2025-01-07 20:39 Peter Xu
  2025-01-07 20:39 ` [PATCH v2 1/7] mm/hugetlb: Fix avoid_reserve to allow taking folio from subpool Peter Xu
                   ` (6 more replies)
  0 siblings, 7 replies; 16+ messages in thread
From: Peter Xu @ 2025-01-07 20:39 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Breno Leitao, Rik van Riel, Muchun Song, Naoya Horiguchi,
	Roman Gushchin, Ackerley Tng, Andrew Morton, peterx,
	Oscar Salvador

[based on akpm/mm-unstable, latest ca95745c20ad, of Jan 7th 2025]

v2:
- Rebase to latest mm-unstable
- Added R-b and T-b for Ackerley on patch 1

  (Ackerley: I was conservative on the tags here to only attach them in
   patch 1, even though it seems you mentioned you agree/tested with the
   whole series.  Please feel free to provide your tag again on the cover
   letter if you want, thanks)

This is a follow up on Ackerley's series here as replacement:

https://lore.kernel.org/r/cover.1728684491.git.ackerleytng@google.com

The goal of this series is to cleanup hugetlb resv accounting, especially
during folio allocation, to decouple a few things:

  - Hugetlb folios v.s. Hugetlbfs: IOW, the hope is in the future hugetlb
    folios can be allocated completely without hugetlbfs.

  - Decouple VMA v.s. hugetlb folio allocations: allocating a hugetlb folio
    should not always require a hugetlbfs VMA.  For example, either it got
    allocated from the inode level (see hugetlbfs_fallocate() where it used
    a pesudo VMA for allocation), or it can be allocated by other kernel
    subsystems.

It paves way for other users to allocate hugetlb folios out of either
system reservations, or subpools (instead of hugetlbfs, as a file system).
For longer term, this prepares hugetlb as a separate concept versus
hugetlbfs, so that hugetlb folios can be allocated by not only hugetlbfs
and other things.

Tests I've done:

- I had a reproducer in patch 1 for the bug I found, this will start to
  work after patch 1 or the whole set applied.

- Hugetlb regression tests (on x86_64 2MBs), includes:

  - All vmtests on hugetlbfs

  - libhugetlbfs test suite (which may fail some tests, but no new failures
    will be introduced by this series, so all such failures happen before
    this series so shouldn't be relevant).

Comments welcomed, thanks.

Peter Xu (7):
  mm/hugetlb: Fix avoid_reserve to allow taking folio from subpool
  mm/hugetlb: Stop using avoid_reserve flag in fork()
  mm/hugetlb: Rename avoid_reserve to cow_from_owner
  mm/hugetlb: Clean up map/global resv accounting when allocate
  mm/hugetlb: Simplify vma_has_reserves()
  mm/hugetlb: Drop vma_has_reserves()
  mm/hugetlb: Unify restore reserve accounting for new allocations

 fs/hugetlbfs/inode.c    |   2 +-
 include/linux/hugetlb.h |   4 +-
 mm/hugetlb.c            | 237 ++++++++++++++++++----------------------
 3 files changed, 107 insertions(+), 136 deletions(-)

-- 
2.47.0


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH v2 1/7] mm/hugetlb: Fix avoid_reserve to allow taking folio from subpool
  2025-01-07 20:39 [PATCH v2 0/7] mm/hugetlb: Refactor hugetlb allocation resv accounting Peter Xu
@ 2025-01-07 20:39 ` Peter Xu
  2025-01-13 11:02   ` Oscar Salvador
  2025-01-07 20:39 ` [PATCH v2 2/7] mm/hugetlb: Stop using avoid_reserve flag in fork() Peter Xu
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 16+ messages in thread
From: Peter Xu @ 2025-01-07 20:39 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Breno Leitao, Rik van Riel, Muchun Song, Naoya Horiguchi,
	Roman Gushchin, Ackerley Tng, Andrew Morton, peterx,
	Oscar Salvador, linux-stable

Since commit 04f2cbe35699 ("hugetlb: guarantee that COW faults for a
process that called mmap(MAP_PRIVATE) on hugetlbfs will succeed"),
avoid_reserve was introduced for a special case of CoW on hugetlb private
mappings, and only if the owner VMA is trying to allocate yet another
hugetlb folio that is not reserved within the private vma reserved map.

Later on, in commit d85f69b0b533 ("mm/hugetlb: alloc_huge_page handle areas
hole punched by fallocate"), alloc_huge_page() enforced to not consume any
global reservation as long as avoid_reserve=true.  This operation doesn't
look correct, because even if it will enforce the allocation to not use
global reservation at all, it will still try to take one reservation from
the spool (if the subpool existed).  Then since the spool reserved pages
take from global reservation, it'll also take one reservation globally.

Logically it can cause global reservation to go wrong.

I wrote a reproducer below, trigger this special path, and every run of
such program will cause global reservation count to increment by one, until
it hits the number of free pages:

  #define _GNU_SOURCE             /* See feature_test_macros(7) */
  #include <stdio.h>
  #include <fcntl.h>
  #include <errno.h>
  #include <unistd.h>
  #include <stdlib.h>
  #include <sys/mman.h>

  #define  MSIZE  (2UL << 20)

  int main(int argc, char *argv[])
  {
      const char *path;
      int *buf;
      int fd, ret;
      pid_t child;

      if (argc < 2) {
          printf("usage: %s <hugetlb_file>\n", argv[0]);
          return -1;
      }

      path = argv[1];

      fd = open(path, O_RDWR | O_CREAT, 0666);
      if (fd < 0) {
          perror("open failed");
          return -1;
      }

      ret = fallocate(fd, 0, 0, MSIZE);
      if (ret != 0) {
          perror("fallocate");
          return -1;
      }

      buf = mmap(NULL, MSIZE, PROT_READ|PROT_WRITE,
                 MAP_PRIVATE, fd, 0);

      if (buf == MAP_FAILED) {
          perror("mmap() failed");
          return -1;
      }

      /* Allocate a page */
      *buf = 1;

      child = fork();
      if (child == 0) {
          /* child doesn't need to do anything */
          exit(0);
      }

      /* Trigger CoW from owner */
      *buf = 2;

      munmap(buf, MSIZE);
      close(fd);
      unlink(path);

      return 0;
  }

It can only reproduce with a sub-mount when there're reserved pages on the
spool, like:

  # sysctl vm.nr_hugepages=128
  # mkdir ./hugetlb-pool
  # mount -t hugetlbfs -o min_size=8M,pagesize=2M none ./hugetlb-pool

Then run the reproducer on the mountpoint:

  # ./reproducer ./hugetlb-pool/test

Fix it by taking the reservation from spool if available.  In general,
avoid_reserve is IMHO more about "avoid vma resv map", not spool's.

I copied stable, however I have no intention for backporting if it's not a
clean cherry-pick, because private hugetlb mapping, and then fork() on top
is too rare to hit.

Cc: linux-stable <stable@vger.kernel.org>
Fixes: d85f69b0b533 ("mm/hugetlb: alloc_huge_page handle areas hole punched by fallocate")
Reviewed-by: Ackerley Tng <ackerleytng@google.com>
Tested-by: Ackerley Tng <ackerleytng@google.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/hugetlb.c | 22 +++-------------------
 1 file changed, 3 insertions(+), 19 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 354eec6f7e84..2bf971f77553 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1394,8 +1394,7 @@ static unsigned long available_huge_pages(struct hstate *h)
 
 static struct folio *dequeue_hugetlb_folio_vma(struct hstate *h,
 				struct vm_area_struct *vma,
-				unsigned long address, int avoid_reserve,
-				long chg)
+				unsigned long address, long chg)
 {
 	struct folio *folio = NULL;
 	struct mempolicy *mpol;
@@ -1411,10 +1410,6 @@ static struct folio *dequeue_hugetlb_folio_vma(struct hstate *h,
 	if (!vma_has_reserves(vma, chg) && !available_huge_pages(h))
 		goto err;
 
-	/* If reserves cannot be used, ensure enough pages are in the pool */
-	if (avoid_reserve && !available_huge_pages(h))
-		goto err;
-
 	gfp_mask = htlb_alloc_mask(h);
 	nid = huge_node(vma, address, gfp_mask, &mpol, &nodemask);
 
@@ -1430,7 +1425,7 @@ static struct folio *dequeue_hugetlb_folio_vma(struct hstate *h,
 		folio = dequeue_hugetlb_folio_nodemask(h, gfp_mask,
 							nid, nodemask);
 
-	if (folio && !avoid_reserve && vma_has_reserves(vma, chg)) {
+	if (folio && vma_has_reserves(vma, chg)) {
 		folio_set_hugetlb_restore_reserve(folio);
 		h->resv_huge_pages--;
 	}
@@ -3047,17 +3042,6 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
 		gbl_chg = hugepage_subpool_get_pages(spool, 1);
 		if (gbl_chg < 0)
 			goto out_end_reservation;
-
-		/*
-		 * Even though there was no reservation in the region/reserve
-		 * map, there could be reservations associated with the
-		 * subpool that can be used.  This would be indicated if the
-		 * return value of hugepage_subpool_get_pages() is zero.
-		 * However, if avoid_reserve is specified we still avoid even
-		 * the subpool reservations.
-		 */
-		if (avoid_reserve)
-			gbl_chg = 1;
 	}
 
 	/* If this allocation is not consuming a reservation, charge it now.
@@ -3080,7 +3064,7 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
 	 * from the global free pool (global change).  gbl_chg == 0 indicates
 	 * a reservation exists for the allocation.
 	 */
-	folio = dequeue_hugetlb_folio_vma(h, vma, addr, avoid_reserve, gbl_chg);
+	folio = dequeue_hugetlb_folio_vma(h, vma, addr, gbl_chg);
 	if (!folio) {
 		spin_unlock_irq(&hugetlb_lock);
 		folio = alloc_buddy_hugetlb_folio_with_mpol(h, vma, addr);
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v2 2/7] mm/hugetlb: Stop using avoid_reserve flag in fork()
  2025-01-07 20:39 [PATCH v2 0/7] mm/hugetlb: Refactor hugetlb allocation resv accounting Peter Xu
  2025-01-07 20:39 ` [PATCH v2 1/7] mm/hugetlb: Fix avoid_reserve to allow taking folio from subpool Peter Xu
@ 2025-01-07 20:39 ` Peter Xu
  2025-01-13 11:05   ` Oscar Salvador
  2025-01-07 20:39 ` [PATCH v2 3/7] mm/hugetlb: Rename avoid_reserve to cow_from_owner Peter Xu
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 16+ messages in thread
From: Peter Xu @ 2025-01-07 20:39 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Breno Leitao, Rik van Riel, Muchun Song, Naoya Horiguchi,
	Roman Gushchin, Ackerley Tng, Andrew Morton, peterx,
	Oscar Salvador

When fork() and stumble on top of a dma-pinned hugetlb private page, CoW
must happen during fork() to guarantee dma coherency.

In this specific path, hugetlb pages need to be allocated for the child
process.  Stop using avoid_reserve=1 flag here: it's not required to be
used here, as dest_vma (which is destined to be a MAP_PRIVATE hugetlb vma)
will have no private vma resv map, and that will make sure it won't be able
to use a vma reservation later.

No functional change intended with this change.  Said that, it's still
wanted to do this, so as to reduce the usage of avoid_reserve to the only
one user, which is also why this flag was introduced initially in commit
04f2cbe35699 ("hugetlb: guarantee that COW faults for a process that called
mmap(MAP_PRIVATE) on hugetlbfs will succeed").  I don't see whoever else
should set it at all.

Further patch will clean up resv accounting based on this.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/hugetlb.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 2bf971f77553..7be8c35d2a83 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5369,7 +5369,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 				spin_unlock(src_ptl);
 				spin_unlock(dst_ptl);
 				/* Do not use reserve as it's private owned */
-				new_folio = alloc_hugetlb_folio(dst_vma, addr, 1);
+				new_folio = alloc_hugetlb_folio(dst_vma, addr, 0);
 				if (IS_ERR(new_folio)) {
 					folio_put(pte_folio);
 					ret = PTR_ERR(new_folio);
-- 
2.47.0

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v2 3/7] mm/hugetlb: Rename avoid_reserve to cow_from_owner
  2025-01-07 20:39 [PATCH v2 0/7] mm/hugetlb: Refactor hugetlb allocation resv accounting Peter Xu
  2025-01-07 20:39 ` [PATCH v2 1/7] mm/hugetlb: Fix avoid_reserve to allow taking folio from subpool Peter Xu
  2025-01-07 20:39 ` [PATCH v2 2/7] mm/hugetlb: Stop using avoid_reserve flag in fork() Peter Xu
@ 2025-01-07 20:39 ` Peter Xu
  2025-01-13 11:20   ` Oscar Salvador
  2025-01-07 20:39 ` [PATCH v2 4/7] mm/hugetlb: Clean up map/global resv accounting when allocate Peter Xu
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 16+ messages in thread
From: Peter Xu @ 2025-01-07 20:39 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Breno Leitao, Rik van Riel, Muchun Song, Naoya Horiguchi,
	Roman Gushchin, Ackerley Tng, Andrew Morton, peterx,
	Oscar Salvador

The old name "avoid_reserve" can be too generic and can be used wrongly in
the new call sites that want to allocate a hugetlb folio.

It's confusing on two things: (1) whether one can opt-in to avoid global
reservation, and (2) whether it should take more than one count.

In reality, this flag is only used in an extremely hacky path, in an
extremely hacky way in hugetlb CoW path only, and always use with 1 saying
"skip global reservation".  Rename the flag to avoid future abuse of this
flag, making it a boolean so as to reflect its true representation that
it's not a counter.  To make it even harder to abuse, add a comment above
the function to explain it.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 fs/hugetlbfs/inode.c    |  2 +-
 include/linux/hugetlb.h |  4 ++--
 mm/hugetlb.c            | 33 ++++++++++++++++++++-------------
 3 files changed, 23 insertions(+), 16 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 62fb0cbc93ab..0fc179a59830 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -814,7 +814,7 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
 		 * folios in these areas, we need to consume the reserves
 		 * to keep reservation accounting consistent.
 		 */
-		folio = alloc_hugetlb_folio(&pseudo_vma, addr, 0);
+		folio = alloc_hugetlb_folio(&pseudo_vma, addr, false);
 		if (IS_ERR(folio)) {
 			mutex_unlock(&hugetlb_fault_mutex_table[hash]);
 			error = PTR_ERR(folio);
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 10faf42ca96a..49ec2362ce92 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -683,7 +683,7 @@ struct huge_bootmem_page {
 int isolate_or_dissolve_huge_page(struct page *page, struct list_head *list);
 int replace_free_hugepage_folios(unsigned long start_pfn, unsigned long end_pfn);
 struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
-				unsigned long addr, int avoid_reserve);
+				unsigned long addr, bool cow_from_owner);
 struct folio *alloc_hugetlb_folio_nodemask(struct hstate *h, int preferred_nid,
 				nodemask_t *nmask, gfp_t gfp_mask,
 				bool allow_alloc_fallback);
@@ -1068,7 +1068,7 @@ static inline int replace_free_hugepage_folios(unsigned long start_pfn,
 
 static inline struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
 					   unsigned long addr,
-					   int avoid_reserve)
+					   bool cow_from_owner)
 {
 	return NULL;
 }
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 7be8c35d2a83..cdbc8914a9f7 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3008,8 +3008,15 @@ int replace_free_hugepage_folios(unsigned long start_pfn, unsigned long end_pfn)
 	return ret;
 }
 
+/*
+ * NOTE! "cow_from_owner" represents a very hacky usage only used in CoW
+ * faults of hugetlb private mappings on top of a non-page-cache folio (in
+ * which case even if there's a private vma resv map it won't cover such
+ * allocation).  New call sites should (probably) never set it to true!!
+ * When it's set, the allocation will bypass all vma level reservations.
+ */
 struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
-				    unsigned long addr, int avoid_reserve)
+				    unsigned long addr, bool cow_from_owner)
 {
 	struct hugepage_subpool *spool = subpool_vma(vma);
 	struct hstate *h = hstate_vma(vma);
@@ -3038,7 +3045,7 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
 	 * Allocations for MAP_NORESERVE mappings also need to be
 	 * checked against any subpool limit.
 	 */
-	if (map_chg || avoid_reserve) {
+	if (map_chg || cow_from_owner) {
 		gbl_chg = hugepage_subpool_get_pages(spool, 1);
 		if (gbl_chg < 0)
 			goto out_end_reservation;
@@ -3046,7 +3053,7 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
 
 	/* If this allocation is not consuming a reservation, charge it now.
 	 */
-	deferred_reserve = map_chg || avoid_reserve;
+	deferred_reserve = map_chg || cow_from_owner;
 	if (deferred_reserve) {
 		ret = hugetlb_cgroup_charge_cgroup_rsvd(
 			idx, pages_per_huge_page(h), &h_cg);
@@ -3071,7 +3078,7 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
 		if (!folio)
 			goto out_uncharge_cgroup;
 		spin_lock_irq(&hugetlb_lock);
-		if (!avoid_reserve && vma_has_reserves(vma, gbl_chg)) {
+		if (!cow_from_owner && vma_has_reserves(vma, gbl_chg)) {
 			folio_set_hugetlb_restore_reserve(folio);
 			h->resv_huge_pages--;
 		}
@@ -3138,7 +3145,7 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
 		hugetlb_cgroup_uncharge_cgroup_rsvd(idx, pages_per_huge_page(h),
 						    h_cg);
 out_subpool_put:
-	if (map_chg || avoid_reserve)
+	if (map_chg || cow_from_owner)
 		hugepage_subpool_put_pages(spool, 1);
 out_end_reservation:
 	vma_end_reservation(h, vma, addr);
@@ -5369,7 +5376,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 				spin_unlock(src_ptl);
 				spin_unlock(dst_ptl);
 				/* Do not use reserve as it's private owned */
-				new_folio = alloc_hugetlb_folio(dst_vma, addr, 0);
+				new_folio = alloc_hugetlb_folio(dst_vma, addr, false);
 				if (IS_ERR(new_folio)) {
 					folio_put(pte_folio);
 					ret = PTR_ERR(new_folio);
@@ -5823,7 +5830,7 @@ static vm_fault_t hugetlb_wp(struct folio *pagecache_folio,
 	struct hstate *h = hstate_vma(vma);
 	struct folio *old_folio;
 	struct folio *new_folio;
-	int outside_reserve = 0;
+	bool cow_from_owner = 0;
 	vm_fault_t ret = 0;
 	struct mmu_notifier_range range;
 
@@ -5886,7 +5893,7 @@ static vm_fault_t hugetlb_wp(struct folio *pagecache_folio,
 	 */
 	if (is_vma_resv_set(vma, HPAGE_RESV_OWNER) &&
 			old_folio != pagecache_folio)
-		outside_reserve = 1;
+		cow_from_owner = true;
 
 	folio_get(old_folio);
 
@@ -5895,7 +5902,7 @@ static vm_fault_t hugetlb_wp(struct folio *pagecache_folio,
 	 * be acquired again before returning to the caller, as expected.
 	 */
 	spin_unlock(vmf->ptl);
-	new_folio = alloc_hugetlb_folio(vma, vmf->address, outside_reserve);
+	new_folio = alloc_hugetlb_folio(vma, vmf->address, cow_from_owner);
 
 	if (IS_ERR(new_folio)) {
 		/*
@@ -5905,7 +5912,7 @@ static vm_fault_t hugetlb_wp(struct folio *pagecache_folio,
 		 * reliability, unmap the page from child processes. The child
 		 * may get SIGKILLed if it later faults.
 		 */
-		if (outside_reserve) {
+		if (cow_from_owner) {
 			struct address_space *mapping = vma->vm_file->f_mapping;
 			pgoff_t idx;
 			u32 hash;
@@ -6156,7 +6163,7 @@ static vm_fault_t hugetlb_no_page(struct address_space *mapping,
 				goto out;
 		}
 
-		folio = alloc_hugetlb_folio(vma, vmf->address, 0);
+		folio = alloc_hugetlb_folio(vma, vmf->address, false);
 		if (IS_ERR(folio)) {
 			/*
 			 * Returning error will result in faulting task being
@@ -6622,7 +6629,7 @@ int hugetlb_mfill_atomic_pte(pte_t *dst_pte,
 			goto out;
 		}
 
-		folio = alloc_hugetlb_folio(dst_vma, dst_addr, 0);
+		folio = alloc_hugetlb_folio(dst_vma, dst_addr, false);
 		if (IS_ERR(folio)) {
 			ret = -ENOMEM;
 			goto out;
@@ -6664,7 +6671,7 @@ int hugetlb_mfill_atomic_pte(pte_t *dst_pte,
 			goto out;
 		}
 
-		folio = alloc_hugetlb_folio(dst_vma, dst_addr, 0);
+		folio = alloc_hugetlb_folio(dst_vma, dst_addr, false);
 		if (IS_ERR(folio)) {
 			folio_put(*foliop);
 			ret = -ENOMEM;
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v2 4/7] mm/hugetlb: Clean up map/global resv accounting when allocate
  2025-01-07 20:39 [PATCH v2 0/7] mm/hugetlb: Refactor hugetlb allocation resv accounting Peter Xu
                   ` (2 preceding siblings ...)
  2025-01-07 20:39 ` [PATCH v2 3/7] mm/hugetlb: Rename avoid_reserve to cow_from_owner Peter Xu
@ 2025-01-07 20:39 ` Peter Xu
  2025-01-13 22:57   ` Ackerley Tng
  2025-01-07 20:40 ` [PATCH v2 5/7] mm/hugetlb: Simplify vma_has_reserves() Peter Xu
                   ` (2 subsequent siblings)
  6 siblings, 1 reply; 16+ messages in thread
From: Peter Xu @ 2025-01-07 20:39 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Breno Leitao, Rik van Riel, Muchun Song, Naoya Horiguchi,
	Roman Gushchin, Ackerley Tng, Andrew Morton, peterx,
	Oscar Salvador

alloc_hugetlb_folio() isn't a function easy to read, especially on
reservation accountings for either VMA or globally (majorly, spool only).

The 1st complexity lies in the special private CoW path, aka,
cow_from_owner=true case.

The 2nd complexity may be the confusing updates of gbl_chg after it's set
once, which looks like they can change anytime on the fly.

Logically, cow_from_user is only about vma reservation.  We could already
decouple the flag and consolidate it into map charge flag very early.  Then
we don't need to keep checking the CoW special flag every time.

This patch does it by making map_chg a tri-state flag.  Tri-state needed is
unfortunate, and it's because currently vma_needs_reservation() has a side
effect internally, that it must be followed by either a end() or commit().

We keep the same semantic as before on one thing: "if (map_chg)" means we
need a separate per-vma resv count.  It keeps most of the old code like
before untouched with the new enum.

After this patch, we take these steps to decide these variables, hopefully
slightly easier to follow:

  - First, decide map_chg.  This will take cow_from_owner into account,
    once and for all.  It's about whether we could take a resv count from
    the vma, no matter it's shared, private, etc.

  - Then, decide gbl_chg.  The only diff here is spool, comparing to
    map_chg.

Now only update each flag once and for all, instead of keep any of them
flipping which can be very hard to follow.

With cow_from_owner merged into map_chg, we could remove quite a few such
checks all over.  Side benefit of such is that we can get rid of one more
confusing flag, which is deferred_reserve.

Cleanup the comments a bit too.  E.g., MAP_NORESERVE may not need to check
against spool limit, AFAIU, if it's on a shared mapping, and if the page
cache folio has its inode's resv map available (in which case map_chg would
have been set zero, hence the code should be correct, not the comment).

There's one trivial detail that needs attention that this patch touched,
which is this check right after vma_commit_reservation():

  if (map_chg > map_commit)

It changes to:

  if (unlikely(map_chg == MAP_CHG_NEEDED && retval == 0))

It should behave the same like before, because previously the only way to
make "map_chg > map_commit" happen is map_chg=1 && map_commit=0.  That's
exactly the rewritten line.  Meanwhile, either commit() or end() will need
to be skipped if ENFORCE, to keep the old behavior.

Even though it looks a lot changed, but no functional change expected.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/hugetlb.c | 110 +++++++++++++++++++++++++++++++++++----------------
 1 file changed, 77 insertions(+), 33 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index cdbc8914a9f7..b8a849fe1531 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2970,6 +2970,25 @@ int isolate_or_dissolve_huge_page(struct page *page, struct list_head *list)
 	return ret;
 }
 
+typedef enum {
+	/*
+	 * For either 0/1: we checked the per-vma resv map, and one resv
+	 * count either can be reused (0), or an extra needed (1).
+	 */
+	MAP_CHG_REUSE = 0,
+	MAP_CHG_NEEDED = 1,
+	/*
+	 * Cannot use per-vma resv count can be used, hence a new resv
+	 * count is enforced.
+	 *
+	 * NOTE: This is mostly identical to MAP_CHG_NEEDED, except
+	 * that currently vma_needs_reservation() has an unwanted side
+	 * effect to either use end() or commit() to complete the
+	 * transaction.	 Hence it needs to differenciate from NEEDED.
+	 */
+	MAP_CHG_ENFORCED = 2,
+} map_chg_state;
+
 /*
  *  replace_free_hugepage_folios - Replace free hugepage folios in a given pfn
  *  range with new folios.
@@ -3021,40 +3040,59 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
 	struct hugepage_subpool *spool = subpool_vma(vma);
 	struct hstate *h = hstate_vma(vma);
 	struct folio *folio;
-	long map_chg, map_commit;
-	long gbl_chg;
+	long retval, gbl_chg;
+	map_chg_state map_chg;
 	int ret, idx;
 	struct hugetlb_cgroup *h_cg = NULL;
-	bool deferred_reserve;
 	gfp_t gfp = htlb_alloc_mask(h) | __GFP_RETRY_MAYFAIL;
 
 	idx = hstate_index(h);
-	/*
-	 * Examine the region/reserve map to determine if the process
-	 * has a reservation for the page to be allocated.  A return
-	 * code of zero indicates a reservation exists (no change).
-	 */
-	map_chg = gbl_chg = vma_needs_reservation(h, vma, addr);
-	if (map_chg < 0)
-		return ERR_PTR(-ENOMEM);
+
+	/* Whether we need a separate per-vma reservation? */
+	if (cow_from_owner) {
+		/*
+		 * Special case!  Since it's a CoW on top of a reserved
+		 * page, the private resv map doesn't count.  So it cannot
+		 * consume the per-vma resv map even if it's reserved.
+		 */
+		map_chg = MAP_CHG_ENFORCED;
+	} else {
+		/*
+		 * Examine the region/reserve map to determine if the process
+		 * has a reservation for the page to be allocated.  A return
+		 * code of zero indicates a reservation exists (no change).
+		 */
+		retval = vma_needs_reservation(h, vma, addr);
+		if (retval < 0)
+			return ERR_PTR(-ENOMEM);
+		map_chg = retval ? MAP_CHG_NEEDED : MAP_CHG_REUSE;
+	}
 
 	/*
+	 * Whether we need a separate global reservation?
+	 *
 	 * Processes that did not create the mapping will have no
 	 * reserves as indicated by the region/reserve map. Check
 	 * that the allocation will not exceed the subpool limit.
-	 * Allocations for MAP_NORESERVE mappings also need to be
-	 * checked against any subpool limit.
+	 * Or if it can get one from the pool reservation directly.
 	 */
-	if (map_chg || cow_from_owner) {
+	if (map_chg) {
 		gbl_chg = hugepage_subpool_get_pages(spool, 1);
 		if (gbl_chg < 0)
 			goto out_end_reservation;
+	} else {
+		/*
+		 * If we have the vma reservation ready, no need for extra
+		 * global reservation.
+		 */
+		gbl_chg = 0;
 	}
 
-	/* If this allocation is not consuming a reservation, charge it now.
+	/*
+	 * If this allocation is not consuming a per-vma reservation,
+	 * charge the hugetlb cgroup now.
 	 */
-	deferred_reserve = map_chg || cow_from_owner;
-	if (deferred_reserve) {
+	if (map_chg) {
 		ret = hugetlb_cgroup_charge_cgroup_rsvd(
 			idx, pages_per_huge_page(h), &h_cg);
 		if (ret)
@@ -3078,7 +3116,7 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
 		if (!folio)
 			goto out_uncharge_cgroup;
 		spin_lock_irq(&hugetlb_lock);
-		if (!cow_from_owner && vma_has_reserves(vma, gbl_chg)) {
+		if (vma_has_reserves(vma, gbl_chg)) {
 			folio_set_hugetlb_restore_reserve(folio);
 			h->resv_huge_pages--;
 		}
@@ -3091,7 +3129,7 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
 	/* If allocation is not consuming a reservation, also store the
 	 * hugetlb_cgroup pointer on the page.
 	 */
-	if (deferred_reserve) {
+	if (map_chg) {
 		hugetlb_cgroup_commit_charge_rsvd(idx, pages_per_huge_page(h),
 						  h_cg, folio);
 	}
@@ -3100,26 +3138,31 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
 
 	hugetlb_set_folio_subpool(folio, spool);
 
-	map_commit = vma_commit_reservation(h, vma, addr);
-	if (unlikely(map_chg > map_commit)) {
+	if (map_chg != MAP_CHG_ENFORCED) {
+		/* commit() is only needed if the map_chg is not enforced */
+		retval = vma_commit_reservation(h, vma, addr);
 		/*
+		 * Check for possible race conditions. When it happens..
 		 * The page was added to the reservation map between
 		 * vma_needs_reservation and vma_commit_reservation.
 		 * This indicates a race with hugetlb_reserve_pages.
 		 * Adjust for the subpool count incremented above AND
-		 * in hugetlb_reserve_pages for the same page.  Also,
+		 * in hugetlb_reserve_pages for the same page.	Also,
 		 * the reservation count added in hugetlb_reserve_pages
 		 * no longer applies.
 		 */
-		long rsv_adjust;
+		if (unlikely(map_chg == MAP_CHG_NEEDED && retval == 0)) {
+			long rsv_adjust;
 
-		rsv_adjust = hugepage_subpool_put_pages(spool, 1);
-		hugetlb_acct_memory(h, -rsv_adjust);
-		if (deferred_reserve) {
-			spin_lock_irq(&hugetlb_lock);
-			hugetlb_cgroup_uncharge_folio_rsvd(hstate_index(h),
-					pages_per_huge_page(h), folio);
-			spin_unlock_irq(&hugetlb_lock);
+			rsv_adjust = hugepage_subpool_put_pages(spool, 1);
+			hugetlb_acct_memory(h, -rsv_adjust);
+			if (map_chg) {
+				spin_lock_irq(&hugetlb_lock);
+				hugetlb_cgroup_uncharge_folio_rsvd(
+				    hstate_index(h), pages_per_huge_page(h),
+				    folio);
+				spin_unlock_irq(&hugetlb_lock);
+			}
 		}
 	}
 
@@ -3141,14 +3184,15 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
 out_uncharge_cgroup:
 	hugetlb_cgroup_uncharge_cgroup(idx, pages_per_huge_page(h), h_cg);
 out_uncharge_cgroup_reservation:
-	if (deferred_reserve)
+	if (map_chg)
 		hugetlb_cgroup_uncharge_cgroup_rsvd(idx, pages_per_huge_page(h),
 						    h_cg);
 out_subpool_put:
-	if (map_chg || cow_from_owner)
+	if (map_chg)
 		hugepage_subpool_put_pages(spool, 1);
 out_end_reservation:
-	vma_end_reservation(h, vma, addr);
+	if (map_chg != MAP_CHG_ENFORCED)
+		vma_end_reservation(h, vma, addr);
 	return ERR_PTR(-ENOSPC);
 }
 
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v2 5/7] mm/hugetlb: Simplify vma_has_reserves()
  2025-01-07 20:39 [PATCH v2 0/7] mm/hugetlb: Refactor hugetlb allocation resv accounting Peter Xu
                   ` (3 preceding siblings ...)
  2025-01-07 20:39 ` [PATCH v2 4/7] mm/hugetlb: Clean up map/global resv accounting when allocate Peter Xu
@ 2025-01-07 20:40 ` Peter Xu
  2025-01-07 20:40 ` [PATCH v2 6/7] mm/hugetlb: Drop vma_has_reserves() Peter Xu
  2025-01-07 20:40 ` [PATCH v2 7/7] mm/hugetlb: Unify restore reserve accounting for new allocations Peter Xu
  6 siblings, 0 replies; 16+ messages in thread
From: Peter Xu @ 2025-01-07 20:40 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Breno Leitao, Rik van Riel, Muchun Song, Naoya Horiguchi,
	Roman Gushchin, Ackerley Tng, Andrew Morton, peterx,
	Oscar Salvador

vma_has_reserves() is a helper "trying" to know whether the vma should
consume one reservation when allocating the hugetlb folio.

However it's not clear on why we need such complexity, as such information
is already represented in the "chg" variable.

From alloc_hugetlb_folio() context, "chg" (or in the function's context,
"gbl_chg") is defined as:

  - If gbl_chg=1, the allocation cannot reuse an existing reservation
  - If gbl_chg=0, the allocation should reuse an existing reservation

Firstly, map_chg is defined as following, to cover all cases of hugetlb
reservation scenarios (mostly, via vma_needs_reservation(), but
cow_from_owner is an outlier):

CONDITION                                             HAS RESERVATION?
=========                                             ================
- SHARED: always check against per-inode resv_map
  (ignore NONRESERVE)
  - If resv exists                                ==> YES  [1]
  - If not                                        ==> NO   [2]
- PRIVATE: complicated...
  - Request came from a CoW from owner resv map   ==> NO   [3]
    (when cow_from_owner==true)
  - If does not own a resv_map at all..           ==> NO   [4]
    (examples: VM_NORESERVE, private fork())
  - If owns a resv_map, but resv donsn't exists   ==> NO   [5]
  - If owns a resv_map, and resv exists           ==> YES  [6]

Further on, gbl_chg considered spool setup, so that is a decision based on
all the context.

If we look at vma_has_reserves(), it almost does check that has already
been processed by map_chg accounting (I marked each return value to the
case above):

  static bool vma_has_reserves(struct vm_area_struct *vma, long chg)
  {
          if (vma->vm_flags & VM_NORESERVE) {
                  if (vma->vm_flags & VM_MAYSHARE && chg == 0)
                          return true;              ==> [1]
                  else
                          return false;             ==> [2] or [4]
          }

          if (vma->vm_flags & VM_MAYSHARE) {
                  if (chg)
                          return false;             ==> [2]
                  else
                          return true;              ==> [1]
          }

          if (is_vma_resv_set(vma, HPAGE_RESV_OWNER)) {
                  if (chg)
                          return false;             ==> [5]
                  else
                          return true;              ==> [6]
          }

          return false;                             ==> [4]
  }

It didn't check [3], but [3] case was actually already covered now by the
"chg" / "gbl_chg" / "map_chg" calculations.

In short, vma_has_reserves() doesn't provide anything more than return
"!chg".. so just simplify all the things.

There're a lot of comments describing truncation races, IIUC there should
have no race as long as map_chg is properly done.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/hugetlb.c | 67 ++++++----------------------------------------------
 1 file changed, 7 insertions(+), 60 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index b8a849fe1531..5ec079f32f44 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1247,66 +1247,13 @@ void clear_vma_resv_huge_pages(struct vm_area_struct *vma)
 }
 
 /* Returns true if the VMA has associated reserve pages */
-static bool vma_has_reserves(struct vm_area_struct *vma, long chg)
+static bool vma_has_reserves(long chg)
 {
-	if (vma->vm_flags & VM_NORESERVE) {
-		/*
-		 * This address is already reserved by other process(chg == 0),
-		 * so, we should decrement reserved count. Without decrementing,
-		 * reserve count remains after releasing inode, because this
-		 * allocated page will go into page cache and is regarded as
-		 * coming from reserved pool in releasing step.  Currently, we
-		 * don't have any other solution to deal with this situation
-		 * properly, so add work-around here.
-		 */
-		if (vma->vm_flags & VM_MAYSHARE && chg == 0)
-			return true;
-		else
-			return false;
-	}
-
-	/* Shared mappings always use reserves */
-	if (vma->vm_flags & VM_MAYSHARE) {
-		/*
-		 * We know VM_NORESERVE is not set.  Therefore, there SHOULD
-		 * be a region map for all pages.  The only situation where
-		 * there is no region map is if a hole was punched via
-		 * fallocate.  In this case, there really are no reserves to
-		 * use.  This situation is indicated if chg != 0.
-		 */
-		if (chg)
-			return false;
-		else
-			return true;
-	}
-
 	/*
-	 * Only the process that called mmap() has reserves for
-	 * private mappings.
+	 * Now "chg" has all the conditions considered for whether we
+	 * should use an existing reservation.
 	 */
-	if (is_vma_resv_set(vma, HPAGE_RESV_OWNER)) {
-		/*
-		 * Like the shared case above, a hole punch or truncate
-		 * could have been performed on the private mapping.
-		 * Examine the value of chg to determine if reserves
-		 * actually exist or were previously consumed.
-		 * Very Subtle - The value of chg comes from a previous
-		 * call to vma_needs_reserves().  The reserve map for
-		 * private mappings has different (opposite) semantics
-		 * than that of shared mappings.  vma_needs_reserves()
-		 * has already taken this difference in semantics into
-		 * account.  Therefore, the meaning of chg is the same
-		 * as in the shared case above.  Code could easily be
-		 * combined, but keeping it separate draws attention to
-		 * subtle differences.
-		 */
-		if (chg)
-			return false;
-		else
-			return true;
-	}
-
-	return false;
+	return chg == 0;
 }
 
 static void enqueue_hugetlb_folio(struct hstate *h, struct folio *folio)
@@ -1407,7 +1354,7 @@ static struct folio *dequeue_hugetlb_folio_vma(struct hstate *h,
 	 * have no page reserves. This check ensures that reservations are
 	 * not "stolen". The child may still get SIGKILLed
 	 */
-	if (!vma_has_reserves(vma, chg) && !available_huge_pages(h))
+	if (!vma_has_reserves(chg) && !available_huge_pages(h))
 		goto err;
 
 	gfp_mask = htlb_alloc_mask(h);
@@ -1425,7 +1372,7 @@ static struct folio *dequeue_hugetlb_folio_vma(struct hstate *h,
 		folio = dequeue_hugetlb_folio_nodemask(h, gfp_mask,
 							nid, nodemask);
 
-	if (folio && vma_has_reserves(vma, chg)) {
+	if (folio && vma_has_reserves(chg)) {
 		folio_set_hugetlb_restore_reserve(folio);
 		h->resv_huge_pages--;
 	}
@@ -3116,7 +3063,7 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
 		if (!folio)
 			goto out_uncharge_cgroup;
 		spin_lock_irq(&hugetlb_lock);
-		if (vma_has_reserves(vma, gbl_chg)) {
+		if (vma_has_reserves(gbl_chg)) {
 			folio_set_hugetlb_restore_reserve(folio);
 			h->resv_huge_pages--;
 		}
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v2 6/7] mm/hugetlb: Drop vma_has_reserves()
  2025-01-07 20:39 [PATCH v2 0/7] mm/hugetlb: Refactor hugetlb allocation resv accounting Peter Xu
                   ` (4 preceding siblings ...)
  2025-01-07 20:40 ` [PATCH v2 5/7] mm/hugetlb: Simplify vma_has_reserves() Peter Xu
@ 2025-01-07 20:40 ` Peter Xu
  2025-01-07 20:40 ` [PATCH v2 7/7] mm/hugetlb: Unify restore reserve accounting for new allocations Peter Xu
  6 siblings, 0 replies; 16+ messages in thread
From: Peter Xu @ 2025-01-07 20:40 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Breno Leitao, Rik van Riel, Muchun Song, Naoya Horiguchi,
	Roman Gushchin, Ackerley Tng, Andrew Morton, peterx,
	Oscar Salvador

After the previous cleanup, vma_has_reserves() is mostly an empty helper
except that it says "use reserve count" is inverted meaning from "needs a
global reserve count", which is still true.

To avoid confusions on having two inverted ways to ask the same question,
always use the gbl_chg everywhere, and drop the function.

When at it, rename "chg" to "gbl_chg" in dequeue_hugetlb_folio_vma().  It
might be helpful for readers to see that the "chg" here is the global
reserve count, not the vma resv count.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/hugetlb.c | 23 ++++++-----------------
 1 file changed, 6 insertions(+), 17 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 5ec079f32f44..922d57e2413a 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1246,16 +1246,6 @@ void clear_vma_resv_huge_pages(struct vm_area_struct *vma)
 	hugetlb_dup_vma_private(vma);
 }
 
-/* Returns true if the VMA has associated reserve pages */
-static bool vma_has_reserves(long chg)
-{
-	/*
-	 * Now "chg" has all the conditions considered for whether we
-	 * should use an existing reservation.
-	 */
-	return chg == 0;
-}
-
 static void enqueue_hugetlb_folio(struct hstate *h, struct folio *folio)
 {
 	int nid = folio_nid(folio);
@@ -1341,7 +1331,7 @@ static unsigned long available_huge_pages(struct hstate *h)
 
 static struct folio *dequeue_hugetlb_folio_vma(struct hstate *h,
 				struct vm_area_struct *vma,
-				unsigned long address, long chg)
+				unsigned long address, long gbl_chg)
 {
 	struct folio *folio = NULL;
 	struct mempolicy *mpol;
@@ -1350,11 +1340,10 @@ static struct folio *dequeue_hugetlb_folio_vma(struct hstate *h,
 	int nid;
 
 	/*
-	 * A child process with MAP_PRIVATE mappings created by their parent
-	 * have no page reserves. This check ensures that reservations are
-	 * not "stolen". The child may still get SIGKILLed
+	 * gbl_chg==1 means the allocation requires a new page that was not
+	 * reserved before.  Making sure there's at least one free page.
 	 */
-	if (!vma_has_reserves(chg) && !available_huge_pages(h))
+	if (gbl_chg && !available_huge_pages(h))
 		goto err;
 
 	gfp_mask = htlb_alloc_mask(h);
@@ -1372,7 +1361,7 @@ static struct folio *dequeue_hugetlb_folio_vma(struct hstate *h,
 		folio = dequeue_hugetlb_folio_nodemask(h, gfp_mask,
 							nid, nodemask);
 
-	if (folio && vma_has_reserves(chg)) {
+	if (folio && !gbl_chg) {
 		folio_set_hugetlb_restore_reserve(folio);
 		h->resv_huge_pages--;
 	}
@@ -3063,7 +3052,7 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
 		if (!folio)
 			goto out_uncharge_cgroup;
 		spin_lock_irq(&hugetlb_lock);
-		if (vma_has_reserves(gbl_chg)) {
+		if (!gbl_chg) {
 			folio_set_hugetlb_restore_reserve(folio);
 			h->resv_huge_pages--;
 		}
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH v2 7/7] mm/hugetlb: Unify restore reserve accounting for new allocations
  2025-01-07 20:39 [PATCH v2 0/7] mm/hugetlb: Refactor hugetlb allocation resv accounting Peter Xu
                   ` (5 preceding siblings ...)
  2025-01-07 20:40 ` [PATCH v2 6/7] mm/hugetlb: Drop vma_has_reserves() Peter Xu
@ 2025-01-07 20:40 ` Peter Xu
  6 siblings, 0 replies; 16+ messages in thread
From: Peter Xu @ 2025-01-07 20:40 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Breno Leitao, Rik van Riel, Muchun Song, Naoya Horiguchi,
	Roman Gushchin, Ackerley Tng, Andrew Morton, peterx,
	Oscar Salvador

Either hugetlb pages dequeued from hstate, or newly allocated from buddy,
would require restore-reserve accounting to be managed properly.  Merge the
two paths on it.  Add a small comment to make it slightly nicer.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/hugetlb.c | 18 +++++++++---------
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 922d57e2413a..3b27840de36f 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1361,11 +1361,6 @@ static struct folio *dequeue_hugetlb_folio_vma(struct hstate *h,
 		folio = dequeue_hugetlb_folio_nodemask(h, gfp_mask,
 							nid, nodemask);
 
-	if (folio && !gbl_chg) {
-		folio_set_hugetlb_restore_reserve(folio);
-		h->resv_huge_pages--;
-	}
-
 	mpol_cond_put(mpol);
 	return folio;
 
@@ -3052,15 +3047,20 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
 		if (!folio)
 			goto out_uncharge_cgroup;
 		spin_lock_irq(&hugetlb_lock);
-		if (!gbl_chg) {
-			folio_set_hugetlb_restore_reserve(folio);
-			h->resv_huge_pages--;
-		}
 		list_add(&folio->lru, &h->hugepage_activelist);
 		folio_ref_unfreeze(folio, 1);
 		/* Fall through */
 	}
 
+	/*
+	 * Either dequeued or buddy-allocated folio needs to add special
+	 * mark to the folio when it consumes a global reservation.
+	 */
+	if (!gbl_chg) {
+		folio_set_hugetlb_restore_reserve(folio);
+		h->resv_huge_pages--;
+	}
+
 	hugetlb_cgroup_commit_charge(idx, pages_per_huge_page(h), h_cg, folio);
 	/* If allocation is not consuming a reservation, also store the
 	 * hugetlb_cgroup pointer on the page.
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 1/7] mm/hugetlb: Fix avoid_reserve to allow taking folio from subpool
  2025-01-07 20:39 ` [PATCH v2 1/7] mm/hugetlb: Fix avoid_reserve to allow taking folio from subpool Peter Xu
@ 2025-01-13 11:02   ` Oscar Salvador
  0 siblings, 0 replies; 16+ messages in thread
From: Oscar Salvador @ 2025-01-13 11:02 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, Breno Leitao, Rik van Riel, Muchun Song,
	Naoya Horiguchi, Roman Gushchin, Ackerley Tng, Andrew Morton,
	linux-stable

On Tue, Jan 07, 2025 at 03:39:56PM -0500, Peter Xu wrote:
> Since commit 04f2cbe35699 ("hugetlb: guarantee that COW faults for a
> process that called mmap(MAP_PRIVATE) on hugetlbfs will succeed"),
> avoid_reserve was introduced for a special case of CoW on hugetlb private
> mappings, and only if the owner VMA is trying to allocate yet another
> hugetlb folio that is not reserved within the private vma reserved map.
> 
> Later on, in commit d85f69b0b533 ("mm/hugetlb: alloc_huge_page handle areas
> hole punched by fallocate"), alloc_huge_page() enforced to not consume any
> global reservation as long as avoid_reserve=true.  This operation doesn't
> look correct, because even if it will enforce the allocation to not use
> global reservation at all, it will still try to take one reservation from
> the spool (if the subpool existed).  Then since the spool reserved pages
> take from global reservation, it'll also take one reservation globally.
> 
> Logically it can cause global reservation to go wrong.
> 
> I wrote a reproducer below, trigger this special path, and every run of
> such program will cause global reservation count to increment by one, until
> it hits the number of free pages:
> 
>   #define _GNU_SOURCE             /* See feature_test_macros(7) */
>   #include <stdio.h>
>   #include <fcntl.h>
>   #include <errno.h>
>   #include <unistd.h>
>   #include <stdlib.h>
>   #include <sys/mman.h>
> 
>   #define  MSIZE  (2UL << 20)
> 
>   int main(int argc, char *argv[])
>   {
>       const char *path;
>       int *buf;
>       int fd, ret;
>       pid_t child;
> 
>       if (argc < 2) {
>           printf("usage: %s <hugetlb_file>\n", argv[0]);
>           return -1;
>       }
> 
>       path = argv[1];
> 
>       fd = open(path, O_RDWR | O_CREAT, 0666);
>       if (fd < 0) {
>           perror("open failed");
>           return -1;
>       }
> 
>       ret = fallocate(fd, 0, 0, MSIZE);
>       if (ret != 0) {
>           perror("fallocate");
>           return -1;
>       }
> 
>       buf = mmap(NULL, MSIZE, PROT_READ|PROT_WRITE,
>                  MAP_PRIVATE, fd, 0);
> 
>       if (buf == MAP_FAILED) {
>           perror("mmap() failed");
>           return -1;
>       }
> 
>       /* Allocate a page */
>       *buf = 1;
> 
>       child = fork();
>       if (child == 0) {
>           /* child doesn't need to do anything */
>           exit(0);
>       }
> 
>       /* Trigger CoW from owner */
>       *buf = 2;
> 
>       munmap(buf, MSIZE);
>       close(fd);
>       unlink(path);
> 
>       return 0;
>   }
> 
> It can only reproduce with a sub-mount when there're reserved pages on the
> spool, like:
> 
>   # sysctl vm.nr_hugepages=128
>   # mkdir ./hugetlb-pool
>   # mount -t hugetlbfs -o min_size=8M,pagesize=2M none ./hugetlb-pool
> 
> Then run the reproducer on the mountpoint:
> 
>   # ./reproducer ./hugetlb-pool/test
> 
> Fix it by taking the reservation from spool if available.  In general,
> avoid_reserve is IMHO more about "avoid vma resv map", not spool's.
> 
> I copied stable, however I have no intention for backporting if it's not a
> clean cherry-pick, because private hugetlb mapping, and then fork() on top
> is too rare to hit.
> 
> Cc: linux-stable <stable@vger.kernel.org>
> Fixes: d85f69b0b533 ("mm/hugetlb: alloc_huge_page handle areas hole punched by fallocate")
> Reviewed-by: Ackerley Tng <ackerleytng@google.com>
> Tested-by: Ackerley Tng <ackerleytng@google.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>

Reviewed-by: Oscar Salvador <osalvador@suse.de>


-- 
Oscar Salvador
SUSE Labs

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 2/7] mm/hugetlb: Stop using avoid_reserve flag in fork()
  2025-01-07 20:39 ` [PATCH v2 2/7] mm/hugetlb: Stop using avoid_reserve flag in fork() Peter Xu
@ 2025-01-13 11:05   ` Oscar Salvador
  0 siblings, 0 replies; 16+ messages in thread
From: Oscar Salvador @ 2025-01-13 11:05 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, Breno Leitao, Rik van Riel, Muchun Song,
	Naoya Horiguchi, Roman Gushchin, Ackerley Tng, Andrew Morton

On Tue, Jan 07, 2025 at 03:39:57PM -0500, Peter Xu wrote:
> When fork() and stumble on top of a dma-pinned hugetlb private page, CoW
> must happen during fork() to guarantee dma coherency.
> 
> In this specific path, hugetlb pages need to be allocated for the child
> process.  Stop using avoid_reserve=1 flag here: it's not required to be
> used here, as dest_vma (which is destined to be a MAP_PRIVATE hugetlb vma)
> will have no private vma resv map, and that will make sure it won't be able
> to use a vma reservation later.
> 
> No functional change intended with this change.  Said that, it's still
> wanted to do this, so as to reduce the usage of avoid_reserve to the only
> one user, which is also why this flag was introduced initially in commit
> 04f2cbe35699 ("hugetlb: guarantee that COW faults for a process that called
> mmap(MAP_PRIVATE) on hugetlbfs will succeed").  I don't see whoever else
> should set it at all.
> 
> Further patch will clean up resv accounting based on this.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>

Reviewed-by: Oscar Salvador <osalvador@suse.de>


-- 
Oscar Salvador
SUSE Labs

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 3/7] mm/hugetlb: Rename avoid_reserve to cow_from_owner
  2025-01-07 20:39 ` [PATCH v2 3/7] mm/hugetlb: Rename avoid_reserve to cow_from_owner Peter Xu
@ 2025-01-13 11:20   ` Oscar Salvador
  2025-01-13 16:19     ` Peter Xu
  0 siblings, 1 reply; 16+ messages in thread
From: Oscar Salvador @ 2025-01-13 11:20 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, Breno Leitao, Rik van Riel, Muchun Song,
	Naoya Horiguchi, Roman Gushchin, Ackerley Tng, Andrew Morton

On Tue, Jan 07, 2025 at 03:39:58PM -0500, Peter Xu wrote:
> The old name "avoid_reserve" can be too generic and can be used wrongly in
> the new call sites that want to allocate a hugetlb folio.
> 
> It's confusing on two things: (1) whether one can opt-in to avoid global
> reservation, and (2) whether it should take more than one count.
> 
> In reality, this flag is only used in an extremely hacky path, in an
> extremely hacky way in hugetlb CoW path only, and always use with 1 saying
> "skip global reservation".  Rename the flag to avoid future abuse of this
> flag, making it a boolean so as to reflect its true representation that
> it's not a counter.  To make it even harder to abuse, add a comment above
> the function to explain it.
> 
> Signed-off-by: Peter Xu <peterx@redhat.com>

I agree that the current name is quite misleading, and this patch
improves the situation substantially.
The only thing I am missing here is that the comment you added could be
more explanatory as to why new call sites do not want to make use of the
flag.

IIRC, not using so, will bypass all vma level reservations as you
mentioned, which means that the child can get killed if the parent
makes use of the page, as it is the parent the only one that made a
reservation.

So maybe dropping a hint would be nice.

Reviewed-by: Oscar Salvador <osalvador@suse.de>


-- 
Oscar Salvador
SUSE Labs

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 3/7] mm/hugetlb: Rename avoid_reserve to cow_from_owner
  2025-01-13 11:20   ` Oscar Salvador
@ 2025-01-13 16:19     ` Peter Xu
  2025-01-16 10:15       ` Oscar Salvador
  0 siblings, 1 reply; 16+ messages in thread
From: Peter Xu @ 2025-01-13 16:19 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: linux-mm, linux-kernel, Breno Leitao, Rik van Riel, Muchun Song,
	Naoya Horiguchi, Roman Gushchin, Ackerley Tng, Andrew Morton

Oscar,

On Mon, Jan 13, 2025 at 12:20:34PM +0100, Oscar Salvador wrote:
> On Tue, Jan 07, 2025 at 03:39:58PM -0500, Peter Xu wrote:
> > The old name "avoid_reserve" can be too generic and can be used wrongly in
> > the new call sites that want to allocate a hugetlb folio.
> > 
> > It's confusing on two things: (1) whether one can opt-in to avoid global
> > reservation, and (2) whether it should take more than one count.
> > 
> > In reality, this flag is only used in an extremely hacky path, in an
> > extremely hacky way in hugetlb CoW path only, and always use with 1 saying
> > "skip global reservation".  Rename the flag to avoid future abuse of this
> > flag, making it a boolean so as to reflect its true representation that
> > it's not a counter.  To make it even harder to abuse, add a comment above
> > the function to explain it.
> > 
> > Signed-off-by: Peter Xu <peterx@redhat.com>
> 
> I agree that the current name is quite misleading, and this patch
> improves the situation substantially.
> The only thing I am missing here is that the comment you added could be
> more explanatory as to why new call sites do not want to make use of the
> flag.
> 
> IIRC, not using so, will bypass all vma level reservations as you

s/not using/using/?  Only using the flag (setting to true) will bypass vma,
but I could have misunderstood this line..

> mentioned, which means that the child can get killed if the parent
> makes use of the page, as it is the parent the only one that made a
> reservation.

The paragraph I added on top of alloc_hugetlb_folio() is trying to suggest
nobody should set this to true in any new paths.

So far, the reservation path should have nothing relevant to stealing page
on its own (if that is what you meant above..) - page stealing in hugetlb
private is done separately within the unmap_ref_private() helper.  Here the
parent needs to bypass vma reservation because it must have consumed it
with the folio installed in the pgtable (which is write-protected).  That
may or may not be relevant to page stealing, e.g. if global pool still has
free page it doesn't need to affect child from using its hugetlb pages.

However again I could have totally misunderstood your comment..

> 
> So maybe dropping a hint would be nice.
> 
> Reviewed-by: Oscar Salvador <osalvador@suse.de>

Thanks for taking a look!

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 4/7] mm/hugetlb: Clean up map/global resv accounting when allocate
  2025-01-07 20:39 ` [PATCH v2 4/7] mm/hugetlb: Clean up map/global resv accounting when allocate Peter Xu
@ 2025-01-13 22:57   ` Ackerley Tng
  2025-01-14 18:25     ` Peter Xu
  0 siblings, 1 reply; 16+ messages in thread
From: Ackerley Tng @ 2025-01-13 22:57 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, leitao, riel, muchun.song, nao.horiguchi,
	roman.gushchin, akpm, peterx, osalvador


Hi Peter,

I have an alternative that is based off your patches.

I like the overall refactoring but was a little uncomfortable with
having to add a custom enum map_chg_state. The custom enum centralizes
the possible states, but the states still need to be interpreted again
throughout the function to take action (e.g. having checks like if
(map_chg != MAP_CHG_ENFORCED) and if (map_chg == MAP_CHG_NEEDED)), and I
feel that shifts the problem later to understanding the interpretation
of the states.

I switched back to something close to avoid_reserve, but improved on
that name with your comments, by calling it bypass_vma_reservations. I
feel that calling it cow_from_owner kind of lets the implementation
detail of the CoW use-case bleed into this function.

"bypass_vma_reservations" is named so that it requests this function,
alloc_hugetlb_folio(), to bypass the resv_map. All parts of
alloc_hugetlb_folio() that update the resv_map are guarded by the
bypass_vma_reservations flag.

This alternative proposes to use the following booleans local to
alloc_hugetlb_folio().

1. vma_reservation_exists

vma_reservation_exists represents whether a reservation exists in the
resv_map and is used. vma_reservation_exists defaults to false, and when
bypass_vma_reservations is not requested, the resv_map will be consulted
to see if vma_reservation_exists.

vma_reservation_exists is also used to store the result of the initial
resv_map check, to correctly fix up a race later on.

2. debit_subpool

If a vma_reservation_exists, then this allocation has been reserved in
both the resv_map and subpool, so skip debiting the subpool.

If alloc_hugetlb_folio() was requested to bypass_vma_reservations, the
subpool should still be charged, so debit_subpool is set.

And if debit_subpool, then proceed to do hugepage_subpool_get_pages()
and set up subpool_reservation_exists.

Later on, debit_subpool is used for the matching cleanup in the error
case.

3. subpool_reservation_exists

subpool_reservation_exists represents whether a reservation exists in
the resv_map and is used, analogous to vma_reservation_exists, and the
subpool is only checked if debit_subpool is set.

If debit_subpool is not set, vma_reservation_exists determines whether a
subpool_reservation_exists.

subpool_reservation_exists is then used to guard decrementing
h->resv_huge_pages, which fixes the bug you found.

4. charge_cgroup_rsvd

This has the same condition as debit_subpool, but has a separate
variable for readability.

Later on, charge_cgroup_rsvd is used to determine whether to commit the
charge, or whether to fix up the charge in error cases.


I also refactored dequeue_hugetlb_folio_vma() to
dequeue_hugetlb_folio_with_mpol() to align with
alloc_buddy_hugetlb_folio_with_mpol() to avoid passing gbl_chg into the
function.

gbl_chg is interpreted in alloc_hugetlb_folio() instead. If
subpool_reservation_exists, then try to get a folio by dequeueing it. If
a subpool reservation does not exist, make sure there are available
pages before dequeueing.

If there was no folio from dequeueing for whatever reason, allocate a
new folio.

This could probably be a separate patch but I'd like to hear your
thoughts before doing integration/cleaning up.

These changes have been tested with your reproducer, and here's the test
output from libhugetlbfs test cases:

********** TEST SUMMARY
*                      2M
*                      32-bit 64-bit
*     Total testcases:    82     85
*             Skipped:     9      9
*                PASS:    72     69
*                FAIL:     0      0
*    Killed by signal:     1      7
*   Bad configuration:     0      0
*       Expected FAIL:     0      0
*     Unexpected PASS:     0      0
*    Test not present:     0      0
* Strange test result:     0      0
**********


Ackerley


---
 mm/hugetlb.c | 186 ++++++++++++++++++++++++++-------------------------
 1 file changed, 94 insertions(+), 92 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 6a0ea28f5bac..2cd588d35984 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1333,9 +1333,9 @@ static unsigned long available_huge_pages(struct hstate *h)
 	return h->free_huge_pages - h->resv_huge_pages;
 }

-static struct folio *dequeue_hugetlb_folio_vma(struct hstate *h,
-				struct vm_area_struct *vma,
-				unsigned long address, long gbl_chg)
+static struct folio *dequeue_hugetlb_folio_with_mpol(struct hstate *h,
+						     struct vm_area_struct *vma,
+						     unsigned long address)
 {
 	struct folio *folio = NULL;
 	struct mempolicy *mpol;
@@ -1343,13 +1343,6 @@ static struct folio *dequeue_hugetlb_folio_vma(struct hstate *h,
 	nodemask_t *nodemask;
 	int nid;

-	/*
-	 * gbl_chg==1 means the allocation requires a new page that was not
-	 * reserved before.  Making sure there's at least one free page.
-	 */
-	if (gbl_chg && !available_huge_pages(h))
-		goto err;
-
 	gfp_mask = htlb_alloc_mask(h);
 	nid = huge_node(vma, address, gfp_mask, &mpol, &nodemask);

@@ -1367,9 +1360,6 @@ static struct folio *dequeue_hugetlb_folio_vma(struct hstate *h,

 	mpol_cond_put(mpol);
 	return folio;
-
-err:
-	return NULL;
 }

 /*
@@ -2943,91 +2933,83 @@ int replace_free_hugepage_folios(unsigned long start_pfn, unsigned long end_pfn)
 	return ret;
 }

-typedef enum {
-	/*
-	 * For either 0/1: we checked the per-vma resv map, and one resv
-	 * count either can be reused (0), or an extra needed (1).
-	 */
-	MAP_CHG_REUSE = 0,
-	MAP_CHG_NEEDED = 1,
-	/*
-	 * Cannot use per-vma resv count can be used, hence a new resv
-	 * count is enforced.
-	 *
-	 * NOTE: This is mostly identical to MAP_CHG_NEEDED, except
-	 * that currently vma_needs_reservation() has an unwanted side
-	 * effect to either use end() or commit() to complete the
-	 * transaction.	 Hence it needs to differenciate from NEEDED.
-	 */
-	MAP_CHG_ENFORCED = 2,
-} map_chg_state;
-
 /*
- * NOTE! "cow_from_owner" represents a very hacky usage only used in CoW
- * faults of hugetlb private mappings on top of a non-page-cache folio (in
- * which case even if there's a private vma resv map it won't cover such
- * allocation).  New call sites should (probably) never set it to true!!
- * When it's set, the allocation will bypass all vma level reservations.
+ * NOTE! "bypass_vma_reservations" represents a very niche usage, when CoW
+ * faults of hugetlb private mappings need to allocate a new page on top of a
+ * non-page-cache folio. In this situation, even if there's a private vma resv
+ * map, the resv map must be bypassed.  New call sites should (probably) never
+ * set bypass_vma_reservations to true!!
  */
 struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
-				    unsigned long addr, bool cow_from_owner)
+				  unsigned long addr, bool bypass_vma_reservations)
 {
 	struct hugepage_subpool *spool = subpool_vma(vma);
 	struct hstate *h = hstate_vma(vma);
 	struct folio *folio;
-	long retval, gbl_chg;
-	map_chg_state map_chg;
 	int ret, idx;
 	struct hugetlb_cgroup *h_cg = NULL;
 	gfp_t gfp = htlb_alloc_mask(h) | __GFP_RETRY_MAYFAIL;
+	bool vma_reservation_exists = false;
+	bool subpool_reservation_exists;
+	bool debit_subpool;
+	bool charge_cgroup_rsvd;

 	idx = hstate_index(h);

-	/* Whether we need a separate per-vma reservation? */
-	if (cow_from_owner) {
-		/*
-		 * Special case!  Since it's a CoW on top of a reserved
-		 * page, the private resv map doesn't count.  So it cannot
-		 * consume the per-vma resv map even if it's reserved.
-		 */
-		map_chg = MAP_CHG_ENFORCED;
-	} else {
+	if (!bypass_vma_reservations) {
 		/*
 		 * Examine the region/reserve map to determine if the process
-		 * has a reservation for the page to be allocated.  A return
-		 * code of zero indicates a reservation exists (no change).
-		 */
-		retval = vma_needs_reservation(h, vma, addr);
-		if (retval < 0)
+		 * has a reservation for the page to be allocated and debit the
+		 * reservation.  If npages_req == 0, a reservation exists and is
+		 * used. If npages_req > 0, a reservation has to be taken either
+		 * from the subpool or global pool.
+ 		 */
+		int npages_req = vma_needs_reservation(h, vma, addr);
+		if (npages_req < 0)
 			return ERR_PTR(-ENOMEM);
-		map_chg = retval ? MAP_CHG_NEEDED : MAP_CHG_REUSE;
+
+		vma_reservation_exists = npages_req == 0;
 	}

 	/*
-	 * Whether we need a separate global reservation?
+	 * If no vma reservation exists, debit the subpool.
 	 *
+	 * Even if we were requested to bypass_vma_reservations, debit the
+	 * subpool - the subpool still has to be charged for this allocation.
+	 */
+	debit_subpool = !vma_reservation_exists || bypass_vma_reservations;
+
+	/*
 	 * Processes that did not create the mapping will have no
 	 * reserves as indicated by the region/reserve map. Check
 	 * that the allocation will not exceed the subpool limit.
 	 * Or if it can get one from the pool reservation directly.
 	 */
-	if (map_chg) {
-		gbl_chg = hugepage_subpool_get_pages(spool, 1);
-		if (gbl_chg < 0)
+	if (debit_subpool) {
+		int npages_req = hugepage_subpool_get_pages(spool, 1);
+		if (npages_req < 0)
 			goto out_end_reservation;
-	} else {
+
 		/*
-		 * If we have the vma reservation ready, no need for extra
-		 * global reservation.
-		 */
-		gbl_chg = 0;
+		 * npages_req == 0 indicates a reservation exists for the
+		 * allocation in the subpool and can be used. npages_req > 0
+		 * indicates that a reservation must be taken from the global
+		 * pool.
+ 		 */
+		subpool_reservation_exists = npages_req == 0;
+	} else {
+		/* A vma reservation implies having a subpool reservation. */
+		subpool_reservation_exists = vma_reservation_exists;
 	}

 	/*
-	 * If this allocation is not consuming a per-vma reservation,
-	 * charge the hugetlb cgroup now.
+	 * If no vma reservation exists, charge the cgroup's reserved quota.
+	 *
+	 * Even if we were requested to bypass_vma_reservations, the cgroup
+	 * still has to be charged for this allocation.
 	 */
-	if (map_chg) {
+	charge_cgroup_rsvd = !vma_reservation_exists || bypass_vma_reservations;
+	if (charge_cgroup_rsvd) {
 		ret = hugetlb_cgroup_charge_cgroup_rsvd(
 			idx, pages_per_huge_page(h), &h_cg);
 		if (ret)
@@ -3039,12 +3021,23 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
 		goto out_uncharge_cgroup_reservation;

 	spin_lock_irq(&hugetlb_lock);
-	/*
-	 * glb_chg is passed to indicate whether or not a page must be taken
-	 * from the global free pool (global change).  gbl_chg == 0 indicates
-	 * a reservation exists for the allocation.
-	 */
-	folio = dequeue_hugetlb_folio_vma(h, vma, addr, gbl_chg);
+
+	if (subpool_reservation_exists) {
+		folio = dequeue_hugetlb_folio_with_mpol(h, vma, addr);
+	} else {
+		/*
+		 * Since no subpool_reservation_exists, the allocation requires
+		 * a new page that was not reserved before.  Only dequeue if
+		 * there are available pages.
+		 */
+		if (available_huge_pages(h)) {
+			folio = dequeue_hugetlb_folio_with_mpol(h, vma, addr);
+		} else {
+			folio = NULL;
+			/* Fallthrough to allocate a new page. */
+		}
+	}
+
 	if (!folio) {
 		spin_unlock_irq(&hugetlb_lock);
 		folio = alloc_buddy_hugetlb_folio_with_mpol(h, vma, addr);
@@ -3057,19 +3050,17 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
 	}

 	/*
-	 * Either dequeued or buddy-allocated folio needs to add special
-	 * mark to the folio when it consumes a global reservation.
+	 * If subpool_reservation_exists (and is used for this allocation),
+	 * decrement resv_huge_pages to indicate that a reservation was used.
 	 */
-	if (!gbl_chg) {
+	if (subpool_reservation_exists) {
 		folio_set_hugetlb_restore_reserve(folio);
 		h->resv_huge_pages--;
 	}

 	hugetlb_cgroup_commit_charge(idx, pages_per_huge_page(h), h_cg, folio);
-	/* If allocation is not consuming a reservation, also store the
-	 * hugetlb_cgroup pointer on the page.
-	 */
-	if (map_chg) {
+
+	if (charge_cgroup_rsvd) {
 		hugetlb_cgroup_commit_charge_rsvd(idx, pages_per_huge_page(h),
 						  h_cg, folio);
 	}
@@ -3078,25 +3069,30 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,

 	hugetlb_set_folio_subpool(folio, spool);

-	if (map_chg != MAP_CHG_ENFORCED) {
-		/* commit() is only needed if the map_chg is not enforced */
-		retval = vma_commit_reservation(h, vma, addr);
+	if (!bypass_vma_reservations) {
+		/*
+		 * As long as vma reservations were not bypassed, we need to
+		 * commit() to clear up any adds_in_progress in resv_map.
+		 */
+		int ret = vma_commit_reservation(h, vma, addr);
 		/*
-		 * Check for possible race conditions. When it happens..
-		 * The page was added to the reservation map between
-		 * vma_needs_reservation and vma_commit_reservation.
-		 * This indicates a race with hugetlb_reserve_pages.
+		 * If there is a discrepancy in reservation status between the
+		 * time of vma_needs_reservation() and vma_commit_reservation(),
+		 * then there the page must have been added to the reservation
+		 * map between vma_needs_reservation() and
+		 * vma_commit_reservation().
+		 *
 		 * Adjust for the subpool count incremented above AND
 		 * in hugetlb_reserve_pages for the same page.	Also,
 		 * the reservation count added in hugetlb_reserve_pages
 		 * no longer applies.
 		 */
-		if (unlikely(map_chg == MAP_CHG_NEEDED && retval == 0)) {
+		if (unlikely(!vma_reservation_exists && ret == 0)) {
 			long rsv_adjust;

 			rsv_adjust = hugepage_subpool_put_pages(spool, 1);
 			hugetlb_acct_memory(h, -rsv_adjust);
-			if (map_chg) {
+			if (charge_cgroup_rsvd) {
 				spin_lock_irq(&hugetlb_lock);
 				hugetlb_cgroup_uncharge_folio_rsvd(
 				    hstate_index(h), pages_per_huge_page(h),
@@ -3124,14 +3120,14 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
 out_uncharge_cgroup:
 	hugetlb_cgroup_uncharge_cgroup(idx, pages_per_huge_page(h), h_cg);
 out_uncharge_cgroup_reservation:
-	if (map_chg)
+	if (charge_cgroup_rsvd)
 		hugetlb_cgroup_uncharge_cgroup_rsvd(idx, pages_per_huge_page(h),
 						    h_cg);
 out_subpool_put:
-	if (map_chg)
+	if (debit_subpool)
 		hugepage_subpool_put_pages(spool, 1);
 out_end_reservation:
-	if (map_chg != MAP_CHG_ENFORCED)
+	if (!bypass_vma_reservations)
 		vma_end_reservation(h, vma, addr);
 	return ERR_PTR(-ENOSPC);
 }
@@ -5900,6 +5896,12 @@ static vm_fault_t hugetlb_wp(struct folio *pagecache_folio,
 	 * be acquired again before returning to the caller, as expected.
 	 */
 	spin_unlock(vmf->ptl);
+
+	/*
+	 * If this is a CoW from the owner of this page, we
+	 * bypass_vma_reservations, since the reservation was already consumed
+	 * when the hugetlb folio was first allocated before the fork happened.
+	 */
 	new_folio = alloc_hugetlb_folio(vma, vmf->address, cow_from_owner);

 	if (IS_ERR(new_folio)) {
-- 
2.47.1.688.g23fc6f90ad-goog

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 4/7] mm/hugetlb: Clean up map/global resv accounting when allocate
  2025-01-13 22:57   ` Ackerley Tng
@ 2025-01-14 18:25     ` Peter Xu
  0 siblings, 0 replies; 16+ messages in thread
From: Peter Xu @ 2025-01-14 18:25 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: linux-mm, linux-kernel, leitao, riel, muchun.song, nao.horiguchi,
	roman.gushchin, akpm, osalvador

On Mon, Jan 13, 2025 at 10:57:55PM +0000, Ackerley Tng wrote:
> 
> Hi Peter,

Hi, Ackerley!

> 
> I have an alternative that is based off your patches.

I'll try to comment mostly inline of your patch, that's easier for me.

Side note: having such a detailed explanation before a patch being pasted
is not a good sign of readable code. :) But still, thanks for all these
detailed explanations!  It's definitely not you nor your patch to blame at
all. Frankly, at least I'm also not 100% satisfied with what I got with my
current patch. However, I must confess that's almost the best I can get so
far.. especially the goal to me is to pave way for your gmem work, so I'm
ok as long as we can reduce as much vma reference possible in this
allocation path, which matters more to me, and when working on this patch I
was trying to make my next patch easier (which hopefully it does).  More on
this at the end.

> 
> I like the overall refactoring but was a little uncomfortable with
> having to add a custom enum map_chg_state. The custom enum centralizes
> the possible states, but the states still need to be interpreted again
> throughout the function to take action (e.g. having checks like if
> (map_chg != MAP_CHG_ENFORCED) and if (map_chg == MAP_CHG_NEEDED)), and I
> feel that shifts the problem later to understanding the interpretation
> of the states.
> 
> I switched back to something close to avoid_reserve, but improved on
> that name with your comments, by calling it bypass_vma_reservations. I
> feel that calling it cow_from_owner kind of lets the implementation
> detail of the CoW use-case bleed into this function.
> 
> "bypass_vma_reservations" is named so that it requests this function,
> alloc_hugetlb_folio(), to bypass the resv_map. All parts of
> alloc_hugetlb_folio() that update the resv_map are guarded by the
> bypass_vma_reservations flag.
> 
> This alternative proposes to use the following booleans local to
> alloc_hugetlb_folio().
> 
> 1. vma_reservation_exists
> 
> vma_reservation_exists represents whether a reservation exists in the
> resv_map and is used. vma_reservation_exists defaults to false, and when
> bypass_vma_reservations is not requested, the resv_map will be consulted
> to see if vma_reservation_exists.
> 
> vma_reservation_exists is also used to store the result of the initial
> resv_map check, to correctly fix up a race later on.
> 
> 2. debit_subpool
> 
> If a vma_reservation_exists, then this allocation has been reserved in
> both the resv_map and subpool, so skip debiting the subpool.
> 
> If alloc_hugetlb_folio() was requested to bypass_vma_reservations, the
> subpool should still be charged, so debit_subpool is set.
> 
> And if debit_subpool, then proceed to do hugepage_subpool_get_pages()
> and set up subpool_reservation_exists.
> 
> Later on, debit_subpool is used for the matching cleanup in the error
> case.
> 
> 3. subpool_reservation_exists
> 
> subpool_reservation_exists represents whether a reservation exists in
> the resv_map and is used, analogous to vma_reservation_exists, and the
> subpool is only checked if debit_subpool is set.
> 
> If debit_subpool is not set, vma_reservation_exists determines whether a
> subpool_reservation_exists.
> 
> subpool_reservation_exists is then used to guard decrementing
> h->resv_huge_pages, which fixes the bug you found.
> 
> 4. charge_cgroup_rsvd
> 
> This has the same condition as debit_subpool, but has a separate
> variable for readability.
> 
> Later on, charge_cgroup_rsvd is used to determine whether to commit the
> charge, or whether to fix up the charge in error cases.
> 
> 
> I also refactored dequeue_hugetlb_folio_vma() to
> dequeue_hugetlb_folio_with_mpol() to align with
> alloc_buddy_hugetlb_folio_with_mpol() to avoid passing gbl_chg into the
> function.
> 
> gbl_chg is interpreted in alloc_hugetlb_folio() instead. If
> subpool_reservation_exists, then try to get a folio by dequeueing it. If
> a subpool reservation does not exist, make sure there are available
> pages before dequeueing.
> 
> If there was no folio from dequeueing for whatever reason, allocate a
> new folio.
> 
> This could probably be a separate patch but I'd like to hear your
> thoughts before doing integration/cleaning up.
> 
> These changes have been tested with your reproducer, and here's the test
> output from libhugetlbfs test cases:
> 
> ********** TEST SUMMARY
> *                      2M
> *                      32-bit 64-bit
> *     Total testcases:    82     85
> *             Skipped:     9      9
> *                PASS:    72     69
> *                FAIL:     0      0
> *    Killed by signal:     1      7
> *   Bad configuration:     0      0
> *       Expected FAIL:     0      0
> *     Unexpected PASS:     0      0
> *    Test not present:     0      0
> * Strange test result:     0      0
> **********
> 
> 
> Ackerley
> 
> 
> ---
>  mm/hugetlb.c | 186 ++++++++++++++++++++++++++-------------------------
>  1 file changed, 94 insertions(+), 92 deletions(-)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 6a0ea28f5bac..2cd588d35984 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1333,9 +1333,9 @@ static unsigned long available_huge_pages(struct hstate *h)
>  	return h->free_huge_pages - h->resv_huge_pages;
>  }
> 
> -static struct folio *dequeue_hugetlb_folio_vma(struct hstate *h,
> -				struct vm_area_struct *vma,
> -				unsigned long address, long gbl_chg)
> +static struct folio *dequeue_hugetlb_folio_with_mpol(struct hstate *h,

Personally the name _with_mpol implies a mempolicy* to be passed in.  In
this case I slightly prefer keeping the name which is shorter, and the vma
variable is the hint to me on that mempolicy will be respected.

So I did notice we also have alloc_buddy_hugetlb_folio_with_mpol(), so IMHO
we could remove that _with_mpol instead, but I don't think that's a huge
deal as of now.

> +						     struct vm_area_struct *vma,
> +						     unsigned long address)
>  {
>  	struct folio *folio = NULL;
>  	struct mempolicy *mpol;
> @@ -1343,13 +1343,6 @@ static struct folio *dequeue_hugetlb_folio_vma(struct hstate *h,
>  	nodemask_t *nodemask;
>  	int nid;
> 
> -	/*
> -	 * gbl_chg==1 means the allocation requires a new page that was not
> -	 * reserved before.  Making sure there's at least one free page.
> -	 */
> -	if (gbl_chg && !available_huge_pages(h))
> -		goto err;

Note: I remember when I was working on this series, I also thought about
something like this, but then I found it's easier leaving it as of now.
More below..

> -
>  	gfp_mask = htlb_alloc_mask(h);
>  	nid = huge_node(vma, address, gfp_mask, &mpol, &nodemask);
> 
> @@ -1367,9 +1360,6 @@ static struct folio *dequeue_hugetlb_folio_vma(struct hstate *h,
> 
>  	mpol_cond_put(mpol);
>  	return folio;
> -
> -err:
> -	return NULL;
>  }
> 
>  /*
> @@ -2943,91 +2933,83 @@ int replace_free_hugepage_folios(unsigned long start_pfn, unsigned long end_pfn)
>  	return ret;
>  }
> 
> -typedef enum {
> -	/*
> -	 * For either 0/1: we checked the per-vma resv map, and one resv
> -	 * count either can be reused (0), or an extra needed (1).
> -	 */
> -	MAP_CHG_REUSE = 0,
> -	MAP_CHG_NEEDED = 1,
> -	/*
> -	 * Cannot use per-vma resv count can be used, hence a new resv
> -	 * count is enforced.
> -	 *
> -	 * NOTE: This is mostly identical to MAP_CHG_NEEDED, except
> -	 * that currently vma_needs_reservation() has an unwanted side
> -	 * effect to either use end() or commit() to complete the
> -	 * transaction.	 Hence it needs to differenciate from NEEDED.
> -	 */
> -	MAP_CHG_ENFORCED = 2,
> -} map_chg_state;
> -
>  /*
> - * NOTE! "cow_from_owner" represents a very hacky usage only used in CoW
> - * faults of hugetlb private mappings on top of a non-page-cache folio (in
> - * which case even if there's a private vma resv map it won't cover such
> - * allocation).  New call sites should (probably) never set it to true!!
> - * When it's set, the allocation will bypass all vma level reservations.
> + * NOTE! "bypass_vma_reservations" represents a very niche usage, when CoW
> + * faults of hugetlb private mappings need to allocate a new page on top of a
> + * non-page-cache folio. In this situation, even if there's a private vma resv
> + * map, the resv map must be bypassed.  New call sites should (probably) never
> + * set bypass_vma_reservations to true!!
>   */
>  struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
> -				    unsigned long addr, bool cow_from_owner)
> +				  unsigned long addr, bool bypass_vma_reservations)

One thing to mention is, I renamed it to cow_from_owner explicitly was
trying to "leak" impl detail into this function.  The reason is I wanted to
leak it in a way that people will never use it! :)

It's the same when we could use some weird variable names like:

  variable_you_never_want_to_set_unless_you_know_whats_happening

Otherwise it's not clear if a new caller (consider the fork() early CoW
user) knows some allocation will definitely bypass vma reservation, then
should it set bypass_vma_reservations=true?  The way I named it suggests
people to never set it, to make the flag easier to comprehend and also keep
that special case (then one day maybe we can drop it.. if e.g. we want to
drop the parent-steal-child-page behavior somehow, which is tricky).

PS: Miaohe used to clean one such use of it too at 88ce3fef47f3
("hugetlbfs: remove meaningless variable avoid_reserve"), basically if it's
too general a name people could try to fill something random into it.. and
it can make the special case too hard to follow.

>  {
>  	struct hugepage_subpool *spool = subpool_vma(vma);
>  	struct hstate *h = hstate_vma(vma);
>  	struct folio *folio;
> -	long retval, gbl_chg;
> -	map_chg_state map_chg;
>  	int ret, idx;
>  	struct hugetlb_cgroup *h_cg = NULL;
>  	gfp_t gfp = htlb_alloc_mask(h) | __GFP_RETRY_MAYFAIL;
> +	bool vma_reservation_exists = false;
> +	bool subpool_reservation_exists;
> +	bool debit_subpool;
> +	bool charge_cgroup_rsvd;
> 
>  	idx = hstate_index(h);
> 
> -	/* Whether we need a separate per-vma reservation? */
> -	if (cow_from_owner) {
> -		/*
> -		 * Special case!  Since it's a CoW on top of a reserved
> -		 * page, the private resv map doesn't count.  So it cannot
> -		 * consume the per-vma resv map even if it's reserved.
> -		 */
> -		map_chg = MAP_CHG_ENFORCED;
> -	} else {
> +	if (!bypass_vma_reservations) {
>  		/*
>  		 * Examine the region/reserve map to determine if the process
> -		 * has a reservation for the page to be allocated.  A return
> -		 * code of zero indicates a reservation exists (no change).
> -		 */
> -		retval = vma_needs_reservation(h, vma, addr);
> -		if (retval < 0)
> +		 * has a reservation for the page to be allocated and debit the
> +		 * reservation.  If npages_req == 0, a reservation exists and is
> +		 * used. If npages_req > 0, a reservation has to be taken either
> +		 * from the subpool or global pool.
> + 		 */
> +		int npages_req = vma_needs_reservation(h, vma, addr);
> +		if (npages_req < 0)
>  			return ERR_PTR(-ENOMEM);
> -		map_chg = retval ? MAP_CHG_NEEDED : MAP_CHG_REUSE;
> +
> +		vma_reservation_exists = npages_req == 0;
>  	}
> 
>  	/*
> -	 * Whether we need a separate global reservation?
> +	 * If no vma reservation exists, debit the subpool.
>  	 *
> +	 * Even if we were requested to bypass_vma_reservations, debit the
> +	 * subpool - the subpool still has to be charged for this allocation.
> +	 */
> +	debit_subpool = !vma_reservation_exists || bypass_vma_reservations;

Things like this is what I wanted to avoid when cleaning this up, so if you
see after my patch such !AAA || BBB things all gone.  To me it's much
easier to read with a tri-state comparing to keeping such calculations
around.  But now I read your patch, it also proves that this can be very
subjective comment.. so I can understand we can prefer different way to
code this on readablility.

Personally I slightly prefer not having special states for the spools,
simply because IMHO we shouldn't treat spool and the global pool
separately: they should almost represent the same thing, it's just that
spools can exist for some inodes so we try to fetch folios from there first
- in the case either it can have max_pages throttling per-mount
allocations, or min_pages which means we could have pre-reserved folios
already without asking for one in the global pool.

But I don't really have a strong opinion on this.  Again, pretty much
personal preference.. and I respect your preference too.

> +
> +	/*
>  	 * Processes that did not create the mapping will have no
>  	 * reserves as indicated by the region/reserve map. Check
>  	 * that the allocation will not exceed the subpool limit.
>  	 * Or if it can get one from the pool reservation directly.
>  	 */
> -	if (map_chg) {
> -		gbl_chg = hugepage_subpool_get_pages(spool, 1);
> -		if (gbl_chg < 0)
> +	if (debit_subpool) {
> +		int npages_req = hugepage_subpool_get_pages(spool, 1);
> +		if (npages_req < 0)
>  			goto out_end_reservation;
> -	} else {
> +
>  		/*
> -		 * If we have the vma reservation ready, no need for extra
> -		 * global reservation.
> -		 */
> -		gbl_chg = 0;
> +		 * npages_req == 0 indicates a reservation exists for the
> +		 * allocation in the subpool and can be used. npages_req > 0
> +		 * indicates that a reservation must be taken from the global
> +		 * pool.
> + 		 */
> +		subpool_reservation_exists = npages_req == 0;
> +	} else {
> +		/* A vma reservation implies having a subpool reservation. */
> +		subpool_reservation_exists = vma_reservation_exists;
>  	}
> 
>  	/*
> -	 * If this allocation is not consuming a per-vma reservation,
> -	 * charge the hugetlb cgroup now.
> +	 * If no vma reservation exists, charge the cgroup's reserved quota.
> +	 *
> +	 * Even if we were requested to bypass_vma_reservations, the cgroup
> +	 * still has to be charged for this allocation.
>  	 */
> -	if (map_chg) {
> +	charge_cgroup_rsvd = !vma_reservation_exists || bypass_vma_reservations;
> +	if (charge_cgroup_rsvd) {
>  		ret = hugetlb_cgroup_charge_cgroup_rsvd(
>  			idx, pages_per_huge_page(h), &h_cg);
>  		if (ret)
> @@ -3039,12 +3021,23 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
>  		goto out_uncharge_cgroup_reservation;
> 
>  	spin_lock_irq(&hugetlb_lock);
> -	/*
> -	 * glb_chg is passed to indicate whether or not a page must be taken
> -	 * from the global free pool (global change).  gbl_chg == 0 indicates
> -	 * a reservation exists for the allocation.
> -	 */
> -	folio = dequeue_hugetlb_folio_vma(h, vma, addr, gbl_chg);
> +
> +	if (subpool_reservation_exists) {
> +		folio = dequeue_hugetlb_folio_with_mpol(h, vma, addr);
> +	} else {
> +		/*
> +		 * Since no subpool_reservation_exists, the allocation requires
> +		 * a new page that was not reserved before.  Only dequeue if
> +		 * there are available pages.
> +		 */
> +		if (available_huge_pages(h)) {
> +			folio = dequeue_hugetlb_folio_with_mpol(h, vma, addr);

[1]

> +		} else {
> +			folio = NULL;
> +			/* Fallthrough to allocate a new page. */
> +		}
> +	}

So this could be why I didn't yet move gbl_chg out of
dequeue_hugetlb_folio(), because the condition check can be not as obvious
like this.

Meanwhile I think it's always better we sanity check folio!=NULL at [1],
but then if you see if we need to process folio==NULL anyway, maybe it's
easier we do that as the old code, always be prepared folio==NULL for
whatever reason (e.g. dequeue_hugetlb_folio can change in the future so it
can return NULL for more reasons).

> +
>  	if (!folio) {
>  		spin_unlock_irq(&hugetlb_lock);
>  		folio = alloc_buddy_hugetlb_folio_with_mpol(h, vma, addr);
> @@ -3057,19 +3050,17 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
>  	}
> 
>  	/*
> -	 * Either dequeued or buddy-allocated folio needs to add special
> -	 * mark to the folio when it consumes a global reservation.
> +	 * If subpool_reservation_exists (and is used for this allocation),
> +	 * decrement resv_huge_pages to indicate that a reservation was used.
>  	 */
> -	if (!gbl_chg) {
> +	if (subpool_reservation_exists) {
>  		folio_set_hugetlb_restore_reserve(folio);
>  		h->resv_huge_pages--;
>  	}
> 
>  	hugetlb_cgroup_commit_charge(idx, pages_per_huge_page(h), h_cg, folio);
> -	/* If allocation is not consuming a reservation, also store the
> -	 * hugetlb_cgroup pointer on the page.
> -	 */
> -	if (map_chg) {
> +
> +	if (charge_cgroup_rsvd) {
>  		hugetlb_cgroup_commit_charge_rsvd(idx, pages_per_huge_page(h),
>  						  h_cg, folio);
>  	}
> @@ -3078,25 +3069,30 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
> 
>  	hugetlb_set_folio_subpool(folio, spool);
> 
> -	if (map_chg != MAP_CHG_ENFORCED) {
> -		/* commit() is only needed if the map_chg is not enforced */
> -		retval = vma_commit_reservation(h, vma, addr);
> +	if (!bypass_vma_reservations) {
> +		/*
> +		 * As long as vma reservations were not bypassed, we need to
> +		 * commit() to clear up any adds_in_progress in resv_map.
> +		 */
> +		int ret = vma_commit_reservation(h, vma, addr);
>  		/*
> -		 * Check for possible race conditions. When it happens..
> -		 * The page was added to the reservation map between
> -		 * vma_needs_reservation and vma_commit_reservation.
> -		 * This indicates a race with hugetlb_reserve_pages.
> +		 * If there is a discrepancy in reservation status between the
> +		 * time of vma_needs_reservation() and vma_commit_reservation(),
> +		 * then there the page must have been added to the reservation
> +		 * map between vma_needs_reservation() and
> +		 * vma_commit_reservation().
> +		 *
>  		 * Adjust for the subpool count incremented above AND
>  		 * in hugetlb_reserve_pages for the same page.	Also,
>  		 * the reservation count added in hugetlb_reserve_pages
>  		 * no longer applies.
>  		 */
> -		if (unlikely(map_chg == MAP_CHG_NEEDED && retval == 0)) {
> +		if (unlikely(!vma_reservation_exists && ret == 0)) {
>  			long rsv_adjust;
> 
>  			rsv_adjust = hugepage_subpool_put_pages(spool, 1);
>  			hugetlb_acct_memory(h, -rsv_adjust);
> -			if (map_chg) {
> +			if (charge_cgroup_rsvd) {
>  				spin_lock_irq(&hugetlb_lock);
>  				hugetlb_cgroup_uncharge_folio_rsvd(
>  				    hstate_index(h), pages_per_huge_page(h),
> @@ -3124,14 +3120,14 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
>  out_uncharge_cgroup:
>  	hugetlb_cgroup_uncharge_cgroup(idx, pages_per_huge_page(h), h_cg);
>  out_uncharge_cgroup_reservation:
> -	if (map_chg)
> +	if (charge_cgroup_rsvd)
>  		hugetlb_cgroup_uncharge_cgroup_rsvd(idx, pages_per_huge_page(h),
>  						    h_cg);
>  out_subpool_put:
> -	if (map_chg)
> +	if (debit_subpool)
>  		hugepage_subpool_put_pages(spool, 1);
>  out_end_reservation:
> -	if (map_chg != MAP_CHG_ENFORCED)
> +	if (!bypass_vma_reservations)
>  		vma_end_reservation(h, vma, addr);
>  	return ERR_PTR(-ENOSPC);
>  }
> @@ -5900,6 +5896,12 @@ static vm_fault_t hugetlb_wp(struct folio *pagecache_folio,
>  	 * be acquired again before returning to the caller, as expected.
>  	 */
>  	spin_unlock(vmf->ptl);
> +
> +	/*
> +	 * If this is a CoW from the owner of this page, we
> +	 * bypass_vma_reservations, since the reservation was already consumed
> +	 * when the hugetlb folio was first allocated before the fork happened.
> +	 */
>  	new_folio = alloc_hugetlb_folio(vma, vmf->address, cow_from_owner);
> 
>  	if (IS_ERR(new_folio)) {

Other than that, I see there're quite a few name changes.  I'm ok with
changes that you feel strongly helpful, in that case feel free to split
them into smaller pieces either with your gmem series or post separately
(e.g. which suggests a broader review).

What I'm more interested in though, is how you'd move forward with the
current status quo and suite it with your gmem series.  The hope is that
this series of mine at least can help moving gmem easier (by addressing
some of the major source of vma reference..) to drop vma implications on
the allocations.

I remember in your series you have had your separate spool in gmem, and you
layered it in the way that you'll bypass quite a few things in what
alloc_hugetlb_folio() would do upfront right now.  I don't remember much
details other than that - I still remember I left some cgroup questions to
you, but those were trivial details and I still agree with your layering of
how hugetlb was allocated in general.  So besides how you would like to
further clean this chunk up, the other part on linking this to your gmem
would be more important to me, and whether it's acceptable to upstream that
gmem can use this model to work: basically split hugetlbfs into two chunks,
and allow other modules allocate stuff from hugetlb pools on its own.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 3/7] mm/hugetlb: Rename avoid_reserve to cow_from_owner
  2025-01-13 16:19     ` Peter Xu
@ 2025-01-16 10:15       ` Oscar Salvador
  2025-01-16 14:26         ` Peter Xu
  0 siblings, 1 reply; 16+ messages in thread
From: Oscar Salvador @ 2025-01-16 10:15 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-mm, linux-kernel, Breno Leitao, Rik van Riel, Muchun Song,
	Naoya Horiguchi, Roman Gushchin, Ackerley Tng, Andrew Morton

On Mon, Jan 13, 2025 at 11:19:27AM -0500, Peter Xu wrote:
> Oscar,
> 
> On Mon, Jan 13, 2025 at 12:20:34PM +0100, Oscar Salvador wrote:
> > On Tue, Jan 07, 2025 at 03:39:58PM -0500, Peter Xu wrote:
> > > The old name "avoid_reserve" can be too generic and can be used wrongly in
> > > the new call sites that want to allocate a hugetlb folio.
> > > 
> > > It's confusing on two things: (1) whether one can opt-in to avoid global
> > > reservation, and (2) whether it should take more than one count.
> > > 
> > > In reality, this flag is only used in an extremely hacky path, in an
> > > extremely hacky way in hugetlb CoW path only, and always use with 1 saying
> > > "skip global reservation".  Rename the flag to avoid future abuse of this
> > > flag, making it a boolean so as to reflect its true representation that
> > > it's not a counter.  To make it even harder to abuse, add a comment above
> > > the function to explain it.
> > > 
> > > Signed-off-by: Peter Xu <peterx@redhat.com>
> > 
> > I agree that the current name is quite misleading, and this patch
> > improves the situation substantially.
> > The only thing I am missing here is that the comment you added could be
> > more explanatory as to why new call sites do not want to make use of the
> > flag.
> > 
> > IIRC, not using so, will bypass all vma level reservations as you
> 
> s/not using/using/?  Only using the flag (setting to true) will bypass vma,
> but I could have misunderstood this line..

Yes, sorry.

> > mentioned, which means that the child can get killed if the parent
> > makes use of the page, as it is the parent the only one that made a
> > reservation.
> 
> The paragraph I added on top of alloc_hugetlb_folio() is trying to suggest
> nobody should set this to true in any new paths.

Yes, you are right.
I thought we could be more categoric, but curent comment seems fine.

> So far, the reservation path should have nothing relevant to stealing page
> on its own (if that is what you meant above..) - page stealing in hugetlb
> private is done separately within the unmap_ref_private() helper.  Here the
> parent needs to bypass vma reservation because it must have consumed it
> with the folio installed in the pgtable (which is write-protected).  That
> may or may not be relevant to page stealing, e.g. if global pool still has
> free page it doesn't need to affect child from using its hugetlb pages.

No, I kind of misinterpreted this.

Now, let me see if I get this straight:

1) parent maps a hugetlb page, but not yet fault in.
2) forks, child faults-in the page
3) child doesn't have any reservation, when 'cow_from_owner' set to true
   we check whether we have a spare hugetlb page to satisfy that
4) parent faults in the page
5) we do not have spare hugetlb pages, so we 'steal' it from the child
   with unmap_ref_private.


Is my understanding correct?

-- 
Oscar Salvador
SUSE Labs

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 3/7] mm/hugetlb: Rename avoid_reserve to cow_from_owner
  2025-01-16 10:15       ` Oscar Salvador
@ 2025-01-16 14:26         ` Peter Xu
  0 siblings, 0 replies; 16+ messages in thread
From: Peter Xu @ 2025-01-16 14:26 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: linux-mm, linux-kernel, Breno Leitao, Rik van Riel, Muchun Song,
	Naoya Horiguchi, Roman Gushchin, Ackerley Tng, Andrew Morton

Hi, Oscar,

On Thu, Jan 16, 2025 at 11:15:09AM +0100, Oscar Salvador wrote:
> Now, let me see if I get this straight:
> 
> 1) parent maps a hugetlb page, but not yet fault in.
> 2) forks, child faults-in the page

Agreed until here.

> 3) child doesn't have any reservation, when 'cow_from_owner' set to true
>    we check whether we have a spare hugetlb page to satisfy that

When the child fault in, it should trigger the hugetlb CoW fault, in which
case it should set 'cow_from_owner' to false (rather than true) always,
because it's not a CoW from owner (the child is not an owner).  See the
check in the fault path:

	if (is_vma_resv_set(vma, HPAGE_RESV_OWNER) &&
			old_folio != pagecache_folio)
		cow_from_owner = true;

Here I would expect when the child faults, the 1st OWNER check failed.

> 4) parent faults in the page
> 5) we do not have spare hugetlb pages, so we 'steal' it from the child
>    with unmap_ref_private.

Agreed on 4/5.

At last step 5, above check will become true, hence this is the place where
the allocation will have cow_from_owner set to true.

If we see the difference at step 3/5, that's also exactly why I renamed the
variable for the whole stack: it represents this special condition from the
top layer (fault) until the allocation layer, saying explicitly when it
should be set true (only "cow", from the "owner" not child), rather than a
very blurred idea of someone trying to avoid_reserve for whatever reason.
The hope is it made that niche path very clear in the allocation path, and
it discourage any other user using this flag which can be abuse (and cause
the allocation path harder to follow in general).

Thanks,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2025-01-16 14:26 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-01-07 20:39 [PATCH v2 0/7] mm/hugetlb: Refactor hugetlb allocation resv accounting Peter Xu
2025-01-07 20:39 ` [PATCH v2 1/7] mm/hugetlb: Fix avoid_reserve to allow taking folio from subpool Peter Xu
2025-01-13 11:02   ` Oscar Salvador
2025-01-07 20:39 ` [PATCH v2 2/7] mm/hugetlb: Stop using avoid_reserve flag in fork() Peter Xu
2025-01-13 11:05   ` Oscar Salvador
2025-01-07 20:39 ` [PATCH v2 3/7] mm/hugetlb: Rename avoid_reserve to cow_from_owner Peter Xu
2025-01-13 11:20   ` Oscar Salvador
2025-01-13 16:19     ` Peter Xu
2025-01-16 10:15       ` Oscar Salvador
2025-01-16 14:26         ` Peter Xu
2025-01-07 20:39 ` [PATCH v2 4/7] mm/hugetlb: Clean up map/global resv accounting when allocate Peter Xu
2025-01-13 22:57   ` Ackerley Tng
2025-01-14 18:25     ` Peter Xu
2025-01-07 20:40 ` [PATCH v2 5/7] mm/hugetlb: Simplify vma_has_reserves() Peter Xu
2025-01-07 20:40 ` [PATCH v2 6/7] mm/hugetlb: Drop vma_has_reserves() Peter Xu
2025-01-07 20:40 ` [PATCH v2 7/7] mm/hugetlb: Unify restore reserve accounting for new allocations Peter Xu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).