Linux-mm Archive on lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 0/6] Open HugeTLB allocation routine for more generic use
@ 2026-05-06 15:54 Ackerley Tng via B4 Relay
  2026-05-06 15:54 ` [PATCH v2 1/6] mm: hugetlb: Consolidate interpretation of gbl_chg within alloc_hugetlb_folio() Ackerley Tng via B4 Relay
                   ` (6 more replies)
  0 siblings, 7 replies; 12+ messages in thread
From: Ackerley Tng via B4 Relay @ 2026-05-06 15:54 UTC (permalink / raw)
  To: Muchun Song, Oscar Salvador, David Hildenbrand, Andrew Morton,
	fvdl, jiaqiyan, joshua.hahnjy, jthoughton, mhocko, michael.roth,
	pasha.tatashin, pbonzini, peterx, pratyush, rick.p.edgecombe,
	rientjes, roman.gushchin, seanjc, shakeel.butt, shivankg,
	vannapurve, yan.y.zhao, Dan Williams, Jason Gunthorpe
  Cc: linux-mm, linux-kernel, Ackerley Tng

Hi,

The motivation for this patch series is guest_memfd, which would like
to use HugeTLB as a generic source of huge pages but not adopt
HugeTLB's reservation at mmap() time.

By refactoring alloc_hugetlb_folio() and some dependent functions,
there is now an option to allocate HugeTLB folios without providing a
VMA. Specifically, HugeTLB allocation used to be dependent on the VMA
to

1. Look up reservations in the resv_map
2. Get mpol, stored at vma->vm_policy

This refactoring provides hugetlb_alloc_folio(), which focuses on just
the allocation itself, and associated memory and HugeTLB charging
(cgroups). alloc_hugetlb_folio() still handles reservations in the
resv_map and subpools.

Regarding naming, I'm definitely open to alternative names :) I chose
hugetlb_alloc_folio() because I'm seeing this function as a general
allocation function that is provided by the HugeTLB subsystem (hence
the hugetlb_ prefix). I'm intending for alloc_hugetlb_folio() to be
later refactored as a static function for use just by HugeTLB, and
HugeTLBfs should probably use hugetlb_alloc_folio() directly.

To see how hugetlb_alloc_folio() is used by guest_memfd, the most
recent patch series that uses this more generic HugeTLB allocation
routine is at [1], and a newer revision of that patch series is at
[2].

Independently of guest_memfd, I believe this change is useful in
simplifying alloc_hugetlb_folio(). alloc_hugetlb_folio() was so
coupled to a VMA that even HugeTLBfs allocates HugeTLB folios using a
pseudo-VMA.

Testing:

+ libhugetlbfs tests pass
+ ./tools/testing/selftests/mm/ksft_hugetlb.sh passes

Changes in this revision:

+ No longer reintroduces try-commit-cancel protocol for hugetlb's memcg charging.
    + "mm: memcontrol: eliminate the problem of dying memory cgroup
      for LRU folios" was merged, and memcg seems to be moving away
      from the try-commit-cancel protcol, with try_charge() no longer
      having any users [3].

[1] https://lore.kernel.org/all/cover.1747264138.git.ackerleytng@google.com/T/
[2] https://github.com/googleprodkernel/linux-cc/tree/wip-gmem-conversions-hugetlb-restructuring-12-08-25
[3] https://lore.kernel.org/all/bb35a69a-5be9-45f5-a557-1902487a1bc2@linux.dev/

---
Ackerley Tng (6):
      mm: hugetlb: Consolidate interpretation of gbl_chg within alloc_hugetlb_folio()
      mm: hugetlb: Move mpol interpretation out of alloc_buddy_hugetlb_folio_with_mpol()
      mm: hugetlb: Move mpol interpretation out of dequeue_hugetlb_folio_vma()
      mm: hugetlb: Use error variable in alloc_hugetlb_folio
      mm: hugetlb: Move mem_cgroup_charge_hugetlb() earlier in allocation
      mm: hugetlb: Refactor out hugetlb_alloc_folio()

 include/linux/hugetlb.h |   3 +
 mm/hugetlb.c            | 209 ++++++++++++++++++++++++++----------------------
 2 files changed, 117 insertions(+), 95 deletions(-)
---
base-commit: adc1e5c6203cf13fe05a1ead08edcb3d3a3baae8
change-id: 20260504-hugetlb-open-up-eaba80571b09

Best regards,
--
Ackerley Tng <ackerleytng@google.com>




^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH v2 1/6] mm: hugetlb: Consolidate interpretation of gbl_chg within alloc_hugetlb_folio()
  2026-05-06 15:54 [PATCH v2 0/6] Open HugeTLB allocation routine for more generic use Ackerley Tng via B4 Relay
@ 2026-05-06 15:54 ` Ackerley Tng via B4 Relay
  2026-05-12  9:00   ` Oscar Salvador
  2026-05-06 15:54 ` [PATCH v2 2/6] mm: hugetlb: Move mpol interpretation out of alloc_buddy_hugetlb_folio_with_mpol() Ackerley Tng via B4 Relay
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 12+ messages in thread
From: Ackerley Tng via B4 Relay @ 2026-05-06 15:54 UTC (permalink / raw)
  To: Muchun Song, Oscar Salvador, David Hildenbrand, Andrew Morton,
	fvdl, jiaqiyan, joshua.hahnjy, jthoughton, mhocko, michael.roth,
	pasha.tatashin, pbonzini, peterx, pratyush, rick.p.edgecombe,
	rientjes, roman.gushchin, seanjc, shakeel.butt, shivankg,
	vannapurve, yan.y.zhao, Dan Williams, Jason Gunthorpe
  Cc: linux-mm, linux-kernel, Ackerley Tng

From: Ackerley Tng <ackerleytng@google.com>

The dequeue_hugetlb_folio_vma() function currently handles the gbl_chg
parameter to determine if a folio can be dequeued based on global page
availability. This leaks reservation-specific logic into the dequeueing
path.

Relocate this logic to alloc_hugetlb_folio() so that
dequeue_hugetlb_folio_vma() focuses solely on selecting and dequeuing a
folio. In alloc_hugetlb_folio(), only attempt to dequeue a folio if a
reservation exists (gbl_chg == 0) or if there are available huge pages in
the global pool.

No functional change intended.

Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Reviewed-by: James Houghton <jthoughton@google.com>
---
 mm/hugetlb.c | 24 +++++++++---------------
 1 file changed, 9 insertions(+), 15 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index f24bf49be047e..8be246b4e6134 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1336,7 +1336,7 @@ static unsigned long available_huge_pages(struct hstate *h)
 
 static struct folio *dequeue_hugetlb_folio_vma(struct hstate *h,
 				struct vm_area_struct *vma,
-				unsigned long address, long gbl_chg)
+				unsigned long address)
 {
 	struct folio *folio = NULL;
 	struct mempolicy *mpol;
@@ -1344,13 +1344,6 @@ static struct folio *dequeue_hugetlb_folio_vma(struct hstate *h,
 	nodemask_t *nodemask;
 	int nid;
 
-	/*
-	 * gbl_chg==1 means the allocation requires a new page that was not
-	 * reserved before.  Making sure there's at least one free page.
-	 */
-	if (gbl_chg && !available_huge_pages(h))
-		goto err;
-
 	gfp_mask = htlb_alloc_mask(h);
 	nid = huge_node(vma, address, gfp_mask, &mpol, &nodemask);
 
@@ -1368,9 +1361,6 @@ static struct folio *dequeue_hugetlb_folio_vma(struct hstate *h,
 
 	mpol_cond_put(mpol);
 	return folio;
-
-err:
-	return NULL;
 }
 
 #if defined(CONFIG_ARCH_HAS_GIGANTIC_PAGE) && defined(CONFIG_CONTIG_ALLOC)
@@ -2939,12 +2929,16 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
 		goto out_uncharge_cgroup_reservation;
 
 	spin_lock_irq(&hugetlb_lock);
+
 	/*
-	 * glb_chg is passed to indicate whether or not a page must be taken
-	 * from the global free pool (global change).  gbl_chg == 0 indicates
-	 * a reservation exists for the allocation.
+	 * gbl_chg == 0 indicates a reservation exists for the allocation - so
+	 * try dequeuing a page. If there are available_huge_pages(), try using
+	 * them!
 	 */
-	folio = dequeue_hugetlb_folio_vma(h, vma, addr, gbl_chg);
+	folio = NULL;
+	if (!gbl_chg || available_huge_pages(h))
+		folio = dequeue_hugetlb_folio_vma(h, vma, addr);
+
 	if (!folio) {
 		spin_unlock_irq(&hugetlb_lock);
 		folio = alloc_buddy_hugetlb_folio_with_mpol(h, vma, addr);

-- 
2.54.0.545.g6539524ca2-goog




^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v2 2/6] mm: hugetlb: Move mpol interpretation out of alloc_buddy_hugetlb_folio_with_mpol()
  2026-05-06 15:54 [PATCH v2 0/6] Open HugeTLB allocation routine for more generic use Ackerley Tng via B4 Relay
  2026-05-06 15:54 ` [PATCH v2 1/6] mm: hugetlb: Consolidate interpretation of gbl_chg within alloc_hugetlb_folio() Ackerley Tng via B4 Relay
@ 2026-05-06 15:54 ` Ackerley Tng via B4 Relay
  2026-05-12 12:51   ` Oscar Salvador
  2026-05-06 15:54 ` [PATCH v2 3/6] mm: hugetlb: Move mpol interpretation out of dequeue_hugetlb_folio_vma() Ackerley Tng via B4 Relay
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 12+ messages in thread
From: Ackerley Tng via B4 Relay @ 2026-05-06 15:54 UTC (permalink / raw)
  To: Muchun Song, Oscar Salvador, David Hildenbrand, Andrew Morton,
	fvdl, jiaqiyan, joshua.hahnjy, jthoughton, mhocko, michael.roth,
	pasha.tatashin, pbonzini, peterx, pratyush, rick.p.edgecombe,
	rientjes, roman.gushchin, seanjc, shakeel.butt, shivankg,
	vannapurve, yan.y.zhao, Dan Williams, Jason Gunthorpe
  Cc: linux-mm, linux-kernel, Ackerley Tng

From: Ackerley Tng <ackerleytng@google.com>

Move memory policy interpretation out of
alloc_buddy_hugetlb_folio_with_mpol() and into alloc_hugetlb_folio() to
separate reading and interpretation of memory policy from actual
allocation.

This will later allow memory policy to be interpreted outside of the
process of allocating a hugetlb folio entirely. This opens doors for other
callers of the HugeTLB folio allocation function, such as guest_memfd,
where memory may not always be mapped and hence may not have an associated
vma.

No functional change intended.

Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Reviewed-by: James Houghton <jthoughton@google.com>
---
 mm/hugetlb.c | 20 +++++++++++---------
 1 file changed, 11 insertions(+), 9 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 8be246b4e6134..ea3bc405b3162 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2160,15 +2160,11 @@ static struct folio *alloc_migrate_hugetlb_folio(struct hstate *h, gfp_t gfp_mas
  */
 static
 struct folio *alloc_buddy_hugetlb_folio_with_mpol(struct hstate *h,
-		struct vm_area_struct *vma, unsigned long addr)
+		struct mempolicy *mpol, int nid, nodemask_t *nodemask)
 {
 	struct folio *folio = NULL;
-	struct mempolicy *mpol;
 	gfp_t gfp_mask = htlb_alloc_mask(h);
-	int nid;
-	nodemask_t *nodemask;
 
-	nid = huge_node(vma, addr, gfp_mask, &mpol, &nodemask);
 	if (mpol_is_preferred_many(mpol)) {
 		gfp_t gfp = gfp_mask & ~(__GFP_DIRECT_RECLAIM | __GFP_NOFAIL);
 
@@ -2180,7 +2176,7 @@ struct folio *alloc_buddy_hugetlb_folio_with_mpol(struct hstate *h,
 
 	if (!folio)
 		folio = alloc_surplus_hugetlb_folio(h, gfp_mask, nid, nodemask);
-	mpol_cond_put(mpol);
+
 	return folio;
 }
 
@@ -2869,7 +2865,7 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
 	map_chg_state map_chg;
 	int ret, idx;
 	struct hugetlb_cgroup *h_cg = NULL;
-	gfp_t gfp = htlb_alloc_mask(h) | __GFP_RETRY_MAYFAIL;
+	gfp_t gfp = htlb_alloc_mask(h);
 
 	idx = hstate_index(h);
 
@@ -2940,8 +2936,14 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
 		folio = dequeue_hugetlb_folio_vma(h, vma, addr);
 
 	if (!folio) {
+		struct mempolicy *mpol;
+		nodemask_t *nodemask;
+		int nid;
+
 		spin_unlock_irq(&hugetlb_lock);
-		folio = alloc_buddy_hugetlb_folio_with_mpol(h, vma, addr);
+		nid = huge_node(vma, addr, gfp, &mpol, &nodemask);
+		folio = alloc_buddy_hugetlb_folio_with_mpol(h, mpol, nid, nodemask);
+		mpol_cond_put(mpol);
 		if (!folio)
 			goto out_uncharge_cgroup;
 		spin_lock_irq(&hugetlb_lock);
@@ -2997,7 +2999,7 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
 		}
 	}
 
-	ret = mem_cgroup_charge_hugetlb(folio, gfp);
+	ret = mem_cgroup_charge_hugetlb(folio, gfp | __GFP_RETRY_MAYFAIL);
 	/*
 	 * Unconditionally increment NR_HUGETLB here. If it turns out that
 	 * mem_cgroup_charge_hugetlb failed, then immediately free the page and

-- 
2.54.0.545.g6539524ca2-goog




^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v2 3/6] mm: hugetlb: Move mpol interpretation out of dequeue_hugetlb_folio_vma()
  2026-05-06 15:54 [PATCH v2 0/6] Open HugeTLB allocation routine for more generic use Ackerley Tng via B4 Relay
  2026-05-06 15:54 ` [PATCH v2 1/6] mm: hugetlb: Consolidate interpretation of gbl_chg within alloc_hugetlb_folio() Ackerley Tng via B4 Relay
  2026-05-06 15:54 ` [PATCH v2 2/6] mm: hugetlb: Move mpol interpretation out of alloc_buddy_hugetlb_folio_with_mpol() Ackerley Tng via B4 Relay
@ 2026-05-06 15:54 ` Ackerley Tng via B4 Relay
  2026-05-12 12:56   ` Oscar Salvador
  2026-05-06 15:54 ` [PATCH v2 4/6] mm: hugetlb: Use error variable in alloc_hugetlb_folio Ackerley Tng via B4 Relay
                   ` (3 subsequent siblings)
  6 siblings, 1 reply; 12+ messages in thread
From: Ackerley Tng via B4 Relay @ 2026-05-06 15:54 UTC (permalink / raw)
  To: Muchun Song, Oscar Salvador, David Hildenbrand, Andrew Morton,
	fvdl, jiaqiyan, joshua.hahnjy, jthoughton, mhocko, michael.roth,
	pasha.tatashin, pbonzini, peterx, pratyush, rick.p.edgecombe,
	rientjes, roman.gushchin, seanjc, shakeel.butt, shivankg,
	vannapurve, yan.y.zhao, Dan Williams, Jason Gunthorpe
  Cc: linux-mm, linux-kernel, Ackerley Tng

From: Ackerley Tng <ackerleytng@google.com>

Move memory policy interpretation out of dequeue_hugetlb_folio_vma() and
into alloc_hugetlb_folio() to separate reading and interpretation of memory
policy from actual allocation.

Also rename dequeue_hugetlb_folio_vma() to
dequeue_hugetlb_folio_with_mpol() to remove association with vma and to
align with alloc_buddy_hugetlb_folio_with_mpol().

This will later allow memory policy to be interpreted outside of the
process of allocating a hugetlb folio entirely. This opens doors for other
callers of the HugeTLB folio allocation function, such as guest_memfd,
where memory may not always be mapped and hence may not have an associated
vma.

No functional change intended.

Signed-off-by: Ackerley Tng <ackerleytng@google.com>
Reviewed-by: James Houghton <jthoughton@google.com>
---
 mm/hugetlb.c | 34 +++++++++++++++-------------------
 1 file changed, 15 insertions(+), 19 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index ea3bc405b3162..3395de4d0999a 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1334,18 +1334,11 @@ static unsigned long available_huge_pages(struct hstate *h)
 	return h->free_huge_pages - h->resv_huge_pages;
 }
 
-static struct folio *dequeue_hugetlb_folio_vma(struct hstate *h,
-				struct vm_area_struct *vma,
-				unsigned long address)
+static struct folio *dequeue_hugetlb_folio_with_mpol(struct hstate *h,
+		struct mempolicy *mpol, int nid, nodemask_t *nodemask)
 {
 	struct folio *folio = NULL;
-	struct mempolicy *mpol;
-	gfp_t gfp_mask;
-	nodemask_t *nodemask;
-	int nid;
-
-	gfp_mask = htlb_alloc_mask(h);
-	nid = huge_node(vma, address, gfp_mask, &mpol, &nodemask);
+	gfp_t gfp_mask = htlb_alloc_mask(h);
 
 	if (mpol_is_preferred_many(mpol)) {
 		folio = dequeue_hugetlb_folio_nodemask(h, gfp_mask,
@@ -1359,7 +1352,6 @@ static struct folio *dequeue_hugetlb_folio_vma(struct hstate *h,
 		folio = dequeue_hugetlb_folio_nodemask(h, gfp_mask,
 							nid, nodemask);
 
-	mpol_cond_put(mpol);
 	return folio;
 }
 
@@ -2866,6 +2858,9 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
 	int ret, idx;
 	struct hugetlb_cgroup *h_cg = NULL;
 	gfp_t gfp = htlb_alloc_mask(h);
+	struct mempolicy *mpol;
+	nodemask_t *nodemask;
+	int nid;
 
 	idx = hstate_index(h);
 
@@ -2926,6 +2921,9 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
 
 	spin_lock_irq(&hugetlb_lock);
 
+	/* Takes reference on mpol. */
+	nid = huge_node(vma, addr, gfp, &mpol, &nodemask);
+
 	/*
 	 * gbl_chg == 0 indicates a reservation exists for the allocation - so
 	 * try dequeuing a page. If there are available_huge_pages(), try using
@@ -2933,25 +2931,23 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
 	 */
 	folio = NULL;
 	if (!gbl_chg || available_huge_pages(h))
-		folio = dequeue_hugetlb_folio_vma(h, vma, addr);
+		folio = dequeue_hugetlb_folio_with_mpol(h, mpol, nid, nodemask);
 
 	if (!folio) {
-		struct mempolicy *mpol;
-		nodemask_t *nodemask;
-		int nid;
-
 		spin_unlock_irq(&hugetlb_lock);
-		nid = huge_node(vma, addr, gfp, &mpol, &nodemask);
 		folio = alloc_buddy_hugetlb_folio_with_mpol(h, mpol, nid, nodemask);
-		mpol_cond_put(mpol);
-		if (!folio)
+		if (!folio) {
+			mpol_cond_put(mpol);
 			goto out_uncharge_cgroup;
+		}
 		spin_lock_irq(&hugetlb_lock);
 		list_add(&folio->lru, &h->hugepage_activelist);
 		folio_ref_unfreeze(folio, 1);
 		/* Fall through */
 	}
 
+	mpol_cond_put(mpol);
+
 	/*
 	 * Either dequeued or buddy-allocated folio needs to add special
 	 * mark to the folio when it consumes a global reservation.

-- 
2.54.0.545.g6539524ca2-goog




^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v2 4/6] mm: hugetlb: Use error variable in alloc_hugetlb_folio
  2026-05-06 15:54 [PATCH v2 0/6] Open HugeTLB allocation routine for more generic use Ackerley Tng via B4 Relay
                   ` (2 preceding siblings ...)
  2026-05-06 15:54 ` [PATCH v2 3/6] mm: hugetlb: Move mpol interpretation out of dequeue_hugetlb_folio_vma() Ackerley Tng via B4 Relay
@ 2026-05-06 15:54 ` Ackerley Tng via B4 Relay
  2026-05-06 15:54 ` [PATCH v2 5/6] mm: hugetlb: Move mem_cgroup_charge_hugetlb() earlier in allocation Ackerley Tng via B4 Relay
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 12+ messages in thread
From: Ackerley Tng via B4 Relay @ 2026-05-06 15:54 UTC (permalink / raw)
  To: Muchun Song, Oscar Salvador, David Hildenbrand, Andrew Morton,
	fvdl, jiaqiyan, joshua.hahnjy, jthoughton, mhocko, michael.roth,
	pasha.tatashin, pbonzini, peterx, pratyush, rick.p.edgecombe,
	rientjes, roman.gushchin, seanjc, shakeel.butt, shivankg,
	vannapurve, yan.y.zhao, Dan Williams, Jason Gunthorpe
  Cc: linux-mm, linux-kernel, Ackerley Tng

From: Ackerley Tng <ackerleytng@google.com>

Refactor alloc_hugetlb_folio to use a local variable for returning error
codes. Instead of returning ERR_PTR(-ENOSPC) at the end of the error
path, assign -ENOSPC to a return variable at each failure point and
return that variable at the end.

This allows the cleanup goto targets to be used with other errors in a
later patch.

No functional change intended.

Signed-off-by: Ackerley Tng <ackerleytng@google.com>
---
 mm/hugetlb.c | 15 +++++++++++----
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 3395de4d0999a..68c21305fc86a 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2894,8 +2894,10 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
 	 */
 	if (map_chg) {
 		gbl_chg = hugepage_subpool_get_pages(spool, 1);
-		if (gbl_chg < 0)
+		if (gbl_chg < 0) {
+			ret = -ENOSPC;
 			goto out_end_reservation;
+		}
 	} else {
 		/*
 		 * If we have the vma reservation ready, no need for extra
@@ -2911,13 +2913,17 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
 	if (map_chg) {
 		ret = hugetlb_cgroup_charge_cgroup_rsvd(
 			idx, pages_per_huge_page(h), &h_cg);
-		if (ret)
+		if (ret) {
+			ret = -ENOSPC;
 			goto out_subpool_put;
+		}
 	}
 
 	ret = hugetlb_cgroup_charge_cgroup(idx, pages_per_huge_page(h), &h_cg);
-	if (ret)
+	if (ret) {
+		ret = -ENOSPC;
 		goto out_uncharge_cgroup_reservation;
+	}
 
 	spin_lock_irq(&hugetlb_lock);
 
@@ -2938,6 +2944,7 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
 		folio = alloc_buddy_hugetlb_folio_with_mpol(h, mpol, nid, nodemask);
 		if (!folio) {
 			mpol_cond_put(mpol);
+			ret = -ENOSPC;
 			goto out_uncharge_cgroup;
 		}
 		spin_lock_irq(&hugetlb_lock);
@@ -3030,7 +3037,7 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
 out_end_reservation:
 	if (map_chg != MAP_CHG_ENFORCED)
 		vma_end_reservation(h, vma, addr);
-	return ERR_PTR(-ENOSPC);
+	return ERR_PTR(ret);
 }
 
 static __init void *alloc_bootmem(struct hstate *h, int nid, bool node_exact)

-- 
2.54.0.545.g6539524ca2-goog




^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v2 5/6] mm: hugetlb: Move mem_cgroup_charge_hugetlb() earlier in allocation
  2026-05-06 15:54 [PATCH v2 0/6] Open HugeTLB allocation routine for more generic use Ackerley Tng via B4 Relay
                   ` (3 preceding siblings ...)
  2026-05-06 15:54 ` [PATCH v2 4/6] mm: hugetlb: Use error variable in alloc_hugetlb_folio Ackerley Tng via B4 Relay
@ 2026-05-06 15:54 ` Ackerley Tng via B4 Relay
  2026-05-06 15:54 ` [PATCH v2 6/6] mm: hugetlb: Refactor out hugetlb_alloc_folio() Ackerley Tng via B4 Relay
  2026-05-12 13:17 ` [PATCH v2 0/6] Open HugeTLB allocation routine for more generic use Oscar Salvador
  6 siblings, 0 replies; 12+ messages in thread
From: Ackerley Tng via B4 Relay @ 2026-05-06 15:54 UTC (permalink / raw)
  To: Muchun Song, Oscar Salvador, David Hildenbrand, Andrew Morton,
	fvdl, jiaqiyan, joshua.hahnjy, jthoughton, mhocko, michael.roth,
	pasha.tatashin, pbonzini, peterx, pratyush, rick.p.edgecombe,
	rientjes, roman.gushchin, seanjc, shakeel.butt, shivankg,
	vannapurve, yan.y.zhao, Dan Williams, Jason Gunthorpe
  Cc: linux-mm, linux-kernel, Ackerley Tng

From: Ackerley Tng <ackerleytng@google.com>

Move mem_cgroup_charge_hugetlb() earlier in the folio allocation
process. This change draws a cleaner line between memcg charging and the
subsequent hugetlb-specific reservation logic for VMAs and subpools.

While it would be ideal to make all accounting and reservations perfectly
symmetric, mem_cgroup_charge_hugetlb() is a complex operation that cannot
be performed under the hugetlb_lock. Moving the charge to this earlier
point ensures that memcg charging is handled before the code begins
manipulating subpool and VMA-specific state. These two types of accounting
will be separated in a future patch.

If mem_cgroup_charge_hugetlb() fails, the code now branches to
out_subpool_put to ensure the folio is freed and the subpool references are
handled correctly.

Signed-off-by: Ackerley Tng <ackerleytng@google.com>
---
 mm/hugetlb.c | 31 ++++++++++++++++++-------------
 1 file changed, 18 insertions(+), 13 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 68c21305fc86a..4159b3565a9be 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2975,6 +2975,24 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
 
 	spin_unlock_irq(&hugetlb_lock);
 
+	ret = mem_cgroup_charge_hugetlb(folio, gfp | __GFP_RETRY_MAYFAIL);
+	/*
+	 * Unconditionally increment NR_HUGETLB here. If it turns out that
+	 * mem_cgroup_charge_hugetlb failed, then immediately free the page and
+	 * decrement NR_HUGETLB.
+	 */
+	lruvec_stat_mod_folio(folio, NR_HUGETLB, pages_per_huge_page(h));
+
+	if (ret == -ENOMEM) {
+		free_huge_folio(folio);
+		/*
+		 * Skip uncharging hugetlb_cgroup since the charges
+		 * were committed to the folio and freeing the folio
+		 * would have cleared those up.
+		 */
+		goto out_subpool_put;
+	}
+
 	hugetlb_set_folio_subpool(folio, spool);
 
 	if (map_chg != MAP_CHG_ENFORCED) {
@@ -3002,19 +3020,6 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
 		}
 	}
 
-	ret = mem_cgroup_charge_hugetlb(folio, gfp | __GFP_RETRY_MAYFAIL);
-	/*
-	 * Unconditionally increment NR_HUGETLB here. If it turns out that
-	 * mem_cgroup_charge_hugetlb failed, then immediately free the page and
-	 * decrement NR_HUGETLB.
-	 */
-	lruvec_stat_mod_folio(folio, NR_HUGETLB, pages_per_huge_page(h));
-
-	if (ret == -ENOMEM) {
-		free_huge_folio(folio);
-		return ERR_PTR(-ENOMEM);
-	}
-
 	return folio;
 
 out_uncharge_cgroup:

-- 
2.54.0.545.g6539524ca2-goog




^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH v2 6/6] mm: hugetlb: Refactor out hugetlb_alloc_folio()
  2026-05-06 15:54 [PATCH v2 0/6] Open HugeTLB allocation routine for more generic use Ackerley Tng via B4 Relay
                   ` (4 preceding siblings ...)
  2026-05-06 15:54 ` [PATCH v2 5/6] mm: hugetlb: Move mem_cgroup_charge_hugetlb() earlier in allocation Ackerley Tng via B4 Relay
@ 2026-05-06 15:54 ` Ackerley Tng via B4 Relay
  2026-05-12 13:25   ` Oscar Salvador
  2026-05-12 13:17 ` [PATCH v2 0/6] Open HugeTLB allocation routine for more generic use Oscar Salvador
  6 siblings, 1 reply; 12+ messages in thread
From: Ackerley Tng via B4 Relay @ 2026-05-06 15:54 UTC (permalink / raw)
  To: Muchun Song, Oscar Salvador, David Hildenbrand, Andrew Morton,
	fvdl, jiaqiyan, joshua.hahnjy, jthoughton, mhocko, michael.roth,
	pasha.tatashin, pbonzini, peterx, pratyush, rick.p.edgecombe,
	rientjes, roman.gushchin, seanjc, shakeel.butt, shivankg,
	vannapurve, yan.y.zhao, Dan Williams, Jason Gunthorpe
  Cc: linux-mm, linux-kernel, Ackerley Tng

From: Ackerley Tng <ackerleytng@google.com>

Refactor out hugetlb_alloc_folio() from alloc_hugetlb_folio(), which
handles allocation of a folio and memory and HugeTLB charging to cgroups.

This refactoring decouples the HugeTLB page allocation from VMAs,
specifically:

1. Reservations (as in resv_map) are stored in the vma
2. mpol is stored at vma->vm_policy
3. A vma must be used for allocation even if the pages are not meant to be
   used by host process.

Without this coupling, VMAs are no longer a requirement for
allocation. This opens up the allocation routine for usage without VMAs,
which will allow guest_memfd to use HugeTLB as a more generic allocator of
huge pages, since guest_memfd memory may not have any associated VMAs by
design. In addition, direct allocations from HugeTLB could possibly be
refactored to avoid the use of a pseudo-VMA.

Also, this decouples HugeTLB page allocation from HugeTLBfs, where the
subpool is stored at the fs mount. This is also a requirement for
guest_memfd, where the plan is to have a subpool created per-fd and stored
on the inode.

No functional change intended.

Signed-off-by: Ackerley Tng <ackerleytng@google.com>
---
 include/linux/hugetlb.h |   3 +
 mm/hugetlb.c            | 179 ++++++++++++++++++++++++++----------------------
 2 files changed, 100 insertions(+), 82 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 93418625d3c5f..ec205d8580885 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -705,6 +705,9 @@ bool hugetlb_bootmem_page_zones_valid(int nid, struct huge_bootmem_page *m);
 int isolate_or_dissolve_huge_folio(struct folio *folio, struct list_head *list);
 int replace_free_hugepage_folios(unsigned long start_pfn, unsigned long end_pfn);
 void wait_for_freed_hugetlb_folios(void);
+struct folio *hugetlb_alloc_folio(struct hstate *h, struct hugepage_subpool *spool,
+		struct mempolicy *mpol, int nid, nodemask_t *nodemask,
+		bool charge_hugetlb_cgroup_rsvd, bool use_global_reservation);
 struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
 				unsigned long addr, bool cow_from_owner);
 struct folio *alloc_hugetlb_folio_nodemask(struct hstate *h, int preferred_nid,
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 4159b3565a9be..a1c5b94e52e0a 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2821,6 +2821,88 @@ void wait_for_freed_hugetlb_folios(void)
 	flush_work(&free_hpage_work);
 }
 
+struct folio *hugetlb_alloc_folio(struct hstate *h, struct hugepage_subpool *spool,
+		struct mempolicy *mpol, int nid, nodemask_t *nodemask,
+		bool charge_hugetlb_cgroup_rsvd, bool use_global_reservation)
+{
+	size_t nr_pages = pages_per_huge_page(h);
+	struct hugetlb_cgroup *h_cg = NULL;
+	gfp_t gfp = htlb_alloc_mask(h);
+	int idx = hstate_index(h);
+	struct folio *folio;
+	int ret;
+
+	if (charge_hugetlb_cgroup_rsvd &&
+	    hugetlb_cgroup_charge_cgroup_rsvd(idx, nr_pages, &h_cg))
+		return ERR_PTR(-ENOSPC);
+
+	if (hugetlb_cgroup_charge_cgroup(idx, nr_pages, &h_cg)) {
+		ret = -ENOSPC;
+		goto err_uncharge_hugetlb_cgroup_rsvd;
+	}
+
+	spin_lock_irq(&hugetlb_lock);
+
+	folio = NULL;
+	if (use_global_reservation || available_huge_pages(h))
+		folio = dequeue_hugetlb_folio_with_mpol(h, mpol, nid, nodemask);
+
+	if (!folio) {
+		spin_unlock_irq(&hugetlb_lock);
+		folio = alloc_buddy_hugetlb_folio_with_mpol(h, mpol, nid, nodemask);
+		if (!folio) {
+			ret = -ENOSPC;
+			goto err_uncharge_hugetlb_cgroup;
+		}
+		spin_lock_irq(&hugetlb_lock);
+		list_add(&folio->lru, &h->hugepage_activelist);
+		folio_ref_unfreeze(folio, 1);
+		/* Fall through */
+	}
+
+	if (use_global_reservation) {
+		folio_set_hugetlb_restore_reserve(folio);
+		h->resv_huge_pages--;
+	}
+
+	hugetlb_cgroup_commit_charge(idx, pages_per_huge_page(h), h_cg, folio);
+
+	if (charge_hugetlb_cgroup_rsvd) {
+		hugetlb_cgroup_commit_charge_rsvd(idx, pages_per_huge_page(h),
+						  h_cg, folio);
+	}
+
+	spin_unlock_irq(&hugetlb_lock);
+
+	ret = mem_cgroup_charge_hugetlb(folio, gfp | __GFP_RETRY_MAYFAIL);
+	/*
+	 * Unconditionally increment NR_HUGETLB here because if
+	 * mem_cgroup_charge_hugetlb failed, freeing the page will
+	 * decrement NR_HUGETLB.
+	 */
+	lruvec_stat_mod_folio(folio, NR_HUGETLB, pages_per_huge_page(h));
+
+	if (ret == -ENOMEM) {
+		free_huge_folio(folio);
+		/*
+		 * Skip uncharging hugetlb_cgroup since the charges
+		 * were committed to the folio and freeing the folio
+		 * would have cleared those up.
+		 */
+		return ERR_PTR(ret);
+	}
+
+	return folio;
+
+ err_uncharge_hugetlb_cgroup:
+	hugetlb_cgroup_uncharge_cgroup(idx, nr_pages, h_cg);
+ err_uncharge_hugetlb_cgroup_rsvd:
+	if (charge_hugetlb_cgroup_rsvd)
+		hugetlb_cgroup_uncharge_cgroup_rsvd(idx, nr_pages, h_cg);
+
+	return ERR_PTR(ret);
+}
+
 typedef enum {
 	/*
 	 * For either 0/1: we checked the per-vma resv map, and one resv
@@ -2856,11 +2938,12 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
 	long retval, gbl_chg, gbl_reserve;
 	map_chg_state map_chg;
 	int ret, idx;
-	struct hugetlb_cgroup *h_cg = NULL;
 	gfp_t gfp = htlb_alloc_mask(h);
 	struct mempolicy *mpol;
 	nodemask_t *nodemask;
 	int nid;
+	bool charge_hugetlb_cgroup_rsvd;
+	bool global_reservation_exists;
 
 	idx = hstate_index(h);
 
@@ -2907,89 +2990,28 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
 	}
 
 	/*
-	 * If this allocation is not consuming a per-vma reservation,
-	 * charge the hugetlb cgroup now.
+	 * If allocation doesn't reuse a reservation in the resv_map,
+	 * charge for the reservation.
 	 */
-	if (map_chg) {
-		ret = hugetlb_cgroup_charge_cgroup_rsvd(
-			idx, pages_per_huge_page(h), &h_cg);
-		if (ret) {
-			ret = -ENOSPC;
-			goto out_subpool_put;
-		}
-	}
+	charge_hugetlb_cgroup_rsvd = map_chg != MAP_CHG_REUSE;
 
-	ret = hugetlb_cgroup_charge_cgroup(idx, pages_per_huge_page(h), &h_cg);
-	if (ret) {
-		ret = -ENOSPC;
-		goto out_uncharge_cgroup_reservation;
-	}
-
-	spin_lock_irq(&hugetlb_lock);
+	/*
+	 * gbl_chg == 0 indicates a reservation exists for this
+	 * allocation, so try to use it.
+	 */
+	global_reservation_exists = gbl_chg == 0;
 
 	/* Takes reference on mpol. */
 	nid = huge_node(vma, addr, gfp, &mpol, &nodemask);
 
-	/*
-	 * gbl_chg == 0 indicates a reservation exists for the allocation - so
-	 * try dequeuing a page. If there are available_huge_pages(), try using
-	 * them!
-	 */
-	folio = NULL;
-	if (!gbl_chg || available_huge_pages(h))
-		folio = dequeue_hugetlb_folio_with_mpol(h, mpol, nid, nodemask);
-
-	if (!folio) {
-		spin_unlock_irq(&hugetlb_lock);
-		folio = alloc_buddy_hugetlb_folio_with_mpol(h, mpol, nid, nodemask);
-		if (!folio) {
-			mpol_cond_put(mpol);
-			ret = -ENOSPC;
-			goto out_uncharge_cgroup;
-		}
-		spin_lock_irq(&hugetlb_lock);
-		list_add(&folio->lru, &h->hugepage_activelist);
-		folio_ref_unfreeze(folio, 1);
-		/* Fall through */
-	}
+	folio = hugetlb_alloc_folio(h, spool, mpol, nid, nodemask,
+				    charge_hugetlb_cgroup_rsvd,
+				    global_reservation_exists);
 
 	mpol_cond_put(mpol);
 
-	/*
-	 * Either dequeued or buddy-allocated folio needs to add special
-	 * mark to the folio when it consumes a global reservation.
-	 */
-	if (!gbl_chg) {
-		folio_set_hugetlb_restore_reserve(folio);
-		h->resv_huge_pages--;
-	}
-
-	hugetlb_cgroup_commit_charge(idx, pages_per_huge_page(h), h_cg, folio);
-	/* If allocation is not consuming a reservation, also store the
-	 * hugetlb_cgroup pointer on the page.
-	 */
-	if (map_chg) {
-		hugetlb_cgroup_commit_charge_rsvd(idx, pages_per_huge_page(h),
-						  h_cg, folio);
-	}
-
-	spin_unlock_irq(&hugetlb_lock);
-
-	ret = mem_cgroup_charge_hugetlb(folio, gfp | __GFP_RETRY_MAYFAIL);
-	/*
-	 * Unconditionally increment NR_HUGETLB here. If it turns out that
-	 * mem_cgroup_charge_hugetlb failed, then immediately free the page and
-	 * decrement NR_HUGETLB.
-	 */
-	lruvec_stat_mod_folio(folio, NR_HUGETLB, pages_per_huge_page(h));
-
-	if (ret == -ENOMEM) {
-		free_huge_folio(folio);
-		/*
-		 * Skip uncharging hugetlb_cgroup since the charges
-		 * were committed to the folio and freeing the folio
-		 * would have cleared those up.
-		 */
+	if (IS_ERR(folio)) {
+		ret = PTR_ERR(folio);
 		goto out_subpool_put;
 	}
 
@@ -3022,12 +3044,6 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
 
 	return folio;
 
-out_uncharge_cgroup:
-	hugetlb_cgroup_uncharge_cgroup(idx, pages_per_huge_page(h), h_cg);
-out_uncharge_cgroup_reservation:
-	if (map_chg)
-		hugetlb_cgroup_uncharge_cgroup_rsvd(idx, pages_per_huge_page(h),
-						    h_cg);
 out_subpool_put:
 	/*
 	 * put page to subpool iff the quota of subpool's rsv_hpages is used
@@ -3038,7 +3054,6 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
 		hugetlb_acct_memory(h, -gbl_reserve);
 	}
 
-
 out_end_reservation:
 	if (map_chg != MAP_CHG_ENFORCED)
 		vma_end_reservation(h, vma, addr);

-- 
2.54.0.545.g6539524ca2-goog




^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 1/6] mm: hugetlb: Consolidate interpretation of gbl_chg within alloc_hugetlb_folio()
  2026-05-06 15:54 ` [PATCH v2 1/6] mm: hugetlb: Consolidate interpretation of gbl_chg within alloc_hugetlb_folio() Ackerley Tng via B4 Relay
@ 2026-05-12  9:00   ` Oscar Salvador
  0 siblings, 0 replies; 12+ messages in thread
From: Oscar Salvador @ 2026-05-12  9:00 UTC (permalink / raw)
  To: ackerleytng
  Cc: Muchun Song, David Hildenbrand, Andrew Morton, fvdl, jiaqiyan,
	joshua.hahnjy, jthoughton, mhocko, michael.roth, pasha.tatashin,
	pbonzini, peterx, pratyush, rick.p.edgecombe, rientjes,
	roman.gushchin, seanjc, shakeel.butt, shivankg, vannapurve,
	yan.y.zhao, Dan Williams, Jason Gunthorpe, linux-mm, linux-kernel

On Wed, May 06, 2026 at 08:54:37AM -0700, Ackerley Tng via B4 Relay wrote:
> From: Ackerley Tng <ackerleytng@google.com>
> 
> The dequeue_hugetlb_folio_vma() function currently handles the gbl_chg
> parameter to determine if a folio can be dequeued based on global page
> availability. This leaks reservation-specific logic into the dequeueing
> path.
> 
> Relocate this logic to alloc_hugetlb_folio() so that
> dequeue_hugetlb_folio_vma() focuses solely on selecting and dequeuing a
> folio. In alloc_hugetlb_folio(), only attempt to dequeue a folio if a
> reservation exists (gbl_chg == 0) or if there are available huge pages in
> the global pool.
> 
> No functional change intended.
> 
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> Reviewed-by: James Houghton <jthoughton@google.com>

Acked-by: Oscar Salvador <osalvador@suse.de>

I am ok with this but:

> ---
>  mm/hugetlb.c | 24 +++++++++---------------
>  1 file changed, 9 insertions(+), 15 deletions(-)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index f24bf49be047e..8be246b4e6134 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
...
> @@ -2939,12 +2929,16 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
>  		goto out_uncharge_cgroup_reservation;
>  
>  	spin_lock_irq(&hugetlb_lock);
> +
>  	/*
> -	 * glb_chg is passed to indicate whether or not a page must be taken
> -	 * from the global free pool (global change).  gbl_chg == 0 indicates
> -	 * a reservation exists for the allocation.
> +	 * gbl_chg == 0 indicates a reservation exists for the allocation - so
> +	 * try dequeuing a page. If there are available_huge_pages(), try using
> +	 * them!

This comment is a bit obfuscated.

I miss "in case there is no reservation, check whether we have available
pages in the pool to satisfy the request"

or something like that.

 

-- 
Oscar Salvador
SUSE Labs


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 2/6] mm: hugetlb: Move mpol interpretation out of alloc_buddy_hugetlb_folio_with_mpol()
  2026-05-06 15:54 ` [PATCH v2 2/6] mm: hugetlb: Move mpol interpretation out of alloc_buddy_hugetlb_folio_with_mpol() Ackerley Tng via B4 Relay
@ 2026-05-12 12:51   ` Oscar Salvador
  0 siblings, 0 replies; 12+ messages in thread
From: Oscar Salvador @ 2026-05-12 12:51 UTC (permalink / raw)
  To: ackerleytng
  Cc: Muchun Song, David Hildenbrand, Andrew Morton, fvdl, jiaqiyan,
	joshua.hahnjy, jthoughton, mhocko, michael.roth, pasha.tatashin,
	pbonzini, peterx, pratyush, rick.p.edgecombe, rientjes,
	roman.gushchin, seanjc, shakeel.butt, shivankg, vannapurve,
	yan.y.zhao, Dan Williams, Jason Gunthorpe, linux-mm, linux-kernel

On Wed, May 06, 2026 at 08:54:38AM -0700, Ackerley Tng via B4 Relay wrote:
> From: Ackerley Tng <ackerleytng@google.com>
> 
> Move memory policy interpretation out of
> alloc_buddy_hugetlb_folio_with_mpol() and into alloc_hugetlb_folio() to
> separate reading and interpretation of memory policy from actual
> allocation.
> 
> This will later allow memory policy to be interpreted outside of the
> process of allocating a hugetlb folio entirely. This opens doors for other
> callers of the HugeTLB folio allocation function, such as guest_memfd,
> where memory may not always be mapped and hence may not have an associated
> vma.
> 
> No functional change intended.
> 
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> Reviewed-by: James Houghton <jthoughton@google.com>

Acked-by: Oscar Salvador <osalvador@suse.de>

> ---
>  mm/hugetlb.c | 20 +++++++++++---------
>  1 file changed, 11 insertions(+), 9 deletions(-)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 8be246b4e6134..ea3bc405b3162 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -2160,15 +2160,11 @@ static struct folio *alloc_migrate_hugetlb_folio(struct hstate *h, gfp_t gfp_mas
>   */
>  static
>  struct folio *alloc_buddy_hugetlb_folio_with_mpol(struct hstate *h,
> -		struct vm_area_struct *vma, unsigned long addr)
> +		struct mempolicy *mpol, int nid, nodemask_t *nodemask)
>  {
>  	struct folio *folio = NULL;
> -	struct mempolicy *mpol;
>  	gfp_t gfp_mask = htlb_alloc_mask(h);

You already get gfp_mask back in alloc_hugetlb_folio(), so maybe just
pass it here, so it is clearer to the reader that these are no different
masks.

 

-- 
Oscar Salvador
SUSE Labs


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 3/6] mm: hugetlb: Move mpol interpretation out of dequeue_hugetlb_folio_vma()
  2026-05-06 15:54 ` [PATCH v2 3/6] mm: hugetlb: Move mpol interpretation out of dequeue_hugetlb_folio_vma() Ackerley Tng via B4 Relay
@ 2026-05-12 12:56   ` Oscar Salvador
  0 siblings, 0 replies; 12+ messages in thread
From: Oscar Salvador @ 2026-05-12 12:56 UTC (permalink / raw)
  To: ackerleytng
  Cc: Muchun Song, David Hildenbrand, Andrew Morton, fvdl, jiaqiyan,
	joshua.hahnjy, jthoughton, mhocko, michael.roth, pasha.tatashin,
	pbonzini, peterx, pratyush, rick.p.edgecombe, rientjes,
	roman.gushchin, seanjc, shakeel.butt, shivankg, vannapurve,
	yan.y.zhao, Dan Williams, Jason Gunthorpe, linux-mm, linux-kernel

On Wed, May 06, 2026 at 08:54:39AM -0700, Ackerley Tng via B4 Relay wrote:
> From: Ackerley Tng <ackerleytng@google.com>
> 
> Move memory policy interpretation out of dequeue_hugetlb_folio_vma() and
> into alloc_hugetlb_folio() to separate reading and interpretation of memory
> policy from actual allocation.
> 
> Also rename dequeue_hugetlb_folio_vma() to
> dequeue_hugetlb_folio_with_mpol() to remove association with vma and to
> align with alloc_buddy_hugetlb_folio_with_mpol().
> 
> This will later allow memory policy to be interpreted outside of the
> process of allocating a hugetlb folio entirely. This opens doors for other
> callers of the HugeTLB folio allocation function, such as guest_memfd,
> where memory may not always be mapped and hence may not have an associated
> vma.
> 
> No functional change intended.
> 
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>
> Reviewed-by: James Houghton <jthoughton@google.com>
> ---
>  mm/hugetlb.c | 34 +++++++++++++++-------------------
>  1 file changed, 15 insertions(+), 19 deletions(-)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index ea3bc405b3162..3395de4d0999a 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1334,18 +1334,11 @@ static unsigned long available_huge_pages(struct hstate *h)
>  	return h->free_huge_pages - h->resv_huge_pages;
>  }
>  
> -static struct folio *dequeue_hugetlb_folio_vma(struct hstate *h,
> -				struct vm_area_struct *vma,
> -				unsigned long address)
> +static struct folio *dequeue_hugetlb_folio_with_mpol(struct hstate *h,
> +		struct mempolicy *mpol, int nid, nodemask_t *nodemask)
>  {
>  	struct folio *folio = NULL;
> -	struct mempolicy *mpol;
> -	gfp_t gfp_mask;
> -	nodemask_t *nodemask;
> -	int nid;
> -
> -	gfp_mask = htlb_alloc_mask(h);
> -	nid = huge_node(vma, address, gfp_mask, &mpol, &nodemask);
> +	gfp_t gfp_mask = htlb_alloc_mask(h);

Same thing here, you already have the mask from the caller.

> @@ -2866,6 +2858,9 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
>  	int ret, idx;
>  	struct hugetlb_cgroup *h_cg = NULL;
>  	gfp_t gfp = htlb_alloc_mask(h);
> +	struct mempolicy *mpol;
> +	nodemask_t *nodemask;
> +	int nid;
>  
>  	idx = hstate_index(h);
>  
> @@ -2926,6 +2921,9 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
>  
>  	spin_lock_irq(&hugetlb_lock);
>  
> +	/* Takes reference on mpol. */
> +	nid = huge_node(vma, addr, gfp, &mpol, &nodemask);

I know that before the refactoring we called huge_node() with the lock
taken, but I think that was just because dequeue_hugetlb_folio_vma()
needed that.
Now, I think we can just have it out of the lock.

Bottom line is we should really make crystal clear what needs to go with
or without the lock, because we have some history in hugetlb land about
know fully knowning what protects what and why.

If you think we still need to call it under the lock, I would state why,
but I do not think we do.

 

-- 
Oscar Salvador
SUSE Labs


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 0/6] Open HugeTLB allocation routine for more generic use
  2026-05-06 15:54 [PATCH v2 0/6] Open HugeTLB allocation routine for more generic use Ackerley Tng via B4 Relay
                   ` (5 preceding siblings ...)
  2026-05-06 15:54 ` [PATCH v2 6/6] mm: hugetlb: Refactor out hugetlb_alloc_folio() Ackerley Tng via B4 Relay
@ 2026-05-12 13:17 ` Oscar Salvador
  6 siblings, 0 replies; 12+ messages in thread
From: Oscar Salvador @ 2026-05-12 13:17 UTC (permalink / raw)
  To: ackerleytng
  Cc: Muchun Song, David Hildenbrand, Andrew Morton, fvdl, jiaqiyan,
	joshua.hahnjy, jthoughton, mhocko, michael.roth, pasha.tatashin,
	pbonzini, peterx, pratyush, rick.p.edgecombe, rientjes,
	roman.gushchin, seanjc, shakeel.butt, shivankg, vannapurve,
	yan.y.zhao, Dan Williams, Jason Gunthorpe, linux-mm, linux-kernel

On Wed, May 06, 2026 at 08:54:36AM -0700, Ackerley Tng via B4 Relay wrote:
> Hi,
> 
> The motivation for this patch series is guest_memfd, which would like
> to use HugeTLB as a generic source of huge pages but not adopt
> HugeTLB's reservation at mmap() time.
> 
> By refactoring alloc_hugetlb_folio() and some dependent functions,
> there is now an option to allocate HugeTLB folios without providing a
> VMA. Specifically, HugeTLB allocation used to be dependent on the VMA
> to
> 
> 1. Look up reservations in the resv_map
> 2. Get mpol, stored at vma->vm_policy
> 
> This refactoring provides hugetlb_alloc_folio(), which focuses on just
> the allocation itself, and associated memory and HugeTLB charging
> (cgroups). alloc_hugetlb_folio() still handles reservations in the
> resv_map and subpools.
> 
> Regarding naming, I'm definitely open to alternative names :) I chose
> hugetlb_alloc_folio() because I'm seeing this function as a general
> allocation function that is provided by the HugeTLB subsystem (hence
> the hugetlb_ prefix). I'm intending for alloc_hugetlb_folio() to be
> later refactored as a static function for use just by HugeTLB, and
> HugeTLBfs should probably use hugetlb_alloc_folio() directly.
> 
> To see how hugetlb_alloc_folio() is used by guest_memfd, the most
> recent patch series that uses this more generic HugeTLB allocation
> routine is at [1], and a newer revision of that patch series is at
> [2].

Would that be

https://lore.kernel.org/all/cover.1747264138.git.ackerleytng@google.com/T/#me2152fa2cc79d651ecea7a2bce8b57725fb57465

?


-- 
Oscar Salvador
SUSE Labs


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v2 6/6] mm: hugetlb: Refactor out hugetlb_alloc_folio()
  2026-05-06 15:54 ` [PATCH v2 6/6] mm: hugetlb: Refactor out hugetlb_alloc_folio() Ackerley Tng via B4 Relay
@ 2026-05-12 13:25   ` Oscar Salvador
  0 siblings, 0 replies; 12+ messages in thread
From: Oscar Salvador @ 2026-05-12 13:25 UTC (permalink / raw)
  To: ackerleytng
  Cc: Muchun Song, David Hildenbrand, Andrew Morton, fvdl, jiaqiyan,
	joshua.hahnjy, jthoughton, mhocko, michael.roth, pasha.tatashin,
	pbonzini, peterx, pratyush, rick.p.edgecombe, rientjes,
	roman.gushchin, seanjc, shakeel.butt, shivankg, vannapurve,
	yan.y.zhao, Dan Williams, Jason Gunthorpe, linux-mm, linux-kernel

On Wed, May 06, 2026 at 08:54:42AM -0700, Ackerley Tng via B4 Relay wrote:
> From: Ackerley Tng <ackerleytng@google.com>
> 
> Refactor out hugetlb_alloc_folio() from alloc_hugetlb_folio(), which
> handles allocation of a folio and memory and HugeTLB charging to cgroups.
> 
> This refactoring decouples the HugeTLB page allocation from VMAs,
> specifically:
> 
> 1. Reservations (as in resv_map) are stored in the vma
> 2. mpol is stored at vma->vm_policy
> 3. A vma must be used for allocation even if the pages are not meant to be
>    used by host process.
> 
> Without this coupling, VMAs are no longer a requirement for
> allocation. This opens up the allocation routine for usage without VMAs,
> which will allow guest_memfd to use HugeTLB as a more generic allocator of
> huge pages, since guest_memfd memory may not have any associated VMAs by
> design. In addition, direct allocations from HugeTLB could possibly be
> refactored to avoid the use of a pseudo-VMA.
> 
> Also, this decouples HugeTLB page allocation from HugeTLBfs, where the
> subpool is stored at the fs mount. This is also a requirement for
> guest_memfd, where the plan is to have a subpool created per-fd and stored
> on the inode.
> 
> No functional change intended.
> 
> Signed-off-by: Ackerley Tng <ackerleytng@google.com>

I yet have to review more thoroughly, but I have a comment below:

> ---
>  include/linux/hugetlb.h |   3 +
>  mm/hugetlb.c            | 179 ++++++++++++++++++++++++++----------------------
>  2 files changed, 100 insertions(+), 82 deletions(-)
> 
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index 93418625d3c5f..ec205d8580885 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -705,6 +705,9 @@ bool hugetlb_bootmem_page_zones_valid(int nid, struct huge_bootmem_page *m);
>  int isolate_or_dissolve_huge_folio(struct folio *folio, struct list_head *list);
>  int replace_free_hugepage_folios(unsigned long start_pfn, unsigned long end_pfn);
>  void wait_for_freed_hugetlb_folios(void);
> +struct folio *hugetlb_alloc_folio(struct hstate *h, struct hugepage_subpool *spool,
> +		struct mempolicy *mpol, int nid, nodemask_t *nodemask,
> +		bool charge_hugetlb_cgroup_rsvd, bool use_global_reservation);
>  struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
>  				unsigned long addr, bool cow_from_owner);
>  struct folio *alloc_hugetlb_folio_nodemask(struct hstate *h, int preferred_nid,
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 4159b3565a9be..a1c5b94e52e0a 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -2821,6 +2821,88 @@ void wait_for_freed_hugetlb_folios(void)
>  	flush_work(&free_hpage_work);
>  }
>  
> +struct folio *hugetlb_alloc_folio(struct hstate *h, struct hugepage_subpool *spool,
> +		struct mempolicy *mpol, int nid, nodemask_t *nodemask,
> +		bool charge_hugetlb_cgroup_rsvd, bool use_global_reservation)

I think I would put that information into a context struct that we can
pass to hugetlb_alloc_folio, otherwise this seems too overloaded, and
maybe we need to add more params in the future to tweak even more the
allocation. E.g:

 struct hugetlb_alloc_ctxt {
   struct hstate *h;
   struct hugepage_subpool *spool;
   gfp_t gfp_mask;
   ...
 };

Maybe we can go even further and convert those boleans into action flags.

I have the feeling that as is, it is quite ad-hoc code, and the thing is that if
we want to open hugetlb allocations into the world, we should make it as generic as
possible, foreseeing that we do not have to change the API whenever a
new user pops up.

 

-- 
Oscar Salvador
SUSE Labs


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2026-05-12 13:25 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-06 15:54 [PATCH v2 0/6] Open HugeTLB allocation routine for more generic use Ackerley Tng via B4 Relay
2026-05-06 15:54 ` [PATCH v2 1/6] mm: hugetlb: Consolidate interpretation of gbl_chg within alloc_hugetlb_folio() Ackerley Tng via B4 Relay
2026-05-12  9:00   ` Oscar Salvador
2026-05-06 15:54 ` [PATCH v2 2/6] mm: hugetlb: Move mpol interpretation out of alloc_buddy_hugetlb_folio_with_mpol() Ackerley Tng via B4 Relay
2026-05-12 12:51   ` Oscar Salvador
2026-05-06 15:54 ` [PATCH v2 3/6] mm: hugetlb: Move mpol interpretation out of dequeue_hugetlb_folio_vma() Ackerley Tng via B4 Relay
2026-05-12 12:56   ` Oscar Salvador
2026-05-06 15:54 ` [PATCH v2 4/6] mm: hugetlb: Use error variable in alloc_hugetlb_folio Ackerley Tng via B4 Relay
2026-05-06 15:54 ` [PATCH v2 5/6] mm: hugetlb: Move mem_cgroup_charge_hugetlb() earlier in allocation Ackerley Tng via B4 Relay
2026-05-06 15:54 ` [PATCH v2 6/6] mm: hugetlb: Refactor out hugetlb_alloc_folio() Ackerley Tng via B4 Relay
2026-05-12 13:25   ` Oscar Salvador
2026-05-12 13:17 ` [PATCH v2 0/6] Open HugeTLB allocation routine for more generic use Oscar Salvador

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox