[PATCH] hugetlb: force allocating surplus hugepages on mempolicy allowed nodes

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] hugetlb: force allocating surplus hugepages on mempolicy allowed nodes
@ 2024-06-21 19:00 Aristeu Rozanski
  2024-06-22  0:56 ` Andrew Morton
  0 siblings, 1 reply; 8+ messages in thread
From: Aristeu Rozanski @ 2024-06-21 19:00 UTC (permalink / raw)
  To: linux-mm; +Cc: Muchun Song, Andrew Morton

allowed_mems_nr() only accounts for the number of free hugepages in the nodes
the current process belongs to. In case one or more of the requested surplus
hugepages are allocated in a different node, the whole allocation will fail due
allowed_mems_nr() returning a lower value.

So allocate surplus hugepages in one of the nodes the current process belongs to.

Easy way to reproduce this issue is to use a 2+ NUMA nodes system:

		# echo 0 >/proc/sys/vm/nr_hugepages
		# echo 1 >/proc/sys/vm/nr_overcommit_hugepages
		# numactl -m0 ./tools/testing/selftests/mm/map_hugetlb 2

will eventually fail when the hugepage ends up allocated in a different node.

Cc: Muchun Song <muchun.song@linux.dev>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Aristeu Rozanski <aris@redhat.com>

---
 mm/hugetlb.c |   46 +++++++++++++++++++++++++++-------------------
 1 file changed, 27 insertions(+), 19 deletions(-)

--- upstream.orig/mm/hugetlb.c	2024-06-20 13:42:25.699568114 -0400
+++ upstream/mm/hugetlb.c	2024-06-21 14:52:49.970798299 -0400
@@ -2618,6 +2618,23 @@ struct folio *alloc_hugetlb_folio_nodema
 	return alloc_migrate_hugetlb_folio(h, gfp_mask, preferred_nid, nmask);
 }
 
+static nodemask_t *policy_mbind_nodemask(gfp_t gfp)
+{
+#ifdef CONFIG_NUMA
+	struct mempolicy *mpol = get_task_policy(current);
+
+	/*
+	 * Only enforce MPOL_BIND policy which overlaps with cpuset policy
+	 * (from policy_nodemask) specifically for hugetlb case
+	 */
+	if (mpol->mode == MPOL_BIND &&
+		(apply_policy_zone(mpol, gfp_zone(gfp)) &&
+		 cpuset_nodemask_valid_mems_allowed(&mpol->nodes)))
+		return &mpol->nodes;
+#endif
+	return NULL;
+}
+
 /*
  * Increase the hugetlb pool such that it can accommodate a reservation
  * of size 'delta'.
@@ -2631,6 +2648,8 @@ static int gather_surplus_pages(struct h
 	long i;
 	long needed, allocated;
 	bool alloc_ok = true;
+	int node;
+	nodemask_t *mbind_nodemask = policy_mbind_nodemask(htlb_alloc_mask(h));
 
 	lockdep_assert_held(&hugetlb_lock);
 	needed = (h->resv_huge_pages + delta) - h->free_huge_pages;
@@ -2645,8 +2664,14 @@ allocated = 0;
 retry:
 	spin_unlock_irq(&hugetlb_lock);
 	for (i = 0; i < needed; i++) {
-		folio = alloc_surplus_hugetlb_folio(h, htlb_alloc_mask(h),
-				NUMA_NO_NODE, NULL);
+		for_each_node_mask(node, cpuset_current_mems_allowed) {
+			if (!mbind_nodemask || node_isset(node, *mbind_nodemask)) {
+				folio = alloc_surplus_hugetlb_folio(h, htlb_alloc_mask(h),
+						node, NULL);
+				if (folio)
+					break;
+			}
+		}
 		if (!folio) {
 			alloc_ok = false;
 			break;
@@ -4876,23 +4901,6 @@ default_hstate_max_huge_pages = 0;
 }
 __setup("default_hugepagesz=", default_hugepagesz_setup);
 
-static nodemask_t *policy_mbind_nodemask(gfp_t gfp)
-{
-#ifdef CONFIG_NUMA
-	struct mempolicy *mpol = get_task_policy(current);
-
-	/*
-	 * Only enforce MPOL_BIND policy which overlaps with cpuset policy
-	 * (from policy_nodemask) specifically for hugetlb case
-	 */
-	if (mpol->mode == MPOL_BIND &&
-		(apply_policy_zone(mpol, gfp_zone(gfp)) &&
-		 cpuset_nodemask_valid_mems_allowed(&mpol->nodes)))
-		return &mpol->nodes;
-#endif
-	return NULL;
-}
-
 static unsigned int allowed_mems_nr(struct hstate *h)
 {
 	int node;



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] hugetlb: force allocating surplus hugepages on mempolicy allowed nodes
  2024-06-21 19:00 [PATCH] hugetlb: force allocating surplus hugepages on mempolicy allowed nodes Aristeu Rozanski
@ 2024-06-22  0:56 ` Andrew Morton
  2024-06-25 22:54   ` Andrew Morton
  2024-07-01 19:11   ` [PATCH] " Aristeu Rozanski
  0 siblings, 2 replies; 8+ messages in thread
From: Andrew Morton @ 2024-06-22  0:56 UTC (permalink / raw)
  To: Aristeu Rozanski; +Cc: linux-mm, Muchun Song

On Fri, 21 Jun 2024 15:00:50 -0400 Aristeu Rozanski <aris@redhat.com> wrote:

> allowed_mems_nr() only accounts for the number of free hugepages in the nodes
> the current process belongs to. In case one or more of the requested surplus
> hugepages are allocated in a different node, the whole allocation will fail due
> allowed_mems_nr() returning a lower value.
> 
> So allocate surplus hugepages in one of the nodes the current process belongs to.
> 
> Easy way to reproduce this issue is to use a 2+ NUMA nodes system:
> 
> 		# echo 0 >/proc/sys/vm/nr_hugepages
> 		# echo 1 >/proc/sys/vm/nr_overcommit_hugepages
> 		# numactl -m0 ./tools/testing/selftests/mm/map_hugetlb 2
> 
> will eventually fail when the hugepage ends up allocated in a different node.
> 

Should we backport this into -stable kernels?


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] hugetlb: force allocating surplus hugepages on mempolicy allowed nodes
  2024-06-22  0:56 ` Andrew Morton
@ 2024-06-25 22:54   ` Andrew Morton
  2024-07-01 19:12     ` Aristeu Rozanski
  2024-07-01 19:11   ` [PATCH] " Aristeu Rozanski
  1 sibling, 1 reply; 8+ messages in thread
From: Andrew Morton @ 2024-06-25 22:54 UTC (permalink / raw)
  To: Aristeu Rozanski, linux-mm, Muchun Song

On Fri, 21 Jun 2024 17:56:09 -0700 Andrew Morton <akpm@linux-foundation.org> wrote:

> On Fri, 21 Jun 2024 15:00:50 -0400 Aristeu Rozanski <aris@redhat.com> wrote:
> 
> > allowed_mems_nr() only accounts for the number of free hugepages in the nodes
> > the current process belongs to. In case one or more of the requested surplus
> > hugepages are allocated in a different node, the whole allocation will fail due
> > allowed_mems_nr() returning a lower value.
> > 
> > So allocate surplus hugepages in one of the nodes the current process belongs to.
> > 
> > Easy way to reproduce this issue is to use a 2+ NUMA nodes system:
> > 
> > 		# echo 0 >/proc/sys/vm/nr_hugepages
> > 		# echo 1 >/proc/sys/vm/nr_overcommit_hugepages
> > 		# numactl -m0 ./tools/testing/selftests/mm/map_hugetlb 2
> > 
> > will eventually fail when the hugepage ends up allocated in a different node.
> > 
> 
> Should we backport this into -stable kernels?

Please?

Aristeu, a description of the userspace-visible effects of the bug will
really help things along here.


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] hugetlb: force allocating surplus hugepages on mempolicy allowed nodes
  2024-06-22  0:56 ` Andrew Morton
  2024-06-25 22:54   ` Andrew Morton
@ 2024-07-01 19:11   ` Aristeu Rozanski
  1 sibling, 0 replies; 8+ messages in thread
From: Aristeu Rozanski @ 2024-07-01 19:11 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Aristeu Rozanski, linux-mm, Muchun Song

Hi Andrew,

On Fri, Jun 21, 2024 at 05:56:09PM -0700, Andrew Morton wrote:
> Should we backport this into -stable kernels?

sorry, stupid email server.

Yes, that would qualify for -stable kernels.

-- 
Aristeu



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] hugetlb: force allocating surplus hugepages on mempolicy allowed nodes
  2024-06-25 22:54   ` Andrew Morton
@ 2024-07-01 19:12     ` Aristeu Rozanski
  2024-07-01 19:23       ` Vishal Moola
  0 siblings, 1 reply; 8+ messages in thread
From: Aristeu Rozanski @ 2024-07-01 19:12 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Aristeu Rozanski, linux-mm, Muchun Song

On Tue, Jun 25, 2024 at 03:54:38PM -0700, Andrew Morton wrote:
> Aristeu, a description of the userspace-visible effects of the bug will
> really help things along here.

Will update with the reproducer and resubmit.

-- 
Aristeu



^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] hugetlb: force allocating surplus hugepages on mempolicy allowed nodes
  2024-07-01 19:12     ` Aristeu Rozanski
@ 2024-07-01 19:23       ` Vishal Moola
  2024-07-01 19:29         ` Aristeu Rozanski
  2024-07-01 21:23         ` [PATCH v2] " Aristeu Rozanski
  0 siblings, 2 replies; 8+ messages in thread
From: Vishal Moola @ 2024-07-01 19:23 UTC (permalink / raw)
  To: Aristeu Rozanski; +Cc: Andrew Morton, Aristeu Rozanski, linux-mm, Muchun Song

On Mon, Jul 01, 2024 at 03:12:07PM -0400, Aristeu Rozanski wrote:
> On Tue, Jun 25, 2024 at 03:54:38PM -0700, Andrew Morton wrote:
> > Aristeu, a description of the userspace-visible effects of the bug will
> > really help things along here.
> 
> Will update with the reproducer and resubmit.

Could you look into this warning[1] and include the fix (if necessary)
when you resubmit?

[1] https://lore.kernel.org/linux-mm/668300a6.170a0220.cdc45.7372@mx.google.com/T/#t
> 
> 


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] hugetlb: force allocating surplus hugepages on mempolicy allowed nodes
  2024-07-01 19:23       ` Vishal Moola
@ 2024-07-01 19:29         ` Aristeu Rozanski
  2024-07-01 21:23         ` [PATCH v2] " Aristeu Rozanski
  1 sibling, 0 replies; 8+ messages in thread
From: Aristeu Rozanski @ 2024-07-01 19:29 UTC (permalink / raw)
  To: Vishal Moola; +Cc: Andrew Morton, Aristeu Rozanski, linux-mm, Muchun Song

On Mon, Jul 01, 2024 at 12:23:52PM -0700, Vishal Moola wrote:
> Could you look into this warning[1] and include the fix (if necessary)
> when you resubmit?
> 
> [1] https://lore.kernel.org/linux-mm/668300a6.170a0220.cdc45.7372@mx.google.com/T/#t

Yep, just figured out where all the replies went and noticed that one too.
Will get everything fixed on the next submission.

Again, my apologies.

-- 
Aristeu



^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH v2] hugetlb: force allocating surplus hugepages on mempolicy allowed nodes
  2024-07-01 19:23       ` Vishal Moola
  2024-07-01 19:29         ` Aristeu Rozanski
@ 2024-07-01 21:23         ` Aristeu Rozanski
  1 sibling, 0 replies; 8+ messages in thread
From: Aristeu Rozanski @ 2024-07-01 21:23 UTC (permalink / raw)
  To: linux-mm; +Cc: Andrew Morton, Vishal Moola, Aristeu Rozanski, Muchun Song,
	stable

When trying to allocate a hugepage with none reserved ones free, it may
be allowed in case a number of overcommit hugepages was configured (using
/proc/sys/vm/nr_overcommit_hugepages) and that number wasn't reached. This
allows for a behavior of having extra hugepages allocated dynamically, if
there're resources for it. Some sysadmins even prefer not reserving any
hugepages and setting a big number of overcommit hugepages.

But while attempting to allocate overcommit hugepages in a multi node
system (either NUMA or mempolicy/cpuset) said allocations might randomly
fail even when there're resources available for the allocation.

This happens due allowed_mems_nr() only accounting for the number of free hugepages
in the nodes the current process belongs to and the surplus hugepage allocation is
done so it can be allocated in any node. In case one or more of the requested
surplus hugepages are allocated in a different node, the whole allocation will fail
due allowed_mems_nr() returning a lower value.

So allocate surplus hugepages in one of the nodes the current process belongs to.

Easy way to reproduce this issue is to use a 2+ NUMA nodes system:

	# echo 0 >/proc/sys/vm/nr_hugepages
	# echo 1 >/proc/sys/vm/nr_overcommit_hugepages
	# numactl -m0 ./tools/testing/selftests/mm/map_hugetlb 2

Repeating the execution of map_hugetlb test application will eventually fail
when the hugepage ends up allocated in a different node.

v2: - attempt to make the description more clear
    - prevent unitialized usage of folio in case current process isn't part of any
      nodes with memory

Cc: Vishal Moola <vishal.moola@gmail.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Aristeu Rozanski <aris@redhat.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: stable@vger.kernel.org
Signed-off-by: Aristeu Rozanski <aris@ruivo.org>

---
 mm/hugetlb.c |   47 ++++++++++++++++++++++++++++-------------------
 1 file changed, 28 insertions(+), 19 deletions(-)

--- upstream.orig/mm/hugetlb.c	2024-06-20 13:42:25.699568114 -0400
+++ upstream/mm/hugetlb.c	2024-07-01 16:48:53.693298053 -0400
@@ -2618,6 +2618,23 @@ struct folio *alloc_hugetlb_folio_nodema
 	return alloc_migrate_hugetlb_folio(h, gfp_mask, preferred_nid, nmask);
 }
 
+static nodemask_t *policy_mbind_nodemask(gfp_t gfp)
+{
+#ifdef CONFIG_NUMA
+	struct mempolicy *mpol = get_task_policy(current);
+
+	/*
+	 * Only enforce MPOL_BIND policy which overlaps with cpuset policy
+	 * (from policy_nodemask) specifically for hugetlb case
+	 */
+	if (mpol->mode == MPOL_BIND &&
+		(apply_policy_zone(mpol, gfp_zone(gfp)) &&
+		 cpuset_nodemask_valid_mems_allowed(&mpol->nodes)))
+		return &mpol->nodes;
+#endif
+	return NULL;
+}
+
 /*
  * Increase the hugetlb pool such that it can accommodate a reservation
  * of size 'delta'.
@@ -2631,6 +2648,8 @@ static int gather_surplus_pages(struct h
 	long i;
 	long needed, allocated;
 	bool alloc_ok = true;
+	int node;
+	nodemask_t *mbind_nodemask = policy_mbind_nodemask(htlb_alloc_mask(h));
 
 	lockdep_assert_held(&hugetlb_lock);
 	needed = (h->resv_huge_pages + delta) - h->free_huge_pages;
@@ -2645,8 +2664,15 @@ allocated = 0;
 retry:
 	spin_unlock_irq(&hugetlb_lock);
 	for (i = 0; i < needed; i++) {
-		folio = alloc_surplus_hugetlb_folio(h, htlb_alloc_mask(h),
-				NUMA_NO_NODE, NULL);
+		folio = NULL;
+		for_each_node_mask(node, cpuset_current_mems_allowed) {
+			if (!mbind_nodemask || node_isset(node, *mbind_nodemask)) {
+				folio = alloc_surplus_hugetlb_folio(h, htlb_alloc_mask(h),
+						node, NULL);
+				if (folio)
+					break;
+			}
+		}
 		if (!folio) {
 			alloc_ok = false;
 			break;
@@ -4876,23 +4902,6 @@ default_hstate_max_huge_pages = 0;
 }
 __setup("default_hugepagesz=", default_hugepagesz_setup);
 
-static nodemask_t *policy_mbind_nodemask(gfp_t gfp)
-{
-#ifdef CONFIG_NUMA
-	struct mempolicy *mpol = get_task_policy(current);
-
-	/*
-	 * Only enforce MPOL_BIND policy which overlaps with cpuset policy
-	 * (from policy_nodemask) specifically for hugetlb case
-	 */
-	if (mpol->mode == MPOL_BIND &&
-		(apply_policy_zone(mpol, gfp_zone(gfp)) &&
-		 cpuset_nodemask_valid_mems_allowed(&mpol->nodes)))
-		return &mpol->nodes;
-#endif
-	return NULL;
-}
-
 static unsigned int allowed_mems_nr(struct hstate *h)
 {
 	int node;


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2024-07-01 21:23 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-06-21 19:00 [PATCH] hugetlb: force allocating surplus hugepages on mempolicy allowed nodes Aristeu Rozanski
2024-06-22  0:56 ` Andrew Morton
2024-06-25 22:54   ` Andrew Morton
2024-07-01 19:12     ` Aristeu Rozanski
2024-07-01 19:23       ` Vishal Moola
2024-07-01 19:29         ` Aristeu Rozanski
2024-07-01 21:23         ` [PATCH v2] " Aristeu Rozanski
2024-07-01 19:11   ` [PATCH] " Aristeu Rozanski

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).