From: Andrey Alekhin <andrei.aleohin@gmail.com>
To: muchun.song@linux.dev
Cc: osalvador@suse.de, linux-mm@kvack.org,
Andrey Alekhin <andrei.aleohin@gmail.com>
Subject: [PATCH] mm: free surplus huge pages properly on NUMA systems
Date: Thu, 15 May 2025 22:13:27 +0300 [thread overview]
Message-ID: <20250515191327.41089-1-andrei.aleohin@gmail.com> (raw)
== History ==
Wrong values of huge pages counters were detected on Red Hat 9.0 (linux
5.14) when runing ltp test hugemmap10. Inspection of linux source code
showed that the problem was not fixed even in linux 6.14.
== Problem ==
free_huge_folio function does not properly free surplus huge pages on
NUMA systems. free_huge_folio checks surplus huge page counter only on
current node (where folio is allocated), but gather_surplus_pages
function can allocate surplus huge pages on any node.
The following sequence is possible on NUMA system:
n - overall number of huge pages
f - number of free huge pages
s - number of surplus huge pages
huge page counters: [before]
|
[after]
Process runs on node #1
|
node0 node1
1) addr1 = mmap(MAP_SHARED, ...) // 1 huge page is mmaped (cur_nid=1)
[n=2 f=2 s=0] [n=1 f=1 s=0] r=0
|
[n=2 f=2 s=0] [n=1 f=1 s=0] r=1
2) echo 1 > /proc/sys/vm/nr_hugepages (cur_nid=1)
[n=2 f=2 s=0] [n=1 f=1 s=0] r=1
|
[n=0 f=0 s=0] [n=1 f=1 s=0] r=1
3) addr2 = mmap(MAP_SHARED, ...) // 1 huge page is mmaped (cur_nid=1)
[n=0 f=0 s=0] [n=1 f=1 s=0] r=1
|
[n=1 f=1 s=1] [n=1 f=1 s=0] r=2
New surplus huge page is reserved on node0, not on node1. In linux 6.14
it is unlikely but possible and legal.
4) write to second page (touch)
[n=1 f=1 s=1] [n=1 f=1 s=0] r=2
|
[n=1 f=1 s=1] [n=1 f=0 s=0] r=1
Reserverd page is mapped on node1
5) munmap(addr2) // 1 huge page is unmaped
[n=1 f=1 s=1] [n=1 f=0 s=0] r=1
|
[n=1 f=1 s=1] [n=1 f=1 s=0] r=1
Huge page is freed, but it is not freed as surplus page. Huge page
counters in system are now: [nr_hugepages=2 free_huge_pages=2
surplus_hugepages=1]. But they must be: [nr_hugepages=1 free_huge_pages=1
surplus_hugepages=0].
== Solution ==
Check huge page counters on all available nodes when page is freed in
free_huge_folio. This check guarantees that surplus huge pages are always
freed correctly if they present in system.
Signed-off-by: Andrey Alekhin <andrei.aleohin@gmail.com>
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 6ea1be71aa42..2d38d12f4943 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1822,6 +1822,23 @@ struct hstate *size_to_hstate(unsigned long size)
return NULL;
}
+static nodemask_t *policy_mbind_nodemask(gfp_t gfp)
+{
+#ifdef CONFIG_NUMA
+ struct mempolicy *mpol = get_task_policy(current);
+
+ /*
+ * Only enforce MPOL_BIND policy which overlaps with cpuset policy
+ * (from policy_nodemask) specifically for hugetlb case
+ */
+ if (mpol->mode == MPOL_BIND &&
+ (apply_policy_zone(mpol, gfp_zone(gfp)) &&
+ cpuset_nodemask_valid_mems_allowed(&mpol->nodes)))
+ return &mpol->nodes;
+#endif
+ return NULL;
+}
+
void free_huge_folio(struct folio *folio)
{
/*
@@ -1833,6 +1850,8 @@ void free_huge_folio(struct folio *folio)
struct hugepage_subpool *spool = hugetlb_folio_subpool(folio);
bool restore_reserve;
unsigned long flags;
+ int node;
+ nodemask_t *mbind_nodemask, alloc_nodemask;
VM_BUG_ON_FOLIO(folio_ref_count(folio), folio);
VM_BUG_ON_FOLIO(folio_mapcount(folio), folio);
@@ -1883,6 +1902,25 @@ void free_huge_folio(struct folio *folio)
remove_hugetlb_folio(h, folio, true);
spin_unlock_irqrestore(&hugetlb_lock, flags);
update_and_free_hugetlb_folio(h, folio, true);
+ } else if (h->surplus_huge_pages) {
+ mbind_nodemask = policy_mbind_nodemask(htlb_alloc_mask(h));
+ if (mbind_nodemask)
+ nodes_and(alloc_nodemask, *mbind_nodemask,
+ cpuset_current_mems_allowed);
+ else
+ alloc_nodemask = cpuset_current_mems_allowed;
+
+ for_each_node_mask(node, alloc_nodemask) {
+ if (h->surplus_huge_pages_node[node]) {
+ h->surplus_huge_pages_node[node]--;
+ h->surplus_huge_pages--;
+ break;
+ }
+ }
+
+ remove_hugetlb_folio(h, folio, false);
+ spin_unlock_irqrestore(&hugetlb_lock, flags);
+ update_and_free_hugetlb_folio(h, folio, true);
} else {
arch_clear_hugetlb_flags(folio);
enqueue_hugetlb_folio(h, folio);
@@ -2389,23 +2427,6 @@ struct folio *alloc_hugetlb_folio_nodemask(struct hstate *h, int preferred_nid,
return alloc_migrate_hugetlb_folio(h, gfp_mask, preferred_nid, nmask);
}
-static nodemask_t *policy_mbind_nodemask(gfp_t gfp)
-{
-#ifdef CONFIG_NUMA
- struct mempolicy *mpol = get_task_policy(current);
-
- /*
- * Only enforce MPOL_BIND policy which overlaps with cpuset policy
- * (from policy_nodemask) specifically for hugetlb case
- */
- if (mpol->mode == MPOL_BIND &&
- (apply_policy_zone(mpol, gfp_zone(gfp)) &&
- cpuset_nodemask_valid_mems_allowed(&mpol->nodes)))
- return &mpol->nodes;
-#endif
- return NULL;
-}
-
/*
* Increase the hugetlb pool such that it can accommodate a reservation
* of size 'delta'.
--
2.43.0
next reply other threads:[~2025-05-15 19:14 UTC|newest]
Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-05-15 19:13 Andrey Alekhin [this message]
2025-05-16 7:56 ` [PATCH] mm: free surplus huge pages properly on NUMA systems David Hildenbrand
2025-05-20 6:50 ` Oscar Salvador
2025-05-20 10:26 ` Oscar Salvador
2025-05-21 16:28 ` Andrey Alekhin
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20250515191327.41089-1-andrei.aleohin@gmail.com \
--to=andrei.aleohin@gmail.com \
--cc=linux-mm@kvack.org \
--cc=muchun.song@linux.dev \
--cc=osalvador@suse.de \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).