linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH] mm: hugetlb: Fix incorrect fallback for subpool
@ 2025-03-25  6:16 Wupeng Ma
  2025-03-31 21:23 ` Joshua Hahn
  2025-04-09 13:58 ` Joshua Hahn
  0 siblings, 2 replies; 6+ messages in thread
From: Wupeng Ma @ 2025-03-25  6:16 UTC (permalink / raw)
  To: akpm, mike.kravetz, david; +Cc: muchun.song, mawupeng1, linux-mm, linux-kernel

During our testing with hugetlb subpool enabled, we observe that
hstate->resv_huge_pages may underflow into negative values. Root cause
analysis reveals a race condition in subpool reservation fallback handling
as follow:

hugetlb_reserve_pages()
    /* Attempt subpool reservation */
    gbl_reserve = hugepage_subpool_get_pages(spool, chg);

    /* Global reservation may fail after subpool allocation */
    if (hugetlb_acct_memory(h, gbl_reserve) < 0)
        goto out_put_pages;

out_put_pages:
    /* This incorrectly restores reservation to subpool */
    hugepage_subpool_put_pages(spool, chg);

When hugetlb_acct_memory() fails after subpool allocation, the current
implementation over-commits subpool reservations by returning the full
'chg' value instead of the actual allocated 'gbl_reserve' amount. This
discrepancy propagates to global reservations during subsequent releases,
eventually causing resv_huge_pages underflow.

This problem can be trigger easily with the following steps:
1. reverse hugepage for hugeltb allocation
2. mount hugetlbfs with min_size to enable hugetlb subpool
3. alloc hugepages with two task(make sure the second will fail due to
   insufficient amount of hugepages)
4. with for a few seconds and repeat step 3 which will make
   hstate->resv_huge_pages to go below zero.

To fix this problem, return corrent amount of pages to subpool during the
fallback after hugepage_subpool_get_pages is called.

Fixes: 1c5ecae3a93f ("hugetlbfs: add minimum size accounting to subpools")
Signed-off-by: Wupeng Ma <mawupeng1@huawei.com>
---
 mm/hugetlb.c | 28 ++++++++++++++++++++++------
 1 file changed, 22 insertions(+), 6 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 318624c96584..dc4449592d00 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2987,7 +2987,7 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
 	struct hugepage_subpool *spool = subpool_vma(vma);
 	struct hstate *h = hstate_vma(vma);
 	struct folio *folio;
-	long retval, gbl_chg;
+	long retval, gbl_chg, gbl_reserve;
 	map_chg_state map_chg;
 	int ret, idx;
 	struct hugetlb_cgroup *h_cg = NULL;
@@ -3140,8 +3140,16 @@ struct folio *alloc_hugetlb_folio(struct vm_area_struct *vma,
 		hugetlb_cgroup_uncharge_cgroup_rsvd(idx, pages_per_huge_page(h),
 						    h_cg);
 out_subpool_put:
-	if (map_chg)
-		hugepage_subpool_put_pages(spool, 1);
+	/*
+	 * put page to subpool iff the quota of subpool's rsv_hpages is used
+	 * during hugepage_subpool_get_pages.
+	 */
+	if (map_chg && !gbl_chg) {
+		gbl_reserve = hugepage_subpool_put_pages(spool, 1);
+		hugetlb_acct_memory(h, -gbl_reserve);
+	}
+
+
 out_end_reservation:
 	if (map_chg != MAP_CHG_ENFORCED)
 		vma_end_reservation(h, vma, addr);
@@ -6949,7 +6957,7 @@ bool hugetlb_reserve_pages(struct inode *inode,
 					struct vm_area_struct *vma,
 					vm_flags_t vm_flags)
 {
-	long chg = -1, add = -1;
+	long chg = -1, add = -1, spool_resv, gbl_resv;
 	struct hstate *h = hstate_inode(inode);
 	struct hugepage_subpool *spool = subpool_inode(inode);
 	struct resv_map *resv_map;
@@ -7084,8 +7092,16 @@ bool hugetlb_reserve_pages(struct inode *inode,
 	return true;
 
 out_put_pages:
-	/* put back original number of pages, chg */
-	(void)hugepage_subpool_put_pages(spool, chg);
+	spool_resv = chg - gbl_reserve;
+	if (spool_resv) {
+		/* put sub pool's reservation back, chg - gbl_reserve */
+		gbl_resv = hugepage_subpool_put_pages(spool, spool_resv);
+		/*
+		 * subpool's reserved pages can not be put back due to race,
+		 * return to hstate.
+		 */
+		hugetlb_acct_memory(h, -gbl_resv);
+	}
 out_uncharge_cgroup:
 	hugetlb_cgroup_uncharge_cgroup_rsvd(hstate_index(h),
 					    chg * pages_per_huge_page(h), h_cg);
-- 
2.43.0



^ permalink raw reply related	[flat|nested] 6+ messages in thread

* Re: [RFC PATCH] mm: hugetlb: Fix incorrect fallback for subpool
  2025-03-25  6:16 [RFC PATCH] mm: hugetlb: Fix incorrect fallback for subpool Wupeng Ma
@ 2025-03-31 21:23 ` Joshua Hahn
  2025-06-11 15:55   ` Ackerley Tng
  2025-04-09 13:58 ` Joshua Hahn
  1 sibling, 1 reply; 6+ messages in thread
From: Joshua Hahn @ 2025-03-31 21:23 UTC (permalink / raw)
  To: Wupeng Ma
  Cc: akpm, mike.kravetz, david, muchun.song, linux-mm, linux-kernel,
	kernel-team

On Tue, 25 Mar 2025 14:16:34 +0800 Wupeng Ma <mawupeng1@huawei.com> wrote:

> During our testing with hugetlb subpool enabled, we observe that
> hstate->resv_huge_pages may underflow into negative values. Root cause
> analysis reveals a race condition in subpool reservation fallback handling
> as follow:
> 
> hugetlb_reserve_pages()
>     /* Attempt subpool reservation */
>     gbl_reserve = hugepage_subpool_get_pages(spool, chg);
> 
>     /* Global reservation may fail after subpool allocation */
>     if (hugetlb_acct_memory(h, gbl_reserve) < 0)
>         goto out_put_pages;
> 
> out_put_pages:
>     /* This incorrectly restores reservation to subpool */
>     hugepage_subpool_put_pages(spool, chg);
> 
> When hugetlb_acct_memory() fails after subpool allocation, the current
> implementation over-commits subpool reservations by returning the full
> 'chg' value instead of the actual allocated 'gbl_reserve' amount. This
> discrepancy propagates to global reservations during subsequent releases,
> eventually causing resv_huge_pages underflow.
> 
> This problem can be trigger easily with the following steps:
> 1. reverse hugepage for hugeltb allocation
> 2. mount hugetlbfs with min_size to enable hugetlb subpool
> 3. alloc hugepages with two task(make sure the second will fail due to
>    insufficient amount of hugepages)
> 4. with for a few seconds and repeat step 3 which will make
>    hstate->resv_huge_pages to go below zero.
> 
> To fix this problem, return corrent amount of pages to subpool during the
> fallback after hugepage_subpool_get_pages is called.
> 
> Fixes: 1c5ecae3a93f ("hugetlbfs: add minimum size accounting to subpools")
> Signed-off-by: Wupeng Ma <mawupeng1@huawei.com>

Hi Wupeng,
Thank you for the fix! This is a problem that we've also seen happen in
our fleet at Meta. I was able to recreate the issue that you mentioned -- to
explicitly lay down the steps I used:

1. echo 1 > /proc/sys/vm/nr_hugepages
2. mkdir /mnt/hugetlb-pool
3.mount -t hugetlbfs -o min_size=2M none /mnt/hugetlb-pool
4. (./get_hugepage &) && (./get_hugepage &)
    # get_hugepage just opens a file in /mnt/hugetlb-pool and mmaps 2M into it.
5. sleep 3
6. (./get_hugepage &) && (./get_hugepage &)
7. cat /proc/meminfo | grep HugePages_Rsvd

... and (6) shows that HugePages_Rsvd has indeed underflowed to U64_MAX!

I've also verified that applying your fix and then re-running the reproducer
shows no underflow.

Reviewed-by: Joshua Hahn <joshua.hahnjy@gmail.com>
Tested-by: Joshua Hahn <joshua.hahnjy@gmail.com>

Sent using hkml (https://github.com/sjp38/hackermail)



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC PATCH] mm: hugetlb: Fix incorrect fallback for subpool
  2025-03-25  6:16 [RFC PATCH] mm: hugetlb: Fix incorrect fallback for subpool Wupeng Ma
  2025-03-31 21:23 ` Joshua Hahn
@ 2025-04-09 13:58 ` Joshua Hahn
  2025-04-10  0:51   ` mawupeng
  1 sibling, 1 reply; 6+ messages in thread
From: Joshua Hahn @ 2025-04-09 13:58 UTC (permalink / raw)
  To: Wupeng Ma; +Cc: akpm, mike.kravetz, david, muchun.song, linux-mm, linux-kernel

On Tue, 25 Mar 2025 14:16:34 +0800 Wupeng Ma <mawupeng1@huawei.com> wrote:

> During our testing with hugetlb subpool enabled, we observe that
> hstate->resv_huge_pages may underflow into negative values. Root cause
> analysis reveals a race condition in subpool reservation fallback handling
> as follow:
> 
> hugetlb_reserve_pages()
>     /* Attempt subpool reservation */
>     gbl_reserve = hugepage_subpool_get_pages(spool, chg);
> 
>     /* Global reservation may fail after subpool allocation */
>     if (hugetlb_acct_memory(h, gbl_reserve) < 0)
>         goto out_put_pages;
> 
> out_put_pages:
>     /* This incorrectly restores reservation to subpool */
>     hugepage_subpool_put_pages(spool, chg);
> 
> When hugetlb_acct_memory() fails after subpool allocation, the current
> implementation over-commits subpool reservations by returning the full
> 'chg' value instead of the actual allocated 'gbl_reserve' amount. This
> discrepancy propagates to global reservations during subsequent releases,
> eventually causing resv_huge_pages underflow.
> 
> This problem can be trigger easily with the following steps:
> 1. reverse hugepage for hugeltb allocation
> 2. mount hugetlbfs with min_size to enable hugetlb subpool
> 3. alloc hugepages with two task(make sure the second will fail due to
>    insufficient amount of hugepages)
> 4. with for a few seconds and repeat step 3 which will make
>    hstate->resv_huge_pages to go below zero.
> 
> To fix this problem, return corrent amount of pages to subpool during the
> fallback after hugepage_subpool_get_pages is called.
> 
> Fixes: 1c5ecae3a93f ("hugetlbfs: add minimum size accounting to subpools")
> Signed-off-by: Wupeng Ma <mawupeng1@huawei.com>
> ---

Hi Wupeng,

I hope you are doing well! It's been a while since this fix was sent in, and
there hasn't been any comments on the test either. Would you consider sending
this patch again without the RFC tag? I think that might help in the way of
moving this patch forward : -)

Thank you again for this fix! Have a great day!
Joshua


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC PATCH] mm: hugetlb: Fix incorrect fallback for subpool
  2025-04-09 13:58 ` Joshua Hahn
@ 2025-04-10  0:51   ` mawupeng
  0 siblings, 0 replies; 6+ messages in thread
From: mawupeng @ 2025-04-10  0:51 UTC (permalink / raw)
  To: joshua.hahnjy
  Cc: mawupeng1, akpm, mike.kravetz, david, muchun.song, linux-mm,
	linux-kernel



On 2025/4/9 21:58, Joshua Hahn wrote:
> On Tue, 25 Mar 2025 14:16:34 +0800 Wupeng Ma <mawupeng1@huawei.com> wrote:
> 
>> During our testing with hugetlb subpool enabled, we observe that
>> hstate->resv_huge_pages may underflow into negative values. Root cause
>> analysis reveals a race condition in subpool reservation fallback handling
>> as follow:
>>
>> hugetlb_reserve_pages()
>>     /* Attempt subpool reservation */
>>     gbl_reserve = hugepage_subpool_get_pages(spool, chg);
>>
>>     /* Global reservation may fail after subpool allocation */
>>     if (hugetlb_acct_memory(h, gbl_reserve) < 0)
>>         goto out_put_pages;
>>
>> out_put_pages:
>>     /* This incorrectly restores reservation to subpool */
>>     hugepage_subpool_put_pages(spool, chg);
>>
>> When hugetlb_acct_memory() fails after subpool allocation, the current
>> implementation over-commits subpool reservations by returning the full
>> 'chg' value instead of the actual allocated 'gbl_reserve' amount. This
>> discrepancy propagates to global reservations during subsequent releases,
>> eventually causing resv_huge_pages underflow.
>>
>> This problem can be trigger easily with the following steps:
>> 1. reverse hugepage for hugeltb allocation
>> 2. mount hugetlbfs with min_size to enable hugetlb subpool
>> 3. alloc hugepages with two task(make sure the second will fail due to
>>    insufficient amount of hugepages)
>> 4. with for a few seconds and repeat step 3 which will make
>>    hstate->resv_huge_pages to go below zero.
>>
>> To fix this problem, return corrent amount of pages to subpool during the
>> fallback after hugepage_subpool_get_pages is called.
>>
>> Fixes: 1c5ecae3a93f ("hugetlbfs: add minimum size accounting to subpools")
>> Signed-off-by: Wupeng Ma <mawupeng1@huawei.com>
>> ---
> 
> Hi Wupeng,
> 
> I hope you are doing well! It's been a while since this fix was sent in, and
> there hasn't been any comments on the test either. Would you consider sending
> this patch again without the RFC tag? I think that might help in the way of
> moving this patch forward : -)

Ok, I'll resend it without the RFC tag.

> 
> Thank you again for this fix! Have a great day!
> Joshua



^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC PATCH] mm: hugetlb: Fix incorrect fallback for subpool
  2025-03-31 21:23 ` Joshua Hahn
@ 2025-06-11 15:55   ` Ackerley Tng
  2025-06-12  0:54     ` Joshua Hahn
  0 siblings, 1 reply; 6+ messages in thread
From: Ackerley Tng @ 2025-06-11 15:55 UTC (permalink / raw)
  To: Joshua Hahn
  Cc: mawupeng1, akpm, mike.kravetz, david, muchun.song, linux-mm,
	linux-kernel, kernel-team

Joshua Hahn <joshua.hahnjy@gmail.com> writes:

> On Tue, 25 Mar 2025 14:16:34 +0800 Wupeng Ma <mawupeng1@huawei.com> wrote:
>
>> During our testing with hugetlb subpool enabled, we observe that
>> hstate->resv_huge_pages may underflow into negative values. Root cause
>> analysis reveals a race condition in subpool reservation fallback handling
>> as follow:
>> 
>> hugetlb_reserve_pages()
>>     /* Attempt subpool reservation */
>>     gbl_reserve = hugepage_subpool_get_pages(spool, chg);
>> 
>>     /* Global reservation may fail after subpool allocation */
>>     if (hugetlb_acct_memory(h, gbl_reserve) < 0)
>>         goto out_put_pages;
>> 
>> out_put_pages:
>>     /* This incorrectly restores reservation to subpool */
>>     hugepage_subpool_put_pages(spool, chg);
>> 
>> When hugetlb_acct_memory() fails after subpool allocation, the current
>> implementation over-commits subpool reservations by returning the full
>> 'chg' value instead of the actual allocated 'gbl_reserve' amount. This
>> discrepancy propagates to global reservations during subsequent releases,
>> eventually causing resv_huge_pages underflow.
>> 
>> This problem can be trigger easily with the following steps:
>> 1. reverse hugepage for hugeltb allocation
>> 2. mount hugetlbfs with min_size to enable hugetlb subpool
>> 3. alloc hugepages with two task(make sure the second will fail due to
>>    insufficient amount of hugepages)
>> 4. with for a few seconds and repeat step 3 which will make
>>    hstate->resv_huge_pages to go below zero.
>> 
>> To fix this problem, return corrent amount of pages to subpool during the
>> fallback after hugepage_subpool_get_pages is called.
>> 
>> Fixes: 1c5ecae3a93f ("hugetlbfs: add minimum size accounting to subpools")
>> Signed-off-by: Wupeng Ma <mawupeng1@huawei.com>
>
> Hi Wupeng,
> Thank you for the fix! This is a problem that we've also seen happen in
> our fleet at Meta. I was able to recreate the issue that you mentioned -- to
> explicitly lay down the steps I used:
>
> 1. echo 1 > /proc/sys/vm/nr_hugepages
> 2. mkdir /mnt/hugetlb-pool
> 3.mount -t hugetlbfs -o min_size=2M none /mnt/hugetlb-pool
> 4. (./get_hugepage &) && (./get_hugepage &)
>     # get_hugepage just opens a file in /mnt/hugetlb-pool and mmaps 2M into it.

Hi Joshua,

Would you be able to share the source for ./get_hugepage? I'm trying to
reproduce this too.

Does ./get_hugepage just mmap and then spin in an infinite loop?

Do you have to somehow limit allocation of surplus HugeTLB pages from
the buddy allocator?

Thanks!

> 5. sleep 3
> 6. (./get_hugepage &) && (./get_hugepage &)
> 7. cat /proc/meminfo | grep HugePages_Rsvd
>
> ... and (6) shows that HugePages_Rsvd has indeed underflowed to U64_MAX!
>
> I've also verified that applying your fix and then re-running the reproducer
> shows no underflow.
>
> Reviewed-by: Joshua Hahn <joshua.hahnjy@gmail.com>
> Tested-by: Joshua Hahn <joshua.hahnjy@gmail.com>
>
> Sent using hkml (https://github.com/sjp38/hackermail)


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [RFC PATCH] mm: hugetlb: Fix incorrect fallback for subpool
  2025-06-11 15:55   ` Ackerley Tng
@ 2025-06-12  0:54     ` Joshua Hahn
  0 siblings, 0 replies; 6+ messages in thread
From: Joshua Hahn @ 2025-06-12  0:54 UTC (permalink / raw)
  To: Ackerley Tng
  Cc: mawupeng1, akpm, mike.kravetz, david, muchun.song, linux-mm,
	linux-kernel, kernel-team

On Wed, 11 Jun 2025 08:55:30 -0700 Ackerley Tng <ackerleytng@google.com> wrote:

> Joshua Hahn <joshua.hahnjy@gmail.com> writes:
> 
> > On Tue, 25 Mar 2025 14:16:34 +0800 Wupeng Ma <mawupeng1@huawei.com> wrote:
> >
> >> During our testing with hugetlb subpool enabled, we observe that
> >> hstate->resv_huge_pages may underflow into negative values. Root cause
> >> analysis reveals a race condition in subpool reservation fallback handling
> >> as follow:
> >> 
> >> hugetlb_reserve_pages()
> >>     /* Attempt subpool reservation */
> >>     gbl_reserve = hugepage_subpool_get_pages(spool, chg);
> >> 
> >>     /* Global reservation may fail after subpool allocation */
> >>     if (hugetlb_acct_memory(h, gbl_reserve) < 0)
> >>         goto out_put_pages;
> >> 
> >> out_put_pages:
> >>     /* This incorrectly restores reservation to subpool */
> >>     hugepage_subpool_put_pages(spool, chg);
> >> 
> >> When hugetlb_acct_memory() fails after subpool allocation, the current
> >> implementation over-commits subpool reservations by returning the full
> >> 'chg' value instead of the actual allocated 'gbl_reserve' amount. This
> >> discrepancy propagates to global reservations during subsequent releases,
> >> eventually causing resv_huge_pages underflow.
> >> 
> >> This problem can be trigger easily with the following steps:
> >> 1. reverse hugepage for hugeltb allocation
> >> 2. mount hugetlbfs with min_size to enable hugetlb subpool
> >> 3. alloc hugepages with two task(make sure the second will fail due to
> >>    insufficient amount of hugepages)
> >> 4. with for a few seconds and repeat step 3 which will make
> >>    hstate->resv_huge_pages to go below zero.
> >> 
> >> To fix this problem, return corrent amount of pages to subpool during the
> >> fallback after hugepage_subpool_get_pages is called.
> >> 
> >> Fixes: 1c5ecae3a93f ("hugetlbfs: add minimum size accounting to subpools")
> >> Signed-off-by: Wupeng Ma <mawupeng1@huawei.com>
> >
> > Hi Wupeng,
> > Thank you for the fix! This is a problem that we've also seen happen in
> > our fleet at Meta. I was able to recreate the issue that you mentioned -- to
> > explicitly lay down the steps I used:
> >
> > 1. echo 1 > /proc/sys/vm/nr_hugepages
> > 2. mkdir /mnt/hugetlb-pool
> > 3.mount -t hugetlbfs -o min_size=2M none /mnt/hugetlb-pool
> > 4. (./get_hugepage &) && (./get_hugepage &)
> >     # get_hugepage just opens a file in /mnt/hugetlb-pool and mmaps 2M into it.
> 
> Hi Joshua,
> 
> Would you be able to share the source for ./get_hugepage? I'm trying to
> reproduce this too.
> 
> Does ./get_hugepage just mmap and then spin in an infinite loop?
> 
> Do you have to somehow limit allocation of surplus HugeTLB pages from
> the buddy allocator?
> 
> Thanks!

Hi Ackerley,

The script I used for get_hugepage is very simple : -) No need to even spin
infinitely! I just make a file descriptor, ftruncate it to 2M, and mmap
into it. For good measure I set addr[0] = '.', sleep for 1 second, and then
munmap the area afterwards.

Here is a simplified version of the script (no error handling):
int fd = open("/mnt/hugetlb-pool/hugetlb_file", O_RDWR | O_CREAT, 0666);
ftruncate(fd, 2*1024*1024);
char *addr = mmap(NULL, 2*1024*1024, PROT_READ | PROT_WRITE, MAP_PRIVATE, fd, 0);
addr[0] = '.';
sleep(1);
munmap(addr, 2*1024*1024);
close(fd);

Hope this helps! Please let me know if it doesn't work, I would be happy
to investigate this with you. Have a great day!
Joshua

Sent using hkml (https://github.com/sjp38/hackermail)


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2025-06-12  0:54 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-03-25  6:16 [RFC PATCH] mm: hugetlb: Fix incorrect fallback for subpool Wupeng Ma
2025-03-31 21:23 ` Joshua Hahn
2025-06-11 15:55   ` Ackerley Tng
2025-06-12  0:54     ` Joshua Hahn
2025-04-09 13:58 ` Joshua Hahn
2025-04-10  0:51   ` mawupeng

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).