* [RFC] mm/hugetlb: min_hpages unwind corrupts reservation accounting
@ 2026-04-28 13:55 Zhao Li
0 siblings, 0 replies; only message in thread
From: Zhao Li @ 2026-04-28 13:55 UTC (permalink / raw)
To: linux-mm
Cc: Zhao Li, Andrew Morton, Mike Kravetz, Muchun Song, Oscar Salvador,
David Hildenbrand, linux-kernel
Hi,
While narrowing a separately-posted v3 patch ("mm/hugetlb: restore
subpool used_hpages on alloc_hugetlb_folio() cgroup-charge failure",
hereafter "the v3 patch"), I traced a broader accounting issue on
subpools that have both max_hpages and min_hpages set. The v3 patch
intentionally avoids that quadrant.
Problem
-------
For min_hpages subpools, HugeTLB reservation state is split across:
- subpool->used_hpages / subpool->rsv_hpages, under spool->lock
- h->resv_huge_pages, under hugetlb_lock
Some callers first do a speculative hugepage_subpool_get_pages() and
only later know whether the operation will commit. If the operation
fails, they undo only the speculative used_hpages bump.
That is fine in isolation, but it composes badly with a racing
hugepage_subpool_put_pages() on the same min_hpages subpool.
One concrete sequence is:
1. Subpool state starts at:
max_hpages = 2, min_hpages = 1
used_hpages = 1, rsv_hpages = 0
h->resv_huge_pages still carries the subpool's min_hpages backing
2. A speculative caller does hugepage_subpool_get_pages(spool, 1) on the
above-min path:
used_hpages: 1 -> 2
rsv_hpages: 0
no change to h->resv_huge_pages
3. Before that speculative slot is unwound or committed, a racing
hugepage_subpool_put_pages(spool, 1) from unreserve/free sees
used_hpages == 2, drops it to 1, and does not restore rsv_hpages
because used_hpages is not below min_hpages.
4. The caller of hugepage_subpool_put_pages() then drops one global
reservation via hugetlb_acct_memory(h, -1).
At that point the subpool's permanent min_hpages backing has effectively
been consumed by a transient speculative used_hpages slot.
If the speculative path later undoes only used_hpages, the state can
become:
used_hpages = 0
rsv_hpages = 0
with the subpool minimum no longer backed globally.
Later, when the subpool is released and subpool_is_free() becomes true,
unlock_or_release_subpool() drops min_hpages from h->resv_huge_pages
again. That second drop can wrap the unsigned reservation counter.
Why this is separate from the v3 patch
--------------------------------------
The v3 patch only decrements used_hpages directly for max-only
subpools, where min_hpages == -1 and hugepage_subpool_put_pages()
cannot restore rsv_hpages. It intentionally leaves min_hpages subpools
unchanged.
The reason is that the broader min_hpages issue already exists in the
older hugetlb_reserve_pages() failure cleanup, so I did not want to
extend the same pattern into alloc_hugetlb_folio().
Reproducer
----------
I first isolated the race with a debug-only `msleep(1000)` widen after
`hugepage_subpool_get_pages()` on the above-min path. More importantly,
I then reproduced it under QEMU on a **clean** Linux v7.1-rc1 tree
(`254f49634ee16a731174d2ae34bc50bd5f45e731`) with a userspace-only
stress harness and no kernel instrumentation.
Setup:
- `mount -t hugetlbfs -o pagesize=2M,size=4M,min_size=2M nodev /mnt/htlb`
(`max_hpages = 2`, `min_hpages = 1`)
- Mapping A pre-creates one file-backed reservation on that subpool,
bringing the live state to:
spool->used_hpages = 1
spool->rsv_hpages = 0
h->resv_huge_pages = 1
- A separate anonymous `MAP_HUGETLB` fault consumes one real hugepage.
- `/proc/sys/vm/nr_hugepages` is then shrunk from 2 to 1 so mapping B's
hugetlbfs `mmap()` will fail with `-ENOMEM` after taking the
speculative subpool slot.
- The userspace harness polls hugetlbfs `statfs().f_bfree` and uses
`f_bfree == 0` as the synchronization point between B's failed
reserve path and A's release on the same subpool. No kernel
modification is needed for that alignment.
Race:
1. Thread B enters `hugetlb_reserve_pages(chg=1)` and takes the
above-min speculative slot.
2. Userspace polls hugetlbfs `statfs().f_bfree` until that speculative
slot is visible at the mount level (`f_bfree == 0`), then unmaps
mapping A on the same subpool.
3. Mapping A's close/unreserve path drops one global reservation while
B still owns only a speculative `used_hpages` slot.
4. Thread B then unwinds only its speculative slot via the existing
`out_put_pages` cleanup.
5. `umount /mnt/htlb` releases the subpool, and
`unlock_or_release_subpool()` subtracts `min_hpages` from
`h->resv_huge_pages` again.
Observed clean-kernel hits:
- run 1: `HIT iter=1026 resv_after=0 resv_umount=18446744073709551615`
- run 2: `HIT iter=22 resv_after=0 resv_umount=18446744073709551615`
Here `resv_after=0` is already the wrong live state before `umount`:
the subpool baseline is still `min_hpages = 1`, so
`/sys/kernel/mm/hugepages/hugepages-2048kB/resv_hugepages` should still
reflect one reserved hugepage at that point. The wrapped value was then
visible by reading the same sysfs file after the umount.
A follow-up probe variant adds a pre-umount snapshot of every
externally-visible counter on hit. Three back-to-back debug-widened
runs all observed identical pre-umount state:
resv_hugepages (sysfs) = 0 (baseline=1 expected)
free_hugepages (sysfs) = 0
HugePages_Rsvd (/proc/meminfo) = 0
statfs(mnt).f_bfree = 2
Note that `statfs` reports the subpool's view (max_hpages - used_hpages
= 2 - 0 = 2 free at subpool layer), while sysfs reports the global
hstate view (h->free_huge_pages = 0). Readers of these layers see
counter values that disagree with each other and with the actual
reservation state. Post-umount, the `resv_hugepages` value wraps to
ULONG_MAX (`18446744073709551615`). That wrapped value reaches the
per-hstate sysfs `resv_hugepages` file for this hugepage size class.
On configurations where this hstate is the default hstate, the same
value also reaches `/proc/meminfo`'s `HugePages_Rsvd`.
I can post the userspace-only harness, the pre-umount probe variant,
and the earlier debug-trace patch as follow-up material if that would
help review.
So this is no longer just a theoretical concern in alloc_hugetlb_folio()
review. The broader issue already exists today on the older
hugetlb_reserve_pages() path.
Downstream sinks (static-trace, kept to the minimum needed for review)
----------------------------------------------------------------------
`h->resv_huge_pages` is per-`struct hstate`, shared across mounts and
subpools using the same hugepage size. Once it is corrupted, two
downstream consumers matter immediately:
- `available_huge_pages(h) = free - resv` (mm/hugetlb.c:1334) is a raw
unsigned subtraction. The `if (gbl_chg && !available_huge_pages(h))`
gate at mm/hugetlb.c:1351 in dequeue_hugetlb_folio_vma() and the
identical predicate at mm/hugetlb.c:1997 in dissolve_free_hugetlb_folio()
both would pass when `resv > free`. That would bypass reservation
accounting on the `gbl_chg > 0` allocation path and on the dissolve
path.
- /sys/kernel/mm/hugepages/hugepages-NkB/resv_hugepages
(mm/hugetlb_sysfs.c:156) exports the raw per-hstate value directly.
If this hstate is the default hstate, `/proc/meminfo`
`HugePages_Rsvd` (mm/hugetlb.c:4566) exports the same raw value.
I have not yet empirically demonstrated cross-mount reservation theft,
gate bypass on a second mount, or a non-admin trigger path. The sink
analysis above is static only and should be read that way.
What I am not claiming here
---------------------------
- I am not claiming the v3 patch introduces this broader issue.
- I am not claiming a final fix direction yet.
Ask
---
Does the above race description and reproduced state sequence look
correct?
If so, I will keep this separate from the v3 thread and package a
reproducer plus a broader min_hpages fix discussion around the existing
hugetlb_reserve_pages() path first.
Thanks,
Zhao
^ permalink raw reply [flat|nested] only message in thread
only message in thread, other threads:[~2026-04-28 13:55 UTC | newest]
Thread overview: (only message) (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-28 13:55 [RFC] mm/hugetlb: min_hpages unwind corrupts reservation accounting Zhao Li
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox