public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [RFC] mm/hugetlb: min_hpages unwind corrupts reservation accounting
@ 2026-04-28 13:55 Zhao Li
  0 siblings, 0 replies; only message in thread
From: Zhao Li @ 2026-04-28 13:55 UTC (permalink / raw)
  To: linux-mm
  Cc: Zhao Li, Andrew Morton, Mike Kravetz, Muchun Song, Oscar Salvador,
	David Hildenbrand, linux-kernel

Hi,

While narrowing a separately-posted v3 patch ("mm/hugetlb: restore
subpool used_hpages on alloc_hugetlb_folio() cgroup-charge failure",
hereafter "the v3 patch"), I traced a broader accounting issue on
subpools that have both max_hpages and min_hpages set.  The v3 patch
intentionally avoids that quadrant.

Problem
-------

For min_hpages subpools, HugeTLB reservation state is split across:

- subpool->used_hpages / subpool->rsv_hpages, under spool->lock
- h->resv_huge_pages, under hugetlb_lock

Some callers first do a speculative hugepage_subpool_get_pages() and
only later know whether the operation will commit.  If the operation
fails, they undo only the speculative used_hpages bump.

That is fine in isolation, but it composes badly with a racing
hugepage_subpool_put_pages() on the same min_hpages subpool.

One concrete sequence is:

1. Subpool state starts at:
      max_hpages = 2, min_hpages = 1
      used_hpages = 1, rsv_hpages = 0
      h->resv_huge_pages still carries the subpool's min_hpages backing

2. A speculative caller does hugepage_subpool_get_pages(spool, 1) on the
   above-min path:
      used_hpages: 1 -> 2
      rsv_hpages: 0
      no change to h->resv_huge_pages

3. Before that speculative slot is unwound or committed, a racing
   hugepage_subpool_put_pages(spool, 1) from unreserve/free sees
   used_hpages == 2, drops it to 1, and does not restore rsv_hpages
   because used_hpages is not below min_hpages.

4. The caller of hugepage_subpool_put_pages() then drops one global
   reservation via hugetlb_acct_memory(h, -1).

At that point the subpool's permanent min_hpages backing has effectively
been consumed by a transient speculative used_hpages slot.

If the speculative path later undoes only used_hpages, the state can
become:

      used_hpages = 0
      rsv_hpages = 0

with the subpool minimum no longer backed globally.

Later, when the subpool is released and subpool_is_free() becomes true,
unlock_or_release_subpool() drops min_hpages from h->resv_huge_pages
again.  That second drop can wrap the unsigned reservation counter.

Why this is separate from the v3 patch
--------------------------------------

The v3 patch only decrements used_hpages directly for max-only
subpools, where min_hpages == -1 and hugepage_subpool_put_pages()
cannot restore rsv_hpages.  It intentionally leaves min_hpages subpools
unchanged.

The reason is that the broader min_hpages issue already exists in the
older hugetlb_reserve_pages() failure cleanup, so I did not want to
extend the same pattern into alloc_hugetlb_folio().

Reproducer
----------

I first isolated the race with a debug-only `msleep(1000)` widen after
`hugepage_subpool_get_pages()` on the above-min path.  More importantly,
I then reproduced it under QEMU on a **clean** Linux v7.1-rc1 tree
(`254f49634ee16a731174d2ae34bc50bd5f45e731`) with a userspace-only
stress harness and no kernel instrumentation.

Setup:

- `mount -t hugetlbfs -o pagesize=2M,size=4M,min_size=2M nodev /mnt/htlb`
  (`max_hpages = 2`, `min_hpages = 1`)
- Mapping A pre-creates one file-backed reservation on that subpool,
  bringing the live state to:
      spool->used_hpages = 1
      spool->rsv_hpages = 0
      h->resv_huge_pages = 1
- A separate anonymous `MAP_HUGETLB` fault consumes one real hugepage.
- `/proc/sys/vm/nr_hugepages` is then shrunk from 2 to 1 so mapping B's
  hugetlbfs `mmap()` will fail with `-ENOMEM` after taking the
  speculative subpool slot.
- The userspace harness polls hugetlbfs `statfs().f_bfree` and uses
  `f_bfree == 0` as the synchronization point between B's failed
  reserve path and A's release on the same subpool.  No kernel
  modification is needed for that alignment.

Race:

1. Thread B enters `hugetlb_reserve_pages(chg=1)` and takes the
   above-min speculative slot.
2. Userspace polls hugetlbfs `statfs().f_bfree` until that speculative
   slot is visible at the mount level (`f_bfree == 0`), then unmaps
   mapping A on the same subpool.
3. Mapping A's close/unreserve path drops one global reservation while
   B still owns only a speculative `used_hpages` slot.
4. Thread B then unwinds only its speculative slot via the existing
   `out_put_pages` cleanup.
5. `umount /mnt/htlb` releases the subpool, and
   `unlock_or_release_subpool()` subtracts `min_hpages` from
   `h->resv_huge_pages` again.

Observed clean-kernel hits:

- run 1: `HIT iter=1026 resv_after=0 resv_umount=18446744073709551615`
- run 2: `HIT iter=22   resv_after=0 resv_umount=18446744073709551615`

Here `resv_after=0` is already the wrong live state before `umount`:
the subpool baseline is still `min_hpages = 1`, so
`/sys/kernel/mm/hugepages/hugepages-2048kB/resv_hugepages` should still
reflect one reserved hugepage at that point.  The wrapped value was then
visible by reading the same sysfs file after the umount.

A follow-up probe variant adds a pre-umount snapshot of every
externally-visible counter on hit.  Three back-to-back debug-widened
runs all observed identical pre-umount state:

  resv_hugepages          (sysfs)        = 0   (baseline=1 expected)
  free_hugepages          (sysfs)        = 0
  HugePages_Rsvd          (/proc/meminfo) = 0
  statfs(mnt).f_bfree                    = 2

Note that `statfs` reports the subpool's view (max_hpages - used_hpages
= 2 - 0 = 2 free at subpool layer), while sysfs reports the global
hstate view (h->free_huge_pages = 0).  Readers of these layers see
counter values that disagree with each other and with the actual
reservation state.  Post-umount, the `resv_hugepages` value wraps to
ULONG_MAX (`18446744073709551615`).  That wrapped value reaches the
per-hstate sysfs `resv_hugepages` file for this hugepage size class.
On configurations where this hstate is the default hstate, the same
value also reaches `/proc/meminfo`'s `HugePages_Rsvd`.

I can post the userspace-only harness, the pre-umount probe variant,
and the earlier debug-trace patch as follow-up material if that would
help review.

So this is no longer just a theoretical concern in alloc_hugetlb_folio()
review.  The broader issue already exists today on the older
hugetlb_reserve_pages() path.

Downstream sinks (static-trace, kept to the minimum needed for review)
----------------------------------------------------------------------

`h->resv_huge_pages` is per-`struct hstate`, shared across mounts and
subpools using the same hugepage size.  Once it is corrupted, two
downstream consumers matter immediately:

- `available_huge_pages(h) = free - resv` (mm/hugetlb.c:1334) is a raw
  unsigned subtraction.  The `if (gbl_chg && !available_huge_pages(h))`
  gate at mm/hugetlb.c:1351 in dequeue_hugetlb_folio_vma() and the
  identical predicate at mm/hugetlb.c:1997 in dissolve_free_hugetlb_folio()
  both would pass when `resv > free`.  That would bypass reservation
  accounting on the `gbl_chg > 0` allocation path and on the dissolve
  path.
- /sys/kernel/mm/hugepages/hugepages-NkB/resv_hugepages
  (mm/hugetlb_sysfs.c:156) exports the raw per-hstate value directly.
  If this hstate is the default hstate, `/proc/meminfo`
  `HugePages_Rsvd` (mm/hugetlb.c:4566) exports the same raw value.

I have not yet empirically demonstrated cross-mount reservation theft,
gate bypass on a second mount, or a non-admin trigger path.  The sink
analysis above is static only and should be read that way.

What I am not claiming here
---------------------------

- I am not claiming the v3 patch introduces this broader issue.
- I am not claiming a final fix direction yet.

Ask
---

Does the above race description and reproduced state sequence look
correct?

If so, I will keep this separate from the v3 thread and package a
reproducer plus a broader min_hpages fix discussion around the existing
hugetlb_reserve_pages() path first.

Thanks,
Zhao

^ permalink raw reply	[flat|nested] only message in thread

only message in thread, other threads:[~2026-04-28 13:55 UTC | newest]

Thread overview: (only message) (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-04-28 13:55 [RFC] mm/hugetlb: min_hpages unwind corrupts reservation accounting Zhao Li

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox