From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from mail-pj1-f48.google.com (mail-pj1-f48.google.com [209.85.216.48])
	(using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits))
	(No client certificate requested)
	by smtp.subspace.kernel.org (Postfix) with ESMTPS id 3A9EF426D2A
	for <linux-kernel@vger.kernel.org>; Tue, 28 Apr 2026 13:55:49 +0000 (UTC)
Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.216.48
ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116;
	t=1777384550; cv=none; b=vGMpzyqVUOZS1VyZEZmedvEtTtmFPK/uYvBHEWjEUdozXZqB+60Yyecpox6PrIObrhLeSFRielg8T1kvN+vY8KbLpmKnHD1SMpH95D+hY607+JDQ5EwAvAIH/oZaZyxggidqylGfte/Fxavp1wzPllK3bK5TpjXR4xSOuMT7fio=
ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org;
	s=arc-20240116; t=1777384550; c=relaxed/simple;
	bh=8PgLu266K+sUx5QUY2MxwBnyxLHhyjC6F+crmtAjOOw=;
	h=From:To:Cc:Subject:Date:Message-ID:MIME-Version; b=YIssv9yBQxSXiNeSWn9a0UheKhXiiI0NjYQ3Y9rL9WImCJW8ppeS5Qu79v45jrOfrt8IUZrqm/66Q4U53DbcwbHP+/9aBFapbIg+Vcx0Vb147S4LWwDWDwgjUJfsHshHkNx72ZmAzRXTaz3fhSSy7/awaUe0CMXm727xnjVwEgE=
ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=PAWIgnSD; arc=none smtp.client-ip=209.85.216.48
Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com
Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com
Authentication-Results: smtp.subspace.kernel.org;
	dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="PAWIgnSD"
Received: by mail-pj1-f48.google.com with SMTP id 98e67ed59e1d1-35d971fb6f1so9261706a91.0
        for <linux-kernel@vger.kernel.org>; Tue, 28 Apr 2026 06:55:49 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20251104; t=1777384548; x=1777989348; darn=vger.kernel.org;
        h=content-transfer-encoding:mime-version:message-id:date:subject:cc
         :to:from:from:to:cc:subject:date:message-id:reply-to;
        bh=4y6+tDkXnIMmu/nIfFw+lmemKG8+IlpqyDeYlXSN2FQ=;
        b=PAWIgnSDy8TZpH1L8HJD182CmDBoVMda2GvBSGlOwwHUxpnngOkvKWqgJzDTfOg2jd
         H80gMrhyYFKmxiI57ahmNeXxCjy+74X8FeIt8nu+hSK85UO6JPbN28H2AUgsB5yu7MPt
         0akjzycZ1PgCk3Cii3M1j7DnfRB2PUD3FQ2WqB6FXrGJZsSm8JSHBRLjm+oIt0/n6pJH
         7KTeFel93DaKW9QoIEe/geEU81es38OHXYJwf/ohSBDH+zjhjAr2IHk/P6+Gcj0C9Ys7
         qyQW3kt74tMaQgLp/jd3IlIYXapSBx60dlT7dB32Lgdy1J6WQdaoQadAMx1Q1wUQm+fw
         d7hA==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20251104; t=1777384548; x=1777989348;
        h=content-transfer-encoding:mime-version:message-id:date:subject:cc
         :to:from:x-gm-gg:x-gm-message-state:from:to:cc:subject:date
         :message-id:reply-to;
        bh=4y6+tDkXnIMmu/nIfFw+lmemKG8+IlpqyDeYlXSN2FQ=;
        b=JwDu4tBFDvjOZjorgoh9gyqmla+59Kl5kcvm3d56mLdUGvZE31KKtx1ZKRg5u3w0KI
         bcYBUJHRVccxawuLu17PF9+ggeDg2CGS1gas/C80+YIBTeuPmxXCuw+f4D1KfM/Cp1ku
         ak4Rq4wQY5WzeuOFVI+0eYP+c/tiAFi/i6lh4pK7C1SinRnawFQb5zsVqS7VkJ8/gD8W
         poYdSWZYXuayPGuy20F16HmD6suawxf4FY/kUZ9En9daWMWPZiTykkt6lhX0TF36Cqj9
         nHk89vRySPS7SFR4PGVxi04srYr3dqQz8TWJoO4zcE6t9Fo70eM+OJrea6AznzUWOR2u
         JDEw==
X-Forwarded-Encrypted: i=1; AFNElJ/JGHd2cO2OCxYBZO4rnw4LHS/F5LldQ2dFme8vaA/5EoGEwg+YNTYRfrfpj2I+G3lSEpxefJsnIvIdd08=@vger.kernel.org
X-Gm-Message-State: AOJu0Yx+Ov562c5YQGGblUCSpGC9a2go0v7aYACxKEKs/qcYfYy+PXV+
	8vyrOvKu31AxEAmzoA26XsYIEzBcn/3TzpDzoamNNCWfUkHJ/kvFomD4
X-Gm-Gg: AeBDievwPy7hn7nxp2g0RklStLhUV3R+511mp6w02xNkCZhGBFdyvEZeNVe8tifdT+b
	pgx4ivO+HoVvaMNR37ytPNnffkKrbP1geyLPS0Q/pSZ/hhmvVzqpkdE1YhPF4kgDOEwIZkts0Y/
	RYh0W2DSMFTTTeNFDiTbatiU2qdMfYSrfV9s5I1dRYv6uTRQMsFSyeS37mwOGVUhjqvUMdgmjYV
	HDWOzmd54UUKtdh2i9WAUeF6hJTsPtIf1DIfOeiw+y65AhN9AKLWhP3PnZY63rUuftBaUAneJ6m
	VgZLoO37EsQlANDHmrKdreqr52jNQbNvG4OZTHqTsE7w6r0r5hwMQezAhiwlo3jeZ3A6hdhtYw5
	7kKk1nf6tCfPqhUdgOxAsk4v2STw0H7VgVJbkhd3geP8ZK9b7ktgt92RR3i9SDGuKPf3oPwLT/f
	9m/1rhZMKyB8ssksUwm0u+r9yOgPeVU5GfLXgLEAXKANCycECiUj/anvJMU2ovpo8Ug03VmDRz
X-Received: by 2002:a17:90b:1a92:b0:35f:bae8:3531 with SMTP id 98e67ed59e1d1-3649207802emr3242142a91.25.1777384548307;
        Tue, 28 Apr 2026 06:55:48 -0700 (PDT)
Received: from KRHW1CJW23.bytedance.net ([203.208.189.12])
        by smtp.gmail.com with ESMTPSA id 98e67ed59e1d1-3649003578csm3106188a91.5.2026.04.28.06.55.44
        (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256);
        Tue, 28 Apr 2026 06:55:47 -0700 (PDT)
From: Zhao Li <enderaoelyther@gmail.com>
To: linux-mm@kvack.org
Cc: Zhao Li <enderaoelyther@gmail.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Mike Kravetz <mike.kravetz@oracle.com>,
	Muchun Song <muchun.song@linux.dev>,
	Oscar Salvador <osalvador@suse.de>,
	David Hildenbrand <david@kernel.org>,
	linux-kernel@vger.kernel.org
Subject: [RFC] mm/hugetlb: min_hpages unwind corrupts reservation accounting
Date: Tue, 28 Apr 2026 21:55:41 +0800
Message-ID: <20260428135540.31428-2-enderaoelyther@gmail.com>
X-Mailer: git-send-email 2.50.1
Precedence: bulk
X-Mailing-List: linux-kernel@vger.kernel.org
List-Id: <linux-kernel.vger.kernel.org>
List-Subscribe: <mailto:linux-kernel+subscribe@vger.kernel.org>
List-Unsubscribe: <mailto:linux-kernel+unsubscribe@vger.kernel.org>
MIME-Version: 1.0
Content-Transfer-Encoding: 8bit

Hi,

While narrowing a separately-posted v3 patch ("mm/hugetlb: restore
subpool used_hpages on alloc_hugetlb_folio() cgroup-charge failure",
hereafter "the v3 patch"), I traced a broader accounting issue on
subpools that have both max_hpages and min_hpages set.  The v3 patch
intentionally avoids that quadrant.

Problem
-------

For min_hpages subpools, HugeTLB reservation state is split across:

- subpool->used_hpages / subpool->rsv_hpages, under spool->lock
- h->resv_huge_pages, under hugetlb_lock

Some callers first do a speculative hugepage_subpool_get_pages() and
only later know whether the operation will commit.  If the operation
fails, they undo only the speculative used_hpages bump.

That is fine in isolation, but it composes badly with a racing
hugepage_subpool_put_pages() on the same min_hpages subpool.

One concrete sequence is:

1. Subpool state starts at:
      max_hpages = 2, min_hpages = 1
      used_hpages = 1, rsv_hpages = 0
      h->resv_huge_pages still carries the subpool's min_hpages backing

2. A speculative caller does hugepage_subpool_get_pages(spool, 1) on the
   above-min path:
      used_hpages: 1 -> 2
      rsv_hpages: 0
      no change to h->resv_huge_pages

3. Before that speculative slot is unwound or committed, a racing
   hugepage_subpool_put_pages(spool, 1) from unreserve/free sees
   used_hpages == 2, drops it to 1, and does not restore rsv_hpages
   because used_hpages is not below min_hpages.

4. The caller of hugepage_subpool_put_pages() then drops one global
   reservation via hugetlb_acct_memory(h, -1).

At that point the subpool's permanent min_hpages backing has effectively
been consumed by a transient speculative used_hpages slot.

If the speculative path later undoes only used_hpages, the state can
become:

      used_hpages = 0
      rsv_hpages = 0

with the subpool minimum no longer backed globally.

Later, when the subpool is released and subpool_is_free() becomes true,
unlock_or_release_subpool() drops min_hpages from h->resv_huge_pages
again.  That second drop can wrap the unsigned reservation counter.

Why this is separate from the v3 patch
--------------------------------------

The v3 patch only decrements used_hpages directly for max-only
subpools, where min_hpages == -1 and hugepage_subpool_put_pages()
cannot restore rsv_hpages.  It intentionally leaves min_hpages subpools
unchanged.

The reason is that the broader min_hpages issue already exists in the
older hugetlb_reserve_pages() failure cleanup, so I did not want to
extend the same pattern into alloc_hugetlb_folio().

Reproducer
----------

I first isolated the race with a debug-only `msleep(1000)` widen after
`hugepage_subpool_get_pages()` on the above-min path.  More importantly,
I then reproduced it under QEMU on a **clean** Linux v7.1-rc1 tree
(`254f49634ee16a731174d2ae34bc50bd5f45e731`) with a userspace-only
stress harness and no kernel instrumentation.

Setup:

- `mount -t hugetlbfs -o pagesize=2M,size=4M,min_size=2M nodev /mnt/htlb`
  (`max_hpages = 2`, `min_hpages = 1`)
- Mapping A pre-creates one file-backed reservation on that subpool,
  bringing the live state to:
      spool->used_hpages = 1
      spool->rsv_hpages = 0
      h->resv_huge_pages = 1
- A separate anonymous `MAP_HUGETLB` fault consumes one real hugepage.
- `/proc/sys/vm/nr_hugepages` is then shrunk from 2 to 1 so mapping B's
  hugetlbfs `mmap()` will fail with `-ENOMEM` after taking the
  speculative subpool slot.
- The userspace harness polls hugetlbfs `statfs().f_bfree` and uses
  `f_bfree == 0` as the synchronization point between B's failed
  reserve path and A's release on the same subpool.  No kernel
  modification is needed for that alignment.

Race:

1. Thread B enters `hugetlb_reserve_pages(chg=1)` and takes the
   above-min speculative slot.
2. Userspace polls hugetlbfs `statfs().f_bfree` until that speculative
   slot is visible at the mount level (`f_bfree == 0`), then unmaps
   mapping A on the same subpool.
3. Mapping A's close/unreserve path drops one global reservation while
   B still owns only a speculative `used_hpages` slot.
4. Thread B then unwinds only its speculative slot via the existing
   `out_put_pages` cleanup.
5. `umount /mnt/htlb` releases the subpool, and
   `unlock_or_release_subpool()` subtracts `min_hpages` from
   `h->resv_huge_pages` again.

Observed clean-kernel hits:

- run 1: `HIT iter=1026 resv_after=0 resv_umount=18446744073709551615`
- run 2: `HIT iter=22   resv_after=0 resv_umount=18446744073709551615`

Here `resv_after=0` is already the wrong live state before `umount`:
the subpool baseline is still `min_hpages = 1`, so
`/sys/kernel/mm/hugepages/hugepages-2048kB/resv_hugepages` should still
reflect one reserved hugepage at that point.  The wrapped value was then
visible by reading the same sysfs file after the umount.

A follow-up probe variant adds a pre-umount snapshot of every
externally-visible counter on hit.  Three back-to-back debug-widened
runs all observed identical pre-umount state:

  resv_hugepages          (sysfs)        = 0   (baseline=1 expected)
  free_hugepages          (sysfs)        = 0
  HugePages_Rsvd          (/proc/meminfo) = 0
  statfs(mnt).f_bfree                    = 2

Note that `statfs` reports the subpool's view (max_hpages - used_hpages
= 2 - 0 = 2 free at subpool layer), while sysfs reports the global
hstate view (h->free_huge_pages = 0).  Readers of these layers see
counter values that disagree with each other and with the actual
reservation state.  Post-umount, the `resv_hugepages` value wraps to
ULONG_MAX (`18446744073709551615`).  That wrapped value reaches the
per-hstate sysfs `resv_hugepages` file for this hugepage size class.
On configurations where this hstate is the default hstate, the same
value also reaches `/proc/meminfo`'s `HugePages_Rsvd`.

I can post the userspace-only harness, the pre-umount probe variant,
and the earlier debug-trace patch as follow-up material if that would
help review.

So this is no longer just a theoretical concern in alloc_hugetlb_folio()
review.  The broader issue already exists today on the older
hugetlb_reserve_pages() path.

Downstream sinks (static-trace, kept to the minimum needed for review)
----------------------------------------------------------------------

`h->resv_huge_pages` is per-`struct hstate`, shared across mounts and
subpools using the same hugepage size.  Once it is corrupted, two
downstream consumers matter immediately:

- `available_huge_pages(h) = free - resv` (mm/hugetlb.c:1334) is a raw
  unsigned subtraction.  The `if (gbl_chg && !available_huge_pages(h))`
  gate at mm/hugetlb.c:1351 in dequeue_hugetlb_folio_vma() and the
  identical predicate at mm/hugetlb.c:1997 in dissolve_free_hugetlb_folio()
  both would pass when `resv > free`.  That would bypass reservation
  accounting on the `gbl_chg > 0` allocation path and on the dissolve
  path.
- /sys/kernel/mm/hugepages/hugepages-NkB/resv_hugepages
  (mm/hugetlb_sysfs.c:156) exports the raw per-hstate value directly.
  If this hstate is the default hstate, `/proc/meminfo`
  `HugePages_Rsvd` (mm/hugetlb.c:4566) exports the same raw value.

I have not yet empirically demonstrated cross-mount reservation theft,
gate bypass on a second mount, or a non-admin trigger path.  The sink
analysis above is static only and should be read that way.

What I am not claiming here
---------------------------

- I am not claiming the v3 patch introduces this broader issue.
- I am not claiming a final fix direction yet.

Ask
---

Does the above race description and reproduced state sequence look
correct?

If so, I will keep this separate from the v3 thread and package a
reproducer plus a broader min_hpages fix discussion around the existing
hugetlb_reserve_pages() path first.

Thanks,
Zhao