From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-pj1-f48.google.com (mail-pj1-f48.google.com [209.85.216.48]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id 3A9EF426D2A for ; Tue, 28 Apr 2026 13:55:49 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.216.48 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777384550; cv=none; b=vGMpzyqVUOZS1VyZEZmedvEtTtmFPK/uYvBHEWjEUdozXZqB+60Yyecpox6PrIObrhLeSFRielg8T1kvN+vY8KbLpmKnHD1SMpH95D+hY607+JDQ5EwAvAIH/oZaZyxggidqylGfte/Fxavp1wzPllK3bK5TpjXR4xSOuMT7fio= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1777384550; c=relaxed/simple; bh=8PgLu266K+sUx5QUY2MxwBnyxLHhyjC6F+crmtAjOOw=; h=From:To:Cc:Subject:Date:Message-ID:MIME-Version; b=YIssv9yBQxSXiNeSWn9a0UheKhXiiI0NjYQ3Y9rL9WImCJW8ppeS5Qu79v45jrOfrt8IUZrqm/66Q4U53DbcwbHP+/9aBFapbIg+Vcx0Vb147S4LWwDWDwgjUJfsHshHkNx72ZmAzRXTaz3fhSSy7/awaUe0CMXm727xnjVwEgE= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com; spf=pass smtp.mailfrom=gmail.com; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b=PAWIgnSD; arc=none smtp.client-ip=209.85.216.48 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=none dis=none) header.from=gmail.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=gmail.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=gmail.com header.i=@gmail.com header.b="PAWIgnSD" Received: by mail-pj1-f48.google.com with SMTP id 98e67ed59e1d1-35d971fb6f1so9261706a91.0 for ; Tue, 28 Apr 2026 06:55:49 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20251104; t=1777384548; x=1777989348; darn=vger.kernel.org; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:from:to:cc:subject:date:message-id:reply-to; bh=4y6+tDkXnIMmu/nIfFw+lmemKG8+IlpqyDeYlXSN2FQ=; b=PAWIgnSDy8TZpH1L8HJD182CmDBoVMda2GvBSGlOwwHUxpnngOkvKWqgJzDTfOg2jd H80gMrhyYFKmxiI57ahmNeXxCjy+74X8FeIt8nu+hSK85UO6JPbN28H2AUgsB5yu7MPt 0akjzycZ1PgCk3Cii3M1j7DnfRB2PUD3FQ2WqB6FXrGJZsSm8JSHBRLjm+oIt0/n6pJH 7KTeFel93DaKW9QoIEe/geEU81es38OHXYJwf/ohSBDH+zjhjAr2IHk/P6+Gcj0C9Ys7 qyQW3kt74tMaQgLp/jd3IlIYXapSBx60dlT7dB32Lgdy1J6WQdaoQadAMx1Q1wUQm+fw d7hA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20251104; t=1777384548; x=1777989348; h=content-transfer-encoding:mime-version:message-id:date:subject:cc :to:from:x-gm-gg:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=4y6+tDkXnIMmu/nIfFw+lmemKG8+IlpqyDeYlXSN2FQ=; b=JwDu4tBFDvjOZjorgoh9gyqmla+59Kl5kcvm3d56mLdUGvZE31KKtx1ZKRg5u3w0KI bcYBUJHRVccxawuLu17PF9+ggeDg2CGS1gas/C80+YIBTeuPmxXCuw+f4D1KfM/Cp1ku ak4Rq4wQY5WzeuOFVI+0eYP+c/tiAFi/i6lh4pK7C1SinRnawFQb5zsVqS7VkJ8/gD8W poYdSWZYXuayPGuy20F16HmD6suawxf4FY/kUZ9En9daWMWPZiTykkt6lhX0TF36Cqj9 nHk89vRySPS7SFR4PGVxi04srYr3dqQz8TWJoO4zcE6t9Fo70eM+OJrea6AznzUWOR2u JDEw== X-Forwarded-Encrypted: i=1; AFNElJ/JGHd2cO2OCxYBZO4rnw4LHS/F5LldQ2dFme8vaA/5EoGEwg+YNTYRfrfpj2I+G3lSEpxefJsnIvIdd08=@vger.kernel.org X-Gm-Message-State: AOJu0Yx+Ov562c5YQGGblUCSpGC9a2go0v7aYACxKEKs/qcYfYy+PXV+ 8vyrOvKu31AxEAmzoA26XsYIEzBcn/3TzpDzoamNNCWfUkHJ/kvFomD4 X-Gm-Gg: AeBDievwPy7hn7nxp2g0RklStLhUV3R+511mp6w02xNkCZhGBFdyvEZeNVe8tifdT+b pgx4ivO+HoVvaMNR37ytPNnffkKrbP1geyLPS0Q/pSZ/hhmvVzqpkdE1YhPF4kgDOEwIZkts0Y/ RYh0W2DSMFTTTeNFDiTbatiU2qdMfYSrfV9s5I1dRYv6uTRQMsFSyeS37mwOGVUhjqvUMdgmjYV HDWOzmd54UUKtdh2i9WAUeF6hJTsPtIf1DIfOeiw+y65AhN9AKLWhP3PnZY63rUuftBaUAneJ6m VgZLoO37EsQlANDHmrKdreqr52jNQbNvG4OZTHqTsE7w6r0r5hwMQezAhiwlo3jeZ3A6hdhtYw5 7kKk1nf6tCfPqhUdgOxAsk4v2STw0H7VgVJbkhd3geP8ZK9b7ktgt92RR3i9SDGuKPf3oPwLT/f 9m/1rhZMKyB8ssksUwm0u+r9yOgPeVU5GfLXgLEAXKANCycECiUj/anvJMU2ovpo8Ug03VmDRz X-Received: by 2002:a17:90b:1a92:b0:35f:bae8:3531 with SMTP id 98e67ed59e1d1-3649207802emr3242142a91.25.1777384548307; Tue, 28 Apr 2026 06:55:48 -0700 (PDT) Received: from KRHW1CJW23.bytedance.net ([203.208.189.12]) by smtp.gmail.com with ESMTPSA id 98e67ed59e1d1-3649003578csm3106188a91.5.2026.04.28.06.55.44 (version=TLS1_3 cipher=TLS_CHACHA20_POLY1305_SHA256 bits=256/256); Tue, 28 Apr 2026 06:55:47 -0700 (PDT) From: Zhao Li To: linux-mm@kvack.org Cc: Zhao Li , Andrew Morton , Mike Kravetz , Muchun Song , Oscar Salvador , David Hildenbrand , linux-kernel@vger.kernel.org Subject: [RFC] mm/hugetlb: min_hpages unwind corrupts reservation accounting Date: Tue, 28 Apr 2026 21:55:41 +0800 Message-ID: <20260428135540.31428-2-enderaoelyther@gmail.com> X-Mailer: git-send-email 2.50.1 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Hi, While narrowing a separately-posted v3 patch ("mm/hugetlb: restore subpool used_hpages on alloc_hugetlb_folio() cgroup-charge failure", hereafter "the v3 patch"), I traced a broader accounting issue on subpools that have both max_hpages and min_hpages set. The v3 patch intentionally avoids that quadrant. Problem ------- For min_hpages subpools, HugeTLB reservation state is split across: - subpool->used_hpages / subpool->rsv_hpages, under spool->lock - h->resv_huge_pages, under hugetlb_lock Some callers first do a speculative hugepage_subpool_get_pages() and only later know whether the operation will commit. If the operation fails, they undo only the speculative used_hpages bump. That is fine in isolation, but it composes badly with a racing hugepage_subpool_put_pages() on the same min_hpages subpool. One concrete sequence is: 1. Subpool state starts at: max_hpages = 2, min_hpages = 1 used_hpages = 1, rsv_hpages = 0 h->resv_huge_pages still carries the subpool's min_hpages backing 2. A speculative caller does hugepage_subpool_get_pages(spool, 1) on the above-min path: used_hpages: 1 -> 2 rsv_hpages: 0 no change to h->resv_huge_pages 3. Before that speculative slot is unwound or committed, a racing hugepage_subpool_put_pages(spool, 1) from unreserve/free sees used_hpages == 2, drops it to 1, and does not restore rsv_hpages because used_hpages is not below min_hpages. 4. The caller of hugepage_subpool_put_pages() then drops one global reservation via hugetlb_acct_memory(h, -1). At that point the subpool's permanent min_hpages backing has effectively been consumed by a transient speculative used_hpages slot. If the speculative path later undoes only used_hpages, the state can become: used_hpages = 0 rsv_hpages = 0 with the subpool minimum no longer backed globally. Later, when the subpool is released and subpool_is_free() becomes true, unlock_or_release_subpool() drops min_hpages from h->resv_huge_pages again. That second drop can wrap the unsigned reservation counter. Why this is separate from the v3 patch -------------------------------------- The v3 patch only decrements used_hpages directly for max-only subpools, where min_hpages == -1 and hugepage_subpool_put_pages() cannot restore rsv_hpages. It intentionally leaves min_hpages subpools unchanged. The reason is that the broader min_hpages issue already exists in the older hugetlb_reserve_pages() failure cleanup, so I did not want to extend the same pattern into alloc_hugetlb_folio(). Reproducer ---------- I first isolated the race with a debug-only `msleep(1000)` widen after `hugepage_subpool_get_pages()` on the above-min path. More importantly, I then reproduced it under QEMU on a **clean** Linux v7.1-rc1 tree (`254f49634ee16a731174d2ae34bc50bd5f45e731`) with a userspace-only stress harness and no kernel instrumentation. Setup: - `mount -t hugetlbfs -o pagesize=2M,size=4M,min_size=2M nodev /mnt/htlb` (`max_hpages = 2`, `min_hpages = 1`) - Mapping A pre-creates one file-backed reservation on that subpool, bringing the live state to: spool->used_hpages = 1 spool->rsv_hpages = 0 h->resv_huge_pages = 1 - A separate anonymous `MAP_HUGETLB` fault consumes one real hugepage. - `/proc/sys/vm/nr_hugepages` is then shrunk from 2 to 1 so mapping B's hugetlbfs `mmap()` will fail with `-ENOMEM` after taking the speculative subpool slot. - The userspace harness polls hugetlbfs `statfs().f_bfree` and uses `f_bfree == 0` as the synchronization point between B's failed reserve path and A's release on the same subpool. No kernel modification is needed for that alignment. Race: 1. Thread B enters `hugetlb_reserve_pages(chg=1)` and takes the above-min speculative slot. 2. Userspace polls hugetlbfs `statfs().f_bfree` until that speculative slot is visible at the mount level (`f_bfree == 0`), then unmaps mapping A on the same subpool. 3. Mapping A's close/unreserve path drops one global reservation while B still owns only a speculative `used_hpages` slot. 4. Thread B then unwinds only its speculative slot via the existing `out_put_pages` cleanup. 5. `umount /mnt/htlb` releases the subpool, and `unlock_or_release_subpool()` subtracts `min_hpages` from `h->resv_huge_pages` again. Observed clean-kernel hits: - run 1: `HIT iter=1026 resv_after=0 resv_umount=18446744073709551615` - run 2: `HIT iter=22 resv_after=0 resv_umount=18446744073709551615` Here `resv_after=0` is already the wrong live state before `umount`: the subpool baseline is still `min_hpages = 1`, so `/sys/kernel/mm/hugepages/hugepages-2048kB/resv_hugepages` should still reflect one reserved hugepage at that point. The wrapped value was then visible by reading the same sysfs file after the umount. A follow-up probe variant adds a pre-umount snapshot of every externally-visible counter on hit. Three back-to-back debug-widened runs all observed identical pre-umount state: resv_hugepages (sysfs) = 0 (baseline=1 expected) free_hugepages (sysfs) = 0 HugePages_Rsvd (/proc/meminfo) = 0 statfs(mnt).f_bfree = 2 Note that `statfs` reports the subpool's view (max_hpages - used_hpages = 2 - 0 = 2 free at subpool layer), while sysfs reports the global hstate view (h->free_huge_pages = 0). Readers of these layers see counter values that disagree with each other and with the actual reservation state. Post-umount, the `resv_hugepages` value wraps to ULONG_MAX (`18446744073709551615`). That wrapped value reaches the per-hstate sysfs `resv_hugepages` file for this hugepage size class. On configurations where this hstate is the default hstate, the same value also reaches `/proc/meminfo`'s `HugePages_Rsvd`. I can post the userspace-only harness, the pre-umount probe variant, and the earlier debug-trace patch as follow-up material if that would help review. So this is no longer just a theoretical concern in alloc_hugetlb_folio() review. The broader issue already exists today on the older hugetlb_reserve_pages() path. Downstream sinks (static-trace, kept to the minimum needed for review) ---------------------------------------------------------------------- `h->resv_huge_pages` is per-`struct hstate`, shared across mounts and subpools using the same hugepage size. Once it is corrupted, two downstream consumers matter immediately: - `available_huge_pages(h) = free - resv` (mm/hugetlb.c:1334) is a raw unsigned subtraction. The `if (gbl_chg && !available_huge_pages(h))` gate at mm/hugetlb.c:1351 in dequeue_hugetlb_folio_vma() and the identical predicate at mm/hugetlb.c:1997 in dissolve_free_hugetlb_folio() both would pass when `resv > free`. That would bypass reservation accounting on the `gbl_chg > 0` allocation path and on the dissolve path. - /sys/kernel/mm/hugepages/hugepages-NkB/resv_hugepages (mm/hugetlb_sysfs.c:156) exports the raw per-hstate value directly. If this hstate is the default hstate, `/proc/meminfo` `HugePages_Rsvd` (mm/hugetlb.c:4566) exports the same raw value. I have not yet empirically demonstrated cross-mount reservation theft, gate bypass on a second mount, or a non-admin trigger path. The sink analysis above is static only and should be read that way. What I am not claiming here --------------------------- - I am not claiming the v3 patch introduces this broader issue. - I am not claiming a final fix direction yet. Ask --- Does the above race description and reproduced state sequence look correct? If so, I will keep this separate from the v3 thread and package a reproducer plus a broader min_hpages fix discussion around the existing hugetlb_reserve_pages() path first. Thanks, Zhao