From: "David Hildenbrand (Arm)" <david@kernel.org>
To: Lance Yang <lance.yang@linux.dev>, hughd@google.com
Cc: akpm@linux-foundation.org, baolin.wang@linux.alibaba.com,
baohua@kernel.org, dev.jain@arm.com, liam.howlett@oracle.com,
ljs@kernel.org, mhocko@suse.com, rppt@kernel.org,
npache@redhat.com, zhengqi.arch@bytedance.com,
ryan.roberts@arm.com, surenb@google.com, ziy@nvidia.com,
linux-mm@kvack.org
Subject: Re: [PATCH hotfix] mm: fix pmd_special() fallback to observe huge_zero
Date: Thu, 30 Apr 2026 07:53:05 +0200 [thread overview]
Message-ID: <4d950326-6944-409b-b108-a4e67256857f@kernel.org> (raw)
In-Reply-To: <aa3937bb-7f06-4136-ba36-ce4a06acf80a@linux.dev>
On 4/29/26 09:33, Lance Yang wrote:
>
>
> On 2026/4/29 15:14, David Hildenbrand (Arm) wrote:
>> On 4/29/26 08:57, Lance Yang wrote:
>>>
>>>
>>> Right.
>>>
>>>
>>> The history seems to be:
>>>
>>> 2025-09-13 d82d09e48219 ("mm/huge_memory: mark PMD mappings of the huge
>>> zero folio special")
>>> 2025-09-13 af38538801c6 ("mm/memory: factor out common code from
>>> vm_normal_page_*()")
>>>
>>> After d82d09e48219, vm_normal_page_pmd() still had the explicit huge
>>> zero check before returning the page:
>>>
>>> --8<---
>>> struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
>>> pmd_t pmd)
>>> {
>>> unsigned long pfn = pmd_pfn(pmd);
>>>
>>> if (unlikely(pmd_special(pmd)))
>>> return NULL;
>>>
>>> if (unlikely(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP))) {
>>> if (vma->vm_flags & VM_MIXEDMAP) {
>>> if (!pfn_valid(pfn))
>>> return NULL;
>>> goto out;
>>> } else {
>>> unsigned long off;
>>> off = (addr - vma->vm_start) >> PAGE_SHIFT;
>>> if (pfn == vma->vm_pgoff + off)
>>> return NULL;
>>> if (!is_cow_mapping(vma->vm_flags))
>>> return NULL;
>>> }
>>> }
>>>
>>> if (is_huge_zero_pfn(pfn))
>>> return NULL;
>>> if (unlikely(pfn > highest_memmap_pfn))
>>> return NULL;
>>>
>>> /*
>>> * NOTE! We still have PageReserved() pages in the page tables.
>>> * eg. VDSO mappings can cause them to exist.
>>> */
>>> out:
>>> return pfn_to_page(pfn);
>>> }
>>> ---
>>>
>>> So even if pmd_mkspecial() was a no-op and pmd_special() stayed false,
>>> we would still return NULL there.
>>>
>>> Then af38538801c6 moved the PMD path into __vm_normal_page():
>>>
>>> ---8<---
>>> struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
>>> pmd_t pmd)
>>> {
>>> return __vm_normal_page(vma, addr, pmd_pfn(pmd), pmd_special(pmd),
>>> pmd_val(pmd), PGTABLE_LEVEL_PMD);
>>> }
>>> ---
>>>
>>> For CONFIG_ARCH_HAS_PTE_SPECIAL=y, __vm_normal_page() only returns NULL
>>> for the huge zero PFN if special == true. On x86 32-bit, pmd_special()
>>> stays false, so this can now fall through to VM_WARN_ON_ONCE():
>>>
>>> ---8<---
>>> if (IS_ENABLED(CONFIG_ARCH_HAS_PTE_SPECIAL)) {
>>> if (unlikely(special)) {
>>> if (is_zero_pfn(pfn) || is_huge_zero_pfn(pfn))
>>> return NULL;
>>> ...
>>> }
>>> ...
>>> } else {
>>> ...
>>> if (is_zero_pfn(pfn) || is_huge_zero_pfn(pfn))
>>> return NULL;
>>> }
>>>
>>> ...
>>> VM_WARN_ON_ONCE(is_zero_pfn(pfn) || is_huge_zero_pfn(pfn));
>>> ...
>>> ---
>>>
>>> So my guess is that the warning above became possible after
>>> af38538801c6, IIUC.
>>
>> Yes, I think you are right about af38538801c6.
>>
>> What about the following then as a temporary solution:
>
> Nice, that works for me :)
Okay, I'd say we do the following:
From fd9ead548f102f7c257980ccc7b96cce7e42a570 Mon Sep 17 00:00:00 2001
From: Hugh Dickins <hughd@google.com>
Date: Thu, 30 Apr 2026 07:35:31 +0200
Subject: [PATCH] mm: fix pmd_special() fallback to observe huge_zero
On x86 32-bit with THP enabled, zap_huge_pmd() is seen to generate a
"WARNING: mm/memory.c:735 at __vm_normal_page+0x6a/0x7d", from the
VM_WARN_ON_ONCE(is_zero_pfn(pfn) || is_huge_zero_pfn(pfn)); followed
by "BUG: Bad rss-counter state"s, then later "BUG: Bad page state"s
when reclaim gets to call shrink_huge_zero_folio_scan().
It's as if the _PAGE_SPECIAL bit never got set in the huge_zero pmd:
and indeed, whereas pte_special() and pte_mkspecial() are subject to a
dedicated CONFIG_ARCH_HAS_PTE_SPECIAL, pmd_special() and pmd_mkspecial()
are subject to CONFIG_ARCH_SUPPORTS_PMD_PFNMAP, which is never enabled
on any 32-bit architecture.
While the problem was exposed through d80a9cb1a64a ("mm/huge_memory: add
and use normal_or_softleaf_folio_pmd()"), it was an oversight in
af38538801c6 ("mm/memory: factor out common code from vm_normal_page_*()")
and would result in other problems:
* huge zero folio accounted in smaps, pagemap (PAGE_IS_FILE) and
numamaps as file-backed THP
* folio_walk_start() returning the folio even without FW_ZEROPAGE set.
Callers seem to tolerate that, though.
... and triggering the VM_WARN_ON_ONE(), although never reported so far.
To fix it, teach vm_normal_page_pmd()/vm_normal_page_pud() whether
pmd_special/pud_special is actually implemented.
Fixes: af38538801c6 ("mm/memory: factor out common code from vm_normal_page_*()")
Signed-off-by: Hugh Dickins <hughd@google.com>>
Co-developed-by: David Hildenbrand (Arm) <david@kernel.org>
Signed-off-by: David Hildenbrand (Arm) <david@kernel.org>
---
mm/memory.c | 17 ++++++++++++++++-
1 file changed, 16 insertions(+), 1 deletion(-)
diff --git a/mm/memory.c b/mm/memory.c
index 7322a40e73b9..a60bc07b48b2 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -612,6 +612,21 @@ static void print_bad_page_map(struct vm_area_struct *vma,
dump_stack();
add_taint(TAINT_BAD_PAGE, LOCKDEP_NOW_UNRELIABLE);
}
+
+static inline bool pgtable_level_has_pxx_special(enum pgtable_level level)
+{
+ switch (level) {
+ case PGTABLE_LEVEL_PTE:
+ return IS_ENABLED(CONFIG_ARCH_HAS_PTE_SPECIAL);
+ case PGTABLE_LEVEL_PMD:
+ return IS_ENABLED(CONFIG_ARCH_SUPPORTS_PMD_PFNMAP);
+ case PGTABLE_LEVEL_PUD:
+ return IS_ENABLED(CONFIG_ARCH_SUPPORTS_PUD_PFNMAP);
+ default:
+ return false;
+ }
+}
+
#define print_bad_pte(vma, addr, pte, page) \
print_bad_page_map(vma, addr, pte_val(pte), page, PGTABLE_LEVEL_PTE)
@@ -684,7 +699,7 @@ static inline struct page *__vm_normal_page(struct vm_area_struct *vma,
unsigned long addr, unsigned long pfn, bool special,
unsigned long long entry, enum pgtable_level level)
{
- if (IS_ENABLED(CONFIG_ARCH_HAS_PTE_SPECIAL)) {
+ if (pgtable_level_has_pxx_special(level)) {
if (unlikely(special)) {
#ifdef CONFIG_FIND_NORMAL_PAGE
if (vma->vm_ops && vma->vm_ops->find_normal_page)
--
2.43.0
--
Cheers,
David
next prev parent reply other threads:[~2026-04-30 5:53 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-29 5:08 [PATCH hotfix] mm: fix pmd_special() fallback to observe huge_zero Hugh Dickins
2026-04-29 5:54 ` Lance Yang
2026-04-29 6:12 ` David Hildenbrand (Arm)
2026-04-29 6:57 ` Lance Yang
2026-04-29 7:14 ` David Hildenbrand (Arm)
2026-04-29 7:33 ` Lance Yang
2026-04-30 5:53 ` David Hildenbrand (Arm) [this message]
2026-04-30 6:46 ` Lance Yang
2026-04-30 8:30 ` Lance Yang
2026-04-30 8:48 ` Hugh Dickins
2026-04-30 8:54 ` David Hildenbrand (Arm)
2026-04-30 9:10 ` Lance Yang
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4d950326-6944-409b-b108-a4e67256857f@kernel.org \
--to=david@kernel.org \
--cc=akpm@linux-foundation.org \
--cc=baohua@kernel.org \
--cc=baolin.wang@linux.alibaba.com \
--cc=dev.jain@arm.com \
--cc=hughd@google.com \
--cc=lance.yang@linux.dev \
--cc=liam.howlett@oracle.com \
--cc=linux-mm@kvack.org \
--cc=ljs@kernel.org \
--cc=mhocko@suse.com \
--cc=npache@redhat.com \
--cc=rppt@kernel.org \
--cc=ryan.roberts@arm.com \
--cc=surenb@google.com \
--cc=zhengqi.arch@bytedance.com \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.