From: Lance Yang <lance.yang@linux.dev>
To: david@kernel.org, hughd@google.com
Cc: akpm@linux-foundation.org, baolin.wang@linux.alibaba.com,
baohua@kernel.org, dev.jain@arm.com, liam.howlett@oracle.com,
ljs@kernel.org, mhocko@suse.com, rppt@kernel.org,
npache@redhat.com, zhengqi.arch@bytedance.com,
ryan.roberts@arm.com, surenb@google.com, ziy@nvidia.com,
linux-mm@kvack.org, Lance Yang <lance.yang@linux.dev>
Subject: Re: [PATCH hotfix] mm: fix pmd_special() fallback to observe huge_zero
Date: Wed, 29 Apr 2026 14:57:42 +0800 [thread overview]
Message-ID: <20260429065743.67054-1-lance.yang@linux.dev> (raw)
In-Reply-To: <e7cbc92b-ce9d-432a-ae5b-c8715dcd922f@kernel.org>
On Wed, Apr 29, 2026 at 08:12:55AM +0200, David Hildenbrand (Arm) wrote:
>On 4/29/26 07:54, Lance Yang wrote:
>>
>> On Tue, Apr 28, 2026 at 10:08:37PM -0700, Hugh Dickins wrote:
>>> On x86 32-bit with THP enabled, zap_huge_pmd() is seen to generate a
>>> "WARNING: mm/memory.c:735 at __vm_normal_page+0x6a/0x7d", from the
>>> VM_WARN_ON_ONCE(is_zero_pfn(pfn) || is_huge_zero_pfn(pfn)); followed
>>> by "BUG: Bad rss-counter state"s, then later "BUG: Bad page state"s
>>> when reclaim gets to call shrink_huge_zero_folio_scan().
>>
>> Good catch!
>>
>>> It's as if the _PAGE_SPECIAL bit never got set in the huge_zero pmd:
>>> and indeed, whereas pte_special() and pte_mkspecial() are subject to a
>>> dedicated CONFIG_ARCH_HAS_PTE_SPECIAL, pmd_special() and pmd_mkspecial()
>>> are subject to CONFIG_ARCH_SUPPORTS_PMD_PFNMAP, which is never enabled
>>> on any 32-bit architecture.
>>>
>>> Add CONFIG_ARCH_HAS_PMD_SPECIAL? Perhaps; but I think it's better just
>>> to observe the huge_zero pmd in the fallback version of pmd_special().
>>>
>>> Fixes: d80a9cb1a64a ("mm/huge_memory: add and use normal_or_softleaf_folio_pmd()")
>
>Likely it should be
>
> Fixes: d82d09e48219 ("mm/huge_memory: mark PMD mappings of the huge zero folio special")
>
>Because vm_normal_page_pmd() would return the wrong thing.
Right.
>But I am surprised that we didn't run into the
>
> VM_WARN_ON_ONCE(is_zero_pfn(pfn) || is_huge_zero_pfn(pfn));
>
>earlier.
The history seems to be:
2025-09-13 d82d09e48219 ("mm/huge_memory: mark PMD mappings of the huge zero folio special")
2025-09-13 af38538801c6 ("mm/memory: factor out common code from vm_normal_page_*()")
After d82d09e48219, vm_normal_page_pmd() still had the explicit huge
zero check before returning the page:
--8<---
struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
pmd_t pmd)
{
unsigned long pfn = pmd_pfn(pmd);
if (unlikely(pmd_special(pmd)))
return NULL;
if (unlikely(vma->vm_flags & (VM_PFNMAP|VM_MIXEDMAP))) {
if (vma->vm_flags & VM_MIXEDMAP) {
if (!pfn_valid(pfn))
return NULL;
goto out;
} else {
unsigned long off;
off = (addr - vma->vm_start) >> PAGE_SHIFT;
if (pfn == vma->vm_pgoff + off)
return NULL;
if (!is_cow_mapping(vma->vm_flags))
return NULL;
}
}
if (is_huge_zero_pfn(pfn))
return NULL;
if (unlikely(pfn > highest_memmap_pfn))
return NULL;
/*
* NOTE! We still have PageReserved() pages in the page tables.
* eg. VDSO mappings can cause them to exist.
*/
out:
return pfn_to_page(pfn);
}
---
So even if pmd_mkspecial() was a no-op and pmd_special() stayed false,
we would still return NULL there.
Then af38538801c6 moved the PMD path into __vm_normal_page():
---8<---
struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
pmd_t pmd)
{
return __vm_normal_page(vma, addr, pmd_pfn(pmd), pmd_special(pmd),
pmd_val(pmd), PGTABLE_LEVEL_PMD);
}
---
For CONFIG_ARCH_HAS_PTE_SPECIAL=y, __vm_normal_page() only returns NULL
for the huge zero PFN if special == true. On x86 32-bit, pmd_special()
stays false, so this can now fall through to VM_WARN_ON_ONCE():
---8<---
if (IS_ENABLED(CONFIG_ARCH_HAS_PTE_SPECIAL)) {
if (unlikely(special)) {
if (is_zero_pfn(pfn) || is_huge_zero_pfn(pfn))
return NULL;
...
}
...
} else {
...
if (is_zero_pfn(pfn) || is_huge_zero_pfn(pfn))
return NULL;
}
...
VM_WARN_ON_ONCE(is_zero_pfn(pfn) || is_huge_zero_pfn(pfn));
...
---
So my guess is that the warning above became possible after
af38538801c6, IIUC.
>
>>> Signed-off-by: Hugh Dickins <hughd@google.com>
>>> ---
>>> include/linux/mm.h | 2 +-
>>> 1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/include/linux/mm.h b/include/linux/mm.h
>>> index 0b776907152e..3b02ac43bcb7 100644
>>> --- a/include/linux/mm.h
>>> +++ b/include/linux/mm.h
>>> @@ -3422,7 +3422,7 @@ static inline pte_t pte_mkspecial(pte_t pte)
>>> #ifndef CONFIG_ARCH_SUPPORTS_PMD_PFNMAP
>>> static inline bool pmd_special(pmd_t pmd)
>>> {
>>> - return false;
>>> + return is_huge_zero_pmd(pmd);
>>> }
>>
>> Emm ... feels a bit odd to me ...
>
>Agreed. But it could be a temporary fix until we fixed up relevant architectures.
Ah, got it :D
>>
>> On these configs pmd_mkspecial() is still a no-op, so pmd_special()
>> would no longer really mean that the PMD was made special :)
>>
>> Could we handle the huge zero PMD in vm_normal_page_pmd() instead?
>
>That adds unnecessary checks for architectures that properly implement pmd_special.
>
>pmd_special() should be fixed longterm on architectures that support THP
>and CONFIG_ARCH_HAS_PTE_SPECIAL. It should not be glued to CONFIG_ARCH_SUPPORTS_PMD_PFNMAP.
>
>
>arch/arc/Kconfig: select HAVE_ARCH_TRANSPARENT_HUGEPAGE if ARC_MMU_V4
>arch/arm/Kconfig: select HAVE_ARCH_TRANSPARENT_HUGEPAGE if ARM_LPAE
>arch/arm64/Kconfig: select HAVE_ARCH_TRANSPARENT_HUGEPAGE
>arch/loongarch/Kconfig: select HAVE_ARCH_TRANSPARENT_HUGEPAGE
>arch/mips/Kconfig: select HAVE_ARCH_TRANSPARENT_HUGEPAGE if CPU_SUPPORTS_HUGEPAGES
>arch/powerpc/platforms/Kconfig.cputype: select HAVE_ARCH_TRANSPARENT_HUGEPAGE
>arch/riscv/Kconfig: select HAVE_ARCH_TRANSPARENT_HUGEPAGE if 64BIT && MMU
>arch/s390/Kconfig: select HAVE_ARCH_TRANSPARENT_HUGEPAGE
>arch/sparc/Kconfig: select HAVE_ARCH_TRANSPARENT_HUGEPAGE
>arch/x86/Kconfig: select HAVE_ARCH_TRANSPARENT_HUGEPAGE
>
>arch/arc/Kconfig: select ARCH_HAS_PTE_SPECIAL
>arch/arm/Kconfig: select ARCH_HAS_PTE_SPECIAL if ARM_LPAE
>arch/arm64/Kconfig: select ARCH_HAS_PTE_SPECIAL
>arch/loongarch/Kconfig: select ARCH_HAS_PTE_SPECIAL
>arch/mips/Kconfig: select ARCH_HAS_PTE_SPECIAL if !(32BIT && CPU_HAS_RIXI)
>arch/parisc/Kconfig: select ARCH_HAS_PTE_SPECIAL
>arch/powerpc/Kconfig: select ARCH_HAS_PTE_SPECIAL
>arch/riscv/Kconfig: select ARCH_HAS_PTE_SPECIAL
>arch/s390/Kconfig: select ARCH_HAS_PTE_SPECIAL
>arch/sh/Kconfig: select ARCH_HAS_PTE_SPECIAL
>arch/sparc/Kconfig: select ARCH_HAS_PTE_SPECIAL
>arch/x86/Kconfig: select ARCH_HAS_PTE_SPECIAL
>
>That's a bit of work given that only arm64, powerpc (64), riscv and x86 (64)
>properly implement pmd_special().
>
>
>So I think Hugh's patch here makes sense for now.
Lesson learned :D thanks!
Lance
next prev parent reply other threads:[~2026-04-29 6:58 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-04-29 5:08 [PATCH hotfix] mm: fix pmd_special() fallback to observe huge_zero Hugh Dickins
2026-04-29 5:54 ` Lance Yang
2026-04-29 6:12 ` David Hildenbrand (Arm)
2026-04-29 6:57 ` Lance Yang [this message]
2026-04-29 7:14 ` David Hildenbrand (Arm)
2026-04-29 7:33 ` Lance Yang
2026-04-30 5:53 ` David Hildenbrand (Arm)
2026-04-30 6:46 ` Lance Yang
2026-04-30 8:30 ` Lance Yang
2026-04-30 8:48 ` Hugh Dickins
2026-04-30 8:54 ` David Hildenbrand (Arm)
2026-04-30 9:10 ` Lance Yang
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260429065743.67054-1-lance.yang@linux.dev \
--to=lance.yang@linux.dev \
--cc=akpm@linux-foundation.org \
--cc=baohua@kernel.org \
--cc=baolin.wang@linux.alibaba.com \
--cc=david@kernel.org \
--cc=dev.jain@arm.com \
--cc=hughd@google.com \
--cc=liam.howlett@oracle.com \
--cc=linux-mm@kvack.org \
--cc=ljs@kernel.org \
--cc=mhocko@suse.com \
--cc=npache@redhat.com \
--cc=rppt@kernel.org \
--cc=ryan.roberts@arm.com \
--cc=surenb@google.com \
--cc=zhengqi.arch@bytedance.com \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.