* [PATCH] mm/page_table_check: do not track special (PFN-mapped) PTEs
@ 2026-06-08 15:57 Andrey Smirnov
2026-06-08 21:22 ` Andrew Morton
0 siblings, 1 reply; 3+ messages in thread
From: Andrey Smirnov @ 2026-06-08 15:57 UTC (permalink / raw)
To: pasha.tatashin, akpm
Cc: linux-mm, linux-kernel, linux-riscv, pjw, palmer, aou, alex,
syzbot+2b5fe617654be3d8848b, Andrey Smirnov, Thomas Gleixner,
Thomas Weißschuh, Andrei Vagin, Andy Lutomirski,
Vincenzo Frascino, stable
The vDSO data store ("[vvar]") special mapping is created as a VM_PFNMAP
mapping and its pages are installed into userspace with vmf_insert_pfn(),
which produces special PTEs (pte_special()). On x86 and arm64 (and riscv)
pte_user_accessible_page() only tests the PRESENT/USER bits and does not
exclude special PTEs, so page_table_check accounts these PFN mappings in
the per-page anon/file map counters even though they are not rmap-managed
pages (vm_normal_page() returns NULL for them).
Most of these data pages live in the kernel image and are never freed, so
the stray accounting is invisible. The time-namespace VVAR page is the
exception: it is a real alloc_page() page that is released with
__free_page() in free_time_ns() when the last task of a time namespace
exits. Across the map / unmap / vdso_join_timens() zap transitions the
special-PTE accounting is not balanced for this page, so a non-zero
file_map_count survives to the free path and trips:
kernel BUG at mm/page_table_check.c:143!
__page_table_check_zero+0xfb/0x130
__free_frozen_pages+0x52f/0x650
free_time_ns+0x85/0xc0
free_nsproxy+0x7f/0x130
do_exit+0x313/0xa60
do_group_exit+0x77/0x90
This is reliably reproducible on x86_64 and arm64 under heavy container/CI
churn that rapidly creates and destroys time namespaces (CLONE_NEWTIME via
runc / docker-init / tini), and was independently reported by syzbot on
riscv. It only manifests when CONFIG_PAGE_TABLE_CHECK is active.
Special PTEs have no struct-page rmap semantics and must never have been
tracked by page table check. Skip them in both the set and clear paths so
the counters stay balanced (always zero) for PFN-mapped pages, regardless
of how the architecture defines pte_user_accessible_page(). pte_special()
is available generically (it is a no-op returning false on architectures
without ARCH_HAS_PTE_SPECIAL), so this is a single, arch-independent fix.
Note that the v7.0 generic vDSO datastore rework in commit 05988dba1179
("vdso/datastore: Allocate data pages dynamically") incidentally avoids
the problem by switching the mapping to VM_MIXEDMAP + vmf_insert_page()
with balanced struct-page accounting. This patch fixes the still-affected
VM_PFNMAP path used by 6.18.y and earlier, and additionally makes
page_table_check robust against any future PFN-mapped user pages.
Fixes: df4e817b7108 ("mm: page table check")
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Thomas Weißschuh <thomas.weissschuh@linutronix.de>
Cc: Andrei Vagin <avagin@gmail.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Vincenzo Frascino <vincenzo.frascino@arm.com>
Reported-by: syzbot+2b5fe617654be3d8848b@syzkaller.appspotmail.com
Closes: https://github.com/siderolabs/talos/issues/13496
Cc: stable@vger.kernel.org
Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
---
mm/page_table_check.c | 13 ++++++++++---
1 file changed, 10 insertions(+), 3 deletions(-)
diff --git a/mm/page_table_check.c b/mm/page_table_check.c
index 4eeca782b888..ee492d5389b9 100644
--- a/mm/page_table_check.c
+++ b/mm/page_table_check.c
@@ -150,9 +150,16 @@ void __page_table_check_pte_clear(struct mm_struct *mm, pte_t pte)
if (&init_mm == mm)
return;
- if (pte_user_accessible_page(pte)) {
+ /*
+ * PFN-mapped (special) PTEs - e.g. the vDSO/time-namespace "[vvar]"
+ * mapping installed via vmf_insert_pfn() - are not rmap-managed and
+ * must not be tracked here. Tracking them can leave a non-zero map
+ * count on a struct page that is later freed (the time namespace VVAR
+ * page in free_time_ns()), tripping the BUG_ON() in
+ * __page_table_check_zero().
+ */
+ if (pte_user_accessible_page(pte) && !pte_special(pte))
page_table_check_clear(pte_pfn(pte), PAGE_SIZE >> PAGE_SHIFT);
- }
}
EXPORT_SYMBOL(__page_table_check_pte_clear);
@@ -205,7 +212,7 @@ void __page_table_check_ptes_set(struct mm_struct *mm, pte_t *ptep, pte_t pte,
for (i = 0; i < nr; i++)
__page_table_check_pte_clear(mm, ptep_get(ptep + i));
- if (pte_user_accessible_page(pte))
+ if (pte_user_accessible_page(pte) && !pte_special(pte))
page_table_check_set(pte_pfn(pte), nr, pte_write(pte));
}
EXPORT_SYMBOL(__page_table_check_ptes_set);
--
2.53.0
^ permalink raw reply related [flat|nested] 3+ messages in thread* Re: [PATCH] mm/page_table_check: do not track special (PFN-mapped) PTEs
2026-06-08 15:57 [PATCH] mm/page_table_check: do not track special (PFN-mapped) PTEs Andrey Smirnov
@ 2026-06-08 21:22 ` Andrew Morton
2026-06-09 2:23 ` Pasha Tatashin
0 siblings, 1 reply; 3+ messages in thread
From: Andrew Morton @ 2026-06-08 21:22 UTC (permalink / raw)
To: Andrey Smirnov
Cc: pasha.tatashin, linux-mm, linux-kernel, linux-riscv, pjw, palmer,
aou, alex, syzbot+2b5fe617654be3d8848b, Thomas Gleixner,
Thomas Weißschuh, Andrei Vagin, Andy Lutomirski,
Vincenzo Frascino, stable
On Mon, 8 Jun 2026 19:57:58 +0400 Andrey Smirnov <andrey.smirnov@siderolabs.com> wrote:
> The vDSO data store ("[vvar]") special mapping is created as a VM_PFNMAP
> mapping and its pages are installed into userspace with vmf_insert_pfn(),
> which produces special PTEs (pte_special()). On x86 and arm64 (and riscv)
> pte_user_accessible_page() only tests the PRESENT/USER bits and does not
> exclude special PTEs, so page_table_check accounts these PFN mappings in
> the per-page anon/file map counters even though they are not rmap-managed
> pages (vm_normal_page() returns NULL for them).
>
> Most of these data pages live in the kernel image and are never freed, so
> the stray accounting is invisible. The time-namespace VVAR page is the
> exception: it is a real alloc_page() page that is released with
> __free_page() in free_time_ns() when the last task of a time namespace
> exits. Across the map / unmap / vdso_join_timens() zap transitions the
> special-PTE accounting is not balanced for this page, so a non-zero
> file_map_count survives to the free path and trips:
>
> kernel BUG at mm/page_table_check.c:143!
> __page_table_check_zero+0xfb/0x130
> __free_frozen_pages+0x52f/0x650
> free_time_ns+0x85/0xc0
> free_nsproxy+0x7f/0x130
> do_exit+0x313/0xa60
> do_group_exit+0x77/0x90
>
> This is reliably reproducible on x86_64 and arm64 under heavy container/CI
> churn that rapidly creates and destroys time namespaces (CLONE_NEWTIME via
> runc / docker-init / tini), and was independently reported by syzbot on
> riscv. It only manifests when CONFIG_PAGE_TABLE_CHECK is active.
>
> Special PTEs have no struct-page rmap semantics and must never have been
> tracked by page table check. Skip them in both the set and clear paths so
> the counters stay balanced (always zero) for PFN-mapped pages, regardless
> of how the architecture defines pte_user_accessible_page(). pte_special()
> is available generically (it is a no-op returning false on architectures
> without ARCH_HAS_PTE_SPECIAL), so this is a single, arch-independent fix.
>
> Note that the v7.0 generic vDSO datastore rework in commit 05988dba1179
> ("vdso/datastore: Allocate data pages dynamically") incidentally avoids
> the problem by switching the mapping to VM_MIXEDMAP + vmf_insert_page()
> with balanced struct-page accounting. This patch fixes the still-affected
> VM_PFNMAP path used by 6.18.y and earlier, and additionally makes
> page_table_check robust against any future PFN-mapped user pages.
Thanks.
The patch isn't applicable to current -linus mainline. I reworked it
as below, then deleted it. It would be better if this rework came from
yourself (tested), please. And a patch which applies will get checked
by Sashiko AI review.
--- a/mm/page_table_check.c~mm-page_table_check-do-not-track-special-pfn-mapped-ptes
+++ a/mm/page_table_check.c
@@ -151,7 +151,15 @@ void __page_table_check_pte_clear(struct
if (&init_mm == mm)
return;
- if (pte_user_accessible_page(mm, addr, pte))
+ /*
+ * PFN-mapped (special) PTEs - e.g. the vDSO/time-namespace "[vvar]"
+ * mapping installed via vmf_insert_pfn() - are not rmap-managed and
+ * must not be tracked here. Tracking them can leave a non-zero map
+ * count on a struct page that is later freed (the time namespace VVAR
+ * page in free_time_ns()), tripping the BUG_ON() in
+ * __page_table_check_zero().
+ */
+ if (pte_user_accessible_page(mm, addr, pte) && !pte_special(pte))
page_table_check_clear(pte_pfn(pte), PAGE_SIZE >> PAGE_SHIFT);
}
EXPORT_SYMBOL(__page_table_check_pte_clear);
@@ -208,7 +216,7 @@ void __page_table_check_ptes_set(struct
for (i = 0; i < nr; i++)
__page_table_check_pte_clear(mm, addr + PAGE_SIZE * i, ptep_get(ptep + i));
- if (pte_user_accessible_page(mm, addr, pte))
+ if (pte_user_accessible_page(mm, addr, pte) && !pte_special(pte))
page_table_check_set(pte_pfn(pte), nr, pte_write(pte));
}
EXPORT_SYMBOL(__page_table_check_ptes_set);
_
^ permalink raw reply [flat|nested] 3+ messages in thread* Re: [PATCH] mm/page_table_check: do not track special (PFN-mapped) PTEs
2026-06-08 21:22 ` Andrew Morton
@ 2026-06-09 2:23 ` Pasha Tatashin
0 siblings, 0 replies; 3+ messages in thread
From: Pasha Tatashin @ 2026-06-09 2:23 UTC (permalink / raw)
To: Andrew Morton
Cc: Andrey Smirnov, pasha.tatashin, linux-mm, linux-kernel,
linux-riscv, pjw, palmer, aou, alex, syzbot+2b5fe617654be3d8848b,
Thomas Gleixner, Thomas Weißschuh, Andrei Vagin,
Andy Lutomirski, Vincenzo Frascino, stable
On 06-08 14:22, Andrew Morton wrote:
> On Mon, 8 Jun 2026 19:57:58 +0400 Andrey Smirnov <andrey.smirnov@siderolabs.com> wrote:
>
> > The vDSO data store ("[vvar]") special mapping is created as a VM_PFNMAP
> > mapping and its pages are installed into userspace with vmf_insert_pfn(),
> > which produces special PTEs (pte_special()). On x86 and arm64 (and riscv)
> > pte_user_accessible_page() only tests the PRESENT/USER bits and does not
> > exclude special PTEs, so page_table_check accounts these PFN mappings in
> > the per-page anon/file map counters even though they are not rmap-managed
> > pages (vm_normal_page() returns NULL for them).
> >
> > Most of these data pages live in the kernel image and are never freed, so
> > the stray accounting is invisible. The time-namespace VVAR page is the
> > exception: it is a real alloc_page() page that is released with
> > __free_page() in free_time_ns() when the last task of a time namespace
> > exits. Across the map / unmap / vdso_join_timens() zap transitions the
> > special-PTE accounting is not balanced for this page, so a non-zero
> > file_map_count survives to the free path and trips:
> >
> > kernel BUG at mm/page_table_check.c:143!
> > __page_table_check_zero+0xfb/0x130
> > __free_frozen_pages+0x52f/0x650
> > free_time_ns+0x85/0xc0
> > free_nsproxy+0x7f/0x130
> > do_exit+0x313/0xa60
> > do_group_exit+0x77/0x90
> >
> > This is reliably reproducible on x86_64 and arm64 under heavy container/CI
> > churn that rapidly creates and destroys time namespaces (CLONE_NEWTIME via
> > runc / docker-init / tini), and was independently reported by syzbot on
> > riscv. It only manifests when CONFIG_PAGE_TABLE_CHECK is active.
> >
> > Special PTEs have no struct-page rmap semantics and must never have been
> > tracked by page table check. Skip them in both the set and clear paths so
> > the counters stay balanced (always zero) for PFN-mapped pages, regardless
> > of how the architecture defines pte_user_accessible_page(). pte_special()
> > is available generically (it is a no-op returning false on architectures
> > without ARCH_HAS_PTE_SPECIAL), so this is a single, arch-independent fix.
> >
> > Note that the v7.0 generic vDSO datastore rework in commit 05988dba1179
> > ("vdso/datastore: Allocate data pages dynamically") incidentally avoids
> > the problem by switching the mapping to VM_MIXEDMAP + vmf_insert_page()
> > with balanced struct-page accounting. This patch fixes the still-affected
> > VM_PFNMAP path used by 6.18.y and earlier, and additionally makes
> > page_table_check robust against any future PFN-mapped user pages.
Thank you for detailed explanation of the bug, and it makes sense to me.
> Thanks.
>
> The patch isn't applicable to current -linus mainline. I reworked it
> as below, then deleted it. It would be better if this rework came from
> yourself (tested), please. And a patch which applies will get checked
> by Sashiko AI review.
+1.
Pasha
> --- a/mm/page_table_check.c~mm-page_table_check-do-not-track-special-pfn-mapped-ptes
> +++ a/mm/page_table_check.c
> @@ -151,7 +151,15 @@ void __page_table_check_pte_clear(struct
> if (&init_mm == mm)
> return;
>
> - if (pte_user_accessible_page(mm, addr, pte))
> + /*
> + * PFN-mapped (special) PTEs - e.g. the vDSO/time-namespace "[vvar]"
> + * mapping installed via vmf_insert_pfn() - are not rmap-managed and
> + * must not be tracked here. Tracking them can leave a non-zero map
> + * count on a struct page that is later freed (the time namespace VVAR
> + * page in free_time_ns()), tripping the BUG_ON() in
> + * __page_table_check_zero().
> + */
> + if (pte_user_accessible_page(mm, addr, pte) && !pte_special(pte))
> page_table_check_clear(pte_pfn(pte), PAGE_SIZE >> PAGE_SHIFT);
> }
> EXPORT_SYMBOL(__page_table_check_pte_clear);
> @@ -208,7 +216,7 @@ void __page_table_check_ptes_set(struct
>
> for (i = 0; i < nr; i++)
> __page_table_check_pte_clear(mm, addr + PAGE_SIZE * i, ptep_get(ptep + i));
> - if (pte_user_accessible_page(mm, addr, pte))
> + if (pte_user_accessible_page(mm, addr, pte) && !pte_special(pte))
> page_table_check_set(pte_pfn(pte), nr, pte_write(pte));
> }
> EXPORT_SYMBOL(__page_table_check_ptes_set);
> _
>
^ permalink raw reply [flat|nested] 3+ messages in thread
end of thread, other threads:[~2026-06-09 2:23 UTC | newest]
Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-06-08 15:57 [PATCH] mm/page_table_check: do not track special (PFN-mapped) PTEs Andrey Smirnov
2026-06-08 21:22 ` Andrew Morton
2026-06-09 2:23 ` Pasha Tatashin
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox