From: Pasha Tatashin <pasha.tatashin@soleen.com>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: "Andrey Smirnov" <andrey.smirnov@siderolabs.com>,
pasha.tatashin@soleen.com, linux-mm@kvack.org,
linux-kernel@vger.kernel.org, linux-riscv@lists.infradead.org,
pjw@kernel.org, palmer@dabbelt.com, aou@eecs.berkeley.edu,
alex@ghiti.fr,
syzbot+2b5fe617654be3d8848b@syzkaller.appspotmail.com,
"Thomas Gleixner" <tglx@linutronix.de>,
"Thomas Weißschuh" <thomas.weissschuh@linutronix.de>,
"Andrei Vagin" <avagin@gmail.com>,
"Andy Lutomirski" <luto@kernel.org>,
"Vincenzo Frascino" <vincenzo.frascino@arm.com>,
stable@vger.kernel.org
Subject: Re: [PATCH] mm/page_table_check: do not track special (PFN-mapped) PTEs
Date: Tue, 9 Jun 2026 02:23:28 +0000 [thread overview]
Message-ID: <aid4yw9WRvZEm2BV@plex> (raw)
In-Reply-To: <20260608142258.5028187b1d245b46554eb2dc@linux-foundation.org>
On 06-08 14:22, Andrew Morton wrote:
> On Mon, 8 Jun 2026 19:57:58 +0400 Andrey Smirnov <andrey.smirnov@siderolabs.com> wrote:
>
> > The vDSO data store ("[vvar]") special mapping is created as a VM_PFNMAP
> > mapping and its pages are installed into userspace with vmf_insert_pfn(),
> > which produces special PTEs (pte_special()). On x86 and arm64 (and riscv)
> > pte_user_accessible_page() only tests the PRESENT/USER bits and does not
> > exclude special PTEs, so page_table_check accounts these PFN mappings in
> > the per-page anon/file map counters even though they are not rmap-managed
> > pages (vm_normal_page() returns NULL for them).
> >
> > Most of these data pages live in the kernel image and are never freed, so
> > the stray accounting is invisible. The time-namespace VVAR page is the
> > exception: it is a real alloc_page() page that is released with
> > __free_page() in free_time_ns() when the last task of a time namespace
> > exits. Across the map / unmap / vdso_join_timens() zap transitions the
> > special-PTE accounting is not balanced for this page, so a non-zero
> > file_map_count survives to the free path and trips:
> >
> > kernel BUG at mm/page_table_check.c:143!
> > __page_table_check_zero+0xfb/0x130
> > __free_frozen_pages+0x52f/0x650
> > free_time_ns+0x85/0xc0
> > free_nsproxy+0x7f/0x130
> > do_exit+0x313/0xa60
> > do_group_exit+0x77/0x90
> >
> > This is reliably reproducible on x86_64 and arm64 under heavy container/CI
> > churn that rapidly creates and destroys time namespaces (CLONE_NEWTIME via
> > runc / docker-init / tini), and was independently reported by syzbot on
> > riscv. It only manifests when CONFIG_PAGE_TABLE_CHECK is active.
> >
> > Special PTEs have no struct-page rmap semantics and must never have been
> > tracked by page table check. Skip them in both the set and clear paths so
> > the counters stay balanced (always zero) for PFN-mapped pages, regardless
> > of how the architecture defines pte_user_accessible_page(). pte_special()
> > is available generically (it is a no-op returning false on architectures
> > without ARCH_HAS_PTE_SPECIAL), so this is a single, arch-independent fix.
> >
> > Note that the v7.0 generic vDSO datastore rework in commit 05988dba1179
> > ("vdso/datastore: Allocate data pages dynamically") incidentally avoids
> > the problem by switching the mapping to VM_MIXEDMAP + vmf_insert_page()
> > with balanced struct-page accounting. This patch fixes the still-affected
> > VM_PFNMAP path used by 6.18.y and earlier, and additionally makes
> > page_table_check robust against any future PFN-mapped user pages.
Thank you for detailed explanation of the bug, and it makes sense to me.
> Thanks.
>
> The patch isn't applicable to current -linus mainline. I reworked it
> as below, then deleted it. It would be better if this rework came from
> yourself (tested), please. And a patch which applies will get checked
> by Sashiko AI review.
+1.
Pasha
> --- a/mm/page_table_check.c~mm-page_table_check-do-not-track-special-pfn-mapped-ptes
> +++ a/mm/page_table_check.c
> @@ -151,7 +151,15 @@ void __page_table_check_pte_clear(struct
> if (&init_mm == mm)
> return;
>
> - if (pte_user_accessible_page(mm, addr, pte))
> + /*
> + * PFN-mapped (special) PTEs - e.g. the vDSO/time-namespace "[vvar]"
> + * mapping installed via vmf_insert_pfn() - are not rmap-managed and
> + * must not be tracked here. Tracking them can leave a non-zero map
> + * count on a struct page that is later freed (the time namespace VVAR
> + * page in free_time_ns()), tripping the BUG_ON() in
> + * __page_table_check_zero().
> + */
> + if (pte_user_accessible_page(mm, addr, pte) && !pte_special(pte))
> page_table_check_clear(pte_pfn(pte), PAGE_SIZE >> PAGE_SHIFT);
> }
> EXPORT_SYMBOL(__page_table_check_pte_clear);
> @@ -208,7 +216,7 @@ void __page_table_check_ptes_set(struct
>
> for (i = 0; i < nr; i++)
> __page_table_check_pte_clear(mm, addr + PAGE_SIZE * i, ptep_get(ptep + i));
> - if (pte_user_accessible_page(mm, addr, pte))
> + if (pte_user_accessible_page(mm, addr, pte) && !pte_special(pte))
> page_table_check_set(pte_pfn(pte), nr, pte_write(pte));
> }
> EXPORT_SYMBOL(__page_table_check_ptes_set);
> _
>
_______________________________________________
linux-riscv mailing list
linux-riscv@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/linux-riscv
WARNING: multiple messages have this Message-ID (diff)
From: Pasha Tatashin <pasha.tatashin@soleen.com>
To: Andrew Morton <akpm@linux-foundation.org>
Cc: "Andrey Smirnov" <andrey.smirnov@siderolabs.com>,
pasha.tatashin@soleen.com, linux-mm@kvack.org,
linux-kernel@vger.kernel.org, linux-riscv@lists.infradead.org,
pjw@kernel.org, palmer@dabbelt.com, aou@eecs.berkeley.edu,
alex@ghiti.fr,
syzbot+2b5fe617654be3d8848b@syzkaller.appspotmail.com,
"Thomas Gleixner" <tglx@linutronix.de>,
"Thomas Weißschuh" <thomas.weissschuh@linutronix.de>,
"Andrei Vagin" <avagin@gmail.com>,
"Andy Lutomirski" <luto@kernel.org>,
"Vincenzo Frascino" <vincenzo.frascino@arm.com>,
stable@vger.kernel.org
Subject: Re: [PATCH] mm/page_table_check: do not track special (PFN-mapped) PTEs
Date: Tue, 9 Jun 2026 02:23:28 +0000 [thread overview]
Message-ID: <aid4yw9WRvZEm2BV@plex> (raw)
In-Reply-To: <20260608142258.5028187b1d245b46554eb2dc@linux-foundation.org>
On 06-08 14:22, Andrew Morton wrote:
> On Mon, 8 Jun 2026 19:57:58 +0400 Andrey Smirnov <andrey.smirnov@siderolabs.com> wrote:
>
> > The vDSO data store ("[vvar]") special mapping is created as a VM_PFNMAP
> > mapping and its pages are installed into userspace with vmf_insert_pfn(),
> > which produces special PTEs (pte_special()). On x86 and arm64 (and riscv)
> > pte_user_accessible_page() only tests the PRESENT/USER bits and does not
> > exclude special PTEs, so page_table_check accounts these PFN mappings in
> > the per-page anon/file map counters even though they are not rmap-managed
> > pages (vm_normal_page() returns NULL for them).
> >
> > Most of these data pages live in the kernel image and are never freed, so
> > the stray accounting is invisible. The time-namespace VVAR page is the
> > exception: it is a real alloc_page() page that is released with
> > __free_page() in free_time_ns() when the last task of a time namespace
> > exits. Across the map / unmap / vdso_join_timens() zap transitions the
> > special-PTE accounting is not balanced for this page, so a non-zero
> > file_map_count survives to the free path and trips:
> >
> > kernel BUG at mm/page_table_check.c:143!
> > __page_table_check_zero+0xfb/0x130
> > __free_frozen_pages+0x52f/0x650
> > free_time_ns+0x85/0xc0
> > free_nsproxy+0x7f/0x130
> > do_exit+0x313/0xa60
> > do_group_exit+0x77/0x90
> >
> > This is reliably reproducible on x86_64 and arm64 under heavy container/CI
> > churn that rapidly creates and destroys time namespaces (CLONE_NEWTIME via
> > runc / docker-init / tini), and was independently reported by syzbot on
> > riscv. It only manifests when CONFIG_PAGE_TABLE_CHECK is active.
> >
> > Special PTEs have no struct-page rmap semantics and must never have been
> > tracked by page table check. Skip them in both the set and clear paths so
> > the counters stay balanced (always zero) for PFN-mapped pages, regardless
> > of how the architecture defines pte_user_accessible_page(). pte_special()
> > is available generically (it is a no-op returning false on architectures
> > without ARCH_HAS_PTE_SPECIAL), so this is a single, arch-independent fix.
> >
> > Note that the v7.0 generic vDSO datastore rework in commit 05988dba1179
> > ("vdso/datastore: Allocate data pages dynamically") incidentally avoids
> > the problem by switching the mapping to VM_MIXEDMAP + vmf_insert_page()
> > with balanced struct-page accounting. This patch fixes the still-affected
> > VM_PFNMAP path used by 6.18.y and earlier, and additionally makes
> > page_table_check robust against any future PFN-mapped user pages.
Thank you for detailed explanation of the bug, and it makes sense to me.
> Thanks.
>
> The patch isn't applicable to current -linus mainline. I reworked it
> as below, then deleted it. It would be better if this rework came from
> yourself (tested), please. And a patch which applies will get checked
> by Sashiko AI review.
+1.
Pasha
> --- a/mm/page_table_check.c~mm-page_table_check-do-not-track-special-pfn-mapped-ptes
> +++ a/mm/page_table_check.c
> @@ -151,7 +151,15 @@ void __page_table_check_pte_clear(struct
> if (&init_mm == mm)
> return;
>
> - if (pte_user_accessible_page(mm, addr, pte))
> + /*
> + * PFN-mapped (special) PTEs - e.g. the vDSO/time-namespace "[vvar]"
> + * mapping installed via vmf_insert_pfn() - are not rmap-managed and
> + * must not be tracked here. Tracking them can leave a non-zero map
> + * count on a struct page that is later freed (the time namespace VVAR
> + * page in free_time_ns()), tripping the BUG_ON() in
> + * __page_table_check_zero().
> + */
> + if (pte_user_accessible_page(mm, addr, pte) && !pte_special(pte))
> page_table_check_clear(pte_pfn(pte), PAGE_SIZE >> PAGE_SHIFT);
> }
> EXPORT_SYMBOL(__page_table_check_pte_clear);
> @@ -208,7 +216,7 @@ void __page_table_check_ptes_set(struct
>
> for (i = 0; i < nr; i++)
> __page_table_check_pte_clear(mm, addr + PAGE_SIZE * i, ptep_get(ptep + i));
> - if (pte_user_accessible_page(mm, addr, pte))
> + if (pte_user_accessible_page(mm, addr, pte) && !pte_special(pte))
> page_table_check_set(pte_pfn(pte), nr, pte_write(pte));
> }
> EXPORT_SYMBOL(__page_table_check_ptes_set);
> _
>
next prev parent reply other threads:[~2026-06-09 2:23 UTC|newest]
Thread overview: 6+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-06-08 15:57 [PATCH] mm/page_table_check: do not track special (PFN-mapped) PTEs Andrey Smirnov
2026-06-08 15:57 ` Andrey Smirnov
2026-06-08 21:22 ` Andrew Morton
2026-06-08 21:22 ` Andrew Morton
2026-06-09 2:23 ` Pasha Tatashin [this message]
2026-06-09 2:23 ` Pasha Tatashin
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aid4yw9WRvZEm2BV@plex \
--to=pasha.tatashin@soleen.com \
--cc=akpm@linux-foundation.org \
--cc=alex@ghiti.fr \
--cc=andrey.smirnov@siderolabs.com \
--cc=aou@eecs.berkeley.edu \
--cc=avagin@gmail.com \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=linux-riscv@lists.infradead.org \
--cc=luto@kernel.org \
--cc=palmer@dabbelt.com \
--cc=pjw@kernel.org \
--cc=stable@vger.kernel.org \
--cc=syzbot+2b5fe617654be3d8848b@syzkaller.appspotmail.com \
--cc=tglx@linutronix.de \
--cc=thomas.weissschuh@linutronix.de \
--cc=vincenzo.frascino@arm.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.