linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
To: David Hildenbrand <david@redhat.com>
Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	xen-devel@lists.xenproject.org, linux-fsdevel@vger.kernel.org,
	nvdimm@lists.linux.dev, Andrew Morton <akpm@linux-foundation.org>,
	Juergen Gross <jgross@suse.com>,
	Stefano Stabellini <sstabellini@kernel.org>,
	Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>,
	Dan Williams <dan.j.williams@intel.com>,
	Matthew Wilcox <willy@infradead.org>, Jan Kara <jack@suse.cz>,
	Alexander Viro <viro@zeniv.linux.org.uk>,
	Christian Brauner <brauner@kernel.org>,
	"Liam R. Howlett" <Liam.Howlett@oracle.com>,
	Vlastimil Babka <vbabka@suse.cz>, Mike Rapoport <rppt@kernel.org>,
	Suren Baghdasaryan <surenb@google.com>,
	Michal Hocko <mhocko@suse.com>, Zi Yan <ziy@nvidia.com>,
	Baolin Wang <baolin.wang@linux.alibaba.com>,
	Nico Pache <npache@redhat.com>,
	Ryan Roberts <ryan.roberts@arm.com>, Dev Jain <dev.jain@arm.com>,
	Barry Song <baohua@kernel.org>, Jann Horn <jannh@google.com>,
	Pedro Falcato <pfalcato@suse.de>, Hugh Dickins <hughd@google.com>,
	Oscar Salvador <osalvador@suse.de>,
	Lance Yang <lance.yang@linux.dev>
Subject: Re: [PATCH v2 6/9] mm/memory: convert print_bad_pte() to print_bad_page_map()
Date: Fri, 18 Jul 2025 11:15:47 +0100	[thread overview]
Message-ID: <200da552-4fc7-44d8-bbea-1669b4b45cf5@lucifer.local> (raw)
In-Reply-To: <c93f873c-813e-42c9-ba01-93c2fa22fb8d@redhat.com>

On Thu, Jul 17, 2025 at 10:03:31PM +0200, David Hildenbrand wrote:
> > > The report will now look something like (dumping pgd to pmd values):
> > >
> > > [   77.943408] BUG: Bad page map in process XXX  entry:80000001233f5867
> > > [   77.944077] addr:00007fd84bb1c000 vm_flags:08100071 anon_vma: ...
> > > [   77.945186] pgd:10a89f067 p4d:10a89f067 pud:10e5a2067 pmd:105327067
> > >
> > > Not using pgdp_get(), because that does not work properly on some arm
> > > configs where pgd_t is an array. Note that we are dumping all levels
> > > even when levels are folded for simplicity.
> >
> > Oh god. I reviewed this below. BUT OH GOD. What. Why???
>
> Exactly.
>
> I wish this patch wouldn't exist, but Hugh convinced me that apparently it
> can be useful in the real world.

Yeah... I mean out of scope for here but that sounds dubious. On the other hand,
we use typedef for page table values etc. etc. so we do make this possible.

>
> >
> > >
> > > Signed-off-by: David Hildenbrand <david@redhat.com>
> > > ---
> > >   mm/memory.c | 120 ++++++++++++++++++++++++++++++++++++++++------------
> > >   1 file changed, 94 insertions(+), 26 deletions(-)
> > >
> > > diff --git a/mm/memory.c b/mm/memory.c
> > > index 173eb6267e0ac..08d16ed7b4cc7 100644
> > > --- a/mm/memory.c
> > > +++ b/mm/memory.c
> > > @@ -473,22 +473,8 @@ static inline void add_mm_rss_vec(struct mm_struct *mm, int *rss)
> > >   			add_mm_counter(mm, i, rss[i]);
> > >   }
> > >
> > > -/*
> > > - * This function is called to print an error when a bad pte
> > > - * is found. For example, we might have a PFN-mapped pte in
> > > - * a region that doesn't allow it.
> > > - *
> > > - * The calling function must still handle the error.
> > > - */
> > > -static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr,
> > > -			  pte_t pte, struct page *page)
> > > +static bool is_bad_page_map_ratelimited(void)
> > >   {
> > > -	pgd_t *pgd = pgd_offset(vma->vm_mm, addr);
> > > -	p4d_t *p4d = p4d_offset(pgd, addr);
> > > -	pud_t *pud = pud_offset(p4d, addr);
> > > -	pmd_t *pmd = pmd_offset(pud, addr);
> > > -	struct address_space *mapping;
> > > -	pgoff_t index;
> > >   	static unsigned long resume;
> > >   	static unsigned long nr_shown;
> > >   	static unsigned long nr_unshown;
> > > @@ -500,7 +486,7 @@ static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr,
> > >   	if (nr_shown == 60) {
> > >   		if (time_before(jiffies, resume)) {
> > >   			nr_unshown++;
> > > -			return;
> > > +			return true;
> > >   		}
> > >   		if (nr_unshown) {
> > >   			pr_alert("BUG: Bad page map: %lu messages suppressed\n",
> > > @@ -511,15 +497,87 @@ static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr,
> > >   	}
> > >   	if (nr_shown++ == 0)
> > >   		resume = jiffies + 60 * HZ;
> > > +	return false;
> > > +}
> > > +
> > > +static void __dump_bad_page_map_pgtable(struct mm_struct *mm, unsigned long addr)
> > > +{
> > > +	unsigned long long pgdv, p4dv, pudv, pmdv;
> >
> > > +	p4d_t p4d, *p4dp;
> > > +	pud_t pud, *pudp;
> > > +	pmd_t pmd, *pmdp;
> > > +	pgd_t *pgdp;
> > > +
> > > +	/*
> > > +	 * This looks like a fully lockless walk, however, the caller is
> > > +	 * expected to hold the leaf page table lock in addition to other
> > > +	 * rmap/mm/vma locks. So this is just a re-walk to dump page table
> > > +	 * content while any concurrent modifications should be completely
> > > +	 * prevented.
> > > +	 */
> >
> > Hmmm :)
> >
> > Why aren't we trying to lock at leaf level?
>
> Ehm, did you read the:
>
> "the caller is expected to hold the leaf page table lock"
>
> :)

Yeah sorry I was in 'what locks do we need' mode and hadn't shifted back here,
but I guess the intent is that the caller _must_ hold this lock.

I know it's nitty and annoying (sorry!) but as asserting seems to not be a
possibility here, could we spell these out as a series of points like:

/*
 * The caller MUST hold the following locks:
 *
 * - Leaf page table lock
 * - Appropriate VMA lock to keep VMA stable
 */

I don't _actually_ think you need the rmap lock then, as none of the page tables
you access would be impacted by any rmap action afaict, with these locks held.
>
>
> >
> > We need to:
> >
> > - Keep VMA stable which prevents unmap page table teardown and khugepaged
> >    collapse.
> > - (not relevant as we don't traverse PTE table but) RCU lock for PTE
> >    entries to avoid MADV_DONTNEED page table withdrawal.
> >
> > Buuut if we're not locking at leaf level, we leave ourselves open to racing
> > faults, zaps, etc. etc.
>
> Yes. I can clarify in the comment of print_bad_page_map(), that it is
> expected to be called with the PTL (unwritten rule, not changing that).

Right, see above, just needs more clarity then.

>
> >
> > So perhaps this why you require such strict conditions...
> >
> > But can you truly be sure of these existing? And we should then assert them
> > here no? For rmap though we'd need the folio/vma.
>
> I hope you realize that this nastiness of a code is called in case our
> system is already running into something extremely unexpected and will
> probably be dead soon.
>
> So I am not to interested in adding anything more here. If you run into this
> code you're in big trouble already.

Yes am aware :) my concern is NULL ptr deref or UAF, but with the locks
held as stated those won't occur.

But f it's not sensible to do it then we don't have to :) I am a reasonable
man, or like to think I am ;)

But I think we need clarity as per the above.

>
> >
> > > +	pgdp = pgd_offset(mm, addr);
> > > +	pgdv = pgd_val(*pgdp);
> >
> > Before I went and looked again at the commit msg I said:
> >
> > 	"Shoudln't we strictly speaking use pgdp_get()? I see you use this
> > 	 helper for other levels."
> >
> > But obviously yeah. You explained the insane reason why not.
>
> Had to find out the hard way ... :)

Pain.

>
> [...]
>
> > > +/*
> > > + * This function is called to print an error when a bad page table entry (e.g.,
> > > + * corrupted page table entry) is found. For example, we might have a
> > > + * PFN-mapped pte in a region that doesn't allow it.
> > > + *
> > > + * The calling function must still handle the error.
> > > + */
> >
> > We have extremely strict locking conditions for the page table traversal... but
> > no mention of them here?
>
> Yeah, I can add that.

Thanks!

>
> >
> > > +static void print_bad_page_map(struct vm_area_struct *vma,
> > > +		unsigned long addr, unsigned long long entry, struct page *page)
> > > +{
> > > +	struct address_space *mapping;
> > > +	pgoff_t index;
> > > +
> > > +	if (is_bad_page_map_ratelimited())
> > > +		return;
> > >
> > >   	mapping = vma->vm_file ? vma->vm_file->f_mapping : NULL;
> > >   	index = linear_page_index(vma, addr);
> > >
> > > -	pr_alert("BUG: Bad page map in process %s  pte:%08llx pmd:%08llx\n",
> > > -		 current->comm,
> > > -		 (long long)pte_val(pte), (long long)pmd_val(*pmd));
> > > +	pr_alert("BUG: Bad page map in process %s  entry:%08llx", current->comm, entry);
> >
> > Sort of wonder if this is even useful if you don't know what the 'entry'
> > is? But I guess the dump below will tell you.
>
> You probably missed in the patch description:
>
> "Whether it is a PTE or something else will usually become obvious from the
> page table dump or from the dumped stack. If ever required in the future, we
> could pass the entry level type similar to "enum rmap_level". For now, let's
> keep it simple."

Yeah sorry I glossed over the commit msg, and now I pay for it ;) OK this
is fine then.

>
> >
> > Though maybe actually useful to see flags etc. in case some horrid
> > corruption happened and maybe dump isn't valid? But then the dump assumes
> > strict conditions to work so... can that happen?
>
> Not sure what you mean, can you elaborate?
>
> If you system is falling apart completely, I'm afraid this report here will
> not safe you.
>
> You'll probably get a flood of 60 ... if your system doesn't collapse before
> that.

I was musing whistfully about the value of outputting the entries. But I
guess _any_ information before a crash is useful. But your dump output will
be a lot more useful :)

>
> >
> > > +	__dump_bad_page_map_pgtable(vma->vm_mm, addr);
> > >   	if (page)
> > > -		dump_page(page, "bad pte");
> > > +		dump_page(page, "bad page map");
> > >   	pr_alert("addr:%px vm_flags:%08lx anon_vma:%px mapping:%px index:%lx\n",
> > >   		 (void *)addr, vma->vm_flags, vma->anon_vma, mapping, index);
> > >   	pr_alert("file:%pD fault:%ps mmap:%ps mmap_prepare: %ps read_folio:%ps\n",
> > > @@ -597,7 +655,7 @@ struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
> > >   		if (is_zero_pfn(pfn))
> > >   			return NULL;
> > >
> > > -		print_bad_pte(vma, addr, pte, NULL);
> > > +		print_bad_page_map(vma, addr, pte_val(pte), NULL);
> > >   		return NULL;
> > >   	}
> > >
> > > @@ -625,7 +683,7 @@ struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
> > >
> > >   check_pfn:
> > >   	if (unlikely(pfn > highest_memmap_pfn)) {
> > > -		print_bad_pte(vma, addr, pte, NULL);
> > > +		print_bad_page_map(vma, addr, pte_val(pte), NULL);
> >
> > This is unrelated to your series, but I guess this is for cases where
> > you're e.g. iomapping or such? So it's not something in the memmap but it's
> > a PFN that might reference io memory or such?
>
> No, it's just an easy check for catching corruptions. In a later patch I add
> a comment.

Ohhh right. I was overthinking this haha

>
> >
> > >   		return NULL;
> > >   	}
> > >
> > > @@ -654,8 +712,15 @@ struct page *vm_normal_page_pmd(struct vm_area_struct *vma, unsigned long addr,
> > >   {
> > >   	unsigned long pfn = pmd_pfn(pmd);
> > >
> > > -	if (unlikely(pmd_special(pmd)))
> > > +	if (unlikely(pmd_special(pmd))) {
> > > +		if (vma->vm_flags & (VM_PFNMAP | VM_MIXEDMAP))
> > > +			return NULL;
> >
> > I guess we'll bring this altogether in a later patch with vm_normal_page()
> > as getting a little duplicative :P
>
> Not that part, but the other part :)

Yes :)

>
> >
> > Makes me think that VM_SPECIAL is kind of badly named (other than fact
> > 'special' is nebulous and overloaded in general) in that it contains stuff
> > that is -VMA-special but only VM_PFNMAP | VM_MIXEDMAP really indicates
> > specialness wrt to underlying folio.
>
> It is.

Yeah I think we in mm periodically moan about this whole thing...

>
> >
> > Then we have VM_IO, which strictly must not have an associated page right?
>
> VM_IO just means read/write side-effects, I think you could have ones with
> an memmap easily ... e.g., memory section (128MiB) spanning both memory and
> MMIO regions.

Hmm, but why not have two separate VMAs? I guess I need to look into more
what this flag actually effects.

But in terms of VMA manipulation in any case it really is no different from
e.g. VM_PFNMAP in terms of what it affects. Though well. I say that. Except
of course this whole vm_normal_page() thing...!

But I've not seen VM_IO set alone anywhere. On the other hand, I haven't
really dug deep on this...

>
> Thanks!
>
> --
> Cheers,
>
> David / dhildenb
>

Cheers, Lorenzo

  reply	other threads:[~2025-07-18 10:16 UTC|newest]

Thread overview: 48+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-07-17 11:52 [PATCH v2 0/9] mm: vm_normal_page*() improvements David Hildenbrand
2025-07-17 11:52 ` [PATCH v2 1/9] mm/huge_memory: move more common code into insert_pmd() David Hildenbrand
2025-07-17 15:34   ` Lorenzo Stoakes
2025-07-25  2:47   ` Wei Yang
2025-07-17 11:52 ` [PATCH v2 2/9] mm/huge_memory: move more common code into insert_pud() David Hildenbrand
2025-07-17 15:42   ` Lorenzo Stoakes
2025-07-25  2:56   ` Wei Yang
2025-07-17 11:52 ` [PATCH v2 3/9] mm/huge_memory: support huge zero folio in vmf_insert_folio_pmd() David Hildenbrand
2025-07-17 15:47   ` Lorenzo Stoakes
2025-07-25  8:07   ` Wei Yang
2025-07-17 11:52 ` [PATCH v2 4/9] fs/dax: use vmf_insert_folio_pmd() to insert the huge zero folio David Hildenbrand
2025-07-17 18:09   ` Lorenzo Stoakes
2025-07-17 11:52 ` [PATCH v2 5/9] mm/huge_memory: mark PMD mappings of the huge zero folio special David Hildenbrand
2025-07-17 18:29   ` Lorenzo Stoakes
2025-07-17 20:31     ` David Hildenbrand
2025-07-18 10:41       ` Lorenzo Stoakes
2025-07-18 10:54         ` David Hildenbrand
2025-07-18 13:06           ` Lorenzo Stoakes
2025-07-28  8:49   ` Wei Yang
2025-07-17 11:52 ` [PATCH v2 6/9] mm/memory: convert print_bad_pte() to print_bad_page_map() David Hildenbrand
2025-07-17 19:17   ` Lorenzo Stoakes
2025-07-17 20:03     ` David Hildenbrand
2025-07-18 10:15       ` Lorenzo Stoakes [this message]
2025-07-18 11:04         ` David Hildenbrand
2025-07-18 12:55           ` Lorenzo Stoakes
2025-07-17 22:06   ` Demi Marie Obenour
2025-07-18  7:44     ` David Hildenbrand
2025-07-18  7:59       ` Demi Marie Obenour
2025-07-18  8:26         ` David Hildenbrand
2025-07-17 11:52 ` [PATCH v2 7/9] mm/memory: factor out common code from vm_normal_page_*() David Hildenbrand
2025-07-17 19:51   ` Lorenzo Stoakes
2025-07-17 19:55     ` Lorenzo Stoakes
2025-07-17 20:03       ` David Hildenbrand
2025-07-18 12:43         ` Lorenzo Stoakes
2025-07-30 12:54           ` David Hildenbrand
2025-07-30 13:24             ` Lorenzo Stoakes
2025-07-17 20:12     ` David Hildenbrand
2025-07-18 12:35       ` Lorenzo Stoakes
2025-07-17 11:52 ` [PATCH v2 8/9] mm: introduce and use vm_normal_page_pud() David Hildenbrand
2025-07-17 20:03   ` Lorenzo Stoakes
2025-07-17 20:14     ` David Hildenbrand
2025-07-18 10:47       ` Lorenzo Stoakes
2025-07-18 11:06         ` David Hildenbrand
2025-07-18 12:44           ` Lorenzo Stoakes
2025-07-29  7:52   ` Wei Yang
2025-07-17 11:52 ` [PATCH v2 9/9] mm: rename vm_ops->find_special_page() to vm_ops->find_normal_page() David Hildenbrand
2025-07-17 20:07   ` Lorenzo Stoakes
2025-07-29  7:53   ` Wei Yang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=200da552-4fc7-44d8-bbea-1669b4b45cf5@lucifer.local \
    --to=lorenzo.stoakes@oracle.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=baohua@kernel.org \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=brauner@kernel.org \
    --cc=dan.j.williams@intel.com \
    --cc=david@redhat.com \
    --cc=dev.jain@arm.com \
    --cc=hughd@google.com \
    --cc=jack@suse.cz \
    --cc=jannh@google.com \
    --cc=jgross@suse.com \
    --cc=lance.yang@linux.dev \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@suse.com \
    --cc=npache@redhat.com \
    --cc=nvdimm@lists.linux.dev \
    --cc=oleksandr_tyshchenko@epam.com \
    --cc=osalvador@suse.de \
    --cc=pfalcato@suse.de \
    --cc=rppt@kernel.org \
    --cc=ryan.roberts@arm.com \
    --cc=sstabellini@kernel.org \
    --cc=surenb@google.com \
    --cc=vbabka@suse.cz \
    --cc=viro@zeniv.linux.org.uk \
    --cc=willy@infradead.org \
    --cc=xen-devel@lists.xenproject.org \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).