linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
From: David Hildenbrand <david@redhat.com>
To: linux-kernel@vger.kernel.org
Cc: linux-mm@kvack.org, xen-devel@lists.xenproject.org,
	linux-fsdevel@vger.kernel.org, nvdimm@lists.linux.dev,
	Andrew Morton <akpm@linux-foundation.org>,
	Juergen Gross <jgross@suse.com>,
	Stefano Stabellini <sstabellini@kernel.org>,
	Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com>,
	Dan Williams <dan.j.williams@intel.com>,
	Matthew Wilcox <willy@infradead.org>, Jan Kara <jack@suse.cz>,
	Alexander Viro <viro@zeniv.linux.org.uk>,
	Christian Brauner <brauner@kernel.org>,
	Lorenzo Stoakes <lorenzo.stoakes@oracle.com>,
	"Liam R. Howlett" <Liam.Howlett@oracle.com>,
	Vlastimil Babka <vbabka@suse.cz>, Mike Rapoport <rppt@kernel.org>,
	Suren Baghdasaryan <surenb@google.com>,
	Michal Hocko <mhocko@suse.com>, Zi Yan <ziy@nvidia.com>,
	Baolin Wang <baolin.wang@linux.alibaba.com>,
	Nico Pache <npache@redhat.com>,
	Ryan Roberts <ryan.roberts@arm.com>, Dev Jain <dev.jain@arm.com>,
	Barry Song <baohua@kernel.org>, Jann Horn <jannh@google.com>,
	Pedro Falcato <pfalcato@suse.de>, Hugh Dickins <hughd@google.com>,
	Oscar Salvador <osalvador@suse.de>,
	Lance Yang <lance.yang@linux.dev>,
	Russell King <linux@armlinux.org.uk>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH v1 6/9] mm/memory: convert print_bad_pte() to print_bad_page_map()
Date: Wed, 16 Jul 2025 10:40:04 +0200	[thread overview]
Message-ID: <e92b75d4-999b-4b0f-b62f-9edca09834c4@redhat.com> (raw)
In-Reply-To: <20250715132350.2448901-7-david@redhat.com>

On 15.07.25 15:23, David Hildenbrand wrote:
> print_bad_pte() looks like something that should actually be a WARN
> or similar, but historically it apparently has proven to be useful to
> detect corruption of page tables even on production systems -- report
> the issue and keep the system running to make it easier to actually detect
> what is going wrong (e.g., multiple such messages might shed a light).
> 
> As we want to unify vm_normal_page_*() handling for PTE/PMD/PUD, we'll have
> to take care of print_bad_pte() as well.
> 
> Let's prepare for using print_bad_pte() also for non-PTEs by adjusting the
> implementation and renaming the function -- we'll rename it to what
> we actually print: bad (page) mappings. Maybe it should be called
> "print_bad_table_entry()"? We'll just call it "print_bad_page_map()"
> because the assumption is that we are dealing with some (previously)
> present page table entry that got corrupted in weird ways.
> 
> Whether it is a PTE or something else will usually become obvious from the
> page table dump or from the dumped stack. If ever required in the future,
> we could pass the entry level type similar to "enum rmap_level". For now,
> let's keep it simple.
> 
> To make the function a bit more readable, factor out the ratelimit check
> into is_bad_page_map_ratelimited() and place the dumping of page
> table content into __dump_bad_page_map_pgtable(). We'll now dump
> information from each level in a single line, and just stop the table
> walk once we hit something that is not a present page table.
> 
> Use print_bad_page_map() in vm_normal_page_pmd() similar to how we do it
> for vm_normal_page(), now that we have a function that can handle it.
> 
> The report will now look something like (dumping pgd to pmd values):
> 
> [   77.943408] BUG: Bad page map in process XXX  entry:80000001233f5867
> [   77.944077] addr:00007fd84bb1c000 vm_flags:08100071 anon_vma: ...
> [   77.945186] pgd:10a89f067 p4d:10a89f067 pud:10e5a2067 pmd:105327067
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>   mm/memory.c | 120 ++++++++++++++++++++++++++++++++++++++++------------
>   1 file changed, 94 insertions(+), 26 deletions(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index a4f62923b961c..00ee0df020503 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -479,22 +479,8 @@ static inline void add_mm_rss_vec(struct mm_struct *mm, int *rss)
>   			add_mm_counter(mm, i, rss[i]);
>   }
>   
> -/*
> - * This function is called to print an error when a bad pte
> - * is found. For example, we might have a PFN-mapped pte in
> - * a region that doesn't allow it.
> - *
> - * The calling function must still handle the error.
> - */
> -static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr,
> -			  pte_t pte, struct page *page)
> +static bool is_bad_page_map_ratelimited(void)
>   {
> -	pgd_t *pgd = pgd_offset(vma->vm_mm, addr);
> -	p4d_t *p4d = p4d_offset(pgd, addr);
> -	pud_t *pud = pud_offset(p4d, addr);
> -	pmd_t *pmd = pmd_offset(pud, addr);
> -	struct address_space *mapping;
> -	pgoff_t index;
>   	static unsigned long resume;
>   	static unsigned long nr_shown;
>   	static unsigned long nr_unshown;
> @@ -506,7 +492,7 @@ static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr,
>   	if (nr_shown == 60) {
>   		if (time_before(jiffies, resume)) {
>   			nr_unshown++;
> -			return;
> +			return true;
>   		}
>   		if (nr_unshown) {
>   			pr_alert("BUG: Bad page map: %lu messages suppressed\n",
> @@ -517,15 +503,87 @@ static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr,
>   	}
>   	if (nr_shown++ == 0)
>   		resume = jiffies + 60 * HZ;
> +	return false;
> +}
> +
> +static void __dump_bad_page_map_pgtable(struct mm_struct *mm, unsigned long addr)
> +{
> +	unsigned long long pgdv, p4dv, pudv, pmdv;
> +	pgd_t pgd, *pgdp;
> +	p4d_t p4d, *p4dp;
> +	pud_t pud, *pudp;
> +	pmd_t *pmdp;
> +
> +	/*
> +	 * This looks like a fully lockless walk, however, the caller is
> +	 * expected to hold the leaf page table lock in addition to other
> +	 * rmap/mm/vma locks. So this is just a re-walk to dump page table
> +	 * content while any concurrent modifications should be completely
> +	 * prevented.
> +	 */
> +	pgdp = pgd_offset(mm, addr);
> +	pgd = pgdp_get(pgdp);
> +	pgdv = pgd_val(pgd);

Apparently there is something weird here on arm-bcm2835_defconfig:

All errors (new ones prefixed by >>):

 >> mm/memory.c:525:6: error: array type 'pgd_t' (aka 'unsigned int[2]') 
is not assignable
      525 |         pgd = pgdp_get(pgdp);
          |         ~~~ ^
    1 error generated.

... apparently because we have this questionable ...

arch/arm/include/asm/pgtable-2level-types.h:typedef pmdval_t pgd_t[2];

I mean, the whole concept of pgdp_get() doesn't make too much sense if 
it wants to return an array.

I don't quite understand the "#undef STRICT_MM_TYPECHECKS #ifdef 
STRICT_MM_TYPECHECKS" stuff.

Why do we want to make it easier on the compiler while doing something 
fairly weird?

CCing arm maintainers: what's going on here? :)

An easy fix would be to not dump the pgd value, but having a 
non-functional pgdp_get() really is weird.

-- 
Cheers,

David / dhildenb



  reply	other threads:[~2025-07-16  8:40 UTC|newest]

Thread overview: 20+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-07-15 13:23 [PATCH v1 0/9] mm: vm_normal_page*() improvements David Hildenbrand
2025-07-15 13:23 ` [PATCH v1 1/9] mm/huge_memory: move more common code into insert_pmd() David Hildenbrand
2025-07-15 13:23 ` [PATCH v1 2/9] mm/huge_memory: move more common code into insert_pud() David Hildenbrand
2025-07-15 13:23 ` [PATCH v1 3/9] mm/huge_memory: support huge zero folio in vmf_insert_folio_pmd() David Hildenbrand
2025-07-15 13:23 ` [PATCH v1 4/9] fs/dax: use vmf_insert_folio_pmd() to insert the huge zero folio David Hildenbrand
2025-07-17  8:38   ` Alistair Popple
2025-07-17  8:39     ` David Hildenbrand
2025-07-15 13:23 ` [PATCH v1 5/9] mm/huge_memory: mark PMD mappings of the huge zero folio special David Hildenbrand
2025-07-15 13:23 ` [PATCH v1 6/9] mm/memory: convert print_bad_pte() to print_bad_page_map() David Hildenbrand
2025-07-16  8:40   ` David Hildenbrand [this message]
2025-07-15 13:23 ` [PATCH v1 7/9] mm/memory: factor out common code from vm_normal_page_*() David Hildenbrand
2025-07-16  8:15   ` Oscar Salvador
2025-07-15 13:23 ` [PATCH v1 8/9] mm: introduce and use vm_normal_page_pud() David Hildenbrand
2025-07-16  8:20   ` Oscar Salvador
2025-07-15 13:23 ` [PATCH v1 9/9] mm: rename vm_ops->find_special_page() to vm_ops->find_normal_page() David Hildenbrand
2025-07-16  8:22   ` Oscar Salvador
2025-07-15 23:31 ` [PATCH v1 0/9] mm: vm_normal_page*() improvements Andrew Morton
2025-07-16  8:47   ` David Hildenbrand
2025-07-16 22:27     ` Andrew Morton
2025-07-17  7:35       ` David Hildenbrand

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=e92b75d4-999b-4b0f-b62f-9edca09834c4@redhat.com \
    --to=david@redhat.com \
    --cc=Liam.Howlett@oracle.com \
    --cc=akpm@linux-foundation.org \
    --cc=baohua@kernel.org \
    --cc=baolin.wang@linux.alibaba.com \
    --cc=brauner@kernel.org \
    --cc=dan.j.williams@intel.com \
    --cc=dev.jain@arm.com \
    --cc=hughd@google.com \
    --cc=jack@suse.cz \
    --cc=jannh@google.com \
    --cc=jgross@suse.com \
    --cc=lance.yang@linux.dev \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux@armlinux.org.uk \
    --cc=lorenzo.stoakes@oracle.com \
    --cc=mhocko@suse.com \
    --cc=npache@redhat.com \
    --cc=nvdimm@lists.linux.dev \
    --cc=oleksandr_tyshchenko@epam.com \
    --cc=osalvador@suse.de \
    --cc=pfalcato@suse.de \
    --cc=rppt@kernel.org \
    --cc=ryan.roberts@arm.com \
    --cc=sstabellini@kernel.org \
    --cc=surenb@google.com \
    --cc=vbabka@suse.cz \
    --cc=viro@zeniv.linux.org.uk \
    --cc=willy@infradead.org \
    --cc=xen-devel@lists.xenproject.org \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).