[PATCH] mm: implement "memory.oops_if_bad

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] mm: implement "memory.oops_if_bad_pte=1" boot option
@ 2025-07-09 18:10 Alexey Dobriyan
  2025-07-09 22:37 ` Andrew Morton
                   ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Alexey Dobriyan @ 2025-07-09 18:10 UTC (permalink / raw)
  To: akpm
  Cc: linux-kernel, linux-mm, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko

Implement

	memory.oops_if_bad_pte=1

boot option which oopses the machine instead of dreadful

	BUG: Bad page map in process

message.

This is intended
for people who want to panic at the slightest provocation and
for people who ruled out hardware problems which in turn means that
delaying vmcore collection is counter-productive.

Linux doesn't (never?) panicked on PTE corruption and even implemented
ratelimited version of the message meaning it can go for minutes and
even hours without anyone noticing which is exactly the opposite of what
should be done to facilitate debugging.

Not enabled by default.

Not advertised.

Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
---

 mm/memory.c                                 | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/mm/memory.c b/mm/memory.c
index b0cda5aab398..90b92b312802 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -42,6 +42,7 @@
 #include <linux/kernel_stat.h>
 #include <linux/mm.h>
 #include <linux/mm_inline.h>
+#include <linux/moduleparam.h>
 #include <linux/sched/mm.h>
 #include <linux/sched/numa_balancing.h>
 #include <linux/sched/task.h>
@@ -480,6 +481,13 @@ static inline void add_mm_rss_vec(struct mm_struct *mm, int *rss)
 			add_mm_counter(mm, i, rss[i]);
 }
 
+/*
+ * Oops instead of printing "Bad page map in process" message and
+ * trying to continue.
+ */
+static bool oops_if_bad_pte __ro_after_init = false;
+module_param(oops_if_bad_pte, bool, 0444);
+
 /*
  * This function is called to print an error when a bad pte
  * is found. For example, we might have a PFN-mapped pte in
@@ -490,6 +498,13 @@ static inline void add_mm_rss_vec(struct mm_struct *mm, int *rss)
 static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr,
 			  pte_t pte, struct page *page)
 {
+	/*
+	 * This line is a formality to collect vmcore ASAP. Real bug
+	 * (hardware or software) happened earlier, current registers and
+	 * backtrace aren't interesting.
+	 */
+	BUG_ON(oops_if_bad_pte);
+
 	pgd_t *pgd = pgd_offset(vma->vm_mm, addr);
 	p4d_t *p4d = p4d_offset(pgd, addr);
 	pud_t *pud = pud_offset(p4d, addr);
-- 
2.49.0



^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH] mm: implement "memory.oops_if_bad_pte=1" boot option
  2025-07-09 18:10 [PATCH] mm: implement "memory.oops_if_bad_pte=1" boot option Alexey Dobriyan
@ 2025-07-09 22:37 ` Andrew Morton
  2025-07-10 16:46   ` Alexey Dobriyan
  2025-07-10  7:35 ` Vlastimil Babka
  2025-07-10 16:16 ` Lorenzo Stoakes
  2 siblings, 1 reply; 9+ messages in thread
From: Andrew Morton @ 2025-07-09 22:37 UTC (permalink / raw)
  To: Alexey Dobriyan
  Cc: linux-kernel, linux-mm, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko

On Wed, 9 Jul 2025 21:10:59 +0300 Alexey Dobriyan <adobriyan@gmail.com> wrote:

> Implement
> 
> 	memory.oops_if_bad_pte=1
> 
> boot option which oopses the machine instead of dreadful
> 
> 	BUG: Bad page map in process
> 
> message.
> 
> This is intended
> for people who want to panic at the slightest provocation and
> for people who ruled out hardware problems which in turn means that
> delaying vmcore collection is counter-productive.
> 
> Linux doesn't (never?) panicked on PTE corruption and even implemented
> ratelimited version of the message meaning it can go for minutes and
> even hours without anyone noticing which is exactly the opposite of what
> should be done to facilitate debugging.
> 
> Not enabled by default.
> 
> Not advertised.
> 
> @@ -490,6 +498,13 @@ static inline void add_mm_rss_vec(struct mm_struct *mm, int *rss)
>  static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr,
>  			  pte_t pte, struct page *page)
>  {
> +	/*
> +	 * This line is a formality to collect vmcore ASAP. Real bug
> +	 * (hardware or software) happened earlier, current registers and
> +	 * backtrace aren't interesting.
> +	 */
> +	BUG_ON(oops_if_bad_pte);
> +

Oh.  A pretty simple thing to do with bpf?  A script to tell the kernel
"dump vmcore if you get here" would have applications in places other
than print_bad_pte()?

That's what bpf_panic() was for (https://lwn.net/Articles/901284/) but
it apparently didn't get merged for <reasons>.



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] mm: implement "memory.oops_if_bad_pte=1" boot option
  2025-07-09 18:10 [PATCH] mm: implement "memory.oops_if_bad_pte=1" boot option Alexey Dobriyan
  2025-07-09 22:37 ` Andrew Morton
@ 2025-07-10  7:35 ` Vlastimil Babka
  2025-07-10 16:47   ` Alexey Dobriyan
  2025-07-10 16:16 ` Lorenzo Stoakes
  2 siblings, 1 reply; 9+ messages in thread
From: Vlastimil Babka @ 2025-07-10  7:35 UTC (permalink / raw)
  To: Alexey Dobriyan, akpm
  Cc: linux-kernel, linux-mm, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Mike Rapoport, Suren Baghdasaryan, Michal Hocko

On 7/9/25 20:10, Alexey Dobriyan wrote:
> Implement
> 
> 	memory.oops_if_bad_pte=1
> 
> boot option which oopses the machine instead of dreadful
> 
> 	BUG: Bad page map in process
> 
> message.
> 
> This is intended
> for people who want to panic at the slightest provocation and
> for people who ruled out hardware problems which in turn means that
> delaying vmcore collection is counter-productive.
> 
> Linux doesn't (never?) panicked on PTE corruption and even implemented
> ratelimited version of the message meaning it can go for minutes and
> even hours without anyone noticing which is exactly the opposite of what
> should be done to facilitate debugging.
> 
> Not enabled by default.
> 
> Not advertised.
> 
> Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>

Could we just reuse the existing panic_on_oops? Would anyone want to panic
in this particular without the others, or vice versa?

> ---
> 
>  mm/memory.c                                 | 15 +++++++++++++++
>  1 file changed, 15 insertions(+)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index b0cda5aab398..90b92b312802 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -42,6 +42,7 @@
>  #include <linux/kernel_stat.h>
>  #include <linux/mm.h>
>  #include <linux/mm_inline.h>
> +#include <linux/moduleparam.h>
>  #include <linux/sched/mm.h>
>  #include <linux/sched/numa_balancing.h>
>  #include <linux/sched/task.h>
> @@ -480,6 +481,13 @@ static inline void add_mm_rss_vec(struct mm_struct *mm, int *rss)
>  			add_mm_counter(mm, i, rss[i]);
>  }
>  
> +/*
> + * Oops instead of printing "Bad page map in process" message and
> + * trying to continue.
> + */
> +static bool oops_if_bad_pte __ro_after_init = false;
> +module_param(oops_if_bad_pte, bool, 0444);
> +
>  /*
>   * This function is called to print an error when a bad pte
>   * is found. For example, we might have a PFN-mapped pte in
> @@ -490,6 +498,13 @@ static inline void add_mm_rss_vec(struct mm_struct *mm, int *rss)
>  static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr,
>  			  pte_t pte, struct page *page)
>  {
> +	/*
> +	 * This line is a formality to collect vmcore ASAP. Real bug
> +	 * (hardware or software) happened earlier, current registers and
> +	 * backtrace aren't interesting.
> +	 */
> +	BUG_ON(oops_if_bad_pte);
> +
>  	pgd_t *pgd = pgd_offset(vma->vm_mm, addr);
>  	p4d_t *p4d = p4d_offset(pgd, addr);
>  	pud_t *pud = pud_offset(p4d, addr);



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] mm: implement "memory.oops_if_bad_pte=1" boot option
  2025-07-09 18:10 [PATCH] mm: implement "memory.oops_if_bad_pte=1" boot option Alexey Dobriyan
  2025-07-09 22:37 ` Andrew Morton
  2025-07-10  7:35 ` Vlastimil Babka
@ 2025-07-10 16:16 ` Lorenzo Stoakes
  2025-07-10 16:57   ` Alexey Dobriyan
  2 siblings, 1 reply; 9+ messages in thread
From: Lorenzo Stoakes @ 2025-07-10 16:16 UTC (permalink / raw)
  To: Alexey Dobriyan
  Cc: akpm, linux-kernel, linux-mm, David Hildenbrand, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko

Sorry but no - this seems to me to just be a hack. And it also appears to
violate the rules on BUG_ON() (see [0]) so this is just a no.

[0]:https://lore.kernel.org/linux-mm/CAHk-=wjO1xL_ZRKUG_SJuh6sPTQ-6Lem3a3pGoo26CXEsx_w0g@mail.gmail.com/

On Wed, Jul 09, 2025 at 09:10:59PM +0300, Alexey Dobriyan wrote:
> Implement
>
> 	memory.oops_if_bad_pte=1

This is a totally new paradigm afaict - introducing an oops based on user
input, I really don't think that's sensible.

Unless kernel.panic_on_oops is set this won't necessarily cause anything to
halt. Really you want a panic_on_bad_pte here, but that would be way way
too specific.

So it seems like a hack just so you can get a vmcore?

You seem to be using BUG_ON() to _maybe_ cause a panic, maybe not, but by
doing this you're inferring that there's unrecoverable system instability,
which isf clearly not the case.

All of the bad PTE handling seems to be intended to be recoverable and
handled by calling code.

Additionally we have uses like zap_present_folio_ptes() which use it to
output PTE state in the instance of an invalid mapcount value - I don't
think oopsing there would really be what you wanted right?

>
> boot option which oopses the machine instead of dreadful
>
> 	BUG: Bad page map in process
>
> message.

I'm not sure what's so dreadful about it? And why an oops is better?

>
> This is intended
> for people who want to panic at the slightest provocation and
> for people who ruled out hardware problems which in turn means that
> delaying vmcore collection is counter-productive.

Seems to be a specific edge case.

>
> Linux doesn't (never?) panicked on PTE corruption and even implemented
> ratelimited version of the message meaning it can go for minutes and
> even hours without anyone noticing which is exactly the opposite of what
> should be done to facilitate debugging.

But are there real situations you can cite where this has been problematic?

>
> Not enabled by default.

Yeah, obviously.

>
> Not advertised.

Umm why? Seems like you just want to add this for your own very specific
purpose?

>
> Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
> ---
>
>  mm/memory.c                                 | 15 +++++++++++++++
>  1 file changed, 15 insertions(+)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index b0cda5aab398..90b92b312802 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -42,6 +42,7 @@
>  #include <linux/kernel_stat.h>
>  #include <linux/mm.h>
>  #include <linux/mm_inline.h>
> +#include <linux/moduleparam.h>
>  #include <linux/sched/mm.h>
>  #include <linux/sched/numa_balancing.h>
>  #include <linux/sched/task.h>
> @@ -480,6 +481,13 @@ static inline void add_mm_rss_vec(struct mm_struct *mm, int *rss)
>  			add_mm_counter(mm, i, rss[i]);
>  }
>
> +/*
> + * Oops instead of printing "Bad page map in process" message and
> + * trying to continue.
> + */
> +static bool oops_if_bad_pte __ro_after_init = false;
> +module_param(oops_if_bad_pte, bool, 0444);
> +
>  /*
>   * This function is called to print an error when a bad pte
>   * is found. For example, we might have a PFN-mapped pte in
> @@ -490,6 +498,13 @@ static inline void add_mm_rss_vec(struct mm_struct *mm, int *rss)
>  static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr,
>  			  pte_t pte, struct page *page)
>  {
> +	/*
> +	 * This line is a formality to collect vmcore ASAP. Real bug
> +	 * (hardware or software) happened earlier, current registers and
> +	 * backtrace aren't interesting.
> +	 */
> +	BUG_ON(oops_if_bad_pte);

Except that it won't without panic_on_oops?

I mean we can't just go around putting arbitrary BUG_ON()'s like this for
cases we want data on.

And far worse here - this is a print_xxx() function, and you're making it
oops? That's really bad.

> +
>  	pgd_t *pgd = pgd_offset(vma->vm_mm, addr);
>  	p4d_t *p4d = p4d_offset(pgd, addr);
>  	pud_t *pud = pud_offset(p4d, addr);
> --
> 2.49.0
>

Note that other page table levels can be 'bad' as well, see pgd_bad() et
al. - none of these will be caught.

Overall I suspect there's one single case you're worried about, that really
you want to put a WARN_ON_ONCE() against - then you can panic_on_warn and
get what you want.

If you can make an argument in favour of this that's convincing then that
would be a potentially upstreamable patch, but this one isn't, in my view.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] mm: implement "memory.oops_if_bad_pte=1" boot option
  2025-07-09 22:37 ` Andrew Morton
@ 2025-07-10 16:46   ` Alexey Dobriyan
  0 siblings, 0 replies; 9+ messages in thread
From: Alexey Dobriyan @ 2025-07-10 16:46 UTC (permalink / raw)
  To: Andrew Morton
  Cc: linux-kernel, linux-mm, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan, Michal Hocko

On Wed, Jul 09, 2025 at 03:37:51PM -0700, Andrew Morton wrote:
> On Wed, 9 Jul 2025 21:10:59 +0300 Alexey Dobriyan <adobriyan@gmail.com> wrote:
> 
> > Implement
> > 
> > 	memory.oops_if_bad_pte=1
> > 
> > boot option which oopses the machine instead of dreadful
> > 
> > 	BUG: Bad page map in process
> > 
> > message.
> > 
> > This is intended
> > for people who want to panic at the slightest provocation and
> > for people who ruled out hardware problems which in turn means that
> > delaying vmcore collection is counter-productive.
> > 
> > Linux doesn't (never?) panicked on PTE corruption and even implemented
> > ratelimited version of the message meaning it can go for minutes and
> > even hours without anyone noticing which is exactly the opposite of what
> > should be done to facilitate debugging.
> > 
> > Not enabled by default.
> > 
> > Not advertised.
> > 
> > @@ -490,6 +498,13 @@ static inline void add_mm_rss_vec(struct mm_struct *mm, int *rss)
> >  static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr,
> >  			  pte_t pte, struct page *page)
> >  {
> > +	/*
> > +	 * This line is a formality to collect vmcore ASAP. Real bug
> > +	 * (hardware or software) happened earlier, current registers and
> > +	 * backtrace aren't interesting.
> > +	 */
> > +	BUG_ON(oops_if_bad_pte);
> > +
> 
> Oh.  A pretty simple thing to do with bpf?

Unless you need to explain how to set it up on QA machines :^)
The good thing about boot options -- they are very easy to use.

> A script to tell the kernel
> "dump vmcore if you get here" would have applications in places other
> than print_bad_pte()?

Sure!

> That's what bpf_panic() was for (https://lwn.net/Articles/901284/) but
> it apparently didn't get merged for <reasons>.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] mm: implement "memory.oops_if_bad_pte=1" boot option
  2025-07-10  7:35 ` Vlastimil Babka
@ 2025-07-10 16:47   ` Alexey Dobriyan
  0 siblings, 0 replies; 9+ messages in thread
From: Alexey Dobriyan @ 2025-07-10 16:47 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: akpm, linux-kernel, linux-mm, David Hildenbrand, Lorenzo Stoakes,
	Liam R. Howlett, Mike Rapoport, Suren Baghdasaryan, Michal Hocko

On Thu, Jul 10, 2025 at 09:35:41AM +0200, Vlastimil Babka wrote:
> On 7/9/25 20:10, Alexey Dobriyan wrote:
> > Implement
> > 
> > 	memory.oops_if_bad_pte=1
> > 
> > boot option which oopses the machine instead of dreadful
> > 
> > 	BUG: Bad page map in process
> > 
> > message.
> > 
> > This is intended
> > for people who want to panic at the slightest provocation and
> > for people who ruled out hardware problems which in turn means that
> > delaying vmcore collection is counter-productive.
> > 
> > Linux doesn't (never?) panicked on PTE corruption and even implemented
> > ratelimited version of the message meaning it can go for minutes and
> > even hours without anyone noticing which is exactly the opposite of what
> > should be done to facilitate debugging.
> > 
> > Not enabled by default.
> > 
> > Not advertised.
> > 
> > Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
> 
> Could we just reuse the existing panic_on_oops? Would anyone want to panic
> in this particular without the others, or vice versa?

Yes, it is supposed to be used with panic_on_oops=1, otherwise lot of
innocent processes might die.

I'll rerhrase the comment.

> > +/*
> > + * Oops instead of printing "Bad page map in process" message and
> > + * trying to continue.
> > + */
> > +static bool oops_if_bad_pte __ro_after_init = false;
> > +module_param(oops_if_bad_pte, bool, 0444);
> > +
> >  /*
> >   * This function is called to print an error when a bad pte
> >   * is found. For example, we might have a PFN-mapped pte in
> > @@ -490,6 +498,13 @@ static inline void add_mm_rss_vec(struct mm_struct *mm, int *rss)
> >  static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr,
> >  			  pte_t pte, struct page *page)
> >  {
> > +	/*
> > +	 * This line is a formality to collect vmcore ASAP. Real bug
> > +	 * (hardware or software) happened earlier, current registers and
> > +	 * backtrace aren't interesting.
> > +	 */
> > +	BUG_ON(oops_if_bad_pte);


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] mm: implement "memory.oops_if_bad_pte=1" boot option
  2025-07-10 16:16 ` Lorenzo Stoakes
@ 2025-07-10 16:57   ` Alexey Dobriyan
  2025-07-10 17:02     ` Lorenzo Stoakes
  0 siblings, 1 reply; 9+ messages in thread
From: Alexey Dobriyan @ 2025-07-10 16:57 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: akpm, linux-kernel, linux-mm, David Hildenbrand, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko

On Thu, Jul 10, 2025 at 05:16:52PM +0100, Lorenzo Stoakes wrote:
> Sorry but no - this seems to me to just be a hack. And it also appears to
> violate the rules on BUG_ON() (see [0]) so this is just a no.
> 
> [0]:https://lore.kernel.org/linux-mm/CAHk-=wjO1xL_ZRKUG_SJuh6sPTQ-6Lem3a3pGoo26CXEsx_w0g@mail.gmail.com/
> 
> On Wed, Jul 09, 2025 at 09:10:59PM +0300, Alexey Dobriyan wrote:
> > Implement
> >
> > 	memory.oops_if_bad_pte=1
> 
> This is a totally new paradigm afaict - introducing an oops based on user
> input, I really don't think that's sensible.
> 
> Unless kernel.panic_on_oops is set this won't necessarily cause anything to
> halt. Really you want a panic_on_bad_pte here, but that would be way way
> too specific.
> 
> So it seems like a hack just so you can get a vmcore?
> 
> You seem to be using BUG_ON() to _maybe_ cause a panic, maybe not, but by
> doing this you're inferring that there's unrecoverable system instability,
> which isf clearly not the case.
> 
> All of the bad PTE handling seems to be intended to be recoverable and
> handled by calling code.
> 
> Additionally we have uses like zap_present_folio_ptes() which use it to
> output PTE state in the instance of an invalid mapcount value - I don't
> think oopsing there would really be what you wanted right?
> 
> >
> > boot option which oopses the machine instead of dreadful
> >
> > 	BUG: Bad page map in process
> >
> > message.
> 
> I'm not sure what's so dreadful about it?

Because the root cause is unknown, happened at unknown time, dmesg
rotated away and nobody bothered to coredump the machine because it
didn't oops!

> And why an oops is better?

I apologize for stating the obvious but the less time between the bug
and coredump collection the better.

> > This is intended
> > for people who want to panic at the slightest provocation and
> > for people who ruled out hardware problems which in turn means that
> > delaying vmcore collection is counter-productive.
> 
> Seems to be a specific edge case.

Yes, but the option is not enabled by default and costs 2 instructions
on the coldest code path, so...

> > Linux doesn't (never?) panicked on PTE corruption and even implemented
> > ratelimited version of the message meaning it can go for minutes and
> > even hours without anyone noticing which is exactly the opposite of what
> > should be done to facilitate debugging.
> 
> But are there real situations you can cite where this has been problematic?
> 
> >
> > Not enabled by default.
> 
> Yeah, obviously.
> 
> >
> > Not advertised.
> 
> Umm why? Seems like you just want to add this for your own very specific
> purpose?

Sort of, I don't want to patch and unpatch things every time.

> > +/*
> > + * Oops instead of printing "Bad page map in process" message and
> > + * trying to continue.
> > + */
> > +static bool oops_if_bad_pte __ro_after_init = false;
> > +module_param(oops_if_bad_pte, bool, 0444);
> > +
> >  /*
> >   * This function is called to print an error when a bad pte
> >   * is found. For example, we might have a PFN-mapped pte in
> > @@ -490,6 +498,13 @@ static inline void add_mm_rss_vec(struct mm_struct *mm, int *rss)
> >  static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr,
> >  			  pte_t pte, struct page *page)
> >  {
> > +	/*
> > +	 * This line is a formality to collect vmcore ASAP. Real bug
> > +	 * (hardware or software) happened earlier, current registers and
> > +	 * backtrace aren't interesting.
> > +	 */
> > +	BUG_ON(oops_if_bad_pte);
> 
> Except that it won't without panic_on_oops?

Yes, I'll update the comment. it is supposed to be used with
panic_on_oops=1 for maximum effect.

> I mean we can't just go around putting arbitrary BUG_ON()'s like this for
> cases we want data on.

Yes, we can!

> And far worse here - this is a print_xxx() function, and you're making it
> oops? That's really bad.

It's fine because, it is conditional BUG_ON.

> Note that other page table levels can be 'bad' as well, see pgd_bad() et
> al. - none of these will be caught.

Sure, I didn't think much about spreading this option to other places.
It can be spread independently.

> Overall I suspect there's one single case you're worried about, that really
> you want to put a WARN_ON_ONCE() against - then you can panic_on_warn and
> get what you want.

Ehh, no. WARN is for home users who can maybe photo the oops and fish it
out of dmesg and make bug report -- so that system survives until their
data are flushed to disk.

I suspect users are very bifurcated: some want to panic always, some
want to panic during QA but not in the field, and then there are users
whose only hope is cellphone camera.

> If you can make an argument in favour of this that's convincing then that
> would be a potentially upstreamable patch, but this one isn't, in my view.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] mm: implement "memory.oops_if_bad_pte=1" boot option
  2025-07-10 16:57   ` Alexey Dobriyan
@ 2025-07-10 17:02     ` Lorenzo Stoakes
  2025-07-10 18:29       ` Michal Hocko
  0 siblings, 1 reply; 9+ messages in thread
From: Lorenzo Stoakes @ 2025-07-10 17:02 UTC (permalink / raw)
  To: Alexey Dobriyan
  Cc: akpm, linux-kernel, linux-mm, David Hildenbrand, Liam R. Howlett,
	Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko

OK I wasn't clear enough I guess - NAK.

This is not upstreamable, nor anything like it.

On Thu, Jul 10, 2025 at 07:57:00PM +0300, Alexey Dobriyan wrote:
> On Thu, Jul 10, 2025 at 05:16:52PM +0100, Lorenzo Stoakes wrote:
> > Sorry but no - this seems to me to just be a hack. And it also appears to
> > violate the rules on BUG_ON() (see [0]) so this is just a no.
> >
> > [0]:https://lore.kernel.org/linux-mm/CAHk-=wjO1xL_ZRKUG_SJuh6sPTQ-6Lem3a3pGoo26CXEsx_w0g@mail.gmail.com/
> >
> > On Wed, Jul 09, 2025 at 09:10:59PM +0300, Alexey Dobriyan wrote:
> > > Implement
> > >
> > > 	memory.oops_if_bad_pte=1
> >
> > This is a totally new paradigm afaict - introducing an oops based on user
> > input, I really don't think that's sensible.
> >
> > Unless kernel.panic_on_oops is set this won't necessarily cause anything to
> > halt. Really you want a panic_on_bad_pte here, but that would be way way
> > too specific.
> >
> > So it seems like a hack just so you can get a vmcore?
> >
> > You seem to be using BUG_ON() to _maybe_ cause a panic, maybe not, but by
> > doing this you're inferring that there's unrecoverable system instability,
> > which isf clearly not the case.
> >
> > All of the bad PTE handling seems to be intended to be recoverable and
> > handled by calling code.
> >
> > Additionally we have uses like zap_present_folio_ptes() which use it to
> > output PTE state in the instance of an invalid mapcount value - I don't
> > think oopsing there would really be what you wanted right?
> >
> > >
> > > boot option which oopses the machine instead of dreadful
> > >
> > > 	BUG: Bad page map in process
> > >
> > > message.
> >
> > I'm not sure what's so dreadful about it?
>
> Because the root cause is unknown, happened at unknown time, dmesg
> rotated away and nobody bothered to coredump the machine because it
> didn't oops!
>
> > And why an oops is better?
>
> I apologize for stating the obvious but the less time between the bug
> and coredump collection the better.
>
> > > This is intended
> > > for people who want to panic at the slightest provocation and
> > > for people who ruled out hardware problems which in turn means that
> > > delaying vmcore collection is counter-productive.
> >
> > Seems to be a specific edge case.
>
> Yes, but the option is not enabled by default and costs 2 instructions
> on the coldest code path, so...
>
> > > Linux doesn't (never?) panicked on PTE corruption and even implemented
> > > ratelimited version of the message meaning it can go for minutes and
> > > even hours without anyone noticing which is exactly the opposite of what
> > > should be done to facilitate debugging.
> >
> > But are there real situations you can cite where this has been problematic?
> >
> > >
> > > Not enabled by default.
> >
> > Yeah, obviously.
> >
> > >
> > > Not advertised.
> >
> > Umm why? Seems like you just want to add this for your own very specific
> > purpose?
>
> Sort of, I don't want to patch and unpatch things every time.
>
> > > +/*
> > > + * Oops instead of printing "Bad page map in process" message and
> > > + * trying to continue.
> > > + */
> > > +static bool oops_if_bad_pte __ro_after_init = false;
> > > +module_param(oops_if_bad_pte, bool, 0444);
> > > +
> > >  /*
> > >   * This function is called to print an error when a bad pte
> > >   * is found. For example, we might have a PFN-mapped pte in
> > > @@ -490,6 +498,13 @@ static inline void add_mm_rss_vec(struct mm_struct *mm, int *rss)
> > >  static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr,
> > >  			  pte_t pte, struct page *page)
> > >  {
> > > +	/*
> > > +	 * This line is a formality to collect vmcore ASAP. Real bug
> > > +	 * (hardware or software) happened earlier, current registers and
> > > +	 * backtrace aren't interesting.
> > > +	 */
> > > +	BUG_ON(oops_if_bad_pte);
> >
> > Except that it won't without panic_on_oops?
>
> Yes, I'll update the comment. it is supposed to be used with
> panic_on_oops=1 for maximum effect.
>
> > I mean we can't just go around putting arbitrary BUG_ON()'s like this for
> > cases we want data on.
>
> Yes, we can!
>
> > And far worse here - this is a print_xxx() function, and you're making it
> > oops? That's really bad.
>
> It's fine because, it is conditional BUG_ON.
>
> > Note that other page table levels can be 'bad' as well, see pgd_bad() et
> > al. - none of these will be caught.
>
> Sure, I didn't think much about spreading this option to other places.
> It can be spread independently.
>
> > Overall I suspect there's one single case you're worried about, that really
> > you want to put a WARN_ON_ONCE() against - then you can panic_on_warn and
> > get what you want.
>
> Ehh, no. WARN is for home users who can maybe photo the oops and fish it
> out of dmesg and make bug report -- so that system survives until their
> data are flushed to disk.
>
> I suspect users are very bifurcated: some want to panic always, some
> want to panic during QA but not in the field, and then there are users
> whose only hope is cellphone camera.
>
> > If you can make an argument in favour of this that's convincing then that
> > would be a potentially upstreamable patch, but this one isn't, in my view.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] mm: implement "memory.oops_if_bad_pte=1" boot option
  2025-07-10 17:02     ` Lorenzo Stoakes
@ 2025-07-10 18:29       ` Michal Hocko
  0 siblings, 0 replies; 9+ messages in thread
From: Michal Hocko @ 2025-07-10 18:29 UTC (permalink / raw)
  To: Lorenzo Stoakes
  Cc: Alexey Dobriyan, akpm, linux-kernel, linux-mm, David Hildenbrand,
	Liam R. Howlett, Vlastimil Babka, Mike Rapoport,
	Suren Baghdasaryan

On Thu 10-07-25 18:02:07, Lorenzo Stoakes wrote:
> OK I wasn't clear enough I guess - NAK.

Completely Agreed!
Because....

> This is not upstreamable, nor anything like it.
> 
> On Thu, Jul 10, 2025 at 07:57:00PM +0300, Alexey Dobriyan wrote:
> > On Thu, Jul 10, 2025 at 05:16:52PM +0100, Lorenzo Stoakes wrote:
[...]
> > > You seem to be using BUG_ON() to _maybe_ cause a panic, maybe not, but by
> > > doing this you're inferring that there's unrecoverable system instability,
> > > which isf clearly not the case.

.... of exactly this. I believe we have already/finally established that
BUG_ON (not even VM_BUG_ON) is a sensible debugging tool. In this
particular case it is not even clear when the page table got corrupted
and it could have happened loong before we notice that so crash dumping
right away doesn't really guarantee anything.

So if this really helped in some specific situations and there is hope
it might help in the future then I believe
[...]
> > > Overall I suspect there's one single case you're worried about, that really
> > > you want to put a WARN_ON_ONCE() against - then you can panic_on_warn and
> > > get what you want.

is exactly what you should do. But even then dump_stack should be
dropped to not duplicate the information printed etc.

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2025-07-10 18:29 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-07-09 18:10 [PATCH] mm: implement "memory.oops_if_bad_pte=1" boot option Alexey Dobriyan
2025-07-09 22:37 ` Andrew Morton
2025-07-10 16:46   ` Alexey Dobriyan
2025-07-10  7:35 ` Vlastimil Babka
2025-07-10 16:47   ` Alexey Dobriyan
2025-07-10 16:16 ` Lorenzo Stoakes
2025-07-10 16:57   ` Alexey Dobriyan
2025-07-10 17:02     ` Lorenzo Stoakes
2025-07-10 18:29       ` Michal Hocko

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).