* [PATCH] mm: implement "memory.oops_if_bad_pte=1" boot option @ 2025-07-09 18:10 Alexey Dobriyan 2025-07-09 22:37 ` Andrew Morton ` (2 more replies) 0 siblings, 3 replies; 9+ messages in thread From: Alexey Dobriyan @ 2025-07-09 18:10 UTC (permalink / raw) To: akpm Cc: linux-kernel, linux-mm, David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko Implement memory.oops_if_bad_pte=1 boot option which oopses the machine instead of dreadful BUG: Bad page map in process message. This is intended for people who want to panic at the slightest provocation and for people who ruled out hardware problems which in turn means that delaying vmcore collection is counter-productive. Linux doesn't (never?) panicked on PTE corruption and even implemented ratelimited version of the message meaning it can go for minutes and even hours without anyone noticing which is exactly the opposite of what should be done to facilitate debugging. Not enabled by default. Not advertised. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> --- mm/memory.c | 15 +++++++++++++++ 1 file changed, 15 insertions(+) diff --git a/mm/memory.c b/mm/memory.c index b0cda5aab398..90b92b312802 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -42,6 +42,7 @@ #include <linux/kernel_stat.h> #include <linux/mm.h> #include <linux/mm_inline.h> +#include <linux/moduleparam.h> #include <linux/sched/mm.h> #include <linux/sched/numa_balancing.h> #include <linux/sched/task.h> @@ -480,6 +481,13 @@ static inline void add_mm_rss_vec(struct mm_struct *mm, int *rss) add_mm_counter(mm, i, rss[i]); } +/* + * Oops instead of printing "Bad page map in process" message and + * trying to continue. + */ +static bool oops_if_bad_pte __ro_after_init = false; +module_param(oops_if_bad_pte, bool, 0444); + /* * This function is called to print an error when a bad pte * is found. For example, we might have a PFN-mapped pte in @@ -490,6 +498,13 @@ static inline void add_mm_rss_vec(struct mm_struct *mm, int *rss) static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr, pte_t pte, struct page *page) { + /* + * This line is a formality to collect vmcore ASAP. Real bug + * (hardware or software) happened earlier, current registers and + * backtrace aren't interesting. + */ + BUG_ON(oops_if_bad_pte); + pgd_t *pgd = pgd_offset(vma->vm_mm, addr); p4d_t *p4d = p4d_offset(pgd, addr); pud_t *pud = pud_offset(p4d, addr); -- 2.49.0 ^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH] mm: implement "memory.oops_if_bad_pte=1" boot option 2025-07-09 18:10 [PATCH] mm: implement "memory.oops_if_bad_pte=1" boot option Alexey Dobriyan @ 2025-07-09 22:37 ` Andrew Morton 2025-07-10 16:46 ` Alexey Dobriyan 2025-07-10 7:35 ` Vlastimil Babka 2025-07-10 16:16 ` Lorenzo Stoakes 2 siblings, 1 reply; 9+ messages in thread From: Andrew Morton @ 2025-07-09 22:37 UTC (permalink / raw) To: Alexey Dobriyan Cc: linux-kernel, linux-mm, David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko On Wed, 9 Jul 2025 21:10:59 +0300 Alexey Dobriyan <adobriyan@gmail.com> wrote: > Implement > > memory.oops_if_bad_pte=1 > > boot option which oopses the machine instead of dreadful > > BUG: Bad page map in process > > message. > > This is intended > for people who want to panic at the slightest provocation and > for people who ruled out hardware problems which in turn means that > delaying vmcore collection is counter-productive. > > Linux doesn't (never?) panicked on PTE corruption and even implemented > ratelimited version of the message meaning it can go for minutes and > even hours without anyone noticing which is exactly the opposite of what > should be done to facilitate debugging. > > Not enabled by default. > > Not advertised. > > @@ -490,6 +498,13 @@ static inline void add_mm_rss_vec(struct mm_struct *mm, int *rss) > static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr, > pte_t pte, struct page *page) > { > + /* > + * This line is a formality to collect vmcore ASAP. Real bug > + * (hardware or software) happened earlier, current registers and > + * backtrace aren't interesting. > + */ > + BUG_ON(oops_if_bad_pte); > + Oh. A pretty simple thing to do with bpf? A script to tell the kernel "dump vmcore if you get here" would have applications in places other than print_bad_pte()? That's what bpf_panic() was for (https://lwn.net/Articles/901284/) but it apparently didn't get merged for <reasons>. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] mm: implement "memory.oops_if_bad_pte=1" boot option 2025-07-09 22:37 ` Andrew Morton @ 2025-07-10 16:46 ` Alexey Dobriyan 0 siblings, 0 replies; 9+ messages in thread From: Alexey Dobriyan @ 2025-07-10 16:46 UTC (permalink / raw) To: Andrew Morton Cc: linux-kernel, linux-mm, David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko On Wed, Jul 09, 2025 at 03:37:51PM -0700, Andrew Morton wrote: > On Wed, 9 Jul 2025 21:10:59 +0300 Alexey Dobriyan <adobriyan@gmail.com> wrote: > > > Implement > > > > memory.oops_if_bad_pte=1 > > > > boot option which oopses the machine instead of dreadful > > > > BUG: Bad page map in process > > > > message. > > > > This is intended > > for people who want to panic at the slightest provocation and > > for people who ruled out hardware problems which in turn means that > > delaying vmcore collection is counter-productive. > > > > Linux doesn't (never?) panicked on PTE corruption and even implemented > > ratelimited version of the message meaning it can go for minutes and > > even hours without anyone noticing which is exactly the opposite of what > > should be done to facilitate debugging. > > > > Not enabled by default. > > > > Not advertised. > > > > @@ -490,6 +498,13 @@ static inline void add_mm_rss_vec(struct mm_struct *mm, int *rss) > > static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr, > > pte_t pte, struct page *page) > > { > > + /* > > + * This line is a formality to collect vmcore ASAP. Real bug > > + * (hardware or software) happened earlier, current registers and > > + * backtrace aren't interesting. > > + */ > > + BUG_ON(oops_if_bad_pte); > > + > > Oh. A pretty simple thing to do with bpf? Unless you need to explain how to set it up on QA machines :^) The good thing about boot options -- they are very easy to use. > A script to tell the kernel > "dump vmcore if you get here" would have applications in places other > than print_bad_pte()? Sure! > That's what bpf_panic() was for (https://lwn.net/Articles/901284/) but > it apparently didn't get merged for <reasons>. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] mm: implement "memory.oops_if_bad_pte=1" boot option 2025-07-09 18:10 [PATCH] mm: implement "memory.oops_if_bad_pte=1" boot option Alexey Dobriyan 2025-07-09 22:37 ` Andrew Morton @ 2025-07-10 7:35 ` Vlastimil Babka 2025-07-10 16:47 ` Alexey Dobriyan 2025-07-10 16:16 ` Lorenzo Stoakes 2 siblings, 1 reply; 9+ messages in thread From: Vlastimil Babka @ 2025-07-10 7:35 UTC (permalink / raw) To: Alexey Dobriyan, akpm Cc: linux-kernel, linux-mm, David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett, Mike Rapoport, Suren Baghdasaryan, Michal Hocko On 7/9/25 20:10, Alexey Dobriyan wrote: > Implement > > memory.oops_if_bad_pte=1 > > boot option which oopses the machine instead of dreadful > > BUG: Bad page map in process > > message. > > This is intended > for people who want to panic at the slightest provocation and > for people who ruled out hardware problems which in turn means that > delaying vmcore collection is counter-productive. > > Linux doesn't (never?) panicked on PTE corruption and even implemented > ratelimited version of the message meaning it can go for minutes and > even hours without anyone noticing which is exactly the opposite of what > should be done to facilitate debugging. > > Not enabled by default. > > Not advertised. > > Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Could we just reuse the existing panic_on_oops? Would anyone want to panic in this particular without the others, or vice versa? > --- > > mm/memory.c | 15 +++++++++++++++ > 1 file changed, 15 insertions(+) > > diff --git a/mm/memory.c b/mm/memory.c > index b0cda5aab398..90b92b312802 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -42,6 +42,7 @@ > #include <linux/kernel_stat.h> > #include <linux/mm.h> > #include <linux/mm_inline.h> > +#include <linux/moduleparam.h> > #include <linux/sched/mm.h> > #include <linux/sched/numa_balancing.h> > #include <linux/sched/task.h> > @@ -480,6 +481,13 @@ static inline void add_mm_rss_vec(struct mm_struct *mm, int *rss) > add_mm_counter(mm, i, rss[i]); > } > > +/* > + * Oops instead of printing "Bad page map in process" message and > + * trying to continue. > + */ > +static bool oops_if_bad_pte __ro_after_init = false; > +module_param(oops_if_bad_pte, bool, 0444); > + > /* > * This function is called to print an error when a bad pte > * is found. For example, we might have a PFN-mapped pte in > @@ -490,6 +498,13 @@ static inline void add_mm_rss_vec(struct mm_struct *mm, int *rss) > static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr, > pte_t pte, struct page *page) > { > + /* > + * This line is a formality to collect vmcore ASAP. Real bug > + * (hardware or software) happened earlier, current registers and > + * backtrace aren't interesting. > + */ > + BUG_ON(oops_if_bad_pte); > + > pgd_t *pgd = pgd_offset(vma->vm_mm, addr); > p4d_t *p4d = p4d_offset(pgd, addr); > pud_t *pud = pud_offset(p4d, addr); ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] mm: implement "memory.oops_if_bad_pte=1" boot option 2025-07-10 7:35 ` Vlastimil Babka @ 2025-07-10 16:47 ` Alexey Dobriyan 0 siblings, 0 replies; 9+ messages in thread From: Alexey Dobriyan @ 2025-07-10 16:47 UTC (permalink / raw) To: Vlastimil Babka Cc: akpm, linux-kernel, linux-mm, David Hildenbrand, Lorenzo Stoakes, Liam R. Howlett, Mike Rapoport, Suren Baghdasaryan, Michal Hocko On Thu, Jul 10, 2025 at 09:35:41AM +0200, Vlastimil Babka wrote: > On 7/9/25 20:10, Alexey Dobriyan wrote: > > Implement > > > > memory.oops_if_bad_pte=1 > > > > boot option which oopses the machine instead of dreadful > > > > BUG: Bad page map in process > > > > message. > > > > This is intended > > for people who want to panic at the slightest provocation and > > for people who ruled out hardware problems which in turn means that > > delaying vmcore collection is counter-productive. > > > > Linux doesn't (never?) panicked on PTE corruption and even implemented > > ratelimited version of the message meaning it can go for minutes and > > even hours without anyone noticing which is exactly the opposite of what > > should be done to facilitate debugging. > > > > Not enabled by default. > > > > Not advertised. > > > > Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> > > Could we just reuse the existing panic_on_oops? Would anyone want to panic > in this particular without the others, or vice versa? Yes, it is supposed to be used with panic_on_oops=1, otherwise lot of innocent processes might die. I'll rerhrase the comment. > > +/* > > + * Oops instead of printing "Bad page map in process" message and > > + * trying to continue. > > + */ > > +static bool oops_if_bad_pte __ro_after_init = false; > > +module_param(oops_if_bad_pte, bool, 0444); > > + > > /* > > * This function is called to print an error when a bad pte > > * is found. For example, we might have a PFN-mapped pte in > > @@ -490,6 +498,13 @@ static inline void add_mm_rss_vec(struct mm_struct *mm, int *rss) > > static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr, > > pte_t pte, struct page *page) > > { > > + /* > > + * This line is a formality to collect vmcore ASAP. Real bug > > + * (hardware or software) happened earlier, current registers and > > + * backtrace aren't interesting. > > + */ > > + BUG_ON(oops_if_bad_pte); ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] mm: implement "memory.oops_if_bad_pte=1" boot option 2025-07-09 18:10 [PATCH] mm: implement "memory.oops_if_bad_pte=1" boot option Alexey Dobriyan 2025-07-09 22:37 ` Andrew Morton 2025-07-10 7:35 ` Vlastimil Babka @ 2025-07-10 16:16 ` Lorenzo Stoakes 2025-07-10 16:57 ` Alexey Dobriyan 2 siblings, 1 reply; 9+ messages in thread From: Lorenzo Stoakes @ 2025-07-10 16:16 UTC (permalink / raw) To: Alexey Dobriyan Cc: akpm, linux-kernel, linux-mm, David Hildenbrand, Liam R. Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko Sorry but no - this seems to me to just be a hack. And it also appears to violate the rules on BUG_ON() (see [0]) so this is just a no. [0]:https://lore.kernel.org/linux-mm/CAHk-=wjO1xL_ZRKUG_SJuh6sPTQ-6Lem3a3pGoo26CXEsx_w0g@mail.gmail.com/ On Wed, Jul 09, 2025 at 09:10:59PM +0300, Alexey Dobriyan wrote: > Implement > > memory.oops_if_bad_pte=1 This is a totally new paradigm afaict - introducing an oops based on user input, I really don't think that's sensible. Unless kernel.panic_on_oops is set this won't necessarily cause anything to halt. Really you want a panic_on_bad_pte here, but that would be way way too specific. So it seems like a hack just so you can get a vmcore? You seem to be using BUG_ON() to _maybe_ cause a panic, maybe not, but by doing this you're inferring that there's unrecoverable system instability, which isf clearly not the case. All of the bad PTE handling seems to be intended to be recoverable and handled by calling code. Additionally we have uses like zap_present_folio_ptes() which use it to output PTE state in the instance of an invalid mapcount value - I don't think oopsing there would really be what you wanted right? > > boot option which oopses the machine instead of dreadful > > BUG: Bad page map in process > > message. I'm not sure what's so dreadful about it? And why an oops is better? > > This is intended > for people who want to panic at the slightest provocation and > for people who ruled out hardware problems which in turn means that > delaying vmcore collection is counter-productive. Seems to be a specific edge case. > > Linux doesn't (never?) panicked on PTE corruption and even implemented > ratelimited version of the message meaning it can go for minutes and > even hours without anyone noticing which is exactly the opposite of what > should be done to facilitate debugging. But are there real situations you can cite where this has been problematic? > > Not enabled by default. Yeah, obviously. > > Not advertised. Umm why? Seems like you just want to add this for your own very specific purpose? > > Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> > --- > > mm/memory.c | 15 +++++++++++++++ > 1 file changed, 15 insertions(+) > > diff --git a/mm/memory.c b/mm/memory.c > index b0cda5aab398..90b92b312802 100644 > --- a/mm/memory.c > +++ b/mm/memory.c > @@ -42,6 +42,7 @@ > #include <linux/kernel_stat.h> > #include <linux/mm.h> > #include <linux/mm_inline.h> > +#include <linux/moduleparam.h> > #include <linux/sched/mm.h> > #include <linux/sched/numa_balancing.h> > #include <linux/sched/task.h> > @@ -480,6 +481,13 @@ static inline void add_mm_rss_vec(struct mm_struct *mm, int *rss) > add_mm_counter(mm, i, rss[i]); > } > > +/* > + * Oops instead of printing "Bad page map in process" message and > + * trying to continue. > + */ > +static bool oops_if_bad_pte __ro_after_init = false; > +module_param(oops_if_bad_pte, bool, 0444); > + > /* > * This function is called to print an error when a bad pte > * is found. For example, we might have a PFN-mapped pte in > @@ -490,6 +498,13 @@ static inline void add_mm_rss_vec(struct mm_struct *mm, int *rss) > static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr, > pte_t pte, struct page *page) > { > + /* > + * This line is a formality to collect vmcore ASAP. Real bug > + * (hardware or software) happened earlier, current registers and > + * backtrace aren't interesting. > + */ > + BUG_ON(oops_if_bad_pte); Except that it won't without panic_on_oops? I mean we can't just go around putting arbitrary BUG_ON()'s like this for cases we want data on. And far worse here - this is a print_xxx() function, and you're making it oops? That's really bad. > + > pgd_t *pgd = pgd_offset(vma->vm_mm, addr); > p4d_t *p4d = p4d_offset(pgd, addr); > pud_t *pud = pud_offset(p4d, addr); > -- > 2.49.0 > Note that other page table levels can be 'bad' as well, see pgd_bad() et al. - none of these will be caught. Overall I suspect there's one single case you're worried about, that really you want to put a WARN_ON_ONCE() against - then you can panic_on_warn and get what you want. If you can make an argument in favour of this that's convincing then that would be a potentially upstreamable patch, but this one isn't, in my view. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] mm: implement "memory.oops_if_bad_pte=1" boot option 2025-07-10 16:16 ` Lorenzo Stoakes @ 2025-07-10 16:57 ` Alexey Dobriyan 2025-07-10 17:02 ` Lorenzo Stoakes 0 siblings, 1 reply; 9+ messages in thread From: Alexey Dobriyan @ 2025-07-10 16:57 UTC (permalink / raw) To: Lorenzo Stoakes Cc: akpm, linux-kernel, linux-mm, David Hildenbrand, Liam R. Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko On Thu, Jul 10, 2025 at 05:16:52PM +0100, Lorenzo Stoakes wrote: > Sorry but no - this seems to me to just be a hack. And it also appears to > violate the rules on BUG_ON() (see [0]) so this is just a no. > > [0]:https://lore.kernel.org/linux-mm/CAHk-=wjO1xL_ZRKUG_SJuh6sPTQ-6Lem3a3pGoo26CXEsx_w0g@mail.gmail.com/ > > On Wed, Jul 09, 2025 at 09:10:59PM +0300, Alexey Dobriyan wrote: > > Implement > > > > memory.oops_if_bad_pte=1 > > This is a totally new paradigm afaict - introducing an oops based on user > input, I really don't think that's sensible. > > Unless kernel.panic_on_oops is set this won't necessarily cause anything to > halt. Really you want a panic_on_bad_pte here, but that would be way way > too specific. > > So it seems like a hack just so you can get a vmcore? > > You seem to be using BUG_ON() to _maybe_ cause a panic, maybe not, but by > doing this you're inferring that there's unrecoverable system instability, > which isf clearly not the case. > > All of the bad PTE handling seems to be intended to be recoverable and > handled by calling code. > > Additionally we have uses like zap_present_folio_ptes() which use it to > output PTE state in the instance of an invalid mapcount value - I don't > think oopsing there would really be what you wanted right? > > > > > boot option which oopses the machine instead of dreadful > > > > BUG: Bad page map in process > > > > message. > > I'm not sure what's so dreadful about it? Because the root cause is unknown, happened at unknown time, dmesg rotated away and nobody bothered to coredump the machine because it didn't oops! > And why an oops is better? I apologize for stating the obvious but the less time between the bug and coredump collection the better. > > This is intended > > for people who want to panic at the slightest provocation and > > for people who ruled out hardware problems which in turn means that > > delaying vmcore collection is counter-productive. > > Seems to be a specific edge case. Yes, but the option is not enabled by default and costs 2 instructions on the coldest code path, so... > > Linux doesn't (never?) panicked on PTE corruption and even implemented > > ratelimited version of the message meaning it can go for minutes and > > even hours without anyone noticing which is exactly the opposite of what > > should be done to facilitate debugging. > > But are there real situations you can cite where this has been problematic? > > > > > Not enabled by default. > > Yeah, obviously. > > > > > Not advertised. > > Umm why? Seems like you just want to add this for your own very specific > purpose? Sort of, I don't want to patch and unpatch things every time. > > +/* > > + * Oops instead of printing "Bad page map in process" message and > > + * trying to continue. > > + */ > > +static bool oops_if_bad_pte __ro_after_init = false; > > +module_param(oops_if_bad_pte, bool, 0444); > > + > > /* > > * This function is called to print an error when a bad pte > > * is found. For example, we might have a PFN-mapped pte in > > @@ -490,6 +498,13 @@ static inline void add_mm_rss_vec(struct mm_struct *mm, int *rss) > > static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr, > > pte_t pte, struct page *page) > > { > > + /* > > + * This line is a formality to collect vmcore ASAP. Real bug > > + * (hardware or software) happened earlier, current registers and > > + * backtrace aren't interesting. > > + */ > > + BUG_ON(oops_if_bad_pte); > > Except that it won't without panic_on_oops? Yes, I'll update the comment. it is supposed to be used with panic_on_oops=1 for maximum effect. > I mean we can't just go around putting arbitrary BUG_ON()'s like this for > cases we want data on. Yes, we can! > And far worse here - this is a print_xxx() function, and you're making it > oops? That's really bad. It's fine because, it is conditional BUG_ON. > Note that other page table levels can be 'bad' as well, see pgd_bad() et > al. - none of these will be caught. Sure, I didn't think much about spreading this option to other places. It can be spread independently. > Overall I suspect there's one single case you're worried about, that really > you want to put a WARN_ON_ONCE() against - then you can panic_on_warn and > get what you want. Ehh, no. WARN is for home users who can maybe photo the oops and fish it out of dmesg and make bug report -- so that system survives until their data are flushed to disk. I suspect users are very bifurcated: some want to panic always, some want to panic during QA but not in the field, and then there are users whose only hope is cellphone camera. > If you can make an argument in favour of this that's convincing then that > would be a potentially upstreamable patch, but this one isn't, in my view. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] mm: implement "memory.oops_if_bad_pte=1" boot option 2025-07-10 16:57 ` Alexey Dobriyan @ 2025-07-10 17:02 ` Lorenzo Stoakes 2025-07-10 18:29 ` Michal Hocko 0 siblings, 1 reply; 9+ messages in thread From: Lorenzo Stoakes @ 2025-07-10 17:02 UTC (permalink / raw) To: Alexey Dobriyan Cc: akpm, linux-kernel, linux-mm, David Hildenbrand, Liam R. Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan, Michal Hocko OK I wasn't clear enough I guess - NAK. This is not upstreamable, nor anything like it. On Thu, Jul 10, 2025 at 07:57:00PM +0300, Alexey Dobriyan wrote: > On Thu, Jul 10, 2025 at 05:16:52PM +0100, Lorenzo Stoakes wrote: > > Sorry but no - this seems to me to just be a hack. And it also appears to > > violate the rules on BUG_ON() (see [0]) so this is just a no. > > > > [0]:https://lore.kernel.org/linux-mm/CAHk-=wjO1xL_ZRKUG_SJuh6sPTQ-6Lem3a3pGoo26CXEsx_w0g@mail.gmail.com/ > > > > On Wed, Jul 09, 2025 at 09:10:59PM +0300, Alexey Dobriyan wrote: > > > Implement > > > > > > memory.oops_if_bad_pte=1 > > > > This is a totally new paradigm afaict - introducing an oops based on user > > input, I really don't think that's sensible. > > > > Unless kernel.panic_on_oops is set this won't necessarily cause anything to > > halt. Really you want a panic_on_bad_pte here, but that would be way way > > too specific. > > > > So it seems like a hack just so you can get a vmcore? > > > > You seem to be using BUG_ON() to _maybe_ cause a panic, maybe not, but by > > doing this you're inferring that there's unrecoverable system instability, > > which isf clearly not the case. > > > > All of the bad PTE handling seems to be intended to be recoverable and > > handled by calling code. > > > > Additionally we have uses like zap_present_folio_ptes() which use it to > > output PTE state in the instance of an invalid mapcount value - I don't > > think oopsing there would really be what you wanted right? > > > > > > > > boot option which oopses the machine instead of dreadful > > > > > > BUG: Bad page map in process > > > > > > message. > > > > I'm not sure what's so dreadful about it? > > Because the root cause is unknown, happened at unknown time, dmesg > rotated away and nobody bothered to coredump the machine because it > didn't oops! > > > And why an oops is better? > > I apologize for stating the obvious but the less time between the bug > and coredump collection the better. > > > > This is intended > > > for people who want to panic at the slightest provocation and > > > for people who ruled out hardware problems which in turn means that > > > delaying vmcore collection is counter-productive. > > > > Seems to be a specific edge case. > > Yes, but the option is not enabled by default and costs 2 instructions > on the coldest code path, so... > > > > Linux doesn't (never?) panicked on PTE corruption and even implemented > > > ratelimited version of the message meaning it can go for minutes and > > > even hours without anyone noticing which is exactly the opposite of what > > > should be done to facilitate debugging. > > > > But are there real situations you can cite where this has been problematic? > > > > > > > > Not enabled by default. > > > > Yeah, obviously. > > > > > > > > Not advertised. > > > > Umm why? Seems like you just want to add this for your own very specific > > purpose? > > Sort of, I don't want to patch and unpatch things every time. > > > > +/* > > > + * Oops instead of printing "Bad page map in process" message and > > > + * trying to continue. > > > + */ > > > +static bool oops_if_bad_pte __ro_after_init = false; > > > +module_param(oops_if_bad_pte, bool, 0444); > > > + > > > /* > > > * This function is called to print an error when a bad pte > > > * is found. For example, we might have a PFN-mapped pte in > > > @@ -490,6 +498,13 @@ static inline void add_mm_rss_vec(struct mm_struct *mm, int *rss) > > > static void print_bad_pte(struct vm_area_struct *vma, unsigned long addr, > > > pte_t pte, struct page *page) > > > { > > > + /* > > > + * This line is a formality to collect vmcore ASAP. Real bug > > > + * (hardware or software) happened earlier, current registers and > > > + * backtrace aren't interesting. > > > + */ > > > + BUG_ON(oops_if_bad_pte); > > > > Except that it won't without panic_on_oops? > > Yes, I'll update the comment. it is supposed to be used with > panic_on_oops=1 for maximum effect. > > > I mean we can't just go around putting arbitrary BUG_ON()'s like this for > > cases we want data on. > > Yes, we can! > > > And far worse here - this is a print_xxx() function, and you're making it > > oops? That's really bad. > > It's fine because, it is conditional BUG_ON. > > > Note that other page table levels can be 'bad' as well, see pgd_bad() et > > al. - none of these will be caught. > > Sure, I didn't think much about spreading this option to other places. > It can be spread independently. > > > Overall I suspect there's one single case you're worried about, that really > > you want to put a WARN_ON_ONCE() against - then you can panic_on_warn and > > get what you want. > > Ehh, no. WARN is for home users who can maybe photo the oops and fish it > out of dmesg and make bug report -- so that system survives until their > data are flushed to disk. > > I suspect users are very bifurcated: some want to panic always, some > want to panic during QA but not in the field, and then there are users > whose only hope is cellphone camera. > > > If you can make an argument in favour of this that's convincing then that > > would be a potentially upstreamable patch, but this one isn't, in my view. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] mm: implement "memory.oops_if_bad_pte=1" boot option 2025-07-10 17:02 ` Lorenzo Stoakes @ 2025-07-10 18:29 ` Michal Hocko 0 siblings, 0 replies; 9+ messages in thread From: Michal Hocko @ 2025-07-10 18:29 UTC (permalink / raw) To: Lorenzo Stoakes Cc: Alexey Dobriyan, akpm, linux-kernel, linux-mm, David Hildenbrand, Liam R. Howlett, Vlastimil Babka, Mike Rapoport, Suren Baghdasaryan On Thu 10-07-25 18:02:07, Lorenzo Stoakes wrote: > OK I wasn't clear enough I guess - NAK. Completely Agreed! Because.... > This is not upstreamable, nor anything like it. > > On Thu, Jul 10, 2025 at 07:57:00PM +0300, Alexey Dobriyan wrote: > > On Thu, Jul 10, 2025 at 05:16:52PM +0100, Lorenzo Stoakes wrote: [...] > > > You seem to be using BUG_ON() to _maybe_ cause a panic, maybe not, but by > > > doing this you're inferring that there's unrecoverable system instability, > > > which isf clearly not the case. .... of exactly this. I believe we have already/finally established that BUG_ON (not even VM_BUG_ON) is a sensible debugging tool. In this particular case it is not even clear when the page table got corrupted and it could have happened loong before we notice that so crash dumping right away doesn't really guarantee anything. So if this really helped in some specific situations and there is hope it might help in the future then I believe [...] > > > Overall I suspect there's one single case you're worried about, that really > > > you want to put a WARN_ON_ONCE() against - then you can panic_on_warn and > > > get what you want. is exactly what you should do. But even then dump_stack should be dropped to not duplicate the information printed etc. -- Michal Hocko SUSE Labs ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2025-07-10 18:29 UTC | newest] Thread overview: 9+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-07-09 18:10 [PATCH] mm: implement "memory.oops_if_bad_pte=1" boot option Alexey Dobriyan 2025-07-09 22:37 ` Andrew Morton 2025-07-10 16:46 ` Alexey Dobriyan 2025-07-10 7:35 ` Vlastimil Babka 2025-07-10 16:47 ` Alexey Dobriyan 2025-07-10 16:16 ` Lorenzo Stoakes 2025-07-10 16:57 ` Alexey Dobriyan 2025-07-10 17:02 ` Lorenzo Stoakes 2025-07-10 18:29 ` Michal Hocko
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).