All of lore.kernel.org
 help / color / mirror / Atom feed
From: Ingo Molnar <mingo@kernel.org>
To: Rik van Riel <riel@surriel.com>
Cc: Dave Hansen <dave.hansen@intel.com>,
	x86@kernel.org, linux-kernel@vger.kernel.org, kernel-team@fb.com,
	Thomas Gleixner <tglx@linutronix.de>, Dave Jones <dsj@fb.com>,
	Andy Lutomirski <luto@kernel.org>
Subject: Re: [PATCH v2]  x86,mm: print likely CPU at segfault time
Date: Thu, 4 Aug 2022 22:17:04 +0200	[thread overview]
Message-ID: <YuwpQEYCwTl+m6j5@gmail.com> (raw)
In-Reply-To: <20220804155450.08c5b87e@imladris.surriel.com>


* Rik van Riel <riel@surriel.com> wrote:

> In a large enough fleet of computers, it is common to have a few bad CPUs.
> Those can often be identified by seeing that some commonly run kernel code,
> which runs fine everywhere else, keeps crashing on the same CPU core on one
> particular bad system.
> 
> However, the failure modes in CPUs that have gone bad over the years are
> often oddly specific, and the only bad behavior seen might be segfaults
> in programs like bash, python, or various system daemons that run fine
> everywhere else.
> 
> Add a printk() to show_signal_msg() to print the CPU, core, and socket
> at segfault time. This is not perfect, since the task might get rescheduled
> on another CPU between when the fault hit, and when the message is printed,
> but in practice this has been good enough to help us identify several bad
> CPU cores.
> 
> segfault[1349]: segfault at 0 ip 000000000040113a sp 00007ffc6d32e360 error 4 in segfault[401000+1000] on CPU 0 (core 0, socket 0)
> 
> Signed-off-by: Rik van Riel <riel@surriel.com>
> CC: Dave Jones <dsj@fb.com>
> ---
>  arch/x86/mm/fault.c | 10 ++++++++++
>  1 file changed, 10 insertions(+)
> 
> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> index fad8faa29d04..a9b93a7816f9 100644
> --- a/arch/x86/mm/fault.c
> +++ b/arch/x86/mm/fault.c
> @@ -769,6 +769,8 @@ show_signal_msg(struct pt_regs *regs, unsigned long error_code,
>  		unsigned long address, struct task_struct *tsk)
>  {
>  	const char *loglvl = task_pid_nr(tsk) > 1 ? KERN_INFO : KERN_EMERG;
> +	/* This is a racy snapshot, but it's better than nothing. */
> +	int cpu = READ_ONCE(raw_smp_processor_id());
>  
>  	if (!unhandled_signal(tsk, SIGSEGV))
>  		return;
> @@ -782,6 +784,14 @@ show_signal_msg(struct pt_regs *regs, unsigned long error_code,
>  
>  	print_vma_addr(KERN_CONT " in ", regs->ip);
>  
> +	/*
> +	 * Dump the likely CPU where the fatal segfault happened.
> +	 * This can help identify faulty hardware.
> +	 */
> +	printk(KERN_CONT " on CPU %d (core %d, socket %d)", cpu,
> +	       topology_core_id(cpu), topology_physical_package_id(cpu));

LGTM, applying this to tip:x86/mm unless someone objects.

I've added the tidbit to the changelog that this only gets printed if 
show_unhandled_signals (/proc/sys/kernel/print-fatal-signals) is enabled - 
which is off by default. So your patch expands upon a default-off debug 
printout in essence - where utility maximization is OK.

Thanks,

	Ingo

  reply	other threads:[~2022-08-04 20:17 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-08-04 19:54 [PATCH v2] x86,mm: print likely CPU at segfault time Rik van Riel
2022-08-04 20:17 ` Ingo Molnar [this message]
2022-08-05  3:01   ` Rik van Riel
2022-08-05  7:59     ` Ingo Molnar
2022-08-05 12:54       ` Rik van Riel
2022-08-05 10:08 ` Borislav Petkov
2022-08-05 12:53   ` Rik van Riel
2022-08-05 13:25     ` Borislav Petkov

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YuwpQEYCwTl+m6j5@gmail.com \
    --to=mingo@kernel.org \
    --cc=dave.hansen@intel.com \
    --cc=dsj@fb.com \
    --cc=kernel-team@fb.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=luto@kernel.org \
    --cc=riel@surriel.com \
    --cc=tglx@linutronix.de \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.