Re: [PATCH v4 09/17] watchdog/hardlockup: Add a "cpu" param to watchdog_hardlockup_check()

linux-perf-users.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Petr Mladek <pmladek@suse.com>
To: Douglas Anderson <dianders@chromium.org>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Sumit Garg <sumit.garg@linaro.org>,
	Mark Rutland <mark.rutland@arm.com>,
	Matthias Kaehlcke <mka@chromium.org>,
	Stephane Eranian <eranian@google.com>,
	Stephen Boyd <swboyd@chromium.org>,
	ricardo.neri@intel.com, Tzung-Bi Shih <tzungbi@chromium.org>,
	Lecopzer Chen <lecopzer.chen@mediatek.com>,
	kgdb-bugreport@lists.sourceforge.net,
	Masayoshi Mizuma <msys.mizuma@gmail.com>,
	Guenter Roeck <groeck@chromium.org>,
	Pingfan Liu <kernelfans@gmail.com>,
	Andi Kleen <ak@linux.intel.com>, Ian Rogers <irogers@google.com>,
	linux-arm-kernel@lists.infradead.org,
	linux-perf-users@vger.kernel.org, ito-yuichi@fujitsu.com,
	Randy Dunlap <rdunlap@infradead.org>,
	Chen-Yu Tsai <wens@csie.org>,
	christophe.leroy@csgroup.eu, davem@davemloft.net,
	sparclinux@vger.kernel.org, mpe@ellerman.id.au,
	Will Deacon <will@kernel.org>,
	ravi.v.shankar@intel.com, npiggin@gmail.com,
	linuxppc-dev@lists.ozlabs.org, Marc Zyngier <maz@kernel.org>,
	Catalin Marinas <catalin.marinas@arm.com>,
	Daniel Thompson <daniel.thompson@linaro.org>
Subject: Re: [PATCH v4 09/17] watchdog/hardlockup: Add a "cpu" param to watchdog_hardlockup_check()
Date: Thu, 11 May 2023 16:14:37 +0200	[thread overview]
Message-ID: <ZFz4TVOyEU51b898@alley> (raw)
In-Reply-To: <20230504151100.v4.9.I3a7d4dd8c23ac30ee0b607d77feb6646b64825c0@changeid>

On Thu 2023-05-04 15:13:41, Douglas Anderson wrote:
> In preparation for the buddy hardlockup detector where the CPU
> checking for lockup might not be the currently running CPU, add a
> "cpu" parameter to watchdog_hardlockup_check().
> 
> --- a/kernel/watchdog.c
> +++ b/kernel/watchdog.c
> @@ -92,14 +92,14 @@ static DEFINE_PER_CPU(unsigned long, hrtimer_interrupts_saved);
>  static DEFINE_PER_CPU(bool, watchdog_hardlockup_processed);
>  static unsigned long watchdog_hardlockup_dumped_stacks;
>  
> -static bool watchdog_hardlockup_is_lockedup(void)
> +static bool watchdog_hardlockup_is_lockedup(unsigned int cpu)
>  {
> -	unsigned long hrint = __this_cpu_read(hrtimer_interrupts);
> +	unsigned long hrint = per_cpu(hrtimer_interrupts, cpu);

My radar tells me that this should be
READ_ONCE(per_cpu(hrtimer_interrupts, cpu)) when the value might
be modified on another CPU. Otherwise, the compiler is allowed
to split the read into more instructions.

It will be needed for the buddy detector. And it will require
also incrementing the value in watchdog_hardlockup_interrupt_count()
an atomic way.

Note that __this_cpu_inc_return() does not guarantee atomicity
according to my understanding. In theory, the following should
work because counter will never be incremented in parallel:

static unsigned long watchdog_hardlockup_interrupt_count(void)
{
	unsigned long count;

	count = __this_cpu_read(hrtimer_interrupts);
	count++;
	WRITE_ONCE(*raw_cpu_ptr(hrtimer_interrupts), count);
}

but it is nasty. A more elegant solution might be using atomic_t
for hrtimer_interrupts counter.

> -	if (__this_cpu_read(hrtimer_interrupts_saved) == hrint)
> +	if (per_cpu(hrtimer_interrupts_saved, cpu) == hrint)
>  		return true;
>  
> -	__this_cpu_write(hrtimer_interrupts_saved, hrint);
> +	per_cpu(hrtimer_interrupts_saved, cpu) = hrint;

IMHO, hrtimer_interrupts_saved might be handled this way.
The value is read/written only by this function.

The buddy watchdog should see consistent values even when
the buddy CPU goes offline. This check should never race
because this CPU should get touched when another buddy
gets assigned.

Well, it would deserve a comment.

>  
>  	return false;
>  }
> @@ -117,35 +117,50 @@ void watchdog_hardlockup_check(struct pt_regs *regs)
>  	 * fired multiple times before we overflow'd. If it hasn't
>  	 * then this is a good indication the cpu is stuck
>  	 */
> -	if (watchdog_hardlockup_is_lockedup()) {
> +	if (watchdog_hardlockup_is_lockedup(cpu)) {
>  		unsigned int this_cpu = smp_processor_id();
> +		struct cpumask backtrace_mask = *cpu_online_mask;
>  
>  		/* Only handle hardlockups once. */
> -		if (__this_cpu_read(watchdog_hardlockup_processed))
> +		if (per_cpu(watchdog_hardlockup_processed, cpu))

This should not need READ_ONCE()/WRITE_ONCE() because it is just bool.
Also it is read/modified only in this function on the same CPU.

>  			return;
>  
> -		pr_emerg("Watchdog detected hard LOCKUP on cpu %d\n", this_cpu);
> +		pr_emerg("Watchdog detected hard LOCKUP on cpu %d\n", cpu);
>  		print_modules();
>  		print_irqtrace_events(current);
> -		if (regs)
> +		if (regs) {
>  			show_regs(regs);
> -		else
> -			dump_stack();
> +			cpumask_clear_cpu(cpu, &backtrace_mask);
> +		} else {
> +			/*
> +			 * If the locked up CPU is different than the CPU we're
> +			 * running on then we'll try to backtrace the CPU that
> +			 * locked up and then exclude it from later backtraces.
> +			 * If that fails or if we're running on the locked up
> +			 * CPU, just do a normal backtrace.
> +			 */
> +			if (cpu != this_cpu && trigger_single_cpu_backtrace(cpu)) {
> +				cpumask_clear_cpu(cpu, &backtrace_mask);
> +			} else {
> +				dump_stack();
> +				cpumask_clear_cpu(this_cpu, &backtrace_mask);

This will dump the stack on the current CPU when
trigger_single_cpu_backtrace(cpu) is not supported.
It would be confusing because the buddy watchdog
could be here only when this_cpu is not hardlocked.

It should be:

	if (cpu == this_cpu) {
		if (regs)
			show_regs(regs);
		else
			dump_stack();
		cpumask_clear_cpu(cpu, &backtrace_mask);
	} else {
		if (trigger_single_cpu_backtrace(cpu)
			cpumask_clear_cpu(cpu, &backtrace_mask);
	}

> +			}
> +		}
>  
>  		/*
> -		 * Perform all-CPU dump only once to avoid multiple hardlockups
> -		 * generating interleaving traces
> +		 * Perform multi-CPU dump only once to avoid multiple
> +		 * hardlockups generating interleaving traces
>  		 */
>  		if (sysctl_hardlockup_all_cpu_backtrace &&
>  		    !test_and_set_bit(0, &watchdog_hardlockup_dumped_stacks))
> -			trigger_allbutself_cpu_backtrace();
> +			trigger_cpumask_backtrace(&backtrace_mask);
>  
>  		if (hardlockup_panic)
>  			nmi_panic(regs, "Hard LOCKUP");
>  
> -		__this_cpu_write(watchdog_hardlockup_processed, true);
> +		per_cpu(watchdog_hardlockup_processed, cpu) = true;
>  	} else {
> -		__this_cpu_write(watchdog_hardlockup_processed, false);
> +		per_cpu(watchdog_hardlockup_processed, cpu) = false;
>  	}
>  }
>  

Best Regards,
Petr

next prev parent reply	other threads:[~2023-05-11 14:15 UTC|newest]

Thread overview: 47+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-05-04 22:13 [PATCH v4 00/17] watchdog/hardlockup: Add the buddy hardlockup detector Douglas Anderson
2023-05-04 22:13 ` [PATCH v4 01/17] watchdog/perf: Define dummy watchdog_update_hrtimer_threshold() on correct config Douglas Anderson
2023-05-05  2:43   ` Nicholas Piggin
2023-05-11  8:39     ` Petr Mladek
2023-05-04 22:13 ` [PATCH v4 02/17] watchdog: remove WATCHDOG_DEFAULT Douglas Anderson
2023-05-04 22:13 ` [PATCH v4 03/17] watchdog/hardlockup: change watchdog_nmi_enable() to void Douglas Anderson
2023-05-05  2:45   ` Nicholas Piggin
2023-05-04 22:13 ` [PATCH v4 04/17] watchdog/perf: Ensure CPU-bound context when creating hardlockup detector event Douglas Anderson
2023-05-04 22:13 ` [PATCH v4 05/17] watchdog/hardlockup: Rename touch_nmi_watchdog() to touch_hardlockup_watchdog() Douglas Anderson
2023-05-05  2:51   ` Nicholas Piggin
2023-05-05 16:37     ` Doug Anderson
2023-05-08  1:34       ` Nicholas Piggin
2023-05-08 15:56         ` Doug Anderson
2023-05-11  9:24       ` Petr Mladek
2023-05-04 22:13 ` [PATCH v4 06/17] watchdog/perf: Rename watchdog_hld.c to watchdog_perf.c Douglas Anderson
2023-05-05  2:53   ` Nicholas Piggin
2023-05-11 10:09   ` Petr Mladek
2023-05-04 22:13 ` [PATCH v4 07/17] watchdog/hardlockup: Move perf hardlockup checking/panic to common watchdog.c Douglas Anderson
2023-05-05  2:58   ` Nicholas Piggin
2023-05-05 16:37     ` Doug Anderson
2023-05-11 12:03       ` Petr Mladek
2023-05-04 22:13 ` [PATCH v4 08/17] watchdog/hardlockup: Style changes to watchdog_hardlockup_check() / ..._is_lockedup() Douglas Anderson
2023-05-05  3:01   ` Nicholas Piggin
2023-05-05 16:38     ` Doug Anderson
2023-05-11 12:45       ` Petr Mladek
2023-05-04 22:13 ` [PATCH v4 09/17] watchdog/hardlockup: Add a "cpu" param to watchdog_hardlockup_check() Douglas Anderson
2023-05-11 14:14   ` Petr Mladek [this message]
2023-05-19 17:21     ` Doug Anderson
2023-05-04 22:13 ` [PATCH v4 10/17] watchdog/hardlockup: Move perf hardlockup watchdog petting to watchdog.c Douglas Anderson
2023-05-11 15:46   ` Petr Mladek
2023-05-19 17:22     ` Doug Anderson
2023-05-04 22:13 ` [PATCH v4 11/17] watchdog/hardlockup: Rename some "NMI watchdog" constants/function Douglas Anderson
2023-05-05  3:06   ` Nicholas Piggin
2023-05-05 16:38     ` Doug Anderson
2023-05-12 11:21     ` Petr Mladek
2023-05-04 22:13 ` [PATCH v4 12/17] watchdog/hardlockup: Have the perf hardlockup use __weak functions more cleanly Douglas Anderson
2023-05-12 11:55   ` Petr Mladek
2023-05-04 22:13 ` [PATCH v4 13/17] watchdog/hardlockup: detect hard lockups using secondary (buddy) CPUs Douglas Anderson
2023-05-05  2:35   ` Nicholas Piggin
2023-05-05 16:35     ` Doug Anderson
2023-05-08  1:04       ` Nicholas Piggin
2023-05-08 15:52         ` Doug Anderson
2023-05-19 17:23           ` Doug Anderson
2023-05-04 22:13 ` [PATCH v4 14/17] watchdog/perf: Add a weak function for an arch to detect if perf can use NMIs Douglas Anderson
2023-05-04 22:13 ` [PATCH v4 15/17] watchdog/perf: Adapt the watchdog_perf interface for async model Douglas Anderson
2023-05-04 22:13 ` [PATCH v4 16/17] arm64: add hw_nmi_get_sample_period for preparation of lockup detector Douglas Anderson
2023-05-04 22:13 ` [PATCH v4 17/17] arm64: Enable perf events based hard " Douglas Anderson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZFz4TVOyEU51b898@alley \
    --to=pmladek@suse.com \
    --cc=ak@linux.intel.com \
    --cc=akpm@linux-foundation.org \
    --cc=catalin.marinas@arm.com \
    --cc=christophe.leroy@csgroup.eu \
    --cc=daniel.thompson@linaro.org \
    --cc=davem@davemloft.net \
    --cc=dianders@chromium.org \
    --cc=eranian@google.com \
    --cc=groeck@chromium.org \
    --cc=irogers@google.com \
    --cc=ito-yuichi@fujitsu.com \
    --cc=kernelfans@gmail.com \
    --cc=kgdb-bugreport@lists.sourceforge.net \
    --cc=lecopzer.chen@mediatek.com \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-perf-users@vger.kernel.org \
    --cc=linuxppc-dev@lists.ozlabs.org \
    --cc=mark.rutland@arm.com \
    --cc=maz@kernel.org \
    --cc=mka@chromium.org \
    --cc=mpe@ellerman.id.au \
    --cc=msys.mizuma@gmail.com \
    --cc=npiggin@gmail.com \
    --cc=ravi.v.shankar@intel.com \
    --cc=rdunlap@infradead.org \
    --cc=ricardo.neri@intel.com \
    --cc=sparclinux@vger.kernel.org \
    --cc=sumit.garg@linaro.org \
    --cc=swboyd@chromium.org \
    --cc=tzungbi@chromium.org \
    --cc=wens@csie.org \
    --cc=will@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).