Re: [PATCH v4 13/17] watchdog/hardlockup: detect hard lockups using secondary (buddy) CPUs

linux-perf-users.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: "Nicholas Piggin" <npiggin@gmail.com>
To: "Doug Anderson" <dianders@chromium.org>
Cc: "Petr Mladek" <pmladek@suse.com>,
	"Andrew Morton" <akpm@linux-foundation.org>,
	"Sumit Garg" <sumit.garg@linaro.org>,
	"Mark Rutland" <mark.rutland@arm.com>,
	"Matthias Kaehlcke" <mka@chromium.org>,
	"Stephane Eranian" <eranian@google.com>,
	"Stephen Boyd" <swboyd@chromium.org>, <ricardo.neri@intel.com>,
	"Tzung-Bi Shih" <tzungbi@chromium.org>,
	"Lecopzer Chen" <lecopzer.chen@mediatek.com>,
	<kgdb-bugreport@lists.sourceforge.net>,
	"Masayoshi Mizuma" <msys.mizuma@gmail.com>,
	"Guenter Roeck" <groeck@chromium.org>,
	"Pingfan Liu" <kernelfans@gmail.com>,
	"Andi Kleen" <ak@linux.intel.com>,
	"Ian Rogers" <irogers@google.com>,
	<linux-arm-kernel@lists.infradead.org>,
	<linux-perf-users@vger.kernel.org>, <ito-yuichi@fujitsu.com>,
	"Randy Dunlap" <rdunlap@infradead.org>,
	"Chen-Yu Tsai" <wens@csie.org>, <christophe.leroy@csgroup.eu>,
	<davem@davemloft.net>, <sparclinux@vger.kernel.org>,
	<mpe@ellerman.id.au>, "Will Deacon" <will@kernel.org>,
	<ravi.v.shankar@intel.com>, <linuxppc-dev@lists.ozlabs.org>,
	"Marc Zyngier" <maz@kernel.org>,
	"Catalin Marinas" <catalin.marinas@arm.com>,
	"Daniel Thompson" <daniel.thompson@linaro.org>,
	"Colin Cross" <ccross@android.com>
Subject: Re: [PATCH v4 13/17] watchdog/hardlockup: detect hard lockups using secondary (buddy) CPUs
Date: Mon, 08 May 2023 11:04:40 +1000	[thread overview]
Message-ID: <CSGHQJAJHWVS.1UAJOF8P5UXSK@wheely> (raw)
In-Reply-To: <CAD=FV=XDfbx3UaP7DV63tASE5Md7siS-EnORD_3T-4yYaEQ7ww@mail.gmail.com>

On Sat May 6, 2023 at 2:35 AM AEST, Doug Anderson wrote:
> Hi,
>
> On Thu, May 4, 2023 at 7:36 PM Nicholas Piggin <npiggin@gmail.com> wrote:
> >
> > On Fri May 5, 2023 at 8:13 AM AEST, Douglas Anderson wrote:
> > > From: Colin Cross <ccross@android.com>
> > >
> > > Implement a hardlockup detector that doesn't doesn't need any extra
> > > arch-specific support code to detect lockups. Instead of using
> > > something arch-specific we will use the buddy system, where each CPU
> > > watches out for another one. Specifically, each CPU will use its
> > > softlockup hrtimer to check that the next CPU is processing hrtimer
> > > interrupts by verifying that a counter is increasing.
> >
> > Powerpc's watchdog has an SMP checker, did you see it?
>
> No, I wasn't aware of it. Interesting, it seems to basically enable
> both types of hardlockup detectors together. If that really catches
> more lockups, it seems like we could do the same thing for the buddy
> system.

It doesn't catch more lockups. On powerpc we don't have a reliable
periodic NMI hence the SMP checker. But it is preferable that a CPU
detects its own lockup because NMI IPIs can result in crashes if
they are taken in certain critical sections.

> If people want, I don't think it would be very hard to make
> the buddy system _not_ exclusive of the perf system. Instead of having
> the buddy system implement the "weak" functions I could just call the
> buddy functions in the right places directly and leave the "weak"
> functions for a more traditional hardlockup detector to implement.
> Opinions?
>
> Maybe after all this lands, the powerpc watchdog could move to use the
> common code? As evidenced by this patch series, there's not really a
> reason for the SMP detection to be platform specific.

The powerpc SMP checker could certainly move to common code if
others wanted to use it.

> > It's all to
> > all rather than buddy which makes it more complicated but arguably
> > bit better functionality.
>
> Can you come up with an example crash where the "all to all" would
> work better than the simple buddy system provided by this patch?

CPU2                     CPU3
spin_lock_irqsave(A)     spin_lock_irqsave(B)
spin_lock_irqsave(B)     spin_lock_irqsave(A)

CPU1 will detect the lockup on CPU2, but CPU3's lockup won't be
detected so we don't get the trace that can diagnose the bug.

Another thing I actually found it useful for is you can easily
see if a core (i.e., all threads in the core) or a chip has
died. Maybe more useful when doing presilicon and bring up work
or firmware hacking, but still useful.

Thanks,
Nick

> It
> seems like they would be equivalent, but I could be missing something.
> Specifically they both need at least one non-locked-up CPU to detect a
> problem. If one or more CPUs is locked up then we'll always detect it.
> I suppose maybe you could provide a better error message at lockup
> time saying that several CPUs were locked up and that could be
> helpful. For now, I'd keep the current buddy system the way it is and
> if you want to provide a patch improving things to be "all-to-all" in
> the future that would be interesting to review.

next prev parent reply	other threads:[~2023-05-08  1:05 UTC|newest]

Thread overview: 47+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-05-04 22:13 [PATCH v4 00/17] watchdog/hardlockup: Add the buddy hardlockup detector Douglas Anderson
2023-05-04 22:13 ` [PATCH v4 01/17] watchdog/perf: Define dummy watchdog_update_hrtimer_threshold() on correct config Douglas Anderson
2023-05-05  2:43   ` Nicholas Piggin
2023-05-11  8:39     ` Petr Mladek
2023-05-04 22:13 ` [PATCH v4 02/17] watchdog: remove WATCHDOG_DEFAULT Douglas Anderson
2023-05-04 22:13 ` [PATCH v4 03/17] watchdog/hardlockup: change watchdog_nmi_enable() to void Douglas Anderson
2023-05-05  2:45   ` Nicholas Piggin
2023-05-04 22:13 ` [PATCH v4 04/17] watchdog/perf: Ensure CPU-bound context when creating hardlockup detector event Douglas Anderson
2023-05-04 22:13 ` [PATCH v4 05/17] watchdog/hardlockup: Rename touch_nmi_watchdog() to touch_hardlockup_watchdog() Douglas Anderson
2023-05-05  2:51   ` Nicholas Piggin
2023-05-05 16:37     ` Doug Anderson
2023-05-08  1:34       ` Nicholas Piggin
2023-05-08 15:56         ` Doug Anderson
2023-05-11  9:24       ` Petr Mladek
2023-05-04 22:13 ` [PATCH v4 06/17] watchdog/perf: Rename watchdog_hld.c to watchdog_perf.c Douglas Anderson
2023-05-05  2:53   ` Nicholas Piggin
2023-05-11 10:09   ` Petr Mladek
2023-05-04 22:13 ` [PATCH v4 07/17] watchdog/hardlockup: Move perf hardlockup checking/panic to common watchdog.c Douglas Anderson
2023-05-05  2:58   ` Nicholas Piggin
2023-05-05 16:37     ` Doug Anderson
2023-05-11 12:03       ` Petr Mladek
2023-05-04 22:13 ` [PATCH v4 08/17] watchdog/hardlockup: Style changes to watchdog_hardlockup_check() / ..._is_lockedup() Douglas Anderson
2023-05-05  3:01   ` Nicholas Piggin
2023-05-05 16:38     ` Doug Anderson
2023-05-11 12:45       ` Petr Mladek
2023-05-04 22:13 ` [PATCH v4 09/17] watchdog/hardlockup: Add a "cpu" param to watchdog_hardlockup_check() Douglas Anderson
2023-05-11 14:14   ` Petr Mladek
2023-05-19 17:21     ` Doug Anderson
2023-05-04 22:13 ` [PATCH v4 10/17] watchdog/hardlockup: Move perf hardlockup watchdog petting to watchdog.c Douglas Anderson
2023-05-11 15:46   ` Petr Mladek
2023-05-19 17:22     ` Doug Anderson
2023-05-04 22:13 ` [PATCH v4 11/17] watchdog/hardlockup: Rename some "NMI watchdog" constants/function Douglas Anderson
2023-05-05  3:06   ` Nicholas Piggin
2023-05-05 16:38     ` Doug Anderson
2023-05-12 11:21     ` Petr Mladek
2023-05-04 22:13 ` [PATCH v4 12/17] watchdog/hardlockup: Have the perf hardlockup use __weak functions more cleanly Douglas Anderson
2023-05-12 11:55   ` Petr Mladek
2023-05-04 22:13 ` [PATCH v4 13/17] watchdog/hardlockup: detect hard lockups using secondary (buddy) CPUs Douglas Anderson
2023-05-05  2:35   ` Nicholas Piggin
2023-05-05 16:35     ` Doug Anderson
2023-05-08  1:04       ` Nicholas Piggin [this message]
2023-05-08 15:52         ` Doug Anderson
2023-05-19 17:23           ` Doug Anderson
2023-05-04 22:13 ` [PATCH v4 14/17] watchdog/perf: Add a weak function for an arch to detect if perf can use NMIs Douglas Anderson
2023-05-04 22:13 ` [PATCH v4 15/17] watchdog/perf: Adapt the watchdog_perf interface for async model Douglas Anderson
2023-05-04 22:13 ` [PATCH v4 16/17] arm64: add hw_nmi_get_sample_period for preparation of lockup detector Douglas Anderson
2023-05-04 22:13 ` [PATCH v4 17/17] arm64: Enable perf events based hard " Douglas Anderson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CSGHQJAJHWVS.1UAJOF8P5UXSK@wheely \
    --to=npiggin@gmail.com \
    --cc=ak@linux.intel.com \
    --cc=akpm@linux-foundation.org \
    --cc=catalin.marinas@arm.com \
    --cc=ccross@android.com \
    --cc=christophe.leroy@csgroup.eu \
    --cc=daniel.thompson@linaro.org \
    --cc=davem@davemloft.net \
    --cc=dianders@chromium.org \
    --cc=eranian@google.com \
    --cc=groeck@chromium.org \
    --cc=irogers@google.com \
    --cc=ito-yuichi@fujitsu.com \
    --cc=kernelfans@gmail.com \
    --cc=kgdb-bugreport@lists.sourceforge.net \
    --cc=lecopzer.chen@mediatek.com \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-perf-users@vger.kernel.org \
    --cc=linuxppc-dev@lists.ozlabs.org \
    --cc=mark.rutland@arm.com \
    --cc=maz@kernel.org \
    --cc=mka@chromium.org \
    --cc=mpe@ellerman.id.au \
    --cc=msys.mizuma@gmail.com \
    --cc=pmladek@suse.com \
    --cc=ravi.v.shankar@intel.com \
    --cc=rdunlap@infradead.org \
    --cc=ricardo.neri@intel.com \
    --cc=sparclinux@vger.kernel.org \
    --cc=sumit.garg@linaro.org \
    --cc=swboyd@chromium.org \
    --cc=tzungbi@chromium.org \
    --cc=wens@csie.org \
    --cc=will@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).