From: "Nicholas Piggin" <npiggin@gmail.com>
To: "Doug Anderson" <dianders@chromium.org>
Cc: "Petr Mladek" <pmladek@suse.com>,
"Andrew Morton" <akpm@linux-foundation.org>,
"Sumit Garg" <sumit.garg@linaro.org>,
"Mark Rutland" <mark.rutland@arm.com>,
"Matthias Kaehlcke" <mka@chromium.org>,
"Stephane Eranian" <eranian@google.com>,
"Stephen Boyd" <swboyd@chromium.org>, <ricardo.neri@intel.com>,
"Tzung-Bi Shih" <tzungbi@chromium.org>,
"Lecopzer Chen" <lecopzer.chen@mediatek.com>,
<kgdb-bugreport@lists.sourceforge.net>,
"Masayoshi Mizuma" <msys.mizuma@gmail.com>,
"Guenter Roeck" <groeck@chromium.org>,
"Pingfan Liu" <kernelfans@gmail.com>,
"Andi Kleen" <ak@linux.intel.com>,
"Ian Rogers" <irogers@google.com>,
<linux-arm-kernel@lists.infradead.org>,
<linux-perf-users@vger.kernel.org>, <ito-yuichi@fujitsu.com>,
"Randy Dunlap" <rdunlap@infradead.org>,
"Chen-Yu Tsai" <wens@csie.org>, <christophe.leroy@csgroup.eu>,
<davem@davemloft.net>, <sparclinux@vger.kernel.org>,
<mpe@ellerman.id.au>, "Will Deacon" <will@kernel.org>,
<ravi.v.shankar@intel.com>, <linuxppc-dev@lists.ozlabs.org>,
"Marc Zyngier" <maz@kernel.org>,
"Catalin Marinas" <catalin.marinas@arm.com>,
"Daniel Thompson" <daniel.thompson@linaro.org>,
"Colin Cross" <ccross@android.com>
Subject: Re: [PATCH v4 13/17] watchdog/hardlockup: detect hard lockups using secondary (buddy) CPUs
Date: Mon, 08 May 2023 11:04:40 +1000 [thread overview]
Message-ID: <CSGHQJAJHWVS.1UAJOF8P5UXSK@wheely> (raw)
In-Reply-To: <CAD=FV=XDfbx3UaP7DV63tASE5Md7siS-EnORD_3T-4yYaEQ7ww@mail.gmail.com>
On Sat May 6, 2023 at 2:35 AM AEST, Doug Anderson wrote:
> Hi,
>
> On Thu, May 4, 2023 at 7:36 PM Nicholas Piggin <npiggin@gmail.com> wrote:
> >
> > On Fri May 5, 2023 at 8:13 AM AEST, Douglas Anderson wrote:
> > > From: Colin Cross <ccross@android.com>
> > >
> > > Implement a hardlockup detector that doesn't doesn't need any extra
> > > arch-specific support code to detect lockups. Instead of using
> > > something arch-specific we will use the buddy system, where each CPU
> > > watches out for another one. Specifically, each CPU will use its
> > > softlockup hrtimer to check that the next CPU is processing hrtimer
> > > interrupts by verifying that a counter is increasing.
> >
> > Powerpc's watchdog has an SMP checker, did you see it?
>
> No, I wasn't aware of it. Interesting, it seems to basically enable
> both types of hardlockup detectors together. If that really catches
> more lockups, it seems like we could do the same thing for the buddy
> system.
It doesn't catch more lockups. On powerpc we don't have a reliable
periodic NMI hence the SMP checker. But it is preferable that a CPU
detects its own lockup because NMI IPIs can result in crashes if
they are taken in certain critical sections.
> If people want, I don't think it would be very hard to make
> the buddy system _not_ exclusive of the perf system. Instead of having
> the buddy system implement the "weak" functions I could just call the
> buddy functions in the right places directly and leave the "weak"
> functions for a more traditional hardlockup detector to implement.
> Opinions?
>
> Maybe after all this lands, the powerpc watchdog could move to use the
> common code? As evidenced by this patch series, there's not really a
> reason for the SMP detection to be platform specific.
The powerpc SMP checker could certainly move to common code if
others wanted to use it.
> > It's all to
> > all rather than buddy which makes it more complicated but arguably
> > bit better functionality.
>
> Can you come up with an example crash where the "all to all" would
> work better than the simple buddy system provided by this patch?
CPU2 CPU3
spin_lock_irqsave(A) spin_lock_irqsave(B)
spin_lock_irqsave(B) spin_lock_irqsave(A)
CPU1 will detect the lockup on CPU2, but CPU3's lockup won't be
detected so we don't get the trace that can diagnose the bug.
Another thing I actually found it useful for is you can easily
see if a core (i.e., all threads in the core) or a chip has
died. Maybe more useful when doing presilicon and bring up work
or firmware hacking, but still useful.
Thanks,
Nick
> It
> seems like they would be equivalent, but I could be missing something.
> Specifically they both need at least one non-locked-up CPU to detect a
> problem. If one or more CPUs is locked up then we'll always detect it.
> I suppose maybe you could provide a better error message at lockup
> time saying that several CPUs were locked up and that could be
> helpful. For now, I'd keep the current buddy system the way it is and
> if you want to provide a patch improving things to be "all-to-all" in
> the future that would be interesting to review.
next prev parent reply other threads:[~2023-05-08 1:05 UTC|newest]
Thread overview: 47+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-05-04 22:13 [PATCH v4 00/17] watchdog/hardlockup: Add the buddy hardlockup detector Douglas Anderson
2023-05-04 22:13 ` [PATCH v4 01/17] watchdog/perf: Define dummy watchdog_update_hrtimer_threshold() on correct config Douglas Anderson
2023-05-05 2:43 ` Nicholas Piggin
2023-05-11 8:39 ` Petr Mladek
2023-05-04 22:13 ` [PATCH v4 02/17] watchdog: remove WATCHDOG_DEFAULT Douglas Anderson
2023-05-04 22:13 ` [PATCH v4 03/17] watchdog/hardlockup: change watchdog_nmi_enable() to void Douglas Anderson
2023-05-05 2:45 ` Nicholas Piggin
2023-05-04 22:13 ` [PATCH v4 04/17] watchdog/perf: Ensure CPU-bound context when creating hardlockup detector event Douglas Anderson
2023-05-04 22:13 ` [PATCH v4 05/17] watchdog/hardlockup: Rename touch_nmi_watchdog() to touch_hardlockup_watchdog() Douglas Anderson
2023-05-05 2:51 ` Nicholas Piggin
2023-05-05 16:37 ` Doug Anderson
2023-05-08 1:34 ` Nicholas Piggin
2023-05-08 15:56 ` Doug Anderson
2023-05-11 9:24 ` Petr Mladek
2023-05-04 22:13 ` [PATCH v4 06/17] watchdog/perf: Rename watchdog_hld.c to watchdog_perf.c Douglas Anderson
2023-05-05 2:53 ` Nicholas Piggin
2023-05-11 10:09 ` Petr Mladek
2023-05-04 22:13 ` [PATCH v4 07/17] watchdog/hardlockup: Move perf hardlockup checking/panic to common watchdog.c Douglas Anderson
2023-05-05 2:58 ` Nicholas Piggin
2023-05-05 16:37 ` Doug Anderson
2023-05-11 12:03 ` Petr Mladek
2023-05-04 22:13 ` [PATCH v4 08/17] watchdog/hardlockup: Style changes to watchdog_hardlockup_check() / ..._is_lockedup() Douglas Anderson
2023-05-05 3:01 ` Nicholas Piggin
2023-05-05 16:38 ` Doug Anderson
2023-05-11 12:45 ` Petr Mladek
2023-05-04 22:13 ` [PATCH v4 09/17] watchdog/hardlockup: Add a "cpu" param to watchdog_hardlockup_check() Douglas Anderson
2023-05-11 14:14 ` Petr Mladek
2023-05-19 17:21 ` Doug Anderson
2023-05-04 22:13 ` [PATCH v4 10/17] watchdog/hardlockup: Move perf hardlockup watchdog petting to watchdog.c Douglas Anderson
2023-05-11 15:46 ` Petr Mladek
2023-05-19 17:22 ` Doug Anderson
2023-05-04 22:13 ` [PATCH v4 11/17] watchdog/hardlockup: Rename some "NMI watchdog" constants/function Douglas Anderson
2023-05-05 3:06 ` Nicholas Piggin
2023-05-05 16:38 ` Doug Anderson
2023-05-12 11:21 ` Petr Mladek
2023-05-04 22:13 ` [PATCH v4 12/17] watchdog/hardlockup: Have the perf hardlockup use __weak functions more cleanly Douglas Anderson
2023-05-12 11:55 ` Petr Mladek
2023-05-04 22:13 ` [PATCH v4 13/17] watchdog/hardlockup: detect hard lockups using secondary (buddy) CPUs Douglas Anderson
2023-05-05 2:35 ` Nicholas Piggin
2023-05-05 16:35 ` Doug Anderson
2023-05-08 1:04 ` Nicholas Piggin [this message]
2023-05-08 15:52 ` Doug Anderson
2023-05-19 17:23 ` Doug Anderson
2023-05-04 22:13 ` [PATCH v4 14/17] watchdog/perf: Add a weak function for an arch to detect if perf can use NMIs Douglas Anderson
2023-05-04 22:13 ` [PATCH v4 15/17] watchdog/perf: Adapt the watchdog_perf interface for async model Douglas Anderson
2023-05-04 22:13 ` [PATCH v4 16/17] arm64: add hw_nmi_get_sample_period for preparation of lockup detector Douglas Anderson
2023-05-04 22:13 ` [PATCH v4 17/17] arm64: Enable perf events based hard " Douglas Anderson
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=CSGHQJAJHWVS.1UAJOF8P5UXSK@wheely \
--to=npiggin@gmail.com \
--cc=ak@linux.intel.com \
--cc=akpm@linux-foundation.org \
--cc=catalin.marinas@arm.com \
--cc=ccross@android.com \
--cc=christophe.leroy@csgroup.eu \
--cc=daniel.thompson@linaro.org \
--cc=davem@davemloft.net \
--cc=dianders@chromium.org \
--cc=eranian@google.com \
--cc=groeck@chromium.org \
--cc=irogers@google.com \
--cc=ito-yuichi@fujitsu.com \
--cc=kernelfans@gmail.com \
--cc=kgdb-bugreport@lists.sourceforge.net \
--cc=lecopzer.chen@mediatek.com \
--cc=linux-arm-kernel@lists.infradead.org \
--cc=linux-perf-users@vger.kernel.org \
--cc=linuxppc-dev@lists.ozlabs.org \
--cc=mark.rutland@arm.com \
--cc=maz@kernel.org \
--cc=mka@chromium.org \
--cc=mpe@ellerman.id.au \
--cc=msys.mizuma@gmail.com \
--cc=pmladek@suse.com \
--cc=ravi.v.shankar@intel.com \
--cc=rdunlap@infradead.org \
--cc=ricardo.neri@intel.com \
--cc=sparclinux@vger.kernel.org \
--cc=sumit.garg@linaro.org \
--cc=swboyd@chromium.org \
--cc=tzungbi@chromium.org \
--cc=wens@csie.org \
--cc=will@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).