From: Petr Mladek <pmladek@suse.com>
To: mrungta@google.com
Cc: Jonathan Corbet <corbet@lwn.net>,
Jinchao Wang <wangjinchao600@gmail.com>,
Yunhui Cui <cuiyunhui@bytedance.com>,
Stephane Eranian <eranian@google.com>,
Ian Rogers <irogers@google.com>, Li Huafei <lihuafei1@huawei.com>,
Feng Tang <feng.tang@linux.alibaba.com>,
Max Kellermann <max.kellermann@ionos.com>,
Douglas Anderson <dianders@chromium.org>,
Andrew Morton <akpm@linux-foundation.org>,
linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org
Subject: Re: [PATCH 3/4] watchdog/hardlockup: improve buddy system detection timeliness
Date: Thu, 5 Mar 2026 14:46:56 +0100 [thread overview]
Message-ID: <aamJUImqf4WfTu3d@pathway.suse.cz> (raw)
In-Reply-To: <20260212-hardlockup-watchdog-fixes-v1-3-745f1dce04c3@google.com>
On Thu 2026-02-12 14:12:12, Mayank Rungta via B4 Relay wrote:
> From: Mayank Rungta <mrungta@google.com>
>
> Currently, the buddy system only performs checks every 3rd sample. With
> a 4-second interval. If a check window is missed, the next check occurs
> 12 seconds later, potentially delaying hard lockup detection for up to
> 24 seconds.
>
> Modify the buddy system to perform checks at every interval (4s).
> Introduce a missed-interrupt threshold to maintain the existing grace
> period while reducing the detection window to 8-12 seconds.
>
> Best and worst case detection scenarios:
>
> Before (12s check window):
> - Best case: Lockup occurs after first check but just before heartbeat
> interval. Detected in ~8s (8s till next check).
> - Worst case: Lockup occurs just after a check.
> Detected in ~24s (missed check + 12s till next check + 12s logic).
>
> After (4s check window with threshold of 3):
> - Best case: Lockup occurs just before a check.
> Detected in ~8s (0s till 1st check + 4s till 2nd + 4s till 3rd).
> - Worst case: Lockup occurs just after a check.
> Detected in ~12s (4s till 1st check + 4s till 2nd + 4s till 3rd).
One might argue that the interval <8s,24s> is not much worse than
<6s,20s> achieved by the perf detector.
But I personally like that the disperse of <8s,12s> is lower so that
the result is more predictable. And it is relatively cheap.
People might have different option. But I am fine with this change.
> --- a/kernel/watchdog.c
> +++ b/kernel/watchdog.c
> @@ -163,8 +171,13 @@ static bool is_hardlockup(unsigned int cpu)
> {
> int hrint = atomic_read(&per_cpu(hrtimer_interrupts, cpu));
>
> - if (per_cpu(hrtimer_interrupts_saved, cpu) == hrint)
> - return true;
> + if (per_cpu(hrtimer_interrupts_saved, cpu) == hrint) {
> + per_cpu(hrtimer_interrupts_missed, cpu)++;
> + if (per_cpu(hrtimer_interrupts_missed, cpu) >= watchdog_hardlockup_miss_thresh)
This would return true for every check when missed >= 3.
As a result, the hardlockup would be reported every 4s.
I would keep the 12s cadence and change this to:
if (per_cpu(hrtimer_interrupts_missed, cpu) % watchdog_hardlockup_miss_thresh == 0)
> + return true;
> +
> + return false;
> + }
>
> /*
> * NOTE: we don't need any fancy atomic_t or READ_ONCE/WRITE_ONCE
> --- a/kernel/watchdog_buddy.c
> +++ b/kernel/watchdog_buddy.c
> @@ -86,14 +87,6 @@ void watchdog_buddy_check_hardlockup(int hrtimer_interrupts)
> {
> unsigned int next_cpu;
>
> - /*
> - * Test for hardlockups every 3 samples. The sample period is
> - * watchdog_thresh * 2 / 5, so 3 samples gets us back to slightly over
> - * watchdog_thresh (over by 20%).
> - */
> - if (hrtimer_interrupts % 3 != 0)
> - return;
It would be symetric with the "% 3" above.
> -
> /* check for a hardlockup on the next CPU */
> next_cpu = watchdog_next_cpu(smp_processor_id());
> if (next_cpu >= nr_cpu_ids)
Best Regards,
Petr
next prev parent reply other threads:[~2026-03-05 13:47 UTC|newest]
Thread overview: 26+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-02-12 21:12 [PATCH 0/4] watchdog/hardlockup: Improvements to hardlockup detection and documentation Mayank Rungta
2026-02-12 21:12 ` Mayank Rungta via B4 Relay
2026-02-12 21:12 ` [PATCH 1/4] watchdog/hardlockup: Always update saved interrupts during check Mayank Rungta
2026-02-12 21:12 ` Mayank Rungta via B4 Relay
2026-02-13 16:29 ` Doug Anderson
2026-03-04 14:44 ` Petr Mladek
2026-03-05 0:58 ` Doug Anderson
2026-03-05 11:27 ` Petr Mladek
2026-03-05 16:13 ` Doug Anderson
2026-03-09 13:33 ` Petr Mladek
2026-03-11 2:51 ` Mayank Rungta
2026-03-11 13:56 ` Petr Mladek
2026-02-12 21:12 ` [PATCH 2/4] doc: watchdog: Clarify hardlockup detection timing Mayank Rungta
2026-02-12 21:12 ` Mayank Rungta via B4 Relay
2026-02-13 16:29 ` Doug Anderson
2026-03-05 12:33 ` Petr Mladek
2026-02-12 21:12 ` [PATCH 3/4] watchdog/hardlockup: improve buddy system detection timeliness Mayank Rungta
2026-02-12 21:12 ` Mayank Rungta via B4 Relay
2026-02-13 16:30 ` Doug Anderson
2026-03-05 13:46 ` Petr Mladek [this message]
2026-03-05 16:45 ` Doug Anderson
2026-03-11 14:07 ` Petr Mladek
2026-03-12 21:02 ` Doug Anderson
2026-02-12 21:12 ` [PATCH 4/4] doc: watchdog: Document buddy detector Mayank Rungta
2026-02-12 21:12 ` Mayank Rungta via B4 Relay
2026-02-13 16:30 ` Doug Anderson
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aamJUImqf4WfTu3d@pathway.suse.cz \
--to=pmladek@suse.com \
--cc=akpm@linux-foundation.org \
--cc=corbet@lwn.net \
--cc=cuiyunhui@bytedance.com \
--cc=dianders@chromium.org \
--cc=eranian@google.com \
--cc=feng.tang@linux.alibaba.com \
--cc=irogers@google.com \
--cc=lihuafei1@huawei.com \
--cc=linux-doc@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=max.kellermann@ionos.com \
--cc=mrungta@google.com \
--cc=wangjinchao600@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.