From: Petr Mladek <pmladek@suse.com>
To: mrungta@google.com
Cc: Jonathan Corbet <corbet@lwn.net>,
Jinchao Wang <wangjinchao600@gmail.com>,
Yunhui Cui <cuiyunhui@bytedance.com>,
Stephane Eranian <eranian@google.com>,
Ian Rogers <irogers@google.com>, Li Huafei <lihuafei1@huawei.com>,
Feng Tang <feng.tang@linux.alibaba.com>,
Max Kellermann <max.kellermann@ionos.com>,
Douglas Anderson <dianders@chromium.org>,
Andrew Morton <akpm@linux-foundation.org>,
linux-kernel@vger.kernel.org, linux-doc@vger.kernel.org
Subject: Re: [PATCH 3/4] watchdog/hardlockup: improve buddy system detection timeliness
Date: Thu, 5 Mar 2026 14:46:56 +0100 [thread overview]
Message-ID: <aamJUImqf4WfTu3d@pathway.suse.cz> (raw)
In-Reply-To: <20260212-hardlockup-watchdog-fixes-v1-3-745f1dce04c3@google.com>
On Thu 2026-02-12 14:12:12, Mayank Rungta via B4 Relay wrote:
> From: Mayank Rungta <mrungta@google.com>
>
> Currently, the buddy system only performs checks every 3rd sample. With
> a 4-second interval. If a check window is missed, the next check occurs
> 12 seconds later, potentially delaying hard lockup detection for up to
> 24 seconds.
>
> Modify the buddy system to perform checks at every interval (4s).
> Introduce a missed-interrupt threshold to maintain the existing grace
> period while reducing the detection window to 8-12 seconds.
>
> Best and worst case detection scenarios:
>
> Before (12s check window):
> - Best case: Lockup occurs after first check but just before heartbeat
> interval. Detected in ~8s (8s till next check).
> - Worst case: Lockup occurs just after a check.
> Detected in ~24s (missed check + 12s till next check + 12s logic).
>
> After (4s check window with threshold of 3):
> - Best case: Lockup occurs just before a check.
> Detected in ~8s (0s till 1st check + 4s till 2nd + 4s till 3rd).
> - Worst case: Lockup occurs just after a check.
> Detected in ~12s (4s till 1st check + 4s till 2nd + 4s till 3rd).
One might argue that the interval <8s,24s> is not much worse than
<6s,20s> achieved by the perf detector.
But I personally like that the disperse of <8s,12s> is lower so that
the result is more predictable. And it is relatively cheap.
People might have different option. But I am fine with this change.
> --- a/kernel/watchdog.c
> +++ b/kernel/watchdog.c
> @@ -163,8 +171,13 @@ static bool is_hardlockup(unsigned int cpu)
> {
> int hrint = atomic_read(&per_cpu(hrtimer_interrupts, cpu));
>
> - if (per_cpu(hrtimer_interrupts_saved, cpu) == hrint)
> - return true;
> + if (per_cpu(hrtimer_interrupts_saved, cpu) == hrint) {
> + per_cpu(hrtimer_interrupts_missed, cpu)++;
> + if (per_cpu(hrtimer_interrupts_missed, cpu) >= watchdog_hardlockup_miss_thresh)
This would return true for every check when missed >= 3.
As a result, the hardlockup would be reported every 4s.
I would keep the 12s cadence and change this to:
if (per_cpu(hrtimer_interrupts_missed, cpu) % watchdog_hardlockup_miss_thresh == 0)
> + return true;
> +
> + return false;
> + }
>
> /*
> * NOTE: we don't need any fancy atomic_t or READ_ONCE/WRITE_ONCE
> --- a/kernel/watchdog_buddy.c
> +++ b/kernel/watchdog_buddy.c
> @@ -86,14 +87,6 @@ void watchdog_buddy_check_hardlockup(int hrtimer_interrupts)
> {
> unsigned int next_cpu;
>
> - /*
> - * Test for hardlockups every 3 samples. The sample period is
> - * watchdog_thresh * 2 / 5, so 3 samples gets us back to slightly over
> - * watchdog_thresh (over by 20%).
> - */
> - if (hrtimer_interrupts % 3 != 0)
> - return;
It would be symetric with the "% 3" above.
> -
> /* check for a hardlockup on the next CPU */
> next_cpu = watchdog_next_cpu(smp_processor_id());
> if (next_cpu >= nr_cpu_ids)
Best Regards,
Petr
next prev parent reply other threads:[~2026-03-05 13:47 UTC|newest]
Thread overview: 21+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-02-12 21:12 [PATCH 0/4] watchdog/hardlockup: Improvements to hardlockup detection and documentation Mayank Rungta via B4 Relay
2026-02-12 21:12 ` [PATCH 1/4] watchdog/hardlockup: Always update saved interrupts during check Mayank Rungta via B4 Relay
2026-02-13 16:29 ` Doug Anderson
2026-03-04 14:44 ` Petr Mladek
2026-03-05 0:58 ` Doug Anderson
2026-03-05 11:27 ` Petr Mladek
2026-03-05 16:13 ` Doug Anderson
2026-03-09 13:33 ` Petr Mladek
2026-03-11 2:51 ` Mayank Rungta
2026-03-11 13:56 ` Petr Mladek
2026-02-12 21:12 ` [PATCH 2/4] doc: watchdog: Clarify hardlockup detection timing Mayank Rungta via B4 Relay
2026-02-13 16:29 ` Doug Anderson
2026-03-05 12:33 ` Petr Mladek
2026-02-12 21:12 ` [PATCH 3/4] watchdog/hardlockup: improve buddy system detection timeliness Mayank Rungta via B4 Relay
2026-02-13 16:30 ` Doug Anderson
2026-03-05 13:46 ` Petr Mladek [this message]
2026-03-05 16:45 ` Doug Anderson
2026-03-11 14:07 ` Petr Mladek
2026-03-12 21:02 ` Doug Anderson
2026-02-12 21:12 ` [PATCH 4/4] doc: watchdog: Document buddy detector Mayank Rungta via B4 Relay
2026-02-13 16:30 ` Doug Anderson
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aamJUImqf4WfTu3d@pathway.suse.cz \
--to=pmladek@suse.com \
--cc=akpm@linux-foundation.org \
--cc=corbet@lwn.net \
--cc=cuiyunhui@bytedance.com \
--cc=dianders@chromium.org \
--cc=eranian@google.com \
--cc=feng.tang@linux.alibaba.com \
--cc=irogers@google.com \
--cc=lihuafei1@huawei.com \
--cc=linux-doc@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=max.kellermann@ionos.com \
--cc=mrungta@google.com \
--cc=wangjinchao600@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox