From: Feng Tang <feng.tang@intel.com>
To: Jiri Wiesner <jwiesner@suse.de>
Cc: <linux-kernel@vger.kernel.org>, John Stultz <jstultz@google.com>,
"Thomas Gleixner" <tglx@linutronix.de>,
Stephen Boyd <sboyd@kernel.org>,
"Paul E. McKenney" <paulmck@kernel.org>
Subject: Re: [PATCH] clocksource: Skip watchdog check for large watchdog intervals
Date: Thu, 4 Jan 2024 13:55:21 +0800 [thread overview]
Message-ID: <ZZZISRg9YGyT2eQ+@feng-clx> (raw)
In-Reply-To: <20240103112113.GA6108@incl>
On Wed, Jan 03, 2024 at 12:21:13PM +0100, Jiri Wiesner wrote:
> There have been reports of the watchdog marking clocksources unstable on
> machines with 8 NUMA nodes:
> > clocksource: timekeeping watchdog on CPU373: Marking clocksource 'tsc' as unstable because the skew is too large:
> > clocksource: 'hpet' wd_nsec: 14523447520 wd_now: 5a749706 wd_last: 45adf1e0 mask: ffffffff
> > clocksource: 'tsc' cs_nsec: 14524115132 cs_now: 515ce2c5a96caa cs_last: 515cd9a9d83918 mask: ffffffffffffffff
> > clocksource: 'tsc' is current clocksource.
> > tsc: Marking TSC unstable due to clocksource watchdog
> > TSC found unstable after boot, most likely due to broken BIOS. Use 'tsc=unstable'.
> > sched_clock: Marking unstable (1950347883333462, 79649632569)<-(1950428279338308, -745776594)
> > clocksource: Checking clocksource tsc synchronization from CPU 400 to CPUs 0,46,52,54,138,208,392,397.
> > clocksource: Switched to clocksource hpet
>
> The measured clocksource skew - the absolute difference between cs_nsec
> and wd_nsec - was 668 microseconds:
> > cs_nsec - wd_nsec = 14524115132 - 14523447520 = 667612
>
> The kernel (based on 5.14.21) used 200 microseconds for the
> uncertainty_margin of both the clocksource and watchdog, resulting in a
> threshold of 400 microseconds. The discrepancy is that the measured
> clocksource skew was evaluated against a threshold suited for watchdog
> intervals of roughly WATCHDOG_INTERVAL, i.e. HZ >> 1, which is 0.5 second.
> Both the cs_nsec and the wd_nsec value indicate that the actual watchdog
> interval was circa 14.5 seconds. Since the watchdog is executed in softirq
> context the expiration of the watchdog timer can get severely delayed on
> account of a ksoftirqd thread not getting to run in a timely manner.
> Surely, a system with such belated softirq execution is not working well
> and the scheduling issue should be looked into but the clocksource
> watchdog should be able to deal with it accordingly.
>
> The solution in this patch skips the current watchdog check if the
> watchdog interval exceeds 1.5 * WATCHDOG_INTERVAL. Considering the maximum
> watchdog interval of 1.5 * WATCHDOG_INTERVAL, the current default
> uncertainty margin (of the TSC and HPET) corresponds to a limit on
> clocksource skew of 333 ppm (microseconds of skew per second). To keep the
> limit imposed by NTP (500 microseconds of skew per second) for all
> possible watchdog intervals, the margins would have to be scaled so that
> the threshold value is proportional to the length of the actual watchdog
> interval.
>
> Fixes: 2e27e793e280 ("clocksource: Reduce clocksource-skew threshold")
IIUC, the issue could happen even before this commit? Though I do think
this is good stuff for stable kernel.
> Suggested-by: Feng Tang <feng.tang@intel.com>
> Signed-off-by: Jiri Wiesner <jwiesner@suse.de>
0Day robot reported some compiling issue. Other than that, it looks
good to me, thanks!
Reviewed-by: Feng Tang <feng.tang@intel.com>
>
> ---
> kernel/time/clocksource.c | 28 ++++++++++++++++++++++++++--
> 1 file changed, 26 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
> index c108ed8a9804..ac5cb0ff278b 100644
> --- a/kernel/time/clocksource.c
> +++ b/kernel/time/clocksource.c
> @@ -98,7 +98,9 @@ static u64 suspend_start;
> /*
> * Interval: 0.5sec.
> */
> -#define WATCHDOG_INTERVAL (HZ >> 1)
> +#define WATCHDOG_INTERVAL (HZ >> 1)
> +#define WATCHDOG_INTR_MAX_NS ((WATCHDOG_INTERVAL + (WATCHDOG_INTERVAL >> 1))\
> + * NSEC_PER_SEC / HZ)
>
> /*
> * Threshold: 0.0312s, when doubled: 0.0625s.
> @@ -134,6 +136,7 @@ static DECLARE_WORK(watchdog_work, clocksource_watchdog_work);
> static DEFINE_SPINLOCK(watchdog_lock);
> static int watchdog_running;
> static atomic_t watchdog_reset_pending;
> +static int64_t watchdog_max_intr;
>
> static inline void clocksource_watchdog_lock(unsigned long *flags)
> {
> @@ -400,7 +403,7 @@ static void clocksource_watchdog(struct timer_list *unused)
> {
> u64 csnow, wdnow, cslast, wdlast, delta;
> int next_cpu, reset_pending;
> - int64_t wd_nsec, cs_nsec;
> + int64_t wd_nsec, cs_nsec, interval;
> struct clocksource *cs;
> enum wd_read_status read_ret;
> unsigned long extra_wait = 0;
> @@ -470,6 +473,27 @@ static void clocksource_watchdog(struct timer_list *unused)
> if (atomic_read(&watchdog_reset_pending))
> continue;
>
> + /*
> + * The processing of timer softirqs can get delayed (usually
> + * on account of ksoftirqd not getting to run in a timely
> + * manner), which causes the watchdog interval to stretch.
> + * Some clocksources, e.g. acpi_pm, cannot tolerate
> + * watchdog intervals longer than a few seconds.
> + * Skew detection may fail for longer watchdog intervals
> + * on account of fixed margins being used.
> + */
> + interval = max(cs_nsec, wd_nsec);
> + if (unlikely(interval > WATCHDOG_INTR_MAX_NS)) {
> + if (system_state > SYSTEM_SCHEDULING &&
> + interval > 2 * watchdog_max_intr) {
> + watchdog_max_intr = interval;
> + pr_warn("Skipping watchdog check: cs_nsec: %lld wd_nsec: %lld\n",
> + cs_nsec, wd_nsec);
> + }
> + watchdog_timer.expires = jiffies;
> + continue;
> + }
> +
> /* Check the deviation from the watchdog clocksource. */
> md = cs->uncertainty_margin + watchdog->uncertainty_margin;
> if (abs(cs_nsec - wd_nsec) > md) {
> --
> 2.35.3
>
>
> --
> Jiri Wiesner
> SUSE Labs
next prev parent reply other threads:[~2024-01-04 6:04 UTC|newest]
Thread overview: 12+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-01-03 11:21 [PATCH] clocksource: Skip watchdog check for large watchdog intervals Jiri Wiesner
2024-01-03 22:08 ` Paul E. McKenney
2024-01-04 16:30 ` Jiri Wiesner
2024-01-04 19:19 ` Paul E. McKenney
2024-01-06 2:55 ` Feng Tang
2024-01-06 12:04 ` Paul E. McKenney
2024-01-04 0:46 ` kernel test robot
2024-01-04 0:57 ` kernel test robot
2024-01-04 5:55 ` Feng Tang [this message]
2024-01-04 16:48 ` Jiri Wiesner
2024-01-08 13:44 ` kernel test robot
2024-01-10 18:36 ` Jiri Wiesner
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ZZZISRg9YGyT2eQ+@feng-clx \
--to=feng.tang@intel.com \
--cc=jstultz@google.com \
--cc=jwiesner@suse.de \
--cc=linux-kernel@vger.kernel.org \
--cc=paulmck@kernel.org \
--cc=sboyd@kernel.org \
--cc=tglx@linutronix.de \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox