All of lore.kernel.org
 help / color / mirror / Atom feed
From: Feng Tang <feng.tang@intel.com>
To: Jiri Wiesner <jwiesner@suse.de>
Cc: <linux-kernel@vger.kernel.org>, John Stultz <jstultz@google.com>,
	"Thomas Gleixner" <tglx@linutronix.de>,
	Stephen Boyd <sboyd@kernel.org>,
	"Paul E. McKenney" <paulmck@kernel.org>
Subject: Re: [PATCH] clocksource: Skip watchdog check for large watchdog intervals
Date: Thu, 4 Jan 2024 13:55:21 +0800	[thread overview]
Message-ID: <ZZZISRg9YGyT2eQ+@feng-clx> (raw)
In-Reply-To: <20240103112113.GA6108@incl>

On Wed, Jan 03, 2024 at 12:21:13PM +0100, Jiri Wiesner wrote:
> There have been reports of the watchdog marking clocksources unstable on
> machines with 8 NUMA nodes:
> > clocksource: timekeeping watchdog on CPU373: Marking clocksource 'tsc' as unstable because the skew is too large:
> > clocksource:   'hpet' wd_nsec: 14523447520 wd_now: 5a749706 wd_last: 45adf1e0 mask: ffffffff
> > clocksource:   'tsc' cs_nsec: 14524115132 cs_now: 515ce2c5a96caa cs_last: 515cd9a9d83918 mask: ffffffffffffffff
> > clocksource:   'tsc' is current clocksource.
> > tsc: Marking TSC unstable due to clocksource watchdog
> > TSC found unstable after boot, most likely due to broken BIOS. Use 'tsc=unstable'.
> > sched_clock: Marking unstable (1950347883333462, 79649632569)<-(1950428279338308, -745776594)
> > clocksource: Checking clocksource tsc synchronization from CPU 400 to CPUs 0,46,52,54,138,208,392,397.
> > clocksource: Switched to clocksource hpet
> 
> The measured clocksource skew - the absolute difference between cs_nsec
> and wd_nsec - was 668 microseconds:
> > cs_nsec - wd_nsec = 14524115132 - 14523447520 = 667612
> 
> The kernel (based on 5.14.21) used 200 microseconds for the
> uncertainty_margin of both the clocksource and watchdog, resulting in a
> threshold of 400 microseconds.  The discrepancy is that the measured
> clocksource skew was evaluated against a threshold suited for watchdog
> intervals of roughly WATCHDOG_INTERVAL, i.e. HZ >> 1, which is 0.5 second.
> Both the cs_nsec and the wd_nsec value indicate that the actual watchdog
> interval was circa 14.5 seconds. Since the watchdog is executed in softirq
> context the expiration of the watchdog timer can get severely delayed on
> account of a ksoftirqd thread not getting to run in a timely manner.
> Surely, a system with such belated softirq execution is not working well
> and the scheduling issue should be looked into but the clocksource
> watchdog should be able to deal with it accordingly.
> 
> The solution in this patch skips the current watchdog check if the
> watchdog interval exceeds 1.5 * WATCHDOG_INTERVAL. Considering the maximum
> watchdog interval of 1.5 * WATCHDOG_INTERVAL, the current default
> uncertainty margin (of the TSC and HPET) corresponds to a limit on
> clocksource skew of 333 ppm (microseconds of skew per second). To keep the
> limit imposed by NTP (500 microseconds of skew per second) for all
> possible watchdog intervals, the margins would have to be scaled so that
> the threshold value is proportional to the length of the actual watchdog
> interval.
> 
> Fixes: 2e27e793e280 ("clocksource: Reduce clocksource-skew threshold")

IIUC, the issue could happen even before this commit? Though I do think 
this is good stuff for stable kernel.

> Suggested-by: Feng Tang <feng.tang@intel.com>
> Signed-off-by: Jiri Wiesner <jwiesner@suse.de>

0Day robot reported some compiling issue. Other than that, it looks
good to me, thanks!

Reviewed-by: Feng Tang <feng.tang@intel.com>

>
> ---
>  kernel/time/clocksource.c | 28 ++++++++++++++++++++++++++--
>  1 file changed, 26 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
> index c108ed8a9804..ac5cb0ff278b 100644
> --- a/kernel/time/clocksource.c
> +++ b/kernel/time/clocksource.c
> @@ -98,7 +98,9 @@ static u64 suspend_start;
>  /*
>   * Interval: 0.5sec.
>   */
> -#define WATCHDOG_INTERVAL (HZ >> 1)
> +#define WATCHDOG_INTERVAL	(HZ >> 1)
> +#define WATCHDOG_INTR_MAX_NS	((WATCHDOG_INTERVAL + (WATCHDOG_INTERVAL >> 1))\
> +				 * NSEC_PER_SEC / HZ)
>  
>  /*
>   * Threshold: 0.0312s, when doubled: 0.0625s.
> @@ -134,6 +136,7 @@ static DECLARE_WORK(watchdog_work, clocksource_watchdog_work);
>  static DEFINE_SPINLOCK(watchdog_lock);
>  static int watchdog_running;
>  static atomic_t watchdog_reset_pending;
> +static int64_t watchdog_max_intr;
>  
>  static inline void clocksource_watchdog_lock(unsigned long *flags)
>  {
> @@ -400,7 +403,7 @@ static void clocksource_watchdog(struct timer_list *unused)
>  {
>  	u64 csnow, wdnow, cslast, wdlast, delta;
>  	int next_cpu, reset_pending;
> -	int64_t wd_nsec, cs_nsec;
> +	int64_t wd_nsec, cs_nsec, interval;
>  	struct clocksource *cs;
>  	enum wd_read_status read_ret;
>  	unsigned long extra_wait = 0;
> @@ -470,6 +473,27 @@ static void clocksource_watchdog(struct timer_list *unused)
>  		if (atomic_read(&watchdog_reset_pending))
>  			continue;
>  
> +		/*
> +		 * The processing of timer softirqs can get delayed (usually
> +		 * on account of ksoftirqd not getting to run in a timely
> +		 * manner), which causes the watchdog interval to stretch.
> +		 * Some clocksources, e.g. acpi_pm, cannot tolerate
> +		 * watchdog intervals longer than a few seconds.
> +		 * Skew detection may fail for longer watchdog intervals
> +		 * on account of fixed margins being used.
> +		 */
> +		interval = max(cs_nsec, wd_nsec);
> +		if (unlikely(interval > WATCHDOG_INTR_MAX_NS)) {
> +			if (system_state > SYSTEM_SCHEDULING &&
> +			    interval > 2 * watchdog_max_intr) {
> +				watchdog_max_intr = interval;
> +				pr_warn("Skipping watchdog check: cs_nsec: %lld wd_nsec: %lld\n",
> +					cs_nsec, wd_nsec);
> +			}
> +			watchdog_timer.expires = jiffies;
> +			continue;
> +		}
> +
>  		/* Check the deviation from the watchdog clocksource. */
>  		md = cs->uncertainty_margin + watchdog->uncertainty_margin;
>  		if (abs(cs_nsec - wd_nsec) > md) {
> -- 
> 2.35.3
> 
> 
> -- 
> Jiri Wiesner
> SUSE Labs

  parent reply	other threads:[~2024-01-04  6:04 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-01-03 11:21 [PATCH] clocksource: Skip watchdog check for large watchdog intervals Jiri Wiesner
2024-01-03 22:08 ` Paul E. McKenney
2024-01-04 16:30   ` Jiri Wiesner
2024-01-04 19:19     ` Paul E. McKenney
2024-01-06  2:55       ` Feng Tang
2024-01-06 12:04         ` Paul E. McKenney
2024-01-04  0:46 ` kernel test robot
2024-01-04  0:57 ` kernel test robot
2024-01-04  5:55 ` Feng Tang [this message]
2024-01-04 16:48   ` Jiri Wiesner
2024-01-08 13:44 ` kernel test robot
2024-01-10 18:36   ` Jiri Wiesner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZZZISRg9YGyT2eQ+@feng-clx \
    --to=feng.tang@intel.com \
    --cc=jstultz@google.com \
    --cc=jwiesner@suse.de \
    --cc=linux-kernel@vger.kernel.org \
    --cc=paulmck@kernel.org \
    --cc=sboyd@kernel.org \
    --cc=tglx@linutronix.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.