public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
From: Sasha Levin <sashal@kernel.org>
To: linux-kernel@vger.kernel.org, stable@vger.kernel.org
Cc: Feng Tang <feng.tang@intel.com>, Waiman Long <longman@redhat.com>,
	John Stultz <jstultz@google.com>,
	Thomas Gleixner <tglx@linutronix.de>,
	Stephen Boyd <sboyd@kernel.org>,
	"Paul E . McKenney" <paulmck@kernel.org>,
	Sasha Levin <sashal@kernel.org>
Subject: [PATCH AUTOSEL 6.2 24/53] clocksource: Suspend the watchdog temporarily when high read latency detected
Date: Sun, 26 Feb 2023 09:44:16 -0500	[thread overview]
Message-ID: <20230226144446.824580-24-sashal@kernel.org> (raw)
In-Reply-To: <20230226144446.824580-1-sashal@kernel.org>

From: Feng Tang <feng.tang@intel.com>

[ Upstream commit b7082cdfc464bf9231300605d03eebf943dda307 ]

Bugs have been reported on 8 sockets x86 machines in which the TSC was
wrongly disabled when the system is under heavy workload.

 [ 818.380354] clocksource: timekeeping watchdog on CPU336: hpet wd-wd read-back delay of 1203520ns
 [ 818.436160] clocksource: wd-tsc-wd read-back delay of 181880ns, clock-skew test skipped!
 [ 819.402962] clocksource: timekeeping watchdog on CPU338: hpet wd-wd read-back delay of 324000ns
 [ 819.448036] clocksource: wd-tsc-wd read-back delay of 337240ns, clock-skew test skipped!
 [ 819.880863] clocksource: timekeeping watchdog on CPU339: hpet read-back delay of 150280ns, attempt 3, marking unstable
 [ 819.936243] tsc: Marking TSC unstable due to clocksource watchdog
 [ 820.068173] TSC found unstable after boot, most likely due to broken BIOS. Use 'tsc=unstable'.
 [ 820.092382] sched_clock: Marking unstable (818769414384, 1195404998)
 [ 820.643627] clocksource: Checking clocksource tsc synchronization from CPU 267 to CPUs 0,4,25,70,126,430,557,564.
 [ 821.067990] clocksource: Switched to clocksource hpet

This can be reproduced by running memory intensive 'stream' tests,
or some of the stress-ng subcases such as 'ioport'.

The reason for these issues is the when system is under heavy load, the
read latency of the clocksources can be very high.  Even lightweight TSC
reads can show high latencies, and latencies are much worse for external
clocksources such as HPET or the APIC PM timer.  These latencies can
result in false-positive clocksource-unstable determinations.

These issues were initially reported by a customer running on a production
system, and this problem was reproduced on several generations of Xeon
servers, especially when running the stress-ng test.  These Xeon servers
were not production systems, but they did have the latest steppings
and firmware.

Given that the clocksource watchdog is a continual diagnostic check with
frequency of twice a second, there is no need to rush it when the system
is under heavy load.  Therefore, when high clocksource read latencies
are detected, suspend the watchdog timer for 5 minutes.

Signed-off-by: Feng Tang <feng.tang@intel.com>
Acked-by: Waiman Long <longman@redhat.com>
Cc: John Stultz <jstultz@google.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Stephen Boyd <sboyd@kernel.org>
Cc: Feng Tang <feng.tang@intel.com>
Signed-off-by: Paul E. McKenney <paulmck@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
 kernel/time/clocksource.c | 45 ++++++++++++++++++++++++++++-----------
 1 file changed, 32 insertions(+), 13 deletions(-)

diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index 9cf32ccda715d..8cd74b89d5776 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -384,6 +384,15 @@ void clocksource_verify_percpu(struct clocksource *cs)
 }
 EXPORT_SYMBOL_GPL(clocksource_verify_percpu);
 
+static inline void clocksource_reset_watchdog(void)
+{
+	struct clocksource *cs;
+
+	list_for_each_entry(cs, &watchdog_list, wd_list)
+		cs->flags &= ~CLOCK_SOURCE_WATCHDOG;
+}
+
+
 static void clocksource_watchdog(struct timer_list *unused)
 {
 	u64 csnow, wdnow, cslast, wdlast, delta;
@@ -391,6 +400,7 @@ static void clocksource_watchdog(struct timer_list *unused)
 	int64_t wd_nsec, cs_nsec;
 	struct clocksource *cs;
 	enum wd_read_status read_ret;
+	unsigned long extra_wait = 0;
 	u32 md;
 
 	spin_lock(&watchdog_lock);
@@ -410,13 +420,30 @@ static void clocksource_watchdog(struct timer_list *unused)
 
 		read_ret = cs_watchdog_read(cs, &csnow, &wdnow);
 
-		if (read_ret != WD_READ_SUCCESS) {
-			if (read_ret == WD_READ_UNSTABLE)
-				/* Clock readout unreliable, so give it up. */
-				__clocksource_unstable(cs);
+		if (read_ret == WD_READ_UNSTABLE) {
+			/* Clock readout unreliable, so give it up. */
+			__clocksource_unstable(cs);
 			continue;
 		}
 
+		/*
+		 * When WD_READ_SKIP is returned, it means the system is likely
+		 * under very heavy load, where the latency of reading
+		 * watchdog/clocksource is very big, and affect the accuracy of
+		 * watchdog check. So give system some space and suspend the
+		 * watchdog check for 5 minutes.
+		 */
+		if (read_ret == WD_READ_SKIP) {
+			/*
+			 * As the watchdog timer will be suspended, and
+			 * cs->last could keep unchanged for 5 minutes, reset
+			 * the counters.
+			 */
+			clocksource_reset_watchdog();
+			extra_wait = HZ * 300;
+			break;
+		}
+
 		/* Clocksource initialized ? */
 		if (!(cs->flags & CLOCK_SOURCE_WATCHDOG) ||
 		    atomic_read(&watchdog_reset_pending)) {
@@ -512,7 +539,7 @@ static void clocksource_watchdog(struct timer_list *unused)
 	 * pair clocksource_stop_watchdog() clocksource_start_watchdog().
 	 */
 	if (!timer_pending(&watchdog_timer)) {
-		watchdog_timer.expires += WATCHDOG_INTERVAL;
+		watchdog_timer.expires += WATCHDOG_INTERVAL + extra_wait;
 		add_timer_on(&watchdog_timer, next_cpu);
 	}
 out:
@@ -537,14 +564,6 @@ static inline void clocksource_stop_watchdog(void)
 	watchdog_running = 0;
 }
 
-static inline void clocksource_reset_watchdog(void)
-{
-	struct clocksource *cs;
-
-	list_for_each_entry(cs, &watchdog_list, wd_list)
-		cs->flags &= ~CLOCK_SOURCE_WATCHDOG;
-}
-
 static void clocksource_resume_watchdog(void)
 {
 	atomic_inc(&watchdog_reset_pending);
-- 
2.39.0


  parent reply	other threads:[~2023-02-26 14:47 UTC|newest]

Thread overview: 59+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-02-26 14:43 [PATCH AUTOSEL 6.2 01/53] wifi: ath9k: Fix use-after-free in ath9k_hif_usb_disconnect() Sasha Levin
2023-02-26 14:43 ` [PATCH AUTOSEL 6.2 02/53] wifi: ath11k: fix monitor mode bringup crash Sasha Levin
2023-02-26 14:43 ` [PATCH AUTOSEL 6.2 03/53] wifi: brcmfmac: Fix potential stack-out-of-bounds in brcmf_c_preinit_dcmds() Sasha Levin
2023-02-26 14:43 ` [PATCH AUTOSEL 6.2 04/53] rcu: Make RCU_LOCKDEP_WARN() avoid early lockdep checks Sasha Levin
2023-02-26 14:43 ` [PATCH AUTOSEL 6.2 05/53] rcu: Suppress smp_processor_id() complaint in synchronize_rcu_expedited_wait() Sasha Levin
2023-02-26 14:43 ` [PATCH AUTOSEL 6.2 06/53] srcu: Delegate work to the boot cpu if using SRCU_SIZE_SMALL Sasha Levin
2023-02-26 14:43 ` [PATCH AUTOSEL 6.2 07/53] rcu-tasks: Make rude RCU-Tasks work well with CPU hotplug Sasha Levin
2023-02-26 14:44 ` [PATCH AUTOSEL 6.2 08/53] rcu-tasks: Handle queue-shrink/callback-enqueue race condition Sasha Levin
2023-02-26 14:44 ` [PATCH AUTOSEL 6.2 09/53] wifi: ath11k: debugfs: fix to work with multiple PCI devices Sasha Levin
2023-02-26 14:44 ` [PATCH AUTOSEL 6.2 10/53] thermal: intel: Fix unsigned comparison with less than zero Sasha Levin
2023-02-26 14:44 ` [PATCH AUTOSEL 6.2 11/53] timers: Prevent union confusion from unexpected restart_syscall() Sasha Levin
2023-02-26 14:44 ` [PATCH AUTOSEL 6.2 12/53] x86/bugs: Reset speculation control settings on init Sasha Levin
2023-02-26 14:44 ` [PATCH AUTOSEL 6.2 13/53] bpftool: Always disable stack protection for BPF objects Sasha Levin
2023-02-26 14:44 ` [PATCH AUTOSEL 6.2 14/53] wifi: brcmfmac: ensure CLM version is null-terminated to prevent stack-out-of-bounds Sasha Levin
2023-02-26 14:44 ` [PATCH AUTOSEL 6.2 15/53] wifi: rtw89: fix assignation of TX BD RAM table Sasha Levin
2023-02-26 14:44 ` [PATCH AUTOSEL 6.2 16/53] wifi: mt7601u: fix an integer underflow Sasha Levin
2023-02-26 14:44 ` [PATCH AUTOSEL 6.2 17/53] inet: fix fast path in __inet_hash_connect() Sasha Levin
2023-02-26 14:44 ` [PATCH AUTOSEL 6.2 18/53] ice: restrict PTP HW clock freq adjustments to 100, 000, 000 PPB Sasha Levin
2023-02-26 14:44 ` [PATCH AUTOSEL 6.2 19/53] ice: add missing checks for PF vsi type Sasha Levin
2023-02-26 14:44 ` [PATCH AUTOSEL 6.2 20/53] Compiler attributes: GCC cold function alignment workarounds Sasha Levin
2023-02-26 14:44 ` [PATCH AUTOSEL 6.2 21/53] ACPI: Don't build ACPICA with '-Os' Sasha Levin
2023-02-26 14:44 ` [PATCH AUTOSEL 6.2 22/53] bpf, docs: Fix modulo zero, division by zero, overflow, and underflow Sasha Levin
2023-02-26 14:44 ` [PATCH AUTOSEL 6.2 23/53] thermal: intel: intel_pch: Add support for Wellsburg PCH Sasha Levin
2023-02-26 14:44 ` Sasha Levin [this message]
2023-02-26 14:44 ` [PATCH AUTOSEL 6.2 25/53] crypto: hisilicon: Wipe entire pool on error Sasha Levin
2023-02-26 14:44 ` [PATCH AUTOSEL 6.2 26/53] netpoll: Remove 4s sleep during carrier detection Sasha Levin
2023-02-27 18:15   ` Jakub Kicinski
2023-03-01  2:10     ` Sasha Levin
2023-02-26 14:44 ` [PATCH AUTOSEL 6.2 27/53] net: bcmgenet: Add a check for oversized packets Sasha Levin
2023-02-26 14:44 ` [PATCH AUTOSEL 6.2 28/53] m68k: Check syscall_trace_enter() return code Sasha Levin
2023-02-26 14:44 ` [PATCH AUTOSEL 6.2 29/53] s390/mm,ptdump: avoid Kasan vs Memcpy Real markers swapping Sasha Levin
2023-02-26 14:44 ` [PATCH AUTOSEL 6.2 30/53] netfilter: nf_tables: NULL pointer dereference in nf_tables_updobj() Sasha Levin
2023-02-26 14:44 ` [PATCH AUTOSEL 6.2 31/53] neighbor: fix proxy_delay usage when it is zero Sasha Levin
2023-02-27 18:15   ` Jakub Kicinski
2023-03-01 14:13     ` Sasha Levin
2023-02-26 14:44 ` [PATCH AUTOSEL 6.2 32/53] can: isotp: check CAN address family in isotp_bind() Sasha Levin
2023-02-26 14:44 ` [PATCH AUTOSEL 6.2 33/53] gcc-plugins: drop -std=gnu++11 to fix GCC 13 build Sasha Levin
2023-02-26 14:44 ` [PATCH AUTOSEL 6.2 34/53] tools/power/x86/intel-speed-select: Add Emerald Rapid quirk Sasha Levin
2023-02-26 14:44 ` [PATCH AUTOSEL 6.2 35/53] platform/x86: dell-ddv: Add support for interface version 3 Sasha Levin
2023-02-26 14:44 ` [PATCH AUTOSEL 6.2 36/53] wifi: mt76: dma: free rx_head in mt76_dma_rx_cleanup Sasha Levin
2023-02-26 14:44 ` [PATCH AUTOSEL 6.2 37/53] ACPI: video: Fix Lenovo Ideapad Z570 DMI match Sasha Levin
2023-02-26 14:44 ` [PATCH AUTOSEL 6.2 38/53] net/mlx5: fw_tracer: Fix debug print Sasha Levin
2023-02-26 14:44 ` [PATCH AUTOSEL 6.2 39/53] coda: Avoid partial allocation of sig_inputArgs Sasha Levin
2023-02-26 14:44 ` [PATCH AUTOSEL 6.2 40/53] uaccess: Add minimum bounds check on kernel buffer size Sasha Levin
2023-02-26 14:44 ` [PATCH AUTOSEL 6.2 41/53] s390/idle: mark arch_cpu_idle() noinstr Sasha Levin
2023-02-26 14:44 ` [PATCH AUTOSEL 6.2 42/53] time/debug: Fix memory leak with using debugfs_lookup() Sasha Levin
2023-02-26 14:44 ` [PATCH AUTOSEL 6.2 43/53] PM: domains: fix " Sasha Levin
2023-02-26 14:44 ` [PATCH AUTOSEL 6.2 44/53] PM: EM: " Sasha Levin
2023-02-26 14:44 ` [PATCH AUTOSEL 6.2 45/53] Bluetooth: Fix issue with Actions Semi ATS2851 based devices Sasha Levin
2023-02-26 14:44 ` [PATCH AUTOSEL 6.2 46/53] Bluetooth: btusb: Add new PID/VID 0489:e0f2 for MT7921 Sasha Levin
2023-02-26 14:44 ` [PATCH AUTOSEL 6.2 47/53] Bluetooth: btusb: Add VID:PID 13d3:3529 for Realtek RTL8821CE Sasha Levin
2023-02-26 14:44 ` [PATCH AUTOSEL 6.2 48/53] wifi: rtw89: debug: avoid invalid access on RTW89_DBG_SEL_MAC_30 Sasha Levin
2023-02-26 14:44 ` [PATCH AUTOSEL 6.2 49/53] hv_netvsc: Check status in SEND_RNDIS_PKT completion message Sasha Levin
2023-02-26 14:44 ` [PATCH AUTOSEL 6.2 50/53] s390/kfence: fix page fault reporting Sasha Levin
2023-02-26 14:44 ` [PATCH AUTOSEL 6.2 51/53] devlink: health: Fix nla_nest_end in error flow Sasha Levin
2023-02-27 18:13   ` Jakub Kicinski
2023-03-01 14:13     ` Sasha Levin
2023-02-26 14:44 ` [PATCH AUTOSEL 6.2 52/53] devlink: Fix TP_STRUCT_entry in trace of devlink health report Sasha Levin
2023-02-26 14:44 ` [PATCH AUTOSEL 6.2 53/53] scm: add user copy checks to put_cmsg() Sasha Levin

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230226144446.824580-24-sashal@kernel.org \
    --to=sashal@kernel.org \
    --cc=feng.tang@intel.com \
    --cc=jstultz@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=longman@redhat.com \
    --cc=paulmck@kernel.org \
    --cc=sboyd@kernel.org \
    --cc=stable@vger.kernel.org \
    --cc=tglx@linutronix.de \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox