From: Feng Tang <feng.tang@intel.com>
To: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Waiman Long <longman@redhat.com>,
John Stultz <jstultz@google.com>,
"Thomas Gleixner" <tglx@linutronix.de>,
Stephen Boyd <sboyd@kernel.org>, <x86@kernel.org>,
Peter Zijlstra <peterz@infradead.org>,
<linux-kernel@vger.kernel.org>, Tim Chen <tim.c.chen@intel.com>
Subject: Re: [RFC PATCH] clocksource: Suspend the watchdog temporarily when high read lantency detected
Date: Wed, 21 Dec 2022 09:01:46 +0800 [thread overview]
Message-ID: <Y6Ja+kYQAi4pppV6@feng-clx> (raw)
In-Reply-To: <20221220183400.GY4001@paulmck-ThinkPad-P17-Gen-1>
Using correct email address of John Stultz.
On Tue, Dec 20, 2022 at 10:34:00AM -0800, Paul E. McKenney wrote:
> On Tue, Dec 20, 2022 at 11:11:08AM -0500, Waiman Long wrote:
> > On 12/20/22 03:25, Feng Tang wrote:
> > > There were bug reported on 8 sockets x86 machines that TSC was wrongly
> > > disabled when system is under heavy workload.
> > >
> > > [ 818.380354] clocksource: timekeeping watchdog on CPU336: hpet wd-wd read-back delay of 1203520ns
> > > [ 818.436160] clocksource: wd-tsc-wd read-back delay of 181880ns, clock-skew test skipped!
> > > [ 819.402962] clocksource: timekeeping watchdog on CPU338: hpet wd-wd read-back delay of 324000ns
> > > [ 819.448036] clocksource: wd-tsc-wd read-back delay of 337240ns, clock-skew test skipped!
> > > [ 819.880863] clocksource: timekeeping watchdog on CPU339: hpet read-back delay of 150280ns, attempt 3, marking unstable
> > > [ 819.936243] tsc: Marking TSC unstable due to clocksource watchdog
> > > [ 820.068173] TSC found unstable after boot, most likely due to broken BIOS. Use 'tsc=unstable'.
> > > [ 820.092382] sched_clock: Marking unstable (818769414384, 1195404998)
> > > [ 820.643627] clocksource: Checking clocksource tsc synchronization from CPU 267 to CPUs 0,4,25,70,126,430,557,564.
> > > [ 821.067990] clocksource: Switched to clocksource hpet
> > >
> > > This can be reproduced when system is running memory intensive 'stream'
> > > test, or some stress-ng subcases like 'ioport'.
> > >
> > > The reason is when system is under heavy load, the read latency of
> > > clocksource can be very high, it can be seen even with lightweight
> > > TSC read, and is much worse on MMIO or IO port read based external
> > > clocksource. Causing the watchdog check to be inaccurate.
> > >
> > > As the clocksource watchdog is a lifetime check with frequency of
> > > twice a second, there is no need to rush doing it when the system
> > > is under heavy load and the clocksource read latency is very high,
> > > suspend the watchdog timer for 5 minutes.
> > >
> > > Signed-off-by: Feng Tang <feng.tang@intel.com>
> > > ---
> > > kernel/time/clocksource.c | 45 ++++++++++++++++++++++++++++-----------
> > > 1 file changed, 32 insertions(+), 13 deletions(-)
> > >
> > > diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
> > > index 9cf32ccda715..8cd74b89d577 100644
> > > --- a/kernel/time/clocksource.c
> > > +++ b/kernel/time/clocksource.c
> > > @@ -384,6 +384,15 @@ void clocksource_verify_percpu(struct clocksource *cs)
> > > }
> > > EXPORT_SYMBOL_GPL(clocksource_verify_percpu);
> > > +static inline void clocksource_reset_watchdog(void)
> > > +{
> > > + struct clocksource *cs;
> > > +
> > > + list_for_each_entry(cs, &watchdog_list, wd_list)
> > > + cs->flags &= ~CLOCK_SOURCE_WATCHDOG;
> > > +}
> > > +
> > > +
> > > static void clocksource_watchdog(struct timer_list *unused)
> > > {
> > > u64 csnow, wdnow, cslast, wdlast, delta;
> > > @@ -391,6 +400,7 @@ static void clocksource_watchdog(struct timer_list *unused)
> > > int64_t wd_nsec, cs_nsec;
> > > struct clocksource *cs;
> > > enum wd_read_status read_ret;
> > > + unsigned long extra_wait = 0;
> > > u32 md;
> > > spin_lock(&watchdog_lock);
> > > @@ -410,13 +420,30 @@ static void clocksource_watchdog(struct timer_list *unused)
> > > read_ret = cs_watchdog_read(cs, &csnow, &wdnow);
> > > - if (read_ret != WD_READ_SUCCESS) {
> > > - if (read_ret == WD_READ_UNSTABLE)
> > > - /* Clock readout unreliable, so give it up. */
> > > - __clocksource_unstable(cs);
> > > + if (read_ret == WD_READ_UNSTABLE) {
> > > + /* Clock readout unreliable, so give it up. */
> > > + __clocksource_unstable(cs);
> > > continue;
> > > }
> > > + /*
> > > + * When WD_READ_SKIP is returned, it means the system is likely
> > > + * under very heavy load, where the latency of reading
> > > + * watchdog/clocksource is very big, and affect the accuracy of
> > > + * watchdog check. So give system some space and suspend the
> > > + * watchdog check for 5 minutes.
> > > + */
> > > + if (read_ret == WD_READ_SKIP) {
> > > + /*
> > > + * As the watchdog timer will be suspended, and
> > > + * cs->last could keep unchanged for 5 minutes, reset
> > > + * the counters.
> > > + */
> > > + clocksource_reset_watchdog();
> > > + extra_wait = HZ * 300;
> > > + break;
> > > + }
> > > +
> > > /* Clocksource initialized ? */
> > > if (!(cs->flags & CLOCK_SOURCE_WATCHDOG) ||
> > > atomic_read(&watchdog_reset_pending)) {
> > > @@ -512,7 +539,7 @@ static void clocksource_watchdog(struct timer_list *unused)
> > > * pair clocksource_stop_watchdog() clocksource_start_watchdog().
> > > */
> > > if (!timer_pending(&watchdog_timer)) {
> > > - watchdog_timer.expires += WATCHDOG_INTERVAL;
> > > + watchdog_timer.expires += WATCHDOG_INTERVAL + extra_wait;
> > > add_timer_on(&watchdog_timer, next_cpu);
> > > }
> > > out:
> > > @@ -537,14 +564,6 @@ static inline void clocksource_stop_watchdog(void)
> > > watchdog_running = 0;
> > > }
> > > -static inline void clocksource_reset_watchdog(void)
> > > -{
> > > - struct clocksource *cs;
> > > -
> > > - list_for_each_entry(cs, &watchdog_list, wd_list)
> > > - cs->flags &= ~CLOCK_SOURCE_WATCHDOG;
> > > -}
> > > -
> > > static void clocksource_resume_watchdog(void)
> > > {
> > > atomic_inc(&watchdog_reset_pending);
> >
> > It looks reasonable to me. Thanks for the patch.
> >
> > Acked-by: Waiman Long <longman@redhat.com>
>
> Queued, thank you both!
Thanks for reviewing and queueing!
> If you would like this to go in some other way:
>
> Acked-by: Paul E. McKenney <paulmck@kernel.org>
>
> And while I am remembering it... Any objections to reversing the role of
> TSC and the other timers on systems where TSC is believed to be accurate?
> So that if there is clocksource skew, HPET is marked unstable rather than
> TSC?
For the bug in commit log, I think it's the 8 sockets system with
hundreds of CPUs causing the big latency, while the HPET itself may
not be broken, and if we switched to ACPI PM_TIMER as watchdog, we
could see similar big latency.
I used to only see this issue with stress tool like stress-ng, but
seems with larger and larger system, even the momory intensive load
can easily trigger this.
> This would preserve the diagnostics without hammering performance
> when skew is detected. (Switching from TSC to HPET hammers performance
> enough that our automation usually notices and reboots the system.)
Yes, switching to HPET is a disaster for performance, we've seen
from 30% to 90% drop in different benchmarks.
Thanks,
Feng
> Thanx, Paul
next prev parent reply other threads:[~2022-12-21 1:05 UTC|newest]
Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-12-20 8:25 [RFC PATCH] clocksource: Suspend the watchdog temporarily when high read lantency detected Feng Tang
2022-12-20 16:11 ` Waiman Long
2022-12-20 18:34 ` Paul E. McKenney
2022-12-21 1:01 ` Feng Tang [this message]
2022-12-21 3:26 ` Waiman Long
2022-12-22 0:40 ` Paul E. McKenney
2022-12-22 3:39 ` Waiman Long
2022-12-22 5:55 ` Paul E. McKenney
2022-12-22 6:00 ` Feng Tang
2022-12-22 6:14 ` Paul E. McKenney
2022-12-22 6:37 ` Feng Tang
2022-12-22 18:24 ` Paul E. McKenney
2022-12-22 21:42 ` Paul E. McKenney
2022-12-22 23:28 ` Paul E. McKenney
2022-12-23 2:09 ` Feng Tang
2022-12-23 3:37 ` Paul E. McKenney
[not found] ` <ad71008d-4acc-d211-dc19-c33bb25ff42c@redhat.com>
2022-12-23 4:14 ` Feng Tang
2022-12-27 18:38 ` Paul E. McKenney
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Y6Ja+kYQAi4pppV6@feng-clx \
--to=feng.tang@intel.com \
--cc=jstultz@google.com \
--cc=linux-kernel@vger.kernel.org \
--cc=longman@redhat.com \
--cc=paulmck@kernel.org \
--cc=peterz@infradead.org \
--cc=sboyd@kernel.org \
--cc=tglx@linutronix.de \
--cc=tim.c.chen@intel.com \
--cc=x86@kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.