public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH v3] Fix: clocksource watchdog marks TSC unstable on guest VM
@ 2015-09-08 14:19 Mathieu Desnoyers
  2015-09-08 15:08 ` Thomas Gleixner
  0 siblings, 1 reply; 5+ messages in thread
From: Mathieu Desnoyers @ 2015-09-08 14:19 UTC (permalink / raw)
  To: Thomas Gleixner; +Cc: linux-kernel, Mathieu Desnoyers, Daniel Lezcano

I have been getting those warnings across a range of guest
kernels in my development virtual machines. The host is a
3.13 Ubuntu kernel. The latest guest on which I reproduced
this is a 4.2 kernel (akpm's tree).

[  126.902240] clocksource: timekeeping watchdog: Marking clocksource 'tsc' as unstable because the skew is too large:
[  126.902240] clocksource:                       'hpet' wd_now: f948c267 wd_last: f5edb97c mask: ffffffff
[  126.967563] clocksource:                       'tsc' cs_now: 48f04ee060 cs_last: 48a8c73ed0 mask: ffffffffffffffff

Relevant info from the guest kernel dmesg:

[    2.197070] HPET: 3 timers in total, 0 timers will be used for per-cpu timer
[    2.198021] hpet0: at MMIO 0xfed00000, IRQs 2, 8, 0
[    2.199557] hpet0: 3 comparators, 64-bit 100.000000 MHz counter

[    3.393185] tsc: Refined TSC clocksource calibration: 2400.037 MHz
[    3.396552] clocksource: tsc: mask: 0xffffffffffffffff max_cycles: 0x22985a87e67, max_idle_ns: 440795315743 ns

It appears that the clocksource watchdog does:

        local_irq_disable();
        csnow = cs->read(cs);
        wdnow = watchdog->read(watchdog);
        local_irq_enable();

If the host kernel preempts the guest between reading the
TSC and the watchdog, the resulting delta may go well over
the watchdog threshold, which is currently 62ms.

Correct this issue by reading "wdnow" before and after reading
TSC, and retry if the delta between the two reads exceeds
WATCHDOG_THRESHOLD.

Introduce WATCHDOG_RETRY to bound the number of retry (in the
unlikely event of a bogus clock source for wdnow). If the
number of retry has been reached, disable the watchdog timer.

Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
CC: Daniel Lezcano <daniel.lezcano@linaro.org>
CC: Thomas Gleixner <tglx@linutronix.de>

--

Changes since v1:
- Add missing \n at the end of pr_warn().
Changes since v2:
- Disable watchdog timer if WATCHDOG_RETRY is reached.
---
 kernel/time/clocksource.c | 192 ++++++++++++++++++++++++++++------------------
 1 file changed, 117 insertions(+), 75 deletions(-)

diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index 174c594..a9d4428 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -126,6 +126,12 @@ static void __clocksource_change_rating(struct clocksource *cs, int rating);
  */
 #define WATCHDOG_INTERVAL (HZ >> 1)
 #define WATCHDOG_THRESHOLD (NSEC_PER_SEC >> 4)
+/*
+ * When two back-to-back reads of the watchdog clock are too far
+ * apart, limit the number of retry allowed before we disable the
+ * watchdog timer.
+ */
+#define WATCHDOG_RETRY 20
 
 static void clocksource_watchdog_work(struct work_struct *work)
 {
@@ -166,95 +172,131 @@ void clocksource_mark_unstable(struct clocksource *cs)
 	spin_unlock_irqrestore(&watchdog_lock, flags);
 }
 
-static void clocksource_watchdog(unsigned long data)
+static int check_cs_wd(struct clocksource *cs)
 {
-	struct clocksource *cs;
-	cycle_t csnow, wdnow, cslast, wdlast, delta;
-	int64_t wd_nsec, cs_nsec;
-	int next_cpu, reset_pending;
-
-	spin_lock(&watchdog_lock);
-	if (!watchdog_running)
-		goto out;
-
-	reset_pending = atomic_read(&watchdog_reset_pending);
+	cycle_t csnow, wdnow[2], cslast, wdlast, delta;
+	int64_t wd_nsec[2], cs_nsec;
+	int retry_count = WATCHDOG_RETRY;
+
+restart:
+	/* Clocksource already marked unstable? */
+	if (cs->flags & CLOCK_SOURCE_UNSTABLE) {
+		if (finished_booting)
+			schedule_work(&watchdog_work);
+		return 0;
+	}
 
-	list_for_each_entry(cs, &watchdog_list, wd_list) {
+	local_irq_disable();
+	wdnow[0] = watchdog->read(watchdog);
+	csnow = cs->read(cs);
+	wdnow[1] = watchdog->read(watchdog);
+	local_irq_enable();
+
+	/* Clocksource initialized ? */
+	if (!(cs->flags & CLOCK_SOURCE_WATCHDOG) ||
+	    atomic_read(&watchdog_reset_pending)) {
+		cs->flags |= CLOCK_SOURCE_WATCHDOG;
+		cs->wd_last = wdnow[1];
+		cs->cs_last = csnow;
+		return 0;
+	}
 
-		/* Clocksource already marked unstable? */
-		if (cs->flags & CLOCK_SOURCE_UNSTABLE) {
-			if (finished_booting)
-				schedule_work(&watchdog_work);
-			continue;
-		}
+	delta = clocksource_delta(wdnow[0], cs->wd_last, watchdog->mask);
+	wd_nsec[0] = clocksource_cyc2ns(delta, watchdog->mult,
+				     watchdog->shift);
+	delta = clocksource_delta(wdnow[1], cs->wd_last, watchdog->mask);
+	wd_nsec[1] = clocksource_cyc2ns(delta, watchdog->mult,
+				     watchdog->shift);
 
-		local_irq_disable();
-		csnow = cs->read(cs);
-		wdnow = watchdog->read(watchdog);
-		local_irq_enable();
-
-		/* Clocksource initialized ? */
-		if (!(cs->flags & CLOCK_SOURCE_WATCHDOG) ||
-		    atomic_read(&watchdog_reset_pending)) {
-			cs->flags |= CLOCK_SOURCE_WATCHDOG;
-			cs->wd_last = wdnow;
-			cs->cs_last = csnow;
-			continue;
+	if ((abs(wd_nsec[1] - wd_nsec[0]) > WATCHDOG_THRESHOLD)) {
+		/*
+		 * Got bogus reads from the watchdog clocksource.
+		 * This can be caused by SMI, MCE, NMI, and preemption of guest
+		 * kernel by host.
+		 */
+		pr_warn("Bogus reads from watchdog clocksource: wdnow[0]: %llx wdnow[1]: %llx\n",
+			wdnow[0], wdnow[1]);
+		pr_warn("              delta: %llu nsec\n",
+			(unsigned long long) abs(wd_nsec[1] - wd_nsec[0]));
+		if (retry_count-- > 0) {
+			goto restart;
+		} else {
+			pr_warn("Disabling watchdog due to bogus clocksource\n");
+			return -1;
 		}
+	}
+	delta = clocksource_delta(csnow, cs->cs_last, cs->mask);
+	cs_nsec = clocksource_cyc2ns(delta, cs->mult, cs->shift);
+	wdlast = cs->wd_last; /* save these in case we print them */
+	cslast = cs->cs_last;
+	cs->cs_last = csnow;
+	cs->wd_last = wdnow[1];
+
+	if (atomic_read(&watchdog_reset_pending))
+		return 0;
+
+	/* Check the deviation from the watchdog clocksource. */
+	if ((abs(cs_nsec - wd_nsec[1]) > WATCHDOG_THRESHOLD)) {
+		pr_warn("timekeeping watchdog: Marking clocksource '%s' as unstable because the skew is too large:\n",
+			cs->name);
+		pr_warn("                      '%s' wd_now: %llx wd_last: %llx mask: %llx\n",
+			watchdog->name, wdnow[1], wdlast, watchdog->mask);
+		pr_warn("                      '%s' cs_now: %llx cs_last: %llx mask: %llx\n",
+			cs->name, csnow, cslast, cs->mask);
+		__clocksource_unstable(cs);
+		return 0;
+	}
 
-		delta = clocksource_delta(wdnow, cs->wd_last, watchdog->mask);
-		wd_nsec = clocksource_cyc2ns(delta, watchdog->mult,
-					     watchdog->shift);
-
-		delta = clocksource_delta(csnow, cs->cs_last, cs->mask);
-		cs_nsec = clocksource_cyc2ns(delta, cs->mult, cs->shift);
-		wdlast = cs->wd_last; /* save these in case we print them */
-		cslast = cs->cs_last;
-		cs->cs_last = csnow;
-		cs->wd_last = wdnow;
+	if (!(cs->flags & CLOCK_SOURCE_VALID_FOR_HRES) &&
+	    (cs->flags & CLOCK_SOURCE_IS_CONTINUOUS) &&
+	    (watchdog->flags & CLOCK_SOURCE_IS_CONTINUOUS)) {
+		/* Mark it valid for high-res. */
+		cs->flags |= CLOCK_SOURCE_VALID_FOR_HRES;
 
-		if (atomic_read(&watchdog_reset_pending))
-			continue;
+		/*
+		 * clocksource_done_booting() will sort it if
+		 * finished_booting is not set yet.
+		 */
+		if (!finished_booting)
+			return 0;
 
-		/* Check the deviation from the watchdog clocksource. */
-		if ((abs(cs_nsec - wd_nsec) > WATCHDOG_THRESHOLD)) {
-			pr_warn("timekeeping watchdog: Marking clocksource '%s' as unstable because the skew is too large:\n",
-				cs->name);
-			pr_warn("                      '%s' wd_now: %llx wd_last: %llx mask: %llx\n",
-				watchdog->name, wdnow, wdlast, watchdog->mask);
-			pr_warn("                      '%s' cs_now: %llx cs_last: %llx mask: %llx\n",
-				cs->name, csnow, cslast, cs->mask);
-			__clocksource_unstable(cs);
-			continue;
+		/*
+		 * If this is not the current clocksource let
+		 * the watchdog thread reselect it. Due to the
+		 * change to high res this clocksource might
+		 * be preferred now. If it is the current
+		 * clocksource let the tick code know about
+		 * that change.
+		 */
+		if (cs != curr_clocksource) {
+			cs->flags |= CLOCK_SOURCE_RESELECT;
+			schedule_work(&watchdog_work);
+		} else {
+			tick_clock_notify();
 		}
+	}
+	return 0;
+}
 
-		if (!(cs->flags & CLOCK_SOURCE_VALID_FOR_HRES) &&
-		    (cs->flags & CLOCK_SOURCE_IS_CONTINUOUS) &&
-		    (watchdog->flags & CLOCK_SOURCE_IS_CONTINUOUS)) {
-			/* Mark it valid for high-res. */
-			cs->flags |= CLOCK_SOURCE_VALID_FOR_HRES;
+static void clocksource_watchdog(unsigned long data)
+{
+	struct clocksource *cs;
+	int next_cpu, reset_pending;
 
-			/*
-			 * clocksource_done_booting() will sort it if
-			 * finished_booting is not set yet.
-			 */
-			if (!finished_booting)
-				continue;
+	spin_lock(&watchdog_lock);
+	if (!watchdog_running)
+		goto out;
 
+	reset_pending = atomic_read(&watchdog_reset_pending);
+
+	list_for_each_entry(cs, &watchdog_list, wd_list) {
+		if (check_cs_wd(cs) < 0) {
 			/*
-			 * If this is not the current clocksource let
-			 * the watchdog thread reselect it. Due to the
-			 * change to high res this clocksource might
-			 * be preferred now. If it is the current
-			 * clocksource let the tick code know about
-			 * that change.
+			 * Disable watchdog because its reference
+			 * clocksource is bogus.
 			 */
-			if (cs != curr_clocksource) {
-				cs->flags |= CLOCK_SOURCE_RESELECT;
-				schedule_work(&watchdog_work);
-			} else {
-				tick_clock_notify();
-			}
+			watchdog_running = 0;
+			goto out;
 		}
 	}
 
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [RFC PATCH v3] Fix: clocksource watchdog marks TSC unstable on guest VM
  2015-09-08 14:19 [RFC PATCH v3] Fix: clocksource watchdog marks TSC unstable on guest VM Mathieu Desnoyers
@ 2015-09-08 15:08 ` Thomas Gleixner
  2015-09-09  1:03   ` Shaohua Li
  0 siblings, 1 reply; 5+ messages in thread
From: Thomas Gleixner @ 2015-09-08 15:08 UTC (permalink / raw)
  To: Mathieu Desnoyers
  Cc: LKML, Daniel Lezcano, John Stultz, Peter Zijlstra, Ingo Molnar,
	Gleb Natapov, Paolo Bonzini, Shaohua Li

On Tue, 8 Sep 2015, Mathieu Desnoyers wrote:
> Introduce WATCHDOG_RETRY to bound the number of retry (in the
> unlikely event of a bogus clock source for wdnow). If the
> number of retry has been reached, disable the watchdog timer.

This does not make any sense at all. Why would the clocksource be
bogus? I rather say, that the whole idea of trying to watchdog the TSC
in a VM is bogus.

There is no guarantee, that the readout of the TSC and the watchdog is
not disturbed by VM scheduling. Aside of that, the HPET emulation goes
all the way back into qemu user land and the implementation itself
does not make me more confident. Be happy that we don't support 64bit
HPET in the kernel as that emulation code is completely broken.

I really have to ask the question WHY we actually do this. There is
absolutely no point at all.

The TSC watchdog is there to catch a few issues with the TSC

   1) Frequency changing behind the kernels back

   2) SMM driven power safe state 'features' which cause the TSC to
      stop

   3) SMM fiddling with the TSC

   4) TSC drifting apart on multi socket systems

#1    Is completely irrelevant for KVM as all machines which have
      hardware virtualization have a frequency constant TSC

#2    Is irrelevant for KVM as well, because the machine does not go
      into deep idle states while the guest is running.

#3/#4 That are the only relevant issues, but there is absolutely no
      need to do this detection in the guest.

We already have a TSC sanity check on the host. So instead of adding
horrible hackery and magic detection, shutoff, retry mechanisms, we
can simply let the guest know, that the TSC has been buggered.

On paravirt kernels we can do that today and AFAICT the
pvclock/kvmclock code has enough magic to deal with all the oddities
already.

For non paravirt kernels which can read the TSC directly, we'd need a
way to transport that information. A simple mechanism would be to
query an emulated MSR from the watchdog which tells the guest the
state of affairs on the host side. That would be a sensible and
minimal invasive change on both host and guests.

Thoughts?

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC PATCH v3] Fix: clocksource watchdog marks TSC unstable on guest VM
  2015-09-08 15:08 ` Thomas Gleixner
@ 2015-09-09  1:03   ` Shaohua Li
  2015-09-09  9:51     ` Thomas Gleixner
  0 siblings, 1 reply; 5+ messages in thread
From: Shaohua Li @ 2015-09-09  1:03 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Mathieu Desnoyers, LKML, Daniel Lezcano, John Stultz,
	Peter Zijlstra, Ingo Molnar, Gleb Natapov, Paolo Bonzini

On Tue, Sep 08, 2015 at 05:08:03PM +0200, Thomas Gleixner wrote:
> On Tue, 8 Sep 2015, Mathieu Desnoyers wrote:
> > Introduce WATCHDOG_RETRY to bound the number of retry (in the
> > unlikely event of a bogus clock source for wdnow). If the
> > number of retry has been reached, disable the watchdog timer.
> 
> This does not make any sense at all. Why would the clocksource be
> bogus? I rather say, that the whole idea of trying to watchdog the TSC
> in a VM is bogus.
> 
> There is no guarantee, that the readout of the TSC and the watchdog is
> not disturbed by VM scheduling. Aside of that, the HPET emulation goes
> all the way back into qemu user land and the implementation itself
> does not make me more confident. Be happy that we don't support 64bit
> HPET in the kernel as that emulation code is completely broken.
> 
> I really have to ask the question WHY we actually do this. There is
> absolutely no point at all.
> 
> The TSC watchdog is there to catch a few issues with the TSC
> 
>    1) Frequency changing behind the kernels back
> 
>    2) SMM driven power safe state 'features' which cause the TSC to
>       stop
> 
>    3) SMM fiddling with the TSC
> 
>    4) TSC drifting apart on multi socket systems
> 
> #1    Is completely irrelevant for KVM as all machines which have
>       hardware virtualization have a frequency constant TSC
> 
> #2    Is irrelevant for KVM as well, because the machine does not go
>       into deep idle states while the guest is running.
> 
> #3/#4 That are the only relevant issues, but there is absolutely no
>       need to do this detection in the guest.
> 
> We already have a TSC sanity check on the host. So instead of adding
> horrible hackery and magic detection, shutoff, retry mechanisms, we
> can simply let the guest know, that the TSC has been buggered.
> 
> On paravirt kernels we can do that today and AFAICT the
> pvclock/kvmclock code has enough magic to deal with all the oddities
> already.
> 
> For non paravirt kernels which can read the TSC directly, we'd need a
> way to transport that information. A simple mechanism would be to
> query an emulated MSR from the watchdog which tells the guest the
> state of affairs on the host side. That would be a sensible and
> minimal invasive change on both host and guests.

This will require every hypervisor supports the MSR, so not a solution
we can expect immediately.

I'm wondering why we can't just make the watchdog better to detect this
watchdog wrap. It can happen in physical machine as I said before, but I
can't find a simple way to trigger it, so it's not very convincing. But
the watchdog doesn't work for specific environment (for exmaple, a bogus
hardware doesn't responsond for some time) for sure, we shouldn't assume
the world is perfect.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC PATCH v3] Fix: clocksource watchdog marks TSC unstable on guest VM
  2015-09-09  1:03   ` Shaohua Li
@ 2015-09-09  9:51     ` Thomas Gleixner
  2015-09-09 15:43       ` Shaohua Li
  0 siblings, 1 reply; 5+ messages in thread
From: Thomas Gleixner @ 2015-09-09  9:51 UTC (permalink / raw)
  To: Shaohua Li
  Cc: Mathieu Desnoyers, LKML, Daniel Lezcano, John Stultz,
	Peter Zijlstra, Ingo Molnar, Gleb Natapov, Paolo Bonzini

On Tue, 8 Sep 2015, Shaohua Li wrote:
> On Tue, Sep 08, 2015 at 05:08:03PM +0200, Thomas Gleixner wrote:
> > For non paravirt kernels which can read the TSC directly, we'd need a
> > way to transport that information. A simple mechanism would be to
> > query an emulated MSR from the watchdog which tells the guest the
> > state of affairs on the host side. That would be a sensible and
> > minimal invasive change on both host and guests.
> 
> This will require every hypervisor supports the MSR, so not a solution
> we can expect immediately.

I know.
 
> I'm wondering why we can't just make the watchdog better to detect this
> watchdog wrap.

Again, I'm not opposed to make it better. I'm just trying to prevent
making the watchdog a total mess for no reason.

> It can happen in physical machine as I said before, but I
> can't find a simple way to trigger it, so it's not very convincing. But
> the watchdog doesn't work for specific environment (for exmaple, a bogus
> hardware doesn't responsond for some time) for sure, we shouldn't assume
> the world is perfect.

Sigh. If the damned hardware blocks long enough to wreckage the
watchdog then we have more serious problems than that.

Can you please stop this handwaving and provide some proper proof for
your arguments? I'm really tired of this.

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [RFC PATCH v3] Fix: clocksource watchdog marks TSC unstable on guest VM
  2015-09-09  9:51     ` Thomas Gleixner
@ 2015-09-09 15:43       ` Shaohua Li
  0 siblings, 0 replies; 5+ messages in thread
From: Shaohua Li @ 2015-09-09 15:43 UTC (permalink / raw)
  To: Thomas Gleixner
  Cc: Mathieu Desnoyers, LKML, Daniel Lezcano, John Stultz,
	Peter Zijlstra, Ingo Molnar, Gleb Natapov, Paolo Bonzini

On Wed, Sep 09, 2015 at 11:51:43AM +0200, Thomas Gleixner wrote:
> On Tue, 8 Sep 2015, Shaohua Li wrote:
> > On Tue, Sep 08, 2015 at 05:08:03PM +0200, Thomas Gleixner wrote:
> > > For non paravirt kernels which can read the TSC directly, we'd need a
> > > way to transport that information. A simple mechanism would be to
> > > query an emulated MSR from the watchdog which tells the guest the
> > > state of affairs on the host side. That would be a sensible and
> > > minimal invasive change on both host and guests.
> > 
> > This will require every hypervisor supports the MSR, so not a solution
> > we can expect immediately.
> 
> I know.
>  
> > I'm wondering why we can't just make the watchdog better to detect this
> > watchdog wrap.
> 
> Again, I'm not opposed to make it better. I'm just trying to prevent
> making the watchdog a total mess for no reason.
> 
> > It can happen in physical machine as I said before, but I
> > can't find a simple way to trigger it, so it's not very convincing. But
> > the watchdog doesn't work for specific environment (for exmaple, a bogus
> > hardware doesn't responsond for some time) for sure, we shouldn't assume
> > the world is perfect.
> 
> Sigh. If the damned hardware blocks long enough to wreckage the
> watchdog then we have more serious problems than that.

There is difference. If hardware blocks, we can choose reset the
hardware or we can just ignore it if it's a serial console or netconsole
(these are what happend in our side) for example.  These impact the
system very little.  But if HPET is the clocksource, the performance of
the system will be quite poor and makes the whole system useless. There
is no method to reset the clocksource to TSC. If there is a reset
mechanism, it's fine too.
 
> Can you please stop this handwaving and provide some proper proof for
> your arguments? I'm really tired of this.

I'm sorry I can't provide a simple way to trigger it in real hardware,
but it's not hard to trigger this issue in kvm. Just make your host busy
and keep rebooting your virtual machine, you will find it.

Thanks,
Shaohua

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2015-09-09 15:43 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-09-08 14:19 [RFC PATCH v3] Fix: clocksource watchdog marks TSC unstable on guest VM Mathieu Desnoyers
2015-09-08 15:08 ` Thomas Gleixner
2015-09-09  1:03   ` Shaohua Li
2015-09-09  9:51     ` Thomas Gleixner
2015-09-09 15:43       ` Shaohua Li

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox