From: Thomas Gleixner <tglx@linutronix.de>
To: lkp@lists.01.org
Subject: Re: [clocksource] 8c30ace35d: WARNING:at_kernel/time/clocksource.c:#clocksource_watchdog
Date: Thu, 29 Apr 2021 19:30:20 +0200 [thread overview]
Message-ID: <87lf91f177.ffs@nanos.tec.linutronix.de> (raw)
In-Reply-To: <20210429142641.GU975577@paulmck-ThinkPad-P17-Gen-1>
[-- Attachment #1: Type: text/plain, Size: 2744 bytes --]
Paul,
On Thu, Apr 29 2021 at 07:26, Paul E. McKenney wrote:
> On Thu, Apr 29, 2021 at 10:27:09AM +0200, Thomas Gleixner wrote:
>> > Or are you saying that the checks should be in the host OS rather than
>> > in the guests?
>>
>> Yes. That's where it belongs. The host has to make sure that TSC is usable
>> otherwise it should tell the guest not to use it. Anything else is
>> wishful thinking and never reliable.
>
> Thank you for the confirmation. I will look into this.
So the guest might need at least some basic sanity checking unless we
declare that hypervisors are always working correctly :)
Which is admittedly more likely than making the same assumption about
BIOS and hardware.
>> > In addition, breakage due to age and environmentals is possible, and if
>> > you have enough hardware, probable. In which case it would be good to
>> > get a notification so that the system in question can be dealt with.
>>
>> Are you trying to find a problem for a solution again?
>
> We really do see this thing trigger. > I am trying to get rid of one
> class of false positives that might be afflicting us. Along the way,
> I am thinking aloud about what might be the cause of any remaining
> skew reports that might trigger in the future.
Fair enough. Admittedly this has at least entertainment value :)
>> Well, you might then also build safety nets for interrupts, exceptions
>> and if you go fully paranoid for every single CPU instruction. :)
>
> Fair, and I doubt that looking at failure data across a large fleet of
> systems has done anything to reduce my level of paranoia. ;-)
You should have known better than opening Pandoras box.
>> Sure. If you have a seperate module then you can add module params to it
>> obviously. But you don't need any of the muck in the actual watchdog
>> code because the watchdog::read() function in that module will simply
>> handle the delay injection. Hmm?
>
> OK, first let me make sure I understand what you are suggesting.
>
> The idea is to leave the watchdog code in kernel/time/clocksource.c,
> but to move the fault injection into kernel/time/clocksourcefault.c or
> some such. In this new file, new clocksource structures are created that
> use some existing timebase/clocksource under the covers. These then
> inject delays based on module parameters (one senstive to CPU number,
> the other unconditional). They register these clocksources using the
> normal interfaces, and verify that they are eventually marked unstable
> when the fault-injection parameters warrant it. This is combined with
> the usual checking of the console log.
>
> Or am I missing your point?
That's what I meant.
Thanks,
tglx
WARNING: multiple messages have this Message-ID (diff)
From: Thomas Gleixner <tglx@linutronix.de>
To: paulmck@kernel.org
Cc: Feng Tang <feng.tang@intel.com>,
kernel test robot <oliver.sang@intel.com>,
0day robot <lkp@intel.com>, John Stultz <john.stultz@linaro.org>,
Stephen Boyd <sboyd@kernel.org>, Jonathan Corbet <corbet@lwn.net>,
Mark Rutland <Mark.Rutland@arm.com>,
Marc Zyngier <maz@kernel.org>, Andi Kleen <ak@linux.intel.com>,
Xing Zhengjun <zhengjun.xing@linux.intel.com>,
LKML <linux-kernel@vger.kernel.org>,
lkp@lists.01.org, kernel-team@fb.com, neeraju@codeaurora.org,
zhengjun.xing@intel.com, x86@kernel.org,
Paolo Bonzini <pbonzini@redhat.com>
Subject: Re: [clocksource] 8c30ace35d: WARNING:at_kernel/time/clocksource.c:#clocksource_watchdog
Date: Thu, 29 Apr 2021 19:30:20 +0200 [thread overview]
Message-ID: <87lf91f177.ffs@nanos.tec.linutronix.de> (raw)
In-Reply-To: <20210429142641.GU975577@paulmck-ThinkPad-P17-Gen-1>
Paul,
On Thu, Apr 29 2021 at 07:26, Paul E. McKenney wrote:
> On Thu, Apr 29, 2021 at 10:27:09AM +0200, Thomas Gleixner wrote:
>> > Or are you saying that the checks should be in the host OS rather than
>> > in the guests?
>>
>> Yes. That's where it belongs. The host has to make sure that TSC is usable
>> otherwise it should tell the guest not to use it. Anything else is
>> wishful thinking and never reliable.
>
> Thank you for the confirmation. I will look into this.
So the guest might need at least some basic sanity checking unless we
declare that hypervisors are always working correctly :)
Which is admittedly more likely than making the same assumption about
BIOS and hardware.
>> > In addition, breakage due to age and environmentals is possible, and if
>> > you have enough hardware, probable. In which case it would be good to
>> > get a notification so that the system in question can be dealt with.
>>
>> Are you trying to find a problem for a solution again?
>
> We really do see this thing trigger. > I am trying to get rid of one
> class of false positives that might be afflicting us. Along the way,
> I am thinking aloud about what might be the cause of any remaining
> skew reports that might trigger in the future.
Fair enough. Admittedly this has at least entertainment value :)
>> Well, you might then also build safety nets for interrupts, exceptions
>> and if you go fully paranoid for every single CPU instruction. :)
>
> Fair, and I doubt that looking at failure data across a large fleet of
> systems has done anything to reduce my level of paranoia. ;-)
You should have known better than opening Pandoras box.
>> Sure. If you have a seperate module then you can add module params to it
>> obviously. But you don't need any of the muck in the actual watchdog
>> code because the watchdog::read() function in that module will simply
>> handle the delay injection. Hmm?
>
> OK, first let me make sure I understand what you are suggesting.
>
> The idea is to leave the watchdog code in kernel/time/clocksource.c,
> but to move the fault injection into kernel/time/clocksourcefault.c or
> some such. In this new file, new clocksource structures are created that
> use some existing timebase/clocksource under the covers. These then
> inject delays based on module parameters (one senstive to CPU number,
> the other unconditional). They register these clocksources using the
> normal interfaces, and verify that they are eventually marked unstable
> when the fault-injection parameters warrant it. This is combined with
> the usual checking of the console log.
>
> Or am I missing your point?
That's what I meant.
Thanks,
tglx
next prev parent reply other threads:[~2021-04-29 17:30 UTC|newest]
Thread overview: 72+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-04-25 22:45 [PATCH v10 clocksource 0/7] Do not mark clocks unstable due to delays for v5.13 Paul E. McKenney
2021-04-25 22:47 ` [PATCH v10 clocksource 1/7] clocksource: Provide module parameters to inject delays in watchdog Paul E. McKenney
2021-04-26 4:07 ` Andi Kleen
2021-04-26 7:13 ` Thomas Gleixner
2021-04-26 15:28 ` Paul E. McKenney
2021-04-26 16:00 ` Andi Kleen
2021-04-26 16:14 ` Paul E. McKenney
2021-04-26 17:56 ` Andi Kleen
2021-04-26 18:24 ` Paul E. McKenney
2021-04-28 4:49 ` Luming Yu
2021-04-28 13:57 ` Paul E. McKenney
2021-04-28 14:24 ` Luming Yu
2021-04-28 14:37 ` Thomas Gleixner
2021-04-25 22:47 ` [PATCH v10 clocksource 2/7] clocksource: Retry clock read if long delays detected Paul E. McKenney
2021-04-27 1:44 ` Feng Tang
2021-04-25 22:47 ` [PATCH v10 clocksource 3/7] clocksource: Check per-CPU clock synchronization when marked unstable Paul E. McKenney
2021-04-26 4:12 ` Andi Kleen
2021-04-26 7:16 ` Thomas Gleixner
2021-04-25 22:47 ` [PATCH v10 clocksource 4/7] clocksource: Provide a module parameter to fuzz per-CPU clock checking Paul E. McKenney
2021-04-25 22:47 ` [PATCH v10 clocksource 5/7] clocksource: Limit number of CPUs checked for clock synchronization Paul E. McKenney
2021-04-25 22:47 ` [PATCH v10 clocksource 6/7] clocksource: Forgive tsc_early pre-calibration drift Paul E. McKenney
2021-04-26 15:01 ` Feng Tang
2021-04-26 15:25 ` Paul E. McKenney
2021-04-26 15:36 ` Feng Tang
2021-04-26 18:26 ` Paul E. McKenney
2021-04-27 1:13 ` Feng Tang
2021-04-27 3:46 ` Paul E. McKenney
2021-04-27 4:16 ` Feng Tang
2021-04-26 15:28 ` Thomas Gleixner
2021-04-27 21:03 ` Thomas Gleixner
2021-04-27 7:27 ` [clocksource] 8c30ace35d: WARNING:at_kernel/time/clocksource.c:#clocksource_watchdog kernel test robot
2021-04-27 7:27 ` kernel test robot
2021-04-27 8:45 ` Feng Tang
2021-04-27 8:45 ` Feng Tang
2021-04-27 13:37 ` Paul E. McKenney
2021-04-27 13:37 ` Paul E. McKenney
2021-04-27 17:50 ` Paul E. McKenney
2021-04-27 17:50 ` Paul E. McKenney
2021-04-27 21:09 ` Thomas Gleixner
2021-04-27 21:09 ` Thomas Gleixner
2021-04-28 1:48 ` Paul E. McKenney
2021-04-28 1:48 ` Paul E. McKenney
2021-04-28 10:14 ` Thomas Gleixner
2021-04-28 10:14 ` Thomas Gleixner
2021-04-28 18:31 ` Paul E. McKenney
2021-04-28 18:31 ` Paul E. McKenney
2021-04-28 13:34 ` Thomas Gleixner
2021-04-28 13:34 ` Thomas Gleixner
2021-04-28 15:39 ` Peter Zijlstra
2021-04-28 15:39 ` Peter Zijlstra
2021-04-28 17:00 ` Thomas Gleixner
2021-04-28 17:00 ` Thomas Gleixner
2021-04-29 7:38 ` Feng Tang
2021-04-29 7:38 ` Feng Tang
2021-04-28 18:31 ` Paul E. McKenney
2021-04-28 18:31 ` Paul E. McKenney
2021-04-29 8:27 ` Thomas Gleixner
2021-04-29 8:27 ` Thomas Gleixner
2021-04-29 14:26 ` Paul E. McKenney
2021-04-29 14:26 ` Paul E. McKenney
2021-04-29 17:30 ` Thomas Gleixner [this message]
2021-04-29 17:30 ` Thomas Gleixner
2021-04-29 23:04 ` Andi Kleen
2021-04-29 23:04 ` Andi Kleen
2021-04-30 0:24 ` Paul E. McKenney
2021-04-30 0:24 ` Paul E. McKenney
2021-04-30 0:59 ` Paul E. McKenney
2021-04-30 0:59 ` Paul E. McKenney
2021-04-30 5:08 ` Paul E. McKenney
2021-04-30 5:08 ` Paul E. McKenney
2021-04-25 22:47 ` [PATCH v9 clocksource 6/6] clocksource: Reduce WATCHDOG_THRESHOLD Paul E. McKenney
2021-04-25 22:47 ` [PATCH v10 clocksource 7/7] " Paul E. McKenney
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=87lf91f177.ffs@nanos.tec.linutronix.de \
--to=tglx@linutronix.de \
--cc=lkp@lists.01.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.