* [PATCH] x86/NMI: refine "watchdog stuck" log message
@ 2024-01-24 15:21 Jan Beulich
2024-01-24 15:52 ` Roger Pau Monné
2024-01-24 15:55 ` Andrew Cooper
0 siblings, 2 replies; 5+ messages in thread
From: Jan Beulich @ 2024-01-24 15:21 UTC (permalink / raw)
To: xen-devel@lists.xenproject.org
Cc: Andrew Cooper, Wei Liu, Roger Pau Monné
Observing
"Testing NMI watchdog on all CPUs: 0 stuck"
it felt like it's not quite right, but I still read it as "no CPU stuck;
all good", when really the system suffered from what 6bdb965178bb
("x86/intel: ensure Global Performance Counter Control is setup
correctly") works around. Convert this to
"Testing NMI watchdog on all CPUs: {0} stuck"
or, with multiple CPUs having an issue, e.g.
"Testing NMI watchdog on all CPUs: {0,40} stuck"
to make more obvious that a lone number is not a count of CPUs.
Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
In principle "sep" could also fulfill the job of "ok"; it felt to me as
if this may not be liked very much, though.
--- a/xen/arch/x86/nmi.c
+++ b/xen/arch/x86/nmi.c
@@ -167,13 +167,14 @@ static void __init cf_check wait_for_nmi
void __init check_nmi_watchdog(void)
{
static unsigned int __initdata prev_nmi_count[NR_CPUS];
- int cpu;
+ unsigned int cpu;
+ char sep = '{';
bool ok = true;
if ( nmi_watchdog == NMI_NONE )
return;
- printk("Testing NMI watchdog on all CPUs:");
+ printk("Testing NMI watchdog on all CPUs: ");
for_each_online_cpu ( cpu )
prev_nmi_count[cpu] = per_cpu(nmi_count, cpu);
@@ -189,12 +190,13 @@ void __init check_nmi_watchdog(void)
{
if ( per_cpu(nmi_count, cpu) - prev_nmi_count[cpu] < 2 )
{
- printk(" %d", cpu);
+ printk("%c%u", sep, cpu);
+ sep = ',';
ok = false;
}
}
- printk(" %s\n", ok ? "ok" : "stuck");
+ printk("%s\n", ok ? "ok" : "} stuck");
/*
* Now that we know it works we can reduce NMI frequency to
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH] x86/NMI: refine "watchdog stuck" log message
2024-01-24 15:21 [PATCH] x86/NMI: refine "watchdog stuck" log message Jan Beulich
@ 2024-01-24 15:52 ` Roger Pau Monné
2024-01-24 15:55 ` Andrew Cooper
1 sibling, 0 replies; 5+ messages in thread
From: Roger Pau Monné @ 2024-01-24 15:52 UTC (permalink / raw)
To: Jan Beulich; +Cc: xen-devel@lists.xenproject.org, Andrew Cooper, Wei Liu
On Wed, Jan 24, 2024 at 04:21:24PM +0100, Jan Beulich wrote:
> Observing
>
> "Testing NMI watchdog on all CPUs: 0 stuck"
>
> it felt like it's not quite right, but I still read it as "no CPU stuck;
> all good", when really the system suffered from what 6bdb965178bb
> ("x86/intel: ensure Global Performance Counter Control is setup
> correctly") works around. Convert this to
>
> "Testing NMI watchdog on all CPUs: {0} stuck"
To make this even more obvious, maybe it could be prefixed with "error":
"Testing NMI watchdog on all CPUs: error {0} stuck"
Hm, albeit I don't like it that much
>
> or, with multiple CPUs having an issue, e.g.
>
> "Testing NMI watchdog on all CPUs: {0,40} stuck"
>
> to make more obvious that a lone number is not a count of CPUs.
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Roger Pau Monné <roger.pau@citrix.com>
> ---
> In principle "sep" could also fulfill the job of "ok"; it felt to me as
> if this may not be liked very much, though.
I think I prefer it the current way.
Thanks, Roger.
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH] x86/NMI: refine "watchdog stuck" log message
2024-01-24 15:21 [PATCH] x86/NMI: refine "watchdog stuck" log message Jan Beulich
2024-01-24 15:52 ` Roger Pau Monné
@ 2024-01-24 15:55 ` Andrew Cooper
2024-01-25 8:05 ` Jan Beulich
1 sibling, 1 reply; 5+ messages in thread
From: Andrew Cooper @ 2024-01-24 15:55 UTC (permalink / raw)
To: Jan Beulich, xen-devel@lists.xenproject.org; +Cc: Wei Liu, Roger Pau Monné
On 24/01/2024 3:21 pm, Jan Beulich wrote:
> Observing
>
> "Testing NMI watchdog on all CPUs: 0 stuck"
>
> it felt like it's not quite right, but I still read it as "no CPU stuck;
> all good", when really the system suffered from what 6bdb965178bb
> ("x86/intel: ensure Global Performance Counter Control is setup
> correctly") works around. Convert this to
>
> "Testing NMI watchdog on all CPUs: {0} stuck"
>
> or, with multiple CPUs having an issue, e.g.
>
> "Testing NMI watchdog on all CPUs: {0,40} stuck"
>
> to make more obvious that a lone number is not a count of CPUs.
>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>
I'd forgotten it was still opencoded like this.
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>, but if you felt
turning this into using a scratch cpumask and %*pb then I think that
would be nicer still.
~Andrew
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH] x86/NMI: refine "watchdog stuck" log message
2024-01-24 15:55 ` Andrew Cooper
@ 2024-01-25 8:05 ` Jan Beulich
2024-01-25 9:10 ` Roger Pau Monné
0 siblings, 1 reply; 5+ messages in thread
From: Jan Beulich @ 2024-01-25 8:05 UTC (permalink / raw)
To: Andrew Cooper
Cc: Wei Liu, Roger Pau Monné, xen-devel@lists.xenproject.org
On 24.01.2024 16:55, Andrew Cooper wrote:
> On 24/01/2024 3:21 pm, Jan Beulich wrote:
>> Observing
>>
>> "Testing NMI watchdog on all CPUs: 0 stuck"
>>
>> it felt like it's not quite right, but I still read it as "no CPU stuck;
>> all good", when really the system suffered from what 6bdb965178bb
>> ("x86/intel: ensure Global Performance Counter Control is setup
>> correctly") works around. Convert this to
>>
>> "Testing NMI watchdog on all CPUs: {0} stuck"
>>
>> or, with multiple CPUs having an issue, e.g.
>>
>> "Testing NMI watchdog on all CPUs: {0,40} stuck"
>>
>> to make more obvious that a lone number is not a count of CPUs.
>>
>> Signed-off-by: Jan Beulich <jbeulich@suse.com>
>
> I'd forgotten it was still opencoded like this.
>
> Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>,
Thanks (Roger, also to you).
> but if you felt
> turning this into using a scratch cpumask and %*pb then I think that
> would be nicer still.
I expected you might say that, and indeed I considered it. But then
I wanted to keep code churn limited. The way it's done it's imo an
almost purely mechanical change. Much like with Roger's consideration
of further refining the message text in the error case.
Jan
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH] x86/NMI: refine "watchdog stuck" log message
2024-01-25 8:05 ` Jan Beulich
@ 2024-01-25 9:10 ` Roger Pau Monné
0 siblings, 0 replies; 5+ messages in thread
From: Roger Pau Monné @ 2024-01-25 9:10 UTC (permalink / raw)
To: Jan Beulich; +Cc: Andrew Cooper, Wei Liu, xen-devel@lists.xenproject.org
On Thu, Jan 25, 2024 at 09:05:13AM +0100, Jan Beulich wrote:
> On 24.01.2024 16:55, Andrew Cooper wrote:
> > On 24/01/2024 3:21 pm, Jan Beulich wrote:
> > but if you felt
> > turning this into using a scratch cpumask and %*pb then I think that
> > would be nicer still.
>
> I expected you might say that, and indeed I considered it. But then
> I wanted to keep code churn limited. The way it's done it's imo an
> almost purely mechanical change. Much like with Roger's consideration
> of further refining the message text in the error case.
One suggestion I had in mind (which is not explicitly related to the
fix here) would be to taint the hypervisor when we hit such selftests
errors, as that would make it more obvious to go search the dmesg for
reported failures. I think it's still fairly easy to miss noticing the
watchdog selftest failed on some CPUs.
Regards, Roger.
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2024-01-25 9:10 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-01-24 15:21 [PATCH] x86/NMI: refine "watchdog stuck" log message Jan Beulich
2024-01-24 15:52 ` Roger Pau Monné
2024-01-24 15:55 ` Andrew Cooper
2024-01-25 8:05 ` Jan Beulich
2024-01-25 9:10 ` Roger Pau Monné
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.