All of lore.kernel.org
 help / color / mirror / Atom feed
* mce.c related WARNING: at kernel/timer.c:983 del_timer_sync
@ 2011-03-08  1:31 Venkatesh Pallipadi
  2011-03-08  9:06 ` Yong Zhang
  0 siblings, 1 reply; 4+ messages in thread
From: Venkatesh Pallipadi @ 2011-03-08  1:31 UTC (permalink / raw)
  To: Andi Kleen, Yong Zhang
  Cc: Linux Kernel Mailing List, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin

With latest git kernel, I see the below WARN_ON at boot, once for each CPU.

[   26.806429] ------------[ cut here ]------------
[   26.806434] WARNING: at kernel/timer.c:983 del_timer_sync+0x39/0x4d()
[   26.806437] Hardware name: MCP55
[   26.806438] Modules linked in: tg3 forcedeth sata_mv powernow_k8
freq_table processor mperf msr cpuid ipv6 genrtc
[   26.806447] Pid: 0, comm: swapper Tainted: G        W
2.6.38-smp-linus.22280 #23
[   26.806449] Call Trace:
[   26.806450]  <IRQ>  [<ffffffff8106e8ea>] ? warn_slowpath_common+0x85/0x9d
[   26.806456]  [<ffffffff8106e91c>] ? warn_slowpath_null+0x1a/0x1c
[   26.806459]  [<ffffffff8107ac48>] ? del_timer_sync+0x39/0x4d
[   26.806462]  [<ffffffff81047a44>] ? mce_cpu_restart+0x1e/0x54
[   26.806466]  [<ffffffff81099b22>] ?
generic_smp_call_function_single_interrupt+0xd1/0xf3
[   26.806469]  [<ffffffff8104c188>] ?
smp_call_function_single_interrupt+0x18/0x27
[   26.806473]  [<ffffffff810375f3>] ? call_function_single_interrupt+0x13/0x20
[   26.806475]  <EOI>  [<ffffffff8103da93>] ? default_idle+0x4d/0x7f
[   26.806479]  [<ffffffff8103da73>] ? default_idle+0x2d/0x7f
[   26.806482]  [<ffffffff8103dbad>] ? c1e_idle+0xe8/0xef
[   26.806485]  [<ffffffff8145bc62>] ? atomic_notifier_call_chain+0x18/0x1a
[   26.806488]  [<ffffffff81035cc7>] ? cpu_idle+0x5f/0x96
[   26.806491]  [<ffffffff8144af7e>] ? rest_init+0x72/0x74
[   26.806495]  [<ffffffff818edcfe>] ? start_kernel+0x37b/0x386
[   26.806498]  [<ffffffff818ed2a4>] ? x86_64_start_reservations+0xb4/0xb8
[   26.806502]  [<ffffffff818ed39a>] ? x86_64_start_kernel+0xf2/0xf9
[   26.806504] ---[ end trace 69a4de56993e518a ]---

Looks like WARN_ON was after this change
commit 466bd3030973910118ca601da8072be97a1e2209
Author: Yong Zhang <yong.zhang@windriver.com>
Date:   Wed Oct 20 15:57:33 2010 -0700

    timer: Warn when del_timer_sync() is called in hardirq context

But, the actual reason is likely some MCE parameter change at boot causing
mce_restart() which in turn calls on_each_cpu mce_cpu_restart() which calls
del_timer_sync().

Thanks,
Venki

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: mce.c related WARNING: at kernel/timer.c:983 del_timer_sync
  2011-03-08  1:31 mce.c related WARNING: at kernel/timer.c:983 del_timer_sync Venkatesh Pallipadi
@ 2011-03-08  9:06 ` Yong Zhang
  2011-03-08 18:50   ` Andi Kleen
  0 siblings, 1 reply; 4+ messages in thread
From: Yong Zhang @ 2011-03-08  9:06 UTC (permalink / raw)
  To: Venkatesh Pallipadi
  Cc: Andi Kleen, Yong Zhang, Linux Kernel Mailing List,
	Thomas Gleixner, Ingo Molnar, H. Peter Anvin

On Tue, Mar 8, 2011 at 9:31 AM, Venkatesh Pallipadi <venki@google.com> wrote:
> With latest git kernel, I see the below WARN_ON at boot, once for each CPU.
>
> [   26.806429] ------------[ cut here ]------------
> [   26.806434] WARNING: at kernel/timer.c:983 del_timer_sync+0x39/0x4d()
> [   26.806437] Hardware name: MCP55
> [   26.806438] Modules linked in: tg3 forcedeth sata_mv powernow_k8
> freq_table processor mperf msr cpuid ipv6 genrtc
> [   26.806447] Pid: 0, comm: swapper Tainted: G        W
> 2.6.38-smp-linus.22280 #23
> [   26.806449] Call Trace:
> [   26.806450]  <IRQ>  [<ffffffff8106e8ea>] ? warn_slowpath_common+0x85/0x9d
> [   26.806456]  [<ffffffff8106e91c>] ? warn_slowpath_null+0x1a/0x1c
> [   26.806459]  [<ffffffff8107ac48>] ? del_timer_sync+0x39/0x4d
> [   26.806462]  [<ffffffff81047a44>] ? mce_cpu_restart+0x1e/0x54
> [   26.806466]  [<ffffffff81099b22>] ?
> generic_smp_call_function_single_interrupt+0xd1/0xf3
> [   26.806469]  [<ffffffff8104c188>] ?
> smp_call_function_single_interrupt+0x18/0x27
> [   26.806473]  [<ffffffff810375f3>] ? call_function_single_interrupt+0x13/0x20
> [   26.806475]  <EOI>  [<ffffffff8103da93>] ? default_idle+0x4d/0x7f
> [   26.806479]  [<ffffffff8103da73>] ? default_idle+0x2d/0x7f
> [   26.806482]  [<ffffffff8103dbad>] ? c1e_idle+0xe8/0xef
> [   26.806485]  [<ffffffff8145bc62>] ? atomic_notifier_call_chain+0x18/0x1a
> [   26.806488]  [<ffffffff81035cc7>] ? cpu_idle+0x5f/0x96
> [   26.806491]  [<ffffffff8144af7e>] ? rest_init+0x72/0x74
> [   26.806495]  [<ffffffff818edcfe>] ? start_kernel+0x37b/0x386
> [   26.806498]  [<ffffffff818ed2a4>] ? x86_64_start_reservations+0xb4/0xb8
> [   26.806502]  [<ffffffff818ed39a>] ? x86_64_start_kernel+0xf2/0xf9
> [   26.806504] ---[ end trace 69a4de56993e518a ]---
>
> Looks like WARN_ON was after this change
> commit 466bd3030973910118ca601da8072be97a1e2209
> Author: Yong Zhang <yong.zhang@windriver.com>
> Date:   Wed Oct 20 15:57:33 2010 -0700
>
>    timer: Warn when del_timer_sync() is called in hardirq context

Even if without the WARN_ON(), lockdep will shout on it :)

>
> But, the actual reason is likely some MCE parameter change at boot causing
> mce_restart() which in turn calls on_each_cpu mce_cpu_restart() which calls
> del_timer_sync().

Seems we found a real bug.

And usage of del_timer_sync() in arch/x86/kernel/cpu/mcheck/mce.c
break two restriction:
1) del_timer_sync() must not be used in interrupt context;
2) The timer's handler must not call add_timer_on();

Thanks,
Yong

-- 
Only stand for myself

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: mce.c related WARNING: at kernel/timer.c:983 del_timer_sync
  2011-03-08  9:06 ` Yong Zhang
@ 2011-03-08 18:50   ` Andi Kleen
  2011-03-08 19:36     ` Thomas Gleixner
  0 siblings, 1 reply; 4+ messages in thread
From: Andi Kleen @ 2011-03-08 18:50 UTC (permalink / raw)
  To: Yong Zhang
  Cc: Venkatesh Pallipadi, Andi Kleen, Yong Zhang,
	Linux Kernel Mailing List, Thomas Gleixner, Ingo Molnar,
	H. Peter Anvin

> >
> > But, the actual reason is likely some MCE parameter change at boot causing
> > mce_restart() which in turn calls on_each_cpu mce_cpu_restart() which calls
> > del_timer_sync().
> 
> Seems we found a real bug.

I don't think it's a real bug actually because the timer cannot run at
the same time in this state. It's an interrupt which runs with irq disabled 
Really the only case where it could lead to deadlock is when the timer
runs with irqs on and the other interrupt with the del_timer_sync
interrupts it. So most likely your new WARN_ON() is catching
lots of innocent code.

That said I don't think we need the del_timer_sync in mce.c either
for the same reason.  The timer is always on the
same CPU, so it cannot run in parallel.

Remove del_timer_sync()s in mce.c

All the del_timers happen on the same CPUs as the actual timers, so
the timer handlers cannot run at the same time. Replace them
with plain del_timer()s.

Signed-off-by: Andi Kleen <ak@linux.intel.com>

diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index d916183..ba7058a 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -1774,7 +1774,7 @@ static int mce_resume(struct sys_device *dev)
 
 static void mce_cpu_restart(void *data)
 {
-	del_timer_sync(&__get_cpu_var(mce_timer));
+	del_timer(&__get_cpu_var(mce_timer));
 	if (!mce_available(__this_cpu_ptr(&cpu_info)))
 		return;
 	__mcheck_cpu_init_generic();
@@ -1793,7 +1793,7 @@ static void mce_disable_ce(void *all)
 	if (!mce_available(__this_cpu_ptr(&cpu_info)))
 		return;
 	if (all)
-		del_timer_sync(&__get_cpu_var(mce_timer));
+		del_timer(&__get_cpu_var(mce_timer));
 	cmci_clear();
 }
 
@@ -2075,7 +2075,7 @@ mce_cpu_callback(struct notifier_block *nfb, unsigned long action, void *hcpu)
 		break;
 	case CPU_DOWN_PREPARE:
 	case CPU_DOWN_PREPARE_FROZEN:
-		del_timer_sync(t);
+		del_timer(t);
 		smp_call_function_single(cpu, mce_disable_cpu, &action, 1);
 		break;
 	case CPU_DOWN_FAILED:


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: mce.c related WARNING: at kernel/timer.c:983 del_timer_sync
  2011-03-08 18:50   ` Andi Kleen
@ 2011-03-08 19:36     ` Thomas Gleixner
  0 siblings, 0 replies; 4+ messages in thread
From: Thomas Gleixner @ 2011-03-08 19:36 UTC (permalink / raw)
  To: Andi Kleen
  Cc: Yong Zhang, Venkatesh Pallipadi, Yong Zhang,
	Linux Kernel Mailing List, Ingo Molnar, H. Peter Anvin

B1;2401;0cOn Tue, 8 Mar 2011, Andi Kleen wrote:
> > >
> > > But, the actual reason is likely some MCE parameter change at boot causing
> > > mce_restart() which in turn calls on_each_cpu mce_cpu_restart() which calls
> > > del_timer_sync().
> > 
> > Seems we found a real bug.
> 
> I don't think it's a real bug actually because the timer cannot run at
> the same time in this state. It's an interrupt which runs with irq disabled 
> Really the only case where it could lead to deadlock is when the timer
> runs with irqs on and the other interrupt with the del_timer_sync
> interrupts it. So most likely your new WARN_ON() is catching
> lots of innocent code.

Nonsense. All timer callbacks run with interrupts enabled. See
timer.c:__run_timers()

	spin_unlock_irq(&base->lock);
	call_timer_fn(timer, fn, data);
        spin_lock_irq(&base->lock);

So it can happen and that's a plain bug, not innocent code.

> That said I don't think we need the del_timer_sync in mce.c either
> for the same reason.  The timer is always on the
> same CPU, so it cannot run in parallel.

But a running timer callback can be interrupted by the smp function
call and if that happens you run into a different problem

     softirq()
	mce_start_timer()
		--> SMP function call interrupt
		    __mcheck_cpu_init_timer()
			del_timer()
			add_timer_on()
                <--
	   add_timer_on()
		--> BUG_ON(timer_pending())

> @@ -2075,7 +2075,7 @@ mce_cpu_callback(struct notifier_block *nfb, unsigned long action, void *hcpu)
>  		break;
>  	case CPU_DOWN_PREPARE:
>  	case CPU_DOWN_PREPARE_FROZEN:
> -		del_timer_sync(t);
> +		del_timer(t);

Sigh. This is not called on the CPU which is taken down and you really
want to call del_timer_sync() here.

>  		smp_call_function_single(cpu, mce_disable_cpu, &action, 1);
>  		break;
>  	case CPU_DOWN_FAILED:

Thanks,

	tglx

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2011-03-08 19:36 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2011-03-08  1:31 mce.c related WARNING: at kernel/timer.c:983 del_timer_sync Venkatesh Pallipadi
2011-03-08  9:06 ` Yong Zhang
2011-03-08 18:50   ` Andi Kleen
2011-03-08 19:36     ` Thomas Gleixner

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.