Generic callfunction IPI problems

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

* Generic callfunction IPI problems
@ 2008-07-06 14:50 Jeremy Fitzhardinge
  2008-07-06 16:03 ` [PATCH] generic ipi function calls: wait on alloc failure fallback Jeremy Fitzhardinge
  0 siblings, 1 reply; 8+ messages in thread
From: Jeremy Fitzhardinge @ 2008-07-06 14:50 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Ingo Molnar, Linux Kernel Mailing List

Hi Jens,

I'm seeing these oopses when running under Xen:

BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
IP: [<ffffffff8105de9a>] generic_smp_call_function_interrupt+0xfb/0x118
PGD 0 
Oops: 0000 [1] SMP 
CPU 15 
Modules linked in:
Pid: 0, comm: swapper Not tainted 2.6.26-rc8-tip #306
RIP: e030:[<ffffffff8105de9a>]  [<ffffffff8105de9a>] generic_smp_call_function_interrupt+0xfb/0x118
RSP: e02b:ffff88007f653e98  EFLAGS: 00010046
RAX: ffffffff815fe6e0 RBX: ffff88007e523cc8 RCX: 0000000000000001
RDX: ffffc10000200200 RSI: 0000000000000001 RDI: ffffffff81693240
RBP: ffff88007f653eb8 R08: ffff88007f653ec8 R09: 0002db11ddd83820
R10: ffff880000000001 R11: 0000000000000246 R12: 0000000000000000
R13: 0000000000000000 R14: 000000000000000f R15: 0000000000000040
FS:  00007f1dadd907a0(0000) GS:ffff88007ff30080(0000) knlGS:0000000000000000
CS:  e033 DS: 002b ES: 002b CR0: 000000008005003b
CR2: 0000000000000000 CR3: 0000000001001000 CR4: 0000000000002660
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000000
Process swapper (pid: 0, threadinfo ffff88007ff90000, task ffff88007ff57100)
Stack:  ffff88007ff2f4c0 0000000000000000 0000000000000000 000000000000004d
 ffff88007f653ec8 ffffffff8100dea7 ffff88007f653ef8 ffffffff810747c5
 ffffffff816959c0 000000000000004d ffff88007ff2f4c0 ffffffff81695a10
Call Trace:
 <IRQ>  [<ffffffff8100dea7>] xen_call_function_interrupt+0xe/0x167
 [<ffffffff810747c5>] handle_IRQ_event+0x2e/0x65
 [<ffffffff81075e5b>] handle_level_irq+0xb5/0x116
 [<ffffffff81013f34>] do_IRQ+0xf7/0x177
 [<ffffffff811ba227>] xen_evtchn_do_upcall+0xb3/0x136
 [<ffffffff8141558e>] xen_do_hypervisor_callback+0x1e/0x30
 <EOI>  [<ffffffff810093aa>] ? _stext+0x3aa/0x1000
 [<ffffffff810093aa>] ? _stext+0x3aa/0x1000
 [<ffffffff8100a42e>] ? xen_safe_halt+0x10/0x1a
 [<ffffffff8100ba26>] ? xen_idle+0x46/0x5c
 [<ffffffff8100eb60>] ? cpu_idle+0xca/0x101
 [<ffffffff8140bc7d>] ? cpu_bringup_and_idle+0x8a/0x8f


Code: e8 fc 96 fc ff 90 41 f6 44 24 20 01 74 08 41 83 64 24 20 fe eb 11 49 8d 7c 24 38 48 c7 c6 90 dd 05 81 e8 6f 92 01 00 4d 8b 24 24 <49> 8b 04 24 49 81 fc e0 e6 5f 81 0f 18 08 0f 85 11 ff ff ff 5b 
RIP  [<ffffffff8105de9a>] generic_smp_call_function_interrupt+0xfb/0x118
 RSP <ffff88007f653e98>
CR2: 0000000000000000
Kernel panic - not syncing: Fatal exception in interrupt


They're pretty rare - this system did a kernbench run on a 16 vcpu 
system with no problems, then oopsed this way when I left it idle overnight.

One interesting data point is that I've been experimenting with more 
virtualization-friendly spinlock algorithms.  If I replace ticket locks 
with the old lock-byte algorithm, I see this much more frequently (and a 
spin-and-block algorithm generally doesn't get through boot).  I wonder 
if there's a race which is masked by ticket locks' strict FIFO 
algorithm?  (But this particular oops was with completely standard 
ticketlocks in place.)

I've been running your old generic IPI patches for a while with no 
problems; this seems to be specific to the version in tip.git.  I 
haven't looked to see what differences there are yet.

I've also only observed problems under Xen, but I haven't done much 
testing on real hardware.

Thanks,
    J

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH] generic ipi function calls: wait on alloc failure fallback
  2008-07-06 14:50 Generic callfunction IPI problems Jeremy Fitzhardinge
@ 2008-07-06 16:03 ` Jeremy Fitzhardinge
  2008-07-06 17:21   ` Jeremy Fitzhardinge
  0 siblings, 1 reply; 8+ messages in thread
From: Jeremy Fitzhardinge @ 2008-07-06 16:03 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Ingo Molnar, Linux Kernel Mailing List

When a GFP_ATOMIC allocation fails, smp_call_function_mask falls back
to allocating the data on the stack and converting it to a waiting call.

Make sure we actually wait in this case.

Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
---
 kernel/smp.c |    1 +
 1 file changed, 1 insertion(+)

===================================================================
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -312,6 +312,7 @@ int smp_call_function_mask(cpumask_t mas
 	if (!data) {
 		data = &d;
 		data->csd.flags = CSD_FLAG_WAIT;
+		wait = 1;
 	}
 
 	spin_lock_init(&data->lock);




^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] generic ipi function calls: wait on alloc failure fallback
  2008-07-06 16:03 ` [PATCH] generic ipi function calls: wait on alloc failure fallback Jeremy Fitzhardinge
@ 2008-07-06 17:21   ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 8+ messages in thread
From: Jeremy Fitzhardinge @ 2008-07-06 17:21 UTC (permalink / raw)
  To: Jens Axboe; +Cc: Ingo Molnar, Linux Kernel Mailing List

Jeremy Fitzhardinge wrote:
> When a GFP_ATOMIC allocation fails, smp_call_function_mask falls back
> to allocating the data on the stack and converting it to a waiting call.
>
> Make sure we actually wait in this case. 

Unfortunately this doesn't solve my crash, though it may account for 
some of them.

The oops I'm looking at at the moment is a NULL pointer on ->next of the 
rcu list in generic_smp_call_function_interrupt()...

    J

^ permalink raw reply	[flat|nested] 8+ messages in thread

* [PATCH] generic ipi function calls: wait on alloc failure fallback
@ 2008-07-15 20:22 Jeremy Fitzhardinge
  2008-07-15 21:48 ` Ingo Molnar
  0 siblings, 1 reply; 8+ messages in thread
From: Jeremy Fitzhardinge @ 2008-07-15 20:22 UTC (permalink / raw)
  To: Jens Axboe, Ingo Molnar; +Cc: Linux Kernel Mailing List, Linus Torvalds

When a GFP_ATOMIC allocation fails, it falls back to allocating the
data on the stack and converting it to a waiting call.

Make sure we actually wait in this case.

Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
---
 kernel/smp.c |    1 +
 1 file changed, 1 insertion(+)

===================================================================
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -312,6 +312,7 @@ int smp_call_function_mask(cpumask_t mas
 	if (!data) {
 		data = &d;
 		data->csd.flags = CSD_FLAG_WAIT;
+		wait = 1;
 	}
 
 	spin_lock_init(&data->lock);




^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] generic ipi function calls: wait on alloc failure fallback
  2008-07-15 20:22 Jeremy Fitzhardinge
@ 2008-07-15 21:48 ` Ingo Molnar
  2008-07-15 22:01   ` Jeremy Fitzhardinge
  0 siblings, 1 reply; 8+ messages in thread
From: Ingo Molnar @ 2008-07-15 21:48 UTC (permalink / raw)
  To: Jeremy Fitzhardinge; +Cc: Jens Axboe, Linux Kernel Mailing List, Linus Torvalds


* Jeremy Fitzhardinge <jeremy@goop.org> wrote:

> When a GFP_ATOMIC allocation fails, it falls back to allocating the 
> data on the stack and converting it to a waiting call.
>
> Make sure we actually wait in this case.

cool, thanks!

does this explain the xen64 weirdnesses you've been seeing?

	Ingo

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] generic ipi function calls: wait on alloc failure fallback
  2008-07-15 21:48 ` Ingo Molnar
@ 2008-07-15 22:01   ` Jeremy Fitzhardinge
  2008-07-18 22:19     ` Ingo Molnar
  0 siblings, 1 reply; 8+ messages in thread
From: Jeremy Fitzhardinge @ 2008-07-15 22:01 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Jens Axboe, Linux Kernel Mailing List, Linus Torvalds

Ingo Molnar wrote:
> * Jeremy Fitzhardinge <jeremy@goop.org> wrote:
>
>   
>> When a GFP_ATOMIC allocation fails, it falls back to allocating the 
>> data on the stack and converting it to a waiting call.
>>
>> Make sure we actually wait in this case.
>>     
>
> cool, thanks!
>
> does this explain the xen64 weirdnesses you've been seeing?
>   

No, but I haven't seen it lately.  I think the other RCU fixes may have 
helped.  But it's all a bit of a worry: I didn't have a good theory 
about what was going wrong, the RCU patches didn't look like they'd fix 
the symptoms I was seeing.

I've seen it with 32 and 64-bit Xen, but there's nothing about the 
problem which makes me think it's really Xen specific.  If it were, I'd 
expect to see failures all over the place, rather than in just in this 
one specific place.

I'm concerned there's a lurking bug, particularly if it's a generic race 
or something that happens to be triggered when running under Xen because 
of the timing changes.  I've tried reproducing it in a hvm Xen domain 
(so it's running the normal x86 kernel fully virtualized, but with the 
Xen scheduler, etc).  I didn't see a problem, but it isn't a very 
convincing test one way or the other.

    J

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] generic ipi function calls: wait on alloc failure fallback
  2008-07-15 22:01   ` Jeremy Fitzhardinge
@ 2008-07-18 22:19     ` Ingo Molnar
  2008-07-18 22:42       ` Jeremy Fitzhardinge
  0 siblings, 1 reply; 8+ messages in thread
From: Ingo Molnar @ 2008-07-18 22:19 UTC (permalink / raw)
  To: Jeremy Fitzhardinge; +Cc: Jens Axboe, Linux Kernel Mailing List, Linus Torvalds


* Jeremy Fitzhardinge <jeremy@goop.org> wrote:

>> does this explain the xen64 weirdnesses you've been seeing?
>>   
>
> No, but I haven't seen it lately.  I think the other RCU fixes may 
> have helped.  But it's all a bit of a worry: I didn't have a good 
> theory about what was going wrong, the RCU patches didn't look like 
> they'd fix the symptoms I was seeing.
>
> I've seen it with 32 and 64-bit Xen, but there's nothing about the 
> problem which makes me think it's really Xen specific.  If it were, 
> I'd expect to see failures all over the place, rather than in just in 
> this one specific place.
>
> I'm concerned there's a lurking bug, particularly if it's a generic 
> race or something that happens to be triggered when running under Xen 
> because of the timing changes.  I've tried reproducing it in a hvm Xen 
> domain (so it's running the normal x86 kernel fully virtualized, but 
> with the Xen scheduler, etc).  I didn't see a problem, but it isn't a 
> very convincing test one way or the other.

ok. I doubt there's much we can do at this stage - the code looks fine. 
If it's some recently added core kernel problem sooner or later some 
workload or hw will come about that shows it in a more debuggable 
manner.

	Ingo

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH] generic ipi function calls: wait on alloc failure fallback
  2008-07-18 22:19     ` Ingo Molnar
@ 2008-07-18 22:42       ` Jeremy Fitzhardinge
  0 siblings, 0 replies; 8+ messages in thread
From: Jeremy Fitzhardinge @ 2008-07-18 22:42 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: Jens Axboe, Linux Kernel Mailing List, Linus Torvalds

Ingo Molnar wrote:
> ok. I doubt there's much we can do at this stage - the code looks fine. 
> If it's some recently added core kernel problem sooner or later some 
> workload or hw will come about that shows it in a more debuggable 
> manner.

Yep.  It has been rock solid for me lately, even doing traditionally 
bug-inducing things like save/restore.

    J


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2008-07-18 22:42 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-07-06 14:50 Generic callfunction IPI problems Jeremy Fitzhardinge
2008-07-06 16:03 ` [PATCH] generic ipi function calls: wait on alloc failure fallback Jeremy Fitzhardinge
2008-07-06 17:21   ` Jeremy Fitzhardinge
  -- strict thread matches above, loose matches on Subject: below --
2008-07-15 20:22 Jeremy Fitzhardinge
2008-07-15 21:48 ` Ingo Molnar
2008-07-15 22:01   ` Jeremy Fitzhardinge
2008-07-18 22:19     ` Ingo Molnar
2008-07-18 22:42       ` Jeremy Fitzhardinge

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox