* Generic callfunction IPI problems
@ 2008-07-06 14:50 Jeremy Fitzhardinge
2008-07-06 16:03 ` [PATCH] generic ipi function calls: wait on alloc failure fallback Jeremy Fitzhardinge
0 siblings, 1 reply; 8+ messages in thread
From: Jeremy Fitzhardinge @ 2008-07-06 14:50 UTC (permalink / raw)
To: Jens Axboe; +Cc: Ingo Molnar, Linux Kernel Mailing List
Hi Jens,
I'm seeing these oopses when running under Xen:
BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
IP: [<ffffffff8105de9a>] generic_smp_call_function_interrupt+0xfb/0x118
PGD 0
Oops: 0000 [1] SMP
CPU 15
Modules linked in:
Pid: 0, comm: swapper Not tainted 2.6.26-rc8-tip #306
RIP: e030:[<ffffffff8105de9a>] [<ffffffff8105de9a>] generic_smp_call_function_interrupt+0xfb/0x118
RSP: e02b:ffff88007f653e98 EFLAGS: 00010046
RAX: ffffffff815fe6e0 RBX: ffff88007e523cc8 RCX: 0000000000000001
RDX: ffffc10000200200 RSI: 0000000000000001 RDI: ffffffff81693240
RBP: ffff88007f653eb8 R08: ffff88007f653ec8 R09: 0002db11ddd83820
R10: ffff880000000001 R11: 0000000000000246 R12: 0000000000000000
R13: 0000000000000000 R14: 000000000000000f R15: 0000000000000040
FS: 00007f1dadd907a0(0000) GS:ffff88007ff30080(0000) knlGS:0000000000000000
CS: e033 DS: 002b ES: 002b CR0: 000000008005003b
CR2: 0000000000000000 CR3: 0000000001001000 CR4: 0000000000002660
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000000
Process swapper (pid: 0, threadinfo ffff88007ff90000, task ffff88007ff57100)
Stack: ffff88007ff2f4c0 0000000000000000 0000000000000000 000000000000004d
ffff88007f653ec8 ffffffff8100dea7 ffff88007f653ef8 ffffffff810747c5
ffffffff816959c0 000000000000004d ffff88007ff2f4c0 ffffffff81695a10
Call Trace:
<IRQ> [<ffffffff8100dea7>] xen_call_function_interrupt+0xe/0x167
[<ffffffff810747c5>] handle_IRQ_event+0x2e/0x65
[<ffffffff81075e5b>] handle_level_irq+0xb5/0x116
[<ffffffff81013f34>] do_IRQ+0xf7/0x177
[<ffffffff811ba227>] xen_evtchn_do_upcall+0xb3/0x136
[<ffffffff8141558e>] xen_do_hypervisor_callback+0x1e/0x30
<EOI> [<ffffffff810093aa>] ? _stext+0x3aa/0x1000
[<ffffffff810093aa>] ? _stext+0x3aa/0x1000
[<ffffffff8100a42e>] ? xen_safe_halt+0x10/0x1a
[<ffffffff8100ba26>] ? xen_idle+0x46/0x5c
[<ffffffff8100eb60>] ? cpu_idle+0xca/0x101
[<ffffffff8140bc7d>] ? cpu_bringup_and_idle+0x8a/0x8f
Code: e8 fc 96 fc ff 90 41 f6 44 24 20 01 74 08 41 83 64 24 20 fe eb 11 49 8d 7c 24 38 48 c7 c6 90 dd 05 81 e8 6f 92 01 00 4d 8b 24 24 <49> 8b 04 24 49 81 fc e0 e6 5f 81 0f 18 08 0f 85 11 ff ff ff 5b
RIP [<ffffffff8105de9a>] generic_smp_call_function_interrupt+0xfb/0x118
RSP <ffff88007f653e98>
CR2: 0000000000000000
Kernel panic - not syncing: Fatal exception in interrupt
They're pretty rare - this system did a kernbench run on a 16 vcpu
system with no problems, then oopsed this way when I left it idle overnight.
One interesting data point is that I've been experimenting with more
virtualization-friendly spinlock algorithms. If I replace ticket locks
with the old lock-byte algorithm, I see this much more frequently (and a
spin-and-block algorithm generally doesn't get through boot). I wonder
if there's a race which is masked by ticket locks' strict FIFO
algorithm? (But this particular oops was with completely standard
ticketlocks in place.)
I've been running your old generic IPI patches for a while with no
problems; this seems to be specific to the version in tip.git. I
haven't looked to see what differences there are yet.
I've also only observed problems under Xen, but I haven't done much
testing on real hardware.
Thanks,
J
^ permalink raw reply [flat|nested] 8+ messages in thread
* [PATCH] generic ipi function calls: wait on alloc failure fallback
2008-07-06 14:50 Generic callfunction IPI problems Jeremy Fitzhardinge
@ 2008-07-06 16:03 ` Jeremy Fitzhardinge
2008-07-06 17:21 ` Jeremy Fitzhardinge
0 siblings, 1 reply; 8+ messages in thread
From: Jeremy Fitzhardinge @ 2008-07-06 16:03 UTC (permalink / raw)
To: Jens Axboe; +Cc: Ingo Molnar, Linux Kernel Mailing List
When a GFP_ATOMIC allocation fails, smp_call_function_mask falls back
to allocating the data on the stack and converting it to a waiting call.
Make sure we actually wait in this case.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
---
kernel/smp.c | 1 +
1 file changed, 1 insertion(+)
===================================================================
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -312,6 +312,7 @@ int smp_call_function_mask(cpumask_t mas
if (!data) {
data = &d;
data->csd.flags = CSD_FLAG_WAIT;
+ wait = 1;
}
spin_lock_init(&data->lock);
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] generic ipi function calls: wait on alloc failure fallback
2008-07-06 16:03 ` [PATCH] generic ipi function calls: wait on alloc failure fallback Jeremy Fitzhardinge
@ 2008-07-06 17:21 ` Jeremy Fitzhardinge
0 siblings, 0 replies; 8+ messages in thread
From: Jeremy Fitzhardinge @ 2008-07-06 17:21 UTC (permalink / raw)
To: Jens Axboe; +Cc: Ingo Molnar, Linux Kernel Mailing List
Jeremy Fitzhardinge wrote:
> When a GFP_ATOMIC allocation fails, smp_call_function_mask falls back
> to allocating the data on the stack and converting it to a waiting call.
>
> Make sure we actually wait in this case.
Unfortunately this doesn't solve my crash, though it may account for
some of them.
The oops I'm looking at at the moment is a NULL pointer on ->next of the
rcu list in generic_smp_call_function_interrupt()...
J
^ permalink raw reply [flat|nested] 8+ messages in thread
* [PATCH] generic ipi function calls: wait on alloc failure fallback
@ 2008-07-15 20:22 Jeremy Fitzhardinge
2008-07-15 21:48 ` Ingo Molnar
0 siblings, 1 reply; 8+ messages in thread
From: Jeremy Fitzhardinge @ 2008-07-15 20:22 UTC (permalink / raw)
To: Jens Axboe, Ingo Molnar; +Cc: Linux Kernel Mailing List, Linus Torvalds
When a GFP_ATOMIC allocation fails, it falls back to allocating the
data on the stack and converting it to a waiting call.
Make sure we actually wait in this case.
Signed-off-by: Jeremy Fitzhardinge <jeremy.fitzhardinge@citrix.com>
---
kernel/smp.c | 1 +
1 file changed, 1 insertion(+)
===================================================================
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -312,6 +312,7 @@ int smp_call_function_mask(cpumask_t mas
if (!data) {
data = &d;
data->csd.flags = CSD_FLAG_WAIT;
+ wait = 1;
}
spin_lock_init(&data->lock);
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] generic ipi function calls: wait on alloc failure fallback
2008-07-15 20:22 Jeremy Fitzhardinge
@ 2008-07-15 21:48 ` Ingo Molnar
2008-07-15 22:01 ` Jeremy Fitzhardinge
0 siblings, 1 reply; 8+ messages in thread
From: Ingo Molnar @ 2008-07-15 21:48 UTC (permalink / raw)
To: Jeremy Fitzhardinge; +Cc: Jens Axboe, Linux Kernel Mailing List, Linus Torvalds
* Jeremy Fitzhardinge <jeremy@goop.org> wrote:
> When a GFP_ATOMIC allocation fails, it falls back to allocating the
> data on the stack and converting it to a waiting call.
>
> Make sure we actually wait in this case.
cool, thanks!
does this explain the xen64 weirdnesses you've been seeing?
Ingo
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] generic ipi function calls: wait on alloc failure fallback
2008-07-15 21:48 ` Ingo Molnar
@ 2008-07-15 22:01 ` Jeremy Fitzhardinge
2008-07-18 22:19 ` Ingo Molnar
0 siblings, 1 reply; 8+ messages in thread
From: Jeremy Fitzhardinge @ 2008-07-15 22:01 UTC (permalink / raw)
To: Ingo Molnar; +Cc: Jens Axboe, Linux Kernel Mailing List, Linus Torvalds
Ingo Molnar wrote:
> * Jeremy Fitzhardinge <jeremy@goop.org> wrote:
>
>
>> When a GFP_ATOMIC allocation fails, it falls back to allocating the
>> data on the stack and converting it to a waiting call.
>>
>> Make sure we actually wait in this case.
>>
>
> cool, thanks!
>
> does this explain the xen64 weirdnesses you've been seeing?
>
No, but I haven't seen it lately. I think the other RCU fixes may have
helped. But it's all a bit of a worry: I didn't have a good theory
about what was going wrong, the RCU patches didn't look like they'd fix
the symptoms I was seeing.
I've seen it with 32 and 64-bit Xen, but there's nothing about the
problem which makes me think it's really Xen specific. If it were, I'd
expect to see failures all over the place, rather than in just in this
one specific place.
I'm concerned there's a lurking bug, particularly if it's a generic race
or something that happens to be triggered when running under Xen because
of the timing changes. I've tried reproducing it in a hvm Xen domain
(so it's running the normal x86 kernel fully virtualized, but with the
Xen scheduler, etc). I didn't see a problem, but it isn't a very
convincing test one way or the other.
J
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] generic ipi function calls: wait on alloc failure fallback
2008-07-15 22:01 ` Jeremy Fitzhardinge
@ 2008-07-18 22:19 ` Ingo Molnar
2008-07-18 22:42 ` Jeremy Fitzhardinge
0 siblings, 1 reply; 8+ messages in thread
From: Ingo Molnar @ 2008-07-18 22:19 UTC (permalink / raw)
To: Jeremy Fitzhardinge; +Cc: Jens Axboe, Linux Kernel Mailing List, Linus Torvalds
* Jeremy Fitzhardinge <jeremy@goop.org> wrote:
>> does this explain the xen64 weirdnesses you've been seeing?
>>
>
> No, but I haven't seen it lately. I think the other RCU fixes may
> have helped. But it's all a bit of a worry: I didn't have a good
> theory about what was going wrong, the RCU patches didn't look like
> they'd fix the symptoms I was seeing.
>
> I've seen it with 32 and 64-bit Xen, but there's nothing about the
> problem which makes me think it's really Xen specific. If it were,
> I'd expect to see failures all over the place, rather than in just in
> this one specific place.
>
> I'm concerned there's a lurking bug, particularly if it's a generic
> race or something that happens to be triggered when running under Xen
> because of the timing changes. I've tried reproducing it in a hvm Xen
> domain (so it's running the normal x86 kernel fully virtualized, but
> with the Xen scheduler, etc). I didn't see a problem, but it isn't a
> very convincing test one way or the other.
ok. I doubt there's much we can do at this stage - the code looks fine.
If it's some recently added core kernel problem sooner or later some
workload or hw will come about that shows it in a more debuggable
manner.
Ingo
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH] generic ipi function calls: wait on alloc failure fallback
2008-07-18 22:19 ` Ingo Molnar
@ 2008-07-18 22:42 ` Jeremy Fitzhardinge
0 siblings, 0 replies; 8+ messages in thread
From: Jeremy Fitzhardinge @ 2008-07-18 22:42 UTC (permalink / raw)
To: Ingo Molnar; +Cc: Jens Axboe, Linux Kernel Mailing List, Linus Torvalds
Ingo Molnar wrote:
> ok. I doubt there's much we can do at this stage - the code looks fine.
> If it's some recently added core kernel problem sooner or later some
> workload or hw will come about that shows it in a more debuggable
> manner.
Yep. It has been rock solid for me lately, even doing traditionally
bug-inducing things like save/restore.
J
^ permalink raw reply [flat|nested] 8+ messages in thread
end of thread, other threads:[~2008-07-18 22:42 UTC | newest]
Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-07-06 14:50 Generic callfunction IPI problems Jeremy Fitzhardinge
2008-07-06 16:03 ` [PATCH] generic ipi function calls: wait on alloc failure fallback Jeremy Fitzhardinge
2008-07-06 17:21 ` Jeremy Fitzhardinge
-- strict thread matches above, loose matches on Subject: below --
2008-07-15 20:22 Jeremy Fitzhardinge
2008-07-15 21:48 ` Ingo Molnar
2008-07-15 22:01 ` Jeremy Fitzhardinge
2008-07-18 22:19 ` Ingo Molnar
2008-07-18 22:42 ` Jeremy Fitzhardinge
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox