xen-devel.lists.xenproject.org archive mirror
 help / color / mirror / Atom feed
* BUG: unable to handle kernel NULL pointer dereference - xen_spin_lock_flags
@ 2012-02-13 22:41 Christopher S. Aker
  2012-02-14  0:54 ` Christopher S. Aker
  0 siblings, 1 reply; 7+ messages in thread
From: Christopher S. Aker @ 2012-02-13 22:41 UTC (permalink / raw)
  To: xen devel

Xen: xen-4.1-testing @ cs 23224 64 bit
Dom0: 3.2.5 PAE *and* 3.0.4 PAE
Hardware: both IGB and e1000e NICed machines

Network stress testing (iperf, UDP, bidirectional) the above stack 
reliably BUGs on both 3.0.4 and 3.2.5, and on both IGB or e1000e NICs 
with the following:

BUG: unable to handle kernel NULL pointer dereference at 00000474
IP: [<c100e967>] xen_spin_lock_flags+0x27/0x70
*pdpt = 00000000009f7027 *pde = 0000000000000000
Oops: 0002 [#1] SMP
Modules linked in: ebt_comment ebt_arp ebt_set ebt_limit ebt_ip6 ebt_ip 
ip_set_hash_net ip_set ebtable_nat xen_gntdev e1000e
Pid: 0, comm: swapper/1 Not tainted 3.2.5-2 #3 Supermicro X8DT6/X8DT6
EIP: 0061:[<c100e967>] EFLAGS: 00010006 CPU: 1
EIP is at xen_spin_lock_flags+0x27/0x70
EAX: 00000400 EBX: 00000474 ECX: 00000001 EDX: 00000001
ESI: 00000200 EDI: 00000400 EBP: e84aff08 ESP: e84afef8
DS: 007b ES: 007b FS: 00d8 GS: 0000 SS: 0069
Process swapper/1 (pid: 0, ti=e84ae000 task=e84589c0 task.ti=e8470000)
Stack:
00000001 00000274 e6058400 e605848c e84aff1c c16688ba 00000000 e6058400
e605848c e84aff40 c149f1a0 c100828b c1668904 c184f880 00000474 00010f66
f38b4000 e6058454 e84aff50 c149f23d 000005e8 ffffffff e84aff60 c149f2c5
Call Trace:
[<c16688ba>] _raw_spin_lock_irqsave+0x2a/0x40
[<c149f1a0>] xen_netbk_schedule_xenvif+0x80/0xf0
[<c100828b>] ? xen_restore_fl_direct_reloc+0x4/0x4
[<c1668904>] ? _raw_spin_unlock_irqrestore+0x14/0x20
[<c149f23d>] xen_netbk_check_rx_xenvif+0x2d/0x70
[<c149f2c5>] tx_credit_callback+0x45/0x50
[<c1059f92>] run_timer_softirq+0x112/0x2d0
[<c100828b>] ? xen_restore_fl_direct_reloc+0x4/0x4
[<c10a072d>] ? check_for_new_grace_period+0x9d/0xc0
[<c10a062e>] ? rcu_process_gp_end+0x3e/0xa0
[<c149f280>] ? xen_netbk_check_rx_xenvif+0x70/0x70
[<c1052586>] __do_softirq+0x96/0x1b0
[<c10524f0>] ? irq_enter+0x70/0x70
<IRQ>
[<c10523e5>] ? irq_exit+0x95/0xb0
[<c13b2070>] ? xen_evtchn_do_upcall+0x20/0x30
[<c166f647>] ? xen_do_upcall+0x7/0xc
[<c106007b>] ? do_prlimit+0x5b/0x1b0
[<c10013a7>] ? hypercall_page+0x3a7/0x1000
[<c1007a42>] ? xen_safe_halt+0x12/0x20
[<c1016fac>] ? default_idle+0x6c/0x170
[<c100ef4e>] ? cpu_idle+0x5e/0x90
[<c165db1f>] ? cpu_bringup_and_idle+0xd/0xf
Code: 00 00 00 00 55 b9 01 00 00 00 89 e5 83 ec 10 89 75 f8 89 d6 89 5d 
f4 81 e6 00 02 00 00 89 c3 89 7d fc bf 00 04 00 00 89 f8 89 ca <86> 13 
84 d2 74 0a f3 90 80 3b 00 74 f3 48 75 f6 84 d2 75 0d 8b
EIP: [<c100e967>] xen_spin_lock_flags+0x27/0x70 SS:ESP 0069:e84afef8
CR2: 0000000000000474
---[ end trace 76e317879747e85d ]---
Kernel panic - not syncing: Fatal exception in interrupt
Pid: 0, comm: swapper/1 Tainted: G D 3.2.5-2 #3
Call Trace:
[<c166624a>] panic+0x57/0x15f
[<c16699b5>] oops_end+0xc5/0xd0
[<c1032dbe>] no_context+0xbe/0x190
[<c1032f28>] __bad_area_nosemaphore+0x98/0x140
[<c1621cd0>] ? br_flood_deliver+0x20/0x20
[<f2dcd0b7>] ? ebt_nat_out+0x27/0x34 [ebtable_nat]
[<c1621cd0>] ? br_flood_deliver+0x20/0x20
[<c1032fe2>] bad_area_nosemaphore+0x12/0x20
[<c166bad2>] do_page_fault+0x2d2/0x3f0
[<c15896ca>] ? nf_hook_slow+0x5a/0x110
[<c1621d21>] ? br_dev_queue_push_xmit+0x51/0x70
[<c1621f77>] ? br_forward_finish+0x57/0x60
[<c1621cd0>] ? br_flood_deliver+0x20/0x20
[<c1621df3>] ? __br_forward+0xb3/0xd0
[<c166b800>] ? spurious_fault+0x130/0x130
[<c16691ee>] error_code+0x5a/0x60
[<c166b800>] ? spurious_fault+0x130/0x130
[<c100e967>] ? xen_spin_lock_flags+0x27/0x70
[<c16688ba>] _raw_spin_lock_irqsave+0x2a/0x40
[<c149f1a0>] xen_netbk_schedule_xenvif+0x80/0xf0
[<c100828b>] ? xen_restore_fl_direct_reloc+0x4/0x4
[<c1668904>] ? _raw_spin_unlock_irqrestore+0x14/0x20
[<c149f23d>] xen_netbk_check_rx_xenvif+0x2d/0x70
[<c149f2c5>] tx_credit_callback+0x45/0x50
[<c1059f92>] run_timer_softirq+0x112/0x2d0
[<c100828b>] ? xen_restore_fl_direct_reloc+0x4/0x4
[<c10a072d>] ? check_for_new_grace_period+0x9d/0xc0
[<c10a062e>] ? rcu_process_gp_end+0x3e/0xa0
[<c149f280>] ? xen_netbk_check_rx_xenvif+0x70/0x70
[<c1052586>] __do_softirq+0x96/0x1b0
[<c10524f0>] ? irq_enter+0x70/0x70
<IRQ> [<c10523e5>] ? irq_exit+0x95/0xb0
[<c13b2070>] ? xen_evtchn_do_upcall+0x20/0x30
[<c166f647>] ? xen_do_upcall+0x7/0xc
[<c106007b>] ? do_prlimit+0x5b/0x1b0
[<c10013a7>] ? hypercall_page+0x3a7/0x1000
[<c1007a42>] ? xen_safe_halt+0x12/0x20
[<c1016fac>] ? default_idle+0x6c/0x170
[<c100ef4e>] ? cpu_idle+0x5e/0x90
[<c165db1f>] ? cpu_bringup_and_idle+0xd/0xf
(XEN) Domain 0 crashed: rebooting machine in 5 seconds.

Should we be pointing the finger at the Hypervisor, or a dom0 issue?

Thanks,
-Chris

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: BUG: unable to handle kernel NULL pointer dereference - xen_spin_lock_flags
  2012-02-13 22:41 BUG: unable to handle kernel NULL pointer dereference - xen_spin_lock_flags Christopher S. Aker
@ 2012-02-14  0:54 ` Christopher S. Aker
  2012-02-14 10:09   ` Ian Campbell
  0 siblings, 1 reply; 7+ messages in thread
From: Christopher S. Aker @ 2012-02-14  0:54 UTC (permalink / raw)
  To: xen devel

On Feb 13, 2012, at 5:41 PM, Christopher S. Aker wrote:
> Network stress testing (iperf, UDP, bidirectional) the above stack reliably BUGs on both 3.0.4 and 3.2.5, and on both IGB or e1000e NICs with the following:
> 
> BUG: unable to handle kernel NULL pointer dereference at 00000474
> IP: [<c100e967>] xen_spin_lock_flags+0x27/0x70

This happens regardless of CONFIG_PARAVIRT_SPINLOCKS enabled or disabled.

-Chris

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: BUG: unable to handle kernel NULL pointer dereference - xen_spin_lock_flags
  2012-02-14  0:54 ` Christopher S. Aker
@ 2012-02-14 10:09   ` Ian Campbell
  2012-02-14 13:58     ` Andrew Cooper
  2012-02-14 15:28     ` Christopher S. Aker
  0 siblings, 2 replies; 7+ messages in thread
From: Ian Campbell @ 2012-02-14 10:09 UTC (permalink / raw)
  To: Christopher S. Aker; +Cc: xen devel

1On Tue, 2012-02-14 at 00:54 +0000, Christopher S. Aker wrote:
> On Feb 13, 2012, at 5:41 PM, Christopher S. Aker wrote:
> > Network stress testing (iperf, UDP, bidirectional) the above stack reliably BUGs on both 3.0.4 and 3.2.5, and on both IGB or e1000e NICs with the following:
> > 
> > BUG: unable to handle kernel NULL pointer dereference at 00000474
> > IP: [<c100e967>] xen_spin_lock_flags+0x27/0x70
> 
> This happens regardless of CONFIG_PARAVIRT_SPINLOCKS enabled or disabled.

I think that rules out the recent pv spinlock bug (fixed by
7a7546b377bdaa25ac77f33d9433c59f259b9688, in various stable trees
AFAIK).

What line of code does that IP correspond to within xen_spin_lock_flags?
Likewise the one in xen_netbk_schedule_xenvif from the stack.

I suspect this must be &netbk->net_schedule_list_lock but I don't see
how that can ever be NULL nor does the offset appear to be 0x474, at
least in my tree -- although that may depend on debug options.

Are you rebooting guests or plug/unplugging vifs while this is going on?
What about hotplugging CPUs (dom0 in particular)?

Does this happen as soon as the test starts or does it work for a bit
before failing?

Ian.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: BUG: unable to handle kernel NULL pointer dereference - xen_spin_lock_flags
  2012-02-14 10:09   ` Ian Campbell
@ 2012-02-14 13:58     ` Andrew Cooper
  2012-02-14 15:28     ` Christopher S. Aker
  1 sibling, 0 replies; 7+ messages in thread
From: Andrew Cooper @ 2012-02-14 13:58 UTC (permalink / raw)
  To: Ian Campbell; +Cc: xen devel

On 14/02/12 10:09, Ian Campbell wrote:
> 1On Tue, 2012-02-14 at 00:54 +0000, Christopher S. Aker wrote:
>> On Feb 13, 2012, at 5:41 PM, Christopher S. Aker wrote:
>>> Network stress testing (iperf, UDP, bidirectional) the above stack reliably BUGs on both 3.0.4 and 3.2.5, and on both IGB or e1000e NICs with the following:
>>>
>>> BUG: unable to handle kernel NULL pointer dereference at 00000474
>>> IP: [<c100e967>] xen_spin_lock_flags+0x27/0x70
>> This happens regardless of CONFIG_PARAVIRT_SPINLOCKS enabled or disabled.
> I think that rules out the recent pv spinlock bug (fixed by
> 7a7546b377bdaa25ac77f33d9433c59f259b9688, in various stable trees
> AFAIK).
>
> What line of code does that IP correspond to within xen_spin_lock_flags?
> Likewise the one in xen_netbk_schedule_xenvif from the stack.
>
> I suspect this must be &netbk->net_schedule_list_lock but I don't see
> how that can ever be NULL nor does the offset appear to be 0x474, at
> least in my tree -- although that may depend on debug options.
>
> Are you rebooting guests or plug/unplugging vifs while this is going on?
> What about hotplugging CPUs (dom0 in particular)?
>
> Does this happen as soon as the test starts or does it work for a bit
> before failing?
>
> Ian.

I dont know if this is related, but it looks very similar to a bug a
friend of mine encountered.  I tried to investigate but got nowhere.

Panic can be found: http://pastebin.com/ExCwhzpy

The panic looks as if it is on the same logical instruction.  (There is
a 32/64bit difference which would likely explain the out-by-one byte
reference for the dereference.)

The difference here is this bug is from the ext4 path, indicating that
it might be a spinlock problem rather than a network problem (of course,
assuming that this is infact the same bug)

-- 
Andrew Cooper - Dom0 Kernel Engineer, Citrix XenServer
T: +44 (0)1223 225 900, http://www.citrix.com

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: BUG: unable to handle kernel NULL pointer dereference - xen_spin_lock_flags
  2012-02-14 10:09   ` Ian Campbell
  2012-02-14 13:58     ` Andrew Cooper
@ 2012-02-14 15:28     ` Christopher S. Aker
  2012-02-14 15:37       ` Ian Campbell
  1 sibling, 1 reply; 7+ messages in thread
From: Christopher S. Aker @ 2012-02-14 15:28 UTC (permalink / raw)
  To: Ian Campbell; +Cc: xen devel

On 2/14/12 5:09 AM, Ian Campbell wrote:
> I suspect this must be&netbk->net_schedule_list_lock but I don't see
> how that can ever be NULL nor does the offset appear to be 0x474, at
> least in my tree -- although that may depend on debug options.
>
> Are you rebooting guests or plug/unplugging vifs while this is going on?

Sorry - I'm relaying some of this on behalf of another team member.

The dom0 insta-panics when running iperf and then bringing a vif down 
(for a network rules rebuild).  We've only been able to trigger it when 
the dom0 is dealing with a boatload of packets.  So, it smells like a 
race between the vif being marked down, and something thinking it can 
still deliver packets to it...

root@dom0# ip link set vif2.0 down (return = BOOM)
br0: port 4(vif2.0) entering forwarding state
BUG: unable to handle kernel NULL pointer dereference at 00000474
IP: [<c100e967>] xen_spin_lock_flags+0x27/0x70
...

-Chris

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: BUG: unable to handle kernel NULL pointer dereference - xen_spin_lock_flags
  2012-02-14 15:28     ` Christopher S. Aker
@ 2012-02-14 15:37       ` Ian Campbell
  2012-02-14 15:52         ` Jan Beulich
  0 siblings, 1 reply; 7+ messages in thread
From: Ian Campbell @ 2012-02-14 15:37 UTC (permalink / raw)
  To: Christopher S. Aker, Jan Beulich; +Cc: xen devel

On Tue, 2012-02-14 at 15:28 +0000, Christopher S. Aker wrote:
> On 2/14/12 5:09 AM, Ian Campbell wrote:
> > I suspect this must be&netbk->net_schedule_list_lock but I don't see
> > how that can ever be NULL nor does the offset appear to be 0x474, at
> > least in my tree -- although that may depend on debug options.
> >
> > Are you rebooting guests or plug/unplugging vifs while this is going on?
> 
> Sorry - I'm relaying some of this on behalf of another team member.
> 
> The dom0 insta-panics when running iperf and then bringing a vif down 
> (for a network rules rebuild).  We've only been able to trigger it when 
> the dom0 is dealing with a boatload of packets.  So, it smells like a 
> race between the vif being marked down, and something thinking it can 
> still deliver packets to it...
> 
> root@dom0# ip link set vif2.0 down (return = BOOM)
> br0: port 4(vif2.0) entering forwarding state
> BUG: unable to handle kernel NULL pointer dereference at 00000474
> IP: [<c100e967>] xen_spin_lock_flags+0x27/0x70
> ...

I wonder if this is related to the discussion in
http://lists.xen.org/archives/html/xen-devel/2011-09/msg00981.html
http://lists.xen.org/archives/html/xen-devel/2011-09/msg00985.html

Jan, do you recall what you did about that issue?

Is it as simple as moving the call to netif_stop_queue(dev) in
xenvif_close before the carrier check?

Ian.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: BUG: unable to handle kernel NULL pointer dereference - xen_spin_lock_flags
  2012-02-14 15:37       ` Ian Campbell
@ 2012-02-14 15:52         ` Jan Beulich
  0 siblings, 0 replies; 7+ messages in thread
From: Jan Beulich @ 2012-02-14 15:52 UTC (permalink / raw)
  To: Ian Campbell, Christopher S. Aker; +Cc: xen devel

>>> On 14.02.12 at 16:37, Ian Campbell <Ian.Campbell@citrix.com> wrote:
> On Tue, 2012-02-14 at 15:28 +0000, Christopher S. Aker wrote:
>> On 2/14/12 5:09 AM, Ian Campbell wrote:
>> > I suspect this must be&netbk->net_schedule_list_lock but I don't see
>> > how that can ever be NULL nor does the offset appear to be 0x474, at
>> > least in my tree -- although that may depend on debug options.
>> >
>> > Are you rebooting guests or plug/unplugging vifs while this is going on?
>> 
>> Sorry - I'm relaying some of this on behalf of another team member.
>> 
>> The dom0 insta-panics when running iperf and then bringing a vif down 
>> (for a network rules rebuild).  We've only been able to trigger it when 
>> the dom0 is dealing with a boatload of packets.  So, it smells like a 
>> race between the vif being marked down, and something thinking it can 
>> still deliver packets to it...
>> 
>> root@dom0# ip link set vif2.0 down (return = BOOM)
>> br0: port 4(vif2.0) entering forwarding state
>> BUG: unable to handle kernel NULL pointer dereference at 00000474
>> IP: [<c100e967>] xen_spin_lock_flags+0x27/0x70
>> ...
> 
> I wonder if this is related to the discussion in
> http://lists.xen.org/archives/html/xen-devel/2011-09/msg00981.html 
> http://lists.xen.org/archives/html/xen-devel/2011-09/msg00985.html 
> 
> Jan, do you recall what you did about that issue?

I did what I indicated we were asked at that time - add a safety
check to start_xmit() (which xen-netback appears to have too,
albeit in a slightly different form). Also, start_xmit() doesn't
appear on the call stack Christopher sent.

> Is it as simple as moving the call to netif_stop_queue(dev) in
> xenvif_close before the carrier check?

As said back then, it would appear to be more consistent, but
you're much more of an expert here than I am. We continue
to have netif_stop_queue() after the carrier check in our trees,
without anyone having reported a problem like this since we
did the above adjustment.

Jan

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2012-02-14 15:52 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-02-13 22:41 BUG: unable to handle kernel NULL pointer dereference - xen_spin_lock_flags Christopher S. Aker
2012-02-14  0:54 ` Christopher S. Aker
2012-02-14 10:09   ` Ian Campbell
2012-02-14 13:58     ` Andrew Cooper
2012-02-14 15:28     ` Christopher S. Aker
2012-02-14 15:37       ` Ian Campbell
2012-02-14 15:52         ` Jan Beulich

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).