netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: Kernel Panic with bonding + IPoIB on 3.2.9
       [not found] ` <CAOzFzEiufg40gKBH6D7zeB47SebfPvgzqOLxhF5eQqpYd-r4zQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2012-03-18 20:21   ` Joseph Glanville
       [not found]     ` <CAOzFzEi=UOnwiV+qVks7+RnYU3PFbaQ+3OaEE3YFG2HHuD5ydQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  0 siblings, 1 reply; 5+ messages in thread
From: Joseph Glanville @ 2012-03-18 20:21 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA

On 19 March 2012 06:41, Joseph Glanville
<joseph.glanville-2MxvZkOi9dvvnOemgxGiVw@public.gmane.org> wrote:
> Hi guys,
>
> I am getting an annoying kernel panic on 3.2.9 that seems to be
> related to bonding (as I can't reproduce the crash without it)
> I believe it might be related to LRO/GRO but there isnt a param to
> disable it anymore that I could see in /ulp/ipoib/
> Let me know if there is anything further I can do to help debug.
>
> Useful information:
>
> Hardware:
> Dell C2100 - Intel Xeon dual socket with 144GB RAM
> Mellanox Connect-X DDR using in kernel mlx4 driver
> Machine is also a Xen dom0
>
> ibstatCA 'mlx4_0'
>        CA type: MT26418
>        Number of ports: 2
>        Firmware version: 2.9.1000
>        Hardware version: a0
>        Node GUID: 0x0002c9030008d7be
>        System image GUID: 0x0002c9030008d7c1
>        Port 1:
>                State: Active
>                Physical state: LinkUp
>                Rate: 20
>                Base lid: 6
>                LMC: 0
>                SM lid: 1
>                Capability mask: 0x02590868
>                Port GUID: 0x0002c9030008d7bf
>                Link layer: InfiniBand
>        Port 2:
>                State: Active
>                Physical state: LinkUp
>                Rate: 20
>                Base lid: 9
>                LMC: 0
>                SM lid: 1
>                Capability mask: 0x02590868
>                Port GUID: 0x0002c9030008d7c0
>                Link layer: InfiniBand
>
>
> ip link show
> 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 16436 qdisc noqueue state UNKNOWN
>    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
> 2: ib0: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 65220 qdisc
> pfifo_fast master bond0 state UP qlen 256
>    link/infiniband
> 80:00:00:48:fe:80:00:00:00:00:00:00:00:02:c9:03:00:08:d7:bf brd
> 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
> 3: ib1: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 65220 qdisc
> pfifo_fast master bond0 state UP qlen 256
>    link/infiniband
> 80:00:00:49:fe:80:00:00:00:00:00:00:00:02:c9:03:00:08:d7:c0 brd
> 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
> 4: gre0: <NOARP> mtu 1476 qdisc noop state DOWN
>    link/gre 0.0.0.0 brd 0.0.0.0
> 5: sit0: <NOARP> mtu 1480 qdisc noop state DOWN
>    link/sit 0.0.0.0 brd 0.0.0.0
> 6: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 65220 qdisc
> noqueue state UP
>    link/infiniband
> 80:00:00:48:fe:80:00:00:00:00:00:00:00:02:c9:03:00:08:d7:bf brd
> 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:f
>
> The KP itself:
> [  422.046837] ------------[ cut here ]------------
> [  422.047024] kernel BUG at net/core/dev.c:1896!
> [  422.047126] invalid opcode: 0000 [#1] SMP
> [  422.047289] CPU 1
> [  422.047328] Modules linked in: ib_srpt(O) scst_vdisk(O) scst(O)
> bonding raid1 raid0 md_mod dm_multipath
> [  422.047869]
> [  422.047962] Pid: 3352, comm: sshd Tainted: G           O
> 3.2.1-orion #4 Dell                   PowerEdge C2100       /0P19C9
> [  422.048237] RIP: e030:[<ffffffff81559b92>]  [<ffffffff81559b92>]
> skb_checksum_help+0x142/0x150
> [  422.048450] RSP: e02b:ffff88006cb11758  EFLAGS: 00010282
> [  422.048556] RAX: 0000000000000108 RBX: ffff880072f7f4e8 RCX: 0000000060004420
> [  422.048668] RDX: 0000000000000108 RSI: 0000000000000000 RDI: ffff880072f7f4e8
> [  422.048780] RBP: ffff88006cb11778 R08: ffff88000e53529c R09: 0000000000000104
> [  422.048892] R10: ffffffff8151a7d0 R11: 0000000000000000 R12: 00000000ffff0018
> [  422.049005] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
> [  422.049119] FS:  00007fea22aa8700(0000) GS:ffff8800bf435000(0000)
> knlGS:0000000000000000
> [  422.049288] CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b
> [  422.049395] CR2: 00007fff14d07ed8 CR3: 00000000085dc000 CR4: 0000000000002660
> [  422.049506] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [  422.049618] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> [  422.049742] Process sshd (pid: 3352, threadinfo ffff88006cb10000,
> task ffff88000cea4920)
> [  422.049909] Stack:
> [  422.050002]  ffff880072f7f4e8 ffff88000ce64000 0000000000000000
> 0000000000000000
> [  422.050290]  ffff88006cb117e8 ffffffff8155f15e ffff88006cb11858
> ffffffff8187ba80
> [  422.050580]  ffff88000ce5bc80 0000000000000000 0000000000000006
> 0000000000000000
> [  422.050869] Call Trace:
> [  422.050967]  [<ffffffff8155f15e>] dev_hard_start_xmit+0x36e/0x6c0
> [  422.051084]  [<ffffffff8157a19b>] sch_direct_xmit+0xdb/0x1e0
> [  422.051191]  [<ffffffff8155f638>] dev_queue_xmit+0x188/0x620
> [  422.051301]  [<ffffffffa003b297>] bond_dev_queue_xmit+0x27/0x70 [bonding]
> [  422.051413]  [<ffffffffa003b5e4>] bond_start_xmit+0x304/0x4e0 [bonding]
> [  422.051524]  [<ffffffff8155f099>] dev_hard_start_xmit+0x2a9/0x6c0
> [  422.051633]  [<ffffffff8155f895>] dev_queue_xmit+0x3e5/0x620
> [  422.051742]  [<ffffffff81567cbd>] neigh_connected_output+0xbd/0xf0
> [  422.051853]  [<ffffffff815a7120>] ? ip_fragment+0x850/0x850
> [  422.051960]  [<ffffffff815a72ae>] ip_finish_output+0x18e/0x300
> [  422.052068]  [<ffffffff815a7dd8>] ip_output+0x98/0xa0
> [  422.052172]  [<ffffffff815a74be>] ? __ip_local_out+0x9e/0xa0
> [  422.052279]  [<ffffffff815a74e4>] ip_local_out+0x24/0x30
> [  422.052385]  [<ffffffff815a764a>] ip_queue_xmit+0x15a/0x400
> [  422.052510]  [<ffffffff815bdade>] tcp_transmit_skb+0x3de/0x8f0
> [  422.052617]  [<ffffffff815be702>] tcp_write_xmit+0x1d2/0x9c0
> [  422.052725]  [<ffffffff81129057>] ? ksize+0x17/0xc0
> [  422.052829]  [<ffffffff815bef41>] __tcp_push_pending_frames+0x21/0x90
> [  422.052939]  [<ffffffff815b09ae>] tcp_sendmsg+0x75e/0xd80
> [  422.053047]  [<ffffffff815d4c0f>] inet_sendmsg+0x5f/0xb0
> [  422.053155]  [<ffffffff81009f3f>] ? xen_restore_fl_direct_reloc+0x4/0x4
> [  422.053267]  [<ffffffff8126734e>] ? selinux_socket_sendmsg+0x1e/0x20
> [  422.053377]  [<ffffffff8154732a>] sock_aio_write+0x15a/0x170
> [  422.053486]  [<ffffffff812652d1>] ? inode_has_perm.clone.15+0x21/0x30
> [  422.053597]  [<ffffffff8113133a>] do_sync_write+0xda/0x120
> [  422.053704]  [<ffffffff81268003>] ? selinux_file_permission+0xb3/0x140
> [  422.053821]  [<ffffffff812e8efa>] ? put_ldisc+0x5a/0xc0
> [  422.053937]  [<ffffffff81262237>] ? security_file_permission+0x27/0xb0
> [  422.054048]  [<ffffffff81131ca9>] vfs_write+0x169/0x180
> [  422.054153]  [<ffffffff81131f1c>] sys_write+0x4c/0x90
> [  422.054260]  [<ffffffff816891d2>] system_call_fastpath+0x16/0x1b
> [  422.054367] Code: 65 86 ff ff 85 c0 0f 84 75 ff ff ff eb a6 41 29
> d4 48 8b 83 d8 00 00 00 0f b7 53 72 45 8d 64 04 02 41 39 d4 77 cd e9
> 5d ff ff ff <0f> 0b 0f 0b 66 2e 0f 1f 84 00 00 00 00 00 55 b8 ea ff ff
> ff 48
> [  422.056691] RIP  [<ffffffff81559b92>] skb_checksum_help+0x142/0x150
> [  422.056831]  RSP <ffff88006cb11758>
> [  422.056930] ---[ end trace 751906f8ee2b0c91 ]---
> [  422.057032] Kernel panic - not syncing: Fatal exception in interrupt
> [  422.057141] Pid: 3352, comm: sshd Tainted: G      D    O 3.2.1-orion #4
> [  422.057250] Call Trace:
> [  422.057348]  [<ffffffff8167e944>] panic+0x8c/0x1a0
> [  422.057451]  [<ffffffff816825fa>] oops_end+0xea/0xf0
> [  422.057557]  [<ffffffff81016636>] die+0x56/0x90
> [  422.057660]  [<ffffffff81681f64>] do_trap+0xc4/0x170
> [  422.057764]  [<ffffffff81013e50>] do_invalid_op+0x90/0xb0
> [  422.057870]  [<ffffffff81559b92>] ? skb_checksum_help+0x142/0x150
> [  422.057989]  [<ffffffff8168b1ab>] invalid_op+0x1b/0x20
> [  422.058101]  [<ffffffff8151a7d0>] ? ipoib_setup+0x330/0x330
> [  422.058207]  [<ffffffff81559b92>] ? skb_checksum_help+0x142/0x150
> [  422.058316]  [<ffffffff8155f15e>] dev_hard_start_xmit+0x36e/0x6c0
> [  422.058425]  [<ffffffff8157a19b>] sch_direct_xmit+0xdb/0x1e0
> [  422.058533]  [<ffffffff8155f638>] dev_queue_xmit+0x188/0x620
> [  422.058641]  [<ffffffffa003b297>] bond_dev_queue_xmit+0x27/0x70 [bonding]
> [  422.058753]  [<ffffffffa003b5e4>] bond_start_xmit+0x304/0x4e0 [bonding]
> [  422.058864]  [<ffffffff8155f099>] dev_hard_start_xmit+0x2a9/0x6c0
> [  422.058973]  [<ffffffff8155f895>] dev_queue_xmit+0x3e5/0x620
> [  422.059080]  [<ffffffff81567cbd>] neigh_connected_output+0xbd/0xf0
> [  422.059190]  [<ffffffff815a7120>] ? ip_fragment+0x850/0x850
> [  422.059296]  [<ffffffff815a72ae>] ip_finish_output+0x18e/0x300
> [  422.059412]  [<ffffffff815a7dd8>] ip_output+0x98/0xa0
> [  422.059517]  [<ffffffff815a74be>] ? __ip_local_out+0x9e/0xa0
> [  422.059624]  [<ffffffff815a74e4>] ip_local_out+0x24/0x30
> [  422.059730]  [<ffffffff815a764a>] ip_queue_xmit+0x15a/0x400
> [  422.059836]  [<ffffffff815bdade>] tcp_transmit_skb+0x3de/0x8f0
> [  422.059944]  [<ffffffff815be702>] tcp_write_xmit+0x1d2/0x9c0
>
> --
> Founder | Director | VP Research
> Orion Virtualisation Solutions | www.orionvm.com.au | Phone: 1300 56
> 99 52 | Mobile: 0428 754 846

CC'ing netdev as that is probably the most appropriate now that I
think about it.

-- 
Founder | Director | VP Research
Orion Virtualisation Solutions | www.orionvm.com.au | Phone: 1300 56
99 52 | Mobile: 0428 754 846
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Kernel Panic with bonding + IPoIB on 3.2.9
       [not found]     ` <CAOzFzEi=UOnwiV+qVks7+RnYU3PFbaQ+3OaEE3YFG2HHuD5ydQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
@ 2012-03-18 21:20       ` Joseph Glanville
  2012-03-19 19:05       ` Roland Dreier
  1 sibling, 0 replies; 5+ messages in thread
From: Joseph Glanville @ 2012-03-18 21:20 UTC (permalink / raw)
  To: linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA
  Cc: netdev-u79uwXL29TY76Z2rM5mHXA

On 19 March 2012 07:21, Joseph Glanville
<joseph.glanville-2MxvZkOi9dvvnOemgxGiVw@public.gmane.org> wrote:
> On 19 March 2012 06:41, Joseph Glanville
> <joseph.glanville-2MxvZkOi9dvvnOemgxGiVw@public.gmane.org> wrote:
>> Hi guys,
>>
>> I am getting an annoying kernel panic on 3.2.9 that seems to be
>> related to bonding (as I can't reproduce the crash without it)
>> I believe it might be related to LRO/GRO but there isnt a param to
>> disable it anymore that I could see in /ulp/ipoib/
>> Let me know if there is anything further I can do to help debug.
>>
>> Useful information:
>>
>> Hardware:
>> Dell C2100 - Intel Xeon dual socket with 144GB RAM
>> Mellanox Connect-X DDR using in kernel mlx4 driver
>> Machine is also a Xen dom0
>>
>> ibstatCA 'mlx4_0'
>>        CA type: MT26418
>>        Number of ports: 2
>>        Firmware version: 2.9.1000
>>        Hardware version: a0
>>        Node GUID: 0x0002c9030008d7be
>>        System image GUID: 0x0002c9030008d7c1
>>        Port 1:
>>                State: Active
>>                Physical state: LinkUp
>>                Rate: 20
>>                Base lid: 6
>>                LMC: 0
>>                SM lid: 1
>>                Capability mask: 0x02590868
>>                Port GUID: 0x0002c9030008d7bf
>>                Link layer: InfiniBand
>>        Port 2:
>>                State: Active
>>                Physical state: LinkUp
>>                Rate: 20
>>                Base lid: 9
>>                LMC: 0
>>                SM lid: 1
>>                Capability mask: 0x02590868
>>                Port GUID: 0x0002c9030008d7c0
>>                Link layer: InfiniBand
>>
>>
>> ip link show
>> 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 16436 qdisc noqueue state UNKNOWN
>>    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
>> 2: ib0: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 65220 qdisc
>> pfifo_fast master bond0 state UP qlen 256
>>    link/infiniband
>> 80:00:00:48:fe:80:00:00:00:00:00:00:00:02:c9:03:00:08:d7:bf brd
>> 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
>> 3: ib1: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 65220 qdisc
>> pfifo_fast master bond0 state UP qlen 256
>>    link/infiniband
>> 80:00:00:49:fe:80:00:00:00:00:00:00:00:02:c9:03:00:08:d7:c0 brd
>> 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff
>> 4: gre0: <NOARP> mtu 1476 qdisc noop state DOWN
>>    link/gre 0.0.0.0 brd 0.0.0.0
>> 5: sit0: <NOARP> mtu 1480 qdisc noop state DOWN
>>    link/sit 0.0.0.0 brd 0.0.0.0
>> 6: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 65220 qdisc
>> noqueue state UP
>>    link/infiniband
>> 80:00:00:48:fe:80:00:00:00:00:00:00:00:02:c9:03:00:08:d7:bf brd
>> 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:f
>>
>> The KP itself:
>> [  422.046837] ------------[ cut here ]------------
>> [  422.047024] kernel BUG at net/core/dev.c:1896!
>> [  422.047126] invalid opcode: 0000 [#1] SMP
>> [  422.047289] CPU 1
>> [  422.047328] Modules linked in: ib_srpt(O) scst_vdisk(O) scst(O)
>> bonding raid1 raid0 md_mod dm_multipath
>> [  422.047869]
>> [  422.047962] Pid: 3352, comm: sshd Tainted: G           O
>> 3.2.1-orion #4 Dell                   PowerEdge C2100       /0P19C9
>> [  422.048237] RIP: e030:[<ffffffff81559b92>]  [<ffffffff81559b92>]
>> skb_checksum_help+0x142/0x150
>> [  422.048450] RSP: e02b:ffff88006cb11758  EFLAGS: 00010282
>> [  422.048556] RAX: 0000000000000108 RBX: ffff880072f7f4e8 RCX: 0000000060004420
>> [  422.048668] RDX: 0000000000000108 RSI: 0000000000000000 RDI: ffff880072f7f4e8
>> [  422.048780] RBP: ffff88006cb11778 R08: ffff88000e53529c R09: 0000000000000104
>> [  422.048892] R10: ffffffff8151a7d0 R11: 0000000000000000 R12: 00000000ffff0018
>> [  422.049005] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
>> [  422.049119] FS:  00007fea22aa8700(0000) GS:ffff8800bf435000(0000)
>> knlGS:0000000000000000
>> [  422.049288] CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b
>> [  422.049395] CR2: 00007fff14d07ed8 CR3: 00000000085dc000 CR4: 0000000000002660
>> [  422.049506] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>> [  422.049618] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
>> [  422.049742] Process sshd (pid: 3352, threadinfo ffff88006cb10000,
>> task ffff88000cea4920)
>> [  422.049909] Stack:
>> [  422.050002]  ffff880072f7f4e8 ffff88000ce64000 0000000000000000
>> 0000000000000000
>> [  422.050290]  ffff88006cb117e8 ffffffff8155f15e ffff88006cb11858
>> ffffffff8187ba80
>> [  422.050580]  ffff88000ce5bc80 0000000000000000 0000000000000006
>> 0000000000000000
>> [  422.050869] Call Trace:
>> [  422.050967]  [<ffffffff8155f15e>] dev_hard_start_xmit+0x36e/0x6c0
>> [  422.051084]  [<ffffffff8157a19b>] sch_direct_xmit+0xdb/0x1e0
>> [  422.051191]  [<ffffffff8155f638>] dev_queue_xmit+0x188/0x620
>> [  422.051301]  [<ffffffffa003b297>] bond_dev_queue_xmit+0x27/0x70 [bonding]
>> [  422.051413]  [<ffffffffa003b5e4>] bond_start_xmit+0x304/0x4e0 [bonding]
>> [  422.051524]  [<ffffffff8155f099>] dev_hard_start_xmit+0x2a9/0x6c0
>> [  422.051633]  [<ffffffff8155f895>] dev_queue_xmit+0x3e5/0x620
>> [  422.051742]  [<ffffffff81567cbd>] neigh_connected_output+0xbd/0xf0
>> [  422.051853]  [<ffffffff815a7120>] ? ip_fragment+0x850/0x850
>> [  422.051960]  [<ffffffff815a72ae>] ip_finish_output+0x18e/0x300
>> [  422.052068]  [<ffffffff815a7dd8>] ip_output+0x98/0xa0
>> [  422.052172]  [<ffffffff815a74be>] ? __ip_local_out+0x9e/0xa0
>> [  422.052279]  [<ffffffff815a74e4>] ip_local_out+0x24/0x30
>> [  422.052385]  [<ffffffff815a764a>] ip_queue_xmit+0x15a/0x400
>> [  422.052510]  [<ffffffff815bdade>] tcp_transmit_skb+0x3de/0x8f0
>> [  422.052617]  [<ffffffff815be702>] tcp_write_xmit+0x1d2/0x9c0
>> [  422.052725]  [<ffffffff81129057>] ? ksize+0x17/0xc0
>> [  422.052829]  [<ffffffff815bef41>] __tcp_push_pending_frames+0x21/0x90
>> [  422.052939]  [<ffffffff815b09ae>] tcp_sendmsg+0x75e/0xd80
>> [  422.053047]  [<ffffffff815d4c0f>] inet_sendmsg+0x5f/0xb0
>> [  422.053155]  [<ffffffff81009f3f>] ? xen_restore_fl_direct_reloc+0x4/0x4
>> [  422.053267]  [<ffffffff8126734e>] ? selinux_socket_sendmsg+0x1e/0x20
>> [  422.053377]  [<ffffffff8154732a>] sock_aio_write+0x15a/0x170
>> [  422.053486]  [<ffffffff812652d1>] ? inode_has_perm.clone.15+0x21/0x30
>> [  422.053597]  [<ffffffff8113133a>] do_sync_write+0xda/0x120
>> [  422.053704]  [<ffffffff81268003>] ? selinux_file_permission+0xb3/0x140
>> [  422.053821]  [<ffffffff812e8efa>] ? put_ldisc+0x5a/0xc0
>> [  422.053937]  [<ffffffff81262237>] ? security_file_permission+0x27/0xb0
>> [  422.054048]  [<ffffffff81131ca9>] vfs_write+0x169/0x180
>> [  422.054153]  [<ffffffff81131f1c>] sys_write+0x4c/0x90
>> [  422.054260]  [<ffffffff816891d2>] system_call_fastpath+0x16/0x1b
>> [  422.054367] Code: 65 86 ff ff 85 c0 0f 84 75 ff ff ff eb a6 41 29
>> d4 48 8b 83 d8 00 00 00 0f b7 53 72 45 8d 64 04 02 41 39 d4 77 cd e9
>> 5d ff ff ff <0f> 0b 0f 0b 66 2e 0f 1f 84 00 00 00 00 00 55 b8 ea ff ff
>> ff 48
>> [  422.056691] RIP  [<ffffffff81559b92>] skb_checksum_help+0x142/0x150
>> [  422.056831]  RSP <ffff88006cb11758>
>> [  422.056930] ---[ end trace 751906f8ee2b0c91 ]---
>> [  422.057032] Kernel panic - not syncing: Fatal exception in interrupt
>> [  422.057141] Pid: 3352, comm: sshd Tainted: G      D    O 3.2.1-orion #4
>> [  422.057250] Call Trace:
>> [  422.057348]  [<ffffffff8167e944>] panic+0x8c/0x1a0
>> [  422.057451]  [<ffffffff816825fa>] oops_end+0xea/0xf0
>> [  422.057557]  [<ffffffff81016636>] die+0x56/0x90
>> [  422.057660]  [<ffffffff81681f64>] do_trap+0xc4/0x170
>> [  422.057764]  [<ffffffff81013e50>] do_invalid_op+0x90/0xb0
>> [  422.057870]  [<ffffffff81559b92>] ? skb_checksum_help+0x142/0x150
>> [  422.057989]  [<ffffffff8168b1ab>] invalid_op+0x1b/0x20
>> [  422.058101]  [<ffffffff8151a7d0>] ? ipoib_setup+0x330/0x330
>> [  422.058207]  [<ffffffff81559b92>] ? skb_checksum_help+0x142/0x150
>> [  422.058316]  [<ffffffff8155f15e>] dev_hard_start_xmit+0x36e/0x6c0
>> [  422.058425]  [<ffffffff8157a19b>] sch_direct_xmit+0xdb/0x1e0
>> [  422.058533]  [<ffffffff8155f638>] dev_queue_xmit+0x188/0x620
>> [  422.058641]  [<ffffffffa003b297>] bond_dev_queue_xmit+0x27/0x70 [bonding]
>> [  422.058753]  [<ffffffffa003b5e4>] bond_start_xmit+0x304/0x4e0 [bonding]
>> [  422.058864]  [<ffffffff8155f099>] dev_hard_start_xmit+0x2a9/0x6c0
>> [  422.058973]  [<ffffffff8155f895>] dev_queue_xmit+0x3e5/0x620
>> [  422.059080]  [<ffffffff81567cbd>] neigh_connected_output+0xbd/0xf0
>> [  422.059190]  [<ffffffff815a7120>] ? ip_fragment+0x850/0x850
>> [  422.059296]  [<ffffffff815a72ae>] ip_finish_output+0x18e/0x300
>> [  422.059412]  [<ffffffff815a7dd8>] ip_output+0x98/0xa0
>> [  422.059517]  [<ffffffff815a74be>] ? __ip_local_out+0x9e/0xa0
>> [  422.059624]  [<ffffffff815a74e4>] ip_local_out+0x24/0x30
>> [  422.059730]  [<ffffffff815a764a>] ip_queue_xmit+0x15a/0x400
>> [  422.059836]  [<ffffffff815bdade>] tcp_transmit_skb+0x3de/0x8f0
>> [  422.059944]  [<ffffffff815be702>] tcp_write_xmit+0x1d2/0x9c0
>>
>> --
>> Founder | Director | VP Research
>> Orion Virtualisation Solutions | www.orionvm.com.au | Phone: 1300 56
>> 99 52 | Mobile: 0428 754 846
>
> CC'ing netdev as that is probably the most appropriate now that I
> think about it.
>
> --
> Founder | Director | VP Research
> Orion Virtualisation Solutions | www.orionvm.com.au | Phone: 1300 56
> 99 52 | Mobile: 0428 754 846

I have narrowed it down to mtu, MTU up to 50k seems to work just fine,
but max MTU of 65520 basically instantly KPs the machine.
Time to dig into bonding.c I guess, if any of the bonding devs could
shed any light on this that would be awesome.

Joseph.

-- 
Founder | Director | VP Research
Orion Virtualisation Solutions | www.orionvm.com.au | Phone: 1300 56
99 52 | Mobile: 0428 754 846
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Kernel Panic with bonding + IPoIB on 3.2.9
       [not found]     ` <CAOzFzEi=UOnwiV+qVks7+RnYU3PFbaQ+3OaEE3YFG2HHuD5ydQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
  2012-03-18 21:20       ` Joseph Glanville
@ 2012-03-19 19:05       ` Roland Dreier
  2012-03-20  3:33         ` Joseph Glanville
  1 sibling, 1 reply; 5+ messages in thread
From: Roland Dreier @ 2012-03-19 19:05 UTC (permalink / raw)
  To: Joseph Glanville
  Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA,
	linux-kernel-u79uwXL29TY76Z2rM5mHXA,
	netdev-u79uwXL29TY76Z2rM5mHXA

On Sun, Mar 18, 2012 at 1:21 PM, Joseph Glanville
<joseph.glanville-2MxvZkOi9dvvnOemgxGiVw@public.gmane.org> wrote:
> [  422.047024] kernel BUG at net/core/dev.c:1896!

So this line is

        BUG_ON(offset >= skb_headlen(skb));

right?  No paritcular idea how we hit this, though...
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Kernel Panic with bonding + IPoIB on 3.2.9
  2012-03-19 19:05       ` Roland Dreier
@ 2012-03-20  3:33         ` Joseph Glanville
  2012-03-20  4:30           ` Jay Vosburgh
  0 siblings, 1 reply; 5+ messages in thread
From: Joseph Glanville @ 2012-03-20  3:33 UTC (permalink / raw)
  To: Roland Dreier; +Cc: linux-rdma, linux-kernel, netdev

On 20 March 2012 06:05, Roland Dreier <roland@purestorage.com> wrote:
> On Sun, Mar 18, 2012 at 1:21 PM, Joseph Glanville
> <joseph.glanville@orionvm.com.au> wrote:
>> [  422.047024] kernel BUG at net/core/dev.c:1896!
>
> So this line is
>
>        BUG_ON(offset >= skb_headlen(skb));
>
> right?  No paritcular idea how we hit this, though...

Yep... I have looked through most of /drivers/net/bonding and I can't
really see why it should be blowing up there.. it really should cause
the BUG_ON under normal IPoIB if the MTU was the cause - yet I have
not experienced this.
The bonding code doesn't seem to do anything special with the MTU
other than propagating changes to the slaves.

-- 
Founder | Director | VP Research
Orion Virtualisation Solutions | www.orionvm.com.au | Phone: 1300 56
99 52 | Mobile: 0428 754 846

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Kernel Panic with bonding + IPoIB on 3.2.9
  2012-03-20  3:33         ` Joseph Glanville
@ 2012-03-20  4:30           ` Jay Vosburgh
  0 siblings, 0 replies; 5+ messages in thread
From: Jay Vosburgh @ 2012-03-20  4:30 UTC (permalink / raw)
  To: Joseph Glanville; +Cc: Roland Dreier, linux-rdma, linux-kernel, netdev

Joseph Glanville <joseph.glanville@orionvm.com.au> wrote:

>On 20 March 2012 06:05, Roland Dreier <roland@purestorage.com> wrote:
>> On Sun, Mar 18, 2012 at 1:21 PM, Joseph Glanville
>> <joseph.glanville@orionvm.com.au> wrote:
>>> [  422.047024] kernel BUG at net/core/dev.c:1896!
>>
>> So this line is
>>
>>        BUG_ON(offset >= skb_headlen(skb));
>>
>> right?  No paritcular idea how we hit this, though...
>
>Yep... I have looked through most of /drivers/net/bonding and I can't
>really see why it should be blowing up there.. it really should cause
>the BUG_ON under normal IPoIB if the MTU was the cause - yet I have
>not experienced this.
>The bonding code doesn't seem to do anything special with the MTU
>other than propagating changes to the slaves.

	For IPoIB, though, there is some extra initialization stuff in
bond_setup_by_slave(), and the hard_header_len will end up being set to
something different from the usual Ethernet value.

	In looking at ipoib_setup, I see that hard_header_len appears to
be set to 4 (IPOIB_ENCAP_LEN).  My recollection was that the IPoIB
hard_header_len was quite a bit larger than that; it looks like it
changed very recently from IPOIB_ENCAP_LEN + INFINIBAND_ALEN to what it
is now:

commit afd87adacb5de00768b2e54f0bd851278f2e6179
Author: Roland Dreier <roland@purestorage.com>
Date:   Tue Feb 7 14:51:21 2012 +0000

    IPoIB: Stop lying about hard_header_len and use skb->cb to stash LL addresses
    
    [ Upstream commit 936d7de3d736e0737542641269436f4b5968e9ef ]
    
    Commit a0417fa3a18a ("net: Make qdisc_skb_cb upper size bound
    explicit.") made it possible for a netdev driver to use skb->cb
    between its header_ops.create method and its .ndo_start_xmit
    method.  Use this in ipoib_hard_header() to stash away the LL address
    (GID + QPN), instead of the "ipoib_pseudoheader" hack.  This allows
    IPoIB to stop lying about its hard_header_len, which will let us fix
    the L2 check for GRO.


	I don't know if this change could be causing the problem (it
appears to be new in 3.2.9), but the hard_header_len is one of the few
areas in the TX path of bonding that IPoIB ends up being different from
regular Ethernet.

	-J

---
	-Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2012-03-20  4:30 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <CAOzFzEiufg40gKBH6D7zeB47SebfPvgzqOLxhF5eQqpYd-r4zQ@mail.gmail.com>
     [not found] ` <CAOzFzEiufg40gKBH6D7zeB47SebfPvgzqOLxhF5eQqpYd-r4zQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2012-03-18 20:21   ` Kernel Panic with bonding + IPoIB on 3.2.9 Joseph Glanville
     [not found]     ` <CAOzFzEi=UOnwiV+qVks7+RnYU3PFbaQ+3OaEE3YFG2HHuD5ydQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>
2012-03-18 21:20       ` Joseph Glanville
2012-03-19 19:05       ` Roland Dreier
2012-03-20  3:33         ` Joseph Glanville
2012-03-20  4:30           ` Jay Vosburgh

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).