* Re: Kernel Panic with bonding + IPoIB on 3.2.9 [not found] ` <CAOzFzEiufg40gKBH6D7zeB47SebfPvgzqOLxhF5eQqpYd-r4zQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2012-03-18 20:21 ` Joseph Glanville [not found] ` <CAOzFzEi=UOnwiV+qVks7+RnYU3PFbaQ+3OaEE3YFG2HHuD5ydQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 0 siblings, 1 reply; 5+ messages in thread From: Joseph Glanville @ 2012-03-18 20:21 UTC (permalink / raw) To: linux-rdma-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA Cc: netdev-u79uwXL29TY76Z2rM5mHXA On 19 March 2012 06:41, Joseph Glanville <joseph.glanville-2MxvZkOi9dvvnOemgxGiVw@public.gmane.org> wrote: > Hi guys, > > I am getting an annoying kernel panic on 3.2.9 that seems to be > related to bonding (as I can't reproduce the crash without it) > I believe it might be related to LRO/GRO but there isnt a param to > disable it anymore that I could see in /ulp/ipoib/ > Let me know if there is anything further I can do to help debug. > > Useful information: > > Hardware: > Dell C2100 - Intel Xeon dual socket with 144GB RAM > Mellanox Connect-X DDR using in kernel mlx4 driver > Machine is also a Xen dom0 > > ibstatCA 'mlx4_0' > CA type: MT26418 > Number of ports: 2 > Firmware version: 2.9.1000 > Hardware version: a0 > Node GUID: 0x0002c9030008d7be > System image GUID: 0x0002c9030008d7c1 > Port 1: > State: Active > Physical state: LinkUp > Rate: 20 > Base lid: 6 > LMC: 0 > SM lid: 1 > Capability mask: 0x02590868 > Port GUID: 0x0002c9030008d7bf > Link layer: InfiniBand > Port 2: > State: Active > Physical state: LinkUp > Rate: 20 > Base lid: 9 > LMC: 0 > SM lid: 1 > Capability mask: 0x02590868 > Port GUID: 0x0002c9030008d7c0 > Link layer: InfiniBand > > > ip link show > 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 16436 qdisc noqueue state UNKNOWN > link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 > 2: ib0: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 65220 qdisc > pfifo_fast master bond0 state UP qlen 256 > link/infiniband > 80:00:00:48:fe:80:00:00:00:00:00:00:00:02:c9:03:00:08:d7:bf brd > 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff > 3: ib1: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 65220 qdisc > pfifo_fast master bond0 state UP qlen 256 > link/infiniband > 80:00:00:49:fe:80:00:00:00:00:00:00:00:02:c9:03:00:08:d7:c0 brd > 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff > 4: gre0: <NOARP> mtu 1476 qdisc noop state DOWN > link/gre 0.0.0.0 brd 0.0.0.0 > 5: sit0: <NOARP> mtu 1480 qdisc noop state DOWN > link/sit 0.0.0.0 brd 0.0.0.0 > 6: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 65220 qdisc > noqueue state UP > link/infiniband > 80:00:00:48:fe:80:00:00:00:00:00:00:00:02:c9:03:00:08:d7:bf brd > 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:f > > The KP itself: > [ 422.046837] ------------[ cut here ]------------ > [ 422.047024] kernel BUG at net/core/dev.c:1896! > [ 422.047126] invalid opcode: 0000 [#1] SMP > [ 422.047289] CPU 1 > [ 422.047328] Modules linked in: ib_srpt(O) scst_vdisk(O) scst(O) > bonding raid1 raid0 md_mod dm_multipath > [ 422.047869] > [ 422.047962] Pid: 3352, comm: sshd Tainted: G O > 3.2.1-orion #4 Dell PowerEdge C2100 /0P19C9 > [ 422.048237] RIP: e030:[<ffffffff81559b92>] [<ffffffff81559b92>] > skb_checksum_help+0x142/0x150 > [ 422.048450] RSP: e02b:ffff88006cb11758 EFLAGS: 00010282 > [ 422.048556] RAX: 0000000000000108 RBX: ffff880072f7f4e8 RCX: 0000000060004420 > [ 422.048668] RDX: 0000000000000108 RSI: 0000000000000000 RDI: ffff880072f7f4e8 > [ 422.048780] RBP: ffff88006cb11778 R08: ffff88000e53529c R09: 0000000000000104 > [ 422.048892] R10: ffffffff8151a7d0 R11: 0000000000000000 R12: 00000000ffff0018 > [ 422.049005] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 > [ 422.049119] FS: 00007fea22aa8700(0000) GS:ffff8800bf435000(0000) > knlGS:0000000000000000 > [ 422.049288] CS: e033 DS: 0000 ES: 0000 CR0: 000000008005003b > [ 422.049395] CR2: 00007fff14d07ed8 CR3: 00000000085dc000 CR4: 0000000000002660 > [ 422.049506] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 > [ 422.049618] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 > [ 422.049742] Process sshd (pid: 3352, threadinfo ffff88006cb10000, > task ffff88000cea4920) > [ 422.049909] Stack: > [ 422.050002] ffff880072f7f4e8 ffff88000ce64000 0000000000000000 > 0000000000000000 > [ 422.050290] ffff88006cb117e8 ffffffff8155f15e ffff88006cb11858 > ffffffff8187ba80 > [ 422.050580] ffff88000ce5bc80 0000000000000000 0000000000000006 > 0000000000000000 > [ 422.050869] Call Trace: > [ 422.050967] [<ffffffff8155f15e>] dev_hard_start_xmit+0x36e/0x6c0 > [ 422.051084] [<ffffffff8157a19b>] sch_direct_xmit+0xdb/0x1e0 > [ 422.051191] [<ffffffff8155f638>] dev_queue_xmit+0x188/0x620 > [ 422.051301] [<ffffffffa003b297>] bond_dev_queue_xmit+0x27/0x70 [bonding] > [ 422.051413] [<ffffffffa003b5e4>] bond_start_xmit+0x304/0x4e0 [bonding] > [ 422.051524] [<ffffffff8155f099>] dev_hard_start_xmit+0x2a9/0x6c0 > [ 422.051633] [<ffffffff8155f895>] dev_queue_xmit+0x3e5/0x620 > [ 422.051742] [<ffffffff81567cbd>] neigh_connected_output+0xbd/0xf0 > [ 422.051853] [<ffffffff815a7120>] ? ip_fragment+0x850/0x850 > [ 422.051960] [<ffffffff815a72ae>] ip_finish_output+0x18e/0x300 > [ 422.052068] [<ffffffff815a7dd8>] ip_output+0x98/0xa0 > [ 422.052172] [<ffffffff815a74be>] ? __ip_local_out+0x9e/0xa0 > [ 422.052279] [<ffffffff815a74e4>] ip_local_out+0x24/0x30 > [ 422.052385] [<ffffffff815a764a>] ip_queue_xmit+0x15a/0x400 > [ 422.052510] [<ffffffff815bdade>] tcp_transmit_skb+0x3de/0x8f0 > [ 422.052617] [<ffffffff815be702>] tcp_write_xmit+0x1d2/0x9c0 > [ 422.052725] [<ffffffff81129057>] ? ksize+0x17/0xc0 > [ 422.052829] [<ffffffff815bef41>] __tcp_push_pending_frames+0x21/0x90 > [ 422.052939] [<ffffffff815b09ae>] tcp_sendmsg+0x75e/0xd80 > [ 422.053047] [<ffffffff815d4c0f>] inet_sendmsg+0x5f/0xb0 > [ 422.053155] [<ffffffff81009f3f>] ? xen_restore_fl_direct_reloc+0x4/0x4 > [ 422.053267] [<ffffffff8126734e>] ? selinux_socket_sendmsg+0x1e/0x20 > [ 422.053377] [<ffffffff8154732a>] sock_aio_write+0x15a/0x170 > [ 422.053486] [<ffffffff812652d1>] ? inode_has_perm.clone.15+0x21/0x30 > [ 422.053597] [<ffffffff8113133a>] do_sync_write+0xda/0x120 > [ 422.053704] [<ffffffff81268003>] ? selinux_file_permission+0xb3/0x140 > [ 422.053821] [<ffffffff812e8efa>] ? put_ldisc+0x5a/0xc0 > [ 422.053937] [<ffffffff81262237>] ? security_file_permission+0x27/0xb0 > [ 422.054048] [<ffffffff81131ca9>] vfs_write+0x169/0x180 > [ 422.054153] [<ffffffff81131f1c>] sys_write+0x4c/0x90 > [ 422.054260] [<ffffffff816891d2>] system_call_fastpath+0x16/0x1b > [ 422.054367] Code: 65 86 ff ff 85 c0 0f 84 75 ff ff ff eb a6 41 29 > d4 48 8b 83 d8 00 00 00 0f b7 53 72 45 8d 64 04 02 41 39 d4 77 cd e9 > 5d ff ff ff <0f> 0b 0f 0b 66 2e 0f 1f 84 00 00 00 00 00 55 b8 ea ff ff > ff 48 > [ 422.056691] RIP [<ffffffff81559b92>] skb_checksum_help+0x142/0x150 > [ 422.056831] RSP <ffff88006cb11758> > [ 422.056930] ---[ end trace 751906f8ee2b0c91 ]--- > [ 422.057032] Kernel panic - not syncing: Fatal exception in interrupt > [ 422.057141] Pid: 3352, comm: sshd Tainted: G D O 3.2.1-orion #4 > [ 422.057250] Call Trace: > [ 422.057348] [<ffffffff8167e944>] panic+0x8c/0x1a0 > [ 422.057451] [<ffffffff816825fa>] oops_end+0xea/0xf0 > [ 422.057557] [<ffffffff81016636>] die+0x56/0x90 > [ 422.057660] [<ffffffff81681f64>] do_trap+0xc4/0x170 > [ 422.057764] [<ffffffff81013e50>] do_invalid_op+0x90/0xb0 > [ 422.057870] [<ffffffff81559b92>] ? skb_checksum_help+0x142/0x150 > [ 422.057989] [<ffffffff8168b1ab>] invalid_op+0x1b/0x20 > [ 422.058101] [<ffffffff8151a7d0>] ? ipoib_setup+0x330/0x330 > [ 422.058207] [<ffffffff81559b92>] ? skb_checksum_help+0x142/0x150 > [ 422.058316] [<ffffffff8155f15e>] dev_hard_start_xmit+0x36e/0x6c0 > [ 422.058425] [<ffffffff8157a19b>] sch_direct_xmit+0xdb/0x1e0 > [ 422.058533] [<ffffffff8155f638>] dev_queue_xmit+0x188/0x620 > [ 422.058641] [<ffffffffa003b297>] bond_dev_queue_xmit+0x27/0x70 [bonding] > [ 422.058753] [<ffffffffa003b5e4>] bond_start_xmit+0x304/0x4e0 [bonding] > [ 422.058864] [<ffffffff8155f099>] dev_hard_start_xmit+0x2a9/0x6c0 > [ 422.058973] [<ffffffff8155f895>] dev_queue_xmit+0x3e5/0x620 > [ 422.059080] [<ffffffff81567cbd>] neigh_connected_output+0xbd/0xf0 > [ 422.059190] [<ffffffff815a7120>] ? ip_fragment+0x850/0x850 > [ 422.059296] [<ffffffff815a72ae>] ip_finish_output+0x18e/0x300 > [ 422.059412] [<ffffffff815a7dd8>] ip_output+0x98/0xa0 > [ 422.059517] [<ffffffff815a74be>] ? __ip_local_out+0x9e/0xa0 > [ 422.059624] [<ffffffff815a74e4>] ip_local_out+0x24/0x30 > [ 422.059730] [<ffffffff815a764a>] ip_queue_xmit+0x15a/0x400 > [ 422.059836] [<ffffffff815bdade>] tcp_transmit_skb+0x3de/0x8f0 > [ 422.059944] [<ffffffff815be702>] tcp_write_xmit+0x1d2/0x9c0 > > -- > Founder | Director | VP Research > Orion Virtualisation Solutions | www.orionvm.com.au | Phone: 1300 56 > 99 52 | Mobile: 0428 754 846 CC'ing netdev as that is probably the most appropriate now that I think about it. -- Founder | Director | VP Research Orion Virtualisation Solutions | www.orionvm.com.au | Phone: 1300 56 99 52 | Mobile: 0428 754 846 -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 5+ messages in thread
[parent not found: <CAOzFzEi=UOnwiV+qVks7+RnYU3PFbaQ+3OaEE3YFG2HHuD5ydQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org>]
* Re: Kernel Panic with bonding + IPoIB on 3.2.9 [not found] ` <CAOzFzEi=UOnwiV+qVks7+RnYU3PFbaQ+3OaEE3YFG2HHuD5ydQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> @ 2012-03-18 21:20 ` Joseph Glanville 2012-03-19 19:05 ` Roland Dreier 1 sibling, 0 replies; 5+ messages in thread From: Joseph Glanville @ 2012-03-18 21:20 UTC (permalink / raw) To: linux-rdma-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA Cc: netdev-u79uwXL29TY76Z2rM5mHXA On 19 March 2012 07:21, Joseph Glanville <joseph.glanville-2MxvZkOi9dvvnOemgxGiVw@public.gmane.org> wrote: > On 19 March 2012 06:41, Joseph Glanville > <joseph.glanville-2MxvZkOi9dvvnOemgxGiVw@public.gmane.org> wrote: >> Hi guys, >> >> I am getting an annoying kernel panic on 3.2.9 that seems to be >> related to bonding (as I can't reproduce the crash without it) >> I believe it might be related to LRO/GRO but there isnt a param to >> disable it anymore that I could see in /ulp/ipoib/ >> Let me know if there is anything further I can do to help debug. >> >> Useful information: >> >> Hardware: >> Dell C2100 - Intel Xeon dual socket with 144GB RAM >> Mellanox Connect-X DDR using in kernel mlx4 driver >> Machine is also a Xen dom0 >> >> ibstatCA 'mlx4_0' >> CA type: MT26418 >> Number of ports: 2 >> Firmware version: 2.9.1000 >> Hardware version: a0 >> Node GUID: 0x0002c9030008d7be >> System image GUID: 0x0002c9030008d7c1 >> Port 1: >> State: Active >> Physical state: LinkUp >> Rate: 20 >> Base lid: 6 >> LMC: 0 >> SM lid: 1 >> Capability mask: 0x02590868 >> Port GUID: 0x0002c9030008d7bf >> Link layer: InfiniBand >> Port 2: >> State: Active >> Physical state: LinkUp >> Rate: 20 >> Base lid: 9 >> LMC: 0 >> SM lid: 1 >> Capability mask: 0x02590868 >> Port GUID: 0x0002c9030008d7c0 >> Link layer: InfiniBand >> >> >> ip link show >> 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 16436 qdisc noqueue state UNKNOWN >> link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 >> 2: ib0: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 65220 qdisc >> pfifo_fast master bond0 state UP qlen 256 >> link/infiniband >> 80:00:00:48:fe:80:00:00:00:00:00:00:00:02:c9:03:00:08:d7:bf brd >> 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff >> 3: ib1: <BROADCAST,MULTICAST,SLAVE,UP,LOWER_UP> mtu 65220 qdisc >> pfifo_fast master bond0 state UP qlen 256 >> link/infiniband >> 80:00:00:49:fe:80:00:00:00:00:00:00:00:02:c9:03:00:08:d7:c0 brd >> 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:ff >> 4: gre0: <NOARP> mtu 1476 qdisc noop state DOWN >> link/gre 0.0.0.0 brd 0.0.0.0 >> 5: sit0: <NOARP> mtu 1480 qdisc noop state DOWN >> link/sit 0.0.0.0 brd 0.0.0.0 >> 6: bond0: <BROADCAST,MULTICAST,MASTER,UP,LOWER_UP> mtu 65220 qdisc >> noqueue state UP >> link/infiniband >> 80:00:00:48:fe:80:00:00:00:00:00:00:00:02:c9:03:00:08:d7:bf brd >> 00:ff:ff:ff:ff:12:40:1b:ff:ff:00:00:00:00:00:00:ff:ff:ff:f >> >> The KP itself: >> [ 422.046837] ------------[ cut here ]------------ >> [ 422.047024] kernel BUG at net/core/dev.c:1896! >> [ 422.047126] invalid opcode: 0000 [#1] SMP >> [ 422.047289] CPU 1 >> [ 422.047328] Modules linked in: ib_srpt(O) scst_vdisk(O) scst(O) >> bonding raid1 raid0 md_mod dm_multipath >> [ 422.047869] >> [ 422.047962] Pid: 3352, comm: sshd Tainted: G O >> 3.2.1-orion #4 Dell PowerEdge C2100 /0P19C9 >> [ 422.048237] RIP: e030:[<ffffffff81559b92>] [<ffffffff81559b92>] >> skb_checksum_help+0x142/0x150 >> [ 422.048450] RSP: e02b:ffff88006cb11758 EFLAGS: 00010282 >> [ 422.048556] RAX: 0000000000000108 RBX: ffff880072f7f4e8 RCX: 0000000060004420 >> [ 422.048668] RDX: 0000000000000108 RSI: 0000000000000000 RDI: ffff880072f7f4e8 >> [ 422.048780] RBP: ffff88006cb11778 R08: ffff88000e53529c R09: 0000000000000104 >> [ 422.048892] R10: ffffffff8151a7d0 R11: 0000000000000000 R12: 00000000ffff0018 >> [ 422.049005] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 >> [ 422.049119] FS: 00007fea22aa8700(0000) GS:ffff8800bf435000(0000) >> knlGS:0000000000000000 >> [ 422.049288] CS: e033 DS: 0000 ES: 0000 CR0: 000000008005003b >> [ 422.049395] CR2: 00007fff14d07ed8 CR3: 00000000085dc000 CR4: 0000000000002660 >> [ 422.049506] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 >> [ 422.049618] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 >> [ 422.049742] Process sshd (pid: 3352, threadinfo ffff88006cb10000, >> task ffff88000cea4920) >> [ 422.049909] Stack: >> [ 422.050002] ffff880072f7f4e8 ffff88000ce64000 0000000000000000 >> 0000000000000000 >> [ 422.050290] ffff88006cb117e8 ffffffff8155f15e ffff88006cb11858 >> ffffffff8187ba80 >> [ 422.050580] ffff88000ce5bc80 0000000000000000 0000000000000006 >> 0000000000000000 >> [ 422.050869] Call Trace: >> [ 422.050967] [<ffffffff8155f15e>] dev_hard_start_xmit+0x36e/0x6c0 >> [ 422.051084] [<ffffffff8157a19b>] sch_direct_xmit+0xdb/0x1e0 >> [ 422.051191] [<ffffffff8155f638>] dev_queue_xmit+0x188/0x620 >> [ 422.051301] [<ffffffffa003b297>] bond_dev_queue_xmit+0x27/0x70 [bonding] >> [ 422.051413] [<ffffffffa003b5e4>] bond_start_xmit+0x304/0x4e0 [bonding] >> [ 422.051524] [<ffffffff8155f099>] dev_hard_start_xmit+0x2a9/0x6c0 >> [ 422.051633] [<ffffffff8155f895>] dev_queue_xmit+0x3e5/0x620 >> [ 422.051742] [<ffffffff81567cbd>] neigh_connected_output+0xbd/0xf0 >> [ 422.051853] [<ffffffff815a7120>] ? ip_fragment+0x850/0x850 >> [ 422.051960] [<ffffffff815a72ae>] ip_finish_output+0x18e/0x300 >> [ 422.052068] [<ffffffff815a7dd8>] ip_output+0x98/0xa0 >> [ 422.052172] [<ffffffff815a74be>] ? __ip_local_out+0x9e/0xa0 >> [ 422.052279] [<ffffffff815a74e4>] ip_local_out+0x24/0x30 >> [ 422.052385] [<ffffffff815a764a>] ip_queue_xmit+0x15a/0x400 >> [ 422.052510] [<ffffffff815bdade>] tcp_transmit_skb+0x3de/0x8f0 >> [ 422.052617] [<ffffffff815be702>] tcp_write_xmit+0x1d2/0x9c0 >> [ 422.052725] [<ffffffff81129057>] ? ksize+0x17/0xc0 >> [ 422.052829] [<ffffffff815bef41>] __tcp_push_pending_frames+0x21/0x90 >> [ 422.052939] [<ffffffff815b09ae>] tcp_sendmsg+0x75e/0xd80 >> [ 422.053047] [<ffffffff815d4c0f>] inet_sendmsg+0x5f/0xb0 >> [ 422.053155] [<ffffffff81009f3f>] ? xen_restore_fl_direct_reloc+0x4/0x4 >> [ 422.053267] [<ffffffff8126734e>] ? selinux_socket_sendmsg+0x1e/0x20 >> [ 422.053377] [<ffffffff8154732a>] sock_aio_write+0x15a/0x170 >> [ 422.053486] [<ffffffff812652d1>] ? inode_has_perm.clone.15+0x21/0x30 >> [ 422.053597] [<ffffffff8113133a>] do_sync_write+0xda/0x120 >> [ 422.053704] [<ffffffff81268003>] ? selinux_file_permission+0xb3/0x140 >> [ 422.053821] [<ffffffff812e8efa>] ? put_ldisc+0x5a/0xc0 >> [ 422.053937] [<ffffffff81262237>] ? security_file_permission+0x27/0xb0 >> [ 422.054048] [<ffffffff81131ca9>] vfs_write+0x169/0x180 >> [ 422.054153] [<ffffffff81131f1c>] sys_write+0x4c/0x90 >> [ 422.054260] [<ffffffff816891d2>] system_call_fastpath+0x16/0x1b >> [ 422.054367] Code: 65 86 ff ff 85 c0 0f 84 75 ff ff ff eb a6 41 29 >> d4 48 8b 83 d8 00 00 00 0f b7 53 72 45 8d 64 04 02 41 39 d4 77 cd e9 >> 5d ff ff ff <0f> 0b 0f 0b 66 2e 0f 1f 84 00 00 00 00 00 55 b8 ea ff ff >> ff 48 >> [ 422.056691] RIP [<ffffffff81559b92>] skb_checksum_help+0x142/0x150 >> [ 422.056831] RSP <ffff88006cb11758> >> [ 422.056930] ---[ end trace 751906f8ee2b0c91 ]--- >> [ 422.057032] Kernel panic - not syncing: Fatal exception in interrupt >> [ 422.057141] Pid: 3352, comm: sshd Tainted: G D O 3.2.1-orion #4 >> [ 422.057250] Call Trace: >> [ 422.057348] [<ffffffff8167e944>] panic+0x8c/0x1a0 >> [ 422.057451] [<ffffffff816825fa>] oops_end+0xea/0xf0 >> [ 422.057557] [<ffffffff81016636>] die+0x56/0x90 >> [ 422.057660] [<ffffffff81681f64>] do_trap+0xc4/0x170 >> [ 422.057764] [<ffffffff81013e50>] do_invalid_op+0x90/0xb0 >> [ 422.057870] [<ffffffff81559b92>] ? skb_checksum_help+0x142/0x150 >> [ 422.057989] [<ffffffff8168b1ab>] invalid_op+0x1b/0x20 >> [ 422.058101] [<ffffffff8151a7d0>] ? ipoib_setup+0x330/0x330 >> [ 422.058207] [<ffffffff81559b92>] ? skb_checksum_help+0x142/0x150 >> [ 422.058316] [<ffffffff8155f15e>] dev_hard_start_xmit+0x36e/0x6c0 >> [ 422.058425] [<ffffffff8157a19b>] sch_direct_xmit+0xdb/0x1e0 >> [ 422.058533] [<ffffffff8155f638>] dev_queue_xmit+0x188/0x620 >> [ 422.058641] [<ffffffffa003b297>] bond_dev_queue_xmit+0x27/0x70 [bonding] >> [ 422.058753] [<ffffffffa003b5e4>] bond_start_xmit+0x304/0x4e0 [bonding] >> [ 422.058864] [<ffffffff8155f099>] dev_hard_start_xmit+0x2a9/0x6c0 >> [ 422.058973] [<ffffffff8155f895>] dev_queue_xmit+0x3e5/0x620 >> [ 422.059080] [<ffffffff81567cbd>] neigh_connected_output+0xbd/0xf0 >> [ 422.059190] [<ffffffff815a7120>] ? ip_fragment+0x850/0x850 >> [ 422.059296] [<ffffffff815a72ae>] ip_finish_output+0x18e/0x300 >> [ 422.059412] [<ffffffff815a7dd8>] ip_output+0x98/0xa0 >> [ 422.059517] [<ffffffff815a74be>] ? __ip_local_out+0x9e/0xa0 >> [ 422.059624] [<ffffffff815a74e4>] ip_local_out+0x24/0x30 >> [ 422.059730] [<ffffffff815a764a>] ip_queue_xmit+0x15a/0x400 >> [ 422.059836] [<ffffffff815bdade>] tcp_transmit_skb+0x3de/0x8f0 >> [ 422.059944] [<ffffffff815be702>] tcp_write_xmit+0x1d2/0x9c0 >> >> -- >> Founder | Director | VP Research >> Orion Virtualisation Solutions | www.orionvm.com.au | Phone: 1300 56 >> 99 52 | Mobile: 0428 754 846 > > CC'ing netdev as that is probably the most appropriate now that I > think about it. > > -- > Founder | Director | VP Research > Orion Virtualisation Solutions | www.orionvm.com.au | Phone: 1300 56 > 99 52 | Mobile: 0428 754 846 I have narrowed it down to mtu, MTU up to 50k seems to work just fine, but max MTU of 65520 basically instantly KPs the machine. Time to dig into bonding.c I guess, if any of the bonding devs could shed any light on this that would be awesome. Joseph. -- Founder | Director | VP Research Orion Virtualisation Solutions | www.orionvm.com.au | Phone: 1300 56 99 52 | Mobile: 0428 754 846 -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Kernel Panic with bonding + IPoIB on 3.2.9 [not found] ` <CAOzFzEi=UOnwiV+qVks7+RnYU3PFbaQ+3OaEE3YFG2HHuD5ydQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2012-03-18 21:20 ` Joseph Glanville @ 2012-03-19 19:05 ` Roland Dreier 2012-03-20 3:33 ` Joseph Glanville 1 sibling, 1 reply; 5+ messages in thread From: Roland Dreier @ 2012-03-19 19:05 UTC (permalink / raw) To: Joseph Glanville Cc: linux-rdma-u79uwXL29TY76Z2rM5mHXA, linux-kernel-u79uwXL29TY76Z2rM5mHXA, netdev-u79uwXL29TY76Z2rM5mHXA On Sun, Mar 18, 2012 at 1:21 PM, Joseph Glanville <joseph.glanville-2MxvZkOi9dvvnOemgxGiVw@public.gmane.org> wrote: > [ 422.047024] kernel BUG at net/core/dev.c:1896! So this line is BUG_ON(offset >= skb_headlen(skb)); right? No paritcular idea how we hit this, though... -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majordomo-u79uwXL29TY76Z2rM5mHXA@public.gmane.org More majordomo info at http://vger.kernel.org/majordomo-info.html ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Kernel Panic with bonding + IPoIB on 3.2.9 2012-03-19 19:05 ` Roland Dreier @ 2012-03-20 3:33 ` Joseph Glanville 2012-03-20 4:30 ` Jay Vosburgh 0 siblings, 1 reply; 5+ messages in thread From: Joseph Glanville @ 2012-03-20 3:33 UTC (permalink / raw) To: Roland Dreier; +Cc: linux-rdma, linux-kernel, netdev On 20 March 2012 06:05, Roland Dreier <roland@purestorage.com> wrote: > On Sun, Mar 18, 2012 at 1:21 PM, Joseph Glanville > <joseph.glanville@orionvm.com.au> wrote: >> [ 422.047024] kernel BUG at net/core/dev.c:1896! > > So this line is > > BUG_ON(offset >= skb_headlen(skb)); > > right? No paritcular idea how we hit this, though... Yep... I have looked through most of /drivers/net/bonding and I can't really see why it should be blowing up there.. it really should cause the BUG_ON under normal IPoIB if the MTU was the cause - yet I have not experienced this. The bonding code doesn't seem to do anything special with the MTU other than propagating changes to the slaves. -- Founder | Director | VP Research Orion Virtualisation Solutions | www.orionvm.com.au | Phone: 1300 56 99 52 | Mobile: 0428 754 846 ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Kernel Panic with bonding + IPoIB on 3.2.9 2012-03-20 3:33 ` Joseph Glanville @ 2012-03-20 4:30 ` Jay Vosburgh 0 siblings, 0 replies; 5+ messages in thread From: Jay Vosburgh @ 2012-03-20 4:30 UTC (permalink / raw) To: Joseph Glanville; +Cc: Roland Dreier, linux-rdma, linux-kernel, netdev Joseph Glanville <joseph.glanville@orionvm.com.au> wrote: >On 20 March 2012 06:05, Roland Dreier <roland@purestorage.com> wrote: >> On Sun, Mar 18, 2012 at 1:21 PM, Joseph Glanville >> <joseph.glanville@orionvm.com.au> wrote: >>> [ 422.047024] kernel BUG at net/core/dev.c:1896! >> >> So this line is >> >> BUG_ON(offset >= skb_headlen(skb)); >> >> right? No paritcular idea how we hit this, though... > >Yep... I have looked through most of /drivers/net/bonding and I can't >really see why it should be blowing up there.. it really should cause >the BUG_ON under normal IPoIB if the MTU was the cause - yet I have >not experienced this. >The bonding code doesn't seem to do anything special with the MTU >other than propagating changes to the slaves. For IPoIB, though, there is some extra initialization stuff in bond_setup_by_slave(), and the hard_header_len will end up being set to something different from the usual Ethernet value. In looking at ipoib_setup, I see that hard_header_len appears to be set to 4 (IPOIB_ENCAP_LEN). My recollection was that the IPoIB hard_header_len was quite a bit larger than that; it looks like it changed very recently from IPOIB_ENCAP_LEN + INFINIBAND_ALEN to what it is now: commit afd87adacb5de00768b2e54f0bd851278f2e6179 Author: Roland Dreier <roland@purestorage.com> Date: Tue Feb 7 14:51:21 2012 +0000 IPoIB: Stop lying about hard_header_len and use skb->cb to stash LL addresses [ Upstream commit 936d7de3d736e0737542641269436f4b5968e9ef ] Commit a0417fa3a18a ("net: Make qdisc_skb_cb upper size bound explicit.") made it possible for a netdev driver to use skb->cb between its header_ops.create method and its .ndo_start_xmit method. Use this in ipoib_hard_header() to stash away the LL address (GID + QPN), instead of the "ipoib_pseudoheader" hack. This allows IPoIB to stop lying about its hard_header_len, which will let us fix the L2 check for GRO. I don't know if this change could be causing the problem (it appears to be new in 3.2.9), but the hard_header_len is one of the few areas in the TX path of bonding that IPoIB ends up being different from regular Ethernet. -J --- -Jay Vosburgh, IBM Linux Technology Center, fubar@us.ibm.com ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2012-03-20 4:30 UTC | newest] Thread overview: 5+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- [not found] <CAOzFzEiufg40gKBH6D7zeB47SebfPvgzqOLxhF5eQqpYd-r4zQ@mail.gmail.com> [not found] ` <CAOzFzEiufg40gKBH6D7zeB47SebfPvgzqOLxhF5eQqpYd-r4zQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2012-03-18 20:21 ` Kernel Panic with bonding + IPoIB on 3.2.9 Joseph Glanville [not found] ` <CAOzFzEi=UOnwiV+qVks7+RnYU3PFbaQ+3OaEE3YFG2HHuD5ydQ-JsoAwUIsXosN+BqQ9rBEUg@public.gmane.org> 2012-03-18 21:20 ` Joseph Glanville 2012-03-19 19:05 ` Roland Dreier 2012-03-20 3:33 ` Joseph Glanville 2012-03-20 4:30 ` Jay Vosburgh
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).