* Kernel Panic every 2 weeks on ISP server (NULL pointer dereference) @ 2011-10-23 1:18 Luciano Ruete 2011-10-23 5:16 ` Eric Dumazet 0 siblings, 1 reply; 5+ messages in thread From: Luciano Ruete @ 2011-10-23 1:18 UTC (permalink / raw) To: netdev [-- Attachment #1: Type: Text/Plain, Size: 1072 bytes --] Hi, I'm the sysadmin at a 3500 customers ISP, wich runs an iptables+tc solution for load balancing and QoS. Every 2 or 3 weeks the server panics with a "NULL pointer dereference" and with IP at "dev_queue_xmit" It is curious that if i disable MSI on the network card driver this panics seems to disapear, does this ring a bell? The server is an IBM, previously with Broadcom NetXtreme II BCM5709 nics and now with Intel 82576. I change the nics thinking that maybe the bug was in Broadcom Driver but it seems to affect MSI in general. The tc+iptables rules are auto-generated with sequreisp[1] an ISP solution that i wrote and is open sourced under AGPLv3. Tell me if you need any further information, and plz CC because I'm not suscribed. root@server:~# uname -a Linux server 2.6.35-30-server #60~lucid1-Ubuntu SMP Tue Sep 20 22:28:40 UTC 2011 x86_64 GNU/Linux [1]https://github.com/sequre/sequreisp -- Luciano Ruete Sequre - Sys Admin Mitre 617, piso 7, of. 1 +54 261 4254894 Mendoza - Argentina http://www.sequre.com.ar/ http://www.sequreisp.com/ [-- Attachment #2: kern.log.txt --] [-- Type: text/plain, Size: 12769 bytes --] BUG: unable to handle kernel NULL pointer dereference at (null) [694244.692704] IP: [<ffffffff814b48ea>] dev_queue_xmit+0xaa/0x5b0 [694244.763424] PGD 16f369067 PUD 16f368067 PMD 0 [694244.817577] Oops: 0000 [#1] SMP [694244.857160] last sysfs file: /sys/devices/system/cpu/cpu7/cache/index2/shared_cpu_map [694244.951740] CPU 3 [694244.974623] Modules linked in: xt_mac ppp_deflate zlib_deflate bsd_comp ppp_async crc_ccitt nf_conntrack_netlink nfnetlink xt_owner ipt_REJECT ipt_REDIRECT ipt_MASQUERADE xt_helper xt_length xt_TCPMSS xt_mark xt_connmark xt_state xt_tcpudp xt_multiport iptable_mangle iptable_nat iptable_filter ip_tables x_tables sch_sfq act_mirred cls_u32 sch_prio cls_fw sch_htb ifb dummy 8021q garp stp nf_nat_irc nf_conntrack_irc nf_nat_sip nf_conntrack_sip nf_nat_pptp nf_conntrack_pptp nf_conntrack_proto_gre nf_nat_proto_gre nf_nat_amanda ts_kmp nf_conntrack_amanda nf_nat_ftp nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_conntrack_ftp nf_conntrack cdc_ether i7core_edac usbnet serio_raw edac_core tpm_tis tpm ioatdma tpm_bios lp shpchp parport mii raid10 raid456 async_pq async_xor xor async_memcpy async_r aid6_recov megaraid_sas raid6_pq async_tx raid1 raid0 multipath igb dca usbhid hid linear [694245.905128] [694245.923881] Pid: 30, comm: events/3 Not tainted 2.6.35-30-server #60~lucid1-Ubuntu 69Y5698 /System x3650 M3 -[7945AC1]- [694246.057920] RIP: 0010:[<ffffffff814b48ea>] [<ffffffff814b48ea>] dev_queue_xmit+0xaa/0x5b0 [694246.157723] RSP: 0018:ffff880001e63960 EFLAGS: 00010202 [694246.222176] RAX: 0000000000002000 RBX: ffff880145b6f400 RCX: 000000009fe9dec3 [694246.308451] RDX: 0000000000000004 RSI: 0000000000000000 RDI: ffff88017bd47130 [694246.394725] RBP: ffff880001e639a0 R08: ffff880145b6f400 R09: ffff88017bd47130 [694246.480998] R10: 0000000000000000 R11: 0000000000000003 R12: 0000000000000000 [694246.567265] R13: ffff880118128000 R14: ffff88015c39d300 R15: ffff880001e63b00 [694246.653534] FS: 0000000000000000(0000) GS:ffff880001e60000(0000) knlGS:0000000000000000 [694246.751226] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [694246.820861] CR2: 0000000000000000 CR3: 0000000250400000 CR4: 00000000000006e0 [694246.907128] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [694246.993394] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [694247.079668] Process events/3 (pid: 30, threadinfo ffff880276efe000, task ffff880276ed44d0) [694247.179433] Stack: [694247.204410] 0000000000000000 ffff880118128000 ffff880001e639e0 ffff880145b6f400 [694247.291793] <0> 0000000000000000 ffff88015ce2f780 ffff880145b6f400 ffff880001e63b00 [694247.384466] <0> ffff880001e639e0 ffffffff814e9347 ffff880001e63a20 ffff880145b6f400 [694247.479257] Call Trace: [694247.509424] <IRQ> [694247.535481] [<ffffffff814e9347>] ip_finish_output+0x237/0x310 [694247.606165] [<ffffffff814e9738>] ip_output+0xb8/0xc0 [694247.667503] [<ffffffff814e75d3>] ? __ip_local_out+0xa3/0xb0 [694247.736114] [<ffffffff814e84c9>] ip_local_out+0x29/0x30 [694247.800566] [<ffffffff814e8ce1>] ip_queue_xmit+0x191/0x3f0 [694247.868129] [<ffffffff814fe484>] tcp_transmit_skb+0x3f4/0x700 [694247.938811] [<ffffffff815002fd>] tcp_send_ack+0xdd/0x130 [694248.004302] [<ffffffff814fc823>] tcp_rcv_synsent_state_process+0x5a3/0x5b0 [694248.088493] [<ffffffff815046bf>] ? tcp_v4_inbound_md5_hash+0x7f/0x210 [694248.167486] [<ffffffff814fcf8d>] tcp_rcv_state_process+0x7d/0x4e0 [694248.242332] [<ffffffff815048f3>] tcp_v4_do_rcv+0xa3/0x1c0 [694248.308864] [<ffffffff81505ab9>] tcp_v4_rcv+0x5a9/0x830 [694248.373314] [<ffffffff814e36a0>] ? ip_local_deliver_finish+0x0/0x290 [694248.451265] [<ffffffff814db384>] ? nf_hook_slow+0x74/0x100 [694248.518830] [<ffffffff814e36a0>] ? ip_local_deliver_finish+0x0/0x290 [694248.596781] [<ffffffff814e377d>] ip_local_deliver_finish+0xdd/0x290 [694248.673696] [<ffffffff814e39b0>] ip_local_deliver+0x80/0x90 [694248.742300] [<ffffffff814e2f29>] ip_rcv_finish+0x119/0x410 [694248.809870] [<ffffffff814e35cd>] ip_rcv+0x23d/0x310 [694248.870167] [<ffffffff814af233>] __netif_receive_skb+0x383/0x5c0 [694248.943960] [<ffffffff814af57b>] process_backlog+0x10b/0x210 [694249.013603] [<ffffffff814b04af>] net_rx_action+0x10f/0x2a0 [694249.081175] [<ffffffff8106862d>] __do_softirq+0xbd/0x200 [694249.146672] [<ffffffff810ca950>] ? handle_IRQ_event+0x50/0x160 [694249.218394] [<ffffffff81068695>] ? __do_softirq+0x125/0x200 [694249.287006] [<ffffffff8100afdc>] call_softirq+0x1c/0x30 [694249.351458] [<ffffffff8100cab5>] do_softirq+0x65/0xa0 [694249.413833] [<ffffffff810684e5>] irq_exit+0x85/0x90 [694249.474131] [<ffffffff815aac85>] do_IRQ+0x75/0xf0 [694249.532352] [<ffffffff815a3853>] ret_from_intr+0x0/0x11 [694249.596797] <EOI> [694249.622854] [<ffffffff815a3319>] ? _raw_spin_unlock_irqrestore+0x19/0x30 [694249.704961] [<ffffffffa0107826>] ppp_asynctty_receive+0x86/0x100 [ppp_async] [694249.791233] [<ffffffff81360816>] flush_to_ldisc+0x1a6/0x1e0 [694249.859834] [<ffffffff81360670>] ? flush_to_ldisc+0x0/0x1e0 [694249.928442] [<ffffffff8107b2a5>] run_workqueue+0xc5/0x1a0 [694249.994969] [<ffffffff8107b423>] worker_thread+0xa3/0x110 [694250.061499] [<ffffffff810800d0>] ? autoremove_wake_function+0x0/0x40 [694250.139451] [<ffffffff8107b380>] ? worker_thread+0x0/0x110 [694250.207014] [<ffffffff8107fb56>] kthread+0x96/0xa0 [694250.266274] [<ffffffff8100aee4>] kernel_thread_helper+0x4/0x10 [694250.338002] [<ffffffff8107fac0>] ? kthread+0x0/0xa0 [694250.398296] [<ffffffff8100aee0>] ? kernel_thread_helper+0x0/0x10 [694250.472081] Code: f6 49 c1 e6 07 66 89 93 ac 00 00 00 4d 03 b5 40 03 00 00 0f b7 83 a6 00 00 00 4d 8b 66 08 80 e4 cf 80 cc 20 66 89 83 a6 00 00 00 <49> 83 3c 24 00 0f 84 3b 02 00 00 49 8d 84 24 9c 00 00 00 48 89 [694250.700622] RIP [<ffffffff814b48ea>] dev_queue_xmit+0xaa/0x5b0 [694250.772367] RSP <ffff880001e63960> [694250.814999] CR2: 0000000000000000 [694250.855923] ---[ end trace 0c85e47af955446e ]--- [694250.912113] Kernel panic - not syncing: Fatal exception in interrupt [694250.989074] Pid: 30, comm: events/3 Tainted: G D 2.6.35-30-server #60~lucid1-Ubuntu [694251.090974] Call Trace: [694251.121208] <IRQ> [<ffffffff815a0597>] panic+0x90/0x113 [694251.154109] ------------[ cut here ]------------ [694251.154118] WARNING: at /build/buildd/linux-lts-backport-maverick-2.6.35/net/sched/sch_generic.c:258 dev_watchdog+0x25f/0x270() [694251.154121] Hardware name: System x3650 M3 -[7945AC1]- [694251.154123] NETDEV WATCHDOG: eth0 (igb): transmit queue 0 timed out [694251.154124] Modules linked in: xt_mac ppp_deflate zlib_deflate bsd_comp ppp_async crc_ccitt nf_conntrack_netlink nfnetlink xt_owner ipt_REJECT ipt_REDIRECT ipt_MASQUERADE xt_helper xt_length xt_TCPMSS xt_mark xt_connmark xt_state xt_tcpudp xt_multiport iptable_mangle iptable_nat iptable_filter ip_tables x_tables sch_sfq act_mirred cls_u32 sch_prio cls_fw sch_htb ifb dummy 8021q garp stp nf_nat_irc nf_conntrack_irc nf_nat_sip nf_conntrack_sip nf_nat_pptp nf_conntrack_pptp nf_conntrack_proto_gre nf_nat_proto_gre nf_nat_amanda ts_kmp nf_conntrack_amanda nf_nat_ftp nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_conntrack_ftp nf_conntrack cdc_ether i7core_edac usbnet serio_raw edac_core tpm_tis tpm ioatdma tpm_bios lp shpchp parport mii raid10 raid456 async_pq async_xor xor async_memcpy async_r aid6_recov megaraid_sas raid6_pq async_tx raid1 raid0 multipath igb dca usbhid hid linear [694251.154174] Pid: 0, comm: swapper Tainted: G D 2.6.35-30-server #60~lucid1-Ubuntu [694251.154176] Call Trace: [694251.154178] <IRQ> [<ffffffff8106159f>] warn_slowpath_common+0x7f/0xc0 [694251.154187] [<ffffffff81061696>] warn_slowpath_fmt+0x46/0x50 [694251.154190] [<ffffffff814cd81f>] dev_watchdog+0x25f/0x270 [694251.154200] [<ffffffffa01adc49>] ? destroy_conntrack+0xa9/0xe0 [nf_conntrack] [694251.154204] [<ffffffff814db1a7>] ? nf_conntrack_destroy+0x17/0x30 [694251.154211] [<ffffffffa01ad264>] ? death_by_timeout+0xd4/0x140 [nf_conntrack] [694251.154214] [<ffffffff814cd5c0>] ? dev_watchdog+0x0/0x270 [694251.154217] [<ffffffff814cd5c0>] ? dev_watchdog+0x0/0x270 [694251.154221] [<ffffffff81070172>] call_timer_fn+0x42/0x120 [694251.154226] [<ffffffff8105553b>] ? scheduler_tick+0x1db/0x300 [694251.154229] [<ffffffff814cd5c0>] ? dev_watchdog+0x0/0x270 [694251.154232] [<ffffffff81071734>] run_timer_softirq+0x154/0x270 [694251.154236] [<ffffffff8108a683>] ? ktime_get+0x63/0xe0 [694251.154239] [<ffffffff8106862d>] __do_softirq+0xbd/0x200 [694251.154243] [<ffffffff8108fa1a>] ? tick_program_event+0x2a/0x30 [694251.154247] [<ffffffff8100afdc>] call_softirq+0x1c/0x30 [694251.154250] [<ffffffff8100cab5>] do_softirq+0x65/0xa0 [694251.154253] [<ffffffff810684e5>] irq_exit+0x85/0x90 [694251.154258] [<ffffffff815aad70>] smp_apic_timer_interrupt+0x70/0x9b [694251.154261] [<ffffffff8100aa93>] apic_timer_interrupt+0x13/0x20 [694251.154263] <EOI> [<ffffffff8130ad54>] ? intel_idle+0xe4/0x180 [694251.154271] [<ffffffff8130ad37>] ? intel_idle+0xc7/0x180 [694251.154277] [<ffffffff81488062>] cpuidle_idle_call+0x92/0x140 [694251.154281] [<ffffffff81008d93>] cpu_idle+0xb3/0x110 [694251.154285] [<ffffffff8159b226>] start_secondary+0x100/0x102 [694251.154288] ---[ end trace 0c85e47af955446f ]--- [694251.154389] igb 0000:17:00.0: eth0: Reset adapter [694251.273081] igb 0000:18:00.0: eth2: Reset adapter [694254.499174] [<ffffffff815a485a>] oops_end+0xea/0xf0 [694254.559522] [<ffffffff8103e45c>] no_context+0xfc/0x190 [694254.622984] [<ffffffffa0070155>] ? nfnetlink_has_listeners+0x15/0x20 [nfnetlink] [694254.713470] [<ffffffff8103e615>] __bad_area_nosemaphore+0x125/0x1e0 [694254.790433] [<ffffffff8103e6e3>] bad_area_nosemaphore+0x13/0x20 [694254.863244] [<ffffffff815a711f>] do_page_fault+0x28f/0x350 [694254.930864] [<ffffffff815a3b35>] page_fault+0x25/0x30 [694254.993287] [<ffffffff814b48ea>] ? dev_queue_xmit+0xaa/0x5b0 [694255.062987] [<ffffffff814e9347>] ip_finish_output+0x237/0x310 [694255.133728] [<ffffffff814e9738>] ip_output+0xb8/0xc0 [694255.195123] [<ffffffff814e75d3>] ? __ip_local_out+0xa3/0xb0 [694255.263784] [<ffffffff814e84c9>] ip_local_out+0x29/0x30 [694255.328283] [<ffffffff814e8ce1>] ip_queue_xmit+0x191/0x3f0 [694255.395910] [<ffffffff814fe484>] tcp_transmit_skb+0x3f4/0x700 [694255.466647] [<ffffffff815002fd>] tcp_send_ack+0xdd/0x130 [694255.532185] [<ffffffff814fc823>] tcp_rcv_synsent_state_process+0x5a3/0x5b0 [694255.616423] [<ffffffff815046bf>] ? tcp_v4_inbound_md5_hash+0x7f/0x210 [694255.695489] [<ffffffff814fcf8d>] tcp_rcv_state_process+0x7d/0x4e0 [694255.770377] [<ffffffff815048f3>] tcp_v4_do_rcv+0xa3/0x1c0 [694255.838650] [<ffffffff81505ab9>] tcp_v4_rcv+0x5a9/0x830 [694255.903158] [<ffffffff814e36a0>] ? ip_local_deliver_finish+0x0/0x290 [694255.981164] [<ffffffff814db384>] ? nf_hook_slow+0x74/0x100 [694256.048778] [<ffffffff814e36a0>] ? ip_local_deliver_finish+0x0/0x290 [694256.126783] [<ffffffff814e377d>] ip_local_deliver_finish+0xdd/0x290 [694256.203750] [<ffffffff814e39b0>] ip_local_deliver+0x80/0x90 [694256.272413] [<ffffffff814e2f29>] ip_rcv_finish+0x119/0x410 [694256.340028] [<ffffffff814e35cd>] ip_rcv+0x23d/0x310 [694256.400385] [<ffffffff814af233>] __netif_receive_skb+0x383/0x5c0 [694256.474233] [<ffffffff814af57b>] process_backlog+0x10b/0x210 [694256.543933] [<ffffffff814b04af>] net_rx_action+0x10f/0x2a0 [694256.611549] [<ffffffff8106862d>] __do_softirq+0xbd/0x200 [694256.677096] [<ffffffff810ca950>] ? handle_IRQ_event+0x50/0x160 [694256.748871] [<ffffffff81068695>] ? __do_softirq+0x125/0x200 [694256.817527] [<ffffffff8100afdc>] call_softirq+0x1c/0x30 [694256.882030] [<ffffffff8100cab5>] do_softirq+0x65/0xa0 [694256.944458] [<ffffffff810684e5>] irq_exit+0x85/0x90 [694257.004807] [<ffffffff815aac85>] do_IRQ+0x75/0xf0 [694257.063079] [<ffffffff815a3853>] ret_from_intr+0x0/0x11 [694257.127578] <EOI> [<ffffffff815a3319>] ? _raw_spin_unlock_irqrestore+0x19/0x30 [694257.217120] [<ffffffffa0107826>] ppp_asynctty_receive+0x86/0x100 [ppp_async] [694257.303447] [<ffffffff81360816>] flush_to_ldisc+0x1a6/0x1e0 [694257.372104] [<ffffffff81360670>] ? flush_to_ldisc+0x0/0x1e0 [694257.440768] [<ffffffff8107b2a5>] run_workqueue+0xc5/0x1a0 [694257.507355] [<ffffffff8107b423>] worker_thread+0xa3/0x110 [694257.573940] [<ffffffff810800d0>] ? autoremove_wake_function+0x0/0x40 [694257.651961] [<ffffffff8107b380>] ? worker_thread+0x0/0x110 [694257.719582] [<ffffffff8107fb56>] kthread+0x96/0xa0 [694257.778897] [<ffffffff8100aee4>] kernel_thread_helper+0x4/0x10 [694257.850666] [<ffffffff8107fac0>] ? kthread+0x0/0xa0 [694257.911018] [<ffffffff8100aee0>] ? kernel_thread_helper+0x0/0x10 [694257.984951] Rebooting in 1 seconds..[ 0.000000] Initializing cgroup subsys cpuset ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Kernel Panic every 2 weeks on ISP server (NULL pointer dereference) 2011-10-23 1:18 Kernel Panic every 2 weeks on ISP server (NULL pointer dereference) Luciano Ruete @ 2011-10-23 5:16 ` Eric Dumazet 2011-10-24 18:09 ` Luciano Ruete 0 siblings, 1 reply; 5+ messages in thread From: Eric Dumazet @ 2011-10-23 5:16 UTC (permalink / raw) To: Luciano Ruete; +Cc: netdev Le samedi 22 octobre 2011 à 22:18 -0300, Luciano Ruete a écrit : > Hi, > > I'm the sysadmin at a 3500 customers ISP, wich runs an iptables+tc solution > for load balancing and QoS. > > Every 2 or 3 weeks the server panics with a "NULL pointer dereference" and > with IP at "dev_queue_xmit" > > It is curious that if i disable MSI on the network card driver this panics > seems to disapear, does this ring a bell? > > The server is an IBM, previously with Broadcom NetXtreme II BCM5709 nics and > now with Intel 82576. I change the nics thinking that maybe the bug was in > Broadcom Driver but it seems to affect MSI in general. > > The tc+iptables rules are auto-generated with sequreisp[1] an ISP solution > that i wrote and is open sourced under AGPLv3. > > Tell me if you need any further information, and plz CC because I'm not > suscribed. > > > root@server:~# uname -a > Linux server 2.6.35-30-server #60~lucid1-Ubuntu SMP Tue Sep 20 22:28:40 UTC > 2011 x86_64 GNU/Linux > > > [1]https://github.com/sequre/sequreisp > Hi Luciano [694250.472081] Code: f6 49 c1 e6 07 shl $0x7,%r14 66 89 93 ac 00 00 00 mov %dx,0xac(%rbx) 4d 03 b5 40 03 00 00 add 0x340(%r13),%r14 txq = dev_pick_tx(dev, skb); 0f b7 83 a6 00 00 00 movzwl 0xa6(%rbx),%eax 4d 8b 66 08 mov 0x8(%r14),%r12 q = rcu_dereference_bh(txq->qdisc); 80 e4 cf and $0xcf,%ah 80 cc 20 or $0x20,%ah 66 89 83 a6 00 00 00 mov %ax,0xa6(%rbx) skb->tc_verd = SET_TC_AT(skb->tc_verd, AT_EGRESS); <49> 83 3c 24 00 cmpq $0x0,(%r12) if (q->enqueue) CRASH because q is NULL. 0f 84 3b 02 00 00 je ... rc = __dev_xmit_skb(skb, q, dev, txq); 49 8d 84 24 9c 00 00 00 lea 0x9c(%r12),%rax 48 89 This looks like a dev_pick_tx() bug, using an out of bound queue_index number and returning a txq pointing after the device allocated array. With recent kernels, this cannot happen anymore because we added fixes in this area. You could try Ubuntu 11.10 (based on linux 3.0) kernel on your server, or apply following patch : commit df32cc193ad88f7b1326b90af799c927b27f7654 Author: Tom Herbert <therbert@google.com> Date: Mon Nov 1 12:55:52 2010 -0700 net: check queue_index from sock is valid for device In dev_pick_tx recompute the queue index if the value stored in the socket is greater than or equal to the number of real queues for the device. The saved index in the sock structure is not guaranteed to be appropriate for the egress device (this could happen on a route change or in presence of tunnelling). The result of the queue index being bad would be to return a bogus queue (crash could prersumably follow). Signed-off-by: Tom Herbert <therbert@google.com> Signed-off-by: David S. Miller <davem@davemloft.net> diff --git a/net/core/dev.c b/net/core/dev.c index 35dfb83..0dd54a6 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -2131,7 +2131,7 @@ static struct netdev_queue *dev_pick_tx(struct net_device *dev, } else { struct sock *sk = skb->sk; queue_index = sk_tx_queue_get(sk); - if (queue_index < 0) { + if (queue_index < 0 || queue_index >= dev->real_num_tx_queues) { queue_index = 0; if (dev->real_num_tx_queues > 1) ^ permalink raw reply related [flat|nested] 5+ messages in thread
* Re: Kernel Panic every 2 weeks on ISP server (NULL pointer dereference) 2011-10-23 5:16 ` Eric Dumazet @ 2011-10-24 18:09 ` Luciano Ruete 2011-10-24 18:21 ` Eric Dumazet 2011-11-07 13:11 ` Luciano Ruete 0 siblings, 2 replies; 5+ messages in thread From: Luciano Ruete @ 2011-10-24 18:09 UTC (permalink / raw) To: Eric Dumazet; +Cc: netdev On Sunday, October 23, 2011 02:16:29 am Eric Dumazet wrote: > Le samedi 22 octobre 2011 à 22:18 -0300, Luciano Ruete a écrit : > > Hi, > > > > I'm the sysadmin at a 3500 customers ISP, wich runs an iptables+tc > > solution for load balancing and QoS. > > > > Every 2 or 3 weeks the server panics with a "NULL pointer dereference" > > and with IP at "dev_queue_xmit" > > > > It is curious that if i disable MSI on the network card driver this > > panics seems to disapear, does this ring a bell? > > > > The server is an IBM, previously with Broadcom NetXtreme II BCM5709 nics > > and now with Intel 82576. I change the nics thinking that maybe the bug > > was in Broadcom Driver but it seems to affect MSI in general. > > > > The tc+iptables rules are auto-generated with sequreisp[1] an ISP > > solution that i wrote and is open sourced under AGPLv3. > > > > Tell me if you need any further information, and plz CC because I'm not > > suscribed. > > > > > > root@server:~# uname -a > > Linux server 2.6.35-30-server #60~lucid1-Ubuntu SMP Tue Sep 20 22:28:40 > > UTC 2011 x86_64 GNU/Linux > > > > > > [1]https://github.com/sequre/sequreisp > > Hi Luciano Hi Eric! Thanks for your answer... > > [694250.472081] Code: f6 > 49 c1 e6 07 shl $0x7,%r14 > 66 89 93 ac 00 00 00 mov %dx,0xac(%rbx) >[...] > This looks like a dev_pick_tx() bug, using an out of bound > queue_index number and returning a txq pointing after > the device allocated array. Clear explanation, is there a tool to map the trace to kernel code, or you did this by hand? > With recent kernels, this cannot happen anymore because > we added fixes in this area. > > You could try Ubuntu 11.10 (based on linux 3.0) kernel > on your server, or apply following patch : > > commit df32cc193ad88f7b1326b90af799c927b27f7654 > Author: Tom Herbert <therbert@google.com> > Date: Mon Nov 1 12:55:52 2010 -0700 > > net: check queue_index from sock is valid for device > > In dev_pick_tx recompute the queue index if the value stored in the > socket is greater than or equal to the number of real queues for the > device. The saved index in the sock structure is not guaranteed to > be appropriate for the egress device (this could happen on a route > change or in presence of tunnelling). The result of the queue index > being bad would be to return a bogus queue (crash could prersumably > follow). Lot of ruote changes in this server, there are 30 upstream providers(15 are dynamic IP ADSLs) load balanced using VLANs and a VLAN switch. Thanks again i will try the kernel upgrade and post results in this thread. Regards! -- Luciano Ruete Sequre - Sys Admin Mitre 617, piso 7, of. 1 +54 261 4254894 Mendoza - Argentina http://www.sequreisp.com/ http://www.sequre.com.ar/ ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Kernel Panic every 2 weeks on ISP server (NULL pointer dereference) 2011-10-24 18:09 ` Luciano Ruete @ 2011-10-24 18:21 ` Eric Dumazet 2011-11-07 13:11 ` Luciano Ruete 1 sibling, 0 replies; 5+ messages in thread From: Eric Dumazet @ 2011-10-24 18:21 UTC (permalink / raw) To: Luciano Ruete; +Cc: netdev Le lundi 24 octobre 2011 à 15:09 -0300, Luciano Ruete a écrit : > Hi Eric! > > Thanks for your answer... > > > > > [694250.472081] Code: f6 > > 49 c1 e6 07 shl $0x7,%r14 > > 66 89 93 ac 00 00 00 mov %dx,0xac(%rbx) > >[...] > > This looks like a dev_pick_tx() bug, using an out of bound > > queue_index number and returning a txq pointing after > > the device allocated array. > > Clear explanation, is there a tool to map the trace to kernel code, or you did > this by hand? > In kernek source, you can find scripts/decodecode # cat CRASH | scripts/decodecode [694250.472081] Code: f6 49 c1 e6 07 66 89 93 ac 00 00 00 4d 03 b5 40 03 00 00 0f b7 83 a6 00 00 00 4d 8b 66 08 80 e4 cf 80 cc 20 66 89 83 a6 00 00 00 <49> 83 3c 24 00 0f 84 3b 02 00 00 49 8d 84 24 9c 00 00 00 48 89 All code ======== 0: f6 (bad) 1: 49 c1 e6 07 shl $0x7,%r14 5: 66 89 93 ac 00 00 00 mov %dx,0xac(%rbx) c: 4d 03 b5 40 03 00 00 add 0x340(%r13),%r14 13: 0f b7 83 a6 00 00 00 movzwl 0xa6(%rbx),%eax 1a: 4d 8b 66 08 mov 0x8(%r14),%r12 1e: 80 e4 cf and $0xcf,%ah 21: 80 cc 20 or $0x20,%ah 24: 66 89 83 a6 00 00 00 mov %ax,0xa6(%rbx) 2b:* 49 83 3c 24 00 cmpq $0x0,(%r12) <-- trapping instruction 30: 0f 84 3b 02 00 00 je 0x271 36: 49 8d 84 24 9c 00 00 lea 0x9c(%r12),%rax 3d: 00 3e: 48 rex.W 3f: 89 .byte 0x89 Code starting with the faulting instruction =========================================== 0: 49 83 3c 24 00 cmpq $0x0,(%r12) 5: 0f 84 3b 02 00 00 je 0x246 b: 49 8d 84 24 9c 00 00 lea 0x9c(%r12),%rax 12: 00 13: 48 rex.W 14: 89 .byte 0x89 ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: Kernel Panic every 2 weeks on ISP server (NULL pointer dereference) 2011-10-24 18:09 ` Luciano Ruete 2011-10-24 18:21 ` Eric Dumazet @ 2011-11-07 13:11 ` Luciano Ruete 1 sibling, 0 replies; 5+ messages in thread From: Luciano Ruete @ 2011-11-07 13:11 UTC (permalink / raw) To: netdev; +Cc: Eric Dumazet [-- Attachment #1: Type: Text/Plain, Size: 1059 bytes --] On Monday, October 24, 2011 03:09:13 pm Luciano Ruete wrote: > On Sunday, October 23, 2011 02:16:29 am Eric Dumazet wrote: > > Le samedi 22 octobre 2011 à 22:18 -0300, Luciano Ruete a écrit : > [...] > Thanks again i will try the kernel upgrade and post results in this thread. Ok, now running Linux Kernel 3.0.0(Ubuntu 11.10)[0] After 3 days of uptime, i've had a new kind of crash(panic), this time in nf_conntrack_sip flush_expectations function. Trace and decoded trace attached, i still do not know how to read this in order to follow excecution and blame a particular kernel line of code. Guess compiling C into assembler on the fly is not in my skills bag. I just want to check if this is kernel a bug, or may be there is something wrong somewhere else in my setup... [0] server:~# uname -a Linux server 3.0.0-12-server #20-Ubuntu SMP Fri Oct 7 16:36:30 UTC 2011 x86_64 GNU/Linux -- Luciano Ruete Sequre - Sys Admin Mitre 617, piso 7, of. 1 +54 261 4254894 Mendoza - Argentina http://www.sequre.com.ar/ [-- Attachment #2: decoded_trace.txt --] [-- Type: text/plain, Size: 1695 bytes --] [328686.010062] Code: 84 d2 75 7f 48 c7 c7 e8 19 12 a0 45 0f b6 ee e8 47 2d 48 e1 48 8b 5b 28 48 85 db 75 0e eb 4c 0f 1f 40 00 4d 85 e4 74 43 4c 89 e3 <8b> b3 d0 00 00 00 31 c0 4c 8b 23 85 f6 0f 95 c0 41 39 c5 75 e3 All code ======== 0: 84 d2 test %dl,%dl 2: 75 7f jne 0x83 4: 48 c7 c7 e8 19 12 a0 mov $0xffffffffa01219e8,%rdi b: 45 0f b6 ee movzbl %r14b,%r13d f: e8 47 2d 48 e1 callq 0xffffffffe1482d5b 14: 48 8b 5b 28 mov 0x28(%rbx),%rbx 18: 48 85 db test %rbx,%rbx 1b: 75 0e jne 0x2b 1d: eb 4c jmp 0x6b 1f: 0f 1f 40 00 nopl 0x0(%rax) 23: 4d 85 e4 test %r12,%r12 26: 74 43 je 0x6b 28: 4c 89 e3 mov %r12,%rbx 2b:* 8b b3 d0 00 00 00 mov 0xd0(%rbx),%esi <-- trapping instruction 31: 31 c0 xor %eax,%eax 33: 4c 8b 23 mov (%rbx),%r12 36: 85 f6 test %esi,%esi 38: 0f 95 c0 setne %al 3b: 41 39 c5 cmp %eax,%r13d 3e: 75 e3 jne 0x23 Code starting with the faulting instruction =========================================== 0: 8b b3 d0 00 00 00 mov 0xd0(%rbx),%esi 6: 31 c0 xor %eax,%eax 8: 4c 8b 23 mov (%rbx),%r12 b: 85 f6 test %esi,%esi d: 0f 95 c0 setne %al 10: 41 39 c5 cmp %eax,%r13d 13: 75 e3 jne 0xfffffffffffffff8 [-- Attachment #3: kern.log.txt --] [-- Type: text/plain, Size: 8709 bytes --] [328680.672986] general protection fault: 0000 [#1] SMP [328680.733325] CPU 1 [328680.756199] Modules linked in: ppp_deflate zlib_deflate bsd_comp ppp_async crc_ccitt nf_conntrack_netlink nfnetlink xt_owner ipt_REJECT ipt_REDIRECT ipt_MASQUERADE xt_iprange xt_helper xt_length xt_TCPMSS xt_connmark xt_mark xt_state xt_tcpudp xt_multiport iptable_mangle iptable_nat iptable_filter ip_tables x_tables sch_sfq act_mirred cls_u32 sch_prio cls_fw sch_htb ifb dummy 8021q garp stp nf_nat_irc nf_conntrack_irc nf_nat_sip nf_conntrack_sip nf_nat_pptp nf_conntrack_pptp nf_conntrack_proto_gre nf_nat_proto_gre nf_nat_amanda ts_kmp nf_conntrack_amanda nf_nat_ftp nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_conntrack_ftp nf_conntrack cdc_ether usbnet i7core_edac ioatdma tpm_tis serio_raw lp shpchp parport edac_core raid10 raid456 async_pq async_xor xor async_memcpy async_raid6_recov ra id6_pq async_tx raid1 igb usbhid raid0 hid megaraid_sas dca multipath linear [328681.672554] [328681.691295] Pid: 0, comm: kworker/0:0 Not tainted 3.0.0-12-server #20-Ubuntu IBM System x3650 M3 -[7945AC1]-/69Y5698 [328681.823341] RIP: 0010:[<ffffffffa017bc70>] [<ffffffffa017bc70>] flush_expectations+0x50/0xc0 [nf_conntrack_sip] [328681.945982] RSP: 0018:ffff88027f223890 EFLAGS: 00010286 [328682.010416] RAX: 0000000000000000 RBX: dead000000100100 RCX: ffff88026fb00480 [328682.096674] RDX: ffff88027ec1c420 RSI: 0000000000000001 RDI: ffff88026aa8df68 [328682.182933] RBP: ffff88027f2238b0 R08: ffff88027ec1c000 R09: dead000000200200 [328682.269198] R10: dead000000200200 R11: dead000000200200 R12: dead000000100100 [328682.355459] R13: 0000000000000001 R14: 0000000000000001 R15: 000000000000012e [328682.441720] FS: 0000000000000000(0000) GS:ffff88027f220000(0000) knlGS:0000000000000000 [328682.539407] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b [328682.609036] CR2: 00007f07284b6500 CR3: 0000000001c03000 CR4: 00000000000006e0 [328682.695299] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [328682.781560] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 [328682.867822] Process kworker/0:0 (pid: 0, threadinfo ffff880272a2e000, task ffff880272a30000) [328682.969665] Stack: [328682.994635] 0000000000000030 ffff88027f2239cc ffff88027f2239c0 ffff88013dfe2100 [328683.084115] ffff88027f2238e0 ffffffffa017cec0 ffff88027f2239cc ffff880200000001 [328683.173599] ffff88027f2238e0 0000000000000000 ffff88027f223970 ffffffffa017b9eb [328683.263086] Call Trace: [328683.293249] <IRQ> [328683.319288] [<ffffffffa017cec0>] process_invite_response+0x80/0x90 [nf_conntrack_sip] [328683.414895] [<ffffffffa017b9eb>] process_sip_response+0x15b/0x170 [nf_conntrack_sip] [328683.509468] [<ffffffffa017d10d>] process_sip_msg.isra.8+0x7d/0xb0 [nf_conntrack_sip] [328683.604039] [<ffffffffa017d1dd>] sip_help_udp+0x9d/0xd0 [nf_conntrack_sip] [328683.688209] [<ffffffffa012fdcf>] ipv4_confirm+0xbf/0x200 [nf_conntrack_ipv4] [328683.774479] [<ffffffff81516075>] nf_iterate+0x85/0xc0 [328683.836838] [<ffffffff815223e0>] ? ip_fragment+0x950/0x950 [328683.904391] [<ffffffff81516126>] nf_hook_slow+0x76/0x130 [328683.969865] [<ffffffff815223e0>] ? ip_fragment+0x950/0x950 [328684.037414] [<ffffffff8152310c>] ip_output+0x9c/0xc0 [328684.098737] [<ffffffff8151f094>] ip_forward_finish+0x44/0x60 [328684.168372] [<ffffffff8151f355>] ip_forward+0x2a5/0x440 [328684.232807] [<ffffffff8151d641>] ip_rcv_finish+0x131/0x370 [328684.300363] [<ffffffff8151deee>] ip_rcv+0x21e/0x2f0 [328684.360647] [<ffffffff814e9972>] __netif_receive_skb+0x4a2/0x540 [328684.434433] [<ffffffff814dc63f>] ? __alloc_skb+0x4f/0x230 [328684.500945] [<ffffffff814e99d9>] __netif_receive_skb+0x509/0x540 [328684.574730] [<ffffffff814ea530>] netif_receive_skb+0x80/0x90 [328684.644361] [<ffffffff814ea928>] ? dev_gro_receive+0x1b8/0x2c0 [328684.716067] [<ffffffff814ea670>] napi_skb_finish+0x50/0x70 [328684.783620] [<ffffffff814eaba5>] napi_gro_receive+0xb5/0xc0 [328684.852215] [<ffffffff815ba50b>] vlan_gro_receive+0x1b/0x20 [328684.920811] [<ffffffffa00c4be8>] igb_clean_rx_irq_adv+0x2a8/0x630 [igb] [328685.001873] [<ffffffffa00c4fde>] igb_poll+0x6e/0x140 [igb] [328685.069425] [<ffffffff814eadb4>] net_rx_action+0x134/0x290 [328685.136984] [<ffffffffa00bf796>] ? igb_msix_ring+0x36/0x50 [igb] [328685.210772] [<ffffffff81065e38>] __do_softirq+0xa8/0x210 [328685.276249] [<ffffffff815fe82e>] ? _raw_spin_lock+0xe/0x20 [328685.343804] [<ffffffff81607e1c>] call_softirq+0x1c/0x30 [328685.408240] [<ffffffff8100c295>] do_softirq+0x65/0xa0 [328685.470602] [<ffffffff8106621e>] irq_exit+0x8e/0xb0 [328685.530885] [<ffffffff81608673>] do_IRQ+0x63/0xe0 [328685.589092] [<ffffffff815fed53>] common_interrupt+0x13/0x13 [328685.657682] <EOI> [328685.683719] [<ffffffff814bc8ba>] ? poll_idle+0x3a/0x80 [328685.747119] [<ffffffff814bc893>] ? poll_idle+0x13/0x80 [328685.810519] [<ffffffff814bcba2>] cpuidle_idle_call+0xa2/0x1d0 [328685.881188] [<ffffffff8100920b>] cpu_idle+0xab/0x100 [328685.942510] [<ffffffff815de7ec>] start_secondary+0xd9/0xdb [328686.010062] Code: 84 d2 75 7f 48 c7 c7 e8 19 12 a0 45 0f b6 ee e8 47 2d 48 e1 48 8b 5b 28 48 85 db 75 0e eb 4c 0f 1f 40 00 4d 85 e4 74 43 4c 89 e3 <8b> b3 d0 00 00 00 31 c0 4c 8b 23 85 f6 0f 95 c0 41 39 c5 75 e3 [328686.238203] RIP [<ffffffffa017bc70>] flush_expectations+0x50/0xc0 [nf_conntrack_sip] [328686.332795] RSP <ffff88027f223890> [328686.375794] ---[ end trace 806ab2e6e0730fa6 ]--- [328686.431970] Kernel panic - not syncing: Fatal exception in interrupt [328686.508915] Pid: 0, comm: kworker/0:0 Tainted: G D 3.0.0-12-server #20-Ubuntu [328686.604571] Call Trace: [328686.634779] <IRQ> [<ffffffff815e8184>] panic+0x91/0x194 [328686.700387] [<ffffffff815ffd0a>] oops_end+0xea/0xf0 [328686.760720] [<ffffffff8100d8c8>] die+0x58/0x90 [328686.815859] [<ffffffff815ff7c2>] do_general_protection+0x162/0x170 [328686.891771] [<ffffffff8150a4eb>] ? qdisc_watchdog_schedule+0x3b/0x40 [328686.969755] [<ffffffff815fefe5>] general_protection+0x25/0x30 [328687.040477] [<ffffffffa017bc70>] ? flush_expectations+0x50/0xc0 [nf_conntrack_sip] [328687.133024] [<ffffffffa017cec0>] process_invite_response+0x80/0x90 [nf_conntrack_sip] [328687.228685] [<ffffffffa017b9eb>] process_sip_response+0x15b/0x170 [nf_conntrack_sip] [328687.323308] [<ffffffffa017d10d>] process_sip_msg.isra.8+0x7d/0xb0 [nf_conntrack_sip] [328687.417932] [<ffffffffa017d1dd>] sip_help_udp+0x9d/0xd0 [nf_conntrack_sip] [328687.502154] [<ffffffffa012fdcf>] ipv4_confirm+0xbf/0x200 [nf_conntrack_ipv4] [328687.588464] [<ffffffff81516075>] nf_iterate+0x85/0xc0 [328687.650873] [<ffffffff815223e0>] ? ip_fragment+0x950/0x950 [328687.718473] [<ffffffff81516126>] nf_hook_slow+0x76/0x130 [328687.783991] [<ffffffff815223e0>] ? ip_fragment+0x950/0x950 [328687.851594] [<ffffffff8152310c>] ip_output+0x9c/0xc0 [328687.912962] [<ffffffff8151f094>] ip_forward_finish+0x44/0x60 [328687.982636] [<ffffffff8151f355>] ip_forward+0x2a5/0x440 [328688.047122] [<ffffffff8151d641>] ip_rcv_finish+0x131/0x370 [328688.114722] [<ffffffff8151deee>] ip_rcv+0x21e/0x2f0 [328688.175049] [<ffffffff814e9972>] __netif_receive_skb+0x4a2/0x540 [328688.248881] [<ffffffff814dc63f>] ? __alloc_skb+0x4f/0x230 [328688.315440] [<ffffffff814e99d9>] __netif_receive_skb+0x509/0x540 [328688.389273] [<ffffffff814ea530>] netif_receive_skb+0x80/0x90 [328688.458955] [<ffffffff814ea928>] ? dev_gro_receive+0x1b8/0x2c0 [328688.530709] [<ffffffff814ea670>] napi_skb_finish+0x50/0x70 [328688.598311] [<ffffffff814eaba5>] napi_gro_receive+0xb5/0xc0 [328688.666951] [<ffffffff815ba50b>] vlan_gro_receive+0x1b/0x20 [328688.735594] [<ffffffffa00c4be8>] igb_clean_rx_irq_adv+0x2a8/0x630 [igb] [328688.816700] [<ffffffffa00c4fde>] igb_poll+0x6e/0x140 [igb] [328688.884300] [<ffffffff814eadb4>] net_rx_action+0x134/0x290 [328688.951901] [<ffffffffa00bf796>] ? igb_msix_ring+0x36/0x50 [igb] [328689.025732] [<ffffffff81065e38>] __do_softirq+0xa8/0x210 [328689.091255] [<ffffffff815fe82e>] ? _raw_spin_lock+0xe/0x20 [328689.158854] [<ffffffff81607e1c>] call_softirq+0x1c/0x30 [328689.223337] [<ffffffff8100c295>] do_softirq+0x65/0xa0 [328689.285746] [<ffffffff8106621e>] irq_exit+0x8e/0xb0 [328689.346074] [<ffffffff81608673>] do_IRQ+0x63/0xe0 [328689.404330] [<ffffffff815fed53>] common_interrupt+0x13/0x13 [328689.472964] <EOI> [<ffffffff814bc8ba>] ? poll_idle+0x3a/0x80 [328689.543765] [<ffffffff814bc893>] ? poll_idle+0x13/0x80 [328689.607210] [<ffffffff814bcba2>] cpuidle_idle_call+0xa2/0x1d0 [328689.677935] [<ffffffff8100920b>] cpu_idle+0xab/0x100 [328689.739309] [<ffffffff815de7ec>] start_secondary+0xd9/0xdb [328689.806913] Rebooting in 1 seconds.. ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2011-11-07 13:12 UTC | newest] Thread overview: 5+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2011-10-23 1:18 Kernel Panic every 2 weeks on ISP server (NULL pointer dereference) Luciano Ruete 2011-10-23 5:16 ` Eric Dumazet 2011-10-24 18:09 ` Luciano Ruete 2011-10-24 18:21 ` Eric Dumazet 2011-11-07 13:11 ` Luciano Ruete
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).