From mboxrd@z Thu Jan 1 00:00:00 1970 From: Stephen Hemminger Subject: Re: sky2 driver fails to handle "rx length error: status 0x5d60100 length 2982" gracefully Date: Wed, 11 Aug 2010 21:59:32 -0400 Message-ID: <20100811215932.26414efe@s6510> References: Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: QUOTED-PRINTABLE Cc: Stephen Hemminger , Linux NetDev To: Maciej =?UTF-8?B?xbtlbmN6eWtvd3NraQ==?= Return-path: Received: from mail.vyatta.com ([76.74.103.46]:44073 "EHLO mail.vyatta.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S933028Ab0HLB7i convert rfc822-to-8bit (ORCPT ); Wed, 11 Aug 2010 21:59:38 -0400 In-Reply-To: Sender: netdev-owner@vger.kernel.org List-ID: On Wed, 11 Aug 2010 17:48:59 -0700 Maciej =C5=BBenczykowski wrote: > [See https://bugzilla.redhat.com/show_bug.cgi?id=3D592398 ] >=20 > Latest tested kernel (from koji for Fedora 13): >=20 > 2.6.34.3-35.rc1.fc13.x86_64 >=20 > Basically occasionally, but possibly more and more often with recent > kernels (I think .33 and .34 are worse then .32) the sky2 driver lock= s > up. >=20 > During this time the nic functions like a DSL line with a 95% drop > rate. ie. sometimes something does get through, but mostly it's dead= =2E > "ip link set eth0 down && ip link set eth0 up" is enough to fix it. >=20 > Here's the initial occurrence of this problem on the above kernel. >=20 > Aug 11 16:21:19 nike kernel: sky2 0000:0c:00.0: eth0: rx length error= : > status 0x5d60100 length 2982 > Aug 11 16:21:27 nike kernel: eth0: hw csum failure. > Aug 11 16:21:27 nike kernel: Pid: 0, comm: swapper Not tainted > 2.6.34.3-35.rc1.fc13.x86_64 #1 > Aug 11 16:21:27 nike kernel: Call Trace: > Aug 11 16:21:27 nike kernel: [] > netdev_rx_csum_fault+0x3b/0x3f > Aug 11 16:21:27 nike kernel: [] > __skb_checksum_complete_head+0x51/0x65 > Aug 11 16:21:27 nike kernel: [] > __skb_checksum_complete+0x11/0x13 > Aug 11 16:21:27 nike kernel: [] nf_ip_checksum+0xdd= /0xe3 > Aug 11 16:21:27 nike kernel: [] udp_error+0x130/0x1= 8a > Aug 11 16:21:27 nike kernel: [] ? enqueue_task+0x5f= /0x6a > Aug 11 16:21:27 nike kernel: [] ? activate_task+0x2= f/0x37 > Aug 11 16:21:27 nike kernel: [] nf_conntrack_in+0x1= 80/0x90e > Aug 11 16:21:27 nike kernel: [] ? enqueue_task_fair= +0x44/0x87 > Aug 11 16:21:27 nike kernel: [] ? enqueue_task+0x5f= /0x6a > Aug 11 16:21:27 nike kernel: [] ipv4_conntrack_in+0= x21/0x23 > Aug 11 16:21:27 nike kernel: [] nf_iterate+0x46/0x8= 9 > Aug 11 16:21:27 nike kernel: [] ? ip_rcv_finish+0x0= /0x362 > Aug 11 16:21:27 nike kernel: [] nf_hook_slow+0x6a/0= xcb > Aug 11 16:21:27 nike kernel: [] ? ip_rcv_finish+0x0= /0x362 > Aug 11 16:21:27 nike kernel: [] ? ip_rcv_finish+0x0= /0x362 > Aug 11 16:21:27 nike kernel: [] NF_HOOK.clone.1+0x4= 6/0x58 > Aug 11 16:21:27 nike kernel: [] ? getnstimeofday+0x= 63/0xb9 > Aug 11 16:21:27 nike kernel: [] ip_rcv+0x256/0x283 > Aug 11 16:21:27 nike kernel: [] netif_receive_skb+0= x493/0x4b9 > Aug 11 16:21:27 nike kernel: [] napi_skb_finish+0x2= 9/0x40 > Aug 11 16:21:27 nike kernel: [] napi_gro_receive+0x= 2f/0x34 > Aug 11 16:21:27 nike kernel: [] sky2_poll+0x9c5/0xc= 58 [sky2] > Aug 11 16:21:27 nike kernel: [] net_rx_action+0xaf/= 0x1ca > Aug 11 16:21:27 nike kernel: [] __do_softirq+0xe5/0= x1a6 > Aug 11 16:21:27 nike kernel: [] ? handle_IRQ_event+= 0x60/0x121 > Aug 11 16:21:27 nike kernel: [] call_softirq+0x1c/0= x30 > Aug 11 16:21:27 nike kernel: [] do_softirq+0x46/0x8= 3 > Aug 11 16:21:27 nike kernel: [] irq_exit+0x3b/0x7d > Aug 11 16:21:27 nike kernel: [] do_IRQ+0xac/0xc3 > Aug 11 16:21:27 nike kernel: [] ret_from_intr+0x0/0= x11 > Aug 11 16:21:27 nike kernel: [] ? > acpi_idle_enter_bm+0x288/0x2bc > Aug 11 16:21:27 nike kernel: [] ? > acpi_idle_enter_bm+0x281/0x2bc > Aug 11 16:21:27 nike kernel: [] cpuidle_idle_call+0= x99/0xf1 > Aug 11 16:21:27 nike kernel: [] cpu_idle+0xaa/0xe4 > Aug 11 16:21:27 nike kernel: [] start_secondary+0x2= 53/0x294 > Aug 11 16:21:34 nike kernel: eth0: hw csum failure. > Aug 11 16:21:34 nike kernel: Pid: 0, comm: swapper Not tainted > 2.6.34.3-35.rc1.fc13.x86_64 #1 > Aug 11 16:21:34 nike kernel: Call Trace: > Aug 11 16:21:34 nike kernel: [] > netdev_rx_csum_fault+0x3b/0x3f > Aug 11 16:21:34 nike kernel: [] > __skb_checksum_complete_head+0x51/0x65 > Aug 11 16:21:34 nike kernel: [] __skb_checksum_comp= lete+0x11/0 > ... > etc, 700 messages over the course of the next hour (until I came back > and ip link down/up fixed it). >=20 > # cat /var/log/messages | egrep 'rx len' > Aug 11 16:21:19 nike kernel: sky2 0000:0c:00.0: eth0: rx length error= : > status 0x5d60100 length 2982 >=20 > (also seen on an older kernel [ 2.6.33.5-112.fc13.x86_64 ]: > Jul 17 12:43:10 nike kernel: sky2 eth0: rx length error: status > 0x5ea0100 length 3018 > Jul 28 02:34:46 nike kernel: sky2 eth0: rx length error: status > 0x5ea0100 length 1642 > Jul 30 09:49:16 nike kernel: sky2 eth0: rx length error: status > 0x5ea0100 length 3018 > Jul 31 00:20:26 nike kernel: sky2 eth0: rx length error: status > 0x5ea0100 length 3018 > and kernels before that, including 2.6.32.12-115.fc12.x86_64, but I > think I might have seen the problem even further back than 2.6.32). >=20 > # cat /var/log/messages | egrep 'eth0: hw csum failure\.$' | wc -l > 694 >=20 > The call stacks differ, here's the most common symbols with the numbe= r > of times they occur > (although this probably isn't particularly useful): >=20 > # cat /var/log/messages | egrep ffffffff | sed -rn 's@^^Aug .. > ..:..:.. nike kernel: @@p' | sort | uniq -c | egrep -v '^ [ > 1-9][0-9] ' > 602 [] ? acpi_idle_enter_bm+0x288/0x2bc > 630 [] cpu_idle+0xaa/0xe4 > 694 [] call_softirq+0x1c/0x30 > 693 [] do_softirq+0x46/0x83 > 273 [] ? sched_clock+0x9/0xd > 105 [] ? native_sched_clock+0x2d/0x5f > 254 [] ? lapic_next_event+0x1d/0x21 > 190 [] ? enqueue_task+0x5f/0x6a > 285 [] ? activate_task+0x2f/0x37 > 144 [] ? enqueue_task_fair+0x44/0x87 > 693 [] irq_exit+0x3b/0x7d > 694 [] __do_softirq+0xe5/0x1a6 > 103 [] ? sched_clock_local+0x1c/0x82 > 693 [] ? getnstimeofday+0x63/0xb9 > 202 [] ? clockevents_program_event+0x7a/0x83 > 255 [] ? tick_dev_program_event+0x3c/0xfc > 703 [] ? handle_IRQ_event+0x60/0x121 > 348 [] ? virt_to_head_page+0xe/0x2f > 528 [] ? __bitmap_weight+0x40/0x8f > 602 [] ? acpi_idle_enter_bm+0x281/0x2bc > 629 [] cpuidle_idle_call+0x99/0xf1 > 115 [] ? __kfree_skb+0x7d/0x81 > 694 [] __skb_checksum_complete_head+0x51/0x65 > 694 [] __skb_checksum_complete+0x11/0x13 > 694 [] netif_receive_skb+0x493/0x4b9 > 694 [] net_rx_action+0xaf/0x1ca > 694 [] napi_skb_finish+0x29/0x40 > 694 [] napi_gro_receive+0x2f/0x34 > 695 [] nf_iterate+0x46/0x89 > 695 [] nf_hook_slow+0x6a/0xcb > 145 [] ? nf_hook_slow+0x87/0xcb > 694 [] nf_conntrack_in+0x180/0x90e > 690 [] udp_error+0x130/0x18a > 2083 [] ? ip_rcv_finish+0x0/0x362 > 163 [] ? ip_local_deliver_finish+0x0/0x1b3 > 694 [] NF_HOOK.clone.1+0x46/0x58 > 694 [] ip_rcv+0x256/0x283 > 694 [] nf_ip_checksum+0xdd/0xe3 > 694 [] ipv4_conntrack_in+0x21/0x23 > 338 [] rest_init+0x7e/0x80 > 295 [] start_secondary+0x253/0x294 > 151 [] ? _raw_spin_unlock_bh+0x15/0x17 > 687 [] ret_from_intr+0x0/0x11 > 687 [] do_IRQ+0xac/0xc3 > 338 [] x86_64_start_reservations+0xb3/0xb7 > 338 [] x86_64_start_kernel+0xf8/0x107 > 338 [] start_kernel+0x413/0x41e > 694 [] sky2_poll+0x9c5/0xc58 [sky2] > 150 [] ? nf_nat_cleanup_conntrack+0x69/0x6d [nf= _nat] > 694 [] netdev_rx_csum_fault+0x3b/0x3f What is the dmesg and lspci info. Looks like a timing issue which is unique to your machine/bus hardware combination. could you just turn off hardware rx checksum (with ethtool).