* Re: 2.6.20-rc1 sky2 problems (regression?) [not found] <87psammchi.fsf@sycorax.lbl.gov> @ 2006-12-14 21:30 ` Stephen Hemminger 2006-12-14 22:00 ` Alex Romosan 2006-12-14 22:25 ` Alex Romosan 0 siblings, 2 replies; 14+ messages in thread From: Stephen Hemminger @ 2006-12-14 21:30 UTC (permalink / raw) To: Alex Romosan; +Cc: netdev On Thu, 14 Dec 2006 12:47:05 -0800 Alex Romosan <romosan@sycorax.lbl.gov> wrote: > under heavy network load the sky2 driver (compiled in the kernel) > locks up and the only way i can get the network back is to reboot the > machine (bringing the network down and back up again doesn't help). > this happens on an amd64 machine (athlon 3500+ processor) and the card > in question is a Marvell Technology Group Ltd. 88E8053 PCI-E Gigabit > Ethernet Controller (rev 15) (from lspci). this is what i see in the > syslog: > > kernel: sky2 eth0: rx error, status 0x414a414a length 0 > kernel: eth0: hw csum failure. > kernel: > kernel: Call Trace: > kernel: <IRQ> [<ffffffff8044681c>] __skb_checksum_complete+0x4d/0x66 > kernel: [<ffffffff80477bc5>] tcp_v4_rcv+0x147/0x8ea > kernel: [<ffffffff80479ef2>] raw_rcv_skb+0x9/0x20 > kernel: [<ffffffff8047a2ff>] raw_rcv+0xbe/0xc4 > kernel: [<ffffffff8045ea9d>] ip_local_deliver+0x170/0x21b > kernel: [<ffffffff8045e8fa>] ip_rcv+0x478/0x4ab > kernel: [<ffffffff8044905d>] netif_receive_skb+0x184/0x20e > kernel: [<ffffffff803de8e5>] sky2_poll+0x68f/0x93c > kernel: [<ffffffff802219ce>] scheduler_tick+0x23/0x2f9 > kernel: [<ffffffff8044a796>] net_rx_action+0x61/0xf0 > kernel: [<ffffffff8022a35f>] __do_softirq+0x40/0x8a > kernel: [<ffffffff8020a3cc>] call_softirq+0x1c/0x28 > kernel: [<ffffffff8020bbf0>] do_softirq+0x2c/0x7d > kernel: [<ffffffff8022a313>] irq_exit+0x36/0x42 > kernel: [<ffffffff8020bebe>] do_IRQ+0x8c/0x9e > kernel: [<ffffffff80208710>] default_idle+0x0/0x3a > kernel: [<ffffffff80209bf1>] ret_from_intr+0x0/0xa > kernel: <EOI> [<ffffffff80208736>] default_idle+0x26/0x3a > kernel: [<ffffffff8020878c>] cpu_idle+0x42/0x75 > kernel: [<ffffffff805df675>] start_kernel+0x1ce/0x1d3 > kernel: [<ffffffff805df140>] _sinittext+0x140/0x144 > kernel: > kernel: eth0: hw csum failure. > kernel: > kernel: Call Trace: > kernel: <IRQ> [<ffffffff8044681c>] __skb_checksum_complete+0x4d/0x66 > kernel: [<ffffffff80477bc5>] tcp_v4_rcv+0x147/0x8ea > kernel: [<ffffffff80479ef2>] raw_rcv_skb+0x9/0x20 > kernel: [<ffffffff8047a2ff>] raw_rcv+0xbe/0xc4 > kernel: [<ffffffff8045ea9d>] ip_local_deliver+0x170/0x21b > kernel: [<ffffffff8045e8fa>] ip_rcv+0x478/0x4ab > kernel: [<ffffffff8044905d>] netif_receive_skb+0x184/0x20e > kernel: [<ffffffff803de8e5>] sky2_poll+0x68f/0x93c > kernel: [<ffffffff80474647>] tcp_delack_timer+0x0/0x1b5 > kernel: [<ffffffff8044a796>] net_rx_action+0x61/0xf0 > kernel: [<ffffffff8022a35f>] __do_softirq+0x40/0x8a > kernel: [<ffffffff8020a3cc>] call_softirq+0x1c/0x28 > kernel: [<ffffffff8020bbf0>] do_softirq+0x2c/0x7d > kernel: [<ffffffff8022a313>] irq_exit+0x36/0x42 > kernel: [<ffffffff8020bebe>] do_IRQ+0x8c/0x9e > kernel: [<ffffffff80209bf1>] ret_from_intr+0x0/0xa > kernel: <EOI> [<ffffffff802a8402>] inode2sd+0x104/0x117 > kernel: [<ffffffff802b8cfa>] search_by_key+0xa08/0xbfe > kernel: [<ffffffff802b8475>] search_by_key+0x183/0xbfe > kernel: [<ffffffff80284778>] ll_rw_block+0x89/0x9e > kernel: [<ffffffff802b8475>] search_by_key+0x183/0xbfe > kernel: [<ffffffff80283cf5>] __find_get_block_slow+0x101/0x10d > kernel: [<ffffffff80284053>] __find_get_block+0x197/0x1a5 > kernel: [<ffffffff8026800c>] inode_get_bytes+0x2a/0x52 > kernel: [<ffffffff802a89f1>] reiserfs_update_sd_size+0x7e/0x284 > kernel: [<ffffffff80237700>] kthread+0xed/0xfd > kernel: [<ffffffff802be990>] do_journal_end+0x34b/0xbdd > kernel: [<ffffffff802b1729>] reiserfs_dirty_inode+0x56/0x76 > kernel: [<ffffffff80284c19>] block_prepare_write+0x1a/0x24 > kernel: [<ffffffff802809b1>] __mark_inode_dirty+0x29/0x197 > kernel: [<ffffffff802a8d04>] reiserfs_commit_write+0x10d/0x19f > kernel: [<ffffffff80284c19>] block_prepare_write+0x1a/0x24 > kernel: [<ffffffff802484fc>] generic_file_buffered_write+0x4ad/0x6c4 > kernel: [<ffffffff80271b3c>] __pollwait+0x0/0xe0 > kernel: [<ffffffff8022a006>] current_fs_time+0x35/0x3b > kernel: [<ffffffff80248a8c>] __generic_file_aio_write_nolock+0x379/0x3ec > kernel: [<ffffffff8049baca>] unix_dgram_recvmsg+0x1be/0x1d9 > kernel: [<ffffffff804b6516>] __mutex_lock_slowpath+0x205/0x210 > kernel: [<ffffffff80248b60>] generic_file_aio_write+0x61/0xc1 > kernel: [<ffffffff80248aff>] generic_file_aio_write+0x0/0xc1 > kernel: [<ffffffff80264e57>] do_sync_readv_writev+0xc0/0x107 > kernel: [<ffffffff802377f7>] autoremove_wake_function+0x0/0x2e > kernel: [<ffffffff80229d16>] getnstimeofday+0x10/0x28 > kernel: [<ffffffff80264ced>] rw_copy_check_uvector+0x6c/0xdc > kernel: [<ffffffff802654f7>] do_readv_writev+0xb2/0x18b > kernel: [<ffffffff80265a2c>] sys_writev+0x45/0x93 > kernel: [<ffffffff802096de>] system_call+0x7e/0x83 > > and so on. some times i don't get this trace but instead i get: > > kernel: sky2 eth0: tx timeout > kernel: sky2 eth0: transmit ring 140 .. 99 report=181 done=181 > kernel: sky2 status report lost? > kernel: NETDEV WATCHDOG: eth0: transmit timed out > kernel: sky2 eth0: tx timeout > kernel: sky2 eth0: transmit ring 181 .. 140 report=181 done=181 > kernel: sky2 hardware hung? flushing > > but the end result is the same, the network card stops responding and > i have to reboot the machine. i can reproduce this on a consistent > basis so if there are any patches, i can try them out and see if they > fix the problem. > > this is probably not a regression per se as i saw it as well with > 2.6.19 and 2.6.19-rc6. i am not sure if it was there previous to > 2.6.19-rc6. suggestions, patches welcome. thanks. Pleas report these problems to netdev@vger.kernel.org, I rarely go looking in LKML. These are the things you need to debug a sky2 related problem. 1) What is exact kernel version in use? This is important because problems get fixed but it can be a long while until the fix bubbles down to the vendor kernels. 2) What is the chip version? The driver prints this out on boot up in the console log. (dmesg | grep sky2) This matters because each chip version has different bugs to deal with. 3) Does it work with the vendor driver? The vendor driver does a number of things differently than the sky2 driver and can mask problems, but if it doesn't work as well that is a useful data point. If you want to know why the sky2 driver was written instead of just using the vendor driver, look at the code. The sk98lin driver is huge, includes features that are unsupportable and broken, and locking mistakes. But the sk98lin also has a watchdog that masks off bugs and may provide useful insight. 4) What is the IRQ routing? There are two issues here, first the driver will never work with edge trigger IRQ's, some motherboards also have busted BIOS and chipsets that don't do MSI properly. A couple of module parameters are available to help: disable_msi=1 avoids using MSI idle_timeout=10 polls for lost IRQ's every N ms (10) 5) What are the messages in the console log when problem happens? 6) Are you running any of the following: bonding, vlans, bridging, netfilter, traffic control? 7) Please get a current version of ethtool from: git://git.kernel.org/pub/scm/network/ethtool/ethtool.git and run ethtool register dump after a problem occurs: ethtool -d eth0 8) Are you using a dual port board. There were issues on the PCI-X version that required hacks, the PCI-express version may have the same problem. Basically, checksum offload wouldn't work and receive DMA's would arrive out of order. -- Stephen Hemminger <shemminger@osdl.org> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: 2.6.20-rc1 sky2 problems (regression?) 2006-12-14 21:30 ` 2.6.20-rc1 sky2 problems (regression?) Stephen Hemminger @ 2006-12-14 22:00 ` Alex Romosan 2006-12-14 22:25 ` Alex Romosan 1 sibling, 0 replies; 14+ messages in thread From: Alex Romosan @ 2006-12-14 22:00 UTC (permalink / raw) To: Stephen Hemminger; +Cc: netdev Stephen Hemminger <shemminger@osdl.org> writes: > On Thu, 14 Dec 2006 12:47:05 -0800 > Alex Romosan <romosan@sycorax.lbl.gov> wrote: > >> under heavy network load the sky2 driver (compiled in the kernel) >> locks up and the only way i can get the network back is to reboot the >> machine (bringing the network down and back up again doesn't help). >> this happens on an amd64 machine (athlon 3500+ processor) and the card >> in question is a Marvell Technology Group Ltd. 88E8053 PCI-E Gigabit >> Ethernet Controller (rev 15) (from lspci). this is what i see in the >> syslog: >> >> kernel: sky2 eth0: rx error, status 0x414a414a length 0 >> kernel: eth0: hw csum failure. >> kernel: >> kernel: Call Trace: >> kernel: <IRQ> [<ffffffff8044681c>] __skb_checksum_complete+0x4d/0x66 >> kernel: [<ffffffff80477bc5>] tcp_v4_rcv+0x147/0x8ea >> kernel: [<ffffffff80479ef2>] raw_rcv_skb+0x9/0x20 >> kernel: [<ffffffff8047a2ff>] raw_rcv+0xbe/0xc4 >> kernel: [<ffffffff8045ea9d>] ip_local_deliver+0x170/0x21b >> kernel: [<ffffffff8045e8fa>] ip_rcv+0x478/0x4ab >> kernel: [<ffffffff8044905d>] netif_receive_skb+0x184/0x20e >> kernel: [<ffffffff803de8e5>] sky2_poll+0x68f/0x93c >> kernel: [<ffffffff802219ce>] scheduler_tick+0x23/0x2f9 >> kernel: [<ffffffff8044a796>] net_rx_action+0x61/0xf0 >> kernel: [<ffffffff8022a35f>] __do_softirq+0x40/0x8a >> kernel: [<ffffffff8020a3cc>] call_softirq+0x1c/0x28 >> kernel: [<ffffffff8020bbf0>] do_softirq+0x2c/0x7d >> kernel: [<ffffffff8022a313>] irq_exit+0x36/0x42 >> kernel: [<ffffffff8020bebe>] do_IRQ+0x8c/0x9e >> kernel: [<ffffffff80208710>] default_idle+0x0/0x3a >> kernel: [<ffffffff80209bf1>] ret_from_intr+0x0/0xa >> kernel: <EOI> [<ffffffff80208736>] default_idle+0x26/0x3a >> kernel: [<ffffffff8020878c>] cpu_idle+0x42/0x75 >> kernel: [<ffffffff805df675>] start_kernel+0x1ce/0x1d3 >> kernel: [<ffffffff805df140>] _sinittext+0x140/0x144 >> kernel: >> kernel: eth0: hw csum failure. >> kernel: >> kernel: Call Trace: >> kernel: <IRQ> [<ffffffff8044681c>] __skb_checksum_complete+0x4d/0x66 >> kernel: [<ffffffff80477bc5>] tcp_v4_rcv+0x147/0x8ea >> kernel: [<ffffffff80479ef2>] raw_rcv_skb+0x9/0x20 >> kernel: [<ffffffff8047a2ff>] raw_rcv+0xbe/0xc4 >> kernel: [<ffffffff8045ea9d>] ip_local_deliver+0x170/0x21b >> kernel: [<ffffffff8045e8fa>] ip_rcv+0x478/0x4ab >> kernel: [<ffffffff8044905d>] netif_receive_skb+0x184/0x20e >> kernel: [<ffffffff803de8e5>] sky2_poll+0x68f/0x93c >> kernel: [<ffffffff80474647>] tcp_delack_timer+0x0/0x1b5 >> kernel: [<ffffffff8044a796>] net_rx_action+0x61/0xf0 >> kernel: [<ffffffff8022a35f>] __do_softirq+0x40/0x8a >> kernel: [<ffffffff8020a3cc>] call_softirq+0x1c/0x28 >> kernel: [<ffffffff8020bbf0>] do_softirq+0x2c/0x7d >> kernel: [<ffffffff8022a313>] irq_exit+0x36/0x42 >> kernel: [<ffffffff8020bebe>] do_IRQ+0x8c/0x9e >> kernel: [<ffffffff80209bf1>] ret_from_intr+0x0/0xa >> kernel: <EOI> [<ffffffff802a8402>] inode2sd+0x104/0x117 >> kernel: [<ffffffff802b8cfa>] search_by_key+0xa08/0xbfe >> kernel: [<ffffffff802b8475>] search_by_key+0x183/0xbfe >> kernel: [<ffffffff80284778>] ll_rw_block+0x89/0x9e >> kernel: [<ffffffff802b8475>] search_by_key+0x183/0xbfe >> kernel: [<ffffffff80283cf5>] __find_get_block_slow+0x101/0x10d >> kernel: [<ffffffff80284053>] __find_get_block+0x197/0x1a5 >> kernel: [<ffffffff8026800c>] inode_get_bytes+0x2a/0x52 >> kernel: [<ffffffff802a89f1>] reiserfs_update_sd_size+0x7e/0x284 >> kernel: [<ffffffff80237700>] kthread+0xed/0xfd >> kernel: [<ffffffff802be990>] do_journal_end+0x34b/0xbdd >> kernel: [<ffffffff802b1729>] reiserfs_dirty_inode+0x56/0x76 >> kernel: [<ffffffff80284c19>] block_prepare_write+0x1a/0x24 >> kernel: [<ffffffff802809b1>] __mark_inode_dirty+0x29/0x197 >> kernel: [<ffffffff802a8d04>] reiserfs_commit_write+0x10d/0x19f >> kernel: [<ffffffff80284c19>] block_prepare_write+0x1a/0x24 >> kernel: [<ffffffff802484fc>] generic_file_buffered_write+0x4ad/0x6c4 >> kernel: [<ffffffff80271b3c>] __pollwait+0x0/0xe0 >> kernel: [<ffffffff8022a006>] current_fs_time+0x35/0x3b >> kernel: [<ffffffff80248a8c>] __generic_file_aio_write_nolock+0x379/0x3ec >> kernel: [<ffffffff8049baca>] unix_dgram_recvmsg+0x1be/0x1d9 >> kernel: [<ffffffff804b6516>] __mutex_lock_slowpath+0x205/0x210 >> kernel: [<ffffffff80248b60>] generic_file_aio_write+0x61/0xc1 >> kernel: [<ffffffff80248aff>] generic_file_aio_write+0x0/0xc1 >> kernel: [<ffffffff80264e57>] do_sync_readv_writev+0xc0/0x107 >> kernel: [<ffffffff802377f7>] autoremove_wake_function+0x0/0x2e >> kernel: [<ffffffff80229d16>] getnstimeofday+0x10/0x28 >> kernel: [<ffffffff80264ced>] rw_copy_check_uvector+0x6c/0xdc >> kernel: [<ffffffff802654f7>] do_readv_writev+0xb2/0x18b >> kernel: [<ffffffff80265a2c>] sys_writev+0x45/0x93 >> kernel: [<ffffffff802096de>] system_call+0x7e/0x83 >> >> and so on. some times i don't get this trace but instead i get: >> >> kernel: sky2 eth0: tx timeout >> kernel: sky2 eth0: transmit ring 140 .. 99 report=181 done=181 >> kernel: sky2 status report lost? >> kernel: NETDEV WATCHDOG: eth0: transmit timed out >> kernel: sky2 eth0: tx timeout >> kernel: sky2 eth0: transmit ring 181 .. 140 report=181 done=181 >> kernel: sky2 hardware hung? flushing >> > Pleas report these problems to netdev@vger.kernel.org, I rarely go > looking in LKML. > > These are the things you need to debug a sky2 related problem. > > 1) What is exact kernel version in use? This is important because > problems get fixed but it can be a long while until the fix bubbles down > to the vendor kernels. this is stock kernel.org kernel version 2.6.20-rc1 i downloaded this morning. 2.6.19 and 2.6.19-rc6 i referred to in my original message were also donloaded from kernel.org. > 2) What is the chip version? The driver prints this out on boot up in > the console log. (dmesg | grep sky2) > This matters because each chip version has different > bugs to deal with. sky2 v1.10 addr 0xfddfc000 irq 17 Yukon-EC (0xb6) rev 1 sky2 eth0: addr 00:11:09:da:39:a3 sky2 eth0: enabling interface sky2 eth0: ram buffer 48K sky2 eth0: Link is up at 100 Mbps, full duplex, flow control both > 3) Does it work with the vendor driver? > The vendor driver does a number of things differently than the sky2 driver > and can mask problems, but if it doesn't work as well that is a useful > data point. If you want to know why the sky2 driver was written instead > of just using the vendor driver, look at the code. The sk98lin driver > is huge, includes features that are unsupportable and broken, and locking > mistakes. But the sk98lin also has a watchdog that masks off bugs and > may provide useful insight. i haven't tried the vendor driver yet, but i guess i will, and let you know what happens. > 4) What is the IRQ routing? > There are two issues here, first the driver will never work with edge > trigger IRQ's, some motherboards also have busted BIOS and chipsets > that don't do MSI properly. A couple of module parameters are available > to help: > disable_msi=1 avoids using MSI > idle_timeout=10 polls for lost IRQ's every N ms (10) hmm, i have MSI interrupts enabled in the config and cat /proc/interrups gives me: 283: 1474208 PCI-MSI-edge eth0 so you say i should dissable msi? > 5) What are the messages in the console log when problem happens? see my original message i kept above. > 6) Are you running any of the following: bonding, vlans, bridging, > netfilter, traffic control? no. > 7) Please get a current version of ethtool from: > git://git.kernel.org/pub/scm/network/ethtool/ethtool.git > and run ethtool register dump after a problem occurs: > ethtool -d eth0 i've downloaded it and i'll run it next time the machine locks up. > 8) Are you using a dual port board. There were issues on the PCI-X > version that required hacks, the PCI-express version may have the > same problem. Basically, checksum offload wouldn't work and receive > DMA's would arrive out of order. it is a dual port board but i am using only one port. --alex-- -- | I believe the moment is at hand when, by a paranoiac and active | | advance of the mind, it will be possible (simultaneously with | | automatism and other passive states) to systematize confusion | | and thus to help to discredit completely the world of reality. | ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: 2.6.20-rc1 sky2 problems (regression?) 2006-12-14 21:30 ` 2.6.20-rc1 sky2 problems (regression?) Stephen Hemminger 2006-12-14 22:00 ` Alex Romosan @ 2006-12-14 22:25 ` Alex Romosan 2006-12-14 22:47 ` Stephen Hemminger 1 sibling, 1 reply; 14+ messages in thread From: Alex Romosan @ 2006-12-14 22:25 UTC (permalink / raw) To: Stephen Hemminger; +Cc: netdev Stephen Hemminger <shemminger@osdl.org> writes: > 4) What is the IRQ routing? > There are two issues here, first the driver will never work with edge > trigger IRQ's, some motherboards also have busted BIOS and chipsets > that don't do MSI properly. A couple of module parameters are available > to help: > disable_msi=1 avoids using MSI > idle_timeout=10 polls for lost IRQ's every N ms (10) i didn't take long to lock up the machine again. i've rebooted back into stock 2.6.20-rc1 and added the two module parameters above. cat /proc/interrupts now gives me: 17: 203 IO-APIC-fasteoi eth0, CMI8738 so i guess the MSI interrupts are disabled. we'll see how this works. > 5) What are the messages in the console log when problem happens? kernel: NETDEV WATCHDOG: eth0: transmit timed out kernel: sky2 eth0: tx timeout kernel: sky2 eth0: transmit ring 402 .. 361 report=406 done=406 kernel: sky2 status report lost? kernel: NETDEV WATCHDOG: eth0: transmit timed out kernel: sky2 eth0: tx timeout kernel: sky2 eth0: transmit ring 406 .. 361 report=406 done=406 kernel: sky2 hardware hung? flushing kernel: NETDEV WATCHDOG: eth0: transmit timed out kernel: sky2 eth0: tx timeout kernel: sky2 eth0: transmit ring 361 .. 321 report=406 done=406 kernel: sky2 status report lost? kernel: NETDEV WATCHDOG: eth0: transmit timed out kernel: sky2 eth0: tx timeout kernel: sky2 eth0: transmit ring 406 .. 366 report=406 done=406 kernel: sky2 hardware hung? flushing > 7) Please get a current version of ethtool from: > git://git.kernel.org/pub/scm/network/ethtool/ethtool.git > and run ethtool register dump after a problem occurs: > ethtool -d eth0 this is the output after it stopped working: PCI config ---------- 00: ab 11 62 43 07 04 18 00 15 00 00 02 08 00 00 00 10: 04 c0 df fd 00 00 00 00 01 ce 00 00 00 00 00 00 20: 00 00 00 00 00 00 00 00 00 00 00 00 62 14 8c 05 30: 00 00 00 00 48 00 00 00 00 00 00 00 03 01 00 00 40: 00 00 f0 01 00 80 a0 01 01 50 02 fe 00 20 00 14 50: 03 5c 00 80 00 00 00 01 00 00 00 01 05 e0 83 00 60: 0c 10 e0 fe 00 00 00 00 61 41 00 00 00 00 00 00 70: 00 02 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Control Registers ----------------- Register Access Port 0x00 LED Control/Status 0xA603164A Interrupt Source 0x40000000 Interrupt Mask 0xC000001D Interrupt Hardware Error Source 0x00000000 Interrupt Hardware Error Mask 0x2E003F3F Bus Management Unit ------------------- CSR Receive Queue 1 0x00010000 CSR Sync Queue 1 0xFFFFFFFF CSR Async Queue 1 0x00000000 MAC Addresses --------------- Addr 1 00 11 09 DA 39 A3 Addr 2 00 11 09 DA 39 A3 Addr 3 00 00 00 00 00 00 Connector type 0x4A (J) PMD type 0x54 (T) PHY type 0x80 Chip Id 0xB6 Yukon-2 EC (rev 0) Ram Buffer 0x0C Status BMU: ----------- Control 0x0002220A Last Index 0x07FF Put Index 0x0601 List Address 0x000000007FBF8000 Transmit 1 done index 0x0196 Transmit index threshold 0x000A Status FIFO Write Pointer 0x16 Read Pointer 0x16 Level 0x00 Watermark 0x10 ISR Watermark 0x10 Status level Init 0x000030D4 Value 0x00000D00 Test 0x04 Control 0x02 TX status Init 0x0001E848 Value 0x0001E848 Test 0x04 Control 0x02 ISR Init 0x000009C4 Value 0x000009C4 Test 0x04 Control 0x02 GMAC control 0x005A GPHY control 0x2002 LINK control 0x02 GMAC 1 Status 0xD000 Control 0x1800 Transmit 0x1000 Receive 0xE000 Transmit flow control 0xFFFF Transmit parameter 0xD7C4 Serial mode 0x221E Source address: 00 11 09 DA 39 A3 Physical address: 00 11 09 DA 39 A3 Rx GMAC 1 End Address 0x0000007F Almost Full Thresh 0x00000070 Control/Test 0x0900228A FIFO Flush Mask 0x000018FB FIFO Flush Threshold 0x0000000B Truncation Threshold 0x0000017C Upper Pause Threshold 0x00000000 Lower Pause Threshold 0x00000081 VLAN Tag 0x00000074 FIFO Write Pointer 0x00000000 FIFO Write Level 0x0000007B FIFO Read Pointer 0x00000000 FIFO Read Level 0x00000079 Tx GMAC 1 End Address 0x0000007F Almost Full Thresh 0x00000010 Control/Test 0x0102220A FIFO Flush Mask 0x00000000 FIFO Flush Threshold 0x00000000 Truncation Threshold 0x00000000 Upper Pause Threshold 0x00000000 Lower Pause Threshold 0x00000081 VLAN Tag 0x0000002A FIFO Write Pointer 0x0000002A FIFO Write Level 0x00000000 FIFO Read Pointer 0x00000000 FIFO Read Level 0x0000002A Receive Queue 1 --------------- Buffer control 0x05F8 Byte Counter 49408 Descriptor Address 0x0000000076F4F810 Status 0x05EA0100 Timestamp 0x00000000 BMU Control/Status 0x000061AA Done 0x0000 Request 0x0000000076F4F810 Csum1 Offset 52057 Piston 14 Csum2 Offset 52057 Positing 14 Sync Transmit Queue 1 --------------- Descriptor Address 0x0000000000000000 Address Counter 0x0000000000000000 Current Byte Counter 0 BMU Control/Status 0x00000000 Flag & FIFO Address 0x00000000 Control 0x00000000 Next 0x00000000 Data 0x0000000000000000 Status 0x00000000 Timestamp 0x00000000 Csum Start 0x0000 Pos 0 Write 0 Async Transmit Queue 1 --------------- Buffer control 0x053D Byte Counter 49950 Descriptor Address 0x0000000047237000 Status 0x000005EA Timestamp 0x00010000 BMU Control/Status 0x800011AA Done 0x0000 Request 0x000000004723753D Csum Start 0x0032 Pos 0 Write 0 Receive RAMbuffer 1 --------------- Start Address 0x00000000 End Address 0x00000E7F Write Pointer 0x00000079 Read Pointer 0x0000007E Upper Threshold/Pause Packets 0x00000D80 Lower Threshold/Pause Packets 0x000003A0 Upper Threshold/High Priority 0x00000AE0 Lower Threshold/High Priority 0x00000740 Packet Counter 0x00000029 Level 0x00000E7B Test 0x0002221A Sync Transmit RAMbuffer 1 --------------- Start Address 0x00000000 End Address 0x00000000 Write Pointer 0x00000000 Read Pointer 0x00000000 Packet Counter 0x00000000 Level 0x00000000 Test 0x00000000 Async Transmit RAMbuffer 1 --------------- Start Address 0x00000E80 End Address 0x000017FF Write Pointer 0x0000132A Read Pointer 0x0000132A Packet Counter 0x00000000 Level 0x00000000 Test 0x0002222A i don't know if it helps but i am also including the output of ethtool while the card was still working: PCI config ---------- 00: ab 11 62 43 07 04 10 00 15 00 00 02 08 00 00 00 10: 04 c0 df fd 00 00 00 00 01 ce 00 00 00 00 00 00 20: 00 00 00 00 00 00 00 00 00 00 00 00 62 14 8c 05 30: 00 00 00 00 48 00 00 00 00 00 00 00 03 01 00 00 40: 00 00 f0 01 00 80 a0 01 01 50 02 fe 00 20 00 14 50: 03 5c 00 80 00 00 00 01 00 00 00 01 05 e0 83 00 60: 0c 10 e0 fe 00 00 00 00 61 41 00 00 00 00 00 00 70: 00 02 00 00 00 00 00 00 00 00 00 00 00 00 00 00 Control Registers ----------------- Register Access Port 0x00 LED Control/Status 0xA603164A Interrupt Source 0x00000000 Interrupt Mask 0xC000001D Interrupt Hardware Error Source 0x00000000 Interrupt Hardware Error Mask 0x2E003F3F Bus Management Unit ------------------- CSR Receive Queue 1 0x00010000 CSR Sync Queue 1 0xFFFFFFFF CSR Async Queue 1 0x00000000 MAC Addresses --------------- Addr 1 00 11 09 DA 39 A3 Addr 2 00 11 09 DA 39 A3 Addr 3 00 00 00 00 00 00 Connector type 0x4A (J) PMD type 0x54 (T) PHY type 0x80 Chip Id 0xB6 Yukon-2 EC (rev 0) Ram Buffer 0x0C Status BMU: ----------- Control 0x0002220A Last Index 0x07FF Put Index 0x00B8 List Address 0x000000007FBF8000 Transmit 1 done index 0x0057 Transmit index threshold 0x000A Status FIFO Write Pointer 0x08 Read Pointer 0x08 Level 0x00 Watermark 0x10 ISR Watermark 0x10 Status level Init 0x000030D4 Value 0x000030D4 Test 0x04 Control 0x02 TX status Init 0x0001E848 Value 0x0001E848 Test 0x04 Control 0x02 ISR Init 0x000009C4 Value 0x000009C4 Test 0x04 Control 0x02 GMAC control 0x005A GPHY control 0x2002 LINK control 0x02 GMAC 1 Status 0xD000 Control 0x1800 Transmit 0x1000 Receive 0xE000 Transmit flow control 0xFFFF Transmit parameter 0xD7C4 Serial mode 0x221E Source address: 00 11 09 DA 39 A3 Physical address: 00 11 09 DA 39 A3 Rx GMAC 1 End Address 0x0000007F Almost Full Thresh 0x00000070 Control/Test 0x0900228A FIFO Flush Mask 0x000018FB FIFO Flush Threshold 0x0000000B Truncation Threshold 0x0000017C Upper Pause Threshold 0x00000000 Lower Pause Threshold 0x00000081 VLAN Tag 0x00000027 FIFO Write Pointer 0x00000000 FIFO Write Level 0x00000000 FIFO Read Pointer 0x00000000 FIFO Read Level 0x00000027 Tx GMAC 1 End Address 0x0000007F Almost Full Thresh 0x00000010 Control/Test 0x0102220A FIFO Flush Mask 0x00000000 FIFO Flush Threshold 0x00000000 Truncation Threshold 0x00000000 Upper Pause Threshold 0x00000000 Lower Pause Threshold 0x00000081 VLAN Tag 0x00000032 FIFO Write Pointer 0x00000032 FIFO Write Level 0x00000000 FIFO Read Pointer 0x00000000 FIFO Read Level 0x00000032 Receive Queue 1 --------------- Buffer control 0x05F8 Byte Counter 49408 Descriptor Address 0x000000001727E010 Status 0x003C0100 Timestamp 0x00000000 BMU Control/Status 0x000061AA Done 0x0000 Request 0x000000001727E010 Csum1 Offset 12632 Piston 14 Csum2 Offset 12632 Positing 14 Sync Transmit Queue 1 --------------- Descriptor Address 0x0000000000000000 Address Counter 0x0000000000000000 Current Byte Counter 0 BMU Control/Status 0x00000000 Flag & FIFO Address 0x00000000 Control 0x00000000 Next 0x00000000 Data 0x0000000000000000 Status 0x00000000 Timestamp 0x00000000 Csum Start 0x0000 Pos 0 Write 0 Async Transmit Queue 1 --------------- Buffer control 0x06CC Byte Counter 49950 Descriptor Address 0x0000000046AD23C6 Status 0x000005EA Timestamp 0x00010000 BMU Control/Status 0x800011AA Done 0x0000 Request 0x0000000046AD2A92 Csum Start 0x0032 Pos 0 Write 0 Receive RAMbuffer 1 --------------- Start Address 0x00000000 End Address 0x00000E7F Write Pointer 0x00000427 Read Pointer 0x00000427 Upper Threshold/Pause Packets 0x00000D80 Lower Threshold/Pause Packets 0x000003A0 Upper Threshold/High Priority 0x00000AE0 Lower Threshold/High Priority 0x00000740 Packet Counter 0x00000000 Level 0x00000000 Test 0x0002221A Sync Transmit RAMbuffer 1 --------------- Start Address 0x00000000 End Address 0x00000000 Write Pointer 0x00000000 Read Pointer 0x00000000 Packet Counter 0x00000000 Level 0x00000000 Test 0x00000000 Async Transmit RAMbuffer 1 --------------- Start Address 0x00000E80 End Address 0x000017FF Write Pointer 0x000017B2 Read Pointer 0x000017B2 Packet Counter 0x00000000 Level 0x00000000 Test 0x0002222A i'll try to lock up the networking again and if it still happens i'll swith to the vendor driver and see what that has to say. --alex-- -- | I believe the moment is at hand when, by a paranoiac and active | | advance of the mind, it will be possible (simultaneously with | | automatism and other passive states) to systematize confusion | | and thus to help to discredit completely the world of reality. | ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: 2.6.20-rc1 sky2 problems (regression?) 2006-12-14 22:25 ` Alex Romosan @ 2006-12-14 22:47 ` Stephen Hemminger 2006-12-14 22:57 ` Alex Romosan ` (2 more replies) 0 siblings, 3 replies; 14+ messages in thread From: Stephen Hemminger @ 2006-12-14 22:47 UTC (permalink / raw) To: Alex Romosan; +Cc: netdev On Thu, 14 Dec 2006 14:25:06 -0800 Alex Romosan <romosan@sycorax.lbl.gov> wrote: > Stephen Hemminger <shemminger@osdl.org> writes: > > > 4) What is the IRQ routing? > > There are two issues here, first the driver will never work with edge > > trigger IRQ's, some motherboards also have busted BIOS and chipsets > > that don't do MSI properly. A couple of module parameters are available > > to help: > > disable_msi=1 avoids using MSI > > idle_timeout=10 polls for lost IRQ's every N ms (10) > > i didn't take long to lock up the machine again. i've rebooted back > into stock 2.6.20-rc1 and added the two module parameters above. cat > /proc/interrupts now gives me: > > 17: 203 IO-APIC-fasteoi eth0, CMI8738 > > so i guess the MSI interrupts are disabled. we'll see how this works. probably won't do much but now the IRQ ends up shared. > > 5) What are the messages in the console log when problem happens? > > kernel: NETDEV WATCHDOG: eth0: transmit timed out > kernel: sky2 eth0: tx timeout > kernel: sky2 eth0: transmit ring 402 .. 361 report=406 done=406 > kernel: sky2 status report lost? The transmit timeout code trys to be smart, but doesn't really recover properly if hardware is stuck. > > 7) Please get a current version of ethtool from: > > git://git.kernel.org/pub/scm/network/ethtool/ethtool.git > > and run ethtool register dump after a problem occurs: > > ethtool -d eth0 > > this is the output after it stopped working: > > > PCI config > ---------- > 00: ab 11 62 43 07 04 18 00 15 00 00 02 08 00 00 00 > 10: 04 c0 df fd 00 00 00 00 01 ce 00 00 00 00 00 00 > 20: 00 00 00 00 00 00 00 00 00 00 00 00 62 14 8c 05 > 30: 00 00 00 00 48 00 00 00 00 00 00 00 03 01 00 00 > 40: 00 00 f0 01 00 80 a0 01 01 50 02 fe 00 20 00 14 > 50: 03 5c 00 80 00 00 00 01 00 00 00 01 05 e0 83 00 > 60: 0c 10 e0 fe 00 00 00 00 61 41 00 00 00 00 00 00 > 70: 00 02 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > > Control Registers > ----------------- > Register Access Port 0x00 > LED Control/Status 0xA603164A > Interrupt Source 0x40000000 > Interrupt Mask 0xC000001D > Interrupt Hardware Error Source 0x00000000 > Interrupt Hardware Error Mask 0x2E003F3F > > Bus Management Unit > ------------------- > CSR Receive Queue 1 0x00010000 > CSR Sync Queue 1 0xFFFFFFFF > CSR Async Queue 1 0x00000000 > > MAC Addresses > --------------- > Addr 1 00 11 09 DA 39 A3 > Addr 2 00 11 09 DA 39 A3 > Addr 3 00 00 00 00 00 00 > > Connector type 0x4A (J) > PMD type 0x54 (T) > PHY type 0x80 > Chip Id 0xB6 Yukon-2 EC > (rev 0) > Ram Buffer 0x0C > > Status BMU: > ----------- > Control 0x0002220A > Last Index 0x07FF > Put Index 0x0601 > List Address 0x000000007FBF8000 > Transmit 1 done index 0x0196 > Transmit index threshold 0x000A > > Status FIFO > Write Pointer 0x16 > Read Pointer 0x16 > Level 0x00 > Watermark 0x10 > ISR Watermark 0x10 > Status level > Init 0x000030D4 Value 0x00000D00 > Test 0x04 Control 0x02 > TX status > Init 0x0001E848 Value 0x0001E848 > Test 0x04 Control 0x02 > ISR > Init 0x000009C4 Value 0x000009C4 > Test 0x04 Control 0x02 > > GMAC control 0x005A > GPHY control 0x2002 > LINK control 0x02 > > GMAC 1 > Status 0xD000 > Control 0x1800 > Transmit 0x1000 > Receive 0xE000 > Transmit flow control 0xFFFF > Transmit parameter 0xD7C4 > Serial mode 0x221E > Source address: 00 11 09 DA 39 A3 > Physical address: 00 11 09 DA 39 A3 > > Rx GMAC 1 > End Address 0x0000007F > Almost Full Thresh 0x00000070 > Control/Test 0x0900228A > FIFO Flush Mask 0x000018FB > FIFO Flush Threshold 0x0000000B > Truncation Threshold 0x0000017C > Upper Pause Threshold 0x00000000 > Lower Pause Threshold 0x00000081 > VLAN Tag 0x00000074 > FIFO Write Pointer 0x00000000 > FIFO Write Level 0x0000007B > FIFO Read Pointer 0x00000000 > FIFO Read Level 0x00000079 > > Tx GMAC 1 > End Address 0x0000007F > Almost Full Thresh 0x00000010 > Control/Test 0x0102220A > FIFO Flush Mask 0x00000000 > FIFO Flush Threshold 0x00000000 > Truncation Threshold 0x00000000 > Upper Pause Threshold 0x00000000 > Lower Pause Threshold 0x00000081 > VLAN Tag 0x0000002A > FIFO Write Pointer 0x0000002A > FIFO Write Level 0x00000000 > FIFO Read Pointer 0x00000000 > FIFO Read Level 0x0000002A > > Receive Queue 1 > --------------- > Buffer control 0x05F8 > Byte Counter 49408 > Descriptor Address 0x0000000076F4F810 > Status 0x05EA0100 > Timestamp 0x00000000 > BMU Control/Status 0x000061AA > Done 0x0000 > Request 0x0000000076F4F810 > Csum1 Offset 52057 Piston 14 > Csum2 Offset 52057 Positing 14 > > Sync Transmit Queue 1 > --------------- > Descriptor Address 0x0000000000000000 > Address Counter 0x0000000000000000 > Current Byte Counter 0 > BMU Control/Status 0x00000000 > Flag & FIFO Address 0x00000000 > > Control 0x00000000 > Next 0x00000000 > Data 0x0000000000000000 > Status 0x00000000 > Timestamp 0x00000000 > Csum Start 0x0000 Pos 0 Write 0 > > Async Transmit Queue 1 > --------------- > Buffer control 0x053D > Byte Counter 49950 > Descriptor Address 0x0000000047237000 > Status 0x000005EA > Timestamp 0x00010000 > BMU Control/Status 0x800011AA > Done 0x0000 > Request 0x000000004723753D > Csum Start 0x0032 Pos 0 Write 0 > > Receive RAMbuffer 1 > --------------- > Start Address 0x00000000 > End Address 0x00000E7F > Write Pointer 0x00000079 > Read Pointer 0x0000007E > Upper Threshold/Pause Packets 0x00000D80 > Lower Threshold/Pause Packets 0x000003A0 > Upper Threshold/High Priority 0x00000AE0 > Lower Threshold/High Priority 0x00000740 > Packet Counter 0x00000029 > Level 0x00000E7B > Test 0x0002221A > > Sync Transmit RAMbuffer 1 > --------------- > Start Address 0x00000000 > End Address 0x00000000 > Write Pointer 0x00000000 > Read Pointer 0x00000000 > Packet Counter 0x00000000 > Level 0x00000000 > Test 0x00000000 > > Async Transmit RAMbuffer 1 > --------------- > Start Address 0x00000E80 > End Address 0x000017FF > Write Pointer 0x0000132A > Read Pointer 0x0000132A > Packet Counter 0x00000000 > Level 0x00000000 > Test 0x0002222A > > i don't know if it helps but i am also including the output of ethtool > while the card was still working: > > > PCI config > ---------- > 00: ab 11 62 43 07 04 10 00 15 00 00 02 08 00 00 00 > 10: 04 c0 df fd 00 00 00 00 01 ce 00 00 00 00 00 00 > 20: 00 00 00 00 00 00 00 00 00 00 00 00 62 14 8c 05 > 30: 00 00 00 00 48 00 00 00 00 00 00 00 03 01 00 00 > 40: 00 00 f0 01 00 80 a0 01 01 50 02 fe 00 20 00 14 > 50: 03 5c 00 80 00 00 00 01 00 00 00 01 05 e0 83 00 > 60: 0c 10 e0 fe 00 00 00 00 61 41 00 00 00 00 00 00 > 70: 00 02 00 00 00 00 00 00 00 00 00 00 00 00 00 00 > > Control Registers > ----------------- > Register Access Port 0x00 > LED Control/Status 0xA603164A > Interrupt Source 0x00000000 > Interrupt Mask 0xC000001D > Interrupt Hardware Error Source 0x00000000 > Interrupt Hardware Error Mask 0x2E003F3F > > Bus Management Unit > ------------------- > CSR Receive Queue 1 0x00010000 > CSR Sync Queue 1 0xFFFFFFFF > CSR Async Queue 1 0x00000000 > > MAC Addresses > --------------- > Addr 1 00 11 09 DA 39 A3 > Addr 2 00 11 09 DA 39 A3 > Addr 3 00 00 00 00 00 00 > > Connector type 0x4A (J) > PMD type 0x54 (T) > PHY type 0x80 > Chip Id 0xB6 Yukon-2 EC > (rev 0) > Ram Buffer 0x0C > > Status BMU: > ----------- > Control 0x0002220A > Last Index 0x07FF > Put Index 0x00B8 > List Address 0x000000007FBF8000 > Transmit 1 done index 0x0057 > Transmit index threshold 0x000A > > Status FIFO > Write Pointer 0x08 > Read Pointer 0x08 > Level 0x00 > Watermark 0x10 > ISR Watermark 0x10 > Status level > Init 0x000030D4 Value 0x000030D4 > Test 0x04 Control 0x02 > TX status > Init 0x0001E848 Value 0x0001E848 > Test 0x04 Control 0x02 > ISR > Init 0x000009C4 Value 0x000009C4 > Test 0x04 Control 0x02 > > GMAC control 0x005A > GPHY control 0x2002 > LINK control 0x02 > > GMAC 1 > Status 0xD000 > Control 0x1800 > Transmit 0x1000 > Receive 0xE000 > Transmit flow control 0xFFFF > Transmit parameter 0xD7C4 > Serial mode 0x221E > Source address: 00 11 09 DA 39 A3 > Physical address: 00 11 09 DA 39 A3 > > Rx GMAC 1 > End Address 0x0000007F > Almost Full Thresh 0x00000070 > Control/Test 0x0900228A > FIFO Flush Mask 0x000018FB > FIFO Flush Threshold 0x0000000B > Truncation Threshold 0x0000017C > Upper Pause Threshold 0x00000000 > Lower Pause Threshold 0x00000081 > VLAN Tag 0x00000027 > FIFO Write Pointer 0x00000000 > FIFO Write Level 0x00000000 > FIFO Read Pointer 0x00000000 > FIFO Read Level 0x00000027 > > Tx GMAC 1 > End Address 0x0000007F > Almost Full Thresh 0x00000010 > Control/Test 0x0102220A > FIFO Flush Mask 0x00000000 > FIFO Flush Threshold 0x00000000 > Truncation Threshold 0x00000000 > Upper Pause Threshold 0x00000000 > Lower Pause Threshold 0x00000081 > VLAN Tag 0x00000032 > FIFO Write Pointer 0x00000032 > FIFO Write Level 0x00000000 > FIFO Read Pointer 0x00000000 > FIFO Read Level 0x00000032 > > Receive Queue 1 > --------------- > Buffer control 0x05F8 > Byte Counter 49408 > Descriptor Address 0x000000001727E010 > Status 0x003C0100 > Timestamp 0x00000000 > BMU Control/Status 0x000061AA > Done 0x0000 > Request 0x000000001727E010 > Csum1 Offset 12632 Piston 14 > Csum2 Offset 12632 Positing 14 > > Sync Transmit Queue 1 > --------------- > Descriptor Address 0x0000000000000000 > Address Counter 0x0000000000000000 > Current Byte Counter 0 > BMU Control/Status 0x00000000 > Flag & FIFO Address 0x00000000 > > Control 0x00000000 > Next 0x00000000 > Data 0x0000000000000000 > Status 0x00000000 > Timestamp 0x00000000 > Csum Start 0x0000 Pos 0 Write 0 > > Async Transmit Queue 1 > --------------- > Buffer control 0x06CC > Byte Counter 49950 > Descriptor Address 0x0000000046AD23C6 > Status 0x000005EA > Timestamp 0x00010000 > BMU Control/Status 0x800011AA > Done 0x0000 > Request 0x0000000046AD2A92 > Csum Start 0x0032 Pos 0 Write 0 > > Receive RAMbuffer 1 > --------------- > Start Address 0x00000000 > End Address 0x00000E7F > Write Pointer 0x00000427 > Read Pointer 0x00000427 > Upper Threshold/Pause Packets 0x00000D80 > Lower Threshold/Pause Packets 0x000003A0 > Upper Threshold/High Priority 0x00000AE0 > Lower Threshold/High Priority 0x00000740 > Packet Counter 0x00000000 > Level 0x00000000 > Test 0x0002221A > > Sync Transmit RAMbuffer 1 > --------------- > Start Address 0x00000000 > End Address 0x00000000 > Write Pointer 0x00000000 > Read Pointer 0x00000000 > Packet Counter 0x00000000 > Level 0x00000000 > Test 0x00000000 > > Async Transmit RAMbuffer 1 > --------------- > Start Address 0x00000E80 > End Address 0x000017FF > Write Pointer 0x000017B2 > Read Pointer 0x000017B2 > Packet Counter 0x00000000 > Level 0x00000000 > Test 0x0002222A > > i'll try to lock up the networking again and if it still happens i'll > swith to the vendor driver and see what that has to say. > Another useful bit of information is the statistics (ethtool -S eth0). When there were flow control bugs, they would show up as count of 1. Are you doing jumbo frames (MTU > 1500)? -- Stephen Hemminger <shemminger@osdl.org> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: 2.6.20-rc1 sky2 problems (regression?) 2006-12-14 22:47 ` Stephen Hemminger @ 2006-12-14 22:57 ` Alex Romosan 2006-12-14 23:08 ` Alex Romosan 2006-12-14 23:21 ` Alex Romosan 2 siblings, 0 replies; 14+ messages in thread From: Alex Romosan @ 2006-12-14 22:57 UTC (permalink / raw) To: Stephen Hemminger; +Cc: netdev Stephen Hemminger <shemminger@osdl.org> writes: > Another useful bit of information is the statistics (ethtool -S eth0). > When there were flow control bugs, they would show up as count of 1. we'll see if the machine locks up again. > Are you doing jumbo frames (MTU > 1500)? no (or at least i don't think so). how can i tell? assuming the machine doesn't lock up with msi interrupts disabled, do you want me to do anything to debug why the driver locks up when the msi interrupts are enabled? --alex-- -- | I believe the moment is at hand when, by a paranoiac and active | | advance of the mind, it will be possible (simultaneously with | | automatism and other passive states) to systematize confusion | | and thus to help to discredit completely the world of reality. | ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: 2.6.20-rc1 sky2 problems (regression?) 2006-12-14 22:47 ` Stephen Hemminger 2006-12-14 22:57 ` Alex Romosan @ 2006-12-14 23:08 ` Alex Romosan 2006-12-14 23:21 ` Alex Romosan 2 siblings, 0 replies; 14+ messages in thread From: Alex Romosan @ 2006-12-14 23:08 UTC (permalink / raw) To: Stephen Hemminger; +Cc: netdev Stephen Hemminger <shemminger@osdl.org> writes: > Another useful bit of information is the statistics (ethtool -S > eth0). When there were flow control bugs, they would show up as > count of 1. > > Are you doing jumbo frames (MTU > 1500)? i just did 'ethtool -S eth0' (the card is still working fine) and i don't think there are any jumbo frames. anyway, this is the output: NIC statistics: tx_bytes: 2697577533 rx_bytes: 503104106 tx_broadcast: 18 rx_broadcast: 4068 tx_multicast: 0 rx_multicast: 416 tx_unicast: 2276028 rx_unicast: 1359009 tx_mac_pause: 0 rx_mac_pause: 0 collisions: 0 late_collision: 0 aborted: 0 single_collisions: 0 multi_collisions: 0 rx_short: 0 rx_runt: 0 rx_64_byte_packets: 713826 rx_65_to_127_byte_packets: 271861 rx_128_to_255_byte_packets: 57307 rx_256_to_511_byte_packets: 25689 rx_512_to_1023_byte_packets: 28502 rx_1024_to_1518_byte_packets: 266308 rx_1518_to_max_byte_packets: 0 rx_too_long: 0 rx_fifo_overflow: 0 rx_jabber: 0 rx_fcs_error: 0 tx_64_byte_packets: 174188 tx_65_to_127_byte_packets: 225242 tx_128_to_255_byte_packets: 44294 tx_256_to_511_byte_packets: 24475 tx_512_to_1023_byte_packets: 80147 tx_1024_to_1518_byte_packets: 1727700 tx_1519_to_max_byte_packets: 0 tx_fifo_underrun: 0 --alex-- -- | I believe the moment is at hand when, by a paranoiac and active | | advance of the mind, it will be possible (simultaneously with | | automatism and other passive states) to systematize confusion | | and thus to help to discredit completely the world of reality. | ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: 2.6.20-rc1 sky2 problems (regression?) 2006-12-14 22:47 ` Stephen Hemminger 2006-12-14 22:57 ` Alex Romosan 2006-12-14 23:08 ` Alex Romosan @ 2006-12-14 23:21 ` Alex Romosan 2006-12-14 23:31 ` Stephen Hemminger 2 siblings, 1 reply; 14+ messages in thread From: Alex Romosan @ 2006-12-14 23:21 UTC (permalink / raw) To: Stephen Hemminger; +Cc: netdev Stephen Hemminger <shemminger@osdl.org> writes: > Another useful bit of information is the statistics (ethtool -S eth0). > When there were flow control bugs, they would show up as count of 1. the driver locked up again, even with msi interrupts disabled and idle_timeout=10. the console message was pretty much as before: kernel: NETDEV WATCHDOG: eth0: transmit timed out kernel: sky2 eth0: tx timeout kernel: sky2 eth0: transmit ring 336 .. 296 report=336 done=336 kernel: sky2 hardware hung? flushing kernel: NETDEV WATCHDOG: eth0: transmit timed out kernel: sky2 eth0: tx timeout kernel: sky2 eth0: transmit ring 296 .. 255 report=336 done=336 kernel: sky2 status report lost? and this is the output from ethtool -S: NIC statistics: tx_bytes: 3092123897 rx_bytes: 546577898 tx_broadcast: 20 rx_broadcast: 4376 tx_multicast: 0 rx_multicast: 459 tx_unicast: 2585993 rx_unicast: 1550758 tx_mac_pause: 1 rx_mac_pause: 0 collisions: 0 late_collision: 0 aborted: 0 single_collisions: 0 multi_collisions: 0 rx_short: 0 rx_runt: 0 rx_64_byte_packets: 850693 rx_65_to_127_byte_packets: 297029 rx_128_to_255_byte_packets: 62116 rx_256_to_511_byte_packets: 28795 rx_512_to_1023_byte_packets: 31357 rx_1024_to_1518_byte_packets: 285603 rx_1518_to_max_byte_packets: 0 rx_too_long: 0 rx_fifo_overflow: 0 rx_jabber: 0 rx_fcs_error: 0 tx_64_byte_packets: 194159 tx_65_to_127_byte_packets: 239961 tx_128_to_255_byte_packets: 48148 tx_256_to_511_byte_packets: 27635 tx_512_to_1023_byte_packets: 95557 tx_1024_to_1518_byte_packets: 1980554 tx_1519_to_max_byte_packets: 0 tx_fifo_underrun: 0 time to try the vendor driver and see if that provides any clues. --alex-- -- | I believe the moment is at hand when, by a paranoiac and active | | advance of the mind, it will be possible (simultaneously with | | automatism and other passive states) to systematize confusion | | and thus to help to discredit completely the world of reality. | ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: 2.6.20-rc1 sky2 problems (regression?) 2006-12-14 23:21 ` Alex Romosan @ 2006-12-14 23:31 ` Stephen Hemminger 2006-12-15 1:04 ` Alex Romosan 0 siblings, 1 reply; 14+ messages in thread From: Stephen Hemminger @ 2006-12-14 23:31 UTC (permalink / raw) To: Alex Romosan; +Cc: netdev On Thu, 14 Dec 2006 15:21:00 -0800 Alex Romosan <romosan@sycorax.lbl.gov> wrote: > Stephen Hemminger <shemminger@osdl.org> writes: > > > Another useful bit of information is the statistics (ethtool -S eth0). > > When there were flow control bugs, they would show up as count of 1. > > the driver locked up again, even with msi interrupts disabled and > idle_timeout=10. the console message was pretty much as before: > > kernel: NETDEV WATCHDOG: eth0: transmit timed out > kernel: sky2 eth0: tx timeout > kernel: sky2 eth0: transmit ring 336 .. 296 report=336 done=336 > kernel: sky2 hardware hung? flushing > kernel: NETDEV WATCHDOG: eth0: transmit timed out > kernel: sky2 eth0: tx timeout > kernel: sky2 eth0: transmit ring 296 .. 255 report=336 done=336 > kernel: sky2 status report lost? > > and this is the output from ethtool -S: > > NIC statistics: > tx_bytes: 3092123897 > rx_bytes: 546577898 > tx_broadcast: 20 > rx_broadcast: 4376 > tx_multicast: 0 > rx_multicast: 459 > tx_unicast: 2585993 > rx_unicast: 1550758 > tx_mac_pause: 1 If this is repeatable... and mac_pause is always one then the problem is hardware flow control. I saw bugs before in the bus interface where it would not resume on unaligned buffer, but that was on receive. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: 2.6.20-rc1 sky2 problems (regression?) 2006-12-14 23:31 ` Stephen Hemminger @ 2006-12-15 1:04 ` Alex Romosan 2006-12-15 2:24 ` Herbert Xu 0 siblings, 1 reply; 14+ messages in thread From: Alex Romosan @ 2006-12-15 1:04 UTC (permalink / raw) To: Stephen Hemminger; +Cc: netdev Stephen Hemminger <shemminger@osdl.org> writes: > If this is repeatable... and mac_pause is always one then the > problem is hardware flow control. I saw bugs before in the bus > interface where it would not resume on unaligned buffer, but > that was on receive. i tried to switch over to the latest vendor driver but unfortunately it doesn't work with kernel 2.6.19+. it still uses CHECKSUM_HW which looks like it was replaced by CHECKSUM_PARTIAL and CHECKSUM_COMPLETE was also added. i think i can replace CHECKSUM_HW in the marvell driver with CHECKSUM_PARTIAL, except for a couple of places where i i am not sure what i am supposed to do. the first instance it says (i am kind of paraphrasing here since i am copying from the screen and not cutting and pasting): /** does the HW need to evaluate checksum for TCP or UDP packets? if (pMessage->ip_summed == CHECKSUM_HW) maybe this needs to be replace with CHECKSUM_PARTIAL. the second one /** TCP checksum offload if ((pSKPacket->pMbuf->ip_summed == CHECKSUM_HW) && (SetOpcodePacketFlag == SK_TRUE) i wonder if this is supposed to be CHECKSUM_COMPLETE if you have any suggestions, i'll appreciate it. --alex-- -- | I believe the moment is at hand when, by a paranoiac and active | | advance of the mind, it will be possible (simultaneously with | | automatism and other passive states) to systematize confusion | | and thus to help to discredit completely the world of reality. | ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: 2.6.20-rc1 sky2 problems (regression?) 2006-12-15 1:04 ` Alex Romosan @ 2006-12-15 2:24 ` Herbert Xu 2006-12-15 3:22 ` Stephen Hemminger 0 siblings, 1 reply; 14+ messages in thread From: Herbert Xu @ 2006-12-15 2:24 UTC (permalink / raw) To: Alex Romosan; +Cc: shemminger, netdev Alex Romosan <romosan@sycorax.lbl.gov> wrote: /** does the HW need to evaluate checksum for TCP or UDP packets? > if (pMessage->ip_summed == CHECKSUM_HW) > > maybe this needs to be replace with CHECKSUM_PARTIAL. the second one > > /** TCP checksum offload > if ((pSKPacket->pMbuf->ip_summed == CHECKSUM_HW) && > (SetOpcodePacketFlag == SK_TRUE) > > i wonder if this is supposed to be CHECKSUM_COMPLETE The rule of thumb is that it's COMPLETE for RX, and PARTIAL for TX. Cheers, -- Visit Openswan at http://www.openswan.org/ Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au> Home Page: http://gondor.apana.org.au/~herbert/ PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: 2.6.20-rc1 sky2 problems (regression?) 2006-12-15 2:24 ` Herbert Xu @ 2006-12-15 3:22 ` Stephen Hemminger 2006-12-15 3:53 ` Alex Romosan 0 siblings, 1 reply; 14+ messages in thread From: Stephen Hemminger @ 2006-12-15 3:22 UTC (permalink / raw) To: Herbert Xu; +Cc: Alex Romosan, netdev On Fri, 15 Dec 2006 13:24:32 +1100 Herbert Xu <herbert@gondor.apana.org.au> wrote: > Alex Romosan <romosan@sycorax.lbl.gov> wrote: > /** does the HW need to evaluate checksum for TCP or UDP packets? > > if (pMessage->ip_summed == CHECKSUM_HW) > > > > maybe this needs to be replace with CHECKSUM_PARTIAL. the second one > > > > /** TCP checksum offload > > if ((pSKPacket->pMbuf->ip_summed == CHECKSUM_HW) && > > (SetOpcodePacketFlag == SK_TRUE) > > > > i wonder if this is supposed to be CHECKSUM_COMPLETE > > The rule of thumb is that it's COMPLETE for RX, and PARTIAL for TX. > > Cheers, I have a fixed up version of the vendor driver, I'll repackage it tomorrow. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: 2.6.20-rc1 sky2 problems (regression?) 2006-12-15 3:22 ` Stephen Hemminger @ 2006-12-15 3:53 ` Alex Romosan 2006-12-16 1:25 ` Stephen Hemminger 0 siblings, 1 reply; 14+ messages in thread From: Alex Romosan @ 2006-12-15 3:53 UTC (permalink / raw) To: Stephen Hemminger; +Cc: Herbert Xu, netdev Stephen Hemminger <shemminger@osdl.org> writes: > I have a fixed up version of the vendor driver, I'll repackage it tomorrow. as per the include file, i ended up replacing all the CHECKSUM_HW with CHECkSUM_PARTIAL since the functions in questions had to do with transmit. seems to be working so far without any lockups. we'll see how long this lasts. --alex-- -- | I believe the moment is at hand when, by a paranoiac and active | | advance of the mind, it will be possible (simultaneously with | | automatism and other passive states) to systematize confusion | | and thus to help to discredit completely the world of reality. | ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: 2.6.20-rc1 sky2 problems (regression?) 2006-12-15 3:53 ` Alex Romosan @ 2006-12-16 1:25 ` Stephen Hemminger 2006-12-16 1:53 ` Alex Romosan 0 siblings, 1 reply; 14+ messages in thread From: Stephen Hemminger @ 2006-12-16 1:25 UTC (permalink / raw) To: Alex Romosan; +Cc: Herbert Xu, netdev On Thu, 14 Dec 2006 19:53:45 -0800 Alex Romosan <romosan@sycorax.lbl.gov> wrote: > Stephen Hemminger <shemminger@osdl.org> writes: > > > I have a fixed up version of the vendor driver, I'll repackage it tomorrow. > > as per the include file, i ended up replacing all the CHECKSUM_HW with > CHECkSUM_PARTIAL since the functions in questions had to do with > transmit. seems to be working so far without any lockups. we'll see > how long this lasts. > > --alex-- > I fixed a bunch of stuff (see ChangeLog) and made a 2.6.19 or later version see: http://developer.osdl.org/shemminger/prototypes/sk98lin-8.41.tar.gz It is too noisy in the console log, because it shows how many times the driver dope slaps itself senseless... Basically every 250ms when it is idle it resets, sorry it's the kind of code you right to "make it work" and ship it which is why vendor drivers suck. -- Stephen Hemminger <shemminger@osdl.org> ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: 2.6.20-rc1 sky2 problems (regression?) 2006-12-16 1:25 ` Stephen Hemminger @ 2006-12-16 1:53 ` Alex Romosan 0 siblings, 0 replies; 14+ messages in thread From: Alex Romosan @ 2006-12-16 1:53 UTC (permalink / raw) To: Stephen Hemminger; +Cc: Herbert Xu, netdev Stephen Hemminger <shemminger@osdl.org> writes: > I fixed a bunch of stuff (see ChangeLog) and made a 2.6.19 or later > version see: > http://developer.osdl.org/shemminger/prototypes/sk98lin-8.41.tar.gz > > It is too noisy in the console log, because it shows how many times > the driver dope slaps itself senseless... Basically every 250ms when > it is idle it resets, sorry it's the kind of code you right to "make it work" > and ship it which is why vendor drivers suck. i'll give it a try on monday when i go back to work. in the meantime i've been running with my "fixed" version of the vendor driver and so far it's been working without any problems (i've been transferring lots of data in and out of the computer the whole day). if there is anything i can do to help debug the kernel sky2 driver let me know. --alex-- -- | I believe the moment is at hand when, by a paranoiac and active | | advance of the mind, it will be possible (simultaneously with | | automatism and other passive states) to systematize confusion | | and thus to help to discredit completely the world of reality. | ^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2006-12-16 1:53 UTC | newest]
Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <87psammchi.fsf@sycorax.lbl.gov>
2006-12-14 21:30 ` 2.6.20-rc1 sky2 problems (regression?) Stephen Hemminger
2006-12-14 22:00 ` Alex Romosan
2006-12-14 22:25 ` Alex Romosan
2006-12-14 22:47 ` Stephen Hemminger
2006-12-14 22:57 ` Alex Romosan
2006-12-14 23:08 ` Alex Romosan
2006-12-14 23:21 ` Alex Romosan
2006-12-14 23:31 ` Stephen Hemminger
2006-12-15 1:04 ` Alex Romosan
2006-12-15 2:24 ` Herbert Xu
2006-12-15 3:22 ` Stephen Hemminger
2006-12-15 3:53 ` Alex Romosan
2006-12-16 1:25 ` Stephen Hemminger
2006-12-16 1:53 ` Alex Romosan
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).