netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* Re: 2.6.20-rc1 sky2 problems (regression?)
       [not found] <87psammchi.fsf@sycorax.lbl.gov>
@ 2006-12-14 21:30 ` Stephen Hemminger
  2006-12-14 22:00   ` Alex Romosan
  2006-12-14 22:25   ` Alex Romosan
  0 siblings, 2 replies; 14+ messages in thread
From: Stephen Hemminger @ 2006-12-14 21:30 UTC (permalink / raw)
  To: Alex Romosan; +Cc: netdev

On Thu, 14 Dec 2006 12:47:05 -0800
Alex Romosan <romosan@sycorax.lbl.gov> wrote:

> under heavy network load the sky2 driver (compiled in the kernel)
> locks up and the only way i can get the network back is to reboot the
> machine (bringing the network down and back up again doesn't help).
> this happens on an amd64 machine (athlon 3500+ processor) and the card
> in question is a Marvell Technology Group Ltd. 88E8053 PCI-E Gigabit
> Ethernet Controller (rev 15) (from lspci). this is what i see in the
> syslog:
> 
> kernel: sky2 eth0: rx error, status 0x414a414a length 0
> kernel: eth0: hw csum failure.
> kernel: 
> kernel: Call Trace:
> kernel:  <IRQ>  [<ffffffff8044681c>] __skb_checksum_complete+0x4d/0x66
> kernel:  [<ffffffff80477bc5>] tcp_v4_rcv+0x147/0x8ea
> kernel:  [<ffffffff80479ef2>] raw_rcv_skb+0x9/0x20
> kernel:  [<ffffffff8047a2ff>] raw_rcv+0xbe/0xc4
> kernel:  [<ffffffff8045ea9d>] ip_local_deliver+0x170/0x21b
> kernel:  [<ffffffff8045e8fa>] ip_rcv+0x478/0x4ab
> kernel:  [<ffffffff8044905d>] netif_receive_skb+0x184/0x20e
> kernel:  [<ffffffff803de8e5>] sky2_poll+0x68f/0x93c
> kernel:  [<ffffffff802219ce>] scheduler_tick+0x23/0x2f9
> kernel:  [<ffffffff8044a796>] net_rx_action+0x61/0xf0
> kernel:  [<ffffffff8022a35f>] __do_softirq+0x40/0x8a
> kernel:  [<ffffffff8020a3cc>] call_softirq+0x1c/0x28
> kernel:  [<ffffffff8020bbf0>] do_softirq+0x2c/0x7d
> kernel:  [<ffffffff8022a313>] irq_exit+0x36/0x42
> kernel:  [<ffffffff8020bebe>] do_IRQ+0x8c/0x9e
> kernel:  [<ffffffff80208710>] default_idle+0x0/0x3a
> kernel:  [<ffffffff80209bf1>] ret_from_intr+0x0/0xa
> kernel:  <EOI>  [<ffffffff80208736>] default_idle+0x26/0x3a
> kernel:  [<ffffffff8020878c>] cpu_idle+0x42/0x75
> kernel:  [<ffffffff805df675>] start_kernel+0x1ce/0x1d3
> kernel:  [<ffffffff805df140>] _sinittext+0x140/0x144
> kernel: 
> kernel: eth0: hw csum failure.
> kernel: 
> kernel: Call Trace:
> kernel:  <IRQ>  [<ffffffff8044681c>] __skb_checksum_complete+0x4d/0x66
> kernel:  [<ffffffff80477bc5>] tcp_v4_rcv+0x147/0x8ea
> kernel:  [<ffffffff80479ef2>] raw_rcv_skb+0x9/0x20
> kernel:  [<ffffffff8047a2ff>] raw_rcv+0xbe/0xc4
> kernel:  [<ffffffff8045ea9d>] ip_local_deliver+0x170/0x21b
> kernel:  [<ffffffff8045e8fa>] ip_rcv+0x478/0x4ab
> kernel:  [<ffffffff8044905d>] netif_receive_skb+0x184/0x20e
> kernel:  [<ffffffff803de8e5>] sky2_poll+0x68f/0x93c
> kernel:  [<ffffffff80474647>] tcp_delack_timer+0x0/0x1b5
> kernel:  [<ffffffff8044a796>] net_rx_action+0x61/0xf0
> kernel:  [<ffffffff8022a35f>] __do_softirq+0x40/0x8a
> kernel:  [<ffffffff8020a3cc>] call_softirq+0x1c/0x28
> kernel:  [<ffffffff8020bbf0>] do_softirq+0x2c/0x7d
> kernel:  [<ffffffff8022a313>] irq_exit+0x36/0x42
> kernel:  [<ffffffff8020bebe>] do_IRQ+0x8c/0x9e
> kernel:  [<ffffffff80209bf1>] ret_from_intr+0x0/0xa
> kernel:  <EOI>  [<ffffffff802a8402>] inode2sd+0x104/0x117
> kernel:  [<ffffffff802b8cfa>] search_by_key+0xa08/0xbfe
> kernel:  [<ffffffff802b8475>] search_by_key+0x183/0xbfe
> kernel:  [<ffffffff80284778>] ll_rw_block+0x89/0x9e
> kernel:  [<ffffffff802b8475>] search_by_key+0x183/0xbfe
> kernel:  [<ffffffff80283cf5>] __find_get_block_slow+0x101/0x10d
> kernel:  [<ffffffff80284053>] __find_get_block+0x197/0x1a5
> kernel:  [<ffffffff8026800c>] inode_get_bytes+0x2a/0x52
> kernel:  [<ffffffff802a89f1>] reiserfs_update_sd_size+0x7e/0x284
> kernel:  [<ffffffff80237700>] kthread+0xed/0xfd
> kernel:  [<ffffffff802be990>] do_journal_end+0x34b/0xbdd
> kernel:  [<ffffffff802b1729>] reiserfs_dirty_inode+0x56/0x76
> kernel:  [<ffffffff80284c19>] block_prepare_write+0x1a/0x24
> kernel:  [<ffffffff802809b1>] __mark_inode_dirty+0x29/0x197
> kernel:  [<ffffffff802a8d04>] reiserfs_commit_write+0x10d/0x19f
> kernel:  [<ffffffff80284c19>] block_prepare_write+0x1a/0x24
> kernel:  [<ffffffff802484fc>] generic_file_buffered_write+0x4ad/0x6c4
> kernel:  [<ffffffff80271b3c>] __pollwait+0x0/0xe0
> kernel:  [<ffffffff8022a006>] current_fs_time+0x35/0x3b
> kernel:  [<ffffffff80248a8c>] __generic_file_aio_write_nolock+0x379/0x3ec
> kernel:  [<ffffffff8049baca>] unix_dgram_recvmsg+0x1be/0x1d9
> kernel:  [<ffffffff804b6516>] __mutex_lock_slowpath+0x205/0x210
> kernel:  [<ffffffff80248b60>] generic_file_aio_write+0x61/0xc1
> kernel:  [<ffffffff80248aff>] generic_file_aio_write+0x0/0xc1
> kernel:  [<ffffffff80264e57>] do_sync_readv_writev+0xc0/0x107
> kernel:  [<ffffffff802377f7>] autoremove_wake_function+0x0/0x2e
> kernel:  [<ffffffff80229d16>] getnstimeofday+0x10/0x28
> kernel:  [<ffffffff80264ced>] rw_copy_check_uvector+0x6c/0xdc
> kernel:  [<ffffffff802654f7>] do_readv_writev+0xb2/0x18b
> kernel:  [<ffffffff80265a2c>] sys_writev+0x45/0x93
> kernel:  [<ffffffff802096de>] system_call+0x7e/0x83
> 
> and so on. some times i don't get this trace but instead i get:
> 
> kernel: sky2 eth0: tx timeout
> kernel: sky2 eth0: transmit ring 140 .. 99 report=181 done=181
> kernel: sky2 status report lost?
> kernel: NETDEV WATCHDOG: eth0: transmit timed out
> kernel: sky2 eth0: tx timeout
> kernel: sky2 eth0: transmit ring 181 .. 140 report=181 done=181
> kernel: sky2 hardware hung? flushing
> 
> but the end result is the same, the network card stops responding and
> i have to reboot the machine. i can reproduce this on a consistent
> basis so if there are any patches, i can try them out and see if they
> fix the problem.
> 
> this is probably not a regression per se as i saw it as well with
> 2.6.19 and 2.6.19-rc6. i am not sure if it was there previous to
> 2.6.19-rc6. suggestions, patches welcome. thanks.

Pleas report these problems to netdev@vger.kernel.org, I rarely go
looking in LKML.

These are the things you need to debug a sky2 related problem.

1) What is exact kernel version in use?  This is important because
   problems get fixed but it can be a long while until the fix bubbles down
   to the vendor kernels.

2) What is the chip version?  The driver prints this out on boot up in
   the console log.   (dmesg | grep sky2)
   This matters because each chip version has different
   bugs to deal with.

3) Does it work with the vendor driver?
   The vendor driver does a number of things differently than the sky2 driver
   and can mask problems, but if it doesn't work as well that is a useful
   data point.  If you want to know why the sky2 driver was written instead
   of just using the vendor driver, look at the code. The sk98lin driver
   is huge, includes features that are unsupportable and broken, and locking
   mistakes.  But the sk98lin also has a watchdog that masks off bugs and
   may provide useful insight.

4) What is the IRQ routing?
   There are two issues here, first the driver will never work with edge
   trigger IRQ's, some motherboards also have busted BIOS and chipsets
   that don't do MSI properly. A couple of module parameters are available
   to help:
      disable_msi=1   		avoids using MSI
      idle_timeout=10		polls for lost IRQ's every N ms (10)

5) What are the messages in the console log when problem happens?

6) Are you running any of the following: bonding, vlans, bridging,
   netfilter, traffic control?

7) Please get a current version of ethtool from:
   git://git.kernel.org/pub/scm/network/ethtool/ethtool.git
   and run ethtool register dump after a problem occurs:
      ethtool -d eth0

8) Are you using a dual port board.  There were issues on the PCI-X
   version that required hacks, the PCI-express version may have the
   same problem.  Basically, checksum offload wouldn't work and receive
   DMA's would arrive out of order.

-- 
Stephen Hemminger <shemminger@osdl.org>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: 2.6.20-rc1 sky2 problems (regression?)
  2006-12-14 21:30 ` 2.6.20-rc1 sky2 problems (regression?) Stephen Hemminger
@ 2006-12-14 22:00   ` Alex Romosan
  2006-12-14 22:25   ` Alex Romosan
  1 sibling, 0 replies; 14+ messages in thread
From: Alex Romosan @ 2006-12-14 22:00 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: netdev

Stephen Hemminger <shemminger@osdl.org> writes:

> On Thu, 14 Dec 2006 12:47:05 -0800
> Alex Romosan <romosan@sycorax.lbl.gov> wrote:
>
>> under heavy network load the sky2 driver (compiled in the kernel)
>> locks up and the only way i can get the network back is to reboot the
>> machine (bringing the network down and back up again doesn't help).
>> this happens on an amd64 machine (athlon 3500+ processor) and the card
>> in question is a Marvell Technology Group Ltd. 88E8053 PCI-E Gigabit
>> Ethernet Controller (rev 15) (from lspci). this is what i see in the
>> syslog:
>> 
>> kernel: sky2 eth0: rx error, status 0x414a414a length 0
>> kernel: eth0: hw csum failure.
>> kernel: 
>> kernel: Call Trace:
>> kernel:  <IRQ>  [<ffffffff8044681c>] __skb_checksum_complete+0x4d/0x66
>> kernel:  [<ffffffff80477bc5>] tcp_v4_rcv+0x147/0x8ea
>> kernel:  [<ffffffff80479ef2>] raw_rcv_skb+0x9/0x20
>> kernel:  [<ffffffff8047a2ff>] raw_rcv+0xbe/0xc4
>> kernel:  [<ffffffff8045ea9d>] ip_local_deliver+0x170/0x21b
>> kernel:  [<ffffffff8045e8fa>] ip_rcv+0x478/0x4ab
>> kernel:  [<ffffffff8044905d>] netif_receive_skb+0x184/0x20e
>> kernel:  [<ffffffff803de8e5>] sky2_poll+0x68f/0x93c
>> kernel:  [<ffffffff802219ce>] scheduler_tick+0x23/0x2f9
>> kernel:  [<ffffffff8044a796>] net_rx_action+0x61/0xf0
>> kernel:  [<ffffffff8022a35f>] __do_softirq+0x40/0x8a
>> kernel:  [<ffffffff8020a3cc>] call_softirq+0x1c/0x28
>> kernel:  [<ffffffff8020bbf0>] do_softirq+0x2c/0x7d
>> kernel:  [<ffffffff8022a313>] irq_exit+0x36/0x42
>> kernel:  [<ffffffff8020bebe>] do_IRQ+0x8c/0x9e
>> kernel:  [<ffffffff80208710>] default_idle+0x0/0x3a
>> kernel:  [<ffffffff80209bf1>] ret_from_intr+0x0/0xa
>> kernel:  <EOI>  [<ffffffff80208736>] default_idle+0x26/0x3a
>> kernel:  [<ffffffff8020878c>] cpu_idle+0x42/0x75
>> kernel:  [<ffffffff805df675>] start_kernel+0x1ce/0x1d3
>> kernel:  [<ffffffff805df140>] _sinittext+0x140/0x144
>> kernel: 
>> kernel: eth0: hw csum failure.
>> kernel: 
>> kernel: Call Trace:
>> kernel:  <IRQ>  [<ffffffff8044681c>] __skb_checksum_complete+0x4d/0x66
>> kernel:  [<ffffffff80477bc5>] tcp_v4_rcv+0x147/0x8ea
>> kernel:  [<ffffffff80479ef2>] raw_rcv_skb+0x9/0x20
>> kernel:  [<ffffffff8047a2ff>] raw_rcv+0xbe/0xc4
>> kernel:  [<ffffffff8045ea9d>] ip_local_deliver+0x170/0x21b
>> kernel:  [<ffffffff8045e8fa>] ip_rcv+0x478/0x4ab
>> kernel:  [<ffffffff8044905d>] netif_receive_skb+0x184/0x20e
>> kernel:  [<ffffffff803de8e5>] sky2_poll+0x68f/0x93c
>> kernel:  [<ffffffff80474647>] tcp_delack_timer+0x0/0x1b5
>> kernel:  [<ffffffff8044a796>] net_rx_action+0x61/0xf0
>> kernel:  [<ffffffff8022a35f>] __do_softirq+0x40/0x8a
>> kernel:  [<ffffffff8020a3cc>] call_softirq+0x1c/0x28
>> kernel:  [<ffffffff8020bbf0>] do_softirq+0x2c/0x7d
>> kernel:  [<ffffffff8022a313>] irq_exit+0x36/0x42
>> kernel:  [<ffffffff8020bebe>] do_IRQ+0x8c/0x9e
>> kernel:  [<ffffffff80209bf1>] ret_from_intr+0x0/0xa
>> kernel:  <EOI>  [<ffffffff802a8402>] inode2sd+0x104/0x117
>> kernel:  [<ffffffff802b8cfa>] search_by_key+0xa08/0xbfe
>> kernel:  [<ffffffff802b8475>] search_by_key+0x183/0xbfe
>> kernel:  [<ffffffff80284778>] ll_rw_block+0x89/0x9e
>> kernel:  [<ffffffff802b8475>] search_by_key+0x183/0xbfe
>> kernel:  [<ffffffff80283cf5>] __find_get_block_slow+0x101/0x10d
>> kernel:  [<ffffffff80284053>] __find_get_block+0x197/0x1a5
>> kernel:  [<ffffffff8026800c>] inode_get_bytes+0x2a/0x52
>> kernel:  [<ffffffff802a89f1>] reiserfs_update_sd_size+0x7e/0x284
>> kernel:  [<ffffffff80237700>] kthread+0xed/0xfd
>> kernel:  [<ffffffff802be990>] do_journal_end+0x34b/0xbdd
>> kernel:  [<ffffffff802b1729>] reiserfs_dirty_inode+0x56/0x76
>> kernel:  [<ffffffff80284c19>] block_prepare_write+0x1a/0x24
>> kernel:  [<ffffffff802809b1>] __mark_inode_dirty+0x29/0x197
>> kernel:  [<ffffffff802a8d04>] reiserfs_commit_write+0x10d/0x19f
>> kernel:  [<ffffffff80284c19>] block_prepare_write+0x1a/0x24
>> kernel:  [<ffffffff802484fc>] generic_file_buffered_write+0x4ad/0x6c4
>> kernel:  [<ffffffff80271b3c>] __pollwait+0x0/0xe0
>> kernel:  [<ffffffff8022a006>] current_fs_time+0x35/0x3b
>> kernel:  [<ffffffff80248a8c>] __generic_file_aio_write_nolock+0x379/0x3ec
>> kernel:  [<ffffffff8049baca>] unix_dgram_recvmsg+0x1be/0x1d9
>> kernel:  [<ffffffff804b6516>] __mutex_lock_slowpath+0x205/0x210
>> kernel:  [<ffffffff80248b60>] generic_file_aio_write+0x61/0xc1
>> kernel:  [<ffffffff80248aff>] generic_file_aio_write+0x0/0xc1
>> kernel:  [<ffffffff80264e57>] do_sync_readv_writev+0xc0/0x107
>> kernel:  [<ffffffff802377f7>] autoremove_wake_function+0x0/0x2e
>> kernel:  [<ffffffff80229d16>] getnstimeofday+0x10/0x28
>> kernel:  [<ffffffff80264ced>] rw_copy_check_uvector+0x6c/0xdc
>> kernel:  [<ffffffff802654f7>] do_readv_writev+0xb2/0x18b
>> kernel:  [<ffffffff80265a2c>] sys_writev+0x45/0x93
>> kernel:  [<ffffffff802096de>] system_call+0x7e/0x83
>> 
>> and so on. some times i don't get this trace but instead i get:
>> 
>> kernel: sky2 eth0: tx timeout
>> kernel: sky2 eth0: transmit ring 140 .. 99 report=181 done=181
>> kernel: sky2 status report lost?
>> kernel: NETDEV WATCHDOG: eth0: transmit timed out
>> kernel: sky2 eth0: tx timeout
>> kernel: sky2 eth0: transmit ring 181 .. 140 report=181 done=181
>> kernel: sky2 hardware hung? flushing
>> 
> Pleas report these problems to netdev@vger.kernel.org, I rarely go
> looking in LKML.
>
> These are the things you need to debug a sky2 related problem.
>
> 1) What is exact kernel version in use?  This is important because
>    problems get fixed but it can be a long while until the fix bubbles down
>    to the vendor kernels.

this is stock kernel.org kernel version 2.6.20-rc1 i downloaded this
morning. 2.6.19 and 2.6.19-rc6 i referred to in my original message
were also donloaded from kernel.org.

> 2) What is the chip version?  The driver prints this out on boot up in
>    the console log.   (dmesg | grep sky2)
>    This matters because each chip version has different
>    bugs to deal with.

sky2 v1.10 addr 0xfddfc000 irq 17 Yukon-EC (0xb6) rev 1
sky2 eth0: addr 00:11:09:da:39:a3
sky2 eth0: enabling interface
sky2 eth0: ram buffer 48K
sky2 eth0: Link is up at 100 Mbps, full duplex, flow control both


> 3) Does it work with the vendor driver?
>    The vendor driver does a number of things differently than the sky2 driver
>    and can mask problems, but if it doesn't work as well that is a useful
>    data point.  If you want to know why the sky2 driver was written instead
>    of just using the vendor driver, look at the code. The sk98lin driver
>    is huge, includes features that are unsupportable and broken, and locking
>    mistakes.  But the sk98lin also has a watchdog that masks off bugs and
>    may provide useful insight.

i haven't tried the vendor driver yet, but i guess i will, and let you
know what happens.

> 4) What is the IRQ routing?
>    There are two issues here, first the driver will never work with edge
>    trigger IRQ's, some motherboards also have busted BIOS and chipsets
>    that don't do MSI properly. A couple of module parameters are available
>    to help:
>       disable_msi=1   		avoids using MSI
>       idle_timeout=10		polls for lost IRQ's every N ms (10)

hmm, i have MSI interrupts enabled in the config and cat
/proc/interrups gives me:

283:    1474208   PCI-MSI-edge      eth0

so you say i should dissable msi?

> 5) What are the messages in the console log when problem happens?

see my original message i kept above.

> 6) Are you running any of the following: bonding, vlans, bridging,
>    netfilter, traffic control?

no.

> 7) Please get a current version of ethtool from:
>    git://git.kernel.org/pub/scm/network/ethtool/ethtool.git
>    and run ethtool register dump after a problem occurs:
>       ethtool -d eth0

i've downloaded it and i'll run it next time the machine locks up.

> 8) Are you using a dual port board.  There were issues on the PCI-X
>    version that required hacks, the PCI-express version may have the
>    same problem.  Basically, checksum offload wouldn't work and receive
>    DMA's would arrive out of order.

it is a dual port board but i am using only one port.

--alex--

-- 
| I believe the moment is at hand when, by a paranoiac and active |
|  advance of the mind, it will be possible (simultaneously with  |
|  automatism and other passive states) to systematize confusion  |
|  and thus to help to discredit completely the world of reality. |

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: 2.6.20-rc1 sky2 problems (regression?)
  2006-12-14 21:30 ` 2.6.20-rc1 sky2 problems (regression?) Stephen Hemminger
  2006-12-14 22:00   ` Alex Romosan
@ 2006-12-14 22:25   ` Alex Romosan
  2006-12-14 22:47     ` Stephen Hemminger
  1 sibling, 1 reply; 14+ messages in thread
From: Alex Romosan @ 2006-12-14 22:25 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: netdev

Stephen Hemminger <shemminger@osdl.org> writes:

> 4) What is the IRQ routing?
>    There are two issues here, first the driver will never work with edge
>    trigger IRQ's, some motherboards also have busted BIOS and chipsets
>    that don't do MSI properly. A couple of module parameters are available
>    to help:
>       disable_msi=1   		avoids using MSI
>       idle_timeout=10		polls for lost IRQ's every N ms (10)

i didn't take long to lock up the machine again. i've rebooted back
into stock 2.6.20-rc1 and added the two module parameters above. cat
/proc/interrupts now gives me:

 17:        203   IO-APIC-fasteoi   eth0, CMI8738

so i guess the MSI interrupts are disabled. we'll see how this works.

> 5) What are the messages in the console log when problem happens?

kernel: NETDEV WATCHDOG: eth0: transmit timed out
kernel: sky2 eth0: tx timeout
kernel: sky2 eth0: transmit ring 402 .. 361 report=406 done=406
kernel: sky2 status report lost?
kernel: NETDEV WATCHDOG: eth0: transmit timed out
kernel: sky2 eth0: tx timeout
kernel: sky2 eth0: transmit ring 406 .. 361 report=406 done=406
kernel: sky2 hardware hung? flushing
kernel: NETDEV WATCHDOG: eth0: transmit timed out
kernel: sky2 eth0: tx timeout
kernel: sky2 eth0: transmit ring 361 .. 321 report=406 done=406
kernel: sky2 status report lost?
kernel: NETDEV WATCHDOG: eth0: transmit timed out
kernel: sky2 eth0: tx timeout
kernel: sky2 eth0: transmit ring 406 .. 366 report=406 done=406
kernel: sky2 hardware hung? flushing

> 7) Please get a current version of ethtool from:
>    git://git.kernel.org/pub/scm/network/ethtool/ethtool.git
>    and run ethtool register dump after a problem occurs:
>       ethtool -d eth0

this is the output after it stopped working:


PCI config
----------
00: ab 11 62 43 07 04 18 00 15 00 00 02 08 00 00 00
10: 04 c0 df fd 00 00 00 00 01 ce 00 00 00 00 00 00
20: 00 00 00 00 00 00 00 00 00 00 00 00 62 14 8c 05
30: 00 00 00 00 48 00 00 00 00 00 00 00 03 01 00 00
40: 00 00 f0 01 00 80 a0 01 01 50 02 fe 00 20 00 14
50: 03 5c 00 80 00 00 00 01 00 00 00 01 05 e0 83 00
60: 0c 10 e0 fe 00 00 00 00 61 41 00 00 00 00 00 00
70: 00 02 00 00 00 00 00 00 00 00 00 00 00 00 00 00

Control Registers
-----------------
Register Access Port             0x00
LED Control/Status               0xA603164A
Interrupt Source                 0x40000000
Interrupt Mask                   0xC000001D
Interrupt Hardware Error Source  0x00000000
Interrupt Hardware Error Mask    0x2E003F3F

Bus Management Unit
-------------------
CSR Receive Queue 1              0x00010000
CSR Sync Queue 1                 0xFFFFFFFF
CSR Async Queue 1                0x00000000

MAC Addresses
---------------
Addr 1            00 11 09 DA 39 A3
Addr 2            00 11 09 DA 39 A3
Addr 3            00 00 00 00 00 00

Connector type               0x4A (J)
PMD type                     0x54 (T)
PHY type                     0x80
Chip Id                      0xB6 Yukon-2 EC
 (rev 0)
Ram Buffer                   0x0C

Status BMU:
-----------
Control                                0x0002220A
Last Index                             0x07FF
Put Index                              0x0601
List Address                           0x000000007FBF8000
Transmit 1 done index                  0x0196
Transmit index threshold               0x000A

Status FIFO
	Write Pointer            0x16
	Read Pointer             0x16
	Level                    0x00
	Watermark                0x10
	ISR Watermark            0x10
Status level
	Init 0x000030D4 Value 0x00000D00
	Test 0x04       Control 0x02
TX status
	Init 0x0001E848 Value 0x0001E848
	Test 0x04       Control 0x02
ISR
	Init 0x000009C4 Value 0x000009C4
	Test 0x04       Control 0x02

GMAC control             0x005A
GPHY control             0x2002
LINK control             0x02

GMAC 1
Status                       0xD000
Control                      0x1800
Transmit                     0x1000
Receive                      0xE000
Transmit flow control        0xFFFF
Transmit parameter           0xD7C4
Serial mode                  0x221E
      Source address:  00 11 09 DA 39 A3
    Physical address:  00 11 09 DA 39 A3

Rx GMAC 1
End Address                      0x0000007F
Almost Full Thresh               0x00000070
Control/Test                     0x0900228A
FIFO Flush Mask                  0x000018FB
FIFO Flush Threshold             0x0000000B
Truncation Threshold             0x0000017C
Upper Pause Threshold            0x00000000
Lower Pause Threshold            0x00000081
VLAN Tag                         0x00000074
FIFO Write Pointer               0x00000000
FIFO Write Level                 0x0000007B
FIFO Read Pointer                0x00000000
FIFO Read Level                  0x00000079

Tx GMAC 1
End Address                      0x0000007F
Almost Full Thresh               0x00000010
Control/Test                     0x0102220A
FIFO Flush Mask                  0x00000000
FIFO Flush Threshold             0x00000000
Truncation Threshold             0x00000000
Upper Pause Threshold            0x00000000
Lower Pause Threshold            0x00000081
VLAN Tag                         0x0000002A
FIFO Write Pointer               0x0000002A
FIFO Write Level                 0x00000000
FIFO Read Pointer                0x00000000
FIFO Read Level                  0x0000002A

Receive Queue 1
---------------
Buffer control                   0x05F8
Byte Counter                     49408
Descriptor Address               0x0000000076F4F810
Status                           0x05EA0100
Timestamp                        0x00000000
BMU Control/Status               0x000061AA
Done                             0x0000
Request                          0x0000000076F4F810
Csum1      Offset 52057 Piston   14
Csum2      Offset 52057 Positing   14

Sync Transmit Queue 1
---------------
Descriptor Address       0x0000000000000000
Address Counter          0x0000000000000000
Current Byte Counter             0
BMU Control/Status               0x00000000
Flag & FIFO Address              0x00000000

Control                          0x00000000
Next                             0x00000000
Data                     0x0000000000000000
Status                           0x00000000
Timestamp                        0x00000000
Csum Start 0x0000 Pos    0 Write 0

Async Transmit Queue 1
---------------
Buffer control                   0x053D
Byte Counter                     49950
Descriptor Address               0x0000000047237000
Status                           0x000005EA
Timestamp                        0x00010000
BMU Control/Status               0x800011AA
Done                             0x0000
Request                          0x000000004723753D
Csum Start 0x0032 Pos    0 Write 0

Receive RAMbuffer 1
---------------
Start Address                    0x00000000
End Address                      0x00000E7F
Write Pointer                    0x00000079
Read Pointer                     0x0000007E
Upper Threshold/Pause Packets    0x00000D80
Lower Threshold/Pause Packets    0x000003A0
Upper Threshold/High Priority    0x00000AE0
Lower Threshold/High Priority    0x00000740
Packet Counter                   0x00000029
Level                            0x00000E7B
Test                             0x0002221A

Sync Transmit RAMbuffer 1
---------------
Start Address                    0x00000000
End Address                      0x00000000
Write Pointer                    0x00000000
Read Pointer                     0x00000000
Packet Counter                   0x00000000
Level                            0x00000000
Test                             0x00000000

Async Transmit RAMbuffer 1
---------------
Start Address                    0x00000E80
End Address                      0x000017FF
Write Pointer                    0x0000132A
Read Pointer                     0x0000132A
Packet Counter                   0x00000000
Level                            0x00000000
Test                             0x0002222A

i don't know if it helps but i am also including the output of ethtool
while the card was still working:


PCI config
----------
00: ab 11 62 43 07 04 10 00 15 00 00 02 08 00 00 00
10: 04 c0 df fd 00 00 00 00 01 ce 00 00 00 00 00 00
20: 00 00 00 00 00 00 00 00 00 00 00 00 62 14 8c 05
30: 00 00 00 00 48 00 00 00 00 00 00 00 03 01 00 00
40: 00 00 f0 01 00 80 a0 01 01 50 02 fe 00 20 00 14
50: 03 5c 00 80 00 00 00 01 00 00 00 01 05 e0 83 00
60: 0c 10 e0 fe 00 00 00 00 61 41 00 00 00 00 00 00
70: 00 02 00 00 00 00 00 00 00 00 00 00 00 00 00 00

Control Registers
-----------------
Register Access Port             0x00
LED Control/Status               0xA603164A
Interrupt Source                 0x00000000
Interrupt Mask                   0xC000001D
Interrupt Hardware Error Source  0x00000000
Interrupt Hardware Error Mask    0x2E003F3F

Bus Management Unit
-------------------
CSR Receive Queue 1              0x00010000
CSR Sync Queue 1                 0xFFFFFFFF
CSR Async Queue 1                0x00000000

MAC Addresses
---------------
Addr 1            00 11 09 DA 39 A3
Addr 2            00 11 09 DA 39 A3
Addr 3            00 00 00 00 00 00

Connector type               0x4A (J)
PMD type                     0x54 (T)
PHY type                     0x80
Chip Id                      0xB6 Yukon-2 EC
 (rev 0)
Ram Buffer                   0x0C

Status BMU:
-----------
Control                                0x0002220A
Last Index                             0x07FF
Put Index                              0x00B8
List Address                           0x000000007FBF8000
Transmit 1 done index                  0x0057
Transmit index threshold               0x000A

Status FIFO
	Write Pointer            0x08
	Read Pointer             0x08
	Level                    0x00
	Watermark                0x10
	ISR Watermark            0x10
Status level
	Init 0x000030D4 Value 0x000030D4
	Test 0x04       Control 0x02
TX status
	Init 0x0001E848 Value 0x0001E848
	Test 0x04       Control 0x02
ISR
	Init 0x000009C4 Value 0x000009C4
	Test 0x04       Control 0x02

GMAC control             0x005A
GPHY control             0x2002
LINK control             0x02

GMAC 1
Status                       0xD000
Control                      0x1800
Transmit                     0x1000
Receive                      0xE000
Transmit flow control        0xFFFF
Transmit parameter           0xD7C4
Serial mode                  0x221E
      Source address:  00 11 09 DA 39 A3
    Physical address:  00 11 09 DA 39 A3

Rx GMAC 1
End Address                      0x0000007F
Almost Full Thresh               0x00000070
Control/Test                     0x0900228A
FIFO Flush Mask                  0x000018FB
FIFO Flush Threshold             0x0000000B
Truncation Threshold             0x0000017C
Upper Pause Threshold            0x00000000
Lower Pause Threshold            0x00000081
VLAN Tag                         0x00000027
FIFO Write Pointer               0x00000000
FIFO Write Level                 0x00000000
FIFO Read Pointer                0x00000000
FIFO Read Level                  0x00000027

Tx GMAC 1
End Address                      0x0000007F
Almost Full Thresh               0x00000010
Control/Test                     0x0102220A
FIFO Flush Mask                  0x00000000
FIFO Flush Threshold             0x00000000
Truncation Threshold             0x00000000
Upper Pause Threshold            0x00000000
Lower Pause Threshold            0x00000081
VLAN Tag                         0x00000032
FIFO Write Pointer               0x00000032
FIFO Write Level                 0x00000000
FIFO Read Pointer                0x00000000
FIFO Read Level                  0x00000032

Receive Queue 1
---------------
Buffer control                   0x05F8
Byte Counter                     49408
Descriptor Address               0x000000001727E010
Status                           0x003C0100
Timestamp                        0x00000000
BMU Control/Status               0x000061AA
Done                             0x0000
Request                          0x000000001727E010
Csum1      Offset 12632 Piston   14
Csum2      Offset 12632 Positing   14

Sync Transmit Queue 1
---------------
Descriptor Address       0x0000000000000000
Address Counter          0x0000000000000000
Current Byte Counter             0
BMU Control/Status               0x00000000
Flag & FIFO Address              0x00000000

Control                          0x00000000
Next                             0x00000000
Data                     0x0000000000000000
Status                           0x00000000
Timestamp                        0x00000000
Csum Start 0x0000 Pos    0 Write 0

Async Transmit Queue 1
---------------
Buffer control                   0x06CC
Byte Counter                     49950
Descriptor Address               0x0000000046AD23C6
Status                           0x000005EA
Timestamp                        0x00010000
BMU Control/Status               0x800011AA
Done                             0x0000
Request                          0x0000000046AD2A92
Csum Start 0x0032 Pos    0 Write 0

Receive RAMbuffer 1
---------------
Start Address                    0x00000000
End Address                      0x00000E7F
Write Pointer                    0x00000427
Read Pointer                     0x00000427
Upper Threshold/Pause Packets    0x00000D80
Lower Threshold/Pause Packets    0x000003A0
Upper Threshold/High Priority    0x00000AE0
Lower Threshold/High Priority    0x00000740
Packet Counter                   0x00000000
Level                            0x00000000
Test                             0x0002221A

Sync Transmit RAMbuffer 1
---------------
Start Address                    0x00000000
End Address                      0x00000000
Write Pointer                    0x00000000
Read Pointer                     0x00000000
Packet Counter                   0x00000000
Level                            0x00000000
Test                             0x00000000

Async Transmit RAMbuffer 1
---------------
Start Address                    0x00000E80
End Address                      0x000017FF
Write Pointer                    0x000017B2
Read Pointer                     0x000017B2
Packet Counter                   0x00000000
Level                            0x00000000
Test                             0x0002222A

i'll try to lock up the networking again and if it still happens i'll
swith to the vendor driver and see what that has to say.

--alex--

-- 
| I believe the moment is at hand when, by a paranoiac and active |
|  advance of the mind, it will be possible (simultaneously with  |
|  automatism and other passive states) to systematize confusion  |
|  and thus to help to discredit completely the world of reality. |

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: 2.6.20-rc1 sky2 problems (regression?)
  2006-12-14 22:25   ` Alex Romosan
@ 2006-12-14 22:47     ` Stephen Hemminger
  2006-12-14 22:57       ` Alex Romosan
                         ` (2 more replies)
  0 siblings, 3 replies; 14+ messages in thread
From: Stephen Hemminger @ 2006-12-14 22:47 UTC (permalink / raw)
  To: Alex Romosan; +Cc: netdev

On Thu, 14 Dec 2006 14:25:06 -0800
Alex Romosan <romosan@sycorax.lbl.gov> wrote:

> Stephen Hemminger <shemminger@osdl.org> writes:
> 
> > 4) What is the IRQ routing?
> >    There are two issues here, first the driver will never work with edge
> >    trigger IRQ's, some motherboards also have busted BIOS and chipsets
> >    that don't do MSI properly. A couple of module parameters are available
> >    to help:
> >       disable_msi=1   		avoids using MSI
> >       idle_timeout=10		polls for lost IRQ's every N ms (10)
> 
> i didn't take long to lock up the machine again. i've rebooted back
> into stock 2.6.20-rc1 and added the two module parameters above. cat
> /proc/interrupts now gives me:
> 
>  17:        203   IO-APIC-fasteoi   eth0, CMI8738
> 
> so i guess the MSI interrupts are disabled. we'll see how this works.

probably won't do much but now the IRQ ends up shared.

> > 5) What are the messages in the console log when problem happens?
> 
> kernel: NETDEV WATCHDOG: eth0: transmit timed out
> kernel: sky2 eth0: tx timeout
> kernel: sky2 eth0: transmit ring 402 .. 361 report=406 done=406
> kernel: sky2 status report lost?

The transmit timeout code trys to be smart, but doesn't really
recover properly if hardware is stuck.


> > 7) Please get a current version of ethtool from:
> >    git://git.kernel.org/pub/scm/network/ethtool/ethtool.git
> >    and run ethtool register dump after a problem occurs:
> >       ethtool -d eth0
> 
> this is the output after it stopped working:
> 
> 
> PCI config
> ----------
> 00: ab 11 62 43 07 04 18 00 15 00 00 02 08 00 00 00
> 10: 04 c0 df fd 00 00 00 00 01 ce 00 00 00 00 00 00
> 20: 00 00 00 00 00 00 00 00 00 00 00 00 62 14 8c 05
> 30: 00 00 00 00 48 00 00 00 00 00 00 00 03 01 00 00
> 40: 00 00 f0 01 00 80 a0 01 01 50 02 fe 00 20 00 14
> 50: 03 5c 00 80 00 00 00 01 00 00 00 01 05 e0 83 00
> 60: 0c 10 e0 fe 00 00 00 00 61 41 00 00 00 00 00 00
> 70: 00 02 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 
> Control Registers
> -----------------
> Register Access Port             0x00
> LED Control/Status               0xA603164A
> Interrupt Source                 0x40000000
> Interrupt Mask                   0xC000001D
> Interrupt Hardware Error Source  0x00000000
> Interrupt Hardware Error Mask    0x2E003F3F
> 
> Bus Management Unit
> -------------------
> CSR Receive Queue 1              0x00010000
> CSR Sync Queue 1                 0xFFFFFFFF
> CSR Async Queue 1                0x00000000
> 
> MAC Addresses
> ---------------
> Addr 1            00 11 09 DA 39 A3
> Addr 2            00 11 09 DA 39 A3
> Addr 3            00 00 00 00 00 00
> 
> Connector type               0x4A (J)
> PMD type                     0x54 (T)
> PHY type                     0x80
> Chip Id                      0xB6 Yukon-2 EC
>  (rev 0)
> Ram Buffer                   0x0C
> 
> Status BMU:
> -----------
> Control                                0x0002220A
> Last Index                             0x07FF
> Put Index                              0x0601
> List Address                           0x000000007FBF8000
> Transmit 1 done index                  0x0196
> Transmit index threshold               0x000A
> 
> Status FIFO
> 	Write Pointer            0x16
> 	Read Pointer             0x16
> 	Level                    0x00
> 	Watermark                0x10
> 	ISR Watermark            0x10
> Status level
> 	Init 0x000030D4 Value 0x00000D00
> 	Test 0x04       Control 0x02
> TX status
> 	Init 0x0001E848 Value 0x0001E848
> 	Test 0x04       Control 0x02
> ISR
> 	Init 0x000009C4 Value 0x000009C4
> 	Test 0x04       Control 0x02
> 
> GMAC control             0x005A
> GPHY control             0x2002
> LINK control             0x02
> 
> GMAC 1
> Status                       0xD000
> Control                      0x1800
> Transmit                     0x1000
> Receive                      0xE000
> Transmit flow control        0xFFFF
> Transmit parameter           0xD7C4
> Serial mode                  0x221E
>       Source address:  00 11 09 DA 39 A3
>     Physical address:  00 11 09 DA 39 A3
> 
> Rx GMAC 1
> End Address                      0x0000007F
> Almost Full Thresh               0x00000070
> Control/Test                     0x0900228A
> FIFO Flush Mask                  0x000018FB
> FIFO Flush Threshold             0x0000000B
> Truncation Threshold             0x0000017C
> Upper Pause Threshold            0x00000000
> Lower Pause Threshold            0x00000081
> VLAN Tag                         0x00000074
> FIFO Write Pointer               0x00000000
> FIFO Write Level                 0x0000007B
> FIFO Read Pointer                0x00000000
> FIFO Read Level                  0x00000079
> 
> Tx GMAC 1
> End Address                      0x0000007F
> Almost Full Thresh               0x00000010
> Control/Test                     0x0102220A
> FIFO Flush Mask                  0x00000000
> FIFO Flush Threshold             0x00000000
> Truncation Threshold             0x00000000
> Upper Pause Threshold            0x00000000
> Lower Pause Threshold            0x00000081
> VLAN Tag                         0x0000002A
> FIFO Write Pointer               0x0000002A
> FIFO Write Level                 0x00000000
> FIFO Read Pointer                0x00000000
> FIFO Read Level                  0x0000002A
> 
> Receive Queue 1
> ---------------
> Buffer control                   0x05F8
> Byte Counter                     49408
> Descriptor Address               0x0000000076F4F810
> Status                           0x05EA0100
> Timestamp                        0x00000000
> BMU Control/Status               0x000061AA
> Done                             0x0000
> Request                          0x0000000076F4F810
> Csum1      Offset 52057 Piston   14
> Csum2      Offset 52057 Positing   14
> 
> Sync Transmit Queue 1
> ---------------
> Descriptor Address       0x0000000000000000
> Address Counter          0x0000000000000000
> Current Byte Counter             0
> BMU Control/Status               0x00000000
> Flag & FIFO Address              0x00000000
> 
> Control                          0x00000000
> Next                             0x00000000
> Data                     0x0000000000000000
> Status                           0x00000000
> Timestamp                        0x00000000
> Csum Start 0x0000 Pos    0 Write 0
> 
> Async Transmit Queue 1
> ---------------
> Buffer control                   0x053D
> Byte Counter                     49950
> Descriptor Address               0x0000000047237000
> Status                           0x000005EA
> Timestamp                        0x00010000
> BMU Control/Status               0x800011AA
> Done                             0x0000
> Request                          0x000000004723753D
> Csum Start 0x0032 Pos    0 Write 0
> 
> Receive RAMbuffer 1
> ---------------
> Start Address                    0x00000000
> End Address                      0x00000E7F
> Write Pointer                    0x00000079
> Read Pointer                     0x0000007E
> Upper Threshold/Pause Packets    0x00000D80
> Lower Threshold/Pause Packets    0x000003A0
> Upper Threshold/High Priority    0x00000AE0
> Lower Threshold/High Priority    0x00000740
> Packet Counter                   0x00000029
> Level                            0x00000E7B
> Test                             0x0002221A
> 
> Sync Transmit RAMbuffer 1
> ---------------
> Start Address                    0x00000000
> End Address                      0x00000000
> Write Pointer                    0x00000000
> Read Pointer                     0x00000000
> Packet Counter                   0x00000000
> Level                            0x00000000
> Test                             0x00000000
> 
> Async Transmit RAMbuffer 1
> ---------------
> Start Address                    0x00000E80
> End Address                      0x000017FF
> Write Pointer                    0x0000132A
> Read Pointer                     0x0000132A
> Packet Counter                   0x00000000
> Level                            0x00000000
> Test                             0x0002222A
> 
> i don't know if it helps but i am also including the output of ethtool
> while the card was still working:
> 
> 
> PCI config
> ----------
> 00: ab 11 62 43 07 04 10 00 15 00 00 02 08 00 00 00
> 10: 04 c0 df fd 00 00 00 00 01 ce 00 00 00 00 00 00
> 20: 00 00 00 00 00 00 00 00 00 00 00 00 62 14 8c 05
> 30: 00 00 00 00 48 00 00 00 00 00 00 00 03 01 00 00
> 40: 00 00 f0 01 00 80 a0 01 01 50 02 fe 00 20 00 14
> 50: 03 5c 00 80 00 00 00 01 00 00 00 01 05 e0 83 00
> 60: 0c 10 e0 fe 00 00 00 00 61 41 00 00 00 00 00 00
> 70: 00 02 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> 
> Control Registers
> -----------------
> Register Access Port             0x00
> LED Control/Status               0xA603164A
> Interrupt Source                 0x00000000
> Interrupt Mask                   0xC000001D
> Interrupt Hardware Error Source  0x00000000
> Interrupt Hardware Error Mask    0x2E003F3F
> 
> Bus Management Unit
> -------------------
> CSR Receive Queue 1              0x00010000
> CSR Sync Queue 1                 0xFFFFFFFF
> CSR Async Queue 1                0x00000000
> 
> MAC Addresses
> ---------------
> Addr 1            00 11 09 DA 39 A3
> Addr 2            00 11 09 DA 39 A3
> Addr 3            00 00 00 00 00 00
> 
> Connector type               0x4A (J)
> PMD type                     0x54 (T)
> PHY type                     0x80
> Chip Id                      0xB6 Yukon-2 EC
>  (rev 0)
> Ram Buffer                   0x0C
> 
> Status BMU:
> -----------
> Control                                0x0002220A
> Last Index                             0x07FF
> Put Index                              0x00B8
> List Address                           0x000000007FBF8000
> Transmit 1 done index                  0x0057
> Transmit index threshold               0x000A
> 
> Status FIFO
> 	Write Pointer            0x08
> 	Read Pointer             0x08
> 	Level                    0x00
> 	Watermark                0x10
> 	ISR Watermark            0x10
> Status level
> 	Init 0x000030D4 Value 0x000030D4
> 	Test 0x04       Control 0x02
> TX status
> 	Init 0x0001E848 Value 0x0001E848
> 	Test 0x04       Control 0x02
> ISR
> 	Init 0x000009C4 Value 0x000009C4
> 	Test 0x04       Control 0x02
> 
> GMAC control             0x005A
> GPHY control             0x2002
> LINK control             0x02
> 
> GMAC 1
> Status                       0xD000
> Control                      0x1800
> Transmit                     0x1000
> Receive                      0xE000
> Transmit flow control        0xFFFF
> Transmit parameter           0xD7C4
> Serial mode                  0x221E
>       Source address:  00 11 09 DA 39 A3
>     Physical address:  00 11 09 DA 39 A3
> 
> Rx GMAC 1
> End Address                      0x0000007F
> Almost Full Thresh               0x00000070
> Control/Test                     0x0900228A
> FIFO Flush Mask                  0x000018FB
> FIFO Flush Threshold             0x0000000B
> Truncation Threshold             0x0000017C
> Upper Pause Threshold            0x00000000
> Lower Pause Threshold            0x00000081
> VLAN Tag                         0x00000027
> FIFO Write Pointer               0x00000000
> FIFO Write Level                 0x00000000
> FIFO Read Pointer                0x00000000
> FIFO Read Level                  0x00000027
> 
> Tx GMAC 1
> End Address                      0x0000007F
> Almost Full Thresh               0x00000010
> Control/Test                     0x0102220A
> FIFO Flush Mask                  0x00000000
> FIFO Flush Threshold             0x00000000
> Truncation Threshold             0x00000000
> Upper Pause Threshold            0x00000000
> Lower Pause Threshold            0x00000081
> VLAN Tag                         0x00000032
> FIFO Write Pointer               0x00000032
> FIFO Write Level                 0x00000000
> FIFO Read Pointer                0x00000000
> FIFO Read Level                  0x00000032
> 
> Receive Queue 1
> ---------------
> Buffer control                   0x05F8
> Byte Counter                     49408
> Descriptor Address               0x000000001727E010
> Status                           0x003C0100
> Timestamp                        0x00000000
> BMU Control/Status               0x000061AA
> Done                             0x0000
> Request                          0x000000001727E010
> Csum1      Offset 12632 Piston   14
> Csum2      Offset 12632 Positing   14
> 
> Sync Transmit Queue 1
> ---------------
> Descriptor Address       0x0000000000000000
> Address Counter          0x0000000000000000
> Current Byte Counter             0
> BMU Control/Status               0x00000000
> Flag & FIFO Address              0x00000000
> 
> Control                          0x00000000
> Next                             0x00000000
> Data                     0x0000000000000000
> Status                           0x00000000
> Timestamp                        0x00000000
> Csum Start 0x0000 Pos    0 Write 0
> 
> Async Transmit Queue 1
> ---------------
> Buffer control                   0x06CC
> Byte Counter                     49950
> Descriptor Address               0x0000000046AD23C6
> Status                           0x000005EA
> Timestamp                        0x00010000
> BMU Control/Status               0x800011AA
> Done                             0x0000
> Request                          0x0000000046AD2A92
> Csum Start 0x0032 Pos    0 Write 0
> 
> Receive RAMbuffer 1
> ---------------
> Start Address                    0x00000000
> End Address                      0x00000E7F
> Write Pointer                    0x00000427
> Read Pointer                     0x00000427
> Upper Threshold/Pause Packets    0x00000D80
> Lower Threshold/Pause Packets    0x000003A0
> Upper Threshold/High Priority    0x00000AE0
> Lower Threshold/High Priority    0x00000740
> Packet Counter                   0x00000000
> Level                            0x00000000
> Test                             0x0002221A
> 
> Sync Transmit RAMbuffer 1
> ---------------
> Start Address                    0x00000000
> End Address                      0x00000000
> Write Pointer                    0x00000000
> Read Pointer                     0x00000000
> Packet Counter                   0x00000000
> Level                            0x00000000
> Test                             0x00000000
> 
> Async Transmit RAMbuffer 1
> ---------------
> Start Address                    0x00000E80
> End Address                      0x000017FF
> Write Pointer                    0x000017B2
> Read Pointer                     0x000017B2
> Packet Counter                   0x00000000
> Level                            0x00000000
> Test                             0x0002222A
> 
> i'll try to lock up the networking again and if it still happens i'll
> swith to the vendor driver and see what that has to say.
> 

Another useful bit of information is the statistics (ethtool -S eth0).
When there were flow control bugs, they would show up as count of 1.

Are you doing jumbo frames (MTU > 1500)?

-- 
Stephen Hemminger <shemminger@osdl.org>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: 2.6.20-rc1 sky2 problems (regression?)
  2006-12-14 22:47     ` Stephen Hemminger
@ 2006-12-14 22:57       ` Alex Romosan
  2006-12-14 23:08       ` Alex Romosan
  2006-12-14 23:21       ` Alex Romosan
  2 siblings, 0 replies; 14+ messages in thread
From: Alex Romosan @ 2006-12-14 22:57 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: netdev

Stephen Hemminger <shemminger@osdl.org> writes:

> Another useful bit of information is the statistics (ethtool -S eth0).
> When there were flow control bugs, they would show up as count of 1.

we'll see if the machine locks up again.

> Are you doing jumbo frames (MTU > 1500)?

no (or at least i don't think so). how can i tell?

assuming the machine doesn't lock up with msi interrupts disabled, do
you want me to do anything to debug why the driver locks up when the
msi interrupts are enabled?

--alex--

-- 
| I believe the moment is at hand when, by a paranoiac and active |
|  advance of the mind, it will be possible (simultaneously with  |
|  automatism and other passive states) to systematize confusion  |
|  and thus to help to discredit completely the world of reality. |

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: 2.6.20-rc1 sky2 problems (regression?)
  2006-12-14 22:47     ` Stephen Hemminger
  2006-12-14 22:57       ` Alex Romosan
@ 2006-12-14 23:08       ` Alex Romosan
  2006-12-14 23:21       ` Alex Romosan
  2 siblings, 0 replies; 14+ messages in thread
From: Alex Romosan @ 2006-12-14 23:08 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: netdev

Stephen Hemminger <shemminger@osdl.org> writes:

> Another useful bit of information is the statistics (ethtool -S
> eth0). When there were flow control bugs, they would show up as
> count of 1.
>
> Are you doing jumbo frames (MTU > 1500)?

i just did 'ethtool -S eth0' (the card is still working fine) and i
don't think there are any jumbo frames. anyway, this is the output:

NIC statistics:
     tx_bytes: 2697577533
     rx_bytes: 503104106
     tx_broadcast: 18
     rx_broadcast: 4068
     tx_multicast: 0
     rx_multicast: 416
     tx_unicast: 2276028
     rx_unicast: 1359009
     tx_mac_pause: 0
     rx_mac_pause: 0
     collisions: 0
     late_collision: 0
     aborted: 0
     single_collisions: 0
     multi_collisions: 0
     rx_short: 0
     rx_runt: 0
     rx_64_byte_packets: 713826
     rx_65_to_127_byte_packets: 271861
     rx_128_to_255_byte_packets: 57307
     rx_256_to_511_byte_packets: 25689
     rx_512_to_1023_byte_packets: 28502
     rx_1024_to_1518_byte_packets: 266308
     rx_1518_to_max_byte_packets: 0
     rx_too_long: 0
     rx_fifo_overflow: 0
     rx_jabber: 0
     rx_fcs_error: 0
     tx_64_byte_packets: 174188
     tx_65_to_127_byte_packets: 225242
     tx_128_to_255_byte_packets: 44294
     tx_256_to_511_byte_packets: 24475
     tx_512_to_1023_byte_packets: 80147
     tx_1024_to_1518_byte_packets: 1727700
     tx_1519_to_max_byte_packets: 0
     tx_fifo_underrun: 0

--alex--

-- 
| I believe the moment is at hand when, by a paranoiac and active |
|  advance of the mind, it will be possible (simultaneously with  |
|  automatism and other passive states) to systematize confusion  |
|  and thus to help to discredit completely the world of reality. |

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: 2.6.20-rc1 sky2 problems (regression?)
  2006-12-14 22:47     ` Stephen Hemminger
  2006-12-14 22:57       ` Alex Romosan
  2006-12-14 23:08       ` Alex Romosan
@ 2006-12-14 23:21       ` Alex Romosan
  2006-12-14 23:31         ` Stephen Hemminger
  2 siblings, 1 reply; 14+ messages in thread
From: Alex Romosan @ 2006-12-14 23:21 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: netdev

Stephen Hemminger <shemminger@osdl.org> writes:

> Another useful bit of information is the statistics (ethtool -S eth0).
> When there were flow control bugs, they would show up as count of 1.

the driver locked up again, even with msi interrupts disabled and
idle_timeout=10. the console message was pretty much as before:

kernel: NETDEV WATCHDOG: eth0: transmit timed out
kernel: sky2 eth0: tx timeout
kernel: sky2 eth0: transmit ring 336 .. 296 report=336 done=336
kernel: sky2 hardware hung? flushing
kernel: NETDEV WATCHDOG: eth0: transmit timed out
kernel: sky2 eth0: tx timeout
kernel: sky2 eth0: transmit ring 296 .. 255 report=336 done=336
kernel: sky2 status report lost?

and this is the output from ethtool -S:

NIC statistics:
     tx_bytes: 3092123897
     rx_bytes: 546577898
     tx_broadcast: 20
     rx_broadcast: 4376
     tx_multicast: 0
     rx_multicast: 459
     tx_unicast: 2585993
     rx_unicast: 1550758
     tx_mac_pause: 1
     rx_mac_pause: 0
     collisions: 0
     late_collision: 0
     aborted: 0
     single_collisions: 0
     multi_collisions: 0
     rx_short: 0
     rx_runt: 0
     rx_64_byte_packets: 850693
     rx_65_to_127_byte_packets: 297029
     rx_128_to_255_byte_packets: 62116
     rx_256_to_511_byte_packets: 28795
     rx_512_to_1023_byte_packets: 31357
     rx_1024_to_1518_byte_packets: 285603
     rx_1518_to_max_byte_packets: 0
     rx_too_long: 0
     rx_fifo_overflow: 0
     rx_jabber: 0
     rx_fcs_error: 0
     tx_64_byte_packets: 194159
     tx_65_to_127_byte_packets: 239961
     tx_128_to_255_byte_packets: 48148
     tx_256_to_511_byte_packets: 27635
     tx_512_to_1023_byte_packets: 95557
     tx_1024_to_1518_byte_packets: 1980554
     tx_1519_to_max_byte_packets: 0
     tx_fifo_underrun: 0

time to try the vendor driver and see if that provides any clues.

--alex--

-- 
| I believe the moment is at hand when, by a paranoiac and active |
|  advance of the mind, it will be possible (simultaneously with  |
|  automatism and other passive states) to systematize confusion  |
|  and thus to help to discredit completely the world of reality. |

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: 2.6.20-rc1 sky2 problems (regression?)
  2006-12-14 23:21       ` Alex Romosan
@ 2006-12-14 23:31         ` Stephen Hemminger
  2006-12-15  1:04           ` Alex Romosan
  0 siblings, 1 reply; 14+ messages in thread
From: Stephen Hemminger @ 2006-12-14 23:31 UTC (permalink / raw)
  To: Alex Romosan; +Cc: netdev

On Thu, 14 Dec 2006 15:21:00 -0800
Alex Romosan <romosan@sycorax.lbl.gov> wrote:

> Stephen Hemminger <shemminger@osdl.org> writes:
> 
> > Another useful bit of information is the statistics (ethtool -S eth0).
> > When there were flow control bugs, they would show up as count of 1.
> 
> the driver locked up again, even with msi interrupts disabled and
> idle_timeout=10. the console message was pretty much as before:
> 
> kernel: NETDEV WATCHDOG: eth0: transmit timed out
> kernel: sky2 eth0: tx timeout
> kernel: sky2 eth0: transmit ring 336 .. 296 report=336 done=336
> kernel: sky2 hardware hung? flushing
> kernel: NETDEV WATCHDOG: eth0: transmit timed out
> kernel: sky2 eth0: tx timeout
> kernel: sky2 eth0: transmit ring 296 .. 255 report=336 done=336
> kernel: sky2 status report lost?
> 
> and this is the output from ethtool -S:
> 
> NIC statistics:
>      tx_bytes: 3092123897
>      rx_bytes: 546577898
>      tx_broadcast: 20
>      rx_broadcast: 4376
>      tx_multicast: 0
>      rx_multicast: 459
>      tx_unicast: 2585993
>      rx_unicast: 1550758
>      tx_mac_pause: 1

If this is repeatable... and mac_pause is always one then the
problem is hardware flow control.  I saw bugs before in the bus
interface where it would not resume on unaligned buffer, but
that was on receive.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: 2.6.20-rc1 sky2 problems (regression?)
  2006-12-14 23:31         ` Stephen Hemminger
@ 2006-12-15  1:04           ` Alex Romosan
  2006-12-15  2:24             ` Herbert Xu
  0 siblings, 1 reply; 14+ messages in thread
From: Alex Romosan @ 2006-12-15  1:04 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: netdev

Stephen Hemminger <shemminger@osdl.org> writes:

> If this is repeatable... and mac_pause is always one then the
> problem is hardware flow control.  I saw bugs before in the bus
> interface where it would not resume on unaligned buffer, but
> that was on receive.

i tried to switch over to the latest vendor driver but unfortunately
it doesn't work with kernel 2.6.19+. it still uses CHECKSUM_HW which
looks like it was replaced by CHECKSUM_PARTIAL and CHECKSUM_COMPLETE
was also added. i think i can replace CHECKSUM_HW in the marvell
driver with CHECKSUM_PARTIAL, except for a couple of places where i
i am not sure what i am supposed to do. the first instance it says (i
am kind of paraphrasing here since i am copying from the screen and
not cutting and pasting):

/** does the HW need to evaluate checksum for TCP or UDP packets?
if (pMessage->ip_summed == CHECKSUM_HW)

maybe this needs to be replace with CHECKSUM_PARTIAL. the second one

/** TCP checksum offload
if ((pSKPacket->pMbuf->ip_summed == CHECKSUM_HW) &&
(SetOpcodePacketFlag == SK_TRUE)

i wonder if this is supposed to be CHECKSUM_COMPLETE

if you have any suggestions, i'll appreciate it.

--alex--

-- 
| I believe the moment is at hand when, by a paranoiac and active |
|  advance of the mind, it will be possible (simultaneously with  |
|  automatism and other passive states) to systematize confusion  |
|  and thus to help to discredit completely the world of reality. |

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: 2.6.20-rc1 sky2 problems (regression?)
  2006-12-15  1:04           ` Alex Romosan
@ 2006-12-15  2:24             ` Herbert Xu
  2006-12-15  3:22               ` Stephen Hemminger
  0 siblings, 1 reply; 14+ messages in thread
From: Herbert Xu @ 2006-12-15  2:24 UTC (permalink / raw)
  To: Alex Romosan; +Cc: shemminger, netdev

Alex Romosan <romosan@sycorax.lbl.gov> wrote:
 /** does the HW need to evaluate checksum for TCP or UDP packets?
> if (pMessage->ip_summed == CHECKSUM_HW)
> 
> maybe this needs to be replace with CHECKSUM_PARTIAL. the second one
> 
> /** TCP checksum offload
> if ((pSKPacket->pMbuf->ip_summed == CHECKSUM_HW) &&
> (SetOpcodePacketFlag == SK_TRUE)
> 
> i wonder if this is supposed to be CHECKSUM_COMPLETE

The rule of thumb is that it's COMPLETE for RX, and PARTIAL for TX.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: 2.6.20-rc1 sky2 problems (regression?)
  2006-12-15  2:24             ` Herbert Xu
@ 2006-12-15  3:22               ` Stephen Hemminger
  2006-12-15  3:53                 ` Alex Romosan
  0 siblings, 1 reply; 14+ messages in thread
From: Stephen Hemminger @ 2006-12-15  3:22 UTC (permalink / raw)
  To: Herbert Xu; +Cc: Alex Romosan, netdev

On Fri, 15 Dec 2006 13:24:32 +1100
Herbert Xu <herbert@gondor.apana.org.au> wrote:

> Alex Romosan <romosan@sycorax.lbl.gov> wrote:
>  /** does the HW need to evaluate checksum for TCP or UDP packets?
> > if (pMessage->ip_summed == CHECKSUM_HW)
> > 
> > maybe this needs to be replace with CHECKSUM_PARTIAL. the second one
> > 
> > /** TCP checksum offload
> > if ((pSKPacket->pMbuf->ip_summed == CHECKSUM_HW) &&
> > (SetOpcodePacketFlag == SK_TRUE)
> > 
> > i wonder if this is supposed to be CHECKSUM_COMPLETE
> 
> The rule of thumb is that it's COMPLETE for RX, and PARTIAL for TX.
> 
> Cheers,

I have a fixed up version of the vendor driver, I'll repackage it tomorrow.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: 2.6.20-rc1 sky2 problems (regression?)
  2006-12-15  3:22               ` Stephen Hemminger
@ 2006-12-15  3:53                 ` Alex Romosan
  2006-12-16  1:25                   ` Stephen Hemminger
  0 siblings, 1 reply; 14+ messages in thread
From: Alex Romosan @ 2006-12-15  3:53 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: Herbert Xu, netdev

Stephen Hemminger <shemminger@osdl.org> writes:

> I have a fixed up version of the vendor driver, I'll repackage it tomorrow.

as per the include file, i ended up replacing all the CHECKSUM_HW with
CHECkSUM_PARTIAL since the functions in questions had to do with
transmit. seems to be working so far without any lockups. we'll see
how long this lasts.

--alex--

-- 
| I believe the moment is at hand when, by a paranoiac and active |
|  advance of the mind, it will be possible (simultaneously with  |
|  automatism and other passive states) to systematize confusion  |
|  and thus to help to discredit completely the world of reality. |

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: 2.6.20-rc1 sky2 problems (regression?)
  2006-12-15  3:53                 ` Alex Romosan
@ 2006-12-16  1:25                   ` Stephen Hemminger
  2006-12-16  1:53                     ` Alex Romosan
  0 siblings, 1 reply; 14+ messages in thread
From: Stephen Hemminger @ 2006-12-16  1:25 UTC (permalink / raw)
  To: Alex Romosan; +Cc: Herbert Xu, netdev

On Thu, 14 Dec 2006 19:53:45 -0800
Alex Romosan <romosan@sycorax.lbl.gov> wrote:

> Stephen Hemminger <shemminger@osdl.org> writes:
> 
> > I have a fixed up version of the vendor driver, I'll repackage it tomorrow.
> 
> as per the include file, i ended up replacing all the CHECKSUM_HW with
> CHECkSUM_PARTIAL since the functions in questions had to do with
> transmit. seems to be working so far without any lockups. we'll see
> how long this lasts.
> 
> --alex--
> 

I fixed a bunch of stuff (see ChangeLog) and made a 2.6.19 or later
version see:
	http://developer.osdl.org/shemminger/prototypes/sk98lin-8.41.tar.gz

It is too noisy in the console log, because it shows how many times
the driver dope slaps itself senseless...  Basically every 250ms when
it is idle it resets, sorry it's the kind of code you right to "make it work"
and ship it which is why vendor drivers suck.

-- 
Stephen Hemminger <shemminger@osdl.org>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: 2.6.20-rc1 sky2 problems (regression?)
  2006-12-16  1:25                   ` Stephen Hemminger
@ 2006-12-16  1:53                     ` Alex Romosan
  0 siblings, 0 replies; 14+ messages in thread
From: Alex Romosan @ 2006-12-16  1:53 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: Herbert Xu, netdev

Stephen Hemminger <shemminger@osdl.org> writes:

> I fixed a bunch of stuff (see ChangeLog) and made a 2.6.19 or later
> version see:
> 	http://developer.osdl.org/shemminger/prototypes/sk98lin-8.41.tar.gz
>
> It is too noisy in the console log, because it shows how many times
> the driver dope slaps itself senseless...  Basically every 250ms when
> it is idle it resets, sorry it's the kind of code you right to "make it work"
> and ship it which is why vendor drivers suck.

i'll give it a try on monday when i go back to work. in the meantime
i've been running with my "fixed" version of the vendor driver and so
far it's been working without any problems (i've been transferring
lots of data in and out of the computer the whole day). if there is
anything i can do to help debug the kernel sky2 driver let me know.

--alex--

-- 
| I believe the moment is at hand when, by a paranoiac and active |
|  advance of the mind, it will be possible (simultaneously with  |
|  automatism and other passive states) to systematize confusion  |
|  and thus to help to discredit completely the world of reality. |

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2006-12-16  1:53 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <87psammchi.fsf@sycorax.lbl.gov>
2006-12-14 21:30 ` 2.6.20-rc1 sky2 problems (regression?) Stephen Hemminger
2006-12-14 22:00   ` Alex Romosan
2006-12-14 22:25   ` Alex Romosan
2006-12-14 22:47     ` Stephen Hemminger
2006-12-14 22:57       ` Alex Romosan
2006-12-14 23:08       ` Alex Romosan
2006-12-14 23:21       ` Alex Romosan
2006-12-14 23:31         ` Stephen Hemminger
2006-12-15  1:04           ` Alex Romosan
2006-12-15  2:24             ` Herbert Xu
2006-12-15  3:22               ` Stephen Hemminger
2006-12-15  3:53                 ` Alex Romosan
2006-12-16  1:25                   ` Stephen Hemminger
2006-12-16  1:53                     ` Alex Romosan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).