Re: sky2 hw csum failure [was Re: sky2 large MTU problems]

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: sky2 hw csum failure [was Re: sky2 large MTU problems]
       [not found] <6278d2220605240228v576dd66atdad4855b308e64bf@mail.gmail.com>
@ 2006-05-24 17:38 ` Stephen Hemminger
  2006-05-25  9:17   ` Daniel J Blueman
  2006-05-25 10:35   ` Patrick McHardy
  0 siblings, 2 replies; 6+ messages in thread
From: Stephen Hemminger @ 2006-05-24 17:38 UTC (permalink / raw)
  To: Daniel J Blueman; +Cc: Netfilter Developer, netdev

On Wed, 24 May 2006 10:28:52 +0100
"Daniel J Blueman" <daniel.blueman@gmail.com> wrote:

> Having done some more stress testing with sky2 1.4 (in 2.6.17-rc4) and
> the latest patch, I have found problems when streaming lots of data
> out of the sky2 interface (eg via samba serving a large file to GigE
> client). Ultimately, the interface will stop sending.
> 
> Before this happens, I see lots of:
> 
> kernel: lan0: hw csum failure.
> kernel:  [__skb_checksum_complete+86/96] __skb_checksum_complete+0x56/0x60
> kernel:  [tcp_error+300/512] tcp_error+0x12c/0x200
> kernel:  [poison_obj+41/96] poison_obj+0x29/0x60
> kernel:  [tcp_error+0/512] tcp_error+0x0/0x200
> kernel:  [ip_conntrack_in+157/1072] ip_conntrack_in+0x9d/0x430
> kernel:  [kfree_skbmem+8/128] kfree_skbmem+0x8/0x80
> kernel:  [arp_process+102/1408] arp_process+0x66/0x580
> kernel:  [check_poison_obj+36/416] check_poison_obj+0x24/0x1a0
> kernel:  [arp_process+102/1408] arp_process+0x66/0x580
> kernel:  [nf_iterate+99/144] nf_iterate+0x63/0x90
> kernel:  [ip_rcv_finish+0/608] ip_rcv_finish+0x0/0x260
> kernel:  [nf_hook_slow+89/240] nf_hook_slow+0x59/0xf0
> kernel:  [ip_rcv_finish+0/608] ip_rcv_finish+0x0/0x260
> kernel:  [ip_rcv+386/1104] ip_rcv+0x182/0x450
> kernel:  [ip_rcv_finish+0/608] ip_rcv_finish+0x0/0x260
> kernel:  [packet_rcv_spkt+216/320] packet_rcv_spkt+0xd8/0x140
> kernel:  [netif_receive_skb+476/784] netif_receive_skb+0x1dc/0x310
> kernel:  [sky2_poll+879/2096] sky2_poll+0x36f/0x830
> kernel:  [_spin_lock_irqsave+9/16] _spin_lock_irqsave+0x9/0x10
> kernel:  [run_timer_softirq+290/416] run_timer_softirq+0x122/0x1a0
> kernel:  [net_rx_action+108/256] net_rx_action+0x6c/0x100
> kernel:  [__do_softirq+66/160] __do_softirq+0x42/0xa0
> kernel:  [do_softirq+78/96] do_softirq+0x4e/0x60
> kernel:  =======================
> kernel:  [do_IRQ+90/160] do_IRQ+0x5a/0xa0
> kernel:  [remove_vma+69/80] remove_vma+0x45/0x50
> kernel:  [common_interrupt+26/32] common_interrupt+0x1a/0x20
> kernel:  [get_offset_pmtmr+151/3584] get_offset_pmtmr+0x97/0xe00
> kernel:  [do_gettimeofday+26/208] do_gettimeofday+0x1a/0xd0
> kernel:  [sys_gettimeofday+26/144] sys_gettimeofday+0x1a/0x90
> kernel:  [syscall_call+7/11] syscall_call+0x7/0xb


What ever the netfilter chain is, it is trimming or altering the packet
without clearing or altering the hardware checksum. It is not a driver
problem, we saw these in VLAN's and ebtables already.


> One of these was preceeded by:
> 
> kernel: sky2 lan0: rx error, status 0x977d977d length 0

The receive FIFO got overrun. You must not be running hardware flow
control.

> 
> This was happening with the default MTU of 1500, not just at MTU size
> 9000 (but it was changed down from 9000). Hardware is Yukon-EC (0xb6)
> rev 1.
> 
> I'll do some more stress testing tonight without the MTU patch and
> without the MTU being raised to 9000 initially and see what happens.
> 
> Thanks for all your great work so far!


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: sky2 hw csum failure [was Re: sky2 large MTU problems]
  2006-05-24 17:38 ` sky2 hw csum failure [was Re: sky2 large MTU problems] Stephen Hemminger
@ 2006-05-25  9:17   ` Daniel J Blueman
  2006-05-25 10:35   ` Patrick McHardy
  1 sibling, 0 replies; 6+ messages in thread
From: Daniel J Blueman @ 2006-05-25  9:17 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: Netfilter Developer, netdev

Hi Stephen,

Thanks for your feedback.

On 24/05/06, Stephen Hemminger <shemminger@osdl.org> wrote:
> "Daniel J Blueman" <daniel.blueman@gmail.com> wrote:
> > Having done some more stress testing with sky2 1.4 (in 2.6.17-rc4) and
> > the latest patch, I have found problems when streaming lots of data
> > out of the sky2 interface (eg via samba serving a large file to GigE
> > client). Ultimately, the interface will stop sending.
> >
> > Before this happens, I see lots of:
> >
> > kernel: lan0: hw csum failure.
> > kernel:  [__skb_checksum_complete+86/96] __skb_checksum_complete+0x56/0x60
> > kernel:  [tcp_error+300/512] tcp_error+0x12c/0x200
> > kernel:  [poison_obj+41/96] poison_obj+0x29/0x60
> > kernel:  [tcp_error+0/512] tcp_error+0x0/0x200
> > kernel:  [ip_conntrack_in+157/1072] ip_conntrack_in+0x9d/0x430
> > kernel:  [kfree_skbmem+8/128] kfree_skbmem+0x8/0x80
> > kernel:  [arp_process+102/1408] arp_process+0x66/0x580
> > kernel:  [check_poison_obj+36/416] check_poison_obj+0x24/0x1a0
> > kernel:  [arp_process+102/1408] arp_process+0x66/0x580
> > kernel:  [nf_iterate+99/144] nf_iterate+0x63/0x90
> > kernel:  [ip_rcv_finish+0/608] ip_rcv_finish+0x0/0x260
> > kernel:  [nf_hook_slow+89/240] nf_hook_slow+0x59/0xf0
> > kernel:  [ip_rcv_finish+0/608] ip_rcv_finish+0x0/0x260
> > kernel:  [ip_rcv+386/1104] ip_rcv+0x182/0x450
> > kernel:  [ip_rcv_finish+0/608] ip_rcv_finish+0x0/0x260
> > kernel:  [packet_rcv_spkt+216/320] packet_rcv_spkt+0xd8/0x140
> > kernel:  [netif_receive_skb+476/784] netif_receive_skb+0x1dc/0x310
> > kernel:  [sky2_poll+879/2096] sky2_poll+0x36f/0x830
> > kernel:  [_spin_lock_irqsave+9/16] _spin_lock_irqsave+0x9/0x10
> > kernel:  [run_timer_softirq+290/416] run_timer_softirq+0x122/0x1a0
> > kernel:  [net_rx_action+108/256] net_rx_action+0x6c/0x100
> > kernel:  [__do_softirq+66/160] __do_softirq+0x42/0xa0
> > kernel:  [do_softirq+78/96] do_softirq+0x4e/0x60
> > kernel:  =======================
> > kernel:  [do_IRQ+90/160] do_IRQ+0x5a/0xa0
> > kernel:  [remove_vma+69/80] remove_vma+0x45/0x50
> > kernel:  [common_interrupt+26/32] common_interrupt+0x1a/0x20
> > kernel:  [get_offset_pmtmr+151/3584] get_offset_pmtmr+0x97/0xe00
> > kernel:  [do_gettimeofday+26/208] do_gettimeofday+0x1a/0xd0
> > kernel:  [sys_gettimeofday+26/144] sys_gettimeofday+0x1a/0x90
> > kernel:  [syscall_call+7/11] syscall_call+0x7/0xb
>
> What ever the netfilter chain is, it is trimming or altering the packet
> without clearing or altering the hardware checksum. It is not a driver
> problem, we saw these in VLAN's and ebtables already.

No ebtables or VLAN used; the relevant part of iptables I have:

iptables -t filter -A INPUT -m state --state ESTABLISHED,RELATED -j ACCEPT
iptables -t filter -A INPUT -p tcp -m tcp --dport 445 --syn -j ACCEPT # SMB
iptables -t filter -A INPUT -j DROP

This may be linked to the use of the large MTU (7500 or 9000) for the
sky2 linux box and the client was transmitting back to the sky2 with
an MTU of 1500.

> > One of these was preceeded by:
> >
> > kernel: sky2 lan0: rx error, status 0x977d977d length 0
>
> The receive FIFO got overrun. You must not be running hardware flow
> control.

This 'status 0x977d977d' message is received before the above problem
occurs and I couldn't reproduce the 'hw csum failure' last night. The
client is a Broadcom NetExtreme PCI-E card purportedly with flow
control on. I have got the reproducer down to:

1. use 2.6.17-rc4 w/ sky2 MTU patch
2. increase MTU to >= 7500
3. decrease MTU to 1500
4. send ~1-2GB out of sky2 NIC
5. "rx error, status 0x977d977d length 0" messages received

I have found that without raising the MTU initially to 7500/9000, this
problem does not occur. Perhaps chip tx buffers aren't shrunk when the
MTU is dropped?

Is there a tunable low-watermark for starting the DMA transfer from
the chip on rx? The client isn't sending back that much (TCP acks
every segment, SMB protocol acks every 64KB), but I guess there are
fewer rx buffers are available, as larger tx buffers are used on the
sky2 chip for the large tx packets.

> > This was happening with the default MTU of 1500, not just at MTU size
> > 9000 (but it was changed down from 9000). Hardware is Yukon-EC (0xb6)
> > rev 1.
> >
> > I'll do some more stress testing tonight without the MTU patch and
> > without the MTU being raised to 9000 initially and see what happens.
> >
> > Thanks for all your great work so far!

Let me know if this is a scenario that isn't expected to work, or if
there is anything else I can look at or try.

Thanks again!
-- 
Daniel J Blueman

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: sky2 hw csum failure [was Re: sky2 large MTU problems]
  2006-05-24 17:38 ` sky2 hw csum failure [was Re: sky2 large MTU problems] Stephen Hemminger
  2006-05-25  9:17   ` Daniel J Blueman
@ 2006-05-25 10:35   ` Patrick McHardy
  2006-05-25 10:55     ` Daniel J Blueman
  1 sibling, 1 reply; 6+ messages in thread
From: Patrick McHardy @ 2006-05-25 10:35 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: Daniel J Blueman, netdev, Netfilter Developer

Stephen Hemminger wrote:
> On Wed, 24 May 2006 10:28:52 +0100
> "Daniel J Blueman" <daniel.blueman@gmail.com> wrote:
> 
> 
>>Having done some more stress testing with sky2 1.4 (in 2.6.17-rc4) and
>>the latest patch, I have found problems when streaming lots of data
>>out of the sky2 interface (eg via samba serving a large file to GigE
>>client). Ultimately, the interface will stop sending.
>>
>>Before this happens, I see lots of:
>>
>>kernel: lan0: hw csum failure.
>>kernel:  [__skb_checksum_complete+86/96] __skb_checksum_complete+0x56/0x60
>>kernel:  [tcp_error+300/512] tcp_error+0x12c/0x200
>>kernel:  [poison_obj+41/96] poison_obj+0x29/0x60
>>kernel:  [tcp_error+0/512] tcp_error+0x0/0x200
>>kernel:  [ip_conntrack_in+157/1072] ip_conntrack_in+0x9d/0x430
>>kernel:  [kfree_skbmem+8/128] kfree_skbmem+0x8/0x80
>>kernel:  [arp_process+102/1408] arp_process+0x66/0x580
>>kernel:  [check_poison_obj+36/416] check_poison_obj+0x24/0x1a0
>>kernel:  [arp_process+102/1408] arp_process+0x66/0x580
>>kernel:  [nf_iterate+99/144] nf_iterate+0x63/0x90
>>kernel:  [ip_rcv_finish+0/608] ip_rcv_finish+0x0/0x260
>>kernel:  [nf_hook_slow+89/240] nf_hook_slow+0x59/0xf0
>>kernel:  [ip_rcv_finish+0/608] ip_rcv_finish+0x0/0x260
>>kernel:  [ip_rcv+386/1104] ip_rcv+0x182/0x450
>>kernel:  [ip_rcv_finish+0/608] ip_rcv_finish+0x0/0x260
>>kernel:  [packet_rcv_spkt+216/320] packet_rcv_spkt+0xd8/0x140
>>kernel:  [netif_receive_skb+476/784] netif_receive_skb+0x1dc/0x310
>>kernel:  [sky2_poll+879/2096] sky2_poll+0x36f/0x830
>>kernel:  [_spin_lock_irqsave+9/16] _spin_lock_irqsave+0x9/0x10
>>kernel:  [run_timer_softirq+290/416] run_timer_softirq+0x122/0x1a0
>>kernel:  [net_rx_action+108/256] net_rx_action+0x6c/0x100
>>kernel:  [__do_softirq+66/160] __do_softirq+0x42/0xa0
>>kernel:  [do_softirq+78/96] do_softirq+0x4e/0x60
>>kernel:  =======================
>>kernel:  [do_IRQ+90/160] do_IRQ+0x5a/0xa0
>>kernel:  [remove_vma+69/80] remove_vma+0x45/0x50
>>kernel:  [common_interrupt+26/32] common_interrupt+0x1a/0x20
>>kernel:  [get_offset_pmtmr+151/3584] get_offset_pmtmr+0x97/0xe00
>>kernel:  [do_gettimeofday+26/208] do_gettimeofday+0x1a/0xd0
>>kernel:  [sys_gettimeofday+26/144] sys_gettimeofday+0x1a/0x90
>>kernel:  [syscall_call+7/11] syscall_call+0x7/0xb
> 
> 
> 
> What ever the netfilter chain is, it is trimming or altering the packet
> without clearing or altering the hardware checksum. It is not a driver
> problem, we saw these in VLAN's and ebtables already.


The call chain looks pretty messed up, but the point where an
invalid HW checksum is detected is in TCP connection tracking,
which is basically the first thing netfilter does, unless
you use the raw table. There are no packet modifications done
by conntrack, so I doubt that netfilter is the culprit here.
Of course we had some big checksumming cleanups, so there is
a possibilty of bugs there, but I did test them with sky2 and
HW checksumming, so I don't think thats the case.

Daniel, is there an easy way to reproduce the checksum failure?

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: sky2 hw csum failure [was Re: sky2 large MTU problems]
  2006-05-25 10:35   ` Patrick McHardy
@ 2006-05-25 10:55     ` Daniel J Blueman
  2006-05-25 11:15       ` Patrick McHardy
  0 siblings, 1 reply; 6+ messages in thread
From: Daniel J Blueman @ 2006-05-25 10:55 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: Stephen Hemminger, netdev, Netfilter Developer

On 25/05/06, Patrick McHardy <kaber@trash.net> wrote:
> Stephen Hemminger wrote:
> > On Wed, 24 May 2006 10:28:52 +0100
> > "Daniel J Blueman" <daniel.blueman@gmail.com> wrote:
> >
> >>Having done some more stress testing with sky2 1.4 (in 2.6.17-rc4) and
> >>the latest patch, I have found problems when streaming lots of data
> >>out of the sky2 interface (eg via samba serving a large file to GigE
> >>client). Ultimately, the interface will stop sending.
> >>
> >>Before this happens, I see lots of:
> >>
> >>kernel: lan0: hw csum failure.
> >>kernel:  [__skb_checksum_complete+86/96] __skb_checksum_complete+0x56/0x60
> >>kernel:  [tcp_error+300/512] tcp_error+0x12c/0x200
> >>kernel:  [poison_obj+41/96] poison_obj+0x29/0x60
> >>kernel:  [tcp_error+0/512] tcp_error+0x0/0x200
> >>kernel:  [ip_conntrack_in+157/1072] ip_conntrack_in+0x9d/0x430
> >>kernel:  [kfree_skbmem+8/128] kfree_skbmem+0x8/0x80
> >>kernel:  [arp_process+102/1408] arp_process+0x66/0x580
> >>kernel:  [check_poison_obj+36/416] check_poison_obj+0x24/0x1a0
> >>kernel:  [arp_process+102/1408] arp_process+0x66/0x580
> >>kernel:  [nf_iterate+99/144] nf_iterate+0x63/0x90
> >>kernel:  [ip_rcv_finish+0/608] ip_rcv_finish+0x0/0x260
> >>kernel:  [nf_hook_slow+89/240] nf_hook_slow+0x59/0xf0
> >>kernel:  [ip_rcv_finish+0/608] ip_rcv_finish+0x0/0x260
> >>kernel:  [ip_rcv+386/1104] ip_rcv+0x182/0x450
> >>kernel:  [ip_rcv_finish+0/608] ip_rcv_finish+0x0/0x260
> >>kernel:  [packet_rcv_spkt+216/320] packet_rcv_spkt+0xd8/0x140
> >>kernel:  [netif_receive_skb+476/784] netif_receive_skb+0x1dc/0x310
> >>kernel:  [sky2_poll+879/2096] sky2_poll+0x36f/0x830
> >>kernel:  [_spin_lock_irqsave+9/16] _spin_lock_irqsave+0x9/0x10
> >>kernel:  [run_timer_softirq+290/416] run_timer_softirq+0x122/0x1a0
> >>kernel:  [net_rx_action+108/256] net_rx_action+0x6c/0x100
> >>kernel:  [__do_softirq+66/160] __do_softirq+0x42/0xa0
> >>kernel:  [do_softirq+78/96] do_softirq+0x4e/0x60
> >>kernel:  =======================
> >>kernel:  [do_IRQ+90/160] do_IRQ+0x5a/0xa0
> >>kernel:  [remove_vma+69/80] remove_vma+0x45/0x50
> >>kernel:  [common_interrupt+26/32] common_interrupt+0x1a/0x20
> >>kernel:  [get_offset_pmtmr+151/3584] get_offset_pmtmr+0x97/0xe00
> >>kernel:  [do_gettimeofday+26/208] do_gettimeofday+0x1a/0xd0
> >>kernel:  [sys_gettimeofday+26/144] sys_gettimeofday+0x1a/0x90
> >>kernel:  [syscall_call+7/11] syscall_call+0x7/0xb
> >
> > What ever the netfilter chain is, it is trimming or altering the packet
> > without clearing or altering the hardware checksum. It is not a driver
> > problem, we saw these in VLAN's and ebtables already.
>
> The call chain looks pretty messed up, but the point where an
> invalid HW checksum is detected is in TCP connection tracking,
> which is basically the first thing netfilter does, unless
> you use the raw table. There are no packet modifications done
> by conntrack, so I doubt that netfilter is the culprit here.
> Of course we had some big checksumming cleanups, so there is
> a possibilty of bugs there, but I did test them with sky2 and
> HW checksumming, so I don't think thats the case.
>
> Daniel, is there an easy way to reproduce the checksum failure?

In short, no. This was seen when packets may have been truncated by
large MTU (eg 9000) problems in the sky2 driver transmit path.

There is a small chance that this could relate to transmitting with an
MTU of 9000 (possibly with receiving with an MTU of 1500 too)

On that interface, the only rules that were being exercised were:

iptables -t filter -A INPUT -m state --state ESTABLISHED,RELATED -j ACCEPT
iptables -t filter -A INPUT -p tcp -m tcp --dport 445 --syn -j ACCEPT # SMB
iptables -t filter -A INPUT -j DROP

HTB and SFQ are active on other interfaces.
-- 
Daniel J Blueman

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: sky2 hw csum failure [was Re: sky2 large MTU problems]
  2006-05-25 10:55     ` Daniel J Blueman
@ 2006-05-25 11:15       ` Patrick McHardy
  2006-05-30  9:10         ` Daniel J Blueman
  0 siblings, 1 reply; 6+ messages in thread
From: Patrick McHardy @ 2006-05-25 11:15 UTC (permalink / raw)
  To: Daniel J Blueman; +Cc: Stephen Hemminger, netdev, Netfilter Developer

Daniel J Blueman wrote:
> On 25/05/06, Patrick McHardy <kaber@trash.net> wrote:
> 
>> Daniel, is there an easy way to reproduce the checksum failure?
> 
> 
> In short, no. This was seen when packets may have been truncated by
> large MTU (eg 9000) problems in the sky2 driver transmit path.
> 
> There is a small chance that this could relate to transmitting with an
> MTU of 9000 (possibly with receiving with an MTU of 1500 too)

Unfortunately I can't test this myself because my other NICs don't
support MTUs > 1500.

> On that interface, the only rules that were being exercised were:
> 
> iptables -t filter -A INPUT -m state --state ESTABLISHED,RELATED -j ACCEPT
> iptables -t filter -A INPUT -p tcp -m tcp --dport 445 --syn -j ACCEPT # SMB
> iptables -t filter -A INPUT -j DROP

That shouldn't cause any packet modifications. Can you trigger the
checksum failures without netfilter?


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: sky2 hw csum failure [was Re: sky2 large MTU problems]
  2006-05-25 11:15       ` Patrick McHardy
@ 2006-05-30  9:10         ` Daniel J Blueman
  0 siblings, 0 replies; 6+ messages in thread
From: Daniel J Blueman @ 2006-05-30  9:10 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: Stephen Hemminger, netdev, Netfilter Developer

On 25/05/06, Patrick McHardy <kaber@trash.net> wrote:
> Daniel J Blueman wrote:
> > On 25/05/06, Patrick McHardy <kaber@trash.net> wrote:
> >
> >> Daniel, is there an easy way to reproduce the checksum failure?
> >
> > In short, no. This was seen when packets may have been truncated by
> > large MTU (eg 9000) problems in the sky2 driver transmit path.
> >
> > There is a small chance that this could relate to transmitting with an
> > MTU of 9000 (possibly with receiving with an MTU of 1500 too)
>
> Unfortunately I can't test this myself because my other NICs don't
> support MTUs > 1500.
>
> > On that interface, the only rules that were being exercised were:
> >
> > iptables -t filter -A INPUT -m state --state ESTABLISHED,RELATED -j ACCEPT
> > iptables -t filter -A INPUT -p tcp -m tcp --dport 445 --syn -j ACCEPT # SMB
> > iptables -t filter -A INPUT -j DROP
>
> That shouldn't cause any packet modifications. Can you trigger the
> checksum failures without netfilter?

When testing, I always run into the "kernel: sky2 lan0: rx error,
status 0x977d977d length 0" problem before anything else.

I need to eliminate the sky2 driver from the equation before I'm able
to prove if there is a problem elsewhere or not. I did have some e1000
NICs, but not any longer, so it'll have to wait until I can find a
stable scenario for my sky2 NIC...
-- 
Daniel J Blueman

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2006-05-30  9:10 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <6278d2220605240228v576dd66atdad4855b308e64bf@mail.gmail.com>
2006-05-24 17:38 ` sky2 hw csum failure [was Re: sky2 large MTU problems] Stephen Hemminger
2006-05-25  9:17   ` Daniel J Blueman
2006-05-25 10:35   ` Patrick McHardy
2006-05-25 10:55     ` Daniel J Blueman
2006-05-25 11:15       ` Patrick McHardy
2006-05-30  9:10         ` Daniel J Blueman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).