2.6.6 e1000 NETDEV WATCHDOG: eth0: transmit timed out

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* 2.6.6 e1000 NETDEV WATCHDOG: eth0: transmit timed out
@ 2004-06-14 16:47 David Greaves
       [not found] ` <20040615155111.26d6b809@dell_ss3.pdx.osdl.net>
  0 siblings, 1 reply; 14+ messages in thread
From: David Greaves @ 2004-06-14 16:47 UTC (permalink / raw)
  To: shemminger, scott.feldman; +Cc: netdev

Hi

I have 2 machines with Intel/Pro 1000MT cards.

One machine seems to work fine (AFAIK), the other has major problems.
I've swapped the cards and the problem stays on the machine.

I'm using version 5.2.39-k2 from the stock 2.6.6 kernel on both machines.

Any sustained traffic causes repeated:
Jun 14 16:29:14 ash kernel: NETDEV WATCHDOG: eth0: transmit timed out
Jun 14 16:29:17 ash kernel: e1000: eth0 NIC Link is Up 1000 Mbps Full Duplex
Jun 14 16:29:17 ash kernel: nfs: server cu OK

I had a pair of Realtek r8169s that worked fine but only gave me 10Mb/s 
so I exchanged them for the Intel/Pro cards in the hope of something 
better - now, even with scp's rate limiter as low as 10kb/s this it 
still occurs.

I have played with all the module parameters and not found anything that 
affects it at 1Gbps

Even dropping to 100Mbps:
Jun 14 17:33:03 ash kernel: e1000: eth0 NIC Link is Up 100 Mbps Full Duplex
Jun 14 17:33:33 ash kernel: NETDEV WATCHDOG: eth0: transmit timed out

it can do 10Mbs:
scp reports a throughput of 1.0Mb/s (... less than thrilling)
however scp now transfers a few Mb and says:
Disconnecting: Corrupted MAC on input.

I found this mail:
  http://oss.sgi.com/projects/netdev/archive/2004-06/msg00256.html
from Stephen

which appears to reverse this mail:
  http://marc.theaimsgroup.com/?l=linux-kernel&m=107516205706542&w=2
from Scott
which I gather was supposed to correct this problem :)

I have seen no suggestions about other subsystems (eg ACPI etc) that 
could also be tried.

David

^ permalink raw reply	[flat|nested] 14+ messages in thread

[parent not found: <20040615155111.26d6b809@dell_ss3.pdx.osdl.net>]

* Re: 2.6.6 e1000 NETDEV WATCHDOG: eth0: transmit timed out
       [not found] ` <20040615155111.26d6b809@dell_ss3.pdx.osdl.net>
@ 2004-06-16 10:59   ` David Greaves
  2004-06-18  8:04     ` Jens Laas
  0 siblings, 1 reply; 14+ messages in thread
From: David Greaves @ 2004-06-16 10:59 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: netdev

Stephen Hemminger wrote:

>How big is the transmit ring. Setting a bigger transmit ring fixed my problem
>	modprobe e1000 TxDescriptors=1024
>
>Also, there are lots of flavors of this chipset and board.  One machine
>I was using had the IBM rebranded version and it would only do PCI33 not PCI66.
>
>  
>
Thanks for replying Stephen - it's really frustrating :)

I did try TxDescriptors and various (most) other parameters (below are 
the actual parameter variations I tried - just cut from a 'history' for 
info).

After each one i downed the link and modprobe -r the driver.
I then ran a large file scp (quicker id+recovery than nfs hanging when 
the link died)

I invariably got an eth0 timed out after a few seconds - some variation 
but IIRC no more than 20% - ie 8-10Mb of a 1G file before it failed.

root@ash:~ # ethtool -g eth0
Ring parameters for eth0:
Pre-set maximums:
RX:             4096
RX Mini:        0
RX Jumbo:       0
TX:             4096
Current hardware settings:
RX:             256
RX Mini:        0
RX Jumbo:       0
TX:             1024

I've pulled all the cards and looked - they are all genuine Intel 
C39226-003 (Pro/1000 MT)
This page http://support.intel.com/support/network/sb/cs-005980-prd38.htm
says: 82541 Gigabit Small Form 32/66
My system has PCI33 BTW.

I have also tried 2.6.7 this morning and have the same problem.

David

module parameters.
modprobe e1000 FlowControl=2
modprobe e1000 FlowControl=1
modprobe e1000 FlowControl=3
modprobe e1000 FlowControl=0
modprobe e1000 FlowControl=0 InterruptThrottleRate=100
modprobe e1000 FlowControl=0 InterruptThrottleRate=1 RxDescriptors=1024 
; ifup eth0
modprobe e1000 FlowControl=0 InterruptThrottleRate=1 RxDescriptors=4096 
; ifup eth0
modprobe e1000 FlowControl=0 InterruptThrottleRate=1 RxDescriptors=256 
RxIntDelay=1  ; ifup eth0
modprobe e1000 FlowControl=0 InterruptThrottleRate=1 RxDescriptors=256 
RxIntDelay=10  ; ifup eth0
modprobe e1000 FlowControl=0 InterruptThrottleRate=1 RxDescriptors=256 
RxIntDelay=1000  ; ifup eth0
modprobe e1000 FlowControl=0 InterruptThrottleRate=1 RxDescriptors=256 
RxIntDelay=0 RxAbsIntDelay=0  ; ifup eth0
modprobe e1000 FlowControl=0 InterruptThrottleRate=1 RxDescriptors=256 
RxIntDelay=0 RxAbsIntDelay=0 TxIntDelay=0 ; ifup eth0
modprobe e1000 FlowControl=0 InterruptThrottleRate=1 RxDescriptors=256 
RxIntDelay=0 RxAbsIntDelay=1024 TxIntDelay=53 ; ifup eth0
modprobe e1000 FlowControl=0 InterruptThrottleRate=1 RxDescriptors=256 
RxIntDelay=0 RxAbsIntDelay=1024 TxIntDelay=64 ; ifup eth0
modprobe e1000 FlowControl=0 InterruptThrottleRate=1 RxDescriptors=256 
RxIntDelay=0 RxAbsIntDelay=65535 TxIntDelay=64 ; ifup eth0
modprobe e1000 FlowControl=0 InterruptThrottleRate=1 RxDescriptors=256 
RxIntDelay=0 RxAbsIntDelay=128 TxIntDelay=0 ; ifup eth0
modprobe e1000 FlowControl=0 InterruptThrottleRate=1 RxDescriptors=256 
RxIntDelay=0 RxAbsIntDelay=128 TxIntDelay=32000 ; ifup eth0
modprobe e1000 FlowControl=0 InterruptThrottleRate=1 RxDescriptors=256 
RxIntDelay=0 RxAbsIntDelay=128 TxIntDelay=32000 ; ifup eth0
modprobe e1000 FlowControl=0 InterruptThrottleRate=1 RxDescriptors=256 
RxIntDelay=0 RxAbsIntDelay=128 TxIntDelay=64 XsumRX=1 ; ifup eth0
modprobe e1000 Speed=100
modprobe e1000 Speed=10

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: 2.6.6 e1000 NETDEV WATCHDOG: eth0: transmit timed out
  2004-06-16 10:59   ` David Greaves
@ 2004-06-18  8:04     ` Jens Laas
  2004-06-18  9:08       ` 2.6.6 e1000 NETDEV WATCHDOG: eth0: transmit timed out+ delay scheduler David Greaves
  2004-06-18 18:11       ` 2.6.6 e1000 NETDEV WATCHDOG: eth0: transmit timed out Stephen Hemminger
  0 siblings, 2 replies; 14+ messages in thread
From: Jens Laas @ 2004-06-18  8:04 UTC (permalink / raw)
  To: David Greaves; +Cc: Stephen Hemminger, netdev

[-- Attachment #1: Type: TEXT/PLAIN, Size: 2809 bytes --]

(04.06.16 kl.11:59) David Greaves skrev följande till Stephen Hemminger:

We have seen the same symptoms. (2.6.x + e1000)

Our system is an SMP system. That might be whats triggering the problem.
Is your system UP or SMP ?
(Next reboot we will test running on only one CPU).

We have tried with and without NAPI, both exhibit the same problem.
We have tried different versions of e1000 without luck.
We have tried with 100Mb and gigabit switches.

Make sure that flowcontrol is disabled on your switch (if it has it 
implemented).

> Stephen Hemminger wrote:
>
>> How big is the transmit ring. Setting a bigger transmit ring fixed my 
>> problem
>> 	modprobe e1000 TxDescriptors=1024

I wouldnt call that a fix, more like a workaround. It should work
  regardless of ringsize.


>> 
>> Also, there are lots of flavors of this chipset and board.  One machine
>> I was using had the IBM rebranded version and it would only do PCI33 not 
>> PCI66.
>> 
>> 
> Thanks for replying Stephen - it's really frustrating :)
>
> I did try TxDescriptors and various (most) other parameters (below are the 
> actual parameter variations I tried - just cut from a 'history' for info).
>
> After each one i downed the link and modprobe -r the driver.
> I then ran a large file scp (quicker id+recovery than nfs hanging when the 
> link died)
>
> I invariably got an eth0 timed out after a few seconds - some variation but 
> IIRC no more than 20% - ie 8-10Mb of a 1G file before it failed.
>
> root@ash:~ # ethtool -g eth0
> Ring parameters for eth0:
> Pre-set maximums:
> RX:             4096
> RX Mini:        0
> RX Jumbo:       0
> TX:             4096
> Current hardware settings:
> RX:             256
> RX Mini:        0
> RX Jumbo:       0
> TX:             1024
>
> I've pulled all the cards and looked - they are all genuine Intel C39226-003 
> (Pro/1000 MT)
> This page http://support.intel.com/support/network/sb/cs-005980-prd38.htm
> says: 82541 Gigabit Small Form 32/66
> My system has PCI33 BTW.
>
> I have also tried 2.6.7 this morning and have the same problem.
>
> David
>
>
> module parameters.

I believe following is recommended by driver developers:
TxDescriptors=256 RxDescriptors=256 FlowControl=0 XsumRX=0

Cheers,
Jens Låås

-----------------------------------------------------------------------
     'This mail automatically becomes portable when carried.'
-----------------------------------------------------------------------
     Jens Låås                              Email: jens.laas@data.slu.se
     Department of Computer Services, SLU   Phone: +46 18 67 35 15
     Vindbrovägen 1
     P.O. Box 7079
     S-750 07 Uppsala
     SWEDEN 
-----------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: 2.6.6 e1000 NETDEV WATCHDOG: eth0: transmit timed out+ delay scheduler
  2004-06-18  8:04     ` Jens Laas
@ 2004-06-18  9:08       ` David Greaves
  2004-06-18 10:27         ` Jens Laas
  2004-06-21 16:42         ` Thayne Harbaugh
  2004-06-18 18:11       ` 2.6.6 e1000 NETDEV WATCHDOG: eth0: transmit timed out Stephen Hemminger
  1 sibling, 2 replies; 14+ messages in thread
From: David Greaves @ 2004-06-18  9:08 UTC (permalink / raw)
  To: Jens Laas; +Cc: Stephen Hemminger, netdev, ganesh.venkatesan

Stephen, I applied your delay scheduler patch and some results appear below.

Jens Laas wrote:

> (04.06.16 kl.11:59) David Greaves skrev följande till Stephen Hemminger:
>
> We have seen the same symptoms. (2.6.x + e1000)
>
> Our system is an SMP system. That might be whats triggering the problem.
> Is your system UP or SMP ?

UP

> (Next reboot we will test running on only one CPU).
>
> We have tried with and without NAPI, both exhibit the same problem.

Me too

> We have tried different versions of e1000 without luck.

Me too, 3 cards.
(did I mention I have 2 machines with very similar specs (AMD/VIAKT600) 
and the other one works - actually, to be accurate, hasn't yet failed 
but hasn't yet run at full speed - and it has a higher CPU speed)

> We have tried with 100Mb and gigabit switches.

I'm now running two e1000's back to back over a piece of cat5...

>
> Make sure that flowcontrol is disabled on your switch (if it has it 
> implemented).

...so it's not that smart anymore ;)

>>
>> module parameters.
>
>
> I believe following is recommended by driver developers:
> TxDescriptors=256 RxDescriptors=256 FlowControl=0 XsumRX=0

Yes, I'm running with module defaults unless otherwise stated but I've 
tried that combo (to no effect)

I'm speaking with Ganesh Venkatesan at intel about it. Ganesh you went 
off list - do you want to include Jens or maybe go back on-list?

A simple failure case for me is : 'ping -s 1500 '
This doesn't cause the timout but doesn't succeed either.

ping -f with standard packet size succeeds (slow rate though) and 
doesn't timeout.

Using 8139 100Mbs card:
272384 packets transmitted, 272383 packets received, 0% packet loss
round-trip min/avg/max = 0.1/0.1/4.0 ms
real    0m32.179s

Using Pro/1000:
60992 packets transmitted, 60991 packets received, 0% packet loss
round-trip min/avg/max = 0.0/0.5/8.4 ms
real    0m38.257s

any ping with -s >1500 results in 100% packet loss.

============
 From hereon down it's 2.6.7 with Stephen's recent delay scheduler patch

This changed the behaviour.

Now ping -s 1500 works
but after that it gets lossy
root@ash:~ # ping -s3000 10.0.1.1
PING 10.0.1.1 (10.0.1.1): 3000 data bytes
3008 bytes from 10.0.1.1: icmp_seq=1 ttl=64 time=0.5 ms
3008 bytes from 10.0.1.1: icmp_seq=11 ttl=64 time=0.5 ms
3008 bytes from 10.0.1.1: icmp_seq=12 ttl=64 time=0.4 ms
3008 bytes from 10.0.1.1: icmp_seq=13 ttl=64 time=0.9 ms
3008 bytes from 10.0.1.1: icmp_seq=15 ttl=64 time=0.4 ms
3008 bytes from 10.0.1.1: icmp_seq=16 ttl=64 time=0.3 ms

and now I'm seeing ping generate:
Jun 18 09:41:57 ash kernel: NETDEV WATCHDOG: eth0: transmit timed out
Jun 18 09:41:59 ash kernel: e1000: eth0: e1000_watchdog: NIC Link is Up 
1000 Mbps Full Duplex

ping -f now works for packet sizes up to -s 2952 (2 packets at mtu 1500)

ping -f -s 2953 results in:
PING 10.0.1.1 (10.0.1.1): 2953 data bytes
..............................ping: sendto: No buffer space available
ping: wrote 10.0.1.1 2961 chars, ret=-1
.ping: sendto: No buffer space available

nb. with the patch, between the same machines via an alternate pair of nics:
root@ash:~ # ping -f -s29550 haze
PING haze.dgreaves.com (10.0.0.88): 29550 data bytes
.
--- haze.dgreaves.com ping statistics ---
10592 packets transmitted, 10591 packets received, 0% packet loss
round-trip min/avg/max = 5.4/5.5/83.5 ms

Increasing Transmit Descriptors to 4096 avoids the No buffer space 
available with packet sizes up to -s65468 (still 100% failure though)

I'm not sure that adds much now so I'll leave it until I get some more 
suggestions.

HTH

David

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: 2.6.6 e1000 NETDEV WATCHDOG: eth0: transmit timed out+ delay scheduler
  2004-06-18  9:08       ` 2.6.6 e1000 NETDEV WATCHDOG: eth0: transmit timed out+ delay scheduler David Greaves
@ 2004-06-18 10:27         ` Jens Laas
  2004-06-18 12:51           ` David Greaves
  2004-06-21 16:42         ` Thayne Harbaugh
  1 sibling, 1 reply; 14+ messages in thread
From: Jens Laas @ 2004-06-18 10:27 UTC (permalink / raw)
  To: David Greaves; +Cc: Jens Laas, Stephen Hemminger, netdev, ganesh.venkatesan

[-- Attachment #1: Type: TEXT/PLAIN, Size: 3528 bytes --]

(04.06.18 kl.10:08) David Greaves skrev följande till Jens Laas:

> Stephen, I applied your delay scheduler patch and some results appear below.
>
> Jens Laas wrote:
>
>> (04.06.16 kl.11:59) David Greaves skrev följande till Stephen Hemminger:
>> 
>> We have seen the same symptoms. (2.6.x + e1000)
>> 
>> Our system is an SMP system. That might be whats triggering the problem.
>> Is your system UP or SMP ?
>
> UP

Ok. This keeps getting stranger..


>
>> (Next reboot we will test running on only one CPU).
>> 
>> We have tried with and without NAPI, both exhibit the same problem.
>
> Me too
>
>> We have tried different versions of e1000 without luck.
>
...
>> Make sure that flowcontrol is disabled on your switch (if it has it 
>> implemented).
>
> ...so it's not that smart anymore ;)
>
>>> 
>>> module parameters.
>> 
>> 
>> I believe following is recommended by driver developers:
>> TxDescriptors=256 RxDescriptors=256 FlowControl=0 XsumRX=0
>
> Yes, I'm running with module defaults unless otherwise stated but I've tried 
> that combo (to no effect)

No effect here either. FlowControl and XsumRX are known troublemakers.

>
> I'm speaking with Ganesh Venkatesan at intel about it. Ganesh you went off 
> list - do you want to include Jens or maybe go back on-list?

If others run into this problem I'm sure they'll appreciate if its on 
list.
Since we have no idea what causes this (AFAIK) it may be a more general 
problem than the device driver.

>
> A simple failure case for me is : 'ping -s 1500 '
> This doesn't cause the timout but doesn't succeed either.
>
> ping -f with standard packet size succeeds (slow rate though) and doesn't 
> timeout.

I dont see the ping problems at all. Unless you try to ping when the 
interface has "hanged" ?


>
> Using 8139 100Mbs card:
> 272384 packets transmitted, 272383 packets received, 0% packet loss
> round-trip min/avg/max = 0.1/0.1/4.0 ms
> real    0m32.179s
>
> Using Pro/1000:
> 60992 packets transmitted, 60991 packets received, 0% packet loss
> round-trip min/avg/max = 0.0/0.5/8.4 ms
> real    0m38.257s
>
> any ping with -s >1500 results in 100% packet loss.
>
> ============
> From hereon down it's 2.6.7 with Stephen's recent delay scheduler patch
>
> This changed the behaviour.

This is strange unless you are actually using the delay scheduler ?
Default is sch_generic (that is pfifo) that does not exhibit the problems 
correct by the patch.


> 10592 packets transmitted, 10591 packets received, 0% packet loss
> round-trip min/avg/max = 5.4/5.5/83.5 ms
>
> Increasing Transmit Descriptors to 4096 avoids the No buffer space available 
> with packet sizes up to -s65468 (still 100% failure though)

Increasing nr of buffers is not a way to fix the problem.

I had hoped to hear something about this from Scott..

Cheers,
Jens

>
> I'm not sure that adds much now so I'll leave it until I get some more 
> suggestions.
>
> HTH
>
> David
>

-----------------------------------------------------------------------
     'This mail automatically becomes portable when carried.'
-----------------------------------------------------------------------
     Jens Låås                              Email: jens.laas@data.slu.se
     Department of Computer Services, SLU   Phone: +46 18 67 35 15
     Vindbrovägen 1
     P.O. Box 7079
     S-750 07 Uppsala
     SWEDEN 
-----------------------------------------------------------------------

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: 2.6.6 e1000 NETDEV WATCHDOG: eth0: transmit timed out+ delay scheduler
  2004-06-18 10:27         ` Jens Laas
@ 2004-06-18 12:51           ` David Greaves
  0 siblings, 0 replies; 14+ messages in thread
From: David Greaves @ 2004-06-18 12:51 UTC (permalink / raw)
  To: Jens Laas; +Cc: Stephen Hemminger, netdev, ganesh.venkatesan

New info:
I booted into XP and the card works there - so it doesn't look like a 
simple hardware incompatibility.
[I've got no real way to test the performance but cygwin's wget against 
apache1.3 on the linux box returns about 25M/s initially and then 15M/s 
sustained for 500Mb]

Jens Laas wrote:

>>
>> I'm speaking with Ganesh Venkatesan at intel about it. Ganesh you 
>> went off list - do you want to include Jens or maybe go back on-list?
>
>
> If others run into this problem I'm sure they'll appreciate if its on 
> list.
> Since we have no idea what causes this (AFAIK) it may be a more 
> general problem than the device driver.

I tend to agree - but I wasn't sure if this was the place and I'll do as 
I'm told ;)

>> A simple failure case for me is : 'ping -s 1500 '
>> This doesn't cause the timout but doesn't succeed either.
>>
>> ping -f with standard packet size succeeds (slow rate though) and 
>> doesn't timeout.
>
>
>
> I dont see the ping problems at all. Unless you try to ping when the 
> interface has "hanged" ?

<sigh> thought that might be helpful.
Ping with -s and -f seems to allow me to trigger errors and it seems a 
lot more debug-able than scp or nfs :)
No all tests are when it's reset and 'clean'

>> ============
>> From hereon down it's 2.6.7 with Stephen's recent delay scheduler patch
>>
>> This changed the behaviour.
>
>
>
> This is strange unless you are actually using the delay scheduler ?
> Default is sch_generic (that is pfifo) that does not exhibit the 
> problems correct by the patch.

I'll go back and double check in case I cocked up...
(I noticed the e1000 module rebuild but you're right that's incidental)

I've rebuilt the kernel and modules with and w/o patch and rebooted a 
few times and I can't reproduce that effect - sorry for the red herring.
So after I reverted Stephens patch the results I reported are still 
reproducable w/o the patch.

>> 10592 packets transmitted, 10591 packets received, 0% packet loss
>> round-trip min/avg/max = 5.4/5.5/83.5 ms
>>
>> Increasing Transmit Descriptors to 4096 avoids the No buffer space 
>> available with packet sizes up to -s65468 (still 100% failure though)
>
>
> Increasing nr of buffers is not a way to fix the problem.

agreed - however in my ignorance of the deep behaviour I'm reporting 
things that affect behaviour in ways I don't expect.
I expected it to take longer to run out of buffers - that didn't happen :)

(Anyway, on retesting I find that this was wrong - I suspect the 
interface was down and I didn't notice)

>
> I had hoped to hear something about this from Scott..

I'm happy to hear from anyone - I don't have *that* long until my RMA 
option expires and I don't fancy keeping them as ornaments!

David

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: 2.6.6 e1000 NETDEV WATCHDOG: eth0: transmit timed out+ delay scheduler
  2004-06-18  9:08       ` 2.6.6 e1000 NETDEV WATCHDOG: eth0: transmit timed out+ delay scheduler David Greaves
  2004-06-18 10:27         ` Jens Laas
@ 2004-06-21 16:42         ` Thayne Harbaugh
  2004-06-21 17:29           ` David Greaves
  1 sibling, 1 reply; 14+ messages in thread
From: Thayne Harbaugh @ 2004-06-21 16:42 UTC (permalink / raw)
  To: David Greaves; +Cc: Jens Laas, Stephen Hemminger, netdev, ganesh.venkatesan

On Fri, 2004-06-18 at 03:08, David Greaves wrote:

> Jens Laas wrote:
> > We have tried different versions of e1000 without luck.
> 
> Me too, 3 cards.
> (did I mention I have 2 machines with very similar specs (AMD/VIAKT600) 
> and the other one works - actually, to be accurate, hasn't yet failed 
> but hasn't yet run at full speed - and it has a higher CPU speed)

What do you mean by, ". . . hasn't yet run at full speed - and it has a
higher CPU speed . . ." ?  Does this mean that you can't get the card to
have a reasonable throughput (~900Mbps)?

-- 
Thayne Harbaugh
Linux Networx

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: 2.6.6 e1000 NETDEV WATCHDOG: eth0: transmit timed out+ delay scheduler
  2004-06-21 16:42         ` Thayne Harbaugh
@ 2004-06-21 17:29           ` David Greaves
  2004-06-21 17:43             ` ganesh.venkatesan
  0 siblings, 1 reply; 14+ messages in thread
From: David Greaves @ 2004-06-21 17:29 UTC (permalink / raw)
  To: tharbaugh; +Cc: Jens Laas, Stephen Hemminger, netdev, ganesh.venkatesan

Thayne Harbaugh wrote:

>On Fri, 2004-06-18 at 03:08, David Greaves wrote:
>
>  
>
>>Jens Laas wrote:
>>    
>>
>>>We have tried different versions of e1000 without luck.
>>>      
>>>
>>Me too, 3 cards.
>>(did I mention I have 2 machines with very similar specs (AMD/VIAKT600) 
>>and the other one works - actually, to be accurate, hasn't yet failed 
>>but hasn't yet run at full speed - and it has a higher CPU speed)
>>    
>>
>
>What do you mean by, ". . . hasn't yet run at full speed - and it has a
>higher CPU speed . . ." ?  Does this mean that you can't get the card to
>have a reasonable throughput (~900Mbps)?
>
>  
>

It sounded reasonable when I wrote it :)

I have 2 machines I can easily test with (wired back to back)
Machine 1 has an AMD3000+ CPU, machine 2 has an AMD3200+ cpu (maybe not 
relevant - maybe important if it's timing related?)

Machine one  stalls within a few kb.
Machine two has shown no signs of failure yet.

However the other machine has not been stressed at all so it has 'not 
yet run at full speed' - not surprising since it has no friends with 
working gigabit cards :)

David
PS
I tried some experiments this weekend with a third machine but I got 
nasty kernel oopses on the second (supposedly good) whenever I did 
ifconfig eth1 mtu 9000 and I've not had time to get any proper results 
or a minimal failure yet.

simply issuing
ifconfig eth1 mtu 9000
on the second machine gave me this:

Jun 18 16:33:08 haze kernel: printk: 1 messages suppressed.
Jun 18 16:33:08 haze kernel: ifconfig: page allocation failure. order:3, 
mode:0x20
Jun 18 16:33:08 haze kernel:  [__alloc_pages+728/848] 
__alloc_pages+0x2d8/0x350
Jun 18 16:33:08 haze kernel:  [__get_free_pages+37/64] 
__get_free_pages+0x25/0x40
Jun 18 16:33:08 haze kernel:  [kmem_getpages+32/176] kmem_getpages+0x20/0xb0
Jun 18 16:33:08 haze kernel:  [cache_grow+166/512] cache_grow+0xa6/0x200
Jun 18 16:33:08 haze kernel:  [cache_alloc_refill+342/544] 
cache_alloc_refill+0x156/0x220
Jun 18 16:33:08 haze kernel:  [__kmalloc+116/128] __kmalloc+0x74/0x80
...

I'll report more fully when I can produce something consistent.

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: 2.6.6 e1000 NETDEV WATCHDOG: eth0: transmit timed out+ delay scheduler
  2004-06-21 17:29           ` David Greaves
@ 2004-06-21 17:43             ` ganesh.venkatesan
  2004-06-21 18:34               ` David Greaves
  0 siblings, 1 reply; 14+ messages in thread
From: ganesh.venkatesan @ 2004-06-21 17:43 UTC (permalink / raw)
  To: David Greaves
  Cc: tharbaugh, Jens Laas, Stephen Hemminger, netdev,
	Venkatesan, Ganesh

David:

Could you try the following patch to workaround the meemory allocation 
issue you are reporting? 

---------------------
--- e1000_main.c	2004-06-21 10:37:29.496090824 -0700
+++ e1000_main.c-patched	2004-06-21 10:37:06.920522832 -0700
@@ -796,7 +796,7 @@ e1000_setup_tx_resources(struct e1000_ad
 	int size;
 
 	size = sizeof(struct e1000_buffer) * txdr->count;
-	txdr->buffer_info = kmalloc(size, GFP_KERNEL);
+	txdr->buffer_info = vmalloc(size);
 	if(!txdr->buffer_info) {
 		return -ENOMEM;
 	}
@@ -809,7 +809,7 @@ e1000_setup_tx_resources(struct e1000_ad
 
 	txdr->desc = pci_alloc_consistent(pdev, txdr->size, &txdr->dma);
 	if(!txdr->desc) {
-		kfree(txdr->buffer_info);
+		vfree(txdr->buffer_info);
 		return -ENOMEM;
 	}
 	memset(txdr->desc, 0, txdr->size);
@@ -913,7 +913,7 @@ e1000_setup_rx_resources(struct e1000_ad
 	int size;
 
 	size = sizeof(struct e1000_buffer) * rxdr->count;
-	rxdr->buffer_info = kmalloc(size, GFP_KERNEL);
+	rxdr->buffer_info = vmalloc(size);
 	if(!rxdr->buffer_info) {
 		return -ENOMEM;
 	}
@@ -927,7 +927,7 @@ e1000_setup_rx_resources(struct e1000_ad
 	rxdr->desc = pci_alloc_consistent(pdev, rxdr->size, &rxdr->dma);
 
 	if(!rxdr->desc) {
-		kfree(rxdr->buffer_info);
+		vfree(rxdr->buffer_info);
 		return -ENOMEM;
 	}
 	memset(rxdr->desc, 0, rxdr->size);
@@ -1051,7 +1051,7 @@ e1000_free_tx_resources(struct e1000_ada
 
 	e1000_clean_tx_ring(adapter);
 
-	kfree(adapter->tx_ring.buffer_info);
+	vfree(adapter->tx_ring.buffer_info);
 	adapter->tx_ring.buffer_info = NULL;
 
 	pci_free_consistent(pdev, adapter->tx_ring.size,
@@ -1120,7 +1120,7 @@ e1000_free_rx_resources(struct e1000_ada
 
 	e1000_clean_rx_ring(adapter);
 
-	kfree(rx_ring->buffer_info);
+	vfree(rx_ring->buffer_info);
 	rx_ring->buffer_info = NULL;
 
 	pci_free_consistent(pdev, rx_ring->size, rx_ring->desc, rx_ring->dma);
--- e1000.h	2004-06-21 10:37:29.523086720 -0700
+++ e1000.h-patched	2004-06-21 10:37:15.506217608 -0700
@@ -49,6 +49,7 @@
 #include <linux/delay.h>
 #include <linux/timer.h>
 #include <linux/slab.h>
+#include <linux/vmalloc.h>
 #include <linux/interrupt.h>
 #include <linux/string.h>
 #include <linux/pagemap.h>
@@ -159,9 +160,9 @@ struct e1000_adapter;
 struct e1000_buffer {
 	struct sk_buff *skb;
 	uint64_t dma;
-	unsigned long length;
 	unsigned long time_stamp;
-	unsigned int next_to_watch;
+	uint16_t next_to_watch;
+	uint16_t length;
 };
 
 struct e1000_desc_ring {
----------------------
ganesh.

On Mon, 21 Jun 2004, David Greaves wrote:

> 
> Thayne Harbaugh wrote:
> 
> >On Fri, 2004-06-18 at 03:08, David Greaves wrote:
> >
> > 
> >
> >>Jens Laas wrote:
> >>   
> >>
> >>>We have tried different versions of e1000 without luck.
> >>>     
> >>>
> >>Me too, 3 cards.
> >>(did I mention I have 2 machines with very similar specs (AMD/VIAKT600)
> >>and the other one works - actually, to be accurate, hasn't yet failed
> >>but hasn't yet run at full speed - and it has a higher CPU speed)
> >>   
> >>
> >
> >What do you mean by, ". . . hasn't yet run at full speed - and it has a
> >higher CPU speed . . ." ?  Does this mean that you can't get the card to
> >have a reasonable throughput (~900Mbps)?
> >
> > 
> >
> 
> It sounded reasonable when I wrote it :)
> 
> I have 2 machines I can easily test with (wired back to back)
> Machine 1 has an AMD3000+ CPU, machine 2 has an AMD3200+ cpu (maybe not
> relevant - maybe important if it's timing related?)
> 
> Machine one  stalls within a few kb.
> Machine two has shown no signs of failure yet.
> 
> However the other machine has not been stressed at all so it has 'not
> yet run at full speed' - not surprising since it has no friends with
> working gigabit cards :)
> 
> David
> PS
> I tried some experiments this weekend with a third machine but I got
> nasty kernel oopses on the second (supposedly good) whenever I did
> ifconfig eth1 mtu 9000 and I've not had time to get any proper results
> or a minimal failure yet.
> 
> simply issuing
> ifconfig eth1 mtu 9000
> on the second machine gave me this:
> 
> Jun 18 16:33:08 haze kernel: printk: 1 messages suppressed.
> Jun 18 16:33:08 haze kernel: ifconfig: page allocation failure. order:3,
> mode:0x20
> Jun 18 16:33:08 haze kernel:  [__alloc_pages+728/848]
> __alloc_pages+0x2d8/0x350
> Jun 18 16:33:08 haze kernel:  [__get_free_pages+37/64]
> __get_free_pages+0x25/0x40
> Jun 18 16:33:08 haze kernel:  [kmem_getpages+32/176] kmem_getpages+0x20/0xb0
> Jun 18 16:33:08 haze kernel:  [cache_grow+166/512] cache_grow+0xa6/0x200
> Jun 18 16:33:08 haze kernel:  [cache_alloc_refill+342/544]
> cache_alloc_refill+0x156/0x220
> Jun 18 16:33:08 haze kernel:  [__kmalloc+116/128] __kmalloc+0x74/0x80
> ...
> 
> I'll report more fully when I can produce something consistent.
> 
> 
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: 2.6.6 e1000 NETDEV WATCHDOG: eth0: transmit timed out+ delay scheduler
  2004-06-21 17:43             ` ganesh.venkatesan
@ 2004-06-21 18:34               ` David Greaves
  0 siblings, 0 replies; 14+ messages in thread
From: David Greaves @ 2004-06-21 18:34 UTC (permalink / raw)
  To: ganesh.venkatesan; +Cc: tharbaugh, Jens Laas, Stephen Hemminger, netdev

OK
applied patch

ifdown eth1; modprobe -r e1000;modprobe e1000;ifup eth1; ifconfig eth1 
mtu 9000
(so no reboot)

dmesg:
e1000: Ignoring new-style parameters in presence of obsolete ones
Intel(R) PRO/1000 Network Driver - version 5.2.52-k4
Copyright (c) 1999-2004 Intel Corporation.
e1000: eth1: e1000_probe: Intel(R) PRO/1000 Network Connection
e1000: eth1: e1000_watchdog: NIC Link is Up 1000 Mbps Full Duplex
ifconfig: page allocation failure. order:3, mode:0x20
 [<c01310a8>] __alloc_pages+0x2d8/0x350
 [<c0131145>] __get_free_pages+0x25/0x40
 [<c0134620>] kmem_getpages+0x20/0xb0
 [<c0135186>] cache_grow+0xa6/0x200
 [<c0135436>] cache_alloc_refill+0x156/0x220
 [<c01359f4>] __kmalloc+0x74/0x80
 [<c02a3427>] alloc_skb+0x47/0xe0
 [<f89e45a2>] e1000_alloc_rx_buffers+0x62/0x100 [e1000]
 [<f89e1045>] e1000_up+0x45/0xb0 [e1000]
 [<f89e363c>] e1000_change_mtu+0x7c/0x110 [e1000]
 [<c02a8ea9>] dev_set_mtu+0x79/0x90
 [<c02a94a5>] dev_ioctl+0x1f5/0x280
 [<c02e271e>] inet_ioctl+0x8e/0xa0
 [<c02a0039>] sock_ioctl+0xe9/0x290
 [<c015c50f>] sys_ioctl+0xef/0x260
 [<c0110570>] do_page_fault+0x0/0x4da
 [<c0103fb7>] syscall_call+0x7/0xb

ifconfig: page allocation failure. order:3, mode:0x20
 [<c01310a8>] __alloc_pages+0x2d8/0x350
 [<c0131145>] __get_free_pages+0x25/0x40
 [<c0134620>] kmem_getpages+0x20/0xb0
 [<c0135186>] cache_grow+0xa6/0x200
 [<c0135436>] cache_alloc_refill+0x156/0x220
 [<c0111a1a>] wake_up_state+0x1a/0x20
 [<c01359f4>] __kmalloc+0x74/0x80
 [<c02a3427>] alloc_skb+0x47/0xe0
 [<f89e45a2>] e1000_alloc_rx_buffers+0x62/0x100 [e1000]
 [<f89e41e7>] e1000_clean_rx_irq+0xf7/0x450 [e1000]
 [<c011175f>] recalc_task_prio+0x8f/0x190
 [<f89e3e73>] e1000_clean+0x43/0xc0 [e1000]
 [<c02a861a>] net_rx_action+0x6a/0xf0
 [<c01190bd>] __do_softirq+0x7d/0x80
 [<c01190e6>] do_softirq+0x26/0x30
 [<c0105ded>] do_IRQ+0xfd/0x130
 [<c0104124>] common_interrupt+0x18/0x20
 [<f89e3d37>] e1000_irq_enable+0x27/0x30 [e1000]
 [<f89e109d>] e1000_up+0x9d/0xb0 [e1000]
 [<f89e363c>] e1000_change_mtu+0x7c/0x110 [e1000]
 [<c02a8ea9>] dev_set_mtu+0x79/0x90
 [<c02a94a5>] dev_ioctl+0x1f5/0x280
 [<c02e271e>] inet_ioctl+0x8e/0xa0
 [<c02a0039>] sock_ioctl+0xe9/0x290
 [<c015c50f>] sys_ioctl+0xef/0x260
 [<c0110570>] do_page_fault+0x0/0x4da
 [<c0103fb7>] syscall_call+0x7/0xb

kdeinit: page allocation failure. order:3, mode:0x20
 [<c01310a8>] __alloc_pages+0x2d8/0x350
 [<c0131145>] __get_free_pages+0x25/0x40
 [<c0134620>] kmem_getpages+0x20/0xb0
 [<c0135186>] cache_grow+0xa6/0x200
 [<c0135436>] cache_alloc_refill+0x156/0x220
 [<c01359f4>] __kmalloc+0x74/0x80
 [<c02a3427>] alloc_skb+0x47/0xe0
 [<f89e45a2>] e1000_alloc_rx_buffers+0x62/0x100 [e1000]
 [<f89e41e7>] e1000_clean_rx_irq+0xf7/0x450 [e1000]
 [<f89e3e73>] e1000_clean+0x43/0xc0 [e1000]
 [<c02a861a>] net_rx_action+0x6a/0xf0
 [<c01190bd>] __do_softirq+0x7d/0x80
 [<c01190e6>] do_softirq+0x26/0x30
 [<c0105ded>] do_IRQ+0xfd/0x130
 [<c0104124>] common_interrupt+0x18/0x20
...

David

ganesh.venkatesan@intel.com wrote:

>David:
>
>Could you try the following patch to workaround the meemory allocation 
>issue you are reporting? 
>
>---------------------
>--- e1000_main.c	2004-06-21 10:37:29.496090824 -0700
>+++ e1000_main.c-patched	2004-06-21 10:37:06.920522832 -0700
>@@ -796,7 +796,7 @@ e1000_setup_tx_resources(struct e1000_ad
> 	int size;
> 
> 	size = sizeof(struct e1000_buffer) * txdr->count;
>-	txdr->buffer_info = kmalloc(size, GFP_KERNEL);
>+	txdr->buffer_info = vmalloc(size);
> 	if(!txdr->buffer_info) {
> 		return -ENOMEM;
> 	}
>@@ -809,7 +809,7 @@ e1000_setup_tx_resources(struct e1000_ad
> 
> 	txdr->desc = pci_alloc_consistent(pdev, txdr->size, &txdr->dma);
> 	if(!txdr->desc) {
>-		kfree(txdr->buffer_info);
>+		vfree(txdr->buffer_info);
> 		return -ENOMEM;
> 	}
> 	memset(txdr->desc, 0, txdr->size);
>@@ -913,7 +913,7 @@ e1000_setup_rx_resources(struct e1000_ad
> 	int size;
> 
> 	size = sizeof(struct e1000_buffer) * rxdr->count;
>-	rxdr->buffer_info = kmalloc(size, GFP_KERNEL);
>+	rxdr->buffer_info = vmalloc(size);
> 	if(!rxdr->buffer_info) {
> 		return -ENOMEM;
> 	}
>@@ -927,7 +927,7 @@ e1000_setup_rx_resources(struct e1000_ad
> 	rxdr->desc = pci_alloc_consistent(pdev, rxdr->size, &rxdr->dma);
> 
> 	if(!rxdr->desc) {
>-		kfree(rxdr->buffer_info);
>+		vfree(rxdr->buffer_info);
> 		return -ENOMEM;
> 	}
> 	memset(rxdr->desc, 0, rxdr->size);
>@@ -1051,7 +1051,7 @@ e1000_free_tx_resources(struct e1000_ada
> 
> 	e1000_clean_tx_ring(adapter);
> 
>-	kfree(adapter->tx_ring.buffer_info);
>+	vfree(adapter->tx_ring.buffer_info);
> 	adapter->tx_ring.buffer_info = NULL;
> 
> 	pci_free_consistent(pdev, adapter->tx_ring.size,
>@@ -1120,7 +1120,7 @@ e1000_free_rx_resources(struct e1000_ada
> 
> 	e1000_clean_rx_ring(adapter);
> 
>-	kfree(rx_ring->buffer_info);
>+	vfree(rx_ring->buffer_info);
> 	rx_ring->buffer_info = NULL;
> 
> 	pci_free_consistent(pdev, rx_ring->size, rx_ring->desc, rx_ring->dma);
>--- e1000.h	2004-06-21 10:37:29.523086720 -0700
>+++ e1000.h-patched	2004-06-21 10:37:15.506217608 -0700
>@@ -49,6 +49,7 @@
> #include <linux/delay.h>
> #include <linux/timer.h>
> #include <linux/slab.h>
>+#include <linux/vmalloc.h>
> #include <linux/interrupt.h>
> #include <linux/string.h>
> #include <linux/pagemap.h>
>@@ -159,9 +160,9 @@ struct e1000_adapter;
> struct e1000_buffer {
> 	struct sk_buff *skb;
> 	uint64_t dma;
>-	unsigned long length;
> 	unsigned long time_stamp;
>-	unsigned int next_to_watch;
>+	uint16_t next_to_watch;
>+	uint16_t length;
> };
> 
> struct e1000_desc_ring {
>----------------------
>ganesh.
>
>On Mon, 21 Jun 2004, David Greaves wrote:
>
>  
>
>>Thayne Harbaugh wrote:
>>
>>    
>>
>>>On Fri, 2004-06-18 at 03:08, David Greaves wrote:
>>>
>>> 
>>>
>>>      
>>>
>>>>Jens Laas wrote:
>>>>   
>>>>
>>>>        
>>>>
>>>>>We have tried different versions of e1000 without luck.
>>>>>     
>>>>>
>>>>>          
>>>>>
>>>>Me too, 3 cards.
>>>>(did I mention I have 2 machines with very similar specs (AMD/VIAKT600)
>>>>and the other one works - actually, to be accurate, hasn't yet failed
>>>>but hasn't yet run at full speed - and it has a higher CPU speed)
>>>>   
>>>>
>>>>        
>>>>
>>>What do you mean by, ". . . hasn't yet run at full speed - and it has a
>>>higher CPU speed . . ." ?  Does this mean that you can't get the card to
>>>have a reasonable throughput (~900Mbps)?
>>>
>>> 
>>>
>>>      
>>>
>>It sounded reasonable when I wrote it :)
>>
>>I have 2 machines I can easily test with (wired back to back)
>>Machine 1 has an AMD3000+ CPU, machine 2 has an AMD3200+ cpu (maybe not
>>relevant - maybe important if it's timing related?)
>>
>>Machine one  stalls within a few kb.
>>Machine two has shown no signs of failure yet.
>>
>>However the other machine has not been stressed at all so it has 'not
>>yet run at full speed' - not surprising since it has no friends with
>>working gigabit cards :)
>>
>>David
>>PS
>>I tried some experiments this weekend with a third machine but I got
>>nasty kernel oopses on the second (supposedly good) whenever I did
>>ifconfig eth1 mtu 9000 and I've not had time to get any proper results
>>or a minimal failure yet.
>>
>>simply issuing
>>ifconfig eth1 mtu 9000
>>on the second machine gave me this:
>>
>>Jun 18 16:33:08 haze kernel: printk: 1 messages suppressed.
>>Jun 18 16:33:08 haze kernel: ifconfig: page allocation failure. order:3,
>>mode:0x20
>>Jun 18 16:33:08 haze kernel:  [__alloc_pages+728/848]
>>__alloc_pages+0x2d8/0x350
>>Jun 18 16:33:08 haze kernel:  [__get_free_pages+37/64]
>>__get_free_pages+0x25/0x40
>>Jun 18 16:33:08 haze kernel:  [kmem_getpages+32/176] kmem_getpages+0x20/0xb0
>>Jun 18 16:33:08 haze kernel:  [cache_grow+166/512] cache_grow+0xa6/0x200
>>Jun 18 16:33:08 haze kernel:  [cache_alloc_refill+342/544]
>>cache_alloc_refill+0x156/0x220
>>Jun 18 16:33:08 haze kernel:  [__kmalloc+116/128] __kmalloc+0x74/0x80
>>...
>>
>>I'll report more fully when I can produce something consistent.
>>
>>
>>
>>    
>>
>
>
>  
>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: 2.6.6 e1000 NETDEV WATCHDOG: eth0: transmit timed out
  2004-06-18  8:04     ` Jens Laas
  2004-06-18  9:08       ` 2.6.6 e1000 NETDEV WATCHDOG: eth0: transmit timed out+ delay scheduler David Greaves
@ 2004-06-18 18:11       ` Stephen Hemminger
  2004-06-18 18:44         ` David Greaves
  1 sibling, 1 reply; 14+ messages in thread
From: Stephen Hemminger @ 2004-06-18 18:11 UTC (permalink / raw)
  To: Jens Laas; +Cc: David Greaves, netdev

To get to the root of these problems, could you:

* Give full lspci -v output for the boards in question.

* Are you using any special queuing or shaping (output of "tc qdisc ls")

* You could try the following, which dumps out the state of the transmit ring
  in case of error. and tries to see if it is one of the other watchdog hooks in
  this driver.

------
diff -Nru a/drivers/net/e1000/e1000_main.c b/drivers/net/e1000/e1000_main.c
--- a/drivers/net/e1000/e1000_main.c	2004-06-18 11:09:36 -07:00
+++ b/drivers/net/e1000/e1000_main.c	2004-06-18 11:09:36 -07:00
@@ -1426,6 +1426,7 @@
 			 * but we've got queued Tx work that's never going
 			 * to get done, so reset controller to flush Tx.
 			 * (Do the reset outside of interrupt context). */
+			printk("%s: link lost but ring is full\n", netdev->name);
 			schedule_work(&adapter->tx_timeout_task);
 		}
 	}
@@ -1450,8 +1451,12 @@
 	i = txdr->next_to_clean;
 	if(txdr->buffer_info[i].dma &&
 	   time_after(jiffies, txdr->buffer_info[i].time_stamp + HZ) &&
-	   !(E1000_READ_REG(&adapter->hw, STATUS) & E1000_STATUS_TXOFF))
+	   !(E1000_READ_REG(&adapter->hw, STATUS) & E1000_STATUS_TXOFF)) {
+		printk("%s: may be hung last tx was %ld ticks\n",
+		       netdev->name, 
+		       (long)(jiffies - txdr->buffer_info[i].time_stamp));
 		netif_stop_queue(netdev);
+	}
 
 	/* Reset the timer */
 	mod_timer(&adapter->watchdog_timer, jiffies + 2 * HZ);
@@ -1826,6 +1831,7 @@
 {
 	struct e1000_adapter *adapter = netdev->priv;
 
+	printk("%s: transmit timeout from queuing\n", netdev->name);
 	/* Do the reset outside of interrupt context */
 	schedule_work(&adapter->tx_timeout_task);
 }
@@ -1834,6 +1840,21 @@
 e1000_tx_timeout_task(struct net_device *netdev)
 {
 	struct e1000_adapter *adapter = netdev->priv;
+	unsigned long now = jiffies;
+	int i;
+
+	printk("%s: state=0x%lx transmit ring size=%u count=%u to_use=%u to_clean=%u\n",
+	       netdev->name, netdev->state,
+	       adapter->tx_ring.size, adapter->tx_ring.count,
+	       adapter->tx_ring.next_to_use, adapter->tx_ring.next_to_clean);
+	
+	for (i = 0; i < adapter->tx_ring.count; ++i) {
+		struct e1000_buffer *b = &adapter->tx_ring.buffer_info[i];
+		printk(" %d: skb=%p dma=%llu length=%lu time=+%ld watch=%u\n",
+		       i, b->skb, b->dma, b->length, 
+		       (long) (now - b->time_stamp), b->next_to_watch);
+	}
+	
 
 	netif_device_detach(netdev);
 	e1000_down(adapter);

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: 2.6.6 e1000 NETDEV WATCHDOG: eth0: transmit timed out
  2004-06-18 18:11       ` 2.6.6 e1000 NETDEV WATCHDOG: eth0: transmit timed out Stephen Hemminger
@ 2004-06-18 18:44         ` David Greaves
       [not found]           ` <20040618141629.0edd9766@dell_ss3.pdx.osdl.net>
  0 siblings, 1 reply; 14+ messages in thread
From: David Greaves @ 2004-06-18 18:44 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: Jens Laas, netdev

Stephen Hemminger wrote:

>To get to the root of these problems, could you:
>
>* Give full lspci -v output for the boards in question.
>  
>
ash:
00:07.0 Ethernet controller: Intel Corp.: Unknown device 1076
        Subsystem: Intel Corp.: Unknown device 1176
        Flags: bus master, 66Mhz, medium devsel, latency 32, IRQ 11
        Memory at e3020000 (32-bit, non-prefetchable) [size=128K]
        Memory at e3000000 (32-bit, non-prefetchable) [size=128K]
        I/O ports at b400 [size=64]
        Expansion ROM at <unassigned> [disabled] [size=128K]
        Capabilities: [dc] Power Management version 2
        Capabilities: [e4] PCI-X non-bridge device.
        Capabilities: [f0] Message Signalled Interrupts: 64bit+ 
Queue=0/0 Enable-

>* Are you using any special queuing or shaping (output of "tc qdisc ls")
>  
>
no
root@ash:~ # tc qdisc ls
RTNETLINK answers: Invalid argument
Dump terminated

>* You could try the following, which dumps out the state of the transmit ring
>  in case of error. and tries to see if it is one of the other watchdog hooks in
>  this driver.
>  
>
patched :)

Test

root@ash:/usr/src/linux # ifdown eth0 ; modprobe -r e1000;modprobe 
e1000; ifup eth0ifdown: interface eth0 not configured
root@ash:/usr/src/linux # ping 10.0.1.1
PING 10.0.1.1 (10.0.1.1): 56 data bytes
64 bytes from 10.0.1.1: icmp_seq=0 ttl=64 time=0.3 ms
64 bytes from 10.0.1.1: icmp_seq=1 ttl=64 time=0.1 ms
64 bytes from 10.0.1.1: icmp_seq=2 ttl=64 time=0.1 ms
64 bytes from 10.0.1.1: icmp_seq=3 ttl=64 time=0.2 ms

--- 10.0.1.1 ping statistics ---
4 packets transmitted, 4 packets received, 0% packet loss
round-trip min/avg/max = 0.1/0.1/0.3 ms
root@ash:/usr/src/linux # ping -s 1500 10.0.1.1
PING 10.0.1.1 (10.0.1.1): 1500 data bytes
1508 bytes from 10.0.1.1: icmp_seq=0 ttl=64 time=0.3 ms
1508 bytes from 10.0.1.1: icmp_seq=1 ttl=64 time=0.4 ms
1508 bytes from 10.0.1.1: icmp_seq=2 ttl=64 time=0.3 ms

--- 10.0.1.1 ping statistics ---
3 packets transmitted, 3 packets received, 0% packet loss
round-trip min/avg/max = 0.3/0.3/0.4 ms
root@ash:/usr/src/linux # ping -s 3000 10.0.1.1
PING 10.0.1.1 (10.0.1.1): 3000 data bytes
3008 bytes from 10.0.1.1: icmp_seq=0 ttl=64 time=0.4 ms
3008 bytes from 10.0.1.1: icmp_seq=3 ttl=64 time=0.3 ms

--- 10.0.1.1 ping statistics ---
7 packets transmitted, 2 packets received, 71% packet loss
round-trip min/avg/max = 0.3/0.3/0.4 ms

messages: (the 'after 5000 jiffies' is mine)
Jun 18 19:37:43 ash kernel: Copyright (c) 1999-2004 Intel Corporation.
Jun 18 19:37:44 ash kernel: e1000: eth0: e1000_probe: Intel(R) PRO/1000 
Network Co
nnection
Jun 18 19:37:46 ash kernel: e1000: eth0: e1000_watchdog: NIC Link is Up 
1000 Mbps
Full Duplex

Jun 18 19:38:18 ash kernel: eth0: may be hung last tx was 2457 ticks

Jun 18 19:38:20 ash kernel: eth0: may be hung last tx was 4457 ticks
Jun 18 19:38:22 ash kernel: eth0: may be hung last tx was 6457 ticks
Jun 18 19:38:24 ash kernel: eth0: may be hung last tx was 8457 ticks
Jun 18 19:38:26 ash kernel: NETDEV WATCHDOG: eth0: transmit timed out 
after 5000 j
iffies
Jun 18 19:38:26 ash kernel: eth0: transmit timeout from queuing
Jun 18 19:38:26 ash kernel: eth0: may be hung last tx was 10457 ticks
Jun 18 19:38:26 ash kernel: eth0: state=0x7 transmit ring size=4096 
count=256 to_u
se=66 to_clean=59
Jun 18 19:38:26 ash kernel:  0: skb=00000000 dma=0 length=42 time=+29527 
watch=0
Jun 18 19:38:26 ash kernel:  1: skb=00000000 dma=0 length=98 time=+29527 
watch=1
Jun 18 19:38:26 ash kernel:  2: skb=00000000 dma=0 length=98 time=+28526 
watch=2
Jun 18 19:38:26 ash kernel:  3: skb=00000000 dma=0 length=98 time=+27525 
watch=3
Jun 18 19:38:26 ash kernel:  4: skb=00000000 dma=0 length=98 time=+26524 
watch=4
Jun 18 19:38:26 ash kernel:  5: skb=00000000 dma=0 length=42 time=+24528 
watch=5
Jun 18 19:38:26 ash kernel:  6: skb=00000000 dma=0 length=0 
time=+20324251 watch=7
Jun 18 19:38:26 ash kernel:  7: skb=00000000 dma=0 length=110 
time=+24510 watch=0
Jun 18 19:38:26 ash kernel:  8: skb=00000000 dma=0 length=0 
time=+20324251 watch=9
Jun 18 19:38:26 ash kernel:  9: skb=00000000 dma=0 length=110 
time=+24510 watch=0
Jun 18 19:38:26 ash kernel:  10: skb=00000000 dma=0 length=0 
time=+20324251 watch=
11
Jun 18 19:38:26 ash kernel:  11: skb=00000000 dma=0 length=110 
time=+24510 watch=0
Jun 18 19:38:26 ash kernel:  12: skb=00000000 dma=0 length=0 
time=+20324251 watch=
13
Jun 18 19:38:26 ash kernel:  13: skb=00000000 dma=0 length=110 
time=+24510 watch=0
Jun 18 19:38:26 ash kernel:  14: skb=00000000 dma=0 length=0 
time=+20324251 watch=
15
Jun 18 19:38:26 ash kernel:  15: skb=00000000 dma=0 length=110 
time=+24510 watch=0
Jun 18 19:38:26 ash kernel:  16: skb=00000000 dma=0 length=0 
time=+20324251 watch=
17
Jun 18 19:38:26 ash kernel:  17: skb=00000000 dma=0 length=257 
time=+24510 watch=0
Jun 18 19:38:26 ash kernel:  18: skb=00000000 dma=0 length=0 
time=+20324251 watch=
19
Jun 18 19:38:26 ash kernel:  19: skb=00000000 dma=0 length=110 
time=+22510 watch=0
Jun 18 19:38:26 ash kernel:  20: skb=00000000 dma=0 length=0 
time=+20324251 watch=
21
Jun 18 19:38:26 ash kernel:  21: skb=00000000 dma=0 length=110 
time=+22510 watch=0
Jun 18 19:38:26 ash kernel:  22: skb=00000000 dma=0 length=0 
time=+20324251 watch=
23
Jun 18 19:38:26 ash kernel:  23: skb=00000000 dma=0 length=110 
time=+22510 watch=0
Jun 18 19:38:26 ash kernel:  24: skb=00000000 dma=0 length=0 
time=+20324251 watch=
25
Jun 18 19:38:26 ash kernel:  25: skb=00000000 dma=0 length=110 
time=+22510 watch=0
Jun 18 19:38:26 ash kernel:  26: skb=00000000 dma=0 length=0 
time=+20324251 watch=
27
Jun 18 19:38:26 ash kernel:  27: skb=00000000 dma=0 length=110 
time=+22510 watch=0
Jun 18 19:38:26 ash kernel:  28: skb=00000000 dma=0 length=0 
time=+20324251 watch=
29
Jun 18 19:38:26 ash kernel:  29: skb=00000000 dma=0 length=110 
time=+22510 watch=0
Jun 18 19:38:26 ash kernel:  30: skb=00000000 dma=0 length=0 
time=+20324251 watch=
31
Jun 18 19:38:26 ash kernel:  31: skb=00000000 dma=0 length=110 
time=+22510 watch=0
Jun 18 19:38:26 ash kernel:  32: skb=00000000 dma=0 length=0 
time=+20324251 watch=
33
Jun 18 19:38:26 ash kernel:  33: skb=00000000 dma=0 length=110 
time=+22510 watch=0
Jun 18 19:38:26 ash kernel:  34: skb=00000000 dma=0 length=0 
time=+20324251 watch=
35
Jun 18 19:38:26 ash kernel:  35: skb=00000000 dma=0 length=110 
time=+22510 watch=0
Jun 18 19:38:26 ash kernel:  36: skb=00000000 dma=0 length=0 
time=+20324251 watch=
37
Jun 18 19:38:26 ash kernel:  37: skb=00000000 dma=0 length=110 
time=+22510 watch=0
Jun 18 19:38:26 ash kernel:  38: skb=00000000 dma=0 length=1514 
time=+21082 watch=
38
Jun 18 19:38:26 ash kernel:  39: skb=00000000 dma=0 length=62 
time=+21082 watch=39
Jun 18 19:38:26 ash kernel:  40: skb=00000000 dma=0 length=0 
time=+20324251 watch=
41
Jun 18 19:38:26 ash kernel:  41: skb=00000000 dma=0 length=110 
time=+20510 watch=0
Jun 18 19:38:26 ash kernel:  42: skb=00000000 dma=0 length=0 
time=+20324251 watch=
43
Jun 18 19:38:26 ash kernel:  43: skb=00000000 dma=0 length=110 
time=+20510 watch=0
Jun 18 19:38:26 ash kernel:  44: skb=00000000 dma=0 length=0 
time=+20324251 watch=
45
Jun 18 19:38:26 ash kernel:  45: skb=00000000 dma=0 length=110 
time=+20510 watch=0
Jun 18 19:38:26 ash kernel:  46: skb=00000000 dma=0 length=0 
time=+20324251 watch=
47
Jun 18 19:38:26 ash kernel:  47: skb=00000000 dma=0 length=110 
time=+20510 watch=0
Jun 18 19:38:26 ash kernel:  48: skb=00000000 dma=0 length=0 
time=+20324251 watch=
49
Jun 18 19:38:26 ash kernel:  49: skb=00000000 dma=0 length=110 
time=+20510 watch=0
Jun 18 19:38:26 ash kernel:  50: skb=00000000 dma=0 length=1514 
time=+20081 watch=
50
Jun 18 19:38:26 ash kernel:  51: skb=00000000 dma=0 length=62 
time=+20081 watch=51
Jun 18 19:38:26 ash kernel:  52: skb=00000000 dma=0 length=1514 
time=+19080 watch=
52
Jun 18 19:38:26 ash kernel:  53: skb=00000000 dma=0 length=62 
time=+19080 watch=53
Jun 18 19:38:26 ash kernel:  54: skb=00000000 dma=0 length=1514 
time=+11459 watch=
54
Jun 18 19:38:26 ash kernel:  55: skb=00000000 dma=0 length=1514 
time=+11458 watch=
55
Jun 18 19:38:26 ash kernel:  56: skb=00000000 dma=0 length=82 
time=+11458 watch=56
Jun 18 19:38:26 ash kernel:  57: skb=00000000 dma=0 length=1514 
time=+10457 watch=
57
Jun 18 19:38:26 ash kernel:  58: skb=00000000 dma=0 length=1514 
time=+10457 watch=
58
Jun 18 19:38:26 ash kernel:  59: skb=f0740420 dma=934467074 length=82 
time=+10457
watch=59
Jun 18 19:38:26 ash kernel:  60: skb=d6e91420 dma=397015042 length=1514 
time=+9456
 watch=60
Jun 18 19:38:26 ash kernel:  61: skb=f07406a0 dma=935571458 length=1514 
time=+9456
 watch=61
Jun 18 19:38:26 ash kernel:  62: skb=f3fcde20 dma=26358274 length=82 
time=+9456 wa
tch=62
Jun 18 19:38:26 ash kernel:  63: skb=f0740ba0 dma=397012994 length=1514 
time=+8455
 watch=63
Jun 18 19:38:26 ash kernel:  64: skb=d6e914c0 dma=935573506 length=1514 
time=+8455
 watch=64
Jun 18 19:38:26 ash kernel:  65: skb=f0740600 dma=937204738 length=82 
time=+8455 w
atch=65
Jun 18 19:38:26 ash kernel:  66: skb=00000000 dma=0 length=0 
time=+20324251 watch=
0
<snip many duplicate lines>
Jun 18 19:38:26 ash kernel: eth0: link lost but ring is full
Jun 18 19:38:26 ash kernel: eth0: state=0x16 transmit ring size=4096 
count=256 to_
use=9 to_clean=2
Jun 18 19:38:26 ash kernel:  0: skb=00000000 dma=0 length=1514 time=+1 
watch=0
Jun 18 19:38:26 ash kernel:  1: skb=00000000 dma=0 length=1514 time=+1 
watch=1
Jun 18 19:38:26 ash kernel:  2: skb=f0740060 dma=26400258 length=82 
time=+1 watch=
2
Jun 18 19:38:26 ash kernel:  3: skb=f0740ec0 dma=594843650 length=1514 
time=+1 wat
ch=3
Jun 18 19:38:26 ash kernel:  4: skb=d6e91a60 dma=594841602 length=1514 
time=+1 wat
ch=4
Jun 18 19:38:26 ash kernel:  5: skb=f0740560 dma=937203714 length=82 
time=+1 watch
=5
Jun 18 19:38:26 ash kernel:  6: skb=d6e919c0 dma=426745858 length=1514 
time=+1 wat
ch=6
Jun 18 19:38:26 ash kernel:  7: skb=d6e91880 dma=426747906 length=1514 
time=+1 wat
ch=7
Jun 18 19:38:26 ash kernel:  8: skb=f65ca920 dma=934469122 length=82 
time=+1 watch
=8
Jun 18 19:38:26 ash kernel:  9: skb=00000000 dma=0 length=0 
time=+20324352 watch=0
Jun 18 19:38:26 ash kernel:  10: skb=00000000 dma=0 length=0 
time=+20324352 watch=
0
<snip many many lines>
=0
Jun 18 19:38:26 ash kernel:  255: skb=00000000 dma=0 length=0 
time=+20324352 watch


David

^ permalink raw reply	[flat|nested] 14+ messages in thread

[parent not found: <20040618141629.0edd9766@dell_ss3.pdx.osdl.net>]

* Re: 2.6.6 e1000 NETDEV WATCHDOG: eth0: transmit timed out
       [not found]           ` <20040618141629.0edd9766@dell_ss3.pdx.osdl.net>
@ 2004-06-18 21:28             ` David Greaves
  0 siblings, 0 replies; 14+ messages in thread
From: David Greaves @ 2004-06-18 21:28 UTC (permalink / raw)
  To: Venkatesan, Ganesh; +Cc: Jens Laas, Glick, Kevin, netdev

OK
Thanks for the pointers and time Stephen, much appreciated :)

Ganesh and Jens - you said you'd like to keep this on-list so Stephen 
let's ensure your reply is archived...


David


Stephen Hemminger wrote:

>It will be up to Intel (Genesh et al) to look at this.
>
>
>On Fri, 18 Jun 2004 19:44:10 +0100
>David Greaves <david@dgreaves.com> wrote:
>
>  
>
>>Stephen Hemminger wrote:
>>
>>    
>>
>>>To get to the root of these problems, could you:
>>>
>>>* Give full lspci -v output for the boards in question.
>>> 
>>>
>>>      
>>>
>>ash:
>>00:07.0 Ethernet controller: Intel Corp.: Unknown device 1076
>>        Subsystem: Intel Corp.: Unknown device 1176
>>        Flags: bus master, 66Mhz, medium devsel, latency 32, IRQ 11
>>        Memory at e3020000 (32-bit, non-prefetchable) [size=128K]
>>        Memory at e3000000 (32-bit, non-prefetchable) [size=128K]
>>        I/O ports at b400 [size=64]
>>        Expansion ROM at <unassigned> [disabled] [size=128K]
>>        Capabilities: [dc] Power Management version 2
>>        Capabilities: [e4] PCI-X non-bridge device.
>>        Capabilities: [f0] Message Signalled Interrupts: 64bit+ 
>>Queue=0/0 Enable-
>>
>>    
>>
>
>  
>
>>Jun 18 19:38:18 ash kernel: eth0: may be hung last tx was 2457 ticks
>>
>>    
>>
>
>
>This means the code that in the e1000 watchdog is seeing the stuck board.
>The driver then calls netif_stop_queue which seems odd.
>
>  
>
>>Jun 18 19:38:20 ash kernel: eth0: may be hung last tx was 4457 ticks
>>Jun 18 19:38:22 ash kernel: eth0: may be hung last tx was 6457 ticks
>>Jun 18 19:38:24 ash kernel: eth0: may be hung last tx was 8457 ticks
>>Jun 18 19:38:26 ash kernel: NETDEV WATCHDOG: eth0: transmit timed out 
>>after 5000 j
>>iffies
>>Jun 18 19:38:26 ash kernel: eth0: transmit timeout from queuing
>>Jun 18 19:38:26 ash kernel: eth0: may be hung last tx was 10457 ticks
>>Jun 18 19:38:26 ash kernel: eth0: state=0x7 transmit ring size=4096 
>>count=256 to_u
>>se=66 to_clean=59
>>    
>>
>
>The state bits show:
>	XOFF - stopped (but that was done in e1000_watchdog)
>	START - board is running
>	PRESENT - board is present.
>That looks okay, but what was the state in the e1000 watchdog??
>
>  
>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* RE: 2.6.6 e1000 NETDEV WATCHDOG: eth0: transmit timed out+ delay scheduler
@ 2004-06-18 14:40 Venkatesan, Ganesh
  0 siblings, 0 replies; 14+ messages in thread
From: Venkatesan, Ganesh @ 2004-06-18 14:40 UTC (permalink / raw)
  To: David Greaves, Jens Laas; +Cc: Stephen Hemminger, netdev

Jens/David:

Did not mean to get off the list. For some reason, my subscription to
netdev is not working (even after re-subscribing). So, I grabbed your
message off of the archive.

I am trying to recreate your failure scenario in our lab. In the
meantime, please send me any new information you have on this issue.

Thanks,
ganesh 

-------------------------------------------------
Ganesh Venkatesan
Network/Storage Division, Hillsboro, OR

-----Original Message-----
From: David Greaves [mailto:david@dgreaves.com] 
Sent: Friday, June 18, 2004 5:52 AM
To: Jens Laas
Cc: Stephen Hemminger; netdev@oss.sgi.com; Venkatesan, Ganesh
Subject: Re: 2.6.6 e1000 NETDEV WATCHDOG: eth0: transmit timed out+
delay scheduler

New info:
I booted into XP and the card works there - so it doesn't look like a 
simple hardware incompatibility.
[I've got no real way to test the performance but cygwin's wget against 
apache1.3 on the linux box returns about 25M/s initially and then 15M/s 
sustained for 500Mb]

Jens Laas wrote:

>>
>> I'm speaking with Ganesh Venkatesan at intel about it. Ganesh you 
>> went off list - do you want to include Jens or maybe go back on-list?
>
>
> If others run into this problem I'm sure they'll appreciate if its on 
> list.
> Since we have no idea what causes this (AFAIK) it may be a more 
> general problem than the device driver.

I tend to agree - but I wasn't sure if this was the place and I'll do as

I'm told ;)

>> A simple failure case for me is : 'ping -s 1500 '
>> This doesn't cause the timout but doesn't succeed either.
>>
>> ping -f with standard packet size succeeds (slow rate though) and 
>> doesn't timeout.
>
>
>
> I dont see the ping problems at all. Unless you try to ping when the 
> interface has "hanged" ?

<sigh> thought that might be helpful.
Ping with -s and -f seems to allow me to trigger errors and it seems a 
lot more debug-able than scp or nfs :)
No all tests are when it's reset and 'clean'

>> ============
>> From hereon down it's 2.6.7 with Stephen's recent delay scheduler
patch
>>
>> This changed the behaviour.
>
>
>
> This is strange unless you are actually using the delay scheduler ?
> Default is sch_generic (that is pfifo) that does not exhibit the 
> problems correct by the patch.

I'll go back and double check in case I cocked up...
(I noticed the e1000 module rebuild but you're right that's incidental)

I've rebuilt the kernel and modules with and w/o patch and rebooted a 
few times and I can't reproduce that effect - sorry for the red herring.
So after I reverted Stephens patch the results I reported are still 
reproducable w/o the patch.

>> 10592 packets transmitted, 10591 packets received, 0% packet loss
>> round-trip min/avg/max = 5.4/5.5/83.5 ms
>>
>> Increasing Transmit Descriptors to 4096 avoids the No buffer space 
>> available with packet sizes up to -s65468 (still 100% failure though)
>
>
> Increasing nr of buffers is not a way to fix the problem.

agreed - however in my ignorance of the deep behaviour I'm reporting 
things that affect behaviour in ways I don't expect.
I expected it to take longer to run out of buffers - that didn't happen
:)

(Anyway, on retesting I find that this was wrong - I suspect the 
interface was down and I didn't notice)

>
> I had hoped to hear something about this from Scott..

I'm happy to hear from anyone - I don't have *that* long until my RMA 
option expires and I don't fancy keeping them as ornaments!

David

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2004-06-21 18:34 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-06-14 16:47 2.6.6 e1000 NETDEV WATCHDOG: eth0: transmit timed out David Greaves
     [not found] ` <20040615155111.26d6b809@dell_ss3.pdx.osdl.net>
2004-06-16 10:59   ` David Greaves
2004-06-18  8:04     ` Jens Laas
2004-06-18  9:08       ` 2.6.6 e1000 NETDEV WATCHDOG: eth0: transmit timed out+ delay scheduler David Greaves
2004-06-18 10:27         ` Jens Laas
2004-06-18 12:51           ` David Greaves
2004-06-21 16:42         ` Thayne Harbaugh
2004-06-21 17:29           ` David Greaves
2004-06-21 17:43             ` ganesh.venkatesan
2004-06-21 18:34               ` David Greaves
2004-06-18 18:11       ` 2.6.6 e1000 NETDEV WATCHDOG: eth0: transmit timed out Stephen Hemminger
2004-06-18 18:44         ` David Greaves
     [not found]           ` <20040618141629.0edd9766@dell_ss3.pdx.osdl.net>
2004-06-18 21:28             ` David Greaves
  -- strict thread matches above, loose matches on Subject: below --
2004-06-18 14:40 2.6.6 e1000 NETDEV WATCHDOG: eth0: transmit timed out+ delay scheduler Venkatesan, Ganesh

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).