* 2.6.6 e1000 NETDEV WATCHDOG: eth0: transmit timed out
@ 2004-06-14 16:47 David Greaves
[not found] ` <20040615155111.26d6b809@dell_ss3.pdx.osdl.net>
0 siblings, 1 reply; 14+ messages in thread
From: David Greaves @ 2004-06-14 16:47 UTC (permalink / raw)
To: shemminger, scott.feldman; +Cc: netdev
Hi
I have 2 machines with Intel/Pro 1000MT cards.
One machine seems to work fine (AFAIK), the other has major problems.
I've swapped the cards and the problem stays on the machine.
I'm using version 5.2.39-k2 from the stock 2.6.6 kernel on both machines.
Any sustained traffic causes repeated:
Jun 14 16:29:14 ash kernel: NETDEV WATCHDOG: eth0: transmit timed out
Jun 14 16:29:17 ash kernel: e1000: eth0 NIC Link is Up 1000 Mbps Full Duplex
Jun 14 16:29:17 ash kernel: nfs: server cu OK
I had a pair of Realtek r8169s that worked fine but only gave me 10Mb/s
so I exchanged them for the Intel/Pro cards in the hope of something
better - now, even with scp's rate limiter as low as 10kb/s this it
still occurs.
I have played with all the module parameters and not found anything that
affects it at 1Gbps
Even dropping to 100Mbps:
Jun 14 17:33:03 ash kernel: e1000: eth0 NIC Link is Up 100 Mbps Full Duplex
Jun 14 17:33:33 ash kernel: NETDEV WATCHDOG: eth0: transmit timed out
it can do 10Mbs:
scp reports a throughput of 1.0Mb/s (... less than thrilling)
however scp now transfers a few Mb and says:
Disconnecting: Corrupted MAC on input.
I found this mail:
http://oss.sgi.com/projects/netdev/archive/2004-06/msg00256.html
from Stephen
which appears to reverse this mail:
http://marc.theaimsgroup.com/?l=linux-kernel&m=107516205706542&w=2
from Scott
which I gather was supposed to correct this problem :)
I have seen no suggestions about other subsystems (eg ACPI etc) that
could also be tried.
David
^ permalink raw reply [flat|nested] 14+ messages in thread[parent not found: <20040615155111.26d6b809@dell_ss3.pdx.osdl.net>]
* Re: 2.6.6 e1000 NETDEV WATCHDOG: eth0: transmit timed out [not found] ` <20040615155111.26d6b809@dell_ss3.pdx.osdl.net> @ 2004-06-16 10:59 ` David Greaves 2004-06-18 8:04 ` Jens Laas 0 siblings, 1 reply; 14+ messages in thread From: David Greaves @ 2004-06-16 10:59 UTC (permalink / raw) To: Stephen Hemminger; +Cc: netdev Stephen Hemminger wrote: >How big is the transmit ring. Setting a bigger transmit ring fixed my problem > modprobe e1000 TxDescriptors=1024 > >Also, there are lots of flavors of this chipset and board. One machine >I was using had the IBM rebranded version and it would only do PCI33 not PCI66. > > > Thanks for replying Stephen - it's really frustrating :) I did try TxDescriptors and various (most) other parameters (below are the actual parameter variations I tried - just cut from a 'history' for info). After each one i downed the link and modprobe -r the driver. I then ran a large file scp (quicker id+recovery than nfs hanging when the link died) I invariably got an eth0 timed out after a few seconds - some variation but IIRC no more than 20% - ie 8-10Mb of a 1G file before it failed. root@ash:~ # ethtool -g eth0 Ring parameters for eth0: Pre-set maximums: RX: 4096 RX Mini: 0 RX Jumbo: 0 TX: 4096 Current hardware settings: RX: 256 RX Mini: 0 RX Jumbo: 0 TX: 1024 I've pulled all the cards and looked - they are all genuine Intel C39226-003 (Pro/1000 MT) This page http://support.intel.com/support/network/sb/cs-005980-prd38.htm says: 82541 Gigabit Small Form 32/66 My system has PCI33 BTW. I have also tried 2.6.7 this morning and have the same problem. David module parameters. modprobe e1000 FlowControl=2 modprobe e1000 FlowControl=1 modprobe e1000 FlowControl=3 modprobe e1000 FlowControl=0 modprobe e1000 FlowControl=0 InterruptThrottleRate=100 modprobe e1000 FlowControl=0 InterruptThrottleRate=1 RxDescriptors=1024 ; ifup eth0 modprobe e1000 FlowControl=0 InterruptThrottleRate=1 RxDescriptors=4096 ; ifup eth0 modprobe e1000 FlowControl=0 InterruptThrottleRate=1 RxDescriptors=256 RxIntDelay=1 ; ifup eth0 modprobe e1000 FlowControl=0 InterruptThrottleRate=1 RxDescriptors=256 RxIntDelay=10 ; ifup eth0 modprobe e1000 FlowControl=0 InterruptThrottleRate=1 RxDescriptors=256 RxIntDelay=1000 ; ifup eth0 modprobe e1000 FlowControl=0 InterruptThrottleRate=1 RxDescriptors=256 RxIntDelay=0 RxAbsIntDelay=0 ; ifup eth0 modprobe e1000 FlowControl=0 InterruptThrottleRate=1 RxDescriptors=256 RxIntDelay=0 RxAbsIntDelay=0 TxIntDelay=0 ; ifup eth0 modprobe e1000 FlowControl=0 InterruptThrottleRate=1 RxDescriptors=256 RxIntDelay=0 RxAbsIntDelay=1024 TxIntDelay=53 ; ifup eth0 modprobe e1000 FlowControl=0 InterruptThrottleRate=1 RxDescriptors=256 RxIntDelay=0 RxAbsIntDelay=1024 TxIntDelay=64 ; ifup eth0 modprobe e1000 FlowControl=0 InterruptThrottleRate=1 RxDescriptors=256 RxIntDelay=0 RxAbsIntDelay=65535 TxIntDelay=64 ; ifup eth0 modprobe e1000 FlowControl=0 InterruptThrottleRate=1 RxDescriptors=256 RxIntDelay=0 RxAbsIntDelay=128 TxIntDelay=0 ; ifup eth0 modprobe e1000 FlowControl=0 InterruptThrottleRate=1 RxDescriptors=256 RxIntDelay=0 RxAbsIntDelay=128 TxIntDelay=32000 ; ifup eth0 modprobe e1000 FlowControl=0 InterruptThrottleRate=1 RxDescriptors=256 RxIntDelay=0 RxAbsIntDelay=128 TxIntDelay=32000 ; ifup eth0 modprobe e1000 FlowControl=0 InterruptThrottleRate=1 RxDescriptors=256 RxIntDelay=0 RxAbsIntDelay=128 TxIntDelay=64 XsumRX=1 ; ifup eth0 modprobe e1000 Speed=100 modprobe e1000 Speed=10 ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: 2.6.6 e1000 NETDEV WATCHDOG: eth0: transmit timed out 2004-06-16 10:59 ` David Greaves @ 2004-06-18 8:04 ` Jens Laas 2004-06-18 9:08 ` 2.6.6 e1000 NETDEV WATCHDOG: eth0: transmit timed out+ delay scheduler David Greaves 2004-06-18 18:11 ` 2.6.6 e1000 NETDEV WATCHDOG: eth0: transmit timed out Stephen Hemminger 0 siblings, 2 replies; 14+ messages in thread From: Jens Laas @ 2004-06-18 8:04 UTC (permalink / raw) To: David Greaves; +Cc: Stephen Hemminger, netdev [-- Attachment #1: Type: TEXT/PLAIN, Size: 2809 bytes --] (04.06.16 kl.11:59) David Greaves skrev följande till Stephen Hemminger: We have seen the same symptoms. (2.6.x + e1000) Our system is an SMP system. That might be whats triggering the problem. Is your system UP or SMP ? (Next reboot we will test running on only one CPU). We have tried with and without NAPI, both exhibit the same problem. We have tried different versions of e1000 without luck. We have tried with 100Mb and gigabit switches. Make sure that flowcontrol is disabled on your switch (if it has it implemented). > Stephen Hemminger wrote: > >> How big is the transmit ring. Setting a bigger transmit ring fixed my >> problem >> modprobe e1000 TxDescriptors=1024 I wouldnt call that a fix, more like a workaround. It should work regardless of ringsize. >> >> Also, there are lots of flavors of this chipset and board. One machine >> I was using had the IBM rebranded version and it would only do PCI33 not >> PCI66. >> >> > Thanks for replying Stephen - it's really frustrating :) > > I did try TxDescriptors and various (most) other parameters (below are the > actual parameter variations I tried - just cut from a 'history' for info). > > After each one i downed the link and modprobe -r the driver. > I then ran a large file scp (quicker id+recovery than nfs hanging when the > link died) > > I invariably got an eth0 timed out after a few seconds - some variation but > IIRC no more than 20% - ie 8-10Mb of a 1G file before it failed. > > root@ash:~ # ethtool -g eth0 > Ring parameters for eth0: > Pre-set maximums: > RX: 4096 > RX Mini: 0 > RX Jumbo: 0 > TX: 4096 > Current hardware settings: > RX: 256 > RX Mini: 0 > RX Jumbo: 0 > TX: 1024 > > I've pulled all the cards and looked - they are all genuine Intel C39226-003 > (Pro/1000 MT) > This page http://support.intel.com/support/network/sb/cs-005980-prd38.htm > says: 82541 Gigabit Small Form 32/66 > My system has PCI33 BTW. > > I have also tried 2.6.7 this morning and have the same problem. > > David > > > module parameters. I believe following is recommended by driver developers: TxDescriptors=256 RxDescriptors=256 FlowControl=0 XsumRX=0 Cheers, Jens Låås ----------------------------------------------------------------------- 'This mail automatically becomes portable when carried.' ----------------------------------------------------------------------- Jens Låås Email: jens.laas@data.slu.se Department of Computer Services, SLU Phone: +46 18 67 35 15 Vindbrovägen 1 P.O. Box 7079 S-750 07 Uppsala SWEDEN ----------------------------------------------------------------------- ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: 2.6.6 e1000 NETDEV WATCHDOG: eth0: transmit timed out+ delay scheduler 2004-06-18 8:04 ` Jens Laas @ 2004-06-18 9:08 ` David Greaves 2004-06-18 10:27 ` Jens Laas 2004-06-21 16:42 ` Thayne Harbaugh 2004-06-18 18:11 ` 2.6.6 e1000 NETDEV WATCHDOG: eth0: transmit timed out Stephen Hemminger 1 sibling, 2 replies; 14+ messages in thread From: David Greaves @ 2004-06-18 9:08 UTC (permalink / raw) To: Jens Laas; +Cc: Stephen Hemminger, netdev, ganesh.venkatesan Stephen, I applied your delay scheduler patch and some results appear below. Jens Laas wrote: > (04.06.16 kl.11:59) David Greaves skrev följande till Stephen Hemminger: > > We have seen the same symptoms. (2.6.x + e1000) > > Our system is an SMP system. That might be whats triggering the problem. > Is your system UP or SMP ? UP > (Next reboot we will test running on only one CPU). > > We have tried with and without NAPI, both exhibit the same problem. Me too > We have tried different versions of e1000 without luck. Me too, 3 cards. (did I mention I have 2 machines with very similar specs (AMD/VIAKT600) and the other one works - actually, to be accurate, hasn't yet failed but hasn't yet run at full speed - and it has a higher CPU speed) > We have tried with 100Mb and gigabit switches. I'm now running two e1000's back to back over a piece of cat5... > > Make sure that flowcontrol is disabled on your switch (if it has it > implemented). ...so it's not that smart anymore ;) >> >> module parameters. > > > I believe following is recommended by driver developers: > TxDescriptors=256 RxDescriptors=256 FlowControl=0 XsumRX=0 Yes, I'm running with module defaults unless otherwise stated but I've tried that combo (to no effect) I'm speaking with Ganesh Venkatesan at intel about it. Ganesh you went off list - do you want to include Jens or maybe go back on-list? A simple failure case for me is : 'ping -s 1500 ' This doesn't cause the timout but doesn't succeed either. ping -f with standard packet size succeeds (slow rate though) and doesn't timeout. Using 8139 100Mbs card: 272384 packets transmitted, 272383 packets received, 0% packet loss round-trip min/avg/max = 0.1/0.1/4.0 ms real 0m32.179s Using Pro/1000: 60992 packets transmitted, 60991 packets received, 0% packet loss round-trip min/avg/max = 0.0/0.5/8.4 ms real 0m38.257s any ping with -s >1500 results in 100% packet loss. ============ From hereon down it's 2.6.7 with Stephen's recent delay scheduler patch This changed the behaviour. Now ping -s 1500 works but after that it gets lossy root@ash:~ # ping -s3000 10.0.1.1 PING 10.0.1.1 (10.0.1.1): 3000 data bytes 3008 bytes from 10.0.1.1: icmp_seq=1 ttl=64 time=0.5 ms 3008 bytes from 10.0.1.1: icmp_seq=11 ttl=64 time=0.5 ms 3008 bytes from 10.0.1.1: icmp_seq=12 ttl=64 time=0.4 ms 3008 bytes from 10.0.1.1: icmp_seq=13 ttl=64 time=0.9 ms 3008 bytes from 10.0.1.1: icmp_seq=15 ttl=64 time=0.4 ms 3008 bytes from 10.0.1.1: icmp_seq=16 ttl=64 time=0.3 ms and now I'm seeing ping generate: Jun 18 09:41:57 ash kernel: NETDEV WATCHDOG: eth0: transmit timed out Jun 18 09:41:59 ash kernel: e1000: eth0: e1000_watchdog: NIC Link is Up 1000 Mbps Full Duplex ping -f now works for packet sizes up to -s 2952 (2 packets at mtu 1500) ping -f -s 2953 results in: PING 10.0.1.1 (10.0.1.1): 2953 data bytes ..............................ping: sendto: No buffer space available ping: wrote 10.0.1.1 2961 chars, ret=-1 .ping: sendto: No buffer space available nb. with the patch, between the same machines via an alternate pair of nics: root@ash:~ # ping -f -s29550 haze PING haze.dgreaves.com (10.0.0.88): 29550 data bytes . --- haze.dgreaves.com ping statistics --- 10592 packets transmitted, 10591 packets received, 0% packet loss round-trip min/avg/max = 5.4/5.5/83.5 ms Increasing Transmit Descriptors to 4096 avoids the No buffer space available with packet sizes up to -s65468 (still 100% failure though) I'm not sure that adds much now so I'll leave it until I get some more suggestions. HTH David ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: 2.6.6 e1000 NETDEV WATCHDOG: eth0: transmit timed out+ delay scheduler 2004-06-18 9:08 ` 2.6.6 e1000 NETDEV WATCHDOG: eth0: transmit timed out+ delay scheduler David Greaves @ 2004-06-18 10:27 ` Jens Laas 2004-06-18 12:51 ` David Greaves 2004-06-21 16:42 ` Thayne Harbaugh 1 sibling, 1 reply; 14+ messages in thread From: Jens Laas @ 2004-06-18 10:27 UTC (permalink / raw) To: David Greaves; +Cc: Jens Laas, Stephen Hemminger, netdev, ganesh.venkatesan [-- Attachment #1: Type: TEXT/PLAIN, Size: 3528 bytes --] (04.06.18 kl.10:08) David Greaves skrev följande till Jens Laas: > Stephen, I applied your delay scheduler patch and some results appear below. > > Jens Laas wrote: > >> (04.06.16 kl.11:59) David Greaves skrev följande till Stephen Hemminger: >> >> We have seen the same symptoms. (2.6.x + e1000) >> >> Our system is an SMP system. That might be whats triggering the problem. >> Is your system UP or SMP ? > > UP Ok. This keeps getting stranger.. > >> (Next reboot we will test running on only one CPU). >> >> We have tried with and without NAPI, both exhibit the same problem. > > Me too > >> We have tried different versions of e1000 without luck. > ... >> Make sure that flowcontrol is disabled on your switch (if it has it >> implemented). > > ...so it's not that smart anymore ;) > >>> >>> module parameters. >> >> >> I believe following is recommended by driver developers: >> TxDescriptors=256 RxDescriptors=256 FlowControl=0 XsumRX=0 > > Yes, I'm running with module defaults unless otherwise stated but I've tried > that combo (to no effect) No effect here either. FlowControl and XsumRX are known troublemakers. > > I'm speaking with Ganesh Venkatesan at intel about it. Ganesh you went off > list - do you want to include Jens or maybe go back on-list? If others run into this problem I'm sure they'll appreciate if its on list. Since we have no idea what causes this (AFAIK) it may be a more general problem than the device driver. > > A simple failure case for me is : 'ping -s 1500 ' > This doesn't cause the timout but doesn't succeed either. > > ping -f with standard packet size succeeds (slow rate though) and doesn't > timeout. I dont see the ping problems at all. Unless you try to ping when the interface has "hanged" ? > > Using 8139 100Mbs card: > 272384 packets transmitted, 272383 packets received, 0% packet loss > round-trip min/avg/max = 0.1/0.1/4.0 ms > real 0m32.179s > > Using Pro/1000: > 60992 packets transmitted, 60991 packets received, 0% packet loss > round-trip min/avg/max = 0.0/0.5/8.4 ms > real 0m38.257s > > any ping with -s >1500 results in 100% packet loss. > > ============ > From hereon down it's 2.6.7 with Stephen's recent delay scheduler patch > > This changed the behaviour. This is strange unless you are actually using the delay scheduler ? Default is sch_generic (that is pfifo) that does not exhibit the problems correct by the patch. > 10592 packets transmitted, 10591 packets received, 0% packet loss > round-trip min/avg/max = 5.4/5.5/83.5 ms > > Increasing Transmit Descriptors to 4096 avoids the No buffer space available > with packet sizes up to -s65468 (still 100% failure though) Increasing nr of buffers is not a way to fix the problem. I had hoped to hear something about this from Scott.. Cheers, Jens > > I'm not sure that adds much now so I'll leave it until I get some more > suggestions. > > HTH > > David > ----------------------------------------------------------------------- 'This mail automatically becomes portable when carried.' ----------------------------------------------------------------------- Jens Låås Email: jens.laas@data.slu.se Department of Computer Services, SLU Phone: +46 18 67 35 15 Vindbrovägen 1 P.O. Box 7079 S-750 07 Uppsala SWEDEN ----------------------------------------------------------------------- ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: 2.6.6 e1000 NETDEV WATCHDOG: eth0: transmit timed out+ delay scheduler 2004-06-18 10:27 ` Jens Laas @ 2004-06-18 12:51 ` David Greaves 0 siblings, 0 replies; 14+ messages in thread From: David Greaves @ 2004-06-18 12:51 UTC (permalink / raw) To: Jens Laas; +Cc: Stephen Hemminger, netdev, ganesh.venkatesan New info: I booted into XP and the card works there - so it doesn't look like a simple hardware incompatibility. [I've got no real way to test the performance but cygwin's wget against apache1.3 on the linux box returns about 25M/s initially and then 15M/s sustained for 500Mb] Jens Laas wrote: >> >> I'm speaking with Ganesh Venkatesan at intel about it. Ganesh you >> went off list - do you want to include Jens or maybe go back on-list? > > > If others run into this problem I'm sure they'll appreciate if its on > list. > Since we have no idea what causes this (AFAIK) it may be a more > general problem than the device driver. I tend to agree - but I wasn't sure if this was the place and I'll do as I'm told ;) >> A simple failure case for me is : 'ping -s 1500 ' >> This doesn't cause the timout but doesn't succeed either. >> >> ping -f with standard packet size succeeds (slow rate though) and >> doesn't timeout. > > > > I dont see the ping problems at all. Unless you try to ping when the > interface has "hanged" ? <sigh> thought that might be helpful. Ping with -s and -f seems to allow me to trigger errors and it seems a lot more debug-able than scp or nfs :) No all tests are when it's reset and 'clean' >> ============ >> From hereon down it's 2.6.7 with Stephen's recent delay scheduler patch >> >> This changed the behaviour. > > > > This is strange unless you are actually using the delay scheduler ? > Default is sch_generic (that is pfifo) that does not exhibit the > problems correct by the patch. I'll go back and double check in case I cocked up... (I noticed the e1000 module rebuild but you're right that's incidental) I've rebuilt the kernel and modules with and w/o patch and rebooted a few times and I can't reproduce that effect - sorry for the red herring. So after I reverted Stephens patch the results I reported are still reproducable w/o the patch. >> 10592 packets transmitted, 10591 packets received, 0% packet loss >> round-trip min/avg/max = 5.4/5.5/83.5 ms >> >> Increasing Transmit Descriptors to 4096 avoids the No buffer space >> available with packet sizes up to -s65468 (still 100% failure though) > > > Increasing nr of buffers is not a way to fix the problem. agreed - however in my ignorance of the deep behaviour I'm reporting things that affect behaviour in ways I don't expect. I expected it to take longer to run out of buffers - that didn't happen :) (Anyway, on retesting I find that this was wrong - I suspect the interface was down and I didn't notice) > > I had hoped to hear something about this from Scott.. I'm happy to hear from anyone - I don't have *that* long until my RMA option expires and I don't fancy keeping them as ornaments! David ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: 2.6.6 e1000 NETDEV WATCHDOG: eth0: transmit timed out+ delay scheduler 2004-06-18 9:08 ` 2.6.6 e1000 NETDEV WATCHDOG: eth0: transmit timed out+ delay scheduler David Greaves 2004-06-18 10:27 ` Jens Laas @ 2004-06-21 16:42 ` Thayne Harbaugh 2004-06-21 17:29 ` David Greaves 1 sibling, 1 reply; 14+ messages in thread From: Thayne Harbaugh @ 2004-06-21 16:42 UTC (permalink / raw) To: David Greaves; +Cc: Jens Laas, Stephen Hemminger, netdev, ganesh.venkatesan On Fri, 2004-06-18 at 03:08, David Greaves wrote: > Jens Laas wrote: > > We have tried different versions of e1000 without luck. > > Me too, 3 cards. > (did I mention I have 2 machines with very similar specs (AMD/VIAKT600) > and the other one works - actually, to be accurate, hasn't yet failed > but hasn't yet run at full speed - and it has a higher CPU speed) What do you mean by, ". . . hasn't yet run at full speed - and it has a higher CPU speed . . ." ? Does this mean that you can't get the card to have a reasonable throughput (~900Mbps)? -- Thayne Harbaugh Linux Networx ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: 2.6.6 e1000 NETDEV WATCHDOG: eth0: transmit timed out+ delay scheduler 2004-06-21 16:42 ` Thayne Harbaugh @ 2004-06-21 17:29 ` David Greaves 2004-06-21 17:43 ` ganesh.venkatesan 0 siblings, 1 reply; 14+ messages in thread From: David Greaves @ 2004-06-21 17:29 UTC (permalink / raw) To: tharbaugh; +Cc: Jens Laas, Stephen Hemminger, netdev, ganesh.venkatesan Thayne Harbaugh wrote: >On Fri, 2004-06-18 at 03:08, David Greaves wrote: > > > >>Jens Laas wrote: >> >> >>>We have tried different versions of e1000 without luck. >>> >>> >>Me too, 3 cards. >>(did I mention I have 2 machines with very similar specs (AMD/VIAKT600) >>and the other one works - actually, to be accurate, hasn't yet failed >>but hasn't yet run at full speed - and it has a higher CPU speed) >> >> > >What do you mean by, ". . . hasn't yet run at full speed - and it has a >higher CPU speed . . ." ? Does this mean that you can't get the card to >have a reasonable throughput (~900Mbps)? > > > It sounded reasonable when I wrote it :) I have 2 machines I can easily test with (wired back to back) Machine 1 has an AMD3000+ CPU, machine 2 has an AMD3200+ cpu (maybe not relevant - maybe important if it's timing related?) Machine one stalls within a few kb. Machine two has shown no signs of failure yet. However the other machine has not been stressed at all so it has 'not yet run at full speed' - not surprising since it has no friends with working gigabit cards :) David PS I tried some experiments this weekend with a third machine but I got nasty kernel oopses on the second (supposedly good) whenever I did ifconfig eth1 mtu 9000 and I've not had time to get any proper results or a minimal failure yet. simply issuing ifconfig eth1 mtu 9000 on the second machine gave me this: Jun 18 16:33:08 haze kernel: printk: 1 messages suppressed. Jun 18 16:33:08 haze kernel: ifconfig: page allocation failure. order:3, mode:0x20 Jun 18 16:33:08 haze kernel: [__alloc_pages+728/848] __alloc_pages+0x2d8/0x350 Jun 18 16:33:08 haze kernel: [__get_free_pages+37/64] __get_free_pages+0x25/0x40 Jun 18 16:33:08 haze kernel: [kmem_getpages+32/176] kmem_getpages+0x20/0xb0 Jun 18 16:33:08 haze kernel: [cache_grow+166/512] cache_grow+0xa6/0x200 Jun 18 16:33:08 haze kernel: [cache_alloc_refill+342/544] cache_alloc_refill+0x156/0x220 Jun 18 16:33:08 haze kernel: [__kmalloc+116/128] __kmalloc+0x74/0x80 ... I'll report more fully when I can produce something consistent. ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: 2.6.6 e1000 NETDEV WATCHDOG: eth0: transmit timed out+ delay scheduler 2004-06-21 17:29 ` David Greaves @ 2004-06-21 17:43 ` ganesh.venkatesan 2004-06-21 18:34 ` David Greaves 0 siblings, 1 reply; 14+ messages in thread From: ganesh.venkatesan @ 2004-06-21 17:43 UTC (permalink / raw) To: David Greaves Cc: tharbaugh, Jens Laas, Stephen Hemminger, netdev, Venkatesan, Ganesh David: Could you try the following patch to workaround the meemory allocation issue you are reporting? --------------------- --- e1000_main.c 2004-06-21 10:37:29.496090824 -0700 +++ e1000_main.c-patched 2004-06-21 10:37:06.920522832 -0700 @@ -796,7 +796,7 @@ e1000_setup_tx_resources(struct e1000_ad int size; size = sizeof(struct e1000_buffer) * txdr->count; - txdr->buffer_info = kmalloc(size, GFP_KERNEL); + txdr->buffer_info = vmalloc(size); if(!txdr->buffer_info) { return -ENOMEM; } @@ -809,7 +809,7 @@ e1000_setup_tx_resources(struct e1000_ad txdr->desc = pci_alloc_consistent(pdev, txdr->size, &txdr->dma); if(!txdr->desc) { - kfree(txdr->buffer_info); + vfree(txdr->buffer_info); return -ENOMEM; } memset(txdr->desc, 0, txdr->size); @@ -913,7 +913,7 @@ e1000_setup_rx_resources(struct e1000_ad int size; size = sizeof(struct e1000_buffer) * rxdr->count; - rxdr->buffer_info = kmalloc(size, GFP_KERNEL); + rxdr->buffer_info = vmalloc(size); if(!rxdr->buffer_info) { return -ENOMEM; } @@ -927,7 +927,7 @@ e1000_setup_rx_resources(struct e1000_ad rxdr->desc = pci_alloc_consistent(pdev, rxdr->size, &rxdr->dma); if(!rxdr->desc) { - kfree(rxdr->buffer_info); + vfree(rxdr->buffer_info); return -ENOMEM; } memset(rxdr->desc, 0, rxdr->size); @@ -1051,7 +1051,7 @@ e1000_free_tx_resources(struct e1000_ada e1000_clean_tx_ring(adapter); - kfree(adapter->tx_ring.buffer_info); + vfree(adapter->tx_ring.buffer_info); adapter->tx_ring.buffer_info = NULL; pci_free_consistent(pdev, adapter->tx_ring.size, @@ -1120,7 +1120,7 @@ e1000_free_rx_resources(struct e1000_ada e1000_clean_rx_ring(adapter); - kfree(rx_ring->buffer_info); + vfree(rx_ring->buffer_info); rx_ring->buffer_info = NULL; pci_free_consistent(pdev, rx_ring->size, rx_ring->desc, rx_ring->dma); --- e1000.h 2004-06-21 10:37:29.523086720 -0700 +++ e1000.h-patched 2004-06-21 10:37:15.506217608 -0700 @@ -49,6 +49,7 @@ #include <linux/delay.h> #include <linux/timer.h> #include <linux/slab.h> +#include <linux/vmalloc.h> #include <linux/interrupt.h> #include <linux/string.h> #include <linux/pagemap.h> @@ -159,9 +160,9 @@ struct e1000_adapter; struct e1000_buffer { struct sk_buff *skb; uint64_t dma; - unsigned long length; unsigned long time_stamp; - unsigned int next_to_watch; + uint16_t next_to_watch; + uint16_t length; }; struct e1000_desc_ring { ---------------------- ganesh. On Mon, 21 Jun 2004, David Greaves wrote: > > Thayne Harbaugh wrote: > > >On Fri, 2004-06-18 at 03:08, David Greaves wrote: > > > > > > > >>Jens Laas wrote: > >> > >> > >>>We have tried different versions of e1000 without luck. > >>> > >>> > >>Me too, 3 cards. > >>(did I mention I have 2 machines with very similar specs (AMD/VIAKT600) > >>and the other one works - actually, to be accurate, hasn't yet failed > >>but hasn't yet run at full speed - and it has a higher CPU speed) > >> > >> > > > >What do you mean by, ". . . hasn't yet run at full speed - and it has a > >higher CPU speed . . ." ? Does this mean that you can't get the card to > >have a reasonable throughput (~900Mbps)? > > > > > > > > It sounded reasonable when I wrote it :) > > I have 2 machines I can easily test with (wired back to back) > Machine 1 has an AMD3000+ CPU, machine 2 has an AMD3200+ cpu (maybe not > relevant - maybe important if it's timing related?) > > Machine one stalls within a few kb. > Machine two has shown no signs of failure yet. > > However the other machine has not been stressed at all so it has 'not > yet run at full speed' - not surprising since it has no friends with > working gigabit cards :) > > David > PS > I tried some experiments this weekend with a third machine but I got > nasty kernel oopses on the second (supposedly good) whenever I did > ifconfig eth1 mtu 9000 and I've not had time to get any proper results > or a minimal failure yet. > > simply issuing > ifconfig eth1 mtu 9000 > on the second machine gave me this: > > Jun 18 16:33:08 haze kernel: printk: 1 messages suppressed. > Jun 18 16:33:08 haze kernel: ifconfig: page allocation failure. order:3, > mode:0x20 > Jun 18 16:33:08 haze kernel: [__alloc_pages+728/848] > __alloc_pages+0x2d8/0x350 > Jun 18 16:33:08 haze kernel: [__get_free_pages+37/64] > __get_free_pages+0x25/0x40 > Jun 18 16:33:08 haze kernel: [kmem_getpages+32/176] kmem_getpages+0x20/0xb0 > Jun 18 16:33:08 haze kernel: [cache_grow+166/512] cache_grow+0xa6/0x200 > Jun 18 16:33:08 haze kernel: [cache_alloc_refill+342/544] > cache_alloc_refill+0x156/0x220 > Jun 18 16:33:08 haze kernel: [__kmalloc+116/128] __kmalloc+0x74/0x80 > ... > > I'll report more fully when I can produce something consistent. > > > ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: 2.6.6 e1000 NETDEV WATCHDOG: eth0: transmit timed out+ delay scheduler 2004-06-21 17:43 ` ganesh.venkatesan @ 2004-06-21 18:34 ` David Greaves 0 siblings, 0 replies; 14+ messages in thread From: David Greaves @ 2004-06-21 18:34 UTC (permalink / raw) To: ganesh.venkatesan; +Cc: tharbaugh, Jens Laas, Stephen Hemminger, netdev OK applied patch ifdown eth1; modprobe -r e1000;modprobe e1000;ifup eth1; ifconfig eth1 mtu 9000 (so no reboot) dmesg: e1000: Ignoring new-style parameters in presence of obsolete ones Intel(R) PRO/1000 Network Driver - version 5.2.52-k4 Copyright (c) 1999-2004 Intel Corporation. e1000: eth1: e1000_probe: Intel(R) PRO/1000 Network Connection e1000: eth1: e1000_watchdog: NIC Link is Up 1000 Mbps Full Duplex ifconfig: page allocation failure. order:3, mode:0x20 [<c01310a8>] __alloc_pages+0x2d8/0x350 [<c0131145>] __get_free_pages+0x25/0x40 [<c0134620>] kmem_getpages+0x20/0xb0 [<c0135186>] cache_grow+0xa6/0x200 [<c0135436>] cache_alloc_refill+0x156/0x220 [<c01359f4>] __kmalloc+0x74/0x80 [<c02a3427>] alloc_skb+0x47/0xe0 [<f89e45a2>] e1000_alloc_rx_buffers+0x62/0x100 [e1000] [<f89e1045>] e1000_up+0x45/0xb0 [e1000] [<f89e363c>] e1000_change_mtu+0x7c/0x110 [e1000] [<c02a8ea9>] dev_set_mtu+0x79/0x90 [<c02a94a5>] dev_ioctl+0x1f5/0x280 [<c02e271e>] inet_ioctl+0x8e/0xa0 [<c02a0039>] sock_ioctl+0xe9/0x290 [<c015c50f>] sys_ioctl+0xef/0x260 [<c0110570>] do_page_fault+0x0/0x4da [<c0103fb7>] syscall_call+0x7/0xb ifconfig: page allocation failure. order:3, mode:0x20 [<c01310a8>] __alloc_pages+0x2d8/0x350 [<c0131145>] __get_free_pages+0x25/0x40 [<c0134620>] kmem_getpages+0x20/0xb0 [<c0135186>] cache_grow+0xa6/0x200 [<c0135436>] cache_alloc_refill+0x156/0x220 [<c0111a1a>] wake_up_state+0x1a/0x20 [<c01359f4>] __kmalloc+0x74/0x80 [<c02a3427>] alloc_skb+0x47/0xe0 [<f89e45a2>] e1000_alloc_rx_buffers+0x62/0x100 [e1000] [<f89e41e7>] e1000_clean_rx_irq+0xf7/0x450 [e1000] [<c011175f>] recalc_task_prio+0x8f/0x190 [<f89e3e73>] e1000_clean+0x43/0xc0 [e1000] [<c02a861a>] net_rx_action+0x6a/0xf0 [<c01190bd>] __do_softirq+0x7d/0x80 [<c01190e6>] do_softirq+0x26/0x30 [<c0105ded>] do_IRQ+0xfd/0x130 [<c0104124>] common_interrupt+0x18/0x20 [<f89e3d37>] e1000_irq_enable+0x27/0x30 [e1000] [<f89e109d>] e1000_up+0x9d/0xb0 [e1000] [<f89e363c>] e1000_change_mtu+0x7c/0x110 [e1000] [<c02a8ea9>] dev_set_mtu+0x79/0x90 [<c02a94a5>] dev_ioctl+0x1f5/0x280 [<c02e271e>] inet_ioctl+0x8e/0xa0 [<c02a0039>] sock_ioctl+0xe9/0x290 [<c015c50f>] sys_ioctl+0xef/0x260 [<c0110570>] do_page_fault+0x0/0x4da [<c0103fb7>] syscall_call+0x7/0xb kdeinit: page allocation failure. order:3, mode:0x20 [<c01310a8>] __alloc_pages+0x2d8/0x350 [<c0131145>] __get_free_pages+0x25/0x40 [<c0134620>] kmem_getpages+0x20/0xb0 [<c0135186>] cache_grow+0xa6/0x200 [<c0135436>] cache_alloc_refill+0x156/0x220 [<c01359f4>] __kmalloc+0x74/0x80 [<c02a3427>] alloc_skb+0x47/0xe0 [<f89e45a2>] e1000_alloc_rx_buffers+0x62/0x100 [e1000] [<f89e41e7>] e1000_clean_rx_irq+0xf7/0x450 [e1000] [<f89e3e73>] e1000_clean+0x43/0xc0 [e1000] [<c02a861a>] net_rx_action+0x6a/0xf0 [<c01190bd>] __do_softirq+0x7d/0x80 [<c01190e6>] do_softirq+0x26/0x30 [<c0105ded>] do_IRQ+0xfd/0x130 [<c0104124>] common_interrupt+0x18/0x20 ... David ganesh.venkatesan@intel.com wrote: >David: > >Could you try the following patch to workaround the meemory allocation >issue you are reporting? > >--------------------- >--- e1000_main.c 2004-06-21 10:37:29.496090824 -0700 >+++ e1000_main.c-patched 2004-06-21 10:37:06.920522832 -0700 >@@ -796,7 +796,7 @@ e1000_setup_tx_resources(struct e1000_ad > int size; > > size = sizeof(struct e1000_buffer) * txdr->count; >- txdr->buffer_info = kmalloc(size, GFP_KERNEL); >+ txdr->buffer_info = vmalloc(size); > if(!txdr->buffer_info) { > return -ENOMEM; > } >@@ -809,7 +809,7 @@ e1000_setup_tx_resources(struct e1000_ad > > txdr->desc = pci_alloc_consistent(pdev, txdr->size, &txdr->dma); > if(!txdr->desc) { >- kfree(txdr->buffer_info); >+ vfree(txdr->buffer_info); > return -ENOMEM; > } > memset(txdr->desc, 0, txdr->size); >@@ -913,7 +913,7 @@ e1000_setup_rx_resources(struct e1000_ad > int size; > > size = sizeof(struct e1000_buffer) * rxdr->count; >- rxdr->buffer_info = kmalloc(size, GFP_KERNEL); >+ rxdr->buffer_info = vmalloc(size); > if(!rxdr->buffer_info) { > return -ENOMEM; > } >@@ -927,7 +927,7 @@ e1000_setup_rx_resources(struct e1000_ad > rxdr->desc = pci_alloc_consistent(pdev, rxdr->size, &rxdr->dma); > > if(!rxdr->desc) { >- kfree(rxdr->buffer_info); >+ vfree(rxdr->buffer_info); > return -ENOMEM; > } > memset(rxdr->desc, 0, rxdr->size); >@@ -1051,7 +1051,7 @@ e1000_free_tx_resources(struct e1000_ada > > e1000_clean_tx_ring(adapter); > >- kfree(adapter->tx_ring.buffer_info); >+ vfree(adapter->tx_ring.buffer_info); > adapter->tx_ring.buffer_info = NULL; > > pci_free_consistent(pdev, adapter->tx_ring.size, >@@ -1120,7 +1120,7 @@ e1000_free_rx_resources(struct e1000_ada > > e1000_clean_rx_ring(adapter); > >- kfree(rx_ring->buffer_info); >+ vfree(rx_ring->buffer_info); > rx_ring->buffer_info = NULL; > > pci_free_consistent(pdev, rx_ring->size, rx_ring->desc, rx_ring->dma); >--- e1000.h 2004-06-21 10:37:29.523086720 -0700 >+++ e1000.h-patched 2004-06-21 10:37:15.506217608 -0700 >@@ -49,6 +49,7 @@ > #include <linux/delay.h> > #include <linux/timer.h> > #include <linux/slab.h> >+#include <linux/vmalloc.h> > #include <linux/interrupt.h> > #include <linux/string.h> > #include <linux/pagemap.h> >@@ -159,9 +160,9 @@ struct e1000_adapter; > struct e1000_buffer { > struct sk_buff *skb; > uint64_t dma; >- unsigned long length; > unsigned long time_stamp; >- unsigned int next_to_watch; >+ uint16_t next_to_watch; >+ uint16_t length; > }; > > struct e1000_desc_ring { >---------------------- >ganesh. > >On Mon, 21 Jun 2004, David Greaves wrote: > > > >>Thayne Harbaugh wrote: >> >> >> >>>On Fri, 2004-06-18 at 03:08, David Greaves wrote: >>> >>> >>> >>> >>> >>>>Jens Laas wrote: >>>> >>>> >>>> >>>> >>>>>We have tried different versions of e1000 without luck. >>>>> >>>>> >>>>> >>>>> >>>>Me too, 3 cards. >>>>(did I mention I have 2 machines with very similar specs (AMD/VIAKT600) >>>>and the other one works - actually, to be accurate, hasn't yet failed >>>>but hasn't yet run at full speed - and it has a higher CPU speed) >>>> >>>> >>>> >>>> >>>What do you mean by, ". . . hasn't yet run at full speed - and it has a >>>higher CPU speed . . ." ? Does this mean that you can't get the card to >>>have a reasonable throughput (~900Mbps)? >>> >>> >>> >>> >>> >>It sounded reasonable when I wrote it :) >> >>I have 2 machines I can easily test with (wired back to back) >>Machine 1 has an AMD3000+ CPU, machine 2 has an AMD3200+ cpu (maybe not >>relevant - maybe important if it's timing related?) >> >>Machine one stalls within a few kb. >>Machine two has shown no signs of failure yet. >> >>However the other machine has not been stressed at all so it has 'not >>yet run at full speed' - not surprising since it has no friends with >>working gigabit cards :) >> >>David >>PS >>I tried some experiments this weekend with a third machine but I got >>nasty kernel oopses on the second (supposedly good) whenever I did >>ifconfig eth1 mtu 9000 and I've not had time to get any proper results >>or a minimal failure yet. >> >>simply issuing >>ifconfig eth1 mtu 9000 >>on the second machine gave me this: >> >>Jun 18 16:33:08 haze kernel: printk: 1 messages suppressed. >>Jun 18 16:33:08 haze kernel: ifconfig: page allocation failure. order:3, >>mode:0x20 >>Jun 18 16:33:08 haze kernel: [__alloc_pages+728/848] >>__alloc_pages+0x2d8/0x350 >>Jun 18 16:33:08 haze kernel: [__get_free_pages+37/64] >>__get_free_pages+0x25/0x40 >>Jun 18 16:33:08 haze kernel: [kmem_getpages+32/176] kmem_getpages+0x20/0xb0 >>Jun 18 16:33:08 haze kernel: [cache_grow+166/512] cache_grow+0xa6/0x200 >>Jun 18 16:33:08 haze kernel: [cache_alloc_refill+342/544] >>cache_alloc_refill+0x156/0x220 >>Jun 18 16:33:08 haze kernel: [__kmalloc+116/128] __kmalloc+0x74/0x80 >>... >> >>I'll report more fully when I can produce something consistent. >> >> >> >> >> > > > > ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: 2.6.6 e1000 NETDEV WATCHDOG: eth0: transmit timed out 2004-06-18 8:04 ` Jens Laas 2004-06-18 9:08 ` 2.6.6 e1000 NETDEV WATCHDOG: eth0: transmit timed out+ delay scheduler David Greaves @ 2004-06-18 18:11 ` Stephen Hemminger 2004-06-18 18:44 ` David Greaves 1 sibling, 1 reply; 14+ messages in thread From: Stephen Hemminger @ 2004-06-18 18:11 UTC (permalink / raw) To: Jens Laas; +Cc: David Greaves, netdev To get to the root of these problems, could you: * Give full lspci -v output for the boards in question. * Are you using any special queuing or shaping (output of "tc qdisc ls") * You could try the following, which dumps out the state of the transmit ring in case of error. and tries to see if it is one of the other watchdog hooks in this driver. ------ diff -Nru a/drivers/net/e1000/e1000_main.c b/drivers/net/e1000/e1000_main.c --- a/drivers/net/e1000/e1000_main.c 2004-06-18 11:09:36 -07:00 +++ b/drivers/net/e1000/e1000_main.c 2004-06-18 11:09:36 -07:00 @@ -1426,6 +1426,7 @@ * but we've got queued Tx work that's never going * to get done, so reset controller to flush Tx. * (Do the reset outside of interrupt context). */ + printk("%s: link lost but ring is full\n", netdev->name); schedule_work(&adapter->tx_timeout_task); } } @@ -1450,8 +1451,12 @@ i = txdr->next_to_clean; if(txdr->buffer_info[i].dma && time_after(jiffies, txdr->buffer_info[i].time_stamp + HZ) && - !(E1000_READ_REG(&adapter->hw, STATUS) & E1000_STATUS_TXOFF)) + !(E1000_READ_REG(&adapter->hw, STATUS) & E1000_STATUS_TXOFF)) { + printk("%s: may be hung last tx was %ld ticks\n", + netdev->name, + (long)(jiffies - txdr->buffer_info[i].time_stamp)); netif_stop_queue(netdev); + } /* Reset the timer */ mod_timer(&adapter->watchdog_timer, jiffies + 2 * HZ); @@ -1826,6 +1831,7 @@ { struct e1000_adapter *adapter = netdev->priv; + printk("%s: transmit timeout from queuing\n", netdev->name); /* Do the reset outside of interrupt context */ schedule_work(&adapter->tx_timeout_task); } @@ -1834,6 +1840,21 @@ e1000_tx_timeout_task(struct net_device *netdev) { struct e1000_adapter *adapter = netdev->priv; + unsigned long now = jiffies; + int i; + + printk("%s: state=0x%lx transmit ring size=%u count=%u to_use=%u to_clean=%u\n", + netdev->name, netdev->state, + adapter->tx_ring.size, adapter->tx_ring.count, + adapter->tx_ring.next_to_use, adapter->tx_ring.next_to_clean); + + for (i = 0; i < adapter->tx_ring.count; ++i) { + struct e1000_buffer *b = &adapter->tx_ring.buffer_info[i]; + printk(" %d: skb=%p dma=%llu length=%lu time=+%ld watch=%u\n", + i, b->skb, b->dma, b->length, + (long) (now - b->time_stamp), b->next_to_watch); + } + netif_device_detach(netdev); e1000_down(adapter); ^ permalink raw reply [flat|nested] 14+ messages in thread
* Re: 2.6.6 e1000 NETDEV WATCHDOG: eth0: transmit timed out 2004-06-18 18:11 ` 2.6.6 e1000 NETDEV WATCHDOG: eth0: transmit timed out Stephen Hemminger @ 2004-06-18 18:44 ` David Greaves [not found] ` <20040618141629.0edd9766@dell_ss3.pdx.osdl.net> 0 siblings, 1 reply; 14+ messages in thread From: David Greaves @ 2004-06-18 18:44 UTC (permalink / raw) To: Stephen Hemminger; +Cc: Jens Laas, netdev Stephen Hemminger wrote: >To get to the root of these problems, could you: > >* Give full lspci -v output for the boards in question. > > ash: 00:07.0 Ethernet controller: Intel Corp.: Unknown device 1076 Subsystem: Intel Corp.: Unknown device 1176 Flags: bus master, 66Mhz, medium devsel, latency 32, IRQ 11 Memory at e3020000 (32-bit, non-prefetchable) [size=128K] Memory at e3000000 (32-bit, non-prefetchable) [size=128K] I/O ports at b400 [size=64] Expansion ROM at <unassigned> [disabled] [size=128K] Capabilities: [dc] Power Management version 2 Capabilities: [e4] PCI-X non-bridge device. Capabilities: [f0] Message Signalled Interrupts: 64bit+ Queue=0/0 Enable- >* Are you using any special queuing or shaping (output of "tc qdisc ls") > > no root@ash:~ # tc qdisc ls RTNETLINK answers: Invalid argument Dump terminated >* You could try the following, which dumps out the state of the transmit ring > in case of error. and tries to see if it is one of the other watchdog hooks in > this driver. > > patched :) Test root@ash:/usr/src/linux # ifdown eth0 ; modprobe -r e1000;modprobe e1000; ifup eth0ifdown: interface eth0 not configured root@ash:/usr/src/linux # ping 10.0.1.1 PING 10.0.1.1 (10.0.1.1): 56 data bytes 64 bytes from 10.0.1.1: icmp_seq=0 ttl=64 time=0.3 ms 64 bytes from 10.0.1.1: icmp_seq=1 ttl=64 time=0.1 ms 64 bytes from 10.0.1.1: icmp_seq=2 ttl=64 time=0.1 ms 64 bytes from 10.0.1.1: icmp_seq=3 ttl=64 time=0.2 ms --- 10.0.1.1 ping statistics --- 4 packets transmitted, 4 packets received, 0% packet loss round-trip min/avg/max = 0.1/0.1/0.3 ms root@ash:/usr/src/linux # ping -s 1500 10.0.1.1 PING 10.0.1.1 (10.0.1.1): 1500 data bytes 1508 bytes from 10.0.1.1: icmp_seq=0 ttl=64 time=0.3 ms 1508 bytes from 10.0.1.1: icmp_seq=1 ttl=64 time=0.4 ms 1508 bytes from 10.0.1.1: icmp_seq=2 ttl=64 time=0.3 ms --- 10.0.1.1 ping statistics --- 3 packets transmitted, 3 packets received, 0% packet loss round-trip min/avg/max = 0.3/0.3/0.4 ms root@ash:/usr/src/linux # ping -s 3000 10.0.1.1 PING 10.0.1.1 (10.0.1.1): 3000 data bytes 3008 bytes from 10.0.1.1: icmp_seq=0 ttl=64 time=0.4 ms 3008 bytes from 10.0.1.1: icmp_seq=3 ttl=64 time=0.3 ms --- 10.0.1.1 ping statistics --- 7 packets transmitted, 2 packets received, 71% packet loss round-trip min/avg/max = 0.3/0.3/0.4 ms messages: (the 'after 5000 jiffies' is mine) Jun 18 19:37:43 ash kernel: Copyright (c) 1999-2004 Intel Corporation. Jun 18 19:37:44 ash kernel: e1000: eth0: e1000_probe: Intel(R) PRO/1000 Network Co nnection Jun 18 19:37:46 ash kernel: e1000: eth0: e1000_watchdog: NIC Link is Up 1000 Mbps Full Duplex Jun 18 19:38:18 ash kernel: eth0: may be hung last tx was 2457 ticks Jun 18 19:38:20 ash kernel: eth0: may be hung last tx was 4457 ticks Jun 18 19:38:22 ash kernel: eth0: may be hung last tx was 6457 ticks Jun 18 19:38:24 ash kernel: eth0: may be hung last tx was 8457 ticks Jun 18 19:38:26 ash kernel: NETDEV WATCHDOG: eth0: transmit timed out after 5000 j iffies Jun 18 19:38:26 ash kernel: eth0: transmit timeout from queuing Jun 18 19:38:26 ash kernel: eth0: may be hung last tx was 10457 ticks Jun 18 19:38:26 ash kernel: eth0: state=0x7 transmit ring size=4096 count=256 to_u se=66 to_clean=59 Jun 18 19:38:26 ash kernel: 0: skb=00000000 dma=0 length=42 time=+29527 watch=0 Jun 18 19:38:26 ash kernel: 1: skb=00000000 dma=0 length=98 time=+29527 watch=1 Jun 18 19:38:26 ash kernel: 2: skb=00000000 dma=0 length=98 time=+28526 watch=2 Jun 18 19:38:26 ash kernel: 3: skb=00000000 dma=0 length=98 time=+27525 watch=3 Jun 18 19:38:26 ash kernel: 4: skb=00000000 dma=0 length=98 time=+26524 watch=4 Jun 18 19:38:26 ash kernel: 5: skb=00000000 dma=0 length=42 time=+24528 watch=5 Jun 18 19:38:26 ash kernel: 6: skb=00000000 dma=0 length=0 time=+20324251 watch=7 Jun 18 19:38:26 ash kernel: 7: skb=00000000 dma=0 length=110 time=+24510 watch=0 Jun 18 19:38:26 ash kernel: 8: skb=00000000 dma=0 length=0 time=+20324251 watch=9 Jun 18 19:38:26 ash kernel: 9: skb=00000000 dma=0 length=110 time=+24510 watch=0 Jun 18 19:38:26 ash kernel: 10: skb=00000000 dma=0 length=0 time=+20324251 watch= 11 Jun 18 19:38:26 ash kernel: 11: skb=00000000 dma=0 length=110 time=+24510 watch=0 Jun 18 19:38:26 ash kernel: 12: skb=00000000 dma=0 length=0 time=+20324251 watch= 13 Jun 18 19:38:26 ash kernel: 13: skb=00000000 dma=0 length=110 time=+24510 watch=0 Jun 18 19:38:26 ash kernel: 14: skb=00000000 dma=0 length=0 time=+20324251 watch= 15 Jun 18 19:38:26 ash kernel: 15: skb=00000000 dma=0 length=110 time=+24510 watch=0 Jun 18 19:38:26 ash kernel: 16: skb=00000000 dma=0 length=0 time=+20324251 watch= 17 Jun 18 19:38:26 ash kernel: 17: skb=00000000 dma=0 length=257 time=+24510 watch=0 Jun 18 19:38:26 ash kernel: 18: skb=00000000 dma=0 length=0 time=+20324251 watch= 19 Jun 18 19:38:26 ash kernel: 19: skb=00000000 dma=0 length=110 time=+22510 watch=0 Jun 18 19:38:26 ash kernel: 20: skb=00000000 dma=0 length=0 time=+20324251 watch= 21 Jun 18 19:38:26 ash kernel: 21: skb=00000000 dma=0 length=110 time=+22510 watch=0 Jun 18 19:38:26 ash kernel: 22: skb=00000000 dma=0 length=0 time=+20324251 watch= 23 Jun 18 19:38:26 ash kernel: 23: skb=00000000 dma=0 length=110 time=+22510 watch=0 Jun 18 19:38:26 ash kernel: 24: skb=00000000 dma=0 length=0 time=+20324251 watch= 25 Jun 18 19:38:26 ash kernel: 25: skb=00000000 dma=0 length=110 time=+22510 watch=0 Jun 18 19:38:26 ash kernel: 26: skb=00000000 dma=0 length=0 time=+20324251 watch= 27 Jun 18 19:38:26 ash kernel: 27: skb=00000000 dma=0 length=110 time=+22510 watch=0 Jun 18 19:38:26 ash kernel: 28: skb=00000000 dma=0 length=0 time=+20324251 watch= 29 Jun 18 19:38:26 ash kernel: 29: skb=00000000 dma=0 length=110 time=+22510 watch=0 Jun 18 19:38:26 ash kernel: 30: skb=00000000 dma=0 length=0 time=+20324251 watch= 31 Jun 18 19:38:26 ash kernel: 31: skb=00000000 dma=0 length=110 time=+22510 watch=0 Jun 18 19:38:26 ash kernel: 32: skb=00000000 dma=0 length=0 time=+20324251 watch= 33 Jun 18 19:38:26 ash kernel: 33: skb=00000000 dma=0 length=110 time=+22510 watch=0 Jun 18 19:38:26 ash kernel: 34: skb=00000000 dma=0 length=0 time=+20324251 watch= 35 Jun 18 19:38:26 ash kernel: 35: skb=00000000 dma=0 length=110 time=+22510 watch=0 Jun 18 19:38:26 ash kernel: 36: skb=00000000 dma=0 length=0 time=+20324251 watch= 37 Jun 18 19:38:26 ash kernel: 37: skb=00000000 dma=0 length=110 time=+22510 watch=0 Jun 18 19:38:26 ash kernel: 38: skb=00000000 dma=0 length=1514 time=+21082 watch= 38 Jun 18 19:38:26 ash kernel: 39: skb=00000000 dma=0 length=62 time=+21082 watch=39 Jun 18 19:38:26 ash kernel: 40: skb=00000000 dma=0 length=0 time=+20324251 watch= 41 Jun 18 19:38:26 ash kernel: 41: skb=00000000 dma=0 length=110 time=+20510 watch=0 Jun 18 19:38:26 ash kernel: 42: skb=00000000 dma=0 length=0 time=+20324251 watch= 43 Jun 18 19:38:26 ash kernel: 43: skb=00000000 dma=0 length=110 time=+20510 watch=0 Jun 18 19:38:26 ash kernel: 44: skb=00000000 dma=0 length=0 time=+20324251 watch= 45 Jun 18 19:38:26 ash kernel: 45: skb=00000000 dma=0 length=110 time=+20510 watch=0 Jun 18 19:38:26 ash kernel: 46: skb=00000000 dma=0 length=0 time=+20324251 watch= 47 Jun 18 19:38:26 ash kernel: 47: skb=00000000 dma=0 length=110 time=+20510 watch=0 Jun 18 19:38:26 ash kernel: 48: skb=00000000 dma=0 length=0 time=+20324251 watch= 49 Jun 18 19:38:26 ash kernel: 49: skb=00000000 dma=0 length=110 time=+20510 watch=0 Jun 18 19:38:26 ash kernel: 50: skb=00000000 dma=0 length=1514 time=+20081 watch= 50 Jun 18 19:38:26 ash kernel: 51: skb=00000000 dma=0 length=62 time=+20081 watch=51 Jun 18 19:38:26 ash kernel: 52: skb=00000000 dma=0 length=1514 time=+19080 watch= 52 Jun 18 19:38:26 ash kernel: 53: skb=00000000 dma=0 length=62 time=+19080 watch=53 Jun 18 19:38:26 ash kernel: 54: skb=00000000 dma=0 length=1514 time=+11459 watch= 54 Jun 18 19:38:26 ash kernel: 55: skb=00000000 dma=0 length=1514 time=+11458 watch= 55 Jun 18 19:38:26 ash kernel: 56: skb=00000000 dma=0 length=82 time=+11458 watch=56 Jun 18 19:38:26 ash kernel: 57: skb=00000000 dma=0 length=1514 time=+10457 watch= 57 Jun 18 19:38:26 ash kernel: 58: skb=00000000 dma=0 length=1514 time=+10457 watch= 58 Jun 18 19:38:26 ash kernel: 59: skb=f0740420 dma=934467074 length=82 time=+10457 watch=59 Jun 18 19:38:26 ash kernel: 60: skb=d6e91420 dma=397015042 length=1514 time=+9456 watch=60 Jun 18 19:38:26 ash kernel: 61: skb=f07406a0 dma=935571458 length=1514 time=+9456 watch=61 Jun 18 19:38:26 ash kernel: 62: skb=f3fcde20 dma=26358274 length=82 time=+9456 wa tch=62 Jun 18 19:38:26 ash kernel: 63: skb=f0740ba0 dma=397012994 length=1514 time=+8455 watch=63 Jun 18 19:38:26 ash kernel: 64: skb=d6e914c0 dma=935573506 length=1514 time=+8455 watch=64 Jun 18 19:38:26 ash kernel: 65: skb=f0740600 dma=937204738 length=82 time=+8455 w atch=65 Jun 18 19:38:26 ash kernel: 66: skb=00000000 dma=0 length=0 time=+20324251 watch= 0 <snip many duplicate lines> Jun 18 19:38:26 ash kernel: eth0: link lost but ring is full Jun 18 19:38:26 ash kernel: eth0: state=0x16 transmit ring size=4096 count=256 to_ use=9 to_clean=2 Jun 18 19:38:26 ash kernel: 0: skb=00000000 dma=0 length=1514 time=+1 watch=0 Jun 18 19:38:26 ash kernel: 1: skb=00000000 dma=0 length=1514 time=+1 watch=1 Jun 18 19:38:26 ash kernel: 2: skb=f0740060 dma=26400258 length=82 time=+1 watch= 2 Jun 18 19:38:26 ash kernel: 3: skb=f0740ec0 dma=594843650 length=1514 time=+1 wat ch=3 Jun 18 19:38:26 ash kernel: 4: skb=d6e91a60 dma=594841602 length=1514 time=+1 wat ch=4 Jun 18 19:38:26 ash kernel: 5: skb=f0740560 dma=937203714 length=82 time=+1 watch =5 Jun 18 19:38:26 ash kernel: 6: skb=d6e919c0 dma=426745858 length=1514 time=+1 wat ch=6 Jun 18 19:38:26 ash kernel: 7: skb=d6e91880 dma=426747906 length=1514 time=+1 wat ch=7 Jun 18 19:38:26 ash kernel: 8: skb=f65ca920 dma=934469122 length=82 time=+1 watch =8 Jun 18 19:38:26 ash kernel: 9: skb=00000000 dma=0 length=0 time=+20324352 watch=0 Jun 18 19:38:26 ash kernel: 10: skb=00000000 dma=0 length=0 time=+20324352 watch= 0 <snip many many lines> =0 Jun 18 19:38:26 ash kernel: 255: skb=00000000 dma=0 length=0 time=+20324352 watch David ^ permalink raw reply [flat|nested] 14+ messages in thread
[parent not found: <20040618141629.0edd9766@dell_ss3.pdx.osdl.net>]
* Re: 2.6.6 e1000 NETDEV WATCHDOG: eth0: transmit timed out [not found] ` <20040618141629.0edd9766@dell_ss3.pdx.osdl.net> @ 2004-06-18 21:28 ` David Greaves 0 siblings, 0 replies; 14+ messages in thread From: David Greaves @ 2004-06-18 21:28 UTC (permalink / raw) To: Venkatesan, Ganesh; +Cc: Jens Laas, Glick, Kevin, netdev OK Thanks for the pointers and time Stephen, much appreciated :) Ganesh and Jens - you said you'd like to keep this on-list so Stephen let's ensure your reply is archived... David Stephen Hemminger wrote: >It will be up to Intel (Genesh et al) to look at this. > > >On Fri, 18 Jun 2004 19:44:10 +0100 >David Greaves <david@dgreaves.com> wrote: > > > >>Stephen Hemminger wrote: >> >> >> >>>To get to the root of these problems, could you: >>> >>>* Give full lspci -v output for the boards in question. >>> >>> >>> >>> >>ash: >>00:07.0 Ethernet controller: Intel Corp.: Unknown device 1076 >> Subsystem: Intel Corp.: Unknown device 1176 >> Flags: bus master, 66Mhz, medium devsel, latency 32, IRQ 11 >> Memory at e3020000 (32-bit, non-prefetchable) [size=128K] >> Memory at e3000000 (32-bit, non-prefetchable) [size=128K] >> I/O ports at b400 [size=64] >> Expansion ROM at <unassigned> [disabled] [size=128K] >> Capabilities: [dc] Power Management version 2 >> Capabilities: [e4] PCI-X non-bridge device. >> Capabilities: [f0] Message Signalled Interrupts: 64bit+ >>Queue=0/0 Enable- >> >> >> > > > >>Jun 18 19:38:18 ash kernel: eth0: may be hung last tx was 2457 ticks >> >> >> > > >This means the code that in the e1000 watchdog is seeing the stuck board. >The driver then calls netif_stop_queue which seems odd. > > > >>Jun 18 19:38:20 ash kernel: eth0: may be hung last tx was 4457 ticks >>Jun 18 19:38:22 ash kernel: eth0: may be hung last tx was 6457 ticks >>Jun 18 19:38:24 ash kernel: eth0: may be hung last tx was 8457 ticks >>Jun 18 19:38:26 ash kernel: NETDEV WATCHDOG: eth0: transmit timed out >>after 5000 j >>iffies >>Jun 18 19:38:26 ash kernel: eth0: transmit timeout from queuing >>Jun 18 19:38:26 ash kernel: eth0: may be hung last tx was 10457 ticks >>Jun 18 19:38:26 ash kernel: eth0: state=0x7 transmit ring size=4096 >>count=256 to_u >>se=66 to_clean=59 >> >> > >The state bits show: > XOFF - stopped (but that was done in e1000_watchdog) > START - board is running > PRESENT - board is present. >That looks okay, but what was the state in the e1000 watchdog?? > > > ^ permalink raw reply [flat|nested] 14+ messages in thread
* RE: 2.6.6 e1000 NETDEV WATCHDOG: eth0: transmit timed out+ delay scheduler @ 2004-06-18 14:40 Venkatesan, Ganesh 0 siblings, 0 replies; 14+ messages in thread From: Venkatesan, Ganesh @ 2004-06-18 14:40 UTC (permalink / raw) To: David Greaves, Jens Laas; +Cc: Stephen Hemminger, netdev Jens/David: Did not mean to get off the list. For some reason, my subscription to netdev is not working (even after re-subscribing). So, I grabbed your message off of the archive. I am trying to recreate your failure scenario in our lab. In the meantime, please send me any new information you have on this issue. Thanks, ganesh ------------------------------------------------- Ganesh Venkatesan Network/Storage Division, Hillsboro, OR -----Original Message----- From: David Greaves [mailto:david@dgreaves.com] Sent: Friday, June 18, 2004 5:52 AM To: Jens Laas Cc: Stephen Hemminger; netdev@oss.sgi.com; Venkatesan, Ganesh Subject: Re: 2.6.6 e1000 NETDEV WATCHDOG: eth0: transmit timed out+ delay scheduler New info: I booted into XP and the card works there - so it doesn't look like a simple hardware incompatibility. [I've got no real way to test the performance but cygwin's wget against apache1.3 on the linux box returns about 25M/s initially and then 15M/s sustained for 500Mb] Jens Laas wrote: >> >> I'm speaking with Ganesh Venkatesan at intel about it. Ganesh you >> went off list - do you want to include Jens or maybe go back on-list? > > > If others run into this problem I'm sure they'll appreciate if its on > list. > Since we have no idea what causes this (AFAIK) it may be a more > general problem than the device driver. I tend to agree - but I wasn't sure if this was the place and I'll do as I'm told ;) >> A simple failure case for me is : 'ping -s 1500 ' >> This doesn't cause the timout but doesn't succeed either. >> >> ping -f with standard packet size succeeds (slow rate though) and >> doesn't timeout. > > > > I dont see the ping problems at all. Unless you try to ping when the > interface has "hanged" ? <sigh> thought that might be helpful. Ping with -s and -f seems to allow me to trigger errors and it seems a lot more debug-able than scp or nfs :) No all tests are when it's reset and 'clean' >> ============ >> From hereon down it's 2.6.7 with Stephen's recent delay scheduler patch >> >> This changed the behaviour. > > > > This is strange unless you are actually using the delay scheduler ? > Default is sch_generic (that is pfifo) that does not exhibit the > problems correct by the patch. I'll go back and double check in case I cocked up... (I noticed the e1000 module rebuild but you're right that's incidental) I've rebuilt the kernel and modules with and w/o patch and rebooted a few times and I can't reproduce that effect - sorry for the red herring. So after I reverted Stephens patch the results I reported are still reproducable w/o the patch. >> 10592 packets transmitted, 10591 packets received, 0% packet loss >> round-trip min/avg/max = 5.4/5.5/83.5 ms >> >> Increasing Transmit Descriptors to 4096 avoids the No buffer space >> available with packet sizes up to -s65468 (still 100% failure though) > > > Increasing nr of buffers is not a way to fix the problem. agreed - however in my ignorance of the deep behaviour I'm reporting things that affect behaviour in ways I don't expect. I expected it to take longer to run out of buffers - that didn't happen :) (Anyway, on retesting I find that this was wrong - I suspect the interface was down and I didn't notice) > > I had hoped to hear something about this from Scott.. I'm happy to hear from anyone - I don't have *that* long until my RMA option expires and I don't fancy keeping them as ornaments! David ^ permalink raw reply [flat|nested] 14+ messages in thread
end of thread, other threads:[~2004-06-21 18:34 UTC | newest]
Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-06-14 16:47 2.6.6 e1000 NETDEV WATCHDOG: eth0: transmit timed out David Greaves
[not found] ` <20040615155111.26d6b809@dell_ss3.pdx.osdl.net>
2004-06-16 10:59 ` David Greaves
2004-06-18 8:04 ` Jens Laas
2004-06-18 9:08 ` 2.6.6 e1000 NETDEV WATCHDOG: eth0: transmit timed out+ delay scheduler David Greaves
2004-06-18 10:27 ` Jens Laas
2004-06-18 12:51 ` David Greaves
2004-06-21 16:42 ` Thayne Harbaugh
2004-06-21 17:29 ` David Greaves
2004-06-21 17:43 ` ganesh.venkatesan
2004-06-21 18:34 ` David Greaves
2004-06-18 18:11 ` 2.6.6 e1000 NETDEV WATCHDOG: eth0: transmit timed out Stephen Hemminger
2004-06-18 18:44 ` David Greaves
[not found] ` <20040618141629.0edd9766@dell_ss3.pdx.osdl.net>
2004-06-18 21:28 ` David Greaves
-- strict thread matches above, loose matches on Subject: below --
2004-06-18 14:40 2.6.6 e1000 NETDEV WATCHDOG: eth0: transmit timed out+ delay scheduler Venkatesan, Ganesh
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).