RE: 2.6.6 e1000 NETDEV WATCHDOG: eth0: transmit timed out+ delay scheduler

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* RE: 2.6.6 e1000 NETDEV WATCHDOG: eth0: transmit timed out+ delay scheduler
@ 2004-06-18 14:40 Venkatesan, Ganesh
  2004-06-18 16:59 ` 2.6.6 e1000 ifconfig: page allocation failure David Greaves
  0 siblings, 1 reply; 2+ messages in thread
From: Venkatesan, Ganesh @ 2004-06-18 14:40 UTC (permalink / raw)
  To: David Greaves, Jens Laas; +Cc: Stephen Hemminger, netdev

Jens/David:

Did not mean to get off the list. For some reason, my subscription to
netdev is not working (even after re-subscribing). So, I grabbed your
message off of the archive.

I am trying to recreate your failure scenario in our lab. In the
meantime, please send me any new information you have on this issue.

Thanks,
ganesh 

-------------------------------------------------
Ganesh Venkatesan
Network/Storage Division, Hillsboro, OR

-----Original Message-----
From: David Greaves [mailto:david@dgreaves.com] 
Sent: Friday, June 18, 2004 5:52 AM
To: Jens Laas
Cc: Stephen Hemminger; netdev@oss.sgi.com; Venkatesan, Ganesh
Subject: Re: 2.6.6 e1000 NETDEV WATCHDOG: eth0: transmit timed out+
delay scheduler

New info:
I booted into XP and the card works there - so it doesn't look like a 
simple hardware incompatibility.
[I've got no real way to test the performance but cygwin's wget against 
apache1.3 on the linux box returns about 25M/s initially and then 15M/s 
sustained for 500Mb]

Jens Laas wrote:

>>
>> I'm speaking with Ganesh Venkatesan at intel about it. Ganesh you 
>> went off list - do you want to include Jens or maybe go back on-list?
>
>
> If others run into this problem I'm sure they'll appreciate if its on 
> list.
> Since we have no idea what causes this (AFAIK) it may be a more 
> general problem than the device driver.

I tend to agree - but I wasn't sure if this was the place and I'll do as

I'm told ;)

>> A simple failure case for me is : 'ping -s 1500 '
>> This doesn't cause the timout but doesn't succeed either.
>>
>> ping -f with standard packet size succeeds (slow rate though) and 
>> doesn't timeout.
>
>
>
> I dont see the ping problems at all. Unless you try to ping when the 
> interface has "hanged" ?

<sigh> thought that might be helpful.
Ping with -s and -f seems to allow me to trigger errors and it seems a 
lot more debug-able than scp or nfs :)
No all tests are when it's reset and 'clean'

>> ============
>> From hereon down it's 2.6.7 with Stephen's recent delay scheduler
patch
>>
>> This changed the behaviour.
>
>
>
> This is strange unless you are actually using the delay scheduler ?
> Default is sch_generic (that is pfifo) that does not exhibit the 
> problems correct by the patch.

I'll go back and double check in case I cocked up...
(I noticed the e1000 module rebuild but you're right that's incidental)

I've rebuilt the kernel and modules with and w/o patch and rebooted a 
few times and I can't reproduce that effect - sorry for the red herring.
So after I reverted Stephens patch the results I reported are still 
reproducable w/o the patch.

>> 10592 packets transmitted, 10591 packets received, 0% packet loss
>> round-trip min/avg/max = 5.4/5.5/83.5 ms
>>
>> Increasing Transmit Descriptors to 4096 avoids the No buffer space 
>> available with packet sizes up to -s65468 (still 100% failure though)
>
>
> Increasing nr of buffers is not a way to fix the problem.

agreed - however in my ignorance of the deep behaviour I'm reporting 
things that affect behaviour in ways I don't expect.
I expected it to take longer to run out of buffers - that didn't happen
:)

(Anyway, on retesting I find that this was wrong - I suspect the 
interface was down and I didn't notice)

>
> I had hoped to hear something about this from Scott..

I'm happy to hear from anyone - I don't have *that* long until my RMA 
option expires and I don't fancy keeping them as ornaments!

David

^ permalink raw reply	[flat|nested] 2+ messages in thread

* Re: 2.6.6 e1000 ifconfig: page allocation failure
  2004-06-18 14:40 2.6.6 e1000 NETDEV WATCHDOG: eth0: transmit timed out+ delay scheduler Venkatesan, Ganesh
@ 2004-06-18 16:59 ` David Greaves
  0 siblings, 0 replies; 2+ messages in thread
From: David Greaves @ 2004-06-18 16:59 UTC (permalink / raw)
  To: Venkatesan, Ganesh; +Cc: Jens Laas, Stephen Hemminger, netdev

On the 2.6.6 server machine:
  ifconfig eth0 mtu 9000
gives an oops in the usb?

Unable to handle kernel paging request at virtual address 92a8292a
 printing eip:
d1163305
*pde = 00000000
Oops: 0000 [#1]
CPU:    0
EIP:    0060:[<d1163305>]    Not tainted
EFLAGS: 00010286   (2.6.6)
EIP is at usb_buffer_free+0x15/0x50 [usbcore]
eax: cea2ec00   ebx: c13665e8   ecx: 00000001   edx: 92a8290a
esi: c13665ec   edi: cf0439dc   ebp: cf58eef4   esp: c3535f44
ds: 007b   es: 007b   ss: 0068
Process usb (pid: 2744, threadinfo=c3534000 task=cf245370)
Stack: cba80d00 c13665e8 c13665ec cf0439dc d106e3a6 cea2ec00 00002000 
cf636000
       0f636000 c13665e8 d106e4a9 c13665e8 cf122980 cffe0280 c01470d3 
cf0439dc
       cf122980 cf122980 00000000 cf27f200 c3534000 c0145a19 cf122980 
cf27f200
Call Trace:
 [<d106e3a6>] usblp_cleanup+0x46/0xb0 [usblp]
 [<d106e4a9>] usblp_release+0x59/0x60 [usblp]
 [<c01470d3>] __fput+0xe3/0x100
 [<c0145a19>] filp_close+0x59/0x90
 [<c0145aa0>] sys_close+0x50/0x60
 [<c0103f0b>] syscall_call+0x7/0xb

Code: 8b 4a 20 85 c9 74 07 8b 41 18 85 c0 75 04 83 c4 10 c3 8b 44
 <6>usb 1-1: new full speed USB device using address 3
drivers/usb/class/usblp.c: usblp0: USB Bidirectional printer dev 3 if 0 
alt 0 proto 2 vid 0x04B8 pid 0x0005
ifconfig: page allocation failure. order:3, mode:0x20
Call Trace:
 [<c013136f>] __alloc_pages+0x2af/0x2f0
 [<c01313d5>] __get_free_pages+0x25/0x40
 [<c01342e7>] cache_grow+0x87/0x230
 [<c01345c9>] cache_alloc_refill+0x139/0x200
 [<c0134960>] __kmalloc+0x70/0x80
 [<c02c1869>] alloc_skb+0x49/0xe0
 [<d110f262>] e1000_alloc_rx_buffers+0x62/0x100 [e1000]
 [<d110c045>] e1000_up+0x45/0xb0 [e1000]
 [<d110e4fc>] e1000_change_mtu+0x7c/0xd0 [e1000]
 [<c02c6e49>] dev_set_mtu+0x79/0x90
 [<c02c7429>] dev_ioctl+0x1e9/0x270
 [<c030032e>] inet_ioctl+0x8e/0xa0
 [<c02be895>] sock_ioctl+0xb5/0x250
 [<c015655d>] sys_ioctl+0xad/0x210
 [<c01129d0>] do_page_fault+0x0/0x4ff
 [<c0103f0b>] syscall_call+0x7/0xb

MemTotal:       256440 kB
MemFree:          2576 kB
Buffers:         18276 kB
Cached:         202048 kB
SwapCached:          0 kB
Active:         112492 kB
Inactive:       115324 kB
HighTotal:           0 kB
HighFree:            0 kB
LowTotal:       256440 kB
LowFree:          2576 kB
SwapTotal:      522100 kB
SwapFree:       522100 kB
Dirty:               8 kB
Writeback:           0 kB
Mapped:          14856 kB
Slab:            16920 kB
Committed_AS:    20272 kB
PageTables:        368 kB
VmallocTotal:   770040 kB
VmallocUsed:     10656 kB
VmallocChunk:   759264 kB



I have had similar on the stable box when it's been used for a while.
I did:
ifconfig eth1 mtu 9000
on the good machine and it gave me this:

Jun 18 16:33:08 haze kernel: printk: 1 messages suppressed.
Jun 18 16:33:08 haze kernel: ifconfig: page allocation failure. order:3, 
mode:0x20
Jun 18 16:33:08 haze kernel:  [__alloc_pages+728/848] 
__alloc_pages+0x2d8/0x350
Jun 18 16:33:08 haze kernel:  [__get_free_pages+37/64] 
__get_free_pages+0x25/0x40
Jun 18 16:33:08 haze kernel:  [kmem_getpages+32/176] kmem_getpages+0x20/0xb0
Jun 18 16:33:08 haze kernel:  [cache_grow+166/512] cache_grow+0xa6/0x200
Jun 18 16:33:08 haze kernel:  [cache_alloc_refill+342/544] 
cache_alloc_refill+0x156/0x220
Jun 18 16:33:08 haze kernel:  [__kmalloc+116/128] __kmalloc+0x74/0x80
Jun 18 16:33:08 haze kernel:  [alloc_skb+71/224] alloc_skb+0x47/0xe0
Jun 18 16:33:08 haze kernel:  [pg0+945227150/1069572096] 
e1000_alloc_rx_buffers+0x5e/0x100 [e1000]
Jun 18 16:33:08 haze kernel:  [pg0+945213509/1069572096] 
e1000_up+0x45/0xb0 [e1000]
Jun 18 16:33:08 haze kernel:  [pg0+945223248/1069572096] 
e1000_change_mtu+0x80/0x110 [e1000]
Jun 18 16:33:08 haze kernel:  [dev_set_mtu+121/144] dev_set_mtu+0x79/0x90
Jun 18 16:33:08 haze kernel:  [dev_ioctl+501/640] dev_ioctl+0x1f5/0x280
Jun 18 16:33:08 haze kernel:  [inet_ioctl+142/160] inet_ioctl+0x8e/0xa0
Jun 18 16:33:08 haze kernel:  [sock_ioctl+233/656] sock_ioctl+0xe9/0x290
Jun 18 16:33:08 haze kernel:  [sys_ioctl+239/608] sys_ioctl+0xef/0x260
Jun 18 16:33:08 haze kernel:  [do_page_fault+0/1242] do_page_fault+0x0/0x4da
Jun 18 16:33:08 haze kernel:  [syscall_call+7/11] syscall_call+0x7/0xb

it had
root@haze:~ # cat /proc/meminfo
MemTotal:      1036868 kB
MemFree:          7564 kB
Buffers:         30720 kB
Cached:         756496 kB
SwapCached:          0 kB
Active:         553348 kB
Inactive:       362700 kB
HighTotal:      131056 kB
HighFree:          252 kB
LowTotal:       905812 kB
LowFree:          7312 kB
SwapTotal:           0 kB
SwapFree:            0 kB
Dirty:               0 kB
Writeback:           0 kB
Mapped:         179532 kB
Slab:           105264 kB
Committed_AS:   298092 kB
PageTables:       1504 kB
VmallocTotal:   114680 kB
VmallocUsed:      2112 kB
VmallocChunk:   112376 kB

I could repeat this by mtu 1500, mtu 9000.
Somehow the distro hadn't mkswap'ed the swap so I added swap and the 
problem went away.
if I swapoff then every time I set the mtu to 9000 I get the page 
allocation failure.

I don't think this should happen but I'm not sure if I *must* have swap?
Also I did this whilst the interface was up (it let me).

David


Venkatesan, Ganesh wrote:

>Jens/David:
>
>Did not mean to get off the list. For some reason, my subscription to
>netdev is not working (even after re-subscribing). So, I grabbed your
>message off of the archive.
>
>I am trying to recreate your failure scenario in our lab. In the
>meantime, please send me any new information you have on this issue.
>
>Thanks,
>ganesh 
> 
>-------------------------------------------------
>Ganesh Venkatesan
>Network/Storage Division, Hillsboro, OR
>
>-----Original Message-----
>From: David Greaves [mailto:david@dgreaves.com] 
>Sent: Friday, June 18, 2004 5:52 AM
>To: Jens Laas
>Cc: Stephen Hemminger; netdev@oss.sgi.com; Venkatesan, Ganesh
>Subject: Re: 2.6.6 e1000 NETDEV WATCHDOG: eth0: transmit timed out+
>delay scheduler
>
>New info:
>I booted into XP and the card works there - so it doesn't look like a 
>simple hardware incompatibility.
>[I've got no real way to test the performance but cygwin's wget against 
>apache1.3 on the linux box returns about 25M/s initially and then 15M/s 
>sustained for 500Mb]
>
>Jens Laas wrote:
>
>  
>
>>>I'm speaking with Ganesh Venkatesan at intel about it. Ganesh you 
>>>went off list - do you want to include Jens or maybe go back on-list?
>>>      
>>>
>>If others run into this problem I'm sure they'll appreciate if its on 
>>list.
>>Since we have no idea what causes this (AFAIK) it may be a more 
>>general problem than the device driver.
>>    
>>
>
>I tend to agree - but I wasn't sure if this was the place and I'll do as
>
>I'm told ;)
>
>  
>
>>>A simple failure case for me is : 'ping -s 1500 '
>>>This doesn't cause the timout but doesn't succeed either.
>>>
>>>ping -f with standard packet size succeeds (slow rate though) and 
>>>doesn't timeout.
>>>      
>>>
>>
>>I dont see the ping problems at all. Unless you try to ping when the 
>>interface has "hanged" ?
>>    
>>
>
><sigh> thought that might be helpful.
>Ping with -s and -f seems to allow me to trigger errors and it seems a 
>lot more debug-able than scp or nfs :)
>No all tests are when it's reset and 'clean'
>
>  
>
>>>============
>>>From hereon down it's 2.6.7 with Stephen's recent delay scheduler
>>>      
>>>
>patch
>  
>
>>>This changed the behaviour.
>>>      
>>>
>>
>>This is strange unless you are actually using the delay scheduler ?
>>Default is sch_generic (that is pfifo) that does not exhibit the 
>>problems correct by the patch.
>>    
>>
>
>I'll go back and double check in case I cocked up...
>(I noticed the e1000 module rebuild but you're right that's incidental)
>
>I've rebuilt the kernel and modules with and w/o patch and rebooted a 
>few times and I can't reproduce that effect - sorry for the red herring.
>So after I reverted Stephens patch the results I reported are still 
>reproducable w/o the patch.
>
>  
>
>>>10592 packets transmitted, 10591 packets received, 0% packet loss
>>>round-trip min/avg/max = 5.4/5.5/83.5 ms
>>>
>>>Increasing Transmit Descriptors to 4096 avoids the No buffer space 
>>>available with packet sizes up to -s65468 (still 100% failure though)
>>>      
>>>
>>Increasing nr of buffers is not a way to fix the problem.
>>    
>>
>
>agreed - however in my ignorance of the deep behaviour I'm reporting 
>things that affect behaviour in ways I don't expect.
>I expected it to take longer to run out of buffers - that didn't happen
>:)
>
>(Anyway, on retesting I find that this was wrong - I suspect the 
>interface was down and I didn't notice)
>
>  
>
>>I had hoped to hear something about this from Scott..
>>    
>>
>
>I'm happy to hear from anyone - I don't have *that* long until my RMA 
>option expires and I don't fancy keeping them as ornaments!
>
>David
>
>
>
>
>  
>

^ permalink raw reply	[flat|nested] 2+ messages in thread

end of thread, other threads:[~2004-06-18 16:59 UTC | newest]

Thread overview: 2+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-06-18 14:40 2.6.6 e1000 NETDEV WATCHDOG: eth0: transmit timed out+ delay scheduler Venkatesan, Ganesh
2004-06-18 16:59 ` 2.6.6 e1000 ifconfig: page allocation failure David Greaves

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).