netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* r8169 locks up in 2.6.16.5
@ 2006-04-15 20:47 Thomas A. Oehser
  2006-04-16  0:58 ` Francois Romieu
  0 siblings, 1 reply; 10+ messages in thread
From: Thomas A. Oehser @ 2006-04-15 20:47 UTC (permalink / raw)
  To: netdev


Hi,

The r8169 driver seems not to be able to handle high loads?

On a dual Athlon 2600+ MP with 2GB RAM, and nothing else running.

I'm doing:

	nc -l -p 12345|buffer|cpio -iuvmdB

On the other machine, also with a gigabit card, I'm doing:

	find .|cpio -o -Hnewc -B|nc 192.168.111.1 12345

It works fine- for about 10 minutes- then, suddenly one of 3 things happens-
even if I stop the high-throughput transfer and quiescse everything.....

- Pings to the router take 9000ms instead of <1 ms (ifconfig down/up fixes it)
- or, it completely stops working, but ifconfig down/up makes it work again
- or, if iptables is natting, sometimes itg completely crashdump oopses

None of these problems occurs using the other NIC, an eepro100 or something.

Should I just give these cards away and forget I ever heard of RealTeK?

Is it known that this driver, or this card, or the comgination, is unstable?

Note, I have another machine using r8169, with some other OS- it doesn't fail.


-Tom

-- 
May 4, 1970: Alison Krause, Jeffrey Miller, Sandra Scheuer, William Schroeder.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: r8169 locks up in 2.6.16.5
  2006-04-15 20:47 r8169 locks up in 2.6.16.5 Thomas A. Oehser
@ 2006-04-16  0:58 ` Francois Romieu
  2006-04-16 13:42   ` Thomas A. Oehser
  0 siblings, 1 reply; 10+ messages in thread
From: Francois Romieu @ 2006-04-16  0:58 UTC (permalink / raw)
  To: Thomas A. Oehser; +Cc: netdev

Thomas A. Oehser <tom@toms.net> :
[...]
> On a dual Athlon 2600+ MP with 2GB RAM, and nothing else running.
       ^^^^^^^^^^^^      ^^
Which motherboard and filesystem do you use ?

[...]
> Should I just give these cards away and forget I ever heard of RealTeK?

I don't think so.

> Is it known that this driver, or this card, or the combination, is unstable?

Some r8169 PR are still open (mac address change, acpi suspend/resume
and "special" motherboard) but there is no clear pattern related to
stability under load. 

Can you send before and after the ping takes 9000 ms:
- ifconfig output
- registers dump via ethtool 

9000ms seems quite close to the watchdog timeout (6 s) + ping
interval. Complete dmesg and .config will be welcome.

-- 
Ueimor

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: r8169 locks up in 2.6.16.5
  2006-04-16  0:58 ` Francois Romieu
@ 2006-04-16 13:42   ` Thomas A. Oehser
  2006-04-16 14:43     ` Francois Romieu
  0 siblings, 1 reply; 10+ messages in thread
From: Thomas A. Oehser @ 2006-04-16 13:42 UTC (permalink / raw)
  To: Francois Romieu, netdev


> Which motherboard and filesystem do you use ?

It's an MSI using EXT3.

> Can you send before and after the ping takes 9000 ms:
> - ifconfig output
> - registers dump via ethtool 
> 
> 9000ms seems quite close to the watchdog timeout (6 s) + ping
> interval. Complete dmesg and .config will be welcome.

I tested with everything turned off (no SMP, no swap, no
USB, no firewire, no iptables, no lmsensors, flat 1G memory,
etc. etc. etc.) except the bare minimum (LSI new Megaraid for the
SCSI and SATA raid arrays, both of which use the same driver, EXT3,
VGA console, keyboard, mouse).  Also, with everything static and
no modules.

Changing _nothing_ other than replacing EEPRO100 with R8169 and
vice versa causes it to work perfectly or fail abominably... the
machine always had both nics physically connected, so it was just
the software change of one driver or the other as eth0.

There was no dmesg output.

Note, one thing I have noticed- it always failed when the load was
coming from the same machine.  Perhaps the packets from that machine
have something about them that particularly spasms the driver?
I'll look at what nic and driver that machine is using.

Note, it will be a day or few before I can retest with ethtool etc.,
I have to do my taxes today.

-Thanks, -Tom


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: r8169 locks up in 2.6.16.5
  2006-04-16 13:42   ` Thomas A. Oehser
@ 2006-04-16 14:43     ` Francois Romieu
  2006-04-16 17:26       ` Thomas A. Oehser
  0 siblings, 1 reply; 10+ messages in thread
From: Francois Romieu @ 2006-04-16 14:43 UTC (permalink / raw)
  To: Thomas A. Oehser; +Cc: netdev

Thomas A. Oehser <tom@toms.net> :
[...]
> I tested with everything turned off (no SMP, no swap, no
> USB, no firewire, no iptables, no lmsensors, flat 1G memory,
> etc. etc. etc.) except the bare minimum (LSI new Megaraid for the
> SCSI and SATA raid arrays, both of which use the same driver, EXT3,
> VGA console, keyboard, mouse).  Also, with everything static and
> no modules.

Ok, it really looks like a genuine bug.

[...]
> Changing _nothing_ other than replacing EEPRO100 with R8169 and
> vice versa causes it to work perfectly or fail abominably... the
> machine always had both nics physically connected, so it was just
> the software change of one driver or the other as eth0.
> 
> There was no dmesg output.

Then the Rx part of the driver is more screwed than the Tx one.

I'll welcome a complete dmesg after you have recovered through
ifconfig down/up though.

[...]
> Note, it will be a day or few before I can retest with ethtool etc.,
> I have to do my taxes today.

No problem. One more question: have you enabled NAPI ? If not, you
should.

-- 
Ueimor

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: r8169 locks up in 2.6.16.5
  2006-04-16 14:43     ` Francois Romieu
@ 2006-04-16 17:26       ` Thomas A. Oehser
  2006-04-16 19:53         ` Francois Romieu
  0 siblings, 1 reply; 10+ messages in thread
From: Thomas A. Oehser @ 2006-04-16 17:26 UTC (permalink / raw)
  To: Francois Romieu; +Cc: Thomas A. Oehser, netdev


> I'll welcome a complete dmesg after you have recovered through
> ifconfig down/up though.

_Nothing_ on dmesg.

> No problem. One more question: have you enabled NAPI ? If not, you
> should.

Doesn't seem to make a difference.  Here, with it on, after failing it:

  --- 192.168.99.100 ping statistics ---
  1362 packets transmitted, 903 packets received, 33% packet loss
  round-trip min/avg/max = 0.1/35803.4/65264.8 ms

Worse than I thought, to a machine on the same gigabit copper switch...

ifconfig eth1:

eth1      Link encap:Ethernet  HWaddr 00:E0:4C:13:A9:56  
          inet addr:192.168.99.99  Bcast:192.168.99.255  Mask:255.255.255.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:528416 errors:252 dropped:620 overruns:0 frame:966
          TX packets:243040 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000 
          Interrupt:161 Base address:0xe000 

ethtool eth1:

Settings for eth1:
	Supported ports: [ TP ]
	Supported link modes:   10baseT/Half 10baseT/Full 
	                        100baseT/Half 100baseT/Full 
	                        1000baseT/Full 
	Supports auto-negotiation: Yes
	Advertised link modes:  10baseT/Half 10baseT/Full 
	                        100baseT/Half 100baseT/Full 
	                        1000baseT/Full 
	Advertised auto-negotiation: Yes
	Speed: 1000Mb/s
	Duplex: Full
	Port: Twisted Pair
	PHYAD: 0
	Transceiver: internal
	Auto-negotiation: on
	Supports Wake-on: pumbg
	Wake-on: g
	Current message level: 0x00000033 (51)
	Link detected: yes

Note, without having thrown traffic through it, I have:

PING 192.168.99.100 (192.168.99.100): 56 data bytes
64 bytes from 192.168.99.100: icmp_seq=0 ttl=128 time=0.1 ms
64 bytes from 192.168.99.100: icmp_seq=1 ttl=128 time=0.1 ms
64 bytes from 192.168.99.100: icmp_seq=2 ttl=128 time=0.1 ms
64 bytes from 192.168.99.100: icmp_seq=3 ttl=128 time=0.1 ms
--- 192.168.99.100 ping statistics ---
4 packets transmitted, 4 packets received, 0% packet loss
round-trip min/avg/max = 0.1/0.1/0.1 ms

What next?

-Tom

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: r8169 locks up in 2.6.16.5
  2006-04-16 17:26       ` Thomas A. Oehser
@ 2006-04-16 19:53         ` Francois Romieu
  2006-04-16 22:58           ` Thomas A. Oehser
  0 siblings, 1 reply; 10+ messages in thread
From: Francois Romieu @ 2006-04-16 19:53 UTC (permalink / raw)
  To: Thomas A. Oehser; +Cc: netdev

Thomas A. Oehser <tom@toms.net> :
[...]
> > I'll welcome a complete dmesg after you have recovered through
> > ifconfig down/up though.
> 
> _Nothing_ on dmesg.

Huh... Nothing appears when you issue 'dmesg' ?

[...]
> eth1      Link encap:Ethernet  HWaddr 00:E0:4C:13:A9:56  
>           inet addr:192.168.99.99  Bcast:192.168.99.255  Mask:255.255.255.0
>           UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>           RX packets:528416 errors:252 dropped:620 overruns:0 frame:966
>           TX packets:243040 errors:0 dropped:0 overruns:0 carrier:0
>           collisions:0 txqueuelen:1000 
>           Interrupt:161 Base address:0xe000 

How much long was the card under load ? A few seconds ?
The same 1500 bytes mtu is used everywhere and the issue can not be
reproduced with a simple ping -f -q -l 64 -s what_you_want aimed at
the 8169, right ?

[...]
> What next?

ethtool -d/-S before and after failure, .config and /proc/interrupts.

-- 
Ueimor

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: r8169 locks up in 2.6.16.5
  2006-04-16 19:53         ` Francois Romieu
@ 2006-04-16 22:58           ` Thomas A. Oehser
  2006-04-17 23:09             ` Francois Romieu
  0 siblings, 1 reply; 10+ messages in thread
From: Thomas A. Oehser @ 2006-04-16 22:58 UTC (permalink / raw)
  To: Francois Romieu; +Cc: Thomas A. Oehser, netdev

[-- Attachment #1: Type: text/plain, Size: 1430 bytes --]





> How much long was the card under load ? A few seconds ?
> The same 1500 bytes mtu is used everywhere and the issue can not be
> reproduced with a simple ping -f -q -l 64 -s what_you_want aimed at
> the 8169, right ?

Correct,

> > What next?
> 
> ethtool -d/-S before and after failure, .config and /proc/interrupts.

Ok, here it all is, the attached archive has:

- .before is right after the interface came up
- .mid is after flooding it with ping -f commands from 2 machines at once
- .bad is after doing the cpio over nc that killed it after 170MB
- .back is after ifconfig down / ifconfig up that restored it

Note, the kernel config has SMP etc turned back on, as it doesn't
seem to affect things, and the eepro100 back on eth0, I'm testing
this against eth1 so that the machine can be normally useful as well...

r8169-bad/ifconfig.before
r8169-bad/dmesg.before
r8169-bad/arp-n.before
r8169-bad/ethtool-d.before
r8169-bad/ethtool-S.before
r8169-bad/config.gz
r8169-bad/proc-interrupts.before
r8169-bad/ping-during
r8169-bad/dmesg.mid
r8169-bad/arp-n.mid
r8169-bad/ifconfig.mid
r8169-bad/ethtool-d.mid
r8169-bad/ethtool-S.mid
r8169-bad/proc-interrupts.mid
r8169-bad/ifconfig.bad
r8169-bad/dmesg.bad
r8169-bad/ethtool-d.bad
r8169-bad/ethtool-S.bad
r8169-bad/proc-interrupts.bad
r8169-bad/ifconfig.back
r8169-bad/dmesg.back
r8169-bad/ethtool-d.back
r8169-bad/ethtool-S.back
r8169-bad/proc-interrupts.back

-Thanks, -Tom

[-- Attachment #2: r8169-bad.tar.bz2 --]
[-- Type: application/octet-stream, Size: 18535 bytes --]

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: r8169 locks up in 2.6.16.5
  2006-04-16 22:58           ` Thomas A. Oehser
@ 2006-04-17 23:09             ` Francois Romieu
  2006-04-18  0:43               ` Thomas A. Oehser
  0 siblings, 1 reply; 10+ messages in thread
From: Francois Romieu @ 2006-04-17 23:09 UTC (permalink / raw)
  To: Thomas A. Oehser; +Cc: netdev

Thomas A. Oehser <tom@toms.net> :
[...]
> Ok, here it all is, the attached archive has:
> 
> - .before is right after the interface came up
> - .mid is after flooding it with ping -f commands from 2 machines at once
> - .bad is after doing the cpio over nc that killed it after 170MB
> - .back is after ifconfig down / ifconfig up that restored it

Thanks for the report. It is quite clear. The device is (almost surely)
killed by a rx fifo overflow. Expect a patch shortly.

I wonder why it overflows in the first place though. How much time does
the 170 Mb transfer need ?

-- 
Ueimor

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: r8169 locks up in 2.6.16.5
  2006-04-17 23:09             ` Francois Romieu
@ 2006-04-18  0:43               ` Thomas A. Oehser
  2006-04-18 23:04                 ` [patch] " Francois Romieu
  0 siblings, 1 reply; 10+ messages in thread
From: Thomas A. Oehser @ 2006-04-18  0:43 UTC (permalink / raw)
  To: Francois Romieu; +Cc: Thomas A. Oehser, netdev


> Thanks for the report. It is quite clear. The device is (almost surely)
> killed by a rx fifo overflow. Expect a patch shortly.
 
> I wonder why it overflows in the first place though. How much time does
> the 170 Mb transfer need ?

It is actually about a 30GB transfer, it was just after the first
170Mb that it failed.  The command in question is just a simple
"nc -l -p 12345|buffer|cpio -iumdB", and the sender may well be
able to generate the data faster than the receiving disk can save
it, as the sender is a raid-1 mirror and the receiver is a raid-5
array, I would expect the raid-5 write penalty and the raid-1 read
speed to make it have to block for most of the transfer.  It didn't
take long to get that far- um, I think only 2 or 3 minutes before
it locked up.

-Tom

-- 
May 4, 1970: Alison Krause, Jeffrey Miller, Sandra Scheuer, William Schroeder.

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [patch] Re: r8169 locks up in 2.6.16.5
  2006-04-18  0:43               ` Thomas A. Oehser
@ 2006-04-18 23:04                 ` Francois Romieu
  0 siblings, 0 replies; 10+ messages in thread
From: Francois Romieu @ 2006-04-18 23:04 UTC (permalink / raw)
  To: Thomas A. Oehser; +Cc: netdev

Thomas A. Oehser <tom@toms.net> :
[...]
> It is actually about a 30GB transfer, it was just after the first
> 170Mb that it failed.  The command in question is just a simple
> "nc -l -p 12345|buffer|cpio -iumdB", and the sender may well be
> able to generate the data faster than the receiving disk can save
> it, as the sender is a raid-1 mirror and the receiver is a raid-5
> array, I would expect the raid-5 write penalty and the raid-1 read
> speed to make it have to block for most of the transfer.  It didn't
> take long to get that far- um, I think only 2 or 3 minutes before
> it locked up.

The r8169 offers 48 kb of Rx fifo. 170 Mb in 2~3 minutes is under 2 Mb/s.
Even if the traffic is very bursty, something seems to stall the PCI bus.
I'm a bit surprized.

Anyway, can you give the patch below a try and tell if it changes
something ? If so, an updated output of 'ethtool -S' would be welcome.

diff --git a/drivers/net/r8169.c b/drivers/net/r8169.c
index 0ad3310..f9da390 100644
--- a/drivers/net/r8169.c
+++ b/drivers/net/r8169.c
@@ -256,10 +256,11 @@ enum RTL8169_register_content {
 	RxOK = 0x01,
 
 	/* RxStatusDesc */
-	RxRES = 0x00200000,
-	RxCRC = 0x00080000,
-	RxRUNT = 0x00100000,
-	RxRWT = 0x00400000,
+	RxFOVF	= (1 << 23),
+	RxRWT	= (1 << 22),
+	RxRES	= (1 << 21),
+	RxRUNT	= (1 << 20),
+	RxCRC	= (1 << 19),
 
 	/* ChipCmdBits */
 	CmdReset = 0x10,
@@ -2435,6 +2436,10 @@ rtl8169_rx_interrupt(struct net_device *
 				tp->stats.rx_length_errors++;
 			if (status & RxCRC)
 				tp->stats.rx_crc_errors++;
+			if (status & RxFOVF) {
+				rtl8169_schedule_work(dev, rtl8169_reset_task);
+				tp->stats.rx_fifo_errors++;
+			}
 			rtl8169_mark_to_asic(desc, tp->rx_buf_sz);
 		} else {
 			struct sk_buff *skb = tp->Rx_skbuff[entry];

^ permalink raw reply related	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2006-04-18 23:05 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-04-15 20:47 r8169 locks up in 2.6.16.5 Thomas A. Oehser
2006-04-16  0:58 ` Francois Romieu
2006-04-16 13:42   ` Thomas A. Oehser
2006-04-16 14:43     ` Francois Romieu
2006-04-16 17:26       ` Thomas A. Oehser
2006-04-16 19:53         ` Francois Romieu
2006-04-16 22:58           ` Thomas A. Oehser
2006-04-17 23:09             ` Francois Romieu
2006-04-18  0:43               ` Thomas A. Oehser
2006-04-18 23:04                 ` [patch] " Francois Romieu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).