public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed
* [2.6.x] e1000: NETDEV WATCHDOG: eth0: transmit timed out
@ 2004-01-23 21:43 Petr Sebor
  0 siblings, 0 replies; 6+ messages in thread
From: Petr Sebor @ 2004-01-23 21:43 UTC (permalink / raw)
  To: linux-kernel

Hello,

since we have upgraded cabling on our network and transfer speeds 
increased a little
bit, we are experiencing very often situations where the Intel PRO/1000 
nics just stop
responding and network dies for a while. Local console works, there are 
no more error
messages other than (when the eth0 comes to a life again):

NETDEV WATCHDOG: eth0: transmit timed out
e1000: eth0 NIC Link is Up 1000 Mbps Full Duplex

then its working ok again, until the next watchdog message.

It has happened even before the cabling upgrade, it was just very rare.

I have tried kernels 2.6.0, 2.6.1, 2.6.2-bk1 and it happens all the
time. It is just that is _seems_ to happen more often with the 2.6.2-bk

The machine in question is an Opteron 244 based server (though kernel
is compiled for 32bits/Athlon). SMP kernel makes no difference, it will
eventually happen as well. The server is not heavily loaded... only few 
users
can trigger the issue. Board is MSI KT800 based.

I have tried to switch NICs, but there is no difference. Onboard 
integrated TG3
gigabit network controller suffers the '100% CPU usage' issue when 
utilized so
this unfortunately no option at the moment.

Anyone having a clue what might be wrong here?

Have a nice weekend,
Petr


^ permalink raw reply	[flat|nested] 6+ messages in thread

* RE: [2.6.x] e1000: NETDEV WATCHDOG: eth0: transmit timed out
@ 2004-01-23 22:28 Feldman, Scott
  2004-01-24 19:36 ` Sergey S. Kostyliov
  2004-01-26 10:33 ` Petr Sebor
  0 siblings, 2 replies; 6+ messages in thread
From: Feldman, Scott @ 2004-01-23 22:28 UTC (permalink / raw)
  To: Petr Sebor, linux-kernel

> since we have upgraded cabling on our network and transfer 
> speeds increased a little bit, we are experiencing very often 
> situations where the Intel PRO/1000 nics just stop responding 
> and network dies for a while. Local console works, there are 
> no more error messages other than (when the eth0 comes to a 
> life again):
> 
> NETDEV WATCHDOG: eth0: transmit timed out
> e1000: eth0 NIC Link is Up 1000 Mbps Full Duplex

Petr, I need you to try something.  Get ethtool 1.8
(sf.net/projects/gkernel) and turn off TSO:

  # ethtool -K eth0 tso off

If you now longer see NETDEV WATCHDOG's, I have a next step.  More on
that later.

-scott

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [2.6.x] e1000: NETDEV WATCHDOG: eth0: transmit timed out
  2004-01-23 22:28 [2.6.x] e1000: NETDEV WATCHDOG: eth0: transmit timed out Feldman, Scott
@ 2004-01-24 19:36 ` Sergey S. Kostyliov
  2004-01-26 10:33 ` Petr Sebor
  1 sibling, 0 replies; 6+ messages in thread
From: Sergey S. Kostyliov @ 2004-01-24 19:36 UTC (permalink / raw)
  To: Feldman, Scott, Petr Sebor, linux-kernel

Hello Scott, Petr,

On Saturday 24 January 2004 01:28, Feldman, Scott wrote:
> > since we have upgraded cabling on our network and transfer
> > speeds increased a little bit, we are experiencing very often
> > situations where the Intel PRO/1000 nics just stop responding
> > and network dies for a while. Local console works, there are
> > no more error messages other than (when the eth0 comes to a
> > life again):
> >
> > NETDEV WATCHDOG: eth0: transmit timed out
> > e1000: eth0 NIC Link is Up 1000 Mbps Full Duplex
>
> Petr, I need you to try something.  Get ethtool 1.8
> (sf.net/projects/gkernel) and turn off TSO:
>
>   # ethtool -K eth0 tso off
>
> If you now longer see NETDEV WATCHDOG's, I have a next step.  More on
> that later.
I have had exactly the same problem with 2.6.{0,1} kernels:
"NETDEV WATCHDOG: eth0: transmit timed out"
where eth0 is:
"03:07.0 Ethernet controller: Intel Corp. 82546EB Gigabit Ethernet Controller (Copper) (rev 01)".
The only difference is that my eth0 is at 100 Mbps Full Duplex.
And yes, in my case this problem was solved by `ethtool -K eth0 tso off`.

>
> -scott
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

-- 
                   Best regards,
                   Sergey S. Kostyliov <rathamahata@php4.ru>
                   Public PGP key: http://sysadminday.org.ru/rathamahata.asc


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [2.6.x] e1000: NETDEV WATCHDOG: eth0: transmit timed out
  2004-01-23 22:28 [2.6.x] e1000: NETDEV WATCHDOG: eth0: transmit timed out Feldman, Scott
  2004-01-24 19:36 ` Sergey S. Kostyliov
@ 2004-01-26 10:33 ` Petr Sebor
  1 sibling, 0 replies; 6+ messages in thread
From: Petr Sebor @ 2004-01-26 10:33 UTC (permalink / raw)
  To: Feldman, Scott; +Cc: linux-kernel

Feldman, Scott wrote:

>>since we have upgraded cabling on our network and transfer 
>>speeds increased a little bit, we are experiencing very often 
>>situations where the Intel PRO/1000 nics just stop responding 
>>and network dies for a while. Local console works, there are 
>>no more error messages other than (when the eth0 comes to a 
>>life again):
>>
>>NETDEV WATCHDOG: eth0: transmit timed out
>>e1000: eth0 NIC Link is Up 1000 Mbps Full Duplex
>>    
>>
>
>Petr, I need you to try something.  Get ethtool 1.8
>(sf.net/projects/gkernel) and turn off TSO:
>
>  # ethtool -K eth0 tso off
>
>If you now longer see NETDEV WATCHDOG's, I have a next step.  More on
>that later.
>
>-scott
>  
>
Scott,

after a weekend and half of working day (with extra torturing of the 
network card)
the NETDEV WATCHDOG's are not barking anymore with the tso's disabled.

Do you want me to do more testing or will you tell me what _the_ next 
step is ? :-)

Regards,
Petr


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [2.6.x] e1000: NETDEV WATCHDOG: eth0: transmit timed out
       [not found] <C6F5CF431189FA4CBAEC9E7DD5441E01036EA9DA@orsmsx402.jf.intel.com>
@ 2004-01-27  0:37 ` Feldman, Scott
  2004-01-27 13:59   ` Petr Sebor
  0 siblings, 1 reply; 6+ messages in thread
From: Feldman, Scott @ 2004-01-27  0:37 UTC (permalink / raw)
  To: Petr Sebor; +Cc: linux-kernel

On Mon, 26 Jan 2004, Petr Sebor wrote:

> after a weekend and half of working day (with extra torturing of the
> network card)
> the NETDEV WATCHDOG's are not barking anymore with the tso's disabled.
> 
> Do you want me to do more testing or will you tell me what _the_ next
> step is ? :-)

Petr, sorry for the suspense.  Here's a patch against 2.6.2-rc2 that fixes 
a race in the Tx path of e1000 that you may be exposing with TSO on.  The 
race is:

Tx queue		Tx clean (interrupt context)

...
if(h/w Q full)         | clean h/w Q
        ...        <---| if(s/w Q stopped) 
        stop s/w Q     |       wake s/w Q


So let's try this patch with TSO back on.

--- linux-2.6.2-rc2/drivers/net/e1000/e1000.h.orig	2004-01-26 15:38:40.000000000 -0800
+++ linux-2.6.2-rc2/drivers/net/e1000/e1000.h	2004-01-26 15:35:37.000000000 -0800
@@ -192,6 +192,7 @@
 
 	/* TX */
 	struct e1000_desc_ring tx_ring;
+	spinlock_t tx_lock;
 	uint32_t txd_cmd;
 	uint32_t tx_int_delay;
 	uint32_t tx_abs_int_delay;
--- linux-2.6.2-rc2/drivers/net/e1000/e1000_main.c.orig	2004-01-26 15:38:33.000000000 -0800
+++ linux-2.6.2-rc2/drivers/net/e1000/e1000_main.c	2004-01-26 15:33:25.000000000 -0800
@@ -669,6 +669,7 @@
 
 	atomic_set(&adapter->irq_sem, 1);
 	spin_lock_init(&adapter->stats_lock);
+	spin_lock_init(&adapter->tx_lock);
 
 	return 0;
 }
@@ -1783,6 +1784,7 @@
 	struct e1000_adapter *adapter = netdev->priv;
 	unsigned int first;
 	unsigned int tx_flags = 0;
+	unsigned long flags;
 	int count;
 
 	if(skb->len <= 0) {
@@ -1790,10 +1792,13 @@
 		return 0;
 	}
 
+	spin_lock_irqsave(&adapter->tx_lock, flags);
+
 	if(adapter->hw.mac_type == e1000_82547) {
 		if(e1000_82547_fifo_workaround(adapter, skb)) {
 			netif_stop_queue(netdev);
 			mod_timer(&adapter->tx_fifo_stall_timer, jiffies);
+			spin_unlock_irqrestore(&adapter->tx_lock, flags);
 			return 1;
 		}
 	}
@@ -1814,11 +1819,14 @@
 		e1000_tx_queue(adapter, count, tx_flags);
 	else {
 		netif_stop_queue(netdev);
+		spin_unlock_irqrestore(&adapter->tx_lock, flags);
 		return 1;
 	}
 
 	netdev->trans_start = jiffies;
 
+	spin_unlock_irqrestore(&adapter->tx_lock, flags);
+	
 	return 0;
 }
 
@@ -2171,6 +2179,8 @@
 	unsigned int i, eop;
 	boolean_t cleaned = FALSE;
 
+	spin_lock(&adapter->tx_lock);
+
 	i = tx_ring->next_to_clean;
 	eop = tx_ring->buffer_info[i].next_to_watch;
 	eop_desc = E1000_TX_DESC(*tx_ring, eop);
@@ -2215,6 +2225,8 @@
 	if(cleaned && netif_queue_stopped(netdev) && netif_carrier_ok(netdev))
 		netif_wake_queue(netdev);
 
+	spin_unlock(&adapter->tx_lock);
+
 	return cleaned;
 }
 


^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [2.6.x] e1000: NETDEV WATCHDOG: eth0: transmit timed out
  2004-01-27  0:37 ` Feldman, Scott
@ 2004-01-27 13:59   ` Petr Sebor
  0 siblings, 0 replies; 6+ messages in thread
From: Petr Sebor @ 2004-01-27 13:59 UTC (permalink / raw)
  To: Feldman, Scott; +Cc: linux-kernel

Feldman, Scott wrote:

>Petr, sorry for the suspense.  Here's a patch against 2.6.2-rc2 that fixes 
>a race in the Tx path of e1000 that you may be exposing with TSO on.  The 
>race is:
>
>Tx queue		Tx clean (interrupt context)
>
>...
>if(h/w Q full)         | clean h/w Q
>        ...        <---| if(s/w Q stopped) 
>        stop s/w Q     |       wake s/w Q
>
>
>So let's try this patch with TSO back on.
>  
>
Scott,

thanks for the patch. Again, 3/4 of working day with moderate server 
load resulted in no
WATCHDOG barking with the patched kernel and tso's turned on. I dare say 
that this is it! :-)
(Little more testing here won't harm though)

If nothing, the stability of the e1000 has vastly improved

Thanks a lot!

Regards,
Petr


^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2004-01-27 14:00 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-01-23 22:28 [2.6.x] e1000: NETDEV WATCHDOG: eth0: transmit timed out Feldman, Scott
2004-01-24 19:36 ` Sergey S. Kostyliov
2004-01-26 10:33 ` Petr Sebor
     [not found] <C6F5CF431189FA4CBAEC9E7DD5441E01036EA9DA@orsmsx402.jf.intel.com>
2004-01-27  0:37 ` Feldman, Scott
2004-01-27 13:59   ` Petr Sebor
  -- strict thread matches above, loose matches on Subject: below --
2004-01-23 21:43 Petr Sebor

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox