[PATCH] [e1000]: Remove unnecessary tx

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] [e1000]: Remove unnecessary tx_lock
@ 2006-08-03 13:15 jamal
  2006-08-03 14:02 ` Herbert Xu
  0 siblings, 1 reply; 85+ messages in thread
From: jamal @ 2006-08-03 13:15 UTC (permalink / raw)
  To: jesse.brandeburg, Auke Kok; +Cc: netdev

[-- Attachment #1: Type: text/plain, Size: 493 bytes --]


There is no need for tx_locking if you are already netif stopped
(transmit path will never be entered).
With this change under high speed forwarding i see anywhere
between 2-4Kpps improvement on a 2 CPU environment with twoo e1000s tied
to different CPUs forwarding between each other. Actually the
performance improvement should be attributed to the use of
TX_WAKE_THRESHOLD - more drivers should use that technique.

cheers,
jamal

signed-off-by: Jamal Hadi Salim <hadi@cyberus.ca>

-----


[-- Attachment #2: e1000_wq --]
[-- Type: text/x-patch, Size: 631 bytes --]

diff --git a/drivers/net/e1000/e1000_main.c b/drivers/net/e1000/e1000_main.c
index da62db8..559e334 100644
--- a/drivers/net/e1000/e1000_main.c
+++ b/drivers/net/e1000/e1000_main.c
@@ -3517,11 +3517,8 @@ #endif
 #define TX_WAKE_THRESHOLD 32
 	if (unlikely(cleaned && netif_queue_stopped(netdev) &&
 	             netif_carrier_ok(netdev))) {
-		spin_lock(&tx_ring->tx_lock);
-		if (netif_queue_stopped(netdev) &&
-		    (E1000_DESC_UNUSED(tx_ring) >= TX_WAKE_THRESHOLD))
+		if (E1000_DESC_UNUSED(tx_ring) >= TX_WAKE_THRESHOLD)
 			netif_wake_queue(netdev);
-		spin_unlock(&tx_ring->tx_lock);
 	}
 
 	if (adapter->detect_tx_hung) {

^ permalink raw reply related	[flat|nested] 85+ messages in thread

* Re: [PATCH] [e1000]: Remove unnecessary tx_lock
  2006-08-03 13:15 [PATCH] [e1000]: Remove unnecessary tx_lock jamal
@ 2006-08-03 14:02 ` Herbert Xu
  2006-08-03 14:24   ` jamal
  2006-08-03 16:36   ` Brandeburg, Jesse
  0 siblings, 2 replies; 85+ messages in thread
From: Herbert Xu @ 2006-08-03 14:02 UTC (permalink / raw)
  To: hadi; +Cc: jesse.brandeburg, auke-jan.h.kok, netdev

jamal <hadi@cyberus.ca> wrote:
> 
> There is no need for tx_locking if you are already netif stopped
> (transmit path will never be entered).
> With this change under high speed forwarding i see anywhere
> between 2-4Kpps improvement on a 2 CPU environment with twoo e1000s tied
> to different CPUs forwarding between each other. Actually the
> performance improvement should be attributed to the use of
> TX_WAKE_THRESHOLD - more drivers should use that technique.

Looks good to me.

Even if we get it wrong and wake up something that we shouldn't have,
the xmit function will simply bail out and stop the queue for us.

Since this is exceedingly unlikely we should drop the locks rather
than bother about it.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] [e1000]: Remove unnecessary tx_lock
  2006-08-03 14:02 ` Herbert Xu
@ 2006-08-03 14:24   ` jamal
  2006-08-03 16:36   ` Brandeburg, Jesse
  1 sibling, 0 replies; 85+ messages in thread
From: jamal @ 2006-08-03 14:24 UTC (permalink / raw)
  To: Herbert Xu; +Cc: jesse.brandeburg, auke-jan.h.kok, netdev

On Fri, 2006-04-08 at 00:02 +1000, Herbert Xu wrote:

> Even if we get it wrong and wake up something that we shouldn't have,
> the xmit function will simply bail out and stop the queue for us.
> 

Infact thats close to what i was seeing. setup is:
eth0 tied to cpu 0
eth1 tied to cpu1

--> eth0 -- forward --> eth1
--> eth1 -- forward --> eth0

Take the example of eth1:
* It is actually being fed packets by cpu0 but is tied to cpu1.

The stop happens in cpu0 (99.99999% of the time) and the wake happens in
cpu1 (100% of the time)[1]. The wakes on most occasions dont even get to
run tx softirq on cpu1 since there is constant activity already
happening on cpu0. So that lock being grabbed was causing a collision on
the tx path which caused a requeue which caused a reschedule on cpu0 -
all this unnecessarily.

> Since this is exceedingly unlikely we should drop the locks rather
> than bother about it.

I should say that the fact that e1000 does its pruning of descriptors
only on the tx path makes the locking unnecessary when stopped.

cheers,
jamal

[1] As you may know this comes from testing your qdisc_is_running
change. I was hoping to induce a DOS attack such that a cpuX is busy
processing packets from another CPU (eg eth1 on cpu1) so that it forgets
to process its own receive path ;-> This is a theoretical possibility
with that change ;-> Clearly i failed given the % above ;->

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] [e1000]: Remove unnecessary tx_lock
  2006-08-03 14:02 ` Herbert Xu
  2006-08-03 14:24   ` jamal
@ 2006-08-03 16:36   ` Brandeburg, Jesse
  2006-08-03 18:05     ` Michael Chan
  2006-08-03 22:06     ` jamal
  1 sibling, 2 replies; 85+ messages in thread
From: Brandeburg, Jesse @ 2006-08-03 16:36 UTC (permalink / raw)
  To: Herbert Xu; +Cc: hadi, jesse.brandeburg, auke-jan.h.kok, netdev

On Fri, 4 Aug 2006, Herbert Xu wrote:
> jamal <hadi@cyberus.ca> wrote:
> > 
> > There is no need for tx_locking if you are already netif stopped
> > (transmit path will never be entered).
> > With this change under high speed forwarding i see anywhere
> > between 2-4Kpps improvement on a 2 CPU environment with twoo e1000s tied
> > to different CPUs forwarding between each other. Actually the
> > performance improvement should be attributed to the use of
> > TX_WAKE_THRESHOLD - more drivers should use that technique.
> > 
> > diff --git a/drivers/net/e1000/e1000_main.c 
> > b/drivers/net/e1000/e1000_main.c
> > index da62db8..559e334 100644
> > --- a/drivers/net/e1000/e1000_main.c
> > +++ b/drivers/net/e1000/e1000_main.c
> > @@ -3517,11 +3517,8 @@ #endif
> >  #define TX_WAKE_THRESHOLD 32
> >  	if (unlikely(cleaned && netif_queue_stopped(netdev) &&
> >  	             netif_carrier_ok(netdev))) {
> > -		spin_lock(&tx_ring->tx_lock);
> > -		if (netif_queue_stopped(netdev) &&
> > -		    (E1000_DESC_UNUSED(tx_ring) >= TX_WAKE_THRESHOLD))
> > +		if (E1000_DESC_UNUSED(tx_ring) >= TX_WAKE_THRESHOLD)
> >  			netif_wake_queue(netdev);
> > -		spin_unlock(&tx_ring->tx_lock);
> >  	}
> >  
> >  	if (adapter->detect_tx_hung) {
> 
> Looks good to me.

jamal, thanks for the investigation, but this doesn't look so good to me.

> Even if we get it wrong and wake up something that we shouldn't have,
> the xmit function will simply bail out and stop the queue for us.

I followed the example of tg3 when attempting to optimize this code.  For 
the normal case we remove a lock acquisition.  Jamals case is not normal. 
:-)

we specifically added this lock originally to fix a problem we saw where 
the netif_stop and netif_start would race, and we would end up with a 
queue that was stopped, and no way to restart it because we didn't have 
any more TX packets to clean up (even if we DID get an interrupt from a 
receive)

> Since this is exceedingly unlikely we should drop the locks rather
> than bother about it.

I think that would be nice, but I still think the current driver solution 
is a good stable solution that has made it through several rounds of 
testing here, not to mention is in the tg3.c code.  Unless someone can 
come up with a way to avoid the race between start and stop *inside* the 
start and stop code (which would be ideal) then I think we have to have a 
solution like this in the driver.

Jesse

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] [e1000]: Remove unnecessary tx_lock
  2006-08-03 16:36   ` Brandeburg, Jesse
@ 2006-08-03 18:05     ` Michael Chan
  2006-08-03 22:08       ` jamal
  2006-08-03 22:06     ` jamal
  1 sibling, 1 reply; 85+ messages in thread
From: Michael Chan @ 2006-08-03 18:05 UTC (permalink / raw)
  To: Brandeburg, Jesse; +Cc: Herbert Xu, hadi, auke-jan.h.kok, netdev

On Thu, 2006-08-03 at 09:36 -0700, Brandeburg, Jesse wrote:

> I followed the example of tg3 when attempting to optimize this code.  For 
> the normal case we remove a lock acquisition.  Jamals case is not normal. 
> :-)
> 
> we specifically added this lock originally to fix a problem we saw where 
> the netif_stop and netif_start would race, and we would end up with a 
> queue that was stopped, and no way to restart it because we didn't have 
> any more TX packets to clean up (even if we DID get an interrupt from a 
> receive)
> 
Yep, I agree that the lock is necessary.  The reason is that
hard_start_xmit and xmit_completion can be running concurrently and they
can miss each other and cause the tx ring to be stopped forever.

In the case of tg3's hard_start_xmit, after stopping the queue, we need
to check if we should wake the queue right away in case xmit_completion
just finished cleaning the tx ring and just missed the queue_stopped
condition.  Because the netif_wake_queue can be called in 2 places, you
need the lock.  Without the lock, the queue can be waken up at the wrong
time and may cause hard_start_xmit to be called with an empty tx ring.


^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] [e1000]: Remove unnecessary tx_lock
  2006-08-03 16:36   ` Brandeburg, Jesse
  2006-08-03 18:05     ` Michael Chan
@ 2006-08-03 22:06     ` jamal
  1 sibling, 0 replies; 85+ messages in thread
From: jamal @ 2006-08-03 22:06 UTC (permalink / raw)
  To: Brandeburg, Jesse; +Cc: Herbert Xu, auke-jan.h.kok, netdev

Jesse,

On Thu, 2006-03-08 at 09:36 -0700, Brandeburg, Jesse wrote:
> On Fri, 4 Aug 2006, Herbert Xu wrote:
> > jamal <hadi@cyberus.ca> wrote:
> > > 
> > > There is no need for tx_locking if you are already netif stopped
> > > (transmit path will never be entered).
> > > With this change under high speed forwarding i see anywhere

> > Even if we get it wrong and wake up something that we shouldn't have,
> > the xmit function will simply bail out and stop the queue for us.
> 
> I followed the example of tg3 when attempting to optimize this code.

Well, the tg3 does try to netif_wake on the tx path as well; it would
need to hold the lock;-> The e1000 doesnt. 

>   For 
> the normal case we remove a lock acquisition.  Jamals case is not normal. 
> :-)

I didnt see the need to hold the lock but i may be missing the "normal"
case you refer to: I did the worst possible scenario and had transmit
path running concurently on one CPU while receive was running on another
(refer to the earlier explanation i posted).

The logic is you stop, the tx path will never be entered. If it is never
entered, you will never have to protect on the rx path. 

> we specifically added this lock originally to fix a problem we saw where 
> the netif_stop and netif_start would race, and we would end up with a 
> queue that was stopped, and no way to restart it because we didn't have 
> any more TX packets to clean up (even if we DID get an interrupt from a 
> receive)
> 

Of course if you call start after you stopped and there are still
packets sitting on the tx ring or qdisc, they will stay forever until
the next packet arrival. But if you do that you have a buggy driver.
But i dont see where you call start in the e1000 - was this some old
code?

> > Since this is exceedingly unlikely we should drop the locks rather
> > than bother about it.
> 
> I think that would be nice, but I still think the current driver solution 
> is a good stable solution that has made it through several rounds of 
> testing here, not to mention is in the tg3.c code.  Unless someone can 
> come up with a way to avoid the race between start and stop *inside* the 
> start and stop code (which would be ideal) then I think we have to have a 
> solution like this in the driver.

You should not call start if you are stopped. Is there some peripheral
code that does start?

cheers,
jamal

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] [e1000]: Remove unnecessary tx_lock
  2006-08-03 18:05     ` Michael Chan
@ 2006-08-03 22:08       ` jamal
  2006-08-04  0:09         ` Michael Chan
  0 siblings, 1 reply; 85+ messages in thread
From: jamal @ 2006-08-03 22:08 UTC (permalink / raw)
  To: Michael Chan; +Cc: Brandeburg, Jesse, Herbert Xu, auke-jan.h.kok, netdev

On Thu, 2006-03-08 at 11:05 -0700, Michael Chan wrote:
> On Thu, 2006-08-03 at 09:36 -0700, Brandeburg, Jesse wrote:
> 
> > I followed the example of tg3 when attempting to optimize this code.  For 
> > the normal case we remove a lock acquisition.  Jamals case is not normal. 
> > :-)
> > 
> > we specifically added this lock originally to fix a problem we saw where 
> > the netif_stop and netif_start would race, and we would end up with a 
> > queue that was stopped, and no way to restart it because we didn't have 
> > any more TX packets to clean up (even if we DID get an interrupt from a 
> > receive)
> > 
> Yep, I agree that the lock is necessary.  The reason is that
> hard_start_xmit and xmit_completion can be running concurrently and they
> can miss each other and cause the tx ring to be stopped forever.
> 

Logic is you are stopped, the tx path can _never_ be entered by the
core. If it is you have a bug somewhere else.
Tx and rx do run concurently but not when you are stopped.

> In the case of tg3's hard_start_xmit, after stopping the queue, we need
> to check if we should wake the queue right away in case xmit_completion
> just finished cleaning the tx ring and just missed the queue_stopped
> condition. 

Is this to protect against some hardware bug?

>  Because the netif_wake_queue can be called in 2 places, you
> need the lock.  Without the lock, the queue can be waken up at the wrong
> time and may cause hard_start_xmit to be called with an empty tx ring.
> 

Yes in your case you need to hold the lock but not in the e1000 case.
I dont understand though why you need to check for wake in the tx path.

cheers,
jamal


^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] [e1000]: Remove unnecessary tx_lock
  2006-08-03 22:08       ` jamal
@ 2006-08-04  0:09         ` Michael Chan
  2006-08-04  1:10           ` Herbert Xu
  2006-08-04  1:16           ` jamal
  0 siblings, 2 replies; 85+ messages in thread
From: Michael Chan @ 2006-08-04  0:09 UTC (permalink / raw)
  To: hadi; +Cc: Brandeburg, Jesse, Herbert Xu, auke-jan.h.kok, netdev

On Thu, 2006-08-03 at 18:08 -0400, jamal wrote:

> 
> Yes in your case you need to hold the lock but not in the e1000 case.
> I dont understand though why you need to check for wake in the tx path.
> 

tg3_start_xmit()

	if (tx_ring_empty)
				<-  tg3_tx()
		netif_stop_queue()


At the arrow, tg3_tx() can come in and clean up the entire ring but will
not see that the queue is stopped and therefore will not call
netif_wake_queue().  As a result, the tx_queue is stopped forever.


^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] [e1000]: Remove unnecessary tx_lock
  2006-08-04  0:09         ` Michael Chan
@ 2006-08-04  1:10           ` Herbert Xu
  2006-08-04  8:37             ` Herbert Xu
  2006-08-04  1:16           ` jamal
  1 sibling, 1 reply; 85+ messages in thread
From: Herbert Xu @ 2006-08-04  1:10 UTC (permalink / raw)
  To: Michael Chan; +Cc: hadi, Brandeburg, Jesse, auke-jan.h.kok, netdev

On Thu, Aug 03, 2006 at 05:09:57PM -0700, Michael Chan wrote:
> 
> tg3_start_xmit()
> 
> 	if (tx_ring_empty)
> 				<-  tg3_tx()
> 		netif_stop_queue()

Good point.  I missed that.  However, this only goes to show that
even the current code (e1000 that is, I haven't checked tg3) is
also unsafe.

Why? Because the tx_lock is only taken in e1000_clean_tx_irq if
the queue is stopped.  As you can see in your diagram above, the
queue could well be awake at the point where e1000_clean_tx_irq
runs.

So we need to fix e1000 anyway.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] [e1000]: Remove unnecessary tx_lock
  2006-08-04  0:09         ` Michael Chan
  2006-08-04  1:10           ` Herbert Xu
@ 2006-08-04  1:16           ` jamal
  2006-08-04  1:18             ` Herbert Xu
  2006-08-04  4:06             ` Michael Chan
  1 sibling, 2 replies; 85+ messages in thread
From: jamal @ 2006-08-04  1:16 UTC (permalink / raw)
  To: Michael Chan; +Cc: Brandeburg, Jesse, Herbert Xu, auke-jan.h.kok, netdev

On Thu, 2006-03-08 at 17:09 -0700, Michael Chan wrote:
> On Thu, 2006-08-03 at 18:08 -0400, jamal wrote:
> 
> > 
> > Yes in your case you need to hold the lock but not in the e1000 case.
> > I dont understand though why you need to check for wake in the tx path.
> > 
> 
> tg3_start_xmit()
> 
> 	if (tx_ring_empty)
> 				<-  tg3_tx()
> 		netif_stop_queue()
> 
> 
> At the arrow, tg3_tx() can come in and clean up the entire ring but will
> not see that the queue is stopped and therefore will not call
> netif_wake_queue().  As a result, the tx_queue is stopped forever.
> 

Indeed - if you do it that way. 
What about (actually e1000 is not exactly like this but closer than tg3
is):

    tg3_start_xmit():
    
      prepare skb                       
      grab ring tx lock irq safe
      put skb on ring
      if (full-ring)
          stop
      release ring tx lock and restore flags blah 

and in tg3_tx():

     if likely(stopped) {  
       // essentially no lock needed here ....
       prune tx ring
       if prunned > threshold
           wake
     } else if not stopped { 
          // need a lock here ...
         grab ring tx lock irq safe
         prune tx ring
         release ring tx lock
     }

The above is safe to run tg3_start_xmit on one CPU and tg3_tx an
another.

And you can then take advantage of asynchronous wakes and remove the
paranoia checks you have on tx.

cheers,
jamal


^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] [e1000]: Remove unnecessary tx_lock
  2006-08-04  1:16           ` jamal
@ 2006-08-04  1:18             ` Herbert Xu
  2006-08-04  1:25               ` jamal
  2006-08-04  4:06             ` Michael Chan
  1 sibling, 1 reply; 85+ messages in thread
From: Herbert Xu @ 2006-08-04  1:18 UTC (permalink / raw)
  To: jamal; +Cc: Michael Chan, Brandeburg, Jesse, auke-jan.h.kok, netdev

On Thu, Aug 03, 2006 at 09:16:04PM -0400, jamal wrote:
> 
> and in tg3_tx():
> 
>      if likely(stopped) {  
>        // essentially no lock needed here ....
>        prune tx ring
>        if prunned > threshold
>            wake
>      } else if not stopped { 
>           // need a lock here ...
>          grab ring tx lock irq safe
>          prune tx ring
>          release ring tx lock
>      }

The problem is that the !stopped case may be the common path for a
lot of workloads so it'd be good to avoid taking the lock there if
we can.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] [e1000]: Remove unnecessary tx_lock
  2006-08-04  1:18             ` Herbert Xu
@ 2006-08-04  1:25               ` jamal
  0 siblings, 0 replies; 85+ messages in thread
From: jamal @ 2006-08-04  1:25 UTC (permalink / raw)
  To: Herbert Xu; +Cc: Michael Chan, Brandeburg, Jesse, auke-jan.h.kok, netdev

On Fri, 2006-04-08 at 11:18 +1000, Herbert Xu wrote:
> On Thu, Aug 03, 2006 at 09:16:04PM -0400, jamal wrote:
> > 
> > and in tg3_tx():
> > 
> >      if likely(stopped) {  
> >        // essentially no lock needed here ....
> >        prune tx ring
> >        if prunned > threshold
> >            wake
> >      } else if not stopped { 
> >           // need a lock here ...
> >          grab ring tx lock irq safe
> >          prune tx ring
> >          release ring tx lock
> >      }
> 
> The problem is that the !stopped case may be the common path for a
> lot of workloads so it'd be good to avoid taking the lock there if
> we can.

True.

Although i cant think of a clever way to do the inverse....
The heuristic i was using to avoid the lock on tx at pruning time was
knowledge that tx was not being entered at all on stop.

cheers,
jamal


^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] [e1000]: Remove unnecessary tx_lock
  2006-08-04  1:16           ` jamal
  2006-08-04  1:18             ` Herbert Xu
@ 2006-08-04  4:06             ` Michael Chan
  1 sibling, 0 replies; 85+ messages in thread
From: Michael Chan @ 2006-08-04  4:06 UTC (permalink / raw)
  To: hadi; +Cc: Brandeburg, Jesse, Herbert Xu, auke-jan.h.kok, netdev

Jamal wrote:

>     tg3_start_xmit():
>     
>       prepare skb                       
>       grab ring tx lock irq safe
>       put skb on ring
>       if (full-ring)
>           stop
>       release ring tx lock and restore flags blah 

The whole point of doing things the way it is done in tg3
is to avoid taking the lock unless there is a need to stop
the queue or wake the queue.

With a 511 entry send ring in tg3, filling the entire tx
ring should not happen all the time.  Even if we fill the
tx ring constantly, the hysteresis effect of waking the
queue only after the tx ring is 3/4 empty will make it
uncommon to do stop_queue and wake_queue.  So it is the
right thing to do to optimize for the case where the
ring is not full.  The penalty is that we need some extra
checking after stopping the queue which should be the
uncommon case.

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] [e1000]: Remove unnecessary tx_lock
  2006-08-04  1:10           ` Herbert Xu
@ 2006-08-04  8:37             ` Herbert Xu
  2006-08-04 10:10               ` Herbert Xu
  0 siblings, 1 reply; 85+ messages in thread
From: Herbert Xu @ 2006-08-04  8:37 UTC (permalink / raw)
  To: Michael Chan; +Cc: hadi, Brandeburg, Jesse, auke-jan.h.kok, netdev

On Fri, Aug 04, 2006 at 11:10:27AM +1000, herbert wrote:
> 
> So we need to fix e1000 anyway.

Here is a fix that also gets rid of the tx_lock on the wake up side.
Please scrutinise this as much as you can.  Jamal, please run this
through your performance test to see if it has a similar effect as
your original patch.

A similar patch is needed for tg3.

[NETDRV] e1000: Fix potential tx wake up race

Jamal posted a patch which improved e1000 TX performance.  That in
turn spawned a discussion on potential tx wake up races caused by
the change.  During that discussion, Michael Chan posted the following
scenario (redrawn by me for e1000):

CPU0					CPU1
e1000_xmit_frame
	tx_lock
	if (tx_ring_empty)
					e1000_clean_tx_irq
						if (netif_queue_stopped)
							tx_lock
							netif_wake_queue
							tx_unlock
		netif_stop_queue
	tx_unlock

As you can see, even with the locks around netif_wake_queue this still
allows the race to occur because the netif_wake_queue code path is
conditional on netif_queue_stopped.

So here is a patch that uses the memory barriers rather than spin locks
to solve the problem.

First of all I've made netif_wake_queue independent of netif_queue_stopped.
This means that we operate on the stopped bit only once which closes a race
window.

Furthermore, netif_wake_queue itself is already a memory barrier because
it does a test_and_clear_bit.  This means that if the other side can
observe the netif_wake_queue then it can observe the extra room we've
made on the
queue.

On the e1000_xmit_frame side, I've changed netif_stop_queue into a new
function netif_test_and_stop_queue which does test_and_set_bit.  This
means that it also is a full memory barrier.

In order to close the race then, we simply recheck the tx queue size
if netif_test_and_stop_queue returns 0 indicating that the queue was
previously started.  There are two possibilities:

1) netif_wake_queue is about to occur.  In this case it doesn't really
matter what we do as netif_wake_queue will always wake the queue for us.
In the worst case, it'll wake up the queue with too little room but that's
really uncommon and we can cope with that anyway.

2) netif_wake_queue has just occured.  Because it is a memory barrier,
when we recheck the queue size we're guaranteed to see the new indices
which means that we'll see the extra room it has made for us.  As a
result we'll simply restart the queue.

The one wart in the patch is that we now have an unconditional atomic
operation in e1000_clean_tx_irq (netif_wake_queue).  If anyone can
think of a way to get rid of it without opening up the race, please
let us know.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
diff --git a/drivers/net/e1000/e1000_main.c b/drivers/net/e1000/e1000_main.c
index 627f224..a22c01f 100644
--- a/drivers/net/e1000/e1000_main.c
+++ b/drivers/net/e1000/e1000_main.c
@@ -2861,6 +2861,34 @@ e1000_transfer_dhcp_info(struct e1000_ad
 	return 0;
 }

+static int __e1000_maybe_stop_tx(struct net_device *netdev, int size)
+{
+	struct e1000_adapter *adapter = netdev_priv(netdev);
+	struct e1000_tx_ring *tx_ring = adapter->tx_ring;
+
+	if (unlikely(netif_test_and_stop_queue(netdev)))
+		return -EBUSY;
+
+	/* netif_test_and_stop_queue acts as a memory barrier.
+	 * We need to check again in a case another CPU has just
+	 * made room available.
+	 */
+	if (likely(E1000_DESC_UNUSED(tx_ring) < size))
+		return -EBUSY;
+
+	/* A reprieve! */
+	netif_start_queue(netdev);
+	return 0;
+}
+
+static inline int e1000_maybe_stop_tx(struct net_device *netdev,
+				      struct e1000_tx_ring *tx_ring, int size)
+{
+	if (likely(E1000_DESC_UNUSED(tx_ring) >= size))
+		return 0;
+	return __e1000_maybe_stop_tx(netdev, size);
+}
+
 #define TXD_USE_COUNT(S, X) (((S) >> (X)) + 1 )
 static int
 e1000_xmit_frame(struct sk_buff *skb, struct net_device *netdev)
@@ -2974,8 +3002,7 @@ #endif

 	/* need: count + 2 desc gap to keep tail from touching
 	 * head, otherwise try next time */
-	if (unlikely(E1000_DESC_UNUSED(tx_ring) < count + 2)) {
-		netif_stop_queue(netdev);
+	if (unlikely(e1000_maybe_stop_tx(netdev, tx_ring, count + 2))) {
 		spin_unlock_irqrestore(&tx_ring->tx_lock, flags);
 		return NETDEV_TX_BUSY;
 	}
@@ -3022,8 +3049,7 @@ #endif
 	netdev->trans_start = jiffies;

 	/* Make sure there is space in the ring for the next send. */
-	if (unlikely(E1000_DESC_UNUSED(tx_ring) < MAX_SKB_FRAGS + 2))
-		netif_stop_queue(netdev);
+	e1000_maybe_stop_tx(netdev, tx_ring, MAX_SKB_FRAGS + 2);

 	spin_unlock_irqrestore(&tx_ring->tx_lock, flags);
 	return NETDEV_TX_OK;
@@ -3515,14 +3541,9 @@ #endif
 	tx_ring->next_to_clean = i;

 #define TX_WAKE_THRESHOLD 32
-	if (unlikely(cleaned && netif_queue_stopped(netdev) &&
-	             netif_carrier_ok(netdev))) {
-		spin_lock(&tx_ring->tx_lock);
-		if (netif_queue_stopped(netdev) &&
-		    (E1000_DESC_UNUSED(tx_ring) >= TX_WAKE_THRESHOLD))
-			netif_wake_queue(netdev);
-		spin_unlock(&tx_ring->tx_lock);
-	}
+	if (unlikely(cleaned && netif_carrier_ok(netdev) &&
+		     E1000_DESC_UNUSED(tx_ring) >= TX_WAKE_THRESHOLD))
+		netif_wake_queue(netdev);

 	if (adapter->detect_tx_hung) {
 		/* Detect a transmit hang in hardware, this serializes the
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 75f02d8..b8714b4 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -652,6 +652,15 @@ #endif
 	set_bit(__LINK_STATE_XOFF, &dev->state);
 }

+static inline int netif_test_and_stop_queue(struct net_device *dev)
+{
+#ifdef CONFIG_NETPOLL_TRAP
+	if (netpoll_trap())
+		return;
+#endif
+	return test_and_set_bit(__LINK_STATE_XOFF, &dev->state);
+}
+
 static inline int netif_queue_stopped(const struct net_device *dev)
 {
 	return test_bit(__LINK_STATE_XOFF, &dev->state);

^ permalink raw reply related	[flat|nested] 85+ messages in thread

* Re: [PATCH] [e1000]: Remove unnecessary tx_lock
  2006-08-04  8:37             ` Herbert Xu
@ 2006-08-04 10:10               ` Herbert Xu
  2006-08-04 10:16                 ` jamal
                                   ` (2 more replies)
  0 siblings, 3 replies; 85+ messages in thread
From: Herbert Xu @ 2006-08-04 10:10 UTC (permalink / raw)
  To: Michael Chan
  Cc: hadi, Brandeburg, Jesse, auke-jan.h.kok, netdev,
	Stephen Hemminger

On Fri, Aug 04, 2006 at 06:37:34PM +1000, herbert wrote:
> 
> [NETDRV] e1000: Fix potential tx wake up race

Oops, I completely looked through netpoll_trap.  Hmm, someone should
really move the ifdef into netpoll_trap.  Stephen?

BTW, who uses netpoll_set_trap anyway? There seems to be no user in
the kernel tree.  Where's Adrian Bunk :)

[NETDRV] e1000: Fix potential tx wake up race

Jamal posted a patch which improved e1000 TX performance.  That in
turn spawned a discussion on potential tx wake up races caused by
the change.  During that discussion, Michael Chan posted the following
scenario (redrawn by me for e1000):

CPU0					CPU1
e1000_xmit_frame
	tx_lock
	if (tx_ring_empty)
					e1000_clean_tx_irq
						if (netif_queue_stopped)
							tx_lock
							netif_wake_queue
							tx_unlock
		netif_stop_queue
	tx_unlock

As you can see, even with the locks around netif_wake_queue this still
allows the race to occur because the netif_wake_queue code path is
conditional on netif_queue_stopped.

So here is a patch that uses the memory barriers rather than spin locks
to solve the problem.

First of all I've made netif_wake_queue independent of netif_queue_stopped.
This means that we operate on the stopped bit only once which closes a race
window.

Furthermore, netif_wake_queue itself is already a memory barrier because
it does a test_and_clear_bit.  This means that if the other side can
observe the netif_wake_queue then it can observe the extra room we've
made on the
queue.

On the e1000_xmit_frame side, I've changed netif_stop_queue into a new
function netif_test_and_stop_queue which does test_and_set_bit.  This
means that it also is a full memory barrier.

In order to close the race then, we simply recheck the tx queue size
if netif_test_and_stop_queue returns 0 indicating that the queue was
previously started.  There are two possibilities:

1) netif_wake_queue is about to occur.  In this case it doesn't really
matter what we do as netif_wake_queue will always wake the queue for us.
In the worst case, it'll wake up the queue with too little room but that's
really uncommon and we can cope with that anyway.

2) netif_wake_queue has just occured.  Because it is a memory barrier,
when we recheck the queue size we're guaranteed to see the new indices
which means that we'll see the extra room it has made for us.  As a
result we'll simply restart the queue.

The one wart in the patch is that we now have an unconditional atomic
operation in e1000_clean_tx_irq (netif_wake_queue).  If anyone can
think of a way to get rid of it without opening up the race, please
let us know.

Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
diff --git a/drivers/net/e1000/e1000_main.c b/drivers/net/e1000/e1000_main.c
index 627f224..a22c01f 100644
--- a/drivers/net/e1000/e1000_main.c
+++ b/drivers/net/e1000/e1000_main.c
@@ -2861,6 +2861,34 @@ e1000_transfer_dhcp_info(struct e1000_ad
 	return 0;
 }

+static int __e1000_maybe_stop_tx(struct net_device *netdev, int size)
+{
+	struct e1000_adapter *adapter = netdev_priv(netdev);
+	struct e1000_tx_ring *tx_ring = adapter->tx_ring;
+
+	if (unlikely(netif_test_and_stop_queue(netdev)))
+		return -EBUSY;
+
+	/* netif_test_and_stop_queue acts as a memory barrier.
+	 * We need to check again in a case another CPU has just
+	 * made room available.
+	 */
+	if (likely(E1000_DESC_UNUSED(tx_ring) < size))
+		return -EBUSY;
+
+	/* A reprieve! */
+	netif_start_queue(netdev);
+	return 0;
+}
+
+static inline int e1000_maybe_stop_tx(struct net_device *netdev,
+				      struct e1000_tx_ring *tx_ring, int size)
+{
+	if (likely(E1000_DESC_UNUSED(tx_ring) >= size))
+		return 0;
+	return __e1000_maybe_stop_tx(netdev, size);
+}
+
 #define TXD_USE_COUNT(S, X) (((S) >> (X)) + 1 )
 static int
 e1000_xmit_frame(struct sk_buff *skb, struct net_device *netdev)
@@ -2974,8 +3002,7 @@ #endif

 	/* need: count + 2 desc gap to keep tail from touching
 	 * head, otherwise try next time */
-	if (unlikely(E1000_DESC_UNUSED(tx_ring) < count + 2)) {
-		netif_stop_queue(netdev);
+	if (unlikely(e1000_maybe_stop_tx(netdev, tx_ring, count + 2))) {
 		spin_unlock_irqrestore(&tx_ring->tx_lock, flags);
 		return NETDEV_TX_BUSY;
 	}
@@ -3022,8 +3049,7 @@ #endif
 	netdev->trans_start = jiffies;

 	/* Make sure there is space in the ring for the next send. */
-	if (unlikely(E1000_DESC_UNUSED(tx_ring) < MAX_SKB_FRAGS + 2))
-		netif_stop_queue(netdev);
+	e1000_maybe_stop_tx(netdev, tx_ring, MAX_SKB_FRAGS + 2);

 	spin_unlock_irqrestore(&tx_ring->tx_lock, flags);
 	return NETDEV_TX_OK;
@@ -3515,14 +3541,9 @@ #endif
 	tx_ring->next_to_clean = i;

 #define TX_WAKE_THRESHOLD 32
-	if (unlikely(cleaned && netif_queue_stopped(netdev) &&
-	             netif_carrier_ok(netdev))) {
-		spin_lock(&tx_ring->tx_lock);
-		if (netif_queue_stopped(netdev) &&
-		    (E1000_DESC_UNUSED(tx_ring) >= TX_WAKE_THRESHOLD))
-			netif_wake_queue(netdev);
-		spin_unlock(&tx_ring->tx_lock);
-	}
+	if (unlikely(cleaned && netif_carrier_ok(netdev) &&
+		     E1000_DESC_UNUSED(tx_ring) >= TX_WAKE_THRESHOLD))
+		netif_wake_queue(netdev);

 	if (adapter->detect_tx_hung) {
 		/* Detect a transmit hang in hardware, this serializes the
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 75f02d8..c4b0de0 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -657,6 +657,15 @@ static inline int netif_queue_stopped(co
 	return test_bit(__LINK_STATE_XOFF, &dev->state);
 }

+static inline int netif_test_and_stop_queue(struct net_device *dev)
+{
+#ifdef CONFIG_NETPOLL_TRAP
+	if (netpoll_trap())
+		return netif_queue_stopped(dev);
+#endif
+	return test_and_set_bit(__LINK_STATE_XOFF, &dev->state);
+}
+
 static inline int netif_running(const struct net_device *dev)
 {
 	return test_bit(__LINK_STATE_START, &dev->state);

^ permalink raw reply related	[flat|nested] 85+ messages in thread

* Re: [PATCH] [e1000]: Remove unnecessary tx_lock
  2006-08-04 10:10               ` Herbert Xu
@ 2006-08-04 10:16                 ` jamal
  2006-08-04 10:25                   ` Herbert Xu
  2006-08-05 16:45                   ` jamal
  2006-08-04 17:12                 ` Stephen Hemminger
  2006-08-04 17:28                 ` Michael Chan
  2 siblings, 2 replies; 85+ messages in thread
From: jamal @ 2006-08-04 10:16 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Michael Chan, Brandeburg, Jesse, auke-jan.h.kok, netdev,
	Stephen Hemminger

On Fri, 2006-04-08 at 20:10 +1000, Herbert Xu wrote:
> On Fri, Aug 04, 2006 at 06:37:34PM +1000, herbert wrote:
> > 
> > [NETDRV] e1000: Fix potential tx wake up race
> 
> Oops, I completely looked through netpoll_trap.  Hmm, someone should
> really move the ifdef into netpoll_trap.  Stephen?
> 
> BTW, who uses netpoll_set_trap anyway? There seems to be no user in
> the kernel tree.  Where's Adrian Bunk :)

You sure you wanna wake him up? ;->

I still have my setup intact, so i will give this a whirl in about 3-4
hours. It does look promising - an atomic should be a lot cheaper than a
lock. Thanks Herbert.

cheers,
jamal



^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] [e1000]: Remove unnecessary tx_lock
  2006-08-04 10:16                 ` jamal
@ 2006-08-04 10:25                   ` Herbert Xu
  2006-08-04 10:45                     ` jamal
  2006-08-05 23:04                     ` jamal
  2006-08-05 16:45                   ` jamal
  1 sibling, 2 replies; 85+ messages in thread
From: Herbert Xu @ 2006-08-04 10:25 UTC (permalink / raw)
  To: jamal
  Cc: Michael Chan, Brandeburg, Jesse, auke-jan.h.kok, netdev,
	Stephen Hemminger

On Fri, Aug 04, 2006 at 06:16:03AM -0400, jamal wrote:
> 
> I still have my setup intact, so i will give this a whirl in about 3-4
> hours. It does look promising - an atomic should be a lot cheaper than a
> lock. Thanks Herbert.

Well it's not that good since the current code only takes the
lock in exceptional circumstances while my patch does the atomic
op unconditionally.

But then again the current code is wrong :)

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] [e1000]: Remove unnecessary tx_lock
  2006-08-04 10:25                   ` Herbert Xu
@ 2006-08-04 10:45                     ` jamal
  2006-08-05 23:04                     ` jamal
  1 sibling, 0 replies; 85+ messages in thread
From: jamal @ 2006-08-04 10:45 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Michael Chan, Brandeburg, Jesse, auke-jan.h.kok, netdev,
	Stephen Hemminger

On Fri, 2006-04-08 at 20:25 +1000, Herbert Xu wrote:

> Well it's not that good since the current code only takes the
> lock in exceptional circumstances while my patch does the atomic
> op unconditionally.

yes, it would be nice to fix that. However, from a selfish angle, my
worries are more about the overload case ;-> logically it is more likely
you have a lot less CPU resources when overloaded than in the common
case; actually now i have said "common" 100 times i understand Jesse
better ;->. 
 
> But then again the current code is wrong :)

indeed.

cheers,
jamal


^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] [e1000]: Remove unnecessary tx_lock
  2006-08-04 10:10               ` Herbert Xu
  2006-08-04 10:16                 ` jamal
@ 2006-08-04 17:12                 ` Stephen Hemminger
  2006-08-04 17:28                 ` Michael Chan
  2 siblings, 0 replies; 85+ messages in thread
From: Stephen Hemminger @ 2006-08-04 17:12 UTC (permalink / raw)
  To: Herbert Xu; +Cc: Michael Chan, hadi, Brandeburg, Jesse, auke-jan.h.kok, netdev

On Fri, 4 Aug 2006 20:10:17 +1000
Herbert Xu <herbert@gondor.apana.org.au> wrote:

> On Fri, Aug 04, 2006 at 06:37:34PM +1000, herbert wrote:
> > 
> > [NETDRV] e1000: Fix potential tx wake up race
> 
> Oops, I completely looked through netpoll_trap.  Hmm, someone should
> really move the ifdef into netpoll_trap.  Stephen?
> 
> BTW, who uses netpoll_set_trap anyway? There seems to be no user in
> the kernel tree.  Where's Adrian Bunk :)

I don't use netconsole. Jeff Moyer seems to be the main instigator of this.

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] [e1000]: Remove unnecessary tx_lock
  2006-08-04 10:10               ` Herbert Xu
  2006-08-04 10:16                 ` jamal
  2006-08-04 17:12                 ` Stephen Hemminger
@ 2006-08-04 17:28                 ` Michael Chan
  2006-08-04 18:08                   ` Stephen Hemminger
  2 siblings, 1 reply; 85+ messages in thread
From: Michael Chan @ 2006-08-04 17:28 UTC (permalink / raw)
  To: Herbert Xu, davem
  Cc: hadi, Brandeburg, Jesse, auke-jan.h.kok, netdev,
	Stephen Hemminger

On Fri, 2006-08-04 at 20:10 +1000, Herbert Xu wrote:

> 1) netif_wake_queue is about to occur.  In this case it doesn't really
> matter what we do as netif_wake_queue will always wake the queue for us.
> In the worst case, it'll wake up the queue with too little room but that's
> really uncommon and we can cope with that anyway.
> 
In a lot of other drivers including tg3, waking the tx queue when there
is not enough room in the tx ring is considered a bug and will trigger a
BUG message:

BUG! Tx Ring full when queue awake!

May be David can explain why it is undesirable to wake up the queue when
it is full?


^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] [e1000]: Remove unnecessary tx_lock
  2006-08-04 17:28                 ` Michael Chan
@ 2006-08-04 18:08                   ` Stephen Hemminger
  2006-08-04 23:31                     ` David Miller
  0 siblings, 1 reply; 85+ messages in thread
From: Stephen Hemminger @ 2006-08-04 18:08 UTC (permalink / raw)
  To: Michael Chan
  Cc: Herbert Xu, davem, hadi, Brandeburg, Jesse, auke-jan.h.kok,
	netdev

On Fri, 04 Aug 2006 10:28:52 -0700
"Michael Chan" <mchan@broadcom.com> wrote:

> On Fri, 2006-08-04 at 20:10 +1000, Herbert Xu wrote:
> 
> > 1) netif_wake_queue is about to occur.  In this case it doesn't really
> > matter what we do as netif_wake_queue will always wake the queue for us.
> > In the worst case, it'll wake up the queue with too little room but that's
> > really uncommon and we can cope with that anyway.
> > 
> In a lot of other drivers including tg3, waking the tx queue when there
> is not enough room in the tx ring is considered a bug and will trigger a
> BUG message:
> 
> BUG! Tx Ring full when queue awake!
> 
> May be David can explain why it is undesirable to wake up the queue when
> it is full?

Because it means that there was a race hole in the normal sleep/wakeup code.
Some drivers just spin (see acenic.c) but that's ugly.

The real risk is that the transmit routine will decide it's busy, but
at the same time an IRQ on another CPU will clean up the whole ring
and make the queue available again. Then the transmit routine exits and
says its busy, and the higher level will decide it is permanently stopped.

-- 
Stephen Hemminger <shemminger@osdl.org>
"And in the Packet there writ down that doome"

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] [e1000]: Remove unnecessary tx_lock
  2006-08-04 18:08                   ` Stephen Hemminger
@ 2006-08-04 23:31                     ` David Miller
  2006-08-05 16:56                       ` jamal
  2006-08-08 17:04                       ` Benjamin LaHaise
  0 siblings, 2 replies; 85+ messages in thread
From: David Miller @ 2006-08-04 23:31 UTC (permalink / raw)
  To: shemminger; +Cc: mchan, herbert, hadi, jesse.brandeburg, auke-jan.h.kok, netdev

From: Stephen Hemminger <shemminger@osdl.org>
Date: Fri, 4 Aug 2006 11:08:29 -0700

> On Fri, 04 Aug 2006 10:28:52 -0700
> "Michael Chan" <mchan@broadcom.com> wrote:
> 
> > May be David can explain why it is undesirable to wake up the queue when
> > it is full?

Why does Michael ask all the difficult questions? :-)

> Because it means that there was a race hole in the normal sleep/wakeup code.
> Some drivers just spin (see acenic.c) but that's ugly.
> 
> The real risk is that the transmit routine will decide it's busy, but
> at the same time an IRQ on another CPU will clean up the whole ring
> and make the queue available again. Then the transmit routine exits and
> says its busy, and the higher level will decide it is permanently stopped.

Yes, it's meant to catch unintented races.

The queueing layer that calls ->hard_start_xmit() technically has no
need to support NETDEV_TX_BUSY as a return value, since the device
is able to prevent this.

If we could avoid NETDEV_TX_BUSY et al. from happening, the idea is
that we could eliminate all of the requeueing logic in the packet
scheduler layer.  It is a pipe dream from many years ago.

But this may never be realized because now we have things like LLTX
which can return NETDEV_TX_LOCKED which also requires a requeue.
Also, several specialized devices can return NETDEV_TX_BUSY in ways
which are hard to prevent.  I'll reproduce Alexey's comments from
qdisc_restart():

		/* Device kicked us out :(
		   This is possible in three cases:

		   0. driver is locked
		   1. fastroute is enabled
		   2. device cannot determine busy state
		      before start of transmission (f.e. dialout)
		   3. device is buggy (ppp)
		 */

Hmmm, another remnant reference to fastroute which we should delete :-)

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] [e1000]: Remove unnecessary tx_lock
  2006-08-04 10:16                 ` jamal
  2006-08-04 10:25                   ` Herbert Xu
@ 2006-08-05 16:45                   ` jamal
  1 sibling, 0 replies; 85+ messages in thread
From: jamal @ 2006-08-05 16:45 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Michael Chan, Brandeburg, Jesse, auke-jan.h.kok, netdev,
	Stephen Hemminger

On Fri, 2006-04-08 at 06:16 -0400, jamal wrote:

> I still have my setup intact, so i will give this a whirl in about 3-4
> hours. 

Sorry, I had a really shitty work day i didnt get around to it. I have 3
days off so will get it done either later today or tomorrow.


cheers,
jamal


^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] [e1000]: Remove unnecessary tx_lock
  2006-08-04 23:31                     ` David Miller
@ 2006-08-05 16:56                       ` jamal
  2006-08-05 23:05                         ` Herbert Xu
  2006-08-08 17:04                       ` Benjamin LaHaise
  1 sibling, 1 reply; 85+ messages in thread
From: jamal @ 2006-08-05 16:56 UTC (permalink / raw)
  To: David Miller
  Cc: shemminger, mchan, herbert, jesse.brandeburg, auke-jan.h.kok,
	netdev

On Fri, 2006-04-08 at 16:31 -0700, David Miller wrote:
> From: Stephen Hemminger <shemminger@osdl.org>

>  
> > The real risk is that the transmit routine will decide it's busy, but
> > at the same time an IRQ on another CPU will clean up the whole ring
> > and make the queue available again. Then the transmit routine exits and
> > says its busy, and the higher level will decide it is permanently stopped.
> 

Doesnt this mean we need to hold the txlock longer to avoid the race?
It will also mean we need to have a consistent programming model of the
drivers (as an example: When is it not advisable to prune tx ring on the
tx path? My gut feeling at the moment is i would rather have receive
path do a lot of dirty work so as to prioritize transmit)

> Yes, it's meant to catch unintented races.
> 

This section of the tx path is quiet hairy. Would a patch to make
qdisc_restart more readable be a sensible thing to submit?

> The queueing layer that calls ->hard_start_xmit() technically has no
> need to support NETDEV_TX_BUSY as a return value, since the device
> is able to prevent this.
> 

I noticed that the e1000 in some cases returns NETDEV_TX_OK even when it
has stopped the queue. This results in wasted calls into the qdisc.
It seems the tg3 is doing as well. I put a little counter and the amount
of those useless calls that happen in incredible.
I was going to experiment with infact returning an additional code
NETDEV_TX_BUSY_QUEUED which signals the core to just go away.

> If we could avoid NETDEV_TX_BUSY et al. from happening, the idea is
> that we could eliminate all of the requeueing logic in the packet
> scheduler layer.  It is a pipe dream from many years ago.
> 

Still doable IMO - maybe a motivation for something to experiment with
this long weekend. 

cheers,
jamal

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] [e1000]: Remove unnecessary tx_lock
  2006-08-04 10:25                   ` Herbert Xu
  2006-08-04 10:45                     ` jamal
@ 2006-08-05 23:04                     ` jamal
  2006-08-05 23:06                       ` Herbert Xu
  1 sibling, 1 reply; 85+ messages in thread
From: jamal @ 2006-08-05 23:04 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Michael Chan, Brandeburg, Jesse, auke-jan.h.kok, netdev,
	Stephen Hemminger

On Fri, 2006-04-08 at 20:25 +1000, Herbert Xu wrote:
> On Fri, Aug 04, 2006 at 06:16:03AM -0400, jamal wrote:
> > 
> > I still have my setup intact, so i will give this a whirl in about 3-4
> > hours. It does look promising - an atomic should be a lot cheaper than a
> > lock. Thanks Herbert.
> 
> Well it's not that good since the current code only takes the
> lock in exceptional circumstances while my patch does the atomic
> op unconditionally.

I run a variety of tests and it looks good i.e performance wise,
although lower than what i had before, certainly is an improvement over
what was there before (by about 1kpps in forwarding). It would probably
chew more CPU than the original driver (given the atomic) but as i said
before i have no philosophical qualms with that since it performs better
under stress ;->

Hardware is:
Dual via C3 1Ghz with a dual port e1000. The bottleneck is really
the 32 bit PCI bus in most cases. I tried running single port, dual;
low traffic, high traffic etc.

Interesting thing i noticed:
Under high speed forwarding test bidirectional (eth0<--forwarding-->eth1)

whenever __e1000_maybe_stop_tx() --> __netif_test_and_stop_queue() was 
called it was _guaranteed to find XOFF not set_. i.e 100% of the time. 

This seems to actually happen on eth0 only (i think the two ports share
the same PCI bridge and probably eth0 is biased?).

If you want me to try something else, send an incremental patch and i
can do it tomorrow.

cheers,
jamal 

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] [e1000]: Remove unnecessary tx_lock
  2006-08-05 16:56                       ` jamal
@ 2006-08-05 23:05                         ` Herbert Xu
  2006-08-05 23:17                           ` jamal
  2006-08-06 17:20                           ` Michael Chan
  0 siblings, 2 replies; 85+ messages in thread
From: Herbert Xu @ 2006-08-05 23:05 UTC (permalink / raw)
  To: jamal
  Cc: David Miller, shemminger, mchan, jesse.brandeburg, auke-jan.h.kok,
	netdev

On Sat, Aug 05, 2006 at 12:56:42PM -0400, jamal wrote:
>
> > > The real risk is that the transmit routine will decide it's busy, but
> > > at the same time an IRQ on another CPU will clean up the whole ring
> > > and make the queue available again. Then the transmit routine exits and
> > > says its busy, and the higher level will decide it is permanently stopped.

This is exactly the scenario where the current code can fail.  It is
also exactly the one that my patch addresses.
 
> I noticed that the e1000 in some cases returns NETDEV_TX_OK even when it
> has stopped the queue. This results in wasted calls into the qdisc.

Actually it's OK.  qdisc_restart will return after each NETDEV_TX_OK and
its caller will check netif_queue_stopped before calling it again.
 
> > If we could avoid NETDEV_TX_BUSY et al. from happening, the idea is
> > that we could eliminate all of the requeueing logic in the packet
> > scheduler layer.  It is a pipe dream from many years ago.
> 
> Still doable IMO - maybe a motivation for something to experiment with
> this long weekend. 

A precondition to eliminating it altogether is to eliminate lockless
drivers which would be a very good thing :)

In fact that reminds me, because tg3 is no longer lockless, we can do
a nicer bug fix for it than e1000 (it has the same bug).

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] [e1000]: Remove unnecessary tx_lock
  2006-08-05 23:04                     ` jamal
@ 2006-08-05 23:06                       ` Herbert Xu
  2006-08-05 23:21                         ` jamal
  0 siblings, 1 reply; 85+ messages in thread
From: Herbert Xu @ 2006-08-05 23:06 UTC (permalink / raw)
  To: jamal
  Cc: Michael Chan, Brandeburg, Jesse, auke-jan.h.kok, netdev,
	Stephen Hemminger

On Sat, Aug 05, 2006 at 07:04:46PM -0400, jamal wrote:
> 
> whenever __e1000_maybe_stop_tx() --> __netif_test_and_stop_queue() was 
> called it was _guaranteed to find XOFF not set_. i.e 100% of the time. 

Hardly surprising since you only have one CPU doing the actual sending.

The only time when this can happen is if we more than one CPU transmitted
at the exatly same time.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] [e1000]: Remove unnecessary tx_lock
  2006-08-05 23:05                         ` Herbert Xu
@ 2006-08-05 23:17                           ` jamal
  2006-08-05 23:19                             ` Herbert Xu
  2006-08-06 17:20                           ` Michael Chan
  1 sibling, 1 reply; 85+ messages in thread
From: jamal @ 2006-08-05 23:17 UTC (permalink / raw)
  To: Herbert Xu
  Cc: David Miller, shemminger, mchan, jesse.brandeburg, auke-jan.h.kok,
	netdev

On Sun, 2006-06-08 at 09:05 +1000, Herbert Xu wrote:
> On Sat, Aug 05, 2006 at 12:56:42PM -0400, jamal wrote:
> >
> > > > The real risk is that the transmit routine will decide it's busy, but
> > > > at the same time an IRQ on another CPU will clean up the whole ring
> > > > and make the queue available again. Then the transmit routine exits and
> > > > says its busy, and the higher level will decide it is permanently stopped.
> 
> This is exactly the scenario where the current code can fail.  It is
> also exactly the one that my patch addresses.
>  

So would the holding of the tx lock at the top layer/qdisc_restart level
not help?

> > I noticed that the e1000 in some cases returns NETDEV_TX_OK even when it
> > has stopped the queue. This results in wasted calls into the qdisc.
> 
> Actually it's OK.  qdisc_restart will return after each NETDEV_TX_OK and
> its caller will check netif_queue_stopped before calling it again.
>  

yes, of course ;->

Actually the stat i captured was not related to this, rather it was the
number of times we entered qdisc_is_running and were unable to get a
packet. I am using the simple pfifo so it cant be the scheduler refusing
to give us a packet. I have a feeling it is just enthusiastic wake up of
the tx softirq. Thoughts on this? I will add checks to be sure. 

> > > If we could avoid NETDEV_TX_BUSY et al. from happening, the idea is
> > > that we could eliminate all of the requeueing logic in the packet
> > > scheduler layer.  It is a pipe dream from many years ago.
> > 
> > Still doable IMO - maybe a motivation for something to experiment with
> > this long weekend. 
> 
> A precondition to eliminating it altogether is to eliminate lockless
> drivers which would be a very good thing :)
> 

They have certainly complicated things. 

cheers,
jamal


^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] [e1000]: Remove unnecessary tx_lock
  2006-08-05 23:17                           ` jamal
@ 2006-08-05 23:19                             ` Herbert Xu
  2006-08-05 23:36                               ` jamal
  0 siblings, 1 reply; 85+ messages in thread
From: Herbert Xu @ 2006-08-05 23:19 UTC (permalink / raw)
  To: jamal
  Cc: David Miller, shemminger, mchan, jesse.brandeburg, auke-jan.h.kok,
	netdev

On Sat, Aug 05, 2006 at 07:17:48PM -0400, jamal wrote:
>
> So would the holding of the tx lock at the top layer/qdisc_restart level
> not help?

Yes it should so the corresponding fix for tg3 would be simpler.

> Actually the stat i captured was not related to this, rather it was the
> number of times we entered qdisc_is_running and were unable to get a
> packet. I am using the simple pfifo so it cant be the scheduler refusing
> to give us a packet. I have a feeling it is just enthusiastic wake up of
> the tx softirq. Thoughts on this? I will add checks to be sure. 

Interesting.  I can't really think of why it'd do that so some extra
checks would really help.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] [e1000]: Remove unnecessary tx_lock
  2006-08-05 23:06                       ` Herbert Xu
@ 2006-08-05 23:21                         ` jamal
  2006-08-05 23:30                           ` Herbert Xu
  0 siblings, 1 reply; 85+ messages in thread
From: jamal @ 2006-08-05 23:21 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Michael Chan, Brandeburg, Jesse, auke-jan.h.kok, netdev,
	Stephen Hemminger

On Sun, 2006-06-08 at 09:06 +1000, Herbert Xu wrote:
> On Sat, Aug 05, 2006 at 07:04:46PM -0400, jamal wrote:
> > 
> > whenever __e1000_maybe_stop_tx() --> __netif_test_and_stop_queue() was 
> > called it was _guaranteed to find XOFF not set_. i.e 100% of the time. 
> 
> Hardly surprising since you only have one CPU doing the actual sending.

Would it make a difference if we had more than one? qdisc_is_running
ensures only one entering that region.

> The only time when this can happen is if we more than one CPU transmitted
> at the exatly same time.
> 

Did you mean to the same eth device? 
Note: It is very highly likely - given the way i generate traffic - that
two packets hit two different CPUs at the same time. But they will be to
different ethernet devices.

cheers,
jamal


^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] [e1000]: Remove unnecessary tx_lock
  2006-08-05 23:21                         ` jamal
@ 2006-08-05 23:30                           ` Herbert Xu
  0 siblings, 0 replies; 85+ messages in thread
From: Herbert Xu @ 2006-08-05 23:30 UTC (permalink / raw)
  To: jamal
  Cc: Michael Chan, Brandeburg, Jesse, auke-jan.h.kok, netdev,
	Stephen Hemminger

On Sat, Aug 05, 2006 at 07:21:13PM -0400, jamal wrote:
>
> > Hardly surprising since you only have one CPU doing the actual sending.
> 
> Would it make a difference if we had more than one? qdisc_is_running
> ensures only one entering that region.

Good point.  I can probably simplify the e1000 patch then.

> > The only time when this can happen is if we more than one CPU transmitted
> > at the exatly same time.
> 
> Did you mean to the same eth device? 

Yes.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] [e1000]: Remove unnecessary tx_lock
  2006-08-05 23:19                             ` Herbert Xu
@ 2006-08-05 23:36                               ` jamal
  2006-08-06  2:51                                 ` Herbert Xu
  0 siblings, 1 reply; 85+ messages in thread
From: jamal @ 2006-08-05 23:36 UTC (permalink / raw)
  To: Herbert Xu
  Cc: David Miller, shemminger, mchan, jesse.brandeburg, auke-jan.h.kok,
	netdev

On Sun, 2006-06-08 at 09:19 +1000, Herbert Xu wrote:

> Interesting.  I can't really think of why it'd do that so some extra
> checks would really help.
> 

I have to run for now but will do this tomorrow. Heres an example of
what is saw (eth devices bound to different CPUs):


on entering eth1 qdisc_is_running:

14651764 times a packet was successfully dequeued and transmitted
9379699 times we got no packet at all and exited.

no packet implies the failure of the check:
if (((skb = dev->gso_skb)) || ((skb = q->dequeue(q)))) {

I know the qlen is >= 0 otherwise i will hit the BUG().
A qlen of 0 will be interesting to find as well.

cheers,
jamal




^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] [e1000]: Remove unnecessary tx_lock
  2006-08-05 23:36                               ` jamal
@ 2006-08-06  2:51                                 ` Herbert Xu
  2006-08-06  7:14                                   ` Edgar E. Iglesias
                                                     ` (2 more replies)
  0 siblings, 3 replies; 85+ messages in thread
From: Herbert Xu @ 2006-08-06  2:51 UTC (permalink / raw)
  To: jamal
  Cc: David Miller, shemminger, mchan, jesse.brandeburg, auke-jan.h.kok,
	netdev

On Sat, Aug 05, 2006 at 07:36:50PM -0400, jamal wrote:
> 
> I know the qlen is >= 0 otherwise i will hit the BUG().
> A qlen of 0 will be interesting to find as well.

When the queue is woken it will always to qdisc_run regardless of
whether there are packets queued so this might explain what you're
seeing.

Anyway, I've changed the unconditional atomic op into a memory
barrier instead.  Let me know if this changes things for you.

I wonder if we could do the TX clean up within the transmission
routine most of the time.  Has anyone tried this before?

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
--
diff --git a/drivers/net/e1000/e1000_main.c b/drivers/net/e1000/e1000_main.c
index 627f224..e7e1306 100644
--- a/drivers/net/e1000/e1000_main.c
+++ b/drivers/net/e1000/e1000_main.c
@@ -2861,6 +2861,33 @@ e1000_transfer_dhcp_info(struct e1000_ad
 	return 0;
 }
 
+static int __e1000_maybe_stop_tx(struct net_device *netdev, int size)
+{
+	struct e1000_adapter *adapter = netdev_priv(netdev);
+	struct e1000_tx_ring *tx_ring = adapter->tx_ring;
+
+	netif_stop_queue(netdev);
+	smp_mb__after_netif_stop_queue();
+
+	/* We need to check again in a case another CPU has just
+	 * made room available.
+	 */
+	if (likely(E1000_DESC_UNUSED(tx_ring) < size))
+		return -EBUSY;
+
+	/* A reprieve! */
+	netif_start_queue(netdev);
+	return 0;
+}
+
+static inline int e1000_maybe_stop_tx(struct net_device *netdev,
+				      struct e1000_tx_ring *tx_ring, int size)
+{
+	if (likely(E1000_DESC_UNUSED(tx_ring) >= size))
+		return 0;
+	return __e1000_maybe_stop_tx(netdev, size);
+}
+
 #define TXD_USE_COUNT(S, X) (((S) >> (X)) + 1 )
 static int
 e1000_xmit_frame(struct sk_buff *skb, struct net_device *netdev)
@@ -2974,8 +3001,7 @@ #endif
 
 	/* need: count + 2 desc gap to keep tail from touching
 	 * head, otherwise try next time */
-	if (unlikely(E1000_DESC_UNUSED(tx_ring) < count + 2)) {
-		netif_stop_queue(netdev);
+	if (unlikely(e1000_maybe_stop_tx(netdev, tx_ring, count + 2))) {
 		spin_unlock_irqrestore(&tx_ring->tx_lock, flags);
 		return NETDEV_TX_BUSY;
 	}
@@ -3022,8 +3048,7 @@ #endif
 	netdev->trans_start = jiffies;
 
 	/* Make sure there is space in the ring for the next send. */
-	if (unlikely(E1000_DESC_UNUSED(tx_ring) < MAX_SKB_FRAGS + 2))
-		netif_stop_queue(netdev);
+	e1000_maybe_stop_tx(netdev, tx_ring, MAX_SKB_FRAGS + 2);
 
 	spin_unlock_irqrestore(&tx_ring->tx_lock, flags);
 	return NETDEV_TX_OK;
@@ -3515,13 +3540,14 @@ #endif
 	tx_ring->next_to_clean = i;
 
 #define TX_WAKE_THRESHOLD 32
-	if (unlikely(cleaned && netif_queue_stopped(netdev) &&
-	             netif_carrier_ok(netdev))) {
-		spin_lock(&tx_ring->tx_lock);
-		if (netif_queue_stopped(netdev) &&
-		    (E1000_DESC_UNUSED(tx_ring) >= TX_WAKE_THRESHOLD))
+	if (unlikely(cleaned && netif_carrier_ok(netdev) &&
+		     E1000_DESC_UNUSED(tx_ring) >= TX_WAKE_THRESHOLD)) {
+		/* Make sure that anybody stopping the queue after this
+		 * sees the new next_to_clean.
+		 */
+		smp_mb();
+		if (netif_queue_stopped(netdev))
 			netif_wake_queue(netdev);
-		spin_unlock(&tx_ring->tx_lock);
 	}
 
 	if (adapter->detect_tx_hung) {
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 75f02d8..7daaf71 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -33,6 +33,7 @@ #ifdef __KERNEL__
 #include <asm/atomic.h>
 #include <asm/cache.h>
 #include <asm/byteorder.h>
+#include <asm/system.h>
 
 #include <linux/device.h>
 #include <linux/percpu.h>
@@ -652,6 +653,12 @@ #endif
 	set_bit(__LINK_STATE_XOFF, &dev->state);
 }
 
+static inline void smp_mb__after_netif_stop_queue(void)
+{
+	/* Need to add smp_mb__after_set_bit(). */
+	smp_mb();
+}
+
 static inline int netif_queue_stopped(const struct net_device *dev)
 {
 	return test_bit(__LINK_STATE_XOFF, &dev->state);

^ permalink raw reply related	[flat|nested] 85+ messages in thread

* Re: [PATCH] [e1000]: Remove unnecessary tx_lock
  2006-08-06  2:51                                 ` Herbert Xu
@ 2006-08-06  7:14                                   ` Edgar E. Iglesias
  2006-08-06  7:24                                     ` Herbert Xu
  2006-08-06  7:26                                     ` David Miller
  2006-08-06 12:24                                   ` jamal
  2006-08-08 12:21                                   ` jamal
  2 siblings, 2 replies; 85+ messages in thread
From: Edgar E. Iglesias @ 2006-08-06  7:14 UTC (permalink / raw)
  To: Herbert Xu
  Cc: jamal, David Miller, shemminger, mchan, jesse.brandeburg,
	auke-jan.h.kok, netdev

On Sun, Aug 06, 2006 at 12:51:23PM +1000, Herbert Xu wrote:
>
> I wonder if we could do the TX clean up within the transmission
> routine most of the time.  Has anyone tried this before?
> 

Hi Herbert,

Not sure what you mean with "most of the time" but a while a go I played 
around with a something like this for another driver. The tx path was cleaning 
up the tx ring at certain intervals while the tx irq-handler was not touching 
the ring unless it had become full. Actually the tx irq was disabled most of 
the time and ring manipulation became simpler. 

Later I ended up throwing this away as it was causing some TCP connections to
deadlock. If you let skb's sit in the tx-ring indefintely, the 
sk->sk_wmem_queued accounting will be inacurate and this was causing the TCP 
retransmit logic to deadlock. 

Again, I'm not sure what you mean with "most of the time" but if you make sure 
the tx-ring is continously beeing cleaned-up, even tough no more packets are 
beeing generated, maybe it could work. 

I hope this wasn't completely irrelevant..

Best regards,
-- 
        Programmer
        Edgar E. Iglesias <edgar.iglesias@axis.com> 46.46.272.1946

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] [e1000]: Remove unnecessary tx_lock
  2006-08-06  7:14                                   ` Edgar E. Iglesias
@ 2006-08-06  7:24                                     ` Herbert Xu
  2006-08-06  7:30                                       ` Edgar E. Iglesias
  2006-08-06  7:26                                     ` David Miller
  1 sibling, 1 reply; 85+ messages in thread
From: Herbert Xu @ 2006-08-06  7:24 UTC (permalink / raw)
  To: Edgar E. Iglesias
  Cc: herbert, hadi, davem, shemminger, mchan, jesse.brandeburg,
	auke-jan.h.kok, netdev

Edgar E. Iglesias <edgar.iglesias@axis.com> wrote:
> 
> Not sure what you mean with "most of the time" but a while a go I played 
> around with a something like this for another driver. The tx path was cleaning 
> up the tx ring at certain intervals while the tx irq-handler was not touching 
> the ring unless it had become full. Actually the tx irq was disabled most of 
> the time and ring manipulation became simpler. 

Yes that's exactly what I mean :)

> Later I ended up throwing this away as it was causing some TCP connections to
> deadlock. If you let skb's sit in the tx-ring indefintely, the 
> sk->sk_wmem_queued accounting will be inacurate and this was causing the TCP 
> retransmit logic to deadlock. 

Yes you need to clean the ring if the TX queue becomes empty or you'll
risk dead locks.

What driver was this and did you do any performance measurements?

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] [e1000]: Remove unnecessary tx_lock
  2006-08-06  7:14                                   ` Edgar E. Iglesias
  2006-08-06  7:24                                     ` Herbert Xu
@ 2006-08-06  7:26                                     ` David Miller
  2006-08-06  7:36                                       ` Herbert Xu
  1 sibling, 1 reply; 85+ messages in thread
From: David Miller @ 2006-08-06  7:26 UTC (permalink / raw)
  To: edgar.iglesias
  Cc: herbert, hadi, shemminger, mchan, jesse.brandeburg,
	auke-jan.h.kok, netdev

From: "Edgar E. Iglesias" <edgar.iglesias@axis.com>
Date: Sun, 6 Aug 2006 09:14:18 +0200

> Later I ended up throwing this away as it was causing some TCP
> connections to deadlock. If you let skb's sit in the tx-ring
> indefintely, the sk->sk_wmem_queued accounting will be inacurate and
> this was causing the TCP retransmit logic to deadlock.

That's right, it's important to release TX SKBs in a timely.  Doing it
purely from ->hard_start_xmit() causes big problems, as you saw.

Increasingly I find that orphan'ing an SKB once it hits the real
device for transmit is something to seriously consider.


^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] [e1000]: Remove unnecessary tx_lock
  2006-08-06  7:24                                     ` Herbert Xu
@ 2006-08-06  7:30                                       ` Edgar E. Iglesias
  0 siblings, 0 replies; 85+ messages in thread
From: Edgar E. Iglesias @ 2006-08-06  7:30 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Edgar E. Iglesias, hadi, davem, shemminger, mchan,
	jesse.brandeburg, auke-jan.h.kok, netdev

On Sun, Aug 06, 2006 at 05:24:41PM +1000, Herbert Xu wrote:
> Edgar E. Iglesias <edgar.iglesias@axis.com> wrote:
> > 
> > Not sure what you mean with "most of the time" but a while a go I played 
> > around with a something like this for another driver. The tx path was cleaning 
> > up the tx ring at certain intervals while the tx irq-handler was not touching 
> > the ring unless it had become full. Actually the tx irq was disabled most of 
> > the time and ring manipulation became simpler. 
> 
> Yes that's exactly what I mean :)
> 
> > Later I ended up throwing this away as it was causing some TCP connections to
> > deadlock. If you let skb's sit in the tx-ring indefintely, the 
> > sk->sk_wmem_queued accounting will be inacurate and this was causing the TCP 
> > retransmit logic to deadlock. 
> 
> Yes you need to clean the ring if the TX queue becomes empty or you'll
> risk dead locks.
> 
> What driver was this and did you do any performance measurements?
> 

It was for the cris/eth_v10.c driver. Performance bumped with 20% for UDP if I remember correctly, TCP was much less. I'll try to dig up the numbers.

On UP non-preemptive kenrnels the tx-path used no interrupts and had no locks to manipulate the ring. But I am guessing that part of the performance win also came by freeing the skb's in chunks.

Best regards
-- 
        Programmer
        Edgar E. Iglesias <edgar.iglesias@axis.com> 46.46.272.1946

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] [e1000]: Remove unnecessary tx_lock
  2006-08-06  7:26                                     ` David Miller
@ 2006-08-06  7:36                                       ` Herbert Xu
  2006-08-06  8:06                                         ` Edgar E. Iglesias
  2006-08-06  8:35                                         ` David Miller
  0 siblings, 2 replies; 85+ messages in thread
From: Herbert Xu @ 2006-08-06  7:36 UTC (permalink / raw)
  To: David Miller
  Cc: edgar.iglesias, hadi, shemminger, mchan, jesse.brandeburg,
	auke-jan.h.kok, netdev

On Sun, Aug 06, 2006 at 12:26:44AM -0700, David Miller wrote:
> 
> Increasingly I find that orphan'ing an SKB once it hits the real
> device for transmit is something to seriously consider.

Yes that certainly makes a lot of sense.

Actually, why do we charge the memory to the socket at all between
the packet leaving the TCP stack and it entering the device? Most
packets spend a tiny amount of time on this journey so it seems
silly to do atomic operations as they leave the stack only to undo
it when they enter the device.

The biggest reason packets spend a non-trivial amount of time on
this journey is traffic shaping.  In that case the memory is already
accounted to the device anyway (in terms of queue length) so it would
seem unfair to charge the socket as well.

So how about simply charging the socket for the retransmit queue
as we do now and skip the caharge for packets leaving the stack?

For UDP it's different since it doesn't have a retransmit queue.
However, I'm unsure how much of a positive effect the socket charge
actually has for UDP.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] [e1000]: Remove unnecessary tx_lock
  2006-08-06  7:36                                       ` Herbert Xu
@ 2006-08-06  8:06                                         ` Edgar E. Iglesias
  2006-08-06  8:27                                           ` Herbert Xu
  2006-08-06  8:35                                         ` David Miller
  1 sibling, 1 reply; 85+ messages in thread
From: Edgar E. Iglesias @ 2006-08-06  8:06 UTC (permalink / raw)
  To: Herbert Xu
  Cc: David Miller, edgar.iglesias, hadi, shemminger, mchan,
	jesse.brandeburg, auke-jan.h.kok, netdev

On Sun, Aug 06, 2006 at 05:36:55PM +1000, Herbert Xu wrote:
> On Sun, Aug 06, 2006 at 12:26:44AM -0700, David Miller wrote:
> > 
> > Increasingly I find that orphan'ing an SKB once it hits the real
> > device for transmit is something to seriously consider.
> 
> Yes that certainly makes a lot of sense.
> 
> Actually, why do we charge the memory to the socket at all between
> the packet leaving the TCP stack and it entering the device? Most
> packets spend a tiny amount of time on this journey so it seems
> silly to do atomic operations as they leave the stack only to undo
> it when they enter the device.
> 
> The biggest reason packets spend a non-trivial amount of time on
> this journey is traffic shaping.  In that case the memory is already
> accounted to the device anyway (in terms of queue length) so it would
> seem unfair to charge the socket as well.
> 
> So how about simply charging the socket for the retransmit queue
> as we do now and skip the caharge for packets leaving the stack?
> 
> For UDP it's different since it doesn't have a retransmit queue.
> However, I'm unsure how much of a positive effect the socket charge
> actually has for UDP.

It makes sense, but I see a problem with this. It will remove some of the
control applications have on how much of their data gets queued on local
queues. Small SNDBUF's are sometimes used to somewhat avoid/control latency.

For now, I think it's better to orphan the skb when it hits a real device for
transmit, as David mentioned.

Best regards
-- 
        Programmer
        Edgar E. Iglesias <edgar.iglesias@axis.com> 46.46.272.1946

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] [e1000]: Remove unnecessary tx_lock
  2006-08-06  8:06                                         ` Edgar E. Iglesias
@ 2006-08-06  8:27                                           ` Herbert Xu
  2006-08-06  9:03                                             ` Edgar E. Iglesias
  0 siblings, 1 reply; 85+ messages in thread
From: Herbert Xu @ 2006-08-06  8:27 UTC (permalink / raw)
  To: Edgar E. Iglesias
  Cc: herbert, davem, edgar.iglesias, hadi, shemminger, mchan,
	jesse.brandeburg, auke-jan.h.kok, netdev

Edgar E. Iglesias <edgar.iglesias@axis.com> wrote:
> 
> It makes sense, but I see a problem with this. It will remove some of the
> control applications have on how much of their data gets queued on local
> queues. Small SNDBUF's are sometimes used to somewhat avoid/control latency.

For UDP yes but does this really make sense for TCP?

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] [e1000]: Remove unnecessary tx_lock
  2006-08-06  7:36                                       ` Herbert Xu
  2006-08-06  8:06                                         ` Edgar E. Iglesias
@ 2006-08-06  8:35                                         ` David Miller
  1 sibling, 0 replies; 85+ messages in thread
From: David Miller @ 2006-08-06  8:35 UTC (permalink / raw)
  To: herbert
  Cc: edgar.iglesias, hadi, shemminger, mchan, jesse.brandeburg,
	auke-jan.h.kok, netdev

From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Sun, 6 Aug 2006 17:36:55 +1000

> Actually, why do we charge the memory to the socket at all between
> the packet leaving the TCP stack and it entering the device?

Because it's easy to verify that it prevents all the possible
DoS scenerios a user could play out.

> Most packets spend a tiny amount of time on this journey so it seems
> silly to do atomic operations as they leave the stack only to undo
> it when they enter the device.

I agree.

> The biggest reason packets spend a non-trivial amount of time on
> this journey is traffic shaping.  In that case the memory is already
> accounted to the device anyway (in terms of queue length) so it would
> seem unfair to charge the socket as well.

Again, total agreement.

> So how about simply charging the socket for the retransmit queue
> as we do now and skip the caharge for packets leaving the stack?
> 
> For UDP it's different since it doesn't have a retransmit queue.
> However, I'm unsure how much of a positive effect the socket charge
> actually has for UDP.

UDP and other datagram protocols are a thorny case aren't they? :)

This has actually been a source of problems in the past, come to think
of it.  We used to have this bug in the output path where UDP could
build a packet of size X that fit in the socket send buffer, but if we
had to fragment this packet because "mtu < X" not all of the fragments
would fit in the socket send buffer and we'd drop the entire packet :)

What happens now is we just try to strictly charge the first fragment
to the socket's send buffer on fragmentation and then we only enfore
lighter rules (allowing up to "2 * sk_sndbuf" to be used) as we charge
for the rest of the frags.

You can see this logic in ip_append_data():

		if (transhdrlen) {
			skb = sock_alloc_send_skb(sk, 
					alloclen + hh_len + 15,
					(flags & MSG_DONTWAIT), &err);
		} else {
			skb = NULL;
			if (atomic_read(&sk->sk_wmem_alloc) <=
			    2 * sk->sk_sndbuf)
				skb = sock_wmalloc(sk, 
						   alloclen + hh_len + 15, 1,
						   sk->sk_allocation);
			if (unlikely(skb == NULL))
				err = -ENOBUFS;
		}

As you say these things are about to go out to the device, so it's
kind of silly from that perspective, BUT...  it does exist to a
certain extent for application flow control, especially for these
datagram apps.

I tried to see if there was any wisdom on this matter in TCP/IP
Illustrated Volume 2, but BSD does the same thing we do.

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] [e1000]: Remove unnecessary tx_lock
  2006-08-06  8:27                                           ` Herbert Xu
@ 2006-08-06  9:03                                             ` Edgar E. Iglesias
  2006-08-06  9:10                                               ` Herbert Xu
  0 siblings, 1 reply; 85+ messages in thread
From: Edgar E. Iglesias @ 2006-08-06  9:03 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Edgar E. Iglesias, davem, hadi, shemminger, mchan,
	jesse.brandeburg, auke-jan.h.kok, netdev

On Sun, Aug 06, 2006 at 06:27:35PM +1000, Herbert Xu wrote:
> Edgar E. Iglesias <edgar.iglesias@axis.com> wrote:
> > 
> > It makes sense, but I see a problem with this. It will remove some of the
> > control applications have on how much of their data gets queued on local
> > queues. Small SNDBUF's are sometimes used to somewhat avoid/control latency.
> 
> For UDP yes but does this really make sense for TCP?
> 

I don't know :)

The flow may end up with SNDBUF worth of skbs on the socket
transmit_queue + cwnd worth of skbs on the qdisc queue + full tx ring. In slow
start we'll increase cwnd and won't se packet loss until the qdisc is full
(which might be at 1000 pkts).

If the flow is locally limited, won't we end up with a very latent flow? where
moste packets will be on the qdisc.

I agree that in general this is an effect of users having to large qdisc 
queues, but today maybe it is beeing somewhat avoided by apps setting small 
SNDBUF's. So I kind of agree with what you suggest but there might be risks.

Best regards
-- 
        Programmer
        Edgar E. Iglesias <edgar.iglesias@axis.com> 46.46.272.1946

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] [e1000]: Remove unnecessary tx_lock
  2006-08-06  9:03                                             ` Edgar E. Iglesias
@ 2006-08-06  9:10                                               ` Herbert Xu
  2006-08-06  9:18                                                 ` Edgar E. Iglesias
  0 siblings, 1 reply; 85+ messages in thread
From: Herbert Xu @ 2006-08-06  9:10 UTC (permalink / raw)
  To: Edgar E. Iglesias
  Cc: davem, hadi, shemminger, mchan, jesse.brandeburg, auke-jan.h.kok,
	netdev

On Sun, Aug 06, 2006 at 11:03:54AM +0200, Edgar E. Iglesias wrote:
>
> > For UDP yes but does this really make sense for TCP?
> 
> I don't know :)
> 
> The flow may end up with SNDBUF worth of skbs on the socket
> transmit_queue + cwnd worth of skbs on the qdisc queue + full tx ring. In slow
> start we'll increase cwnd and won't se packet loss until the qdisc is full
> (which might be at 1000 pkts).

For TCP, the packets stay on the transmit queue until they're acked.
So all those packets on the qdisc queue (and tx ring) are still charged
to the socket.  The only difference is that right now we're double-
charging the ones on the qdisc queue..

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] [e1000]: Remove unnecessary tx_lock
  2006-08-06  9:10                                               ` Herbert Xu
@ 2006-08-06  9:18                                                 ` Edgar E. Iglesias
  0 siblings, 0 replies; 85+ messages in thread
From: Edgar E. Iglesias @ 2006-08-06  9:18 UTC (permalink / raw)
  To: Herbert Xu
  Cc: Edgar E. Iglesias, davem, hadi, shemminger, mchan,
	jesse.brandeburg, auke-jan.h.kok, netdev

On Sun, Aug 06, 2006 at 07:10:20PM +1000, Herbert Xu wrote:
> On Sun, Aug 06, 2006 at 11:03:54AM +0200, Edgar E. Iglesias wrote:
> >
> > > For UDP yes but does this really make sense for TCP?
> > 
> > I don't know :)
> > 
> > The flow may end up with SNDBUF worth of skbs on the socket
> > transmit_queue + cwnd worth of skbs on the qdisc queue + full tx ring. In slow
> > start we'll increase cwnd and won't se packet loss until the qdisc is full
> > (which might be at 1000 pkts).
> 
> For TCP, the packets stay on the transmit queue until they're acked.
> So all those packets on the qdisc queue (and tx ring) are still charged
> to the socket.  The only difference is that right now we're double-
> charging the ones on the qdisc queue..

I see, then I guess this won't be a problem.

Thanks,
-- 
        Programmer
        Edgar E. Iglesias <edgar.iglesias@axis.com> 46.46.272.1946

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] [e1000]: Remove unnecessary tx_lock
  2006-08-06  2:51                                 ` Herbert Xu
  2006-08-06  7:14                                   ` Edgar E. Iglesias
@ 2006-08-06 12:24                                   ` jamal
  2006-08-06 12:33                                     ` jamal
                                                       ` (2 more replies)
  2006-08-08 12:21                                   ` jamal
  2 siblings, 3 replies; 85+ messages in thread
From: jamal @ 2006-08-06 12:24 UTC (permalink / raw)
  To: Herbert Xu
  Cc: David Miller, shemminger, mchan, jesse.brandeburg, auke-jan.h.kok,
	Edgar E. Iglesias, netdev

[-- Attachment #1: Type: text/plain, Size: 2546 bytes --]

On Sun, 2006-06-08 at 12:51 +1000, Herbert Xu wrote:
> On Sat, Aug 05, 2006 at 07:36:50PM -0400, jamal wrote:
> > 
> > I know the qlen is >= 0 otherwise i will hit the BUG().
> > A qlen of 0 will be interesting to find as well.
> 
> When the queue is woken it will always to qdisc_run regardless of
> whether there are packets queued so this might explain what you're
> seeing.
> 

That aligns with what i was thinking as well - i.e what i referred to as
possibly enthusiasm on part of the tx softirq.
i.e if there was nothing queued, why is the softirq even woken up?
Note, I observed this behavior to be a lot worse on systems that had
very little traffic going out. Anyways, I hope to find out.

> Anyway, I've changed the unconditional atomic op into a memory
> barrier instead.  Let me know if this changes things for you.
> 

Will try it before end of day. Machine is about 20 minutes drive from
where i am.

> I wonder if we could do the TX clean up within the transmission
> routine most of the time.  Has anyone tried this before?

I actually have - as recently as last week when i was looking at the
qdisc_is_running thing. Patch for e1000 attached. 

One of the experiments i attempted was similar to the one described by
edgar.iglesias@axis.com 
I would only clean up on the rx path if netif_stopped and the rest
happened in tx path. I tried variations of the above (ex: cleanup on tx
path only if we exceed a threshold etc).

I did not see _any_ improvement for forwarding tests ;-> 
Infact at one point i think i saw a decrease in throughput, but it may
have been within tolerance of experimental errors.

In the old days, both Robert and I did this with the tulip and it used
to be an improvement. But now that i think about it, probably the reason
was by doing so we reduced the EOT interrupts. Note, the NAPI 
approach on tulip allows that tx pruning to be done via interrupts.
I liked that approach; but most people didnt and the e1000 and tg3 is
what you see now.

My gut feeling is to let the cleaning happen in the receive path. The
reason is that this leaves the tx path doing a lot less work. In a way
this gives a pseudo higher-priority to the tx path at least for the case
of forwarding where it is important for me to make sure the packet that
i have sweated doing all the forwarding work on is going to go out. IOW,
if the rx path gets very busy packets will end up being dropped on the
receive path by virtue of DMA rings being filled up while the driver is
busy - and that is an acceptable compromise.

cheers,
jamal

[-- Attachment #2: e1000-p1 --]
[-- Type: text/x-patch, Size: 7527 bytes --]

diff --git a/drivers/net/e1000/e1000.h b/drivers/net/e1000/e1000.h
index d304297..270ec35 100644
--- a/drivers/net/e1000/e1000.h
+++ b/drivers/net/e1000/e1000.h
@@ -183,6 +183,10 @@ struct e1000_tx_ring {
 	unsigned int size;
 	/* number of descriptors in the ring */
 	unsigned int count;
+	/* number of descriptors needed to wake up netdev */
+	unsigned int waket;
+	/* number of descriptors we allow to prune at a time */
+	unsigned int prunet;
 	/* next descriptor to associate a buffer with */
 	unsigned int next_to_use;
 	/* next descriptor to check for DD status bit */
diff --git a/drivers/net/e1000/e1000_main.c b/drivers/net/e1000/e1000_main.c
index da62db8..c90d5c3 100644
--- a/drivers/net/e1000/e1000_main.c
+++ b/drivers/net/e1000/e1000_main.c
 err_up:
@@ -1371,6 +1372,11 @@ setup_tx_desc_die:
 
 	txdr->next_to_use = 0;
 	txdr->next_to_clean = 0;
+	/* hard-coding for now, but we should be able to
+	 * use ethtool or module params in the future
+	*/
+	txdr->prunet = txdr->count>>2;
+	txdr->waket = txdr->count>>3;
 	spin_lock_init(&txdr->tx_lock);
 
 	return 0;
@@ -2861,6 +2867,10 @@ e1000_transfer_dhcp_info(struct e1000_ad
 	return 0;
 }
 
+static unsigned int
+e1000_prune_tx_ring(struct e1000_adapter *adapter,
+                   struct e1000_tx_ring *tx_ring);
+
 #define TXD_USE_COUNT(S, X) (((S) >> (X)) + 1 )
 static int
 e1000_xmit_frame(struct sk_buff *skb, struct net_device *netdev)
@@ -2980,6 +2990,14 @@ #endif
 		return NETDEV_TX_BUSY;
 	}
 
+#ifdef jamal
+	{ 
+		int fdesc = E1000_DESC_UNUSED(tx_ring);
+		if (unlikely(fdesc < tx_ring->waket)) 
+			e1000_prune_tx_ring(adapter,tx_ring);
+	}
+#endif
+
 	if (unlikely(adapter->hw.mac_type == e1000_82547)) {
 		if (unlikely(e1000_82547_fifo_workaround(adapter, skb))) {
 			netif_stop_queue(netdev);
@@ -3468,24 +3496,30 @@ quit_polling:
 }
 
 #endif
-/**
- * e1000_clean_tx_irq - Reclaim resources after transmit completes
- * @adapter: board private structure
- **/
+static inline void
+e1000_schedule_tx(struct net_device *netdev,
+		  struct e1000_tx_ring *tx_ring)
+{
+	if (netif_carrier_ok(netdev)) {
+		if (E1000_DESC_UNUSED(tx_ring) >= tx_ring->waket)
+			netif_wake_queue(netdev);
+	}
+}
 
-static boolean_t
-e1000_clean_tx_irq(struct e1000_adapter *adapter,
+/* 
+ * Must hold tx lock when invoking this ..
+*/
+static unsigned int
+e1000_prune_tx_ring(struct e1000_adapter *adapter,
                    struct e1000_tx_ring *tx_ring)
 {
-	struct net_device *netdev = adapter->netdev;
 	struct e1000_tx_desc *tx_desc, *eop_desc;
 	struct e1000_buffer *buffer_info;
 	unsigned int i, eop;
-#ifdef CONFIG_E1000_NAPI
-	unsigned int count = 0;
-#endif
+	unsigned int count = 0, cleaned_count = 0;
 	boolean_t cleaned = FALSE;
 
+
 	i = tx_ring->next_to_clean;
 	eop = tx_ring->buffer_info[i].next_to_watch;
 	eop_desc = E1000_TX_DESC(*tx_ring, eop);
@@ -3498,6 +3532,7 @@ #endif
 
 			e1000_unmap_and_free_tx_resource(adapter, buffer_info);
 			memset(tx_desc, 0, sizeof(struct e1000_tx_desc));
+			cleaned_count++;
 
 			if (unlikely(++i == tx_ring->count)) i = 0;
 		}
@@ -3506,59 +3541,94 @@ #endif
 		eop = tx_ring->buffer_info[i].next_to_watch;
 		eop_desc = E1000_TX_DESC(*tx_ring, eop);
 #ifdef CONFIG_E1000_NAPI
-#define E1000_TX_WEIGHT 64
-		/* weight of a sort for tx, to avoid endless transmit cleanup */
-		if (count++ == E1000_TX_WEIGHT) break;
+		/* avoid endless transmit cleanup */
+		if (count++ == tx_ring->prunet) break;
 #endif
 	}
 
 	tx_ring->next_to_clean = i;
 
-#define TX_WAKE_THRESHOLD 32
-	if (unlikely(cleaned && netif_queue_stopped(netdev) &&
-	             netif_carrier_ok(netdev))) {
-		spin_lock(&tx_ring->tx_lock);
-		if (netif_queue_stopped(netdev) &&
-		    (E1000_DESC_UNUSED(tx_ring) >= TX_WAKE_THRESHOLD))
-			netif_wake_queue(netdev);
-		spin_unlock(&tx_ring->tx_lock);
-	}
-
-	if (adapter->detect_tx_hung) {
-		/* Detect a transmit hang in hardware, this serializes the
-		 * check with the clearing of time_stamp and movement of i */
-		adapter->detect_tx_hung = FALSE;
-		if (tx_ring->buffer_info[eop].dma &&
-		    time_after(jiffies, tx_ring->buffer_info[eop].time_stamp +
-		               (adapter->tx_timeout_factor * HZ))
-		    && !(E1000_READ_REG(&adapter->hw, STATUS) &
-		         E1000_STATUS_TXOFF)) {
-
-			/* detected Tx unit hang */
-			DPRINTK(DRV, ERR, "Detected Tx Unit Hang\n"
-					"  Tx Queue             <%lu>\n"
-					"  TDH                  <%x>\n"
-					"  TDT                  <%x>\n"
-					"  next_to_use          <%x>\n"
-					"  next_to_clean        <%x>\n"
-					"buffer_info[next_to_clean]\n"
-					"  time_stamp           <%lx>\n"
-					"  next_to_watch        <%x>\n"
-					"  jiffies              <%lx>\n"
-					"  next_to_watch.status <%x>\n",
-				(unsigned long)((tx_ring - adapter->tx_ring) /
-					sizeof(struct e1000_tx_ring)),
-				readl(adapter->hw.hw_addr + tx_ring->tdh),
-				readl(adapter->hw.hw_addr + tx_ring->tdt),
-				tx_ring->next_to_use,
-				tx_ring->next_to_clean,
-				tx_ring->buffer_info[eop].time_stamp,
-				eop,
-				jiffies,
-				eop_desc->upper.fields.status);
-			netif_stop_queue(netdev);
+	return cleaned_count;
+}
+
+static inline void
+e1000_detect_tx_hung(struct e1000_adapter *adapter,
+                   struct e1000_tx_ring *tx_ring)
+{
+	struct e1000_tx_desc *eop_desc;
+	struct net_device *netdev = adapter->netdev;
+	unsigned int i = tx_ring->next_to_clean;
+	unsigned int eop = tx_ring->buffer_info[i].next_to_watch;
+	eop_desc = E1000_TX_DESC(*tx_ring, eop);
+	/* Detect a transmit hang in hardware, this serializes the
+	 * check with the clearing of time_stamp and movement of i */
+	adapter->detect_tx_hung = FALSE;
+	if (tx_ring->buffer_info[eop].dma &&
+	    time_after(jiffies, tx_ring->buffer_info[eop].time_stamp +
+		       (adapter->tx_timeout_factor * HZ))
+	    && !(E1000_READ_REG(&adapter->hw, STATUS) &
+		 E1000_STATUS_TXOFF)) {
+
+		/* detected Tx unit hang */
+		DPRINTK(DRV, ERR, "Detected Tx Unit Hang\n"
+				"  Tx Queue             <%lu>\n"
+				"  TDH                  <%x>\n"
+				"  TDT                  <%x>\n"
+				"  next_to_use          <%x>\n"
+				"  next_to_clean        <%x>\n"
+				"buffer_info[next_to_clean]\n"
+				"  time_stamp           <%lx>\n"
+				"  next_to_watch        <%x>\n"
+				"  jiffies              <%lx>\n"
+				"  next_to_watch.status <%x>\n",
+			(unsigned long)((tx_ring - adapter->tx_ring) /
+				sizeof(struct e1000_tx_ring)),
+			readl(adapter->hw.hw_addr + tx_ring->tdh),
+			readl(adapter->hw.hw_addr + tx_ring->tdt),
+			tx_ring->next_to_use,
+			tx_ring->next_to_clean,
+			tx_ring->buffer_info[eop].time_stamp,
+			eop,
+			jiffies,
+			eop_desc->upper.fields.status);
+
+		netif_stop_queue(netdev);
+	}
+}
+
+/**
+ * e1000_clean_tx_irq - Reclaim resources after transmit completes
+ * @adapter: board private structure
+ **/
+
+static boolean_t
+e1000_clean_tx_irq(struct e1000_adapter *adapter,
+                   struct e1000_tx_ring *tx_ring)
+{
+	struct net_device *netdev = adapter->netdev;
+	boolean_t cleaned = FALSE;
+
+#ifdef jamal
+	spin_lock(&tx_ring->tx_lock);
+	{
+		int fdesc = E1000_DESC_UNUSED(tx_ring);
+		if (fdesc < tx_ring->prunet) {
+			if (e1000_prune_tx_ring(adapter,tx_ring))
+				cleaned = TRUE;
 		}
 	}
+	spin_unlock(&tx_ring->tx_lock);
+#else
+	if (e1000_prune_tx_ring(adapter,tx_ring))
+		cleaned = TRUE;
+#endif
+
+	if (unlikely(netif_queue_stopped(netdev) && cleaned))
+		e1000_schedule_tx(netdev, tx_ring);
+
+	if (unlikely(adapter->detect_tx_hung)) 
+		e1000_detect_tx_hung(adapter,tx_ring);
+
 	return cleaned;
 }

^ permalink raw reply related	[flat|nested] 85+ messages in thread

* Re: [PATCH] [e1000]: Remove unnecessary tx_lock
  2006-08-06 12:24                                   ` jamal
@ 2006-08-06 12:33                                     ` jamal
  2006-08-06 23:16                                       ` Jesse Brandeburg
  2006-08-06 19:22                                     ` jamal
  2006-08-08  1:19                                     ` jamal
  2 siblings, 1 reply; 85+ messages in thread
From: jamal @ 2006-08-06 12:33 UTC (permalink / raw)
  To: Herbert Xu
  Cc: David Miller, shemminger, mchan, jesse.brandeburg, auke-jan.h.kok,
	Edgar E. Iglesias, netdev

BTW, if you notice in the attached patch i have:
+       /* number of descriptors needed to wake up netdev */
+       unsigned int waket;
+       /* number of descriptors we allow to prune at a time */
+       unsigned int prunet;

and:

+       /* hard-coding for now, but we should be able to
+        * use ethtool or module params in the future
+       */
+       txdr->prunet = txdr->count>>2;
+       txdr->waket = txdr->count>>3;

i.e the two thresholds are based on the tx ring sizes.

I didnt like TX_WAKE_THRESHOLD and  E1000_TX_WEIGHT hardcoding.
if i make the ring smaller then these values could be invalid.
I also experimented with varying the two thresholds with varying traffic
throughput - lets say results are still magical and i am still looking
at it.

Q: Do you know the historical context of why TX_WAKE_THRESHOLD and
E1000_TX_WEIGHT are hard-coded (and not for example related to the size
of the ring)? Jesse?
Otherwise i would like to submit a patch for the above. eventually
probably even allow the setting via ethtool.

cheers,
jamal

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] [e1000]: Remove unnecessary tx_lock
  2006-08-05 23:05                         ` Herbert Xu
  2006-08-05 23:17                           ` jamal
@ 2006-08-06 17:20                           ` Michael Chan
  2006-08-06 23:04                             ` Herbert Xu
  1 sibling, 1 reply; 85+ messages in thread
From: Michael Chan @ 2006-08-06 17:20 UTC (permalink / raw)
  To: Herbert Xu, jamal
  Cc: David Miller, shemminger, jesse.brandeburg, auke-jan.h.kok,
	netdev

Herbert Xu wrote:

> In fact that reminds me, because tg3 is no longer lockless, we can do
> a nicer bug fix for it than e1000 (it has the same bug).
> 

Do you mean the same race condition bug between tg3_start_xmit()
and tg3_tx() that I brought up?  I don't think tg3 has that problem.
Did I miss something?


^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] [e1000]: Remove unnecessary tx_lock
  2006-08-06 12:24                                   ` jamal
  2006-08-06 12:33                                     ` jamal
@ 2006-08-06 19:22                                     ` jamal
  2006-08-08  1:19                                     ` jamal
  2 siblings, 0 replies; 85+ messages in thread
From: jamal @ 2006-08-06 19:22 UTC (permalink / raw)
  To: Herbert Xu
  Cc: David Miller, shemminger, mchan, jesse.brandeburg, auke-jan.h.kok,
	Edgar E. Iglesias, netdev

On Sun, 2006-06-08 at 08:24 -0400, jamal wrote:

> Will try it before end of day. 

Looks good performance-wise; almost as good as the version i had.
Certainly better than the original e1000 (of course i am talking about
the overload case ;->).

Interesting thing to note:
only 0.001% of the time and only on one CPU was the
 __e1000_maybe_stop_tx::netif_start_queue() ever hit.
I guess that is sufficient to convince someone of the possibility of a
race ;-> Note, these C3s are damn slow CPUs by todays standards so i am
suspecting that % will be higher with faster CPUs.

cheers,
jamal

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] [e1000]: Remove unnecessary tx_lock
  2006-08-06 17:20                           ` Michael Chan
@ 2006-08-06 23:04                             ` Herbert Xu
  2006-08-07  3:56                               ` Michael Chan
  0 siblings, 1 reply; 85+ messages in thread
From: Herbert Xu @ 2006-08-06 23:04 UTC (permalink / raw)
  To: Michael Chan
  Cc: jamal, David Miller, shemminger, jesse.brandeburg, auke-jan.h.kok,
	netdev

On Sun, Aug 06, 2006 at 10:20:12AM -0700, Michael Chan wrote:
> 
> Do you mean the same race condition bug between tg3_start_xmit()
> and tg3_tx() that I brought up?  I don't think tg3 has that problem.
> Did I miss something?

Yes it has the problem in two places.  First of all the first
netif_stop_queue in tg3_start_xmit doesn't take the tx_lock at
all so it's obviously vulnerable.

The last netif_stop_queue is susceptible to a more subtle problem:

CPU0					CPU1
tg3_start_xmit
	TX_BUFF_AVAIL <= threshold
		tx_lock
					tg3_tx
						!netif_queue_stopped
		netif_stop_queue
		BUFF_AVAIL <= threshold
						tx_cons = sw_idx
		tx_unlock

Because netif_queue_stopped is *not* a memory barrier, it can float
above the assignment to tx_cons.  So even though netif_stop_queue *is*
a memory barrier on x86, the second BUFF_AVAIL test can still fail to
see the updated index.

The fix here is obviously to add a memory barrier in tg3_tx.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] [e1000]: Remove unnecessary tx_lock
  2006-08-06 12:33                                     ` jamal
@ 2006-08-06 23:16                                       ` Jesse Brandeburg
  2006-08-07 12:50                                         ` jamal
  0 siblings, 1 reply; 85+ messages in thread
From: Jesse Brandeburg @ 2006-08-06 23:16 UTC (permalink / raw)
  To: hadi
  Cc: Herbert Xu, David Miller, shemminger, mchan, jesse.brandeburg,
	auke-jan.h.kok, Edgar E. Iglesias, netdev

On 8/6/06, jamal <hadi@cyberus.ca> wrote:
> Q: Do you know the historical context of why TX_WAKE_THRESHOLD and
> E1000_TX_WEIGHT are hard-coded (and not for example related to the size
> of the ring)? Jesse?
> Otherwise i would like to submit a patch for the above. eventually
> probably even allow the setting via ethtool.

Yeah those are both created by me, interestingly up until a couple of
releases ago we had neither.

To answer your question of context, I noticed very specific problems
and fixed them with the simplest method that worked.  I figure we can
address it as time goes on (like we are now) with improvements.

As for specifics, for TX_WAKE_THRESHOLD, i noticed that we were
starting the queue after every packet was cleaned, so when the ring
went full there was a lot of queue thrash.  tg3 seemed to fix it in a
smart way and so I did a similar fix.  Note we should have at least
MAX_SKB_FRAGS (usually 32) + a few descriptors free before we should
start the tx again, otherwise we run the risk of a maximum fragmented
packet being unable to fit in the tx ring.

now, for E1000_TX_WEIGHT, that was more of an experiment as i noticed
we could stay in transmit clean up forEVER (okay not literally) which
would really violate our NAPI timeslice.  I messed with some values
and 64 didn't really seem like too bad a compromise (it does slow
things down just a tad in the general case) while really helping a
couple of tests where there were lots of outstanding transmits
happening at the same time as lots of receives.

This need for a tx weight is yet another global (design) problem with
NAPI enabled drivers, but someday I'll try to document some of the
issues I've seen.

Jesse

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] [e1000]: Remove unnecessary tx_lock
  2006-08-06 23:04                             ` Herbert Xu
@ 2006-08-07  3:56                               ` Michael Chan
  2006-08-07  4:21                                 ` Herbert Xu
  0 siblings, 1 reply; 85+ messages in thread
From: Michael Chan @ 2006-08-07  3:56 UTC (permalink / raw)
  To: Herbert Xu
  Cc: jamal, David Miller, shemminger, jesse.brandeburg, auke-jan.h.kok,
	netdev

Herbert Xu wrote:

> Yes it has the problem in two places.  First of all the first
> netif_stop_queue in tg3_start_xmit doesn't take the tx_lock at
> all so it's obviously vulnerable.

Oh, that first one near the top of tg3_start_xmit()?  That
condition is not supposed to happen and is a bug condition.  But
we can certainly add a spinlock to make it more correct.

> 
> The last netif_stop_queue is susceptible to a more subtle problem:
> 
> CPU0					CPU1
> tg3_start_xmit
> 	TX_BUFF_AVAIL <= threshold
> 		tx_lock
> 					tg3_tx
> 						!netif_queue_stopped
> 		netif_stop_queue
> 		BUFF_AVAIL <= threshold
> 						tx_cons = sw_idx
> 		tx_unlock
> 
> Because netif_queue_stopped is *not* a memory barrier, it can float
> above the assignment to tx_cons.  So even though netif_stop_queue *is*
> a memory barrier on x86, the second BUFF_AVAIL test can still fail to
> see the updated index.
> 
> The fix here is obviously to add a memory barrier in tg3_tx.

You are absolutely right and I missed it.  We should add a wmb()
after the tx_cons update in tg3_tx().  Thanks.



^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] [e1000]: Remove unnecessary tx_lock
  2006-08-07  3:56                               ` Michael Chan
@ 2006-08-07  4:21                                 ` Herbert Xu
  0 siblings, 0 replies; 85+ messages in thread
From: Herbert Xu @ 2006-08-07  4:21 UTC (permalink / raw)
  To: Michael Chan
  Cc: jamal, David Miller, shemminger, jesse.brandeburg, auke-jan.h.kok,
	netdev

On Sun, Aug 06, 2006 at 08:56:31PM -0700, Michael Chan wrote:
>
> > Because netif_queue_stopped is *not* a memory barrier, it can float
> > above the assignment to tx_cons.  So even though netif_stop_queue *is*
> > a memory barrier on x86, the second BUFF_AVAIL test can still fail to
> > see the updated index.
> > 
> > The fix here is obviously to add a memory barrier in tg3_tx.
> 
> You are absolutely right and I missed it.  We should add a wmb()
> after the tx_cons update in tg3_tx().  Thanks.

A wmb() only prevents stores from passing stores.  We need to prevent
a load from passing a store so you need a full memory barrier (smp_mb).

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] [e1000]: Remove unnecessary tx_lock
  2006-08-06 23:16                                       ` Jesse Brandeburg
@ 2006-08-07 12:50                                         ` jamal
  2006-08-07 15:21                                           ` Edgar E. Iglesias
  0 siblings, 1 reply; 85+ messages in thread
From: jamal @ 2006-08-07 12:50 UTC (permalink / raw)
  To: Jesse Brandeburg
  Cc: Herbert Xu, David Miller, shemminger, mchan, jesse.brandeburg,
	auke-jan.h.kok, Edgar E. Iglesias, netdev

On Sun, 2006-06-08 at 16:16 -0700, Jesse Brandeburg wrote:
[..]
> 
> As for specifics, for TX_WAKE_THRESHOLD, i noticed that we were
> starting the queue after every packet was cleaned, so when the ring
> went full there was a lot of queue thrash.

indeed this is what used to happen and was bad.... So this is a huge
improvement.
What happens now under steady state at high traffic transmits is,
instead of 1, you see E1000_TX_WEIGHT in between queue sleep/wakes. I
assume this is a given since E1000_TX_WEIGHT is higher than
TX_WAKE_THRESHOLD.  I am not sure if i can vouch for even more
improvement by mucking around with values of E1000_TX_WEIGHT.

Can you please take a look at the patch i posted? I would like to submit
that for inclusion. It does two things
a) make pruning available to be invoked from elsewhere (I tried to do it
from the tx path but it gave me non-good results)
b) makes E1000_TX_WEIGHT and TX_WAKE_THRESHOLD relative to the size
of the transmit ring. I think this is a sane thing to do.

You could either extract the bits or i could resend to you as two
different patches. I have tested it and it works.

>   tg3 seemed to fix it in a
> smart way and so I did a similar fix.  Note we should have at least
> MAX_SKB_FRAGS (usually 32) + a few descriptors free before we should
> start the tx again, otherwise we run the risk of a maximum fragmented
> packet being unable to fit in the tx ring.

I noticed you check for that in the tx path.

> now, for E1000_TX_WEIGHT, that was more of an experiment as i noticed
> we could stay in transmit clean up forEVER (okay not literally) which
> would really violate our NAPI timeslice.  

Interesting. The only time i have seen the NAPI time slice kick in is in
slow hardware or emulators (like UML). 
I wonder if the pruning path could be made faster? What is the most
consuming item? I realize there will be a substantial amount of cache
misses. Maybe in addition to prunning E1000_TX_WEIGHT descriptors also
fire a timer to clean up the rest (to avoid it being accounted for in
the napi timeslice;->). Essentially i think you have some thing in the
pruning path that needs to be optimized. Profiling and improving that
would help.

> I messed with some values
> and 64 didn't really seem like too bad a compromise (it does slow
> things down just a tad in the general case) while really helping a
> couple of tests where there were lots of outstanding transmits
> happening at the same time as lots of receives.
> 

The later are the kind of tests i am running. If you are a router or a
busy server they apply. In slow machines a ping flood also applies etc. 

> This need for a tx weight is yet another global (design) problem with
> NAPI enabled drivers, 

oh yes, the Intel cabal - blame NAPI first;-> 
IMO, the problem is you are consuming too many cycles in the receive
path. NAPI has to be fair to all netdevices and cant hog all the CPU
because a certain netdevice uses too many cycles to process a packet. 

> but someday I'll try to document some of the issues I've seen.

I think it would be invaluable. Just dont jump to blame Canada^WNAPI
conclusion because it distracts; 

cheers,
jamal

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] [e1000]: Remove unnecessary tx_lock
  2006-08-07 12:50                                         ` jamal
@ 2006-08-07 15:21                                           ` Edgar E. Iglesias
  2006-08-07 15:40                                             ` jamal
  0 siblings, 1 reply; 85+ messages in thread
From: Edgar E. Iglesias @ 2006-08-07 15:21 UTC (permalink / raw)
  To: jamal
  Cc: Jesse Brandeburg, Herbert Xu, David Miller, shemminger, mchan,
	jesse.brandeburg, auke-jan.h.kok, Edgar E. Iglesias, netdev

On Mon, Aug 07, 2006 at 08:50:36AM -0400, jamal wrote:
> On Sun, 2006-06-08 at 16:16 -0700, Jesse Brandeburg wrote:
> [..]
> > 
> > As for specifics, for TX_WAKE_THRESHOLD, i noticed that we were
> > starting the queue after every packet was cleaned, so when the ring
> > went full there was a lot of queue thrash.
> 
> indeed this is what used to happen and was bad.... So this is a huge
> improvement.
> What happens now under steady state at high traffic transmits is,
> instead of 1, you see E1000_TX_WEIGHT in between queue sleep/wakes. I
> assume this is a given since E1000_TX_WEIGHT is higher than
> TX_WAKE_THRESHOLD.  I am not sure if i can vouch for even more
> improvement by mucking around with values of E1000_TX_WEIGHT.
> 
> Can you please take a look at the patch i posted? I would like to submit
> that for inclusion. It does two things
> a) make pruning available to be invoked from elsewhere (I tried to do it
> from the tx path but it gave me non-good results)
> b) makes E1000_TX_WEIGHT and TX_WAKE_THRESHOLD relative to the size
> of the transmit ring. I think this is a sane thing to do.
> 

Hi Jamal,

I have a question regarding your patch. In clean_tx_irq, it seems you dont
clean the ring unless fdesc < tx_ring->prunet. Won't this cause deadlocks for
local TCP connections if transmit goes quiet?

It seems to me as if this patch depends on the skb orphaning previously 
suggested on this thread. Please correct me if I'm wrong.

Best regards
-- 
        Programmer
        Edgar E. Iglesias <edgar.iglesias@axis.com> 46.46.272.1946

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] [e1000]: Remove unnecessary tx_lock
  2006-08-07 15:21                                           ` Edgar E. Iglesias
@ 2006-08-07 15:40                                             ` jamal
  2006-08-07 15:59                                               ` Edgar E. Iglesias
  2006-08-07 16:29                                               ` Edgar E. Iglesias
  0 siblings, 2 replies; 85+ messages in thread
From: jamal @ 2006-08-07 15:40 UTC (permalink / raw)
  To: Edgar E. Iglesias
  Cc: Jesse Brandeburg, Herbert Xu, David Miller, shemminger, mchan,
	jesse.brandeburg, auke-jan.h.kok, netdev

On Mon, 2006-07-08 at 17:21 +0200, Edgar E. Iglesias wrote:
[..]
> I have a question regarding your patch. In clean_tx_irq, it seems you dont
> clean the ring unless fdesc < tx_ring->prunet. Won't this cause deadlocks for
> local TCP connections if transmit goes quiet?
> 

I have not tested the TCP case; however, note that the specific part you
reference is commented out. There are no deadlock issues in the case of
forwarding (as in my testcases). 

I did not quiet follow the ensuing discussion after your post:
These descriptors being pruned in the tx path happen only after the
packets have been sent out on the wire. Why would this contribute to a
deadlock but not when it happens on the receive path? It is true that
tcp retransmit queue will still be referencing the skbs, but why is it
any different because in one case it happens in the tx and in the other
on the receive? Is there dependency on waking up the queue?

> It seems to me as if this patch depends on the skb orphaning previously 
> suggested on this thread. Please correct me if I'm wrong.
> 

I didnt quiet follow that discussion I will go back and read it; you
could also answer my questions above to make me understand better.

cheers,
jamal

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] [e1000]: Remove unnecessary tx_lock
  2006-08-07 15:40                                             ` jamal
@ 2006-08-07 15:59                                               ` Edgar E. Iglesias
  2006-08-07 16:31                                                 ` Jamal Hadi Salim
  2006-08-07 16:29                                               ` Edgar E. Iglesias
  1 sibling, 1 reply; 85+ messages in thread
From: Edgar E. Iglesias @ 2006-08-07 15:59 UTC (permalink / raw)
  To: jamal
  Cc: Jesse Brandeburg, Herbert Xu, David Miller, shemminger, mchan,
	jesse.brandeburg, auke-jan.h.kok, netdev

On Mon, Aug 07, 2006 at 11:40:49AM -0400, jamal wrote:
> 
> On Mon, 2006-07-08 at 17:21 +0200, Edgar E. Iglesias wrote:
> [..]
> > I have a question regarding your patch. In clean_tx_irq, it seems you dont
> > clean the ring unless fdesc < tx_ring->prunet. Won't this cause deadlocks for
> > local TCP connections if transmit goes quiet?
> > 
> 
> I have not tested the TCP case; however, note that the specific part you
> reference is commented out. There are no deadlock issues in the case of
> forwarding (as in my testcases). 

Ok, I thought you wanted the code inside the ifdefs to be considered. If not,
I guess there is no problem. Yes, the forwarding case does not suffer from
any deadlocks issues that I am aware of.

> 
> I did not quiet follow the ensuing discussion after your post:
> These descriptors being pruned in the tx path happen only after the
> packets have been sent out on the wire. Why would this contribute to a
> deadlock but not when it happens on the receive path? It is true that
> tcp retransmit queue will still be referencing the skbs, but why is it
> any different because in one case it happens in the tx and in the other
> on the receive? Is there dependency on waking up the queue?

No, the deadlock happens only if you don't prune the descriptors. If the host
sends some data and then goes quite, fdesc < tx_ring->prunet might not be
true for a long time and skbs will end up sitting in the tx ring indefinitely,
charging the socket's sndbuf.

Best regards
-- 
        Programmer
        Edgar E. Iglesias <edgar.iglesias@axis.com> 46.46.272.1946

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] [e1000]: Remove unnecessary tx_lock
  2006-08-07 15:40                                             ` jamal
  2006-08-07 15:59                                               ` Edgar E. Iglesias
@ 2006-08-07 16:29                                               ` Edgar E. Iglesias
  2006-08-07 16:36                                                 ` jamal
  1 sibling, 1 reply; 85+ messages in thread
From: Edgar E. Iglesias @ 2006-08-07 16:29 UTC (permalink / raw)
  To: jamal
  Cc: Jesse Brandeburg, Herbert Xu, David Miller, shemminger, mchan,
	jesse.brandeburg, auke-jan.h.kok, netdev

On Mon, Aug 07, 2006 at 11:40:49AM -0400, jamal wrote:
> 
> On Mon, 2006-07-08 at 17:21 +0200, Edgar E. Iglesias wrote:
> [..]
> > I have a question regarding your patch. In clean_tx_irq, it seems you dont
> > clean the ring unless fdesc < tx_ring->prunet. Won't this cause deadlocks for
> > local TCP connections if transmit goes quiet?
> > 
> 
> I have not tested the TCP case; however, note that the specific part you
> reference is commented out. There are no deadlock issues in the case of
> forwarding (as in my testcases). 
> 
> I did not quiet follow the ensuing discussion after your post:
> These descriptors being pruned in the tx path happen only after the
> packets have been sent out on the wire. Why would this contribute to a
> deadlock but not when it happens on the receive path? It is true that
> tcp retransmit queue will still be referencing the skbs, but why is it
> any different because in one case it happens in the tx and in the other
> on the receive? Is there dependency on waking up the queue?
> 

Hi again Jamal,

Not sure if it is doable, but to I'll post the thoughts anyway.

Assuming you would get the code inside the jamal ifdefs working without
deadlocks, you now have a tx_irq function which if fdesc >= tx_ring->prunet
essentially just checks for hw lockups. Let's speculate and further assume you
could do the detect_tx_hung from some other context (timer or whatever) then
you end up having a tx_irq function which most of the time does nothing.

The next step could be to move the fdesc >= tx_ring->prunet logic into the
transmit path and completely disable the tx_irq when the condition is not met.

Now you end up not taking the irq at all as long as fdesc >= tx_ring->prunet.

This was the logic I tried on the cris driver but ended up with deadlocks :)

Best regards
-- 
        Programmer
        Edgar E. Iglesias <edgar.iglesias@axis.com> 46.46.272.1946

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] [e1000]: Remove unnecessary tx_lock
  2006-08-07 15:59                                               ` Edgar E. Iglesias
@ 2006-08-07 16:31                                                 ` Jamal Hadi Salim
  2006-08-07 17:04                                                   ` Edgar E. Iglesias
  0 siblings, 1 reply; 85+ messages in thread
From: Jamal Hadi Salim @ 2006-08-07 16:31 UTC (permalink / raw)
  To: Edgar E. Iglesias
  Cc: netdev, auke-jan.h.kok, jesse.brandeburg, mchan, shemminger,
	David Miller, Herbert Xu, Jesse Brandeburg

On Mon, 2006-07-08 at 17:59 +0200, Edgar E. Iglesias wrote:

> Ok, I thought you wanted the code inside the ifdefs to be considered. If not,
> I guess there is no problem. Yes, the forwarding case does not suffer from
> any deadlocks issues that I am aware of.
> 

>From my tests:
It does _not_ provide any performance improvements and at some point i decided
i didnt want to add more variables to analyze, so i got rid of it; I would have 
had to hand edit the patch to totally remove it; so that why you still see the 
ifdefed out variant.

> No, the deadlock happens only if you don't prune the descriptors. If the host
> sends some data and then goes quite, fdesc < tx_ring->prunet might not be
> true for a long time and skbs will end up sitting in the tx ring indefinitely,
> charging the socket's sndbuf.
> 

Note: I didnt get rid of the rx path pruning. i.e that is still on. It
just prunes lesser descriptors with that change on the tx. So not very
different from before.

I think i may be getting a gist now of the discussion after a re-read; 
while packets are still charged to TCP may have been transmitted they may sit
on the tx ring forever. They will only be pruned if we had netif_stopped
(and even that is not good enough with Jesse's threshold check) or if a
new packet comes in destined for us. 
Did i understand correctly? If yes, i didnt introduce this challenge it
has always been there. I think i understand the suggestion now from
Dave/Herbert to orphan those skbs... 

cheers,
jamal

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] [e1000]: Remove unnecessary tx_lock
  2006-08-07 16:29                                               ` Edgar E. Iglesias
@ 2006-08-07 16:36                                                 ` jamal
  0 siblings, 0 replies; 85+ messages in thread
From: jamal @ 2006-08-07 16:36 UTC (permalink / raw)
  To: Edgar E. Iglesias
  Cc: Jesse Brandeburg, Herbert Xu, David Miller, shemminger, mchan,
	jesse.brandeburg, auke-jan.h.kok, netdev

On Mon, 2006-07-08 at 18:29 +0200, Edgar E. Iglesias wrote:

> Assuming you would get the code inside the jamal ifdefs working without
> deadlocks, you now have a tx_irq function which if fdesc >= tx_ring->prunet
> essentially just checks for hw lockups. Let's speculate and further assume you
> could do the detect_tx_hung from some other context (timer or whatever) then
> you end up having a tx_irq function which most of the time does nothing.
> 
> The next step could be to move the fdesc >= tx_ring->prunet logic into the
> transmit path and completely disable the tx_irq when the condition is not met.
> 
> Now you end up not taking the irq at all as long as fdesc >= tx_ring->prunet.
> 
> This was the logic I tried on the cris driver but ended up with deadlocks :)
> 

Like i said in one of my earlier postings (first email i CCed you on),
this specific test i assumed was as close to what you did. But if i
understand what you describe as "deadlock" then we have a slightly
different problem that can only be solved by orphaning the skbs that are
determined to have been put on the DMA ring ....

cheers,
jamal


^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] [e1000]: Remove unnecessary tx_lock
  2006-08-07 16:31                                                 ` Jamal Hadi Salim
@ 2006-08-07 17:04                                                   ` Edgar E. Iglesias
  2006-08-07 18:00                                                     ` jamal
  0 siblings, 1 reply; 85+ messages in thread
From: Edgar E. Iglesias @ 2006-08-07 17:04 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: netdev, auke-jan.h.kok, jesse.brandeburg, mchan, shemminger,
	David Miller, Herbert Xu, Jesse Brandeburg

On Mon, Aug 07, 2006 at 12:31:59PM -0400, Jamal Hadi Salim wrote:
> On Mon, 2006-07-08 at 17:59 +0200, Edgar E. Iglesias wrote:
> 
> > Ok, I thought you wanted the code inside the ifdefs to be considered. If not,
> > I guess there is no problem. Yes, the forwarding case does not suffer from
> > any deadlocks issues that I am aware of.
> > 
> 
> >From my tests:
> It does _not_ provide any performance improvements and at some point i decided
> i didnt want to add more variables to analyze, so i got rid of it; I would have 
> had to hand edit the patch to totally remove it; so that why you still see the 
> ifdefed out variant.
> 
> > No, the deadlock happens only if you don't prune the descriptors. If the host
> > sends some data and then goes quite, fdesc < tx_ring->prunet might not be
> > true for a long time and skbs will end up sitting in the tx ring indefinitely,
> > charging the socket's sndbuf.
> > 
> 
> Note: I didnt get rid of the rx path pruning. i.e that is still on. It
> just prunes lesser descriptors with that change on the tx. So not very
> different from before.
> 
> I think i may be getting a gist now of the discussion after a re-read; 
> while packets are still charged to TCP may have been transmitted they may sit
> on the tx ring forever. They will only be pruned if we had netif_stopped
> (and even that is not good enough with Jesse's threshold check) or if a
> new packet comes in destined for us. 
> Did i understand correctly? If yes, i didnt introduce this challenge it
> has always been there. I think i understand the suggestion now from
> Dave/Herbert to orphan those skbs... 

I'll give you an example.

A TCP flow sends X data and later waits for a response, host is now quietly
waiting. Assume fdesc >= tx_ring->prunet, so we dont free any skbs, right?

Now assume that some part of X data gets lost, our retransmit timer hits and
we want to retransmit but our socket is charged with too much data sitting on
the nics tx-ring, so we don't send anything. By orphaning, those skbs won't
charge the socket and the flow can retransmit.

Best regards
-- 
        Programmer
        Edgar E. Iglesias <edgar.iglesias@axis.com> 46.46.272.1946

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] [e1000]: Remove unnecessary tx_lock
  2006-08-07 17:04                                                   ` Edgar E. Iglesias
@ 2006-08-07 18:00                                                     ` jamal
  2006-08-07 18:47                                                       ` Edgar E. Iglesias
  0 siblings, 1 reply; 85+ messages in thread
From: jamal @ 2006-08-07 18:00 UTC (permalink / raw)
  To: Edgar E. Iglesias
  Cc: Jesse Brandeburg, Herbert Xu, David Miller, shemminger, mchan,
	jesse.brandeburg, auke-jan.h.kok, netdev

On Mon, 2006-07-08 at 19:04 +0200, Edgar E. Iglesias wrote:

> 
> I'll give you an example.

Thanks - that matches my understanding.

> A TCP flow sends X data and later waits for a response, host is now quietly
> waiting. Assume fdesc >= tx_ring->prunet, so we dont free any skbs, right?
> 

I am hoping they will be freed by a tx interrupt that will force poll to
happen. Or a new packet arrival etc. Just like before. Why do you see
the two as different? (the tx path pruning is still going on as i noted
before). If all you are looking for is a scheme to quickly free the skbs
so that TCP doesnt get charged, I am not sure if this is the right one.

> Now assume that some part of X data gets lost, our retransmit timer hits and
> we want to retransmit but our socket is charged with too much data sitting on
> the nics tx-ring, so we don't send anything. By orphaning, those skbs won't
> charge the socket and the flow can retransmit.

I understand that as well as the dilemma that TCP not being charged for
skbs (if you decide to orphan) it holds in its retransmit queue ;->
Which is not a problem unless that queueu grows to be a huge one ;->

cheers,
jamal


^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] [e1000]: Remove unnecessary tx_lock
  2006-08-07 18:00                                                     ` jamal
@ 2006-08-07 18:47                                                       ` Edgar E. Iglesias
  2006-08-07 19:03                                                         ` jamal
  0 siblings, 1 reply; 85+ messages in thread
From: Edgar E. Iglesias @ 2006-08-07 18:47 UTC (permalink / raw)
  To: jamal
  Cc: Jesse Brandeburg, Herbert Xu, David Miller, shemminger, mchan,
	jesse.brandeburg, auke-jan.h.kok, netdev

On Mon, Aug 07, 2006 at 02:00:24PM -0400, jamal wrote:
> On Mon, 2006-07-08 at 19:04 +0200, Edgar E. Iglesias wrote:
> 
> > 
> > I'll give you an example.
> 
> Thanks - that matches my understanding.
> 
> > A TCP flow sends X data and later waits for a response, host is now quietly
> > waiting. Assume fdesc >= tx_ring->prunet, so we dont free any skbs, right?
> > 
> 
> I am hoping they will be freed by a tx interrupt that will force poll to
> happen. Or a new packet arrival etc. Just like before. Why do you see
> the two as different? (the tx path pruning is still going on as i noted
> before). If all you are looking for is a scheme to quickly free the skbs
> so that TCP doesnt get charged, I am not sure if this is the right one.
> 

I think we are out of sync :) My, fault I haven't been clear enough.

First of all, I don't think the patch with jamal undefined has any problems. I 
assumed wrongly from the start that you somehow wanted that part to go in 
aswell, sorry about that. As you say, the flow goes just as before.

Now, with jamal defined, I only see e1000_prune_tx_ring beeing called if
fdesc < tx_ring->prunet or fdesc < tx_ring->waket. In other words, the freing
of skbs is dependant on external events that might not become true if the
host is quiet. Skb's could end up sitting on the ring indefinitely.

Sorry for the confusion.
-- 
        Programmer
        Edgar E. Iglesias <edgar.iglesias@axis.com> 46.46.272.1946

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] [e1000]: Remove unnecessary tx_lock
  2006-08-07 18:47                                                       ` Edgar E. Iglesias
@ 2006-08-07 19:03                                                         ` jamal
  2006-08-07 19:14                                                           ` Edgar E. Iglesias
                                                                             ` (2 more replies)
  0 siblings, 3 replies; 85+ messages in thread
From: jamal @ 2006-08-07 19:03 UTC (permalink / raw)
  To: Edgar E. Iglesias
  Cc: Jesse Brandeburg, Herbert Xu, David Miller, shemminger, mchan,
	jesse.brandeburg, auke-jan.h.kok, netdev

On Mon, 2006-07-08 at 20:47 +0200, Edgar E. Iglesias wrote:

> I think we are out of sync :) 

Imagine that, eh? ;->

> My, fault I haven't been clear enough.
> 

Not just your transmit but also my receive is at fault ;-> (aka, I may
not be listening as well as i should). Now two machines or CPUs you
would think wont have this problem since they dont possess minds;->

> First of all, I don't think the patch with jamal undefined has any problems. I 
> assumed wrongly from the start that you somehow wanted that part to go in 
> aswell, sorry about that. As you say, the flow goes just as before.
> 
> Now, with jamal defined, I only see e1000_prune_tx_ring beeing called if
> fdesc < tx_ring->prunet or fdesc < tx_ring->waket. 

Ok, thats the code that has been commented out, no? i.e there is no
fdesc otherwise.

> In other words, the freing
> of skbs is dependant on external events that might not become true if the
> host is quiet. Skb's could end up sitting on the ring indefinitely.
> 

Yes, this has _always_ been true. In the patch i posted it merely
converted things, example:

-#define E1000_TX_WEIGHT 64
-               /* weight of a sort for tx, to avoid endless transmit
cleanup */
-               if (count++ == E1000_TX_WEIGHT) break;
+               /* avoid endless transmit cleanup */
+               if (count++ == tx_ring->prunet) break;

As you can see E1000_TX_WEIGHT threshold exists today and you are right
if no TX interupts, packet arrivals or scheduled wakes happen the that
descriptor that was not pruned will sit there forever (which is a bad
thing for TCP). Are we in sync?
If yes, what is the likelihood they will sit there forever? I think
perhaps some TX interupts will happen, no?

cheers,
jamal

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] [e1000]: Remove unnecessary tx_lock
  2006-08-07 19:03                                                         ` jamal
@ 2006-08-07 19:14                                                           ` Edgar E. Iglesias
  2006-08-07 19:34                                                             ` jamal
  2006-08-07 20:53                                                           ` Brandeburg, Jesse
  2006-08-07 23:23                                                           ` Herbert Xu
  2 siblings, 1 reply; 85+ messages in thread
From: Edgar E. Iglesias @ 2006-08-07 19:14 UTC (permalink / raw)
  To: jamal
  Cc: Jesse Brandeburg, Herbert Xu, David Miller, shemminger, mchan,
	jesse.brandeburg, auke-jan.h.kok, netdev

On Mon, Aug 07, 2006 at 03:03:33PM -0400, jamal wrote:
> On Mon, 2006-07-08 at 20:47 +0200, Edgar E. Iglesias wrote:
> 
> > I think we are out of sync :) 
> 
> Imagine that, eh? ;->
> 
> > My, fault I haven't been clear enough.
> > 
> 
> Not just your transmit but also my receive is at fault ;-> (aka, I may
> not be listening as well as i should). Now two machines or CPUs you
> would think wont have this problem since they dont possess minds;->
> 
> > First of all, I don't think the patch with jamal undefined has any problems. I 
> > assumed wrongly from the start that you somehow wanted that part to go in 
> > aswell, sorry about that. As you say, the flow goes just as before.
> > 
> > Now, with jamal defined, I only see e1000_prune_tx_ring beeing called if
> > fdesc < tx_ring->prunet or fdesc < tx_ring->waket. 
> 
> Ok, thats the code that has been commented out, no? i.e there is no
> fdesc otherwise.

Exactly.

> 
> > In other words, the freing
> > of skbs is dependant on external events that might not become true if the
> > host is quiet. Skb's could end up sitting on the ring indefinitely.
> > 
> 
> Yes, this has _always_ been true. In the patch i posted it merely
> converted things, example:
> 
> -#define E1000_TX_WEIGHT 64
> -               /* weight of a sort for tx, to avoid endless transmit
> cleanup */
> -               if (count++ == E1000_TX_WEIGHT) break;
> +               /* avoid endless transmit cleanup */
> +               if (count++ == tx_ring->prunet) break;
> 
> As you can see E1000_TX_WEIGHT threshold exists today and you are right
> if no TX interupts, packet arrivals or scheduled wakes happen the that
> descriptor that was not pruned will sit there forever (which is a bad
> thing for TCP). Are we in sync?

Yep :)

> If yes, what is the likelihood they will sit there forever? I think
> perhaps some TX interupts will happen, no?

with jamal undefined, absolutely. With jamal defined, TX interrupts will come
but I couldnt find a way into e1000_prune_tx_ring unless fdesc met the 
conditions. Correct? 

Best regards
-- 
        Programmer
        Edgar E. Iglesias <edgar.iglesias@axis.com> 46.46.272.1946

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] [e1000]: Remove unnecessary tx_lock
  2006-08-07 19:14                                                           ` Edgar E. Iglesias
@ 2006-08-07 19:34                                                             ` jamal
  2006-08-07 20:28                                                               ` Edgar E. Iglesias
  0 siblings, 1 reply; 85+ messages in thread
From: jamal @ 2006-08-07 19:34 UTC (permalink / raw)
  To: Edgar E. Iglesias
  Cc: Jesse Brandeburg, Herbert Xu, David Miller, shemminger, mchan,
	jesse.brandeburg, auke-jan.h.kok, netdev

On Mon, 2006-07-08 at 21:14 +0200, Edgar E. Iglesias wrote:

> > If yes, what is the likelihood they will sit there forever? I think
> > perhaps some TX interupts will happen, no?
> 
> with jamal undefined, absolutely. With jamal defined, TX interrupts will come
> but I couldnt find a way into e1000_prune_tx_ring unless fdesc met the 
> conditions. Correct? 
> 

Forgive me since i am still missing something ..

Observe that the same threshold used in two different ways:

1) in tx path tx_ring->prunet is to check on when we should _start_ to
prune.
2) on rx path tx_ring->prunet is to check when to _stop_ pruning.

i.e #1 is a preemptive action.

You seem to suggest doing it the way i was it made things worse?

Note that TX interrupts will happen as long as you dont prune some
descriptors (I am assuming this, I havent checked the settings).

cheers,
jamal

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] [e1000]: Remove unnecessary tx_lock
  2006-08-07 19:34                                                             ` jamal
@ 2006-08-07 20:28                                                               ` Edgar E. Iglesias
  2006-08-08  0:52                                                                 ` jamal
  0 siblings, 1 reply; 85+ messages in thread
From: Edgar E. Iglesias @ 2006-08-07 20:28 UTC (permalink / raw)
  To: jamal
  Cc: Jesse Brandeburg, Herbert Xu, David Miller, shemminger, mchan,
	jesse.brandeburg, auke-jan.h.kok, netdev

On Mon, Aug 07, 2006 at 03:34:30PM -0400, jamal wrote:
> On Mon, 2006-07-08 at 21:14 +0200, Edgar E. Iglesias wrote:
> 
> > > If yes, what is the likelihood they will sit there forever? I think
> > > perhaps some TX interupts will happen, no?
> > 
> > with jamal undefined, absolutely. With jamal defined, TX interrupts will come
> > but I couldnt find a way into e1000_prune_tx_ring unless fdesc met the 
> > conditions. Correct? 
> > 
> 
> Forgive me since i am still missing something ..
> 
> Observe that the same threshold used in two different ways:
> 
> 1) in tx path tx_ring->prunet is to check on when we should _start_ to
> prune.
> 2) on rx path tx_ring->prunet is to check when to _stop_ pruning.
> 

I can see two calls to e1000_prune_tx_ring with jamal _defined_.

1. tx path
+#ifdef jamal
+       {
+               int fdesc = E1000_DESC_UNUSED(tx_ring);
+               if (unlikely(fdesc < tx_ring->waket))
+                       e1000_prune_tx_ring(adapter,tx_ring);
+       }
+#endif

2. tx and rx path
+#ifdef jamal
+       spin_lock(&tx_ring->tx_lock);
+       {
+               int fdesc = E1000_DESC_UNUSED(tx_ring);
+               if (fdesc < tx_ring->prunet) {
+                       if (e1000_prune_tx_ring(adapter,tx_ring))
+                               cleaned = TRUE;
                }
        }
+       spin_unlock(&tx_ring->tx_lock);
+#else
+       if (e1000_prune_tx_ring(adapter,tx_ring))
+               cleaned = TRUE;
+#endif

Assume a ring of 64 entries, prunet of 16, waket of 8. Now host sends 40 skbs
and stops. tx-ring holds 40 skbs, has 24 free. TX interrupts hit you, you may
even be receiveing packets but I don't see how you enter prune_tx_ring without
more packets going out via hard_start_xmit? skb's will sit on the ring until 
more packets are sent from the quiet host.

As you can see, with jamal _undefined_ e1000_prune_tx_ring is called
unconditionally and I beleive things will work ok.

I am not familiar with this code nor the hw so I'm probably missing something
fundamental. 

Best regards
-- 
        Programmer
        Edgar E. Iglesias <edgar.iglesias@axis.com> 46.46.272.1946

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] [e1000]: Remove unnecessary tx_lock
  2006-08-07 19:03                                                         ` jamal
  2006-08-07 19:14                                                           ` Edgar E. Iglesias
@ 2006-08-07 20:53                                                           ` Brandeburg, Jesse
  2006-08-08  1:07                                                             ` jamal
  2006-08-07 23:23                                                           ` Herbert Xu
  2 siblings, 1 reply; 85+ messages in thread
From: Brandeburg, Jesse @ 2006-08-07 20:53 UTC (permalink / raw)
  To: jamal
  Cc: Edgar E. Iglesias, Jesse Brandeburg, Herbert Xu, David Miller,
	shemminger, mchan, jesse.brandeburg, auke-jan.h.kok, netdev

On Mon, 7 Aug 2006, jamal wrote:
> -#define E1000_TX_WEIGHT 64
> -               /* weight of a sort for tx, to avoid endless transmit
> cleanup */
> -               if (count++ == E1000_TX_WEIGHT) break;
> +               /* avoid endless transmit cleanup */
> +               if (count++ == tx_ring->prunet) break;
> 
> As you can see E1000_TX_WEIGHT threshold exists today and you are right
> if no TX interupts, packet arrivals or scheduled wakes happen the that
> descriptor that was not pruned will sit there forever (which is a bad
> thing for TCP). Are we in sync?
> If yes, what is the likelihood they will sit there forever? I think
> perhaps some TX interupts will happen, no?

we don't enable it right now, but you could use the TXQE (tx queue empty) 
interrupt to avoid the starvation case.  I think it might flood you with 
TXQE interrupts however, so we'd probably have to figure out some way to 
turn it on occasionally.

Jesse

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] [e1000]: Remove unnecessary tx_lock
  2006-08-07 19:03                                                         ` jamal
  2006-08-07 19:14                                                           ` Edgar E. Iglesias
  2006-08-07 20:53                                                           ` Brandeburg, Jesse
@ 2006-08-07 23:23                                                           ` Herbert Xu
  2006-08-07 23:35                                                             ` Brandeburg, Jesse
  2 siblings, 1 reply; 85+ messages in thread
From: Herbert Xu @ 2006-08-07 23:23 UTC (permalink / raw)
  To: jamal
  Cc: Edgar E. Iglesias, Jesse Brandeburg, David Miller, shemminger,
	mchan, jesse.brandeburg, auke-jan.h.kok, netdev

On Mon, Aug 07, 2006 at 03:03:33PM -0400, jamal wrote:
> 
> -#define E1000_TX_WEIGHT 64
> -               /* weight of a sort for tx, to avoid endless transmit
> cleanup */
> -               if (count++ == E1000_TX_WEIGHT) break;
> +               /* avoid endless transmit cleanup */
> +               if (count++ == tx_ring->prunet) break;
> 
> As you can see E1000_TX_WEIGHT threshold exists today and you are right
> if no TX interupts, packet arrivals or scheduled wakes happen the that
> descriptor that was not pruned will sit there forever (which is a bad
> thing for TCP). Are we in sync?
> If yes, what is the likelihood they will sit there forever? I think
> perhaps some TX interupts will happen, no?

I thought this code is only used for NAPI so as long as work was done
it'll keep calling this.

One thing I'm not sure about though is the time between it decides that
there is no work and the point where the interrupts are reenabled.

What if work arrives in that time and no work ever arrives after the
interrupts are turned on again? Does that mean the work will sit there
forever?

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] [e1000]: Remove unnecessary tx_lock
  2006-08-07 23:23                                                           ` Herbert Xu
@ 2006-08-07 23:35                                                             ` Brandeburg, Jesse
  2006-08-07 23:40                                                               ` Herbert Xu
  0 siblings, 1 reply; 85+ messages in thread
From: Brandeburg, Jesse @ 2006-08-07 23:35 UTC (permalink / raw)
  To: Herbert Xu
  Cc: jamal, Edgar E. Iglesias, Jesse Brandeburg, David Miller,
	shemminger, mchan, jesse.brandeburg, auke-jan.h.kok, netdev

On Tue, 8 Aug 2006, Herbert Xu wrote:
> > -#define E1000_TX_WEIGHT 64
> > -               /* weight of a sort for tx, to avoid endless transmit
> > cleanup */
> > -               if (count++ == E1000_TX_WEIGHT) break;
> > +               /* avoid endless transmit cleanup */
> > +               if (count++ == tx_ring->prunet) break;
> > 
> > As you can see E1000_TX_WEIGHT threshold exists today and you are right
> > if no TX interupts, packet arrivals or scheduled wakes happen the that
> > descriptor that was not pruned will sit there forever (which is a bad
> > thing for TCP). Are we in sync?
> > If yes, what is the likelihood they will sit there forever? I think
> > perhaps some TX interupts will happen, no?
> 
> I thought this code is only used for NAPI so as long as work was done
> it'll keep calling this.

yes, you're correct.
 
> One thing I'm not sure about though is the time between it decides that
> there is no work and the point where the interrupts are reenabled.

e1000 only clears the interrupts when it reads ICR in e1000_intr (before 
scheduling napi poll) so any interrupts that occur while polling (and 
interrupts are disabled) will cause a new assertion once interrupts are 
re-enabled.  Sometimes a little bit inefficient due to extra trips through 
poll, but guarantees never to miss an int.  I'm open to creative ways to 
avoid this, but adding an I/O read in e1000_clean would be pretty yucky.

> What if work arrives in that time and no work ever arrives after the
> interrupts are turned on again? Does that mean the work will sit there
> forever?

nope, see above.

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] [e1000]: Remove unnecessary tx_lock
  2006-08-07 23:35                                                             ` Brandeburg, Jesse
@ 2006-08-07 23:40                                                               ` Herbert Xu
  0 siblings, 0 replies; 85+ messages in thread
From: Herbert Xu @ 2006-08-07 23:40 UTC (permalink / raw)
  To: Brandeburg, Jesse
  Cc: jamal, Edgar E. Iglesias, Jesse Brandeburg, David Miller,
	shemminger, mchan, auke-jan.h.kok, netdev

On Mon, Aug 07, 2006 at 04:35:36PM -0700, Brandeburg, Jesse wrote:
> 
> e1000 only clears the interrupts when it reads ICR in e1000_intr (before 
> scheduling napi poll) so any interrupts that occur while polling (and 
> interrupts are disabled) will cause a new assertion once interrupts are 
> re-enabled.  Sometimes a little bit inefficient due to extra trips through 
> poll, but guarantees never to miss an int.  I'm open to creative ways to 
> avoid this, but adding an I/O read in e1000_clean would be pretty yucky.

The standard solution in Linux is to clear and recheck.  So just before
you reenable the interrupts you'd clear pending interrupts again and
check for rx/tx work, if there is work then you just go back to polling.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] [e1000]: Remove unnecessary tx_lock
  2006-08-07 20:28                                                               ` Edgar E. Iglesias
@ 2006-08-08  0:52                                                                 ` jamal
  0 siblings, 0 replies; 85+ messages in thread
From: jamal @ 2006-08-08  0:52 UTC (permalink / raw)
  To: Edgar E. Iglesias
  Cc: Jesse Brandeburg, Herbert Xu, David Miller, shemminger, mchan,
	jesse.brandeburg, auke-jan.h.kok, netdev

On Mon, 2006-07-08 at 22:28 +0200, Edgar E. Iglesias wrote:

> 
> I can see two calls to e1000_prune_tx_ring with jamal _defined_.
> 

Ok, never mind - I think thats the source of our confusion. Just ignore
those entries since they are ifdefed out (The second one i dont know how
it made it in since there are no traces of it in my tree).

cheers,
jamal


^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] [e1000]: Remove unnecessary tx_lock
  2006-08-07 20:53                                                           ` Brandeburg, Jesse
@ 2006-08-08  1:07                                                             ` jamal
  0 siblings, 0 replies; 85+ messages in thread
From: jamal @ 2006-08-08  1:07 UTC (permalink / raw)
  To: Brandeburg, Jesse
  Cc: Edgar E. Iglesias, Jesse Brandeburg, Herbert Xu, David Miller,
	shemminger, mchan, auke-jan.h.kok, netdev

On Mon, 2006-07-08 at 13:53 -0700, Brandeburg, Jesse wrote:

> we don't enable it right now, but you could use the TXQE (tx queue empty) 
> interrupt to avoid the starvation case.  I think it might flood you with 
> TXQE interrupts however, so we'd probably have to figure out some way to 
> turn it on occasionally.
> 

I think it may make sense to set on netif_stop and reset on wake. I take
it "empty" means all the descriptors for sending are gone?

cheers,
jamal


^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] [e1000]: Remove unnecessary tx_lock
  2006-08-06 12:24                                   ` jamal
  2006-08-06 12:33                                     ` jamal
  2006-08-06 19:22                                     ` jamal
@ 2006-08-08  1:19                                     ` jamal
  2006-08-08  1:22                                       ` Herbert Xu
  2 siblings, 1 reply; 85+ messages in thread
From: jamal @ 2006-08-08  1:19 UTC (permalink / raw)
  To: Herbert Xu
  Cc: David Miller, shemminger, mchan, jesse.brandeburg, auke-jan.h.kok,
	Edgar E. Iglesias, netdev

[-- Attachment #1: Type: text/plain, Size: 1304 bytes --]

On Sun, 2006-06-08 at 08:24 -0400, jamal wrote:
> On Sun, 2006-06-08 at 12:51 +1000, Herbert Xu wrote:
> > On Sat, Aug 05, 2006 at 07:36:50PM -0400, jamal wrote:
> > > 
> > > I know the qlen is >= 0 otherwise i will hit the BUG().
> > > A qlen of 0 will be interesting to find as well.
> > 
> > When the queue is woken it will always to qdisc_run regardless of
> > whether there are packets queued so this might explain what you're
> > seeing.
> > 
> 
> That aligns with what i was thinking as well - i.e what i referred to as
> possibly enthusiasm on part of the tx softirq.
> i.e if there was nothing queued, why is the softirq even woken up?
>
> Note, I observed this behavior to be a lot worse on systems that had
> very little traffic going out. Anyways, I hope to find out.
> 

The reason for this is _mostly_ not the tx softirq as i found out;->
It is actually a result of dev_queue_xmit.
The only way to tell if you can depart qdisc_is_running is by asking the
qdisc for a packet. So if you have slow traffic, its guaranteed that you
enter once to process a packet and the second time to find out that you
have to go away ;-> 

In any case, heres a patch that i believe saves some cycles in case of
a tx scheduling while some CPU is already processing the qdisc. Minimal
testing.

cheers,
jamal


[-- Attachment #2: qir-p --]
[-- Type: text/x-patch, Size: 466 bytes --]

diff --git a/net/core/dev.c b/net/core/dev.c
index d95e262..977e77e 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1092,7 +1092,8 @@ static void dev_queue_xmit_nit(struct sk
 
 void __netif_schedule(struct net_device *dev)
 {
-	if (!test_and_set_bit(__LINK_STATE_SCHED, &dev->state)) {
+	if (!test_bit(__LINK_STATE_QDISC_RUNNING, &dev->state) &&
+	    !test_and_set_bit(__LINK_STATE_SCHED, &dev->state)) {
 		unsigned long flags;
 		struct softnet_data *sd;
 

^ permalink raw reply related	[flat|nested] 85+ messages in thread

* Re: [PATCH] [e1000]: Remove unnecessary tx_lock
  2006-08-08  1:19                                     ` jamal
@ 2006-08-08  1:22                                       ` Herbert Xu
  2006-08-08  1:33                                         ` jamal
  0 siblings, 1 reply; 85+ messages in thread
From: Herbert Xu @ 2006-08-08  1:22 UTC (permalink / raw)
  To: jamal
  Cc: David Miller, shemminger, mchan, jesse.brandeburg, auke-jan.h.kok,
	Edgar E. Iglesias, netdev

On Mon, Aug 07, 2006 at 09:19:31PM -0400, jamal wrote:
>
> diff --git a/net/core/dev.c b/net/core/dev.c
> index d95e262..977e77e 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -1092,7 +1092,8 @@ static void dev_queue_xmit_nit(struct sk
>  
>  void __netif_schedule(struct net_device *dev)
>  {
> -	if (!test_and_set_bit(__LINK_STATE_SCHED, &dev->state)) {
> +	if (!test_bit(__LINK_STATE_QDISC_RUNNING, &dev->state) &&
> +	    !test_and_set_bit(__LINK_STATE_SCHED, &dev->state)) {
>  		unsigned long flags;
>  		struct softnet_data *sd;

I'm not sure if this is safe.  What if the other side clears QDISC_RUNNING
right after you test it here? Wouldn't you miss the schedule altogether?

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] [e1000]: Remove unnecessary tx_lock
  2006-08-08  1:22                                       ` Herbert Xu
@ 2006-08-08  1:33                                         ` jamal
  2006-08-08  2:17                                           ` Herbert Xu
  0 siblings, 1 reply; 85+ messages in thread
From: jamal @ 2006-08-08  1:33 UTC (permalink / raw)
  To: Herbert Xu
  Cc: David Miller, shemminger, mchan, jesse.brandeburg, auke-jan.h.kok,
	Edgar E. Iglesias, netdev

On Tue, 2006-08-08 at 11:22 +1000, Herbert Xu wrote:

> I'm not sure if this is safe.  What if the other side clears QDISC_RUNNING
> right after you test it here? Wouldn't you miss the schedule altogether?
> 

Is there a need for a schedule if that happens? i.e assuming that the
clearing happens because there are no more packets to process. 
OTOH, if the other side sets the QDISC_RUNNING right after the test,
then qdisc is running will be entered.

cheers,
jamal


^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] [e1000]: Remove unnecessary tx_lock
  2006-08-08  1:33                                         ` jamal
@ 2006-08-08  2:17                                           ` Herbert Xu
  2006-08-08  3:10                                             ` jamal
  0 siblings, 1 reply; 85+ messages in thread
From: Herbert Xu @ 2006-08-08  2:17 UTC (permalink / raw)
  To: jamal
  Cc: David Miller, shemminger, mchan, jesse.brandeburg, auke-jan.h.kok,
	Edgar E. Iglesias, netdev

On Mon, Aug 07, 2006 at 09:33:52PM -0400, jamal wrote:
> 
> Is there a need for a schedule if that happens? i.e assuming that the
> clearing happens because there are no more packets to process. 
> OTOH, if the other side sets the QDISC_RUNNING right after the test,
> then qdisc is running will be entered.

CPU0				CPU1
__qdisc_run
	netif_queue_stopped
				netif_wake_queue
					__netif_schedule
						test_bit
	clear_bit

So you may end up losing that wake event.  The QDISC_RUNNING bit was
originally designed to be used under the qdisc lock, so using it
in a different contexts will require new synchronisation mechanisms.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] [e1000]: Remove unnecessary tx_lock
  2006-08-08  2:17                                           ` Herbert Xu
@ 2006-08-08  3:10                                             ` jamal
  0 siblings, 0 replies; 85+ messages in thread
From: jamal @ 2006-08-08  3:10 UTC (permalink / raw)
  To: Herbert Xu
  Cc: David Miller, shemminger, mchan, jesse.brandeburg, auke-jan.h.kok,
	Edgar E. Iglesias, netdev

On Tue, 2006-08-08 at 12:17 +1000, Herbert Xu wrote:

> CPU0				CPU1
> __qdisc_run
> 	netif_queue_stopped
> 				netif_wake_queue
> 					__netif_schedule
> 						test_bit
> 	clear_bit
> 
> So you may end up losing that wake event.  

Indeed you may.
I think what i was seeing is (motivating the patch to begin with) is:
if you got out of qdisc restart because driver returned TX_BUSY
leading to a netif_queue_stopped then you will have rescheduled on CPU0.
The only exception is the one example i mentioned in earlier email where
a TX_OK is returned by e1000. 

> The QDISC_RUNNING bit was
> originally designed to be used under the qdisc lock, so using it
> in a different contexts will require new synchronisation mechanisms.

Agreed.

cheers,
jamal



^ permalink raw reply	[flat|nested] 85+ messages in thread

* RE: [PATCH] [e1000]: Remove unnecessary tx_lock
@ 2006-08-08  5:43 Brandeburg, Jesse
  0 siblings, 0 replies; 85+ messages in thread
From: Brandeburg, Jesse @ 2006-08-08  5:43 UTC (permalink / raw)
  To: hadi
  Cc: Edgar E. Iglesias, Jesse Brandeburg, Herbert Xu, David Miller,
	shemminger, mchan, Kok, Auke-jan H, netdev

> -----Original Message-----
> From: jamal [mailto:hadi@cyberus.ca]
> On Mon, 2006-07-08 at 13:53 -0700, Brandeburg, Jesse wrote:
> 
> > we don't enable it right now, but you could use the TXQE (tx queue
> empty)
> > interrupt to avoid the starvation case.  I think it might flood you
with
> > TXQE interrupts however, so we'd probably have to figure out some
way to
> > turn it on occasionally.
> >
> 
> I think it may make sense to set on netif_stop and reset on wake. I
take
> it "empty" means all the descriptors for sending are gone?

We can try it.  Empty means that the adapter has completed all transmit
work and has no more to do (register TDH == TDT)

Jesse

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] [e1000]: Remove unnecessary tx_lock
  2006-08-06  2:51                                 ` Herbert Xu
  2006-08-06  7:14                                   ` Edgar E. Iglesias
  2006-08-06 12:24                                   ` jamal
@ 2006-08-08 12:21                                   ` jamal
  2006-08-08 12:39                                     ` Herbert Xu
  2 siblings, 1 reply; 85+ messages in thread
From: jamal @ 2006-08-08 12:21 UTC (permalink / raw)
  To: Herbert Xu
  Cc: David Miller, shemminger, mchan, jesse.brandeburg, auke-jan.h.kok,
	netdev


Is the patch below making it in? As i pointed out in another email
tests reveal it is good.

cheers,
jamal

On Sun, 2006-06-08 at 12:51 +1000, Herbert Xu wrote:
> On Sat, Aug 05, 2006 at 07:36:50PM -0400, jamal wrote:
> > 
> > I know the qlen is >= 0 otherwise i will hit the BUG().
> > A qlen of 0 will be interesting to find as well.
> 
> When the queue is woken it will always to qdisc_run regardless of
> whether there are packets queued so this might explain what you're
> seeing.
> 
> Anyway, I've changed the unconditional atomic op into a memory
> barrier instead.  Let me know if this changes things for you.
> 
> I wonder if we could do the TX clean up within the transmission
> routine most of the time.  Has anyone tried this before?
> 
> Cheers,
> -- 
> Visit Openswan at http://www.openswan.org/
> Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
> Home Page: http://gondor.apana.org.au/~herbert/
> PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
> --
> diff --git a/drivers/net/e1000/e1000_main.c b/drivers/net/e1000/e1000_main.c
> index 627f224..e7e1306 100644
> --- a/drivers/net/e1000/e1000_main.c
> +++ b/drivers/net/e1000/e1000_main.c
> @@ -2861,6 +2861,33 @@ e1000_transfer_dhcp_info(struct e1000_ad
>  	return 0;
>  }
>  
> +static int __e1000_maybe_stop_tx(struct net_device *netdev, int size)
> +{
> +	struct e1000_adapter *adapter = netdev_priv(netdev);
> +	struct e1000_tx_ring *tx_ring = adapter->tx_ring;
> +
> +	netif_stop_queue(netdev);
> +	smp_mb__after_netif_stop_queue();
> +
> +	/* We need to check again in a case another CPU has just
> +	 * made room available.
> +	 */
> +	if (likely(E1000_DESC_UNUSED(tx_ring) < size))
> +		return -EBUSY;
> +
> +	/* A reprieve! */
> +	netif_start_queue(netdev);
> +	return 0;
> +}
> +
> +static inline int e1000_maybe_stop_tx(struct net_device *netdev,
> +				      struct e1000_tx_ring *tx_ring, int size)
> +{
> +	if (likely(E1000_DESC_UNUSED(tx_ring) >= size))
> +		return 0;
> +	return __e1000_maybe_stop_tx(netdev, size);
> +}
> +
>  #define TXD_USE_COUNT(S, X) (((S) >> (X)) + 1 )
>  static int
>  e1000_xmit_frame(struct sk_buff *skb, struct net_device *netdev)
> @@ -2974,8 +3001,7 @@ #endif
>  
>  	/* need: count + 2 desc gap to keep tail from touching
>  	 * head, otherwise try next time */
> -	if (unlikely(E1000_DESC_UNUSED(tx_ring) < count + 2)) {
> -		netif_stop_queue(netdev);
> +	if (unlikely(e1000_maybe_stop_tx(netdev, tx_ring, count + 2))) {
>  		spin_unlock_irqrestore(&tx_ring->tx_lock, flags);
>  		return NETDEV_TX_BUSY;
>  	}
> @@ -3022,8 +3048,7 @@ #endif
>  	netdev->trans_start = jiffies;
>  
>  	/* Make sure there is space in the ring for the next send. */
> -	if (unlikely(E1000_DESC_UNUSED(tx_ring) < MAX_SKB_FRAGS + 2))
> -		netif_stop_queue(netdev);
> +	e1000_maybe_stop_tx(netdev, tx_ring, MAX_SKB_FRAGS + 2);
>  
>  	spin_unlock_irqrestore(&tx_ring->tx_lock, flags);
>  	return NETDEV_TX_OK;
> @@ -3515,13 +3540,14 @@ #endif
>  	tx_ring->next_to_clean = i;
>  
>  #define TX_WAKE_THRESHOLD 32
> -	if (unlikely(cleaned && netif_queue_stopped(netdev) &&
> -	             netif_carrier_ok(netdev))) {
> -		spin_lock(&tx_ring->tx_lock);
> -		if (netif_queue_stopped(netdev) &&
> -		    (E1000_DESC_UNUSED(tx_ring) >= TX_WAKE_THRESHOLD))
> +	if (unlikely(cleaned && netif_carrier_ok(netdev) &&
> +		     E1000_DESC_UNUSED(tx_ring) >= TX_WAKE_THRESHOLD)) {
> +		/* Make sure that anybody stopping the queue after this
> +		 * sees the new next_to_clean.
> +		 */
> +		smp_mb();
> +		if (netif_queue_stopped(netdev))
>  			netif_wake_queue(netdev);
> -		spin_unlock(&tx_ring->tx_lock);
>  	}
>  
>  	if (adapter->detect_tx_hung) {
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index 75f02d8..7daaf71 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -33,6 +33,7 @@ #ifdef __KERNEL__
>  #include <asm/atomic.h>
>  #include <asm/cache.h>
>  #include <asm/byteorder.h>
> +#include <asm/system.h>
>  
>  #include <linux/device.h>
>  #include <linux/percpu.h>
> @@ -652,6 +653,12 @@ #endif
>  	set_bit(__LINK_STATE_XOFF, &dev->state);
>  }
>  
> +static inline void smp_mb__after_netif_stop_queue(void)
> +{
> +	/* Need to add smp_mb__after_set_bit(). */
> +	smp_mb();
> +}
> +
>  static inline int netif_queue_stopped(const struct net_device *dev)
>  {
>  	return test_bit(__LINK_STATE_XOFF, &dev->state);
> -
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 


^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] [e1000]: Remove unnecessary tx_lock
  2006-08-08 12:21                                   ` jamal
@ 2006-08-08 12:39                                     ` Herbert Xu
  0 siblings, 0 replies; 85+ messages in thread
From: Herbert Xu @ 2006-08-08 12:39 UTC (permalink / raw)
  To: jamal
  Cc: David Miller, shemminger, mchan, jesse.brandeburg, auke-jan.h.kok,
	netdev

On Tue, Aug 08, 2006 at 08:21:58AM -0400, jamal wrote:
> 
> Is the patch below making it in? As i pointed out in another email
> tests reveal it is good.

I'll write up a proper description and submit it.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] [e1000]: Remove unnecessary tx_lock
  2006-08-04 23:31                     ` David Miller
  2006-08-05 16:56                       ` jamal
@ 2006-08-08 17:04                       ` Benjamin LaHaise
  2006-08-08 22:06                         ` David Miller
  1 sibling, 1 reply; 85+ messages in thread
From: Benjamin LaHaise @ 2006-08-08 17:04 UTC (permalink / raw)
  To: David Miller
  Cc: shemminger, mchan, herbert, hadi, jesse.brandeburg,
	auke-jan.h.kok, netdev

On Fri, Aug 04, 2006 at 04:31:11PM -0700, David Miller wrote:
> Yes, it's meant to catch unintented races.
> 
> The queueing layer that calls ->hard_start_xmit() technically has no
> need to support NETDEV_TX_BUSY as a return value, since the device
> is able to prevent this.
> 
> If we could avoid NETDEV_TX_BUSY et al. from happening, the idea is
> that we could eliminate all of the requeueing logic in the packet
> scheduler layer.  It is a pipe dream from many years ago.

Maybe the way NETDEV_TX_BUSY is implemented is wrong -- that is, it should 
be possible to return a result saying "we sent the packet, but the queue is 
now full".  That would eliminate the atomic op that netif_queue_stop() 
performs and allow higher layers to merge the necessary barrier on the full 
case with the op on the tx lock.  That way we could have lockless queue 
handling within the driver.  This might need start/stop sequence counters, 
though.

		-ben
-- 
"Time is of no importance, Mr. President, only life is important."
Don't Email: <dont@kvack.org>.

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] [e1000]: Remove unnecessary tx_lock
  2006-08-08 17:04                       ` Benjamin LaHaise
@ 2006-08-08 22:06                         ` David Miller
  2006-08-08 23:21                           ` Benjamin LaHaise
  0 siblings, 1 reply; 85+ messages in thread
From: David Miller @ 2006-08-08 22:06 UTC (permalink / raw)
  To: bcrl
  Cc: shemminger, mchan, herbert, hadi, jesse.brandeburg,
	auke-jan.h.kok, netdev

From: Benjamin LaHaise <bcrl@kvack.org>
Date: Tue, 8 Aug 2006 13:04:32 -0400

> Maybe the way NETDEV_TX_BUSY is implemented is wrong -- that is, it should 
> be possible to return a result saying "we sent the packet, but the queue is 
> now full".  That would eliminate the atomic op that netif_queue_stop() 
> performs and allow higher layers to merge the necessary barrier on the full 
> case with the op on the tx lock.  That way we could have lockless queue 
> handling within the driver.  This might need start/stop sequence counters, 
> though.

The driver ->hard_start_xmit() method is invoked with the queue
unlocked, so this kind of scheme would not be workable.

While the queue is unlocked, another cpu could empty entries in the
TX queue making it not-full, which invalidates this "queue is now
full" return value the driver just gave.

We don't do things the way we do it now for fun, it's just the most
reasonable scheme we've come up with given the locking constraints.

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] [e1000]: Remove unnecessary tx_lock
  2006-08-08 22:06                         ` David Miller
@ 2006-08-08 23:21                           ` Benjamin LaHaise
  2006-08-09  0:25                             ` Herbert Xu
  0 siblings, 1 reply; 85+ messages in thread
From: Benjamin LaHaise @ 2006-08-08 23:21 UTC (permalink / raw)
  To: David Miller
  Cc: shemminger, mchan, herbert, hadi, jesse.brandeburg,
	auke-jan.h.kok, netdev

On Tue, Aug 08, 2006 at 03:06:07PM -0700, David Miller wrote:
> The driver ->hard_start_xmit() method is invoked with the queue
> unlocked, so this kind of scheme would not be workable.

Looking at e1000, NETIF_F_LLTX is a waste -- the driver takes a spinlock 
almost immediately after entering.  Taking the spinlock higher up in the 
network stack, preferably for multiple packets, would at least let the 
CPU deal with things sooner.  tg3 doesn't use LLTX and thus holds the lock 
over hard_start_xmit().  bonding is atrocious and uses a read lock and unlock 
in many of the xmit routines.  The only true lockless tx seems to be 
loopback.

> While the queue is unlocked, another cpu could empty entries in the
> TX queue making it not-full, which invalidates this "queue is now
> full" return value the driver just gave.
> 
> We don't do things the way we do it now for fun, it's just the most
> reasonable scheme we've come up with given the locking constraints.

Given that the vast majority of drivers are locked during their xmit 
routine, it is probably worth implementing at some point.

		-ben
-- 
"Time is of no importance, Mr. President, only life is important."
Don't Email: <dont@kvack.org>.

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] [e1000]: Remove unnecessary tx_lock
  2006-08-08 23:21                           ` Benjamin LaHaise
@ 2006-08-09  0:25                             ` Herbert Xu
  2006-08-09  1:25                               ` Benjamin LaHaise
  0 siblings, 1 reply; 85+ messages in thread
From: Herbert Xu @ 2006-08-09  0:25 UTC (permalink / raw)
  To: Benjamin LaHaise
  Cc: David Miller, shemminger, mchan, hadi, jesse.brandeburg,
	auke-jan.h.kok, netdev

On Tue, Aug 08, 2006 at 07:21:08PM -0400, Benjamin LaHaise wrote:
> 
> Looking at e1000, NETIF_F_LLTX is a waste -- the driver takes a spinlock 
> almost immediately after entering.  Taking the spinlock higher up in the 
> network stack, preferably for multiple packets, would at least let the 
> CPU deal with things sooner.  tg3 doesn't use LLTX and thus holds the lock 
> over hard_start_xmit().  bonding is atrocious and uses a read lock and unlock 
> in many of the xmit routines.  The only true lockless tx seems to be 
> loopback.

Your assessment is correct.  However, that's not the issue here.

The problem here is that the TX clean function does not take the lock
(nor do we want it to).  It can thus come in while you're transmitting
and empty the queue.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 85+ messages in thread

* Re: [PATCH] [e1000]: Remove unnecessary tx_lock
  2006-08-09  0:25                             ` Herbert Xu
@ 2006-08-09  1:25                               ` Benjamin LaHaise
  0 siblings, 0 replies; 85+ messages in thread
From: Benjamin LaHaise @ 2006-08-09  1:25 UTC (permalink / raw)
  To: Herbert Xu
  Cc: David Miller, shemminger, mchan, hadi, jesse.brandeburg,
	auke-jan.h.kok, netdev

On Wed, Aug 09, 2006 at 10:25:30AM +1000, Herbert Xu wrote:
> The problem here is that the TX clean function does not take the lock
> (nor do we want it to).  It can thus come in while you're transmitting
> and empty the queue.

That can be solved with sequence numbers -- ie, we keep track of the number 
of packets queued to the hardware, as well as tracking the number of tx 
completions that have been reaped.  Because those numbers are flushed to 
memory by other locks (barriers) taken during processing, the code in 
hard_start_xmit() and the actual tx completion reaping could be done without 
the atomic ops.  The key thing is that the higher level code in the network 
stack calling dev->hard_start_xmit() would know that it needs to retry if 
the tx complete sequence changes.  I'll ponder this a bit more and try to 
come up with some sample code.

		-ben
-- 
"Time is of no importance, Mr. President, only life is important."
Don't Email: <dont@kvack.org>.

^ permalink raw reply	[flat|nested] 85+ messages in thread

end of thread, other threads:[~2006-08-09  1:25 UTC | newest]

Thread overview: 85+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-08-03 13:15 [PATCH] [e1000]: Remove unnecessary tx_lock jamal
2006-08-03 14:02 ` Herbert Xu
2006-08-03 14:24   ` jamal
2006-08-03 16:36   ` Brandeburg, Jesse
2006-08-03 18:05     ` Michael Chan
2006-08-03 22:08       ` jamal
2006-08-04  0:09         ` Michael Chan
2006-08-04  1:10           ` Herbert Xu
2006-08-04  8:37             ` Herbert Xu
2006-08-04 10:10               ` Herbert Xu
2006-08-04 10:16                 ` jamal
2006-08-04 10:25                   ` Herbert Xu
2006-08-04 10:45                     ` jamal
2006-08-05 23:04                     ` jamal
2006-08-05 23:06                       ` Herbert Xu
2006-08-05 23:21                         ` jamal
2006-08-05 23:30                           ` Herbert Xu
2006-08-05 16:45                   ` jamal
2006-08-04 17:12                 ` Stephen Hemminger
2006-08-04 17:28                 ` Michael Chan
2006-08-04 18:08                   ` Stephen Hemminger
2006-08-04 23:31                     ` David Miller
2006-08-05 16:56                       ` jamal
2006-08-05 23:05                         ` Herbert Xu
2006-08-05 23:17                           ` jamal
2006-08-05 23:19                             ` Herbert Xu
2006-08-05 23:36                               ` jamal
2006-08-06  2:51                                 ` Herbert Xu
2006-08-06  7:14                                   ` Edgar E. Iglesias
2006-08-06  7:24                                     ` Herbert Xu
2006-08-06  7:30                                       ` Edgar E. Iglesias
2006-08-06  7:26                                     ` David Miller
2006-08-06  7:36                                       ` Herbert Xu
2006-08-06  8:06                                         ` Edgar E. Iglesias
2006-08-06  8:27                                           ` Herbert Xu
2006-08-06  9:03                                             ` Edgar E. Iglesias
2006-08-06  9:10                                               ` Herbert Xu
2006-08-06  9:18                                                 ` Edgar E. Iglesias
2006-08-06  8:35                                         ` David Miller
2006-08-06 12:24                                   ` jamal
2006-08-06 12:33                                     ` jamal
2006-08-06 23:16                                       ` Jesse Brandeburg
2006-08-07 12:50                                         ` jamal
2006-08-07 15:21                                           ` Edgar E. Iglesias
2006-08-07 15:40                                             ` jamal
2006-08-07 15:59                                               ` Edgar E. Iglesias
2006-08-07 16:31                                                 ` Jamal Hadi Salim
2006-08-07 17:04                                                   ` Edgar E. Iglesias
2006-08-07 18:00                                                     ` jamal
2006-08-07 18:47                                                       ` Edgar E. Iglesias
2006-08-07 19:03                                                         ` jamal
2006-08-07 19:14                                                           ` Edgar E. Iglesias
2006-08-07 19:34                                                             ` jamal
2006-08-07 20:28                                                               ` Edgar E. Iglesias
2006-08-08  0:52                                                                 ` jamal
2006-08-07 20:53                                                           ` Brandeburg, Jesse
2006-08-08  1:07                                                             ` jamal
2006-08-07 23:23                                                           ` Herbert Xu
2006-08-07 23:35                                                             ` Brandeburg, Jesse
2006-08-07 23:40                                                               ` Herbert Xu
2006-08-07 16:29                                               ` Edgar E. Iglesias
2006-08-07 16:36                                                 ` jamal
2006-08-06 19:22                                     ` jamal
2006-08-08  1:19                                     ` jamal
2006-08-08  1:22                                       ` Herbert Xu
2006-08-08  1:33                                         ` jamal
2006-08-08  2:17                                           ` Herbert Xu
2006-08-08  3:10                                             ` jamal
2006-08-08 12:21                                   ` jamal
2006-08-08 12:39                                     ` Herbert Xu
2006-08-06 17:20                           ` Michael Chan
2006-08-06 23:04                             ` Herbert Xu
2006-08-07  3:56                               ` Michael Chan
2006-08-07  4:21                                 ` Herbert Xu
2006-08-08 17:04                       ` Benjamin LaHaise
2006-08-08 22:06                         ` David Miller
2006-08-08 23:21                           ` Benjamin LaHaise
2006-08-09  0:25                             ` Herbert Xu
2006-08-09  1:25                               ` Benjamin LaHaise
2006-08-04  1:16           ` jamal
2006-08-04  1:18             ` Herbert Xu
2006-08-04  1:25               ` jamal
2006-08-04  4:06             ` Michael Chan
2006-08-03 22:06     ` jamal
  -- strict thread matches above, loose matches on Subject: below --
2006-08-08  5:43 Brandeburg, Jesse

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).