[RFC][NET_SCHED] explict hold dev tx lock

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC][NET_SCHED] explict hold dev tx lock
@ 2007-09-16 16:14 jamal
  2007-09-16 19:31 ` David Miller
  0 siblings, 1 reply; 18+ messages in thread
From: jamal @ 2007-09-16 16:14 UTC (permalink / raw)
  To: Herbert Xu
  Cc: David Miller, netdev, Patrick McHardy, Eric Dumazet,
	Evgeniy Polyakov

[-- Attachment #1: Type: text/plain, Size: 1445 bytes --]

While trying to port my batching changes to net-2.6.24 from this morning
i realized this is something i had wanted to probe people on....

Challenge:
For N Cpus, with full throttle traffic on all N CPUs, funneling traffic
to the same ethernet device, the devices queue lock is contended by all
N CPUs constantly. The TX lock is only contended by a max of 2 CPUS. 
In the current mode of operation, after all the work of entering the
dequeue region, we may endup aborting the path if we are unable to get
the tx lock and go back to contend for the queue lock. As N goes up,
this gets worse.

Testing:
I did some testing with a 4 cpu (2xdual core) with no irq binding. I run
about 10 runs of 30M packets each from the stack with a udp app i wrote
which is intended to run keep all 4 cpus busy -  and to my suprise i
found that we only bail out less than 0.1%. I may need a better test
case.

Changes:
I made changes to the code path as defined in the patch included to 
and noticed a slight increase (2-3%) in performance with both e1000 and
tg3; which was a relief because i thought the spinlock_irq (which is
needed because some drivers grab tx lock in interupts) may have negative
effects. The fact it didnt reduce performance was a good thing.
Note: This is the highest end machine ive ever laid hands on, so this
may be misleading.

So - what side effects do people see in doing this? If none, i will
clean it up and submit.

cheers,
jamal

[-- Attachment #2: nsqr1 --]
[-- Type: text/x-patch, Size: 2161 bytes --]

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index dc5e35f..ab9966f 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1271,6 +1271,12 @@ static inline void netif_tx_lock(struct net_device *dev)
 	dev->xmit_lock_owner = smp_processor_id();
 }

+static inline void netif_tx_lock_irq(struct net_device *dev)
+{
+	spin_lock_irq(&dev->_xmit_lock);
+	dev->xmit_lock_owner = smp_processor_id();
+}
+
 static inline void netif_tx_lock_bh(struct net_device *dev)
 {
 	spin_lock_bh(&dev->_xmit_lock);
@@ -1291,6 +1297,12 @@ static inline void netif_tx_unlock(struct net_device *dev)
 	spin_unlock(&dev->_xmit_lock);
 }

+static inline void netif_tx_unlock_irq(struct net_device *dev)
+{
+	dev->xmit_lock_owner = -1;
+	spin_unlock_irq(&dev->_xmit_lock);
+}
+
 static inline void netif_tx_unlock_bh(struct net_device *dev)
 {
 	dev->xmit_lock_owner = -1;
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index e970e8e..f75a924 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -134,34 +134,23 @@ static inline int qdisc_restart(struct net_device *dev)
 {
 	struct Qdisc *q = dev->qdisc;
 	struct sk_buff *skb;
-	unsigned lockless;
+	unsigned lockless = (dev->features & NETIF_F_LLTX);
 	int ret;

 	/* Dequeue packet */
 	if (unlikely((skb = dev_dequeue_skb(dev, q)) == NULL))
 		return 0;

-	/*
-	 * When the driver has LLTX set, it does its own locking in
-	 * start_xmit. These checks are worth it because even uncongested
-	 * locks can be quite expensive. The driver can do a trylock, as
-	 * is being done here; in case of lock contention it should return
-	 * NETDEV_TX_LOCKED and the packet will be requeued.
-	 */
-	lockless = (dev->features & NETIF_F_LLTX);
-
-	if (!lockless && !netif_tx_trylock(dev)) {
-		/* Another CPU grabbed the driver tx lock */
-		return handle_dev_cpu_collision(skb, dev, q);
-	}
-
 	/* And release queue */
 	spin_unlock(&dev->queue_lock);

+	if (!lockless)
+		netif_tx_lock_irq(dev);
+
 	ret = dev_hard_start_xmit(skb, dev);

 	if (!lockless)
-		netif_tx_unlock(dev);
+		netif_tx_unlock_irq(dev);

 	spin_lock(&dev->queue_lock);
 	q = dev->qdisc;

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [RFC][NET_SCHED] explict hold dev tx lock
  2007-09-16 16:14 [RFC][NET_SCHED] explict hold dev tx lock jamal
@ 2007-09-16 19:31 ` David Miller
  2007-09-16 20:41   ` jamal
  0 siblings, 1 reply; 18+ messages in thread
From: David Miller @ 2007-09-16 19:31 UTC (permalink / raw)
  To: hadi; +Cc: herbert, netdev, kaber, dada1, johnpol

From: jamal <hadi@cyberus.ca>
Date: Sun, 16 Sep 2007 12:14:34 -0400

> Changes:
> I made changes to the code path as defined in the patch included to 
> and noticed a slight increase (2-3%) in performance with both e1000 and
> tg3; which was a relief because i thought the spinlock_irq (which is
> needed because some drivers grab tx lock in interupts) may have negative
> effects. The fact it didnt reduce performance was a good thing.
> Note: This is the highest end machine ive ever laid hands on, so this
> may be misleading.
>  
> So - what side effects do people see in doing this? If none, i will
> clean it up and submit.

I tried this 4 years ago, it doesn't work.  :-)

Many drivers, particularly very old ones that PIO packets into
a device which can take a long time, absolutely depend upon
interrupts being enabled fully during ->hard_start_xmit()
so that other high periority devices (such as simpler serial
controllers) can have their interrupts serviced during this
slow operation.

I don't think we want to do it anyways, whatever performance
we gain from it is offset by the badness of disabling interrupts
during this reasonably length stretch of code.

The -rt folks as a result would notice this too and spank us :-)

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC][NET_SCHED] explict hold dev tx lock
  2007-09-16 19:31 ` David Miller
@ 2007-09-16 20:41   ` jamal
  2007-09-16 20:52     ` jamal
  2007-09-18  2:01     ` David Miller
  0 siblings, 2 replies; 18+ messages in thread
From: jamal @ 2007-09-16 20:41 UTC (permalink / raw)
  To: David Miller; +Cc: herbert, netdev, kaber, dada1, johnpol

On Sun, 2007-16-09 at 12:31 -0700, David Miller wrote:
> From: jamal <hadi@cyberus.ca>
> Date: Sun, 16 Sep 2007 12:14:34 -0400

> > So - what side effects do people see in doing this? If none, i will
> > clean it up and submit.
> 
> I tried this 4 years ago, it doesn't work.  :-)
> 

;->

[good reasons removed here]

> I don't think we want to do it anyways, whatever performance
> we gain from it is offset by the badness of disabling interrupts
> during this reasonably length stretch of code.
> 
> The -rt folks as a result would notice this too and spank us :-)

indeed. 
Ok, maybe i am thinking too hard with that patch, so help me out:->
When i looked at that code path as it is today: i felt the softirq could
be interupted on the same CPU it is running on while it already grabbed
that tx lock (if the trylock succeeds) and that the hardirq code when
attempting to grab the lock would result in a deadlock.
Did i misread that? 
When i experimented with tg3 and e1000 i did not see any such problems
with the non irq version of the lock.

cheers,
jamal

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC][NET_SCHED] explict hold dev tx lock
  2007-09-16 20:41   ` jamal
@ 2007-09-16 20:52     ` jamal
  2007-09-16 21:10       ` jamal
  2007-09-18  2:01     ` David Miller
  1 sibling, 1 reply; 18+ messages in thread
From: jamal @ 2007-09-16 20:52 UTC (permalink / raw)
  To: David Miller; +Cc: herbert, netdev, kaber, dada1, johnpol

On Sun, 2007-16-09 at 16:41 -0400, jamal wrote:

> indeed. 
> Ok, maybe i am thinking too hard with that patch, so help me out:->

Ok, that was probably too much of an explanation. What i should say is
if i grabbed the lock explicitly without disabling irqs it wont be much
different than what is done today and should always work.
No?

cheers,
jamal


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC][NET_SCHED] explict hold dev tx lock
  2007-09-16 20:52     ` jamal
@ 2007-09-16 21:10       ` jamal
  2007-09-17 10:27         ` Evgeniy Polyakov
  0 siblings, 1 reply; 18+ messages in thread
From: jamal @ 2007-09-16 21:10 UTC (permalink / raw)
  To: David Miller; +Cc: herbert, netdev, kaber, dada1, johnpol

[-- Attachment #1: Type: text/plain, Size: 331 bytes --]

On Sun, 2007-16-09 at 16:52 -0400, jamal wrote:

> What i should say is
> if i grabbed the lock explicitly without disabling irqs it wont be much
> different than what is done today and should always work.
> No?

And to be more explicit, heres a patch using the macros from previous
patch. So far tested on 3 NICs.

cheers,
jamal


[-- Attachment #2: nsqr2 --]
[-- Type: text/x-patch, Size: 1204 bytes --]

diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index e970e8e..1ae905e 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -134,34 +134,18 @@ static inline int qdisc_restart(struct net_device *dev)
 {
 	struct Qdisc *q = dev->qdisc;
 	struct sk_buff *skb;
-	unsigned lockless;
 	int ret;
 
 	/* Dequeue packet */
 	if (unlikely((skb = dev_dequeue_skb(dev, q)) == NULL))
 		return 0;
 
-	/*
-	 * When the driver has LLTX set, it does its own locking in
-	 * start_xmit. These checks are worth it because even uncongested
-	 * locks can be quite expensive. The driver can do a trylock, as
-	 * is being done here; in case of lock contention it should return
-	 * NETDEV_TX_LOCKED and the packet will be requeued.
-	 */
-	lockless = (dev->features & NETIF_F_LLTX);
-
-	if (!lockless && !netif_tx_trylock(dev)) {
-		/* Another CPU grabbed the driver tx lock */
-		return handle_dev_cpu_collision(skb, dev, q);
-	}
-
 	/* And release queue */
 	spin_unlock(&dev->queue_lock);
 
+	HARD_TX_LOCK(dev, smp_processor_id());
 	ret = dev_hard_start_xmit(skb, dev);
-
-	if (!lockless)
-		netif_tx_unlock(dev);
+	HARD_TX_UNLOCK(dev);
 
 	spin_lock(&dev->queue_lock);
 	q = dev->qdisc;

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [RFC][NET_SCHED] explict hold dev tx lock
  2007-09-16 21:10       ` jamal
@ 2007-09-17 10:27         ` Evgeniy Polyakov
  2007-09-17 13:03           ` jamal
  0 siblings, 1 reply; 18+ messages in thread
From: Evgeniy Polyakov @ 2007-09-17 10:27 UTC (permalink / raw)
  To: jamal; +Cc: David Miller, herbert, netdev, kaber, dada1

On Sun, Sep 16, 2007 at 05:10:00PM -0400, jamal (hadi@cyberus.ca) wrote:
> On Sun, 2007-16-09 at 16:52 -0400, jamal wrote:
> 
> > What i should say is
> > if i grabbed the lock explicitly without disabling irqs it wont be much
> > different than what is done today and should always work.
> > No?
> 
> And to be more explicit, heres a patch using the macros from previous
> patch. So far tested on 3 NICs.

How many cpu collisions you are seeing?
Did I understand you right, that you replaced trylock with lock and
thus removed collision handling and got better results?


-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC][NET_SCHED] explict hold dev tx lock
  2007-09-17 10:27         ` Evgeniy Polyakov
@ 2007-09-17 13:03           ` jamal
  2007-09-17 13:58             ` Evgeniy Polyakov
  0 siblings, 1 reply; 18+ messages in thread
From: jamal @ 2007-09-17 13:03 UTC (permalink / raw)
  To: Evgeniy Polyakov; +Cc: David Miller, herbert, netdev, kaber, dada1

On Mon, 2007-17-09 at 14:27 +0400, Evgeniy Polyakov wrote:

> 
> How many cpu collisions you are seeing?

On 4 CPUs which were always transmitting very few - there was contention
in the range of 100 per million attempts.
Note: it doesnt matter that 4 cpus were busy, this lock is contended at
max (for all NAPI drivers i poked into) between two CPUs.

> Did I understand you right, that you replaced trylock with lock and
> thus removed collision handling and got better results?

Yes, a small one with the 4 CPUs and no irq binding. Note that in the
test cases i run, the contention for queue lock was high (since all CPUs
were busy processing traffic). 
I think as the the number of CPUs go up, this will become more
prominent. The choice is between contending for queue lock or this lock.
One lock is contended by max of two cpus, the other by N cpus. As N goes
up, you want to have more mercy on the one that is contended by N cpus.
Did that make sense?

cheers,
jamal

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC][NET_SCHED] explict hold dev tx lock
  2007-09-17 13:03           ` jamal
@ 2007-09-17 13:58             ` Evgeniy Polyakov
  0 siblings, 0 replies; 18+ messages in thread
From: Evgeniy Polyakov @ 2007-09-17 13:58 UTC (permalink / raw)
  To: jamal; +Cc: David Miller, herbert, netdev, kaber, dada1

On Mon, Sep 17, 2007 at 09:03:58AM -0400, jamal (hadi@cyberus.ca) wrote:
> > Did I understand you right, that you replaced trylock with lock and
> > thus removed collision handling and got better results?
> 
> Yes, a small one with the 4 CPUs and no irq binding. Note that in the
> test cases i run, the contention for queue lock was high (since all CPUs
> were busy processing traffic). 
> I think as the the number of CPUs go up, this will become more
> prominent. The choice is between contending for queue lock or this lock.
> One lock is contended by max of two cpus, the other by N cpus. As N goes
> up, you want to have more mercy on the one that is contended by N cpus.
> Did that make sense?

I think if number of cpus grows and there is no interupt binding, system
will not scale very well anyway, but your description makes sense,
thanks.

-- 
	Evgeniy Polyakov

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC][NET_SCHED] explict hold dev tx lock
  2007-09-16 20:41   ` jamal
  2007-09-16 20:52     ` jamal
@ 2007-09-18  2:01     ` David Miller
  2007-09-18  2:48       ` jamal
  1 sibling, 1 reply; 18+ messages in thread
From: David Miller @ 2007-09-18  2:01 UTC (permalink / raw)
  To: hadi; +Cc: herbert, netdev, kaber, dada1, johnpol

From: jamal <hadi@cyberus.ca>
Date: Sun, 16 Sep 2007 16:41:24 -0400

> Ok, maybe i am thinking too hard with that patch, so help me out:->
> When i looked at that code path as it is today: i felt the softirq could
> be interupted on the same CPU it is running on while it already grabbed
> that tx lock (if the trylock succeeds) and that the hardirq code when
> attempting to grab the lock would result in a deadlock.

Hardirq should never try to grab the netif_tx_lock(), it is
only for base and softirq context.

Any hardirq context code taking that lock needs to be fixed.
We could assert this if we don't already.

It's the only way that it works that we can invoke ->hard_start_xmit()
with interrupts fully enabled.

I notice that your patch bypasses the LLTX logic (I think) and this
isn't kosher, it might introduce deadlocks or similar as when we
are doing LLTX the driver determines the locking and IRQ context
semantics.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC][NET_SCHED] explict hold dev tx lock
  2007-09-18  2:01     ` David Miller
@ 2007-09-18  2:48       ` jamal
  2007-09-19 13:33         ` jamal
  0 siblings, 1 reply; 18+ messages in thread
From: jamal @ 2007-09-18  2:48 UTC (permalink / raw)
  To: David Miller; +Cc: herbert, netdev, kaber, dada1, johnpol

On Mon, 2007-17-09 at 19:01 -0700, David Miller wrote:

> Hardirq should never try to grab the netif_tx_lock(), it is
> only for base and softirq context.
> 
> Any hardirq context code taking that lock needs to be fixed.
> We could assert this if we don't already.

I snooped around it looks pretty clean;  An assertion wont hurt,
but people will find _very quickly_ it hurts when testing a driver if
they did it wrong;->

> It's the only way that it works that we can invoke ->hard_start_xmit()
> with interrupts fully enabled.

> I notice that your patch bypasses the LLTX logic (I think) and this
> isn't kosher, it might introduce deadlocks or similar as when we
> are doing LLTX the driver determines the locking and IRQ context
> semantics.

Nothing much has changed from what it was before.
The only difference is we let go of the queue lock before grabbing
the tx lock which never mattered for LLTX. 
Once we grab the tx lock it is the same logic and so far is working well
on both tg3 and e1000 (which is LLTX). 
I will continue to retest with net-2.6.24 once you complete rebasing
and look around to see if anyone maybe affected.

cheers,
jamal


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC][NET_SCHED] explict hold dev tx lock
  2007-09-18  2:48       ` jamal
@ 2007-09-19 13:33         ` jamal
  2007-09-19 16:09           ` David Miller
  0 siblings, 1 reply; 18+ messages in thread
From: jamal @ 2007-09-19 13:33 UTC (permalink / raw)
  To: David Miller; +Cc: herbert, netdev, kaber, dada1, johnpol

On Mon, 2007-17-09 at 22:48 -0400, jamal wrote:

> Nothing much has changed from what it was before.
> The only difference is we let go of the queue lock before grabbing
> the tx lock which never mattered for LLTX. 
> Once we grab the tx lock it is the same logic and so far is working well
> on both tg3 and e1000 (which is LLTX). 
> I will continue to retest with net-2.6.24 once you complete rebasing
> and look around to see if anyone maybe affected.

Ok, this is looking solid with this mornings tree. Tested on a dual core
xeon with e1000 (LLTX) and a dual core opteron with tg3 (non-LLTX).
About 100 million packets from udp full throttle on all 4 cpus; i tried
pulling cables etc while doing this to generate extrenous interupts and
didnt see any issues.

Shall i submit the patch?

cheers,
jamal

 


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC][NET_SCHED] explict hold dev tx lock
  2007-09-19 13:33         ` jamal
@ 2007-09-19 16:09           ` David Miller
  2007-09-20  2:33             ` jamal
  0 siblings, 1 reply; 18+ messages in thread
From: David Miller @ 2007-09-19 16:09 UTC (permalink / raw)
  To: hadi; +Cc: herbert, netdev, kaber, dada1, johnpol

From: jamal <hadi@cyberus.ca>
Date: Wed, 19 Sep 2007 09:33:52 -0400

> On Mon, 2007-17-09 at 22:48 -0400, jamal wrote:
> 
> > Nothing much has changed from what it was before.
> > The only difference is we let go of the queue lock before grabbing
> > the tx lock which never mattered for LLTX. 
> > Once we grab the tx lock it is the same logic and so far is working well
> > on both tg3 and e1000 (which is LLTX). 
> > I will continue to retest with net-2.6.24 once you complete rebasing
> > and look around to see if anyone maybe affected.
> 
> Ok, this is looking solid with this mornings tree. Tested on a dual core
> xeon with e1000 (LLTX) and a dual core opteron with tg3 (non-LLTX).
> About 100 million packets from udp full throttle on all 4 cpus; i tried
> pulling cables etc while doing this to generate extrenous interupts and
> didnt see any issues.
> 
> Shall i submit the patch?

Sure, along with a description as to why you want to make this
change.

I still don't understand the impetus. :)

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC][NET_SCHED] explict hold dev tx lock
  2007-09-19 16:09           ` David Miller
@ 2007-09-20  2:33             ` jamal
  2007-09-20  2:43               ` jamal
  0 siblings, 1 reply; 18+ messages in thread
From: jamal @ 2007-09-20  2:33 UTC (permalink / raw)
  To: David Miller; +Cc: herbert, netdev, kaber, dada1, johnpol

On Wed, 2007-19-09 at 09:09 -0700, David Miller wrote:

> Sure, along with a description as to why you want to make this
> change.

Will do. And if you feel that i should sit on it a little more i can do
that too. The good news is it doesnt make things any worse than they
already are and infact shows a slight improvement in a 4 cpu machine.

> I still don't understand the impetus. :)

I will like to use the effect later on to improve batching. 

cheers,
jamal

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC][NET_SCHED] explict hold dev tx lock
  2007-09-20  2:33             ` jamal
@ 2007-09-20  2:43               ` jamal
  2007-09-26  2:28                 ` David Miller
  2007-10-09  4:00                 ` Herbert Xu
  0 siblings, 2 replies; 18+ messages in thread
From: jamal @ 2007-09-20  2:43 UTC (permalink / raw)
  To: David Miller; +Cc: herbert, netdev, kaber, dada1, johnpol

[-- Attachment #1: Type: text/plain, Size: 73 bytes --]


Ok, this is from the net-2.6.24 of about an hour ago. 

cheers,
jamal



[-- Attachment #2: nsqr3 --]
[-- Type: text/plain, Size: 2295 bytes --]

[NET_SCHED] explict hold dev tx lock

For N cpus, with full throttle traffic on all N CPUs, funneling traffic
to the same ethernet device, the devices queue lock is contended by all
N CPUs constantly. The TX lock is only contended by a max of 2 CPUS.
In the current mode of operation, after all the work of entering the
dequeue region, we may endup aborting the path if we are unable to get
the tx lock and go back to contend for the queue lock. As N goes up,
this gets worse.

The changes in this patch result in a small increase in performance
with a 4CPU (2xdual-core) with no irq binding. Both e1000 and tg3
showed similar behavior;

Signed-off-by: Jamal Hadi Salim <hadi@cyberus.ca>

---
commit 5973a9734a93903e7c3547dbe19addc42600240f
tree a57bd488e200fdaa3dfef7f27de52cbe02238dd3
parent 31b9dd879cd9fd254484de2950341acdcb62680f
author Jamal Hadi Salim <hadi@cyberus.ca> Wed, 19 Sep 2007 22:29:16 -0400
committer Jamal Hadi Salim <hadi@cyberus.ca> Wed, 19 Sep 2007 22:29:16 -0400

 net/sched/sch_generic.c |   19 ++-----------------
 1 files changed, 2 insertions(+), 17 deletions(-)

diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index e970e8e..95ae119 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -134,34 +134,19 @@ static inline int qdisc_restart(struct net_device *dev)
 {
 	struct Qdisc *q = dev->qdisc;
 	struct sk_buff *skb;
-	unsigned lockless;
 	int ret;
 
 	/* Dequeue packet */
 	if (unlikely((skb = dev_dequeue_skb(dev, q)) == NULL))
 		return 0;
 
-	/*
-	 * When the driver has LLTX set, it does its own locking in
-	 * start_xmit. These checks are worth it because even uncongested
-	 * locks can be quite expensive. The driver can do a trylock, as
-	 * is being done here; in case of lock contention it should return
-	 * NETDEV_TX_LOCKED and the packet will be requeued.
-	 */
-	lockless = (dev->features & NETIF_F_LLTX);
-
-	if (!lockless && !netif_tx_trylock(dev)) {
-		/* Another CPU grabbed the driver tx lock */
-		return handle_dev_cpu_collision(skb, dev, q);
-	}
 
 	/* And release queue */
 	spin_unlock(&dev->queue_lock);
 
+	HARD_TX_LOCK(dev, smp_processor_id());
 	ret = dev_hard_start_xmit(skb, dev);
-
-	if (!lockless)
-		netif_tx_unlock(dev);
+	HARD_TX_UNLOCK(dev);
 
 	spin_lock(&dev->queue_lock);
 	q = dev->qdisc;

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [RFC][NET_SCHED] explict hold dev tx lock
  2007-09-20  2:43               ` jamal
@ 2007-09-26  2:28                 ` David Miller
  2007-09-26 13:11                   ` jamal
  2007-10-09  4:00                 ` Herbert Xu
  1 sibling, 1 reply; 18+ messages in thread
From: David Miller @ 2007-09-26  2:28 UTC (permalink / raw)
  To: hadi; +Cc: herbert, netdev, kaber, dada1, johnpol

From: jamal <hadi@cyberus.ca>
Date: Wed, 19 Sep 2007 22:43:03 -0400

> [NET_SCHED] explict hold dev tx lock
> 
> For N cpus, with full throttle traffic on all N CPUs, funneling traffic
> to the same ethernet device, the devices queue lock is contended by all
> N CPUs constantly. The TX lock is only contended by a max of 2 CPUS.
> In the current mode of operation, after all the work of entering the
> dequeue region, we may endup aborting the path if we are unable to get
> the tx lock and go back to contend for the queue lock. As N goes up,
> this gets worse.
> 
> The changes in this patch result in a small increase in performance
> with a 4CPU (2xdual-core) with no irq binding. Both e1000 and tg3
> showed similar behavior;
> 
> Signed-off-by: Jamal Hadi Salim <hadi@cyberus.ca>

I've applied this to net-2.6.24, although I want to study more deeply
the implications of this change myself at some point :)

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC][NET_SCHED] explict hold dev tx lock
  2007-09-26  2:28                 ` David Miller
@ 2007-09-26 13:11                   ` jamal
  0 siblings, 0 replies; 18+ messages in thread
From: jamal @ 2007-09-26 13:11 UTC (permalink / raw)
  To: David Miller; +Cc: herbert, netdev, kaber, dada1, johnpol

On Tue, 2007-25-09 at 19:28 -0700, David Miller wrote:

> I've applied this to net-2.6.24, although I want to study more deeply
> the implications of this change myself at some point :)

sounds reasonable. Ive done a lot of testing with my 2-3 NIC variants;
ive cced whoever i thought was a stakeholder and they ALL seemed
silently-happy-joyous (wink;->), so letting more people go at it is the
best thing to do. I will be watching that space.

cheers,
jamal

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC][NET_SCHED] explict hold dev tx lock
  2007-09-20  2:43               ` jamal
  2007-09-26  2:28                 ` David Miller
@ 2007-10-09  4:00                 ` Herbert Xu
  2007-10-09 13:43                   ` jamal
  1 sibling, 1 reply; 18+ messages in thread
From: Herbert Xu @ 2007-10-09  4:00 UTC (permalink / raw)
  To: jamal; +Cc: David Miller, netdev, kaber, dada1, johnpol

On Wed, Sep 19, 2007 at 10:43:03PM -0400, jamal wrote:
>
> [NET_SCHED] explict hold dev tx lock
> 
> For N cpus, with full throttle traffic on all N CPUs, funneling traffic
> to the same ethernet device, the devices queue lock is contended by all
> N CPUs constantly. The TX lock is only contended by a max of 2 CPUS.
> In the current mode of operation, after all the work of entering the
> dequeue region, we may endup aborting the path if we are unable to get
> the tx lock and go back to contend for the queue lock. As N goes up,
> this gets worse.
> 
> The changes in this patch result in a small increase in performance
> with a 4CPU (2xdual-core) with no irq binding. Both e1000 and tg3
> showed similar behavior;

OK, after waking up a bit more I now have another question :)

Both of the drivers you've tested here are special.  Firstly
e1000 is lockless so there is no contention here at all.  On
the other hand tg3 doesn't take the TX lock on the clean-up
path unless the queue has been stopped.

In other words both drivers only take the TX lock on xmit so
this patch makes very little difference to them.

What I'm worried about is would we see worse behaviour with
drivers that do all their TX clean-up with the TX lock held
(which would cause qdisc_restart to spin while this is happening)?

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC][NET_SCHED] explict hold dev tx lock
  2007-10-09  4:00                 ` Herbert Xu
@ 2007-10-09 13:43                   ` jamal
  0 siblings, 0 replies; 18+ messages in thread
From: jamal @ 2007-10-09 13:43 UTC (permalink / raw)
  To: Herbert Xu; +Cc: David Miller, netdev, kaber, dada1, johnpol

On Tue, 2007-09-10 at 12:00 +0800, Herbert Xu wrote:

> 
> OK, after waking up a bit more 

me too;-> 

> What I'm worried about is would we see worse behaviour with
> drivers that do all their TX clean-up with the TX lock held

Good point Herbert.
When i looked around i only found one driver that behaved like that;
some IBM mainframe one from the looks of it. That driver did a lot of
other obscure things (i think they maintain their own napi calls etc),
so it didnt worry me very much. 
IIRC, I think it could be fixed to do what tg3 and relatives like bnx do
(I really like the approach) - just didnt have the time to chase it. 
There are a _lot_ more drivers that have no respect for netif_tx_lock
and implement their own locking down in the driver. Those already suffer
from the phenomena you describe whether TX_LOCK is held or not.

> (which would cause qdisc_restart to spin while this is happening)?

Yes, with such a driver, we spin in the worst case. But that provides
opportunities for optimization of a driver behaving that way; two
approaches off top of my head:
a) we could prune the tx descriptors on the the tx path since it is safe
to do so and reduce the amount of time spent doing such work in the napi
poll
or
b) Have the napi side do a trylock (sort of what the e1000 attempts to
do) and reschedule the poll to retry.

#b is fair because the cost of a queue_lock goes up as the number of
cpus goes up (which is the case i was optimizing for). The cost of tx
lock is contention by two cpus in worst case. Thoughts?

cheers,
jamal

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2007-10-09 13:43 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-09-16 16:14 [RFC][NET_SCHED] explict hold dev tx lock jamal
2007-09-16 19:31 ` David Miller
2007-09-16 20:41   ` jamal
2007-09-16 20:52     ` jamal
2007-09-16 21:10       ` jamal
2007-09-17 10:27         ` Evgeniy Polyakov
2007-09-17 13:03           ` jamal
2007-09-17 13:58             ` Evgeniy Polyakov
2007-09-18  2:01     ` David Miller
2007-09-18  2:48       ` jamal
2007-09-19 13:33         ` jamal
2007-09-19 16:09           ` David Miller
2007-09-20  2:33             ` jamal
2007-09-20  2:43               ` jamal
2007-09-26  2:28                 ` David Miller
2007-09-26 13:11                   ` jamal
2007-10-09  4:00                 ` Herbert Xu
2007-10-09 13:43                   ` jamal

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).