netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC,PATCH] loopback: calls netif_receive_skb() instead of netif_rx()
@ 2008-02-21 18:51 Eric Dumazet
  2008-02-21 20:14 ` Daniel Lezcano
  2008-02-27  2:21 ` David Miller
  0 siblings, 2 replies; 35+ messages in thread
From: Eric Dumazet @ 2008-02-21 18:51 UTC (permalink / raw)
  To: David S. Miller; +Cc: netdev

[-- Attachment #1: Type: text/plain, Size: 832 bytes --]

Hi David

This is an RFC, based on net-2.6 for convenience only.

Thank you

[RFC,PATCH] loopback: calls netif_receive_skb() instead of netif_rx()

Loopback transmit function loopback_xmit() actually calls netif_rx() to 
queue
a skb to the softnet queue, and arms a softirq so that this skb can be 
handled later.

This has a cost on SMP, because we need to hold a reference on the 
device, and free this
reference when softirq dequeues packet.

Following patch directly calls netif_receive_skb() and avoids lot of 
atomic operations.
(atomic_inc(&dev->refcnt), set_and_set_bit(NAPI_STATE_SCHED, &n->state), ...
  atomic_dec(&dev->refcnt)...), cache line ping-pongs on device refcnt, 
but also softirq overhead.

This gives a nice boost on tbench for example (5 % on my machine)

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>



[-- Attachment #2: loopback_xmit.patch --]
[-- Type: text/plain, Size: 355 bytes --]

diff --git a/drivers/net/loopback.c b/drivers/net/loopback.c
index f2a6e71..9bed7ed 100644
--- a/drivers/net/loopback.c
+++ b/drivers/net/loopback.c
@@ -158,7 +158,7 @@ static int loopback_xmit(struct sk_buff *skb, struct net_device *dev)
 	lb_stats->bytes += skb->len;
 	lb_stats->packets++;
 
-	netif_rx(skb);
+	netif_receive_skb(skb);
 
 	return 0;
 }

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [RFC,PATCH] loopback: calls netif_receive_skb() instead of netif_rx()
  2008-02-21 18:51 [RFC,PATCH] loopback: calls netif_receive_skb() instead of netif_rx() Eric Dumazet
@ 2008-02-21 20:14 ` Daniel Lezcano
  2008-02-21 23:19   ` Eric Dumazet
  2008-02-27  2:21 ` David Miller
  1 sibling, 1 reply; 35+ messages in thread
From: Daniel Lezcano @ 2008-02-21 20:14 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David S. Miller, netdev

Eric Dumazet wrote:
> Hi David
> 
> This is an RFC, based on net-2.6 for convenience only.
> 
> Thank you
> 
> [RFC,PATCH] loopback: calls netif_receive_skb() instead of netif_rx()
> 
> Loopback transmit function loopback_xmit() actually calls netif_rx() to 
> queue
> a skb to the softnet queue, and arms a softirq so that this skb can be 
> handled later.
> 
> This has a cost on SMP, because we need to hold a reference on the 
> device, and free this
> reference when softirq dequeues packet.
> 
> Following patch directly calls netif_receive_skb() and avoids lot of 
> atomic operations.
> (atomic_inc(&dev->refcnt), set_and_set_bit(NAPI_STATE_SCHED, &n->state), 
> ...
>  atomic_dec(&dev->refcnt)...), cache line ping-pongs on device refcnt, 
> but also softirq overhead.
> 
> This gives a nice boost on tbench for example (5 % on my machine)

I understand this is interesting for the loopback when there is no 
multiple instances of it and it can't be unregistered. But now with the 
network namespaces, we can have multiple instances of the loopback and 
it can to be unregistered. Shouldn't we still use netif_rx ?
Perhaps we can do something like:

	if (dev->nd_net == &init_net)
		netif_receive_skb(skb);
	else
		netif_rx(skb);

Or we create:
	init_loopback_xmit() calling netif_receive_skb(skb);
	and setup this function when creating the loopback for init_net,
	otherwise we setup the usual loopback_xmit.

We are still safe for multiple network namespaces and we have the 
improvement for init_net loopback.





^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC,PATCH] loopback: calls netif_receive_skb() instead of netif_rx()
  2008-02-21 20:14 ` Daniel Lezcano
@ 2008-02-21 23:19   ` Eric Dumazet
  2008-02-22 10:19     ` Daniel Lezcano
  0 siblings, 1 reply; 35+ messages in thread
From: Eric Dumazet @ 2008-02-21 23:19 UTC (permalink / raw)
  To: Daniel Lezcano; +Cc: David S. Miller, netdev

Daniel Lezcano a écrit :
> Eric Dumazet wrote:
>> Hi David
>>
>> This is an RFC, based on net-2.6 for convenience only.
>>
>> Thank you
>>
>> [RFC,PATCH] loopback: calls netif_receive_skb() instead of netif_rx()
>>
>> Loopback transmit function loopback_xmit() actually calls netif_rx() 
>> to queue
>> a skb to the softnet queue, and arms a softirq so that this skb can be 
>> handled later.
>>
>> This has a cost on SMP, because we need to hold a reference on the 
>> device, and free this
>> reference when softirq dequeues packet.
>>
>> Following patch directly calls netif_receive_skb() and avoids lot of 
>> atomic operations.
>> (atomic_inc(&dev->refcnt), set_and_set_bit(NAPI_STATE_SCHED, 
>> &n->state), ...
>>  atomic_dec(&dev->refcnt)...), cache line ping-pongs on device refcnt, 
>> but also softirq overhead.
>>
>> This gives a nice boost on tbench for example (5 % on my machine)
> 
> I understand this is interesting for the loopback when there is no 
> multiple instances of it and it can't be unregistered. But now with the 
> network namespaces, we can have multiple instances of the loopback and 
> it can to be unregistered. Shouldn't we still use netif_rx ?
> Perhaps we can do something like:
> 
>     if (dev->nd_net == &init_net)
>         netif_receive_skb(skb);
>     else
>         netif_rx(skb);

or

#ifdef CONFIG_NET_NS
	if (dev->nd_net != &init_net)
		netif_rx(skb);
	else
#endif
		netif_receive_skb(skb);

> 
> Or we create:
>     init_loopback_xmit() calling netif_receive_skb(skb);
>     and setup this function when creating the loopback for init_net,
>     otherwise we setup the usual loopback_xmit.
> 
> We are still safe for multiple network namespaces and we have the 
> improvement for init_net loopback.
> 

I dont understand how my patch could degrade loopbackdev unregister logic. It 
should only help it, by avoiding a queue of 'pending packets' per cpu.

When we want to unregister a network device, stack makes sure that no more 
calls to dev->hard_start_xmit() can occur.

If no more loopback_xmit() calls are done on this device, it doesnt matter if 
it internally uses netif_rx() or netif_receive_skb(skb)

loopback device has no queue, its really unfortunate to use the 'softirq' 
internal queue.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC,PATCH] loopback: calls netif_receive_skb() instead of netif_rx()
  2008-02-21 23:19   ` Eric Dumazet
@ 2008-02-22 10:19     ` Daniel Lezcano
  0 siblings, 0 replies; 35+ messages in thread
From: Daniel Lezcano @ 2008-02-22 10:19 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David S. Miller, netdev

Eric Dumazet wrote:
> Daniel Lezcano a écrit :
>> Eric Dumazet wrote:
>>> Hi David
>>>
>>> This is an RFC, based on net-2.6 for convenience only.
>>>
>>> Thank you
>>>
>>> [RFC,PATCH] loopback: calls netif_receive_skb() instead of netif_rx()
>>>
>>> Loopback transmit function loopback_xmit() actually calls netif_rx() 
>>> to queue
>>> a skb to the softnet queue, and arms a softirq so that this skb can 
>>> be handled later.
>>>
>>> This has a cost on SMP, because we need to hold a reference on the 
>>> device, and free this
>>> reference when softirq dequeues packet.
>>>
>>> Following patch directly calls netif_receive_skb() and avoids lot of 
>>> atomic operations.
>>> (atomic_inc(&dev->refcnt), set_and_set_bit(NAPI_STATE_SCHED, 
>>> &n->state), ...
>>>  atomic_dec(&dev->refcnt)...), cache line ping-pongs on device 
>>> refcnt, but also softirq overhead.
>>>
>>> This gives a nice boost on tbench for example (5 % on my machine)
>>
>> I understand this is interesting for the loopback when there is no 
>> multiple instances of it and it can't be unregistered. But now with 
>> the network namespaces, we can have multiple instances of the loopback 
>> and it can to be unregistered. Shouldn't we still use netif_rx ?
>> Perhaps we can do something like:
>>
>>     if (dev->nd_net == &init_net)
>>         netif_receive_skb(skb);
>>     else
>>         netif_rx(skb);
> 
> or
> 
> #ifdef CONFIG_NET_NS
>     if (dev->nd_net != &init_net)
>         netif_rx(skb);
>     else
> #endif
>         netif_receive_skb(skb);
> 
>>
>> Or we create:
>>     init_loopback_xmit() calling netif_receive_skb(skb);
>>     and setup this function when creating the loopback for init_net,
>>     otherwise we setup the usual loopback_xmit.
>>
>> We are still safe for multiple network namespaces and we have the 
>> improvement for init_net loopback.
>>
> 
> I dont understand how my patch could degrade loopbackdev unregister 
> logic. It should only help it, by avoiding a queue of 'pending packets' 
> per cpu.
> 
> When we want to unregister a network device, stack makes sure that no 
> more calls to dev->hard_start_xmit() can occur.
> 
> If no more loopback_xmit() calls are done on this device, it doesnt 
> matter if it internally uses netif_rx() or netif_receive_skb(skb)
> 
> loopback device has no queue, its really unfortunate to use the 
> 'softirq' internal queue.

Fair enough :)


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC,PATCH] loopback: calls netif_receive_skb() instead of netif_rx()
  2008-02-21 18:51 [RFC,PATCH] loopback: calls netif_receive_skb() instead of netif_rx() Eric Dumazet
  2008-02-21 20:14 ` Daniel Lezcano
@ 2008-02-27  2:21 ` David Miller
  2008-02-27  7:20   ` Jarek Poplawski
  2008-03-01 10:26   ` Eric Dumazet
  1 sibling, 2 replies; 35+ messages in thread
From: David Miller @ 2008-02-27  2:21 UTC (permalink / raw)
  To: dada1; +Cc: netdev

From: Eric Dumazet <dada1@cosmosbay.com>
Date: Thu, 21 Feb 2008 19:51:52 +0100

> Following patch directly calls netif_receive_skb() and avoids lot of 
> atomic operations.
> (atomic_inc(&dev->refcnt), set_and_set_bit(NAPI_STATE_SCHED, &n->state), ...
>   atomic_dec(&dev->refcnt)...), cache line ping-pongs on device refcnt, 
> but also softirq overhead.
> 
> This gives a nice boost on tbench for example (5 % on my machine)

My only concern is stack usage.

Note that packet reception can elicit a response and go all the way
back into this driver and all the way down into netif_receive_skb()
again.  And so on and so forth.

If there is some bug in the stack (ACK'ing ACKs, stuff like that) we
could get into a loop and overrun the kernel stack in no time at all.

So, if anything, this change could make inconvenient errors become
catastrophic and hard to diagnose.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC,PATCH] loopback: calls netif_receive_skb() instead of netif_rx()
  2008-02-27  2:21 ` David Miller
@ 2008-02-27  7:20   ` Jarek Poplawski
  2008-02-27  7:23     ` David Miller
  2008-03-01 10:26   ` Eric Dumazet
  1 sibling, 1 reply; 35+ messages in thread
From: Jarek Poplawski @ 2008-02-27  7:20 UTC (permalink / raw)
  To: David Miller; +Cc: dada1, netdev

On 27-02-2008 03:21, David Miller wrote:
> From: Eric Dumazet <dada1@cosmosbay.com>
> Date: Thu, 21 Feb 2008 19:51:52 +0100
> 
>> Following patch directly calls netif_receive_skb() and avoids lot of 
>> atomic operations.
>> (atomic_inc(&dev->refcnt), set_and_set_bit(NAPI_STATE_SCHED, &n->state), ...
>>   atomic_dec(&dev->refcnt)...), cache line ping-pongs on device refcnt, 
>> but also softirq overhead.
>>
>> This gives a nice boost on tbench for example (5 % on my machine)
> 
> My only concern is stack usage.
...

I wonder why overloading with net processing is no concern here?
There would be no napi control around this  netif_receive_skb().

Another concern might be a code which depends on softirq context
here (unless it was checked already)?

Regards,
Jarek P.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC,PATCH] loopback: calls netif_receive_skb() instead of netif_rx()
  2008-02-27  7:20   ` Jarek Poplawski
@ 2008-02-27  7:23     ` David Miller
  2008-02-27  7:34       ` Jarek Poplawski
  0 siblings, 1 reply; 35+ messages in thread
From: David Miller @ 2008-02-27  7:23 UTC (permalink / raw)
  To: jarkao2; +Cc: dada1, netdev

From: Jarek Poplawski <jarkao2@gmail.com>
Date: Wed, 27 Feb 2008 07:20:41 +0000

> I wonder why overloading with net processing is no concern here?
> There would be no napi control around this  netif_receive_skb().

Good point, but we're talking about loopback wherein only
the local system can overload itself.

> Another concern might be a code which depends on softirq context
> here (unless it was checked already)?

Hmmm.... yes this could be a problem.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC,PATCH] loopback: calls netif_receive_skb() instead of netif_rx()
  2008-02-27  7:23     ` David Miller
@ 2008-02-27  7:34       ` Jarek Poplawski
  0 siblings, 0 replies; 35+ messages in thread
From: Jarek Poplawski @ 2008-02-27  7:34 UTC (permalink / raw)
  To: David Miller; +Cc: dada1, netdev

On Tue, Feb 26, 2008 at 11:23:14PM -0800, David Miller wrote:
> From: Jarek Poplawski <jarkao2@gmail.com>
> Date: Wed, 27 Feb 2008 07:20:41 +0000
> 
> > I wonder why overloading with net processing is no concern here?
> > There would be no napi control around this  netif_receive_skb().
> 
> Good point, but we're talking about loopback wherein only
> the local system can overload itself.

Then e.g. the lack of responsiveness should concern us, I guess.

Jarek P.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC,PATCH] loopback: calls netif_receive_skb() instead of netif_rx()
  2008-02-27  2:21 ` David Miller
  2008-02-27  7:20   ` Jarek Poplawski
@ 2008-03-01 10:26   ` Eric Dumazet
  2008-03-04  4:55     ` David Miller
  2008-03-23 10:29     ` David Miller
  1 sibling, 2 replies; 35+ messages in thread
From: Eric Dumazet @ 2008-03-01 10:26 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

[-- Attachment #1: Type: text/plain, Size: 2073 bytes --]

David Miller a écrit :
> From: Eric Dumazet <dada1@cosmosbay.com>
> Date: Thu, 21 Feb 2008 19:51:52 +0100
> 
>> Following patch directly calls netif_receive_skb() and avoids lot of 
>> atomic operations.
>> (atomic_inc(&dev->refcnt), set_and_set_bit(NAPI_STATE_SCHED, &n->state), ...
>>   atomic_dec(&dev->refcnt)...), cache line ping-pongs on device refcnt, 
>> but also softirq overhead.
>>
>> This gives a nice boost on tbench for example (5 % on my machine)
> 
> My only concern is stack usage.
> 
> Note that packet reception can elicit a response and go all the way
> back into this driver and all the way down into netif_receive_skb()
> again.  And so on and so forth.
> 
> If there is some bug in the stack (ACK'ing ACKs, stuff like that) we
> could get into a loop and overrun the kernel stack in no time at all.
> 
> So, if anything, this change could make inconvenient errors become
> catastrophic and hard to diagnose.

You are absolutly right. We should guard against recursion, using a new field 
in "pcpu_lstats" (cheap access in a hot cache line as we have to update stats 
anyway)

Thank you

[PATCH] loopback: calls netif_receive_skb() instead of netif_rx()

Loopback transmit function loopback_xmit() actually calls netif_rx() to queue 
a skb to the softnet queue, and arms a softirq so that this skb can be handled 
later.

This has a cost on SMP, because we need to hold a reference on the device, and 
free this reference when softirq dequeues packet.

Following patch directly calls netif_receive_skb() and avoids lot of atomic 
operations.

(atomic_inc(&dev->refcnt), set_and_set_bit(NAPI_STATE_SCHED, &n->state), ...
atomic_dec(&dev->refcnt)...), cache line ping-pongs on device refcnt, but also 
softirq overhead.

This gives a nice boost on tbench for example (5 % on my machine)

We want to limit recursion, in case network stack wants to re-enter 
loopback_xmit(). We use a depth field (per cpu), so that we avoid stack 
overflow, queueing the packet instead of trying to directly handle it.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>


[-- Attachment #2: loopback.patch --]
[-- Type: text/plain, Size: 760 bytes --]

diff --git a/drivers/net/loopback.c b/drivers/net/loopback.c
index f2a6e71..c1d0956 100644
--- a/drivers/net/loopback.c
+++ b/drivers/net/loopback.c
@@ -62,6 +62,7 @@
 struct pcpu_lstats {
 	unsigned long packets;
 	unsigned long bytes;
+	int	      depth;
 };
 
 /* KISS: just allocate small chunks and copy bits.
@@ -158,8 +159,16 @@ static int loopback_xmit(struct sk_buff *skb, struct net_device *dev)
 	lb_stats->bytes += skb->len;
 	lb_stats->packets++;
 
-	netif_rx(skb);
-
+	/*
+	 * We can call netif_receive_skb() instead of netif_rx()
+	 * to speedup processing, but not in case of recursion,
+	 * or we risk stack overflow.
+	 */
+	if (lb_stats->depth++ == 0)
+		netif_receive_skb(skb);
+	else
+		netif_rx(skb);
+	lb_stats->depth--;
 	return 0;
 }
 

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [RFC,PATCH] loopback: calls netif_receive_skb() instead of netif_rx()
  2008-03-01 10:26   ` Eric Dumazet
@ 2008-03-04  4:55     ` David Miller
  2008-03-04  5:15       ` Stephen Hemminger
  2008-03-04  6:27       ` Eric Dumazet
  2008-03-23 10:29     ` David Miller
  1 sibling, 2 replies; 35+ messages in thread
From: David Miller @ 2008-03-04  4:55 UTC (permalink / raw)
  To: dada1; +Cc: netdev

From: Eric Dumazet <dada1@cosmosbay.com>
Date: Sat, 01 Mar 2008 11:26:17 +0100

> You are absolutly right. We should guard against recursion, using a new field 
> in "pcpu_lstats" (cheap access in a hot cache line as we have to update stats 
> anyway)
 ...
> [PATCH] loopback: calls netif_receive_skb() instead of netif_rx()

I'm willing to seriously entertain this change and stick it
into net-2.6.26 if you will perform a reasonable deep stack
test.

For example, create an XFS filesystem, and mount it NFS over
loopback.  Then stress it like crazy.

See if this generates stack overflows or weird crashes.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC,PATCH] loopback: calls netif_receive_skb() instead of netif_rx()
  2008-03-04  4:55     ` David Miller
@ 2008-03-04  5:15       ` Stephen Hemminger
  2008-03-04  6:27       ` Eric Dumazet
  1 sibling, 0 replies; 35+ messages in thread
From: Stephen Hemminger @ 2008-03-04  5:15 UTC (permalink / raw)
  To: David Miller; +Cc: dada1, netdev

On Mon, 03 Mar 2008 20:55:58 -0800 (PST)
David Miller <davem@davemloft.net> wrote:

> From: Eric Dumazet <dada1@cosmosbay.com>
> Date: Sat, 01 Mar 2008 11:26:17 +0100
> 
> > You are absolutly right. We should guard against recursion, using a new field 
> > in "pcpu_lstats" (cheap access in a hot cache line as we have to update stats 
> > anyway)
>  ...
> > [PATCH] loopback: calls netif_receive_skb() instead of netif_rx()
> 
> I'm willing to seriously entertain this change and stick it
> into net-2.6.26 if you will perform a reasonable deep stack
> test.
> 
> For example, create an XFS filesystem, and mount it NFS over
> loopback.  Then stress it like crazy.
> 
> See if this generates stack overflows or weird crashes.
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

Also (unrealistic) benchmarks often test loopback performance, so you
should also check for performance gains/losses in things like
netbench, netperf, tbench, etc.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC,PATCH] loopback: calls netif_receive_skb() instead of netif_rx()
  2008-03-04  4:55     ` David Miller
  2008-03-04  5:15       ` Stephen Hemminger
@ 2008-03-04  6:27       ` Eric Dumazet
  1 sibling, 0 replies; 35+ messages in thread
From: Eric Dumazet @ 2008-03-04  6:27 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

David Miller a écrit :
> From: Eric Dumazet <dada1@cosmosbay.com>
> Date: Sat, 01 Mar 2008 11:26:17 +0100
> 
>> You are absolutly right. We should guard against recursion, using a new field 
>> in "pcpu_lstats" (cheap access in a hot cache line as we have to update stats 
>> anyway)
>  ...
>> [PATCH] loopback: calls netif_receive_skb() instead of netif_rx()
> 
> I'm willing to seriously entertain this change and stick it
> into net-2.6.26 if you will perform a reasonable deep stack
> test.
> 
> For example, create an XFS filesystem, and mount it NFS over
> loopback.  Then stress it like crazy.
> 
> See if this generates stack overflows or weird crashes.
> 
> 

Fair enough :)

I'll do my best to stress it on various situations, with 4K stacks on i386.

Thank you

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC,PATCH] loopback: calls netif_receive_skb() instead of netif_rx()
  2008-03-01 10:26   ` Eric Dumazet
  2008-03-04  4:55     ` David Miller
@ 2008-03-23 10:29     ` David Miller
  2008-03-23 18:48       ` Eric Dumazet
  1 sibling, 1 reply; 35+ messages in thread
From: David Miller @ 2008-03-23 10:29 UTC (permalink / raw)
  To: dada1; +Cc: netdev

From: Eric Dumazet <dada1@cosmosbay.com>
Date: Sat, 01 Mar 2008 11:26:17 +0100

> [PATCH] loopback: calls netif_receive_skb() instead of netif_rx()

Eric, did you get a chance to kernel stack usage stress this
thing out like I asked?

Thanks.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC,PATCH] loopback: calls netif_receive_skb() instead of netif_rx()
  2008-03-23 10:29     ` David Miller
@ 2008-03-23 18:48       ` Eric Dumazet
  2008-03-23 19:15         ` Andi Kleen
                           ` (2 more replies)
  0 siblings, 3 replies; 35+ messages in thread
From: Eric Dumazet @ 2008-03-23 18:48 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

[-- Attachment #1: Type: text/plain, Size: 2029 bytes --]

David Miller a écrit :
> From: Eric Dumazet <dada1@cosmosbay.com>
> Date: Sat, 01 Mar 2008 11:26:17 +0100
> 
>> [PATCH] loopback: calls netif_receive_skb() instead of netif_rx()
> 
> Eric, did you get a chance to kernel stack usage stress this
> thing out like I asked?
> 

I noticed some paths in kernel are very stack aggressive, and on i386 with 
CONFIG_4KSTACKS we were really in a dangerous land, even without my patch.

What we call 4K stacks is in fact 4K - sizeof(struct task_struct), so a litle 
bit more than 2K. (not counting some insane configuration were struct 
task_struct take 3.5 KB, see CONFIG_LATENCYTOP for an example)

So I cooked a different patch that explicitly test available stack space 
instead of counting a depth value.

The problem is that this patch depends on CONFIG_STACK_GROWSUP and I had an 
issue with it
(see http://kerneltrap.org/mailarchive/linux-kernel/2008/3/5/1079774)
and I had no answer from Kyle or others on this subject.

So I had to disable the optimisation for HPPA arch (it seems the only arch 
that has CONFIG_STACK_GROWSUP)

[PATCH] loopback: calls netif_receive_skb() instead of netif_rx()

Loopback transmit function loopback_xmit() actually calls netif_rx() to queue 
a skb to the softnet queue, and arms a softirq so that this skb can be handled 
later.

This has a cost on SMP, because we need to hold a reference on the device, and 
free this reference when softirq dequeues packet.


Following patch directly calls netif_receive_skb() and avoids lot of atomic 
operations.

(atomic_inc(&dev->refcnt), set_and_set_bit(NAPI_STATE_SCHED, &n->state), ...

atomic_dec(&dev->refcnt)...), cache line ping-pongs on device refcnt, but also 
softirq overhead.

This gives a nice boost on tbench for example (5 % on my machine)

We check available free stack space to take the decision of directly call 
netif_receive_skb() or queue the packet for further softirq handling, when 
stack space will be back to an acceptable level.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>


[-- Attachment #2: loopback.patch --]
[-- Type: text/plain, Size: 991 bytes --]

diff --git a/drivers/net/loopback.c b/drivers/net/loopback.c
index f2a6e71..656e29c 100644
--- a/drivers/net/loopback.c
+++ b/drivers/net/loopback.c
@@ -126,6 +126,17 @@ static void emulate_large_send_offload(struct sk_buff *skb)
 }
 #endif /* LOOPBACK_TSO */
 
+static int enough_stack_space(void)
+{
+#ifdef CONFIG_STACK_GROWSUP
+	return 0;
+#else
+	unsigned long free = (unsigned long)&free -
+			     (unsigned long)end_of_stack(current);
+	return free >= THREAD_SIZE/3 ;
+#endif
+}
+
 /*
  * The higher levels take care of making this non-reentrant (it's
  * called with bh's disabled).
@@ -158,7 +169,14 @@ static int loopback_xmit(struct sk_buff *skb, struct net_device *dev)
 	lb_stats->bytes += skb->len;
 	lb_stats->packets++;
 
-	netif_rx(skb);
+	/*
+	 * We can call netif_receive_skb() instead of netif_rx()
+	 * to speedup processing, but only if we have enough stack space.
+	 */
+	if (enough_stack_space())
+		netif_receive_skb(skb);
+	else
+		netif_rx(skb);
 
 	return 0;
 }

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [RFC,PATCH] loopback: calls netif_receive_skb() instead of netif_rx()
  2008-03-23 18:48       ` Eric Dumazet
@ 2008-03-23 19:15         ` Andi Kleen
  2008-03-29  1:36         ` David Miller
  2008-03-31  9:48         ` Ingo Molnar
  2 siblings, 0 replies; 35+ messages in thread
From: Andi Kleen @ 2008-03-23 19:15 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev

Eric Dumazet <dada1@cosmosbay.com> writes:
> 
> What we call 4K stacks is in fact 4K - sizeof(struct task_struct), so

Actually it is 4K - sizeof(struct thread_struct). 

> a litle bit more than 2K. (not counting some insane configuration were
> struct task_struct take 3.5 KB, see CONFIG_LATENCYTOP for an example)

thread_struct significantly smaller than task_struct

That said I agree that 4K stack is too tight for many things
and in general dangerous.

-Andi

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC,PATCH] loopback: calls netif_receive_skb() instead of netif_rx()
  2008-03-23 18:48       ` Eric Dumazet
  2008-03-23 19:15         ` Andi Kleen
@ 2008-03-29  1:36         ` David Miller
  2008-03-29  8:18           ` Eric Dumazet
  2008-03-31  9:48         ` Ingo Molnar
  2 siblings, 1 reply; 35+ messages in thread
From: David Miller @ 2008-03-29  1:36 UTC (permalink / raw)
  To: dada1; +Cc: netdev

From: Eric Dumazet <dada1@cosmosbay.com>
Date: Sun, 23 Mar 2008 19:48:29 +0100

> [PATCH] loopback: calls netif_receive_skb() instead of netif_rx()

Hmmm...

+static int enough_stack_space(void)
+{
+#ifdef CONFIG_STACK_GROWSUP
+	return 0;
+#else
+	unsigned long free = (unsigned long)&free -
+			     (unsigned long)end_of_stack(current);
+	return free >= THREAD_SIZE/3 ;
+#endif
+}
+

This will always fail when we are on an interrupt stack,
I think you'd want it to succeed in such a case.

Can you agree that, at least to a point, this is getting a bit
convoluted and perhaps adding more complexity than this optimization
deserves? :-)


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC,PATCH] loopback: calls netif_receive_skb() instead of netif_rx()
  2008-03-29  1:36         ` David Miller
@ 2008-03-29  8:18           ` Eric Dumazet
  2008-03-29 23:54             ` David Miller
  0 siblings, 1 reply; 35+ messages in thread
From: Eric Dumazet @ 2008-03-29  8:18 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

David Miller a écrit :
> From: Eric Dumazet <dada1@cosmosbay.com>
> Date: Sun, 23 Mar 2008 19:48:29 +0100
> 
>> [PATCH] loopback: calls netif_receive_skb() instead of netif_rx()
> 
> Hmmm...
> 
> +static int enough_stack_space(void)
> +{
> +#ifdef CONFIG_STACK_GROWSUP
> +	return 0;
> +#else
> +	unsigned long free = (unsigned long)&free -
> +			     (unsigned long)end_of_stack(current);
> +	return free >= THREAD_SIZE/3 ;
> +#endif
> +}
> +
> 
> This will always fail when we are on an interrupt stack,
> I think you'd want it to succeed in such a case.
> 
> Can you agree that, at least to a point, this is getting a bit
> convoluted and perhaps adding more complexity than this optimization
> deserves? :-)
> 
> 

Yes, I do agree, mixing 'network' and 'mainline' in the same patch is garanted 
to be problematic.

We shall wait for 32 cpus machines before thinking about that :)

BTW, can loopback_xmit() be called on an interrupt stack ?


^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC,PATCH] loopback: calls netif_receive_skb() instead of netif_rx()
  2008-03-29  8:18           ` Eric Dumazet
@ 2008-03-29 23:54             ` David Miller
  2008-03-31  6:38               ` Eric Dumazet
  0 siblings, 1 reply; 35+ messages in thread
From: David Miller @ 2008-03-29 23:54 UTC (permalink / raw)
  To: dada1; +Cc: netdev

From: Eric Dumazet <dada1@cosmosbay.com>
Date: Sat, 29 Mar 2008 09:18:03 +0100

> BTW, can loopback_xmit() be called on an interrupt stack ?

Absolutely, softirq processes TCP data, ACK goes out in
softirq context.

Softirqs run on interrupt stacks just as hardirqs do.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC,PATCH] loopback: calls netif_receive_skb() instead of netif_rx()
  2008-03-29 23:54             ` David Miller
@ 2008-03-31  6:38               ` Eric Dumazet
  0 siblings, 0 replies; 35+ messages in thread
From: Eric Dumazet @ 2008-03-31  6:38 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

David Miller a écrit :
> From: Eric Dumazet <dada1@cosmosbay.com>
> Date: Sat, 29 Mar 2008 09:18:03 +0100
> 
>> BTW, can loopback_xmit() be called on an interrupt stack ?
> 
> Absolutely, softirq processes TCP data, ACK goes out in
> softirq context.

yes, ICMP messages (if any are sent) too.

> 
> Softirqs run on interrupt stacks just as hardirqs do.
> 

Well, it depends.

Not on x86_32 with 8K stacks, and some other arches.
do_softirq() can use the underlying stack too.

Thank you

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC,PATCH] loopback: calls netif_receive_skb() instead of netif_rx()
  2008-03-23 18:48       ` Eric Dumazet
  2008-03-23 19:15         ` Andi Kleen
  2008-03-29  1:36         ` David Miller
@ 2008-03-31  9:48         ` Ingo Molnar
  2008-03-31 10:01           ` Eric Dumazet
  2008-03-31 10:08           ` David Miller
  2 siblings, 2 replies; 35+ messages in thread
From: Ingo Molnar @ 2008-03-31  9:48 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev, linux-kernel, Peter Zijlstra


* Eric Dumazet <dada1@cosmosbay.com> wrote:

> I noticed some paths in kernel are very stack aggressive, and on i386 
> with CONFIG_4KSTACKS we were really in a dangerous land, even without 
> my patch.
>
> What we call 4K stacks is in fact 4K - sizeof(struct task_struct), so 
> a litle bit more than 2K. [...]

that's just wrong - 4K stacks on x86 are 4K-sizeof(thread_info) - the 
task struct is allocated elsewhere. The patch below runs just fine on 
4K-stack x86.

	Ingo

------------->
Subject: net: loopback speedup
From: Ingo Molnar <mingo@elte.hu>
Date: Mon Mar 31 11:23:21 CEST 2008

Signed-off-by: Ingo Molnar <mingo@elte.hu>
---
 drivers/net/loopback.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux/drivers/net/loopback.c
===================================================================
--- linux.orig/drivers/net/loopback.c
+++ linux/drivers/net/loopback.c
@@ -158,7 +158,7 @@ static int loopback_xmit(struct sk_buff 
 	lb_stats->bytes += skb->len;
 	lb_stats->packets++;
 
-	netif_rx(skb);
+	netif_receive_skb(skb);
 
 	return 0;
 }

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC,PATCH] loopback: calls netif_receive_skb() instead of netif_rx()
  2008-03-31  9:48         ` Ingo Molnar
@ 2008-03-31 10:01           ` Eric Dumazet
  2008-03-31 10:12             ` Ingo Molnar
  2008-03-31 10:08           ` David Miller
  1 sibling, 1 reply; 35+ messages in thread
From: Eric Dumazet @ 2008-03-31 10:01 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: David Miller, netdev, linux-kernel, Peter Zijlstra

Ingo Molnar a écrit :
> * Eric Dumazet <dada1@cosmosbay.com> wrote:
>
>   
>> I noticed some paths in kernel are very stack aggressive, and on i386 
>> with CONFIG_4KSTACKS we were really in a dangerous land, even without 
>> my patch.
>>
>> What we call 4K stacks is in fact 4K - sizeof(struct task_struct), so 
>> a litle bit more than 2K. [...]
>>     
>
> that's just wrong - 4K stacks on x86 are 4K-sizeof(thread_info) - the 
> task struct is allocated elsewhere. The patch below runs just fine on 
> 4K-stack x86.
>
>   
Yes, this error was corrected by Andi already :)

Thank you Ingo but this patch was already suggested by me previously ( 
http://marc.info/?l=linux-netdev&m=120361996713007&w=2 ) and was 
rejected, since we can very easily consume all stack space, especially 
with 4K stacks.
(try with NFS mounts and XFS for example)



Only safe way is to check available free stack space, since we can nest  
loopback_xmit() several time.
In case of protocol errors (like in TCP, if we answer to an ACK by 
another ACK, or ICMP loops), we would exhaust stack instead of delaying 
packets for next softirq run.

Problem is to check available space :

It depends on stack growing UP or DOWN, and depends on caller running on 
process stack, or softirq stack, or even hardirq stack.




> 	Ingo
>
> ------------->
> Subject: net: loopback speedup
> From: Ingo Molnar <mingo@elte.hu>
> Date: Mon Mar 31 11:23:21 CEST 2008
>
> Signed-off-by: Ingo Molnar <mingo@elte.hu>
> ---
>  drivers/net/loopback.c |    2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> Index: linux/drivers/net/loopback.c
> ===================================================================
> --- linux.orig/drivers/net/loopback.c
> +++ linux/drivers/net/loopback.c
> @@ -158,7 +158,7 @@ static int loopback_xmit(struct sk_buff 
>  	lb_stats->bytes += skb->len;
>  	lb_stats->packets++;
>  
> -	netif_rx(skb);
> +	netif_receive_skb(skb);
>  
>  	return 0;
>  }
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>
>
>   

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC,PATCH] loopback: calls netif_receive_skb() instead of netif_rx()
  2008-03-31  9:48         ` Ingo Molnar
  2008-03-31 10:01           ` Eric Dumazet
@ 2008-03-31 10:08           ` David Miller
  2008-03-31 10:44             ` Ingo Molnar
  1 sibling, 1 reply; 35+ messages in thread
From: David Miller @ 2008-03-31 10:08 UTC (permalink / raw)
  To: mingo; +Cc: dada1, netdev, linux-kernel, a.p.zijlstra

From: Ingo Molnar <mingo@elte.hu>
Date: Mon, 31 Mar 2008 11:48:23 +0200

> 
> * Eric Dumazet <dada1@cosmosbay.com> wrote:
> 
> > I noticed some paths in kernel are very stack aggressive, and on i386 
> > with CONFIG_4KSTACKS we were really in a dangerous land, even without 
> > my patch.
> >
> > What we call 4K stacks is in fact 4K - sizeof(struct task_struct), so 
> > a litle bit more than 2K. [...]
> 
> that's just wrong - 4K stacks on x86 are 4K-sizeof(thread_info) - the 
> task struct is allocated elsewhere. The patch below runs just fine on 
> 4K-stack x86.

I don't think it's safe.

Every packet you receive can result in a sent packet, which
in turn can result in a full packet receive path being
taken, and yet again another sent packet.

And so on and so forth.

Some cases like this would be stack bugs, but wouldn't
you like that bug to be a very busy cpu instead of a
crash from overrunning the current stack?

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC,PATCH] loopback: calls netif_receive_skb() instead of netif_rx()
  2008-03-31 10:01           ` Eric Dumazet
@ 2008-03-31 10:12             ` Ingo Molnar
  2008-04-01  9:19               ` Eric Dumazet
  0 siblings, 1 reply; 35+ messages in thread
From: Ingo Molnar @ 2008-03-31 10:12 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, netdev, linux-kernel, Peter Zijlstra


* Eric Dumazet <dada1@cosmosbay.com> wrote:

> Problem is to check available space :
>
> It depends on stack growing UP or DOWN, and depends on caller running 
> on process stack, or softirq stack, or even hardirq stack.

ok - i wish such threads were on lkml so that everyone not just the 
netdev kabal can read it. It's quite ugly, but if we want to check stack 
free space i'd suggest for you to put a stack_can_recurse() call into 
arch/x86/kernel/process.c and offer a default __weak implementation in 
kernel/fork.c that always returns 0.

the rule on x86 should be something like this: on 4K stacks and 64-bit 
[which have irqstacks] free stack space can go as low as 25%. On 8K 
stacks [which doesnt have irqstacks but nests irqs] it should not go 
below 50% before falling back to the explicitly queued packet branch.

this way other pieces of kernel code code can choose between on-stack 
fast recursion and explicit iterators. Although i'm not sure i like the 
whole concept to begin with ...

	Ingo

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC,PATCH] loopback: calls netif_receive_skb() instead of netif_rx()
  2008-03-31 10:08           ` David Miller
@ 2008-03-31 10:44             ` Ingo Molnar
  2008-03-31 11:02               ` David Miller
  0 siblings, 1 reply; 35+ messages in thread
From: Ingo Molnar @ 2008-03-31 10:44 UTC (permalink / raw)
  To: David Miller; +Cc: dada1, netdev, linux-kernel, a.p.zijlstra


* David Miller <davem@davemloft.net> wrote:

> I don't think it's safe.
> 
> Every packet you receive can result in a sent packet, which in turn 
> can result in a full packet receive path being taken, and yet again 
> another sent packet.
> 
> And so on and so forth.
> 
> Some cases like this would be stack bugs, but wouldn't you like that 
> bug to be a very busy cpu instead of a crash from overrunning the 
> current stack?

sure.

But the core problem remains: our loopback networking scalability is 
poor. For plain localhost<->localhost connected sockets we hit the 
loopback device lock for every packet, and this very much shows up on 
real workloads on a quad already: the lock instruction in netif_rx is 
the most expensive instruction in a sysbench DB workload.

and it's not just about scalability, the plain algorithmic overhead is 
way too high as well:

 $ taskset 1 ./bw_tcp -s
 $ taskset 1 ./bw_tcp localhost
 Socket bandwidth using localhost: 2607.09 MB/sec
 $ taskset 1 ./bw_pipe
 Pipe bandwidth: 3680.44 MB/sec

i dont think this is acceptable. Either we should fix loopback TCP 
performance or we should transparently switch to VFS pipes as a 
transport method when an app establishes a plain loopback connection (as 
long as there are no frills like content-modifying component in the 
delivery path of packets after a connection has been established - which 
covers 99.9% of the real-life loopback cases).

I'm not suggesting we shouldnt use TCP for connection establishing - but 
if the TCP loopback packet transport is too slow we should use the VFS 
transport which is both more scalable, less cache-intense and has lower 
straight overhead as well.

	Ingo

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC,PATCH] loopback: calls netif_receive_skb() instead of netif_rx()
  2008-03-31 10:44             ` Ingo Molnar
@ 2008-03-31 11:02               ` David Miller
  2008-03-31 11:36                 ` poor network loopback performance and scalability (was: Re: [RFC,PATCH] loopback: calls netif_receive_skb() instead of netif_rx()) Ingo Molnar
  0 siblings, 1 reply; 35+ messages in thread
From: David Miller @ 2008-03-31 11:02 UTC (permalink / raw)
  To: mingo; +Cc: dada1, netdev, linux-kernel, a.p.zijlstra

From: Ingo Molnar <mingo@elte.hu>
Date: Mon, 31 Mar 2008 12:44:03 +0200

> and it's not just about scalability, the plain algorithmic overhead is 
> way too high as well:
> 
>  $ taskset 1 ./bw_tcp -s
>  $ taskset 1 ./bw_tcp localhost
>  Socket bandwidth using localhost: 2607.09 MB/sec
>  $ taskset 1 ./bw_pipe
>  Pipe bandwidth: 3680.44 MB/sec

Set your loopback MTU to some larger value if this result and
the locking overhead upsets you.

Also, woe be to the application that wants fast local interprocess
communication and doesn't use IPC_SHM, MAP_SHARED, pipes, or AF_UNIX
sockets.  (there's not just one better facility, there are _four_!)

>From this perspective, people way-overemphasize loopback performance,
and 999 times out of 1000 they prove their points using synthetic
benchmarks.

And don't give me this garbage about the application wanting to be
generic and therefore use IP sockets for everything.  Either they want
to be generic, or they want the absolute best performance.  Trying
to get an "or" and have both at the same time will result in
ludicrious hacks ending up in the kernel.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* poor network loopback performance and scalability (was: Re: [RFC,PATCH] loopback: calls netif_receive_skb() instead of netif_rx())
  2008-03-31 11:02               ` David Miller
@ 2008-03-31 11:36                 ` Ingo Molnar
  2008-04-21  3:24                   ` Herbert Xu
  0 siblings, 1 reply; 35+ messages in thread
From: Ingo Molnar @ 2008-03-31 11:36 UTC (permalink / raw)
  To: David Miller; +Cc: dada1, netdev, linux-kernel, a.p.zijlstra


* David Miller <davem@davemloft.net> wrote:

> From: Ingo Molnar <mingo@elte.hu>
> Date: Mon, 31 Mar 2008 12:44:03 +0200
> 
> > and it's not just about scalability, the plain algorithmic overhead is 
> > way too high as well:
> > 
> >  $ taskset 1 ./bw_tcp -s
> >  $ taskset 1 ./bw_tcp localhost
> >  Socket bandwidth using localhost: 2607.09 MB/sec
> >  $ taskset 1 ./bw_pipe
> >  Pipe bandwidth: 3680.44 MB/sec
> 
> Set your loopback MTU to some larger value if this result and the 
> locking overhead upsets you.

yes, of course it "upsets me" - it shows up in macrobenchmarks as well 
(not just lmbench) - wouldnt (and shouldnt) that upset you?

And even with a ridiculously high MTU of 1048576 there's only a 13% 
improvement:

   # ifconfig lo mtu 1048576
   # taskset 1 ./bw_tcp -s
   # taskset 1 ./bw_tcp localhost
   Socket bandwidth using localhost: 2951.51 MB/sec

pipes are still another ~25% faster:

   # taskset 1 ./bw_pipe
   Pipe bandwidth: 3657.40 MB/sec

> > i dont think this is acceptable. Either we should fix loopback TCP 
> > performance or we should transparently switch to VFS pipes as a 
> > transport method when an app establishes a plain loopback connection 
> > (as long as there are no frills like content-modifying component in 
> > the delivery path of packets after a connection has been established 
> > - which covers 99.9% of the real-life loopback cases).
> >
> > I'm not suggesting we shouldnt use TCP for connection establishing - 
> > but if the TCP loopback packet transport is too slow we should use 
> > the VFS transport which is both more scalable, less cache-intense 
> > and has lower straight overhead as well.
[...]

> Also, woe be to the application that wants fast local interprocess 
> communication and doesn't use IPC_SHM, MAP_SHARED, pipes, or AF_UNIX 
> sockets.  (there's not just one better facility, there are _four_!)
> 
> From this perspective, people way-overemphasize loopback performance, 
> and 999 times out of 1000 they prove their points using synthetic 
> benchmarks.
> 
> And don't give me this garbage about the application wanting to be 
> generic and therefore use IP sockets for everything.  Either they want 
> to be generic, or they want the absolute best performance.  Trying to 
> get an "or" and have both at the same time will result in ludicrious 
> hacks ending up in the kernel.

i talked about the localhost data transport only (in the portion you 
dropped from your quotes), not about the connection API or the overall 
management of such sockets. There's absolutely no good technical reason 
i can see why plain loopback sockets should be forced to go over a 
global lock, or why apps should be forced to change to another API when 
the real problem is that kernel developers are lazy or incompetent to 
fix their code.

And i'm still trying to establish whether we have common ground for 
discussion: do you accept my numbers that TCP loopback transport 
performs badly when compared to pipes (i think you accepted that 
implicitly, but i dont want to put anything into your mouth).

Having agreed on that, do you share my view that it should be and could 
be fixed? Or do you claim that it cannot be fixed and wont ever be 
fixed?

	Ingo

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC,PATCH] loopback: calls netif_receive_skb() instead of netif_rx()
  2008-03-31 10:12             ` Ingo Molnar
@ 2008-04-01  9:19               ` Eric Dumazet
  2008-04-03 14:06                 ` Pavel Machek
  0 siblings, 1 reply; 35+ messages in thread
From: Eric Dumazet @ 2008-04-01  9:19 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: David Miller, netdev, linux-kernel, Peter Zijlstra

[-- Attachment #1: Type: text/plain, Size: 2186 bytes --]

Ingo Molnar a écrit :
> * Eric Dumazet <dada1@cosmosbay.com> wrote:
>
>   
>> Problem is to check available space :
>>
>> It depends on stack growing UP or DOWN, and depends on caller running 
>> on process stack, or softirq stack, or even hardirq stack.
>>     
>
> ok - i wish such threads were on lkml so that everyone not just the 
> netdev kabal can read it. It's quite ugly, but if we want to check stack 
> free space i'd suggest for you to put a stack_can_recurse() call into 
> arch/x86/kernel/process.c and offer a default __weak implementation in 
> kernel/fork.c that always returns 0.
>
> the rule on x86 should be something like this: on 4K stacks and 64-bit 
> [which have irqstacks] free stack space can go as low as 25%. On 8K 
> stacks [which doesnt have irqstacks but nests irqs] it should not go 
> below 50% before falling back to the explicitly queued packet branch.
>
> this way other pieces of kernel code code can choose between on-stack 
> fast recursion and explicit iterators. Although i'm not sure i like the 
> whole concept to begin with ...
>
>   

Hi Ingo

I took the time to prepare a patch to implement  
arch_stack_can_recurse() as you suggested.

Thank you

[PATCH] x86 : arch_stack_can_recurse() introduction

Some paths in kernel would like to chose between on-stack fast recursion 
and explicit iterators.

One identified spot is in net loopback driver, where we can avoid 
netif_rx() and its slowdown if
sufficient stack space is available.

We introduce a generic arch_stack_can_recurse() which default to a weak 
function returning 0.

 On x86 arch, we implement following logic :

   32 bits and 4K stacks (separate irq stacks) : can use up to 25% of stack
   64 bits, 8K stacks (separate irq stacks)    : can use up to 25% of stack
   32 bits and 8K stacks (no irq stacks)       : can use up to 50% of stack

Example of use in drivers/net/loopback.c, function  loopback_xmit()

if (arch_stack_can_recurse())
    netif_receive_skb(skb); /* immediate delivery to stack */
else
    netif_rx(skb); /* defer to softirq handling */

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>



[-- Attachment #2: can_recurse.patch --]
[-- Type: text/plain, Size: 2282 bytes --]

diff --git a/arch/x86/kernel/process.c b/arch/x86/kernel/process.c
index 0e613e7..6edc1d3 100644
--- a/arch/x86/kernel/process.c
+++ b/arch/x86/kernel/process.c
@@ -43,3 +43,25 @@ void arch_task_cache_init(void)
 				  __alignof__(union thread_xstate),
 				  SLAB_PANIC, NULL);
 }
+
+
+/*
+ * Used to check if we can recurse without risking stack overflow
+ * Rules are :
+ *   32 bits and 4K stacks (separate irq stacks) : can use up to 25% of stack
+ *   64 bits, 8K stacks (separate irq stacks)    : can use up to 25% of stack
+ *   32 bits and 8K stacks (no irq stacks)       : can use up to 50% of stack
+ */
+#if defined(CONFIG_4KSTACKS) || defined(CONFIG_X86_64)
+# define STACK_RECURSE_LIMIT (THREAD_SIZE/4)
+#else
+# define STACK_RECURSE_LIMIT (THREAD_SIZE/2)
+#endif
+
+int arch_stack_can_recurse()
+{
+	unsigned long offset_stack = current_stack_pointer & (THREAD_SIZE - 1);
+	unsigned long avail_stack = offset_stack - sizeof(struct thread_info);
+
+	return avail_stack >= STACK_RECURSE_LIMIT;
+}
diff --git a/include/asm-x86/thread_info_64.h b/include/asm-x86/thread_info_64.h
index f23fefc..9a913c4 100644
--- a/include/asm-x86/thread_info_64.h
+++ b/include/asm-x86/thread_info_64.h
@@ -60,6 +60,9 @@ struct thread_info {
 #define init_thread_info	(init_thread_union.thread_info)
 #define init_stack		(init_thread_union.stack)
 
+/* how to get the current stack pointer from C */
+register unsigned long current_stack_pointer asm("rsp") __used;
+
 static inline struct thread_info *current_thread_info(void)
 {
 	struct thread_info *ti;
diff --git a/include/linux/sched.h b/include/linux/sched.h
index ca720f0..445b8da 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1600,6 +1600,8 @@ union thread_union {
 	unsigned long stack[THREAD_SIZE/sizeof(long)];
 };
 
+extern int arch_stack_can_recurse(void);
+
 #ifndef __HAVE_ARCH_KSTACK_END
 static inline int kstack_end(void *addr)
 {
diff --git a/kernel/fork.c b/kernel/fork.c
index a19df75..cd5d1e1 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -136,6 +136,11 @@ void __attribute__((weak)) arch_task_cache_init(void)
 {
 }
 
+int __attribute__((weak)) arch_stack_can_recurse(void)
+{
+	return 0;
+}
+
 void __init fork_init(unsigned long mempages)
 {
 #ifndef __HAVE_ARCH_TASK_STRUCT_ALLOCATOR

^ permalink raw reply related	[flat|nested] 35+ messages in thread

* Re: [RFC,PATCH] loopback: calls netif_receive_skb() instead of netif_rx()
  2008-04-01  9:19               ` Eric Dumazet
@ 2008-04-03 14:06                 ` Pavel Machek
  2008-04-03 16:19                   ` Eric Dumazet
  0 siblings, 1 reply; 35+ messages in thread
From: Pavel Machek @ 2008-04-03 14:06 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Ingo Molnar, David Miller, netdev, linux-kernel, Peter Zijlstra

Hi!

> >the rule on x86 should be something like this: on 4K 
> >stacks and 64-bit [which have irqstacks] free stack 
> >space can go as low as 25%. On 8K stacks [which doesnt 
...
>   32 bits and 4K stacks (separate irq stacks) : can use 
>   up to 25% of stack

I think ingo meant 'up to 75% used'.


-- 
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: [RFC,PATCH] loopback: calls netif_receive_skb() instead of netif_rx()
  2008-04-03 14:06                 ` Pavel Machek
@ 2008-04-03 16:19                   ` Eric Dumazet
  0 siblings, 0 replies; 35+ messages in thread
From: Eric Dumazet @ 2008-04-03 16:19 UTC (permalink / raw)
  To: Pavel Machek
  Cc: Ingo Molnar, David Miller, netdev, linux-kernel, Peter Zijlstra

Pavel Machek a écrit :
> Hi!
>
>   
>>> the rule on x86 should be something like this: on 4K 
>>> stacks and 64-bit [which have irqstacks] free stack 
>>> space can go as low as 25%. On 8K stacks [which doesnt 
>>>       
> ...
>   
>>   32 bits and 4K stacks (separate irq stacks) : can use 
>>   up to 25% of stack
>>     
>
> I think ingo meant 'up to 75% used'.
>
>
>   
Patch is OK, my english might be a litle bit unsual :)





^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: poor network loopback performance and scalability (was: Re: [RFC,PATCH] loopback: calls netif_receive_skb() instead of netif_rx())
  2008-03-31 11:36                 ` poor network loopback performance and scalability (was: Re: [RFC,PATCH] loopback: calls netif_receive_skb() instead of netif_rx()) Ingo Molnar
@ 2008-04-21  3:24                   ` Herbert Xu
  2008-04-21  3:38                     ` poor network loopback performance and scalability David Miller
  0 siblings, 1 reply; 35+ messages in thread
From: Herbert Xu @ 2008-04-21  3:24 UTC (permalink / raw)
  To: Ingo Molnar; +Cc: davem, dada1, netdev, linux-kernel, a.p.zijlstra

Ingo Molnar <mingo@elte.hu> wrote:
> 
>   # ifconfig lo mtu 1048576

Chiming in late here, but 1048576 can't possibly work with IP
which uses a 16-bit quantity as the length header.  In fact a
quick test seems to indicate that an 1048576 mtu doesn't generate
anything bigger than the default 16K mtu.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: poor network loopback performance and scalability
  2008-04-21  3:24                   ` Herbert Xu
@ 2008-04-21  3:38                     ` David Miller
  2008-04-21  8:11                       ` Ingo Molnar
  0 siblings, 1 reply; 35+ messages in thread
From: David Miller @ 2008-04-21  3:38 UTC (permalink / raw)
  To: herbert; +Cc: mingo, dada1, netdev, linux-kernel, a.p.zijlstra

From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Mon, 21 Apr 2008 11:24:04 +0800

> Ingo Molnar <mingo@elte.hu> wrote:
> > 
> >   # ifconfig lo mtu 1048576
> 
> Chiming in late here, but 1048576 can't possibly work with IP
> which uses a 16-bit quantity as the length header.  In fact a
> quick test seems to indicate that an 1048576 mtu doesn't generate
> anything bigger than the default 16K mtu.

Right.

To move things forward, we should look into doing something
similar to what Al Viro suggested, which would be to return
an SKB pointer from the transmit path and call back into
netif_receive_skb() using that.

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: poor network loopback performance and scalability
  2008-04-21  3:38                     ` poor network loopback performance and scalability David Miller
@ 2008-04-21  8:11                       ` Ingo Molnar
  2008-04-21  8:16                         ` David Miller
  0 siblings, 1 reply; 35+ messages in thread
From: Ingo Molnar @ 2008-04-21  8:11 UTC (permalink / raw)
  To: David Miller; +Cc: herbert, dada1, netdev, linux-kernel, a.p.zijlstra


* David Miller <davem@davemloft.net> wrote:

> From: Herbert Xu <herbert@gondor.apana.org.au>
> Date: Mon, 21 Apr 2008 11:24:04 +0800
> 
> > Ingo Molnar <mingo@elte.hu> wrote:
> > > 
> > >   # ifconfig lo mtu 1048576
> > 
> > Chiming in late here, but 1048576 can't possibly work with IP which 
> > uses a 16-bit quantity as the length header.  In fact a quick test 
> > seems to indicate that an 1048576 mtu doesn't generate anything 
> > bigger than the default 16K mtu.
> 
> Right.
> 
> To move things forward, we should look into doing something similar to 
> what Al Viro suggested, which would be to return an SKB pointer from 
> the transmit path and call back into netif_receive_skb() using that.

yep, basically the sk_peer trick that AF_UNIX is already using.

it just seems rather more tricky in the 'real skb' localhost case 
because there's no real established trust path we can pass this coupling 
of the two sockets over. Netfilter might affect it and deny a localhost 
connection. Lifetime rules seem rather tricky as well: either end of the 
localhost connection can go away independently so a refcount to the 
socket has to be kept. skb->sk might be something to use, but it looks 
like a dangerous complication and it would burden the fastpath with an 
extra sk reference inc/dec.

... so i'm not implying that any of this is an easy topic to solve (to 
me at least :). But fact is that database connections over localhost are 
very common on web apps and it is very convenient as well. I use it 
myself - AF_UNIX transport is often non-existing in apps and libraries 
or is often just an afterthought with limitations - apps tend to 
gravitate towards a single API. So i dont think "use AF_UNIX" is an 
acceptable answer in this case. I believe we should try to make 
localhost transport comparably fast to AF_UNIX.

	Ingo

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: poor network loopback performance and scalability
  2008-04-21  8:11                       ` Ingo Molnar
@ 2008-04-21  8:16                         ` David Miller
  2008-04-21 10:19                           ` Herbert Xu
  0 siblings, 1 reply; 35+ messages in thread
From: David Miller @ 2008-04-21  8:16 UTC (permalink / raw)
  To: mingo; +Cc: herbert, dada1, netdev, linux-kernel, a.p.zijlstra

From: Ingo Molnar <mingo@elte.hu>
Date: Mon, 21 Apr 2008 10:11:03 +0200

> 
> * David Miller <davem@davemloft.net> wrote:
> 
> > To move things forward, we should look into doing something similar to 
> > what Al Viro suggested, which would be to return an SKB pointer from 
> > the transmit path and call back into netif_receive_skb() using that.
> 
> yep, basically the sk_peer trick that AF_UNIX is already using.

Please read again, that isn't the suggestion being discussed.

What's being discussed is having the top of the transmit call path
getting a socket "buffer" pointer, that it can feed back into the
packet input path directly.  Loopback would return buffer pointers
from ->hard_start_xmit() instead of passing them netif_rx().  The top
of the transmit call path, upon getting a non-NULL buffer returned,
would pass it to netif_receive_skb().

We're not talking about sockets, although that is another idea (which
I'm working on a patch for, and I have a mechanism for what you refer
to as "path validation").

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: poor network loopback performance and scalability
  2008-04-21  8:16                         ` David Miller
@ 2008-04-21 10:19                           ` Herbert Xu
  2008-04-21 10:22                             ` David Miller
  0 siblings, 1 reply; 35+ messages in thread
From: Herbert Xu @ 2008-04-21 10:19 UTC (permalink / raw)
  To: David Miller; +Cc: mingo, dada1, netdev, linux-kernel, a.p.zijlstra

On Mon, Apr 21, 2008 at 01:16:23AM -0700, David Miller wrote:
>
> What's being discussed is having the top of the transmit call path
> getting a socket "buffer" pointer, that it can feed back into the
> packet input path directly.  Loopback would return buffer pointers
> from ->hard_start_xmit() instead of passing them netif_rx().  The top
> of the transmit call path, upon getting a non-NULL buffer returned,
> would pass it to netif_receive_skb().

Yes this will definitely reduce the per-packet cost.  The other
low-hanging fruit is to raise the loopback MTU to just below 64K.
I belive the current value is a legacy from the days when we didn't
support skb page frags so everything had to be physically contiguous.

Longer term we could look at generating packets > 64K on lo, for
IPv6 anyway.

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 35+ messages in thread

* Re: poor network loopback performance and scalability
  2008-04-21 10:19                           ` Herbert Xu
@ 2008-04-21 10:22                             ` David Miller
  0 siblings, 0 replies; 35+ messages in thread
From: David Miller @ 2008-04-21 10:22 UTC (permalink / raw)
  To: herbert; +Cc: mingo, dada1, netdev, linux-kernel, a.p.zijlstra

From: Herbert Xu <herbert@gondor.apana.org.au>
Date: Mon, 21 Apr 2008 18:19:08 +0800

> I belive the current value is a legacy from the days when we didn't
> support skb page frags so everything had to be physically contiguous.

It's legacy from when my top-of-the-line UltraSPARC-I 130Mhz cpus
got the best loopback results using that value :-)

^ permalink raw reply	[flat|nested] 35+ messages in thread

end of thread, other threads:[~2008-04-21 10:21 UTC | newest]

Thread overview: 35+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-02-21 18:51 [RFC,PATCH] loopback: calls netif_receive_skb() instead of netif_rx() Eric Dumazet
2008-02-21 20:14 ` Daniel Lezcano
2008-02-21 23:19   ` Eric Dumazet
2008-02-22 10:19     ` Daniel Lezcano
2008-02-27  2:21 ` David Miller
2008-02-27  7:20   ` Jarek Poplawski
2008-02-27  7:23     ` David Miller
2008-02-27  7:34       ` Jarek Poplawski
2008-03-01 10:26   ` Eric Dumazet
2008-03-04  4:55     ` David Miller
2008-03-04  5:15       ` Stephen Hemminger
2008-03-04  6:27       ` Eric Dumazet
2008-03-23 10:29     ` David Miller
2008-03-23 18:48       ` Eric Dumazet
2008-03-23 19:15         ` Andi Kleen
2008-03-29  1:36         ` David Miller
2008-03-29  8:18           ` Eric Dumazet
2008-03-29 23:54             ` David Miller
2008-03-31  6:38               ` Eric Dumazet
2008-03-31  9:48         ` Ingo Molnar
2008-03-31 10:01           ` Eric Dumazet
2008-03-31 10:12             ` Ingo Molnar
2008-04-01  9:19               ` Eric Dumazet
2008-04-03 14:06                 ` Pavel Machek
2008-04-03 16:19                   ` Eric Dumazet
2008-03-31 10:08           ` David Miller
2008-03-31 10:44             ` Ingo Molnar
2008-03-31 11:02               ` David Miller
2008-03-31 11:36                 ` poor network loopback performance and scalability (was: Re: [RFC,PATCH] loopback: calls netif_receive_skb() instead of netif_rx()) Ingo Molnar
2008-04-21  3:24                   ` Herbert Xu
2008-04-21  3:38                     ` poor network loopback performance and scalability David Miller
2008-04-21  8:11                       ` Ingo Molnar
2008-04-21  8:16                         ` David Miller
2008-04-21 10:19                           ` Herbert Xu
2008-04-21 10:22                             ` David Miller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).