netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC] about "net: orphan frags on receive" insanity
@ 2013-06-26  8:23 Eric Dumazet
  2013-06-26  9:56 ` Michael S. Tsirkin
  0 siblings, 1 reply; 8+ messages in thread
From: Eric Dumazet @ 2013-06-26  8:23 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: netdev

Michael

Following commit added 400 bytes into __netif_receive_skb_core()

Could we revert it, and fix the problem into the callers instead ?

Thanks



commit 1080e512d44d4f67b8beb8edf25a1bbcb1066dc7
Author: Michael S. Tsirkin <mst@redhat.com>
Date:   Fri Jul 20 09:23:17 2012 +0000

    net: orphan frags on receive
    
    zero copy packets are normally sent to the outside
    network, but bridging, tun etc might loop them
    back to host networking stack. If this happens
    destructors will never be called, so orphan
    the frags immediately on receive.
    
    Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC] about "net: orphan frags on receive" insanity
  2013-06-26  8:23 [RFC] about "net: orphan frags on receive" insanity Eric Dumazet
@ 2013-06-26  9:56 ` Michael S. Tsirkin
  2013-06-26 10:12   ` Eric Dumazet
  0 siblings, 1 reply; 8+ messages in thread
From: Michael S. Tsirkin @ 2013-06-26  9:56 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev

On Wed, Jun 26, 2013 at 01:23:59AM -0700, Eric Dumazet wrote:
> Michael
> 
> Following commit added 400 bytes into __netif_receive_skb_core()
> 
> Could we revert it, and fix the problem into the callers instead ?
> 
> Thanks
> 
> 
> 
> commit 1080e512d44d4f67b8beb8edf25a1bbcb1066dc7
> Author: Michael S. Tsirkin <mst@redhat.com>
> Date:   Fri Jul 20 09:23:17 2012 +0000
> 
>     net: orphan frags on receive
>     
>     zero copy packets are normally sent to the outside
>     network, but bridging, tun etc might loop them
>     back to host networking stack. If this happens
>     destructors will never be called, so orphan
>     the frags immediately on receive.
>     
>     Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
>     Signed-off-by: David S. Miller <davem@davemloft.net>

Please don't just revert it.

Packets can cross e.g. the bridge and end up on the external
interface, in which case that NIC guarantees that
they will get transmitted eventually, so it's safe
to transmit from guest memory, or they can end up
in the host stack where they can stay indefinitely,
in this case we need to orphan frags so guest networking
can keep going.

What do you suggest exactly?

Also, I'll have to look at the generated binary - when I
coded this up, compiler seemed to only put a
conditional branch on the fast path, this shouldn't
be all that expensive.

-- 
MST

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC] about "net: orphan frags on receive" insanity
  2013-06-26  9:56 ` Michael S. Tsirkin
@ 2013-06-26 10:12   ` Eric Dumazet
  2013-06-26 18:56     ` Michael S. Tsirkin
  0 siblings, 1 reply; 8+ messages in thread
From: Eric Dumazet @ 2013-06-26 10:12 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: netdev

On Wed, 2013-06-26 at 12:56 +0300, Michael S. Tsirkin wrote:

> Please don't just revert it.

I said "revert and fix the callers"

> 
> Packets can cross e.g. the bridge and end up on the external
> interface, in which case that NIC guarantees that
> they will get transmitted eventually, so it's safe
> to transmit from guest memory, or they can end up
> in the host stack where they can stay indefinitely,
> in this case we need to orphan frags so guest networking
> can keep going.
> 
> What do you suggest exactly?
> 
> Also, I'll have to look at the generated binary - when I
> coded this up, compiler seemed to only put a
> conditional branch on the fast path, this shouldn't
> be all that expensive.

Then look again. Its 400 bytes right now.

I suggested to call this only from the impacted callers.

VM are fine, but if we slow down bare metal to close the gap, or make
appealing kernel bypass, that's not very fair.

Why normal frames received from real NIC should pay this price, I don't
know.

The conditional branch might sound not expensive, but the cache line
miss on skb_shinfo(skb)->tx_flags is expensive.

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC] about "net: orphan frags on receive" insanity
  2013-06-26 10:12   ` Eric Dumazet
@ 2013-06-26 18:56     ` Michael S. Tsirkin
  2013-06-26 19:09       ` Eric Dumazet
  0 siblings, 1 reply; 8+ messages in thread
From: Michael S. Tsirkin @ 2013-06-26 18:56 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev

On Wed, Jun 26, 2013 at 03:12:23AM -0700, Eric Dumazet wrote:
> On Wed, 2013-06-26 at 12:56 +0300, Michael S. Tsirkin wrote:
> 
> > Please don't just revert it.
> 
> I said "revert and fix the callers"
> 
> > 
> > Packets can cross e.g. the bridge and end up on the external
> > interface, in which case that NIC guarantees that
> > they will get transmitted eventually, so it's safe
> > to transmit from guest memory, or they can end up
> > in the host stack where they can stay indefinitely,
> > in this case we need to orphan frags so guest networking
> > can keep going.
> > 
> > What do you suggest exactly?
> > 
> > Also, I'll have to look at the generated binary - when I
> > coded this up, compiler seemed to only put a
> > conditional branch on the fast path, this shouldn't
> > be all that expensive.
> 
> Then look again. Its 400 bytes right now.
> 
> I suggested to call this only from the impacted callers.

Well we don't want to duplicate the whole RX path
to special-case that, right?

> VM are fine, but if we slow down bare metal to close the gap, or make
> appealing kernel bypass, that's not very fair.
> 
> Why normal frames received from real NIC should pay this price, I don't
> know.
> 
> The conditional branch might sound not expensive, but the cache line
> miss on skb_shinfo(skb)->tx_flags is expensive.
> 

Okay, for the RX path, I think we can set a
separate flag in the skb itself, and branch
based on that.

Would this address the comment?

-- 
MST

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC] about "net: orphan frags on receive" insanity
  2013-06-26 18:56     ` Michael S. Tsirkin
@ 2013-06-26 19:09       ` Eric Dumazet
  2013-06-26 19:22         ` Michael S. Tsirkin
  0 siblings, 1 reply; 8+ messages in thread
From: Eric Dumazet @ 2013-06-26 19:09 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: netdev

On Wed, 2013-06-26 at 21:56 +0300, Michael S. Tsirkin wrote:

> Well we don't want to duplicate the whole RX path
> to special-case that, right?


Whats wrong using a helper ?

I_Should_cleanup_things(skb)
{
	... // cleanup for these special users trying to reenter stack

	netif_rx(skb);
}

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC] about "net: orphan frags on receive" insanity
  2013-06-26 19:09       ` Eric Dumazet
@ 2013-06-26 19:22         ` Michael S. Tsirkin
  2013-06-26 19:44           ` Eric Dumazet
  0 siblings, 1 reply; 8+ messages in thread
From: Michael S. Tsirkin @ 2013-06-26 19:22 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev

On Wed, Jun 26, 2013 at 12:09:24PM -0700, Eric Dumazet wrote:
> On Wed, 2013-06-26 at 21:56 +0300, Michael S. Tsirkin wrote:
> 
> > Well we don't want to duplicate the whole RX path
> > to special-case that, right?
> 
> 
> Whats wrong using a helper ?
> 
> I_Should_cleanup_things(skb)
> {
> 	... // cleanup for these special users trying to reenter stack
> 
> 	netif_rx(skb);
> }
> 

The point is we don't know the final destination of the packet
until it's going through the stack.

We don't want to trigger a copy for all data we get from tun:
we only want to do this if the data has a chance to get
queued somewhere indefinitely.


-- 
MST

^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [RFC] about "net: orphan frags on receive" insanity
  2013-06-26 19:22         ` Michael S. Tsirkin
@ 2013-06-26 19:44           ` Eric Dumazet
  2013-06-27  7:09             ` Michael S. Tsirkin
  0 siblings, 1 reply; 8+ messages in thread
From: Eric Dumazet @ 2013-06-26 19:44 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: netdev

On Wed, 2013-06-26 at 22:22 +0300, Michael S. Tsirkin wrote:

> The point is we don't know the final destination of the packet
> until it's going through the stack.
> 
> We don't want to trigger a copy for all data we get from tun:
> we only want to do this if the data has a chance to get
> queued somewhere indefinitely.

I think you missed my point.

I am pretty sure it should be done from netif_rx(), not from
__netif_receive_skb_core()
so that modern NIC devices do not have to pay this extra cost.

# size net/core/dev_*.o
   text	   data	    bss	    dec	    hex	filename
  41928	    963	    752	  43643	   aa7b	net/core/dev_before.o
  41579	    963	    752	  43294	   a91e	net/core/dev_after.o

Untested patch :

diff --git a/net/core/dev.c b/net/core/dev.c
index fc1e289..3730318 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1646,8 +1646,6 @@ static inline int deliver_skb(struct sk_buff *skb,
 			      struct packet_type *pt_prev,
 			      struct net_device *orig_dev)
 {
-	if (unlikely(skb_orphan_frags(skb, GFP_ATOMIC)))
-		return -ENOMEM;
 	atomic_inc(&skb->users);
 	return pt_prev->func(skb, skb->dev, pt_prev, orig_dev);
 }
@@ -3133,6 +3131,9 @@ int netif_rx(struct sk_buff *skb)
 	if (netpoll_rx(skb))
 		return NET_RX_DROP;
 
+	if (unlikely(skb_orphan_frags(skb, GFP_ATOMIC)))
+		return NET_RX_DROP;
+
 	net_timestamp_check(netdev_tstamp_prequeue, skb);
 
 	trace_netif_rx(skb);
@@ -3498,10 +3499,7 @@ ncls:
 	}
 
 	if (pt_prev) {
-		if (unlikely(skb_orphan_frags(skb, GFP_ATOMIC)))
-			goto drop;
-		else
-			ret = pt_prev->func(skb, skb->dev, pt_prev, orig_dev);
+		ret = pt_prev->func(skb, skb->dev, pt_prev, orig_dev);
 	} else {
 drop:
 		atomic_long_inc(&skb->dev->rx_dropped);

^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [RFC] about "net: orphan frags on receive" insanity
  2013-06-26 19:44           ` Eric Dumazet
@ 2013-06-27  7:09             ` Michael S. Tsirkin
  0 siblings, 0 replies; 8+ messages in thread
From: Michael S. Tsirkin @ 2013-06-27  7:09 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev

On Wed, Jun 26, 2013 at 12:44:34PM -0700, Eric Dumazet wrote:
> On Wed, 2013-06-26 at 22:22 +0300, Michael S. Tsirkin wrote:
> 
> > The point is we don't know the final destination of the packet
> > until it's going through the stack.
> > 
> > We don't want to trigger a copy for all data we get from tun:
> > we only want to do this if the data has a chance to get
> > queued somewhere indefinitely.
> 
> I think you missed my point.
> 
> I am pretty sure it should be done from netif_rx(), not from
> __netif_receive_skb_core()
> so that modern NIC devices do not have to pay this extra cost.

Yres, this is exactly what I thought you meant, but as I said,
this approach disables tun zero copy.
The point of the current code is to copy only if there
are any host protocols that consume the skb.

> # size net/core/dev_*.o
>    text	   data	    bss	    dec	    hex	filename
>   41928	    963	    752	  43643	   aa7b	net/core/dev_before.o
>   41579	    963	    752	  43294	   a91e	net/core/dev_after.o

So we have a lot of code like this:
                if (pt_prev) {
                        ret = deliver_skb(skb, pt_prev, orig_dev);
                }
                pt_prev = NULL;
and this is what created all this.

But in practice pt_prev is set when we have multiple
consumers for a packet - this is the standard path:
        if (pt_prev) {
                if (unlikely(skb_orphan_frags(skb, GFP_ATOMIC)))
                        goto drop;
                else
                        ret = pt_prev->func(skb, skb->dev, pt_prev, orig_dev);
        }
So I think calls to deliver_skb
are not all that common - we should not make gcc inline
them so aggressively, let it make its own decisions.

Here's an alternative patch:


diff --git a/net/core/dev.c b/net/core/dev.c
index fc1e289..03cb51c 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1642,9 +1642,9 @@ int dev_forward_skb(struct net_device *dev, struct sk_buff *skb)
 }
 EXPORT_SYMBOL_GPL(dev_forward_skb);
 
-static inline int deliver_skb(struct sk_buff *skb,
-			      struct packet_type *pt_prev,
-			      struct net_device *orig_dev)
+static int deliver_skb(struct sk_buff *skb,
+		       struct packet_type *pt_prev,
+		       struct net_device *orig_dev)
 {
 	if (unlikely(skb_orphan_frags(skb, GFP_ATOMIC)))
 		return -ENOMEM;


This gives us more than half the gain of your patch
without breaking tun zero copy.

[mst@robin linux]$ size net/core/dev_orig.o 
   text    data     bss     dec     hex filename
  47357    1270     720   49347    c0c3 net/core/dev_orig.o
[mst@robin linux]$ size net/core/dev.o
   text    data     bss     dec     hex filename
  47105    1270     720   49095    bfc7 net/core/dev.o

Maybe we should tag calls to if (pt_prev) before deliver_skb
as unlikely, this will move this code further from the hot path -
this needs some testing.



> Untested patch :
> 
> diff --git a/net/core/dev.c b/net/core/dev.c
> index fc1e289..3730318 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -1646,8 +1646,6 @@ static inline int deliver_skb(struct sk_buff *skb,
>  			      struct packet_type *pt_prev,
>  			      struct net_device *orig_dev)
>  {
> -	if (unlikely(skb_orphan_frags(skb, GFP_ATOMIC)))
> -		return -ENOMEM;
>  	atomic_inc(&skb->users);
>  	return pt_prev->func(skb, skb->dev, pt_prev, orig_dev);
>  }
> @@ -3133,6 +3131,9 @@ int netif_rx(struct sk_buff *skb)

this is basically assuming tun will use netif_rx forever,
but I think it's quite possible that we'll
switch it to napi down the road.

>  	if (netpoll_rx(skb))
>  		return NET_RX_DROP;
>  
> +	if (unlikely(skb_orphan_frags(skb, GFP_ATOMIC)))
> +		return NET_RX_DROP;
> +
>  	net_timestamp_check(netdev_tstamp_prequeue, skb);
>  
>  	trace_netif_rx(skb);

And this chunk means all tun data is immediately copied before
it's passed to net core, in effect, no zero copy transmit.

> @@ -3498,10 +3499,7 @@ ncls:
>  	}
>  
>  	if (pt_prev) {
> -		if (unlikely(skb_orphan_frags(skb, GFP_ATOMIC)))
> -			goto drop;
> -		else
> -			ret = pt_prev->func(skb, skb->dev, pt_prev, orig_dev);
> +		ret = pt_prev->func(skb, skb->dev, pt_prev, orig_dev);
>  	} else {
>  drop:
>  		atomic_long_inc(&skb->dev->rx_dropped);
>

^ permalink raw reply related	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2013-06-27  7:08 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2013-06-26  8:23 [RFC] about "net: orphan frags on receive" insanity Eric Dumazet
2013-06-26  9:56 ` Michael S. Tsirkin
2013-06-26 10:12   ` Eric Dumazet
2013-06-26 18:56     ` Michael S. Tsirkin
2013-06-26 19:09       ` Eric Dumazet
2013-06-26 19:22         ` Michael S. Tsirkin
2013-06-26 19:44           ` Eric Dumazet
2013-06-27  7:09             ` Michael S. Tsirkin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).