All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Michael S. Tsirkin" <mst@redhat.com>
To: Eric Dumazet <eric.dumazet@gmail.com>
Cc: netdev <netdev@vger.kernel.org>
Subject: Re: [RFC] about "net: orphan frags on receive" insanity
Date: Thu, 27 Jun 2013 10:09:10 +0300	[thread overview]
Message-ID: <20130627070910.GA5489@redhat.com> (raw)
In-Reply-To: <1372275874.3301.206.camel@edumazet-glaptop>

On Wed, Jun 26, 2013 at 12:44:34PM -0700, Eric Dumazet wrote:
> On Wed, 2013-06-26 at 22:22 +0300, Michael S. Tsirkin wrote:
> 
> > The point is we don't know the final destination of the packet
> > until it's going through the stack.
> > 
> > We don't want to trigger a copy for all data we get from tun:
> > we only want to do this if the data has a chance to get
> > queued somewhere indefinitely.
> 
> I think you missed my point.
> 
> I am pretty sure it should be done from netif_rx(), not from
> __netif_receive_skb_core()
> so that modern NIC devices do not have to pay this extra cost.

Yres, this is exactly what I thought you meant, but as I said,
this approach disables tun zero copy.
The point of the current code is to copy only if there
are any host protocols that consume the skb.

> # size net/core/dev_*.o
>    text	   data	    bss	    dec	    hex	filename
>   41928	    963	    752	  43643	   aa7b	net/core/dev_before.o
>   41579	    963	    752	  43294	   a91e	net/core/dev_after.o

So we have a lot of code like this:
                if (pt_prev) {
                        ret = deliver_skb(skb, pt_prev, orig_dev);
                }
                pt_prev = NULL;
and this is what created all this.

But in practice pt_prev is set when we have multiple
consumers for a packet - this is the standard path:
        if (pt_prev) {
                if (unlikely(skb_orphan_frags(skb, GFP_ATOMIC)))
                        goto drop;
                else
                        ret = pt_prev->func(skb, skb->dev, pt_prev, orig_dev);
        }
So I think calls to deliver_skb
are not all that common - we should not make gcc inline
them so aggressively, let it make its own decisions.

Here's an alternative patch:


diff --git a/net/core/dev.c b/net/core/dev.c
index fc1e289..03cb51c 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1642,9 +1642,9 @@ int dev_forward_skb(struct net_device *dev, struct sk_buff *skb)
 }
 EXPORT_SYMBOL_GPL(dev_forward_skb);
 
-static inline int deliver_skb(struct sk_buff *skb,
-			      struct packet_type *pt_prev,
-			      struct net_device *orig_dev)
+static int deliver_skb(struct sk_buff *skb,
+		       struct packet_type *pt_prev,
+		       struct net_device *orig_dev)
 {
 	if (unlikely(skb_orphan_frags(skb, GFP_ATOMIC)))
 		return -ENOMEM;


This gives us more than half the gain of your patch
without breaking tun zero copy.

[mst@robin linux]$ size net/core/dev_orig.o 
   text    data     bss     dec     hex filename
  47357    1270     720   49347    c0c3 net/core/dev_orig.o
[mst@robin linux]$ size net/core/dev.o
   text    data     bss     dec     hex filename
  47105    1270     720   49095    bfc7 net/core/dev.o

Maybe we should tag calls to if (pt_prev) before deliver_skb
as unlikely, this will move this code further from the hot path -
this needs some testing.



> Untested patch :
> 
> diff --git a/net/core/dev.c b/net/core/dev.c
> index fc1e289..3730318 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -1646,8 +1646,6 @@ static inline int deliver_skb(struct sk_buff *skb,
>  			      struct packet_type *pt_prev,
>  			      struct net_device *orig_dev)
>  {
> -	if (unlikely(skb_orphan_frags(skb, GFP_ATOMIC)))
> -		return -ENOMEM;
>  	atomic_inc(&skb->users);
>  	return pt_prev->func(skb, skb->dev, pt_prev, orig_dev);
>  }
> @@ -3133,6 +3131,9 @@ int netif_rx(struct sk_buff *skb)

this is basically assuming tun will use netif_rx forever,
but I think it's quite possible that we'll
switch it to napi down the road.

>  	if (netpoll_rx(skb))
>  		return NET_RX_DROP;
>  
> +	if (unlikely(skb_orphan_frags(skb, GFP_ATOMIC)))
> +		return NET_RX_DROP;
> +
>  	net_timestamp_check(netdev_tstamp_prequeue, skb);
>  
>  	trace_netif_rx(skb);

And this chunk means all tun data is immediately copied before
it's passed to net core, in effect, no zero copy transmit.

> @@ -3498,10 +3499,7 @@ ncls:
>  	}
>  
>  	if (pt_prev) {
> -		if (unlikely(skb_orphan_frags(skb, GFP_ATOMIC)))
> -			goto drop;
> -		else
> -			ret = pt_prev->func(skb, skb->dev, pt_prev, orig_dev);
> +		ret = pt_prev->func(skb, skb->dev, pt_prev, orig_dev);
>  	} else {
>  drop:
>  		atomic_long_inc(&skb->dev->rx_dropped);
>

      reply	other threads:[~2013-06-27  7:08 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-06-26  8:23 [RFC] about "net: orphan frags on receive" insanity Eric Dumazet
2013-06-26  9:56 ` Michael S. Tsirkin
2013-06-26 10:12   ` Eric Dumazet
2013-06-26 18:56     ` Michael S. Tsirkin
2013-06-26 19:09       ` Eric Dumazet
2013-06-26 19:22         ` Michael S. Tsirkin
2013-06-26 19:44           ` Eric Dumazet
2013-06-27  7:09             ` Michael S. Tsirkin [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20130627070910.GA5489@redhat.com \
    --to=mst@redhat.com \
    --cc=eric.dumazet@gmail.com \
    --cc=netdev@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.