* [RFC] about "net: orphan frags on receive" insanity
@ 2013-06-26 8:23 Eric Dumazet
2013-06-26 9:56 ` Michael S. Tsirkin
0 siblings, 1 reply; 8+ messages in thread
From: Eric Dumazet @ 2013-06-26 8:23 UTC (permalink / raw)
To: Michael S. Tsirkin; +Cc: netdev
Michael
Following commit added 400 bytes into __netif_receive_skb_core()
Could we revert it, and fix the problem into the callers instead ?
Thanks
commit 1080e512d44d4f67b8beb8edf25a1bbcb1066dc7
Author: Michael S. Tsirkin <mst@redhat.com>
Date: Fri Jul 20 09:23:17 2012 +0000
net: orphan frags on receive
zero copy packets are normally sent to the outside
network, but bridging, tun etc might loop them
back to host networking stack. If this happens
destructors will never be called, so orphan
the frags immediately on receive.
Signed-off-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
^ permalink raw reply [flat|nested] 8+ messages in thread* Re: [RFC] about "net: orphan frags on receive" insanity 2013-06-26 8:23 [RFC] about "net: orphan frags on receive" insanity Eric Dumazet @ 2013-06-26 9:56 ` Michael S. Tsirkin 2013-06-26 10:12 ` Eric Dumazet 0 siblings, 1 reply; 8+ messages in thread From: Michael S. Tsirkin @ 2013-06-26 9:56 UTC (permalink / raw) To: Eric Dumazet; +Cc: netdev On Wed, Jun 26, 2013 at 01:23:59AM -0700, Eric Dumazet wrote: > Michael > > Following commit added 400 bytes into __netif_receive_skb_core() > > Could we revert it, and fix the problem into the callers instead ? > > Thanks > > > > commit 1080e512d44d4f67b8beb8edf25a1bbcb1066dc7 > Author: Michael S. Tsirkin <mst@redhat.com> > Date: Fri Jul 20 09:23:17 2012 +0000 > > net: orphan frags on receive > > zero copy packets are normally sent to the outside > network, but bridging, tun etc might loop them > back to host networking stack. If this happens > destructors will never be called, so orphan > the frags immediately on receive. > > Signed-off-by: Michael S. Tsirkin <mst@redhat.com> > Signed-off-by: David S. Miller <davem@davemloft.net> Please don't just revert it. Packets can cross e.g. the bridge and end up on the external interface, in which case that NIC guarantees that they will get transmitted eventually, so it's safe to transmit from guest memory, or they can end up in the host stack where they can stay indefinitely, in this case we need to orphan frags so guest networking can keep going. What do you suggest exactly? Also, I'll have to look at the generated binary - when I coded this up, compiler seemed to only put a conditional branch on the fast path, this shouldn't be all that expensive. -- MST ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [RFC] about "net: orphan frags on receive" insanity 2013-06-26 9:56 ` Michael S. Tsirkin @ 2013-06-26 10:12 ` Eric Dumazet 2013-06-26 18:56 ` Michael S. Tsirkin 0 siblings, 1 reply; 8+ messages in thread From: Eric Dumazet @ 2013-06-26 10:12 UTC (permalink / raw) To: Michael S. Tsirkin; +Cc: netdev On Wed, 2013-06-26 at 12:56 +0300, Michael S. Tsirkin wrote: > Please don't just revert it. I said "revert and fix the callers" > > Packets can cross e.g. the bridge and end up on the external > interface, in which case that NIC guarantees that > they will get transmitted eventually, so it's safe > to transmit from guest memory, or they can end up > in the host stack where they can stay indefinitely, > in this case we need to orphan frags so guest networking > can keep going. > > What do you suggest exactly? > > Also, I'll have to look at the generated binary - when I > coded this up, compiler seemed to only put a > conditional branch on the fast path, this shouldn't > be all that expensive. Then look again. Its 400 bytes right now. I suggested to call this only from the impacted callers. VM are fine, but if we slow down bare metal to close the gap, or make appealing kernel bypass, that's not very fair. Why normal frames received from real NIC should pay this price, I don't know. The conditional branch might sound not expensive, but the cache line miss on skb_shinfo(skb)->tx_flags is expensive. ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [RFC] about "net: orphan frags on receive" insanity 2013-06-26 10:12 ` Eric Dumazet @ 2013-06-26 18:56 ` Michael S. Tsirkin 2013-06-26 19:09 ` Eric Dumazet 0 siblings, 1 reply; 8+ messages in thread From: Michael S. Tsirkin @ 2013-06-26 18:56 UTC (permalink / raw) To: Eric Dumazet; +Cc: netdev On Wed, Jun 26, 2013 at 03:12:23AM -0700, Eric Dumazet wrote: > On Wed, 2013-06-26 at 12:56 +0300, Michael S. Tsirkin wrote: > > > Please don't just revert it. > > I said "revert and fix the callers" > > > > > Packets can cross e.g. the bridge and end up on the external > > interface, in which case that NIC guarantees that > > they will get transmitted eventually, so it's safe > > to transmit from guest memory, or they can end up > > in the host stack where they can stay indefinitely, > > in this case we need to orphan frags so guest networking > > can keep going. > > > > What do you suggest exactly? > > > > Also, I'll have to look at the generated binary - when I > > coded this up, compiler seemed to only put a > > conditional branch on the fast path, this shouldn't > > be all that expensive. > > Then look again. Its 400 bytes right now. > > I suggested to call this only from the impacted callers. Well we don't want to duplicate the whole RX path to special-case that, right? > VM are fine, but if we slow down bare metal to close the gap, or make > appealing kernel bypass, that's not very fair. > > Why normal frames received from real NIC should pay this price, I don't > know. > > The conditional branch might sound not expensive, but the cache line > miss on skb_shinfo(skb)->tx_flags is expensive. > Okay, for the RX path, I think we can set a separate flag in the skb itself, and branch based on that. Would this address the comment? -- MST ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [RFC] about "net: orphan frags on receive" insanity 2013-06-26 18:56 ` Michael S. Tsirkin @ 2013-06-26 19:09 ` Eric Dumazet 2013-06-26 19:22 ` Michael S. Tsirkin 0 siblings, 1 reply; 8+ messages in thread From: Eric Dumazet @ 2013-06-26 19:09 UTC (permalink / raw) To: Michael S. Tsirkin; +Cc: netdev On Wed, 2013-06-26 at 21:56 +0300, Michael S. Tsirkin wrote: > Well we don't want to duplicate the whole RX path > to special-case that, right? Whats wrong using a helper ? I_Should_cleanup_things(skb) { ... // cleanup for these special users trying to reenter stack netif_rx(skb); } ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [RFC] about "net: orphan frags on receive" insanity 2013-06-26 19:09 ` Eric Dumazet @ 2013-06-26 19:22 ` Michael S. Tsirkin 2013-06-26 19:44 ` Eric Dumazet 0 siblings, 1 reply; 8+ messages in thread From: Michael S. Tsirkin @ 2013-06-26 19:22 UTC (permalink / raw) To: Eric Dumazet; +Cc: netdev On Wed, Jun 26, 2013 at 12:09:24PM -0700, Eric Dumazet wrote: > On Wed, 2013-06-26 at 21:56 +0300, Michael S. Tsirkin wrote: > > > Well we don't want to duplicate the whole RX path > > to special-case that, right? > > > Whats wrong using a helper ? > > I_Should_cleanup_things(skb) > { > ... // cleanup for these special users trying to reenter stack > > netif_rx(skb); > } > The point is we don't know the final destination of the packet until it's going through the stack. We don't want to trigger a copy for all data we get from tun: we only want to do this if the data has a chance to get queued somewhere indefinitely. -- MST ^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [RFC] about "net: orphan frags on receive" insanity 2013-06-26 19:22 ` Michael S. Tsirkin @ 2013-06-26 19:44 ` Eric Dumazet 2013-06-27 7:09 ` Michael S. Tsirkin 0 siblings, 1 reply; 8+ messages in thread From: Eric Dumazet @ 2013-06-26 19:44 UTC (permalink / raw) To: Michael S. Tsirkin; +Cc: netdev On Wed, 2013-06-26 at 22:22 +0300, Michael S. Tsirkin wrote: > The point is we don't know the final destination of the packet > until it's going through the stack. > > We don't want to trigger a copy for all data we get from tun: > we only want to do this if the data has a chance to get > queued somewhere indefinitely. I think you missed my point. I am pretty sure it should be done from netif_rx(), not from __netif_receive_skb_core() so that modern NIC devices do not have to pay this extra cost. # size net/core/dev_*.o text data bss dec hex filename 41928 963 752 43643 aa7b net/core/dev_before.o 41579 963 752 43294 a91e net/core/dev_after.o Untested patch : diff --git a/net/core/dev.c b/net/core/dev.c index fc1e289..3730318 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -1646,8 +1646,6 @@ static inline int deliver_skb(struct sk_buff *skb, struct packet_type *pt_prev, struct net_device *orig_dev) { - if (unlikely(skb_orphan_frags(skb, GFP_ATOMIC))) - return -ENOMEM; atomic_inc(&skb->users); return pt_prev->func(skb, skb->dev, pt_prev, orig_dev); } @@ -3133,6 +3131,9 @@ int netif_rx(struct sk_buff *skb) if (netpoll_rx(skb)) return NET_RX_DROP; + if (unlikely(skb_orphan_frags(skb, GFP_ATOMIC))) + return NET_RX_DROP; + net_timestamp_check(netdev_tstamp_prequeue, skb); trace_netif_rx(skb); @@ -3498,10 +3499,7 @@ ncls: } if (pt_prev) { - if (unlikely(skb_orphan_frags(skb, GFP_ATOMIC))) - goto drop; - else - ret = pt_prev->func(skb, skb->dev, pt_prev, orig_dev); + ret = pt_prev->func(skb, skb->dev, pt_prev, orig_dev); } else { drop: atomic_long_inc(&skb->dev->rx_dropped); ^ permalink raw reply related [flat|nested] 8+ messages in thread
* Re: [RFC] about "net: orphan frags on receive" insanity 2013-06-26 19:44 ` Eric Dumazet @ 2013-06-27 7:09 ` Michael S. Tsirkin 0 siblings, 0 replies; 8+ messages in thread From: Michael S. Tsirkin @ 2013-06-27 7:09 UTC (permalink / raw) To: Eric Dumazet; +Cc: netdev On Wed, Jun 26, 2013 at 12:44:34PM -0700, Eric Dumazet wrote: > On Wed, 2013-06-26 at 22:22 +0300, Michael S. Tsirkin wrote: > > > The point is we don't know the final destination of the packet > > until it's going through the stack. > > > > We don't want to trigger a copy for all data we get from tun: > > we only want to do this if the data has a chance to get > > queued somewhere indefinitely. > > I think you missed my point. > > I am pretty sure it should be done from netif_rx(), not from > __netif_receive_skb_core() > so that modern NIC devices do not have to pay this extra cost. Yres, this is exactly what I thought you meant, but as I said, this approach disables tun zero copy. The point of the current code is to copy only if there are any host protocols that consume the skb. > # size net/core/dev_*.o > text data bss dec hex filename > 41928 963 752 43643 aa7b net/core/dev_before.o > 41579 963 752 43294 a91e net/core/dev_after.o So we have a lot of code like this: if (pt_prev) { ret = deliver_skb(skb, pt_prev, orig_dev); } pt_prev = NULL; and this is what created all this. But in practice pt_prev is set when we have multiple consumers for a packet - this is the standard path: if (pt_prev) { if (unlikely(skb_orphan_frags(skb, GFP_ATOMIC))) goto drop; else ret = pt_prev->func(skb, skb->dev, pt_prev, orig_dev); } So I think calls to deliver_skb are not all that common - we should not make gcc inline them so aggressively, let it make its own decisions. Here's an alternative patch: diff --git a/net/core/dev.c b/net/core/dev.c index fc1e289..03cb51c 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -1642,9 +1642,9 @@ int dev_forward_skb(struct net_device *dev, struct sk_buff *skb) } EXPORT_SYMBOL_GPL(dev_forward_skb); -static inline int deliver_skb(struct sk_buff *skb, - struct packet_type *pt_prev, - struct net_device *orig_dev) +static int deliver_skb(struct sk_buff *skb, + struct packet_type *pt_prev, + struct net_device *orig_dev) { if (unlikely(skb_orphan_frags(skb, GFP_ATOMIC))) return -ENOMEM; This gives us more than half the gain of your patch without breaking tun zero copy. [mst@robin linux]$ size net/core/dev_orig.o text data bss dec hex filename 47357 1270 720 49347 c0c3 net/core/dev_orig.o [mst@robin linux]$ size net/core/dev.o text data bss dec hex filename 47105 1270 720 49095 bfc7 net/core/dev.o Maybe we should tag calls to if (pt_prev) before deliver_skb as unlikely, this will move this code further from the hot path - this needs some testing. > Untested patch : > > diff --git a/net/core/dev.c b/net/core/dev.c > index fc1e289..3730318 100644 > --- a/net/core/dev.c > +++ b/net/core/dev.c > @@ -1646,8 +1646,6 @@ static inline int deliver_skb(struct sk_buff *skb, > struct packet_type *pt_prev, > struct net_device *orig_dev) > { > - if (unlikely(skb_orphan_frags(skb, GFP_ATOMIC))) > - return -ENOMEM; > atomic_inc(&skb->users); > return pt_prev->func(skb, skb->dev, pt_prev, orig_dev); > } > @@ -3133,6 +3131,9 @@ int netif_rx(struct sk_buff *skb) this is basically assuming tun will use netif_rx forever, but I think it's quite possible that we'll switch it to napi down the road. > if (netpoll_rx(skb)) > return NET_RX_DROP; > > + if (unlikely(skb_orphan_frags(skb, GFP_ATOMIC))) > + return NET_RX_DROP; > + > net_timestamp_check(netdev_tstamp_prequeue, skb); > > trace_netif_rx(skb); And this chunk means all tun data is immediately copied before it's passed to net core, in effect, no zero copy transmit. > @@ -3498,10 +3499,7 @@ ncls: > } > > if (pt_prev) { > - if (unlikely(skb_orphan_frags(skb, GFP_ATOMIC))) > - goto drop; > - else > - ret = pt_prev->func(skb, skb->dev, pt_prev, orig_dev); > + ret = pt_prev->func(skb, skb->dev, pt_prev, orig_dev); > } else { > drop: > atomic_long_inc(&skb->dev->rx_dropped); > ^ permalink raw reply related [flat|nested] 8+ messages in thread
end of thread, other threads:[~2013-06-27 7:08 UTC | newest] Thread overview: 8+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2013-06-26 8:23 [RFC] about "net: orphan frags on receive" insanity Eric Dumazet 2013-06-26 9:56 ` Michael S. Tsirkin 2013-06-26 10:12 ` Eric Dumazet 2013-06-26 18:56 ` Michael S. Tsirkin 2013-06-26 19:09 ` Eric Dumazet 2013-06-26 19:22 ` Michael S. Tsirkin 2013-06-26 19:44 ` Eric Dumazet 2013-06-27 7:09 ` Michael S. Tsirkin
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).