* [RFC] net: release dst entry in dev_queue_xmit()
@ 2009-03-20 11:40 Eric Dumazet
2009-03-20 14:10 ` Neil Horman
` (2 more replies)
0 siblings, 3 replies; 28+ messages in thread
From: Eric Dumazet @ 2009-03-20 11:40 UTC (permalink / raw)
To: David S. Miller; +Cc: Linux Netdev List
One point of contention in high network loads is the dst_release() performed
when a transmited skb is freed. This is because NIC tx completion calls skb free
long after original call to dev_queue_xmit(skb).
CPU cache is cold and the atomic op in dst_release() stalls. On SMP, this is
quite visible if one CPU is 100% handling softirqs for a network device,
since dst_clone() is done by other cpus, involving cache line ping pongs.
I believe we can release dst in dev_queue_xmit(), while cpu cache is hot, since
caller of dev_queue_xmit() had to hold a reference on dst right before. This reduce
work to be done by softirq handler, and decrease cache misses.
I also believe only pktgen can call dev_queue_xmit() with skb which have
a skb->users != 1. But pkthen skbs have a NULL dst entry.
I added a WARN_ON_ONCE() to catch other cases, and not release skb->dst
if skb->users != 1
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
diff --git a/net/core/dev.c b/net/core/dev.c
index f112970..9e0fd01 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -1852,6 +1852,20 @@ gso:
if (q->enqueue) {
spinlock_t *root_lock = qdisc_lock(q);
+ /*
+ * Release dst while its refcnt is hot in CPU cache, instead
+ * of waiting NIC tx completion
+ */
+ if (likely(skb->dst)) {
+ if (!WARN_ON_ONCE(atomic_read(&skb->users) != 1)) {
+ int newrefcnt;
+
+ smp_mb__before_atomic_dec();
+ newrefcnt = atomic_dec_return(&skb->dst->__refcnt);
+ WARN_ON(newrefcnt < 0);
+ skb->dst = NULL;
+ }
+ }
spin_lock(root_lock);
if (unlikely(test_bit(__QDISC_STATE_DEACTIVATED, &q->state))) {
^ permalink raw reply related [flat|nested] 28+ messages in thread* Re: [RFC] net: release dst entry in dev_queue_xmit() 2009-03-20 11:40 [RFC] net: release dst entry in dev_queue_xmit() Eric Dumazet @ 2009-03-20 14:10 ` Neil Horman 2009-03-25 6:43 ` David Miller 2009-05-12 8:12 ` [PATCH] net: release dst entry in dev_hard_start_xmit() Eric Dumazet 2 siblings, 0 replies; 28+ messages in thread From: Neil Horman @ 2009-03-20 14:10 UTC (permalink / raw) To: Eric Dumazet; +Cc: David S. Miller, Linux Netdev List On Fri, Mar 20, 2009 at 12:40:22PM +0100, Eric Dumazet wrote: > One point of contention in high network loads is the dst_release() performed > when a transmited skb is freed. This is because NIC tx completion calls skb free > long after original call to dev_queue_xmit(skb). > > CPU cache is cold and the atomic op in dst_release() stalls. On SMP, this is > quite visible if one CPU is 100% handling softirqs for a network device, > since dst_clone() is done by other cpus, involving cache line ping pongs. > > I believe we can release dst in dev_queue_xmit(), while cpu cache is hot, since > caller of dev_queue_xmit() had to hold a reference on dst right before. This reduce > work to be done by softirq handler, and decrease cache misses. > > I also believe only pktgen can call dev_queue_xmit() with skb which have > a skb->users != 1. But pkthen skbs have a NULL dst entry. > > I added a WARN_ON_ONCE() to catch other cases, and not release skb->dst > if skb->users != 1 > > > Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> > > > diff --git a/net/core/dev.c b/net/core/dev.c > index f112970..9e0fd01 100644 > --- a/net/core/dev.c > +++ b/net/core/dev.c > @@ -1852,6 +1852,20 @@ gso: > if (q->enqueue) { > spinlock_t *root_lock = qdisc_lock(q); > > + /* > + * Release dst while its refcnt is hot in CPU cache, instead > + * of waiting NIC tx completion > + */ > + if (likely(skb->dst)) { > + if (!WARN_ON_ONCE(atomic_read(&skb->users) != 1)) { > + int newrefcnt; > + > + smp_mb__before_atomic_dec(); > + newrefcnt = atomic_dec_return(&skb->dst->__refcnt); > + WARN_ON(newrefcnt < 0); > + skb->dst = NULL; > + } > + } > spin_lock(root_lock); > > if (unlikely(test_bit(__QDISC_STATE_DEACTIVATED, &q->state))) { > -- > To unsubscribe from this list: send the line "unsubscribe netdev" in > the body of a message to majordomo@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html > I think this seems like a pretty good idea. I thought for a moment that some stacked interfaces (bonds, vlan devices), might have a problem with this, since they tend to pass through dev_queue_xmit twice, but I can't see a problem with either one of those cases either, since neither of thier xmit routines makes any use of the dst pointer. I'd say include it Acked-by: Neil Horman <nhorman@tuxdriver.com> ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC] net: release dst entry in dev_queue_xmit() 2009-03-20 11:40 [RFC] net: release dst entry in dev_queue_xmit() Eric Dumazet 2009-03-20 14:10 ` Neil Horman @ 2009-03-25 6:43 ` David Miller 2009-03-25 7:13 ` Eric Dumazet 2009-05-12 8:12 ` [PATCH] net: release dst entry in dev_hard_start_xmit() Eric Dumazet 2 siblings, 1 reply; 28+ messages in thread From: David Miller @ 2009-03-25 6:43 UTC (permalink / raw) To: dada1; +Cc: netdev From: Eric Dumazet <dada1@cosmosbay.com> Date: Fri, 20 Mar 2009 12:40:22 +0100 > I believe we can release dst in dev_queue_xmit(), while cpu cache is > hot, since caller of dev_queue_xmit() had to hold a reference on dst > right before. This reduce work to be done by softirq handler, and > decrease cache misses. > This will break various packet schedulers and classifiers. Heck sch_sfq.c even uses skb->dst as part of it's flow hash function :-) ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC] net: release dst entry in dev_queue_xmit() 2009-03-25 6:43 ` David Miller @ 2009-03-25 7:13 ` Eric Dumazet 2009-03-25 7:17 ` David Miller 0 siblings, 1 reply; 28+ messages in thread From: Eric Dumazet @ 2009-03-25 7:13 UTC (permalink / raw) To: David Miller; +Cc: netdev David Miller a écrit : > From: Eric Dumazet <dada1@cosmosbay.com> > Date: Fri, 20 Mar 2009 12:40:22 +0100 > >> I believe we can release dst in dev_queue_xmit(), while cpu cache is >> hot, since caller of dev_queue_xmit() had to hold a reference on dst >> right before. This reduce work to be done by softirq handler, and >> decrease cache misses. >> > > This will break various packet schedulers and classifiers. > > Heck sch_sfq.c even uses skb->dst as part of it's flow hash > function :-) Well, as one of the hash perturbator, for other protocols than IPV4 & IPV6... default: h = (unsigned long)skb->dst ^ skb->protocol; h2 = (unsigned long)skb->sk; } return sfq_fold_hash(q, h, h2); But teql indeed mandates dst in __teql_resolve() Darn... This dst freeing should be done very late then, in the NIC driver itself, just before giving skb to hardware, or right before in dev_hard_start_xmit() If done in dev_hard_start_xmit(), skb could be requeued (because of NETDEV_TX_BUSY). Then if requeued, maybe at this time, dst being NULL is not a problem ? Thanks a lot David ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC] net: release dst entry in dev_queue_xmit() 2009-03-25 7:13 ` Eric Dumazet @ 2009-03-25 7:17 ` David Miller 2009-03-25 18:22 ` Jarek Poplawski 0 siblings, 1 reply; 28+ messages in thread From: David Miller @ 2009-03-25 7:17 UTC (permalink / raw) To: dada1; +Cc: netdev From: Eric Dumazet <dada1@cosmosbay.com> Date: Wed, 25 Mar 2009 08:13:30 +0100 > If done in dev_hard_start_xmit(), skb could be requeued (because of > NETDEV_TX_BUSY). Then if requeued, maybe at this time, dst being > NULL is not a problem ? Usually it should be OK because the packet schedulers have a sort-of one-behind state that allows them to reinsert the SKB into their queue datastructures without reclassifying. But I'm not %100 sure there isn't some case that might still need skb->dst there. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC] net: release dst entry in dev_queue_xmit() 2009-03-25 7:17 ` David Miller @ 2009-03-25 18:22 ` Jarek Poplawski 2009-03-25 18:41 ` Eric Dumazet 0 siblings, 1 reply; 28+ messages in thread From: Jarek Poplawski @ 2009-03-25 18:22 UTC (permalink / raw) To: David Miller; +Cc: dada1, netdev David Miller wrote, On 03/25/2009 08:17 AM: > From: Eric Dumazet <dada1@cosmosbay.com> > Date: Wed, 25 Mar 2009 08:13:30 +0100 > >> If done in dev_hard_start_xmit(), skb could be requeued (because of >> NETDEV_TX_BUSY). Then if requeued, maybe at this time, dst being >> NULL is not a problem ? > > Usually it should be OK because the packet schedulers have > a sort-of one-behind state that allows them to reinsert > the SKB into their queue datastructures without reclassifying. Actually, since David has dumped requeuing there is no reinserting; this last one "requeued" skb is buffered at the top in q->gso_skb and waiting for better times. Jarek P. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC] net: release dst entry in dev_queue_xmit() 2009-03-25 18:22 ` Jarek Poplawski @ 2009-03-25 18:41 ` Eric Dumazet 2009-03-25 19:18 ` Jarek Poplawski 0 siblings, 1 reply; 28+ messages in thread From: Eric Dumazet @ 2009-03-25 18:41 UTC (permalink / raw) To: Jarek Poplawski; +Cc: David Miller, netdev Jarek Poplawski a écrit : > David Miller wrote, On 03/25/2009 08:17 AM: > >> From: Eric Dumazet <dada1@cosmosbay.com> >> Date: Wed, 25 Mar 2009 08:13:30 +0100 >> >>> If done in dev_hard_start_xmit(), skb could be requeued (because of >>> NETDEV_TX_BUSY). Then if requeued, maybe at this time, dst being >>> NULL is not a problem ? >> Usually it should be OK because the packet schedulers have >> a sort-of one-behind state that allows them to reinsert >> the SKB into their queue datastructures without reclassifying. > > > Actually, since David has dumped requeuing there is no reinserting; > this last one "requeued" skb is buffered at the top in q->gso_skb > and waiting for better times. Yes indeed, this is what I thought too, thanks Jarek. I tested following patch today on my machine, but obviously could not try all possible quirks :) [PATCH] net: release dst entry in dev_hard_start_xmit() One point of contention in high network loads is the dst_release() performed when a transmited skb is freed. This is because NIC tx completion calls skb free long after original call to dev_queue_xmit(skb). CPU cache is cold and the atomic op in dst_release() stalls. On SMP, this is quite visible if one CPU is 100% handling softirqs for a network device, since dst_clone() is done by other cpus, involving cache line ping pongs. I believe we can release dst in dev_hard_start_xmit(), while cpu cache is hot, since caller of dev_queue_xmit() had to hold a reference on dst right before. This reduce work to be done by softirq handler, and decrease cache misses. I also believe only pktgen can call dev_queue_xmit() with skb which have a skb->users != 1. But pkthen skbs have a NULL dst entry. I added a WARN_ON_ONCE() to catch other cases, and not release skb->dst if skb->users != 1 Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> diff --git a/net/core/dev.c b/net/core/dev.c index e3fe5c7..a622db6 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -1664,6 +1664,26 @@ static int dev_gso_segment(struct sk_buff *skb) return 0; } + +/* + * Release dst while its refcnt is likely hot in CPU cache, instead + * of waiting NIC tx completion. + * We inline dst_release() code for performance reason + */ +static void release_skb_dst(struct sk_buff *skb) +{ + if (likely(skb->dst)) { + if (!WARN_ON_ONCE(atomic_read(&skb->users) != 1)) { + int newrefcnt; + + smp_mb__before_atomic_dec(); + newrefcnt = atomic_dec_return(&skb->dst->__refcnt); + WARN_ON(newrefcnt < 0); + skb->dst = NULL; + } + } +} + int dev_hard_start_xmit(struct sk_buff *skb, struct net_device *dev, struct netdev_queue *txq) { @@ -1681,6 +1701,7 @@ int dev_hard_start_xmit(struct sk_buff *skb, struct net_device *dev, goto gso; } + release_skb_dst(skb); return ops->ndo_start_xmit(skb, dev); } @@ -1691,6 +1712,7 @@ gso: skb->next = nskb->next; nskb->next = NULL; + release_skb_dst(nskb); rc = ops->ndo_start_xmit(nskb, dev); if (unlikely(rc)) { nskb->next = skb->next; ^ permalink raw reply related [flat|nested] 28+ messages in thread
* Re: [RFC] net: release dst entry in dev_queue_xmit() 2009-03-25 18:41 ` Eric Dumazet @ 2009-03-25 19:18 ` Jarek Poplawski 2009-03-25 19:40 ` Eric Dumazet 0 siblings, 1 reply; 28+ messages in thread From: Jarek Poplawski @ 2009-03-25 19:18 UTC (permalink / raw) To: Eric Dumazet; +Cc: David Miller, netdev On Wed, Mar 25, 2009 at 07:41:27PM +0100, Eric Dumazet wrote: > Jarek Poplawski a écrit : > > David Miller wrote, On 03/25/2009 08:17 AM: > > > >> From: Eric Dumazet <dada1@cosmosbay.com> > >> Date: Wed, 25 Mar 2009 08:13:30 +0100 > >> > >>> If done in dev_hard_start_xmit(), skb could be requeued (because of > >>> NETDEV_TX_BUSY). Then if requeued, maybe at this time, dst being > >>> NULL is not a problem ? > >> Usually it should be OK because the packet schedulers have > >> a sort-of one-behind state that allows them to reinsert > >> the SKB into their queue datastructures without reclassifying. > > > > > > Actually, since David has dumped requeuing there is no reinserting; > > this last one "requeued" skb is buffered at the top in q->gso_skb > > and waiting for better times. > > Yes indeed, this is what I thought too, thanks Jarek. Alas I'm a bit concerned with virtual devs, e.g. now I'm looking at xmits in macvlan and pppoe. Maybe this patch should exclude them? Jarek P. > > I tested following patch today on my machine, but obviously could not try > all possible quirks :) > > [PATCH] net: release dst entry in dev_hard_start_xmit() > > One point of contention in high network loads is the dst_release() performed > when a transmited skb is freed. This is because NIC tx completion calls skb free > long after original call to dev_queue_xmit(skb). > > CPU cache is cold and the atomic op in dst_release() stalls. On SMP, this is > quite visible if one CPU is 100% handling softirqs for a network device, > since dst_clone() is done by other cpus, involving cache line ping pongs. > > I believe we can release dst in dev_hard_start_xmit(), while cpu cache is hot, since > caller of dev_queue_xmit() had to hold a reference on dst right before. This reduce > work to be done by softirq handler, and decrease cache misses. > > I also believe only pktgen can call dev_queue_xmit() with skb which have > a skb->users != 1. But pkthen skbs have a NULL dst entry. > > I added a WARN_ON_ONCE() to catch other cases, and not release skb->dst > if skb->users != 1 > > Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> > > diff --git a/net/core/dev.c b/net/core/dev.c > index e3fe5c7..a622db6 100644 > --- a/net/core/dev.c > +++ b/net/core/dev.c > @@ -1664,6 +1664,26 @@ static int dev_gso_segment(struct sk_buff *skb) > return 0; > } > > + > +/* > + * Release dst while its refcnt is likely hot in CPU cache, instead > + * of waiting NIC tx completion. > + * We inline dst_release() code for performance reason > + */ > +static void release_skb_dst(struct sk_buff *skb) > +{ > + if (likely(skb->dst)) { > + if (!WARN_ON_ONCE(atomic_read(&skb->users) != 1)) { > + int newrefcnt; > + > + smp_mb__before_atomic_dec(); > + newrefcnt = atomic_dec_return(&skb->dst->__refcnt); > + WARN_ON(newrefcnt < 0); > + skb->dst = NULL; > + } > + } > +} > + > int dev_hard_start_xmit(struct sk_buff *skb, struct net_device *dev, > struct netdev_queue *txq) > { > @@ -1681,6 +1701,7 @@ int dev_hard_start_xmit(struct sk_buff *skb, struct net_device *dev, > goto gso; > } > > + release_skb_dst(skb); > return ops->ndo_start_xmit(skb, dev); > } > > @@ -1691,6 +1712,7 @@ gso: > > skb->next = nskb->next; > nskb->next = NULL; > + release_skb_dst(nskb); > rc = ops->ndo_start_xmit(nskb, dev); > if (unlikely(rc)) { > nskb->next = skb->next; > ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC] net: release dst entry in dev_queue_xmit() 2009-03-25 19:18 ` Jarek Poplawski @ 2009-03-25 19:40 ` Eric Dumazet 2009-03-25 19:54 ` Jarek Poplawski 0 siblings, 1 reply; 28+ messages in thread From: Eric Dumazet @ 2009-03-25 19:40 UTC (permalink / raw) To: Jarek Poplawski; +Cc: David Miller, netdev Jarek Poplawski a écrit : > On Wed, Mar 25, 2009 at 07:41:27PM +0100, Eric Dumazet wrote: >> Jarek Poplawski a écrit : >>> David Miller wrote, On 03/25/2009 08:17 AM: >>> >>>> From: Eric Dumazet <dada1@cosmosbay.com> >>>> Date: Wed, 25 Mar 2009 08:13:30 +0100 >>>> >>>>> If done in dev_hard_start_xmit(), skb could be requeued (because of >>>>> NETDEV_TX_BUSY). Then if requeued, maybe at this time, dst being >>>>> NULL is not a problem ? >>>> Usually it should be OK because the packet schedulers have >>>> a sort-of one-behind state that allows them to reinsert >>>> the SKB into their queue datastructures without reclassifying. >>> >>> Actually, since David has dumped requeuing there is no reinserting; >>> this last one "requeued" skb is buffered at the top in q->gso_skb >>> and waiting for better times. >> Yes indeed, this is what I thought too, thanks Jarek. > > Alas I'm a bit concerned with virtual devs, e.g. now I'm looking at > xmits in macvlan and pppoe. Maybe this patch should exclude them? > Yes, MACVLAN :) its macvlan_start_xmit() function calls dev_queue_xmit(skb) again, so we go back to packet schedulers and classifiers, they might need dst again :( Only other potential problem I found was in drivers/net/appletalk/ipddp.c static int ipddp_xmit(struct sk_buff *skb, struct net_device *dev) { __be32 paddr = ((struct rtable*)skb->dst)->rt_gateway; Not sure this driver is still supported, or if this paddr can be found elsewhere... __sk_dst_get(skb->sk) ??? ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC] net: release dst entry in dev_queue_xmit() 2009-03-25 19:40 ` Eric Dumazet @ 2009-03-25 19:54 ` Jarek Poplawski 2009-03-25 20:28 ` Eric Dumazet 0 siblings, 1 reply; 28+ messages in thread From: Jarek Poplawski @ 2009-03-25 19:54 UTC (permalink / raw) To: Eric Dumazet; +Cc: David Miller, netdev On Wed, Mar 25, 2009 at 08:40:05PM +0100, Eric Dumazet wrote: > Jarek Poplawski a écrit : ... > > Alas I'm a bit concerned with virtual devs, e.g. now I'm looking at > > xmits in macvlan and pppoe. Maybe this patch should exclude them? > > > > Yes, MACVLAN :) its macvlan_start_xmit() function calls > dev_queue_xmit(skb) again, so we go back to packet schedulers and > classifiers, they might need dst again :( > > Only other potential problem I found was in > drivers/net/appletalk/ipddp.c > > static int ipddp_xmit(struct sk_buff *skb, struct net_device *dev) > { > __be32 paddr = ((struct rtable*)skb->dst)->rt_gateway; > > Not sure this driver is still supported, or if this paddr can be found elsewhere... > __sk_dst_get(skb->sk) ??? > I guess you've considered the loopback too... Jarek P. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC] net: release dst entry in dev_queue_xmit() 2009-03-25 19:54 ` Jarek Poplawski @ 2009-03-25 20:28 ` Eric Dumazet 2009-03-25 21:12 ` Jarek Poplawski 0 siblings, 1 reply; 28+ messages in thread From: Eric Dumazet @ 2009-03-25 20:28 UTC (permalink / raw) To: Jarek Poplawski; +Cc: David Miller, netdev Jarek Poplawski a écrit : > I guess you've considered the loopback too... > I had for the first patch (carefuly avoiding loopback being not responsive) : I was releasing dst only if enqueue() was called. You are right, loopback doesnt work anymore with last patch, and I have no idea why... Do you know why ? Thank you ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC] net: release dst entry in dev_queue_xmit() 2009-03-25 20:28 ` Eric Dumazet @ 2009-03-25 21:12 ` Jarek Poplawski 2009-03-25 21:20 ` Patrick McHardy 0 siblings, 1 reply; 28+ messages in thread From: Jarek Poplawski @ 2009-03-25 21:12 UTC (permalink / raw) To: Eric Dumazet; +Cc: David Miller, netdev On Wed, Mar 25, 2009 at 09:28:37PM +0100, Eric Dumazet wrote: > Jarek Poplawski a écrit : > > > I guess you've considered the loopback too... > > > > I had for the first patch (carefuly avoiding loopback being not responsive) : > I was releasing dst only if enqueue() was called. > > You are right, loopback doesnt work anymore with last patch, > and I have no idea why... > > Do you know why ? Hmm.. isn't this test for dst == NULL in ip_rcv_finish expected to be negative for loopback source? Jarek P. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [RFC] net: release dst entry in dev_queue_xmit() 2009-03-25 21:12 ` Jarek Poplawski @ 2009-03-25 21:20 ` Patrick McHardy 0 siblings, 0 replies; 28+ messages in thread From: Patrick McHardy @ 2009-03-25 21:20 UTC (permalink / raw) To: Jarek Poplawski; +Cc: Eric Dumazet, David Miller, netdev Jarek Poplawski wrote: > On Wed, Mar 25, 2009 at 09:28:37PM +0100, Eric Dumazet wrote: > >> Jarek Poplawski a écrit : >> >> >>> I guess you've considered the loopback too... >>> >>> >> I had for the first patch (carefuly avoiding loopback being not responsive) : >> I was releasing dst only if enqueue() was called. >> >> You are right, loopback doesnt work anymore with last patch, >> and I have no idea why... >> >> Do you know why ? >> > > Hmm.. isn't this test for dst == NULL in ip_rcv_finish expected to be > negative for loopback source? ip_route_input() doesn't accept loopback addresses, so loopback packets already need to have a dst_entry attached. ^ permalink raw reply [flat|nested] 28+ messages in thread
* [PATCH] net: release dst entry in dev_hard_start_xmit() 2009-03-20 11:40 [RFC] net: release dst entry in dev_queue_xmit() Eric Dumazet 2009-03-20 14:10 ` Neil Horman 2009-03-25 6:43 ` David Miller @ 2009-05-12 8:12 ` Eric Dumazet 2009-05-12 8:21 ` Eric Dumazet 2 siblings, 1 reply; 28+ messages in thread From: Eric Dumazet @ 2009-05-12 8:12 UTC (permalink / raw) To: David S. Miller; +Cc: Linux Netdev List, Jarek Poplawski, Patrick McHardy One point of contention in high network loads is the dst_release() performed when a transmited skb is freed. This is because NIC tx completion calls dev_kree_skb() long after original call to dev_queue_xmit(skb). CPU cache is cold and the atomic op in dst_release() stalls. On SMP, this is quite visible if one CPU is 100% handling softirqs for a network device, since dst_clone() is done by other cpus, involving cache line ping pongs. It seems right place to release dst is in dev_hard_start_xmit(), for most devices but ones that are virtual, and some exceptions. David Miller suggested to define a new device flag, set in alloc_netdev_mq() (so that most devices set it at init time), and carefuly unset in devices which dont want a NULL skb->dst in their ndo_start_xmit(). List of devices that must clear this flag is : - loopback device, because it calls netif_rx() and quoting Patrick : "ip_route_input() doesn't accept loopback addresses, so loopback packets already need to have a dst_entry attached." - appletalk/ipddp.c : needs skb->dst in its xmit function - And all devices that call again dev_queue_xmit() from their xmit function (as some classifiers need skb->dst) : bonding, macvlan, eql, ifb, hdlc_fr Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> --- drivers/net/appletalk/ipddp.c | 1 + drivers/net/bonding/bond_main.c | 1 + drivers/net/eql.c | 1 + drivers/net/ifb.c | 1 + drivers/net/loopback.c | 1 + drivers/net/macvlan.c | 1 + drivers/net/wan/hdlc_fr.c | 1 + include/linux/if.h | 3 +++ net/core/dev.c | 9 +++++++++ 9 files changed, 19 insertions(+) diff --git a/drivers/net/appletalk/ipddp.c b/drivers/net/appletalk/ipddp.c index da64ba8..0268561 100644 --- a/drivers/net/appletalk/ipddp.c +++ b/drivers/net/appletalk/ipddp.c @@ -71,6 +71,7 @@ static struct net_device * __init ipddp_init(void) if (!dev) return ERR_PTR(-ENOMEM); + dev->priv_flags &= IFF_XMIT_DST_RELEASE; strcpy(dev->name, "ipddp%d"); if (version_printed++ == 0) diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c index 815191d..a29f421 100644 --- a/drivers/net/bonding/bond_main.c +++ b/drivers/net/bonding/bond_main.c @@ -5148,6 +5148,7 @@ int bond_create(char *name, struct bond_params *params) goto out_rtnl; } + bond_dev->priv_flags &= ~IFF_XMIT_DST_RELEASE; if (!name) { res = dev_alloc_name(bond_dev, "bond%d"); if (res < 0) diff --git a/drivers/net/eql.c b/drivers/net/eql.c index 5210bb1..19b7dd9 100644 --- a/drivers/net/eql.c +++ b/drivers/net/eql.c @@ -194,6 +194,7 @@ static void __init eql_setup(struct net_device *dev) dev->type = ARPHRD_SLIP; dev->tx_queue_len = 5; /* Hands them off fast */ + dev->priv_flags &= ~IFF_XMIT_DST_RELEASE; } static int eql_open(struct net_device *dev) diff --git a/drivers/net/ifb.c b/drivers/net/ifb.c index 60a2630..96713ef 100644 --- a/drivers/net/ifb.c +++ b/drivers/net/ifb.c @@ -156,6 +156,7 @@ static void ifb_setup(struct net_device *dev) dev->flags |= IFF_NOARP; dev->flags &= ~IFF_MULTICAST; + dev->priv_flags &= ~IFF_XMIT_DST_RELEASE; random_ether_addr(dev->dev_addr); } diff --git a/drivers/net/loopback.c b/drivers/net/loopback.c index 6f71157..da472c6 100644 --- a/drivers/net/loopback.c +++ b/drivers/net/loopback.c @@ -170,6 +170,7 @@ static void loopback_setup(struct net_device *dev) dev->tx_queue_len = 0; dev->type = ARPHRD_LOOPBACK; /* 0x0001*/ dev->flags = IFF_LOOPBACK; + dev->priv_flags &= ~IFF_XMIT_DST_RELEASE; dev->features = NETIF_F_SG | NETIF_F_FRAGLIST | NETIF_F_TSO | NETIF_F_NO_CSUM diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c index 329cd50..d5334b4 100644 --- a/drivers/net/macvlan.c +++ b/drivers/net/macvlan.c @@ -414,6 +414,7 @@ static void macvlan_setup(struct net_device *dev) { ether_setup(dev); + dev->priv_flags &= ~IFF_XMIT_DST_RELEASE; dev->netdev_ops = &macvlan_netdev_ops; dev->destructor = free_netdev; dev->header_ops = &macvlan_hard_header_ops, diff --git a/drivers/net/wan/hdlc_fr.c b/drivers/net/wan/hdlc_fr.c index 8005301..bfa0161 100644 --- a/drivers/net/wan/hdlc_fr.c +++ b/drivers/net/wan/hdlc_fr.c @@ -1054,6 +1054,7 @@ static void pvc_setup(struct net_device *dev) dev->flags = IFF_POINTOPOINT; dev->hard_header_len = 10; dev->addr_len = 2; + dev->priv_flags &= ~IFF_XMIT_DST_RELEASE; } static const struct net_device_ops pvc_ops = { diff --git a/include/linux/if.h b/include/linux/if.h index 1108f3e..b9a6229 100644 --- a/include/linux/if.h +++ b/include/linux/if.h @@ -67,6 +67,9 @@ #define IFF_ISATAP 0x80 /* ISATAP interface (RFC4214) */ #define IFF_MASTER_ARPMON 0x100 /* bonding master, ARP mon in use */ #define IFF_WAN_HDLC 0x200 /* WAN HDLC device */ +#define IFF_XMIT_DST_RELEASE 0x400 /* dev_hard_start_xmit() is allowed to + * release skb->dst + */ #define IF_GET_IFACE 0x0001 /* for querying only */ #define IF_GET_PROTO 0x0002 diff --git a/net/core/dev.c b/net/core/dev.c index 14dd725..b9aef53 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -1688,6 +1688,14 @@ int dev_hard_start_xmit(struct sk_buff *skb, struct net_device *dev, goto gso; } + /* + * If device doesnt need skb->dst, release it right now while + * its hot in this cpu cache + */ + if ((dev->priv_flags & IFF_XMIT_DST_RELEASE) && skb->dst) { + dst_release(skb->dst); + skb->dst = NULL; + } rc = ops->ndo_start_xmit(skb, dev); /* * TODO: if skb_orphan() was called by @@ -5028,6 +5036,7 @@ struct net_device *alloc_netdev_mq(int sizeof_priv, const char *name, netdev_init_queues(dev); INIT_LIST_HEAD(&dev->napi_list); + dev->priv_flags = IFF_XMIT_DST_RELEASE; setup(dev); strcpy(dev->name, name); return dev; ^ permalink raw reply related [flat|nested] 28+ messages in thread
* Re: [PATCH] net: release dst entry in dev_hard_start_xmit() 2009-05-12 8:12 ` [PATCH] net: release dst entry in dev_hard_start_xmit() Eric Dumazet @ 2009-05-12 8:21 ` Eric Dumazet 2009-05-12 19:26 ` [PATCH, v2] " Eric Dumazet 2009-05-12 19:27 ` [PATCH] " Jarek Poplawski 0 siblings, 2 replies; 28+ messages in thread From: Eric Dumazet @ 2009-05-12 8:21 UTC (permalink / raw) To: David S. Miller; +Cc: Linux Netdev List, Jarek Poplawski, Patrick McHardy Eric Dumazet a écrit : > One point of contention in high network loads is the dst_release() performed > when a transmited skb is freed. This is because NIC tx completion calls > dev_kree_skb() long after original call to dev_queue_xmit(skb). > > CPU cache is cold and the atomic op in dst_release() stalls. On SMP, this is > quite visible if one CPU is 100% handling softirqs for a network device, > since dst_clone() is done by other cpus, involving cache line ping pongs. > > It seems right place to release dst is in dev_hard_start_xmit(), for most > devices but ones that are virtual, and some exceptions. > > David Miller suggested to define a new device flag, set in alloc_netdev_mq() > (so that most devices set it at init time), and carefuly unset in devices > which dont want a NULL skb->dst in their ndo_start_xmit(). > > List of devices that must clear this flag is : > > - loopback device, because it calls netif_rx() and quoting Patrick : > "ip_route_input() doesn't accept loopback addresses, so loopback packets > already need to have a dst_entry attached." > - appletalk/ipddp.c : needs skb->dst in its xmit function > > - And all devices that call again dev_queue_xmit() from their xmit function > (as some classifiers need skb->dst) : bonding, macvlan, eql, ifb, hdlc_fr > > Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> > --- > drivers/net/appletalk/ipddp.c | 1 + > drivers/net/bonding/bond_main.c | 1 + > drivers/net/eql.c | 1 + > drivers/net/ifb.c | 1 + > drivers/net/loopback.c | 1 + > drivers/net/macvlan.c | 1 + > drivers/net/wan/hdlc_fr.c | 1 + > include/linux/if.h | 3 +++ > net/core/dev.c | 9 +++++++++ > 9 files changed, 19 insertions(+) > > diff --git a/drivers/net/appletalk/ipddp.c b/drivers/net/appletalk/ipddp.c > index da64ba8..0268561 100644 > --- a/drivers/net/appletalk/ipddp.c > +++ b/drivers/net/appletalk/ipddp.c > @@ -71,6 +71,7 @@ static struct net_device * __init ipddp_init(void) > if (!dev) > return ERR_PTR(-ENOMEM); > > + dev->priv_flags &= IFF_XMIT_DST_RELEASE; Oops, I forgot the ~ here, sorry > strcpy(dev->name, "ipddp%d"); > > if (version_printed++ == 0) > diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c > index 815191d..a29f421 100644 > --- a/drivers/net/bonding/bond_main.c > +++ b/drivers/net/bonding/bond_main.c > @@ -5148,6 +5148,7 @@ int bond_create(char *name, struct bond_params *params) > goto out_rtnl; > } > > + bond_dev->priv_flags &= ~IFF_XMIT_DST_RELEASE; > if (!name) { > res = dev_alloc_name(bond_dev, "bond%d"); > if (res < 0) > diff --git a/drivers/net/eql.c b/drivers/net/eql.c > index 5210bb1..19b7dd9 100644 > --- a/drivers/net/eql.c > +++ b/drivers/net/eql.c > @@ -194,6 +194,7 @@ static void __init eql_setup(struct net_device *dev) > > dev->type = ARPHRD_SLIP; > dev->tx_queue_len = 5; /* Hands them off fast */ > + dev->priv_flags &= ~IFF_XMIT_DST_RELEASE; > } > > static int eql_open(struct net_device *dev) > diff --git a/drivers/net/ifb.c b/drivers/net/ifb.c > index 60a2630..96713ef 100644 > --- a/drivers/net/ifb.c > +++ b/drivers/net/ifb.c > @@ -156,6 +156,7 @@ static void ifb_setup(struct net_device *dev) > > dev->flags |= IFF_NOARP; > dev->flags &= ~IFF_MULTICAST; > + dev->priv_flags &= ~IFF_XMIT_DST_RELEASE; > random_ether_addr(dev->dev_addr); > } > > diff --git a/drivers/net/loopback.c b/drivers/net/loopback.c > index 6f71157..da472c6 100644 > --- a/drivers/net/loopback.c > +++ b/drivers/net/loopback.c > @@ -170,6 +170,7 @@ static void loopback_setup(struct net_device *dev) > dev->tx_queue_len = 0; > dev->type = ARPHRD_LOOPBACK; /* 0x0001*/ > dev->flags = IFF_LOOPBACK; > + dev->priv_flags &= ~IFF_XMIT_DST_RELEASE; > dev->features = NETIF_F_SG | NETIF_F_FRAGLIST > | NETIF_F_TSO > | NETIF_F_NO_CSUM > diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c > index 329cd50..d5334b4 100644 > --- a/drivers/net/macvlan.c > +++ b/drivers/net/macvlan.c > @@ -414,6 +414,7 @@ static void macvlan_setup(struct net_device *dev) > { > ether_setup(dev); > > + dev->priv_flags &= ~IFF_XMIT_DST_RELEASE; > dev->netdev_ops = &macvlan_netdev_ops; > dev->destructor = free_netdev; > dev->header_ops = &macvlan_hard_header_ops, > diff --git a/drivers/net/wan/hdlc_fr.c b/drivers/net/wan/hdlc_fr.c > index 8005301..bfa0161 100644 > --- a/drivers/net/wan/hdlc_fr.c > +++ b/drivers/net/wan/hdlc_fr.c > @@ -1054,6 +1054,7 @@ static void pvc_setup(struct net_device *dev) > dev->flags = IFF_POINTOPOINT; > dev->hard_header_len = 10; > dev->addr_len = 2; > + dev->priv_flags &= ~IFF_XMIT_DST_RELEASE; > } > > static const struct net_device_ops pvc_ops = { > diff --git a/include/linux/if.h b/include/linux/if.h > index 1108f3e..b9a6229 100644 > --- a/include/linux/if.h > +++ b/include/linux/if.h > @@ -67,6 +67,9 @@ > #define IFF_ISATAP 0x80 /* ISATAP interface (RFC4214) */ > #define IFF_MASTER_ARPMON 0x100 /* bonding master, ARP mon in use */ > #define IFF_WAN_HDLC 0x200 /* WAN HDLC device */ > +#define IFF_XMIT_DST_RELEASE 0x400 /* dev_hard_start_xmit() is allowed to > + * release skb->dst > + */ > > #define IF_GET_IFACE 0x0001 /* for querying only */ > #define IF_GET_PROTO 0x0002 > diff --git a/net/core/dev.c b/net/core/dev.c > index 14dd725..b9aef53 100644 > --- a/net/core/dev.c > +++ b/net/core/dev.c > @@ -1688,6 +1688,14 @@ int dev_hard_start_xmit(struct sk_buff *skb, struct net_device *dev, > goto gso; > } > > + /* > + * If device doesnt need skb->dst, release it right now while > + * its hot in this cpu cache > + */ > + if ((dev->priv_flags & IFF_XMIT_DST_RELEASE) && skb->dst) { > + dst_release(skb->dst); > + skb->dst = NULL; > + } > rc = ops->ndo_start_xmit(skb, dev); > /* > * TODO: if skb_orphan() was called by > @@ -5028,6 +5036,7 @@ struct net_device *alloc_netdev_mq(int sizeof_priv, const char *name, > netdev_init_queues(dev); > > INIT_LIST_HEAD(&dev->napi_list); > + dev->priv_flags = IFF_XMIT_DST_RELEASE; > setup(dev); > strcpy(dev->name, name); > return dev; > ^ permalink raw reply [flat|nested] 28+ messages in thread
* [PATCH, v2] net: release dst entry in dev_hard_start_xmit() 2009-05-12 8:21 ` Eric Dumazet @ 2009-05-12 19:26 ` Eric Dumazet 2009-05-19 5:19 ` David Miller 2009-05-12 19:27 ` [PATCH] " Jarek Poplawski 1 sibling, 1 reply; 28+ messages in thread From: Eric Dumazet @ 2009-05-12 19:26 UTC (permalink / raw) To: David S. Miller; +Cc: Linux Netdev List, Jarek Poplawski, Patrick McHardy Eric Dumazet a écrit : > Eric Dumazet a écrit : >> One point of contention in high network loads is the dst_release() performed >> when a transmited skb is freed. This is because NIC tx completion calls >> dev_kree_skb() long after original call to dev_queue_xmit(skb). >> >> CPU cache is cold and the atomic op in dst_release() stalls. On SMP, this is >> quite visible if one CPU is 100% handling softirqs for a network device, >> since dst_clone() is done by other cpus, involving cache line ping pongs. >> >> It seems right place to release dst is in dev_hard_start_xmit(), for most >> devices but ones that are virtual, and some exceptions. >> >> David Miller suggested to define a new device flag, set in alloc_netdev_mq() >> (so that most devices set it at init time), and carefuly unset in devices >> which dont want a NULL skb->dst in their ndo_start_xmit(). >> >> List of devices that must clear this flag is : >> >> - loopback device, because it calls netif_rx() and quoting Patrick : >> "ip_route_input() doesn't accept loopback addresses, so loopback packets >> already need to have a dst_entry attached." >> - appletalk/ipddp.c : needs skb->dst in its xmit function >> >> - And all devices that call again dev_queue_xmit() from their xmit function >> (as some classifiers need skb->dst) : bonding, macvlan, eql, ifb, hdlc_fr >> >> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> >> --- >> drivers/net/appletalk/ipddp.c | 1 + >> drivers/net/bonding/bond_main.c | 1 + >> drivers/net/eql.c | 1 + >> drivers/net/ifb.c | 1 + >> drivers/net/loopback.c | 1 + >> drivers/net/macvlan.c | 1 + >> drivers/net/wan/hdlc_fr.c | 1 + >> include/linux/if.h | 3 +++ >> net/core/dev.c | 9 +++++++++ >> 9 files changed, 19 insertions(+) >> >> diff --git a/drivers/net/appletalk/ipddp.c b/drivers/net/appletalk/ipddp.c >> index da64ba8..0268561 100644 >> --- a/drivers/net/appletalk/ipddp.c >> +++ b/drivers/net/appletalk/ipddp.c >> @@ -71,6 +71,7 @@ static struct net_device * __init ipddp_init(void) >> if (!dev) >> return ERR_PTR(-ENOMEM); >> >> + dev->priv_flags &= IFF_XMIT_DST_RELEASE; > > > Oops, I forgot the ~ here, sorry Here is a second version, including change to vlan code as well. Thank you [PATCH] net: release dst entry in dev_hard_start_xmit() One point of contention in high network loads is the dst_release() performed when a transmited skb is freed. This is because NIC tx completion calls dev_kree_skb() long after original call to dev_queue_xmit(skb). CPU cache is cold and the atomic op in dst_release() stalls. On SMP, this is quite visible if one CPU is 100% handling softirqs for a network device, since dst_clone() is done by other cpus, involving cache line ping pongs. It seems right place to release dst is in dev_hard_start_xmit(), for most devices but ones that are virtual, and some exceptions. David Miller suggested to define a new device flag, set in alloc_netdev_mq() (so that most devices set it at init time), and carefuly unset in devices which dont want a NULL skb->dst in their ndo_start_xmit(). List of devices that must clear this flag is : - loopback device, because it calls netif_rx() and quoting Patrick : "ip_route_input() doesn't accept loopback addresses, so loopback packets already need to have a dst_entry attached." - appletalk/ipddp.c : needs skb->dst in its xmit function - And all devices that call again dev_queue_xmit() from their xmit function (as some classifiers need skb->dst) : bonding, vlan, macvlan, eql, ifb, hdlc_fr Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> --- drivers/net/appletalk/ipddp.c | 1 + drivers/net/bonding/bond_main.c | 1 + drivers/net/eql.c | 1 + drivers/net/ifb.c | 1 + drivers/net/loopback.c | 1 + drivers/net/macvlan.c | 1 + drivers/net/wan/hdlc_fr.c | 1 + include/linux/if.h | 3 +++ net/8021q/vlan_dev.c | 1 + net/core/dev.c | 9 +++++++++ 10 files changed, 20 insertions(+) diff --git a/drivers/net/appletalk/ipddp.c b/drivers/net/appletalk/ipddp.c index da64ba8..f939e92 100644 --- a/drivers/net/appletalk/ipddp.c +++ b/drivers/net/appletalk/ipddp.c @@ -71,6 +71,7 @@ static struct net_device * __init ipddp_init(void) if (!dev) return ERR_PTR(-ENOMEM); + dev->priv_flags &= ~IFF_XMIT_DST_RELEASE; strcpy(dev->name, "ipddp%d"); if (version_printed++ == 0) diff --git a/drivers/net/bonding/bond_main.c b/drivers/net/bonding/bond_main.c index 815191d..a29f421 100644 --- a/drivers/net/bonding/bond_main.c +++ b/drivers/net/bonding/bond_main.c @@ -5148,6 +5148,7 @@ int bond_create(char *name, struct bond_params *params) goto out_rtnl; } + bond_dev->priv_flags &= ~IFF_XMIT_DST_RELEASE; if (!name) { res = dev_alloc_name(bond_dev, "bond%d"); if (res < 0) diff --git a/drivers/net/eql.c b/drivers/net/eql.c index 5210bb1..19b7dd9 100644 --- a/drivers/net/eql.c +++ b/drivers/net/eql.c @@ -194,6 +194,7 @@ static void __init eql_setup(struct net_device *dev) dev->type = ARPHRD_SLIP; dev->tx_queue_len = 5; /* Hands them off fast */ + dev->priv_flags &= ~IFF_XMIT_DST_RELEASE; } static int eql_open(struct net_device *dev) diff --git a/drivers/net/ifb.c b/drivers/net/ifb.c index 60a2630..96713ef 100644 --- a/drivers/net/ifb.c +++ b/drivers/net/ifb.c @@ -156,6 +156,7 @@ static void ifb_setup(struct net_device *dev) dev->flags |= IFF_NOARP; dev->flags &= ~IFF_MULTICAST; + dev->priv_flags &= ~IFF_XMIT_DST_RELEASE; random_ether_addr(dev->dev_addr); } diff --git a/drivers/net/loopback.c b/drivers/net/loopback.c index 6f71157..da472c6 100644 --- a/drivers/net/loopback.c +++ b/drivers/net/loopback.c @@ -170,6 +170,7 @@ static void loopback_setup(struct net_device *dev) dev->tx_queue_len = 0; dev->type = ARPHRD_LOOPBACK; /* 0x0001*/ dev->flags = IFF_LOOPBACK; + dev->priv_flags &= ~IFF_XMIT_DST_RELEASE; dev->features = NETIF_F_SG | NETIF_F_FRAGLIST | NETIF_F_TSO | NETIF_F_NO_CSUM diff --git a/drivers/net/macvlan.c b/drivers/net/macvlan.c index 329cd50..d5334b4 100644 --- a/drivers/net/macvlan.c +++ b/drivers/net/macvlan.c @@ -414,6 +414,7 @@ static void macvlan_setup(struct net_device *dev) { ether_setup(dev); + dev->priv_flags &= ~IFF_XMIT_DST_RELEASE; dev->netdev_ops = &macvlan_netdev_ops; dev->destructor = free_netdev; dev->header_ops = &macvlan_hard_header_ops, diff --git a/drivers/net/wan/hdlc_fr.c b/drivers/net/wan/hdlc_fr.c index 8005301..bfa0161 100644 --- a/drivers/net/wan/hdlc_fr.c +++ b/drivers/net/wan/hdlc_fr.c @@ -1054,6 +1054,7 @@ static void pvc_setup(struct net_device *dev) dev->flags = IFF_POINTOPOINT; dev->hard_header_len = 10; dev->addr_len = 2; + dev->priv_flags &= ~IFF_XMIT_DST_RELEASE; } static const struct net_device_ops pvc_ops = { diff --git a/include/linux/if.h b/include/linux/if.h index 1108f3e..b9a6229 100644 --- a/include/linux/if.h +++ b/include/linux/if.h @@ -67,6 +67,9 @@ #define IFF_ISATAP 0x80 /* ISATAP interface (RFC4214) */ #define IFF_MASTER_ARPMON 0x100 /* bonding master, ARP mon in use */ #define IFF_WAN_HDLC 0x200 /* WAN HDLC device */ +#define IFF_XMIT_DST_RELEASE 0x400 /* dev_hard_start_xmit() is allowed to + * release skb->dst + */ #define IF_GET_IFACE 0x0001 /* for querying only */ #define IF_GET_PROTO 0x0002 diff --git a/net/8021q/vlan_dev.c b/net/8021q/vlan_dev.c index 25ba41e..faa535f 100644 --- a/net/8021q/vlan_dev.c +++ b/net/8021q/vlan_dev.c @@ -739,6 +739,7 @@ void vlan_setup(struct net_device *dev) ether_setup(dev); dev->priv_flags |= IFF_802_1Q_VLAN; + dev->priv_flags &= ~IFF_XMIT_DST_RELEASE; dev->tx_queue_len = 0; dev->netdev_ops = &vlan_netdev_ops; diff --git a/net/core/dev.c b/net/core/dev.c index 14dd725..b9aef53 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -1688,6 +1688,14 @@ int dev_hard_start_xmit(struct sk_buff *skb, struct net_device *dev, goto gso; } + /* + * If device doesnt need skb->dst, release it right now while + * its hot in this cpu cache + */ + if ((dev->priv_flags & IFF_XMIT_DST_RELEASE) && skb->dst) { + dst_release(skb->dst); + skb->dst = NULL; + } rc = ops->ndo_start_xmit(skb, dev); /* * TODO: if skb_orphan() was called by @@ -5028,6 +5036,7 @@ struct net_device *alloc_netdev_mq(int sizeof_priv, const char *name, netdev_init_queues(dev); INIT_LIST_HEAD(&dev->napi_list); + dev->priv_flags = IFF_XMIT_DST_RELEASE; setup(dev); strcpy(dev->name, name); return dev; ^ permalink raw reply related [flat|nested] 28+ messages in thread
* Re: [PATCH, v2] net: release dst entry in dev_hard_start_xmit() 2009-05-12 19:26 ` [PATCH, v2] " Eric Dumazet @ 2009-05-19 5:19 ` David Miller 2009-05-19 5:44 ` Eric Dumazet 2009-05-19 19:44 ` Eric Dumazet 0 siblings, 2 replies; 28+ messages in thread From: David Miller @ 2009-05-19 5:19 UTC (permalink / raw) To: dada1; +Cc: netdev, jarkao2, kaber From: Eric Dumazet <dada1@cosmosbay.com> Date: Tue, 12 May 2009 21:26:35 +0200 > [PATCH] net: release dst entry in dev_hard_start_xmit() ... > Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Applied, thanks Eric. Eric, please followup and double-check the pppoe paths that Jarek mentioned. I never saw that fully resolved. Thanks! ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH, v2] net: release dst entry in dev_hard_start_xmit() 2009-05-19 5:19 ` David Miller @ 2009-05-19 5:44 ` Eric Dumazet 2009-05-19 19:44 ` Eric Dumazet 1 sibling, 0 replies; 28+ messages in thread From: Eric Dumazet @ 2009-05-19 5:44 UTC (permalink / raw) To: David Miller; +Cc: netdev, jarkao2, kaber David Miller a écrit : > From: Eric Dumazet <dada1@cosmosbay.com> > Date: Tue, 12 May 2009 21:26:35 +0200 > >> [PATCH] net: release dst entry in dev_hard_start_xmit() > ... >> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> > > Applied, thanks Eric. > > Eric, please followup and double-check the pppoe paths > that Jarek mentioned. I never saw that fully resolved. > Ah OK, I 'll do this David Thanks ! ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH, v2] net: release dst entry in dev_hard_start_xmit() 2009-05-19 5:19 ` David Miller 2009-05-19 5:44 ` Eric Dumazet @ 2009-05-19 19:44 ` Eric Dumazet 2009-05-19 21:09 ` Jarek Poplawski 1 sibling, 1 reply; 28+ messages in thread From: Eric Dumazet @ 2009-05-19 19:44 UTC (permalink / raw) To: David Miller; +Cc: netdev, jarkao2, kaber David Miller a écrit : > From: Eric Dumazet <dada1@cosmosbay.com> > Date: Tue, 12 May 2009 21:26:35 +0200 > >> [PATCH] net: release dst entry in dev_hard_start_xmit() > ... >> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> > > Applied, thanks Eric. > > Eric, please followup and double-check the pppoe paths > that Jarek mentioned. I never saw that fully resolved. > [PATCH] ppp: unset IFF_XMIT_DST_RELEASE in ppp_setup() Jarek pointed pppoe can call back dev_queue_xmit(), and might need skb->dst, so its safer to unset IFF_XMIT_DST_RELEASE on ppp devices. Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> --- diff --git a/drivers/net/ppp_generic.c b/drivers/net/ppp_generic.c index 8ee9142..639d11b 100644 --- a/drivers/net/ppp_generic.c +++ b/drivers/net/ppp_generic.c @@ -1054,6 +1054,7 @@ static void ppp_setup(struct net_device *dev) dev->type = ARPHRD_PPP; dev->flags = IFF_POINTOPOINT | IFF_NOARP | IFF_MULTICAST; dev->features |= NETIF_F_NETNS_LOCAL; + dev->priv_flags &= ~IFF_XMIT_DST_RELEASE; } /* ^ permalink raw reply related [flat|nested] 28+ messages in thread
* Re: [PATCH, v2] net: release dst entry in dev_hard_start_xmit() 2009-05-19 19:44 ` Eric Dumazet @ 2009-05-19 21:09 ` Jarek Poplawski 2009-05-19 21:21 ` Eric Dumazet 2009-05-19 21:24 ` David Miller 0 siblings, 2 replies; 28+ messages in thread From: Jarek Poplawski @ 2009-05-19 21:09 UTC (permalink / raw) To: Eric Dumazet; +Cc: David Miller, netdev, kaber On Tue, May 19, 2009 at 09:44:14PM +0200, Eric Dumazet wrote: > David Miller a écrit : > > From: Eric Dumazet <dada1@cosmosbay.com> > > Date: Tue, 12 May 2009 21:26:35 +0200 > > > >> [PATCH] net: release dst entry in dev_hard_start_xmit() > > ... > >> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> > > > > Applied, thanks Eric. > > > > Eric, please followup and double-check the pppoe paths > > that Jarek mentioned. I never saw that fully resolved. > > > > [PATCH] ppp: unset IFF_XMIT_DST_RELEASE in ppp_setup() > > Jarek pointed pppoe can call back dev_queue_xmit(), and might need > skb->dst, so its safer to unset IFF_XMIT_DST_RELEASE on ppp devices. Hmm... Of course, this patch looks OK to me, but actually my main concern was more general. We avoid adding such flags for each "real" dev, but if so IMHO it would be safer to generally add them to all "virtual" devs - needed or not. You prefer to do this only where necessary, but it's not always clear if it's omitted on purpose or by chance. So, now I'm wondering about xen-netfront - needlessly I hope ;-) Jarek P. > > Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> > --- > diff --git a/drivers/net/ppp_generic.c b/drivers/net/ppp_generic.c > index 8ee9142..639d11b 100644 > --- a/drivers/net/ppp_generic.c > +++ b/drivers/net/ppp_generic.c > @@ -1054,6 +1054,7 @@ static void ppp_setup(struct net_device *dev) > dev->type = ARPHRD_PPP; > dev->flags = IFF_POINTOPOINT | IFF_NOARP | IFF_MULTICAST; > dev->features |= NETIF_F_NETNS_LOCAL; > + dev->priv_flags &= ~IFF_XMIT_DST_RELEASE; > } > > /* > ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH, v2] net: release dst entry in dev_hard_start_xmit() 2009-05-19 21:09 ` Jarek Poplawski @ 2009-05-19 21:21 ` Eric Dumazet 2009-05-19 21:24 ` David Miller 1 sibling, 0 replies; 28+ messages in thread From: Eric Dumazet @ 2009-05-19 21:21 UTC (permalink / raw) To: Jarek Poplawski; +Cc: David Miller, netdev, kaber Jarek Poplawski a écrit : > On Tue, May 19, 2009 at 09:44:14PM +0200, Eric Dumazet wrote: >> David Miller a écrit : >>> From: Eric Dumazet <dada1@cosmosbay.com> >>> Date: Tue, 12 May 2009 21:26:35 +0200 >>> >>>> [PATCH] net: release dst entry in dev_hard_start_xmit() >>> ... >>>> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> >>> Applied, thanks Eric. >>> >>> Eric, please followup and double-check the pppoe paths >>> that Jarek mentioned. I never saw that fully resolved. >>> >> [PATCH] ppp: unset IFF_XMIT_DST_RELEASE in ppp_setup() >> >> Jarek pointed pppoe can call back dev_queue_xmit(), and might need >> skb->dst, so its safer to unset IFF_XMIT_DST_RELEASE on ppp devices. > > Hmm... Of course, this patch looks OK to me, but actually my main > concern was more general. We avoid adding such flags for each "real" > dev, but if so IMHO it would be safer to generally add them to all > "virtual" devs - needed or not. You prefer to do this only where > necessary, but it's not always clear if it's omitted on purpose or > by chance. So, now I'm wondering about xen-netfront - needlessly I > hope ;-) > This is the deal in fact, tracking all valid uses, and I'll check this. Another path would have to set the flag only for fast devices (Gb and 10Gb) ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH, v2] net: release dst entry in dev_hard_start_xmit() 2009-05-19 21:09 ` Jarek Poplawski 2009-05-19 21:21 ` Eric Dumazet @ 2009-05-19 21:24 ` David Miller 1 sibling, 0 replies; 28+ messages in thread From: David Miller @ 2009-05-19 21:24 UTC (permalink / raw) To: jarkao2; +Cc: dada1, netdev, kaber From: Jarek Poplawski <jarkao2@gmail.com> Date: Tue, 19 May 2009 23:09:37 +0200 > Hmm... Of course, this patch looks OK to me, but actually my main > concern was more general. We avoid adding such flags for each "real" > dev, but if so IMHO it would be safer to generally add them to all > "virtual" devs - needed or not. You prefer to do this only where > necessary, but it's not always clear if it's omitted on purpose or > by chance. So, now I'm wondering about xen-netfront - needlessly I > hope ;-) It is an issue that surely needs to be fleshed out, to make sure we use this where possible without breaking odd cases too easily. But for now I'm applying Eric's patch because it does fix something until we have these issues sorted more generally. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH] net: release dst entry in dev_hard_start_xmit() 2009-05-12 8:21 ` Eric Dumazet 2009-05-12 19:26 ` [PATCH, v2] " Eric Dumazet @ 2009-05-12 19:27 ` Jarek Poplawski 2009-05-12 19:44 ` Eric Dumazet 1 sibling, 1 reply; 28+ messages in thread From: Jarek Poplawski @ 2009-05-12 19:27 UTC (permalink / raw) To: Eric Dumazet; +Cc: David S. Miller, Linux Netdev List, Patrick McHardy On Tue, May 12, 2009 at 10:21:12AM +0200, Eric Dumazet wrote: > Eric Dumazet a écrit : ... > > List of devices that must clear this flag is : > > > > - loopback device, because it calls netif_rx() and quoting Patrick : > > "ip_route_input() doesn't accept loopback addresses, so loopback packets > > already need to have a dst_entry attached." > > - appletalk/ipddp.c : needs skb->dst in its xmit function > > > > - And all devices that call again dev_queue_xmit() from their xmit function > > (as some classifiers need skb->dst) : bonding, macvlan, eql, ifb, hdlc_fr Why not vlan (and/or maybe others in net/ using dev_queue_xmit)? Jarek P. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH] net: release dst entry in dev_hard_start_xmit() 2009-05-12 19:27 ` [PATCH] " Jarek Poplawski @ 2009-05-12 19:44 ` Eric Dumazet 2009-05-12 20:05 ` Jarek Poplawski 0 siblings, 1 reply; 28+ messages in thread From: Eric Dumazet @ 2009-05-12 19:44 UTC (permalink / raw) To: Jarek Poplawski; +Cc: David S. Miller, Linux Netdev List, Patrick McHardy Jarek Poplawski a écrit : > On Tue, May 12, 2009 at 10:21:12AM +0200, Eric Dumazet wrote: >> Eric Dumazet a écrit : > ... >>> List of devices that must clear this flag is : >>> >>> - loopback device, because it calls netif_rx() and quoting Patrick : >>> "ip_route_input() doesn't accept loopback addresses, so loopback packets >>> already need to have a dst_entry attached." >>> - appletalk/ipddp.c : needs skb->dst in its xmit function >>> >>> - And all devices that call again dev_queue_xmit() from their xmit function >>> (as some classifiers need skb->dst) : bonding, macvlan, eql, ifb, hdlc_fr > > Why not vlan (and/or maybe others in net/ using dev_queue_xmit)? > Yes I spoted vlan earlier this afternoon. For other net/*, I didnt not find candidates yet. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH] net: release dst entry in dev_hard_start_xmit() 2009-05-12 19:44 ` Eric Dumazet @ 2009-05-12 20:05 ` Jarek Poplawski 2009-05-12 20:24 ` Jarek Poplawski 2009-05-12 20:52 ` Eric Dumazet 0 siblings, 2 replies; 28+ messages in thread From: Jarek Poplawski @ 2009-05-12 20:05 UTC (permalink / raw) To: Eric Dumazet; +Cc: David S. Miller, Linux Netdev List, Patrick McHardy On Tue, May 12, 2009 at 09:44:52PM +0200, Eric Dumazet wrote: > Jarek Poplawski a écrit : > > On Tue, May 12, 2009 at 10:21:12AM +0200, Eric Dumazet wrote: > >> Eric Dumazet a écrit : > > ... > >>> List of devices that must clear this flag is : > >>> > >>> - loopback device, because it calls netif_rx() and quoting Patrick : > >>> "ip_route_input() doesn't accept loopback addresses, so loopback packets > >>> already need to have a dst_entry attached." > >>> - appletalk/ipddp.c : needs skb->dst in its xmit function > >>> > >>> - And all devices that call again dev_queue_xmit() from their xmit function > >>> (as some classifiers need skb->dst) : bonding, macvlan, eql, ifb, hdlc_fr > > > > Why not vlan (and/or maybe others in net/ using dev_queue_xmit)? > > > > Yes I spoted vlan earlier this afternoon. > > For other net/*, I didnt not find candidates yet. Hmm..., if vlan, then why not pppoe (and/or maybe others in drivers/net/ using dev_queue_xmit)? Jarek P. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH] net: release dst entry in dev_hard_start_xmit() 2009-05-12 20:05 ` Jarek Poplawski @ 2009-05-12 20:24 ` Jarek Poplawski 2009-05-12 20:52 ` Eric Dumazet 1 sibling, 0 replies; 28+ messages in thread From: Jarek Poplawski @ 2009-05-12 20:24 UTC (permalink / raw) To: Eric Dumazet; +Cc: David S. Miller, Linux Netdev List, Patrick McHardy On Tue, May 12, 2009 at 10:05:15PM +0200, Jarek Poplawski wrote: ... > Hmm..., if vlan, then why not pppoe (and/or maybe others in drivers/net/ > using dev_queue_xmit)? I wonder if we can't simplify this e.g. by checking indirectly for some hardware flag/capability like checksums etc. Some older hardware could be omitted, by is it really so big deal? Jarek P. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH] net: release dst entry in dev_hard_start_xmit() 2009-05-12 20:05 ` Jarek Poplawski 2009-05-12 20:24 ` Jarek Poplawski @ 2009-05-12 20:52 ` Eric Dumazet 2009-05-12 20:59 ` Jarek Poplawski 1 sibling, 1 reply; 28+ messages in thread From: Eric Dumazet @ 2009-05-12 20:52 UTC (permalink / raw) To: Jarek Poplawski; +Cc: David S. Miller, Linux Netdev List, Patrick McHardy Jarek Poplawski a écrit : > On Tue, May 12, 2009 at 09:44:52PM +0200, Eric Dumazet wrote: >> Jarek Poplawski a écrit : >>> On Tue, May 12, 2009 at 10:21:12AM +0200, Eric Dumazet wrote: >>>> Eric Dumazet a écrit : >>> ... >>>>> List of devices that must clear this flag is : >>>>> >>>>> - loopback device, because it calls netif_rx() and quoting Patrick : >>>>> "ip_route_input() doesn't accept loopback addresses, so loopback packets >>>>> already need to have a dst_entry attached." >>>>> - appletalk/ipddp.c : needs skb->dst in its xmit function >>>>> >>>>> - And all devices that call again dev_queue_xmit() from their xmit function >>>>> (as some classifiers need skb->dst) : bonding, macvlan, eql, ifb, hdlc_fr >>> Why not vlan (and/or maybe others in net/ using dev_queue_xmit)? >>> >> Yes I spoted vlan earlier this afternoon. >> >> For other net/*, I didnt not find candidates yet. > > Hmm..., if vlan, then why not pppoe (and/or maybe others in drivers/net/ > using dev_queue_xmit)? > As I said, I didnt found other relevant uses, but I am probably wrong :) About ppoe for example, two calls to dev_queue_xmit() are not relevant. One is from static int pppoe_sendmsg(struct kiocb *iocb, struct socket *sock, struct msghdr *m, size_t total_len) This is a normal direct call, not a call from its ndo_start_xmit() Second one is from static int __pppoe_xmit(struct sock *sk, struct sk_buff *skb) Same observation, there is no impact for this one as well. So we have to care on device drivers that have a ndo_start_xmit() call that could re-enter dev_queue_xmit(). Not care about drivers that call dev_queue_xmit() from other paths. ^ permalink raw reply [flat|nested] 28+ messages in thread
* Re: [PATCH] net: release dst entry in dev_hard_start_xmit() 2009-05-12 20:52 ` Eric Dumazet @ 2009-05-12 20:59 ` Jarek Poplawski 0 siblings, 0 replies; 28+ messages in thread From: Jarek Poplawski @ 2009-05-12 20:59 UTC (permalink / raw) To: Eric Dumazet; +Cc: David S. Miller, Linux Netdev List, Patrick McHardy On Tue, May 12, 2009 at 10:52:27PM +0200, Eric Dumazet wrote: > Jarek Poplawski a écrit : > > On Tue, May 12, 2009 at 09:44:52PM +0200, Eric Dumazet wrote: > >> Jarek Poplawski a écrit : > >>> On Tue, May 12, 2009 at 10:21:12AM +0200, Eric Dumazet wrote: > >>>> Eric Dumazet a écrit : > >>> ... > >>>>> List of devices that must clear this flag is : > >>>>> > >>>>> - loopback device, because it calls netif_rx() and quoting Patrick : > >>>>> "ip_route_input() doesn't accept loopback addresses, so loopback packets > >>>>> already need to have a dst_entry attached." > >>>>> - appletalk/ipddp.c : needs skb->dst in its xmit function > >>>>> > >>>>> - And all devices that call again dev_queue_xmit() from their xmit function > >>>>> (as some classifiers need skb->dst) : bonding, macvlan, eql, ifb, hdlc_fr > >>> Why not vlan (and/or maybe others in net/ using dev_queue_xmit)? > >>> > >> Yes I spoted vlan earlier this afternoon. > >> > >> For other net/*, I didnt not find candidates yet. > > > > Hmm..., if vlan, then why not pppoe (and/or maybe others in drivers/net/ > > using dev_queue_xmit)? > > > > As I said, I didnt found other relevant uses, but I am probably wrong :) > > About ppoe for example, two calls to dev_queue_xmit() are not relevant. > > One is from > > static int pppoe_sendmsg(struct kiocb *iocb, struct socket *sock, > struct msghdr *m, size_t total_len) > > This is a normal direct call, not a call from its ndo_start_xmit() > > Second one is from > > static int __pppoe_xmit(struct sock *sk, struct sk_buff *skb) > > Same observation, there is no impact for this one as well. > > > > So we have to care on device drivers that have a ndo_start_xmit() call that could > re-enter dev_queue_xmit(). Not care about drivers that call dev_queue_xmit() from > other paths. > Isn't it called through ppp_generic's ndo_start_xmit()? Jarek P. ^ permalink raw reply [flat|nested] 28+ messages in thread
end of thread, other threads:[~2009-05-19 21:24 UTC | newest] Thread overview: 28+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2009-03-20 11:40 [RFC] net: release dst entry in dev_queue_xmit() Eric Dumazet 2009-03-20 14:10 ` Neil Horman 2009-03-25 6:43 ` David Miller 2009-03-25 7:13 ` Eric Dumazet 2009-03-25 7:17 ` David Miller 2009-03-25 18:22 ` Jarek Poplawski 2009-03-25 18:41 ` Eric Dumazet 2009-03-25 19:18 ` Jarek Poplawski 2009-03-25 19:40 ` Eric Dumazet 2009-03-25 19:54 ` Jarek Poplawski 2009-03-25 20:28 ` Eric Dumazet 2009-03-25 21:12 ` Jarek Poplawski 2009-03-25 21:20 ` Patrick McHardy 2009-05-12 8:12 ` [PATCH] net: release dst entry in dev_hard_start_xmit() Eric Dumazet 2009-05-12 8:21 ` Eric Dumazet 2009-05-12 19:26 ` [PATCH, v2] " Eric Dumazet 2009-05-19 5:19 ` David Miller 2009-05-19 5:44 ` Eric Dumazet 2009-05-19 19:44 ` Eric Dumazet 2009-05-19 21:09 ` Jarek Poplawski 2009-05-19 21:21 ` Eric Dumazet 2009-05-19 21:24 ` David Miller 2009-05-12 19:27 ` [PATCH] " Jarek Poplawski 2009-05-12 19:44 ` Eric Dumazet 2009-05-12 20:05 ` Jarek Poplawski 2009-05-12 20:24 ` Jarek Poplawski 2009-05-12 20:52 ` Eric Dumazet 2009-05-12 20:59 ` Jarek Poplawski
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).