[RFC] Could we avoid touching dst->refcount in some cases ?

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [RFC] Could we avoid touching dst->refcount in some cases ?
@ 2008-11-24  8:57 Eric Dumazet
  2008-11-24  9:42 ` Andi Kleen
  0 siblings, 1 reply; 22+ messages in thread
From: Eric Dumazet @ 2008-11-24  8:57 UTC (permalink / raw)
  To: Linux Netdev List

[-- Attachment #1: Type: text/plain, Size: 484 bytes --]

tbench has hard time incrementing decrementing the route cache refcount
shared by all communications on localhost.

On real world, we also have this problem on RTP servers sending many UDP
frames to mediagateways, especially big ones handling thousand of streams.

Given that route entries are using RCU, we probably can avoid incrementing
their refcount in case of connected sockets ?

Here is a (untested and probably not working at all) patch on UDP part to
illustrate the idea :


[-- Attachment #2: avoid_touching_refcount.patch --]
[-- Type: text/plain, Size: 1271 bytes --]

diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index da869ce..c385f13 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -553,6 +553,7 @@ int udp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
 	int ulen = len;
 	struct ipcm_cookie ipc;
 	struct rtable *rt = NULL;
+	int rt_release = 0;
 	int free = 0;
 	int connected = 0;
 	__be32 daddr, faddr, saddr;
@@ -656,8 +657,9 @@ int udp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
 		connected = 0;
 	}
 
+	rcu_read_lock();
 	if (connected)
-		rt = (struct rtable*)sk_dst_check(sk, 0);
+		rt = (struct rtable *)__sk_dst_check(sk, 0);
 
 	if (rt == NULL) {
 		struct flowi fl = { .oif = ipc.oif,
@@ -681,11 +683,14 @@ int udp_sendmsg(struct kiocb *iocb, struct sock *sk, struct msghdr *msg,
 		}
 
 		err = -EACCES;
+		rt_release = 1;
 		if ((rt->rt_flags & RTCF_BROADCAST) &&
 		    !sock_flag(sk, SOCK_BROADCAST))
 			goto out;
-		if (connected)
-			sk_dst_set(sk, dst_clone(&rt->u.dst));
+		if (connected) {
+			sk_dst_set(sk, &rt->u.dst);
+			rt_release = 0;
+		}
 	}
 
 	if (msg->msg_flags&MSG_CONFIRM)
@@ -730,7 +735,9 @@ do_append_data:
 	release_sock(sk);
 
 out:
-	ip_rt_put(rt);
+	if (rt_release)
+		ip_rt_put(rt);
+	rcu_read_unlock();
 	if (free)
 		kfree(ipc.opt);
 	if (!err)

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [RFC] Could we avoid touching dst->refcount in some cases ?
  2008-11-24  8:57 [RFC] Could we avoid touching dst->refcount in some cases ? Eric Dumazet
@ 2008-11-24  9:42 ` Andi Kleen
  2008-11-24 10:14   ` Eric Dumazet
  0 siblings, 1 reply; 22+ messages in thread
From: Andi Kleen @ 2008-11-24  9:42 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Linux Netdev List

Eric Dumazet <dada1@cosmosbay.com> writes:

> tbench has hard time incrementing decrementing the route cache refcount
> shared by all communications on localhost.

iirc there was a patch some time ago to use per CPU loopback devices to 
avoid this, but it was considered too much a benchmark hack.
As core counts increase it might stop being that though.

>
> On real world, we also have this problem on RTP servers sending many UDP
> frames to mediagateways, especially big ones handling thousand of streams.
>
> Given that route entries are using RCU, we probably can avoid incrementing
> their refcount in case of connected sockets ?

Normally they can be hold over sleeps or queuing of skbs too, and RCU
doesn't handle that. To make it handle that you would need to define a
custom RCU period designed for this case, but this would be probably
tricky and fragile: especially I'm not sure even if you had a "any
packet queued" RCU method it be guaranteed to always finish 
because there is no fixed upper livetime of a packet.

The other issue is that on preemptible kernels you would need to 
disable preemption all the time such a routing entry is hold, which
could be potentially quite long.

-Andi

-- 
ak@linux.intel.com

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC] Could we avoid touching dst->refcount in some cases ?
  2008-11-24  9:42 ` Andi Kleen
@ 2008-11-24 10:14   ` Eric Dumazet
  2008-11-24 11:24     ` [PATCH] net: avoid a pair of dst_hold()/dst_release() in ip_append_data() Eric Dumazet
                       ` (2 more replies)
  0 siblings, 3 replies; 22+ messages in thread
From: Eric Dumazet @ 2008-11-24 10:14 UTC (permalink / raw)
  To: Andi Kleen; +Cc: Linux Netdev List

Andi Kleen a écrit :
> Eric Dumazet <dada1@cosmosbay.com> writes:
> 
>> tbench has hard time incrementing decrementing the route cache refcount
>> shared by all communications on localhost.
> 
> iirc there was a patch some time ago to use per CPU loopback devices to 
> avoid this, but it was considered too much a benchmark hack.
> As core counts increase it might stop being that though.

Well, you probably mention Stephen patch to avoid dirtying other contended
cache lines (one napi structure per cpu)

Having multiple loopback dev would really be a hack I agree.

> 
>> On real world, we also have this problem on RTP servers sending many UDP
>> frames to mediagateways, especially big ones handling thousand of streams.
>>
>> Given that route entries are using RCU, we probably can avoid incrementing
>> their refcount in case of connected sockets ?
> 
> Normally they can be hold over sleeps or queuing of skbs too, and RCU
> doesn't handle that. To make it handle that you would need to define a
> custom RCU period designed for this case, but this would be probably
> tricky and fragile: especially I'm not sure even if you had a "any
> packet queued" RCU method it be guaranteed to always finish 
> because there is no fixed upper livetime of a packet.
> 
> The other issue is that on preemptible kernels you would need to 
> disable preemption all the time such a routing entry is hold, which
> could be potentially quite long.
> 

Well, in case of UDP, we call ip_push_pending_frames() and this one
does the increment of refcount (again). I was not considering
avoiding the refcount hold we do when queing a skb in transmit
queue, only during a short period of time. Oh well, ip_append_data()
might sleep, so this cannot work...

I agree avoiding one refcount increment/decrement is probably
not a huge gain, considering we *have* to do the increment,
but when many cpus are using UDP send/receive in //, this might
show a gain somehow.

So maybe we could make ip_append_data() (or its callers) a
litle bit smarter, avoiding increment/decrement if possible.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH] net: avoid a pair of dst_hold()/dst_release() in ip_append_data()
  2008-11-24 10:14   ` Eric Dumazet
@ 2008-11-24 11:24     ` Eric Dumazet
  2008-11-24 13:59       ` [PATCH] net: avoid a pair of dst_hold()/dst_release() in ip_push_pending_frames() Eric Dumazet
  2008-11-24 23:55       ` [PATCH] net: avoid a pair of dst_hold()/dst_release() in ip_append_data() David Miller
  2008-11-24 11:27     ` [RFC] Could we avoid touching dst->refcount in some cases ? Andi Kleen
  2008-11-24 23:39     ` David Miller
  2 siblings, 2 replies; 22+ messages in thread
From: Eric Dumazet @ 2008-11-24 11:24 UTC (permalink / raw)
  To: David S. Miller
  Cc: Andi Kleen, Linux Netdev List, Corey Minyard, Christian Bell

[-- Attachment #1: Type: text/plain, Size: 3199 bytes --]

Eric Dumazet a écrit :
> Andi Kleen a écrit :
>> Eric Dumazet <dada1@cosmosbay.com> writes:
>>
>>> tbench has hard time incrementing decrementing the route cache refcount
>>> shared by all communications on localhost.
>>
>> iirc there was a patch some time ago to use per CPU loopback devices 
>> to avoid this, but it was considered too much a benchmark hack.
>> As core counts increase it might stop being that though.
> 
> Well, you probably mention Stephen patch to avoid dirtying other contended
> cache lines (one napi structure per cpu)
> 
> Having multiple loopback dev would really be a hack I agree.
> 
>>
>>> On real world, we also have this problem on RTP servers sending many UDP
>>> frames to mediagateways, especially big ones handling thousand of 
>>> streams.
>>>
>>> Given that route entries are using RCU, we probably can avoid 
>>> incrementing
>>> their refcount in case of connected sockets ?
>>
>> Normally they can be hold over sleeps or queuing of skbs too, and RCU
>> doesn't handle that. To make it handle that you would need to define a
>> custom RCU period designed for this case, but this would be probably
>> tricky and fragile: especially I'm not sure even if you had a "any
>> packet queued" RCU method it be guaranteed to always finish because 
>> there is no fixed upper livetime of a packet.
>>
>> The other issue is that on preemptible kernels you would need to 
>> disable preemption all the time such a routing entry is hold, which
>> could be potentially quite long.
>>
> 
> Well, in case of UDP, we call ip_push_pending_frames() and this one
> does the increment of refcount (again). I was not considering
> avoiding the refcount hold we do when queing a skb in transmit
> queue, only during a short period of time. Oh well, ip_append_data()
> might sleep, so this cannot work...
> 
> I agree avoiding one refcount increment/decrement is probably
> not a huge gain, considering we *have* to do the increment,
> but when many cpus are using UDP send/receive in //, this might
> show a gain somehow.
> 
> So maybe we could make ip_append_data() (or its callers) a
> litle bit smarter, avoiding increment/decrement if possible.

Here is a patch to remove one dst_hold()/dst_release() pair
in UDP/RAW transmit path.

[PATCH] net: avoid a pair of dst_hold()/dst_release() in ip_append_data()

We can reduce pressure on dst entry refcount that slowdown UDP transmit
path on SMP machines. This pressure is visible on RTP servers when
delivering content to mediagateways, especially big ones, handling
thousand of streams. Several cpus send UDP frames to the same
destination, hence use the same dst entry.

This patch makes ip_append_data() eventually steal the refcount its
callers had to take on the dst entry.

This doesnt avoid all refcounting, but still gives speedups on SMP,
on UDP/RAW transmit path

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
---
 include/net/ip.h     |    2 +-
 net/ipv4/icmp.c      |    8 ++++----
 net/ipv4/ip_output.c |   11 ++++++++---
 net/ipv4/raw.c       |    2 +-
 net/ipv4/udp.c       |    2 +-
 5 files changed, 15 insertions(+), 10 deletions(-)

[-- Attachment #2: ip_append_data.patch --]
[-- Type: text/plain, Size: 4256 bytes --]

diff --git a/include/net/ip.h b/include/net/ip.h
index bc026ec..ddef10c 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -110,7 +110,7 @@ extern int		ip_append_data(struct sock *sk,
 						   int odd, struct sk_buff *skb),
 				void *from, int len, int protolen,
 				struct ipcm_cookie *ipc,
-				struct rtable *rt,
+				struct rtable **rt,
 				unsigned int flags);
 extern int		ip_generic_getfrag(void *from, char *to, int offset, int len, int odd, struct sk_buff *skb);
 extern ssize_t		ip_append_page(struct sock *sk, struct page *page,
diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c
index 21e497e..7b88be9 100644
--- a/net/ipv4/icmp.c
+++ b/net/ipv4/icmp.c
@@ -321,12 +321,12 @@ static int icmp_glue_bits(void *from, char *to, int offset, int len, int odd,
 }
 
 static void icmp_push_reply(struct icmp_bxm *icmp_param,
-			    struct ipcm_cookie *ipc, struct rtable *rt)
+			    struct ipcm_cookie *ipc, struct rtable **rt)
 {
 	struct sock *sk;
 	struct sk_buff *skb;
 
-	sk = icmp_sk(dev_net(rt->u.dst.dev));
+	sk = icmp_sk(dev_net((*rt)->u.dst.dev));
 	if (ip_append_data(sk, icmp_glue_bits, icmp_param,
 			   icmp_param->data_len+icmp_param->head_len,
 			   icmp_param->head_len,
@@ -392,7 +392,7 @@ static void icmp_reply(struct icmp_bxm *icmp_param, struct sk_buff *skb)
 	}
 	if (icmpv4_xrlim_allow(net, rt, icmp_param->data.icmph.type,
 			       icmp_param->data.icmph.code))
-		icmp_push_reply(icmp_param, &ipc, rt);
+		icmp_push_reply(icmp_param, &ipc, &rt);
 	ip_rt_put(rt);
 out_unlock:
 	icmp_xmit_unlock(sk);
@@ -635,7 +635,7 @@ route_done:
 		icmp_param.data_len = room;
 	icmp_param.head_len = sizeof(struct icmphdr);
 
-	icmp_push_reply(&icmp_param, &ipc, rt);
+	icmp_push_reply(&icmp_param, &ipc, &rt);
 ende:
 	ip_rt_put(rt);
 out_unlock:
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 46d7be2..da9b819 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -778,7 +778,7 @@ int ip_append_data(struct sock *sk,
 		   int getfrag(void *from, char *to, int offset, int len,
 			       int odd, struct sk_buff *skb),
 		   void *from, int length, int transhdrlen,
-		   struct ipcm_cookie *ipc, struct rtable *rt,
+		   struct ipcm_cookie *ipc, struct rtable **rtp,
 		   unsigned int flags)
 {
 	struct inet_sock *inet = inet_sk(sk);
@@ -793,6 +793,7 @@ int ip_append_data(struct sock *sk,
 	int offset = 0;
 	unsigned int maxfraglen, fragheaderlen;
 	int csummode = CHECKSUM_NONE;
+	struct rtable *rt;
 
 	if (flags&MSG_PROBE)
 		return 0;
@@ -812,7 +813,11 @@ int ip_append_data(struct sock *sk,
 			inet->cork.flags |= IPCORK_OPT;
 			inet->cork.addr = ipc->addr;
 		}
-		dst_hold(&rt->u.dst);
+		rt = *rtp;
+		/*
+		 * We steal reference to this route, caller should not release it
+		 */
+		*rtp = NULL;
 		inet->cork.fragsize = mtu = inet->pmtudisc == IP_PMTUDISC_PROBE ?
 					    rt->u.dst.dev->mtu :
 					    dst_mtu(rt->u.dst.path);
@@ -1391,7 +1396,7 @@ void ip_send_reply(struct sock *sk, struct sk_buff *skb, struct ip_reply_arg *ar
 	sk->sk_protocol = ip_hdr(skb)->protocol;
 	sk->sk_bound_dev_if = arg->bound_dev_if;
 	ip_append_data(sk, ip_reply_glue_bits, arg->iov->iov_base, len, 0,
-		       &ipc, rt, MSG_DONTWAIT);
+		       &ipc, &rt, MSG_DONTWAIT);
 	if ((skb = skb_peek(&sk->sk_write_queue)) != NULL) {
 		if (arg->csumoffset >= 0)
 			*((__sum16 *)skb_transport_header(skb) +
diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c
index 998fcff..dff8bc4 100644
--- a/net/ipv4/raw.c
+++ b/net/ipv4/raw.c
@@ -572,7 +572,7 @@ back_from_confirm:
 			ipc.addr = rt->rt_dst;
 		lock_sock(sk);
 		err = ip_append_data(sk, ip_generic_getfrag, msg->msg_iov, len, 0,
-					&ipc, rt, msg->msg_flags);
+					&ipc, &rt, msg->msg_flags);
 		if (err)
 			ip_flush_pending_frames(sk);
 		else if (!(msg->msg_flags & MSG_MORE))
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index da869ce..5491144 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -719,7 +719,7 @@ do_append_data:
 	up->len += ulen;
 	getfrag  =  is_udplite ?  udplite_getfrag : ip_generic_getfrag;
 	err = ip_append_data(sk, getfrag, msg->msg_iov, ulen,
-			sizeof(struct udphdr), &ipc, rt,
+			sizeof(struct udphdr), &ipc, &rt,
 			corkreq ? msg->msg_flags|MSG_MORE : msg->msg_flags);
 	if (err)
 		udp_flush_pending_frames(sk);

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [RFC] Could we avoid touching dst->refcount in some cases ?
  2008-11-24 10:14   ` Eric Dumazet
  2008-11-24 11:24     ` [PATCH] net: avoid a pair of dst_hold()/dst_release() in ip_append_data() Eric Dumazet
@ 2008-11-24 11:27     ` Andi Kleen
  2008-11-24 23:36       ` David Miller
  2008-11-24 23:39     ` David Miller
  2 siblings, 1 reply; 22+ messages in thread
From: Andi Kleen @ 2008-11-24 11:27 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: Andi Kleen, Linux Netdev List

On Mon, Nov 24, 2008 at 11:14:29AM +0100, Eric Dumazet wrote:
> Andi Kleen a écrit :
> >Eric Dumazet <dada1@cosmosbay.com> writes:
> >
> >>tbench has hard time incrementing decrementing the route cache refcount
> >>shared by all communications on localhost.
> >
> >iirc there was a patch some time ago to use per CPU loopback devices to 
> >avoid this, but it was considered too much a benchmark hack.
> >As core counts increase it might stop being that though.
> 
> Well, you probably mention Stephen patch to avoid dirtying other contended
> cache lines (one napi structure per cpu)

No that patch wasn't from Stephen. iirc it was from someone at SGI.

-Andi

-- 
ak@linux.intel.com

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH] net: avoid a pair of dst_hold()/dst_release() in ip_push_pending_frames()
  2008-11-24 11:24     ` [PATCH] net: avoid a pair of dst_hold()/dst_release() in ip_append_data() Eric Dumazet
@ 2008-11-24 13:59       ` Eric Dumazet
  2008-11-25  0:07         ` David Miller
  2008-11-24 23:55       ` [PATCH] net: avoid a pair of dst_hold()/dst_release() in ip_append_data() David Miller
  1 sibling, 1 reply; 22+ messages in thread
From: Eric Dumazet @ 2008-11-24 13:59 UTC (permalink / raw)
  To: David S. Miller
  Cc: Andi Kleen, Linux Netdev List, Corey Minyard, Christian Bell

[-- Attachment #1: Type: text/plain, Size: 662 bytes --]

We can reduce pressure on dst entry refcount that slowdown UDP transmit
path on SMP machines. This pressure is visible on RTP servers when
delivering content to mediagateways, especially big ones, handling
thousand of streams. Several cpus send UDP frames to the same
destination, hence use the same dst entry.

This patch makes ip_push_pending_frames() steal the refcount its
callers had to take when filling inet->cork.dst.

This doesnt avoid all refcounting, but still gives speedups on SMP,
on UDP/RAW transmit path.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
---
 net/ipv4/ip_output.c |    7 ++++++-
 1 files changed, 6 insertions(+), 1 deletion(-)

[-- Attachment #2: ip_push_pending_frames.patch --]
[-- Type: text/plain, Size: 541 bytes --]

diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 46d7be2..89bc1b9 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -1279,7 +1279,12 @@ int ip_push_pending_frames(struct sock *sk)

 	skb->priority = sk->sk_priority;
 	skb->mark = sk->sk_mark;
-	skb->dst = dst_clone(&rt->u.dst);
+	/*
+	 * Steal rt from cork.dst to avoid a pair of atomic_inc/atomic_dec
+	 * on dst refcount
+	 */
+	inet->cork.dst = NULL;
+	skb->dst = &rt->u.dst;

 	if (iph->protocol == IPPROTO_ICMP)
 		icmp_out_count(net, ((struct icmphdr *)

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [RFC] Could we avoid touching dst->refcount in some cases ?
  2008-11-24 11:27     ` [RFC] Could we avoid touching dst->refcount in some cases ? Andi Kleen
@ 2008-11-24 23:36       ` David Miller
  0 siblings, 0 replies; 22+ messages in thread
From: David Miller @ 2008-11-24 23:36 UTC (permalink / raw)
  To: andi; +Cc: dada1, netdev

From: Andi Kleen <andi@firstfloor.org>
Date: Mon, 24 Nov 2008 12:27:09 +0100

> On Mon, Nov 24, 2008 at 11:14:29AM +0100, Eric Dumazet wrote:
> > Andi Kleen a écrit :
> > >Eric Dumazet <dada1@cosmosbay.com> writes:
> > >
> > >>tbench has hard time incrementing decrementing the route cache refcount
> > >>shared by all communications on localhost.
> > >
> > >iirc there was a patch some time ago to use per CPU loopback devices to 
> > >avoid this, but it was considered too much a benchmark hack.
> > >As core counts increase it might stop being that though.
> > 
> > Well, you probably mention Stephen patch to avoid dirtying other contended
> > cache lines (one napi structure per cpu)
> 
> No that patch wasn't from Stephen. iirc it was from someone at SGI.

That's how I remember it too.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC] Could we avoid touching dst->refcount in some cases ?
  2008-11-24 10:14   ` Eric Dumazet
  2008-11-24 11:24     ` [PATCH] net: avoid a pair of dst_hold()/dst_release() in ip_append_data() Eric Dumazet
  2008-11-24 11:27     ` [RFC] Could we avoid touching dst->refcount in some cases ? Andi Kleen
@ 2008-11-24 23:39     ` David Miller
  2008-11-25  4:43       ` Eric Dumazet
  2 siblings, 1 reply; 22+ messages in thread
From: David Miller @ 2008-11-24 23:39 UTC (permalink / raw)
  To: dada1; +Cc: andi, netdev

From: Eric Dumazet <dada1@cosmosbay.com>
Date: Mon, 24 Nov 2008 11:14:29 +0100

> So maybe we could make ip_append_data() (or its callers) a
> litle bit smarter, avoiding increment/decrement if possible.

These ideas are interesting but hard to make work.

I think the receive path has more chance of getting gains
from this, to be honest.

One third (effectively) of TCP stream packets are ACKs and
freed immediately.  This means that the looked up route does
not escape the packet receive path.  So we could elide the
counter increment in that case.

In fact, once we queue even TCP data, there is no need for
that cached skb->dst route any longer.

So pretty much all TCP packets could avoid the dst refcounting
on receive.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] net: avoid a pair of dst_hold()/dst_release() in ip_append_data()
  2008-11-24 11:24     ` [PATCH] net: avoid a pair of dst_hold()/dst_release() in ip_append_data() Eric Dumazet
  2008-11-24 13:59       ` [PATCH] net: avoid a pair of dst_hold()/dst_release() in ip_push_pending_frames() Eric Dumazet
@ 2008-11-24 23:55       ` David Miller
  2008-11-25  2:22         ` Andi Kleen
  1 sibling, 1 reply; 22+ messages in thread
From: David Miller @ 2008-11-24 23:55 UTC (permalink / raw)
  To: dada1; +Cc: andi, netdev, minyard, christian

From: Eric Dumazet <dada1@cosmosbay.com>
Date: Mon, 24 Nov 2008 12:24:53 +0100

> [PATCH] net: avoid a pair of dst_hold()/dst_release() in ip_append_data()
> 
> We can reduce pressure on dst entry refcount that slowdown UDP transmit
> path on SMP machines. This pressure is visible on RTP servers when
> delivering content to mediagateways, especially big ones, handling
> thousand of streams. Several cpus send UDP frames to the same
> destination, hence use the same dst entry.
> 
> This patch makes ip_append_data() eventually steal the refcount its
> callers had to take on the dst entry.
> 
> This doesnt avoid all refcounting, but still gives speedups on SMP,
> on UDP/RAW transmit path
> 
> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>

Ok, this looks fine to me, thanks Eric.  Although as you know
I'm not a big fan of pass by reference arguments :-)

Thinking more I believe we can do similar tricks for all TCP
transmit traffic.

Packets bound to sockets never outlive those sockets (and thus
their cached routes) unless we skb_orphan().

The only not covered case is where the socket cached route
is reset or changed.  We could defer the dst put until the
transmit queue reaches a certain point, kind of like a retransmit
queue RCU :-)

Just some ideas...

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] net: avoid a pair of dst_hold()/dst_release() in ip_push_pending_frames()
  2008-11-24 13:59       ` [PATCH] net: avoid a pair of dst_hold()/dst_release() in ip_push_pending_frames() Eric Dumazet
@ 2008-11-25  0:07         ` David Miller
  0 siblings, 0 replies; 22+ messages in thread
From: David Miller @ 2008-11-25  0:07 UTC (permalink / raw)
  To: dada1; +Cc: andi, netdev, minyard, christian

From: Eric Dumazet <dada1@cosmosbay.com>
Date: Mon, 24 Nov 2008 14:59:51 +0100

> We can reduce pressure on dst entry refcount that slowdown UDP transmit
> path on SMP machines. This pressure is visible on RTP servers when
> delivering content to mediagateways, especially big ones, handling
> thousand of streams. Several cpus send UDP frames to the same
> destination, hence use the same dst entry.
> 
> This patch makes ip_push_pending_frames() steal the refcount its
> callers had to take when filling inet->cork.dst.
> 
> This doesnt avoid all refcounting, but still gives speedups on SMP,
> on UDP/RAW transmit path.
> 
> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>

Applied.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] net: avoid a pair of dst_hold()/dst_release() in ip_append_data()
  2008-11-24 23:55       ` [PATCH] net: avoid a pair of dst_hold()/dst_release() in ip_append_data() David Miller
@ 2008-11-25  2:22         ` Andi Kleen
  0 siblings, 0 replies; 22+ messages in thread
From: Andi Kleen @ 2008-11-25  2:22 UTC (permalink / raw)
  To: David Miller; +Cc: dada1, andi, netdev, minyard, christian

> Thinking more I believe we can do similar tricks for all TCP
> transmit traffic.

Sounds reasonable.

> 
> Packets bound to sockets never outlive those sockets (and thus
> their cached routes) unless we skb_orphan().
> 
> The only not covered case is where the socket cached route
> is reset or changed.  We could defer the dst put until the
> transmit queue reaches a certain point, kind of like a retransmit
> queue RCU :-)
> 
> Just some ideas...

netfilter makes it somewhat tricky, for compatibility you would
need to reclone the route on the fly.

-Andi

-- 
ak@linux.intel.com

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC] Could we avoid touching dst->refcount in some cases ?
  2008-11-24 23:39     ` David Miller
@ 2008-11-25  4:43       ` Eric Dumazet
  2008-11-25  5:00         ` David Miller
  0 siblings, 1 reply; 22+ messages in thread
From: Eric Dumazet @ 2008-11-25  4:43 UTC (permalink / raw)
  To: David Miller; +Cc: andi, netdev

David Miller a écrit :
> From: Eric Dumazet <dada1@cosmosbay.com>
> Date: Mon, 24 Nov 2008 11:14:29 +0100
> 
>> So maybe we could make ip_append_data() (or its callers) a
>> litle bit smarter, avoiding increment/decrement if possible.
> 
> These ideas are interesting but hard to make work.
> 
> I think the receive path has more chance of getting gains
> from this, to be honest.
> 
> One third (effectively) of TCP stream packets are ACKs and
> freed immediately.  This means that the looked up route does
> not escape the packet receive path.  So we could elide the
> counter increment in that case.
> 
> In fact, once we queue even TCP data, there is no need for
> that cached skb->dst route any longer.
> 
> So pretty much all TCP packets could avoid the dst refcounting
> on receive.

Very interesting. So we could try the following path :

1) First try to release dst when queueing skb to various queues
(UDP, TCP, ...) while its hot. Reader wont have to release it
while its cold.

2) Check if we can handle the input path without any refcount
   dirtying ?

To make the transition easy, we could use a bit on skb to mark
dst being not refcounted (ie no dst_release() should be done on it)



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [RFC] Could we avoid touching dst->refcount in some cases ?
  2008-11-25  4:43       ` Eric Dumazet
@ 2008-11-25  5:00         ` David Miller
  2008-11-26  0:00           ` [PATCH] net: release skb->dst in sock_queue_rcv_skb() Eric Dumazet
  0 siblings, 1 reply; 22+ messages in thread
From: David Miller @ 2008-11-25  5:00 UTC (permalink / raw)
  To: dada1; +Cc: andi, netdev

From: Eric Dumazet <dada1@cosmosbay.com>
Date: Tue, 25 Nov 2008 05:43:32 +0100

> Very interesting. So we could try the following path :
> 
> 1) First try to release dst when queueing skb to various queues
> (UDP, TCP, ...) while its hot. Reader wont have to release it
> while its cold.
> 
> 2) Check if we can handle the input path without any refcount
>    dirtying ?
> 
> To make the transition easy, we could use a bit on skb to mark
> dst being not refcounted (ie no dst_release() should be done on it)

It is possible to make this self-auditing.  For example, by
using the usual trick where we encode a pointer in an
unsigned long and use the low bits for states.

In the first step, make each skb->dst access go through some
accessor inline function.

Next, audit the paths where skb->dst's can "escape" the pure
packet input path.  Add annotations, in the form of a
inline function call, for these locations.

Also, audit the other locations where we enqueue into a socket
queue and no longer care about the skb->dst, and annotate
those with another inline function.

Finally, the initial skb->dst assignment in the input path doesn't
grab a reference, but sets the low bit ("refcount pending") in
the encoded skb->dst pointer.  The skb->dst "escape" inline
function performs the deferred refcount grab.  And kfree_skb()
is taught to not dst_release() on skb->dst's which have the
low bit set.

Anyways, something like that.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH] net: release skb->dst in sock_queue_rcv_skb()
  2008-11-25  5:00         ` David Miller
@ 2008-11-26  0:00           ` Eric Dumazet
  2008-11-26  0:23             ` David Miller
                               ` (2 more replies)
  0 siblings, 3 replies; 22+ messages in thread
From: Eric Dumazet @ 2008-11-26  0:00 UTC (permalink / raw)
  To: David Miller; +Cc: andi, netdev

[-- Attachment #1: Type: text/plain, Size: 2867 bytes --]

David Miller a écrit :
> From: Eric Dumazet <dada1@cosmosbay.com>
> Date: Tue, 25 Nov 2008 05:43:32 +0100
> 
>> Very interesting. So we could try the following path :
>>
>> 1) First try to release dst when queueing skb to various queues
>> (UDP, TCP, ...) while its hot. Reader wont have to release it
>> while its cold.
>>
>> 2) Check if we can handle the input path without any refcount
>>    dirtying ?
>>
>> To make the transition easy, we could use a bit on skb to mark
>> dst being not refcounted (ie no dst_release() should be done on it)
> 
> It is possible to make this self-auditing.  For example, by
> using the usual trick where we encode a pointer in an
> unsigned long and use the low bits for states.
> 
> In the first step, make each skb->dst access go through some
> accessor inline function.
> 
> Next, audit the paths where skb->dst's can "escape" the pure
> packet input path.  Add annotations, in the form of a
> inline function call, for these locations.
> 
> Also, audit the other locations where we enqueue into a socket
> queue and no longer care about the skb->dst, and annotate
> those with another inline function.
> 
> Finally, the initial skb->dst assignment in the input path doesn't
> grab a reference, but sets the low bit ("refcount pending") in
> the encoded skb->dst pointer.  The skb->dst "escape" inline
> function performs the deferred refcount grab.  And kfree_skb()
> is taught to not dst_release() on skb->dst's which have the
> low bit set.
> 
> Anyways, something like that.

I looked this stuff and found it would be difficult to not grab a 
reference (and more important not writing to dst) in input path.

ip_rcv_finish() calls ip_route_input()
and ip_route_input() calls dst_use(&rth->u.dst, jiffies);

static inline void dst_use(struct dst_entry *dst, unsigned long time)
{
        dst_hold(dst);
        dst->__use++;
        dst->lastuse = time;
}

Even if we avoid the refcount increment, I guess we need the lastuse
assignement in order to keep dst in cache. Not sure about the role of
__use field. Hum... for a tcp connection, dst refcount should already
be pinned by a sk->sk_dst_cache. Maybe test refcount value, and if this
value is > 1, dont take a reference. (given rcu_read_lock() is done
before calling ip_rcv_finish())

In the meantime, what do you think of the following patch ?

[PATCH] net: release skb->dst in sock_queue_rcv_skb()

When queuing a skb to sk->sk_receive_queue, we can release its dst, not
anymore needed.
Since current cpu did the dst_hold(), refcount is probably still hot
int this cpu caches.

This avoids readers to access the original dst to decrement its refcount,
possibly a long time after packet reception. This should speedup UDP
and RAW receive path.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>

[-- Attachment #2: sock_queue_rcv_skb.patch --]
[-- Type: text/plain, Size: 534 bytes --]

diff --git a/net/core/sock.c b/net/core/sock.c
index a4e840e..b287645 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -289,7 +289,11 @@ int sock_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
 
 	skb->dev = NULL;
 	skb_set_owner_r(skb, sk);
-
+	/*
+	 * release dst right now while its hot
+	 */
+	dst_release(skb->dst);
+	skb->dst = NULL;
 	/* Cache the SKB length before we tack it onto the receive
 	 * queue.  Once it is added it no longer belongs to us and
 	 * may be freed by other threads of control pulling packets

^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: [PATCH] net: release skb->dst in sock_queue_rcv_skb()
  2008-11-26  0:00           ` [PATCH] net: release skb->dst in sock_queue_rcv_skb() Eric Dumazet
@ 2008-11-26  0:23             ` David Miller
  2008-11-26  2:04             ` David Miller
  2008-12-17 11:25             ` net-next: broken IP_PKTINFO [was Re: [PATCH] net: release skb->dst in sock_queue_rcv_skb()] Mark McLoughlin
  2 siblings, 0 replies; 22+ messages in thread
From: David Miller @ 2008-11-26  0:23 UTC (permalink / raw)
  To: dada1; +Cc: andi, netdev

From: Eric Dumazet <dada1@cosmosbay.com>
Date: Wed, 26 Nov 2008 01:00:30 +0100

> Hum... for a tcp connection, dst refcount should already be pinned
> by a sk->sk_dst_cache. Maybe test refcount value, and if this value
> is > 1, dont take a reference. (given rcu_read_lock() is done before
> calling ip_rcv_finish())

Input route is different from output route.  sk->sk_dst_cache holds
the output route.  So I can't see how this can help here.



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] net: release skb->dst in sock_queue_rcv_skb()
  2008-11-26  0:00           ` [PATCH] net: release skb->dst in sock_queue_rcv_skb() Eric Dumazet
  2008-11-26  0:23             ` David Miller
@ 2008-11-26  2:04             ` David Miller
  2008-11-26  7:39               ` Eric Dumazet
  2008-12-17 11:25             ` net-next: broken IP_PKTINFO [was Re: [PATCH] net: release skb->dst in sock_queue_rcv_skb()] Mark McLoughlin
  2 siblings, 1 reply; 22+ messages in thread
From: David Miller @ 2008-11-26  2:04 UTC (permalink / raw)
  To: dada1; +Cc: andi, netdev

From: Eric Dumazet <dada1@cosmosbay.com>
Date: Wed, 26 Nov 2008 01:00:30 +0100

> In the meantime, what do you think of the following patch ?
> 
> [PATCH] net: release skb->dst in sock_queue_rcv_skb()
> 
> When queuing a skb to sk->sk_receive_queue, we can release its dst, not
> anymore needed.
> Since current cpu did the dst_hold(), refcount is probably still hot
> int this cpu caches.
> 
> This avoids readers to access the original dst to decrement its refcount,
> possibly a long time after packet reception. This should speedup UDP
> and RAW receive path.
> 
> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>

I guess the idea is that if we release quickly we'll not have
to reget the cacheline in owned state.

I wonder if this might actually slightly hurt loads like tbench where
we are banging on the refcnt constantly on every cpu anyways.

Can you do a quick check?

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] net: release skb->dst in sock_queue_rcv_skb()
  2008-11-26  2:04             ` David Miller
@ 2008-11-26  7:39               ` Eric Dumazet
  2008-11-26  9:08                 ` David Miller
  0 siblings, 1 reply; 22+ messages in thread
From: Eric Dumazet @ 2008-11-26  7:39 UTC (permalink / raw)
  To: David Miller; +Cc: andi, netdev

David Miller a écrit :
> From: Eric Dumazet <dada1@cosmosbay.com>
> Date: Wed, 26 Nov 2008 01:00:30 +0100
> 
>> In the meantime, what do you think of the following patch ?
>>
>> [PATCH] net: release skb->dst in sock_queue_rcv_skb()
>>
>> When queuing a skb to sk->sk_receive_queue, we can release its dst, not
>> anymore needed.
>> Since current cpu did the dst_hold(), refcount is probably still hot
>> int this cpu caches.
>>
>> This avoids readers to access the original dst to decrement its refcount,
>> possibly a long time after packet reception. This should speedup UDP
>> and RAW receive path.
>>
>> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
> 
> I guess the idea is that if we release quickly we'll not have
> to reget the cacheline in owned state.

Yes, this is the idea.

> 
> I wonder if this might actually slightly hurt loads like tbench where
> we are banging on the refcnt constantly on every cpu anyways.

Yes, the only way to reduce the load on this case is not moving the
increments/decrements : It could help on some machines, and hurt on others.
In the long term, only reducing the number of dirtying can help the average.

> 
> Can you do a quick check?
> 
> 

Sure I can do a check :)

No impact at all on tbench, unless this bench can run in UDP mode :)


CPU: Core 2, speed 3000.1 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit
mask of 0x00 (Unhalted core cycles) count 100000
samples  cum. samples  %        cum. %     symbol name
599983   599983        11.2948  11.2948    copy_from_user
549349   1149332       10.3416  21.6364    ipt_do_table
245697   1395029        4.6253  26.2617    copy_to_user
221663   1616692        4.1729  30.4346    schedule
144888   1761580        2.7275  33.1621    tcp_sendmsg
136967   1898547        2.5784  35.7406    tcp_ack
115513   2014060        2.1746  37.9151    tcp_transmit_skb
99517    2113577        1.8734  39.7885    sysenter_past_esp
99438    2213015        1.8719  41.6605    ip_queue_xmit
98171    2311186        1.8481  43.5086    tcp_recvmsg
80763    2391949        1.5204  45.0290    __switch_to
79621    2471570        1.4989  46.5278    tcp_v4_rcv
77210    2548780        1.4535  47.9813    dst_release
69387    2618167        1.3062  49.2876    tcp_rcv_established
64709    2682876        1.2182  50.5057    __tcp_push_pending_frames
55223    2738099        1.0396  51.5453    lock_sock_nested
53754    2791853        1.0119  52.5572    sys_socketcall
50092    2841945        0.9430  53.5002    netif_receive_skb
49499    2891444        0.9318  54.4321    release_sock
47796    2939240        0.8998  55.3318    __inet_lookup_established
45162    2984402        0.8502  56.1820    update_curr
44895    3029297        0.8452  57.0272    ip_rcv
42945    3072242        0.8084  57.8356    dev_queue_xmit
42892    3115134        0.8075  58.6431    tcp_event_data_recv
42768    3157902        0.8051  59.4482    local_bh_enable
41555    3199457        0.7823  60.2305    netif_rx
38613    3238070        0.7269  60.9574    __alloc_skb
38016    3276086        0.7157  61.6730    ip_finish_output
36867    3312953        0.6940  62.3671    tcp_current_mss
36759    3349712        0.6920  63.0591    skb_release_data
35560    3385272        0.6694  63.7285    local_bh_enable_ip
34100    3419372        0.6419  64.3704    sock_recvmsg
33829    3453201        0.6368  65.0073    __kfree_skb
32949    3486150        0.6203  65.6275    sched_clock_cpu


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH] net: release skb->dst in sock_queue_rcv_skb()
  2008-11-26  7:39               ` Eric Dumazet
@ 2008-11-26  9:08                 ` David Miller
  0 siblings, 0 replies; 22+ messages in thread
From: David Miller @ 2008-11-26  9:08 UTC (permalink / raw)
  To: dada1; +Cc: andi, netdev

From: Eric Dumazet <dada1@cosmosbay.com>
Date: Wed, 26 Nov 2008 08:39:30 +0100

> David Miller a écrit :
> > From: Eric Dumazet <dada1@cosmosbay.com>
> > Date: Wed, 26 Nov 2008 01:00:30 +0100
> > 
> > Can you do a quick check?
> > 
> 
> Sure I can do a check :)
> 
> No impact at all on tbench, unless this bench can run in UDP mode :)

Durrr, of course!

Patch applied, thanks Eric.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* net-next: broken IP_PKTINFO [was Re: [PATCH] net: release skb->dst in sock_queue_rcv_skb()]
  2008-11-26  0:00           ` [PATCH] net: release skb->dst in sock_queue_rcv_skb() Eric Dumazet
  2008-11-26  0:23             ` David Miller
  2008-11-26  2:04             ` David Miller
@ 2008-12-17 11:25             ` Mark McLoughlin
  2008-12-18  3:34               ` net-next: broken IP_PKTINFO David Miller
  2 siblings, 1 reply; 22+ messages in thread
From: Mark McLoughlin @ 2008-12-17 11:25 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, andi, netdev

Hi,

On Wed, 2008-11-26 at 01:00 +0100, Eric Dumazet wrote:

> [PATCH] net: release skb->dst in sock_queue_rcv_skb()
> 
> When queuing a skb to sk->sk_receive_queue, we can release its dst, not
> anymore needed.
> Since current cpu did the dst_hold(), refcount is probably still hot
> int this cpu caches.
> 
> This avoids readers to access the original dst to decrement its refcount,
> possibly a long time after packet reception. This should speedup UDP
> and RAW receive path.
> 
> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
> plain text document attachment (sock_queue_rcv_skb.patch)
> diff --git a/net/core/sock.c b/net/core/sock.c
> index a4e840e..b287645 100644
> --- a/net/core/sock.c
> +++ b/net/core/sock.c
> @@ -289,7 +289,11 @@ int sock_queue_rcv_skb(struct sock *sk, struct sk_buff *skb)
>  
>  	skb->dev = NULL;
>  	skb_set_owner_r(skb, sk);
> -
> +	/*
> +	 * release dst right now while its hot
> +	 */
> +	dst_release(skb->dst);
> +	skb->dst = NULL;

IP_PKTINFO cmsg data is one post-queueing user:

  static void ip_cmsg_recv_pktinfo(struct msghdr *msg, struct sk_buff *skb)
  {
          struct in_pktinfo info;
          struct rtable *rt = skb->rtable;

          info.ipi_addr.s_addr = ip_hdr(skb)->daddr;
          if (rt) {
                  info.ipi_ifindex = rt->rt_iif;
                  info.ipi_spec_dst.s_addr = rt->rt_spec_dst;
          } else {
                  info.ipi_ifindex = 0;
                  info.ipi_spec_dst.s_addr = 0;
          }

          put_cmsg(msg, SOL_IP, IP_PKTINFO, sizeof(info), &info);
  }

(i.e. skb->rtable is NULL at this point)

I'm seeing dnsmasq not working on net-next-2.6 because of this and
reverting commit 7035560 makes things work as expected again.

Cheers,
Mark.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: net-next: broken IP_PKTINFO
  2008-12-17 11:25             ` net-next: broken IP_PKTINFO [was Re: [PATCH] net: release skb->dst in sock_queue_rcv_skb()] Mark McLoughlin
@ 2008-12-18  3:34               ` David Miller
  2008-12-18  5:59                 ` Eric Dumazet
  0 siblings, 1 reply; 22+ messages in thread
From: David Miller @ 2008-12-18  3:34 UTC (permalink / raw)
  To: markmc; +Cc: dada1, andi, netdev

From: Mark McLoughlin <markmc@redhat.com>
Date: Wed, 17 Dec 2008 11:25:01 +0000

> On Wed, 2008-11-26 at 01:00 +0100, Eric Dumazet wrote:
> 
> > [PATCH] net: release skb->dst in sock_queue_rcv_skb()
 ...
> IP_PKTINFO cmsg data is one post-queueing user:

Eric, we'll need to rever this change I think.

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: net-next: broken IP_PKTINFO
  2008-12-18  3:34               ` net-next: broken IP_PKTINFO David Miller
@ 2008-12-18  5:59                 ` Eric Dumazet
  2008-12-18  6:17                   ` David Miller
  0 siblings, 1 reply; 22+ messages in thread
From: Eric Dumazet @ 2008-12-18  5:59 UTC (permalink / raw)
  To: David Miller; +Cc: markmc, andi, netdev

David Miller a écrit :
> From: Mark McLoughlin <markmc@redhat.com>
> Date: Wed, 17 Dec 2008 11:25:01 +0000
> 
>> On Wed, 2008-11-26 at 01:00 +0100, Eric Dumazet wrote:
>>
>>> [PATCH] net: release skb->dst in sock_queue_rcv_skb()
>  ...
>> IP_PKTINFO cmsg data is one post-queueing user:
> 
> Eric, we'll need to rever this change I think.

I am afraid we have to revert it, yes.

META_COLLECTOR(int_rtiif) & META_COLLECTOR(int_rtclassid) 
in net/sched/em_meta.c also need rtable, I am not sure how it is used.

About ip_cmsg_recv_pktinfo() :

iif can be found in skb->iif instead of rt->rt_iif, but I am not sure
about rt_spec_dst : Shouldnt we find it in ip_hdr(skb)->saddr ?
 
Do you know if we really need rtable in ip_cmsg_recv_pktinfo() ?

diff --git a/net/ipv4/ip_sockglue.c b/net/ipv4/ip_sockglue.c
index 43c0585..e854893 100644
--- a/net/ipv4/ip_sockglue.c
+++ b/net/ipv4/ip_sockglue.c
@@ -64,8 +64,8 @@ static void ip_cmsg_recv_pktinfo(struct msghdr *msg, struct sk_buff *skb)
 		info.ipi_ifindex = rt->rt_iif;
 		info.ipi_spec_dst.s_addr = rt->rt_spec_dst;
 	} else {
-		info.ipi_ifindex = 0;
-		info.ipi_spec_dst.s_addr = 0;
+		info.ipi_ifindex = skb->iif;
+		info.ipi_spec_dst.s_addr = ip_hdr(skb)->saddr;
 	}
 
 	put_cmsg(msg, SOL_IP, IP_PKTINFO, sizeof(info), &info);


^ permalink raw reply related	[flat|nested] 22+ messages in thread

* Re: net-next: broken IP_PKTINFO
  2008-12-18  5:59                 ` Eric Dumazet
@ 2008-12-18  6:17                   ` David Miller
  0 siblings, 0 replies; 22+ messages in thread
From: David Miller @ 2008-12-18  6:17 UTC (permalink / raw)
  To: dada1; +Cc: markmc, andi, netdev

From: Eric Dumazet <dada1@cosmosbay.com>
Date: Thu, 18 Dec 2008 06:59:26 +0100

> David Miller a écrit :
> > Eric, we'll need to rever this change I think.
> 
> I am afraid we have to revert it, yes.

Done.

> About ip_cmsg_recv_pktinfo() :
> 
> iif can be found in skb->iif instead of rt->rt_iif, but I am not sure
> about rt_spec_dst : Shouldnt we find it in ip_hdr(skb)->saddr ?
>  
> Do you know if we really need rtable in ip_cmsg_recv_pktinfo() ?

I think we might, as these are routing attributes, which not
necessarily match up with the values found in the SKB.

In fact, I do remember that "specific destination" has a very
exact definition in the routing RFCs and it has to do with the
matched route.

Conversely, in what situations (other than with the reverted patch
applied, hehe) would the route not be attached here?

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2008-12-18  6:17 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2008-11-24  8:57 [RFC] Could we avoid touching dst->refcount in some cases ? Eric Dumazet
2008-11-24  9:42 ` Andi Kleen
2008-11-24 10:14   ` Eric Dumazet
2008-11-24 11:24     ` [PATCH] net: avoid a pair of dst_hold()/dst_release() in ip_append_data() Eric Dumazet
2008-11-24 13:59       ` [PATCH] net: avoid a pair of dst_hold()/dst_release() in ip_push_pending_frames() Eric Dumazet
2008-11-25  0:07         ` David Miller
2008-11-24 23:55       ` [PATCH] net: avoid a pair of dst_hold()/dst_release() in ip_append_data() David Miller
2008-11-25  2:22         ` Andi Kleen
2008-11-24 11:27     ` [RFC] Could we avoid touching dst->refcount in some cases ? Andi Kleen
2008-11-24 23:36       ` David Miller
2008-11-24 23:39     ` David Miller
2008-11-25  4:43       ` Eric Dumazet
2008-11-25  5:00         ` David Miller
2008-11-26  0:00           ` [PATCH] net: release skb->dst in sock_queue_rcv_skb() Eric Dumazet
2008-11-26  0:23             ` David Miller
2008-11-26  2:04             ` David Miller
2008-11-26  7:39               ` Eric Dumazet
2008-11-26  9:08                 ` David Miller
2008-12-17 11:25             ` net-next: broken IP_PKTINFO [was Re: [PATCH] net: release skb->dst in sock_queue_rcv_skb()] Mark McLoughlin
2008-12-18  3:34               ` net-next: broken IP_PKTINFO David Miller
2008-12-18  5:59                 ` Eric Dumazet
2008-12-18  6:17                   ` David Miller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).