[PATCH] net: avoid a pair of dst_hold()/dst_release() in ip_append_data()

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Eric Dumazet <dada1@cosmosbay.com>
To: "David S. Miller" <davem@davemloft.net>
Cc: Andi Kleen <andi@firstfloor.org>,
	Linux Netdev List <netdev@vger.kernel.org>,
	Corey Minyard <minyard@acm.org>,
	Christian Bell <christian@myri.com>
Subject: [PATCH] net: avoid a pair of dst_hold()/dst_release() in ip_append_data()
Date: Mon, 24 Nov 2008 12:24:53 +0100	[thread overview]
Message-ID: <492A8F05.3080509@cosmosbay.com> (raw)
In-Reply-To: <492A7E85.3060502@cosmosbay.com>

[-- Attachment #1: Type: text/plain, Size: 3199 bytes --]

Eric Dumazet a écrit :
> Andi Kleen a écrit :
>> Eric Dumazet <dada1@cosmosbay.com> writes:
>>
>>> tbench has hard time incrementing decrementing the route cache refcount
>>> shared by all communications on localhost.
>>
>> iirc there was a patch some time ago to use per CPU loopback devices 
>> to avoid this, but it was considered too much a benchmark hack.
>> As core counts increase it might stop being that though.
> 
> Well, you probably mention Stephen patch to avoid dirtying other contended
> cache lines (one napi structure per cpu)
> 
> Having multiple loopback dev would really be a hack I agree.
> 
>>
>>> On real world, we also have this problem on RTP servers sending many UDP
>>> frames to mediagateways, especially big ones handling thousand of 
>>> streams.
>>>
>>> Given that route entries are using RCU, we probably can avoid 
>>> incrementing
>>> their refcount in case of connected sockets ?
>>
>> Normally they can be hold over sleeps or queuing of skbs too, and RCU
>> doesn't handle that. To make it handle that you would need to define a
>> custom RCU period designed for this case, but this would be probably
>> tricky and fragile: especially I'm not sure even if you had a "any
>> packet queued" RCU method it be guaranteed to always finish because 
>> there is no fixed upper livetime of a packet.
>>
>> The other issue is that on preemptible kernels you would need to 
>> disable preemption all the time such a routing entry is hold, which
>> could be potentially quite long.
>>
> 
> Well, in case of UDP, we call ip_push_pending_frames() and this one
> does the increment of refcount (again). I was not considering
> avoiding the refcount hold we do when queing a skb in transmit
> queue, only during a short period of time. Oh well, ip_append_data()
> might sleep, so this cannot work...
> 
> I agree avoiding one refcount increment/decrement is probably
> not a huge gain, considering we *have* to do the increment,
> but when many cpus are using UDP send/receive in //, this might
> show a gain somehow.
> 
> So maybe we could make ip_append_data() (or its callers) a
> litle bit smarter, avoiding increment/decrement if possible.

Here is a patch to remove one dst_hold()/dst_release() pair
in UDP/RAW transmit path.

[PATCH] net: avoid a pair of dst_hold()/dst_release() in ip_append_data()

We can reduce pressure on dst entry refcount that slowdown UDP transmit
path on SMP machines. This pressure is visible on RTP servers when
delivering content to mediagateways, especially big ones, handling
thousand of streams. Several cpus send UDP frames to the same
destination, hence use the same dst entry.

This patch makes ip_append_data() eventually steal the refcount its
callers had to take on the dst entry.

This doesnt avoid all refcounting, but still gives speedups on SMP,
on UDP/RAW transmit path

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
---
 include/net/ip.h     |    2 +-
 net/ipv4/icmp.c      |    8 ++++----
 net/ipv4/ip_output.c |   11 ++++++++---
 net/ipv4/raw.c       |    2 +-
 net/ipv4/udp.c       |    2 +-
 5 files changed, 15 insertions(+), 10 deletions(-)

[-- Attachment #2: ip_append_data.patch --]
[-- Type: text/plain, Size: 4256 bytes --]

diff --git a/include/net/ip.h b/include/net/ip.h
index bc026ec..ddef10c 100644
--- a/include/net/ip.h
+++ b/include/net/ip.h
@@ -110,7 +110,7 @@ extern int		ip_append_data(struct sock *sk,
 						   int odd, struct sk_buff *skb),
 				void *from, int len, int protolen,
 				struct ipcm_cookie *ipc,
-				struct rtable *rt,
+				struct rtable **rt,
 				unsigned int flags);
 extern int		ip_generic_getfrag(void *from, char *to, int offset, int len, int odd, struct sk_buff *skb);
 extern ssize_t		ip_append_page(struct sock *sk, struct page *page,
diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c
index 21e497e..7b88be9 100644
--- a/net/ipv4/icmp.c
+++ b/net/ipv4/icmp.c
@@ -321,12 +321,12 @@ static int icmp_glue_bits(void *from, char *to, int offset, int len, int odd,
 }
 
 static void icmp_push_reply(struct icmp_bxm *icmp_param,
-			    struct ipcm_cookie *ipc, struct rtable *rt)
+			    struct ipcm_cookie *ipc, struct rtable **rt)
 {
 	struct sock *sk;
 	struct sk_buff *skb;
 
-	sk = icmp_sk(dev_net(rt->u.dst.dev));
+	sk = icmp_sk(dev_net((*rt)->u.dst.dev));
 	if (ip_append_data(sk, icmp_glue_bits, icmp_param,
 			   icmp_param->data_len+icmp_param->head_len,
 			   icmp_param->head_len,
@@ -392,7 +392,7 @@ static void icmp_reply(struct icmp_bxm *icmp_param, struct sk_buff *skb)
 	}
 	if (icmpv4_xrlim_allow(net, rt, icmp_param->data.icmph.type,
 			       icmp_param->data.icmph.code))
-		icmp_push_reply(icmp_param, &ipc, rt);
+		icmp_push_reply(icmp_param, &ipc, &rt);
 	ip_rt_put(rt);
 out_unlock:
 	icmp_xmit_unlock(sk);
@@ -635,7 +635,7 @@ route_done:
 		icmp_param.data_len = room;
 	icmp_param.head_len = sizeof(struct icmphdr);
 
-	icmp_push_reply(&icmp_param, &ipc, rt);
+	icmp_push_reply(&icmp_param, &ipc, &rt);
 ende:
 	ip_rt_put(rt);
 out_unlock:
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 46d7be2..da9b819 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -778,7 +778,7 @@ int ip_append_data(struct sock *sk,
 		   int getfrag(void *from, char *to, int offset, int len,
 			       int odd, struct sk_buff *skb),
 		   void *from, int length, int transhdrlen,
-		   struct ipcm_cookie *ipc, struct rtable *rt,
+		   struct ipcm_cookie *ipc, struct rtable **rtp,
 		   unsigned int flags)
 {
 	struct inet_sock *inet = inet_sk(sk);
@@ -793,6 +793,7 @@ int ip_append_data(struct sock *sk,
 	int offset = 0;
 	unsigned int maxfraglen, fragheaderlen;
 	int csummode = CHECKSUM_NONE;
+	struct rtable *rt;
 
 	if (flags&MSG_PROBE)
 		return 0;
@@ -812,7 +813,11 @@ int ip_append_data(struct sock *sk,
 			inet->cork.flags |= IPCORK_OPT;
 			inet->cork.addr = ipc->addr;
 		}
-		dst_hold(&rt->u.dst);
+		rt = *rtp;
+		/*
+		 * We steal reference to this route, caller should not release it
+		 */
+		*rtp = NULL;
 		inet->cork.fragsize = mtu = inet->pmtudisc == IP_PMTUDISC_PROBE ?
 					    rt->u.dst.dev->mtu :
 					    dst_mtu(rt->u.dst.path);
@@ -1391,7 +1396,7 @@ void ip_send_reply(struct sock *sk, struct sk_buff *skb, struct ip_reply_arg *ar
 	sk->sk_protocol = ip_hdr(skb)->protocol;
 	sk->sk_bound_dev_if = arg->bound_dev_if;
 	ip_append_data(sk, ip_reply_glue_bits, arg->iov->iov_base, len, 0,
-		       &ipc, rt, MSG_DONTWAIT);
+		       &ipc, &rt, MSG_DONTWAIT);
 	if ((skb = skb_peek(&sk->sk_write_queue)) != NULL) {
 		if (arg->csumoffset >= 0)
 			*((__sum16 *)skb_transport_header(skb) +
diff --git a/net/ipv4/raw.c b/net/ipv4/raw.c
index 998fcff..dff8bc4 100644
--- a/net/ipv4/raw.c
+++ b/net/ipv4/raw.c
@@ -572,7 +572,7 @@ back_from_confirm:
 			ipc.addr = rt->rt_dst;
 		lock_sock(sk);
 		err = ip_append_data(sk, ip_generic_getfrag, msg->msg_iov, len, 0,
-					&ipc, rt, msg->msg_flags);
+					&ipc, &rt, msg->msg_flags);
 		if (err)
 			ip_flush_pending_frames(sk);
 		else if (!(msg->msg_flags & MSG_MORE))
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index da869ce..5491144 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -719,7 +719,7 @@ do_append_data:
 	up->len += ulen;
 	getfrag  =  is_udplite ?  udplite_getfrag : ip_generic_getfrag;
 	err = ip_append_data(sk, getfrag, msg->msg_iov, ulen,
-			sizeof(struct udphdr), &ipc, rt,
+			sizeof(struct udphdr), &ipc, &rt,
 			corkreq ? msg->msg_flags|MSG_MORE : msg->msg_flags);
 	if (err)
 		udp_flush_pending_frames(sk);

next prev parent reply	other threads:[~2008-11-24 11:25 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2008-11-24  8:57 [RFC] Could we avoid touching dst->refcount in some cases ? Eric Dumazet
2008-11-24  9:42 ` Andi Kleen
2008-11-24 10:14   ` Eric Dumazet
2008-11-24 11:24     ` Eric Dumazet [this message]
2008-11-24 13:59       ` [PATCH] net: avoid a pair of dst_hold()/dst_release() in ip_push_pending_frames() Eric Dumazet
2008-11-25  0:07         ` David Miller
2008-11-24 23:55       ` [PATCH] net: avoid a pair of dst_hold()/dst_release() in ip_append_data() David Miller
2008-11-25  2:22         ` Andi Kleen
2008-11-24 11:27     ` [RFC] Could we avoid touching dst->refcount in some cases ? Andi Kleen
2008-11-24 23:36       ` David Miller
2008-11-24 23:39     ` David Miller
2008-11-25  4:43       ` Eric Dumazet
2008-11-25  5:00         ` David Miller
2008-11-26  0:00           ` [PATCH] net: release skb->dst in sock_queue_rcv_skb() Eric Dumazet
2008-11-26  0:23             ` David Miller
2008-11-26  2:04             ` David Miller
2008-11-26  7:39               ` Eric Dumazet
2008-11-26  9:08                 ` David Miller
2008-12-17 11:25             ` net-next: broken IP_PKTINFO [was Re: [PATCH] net: release skb->dst in sock_queue_rcv_skb()] Mark McLoughlin
2008-12-18  3:34               ` net-next: broken IP_PKTINFO David Miller
2008-12-18  5:59                 ` Eric Dumazet
2008-12-18  6:17                   ` David Miller

find likely ancestor, descendant, or conflicting patches for this message:
( dfblob:bc026ec dfblob:ddef10c dfblob:21e497e dfblob:7b88be9
dfblob:46d7be2 dfblob:da9b819 dfblob:998fcff dfblob:dff8bc4
dfblob:da869ce dfblob:5491144 )
 OR (
bs:"[PATCH] net: avoid a pair of dst_hold()/dst_release() in ip_append_data()" )
	(help)

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=492A8F05.3080509@cosmosbay.com \
    --to=dada1@cosmosbay.com \
    --cc=andi@firstfloor.org \
    --cc=christian@myri.com \
    --cc=davem@davemloft.net \
    --cc=minyard@acm.org \
    --cc=netdev@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).