From mboxrd@z Thu Jan 1 00:00:00 1970 From: Eric Dumazet Subject: [PATCH] net: No more expensive sock_hold()/sock_put() on each tx Date: Thu, 04 Jun 2009 11:18:35 +0200 Message-ID: <4A27916B.7030607@gmail.com> References: <200906041324.59118.rusty@rustcorp.com.au> <20090603.210054.18839960.davem@davemloft.net> <4A275380.1050601@gmail.com> <20090603.215621.136203134.davem@davemloft.net> Mime-Version: 1.0 Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: QUOTED-PRINTABLE Cc: rusty@rustcorp.com.au, netdev@vger.kernel.org To: David Miller Return-path: Received: from gw1.cosmosbay.com ([212.99.114.194]:59632 "EHLO gw1.cosmosbay.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751512AbZFDJSp convert rfc822-to-8bit (ORCPT ); Thu, 4 Jun 2009 05:18:45 -0400 In-Reply-To: <20090603.215621.136203134.davem@davemloft.net> Sender: netdev-owner@vger.kernel.org List-ID: David Miller a =E9crit : > From: Eric Dumazet > Date: Thu, 04 Jun 2009 06:54:24 +0200 >=20 >> We also can avoid the sock_put()/sock_hold() pair for each tx packet= , >> to only touch sk_wmem_alloc (with appropriate atomic_sub_return() in= sock_wfree() >> and atomic_dec_test in sk_free >> >> We could initialize sk->sk_wmem_alloc to one instead of 0, so that >> sock_wfree() could just synchronize itself with sk_free() >=20 > Excellent idea Eric. Thanks ! >=20 >> Patch will follow after some testing >=20 > I look forward to it :-) Here it is, based on net-next-2.6=20 CC list trimmed down... Next step would be to get rid of sk_callback_lock, I dont remember if we already discussed about that rwlock, after RCUification of socket= s... Thanks [PATCH] net: No more expensive sock_hold()/sock_put() on each tx One of the problem with sock memory accounting is it uses a pair of sock_hold()/sock_put() for each transmitted packet. This slows down bidirectional flows because the receive path also needs to take a refcount on socket and might use a different cpu than transmit path or transmit completion path. So these two atomic operations also trigger cache line bounces. We can see this in tx or tx/rx workloads (media gateways for example), where sock_wfree() can be in top five functions in profiles. We use this sock_hold()/sock_put() so that sock freeing is delayed until all tx packets are completed. As we also update sk_wmem_alloc, we could offset sk_wmem_alloc by one unit at init time, until sk_free() is called. Once sk_free() is called, we atomic_dec_and_test(sk_wmem_alloc) to decrement initial offset and atomicaly check if any packets are in flight. skb_set_owner_w() doesnt call sock_hold() anymore sock_wfree() doesnt call sock_put() anymore, but check if sk_wmem_alloc reached 0 to perform the final freeing. Drawback is that a skb->truesize error could lead to unfreeable sockets= , or even worse, prematurely calling __sk_free() on a live socket. Nice speedups on SMP. tbench for example, going from 2691 MB/s to 2711 = MB/s on my 8 cpu dev machine, even if tbench was not really hitting sk_refcn= t contention point. 5 % speedup on a UDP transmit workload (depends on number of flows), lowering TX completion cpu usage. Signed-off-by: Eric Dumazet --- include/net/sock.h | 6 +++++- net/core/sock.c | 29 +++++++++++++++++++++++++---- net/ipv4/ip_output.c | 1 - net/ipv6/ip6_output.c | 1 - 4 files changed, 30 insertions(+), 7 deletions(-) diff --git a/include/net/sock.h b/include/net/sock.h index 4bb1ff9..010e14a 100644 --- a/include/net/sock.h +++ b/include/net/sock.h @@ -1217,9 +1217,13 @@ static inline int skb_copy_to_page(struct sock *= sk, char __user *from, =20 static inline void skb_set_owner_w(struct sk_buff *skb, struct sock *s= k) { - sock_hold(sk); skb->sk =3D sk; skb->destructor =3D sock_wfree; + /* + * We used to take a refcount on sk, but following operation + * is enough to guarantee sk_free() wont free this sock until + * all in-flight packets are completed + */ atomic_add(skb->truesize, &sk->sk_wmem_alloc); } =20 diff --git a/net/core/sock.c b/net/core/sock.c index 58dec9d..ce0159a 100644 --- a/net/core/sock.c +++ b/net/core/sock.c @@ -1005,7 +1005,7 @@ struct sock *sk_alloc(struct net *net, int family= , gfp_t priority, } EXPORT_SYMBOL(sk_alloc); =20 -void sk_free(struct sock *sk) +static void __sk_free(struct sock *sk) { struct sk_filter *filter; =20 @@ -1028,6 +1028,17 @@ void sk_free(struct sock *sk) put_net(sock_net(sk)); sk_prot_free(sk->sk_prot_creator, sk); } + +void sk_free(struct sock *sk) +{ + /* + * We substract one from sk_wmem_alloc and can know if + * some packets are still in some tx queue. + * If not null, sock_wfree() will call __sk_free(sk) later + */ + if (atomic_dec_and_test(&sk->sk_wmem_alloc)) + __sk_free(sk); +} EXPORT_SYMBOL(sk_free); =20 /* @@ -1068,7 +1079,10 @@ struct sock *sk_clone(const struct sock *sk, con= st gfp_t priority) newsk->sk_backlog.head =3D newsk->sk_backlog.tail =3D NULL; =20 atomic_set(&newsk->sk_rmem_alloc, 0); - atomic_set(&newsk->sk_wmem_alloc, 0); + /* + * sk_wmem_alloc set to one (see sk_free() and sock_wfree()) + */ + atomic_set(&newsk->sk_wmem_alloc, 1); atomic_set(&newsk->sk_omem_alloc, 0); skb_queue_head_init(&newsk->sk_receive_queue); skb_queue_head_init(&newsk->sk_write_queue); @@ -1172,12 +1186,18 @@ void __init sk_init(void) void sock_wfree(struct sk_buff *skb) { struct sock *sk =3D skb->sk; + int res; =20 /* In case it might be waiting for more memory. */ - atomic_sub(skb->truesize, &sk->sk_wmem_alloc); + res =3D atomic_sub_return(skb->truesize, &sk->sk_wmem_alloc); if (!sock_flag(sk, SOCK_USE_WRITE_QUEUE)) sk->sk_write_space(sk); - sock_put(sk); + /* + * if sk_wmem_alloc reached 0, we are last user and should + * free this sock, as sk_free() call could not do it. + */ + if (res =3D=3D 0) + __sk_free(sk); } EXPORT_SYMBOL(sock_wfree); =20 @@ -1816,6 +1836,7 @@ void sock_init_data(struct socket *sock, struct s= ock *sk) sk->sk_stamp =3D ktime_set(-1L, 0); =20 atomic_set(&sk->sk_refcnt, 1); + atomic_set(&sk->sk_wmem_alloc, 1); atomic_set(&sk->sk_drops, 0); } EXPORT_SYMBOL(sock_init_data); diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c index 3d6167f..badbfde 100644 --- a/net/ipv4/ip_output.c +++ b/net/ipv4/ip_output.c @@ -498,7 +498,6 @@ int ip_fragment(struct sk_buff *skb, int (*output)(= struct sk_buff *)) =20 BUG_ON(frag->sk); if (skb->sk) { - sock_hold(skb->sk); frag->sk =3D skb->sk; frag->destructor =3D sock_wfree; truesizes +=3D frag->truesize; diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c index c8dc8e5..18b9630 100644 --- a/net/ipv6/ip6_output.c +++ b/net/ipv6/ip6_output.c @@ -680,7 +680,6 @@ static int ip6_fragment(struct sk_buff *skb, int (*= output)(struct sk_buff *)) =20 BUG_ON(frag->sk); if (skb->sk) { - sock_hold(skb->sk); frag->sk =3D skb->sk; frag->destructor =3D sock_wfree; truesizes +=3D frag->truesize;