From: Eric Dumazet <eric.dumazet@gmail.com>
To: David Miller <davem@davemloft.net>
Cc: rusty@rustcorp.com.au, netdev@vger.kernel.org
Subject: [PATCH] net: No more expensive sock_hold()/sock_put() on each tx
Date: Thu, 04 Jun 2009 11:18:35 +0200 [thread overview]
Message-ID: <4A27916B.7030607@gmail.com> (raw)
In-Reply-To: <20090603.215621.136203134.davem@davemloft.net>
David Miller a écrit :
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Thu, 04 Jun 2009 06:54:24 +0200
>
>> We also can avoid the sock_put()/sock_hold() pair for each tx packet,
>> to only touch sk_wmem_alloc (with appropriate atomic_sub_return() in sock_wfree()
>> and atomic_dec_test in sk_free
>>
>> We could initialize sk->sk_wmem_alloc to one instead of 0, so that
>> sock_wfree() could just synchronize itself with sk_free()
>
> Excellent idea Eric.
Thanks !
>
>> Patch will follow after some testing
>
> I look forward to it :-)
Here it is, based on net-next-2.6
CC list trimmed down...
Next step would be to get rid of sk_callback_lock, I dont remember
if we already discussed about that rwlock, after RCUification of sockets...
Thanks
[PATCH] net: No more expensive sock_hold()/sock_put() on each tx
One of the problem with sock memory accounting is it uses
a pair of sock_hold()/sock_put() for each transmitted packet.
This slows down bidirectional flows because the receive path
also needs to take a refcount on socket and might use a different
cpu than transmit path or transmit completion path. So these
two atomic operations also trigger cache line bounces.
We can see this in tx or tx/rx workloads (media gateways for example),
where sock_wfree() can be in top five functions in profiles.
We use this sock_hold()/sock_put() so that sock freeing
is delayed until all tx packets are completed.
As we also update sk_wmem_alloc, we could offset sk_wmem_alloc
by one unit at init time, until sk_free() is called.
Once sk_free() is called, we atomic_dec_and_test(sk_wmem_alloc)
to decrement initial offset and atomicaly check if any packets
are in flight.
skb_set_owner_w() doesnt call sock_hold() anymore
sock_wfree() doesnt call sock_put() anymore, but check if sk_wmem_alloc
reached 0 to perform the final freeing.
Drawback is that a skb->truesize error could lead to unfreeable sockets, or
even worse, prematurely calling __sk_free() on a live socket.
Nice speedups on SMP. tbench for example, going from 2691 MB/s to 2711 MB/s
on my 8 cpu dev machine, even if tbench was not really hitting sk_refcnt
contention point. 5 % speedup on a UDP transmit workload (depends
on number of flows), lowering TX completion cpu usage.
Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com>
---
include/net/sock.h | 6 +++++-
net/core/sock.c | 29 +++++++++++++++++++++++++----
net/ipv4/ip_output.c | 1 -
net/ipv6/ip6_output.c | 1 -
4 files changed, 30 insertions(+), 7 deletions(-)
diff --git a/include/net/sock.h b/include/net/sock.h
index 4bb1ff9..010e14a 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -1217,9 +1217,13 @@ static inline int skb_copy_to_page(struct sock *sk, char __user *from,
static inline void skb_set_owner_w(struct sk_buff *skb, struct sock *sk)
{
- sock_hold(sk);
skb->sk = sk;
skb->destructor = sock_wfree;
+ /*
+ * We used to take a refcount on sk, but following operation
+ * is enough to guarantee sk_free() wont free this sock until
+ * all in-flight packets are completed
+ */
atomic_add(skb->truesize, &sk->sk_wmem_alloc);
}
diff --git a/net/core/sock.c b/net/core/sock.c
index 58dec9d..ce0159a 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1005,7 +1005,7 @@ struct sock *sk_alloc(struct net *net, int family, gfp_t priority,
}
EXPORT_SYMBOL(sk_alloc);
-void sk_free(struct sock *sk)
+static void __sk_free(struct sock *sk)
{
struct sk_filter *filter;
@@ -1028,6 +1028,17 @@ void sk_free(struct sock *sk)
put_net(sock_net(sk));
sk_prot_free(sk->sk_prot_creator, sk);
}
+
+void sk_free(struct sock *sk)
+{
+ /*
+ * We substract one from sk_wmem_alloc and can know if
+ * some packets are still in some tx queue.
+ * If not null, sock_wfree() will call __sk_free(sk) later
+ */
+ if (atomic_dec_and_test(&sk->sk_wmem_alloc))
+ __sk_free(sk);
+}
EXPORT_SYMBOL(sk_free);
/*
@@ -1068,7 +1079,10 @@ struct sock *sk_clone(const struct sock *sk, const gfp_t priority)
newsk->sk_backlog.head = newsk->sk_backlog.tail = NULL;
atomic_set(&newsk->sk_rmem_alloc, 0);
- atomic_set(&newsk->sk_wmem_alloc, 0);
+ /*
+ * sk_wmem_alloc set to one (see sk_free() and sock_wfree())
+ */
+ atomic_set(&newsk->sk_wmem_alloc, 1);
atomic_set(&newsk->sk_omem_alloc, 0);
skb_queue_head_init(&newsk->sk_receive_queue);
skb_queue_head_init(&newsk->sk_write_queue);
@@ -1172,12 +1186,18 @@ void __init sk_init(void)
void sock_wfree(struct sk_buff *skb)
{
struct sock *sk = skb->sk;
+ int res;
/* In case it might be waiting for more memory. */
- atomic_sub(skb->truesize, &sk->sk_wmem_alloc);
+ res = atomic_sub_return(skb->truesize, &sk->sk_wmem_alloc);
if (!sock_flag(sk, SOCK_USE_WRITE_QUEUE))
sk->sk_write_space(sk);
- sock_put(sk);
+ /*
+ * if sk_wmem_alloc reached 0, we are last user and should
+ * free this sock, as sk_free() call could not do it.
+ */
+ if (res == 0)
+ __sk_free(sk);
}
EXPORT_SYMBOL(sock_wfree);
@@ -1816,6 +1836,7 @@ void sock_init_data(struct socket *sock, struct sock *sk)
sk->sk_stamp = ktime_set(-1L, 0);
atomic_set(&sk->sk_refcnt, 1);
+ atomic_set(&sk->sk_wmem_alloc, 1);
atomic_set(&sk->sk_drops, 0);
}
EXPORT_SYMBOL(sock_init_data);
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index 3d6167f..badbfde 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -498,7 +498,6 @@ int ip_fragment(struct sk_buff *skb, int (*output)(struct sk_buff *))
BUG_ON(frag->sk);
if (skb->sk) {
- sock_hold(skb->sk);
frag->sk = skb->sk;
frag->destructor = sock_wfree;
truesizes += frag->truesize;
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index c8dc8e5..18b9630 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -680,7 +680,6 @@ static int ip6_fragment(struct sk_buff *skb, int (*output)(struct sk_buff *))
BUG_ON(frag->sk);
if (skb->sk) {
- sock_hold(skb->sk);
frag->sk = skb->sk;
frag->destructor = sock_wfree;
truesizes += frag->truesize;
next prev parent reply other threads:[~2009-06-04 9:18 UTC|newest]
Thread overview: 28+ messages / expand[flat|nested] mbox.gz Atom feed top
2009-05-29 14:14 [PATCH 1/4] net: skb_orphan on dev_hard_start_xmit Rusty Russell
2009-05-29 15:11 ` Eric Dumazet
2009-06-01 12:27 ` Rusty Russell
2009-06-03 21:02 ` Eric Dumazet
2009-06-04 3:54 ` Rusty Russell
2009-06-04 4:00 ` David Miller
2009-06-04 4:54 ` Eric Dumazet
2009-06-04 4:56 ` David Miller
2009-06-04 9:18 ` Eric Dumazet [this message]
2009-06-04 9:26 ` [PATCH] net: No more expensive sock_hold()/sock_put() on each tx David Miller
2009-06-10 8:17 ` David Miller
2009-06-10 8:30 ` Eric Dumazet
2009-06-11 9:56 ` David Miller
2009-06-01 19:47 ` [PATCH 1/4] net: skb_orphan on dev_hard_start_xmit Patrick Ohly
2009-06-02 7:25 ` David Miller
2009-06-02 14:08 ` Rusty Russell
2009-06-03 0:14 ` David Miller
2009-07-03 7:55 ` Herbert Xu
2009-07-04 3:02 ` David Miller
2009-07-04 3:08 ` Herbert Xu
2009-07-04 3:13 ` David Miller
2009-07-04 7:42 ` Herbert Xu
2009-07-04 9:09 ` Herbert Xu
2009-07-05 3:26 ` Herbert Xu
2009-07-05 3:34 ` Herbert Xu
2009-08-18 1:47 ` David Miller
2009-08-19 3:19 ` Herbert Xu
2009-08-19 3:34 ` David Miller
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=4A27916B.7030607@gmail.com \
--to=eric.dumazet@gmail.com \
--cc=davem@davemloft.net \
--cc=netdev@vger.kernel.org \
--cc=rusty@rustcorp.com.au \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).