From: Eric Dumazet <dada1@cosmosbay.com>
To: Kenny Chang <kchang@athenacr.com>
Cc: netdev@vger.kernel.org, "David S. Miller" <davem@davemloft.net>,
Christoph Lameter <cl@linux-foundation.org>
Subject: Re: Multicast packet loss
Date: Sat, 28 Feb 2009 09:51:11 +0100 [thread overview]
Message-ID: <49A8FAFF.7060104@cosmosbay.com> (raw)
In-Reply-To: <49A6CE39.5050200@athenacr.com>
Kenny Chang a écrit :
> It's been a while since I updated this thread. We've been running
> through the different suggestions and tabulating their effects, as well
> as trying out an Intel card. The short story is that setting affinity
> and MSI works to some extent, and the Intel card doesn't seem to change
> things significantly. The results don't seem consistent enough for us
> to be able to point to a smoking gun.
>
> It does look like the 2.6.29-rc4 kernel performs okay with the Intel
> card, but this is not a real-time build and it's not likely to be in a
> supported Ubuntu distribution real soon. We've reached the point where
> we'd like to look for an expert dedicated to work on this problem for a
> period of time. The final result being some sort of solution to produce
> a realtime configuration with a reasonably "aged" kernel (.24~.28) that
> has multicast performance greater than or equal to that of 2.6.15.
>
> If anybody is interested in devoting some compensated time to this
> issue, we're offering up a bounty:
> http://www.athenacr.com/bounties/multicast-performance/
>
> For completeness, here's the table of our experiment results:
>
> ====================== ================== ========= ==========
> =============== ============== ============== =================
> Kernel flavor IRQ affinity *4x
> mcasttest* *5x mcasttest* *6x mcasttest* *Mtools2* [4]_
> ====================== ================== ========= ==========
> =============== ============== ============== =================
> Intel
> e1000e
>
> -----------------------------------------+---------+----------+---------------+--------------+--------------+-----------------
>
> 2.6.24.19 rt | any |
> OK Maybe X
> 2.6.24.19 rt | CPU0 |
> OK OK X
> 2.6.24.19 generic | any |
> X
> 2.6.24.19 generic | CPU0 |
> OK
> 2.6.29-rc3 vanilla-server | any |
> X
> 2.6.29-rc3 vanilla-server | CPU0 |
> OK
> 2.6.29-rc4 vanilla-generic | any |
> X OK
> 2.6.29-rc4 vanilla-generic | CPU0 | OK
> OK OK [5]_ OK
> -----------------------------------------+---------+----------+---------------+--------------+--------------+-----------------
>
> Broadcom
> BNX2
>
> -----------------------------------------+---------+----------+---------------+--------------+--------------+-----------------
>
> 2.6.24-19 rt | MSI any |
> OK OK X
> 2.6.24-19 rt | MSI CPU0 |
> OK Maybe X
> 2.6.24-19 rt | APIC any |
> OK OK X
> 2.6.24-19 rt | APIC CPU0 |
> OK Maybe X
> 2.6.24-19-bnx-latest rt | APIC CPU0 |
> OK X
> 2.6.24-19 server | MSI any |
> X
> 2.6.24-19 server | MSI CPU0 |
> OK
> 2.6.24-19 generic | APIC any |
> X
> 2.6.24-19 generic | APIC CPU0 |
> OK
> 2.6.27-11 generic | APIC any |
> X
> 2.6.27-11 generic | APIC CPU0 |
> OK 10% drop
> 2.6.28-8 generic | APIC any |
> OK X
> 2.6.28-8 generic | APIC CPU0 |
> OK OK 0.5% drop
> 2.6.29-rc3 vanilla-server | MSI any |
> X
> 2.6.29-rc3 vanilla-server | MSI CPU0 |
> X
> 2.6.29-rc3 vanilla-server | APIC any |
> OK X
> 2.6.29-rc3 vanilla-server | APIC CPU0 |
> OK OK
> 2.6.29-rc4 vanilla-generic | APIC any |
> X
> 2.6.29-rc4 vanilla-generic | APIC CPU0 |
> OK 3% drop 10% drop X
> ======================
> ==================+=========+==========+===============+==============+==============+=================
>
> * [4] MTools2 is a test from 29West: http://www.29west.com/docs/TestNet/
> * [5] In 5 trials, 1 of the trials dropped 2%, 4 of the trials dropped
> nothing.
>
> Kenny
>
Hi Kenny
I am investigating how to reduce contention (and schedule() calls) on this workload.
Following patch already gave me less packet drops (but not yet *perfect*)
(10% packet loss instead of 30%, if 8 receivers on my 8 cpus machine)
David, this is a preliminary work, not meant for inclusion as is,
comments are welcome.
Thank you
[PATCH] net: sk_forward_alloc becomes an atomic_t
Commit 95766fff6b9a78d11fc2d3812dd035381690b55d
(UDP: Add memory accounting) introduced a regression for high rate UDP flows,
because of extra lock_sock() in udp_recvmsg()
In order to reduce need for lock_sock() in UDP receive path, we might need
to declare sk_forward_alloc as an atomic_t.
udp_recvmsg() can avoid a lock_sock()/release_sock() pair.
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
---
include/net/sock.h | 14 +++++++-------
net/core/sock.c | 31 +++++++++++++++++++------------
net/core/stream.c | 2 +-
net/ipv4/af_inet.c | 2 +-
net/ipv4/inet_diag.c | 2 +-
net/ipv4/tcp_input.c | 2 +-
net/ipv4/udp.c | 2 --
net/ipv6/udp.c | 2 --
net/sched/em_meta.c | 2 +-
9 files changed, 31 insertions(+), 28 deletions(-)
diff --git a/include/net/sock.h b/include/net/sock.h
index 4bb1ff9..c4befb9 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -250,7 +250,7 @@ struct sock {
struct sk_buff_head sk_async_wait_queue;
#endif
int sk_wmem_queued;
- int sk_forward_alloc;
+ atomic_t sk_forward_alloc;
gfp_t sk_allocation;
int sk_route_caps;
int sk_gso_type;
@@ -823,7 +823,7 @@ static inline int sk_wmem_schedule(struct sock *sk, int size)
{
if (!sk_has_account(sk))
return 1;
- return size <= sk->sk_forward_alloc ||
+ return size <= atomic_read(&sk->sk_forward_alloc) ||
__sk_mem_schedule(sk, size, SK_MEM_SEND);
}
@@ -831,7 +831,7 @@ static inline int sk_rmem_schedule(struct sock *sk, int size)
{
if (!sk_has_account(sk))
return 1;
- return size <= sk->sk_forward_alloc ||
+ return size <= atomic_read(&sk->sk_forward_alloc) ||
__sk_mem_schedule(sk, size, SK_MEM_RECV);
}
@@ -839,7 +839,7 @@ static inline void sk_mem_reclaim(struct sock *sk)
{
if (!sk_has_account(sk))
return;
- if (sk->sk_forward_alloc >= SK_MEM_QUANTUM)
+ if (atomic_read(&sk->sk_forward_alloc) >= SK_MEM_QUANTUM)
__sk_mem_reclaim(sk);
}
@@ -847,7 +847,7 @@ static inline void sk_mem_reclaim_partial(struct sock *sk)
{
if (!sk_has_account(sk))
return;
- if (sk->sk_forward_alloc > SK_MEM_QUANTUM)
+ if (atomic_read(&sk->sk_forward_alloc) > SK_MEM_QUANTUM)
__sk_mem_reclaim(sk);
}
@@ -855,14 +855,14 @@ static inline void sk_mem_charge(struct sock *sk, int size)
{
if (!sk_has_account(sk))
return;
- sk->sk_forward_alloc -= size;
+ atomic_sub(size, &sk->sk_forward_alloc);
}
static inline void sk_mem_uncharge(struct sock *sk, int size)
{
if (!sk_has_account(sk))
return;
- sk->sk_forward_alloc += size;
+ atomic_add(size, &sk->sk_forward_alloc);
}
static inline void sk_wmem_free_skb(struct sock *sk, struct sk_buff *skb)
diff --git a/net/core/sock.c b/net/core/sock.c
index 0620046..8489105 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1081,7 +1081,7 @@ struct sock *sk_clone(const struct sock *sk, const gfp_t priority)
newsk->sk_dst_cache = NULL;
newsk->sk_wmem_queued = 0;
- newsk->sk_forward_alloc = 0;
+ atomic_set(&newsk->sk_forward_alloc, 0);
newsk->sk_send_head = NULL;
newsk->sk_userlocks = sk->sk_userlocks & ~SOCK_BINDPORT_LOCK;
@@ -1479,7 +1479,7 @@ int __sk_mem_schedule(struct sock *sk, int size, int kind)
int amt = sk_mem_pages(size);
int allocated;
- sk->sk_forward_alloc += amt * SK_MEM_QUANTUM;
+ atomic_add(amt * SK_MEM_QUANTUM, &sk->sk_forward_alloc);
allocated = atomic_add_return(amt, prot->memory_allocated);
/* Under limit. */
@@ -1520,7 +1520,7 @@ int __sk_mem_schedule(struct sock *sk, int size, int kind)
if (prot->sysctl_mem[2] > alloc *
sk_mem_pages(sk->sk_wmem_queued +
atomic_read(&sk->sk_rmem_alloc) +
- sk->sk_forward_alloc))
+ atomic_read(&sk->sk_forward_alloc)))
return 1;
}
@@ -1537,7 +1537,7 @@ suppress_allocation:
}
/* Alas. Undo changes. */
- sk->sk_forward_alloc -= amt * SK_MEM_QUANTUM;
+ atomic_sub(amt * SK_MEM_QUANTUM, &sk->sk_forward_alloc);
atomic_sub(amt, prot->memory_allocated);
return 0;
}
@@ -1551,14 +1551,21 @@ EXPORT_SYMBOL(__sk_mem_schedule);
void __sk_mem_reclaim(struct sock *sk)
{
struct proto *prot = sk->sk_prot;
-
- atomic_sub(sk->sk_forward_alloc >> SK_MEM_QUANTUM_SHIFT,
- prot->memory_allocated);
- sk->sk_forward_alloc &= SK_MEM_QUANTUM - 1;
-
- if (prot->memory_pressure && *prot->memory_pressure &&
- (atomic_read(prot->memory_allocated) < prot->sysctl_mem[0]))
- *prot->memory_pressure = 0;
+ int val = atomic_read(&sk->sk_forward_alloc);
+
+begin:
+ val = atomic_read(&sk->sk_forward_alloc);
+ if (val >= SK_MEM_QUANTUM) {
+ if (atomic_cmpxchg(&sk->sk_forward_alloc, val,
+ val & (SK_MEM_QUANTUM - 1)) != val)
+ goto begin;
+ atomic_sub(val >> SK_MEM_QUANTUM_SHIFT,
+ prot->memory_allocated);
+
+ if (prot->memory_pressure && *prot->memory_pressure &&
+ (atomic_read(prot->memory_allocated) < prot->sysctl_mem[0]))
+ *prot->memory_pressure = 0;
+ }
}
EXPORT_SYMBOL(__sk_mem_reclaim);
diff --git a/net/core/stream.c b/net/core/stream.c
index 8727cea..4d04d28 100644
--- a/net/core/stream.c
+++ b/net/core/stream.c
@@ -198,7 +198,7 @@ void sk_stream_kill_queues(struct sock *sk)
sk_mem_reclaim(sk);
WARN_ON(sk->sk_wmem_queued);
- WARN_ON(sk->sk_forward_alloc);
+ WARN_ON(atomic_read(&sk->sk_forward_alloc));
/* It is _impossible_ for the backlog to contain anything
* when we get here. All user references to this socket
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 627be4d..7a1475c 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -152,7 +152,7 @@ void inet_sock_destruct(struct sock *sk)
WARN_ON(atomic_read(&sk->sk_rmem_alloc));
WARN_ON(atomic_read(&sk->sk_wmem_alloc));
WARN_ON(sk->sk_wmem_queued);
- WARN_ON(sk->sk_forward_alloc);
+ WARN_ON(atomic_read(&sk->sk_forward_alloc));
kfree(inet->opt);
dst_release(sk->sk_dst_cache);
diff --git a/net/ipv4/inet_diag.c b/net/ipv4/inet_diag.c
index 588a779..903ad66 100644
--- a/net/ipv4/inet_diag.c
+++ b/net/ipv4/inet_diag.c
@@ -158,7 +158,7 @@ static int inet_csk_diag_fill(struct sock *sk,
if (minfo) {
minfo->idiag_rmem = atomic_read(&sk->sk_rmem_alloc);
minfo->idiag_wmem = sk->sk_wmem_queued;
- minfo->idiag_fmem = sk->sk_forward_alloc;
+ minfo->idiag_fmem = atomic_read(&sk->sk_forward_alloc);
minfo->idiag_tmem = atomic_read(&sk->sk_wmem_alloc);
}
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index a6961d7..5e08f37 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -5258,7 +5258,7 @@ int tcp_rcv_established(struct sock *sk, struct sk_buff *skb,
tcp_rcv_rtt_measure_ts(sk, skb);
- if ((int)skb->truesize > sk->sk_forward_alloc)
+ if ((int)skb->truesize > atomic_read(&sk->sk_forward_alloc))
goto step5;
NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPHPHITS);
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 4bd178a..dcc246a 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -955,9 +955,7 @@ try_again:
err = ulen;
out_free:
- lock_sock(sk);
skb_free_datagram(sk, skb);
- release_sock(sk);
out:
return err;
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 84b1a29..582b80a 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -257,9 +257,7 @@ try_again:
err = ulen;
out_free:
- lock_sock(sk);
skb_free_datagram(sk, skb);
- release_sock(sk);
out:
return err;
diff --git a/net/sched/em_meta.c b/net/sched/em_meta.c
index 72cf86e..94d90b6 100644
--- a/net/sched/em_meta.c
+++ b/net/sched/em_meta.c
@@ -383,7 +383,7 @@ META_COLLECTOR(int_sk_wmem_queued)
META_COLLECTOR(int_sk_fwd_alloc)
{
SKIP_NONLOCAL(skb);
- dst->value = skb->sk->sk_forward_alloc;
+ dst->value = atomic_read(&skb->sk->sk_forward_alloc);
}
META_COLLECTOR(int_sk_sndbuf)
next prev parent reply other threads:[~2009-02-28 8:51 UTC|newest]
Thread overview: 70+ messages / expand[flat|nested] mbox.gz Atom feed top
2009-01-30 17:49 Multicast packet loss Kenny Chang
2009-01-30 19:04 ` Eric Dumazet
2009-01-30 19:17 ` Denys Fedoryschenko
2009-01-30 20:03 ` Neil Horman
2009-01-30 22:29 ` Kenny Chang
2009-01-30 22:41 ` Eric Dumazet
2009-01-31 16:03 ` Neil Horman
2009-02-02 16:13 ` Kenny Chang
2009-02-02 16:48 ` Kenny Chang
2009-02-03 11:55 ` Neil Horman
2009-02-03 15:20 ` Kenny Chang
2009-02-04 1:15 ` Neil Horman
2009-02-04 16:07 ` Kenny Chang
2009-02-04 16:46 ` Wesley Chow
2009-02-04 18:11 ` Eric Dumazet
2009-02-05 13:33 ` Neil Horman
2009-02-05 13:46 ` Wesley Chow
2009-02-05 13:29 ` Neil Horman
2009-02-01 12:40 ` Eric Dumazet
2009-02-02 13:45 ` Neil Horman
2009-02-02 16:57 ` Eric Dumazet
2009-02-02 18:22 ` Neil Horman
2009-02-02 19:51 ` Wes Chow
2009-02-02 20:29 ` Eric Dumazet
2009-02-02 21:09 ` Wes Chow
2009-02-02 21:31 ` Eric Dumazet
2009-02-03 17:34 ` Kenny Chang
2009-02-04 1:21 ` Neil Horman
2009-02-26 17:15 ` Kenny Chang
2009-02-28 8:51 ` Eric Dumazet [this message]
2009-03-01 17:03 ` Eric Dumazet
2009-03-04 8:16 ` David Miller
2009-03-04 8:36 ` Eric Dumazet
2009-03-07 7:46 ` Eric Dumazet
2009-03-08 16:46 ` Eric Dumazet
2009-03-09 2:49 ` David Miller
2009-03-09 6:36 ` Eric Dumazet
2009-03-13 21:51 ` David Miller
2009-03-13 22:30 ` Eric Dumazet
2009-03-13 22:38 ` David Miller
2009-03-13 22:45 ` Eric Dumazet
2009-03-14 9:03 ` [PATCH] net: reorder fields of struct socket Eric Dumazet
2009-03-16 2:59 ` David Miller
2009-03-16 22:22 ` Multicast packet loss Eric Dumazet
2009-03-17 10:11 ` Peter Zijlstra
2009-03-17 11:08 ` Eric Dumazet
2009-03-17 11:57 ` Peter Zijlstra
2009-03-17 15:00 ` Brian Bloniarz
2009-03-17 15:16 ` Eric Dumazet
2009-03-17 19:39 ` David Stevens
2009-03-17 21:19 ` Eric Dumazet
2009-04-03 19:28 ` Brian Bloniarz
2009-04-05 13:49 ` Eric Dumazet
2009-04-06 21:53 ` Brian Bloniarz
2009-04-06 22:12 ` Brian Bloniarz
2009-04-07 20:08 ` Brian Bloniarz
2009-04-08 8:12 ` Eric Dumazet
2009-03-09 22:56 ` Brian Bloniarz
2009-03-10 5:28 ` Eric Dumazet
2009-03-10 23:22 ` Brian Bloniarz
2009-03-11 3:00 ` Eric Dumazet
2009-03-12 15:47 ` Brian Bloniarz
2009-03-12 16:34 ` Eric Dumazet
2009-02-27 18:40 ` Christoph Lameter
2009-02-27 18:56 ` Eric Dumazet
2009-02-27 19:45 ` Christoph Lameter
2009-02-27 20:12 ` Eric Dumazet
2009-02-27 21:36 ` Eric Dumazet
2009-02-02 13:53 ` Eric Dumazet
-- strict thread matches above, loose matches on Subject: below --
2009-04-05 14:42 bmb
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=49A8FAFF.7060104@cosmosbay.com \
--to=dada1@cosmosbay.com \
--cc=cl@linux-foundation.org \
--cc=davem@davemloft.net \
--cc=kchang@athenacr.com \
--cc=netdev@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).