From: Eric Dumazet <dada1@cosmosbay.com>
To: Kenny Chang <kchang@athenacr.com>
Cc: netdev@vger.kernel.org, "David S. Miller" <davem@davemloft.net>,
Christoph Lameter <cl@linux-foundation.org>
Subject: Re: Multicast packet loss
Date: Sat, 28 Feb 2009 09:51:11 +0100 [thread overview]
Message-ID: <49A8FAFF.7060104@cosmosbay.com> (raw)
In-Reply-To: <49A6CE39.5050200@athenacr.com>
Kenny Chang a écrit :
> It's been a while since I updated this thread. We've been running
> through the different suggestions and tabulating their effects, as well
> as trying out an Intel card. The short story is that setting affinity
> and MSI works to some extent, and the Intel card doesn't seem to change
> things significantly. The results don't seem consistent enough for us
> to be able to point to a smoking gun.
>
> It does look like the 2.6.29-rc4 kernel performs okay with the Intel
> card, but this is not a real-time build and it's not likely to be in a
> supported Ubuntu distribution real soon. We've reached the point where
> we'd like to look for an expert dedicated to work on this problem for a
> period of time. The final result being some sort of solution to produce
> a realtime configuration with a reasonably "aged" kernel (.24~.28) that
> has multicast performance greater than or equal to that of 2.6.15.
>
> If anybody is interested in devoting some compensated time to this
> issue, we're offering up a bounty:
> http://www.athenacr.com/bounties/multicast-performance/
>
> For completeness, here's the table of our experiment results:
>
> ====================== ================== ========= ==========
> =============== ============== ============== =================
> Kernel flavor IRQ affinity *4x
> mcasttest* *5x mcasttest* *6x mcasttest* *Mtools2* [4]_
> ====================== ================== ========= ==========
> =============== ============== ============== =================
> Intel
> e1000e
>
> -----------------------------------------+---------+----------+---------------+--------------+--------------+-----------------
>
> 2.6.24.19 rt | any |
> OK Maybe X
> 2.6.24.19 rt | CPU0 |
> OK OK X
> 2.6.24.19 generic | any |
> X
> 2.6.24.19 generic | CPU0 |
> OK
> 2.6.29-rc3 vanilla-server | any |
> X
> 2.6.29-rc3 vanilla-server | CPU0 |
> OK
> 2.6.29-rc4 vanilla-generic | any |
> X OK
> 2.6.29-rc4 vanilla-generic | CPU0 | OK
> OK OK [5]_ OK
> -----------------------------------------+---------+----------+---------------+--------------+--------------+-----------------
>
> Broadcom
> BNX2
>
> -----------------------------------------+---------+----------+---------------+--------------+--------------+-----------------
>
> 2.6.24-19 rt | MSI any |
> OK OK X
> 2.6.24-19 rt | MSI CPU0 |
> OK Maybe X
> 2.6.24-19 rt | APIC any |
> OK OK X
> 2.6.24-19 rt | APIC CPU0 |
> OK Maybe X
> 2.6.24-19-bnx-latest rt | APIC CPU0 |
> OK X
> 2.6.24-19 server | MSI any |
> X
> 2.6.24-19 server | MSI CPU0 |
> OK
> 2.6.24-19 generic | APIC any |
> X
> 2.6.24-19 generic | APIC CPU0 |
> OK
> 2.6.27-11 generic | APIC any |
> X
> 2.6.27-11 generic | APIC CPU0 |
> OK 10% drop
> 2.6.28-8 generic | APIC any |
> OK X
> 2.6.28-8 generic | APIC CPU0 |
> OK OK 0.5% drop
> 2.6.29-rc3 vanilla-server | MSI any |
> X
> 2.6.29-rc3 vanilla-server | MSI CPU0 |
> X
> 2.6.29-rc3 vanilla-server | APIC any |
> OK X
> 2.6.29-rc3 vanilla-server | APIC CPU0 |
> OK OK
> 2.6.29-rc4 vanilla-generic | APIC any |
> X
> 2.6.29-rc4 vanilla-generic | APIC CPU0 |
> OK 3% drop 10% drop X
> ======================
> ==================+=========+==========+===============+==============+==============+=================
>
> * [4] MTools2 is a test from 29West: http://www.29west.com/docs/TestNet/
> * [5] In 5 trials, 1 of the trials dropped 2%, 4 of the trials dropped
> nothing.
>
> Kenny
>
Hi Kenny
I am investigating how to reduce contention (and schedule() calls) on this workload.
Following patch already gave me less packet drops (but not yet *perfect*)
(10% packet loss instead of 30%, if 8 receivers on my 8 cpus machine)
David, this is a preliminary work, not meant for inclusion as is,
comments are welcome.
Thank you
[PATCH] net: sk_forward_alloc becomes an atomic_t
Commit 95766fff6b9a78d11fc2d3812dd035381690b55d
(UDP: Add memory accounting) introduced a regression for high rate UDP flows,
because of extra lock_sock() in udp_recvmsg()
In order to reduce need for lock_sock() in UDP receive path, we might need
to declare sk_forward_alloc as an atomic_t.
udp_recvmsg() can avoid a lock_sock()/release_sock() pair.
Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
---
include/net/sock.h | 14 +++++++-------
net/core/sock.c | 31 +++++++++++++++++++------------
net/core/stream.c | 2 +-
net/ipv4/af_inet.c | 2 +-
net/ipv4/inet_diag.c | 2 +-
net/ipv4/tcp_input.c | 2 +-
net/ipv4/udp.c | 2 --
net/ipv6/udp.c | 2 --
net/sched/em_meta.c | 2 +-
9 files changed, 31 insertions(+), 28 deletions(-)
diff --git a/include/net/sock.h b/include/net/sock.h
index 4bb1ff9..c4befb9 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -250,7 +250,7 @@ struct sock {
struct sk_buff_head sk_async_wait_queue;
#endif
int sk_wmem_queued;
- int sk_forward_alloc;
+ atomic_t sk_forward_alloc;
gfp_t sk_allocation;
int sk_route_caps;
int sk_gso_type;
@@ -823,7 +823,7 @@ static inline int sk_wmem_schedule(struct sock *sk, int size)
{
if (!sk_has_account(sk))
return 1;
- return size <= sk->sk_forward_alloc ||
+ return size <= atomic_read(&sk->sk_forward_alloc) ||
__sk_mem_schedule(sk, size, SK_MEM_SEND);
}
@@ -831,7 +831,7 @@ static inline int sk_rmem_schedule(struct sock *sk, int size)
{
if (!sk_has_account(sk))
return 1;
- return size <= sk->sk_forward_alloc ||
+ return size <= atomic_read(&sk->sk_forward_alloc) ||
__sk_mem_schedule(sk, size, SK_MEM_RECV);
}
@@ -839,7 +839,7 @@ static inline void sk_mem_reclaim(struct sock *sk)
{
if (!sk_has_account(sk))
return;
- if (sk->sk_forward_alloc >= SK_MEM_QUANTUM)
+ if (atomic_read(&sk->sk_forward_alloc) >= SK_MEM_QUANTUM)
__sk_mem_reclaim(sk);
}
@@ -847,7 +847,7 @@ static inline void sk_mem_reclaim_partial(struct sock *sk)
{
if (!sk_has_account(sk))
return;
- if (sk->sk_forward_alloc > SK_MEM_QUANTUM)
+ if (atomic_read(&sk->sk_forward_alloc) > SK_MEM_QUANTUM)
__sk_mem_reclaim(sk);
}
@@ -855,14 +855,14 @@ static inline void sk_mem_charge(struct sock *sk, int size)
{
if (!sk_has_account(sk))
return;
- sk->sk_forward_alloc -= size;
+ atomic_sub(size, &sk->sk_forward_alloc);
}
static inline void sk_mem_uncharge(struct sock *sk, int size)
{
if (!sk_has_account(sk))
return;
- sk->sk_forward_alloc += size;
+ atomic_add(size, &sk->sk_forward_alloc);
}
static inline void sk_wmem_free_skb(struct sock *sk, struct sk_buff *skb)
diff --git a/net/core/sock.c b/net/core/sock.c
index 0620046..8489105 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1081,7 +1081,7 @@ struct sock *sk_clone(const struct sock *sk, const gfp_t priority)
newsk->sk_dst_cache = NULL;
newsk->sk_wmem_queued = 0;
- newsk->sk_forward_alloc = 0;
+ atomic_set(&newsk->sk_forward_alloc, 0);
newsk->sk_send_head = NULL;
newsk->sk_userlocks = sk->sk_userlocks & ~SOCK_BINDPORT_LOCK;
@@ -1479,7 +1479,7 @@ int __sk_mem_schedule(struct sock *sk, int size, int kind)
int amt = sk_mem_pages(size);
int allocated;
- sk->sk_forward_alloc += amt * SK_MEM_QUANTUM;
+ atomic_add(amt * SK_MEM_QUANTUM, &sk->sk_forward_alloc);
allocated = atomic_add_return(amt, prot->memory_allocated);
/* Under limit. */
@@ -1520,7 +1520,7 @@ int __sk_mem_schedule(struct sock *sk, int size, int kind)
if (prot->sysctl_mem[2] > alloc *
sk_mem_pages(sk->sk_wmem_queued +
atomic_read(&sk->sk_rmem_alloc) +
- sk->sk_forward_alloc))
+ atomic_read(&sk->sk_forward_alloc)))
return 1;
}
@@ -1537,7 +1537,7 @@ suppress_allocation:
}
/* Alas. Undo changes. */
- sk->sk_forward_alloc -= amt * SK_MEM_QUANTUM;
+ atomic_sub(amt * SK_MEM_QUANTUM, &sk->sk_forward_alloc);
atomic_sub(amt, prot->memory_allocated);
return 0;
}
@@ -1551,14 +1551,21 @@ EXPORT_SYMBOL(__sk_mem_schedule);
void __sk_mem_reclaim(struct sock *sk)
{
struct proto *prot = sk->sk_prot;
-
- atomic_sub(sk->sk_forward_alloc >> SK_MEM_QUANTUM_SHIFT,
- prot->memory_allocated);
- sk->sk_forward_alloc &= SK_MEM_QUANTUM - 1;
-
- if (prot->memory_pressure && *prot->memory_pressure &&
- (atomic_read(prot->memory_allocated) < prot->sysctl_mem[0]))
- *prot->memory_pressure = 0;
+ int val = atomic_read(&sk->sk_forward_alloc);
+
+begin:
+ val = atomic_read(&sk->sk_forward_alloc);
+ if (val >= SK_MEM_QUANTUM) {
+ if (atomic_cmpxchg(&sk->sk_forward_alloc, val,
+ val & (SK_MEM_QUANTUM - 1)) != val)
+ goto begin;
+ atomic_sub(val >> SK_MEM_QUANTUM_SHIFT,
+ prot->memory_allocated);
+
+ if (prot->memory_pressure && *prot->memory_pressure &&
+ (atomic_read(prot->memory_allocated) < prot->sysctl_mem[0]))
+ *prot->memory_pressure = 0;
+ }
}
EXPORT_SYMBOL(__sk_mem_reclaim);
diff --git a/net/core/stream.c b/net/core/stream.c
index 8727cea..4d04d28 100644
--- a/net/core/stream.c
+++ b/net/core/stream.c
@@ -198,7 +198,7 @@ void sk_stream_kill_queues(struct sock *sk)
sk_mem_reclaim(sk);
WARN_ON(sk->sk_wmem_queued);
- WARN_ON(sk->sk_forward_alloc);
+ WARN_ON(atomic_read(&sk->sk_forward_alloc));
/* It is _impossible_ for the backlog to contain anything
* when we get here. All user references to this socket
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 627be4d..7a1475c 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -152,7 +152,7 @@ void inet_sock_destruct(struct sock *sk)
WARN_ON(atomic_read(&sk->sk_rmem_alloc));
WARN_ON(atomic_read(&sk->sk_wmem_alloc));
WARN_ON(sk->sk_wmem_queued);
- WARN_ON(sk->sk_forward_alloc);
+ WARN_ON(atomic_read(&sk->sk_forward_alloc));
kfree(inet->opt);
dst_release(sk->sk_dst_cache);
diff --git a/net/ipv4/inet_diag.c b/net/ipv4/inet_diag.c
index 588a779..903ad66 100644
--- a/net/ipv4/inet_diag.c
+++ b/net/ipv4/inet_diag.c
@@ -158,7 +158,7 @@ static int inet_csk_diag_fill(struct sock *sk,
if (minfo) {
minfo->idiag_rmem = atomic_read(&sk->sk_rmem_alloc);
minfo->idiag_wmem = sk->sk_wmem_queued;
- minfo->idiag_fmem = sk->sk_forward_alloc;
+ minfo->idiag_fmem = atomic_read(&sk->sk_forward_alloc);
minfo->idiag_tmem = atomic_read(&sk->sk_wmem_alloc);
}
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index a6961d7..5e08f37 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -5258,7 +5258,7 @@ int tcp_rcv_established(struct sock *sk, struct sk_buff *skb,
tcp_rcv_rtt_measure_ts(sk, skb);
- if ((int)skb->truesize > sk->sk_forward_alloc)
+ if ((int)skb->truesize > atomic_read(&sk->sk_forward_alloc))
goto step5;
NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPHPHITS);
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 4bd178a..dcc246a 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -955,9 +955,7 @@ try_again:
err = ulen;
out_free:
- lock_sock(sk);
skb_free_datagram(sk, skb);
- release_sock(sk);
out:
return err;
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 84b1a29..582b80a 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -257,9 +257,7 @@ try_again:
err = ulen;
out_free:
- lock_sock(sk);
skb_free_datagram(sk, skb);
- release_sock(sk);
out:
return err;
diff --git a/net/sched/em_meta.c b/net/sched/em_meta.c
index 72cf86e..94d90b6 100644
--- a/net/sched/em_meta.c
+++ b/net/sched/em_meta.c
@@ -383,7 +383,7 @@ META_COLLECTOR(int_sk_wmem_queued)
META_COLLECTOR(int_sk_fwd_alloc)
{
SKIP_NONLOCAL(skb);
- dst->value = skb->sk->sk_forward_alloc;
+ dst->value = atomic_read(&skb->sk->sk_forward_alloc);
}
META_COLLECTOR(int_sk_sndbuf)
next prev parent reply other threads:[~2009-02-28 8:51 UTC|newest]
Thread overview: 70+ messages / expand[flat|nested] mbox.gz Atom feed top
2009-01-30 17:49 Multicast packet loss Kenny Chang
2009-01-30 19:04 ` Eric Dumazet
2009-01-30 19:17 ` Denys Fedoryschenko
2009-01-30 20:03 ` Neil Horman
2009-01-30 22:29 ` Kenny Chang
2009-01-30 22:41 ` Eric Dumazet
2009-01-31 16:03 ` Neil Horman
2009-02-02 16:13 ` Kenny Chang
2009-02-02 16:48 ` Kenny Chang
2009-02-03 11:55 ` Neil Horman
2009-02-03 15:20 ` Kenny Chang
2009-02-04 1:15 ` Neil Horman
2009-02-04 16:07 ` Kenny Chang
2009-02-04 16:46 ` Wesley Chow
2009-02-04 18:11 ` Eric Dumazet
2009-02-05 13:33 ` Neil Horman
2009-02-05 13:46 ` Wesley Chow
2009-02-05 13:29 ` Neil Horman
2009-02-01 12:40 ` Eric Dumazet
2009-02-02 13:45 ` Neil Horman
2009-02-02 16:57 ` Eric Dumazet
2009-02-02 18:22 ` Neil Horman
2009-02-02 19:51 ` Wes Chow
2009-02-02 20:29 ` Eric Dumazet
2009-02-02 21:09 ` Wes Chow
2009-02-02 21:31 ` Eric Dumazet
2009-02-03 17:34 ` Kenny Chang
2009-02-04 1:21 ` Neil Horman
2009-02-26 17:15 ` Kenny Chang
2009-02-28 8:51 ` Eric Dumazet [this message]
2009-03-01 17:03 ` Eric Dumazet
2009-03-04 8:16 ` David Miller
2009-03-04 8:36 ` Eric Dumazet
2009-03-07 7:46 ` Eric Dumazet
2009-03-08 16:46 ` Eric Dumazet
2009-03-09 2:49 ` David Miller
2009-03-09 6:36 ` Eric Dumazet
2009-03-13 21:51 ` David Miller
2009-03-13 22:30 ` Eric Dumazet
2009-03-13 22:38 ` David Miller
2009-03-13 22:45 ` Eric Dumazet
2009-03-14 9:03 ` [PATCH] net: reorder fields of struct socket Eric Dumazet
2009-03-16 2:59 ` David Miller
2009-03-16 22:22 ` Multicast packet loss Eric Dumazet
2009-03-17 10:11 ` Peter Zijlstra
2009-03-17 11:08 ` Eric Dumazet
2009-03-17 11:57 ` Peter Zijlstra
2009-03-17 15:00 ` Brian Bloniarz
2009-03-17 15:16 ` Eric Dumazet
2009-03-17 19:39 ` David Stevens
2009-03-17 21:19 ` Eric Dumazet
2009-04-03 19:28 ` Brian Bloniarz
2009-04-05 13:49 ` Eric Dumazet
2009-04-06 21:53 ` Brian Bloniarz
2009-04-06 22:12 ` Brian Bloniarz
2009-04-07 20:08 ` Brian Bloniarz
2009-04-08 8:12 ` Eric Dumazet
2009-03-09 22:56 ` Brian Bloniarz
2009-03-10 5:28 ` Eric Dumazet
2009-03-10 23:22 ` Brian Bloniarz
2009-03-11 3:00 ` Eric Dumazet
2009-03-12 15:47 ` Brian Bloniarz
2009-03-12 16:34 ` Eric Dumazet
2009-02-27 18:40 ` Christoph Lameter
2009-02-27 18:56 ` Eric Dumazet
2009-02-27 19:45 ` Christoph Lameter
2009-02-27 20:12 ` Eric Dumazet
2009-02-27 21:36 ` Eric Dumazet
2009-02-02 13:53 ` Eric Dumazet
-- strict thread matches above, loose matches on Subject: below --
2009-04-05 14:42 bmb
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=49A8FAFF.7060104@cosmosbay.com \
--to=dada1@cosmosbay.com \
--cc=cl@linux-foundation.org \
--cc=davem@davemloft.net \
--cc=kchang@athenacr.com \
--cc=netdev@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.