All of lore.kernel.org
 help / color / mirror / Atom feed
From: Eric Dumazet <dada1@cosmosbay.com>
To: Kenny Chang <kchang@athenacr.com>
Cc: netdev@vger.kernel.org, "David S. Miller" <davem@davemloft.net>,
	Christoph Lameter <cl@linux-foundation.org>
Subject: Re: Multicast packet loss
Date: Sat, 28 Feb 2009 09:51:11 +0100	[thread overview]
Message-ID: <49A8FAFF.7060104@cosmosbay.com> (raw)
In-Reply-To: <49A6CE39.5050200@athenacr.com>

Kenny Chang a écrit :
> It's been a while since I updated this thread.  We've been running
> through the different suggestions and tabulating their effects, as well
> as trying out an Intel card.  The short story is that setting affinity
> and MSI works to some extent, and the Intel card doesn't seem to change
> things significantly.  The results don't seem consistent enough for us
> to be able to point to a smoking gun.
> 
> It does look like the 2.6.29-rc4 kernel performs okay with the Intel
> card, but this is not a real-time build and it's not likely to be in a
> supported Ubuntu distribution real soon.  We've reached the point where
> we'd like to look for an expert dedicated to work on this problem for a
> period of time.  The final result being some sort of solution to produce
> a realtime configuration with a reasonably "aged" kernel (.24~.28) that
> has multicast performance greater than or equal to that of 2.6.15.
> 
> If anybody is interested in devoting some compensated time to this
> issue, we're offering up a bounty:
> http://www.athenacr.com/bounties/multicast-performance/
> 
> For completeness, here's the table of our experiment results:
> 
> ====================== ================== ========= ==========
> =============== ============== ============== =================
> Kernel                 flavor             IRQ       affinity   *4x
> mcasttest*  *5x mcasttest* *6x mcasttest*  *Mtools2* [4]_
> ====================== ================== ========= ==========
> =============== ============== ============== =================
> Intel
> e1000e                                                                                                                 
> 
> -----------------------------------------+---------+----------+---------------+--------------+--------------+-----------------
> 
> 2.6.24.19              rt                |          any       |
> OK              Maybe          X                            
> 2.6.24.19              rt                |          CPU0      |
> OK              OK             X                            
> 2.6.24.19              generic           |          any       |
> X                                                           
> 2.6.24.19              generic           |          CPU0      |
> OK                                                          
> 2.6.29-rc3             vanilla-server    |          any       |
> X                                                           
> 2.6.29-rc3             vanilla-server    |          CPU0      |
> OK                                                          
> 2.6.29-rc4             vanilla-generic   |          any       |
> X                                             OK            
> 2.6.29-rc4             vanilla-generic   |          CPU0      | OK  
>           OK             OK [5]_        OK            
> -----------------------------------------+---------+----------+---------------+--------------+--------------+-----------------
> 
> Broadcom
> BNX2                                                                                                                
> 
> -----------------------------------------+---------+----------+---------------+--------------+--------------+-----------------
> 
> 2.6.24-19              rt                | MSI      any       |
> OK              OK             X                            
> 2.6.24-19              rt                | MSI      CPU0      |
> OK              Maybe          X                            
> 2.6.24-19              rt                | APIC     any       |
> OK              OK             X                            
> 2.6.24-19              rt                | APIC     CPU0      |
> OK              Maybe          X                            
> 2.6.24-19-bnx-latest   rt                | APIC     CPU0      |
> OK              X                                           
> 2.6.24-19              server            | MSI      any       |
> X                                                           
> 2.6.24-19              server            | MSI      CPU0      |
> OK                                                          
> 2.6.24-19              generic           | APIC     any       |
> X                                                           
> 2.6.24-19              generic           | APIC     CPU0      |
> OK                                                          
> 2.6.27-11              generic           | APIC     any       |
> X                                                           
> 2.6.27-11              generic           | APIC     CPU0      |
> OK              10% drop                                     
> 2.6.28-8               generic           | APIC     any       |
> OK              X                                            
> 2.6.28-8               generic           | APIC     CPU0      |
> OK              OK             0.5% drop                     
> 2.6.29-rc3             vanilla-server    | MSI      any       |
> X                                                           
> 2.6.29-rc3             vanilla-server    | MSI      CPU0      |
> X                                                           
> 2.6.29-rc3             vanilla-server    | APIC     any       |
> OK              X                                           
> 2.6.29-rc3             vanilla-server    | APIC     CPU0      |
> OK              OK                                          
> 2.6.29-rc4             vanilla-generic   | APIC     any       |
> X                                                           
> 2.6.29-rc4             vanilla-generic   | APIC     CPU0      |
> OK              3% drop        10% drop       X             
> ======================
> ==================+=========+==========+===============+==============+==============+=================
> 
> * [4] MTools2 is a test from 29West: http://www.29west.com/docs/TestNet/
> * [5] In 5 trials, 1 of the trials dropped 2%, 4 of the trials dropped
> nothing.
> 
> Kenny
> 

Hi Kenny

I am investigating how to reduce contention (and schedule() calls) on this workload.

Following patch already gave me less packet drops (but not yet *perfect*)
(10% packet loss instead of 30%, if 8 receivers on my 8 cpus machine)


David, this is a preliminary work, not meant for inclusion as is,
comments are welcome.

Thank you

[PATCH] net: sk_forward_alloc becomes an atomic_t

Commit 95766fff6b9a78d11fc2d3812dd035381690b55d
(UDP: Add memory accounting) introduced a regression for high rate UDP flows,
because of extra lock_sock() in udp_recvmsg()

In order to reduce need for lock_sock() in UDP receive path, we might need
to declare sk_forward_alloc as an atomic_t.

udp_recvmsg() can avoid a lock_sock()/release_sock() pair.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
---
 include/net/sock.h   |   14 +++++++-------
 net/core/sock.c      |   31 +++++++++++++++++++------------
 net/core/stream.c    |    2 +-
 net/ipv4/af_inet.c   |    2 +-
 net/ipv4/inet_diag.c |    2 +-
 net/ipv4/tcp_input.c |    2 +-
 net/ipv4/udp.c       |    2 --
 net/ipv6/udp.c       |    2 --
 net/sched/em_meta.c  |    2 +-
 9 files changed, 31 insertions(+), 28 deletions(-)

diff --git a/include/net/sock.h b/include/net/sock.h
index 4bb1ff9..c4befb9 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -250,7 +250,7 @@ struct sock {
 	struct sk_buff_head	sk_async_wait_queue;
 #endif
 	int			sk_wmem_queued;
-	int			sk_forward_alloc;
+	atomic_t		sk_forward_alloc;
 	gfp_t			sk_allocation;
 	int			sk_route_caps;
 	int			sk_gso_type;
@@ -823,7 +823,7 @@ static inline int sk_wmem_schedule(struct sock *sk, int size)
 {
 	if (!sk_has_account(sk))
 		return 1;
-	return size <= sk->sk_forward_alloc ||
+	return size <= atomic_read(&sk->sk_forward_alloc) ||
 		__sk_mem_schedule(sk, size, SK_MEM_SEND);
 }
 
@@ -831,7 +831,7 @@ static inline int sk_rmem_schedule(struct sock *sk, int size)
 {
 	if (!sk_has_account(sk))
 		return 1;
-	return size <= sk->sk_forward_alloc ||
+	return size <= atomic_read(&sk->sk_forward_alloc) ||
 		__sk_mem_schedule(sk, size, SK_MEM_RECV);
 }
 
@@ -839,7 +839,7 @@ static inline void sk_mem_reclaim(struct sock *sk)
 {
 	if (!sk_has_account(sk))
 		return;
-	if (sk->sk_forward_alloc >= SK_MEM_QUANTUM)
+	if (atomic_read(&sk->sk_forward_alloc) >= SK_MEM_QUANTUM)
 		__sk_mem_reclaim(sk);
 }
 
@@ -847,7 +847,7 @@ static inline void sk_mem_reclaim_partial(struct sock *sk)
 {
 	if (!sk_has_account(sk))
 		return;
-	if (sk->sk_forward_alloc > SK_MEM_QUANTUM)
+	if (atomic_read(&sk->sk_forward_alloc) > SK_MEM_QUANTUM)
 		__sk_mem_reclaim(sk);
 }
 
@@ -855,14 +855,14 @@ static inline void sk_mem_charge(struct sock *sk, int size)
 {
 	if (!sk_has_account(sk))
 		return;
-	sk->sk_forward_alloc -= size;
+	atomic_sub(size, &sk->sk_forward_alloc);
 }
 
 static inline void sk_mem_uncharge(struct sock *sk, int size)
 {
 	if (!sk_has_account(sk))
 		return;
-	sk->sk_forward_alloc += size;
+	atomic_add(size, &sk->sk_forward_alloc);
 }
 
 static inline void sk_wmem_free_skb(struct sock *sk, struct sk_buff *skb)
diff --git a/net/core/sock.c b/net/core/sock.c
index 0620046..8489105 100644
--- a/net/core/sock.c
+++ b/net/core/sock.c
@@ -1081,7 +1081,7 @@ struct sock *sk_clone(const struct sock *sk, const gfp_t priority)
 
 		newsk->sk_dst_cache	= NULL;
 		newsk->sk_wmem_queued	= 0;
-		newsk->sk_forward_alloc = 0;
+		atomic_set(&newsk->sk_forward_alloc, 0);
 		newsk->sk_send_head	= NULL;
 		newsk->sk_userlocks	= sk->sk_userlocks & ~SOCK_BINDPORT_LOCK;
 
@@ -1479,7 +1479,7 @@ int __sk_mem_schedule(struct sock *sk, int size, int kind)
 	int amt = sk_mem_pages(size);
 	int allocated;
 
-	sk->sk_forward_alloc += amt * SK_MEM_QUANTUM;
+	atomic_add(amt * SK_MEM_QUANTUM, &sk->sk_forward_alloc);
 	allocated = atomic_add_return(amt, prot->memory_allocated);
 
 	/* Under limit. */
@@ -1520,7 +1520,7 @@ int __sk_mem_schedule(struct sock *sk, int size, int kind)
 		if (prot->sysctl_mem[2] > alloc *
 		    sk_mem_pages(sk->sk_wmem_queued +
 				 atomic_read(&sk->sk_rmem_alloc) +
-				 sk->sk_forward_alloc))
+				 atomic_read(&sk->sk_forward_alloc)))
 			return 1;
 	}
 
@@ -1537,7 +1537,7 @@ suppress_allocation:
 	}
 
 	/* Alas. Undo changes. */
-	sk->sk_forward_alloc -= amt * SK_MEM_QUANTUM;
+	atomic_sub(amt * SK_MEM_QUANTUM, &sk->sk_forward_alloc);
 	atomic_sub(amt, prot->memory_allocated);
 	return 0;
 }
@@ -1551,14 +1551,21 @@ EXPORT_SYMBOL(__sk_mem_schedule);
 void __sk_mem_reclaim(struct sock *sk)
 {
 	struct proto *prot = sk->sk_prot;
-
-	atomic_sub(sk->sk_forward_alloc >> SK_MEM_QUANTUM_SHIFT,
-		   prot->memory_allocated);
-	sk->sk_forward_alloc &= SK_MEM_QUANTUM - 1;
-
-	if (prot->memory_pressure && *prot->memory_pressure &&
-	    (atomic_read(prot->memory_allocated) < prot->sysctl_mem[0]))
-		*prot->memory_pressure = 0;
+	int val = atomic_read(&sk->sk_forward_alloc);
+
+begin:
+	val = atomic_read(&sk->sk_forward_alloc);
+	if (val >= SK_MEM_QUANTUM) {
+		if (atomic_cmpxchg(&sk->sk_forward_alloc, val,
+				   val & (SK_MEM_QUANTUM - 1)) != val)
+			goto begin;
+		atomic_sub(val >> SK_MEM_QUANTUM_SHIFT,
+			   prot->memory_allocated);
+
+		if (prot->memory_pressure && *prot->memory_pressure &&
+		    (atomic_read(prot->memory_allocated) < prot->sysctl_mem[0]))
+			*prot->memory_pressure = 0;
+	}
 }
 
 EXPORT_SYMBOL(__sk_mem_reclaim);
diff --git a/net/core/stream.c b/net/core/stream.c
index 8727cea..4d04d28 100644
--- a/net/core/stream.c
+++ b/net/core/stream.c
@@ -198,7 +198,7 @@ void sk_stream_kill_queues(struct sock *sk)
 	sk_mem_reclaim(sk);
 
 	WARN_ON(sk->sk_wmem_queued);
-	WARN_ON(sk->sk_forward_alloc);
+	WARN_ON(atomic_read(&sk->sk_forward_alloc));
 
 	/* It is _impossible_ for the backlog to contain anything
 	 * when we get here.  All user references to this socket
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 627be4d..7a1475c 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -152,7 +152,7 @@ void inet_sock_destruct(struct sock *sk)
 	WARN_ON(atomic_read(&sk->sk_rmem_alloc));
 	WARN_ON(atomic_read(&sk->sk_wmem_alloc));
 	WARN_ON(sk->sk_wmem_queued);
-	WARN_ON(sk->sk_forward_alloc);
+	WARN_ON(atomic_read(&sk->sk_forward_alloc));
 
 	kfree(inet->opt);
 	dst_release(sk->sk_dst_cache);
diff --git a/net/ipv4/inet_diag.c b/net/ipv4/inet_diag.c
index 588a779..903ad66 100644
--- a/net/ipv4/inet_diag.c
+++ b/net/ipv4/inet_diag.c
@@ -158,7 +158,7 @@ static int inet_csk_diag_fill(struct sock *sk,
 	if (minfo) {
 		minfo->idiag_rmem = atomic_read(&sk->sk_rmem_alloc);
 		minfo->idiag_wmem = sk->sk_wmem_queued;
-		minfo->idiag_fmem = sk->sk_forward_alloc;
+		minfo->idiag_fmem = atomic_read(&sk->sk_forward_alloc);
 		minfo->idiag_tmem = atomic_read(&sk->sk_wmem_alloc);
 	}
 
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index a6961d7..5e08f37 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -5258,7 +5258,7 @@ int tcp_rcv_established(struct sock *sk, struct sk_buff *skb,
 
 				tcp_rcv_rtt_measure_ts(sk, skb);
 
-				if ((int)skb->truesize > sk->sk_forward_alloc)
+				if ((int)skb->truesize > atomic_read(&sk->sk_forward_alloc))
 					goto step5;
 
 				NET_INC_STATS_BH(sock_net(sk), LINUX_MIB_TCPHPHITS);
diff --git a/net/ipv4/udp.c b/net/ipv4/udp.c
index 4bd178a..dcc246a 100644
--- a/net/ipv4/udp.c
+++ b/net/ipv4/udp.c
@@ -955,9 +955,7 @@ try_again:
 		err = ulen;
 
 out_free:
-	lock_sock(sk);
 	skb_free_datagram(sk, skb);
-	release_sock(sk);
 out:
 	return err;
 
diff --git a/net/ipv6/udp.c b/net/ipv6/udp.c
index 84b1a29..582b80a 100644
--- a/net/ipv6/udp.c
+++ b/net/ipv6/udp.c
@@ -257,9 +257,7 @@ try_again:
 		err = ulen;
 
 out_free:
-	lock_sock(sk);
 	skb_free_datagram(sk, skb);
-	release_sock(sk);
 out:
 	return err;
 
diff --git a/net/sched/em_meta.c b/net/sched/em_meta.c
index 72cf86e..94d90b6 100644
--- a/net/sched/em_meta.c
+++ b/net/sched/em_meta.c
@@ -383,7 +383,7 @@ META_COLLECTOR(int_sk_wmem_queued)
 META_COLLECTOR(int_sk_fwd_alloc)
 {
 	SKIP_NONLOCAL(skb);
-	dst->value = skb->sk->sk_forward_alloc;
+	dst->value = atomic_read(&skb->sk->sk_forward_alloc);
 }
 
 META_COLLECTOR(int_sk_sndbuf)


  reply	other threads:[~2009-02-28  8:51 UTC|newest]

Thread overview: 70+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2009-01-30 17:49 Multicast packet loss Kenny Chang
2009-01-30 19:04 ` Eric Dumazet
2009-01-30 19:17 ` Denys Fedoryschenko
2009-01-30 20:03 ` Neil Horman
2009-01-30 22:29   ` Kenny Chang
2009-01-30 22:41     ` Eric Dumazet
2009-01-31 16:03       ` Neil Horman
2009-02-02 16:13         ` Kenny Chang
2009-02-02 16:48         ` Kenny Chang
2009-02-03 11:55           ` Neil Horman
2009-02-03 15:20             ` Kenny Chang
2009-02-04  1:15               ` Neil Horman
2009-02-04 16:07                 ` Kenny Chang
2009-02-04 16:46                   ` Wesley Chow
2009-02-04 18:11                     ` Eric Dumazet
2009-02-05 13:33                       ` Neil Horman
2009-02-05 13:46                         ` Wesley Chow
2009-02-05 13:29                   ` Neil Horman
2009-02-01 12:40       ` Eric Dumazet
2009-02-02 13:45         ` Neil Horman
2009-02-02 16:57           ` Eric Dumazet
2009-02-02 18:22             ` Neil Horman
2009-02-02 19:51               ` Wes Chow
2009-02-02 20:29                 ` Eric Dumazet
2009-02-02 21:09                   ` Wes Chow
2009-02-02 21:31                     ` Eric Dumazet
2009-02-03 17:34                       ` Kenny Chang
2009-02-04  1:21                         ` Neil Horman
2009-02-26 17:15                           ` Kenny Chang
2009-02-28  8:51                             ` Eric Dumazet [this message]
2009-03-01 17:03                               ` Eric Dumazet
2009-03-04  8:16                               ` David Miller
2009-03-04  8:36                                 ` Eric Dumazet
2009-03-07  7:46                                   ` Eric Dumazet
2009-03-08 16:46                                     ` Eric Dumazet
2009-03-09  2:49                                       ` David Miller
2009-03-09  6:36                                         ` Eric Dumazet
2009-03-13 21:51                                           ` David Miller
2009-03-13 22:30                                             ` Eric Dumazet
2009-03-13 22:38                                               ` David Miller
2009-03-13 22:45                                                 ` Eric Dumazet
2009-03-14  9:03                                                   ` [PATCH] net: reorder fields of struct socket Eric Dumazet
2009-03-16  2:59                                                     ` David Miller
2009-03-16 22:22                                                 ` Multicast packet loss Eric Dumazet
2009-03-17 10:11                                                   ` Peter Zijlstra
2009-03-17 11:08                                                     ` Eric Dumazet
2009-03-17 11:57                                                       ` Peter Zijlstra
2009-03-17 15:00                                                       ` Brian Bloniarz
2009-03-17 15:16                                                         ` Eric Dumazet
2009-03-17 19:39                                                           ` David Stevens
2009-03-17 21:19                                                             ` Eric Dumazet
2009-04-03 19:28                                                   ` Brian Bloniarz
2009-04-05 13:49                                                     ` Eric Dumazet
2009-04-06 21:53                                                       ` Brian Bloniarz
2009-04-06 22:12                                                         ` Brian Bloniarz
2009-04-07 20:08                                                       ` Brian Bloniarz
2009-04-08  8:12                                                         ` Eric Dumazet
2009-03-09 22:56                                       ` Brian Bloniarz
2009-03-10  5:28                                         ` Eric Dumazet
2009-03-10 23:22                                           ` Brian Bloniarz
2009-03-11  3:00                                             ` Eric Dumazet
2009-03-12 15:47                                               ` Brian Bloniarz
2009-03-12 16:34                                                 ` Eric Dumazet
2009-02-27 18:40       ` Christoph Lameter
2009-02-27 18:56         ` Eric Dumazet
2009-02-27 19:45           ` Christoph Lameter
2009-02-27 20:12             ` Eric Dumazet
2009-02-27 21:36               ` Eric Dumazet
2009-02-02 13:53     ` Eric Dumazet
  -- strict thread matches above, loose matches on Subject: below --
2009-04-05 14:42 bmb

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=49A8FAFF.7060104@cosmosbay.com \
    --to=dada1@cosmosbay.com \
    --cc=cl@linux-foundation.org \
    --cc=davem@davemloft.net \
    --cc=kchang@athenacr.com \
    --cc=netdev@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.