netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [patch] do not readlock all buckets in /proc/net/tcp
@ 2004-07-05 11:09 Marcus Meissner
  2004-07-05 11:27 ` Herbert Xu
  0 siblings, 1 reply; 7+ messages in thread
From: Marcus Meissner @ 2004-07-05 11:09 UTC (permalink / raw)
  To: netdev

[-- Attachment #1: Type: text/plain, Size: 466 bytes --]

Hi,

This patch makes the files /proc/net/tcp and /proc/net/tcp6 not acquire
the readlock for every bucket.

On ppc64 and ia64 the readlocks are so expensive, that reading /proc/net/tcp
takes 0.25 seconds on a usual p670 LPAR.

And it locks 65536 buckets where just 20 chains are used at all in a normal
non-netserver setup.

Ciao, Marcus

Changelog:
	Readlock only non-empty hash chains to avoid 65536 readlocks.

	Signed-Off-By: Marcus Meissner <meissner@suse.de>

[-- Attachment #2: tcp-proc-walk --]
[-- Type: text/plain, Size: 1255 bytes --]

--- linux-2.6.5/net/ipv4/tcp_ipv4.c.xx	2004-07-04 13:39:51.000000000 +0200
+++ linux-2.6.5/net/ipv4/tcp_ipv4.c	2004-07-04 13:51:57.000000000 +0200
@@ -2255,6 +2255,12 @@
 		struct hlist_node *node;
 		struct tcp_tw_bucket *tw;
 	       
+		/* Avoid taking the readlock cost if we know the chain is empty,
+		 * we have a lot of buckets.
+		 */
+		if (hlist_empty(&tcp_ehash[st->bucket].chain) &&
+		    hlist_empty(&tcp_ehash[st->bucket+tcp_ehash_size].chain))
+			continue;
 		read_lock(&tcp_ehash[st->bucket].lock);
 		sk_for_each(sk, node, &tcp_ehash[st->bucket].chain) {
 			if (sk->sk_family != st->family) {
@@ -2301,13 +2307,17 @@
 		}
 		read_unlock(&tcp_ehash[st->bucket].lock);
 		st->state = TCP_SEQ_STATE_ESTABLISHED;
-		if (++st->bucket < tcp_ehash_size) {
-			read_lock(&tcp_ehash[st->bucket].lock);
-			sk = sk_head(&tcp_ehash[st->bucket].chain);
-		} else {
+
+		while ((++st->bucket < tcp_ehash_size) &&
+		       hlist_empty(&tcp_ehash[st->bucket].chain) &&
+		       hlist_empty(&tcp_ehash[st->bucket+tcp_ehash_size].chain))
+			/*empty*/;
+		if (st->bucket >= tcp_ehash_size) {
 			cur = NULL;
 			goto out;
 		}
+		read_lock(&tcp_ehash[st->bucket].lock);
+		sk = sk_head(&tcp_ehash[st->bucket].chain);
 	} else
 		sk = sk_next(sk);
 

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [patch] do not readlock all buckets in /proc/net/tcp
  2004-07-05 11:09 [patch] do not readlock all buckets in /proc/net/tcp Marcus Meissner
@ 2004-07-05 11:27 ` Herbert Xu
  2004-07-05 11:35   ` Marcus Meissner
  0 siblings, 1 reply; 7+ messages in thread
From: Herbert Xu @ 2004-07-05 11:27 UTC (permalink / raw)
  To: Marcus Meissner; +Cc: netdev

Marcus Meissner <meissner@suse.de> wrote:
> 
> This patch makes the files /proc/net/tcp and /proc/net/tcp6 not acquire
> the readlock for every bucket.
> 
> On ppc64 and ia64 the readlocks are so expensive, that reading /proc/net/tcp
> takes 0.25 seconds on a usual p670 LPAR.
> 
> And it locks 65536 buckets where just 20 chains are used at all in a normal
> non-netserver setup.

Why not use NETLINK+TCP_DIAG instead? It's much faster.
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [patch] do not readlock all buckets in /proc/net/tcp
  2004-07-05 11:27 ` Herbert Xu
@ 2004-07-05 11:35   ` Marcus Meissner
  2004-07-05 12:06     ` Herbert Xu
  0 siblings, 1 reply; 7+ messages in thread
From: Marcus Meissner @ 2004-07-05 11:35 UTC (permalink / raw)
  To: Herbert Xu; +Cc: netdev

On Mon, Jul 05, 2004 at 09:27:54PM +1000, Herbert Xu wrote:
> Marcus Meissner <meissner@suse.de> wrote:
> > 
> > This patch makes the files /proc/net/tcp and /proc/net/tcp6 not acquire
> > the readlock for every bucket.
> > 
> > On ppc64 and ia64 the readlocks are so expensive, that reading /proc/net/tcp
> > takes 0.25 seconds on a usual p670 LPAR.
> > 
> > And it locks 65536 buckets where just 20 chains are used at all in a normal
> > non-netserver setup.
> 
> Why not use NETLINK+TCP_DIAG instead? It's much faster.

Not sure if you want / can fix all proprietary software.

Oh, and NETLINK+TCP_DIAG seems to have the same readlock contention problem,
see tcpdiag_dump().

Ciao, Marcus

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [patch] do not readlock all buckets in /proc/net/tcp
  2004-07-05 11:35   ` Marcus Meissner
@ 2004-07-05 12:06     ` Herbert Xu
  2004-07-05 12:25       ` YOSHIFUJI Hideaki / 吉藤英明
  0 siblings, 1 reply; 7+ messages in thread
From: Herbert Xu @ 2004-07-05 12:06 UTC (permalink / raw)
  To: Marcus Meissner; +Cc: netdev

On Mon, Jul 05, 2004 at 01:35:55PM +0200, Marcus Meissner wrote:
> 
> Oh, and NETLINK+TCP_DIAG seems to have the same readlock contention problem,
> see tcpdiag_dump().

Then you wouldn't mind adding this optimisation for tcp_diag as well,
right? :)

Cheers,
-- 
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <herbert@gondor.apana.org.au>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [patch] do not readlock all buckets in /proc/net/tcp
  2004-07-05 12:06     ` Herbert Xu
@ 2004-07-05 12:25       ` YOSHIFUJI Hideaki / 吉藤英明
  2004-07-05 12:45         ` Marcus Meissner
  0 siblings, 1 reply; 7+ messages in thread
From: YOSHIFUJI Hideaki / 吉藤英明 @ 2004-07-05 12:25 UTC (permalink / raw)
  To: herbert, meissner; +Cc: netdev, yoshfuji

In article <20040705120610.GA5728@gondor.apana.org.au> (at Mon, 5 Jul 2004 22:06:10 +1000), Herbert Xu <herbert@gondor.apana.org.au> says:

> On Mon, Jul 05, 2004 at 01:35:55PM +0200, Marcus Meissner wrote:
> > 
> > Oh, and NETLINK+TCP_DIAG seems to have the same readlock contention problem,
> > see tcpdiag_dump().
> 
> Then you wouldn't mind adding this optimisation for tcp_diag as well,
> right? :)

here it is. :-)

Signed-off-by: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>

===== net/ipv4/tcp_diag.c 1.15 vs edited =====
--- 1.15/net/ipv4/tcp_diag.c	2004-06-08 07:27:58 +09:00
+++ edited/net/ipv4/tcp_diag.c	2004-07-05 21:18:17 +09:00
@@ -522,9 +522,13 @@
 		if (i > s_i)
 			s_num = 0;
 
-		read_lock_bh(&head->lock);
-
 		num = 0;
+
+		if (hlist_empty(&head->chain) &&
+		    (!(r->tcpdiag_states&TCPF_TIME_WAIT) || hlist_empty(&head->chain)))
+			continue;
+
+		read_lock_bh(&head->lock);
 		sk_for_each(sk, node, &head->chain) {
 			struct inet_opt *inet = inet_sk(sk);
 

-- 
Hideaki YOSHIFUJI @ USAGI Project <yoshfuji@linux-ipv6.org>
GPG FP: 9022 65EB 1ECF 3AD1 0BDF  80D8 4807 F894 E062 0EEA

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [patch] do not readlock all buckets in /proc/net/tcp
  2004-07-05 12:25       ` YOSHIFUJI Hideaki / 吉藤英明
@ 2004-07-05 12:45         ` Marcus Meissner
  2004-07-05 13:25           ` YOSHIFUJI Hideaki / 吉藤英明
  0 siblings, 1 reply; 7+ messages in thread
From: Marcus Meissner @ 2004-07-05 12:45 UTC (permalink / raw)
  To: YOSHIFUJI Hideaki / 吉藤英明; +Cc: netdev

On Mon, Jul 05, 2004 at 09:25:22PM +0900, YOSHIFUJI Hideaki / 吉藤英明 wrote:
> In article <20040705120610.GA5728@gondor.apana.org.au> (at Mon, 5 Jul 2004 22:06:10 +1000), Herbert Xu <herbert@gondor.apana.org.au> says:
> 
> > On Mon, Jul 05, 2004 at 01:35:55PM +0200, Marcus Meissner wrote:
> > > 
> > > Oh, and NETLINK+TCP_DIAG seems to have the same readlock contention problem,
> > > see tcpdiag_dump().
> > 
> > Then you wouldn't mind adding this optimisation for tcp_diag as well,
> > right? :)
> 
> here it is. :-)
> 
> Signed-off-by: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>
> 
> ===== net/ipv4/tcp_diag.c 1.15 vs edited =====
> --- 1.15/net/ipv4/tcp_diag.c	2004-06-08 07:27:58 +09:00
> +++ edited/net/ipv4/tcp_diag.c	2004-07-05 21:18:17 +09:00
> @@ -522,9 +522,13 @@
>  		if (i > s_i)
>  			s_num = 0;
>  
> -		read_lock_bh(&head->lock);
> -
>  		num = 0;
> +
> +		if (hlist_empty(&head->chain) &&
> +		    (!(r->tcpdiag_states&TCPF_TIME_WAIT) || hlist_empty(&head->chain)))
> +			continue;

The second hlist_empty is bad, you should be checking &tcp_ehash[i +
tcp_ehash_size].chain ((head+tcp_ehash_size) I think).


Ciao, Marcus

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [patch] do not readlock all buckets in /proc/net/tcp
  2004-07-05 12:45         ` Marcus Meissner
@ 2004-07-05 13:25           ` YOSHIFUJI Hideaki / 吉藤英明
  0 siblings, 0 replies; 7+ messages in thread
From: YOSHIFUJI Hideaki / 吉藤英明 @ 2004-07-05 13:25 UTC (permalink / raw)
  To: meissner; +Cc: netdev, yoshfuji

In article <20040705124511.GA17193@suse.de> (at Mon, 5 Jul 2004 14:45:11 +0200), Marcus Meissner <meissner@suse.de> says:

> The second hlist_empty is bad, you should be checking &tcp_ehash[i +
> tcp_ehash_size].chain ((head+tcp_ehash_size) I think).

Oops...

Signed-off-by: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org>

===== net/ipv4/tcp_diag.c 1.15 vs edited =====
--- 1.15/net/ipv4/tcp_diag.c	2004-06-08 07:27:58 +09:00
+++ edited/net/ipv4/tcp_diag.c	2004-07-05 22:21:06 +09:00
@@ -522,9 +522,14 @@
 		if (i > s_i)
 			s_num = 0;
 
-		read_lock_bh(&head->lock);
-
 		num = 0;
+
+		if (hlist_empty(&head->chain) &&
+		    (!(r->tcpdiag_states&TCPF_TIME_WAIT) || 
+		       hlist_empty(&(head + tcp_ehash_size)->chain)))
+			continue;
+
+		read_lock_bh(&head->lock);
 		sk_for_each(sk, node, &head->chain) {
 			struct inet_opt *inet = inet_sk(sk);
 

-- 
Hideaki YOSHIFUJI @ USAGI Project <yoshfuji@linux-ipv6.org>
GPG FP: 9022 65EB 1ECF 3AD1 0BDF  80D8 4807 F894 E062 0EEA

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2004-07-05 13:25 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2004-07-05 11:09 [patch] do not readlock all buckets in /proc/net/tcp Marcus Meissner
2004-07-05 11:27 ` Herbert Xu
2004-07-05 11:35   ` Marcus Meissner
2004-07-05 12:06     ` Herbert Xu
2004-07-05 12:25       ` YOSHIFUJI Hideaki / 吉藤英明
2004-07-05 12:45         ` Marcus Meissner
2004-07-05 13:25           ` YOSHIFUJI Hideaki / 吉藤英明

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).