[PATCH] Bound TSO defer time (resend)

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] Bound TSO defer time (resend)
@ 2006-10-17  0:53 John Heffner
  2006-10-17  3:20 ` Stephen Hemminger
  0 siblings, 1 reply; 32+ messages in thread
From: John Heffner @ 2006-10-17  0:53 UTC (permalink / raw)
  To: netdev

The original message didn't show up on the list.  I'm assuming it's
because the filters didn't like the attached postscript.  I posted PDFs of
the figures on the web:

http://www.psc.edu/~jheffner/tmp/a.pdf
http://www.psc.edu/~jheffner/tmp/b.pdf
http://www.psc.edu/~jheffner/tmp/c.pdf

  -John


---------- Forwarded message ----------
Date: Mon, 16 Oct 2006 15:55:53 -0400 (EDT)
From: John Heffner <jheffner@psc.edu>
To: David Miller <davem@davemloft.net>
Cc: netdev <netdev@vger.kernel.org>
Subject: [PATCH] Bound TSO defer time

This patch limits the amount of time you will defer sending a TSO segment
to less than two clock ticks, or the time between two acks, whichever is
longer.

On slow links, deferring causes significant bursts.  See attached plots,
which show RTT through a 1 Mbps link with a 100 ms RTT and ~100 ms queue
for (a) non-TSO, (b) currnet TSO, and (c) patched TSO.  This burstiness
causes significant jitter, tends to overflow queues early (bad for short
queues), and makes delay-based congestion control more difficult.

Deferring by a couple clock ticks I believe will have a relatively small
impact on performance.


Signed-off-by: John Heffner <jheffner@psc.edu>


diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 0e058a2..27ae4b2 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -341,7 +341,9 @@ #endif
 	int			linger2;

 	unsigned long last_synq_overflow;
-
+
+	__u32	tso_deferred;
+
 /* Receiver side RTT estimation */
 	struct {
 		__u32	rtt;
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 9a253fa..3ea8973 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1087,11 +1087,15 @@ static int tcp_tso_should_defer(struct s
 	u32 send_win, cong_win, limit, in_flight;

 	if (TCP_SKB_CB(skb)->flags & TCPCB_FLAG_FIN)
-		return 0;
+		goto send_now;

 	if (icsk->icsk_ca_state != TCP_CA_Open)
-		return 0;
+		goto send_now;

+	/* Defer for less than two clock ticks. */
+	if (!tp->tso_deferred && ((jiffies<<1)>>1) - (tp->tso_deferred>>1) > 1)
+		goto send_now;
+
 	in_flight = tcp_packets_in_flight(tp);

 	BUG_ON(tcp_skb_pcount(skb) <= 1 ||
@@ -1106,8 +1110,8 @@ static int tcp_tso_should_defer(struct s

 	/* If a full-sized TSO skb can be sent, do it. */
 	if (limit >= 65536)
-		return 0;
-
+		goto send_now;
+
 	if (sysctl_tcp_tso_win_divisor) {
 		u32 chunk = min(tp->snd_wnd, tp->snd_cwnd * tp->mss_cache);

@@ -1116,7 +1120,7 @@ static int tcp_tso_should_defer(struct s
 		 */
 		chunk /= sysctl_tcp_tso_win_divisor;
 		if (limit >= chunk)
-			return 0;
+			goto send_now;
 	} else {
 		/* Different approach, try not to defer past a single
 		 * ACK.  Receiver should ACK every other full sized
@@ -1124,11 +1128,17 @@ static int tcp_tso_should_defer(struct s
 		 * then send now.
 		 */
 		if (limit > tcp_max_burst(tp) * tp->mss_cache)
-			return 0;
+			goto send_now;
 	}
-
+
 	/* Ok, it looks like it is advisable to defer.  */
+	tp->tso_deferred = 1 | (jiffies<<1);
+
 	return 1;
+
+send_now:
+	tp->tso_deferred = 0;
+	return 0;
 }

 /* Create a new MTU probe if we are ready.

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [PATCH] Bound TSO defer time (resend)
  2006-10-17  0:53 [PATCH] Bound TSO defer time (resend) John Heffner
@ 2006-10-17  3:20 ` Stephen Hemminger
  2006-10-17  4:18   ` John Heffner
  0 siblings, 1 reply; 32+ messages in thread
From: Stephen Hemminger @ 2006-10-17  3:20 UTC (permalink / raw)
  To: John Heffner; +Cc: netdev

On Mon, 16 Oct 2006 20:53:20 -0400 (EDT)
John Heffner <jheffner@psc.edu> wrote:

> The original message didn't show up on the list.  I'm assuming it's
> because the filters didn't like the attached postscript.  I posted PDFs of
> the figures on the web:
> 
> http://www.psc.edu/~jheffner/tmp/a.pdf
> http://www.psc.edu/~jheffner/tmp/b.pdf
> http://www.psc.edu/~jheffner/tmp/c.pdf
> 
>   -John
> 
> 
> ---------- Forwarded message ----------
> Date: Mon, 16 Oct 2006 15:55:53 -0400 (EDT)
> From: John Heffner <jheffner@psc.edu>
> To: David Miller <davem@davemloft.net>
> Cc: netdev <netdev@vger.kernel.org>
> Subject: [PATCH] Bound TSO defer time
> 
> This patch limits the amount of time you will defer sending a TSO segment
> to less than two clock ticks, or the time between two acks, whichever is
> longer.
> 
> On slow links, deferring causes significant bursts.  See attached plots,
> which show RTT through a 1 Mbps link with a 100 ms RTT and ~100 ms queue
> for (a) non-TSO, (b) currnet TSO, and (c) patched TSO.  This burstiness
> causes significant jitter, tends to overflow queues early (bad for short
> queues), and makes delay-based congestion control more difficult.
> 
> Deferring by a couple clock ticks I believe will have a relatively small
> impact on performance.
> 
> 
> Signed-off-by: John Heffner <jheffner@psc.edu>

Okay, but doing any timing on clock ticks makes the behavior dependent
on the value of HZ which doesn't seem desirable. Should this be based
on RTT or a real-time values?

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] Bound TSO defer time (resend)
  2006-10-17  3:20 ` Stephen Hemminger
@ 2006-10-17  4:18   ` John Heffner
  2006-10-17  5:35     ` David Miller
  2006-10-18 15:37     ` [PATCH] Bound TSO defer time (resend) Andi Kleen
  0 siblings, 2 replies; 32+ messages in thread
From: John Heffner @ 2006-10-17  4:18 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: netdev

Stephen Hemminger wrote:
> On Mon, 16 Oct 2006 20:53:20 -0400 (EDT)
> John Heffner <jheffner@psc.edu> wrote:

>> This patch limits the amount of time you will defer sending a TSO segment
>> to less than two clock ticks, or the time between two acks, whichever is
>> longer.

> 
> Okay, but doing any timing on clock ticks makes the behavior dependent
> on the value of HZ which doesn't seem desirable. Should this be based
> on RTT or a real-time values?

It would be nice to use a high res clock so you don't depend on HZ, but 
this is still expensive on most SMP arch's as I understand it.

   -John


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] Bound TSO defer time (resend)
  2006-10-17  4:18   ` John Heffner
@ 2006-10-17  5:35     ` David Miller
  2006-10-17 12:22       ` John Heffner
  2006-10-17 12:58       ` [PATCH] [NET] Size listen hash tables using backlog hint Eric Dumazet Hi
  2006-10-18 15:37     ` [PATCH] Bound TSO defer time (resend) Andi Kleen
  1 sibling, 2 replies; 32+ messages in thread
From: David Miller @ 2006-10-17  5:35 UTC (permalink / raw)
  To: jheffner; +Cc: shemminger, netdev

From: John Heffner <jheffner@psc.edu>
Date: Tue, 17 Oct 2006 00:18:33 -0400

> Stephen Hemminger wrote:
> > On Mon, 16 Oct 2006 20:53:20 -0400 (EDT)
> > John Heffner <jheffner@psc.edu> wrote:
> 
> >> This patch limits the amount of time you will defer sending a TSO segment
> >> to less than two clock ticks, or the time between two acks, whichever is
> >> longer.
> 
> > 
> > Okay, but doing any timing on clock ticks makes the behavior dependent
> > on the value of HZ which doesn't seem desirable. Should this be based
> > on RTT or a real-time values?
> 
> It would be nice to use a high res clock so you don't depend on HZ, but 
> this is still expensive on most SMP arch's as I understand it.

Right so we do need to use a jiffies based solution.

Since HZ is variable, I have a feeling that the thing to do here
is pick some timeout in msec.  Then replace the "2 clock ticks"
with some msec_to_jiffies() calls, bottoming out at 1 jiffie.

How does that sound?

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] Bound TSO defer time (resend)
  2006-10-17  5:35     ` David Miller
@ 2006-10-17 12:22       ` John Heffner
  2006-10-19  3:39         ` David Miller
  2006-10-17 12:58       ` [PATCH] [NET] Size listen hash tables using backlog hint Eric Dumazet Hi
  1 sibling, 1 reply; 32+ messages in thread
From: John Heffner @ 2006-10-17 12:22 UTC (permalink / raw)
  To: David Miller; +Cc: shemminger, netdev

David Miller wrote:
> From: John Heffner <jheffner@psc.edu>
> Date: Tue, 17 Oct 2006 00:18:33 -0400
> 
>> Stephen Hemminger wrote:
>>> On Mon, 16 Oct 2006 20:53:20 -0400 (EDT)
>>> John Heffner <jheffner@psc.edu> wrote:
>>>> This patch limits the amount of time you will defer sending a TSO segment
>>>> to less than two clock ticks, or the time between two acks, whichever is
>>>> longer.
>>> Okay, but doing any timing on clock ticks makes the behavior dependent
>>> on the value of HZ which doesn't seem desirable. Should this be based
>>> on RTT or a real-time values?
>> It would be nice to use a high res clock so you don't depend on HZ, but 
>> this is still expensive on most SMP arch's as I understand it.
> 
> Right so we do need to use a jiffies based solution.
> 
> Since HZ is variable, I have a feeling that the thing to do here
> is pick some timeout in msec.  Then replace the "2 clock ticks"
> with some msec_to_jiffies() calls, bottoming out at 1 jiffie.
> 
> How does that sound?

That's actually how I originally coded it. :)  But then it occurred to 
me that if you've already been waiting for a full clock tick, the 
marginal CPU savings of waiting longer will not be great.  Which is why 
I chose the value of 2 ticks so you're guaranteed to have waited at 
least one full tick.

   -John

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] Bound TSO defer time (resend)
  2006-10-17 12:22       ` John Heffner
@ 2006-10-19  3:39         ` David Miller
  0 siblings, 0 replies; 32+ messages in thread
From: David Miller @ 2006-10-19  3:39 UTC (permalink / raw)
  To: jheffner; +Cc: shemminger, netdev

From: John Heffner <jheffner@psc.edu>
Date: Tue, 17 Oct 2006 08:22:11 -0400

> That's actually how I originally coded it. :)  But then it occurred to 
> me that if you've already been waiting for a full clock tick, the 
> marginal CPU savings of waiting longer will not be great.  Which is why 
> I chose the value of 2 ticks so you're guaranteed to have waited at 
> least one full tick.

Fair enough, patch applied, thanks.

BTW, like some other's using Thunderbird to send patches,
lines with nothing but spaces are being corrupted into
fully empty lines.

In fact, in your patch some trailing whitespace of existing
code lines were also eliminated by Thunderbird, further
corrupting the patch.

I fixed this all up by hand, but please try to get this fixed
up for future submissions.

Thanks.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH] [NET] Size listen hash tables using backlog hint
  2006-10-17  5:35     ` David Miller
  2006-10-17 12:22       ` John Heffner
@ 2006-10-17 12:58       ` Eric Dumazet Hi
  2006-10-18  7:38         ` [PATCH] [NET] inet_peer : group together avl_left, avl_right, v4daddr to speedup lookups on some CPUS Eric Dumazet
                           ` (2 more replies)
  1 sibling, 3 replies; 32+ messages in thread
From: Eric Dumazet Hi @ 2006-10-17 12:58 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

[-- Attachment #1: Type: text/plain, Size: 1891 bytes --]

Hi David

We currently allocate  a fixed size 512 (TCP_SYNQ_HSIZE) slots hash table for 
each LISTEN socket, regardless of various parameters (listen backlog for 
example)

On x86_64, this means order-1 allocations (might fail), even for 'small' 
sockets, expecting few connections. On the contrary, a huge server wanting a 
backlog of 50000 is slowed down a bit because of this fixed limit.

This patch makes the sizing of listen hash table a dynamic parameter, 
depending of :
- net.core.somaxconn tunable (/proc/sys/net/core/somaxconn , default is 128)
- net.ipv4.tcp_max_syn_backlog tunable (default : 256, 1024 or 128)
- backlog value given by user application  (2nd parameter of listen())
- and available LOWMEM ram

reqsk_queue_alloc() goal is to use a power of two size for the whole 
listen_sock structure, to avoid wasting memory for large backlogs, meaning 
the hash table nr_table_entries is not anymore a power of two. (Hence one AND 
(nr_table_entries - 1) must be replaced by MODULO nr_table_entries)

We still limit memory allocation with the two existing tunables (somaxconn & 
tcp_max_syn_backlog).

In case memory allocation has problems,  reqsk_queue_alloc() reduces the size 
of the hash table to allow a successfull listen() call, without giving 
feedback to user application, as this 'backlog' was advisory.

Thank you

 include/net/request_sock.h      |    8 ++++----
 include/net/tcp.h               |    1 -
 net/core/request_sock.c         |   39 
+++++++++++++++++++++++++++++----------
 net/dccp/ipv4.c                 |    2 +-
 net/dccp/proto.c                |    6 +++---
 net/ipv4/af_inet.c              |    2 +-
 net/ipv4/inet_connection_sock.c |    8 +++++---
 net/ipv4/tcp_ipv4.c             |    6 +++---
 net/ipv6/tcp_ipv6.c             |    2 +-
 9 files changed, 47 insertions(+), 27 deletions(-)


Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>

[-- Attachment #2: size_listen_hash_table.patch --]
[-- Type: text/plain, Size: 7676 bytes --]

--- linux-2.6.19-rc2/net/core/request_sock.c	2006-10-13 18:25:04.000000000 +0200
+++ linux-2.6.19-rc2-ed/net/core/request_sock.c	2006-10-17 14:47:48.000000000 +0200
@@ -29,29 +29,48 @@
  * it is absolutely not enough even at 100conn/sec. 256 cures most
  * of problems. This value is adjusted to 128 for very small machines
  * (<=32Mb of memory) and to 1024 on normal or better ones (>=256Mb).
- * Further increasing requires to change hash table size.
  */
 int sysctl_max_syn_backlog = 256;
 
 int reqsk_queue_alloc(struct request_sock_queue *queue,
-		      const int nr_table_entries)
+		      u32 nr_entries)
 {
-	const int lopt_size = sizeof(struct listen_sock) +
-			      nr_table_entries * sizeof(struct request_sock *);
-	struct listen_sock *lopt = kzalloc(lopt_size, GFP_KERNEL);
+	struct listen_sock *lopt;
+	size_t size = sizeof(struct listen_sock);
 
-	if (lopt == NULL)
-		return -ENOMEM;
+	nr_entries = min_t(u32, nr_entries, sysctl_max_syn_backlog);
+	nr_entries = max_t(u32, nr_entries, 8);
+	size += nr_entries*sizeof(struct request_sock *);
+	size = roundup_pow_of_two(size);
+	while (1) {
+		lopt = kzalloc(size, GFP_KERNEL);
+		if (lopt != NULL)
+			break;
+		size >>= 1;
+		if (size < sizeof(struct listen_sock) +
+			8 * sizeof(struct request_sock *))
+			return -ENOMEM;
+	}
+	lopt->nr_table_entries = (size - sizeof(struct listen_sock)) /
+				sizeof(struct request_sock *);
 
-	for (lopt->max_qlen_log = 6;
-	     (1 << lopt->max_qlen_log) < sysctl_max_syn_backlog;
+	/*
+	 * max_qlen_log computation is based on the backlog (nr_entries),
+	 * not on actual hash size (lopt->nr_table_entries).
+	 */
+	for (lopt->max_qlen_log = 3;
+	     (1 << lopt->max_qlen_log) < nr_entries;
 	     lopt->max_qlen_log++);
 
 	get_random_bytes(&lopt->hash_rnd, sizeof(lopt->hash_rnd));
 	rwlock_init(&queue->syn_wait_lock);
 	queue->rskq_accept_head = NULL;
-	lopt->nr_table_entries = nr_table_entries;
 
+	/*
+	 * This write_lock_bh()/write_unlock_bh() pair forces this CPU to commit
+	 * its memory changes and let readers (which acquire syn_wait_lock in
+	 * reader mode) operate without seeing random content.
+	 */
 	write_lock_bh(&queue->syn_wait_lock);
 	queue->listen_opt = lopt;
 	write_unlock_bh(&queue->syn_wait_lock);
--- linux-2.6.19-rc2/net/ipv4/af_inet.c	2006-10-13 18:25:04.000000000 +0200
+++ linux-2.6.19-rc2-ed/net/ipv4/af_inet.c	2006-10-17 10:32:22.000000000 +0200
@@ -204,7 +204,7 @@
 	 * we can only allow the backlog to be adjusted.
 	 */
 	if (old_state != TCP_LISTEN) {
-		err = inet_csk_listen_start(sk, TCP_SYNQ_HSIZE);
+		err = inet_csk_listen_start(sk, backlog);
 		if (err)
 			goto out;
 	}
--- linux-2.6.19-rc2/net/ipv4/tcp_ipv4.c	2006-10-13 18:25:04.000000000 +0200
+++ linux-2.6.19-rc2-ed/net/ipv4/tcp_ipv4.c	2006-10-17 12:19:38.000000000 +0200
@@ -715,7 +715,7 @@
 	return dopt;
 }
 
-struct request_sock_ops tcp_request_sock_ops = {
+struct request_sock_ops tcp_request_sock_ops __read_mostly = {
 	.family		=	PF_INET,
 	.obj_size	=	sizeof(struct tcp_request_sock),
 	.rtx_syn_ack	=	tcp_v4_send_synack,
@@ -1385,7 +1385,7 @@
 	if (st->state == TCP_SEQ_STATE_OPENREQ) {
 		struct request_sock *req = cur;
 
-	       	icsk = inet_csk(st->syn_wait_sk);
+		icsk = inet_csk(st->syn_wait_sk);
 		req = req->dl_next;
 		while (1) {
 			while (req) {
@@ -1395,7 +1395,7 @@
 				}
 				req = req->dl_next;
 			}
-			if (++st->sbucket >= TCP_SYNQ_HSIZE)
+			if (++st->sbucket >= icsk->icsk_accept_queue.listen_opt->nr_table_entries)
 				break;
 get_req:
 			req = icsk->icsk_accept_queue.listen_opt->syn_table[st->sbucket];
--- linux-2.6.19-rc2/net/dccp/proto.c	2006-10-13 18:25:04.000000000 +0200
+++ linux-2.6.19-rc2-ed/net/dccp/proto.c	2006-10-17 10:32:22.000000000 +0200
@@ -262,12 +262,12 @@
 
 EXPORT_SYMBOL_GPL(dccp_destroy_sock);
 
-static inline int dccp_listen_start(struct sock *sk)
+static inline int dccp_listen_start(struct sock *sk, int backlog)
 {
 	struct dccp_sock *dp = dccp_sk(sk);
 
 	dp->dccps_role = DCCP_ROLE_LISTEN;
-	return inet_csk_listen_start(sk, TCP_SYNQ_HSIZE);
+	return inet_csk_listen_start(sk, backlog);
 }
 
 int dccp_disconnect(struct sock *sk, int flags)
@@ -788,7 +788,7 @@
 		 * FIXME: here it probably should be sk->sk_prot->listen_start
 		 * see tcp_listen_start
 		 */
-		err = dccp_listen_start(sk);
+		err = dccp_listen_start(sk, backlog);
 		if (err)
 			goto out;
 	}
--- linux-2.6.19-rc2/net/dccp/ipv4.c	2006-10-13 18:25:04.000000000 +0200
+++ linux-2.6.19-rc2-ed/net/dccp/ipv4.c	2006-10-17 10:44:21.000000000 +0200
@@ -1020,7 +1020,7 @@
 	kfree(inet_rsk(req)->opt);
 }
 
-static struct request_sock_ops dccp_request_sock_ops = {
+static struct request_sock_ops dccp_request_sock_ops _read_mostly = {
 	.family		= PF_INET,
 	.obj_size	= sizeof(struct dccp_request_sock),
 	.rtx_syn_ack	= dccp_v4_send_response,
--- linux-2.6.19-rc2/net/ipv6/tcp_ipv6.c	2006-10-13 18:25:04.000000000 +0200
+++ linux-2.6.19-rc2-ed/net/ipv6/tcp_ipv6.c	2006-10-17 10:44:21.000000000 +0200
@@ -526,7 +526,7 @@
 		kfree_skb(inet6_rsk(req)->pktopts);
 }
 
-static struct request_sock_ops tcp6_request_sock_ops = {
+static struct request_sock_ops tcp6_request_sock_ops _read_mostly = {
 	.family		=	AF_INET6,
 	.obj_size	=	sizeof(struct tcp6_request_sock),
 	.rtx_syn_ack	=	tcp_v6_send_synack,
--- linux-2.6.19-rc2/net/ipv4/inet_connection_sock.c	2006-10-13 18:25:04.000000000 +0200
+++ linux-2.6.19-rc2-ed/net/ipv4/inet_connection_sock.c	2006-10-17 10:32:22.000000000 +0200
@@ -343,9 +343,9 @@
 EXPORT_SYMBOL_GPL(inet_csk_route_req);
 
 static inline u32 inet_synq_hash(const __be32 raddr, const __be16 rport,
-				 const u32 rnd, const u16 synq_hsize)
+				 const u32 rnd, const u32 synq_hsize)
 {
-	return jhash_2words((__force u32)raddr, (__force u32)rport, rnd) & (synq_hsize - 1);
+	return jhash_2words((__force u32)raddr, (__force u32)rport, rnd) % synq_hsize;
 }
 
 #if defined(CONFIG_IPV6) || defined(CONFIG_IPV6_MODULE)
@@ -478,7 +478,9 @@
 			reqp = &req->dl_next;
 		}
 
-		i = (i + 1) & (lopt->nr_table_entries - 1);
+		i++;
+		if (i == lopt->nr_table_entries)
+			i = 0;
 
 	} while (--budget > 0);
 
--- linux-2.6.19-rc2/include/net/tcp.h	2006-10-13 18:25:04.000000000 +0200
+++ linux-2.6.19-rc2-ed/include/net/tcp.h	2006-10-17 10:51:51.000000000 +0200
@@ -138,7 +138,6 @@
 #define MAX_TCP_SYNCNT		127
 
 #define TCP_SYNQ_INTERVAL	(HZ/5)	/* Period of SYNACK timer */
-#define TCP_SYNQ_HSIZE		512	/* Size of SYNACK hash table */
 
 #define TCP_PAWS_24DAYS	(60 * 60 * 24 * 24)
 #define TCP_PAWS_MSL	60		/* Per-host timestamps are invalidated
--- linux-2.6.19-rc2/include/net/request_sock.h	2006-10-13 18:25:04.000000000 +0200
+++ linux-2.6.19-rc2-ed/include/net/request_sock.h	2006-10-17 12:33:18.000000000 +0200
@@ -28,8 +28,8 @@
 
 struct request_sock_ops {
 	int		family;
-	kmem_cache_t	*slab;
 	int		obj_size;
+	kmem_cache_t	*slab;
 	int		(*rtx_syn_ack)(struct sock *sk,
 				       struct request_sock *req,
 				       struct dst_entry *dst);
@@ -51,12 +51,12 @@
 	u32				rcv_wnd;	  /* rcv_wnd offered first time */
 	u32				ts_recent;
 	unsigned long			expires;
-	struct request_sock_ops		*rsk_ops;
+	const struct request_sock_ops		*rsk_ops;
 	struct sock			*sk;
 	u32				secid;
 };
 
-static inline struct request_sock *reqsk_alloc(struct request_sock_ops *ops)
+static inline struct request_sock *reqsk_alloc(const struct request_sock_ops *ops)
 {
 	struct request_sock *req = kmem_cache_alloc(ops->slab, SLAB_ATOMIC);
 
@@ -120,7 +120,7 @@
 };
 
 extern int reqsk_queue_alloc(struct request_sock_queue *queue,
-			     const int nr_table_entries);
+			     unsigned int nr_table_entries);
 
 static inline struct listen_sock *reqsk_queue_yank_listen_sk(struct request_sock_queue *queue)
 {

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH] [NET] inet_peer : group together avl_left, avl_right, v4daddr to speedup lookups on some CPUS
  2006-10-17 12:58       ` [PATCH] [NET] Size listen hash tables using backlog hint Eric Dumazet Hi
@ 2006-10-18  7:38         ` Eric Dumazet
  2006-10-18 16:35           ` [PATCH] [NET] reduce per cpu ram used for loopback stats Eric Dumazet
                             ` (2 more replies)
  2006-10-19  3:31         ` [PATCH] [NET] Size listen hash tables using backlog hint David Miller
  2006-10-19  9:27         ` Eric Dumazet
  2 siblings, 3 replies; 32+ messages in thread
From: Eric Dumazet @ 2006-10-18  7:38 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

[-- Attachment #1: Type: text/plain, Size: 261 bytes --]

Hi David

Lot of routers still use CPUS with 32 bytes cache lines. (Intel PIII)
It make sense to make sure fields used at lookup time are in the same cache 
line, to reduce cache footprint and speedup lookups.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>

[-- Attachment #2: inetpeer_speedup.patch --]
[-- Type: text/plain, Size: 773 bytes --]

--- linux/include/net/inetpeer.h	2006-10-18 09:30:12.000000000 +0200
+++ linux-ed/include/net/inetpeer.h	2006-10-18 09:32:17.000000000 +0200
@@ -17,14 +17,15 @@
 
 struct inet_peer
 {
+	/* group together avl_left,avl_right,v4daddr to speedup lookups */
 	struct inet_peer	*avl_left, *avl_right;
+	__u32			v4daddr;	/* peer's address */
+	__u16			avl_height;
+	__u16			ip_id_count;	/* IP ID for the next packet */
 	struct inet_peer	*unused_next, **unused_prevp;
 	__u32	dtime;		/* the time of last use of not
 						 * referenced entries */
 	atomic_t		refcnt;
-	__u32			v4daddr;	/* peer's address */
-	__u16			avl_height;
-	__u16			ip_id_count;	/* IP ID for the next packet */
 	atomic_t		rid;		/* Frag reception counter */
 	__u32			tcp_ts;
 	unsigned long		tcp_ts_stamp;

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH] [NET] reduce per cpu ram used for loopback stats
  2006-10-18  7:38         ` [PATCH] [NET] inet_peer : group together avl_left, avl_right, v4daddr to speedup lookups on some CPUS Eric Dumazet
@ 2006-10-18 16:35           ` Eric Dumazet
  2006-10-18 17:00             ` [PATCH, resent] " Eric Dumazet
  2006-10-19  3:53             ` [PATCH] " David Miller
  2006-10-19  3:44           ` [PATCH] [NET] inet_peer : group together avl_left, avl_right, v4daddr to speedup lookups on some CPUS David Miller
  2006-10-19 10:57           ` Eric Dumazet
  2 siblings, 2 replies; 32+ messages in thread
From: Eric Dumazet @ 2006-10-18 16:35 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

[-- Attachment #1: Type: text/plain, Size: 276 bytes --]

We dont need a full struct net_device_stats (currently 23 long : 184 bytes on 
x86_64) per possible CPU, but only two counters : bytes and packets

We save few CPU cycles too in loopback_xmit() not updating 4 fields, but 2. 

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>

[-- Attachment #2: loopback.patch --]
[-- Type: text/plain, Size: 1893 bytes --]

--- linux/drivers/net/loopback.c	2006-10-18 17:28:20.000000000 +0200
+++ linux-eddrivers/net/loopback.c	2006-10-18 18:26:41.000000000 +0200
@@ -58,7 +58,11 @@
 #include <linux/tcp.h>
 #include <linux/percpu.h>
 
-static DEFINE_PER_CPU(struct net_device_stats, loopback_stats);
+struct pcpu_lstats {
+	unsigned long packets;
+	unsigned long bytes;
+};
+static DEFINE_PER_CPU(struct pcpu_lstats, pcpu_lstats);
 
 #define LOOPBACK_OVERHEAD (128 + MAX_HEADER + 16 + 16)
 
@@ -128,7 +132,7 @@
  */
 static int loopback_xmit(struct sk_buff *skb, struct net_device *dev)
 {
-	struct net_device_stats *lb_stats;
+	struct pcpu_lstats *lb_stats;
 
 	skb_orphan(skb);
 
@@ -149,11 +153,9 @@
 #endif
 	dev->last_rx = jiffies;
 
-	lb_stats = &per_cpu(loopback_stats, get_cpu());
-	lb_stats->rx_bytes += skb->len;
-	lb_stats->tx_bytes = lb_stats->rx_bytes;
-	lb_stats->rx_packets++;
-	lb_stats->tx_packets = lb_stats->rx_packets;
+	lb_stats = &per_cpu(pcpu_lstats, get_cpu());
+	lb_stats->bytes += skb->len;
+	lb_stats->packets++;
 	put_cpu();
 
 	netif_rx(skb);
@@ -166,20 +168,21 @@
 static struct net_device_stats *get_stats(struct net_device *dev)
 {
 	struct net_device_stats *stats = &loopback_stats;
+	unsigned long bytes = 0;
+	unsigned long packets = 0;
 	int i;
 
-	memset(stats, 0, sizeof(struct net_device_stats));
-
 	for_each_possible_cpu(i) {
-		struct net_device_stats *lb_stats;
+		const struct pcpu_lstats *lb_stats;
 
-		lb_stats = &per_cpu(loopback_stats, i);
-		stats->rx_bytes   += lb_stats->rx_bytes;
-		stats->tx_bytes   += lb_stats->tx_bytes;
-		stats->rx_packets += lb_stats->rx_packets;
-		stats->tx_packets += lb_stats->tx_packets;
+		lb_stats = &per_cpu(pcpu_lstats, i);
+		bytes   += lb_stats->bytes;
+		packets += lb_stats->packets;
 	}
-
+	stats->rx_packets = packets;
+	stats->tx_packets = packets;
+	stats->rx_bytes = bytes;
+	stats->tx_bytes = bytes;
 	return stats;
 }
 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH, resent] [NET] reduce per cpu ram used for loopback stats
  2006-10-18 16:35           ` [PATCH] [NET] reduce per cpu ram used for loopback stats Eric Dumazet
@ 2006-10-18 17:00             ` Eric Dumazet
  2006-10-19  3:53               ` David Miller
  2006-10-19  3:53             ` [PATCH] " David Miller
  1 sibling, 1 reply; 32+ messages in thread
From: Eric Dumazet @ 2006-10-18 17:00 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

[-- Attachment #1: Type: text/plain, Size: 403 bytes --]

Sorry David, the previous attachment had a missing / in one filename

[NET] reduce per cpu ram used for loopback device stats

We dont need a full struct net_device_stats (currently 23 long : 184 bytes on 
x86_64) per possible CPU, but only two counters : bytes and packets

We save few CPU cycles too in loopback_xmit() not updating 4 fields, but 2. 

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>

[-- Attachment #2: loopback.patch --]
[-- Type: text/plain, Size: 1894 bytes --]

--- linux/drivers/net/loopback.c	2006-10-18 17:28:20.000000000 +0200
+++ linux-ed/drivers/net/loopback.c	2006-10-18 18:26:41.000000000 +0200
@@ -58,7 +58,11 @@
 #include <linux/tcp.h>
 #include <linux/percpu.h>
 
-static DEFINE_PER_CPU(struct net_device_stats, loopback_stats);
+struct pcpu_lstats {
+	unsigned long packets;
+	unsigned long bytes;
+};
+static DEFINE_PER_CPU(struct pcpu_lstats, pcpu_lstats);
 
 #define LOOPBACK_OVERHEAD (128 + MAX_HEADER + 16 + 16)
 
@@ -128,7 +132,7 @@
  */
 static int loopback_xmit(struct sk_buff *skb, struct net_device *dev)
 {
-	struct net_device_stats *lb_stats;
+	struct pcpu_lstats *lb_stats;
 
 	skb_orphan(skb);
 
@@ -149,11 +153,9 @@
 #endif
 	dev->last_rx = jiffies;
 
-	lb_stats = &per_cpu(loopback_stats, get_cpu());
-	lb_stats->rx_bytes += skb->len;
-	lb_stats->tx_bytes = lb_stats->rx_bytes;
-	lb_stats->rx_packets++;
-	lb_stats->tx_packets = lb_stats->rx_packets;
+	lb_stats = &per_cpu(pcpu_lstats, get_cpu());
+	lb_stats->bytes += skb->len;
+	lb_stats->packets++;
 	put_cpu();
 
 	netif_rx(skb);
@@ -166,20 +168,21 @@
 static struct net_device_stats *get_stats(struct net_device *dev)
 {
 	struct net_device_stats *stats = &loopback_stats;
+	unsigned long bytes = 0;
+	unsigned long packets = 0;
 	int i;
 
-	memset(stats, 0, sizeof(struct net_device_stats));
-
 	for_each_possible_cpu(i) {
-		struct net_device_stats *lb_stats;
+		const struct pcpu_lstats *lb_stats;
 
-		lb_stats = &per_cpu(loopback_stats, i);
-		stats->rx_bytes   += lb_stats->rx_bytes;
-		stats->tx_bytes   += lb_stats->tx_bytes;
-		stats->rx_packets += lb_stats->rx_packets;
-		stats->tx_packets += lb_stats->tx_packets;
+		lb_stats = &per_cpu(pcpu_lstats, i);
+		bytes   += lb_stats->bytes;
+		packets += lb_stats->packets;
 	}
-
+	stats->rx_packets = packets;
+	stats->tx_packets = packets;
+	stats->rx_bytes = bytes;
+	stats->tx_bytes = bytes;
 	return stats;
 }
 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH, resent] [NET] reduce per cpu ram used for loopback stats
  2006-10-18 17:00             ` [PATCH, resent] " Eric Dumazet
@ 2006-10-19  3:53               ` David Miller
  0 siblings, 0 replies; 32+ messages in thread
From: David Miller @ 2006-10-19  3:53 UTC (permalink / raw)
  To: dada1; +Cc: netdev

From: Eric Dumazet <dada1@cosmosbay.com>
Date: Wed, 18 Oct 2006 19:00:03 +0200

> Sorry David, the previous attachment had a missing / in one filename

Hehe, and I read this after replying to you about that :-)

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] [NET] reduce per cpu ram used for loopback stats
  2006-10-18 16:35           ` [PATCH] [NET] reduce per cpu ram used for loopback stats Eric Dumazet
  2006-10-18 17:00             ` [PATCH, resent] " Eric Dumazet
@ 2006-10-19  3:53             ` David Miller
  1 sibling, 0 replies; 32+ messages in thread
From: David Miller @ 2006-10-19  3:53 UTC (permalink / raw)
  To: dada1; +Cc: netdev

From: Eric Dumazet <dada1@cosmosbay.com>
Date: Wed, 18 Oct 2006 18:35:48 +0200

Applied Eric, but the file paths in your patch were bogus and needed
to be fixed up:

> --- linux/drivers/net/loopback.c	2006-10-18 17:28:20.000000000 +0200
> +++ linux-eddrivers/net/loopback.c	2006-10-18 18:26:41.000000000 +0200

This would never apply, since "-p1" patch treatment would use
"net/loopback.c" as the patch which is obviously wrong and should be
"drivers/net/loopback.c"

I don't know what you use to generate patches, but you might stand to
gain from using some automated tools for this :-)

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] [NET] inet_peer : group together avl_left, avl_right, v4daddr to speedup lookups on some CPUS
  2006-10-18  7:38         ` [PATCH] [NET] inet_peer : group together avl_left, avl_right, v4daddr to speedup lookups on some CPUS Eric Dumazet
  2006-10-18 16:35           ` [PATCH] [NET] reduce per cpu ram used for loopback stats Eric Dumazet
@ 2006-10-19  3:44           ` David Miller
  2006-10-19 10:57           ` Eric Dumazet
  2 siblings, 0 replies; 32+ messages in thread
From: David Miller @ 2006-10-19  3:44 UTC (permalink / raw)
  To: dada1; +Cc: netdev

From: Eric Dumazet <dada1@cosmosbay.com>
Date: Wed, 18 Oct 2006 09:38:38 +0200

> Lot of routers still use CPUS with 32 bytes cache lines. (Intel PIII)
> It make sense to make sure fields used at lookup time are in the same cache 
> line, to reduce cache footprint and speedup lookups.
> 
> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>

Looks fine but patch doesn't apply:

-	__u32			v4daddr;	/* peer's address */

in my tree v4daddr is a __be32, not a __u32.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH] [NET] inet_peer : group together avl_left, avl_right, v4daddr to speedup lookups on some CPUS
  2006-10-18  7:38         ` [PATCH] [NET] inet_peer : group together avl_left, avl_right, v4daddr to speedup lookups on some CPUS Eric Dumazet
  2006-10-18 16:35           ` [PATCH] [NET] reduce per cpu ram used for loopback stats Eric Dumazet
  2006-10-19  3:44           ` [PATCH] [NET] inet_peer : group together avl_left, avl_right, v4daddr to speedup lookups on some CPUS David Miller
@ 2006-10-19 10:57           ` Eric Dumazet
  2006-10-19 15:45             ` [PATCH] [NET] One NET_INC_STATS() could be NET_INC_STATS_BH in tcp_v4_err() Eric Dumazet
  2006-10-20  7:28             ` [PATCH] [NET] inet_peer : group together avl_left, avl_right, v4daddr to speedup lookups on some CPUS David Miller
  2 siblings, 2 replies; 32+ messages in thread
From: Eric Dumazet @ 2006-10-19 10:57 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

[-- Attachment #1: Type: text/plain, Size: 306 bytes --]

Hi David

Lot of routers/embedded devices still use CPUS with 16/32 bytes cache lines. 
(486, Pentium, ...  PIII)
It makes sense to group together fields used at lookup time so they fit in one 
cache line.
This reduce cache footprint and speedup lookups.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>

[-- Attachment #2: inetpeer_speedup.patch --]
[-- Type: text/plain, Size: 781 bytes --]

--- net-2.6/include/net/inetpeer.h	2006-10-19 12:50:29.000000000 +0200
+++ net-2.6-ed/include/net/inetpeer.h	2006-10-19 12:52:08.000000000 +0200
@@ -17,14 +17,15 @@
 
 struct inet_peer
 {
+	/* group together avl_left,avl_right,v4daddr to speedup lookups */
 	struct inet_peer	*avl_left, *avl_right;
+	__be32			v4daddr;	/* peer's address */
+	__u16			avl_height;
+	__u16			ip_id_count;	/* IP ID for the next packet */
 	struct inet_peer	*unused_next, **unused_prevp;
 	__u32			dtime;		/* the time of last use of not
 						 * referenced entries */
 	atomic_t		refcnt;
-	__be32			v4daddr;	/* peer's address */
-	__u16			avl_height;
-	__u16			ip_id_count;	/* IP ID for the next packet */
 	atomic_t		rid;		/* Frag reception counter */
 	__u32			tcp_ts;
 	unsigned long		tcp_ts_stamp;

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH] [NET] One NET_INC_STATS() could be NET_INC_STATS_BH in tcp_v4_err()
  2006-10-19 10:57           ` Eric Dumazet
@ 2006-10-19 15:45             ` Eric Dumazet
  2006-10-20  7:22               ` David Miller
  2006-10-20  7:28             ` [PATCH] [NET] inet_peer : group together avl_left, avl_right, v4daddr to speedup lookups on some CPUS David Miller
  1 sibling, 1 reply; 32+ messages in thread
From: Eric Dumazet @ 2006-10-19 15:45 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

[-- Attachment #1: Type: text/plain, Size: 150 bytes --]

I believe this NET_INC_STATS() call can be replaced by  NET_INC_STATS_BH(), a 
little bit cheaper.

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>

[-- Attachment #2: OUTOFWINDOWICMPS.patch --]
[-- Type: text/plain, Size: 383 bytes --]

--- linux/net/ipv4/tcp_ipv4.c.orig	2006-10-19 17:37:22.000000000 +0200
+++ linux-ed/net/ipv4/tcp_ipv4.c	2006-10-19 17:37:43.000000000 +0200
@@ -373,7 +373,7 @@
 	seq = ntohl(th->seq);
 	if (sk->sk_state != TCP_LISTEN &&
 	    !between(seq, tp->snd_una, tp->snd_nxt)) {
-		NET_INC_STATS(LINUX_MIB_OUTOFWINDOWICMPS);
+		NET_INC_STATS_BH(LINUX_MIB_OUTOFWINDOWICMPS);
 		goto out;
 	}
 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] [NET] One NET_INC_STATS() could be NET_INC_STATS_BH in tcp_v4_err()
  2006-10-19 15:45             ` [PATCH] [NET] One NET_INC_STATS() could be NET_INC_STATS_BH in tcp_v4_err() Eric Dumazet
@ 2006-10-20  7:22               ` David Miller
  2006-10-20 14:21                 ` Arnaldo Carvalho de Melo
  0 siblings, 1 reply; 32+ messages in thread
From: David Miller @ 2006-10-20  7:22 UTC (permalink / raw)
  To: dada1; +Cc: netdev

From: Eric Dumazet <dada1@cosmosbay.com>
Date: Thu, 19 Oct 2006 17:45:26 +0200

> I believe this NET_INC_STATS() call can be replaced by  NET_INC_STATS_BH(), a 
> little bit cheaper.
> 
> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>

Applied, although I hope tcp_v4_err() never becomes a fast path :-)

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] [NET] One NET_INC_STATS() could be NET_INC_STATS_BH in tcp_v4_err()
  2006-10-20  7:22               ` David Miller
@ 2006-10-20 14:21                 ` Arnaldo Carvalho de Melo
  0 siblings, 0 replies; 32+ messages in thread
From: Arnaldo Carvalho de Melo @ 2006-10-20 14:21 UTC (permalink / raw)
  To: David Miller; +Cc: dada1, netdev

On 10/20/06, David Miller <davem@davemloft.net> wrote:
> From: Eric Dumazet <dada1@cosmosbay.com>
> Date: Thu, 19 Oct 2006 17:45:26 +0200
>
> > I believe this NET_INC_STATS() call can be replaced by  NET_INC_STATS_BH(), a
> > little bit cheaper.
> >
> > Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>
>
> Applied, although I hope tcp_v4_err() never becomes a fast path :-)

I'll queue a cset in my net-2.6 tree to do the equivalent for dccp_v4_err() :-)

- Arnaldo

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] [NET] inet_peer : group together avl_left, avl_right, v4daddr to speedup lookups on some CPUS
  2006-10-19 10:57           ` Eric Dumazet
  2006-10-19 15:45             ` [PATCH] [NET] One NET_INC_STATS() could be NET_INC_STATS_BH in tcp_v4_err() Eric Dumazet
@ 2006-10-20  7:28             ` David Miller
  1 sibling, 0 replies; 32+ messages in thread
From: David Miller @ 2006-10-20  7:28 UTC (permalink / raw)
  To: dada1; +Cc: netdev

From: Eric Dumazet <dada1@cosmosbay.com>
Date: Thu, 19 Oct 2006 12:57:42 +0200

> Lot of routers/embedded devices still use CPUS with 16/32 bytes cache lines. 
> (486, Pentium, ...  PIII)
> It makes sense to group together fields used at lookup time so they fit in one 
> cache line.
> This reduce cache footprint and speedup lookups.
> 
> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>

Applied, thanks Eric.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] [NET] Size listen hash tables using backlog hint
  2006-10-17 12:58       ` [PATCH] [NET] Size listen hash tables using backlog hint Eric Dumazet Hi
  2006-10-18  7:38         ` [PATCH] [NET] inet_peer : group together avl_left, avl_right, v4daddr to speedup lookups on some CPUS Eric Dumazet
@ 2006-10-19  3:31         ` David Miller
  2006-10-19  4:54           ` Stephen Hemminger
  2006-10-19  5:12           ` Eric Dumazet
  2006-10-19  9:27         ` Eric Dumazet
  2 siblings, 2 replies; 32+ messages in thread
From: David Miller @ 2006-10-19  3:31 UTC (permalink / raw)
  To: dada1; +Cc: netdev

From: Eric Dumazet Hi <dada1@cosmosbay.com>
Date: Tue, 17 Oct 2006 14:58:37 +0200

> reqsk_queue_alloc() goal is to use a power of two size for the whole
> listen_sock structure, to avoid wasting memory for large backlogs,
> meaning the hash table nr_table_entries is not anymore a power of
> two. (Hence one AND (nr_table_entries - 1) must be replaced by
> MODULO nr_table_entries)

Modulus can be very expensive for some small/slow cpus.  Please round
down to a power-of-2 instead of up if you think the wastage really
matters.

Thanks.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] [NET] Size listen hash tables using backlog hint
  2006-10-19  3:31         ` [PATCH] [NET] Size listen hash tables using backlog hint David Miller
@ 2006-10-19  4:54           ` Stephen Hemminger
  2006-10-19  5:08             ` David Miller
  2006-10-19  5:12           ` Eric Dumazet
  1 sibling, 1 reply; 32+ messages in thread
From: Stephen Hemminger @ 2006-10-19  4:54 UTC (permalink / raw)
  To: David Miller; +Cc: dada1, netdev

David Miller wrote:
> From: Eric Dumazet Hi <dada1@cosmosbay.com>
> Date: Tue, 17 Oct 2006 14:58:37 +0200
>
>   
>> reqsk_queue_alloc() goal is to use a power of two size for the whole
>> listen_sock structure, to avoid wasting memory for large backlogs,
>> meaning the hash table nr_table_entries is not anymore a power of
>> two. (Hence one AND (nr_table_entries - 1) must be replaced by
>> MODULO nr_table_entries)
>>     
>
> Modulus can be very expensive for some small/slow cpus.  Please round
> down to a power-of-2 instead of up if you think the wastage really
> matters.
>   

Reminds me, anyone know why GCC is too stupid to convert modulus of a 
constant power of 2
into a mask operation?


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] [NET] Size listen hash tables using backlog hint
  2006-10-19  4:54           ` Stephen Hemminger
@ 2006-10-19  5:08             ` David Miller
  0 siblings, 0 replies; 32+ messages in thread
From: David Miller @ 2006-10-19  5:08 UTC (permalink / raw)
  To: shemminger; +Cc: dada1, netdev

From: Stephen Hemminger <shemminger@osdl.org>
Date: Wed, 18 Oct 2006 21:54:08 -0700

> Reminds me, anyone know why GCC is too stupid to convert modulus of a 
> constant power of 2 into a mask operation?

If the computation ends up being signed it can't perform this
optimization.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] [NET] Size listen hash tables using backlog hint
  2006-10-19  3:31         ` [PATCH] [NET] Size listen hash tables using backlog hint David Miller
  2006-10-19  4:54           ` Stephen Hemminger
@ 2006-10-19  5:12           ` Eric Dumazet
  2006-10-19  6:12             ` David Miller
  1 sibling, 1 reply; 32+ messages in thread
From: Eric Dumazet @ 2006-10-19  5:12 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

David Miller a écrit :
> From: Eric Dumazet Hi <dada1@cosmosbay.com>
> Date: Tue, 17 Oct 2006 14:58:37 +0200
> 
>> reqsk_queue_alloc() goal is to use a power of two size for the whole
>> listen_sock structure, to avoid wasting memory for large backlogs,
>> meaning the hash table nr_table_entries is not anymore a power of
>> two. (Hence one AND (nr_table_entries - 1) must be replaced by
>> MODULO nr_table_entries)
> 
> Modulus can be very expensive for some small/slow cpus.  Please round
> down to a power-of-2 instead of up if you think the wastage really
> matters.
> 
> Thanks.

I am not sure I understand your points. Rounding up or down still need the 
modulus. Only the size changes by a two factor. I feel you want me to remove 
the modulus, thats unrelated to rounding.

A 66 MHz 486 can perform 1.000.000 divisions per second. Is it a 'slow' cpu ?

If we stay with a power-of-two, say 2^X hash slots, using (2^X)*sizeof(void*), 
the extra bits added by struct listen_sock will *need* the same amount of 
memory, because of kmalloc() alignment to next power-of-two. That basically 
wastes half of the ram taken by struct listen_sock allocation, unless we add 
yet another pointer to hash table and do two kmallocs(), one for pure 
power-of-two hash table, one for struct listen_sock. If we keep current 
scheme, the current max kmalloc size of 131072 bytes would limit us to 65536 
bytes for the hash table itself, so 8192 slots on 64bits platforms. I was 
expecting to use a 16380 slots hash size instead.

The modulus is done on two places :

inet_csk_search_req() : called from tcp_v4_err()/dccp_v4_err() only after 
checks. Frequency of such events is rather low.

tcp_v4_hnd_req() : called from tcp_v4_do_rcv() for TCP_LISTEN state. Frequency 
of such events is rather low, especially on machines driven by small/slow cpus...

inet_csk_reqsk_queue_hash_add()called from tcp_v4_conn_request() when a new 
connection attempt is stored in hash table.

Thats in normal conditions two modulus done per new tcp/dccp sessions 
establishments. In DOS situation, I doubt the extra cycles will do any difference.

So... what do you prefer :

1) Keep the modulus
2) allocate two blocks of ram (powser-of -two hash size, but one extra 
indirection)
3) waste near half of ram because one block allocated, and power-of-two hash size.

Thank you

Eric

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] [NET] Size listen hash tables using backlog hint
  2006-10-19  5:12           ` Eric Dumazet
@ 2006-10-19  6:12             ` David Miller
  2006-10-19  6:34               ` Eric Dumazet
  0 siblings, 1 reply; 32+ messages in thread
From: David Miller @ 2006-10-19  6:12 UTC (permalink / raw)
  To: dada1; +Cc: netdev

From: Eric Dumazet <dada1@cosmosbay.com>
Date: Thu, 19 Oct 2006 07:12:58 +0200

> A 66 MHz 486 can perform 1.000.000 divisions per second. Is it a 'slow' cpu ?

Sparc and some other embedded chips have no division/modulus integer
instruction and do it in software.

> So... what do you prefer :
> 
> 1) Keep the modulus
> 2) allocate two blocks of ram (powser-of -two hash size, but one extra 
> indirection)
> 3) waste near half of ram because one block allocated, and power-of-two hash size.

I thought the problem was that you use a modulus and non-power-of-2
hash table size because rounding up to the next power of 2 wastes
a lot of space?  Given that, my suggestion is simply to not round
up to the next power-of-2, or only do so when we are very very close
to that next power-of-2.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] [NET] Size listen hash tables using backlog hint
  2006-10-19  6:12             ` David Miller
@ 2006-10-19  6:34               ` Eric Dumazet
  2006-10-19  6:57                 ` David Miller
  0 siblings, 1 reply; 32+ messages in thread
From: Eric Dumazet @ 2006-10-19  6:34 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

David Miller a écrit :
> From: Eric Dumazet <dada1@cosmosbay.com>
> Date: Thu, 19 Oct 2006 07:12:58 +0200
> 
>> A 66 MHz 486 can perform 1.000.000 divisions per second. Is it a 'slow' cpu ?
> 
> Sparc and some other embedded chips have no division/modulus integer
> instruction and do it in software.

How many times this division will be done ? As I said, tcp session establishment.

Are you aware a division is done in slab code when you kfree() one network 
frames ? That is much more problematic than SYN packets.

> 
>> So... what do you prefer :
>>
>> 1) Keep the modulus
>> 2) allocate two blocks of ram (powser-of -two hash size, but one extra 
>> indirection)
>> 3) waste near half of ram because one block allocated, and power-of-two hash size.
> 
> I thought the problem was that you use a modulus and non-power-of-2
> hash table size because rounding up to the next power of 2 wastes
> a lot of space?  Given that, my suggestion is simply to not round
> up to the next power-of-2, or only do so when we are very very close
> to that next power-of-2.

My main problem is being able to use a large hash table on big servers.

With power-of two constraint, plus kmalloc max size constraint, we can use 
half the size we could.

Are you suggesting something like :

Allocation time:
----------------
if (cpu is very_very_slow or hash size small_enough) {
   ptr->size = power_of_too;
   ptr->size_mask = (power_of_two - 1);
} else {
   ptr->size = somevalue;
   ptr->size_mask = ~0;
}

Lookup time :
---------------
if (ptr->size_mask != ~0)
     slot = hash & ptr->size_mask;
else
     slot = hash % ptr->size;

The extra conditional branch may be more expensive than just doing division on 
99% of cpus...


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] [NET] Size listen hash tables using backlog hint
  2006-10-19  6:34               ` Eric Dumazet
@ 2006-10-19  6:57                 ` David Miller
  2006-10-19  8:29                   ` Eric Dumazet
  0 siblings, 1 reply; 32+ messages in thread
From: David Miller @ 2006-10-19  6:57 UTC (permalink / raw)
  To: dada1; +Cc: netdev

From: Eric Dumazet <dada1@cosmosbay.com>
Date: Thu, 19 Oct 2006 08:34:53 +0200

> My main problem is being able to use a large hash table on big servers.
> 
> With power-of two constraint, plus kmalloc max size constraint, we can use 
> half the size we could.

Switch to vmalloc() at the kmalloc() cut-off point, just like
I did for the other hashes in the tree.

> Are you suggesting something like :

Not at all.

BTW, this all reminds me that we need to be careful that this
isn't allowing arbitrary users to eat up a ton of unswappable
ram.  It's pretty easy to open up a lot of listening sockets :)

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] [NET] Size listen hash tables using backlog hint
  2006-10-19  6:57                 ` David Miller
@ 2006-10-19  8:29                   ` Eric Dumazet
  2006-10-19  8:41                     ` David Miller
  0 siblings, 1 reply; 32+ messages in thread
From: Eric Dumazet @ 2006-10-19  8:29 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

On Thursday 19 October 2006 08:57, David Miller wrote:

> Switch to vmalloc() at the kmalloc() cut-off point, just like
> I did for the other hashes in the tree.

Yes, so you basically want option 4) :)


4) Use vmalloc() if size_lopt > PAGE_SIZE

keep a power_of two :
nr_table_entries = 2 ^ X;

size_lopt = sizeof(listen_sock) + nr_table_entries*sizeof(void*);
if (size > PAGE_SIZE)
  ptr = vmalloc(size_lopt);
else
  ptr = kmalloc(size_lopt);

Pros :
Only under one page is wasted (ie allocated but not used)
vmalloc() is nicer for NUMA, so I am pleased :)
vmalloc() has more chances to succeed when memory is fragmented
keep a power-of-two hash table size

Cons :
TLB cost

// for reference
struct listen_sock {
        u8                      max_qlen_log;
        /* 3 bytes hole, try to use */
        int                     qlen;
        int                     qlen_young;
        int                     clock_hand;
        u32                     hash_rnd;
        u32                     nr_table_entries;
        struct request_sock     *syn_table[0]; /* hash table follow this 
header */
};



> BTW, this all reminds me that we need to be careful that this
> isn't allowing arbitrary users to eat up a ton of unswappable
> ram.  It's pretty easy to open up a lot of listening sockets :)

With actual somaxconn=128 limit, my patch ends in allocating less ram (half of 
a page) than current x86_64 kernel (2 pages)

Thank you


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] [NET] Size listen hash tables using backlog hint
  2006-10-19  8:29                   ` Eric Dumazet
@ 2006-10-19  8:41                     ` David Miller
  2006-10-19  9:11                       ` Eric Dumazet
  0 siblings, 1 reply; 32+ messages in thread
From: David Miller @ 2006-10-19  8:41 UTC (permalink / raw)
  To: dada1; +Cc: netdev

From: Eric Dumazet <dada1@cosmosbay.com>
Date: Thu, 19 Oct 2006 10:29:00 +0200

> Cons :
> TLB cost

For those hot x86 and x86_64 cpus you tend to be using, this
particular cost is relatively small.  :-) It's effectively like
another memory reference in the worst case, in the best case
it's "free".

> With actual somaxconn=128 limit, my patch ends in allocating less
> ram (half of a page) than current x86_64 kernel (2 pages)

Understood.  But the issue is that there are greater security
implications than before when increasing this sysctl.

To be honest, it's probably water under the bridge, because
if you can stuff up SOMAXCONN number of sockets into the
system per listening socket which is a lot more than the
hash table eats up. :-)

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] [NET] Size listen hash tables using backlog hint
  2006-10-19  8:41                     ` David Miller
@ 2006-10-19  9:11                       ` Eric Dumazet
  0 siblings, 0 replies; 32+ messages in thread
From: Eric Dumazet @ 2006-10-19  9:11 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

On Thursday 19 October 2006 10:41, David Miller wrote:
> From: Eric Dumazet <dada1@cosmosbay.com>
> Date: Thu, 19 Oct 2006 10:29:00 +0200
>
> > Cons :
> > TLB cost
>
> For those hot x86 and x86_64 cpus you tend to be using, this
> particular cost is relatively small.  :-) It's effectively like
> another memory reference in the worst case, in the best case
> it's "free".

Well, it was a private joke with you, as you  *use* machines that take a fault 
on a TLB miss :) 

BTW I do care of old machines too...

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [PATCH] [NET] Size listen hash tables using backlog hint
  2006-10-17 12:58       ` [PATCH] [NET] Size listen hash tables using backlog hint Eric Dumazet Hi
  2006-10-18  7:38         ` [PATCH] [NET] inet_peer : group together avl_left, avl_right, v4daddr to speedup lookups on some CPUS Eric Dumazet
  2006-10-19  3:31         ` [PATCH] [NET] Size listen hash tables using backlog hint David Miller
@ 2006-10-19  9:27         ` Eric Dumazet
  2006-10-20  7:27           ` David Miller
  2 siblings, 1 reply; 32+ messages in thread
From: Eric Dumazet @ 2006-10-19  9:27 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

[-- Attachment #1: Type: text/plain, Size: 1562 bytes --]

Hi David

Here is the second try for this patch. Many thanks for your feedback.

[PATCH] [NET] Size listen hash tables using backlog hint

We currently allocate  a fixed size 512 (TCP_SYNQ_HSIZE) slots hash table for 
each LISTEN socket, regardless of various parameters (listen backlog for 
example)

On x86_64, this means order-1 allocations (might fail), even for 'small' 
sockets, expecting few connections. On the contrary, a huge server wanting a 
backlog of 50000 is slowed down a bit because of this fixed limit.

This patch makes the sizing of listen hash table a dynamic parameter, 
depending of :
- net.core.somaxconn tunable (default is 128)
- net.ipv4.tcp_max_syn_backlog tunable (default : 256, 1024 or 128)
- backlog value given by user application  (2nd parameter of listen())

For large allocations (bigger than PAGE_SIZE), we use vmalloc() instead of 
kmalloc().

We still limit memory allocation with the two existing tunables (somaxconn & 
tcp_max_syn_backlog).

 include/net/request_sock.h      |    8 ++++----
 include/net/tcp.h               |    1 -
 net/core/request_sock.c         |   38 +++++++++++++++++++++++++++++---------
 net/dccp/ipv4.c                 |    2 +-
 net/dccp/proto.c                |    6 +++---
 net/ipv4/af_inet.c              |    2 +-
 net/ipv4/inet_connection_sock.c |    2 +-
 net/ipv4/tcp_ipv4.c             |    6 +++---
 net/ipv6/tcp_ipv6.c             |    2 +-
 9 files changed, 43 insertions(+), 24 deletions(-)

Signed-off-by: Eric Dumazet <dada1@cosmosbay.com>

[-- Attachment #2: size_listen_hash_table.patch --]
[-- Type: text/plain, Size: 7933 bytes --]

--- linux-2.6.19-rc2/net/core/request_sock.c	2006-10-13 18:25:04.000000000 +0200
+++ linux-2.6.19-rc2-ed/net/core/request_sock.c	2006-10-19 11:05:56.000000000 +0200
@@ -15,6 +15,7 @@
 #include <linux/random.h>
 #include <linux/slab.h>
 #include <linux/string.h>
+#include <linux/vmalloc.h>
 
 #include <net/request_sock.h>
 
@@ -29,22 +30,31 @@
  * it is absolutely not enough even at 100conn/sec. 256 cures most
  * of problems. This value is adjusted to 128 for very small machines
  * (<=32Mb of memory) and to 1024 on normal or better ones (>=256Mb).
- * Further increasing requires to change hash table size.
+ * Note : Dont forget somaxconn that may limit backlog too.
  */
 int sysctl_max_syn_backlog = 256;
 
 int reqsk_queue_alloc(struct request_sock_queue *queue,
-		      const int nr_table_entries)
+		      unsigned int nr_table_entries)
 {
-	const int lopt_size = sizeof(struct listen_sock) +
-			      nr_table_entries * sizeof(struct request_sock *);
-	struct listen_sock *lopt = kzalloc(lopt_size, GFP_KERNEL);
+	size_t lopt_size = sizeof(struct listen_sock);
+	struct listen_sock *lopt;
 
+	nr_table_entries = min_t(u32, nr_table_entries, sysctl_max_syn_backlog);
+	nr_table_entries = max_t(u32, nr_table_entries, 8);
+	nr_table_entries = roundup_pow_of_two(nr_table_entries + 1);
+	lopt_size += nr_table_entries * sizeof(struct request_sock *);
+	if (lopt_size > PAGE_SIZE)
+		lopt = __vmalloc(lopt_size,
+			GFP_KERNEL | __GFP_HIGHMEM | __GFP_ZERO,
+			PAGE_KERNEL);
+	else
+		lopt = kzalloc(lopt_size, GFP_KERNEL);
 	if (lopt == NULL)
 		return -ENOMEM;
 
-	for (lopt->max_qlen_log = 6;
-	     (1 << lopt->max_qlen_log) < sysctl_max_syn_backlog;
+	for (lopt->max_qlen_log = 3;
+	     (1 << lopt->max_qlen_log) < nr_table_entries;
 	     lopt->max_qlen_log++);
 
 	get_random_bytes(&lopt->hash_rnd, sizeof(lopt->hash_rnd));
@@ -52,6 +62,11 @@
 	queue->rskq_accept_head = NULL;
 	lopt->nr_table_entries = nr_table_entries;
 
+	/*
+	 * This write_lock_bh()/write_unlock_bh() pair forces this CPU to commit
+	 * its memory changes and let readers (which acquire syn_wait_lock in
+	 * reader mode) operate without seeing random content.
+	 */
 	write_lock_bh(&queue->syn_wait_lock);
 	queue->listen_opt = lopt;
 	write_unlock_bh(&queue->syn_wait_lock);
@@ -65,9 +80,11 @@
 {
 	/* make all the listen_opt local to us */
 	struct listen_sock *lopt = reqsk_queue_yank_listen_sk(queue);
+	size_t lopt_size = sizeof(struct listen_sock) +
+		lopt->nr_table_entries * sizeof(struct request_sock *);
 
 	if (lopt->qlen != 0) {
-		int i;
+		unsigned int i;
 
 		for (i = 0; i < lopt->nr_table_entries; i++) {
 			struct request_sock *req;
@@ -81,7 +98,10 @@
 	}
 
 	BUG_TRAP(lopt->qlen == 0);
-	kfree(lopt);
+	if (lopt_size > PAGE_SIZE)
+		vfree(lopt);
+	else
+		kfree(lopt);
 }
 
 EXPORT_SYMBOL(reqsk_queue_destroy);
--- linux-2.6.19-rc2/net/ipv4/af_inet.c	2006-10-13 18:25:04.000000000 +0200
+++ linux-2.6.19-rc2-ed/net/ipv4/af_inet.c	2006-10-17 10:32:22.000000000 +0200
@@ -204,7 +204,7 @@
 	 * we can only allow the backlog to be adjusted.
 	 */
 	if (old_state != TCP_LISTEN) {
-		err = inet_csk_listen_start(sk, TCP_SYNQ_HSIZE);
+		err = inet_csk_listen_start(sk, backlog);
 		if (err)
 			goto out;
 	}
--- linux-2.6.19-rc2/net/ipv4/tcp_ipv4.c	2006-10-13 18:25:04.000000000 +0200
+++ linux-2.6.19-rc2-ed/net/ipv4/tcp_ipv4.c	2006-10-17 12:19:38.000000000 +0200
@@ -715,7 +715,7 @@
 	return dopt;
 }
 
-struct request_sock_ops tcp_request_sock_ops = {
+struct request_sock_ops tcp_request_sock_ops __read_mostly = {
 	.family		=	PF_INET,
 	.obj_size	=	sizeof(struct tcp_request_sock),
 	.rtx_syn_ack	=	tcp_v4_send_synack,
@@ -1385,7 +1385,7 @@
 	if (st->state == TCP_SEQ_STATE_OPENREQ) {
 		struct request_sock *req = cur;
 
-	       	icsk = inet_csk(st->syn_wait_sk);
+		icsk = inet_csk(st->syn_wait_sk);
 		req = req->dl_next;
 		while (1) {
 			while (req) {
@@ -1395,7 +1395,7 @@
 				}
 				req = req->dl_next;
 			}
-			if (++st->sbucket >= TCP_SYNQ_HSIZE)
+			if (++st->sbucket >= icsk->icsk_accept_queue.listen_opt->nr_table_entries)
 				break;
 get_req:
 			req = icsk->icsk_accept_queue.listen_opt->syn_table[st->sbucket];
--- linux-2.6.19-rc2/net/dccp/proto.c	2006-10-13 18:25:04.000000000 +0200
+++ linux-2.6.19-rc2-ed/net/dccp/proto.c	2006-10-17 10:32:22.000000000 +0200
@@ -262,12 +262,12 @@
 
 EXPORT_SYMBOL_GPL(dccp_destroy_sock);
 
-static inline int dccp_listen_start(struct sock *sk)
+static inline int dccp_listen_start(struct sock *sk, int backlog)
 {
 	struct dccp_sock *dp = dccp_sk(sk);
 
 	dp->dccps_role = DCCP_ROLE_LISTEN;
-	return inet_csk_listen_start(sk, TCP_SYNQ_HSIZE);
+	return inet_csk_listen_start(sk, backlog);
 }
 
 int dccp_disconnect(struct sock *sk, int flags)
@@ -788,7 +788,7 @@
 		 * FIXME: here it probably should be sk->sk_prot->listen_start
 		 * see tcp_listen_start
 		 */
-		err = dccp_listen_start(sk);
+		err = dccp_listen_start(sk, backlog);
 		if (err)
 			goto out;
 	}
--- linux-2.6.19-rc2/net/dccp/ipv4.c	2006-10-13 18:25:04.000000000 +0200
+++ linux-2.6.19-rc2-ed/net/dccp/ipv4.c	2006-10-17 10:44:21.000000000 +0200
@@ -1020,7 +1020,7 @@
 	kfree(inet_rsk(req)->opt);
 }
 
-static struct request_sock_ops dccp_request_sock_ops = {
+static struct request_sock_ops dccp_request_sock_ops _read_mostly = {
 	.family		= PF_INET,
 	.obj_size	= sizeof(struct dccp_request_sock),
 	.rtx_syn_ack	= dccp_v4_send_response,
--- linux-2.6.19-rc2/net/ipv6/tcp_ipv6.c	2006-10-13 18:25:04.000000000 +0200
+++ linux-2.6.19-rc2-ed/net/ipv6/tcp_ipv6.c	2006-10-17 10:44:21.000000000 +0200
@@ -526,7 +526,7 @@
 		kfree_skb(inet6_rsk(req)->pktopts);
 }
 
-static struct request_sock_ops tcp6_request_sock_ops = {
+static struct request_sock_ops tcp6_request_sock_ops _read_mostly = {
 	.family		=	AF_INET6,
 	.obj_size	=	sizeof(struct tcp6_request_sock),
 	.rtx_syn_ack	=	tcp_v6_send_synack,
--- linux-2.6.19-rc2/net/ipv4/inet_connection_sock.c	2006-10-13 18:25:04.000000000 +0200
+++ linux-2.6.19-rc2-ed/net/ipv4/inet_connection_sock.c	2006-10-19 10:51:26.000000000 +0200
@@ -343,7 +343,7 @@
 EXPORT_SYMBOL_GPL(inet_csk_route_req);
 
 static inline u32 inet_synq_hash(const __be32 raddr, const __be16 rport,
-				 const u32 rnd, const u16 synq_hsize)
+				 const u32 rnd, const u32 synq_hsize)
 {
 	return jhash_2words((__force u32)raddr, (__force u32)rport, rnd) & (synq_hsize - 1);
 }
--- linux-2.6.19-rc2/include/net/tcp.h	2006-10-13 18:25:04.000000000 +0200
+++ linux-2.6.19-rc2-ed/include/net/tcp.h	2006-10-17 10:51:51.000000000 +0200
@@ -138,7 +138,6 @@
 #define MAX_TCP_SYNCNT		127
 
 #define TCP_SYNQ_INTERVAL	(HZ/5)	/* Period of SYNACK timer */
-#define TCP_SYNQ_HSIZE		512	/* Size of SYNACK hash table */
 
 #define TCP_PAWS_24DAYS	(60 * 60 * 24 * 24)
 #define TCP_PAWS_MSL	60		/* Per-host timestamps are invalidated
--- linux-2.6.19-rc2/include/net/request_sock.h	2006-10-13 18:25:04.000000000 +0200
+++ linux-2.6.19-rc2-ed/include/net/request_sock.h	2006-10-17 12:33:18.000000000 +0200
@@ -28,8 +28,8 @@
 
 struct request_sock_ops {
 	int		family;
-	kmem_cache_t	*slab;
 	int		obj_size;
+	kmem_cache_t	*slab;
 	int		(*rtx_syn_ack)(struct sock *sk,
 				       struct request_sock *req,
 				       struct dst_entry *dst);
@@ -51,12 +51,12 @@
 	u32				rcv_wnd;	  /* rcv_wnd offered first time */
 	u32				ts_recent;
 	unsigned long			expires;
-	struct request_sock_ops		*rsk_ops;
+	const struct request_sock_ops		*rsk_ops;
 	struct sock			*sk;
 	u32				secid;
 };
 
-static inline struct request_sock *reqsk_alloc(struct request_sock_ops *ops)
+static inline struct request_sock *reqsk_alloc(const struct request_sock_ops *ops)
 {
 	struct request_sock *req = kmem_cache_alloc(ops->slab, SLAB_ATOMIC);
 
@@ -120,7 +120,7 @@
 };
 
 extern int reqsk_queue_alloc(struct request_sock_queue *queue,
-			     const int nr_table_entries);
+			     unsigned int nr_table_entries);
 
 static inline struct listen_sock *reqsk_queue_yank_listen_sk(struct request_sock_queue *queue)
 {

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] [NET] Size listen hash tables using backlog hint
  2006-10-19  9:27         ` Eric Dumazet
@ 2006-10-20  7:27           ` David Miller
  0 siblings, 0 replies; 32+ messages in thread
From: David Miller @ 2006-10-20  7:27 UTC (permalink / raw)
  To: dada1; +Cc: netdev

From: Eric Dumazet <dada1@cosmosbay.com>
Date: Thu, 19 Oct 2006 11:27:50 +0200

> Here is the second try for this patch. Many thanks for your feedback.
> 
> [PATCH] [NET] Size listen hash tables using backlog hint

This version looks very good.  It's not a major bug fix (obviously) so
we'll have to defer it to 2.6.20, so please resubmit once I open up
the net-2.6.20 tree.

Thanks a lot!


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] Bound TSO defer time (resend)
  2006-10-17  4:18   ` John Heffner
  2006-10-17  5:35     ` David Miller
@ 2006-10-18 15:37     ` Andi Kleen
  2006-10-18 16:40       ` Stephen Hemminger
  1 sibling, 1 reply; 32+ messages in thread
From: Andi Kleen @ 2006-10-18 15:37 UTC (permalink / raw)
  To: John Heffner; +Cc: Stephen Hemminger, netdev

On Tuesday 17 October 2006 06:18, John Heffner wrote:
> Stephen Hemminger wrote:
> > On Mon, 16 Oct 2006 20:53:20 -0400 (EDT)
> > John Heffner <jheffner@psc.edu> wrote:
> 
> >> This patch limits the amount of time you will defer sending a TSO segment
> >> to less than two clock ticks, or the time between two acks, whichever is
> >> longer.
> 
> > 
> > Okay, but doing any timing on clock ticks makes the behavior dependent
> > on the value of HZ which doesn't seem desirable. Should this be based
> > on RTT or a real-time values?
> 
> It would be nice to use a high res clock so you don't depend on HZ, but 
> this is still expensive on most SMP arch's as I understand it.

You can always use xtime. It doesn't have better solution than jiffies
though, but it gives you real time.

Drawback is that there is some work towards tickless kernels and with
that xtime will be more expensive again. But hopefully not by that much.

-Andi

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [PATCH] Bound TSO defer time (resend)
  2006-10-18 15:37     ` [PATCH] Bound TSO defer time (resend) Andi Kleen
@ 2006-10-18 16:40       ` Stephen Hemminger
  0 siblings, 0 replies; 32+ messages in thread
From: Stephen Hemminger @ 2006-10-18 16:40 UTC (permalink / raw)
  To: Andi Kleen; +Cc: John Heffner, netdev

On Wed, 18 Oct 2006 17:37:36 +0200
Andi Kleen <ak@suse.de> wrote:

> On Tuesday 17 October 2006 06:18, John Heffner wrote:
> > Stephen Hemminger wrote:
> > > On Mon, 16 Oct 2006 20:53:20 -0400 (EDT)
> > > John Heffner <jheffner@psc.edu> wrote:
> > 
> > >> This patch limits the amount of time you will defer sending a TSO segment
> > >> to less than two clock ticks, or the time between two acks, whichever is
> > >> longer.
> > 
> > > 
> > > Okay, but doing any timing on clock ticks makes the behavior dependent
> > > on the value of HZ which doesn't seem desirable. Should this be based
> > > on RTT or a real-time values?
> > 
> > It would be nice to use a high res clock so you don't depend on HZ, but 
> > this is still expensive on most SMP arch's as I understand it.
> 
> You can always use xtime. It doesn't have better solution than jiffies
> though, but it gives you real time.
> 
> Drawback is that there is some work towards tickless kernels and with
> that xtime will be more expensive again. But hopefully not by that much.
> 
> -Andi

Actually the thing to use now is ktime.  It would then be compatiable
with hrtimers. But it seems a bit of overkill in this case.

-- 
Stephen Hemminger <shemminger@osdl.org>

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2006-10-20 14:21 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-10-17  0:53 [PATCH] Bound TSO defer time (resend) John Heffner
2006-10-17  3:20 ` Stephen Hemminger
2006-10-17  4:18   ` John Heffner
2006-10-17  5:35     ` David Miller
2006-10-17 12:22       ` John Heffner
2006-10-19  3:39         ` David Miller
2006-10-17 12:58       ` [PATCH] [NET] Size listen hash tables using backlog hint Eric Dumazet Hi
2006-10-18  7:38         ` [PATCH] [NET] inet_peer : group together avl_left, avl_right, v4daddr to speedup lookups on some CPUS Eric Dumazet
2006-10-18 16:35           ` [PATCH] [NET] reduce per cpu ram used for loopback stats Eric Dumazet
2006-10-18 17:00             ` [PATCH, resent] " Eric Dumazet
2006-10-19  3:53               ` David Miller
2006-10-19  3:53             ` [PATCH] " David Miller
2006-10-19  3:44           ` [PATCH] [NET] inet_peer : group together avl_left, avl_right, v4daddr to speedup lookups on some CPUS David Miller
2006-10-19 10:57           ` Eric Dumazet
2006-10-19 15:45             ` [PATCH] [NET] One NET_INC_STATS() could be NET_INC_STATS_BH in tcp_v4_err() Eric Dumazet
2006-10-20  7:22               ` David Miller
2006-10-20 14:21                 ` Arnaldo Carvalho de Melo
2006-10-20  7:28             ` [PATCH] [NET] inet_peer : group together avl_left, avl_right, v4daddr to speedup lookups on some CPUS David Miller
2006-10-19  3:31         ` [PATCH] [NET] Size listen hash tables using backlog hint David Miller
2006-10-19  4:54           ` Stephen Hemminger
2006-10-19  5:08             ` David Miller
2006-10-19  5:12           ` Eric Dumazet
2006-10-19  6:12             ` David Miller
2006-10-19  6:34               ` Eric Dumazet
2006-10-19  6:57                 ` David Miller
2006-10-19  8:29                   ` Eric Dumazet
2006-10-19  8:41                     ` David Miller
2006-10-19  9:11                       ` Eric Dumazet
2006-10-19  9:27         ` Eric Dumazet
2006-10-20  7:27           ` David Miller
2006-10-18 15:37     ` [PATCH] Bound TSO defer time (resend) Andi Kleen
2006-10-18 16:40       ` Stephen Hemminger

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).