netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] Customizable TCP backoff patch
@ 2006-09-27 18:52 Ben Woodard
  2006-09-27 23:16 ` David Miller
  0 siblings, 1 reply; 17+ messages in thread
From: Ben Woodard @ 2006-09-27 18:52 UTC (permalink / raw)
  To: netdev; +Cc: Mark A. Grondona, Brian Behlendorf

[-- Attachment #1: Type: text/plain, Size: 2800 bytes --]

Here at LLNL we have a rather challenging network environment on our 
clusters. We basically have 1000's of gigE links attached to an 
oversubscribed federated network. Most of the time this network is idle 
but the expected workload is for regular spikes extremely heavy activity 
lasting a few minutes. All end-points in a highly coordinated manor, 
typically after exiting an MPI barrier, start pushing as much data as 
possible through the oversubscribed core.  The result is a wave of TCP 
back-offs where all the TCP streams back-off in lock step. The network 
oscillates from highly congested for brief moments to largely idle. 
Given enough time TCP will settle down in to something mostly reasonable 
but even then it causes us a few problems:

1) It takes a long time for the network to settle in to a steady state 
and while it does network utilization is very poor.

2) Many of the sockets will rapidly back off to the maximum value.  This 
can lead to application level timeouts being triggered because we also 
initially calibrated the timeouts with the notion that 2 minute 
back-offs would be the exception and not the norm.

3) Once we reach steady state there's no guarantee of fairness between 
TCP streams.  For our workload this is particularly undesirable since 
the parallel job which kicked off all this activity must wait until the 
slowest transaction completes.  This translates in to 1000's of nodes 
sitting idle.

By knowing this is the expected workload on this dedicated network we
can safely make the back-offs more aggressive to mitigate most of these
issues.

We also played around with using a random seed when selecting the 
back-off interval to avoid all the sockets backing-off in lock step. 
That worked reasonably well but was more invasive then simply adding a 
few more tunable.

Because these are general utility clusters we run many different 
programs and so trying to fix this problem in the application is not 
possible since there are literally hundreds if not thousands of them.

We're more than willing to consider other approaches to handling this
particular workload better.  We've even considered that TCP isn't at all 
the right protocol but this affects several protocols including NFS and 
the benefits of running NFS over TCP are too great.

The original patch was prepared by Brian Behlendorf. He asked me to 
adapt it for current kernels keep it up to date and send upstream.

This may also help people like Andrew Athan which reported a similar 
problem a couple of days ago on the linux-net mailing list: 
http://www.uwsg.iu.edu/hypermail/linux/net/0609.3/0005.html I suspect 
that it is more common a case than is widely recognized.

Signed-off-by: Ben Woodard <woodard@redhat.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

[-- Attachment #2: tcp-backoff-18.patch --]
[-- Type: text/x-patch, Size: 6566 bytes --]

diff -ru linux-2.6.18/include/linux/sysctl.h linux-2.6.18.new/include/linux/sysctl.h
--- linux-2.6.18/include/linux/sysctl.h	2006-09-19 20:42:06.000000000 -0700
+++ linux-2.6.18.new/include/linux/sysctl.h	2006-09-26 17:10:36.000000000 -0700
@@ -411,6 +411,8 @@
 	NET_IPV4_TCP_WORKAROUND_SIGNED_WINDOWS=115,
 	NET_TCP_DMA_COPYBREAK=116,
 	NET_TCP_SLOW_START_AFTER_IDLE=117,
+	NET_TCP_RTO_MAX=118,
+	NET_TCP_RTO_INIT=119,
 };
 
 enum {
Only in linux-2.6.18.new/include/linux: sysctl.h.orig
Only in linux-2.6.18.new/include/linux: sysctl.h.rej
diff -ru linux-2.6.18/include/linux/tcp.h linux-2.6.18.new/include/linux/tcp.h
--- linux-2.6.18/include/linux/tcp.h	2006-09-19 20:42:06.000000000 -0700
+++ linux-2.6.18.new/include/linux/tcp.h	2006-09-26 17:08:33.000000000 -0700
@@ -94,6 +94,8 @@
 #define TCP_INFO		11	/* Information about this connection. */
 #define TCP_QUICKACK		12	/* Block/reenable quick acks */
 #define TCP_CONGESTION		13	/* Congestion control algorithm */
+#define TCP_BACKOFF_MAX         14      /* Maximum backoff value */
+#define TCP_BACKOFF_INIT        15      /* Initial backoff value */
 
 #define TCPI_OPT_TIMESTAMPS	1
 #define TCPI_OPT_SACK		2
@@ -257,6 +259,8 @@
 	__u8	frto_counter;	/* Number of new acks after RTO */
 	__u8	nonagle;	/* Disable Nagle algorithm?             */
 	__u8	keepalive_probes; /* num of allowed keep alive probes	*/
+        __u32   rto_max;        /* Maximum backoff value                */
+        __u32   rto_init;       /* Initial backoff value                */
 
 /* RTT measurement */
 	__u32	srtt;		/* smoothed round trip time << 3	*/
Only in linux-2.6.18.new/include/linux: tcp.h.orig
diff -ru linux-2.6.18/include/net/tcp.h linux-2.6.18.new/include/net/tcp.h
--- linux-2.6.18/include/net/tcp.h	2006-09-19 20:42:06.000000000 -0700
+++ linux-2.6.18.new/include/net/tcp.h	2006-09-26 17:12:04.000000000 -0700
@@ -227,6 +227,8 @@
 extern int sysctl_tcp_base_mss;
 extern int sysctl_tcp_workaround_signed_windows;
 extern int sysctl_tcp_slow_start_after_idle;
+extern int sysctl_tcp_rto_max;
+extern int sysctl_tcp_rto_init;
 
 extern atomic_t tcp_memory_allocated;
 extern atomic_t tcp_sockets_allocated;
Only in linux-2.6.18.new/include/net: tcp.h.orig
Only in linux-2.6.18.new/include/net: tcp.h.rej
diff -ru linux-2.6.18/net/ipv4/sysctl_net_ipv4.c linux-2.6.18.new/net/ipv4/sysctl_net_ipv4.c
--- linux-2.6.18/net/ipv4/sysctl_net_ipv4.c	2006-09-19 20:42:06.000000000 -0700
+++ linux-2.6.18.new/net/ipv4/sysctl_net_ipv4.c	2006-09-26 17:08:33.000000000 -0700
@@ -697,6 +697,22 @@
 		.mode		= 0644,
 		.proc_handler	= &proc_dointvec
 	},
+	{
+	        .ctl_name       = NET_TCP_RTO_MAX,
+		.procname       = "tcp_rto_max",
+		.data           = &sysctl_tcp_rto_max,
+		.maxlen         = sizeof(unsigned),
+		.mode           = 0644,
+		.proc_handler   = &proc_dointvec
+	},
+	{
+	        .ctl_name       = NET_TCP_RTO_INIT,
+		.procname       = "tcp_rto_init",
+		.data           = &sysctl_tcp_rto_init,
+		.maxlen         = sizeof(unsigned),
+		.mode           = 0644,
+		.proc_handler   = &proc_dointvec
+	},
 	{ .ctl_name = 0 }
 };
 
Only in linux-2.6.18.new/net/ipv4: sysctl_net_ipv4.c.orig
diff -ru linux-2.6.18/net/ipv4/tcp.c linux-2.6.18.new/net/ipv4/tcp.c
--- linux-2.6.18/net/ipv4/tcp.c	2006-09-19 20:42:06.000000000 -0700
+++ linux-2.6.18.new/net/ipv4/tcp.c	2006-09-26 17:08:33.000000000 -0700
@@ -1939,6 +1939,21 @@
 		}
 		break;
 
+        case TCP_BACKOFF_MAX:
+                if (val < 1)
+                        err = -EINVAL;
+                else
+                        tp->rto_max = val * HZ;
+                break;
+ 
+        case TCP_BACKOFF_INIT:
+                if (val < 1)
+                        err = -EINVAL;
+                else
+                        tp->rto_init = val * HZ;
+                break;
+ 
+ 
 	default:
 		err = -ENOPROTOOPT;
 		break;
@@ -2110,6 +2125,12 @@
 		if (copy_to_user(optval, icsk->icsk_ca_ops->name, len))
 			return -EFAULT;
 		return 0;
+        case TCP_BACKOFF_MAX:
+                val = (tp->rto_max ? : sysctl_tcp_rto_max) / HZ;
+                break;
+        case TCP_BACKOFF_INIT:
+                val = (tp->rto_init ? : sysctl_tcp_rto_init) / HZ;
+                break;
 	default:
 		return -ENOPROTOOPT;
 	};
Only in linux-2.6.18.new/net/ipv4: tcp.c.orig
diff -ru linux-2.6.18/net/ipv4/tcp_timer.c linux-2.6.18.new/net/ipv4/tcp_timer.c
--- linux-2.6.18/net/ipv4/tcp_timer.c	2006-09-19 20:42:06.000000000 -0700
+++ linux-2.6.18.new/net/ipv4/tcp_timer.c	2006-09-26 17:08:33.000000000 -0700
@@ -31,6 +31,8 @@
 int sysctl_tcp_retries1 = TCP_RETR1;
 int sysctl_tcp_retries2 = TCP_RETR2;
 int sysctl_tcp_orphan_retries;
+int sysctl_tcp_rto_max  = TCP_RTO_MAX;
+int sysctl_tcp_rto_init = TCP_TIMEOUT_INIT;
 
 static void tcp_write_timer(unsigned long);
 static void tcp_delack_timer(unsigned long);
@@ -71,7 +73,8 @@
 
 	/* If peer does not open window for long time, or did not transmit 
 	 * anything for long time, penalize it. */
-	if ((s32)(tcp_time_stamp - tp->lsndtime) > 2*TCP_RTO_MAX || !do_reset)
+	if ((s32)(tcp_time_stamp - tp->lsndtime) >
+	    2 * (tp->rto_max ? : sysctl_tcp_rto_max) || !do_reset)
 		orphans <<= 1;
 
 	/* If some dubious ICMP arrived, penalize even more. */
@@ -256,8 +259,8 @@
 	max_probes = sysctl_tcp_retries2;
 
 	if (sock_flag(sk, SOCK_DEAD)) {
-		const int alive = ((icsk->icsk_rto << icsk->icsk_backoff) < TCP_RTO_MAX);
- 
+		const int alive = ((icsk->icsk_rto << icsk->icsk_backoff) < 
+				   (tp->rto_max ? : sysctl_tcp_rto_max));
 		max_probes = tcp_orphan_retries(sk, alive);
 
 		if (tcp_out_of_resources(sk, alive || icsk->icsk_probes_out <= max_probes))
@@ -301,7 +304,8 @@
 			       inet->num, tp->snd_una, tp->snd_nxt);
 		}
 #endif
-		if (tcp_time_stamp - tp->rcv_tstamp > TCP_RTO_MAX) {
+		if (tcp_time_stamp - tp->rcv_tstamp > 
+		    (tp->rto_max ? : sysctl_tcp_rto_max)) {
 			tcp_write_err(sk);
 			goto out;
 		}
@@ -373,7 +377,8 @@
 
 out_reset_timer:
 	icsk->icsk_rto = min(icsk->icsk_rto << 1, TCP_RTO_MAX);
-	inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS, icsk->icsk_rto, TCP_RTO_MAX);
+	inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS, icsk->icsk_rto, 
+				  (tp->rto_max ? : sysctl_tcp_rto_max));
 	if (icsk->icsk_retransmits > sysctl_tcp_retries1)
 		__sk_dst_reset(sk);
 
@@ -428,7 +433,8 @@
 static void tcp_synack_timer(struct sock *sk)
 {
 	inet_csk_reqsk_queue_prune(sk, TCP_SYNQ_INTERVAL,
-				   TCP_TIMEOUT_INIT, TCP_RTO_MAX);
+				   TCP_TIMEOUT_INIT, 
+				   ((tp->rto_init ? : sysctl_tcp_rto_init)));
 }
 
 void tcp_set_keepalive(struct sock *sk, int val)

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] Customizable TCP backoff patch
  2006-09-27 23:16 ` David Miller
@ 2006-09-27 23:00   ` Stephen Hemminger
  2006-09-28  2:06     ` David Miller
  2006-10-03 18:14   ` Ben Woodard
  1 sibling, 1 reply; 17+ messages in thread
From: Stephen Hemminger @ 2006-09-27 23:00 UTC (permalink / raw)
  To: David Miller; +Cc: woodard, netdev, mgrondona, behlendorf1

On Wed, 27 Sep 2006 16:16:38 -0700 (PDT)
David Miller <davem@davemloft.net> wrote:

> From: Ben Woodard <woodard@redhat.com>
> Date: Wed, 27 Sep 2006 11:52:57 -0700
> 
> > Because these are general utility clusters we run many different 
> > programs and so trying to fix this problem in the application is not 
> > possible since there are literally hundreds if not thousands of them.
> 
> Then why add a socket option setting as your patch does? :-)
> 
> I also object to the socket option setting being allowed for
> any user because this can have awful effects if allowed by
> arbitrary users on arbitrary networks.

Setting a cwnd limit would do the same thing.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] Customizable TCP backoff patch
  2006-09-27 18:52 [PATCH] Customizable TCP backoff patch Ben Woodard
@ 2006-09-27 23:16 ` David Miller
  2006-09-27 23:00   ` Stephen Hemminger
  2006-10-03 18:14   ` Ben Woodard
  0 siblings, 2 replies; 17+ messages in thread
From: David Miller @ 2006-09-27 23:16 UTC (permalink / raw)
  To: woodard; +Cc: netdev, mgrondona, behlendorf1

From: Ben Woodard <woodard@redhat.com>
Date: Wed, 27 Sep 2006 11:52:57 -0700

> Because these are general utility clusters we run many different 
> programs and so trying to fix this problem in the application is not 
> possible since there are literally hundreds if not thousands of them.

Then why add a socket option setting as your patch does? :-)

I also object to the socket option setting being allowed for
any user because this can have awful effects if allowed by
arbitrary users on arbitrary networks.

> We're more than willing to consider other approaches to handling this
> particular workload better.  We've even considered that TCP isn't at all 
> the right protocol but this affects several protocols including NFS and 
> the benefits of running NFS over TCP are too great.
> 
> The original patch was prepared by Brian Behlendorf. He asked me to 
> adapt it for current kernels keep it up to date and send upstream.
> 
> This may also help people like Andrew Athan which reported a similar 
> problem a couple of days ago on the linux-net mailing list: 
> http://www.uwsg.iu.edu/hypermail/linux/net/0609.3/0005.html I suspect 
> that it is more common a case than is widely recognized.
> 
> Signed-off-by: Ben Woodard <woodard@redhat.com>
> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>

Other issues:

1) 2 "u32" in the tcp_sock is a lot of space to devote to this
   new state.  If it can fit in 2 "u16"'s or even less space,
   please use that.

2) the expression "(tp->foo ? : sysctl_foo)" is repeated many times
   in the patch, please encapsulate it into an inline function
   or similar

I'm still torn on the fundamental issues of this patch.  I think
random backoff is a better generic solution to this kind of problem.
If it works for ethernet, it might just work for TCP too :-)

Thanks.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] Customizable TCP backoff patch
  2006-09-27 23:00   ` Stephen Hemminger
@ 2006-09-28  2:06     ` David Miller
  0 siblings, 0 replies; 17+ messages in thread
From: David Miller @ 2006-09-28  2:06 UTC (permalink / raw)
  To: shemminger; +Cc: woodard, netdev, mgrondona, behlendorf1

From: Stephen Hemminger <shemminger@osdl.org>
Date: Wed, 27 Sep 2006 16:00:44 -0700

> On Wed, 27 Sep 2006 16:16:38 -0700 (PDT)
> David Miller <davem@davemloft.net> wrote:
> 
> > From: Ben Woodard <woodard@redhat.com>
> > Date: Wed, 27 Sep 2006 11:52:57 -0700
> > 
> > > Because these are general utility clusters we run many different 
> > > programs and so trying to fix this problem in the application is not 
> > > possible since there are literally hundreds if not thousands of them.
> > 
> > Then why add a socket option setting as your patch does? :-)
> > 
> > I also object to the socket option setting being allowed for
> > any user because this can have awful effects if allowed by
> > arbitrary users on arbitrary networks.
> 
> Setting a cwnd limit would do the same thing.

Not really :-)

It would not influence the TCP retransmit timeouts and
backoff, although it would influence the congestion window
handling during recovery after such timeout based losses.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] Customizable TCP backoff patch
  2006-09-27 23:16 ` David Miller
  2006-09-27 23:00   ` Stephen Hemminger
@ 2006-10-03 18:14   ` Ben Woodard
  2006-10-04  7:07     ` David Miller
  1 sibling, 1 reply; 17+ messages in thread
From: Ben Woodard @ 2006-10-03 18:14 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, mgrondona, behlendorf1

[-- Attachment #1: Type: text/plain, Size: 2240 bytes --]

David Miller wrote:
> From: Ben Woodard <woodard@redhat.com>
> Date: Wed, 27 Sep 2006 11:52:57 -0700
> 
>> Because these are general utility clusters we run many different 
>> programs and so trying to fix this problem in the application is not 
>> possible since there are literally hundreds if not thousands of them.
> 
> Then why add a socket option setting as your patch does? :-)
> 
> I also object to the socket option setting being allowed for
> any user because this can have awful effects if allowed by
> arbitrary users on arbitrary networks.
> 
>> We're more than willing to consider other approaches to handling this
>> particular workload better.  We've even considered that TCP isn't at all 
>> the right protocol but this affects several protocols including NFS and 
>> the benefits of running NFS over TCP are too great.
>>
>> The original patch was prepared by Brian Behlendorf. He asked me to 
>> adapt it for current kernels keep it up to date and send upstream.
>>
>> This may also help people like Andrew Athan which reported a similar 
>> problem a couple of days ago on the linux-net mailing list: 
>> http://www.uwsg.iu.edu/hypermail/linux/net/0609.3/0005.html I suspect 
>> that it is more common a case than is widely recognized.
>>
>> Signed-off-by: Ben Woodard <woodard@redhat.com>
>> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
> 
> Other issues:
> 
> 1) 2 "u32" in the tcp_sock is a lot of space to devote to this
>    new state.  If it can fit in 2 "u16"'s or even less space,
>    please use that.
> 
> 2) the expression "(tp->foo ? : sysctl_foo)" is repeated many times
>    in the patch, please encapsulate it into an inline function
>    or similar
> 

How does this look to you for answering your two complaints above.

> I'm still torn on the fundamental issues of this patch.  I think
> random backoff is a better generic solution to this kind of problem.
> If it works for ethernet, it might just work for TCP too :-)

I haven't taken this on in this patch. I'd have to think more about how 
to do that and I'm not sure that introducing randomness here will allow 
us to settle into a steady state faster than configuring a shorter 
timeout in the environments that need it.

> 
> Thanks.


[-- Attachment #2: tcp-backoff-18-2.diff --]
[-- Type: text/x-patch, Size: 7043 bytes --]

diff -ru linux-2.6.18/include/linux/sysctl.h linux-2.6.18.new/include/linux/sysctl.h
--- linux-2.6.18/include/linux/sysctl.h	2006-09-19 20:42:06.000000000 -0700
+++ linux-2.6.18.new/include/linux/sysctl.h	2006-09-26 17:10:36.000000000 -0700
@@ -411,6 +411,8 @@
 	NET_IPV4_TCP_WORKAROUND_SIGNED_WINDOWS=115,
 	NET_TCP_DMA_COPYBREAK=116,
 	NET_TCP_SLOW_START_AFTER_IDLE=117,
+	NET_TCP_RTO_MAX=118,
+	NET_TCP_RTO_INIT=119,
 };
 
 enum {
Only in linux-2.6.18.new/include/linux: sysctl.h.orig
Only in linux-2.6.18.new/include/linux: sysctl.h.rej
diff -ru linux-2.6.18/include/linux/tcp.h linux-2.6.18.new/include/linux/tcp.h
--- linux-2.6.18/include/linux/tcp.h	2006-09-19 20:42:06.000000000 -0700
+++ linux-2.6.18.new/include/linux/tcp.h	2006-09-28 13:18:12.000000000 -0700
@@ -94,6 +94,8 @@
 #define TCP_INFO		11	/* Information about this connection. */
 #define TCP_QUICKACK		12	/* Block/reenable quick acks */
 #define TCP_CONGESTION		13	/* Congestion control algorithm */
+#define TCP_BACKOFF_MAX         14      /* Maximum backoff value */
+#define TCP_BACKOFF_INIT        15      /* Initial backoff value */
 
 #define TCPI_OPT_TIMESTAMPS	1
 #define TCPI_OPT_SACK		2
@@ -257,6 +259,8 @@
 	__u8	frto_counter;	/* Number of new acks after RTO */
 	__u8	nonagle;	/* Disable Nagle algorithm?             */
 	__u8	keepalive_probes; /* num of allowed keep alive probes	*/
+        __u16   rto_max;        /* Maximum backoff value                */
+        __u16   rto_init;       /* Initial backoff value                */
 
 /* RTT measurement */
 	__u32	srtt;		/* smoothed round trip time << 3	*/
Only in linux-2.6.18.new/include/linux: tcp.h.orig
diff -ru linux-2.6.18/include/net/tcp.h linux-2.6.18.new/include/net/tcp.h
--- linux-2.6.18/include/net/tcp.h	2006-09-19 20:42:06.000000000 -0700
+++ linux-2.6.18.new/include/net/tcp.h	2006-09-26 17:12:04.000000000 -0700
@@ -227,6 +227,8 @@
 extern int sysctl_tcp_base_mss;
 extern int sysctl_tcp_workaround_signed_windows;
 extern int sysctl_tcp_slow_start_after_idle;
+extern int sysctl_tcp_rto_max;
+extern int sysctl_tcp_rto_init;
 
 extern atomic_t tcp_memory_allocated;
 extern atomic_t tcp_sockets_allocated;
Only in linux-2.6.18.new/include/net: tcp.h.orig
Only in linux-2.6.18.new/include/net: tcp.h.rej
diff -ru linux-2.6.18/net/ipv4/sysctl_net_ipv4.c linux-2.6.18.new/net/ipv4/sysctl_net_ipv4.c
--- linux-2.6.18/net/ipv4/sysctl_net_ipv4.c	2006-09-19 20:42:06.000000000 -0700
+++ linux-2.6.18.new/net/ipv4/sysctl_net_ipv4.c	2006-09-26 17:08:33.000000000 -0700
@@ -697,6 +697,22 @@
 		.mode		= 0644,
 		.proc_handler	= &proc_dointvec
 	},
+	{
+	        .ctl_name       = NET_TCP_RTO_MAX,
+		.procname       = "tcp_rto_max",
+		.data           = &sysctl_tcp_rto_max,
+		.maxlen         = sizeof(unsigned),
+		.mode           = 0644,
+		.proc_handler   = &proc_dointvec
+	},
+	{
+	        .ctl_name       = NET_TCP_RTO_INIT,
+		.procname       = "tcp_rto_init",
+		.data           = &sysctl_tcp_rto_init,
+		.maxlen         = sizeof(unsigned),
+		.mode           = 0644,
+		.proc_handler   = &proc_dointvec
+	},
 	{ .ctl_name = 0 }
 };
 
Only in linux-2.6.18.new/net/ipv4: sysctl_net_ipv4.c.orig
diff -ru linux-2.6.18/net/ipv4/tcp.c linux-2.6.18.new/net/ipv4/tcp.c
--- linux-2.6.18/net/ipv4/tcp.c	2006-09-19 20:42:06.000000000 -0700
+++ linux-2.6.18.new/net/ipv4/tcp.c	2006-09-28 13:56:24.000000000 -0700
@@ -1764,6 +1764,8 @@
 	return err;
 }
 
+#define TCP_BACKOFF_MAXVAL 65535
+
 /*
  *	Socket option code for TCP.
  */
@@ -1939,6 +1941,21 @@
 		}
 		break;
 
+        case TCP_BACKOFF_MAX:
+                if (val < 1 || val > TCP_BACKOFF_MAXVAL)
+                        err = -EINVAL;
+                else
+                        tp->rto_max = val * HZ;
+                break;
+ 
+        case TCP_BACKOFF_INIT:
+                if (val < 1 || val > TCP_BACKOFF_MAXVAL)
+                        err = -EINVAL;
+                else
+                        tp->rto_init = val * HZ;
+                break;
+ 
+ 
 	default:
 		err = -ENOPROTOOPT;
 		break;
@@ -2110,6 +2127,12 @@
 		if (copy_to_user(optval, icsk->icsk_ca_ops->name, len))
 			return -EFAULT;
 		return 0;
+        case TCP_BACKOFF_MAX:
+                val = (tp->rto_max ? : sysctl_tcp_rto_max) / HZ;
+                break;
+        case TCP_BACKOFF_INIT:
+                val = (tp->rto_init ? : sysctl_tcp_rto_init) / HZ;
+                break;
 	default:
 		return -ENOPROTOOPT;
 	};
Only in linux-2.6.18.new/net/ipv4: tcp.c.orig
diff -ru linux-2.6.18/net/ipv4/tcp_timer.c linux-2.6.18.new/net/ipv4/tcp_timer.c
--- linux-2.6.18/net/ipv4/tcp_timer.c	2006-09-19 20:42:06.000000000 -0700
+++ linux-2.6.18.new/net/ipv4/tcp_timer.c	2006-09-28 16:41:39.000000000 -0700
@@ -31,11 +31,21 @@
 int sysctl_tcp_retries1 = TCP_RETR1;
 int sysctl_tcp_retries2 = TCP_RETR2;
 int sysctl_tcp_orphan_retries;
+int sysctl_tcp_rto_max  = TCP_RTO_MAX;
+int sysctl_tcp_rto_init = TCP_TIMEOUT_INIT;
 
 static void tcp_write_timer(unsigned long);
 static void tcp_delack_timer(unsigned long);
 static void tcp_keepalive_timer (unsigned long data);
 
+static inline __u16 rto_max(struct tcp_sock *tp){
+        return tp->rto_max ? : sysctl_tcp_rto_max;
+}
+
+static inline __u16 rto_init(struct tcp_sock *tp){
+        return tp->rto_init ? : sysctl_tcp_rto_init;
+}
+
 void tcp_init_xmit_timers(struct sock *sk)
 {
 	inet_csk_init_xmit_timers(sk, &tcp_write_timer, &tcp_delack_timer,
@@ -71,7 +81,7 @@
 
 	/* If peer does not open window for long time, or did not transmit 
 	 * anything for long time, penalize it. */
-	if ((s32)(tcp_time_stamp - tp->lsndtime) > 2*TCP_RTO_MAX || !do_reset)
+	if ((s32)(tcp_time_stamp - tp->lsndtime) > 2*rto_max(tp) || !do_reset)
 		orphans <<= 1;
 
 	/* If some dubious ICMP arrived, penalize even more. */
@@ -256,8 +266,8 @@
 	max_probes = sysctl_tcp_retries2;
 
 	if (sock_flag(sk, SOCK_DEAD)) {
-		const int alive = ((icsk->icsk_rto << icsk->icsk_backoff) < TCP_RTO_MAX);
- 
+		const int alive = ((icsk->icsk_rto << icsk->icsk_backoff) < 
+				   rto_max(tp));
 		max_probes = tcp_orphan_retries(sk, alive);
 
 		if (tcp_out_of_resources(sk, alive || icsk->icsk_probes_out <= max_probes))
@@ -301,7 +311,7 @@
 			       inet->num, tp->snd_una, tp->snd_nxt);
 		}
 #endif
-		if (tcp_time_stamp - tp->rcv_tstamp > TCP_RTO_MAX) {
+		if (tcp_time_stamp - tp->rcv_tstamp > rto_max(tp)) {
 			tcp_write_err(sk);
 			goto out;
 		}
@@ -373,7 +383,8 @@
 
 out_reset_timer:
 	icsk->icsk_rto = min(icsk->icsk_rto << 1, TCP_RTO_MAX);
-	inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS, icsk->icsk_rto, TCP_RTO_MAX);
+	inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS, icsk->icsk_rto, 
+				  rto_max(tp));
 	if (icsk->icsk_retransmits > sysctl_tcp_retries1)
 		__sk_dst_reset(sk);
 
@@ -427,8 +438,8 @@
 
 static void tcp_synack_timer(struct sock *sk)
 {
-	inet_csk_reqsk_queue_prune(sk, TCP_SYNQ_INTERVAL,
-				   TCP_TIMEOUT_INIT, TCP_RTO_MAX);
+	inet_csk_reqsk_queue_prune(sk, TCP_SYNQ_INTERVAL, 
+				   TCP_TIMEOUT_INIT, rto_max(tcp_sk(sk)));
 }
 
 void tcp_set_keepalive(struct sock *sk, int val)

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] Customizable TCP backoff patch
  2006-10-03 18:14   ` Ben Woodard
@ 2006-10-04  7:07     ` David Miller
  2006-10-04  8:56       ` Ingo Oeser
                         ` (2 more replies)
  0 siblings, 3 replies; 17+ messages in thread
From: David Miller @ 2006-10-04  7:07 UTC (permalink / raw)
  To: woodard; +Cc: netdev, mgrondona, behlendorf1

From: Ben Woodard <woodard@redhat.com>
Date: Tue, 03 Oct 2006 11:14:38 -0700

> > Other issues:
> > 
> > 1) 2 "u32" in the tcp_sock is a lot of space to devote to this
> >    new state.  If it can fit in 2 "u16"'s or even less space,
> >    please use that.
> > 
> > 2) the expression "(tp->foo ? : sysctl_foo)" is repeated many times
> >    in the patch, please encapsulate it into an inline function
> >    or similar
> > 
> 
> How does this look to you for answering your two complaints above.

It looks a lot better.  I'd like you to take #2 further,
put the inline function into net/tcp.h and use it in the
tcp.c instances too.

Also, please don't format code like this:

+static inline __u16 rto_max(struct tcp_sock *tp){
+        return tp->rto_max ? : sysctl_tcp_rto_max;
+}
+
+static inline __u16 rto_init(struct tcp_sock *tp){
+        return tp->rto_init ? : sysctl_tcp_rto_init;
+}

The openning brace of each functions belongs on a line by itself,
thanks.

Finally, I'm not so sure "seconds" is the best unit for the
socket option.  In fact you use "seconds" as the unit for
the socket option and the sysctl is raw jiffies.  That's
inconsistency is really problematic.

At the very least, seconds might not be fine enough granularity
for some circumstances.  Heck, the default RTO_MIN is 1/5 of a
second. :-)

I also understand that going to milliseconds or microseconds would
make the size of the in-socket struct members an issue again.  These
things are never easy are they? :-/

Any better ideas?

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] Customizable TCP backoff patch
  2006-10-04  7:07     ` David Miller
@ 2006-10-04  8:56       ` Ingo Oeser
  2006-10-04  9:08         ` David Miller
  2006-10-04 17:17       ` Stephen Hemminger
  2006-10-11  1:46       ` Ben Woodard
  2 siblings, 1 reply; 17+ messages in thread
From: Ingo Oeser @ 2006-10-04  8:56 UTC (permalink / raw)
  To: David Miller; +Cc: woodard, netdev, mgrondona, behlendorf1

David Miller wrote:
> At the very least, seconds might not be fine enough granularity
> for some circumstances.  Heck, the default RTO_MIN is 1/5 of a
> second. :-)
> 
> I also understand that going to milliseconds or microseconds would
> make the size of the in-socket struct members an issue again.  These
> things are never easy are they? :-/

Would be, if floating point values would be allowed. Single precision would 
be enough in this case and its just 32 bits IIRC.

I mean floating point values just for the user->kernel ABI and 
NOT for the internal timer representation.


Regards

Ingo Oeser

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] Customizable TCP backoff patch
  2006-10-04  8:56       ` Ingo Oeser
@ 2006-10-04  9:08         ` David Miller
  0 siblings, 0 replies; 17+ messages in thread
From: David Miller @ 2006-10-04  9:08 UTC (permalink / raw)
  To: netdev; +Cc: woodard, netdev, mgrondona, behlendorf1

From: Ingo Oeser <netdev@axxeo.de>
Date: Wed, 4 Oct 2006 10:56:12 +0200

> David Miller wrote:
> > At the very least, seconds might not be fine enough granularity
> > for some circumstances.  Heck, the default RTO_MIN is 1/5 of a
> > second. :-)
> > 
> > I also understand that going to milliseconds or microseconds would
> > make the size of the in-socket struct members an issue again.  These
> > things are never easy are they? :-/
> 
> Would be, if floating point values would be allowed. Single precision would 
> be enough in this case and its just 32 bits IIRC.
> 
> I mean floating point values just for the user->kernel ABI and 
> NOT for the internal timer representation.

We got the struct members down to 16-bits each, the problem is in
fact that we'd need to make them 32-bits again.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] Customizable TCP backoff patch
  2006-10-04  7:07     ` David Miller
  2006-10-04  8:56       ` Ingo Oeser
@ 2006-10-04 17:17       ` Stephen Hemminger
  2006-10-11  1:46       ` Ben Woodard
  2 siblings, 0 replies; 17+ messages in thread
From: Stephen Hemminger @ 2006-10-04 17:17 UTC (permalink / raw)
  To: David Miller; +Cc: woodard, netdev, mgrondona, behlendorf1

On Wed, 04 Oct 2006 00:07:22 -0700 (PDT)
David Miller <davem@davemloft.net> wrote:

> From: Ben Woodard <woodard@redhat.com>
> Date: Tue, 03 Oct 2006 11:14:38 -0700
> 
> > > Other issues:
> > > 
> > > 1) 2 "u32" in the tcp_sock is a lot of space to devote to this
> > >    new state.  If it can fit in 2 "u16"'s or even less space,
> > >    please use that.
> > > 
> > > 2) the expression "(tp->foo ? : sysctl_foo)" is repeated many times
> > >    in the patch, please encapsulate it into an inline function
> > >    or similar
> > > 
> > 
> > How does this look to you for answering your two complaints above.
> 
> It looks a lot better.  I'd like you to take #2 further,
> put the inline function into net/tcp.h and use it in the
> tcp.c instances too.
> 
> Also, please don't format code like this:
> 
> +static inline __u16 rto_max(struct tcp_sock *tp){
> +        return tp->rto_max ? : sysctl_tcp_rto_max;
> +}
> +
> +static inline __u16 rto_init(struct tcp_sock *tp){
> +        return tp->rto_init ? : sysctl_tcp_rto_init;
> +}
> 
> The openning brace of each functions belongs on a line by itself,
> thanks.
> 
> Finally, I'm not so sure "seconds" is the best unit for the
> socket option.  In fact you use "seconds" as the unit for
> the socket option and the sysctl is raw jiffies.  That's
> inconsistency is really problematic.
> 
> At the very least, seconds might not be fine enough granularity
> for some circumstances.  Heck, the default RTO_MIN is 1/5 of a
> second. :-)
> 
> I also understand that going to milliseconds or microseconds would
> make the size of the in-socket struct members an issue again.  These
> things are never easy are they? :-/
> 
> Any better ideas?

If you are willing to do a little more work, the sysctl value can be
in milliseconds, and the internal socket member in some other unit
like jiffies.  It does require writing your own in/out translation
instead of using proc_dointvec

-- 
Stephen Hemminger <shemminger@osdl.org>

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] Customizable TCP backoff patch
  2006-10-04  7:07     ` David Miller
  2006-10-04  8:56       ` Ingo Oeser
  2006-10-04 17:17       ` Stephen Hemminger
@ 2006-10-11  1:46       ` Ben Woodard
  2006-10-11  1:59         ` YOSHIFUJI Hideaki / 吉藤英明
                           ` (3 more replies)
  2 siblings, 4 replies; 17+ messages in thread
From: Ben Woodard @ 2006-10-11  1:46 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, mgrondona, behlendorf1

[-- Attachment #1: Type: text/plain, Size: 2001 bytes --]

David Miller wrote:
> From: Ben Woodard <woodard@redhat.com>
> Date: Tue, 03 Oct 2006 11:14:38 -0700
> 
>>> Other issues:
>>>
>>> 1) 2 "u32" in the tcp_sock is a lot of space to devote to this
>>>    new state.  If it can fit in 2 "u16"'s or even less space,
>>>    please use that.
>>>
>>> 2) the expression "(tp->foo ? : sysctl_foo)" is repeated many times
>>>    in the patch, please encapsulate it into an inline function
>>>    or similar
>>>
>> How does this look to you for answering your two complaints above.
> 
> It looks a lot better.  I'd like you to take #2 further,
> put the inline function into net/tcp.h and use it in the
> tcp.c instances too.

Done. I orginally didn't want to grow the header files with two more 
functions that weren't use many places. Anyway this is fixed.

> 
> Also, please don't format code like this:
> 
> +static inline __u16 rto_max(struct tcp_sock *tp){
> +        return tp->rto_max ? : sysctl_tcp_rto_max;
> +}
> +
> +static inline __u16 rto_init(struct tcp_sock *tp){
> +        return tp->rto_init ? : sysctl_tcp_rto_init;
> +}
> 
> The opening brace of each functions belongs on a line by itself,
> thanks.

Fixed.

> 
> Finally, I'm not so sure "seconds" is the best unit for the
> socket option.  In fact you use "seconds" as the unit for
> the socket option and the sysctl is raw jiffies.  That's
> inconsistency is really problematic.
> 
> At the very least, seconds might not be fine enough granularity
> for some circumstances.  Heck, the default RTO_MIN is 1/5 of a
> second. :-)
> 
> I also understand that going to milliseconds or microseconds would
> make the size of the in-socket struct members an issue again.  These
> things are never easy are they? :-/
> 
> Any better ideas?

Here is what I've done. I store the values in the socket structure as ms 
so that they will fit into 16 bits and as jiffies in the global 
variables. Then I populate the global variables using 
proc_doulongvec_ms_jiffies()

How does that look to you?

-ben



[-- Attachment #2: tcp-backoff-18-4.diff --]
[-- Type: text/x-patch, Size: 7119 bytes --]

diff -ru linux-2.6.18/include/linux/sysctl.h linux-2.6.18.new/include/linux/sysctl.h
--- linux-2.6.18/include/linux/sysctl.h	2006-09-19 20:42:06.000000000 -0700
+++ linux-2.6.18.new/include/linux/sysctl.h	2006-09-26 17:10:36.000000000 -0700
@@ -411,6 +411,8 @@
 	NET_IPV4_TCP_WORKAROUND_SIGNED_WINDOWS=115,
 	NET_TCP_DMA_COPYBREAK=116,
 	NET_TCP_SLOW_START_AFTER_IDLE=117,
+	NET_TCP_RTO_MAX=118,
+	NET_TCP_RTO_INIT=119,
 };
 
 enum {
diff -ru linux-2.6.18/include/linux/tcp.h linux-2.6.18.new/include/linux/tcp.h
--- linux-2.6.18/include/linux/tcp.h	2006-09-19 20:42:06.000000000 -0700
+++ linux-2.6.18.new/include/linux/tcp.h	2006-10-10 17:01:25.000000000 -0700
@@ -94,6 +94,8 @@
 #define TCP_INFO		11	/* Information about this connection. */
 #define TCP_QUICKACK		12	/* Block/reenable quick acks */
 #define TCP_CONGESTION		13	/* Congestion control algorithm */
+#define TCP_BACKOFF_MAX         14      /* Maximum backoff value */
+#define TCP_BACKOFF_INIT        15      /* Initial backoff value */
 
 #define TCPI_OPT_TIMESTAMPS	1
 #define TCPI_OPT_SACK		2
@@ -257,6 +259,8 @@
 	__u8	frto_counter;	/* Number of new acks after RTO */
 	__u8	nonagle;	/* Disable Nagle algorithm?             */
 	__u8	keepalive_probes; /* num of allowed keep alive probes	*/
+        __u16   rto_max;        /* Maximum backoff value in ms          */
+        __u16   rto_init;       /* Initial backoff value in ms          */
 
 /* RTT measurement */
 	__u32	srtt;		/* smoothed round trip time << 3	*/
diff -ru linux-2.6.18/include/net/tcp.h linux-2.6.18.new/include/net/tcp.h
--- linux-2.6.18/include/net/tcp.h	2006-09-19 20:42:06.000000000 -0700
+++ linux-2.6.18.new/include/net/tcp.h	2006-10-10 18:42:00.000000000 -0700
@@ -227,11 +227,23 @@
 extern int sysctl_tcp_base_mss;
 extern int sysctl_tcp_workaround_signed_windows;
 extern int sysctl_tcp_slow_start_after_idle;
+extern unsigned long sysctl_tcp_rto_max;
+extern unsigned long sysctl_tcp_rto_init;
 
 extern atomic_t tcp_memory_allocated;
 extern atomic_t tcp_sockets_allocated;
 extern int tcp_memory_pressure;
 
+static inline unsigned long tcp_rto_max(struct tcp_sock *tp)
+{
+        return tp->rto_max ? tp->rto_max*HZ/1000 : sysctl_tcp_rto_max;
+}
+
+static inline unsigned long tcp_rto_init(struct tcp_sock *tp)
+{
+        return tp->rto_init ? tp->rto_init*HZ/1000 : sysctl_tcp_rto_init;
+}
+
 /*
  * The next routines deal with comparing 32 bit unsigned ints
  * and worry about wraparound (automatic with unsigned arithmetic).
diff -ru linux-2.6.18/net/ipv4/sysctl_net_ipv4.c linux-2.6.18.new/net/ipv4/sysctl_net_ipv4.c
--- linux-2.6.18/net/ipv4/sysctl_net_ipv4.c	2006-09-19 20:42:06.000000000 -0700
+++ linux-2.6.18.new/net/ipv4/sysctl_net_ipv4.c	2006-10-10 16:32:08.000000000 -0700
@@ -128,6 +128,8 @@
 	return ret;
 }
 
+static unsigned long tcp_rto_min=0;
+static unsigned long tcp_rto_max=65535;
 
 ctl_table ipv4_table[] = {
         {
@@ -697,6 +699,26 @@
 		.mode		= 0644,
 		.proc_handler	= &proc_dointvec
 	},
+	{
+	        .ctl_name       = NET_TCP_RTO_MAX,
+		.procname       = "tcp_rto_max",
+		.data           = &sysctl_tcp_rto_max,
+		.maxlen         = sizeof(unsigned),
+		.mode           = 0644,
+		.proc_handler   = &proc_doulongvec_ms_jiffies,
+		.extra1         = &tcp_rto_min_constant,
+		.extra2         = &tcp_rto_max_constant,
+	},
+	{
+	        .ctl_name       = NET_TCP_RTO_INIT,
+		.procname       = "tcp_rto_init",
+		.data           = &sysctl_tcp_rto_init,
+		.maxlen         = sizeof(unsigned),
+		.mode           = 0644,
+		.proc_handler   = &proc_doulongvec_ms_jiffies,
+		.extra1         = &tcp_rto_min_constant,
+		.extra2         = &tcp_rto_max_constant,
+	},
 	{ .ctl_name = 0 }
 };
 
diff -ru linux-2.6.18/net/ipv4/tcp.c linux-2.6.18.new/net/ipv4/tcp.c
--- linux-2.6.18/net/ipv4/tcp.c	2006-09-19 20:42:06.000000000 -0700
+++ linux-2.6.18.new/net/ipv4/tcp.c	2006-10-10 18:37:40.000000000 -0700
@@ -1764,6 +1764,8 @@
 	return err;
 }
 
+#define TCP_BACKOFF_MAXVAL 65535
+
 /*
  *	Socket option code for TCP.
  */
@@ -1939,6 +1941,21 @@
 		}
 		break;
 
+        case TCP_BACKOFF_MAX:
+                if (val < 1 || val > TCP_BACKOFF_MAXVAL)
+                        err = -EINVAL;
+                else
+                        tp->rto_max = val;
+                break;
+ 
+        case TCP_BACKOFF_INIT:
+                if (val < 1 || val > TCP_BACKOFF_MAXVAL)
+                        err = -EINVAL;
+                else
+                        tp->rto_init = val;
+                break;
+ 
+ 
 	default:
 		err = -ENOPROTOOPT;
 		break;
@@ -2110,6 +2127,12 @@
 		if (copy_to_user(optval, icsk->icsk_ca_ops->name, len))
 			return -EFAULT;
 		return 0;
+        case TCP_BACKOFF_MAX:
+	        val = tcp_rto_max(tp)*1000/HZ;
+                break;
+        case TCP_BACKOFF_INIT:
+	        val = tcp_rto_init(tp)*1000/HZ;
+                break;
 	default:
 		return -ENOPROTOOPT;
 	};
diff -ru linux-2.6.18/net/ipv4/tcp_timer.c linux-2.6.18.new/net/ipv4/tcp_timer.c
--- linux-2.6.18/net/ipv4/tcp_timer.c	2006-09-19 20:42:06.000000000 -0700
+++ linux-2.6.18.new/net/ipv4/tcp_timer.c	2006-10-10 16:25:10.000000000 -0700
@@ -31,6 +31,8 @@
 int sysctl_tcp_retries1 = TCP_RETR1;
 int sysctl_tcp_retries2 = TCP_RETR2;
 int sysctl_tcp_orphan_retries;
+unsigned long sysctl_tcp_rto_max  = TCP_RTO_MAX;
+unsigned long sysctl_tcp_rto_init = TCP_TIMEOUT_INIT;
 
 static void tcp_write_timer(unsigned long);
 static void tcp_delack_timer(unsigned long);
@@ -71,7 +73,8 @@
 
 	/* If peer does not open window for long time, or did not transmit 
 	 * anything for long time, penalize it. */
-	if ((s32)(tcp_time_stamp - tp->lsndtime) > 2*TCP_RTO_MAX || !do_reset)
+	if ((s32)(tcp_time_stamp - tp->lsndtime) > 2*tcp_rto_max(tp) || 
+	    !do_reset)
 		orphans <<= 1;
 
 	/* If some dubious ICMP arrived, penalize even more. */
@@ -256,8 +259,8 @@
 	max_probes = sysctl_tcp_retries2;
 
 	if (sock_flag(sk, SOCK_DEAD)) {
-		const int alive = ((icsk->icsk_rto << icsk->icsk_backoff) < TCP_RTO_MAX);
- 
+		const int alive = ((icsk->icsk_rto << icsk->icsk_backoff) < 
+				   tcp_rto_max(tp));
 		max_probes = tcp_orphan_retries(sk, alive);
 
 		if (tcp_out_of_resources(sk, alive || icsk->icsk_probes_out <= max_probes))
@@ -301,7 +304,7 @@
 			       inet->num, tp->snd_una, tp->snd_nxt);
 		}
 #endif
-		if (tcp_time_stamp - tp->rcv_tstamp > TCP_RTO_MAX) {
+		if (tcp_time_stamp - tp->rcv_tstamp > tcp_rto_max(tp)) {
 			tcp_write_err(sk);
 			goto out;
 		}
@@ -373,7 +376,8 @@
 
 out_reset_timer:
 	icsk->icsk_rto = min(icsk->icsk_rto << 1, TCP_RTO_MAX);
-	inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS, icsk->icsk_rto, TCP_RTO_MAX);
+	inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS, icsk->icsk_rto, 
+				  tcp_rto_max(tp));
 	if (icsk->icsk_retransmits > sysctl_tcp_retries1)
 		__sk_dst_reset(sk);
 
@@ -427,8 +431,8 @@
 
 static void tcp_synack_timer(struct sock *sk)
 {
-	inet_csk_reqsk_queue_prune(sk, TCP_SYNQ_INTERVAL,
-				   TCP_TIMEOUT_INIT, TCP_RTO_MAX);
+	inet_csk_reqsk_queue_prune(sk, TCP_SYNQ_INTERVAL, TCP_TIMEOUT_INIT, 
+				   tcp_rto_max(tcp_sk(sk)));
 }
 
 void tcp_set_keepalive(struct sock *sk, int val)

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] Customizable TCP backoff patch
  2006-10-11  1:46       ` Ben Woodard
@ 2006-10-11  1:59         ` YOSHIFUJI Hideaki / 吉藤英明
  2006-10-11  2:53         ` David Miller
                           ` (2 subsequent siblings)
  3 siblings, 0 replies; 17+ messages in thread
From: YOSHIFUJI Hideaki / 吉藤英明 @ 2006-10-11  1:59 UTC (permalink / raw)
  To: woodard; +Cc: davem, netdev, mgrondona, behlendorf1, yoshfuji

In article <452C4D12.9060102@redhat.com> (at Tue, 10 Oct 2006 18:46:58 -0700), Ben Woodard <woodard@redhat.com> says:

> diff -ru linux-2.6.18/net/ipv4/tcp.c linux-2.6.18.new/net/ipv4/tcp.c
> --- linux-2.6.18/net/ipv4/tcp.c	2006-09-19 20:42:06.000000000 -0700
> +++ linux-2.6.18.new/net/ipv4/tcp.c	2006-10-10 18:37:40.000000000 -0700
> @@ -1939,6 +1941,21 @@
>  		}
>  		break;
>  
> +        case TCP_BACKOFF_MAX:
> +                if (val < 1 || val > TCP_BACKOFF_MAXVAL)
> +                        err = -EINVAL;
> +                else
> +                        tp->rto_max = val;
> +                break;
> + 
> +        case TCP_BACKOFF_INIT:
> +                if (val < 1 || val > TCP_BACKOFF_MAXVAL)
> +                        err = -EINVAL;
> +                else
> +                        tp->rto_init = val;
> +                break;
> + 
> + 
>  	default:
>  		err = -ENOPROTOOPT;
>  		break;
> @@ -2110,6 +2127,12 @@
>  		if (copy_to_user(optval, icsk->icsk_ca_ops->name, len))
>  			return -EFAULT;
>  		return 0;
> +        case TCP_BACKOFF_MAX:
> +	        val = tcp_rto_max(tp)*1000/HZ;
> +                break;
> +        case TCP_BACKOFF_INIT:
> +	        val = tcp_rto_init(tp)*1000/HZ;
> +                break;
>  	default:
>  		return -ENOPROTOOPT;
>  	};

They look very inconsistent.  Please use tp->rto_max / tp->rto_init.

--yoshfuji

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] Customizable TCP backoff patch
  2006-10-11  1:46       ` Ben Woodard
  2006-10-11  1:59         ` YOSHIFUJI Hideaki / 吉藤英明
@ 2006-10-11  2:53         ` David Miller
  2006-10-11  2:54         ` David Miller
  2006-10-11 14:01         ` Vlad Yasevich
  3 siblings, 0 replies; 17+ messages in thread
From: David Miller @ 2006-10-11  2:53 UTC (permalink / raw)
  To: woodard; +Cc: netdev, mgrondona, behlendorf1

From: Ben Woodard <woodard@redhat.com>
Date: Tue, 10 Oct 2006 18:46:58 -0700

> @@ -257,6 +259,8 @@
>  	__u8	frto_counter;	/* Number of new acks after RTO */
>  	__u8	nonagle;	/* Disable Nagle algorithm?             */
>  	__u8	keepalive_probes; /* num of allowed keep alive probes	*/
> +        __u16   rto_max;        /* Maximum backoff value in ms          */
> +        __u16   rto_init;       /* Initial backoff value in ms          */

The existing struct members are properly indented using tabs,
your patch adds members which are indented using spaces.  Please
fix this, thanks a lot.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] Customizable TCP backoff patch
  2006-10-11  1:46       ` Ben Woodard
  2006-10-11  1:59         ` YOSHIFUJI Hideaki / 吉藤英明
  2006-10-11  2:53         ` David Miller
@ 2006-10-11  2:54         ` David Miller
  2006-10-11 14:01         ` Vlad Yasevich
  3 siblings, 0 replies; 17+ messages in thread
From: David Miller @ 2006-10-11  2:54 UTC (permalink / raw)
  To: woodard; +Cc: netdev, mgrondona, behlendorf1


Actually, your entire patch uses spaces for indentation
of every new line added, not just those two new struct members
I pointed out.  Please use tabs for all of those cases.

Thanks a lot.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] Customizable TCP backoff patch
  2006-10-11  1:46       ` Ben Woodard
                           ` (2 preceding siblings ...)
  2006-10-11  2:54         ` David Miller
@ 2006-10-11 14:01         ` Vlad Yasevich
  2006-10-12  0:51           ` Ben Woodard
  3 siblings, 1 reply; 17+ messages in thread
From: Vlad Yasevich @ 2006-10-11 14:01 UTC (permalink / raw)
  To: Ben Woodard; +Cc: David Miller, netdev, mgrondona, behlendorf1

Ben Woodard wrote:
 ------------------------------------------------------------------------
> diff -ru linux-2.6.18/include/net/tcp.h linux-2.6.18.new/include/net/tcp.h
> --- linux-2.6.18/include/net/tcp.h	2006-09-19 20:42:06.000000000 -0700
> +++ linux-2.6.18.new/include/net/tcp.h	2006-10-10 18:42:00.000000000 -0700
> @@ -227,11 +227,23 @@
>  extern int sysctl_tcp_base_mss;
>  extern int sysctl_tcp_workaround_signed_windows;
>  extern int sysctl_tcp_slow_start_after_idle;
> +extern unsigned long sysctl_tcp_rto_max;
> +extern unsigned long sysctl_tcp_rto_init;
>  
>  extern atomic_t tcp_memory_allocated;
>  extern atomic_t tcp_sockets_allocated;
>  extern int tcp_memory_pressure;
>  
> +static inline unsigned long tcp_rto_max(struct tcp_sock *tp)
> +{
> +        return tp->rto_max ? tp->rto_max*HZ/1000 : sysctl_tcp_rto_max;
                                ^^^^^^^^^^^^^^^^^^^
That should probably be msecs_to_jiffies(tp->rto_max)

> +}
> +
> +static inline unsigned long tcp_rto_init(struct tcp_sock *tp)
> +{
> +        return tp->rto_init ? tp->rto_init*HZ/1000 : sysctl_tcp_rto_init;

Ditto.

> +}
> +
>  /*
>   * The next routines deal with comparing 32 bit unsigned ints
>   * and worry about wraparound (automatic with unsigned arithmetic).
> diff -ru linux-2.6.18/net/ipv4/sysctl_net_ipv4.c linux-2.6.18.new/net/ipv4/sysctl_net_ipv4.c
> --- linux-2.6.18/net/ipv4/sysctl_net_ipv4.c	2006-09-19 20:42:06.000000000 -0700
> +++ linux-2.6.18.new/net/ipv4/sysctl_net_ipv4.c	2006-10-10 16:32:08.000000000 -0700
> @@ -128,6 +128,8 @@
>  	return ret;
>  }
>  
> +static unsigned long tcp_rto_min=0;
> +static unsigned long tcp_rto_max=65535;
>  
>  ctl_table ipv4_table[] = {
>          {
> @@ -697,6 +699,26 @@
>  		.mode		= 0644,
>  		.proc_handler	= &proc_dointvec
>  	},
> +	{
> +	        .ctl_name       = NET_TCP_RTO_MAX,
> +		.procname       = "tcp_rto_max",
> +		.data           = &sysctl_tcp_rto_max,
> +		.maxlen         = sizeof(unsigned),
> +		.mode           = 0644,
> +		.proc_handler   = &proc_doulongvec_ms_jiffies,
> +		.extra1         = &tcp_rto_min_constant,
> +		.extra2         = &tcp_rto_max_constant,
> +	},
> +	{
> +	        .ctl_name       = NET_TCP_RTO_INIT,
> +		.procname       = "tcp_rto_init",
> +		.data           = &sysctl_tcp_rto_init,
> +		.maxlen         = sizeof(unsigned),
> +		.mode           = 0644,
> +		.proc_handler   = &proc_doulongvec_ms_jiffies,
> +		.extra1         = &tcp_rto_min_constant,
> +		.extra2         = &tcp_rto_max_constant,
> +	},
>  	{ .ctl_name = 0 }
>  };

Try as I might, I can't find proc_doulongvec_ms_jiffies() anywhere.  I think
you meant proc_doulongvec_ms_jiffies_minmax()?

Also, this function has a bug in that it doesn't do corrections for potentially different
HZ values.  Look at the msecs_to_jiffies...  When I've used proc_doulongvec_ms_jiffies_minmax()
before, I would seen rather interesting truncation errors such that sysctl input didn't match
sysctl output.

>  
> diff -ru linux-2.6.18/net/ipv4/tcp.c linux-2.6.18.new/net/ipv4/tcp.c
> --- linux-2.6.18/net/ipv4/tcp.c	2006-09-19 20:42:06.000000000 -0700
> +++ linux-2.6.18.new/net/ipv4/tcp.c	2006-10-10 18:37:40.000000000 -0700
> @@ -1764,6 +1764,8 @@
>  	return err;
>  }
>  
> +#define TCP_BACKOFF_MAXVAL 65535
> +
>  /*
>   *	Socket option code for TCP.
>   */
> @@ -1939,6 +1941,21 @@
>  		}
>  		break;
>  
> +        case TCP_BACKOFF_MAX:
> +                if (val < 1 || val > TCP_BACKOFF_MAXVAL)
> +                        err = -EINVAL;
> +                else
> +                        tp->rto_max = val;
> +                break;
> + 
> +        case TCP_BACKOFF_INIT:
> +                if (val < 1 || val > TCP_BACKOFF_MAXVAL)
> +                        err = -EINVAL;
> +                else
> +                        tp->rto_init = val;
> +                break;
> + 
> + 
>  	default:
>  		err = -ENOPROTOOPT;
>  		break;
> @@ -2110,6 +2127,12 @@
>  		if (copy_to_user(optval, icsk->icsk_ca_ops->name, len))
>  			return -EFAULT;
>  		return 0;
> +        case TCP_BACKOFF_MAX:
> +	        val = tcp_rto_max(tp)*1000/HZ;
> +                break;
> +        case TCP_BACKOFF_INIT:
> +	        val = tcp_rto_init(tp)*1000/HZ;

The two divisions above introduce truncation errors such the input doesn't
match the output.  Better to use jiffies_to_msecs().

Regards
-vlad

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] Customizable TCP backoff patch
  2006-10-11 14:01         ` Vlad Yasevich
@ 2006-10-12  0:51           ` Ben Woodard
  2006-10-12  1:06             ` YOSHIFUJI Hideaki / 吉藤英明
  0 siblings, 1 reply; 17+ messages in thread
From: Ben Woodard @ 2006-10-12  0:51 UTC (permalink / raw)
  To: Vlad Yasevich; +Cc: David Miller, netdev, mgrondona, behlendorf1

[-- Attachment #1: Type: text/plain, Size: 4968 bytes --]

Here we go again. The changes are as follows:
1) the spaces are gone and now it is tabs.
2) used msecs_to_jiffies() and jiffies_to_msecs(). I'm much happier with 
  that. I didn't know those functions existed.
3) changed over to proc_doulongvec_ms_jiffies_minmax(). Last night's 
test compile failed because I was out of disk. I figured that the code I 
changed had compiled. I guess not.

I really didn't understand Yoshifuji Hideaki's comment. (sorry)

-ben


Vlad Yasevich wrote:
> Ben Woodard wrote:
>  ------------------------------------------------------------------------
>> diff -ru linux-2.6.18/include/net/tcp.h linux-2.6.18.new/include/net/tcp.h
>> --- linux-2.6.18/include/net/tcp.h	2006-09-19 20:42:06.000000000 -0700
>> +++ linux-2.6.18.new/include/net/tcp.h	2006-10-10 18:42:00.000000000 -0700
>> @@ -227,11 +227,23 @@
>>  extern int sysctl_tcp_base_mss;
>>  extern int sysctl_tcp_workaround_signed_windows;
>>  extern int sysctl_tcp_slow_start_after_idle;
>> +extern unsigned long sysctl_tcp_rto_max;
>> +extern unsigned long sysctl_tcp_rto_init;
>>  
>>  extern atomic_t tcp_memory_allocated;
>>  extern atomic_t tcp_sockets_allocated;
>>  extern int tcp_memory_pressure;
>>  
>> +static inline unsigned long tcp_rto_max(struct tcp_sock *tp)
>> +{
>> +        return tp->rto_max ? tp->rto_max*HZ/1000 : sysctl_tcp_rto_max;
>                                 ^^^^^^^^^^^^^^^^^^^
> That should probably be msecs_to_jiffies(tp->rto_max)
> 
>> +}
>> +
>> +static inline unsigned long tcp_rto_init(struct tcp_sock *tp)
>> +{
>> +        return tp->rto_init ? tp->rto_init*HZ/1000 : sysctl_tcp_rto_init;
> 
> Ditto.
> 
>> +}
>> +
>>  /*
>>   * The next routines deal with comparing 32 bit unsigned ints
>>   * and worry about wraparound (automatic with unsigned arithmetic).
>> diff -ru linux-2.6.18/net/ipv4/sysctl_net_ipv4.c linux-2.6.18.new/net/ipv4/sysctl_net_ipv4.c
>> --- linux-2.6.18/net/ipv4/sysctl_net_ipv4.c	2006-09-19 20:42:06.000000000 -0700
>> +++ linux-2.6.18.new/net/ipv4/sysctl_net_ipv4.c	2006-10-10 16:32:08.000000000 -0700
>> @@ -128,6 +128,8 @@
>>  	return ret;
>>  }
>>  
>> +static unsigned long tcp_rto_min=0;
>> +static unsigned long tcp_rto_max=65535;
>>  
>>  ctl_table ipv4_table[] = {
>>          {
>> @@ -697,6 +699,26 @@
>>  		.mode		= 0644,
>>  		.proc_handler	= &proc_dointvec
>>  	},
>> +	{
>> +	        .ctl_name       = NET_TCP_RTO_MAX,
>> +		.procname       = "tcp_rto_max",
>> +		.data           = &sysctl_tcp_rto_max,
>> +		.maxlen         = sizeof(unsigned),
>> +		.mode           = 0644,
>> +		.proc_handler   = &proc_doulongvec_ms_jiffies,
>> +		.extra1         = &tcp_rto_min_constant,
>> +		.extra2         = &tcp_rto_max_constant,
>> +	},
>> +	{
>> +	        .ctl_name       = NET_TCP_RTO_INIT,
>> +		.procname       = "tcp_rto_init",
>> +		.data           = &sysctl_tcp_rto_init,
>> +		.maxlen         = sizeof(unsigned),
>> +		.mode           = 0644,
>> +		.proc_handler   = &proc_doulongvec_ms_jiffies,
>> +		.extra1         = &tcp_rto_min_constant,
>> +		.extra2         = &tcp_rto_max_constant,
>> +	},
>>  	{ .ctl_name = 0 }
>>  };
> 
> Try as I might, I can't find proc_doulongvec_ms_jiffies() anywhere.  I think
> you meant proc_doulongvec_ms_jiffies_minmax()?
> 
> Also, this function has a bug in that it doesn't do corrections for potentially different
> HZ values.  Look at the msecs_to_jiffies...  When I've used proc_doulongvec_ms_jiffies_minmax()
> before, I would seen rather interesting truncation errors such that sysctl input didn't match
> sysctl output.
> 
>>  
>> diff -ru linux-2.6.18/net/ipv4/tcp.c linux-2.6.18.new/net/ipv4/tcp.c
>> --- linux-2.6.18/net/ipv4/tcp.c	2006-09-19 20:42:06.000000000 -0700
>> +++ linux-2.6.18.new/net/ipv4/tcp.c	2006-10-10 18:37:40.000000000 -0700
>> @@ -1764,6 +1764,8 @@
>>  	return err;
>>  }
>>  
>> +#define TCP_BACKOFF_MAXVAL 65535
>> +
>>  /*
>>   *	Socket option code for TCP.
>>   */
>> @@ -1939,6 +1941,21 @@
>>  		}
>>  		break;
>>  
>> +        case TCP_BACKOFF_MAX:
>> +                if (val < 1 || val > TCP_BACKOFF_MAXVAL)
>> +                        err = -EINVAL;
>> +                else
>> +                        tp->rto_max = val;
>> +                break;
>> + 
>> +        case TCP_BACKOFF_INIT:
>> +                if (val < 1 || val > TCP_BACKOFF_MAXVAL)
>> +                        err = -EINVAL;
>> +                else
>> +                        tp->rto_init = val;
>> +                break;
>> + 
>> + 
>>  	default:
>>  		err = -ENOPROTOOPT;
>>  		break;
>> @@ -2110,6 +2127,12 @@
>>  		if (copy_to_user(optval, icsk->icsk_ca_ops->name, len))
>>  			return -EFAULT;
>>  		return 0;
>> +        case TCP_BACKOFF_MAX:
>> +	        val = tcp_rto_max(tp)*1000/HZ;
>> +                break;
>> +        case TCP_BACKOFF_INIT:
>> +	        val = tcp_rto_init(tp)*1000/HZ;
> 
> The two divisions above introduce truncation errors such the input doesn't
> match the output.  Better to use jiffies_to_msecs().
> 
> Regards
> -vlad


[-- Attachment #2: tcp-backoff-18-5.diff --]
[-- Type: text/x-patch, Size: 6796 bytes --]

diff -ru linux-2.6.18/include/linux/sysctl.h linux-2.6.18.new/include/linux/sysctl.h
--- linux-2.6.18/include/linux/sysctl.h	2006-09-19 20:42:06.000000000 -0700
+++ linux-2.6.18.new/include/linux/sysctl.h	2006-10-11 10:27:52.000000000 -0700
@@ -411,6 +411,8 @@
 	NET_IPV4_TCP_WORKAROUND_SIGNED_WINDOWS=115,
 	NET_TCP_DMA_COPYBREAK=116,
 	NET_TCP_SLOW_START_AFTER_IDLE=117,
+	NET_TCP_RTO_MAX=118,
+	NET_TCP_RTO_INIT=119,
 };
 
 enum {
diff -ru linux-2.6.18/include/linux/tcp.h linux-2.6.18.new/include/linux/tcp.h
--- linux-2.6.18/include/linux/tcp.h	2006-09-19 20:42:06.000000000 -0700
+++ linux-2.6.18.new/include/linux/tcp.h	2006-10-11 10:41:40.000000000 -0700
@@ -94,6 +94,8 @@
 #define TCP_INFO		11	/* Information about this connection. */
 #define TCP_QUICKACK		12	/* Block/reenable quick acks */
 #define TCP_CONGESTION		13	/* Congestion control algorithm */
+#define TCP_BACKOFF_MAX         14      /* Maximum backoff value */
+#define TCP_BACKOFF_INIT        15      /* Initial backoff value */
 
 #define TCPI_OPT_TIMESTAMPS	1
 #define TCPI_OPT_SACK		2
@@ -257,6 +259,8 @@
 	__u8	frto_counter;	/* Number of new acks after RTO */
 	__u8	nonagle;	/* Disable Nagle algorithm?             */
 	__u8	keepalive_probes; /* num of allowed keep alive probes	*/
+	__u16	rto_max;	/* Maximum backoff value in ms		*/
+	__u16	rto_init;	/* Initial backoff value in ms		*/
 
 /* RTT measurement */
 	__u32	srtt;		/* smoothed round trip time << 3	*/
diff -ru linux-2.6.18/include/net/tcp.h linux-2.6.18.new/include/net/tcp.h
--- linux-2.6.18/include/net/tcp.h	2006-09-19 20:42:06.000000000 -0700
+++ linux-2.6.18.new/include/net/tcp.h	2006-10-11 17:43:23.091431000 -0700
@@ -227,11 +227,23 @@
 extern int sysctl_tcp_base_mss;
 extern int sysctl_tcp_workaround_signed_windows;
 extern int sysctl_tcp_slow_start_after_idle;
+extern unsigned long sysctl_tcp_rto_max;
+extern unsigned long sysctl_tcp_rto_init;
 
 extern atomic_t tcp_memory_allocated;
 extern atomic_t tcp_sockets_allocated;
 extern int tcp_memory_pressure;
 
+static inline unsigned long tcp_rto_max(struct tcp_sock *tp)
+{
+	return tp->rto_max ? msecs_to_jiffies(tp->rto_max) : sysctl_tcp_rto_max;
+}
+
+static inline unsigned long tcp_rto_init(struct tcp_sock *tp)
+{
+	return tp->rto_init ? msecs_to_jiffies(tp->rto_init) : sysctl_tcp_rto_init;
+}
+
 /*
  * The next routines deal with comparing 32 bit unsigned ints
  * and worry about wraparound (automatic with unsigned arithmetic).
Only in linux-2.6.18.new/include/net: tcp.h~
diff -ru linux-2.6.18/net/ipv4/sysctl_net_ipv4.c linux-2.6.18.new/net/ipv4/sysctl_net_ipv4.c
--- linux-2.6.18/net/ipv4/sysctl_net_ipv4.c	2006-09-19 20:42:06.000000000 -0700
+++ linux-2.6.18.new/net/ipv4/sysctl_net_ipv4.c	2006-10-11 16:00:37.000000000 -0700
@@ -128,6 +128,8 @@
 	return ret;
 }
 
+static unsigned long tcp_rto_min=0;
+static unsigned long tcp_rto_max=65535;
 
 ctl_table ipv4_table[] = {
         {
@@ -697,6 +699,26 @@
 		.mode		= 0644,
 		.proc_handler	= &proc_dointvec
 	},
+	{
+		.ctl_name	= NET_TCP_RTO_MAX,
+		.procname	= "tcp_rto_max",
+		.data		= &sysctl_tcp_rto_max,
+		.maxlen		= sizeof(unsigned),
+		.mode		= 0644,
+		.proc_handler	= &proc_doulongvec_ms_jiffies_minmax,
+		.extra1		= &tcp_rto_min_constant,
+		.extra2		= &tcp_rto_max_constant,
+	},
+	{
+		.ctl_name	= NET_TCP_RTO_INIT,
+		.procname	= "tcp_rto_init",
+		.data		= &sysctl_tcp_rto_init,
+		.maxlen		= sizeof(unsigned),
+		.mode		= 0644,
+		.proc_handler	= &proc_doulongvec_ms_jiffies_minmax,
+		.extra1		= &tcp_rto_min_constant,
+		.extra2		= &tcp_rto_max_constant,
+	},
 	{ .ctl_name = 0 }
 };
 
diff -ru linux-2.6.18/net/ipv4/tcp.c linux-2.6.18.new/net/ipv4/tcp.c
--- linux-2.6.18/net/ipv4/tcp.c	2006-09-19 20:42:06.000000000 -0700
+++ linux-2.6.18.new/net/ipv4/tcp.c	2006-10-11 16:00:37.000000000 -0700
@@ -1764,6 +1764,8 @@
 	return err;
 }
 
+#define TCP_BACKOFF_MAXVAL 65535
+
 /*
  *	Socket option code for TCP.
  */
@@ -1939,6 +1941,20 @@
 		}
 		break;
 
+	case TCP_BACKOFF_MAX:
+		if (val < 1 || val > TCP_BACKOFF_MAXVAL)
+			err = -EINVAL;
+		else
+			tp->rto_max = val;
+		break;
+ 
+	case TCP_BACKOFF_INIT:
+		if (val < 1 || val > TCP_BACKOFF_MAXVAL)
+			err = -EINVAL;
+		else
+			tp->rto_init = val;
+		break;
+ 
 	default:
 		err = -ENOPROTOOPT;
 		break;
@@ -2110,6 +2126,12 @@
 		if (copy_to_user(optval, icsk->icsk_ca_ops->name, len))
 			return -EFAULT;
 		return 0;
+	case TCP_BACKOFF_MAX:
+		val = jiffies_to_msecs(tcp_rto_max(tp));
+		break;
+	case TCP_BACKOFF_INIT:
+		val = jiffies_to_msecs(tcp_rto_init(tp));
+		break;
 	default:
 		return -ENOPROTOOPT;
 	};
diff -ru linux-2.6.18/net/ipv4/tcp_timer.c linux-2.6.18.new/net/ipv4/tcp_timer.c
--- linux-2.6.18/net/ipv4/tcp_timer.c	2006-09-19 20:42:06.000000000 -0700
+++ linux-2.6.18.new/net/ipv4/tcp_timer.c	2006-10-11 10:41:39.000000000 -0700
@@ -31,6 +31,8 @@
 int sysctl_tcp_retries1 = TCP_RETR1;
 int sysctl_tcp_retries2 = TCP_RETR2;
 int sysctl_tcp_orphan_retries;
+unsigned long sysctl_tcp_rto_max  = TCP_RTO_MAX;
+unsigned long sysctl_tcp_rto_init = TCP_TIMEOUT_INIT;
 
 static void tcp_write_timer(unsigned long);
 static void tcp_delack_timer(unsigned long);
@@ -71,7 +73,8 @@
 
 	/* If peer does not open window for long time, or did not transmit 
 	 * anything for long time, penalize it. */
-	if ((s32)(tcp_time_stamp - tp->lsndtime) > 2*TCP_RTO_MAX || !do_reset)
+	if ((s32)(tcp_time_stamp - tp->lsndtime) > 2*tcp_rto_max(tp) || 
+	    !do_reset)
 		orphans <<= 1;
 
 	/* If some dubious ICMP arrived, penalize even more. */
@@ -256,8 +259,8 @@
 	max_probes = sysctl_tcp_retries2;
 
 	if (sock_flag(sk, SOCK_DEAD)) {
-		const int alive = ((icsk->icsk_rto << icsk->icsk_backoff) < TCP_RTO_MAX);
- 
+		const int alive = ((icsk->icsk_rto << icsk->icsk_backoff) < 
+				   tcp_rto_max(tp));
 		max_probes = tcp_orphan_retries(sk, alive);
 
 		if (tcp_out_of_resources(sk, alive || icsk->icsk_probes_out <= max_probes))
@@ -301,7 +304,7 @@
 			       inet->num, tp->snd_una, tp->snd_nxt);
 		}
 #endif
-		if (tcp_time_stamp - tp->rcv_tstamp > TCP_RTO_MAX) {
+		if (tcp_time_stamp - tp->rcv_tstamp > tcp_rto_max(tp)) {
 			tcp_write_err(sk);
 			goto out;
 		}
@@ -373,7 +376,8 @@
 
 out_reset_timer:
 	icsk->icsk_rto = min(icsk->icsk_rto << 1, TCP_RTO_MAX);
-	inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS, icsk->icsk_rto, TCP_RTO_MAX);
+	inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS, icsk->icsk_rto, 
+				  tcp_rto_max(tp));
 	if (icsk->icsk_retransmits > sysctl_tcp_retries1)
 		__sk_dst_reset(sk);
 
@@ -427,8 +431,8 @@
 
 static void tcp_synack_timer(struct sock *sk)
 {
-	inet_csk_reqsk_queue_prune(sk, TCP_SYNQ_INTERVAL,
-				   TCP_TIMEOUT_INIT, TCP_RTO_MAX);
+	inet_csk_reqsk_queue_prune(sk, TCP_SYNQ_INTERVAL, TCP_TIMEOUT_INIT, 
+				   tcp_rto_max(tcp_sk(sk)));
 }
 
 void tcp_set_keepalive(struct sock *sk, int val)

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] Customizable TCP backoff patch
  2006-10-12  0:51           ` Ben Woodard
@ 2006-10-12  1:06             ` YOSHIFUJI Hideaki / 吉藤英明
  2006-10-12 15:49               ` Ben Woodard
  0 siblings, 1 reply; 17+ messages in thread
From: YOSHIFUJI Hideaki / 吉藤英明 @ 2006-10-12  1:06 UTC (permalink / raw)
  To: woodard; +Cc: vladislav.yasevich, davem, netdev, mgrondona, behlendorf1,
	yoshfuji

In article <452D918C.6050300@redhat.com> (at Wed, 11 Oct 2006 17:51:24 -0700), Ben Woodard <woodard@redhat.com> says:

> diff -ru linux-2.6.18/net/ipv4/sysctl_net_ipv4.c linux-2.6.18.new/net/ipv4/sysctl_net_ipv4.c
> --- linux-2.6.18/net/ipv4/sysctl_net_ipv4.c	2006-09-19 20:42:06.000000000 -0700
> +++ linux-2.6.18.new/net/ipv4/sysctl_net_ipv4.c	2006-10-11 16:00:37.000000000 -0700
> @@ -128,6 +128,8 @@
>  	return ret;
>  }
>  
> +static unsigned long tcp_rto_min=0;
> +static unsigned long tcp_rto_max=65535;
>  
>  ctl_table ipv4_table[] = {
>          {
> @@ -697,6 +699,26 @@
>  		.mode		= 0644,
>  		.proc_handler	= &proc_dointvec
>  	},
> +	{
> +		.ctl_name	= NET_TCP_RTO_MAX,
> +		.procname	= "tcp_rto_max",
> +		.data		= &sysctl_tcp_rto_max,
> +		.maxlen		= sizeof(unsigned),

sizeof(unsigned long)

> +		.mode		= 0644,
> +		.proc_handler	= &proc_doulongvec_ms_jiffies_minmax,
> +		.extra1		= &tcp_rto_min_constant,
> +		.extra2		= &tcp_rto_max_constant,
> +	},
> +	{
> +		.ctl_name	= NET_TCP_RTO_INIT,
> +		.procname	= "tcp_rto_init",
> +		.data		= &sysctl_tcp_rto_init,
> +		.maxlen		= sizeof(unsigned),

sizeof(unsigned long)

> +		.mode		= 0644,
> +		.proc_handler	= &proc_doulongvec_ms_jiffies_minmax,
> +		.extra1		= &tcp_rto_min_constant,
> +		.extra2		= &tcp_rto_max_constant,
> +	},
>  	{ .ctl_name = 0 }
>  };
>  
> diff -ru linux-2.6.18/net/ipv4/tcp.c linux-2.6.18.new/net/ipv4/tcp.c
> --- linux-2.6.18/net/ipv4/tcp.c	2006-09-19 20:42:06.000000000 -0700
> +++ linux-2.6.18.new/net/ipv4/tcp.c	2006-10-11 16:00:37.000000000 -0700
> @@ -1764,6 +1764,8 @@
>  	return err;
>  }
>  
> +#define TCP_BACKOFF_MAXVAL 65535
> +
>  /*
>   *	Socket option code for TCP.
>   */
> @@ -1939,6 +1941,20 @@
>  		}
>  		break;
>  
> +	case TCP_BACKOFF_MAX:
> +		if (val < 1 || val > TCP_BACKOFF_MAXVAL)
> +			err = -EINVAL;
> +		else
> +			tp->rto_max = val;
> +		break;
> + 
> +	case TCP_BACKOFF_INIT:
> +		if (val < 1 || val > TCP_BACKOFF_MAXVAL)
> +			err = -EINVAL;
> +		else
> +			tp->rto_init = val;
> +		break;
> + 
>  	default:
>  		err = -ENOPROTOOPT;
>  		break;
> @@ -2110,6 +2126,12 @@
>  		if (copy_to_user(optval, icsk->icsk_ca_ops->name, len))
>  			return -EFAULT;
>  		return 0;
> +	case TCP_BACKOFF_MAX:
> +		val = jiffies_to_msecs(tcp_rto_max(tp));
> +		break;

tp->rto_max

> +	case TCP_BACKOFF_INIT:
> +		val = jiffies_to_msecs(tcp_rto_init(tp));
> +		break;
>  	default:

tp->rto_init

--yoshfuji

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH] Customizable TCP backoff patch
  2006-10-12  1:06             ` YOSHIFUJI Hideaki / 吉藤英明
@ 2006-10-12 15:49               ` Ben Woodard
  0 siblings, 0 replies; 17+ messages in thread
From: Ben Woodard @ 2006-10-12 15:49 UTC (permalink / raw)
  To: YOSHIFUJI Hideaki / 吉藤英明
  Cc: vladislav.yasevich, davem, netdev, mgrondona, behlendorf1

[-- Attachment #1: Type: text/plain, Size: 933 bytes --]

YOSHIFUJI Hideaki / 吉藤英明 wrote:
>> +		.data		= &sysctl_tcp_rto_max,
>> +		.maxlen		= sizeof(unsigned),
> 
> sizeof(unsigned long)

Good catch. That would have corrupted things badly on some 64b 
platforms. With all the flux in the area I forgot to change the size of 
that but would have been OK on the ia32 boxes.

>> diff -ru linux-2.6.18/net/ipv4/tcp.c linux-2.6.18.new/net/ipv4/tcp.c
>> --- linux-2.6.18/net/ipv4/tcp.c	2006-09-19 20:42:06.000000000 -0700
>> +++ linux-2.6.18.new/net/ipv4/tcp.c	2006-10-11 16:00:37.000000000 -0700
>> @@ -2110,6 +2126,12 @@
>>  		if (copy_to_user(optval, icsk->icsk_ca_ops->name, len))
>>  			return -EFAULT;
>>  		return 0;
>> +	case TCP_BACKOFF_MAX:
>> +		val = jiffies_to_msecs(tcp_rto_max(tp));
>> +		break;
> 
> tp->rto_max
> 
>> +	case TCP_BACKOFF_INIT:
>> +		val = jiffies_to_msecs(tcp_rto_init(tp));
>> +		break;
>>  	default:
> 
> tp->rto_init

OK I get it now.

> 
> --yoshfuji


[-- Attachment #2: tcp-backoff-18-6.diff --]
[-- Type: text/x-patch, Size: 6881 bytes --]

diff -ru linux-2.6.18/include/linux/sysctl.h linux-2.6.18.new/include/linux/sysctl.h
--- linux-2.6.18/include/linux/sysctl.h	2006-09-19 20:42:06.000000000 -0700
+++ linux-2.6.18.new/include/linux/sysctl.h	2006-10-11 10:27:52.000000000 -0700
@@ -411,6 +411,8 @@
 	NET_IPV4_TCP_WORKAROUND_SIGNED_WINDOWS=115,
 	NET_TCP_DMA_COPYBREAK=116,
 	NET_TCP_SLOW_START_AFTER_IDLE=117,
+	NET_TCP_RTO_MAX=118,
+	NET_TCP_RTO_INIT=119,
 };
 
 enum {
diff -ru linux-2.6.18/include/linux/tcp.h linux-2.6.18.new/include/linux/tcp.h
--- linux-2.6.18/include/linux/tcp.h	2006-09-19 20:42:06.000000000 -0700
+++ linux-2.6.18.new/include/linux/tcp.h	2006-10-12 08:28:52.645411000 -0700
@@ -94,6 +94,8 @@
 #define TCP_INFO		11	/* Information about this connection. */
 #define TCP_QUICKACK		12	/* Block/reenable quick acks */
 #define TCP_CONGESTION		13	/* Congestion control algorithm */
+#define TCP_BACKOFF_MAX		14	/* Maximum backoff value */
+#define TCP_BACKOFF_INIT	15	/* Initial backoff value */
 
 #define TCPI_OPT_TIMESTAMPS	1
 #define TCPI_OPT_SACK		2
@@ -257,6 +259,8 @@
 	__u8	frto_counter;	/* Number of new acks after RTO */
 	__u8	nonagle;	/* Disable Nagle algorithm?             */
 	__u8	keepalive_probes; /* num of allowed keep alive probes	*/
+	__u16	rto_max;	/* Maximum backoff value in ms		*/
+	__u16	rto_init;	/* Initial backoff value in ms		*/
 
 /* RTT measurement */
 	__u32	srtt;		/* smoothed round trip time << 3	*/
Only in linux-2.6.18.new/include/linux: tcp.h~
diff -ru linux-2.6.18/include/net/tcp.h linux-2.6.18.new/include/net/tcp.h
--- linux-2.6.18/include/net/tcp.h	2006-09-19 20:42:06.000000000 -0700
+++ linux-2.6.18.new/include/net/tcp.h	2006-10-11 17:43:23.091431000 -0700
@@ -227,11 +227,23 @@
 extern int sysctl_tcp_base_mss;
 extern int sysctl_tcp_workaround_signed_windows;
 extern int sysctl_tcp_slow_start_after_idle;
+extern unsigned long sysctl_tcp_rto_max;
+extern unsigned long sysctl_tcp_rto_init;
 
 extern atomic_t tcp_memory_allocated;
 extern atomic_t tcp_sockets_allocated;
 extern int tcp_memory_pressure;
 
+static inline unsigned long tcp_rto_max(struct tcp_sock *tp)
+{
+	return tp->rto_max ? msecs_to_jiffies(tp->rto_max) : sysctl_tcp_rto_max;
+}
+
+static inline unsigned long tcp_rto_init(struct tcp_sock *tp)
+{
+	return tp->rto_init ? msecs_to_jiffies(tp->rto_init) : sysctl_tcp_rto_init;
+}
+
 /*
  * The next routines deal with comparing 32 bit unsigned ints
  * and worry about wraparound (automatic with unsigned arithmetic).
Only in linux-2.6.18.new/include/net: tcp.h~
diff -ru linux-2.6.18/net/ipv4/sysctl_net_ipv4.c linux-2.6.18.new/net/ipv4/sysctl_net_ipv4.c
--- linux-2.6.18/net/ipv4/sysctl_net_ipv4.c	2006-09-19 20:42:06.000000000 -0700
+++ linux-2.6.18.new/net/ipv4/sysctl_net_ipv4.c	2006-10-12 07:14:41.871910000 -0700
@@ -128,6 +128,8 @@
 	return ret;
 }
 
+static unsigned long tcp_rto_min=0;
+static unsigned long tcp_rto_max=65535;
 
 ctl_table ipv4_table[] = {
         {
@@ -697,6 +699,26 @@
 		.mode		= 0644,
 		.proc_handler	= &proc_dointvec
 	},
+	{
+		.ctl_name	= NET_TCP_RTO_MAX,
+		.procname	= "tcp_rto_max",
+		.data		= &sysctl_tcp_rto_max,
+		.maxlen		= sizeof(unsigned long),
+		.mode		= 0644,
+		.proc_handler	= &proc_doulongvec_ms_jiffies_minmax,
+		.extra1		= &tcp_rto_min_constant,
+		.extra2		= &tcp_rto_max_constant,
+	},
+	{
+		.ctl_name	= NET_TCP_RTO_INIT,
+		.procname	= "tcp_rto_init",
+		.data		= &sysctl_tcp_rto_init,
+		.maxlen		= sizeof(unsigned long),
+		.mode		= 0644,
+		.proc_handler	= &proc_doulongvec_ms_jiffies_minmax,
+		.extra1		= &tcp_rto_min_constant,
+		.extra2		= &tcp_rto_max_constant,
+	},
 	{ .ctl_name = 0 }
 };
 
Only in linux-2.6.18.new/net/ipv4: sysctl_net_ipv4.c~
diff -ru linux-2.6.18/net/ipv4/tcp.c linux-2.6.18.new/net/ipv4/tcp.c
--- linux-2.6.18/net/ipv4/tcp.c	2006-09-19 20:42:06.000000000 -0700
+++ linux-2.6.18.new/net/ipv4/tcp.c	2006-10-12 07:18:01.193083000 -0700
@@ -1764,6 +1764,8 @@
 	return err;
 }
 
+#define TCP_BACKOFF_MAXVAL 65535
+
 /*
  *	Socket option code for TCP.
  */
@@ -1939,6 +1941,20 @@
 		}
 		break;
 
+	case TCP_BACKOFF_MAX:
+		if (val < 1 || val > TCP_BACKOFF_MAXVAL)
+			err = -EINVAL;
+		else
+			tp->rto_max = val;
+		break;
+ 
+	case TCP_BACKOFF_INIT:
+		if (val < 1 || val > TCP_BACKOFF_MAXVAL)
+			err = -EINVAL;
+		else
+			tp->rto_init = val;
+		break;
+ 
 	default:
 		err = -ENOPROTOOPT;
 		break;
@@ -2110,6 +2126,12 @@
 		if (copy_to_user(optval, icsk->icsk_ca_ops->name, len))
 			return -EFAULT;
 		return 0;
+	case TCP_BACKOFF_MAX:
+		val = tp->rto_max;
+		break;
+	case TCP_BACKOFF_INIT:
+		val = tp->rto_init;
+		break;
 	default:
 		return -ENOPROTOOPT;
 	};
Only in linux-2.6.18.new/net/ipv4: tcp.c~
diff -ru linux-2.6.18/net/ipv4/tcp_timer.c linux-2.6.18.new/net/ipv4/tcp_timer.c
--- linux-2.6.18/net/ipv4/tcp_timer.c	2006-09-19 20:42:06.000000000 -0700
+++ linux-2.6.18.new/net/ipv4/tcp_timer.c	2006-10-11 10:41:39.000000000 -0700
@@ -31,6 +31,8 @@
 int sysctl_tcp_retries1 = TCP_RETR1;
 int sysctl_tcp_retries2 = TCP_RETR2;
 int sysctl_tcp_orphan_retries;
+unsigned long sysctl_tcp_rto_max  = TCP_RTO_MAX;
+unsigned long sysctl_tcp_rto_init = TCP_TIMEOUT_INIT;
 
 static void tcp_write_timer(unsigned long);
 static void tcp_delack_timer(unsigned long);
@@ -71,7 +73,8 @@
 
 	/* If peer does not open window for long time, or did not transmit 
 	 * anything for long time, penalize it. */
-	if ((s32)(tcp_time_stamp - tp->lsndtime) > 2*TCP_RTO_MAX || !do_reset)
+	if ((s32)(tcp_time_stamp - tp->lsndtime) > 2*tcp_rto_max(tp) || 
+	    !do_reset)
 		orphans <<= 1;
 
 	/* If some dubious ICMP arrived, penalize even more. */
@@ -256,8 +259,8 @@
 	max_probes = sysctl_tcp_retries2;
 
 	if (sock_flag(sk, SOCK_DEAD)) {
-		const int alive = ((icsk->icsk_rto << icsk->icsk_backoff) < TCP_RTO_MAX);
- 
+		const int alive = ((icsk->icsk_rto << icsk->icsk_backoff) < 
+				   tcp_rto_max(tp));
 		max_probes = tcp_orphan_retries(sk, alive);
 
 		if (tcp_out_of_resources(sk, alive || icsk->icsk_probes_out <= max_probes))
@@ -301,7 +304,7 @@
 			       inet->num, tp->snd_una, tp->snd_nxt);
 		}
 #endif
-		if (tcp_time_stamp - tp->rcv_tstamp > TCP_RTO_MAX) {
+		if (tcp_time_stamp - tp->rcv_tstamp > tcp_rto_max(tp)) {
 			tcp_write_err(sk);
 			goto out;
 		}
@@ -373,7 +376,8 @@
 
 out_reset_timer:
 	icsk->icsk_rto = min(icsk->icsk_rto << 1, TCP_RTO_MAX);
-	inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS, icsk->icsk_rto, TCP_RTO_MAX);
+	inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS, icsk->icsk_rto, 
+				  tcp_rto_max(tp));
 	if (icsk->icsk_retransmits > sysctl_tcp_retries1)
 		__sk_dst_reset(sk);
 
@@ -427,8 +431,8 @@
 
 static void tcp_synack_timer(struct sock *sk)
 {
-	inet_csk_reqsk_queue_prune(sk, TCP_SYNQ_INTERVAL,
-				   TCP_TIMEOUT_INIT, TCP_RTO_MAX);
+	inet_csk_reqsk_queue_prune(sk, TCP_SYNQ_INTERVAL, TCP_TIMEOUT_INIT, 
+				   tcp_rto_max(tcp_sk(sk)));
 }
 
 void tcp_set_keepalive(struct sock *sk, int val)

^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2006-10-12 15:49 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2006-09-27 18:52 [PATCH] Customizable TCP backoff patch Ben Woodard
2006-09-27 23:16 ` David Miller
2006-09-27 23:00   ` Stephen Hemminger
2006-09-28  2:06     ` David Miller
2006-10-03 18:14   ` Ben Woodard
2006-10-04  7:07     ` David Miller
2006-10-04  8:56       ` Ingo Oeser
2006-10-04  9:08         ` David Miller
2006-10-04 17:17       ` Stephen Hemminger
2006-10-11  1:46       ` Ben Woodard
2006-10-11  1:59         ` YOSHIFUJI Hideaki / 吉藤英明
2006-10-11  2:53         ` David Miller
2006-10-11  2:54         ` David Miller
2006-10-11 14:01         ` Vlad Yasevich
2006-10-12  0:51           ` Ben Woodard
2006-10-12  1:06             ` YOSHIFUJI Hideaki / 吉藤英明
2006-10-12 15:49               ` Ben Woodard

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).