netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 2.6.22-rc5] TCP: Make TCP_RTO_MAX a variable
@ 2007-06-25 13:09 OBATA Noboru
  2007-06-25 13:15 ` Patrick McHardy
                   ` (3 more replies)
  0 siblings, 4 replies; 14+ messages in thread
From: OBATA Noboru @ 2007-06-25 13:09 UTC (permalink / raw)
  To: David Miller; +Cc: Stephen Hemminger, netdev

From: OBATA Noboru <noboru.obata.ar@hitachi.com>

Make TCP_RTO_MAX a variable, and allow a user to change it via a
new sysctl entry /proc/sys/net/ipv4/tcp_rto_max.  A user can
then guarantee TCP retransmission to be more controllable, say,
at least once per 10 seconds, by setting it to 10.  This is
quite helpful on failover-capable network devices, such as an
active-backup bonding device.  On such devices, it is desirable
that TCP retransmits a packet shortly after the failover, which
is what I would like to do with this patch.  Please see
Background and Problem below for rationale in detail.

Reading from /proc/sys/net/ipv4/tcp_rto_max shows the current
TCP_RTO_MAX in seconds.  The actual value of TCP_RTO_MAX is
stored in sysctl_tcp_rto_max in jiffies.

Writing to /proc/sys/net/ipv4/tcp_rto_max updates the
TCP_RTO_MAX, only if the new value is not smaller than
TCP_RTO_MIN, which is currently 0.2[sec].  Since tcp_rto_max is
an integer, the minimum value of /proc/sys/net/ipv4/tcp_rto_max
is 1, in substance.  Also the RtoMax entry in /proc/net/snmp is
updated.

Please note that this is effective in IPv6 as well.


Background and Problem
======================

When designing a TCP/IP based network system on failover-capable
network devices, people want to set timeouts hierarchically in
three layers, network device layer, TCP layer, and application
layer (bottom-up order), such that:

1. Network device layer detects a failure first and switch to a
   backup device (say, in 20sec).

2. TCP layer timeout & retransmission comes next, _hopefully_
   before the application layer timeout.

3. Application layer detects a network failure last (by, say,
   30sec timeout) and may trigger a system-level failover.

   * Note 1.  The timeouts for #1 and #2 are handled
     independently and there is no relationship between them.

   * Note 2.  The actual timeout settings (20sec or 30sec in
     this example) are often determined by systems requirement
     and so setting them to certain "safe values" (if any) are
     usually not possible.

If TCP retransmission misses the time frame between event #1
and #3 in Background above (between 20 and 30sec since network
failure), a failure causes the system-level failover where the
network-device-level failover should be enough.

The problem in this hierarchical timeout scheme is that TCP
layer does not guarantee the next retransmission to occur in
certain period of time.  In the above example, people expect TCP
to retransmit a packet between 20 and 30sec since network
failure, but it may not happen.

Starting from RTO=0.5sec for example, retransmission will occur
at time 0.5, 1.5, 3.5, 7.5, 15.5, and 31.5 as indicated by 'o'
in the following diagram, but miss the time frame between time
20 and 30.

       time: 0         10        20        30sec
             |         |         |         |
  App. layer |---------+---------+---------X  ==> system failover
   TCP layer oo-o---o--+----o----+---------+o <== expects retrans. b/w 20~30
Netdev layer |---------+---------X            ==> network failover


Signed-off-by: OBATA Noboru <noboru.obata.ar@hitachi.com>
---

 Documentation/networking/ip-sysctl.txt |    6 +
 include/linux/sysctl.h                 |    1
 include/net/tcp.h                      |    5 +
 net/ipv4/sysctl_net_ipv4.c             |   77 +++++++++++++++++++++++++
 net/ipv4/tcp_timer.c                   |    3
 5 files changed, 91 insertions(+), 1 deletion(-)

diff -uprN -X a/Documentation/dontdiff linux-2.6.22-rc5-orig/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
--- a/Documentation/networking/ip-sysctl.txt	2007-06-22 21:34:18.000000000 +0900
+++ b/Documentation/networking/ip-sysctl.txt	2007-06-25 16:07:21.000000000 +0900
@@ -340,6 +340,12 @@ tcp_rmem - vector of 3 INTEGERs: min, de
 	net.core.rmem_max, "static" selection via SO_RCVBUF does not use this.
 	Default: 87380*2 bytes.
 
+tcp_rto_max - INTEGER
+	Maximum time in seconds to which RTO can grow.  Exponential
+	backoff of RTO is bounded by this value.  The value must not be
+	smaller than 1.  Note this parameter is also effective for IPv6.
+	Default: 120
+
 tcp_sack - BOOLEAN
 	Enable select acknowledgments (SACKS).
 
diff -uprN -X a/Documentation/dontdiff linux-2.6.22-rc5-orig/include/linux/sysctl.h b/include/linux/sysctl.h
--- a/include/linux/sysctl.h	2007-06-22 21:34:33.000000000 +0900
+++ b/include/linux/sysctl.h	2007-06-25 16:27:29.000000000 +0900
@@ -441,6 +441,7 @@ enum
 	NET_TCP_ALLOWED_CONG_CONTROL=123,
 	NET_TCP_MAX_SSTHRESH=124,
 	NET_TCP_FRTO_RESPONSE=125,
+	NET_TCP_RTO_MAX=126,
 };
 
 enum {
diff -uprN -X a/Documentation/dontdiff linux-2.6.22-rc5-orig/include/net/tcp.h b/include/net/tcp.h
--- a/include/net/tcp.h	2007-06-22 21:34:33.000000000 +0900
+++ b/include/net/tcp.h	2007-06-22 21:40:05.000000000 +0900
@@ -121,7 +121,9 @@ extern void tcp_time_wait(struct sock *s
 #define TCP_DELACK_MIN	4U
 #define TCP_ATO_MIN	4U
 #endif
-#define TCP_RTO_MAX	((unsigned)(120*HZ))
+extern int sysctl_tcp_rto_max;
+#define TCP_RTO_MAX	((unsigned)(sysctl_tcp_rto_max))
+#define TCP_RTO_MAX_DEFAULT	((unsigned)(120*HZ))
 #define TCP_RTO_MIN	((unsigned)(HZ/5))
 #define TCP_TIMEOUT_INIT ((unsigned)(3*HZ))	/* RFC 1122 initial RTO value	*/
 
@@ -203,6 +205,7 @@ extern int sysctl_tcp_synack_retries;
 extern int sysctl_tcp_retries1;
 extern int sysctl_tcp_retries2;
 extern int sysctl_tcp_orphan_retries;
+extern int sysctl_tcp_rto_max;
 extern int sysctl_tcp_syncookies;
 extern int sysctl_tcp_retrans_collapse;
 extern int sysctl_tcp_stdurg;
diff -uprN -X a/Documentation/dontdiff linux-2.6.22-rc5-orig/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
--- a/net/ipv4/sysctl_net_ipv4.c	2007-06-22 21:34:33.000000000 +0900
+++ b/net/ipv4/sysctl_net_ipv4.c	2007-06-25 16:27:53.000000000 +0900
@@ -186,6 +186,74 @@ static int strategy_allowed_congestion_c
 
 }
 
+static int proc_tcp_rto_max(ctl_table *ctl, int write, struct file *filp,
+			    void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+	int val = *(int *)ctl->data;
+	int ret;
+
+	ret = proc_dointvec_jiffies(ctl, write, filp, buffer, lenp, ppos);
+	if (ret)
+		return ret;
+
+	if (write && *(int *)ctl->data != val) {
+		if (*(int *)ctl->data < TCP_RTO_MIN) {
+			*(int *)ctl->data = val;
+			return -EINVAL;
+		}
+		TCP_ADD_STATS_USER(TCP_MIB_RTOMAX,
+				   (*(int *)ctl->data - val) * 1000 / HZ);
+	}
+
+	return 0;
+}
+
+static int strategy_tcp_rto_max(ctl_table *table, int __user *name,
+				int nlen, void __user *oldval,
+				size_t __user *oldlenp,
+				void __user *newval, size_t newlen)
+{
+	int *valp = table->data;
+	int new;
+
+	if (!newval || !newlen)
+		return 0;
+
+	if (newlen != sizeof(int))
+		return -EINVAL;
+
+	if (get_user(new, (int __user *)newval))
+		return -EFAULT;
+
+	if (new * HZ == *valp)
+		return 0;
+
+	if (new * HZ < TCP_RTO_MIN)
+		return -EINVAL;
+
+	if (oldval && oldlenp) {
+		size_t len;
+
+		if (get_user(len, oldlenp))
+			return -EFAULT;
+
+		if (len) {
+			if (len > table->maxlen)
+				len = table->maxlen;
+			if (put_user(*valp / HZ, (int __user *)oldval))
+				return -EFAULT;
+			if (put_user(len, oldlenp))
+				return -EFAULT;
+		}
+	}
+
+	TCP_ADD_STATS_USER(TCP_MIB_RTOMAX, (new * HZ - *valp) * 1000 / HZ);
+
+	*valp = new * HZ;
+
+	return 1;
+}
+
 ctl_table ipv4_table[] = {
 	{
 		.ctl_name	= NET_IPV4_TCP_TIMESTAMPS,
@@ -363,6 +431,15 @@ ctl_table ipv4_table[] = {
 		.proc_handler	= &proc_dointvec
 	},
 	{
+		.ctl_name	= NET_TCP_RTO_MAX,
+		.procname	= "tcp_rto_max",
+		.data		= &sysctl_tcp_rto_max,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= &proc_tcp_rto_max,
+		.strategy	= &strategy_tcp_rto_max
+	},
+	{
 		.ctl_name	= NET_IPV4_TCP_FIN_TIMEOUT,
 		.procname	= "tcp_fin_timeout",
 		.data		= &sysctl_tcp_fin_timeout,
diff -uprN -X a/Documentation/dontdiff linux-2.6.22-rc5-orig/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
--- a/net/ipv4/tcp_timer.c	2007-06-22 21:34:33.000000000 +0900
+++ b/net/ipv4/tcp_timer.c	2007-06-22 21:39:35.000000000 +0900
@@ -31,6 +31,9 @@ int sysctl_tcp_keepalive_intvl __read_mo
 int sysctl_tcp_retries1 __read_mostly = TCP_RETR1;
 int sysctl_tcp_retries2 __read_mostly = TCP_RETR2;
 int sysctl_tcp_orphan_retries __read_mostly;
+int sysctl_tcp_rto_max __read_mostly = TCP_RTO_MAX_DEFAULT;
+
+EXPORT_SYMBOL(sysctl_tcp_rto_max);
 
 static void tcp_write_timer(unsigned long);
 static void tcp_delack_timer(unsigned long);

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2.6.22-rc5] TCP: Make TCP_RTO_MAX a variable
  2007-06-25 13:09 [PATCH 2.6.22-rc5] TCP: Make TCP_RTO_MAX a variable OBATA Noboru
@ 2007-06-25 13:15 ` Patrick McHardy
  2007-06-25 14:45   ` Siim Põder
                     ` (2 more replies)
  2007-06-25 16:07 ` Stephen Hemminger
                   ` (2 subsequent siblings)
  3 siblings, 3 replies; 14+ messages in thread
From: Patrick McHardy @ 2007-06-25 13:15 UTC (permalink / raw)
  To: OBATA Noboru; +Cc: David Miller, Stephen Hemminger, netdev

OBATA Noboru wrote:
> From: OBATA Noboru <noboru.obata.ar@hitachi.com>
> 
> Make TCP_RTO_MAX a variable, and allow a user to change it via a
> new sysctl entry /proc/sys/net/ipv4/tcp_rto_max.  A user can
> then guarantee TCP retransmission to be more controllable, say,
> at least once per 10 seconds, by setting it to 10.  This is
> quite helpful on failover-capable network devices, such as an
> active-backup bonding device.  On such devices, it is desirable
> that TCP retransmits a packet shortly after the failover, which
> is what I would like to do with this patch.  Please see
> Background and Problem below for rationale in detail.


Would it make sense to do this per route?


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2.6.22-rc5] TCP: Make TCP_RTO_MAX a variable
  2007-06-25 13:15 ` Patrick McHardy
@ 2007-06-25 14:45   ` Siim Põder
  2007-06-25 16:08   ` Stephen Hemminger
  2007-06-27 21:57   ` [MaybeSpam] " noboru.obata.ar
  2 siblings, 0 replies; 14+ messages in thread
From: Siim Põder @ 2007-06-25 14:45 UTC (permalink / raw)
  To: netdev

Yo!

Patrick McHardy wrote:
> OBATA Noboru wrote:
>> Make TCP_RTO_MAX a variable, and allow a user to change it via a
>> new sysctl entry /proc/sys/net/ipv4/tcp_rto_max.  A user can
>> then guarantee TCP retransmission to be more controllable, say,
>> at least once per 10 seconds, by setting it to 10.  This is
>> quite helpful on failover-capable network devices, such as an
>> active-backup bonding device.  On such devices, it is desirable
>> that TCP retransmits a packet shortly after the failover, which
>> is what I would like to do with this patch.  Please see
>> Background and Problem below for rationale in detail.
> 
> Would it make sense to do this per route?

To only do it per route would reduce it's usefulness with dynamic
routing as routing daemons would probably have trouble supporting it.

Siim Põder

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2.6.22-rc5] TCP: Make TCP_RTO_MAX a variable
  2007-06-25 13:09 [PATCH 2.6.22-rc5] TCP: Make TCP_RTO_MAX a variable OBATA Noboru
  2007-06-25 13:15 ` Patrick McHardy
@ 2007-06-25 16:07 ` Stephen Hemminger
  2007-07-12  6:45   ` OBATA Noboru
  2007-06-25 22:18 ` Ian McDonald
  2007-06-28  1:00 ` YOSHIFUJI Hideaki / 吉藤英明
  3 siblings, 1 reply; 14+ messages in thread
From: Stephen Hemminger @ 2007-06-25 16:07 UTC (permalink / raw)
  To: OBATA Noboru; +Cc: David Miller, netdev

On Mon, 25 Jun 2007 22:09:39 +0900 (JST)
OBATA Noboru <noboru.obata.ar@hitachi.com> wrote:

> From: OBATA Noboru <noboru.obata.ar@hitachi.com>
> 
> Make TCP_RTO_MAX a variable, and allow a user to change it via a
> new sysctl entry /proc/sys/net/ipv4/tcp_rto_max.  A user can
> then guarantee TCP retransmission to be more controllable, say,
> at least once per 10 seconds, by setting it to 10.  This is
> quite helpful on failover-capable network devices, such as an
> active-backup bonding device.  On such devices, it is desirable
> that TCP retransmits a packet shortly after the failover, which
> is what I would like to do with this patch.  Please see
> Background and Problem below for rationale in detail.
> 
> Reading from /proc/sys/net/ipv4/tcp_rto_max shows the current
> TCP_RTO_MAX in seconds.  The actual value of TCP_RTO_MAX is
> stored in sysctl_tcp_rto_max in jiffies.
> 
> Writing to /proc/sys/net/ipv4/tcp_rto_max updates the
> TCP_RTO_MAX, only if the new value is not smaller than
> TCP_RTO_MIN, which is currently 0.2[sec].  Since tcp_rto_max is
> an integer, the minimum value of /proc/sys/net/ipv4/tcp_rto_max
> is 1, in substance.  Also the RtoMax entry in /proc/net/snmp is
> updated.
> 
> Please note that this is effective in IPv6 as well.
> 
> 
> Background and Problem
> ======================
> 
> When designing a TCP/IP based network system on failover-capable
> network devices, people want to set timeouts hierarchically in
> three layers, network device layer, TCP layer, and application
> layer (bottom-up order), such that:
> 
> 1. Network device layer detects a failure first and switch to a
>    backup device (say, in 20sec).
> 
> 2. TCP layer timeout & retransmission comes next, _hopefully_
>    before the application layer timeout.
> 
> 3. Application layer detects a network failure last (by, say,
>    30sec timeout) and may trigger a system-level failover.
> 
>    * Note 1.  The timeouts for #1 and #2 are handled
>      independently and there is no relationship between them.
> 
>    * Note 2.  The actual timeout settings (20sec or 30sec in
>      this example) are often determined by systems requirement
>      and so setting them to certain "safe values" (if any) are
>      usually not possible.
> 
> If TCP retransmission misses the time frame between event #1
> and #3 in Background above (between 20 and 30sec since network
> failure), a failure causes the system-level failover where the
> network-device-level failover should be enough.
> 
> The problem in this hierarchical timeout scheme is that TCP
> layer does not guarantee the next retransmission to occur in
> certain period of time.  In the above example, people expect TCP
> to retransmit a packet between 20 and 30sec since network
> failure, but it may not happen.
> 
> Starting from RTO=0.5sec for example, retransmission will occur
> at time 0.5, 1.5, 3.5, 7.5, 15.5, and 31.5 as indicated by 'o'
> in the following diagram, but miss the time frame between time
> 20 and 30.
> 
>        time: 0         10        20        30sec
>              |         |         |         |
>   App. layer |---------+---------+---------X  ==> system failover
>    TCP layer oo-o---o--+----o----+---------+o <== expects retrans. b/w 20~30
> Netdev layer |---------+---------X            ==> network failover
> 
> 
> Signed-off-by: OBATA Noboru <noboru.obata.ar@hitachi.com>
> ---
> 
>  Documentation/networking/ip-sysctl.txt |    6 +
>  include/linux/sysctl.h                 |    1
>  include/net/tcp.h                      |    5 +
>  net/ipv4/sysctl_net_ipv4.c             |   77 +++++++++++++++++++++++++
>  net/ipv4/tcp_timer.c                   |    3
>  5 files changed, 91 insertions(+), 1 deletion(-)
> 
> diff -uprN -X a/Documentation/dontdiff linux-2.6.22-rc5-orig/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
> --- a/Documentation/networking/ip-sysctl.txt	2007-06-22 21:34:18.000000000 +0900
> +++ b/Documentation/networking/ip-sysctl.txt	2007-06-25 16:07:21.000000000 +0900
> @@ -340,6 +340,12 @@ tcp_rmem - vector of 3 INTEGERs: min, de
>  	net.core.rmem_max, "static" selection via SO_RCVBUF does not use this.
>  	Default: 87380*2 bytes.
>  
> +tcp_rto_max - INTEGER
> +	Maximum time in seconds to which RTO can grow.  Exponential
> +	backoff of RTO is bounded by this value.  The value must not be
> +	smaller than 1.  Note this parameter is also effective for IPv6.
> +	Default: 120
> +
>  tcp_sack - BOOLEAN
>  	Enable select acknowledgments (SACKS).
>  
> diff -uprN -X a/Documentation/dontdiff linux-2.6.22-rc5-orig/include/linux/sysctl.h b/include/linux/sysctl.h
> --- a/include/linux/sysctl.h	2007-06-22 21:34:33.000000000 +0900
> +++ b/include/linux/sysctl.h	2007-06-25 16:27:29.000000000 +0900
> @@ -441,6 +441,7 @@ enum
>  	NET_TCP_ALLOWED_CONG_CONTROL=123,
>  	NET_TCP_MAX_SSTHRESH=124,
>  	NET_TCP_FRTO_RESPONSE=125,
> +	NET_TCP_RTO_MAX=126,
>  };
>  

Rather than assigning another numeric sysctl value, you can use
CTL_UNNUMBERED.  The use of numeric sysctl's is being phased down, at one
point they were even going to be deprecated.


>  enum {
> diff -uprN -X a/Documentation/dontdiff linux-2.6.22-rc5-orig/include/net/tcp.h b/include/net/tcp.h
> --- a/include/net/tcp.h	2007-06-22 21:34:33.000000000 +0900
> +++ b/include/net/tcp.h	2007-06-22 21:40:05.000000000 +0900
> @@ -121,7 +121,9 @@ extern void tcp_time_wait(struct sock *s
>  #define TCP_DELACK_MIN	4U
>  #define TCP_ATO_MIN	4U
>  #endif
> -#define TCP_RTO_MAX	((unsigned)(120*HZ))
> +extern int sysctl_tcp_rto_max;
> +#define TCP_RTO_MAX	((unsigned)(sysctl_tcp_rto_max))
> +#define TCP_RTO_MAX_DEFAULT	((unsigned)(120*HZ))
>  #define TCP_RTO_MIN	((unsigned)(HZ/5))
>  #define TCP_TIMEOUT_INIT ((unsigned)(3*HZ))	/* RFC 1122 initial RTO value	*/

Rather than causing macro TCP_RTO_MAX to reference sysctl_rto_max directly.

> @@ -203,6 +205,7 @@ extern int sysctl_tcp_synack_retries;
>  extern int sysctl_tcp_retries1;
>  extern int sysctl_tcp_retries2;
>  extern int sysctl_tcp_orphan_retries;
> +extern int sysctl_tcp_rto_max;
>  extern int sysctl_tcp_syncookies;
>  extern int sysctl_tcp_retrans_collapse;
>  extern int sysctl_tcp_stdurg;
> diff -uprN -X a/Documentation/dontdiff linux-2.6.22-rc5-orig/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
> --- a/net/ipv4/sysctl_net_ipv4.c	2007-06-22 21:34:33.000000000 +0900
> +++ b/net/ipv4/sysctl_net_ipv4.c	2007-06-25 16:27:53.000000000 +0900
> @@ -186,6 +186,74 @@ static int strategy_allowed_congestion_c
>  
>  }
>  
> +static int proc_tcp_rto_max(ctl_table *ctl, int write, struct file *filp,
> +			    void __user *buffer, size_t *lenp, loff_t *ppos)
> +{
> +	int val = *(int *)ctl->data;
> +	int ret;
> +
> +	ret = proc_dointvec_jiffies(ctl, write, filp, buffer, lenp, ppos);
> +	if (ret)
> +		return ret;
> +
> +	if (write && *(int *)ctl->data != val) {
> +		if (*(int *)ctl->data < TCP_RTO_MIN) {
> +			*(int *)ctl->data = val;
> +			return -EINVAL;
> +		}
> +		TCP_ADD_STATS_USER(TCP_MIB_RTOMAX,
> +				   (*(int *)ctl->data - val) * 1000 / HZ);
> +	}
> +
> +	return 0;
> +}
> +
> +static int strategy_tcp_rto_max(ctl_table *table, int __user *name,
> +				int nlen, void __user *oldval,
> +				size_t __user *oldlenp,
> +				void __user *newval, size_t newlen)
> +{
> +	int *valp = table->data;
> +	int new;
> +
> +	if (!newval || !newlen)
> +		return 0;
> +
> +	if (newlen != sizeof(int))
> +		return -EINVAL;
> +
> +	if (get_user(new, (int __user *)newval))
> +		return -EFAULT;
> +
> +	if (new * HZ == *valp)
> +		return 0;
> +
> +	if (new * HZ < TCP_RTO_MIN)
> +		return -EINVAL;
> +
> +	if (oldval && oldlenp) {
> +		size_t len;
> +
> +		if (get_user(len, oldlenp))
> +			return -EFAULT;
> +
> +		if (len) {
> +			if (len > table->maxlen)
> +				len = table->maxlen;
> +			if (put_user(*valp / HZ, (int __user *)oldval))
> +				return -EFAULT;
> +			if (put_user(len, oldlenp))
> +				return -EFAULT;
> +		}
> +	}
> +
> +	TCP_ADD_STATS_USER(TCP_MIB_RTOMAX, (new * HZ - *valp) * 1000 / HZ);
> +
> +	*valp = new * HZ;
> +
> +	return 1;
> +}

Could sysctl_rto_max be unsigned instead of int to avoid possible sign wrap issues and
having to cast it on each use?

>  ctl_table ipv4_table[] = {
>  	{
>  		.ctl_name	= NET_IPV4_TCP_TIMESTAMPS,
> @@ -363,6 +431,15 @@ ctl_table ipv4_table[] = {
>  		.proc_handler	= &proc_dointvec
>  	},
>  	{
> +		.ctl_name	= NET_TCP_RTO_MAX,
> +		.procname	= "tcp_rto_max",
> +		.data		= &sysctl_tcp_rto_max,
> +		.maxlen		= sizeof(int),
> +		.mode		= 0644,
> +		.proc_handler	= &proc_tcp_rto_max,
> +		.strategy	= &strategy_tcp_rto_max
> +	},
> +	{
>  		.ctl_name	= NET_IPV4_TCP_FIN_TIMEOUT,
>  		.procname	= "tcp_fin_timeout",
>  		.data		= &sysctl_tcp_fin_timeout,
> diff -uprN -X a/Documentation/dontdiff linux-2.6.22-rc5-orig/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
> --- a/net/ipv4/tcp_timer.c	2007-06-22 21:34:33.000000000 +0900
> +++ b/net/ipv4/tcp_timer.c	2007-06-22 21:39:35.000000000 +0900
> @@ -31,6 +31,9 @@ int sysctl_tcp_keepalive_intvl __read_mo
>  int sysctl_tcp_retries1 __read_mostly = TCP_RETR1;
>  int sysctl_tcp_retries2 __read_mostly = TCP_RETR2;
>  int sysctl_tcp_orphan_retries __read_mostly;
> +int sysctl_tcp_rto_max __read_mostly = TCP_RTO_MAX_DEFAULT;
> +
> +EXPORT_SYMBOL(sysctl_tcp_rto_max);
>  
>  static void tcp_write_timer(unsigned long);
>  static void tcp_delack_timer(unsigned long);


-- 
Stephen Hemminger <shemminger@linux-foundation.org>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2.6.22-rc5] TCP: Make TCP_RTO_MAX a variable
  2007-06-25 13:15 ` Patrick McHardy
  2007-06-25 14:45   ` Siim Põder
@ 2007-06-25 16:08   ` Stephen Hemminger
  2007-06-27 21:57   ` [MaybeSpam] " noboru.obata.ar
  2 siblings, 0 replies; 14+ messages in thread
From: Stephen Hemminger @ 2007-06-25 16:08 UTC (permalink / raw)
  To: Patrick McHardy; +Cc: OBATA Noboru, David Miller, netdev

On Mon, 25 Jun 2007 15:15:14 +0200
Patrick McHardy <kaber@trash.net> wrote:

> OBATA Noboru wrote:
> > From: OBATA Noboru <noboru.obata.ar@hitachi.com>
> > 
> > Make TCP_RTO_MAX a variable, and allow a user to change it via a
> > new sysctl entry /proc/sys/net/ipv4/tcp_rto_max.  A user can
> > then guarantee TCP retransmission to be more controllable, say,
> > at least once per 10 seconds, by setting it to 10.  This is
> > quite helpful on failover-capable network devices, such as an
> > active-backup bonding device.  On such devices, it is desirable
> > that TCP retransmits a packet shortly after the failover, which
> > is what I would like to do with this patch.  Please see
> > Background and Problem below for rationale in detail.
> 
> 
> Would it make sense to do this per route?
> 
Both global sysctl and per route would be useful additions.

-- 
Stephen Hemminger <shemminger@linux-foundation.org>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2.6.22-rc5] TCP: Make TCP_RTO_MAX a variable
  2007-06-25 13:09 [PATCH 2.6.22-rc5] TCP: Make TCP_RTO_MAX a variable OBATA Noboru
  2007-06-25 13:15 ` Patrick McHardy
  2007-06-25 16:07 ` Stephen Hemminger
@ 2007-06-25 22:18 ` Ian McDonald
  2007-06-25 22:28   ` Stephen Hemminger
                     ` (2 more replies)
  2007-06-28  1:00 ` YOSHIFUJI Hideaki / 吉藤英明
  3 siblings, 3 replies; 14+ messages in thread
From: Ian McDonald @ 2007-06-25 22:18 UTC (permalink / raw)
  To: OBATA Noboru; +Cc: David Miller, Stephen Hemminger, netdev

On 6/26/07, OBATA Noboru <noboru.obata.ar@hitachi.com> wrote:
> From: OBATA Noboru <noboru.obata.ar@hitachi.com>
>
> Make TCP_RTO_MAX a variable, and allow a user to change it via a
> new sysctl entry /proc/sys/net/ipv4/tcp_rto_max.  A user can
> then guarantee TCP retransmission to be more controllable, say,
> at least once per 10 seconds, by setting it to 10.  This is
> quite helpful on failover-capable network devices, such as an
> active-backup bonding device.  On such devices, it is desirable
> that TCP retransmits a packet shortly after the failover, which
> is what I would like to do with this patch.  Please see
> Background and Problem below for rationale in detail.
>
RFC2988 says this:
   (2.4) Whenever RTO is computed, if it is less than 1 second then the
         RTO SHOULD be rounded up to 1 second.

         Traditionally, TCP implementations use coarse grain clocks to
         measure the RTT and trigger the RTO, which imposes a large
         minimum value on the RTO.  Research suggests that a large
         minimum RTO is needed to keep TCP conservative and avoid
         spurious retransmissions [AP99].  Therefore, this
         specification requires a large minimum RTO as a conservative
         approach, while at the same time acknowledging that at some
         future point, research may show that a smaller minimum RTO is
         acceptable or superior.

   (2.5) A maximum value MAY be placed on RTO provided it is at least 60
         seconds.

Your code doesn't seem to meet requirements of section 2.5 as your
minimum is 1 second.

I think if you're trying to solve the bonding issue then you should
solve that issue, not hack the TCP implementation as that opens it up
to abuse in other ways.

Ian
-- 
Web: http://wand.net.nz/~iam4/
Blog: http://iansblog.jandi.co.nz
WAND Network Research Group

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2.6.22-rc5] TCP: Make TCP_RTO_MAX a variable
  2007-06-25 22:18 ` Ian McDonald
@ 2007-06-25 22:28   ` Stephen Hemminger
  2007-06-25 22:29   ` Rick Jones
  2007-07-12  6:56   ` OBATA Noboru
  2 siblings, 0 replies; 14+ messages in thread
From: Stephen Hemminger @ 2007-06-25 22:28 UTC (permalink / raw)
  To: Ian McDonald; +Cc: OBATA Noboru, David Miller, netdev

On Tue, 26 Jun 2007 10:18:46 +1200
"Ian McDonald" <ian.mcdonald@jandi.co.nz> wrote:

> On 6/26/07, OBATA Noboru <noboru.obata.ar@hitachi.com> wrote:
> > From: OBATA Noboru <noboru.obata.ar@hitachi.com>
> >
> > Make TCP_RTO_MAX a variable, and allow a user to change it via a
> > new sysctl entry /proc/sys/net/ipv4/tcp_rto_max.  A user can
> > then guarantee TCP retransmission to be more controllable, say,
> > at least once per 10 seconds, by setting it to 10.  This is
> > quite helpful on failover-capable network devices, such as an
> > active-backup bonding device.  On such devices, it is desirable
> > that TCP retransmits a packet shortly after the failover, which
> > is what I would like to do with this patch.  Please see
> > Background and Problem below for rationale in detail.
> >
> RFC2988 says this:
>    (2.4) Whenever RTO is computed, if it is less than 1 second then the
>          RTO SHOULD be rounded up to 1 second.
> 
>          Traditionally, TCP implementations use coarse grain clocks to
>          measure the RTT and trigger the RTO, which imposes a large
>          minimum value on the RTO.  Research suggests that a large
>          minimum RTO is needed to keep TCP conservative and avoid
>          spurious retransmissions [AP99].  Therefore, this
>          specification requires a large minimum RTO as a conservative
>          approach, while at the same time acknowledging that at some
>          future point, research may show that a smaller minimum RTO is
>          acceptable or superior.
> 
>    (2.5) A maximum value MAY be placed on RTO provided it is at least 60
>          seconds.
> 
> Your code doesn't seem to meet requirements of section 2.5 as your
> minimum is 1 second.
> 
> I think if you're trying to solve the bonding issue then you should
> solve that issue, not hack the TCP implementation as that opens it up
> to abuse in other ways.
> 
> Ian

Another alternative is to provide a way to force all connections to retransmit
"right away" by adding a notifier mechanism.

-- 
Stephen Hemminger <shemminger@linux-foundation.org>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2.6.22-rc5] TCP: Make TCP_RTO_MAX a variable
  2007-06-25 22:18 ` Ian McDonald
  2007-06-25 22:28   ` Stephen Hemminger
@ 2007-06-25 22:29   ` Rick Jones
  2007-07-12  6:53     ` OBATA Noboru
  2007-07-12  6:56   ` OBATA Noboru
  2 siblings, 1 reply; 14+ messages in thread
From: Rick Jones @ 2007-06-25 22:29 UTC (permalink / raw)
  To: Ian McDonald; +Cc: OBATA Noboru, David Miller, Stephen Hemminger, netdev

Ian McDonald wrote:
> On 6/26/07, OBATA Noboru <noboru.obata.ar@hitachi.com> wrote:
> 
>> From: OBATA Noboru <noboru.obata.ar@hitachi.com>
>>
>> Make TCP_RTO_MAX a variable, and allow a user to change it via a
>> new sysctl entry /proc/sys/net/ipv4/tcp_rto_max.  A user can
>> then guarantee TCP retransmission to be more controllable, say,
>> at least once per 10 seconds, by setting it to 10.  This is
>> quite helpful on failover-capable network devices, such as an
>> active-backup bonding device.  On such devices, it is desirable
>> that TCP retransmits a packet shortly after the failover, which
>> is what I would like to do with this patch.  Please see
>> Background and Problem below for rationale in detail.
>>
> RFC2988 says this:
>   (2.4) Whenever RTO is computed, if it is less than 1 second then the
>         RTO SHOULD be rounded up to 1 second.
> 
>         Traditionally, TCP implementations use coarse grain clocks to
>         measure the RTT and trigger the RTO, which imposes a large
>         minimum value on the RTO.  Research suggests that a large
>         minimum RTO is needed to keep TCP conservative and avoid
>         spurious retransmissions [AP99].  Therefore, this
>         specification requires a large minimum RTO as a conservative
>         approach, while at the same time acknowledging that at some
>         future point, research may show that a smaller minimum RTO is
>         acceptable or superior.
> 
>   (2.5) A maximum value MAY be placed on RTO provided it is at least 60
>         seconds.
> 
> Your code doesn't seem to meet requirements of section 2.5 as your
> minimum is 1 second.

(At the risk of having another Emily Litella moment entering a 
discussion late...)

I thought that those sorts of things were generally referring to the 
_default_ setting?

> I think if you're trying to solve the bonding issue then you should
> solve that issue, not hack the TCP implementation as that opens it up
> to abuse in other ways.

FWIW, other stacks have a "tcp_rexmit_interval_max" without too much 
trouble:

$ ndd -h tcp_rexmit_interval_max

tcp_rexmit_interval_max:

     Upper limit for computed round trip time-out. [1,7200000]
     Default: 60000 (1 minute)

[Interesting to me that the default happens to be the aforementioned 60 
seconds :) ]

In the abstract, if we wanted a quick recovery in TCP from a link 
failover, I suppose it could be possible for a machine-local link 
failover if the link-failover code could then call back up into TCP to 
say "Yo, TCP, any connections you had going over this link/path/route 
should probably go ahead and try retransmitting now rather than later."

Of course, that does seem rather more complicated than having the 
administrator set an upper bound on the RTO, and wouldn't deal with 
non-machine-local link failover.

rick jones

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [MaybeSpam] Re: [PATCH 2.6.22-rc5] TCP: Make TCP_RTO_MAX a variable
  2007-06-25 13:15 ` Patrick McHardy
  2007-06-25 14:45   ` Siim Põder
  2007-06-25 16:08   ` Stephen Hemminger
@ 2007-06-27 21:57   ` noboru.obata.ar
  2 siblings, 0 replies; 14+ messages in thread
From: noboru.obata.ar @ 2007-06-27 21:57 UTC (permalink / raw)
  To: kaber; +Cc: davem, shemminger, netdev, noboru.obata.ar

Patrick McHardy wrote:
> OBATA Noboru wrote:
> > From: OBATA Noboru <noboru.obata.ar@hitachi.com>
> > 
> > Make TCP_RTO_MAX a variable, and allow a user to change it via a
> > new sysctl entry /proc/sys/net/ipv4/tcp_rto_max.  A user can
> > then guarantee TCP retransmission to be more controllable, say,
> > at least once per 10 seconds, by setting it to 10.  This is
> > quite helpful on failover-capable network devices, such as an
> > active-backup bonding device.  On such devices, it is desirable
> > that TCP retransmits a packet shortly after the failover, which
> > is what I would like to do with this patch.  Please see
> > Background and Problem below for rationale in detail.
> 
> 
> Would it make sense to do this per route?

Well, for a certain case, maybe yes.

For example,

(1) You have both a fast route (link) and a slow route,
(2) You want to use a short RTO for the fast route and not for
    the slow route, and
(3) Routes are static, as mentioned by Siim.

On such a case, only a global tcp_rto_max, which is set to a
very small value, may overload the slow link due to many
retransmission packets.  Then, as Stephen mentioned, people will
find it useful to have per route tcp_rto_max.

But let me give you some number.

The number of retramsmission packets in the first 60[s] on
tcp_rto_max = 10[s], starting from RTO = 0.2[s], are:
	10 with tcp_rto_max = 10[s], where
	8 with tcp_rto_max = 120[s] (original).

Only extra 2 packets per minute per socket should be acceptable
in most cases.

So if you choose a moderate tcp_rto_max, you may not necessarily
need per route tcp_rto_max.

Regards,

-- 
OBATA Noboru (noboru.obata.ar@hitachi.com)


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2.6.22-rc5] TCP: Make TCP_RTO_MAX a variable
  2007-06-25 13:09 [PATCH 2.6.22-rc5] TCP: Make TCP_RTO_MAX a variable OBATA Noboru
                   ` (2 preceding siblings ...)
  2007-06-25 22:18 ` Ian McDonald
@ 2007-06-28  1:00 ` YOSHIFUJI Hideaki / 吉藤英明
  3 siblings, 0 replies; 14+ messages in thread
From: YOSHIFUJI Hideaki / 吉藤英明 @ 2007-06-28  1:00 UTC (permalink / raw)
  To: noboru.obata.ar; +Cc: davem, shemminger, netdev, yoshfuji

In article <20070625.220939.132853560.noboru.obata.ar@hitachi.com> (at Mon, 25 Jun 2007 22:09:39 +0900 (JST)), OBATA Noboru <noboru.obata.ar@hitachi.com> says:

> Please note that this is effective in IPv6 as well.

Of course, I'm happy with this.

--yoshfuji

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2.6.22-rc5] TCP: Make TCP_RTO_MAX a variable
  2007-06-25 16:07 ` Stephen Hemminger
@ 2007-07-12  6:45   ` OBATA Noboru
  0 siblings, 0 replies; 14+ messages in thread
From: OBATA Noboru @ 2007-07-12  6:45 UTC (permalink / raw)
  To: shemminger; +Cc: davem, yoshfuji, netdev

Hi, Stephen.

Thank you for your comments.  I will fix them and re-send the
patch for 2.6.22.

From: Stephen Hemminger <shemminger@linux-foundation.org>
Subject: Re: [PATCH 2.6.22-rc5] TCP: Make TCP_RTO_MAX a variable
Date: Mon, 25 Jun 2007 09:07:48 -0700

> > diff -uprN -X a/Documentation/dontdiff linux-2.6.22-rc5-orig/include/linux/sysctl.h b/include/linux/sysctl.h
> > --- a/include/linux/sysctl.h	2007-06-22 21:34:33.000000000 +0900
> > +++ b/include/linux/sysctl.h	2007-06-25 16:27:29.000000000 +0900
> > @@ -441,6 +441,7 @@ enum
> >  	NET_TCP_ALLOWED_CONG_CONTROL=123,
> >  	NET_TCP_MAX_SSTHRESH=124,
> >  	NET_TCP_FRTO_RESPONSE=125,
> > +	NET_TCP_RTO_MAX=126,
> >  };
> >  
> 
> Rather than assigning another numeric sysctl value, you can use
> CTL_UNNUMBERED.  The use of numeric sysctl's is being phased down, at one
> point they were even going to be deprecated.

Understood.


> > diff -uprN -X a/Documentation/dontdiff linux-2.6.22-rc5-orig/include/net/tcp.h b/include/net/tcp.h
> > --- a/include/net/tcp.h	2007-06-22 21:34:33.000000000 +0900
> > +++ b/include/net/tcp.h	2007-06-22 21:40:05.000000000 +0900
> > @@ -121,7 +121,9 @@ extern void tcp_time_wait(struct sock *s
> >  #define TCP_DELACK_MIN	4U
> >  #define TCP_ATO_MIN	4U
> >  #endif
> > -#define TCP_RTO_MAX	((unsigned)(120*HZ))
> > +extern int sysctl_tcp_rto_max;
> > +#define TCP_RTO_MAX	((unsigned)(sysctl_tcp_rto_max))
> > +#define TCP_RTO_MAX_DEFAULT	((unsigned)(120*HZ))
> >  #define TCP_RTO_MIN	((unsigned)(HZ/5))
> >  #define TCP_TIMEOUT_INIT ((unsigned)(3*HZ))	/* RFC 1122 initial RTO value	*/
> 
> Rather than causing macro TCP_RTO_MAX to reference sysctl_rto_max directly.

Okay.  I will replace all occurrence of TCP_RTO_MAX to
sysctl_tcp_rto_max.


> > @@ -203,6 +205,7 @@ extern int sysctl_tcp_synack_retries;
> >  extern int sysctl_tcp_retries1;
> >  extern int sysctl_tcp_retries2;
> >  extern int sysctl_tcp_orphan_retries;
> > +extern int sysctl_tcp_rto_max;
> >  extern int sysctl_tcp_syncookies;
> >  extern int sysctl_tcp_retrans_collapse;
> >  extern int sysctl_tcp_stdurg;

> Could sysctl_rto_max be unsigned instead of int to avoid possible sign wrap issues and
> having to cast it on each use?

Yes.  As sysctl_tcp_rto_max is going to replace TCP_RTO_MAX,
which is unsigned, making sysctl_tcp_rto_max unsigned seems
reasonable.


> >  ctl_table ipv4_table[] = {
> >  	{
> >  		.ctl_name	= NET_IPV4_TCP_TIMESTAMPS,
> > @@ -363,6 +431,15 @@ ctl_table ipv4_table[] = {
> >  		.proc_handler	= &proc_dointvec
> >  	},
> >  	{
> > +		.ctl_name	= NET_TCP_RTO_MAX,
> > +		.procname	= "tcp_rto_max",
> > +		.data		= &sysctl_tcp_rto_max,
> > +		.maxlen		= sizeof(int),
> > +		.mode		= 0644,
> > +		.proc_handler	= &proc_tcp_rto_max,
> > +		.strategy	= &strategy_tcp_rto_max
> > +	},
> > +	{

I will remove .strategy and strategy_tcp_rto_max from my patch
because I'm not going to support the numeric sysctl.

Regards,

-- 
OBATA Noboru (noboru.obata.ar@hitachi.com)

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2.6.22-rc5] TCP: Make TCP_RTO_MAX a variable
  2007-06-25 22:29   ` Rick Jones
@ 2007-07-12  6:53     ` OBATA Noboru
  2007-07-12  9:54       ` Ian McDonald
  0 siblings, 1 reply; 14+ messages in thread
From: OBATA Noboru @ 2007-07-12  6:53 UTC (permalink / raw)
  To: rick.jones2; +Cc: ian.mcdonald, davem, shemminger, yoshfuji, netdev

From: Rick Jones <rick.jones2@hp.com>
Subject: Re: [PATCH 2.6.22-rc5] TCP: Make TCP_RTO_MAX a variable
Date: Mon, 25 Jun 2007 15:29:26 -0700

> Ian McDonald wrote:
> > On 6/26/07, OBATA Noboru <noboru.obata.ar@hitachi.com> wrote:
> > 
> >> From: OBATA Noboru <noboru.obata.ar@hitachi.com>
> >>
> >> Make TCP_RTO_MAX a variable, and allow a user to change it via a
> >> new sysctl entry /proc/sys/net/ipv4/tcp_rto_max.  A user can
> >> then guarantee TCP retransmission to be more controllable, say,
> >> at least once per 10 seconds, by setting it to 10.  This is
> >> quite helpful on failover-capable network devices, such as an
> >> active-backup bonding device.  On such devices, it is desirable
> >> that TCP retransmits a packet shortly after the failover, which
> >> is what I would like to do with this patch.  Please see
> >> Background and Problem below for rationale in detail.
> >>
> > RFC2988 says this:
> >   (2.4) Whenever RTO is computed, if it is less than 1 second then the
> >         RTO SHOULD be rounded up to 1 second.
> > 
> >         Traditionally, TCP implementations use coarse grain clocks to
> >         measure the RTT and trigger the RTO, which imposes a large
> >         minimum value on the RTO.  Research suggests that a large
> >         minimum RTO is needed to keep TCP conservative and avoid
> >         spurious retransmissions [AP99].  Therefore, this
> >         specification requires a large minimum RTO as a conservative
> >         approach, while at the same time acknowledging that at some
> >         future point, research may show that a smaller minimum RTO is
> >         acceptable or superior.
> > 
> >   (2.5) A maximum value MAY be placed on RTO provided it is at least 60
> >         seconds.
> > 
> > Your code doesn't seem to meet requirements of section 2.5 as your
> > minimum is 1 second.
> 
> (At the risk of having another Emily Litella moment entering a 
> discussion late...)
> 
> I thought that those sorts of things were generally referring to the 
> _default_ setting?

I believe so.  And the requirement of section 2.5 is rather weak
(it says "MAY").

Any comments from others?

-- 
OBATA Noboru (noboru.obata.ar@hitachi.com)

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2.6.22-rc5] TCP: Make TCP_RTO_MAX a variable
  2007-06-25 22:18 ` Ian McDonald
  2007-06-25 22:28   ` Stephen Hemminger
  2007-06-25 22:29   ` Rick Jones
@ 2007-07-12  6:56   ` OBATA Noboru
  2 siblings, 0 replies; 14+ messages in thread
From: OBATA Noboru @ 2007-07-12  6:56 UTC (permalink / raw)
  To: ian.mcdonald; +Cc: davem, shemminger, yoshfuji, netdev

From: "Ian McDonald" <ian.mcdonald@jandi.co.nz>
Subject: [MaybeSpam] Re: [PATCH 2.6.22-rc5] TCP: Make TCP_RTO_MAX a variable
Date: Tue, 26 Jun 2007 10:18:46 +1200

> On 6/26/07, OBATA Noboru <noboru.obata.ar@hitachi.com> wrote:
> > From: OBATA Noboru <noboru.obata.ar@hitachi.com>
> >
> > Make TCP_RTO_MAX a variable, and allow a user to change it via a
> > new sysctl entry /proc/sys/net/ipv4/tcp_rto_max.  A user can
> > then guarantee TCP retransmission to be more controllable, say,
> > at least once per 10 seconds, by setting it to 10.  This is
> > quite helpful on failover-capable network devices, such as an
> > active-backup bonding device.  On such devices, it is desirable
> > that TCP retransmits a packet shortly after the failover, which
> > is what I would like to do with this patch.  Please see
> > Background and Problem below for rationale in detail.
> >
> RFC2988 says this:
>    (2.4) Whenever RTO is computed, if it is less than 1 second then the
>          RTO SHOULD be rounded up to 1 second.
> 
>          Traditionally, TCP implementations use coarse grain clocks to
>          measure the RTT and trigger the RTO, which imposes a large
>          minimum value on the RTO.  Research suggests that a large
>          minimum RTO is needed to keep TCP conservative and avoid
>          spurious retransmissions [AP99].  Therefore, this
>          specification requires a large minimum RTO as a conservative
>          approach, while at the same time acknowledging that at some
>          future point, research may show that a smaller minimum RTO is
>          acceptable or superior.
> 
>    (2.5) A maximum value MAY be placed on RTO provided it is at least 60
>          seconds.
> 
> Your code doesn't seem to meet requirements of section 2.5 as your
> minimum is 1 second.
> 
> I think if you're trying to solve the bonding issue then you should
> solve that issue, not hack the TCP implementation as that opens it up
> to abuse in other ways.

I think this is rather a new problem, or requirement, in the
combined case "TCP on a failover-capable network device," and
not easily solved only by bonding.

A notify mechanism from bonding to TCP is suggested, but I think
it is really hard to do it in the virtualized environment like
Xen.  Hypervisor (Dom-0) takes care of physical devices,
including bonding, and guests (Dom-U) handle TCP.  Notifying
from bonding in Dom-0 to TCP in Dom-U is really a challenge.

My problem (TCP retransmission may not be done in the expected
time frame, e.x., 10 seconds after a bonding failover) still
occurs in such an environment, and my code (capping TCP_RTO_MAX)
still works on VM environment.

So solving this in TCP layer makes sense to me.

Regards,

-- 
OBATA Noboru (noboru.obata.ar@hitachi.com)

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2.6.22-rc5] TCP: Make TCP_RTO_MAX a variable
  2007-07-12  6:53     ` OBATA Noboru
@ 2007-07-12  9:54       ` Ian McDonald
  0 siblings, 0 replies; 14+ messages in thread
From: Ian McDonald @ 2007-07-12  9:54 UTC (permalink / raw)
  To: OBATA Noboru; +Cc: rick.jones2, davem, shemminger, yoshfuji, netdev

On 7/12/07, OBATA Noboru <noboru.obata.ar@hitachi.com> wrote:
> > Ian McDonald wrote:
> > > On 6/26/07, OBATA Noboru <noboru.obata.ar@hitachi.com> wrote:
> > >
> > >> From: OBATA Noboru <noboru.obata.ar@hitachi.com>
> > >>
> > >> Make TCP_RTO_MAX a variable, and allow a user to change it via a
> > >> new sysctl entry /proc/sys/net/ipv4/tcp_rto_max.  A user can
> > >> then guarantee TCP retransmission to be more controllable, say,
> > >> at least once per 10 seconds, by setting it to 10.  This is
> > >> quite helpful on failover-capable network devices, such as an
> > >> active-backup bonding device.  On such devices, it is desirable
> > >> that TCP retransmits a packet shortly after the failover, which
> > >> is what I would like to do with this patch.  Please see
> > >> Background and Problem below for rationale in detail.
> > >>
> > > RFC2988 says this:
> > >   (2.4) Whenever RTO is computed, if it is less than 1 second then the
> > >         RTO SHOULD be rounded up to 1 second.
> > >
> > >         Traditionally, TCP implementations use coarse grain clocks to
> > >         measure the RTT and trigger the RTO, which imposes a large
> > >         minimum value on the RTO.  Research suggests that a large
> > >         minimum RTO is needed to keep TCP conservative and avoid
> > >         spurious retransmissions [AP99].  Therefore, this
> > >         specification requires a large minimum RTO as a conservative
> > >         approach, while at the same time acknowledging that at some
> > >         future point, research may show that a smaller minimum RTO is
> > >         acceptable or superior.
> > >
> > >   (2.5) A maximum value MAY be placed on RTO provided it is at least 60
> > >         seconds.
> > >
> > > Your code doesn't seem to meet requirements of section 2.5 as your
> > > minimum is 1 second.
> >
> > (At the risk of having another Emily Litella moment entering a
> > discussion late...)
> >
> > I thought that those sorts of things were generally referring to the
> > _default_ setting?
>
> I believe so.  And the requirement of section 2.5 is rather weak
> (it says "MAY").
>
It is weak in saying you don't have to have a maximum, but if you do
have one IT IS AT LEAST 60 seconds (emphasis mine). So the time period
is a strong requirement if you decide to implement - which is a weak
requirement.

Ian
-- 
Web: http://wand.net.nz/~iam4/
Blog: http://iansblog.jandi.co.nz
WAND Network Research Group

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2007-07-12  9:54 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-06-25 13:09 [PATCH 2.6.22-rc5] TCP: Make TCP_RTO_MAX a variable OBATA Noboru
2007-06-25 13:15 ` Patrick McHardy
2007-06-25 14:45   ` Siim Põder
2007-06-25 16:08   ` Stephen Hemminger
2007-06-27 21:57   ` [MaybeSpam] " noboru.obata.ar
2007-06-25 16:07 ` Stephen Hemminger
2007-07-12  6:45   ` OBATA Noboru
2007-06-25 22:18 ` Ian McDonald
2007-06-25 22:28   ` Stephen Hemminger
2007-06-25 22:29   ` Rick Jones
2007-07-12  6:53     ` OBATA Noboru
2007-07-12  9:54       ` Ian McDonald
2007-07-12  6:56   ` OBATA Noboru
2007-06-28  1:00 ` YOSHIFUJI Hideaki / 吉藤英明

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).