* [PATCH 2.6.22] TCP: Make TCP_RTO_MAX a variable (take 2)
@ 2007-07-12 7:15 OBATA Noboru
2007-07-12 9:37 ` David Miller
0 siblings, 1 reply; 19+ messages in thread
From: OBATA Noboru @ 2007-07-12 7:15 UTC (permalink / raw)
To: davem; +Cc: shemminger, yoshfuji, netdev
Hi David,
Patch (take 2) for making TCP_RTO_MAX a variable. Suggestions
from Stephen for the first version are merged. Any comments are
appreciated.
From: OBATA Noboru <noboru.obata.ar@hitachi.com>
Make TCP_RTO_MAX a variable, and allow a user to change it via a
new sysctl entry /proc/sys/net/ipv4/tcp_rto_max. A user can
then guarantee TCP retransmission to be more controllable, say,
at least once per 10 seconds, by setting it to 10. This is
quite helpful on failover-capable network devices, such as an
active-backup bonding device. On such devices, it is desirable
that TCP retransmits a packet shortly after the failover, which
is what I would like to do with this patch. Please see
Background and Problem below for rationale in detail.
Reading from /proc/sys/net/ipv4/tcp_rto_max shows the current
TCP_RTO_MAX in seconds. The actual value of TCP_RTO_MAX is
stored in sysctl_tcp_rto_max in jiffies.
Writing to /proc/sys/net/ipv4/tcp_rto_max updates the
TCP_RTO_MAX, only if the new value is not smaller than
TCP_RTO_MIN, which is currently 0.2[sec]. Since tcp_rto_max is
an integer, the minimum value of /proc/sys/net/ipv4/tcp_rto_max
is 1, in substance. Also the RtoMax entry in /proc/net/snmp is
updated.
Please note that this is effective in IPv6 as well.
Background and Problem
======================
When designing a TCP/IP based network system on failover-capable
network devices, people want to set timeouts hierarchically in
three layers, network device layer, TCP layer, and application
layer (bottom-up order), such that:
1. Network device layer detects a failure first and switch to a
backup device (say, in 20sec).
2. TCP layer timeout & retransmission comes next, _hopefully_
before the application layer timeout.
3. Application layer detects a network failure last (by, say,
30sec timeout) and may trigger a system-level failover.
* Note 1. The timeouts for #1 and #2 are handled
independently and there is no relationship between them.
* Note 2. The actual timeout settings (20sec or 30sec in
this example) are often determined by systems requirement
and so setting them to certain "safe values" (if any) are
usually not possible.
If TCP retransmission misses the time frame between event #1
and #3 in Background above (between 20 and 30sec since network
failure), a failure causes the system-level failover where the
network-device-level failover should be enough.
The problem in this hierarchical timeout scheme is that TCP
layer does not guarantee the next retransmission to occur in
certain period of time. In the above example, people expect TCP
to retransmit a packet between 20 and 30sec since network
failure, but it may not happen.
Starting from RTO=0.5sec for example, retransmission will occur
at time 0.5, 1.5, 3.5, 7.5, 15.5, and 31.5 as indicated by 'o'
in the following diagram, but miss the time frame between time
20 and 30.
time: 0 10 20 30sec
| | | |
App. layer |---------+---------+---------X ==> system failover
TCP layer oo-o---o--+----o----+---------+o <== expects retrans. b/w 20~30
Netdev layer |---------+---------X ==> network failover
Signed-off-by: OBATA Noboru <noboru.obata.ar@hitachi.com>
---
Documentation/networking/ip-sysctl.txt | 6 ++++
include/net/tcp.h | 11 ++++----
net/ipv4/sysctl_net_ipv4.c | 32 +++++++++++++++++++++++++
net/ipv4/tcp_input.c | 14 +++++-----
net/ipv4/tcp_output.c | 14 +++++-----
net/ipv4/tcp_timer.c | 19 ++++++++------
6 files changed, 69 insertions(+), 27 deletions(-)
diff -uprN -X a/Documentation/dontdiff a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
--- a/Documentation/networking/ip-sysctl.txt 2007-07-07 14:36:14.000000000 +0900
+++ b/Documentation/networking/ip-sysctl.txt 2007-07-07 18:38:59.000000000 +0900
@@ -340,6 +340,12 @@ tcp_rmem - vector of 3 INTEGERs: min, de
net.core.rmem_max, "static" selection via SO_RCVBUF does not use this.
Default: 87380*2 bytes.
+tcp_rto_max - INTEGER
+ Maximum time in seconds to which RTO can grow. Exponential
+ backoff of RTO is bounded by this value. The value must not be
+ smaller than 1. Note this parameter is also effective for IPv6.
+ Default: 120
+
tcp_sack - BOOLEAN
Enable select acknowledgments (SACKS).
diff -uprN -X a/Documentation/dontdiff a/include/net/tcp.h b/include/net/tcp.h
--- a/include/net/tcp.h 2007-07-07 14:36:24.000000000 +0900
+++ b/include/net/tcp.h 2007-07-11 18:36:49.000000000 +0900
@@ -121,7 +121,7 @@ extern void tcp_time_wait(struct sock *s
#define TCP_DELACK_MIN 4U
#define TCP_ATO_MIN 4U
#endif
-#define TCP_RTO_MAX ((unsigned)(120*HZ))
+#define TCP_RTO_MAX_DEFAULT ((unsigned)(120*HZ))
#define TCP_RTO_MIN ((unsigned)(HZ/5))
#define TCP_TIMEOUT_INIT ((unsigned)(3*HZ)) /* RFC 1122 initial RTO value */
@@ -203,6 +203,7 @@ extern int sysctl_tcp_synack_retries;
extern int sysctl_tcp_retries1;
extern int sysctl_tcp_retries2;
extern int sysctl_tcp_orphan_retries;
+extern unsigned int sysctl_tcp_rto_max;
extern int sysctl_tcp_syncookies;
extern int sysctl_tcp_retrans_collapse;
extern int sysctl_tcp_stdurg;
@@ -608,7 +609,7 @@ static inline void tcp_packets_out_inc(s
tp->packets_out += tcp_skb_pcount(skb);
if (!orig)
inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS,
- inet_csk(sk)->icsk_rto, TCP_RTO_MAX);
+ inet_csk(sk)->icsk_rto, sysctl_tcp_rto_max);
}
static inline void tcp_packets_out_dec(struct tcp_sock *tp,
@@ -793,7 +794,7 @@ static inline void tcp_check_probe_timer
if (!tp->packets_out && !icsk->icsk_pending)
inet_csk_reset_xmit_timer(sk, ICSK_TIME_PROBE0,
- icsk->icsk_rto, TCP_RTO_MAX);
+ icsk->icsk_rto, sysctl_tcp_rto_max);
}
static inline void tcp_push_pending_frames(struct sock *sk)
@@ -880,7 +881,7 @@ static inline int tcp_prequeue(struct so
if (!inet_csk_ack_scheduled(sk))
inet_csk_reset_xmit_timer(sk, ICSK_TIME_DACK,
(3 * TCP_RTO_MIN) / 4,
- TCP_RTO_MAX);
+ sysctl_tcp_rto_max);
}
return 1;
}
@@ -1038,7 +1039,7 @@ static inline void tcp_mib_init(void)
/* See RFC 2012 */
TCP_ADD_STATS_USER(TCP_MIB_RTOALGORITHM, 1);
TCP_ADD_STATS_USER(TCP_MIB_RTOMIN, TCP_RTO_MIN*1000/HZ);
- TCP_ADD_STATS_USER(TCP_MIB_RTOMAX, TCP_RTO_MAX*1000/HZ);
+ TCP_ADD_STATS_USER(TCP_MIB_RTOMAX, sysctl_tcp_rto_max*1000/HZ);
TCP_ADD_STATS_USER(TCP_MIB_MAXCONN, -1);
}
diff -uprN -X a/Documentation/dontdiff a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
--- a/net/ipv4/sysctl_net_ipv4.c 2007-07-07 14:36:24.000000000 +0900
+++ b/net/ipv4/sysctl_net_ipv4.c 2007-07-11 19:55:02.000000000 +0900
@@ -186,6 +186,30 @@ static int strategy_allowed_congestion_c
}
+static int proc_tcp_rto_max(ctl_table *ctl, int write, struct file *filp,
+ void __user *buffer, size_t *lenp, loff_t *ppos)
+{
+ int *valp = ctl->data;
+ int oldval = *valp;
+ int ret;
+
+ /* Using dointvec conversion for an unsigned variable. */
+ ret = proc_dointvec_jiffies(ctl, write, filp, buffer, lenp, ppos);
+ if (ret)
+ return ret;
+
+ if (write && *valp != oldval) {
+ if (*valp < (int)TCP_RTO_MIN) {
+ *valp = oldval;
+ return -EINVAL;
+ }
+ TCP_ADD_STATS_USER(TCP_MIB_RTOMAX,
+ (*valp - oldval) * 1000 / HZ);
+ }
+
+ return 0;
+}
+
ctl_table ipv4_table[] = {
{
.ctl_name = NET_IPV4_TCP_TIMESTAMPS,
@@ -363,6 +387,14 @@ ctl_table ipv4_table[] = {
.proc_handler = &proc_dointvec
},
{
+ .ctl_name = CTL_UNNUMBERED,
+ .procname = "tcp_rto_max",
+ .data = &sysctl_tcp_rto_max,
+ .maxlen = sizeof(int),
+ .mode = 0644,
+ .proc_handler = &proc_tcp_rto_max
+ },
+ {
.ctl_name = NET_IPV4_TCP_FIN_TIMEOUT,
.procname = "tcp_fin_timeout",
.data = &sysctl_tcp_fin_timeout,
diff -uprN -X a/Documentation/dontdiff a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
--- a/net/ipv4/tcp_input.c 2007-07-07 14:36:24.000000000 +0900
+++ b/net/ipv4/tcp_input.c 2007-07-07 18:39:00.000000000 +0900
@@ -654,8 +654,8 @@ static inline void tcp_set_rto(struct so
*/
static inline void tcp_bound_rto(struct sock *sk)
{
- if (inet_csk(sk)->icsk_rto > TCP_RTO_MAX)
- inet_csk(sk)->icsk_rto = TCP_RTO_MAX;
+ if (inet_csk(sk)->icsk_rto > sysctl_tcp_rto_max)
+ inet_csk(sk)->icsk_rto = sysctl_tcp_rto_max;
}
/* Save metrics learned by this TCP session.
@@ -1527,7 +1527,7 @@ static int tcp_check_sack_reneging(struc
icsk->icsk_retransmits++;
tcp_retransmit_skb(sk, tcp_write_queue_head(sk));
inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS,
- icsk->icsk_rto, TCP_RTO_MAX);
+ icsk->icsk_rto, sysctl_tcp_rto_max);
return 1;
}
return 0;
@@ -2340,7 +2340,7 @@ static void tcp_ack_packets_out(struct s
if (!tp->packets_out) {
inet_csk_clear_xmit_timer(sk, ICSK_TIME_RETRANS);
} else {
- inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS, inet_csk(sk)->icsk_rto, TCP_RTO_MAX);
+ inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS, inet_csk(sk)->icsk_rto, sysctl_tcp_rto_max);
}
}
@@ -2539,8 +2539,8 @@ static void tcp_ack_probe(struct sock *s
*/
} else {
inet_csk_reset_xmit_timer(sk, ICSK_TIME_PROBE0,
- min(icsk->icsk_rto << icsk->icsk_backoff, TCP_RTO_MAX),
- TCP_RTO_MAX);
+ min(icsk->icsk_rto << icsk->icsk_backoff, sysctl_tcp_rto_max),
+ sysctl_tcp_rto_max);
}
}
@@ -4552,7 +4552,7 @@ static int tcp_rcv_synsent_state_process
tcp_incr_quickack(sk);
tcp_enter_quickack_mode(sk);
inet_csk_reset_xmit_timer(sk, ICSK_TIME_DACK,
- TCP_DELACK_MAX, TCP_RTO_MAX);
+ TCP_DELACK_MAX, sysctl_tcp_rto_max);
discard:
__kfree_skb(skb);
diff -uprN -X a/Documentation/dontdiff a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
--- a/net/ipv4/tcp_output.c 2007-07-07 14:36:24.000000000 +0900
+++ b/net/ipv4/tcp_output.c 2007-07-11 18:39:53.000000000 +0900
@@ -1913,7 +1913,7 @@ void tcp_xmit_retransmit_queue(struct so
if (skb == tcp_write_queue_head(sk))
inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS,
inet_csk(sk)->icsk_rto,
- TCP_RTO_MAX);
+ sysctl_tcp_rto_max);
}
packet_cnt += tcp_skb_pcount(skb);
@@ -1981,7 +1981,7 @@ void tcp_xmit_retransmit_queue(struct so
if (skb == tcp_write_queue_head(sk))
inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS,
inet_csk(sk)->icsk_rto,
- TCP_RTO_MAX);
+ sysctl_tcp_rto_max);
NET_INC_STATS_BH(LINUX_MIB_TCPFORWARDRETRANS);
}
@@ -2305,7 +2305,7 @@ int tcp_connect(struct sock *sk)
/* Timer for repeating the SYN until an answer. */
inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS,
- inet_csk(sk)->icsk_rto, TCP_RTO_MAX);
+ inet_csk(sk)->icsk_rto, sysctl_tcp_rto_max);
return 0;
}
@@ -2380,7 +2380,7 @@ void tcp_send_ack(struct sock *sk)
inet_csk_schedule_ack(sk);
inet_csk(sk)->icsk_ack.ato = TCP_ATO_MIN;
inet_csk_reset_xmit_timer(sk, ICSK_TIME_DACK,
- TCP_DELACK_MAX, TCP_RTO_MAX);
+ TCP_DELACK_MAX, sysctl_tcp_rto_max);
return;
}
@@ -2508,8 +2508,8 @@ void tcp_send_probe0(struct sock *sk)
icsk->icsk_backoff++;
icsk->icsk_probes_out++;
inet_csk_reset_xmit_timer(sk, ICSK_TIME_PROBE0,
- min(icsk->icsk_rto << icsk->icsk_backoff, TCP_RTO_MAX),
- TCP_RTO_MAX);
+ min(icsk->icsk_rto << icsk->icsk_backoff, sysctl_tcp_rto_max),
+ sysctl_tcp_rto_max);
} else {
/* If packet was not sent due to local congestion,
* do not backoff and do not remember icsk_probes_out.
@@ -2522,7 +2522,7 @@ void tcp_send_probe0(struct sock *sk)
inet_csk_reset_xmit_timer(sk, ICSK_TIME_PROBE0,
min(icsk->icsk_rto << icsk->icsk_backoff,
TCP_RESOURCE_PROBE_INTERVAL),
- TCP_RTO_MAX);
+ sysctl_tcp_rto_max);
}
}
diff -uprN -X a/Documentation/dontdiff a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
--- a/net/ipv4/tcp_timer.c 2007-07-07 14:36:24.000000000 +0900
+++ b/net/ipv4/tcp_timer.c 2007-07-11 18:46:12.000000000 +0900
@@ -31,6 +31,9 @@ int sysctl_tcp_keepalive_intvl __read_mo
int sysctl_tcp_retries1 __read_mostly = TCP_RETR1;
int sysctl_tcp_retries2 __read_mostly = TCP_RETR2;
int sysctl_tcp_orphan_retries __read_mostly;
+unsigned int sysctl_tcp_rto_max __read_mostly = TCP_RTO_MAX_DEFAULT;
+
+EXPORT_SYMBOL(sysctl_tcp_rto_max);
static void tcp_write_timer(unsigned long);
static void tcp_delack_timer(unsigned long);
@@ -71,7 +74,7 @@ static int tcp_out_of_resources(struct s
/* If peer does not open window for long time, or did not transmit
* anything for long time, penalize it. */
- if ((s32)(tcp_time_stamp - tp->lsndtime) > 2*TCP_RTO_MAX || !do_reset)
+ if ((s32)(tcp_time_stamp - tp->lsndtime) > 2*sysctl_tcp_rto_max || !do_reset)
orphans <<= 1;
/* If some dubious ICMP arrived, penalize even more. */
@@ -147,7 +150,7 @@ static int tcp_write_timeout(struct sock
retry_until = sysctl_tcp_retries2;
if (sock_flag(sk, SOCK_DEAD)) {
- const int alive = (icsk->icsk_rto < TCP_RTO_MAX);
+ const int alive = (icsk->icsk_rto < sysctl_tcp_rto_max);
retry_until = tcp_orphan_retries(sk, alive);
@@ -254,7 +257,7 @@ static void tcp_probe_timer(struct sock
max_probes = sysctl_tcp_retries2;
if (sock_flag(sk, SOCK_DEAD)) {
- const int alive = ((icsk->icsk_rto << icsk->icsk_backoff) < TCP_RTO_MAX);
+ const int alive = ((icsk->icsk_rto << icsk->icsk_backoff) < sysctl_tcp_rto_max);
max_probes = tcp_orphan_retries(sk, alive);
@@ -299,7 +302,7 @@ static void tcp_retransmit_timer(struct
inet->num, tp->snd_una, tp->snd_nxt);
}
#endif
- if (tcp_time_stamp - tp->rcv_tstamp > TCP_RTO_MAX) {
+ if (tcp_time_stamp - tp->rcv_tstamp > sysctl_tcp_rto_max) {
tcp_write_err(sk);
goto out;
}
@@ -347,7 +350,7 @@ static void tcp_retransmit_timer(struct
icsk->icsk_retransmits = 1;
inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS,
min(icsk->icsk_rto, TCP_RESOURCE_PROBE_INTERVAL),
- TCP_RTO_MAX);
+ sysctl_tcp_rto_max);
goto out;
}
@@ -370,8 +373,8 @@ static void tcp_retransmit_timer(struct
icsk->icsk_retransmits++;
out_reset_timer:
- icsk->icsk_rto = min(icsk->icsk_rto << 1, TCP_RTO_MAX);
- inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS, icsk->icsk_rto, TCP_RTO_MAX);
+ icsk->icsk_rto = min(icsk->icsk_rto << 1, sysctl_tcp_rto_max);
+ inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS, icsk->icsk_rto, sysctl_tcp_rto_max);
if (icsk->icsk_retransmits > sysctl_tcp_retries1)
__sk_dst_reset(sk);
@@ -426,7 +429,7 @@ out_unlock:
static void tcp_synack_timer(struct sock *sk)
{
inet_csk_reqsk_queue_prune(sk, TCP_SYNQ_INTERVAL,
- TCP_TIMEOUT_INIT, TCP_RTO_MAX);
+ TCP_TIMEOUT_INIT, sysctl_tcp_rto_max);
}
void tcp_set_keepalive(struct sock *sk, int val)
^ permalink raw reply [flat|nested] 19+ messages in thread* Re: [PATCH 2.6.22] TCP: Make TCP_RTO_MAX a variable (take 2)
2007-07-12 7:15 [PATCH 2.6.22] TCP: Make TCP_RTO_MAX a variable (take 2) OBATA Noboru
@ 2007-07-12 9:37 ` David Miller
2007-07-12 13:59 ` OBATA Noboru
2007-07-12 20:51 ` Rick Jones
0 siblings, 2 replies; 19+ messages in thread
From: David Miller @ 2007-07-12 9:37 UTC (permalink / raw)
To: noboru.obata.ar; +Cc: shemminger, yoshfuji, netdev
From: OBATA Noboru <noboru.obata.ar@hitachi.com>
Date: Thu, 12 Jul 2007 16:15:10 +0900 (JST)
> 1. Network device layer detects a failure first and switch to a
> backup device (say, in 20sec).
>
> 2. TCP layer timeout & retransmission comes next, _hopefully_
> before the application layer timeout.
>
> 3. Application layer detects a network failure last (by, say,
> 30sec timeout) and may trigger a system-level failover.
>
> * Note 1. The timeouts for #1 and #2 are handled
> independently and there is no relationship between them.
>
> * Note 2. The actual timeout settings (20sec or 30sec in
> this example) are often determined by systems requirement
> and so setting them to certain "safe values" (if any) are
> usually not possible.
>
> If TCP retransmission misses the time frame between event #1
> and #3 in Background above (between 20 and 30sec since network
> failure), a failure causes the system-level failover where the
> network-device-level failover should be enough.
I'm still totally unconvinced, this seems pointless.
TCP's timeouts are perfectly fine, and the only thing you
might be showing above is that the application timeouts
are too short or that TCP needs notifications.
I am totally unconvinced about your dom0 vs. domU notification
arguments as well.
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 2.6.22] TCP: Make TCP_RTO_MAX a variable (take 2)
2007-07-12 9:37 ` David Miller
@ 2007-07-12 13:59 ` OBATA Noboru
2007-07-12 20:24 ` David Miller
2007-07-12 20:51 ` Rick Jones
1 sibling, 1 reply; 19+ messages in thread
From: OBATA Noboru @ 2007-07-12 13:59 UTC (permalink / raw)
To: davem; +Cc: shemminger, yoshfuji, netdev
From: David Miller <davem@davemloft.net>
Subject: Re: [PATCH 2.6.22] TCP: Make TCP_RTO_MAX a variable (take 2)
Date: Thu, 12 Jul 2007 02:37:10 -0700 (PDT)
> Subject: Re: [PATCH 2.6.22] TCP: Make TCP_RTO_MAX a variable (take 2)
> From: David Miller <davem@davemloft.net>
> To: noboru.obata.ar@hitachi.com
> Cc: shemminger@linux-foundation.org, yoshfuji@linux-ipv6.org,
> netdev@vger.kernel.org
> Date: Thu, 12 Jul 2007 02:37:10 -0700 (PDT)
> X-Mailer: Mew version 5.1.52 on Emacs 21.4 / Mule 5.0 (SAKAKI)
>
> From: OBATA Noboru <noboru.obata.ar@hitachi.com>
> Date: Thu, 12 Jul 2007 16:15:10 +0900 (JST)
>
> > 1. Network device layer detects a failure first and switch to a
> > backup device (say, in 20sec).
> >
> > 2. TCP layer timeout & retransmission comes next, _hopefully_
> > before the application layer timeout.
> >
> > 3. Application layer detects a network failure last (by, say,
> > 30sec timeout) and may trigger a system-level failover.
> >
> > * Note 1. The timeouts for #1 and #2 are handled
> > independently and there is no relationship between them.
> >
> > * Note 2. The actual timeout settings (20sec or 30sec in
> > this example) are often determined by systems requirement
> > and so setting them to certain "safe values" (if any) are
> > usually not possible.
> >
> > If TCP retransmission misses the time frame between event #1
> > and #3 in Background above (between 20 and 30sec since network
> > failure), a failure causes the system-level failover where the
> > network-device-level failover should be enough.
>
> I'm still totally unconvinced, this seems pointless.
>
> TCP's timeouts are perfectly fine, and the only thing you
> might be showing above is that the application timeouts
> are too short or that TCP needs notifications.
I take your comment seriously, David.
And I agree with you that TCP's timeouts are fine on a network
where congestion is a primary reason of packet loss.
But in a high-speed LAN today, for example, congestion is
effectively diminished by network capacity design, and physical
failure of devices and cables is now a major concern, which is
addressed by redundant devices and failover. TCP's timeouts
(RTT/RTO estimation and exponential backoff) work fine as well
on failover-capable networks, but I think smaller TCP_RTO_MAX is
desirable because failover can be taken place in order of
seconds. This will surely increase the usefullness of TCP on
such networks.
How do you think TCP timeouts in Linux can adapt to such changes
in network environment?
> I am totally unconvinced about your dom0 vs. domU notification
> arguments as well.
Well, I'd appreciate if you could tell me a bit more in detail
why my argument does not make sense to you.
In a virtualized environment, a failure is detected in Dom-0,
and TCP stack to be notified sits on Dom-U. I think
notifications from Dom-0 to Dom-U TCP are not easy.
Best regards,
--
OBATA Noboru (noboru.obata.ar@hitachi.com)
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 2.6.22] TCP: Make TCP_RTO_MAX a variable (take 2)
2007-07-12 13:59 ` OBATA Noboru
@ 2007-07-12 20:24 ` David Miller
2007-07-12 21:12 ` Stephen Hemminger
[not found] ` <20070828.220447.01366772.noboru.obata.ar@hitachi.com>
0 siblings, 2 replies; 19+ messages in thread
From: David Miller @ 2007-07-12 20:24 UTC (permalink / raw)
To: noboru.obata.ar; +Cc: shemminger, yoshfuji, netdev
From: OBATA Noboru <noboru.obata.ar@hitachi.com>
Date: Thu, 12 Jul 2007 22:59:50 +0900 (JST)
> How do you think TCP timeouts in Linux can adapt to such changes
> in network environment?
I'm honestly not interested in discussing this any more
and Ian has even showed that the RFCs state that if we have
a maximum it must be at least 60.
So really, there is no chance of merging a TCP_RTO_MAX
decreasing patch, sorry.
^ permalink raw reply [flat|nested] 19+ messages in thread* Re: [PATCH 2.6.22] TCP: Make TCP_RTO_MAX a variable (take 2)
2007-07-12 20:24 ` David Miller
@ 2007-07-12 21:12 ` Stephen Hemminger
2007-07-12 21:27 ` Rick Jones
[not found] ` <20070828.220447.01366772.noboru.obata.ar@hitachi.com>
1 sibling, 1 reply; 19+ messages in thread
From: Stephen Hemminger @ 2007-07-12 21:12 UTC (permalink / raw)
To: noboru.obata.ar; +Cc: David Miller, yoshfuji, netdev
On Thu, 12 Jul 2007 13:24:48 -0700 (PDT)
David Miller <davem@davemloft.net> wrote:
> From: OBATA Noboru <noboru.obata.ar@hitachi.com>
> Date: Thu, 12 Jul 2007 22:59:50 +0900 (JST)
>
> > How do you think TCP timeouts in Linux can adapt to such changes
> > in network environment?
>
> I'm honestly not interested in discussing this any more
> and Ian has even showed that the RFCs state that if we have
> a maximum it must be at least 60.
>
> So really, there is no chance of merging a TCP_RTO_MAX
> decreasing patch, sorry.
One question is why the RTO gets so large that it limits failover?
If Linux TCP is working correctly, RTO should be srtt + 2*rttvar
So either there is a huge srtt or variance, or something is going
wrong with RTT estimation. Given some reasonable maximums of
Srtt = 500ms and rttvar = 250ms, that would cause RTO to be 1second.
--
Stephen Hemminger <shemminger@linux-foundation.org>
^ permalink raw reply [flat|nested] 19+ messages in thread* Re: [PATCH 2.6.22] TCP: Make TCP_RTO_MAX a variable (take 2)
2007-07-12 21:12 ` Stephen Hemminger
@ 2007-07-12 21:27 ` Rick Jones
2007-07-12 22:02 ` Stephen Hemminger
2007-07-13 4:29 ` Ilpo Järvinen
0 siblings, 2 replies; 19+ messages in thread
From: Rick Jones @ 2007-07-12 21:27 UTC (permalink / raw)
To: Stephen Hemminger; +Cc: noboru.obata.ar, David Miller, yoshfuji, netdev
> One question is why the RTO gets so large that it limits failover?
>
> If Linux TCP is working correctly, RTO should be srtt + 2*rttvar
>
> So either there is a huge srtt or variance, or something is going
> wrong with RTT estimation. Given some reasonable maximums of
> Srtt = 500ms and rttvar = 250ms, that would cause RTO to be 1second.
I suspect that what is happening here is that a link goes down in a
trunk somewhere for some number of seconds, resulting in a given TCP
segment being retransmitted several times, with the doubling of the RTO
each time.
rick jones
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 2.6.22] TCP: Make TCP_RTO_MAX a variable (take 2)
2007-07-12 21:27 ` Rick Jones
@ 2007-07-12 22:02 ` Stephen Hemminger
2007-07-12 22:27 ` Rick Jones
2007-07-13 4:29 ` Ilpo Järvinen
1 sibling, 1 reply; 19+ messages in thread
From: Stephen Hemminger @ 2007-07-12 22:02 UTC (permalink / raw)
To: Rick Jones; +Cc: noboru.obata.ar, David Miller, yoshfuji, netdev
On Thu, 12 Jul 2007 14:27:05 -0700
Rick Jones <rick.jones2@hp.com> wrote:
> > One question is why the RTO gets so large that it limits failover?
> >
> > If Linux TCP is working correctly, RTO should be srtt + 2*rttvar
> >
> > So either there is a huge srtt or variance, or something is going
> > wrong with RTT estimation. Given some reasonable maximums of
> > Srtt = 500ms and rttvar = 250ms, that would cause RTO to be 1second.
>
> I suspect that what is happening here is that a link goes down in a
> trunk somewhere for some number of seconds, resulting in a given TCP
> segment being retransmitted several times, with the doubling of the RTO
> each time.
>
> rick jones
So the problem is that RTO can grows to be twice the failover detection
time. So back to the original mail, the scenario has a switch with failover
detection of 20seconds. Worst case TCP RTO could grow to 40 seconds.
Going back in archive to original mail:
> Background
> ==========
>
> When designing a TCP/IP based network system on failover-capable
> network devices, people want to set timeouts hierarchically in
> three layers, network device layer, TCP layer, and application
> layer (bottom-up order), such that:
>
> 1. Network device layer detects a failure first and switch to a
> backup device (say, in 20sec).
>
> 2. TCP layer timeout & retransmission comes next, _hopefully_
> before the application layer timeout.
>
> 3. Application layer detects a network failure last (by, say,
> 30sec timeout) and may trigger a system-level failover.
Sounds like the solution is to make the switch failover detection faster.
If you get switch failover down to 5sec then TCP RTO shouldn't be bigger
than 10sec, and application will survive.
--
Stephen Hemminger <shemminger@linux-foundation.org>
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 2.6.22] TCP: Make TCP_RTO_MAX a variable (take 2)
2007-07-12 22:02 ` Stephen Hemminger
@ 2007-07-12 22:27 ` Rick Jones
2007-07-24 13:30 ` OBATA Noboru
0 siblings, 1 reply; 19+ messages in thread
From: Rick Jones @ 2007-07-12 22:27 UTC (permalink / raw)
To: Stephen Hemminger; +Cc: noboru.obata.ar, David Miller, yoshfuji, netdev
> So the problem is that RTO can grows to be twice the failover detection
> time. So back to the original mail, the scenario has a switch with failover
> detection of 20seconds. Worst case TCP RTO could grow to 40 seconds.
>
> Going back in archive to original mail:
>
>
>>Background
>>==========
>>
>>When designing a TCP/IP based network system on failover-capable
>>network devices, people want to set timeouts hierarchically in
>>three layers, network device layer, TCP layer, and application
>>layer (bottom-up order), such that:
>>
>>1. Network device layer detects a failure first and switch to a
>> backup device (say, in 20sec).
>>
>>2. TCP layer timeout & retransmission comes next, _hopefully_
>> before the application layer timeout.
>>
>>3. Application layer detects a network failure last (by, say,
>> 30sec timeout) and may trigger a system-level failover.
>
>
> Sounds like the solution is to make the switch failover detection faster.
> If you get switch failover down to 5sec then TCP RTO shouldn't be bigger
> than 10sec, and application will survive.
That may indeed be the best solution, we'll have to wait to hear if
there is any freedom there. When this sort of thing has crossed my path
in other contexts, the general answer is that the device failover time
is fixed, and the application layer time is similarly constrained by
end-user expectation/requirement. Often as not, layer 8 and 9 issues
tend to dominate and expect to trump (in this case layer 4 issues).
rick jones
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 2.6.22] TCP: Make TCP_RTO_MAX a variable (take 2)
2007-07-12 22:27 ` Rick Jones
@ 2007-07-24 13:30 ` OBATA Noboru
0 siblings, 0 replies; 19+ messages in thread
From: OBATA Noboru @ 2007-07-24 13:30 UTC (permalink / raw)
To: rick.jones2; +Cc: shemminger, davem, yoshfuji, netdev
From: Rick Jones <rick.jones2@hp.com>
Subject: Re: [PATCH 2.6.22] TCP: Make TCP_RTO_MAX a variable (take 2)
Date: Thu, 12 Jul 2007 15:27:30 -0700
> > So the problem is that RTO can grows to be twice the failover detection
> > time. So back to the original mail, the scenario has a switch with failover
> > detection of 20seconds. Worst case TCP RTO could grow to 40 seconds.
> >
> > Going back in archive to original mail:
> >
> >
> >>Background
> >>==========
> >>
> >>When designing a TCP/IP based network system on failover-capable
> >>network devices, people want to set timeouts hierarchically in
> >>three layers, network device layer, TCP layer, and application
> >>layer (bottom-up order), such that:
> >>
> >>1. Network device layer detects a failure first and switch to a
> >> backup device (say, in 20sec).
> >>
> >>2. TCP layer timeout & retransmission comes next, _hopefully_
> >> before the application layer timeout.
> >>
> >>3. Application layer detects a network failure last (by, say,
> >> 30sec timeout) and may trigger a system-level failover.
> >
> >
> > Sounds like the solution is to make the switch failover detection faster.
> > If you get switch failover down to 5sec then TCP RTO shouldn't be bigger
> > than 10sec, and application will survive.
>
> That may indeed be the best solution, we'll have to wait to hear if
> there is any freedom there. When this sort of thing has crossed my path
> in other contexts, the general answer is that the device failover time
> is fixed, and the application layer time is similarly constrained by
> end-user expectation/requirement. Often as not, layer 8 and 9 issues
> tend to dominate and expect to trump (in this case layer 4 issues).
I agree that application will survive if a user makes the
application timeout twice the failover timeout. But I'm afraid
there is no such freedom.
Basically, to minimize downtime, shorter timeouts are preferred
as long as the probability of mis-detection is kept low at a
certain level.
In practice, failover timeouts for bonding, switches, or routers
are determined by heuristics. Users know what timeout values and
retry counts of probe packets are suitable for detecting failure
of a certain combination of network equipments. (e.g., 5sec
timeout, retries 4 times) Shorter is better.
And application timeout (or system timeout) is given as an
end-user requirement. There is little change of negotiation,
really. And again shorter (than requirement, if possible) is
better.
Regards,
--
OBATA Noboru (noboru.obata.ar@hitachi.com)
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 2.6.22] TCP: Make TCP_RTO_MAX a variable (take 2)
2007-07-12 21:27 ` Rick Jones
2007-07-12 22:02 ` Stephen Hemminger
@ 2007-07-13 4:29 ` Ilpo Järvinen
2007-07-13 16:55 ` Rick Jones
1 sibling, 1 reply; 19+ messages in thread
From: Ilpo Järvinen @ 2007-07-13 4:29 UTC (permalink / raw)
To: Rick Jones
Cc: Stephen Hemminger, noboru.obata.ar, David Miller, yoshfuji,
Netdev
On Thu, 12 Jul 2007, Rick Jones wrote:
> > One question is why the RTO gets so large that it limits failover?
> >
> > If Linux TCP is working correctly, RTO should be srtt + 2*rttvar
> >
> > So either there is a huge srtt or variance, or something is going
> > wrong with RTT estimation. Given some reasonable maximums of
> > Srtt = 500ms and rttvar = 250ms, that would cause RTO to be 1second.
>
> I suspect that what is happening here is that a link goes down in a trunk
> somewhere for some number of seconds, resulting in a given TCP segment being
> retransmitted several times, with the doubling of the RTO each time.
But that's a back-off for the retransmissions, the doubling is
temporary... Once you return to normal conditions, the accumulated backoff
multiplier will be immediately cut back to normal. So you should then be
back to 1 second (like in the example or whatever) again...
--
i.
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 2.6.22] TCP: Make TCP_RTO_MAX a variable (take 2)
2007-07-13 4:29 ` Ilpo Järvinen
@ 2007-07-13 16:55 ` Rick Jones
2007-07-14 6:19 ` David Miller
0 siblings, 1 reply; 19+ messages in thread
From: Rick Jones @ 2007-07-13 16:55 UTC (permalink / raw)
To: Ilpo Järvinen
Cc: Stephen Hemminger, noboru.obata.ar, David Miller, yoshfuji,
Netdev
Ilpo Järvinen wrote:
> On Thu, 12 Jul 2007, Rick Jones wrote:
>
>
>>>One question is why the RTO gets so large that it limits failover?
>>>
>>>If Linux TCP is working correctly, RTO should be srtt + 2*rttvar
>>>
>>>So either there is a huge srtt or variance, or something is going
>>>wrong with RTT estimation. Given some reasonable maximums of
>>>Srtt = 500ms and rttvar = 250ms, that would cause RTO to be 1second.
>>
>>I suspect that what is happening here is that a link goes down in a trunk
>>somewhere for some number of seconds, resulting in a given TCP segment being
>>retransmitted several times, with the doubling of the RTO each time.
>
>
> But that's a back-off for the retransmissions, the doubling is
> temporary... Once you return to normal conditions, the accumulated backoff
> multiplier will be immediately cut back to normal. So you should then be
> back to 1 second (like in the example or whatever) again...
Fine, but so? I suspect the point of the patch is to provide a lower cap on the
accumulated backoff so data starts flowing over the connection within that lower
cap once the link is restored/failed-over.
rick jones
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 2.6.22] TCP: Make TCP_RTO_MAX a variable (take 2)
2007-07-13 16:55 ` Rick Jones
@ 2007-07-14 6:19 ` David Miller
2007-07-23 18:40 ` Rick Jones
0 siblings, 1 reply; 19+ messages in thread
From: David Miller @ 2007-07-14 6:19 UTC (permalink / raw)
To: rick.jones2; +Cc: ilpo.jarvinen, shemminger, noboru.obata.ar, yoshfuji, netdev
From: Rick Jones <rick.jones2@hp.com>
Date: Fri, 13 Jul 2007 09:55:10 -0700
> Fine, but so? I suspect the point of the patch is to provide a
> lower cap on the accumulated backoff so data starts flowing over the
> connection within that lower cap once the link is
> restored/failed-over.
The backoff is there for a reason.
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 2.6.22] TCP: Make TCP_RTO_MAX a variable (take 2)
2007-07-14 6:19 ` David Miller
@ 2007-07-23 18:40 ` Rick Jones
0 siblings, 0 replies; 19+ messages in thread
From: Rick Jones @ 2007-07-23 18:40 UTC (permalink / raw)
To: David Miller; +Cc: ilpo.jarvinen, shemminger, noboru.obata.ar, yoshfuji, netdev
David Miller wrote:
> From: Rick Jones <rick.jones2@hp.com>
> Date: Fri, 13 Jul 2007 09:55:10 -0700
>
>
>>Fine, but so? I suspect the point of the patch is to provide a
>>lower cap on the accumulated backoff so data starts flowing over the
>>connection within that lower cap once the link is
>>restored/failed-over.
>
>
> The backoff is there for a reason.
I'm not disputing the general value of the backoff, nor about the value of an
initial value of 60 seconds. In terms of avoiding congestive collapse one does
indeed want the exponential backoff. I'm just in agreement with the person from
Hitachi that allowing someone to tweak the backoff has a certain value.
60 seconds is already a trade-off between a pure (non capped) exponential
backoff and capping the value.
rick
^ permalink raw reply [flat|nested] 19+ messages in thread
[parent not found: <20070828.220447.01366772.noboru.obata.ar@hitachi.com>]
* Re: [PATCH 2.6.22] TCP: Make TCP_RTO_MAX a variable (take 2)
2007-07-12 9:37 ` David Miller
2007-07-12 13:59 ` OBATA Noboru
@ 2007-07-12 20:51 ` Rick Jones
2007-07-24 13:35 ` OBATA Noboru
1 sibling, 1 reply; 19+ messages in thread
From: Rick Jones @ 2007-07-12 20:51 UTC (permalink / raw)
To: David Miller; +Cc: noboru.obata.ar, shemminger, yoshfuji, netdev
>
> TCP's timeouts are perfectly fine, and the only thing you
> might be showing above is that the application timeouts
> are too short or that TCP needs notifications.
The application timeouts are probably being driven by external desires
for a given recovery time.
TCP notifications don't solve anything unless the links in question are
local to the machine on which the TCP endpoint resides.
So, it seems that what this is really saying is that in the context of
Linux at least, TCP is not a suitable protocol to be used in situations
where a fast detection/recovery is desired.
Does that pretty much sum it up?
rick jones
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH 2.6.22] TCP: Make TCP_RTO_MAX a variable (take 2)
2007-07-12 20:51 ` Rick Jones
@ 2007-07-24 13:35 ` OBATA Noboru
0 siblings, 0 replies; 19+ messages in thread
From: OBATA Noboru @ 2007-07-24 13:35 UTC (permalink / raw)
To: rick.jones2; +Cc: davem, shemminger, yoshfuji, netdev
From: Rick Jones <rick.jones2@hp.com>
Subject: Re: [PATCH 2.6.22] TCP: Make TCP_RTO_MAX a variable (take 2)
Date: Thu, 12 Jul 2007 13:51:44 -0700
> > TCP's timeouts are perfectly fine, and the only thing you
> > might be showing above is that the application timeouts
> > are too short or that TCP needs notifications.
>
> The application timeouts are probably being driven by external desires
> for a given recovery time.
Agreed.
> TCP notifications don't solve anything unless the links in question are
> local to the machine on which the TCP endpoint resides.
Agreed. Thank you for a good explanation.
My original discussion using Dom-0 and Dom-U might be
misleading, but I was trying to say:
* Network failure and recovery(failover) are not necessarily
visible locally.
** Dom-0 vs. Dom-U discussion is just an example of the case
where a network failure is not visible locally.
** For another example, network switches or routers sitting
somewhere in the middle of route are often duplicated with
active-standby setting today.
* Quick response (retransmission) of TCP upon a recovery of such
invisible devices as well is desired.
* If the failure and recovery are not visible locally, TCP
notifications do not help.
Regards,
--
OBATA Noboru (noboru.obata.ar@hitachi.com)
^ permalink raw reply [flat|nested] 19+ messages in thread
end of thread, other threads:[~2007-08-30 12:24 UTC | newest]
Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2007-07-12 7:15 [PATCH 2.6.22] TCP: Make TCP_RTO_MAX a variable (take 2) OBATA Noboru
2007-07-12 9:37 ` David Miller
2007-07-12 13:59 ` OBATA Noboru
2007-07-12 20:24 ` David Miller
2007-07-12 21:12 ` Stephen Hemminger
2007-07-12 21:27 ` Rick Jones
2007-07-12 22:02 ` Stephen Hemminger
2007-07-12 22:27 ` Rick Jones
2007-07-24 13:30 ` OBATA Noboru
2007-07-13 4:29 ` Ilpo Järvinen
2007-07-13 16:55 ` Rick Jones
2007-07-14 6:19 ` David Miller
2007-07-23 18:40 ` Rick Jones
[not found] ` <20070828.220447.01366772.noboru.obata.ar@hitachi.com>
[not found] ` <20070828.133057.107937654.davem@davemloft.net>
2007-08-29 12:26 ` OBATA Noboru
2007-08-29 16:16 ` Rick Jones
2007-08-30 12:24 ` OBATA Noboru
2007-08-29 18:15 ` David Miller
2007-07-12 20:51 ` Rick Jones
2007-07-24 13:35 ` OBATA Noboru
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).