* [PATCH] revert TCP retransmission backoff on ICMP destination unreachable
@ 2009-08-11 11:27 Damian Lukowski
2009-08-13 23:08 ` David Miller
0 siblings, 1 reply; 9+ messages in thread
From: Damian Lukowski @ 2009-08-11 11:27 UTC (permalink / raw)
To: netdev
[-- Attachment #1: Type: text/plain, Size: 1276 bytes --]
This patch is a proof-of-concept related to the draft "Make TCP more
Robust to Long Connectivity Disruptions"
(http://tools.ietf.org/html/draft-zimmermann-tcp-lcd-01).
Exponential backoff is TCP's standard behaviour during long connectivity
disruptions, which is a countermeasure against network congestion.
If congestion can be excluded as the reason for RTO retransmission loss,
backoff is not desirable, as it yields longer TCP recovery times, when
the communication path is repaired shortly after an unsuccessful
retransmission probe.
Here, an ICMP host/network unreachable message, whose payload fits to
TCPs SND.UNA, is taken as an indication that the RTO retransmission has
not been lost due to congestion, but because of a route failure
somewhere along the path. On receipt of such a message, the timeout is
halved (in order to revert to doubling after the retransmission).
With true congestion, a router won't trigger such a message and the
patched TCP will operate as standard TCP.
Comments are appreciated, especially regarding eventual side effects
with the zero window probing mechanism.
In Linux, the RTO retransmisson mechanism is coupled to the zero window
probing mechanism, but is not intended to be changed. I hope I have not
missed something, here.
---
[-- Attachment #2: TCP_ICMP_2.6.30.4.patch --]
[-- Type: text/plain, Size: 4322 bytes --]
diff -Naur linux-2.6.30.4-vanilla/include/net/tcp.h linux-2.6.30.4-tcp-icmp/include/net/tcp.h
--- linux-2.6.30.4-vanilla/include/net/tcp.h 2009-07-31 00:34:47.000000000 +0200
+++ linux-2.6.30.4-tcp-icmp/include/net/tcp.h 2009-08-11 11:28:23.162009835 +0200
@@ -1220,6 +1220,8 @@
#define tcp_for_write_queue_from_safe(skb, tmp, sk) \
skb_queue_walk_from_safe(&(sk)->sk_write_queue, skb, tmp)
+#define retrans_overstepped(sk, boundary) (inet_csk(sk)->icsk_retransmits && (tcp_time_stamp - tcp_sk(sk)->retrans_stamp) >= TCP_RTO_MIN*(2 << boundary))
+
static inline struct sk_buff *tcp_send_head(struct sock *sk)
{
return sk->sk_send_head;
diff -Naur linux-2.6.30.4-vanilla/net/ipv4/tcp_ipv4.c linux-2.6.30.4-tcp-icmp/net/ipv4/tcp_ipv4.c
--- linux-2.6.30.4-vanilla/net/ipv4/tcp_ipv4.c 2009-07-31 00:34:47.000000000 +0200
+++ linux-2.6.30.4-tcp-icmp/net/ipv4/tcp_ipv4.c 2009-08-11 11:28:33.572009505 +0200
@@ -332,11 +332,14 @@
{
struct iphdr *iph = (struct iphdr *)skb->data;
struct tcphdr *th = (struct tcphdr *)(skb->data + (iph->ihl << 2));
+
+ struct inet_connection_sock *icsk;
struct tcp_sock *tp;
struct inet_sock *inet;
const int type = icmp_hdr(skb)->type;
const int code = icmp_hdr(skb)->code;
struct sock *sk;
+ struct sk_buff *skb_r;
__u32 seq;
int err;
struct net *net = dev_net(skb->dev);
@@ -367,6 +370,7 @@
if (sk->sk_state == TCP_CLOSE)
goto out;
+ icsk = inet_csk(sk);
tp = tcp_sk(sk);
seq = ntohl(th->seq);
if (sk->sk_state != TCP_LISTEN &&
@@ -393,6 +397,34 @@
}
err = icmp_err_convert[code].errno;
+
+ /* check if ICMP unreachable messages allow revert of back-off */
+ if ((code != ICMP_NET_UNREACH && code != ICMP_HOST_UNREACH) || seq != tp->snd_una
+ || !icsk->icsk_retransmits || !icsk->icsk_backoff) break;
+
+ icsk->icsk_backoff--;
+ icsk->icsk_rto >>= 1;
+
+ skb_r = skb_peek(&sk->sk_write_queue);
+ BUG_ON(!skb_r);
+
+ if (sock_owned_by_user(sk)) {
+ // Deferring retransmission clocked by ICMP due to locked socket.
+ inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS,
+ min(icsk->icsk_rto, TCP_RESOURCE_PROBE_INTERVAL),
+ TCP_RTO_MAX);
+ }
+
+ if (tcp_time_stamp - TCP_SKB_CB(skb_r)->when > inet_csk(sk)->icsk_rto) {
+ // RTO revert clocked out retransmission.
+ tcp_retransmit_skb(sk, skb_r);
+ inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS, icsk->icsk_rto, TCP_RTO_MAX);
+ } else {
+ //RTO revert shortened timer.
+ inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS,
+ icsk->icsk_rto - (tcp_time_stamp - TCP_SKB_CB(skb_r)->when), TCP_RTO_MAX);
+ }
+
break;
case ICMP_TIME_EXCEEDED:
err = EHOSTUNREACH;
diff -Naur linux-2.6.30.4-vanilla/net/ipv4/tcp_timer.c linux-2.6.30.4-tcp-icmp/net/ipv4/tcp_timer.c
--- linux-2.6.30.4-vanilla/net/ipv4/tcp_timer.c 2009-07-31 00:34:47.000000000 +0200
+++ linux-2.6.30.4-tcp-icmp/net/ipv4/tcp_timer.c 2009-08-11 11:28:33.557009931 +0200
@@ -143,7 +143,7 @@
dst_negative_advice(&sk->sk_dst_cache);
retry_until = icsk->icsk_syn_retries ? : sysctl_tcp_syn_retries;
} else {
- if (icsk->icsk_retransmits >= sysctl_tcp_retries1) {
+ if (retrans_overstepped(sk, sysctl_tcp_retries1)) {
/* Black hole detection */
tcp_mtu_probing(icsk, sk);
@@ -156,12 +156,12 @@
retry_until = tcp_orphan_retries(sk, alive);
- if (tcp_out_of_resources(sk, alive || icsk->icsk_retransmits < retry_until))
+ if (tcp_out_of_resources(sk, alive || !retrans_overstepped(sk, retry_until)))
return 1;
}
}
- if (icsk->icsk_retransmits >= retry_until) {
+ if (retrans_overstepped(sk, retry_until)) {
/* Has it gone just too far? */
tcp_write_err(sk);
return 1;
@@ -385,7 +385,7 @@
out_reset_timer:
icsk->icsk_rto = min(icsk->icsk_rto << 1, TCP_RTO_MAX);
inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS, icsk->icsk_rto, TCP_RTO_MAX);
- if (icsk->icsk_retransmits > sysctl_tcp_retries1)
+ if (retrans_overstepped(sk, sysctl_tcp_retries1))
__sk_dst_reset(sk);
out:;
^ permalink raw reply [flat|nested] 9+ messages in thread* Re: [PATCH] revert TCP retransmission backoff on ICMP destination unreachable 2009-08-11 11:27 [PATCH] revert TCP retransmission backoff on ICMP destination unreachable Damian Lukowski @ 2009-08-13 23:08 ` David Miller 2009-08-14 12:08 ` Damian Lukowski 0 siblings, 1 reply; 9+ messages in thread From: David Miller @ 2009-08-13 23:08 UTC (permalink / raw) To: damian; +Cc: netdev From: Damian Lukowski <damian@tvk.rwth-aachen.de> Date: Tue, 11 Aug 2009 13:27:42 +0200 > @@ -1220,6 +1220,8 @@ > #define tcp_for_write_queue_from_safe(skb, tmp, sk) \ > skb_queue_walk_from_safe(&(sk)->sk_write_queue, skb, tmp) > > +#define retrans_overstepped(sk, boundary) (inet_csk(sk)->icsk_retransmits && (tcp_time_stamp - tcp_sk(sk)->retrans_stamp) >= TCP_RTO_MIN*(2 << boundary)) > + Longer than 80 columns, and use an inline function instead of a macro in order to get proper type checking. > @@ -332,11 +332,14 @@ > { > struct iphdr *iph = (struct iphdr *)skb->data; > struct tcphdr *th = (struct tcphdr *)(skb->data + (iph->ihl << 2)); > + > + struct inet_connection_sock *icsk; > struct tcp_sock *tp; > struct inet_sock *inet; Do not break up the function local variables with spurious new lines like this, please. > const int type = icmp_hdr(skb)->type; > const int code = icmp_hdr(skb)->code; > struct sock *sk; > + struct sk_buff *skb_r; > __u32 seq; > int err; > struct net *net = dev_net(skb->dev); > @@ -367,6 +370,7 @@ > if (sk->sk_state == TCP_CLOSE) > goto out; > > + icsk = inet_csk(sk); > tp = tcp_sk(sk); > seq = ntohl(th->seq); > if (sk->sk_state != TCP_LISTEN && > @@ -393,6 +397,34 @@ > } > > err = icmp_err_convert[code].errno; > + > + /* check if ICMP unreachable messages allow revert of back-off */ > + if ((code != ICMP_NET_UNREACH && code != ICMP_HOST_UNREACH) || seq != tp->snd_una > + || !icsk->icsk_retransmits || !icsk->icsk_backoff) break; > + > + icsk->icsk_backoff--; > + icsk->icsk_rto >>= 1; > + > + skb_r = skb_peek(&sk->sk_write_queue); > + BUG_ON(!skb_r); > + The indentation and tabbing is messed up in all of the code you are adding, please fix it up to be consistent with the surrounding code and the rest of the TCP stack. Do not use C++ style // comments. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] revert TCP retransmission backoff on ICMP destination unreachable 2009-08-13 23:08 ` David Miller @ 2009-08-14 12:08 ` Damian Lukowski 2009-08-18 13:45 ` Ilpo Järvinen 0 siblings, 1 reply; 9+ messages in thread From: Damian Lukowski @ 2009-08-14 12:08 UTC (permalink / raw) To: David Miller; +Cc: netdev [-- Attachment #1: Type: text/plain, Size: 445 bytes --] > Longer than 80 columns, and use an inline function instead > of a macro in order to get proper type checking. > [...] > Do not break up the function local variables with spurious new lines > like this, please. > [...] > The indentation and tabbing is messed up in all of the code you are > adding, please fix it up to be consistent with the surrounding code > and the rest of the TCP stack. > > Do not use C++ style // comments. Better? -- [-- Attachment #2: TCP_ICMP_2.6.30.4.patch --] [-- Type: text/plain, Size: 4105 bytes --] Signed-off-by: Damian Lukowski <damian@tvk.rwth-aachen.de> diff -Naur linux-2.6.30.4/include/net/tcp.h linux-2.6.30.4-tcp-icmp/include/net/tcp.h --- linux-2.6.30.4/include/net/tcp.h 2009-07-31 00:34:47.000000000 +0200 +++ linux-2.6.30.4-tcp-icmp/include/net/tcp.h 2009-08-14 12:18:30.846060685 +0200 @@ -1220,6 +1220,14 @@ #define tcp_for_write_queue_from_safe(skb, tmp, sk) \ skb_queue_walk_from_safe(&(sk)->sk_write_queue, skb, tmp) +static inline bool retrans_overstepped(const struct sock *sk, + unsigned int boundary) +{ + return inet_csk(sk)->icsk_retransmits && + (tcp_time_stamp - tcp_sk(sk)->retrans_stamp) >= + TCP_RTO_MIN*(2 << boundary); +} + static inline struct sk_buff *tcp_send_head(struct sock *sk) { return sk->sk_send_head; diff -Naur linux-2.6.30.4/net/ipv4/tcp_ipv4.c linux-2.6.30.4-tcp-icmp/net/ipv4/tcp_ipv4.c --- linux-2.6.30.4/net/ipv4/tcp_ipv4.c 2009-07-31 00:34:47.000000000 +0200 +++ linux-2.6.30.4-tcp-icmp/net/ipv4/tcp_ipv4.c 2009-08-14 13:19:48.841598908 +0200 @@ -332,11 +332,13 @@ { struct iphdr *iph = (struct iphdr *)skb->data; struct tcphdr *th = (struct tcphdr *)(skb->data + (iph->ihl << 2)); + struct inet_connection_sock *icsk; struct tcp_sock *tp; struct inet_sock *inet; const int type = icmp_hdr(skb)->type; const int code = icmp_hdr(skb)->code; struct sock *sk; + struct sk_buff *skb_r; __u32 seq; int err; struct net *net = dev_net(skb->dev); @@ -367,6 +369,7 @@ if (sk->sk_state == TCP_CLOSE) goto out; + icsk = inet_csk(sk); tp = tcp_sk(sk); seq = ntohl(th->seq); if (sk->sk_state != TCP_LISTEN && @@ -393,6 +396,41 @@ } err = icmp_err_convert[code].errno; + /* check if ICMP unreachable messages allow revert of backoff */ + if ((code != ICMP_NET_UNREACH && code != ICMP_HOST_UNREACH) || + seq != tp->snd_una || !icsk->icsk_retransmits || + !icsk->icsk_backoff) + break; + + icsk->icsk_backoff--; + icsk->icsk_rto >>= 1; + + skb_r = skb_peek(&sk->sk_write_queue); + BUG_ON(!skb_r); + + if (sock_owned_by_user(sk)) { + /* Deferring retransmission clocked by ICMP + * due to locked socket. */ + inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS, + min(icsk->icsk_rto, TCP_RESOURCE_PROBE_INTERVAL), + TCP_RTO_MAX); + } + + if (tcp_time_stamp - TCP_SKB_CB(skb_r)->when > + inet_csk(sk)->icsk_rto) { + /* RTO revert clocked out retransmission. */ + tcp_retransmit_skb(sk, skb_r); + inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS, + icsk->icsk_rto, TCP_RTO_MAX); + } else { + /* RTO revert shortened timer. */ + inet_csk_reset_xmit_timer( + sk, ICSK_TIME_RETRANS, + icsk->icsk_rto- + (tcp_time_stamp-TCP_SKB_CB(skb_r)->when), + TCP_RTO_MAX); + } + break; case ICMP_TIME_EXCEEDED: err = EHOSTUNREACH; diff -Naur linux-2.6.30.4/net/ipv4/tcp_timer.c linux-2.6.30.4-tcp-icmp/net/ipv4/tcp_timer.c --- linux-2.6.30.4/net/ipv4/tcp_timer.c 2009-07-31 00:34:47.000000000 +0200 +++ linux-2.6.30.4-tcp-icmp/net/ipv4/tcp_timer.c 2009-08-14 13:22:18.068666329 +0200 @@ -143,7 +143,7 @@ dst_negative_advice(&sk->sk_dst_cache); retry_until = icsk->icsk_syn_retries ? : sysctl_tcp_syn_retries; } else { - if (icsk->icsk_retransmits >= sysctl_tcp_retries1) { + if (retrans_overstepped(sk, sysctl_tcp_retries1)) { /* Black hole detection */ tcp_mtu_probing(icsk, sk); @@ -156,12 +156,14 @@ retry_until = tcp_orphan_retries(sk, alive); - if (tcp_out_of_resources(sk, alive || icsk->icsk_retransmits < retry_until)) + if (tcp_out_of_resources( + sk, alive || + !retrans_overstepped(sk, retry_until))) return 1; } } - if (icsk->icsk_retransmits >= retry_until) { + if (retrans_overstepped(sk, retry_until)) { /* Has it gone just too far? */ tcp_write_err(sk); return 1; @@ -385,7 +387,7 @@ out_reset_timer: icsk->icsk_rto = min(icsk->icsk_rto << 1, TCP_RTO_MAX); inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS, icsk->icsk_rto, TCP_RTO_MAX); - if (icsk->icsk_retransmits > sysctl_tcp_retries1) + if (retrans_overstepped(sk, sysctl_tcp_retries1)) __sk_dst_reset(sk); out:; ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] revert TCP retransmission backoff on ICMP destination unreachable 2009-08-14 12:08 ` Damian Lukowski @ 2009-08-18 13:45 ` Ilpo Järvinen 2009-08-18 17:40 ` Damian Lukowski 0 siblings, 1 reply; 9+ messages in thread From: Ilpo Järvinen @ 2009-08-18 13:45 UTC (permalink / raw) To: Damian Lukowski; +Cc: David Miller, Netdev On Fri, 14 Aug 2009, Damian Lukowski wrote: >> Longer than 80 columns, and use an inline function instead >> of a macro in order to get proper type checking. >> [...] >> Do not break up the function local variables with spurious new lines >> like this, please. >> [...] >> The indentation and tabbing is messed up in all of the code you are >> adding, please fix it up to be consistent with the surrounding code >> and the rest of the TCP stack. >> >> Do not use C++ style // comments. > > Better? Please, include the changelog message on resubmits too next time. > Signed-off-by: Damian Lukowski <damian@tvk.rwth-aachen.de> > diff -Naur linux-2.6.30.4/include/net/tcp.h linux-2.6.30.4-tcp-icmp/include/net/tcp.h > --- linux-2.6.30.4/include/net/tcp.h 2009-07-31 00:34:47.000000000 +0200 > +++ linux-2.6.30.4-tcp-icmp/include/net/tcp.h 2009-08-14 12:18:30.846060685 +0200 > @@ -1220,6 +1220,14 @@ > #define tcp_for_write_queue_from_safe(skb, tmp, sk) \ > skb_queue_walk_from_safe(&(sk)->sk_write_queue, skb, tmp) > > +static inline bool retrans_overstepped(const struct sock *sk, > + unsigned int boundary) > +{ > + return inet_csk(sk)->icsk_retransmits && > + (tcp_time_stamp - tcp_sk(sk)->retrans_stamp) >= > + TCP_RTO_MIN*(2 << boundary); > +} > + > static inline struct sk_buff *tcp_send_head(struct sock *sk) > { > return sk->sk_send_head; > diff -Naur linux-2.6.30.4/net/ipv4/tcp_ipv4.c linux-2.6.30.4-tcp-icmp/net/ipv4/tcp_ipv4.c > --- linux-2.6.30.4/net/ipv4/tcp_ipv4.c 2009-07-31 00:34:47.000000000 +0200 > +++ linux-2.6.30.4-tcp-icmp/net/ipv4/tcp_ipv4.c 2009-08-14 13:19:48.841598908 +0200 > @@ -332,11 +332,13 @@ > { > struct iphdr *iph = (struct iphdr *)skb->data; > struct tcphdr *th = (struct tcphdr *)(skb->data + (iph->ihl << 2)); > + struct inet_connection_sock *icsk; > struct tcp_sock *tp; > struct inet_sock *inet; > const int type = icmp_hdr(skb)->type; > const int code = icmp_hdr(skb)->code; > struct sock *sk; > + struct sk_buff *skb_r; I'd make this called "skb", and change the old skb to icmp_skb. > __u32 seq; > int err; > struct net *net = dev_net(skb->dev); > @@ -367,6 +369,7 @@ > if (sk->sk_state == TCP_CLOSE) > goto out; > > + icsk = inet_csk(sk); > tp = tcp_sk(sk); > seq = ntohl(th->seq); > if (sk->sk_state != TCP_LISTEN && > @@ -393,6 +396,41 @@ > } > > err = icmp_err_convert[code].errno; > + /* check if ICMP unreachable messages allow revert of backoff */ > + if ((code != ICMP_NET_UNREACH && code != ICMP_HOST_UNREACH) || > + seq != tp->snd_una || !icsk->icsk_retransmits || > + !icsk->icsk_backoff) I'd recommend you break this into two if()\nbreak's, one for filtering other icmps and other for filtering mismatching against the socket's state. > + break; > + > + icsk->icsk_backoff--; > + icsk->icsk_rto >>= 1; Are you absolute sure that we don't go to zero here when phase of the moon is just right? I'd put a max(..., 1) guard there. > + > + skb_r = skb_peek(&sk->sk_write_queue); tcp_write_queue_head(sk) > + BUG_ON(!skb_r); > + > + if (sock_owned_by_user(sk)) { > + /* Deferring retransmission clocked by ICMP > + * due to locked socket. */ > + inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS, > + min(icsk->icsk_rto, TCP_RESOURCE_PROBE_INTERVAL), ??? (besides having indent wrong). What makes you think HZ/2 is right value if icsk_rto is large? > + TCP_RTO_MAX); Perhaps you lack here something to exit? > + } > + > + if (tcp_time_stamp - TCP_SKB_CB(skb_r)->when > > + inet_csk(sk)->icsk_rto) { icsk->icsk_rto !?! > + /* RTO revert clocked out retransmission. */ > + tcp_retransmit_skb(sk, skb_r); This is plain wrong, tcp_sock counters will get messed up if TCPCB_SACKED_RETRANS is already set while calling tcp_retransmit_skb. As a sidenote, it would be probably useful to move the check + clear of that bit before doing the retransmission to some common place one day. > + inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS, > + icsk->icsk_rto, TCP_RTO_MAX); RFC 2988, step 5.5 missing? Have you verified this patch for real? > + } else { > + /* RTO revert shortened timer. */ > + inet_csk_reset_xmit_timer( > + sk, ICSK_TIME_RETRANS, > + icsk->icsk_rto- > + (tcp_time_stamp-TCP_SKB_CB(skb_r)->when), Spacing. > + TCP_RTO_MAX); > + } > + How about: u32 remaining; remaining = icsk->icsk_rto - min(icsk->icsk_rto, tcp_time_stamp - TCP_SKB_CB(skb_r)->when)); if (!remaining) { tcp_retransmit_skb(...); remaining = icsk->icsk_rto; } inet_csk_reset_xmit_timer(..., remaining); But check my note about step 5.5 (see above) to get remaining set right in the if branch. > break; > case ICMP_TIME_EXCEEDED: > err = EHOSTUNREACH; > diff -Naur linux-2.6.30.4/net/ipv4/tcp_timer.c linux-2.6.30.4-tcp-icmp/net/ipv4/tcp_timer.c > --- linux-2.6.30.4/net/ipv4/tcp_timer.c 2009-07-31 00:34:47.000000000 +0200 > +++ linux-2.6.30.4-tcp-icmp/net/ipv4/tcp_timer.c 2009-08-14 13:22:18.068666329 +0200 > @@ -143,7 +143,7 @@ > dst_negative_advice(&sk->sk_dst_cache); > retry_until = icsk->icsk_syn_retries ? : sysctl_tcp_syn_retries; > } else { > - if (icsk->icsk_retransmits >= sysctl_tcp_retries1) { > + if (retrans_overstepped(sk, sysctl_tcp_retries1)) { ??? Could you justify these changes and explain their relation to the draft and icmp messages? If not, please drop them and try with proper description and description for them in a separate patch. I fail to see what you're trying to achieve here. You just ended up redefining the sysctl meaning too I think... > /* Black hole detection */ > tcp_mtu_probing(icsk, sk); > > @@ -156,12 +156,14 @@ > > retry_until = tcp_orphan_retries(sk, alive); > > - if (tcp_out_of_resources(sk, alive || icsk->icsk_retransmits < retry_until)) > + if (tcp_out_of_resources( > + sk, alive || > + !retrans_overstepped(sk, retry_until))) ??? > return 1; > } > } > > - if (icsk->icsk_retransmits >= retry_until) { > + if (retrans_overstepped(sk, retry_until)) { ... > /* Has it gone just too far? */ > tcp_write_err(sk); > return 1; > @@ -385,7 +387,7 @@ > out_reset_timer: > icsk->icsk_rto = min(icsk->icsk_rto << 1, TCP_RTO_MAX); > inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS, icsk->icsk_rto, TCP_RTO_MAX); > - if (icsk->icsk_retransmits > sysctl_tcp_retries1) > + if (retrans_overstepped(sk, sysctl_tcp_retries1)) ...I guess none of these retrans_overstepped are related to the icmp stuff at all? > __sk_dst_reset(sk); > > out:; -- i. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] revert TCP retransmission backoff on ICMP destination unreachable 2009-08-18 13:45 ` Ilpo Järvinen @ 2009-08-18 17:40 ` Damian Lukowski 2009-08-18 22:07 ` Ilpo Järvinen 2009-08-19 6:29 ` David Miller 0 siblings, 2 replies; 9+ messages in thread From: Damian Lukowski @ 2009-08-18 17:40 UTC (permalink / raw) To: Ilpo Järvinen; +Cc: David Miller, Netdev > On Fri, 14 Aug 2009, Damian Lukowski wrote: > >>> Longer than 80 columns, and use an inline function instead >>> of a macro in order to get proper type checking. >>> [...] >>> Do not break up the function local variables with spurious new lines >>> like this, please. >>> [...] >>> The indentation and tabbing is messed up in all of the code you are >>> adding, please fix it up to be consistent with the surrounding code >>> and the rest of the TCP stack. >>> >>> Do not use C++ style // comments. >> >> Better? > > Please, include the changelog message on resubmits too next time. I'm sorry, which message do you mean? I used plain diff without GIT or anything. >> + if ((code != ICMP_NET_UNREACH && code != ICMP_HOST_UNREACH) || >> + seq != tp->snd_una || !icsk->icsk_retransmits || >> + !icsk->icsk_backoff) > > I'd recommend you break this into two if()\nbreak's, one for filtering > other icmps and other for filtering mismatching against the socket's state. Ok. >> + icsk->icsk_backoff--; >> + icsk->icsk_rto >>= 1; > > Are you absolute sure that we don't go to zero here when phase of the > moon is just right? I'd put a max(..., 1) guard there. Well, the retransmission timer doubles icsk_rto whenever it increases icsk_backoff, so we should reach the original icsk_rto, when icsk_backoff becomes zero. >> + skb_r = skb_peek(&sk->sk_write_queue); > > tcp_write_queue_head(sk) Ok. > >> + BUG_ON(!skb_r); >> + >> + if (sock_owned_by_user(sk)) { >> + /* Deferring retransmission clocked by ICMP >> + * due to locked socket. */ >> + inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS, >> + min(icsk->icsk_rto, TCP_RESOURCE_PROBE_INTERVAL), > > ??? (besides having indent wrong). What makes you think HZ/2 is right > value if icsk_rto is large? To be honest, I have taken over the code from my predecessor and didn't question this part. When thinking about it, I would remove this block completely. Will bad things happen, if I mess with the timer, when the socket is locked? >> + TCP_RTO_MAX); > > Perhaps you lack here something to exit? > >> + } >> + >> + if (tcp_time_stamp - TCP_SKB_CB(skb_r)->when > >> + inet_csk(sk)->icsk_rto) { > > icsk->icsk_rto !?! > >> + /* RTO revert clocked out retransmission. */ >> + tcp_retransmit_skb(sk, skb_r); > > This is plain wrong, tcp_sock counters will get messed up if > TCPCB_SACKED_RETRANS is already set while calling tcp_retransmit_skb. As > a sidenote, it would be probably useful to move the check + clear of > that bit before doing the retransmission to some common place one day. What, if I call tcp_retransmit_timer() manually instead? Will there still be problems with SACK? >> + inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS, >> + icsk->icsk_rto, TCP_RTO_MAX); > > RFC 2988, step 5.5 missing? Have you verified this patch for real? Thats the whole point of the draft. If consecutive retransmissions trigger fitting ICMPs, the RTO value is not doubled, but remains constant. So you can have a retransmission schedule of t, t+r, t+2r, t+3r, t+4r, t+5r ... instead of t, t+r, t+3r, t+7r, t+15r, t+31r ... I can show you trace plots of patched an unpatched TCP, if you like. Should I attach files in the mails, or better post hyperlinks to them? >> + } else { >> + /* RTO revert shortened timer. */ >> + inet_csk_reset_xmit_timer( >> + sk, ICSK_TIME_RETRANS, >> + icsk->icsk_rto- >> + (tcp_time_stamp-TCP_SKB_CB(skb_r)->when), > > Spacing. > >> + TCP_RTO_MAX); >> + } >> + > > How about: > > u32 remaining; > > remaining = icsk->icsk_rto - min(icsk->icsk_rto, > tcp_time_stamp - TCP_SKB_CB(skb_r)->when)); > if (!remaining) { > tcp_retransmit_skb(...); > remaining = icsk->icsk_rto; > } > inet_csk_reset_xmit_timer(..., remaining); Seems to be ok, but isn't tcp_retransmit_timer(...) better? >> - if (icsk->icsk_retransmits >= sysctl_tcp_retries1) { >> + if (retrans_overstepped(sk, sysctl_tcp_retries1)) { > > ??? Could you justify these changes and explain their relation to the > draft and icmp messages? If not, please drop them and try with proper > description and description for them in a separate patch. I fail to see > what you're trying to achieve here. You just ended up redefining the > sysctl meaning too I think... Well, you're quite right. RFCs specify timeouts in dimensions of time (seconds, minutes, etc.), but Linux specifies timeouts in form of numbers of retransmissions. One can do this, as long as the retransmission schedule (exponential backoff in RFC case) is known. But with this patch, the schedule can be completely different. In the "worst" case, the time intervals between consecutive retransmission probes are constant, leading to a TCP timeout after few seconds, instead of several minutes. What I did here, is to calculate the TCP timeout as if the schedule were still an exponential backoff, given the allowed number of retransmissions. It is possible, that TCP will now retransmit many more times than tcp_retries indicates, but the duration until the TCP connection times out, is more or less equivalent to unpatched TCP with same tcp_retries values. So, whenever some control block uses count values, but means time values, I have to replace the control block appropriately; and that's what retrans_overstepped() is supposed to do. Thanks for your help, so far Damian ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] revert TCP retransmission backoff on ICMP destination unreachable 2009-08-18 17:40 ` Damian Lukowski @ 2009-08-18 22:07 ` Ilpo Järvinen 2009-08-18 23:56 ` Damian Lukowski 2009-08-19 6:29 ` David Miller 1 sibling, 1 reply; 9+ messages in thread From: Ilpo Järvinen @ 2009-08-18 22:07 UTC (permalink / raw) To: Damian Lukowski; +Cc: David Miller, Netdev On Tue, 18 Aug 2009, Damian Lukowski wrote: > > On Fri, 14 Aug 2009, Damian Lukowski wrote: > > > >>> Longer than 80 columns, and use an inline function instead > >>> of a macro in order to get proper type checking. > >>> [...] > >>> Do not break up the function local variables with spurious new lines > >>> like this, please. > >>> [...] > >>> The indentation and tabbing is messed up in all of the code you are > >>> adding, please fix it up to be consistent with the surrounding code > >>> and the rest of the TCP stack. > >>> > >>> Do not use C++ style // comments. > >> > >> Better? > > > > Please, include the changelog message on resubmits too next time. > > I'm sorry, which message do you mean? I used plain diff without GIT or > anything. In the original mail you had some description about the patch, I was thinking of quoting something out of it but it was missing so I didn't bother copying. But more importantly, would Dave Miller want to apply the patch of yours, he would need that description and it saves work for him to have the description in every resubmit "duplicated" (one usually just gets a boomerang mail asking for the description which is similarly wasting his time). > >> + icsk->icsk_backoff--; > >> + icsk->icsk_rto >>= 1; > > > > Are you absolute sure that we don't go to zero here when phase of the > > moon is just right? I'd put a max(..., 1) guard there. > > Well, the retransmission timer doubles icsk_rto whenever it increases > icsk_backoff, so we should reach the original icsk_rto, when > icsk_backoff becomes zero. This is a false assumption. Take a look into the formula, there's a cap how far icsk_rto goes but icsk_backoff keeps increasing. In fact, there would probably be a need to set some other lower bound than 1, TCP_RTO_MIN would probably be more appropriate for this. > >> + BUG_ON(!skb_r); > >> + > >> + if (sock_owned_by_user(sk)) { > >> + /* Deferring retransmission clocked by ICMP > >> + * due to locked socket. */ > >> + inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS, > >> + min(icsk->icsk_rto, TCP_RESOURCE_PROBE_INTERVAL), > > > > ??? (besides having indent wrong). What makes you think HZ/2 is right > > value if icsk_rto is large? > > To be honest, I have taken over the code from my predecessor and didn't > question this part. But you did Signoff that regardless, so you should know ;-). > When thinking about it, I would remove this block > completely. Will bad things happen, if I mess with the timer, when the > socket is locked? I've no idea without looking into all the details but I think this needs to be moved slightly later anyway into the imminent expiry branch and the timeout value needs to be copied from tcp_write_timer instead of using that bogus formula. > >> + TCP_RTO_MAX); > > > > Perhaps you lack here something to exit? > > > >> + } > >> + > >> + if (tcp_time_stamp - TCP_SKB_CB(skb_r)->when > > >> + inet_csk(sk)->icsk_rto) { > > > > icsk->icsk_rto !?! > > > >> + /* RTO revert clocked out retransmission. */ > >> + tcp_retransmit_skb(sk, skb_r); > > > > This is plain wrong, tcp_sock counters will get messed up if > > TCPCB_SACKED_RETRANS is already set while calling tcp_retransmit_skb. As > > a sidenote, it would be probably useful to move the check + clear of > > that bit before doing the retransmission to some common place one day. > > What, if I call tcp_retransmit_timer() manually instead? Will there > still be problems with SACK? SACK is fine then, and I took a quick look into the other stuff which seemed to be quite ok too but it was just a quick glance, but then you certainly must delay the RTO if sock_owned_by_user(sk) is true like tcp_write_timer does. Tcp_retransmit_timer would also deal with the missing step 5.5 btw. I'm not sure if you like all the things that will happen in the tcp_enter_loss though, especially the undo_marker clearing might not please you, or is at best, sub-optimal in a certain scenario. > >> + inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS, > >> + icsk->icsk_rto, TCP_RTO_MAX); > > > > RFC 2988, step 5.5 missing? Have you verified this patch for real? > > Thats the whole point of the draft. If consecutive retransmissions > trigger fitting ICMPs, the RTO value is not doubled, but remains > constant. No, either the draft is wrong or the implementation. 5.5 step should be applied. Now you end up decreasing RTO per icmp which causes imminent retransmission instead of keeping it "constant". Effectively, step 4, 5, 6 (halves like you do), 7 -> R, R is 5.4-5.6 and 5.5 increases RTO again because the timer _expired_ here even though we did short-circuit it instead of setting a zero timer and using the normal procedure (which in itself is ok approach, no need to force timer to expire). > So you can have a retransmission schedule of > t, t+r, t+2r, t+3r, t+4r, t+5r ... > instead of > t, t+r, t+3r, t+7r, t+15r, t+31r ... Sure, but that's not what happens, instead you will get this series: t, t+r, (t+r)/2, (t+r)/4, ... ...disclaimer: it might be off-by-something somewhere but I hope you get my point anyway. > I can show you trace plots of patched an unpatched TCP, if you like. > Should I attach files in the mails, or better post hyperlinks to them? Sure, but they must not be from a trivial case but the rto must "flunctuate" and the expirys must be using the "imminent" branch for it to count for me as a proof that you did it right. > > How about: > > > > u32 remaining; > > > > remaining = icsk->icsk_rto - min(icsk->icsk_rto, > > tcp_time_stamp - TCP_SKB_CB(skb_r)->when)); > > if (!remaining) { > > tcp_retransmit_skb(...); > > remaining = icsk->icsk_rto; > > } > > inet_csk_reset_xmit_timer(..., remaining); > > Seems to be ok, but isn't tcp_retransmit_timer(...) better? Yeah, that's mostly fine I guess (at least in meaning of "safe"). > >> - if (icsk->icsk_retransmits >= sysctl_tcp_retries1) { > >> + if (retrans_overstepped(sk, sysctl_tcp_retries1)) { > > > > ??? Could you justify these changes and explain their relation to the > > draft and icmp messages? If not, please drop them and try with proper > > description and description for them in a separate patch. I fail to see > > what you're trying to achieve here. You just ended up redefining the > > sysctl meaning too I think... > > Well, you're quite right. RFCs specify timeouts in dimensions of time > (seconds, minutes, etc.), but Linux specifies timeouts in form of > numbers of retransmissions. One can do this, as long as the > retransmission schedule (exponential backoff in RFC case) is known. But > with this patch, the schedule can be completely different. In the > "worst" case, the time intervals between consecutive retransmission > probes are constant, leading to a TCP timeout after few seconds, instead > of several minutes. > What I did here, is to calculate the TCP timeout as if the schedule were > still an exponential backoff, given the allowed number of > retransmissions. It is possible, that TCP will now retransmit many more > times than tcp_retries indicates, but the duration until the TCP > connection times out, is more or less equivalent to unpatched TCP with > same tcp_retries values. > So, whenever some control block uses count values, but means time > values, I have to replace the control block appropriately; and that's > what retrans_overstepped() is supposed to do. I can understand your point, however, this is independent thing which should not be mixed with the other part of the change. So in future provide them as a two patch series. ...And expect some to not like this kind of change. -- i. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] revert TCP retransmission backoff on ICMP destination unreachable 2009-08-18 22:07 ` Ilpo Järvinen @ 2009-08-18 23:56 ` Damian Lukowski 2009-08-19 10:55 ` Ilpo Järvinen 0 siblings, 1 reply; 9+ messages in thread From: Damian Lukowski @ 2009-08-18 23:56 UTC (permalink / raw) To: Ilpo Järvinen; +Cc: Netdev, David Miller > On Tue, 18 Aug 2009, Damian Lukowski wrote: > >>> On Fri, 14 Aug 2009, Damian Lukowski wrote: >>> >>>>> Longer than 80 columns, and use an inline function instead >>>>> of a macro in order to get proper type checking. >>>>> [...] >>>>> Do not break up the function local variables with spurious new lines >>>>> like this, please. >>>>> [...] >>>>> The indentation and tabbing is messed up in all of the code you are >>>>> adding, please fix it up to be consistent with the surrounding code >>>>> and the rest of the TCP stack. >>>>> >>>>> Do not use C++ style // comments. >>>> Better? >>> Please, include the changelog message on resubmits too next time. >> I'm sorry, which message do you mean? I used plain diff without GIT or >> anything. > > In the original mail you had some description about the patch, I was > thinking of quoting something out of it but it was missing so I didn't > bother copying. But more importantly, would Dave Miller want to apply the > patch of yours, he would need that description and it saves work for him > to have the description in every resubmit "duplicated" (one usually just > gets a boomerang mail asking for the description which is similarly > wasting his time). Ok, I will fully quote previous messages now. Just wanted to avoid redundancy. Anyway, as long as the draft is not accepted as RFC, the patch will violate current standards, so I doubt, Dave would like to apply it soon. :) > >>>> + icsk->icsk_backoff--; >>>> + icsk->icsk_rto >>= 1; >>> Are you absolute sure that we don't go to zero here when phase of the >>> moon is just right? I'd put a max(..., 1) guard there. >> Well, the retransmission timer doubles icsk_rto whenever it increases >> icsk_backoff, so we should reach the original icsk_rto, when >> icsk_backoff becomes zero. > > This is a false assumption. Take a look into the formula, there's a > cap how far icsk_rto goes but icsk_backoff keeps increasing. In fact, > there would probably be a need to set some other lower bound than 1, > TCP_RTO_MIN would probably be more appropriate for this. *sigh* I should have seen that. I would like to preserve the initial rto value and not work with RTO_MIN. First, I have thought about storing the initial rto in another state variable. But what, if we simply recalculate the value?: ### icsk->icsk_backoff--; - icsk->icsk_rto >>= 1; + tcp_set_rto(sk); + icsk->icsk_rto <<= icsk->icsk_backoff; + tcp_bound_rto(sk); ### >>>> + BUG_ON(!skb_r); >>>> + >>>> + if (sock_owned_by_user(sk)) { >>>> + /* Deferring retransmission clocked by ICMP >>>> + * due to locked socket. */ >>>> + inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS, >>>> + min(icsk->icsk_rto, TCP_RESOURCE_PROBE_INTERVAL), >>> ??? (besides having indent wrong). What makes you think HZ/2 is right >>> value if icsk_rto is large? >> To be honest, I have taken over the code from my predecessor and didn't >> question this part. > > But you did Signoff that regardless, so you should know ;-). Only because checkpatch.pl didn't like it otherwise. :) > >> When thinking about it, I would remove this block >> completely. Will bad things happen, if I mess with the timer, when the >> socket is locked? > > I've no idea without looking into all the details but I think this needs > to be moved slightly later anyway into the imminent expiry branch and the > timeout value needs to be copied from tcp_write_timer instead of using > that bogus formula. > >>>> + TCP_RTO_MAX); >>> Perhaps you lack here something to exit? >>> >>>> + } >>>> + >>>> + if (tcp_time_stamp - TCP_SKB_CB(skb_r)->when > >>>> + inet_csk(sk)->icsk_rto) { >>> icsk->icsk_rto !?! >>> >>>> + /* RTO revert clocked out retransmission. */ >>>> + tcp_retransmit_skb(sk, skb_r); >>> This is plain wrong, tcp_sock counters will get messed up if >>> TCPCB_SACKED_RETRANS is already set while calling tcp_retransmit_skb. As >>> a sidenote, it would be probably useful to move the check + clear of >>> that bit before doing the retransmission to some common place one day. >> What, if I call tcp_retransmit_timer() manually instead? Will there >> still be problems with SACK? > > SACK is fine then, and I took a quick look into the other stuff which > seemed to be quite ok too but it was just a quick glance, but then you > certainly must delay the RTO if sock_owned_by_user(sk) is true like > tcp_write_timer does. Tcp_retransmit_timer would also deal with the > missing step 5.5 btw. I'm not sure if you like all the things that will > happen in the tcp_enter_loss though, especially the undo_marker clearing > might not please you, or is at best, sub-optimal in a certain scenario. > >>>> + inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS, >>>> + icsk->icsk_rto, TCP_RTO_MAX); >>> RFC 2988, step 5.5 missing? Have you verified this patch for real? >> Thats the whole point of the draft. If consecutive retransmissions >> trigger fitting ICMPs, the RTO value is not doubled, but remains >> constant. > > No, either the draft is wrong or the implementation. 5.5 step should be > applied. Now you end up decreasing RTO per icmp which causes imminent > retransmission instead of keeping it "constant". Effectively, step 4, 5, 6 > (halves like you do), 7 -> R, R is 5.4-5.6 and 5.5 increases RTO again > because the timer _expired_ here even though we did short-circuit it > instead of setting a zero timer and using the normal procedure (which in > itself is ok approach, no need to force timer to expire). > >> So you can have a retransmission schedule of >> t, t+r, t+2r, t+3r, t+4r, t+5r ... >> instead of >> t, t+r, t+3r, t+7r, t+15r, t+31r ... > > Sure, but that's not what happens, instead you will get this series: > > t, t+r, (t+r)/2, (t+r)/4, ... > > ...disclaimer: it might be off-by-something somewhere but I hope you get > my point anyway. It seems to me, that we focus on different things. From the TCP code's point of view, RFC2988 step 5.5 is of course executed, because timeouts still trigger tcp_retransmit_timer(), which doubles the RTO at the end. But from the network's point of view, there is no doubling, but constant intervals between retransmissions. The following happens with patched TCP (not considering RTO_MAX): t: Connectivity breaks, rto is r t+r: First RTO fires, SND.UNA is retransmitted by tcp_retransmit_timer() and rto := 2rto Now it is rto == 2r t+r+epsilon: ICMP unreach arrives from a router. The patch sets rto := rto/2 Now it is rto == r t+2r: Second RTO fires, SND.UNA is retransmitted by tcp_retransmit_timer() and rto := 2rto Now it is rto == 2r Assume, that this retransmission did not trigger any ICMP message. t+4r: Third RTO fires, SND.UNA is retransmitted by tcp_retransmit_timer() and rto := 2rto Now it is rto == 4r Assume, that this retransmission did not trigger any ICMP message. t+8r: Fourth RTO fires, SND.UNA is retransmitted by tcp_retransmit_timer() and rto := 2rto Now it is rto == 8r t+8r+epsilon: ICMP unreach arrives from a router. The patch sets rto := rto/2 Now it is rto == 4r t+12r: Fifth RTO fires ... >> I can show you trace plots of patched an unpatched TCP, if you like. >> Should I attach files in the mails, or better post hyperlinks to them? > > Sure, but they must not be from a trivial case but the rto must > "flunctuate" How should they fluctuate? RTO is doubled by tcp_retransmit_timer() no matter what happens, and it is halved by the the patch, if an appropriate ICMP packet has been received. > and the expirys must be using the "imminent" branch for it > to count for me as a proof that you did it right. I'm not sure what you mean. Still, hyperlinks or attachments? > >>> How about: >>> >>> u32 remaining; >>> >>> remaining = icsk->icsk_rto - min(icsk->icsk_rto, >>> tcp_time_stamp - TCP_SKB_CB(skb_r)->when)); >>> if (!remaining) { >>> tcp_retransmit_skb(...); >>> remaining = icsk->icsk_rto; >>> } >>> inet_csk_reset_xmit_timer(..., remaining); >> Seems to be ok, but isn't tcp_retransmit_timer(...) better? > > Yeah, that's mostly fine I guess (at least in meaning of "safe"). > >>>> - if (icsk->icsk_retransmits >= sysctl_tcp_retries1) { >>>> + if (retrans_overstepped(sk, sysctl_tcp_retries1)) { >>> ??? Could you justify these changes and explain their relation to the >>> draft and icmp messages? If not, please drop them and try with proper >>> description and description for them in a separate patch. I fail to see >>> what you're trying to achieve here. You just ended up redefining the >>> sysctl meaning too I think... >> Well, you're quite right. RFCs specify timeouts in dimensions of time >> (seconds, minutes, etc.), but Linux specifies timeouts in form of >> numbers of retransmissions. One can do this, as long as the >> retransmission schedule (exponential backoff in RFC case) is known. But >> with this patch, the schedule can be completely different. In the >> "worst" case, the time intervals between consecutive retransmission >> probes are constant, leading to a TCP timeout after few seconds, instead >> of several minutes. >> What I did here, is to calculate the TCP timeout as if the schedule were >> still an exponential backoff, given the allowed number of >> retransmissions. It is possible, that TCP will now retransmit many more >> times than tcp_retries indicates, but the duration until the TCP >> connection times out, is more or less equivalent to unpatched TCP with >> same tcp_retries values. >> So, whenever some control block uses count values, but means time >> values, I have to replace the control block appropriately; and that's >> what retrans_overstepped() is supposed to do. > > I can understand your point, however, this is independent thing which > should not be mixed with the other part of the change. So in future > provide them as a two patch series. ...And expect some to not like this > kind of change. Ok. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] revert TCP retransmission backoff on ICMP destination unreachable 2009-08-18 23:56 ` Damian Lukowski @ 2009-08-19 10:55 ` Ilpo Järvinen 0 siblings, 0 replies; 9+ messages in thread From: Ilpo Järvinen @ 2009-08-19 10:55 UTC (permalink / raw) To: Damian Lukowski; +Cc: Netdev, David Miller On Wed, 19 Aug 2009, Damian Lukowski wrote: > Ok, I will fully quote previous messages now. Just wanted to avoid > redundancy. Full quoting is not what I meant, Dave explained what I meant in his mail. > Anyway, as long as the draft is not accepted as RFC, the > patch will violate current standards, so I doubt, Dave would like to > apply it soon. :) ...Don't be too sure on that ;-). > >>>> + icsk->icsk_backoff--; > >>>> + icsk->icsk_rto >>= 1; > >>> Are you absolute sure that we don't go to zero here when phase of the > >>> moon is just right? I'd put a max(..., 1) guard there. > >> Well, the retransmission timer doubles icsk_rto whenever it increases > >> icsk_backoff, so we should reach the original icsk_rto, when > >> icsk_backoff becomes zero. > > > > This is a false assumption. Take a look into the formula, there's a > > cap how far icsk_rto goes but icsk_backoff keeps increasing. In fact, > > there would probably be a need to set some other lower bound than 1, > > TCP_RTO_MIN would probably be more appropriate for this. > > *sigh* I should have seen that. I would like to preserve the initial rto > value and not work with RTO_MIN. First, I have thought about storing the > initial rto in another state variable. But what, if we simply > recalculate the value?: > ### > icsk->icsk_backoff--; > - icsk->icsk_rto >>= 1; > + tcp_set_rto(sk); > + icsk->icsk_rto <<= icsk->icsk_backoff; > + tcp_bound_rto(sk); > ### Yes, this is sort of a problem. I'd recommend that you read through all icsk_rto readers and writers and see if recalculation can be done without getting into some problems. > >>>> + BUG_ON(!skb_r); > >>>> + > >>>> + if (sock_owned_by_user(sk)) { > >>>> + /* Deferring retransmission clocked by ICMP > >>>> + * due to locked socket. */ > >>>> + inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS, > >>>> + min(icsk->icsk_rto, TCP_RESOURCE_PROBE_INTERVAL), > >>> ??? (besides having indent wrong). What makes you think HZ/2 is right > >>> value if icsk_rto is large? > >> To be honest, I have taken over the code from my predecessor and didn't > >> question this part. > > > > But you did Signoff that regardless, so you should know ;-). > > Only because checkpatch.pl didn't like it otherwise. :) ...And you dare to admit :-) ...It certainly has some other meaning beyond checkpatch complain, for future, please check and understand what it means that you add it there. > >>>> + inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS, > >>>> + icsk->icsk_rto, TCP_RTO_MAX); > >>> RFC 2988, step 5.5 missing? Have you verified this patch for real? > >> Thats the whole point of the draft. If consecutive retransmissions > >> trigger fitting ICMPs, the RTO value is not doubled, but remains > >> constant. > > > > No, either the draft is wrong or the implementation. 5.5 step should be > > applied. Now you end up decreasing RTO per icmp which causes imminent > > retransmission instead of keeping it "constant". Effectively, step 4, 5, 6 > > (halves like you do), 7 -> R, R is 5.4-5.6 and 5.5 increases RTO again > > because the timer _expired_ here even though we did short-circuit it > > instead of setting a zero timer and using the normal procedure (which in > > itself is ok approach, no need to force timer to expire). > > > >> So you can have a retransmission schedule of > >> t, t+r, t+2r, t+3r, t+4r, t+5r ... > >> instead of > >> t, t+r, t+3r, t+7r, t+15r, t+31r ... > > > > Sure, but that's not what happens, instead you will get this series: > > > > t, t+r, (t+r)/2, (t+r)/4, ... > > > > ...disclaimer: it might be off-by-something somewhere but I hope you get > > my point anyway. > > It seems to me, that we focus on different things. From the TCP code's > point of view, RFC2988 step 5.5 is of course executed, because timeouts > still trigger tcp_retransmit_timer(), which doubles the RTO at the end. No and yes. For no: you artificially made tcp_retransmit_timer to not be called whenever the ICMP arrives at a point where RTO/2 step causes the timer to expire right away. But yes: if you change it to call tcp_retransmit_timer as we have discussed, 5.5 should be there (but that's not in original patch which I was commenting). > But from the network's point of view, there is no doubling, but constant > intervals between retransmissions. Plain wrong, you would only end up halving as per icmp when the timing is right to produce the scenario I'm talking. Please show from the original patch > The following happens with patched TCP (not considering RTO_MAX): > t: Connectivity breaks, rto is r > t+r: First RTO fires, SND.UNA is retransmitted by > tcp_retransmit_timer() and rto := 2rto > Now it is rto == 2r > t+r+epsilon: ICMP unreach arrives from a router. > The patch sets rto := rto/2 > Now it is rto == r > t+2r: Second RTO fires, SND.UNA is retransmitted by > tcp_retransmit_timer() and rto := 2rto > Now it is rto == 2r > Assume, that this retransmission did not trigger > any ICMP message. > t+4r: Third RTO fires, SND.UNA is retransmitted by > tcp_retransmit_timer() and rto := 2rto > Now it is rto == 4r > Assume, that this retransmission did not trigger > any ICMP message. > t+8r: Fourth RTO fires, SND.UNA is retransmitted by > tcp_retransmit_timer() and rto := 2rto > Now it is rto == 8r > t+8r+epsilon: ICMP unreach arrives from a router. > The patch sets rto := rto/2 > Now it is rto == 4r > t+12r: Fifth RTO fires ... No, that was not the scenario I'm talking about. Let me try to explain: > t: Connectivity breaks, rto is r > t+r: First RTO fires, SND.UNA is retransmitted by > tcp_retransmit_timer() and rto := 2rto > Now it is rto == 2r t+2r+epsilon: ICMP unreach arrives from a router. > The patch sets rto := rto/2 t+2r+epsilon > t+r+rto => call for retransmit in the icmp code set rto = r t+5/2r+epsilon ICMP unreach arrives from a router. The patch sets rto := rto/2 t+5/2r+epsilon > t+something => call for retransmit set rto = r/2 ... See, it's a downward spiral because nothing in the original patch takes care of the rto*2 (ie., step 5.5) when RTO is artificially _caused by_ the ICMP (ie., the halving does make expiry to happen right away, see draft step 7). > >> I can show you trace plots of patched an unpatched TCP, if you like. > >> Should I attach files in the mails, or better post hyperlinks to them? > > > > Sure, but they must not be from a trivial case but the rto must > > "flunctuate" > > How should they fluctuate? RTO is doubled by tcp_retransmit_timer() no > matter what happens, and it is halved by the the patch, if an > appropriate ICMP packet has been received. The original patch doesn't agree. If you'd have used tcp_retransmit_timer in the original I'd agree but you called for tcp_retransmit_skb(skb_r) and set the next rto yourself which wouldn't let tcp_retransmit_timer to run at all in the scenario I was talking about. But lets see the next version of the patch first and only then continue the discussion on this particular point about 5.5 (probably not then necessary anymore because of a change we've discussed). > > and the expirys must be using the "imminent" branch for it > > to count for me as a proof that you did it right. > > I'm not sure what you mean. The case I'm talking about is the branch where we notice that after doing rto/2 the RTO is to be scheduled right away (see step 7 in the draft). > Still, hyperlinks or attachments? Either is fine, but large stuff as links. However, honestly I'm not that interest in them as I know that if you have the traces from the particular case I'm talking about, they would show the misbehavior (other cases work but they do not interest me at all). It would have been mostly just to convince you that this misbehavior is there, since my words didn't seem to make it :-). I propose we put this "proving" aside too for now since the changes we've already discussed should make this a non-issue anyway, and therefore no need to waste time on that. -- i. ^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH] revert TCP retransmission backoff on ICMP destination unreachable 2009-08-18 17:40 ` Damian Lukowski 2009-08-18 22:07 ` Ilpo Järvinen @ 2009-08-19 6:29 ` David Miller 1 sibling, 0 replies; 9+ messages in thread From: David Miller @ 2009-08-19 6:29 UTC (permalink / raw) To: damian; +Cc: ilpo.jarvinen, netdev From: Damian Lukowski <damian@tvk.rwth-aachen.de> Date: Tue, 18 Aug 2009 19:40:01 +0200 >> On Fri, 14 Aug 2009, Damian Lukowski wrote: >> >>>> Longer than 80 columns, and use an inline function instead >>>> of a macro in order to get proper type checking. >>>> [...] >>>> Do not break up the function local variables with spurious new lines >>>> like this, please. >>>> [...] >>>> The indentation and tabbing is messed up in all of the code you are >>>> adding, please fix it up to be consistent with the surrounding code >>>> and the rest of the TCP stack. >>>> >>>> Do not use C++ style // comments. >>> >>> Better? >> >> Please, include the changelog message on resubmits too next time. > > I'm sorry, which message do you mean? I used plain diff without GIT or > anything. He means the commit message. When you fix up patches, don't just reply and say "here's the fixed up patch". That doesn't work. We need the full context, the full commit message, and your signoffs, when you submit any patch you intend us to consider applying. ^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2009-08-19 10:55 UTC | newest] Thread overview: 9+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2009-08-11 11:27 [PATCH] revert TCP retransmission backoff on ICMP destination unreachable Damian Lukowski 2009-08-13 23:08 ` David Miller 2009-08-14 12:08 ` Damian Lukowski 2009-08-18 13:45 ` Ilpo Järvinen 2009-08-18 17:40 ` Damian Lukowski 2009-08-18 22:07 ` Ilpo Järvinen 2009-08-18 23:56 ` Damian Lukowski 2009-08-19 10:55 ` Ilpo Järvinen 2009-08-19 6:29 ` David Miller
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).