[PATCH] revert TCP retransmission backoff on ICMP destination unreachable

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH] revert TCP retransmission backoff on ICMP destination unreachable
@ 2009-08-11 11:27 Damian Lukowski
  2009-08-13 23:08 ` David Miller
  0 siblings, 1 reply; 9+ messages in thread
From: Damian Lukowski @ 2009-08-11 11:27 UTC (permalink / raw)
  To: netdev

[-- Attachment #1: Type: text/plain, Size: 1276 bytes --]

This patch is a proof-of-concept related to the draft "Make TCP more
Robust to Long Connectivity Disruptions"
(http://tools.ietf.org/html/draft-zimmermann-tcp-lcd-01).

Exponential backoff is TCP's standard behaviour during long connectivity
disruptions, which is a countermeasure against network congestion.
If congestion can be excluded as the reason for RTO retransmission loss,
backoff is not desirable, as it yields longer TCP recovery times, when
the communication path is repaired shortly after an unsuccessful
retransmission probe.
Here, an ICMP host/network unreachable message, whose payload fits to
TCPs SND.UNA, is taken as an indication that the RTO retransmission has
not been lost due to congestion, but because of a route failure
somewhere along the path. On receipt of such a message, the timeout is
halved (in order to revert to doubling after the retransmission).
With true congestion, a router won't trigger such a message and the
patched TCP will operate as standard TCP.

Comments are appreciated, especially regarding eventual side effects
with the zero window probing mechanism.
In Linux, the RTO retransmisson mechanism is coupled to the zero window
probing mechanism, but is not intended to be changed. I hope I have not
missed something, here.

---

[-- Attachment #2: TCP_ICMP_2.6.30.4.patch --]
[-- Type: text/plain, Size: 4322 bytes --]

diff -Naur linux-2.6.30.4-vanilla/include/net/tcp.h linux-2.6.30.4-tcp-icmp/include/net/tcp.h
--- linux-2.6.30.4-vanilla/include/net/tcp.h	2009-07-31 00:34:47.000000000 +0200
+++ linux-2.6.30.4-tcp-icmp/include/net/tcp.h	2009-08-11 11:28:23.162009835 +0200
@@ -1220,6 +1220,8 @@
 #define tcp_for_write_queue_from_safe(skb, tmp, sk)			\
 	skb_queue_walk_from_safe(&(sk)->sk_write_queue, skb, tmp)
 
+#define retrans_overstepped(sk, boundary) (inet_csk(sk)->icsk_retransmits && (tcp_time_stamp - tcp_sk(sk)->retrans_stamp) >= TCP_RTO_MIN*(2 << boundary))
+
 static inline struct sk_buff *tcp_send_head(struct sock *sk)
 {
 	return sk->sk_send_head;
diff -Naur linux-2.6.30.4-vanilla/net/ipv4/tcp_ipv4.c linux-2.6.30.4-tcp-icmp/net/ipv4/tcp_ipv4.c
--- linux-2.6.30.4-vanilla/net/ipv4/tcp_ipv4.c	2009-07-31 00:34:47.000000000 +0200
+++ linux-2.6.30.4-tcp-icmp/net/ipv4/tcp_ipv4.c	2009-08-11 11:28:33.572009505 +0200
@@ -332,11 +332,14 @@
 {
 	struct iphdr *iph = (struct iphdr *)skb->data;
 	struct tcphdr *th = (struct tcphdr *)(skb->data + (iph->ihl << 2));
+
+	struct inet_connection_sock *icsk;
 	struct tcp_sock *tp;
 	struct inet_sock *inet;
 	const int type = icmp_hdr(skb)->type;
 	const int code = icmp_hdr(skb)->code;
 	struct sock *sk;
+	struct sk_buff *skb_r;
 	__u32 seq;
 	int err;
 	struct net *net = dev_net(skb->dev);
@@ -367,6 +370,7 @@
 	if (sk->sk_state == TCP_CLOSE)
 		goto out;
 
+	icsk = inet_csk(sk);
 	tp = tcp_sk(sk);
 	seq = ntohl(th->seq);
 	if (sk->sk_state != TCP_LISTEN &&
@@ -393,6 +397,34 @@
 		}
 
 		err = icmp_err_convert[code].errno;
+
+		/* check if ICMP unreachable messages allow revert of back-off */
+                if ((code != ICMP_NET_UNREACH && code != ICMP_HOST_UNREACH) || seq != tp->snd_una 
+		|| !icsk->icsk_retransmits || !icsk->icsk_backoff) break;
+
+                icsk->icsk_backoff--;
+                icsk->icsk_rto >>= 1;
+
+                skb_r = skb_peek(&sk->sk_write_queue);
+                BUG_ON(!skb_r);
+
+		if (sock_owned_by_user(sk)) {
+                        // Deferring retransmission clocked by ICMP due to locked socket.
+                        inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS,
+                                        min(icsk->icsk_rto, TCP_RESOURCE_PROBE_INTERVAL),
+                                        TCP_RTO_MAX);
+                }
+
+                if (tcp_time_stamp - TCP_SKB_CB(skb_r)->when > inet_csk(sk)->icsk_rto) {
+                        // RTO revert clocked out retransmission.
+                        tcp_retransmit_skb(sk, skb_r);
+                        inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS, icsk->icsk_rto, TCP_RTO_MAX);
+                } else {
+                        //RTO revert shortened timer.
+                        inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS,
+                            icsk->icsk_rto - (tcp_time_stamp - TCP_SKB_CB(skb_r)->when), TCP_RTO_MAX);
+                }
+
 		break;
 	case ICMP_TIME_EXCEEDED:
 		err = EHOSTUNREACH;
diff -Naur linux-2.6.30.4-vanilla/net/ipv4/tcp_timer.c linux-2.6.30.4-tcp-icmp/net/ipv4/tcp_timer.c
--- linux-2.6.30.4-vanilla/net/ipv4/tcp_timer.c	2009-07-31 00:34:47.000000000 +0200
+++ linux-2.6.30.4-tcp-icmp/net/ipv4/tcp_timer.c	2009-08-11 11:28:33.557009931 +0200
@@ -143,7 +143,7 @@
 			dst_negative_advice(&sk->sk_dst_cache);
 		retry_until = icsk->icsk_syn_retries ? : sysctl_tcp_syn_retries;
 	} else {
-		if (icsk->icsk_retransmits >= sysctl_tcp_retries1) {
+		if (retrans_overstepped(sk, sysctl_tcp_retries1)) {
 			/* Black hole detection */
 			tcp_mtu_probing(icsk, sk);
 
@@ -156,12 +156,12 @@
 
 			retry_until = tcp_orphan_retries(sk, alive);
 
-			if (tcp_out_of_resources(sk, alive || icsk->icsk_retransmits < retry_until))
+			if (tcp_out_of_resources(sk, alive || !retrans_overstepped(sk, retry_until)))
 				return 1;
 		}
 	}
 
-	if (icsk->icsk_retransmits >= retry_until) {
+	if (retrans_overstepped(sk, retry_until)) {
 		/* Has it gone just too far? */
 		tcp_write_err(sk);
 		return 1;
@@ -385,7 +385,7 @@
 out_reset_timer:
 	icsk->icsk_rto = min(icsk->icsk_rto << 1, TCP_RTO_MAX);
 	inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS, icsk->icsk_rto, TCP_RTO_MAX);
-	if (icsk->icsk_retransmits > sysctl_tcp_retries1)
+	if (retrans_overstepped(sk, sysctl_tcp_retries1))
 		__sk_dst_reset(sk);
 
 out:;

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] revert TCP retransmission backoff on ICMP destination unreachable
  2009-08-11 11:27 [PATCH] revert TCP retransmission backoff on ICMP destination unreachable Damian Lukowski
@ 2009-08-13 23:08 ` David Miller
  2009-08-14 12:08   ` Damian Lukowski
  0 siblings, 1 reply; 9+ messages in thread
From: David Miller @ 2009-08-13 23:08 UTC (permalink / raw)
  To: damian; +Cc: netdev

From: Damian Lukowski <damian@tvk.rwth-aachen.de>
Date: Tue, 11 Aug 2009 13:27:42 +0200

> @@ -1220,6 +1220,8 @@
>  #define tcp_for_write_queue_from_safe(skb, tmp, sk)			\
>  	skb_queue_walk_from_safe(&(sk)->sk_write_queue, skb, tmp)
>  
> +#define retrans_overstepped(sk, boundary) (inet_csk(sk)->icsk_retransmits && (tcp_time_stamp - tcp_sk(sk)->retrans_stamp) >= TCP_RTO_MIN*(2 << boundary))
> +

Longer than 80 columns, and use an inline function instead
of a macro in order to get proper type checking.

> @@ -332,11 +332,14 @@
>  {
>  	struct iphdr *iph = (struct iphdr *)skb->data;
>  	struct tcphdr *th = (struct tcphdr *)(skb->data + (iph->ihl << 2));
> +
> +	struct inet_connection_sock *icsk;
>  	struct tcp_sock *tp;
>  	struct inet_sock *inet;

Do not break up the function local variables with spurious new lines
like this, please.

>  	const int type = icmp_hdr(skb)->type;
>  	const int code = icmp_hdr(skb)->code;
>  	struct sock *sk;
> +	struct sk_buff *skb_r;
>  	__u32 seq;
>  	int err;
>  	struct net *net = dev_net(skb->dev);
> @@ -367,6 +370,7 @@
>  	if (sk->sk_state == TCP_CLOSE)
>  		goto out;
>  
> +	icsk = inet_csk(sk);
>  	tp = tcp_sk(sk);
>  	seq = ntohl(th->seq);
>  	if (sk->sk_state != TCP_LISTEN &&
> @@ -393,6 +397,34 @@
>  		}
>  
>  		err = icmp_err_convert[code].errno;
> +
> +		/* check if ICMP unreachable messages allow revert of back-off */
> +                if ((code != ICMP_NET_UNREACH && code != ICMP_HOST_UNREACH) || seq != tp->snd_una 
> +		|| !icsk->icsk_retransmits || !icsk->icsk_backoff) break;
> +
> +                icsk->icsk_backoff--;
> +                icsk->icsk_rto >>= 1;
> +
> +                skb_r = skb_peek(&sk->sk_write_queue);
> +                BUG_ON(!skb_r);
> +

The indentation and tabbing is messed up in all of the code you are
adding, please fix it up to be consistent with the surrounding code
and the rest of the TCP stack.

Do not use C++ style // comments.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] revert TCP retransmission backoff on ICMP destination unreachable
  2009-08-13 23:08 ` David Miller
@ 2009-08-14 12:08   ` Damian Lukowski
  2009-08-18 13:45     ` Ilpo Järvinen
  0 siblings, 1 reply; 9+ messages in thread
From: Damian Lukowski @ 2009-08-14 12:08 UTC (permalink / raw)
  To: David Miller; +Cc: netdev

[-- Attachment #1: Type: text/plain, Size: 445 bytes --]

> Longer than 80 columns, and use an inline function instead
> of a macro in order to get proper type checking.
> [...]
> Do not break up the function local variables with spurious new lines
> like this, please.
> [...]
> The indentation and tabbing is messed up in all of the code you are
> adding, please fix it up to be consistent with the surrounding code
> and the rest of the TCP stack.
> 
> Do not use C++ style // comments.

Better?

--

[-- Attachment #2: TCP_ICMP_2.6.30.4.patch --]
[-- Type: text/plain, Size: 4105 bytes --]

Signed-off-by: Damian Lukowski <damian@tvk.rwth-aachen.de>
diff -Naur linux-2.6.30.4/include/net/tcp.h linux-2.6.30.4-tcp-icmp/include/net/tcp.h
--- linux-2.6.30.4/include/net/tcp.h	2009-07-31 00:34:47.000000000 +0200
+++ linux-2.6.30.4-tcp-icmp/include/net/tcp.h	2009-08-14 12:18:30.846060685 +0200
@@ -1220,6 +1220,14 @@
 #define tcp_for_write_queue_from_safe(skb, tmp, sk)			\
 	skb_queue_walk_from_safe(&(sk)->sk_write_queue, skb, tmp)
 
+static inline bool retrans_overstepped(const struct sock *sk,
+					unsigned int boundary)
+{
+	return 	inet_csk(sk)->icsk_retransmits &&
+		(tcp_time_stamp - tcp_sk(sk)->retrans_stamp) >=
+			TCP_RTO_MIN*(2 << boundary);
+}
+
 static inline struct sk_buff *tcp_send_head(struct sock *sk)
 {
 	return sk->sk_send_head;
diff -Naur linux-2.6.30.4/net/ipv4/tcp_ipv4.c linux-2.6.30.4-tcp-icmp/net/ipv4/tcp_ipv4.c
--- linux-2.6.30.4/net/ipv4/tcp_ipv4.c	2009-07-31 00:34:47.000000000 +0200
+++ linux-2.6.30.4-tcp-icmp/net/ipv4/tcp_ipv4.c	2009-08-14 13:19:48.841598908 +0200
@@ -332,11 +332,13 @@
 {
 	struct iphdr *iph = (struct iphdr *)skb->data;
 	struct tcphdr *th = (struct tcphdr *)(skb->data + (iph->ihl << 2));
+	struct inet_connection_sock *icsk;
 	struct tcp_sock *tp;
 	struct inet_sock *inet;
 	const int type = icmp_hdr(skb)->type;
 	const int code = icmp_hdr(skb)->code;
 	struct sock *sk;
+	struct sk_buff *skb_r;
 	__u32 seq;
 	int err;
 	struct net *net = dev_net(skb->dev);
@@ -367,6 +369,7 @@
 	if (sk->sk_state == TCP_CLOSE)
 		goto out;
 
+	icsk = inet_csk(sk);
 	tp = tcp_sk(sk);
 	seq = ntohl(th->seq);
 	if (sk->sk_state != TCP_LISTEN &&
@@ -393,6 +396,41 @@
 		}
 
 		err = icmp_err_convert[code].errno;
+		/* check if ICMP unreachable messages allow revert of backoff */
+		if ((code != ICMP_NET_UNREACH && code != ICMP_HOST_UNREACH) ||
+		    seq != tp->snd_una  || !icsk->icsk_retransmits ||
+		    !icsk->icsk_backoff)
+			break;
+
+		icsk->icsk_backoff--;
+		icsk->icsk_rto >>= 1;
+
+		skb_r = skb_peek(&sk->sk_write_queue);
+		BUG_ON(!skb_r);
+
+		if (sock_owned_by_user(sk)) {
+			/* Deferring retransmission clocked by ICMP
+			 * due to locked socket. */
+			inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS,
+			min(icsk->icsk_rto, TCP_RESOURCE_PROBE_INTERVAL),
+			TCP_RTO_MAX);
+		}
+
+		if (tcp_time_stamp - TCP_SKB_CB(skb_r)->when >
+		    inet_csk(sk)->icsk_rto) {
+			/* RTO revert clocked out retransmission. */
+			tcp_retransmit_skb(sk, skb_r);
+			inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS,
+						  icsk->icsk_rto, TCP_RTO_MAX);
+		} else {
+			/* RTO revert shortened timer. */
+			inet_csk_reset_xmit_timer(
+				sk, ICSK_TIME_RETRANS,
+				icsk->icsk_rto-
+				(tcp_time_stamp-TCP_SKB_CB(skb_r)->when),
+				TCP_RTO_MAX);
+		}
+
 		break;
 	case ICMP_TIME_EXCEEDED:
 		err = EHOSTUNREACH;
diff -Naur linux-2.6.30.4/net/ipv4/tcp_timer.c linux-2.6.30.4-tcp-icmp/net/ipv4/tcp_timer.c
--- linux-2.6.30.4/net/ipv4/tcp_timer.c	2009-07-31 00:34:47.000000000 +0200
+++ linux-2.6.30.4-tcp-icmp/net/ipv4/tcp_timer.c	2009-08-14 13:22:18.068666329 +0200
@@ -143,7 +143,7 @@
 			dst_negative_advice(&sk->sk_dst_cache);
 		retry_until = icsk->icsk_syn_retries ? : sysctl_tcp_syn_retries;
 	} else {
-		if (icsk->icsk_retransmits >= sysctl_tcp_retries1) {
+		if (retrans_overstepped(sk, sysctl_tcp_retries1)) {
 			/* Black hole detection */
 			tcp_mtu_probing(icsk, sk);
 
@@ -156,12 +156,14 @@
 
 			retry_until = tcp_orphan_retries(sk, alive);
 
-			if (tcp_out_of_resources(sk, alive || icsk->icsk_retransmits < retry_until))
+			if (tcp_out_of_resources(
+				sk, alive ||
+				    !retrans_overstepped(sk, retry_until)))
 				return 1;
 		}
 	}
 
-	if (icsk->icsk_retransmits >= retry_until) {
+	if (retrans_overstepped(sk, retry_until)) {
 		/* Has it gone just too far? */
 		tcp_write_err(sk);
 		return 1;
@@ -385,7 +387,7 @@
 out_reset_timer:
 	icsk->icsk_rto = min(icsk->icsk_rto << 1, TCP_RTO_MAX);
 	inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS, icsk->icsk_rto, TCP_RTO_MAX);
-	if (icsk->icsk_retransmits > sysctl_tcp_retries1)
+	if (retrans_overstepped(sk, sysctl_tcp_retries1))
 		__sk_dst_reset(sk);
 
 out:;

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] revert TCP retransmission backoff on ICMP destination unreachable
  2009-08-14 12:08   ` Damian Lukowski
@ 2009-08-18 13:45     ` Ilpo Järvinen
  2009-08-18 17:40       ` Damian Lukowski
  0 siblings, 1 reply; 9+ messages in thread
From: Ilpo Järvinen @ 2009-08-18 13:45 UTC (permalink / raw)
  To: Damian Lukowski; +Cc: David Miller, Netdev

On Fri, 14 Aug 2009, Damian Lukowski wrote:

>> Longer than 80 columns, and use an inline function instead
>> of a macro in order to get proper type checking.
>> [...]
>> Do not break up the function local variables with spurious new lines
>> like this, please.
>> [...]
>> The indentation and tabbing is messed up in all of the code you are
>> adding, please fix it up to be consistent with the surrounding code
>> and the rest of the TCP stack.
>>
>> Do not use C++ style // comments.
>
> Better?

Please, include the changelog message on resubmits too next time.

> Signed-off-by: Damian Lukowski <damian@tvk.rwth-aachen.de>


> diff -Naur linux-2.6.30.4/include/net/tcp.h linux-2.6.30.4-tcp-icmp/include/net/tcp.h
> --- linux-2.6.30.4/include/net/tcp.h	2009-07-31 00:34:47.000000000 +0200
> +++ linux-2.6.30.4-tcp-icmp/include/net/tcp.h	2009-08-14 12:18:30.846060685 +0200
> @@ -1220,6 +1220,14 @@
>  #define tcp_for_write_queue_from_safe(skb, tmp, sk)			\
>  	skb_queue_walk_from_safe(&(sk)->sk_write_queue, skb, tmp)
> 
> +static inline bool retrans_overstepped(const struct sock *sk,
> +					unsigned int boundary)
> +{
> +	return 	inet_csk(sk)->icsk_retransmits &&
> +		(tcp_time_stamp - tcp_sk(sk)->retrans_stamp) >=
> +			TCP_RTO_MIN*(2 << boundary);
> +}
> +
>  static inline struct sk_buff *tcp_send_head(struct sock *sk)
>  {
>  	return sk->sk_send_head;
> diff -Naur linux-2.6.30.4/net/ipv4/tcp_ipv4.c linux-2.6.30.4-tcp-icmp/net/ipv4/tcp_ipv4.c
> --- linux-2.6.30.4/net/ipv4/tcp_ipv4.c	2009-07-31 00:34:47.000000000 +0200
> +++ linux-2.6.30.4-tcp-icmp/net/ipv4/tcp_ipv4.c	2009-08-14 13:19:48.841598908 +0200
> @@ -332,11 +332,13 @@
>  {
>  	struct iphdr *iph = (struct iphdr *)skb->data;
>  	struct tcphdr *th = (struct tcphdr *)(skb->data + (iph->ihl << 2));
> +	struct inet_connection_sock *icsk;
>  	struct tcp_sock *tp;
>  	struct inet_sock *inet;
>  	const int type = icmp_hdr(skb)->type;
>  	const int code = icmp_hdr(skb)->code;
>  	struct sock *sk;
> +	struct sk_buff *skb_r;

I'd make this called "skb", and change the old skb to icmp_skb.

>  	__u32 seq;
>  	int err;
>  	struct net *net = dev_net(skb->dev);
> @@ -367,6 +369,7 @@
>  	if (sk->sk_state == TCP_CLOSE)
>  		goto out;
> 
> +	icsk = inet_csk(sk);
>  	tp = tcp_sk(sk);
>  	seq = ntohl(th->seq);
>  	if (sk->sk_state != TCP_LISTEN &&
> @@ -393,6 +396,41 @@
>  		}
>
>  		err = icmp_err_convert[code].errno;
> +		/* check if ICMP unreachable messages allow revert of backoff */
> +		if ((code != ICMP_NET_UNREACH && code != ICMP_HOST_UNREACH) ||
> +		    seq != tp->snd_una  || !icsk->icsk_retransmits ||
> +		    !icsk->icsk_backoff)

I'd recommend you break this into two if()\nbreak's, one for filtering
other icmps and other for filtering mismatching against the socket's 
state.

> +			break;
> +
> +		icsk->icsk_backoff--;
> +		icsk->icsk_rto >>= 1;

Are you absolute sure that we don't go to zero here when phase of the 
moon is just right? I'd put a max(..., 1) guard there.

> +
> +		skb_r = skb_peek(&sk->sk_write_queue);

tcp_write_queue_head(sk)

> +		BUG_ON(!skb_r);
> +
> +		if (sock_owned_by_user(sk)) {
> +			/* Deferring retransmission clocked by ICMP
> +			 * due to locked socket. */
> +			inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS,
> +			min(icsk->icsk_rto, TCP_RESOURCE_PROBE_INTERVAL),

??? (besides having indent wrong). What makes you think HZ/2 is right 
value if icsk_rto is large?

> +			TCP_RTO_MAX);

Perhaps you lack here something to exit?

> +		}
> +
> +		if (tcp_time_stamp - TCP_SKB_CB(skb_r)->when >
> +		    inet_csk(sk)->icsk_rto) {

icsk->icsk_rto !?!

> +			/* RTO revert clocked out retransmission. */
> +			tcp_retransmit_skb(sk, skb_r);

This is plain wrong, tcp_sock counters will get messed up if 
TCPCB_SACKED_RETRANS is already set while calling tcp_retransmit_skb. As a 
sidenote, it would be probably useful to move the check + clear of that 
bit before doing the retransmission to some common place one day.

> +			inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS,
> +						  icsk->icsk_rto, TCP_RTO_MAX);

RFC 2988, step 5.5 missing? Have you verified this patch for real?

> +		} else {
> +			/* RTO revert shortened timer. */
> +			inet_csk_reset_xmit_timer(
> +				sk, ICSK_TIME_RETRANS,
> +				icsk->icsk_rto-
> +				(tcp_time_stamp-TCP_SKB_CB(skb_r)->when),

Spacing.

> +				TCP_RTO_MAX);
> +		}
> +

How about:

 		u32 remaining;

 		remaining = icsk->icsk_rto - min(icsk->icsk_rto,
 						 tcp_time_stamp - TCP_SKB_CB(skb_r)->when));
 		if (!remaining) {
 			tcp_retransmit_skb(...);
 			remaining = icsk->icsk_rto;
 		}
 		inet_csk_reset_xmit_timer(..., remaining);

But check my note about step 5.5 (see above) to get remaining set right in 
the if branch.

>  		break;
>  	case ICMP_TIME_EXCEEDED:
>  		err = EHOSTUNREACH;
> diff -Naur linux-2.6.30.4/net/ipv4/tcp_timer.c linux-2.6.30.4-tcp-icmp/net/ipv4/tcp_timer.c
> --- linux-2.6.30.4/net/ipv4/tcp_timer.c	2009-07-31 00:34:47.000000000 +0200
> +++ linux-2.6.30.4-tcp-icmp/net/ipv4/tcp_timer.c	2009-08-14 13:22:18.068666329 +0200
> @@ -143,7 +143,7 @@
>  			dst_negative_advice(&sk->sk_dst_cache);
>  		retry_until = icsk->icsk_syn_retries ? : sysctl_tcp_syn_retries;
>  	} else {
> -		if (icsk->icsk_retransmits >= sysctl_tcp_retries1) {
> +		if (retrans_overstepped(sk, sysctl_tcp_retries1)) {

??? Could you justify these changes and explain their relation to the 
draft and icmp messages? If not, please drop them and try with proper 
description and description for them in a separate patch. I fail to see 
what you're trying to achieve here. You just ended up redefining the 
sysctl meaning too I think...

>  			/* Black hole detection */
>  			tcp_mtu_probing(icsk, sk);
> 
> @@ -156,12 +156,14 @@
>
>  			retry_until = tcp_orphan_retries(sk, alive);
> 
> -			if (tcp_out_of_resources(sk, alive || icsk->icsk_retransmits < retry_until))
> +			if (tcp_out_of_resources(
> +				sk, alive ||
> +				    !retrans_overstepped(sk, retry_until)))

???

>  				return 1;
>  		}
>  	}
> 
> -	if (icsk->icsk_retransmits >= retry_until) {
> +	if (retrans_overstepped(sk, retry_until)) {

...

>  		/* Has it gone just too far? */
>  		tcp_write_err(sk);
>  		return 1;
> @@ -385,7 +387,7 @@
>  out_reset_timer:
>  	icsk->icsk_rto = min(icsk->icsk_rto << 1, TCP_RTO_MAX);
>  	inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS, icsk->icsk_rto, TCP_RTO_MAX);
> -	if (icsk->icsk_retransmits > sysctl_tcp_retries1)
> +	if (retrans_overstepped(sk, sysctl_tcp_retries1))

...I guess none of these retrans_overstepped are related to the icmp stuff 
at all?

>  		__sk_dst_reset(sk);
>
>  out:;

-- 
  i.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] revert TCP retransmission backoff on ICMP destination unreachable
  2009-08-18 13:45     ` Ilpo Järvinen
@ 2009-08-18 17:40       ` Damian Lukowski
  2009-08-18 22:07         ` Ilpo Järvinen
  2009-08-19  6:29         ` David Miller
  0 siblings, 2 replies; 9+ messages in thread
From: Damian Lukowski @ 2009-08-18 17:40 UTC (permalink / raw)
  To: Ilpo Järvinen; +Cc: David Miller, Netdev

> On Fri, 14 Aug 2009, Damian Lukowski wrote:
> 
>>> Longer than 80 columns, and use an inline function instead
>>> of a macro in order to get proper type checking.
>>> [...]
>>> Do not break up the function local variables with spurious new lines
>>> like this, please.
>>> [...]
>>> The indentation and tabbing is messed up in all of the code you are
>>> adding, please fix it up to be consistent with the surrounding code
>>> and the rest of the TCP stack.
>>>
>>> Do not use C++ style // comments.
>>
>> Better?
> 
> Please, include the changelog message on resubmits too next time.

I'm sorry, which message do you mean? I used plain diff without GIT or
anything.

>> +        if ((code != ICMP_NET_UNREACH && code != ICMP_HOST_UNREACH) ||
>> +            seq != tp->snd_una  || !icsk->icsk_retransmits ||
>> +            !icsk->icsk_backoff)
> 
> I'd recommend you break this into two if()\nbreak's, one for filtering
> other icmps and other for filtering mismatching against the socket's state.

Ok.

>> +        icsk->icsk_backoff--;
>> +        icsk->icsk_rto >>= 1;
> 
> Are you absolute sure that we don't go to zero here when phase of the
> moon is just right? I'd put a max(..., 1) guard there.

Well, the retransmission timer doubles icsk_rto whenever it increases
icsk_backoff, so we should reach the original icsk_rto, when
icsk_backoff becomes zero.

>> +        skb_r = skb_peek(&sk->sk_write_queue);
> 
> tcp_write_queue_head(sk)

Ok.
> 
>> +        BUG_ON(!skb_r);
>> +
>> +        if (sock_owned_by_user(sk)) {
>> +            /* Deferring retransmission clocked by ICMP
>> +             * due to locked socket. */
>> +            inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS,
>> +            min(icsk->icsk_rto, TCP_RESOURCE_PROBE_INTERVAL),
> 
> ??? (besides having indent wrong). What makes you think HZ/2 is right
> value if icsk_rto is large?

To be honest, I have taken over the code from my predecessor and didn't
question this part. When thinking about it, I would remove this block
completely. Will bad things happen, if I mess with the timer, when the
socket is locked?

>> +            TCP_RTO_MAX);
> 
> Perhaps you lack here something to exit?
> 
>> +        }
>> +
>> +        if (tcp_time_stamp - TCP_SKB_CB(skb_r)->when >
>> +            inet_csk(sk)->icsk_rto) {
> 
> icsk->icsk_rto !?!
> 
>> +            /* RTO revert clocked out retransmission. */
>> +            tcp_retransmit_skb(sk, skb_r);
> 
> This is plain wrong, tcp_sock counters will get messed up if
> TCPCB_SACKED_RETRANS is already set while calling tcp_retransmit_skb. As
> a sidenote, it would be probably useful to move the check + clear of
> that bit before doing the retransmission to some common place one day.

What, if I call tcp_retransmit_timer() manually instead? Will there
still be problems with SACK?

>> +            inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS,
>> +                          icsk->icsk_rto, TCP_RTO_MAX);
> 
> RFC 2988, step 5.5 missing? Have you verified this patch for real?

Thats the whole point of the draft. If consecutive retransmissions
trigger fitting ICMPs, the RTO value is not doubled, but remains
constant. So you can have a retransmission schedule of
t, t+r, t+2r, t+3r, t+4r,  t+5r  ...
instead of
t, t+r, t+3r, t+7r, t+15r, t+31r ...

I can show you trace plots of patched an unpatched TCP, if you like.
Should I attach files in the mails, or better post hyperlinks to them?

>> +        } else {
>> +            /* RTO revert shortened timer. */
>> +            inet_csk_reset_xmit_timer(
>> +                sk, ICSK_TIME_RETRANS,
>> +                icsk->icsk_rto-
>> +                (tcp_time_stamp-TCP_SKB_CB(skb_r)->when),
> 
> Spacing.
> 
>> +                TCP_RTO_MAX);
>> +        }
>> +
> 
> How about:
> 
>         u32 remaining;
> 
>         remaining = icsk->icsk_rto - min(icsk->icsk_rto,
>                          tcp_time_stamp - TCP_SKB_CB(skb_r)->when));
>         if (!remaining) {
>             tcp_retransmit_skb(...);
>             remaining = icsk->icsk_rto;
>         }
>         inet_csk_reset_xmit_timer(..., remaining);

Seems to be ok, but isn't tcp_retransmit_timer(...) better?

>> -        if (icsk->icsk_retransmits >= sysctl_tcp_retries1) {
>> +        if (retrans_overstepped(sk, sysctl_tcp_retries1)) {
> 
> ??? Could you justify these changes and explain their relation to the
> draft and icmp messages? If not, please drop them and try with proper
> description and description for them in a separate patch. I fail to see
> what you're trying to achieve here. You just ended up redefining the
> sysctl meaning too I think...

Well, you're quite right. RFCs specify timeouts in dimensions of time
(seconds, minutes, etc.), but Linux specifies timeouts in form of
numbers of retransmissions. One can do this, as long as the
retransmission schedule (exponential backoff in RFC case) is known. But
with this patch, the schedule can be completely different. In the
"worst" case, the time intervals between consecutive retransmission
probes are constant, leading to a TCP timeout after few seconds, instead
of several minutes.
What I did here, is to calculate the TCP timeout as if the schedule were
still an exponential backoff, given the allowed number of
retransmissions. It is possible, that TCP will now retransmit many more
times than tcp_retries indicates, but the duration until the TCP
connection times out, is more or less equivalent to unpatched TCP with
same tcp_retries values.
So, whenever some control block uses count values, but means time
values, I have to replace the control block appropriately; and that's
what retrans_overstepped() is supposed to do.

Thanks for your help, so far
 Damian

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] revert TCP retransmission backoff on ICMP destination unreachable
  2009-08-18 17:40       ` Damian Lukowski
@ 2009-08-18 22:07         ` Ilpo Järvinen
  2009-08-18 23:56           ` Damian Lukowski
  2009-08-19  6:29         ` David Miller
  1 sibling, 1 reply; 9+ messages in thread
From: Ilpo Järvinen @ 2009-08-18 22:07 UTC (permalink / raw)
  To: Damian Lukowski; +Cc: David Miller, Netdev

On Tue, 18 Aug 2009, Damian Lukowski wrote:

> > On Fri, 14 Aug 2009, Damian Lukowski wrote:
> > 
> >>> Longer than 80 columns, and use an inline function instead
> >>> of a macro in order to get proper type checking.
> >>> [...]
> >>> Do not break up the function local variables with spurious new lines
> >>> like this, please.
> >>> [...]
> >>> The indentation and tabbing is messed up in all of the code you are
> >>> adding, please fix it up to be consistent with the surrounding code
> >>> and the rest of the TCP stack.
> >>>
> >>> Do not use C++ style // comments.
> >>
> >> Better?
> > 
> > Please, include the changelog message on resubmits too next time.
> 
> I'm sorry, which message do you mean? I used plain diff without GIT or
> anything.

In the original mail you had some description about the patch, I was 
thinking of quoting something out of it but it was missing so I didn't 
bother copying. But more importantly, would Dave Miller want to apply the 
patch of yours, he would need that description and it saves work for him 
to have the description in every resubmit "duplicated" (one usually just 
gets a boomerang mail asking for the description which is similarly 
wasting his time).

> >> +        icsk->icsk_backoff--;
> >> +        icsk->icsk_rto >>= 1;
> > 
> > Are you absolute sure that we don't go to zero here when phase of the
> > moon is just right? I'd put a max(..., 1) guard there.
> 
> Well, the retransmission timer doubles icsk_rto whenever it increases
> icsk_backoff, so we should reach the original icsk_rto, when
> icsk_backoff becomes zero.

This is a false assumption. Take a look into the formula, there's a 
cap how far icsk_rto goes but icsk_backoff keeps increasing. In fact, 
there would probably be a need to set some other lower bound than 1, 
TCP_RTO_MIN would probably be more appropriate for this.

> >> +        BUG_ON(!skb_r);
> >> +
> >> +        if (sock_owned_by_user(sk)) {
> >> +            /* Deferring retransmission clocked by ICMP
> >> +             * due to locked socket. */
> >> +            inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS,
> >> +            min(icsk->icsk_rto, TCP_RESOURCE_PROBE_INTERVAL),
> > 
> > ??? (besides having indent wrong). What makes you think HZ/2 is right
> > value if icsk_rto is large?
> 
> To be honest, I have taken over the code from my predecessor and didn't
> question this part.

But you did Signoff that regardless, so you should know ;-).

> When thinking about it, I would remove this block
> completely. Will bad things happen, if I mess with the timer, when the
> socket is locked?

I've no idea without looking into all the details but I think this needs 
to be moved slightly later anyway into the imminent expiry branch and the 
timeout value needs to be copied from tcp_write_timer instead of using 
that bogus formula.

> >> +            TCP_RTO_MAX);
> > 
> > Perhaps you lack here something to exit?
> > 
> >> +        }
> >> +
> >> +        if (tcp_time_stamp - TCP_SKB_CB(skb_r)->when >
> >> +            inet_csk(sk)->icsk_rto) {
> > 
> > icsk->icsk_rto !?!
> > 
> >> +            /* RTO revert clocked out retransmission. */
> >> +            tcp_retransmit_skb(sk, skb_r);
> > 
> > This is plain wrong, tcp_sock counters will get messed up if
> > TCPCB_SACKED_RETRANS is already set while calling tcp_retransmit_skb. As
> > a sidenote, it would be probably useful to move the check + clear of
> > that bit before doing the retransmission to some common place one day.
> 
> What, if I call tcp_retransmit_timer() manually instead? Will there
> still be problems with SACK?

SACK is fine then, and I took a quick look into the other stuff which 
seemed to be quite ok too but it was just a quick glance, but then you 
certainly must delay the RTO if sock_owned_by_user(sk) is true like 
tcp_write_timer does. Tcp_retransmit_timer would also deal with the 
missing step 5.5 btw. I'm not sure if you like all the things that will 
happen in the tcp_enter_loss though, especially the undo_marker clearing 
might not please you, or is at best, sub-optimal in a certain scenario.

> >> +            inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS,
> >> +                          icsk->icsk_rto, TCP_RTO_MAX);
> > 
> > RFC 2988, step 5.5 missing? Have you verified this patch for real?
> 
> Thats the whole point of the draft. If consecutive retransmissions
> trigger fitting ICMPs, the RTO value is not doubled, but remains
> constant.

No, either the draft is wrong or the implementation. 5.5 step should be 
applied. Now you end up decreasing RTO per icmp which causes imminent 
retransmission instead of keeping it "constant". Effectively, step 4, 5, 6 
(halves like you do), 7 -> R, R is 5.4-5.6 and 5.5 increases RTO again 
because the timer _expired_ here even though we did short-circuit it 
instead of setting a zero timer and using the normal procedure (which in 
itself is ok approach, no need to force timer to expire).

> So you can have a retransmission schedule of
> t, t+r, t+2r, t+3r, t+4r,  t+5r  ...
> instead of
> t, t+r, t+3r, t+7r, t+15r, t+31r ...

Sure, but that's not what happens, instead you will get this series:

t, t+r, (t+r)/2, (t+r)/4, ...

...disclaimer: it might be off-by-something somewhere but I hope you get 
my point anyway.

> I can show you trace plots of patched an unpatched TCP, if you like.
> Should I attach files in the mails, or better post hyperlinks to them?

Sure, but they must not be from a trivial case but the rto must 
"flunctuate" and the expirys must be using the "imminent" branch for it
to count for me as a proof that you did it right.

> > How about:
> > 
> >         u32 remaining;
> > 
> >         remaining = icsk->icsk_rto - min(icsk->icsk_rto,
> >                          tcp_time_stamp - TCP_SKB_CB(skb_r)->when));
> >         if (!remaining) {
> >             tcp_retransmit_skb(...);
> >             remaining = icsk->icsk_rto;
> >         }
> >         inet_csk_reset_xmit_timer(..., remaining);
> 
> Seems to be ok, but isn't tcp_retransmit_timer(...) better?

Yeah, that's mostly fine I guess (at least in meaning of "safe").

> >> -        if (icsk->icsk_retransmits >= sysctl_tcp_retries1) {
> >> +        if (retrans_overstepped(sk, sysctl_tcp_retries1)) {
> > 
> > ??? Could you justify these changes and explain their relation to the
> > draft and icmp messages? If not, please drop them and try with proper
> > description and description for them in a separate patch. I fail to see
> > what you're trying to achieve here. You just ended up redefining the
> > sysctl meaning too I think...
> 
> Well, you're quite right. RFCs specify timeouts in dimensions of time
> (seconds, minutes, etc.), but Linux specifies timeouts in form of
> numbers of retransmissions. One can do this, as long as the
> retransmission schedule (exponential backoff in RFC case) is known. But
> with this patch, the schedule can be completely different. In the
> "worst" case, the time intervals between consecutive retransmission
> probes are constant, leading to a TCP timeout after few seconds, instead
> of several minutes.
> What I did here, is to calculate the TCP timeout as if the schedule were
> still an exponential backoff, given the allowed number of
> retransmissions. It is possible, that TCP will now retransmit many more
> times than tcp_retries indicates, but the duration until the TCP
> connection times out, is more or less equivalent to unpatched TCP with
> same tcp_retries values.
> So, whenever some control block uses count values, but means time
> values, I have to replace the control block appropriately; and that's
> what retrans_overstepped() is supposed to do.

I can understand your point, however, this is independent thing which 
should not be mixed with the other part of the change. So in future 
provide them as a two patch series. ...And expect some to not like this 
kind of change.

-- 
 i.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] revert TCP retransmission backoff on ICMP destination unreachable
  2009-08-18 22:07         ` Ilpo Järvinen
@ 2009-08-18 23:56           ` Damian Lukowski
  2009-08-19 10:55             ` Ilpo Järvinen
  0 siblings, 1 reply; 9+ messages in thread
From: Damian Lukowski @ 2009-08-18 23:56 UTC (permalink / raw)
  To: Ilpo Järvinen; +Cc: Netdev, David Miller

> On Tue, 18 Aug 2009, Damian Lukowski wrote:
> 
>>> On Fri, 14 Aug 2009, Damian Lukowski wrote:
>>>
>>>>> Longer than 80 columns, and use an inline function instead
>>>>> of a macro in order to get proper type checking.
>>>>> [...]
>>>>> Do not break up the function local variables with spurious new lines
>>>>> like this, please.
>>>>> [...]
>>>>> The indentation and tabbing is messed up in all of the code you are
>>>>> adding, please fix it up to be consistent with the surrounding code
>>>>> and the rest of the TCP stack.
>>>>>
>>>>> Do not use C++ style // comments.
>>>> Better?
>>> Please, include the changelog message on resubmits too next time.
>> I'm sorry, which message do you mean? I used plain diff without GIT or
>> anything.
> 
> In the original mail you had some description about the patch, I was 
> thinking of quoting something out of it but it was missing so I didn't 
> bother copying. But more importantly, would Dave Miller want to apply the 
> patch of yours, he would need that description and it saves work for him 
> to have the description in every resubmit "duplicated" (one usually just 
> gets a boomerang mail asking for the description which is similarly 
> wasting his time).

Ok, I will fully quote previous messages now. Just wanted to avoid
redundancy. Anyway, as long as the draft is not accepted as RFC, the
patch will violate current standards, so I doubt, Dave would like to
apply it soon. :)

> 
>>>> +        icsk->icsk_backoff--;
>>>> +        icsk->icsk_rto >>= 1;
>>> Are you absolute sure that we don't go to zero here when phase of the
>>> moon is just right? I'd put a max(..., 1) guard there.
>> Well, the retransmission timer doubles icsk_rto whenever it increases
>> icsk_backoff, so we should reach the original icsk_rto, when
>> icsk_backoff becomes zero.
> 
> This is a false assumption. Take a look into the formula, there's a 
> cap how far icsk_rto goes but icsk_backoff keeps increasing. In fact, 
> there would probably be a need to set some other lower bound than 1, 
> TCP_RTO_MIN would probably be more appropriate for this.

*sigh* I should have seen that. I would like to preserve the initial rto
value and not work with RTO_MIN. First, I have thought about storing the
initial rto in another state variable. But what, if we simply
recalculate the value?:
###
	icsk->icsk_backoff--;
-	icsk->icsk_rto >>= 1;
+	tcp_set_rto(sk);
+	icsk->icsk_rto <<= icsk->icsk_backoff;
+	tcp_bound_rto(sk);
###

>>>> +        BUG_ON(!skb_r);
>>>> +
>>>> +        if (sock_owned_by_user(sk)) {
>>>> +            /* Deferring retransmission clocked by ICMP
>>>> +             * due to locked socket. */
>>>> +            inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS,
>>>> +            min(icsk->icsk_rto, TCP_RESOURCE_PROBE_INTERVAL),
>>> ??? (besides having indent wrong). What makes you think HZ/2 is right
>>> value if icsk_rto is large?
>> To be honest, I have taken over the code from my predecessor and didn't
>> question this part.
> 
> But you did Signoff that regardless, so you should know ;-).

Only because checkpatch.pl didn't like it otherwise. :)

> 
>> When thinking about it, I would remove this block
>> completely. Will bad things happen, if I mess with the timer, when the
>> socket is locked?
> 
> I've no idea without looking into all the details but I think this needs 
> to be moved slightly later anyway into the imminent expiry branch and the 
> timeout value needs to be copied from tcp_write_timer instead of using 
> that bogus formula.
> 
>>>> +            TCP_RTO_MAX);
>>> Perhaps you lack here something to exit?
>>>
>>>> +        }
>>>> +
>>>> +        if (tcp_time_stamp - TCP_SKB_CB(skb_r)->when >
>>>> +            inet_csk(sk)->icsk_rto) {
>>> icsk->icsk_rto !?!
>>>
>>>> +            /* RTO revert clocked out retransmission. */
>>>> +            tcp_retransmit_skb(sk, skb_r);
>>> This is plain wrong, tcp_sock counters will get messed up if
>>> TCPCB_SACKED_RETRANS is already set while calling tcp_retransmit_skb. As
>>> a sidenote, it would be probably useful to move the check + clear of
>>> that bit before doing the retransmission to some common place one day.
>> What, if I call tcp_retransmit_timer() manually instead? Will there
>> still be problems with SACK?
> 
> SACK is fine then, and I took a quick look into the other stuff which 
> seemed to be quite ok too but it was just a quick glance, but then you 
> certainly must delay the RTO if sock_owned_by_user(sk) is true like 
> tcp_write_timer does. Tcp_retransmit_timer would also deal with the 
> missing step 5.5 btw. I'm not sure if you like all the things that will 
> happen in the tcp_enter_loss though, especially the undo_marker clearing 
> might not please you, or is at best, sub-optimal in a certain scenario.
> 
>>>> +            inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS,
>>>> +                          icsk->icsk_rto, TCP_RTO_MAX);
>>> RFC 2988, step 5.5 missing? Have you verified this patch for real?
>> Thats the whole point of the draft. If consecutive retransmissions
>> trigger fitting ICMPs, the RTO value is not doubled, but remains
>> constant.
> 
> No, either the draft is wrong or the implementation. 5.5 step should be 
> applied. Now you end up decreasing RTO per icmp which causes imminent 
> retransmission instead of keeping it "constant". Effectively, step 4, 5, 6 
> (halves like you do), 7 -> R, R is 5.4-5.6 and 5.5 increases RTO again 
> because the timer _expired_ here even though we did short-circuit it 
> instead of setting a zero timer and using the normal procedure (which in 
> itself is ok approach, no need to force timer to expire).
> 
>> So you can have a retransmission schedule of
>> t, t+r, t+2r, t+3r, t+4r,  t+5r  ...
>> instead of
>> t, t+r, t+3r, t+7r, t+15r, t+31r ...
> 
> Sure, but that's not what happens, instead you will get this series:
> 
> t, t+r, (t+r)/2, (t+r)/4, ...
> 
> ...disclaimer: it might be off-by-something somewhere but I hope you get 
> my point anyway.

It seems to me, that we focus on different things. From the TCP code's
point of view, RFC2988 step 5.5 is of course executed, because timeouts
still trigger tcp_retransmit_timer(), which doubles the RTO at the end.
But from the network's point of view, there is no doubling, but constant
intervals between retransmissions.
The following happens with patched TCP (not considering RTO_MAX):
t:		Connectivity breaks, rto is r
t+r:		First RTO fires, SND.UNA is retransmitted by
		tcp_retransmit_timer() and rto := 2rto
		Now it is rto == 2r
t+r+epsilon:	ICMP unreach arrives from a router.
		The patch sets rto := rto/2
		Now it is rto == r
t+2r:		Second RTO fires, SND.UNA is retransmitted by
		tcp_retransmit_timer() and rto := 2rto
		Now it is rto == 2r
		Assume, that this retransmission did not trigger
		any ICMP message.
t+4r:		Third RTO fires, SND.UNA is retransmitted by
		tcp_retransmit_timer() and rto := 2rto
		Now it is rto == 4r
		Assume, that this retransmission did not trigger
		any ICMP message.
t+8r:		Fourth RTO fires, SND.UNA is retransmitted by
		tcp_retransmit_timer() and rto := 2rto
		Now it is rto == 8r
t+8r+epsilon:	ICMP unreach arrives from a router.
		The patch sets rto := rto/2
		Now it is rto == 4r
t+12r:		Fifth RTO fires ...


>> I can show you trace plots of patched an unpatched TCP, if you like.
>> Should I attach files in the mails, or better post hyperlinks to them?
> 
> Sure, but they must not be from a trivial case but the rto must 
> "flunctuate" 

How should they fluctuate? RTO is doubled by tcp_retransmit_timer() no
matter what happens, and it is halved by the the patch, if an
appropriate ICMP packet has been received.

> and the expirys must be using the "imminent" branch for it
> to count for me as a proof that you did it right.

I'm not sure what you mean.

Still, hyperlinks or attachments?

> 
>>> How about:
>>>
>>>         u32 remaining;
>>>
>>>         remaining = icsk->icsk_rto - min(icsk->icsk_rto,
>>>                          tcp_time_stamp - TCP_SKB_CB(skb_r)->when));
>>>         if (!remaining) {
>>>             tcp_retransmit_skb(...);
>>>             remaining = icsk->icsk_rto;
>>>         }
>>>         inet_csk_reset_xmit_timer(..., remaining);
>> Seems to be ok, but isn't tcp_retransmit_timer(...) better?
> 
> Yeah, that's mostly fine I guess (at least in meaning of "safe").
> 
>>>> -        if (icsk->icsk_retransmits >= sysctl_tcp_retries1) {
>>>> +        if (retrans_overstepped(sk, sysctl_tcp_retries1)) {
>>> ??? Could you justify these changes and explain their relation to the
>>> draft and icmp messages? If not, please drop them and try with proper
>>> description and description for them in a separate patch. I fail to see
>>> what you're trying to achieve here. You just ended up redefining the
>>> sysctl meaning too I think...
>> Well, you're quite right. RFCs specify timeouts in dimensions of time
>> (seconds, minutes, etc.), but Linux specifies timeouts in form of
>> numbers of retransmissions. One can do this, as long as the
>> retransmission schedule (exponential backoff in RFC case) is known. But
>> with this patch, the schedule can be completely different. In the
>> "worst" case, the time intervals between consecutive retransmission
>> probes are constant, leading to a TCP timeout after few seconds, instead
>> of several minutes.
>> What I did here, is to calculate the TCP timeout as if the schedule were
>> still an exponential backoff, given the allowed number of
>> retransmissions. It is possible, that TCP will now retransmit many more
>> times than tcp_retries indicates, but the duration until the TCP
>> connection times out, is more or less equivalent to unpatched TCP with
>> same tcp_retries values.
>> So, whenever some control block uses count values, but means time
>> values, I have to replace the control block appropriately; and that's
>> what retrans_overstepped() is supposed to do.
> 
> I can understand your point, however, this is independent thing which 
> should not be mixed with the other part of the change. So in future 
> provide them as a two patch series. ...And expect some to not like this 
> kind of change.

Ok.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] revert TCP retransmission backoff on ICMP destination unreachable
  2009-08-18 23:56           ` Damian Lukowski
@ 2009-08-19 10:55             ` Ilpo Järvinen
  0 siblings, 0 replies; 9+ messages in thread
From: Ilpo Järvinen @ 2009-08-19 10:55 UTC (permalink / raw)
  To: Damian Lukowski; +Cc: Netdev, David Miller

On Wed, 19 Aug 2009, Damian Lukowski wrote:

> Ok, I will fully quote previous messages now. Just wanted to avoid
> redundancy.

Full quoting is not what I meant, Dave explained what I meant in his mail.

> Anyway, as long as the draft is not accepted as RFC, the
> patch will violate current standards, so I doubt, Dave would like to
> apply it soon. :)

...Don't be too sure on that ;-).

> >>>> +        icsk->icsk_backoff--;
> >>>> +        icsk->icsk_rto >>= 1;
> >>> Are you absolute sure that we don't go to zero here when phase of the
> >>> moon is just right? I'd put a max(..., 1) guard there.
> >> Well, the retransmission timer doubles icsk_rto whenever it increases
> >> icsk_backoff, so we should reach the original icsk_rto, when
> >> icsk_backoff becomes zero.
> > 
> > This is a false assumption. Take a look into the formula, there's a 
> > cap how far icsk_rto goes but icsk_backoff keeps increasing. In fact, 
> > there would probably be a need to set some other lower bound than 1, 
> > TCP_RTO_MIN would probably be more appropriate for this.
> 
> *sigh* I should have seen that. I would like to preserve the initial rto
> value and not work with RTO_MIN. First, I have thought about storing the
> initial rto in another state variable. But what, if we simply
> recalculate the value?:
> ###
> 	icsk->icsk_backoff--;
> -	icsk->icsk_rto >>= 1;
> +	tcp_set_rto(sk);
> +	icsk->icsk_rto <<= icsk->icsk_backoff;
> +	tcp_bound_rto(sk);
> ###

Yes, this is sort of a problem. I'd recommend that you read through all 
icsk_rto readers and writers and see if recalculation can be done without 
getting into some problems.

> >>>> +        BUG_ON(!skb_r);
> >>>> +
> >>>> +        if (sock_owned_by_user(sk)) {
> >>>> +            /* Deferring retransmission clocked by ICMP
> >>>> +             * due to locked socket. */
> >>>> +            inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS,
> >>>> +            min(icsk->icsk_rto, TCP_RESOURCE_PROBE_INTERVAL),
> >>> ??? (besides having indent wrong). What makes you think HZ/2 is right
> >>> value if icsk_rto is large?
> >> To be honest, I have taken over the code from my predecessor and didn't
> >> question this part.
> > 
> > But you did Signoff that regardless, so you should know ;-).
> 
> Only because checkpatch.pl didn't like it otherwise. :)

...And you dare to admit :-) ...It certainly has some other meaning beyond 
checkpatch complain, for future, please check and understand what it 
means that you add it there.

> >>>> +            inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS,
> >>>> +                          icsk->icsk_rto, TCP_RTO_MAX);
> >>> RFC 2988, step 5.5 missing? Have you verified this patch for real?
> >> Thats the whole point of the draft. If consecutive retransmissions
> >> trigger fitting ICMPs, the RTO value is not doubled, but remains
> >> constant.
> > 
> > No, either the draft is wrong or the implementation. 5.5 step should be 
> > applied. Now you end up decreasing RTO per icmp which causes imminent 
> > retransmission instead of keeping it "constant". Effectively, step 4, 5, 6 
> > (halves like you do), 7 -> R, R is 5.4-5.6 and 5.5 increases RTO again 
> > because the timer _expired_ here even though we did short-circuit it 
> > instead of setting a zero timer and using the normal procedure (which in 
> > itself is ok approach, no need to force timer to expire).
> > 
> >> So you can have a retransmission schedule of
> >> t, t+r, t+2r, t+3r, t+4r,  t+5r  ...
> >> instead of
> >> t, t+r, t+3r, t+7r, t+15r, t+31r ...
> > 
> > Sure, but that's not what happens, instead you will get this series:
> > 
> > t, t+r, (t+r)/2, (t+r)/4, ...
> > 
> > ...disclaimer: it might be off-by-something somewhere but I hope you get 
> > my point anyway.
> 
> It seems to me, that we focus on different things. From the TCP code's
> point of view, RFC2988 step 5.5 is of course executed, because timeouts
> still trigger tcp_retransmit_timer(), which doubles the RTO at the end.

No and yes. For no: you artificially made tcp_retransmit_timer to not be 
called whenever the ICMP arrives at a point where RTO/2 step causes the 
timer to expire right away. But yes: if you change it to call 
tcp_retransmit_timer as we have discussed, 5.5 should be there (but that's 
not in original patch which I was commenting).

> But from the network's point of view, there is no doubling, but constant
> intervals between retransmissions.

Plain wrong, you would only end up halving as per icmp when the timing is 
right to produce the scenario I'm talking. Please show from the original 
patch

> The following happens with patched TCP (not considering RTO_MAX):
> t:		Connectivity breaks, rto is r
> t+r:		First RTO fires, SND.UNA is retransmitted by
> 		tcp_retransmit_timer() and rto := 2rto
> 		Now it is rto == 2r
> t+r+epsilon:	ICMP unreach arrives from a router.
> 		The patch sets rto := rto/2
> 		Now it is rto == r
> t+2r:		Second RTO fires, SND.UNA is retransmitted by
> 		tcp_retransmit_timer() and rto := 2rto
> 		Now it is rto == 2r
> 		Assume, that this retransmission did not trigger
> 		any ICMP message.
> t+4r:		Third RTO fires, SND.UNA is retransmitted by
> 		tcp_retransmit_timer() and rto := 2rto
> 		Now it is rto == 4r
> 		Assume, that this retransmission did not trigger
> 		any ICMP message.
> t+8r:		Fourth RTO fires, SND.UNA is retransmitted by
> 		tcp_retransmit_timer() and rto := 2rto
> 		Now it is rto == 8r
> t+8r+epsilon:	ICMP unreach arrives from a router.
> 		The patch sets rto := rto/2
> 		Now it is rto == 4r
> t+12r:		Fifth RTO fires ...

No, that was not the scenario I'm talking about. Let me try to explain:

> t:		Connectivity breaks, rto is r
> t+r:		First RTO fires, SND.UNA is retransmitted by
> 		tcp_retransmit_timer() and rto := 2rto
> 		Now it is rto == 2r
t+2r+epsilon:	ICMP unreach arrives from a router.
> 		The patch sets rto := rto/2
 		t+2r+epsilon > t+r+rto => call for retransmit in the icmp code
 		set rto = r
t+5/2r+epsilon ICMP unreach arrives from a router.
 		The patch sets rto := rto/2
 		t+5/2r+epsilon > t+something => call for retransmit
 		set rto = r/2
...

See, it's a downward spiral because nothing in the original patch takes 
care of the rto*2 (ie., step 5.5) when RTO is artificially _caused by_ the 
ICMP (ie., the halving does make expiry to happen right away, see draft 
step 7).

> >> I can show you trace plots of patched an unpatched TCP, if you like.
> >> Should I attach files in the mails, or better post hyperlinks to them?
> > 
> > Sure, but they must not be from a trivial case but the rto must 
> > "flunctuate" 
> 
> How should they fluctuate? RTO is doubled by tcp_retransmit_timer() no
> matter what happens, and it is halved by the the patch, if an
> appropriate ICMP packet has been received.

The original patch doesn't agree. If you'd have used tcp_retransmit_timer 
in the original I'd agree but you called for tcp_retransmit_skb(skb_r) 
and set the next rto yourself which wouldn't let tcp_retransmit_timer to 
run at all in the scenario I was talking about. But lets see the next 
version of the patch first and only then continue the discussion on this 
particular point about 5.5 (probably not then necessary anymore because
of a change we've discussed).

> > and the expirys must be using the "imminent" branch for it
> > to count for me as a proof that you did it right.
> 
> I'm not sure what you mean.

The case I'm talking about is the branch where we notice that after doing 
rto/2 the RTO is to be scheduled right away (see step 7 in the draft).

> Still, hyperlinks or attachments?

Either is fine, but large stuff as links. However, honestly I'm not that 
interest in them as I know that if you have the traces from the particular 
case I'm talking about, they would show the misbehavior (other cases 
work but they do not interest me at all). It would have been mostly just 
to convince you that this misbehavior is there, since my words didn't seem 
to make it :-). I propose we put this "proving" aside too for now since 
the changes we've already discussed should make this a non-issue anyway,
and therefore no need to waste time on that.


-- 
  i.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] revert TCP retransmission backoff on ICMP destination unreachable
  2009-08-18 17:40       ` Damian Lukowski
  2009-08-18 22:07         ` Ilpo Järvinen
@ 2009-08-19  6:29         ` David Miller
  1 sibling, 0 replies; 9+ messages in thread
From: David Miller @ 2009-08-19  6:29 UTC (permalink / raw)
  To: damian; +Cc: ilpo.jarvinen, netdev

From: Damian Lukowski <damian@tvk.rwth-aachen.de>
Date: Tue, 18 Aug 2009 19:40:01 +0200

>> On Fri, 14 Aug 2009, Damian Lukowski wrote:
>> 
>>>> Longer than 80 columns, and use an inline function instead
>>>> of a macro in order to get proper type checking.
>>>> [...]
>>>> Do not break up the function local variables with spurious new lines
>>>> like this, please.
>>>> [...]
>>>> The indentation and tabbing is messed up in all of the code you are
>>>> adding, please fix it up to be consistent with the surrounding code
>>>> and the rest of the TCP stack.
>>>>
>>>> Do not use C++ style // comments.
>>>
>>> Better?
>> 
>> Please, include the changelog message on resubmits too next time.
> 
> I'm sorry, which message do you mean? I used plain diff without GIT or
> anything.

He means the commit message.

When you fix up patches, don't just reply and say "here's the fixed
up patch".  That doesn't work.

We need the full context, the full commit message, and your signoffs,
when you submit any patch you intend us to consider applying.

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2009-08-19 10:55 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2009-08-11 11:27 [PATCH] revert TCP retransmission backoff on ICMP destination unreachable Damian Lukowski
2009-08-13 23:08 ` David Miller
2009-08-14 12:08   ` Damian Lukowski
2009-08-18 13:45     ` Ilpo Järvinen
2009-08-18 17:40       ` Damian Lukowski
2009-08-18 22:07         ` Ilpo Järvinen
2009-08-18 23:56           ` Damian Lukowski
2009-08-19 10:55             ` Ilpo Järvinen
2009-08-19  6:29         ` David Miller

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).