* [PATCH] Customizable TCP backoff patch
@ 2006-09-27 18:52 Ben Woodard
2006-09-27 23:16 ` David Miller
0 siblings, 1 reply; 17+ messages in thread
From: Ben Woodard @ 2006-09-27 18:52 UTC (permalink / raw)
To: netdev; +Cc: Mark A. Grondona, Brian Behlendorf
[-- Attachment #1: Type: text/plain, Size: 2800 bytes --]
Here at LLNL we have a rather challenging network environment on our
clusters. We basically have 1000's of gigE links attached to an
oversubscribed federated network. Most of the time this network is idle
but the expected workload is for regular spikes extremely heavy activity
lasting a few minutes. All end-points in a highly coordinated manor,
typically after exiting an MPI barrier, start pushing as much data as
possible through the oversubscribed core. The result is a wave of TCP
back-offs where all the TCP streams back-off in lock step. The network
oscillates from highly congested for brief moments to largely idle.
Given enough time TCP will settle down in to something mostly reasonable
but even then it causes us a few problems:
1) It takes a long time for the network to settle in to a steady state
and while it does network utilization is very poor.
2) Many of the sockets will rapidly back off to the maximum value. This
can lead to application level timeouts being triggered because we also
initially calibrated the timeouts with the notion that 2 minute
back-offs would be the exception and not the norm.
3) Once we reach steady state there's no guarantee of fairness between
TCP streams. For our workload this is particularly undesirable since
the parallel job which kicked off all this activity must wait until the
slowest transaction completes. This translates in to 1000's of nodes
sitting idle.
By knowing this is the expected workload on this dedicated network we
can safely make the back-offs more aggressive to mitigate most of these
issues.
We also played around with using a random seed when selecting the
back-off interval to avoid all the sockets backing-off in lock step.
That worked reasonably well but was more invasive then simply adding a
few more tunable.
Because these are general utility clusters we run many different
programs and so trying to fix this problem in the application is not
possible since there are literally hundreds if not thousands of them.
We're more than willing to consider other approaches to handling this
particular workload better. We've even considered that TCP isn't at all
the right protocol but this affects several protocols including NFS and
the benefits of running NFS over TCP are too great.
The original patch was prepared by Brian Behlendorf. He asked me to
adapt it for current kernels keep it up to date and send upstream.
This may also help people like Andrew Athan which reported a similar
problem a couple of days ago on the linux-net mailing list:
http://www.uwsg.iu.edu/hypermail/linux/net/0609.3/0005.html I suspect
that it is more common a case than is widely recognized.
Signed-off-by: Ben Woodard <woodard@redhat.com>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
[-- Attachment #2: tcp-backoff-18.patch --]
[-- Type: text/x-patch, Size: 6566 bytes --]
diff -ru linux-2.6.18/include/linux/sysctl.h linux-2.6.18.new/include/linux/sysctl.h
--- linux-2.6.18/include/linux/sysctl.h 2006-09-19 20:42:06.000000000 -0700
+++ linux-2.6.18.new/include/linux/sysctl.h 2006-09-26 17:10:36.000000000 -0700
@@ -411,6 +411,8 @@
NET_IPV4_TCP_WORKAROUND_SIGNED_WINDOWS=115,
NET_TCP_DMA_COPYBREAK=116,
NET_TCP_SLOW_START_AFTER_IDLE=117,
+ NET_TCP_RTO_MAX=118,
+ NET_TCP_RTO_INIT=119,
};
enum {
Only in linux-2.6.18.new/include/linux: sysctl.h.orig
Only in linux-2.6.18.new/include/linux: sysctl.h.rej
diff -ru linux-2.6.18/include/linux/tcp.h linux-2.6.18.new/include/linux/tcp.h
--- linux-2.6.18/include/linux/tcp.h 2006-09-19 20:42:06.000000000 -0700
+++ linux-2.6.18.new/include/linux/tcp.h 2006-09-26 17:08:33.000000000 -0700
@@ -94,6 +94,8 @@
#define TCP_INFO 11 /* Information about this connection. */
#define TCP_QUICKACK 12 /* Block/reenable quick acks */
#define TCP_CONGESTION 13 /* Congestion control algorithm */
+#define TCP_BACKOFF_MAX 14 /* Maximum backoff value */
+#define TCP_BACKOFF_INIT 15 /* Initial backoff value */
#define TCPI_OPT_TIMESTAMPS 1
#define TCPI_OPT_SACK 2
@@ -257,6 +259,8 @@
__u8 frto_counter; /* Number of new acks after RTO */
__u8 nonagle; /* Disable Nagle algorithm? */
__u8 keepalive_probes; /* num of allowed keep alive probes */
+ __u32 rto_max; /* Maximum backoff value */
+ __u32 rto_init; /* Initial backoff value */
/* RTT measurement */
__u32 srtt; /* smoothed round trip time << 3 */
Only in linux-2.6.18.new/include/linux: tcp.h.orig
diff -ru linux-2.6.18/include/net/tcp.h linux-2.6.18.new/include/net/tcp.h
--- linux-2.6.18/include/net/tcp.h 2006-09-19 20:42:06.000000000 -0700
+++ linux-2.6.18.new/include/net/tcp.h 2006-09-26 17:12:04.000000000 -0700
@@ -227,6 +227,8 @@
extern int sysctl_tcp_base_mss;
extern int sysctl_tcp_workaround_signed_windows;
extern int sysctl_tcp_slow_start_after_idle;
+extern int sysctl_tcp_rto_max;
+extern int sysctl_tcp_rto_init;
extern atomic_t tcp_memory_allocated;
extern atomic_t tcp_sockets_allocated;
Only in linux-2.6.18.new/include/net: tcp.h.orig
Only in linux-2.6.18.new/include/net: tcp.h.rej
diff -ru linux-2.6.18/net/ipv4/sysctl_net_ipv4.c linux-2.6.18.new/net/ipv4/sysctl_net_ipv4.c
--- linux-2.6.18/net/ipv4/sysctl_net_ipv4.c 2006-09-19 20:42:06.000000000 -0700
+++ linux-2.6.18.new/net/ipv4/sysctl_net_ipv4.c 2006-09-26 17:08:33.000000000 -0700
@@ -697,6 +697,22 @@
.mode = 0644,
.proc_handler = &proc_dointvec
},
+ {
+ .ctl_name = NET_TCP_RTO_MAX,
+ .procname = "tcp_rto_max",
+ .data = &sysctl_tcp_rto_max,
+ .maxlen = sizeof(unsigned),
+ .mode = 0644,
+ .proc_handler = &proc_dointvec
+ },
+ {
+ .ctl_name = NET_TCP_RTO_INIT,
+ .procname = "tcp_rto_init",
+ .data = &sysctl_tcp_rto_init,
+ .maxlen = sizeof(unsigned),
+ .mode = 0644,
+ .proc_handler = &proc_dointvec
+ },
{ .ctl_name = 0 }
};
Only in linux-2.6.18.new/net/ipv4: sysctl_net_ipv4.c.orig
diff -ru linux-2.6.18/net/ipv4/tcp.c linux-2.6.18.new/net/ipv4/tcp.c
--- linux-2.6.18/net/ipv4/tcp.c 2006-09-19 20:42:06.000000000 -0700
+++ linux-2.6.18.new/net/ipv4/tcp.c 2006-09-26 17:08:33.000000000 -0700
@@ -1939,6 +1939,21 @@
}
break;
+ case TCP_BACKOFF_MAX:
+ if (val < 1)
+ err = -EINVAL;
+ else
+ tp->rto_max = val * HZ;
+ break;
+
+ case TCP_BACKOFF_INIT:
+ if (val < 1)
+ err = -EINVAL;
+ else
+ tp->rto_init = val * HZ;
+ break;
+
+
default:
err = -ENOPROTOOPT;
break;
@@ -2110,6 +2125,12 @@
if (copy_to_user(optval, icsk->icsk_ca_ops->name, len))
return -EFAULT;
return 0;
+ case TCP_BACKOFF_MAX:
+ val = (tp->rto_max ? : sysctl_tcp_rto_max) / HZ;
+ break;
+ case TCP_BACKOFF_INIT:
+ val = (tp->rto_init ? : sysctl_tcp_rto_init) / HZ;
+ break;
default:
return -ENOPROTOOPT;
};
Only in linux-2.6.18.new/net/ipv4: tcp.c.orig
diff -ru linux-2.6.18/net/ipv4/tcp_timer.c linux-2.6.18.new/net/ipv4/tcp_timer.c
--- linux-2.6.18/net/ipv4/tcp_timer.c 2006-09-19 20:42:06.000000000 -0700
+++ linux-2.6.18.new/net/ipv4/tcp_timer.c 2006-09-26 17:08:33.000000000 -0700
@@ -31,6 +31,8 @@
int sysctl_tcp_retries1 = TCP_RETR1;
int sysctl_tcp_retries2 = TCP_RETR2;
int sysctl_tcp_orphan_retries;
+int sysctl_tcp_rto_max = TCP_RTO_MAX;
+int sysctl_tcp_rto_init = TCP_TIMEOUT_INIT;
static void tcp_write_timer(unsigned long);
static void tcp_delack_timer(unsigned long);
@@ -71,7 +73,8 @@
/* If peer does not open window for long time, or did not transmit
* anything for long time, penalize it. */
- if ((s32)(tcp_time_stamp - tp->lsndtime) > 2*TCP_RTO_MAX || !do_reset)
+ if ((s32)(tcp_time_stamp - tp->lsndtime) >
+ 2 * (tp->rto_max ? : sysctl_tcp_rto_max) || !do_reset)
orphans <<= 1;
/* If some dubious ICMP arrived, penalize even more. */
@@ -256,8 +259,8 @@
max_probes = sysctl_tcp_retries2;
if (sock_flag(sk, SOCK_DEAD)) {
- const int alive = ((icsk->icsk_rto << icsk->icsk_backoff) < TCP_RTO_MAX);
-
+ const int alive = ((icsk->icsk_rto << icsk->icsk_backoff) <
+ (tp->rto_max ? : sysctl_tcp_rto_max));
max_probes = tcp_orphan_retries(sk, alive);
if (tcp_out_of_resources(sk, alive || icsk->icsk_probes_out <= max_probes))
@@ -301,7 +304,8 @@
inet->num, tp->snd_una, tp->snd_nxt);
}
#endif
- if (tcp_time_stamp - tp->rcv_tstamp > TCP_RTO_MAX) {
+ if (tcp_time_stamp - tp->rcv_tstamp >
+ (tp->rto_max ? : sysctl_tcp_rto_max)) {
tcp_write_err(sk);
goto out;
}
@@ -373,7 +377,8 @@
out_reset_timer:
icsk->icsk_rto = min(icsk->icsk_rto << 1, TCP_RTO_MAX);
- inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS, icsk->icsk_rto, TCP_RTO_MAX);
+ inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS, icsk->icsk_rto,
+ (tp->rto_max ? : sysctl_tcp_rto_max));
if (icsk->icsk_retransmits > sysctl_tcp_retries1)
__sk_dst_reset(sk);
@@ -428,7 +433,8 @@
static void tcp_synack_timer(struct sock *sk)
{
inet_csk_reqsk_queue_prune(sk, TCP_SYNQ_INTERVAL,
- TCP_TIMEOUT_INIT, TCP_RTO_MAX);
+ TCP_TIMEOUT_INIT,
+ ((tp->rto_init ? : sysctl_tcp_rto_init)));
}
void tcp_set_keepalive(struct sock *sk, int val)
^ permalink raw reply [flat|nested] 17+ messages in thread* Re: [PATCH] Customizable TCP backoff patch 2006-09-27 18:52 [PATCH] Customizable TCP backoff patch Ben Woodard @ 2006-09-27 23:16 ` David Miller 2006-09-27 23:00 ` Stephen Hemminger 2006-10-03 18:14 ` Ben Woodard 0 siblings, 2 replies; 17+ messages in thread From: David Miller @ 2006-09-27 23:16 UTC (permalink / raw) To: woodard; +Cc: netdev, mgrondona, behlendorf1 From: Ben Woodard <woodard@redhat.com> Date: Wed, 27 Sep 2006 11:52:57 -0700 > Because these are general utility clusters we run many different > programs and so trying to fix this problem in the application is not > possible since there are literally hundreds if not thousands of them. Then why add a socket option setting as your patch does? :-) I also object to the socket option setting being allowed for any user because this can have awful effects if allowed by arbitrary users on arbitrary networks. > We're more than willing to consider other approaches to handling this > particular workload better. We've even considered that TCP isn't at all > the right protocol but this affects several protocols including NFS and > the benefits of running NFS over TCP are too great. > > The original patch was prepared by Brian Behlendorf. He asked me to > adapt it for current kernels keep it up to date and send upstream. > > This may also help people like Andrew Athan which reported a similar > problem a couple of days ago on the linux-net mailing list: > http://www.uwsg.iu.edu/hypermail/linux/net/0609.3/0005.html I suspect > that it is more common a case than is widely recognized. > > Signed-off-by: Ben Woodard <woodard@redhat.com> > Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Other issues: 1) 2 "u32" in the tcp_sock is a lot of space to devote to this new state. If it can fit in 2 "u16"'s or even less space, please use that. 2) the expression "(tp->foo ? : sysctl_foo)" is repeated many times in the patch, please encapsulate it into an inline function or similar I'm still torn on the fundamental issues of this patch. I think random backoff is a better generic solution to this kind of problem. If it works for ethernet, it might just work for TCP too :-) Thanks. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] Customizable TCP backoff patch 2006-09-27 23:16 ` David Miller @ 2006-09-27 23:00 ` Stephen Hemminger 2006-09-28 2:06 ` David Miller 2006-10-03 18:14 ` Ben Woodard 1 sibling, 1 reply; 17+ messages in thread From: Stephen Hemminger @ 2006-09-27 23:00 UTC (permalink / raw) To: David Miller; +Cc: woodard, netdev, mgrondona, behlendorf1 On Wed, 27 Sep 2006 16:16:38 -0700 (PDT) David Miller <davem@davemloft.net> wrote: > From: Ben Woodard <woodard@redhat.com> > Date: Wed, 27 Sep 2006 11:52:57 -0700 > > > Because these are general utility clusters we run many different > > programs and so trying to fix this problem in the application is not > > possible since there are literally hundreds if not thousands of them. > > Then why add a socket option setting as your patch does? :-) > > I also object to the socket option setting being allowed for > any user because this can have awful effects if allowed by > arbitrary users on arbitrary networks. Setting a cwnd limit would do the same thing. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] Customizable TCP backoff patch 2006-09-27 23:00 ` Stephen Hemminger @ 2006-09-28 2:06 ` David Miller 0 siblings, 0 replies; 17+ messages in thread From: David Miller @ 2006-09-28 2:06 UTC (permalink / raw) To: shemminger; +Cc: woodard, netdev, mgrondona, behlendorf1 From: Stephen Hemminger <shemminger@osdl.org> Date: Wed, 27 Sep 2006 16:00:44 -0700 > On Wed, 27 Sep 2006 16:16:38 -0700 (PDT) > David Miller <davem@davemloft.net> wrote: > > > From: Ben Woodard <woodard@redhat.com> > > Date: Wed, 27 Sep 2006 11:52:57 -0700 > > > > > Because these are general utility clusters we run many different > > > programs and so trying to fix this problem in the application is not > > > possible since there are literally hundreds if not thousands of them. > > > > Then why add a socket option setting as your patch does? :-) > > > > I also object to the socket option setting being allowed for > > any user because this can have awful effects if allowed by > > arbitrary users on arbitrary networks. > > Setting a cwnd limit would do the same thing. Not really :-) It would not influence the TCP retransmit timeouts and backoff, although it would influence the congestion window handling during recovery after such timeout based losses. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] Customizable TCP backoff patch 2006-09-27 23:16 ` David Miller 2006-09-27 23:00 ` Stephen Hemminger @ 2006-10-03 18:14 ` Ben Woodard 2006-10-04 7:07 ` David Miller 1 sibling, 1 reply; 17+ messages in thread From: Ben Woodard @ 2006-10-03 18:14 UTC (permalink / raw) To: David Miller; +Cc: netdev, mgrondona, behlendorf1 [-- Attachment #1: Type: text/plain, Size: 2240 bytes --] David Miller wrote: > From: Ben Woodard <woodard@redhat.com> > Date: Wed, 27 Sep 2006 11:52:57 -0700 > >> Because these are general utility clusters we run many different >> programs and so trying to fix this problem in the application is not >> possible since there are literally hundreds if not thousands of them. > > Then why add a socket option setting as your patch does? :-) > > I also object to the socket option setting being allowed for > any user because this can have awful effects if allowed by > arbitrary users on arbitrary networks. > >> We're more than willing to consider other approaches to handling this >> particular workload better. We've even considered that TCP isn't at all >> the right protocol but this affects several protocols including NFS and >> the benefits of running NFS over TCP are too great. >> >> The original patch was prepared by Brian Behlendorf. He asked me to >> adapt it for current kernels keep it up to date and send upstream. >> >> This may also help people like Andrew Athan which reported a similar >> problem a couple of days ago on the linux-net mailing list: >> http://www.uwsg.iu.edu/hypermail/linux/net/0609.3/0005.html I suspect >> that it is more common a case than is widely recognized. >> >> Signed-off-by: Ben Woodard <woodard@redhat.com> >> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> > > Other issues: > > 1) 2 "u32" in the tcp_sock is a lot of space to devote to this > new state. If it can fit in 2 "u16"'s or even less space, > please use that. > > 2) the expression "(tp->foo ? : sysctl_foo)" is repeated many times > in the patch, please encapsulate it into an inline function > or similar > How does this look to you for answering your two complaints above. > I'm still torn on the fundamental issues of this patch. I think > random backoff is a better generic solution to this kind of problem. > If it works for ethernet, it might just work for TCP too :-) I haven't taken this on in this patch. I'd have to think more about how to do that and I'm not sure that introducing randomness here will allow us to settle into a steady state faster than configuring a shorter timeout in the environments that need it. > > Thanks. [-- Attachment #2: tcp-backoff-18-2.diff --] [-- Type: text/x-patch, Size: 7043 bytes --] diff -ru linux-2.6.18/include/linux/sysctl.h linux-2.6.18.new/include/linux/sysctl.h --- linux-2.6.18/include/linux/sysctl.h 2006-09-19 20:42:06.000000000 -0700 +++ linux-2.6.18.new/include/linux/sysctl.h 2006-09-26 17:10:36.000000000 -0700 @@ -411,6 +411,8 @@ NET_IPV4_TCP_WORKAROUND_SIGNED_WINDOWS=115, NET_TCP_DMA_COPYBREAK=116, NET_TCP_SLOW_START_AFTER_IDLE=117, + NET_TCP_RTO_MAX=118, + NET_TCP_RTO_INIT=119, }; enum { Only in linux-2.6.18.new/include/linux: sysctl.h.orig Only in linux-2.6.18.new/include/linux: sysctl.h.rej diff -ru linux-2.6.18/include/linux/tcp.h linux-2.6.18.new/include/linux/tcp.h --- linux-2.6.18/include/linux/tcp.h 2006-09-19 20:42:06.000000000 -0700 +++ linux-2.6.18.new/include/linux/tcp.h 2006-09-28 13:18:12.000000000 -0700 @@ -94,6 +94,8 @@ #define TCP_INFO 11 /* Information about this connection. */ #define TCP_QUICKACK 12 /* Block/reenable quick acks */ #define TCP_CONGESTION 13 /* Congestion control algorithm */ +#define TCP_BACKOFF_MAX 14 /* Maximum backoff value */ +#define TCP_BACKOFF_INIT 15 /* Initial backoff value */ #define TCPI_OPT_TIMESTAMPS 1 #define TCPI_OPT_SACK 2 @@ -257,6 +259,8 @@ __u8 frto_counter; /* Number of new acks after RTO */ __u8 nonagle; /* Disable Nagle algorithm? */ __u8 keepalive_probes; /* num of allowed keep alive probes */ + __u16 rto_max; /* Maximum backoff value */ + __u16 rto_init; /* Initial backoff value */ /* RTT measurement */ __u32 srtt; /* smoothed round trip time << 3 */ Only in linux-2.6.18.new/include/linux: tcp.h.orig diff -ru linux-2.6.18/include/net/tcp.h linux-2.6.18.new/include/net/tcp.h --- linux-2.6.18/include/net/tcp.h 2006-09-19 20:42:06.000000000 -0700 +++ linux-2.6.18.new/include/net/tcp.h 2006-09-26 17:12:04.000000000 -0700 @@ -227,6 +227,8 @@ extern int sysctl_tcp_base_mss; extern int sysctl_tcp_workaround_signed_windows; extern int sysctl_tcp_slow_start_after_idle; +extern int sysctl_tcp_rto_max; +extern int sysctl_tcp_rto_init; extern atomic_t tcp_memory_allocated; extern atomic_t tcp_sockets_allocated; Only in linux-2.6.18.new/include/net: tcp.h.orig Only in linux-2.6.18.new/include/net: tcp.h.rej diff -ru linux-2.6.18/net/ipv4/sysctl_net_ipv4.c linux-2.6.18.new/net/ipv4/sysctl_net_ipv4.c --- linux-2.6.18/net/ipv4/sysctl_net_ipv4.c 2006-09-19 20:42:06.000000000 -0700 +++ linux-2.6.18.new/net/ipv4/sysctl_net_ipv4.c 2006-09-26 17:08:33.000000000 -0700 @@ -697,6 +697,22 @@ .mode = 0644, .proc_handler = &proc_dointvec }, + { + .ctl_name = NET_TCP_RTO_MAX, + .procname = "tcp_rto_max", + .data = &sysctl_tcp_rto_max, + .maxlen = sizeof(unsigned), + .mode = 0644, + .proc_handler = &proc_dointvec + }, + { + .ctl_name = NET_TCP_RTO_INIT, + .procname = "tcp_rto_init", + .data = &sysctl_tcp_rto_init, + .maxlen = sizeof(unsigned), + .mode = 0644, + .proc_handler = &proc_dointvec + }, { .ctl_name = 0 } }; Only in linux-2.6.18.new/net/ipv4: sysctl_net_ipv4.c.orig diff -ru linux-2.6.18/net/ipv4/tcp.c linux-2.6.18.new/net/ipv4/tcp.c --- linux-2.6.18/net/ipv4/tcp.c 2006-09-19 20:42:06.000000000 -0700 +++ linux-2.6.18.new/net/ipv4/tcp.c 2006-09-28 13:56:24.000000000 -0700 @@ -1764,6 +1764,8 @@ return err; } +#define TCP_BACKOFF_MAXVAL 65535 + /* * Socket option code for TCP. */ @@ -1939,6 +1941,21 @@ } break; + case TCP_BACKOFF_MAX: + if (val < 1 || val > TCP_BACKOFF_MAXVAL) + err = -EINVAL; + else + tp->rto_max = val * HZ; + break; + + case TCP_BACKOFF_INIT: + if (val < 1 || val > TCP_BACKOFF_MAXVAL) + err = -EINVAL; + else + tp->rto_init = val * HZ; + break; + + default: err = -ENOPROTOOPT; break; @@ -2110,6 +2127,12 @@ if (copy_to_user(optval, icsk->icsk_ca_ops->name, len)) return -EFAULT; return 0; + case TCP_BACKOFF_MAX: + val = (tp->rto_max ? : sysctl_tcp_rto_max) / HZ; + break; + case TCP_BACKOFF_INIT: + val = (tp->rto_init ? : sysctl_tcp_rto_init) / HZ; + break; default: return -ENOPROTOOPT; }; Only in linux-2.6.18.new/net/ipv4: tcp.c.orig diff -ru linux-2.6.18/net/ipv4/tcp_timer.c linux-2.6.18.new/net/ipv4/tcp_timer.c --- linux-2.6.18/net/ipv4/tcp_timer.c 2006-09-19 20:42:06.000000000 -0700 +++ linux-2.6.18.new/net/ipv4/tcp_timer.c 2006-09-28 16:41:39.000000000 -0700 @@ -31,11 +31,21 @@ int sysctl_tcp_retries1 = TCP_RETR1; int sysctl_tcp_retries2 = TCP_RETR2; int sysctl_tcp_orphan_retries; +int sysctl_tcp_rto_max = TCP_RTO_MAX; +int sysctl_tcp_rto_init = TCP_TIMEOUT_INIT; static void tcp_write_timer(unsigned long); static void tcp_delack_timer(unsigned long); static void tcp_keepalive_timer (unsigned long data); +static inline __u16 rto_max(struct tcp_sock *tp){ + return tp->rto_max ? : sysctl_tcp_rto_max; +} + +static inline __u16 rto_init(struct tcp_sock *tp){ + return tp->rto_init ? : sysctl_tcp_rto_init; +} + void tcp_init_xmit_timers(struct sock *sk) { inet_csk_init_xmit_timers(sk, &tcp_write_timer, &tcp_delack_timer, @@ -71,7 +81,7 @@ /* If peer does not open window for long time, or did not transmit * anything for long time, penalize it. */ - if ((s32)(tcp_time_stamp - tp->lsndtime) > 2*TCP_RTO_MAX || !do_reset) + if ((s32)(tcp_time_stamp - tp->lsndtime) > 2*rto_max(tp) || !do_reset) orphans <<= 1; /* If some dubious ICMP arrived, penalize even more. */ @@ -256,8 +266,8 @@ max_probes = sysctl_tcp_retries2; if (sock_flag(sk, SOCK_DEAD)) { - const int alive = ((icsk->icsk_rto << icsk->icsk_backoff) < TCP_RTO_MAX); - + const int alive = ((icsk->icsk_rto << icsk->icsk_backoff) < + rto_max(tp)); max_probes = tcp_orphan_retries(sk, alive); if (tcp_out_of_resources(sk, alive || icsk->icsk_probes_out <= max_probes)) @@ -301,7 +311,7 @@ inet->num, tp->snd_una, tp->snd_nxt); } #endif - if (tcp_time_stamp - tp->rcv_tstamp > TCP_RTO_MAX) { + if (tcp_time_stamp - tp->rcv_tstamp > rto_max(tp)) { tcp_write_err(sk); goto out; } @@ -373,7 +383,8 @@ out_reset_timer: icsk->icsk_rto = min(icsk->icsk_rto << 1, TCP_RTO_MAX); - inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS, icsk->icsk_rto, TCP_RTO_MAX); + inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS, icsk->icsk_rto, + rto_max(tp)); if (icsk->icsk_retransmits > sysctl_tcp_retries1) __sk_dst_reset(sk); @@ -427,8 +438,8 @@ static void tcp_synack_timer(struct sock *sk) { - inet_csk_reqsk_queue_prune(sk, TCP_SYNQ_INTERVAL, - TCP_TIMEOUT_INIT, TCP_RTO_MAX); + inet_csk_reqsk_queue_prune(sk, TCP_SYNQ_INTERVAL, + TCP_TIMEOUT_INIT, rto_max(tcp_sk(sk))); } void tcp_set_keepalive(struct sock *sk, int val) ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] Customizable TCP backoff patch 2006-10-03 18:14 ` Ben Woodard @ 2006-10-04 7:07 ` David Miller 2006-10-04 8:56 ` Ingo Oeser ` (2 more replies) 0 siblings, 3 replies; 17+ messages in thread From: David Miller @ 2006-10-04 7:07 UTC (permalink / raw) To: woodard; +Cc: netdev, mgrondona, behlendorf1 From: Ben Woodard <woodard@redhat.com> Date: Tue, 03 Oct 2006 11:14:38 -0700 > > Other issues: > > > > 1) 2 "u32" in the tcp_sock is a lot of space to devote to this > > new state. If it can fit in 2 "u16"'s or even less space, > > please use that. > > > > 2) the expression "(tp->foo ? : sysctl_foo)" is repeated many times > > in the patch, please encapsulate it into an inline function > > or similar > > > > How does this look to you for answering your two complaints above. It looks a lot better. I'd like you to take #2 further, put the inline function into net/tcp.h and use it in the tcp.c instances too. Also, please don't format code like this: +static inline __u16 rto_max(struct tcp_sock *tp){ + return tp->rto_max ? : sysctl_tcp_rto_max; +} + +static inline __u16 rto_init(struct tcp_sock *tp){ + return tp->rto_init ? : sysctl_tcp_rto_init; +} The openning brace of each functions belongs on a line by itself, thanks. Finally, I'm not so sure "seconds" is the best unit for the socket option. In fact you use "seconds" as the unit for the socket option and the sysctl is raw jiffies. That's inconsistency is really problematic. At the very least, seconds might not be fine enough granularity for some circumstances. Heck, the default RTO_MIN is 1/5 of a second. :-) I also understand that going to milliseconds or microseconds would make the size of the in-socket struct members an issue again. These things are never easy are they? :-/ Any better ideas? ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] Customizable TCP backoff patch 2006-10-04 7:07 ` David Miller @ 2006-10-04 8:56 ` Ingo Oeser 2006-10-04 9:08 ` David Miller 2006-10-04 17:17 ` Stephen Hemminger 2006-10-11 1:46 ` Ben Woodard 2 siblings, 1 reply; 17+ messages in thread From: Ingo Oeser @ 2006-10-04 8:56 UTC (permalink / raw) To: David Miller; +Cc: woodard, netdev, mgrondona, behlendorf1 David Miller wrote: > At the very least, seconds might not be fine enough granularity > for some circumstances. Heck, the default RTO_MIN is 1/5 of a > second. :-) > > I also understand that going to milliseconds or microseconds would > make the size of the in-socket struct members an issue again. These > things are never easy are they? :-/ Would be, if floating point values would be allowed. Single precision would be enough in this case and its just 32 bits IIRC. I mean floating point values just for the user->kernel ABI and NOT for the internal timer representation. Regards Ingo Oeser ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] Customizable TCP backoff patch 2006-10-04 8:56 ` Ingo Oeser @ 2006-10-04 9:08 ` David Miller 0 siblings, 0 replies; 17+ messages in thread From: David Miller @ 2006-10-04 9:08 UTC (permalink / raw) To: netdev; +Cc: woodard, netdev, mgrondona, behlendorf1 From: Ingo Oeser <netdev@axxeo.de> Date: Wed, 4 Oct 2006 10:56:12 +0200 > David Miller wrote: > > At the very least, seconds might not be fine enough granularity > > for some circumstances. Heck, the default RTO_MIN is 1/5 of a > > second. :-) > > > > I also understand that going to milliseconds or microseconds would > > make the size of the in-socket struct members an issue again. These > > things are never easy are they? :-/ > > Would be, if floating point values would be allowed. Single precision would > be enough in this case and its just 32 bits IIRC. > > I mean floating point values just for the user->kernel ABI and > NOT for the internal timer representation. We got the struct members down to 16-bits each, the problem is in fact that we'd need to make them 32-bits again. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] Customizable TCP backoff patch 2006-10-04 7:07 ` David Miller 2006-10-04 8:56 ` Ingo Oeser @ 2006-10-04 17:17 ` Stephen Hemminger 2006-10-11 1:46 ` Ben Woodard 2 siblings, 0 replies; 17+ messages in thread From: Stephen Hemminger @ 2006-10-04 17:17 UTC (permalink / raw) To: David Miller; +Cc: woodard, netdev, mgrondona, behlendorf1 On Wed, 04 Oct 2006 00:07:22 -0700 (PDT) David Miller <davem@davemloft.net> wrote: > From: Ben Woodard <woodard@redhat.com> > Date: Tue, 03 Oct 2006 11:14:38 -0700 > > > > Other issues: > > > > > > 1) 2 "u32" in the tcp_sock is a lot of space to devote to this > > > new state. If it can fit in 2 "u16"'s or even less space, > > > please use that. > > > > > > 2) the expression "(tp->foo ? : sysctl_foo)" is repeated many times > > > in the patch, please encapsulate it into an inline function > > > or similar > > > > > > > How does this look to you for answering your two complaints above. > > It looks a lot better. I'd like you to take #2 further, > put the inline function into net/tcp.h and use it in the > tcp.c instances too. > > Also, please don't format code like this: > > +static inline __u16 rto_max(struct tcp_sock *tp){ > + return tp->rto_max ? : sysctl_tcp_rto_max; > +} > + > +static inline __u16 rto_init(struct tcp_sock *tp){ > + return tp->rto_init ? : sysctl_tcp_rto_init; > +} > > The openning brace of each functions belongs on a line by itself, > thanks. > > Finally, I'm not so sure "seconds" is the best unit for the > socket option. In fact you use "seconds" as the unit for > the socket option and the sysctl is raw jiffies. That's > inconsistency is really problematic. > > At the very least, seconds might not be fine enough granularity > for some circumstances. Heck, the default RTO_MIN is 1/5 of a > second. :-) > > I also understand that going to milliseconds or microseconds would > make the size of the in-socket struct members an issue again. These > things are never easy are they? :-/ > > Any better ideas? If you are willing to do a little more work, the sysctl value can be in milliseconds, and the internal socket member in some other unit like jiffies. It does require writing your own in/out translation instead of using proc_dointvec -- Stephen Hemminger <shemminger@osdl.org> ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] Customizable TCP backoff patch 2006-10-04 7:07 ` David Miller 2006-10-04 8:56 ` Ingo Oeser 2006-10-04 17:17 ` Stephen Hemminger @ 2006-10-11 1:46 ` Ben Woodard 2006-10-11 1:59 ` YOSHIFUJI Hideaki / 吉藤英明 ` (3 more replies) 2 siblings, 4 replies; 17+ messages in thread From: Ben Woodard @ 2006-10-11 1:46 UTC (permalink / raw) To: David Miller; +Cc: netdev, mgrondona, behlendorf1 [-- Attachment #1: Type: text/plain, Size: 2001 bytes --] David Miller wrote: > From: Ben Woodard <woodard@redhat.com> > Date: Tue, 03 Oct 2006 11:14:38 -0700 > >>> Other issues: >>> >>> 1) 2 "u32" in the tcp_sock is a lot of space to devote to this >>> new state. If it can fit in 2 "u16"'s or even less space, >>> please use that. >>> >>> 2) the expression "(tp->foo ? : sysctl_foo)" is repeated many times >>> in the patch, please encapsulate it into an inline function >>> or similar >>> >> How does this look to you for answering your two complaints above. > > It looks a lot better. I'd like you to take #2 further, > put the inline function into net/tcp.h and use it in the > tcp.c instances too. Done. I orginally didn't want to grow the header files with two more functions that weren't use many places. Anyway this is fixed. > > Also, please don't format code like this: > > +static inline __u16 rto_max(struct tcp_sock *tp){ > + return tp->rto_max ? : sysctl_tcp_rto_max; > +} > + > +static inline __u16 rto_init(struct tcp_sock *tp){ > + return tp->rto_init ? : sysctl_tcp_rto_init; > +} > > The opening brace of each functions belongs on a line by itself, > thanks. Fixed. > > Finally, I'm not so sure "seconds" is the best unit for the > socket option. In fact you use "seconds" as the unit for > the socket option and the sysctl is raw jiffies. That's > inconsistency is really problematic. > > At the very least, seconds might not be fine enough granularity > for some circumstances. Heck, the default RTO_MIN is 1/5 of a > second. :-) > > I also understand that going to milliseconds or microseconds would > make the size of the in-socket struct members an issue again. These > things are never easy are they? :-/ > > Any better ideas? Here is what I've done. I store the values in the socket structure as ms so that they will fit into 16 bits and as jiffies in the global variables. Then I populate the global variables using proc_doulongvec_ms_jiffies() How does that look to you? -ben [-- Attachment #2: tcp-backoff-18-4.diff --] [-- Type: text/x-patch, Size: 7119 bytes --] diff -ru linux-2.6.18/include/linux/sysctl.h linux-2.6.18.new/include/linux/sysctl.h --- linux-2.6.18/include/linux/sysctl.h 2006-09-19 20:42:06.000000000 -0700 +++ linux-2.6.18.new/include/linux/sysctl.h 2006-09-26 17:10:36.000000000 -0700 @@ -411,6 +411,8 @@ NET_IPV4_TCP_WORKAROUND_SIGNED_WINDOWS=115, NET_TCP_DMA_COPYBREAK=116, NET_TCP_SLOW_START_AFTER_IDLE=117, + NET_TCP_RTO_MAX=118, + NET_TCP_RTO_INIT=119, }; enum { diff -ru linux-2.6.18/include/linux/tcp.h linux-2.6.18.new/include/linux/tcp.h --- linux-2.6.18/include/linux/tcp.h 2006-09-19 20:42:06.000000000 -0700 +++ linux-2.6.18.new/include/linux/tcp.h 2006-10-10 17:01:25.000000000 -0700 @@ -94,6 +94,8 @@ #define TCP_INFO 11 /* Information about this connection. */ #define TCP_QUICKACK 12 /* Block/reenable quick acks */ #define TCP_CONGESTION 13 /* Congestion control algorithm */ +#define TCP_BACKOFF_MAX 14 /* Maximum backoff value */ +#define TCP_BACKOFF_INIT 15 /* Initial backoff value */ #define TCPI_OPT_TIMESTAMPS 1 #define TCPI_OPT_SACK 2 @@ -257,6 +259,8 @@ __u8 frto_counter; /* Number of new acks after RTO */ __u8 nonagle; /* Disable Nagle algorithm? */ __u8 keepalive_probes; /* num of allowed keep alive probes */ + __u16 rto_max; /* Maximum backoff value in ms */ + __u16 rto_init; /* Initial backoff value in ms */ /* RTT measurement */ __u32 srtt; /* smoothed round trip time << 3 */ diff -ru linux-2.6.18/include/net/tcp.h linux-2.6.18.new/include/net/tcp.h --- linux-2.6.18/include/net/tcp.h 2006-09-19 20:42:06.000000000 -0700 +++ linux-2.6.18.new/include/net/tcp.h 2006-10-10 18:42:00.000000000 -0700 @@ -227,11 +227,23 @@ extern int sysctl_tcp_base_mss; extern int sysctl_tcp_workaround_signed_windows; extern int sysctl_tcp_slow_start_after_idle; +extern unsigned long sysctl_tcp_rto_max; +extern unsigned long sysctl_tcp_rto_init; extern atomic_t tcp_memory_allocated; extern atomic_t tcp_sockets_allocated; extern int tcp_memory_pressure; +static inline unsigned long tcp_rto_max(struct tcp_sock *tp) +{ + return tp->rto_max ? tp->rto_max*HZ/1000 : sysctl_tcp_rto_max; +} + +static inline unsigned long tcp_rto_init(struct tcp_sock *tp) +{ + return tp->rto_init ? tp->rto_init*HZ/1000 : sysctl_tcp_rto_init; +} + /* * The next routines deal with comparing 32 bit unsigned ints * and worry about wraparound (automatic with unsigned arithmetic). diff -ru linux-2.6.18/net/ipv4/sysctl_net_ipv4.c linux-2.6.18.new/net/ipv4/sysctl_net_ipv4.c --- linux-2.6.18/net/ipv4/sysctl_net_ipv4.c 2006-09-19 20:42:06.000000000 -0700 +++ linux-2.6.18.new/net/ipv4/sysctl_net_ipv4.c 2006-10-10 16:32:08.000000000 -0700 @@ -128,6 +128,8 @@ return ret; } +static unsigned long tcp_rto_min=0; +static unsigned long tcp_rto_max=65535; ctl_table ipv4_table[] = { { @@ -697,6 +699,26 @@ .mode = 0644, .proc_handler = &proc_dointvec }, + { + .ctl_name = NET_TCP_RTO_MAX, + .procname = "tcp_rto_max", + .data = &sysctl_tcp_rto_max, + .maxlen = sizeof(unsigned), + .mode = 0644, + .proc_handler = &proc_doulongvec_ms_jiffies, + .extra1 = &tcp_rto_min_constant, + .extra2 = &tcp_rto_max_constant, + }, + { + .ctl_name = NET_TCP_RTO_INIT, + .procname = "tcp_rto_init", + .data = &sysctl_tcp_rto_init, + .maxlen = sizeof(unsigned), + .mode = 0644, + .proc_handler = &proc_doulongvec_ms_jiffies, + .extra1 = &tcp_rto_min_constant, + .extra2 = &tcp_rto_max_constant, + }, { .ctl_name = 0 } }; diff -ru linux-2.6.18/net/ipv4/tcp.c linux-2.6.18.new/net/ipv4/tcp.c --- linux-2.6.18/net/ipv4/tcp.c 2006-09-19 20:42:06.000000000 -0700 +++ linux-2.6.18.new/net/ipv4/tcp.c 2006-10-10 18:37:40.000000000 -0700 @@ -1764,6 +1764,8 @@ return err; } +#define TCP_BACKOFF_MAXVAL 65535 + /* * Socket option code for TCP. */ @@ -1939,6 +1941,21 @@ } break; + case TCP_BACKOFF_MAX: + if (val < 1 || val > TCP_BACKOFF_MAXVAL) + err = -EINVAL; + else + tp->rto_max = val; + break; + + case TCP_BACKOFF_INIT: + if (val < 1 || val > TCP_BACKOFF_MAXVAL) + err = -EINVAL; + else + tp->rto_init = val; + break; + + default: err = -ENOPROTOOPT; break; @@ -2110,6 +2127,12 @@ if (copy_to_user(optval, icsk->icsk_ca_ops->name, len)) return -EFAULT; return 0; + case TCP_BACKOFF_MAX: + val = tcp_rto_max(tp)*1000/HZ; + break; + case TCP_BACKOFF_INIT: + val = tcp_rto_init(tp)*1000/HZ; + break; default: return -ENOPROTOOPT; }; diff -ru linux-2.6.18/net/ipv4/tcp_timer.c linux-2.6.18.new/net/ipv4/tcp_timer.c --- linux-2.6.18/net/ipv4/tcp_timer.c 2006-09-19 20:42:06.000000000 -0700 +++ linux-2.6.18.new/net/ipv4/tcp_timer.c 2006-10-10 16:25:10.000000000 -0700 @@ -31,6 +31,8 @@ int sysctl_tcp_retries1 = TCP_RETR1; int sysctl_tcp_retries2 = TCP_RETR2; int sysctl_tcp_orphan_retries; +unsigned long sysctl_tcp_rto_max = TCP_RTO_MAX; +unsigned long sysctl_tcp_rto_init = TCP_TIMEOUT_INIT; static void tcp_write_timer(unsigned long); static void tcp_delack_timer(unsigned long); @@ -71,7 +73,8 @@ /* If peer does not open window for long time, or did not transmit * anything for long time, penalize it. */ - if ((s32)(tcp_time_stamp - tp->lsndtime) > 2*TCP_RTO_MAX || !do_reset) + if ((s32)(tcp_time_stamp - tp->lsndtime) > 2*tcp_rto_max(tp) || + !do_reset) orphans <<= 1; /* If some dubious ICMP arrived, penalize even more. */ @@ -256,8 +259,8 @@ max_probes = sysctl_tcp_retries2; if (sock_flag(sk, SOCK_DEAD)) { - const int alive = ((icsk->icsk_rto << icsk->icsk_backoff) < TCP_RTO_MAX); - + const int alive = ((icsk->icsk_rto << icsk->icsk_backoff) < + tcp_rto_max(tp)); max_probes = tcp_orphan_retries(sk, alive); if (tcp_out_of_resources(sk, alive || icsk->icsk_probes_out <= max_probes)) @@ -301,7 +304,7 @@ inet->num, tp->snd_una, tp->snd_nxt); } #endif - if (tcp_time_stamp - tp->rcv_tstamp > TCP_RTO_MAX) { + if (tcp_time_stamp - tp->rcv_tstamp > tcp_rto_max(tp)) { tcp_write_err(sk); goto out; } @@ -373,7 +376,8 @@ out_reset_timer: icsk->icsk_rto = min(icsk->icsk_rto << 1, TCP_RTO_MAX); - inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS, icsk->icsk_rto, TCP_RTO_MAX); + inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS, icsk->icsk_rto, + tcp_rto_max(tp)); if (icsk->icsk_retransmits > sysctl_tcp_retries1) __sk_dst_reset(sk); @@ -427,8 +431,8 @@ static void tcp_synack_timer(struct sock *sk) { - inet_csk_reqsk_queue_prune(sk, TCP_SYNQ_INTERVAL, - TCP_TIMEOUT_INIT, TCP_RTO_MAX); + inet_csk_reqsk_queue_prune(sk, TCP_SYNQ_INTERVAL, TCP_TIMEOUT_INIT, + tcp_rto_max(tcp_sk(sk))); } void tcp_set_keepalive(struct sock *sk, int val) ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] Customizable TCP backoff patch 2006-10-11 1:46 ` Ben Woodard @ 2006-10-11 1:59 ` YOSHIFUJI Hideaki / 吉藤英明 2006-10-11 2:53 ` David Miller ` (2 subsequent siblings) 3 siblings, 0 replies; 17+ messages in thread From: YOSHIFUJI Hideaki / 吉藤英明 @ 2006-10-11 1:59 UTC (permalink / raw) To: woodard; +Cc: davem, netdev, mgrondona, behlendorf1, yoshfuji In article <452C4D12.9060102@redhat.com> (at Tue, 10 Oct 2006 18:46:58 -0700), Ben Woodard <woodard@redhat.com> says: > diff -ru linux-2.6.18/net/ipv4/tcp.c linux-2.6.18.new/net/ipv4/tcp.c > --- linux-2.6.18/net/ipv4/tcp.c 2006-09-19 20:42:06.000000000 -0700 > +++ linux-2.6.18.new/net/ipv4/tcp.c 2006-10-10 18:37:40.000000000 -0700 > @@ -1939,6 +1941,21 @@ > } > break; > > + case TCP_BACKOFF_MAX: > + if (val < 1 || val > TCP_BACKOFF_MAXVAL) > + err = -EINVAL; > + else > + tp->rto_max = val; > + break; > + > + case TCP_BACKOFF_INIT: > + if (val < 1 || val > TCP_BACKOFF_MAXVAL) > + err = -EINVAL; > + else > + tp->rto_init = val; > + break; > + > + > default: > err = -ENOPROTOOPT; > break; > @@ -2110,6 +2127,12 @@ > if (copy_to_user(optval, icsk->icsk_ca_ops->name, len)) > return -EFAULT; > return 0; > + case TCP_BACKOFF_MAX: > + val = tcp_rto_max(tp)*1000/HZ; > + break; > + case TCP_BACKOFF_INIT: > + val = tcp_rto_init(tp)*1000/HZ; > + break; > default: > return -ENOPROTOOPT; > }; They look very inconsistent. Please use tp->rto_max / tp->rto_init. --yoshfuji ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] Customizable TCP backoff patch 2006-10-11 1:46 ` Ben Woodard 2006-10-11 1:59 ` YOSHIFUJI Hideaki / 吉藤英明 @ 2006-10-11 2:53 ` David Miller 2006-10-11 2:54 ` David Miller 2006-10-11 14:01 ` Vlad Yasevich 3 siblings, 0 replies; 17+ messages in thread From: David Miller @ 2006-10-11 2:53 UTC (permalink / raw) To: woodard; +Cc: netdev, mgrondona, behlendorf1 From: Ben Woodard <woodard@redhat.com> Date: Tue, 10 Oct 2006 18:46:58 -0700 > @@ -257,6 +259,8 @@ > __u8 frto_counter; /* Number of new acks after RTO */ > __u8 nonagle; /* Disable Nagle algorithm? */ > __u8 keepalive_probes; /* num of allowed keep alive probes */ > + __u16 rto_max; /* Maximum backoff value in ms */ > + __u16 rto_init; /* Initial backoff value in ms */ The existing struct members are properly indented using tabs, your patch adds members which are indented using spaces. Please fix this, thanks a lot. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] Customizable TCP backoff patch 2006-10-11 1:46 ` Ben Woodard 2006-10-11 1:59 ` YOSHIFUJI Hideaki / 吉藤英明 2006-10-11 2:53 ` David Miller @ 2006-10-11 2:54 ` David Miller 2006-10-11 14:01 ` Vlad Yasevich 3 siblings, 0 replies; 17+ messages in thread From: David Miller @ 2006-10-11 2:54 UTC (permalink / raw) To: woodard; +Cc: netdev, mgrondona, behlendorf1 Actually, your entire patch uses spaces for indentation of every new line added, not just those two new struct members I pointed out. Please use tabs for all of those cases. Thanks a lot. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] Customizable TCP backoff patch 2006-10-11 1:46 ` Ben Woodard ` (2 preceding siblings ...) 2006-10-11 2:54 ` David Miller @ 2006-10-11 14:01 ` Vlad Yasevich 2006-10-12 0:51 ` Ben Woodard 3 siblings, 1 reply; 17+ messages in thread From: Vlad Yasevich @ 2006-10-11 14:01 UTC (permalink / raw) To: Ben Woodard; +Cc: David Miller, netdev, mgrondona, behlendorf1 Ben Woodard wrote: ------------------------------------------------------------------------ > diff -ru linux-2.6.18/include/net/tcp.h linux-2.6.18.new/include/net/tcp.h > --- linux-2.6.18/include/net/tcp.h 2006-09-19 20:42:06.000000000 -0700 > +++ linux-2.6.18.new/include/net/tcp.h 2006-10-10 18:42:00.000000000 -0700 > @@ -227,11 +227,23 @@ > extern int sysctl_tcp_base_mss; > extern int sysctl_tcp_workaround_signed_windows; > extern int sysctl_tcp_slow_start_after_idle; > +extern unsigned long sysctl_tcp_rto_max; > +extern unsigned long sysctl_tcp_rto_init; > > extern atomic_t tcp_memory_allocated; > extern atomic_t tcp_sockets_allocated; > extern int tcp_memory_pressure; > > +static inline unsigned long tcp_rto_max(struct tcp_sock *tp) > +{ > + return tp->rto_max ? tp->rto_max*HZ/1000 : sysctl_tcp_rto_max; ^^^^^^^^^^^^^^^^^^^ That should probably be msecs_to_jiffies(tp->rto_max) > +} > + > +static inline unsigned long tcp_rto_init(struct tcp_sock *tp) > +{ > + return tp->rto_init ? tp->rto_init*HZ/1000 : sysctl_tcp_rto_init; Ditto. > +} > + > /* > * The next routines deal with comparing 32 bit unsigned ints > * and worry about wraparound (automatic with unsigned arithmetic). > diff -ru linux-2.6.18/net/ipv4/sysctl_net_ipv4.c linux-2.6.18.new/net/ipv4/sysctl_net_ipv4.c > --- linux-2.6.18/net/ipv4/sysctl_net_ipv4.c 2006-09-19 20:42:06.000000000 -0700 > +++ linux-2.6.18.new/net/ipv4/sysctl_net_ipv4.c 2006-10-10 16:32:08.000000000 -0700 > @@ -128,6 +128,8 @@ > return ret; > } > > +static unsigned long tcp_rto_min=0; > +static unsigned long tcp_rto_max=65535; > > ctl_table ipv4_table[] = { > { > @@ -697,6 +699,26 @@ > .mode = 0644, > .proc_handler = &proc_dointvec > }, > + { > + .ctl_name = NET_TCP_RTO_MAX, > + .procname = "tcp_rto_max", > + .data = &sysctl_tcp_rto_max, > + .maxlen = sizeof(unsigned), > + .mode = 0644, > + .proc_handler = &proc_doulongvec_ms_jiffies, > + .extra1 = &tcp_rto_min_constant, > + .extra2 = &tcp_rto_max_constant, > + }, > + { > + .ctl_name = NET_TCP_RTO_INIT, > + .procname = "tcp_rto_init", > + .data = &sysctl_tcp_rto_init, > + .maxlen = sizeof(unsigned), > + .mode = 0644, > + .proc_handler = &proc_doulongvec_ms_jiffies, > + .extra1 = &tcp_rto_min_constant, > + .extra2 = &tcp_rto_max_constant, > + }, > { .ctl_name = 0 } > }; Try as I might, I can't find proc_doulongvec_ms_jiffies() anywhere. I think you meant proc_doulongvec_ms_jiffies_minmax()? Also, this function has a bug in that it doesn't do corrections for potentially different HZ values. Look at the msecs_to_jiffies... When I've used proc_doulongvec_ms_jiffies_minmax() before, I would seen rather interesting truncation errors such that sysctl input didn't match sysctl output. > > diff -ru linux-2.6.18/net/ipv4/tcp.c linux-2.6.18.new/net/ipv4/tcp.c > --- linux-2.6.18/net/ipv4/tcp.c 2006-09-19 20:42:06.000000000 -0700 > +++ linux-2.6.18.new/net/ipv4/tcp.c 2006-10-10 18:37:40.000000000 -0700 > @@ -1764,6 +1764,8 @@ > return err; > } > > +#define TCP_BACKOFF_MAXVAL 65535 > + > /* > * Socket option code for TCP. > */ > @@ -1939,6 +1941,21 @@ > } > break; > > + case TCP_BACKOFF_MAX: > + if (val < 1 || val > TCP_BACKOFF_MAXVAL) > + err = -EINVAL; > + else > + tp->rto_max = val; > + break; > + > + case TCP_BACKOFF_INIT: > + if (val < 1 || val > TCP_BACKOFF_MAXVAL) > + err = -EINVAL; > + else > + tp->rto_init = val; > + break; > + > + > default: > err = -ENOPROTOOPT; > break; > @@ -2110,6 +2127,12 @@ > if (copy_to_user(optval, icsk->icsk_ca_ops->name, len)) > return -EFAULT; > return 0; > + case TCP_BACKOFF_MAX: > + val = tcp_rto_max(tp)*1000/HZ; > + break; > + case TCP_BACKOFF_INIT: > + val = tcp_rto_init(tp)*1000/HZ; The two divisions above introduce truncation errors such the input doesn't match the output. Better to use jiffies_to_msecs(). Regards -vlad ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] Customizable TCP backoff patch 2006-10-11 14:01 ` Vlad Yasevich @ 2006-10-12 0:51 ` Ben Woodard 2006-10-12 1:06 ` YOSHIFUJI Hideaki / 吉藤英明 0 siblings, 1 reply; 17+ messages in thread From: Ben Woodard @ 2006-10-12 0:51 UTC (permalink / raw) To: Vlad Yasevich; +Cc: David Miller, netdev, mgrondona, behlendorf1 [-- Attachment #1: Type: text/plain, Size: 4968 bytes --] Here we go again. The changes are as follows: 1) the spaces are gone and now it is tabs. 2) used msecs_to_jiffies() and jiffies_to_msecs(). I'm much happier with that. I didn't know those functions existed. 3) changed over to proc_doulongvec_ms_jiffies_minmax(). Last night's test compile failed because I was out of disk. I figured that the code I changed had compiled. I guess not. I really didn't understand Yoshifuji Hideaki's comment. (sorry) -ben Vlad Yasevich wrote: > Ben Woodard wrote: > ------------------------------------------------------------------------ >> diff -ru linux-2.6.18/include/net/tcp.h linux-2.6.18.new/include/net/tcp.h >> --- linux-2.6.18/include/net/tcp.h 2006-09-19 20:42:06.000000000 -0700 >> +++ linux-2.6.18.new/include/net/tcp.h 2006-10-10 18:42:00.000000000 -0700 >> @@ -227,11 +227,23 @@ >> extern int sysctl_tcp_base_mss; >> extern int sysctl_tcp_workaround_signed_windows; >> extern int sysctl_tcp_slow_start_after_idle; >> +extern unsigned long sysctl_tcp_rto_max; >> +extern unsigned long sysctl_tcp_rto_init; >> >> extern atomic_t tcp_memory_allocated; >> extern atomic_t tcp_sockets_allocated; >> extern int tcp_memory_pressure; >> >> +static inline unsigned long tcp_rto_max(struct tcp_sock *tp) >> +{ >> + return tp->rto_max ? tp->rto_max*HZ/1000 : sysctl_tcp_rto_max; > ^^^^^^^^^^^^^^^^^^^ > That should probably be msecs_to_jiffies(tp->rto_max) > >> +} >> + >> +static inline unsigned long tcp_rto_init(struct tcp_sock *tp) >> +{ >> + return tp->rto_init ? tp->rto_init*HZ/1000 : sysctl_tcp_rto_init; > > Ditto. > >> +} >> + >> /* >> * The next routines deal with comparing 32 bit unsigned ints >> * and worry about wraparound (automatic with unsigned arithmetic). >> diff -ru linux-2.6.18/net/ipv4/sysctl_net_ipv4.c linux-2.6.18.new/net/ipv4/sysctl_net_ipv4.c >> --- linux-2.6.18/net/ipv4/sysctl_net_ipv4.c 2006-09-19 20:42:06.000000000 -0700 >> +++ linux-2.6.18.new/net/ipv4/sysctl_net_ipv4.c 2006-10-10 16:32:08.000000000 -0700 >> @@ -128,6 +128,8 @@ >> return ret; >> } >> >> +static unsigned long tcp_rto_min=0; >> +static unsigned long tcp_rto_max=65535; >> >> ctl_table ipv4_table[] = { >> { >> @@ -697,6 +699,26 @@ >> .mode = 0644, >> .proc_handler = &proc_dointvec >> }, >> + { >> + .ctl_name = NET_TCP_RTO_MAX, >> + .procname = "tcp_rto_max", >> + .data = &sysctl_tcp_rto_max, >> + .maxlen = sizeof(unsigned), >> + .mode = 0644, >> + .proc_handler = &proc_doulongvec_ms_jiffies, >> + .extra1 = &tcp_rto_min_constant, >> + .extra2 = &tcp_rto_max_constant, >> + }, >> + { >> + .ctl_name = NET_TCP_RTO_INIT, >> + .procname = "tcp_rto_init", >> + .data = &sysctl_tcp_rto_init, >> + .maxlen = sizeof(unsigned), >> + .mode = 0644, >> + .proc_handler = &proc_doulongvec_ms_jiffies, >> + .extra1 = &tcp_rto_min_constant, >> + .extra2 = &tcp_rto_max_constant, >> + }, >> { .ctl_name = 0 } >> }; > > Try as I might, I can't find proc_doulongvec_ms_jiffies() anywhere. I think > you meant proc_doulongvec_ms_jiffies_minmax()? > > Also, this function has a bug in that it doesn't do corrections for potentially different > HZ values. Look at the msecs_to_jiffies... When I've used proc_doulongvec_ms_jiffies_minmax() > before, I would seen rather interesting truncation errors such that sysctl input didn't match > sysctl output. > >> >> diff -ru linux-2.6.18/net/ipv4/tcp.c linux-2.6.18.new/net/ipv4/tcp.c >> --- linux-2.6.18/net/ipv4/tcp.c 2006-09-19 20:42:06.000000000 -0700 >> +++ linux-2.6.18.new/net/ipv4/tcp.c 2006-10-10 18:37:40.000000000 -0700 >> @@ -1764,6 +1764,8 @@ >> return err; >> } >> >> +#define TCP_BACKOFF_MAXVAL 65535 >> + >> /* >> * Socket option code for TCP. >> */ >> @@ -1939,6 +1941,21 @@ >> } >> break; >> >> + case TCP_BACKOFF_MAX: >> + if (val < 1 || val > TCP_BACKOFF_MAXVAL) >> + err = -EINVAL; >> + else >> + tp->rto_max = val; >> + break; >> + >> + case TCP_BACKOFF_INIT: >> + if (val < 1 || val > TCP_BACKOFF_MAXVAL) >> + err = -EINVAL; >> + else >> + tp->rto_init = val; >> + break; >> + >> + >> default: >> err = -ENOPROTOOPT; >> break; >> @@ -2110,6 +2127,12 @@ >> if (copy_to_user(optval, icsk->icsk_ca_ops->name, len)) >> return -EFAULT; >> return 0; >> + case TCP_BACKOFF_MAX: >> + val = tcp_rto_max(tp)*1000/HZ; >> + break; >> + case TCP_BACKOFF_INIT: >> + val = tcp_rto_init(tp)*1000/HZ; > > The two divisions above introduce truncation errors such the input doesn't > match the output. Better to use jiffies_to_msecs(). > > Regards > -vlad [-- Attachment #2: tcp-backoff-18-5.diff --] [-- Type: text/x-patch, Size: 6796 bytes --] diff -ru linux-2.6.18/include/linux/sysctl.h linux-2.6.18.new/include/linux/sysctl.h --- linux-2.6.18/include/linux/sysctl.h 2006-09-19 20:42:06.000000000 -0700 +++ linux-2.6.18.new/include/linux/sysctl.h 2006-10-11 10:27:52.000000000 -0700 @@ -411,6 +411,8 @@ NET_IPV4_TCP_WORKAROUND_SIGNED_WINDOWS=115, NET_TCP_DMA_COPYBREAK=116, NET_TCP_SLOW_START_AFTER_IDLE=117, + NET_TCP_RTO_MAX=118, + NET_TCP_RTO_INIT=119, }; enum { diff -ru linux-2.6.18/include/linux/tcp.h linux-2.6.18.new/include/linux/tcp.h --- linux-2.6.18/include/linux/tcp.h 2006-09-19 20:42:06.000000000 -0700 +++ linux-2.6.18.new/include/linux/tcp.h 2006-10-11 10:41:40.000000000 -0700 @@ -94,6 +94,8 @@ #define TCP_INFO 11 /* Information about this connection. */ #define TCP_QUICKACK 12 /* Block/reenable quick acks */ #define TCP_CONGESTION 13 /* Congestion control algorithm */ +#define TCP_BACKOFF_MAX 14 /* Maximum backoff value */ +#define TCP_BACKOFF_INIT 15 /* Initial backoff value */ #define TCPI_OPT_TIMESTAMPS 1 #define TCPI_OPT_SACK 2 @@ -257,6 +259,8 @@ __u8 frto_counter; /* Number of new acks after RTO */ __u8 nonagle; /* Disable Nagle algorithm? */ __u8 keepalive_probes; /* num of allowed keep alive probes */ + __u16 rto_max; /* Maximum backoff value in ms */ + __u16 rto_init; /* Initial backoff value in ms */ /* RTT measurement */ __u32 srtt; /* smoothed round trip time << 3 */ diff -ru linux-2.6.18/include/net/tcp.h linux-2.6.18.new/include/net/tcp.h --- linux-2.6.18/include/net/tcp.h 2006-09-19 20:42:06.000000000 -0700 +++ linux-2.6.18.new/include/net/tcp.h 2006-10-11 17:43:23.091431000 -0700 @@ -227,11 +227,23 @@ extern int sysctl_tcp_base_mss; extern int sysctl_tcp_workaround_signed_windows; extern int sysctl_tcp_slow_start_after_idle; +extern unsigned long sysctl_tcp_rto_max; +extern unsigned long sysctl_tcp_rto_init; extern atomic_t tcp_memory_allocated; extern atomic_t tcp_sockets_allocated; extern int tcp_memory_pressure; +static inline unsigned long tcp_rto_max(struct tcp_sock *tp) +{ + return tp->rto_max ? msecs_to_jiffies(tp->rto_max) : sysctl_tcp_rto_max; +} + +static inline unsigned long tcp_rto_init(struct tcp_sock *tp) +{ + return tp->rto_init ? msecs_to_jiffies(tp->rto_init) : sysctl_tcp_rto_init; +} + /* * The next routines deal with comparing 32 bit unsigned ints * and worry about wraparound (automatic with unsigned arithmetic). Only in linux-2.6.18.new/include/net: tcp.h~ diff -ru linux-2.6.18/net/ipv4/sysctl_net_ipv4.c linux-2.6.18.new/net/ipv4/sysctl_net_ipv4.c --- linux-2.6.18/net/ipv4/sysctl_net_ipv4.c 2006-09-19 20:42:06.000000000 -0700 +++ linux-2.6.18.new/net/ipv4/sysctl_net_ipv4.c 2006-10-11 16:00:37.000000000 -0700 @@ -128,6 +128,8 @@ return ret; } +static unsigned long tcp_rto_min=0; +static unsigned long tcp_rto_max=65535; ctl_table ipv4_table[] = { { @@ -697,6 +699,26 @@ .mode = 0644, .proc_handler = &proc_dointvec }, + { + .ctl_name = NET_TCP_RTO_MAX, + .procname = "tcp_rto_max", + .data = &sysctl_tcp_rto_max, + .maxlen = sizeof(unsigned), + .mode = 0644, + .proc_handler = &proc_doulongvec_ms_jiffies_minmax, + .extra1 = &tcp_rto_min_constant, + .extra2 = &tcp_rto_max_constant, + }, + { + .ctl_name = NET_TCP_RTO_INIT, + .procname = "tcp_rto_init", + .data = &sysctl_tcp_rto_init, + .maxlen = sizeof(unsigned), + .mode = 0644, + .proc_handler = &proc_doulongvec_ms_jiffies_minmax, + .extra1 = &tcp_rto_min_constant, + .extra2 = &tcp_rto_max_constant, + }, { .ctl_name = 0 } }; diff -ru linux-2.6.18/net/ipv4/tcp.c linux-2.6.18.new/net/ipv4/tcp.c --- linux-2.6.18/net/ipv4/tcp.c 2006-09-19 20:42:06.000000000 -0700 +++ linux-2.6.18.new/net/ipv4/tcp.c 2006-10-11 16:00:37.000000000 -0700 @@ -1764,6 +1764,8 @@ return err; } +#define TCP_BACKOFF_MAXVAL 65535 + /* * Socket option code for TCP. */ @@ -1939,6 +1941,20 @@ } break; + case TCP_BACKOFF_MAX: + if (val < 1 || val > TCP_BACKOFF_MAXVAL) + err = -EINVAL; + else + tp->rto_max = val; + break; + + case TCP_BACKOFF_INIT: + if (val < 1 || val > TCP_BACKOFF_MAXVAL) + err = -EINVAL; + else + tp->rto_init = val; + break; + default: err = -ENOPROTOOPT; break; @@ -2110,6 +2126,12 @@ if (copy_to_user(optval, icsk->icsk_ca_ops->name, len)) return -EFAULT; return 0; + case TCP_BACKOFF_MAX: + val = jiffies_to_msecs(tcp_rto_max(tp)); + break; + case TCP_BACKOFF_INIT: + val = jiffies_to_msecs(tcp_rto_init(tp)); + break; default: return -ENOPROTOOPT; }; diff -ru linux-2.6.18/net/ipv4/tcp_timer.c linux-2.6.18.new/net/ipv4/tcp_timer.c --- linux-2.6.18/net/ipv4/tcp_timer.c 2006-09-19 20:42:06.000000000 -0700 +++ linux-2.6.18.new/net/ipv4/tcp_timer.c 2006-10-11 10:41:39.000000000 -0700 @@ -31,6 +31,8 @@ int sysctl_tcp_retries1 = TCP_RETR1; int sysctl_tcp_retries2 = TCP_RETR2; int sysctl_tcp_orphan_retries; +unsigned long sysctl_tcp_rto_max = TCP_RTO_MAX; +unsigned long sysctl_tcp_rto_init = TCP_TIMEOUT_INIT; static void tcp_write_timer(unsigned long); static void tcp_delack_timer(unsigned long); @@ -71,7 +73,8 @@ /* If peer does not open window for long time, or did not transmit * anything for long time, penalize it. */ - if ((s32)(tcp_time_stamp - tp->lsndtime) > 2*TCP_RTO_MAX || !do_reset) + if ((s32)(tcp_time_stamp - tp->lsndtime) > 2*tcp_rto_max(tp) || + !do_reset) orphans <<= 1; /* If some dubious ICMP arrived, penalize even more. */ @@ -256,8 +259,8 @@ max_probes = sysctl_tcp_retries2; if (sock_flag(sk, SOCK_DEAD)) { - const int alive = ((icsk->icsk_rto << icsk->icsk_backoff) < TCP_RTO_MAX); - + const int alive = ((icsk->icsk_rto << icsk->icsk_backoff) < + tcp_rto_max(tp)); max_probes = tcp_orphan_retries(sk, alive); if (tcp_out_of_resources(sk, alive || icsk->icsk_probes_out <= max_probes)) @@ -301,7 +304,7 @@ inet->num, tp->snd_una, tp->snd_nxt); } #endif - if (tcp_time_stamp - tp->rcv_tstamp > TCP_RTO_MAX) { + if (tcp_time_stamp - tp->rcv_tstamp > tcp_rto_max(tp)) { tcp_write_err(sk); goto out; } @@ -373,7 +376,8 @@ out_reset_timer: icsk->icsk_rto = min(icsk->icsk_rto << 1, TCP_RTO_MAX); - inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS, icsk->icsk_rto, TCP_RTO_MAX); + inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS, icsk->icsk_rto, + tcp_rto_max(tp)); if (icsk->icsk_retransmits > sysctl_tcp_retries1) __sk_dst_reset(sk); @@ -427,8 +431,8 @@ static void tcp_synack_timer(struct sock *sk) { - inet_csk_reqsk_queue_prune(sk, TCP_SYNQ_INTERVAL, - TCP_TIMEOUT_INIT, TCP_RTO_MAX); + inet_csk_reqsk_queue_prune(sk, TCP_SYNQ_INTERVAL, TCP_TIMEOUT_INIT, + tcp_rto_max(tcp_sk(sk))); } void tcp_set_keepalive(struct sock *sk, int val) ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] Customizable TCP backoff patch 2006-10-12 0:51 ` Ben Woodard @ 2006-10-12 1:06 ` YOSHIFUJI Hideaki / 吉藤英明 2006-10-12 15:49 ` Ben Woodard 0 siblings, 1 reply; 17+ messages in thread From: YOSHIFUJI Hideaki / 吉藤英明 @ 2006-10-12 1:06 UTC (permalink / raw) To: woodard; +Cc: vladislav.yasevich, davem, netdev, mgrondona, behlendorf1, yoshfuji In article <452D918C.6050300@redhat.com> (at Wed, 11 Oct 2006 17:51:24 -0700), Ben Woodard <woodard@redhat.com> says: > diff -ru linux-2.6.18/net/ipv4/sysctl_net_ipv4.c linux-2.6.18.new/net/ipv4/sysctl_net_ipv4.c > --- linux-2.6.18/net/ipv4/sysctl_net_ipv4.c 2006-09-19 20:42:06.000000000 -0700 > +++ linux-2.6.18.new/net/ipv4/sysctl_net_ipv4.c 2006-10-11 16:00:37.000000000 -0700 > @@ -128,6 +128,8 @@ > return ret; > } > > +static unsigned long tcp_rto_min=0; > +static unsigned long tcp_rto_max=65535; > > ctl_table ipv4_table[] = { > { > @@ -697,6 +699,26 @@ > .mode = 0644, > .proc_handler = &proc_dointvec > }, > + { > + .ctl_name = NET_TCP_RTO_MAX, > + .procname = "tcp_rto_max", > + .data = &sysctl_tcp_rto_max, > + .maxlen = sizeof(unsigned), sizeof(unsigned long) > + .mode = 0644, > + .proc_handler = &proc_doulongvec_ms_jiffies_minmax, > + .extra1 = &tcp_rto_min_constant, > + .extra2 = &tcp_rto_max_constant, > + }, > + { > + .ctl_name = NET_TCP_RTO_INIT, > + .procname = "tcp_rto_init", > + .data = &sysctl_tcp_rto_init, > + .maxlen = sizeof(unsigned), sizeof(unsigned long) > + .mode = 0644, > + .proc_handler = &proc_doulongvec_ms_jiffies_minmax, > + .extra1 = &tcp_rto_min_constant, > + .extra2 = &tcp_rto_max_constant, > + }, > { .ctl_name = 0 } > }; > > diff -ru linux-2.6.18/net/ipv4/tcp.c linux-2.6.18.new/net/ipv4/tcp.c > --- linux-2.6.18/net/ipv4/tcp.c 2006-09-19 20:42:06.000000000 -0700 > +++ linux-2.6.18.new/net/ipv4/tcp.c 2006-10-11 16:00:37.000000000 -0700 > @@ -1764,6 +1764,8 @@ > return err; > } > > +#define TCP_BACKOFF_MAXVAL 65535 > + > /* > * Socket option code for TCP. > */ > @@ -1939,6 +1941,20 @@ > } > break; > > + case TCP_BACKOFF_MAX: > + if (val < 1 || val > TCP_BACKOFF_MAXVAL) > + err = -EINVAL; > + else > + tp->rto_max = val; > + break; > + > + case TCP_BACKOFF_INIT: > + if (val < 1 || val > TCP_BACKOFF_MAXVAL) > + err = -EINVAL; > + else > + tp->rto_init = val; > + break; > + > default: > err = -ENOPROTOOPT; > break; > @@ -2110,6 +2126,12 @@ > if (copy_to_user(optval, icsk->icsk_ca_ops->name, len)) > return -EFAULT; > return 0; > + case TCP_BACKOFF_MAX: > + val = jiffies_to_msecs(tcp_rto_max(tp)); > + break; tp->rto_max > + case TCP_BACKOFF_INIT: > + val = jiffies_to_msecs(tcp_rto_init(tp)); > + break; > default: tp->rto_init --yoshfuji ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] Customizable TCP backoff patch 2006-10-12 1:06 ` YOSHIFUJI Hideaki / 吉藤英明 @ 2006-10-12 15:49 ` Ben Woodard 0 siblings, 0 replies; 17+ messages in thread From: Ben Woodard @ 2006-10-12 15:49 UTC (permalink / raw) To: YOSHIFUJI Hideaki / 吉藤英明 Cc: vladislav.yasevich, davem, netdev, mgrondona, behlendorf1 [-- Attachment #1: Type: text/plain, Size: 933 bytes --] YOSHIFUJI Hideaki / 吉藤英明 wrote: >> + .data = &sysctl_tcp_rto_max, >> + .maxlen = sizeof(unsigned), > > sizeof(unsigned long) Good catch. That would have corrupted things badly on some 64b platforms. With all the flux in the area I forgot to change the size of that but would have been OK on the ia32 boxes. >> diff -ru linux-2.6.18/net/ipv4/tcp.c linux-2.6.18.new/net/ipv4/tcp.c >> --- linux-2.6.18/net/ipv4/tcp.c 2006-09-19 20:42:06.000000000 -0700 >> +++ linux-2.6.18.new/net/ipv4/tcp.c 2006-10-11 16:00:37.000000000 -0700 >> @@ -2110,6 +2126,12 @@ >> if (copy_to_user(optval, icsk->icsk_ca_ops->name, len)) >> return -EFAULT; >> return 0; >> + case TCP_BACKOFF_MAX: >> + val = jiffies_to_msecs(tcp_rto_max(tp)); >> + break; > > tp->rto_max > >> + case TCP_BACKOFF_INIT: >> + val = jiffies_to_msecs(tcp_rto_init(tp)); >> + break; >> default: > > tp->rto_init OK I get it now. > > --yoshfuji [-- Attachment #2: tcp-backoff-18-6.diff --] [-- Type: text/x-patch, Size: 6881 bytes --] diff -ru linux-2.6.18/include/linux/sysctl.h linux-2.6.18.new/include/linux/sysctl.h --- linux-2.6.18/include/linux/sysctl.h 2006-09-19 20:42:06.000000000 -0700 +++ linux-2.6.18.new/include/linux/sysctl.h 2006-10-11 10:27:52.000000000 -0700 @@ -411,6 +411,8 @@ NET_IPV4_TCP_WORKAROUND_SIGNED_WINDOWS=115, NET_TCP_DMA_COPYBREAK=116, NET_TCP_SLOW_START_AFTER_IDLE=117, + NET_TCP_RTO_MAX=118, + NET_TCP_RTO_INIT=119, }; enum { diff -ru linux-2.6.18/include/linux/tcp.h linux-2.6.18.new/include/linux/tcp.h --- linux-2.6.18/include/linux/tcp.h 2006-09-19 20:42:06.000000000 -0700 +++ linux-2.6.18.new/include/linux/tcp.h 2006-10-12 08:28:52.645411000 -0700 @@ -94,6 +94,8 @@ #define TCP_INFO 11 /* Information about this connection. */ #define TCP_QUICKACK 12 /* Block/reenable quick acks */ #define TCP_CONGESTION 13 /* Congestion control algorithm */ +#define TCP_BACKOFF_MAX 14 /* Maximum backoff value */ +#define TCP_BACKOFF_INIT 15 /* Initial backoff value */ #define TCPI_OPT_TIMESTAMPS 1 #define TCPI_OPT_SACK 2 @@ -257,6 +259,8 @@ __u8 frto_counter; /* Number of new acks after RTO */ __u8 nonagle; /* Disable Nagle algorithm? */ __u8 keepalive_probes; /* num of allowed keep alive probes */ + __u16 rto_max; /* Maximum backoff value in ms */ + __u16 rto_init; /* Initial backoff value in ms */ /* RTT measurement */ __u32 srtt; /* smoothed round trip time << 3 */ Only in linux-2.6.18.new/include/linux: tcp.h~ diff -ru linux-2.6.18/include/net/tcp.h linux-2.6.18.new/include/net/tcp.h --- linux-2.6.18/include/net/tcp.h 2006-09-19 20:42:06.000000000 -0700 +++ linux-2.6.18.new/include/net/tcp.h 2006-10-11 17:43:23.091431000 -0700 @@ -227,11 +227,23 @@ extern int sysctl_tcp_base_mss; extern int sysctl_tcp_workaround_signed_windows; extern int sysctl_tcp_slow_start_after_idle; +extern unsigned long sysctl_tcp_rto_max; +extern unsigned long sysctl_tcp_rto_init; extern atomic_t tcp_memory_allocated; extern atomic_t tcp_sockets_allocated; extern int tcp_memory_pressure; +static inline unsigned long tcp_rto_max(struct tcp_sock *tp) +{ + return tp->rto_max ? msecs_to_jiffies(tp->rto_max) : sysctl_tcp_rto_max; +} + +static inline unsigned long tcp_rto_init(struct tcp_sock *tp) +{ + return tp->rto_init ? msecs_to_jiffies(tp->rto_init) : sysctl_tcp_rto_init; +} + /* * The next routines deal with comparing 32 bit unsigned ints * and worry about wraparound (automatic with unsigned arithmetic). Only in linux-2.6.18.new/include/net: tcp.h~ diff -ru linux-2.6.18/net/ipv4/sysctl_net_ipv4.c linux-2.6.18.new/net/ipv4/sysctl_net_ipv4.c --- linux-2.6.18/net/ipv4/sysctl_net_ipv4.c 2006-09-19 20:42:06.000000000 -0700 +++ linux-2.6.18.new/net/ipv4/sysctl_net_ipv4.c 2006-10-12 07:14:41.871910000 -0700 @@ -128,6 +128,8 @@ return ret; } +static unsigned long tcp_rto_min=0; +static unsigned long tcp_rto_max=65535; ctl_table ipv4_table[] = { { @@ -697,6 +699,26 @@ .mode = 0644, .proc_handler = &proc_dointvec }, + { + .ctl_name = NET_TCP_RTO_MAX, + .procname = "tcp_rto_max", + .data = &sysctl_tcp_rto_max, + .maxlen = sizeof(unsigned long), + .mode = 0644, + .proc_handler = &proc_doulongvec_ms_jiffies_minmax, + .extra1 = &tcp_rto_min_constant, + .extra2 = &tcp_rto_max_constant, + }, + { + .ctl_name = NET_TCP_RTO_INIT, + .procname = "tcp_rto_init", + .data = &sysctl_tcp_rto_init, + .maxlen = sizeof(unsigned long), + .mode = 0644, + .proc_handler = &proc_doulongvec_ms_jiffies_minmax, + .extra1 = &tcp_rto_min_constant, + .extra2 = &tcp_rto_max_constant, + }, { .ctl_name = 0 } }; Only in linux-2.6.18.new/net/ipv4: sysctl_net_ipv4.c~ diff -ru linux-2.6.18/net/ipv4/tcp.c linux-2.6.18.new/net/ipv4/tcp.c --- linux-2.6.18/net/ipv4/tcp.c 2006-09-19 20:42:06.000000000 -0700 +++ linux-2.6.18.new/net/ipv4/tcp.c 2006-10-12 07:18:01.193083000 -0700 @@ -1764,6 +1764,8 @@ return err; } +#define TCP_BACKOFF_MAXVAL 65535 + /* * Socket option code for TCP. */ @@ -1939,6 +1941,20 @@ } break; + case TCP_BACKOFF_MAX: + if (val < 1 || val > TCP_BACKOFF_MAXVAL) + err = -EINVAL; + else + tp->rto_max = val; + break; + + case TCP_BACKOFF_INIT: + if (val < 1 || val > TCP_BACKOFF_MAXVAL) + err = -EINVAL; + else + tp->rto_init = val; + break; + default: err = -ENOPROTOOPT; break; @@ -2110,6 +2126,12 @@ if (copy_to_user(optval, icsk->icsk_ca_ops->name, len)) return -EFAULT; return 0; + case TCP_BACKOFF_MAX: + val = tp->rto_max; + break; + case TCP_BACKOFF_INIT: + val = tp->rto_init; + break; default: return -ENOPROTOOPT; }; Only in linux-2.6.18.new/net/ipv4: tcp.c~ diff -ru linux-2.6.18/net/ipv4/tcp_timer.c linux-2.6.18.new/net/ipv4/tcp_timer.c --- linux-2.6.18/net/ipv4/tcp_timer.c 2006-09-19 20:42:06.000000000 -0700 +++ linux-2.6.18.new/net/ipv4/tcp_timer.c 2006-10-11 10:41:39.000000000 -0700 @@ -31,6 +31,8 @@ int sysctl_tcp_retries1 = TCP_RETR1; int sysctl_tcp_retries2 = TCP_RETR2; int sysctl_tcp_orphan_retries; +unsigned long sysctl_tcp_rto_max = TCP_RTO_MAX; +unsigned long sysctl_tcp_rto_init = TCP_TIMEOUT_INIT; static void tcp_write_timer(unsigned long); static void tcp_delack_timer(unsigned long); @@ -71,7 +73,8 @@ /* If peer does not open window for long time, or did not transmit * anything for long time, penalize it. */ - if ((s32)(tcp_time_stamp - tp->lsndtime) > 2*TCP_RTO_MAX || !do_reset) + if ((s32)(tcp_time_stamp - tp->lsndtime) > 2*tcp_rto_max(tp) || + !do_reset) orphans <<= 1; /* If some dubious ICMP arrived, penalize even more. */ @@ -256,8 +259,8 @@ max_probes = sysctl_tcp_retries2; if (sock_flag(sk, SOCK_DEAD)) { - const int alive = ((icsk->icsk_rto << icsk->icsk_backoff) < TCP_RTO_MAX); - + const int alive = ((icsk->icsk_rto << icsk->icsk_backoff) < + tcp_rto_max(tp)); max_probes = tcp_orphan_retries(sk, alive); if (tcp_out_of_resources(sk, alive || icsk->icsk_probes_out <= max_probes)) @@ -301,7 +304,7 @@ inet->num, tp->snd_una, tp->snd_nxt); } #endif - if (tcp_time_stamp - tp->rcv_tstamp > TCP_RTO_MAX) { + if (tcp_time_stamp - tp->rcv_tstamp > tcp_rto_max(tp)) { tcp_write_err(sk); goto out; } @@ -373,7 +376,8 @@ out_reset_timer: icsk->icsk_rto = min(icsk->icsk_rto << 1, TCP_RTO_MAX); - inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS, icsk->icsk_rto, TCP_RTO_MAX); + inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS, icsk->icsk_rto, + tcp_rto_max(tp)); if (icsk->icsk_retransmits > sysctl_tcp_retries1) __sk_dst_reset(sk); @@ -427,8 +431,8 @@ static void tcp_synack_timer(struct sock *sk) { - inet_csk_reqsk_queue_prune(sk, TCP_SYNQ_INTERVAL, - TCP_TIMEOUT_INIT, TCP_RTO_MAX); + inet_csk_reqsk_queue_prune(sk, TCP_SYNQ_INTERVAL, TCP_TIMEOUT_INIT, + tcp_rto_max(tcp_sk(sk))); } void tcp_set_keepalive(struct sock *sk, int val) ^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2006-10-12 15:49 UTC | newest] Thread overview: 17+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2006-09-27 18:52 [PATCH] Customizable TCP backoff patch Ben Woodard 2006-09-27 23:16 ` David Miller 2006-09-27 23:00 ` Stephen Hemminger 2006-09-28 2:06 ` David Miller 2006-10-03 18:14 ` Ben Woodard 2006-10-04 7:07 ` David Miller 2006-10-04 8:56 ` Ingo Oeser 2006-10-04 9:08 ` David Miller 2006-10-04 17:17 ` Stephen Hemminger 2006-10-11 1:46 ` Ben Woodard 2006-10-11 1:59 ` YOSHIFUJI Hideaki / 吉藤英明 2006-10-11 2:53 ` David Miller 2006-10-11 2:54 ` David Miller 2006-10-11 14:01 ` Vlad Yasevich 2006-10-12 0:51 ` Ben Woodard 2006-10-12 1:06 ` YOSHIFUJI Hideaki / 吉藤英明 2006-10-12 15:49 ` Ben Woodard
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).