netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] tcp: sysctl for initial receive window
@ 2012-09-21  8:55 Jesper Dangaard Brouer
  2012-09-21 15:25 ` Eric Dumazet
  2012-09-25  5:29 ` Jan Engelhardt
  0 siblings, 2 replies; 9+ messages in thread
From: Jesper Dangaard Brouer @ 2012-09-21  8:55 UTC (permalink / raw)
  To: netdev; +Cc: Jesper Dangaard Brouer, Nandita Dukkipati, Eric Dumazet

Make it possible to adjust the TCP default initial advertised receive
window, via sysctl /proc/sys/net/ipv4/tcp_init_recv_window.

The window size is this value multiplied by the MSS of the connection.
The default value is (still) 10, as descibed in commit 356f039822b
(TCP: increase default initial receive window.)

Allow minimum value of 1, but recommend against setting value below 2
in the documentation.

Its possible to control/override this value per route table entry via
the iproute2 option initrwnd.  Having the global default exported via
sysctl, helps determine the default setting, and make is easier to
adjust.

Cc: Nandita Dukkipati <nanditad@google.com>
Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com>
---

 Documentation/networking/ip-sysctl.txt |   12 ++++++++++++
 include/net/tcp.h                      |    1 +
 net/ipv4/sysctl_net_ipv4.c             |    9 +++++++++
 net/ipv4/tcp_input.c                   |    6 +++---
 net/ipv4/tcp_output.c                  |    8 +++++---
 5 files changed, 30 insertions(+), 6 deletions(-)

diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
index c7fc107..684131c 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -257,6 +257,18 @@ tcp_frto_response - INTEGER
 		  to the values prior timeout
 	Default: 0 (rate halving based)
 
+tcp_init_recv_window - INTEGER
+	Default initial advertised receive window.  Actual window size
+	is this value multiplied by the MSS of the connection.  Its
+	possible to control/override this value per route table entry
+	via the iproute2 option initrwnd.
+	Minimum value is 1, but 2 is the recommended minimum.
+	The effective max value, is limited by the sockets receive
+	buffer size (default tcp_rmem[1], and possibly scaled by
+	tcp_adv_win_scale), and can further be limited by window
+	clamp.
+	Default: 10
+
 tcp_keepalive_time - INTEGER
 	How often TCP sends out keepalive messages when keepalive is enabled.
 	Default: 2hours.
diff --git a/include/net/tcp.h b/include/net/tcp.h
index a8cb00c..3334852 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -292,6 +292,7 @@ extern int sysctl_tcp_thin_dupack;
 extern int sysctl_tcp_early_retrans;
 extern int sysctl_tcp_limit_output_bytes;
 extern int sysctl_tcp_challenge_ack_limit;
+extern u32 sysctl_tcp_init_recv_window;
 
 extern atomic_long_t tcp_memory_allocated;
 extern struct percpu_counter tcp_sockets_allocated;
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 9205e49..9bb6608 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -27,6 +27,7 @@
 #include <net/tcp_memcontrol.h>
 
 static int zero;
+static int one = 1;
 static int two = 2;
 static int tcp_retr1_max = 255;
 static int ip_local_port_range_min[] = { 1, 1 };
@@ -794,6 +795,14 @@ static struct ctl_table ipv4_table[] = {
 		.proc_handler	= proc_dointvec_minmax,
 		.extra1		= &zero
 	},
+	{
+		.procname	= "tcp_init_recv_window",
+		.data		= &sysctl_tcp_init_recv_window,
+		.maxlen		= sizeof(sysctl_tcp_init_recv_window),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= &one
+	},
 	{ }
 };
 
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index e2bec81..bbf7a33 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -356,14 +356,14 @@ static void tcp_grow_window(struct sock *sk, const struct sk_buff *skb)
 static void tcp_fixup_rcvbuf(struct sock *sk)
 {
 	u32 mss = tcp_sk(sk)->advmss;
-	u32 icwnd = TCP_DEFAULT_INIT_RCVWND;
+	u32 icwnd = sysctl_tcp_init_recv_window;
 	int rcvmem;
 
-	/* Limit to 10 segments if mss <= 1460,
+	/* Limit to default 10 segments if mss <= 1460,
 	 * or 14600/mss segments, with a minimum of two segments.
 	 */
 	if (mss > 1460)
-		icwnd = max_t(u32, (1460 * TCP_DEFAULT_INIT_RCVWND) / mss, 2);
+		icwnd = max_t(u32, (1460 * icwnd) / mss, 2);
 
 	rcvmem = SKB_TRUESIZE(mss + MAX_TCP_HEADER);
 	while (tcp_win_from_space(rcvmem) < mss)
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index cfe6ffe..5f3b26d 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -59,6 +59,8 @@ int sysctl_tcp_limit_output_bytes __read_mostly = 131072;
  */
 int sysctl_tcp_tso_win_divisor __read_mostly = 3;
 
+u32 sysctl_tcp_init_recv_window __read_mostly = TCP_DEFAULT_INIT_RCVWND;
+
 int sysctl_tcp_mtu_probing __read_mostly = 0;
 int sysctl_tcp_base_mss __read_mostly = TCP_BASE_MSS;
 
@@ -235,14 +237,14 @@ void tcp_select_initial_window(int __space, __u32 mss,
 	}
 
 	/* Set initial window to a value enough for senders starting with
-	 * initial congestion window of TCP_DEFAULT_INIT_RCVWND. Place
+	 * initial congestion window of sysctl_tcp_init_recv_window. Place
 	 * a limit on the initial window when mss is larger than 1460.
 	 */
 	if (mss > (1 << *rcv_wscale)) {
-		int init_cwnd = TCP_DEFAULT_INIT_RCVWND;
+		int init_cwnd = sysctl_tcp_init_recv_window;
 		if (mss > 1460)
 			init_cwnd =
-			max_t(u32, (1460 * TCP_DEFAULT_INIT_RCVWND) / mss, 2);
+			max_t(u32, (1460 * init_cwnd) / mss, 2);
 		/* when initializing use the value from init_rcv_wnd
 		 * rather than the default from above
 		 */

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH] tcp: sysctl for initial receive window
  2012-09-21  8:55 [PATCH] tcp: sysctl for initial receive window Jesper Dangaard Brouer
@ 2012-09-21 15:25 ` Eric Dumazet
  2012-09-21 17:34   ` Jesper Dangaard Brouer
  2012-09-21 17:56   ` David Miller
  2012-09-25  5:29 ` Jan Engelhardt
  1 sibling, 2 replies; 9+ messages in thread
From: Eric Dumazet @ 2012-09-21 15:25 UTC (permalink / raw)
  To: Jesper Dangaard Brouer; +Cc: netdev, Nandita Dukkipati

On Fri, 2012-09-21 at 10:55 +0200, Jesper Dangaard Brouer wrote:
> Make it possible to adjust the TCP default initial advertised receive
> window, via sysctl /proc/sys/net/ipv4/tcp_init_recv_window.
> 
> The window size is this value multiplied by the MSS of the connection.
> The default value is (still) 10, as descibed in commit 356f039822b
> (TCP: increase default initial receive window.)
> 
> Allow minimum value of 1, but recommend against setting value below 2
> in the documentation.
> 
> Its possible to control/override this value per route table entry via
> the iproute2 option initrwnd.  Having the global default exported via
> sysctl, helps determine the default setting, and make is easier to
> adjust.

I was wondering why its not symmetric :

If we add a sysctl for initial receive window, we need another one for
initial send window ?

Thanks

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] tcp: sysctl for initial receive window
  2012-09-21 15:25 ` Eric Dumazet
@ 2012-09-21 17:34   ` Jesper Dangaard Brouer
  2012-09-21 17:56   ` David Miller
  1 sibling, 0 replies; 9+ messages in thread
From: Jesper Dangaard Brouer @ 2012-09-21 17:34 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: netdev, Nandita Dukkipati

On Fri, 2012-09-21 at 17:25 +0200, Eric Dumazet wrote:
> On Fri, 2012-09-21 at 10:55 +0200, Jesper Dangaard Brouer wrote:
> > Make it possible to adjust the TCP default initial advertised receive
> > window, via sysctl /proc/sys/net/ipv4/tcp_init_recv_window.
> > 
> > The window size is this value multiplied by the MSS of the connection.
> > The default value is (still) 10, as descibed in commit 356f039822b
> > (TCP: increase default initial receive window.)
> > 
> > Allow minimum value of 1, but recommend against setting value below 2
> > in the documentation.
> > 
> > Its possible to control/override this value per route table entry via
> > the iproute2 option initrwnd.  Having the global default exported via
> > sysctl, helps determine the default setting, and make is easier to
> > adjust.
> 
> I was wondering why its not symmetric :
> 
> If we add a sysctl for initial receive window, we need another one for
> initial send window ?

Yes, that was also part of my plan (I just didn't have time to complete
it).  I'll implement the sysctl for initial congestion window, next
week.

Just wanted some initial feedback, on if this sysctl approach is
acceptable or not.

-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Sr. Network Kernel Developer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] tcp: sysctl for initial receive window
  2012-09-21 15:25 ` Eric Dumazet
  2012-09-21 17:34   ` Jesper Dangaard Brouer
@ 2012-09-21 17:56   ` David Miller
  2012-09-21 18:32     ` Jesper Dangaard Brouer
  1 sibling, 1 reply; 9+ messages in thread
From: David Miller @ 2012-09-21 17:56 UTC (permalink / raw)
  To: eric.dumazet; +Cc: brouer, netdev, nanditad

From: Eric Dumazet <eric.dumazet@gmail.com>
Date: Fri, 21 Sep 2012 17:25:11 +0200

> On Fri, 2012-09-21 at 10:55 +0200, Jesper Dangaard Brouer wrote:
>> Make it possible to adjust the TCP default initial advertised receive
>> window, via sysctl /proc/sys/net/ipv4/tcp_init_recv_window.
>> 
>> The window size is this value multiplied by the MSS of the connection.
>> The default value is (still) 10, as descibed in commit 356f039822b
>> (TCP: increase default initial receive window.)
>> 
>> Allow minimum value of 1, but recommend against setting value below 2
>> in the documentation.
>> 
>> Its possible to control/override this value per route table entry via
>> the iproute2 option initrwnd.  Having the global default exported via
>> sysctl, helps determine the default setting, and make is easier to
>> adjust.
> 
> I was wondering why its not symmetric :
> 
> If we add a sysctl for initial receive window, we need another one for
> initial send window ?

Unlike the routing configuration, this is susceptible to serious abuse.

All it takes is for one jackass vendor to say that this should be set
to 1,000 in in sysctl.conf when using their product.

Whereas setting it on a per-route basis forces the person doing it
to actually consider that there might be ramifications that have to
do with the paths on which you are making this adjustment.

I would only let this in if you hard limited the setting to it's
current setting, 10.  So people could decrease it.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] tcp: sysctl for initial receive window
  2012-09-21 17:56   ` David Miller
@ 2012-09-21 18:32     ` Jesper Dangaard Brouer
  2012-09-21 18:48       ` David Miller
  0 siblings, 1 reply; 9+ messages in thread
From: Jesper Dangaard Brouer @ 2012-09-21 18:32 UTC (permalink / raw)
  To: David Miller; +Cc: eric.dumazet, netdev, nanditad

On Fri, 2012-09-21 at 13:56 -0400, David Miller wrote:
> From: Eric Dumazet <eric.dumazet@gmail.com>
> Date: Fri, 21 Sep 2012 17:25:11 +0200
> 
> > On Fri, 2012-09-21 at 10:55 +0200, Jesper Dangaard Brouer wrote:
> >> Make it possible to adjust the TCP default initial advertised receive
> >> window, via sysctl /proc/sys/net/ipv4/tcp_init_recv_window.
> >> 
> >> The window size is this value multiplied by the MSS of the connection.
> >> The default value is (still) 10, as descibed in commit 356f039822b
> >> (TCP: increase default initial receive window.)
> >> 
> >> Allow minimum value of 1, but recommend against setting value below 2
> >> in the documentation.
> >> 
> >> Its possible to control/override this value per route table entry via
> >> the iproute2 option initrwnd.  Having the global default exported via
> >> sysctl, helps determine the default setting, and make is easier to
> >> adjust.
> > 
> > I was wondering why its not symmetric :
> > 
> > If we add a sysctl for initial receive window, we need another one for
> > initial send window ?
> 
> Unlike the routing configuration, this is susceptible to serious abuse.

Are you talking about, this patch for "tcp_init_recv_window" initial
advertised receive window?


> All it takes is for one jackass vendor to say that this should be set
> to 1,000 in in sysctl.conf when using their product.

I do see your point with jackass vendors.

But (for tcp_init_recv_window) its not a problem, because this is being
limited by tcp_rmem[1] (and div 2 default due to tcp_adv_win_scale), and
can/is further be limited by window clamping. (and we also cut it if
tcp_adv_win_scale > 14).


> Whereas setting it on a per-route basis forces the person doing it
> to actually consider that there might be ramifications that have to
> do with the paths on which you are making this adjustment.

As I mentioned above, this also requires some extra work and
consideration to make this go out of bound.

> I would only let this in if you hard limited the setting to it's
> current setting, 10.  So people could decrease it.

The would defeat the purpose of the patch.  Perhaps we could, allow a
sensible max... (but this max is already being controlled as described).


-- 
Best regards,
  Jesper Dangaard Brouer
  MSc.CS, Sr. Network Kernel Developer at Red Hat
  Author of http://www.iptv-analyzer.org
  LinkedIn: http://www.linkedin.com/in/brouer

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] tcp: sysctl for initial receive window
  2012-09-21 18:32     ` Jesper Dangaard Brouer
@ 2012-09-21 18:48       ` David Miller
  2012-09-26 11:53         ` Eric Dumazet
  0 siblings, 1 reply; 9+ messages in thread
From: David Miller @ 2012-09-21 18:48 UTC (permalink / raw)
  To: brouer; +Cc: eric.dumazet, netdev, nanditad

From: Jesper Dangaard Brouer <brouer@redhat.com>
Date: Fri, 21 Sep 2012 20:32:06 +0200

> On Fri, 2012-09-21 at 13:56 -0400, David Miller wrote:
>> From: Eric Dumazet <eric.dumazet@gmail.com>
>> Date: Fri, 21 Sep 2012 17:25:11 +0200
>> 
>> I would only let this in if you hard limited the setting to it's
>> current setting, 10.  So people could decrease it.
> 
> The would defeat the purpose of the patch.  Perhaps we could, allow a
> sensible max... (but this max is already being controlled as described).

Any new max which is truly sensible, could be the new default, and we
would apply the same amount of vetting for such a thing.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] tcp: sysctl for initial receive window
  2012-09-21  8:55 [PATCH] tcp: sysctl for initial receive window Jesper Dangaard Brouer
  2012-09-21 15:25 ` Eric Dumazet
@ 2012-09-25  5:29 ` Jan Engelhardt
  1 sibling, 0 replies; 9+ messages in thread
From: Jan Engelhardt @ 2012-09-25  5:29 UTC (permalink / raw)
  To: Jesper Dangaard Brouer; +Cc: netdev, Nandita Dukkipati, Eric Dumazet



On Friday 2012-09-21 10:55, Jesper Dangaard Brouer wrote:
>diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
>index c7fc107..684131c 100644
>--- a/Documentation/networking/ip-sysctl.txt
>+++ b/Documentation/networking/ip-sysctl.txt
>@@ -257,6 +257,18 @@ tcp_frto_response - INTEGER
> 		  to the values prior timeout
> 	Default: 0 (rate halving based)
> 
>+tcp_init_recv_window - INTEGER
>+	Default initial advertised receive window.  Actual window size
>+	is this value multiplied by the MSS of the connection.  Its

	is this value multiplied by the MSS of the connection.  It is

>+	possible to control/override this value per route table entry
>+	via the iproute2 option initrwnd.
>+	Minimum value is 1, but 2 is the recommended minimum.
>+	The effective max value, is limited by the sockets receive

	The effective max value is limited by the sockets receive

>+	buffer size (default tcp_rmem[1], and possibly scaled by
>+	tcp_adv_win_scale), and can further be limited by window

	tcp_adv_win_scale) and can further be limited by window

>+	clamp.

	clamping.

>+	Default: 10
>+
> tcp_keepalive_time - INTEGER
> 	How often TCP sends out keepalive messages when keepalive is enabled.
> 	Default: 2hours.

The "recommended minimum" is somewhat strange from a language POV,
since the recommendation is actually to _not touch_ the option at all
(because the default works and there is potential abuse as Dave
mentions).

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] tcp: sysctl for initial receive window
  2012-09-21 18:48       ` David Miller
@ 2012-09-26 11:53         ` Eric Dumazet
  2012-10-01 22:36           ` Yuchung Cheng
  0 siblings, 1 reply; 9+ messages in thread
From: Eric Dumazet @ 2012-09-26 11:53 UTC (permalink / raw)
  To: David Miller; +Cc: brouer, netdev, nanditad

On Fri, 2012-09-21 at 14:48 -0400, David Miller wrote: 
> From: Jesper Dangaard Brouer <brouer@redhat.com>
> Date: Fri, 21 Sep 2012 20:32:06 +0200
> > The would defeat the purpose of the patch.  Perhaps we could, allow a
> > sensible max... (but this max is already being controlled as described).
> 
> Any new max which is truly sensible, could be the new default, and we
> would apply the same amount of vetting for such a thing.


We have in linux a very conservative and complex rwin control at the
beginning of a TCP session, only for the very first packets,
if applications are reasonably fast at draining their receive queue.
(They mostly are)

Last time I had to take a look (after truesize changes), I was kind of
worried to not find a good reason why we were doing this.

We now have :

- rcvbuf autotuning, letting rwin growing up to 3MB or so
- Better truesize tracking
- global/cgroup tcp mem accounting/pressure
- TCP coalescing to minimize the effect of bad citizen packets
    (very low len/truesize ratio) 
- People tracking TCP stack inefficiencies and working on new CCs...
   (An example is Joe Touch I-D
http://tools.ietf.org/html/draft-touch-tcpm-automatic-iw-03 that
proposes increasing IW over a longer period of time (as opposed to
revisiting constants every few years).
- ...

TCP congestion control is controlled by the sender, driven by the ACK
coming back from receiver, and initial rwin should not change CC at all,
unless we deliberately constrain rwin to a too small value.

We did the 3 -> 10 change only two years ago.
And 3 was really too small even 5 years ago.

Browsers had to open simultaneous sessions to the same server only to
workaround this limit, and they still do.

I would just remove the 10 'hard constant', (but not so hard, since it
was 3 only 2 years ago), and let tcp_rmem[1]/SO_RCVBUF decide of the
initial receive window.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] tcp: sysctl for initial receive window
  2012-09-26 11:53         ` Eric Dumazet
@ 2012-10-01 22:36           ` Yuchung Cheng
  0 siblings, 0 replies; 9+ messages in thread
From: Yuchung Cheng @ 2012-10-01 22:36 UTC (permalink / raw)
  To: Eric Dumazet; +Cc: David Miller, brouer, netdev, nanditad

On Wed, Sep 26, 2012 at 4:53 AM, Eric Dumazet <eric.dumazet@gmail.com> wrote:
> On Fri, 2012-09-21 at 14:48 -0400, David Miller wrote:
>> From: Jesper Dangaard Brouer <brouer@redhat.com>
>> Date: Fri, 21 Sep 2012 20:32:06 +0200
>> > The would defeat the purpose of the patch.  Perhaps we could, allow a
>> > sensible max... (but this max is already being controlled as described).
>>
>> Any new max which is truly sensible, could be the new default, and we
>> would apply the same amount of vetting for such a thing.
>
>
> We have in linux a very conservative and complex rwin control at the
> beginning of a TCP session, only for the very first packets,
> if applications are reasonably fast at draining their receive queue.
> (They mostly are)
>
> Last time I had to take a look (after truesize changes), I was kind of
> worried to not find a good reason why we were doing this.
>
> We now have :
>
> - rcvbuf autotuning, letting rwin growing up to 3MB or so
> - Better truesize tracking
> - global/cgroup tcp mem accounting/pressure
> - TCP coalescing to minimize the effect of bad citizen packets
>     (very low len/truesize ratio)
> - People tracking TCP stack inefficiencies and working on new CCs...
>    (An example is Joe Touch I-D
> http://tools.ietf.org/html/draft-touch-tcpm-automatic-iw-03 that
> proposes increasing IW over a longer period of time (as opposed to
> revisiting constants every few years).
> - ...
>
> TCP congestion control is controlled by the sender, driven by the ACK
> coming back from receiver, and initial rwin should not change CC at all,
> unless we deliberately constrain rwin to a too small value.
>
> We did the 3 -> 10 change only two years ago.
> And 3 was really too small even 5 years ago.
>
> Browsers had to open simultaneous sessions to the same server only to
> workaround this limit, and they still do.
>
> I would just remove the 10 'hard constant', (but not so hard, since it
> was 3 only 2 years ago), and let tcp_rmem[1]/SO_RCVBUF decide of the
> initial receive window.
I like this idea a lot. Got a patch for us to try?

>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2012-10-01 22:36 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2012-09-21  8:55 [PATCH] tcp: sysctl for initial receive window Jesper Dangaard Brouer
2012-09-21 15:25 ` Eric Dumazet
2012-09-21 17:34   ` Jesper Dangaard Brouer
2012-09-21 17:56   ` David Miller
2012-09-21 18:32     ` Jesper Dangaard Brouer
2012-09-21 18:48       ` David Miller
2012-09-26 11:53         ` Eric Dumazet
2012-10-01 22:36           ` Yuchung Cheng
2012-09-25  5:29 ` Jan Engelhardt

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).