netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH] Add a sysctl to allow TCP window shrinking in order to honor memory limits
@ 2023-06-05 20:38 Mike Freemon
  2023-06-05 22:42 ` Stephen Hemminger
  0 siblings, 1 reply; 9+ messages in thread
From: Mike Freemon @ 2023-06-05 20:38 UTC (permalink / raw)
  To: netdev; +Cc: kernel-team, mfreemon

From: "mfreemon@cloudflare.com" <mfreemon@cloudflare.com>

Under certain circumstances, the tcp receive buffer memory limit
set by autotuning is ignored, and the receive buffer can grow
unrestrained until it reaches tcp_rmem[2].

To reproduce:  Connect a TCP session with the receiver doing
nothing and the sender sending small packets (an infinite loop
of socket send() with 4 bytes of payload with a sleep of 1 ms
in between each send()).  This will fill the tcp receive buffer
all the way to tcp_rmem[2], ignoring the autotuning limit
(sk_rcvbuf).

As a result, a host can have individual tcp sessions with receive
buffers of size tcp_rmem[2], and the host itself can reach tcp_mem
limits, causing the host to go into tcp memory pressure mode.

The fundamental issue is the relationship between the granularity
of the window scaling factor and the number of byte ACKed back
to the sender.  This problem has previously been identified in
RFC 7323, appendix F [1].

The Linux kernel currently adheres to never shrinking the window.

In addition to the overallocation of memory mentioned above, this
is also functionally incorrect, because once tcp_rmem[2] is
reached, the receiver will drop in-window packets resulting in
retransmissions and an eventual timeout of the tcp session.  A
receive buffer full condition should instead result in a zero
window and an indefinite wait.

In practice, this problem is largely hidden for most flows.  It
is not applicable to mice flows.  Elephant flows can send data
fast enough to "overrun" the sk_rcvbuf limit (in a single ACK),
triggering a zero window.

But this problem does show up for other types of flows.  A good
example are websockets and other type of flows that send small
amounts of data spaced apart slightly in time.  In these cases,
we directly encounter the problem described in [1].

RFC 7323, section 2.4 [2], says there are instances when a retracted
window can be offered, and that TCP implementations MUST ensure
that they handle a shrinking window, as specified in RFC 1122,
section 4.2.2.16 [3].  All prior RFCs on the topic of tcp window
management have made clear that sender must accept a shrunk window
from the receiver, including RFC 793 [4] and RFC 1323 [5].

This patch implements the functionality to shrink the tcp window
when necessary to keep the right edge within the memory limit by
autotuning (sk_rcvbuf).  This new functionality is enabled with
the following sysctl:

sysctl: net.ipv4.tcp_shrink_window

This sysctl changes how the TCP window is calculated.

If sysctl tcp_shrink_window is zero (the default value), then the
window is never shrunk.

If sysctl tcp_shrink_window is non-zero, then the memory limit
set by autotuning is honored.  This requires that the TCP window
be shrunk ("retracted") as described in RFC 1122.

[1] https://www.rfc-editor.org/rfc/rfc7323#appendix-F
[2] https://www.rfc-editor.org/rfc/rfc7323#section-2.4
[3] https://www.rfc-editor.org/rfc/rfc1122#page-91
[4] https://www.rfc-editor.org/rfc/rfc793
[5] https://www.rfc-editor.org/rfc/rfc1323

Signed-off-by: Mike Freemon <mfreemon@cloudflare.com>
---
 Documentation/networking/ip-sysctl.rst | 14 ++++++
 include/net/netns/ipv4.h               |  1 +
 net/ipv4/sysctl_net_ipv4.c             |  9 ++++
 net/ipv4/tcp_ipv4.c                    |  2 +
 net/ipv4/tcp_output.c                  | 59 +++++++++++++++++++-------
 5 files changed, 70 insertions(+), 15 deletions(-)

diff --git a/Documentation/networking/ip-sysctl.rst b/Documentation/networking/ip-sysctl.rst
index 6ec06a33688a..6c713d3a97e5 100644
--- a/Documentation/networking/ip-sysctl.rst
+++ b/Documentation/networking/ip-sysctl.rst
@@ -970,6 +970,20 @@ tcp_tw_reuse - INTEGER
 tcp_window_scaling - BOOLEAN
 	Enable window scaling as defined in RFC1323.
 
+tcp_shrink_window - BOOLEAN
+	This changes how the TCP receive window is calculated when window
+	scaling is in effect.
+
+	RFC 7323, section 2.4, says there are instances when a retracted
+	window can be offered, and that TCP implementations MUST ensure
+	that they handle a shrinking window, as specified in RFC 1122.
+
+	- 0 - Disabled.	The window is never shrunk.
+	- 1 - Enabled.	The window is shrunk when necessary to remain within
+			the memory limit set by autotuning (sk_rcvbuf).
+
+	Default: 0
+
 tcp_wmem - vector of 3 INTEGERs: min, default, max
 	min: Amount of memory reserved for send buffers for TCP sockets.
 	Each TCP socket has rights to use it due to fact of its birth.
diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index db762e35aca9..e30442bdc9c0 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -65,6 +65,7 @@ struct netns_ipv4 {
 #endif
 	bool			fib_has_custom_local_routes;
 	bool			fib_offload_disabled;
+	u8			sysctl_tcp_shrink_window;
 #ifdef CONFIG_IP_ROUTE_CLASSID
 	atomic_t		fib_num_tclassid_users;
 #endif
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 40fe70fc2015..12470660744d 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -1470,6 +1470,15 @@ static struct ctl_table ipv4_net_table[] = {
 		.extra1         = SYSCTL_ZERO,
 		.extra2         = &tcp_plb_max_cong_thresh,
 	},
+	{
+		.procname	= "tcp_shrink_window",
+		.data		= &init_net.ipv4.sysctl_tcp_shrink_window,
+		.maxlen		= sizeof(u8),
+		.mode		= 0644,
+		.proc_handler	= proc_dou8vec_minmax,
+		.extra1		= SYSCTL_ZERO,
+		.extra2		= SYSCTL_ONE,
+	},
 	{ }
 };
 
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 06d2573685ca..ed0ce4b4b96d 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -3276,6 +3276,8 @@ static int __net_init tcp_sk_init(struct net *net)
 	else
 		net->ipv4.tcp_congestion_control = &tcp_reno;
 
+	net->ipv4.sysctl_tcp_shrink_window = 0;
+
 	return 0;
 }
 
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index cfe128b81a01..6bdd59716028 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -260,8 +260,8 @@ static u16 tcp_select_window(struct sock *sk)
 	u32 old_win = tp->rcv_wnd;
 	u32 cur_win = tcp_receive_window(tp);
 	u32 new_win = __tcp_select_window(sk);
+	struct net *net = sock_net(sk);
 
-	/* Never shrink the offered window */
 	if (new_win < cur_win) {
 		/* Danger Will Robinson!
 		 * Don't update rcv_wup/rcv_wnd here or else
@@ -270,11 +270,15 @@ static u16 tcp_select_window(struct sock *sk)
 		 *
 		 * Relax Will Robinson.
 		 */
-		if (new_win == 0)
-			NET_INC_STATS(sock_net(sk),
-				      LINUX_MIB_TCPWANTZEROWINDOWADV);
-		new_win = ALIGN(cur_win, 1 << tp->rx_opt.rcv_wscale);
+		if (!net->ipv4.sysctl_tcp_shrink_window) {
+			/* Never shrink the offered window */
+			if (new_win == 0)
+				NET_INC_STATS(sock_net(sk),
+					      LINUX_MIB_TCPWANTZEROWINDOWADV);
+			new_win = ALIGN(cur_win, 1 << tp->rx_opt.rcv_wscale);
+		}
 	}
+
 	tp->rcv_wnd = new_win;
 	tp->rcv_wup = tp->rcv_nxt;
 
@@ -2947,6 +2951,7 @@ u32 __tcp_select_window(struct sock *sk)
 {
 	struct inet_connection_sock *icsk = inet_csk(sk);
 	struct tcp_sock *tp = tcp_sk(sk);
+	struct net *net = sock_net(sk);
 	/* MSS for the peer's data.  Previous versions used mss_clamp
 	 * here.  I don't know if the value based on our guesses
 	 * of peer's MSS is better for the performance.  It's more correct
@@ -2968,16 +2973,24 @@ u32 __tcp_select_window(struct sock *sk)
 		if (mss <= 0)
 			return 0;
 	}
+
+	if (net->ipv4.sysctl_tcp_shrink_window) {
+		/* new window should always be an exact multiple of scaling factor */
+		free_space = round_down(free_space, 1 << tp->rx_opt.rcv_wscale);
+	}
+
 	if (free_space < (full_space >> 1)) {
 		icsk->icsk_ack.quick = 0;
 
 		if (tcp_under_memory_pressure(sk))
 			tcp_adjust_rcv_ssthresh(sk);
 
-		/* free_space might become our new window, make sure we don't
-		 * increase it due to wscale.
-		 */
-		free_space = round_down(free_space, 1 << tp->rx_opt.rcv_wscale);
+		if (!net->ipv4.sysctl_tcp_shrink_window) {
+			/* free_space might become our new window, make sure we don't
+			 * increase it due to wscale.
+			 */
+			free_space = round_down(free_space, 1 << tp->rx_opt.rcv_wscale);
+		}
 
 		/* if free space is less than mss estimate, or is below 1/16th
 		 * of the maximum allowed, try to move to zero-window, else
@@ -2988,10 +3001,24 @@ u32 __tcp_select_window(struct sock *sk)
 		 */
 		if (free_space < (allowed_space >> 4) || free_space < mss)
 			return 0;
+
+		if (net->ipv4.sysctl_tcp_shrink_window && free_space < (1 << tp->rx_opt.rcv_wscale))
+			return 0;
 	}
 
-	if (free_space > tp->rcv_ssthresh)
+	if (free_space > tp->rcv_ssthresh) {
 		free_space = tp->rcv_ssthresh;
+		if (net->ipv4.sysctl_tcp_shrink_window) {
+			/* new window should always be an exact multiple of scaling factor
+			 *
+			 * For this case, we ALIGN "up" (increase free_space) because
+			 * we know free_space is not zero here, it has been reduced from
+			 * the memory-based limit, and rcv_ssthresh is not a hard limit
+			 * (unlike sk_rcvbuf).
+			 */
+			free_space = ALIGN(free_space, (1 << tp->rx_opt.rcv_wscale));
+		}
+	}
 
 	/* Don't do rounding if we are using window scaling, since the
 	 * scaled window will not line up with the MSS boundary anyway.
@@ -2999,11 +3026,13 @@ u32 __tcp_select_window(struct sock *sk)
 	if (tp->rx_opt.rcv_wscale) {
 		window = free_space;
 
-		/* Advertise enough space so that it won't get scaled away.
-		 * Import case: prevent zero window announcement if
-		 * 1<<rcv_wscale > mss.
-		 */
-		window = ALIGN(window, (1 << tp->rx_opt.rcv_wscale));
+		if (!net->ipv4.sysctl_tcp_shrink_window) {
+			/* Advertise enough space so that it won't get scaled away.
+			 * Import case: prevent zero window announcement if
+			 * 1<<rcv_wscale > mss.
+			 */
+			window = ALIGN(window, (1 << tp->rx_opt.rcv_wscale));
+		}
 	} else {
 		window = tp->rcv_wnd;
 		/* Get the largest window that is a nice multiple of mss.
-- 
2.40.0


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH] Add a sysctl to allow TCP window shrinking in order to honor memory limits
  2023-06-05 20:38 [PATCH] Add a sysctl to allow TCP window shrinking in order to honor memory limits Mike Freemon
@ 2023-06-05 22:42 ` Stephen Hemminger
  2023-06-05 22:44   ` Stephen Hemminger
  2023-06-06 14:54   ` Mike Freemon
  0 siblings, 2 replies; 9+ messages in thread
From: Stephen Hemminger @ 2023-06-05 22:42 UTC (permalink / raw)
  To: Mike Freemon; +Cc: netdev, kernel-team

On Mon,  5 Jun 2023 15:38:57 -0500
Mike Freemon <mfreemon@cloudflare.com> wrote:

> From: "mfreemon@cloudflare.com" <mfreemon@cloudflare.com>
> 
> Under certain circumstances, the tcp receive buffer memory limit
> set by autotuning is ignored, and the receive buffer can grow
> unrestrained until it reaches tcp_rmem[2].
> 
> To reproduce:  Connect a TCP session with the receiver doing
> nothing and the sender sending small packets (an infinite loop
> of socket send() with 4 bytes of payload with a sleep of 1 ms
> in between each send()).  This will fill the tcp receive buffer
> all the way to tcp_rmem[2], ignoring the autotuning limit
> (sk_rcvbuf).
> 
> As a result, a host can have individual tcp sessions with receive
> buffers of size tcp_rmem[2], and the host itself can reach tcp_mem
> limits, causing the host to go into tcp memory pressure mode.
> 
> The fundamental issue is the relationship between the granularity
> of the window scaling factor and the number of byte ACKed back
> to the sender.  This problem has previously been identified in
> RFC 7323, appendix F [1].
> 
> The Linux kernel currently adheres to never shrinking the window.
> 
> In addition to the overallocation of memory mentioned above, this
> is also functionally incorrect, because once tcp_rmem[2] is
> reached, the receiver will drop in-window packets resulting in
> retransmissions and an eventual timeout of the tcp session.  A
> receive buffer full condition should instead result in a zero
> window and an indefinite wait.
> 
> In practice, this problem is largely hidden for most flows.  It
> is not applicable to mice flows.  Elephant flows can send data
> fast enough to "overrun" the sk_rcvbuf limit (in a single ACK),
> triggering a zero window.
> 
> But this problem does show up for other types of flows.  A good
> example are websockets and other type of flows that send small
> amounts of data spaced apart slightly in time.  In these cases,
> we directly encounter the problem described in [1].
> 
> RFC 7323, section 2.4 [2], says there are instances when a retracted
> window can be offered, and that TCP implementations MUST ensure
> that they handle a shrinking window, as specified in RFC 1122,
> section 4.2.2.16 [3].  All prior RFCs on the topic of tcp window
> management have made clear that sender must accept a shrunk window
> from the receiver, including RFC 793 [4] and RFC 1323 [5].
> 
> This patch implements the functionality to shrink the tcp window
> when necessary to keep the right edge within the memory limit by
> autotuning (sk_rcvbuf).  This new functionality is enabled with
> the following sysctl:
> 
> sysctl: net.ipv4.tcp_shrink_window
> 
> This sysctl changes how the TCP window is calculated.
> 
> If sysctl tcp_shrink_window is zero (the default value), then the
> window is never shrunk.
> 
> If sysctl tcp_shrink_window is non-zero, then the memory limit
> set by autotuning is honored.  This requires that the TCP window
> be shrunk ("retracted") as described in RFC 1122.
> 
> [1] https://www.rfc-editor.org/rfc/rfc7323#appendix-F
> [2] https://www.rfc-editor.org/rfc/rfc7323#section-2.4
> [3] https://www.rfc-editor.org/rfc/rfc1122#page-91
> [4] https://www.rfc-editor.org/rfc/rfc793
> [5] https://www.rfc-editor.org/rfc/rfc1323
> 
> Signed-off-by: Mike Freemon <mfreemon@cloudflare.com>

Does Linux TCP really need another tuning parameter?
Will tests get run with both feature on and off?
What default will distributions ship with?

Sounds like unbounded receive window growth is always a bad
idea and a latent bug.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] Add a sysctl to allow TCP window shrinking in order to honor memory limits
  2023-06-05 22:42 ` Stephen Hemminger
@ 2023-06-05 22:44   ` Stephen Hemminger
  2023-06-06  2:09     ` Jason Xing
  2023-06-06 14:54   ` Mike Freemon
  1 sibling, 1 reply; 9+ messages in thread
From: Stephen Hemminger @ 2023-06-05 22:44 UTC (permalink / raw)
  To: Mike Freemon; +Cc: netdev, kernel-team

On Mon, 5 Jun 2023 15:42:29 -0700
Stephen Hemminger <stephen@networkplumber.org> wrote:

> > sysctl: net.ipv4.tcp_shrink_window
> > 
> > This sysctl changes how the TCP window is calculated.
> > 
> > If sysctl tcp_shrink_window is zero (the default value), then the
> > window is never shrunk.
> > 
> > If sysctl tcp_shrink_window is non-zero, then the memory limit
> > set by autotuning is honored.  This requires that the TCP window
> > be shrunk ("retracted") as described in RFC 1122.
> > 
> > [1] https://www.rfc-editor.org/rfc/rfc7323#appendix-F
> > [2] https://www.rfc-editor.org/rfc/rfc7323#section-2.4
> > [3] https://www.rfc-editor.org/rfc/rfc1122#page-91
> > [4] https://www.rfc-editor.org/rfc/rfc793
> > [5] https://www.rfc-editor.org/rfc/rfc1323
> > 
> > Signed-off-by: Mike Freemon <mfreemon@cloudflare.com>  
> 
> Does Linux TCP really need another tuning parameter?
> Will tests get run with both feature on and off?
> What default will distributions ship with?
> 
> Sounds like unbounded receive window growth is always a bad
> idea and a latent bug.

FYI - I worked in an environment where every bug fix had to have
a tuning parameter to turn it off. It was a bad idea, driven by
management problems with updating. The number of knobs lead
to confusion and geometric growth in possible code paths.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] Add a sysctl to allow TCP window shrinking in order to honor memory limits
  2023-06-05 22:44   ` Stephen Hemminger
@ 2023-06-06  2:09     ` Jason Xing
  2023-06-06 15:17       ` Mike Freemon
  0 siblings, 1 reply; 9+ messages in thread
From: Jason Xing @ 2023-06-06  2:09 UTC (permalink / raw)
  To: Stephen Hemminger, Eric Dumazet, Neal Cardwell
  Cc: Mike Freemon, netdev, kernel-team

On Tue, Jun 6, 2023 at 6:44 AM Stephen Hemminger
<stephen@networkplumber.org> wrote:
>
> On Mon, 5 Jun 2023 15:42:29 -0700
> Stephen Hemminger <stephen@networkplumber.org> wrote:
>
> > > sysctl: net.ipv4.tcp_shrink_window
> > >
> > > This sysctl changes how the TCP window is calculated.
> > >
> > > If sysctl tcp_shrink_window is zero (the default value), then the
> > > window is never shrunk.
> > >
> > > If sysctl tcp_shrink_window is non-zero, then the memory limit
> > > set by autotuning is honored.  This requires that the TCP window
> > > be shrunk ("retracted") as described in RFC 1122.
> > >
> > > [1] https://www.rfc-editor.org/rfc/rfc7323#appendix-F
> > > [2] https://www.rfc-editor.org/rfc/rfc7323#section-2.4
> > > [3] https://www.rfc-editor.org/rfc/rfc1122#page-91
> > > [4] https://www.rfc-editor.org/rfc/rfc793
> > > [5] https://www.rfc-editor.org/rfc/rfc1323
> > >
> > > Signed-off-by: Mike Freemon <mfreemon@cloudflare.com>
> >
> > Does Linux TCP really need another tuning parameter?
> > Will tests get run with both feature on and off?
> > What default will distributions ship with?
> >
> > Sounds like unbounded receive window growth is always a bad
> > idea and a latent bug.
>
> FYI - I worked in an environment where every bug fix had to have
> a tuning parameter to turn it off. It was a bad idea, driven by
> management problems with updating. The number of knobs lead
> to confusion and geometric growth in possible code paths.
>

I agree. More than this, shrinking window prohibited in those classic
RFCs could cause unexpected/unwanted behaviour.

CC: Eric and Neal

Thanks,
Jason

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] Add a sysctl to allow TCP window shrinking in order to honor memory limits
  2023-06-05 22:42 ` Stephen Hemminger
  2023-06-05 22:44   ` Stephen Hemminger
@ 2023-06-06 14:54   ` Mike Freemon
  1 sibling, 0 replies; 9+ messages in thread
From: Mike Freemon @ 2023-06-06 14:54 UTC (permalink / raw)
  To: Stephen Hemminger; +Cc: netdev, kernel-team


On 6/5/23 17:42, Stephen Hemminger wrote:
> On Mon,  5 Jun 2023 15:38:57 -0500
> Mike Freemon <mfreemon@cloudflare.com> wrote:
> 
>> From: "mfreemon@cloudflare.com" <mfreemon@cloudflare.com>
>>
>> Under certain circumstances, the tcp receive buffer memory limit
>> set by autotuning is ignored, and the receive buffer can grow
>> unrestrained until it reaches tcp_rmem[2].
>>
>> To reproduce:  Connect a TCP session with the receiver doing
>> nothing and the sender sending small packets (an infinite loop
>> of socket send() with 4 bytes of payload with a sleep of 1 ms
>> in between each send()).  This will fill the tcp receive buffer
>> all the way to tcp_rmem[2], ignoring the autotuning limit
>> (sk_rcvbuf).
>>
>> As a result, a host can have individual tcp sessions with receive
>> buffers of size tcp_rmem[2], and the host itself can reach tcp_mem
>> limits, causing the host to go into tcp memory pressure mode.
>>
>> The fundamental issue is the relationship between the granularity
>> of the window scaling factor and the number of byte ACKed back
>> to the sender.  This problem has previously been identified in
>> RFC 7323, appendix F [1].
>>
>> The Linux kernel currently adheres to never shrinking the window.
>>
>> In addition to the overallocation of memory mentioned above, this
>> is also functionally incorrect, because once tcp_rmem[2] is
>> reached, the receiver will drop in-window packets resulting in
>> retransmissions and an eventual timeout of the tcp session.  A
>> receive buffer full condition should instead result in a zero
>> window and an indefinite wait.
>>
>> In practice, this problem is largely hidden for most flows.  It
>> is not applicable to mice flows.  Elephant flows can send data
>> fast enough to "overrun" the sk_rcvbuf limit (in a single ACK),
>> triggering a zero window.
>>
>> But this problem does show up for other types of flows.  A good
>> example are websockets and other type of flows that send small
>> amounts of data spaced apart slightly in time.  In these cases,
>> we directly encounter the problem described in [1].
>>
>> RFC 7323, section 2.4 [2], says there are instances when a retracted
>> window can be offered, and that TCP implementations MUST ensure
>> that they handle a shrinking window, as specified in RFC 1122,
>> section 4.2.2.16 [3].  All prior RFCs on the topic of tcp window
>> management have made clear that sender must accept a shrunk window
>> from the receiver, including RFC 793 [4] and RFC 1323 [5].
>>
>> This patch implements the functionality to shrink the tcp window
>> when necessary to keep the right edge within the memory limit by
>> autotuning (sk_rcvbuf).  This new functionality is enabled with
>> the following sysctl:
>>
>> sysctl: net.ipv4.tcp_shrink_window
>>
>> This sysctl changes how the TCP window is calculated.
>>
>> If sysctl tcp_shrink_window is zero (the default value), then the
>> window is never shrunk.
>>
>> If sysctl tcp_shrink_window is non-zero, then the memory limit
>> set by autotuning is honored.  This requires that the TCP window
>> be shrunk ("retracted") as described in RFC 1122.
>>
>> [1] https://www.rfc-editor.org/rfc/rfc7323#appendix-F
>> [2] https://www.rfc-editor.org/rfc/rfc7323#section-2.4
>> [3] https://www.rfc-editor.org/rfc/rfc1122#page-91
>> [4] https://www.rfc-editor.org/rfc/rfc793
>> [5] https://www.rfc-editor.org/rfc/rfc1323
>>
>> Signed-off-by: Mike Freemon <mfreemon@cloudflare.com>
> 
> Does Linux TCP really need another tuning parameter?

It's useful to make testing faster, i.e. comparing enabled vs disabled.
It could also be useful as a quick diagnostic test, i.e. someone is
having a problem and they want to quickly eliminate this patch as a
cause.

But I left it in mainly as a risk response.  This patch requires that
the receiving TCP implementation handle the shrinking window correctly.  
This patch has been deployed at Cloudflare and we have not discovered
any cases where the peer TCP fails to be RFC compliant.  But we cannot
rule out the possibility completely.  The concern is what if someone is
running some old software on a non-public network and their software
does not handle a shrinking window.  Simply disabling this feature via
a sysctl parameter seems like a good solution for that situation.

If the consensus is to not have a sysctl parameter, I am happy to
remove it.

A related question:  If we leave it in, what do we think the default
value should be?  It's disabled by default right now, but that is just
me being conservative.  If we are comfortable enabling this by default,
I'm happy to do that too.

> Will tests get run with both feature on and off?

More background and details about the patch is here, including the test
results you're looking for:
https://blog.cloudflare.com/unbounded-memory-usage-by-tcp-for-receive-buffers-and-how-we-fixed-it/

> What default will distributions ship with?

I'm not sure how to answer this.  Isn't that up to the distributions to decide?

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] Add a sysctl to allow TCP window shrinking in order to honor memory limits
  2023-06-06  2:09     ` Jason Xing
@ 2023-06-06 15:17       ` Mike Freemon
  2023-06-06 15:33         ` Eric Dumazet
  0 siblings, 1 reply; 9+ messages in thread
From: Mike Freemon @ 2023-06-06 15:17 UTC (permalink / raw)
  To: Jason Xing, Stephen Hemminger, Eric Dumazet, Neal Cardwell
  Cc: netdev, kernel-team


On 6/5/23 21:09, Jason Xing wrote:
> On Tue, Jun 6, 2023 at 6:44 AM Stephen Hemminger
> <stephen@networkplumber.org> wrote:
>>
>> On Mon, 5 Jun 2023 15:42:29 -0700
>> Stephen Hemminger <stephen@networkplumber.org> wrote:
>>
>>>> sysctl: net.ipv4.tcp_shrink_window
>>>>
>>>> This sysctl changes how the TCP window is calculated.
>>>>
>>>> If sysctl tcp_shrink_window is zero (the default value), then the
>>>> window is never shrunk.
>>>>
>>>> If sysctl tcp_shrink_window is non-zero, then the memory limit
>>>> set by autotuning is honored.  This requires that the TCP window
>>>> be shrunk ("retracted") as described in RFC 1122.
>>>>
>>>> [1] https://www.rfc-editor.org/rfc/rfc7323#appendix-F
>>>> [2] https://www.rfc-editor.org/rfc/rfc7323#section-2.4
>>>> [3] https://www.rfc-editor.org/rfc/rfc1122#page-91
>>>> [4] https://www.rfc-editor.org/rfc/rfc793
>>>> [5] https://www.rfc-editor.org/rfc/rfc1323
>>>>
>>>> Signed-off-by: Mike Freemon <mfreemon@cloudflare.com>
>>>
>>> Does Linux TCP really need another tuning parameter?
>>> Will tests get run with both feature on and off?
>>> What default will distributions ship with?
>>>
>>> Sounds like unbounded receive window growth is always a bad
>>> idea and a latent bug.
>>
>> FYI - I worked in an environment where every bug fix had to have
>> a tuning parameter to turn it off. It was a bad idea, driven by
>> management problems with updating. The number of knobs lead
>> to confusion and geometric growth in possible code paths.
>>
> 
> I agree. More than this, shrinking window prohibited in those classic
> RFCs could cause unexpected/unwanted behaviour.

I discuss the RFCs in more detail in my blog post here:
https://blog.cloudflare.com/unbounded-memory-usage-by-tcp-for-receive-buffers-and-how-we-fixed-it/

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] Add a sysctl to allow TCP window shrinking in order to honor memory limits
  2023-06-06 15:17       ` Mike Freemon
@ 2023-06-06 15:33         ` Eric Dumazet
  2023-06-06 15:35           ` Neal Cardwell
  0 siblings, 1 reply; 9+ messages in thread
From: Eric Dumazet @ 2023-06-06 15:33 UTC (permalink / raw)
  To: Mike Freemon
  Cc: Jason Xing, Stephen Hemminger, Neal Cardwell, netdev, kernel-team

On Tue, Jun 6, 2023 at 5:17 PM Mike Freemon <mfreemon@cloudflare.com> wrote:
>
>
> On 6/5/23 21:09, Jason Xing wrote:
> > On Tue, Jun 6, 2023 at 6:44 AM Stephen Hemminger
> > <stephen@networkplumber.org> wrote:
> >>
> >> On Mon, 5 Jun 2023 15:42:29 -0700
> >> Stephen Hemminger <stephen@networkplumber.org> wrote:
> >>
> >>>> sysctl: net.ipv4.tcp_shrink_window
> >>>>
> >>>> This sysctl changes how the TCP window is calculated.
> >>>>
> >>>> If sysctl tcp_shrink_window is zero (the default value), then the
> >>>> window is never shrunk.
> >>>>
> >>>> If sysctl tcp_shrink_window is non-zero, then the memory limit
> >>>> set by autotuning is honored.  This requires that the TCP window
> >>>> be shrunk ("retracted") as described in RFC 1122.
> >>>>
> >>>> [1] https://www.rfc-editor.org/rfc/rfc7323#appendix-F
> >>>> [2] https://www.rfc-editor.org/rfc/rfc7323#section-2.4
> >>>> [3] https://www.rfc-editor.org/rfc/rfc1122#page-91
> >>>> [4] https://www.rfc-editor.org/rfc/rfc793
> >>>> [5] https://www.rfc-editor.org/rfc/rfc1323
> >>>>
> >>>> Signed-off-by: Mike Freemon <mfreemon@cloudflare.com>
> >>>
> >>> Does Linux TCP really need another tuning parameter?
> >>> Will tests get run with both feature on and off?
> >>> What default will distributions ship with?
> >>>
> >>> Sounds like unbounded receive window growth is always a bad
> >>> idea and a latent bug.
> >>
> >> FYI - I worked in an environment where every bug fix had to have
> >> a tuning parameter to turn it off. It was a bad idea, driven by
> >> management problems with updating. The number of knobs lead
> >> to confusion and geometric growth in possible code paths.
> >>
> >
> > I agree. More than this, shrinking window prohibited in those classic
> > RFCs could cause unexpected/unwanted behaviour.
>
> I discuss the RFCs in more detail in my blog post here:
> https://blog.cloudflare.com/unbounded-memory-usage-by-tcp-for-receive-buffers-and-how-we-fixed-it/

Mike, the usual process to push linux patches is to make them self contained.

changelog should be enough, no need to read a lengthy blog post.

Also, I did not receive a copy of the patch, which is unfortunate,
given I am the linux TCP maintainer.

Next time you post it, make sure to CC me.

Thanks.

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] Add a sysctl to allow TCP window shrinking in order to honor memory limits
  2023-06-06 15:33         ` Eric Dumazet
@ 2023-06-06 15:35           ` Neal Cardwell
  2023-06-06 17:00             ` Mike Freemon
  0 siblings, 1 reply; 9+ messages in thread
From: Neal Cardwell @ 2023-06-06 15:35 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Mike Freemon, Jason Xing, Stephen Hemminger, netdev, kernel-team

On Tue, Jun 6, 2023 at 11:33 AM Eric Dumazet <edumazet@google.com> wrote:
>
> On Tue, Jun 6, 2023 at 5:17 PM Mike Freemon <mfreemon@cloudflare.com> wrote:
...
> > I discuss the RFCs in more detail in my blog post here:
> > https://blog.cloudflare.com/unbounded-memory-usage-by-tcp-for-receive-buffers-and-how-we-fixed-it/
>
> Mike, the usual process to push linux patches is to make them self contained.
>
> changelog should be enough, no need to read a lengthy blog post.
>
> Also, I did not receive a copy of the patch, which is unfortunate,
> given I am the linux TCP maintainer.
>
> Next time you post it, make sure to CC me.
>
> Thanks.

Also, before the repost, can you please rebase  on to the latest
net-next branch? The patch does not currently apply to the latest
net-next tree. Thanks!

neal

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH] Add a sysctl to allow TCP window shrinking in order to honor memory limits
  2023-06-06 15:35           ` Neal Cardwell
@ 2023-06-06 17:00             ` Mike Freemon
  0 siblings, 0 replies; 9+ messages in thread
From: Mike Freemon @ 2023-06-06 17:00 UTC (permalink / raw)
  To: Neal Cardwell, Eric Dumazet
  Cc: Jason Xing, Stephen Hemminger, netdev, kernel-team



On 6/6/23 10:35, Neal Cardwell wrote:
> On Tue, Jun 6, 2023 at 11:33 AM Eric Dumazet <edumazet@google.com> wrote:
>>
>> On Tue, Jun 6, 2023 at 5:17 PM Mike Freemon <mfreemon@cloudflare.com> wrote:
> ...
>>> I discuss the RFCs in more detail in my blog post here:
>>> https://blog.cloudflare.com/unbounded-memory-usage-by-tcp-for-receive-buffers-and-how-we-fixed-it/
>>
>> Mike, the usual process to push linux patches is to make them self contained.
>>
>> changelog should be enough, no need to read a lengthy blog post.
>>
>> Also, I did not receive a copy of the patch, which is unfortunate,
>> given I am the linux TCP maintainer.
>>
>> Next time you post it, make sure to CC me.
>>
>> Thanks.
> 
> Also, before the repost, can you please rebase  on to the latest
> net-next branch? The patch does not currently apply to the latest
> net-next tree. Thanks!

I just posted "v2", which applies successfully on net-next.

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2023-06-06 17:00 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-06-05 20:38 [PATCH] Add a sysctl to allow TCP window shrinking in order to honor memory limits Mike Freemon
2023-06-05 22:42 ` Stephen Hemminger
2023-06-05 22:44   ` Stephen Hemminger
2023-06-06  2:09     ` Jason Xing
2023-06-06 15:17       ` Mike Freemon
2023-06-06 15:33         ` Eric Dumazet
2023-06-06 15:35           ` Neal Cardwell
2023-06-06 17:00             ` Mike Freemon
2023-06-06 14:54   ` Mike Freemon

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).