netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Stephen Hemminger <stephen@networkplumber.org>
To: Mike Freemon <mfreemon@cloudflare.com>
Cc: netdev@vger.kernel.org, kernel-team@cloudflare.com
Subject: Re: [PATCH] Add a sysctl to allow TCP window shrinking in order to honor memory limits
Date: Mon, 5 Jun 2023 15:42:29 -0700	[thread overview]
Message-ID: <20230605154229.6077983e@hermes.local> (raw)
In-Reply-To: <20230605203857.1672816-1-mfreemon@cloudflare.com>

On Mon,  5 Jun 2023 15:38:57 -0500
Mike Freemon <mfreemon@cloudflare.com> wrote:

> From: "mfreemon@cloudflare.com" <mfreemon@cloudflare.com>
> 
> Under certain circumstances, the tcp receive buffer memory limit
> set by autotuning is ignored, and the receive buffer can grow
> unrestrained until it reaches tcp_rmem[2].
> 
> To reproduce:  Connect a TCP session with the receiver doing
> nothing and the sender sending small packets (an infinite loop
> of socket send() with 4 bytes of payload with a sleep of 1 ms
> in between each send()).  This will fill the tcp receive buffer
> all the way to tcp_rmem[2], ignoring the autotuning limit
> (sk_rcvbuf).
> 
> As a result, a host can have individual tcp sessions with receive
> buffers of size tcp_rmem[2], and the host itself can reach tcp_mem
> limits, causing the host to go into tcp memory pressure mode.
> 
> The fundamental issue is the relationship between the granularity
> of the window scaling factor and the number of byte ACKed back
> to the sender.  This problem has previously been identified in
> RFC 7323, appendix F [1].
> 
> The Linux kernel currently adheres to never shrinking the window.
> 
> In addition to the overallocation of memory mentioned above, this
> is also functionally incorrect, because once tcp_rmem[2] is
> reached, the receiver will drop in-window packets resulting in
> retransmissions and an eventual timeout of the tcp session.  A
> receive buffer full condition should instead result in a zero
> window and an indefinite wait.
> 
> In practice, this problem is largely hidden for most flows.  It
> is not applicable to mice flows.  Elephant flows can send data
> fast enough to "overrun" the sk_rcvbuf limit (in a single ACK),
> triggering a zero window.
> 
> But this problem does show up for other types of flows.  A good
> example are websockets and other type of flows that send small
> amounts of data spaced apart slightly in time.  In these cases,
> we directly encounter the problem described in [1].
> 
> RFC 7323, section 2.4 [2], says there are instances when a retracted
> window can be offered, and that TCP implementations MUST ensure
> that they handle a shrinking window, as specified in RFC 1122,
> section 4.2.2.16 [3].  All prior RFCs on the topic of tcp window
> management have made clear that sender must accept a shrunk window
> from the receiver, including RFC 793 [4] and RFC 1323 [5].
> 
> This patch implements the functionality to shrink the tcp window
> when necessary to keep the right edge within the memory limit by
> autotuning (sk_rcvbuf).  This new functionality is enabled with
> the following sysctl:
> 
> sysctl: net.ipv4.tcp_shrink_window
> 
> This sysctl changes how the TCP window is calculated.
> 
> If sysctl tcp_shrink_window is zero (the default value), then the
> window is never shrunk.
> 
> If sysctl tcp_shrink_window is non-zero, then the memory limit
> set by autotuning is honored.  This requires that the TCP window
> be shrunk ("retracted") as described in RFC 1122.
> 
> [1] https://www.rfc-editor.org/rfc/rfc7323#appendix-F
> [2] https://www.rfc-editor.org/rfc/rfc7323#section-2.4
> [3] https://www.rfc-editor.org/rfc/rfc1122#page-91
> [4] https://www.rfc-editor.org/rfc/rfc793
> [5] https://www.rfc-editor.org/rfc/rfc1323
> 
> Signed-off-by: Mike Freemon <mfreemon@cloudflare.com>

Does Linux TCP really need another tuning parameter?
Will tests get run with both feature on and off?
What default will distributions ship with?

Sounds like unbounded receive window growth is always a bad
idea and a latent bug.

  reply	other threads:[~2023-06-05 22:42 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-06-05 20:38 [PATCH] Add a sysctl to allow TCP window shrinking in order to honor memory limits Mike Freemon
2023-06-05 22:42 ` Stephen Hemminger [this message]
2023-06-05 22:44   ` Stephen Hemminger
2023-06-06  2:09     ` Jason Xing
2023-06-06 15:17       ` Mike Freemon
2023-06-06 15:33         ` Eric Dumazet
2023-06-06 15:35           ` Neal Cardwell
2023-06-06 17:00             ` Mike Freemon
2023-06-06 14:54   ` Mike Freemon

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230605154229.6077983e@hermes.local \
    --to=stephen@networkplumber.org \
    --cc=kernel-team@cloudflare.com \
    --cc=mfreemon@cloudflare.com \
    --cc=netdev@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).