* [PATCH net-next v2 0/2] Make TIME-WAIT reuse delay deterministic and configurable
@ 2024-12-09 19:38 Jakub Sitnicki
2024-12-09 19:38 ` [PATCH net-next v2 1/2] tcp: Measure TIME-WAIT reuse delay with millisecond precision Jakub Sitnicki
` (3 more replies)
0 siblings, 4 replies; 9+ messages in thread
From: Jakub Sitnicki @ 2024-12-09 19:38 UTC (permalink / raw)
To: netdev, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni
Cc: Jason Xing, Adrien Vasseur, Lee Valentine, kernel-team
This patch set is an effort to enable faster reuse of TIME-WAIT sockets.
We have recently talked about the motivation and the idea at Plumbers [1].
Experiment in production
------------------------
We are restarting our experiment on a small set of production nodes as the
code has slightly changed since v1 [2], and there are still a few weeks of
development window to soak the changes. We will report back if we observe
any regressions.
Packetdrill tests
-----------------
The packetdrill tests for TIME-WAIT reuse [3] did not change since v1.
Although we are not touching PAWS code any more, I would still like to add
tests to cover PAWS reject after TW reuse. This, however, requires patching
packetdrill as I mentioned in the last cover letter [2].
Thanks,
-jkbs
[1] https://lpc.events/event/18/contributions/1962/
[2] https://lore.kernel.org/r/20241113-jakub-krn-909-poc-msec-tw-tstamp-v2-0-b0a335247304@cloudflare.com
[3] https://github.com/google/packetdrill/pull/90
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
---
Changes in v2:
- Pivot to a dedicated msec timestamp for entering TIME-WAIT state (Eric)
- Link to v1: https://lore.kernel.org/r/20241204-jakub-krn-909-poc-msec-tw-tstamp-v1-0-8b54467a0f34@cloudflare.com
Changes in v1:
- packetdrill: Adjust TS val for reused connection so value keep increasing
- Link to RFCv2: https://lore.kernel.org/r/20241113-jakub-krn-909-poc-msec-tw-tstamp-v2-0-b0a335247304@cloudflare.com
Changes in RFCv2:
- Make TIME-WAIT reuse configurable through a per-netns sysctl.
- Account for timestamp rounding so delay is not shorter than set value.
- Use tcp_mstamp when we know it is fresh due to receiving a segment.
- Link to RFCv1: https://lore.kernel.org/r/20240819-jakub-krn-909-poc-msec-tw-tstamp-v1-1-6567b5006fbe@cloudflare.com
---
Jakub Sitnicki (2):
tcp: Measure TIME-WAIT reuse delay with millisecond precision
tcp: Add sysctl to configure TIME-WAIT reuse delay
Documentation/networking/ip-sysctl.rst | 14 ++++++++++++++
.../networking/net_cachelines/netns_ipv4_sysctl.rst | 1 +
include/net/inet_timewait_sock.h | 4 ++++
include/net/netns/ipv4.h | 1 +
net/ipv4/sysctl_net_ipv4.c | 10 ++++++++++
net/ipv4/tcp_ipv4.c | 7 +++++--
net/ipv4/tcp_minisocks.c | 7 ++++++-
7 files changed, 41 insertions(+), 3 deletions(-)
^ permalink raw reply [flat|nested] 9+ messages in thread
* [PATCH net-next v2 1/2] tcp: Measure TIME-WAIT reuse delay with millisecond precision
2024-12-09 19:38 [PATCH net-next v2 0/2] Make TIME-WAIT reuse delay deterministic and configurable Jakub Sitnicki
@ 2024-12-09 19:38 ` Jakub Sitnicki
2024-12-10 8:11 ` Eric Dumazet
2024-12-12 1:00 ` Jason Xing
2024-12-09 19:38 ` [PATCH net-next v2 2/2] tcp: Add sysctl to configure TIME-WAIT reuse delay Jakub Sitnicki
` (2 subsequent siblings)
3 siblings, 2 replies; 9+ messages in thread
From: Jakub Sitnicki @ 2024-12-09 19:38 UTC (permalink / raw)
To: netdev, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni
Cc: Jason Xing, Adrien Vasseur, Lee Valentine, kernel-team
Prepare ground for TIME-WAIT socket reuse with subsecond delay.
Today the last TS.Recent update timestamp, recorded in seconds and stored
tp->ts_recent_stamp and tw->tw_ts_recent_stamp fields, has two purposes.
Firstly, it is used to track the age of the last recorded TS.Recent value
to detect when that value becomes outdated due to potential wrap-around of
the other TCP timestamp clock (RFC 7323, section 5.5).
For this purpose a second-based timestamp is completely sufficient as even
in the worst case scenario of a peer using a high resolution microsecond
timestamp, the wrap-around interval is ~36 minutes long.
Secondly, it serves as a threshold value for allowing TIME-WAIT socket
reuse. A TIME-WAIT socket can be reused only once the virtual 1 Hz clock,
ktime_get_seconds, is past the TS.Recent update timestamp.
The purpose behind delaying the TIME-WAIT socket reuse is to wait for the
other TCP timestamp clock to tick at least once before reusing the
connection. It is only then that the PAWS mechanism for the reopened
connection can detect old duplicate segments from the previous connection
incarnation (RFC 7323, appendix B.2).
In this case using a timestamp with second resolution not only blocks the
way toward allowing faster TIME-WAIT reuse after shorter subsecond delay,
but also makes it impossible to reliably delay TW reuse by one second.
As Eric Dumazet has pointed out [1], due to timestamp rounding, the TW
reuse delay will actually be between (0, 1] seconds, and 0.5 seconds on
average. We delay TW reuse for one full second only when last TS.Recent
update coincides with our virtual 1 Hz clock tick.
Considering the above, introduce a dedicated field to store a millisecond
timestamp of transition into the TIME-WAIT state. Place it in an existing
4-byte hole inside inet_timewait_sock structure to avoid an additional
memory cost.
Use the new timestamp to (i) reliably delay TIME-WAIT reuse by one second,
and (ii) prepare for configurable subsecond reuse delay in the subsequent
change.
We assume here that a full one second delay was the original intention in
[2] because it accounts for the worst case scenario of the other TCP using
the slowest recommended 1 Hz timestamp clock.
A more involved alternative would be to change the resolution of the last
TS.Recent update timestamp, tw->tw_ts_recent_stamp, to milliseconds.
[1] https://lore.kernel.org/netdev/CANn89iKB4GFd8sVzCbRttqw_96o3i2wDhX-3DraQtsceNGYwug@mail.gmail.com/
[2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=b8439924316d5bcb266d165b93d632a4b4b859af
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
---
include/net/inet_timewait_sock.h | 4 ++++
net/ipv4/tcp_ipv4.c | 5 +++--
net/ipv4/tcp_minisocks.c | 7 ++++++-
3 files changed, 13 insertions(+), 3 deletions(-)
diff --git a/include/net/inet_timewait_sock.h b/include/net/inet_timewait_sock.h
index 62c0a7e65d6bdf4c71a8ea90586b985f9fd30229..67a313575780992a1b55aa26aaa2055111eb7e8d 100644
--- a/include/net/inet_timewait_sock.h
+++ b/include/net/inet_timewait_sock.h
@@ -74,6 +74,10 @@ struct inet_timewait_sock {
tw_tos : 8;
u32 tw_txhash;
u32 tw_priority;
+ /**
+ * @tw_reuse_stamp: Time of entry into %TCP_TIME_WAIT state in msec.
+ */
+ u32 tw_entry_stamp;
struct timer_list tw_timer;
struct inet_bind_bucket *tw_tb;
struct inet_bind2_bucket *tw_tb2;
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index a38c8b1f44dbd95fcea08bd81e0ceaa70177ac8a..3b6ba1d16921e079d5ba08c3c0b98dccace8c370 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -120,6 +120,7 @@ int tcp_twsk_unique(struct sock *sk, struct sock *sktw, void *twp)
const struct tcp_timewait_sock *tcptw = tcp_twsk(sktw);
struct tcp_sock *tp = tcp_sk(sk);
int ts_recent_stamp;
+ u32 reuse_thresh;
if (READ_ONCE(tw->tw_substate) == TCP_FIN_WAIT2)
reuse = 0;
@@ -162,9 +163,9 @@ int tcp_twsk_unique(struct sock *sk, struct sock *sktw, void *twp)
and use initial timestamp retrieved from peer table.
*/
ts_recent_stamp = READ_ONCE(tcptw->tw_ts_recent_stamp);
+ reuse_thresh = READ_ONCE(tw->tw_entry_stamp) + MSEC_PER_SEC;
if (ts_recent_stamp &&
- (!twp || (reuse && time_after32(ktime_get_seconds(),
- ts_recent_stamp)))) {
+ (!twp || (reuse && time_after32(tcp_clock_ms(), reuse_thresh)))) {
/* inet_twsk_hashdance_schedule() sets sk_refcnt after putting twsk
* and releasing the bucket lock.
*/
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index 7121d8573928cbf6840b3361b62f4812d365a30b..b089b08e9617862cd73b47ac06b5ac6c1e843ec6 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -157,8 +157,11 @@ tcp_timewait_state_process(struct inet_timewait_sock *tw, struct sk_buff *skb,
rcv_nxt);
if (tmp_opt.saw_tstamp) {
+ u64 ts = tcp_clock_ms();
+
+ WRITE_ONCE(tw->tw_entry_stamp, ts);
WRITE_ONCE(tcptw->tw_ts_recent_stamp,
- ktime_get_seconds());
+ div_u64(ts, MSEC_PER_SEC));
WRITE_ONCE(tcptw->tw_ts_recent,
tmp_opt.rcv_tsval);
}
@@ -316,6 +319,8 @@ void tcp_time_wait(struct sock *sk, int state, int timeo)
tw->tw_mark = sk->sk_mark;
tw->tw_priority = READ_ONCE(sk->sk_priority);
tw->tw_rcv_wscale = tp->rx_opt.rcv_wscale;
+ /* refreshed when we enter true TIME-WAIT state */
+ tw->tw_entry_stamp = tcp_time_stamp_ms(tp);
tcptw->tw_rcv_nxt = tp->rcv_nxt;
tcptw->tw_snd_nxt = tp->snd_nxt;
tcptw->tw_rcv_wnd = tcp_receive_window(tp);
--
2.43.0
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [PATCH net-next v2 2/2] tcp: Add sysctl to configure TIME-WAIT reuse delay
2024-12-09 19:38 [PATCH net-next v2 0/2] Make TIME-WAIT reuse delay deterministic and configurable Jakub Sitnicki
2024-12-09 19:38 ` [PATCH net-next v2 1/2] tcp: Measure TIME-WAIT reuse delay with millisecond precision Jakub Sitnicki
@ 2024-12-09 19:38 ` Jakub Sitnicki
2024-12-10 8:21 ` Eric Dumazet
2024-12-12 1:08 ` Jason Xing
2024-12-12 4:30 ` [PATCH net-next v2 0/2] Make TIME-WAIT reuse delay deterministic and configurable patchwork-bot+netdevbpf
2024-12-13 13:06 ` Jakub Sitnicki
3 siblings, 2 replies; 9+ messages in thread
From: Jakub Sitnicki @ 2024-12-09 19:38 UTC (permalink / raw)
To: netdev, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni
Cc: Jason Xing, Adrien Vasseur, Lee Valentine, kernel-team
Today we have a hardcoded delay of 1 sec before a TIME-WAIT socket can be
reused by reopening a connection. This is a safe choice based on an
assumption that the other TCP timestamp clock frequency, which is unknown
to us, may be as low as 1 Hz (RFC 7323, section 5.4).
However, this means that in the presence of short lived connections with an
RTT of couple of milliseconds, the time during which a 4-tuple is blocked
from reuse can be orders of magnitude longer that the connection lifetime.
Combined with a reduced pool of ephemeral ports, when using
IP_LOCAL_PORT_RANGE to share an egress IP address between hosts [1], the
long TIME-WAIT reuse delay can lead to port exhaustion, where all available
4-tuples are tied up in TIME-WAIT state.
Turn the reuse delay into a per-netns setting so that sysadmins can make
more aggressive assumptions about remote TCP timestamp clock frequency and
shorten the delay in order to allow connections to reincarnate faster.
Note that applications can completely bypass the TIME-WAIT delay protection
already today by locking the local port with bind() before connecting. Such
immediate connection reuse may result in PAWS failing to detect old
duplicate segments, leaving us with just the sequence number check as a
safety net.
This new configurable offers a trade off where the sysadmin can balance
between the risk of PAWS detection failing to act versus exhausting ports
by having sockets tied up in TIME-WAIT state for too long.
[1] https://lpc.events/event/16/contributions/1349/
Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
---
Documentation/networking/ip-sysctl.rst | 14 ++++++++++++++
.../networking/net_cachelines/netns_ipv4_sysctl.rst | 1 +
include/net/netns/ipv4.h | 1 +
net/ipv4/sysctl_net_ipv4.c | 10 ++++++++++
net/ipv4/tcp_ipv4.c | 4 +++-
5 files changed, 29 insertions(+), 1 deletion(-)
diff --git a/Documentation/networking/ip-sysctl.rst b/Documentation/networking/ip-sysctl.rst
index eacf8983e2307476895a8def7363375f2af36d9d..2f2b00295836be80e1da11370022ca083d7d1eb2 100644
--- a/Documentation/networking/ip-sysctl.rst
+++ b/Documentation/networking/ip-sysctl.rst
@@ -1000,6 +1000,20 @@ tcp_tw_reuse - INTEGER
Default: 2
+tcp_tw_reuse_delay - UNSIGNED INTEGER
+ The delay in milliseconds before a TIME-WAIT socket can be reused by a
+ new connection, if TIME-WAIT socket reuse is enabled. The actual reuse
+ threshold is within [N, N+1] range, where N is the requested delay in
+ milliseconds, to ensure the delay interval is never shorter than the
+ configured value.
+
+ This setting contains an assumption about the other TCP timestamp clock
+ tick interval. It should not be set to a value lower than the peer's
+ clock tick for PAWS (Protection Against Wrapped Sequence numbers)
+ mechanism work correctly for the reused connection.
+
+ Default: 1000 (milliseconds)
+
tcp_window_scaling - BOOLEAN
Enable window scaling as defined in RFC1323.
diff --git a/Documentation/networking/net_cachelines/netns_ipv4_sysctl.rst b/Documentation/networking/net_cachelines/netns_ipv4_sysctl.rst
index 629da6dc6d746ce8058cfbe2215d33d55ca4c19d..de0263302f16dd815593671c4f75a93ed6f7cac4 100644
--- a/Documentation/networking/net_cachelines/netns_ipv4_sysctl.rst
+++ b/Documentation/networking/net_cachelines/netns_ipv4_sysctl.rst
@@ -79,6 +79,7 @@ u8 sysctl_tcp_retries1
u8 sysctl_tcp_retries2
u8 sysctl_tcp_orphan_retries
u8 sysctl_tcp_tw_reuse timewait_sock_ops
+unsigned_int sysctl_tcp_tw_reuse_delay timewait_sock_ops
int sysctl_tcp_fin_timeout TCP_LAST_ACK/tcp_rcv_state_process
unsigned_int sysctl_tcp_notsent_lowat read_mostly tcp_notsent_lowat/tcp_stream_memory_free
u8 sysctl_tcp_sack tcp_syn_options
diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index 3c014170e0012818db36d4a7a327025e3fa00dd1..46452da352061007d19d00fdacddd25bbe56444d 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -175,6 +175,7 @@ struct netns_ipv4 {
u8 sysctl_tcp_retries2;
u8 sysctl_tcp_orphan_retries;
u8 sysctl_tcp_tw_reuse;
+ unsigned int sysctl_tcp_tw_reuse_delay;
int sysctl_tcp_fin_timeout;
u8 sysctl_tcp_sack;
u8 sysctl_tcp_window_scaling;
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index a79b2a52ce01e6c1a1257ba31c17ac2f51ba19ec..42cb5dc9cb245c26f9a38f8c8c4b26b1adddca39 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -45,6 +45,7 @@ static unsigned int tcp_child_ehash_entries_max = 16 * 1024 * 1024;
static unsigned int udp_child_hash_entries_max = UDP_HTABLE_SIZE_MAX;
static int tcp_plb_max_rounds = 31;
static int tcp_plb_max_cong_thresh = 256;
+static unsigned int tcp_tw_reuse_delay_max = TCP_PAWS_MSL * MSEC_PER_SEC;
/* obsolete */
static int sysctl_tcp_low_latency __read_mostly;
@@ -1065,6 +1066,15 @@ static struct ctl_table ipv4_net_table[] = {
.extra1 = SYSCTL_ZERO,
.extra2 = SYSCTL_TWO,
},
+ {
+ .procname = "tcp_tw_reuse_delay",
+ .data = &init_net.ipv4.sysctl_tcp_tw_reuse_delay,
+ .maxlen = sizeof(unsigned int),
+ .mode = 0644,
+ .proc_handler = proc_douintvec_minmax,
+ .extra1 = SYSCTL_ONE,
+ .extra2 = &tcp_tw_reuse_delay_max,
+ },
{
.procname = "tcp_max_syn_backlog",
.data = &init_net.ipv4.sysctl_max_syn_backlog,
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 3b6ba1d16921e079d5ba08c3c0b98dccace8c370..e45222d5fc2e2a3409e2a93c78588ab6a352f2f9 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -163,7 +163,8 @@ int tcp_twsk_unique(struct sock *sk, struct sock *sktw, void *twp)
and use initial timestamp retrieved from peer table.
*/
ts_recent_stamp = READ_ONCE(tcptw->tw_ts_recent_stamp);
- reuse_thresh = READ_ONCE(tw->tw_entry_stamp) + MSEC_PER_SEC;
+ reuse_thresh = READ_ONCE(tw->tw_entry_stamp) +
+ READ_ONCE(sock_net(sk)->ipv4.sysctl_tcp_tw_reuse_delay);
if (ts_recent_stamp &&
(!twp || (reuse && time_after32(tcp_clock_ms(), reuse_thresh)))) {
/* inet_twsk_hashdance_schedule() sets sk_refcnt after putting twsk
@@ -3458,6 +3459,7 @@ static int __net_init tcp_sk_init(struct net *net)
net->ipv4.sysctl_tcp_fin_timeout = TCP_FIN_TIMEOUT;
net->ipv4.sysctl_tcp_notsent_lowat = UINT_MAX;
net->ipv4.sysctl_tcp_tw_reuse = 2;
+ net->ipv4.sysctl_tcp_tw_reuse_delay = 1 * MSEC_PER_SEC;
net->ipv4.sysctl_tcp_no_ssthresh_metrics_save = 1;
refcount_set(&net->ipv4.tcp_death_row.tw_refcount, 1);
--
2.43.0
^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [PATCH net-next v2 1/2] tcp: Measure TIME-WAIT reuse delay with millisecond precision
2024-12-09 19:38 ` [PATCH net-next v2 1/2] tcp: Measure TIME-WAIT reuse delay with millisecond precision Jakub Sitnicki
@ 2024-12-10 8:11 ` Eric Dumazet
2024-12-12 1:00 ` Jason Xing
1 sibling, 0 replies; 9+ messages in thread
From: Eric Dumazet @ 2024-12-10 8:11 UTC (permalink / raw)
To: Jakub Sitnicki
Cc: netdev, David S. Miller, Jakub Kicinski, Paolo Abeni, Jason Xing,
Adrien Vasseur, Lee Valentine, kernel-team
On Mon, Dec 9, 2024 at 8:38 PM Jakub Sitnicki <jakub@cloudflare.com> wrote:
>
> Prepare ground for TIME-WAIT socket reuse with subsecond delay.
>
> Today the last TS.Recent update timestamp, recorded in seconds and stored
> tp->ts_recent_stamp and tw->tw_ts_recent_stamp fields, has two purposes.
>
> Firstly, it is used to track the age of the last recorded TS.Recent value
> to detect when that value becomes outdated due to potential wrap-around of
> the other TCP timestamp clock (RFC 7323, section 5.5).
>
> For this purpose a second-based timestamp is completely sufficient as even
> in the worst case scenario of a peer using a high resolution microsecond
> timestamp, the wrap-around interval is ~36 minutes long.
>
> Secondly, it serves as a threshold value for allowing TIME-WAIT socket
> reuse. A TIME-WAIT socket can be reused only once the virtual 1 Hz clock,
> ktime_get_seconds, is past the TS.Recent update timestamp.
>
> The purpose behind delaying the TIME-WAIT socket reuse is to wait for the
> other TCP timestamp clock to tick at least once before reusing the
> connection. It is only then that the PAWS mechanism for the reopened
> connection can detect old duplicate segments from the previous connection
> incarnation (RFC 7323, appendix B.2).
>
> In this case using a timestamp with second resolution not only blocks the
> way toward allowing faster TIME-WAIT reuse after shorter subsecond delay,
> but also makes it impossible to reliably delay TW reuse by one second.
>
> As Eric Dumazet has pointed out [1], due to timestamp rounding, the TW
> reuse delay will actually be between (0, 1] seconds, and 0.5 seconds on
> average. We delay TW reuse for one full second only when last TS.Recent
> update coincides with our virtual 1 Hz clock tick.
>
> Considering the above, introduce a dedicated field to store a millisecond
> timestamp of transition into the TIME-WAIT state. Place it in an existing
> 4-byte hole inside inet_timewait_sock structure to avoid an additional
> memory cost.
>
> Use the new timestamp to (i) reliably delay TIME-WAIT reuse by one second,
> and (ii) prepare for configurable subsecond reuse delay in the subsequent
> change.
>
> We assume here that a full one second delay was the original intention in
> [2] because it accounts for the worst case scenario of the other TCP using
> the slowest recommended 1 Hz timestamp clock.
>
> A more involved alternative would be to change the resolution of the last
> TS.Recent update timestamp, tw->tw_ts_recent_stamp, to milliseconds.
>
> [1] https://lore.kernel.org/netdev/CANn89iKB4GFd8sVzCbRttqw_96o3i2wDhX-3DraQtsceNGYwug@mail.gmail.com/
> [2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=b8439924316d5bcb266d165b93d632a4b4b859af
>
> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH net-next v2 2/2] tcp: Add sysctl to configure TIME-WAIT reuse delay
2024-12-09 19:38 ` [PATCH net-next v2 2/2] tcp: Add sysctl to configure TIME-WAIT reuse delay Jakub Sitnicki
@ 2024-12-10 8:21 ` Eric Dumazet
2024-12-12 1:08 ` Jason Xing
1 sibling, 0 replies; 9+ messages in thread
From: Eric Dumazet @ 2024-12-10 8:21 UTC (permalink / raw)
To: Jakub Sitnicki
Cc: netdev, David S. Miller, Jakub Kicinski, Paolo Abeni, Jason Xing,
Adrien Vasseur, Lee Valentine, kernel-team
On Mon, Dec 9, 2024 at 8:38 PM Jakub Sitnicki <jakub@cloudflare.com> wrote:
>
> Today we have a hardcoded delay of 1 sec before a TIME-WAIT socket can be
> reused by reopening a connection. This is a safe choice based on an
> assumption that the other TCP timestamp clock frequency, which is unknown
> to us, may be as low as 1 Hz (RFC 7323, section 5.4).
>
> However, this means that in the presence of short lived connections with an
> RTT of couple of milliseconds, the time during which a 4-tuple is blocked
> from reuse can be orders of magnitude longer that the connection lifetime.
> Combined with a reduced pool of ephemeral ports, when using
> IP_LOCAL_PORT_RANGE to share an egress IP address between hosts [1], the
> long TIME-WAIT reuse delay can lead to port exhaustion, where all available
> 4-tuples are tied up in TIME-WAIT state.
>
> Turn the reuse delay into a per-netns setting so that sysadmins can make
> more aggressive assumptions about remote TCP timestamp clock frequency and
> shorten the delay in order to allow connections to reincarnate faster.
>
> Note that applications can completely bypass the TIME-WAIT delay protection
> already today by locking the local port with bind() before connecting. Such
> immediate connection reuse may result in PAWS failing to detect old
> duplicate segments, leaving us with just the sequence number check as a
> safety net.
>
> This new configurable offers a trade off where the sysadmin can balance
> between the risk of PAWS detection failing to act versus exhausting ports
> by having sockets tied up in TIME-WAIT state for too long.
>
> [1] https://lpc.events/event/16/contributions/1349/
>
> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Thanks !
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH net-next v2 1/2] tcp: Measure TIME-WAIT reuse delay with millisecond precision
2024-12-09 19:38 ` [PATCH net-next v2 1/2] tcp: Measure TIME-WAIT reuse delay with millisecond precision Jakub Sitnicki
2024-12-10 8:11 ` Eric Dumazet
@ 2024-12-12 1:00 ` Jason Xing
1 sibling, 0 replies; 9+ messages in thread
From: Jason Xing @ 2024-12-12 1:00 UTC (permalink / raw)
To: Jakub Sitnicki
Cc: netdev, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Adrien Vasseur, Lee Valentine, kernel-team
On Tue, Dec 10, 2024 at 3:38 AM Jakub Sitnicki <jakub@cloudflare.com> wrote:
>
> Prepare ground for TIME-WAIT socket reuse with subsecond delay.
>
> Today the last TS.Recent update timestamp, recorded in seconds and stored
> tp->ts_recent_stamp and tw->tw_ts_recent_stamp fields, has two purposes.
>
> Firstly, it is used to track the age of the last recorded TS.Recent value
> to detect when that value becomes outdated due to potential wrap-around of
> the other TCP timestamp clock (RFC 7323, section 5.5).
>
> For this purpose a second-based timestamp is completely sufficient as even
> in the worst case scenario of a peer using a high resolution microsecond
> timestamp, the wrap-around interval is ~36 minutes long.
>
> Secondly, it serves as a threshold value for allowing TIME-WAIT socket
> reuse. A TIME-WAIT socket can be reused only once the virtual 1 Hz clock,
> ktime_get_seconds, is past the TS.Recent update timestamp.
>
> The purpose behind delaying the TIME-WAIT socket reuse is to wait for the
> other TCP timestamp clock to tick at least once before reusing the
> connection. It is only then that the PAWS mechanism for the reopened
> connection can detect old duplicate segments from the previous connection
> incarnation (RFC 7323, appendix B.2).
>
> In this case using a timestamp with second resolution not only blocks the
> way toward allowing faster TIME-WAIT reuse after shorter subsecond delay,
> but also makes it impossible to reliably delay TW reuse by one second.
>
> As Eric Dumazet has pointed out [1], due to timestamp rounding, the TW
> reuse delay will actually be between (0, 1] seconds, and 0.5 seconds on
> average. We delay TW reuse for one full second only when last TS.Recent
> update coincides with our virtual 1 Hz clock tick.
>
> Considering the above, introduce a dedicated field to store a millisecond
> timestamp of transition into the TIME-WAIT state. Place it in an existing
> 4-byte hole inside inet_timewait_sock structure to avoid an additional
> memory cost.
>
> Use the new timestamp to (i) reliably delay TIME-WAIT reuse by one second,
> and (ii) prepare for configurable subsecond reuse delay in the subsequent
> change.
>
> We assume here that a full one second delay was the original intention in
> [2] because it accounts for the worst case scenario of the other TCP using
> the slowest recommended 1 Hz timestamp clock.
>
> A more involved alternative would be to change the resolution of the last
> TS.Recent update timestamp, tw->tw_ts_recent_stamp, to milliseconds.
>
> [1] https://lore.kernel.org/netdev/CANn89iKB4GFd8sVzCbRttqw_96o3i2wDhX-3DraQtsceNGYwug@mail.gmail.com/
> [2] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=b8439924316d5bcb266d165b93d632a4b4b859af
>
> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Thanks for your effort!
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH net-next v2 2/2] tcp: Add sysctl to configure TIME-WAIT reuse delay
2024-12-09 19:38 ` [PATCH net-next v2 2/2] tcp: Add sysctl to configure TIME-WAIT reuse delay Jakub Sitnicki
2024-12-10 8:21 ` Eric Dumazet
@ 2024-12-12 1:08 ` Jason Xing
1 sibling, 0 replies; 9+ messages in thread
From: Jason Xing @ 2024-12-12 1:08 UTC (permalink / raw)
To: Jakub Sitnicki
Cc: netdev, David S. Miller, Eric Dumazet, Jakub Kicinski,
Paolo Abeni, Adrien Vasseur, Lee Valentine, kernel-team
On Tue, Dec 10, 2024 at 3:38 AM Jakub Sitnicki <jakub@cloudflare.com> wrote:
>
> Today we have a hardcoded delay of 1 sec before a TIME-WAIT socket can be
> reused by reopening a connection. This is a safe choice based on an
> assumption that the other TCP timestamp clock frequency, which is unknown
> to us, may be as low as 1 Hz (RFC 7323, section 5.4).
>
> However, this means that in the presence of short lived connections with an
> RTT of couple of milliseconds, the time during which a 4-tuple is blocked
> from reuse can be orders of magnitude longer that the connection lifetime.
> Combined with a reduced pool of ephemeral ports, when using
> IP_LOCAL_PORT_RANGE to share an egress IP address between hosts [1], the
> long TIME-WAIT reuse delay can lead to port exhaustion, where all available
> 4-tuples are tied up in TIME-WAIT state.
>
> Turn the reuse delay into a per-netns setting so that sysadmins can make
> more aggressive assumptions about remote TCP timestamp clock frequency and
> shorten the delay in order to allow connections to reincarnate faster.
>
> Note that applications can completely bypass the TIME-WAIT delay protection
> already today by locking the local port with bind() before connecting. Such
> immediate connection reuse may result in PAWS failing to detect old
> duplicate segments, leaving us with just the sequence number check as a
> safety net.
>
> This new configurable offers a trade off where the sysadmin can balance
> between the risk of PAWS detection failing to act versus exhausting ports
> by having sockets tied up in TIME-WAIT state for too long.
>
> [1] https://lpc.events/event/16/contributions/1349/
>
> Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Thanks. I feel this will benefit a certain group of people soon :)
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH net-next v2 0/2] Make TIME-WAIT reuse delay deterministic and configurable
2024-12-09 19:38 [PATCH net-next v2 0/2] Make TIME-WAIT reuse delay deterministic and configurable Jakub Sitnicki
2024-12-09 19:38 ` [PATCH net-next v2 1/2] tcp: Measure TIME-WAIT reuse delay with millisecond precision Jakub Sitnicki
2024-12-09 19:38 ` [PATCH net-next v2 2/2] tcp: Add sysctl to configure TIME-WAIT reuse delay Jakub Sitnicki
@ 2024-12-12 4:30 ` patchwork-bot+netdevbpf
2024-12-13 13:06 ` Jakub Sitnicki
3 siblings, 0 replies; 9+ messages in thread
From: patchwork-bot+netdevbpf @ 2024-12-12 4:30 UTC (permalink / raw)
To: Jakub Sitnicki
Cc: netdev, davem, edumazet, kuba, pabeni, kerneljasonxing, avasseur,
lvalentine, kernel-team
Hello:
This series was applied to netdev/net-next.git (main)
by Jakub Kicinski <kuba@kernel.org>:
On Mon, 09 Dec 2024 20:38:02 +0100 you wrote:
> This patch set is an effort to enable faster reuse of TIME-WAIT sockets.
> We have recently talked about the motivation and the idea at Plumbers [1].
>
> Experiment in production
> ------------------------
>
> We are restarting our experiment on a small set of production nodes as the
> code has slightly changed since v1 [2], and there are still a few weeks of
> development window to soak the changes. We will report back if we observe
> any regressions.
>
> [...]
Here is the summary with links:
- [net-next,v2,1/2] tcp: Measure TIME-WAIT reuse delay with millisecond precision
https://git.kernel.org/netdev/net-next/c/19ce8cd30465
- [net-next,v2,2/2] tcp: Add sysctl to configure TIME-WAIT reuse delay
https://git.kernel.org/netdev/net-next/c/ca6a6f93867a
You are awesome, thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [PATCH net-next v2 0/2] Make TIME-WAIT reuse delay deterministic and configurable
2024-12-09 19:38 [PATCH net-next v2 0/2] Make TIME-WAIT reuse delay deterministic and configurable Jakub Sitnicki
` (2 preceding siblings ...)
2024-12-12 4:30 ` [PATCH net-next v2 0/2] Make TIME-WAIT reuse delay deterministic and configurable patchwork-bot+netdevbpf
@ 2024-12-13 13:06 ` Jakub Sitnicki
3 siblings, 0 replies; 9+ messages in thread
From: Jakub Sitnicki @ 2024-12-13 13:06 UTC (permalink / raw)
To: netdev
Cc: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
Jason Xing, Adrien Vasseur, Lee Valentine, kernel-team
On Mon, Dec 09, 2024 at 08:38 PM +01, Jakub Sitnicki wrote:
> Packetdrill tests
> -----------------
>
> The packetdrill tests for TIME-WAIT reuse [3] did not change since v1.
> Although we are not touching PAWS code any more, I would still like to add
> tests to cover PAWS reject after TW reuse. This, however, requires patching
> packetdrill as I mentioned in the last cover letter [2].
Thank you for the prompt reviews. Happy to hear there are other users
looking to adopt these.
Since patches are now in net-next, I have moved the accompanying
packetdrill PR from Draft to Open, if you want to follow that work:
https://github.com/google/packetdrill/pull/90
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2024-12-13 13:06 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-12-09 19:38 [PATCH net-next v2 0/2] Make TIME-WAIT reuse delay deterministic and configurable Jakub Sitnicki
2024-12-09 19:38 ` [PATCH net-next v2 1/2] tcp: Measure TIME-WAIT reuse delay with millisecond precision Jakub Sitnicki
2024-12-10 8:11 ` Eric Dumazet
2024-12-12 1:00 ` Jason Xing
2024-12-09 19:38 ` [PATCH net-next v2 2/2] tcp: Add sysctl to configure TIME-WAIT reuse delay Jakub Sitnicki
2024-12-10 8:21 ` Eric Dumazet
2024-12-12 1:08 ` Jason Xing
2024-12-12 4:30 ` [PATCH net-next v2 0/2] Make TIME-WAIT reuse delay deterministic and configurable patchwork-bot+netdevbpf
2024-12-13 13:06 ` Jakub Sitnicki
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).