* [PATCH net v2 0/2] tcp: restrict crossed SYN specific actions to SYN-ACK
@ 2024-07-18 10:33 Matthieu Baerts (NGI0)
2024-07-18 10:33 ` [PATCH net v2 1/2] tcp: process the 3rd ACK with sk_socket for TFO/MPTCP Matthieu Baerts (NGI0)
` (2 more replies)
0 siblings, 3 replies; 17+ messages in thread
From: Matthieu Baerts (NGI0) @ 2024-07-18 10:33 UTC (permalink / raw)
To: Eric Dumazet, David S. Miller, David Ahern, Jakub Kicinski,
Paolo Abeni, Kuniyuki Iwashima, Jerry Chu
Cc: netdev, mptcp, linux-kernel, Matthieu Baerts (NGI0)
A recent commit in TCP affects TFO and MPTCP in some cases. The first
patch fixes that.
The second one applies the same fix to another crossed SYN specific
action just before.
These two fixes simply restrict what should be done only for crossed SYN
cases to packets with SYN-ACK flags.
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
---
Changes in v2:
- Patch 1/2 has a simpler and generic check (Kuniyuki), and an updated
comment.
- New patch 2/2: a related fix
- Link to v1: https://lore.kernel.org/r/20240716-upstream-net-next-20240716-tcp-3rd-ack-consume-sk_socket-v1-1-4e61d0b79233@kernel.org
---
Matthieu Baerts (NGI0) (2):
tcp: process the 3rd ACK with sk_socket for TFO/MPTCP
tcp: limit wake-up for crossed SYN cases to SYN-ACK
net/ipv4/tcp_input.c | 11 ++++++++---
1 file changed, 8 insertions(+), 3 deletions(-)
---
base-commit: 120f1c857a73e52132e473dee89b340440cb692b
change-id: 20240716-upstream-net-next-20240716-tcp-3rd-ack-consume-sk_socket-0cbd159fc5b0
Best regards,
--
Matthieu Baerts (NGI0) <matttbe@kernel.org>
^ permalink raw reply [flat|nested] 17+ messages in thread
* [PATCH net v2 1/2] tcp: process the 3rd ACK with sk_socket for TFO/MPTCP
2024-07-18 10:33 [PATCH net v2 0/2] tcp: restrict crossed SYN specific actions to SYN-ACK Matthieu Baerts (NGI0)
@ 2024-07-18 10:33 ` Matthieu Baerts (NGI0)
2024-07-23 14:37 ` Eric Dumazet
2024-07-18 10:33 ` [PATCH net v2 2/2] tcp: limit wake-up for crossed SYN cases to SYN-ACK Matthieu Baerts (NGI0)
2024-07-23 8:10 ` [PATCH net v2 0/2] tcp: restrict crossed SYN specific actions " Paolo Abeni
2 siblings, 1 reply; 17+ messages in thread
From: Matthieu Baerts (NGI0) @ 2024-07-18 10:33 UTC (permalink / raw)
To: Eric Dumazet, David S. Miller, David Ahern, Jakub Kicinski,
Paolo Abeni, Kuniyuki Iwashima, Jerry Chu
Cc: netdev, mptcp, linux-kernel, Matthieu Baerts (NGI0)
The 'Fixes' commit recently changed the behaviour of TCP by skipping the
processing of the 3rd ACK when a sk->sk_socket is set. The goal was to
skip tcp_ack_snd_check() in tcp_rcv_state_process() not to send an
unnecessary ACK in case of simultaneous connect(). Unfortunately, that
had an impact on TFO and MPTCP.
I started to look at the impact on MPTCP, because the MPTCP CI found
some issues with the MPTCP Packetdrill tests [1]. Then Paolo suggested
me to look at the impact on TFO with "plain" TCP.
For MPTCP, when receiving the 3rd ACK of a request adding a new path
(MP_JOIN), sk->sk_socket will be set, and point to the MPTCP sock that
has been created when the MPTCP connection got established before with
the first path. The newly added 'goto' will then skip the processing of
the segment text (step 7) and not go through tcp_data_queue() where the
MPTCP options are validated, and some actions are triggered, e.g.
sending the MPJ 4th ACK [2] as demonstrated by the new errors when
running a packetdrill test [3] establishing a second subflow.
This doesn't fully break MPTCP, mainly the 4th MPJ ACK that will be
delayed. Still, we don't want to have this behaviour as it delays the
switch to the fully established mode, and invalid MPTCP options in this
3rd ACK will not be caught any more. This modification also affects the
MPTCP + TFO feature as well, and being the reason why the selftests
started to be unstable the last few days [4].
For TFO, the existing 'basic-cookie-not-reqd' test [5] was no longer
passing: if the 3rd ACK contains data, and the connection is accept()ed
before receiving them, these data would no longer be processed, and thus
not ACKed.
One last thing about MPTCP, in case of simultaneous connect(), a
fallback to TCP will be done, which seems fine:
`../common/defaults.sh`
0 socket(..., SOCK_STREAM|SOCK_NONBLOCK, IPPROTO_MPTCP) = 3
+0 connect(3, ..., ...) = -1 EINPROGRESS (Operation now in progress)
+0 > S 0:0(0) <mss 1460, sackOK, TS val 100 ecr 0, nop, wscale 8, mpcapable v1 flags[flag_h] nokey>
+0 < S 0:0(0) win 1000 <mss 1460, sackOK, TS val 407 ecr 0, nop, wscale 8, mpcapable v1 flags[flag_h] nokey>
+0 > S. 0:0(0) ack 1 <mss 1460, sackOK, TS val 330 ecr 0, nop, wscale 8, mpcapable v1 flags[flag_h] nokey>
+0 < S. 0:0(0) ack 1 win 65535 <mss 1460, sackOK, TS val 700 ecr 100, nop, wscale 8, mpcapable v1 flags[flag_h] key[skey=2]>
+0 write(3, ..., 100) = 100
+0 > . 1:1(0) ack 1 <nop, nop, TS val 845707014 ecr 700, nop, nop, sack 0:1>
+0 > P. 1:101(100) ack 1 <nop, nop, TS val 845958933 ecr 700>
Simultaneous SYN-data crossing is also not supported by TFO, see [6].
Link: https://github.com/multipath-tcp/mptcp_net-next/actions/runs/9936227696 [1]
Link: https://datatracker.ietf.org/doc/html/rfc8684#fig_tokens [2]
Link: https://github.com/multipath-tcp/packetdrill/blob/mptcp-net-next/gtests/net/mptcp/syscalls/accept.pkt#L28 [3]
Link: https://netdev.bots.linux.dev/contest.html?executor=vmksft-mptcp-dbg&test=mptcp-connect-sh [4]
Link: https://github.com/google/packetdrill/blob/master/gtests/net/tcp/fastopen/server/basic-cookie-not-reqd.pkt#L21 [5]
Link: https://github.com/google/packetdrill/blob/master/gtests/net/tcp/fastopen/client/simultaneous-fast-open.pkt [6]
Fixes: 23e89e8ee7be ("tcp: Don't drop SYN+ACK for simultaneous connect().")
Suggested-by: Paolo Abeni <pabeni@redhat.com>
Suggested-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
---
Notes:
- We could also drop this 'goto consume', and send the unnecessary ACK
in this simultaneous connect case, which doesn't seem to be a "real"
case, more something for fuzzers. But that's not what the RFC 9293
recommends to do.
- v2:
- Check if the SYN bit is set instead of looking for TFO and MPTCP
specific attributes, as suggested by Kuniyuki.
- Updated the comment above
- Please note that the v2 has been sent mainly to satisfy the CI (to
be able to catch new bugs with MPTCP), and because the suggestion
from Kuniyuki looks better. It has not been sent to urge TCP
maintainers to review it quicker than it should, please take your
time and enjoy netdev.conf :)
---
net/ipv4/tcp_input.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index ff9ab3d01ced..bfe1bc69dc3e 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -6820,7 +6820,12 @@ tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb)
if (sk->sk_shutdown & SEND_SHUTDOWN)
tcp_shutdown(sk, SEND_SHUTDOWN);
- if (sk->sk_socket)
+ /* For crossed SYN cases, not to send an unnecessary ACK.
+ * Note that sk->sk_socket can be assigned in other cases, e.g.
+ * with TFO (if accept()'ed before the 3rd ACK) and MPTCP (MPJ:
+ * sk_socket is the parent MPTCP sock).
+ */
+ if (sk->sk_socket && th->syn)
goto consume;
break;
--
2.45.2
^ permalink raw reply related [flat|nested] 17+ messages in thread
* [PATCH net v2 2/2] tcp: limit wake-up for crossed SYN cases to SYN-ACK
2024-07-18 10:33 [PATCH net v2 0/2] tcp: restrict crossed SYN specific actions to SYN-ACK Matthieu Baerts (NGI0)
2024-07-18 10:33 ` [PATCH net v2 1/2] tcp: process the 3rd ACK with sk_socket for TFO/MPTCP Matthieu Baerts (NGI0)
@ 2024-07-18 10:33 ` Matthieu Baerts (NGI0)
2024-07-23 14:32 ` Eric Dumazet
2024-07-23 8:10 ` [PATCH net v2 0/2] tcp: restrict crossed SYN specific actions " Paolo Abeni
2 siblings, 1 reply; 17+ messages in thread
From: Matthieu Baerts (NGI0) @ 2024-07-18 10:33 UTC (permalink / raw)
To: Eric Dumazet, David S. Miller, David Ahern, Jakub Kicinski,
Paolo Abeni, Kuniyuki Iwashima, Jerry Chu
Cc: netdev, mptcp, linux-kernel, Matthieu Baerts (NGI0)
sk->sk_socket will be assigned in case of marginal crossed SYN, but also
in other cases, e.g.
- With TCP Fast Open, if the connection got accept()'ed before
receiving the 3rd ACK ;
- With MPTCP, when accepting additional subflows to an existing MPTCP
connection.
In these cases, the switch to TCP_ESTABLISHED is done when receiving the
3rd ACK, without the SYN flag then.
To properly restrict the wake-up to crossed SYN cases, it is then
required to also limit the check to packets containing the SYN-ACK
flags.
While at it, also update the attached comment: sk->sk_sleep has been
removed in 2010, and replaced by sk->sk_wq in commit 43815482370c ("net:
sock_def_readable() and friends RCU conversion").
Fixes: 168a8f58059a ("tcp: TCP Fast Open Server - main code path")
Suggested-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
---
Notes:
- The above 'Fixes' tag should correspond to the commit introducing the
possibility to have sk->sk_socket being set there in other cases than
the crossed SYN one. But I might have missed other cases. Maybe
1da177e4c3f4 ("Linux-2.6.12-rc2") might be safer? On the other hand,
I don't think this wake-up was causing any visible issue, apart from
not being needed.
---
net/ipv4/tcp_input.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index bfe1bc69dc3e..5cebb389bf71 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -6797,9 +6797,9 @@ tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb)
/* Note, that this wakeup is only for marginal crossed SYN case.
* Passively open sockets are not waked up, because
- * sk->sk_sleep == NULL and sk->sk_socket == NULL.
+ * sk->sk_wq == NULL and sk->sk_socket == NULL.
*/
- if (sk->sk_socket)
+ if (sk->sk_socket && th->syn)
sk_wake_async(sk, SOCK_WAKE_IO, POLL_OUT);
tp->snd_una = TCP_SKB_CB(skb)->ack_seq;
--
2.45.2
^ permalink raw reply related [flat|nested] 17+ messages in thread
* Re: [PATCH net v2 0/2] tcp: restrict crossed SYN specific actions to SYN-ACK
2024-07-18 10:33 [PATCH net v2 0/2] tcp: restrict crossed SYN specific actions to SYN-ACK Matthieu Baerts (NGI0)
2024-07-18 10:33 ` [PATCH net v2 1/2] tcp: process the 3rd ACK with sk_socket for TFO/MPTCP Matthieu Baerts (NGI0)
2024-07-18 10:33 ` [PATCH net v2 2/2] tcp: limit wake-up for crossed SYN cases to SYN-ACK Matthieu Baerts (NGI0)
@ 2024-07-23 8:10 ` Paolo Abeni
2024-07-23 8:22 ` Eric Dumazet
2 siblings, 1 reply; 17+ messages in thread
From: Paolo Abeni @ 2024-07-23 8:10 UTC (permalink / raw)
To: Matthieu Baerts (NGI0), Neal Cardwell
Cc: netdev, mptcp, linux-kernel, Jerry Chu, Jakub Kicinski,
David Ahern, Eric Dumazet, David S. Miller, Kuniyuki Iwashima
On 7/18/24 12:33, Matthieu Baerts (NGI0) wrote:
> A recent commit in TCP affects TFO and MPTCP in some cases. The first
> patch fixes that.
>
> The second one applies the same fix to another crossed SYN specific
> action just before.
>
> These two fixes simply restrict what should be done only for crossed SYN
> cases to packets with SYN-ACK flags.
>
> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
> ---
> Changes in v2:
> - Patch 1/2 has a simpler and generic check (Kuniyuki), and an updated
> comment.
> - New patch 2/2: a related fix
> - Link to v1: https://lore.kernel.org/r/20240716-upstream-net-next-20240716-tcp-3rd-ack-consume-sk_socket-v1-1-4e61d0b79233@kernel.org
Re-adding Neal for awareness. It would be great if this could go through
some packetdrill testing,
Thanks!
Paolo
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH net v2 0/2] tcp: restrict crossed SYN specific actions to SYN-ACK
2024-07-23 8:10 ` [PATCH net v2 0/2] tcp: restrict crossed SYN specific actions " Paolo Abeni
@ 2024-07-23 8:22 ` Eric Dumazet
0 siblings, 0 replies; 17+ messages in thread
From: Eric Dumazet @ 2024-07-23 8:22 UTC (permalink / raw)
To: Paolo Abeni
Cc: Matthieu Baerts (NGI0), Neal Cardwell, netdev, mptcp,
linux-kernel, Jerry Chu, Jakub Kicinski, David Ahern,
David S. Miller, Kuniyuki Iwashima
On Tue, Jul 23, 2024 at 1:10 AM Paolo Abeni <pabeni@redhat.com> wrote:
>
> On 7/18/24 12:33, Matthieu Baerts (NGI0) wrote:
> > A recent commit in TCP affects TFO and MPTCP in some cases. The first
> > patch fixes that.
> >
> > The second one applies the same fix to another crossed SYN specific
> > action just before.
> >
> > These two fixes simply restrict what should be done only for crossed SYN
> > cases to packets with SYN-ACK flags.
> >
> > Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
> > ---
> > Changes in v2:
> > - Patch 1/2 has a simpler and generic check (Kuniyuki), and an updated
> > comment.
> > - New patch 2/2: a related fix
> > - Link to v1: https://lore.kernel.org/r/20240716-upstream-net-next-20240716-tcp-3rd-ack-consume-sk_socket-v1-1-4e61d0b79233@kernel.org
>
> Re-adding Neal for awareness. It would be great if this could go through
> some packetdrill testing,
>
I am back in France, let me see what I can do.
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH net v2 2/2] tcp: limit wake-up for crossed SYN cases to SYN-ACK
2024-07-18 10:33 ` [PATCH net v2 2/2] tcp: limit wake-up for crossed SYN cases to SYN-ACK Matthieu Baerts (NGI0)
@ 2024-07-23 14:32 ` Eric Dumazet
2024-07-23 14:40 ` Matthieu Baerts
0 siblings, 1 reply; 17+ messages in thread
From: Eric Dumazet @ 2024-07-23 14:32 UTC (permalink / raw)
To: Matthieu Baerts (NGI0)
Cc: David S. Miller, David Ahern, Jakub Kicinski, Paolo Abeni,
Kuniyuki Iwashima, Jerry Chu, netdev, mptcp, linux-kernel
On Thu, Jul 18, 2024 at 12:34 PM Matthieu Baerts (NGI0)
<matttbe@kernel.org> wrote:
>
> sk->sk_socket will be assigned in case of marginal crossed SYN, but also
> in other cases, e.g.
>
> - With TCP Fast Open, if the connection got accept()'ed before
> receiving the 3rd ACK ;
>
> - With MPTCP, when accepting additional subflows to an existing MPTCP
> connection.
>
> In these cases, the switch to TCP_ESTABLISHED is done when receiving the
> 3rd ACK, without the SYN flag then.
>
> To properly restrict the wake-up to crossed SYN cases, it is then
> required to also limit the check to packets containing the SYN-ACK
> flags.
>
> While at it, also update the attached comment: sk->sk_sleep has been
> removed in 2010, and replaced by sk->sk_wq in commit 43815482370c ("net:
> sock_def_readable() and friends RCU conversion").
>
> Fixes: 168a8f58059a ("tcp: TCP Fast Open Server - main code path")
> Suggested-by: Kuniyuki Iwashima <kuniyu@amazon.com>
> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
> ---
> Notes:
> - The above 'Fixes' tag should correspond to the commit introducing the
> possibility to have sk->sk_socket being set there in other cases than
> the crossed SYN one. But I might have missed other cases. Maybe
> 1da177e4c3f4 ("Linux-2.6.12-rc2") might be safer? On the other hand,
> I don't think this wake-up was causing any visible issue, apart from
> not being needed.
This seems a net-next candidate to me ?
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH net v2 1/2] tcp: process the 3rd ACK with sk_socket for TFO/MPTCP
2024-07-18 10:33 ` [PATCH net v2 1/2] tcp: process the 3rd ACK with sk_socket for TFO/MPTCP Matthieu Baerts (NGI0)
@ 2024-07-23 14:37 ` Eric Dumazet
2024-07-23 14:58 ` Matthieu Baerts
0 siblings, 1 reply; 17+ messages in thread
From: Eric Dumazet @ 2024-07-23 14:37 UTC (permalink / raw)
To: Matthieu Baerts (NGI0)
Cc: David S. Miller, David Ahern, Jakub Kicinski, Paolo Abeni,
Kuniyuki Iwashima, Jerry Chu, netdev, mptcp, linux-kernel
On Thu, Jul 18, 2024 at 12:34 PM Matthieu Baerts (NGI0)
<matttbe@kernel.org> wrote:
>
> The 'Fixes' commit recently changed the behaviour of TCP by skipping the
> processing of the 3rd ACK when a sk->sk_socket is set. The goal was to
> skip tcp_ack_snd_check() in tcp_rcv_state_process() not to send an
> unnecessary ACK in case of simultaneous connect(). Unfortunately, that
> had an impact on TFO and MPTCP.
>
> I started to look at the impact on MPTCP, because the MPTCP CI found
> some issues with the MPTCP Packetdrill tests [1]. Then Paolo suggested
> me to look at the impact on TFO with "plain" TCP.
>
> For MPTCP, when receiving the 3rd ACK of a request adding a new path
> (MP_JOIN), sk->sk_socket will be set, and point to the MPTCP sock that
> has been created when the MPTCP connection got established before with
> the first path. The newly added 'goto' will then skip the processing of
> the segment text (step 7) and not go through tcp_data_queue() where the
> MPTCP options are validated, and some actions are triggered, e.g.
> sending the MPJ 4th ACK [2] as demonstrated by the new errors when
> running a packetdrill test [3] establishing a second subflow.
>
> This doesn't fully break MPTCP, mainly the 4th MPJ ACK that will be
> delayed. Still, we don't want to have this behaviour as it delays the
> switch to the fully established mode, and invalid MPTCP options in this
> 3rd ACK will not be caught any more. This modification also affects the
> MPTCP + TFO feature as well, and being the reason why the selftests
> started to be unstable the last few days [4].
>
> For TFO, the existing 'basic-cookie-not-reqd' test [5] was no longer
> passing: if the 3rd ACK contains data, and the connection is accept()ed
> before receiving them, these data would no longer be processed, and thus
> not ACKed.
>
> One last thing about MPTCP, in case of simultaneous connect(), a
> fallback to TCP will be done, which seems fine:
>
> `../common/defaults.sh`
>
> 0 socket(..., SOCK_STREAM|SOCK_NONBLOCK, IPPROTO_MPTCP) = 3
> +0 connect(3, ..., ...) = -1 EINPROGRESS (Operation now in progress)
>
> +0 > S 0:0(0) <mss 1460, sackOK, TS val 100 ecr 0, nop, wscale 8, mpcapable v1 flags[flag_h] nokey>
> +0 < S 0:0(0) win 1000 <mss 1460, sackOK, TS val 407 ecr 0, nop, wscale 8, mpcapable v1 flags[flag_h] nokey>
> +0 > S. 0:0(0) ack 1 <mss 1460, sackOK, TS val 330 ecr 0, nop, wscale 8, mpcapable v1 flags[flag_h] nokey>
> +0 < S. 0:0(0) ack 1 win 65535 <mss 1460, sackOK, TS val 700 ecr 100, nop, wscale 8, mpcapable v1 flags[flag_h] key[skey=2]>
>
> +0 write(3, ..., 100) = 100
> +0 > . 1:1(0) ack 1 <nop, nop, TS val 845707014 ecr 700, nop, nop, sack 0:1>
> +0 > P. 1:101(100) ack 1 <nop, nop, TS val 845958933 ecr 700>
>
> Simultaneous SYN-data crossing is also not supported by TFO, see [6].
>
> Link: https://github.com/multipath-tcp/mptcp_net-next/actions/runs/9936227696 [1]
> Link: https://datatracker.ietf.org/doc/html/rfc8684#fig_tokens [2]
> Link: https://github.com/multipath-tcp/packetdrill/blob/mptcp-net-next/gtests/net/mptcp/syscalls/accept.pkt#L28 [3]
> Link: https://netdev.bots.linux.dev/contest.html?executor=vmksft-mptcp-dbg&test=mptcp-connect-sh [4]
> Link: https://github.com/google/packetdrill/blob/master/gtests/net/tcp/fastopen/server/basic-cookie-not-reqd.pkt#L21 [5]
> Link: https://github.com/google/packetdrill/blob/master/gtests/net/tcp/fastopen/client/simultaneous-fast-open.pkt [6]
> Fixes: 23e89e8ee7be ("tcp: Don't drop SYN+ACK for simultaneous connect().")
> Suggested-by: Paolo Abeni <pabeni@redhat.com>
> Suggested-by: Kuniyuki Iwashima <kuniyu@amazon.com>
> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
> ---
> Notes:
> - We could also drop this 'goto consume', and send the unnecessary ACK
> in this simultaneous connect case, which doesn't seem to be a "real"
> case, more something for fuzzers. But that's not what the RFC 9293
> recommends to do.
> - v2:
> - Check if the SYN bit is set instead of looking for TFO and MPTCP
> specific attributes, as suggested by Kuniyuki.
> - Updated the comment above
> - Please note that the v2 has been sent mainly to satisfy the CI (to
> be able to catch new bugs with MPTCP), and because the suggestion
> from Kuniyuki looks better. It has not been sent to urge TCP
> maintainers to review it quicker than it should, please take your
> time and enjoy netdev.conf :)
> ---
> net/ipv4/tcp_input.c | 7 ++++++-
> 1 file changed, 6 insertions(+), 1 deletion(-)
>
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index ff9ab3d01ced..bfe1bc69dc3e 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -6820,7 +6820,12 @@ tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb)
> if (sk->sk_shutdown & SEND_SHUTDOWN)
> tcp_shutdown(sk, SEND_SHUTDOWN);
>
> - if (sk->sk_socket)
> + /* For crossed SYN cases, not to send an unnecessary ACK.
> + * Note that sk->sk_socket can be assigned in other cases, e.g.
> + * with TFO (if accept()'ed before the 3rd ACK) and MPTCP (MPJ:
> + * sk_socket is the parent MPTCP sock).
> + */
> + if (sk->sk_socket && th->syn)
> goto consume;
I think we should simply remove this part completely, because we
should send an ack anyway.
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index ff9ab3d01ced89570903d3a9f649a637c5e07a90..91357d4713182078debd746a224046cba80ea3ce
100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -6820,8 +6820,6 @@ tcp_rcv_state_process(struct sock *sk, struct
sk_buff *skb)
if (sk->sk_shutdown & SEND_SHUTDOWN)
tcp_shutdown(sk, SEND_SHUTDOWN);
- if (sk->sk_socket)
- goto consume;
break;
case TCP_FIN_WAIT1: {
I have a failing packetdrill test after Kuniyuki patch :
//
// Test the simultaneous open scenario that both end sends
// SYN/data. Although we don't support that the connection should
// still be established.
//
`../../common/defaults.sh
../../common/set_sysctls.py /proc/sys/net/ipv4/tcp_timestamps=0`
// Cache warmup: send a Fast Open cookie request
0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
+0 fcntl(3, F_SETFL, O_RDWR|O_NONBLOCK) = 0
+0 sendto(3, ..., 0, MSG_FASTOPEN, ..., ...) = -1 EINPROGRESS
(Operation is now in progress)
+0 > S 0:0(0) <mss 1460,nop,nop,sackOK,nop,wscale 8,FO,nop,nop>
+.01 < S. 123:123(0) ack 1 win 14600 <mss
1460,nop,nop,sackOK,nop,wscale 6,FO abcd1234,nop,nop>
+0 > . 1:1(0) ack 1
+.01 close(3) = 0
+0 > F. 1:1(0) ack 1
+.01 < F. 1:1(0) ack 2 win 92
+0 > . 2:2(0) ack 2
//
// Test: simulatenous fast open
//
+.01 socket(..., SOCK_STREAM, IPPROTO_TCP) = 4
+0 fcntl(4, F_SETFL, O_RDWR|O_NONBLOCK) = 0
+0 sendto(4, ..., 1000, MSG_FASTOPEN, ..., ...) = 1000
+0 > S 0:1000(1000) <mss 1460,nop,nop,sackOK,nop,wscale 8,FO
abcd1234,nop,nop>
// Simul. SYN-data crossing: we don't support that yet so ack only remote ISN
+.005 < S 1234:1734(500) win 14600 <mss 1040,nop,nop,sackOK,nop,wscale
6,FO 87654321,nop,nop>
+0 > S. 0:0(0) ack 1235 <mss 1460,nop,nop,sackOK,nop,wscale 8>
// SYN data is never retried.
+.045 < S. 1234:1234(0) ack 1001 win 14600 <mss
940,nop,nop,sackOK,nop,wscale 6,FO 12345678,nop,nop>
+0 > . 1001:1001(0) ack 1
// The other end retries
+.1 < P. 1:501(500) ack 1000 win 257
+0 > . 1001:1001(0) ack 501
+0 read(4, ..., 4096) = 500
+0 close(4) = 0
+0 > F. 1001:1001(0) ack 501
+.05 < F. 501:501(0) ack 1002 win 257
+0 > . 1002:1002(0) ack 502
`/tmp/sysctl_restore_${PPID}.sh`
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH net v2 2/2] tcp: limit wake-up for crossed SYN cases to SYN-ACK
2024-07-23 14:32 ` Eric Dumazet
@ 2024-07-23 14:40 ` Matthieu Baerts
0 siblings, 0 replies; 17+ messages in thread
From: Matthieu Baerts @ 2024-07-23 14:40 UTC (permalink / raw)
To: Eric Dumazet
Cc: David S. Miller, David Ahern, Jakub Kicinski, Paolo Abeni,
Kuniyuki Iwashima, Jerry Chu, netdev, mptcp, linux-kernel
Hi Eric,
Thank you for the review!
On 23/07/2024 16:32, Eric Dumazet wrote:
> On Thu, Jul 18, 2024 at 12:34 PM Matthieu Baerts (NGI0)
> <matttbe@kernel.org> wrote:
>>
>> sk->sk_socket will be assigned in case of marginal crossed SYN, but also
>> in other cases, e.g.
>>
>> - With TCP Fast Open, if the connection got accept()'ed before
>> receiving the 3rd ACK ;
>>
>> - With MPTCP, when accepting additional subflows to an existing MPTCP
>> connection.
>>
>> In these cases, the switch to TCP_ESTABLISHED is done when receiving the
>> 3rd ACK, without the SYN flag then.
>>
>> To properly restrict the wake-up to crossed SYN cases, it is then
>> required to also limit the check to packets containing the SYN-ACK
>> flags.
>>
>> While at it, also update the attached comment: sk->sk_sleep has been
>> removed in 2010, and replaced by sk->sk_wq in commit 43815482370c ("net:
>> sock_def_readable() and friends RCU conversion").
>>
>> Fixes: 168a8f58059a ("tcp: TCP Fast Open Server - main code path")
>> Suggested-by: Kuniyuki Iwashima <kuniyu@amazon.com>
>> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
>> ---
>> Notes:
>> - The above 'Fixes' tag should correspond to the commit introducing the
>> possibility to have sk->sk_socket being set there in other cases than
>> the crossed SYN one. But I might have missed other cases. Maybe
>> 1da177e4c3f4 ("Linux-2.6.12-rc2") might be safer? On the other hand,
>> I don't think this wake-up was causing any visible issue, apart from
>> not being needed.
>
> This seems a net-next candidate to me ?
Fine by me!
I modified this line mainly because Kuniyuki mentioned that it was the
same check as the new one, modified in patch 1/2. I didn't find any
visible issue with the wakeup, so I guess it can go to net-next.
Cheers,
Matt
--
Sponsored by the NGI0 Core fund.
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH net v2 1/2] tcp: process the 3rd ACK with sk_socket for TFO/MPTCP
2024-07-23 14:37 ` Eric Dumazet
@ 2024-07-23 14:58 ` Matthieu Baerts
2024-07-23 15:38 ` Eric Dumazet
0 siblings, 1 reply; 17+ messages in thread
From: Matthieu Baerts @ 2024-07-23 14:58 UTC (permalink / raw)
To: Eric Dumazet
Cc: David S. Miller, David Ahern, Jakub Kicinski, Paolo Abeni,
Kuniyuki Iwashima, netdev, mptcp, linux-kernel, Neal Cardwell
Hi Eric,
+cc Neal
-cc Jerry (NoSuchUser)
On 23/07/2024 16:37, Eric Dumazet wrote:
> On Thu, Jul 18, 2024 at 12:34 PM Matthieu Baerts (NGI0)
> <matttbe@kernel.org> wrote:
>>
>> The 'Fixes' commit recently changed the behaviour of TCP by skipping the
>> processing of the 3rd ACK when a sk->sk_socket is set. The goal was to
>> skip tcp_ack_snd_check() in tcp_rcv_state_process() not to send an
>> unnecessary ACK in case of simultaneous connect(). Unfortunately, that
>> had an impact on TFO and MPTCP.
>>
>> I started to look at the impact on MPTCP, because the MPTCP CI found
>> some issues with the MPTCP Packetdrill tests [1]. Then Paolo suggested
>> me to look at the impact on TFO with "plain" TCP.
>>
>> For MPTCP, when receiving the 3rd ACK of a request adding a new path
>> (MP_JOIN), sk->sk_socket will be set, and point to the MPTCP sock that
>> has been created when the MPTCP connection got established before with
>> the first path. The newly added 'goto' will then skip the processing of
>> the segment text (step 7) and not go through tcp_data_queue() where the
>> MPTCP options are validated, and some actions are triggered, e.g.
>> sending the MPJ 4th ACK [2] as demonstrated by the new errors when
>> running a packetdrill test [3] establishing a second subflow.
>>
>> This doesn't fully break MPTCP, mainly the 4th MPJ ACK that will be
>> delayed. Still, we don't want to have this behaviour as it delays the
>> switch to the fully established mode, and invalid MPTCP options in this
>> 3rd ACK will not be caught any more. This modification also affects the
>> MPTCP + TFO feature as well, and being the reason why the selftests
>> started to be unstable the last few days [4].
>>
>> For TFO, the existing 'basic-cookie-not-reqd' test [5] was no longer
>> passing: if the 3rd ACK contains data, and the connection is accept()ed
>> before receiving them, these data would no longer be processed, and thus
>> not ACKed.
>>
>> One last thing about MPTCP, in case of simultaneous connect(), a
>> fallback to TCP will be done, which seems fine:
>>
>> `../common/defaults.sh`
>>
>> 0 socket(..., SOCK_STREAM|SOCK_NONBLOCK, IPPROTO_MPTCP) = 3
>> +0 connect(3, ..., ...) = -1 EINPROGRESS (Operation now in progress)
>>
>> +0 > S 0:0(0) <mss 1460, sackOK, TS val 100 ecr 0, nop, wscale 8, mpcapable v1 flags[flag_h] nokey>
>> +0 < S 0:0(0) win 1000 <mss 1460, sackOK, TS val 407 ecr 0, nop, wscale 8, mpcapable v1 flags[flag_h] nokey>
>> +0 > S. 0:0(0) ack 1 <mss 1460, sackOK, TS val 330 ecr 0, nop, wscale 8, mpcapable v1 flags[flag_h] nokey>
>> +0 < S. 0:0(0) ack 1 win 65535 <mss 1460, sackOK, TS val 700 ecr 100, nop, wscale 8, mpcapable v1 flags[flag_h] key[skey=2]>
>>
>> +0 write(3, ..., 100) = 100
>> +0 > . 1:1(0) ack 1 <nop, nop, TS val 845707014 ecr 700, nop, nop, sack 0:1>
>> +0 > P. 1:101(100) ack 1 <nop, nop, TS val 845958933 ecr 700>
>>
>> Simultaneous SYN-data crossing is also not supported by TFO, see [6].
>>
>> Link: https://github.com/multipath-tcp/mptcp_net-next/actions/runs/9936227696 [1]
>> Link: https://datatracker.ietf.org/doc/html/rfc8684#fig_tokens [2]
>> Link: https://github.com/multipath-tcp/packetdrill/blob/mptcp-net-next/gtests/net/mptcp/syscalls/accept.pkt#L28 [3]
>> Link: https://netdev.bots.linux.dev/contest.html?executor=vmksft-mptcp-dbg&test=mptcp-connect-sh [4]
>> Link: https://github.com/google/packetdrill/blob/master/gtests/net/tcp/fastopen/server/basic-cookie-not-reqd.pkt#L21 [5]
>> Link: https://github.com/google/packetdrill/blob/master/gtests/net/tcp/fastopen/client/simultaneous-fast-open.pkt [6]
>> Fixes: 23e89e8ee7be ("tcp: Don't drop SYN+ACK for simultaneous connect().")
>> Suggested-by: Paolo Abeni <pabeni@redhat.com>
>> Suggested-by: Kuniyuki Iwashima <kuniyu@amazon.com>
>> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
>> ---
>> Notes:
>> - We could also drop this 'goto consume', and send the unnecessary ACK
>> in this simultaneous connect case, which doesn't seem to be a "real"
>> case, more something for fuzzers. But that's not what the RFC 9293
>> recommends to do.
>> - v2:
>> - Check if the SYN bit is set instead of looking for TFO and MPTCP
>> specific attributes, as suggested by Kuniyuki.
>> - Updated the comment above
>> - Please note that the v2 has been sent mainly to satisfy the CI (to
>> be able to catch new bugs with MPTCP), and because the suggestion
>> from Kuniyuki looks better. It has not been sent to urge TCP
>> maintainers to review it quicker than it should, please take your
>> time and enjoy netdev.conf :)
>> ---
>> net/ipv4/tcp_input.c | 7 ++++++-
>> 1 file changed, 6 insertions(+), 1 deletion(-)
>>
>> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
>> index ff9ab3d01ced..bfe1bc69dc3e 100644
>> --- a/net/ipv4/tcp_input.c
>> +++ b/net/ipv4/tcp_input.c
>> @@ -6820,7 +6820,12 @@ tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb)
>> if (sk->sk_shutdown & SEND_SHUTDOWN)
>> tcp_shutdown(sk, SEND_SHUTDOWN);
>>
>> - if (sk->sk_socket)
>> + /* For crossed SYN cases, not to send an unnecessary ACK.
>> + * Note that sk->sk_socket can be assigned in other cases, e.g.
>> + * with TFO (if accept()'ed before the 3rd ACK) and MPTCP (MPJ:
>> + * sk_socket is the parent MPTCP sock).
>> + */
>> + if (sk->sk_socket && th->syn)
>> goto consume;
>
> I think we should simply remove this part completely, because we
> should send an ack anyway.
Thank you for having looked, and ran the full packetdrill test suite!
>
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index ff9ab3d01ced89570903d3a9f649a637c5e07a90..91357d4713182078debd746a224046cba80ea3ce
> 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -6820,8 +6820,6 @@ tcp_rcv_state_process(struct sock *sk, struct
> sk_buff *skb)
> if (sk->sk_shutdown & SEND_SHUTDOWN)
> tcp_shutdown(sk, SEND_SHUTDOWN);
>
> - if (sk->sk_socket)
> - goto consume;
> break;
>
> case TCP_FIN_WAIT1: {
>
>
> I have a failing packetdrill test after Kuniyuki patch :
>
>
>
> //
> // Test the simultaneous open scenario that both end sends
> // SYN/data. Although we don't support that the connection should
> // still be established.
> //
> `../../common/defaults.sh
> ../../common/set_sysctls.py /proc/sys/net/ipv4/tcp_timestamps=0`
>
> // Cache warmup: send a Fast Open cookie request
> 0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
> +0 fcntl(3, F_SETFL, O_RDWR|O_NONBLOCK) = 0
> +0 sendto(3, ..., 0, MSG_FASTOPEN, ..., ...) = -1 EINPROGRESS
> (Operation is now in progress)
> +0 > S 0:0(0) <mss 1460,nop,nop,sackOK,nop,wscale 8,FO,nop,nop>
> +.01 < S. 123:123(0) ack 1 win 14600 <mss
> 1460,nop,nop,sackOK,nop,wscale 6,FO abcd1234,nop,nop>
> +0 > . 1:1(0) ack 1
> +.01 close(3) = 0
> +0 > F. 1:1(0) ack 1
> +.01 < F. 1:1(0) ack 2 win 92
> +0 > . 2:2(0) ack 2
>
>
> //
> // Test: simulatenous fast open
> //
> +.01 socket(..., SOCK_STREAM, IPPROTO_TCP) = 4
> +0 fcntl(4, F_SETFL, O_RDWR|O_NONBLOCK) = 0
> +0 sendto(4, ..., 1000, MSG_FASTOPEN, ..., ...) = 1000
> +0 > S 0:1000(1000) <mss 1460,nop,nop,sackOK,nop,wscale 8,FO
> abcd1234,nop,nop>
> // Simul. SYN-data crossing: we don't support that yet so ack only remote ISN
> +.005 < S 1234:1734(500) win 14600 <mss 1040,nop,nop,sackOK,nop,wscale
> 6,FO 87654321,nop,nop>
> +0 > S. 0:0(0) ack 1235 <mss 1460,nop,nop,sackOK,nop,wscale 8>
>
> // SYN data is never retried.
> +.045 < S. 1234:1234(0) ack 1001 win 14600 <mss
> 940,nop,nop,sackOK,nop,wscale 6,FO 12345678,nop,nop>
> +0 > . 1001:1001(0) ack 1
I recently sent a PR -- already applied -- to Neal to remove this line:
https://github.com/google/packetdrill/pull/86
I thought it was the intension of Kuniyuki's patch not to send this ACK
in this case to follow the RFC 9293's recommendation. This TFO test
looks a bit similar to the example from Kuniyuki's patch:
--------------- 8< ---------------
0 socket(..., SOCK_STREAM|SOCK_NONBLOCK, IPPROTO_TCP) = 3
+0 connect(3, ..., ...) = -1 EINPROGRESS (Operation now in progress)
+0 > S 0:0(0) <mss 1460,sackOK,TS val 1000 ecr 0,nop,wscale 8>
+0 < S 0:0(0) win 1000 <mss 1000>
+0 > S. 0:0(0) ack 1 <mss 1460,sackOK,TS val 3308134035 ecr 0,nop,wscale 8>
+0 < S. 0:0(0) ack 1 win 1000
/* No ACK here */
+0 write(3, ..., 100) = 100
+0 > P. 1:101(100) ack 1
--------------- 8< ---------------
But maybe here that should be different for TFO?
For my case with MPTCP (and TFO), it is fine to drop this 'goto consume'
but I don't know how "strict" we want to be regarding the RFC and this
marginal case.
> // The other end retries
> +.1 < P. 1:501(500) ack 1000 win 257
> +0 > . 1001:1001(0) ack 501
> +0 read(4, ..., 4096) = 500
> +0 close(4) = 0
> +0 > F. 1001:1001(0) ack 501
> +.05 < F. 501:501(0) ack 1002 win 257
> +0 > . 1002:1002(0) ack 502
>
> `/tmp/sysctl_restore_${PPID}.sh`
>
Cheers,
Matt
--
Sponsored by the NGI0 Core fund.
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH net v2 1/2] tcp: process the 3rd ACK with sk_socket for TFO/MPTCP
2024-07-23 14:58 ` Matthieu Baerts
@ 2024-07-23 15:38 ` Eric Dumazet
2024-07-23 16:08 ` Matthieu Baerts
2024-07-23 21:27 ` Kuniyuki Iwashima
0 siblings, 2 replies; 17+ messages in thread
From: Eric Dumazet @ 2024-07-23 15:38 UTC (permalink / raw)
To: Matthieu Baerts
Cc: David S. Miller, David Ahern, Jakub Kicinski, Paolo Abeni,
Kuniyuki Iwashima, netdev, mptcp, linux-kernel, Neal Cardwell
On Tue, Jul 23, 2024 at 4:58 PM Matthieu Baerts <matttbe@kernel.org> wrote:
>
> Hi Eric,
>
> +cc Neal
> -cc Jerry (NoSuchUser)
>
> On 23/07/2024 16:37, Eric Dumazet wrote:
> > On Thu, Jul 18, 2024 at 12:34 PM Matthieu Baerts (NGI0)
> > <matttbe@kernel.org> wrote:
> >>
> >> The 'Fixes' commit recently changed the behaviour of TCP by skipping the
> >> processing of the 3rd ACK when a sk->sk_socket is set. The goal was to
> >> skip tcp_ack_snd_check() in tcp_rcv_state_process() not to send an
> >> unnecessary ACK in case of simultaneous connect(). Unfortunately, that
> >> had an impact on TFO and MPTCP.
> >>
> >> I started to look at the impact on MPTCP, because the MPTCP CI found
> >> some issues with the MPTCP Packetdrill tests [1]. Then Paolo suggested
> >> me to look at the impact on TFO with "plain" TCP.
> >>
> >> For MPTCP, when receiving the 3rd ACK of a request adding a new path
> >> (MP_JOIN), sk->sk_socket will be set, and point to the MPTCP sock that
> >> has been created when the MPTCP connection got established before with
> >> the first path. The newly added 'goto' will then skip the processing of
> >> the segment text (step 7) and not go through tcp_data_queue() where the
> >> MPTCP options are validated, and some actions are triggered, e.g.
> >> sending the MPJ 4th ACK [2] as demonstrated by the new errors when
> >> running a packetdrill test [3] establishing a second subflow.
> >>
> >> This doesn't fully break MPTCP, mainly the 4th MPJ ACK that will be
> >> delayed. Still, we don't want to have this behaviour as it delays the
> >> switch to the fully established mode, and invalid MPTCP options in this
> >> 3rd ACK will not be caught any more. This modification also affects the
> >> MPTCP + TFO feature as well, and being the reason why the selftests
> >> started to be unstable the last few days [4].
> >>
> >> For TFO, the existing 'basic-cookie-not-reqd' test [5] was no longer
> >> passing: if the 3rd ACK contains data, and the connection is accept()ed
> >> before receiving them, these data would no longer be processed, and thus
> >> not ACKed.
> >>
> >> One last thing about MPTCP, in case of simultaneous connect(), a
> >> fallback to TCP will be done, which seems fine:
> >>
> >> `../common/defaults.sh`
> >>
> >> 0 socket(..., SOCK_STREAM|SOCK_NONBLOCK, IPPROTO_MPTCP) = 3
> >> +0 connect(3, ..., ...) = -1 EINPROGRESS (Operation now in progress)
> >>
> >> +0 > S 0:0(0) <mss 1460, sackOK, TS val 100 ecr 0, nop, wscale 8, mpcapable v1 flags[flag_h] nokey>
> >> +0 < S 0:0(0) win 1000 <mss 1460, sackOK, TS val 407 ecr 0, nop, wscale 8, mpcapable v1 flags[flag_h] nokey>
> >> +0 > S. 0:0(0) ack 1 <mss 1460, sackOK, TS val 330 ecr 0, nop, wscale 8, mpcapable v1 flags[flag_h] nokey>
> >> +0 < S. 0:0(0) ack 1 win 65535 <mss 1460, sackOK, TS val 700 ecr 100, nop, wscale 8, mpcapable v1 flags[flag_h] key[skey=2]>
> >>
> >> +0 write(3, ..., 100) = 100
> >> +0 > . 1:1(0) ack 1 <nop, nop, TS val 845707014 ecr 700, nop, nop, sack 0:1>
> >> +0 > P. 1:101(100) ack 1 <nop, nop, TS val 845958933 ecr 700>
> >>
> >> Simultaneous SYN-data crossing is also not supported by TFO, see [6].
> >>
> >> Link: https://github.com/multipath-tcp/mptcp_net-next/actions/runs/9936227696 [1]
> >> Link: https://datatracker.ietf.org/doc/html/rfc8684#fig_tokens [2]
> >> Link: https://github.com/multipath-tcp/packetdrill/blob/mptcp-net-next/gtests/net/mptcp/syscalls/accept.pkt#L28 [3]
> >> Link: https://netdev.bots.linux.dev/contest.html?executor=vmksft-mptcp-dbg&test=mptcp-connect-sh [4]
> >> Link: https://github.com/google/packetdrill/blob/master/gtests/net/tcp/fastopen/server/basic-cookie-not-reqd.pkt#L21 [5]
> >> Link: https://github.com/google/packetdrill/blob/master/gtests/net/tcp/fastopen/client/simultaneous-fast-open.pkt [6]
> >> Fixes: 23e89e8ee7be ("tcp: Don't drop SYN+ACK for simultaneous connect().")
> >> Suggested-by: Paolo Abeni <pabeni@redhat.com>
> >> Suggested-by: Kuniyuki Iwashima <kuniyu@amazon.com>
> >> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
> >> ---
> >> Notes:
> >> - We could also drop this 'goto consume', and send the unnecessary ACK
> >> in this simultaneous connect case, which doesn't seem to be a "real"
> >> case, more something for fuzzers. But that's not what the RFC 9293
> >> recommends to do.
> >> - v2:
> >> - Check if the SYN bit is set instead of looking for TFO and MPTCP
> >> specific attributes, as suggested by Kuniyuki.
> >> - Updated the comment above
> >> - Please note that the v2 has been sent mainly to satisfy the CI (to
> >> be able to catch new bugs with MPTCP), and because the suggestion
> >> from Kuniyuki looks better. It has not been sent to urge TCP
> >> maintainers to review it quicker than it should, please take your
> >> time and enjoy netdev.conf :)
> >> ---
> >> net/ipv4/tcp_input.c | 7 ++++++-
> >> 1 file changed, 6 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> >> index ff9ab3d01ced..bfe1bc69dc3e 100644
> >> --- a/net/ipv4/tcp_input.c
> >> +++ b/net/ipv4/tcp_input.c
> >> @@ -6820,7 +6820,12 @@ tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb)
> >> if (sk->sk_shutdown & SEND_SHUTDOWN)
> >> tcp_shutdown(sk, SEND_SHUTDOWN);
> >>
> >> - if (sk->sk_socket)
> >> + /* For crossed SYN cases, not to send an unnecessary ACK.
> >> + * Note that sk->sk_socket can be assigned in other cases, e.g.
> >> + * with TFO (if accept()'ed before the 3rd ACK) and MPTCP (MPJ:
> >> + * sk_socket is the parent MPTCP sock).
> >> + */
> >> + if (sk->sk_socket && th->syn)
> >> goto consume;
> >
> > I think we should simply remove this part completely, because we
> > should send an ack anyway.
>
> Thank you for having looked, and ran the full packetdrill test suite!
>
> >
> > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> > index ff9ab3d01ced89570903d3a9f649a637c5e07a90..91357d4713182078debd746a224046cba80ea3ce
> > 100644
> > --- a/net/ipv4/tcp_input.c
> > +++ b/net/ipv4/tcp_input.c
> > @@ -6820,8 +6820,6 @@ tcp_rcv_state_process(struct sock *sk, struct
> > sk_buff *skb)
> > if (sk->sk_shutdown & SEND_SHUTDOWN)
> > tcp_shutdown(sk, SEND_SHUTDOWN);
> >
> > - if (sk->sk_socket)
> > - goto consume;
> > break;
> >
> > case TCP_FIN_WAIT1: {
> >
> >
> > I have a failing packetdrill test after Kuniyuki patch :
> >
> >
> >
> > //
> > // Test the simultaneous open scenario that both end sends
> > // SYN/data. Although we don't support that the connection should
> > // still be established.
> > //
> > `../../common/defaults.sh
> > ../../common/set_sysctls.py /proc/sys/net/ipv4/tcp_timestamps=0`
> >
> > // Cache warmup: send a Fast Open cookie request
> > 0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
> > +0 fcntl(3, F_SETFL, O_RDWR|O_NONBLOCK) = 0
> > +0 sendto(3, ..., 0, MSG_FASTOPEN, ..., ...) = -1 EINPROGRESS
> > (Operation is now in progress)
> > +0 > S 0:0(0) <mss 1460,nop,nop,sackOK,nop,wscale 8,FO,nop,nop>
> > +.01 < S. 123:123(0) ack 1 win 14600 <mss
> > 1460,nop,nop,sackOK,nop,wscale 6,FO abcd1234,nop,nop>
> > +0 > . 1:1(0) ack 1
> > +.01 close(3) = 0
> > +0 > F. 1:1(0) ack 1
> > +.01 < F. 1:1(0) ack 2 win 92
> > +0 > . 2:2(0) ack 2
> >
> >
> > //
> > // Test: simulatenous fast open
> > //
> > +.01 socket(..., SOCK_STREAM, IPPROTO_TCP) = 4
> > +0 fcntl(4, F_SETFL, O_RDWR|O_NONBLOCK) = 0
> > +0 sendto(4, ..., 1000, MSG_FASTOPEN, ..., ...) = 1000
> > +0 > S 0:1000(1000) <mss 1460,nop,nop,sackOK,nop,wscale 8,FO
> > abcd1234,nop,nop>
> > // Simul. SYN-data crossing: we don't support that yet so ack only remote ISN
> > +.005 < S 1234:1734(500) win 14600 <mss 1040,nop,nop,sackOK,nop,wscale
> > 6,FO 87654321,nop,nop>
> > +0 > S. 0:0(0) ack 1235 <mss 1460,nop,nop,sackOK,nop,wscale 8>
> >
> > // SYN data is never retried.
> > +.045 < S. 1234:1234(0) ack 1001 win 14600 <mss
> > 940,nop,nop,sackOK,nop,wscale 6,FO 12345678,nop,nop>
> > +0 > . 1001:1001(0) ack 1
>
> I recently sent a PR -- already applied -- to Neal to remove this line:
>
> https://github.com/google/packetdrill/pull/86
>
> I thought it was the intension of Kuniyuki's patch not to send this ACK
> in this case to follow the RFC 9293's recommendation. This TFO test
> looks a bit similar to the example from Kuniyuki's patch:
>
>
> --------------- 8< ---------------
> 0 socket(..., SOCK_STREAM|SOCK_NONBLOCK, IPPROTO_TCP) = 3
> +0 connect(3, ..., ...) = -1 EINPROGRESS (Operation now in progress)
>
> +0 > S 0:0(0) <mss 1460,sackOK,TS val 1000 ecr 0,nop,wscale 8>
> +0 < S 0:0(0) win 1000 <mss 1000>
> +0 > S. 0:0(0) ack 1 <mss 1460,sackOK,TS val 3308134035 ecr 0,nop,wscale 8>
> +0 < S. 0:0(0) ack 1 win 1000
>
> /* No ACK here */
>
> +0 write(3, ..., 100) = 100
> +0 > P. 1:101(100) ack 1
> --------------- 8< ---------------
>
>
>
> But maybe here that should be different for TFO?
>
> For my case with MPTCP (and TFO), it is fine to drop this 'goto consume'
> but I don't know how "strict" we want to be regarding the RFC and this
> marginal case.
Problem of this 'goto consume' is that we are not properly sending a
DUPACK in this case.
+.01 socket(..., SOCK_STREAM, IPPROTO_TCP) = 4
+0 fcntl(4, F_SETFL, O_RDWR|O_NONBLOCK) = 0
+0 sendto(4, ..., 1000, MSG_FASTOPEN, ..., ...) = 1000
+0 > S 0:1000(1000) <mss 1460,nop,nop,sackOK,nop,wscale 8,FO
abcd1234,nop,nop>
// Simul. SYN-data crossing: we don't support that yet so ack only remote ISN
+.005 < S 1234:1734(500) win 14600 <mss 1040,nop,nop,sackOK,nop,wscale
6,FO 87654321,nop,nop>
+0 > S. 0:0(0) ack 1235 <mss 1460,nop,nop,sackOK,nop,wscale 8>
+.045 < S. 1234:1234(0) ack 1001 win 14600 <mss
940,nop,nop,sackOK,nop,wscale 6,FO 12345678,nop,nop>
+0 > . 1001:1001(0) ack 1 <nop,nop,sack 0:1> // See here
Not sending a dupack seems wrong and could hurt.
I had reservations about this part, if you look back to the discussion.
This is why Kuniyuki added in his changelog :
Note that tcp_ack_snd_check() in tcp_rcv_state_process() is skipped not to
send an unnecessary ACK, but this could be a bit risky for net.git, so this
targets for net-next.
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH net v2 1/2] tcp: process the 3rd ACK with sk_socket for TFO/MPTCP
2024-07-23 15:38 ` Eric Dumazet
@ 2024-07-23 16:08 ` Matthieu Baerts
2024-07-23 16:42 ` Eric Dumazet
2024-07-23 21:27 ` Kuniyuki Iwashima
1 sibling, 1 reply; 17+ messages in thread
From: Matthieu Baerts @ 2024-07-23 16:08 UTC (permalink / raw)
To: Eric Dumazet
Cc: David S. Miller, David Ahern, Jakub Kicinski, Paolo Abeni,
Kuniyuki Iwashima, netdev, mptcp, linux-kernel, Neal Cardwell
Hi Eric,
On 23/07/2024 17:38, Eric Dumazet wrote:
> On Tue, Jul 23, 2024 at 4:58 PM Matthieu Baerts <matttbe@kernel.org> wrote:
>>
>> Hi Eric,
>>
>> +cc Neal
>> -cc Jerry (NoSuchUser)
>>
>> On 23/07/2024 16:37, Eric Dumazet wrote:
>>> On Thu, Jul 18, 2024 at 12:34 PM Matthieu Baerts (NGI0)
>>> <matttbe@kernel.org> wrote:
>>>>
>>>> The 'Fixes' commit recently changed the behaviour of TCP by skipping the
>>>> processing of the 3rd ACK when a sk->sk_socket is set. The goal was to
>>>> skip tcp_ack_snd_check() in tcp_rcv_state_process() not to send an
>>>> unnecessary ACK in case of simultaneous connect(). Unfortunately, that
>>>> had an impact on TFO and MPTCP.
>>>>
>>>> I started to look at the impact on MPTCP, because the MPTCP CI found
>>>> some issues with the MPTCP Packetdrill tests [1]. Then Paolo suggested
>>>> me to look at the impact on TFO with "plain" TCP.
>>>>
>>>> For MPTCP, when receiving the 3rd ACK of a request adding a new path
>>>> (MP_JOIN), sk->sk_socket will be set, and point to the MPTCP sock that
>>>> has been created when the MPTCP connection got established before with
>>>> the first path. The newly added 'goto' will then skip the processing of
>>>> the segment text (step 7) and not go through tcp_data_queue() where the
>>>> MPTCP options are validated, and some actions are triggered, e.g.
>>>> sending the MPJ 4th ACK [2] as demonstrated by the new errors when
>>>> running a packetdrill test [3] establishing a second subflow.
>>>>
>>>> This doesn't fully break MPTCP, mainly the 4th MPJ ACK that will be
>>>> delayed. Still, we don't want to have this behaviour as it delays the
>>>> switch to the fully established mode, and invalid MPTCP options in this
>>>> 3rd ACK will not be caught any more. This modification also affects the
>>>> MPTCP + TFO feature as well, and being the reason why the selftests
>>>> started to be unstable the last few days [4].
>>>>
>>>> For TFO, the existing 'basic-cookie-not-reqd' test [5] was no longer
>>>> passing: if the 3rd ACK contains data, and the connection is accept()ed
>>>> before receiving them, these data would no longer be processed, and thus
>>>> not ACKed.
>>>>
>>>> One last thing about MPTCP, in case of simultaneous connect(), a
>>>> fallback to TCP will be done, which seems fine:
>>>>
>>>> `../common/defaults.sh`
>>>>
>>>> 0 socket(..., SOCK_STREAM|SOCK_NONBLOCK, IPPROTO_MPTCP) = 3
>>>> +0 connect(3, ..., ...) = -1 EINPROGRESS (Operation now in progress)
>>>>
>>>> +0 > S 0:0(0) <mss 1460, sackOK, TS val 100 ecr 0, nop, wscale 8, mpcapable v1 flags[flag_h] nokey>
>>>> +0 < S 0:0(0) win 1000 <mss 1460, sackOK, TS val 407 ecr 0, nop, wscale 8, mpcapable v1 flags[flag_h] nokey>
>>>> +0 > S. 0:0(0) ack 1 <mss 1460, sackOK, TS val 330 ecr 0, nop, wscale 8, mpcapable v1 flags[flag_h] nokey>
>>>> +0 < S. 0:0(0) ack 1 win 65535 <mss 1460, sackOK, TS val 700 ecr 100, nop, wscale 8, mpcapable v1 flags[flag_h] key[skey=2]>
>>>>
>>>> +0 write(3, ..., 100) = 100
>>>> +0 > . 1:1(0) ack 1 <nop, nop, TS val 845707014 ecr 700, nop, nop, sack 0:1>
>>>> +0 > P. 1:101(100) ack 1 <nop, nop, TS val 845958933 ecr 700>
>>>>
>>>> Simultaneous SYN-data crossing is also not supported by TFO, see [6].
>>>>
>>>> Link: https://github.com/multipath-tcp/mptcp_net-next/actions/runs/9936227696 [1]
>>>> Link: https://datatracker.ietf.org/doc/html/rfc8684#fig_tokens [2]
>>>> Link: https://github.com/multipath-tcp/packetdrill/blob/mptcp-net-next/gtests/net/mptcp/syscalls/accept.pkt#L28 [3]
>>>> Link: https://netdev.bots.linux.dev/contest.html?executor=vmksft-mptcp-dbg&test=mptcp-connect-sh [4]
>>>> Link: https://github.com/google/packetdrill/blob/master/gtests/net/tcp/fastopen/server/basic-cookie-not-reqd.pkt#L21 [5]
>>>> Link: https://github.com/google/packetdrill/blob/master/gtests/net/tcp/fastopen/client/simultaneous-fast-open.pkt [6]
>>>> Fixes: 23e89e8ee7be ("tcp: Don't drop SYN+ACK for simultaneous connect().")
>>>> Suggested-by: Paolo Abeni <pabeni@redhat.com>
>>>> Suggested-by: Kuniyuki Iwashima <kuniyu@amazon.com>
>>>> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
>>>> ---
>>>> Notes:
>>>> - We could also drop this 'goto consume', and send the unnecessary ACK
>>>> in this simultaneous connect case, which doesn't seem to be a "real"
>>>> case, more something for fuzzers. But that's not what the RFC 9293
>>>> recommends to do.
>>>> - v2:
>>>> - Check if the SYN bit is set instead of looking for TFO and MPTCP
>>>> specific attributes, as suggested by Kuniyuki.
>>>> - Updated the comment above
>>>> - Please note that the v2 has been sent mainly to satisfy the CI (to
>>>> be able to catch new bugs with MPTCP), and because the suggestion
>>>> from Kuniyuki looks better. It has not been sent to urge TCP
>>>> maintainers to review it quicker than it should, please take your
>>>> time and enjoy netdev.conf :)
>>>> ---
>>>> net/ipv4/tcp_input.c | 7 ++++++-
>>>> 1 file changed, 6 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
>>>> index ff9ab3d01ced..bfe1bc69dc3e 100644
>>>> --- a/net/ipv4/tcp_input.c
>>>> +++ b/net/ipv4/tcp_input.c
>>>> @@ -6820,7 +6820,12 @@ tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb)
>>>> if (sk->sk_shutdown & SEND_SHUTDOWN)
>>>> tcp_shutdown(sk, SEND_SHUTDOWN);
>>>>
>>>> - if (sk->sk_socket)
>>>> + /* For crossed SYN cases, not to send an unnecessary ACK.
>>>> + * Note that sk->sk_socket can be assigned in other cases, e.g.
>>>> + * with TFO (if accept()'ed before the 3rd ACK) and MPTCP (MPJ:
>>>> + * sk_socket is the parent MPTCP sock).
>>>> + */
>>>> + if (sk->sk_socket && th->syn)
>>>> goto consume;
>>>
>>> I think we should simply remove this part completely, because we
>>> should send an ack anyway.
>>
>> Thank you for having looked, and ran the full packetdrill test suite!
>>
>>>
>>> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
>>> index ff9ab3d01ced89570903d3a9f649a637c5e07a90..91357d4713182078debd746a224046cba80ea3ce
>>> 100644
>>> --- a/net/ipv4/tcp_input.c
>>> +++ b/net/ipv4/tcp_input.c
>>> @@ -6820,8 +6820,6 @@ tcp_rcv_state_process(struct sock *sk, struct
>>> sk_buff *skb)
>>> if (sk->sk_shutdown & SEND_SHUTDOWN)
>>> tcp_shutdown(sk, SEND_SHUTDOWN);
>>>
>>> - if (sk->sk_socket)
>>> - goto consume;
>>> break;
>>>
>>> case TCP_FIN_WAIT1: {
>>>
>>>
>>> I have a failing packetdrill test after Kuniyuki patch :
>>>
>>>
>>>
>>> //
>>> // Test the simultaneous open scenario that both end sends
>>> // SYN/data. Although we don't support that the connection should
>>> // still be established.
>>> //
>>> `../../common/defaults.sh
>>> ../../common/set_sysctls.py /proc/sys/net/ipv4/tcp_timestamps=0`
>>>
>>> // Cache warmup: send a Fast Open cookie request
>>> 0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
>>> +0 fcntl(3, F_SETFL, O_RDWR|O_NONBLOCK) = 0
>>> +0 sendto(3, ..., 0, MSG_FASTOPEN, ..., ...) = -1 EINPROGRESS
>>> (Operation is now in progress)
>>> +0 > S 0:0(0) <mss 1460,nop,nop,sackOK,nop,wscale 8,FO,nop,nop>
>>> +.01 < S. 123:123(0) ack 1 win 14600 <mss
>>> 1460,nop,nop,sackOK,nop,wscale 6,FO abcd1234,nop,nop>
>>> +0 > . 1:1(0) ack 1
>>> +.01 close(3) = 0
>>> +0 > F. 1:1(0) ack 1
>>> +.01 < F. 1:1(0) ack 2 win 92
>>> +0 > . 2:2(0) ack 2
>>>
>>>
>>> //
>>> // Test: simulatenous fast open
>>> //
>>> +.01 socket(..., SOCK_STREAM, IPPROTO_TCP) = 4
>>> +0 fcntl(4, F_SETFL, O_RDWR|O_NONBLOCK) = 0
>>> +0 sendto(4, ..., 1000, MSG_FASTOPEN, ..., ...) = 1000
>>> +0 > S 0:1000(1000) <mss 1460,nop,nop,sackOK,nop,wscale 8,FO
>>> abcd1234,nop,nop>
>>> // Simul. SYN-data crossing: we don't support that yet so ack only remote ISN
>>> +.005 < S 1234:1734(500) win 14600 <mss 1040,nop,nop,sackOK,nop,wscale
>>> 6,FO 87654321,nop,nop>
>>> +0 > S. 0:0(0) ack 1235 <mss 1460,nop,nop,sackOK,nop,wscale 8>
>>>
>>> // SYN data is never retried.
>>> +.045 < S. 1234:1234(0) ack 1001 win 14600 <mss
>>> 940,nop,nop,sackOK,nop,wscale 6,FO 12345678,nop,nop>
>>> +0 > . 1001:1001(0) ack 1
>>
>> I recently sent a PR -- already applied -- to Neal to remove this line:
>>
>> https://github.com/google/packetdrill/pull/86
>>
>> I thought it was the intension of Kuniyuki's patch not to send this ACK
>> in this case to follow the RFC 9293's recommendation. This TFO test
>> looks a bit similar to the example from Kuniyuki's patch:
>>
>>
>> --------------- 8< ---------------
>> 0 socket(..., SOCK_STREAM|SOCK_NONBLOCK, IPPROTO_TCP) = 3
>> +0 connect(3, ..., ...) = -1 EINPROGRESS (Operation now in progress)
>>
>> +0 > S 0:0(0) <mss 1460,sackOK,TS val 1000 ecr 0,nop,wscale 8>
>> +0 < S 0:0(0) win 1000 <mss 1000>
>> +0 > S. 0:0(0) ack 1 <mss 1460,sackOK,TS val 3308134035 ecr 0,nop,wscale 8>
>> +0 < S. 0:0(0) ack 1 win 1000
>>
>> /* No ACK here */
>>
>> +0 write(3, ..., 100) = 100
>> +0 > P. 1:101(100) ack 1
>> --------------- 8< ---------------
>>
>>
>>
>> But maybe here that should be different for TFO?
>>
>> For my case with MPTCP (and TFO), it is fine to drop this 'goto consume'
>> but I don't know how "strict" we want to be regarding the RFC and this
>> marginal case.
>
> Problem of this 'goto consume' is that we are not properly sending a
> DUPACK in this case.
>
> +.01 socket(..., SOCK_STREAM, IPPROTO_TCP) = 4
> +0 fcntl(4, F_SETFL, O_RDWR|O_NONBLOCK) = 0
> +0 sendto(4, ..., 1000, MSG_FASTOPEN, ..., ...) = 1000
> +0 > S 0:1000(1000) <mss 1460,nop,nop,sackOK,nop,wscale 8,FO
> abcd1234,nop,nop>
> // Simul. SYN-data crossing: we don't support that yet so ack only remote ISN
> +.005 < S 1234:1734(500) win 14600 <mss 1040,nop,nop,sackOK,nop,wscale
> 6,FO 87654321,nop,nop>
> +0 > S. 0:0(0) ack 1235 <mss 1460,nop,nop,sackOK,nop,wscale 8>
>
> +.045 < S. 1234:1234(0) ack 1001 win 14600 <mss
> 940,nop,nop,sackOK,nop,wscale 6,FO 12345678,nop,nop>
> +0 > . 1001:1001(0) ack 1 <nop,nop,sack 0:1> // See here
I'm sorry, but is it normal to have 'ack 1' with 'sack 0:1' here?
> Not sending a dupack seems wrong and could hurt.
Indeed, I thought the RFC 9293 was not allowing that, but I didn't see
anything forbidding this DUPACK:
https://www.rfc-editor.org/rfc/rfc9293.html#section-3.5-7
> I had reservations about this part, if you look back to the discussion.
>
> This is why Kuniyuki added in his changelog :
>
> Note that tcp_ack_snd_check() in tcp_rcv_state_process() is skipped not to
> send an unnecessary ACK, but this could be a bit risky for net.git, so this
> targets for net-next.
I understand, I can send a v3 dropping this part, and not including
patch 2/2 for -net. I can also send a PR to Neal re-adding the ACK with
'sack' (if it is normal).
Cheers,
Matt
--
Sponsored by the NGI0 Core fund.
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH net v2 1/2] tcp: process the 3rd ACK with sk_socket for TFO/MPTCP
2024-07-23 16:08 ` Matthieu Baerts
@ 2024-07-23 16:42 ` Eric Dumazet
2024-07-23 19:09 ` Matthieu Baerts
0 siblings, 1 reply; 17+ messages in thread
From: Eric Dumazet @ 2024-07-23 16:42 UTC (permalink / raw)
To: Matthieu Baerts
Cc: David S. Miller, David Ahern, Jakub Kicinski, Paolo Abeni,
Kuniyuki Iwashima, netdev, mptcp, linux-kernel, Neal Cardwell
On Tue, Jul 23, 2024 at 6:08 PM Matthieu Baerts <matttbe@kernel.org> wrote:
>
> Hi Eric,
>
> On 23/07/2024 17:38, Eric Dumazet wrote:
> > On Tue, Jul 23, 2024 at 4:58 PM Matthieu Baerts <matttbe@kernel.org> wrote:
> >>
> >> Hi Eric,
> >>
> >> +cc Neal
> >> -cc Jerry (NoSuchUser)
> >>
> >> On 23/07/2024 16:37, Eric Dumazet wrote:
> >>> On Thu, Jul 18, 2024 at 12:34 PM Matthieu Baerts (NGI0)
> >>> <matttbe@kernel.org> wrote:
> >>>>
> >>>> The 'Fixes' commit recently changed the behaviour of TCP by skipping the
> >>>> processing of the 3rd ACK when a sk->sk_socket is set. The goal was to
> >>>> skip tcp_ack_snd_check() in tcp_rcv_state_process() not to send an
> >>>> unnecessary ACK in case of simultaneous connect(). Unfortunately, that
> >>>> had an impact on TFO and MPTCP.
> >>>>
> >>>> I started to look at the impact on MPTCP, because the MPTCP CI found
> >>>> some issues with the MPTCP Packetdrill tests [1]. Then Paolo suggested
> >>>> me to look at the impact on TFO with "plain" TCP.
> >>>>
> >>>> For MPTCP, when receiving the 3rd ACK of a request adding a new path
> >>>> (MP_JOIN), sk->sk_socket will be set, and point to the MPTCP sock that
> >>>> has been created when the MPTCP connection got established before with
> >>>> the first path. The newly added 'goto' will then skip the processing of
> >>>> the segment text (step 7) and not go through tcp_data_queue() where the
> >>>> MPTCP options are validated, and some actions are triggered, e.g.
> >>>> sending the MPJ 4th ACK [2] as demonstrated by the new errors when
> >>>> running a packetdrill test [3] establishing a second subflow.
> >>>>
> >>>> This doesn't fully break MPTCP, mainly the 4th MPJ ACK that will be
> >>>> delayed. Still, we don't want to have this behaviour as it delays the
> >>>> switch to the fully established mode, and invalid MPTCP options in this
> >>>> 3rd ACK will not be caught any more. This modification also affects the
> >>>> MPTCP + TFO feature as well, and being the reason why the selftests
> >>>> started to be unstable the last few days [4].
> >>>>
> >>>> For TFO, the existing 'basic-cookie-not-reqd' test [5] was no longer
> >>>> passing: if the 3rd ACK contains data, and the connection is accept()ed
> >>>> before receiving them, these data would no longer be processed, and thus
> >>>> not ACKed.
> >>>>
> >>>> One last thing about MPTCP, in case of simultaneous connect(), a
> >>>> fallback to TCP will be done, which seems fine:
> >>>>
> >>>> `../common/defaults.sh`
> >>>>
> >>>> 0 socket(..., SOCK_STREAM|SOCK_NONBLOCK, IPPROTO_MPTCP) = 3
> >>>> +0 connect(3, ..., ...) = -1 EINPROGRESS (Operation now in progress)
> >>>>
> >>>> +0 > S 0:0(0) <mss 1460, sackOK, TS val 100 ecr 0, nop, wscale 8, mpcapable v1 flags[flag_h] nokey>
> >>>> +0 < S 0:0(0) win 1000 <mss 1460, sackOK, TS val 407 ecr 0, nop, wscale 8, mpcapable v1 flags[flag_h] nokey>
> >>>> +0 > S. 0:0(0) ack 1 <mss 1460, sackOK, TS val 330 ecr 0, nop, wscale 8, mpcapable v1 flags[flag_h] nokey>
> >>>> +0 < S. 0:0(0) ack 1 win 65535 <mss 1460, sackOK, TS val 700 ecr 100, nop, wscale 8, mpcapable v1 flags[flag_h] key[skey=2]>
> >>>>
> >>>> +0 write(3, ..., 100) = 100
> >>>> +0 > . 1:1(0) ack 1 <nop, nop, TS val 845707014 ecr 700, nop, nop, sack 0:1>
> >>>> +0 > P. 1:101(100) ack 1 <nop, nop, TS val 845958933 ecr 700>
> >>>>
> >>>> Simultaneous SYN-data crossing is also not supported by TFO, see [6].
> >>>>
> >>>> Link: https://github.com/multipath-tcp/mptcp_net-next/actions/runs/9936227696 [1]
> >>>> Link: https://datatracker.ietf.org/doc/html/rfc8684#fig_tokens [2]
> >>>> Link: https://github.com/multipath-tcp/packetdrill/blob/mptcp-net-next/gtests/net/mptcp/syscalls/accept.pkt#L28 [3]
> >>>> Link: https://netdev.bots.linux.dev/contest.html?executor=vmksft-mptcp-dbg&test=mptcp-connect-sh [4]
> >>>> Link: https://github.com/google/packetdrill/blob/master/gtests/net/tcp/fastopen/server/basic-cookie-not-reqd.pkt#L21 [5]
> >>>> Link: https://github.com/google/packetdrill/blob/master/gtests/net/tcp/fastopen/client/simultaneous-fast-open.pkt [6]
> >>>> Fixes: 23e89e8ee7be ("tcp: Don't drop SYN+ACK for simultaneous connect().")
> >>>> Suggested-by: Paolo Abeni <pabeni@redhat.com>
> >>>> Suggested-by: Kuniyuki Iwashima <kuniyu@amazon.com>
> >>>> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
> >>>> ---
> >>>> Notes:
> >>>> - We could also drop this 'goto consume', and send the unnecessary ACK
> >>>> in this simultaneous connect case, which doesn't seem to be a "real"
> >>>> case, more something for fuzzers. But that's not what the RFC 9293
> >>>> recommends to do.
> >>>> - v2:
> >>>> - Check if the SYN bit is set instead of looking for TFO and MPTCP
> >>>> specific attributes, as suggested by Kuniyuki.
> >>>> - Updated the comment above
> >>>> - Please note that the v2 has been sent mainly to satisfy the CI (to
> >>>> be able to catch new bugs with MPTCP), and because the suggestion
> >>>> from Kuniyuki looks better. It has not been sent to urge TCP
> >>>> maintainers to review it quicker than it should, please take your
> >>>> time and enjoy netdev.conf :)
> >>>> ---
> >>>> net/ipv4/tcp_input.c | 7 ++++++-
> >>>> 1 file changed, 6 insertions(+), 1 deletion(-)
> >>>>
> >>>> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> >>>> index ff9ab3d01ced..bfe1bc69dc3e 100644
> >>>> --- a/net/ipv4/tcp_input.c
> >>>> +++ b/net/ipv4/tcp_input.c
> >>>> @@ -6820,7 +6820,12 @@ tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb)
> >>>> if (sk->sk_shutdown & SEND_SHUTDOWN)
> >>>> tcp_shutdown(sk, SEND_SHUTDOWN);
> >>>>
> >>>> - if (sk->sk_socket)
> >>>> + /* For crossed SYN cases, not to send an unnecessary ACK.
> >>>> + * Note that sk->sk_socket can be assigned in other cases, e.g.
> >>>> + * with TFO (if accept()'ed before the 3rd ACK) and MPTCP (MPJ:
> >>>> + * sk_socket is the parent MPTCP sock).
> >>>> + */
> >>>> + if (sk->sk_socket && th->syn)
> >>>> goto consume;
> >>>
> >>> I think we should simply remove this part completely, because we
> >>> should send an ack anyway.
> >>
> >> Thank you for having looked, and ran the full packetdrill test suite!
> >>
> >>>
> >>> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> >>> index ff9ab3d01ced89570903d3a9f649a637c5e07a90..91357d4713182078debd746a224046cba80ea3ce
> >>> 100644
> >>> --- a/net/ipv4/tcp_input.c
> >>> +++ b/net/ipv4/tcp_input.c
> >>> @@ -6820,8 +6820,6 @@ tcp_rcv_state_process(struct sock *sk, struct
> >>> sk_buff *skb)
> >>> if (sk->sk_shutdown & SEND_SHUTDOWN)
> >>> tcp_shutdown(sk, SEND_SHUTDOWN);
> >>>
> >>> - if (sk->sk_socket)
> >>> - goto consume;
> >>> break;
> >>>
> >>> case TCP_FIN_WAIT1: {
> >>>
> >>>
> >>> I have a failing packetdrill test after Kuniyuki patch :
> >>>
> >>>
> >>>
> >>> //
> >>> // Test the simultaneous open scenario that both end sends
> >>> // SYN/data. Although we don't support that the connection should
> >>> // still be established.
> >>> //
> >>> `../../common/defaults.sh
> >>> ../../common/set_sysctls.py /proc/sys/net/ipv4/tcp_timestamps=0`
> >>>
> >>> // Cache warmup: send a Fast Open cookie request
> >>> 0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
> >>> +0 fcntl(3, F_SETFL, O_RDWR|O_NONBLOCK) = 0
> >>> +0 sendto(3, ..., 0, MSG_FASTOPEN, ..., ...) = -1 EINPROGRESS
> >>> (Operation is now in progress)
> >>> +0 > S 0:0(0) <mss 1460,nop,nop,sackOK,nop,wscale 8,FO,nop,nop>
> >>> +.01 < S. 123:123(0) ack 1 win 14600 <mss
> >>> 1460,nop,nop,sackOK,nop,wscale 6,FO abcd1234,nop,nop>
> >>> +0 > . 1:1(0) ack 1
> >>> +.01 close(3) = 0
> >>> +0 > F. 1:1(0) ack 1
> >>> +.01 < F. 1:1(0) ack 2 win 92
> >>> +0 > . 2:2(0) ack 2
> >>>
> >>>
> >>> //
> >>> // Test: simulatenous fast open
> >>> //
> >>> +.01 socket(..., SOCK_STREAM, IPPROTO_TCP) = 4
> >>> +0 fcntl(4, F_SETFL, O_RDWR|O_NONBLOCK) = 0
> >>> +0 sendto(4, ..., 1000, MSG_FASTOPEN, ..., ...) = 1000
> >>> +0 > S 0:1000(1000) <mss 1460,nop,nop,sackOK,nop,wscale 8,FO
> >>> abcd1234,nop,nop>
> >>> // Simul. SYN-data crossing: we don't support that yet so ack only remote ISN
> >>> +.005 < S 1234:1734(500) win 14600 <mss 1040,nop,nop,sackOK,nop,wscale
> >>> 6,FO 87654321,nop,nop>
> >>> +0 > S. 0:0(0) ack 1235 <mss 1460,nop,nop,sackOK,nop,wscale 8>
> >>>
> >>> // SYN data is never retried.
> >>> +.045 < S. 1234:1234(0) ack 1001 win 14600 <mss
> >>> 940,nop,nop,sackOK,nop,wscale 6,FO 12345678,nop,nop>
> >>> +0 > . 1001:1001(0) ack 1
> >>
> >> I recently sent a PR -- already applied -- to Neal to remove this line:
> >>
> >> https://github.com/google/packetdrill/pull/86
> >>
> >> I thought it was the intension of Kuniyuki's patch not to send this ACK
> >> in this case to follow the RFC 9293's recommendation. This TFO test
> >> looks a bit similar to the example from Kuniyuki's patch:
> >>
> >>
> >> --------------- 8< ---------------
> >> 0 socket(..., SOCK_STREAM|SOCK_NONBLOCK, IPPROTO_TCP) = 3
> >> +0 connect(3, ..., ...) = -1 EINPROGRESS (Operation now in progress)
> >>
> >> +0 > S 0:0(0) <mss 1460,sackOK,TS val 1000 ecr 0,nop,wscale 8>
> >> +0 < S 0:0(0) win 1000 <mss 1000>
> >> +0 > S. 0:0(0) ack 1 <mss 1460,sackOK,TS val 3308134035 ecr 0,nop,wscale 8>
> >> +0 < S. 0:0(0) ack 1 win 1000
> >>
> >> /* No ACK here */
> >>
> >> +0 write(3, ..., 100) = 100
> >> +0 > P. 1:101(100) ack 1
> >> --------------- 8< ---------------
> >>
> >>
> >>
> >> But maybe here that should be different for TFO?
> >>
> >> For my case with MPTCP (and TFO), it is fine to drop this 'goto consume'
> >> but I don't know how "strict" we want to be regarding the RFC and this
> >> marginal case.
> >
> > Problem of this 'goto consume' is that we are not properly sending a
> > DUPACK in this case.
> >
> > +.01 socket(..., SOCK_STREAM, IPPROTO_TCP) = 4
> > +0 fcntl(4, F_SETFL, O_RDWR|O_NONBLOCK) = 0
> > +0 sendto(4, ..., 1000, MSG_FASTOPEN, ..., ...) = 1000
> > +0 > S 0:1000(1000) <mss 1460,nop,nop,sackOK,nop,wscale 8,FO
> > abcd1234,nop,nop>
> > // Simul. SYN-data crossing: we don't support that yet so ack only remote ISN
> > +.005 < S 1234:1734(500) win 14600 <mss 1040,nop,nop,sackOK,nop,wscale
> > 6,FO 87654321,nop,nop>
> > +0 > S. 0:0(0) ack 1235 <mss 1460,nop,nop,sackOK,nop,wscale 8>
> >
> > +.045 < S. 1234:1234(0) ack 1001 win 14600 <mss
> > 940,nop,nop,sackOK,nop,wscale 6,FO 12345678,nop,nop>
> > +0 > . 1001:1001(0) ack 1 <nop,nop,sack 0:1> // See here
>
> I'm sorry, but is it normal to have 'ack 1' with 'sack 0:1' here?
It is normal, because the SYN was already received/processed.
sack 0:1 represents this SYN sequence.
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH net v2 1/2] tcp: process the 3rd ACK with sk_socket for TFO/MPTCP
2024-07-23 16:42 ` Eric Dumazet
@ 2024-07-23 19:09 ` Matthieu Baerts
2024-07-23 21:41 ` Kuniyuki Iwashima
2024-07-23 22:01 ` Neal Cardwell
0 siblings, 2 replies; 17+ messages in thread
From: Matthieu Baerts @ 2024-07-23 19:09 UTC (permalink / raw)
To: Eric Dumazet
Cc: David S. Miller, David Ahern, Jakub Kicinski, Paolo Abeni,
Kuniyuki Iwashima, netdev, mptcp, linux-kernel, Neal Cardwell
Hi Eric,
On 23/07/2024 18:42, Eric Dumazet wrote:
> On Tue, Jul 23, 2024 at 6:08 PM Matthieu Baerts <matttbe@kernel.org> wrote:
>>
>> Hi Eric,
>>
>> On 23/07/2024 17:38, Eric Dumazet wrote:
>>> On Tue, Jul 23, 2024 at 4:58 PM Matthieu Baerts <matttbe@kernel.org> wrote:
>>>>
>>>> Hi Eric,
>>>>
>>>> +cc Neal
>>>> -cc Jerry (NoSuchUser)
>>>>
>>>> On 23/07/2024 16:37, Eric Dumazet wrote:
>>>>> On Thu, Jul 18, 2024 at 12:34 PM Matthieu Baerts (NGI0)
>>>>> <matttbe@kernel.org> wrote:
>>>>>>
>>>>>> The 'Fixes' commit recently changed the behaviour of TCP by skipping the
>>>>>> processing of the 3rd ACK when a sk->sk_socket is set. The goal was to
>>>>>> skip tcp_ack_snd_check() in tcp_rcv_state_process() not to send an
>>>>>> unnecessary ACK in case of simultaneous connect(). Unfortunately, that
>>>>>> had an impact on TFO and MPTCP.
>>>>>>
>>>>>> I started to look at the impact on MPTCP, because the MPTCP CI found
>>>>>> some issues with the MPTCP Packetdrill tests [1]. Then Paolo suggested
>>>>>> me to look at the impact on TFO with "plain" TCP.
>>>>>>
>>>>>> For MPTCP, when receiving the 3rd ACK of a request adding a new path
>>>>>> (MP_JOIN), sk->sk_socket will be set, and point to the MPTCP sock that
>>>>>> has been created when the MPTCP connection got established before with
>>>>>> the first path. The newly added 'goto' will then skip the processing of
>>>>>> the segment text (step 7) and not go through tcp_data_queue() where the
>>>>>> MPTCP options are validated, and some actions are triggered, e.g.
>>>>>> sending the MPJ 4th ACK [2] as demonstrated by the new errors when
>>>>>> running a packetdrill test [3] establishing a second subflow.
>>>>>>
>>>>>> This doesn't fully break MPTCP, mainly the 4th MPJ ACK that will be
>>>>>> delayed. Still, we don't want to have this behaviour as it delays the
>>>>>> switch to the fully established mode, and invalid MPTCP options in this
>>>>>> 3rd ACK will not be caught any more. This modification also affects the
>>>>>> MPTCP + TFO feature as well, and being the reason why the selftests
>>>>>> started to be unstable the last few days [4].
>>>>>>
>>>>>> For TFO, the existing 'basic-cookie-not-reqd' test [5] was no longer
>>>>>> passing: if the 3rd ACK contains data, and the connection is accept()ed
>>>>>> before receiving them, these data would no longer be processed, and thus
>>>>>> not ACKed.
>>>>>>
>>>>>> One last thing about MPTCP, in case of simultaneous connect(), a
>>>>>> fallback to TCP will be done, which seems fine:
>>>>>>
>>>>>> `../common/defaults.sh`
>>>>>>
>>>>>> 0 socket(..., SOCK_STREAM|SOCK_NONBLOCK, IPPROTO_MPTCP) = 3
>>>>>> +0 connect(3, ..., ...) = -1 EINPROGRESS (Operation now in progress)
>>>>>>
>>>>>> +0 > S 0:0(0) <mss 1460, sackOK, TS val 100 ecr 0, nop, wscale 8, mpcapable v1 flags[flag_h] nokey>
>>>>>> +0 < S 0:0(0) win 1000 <mss 1460, sackOK, TS val 407 ecr 0, nop, wscale 8, mpcapable v1 flags[flag_h] nokey>
>>>>>> +0 > S. 0:0(0) ack 1 <mss 1460, sackOK, TS val 330 ecr 0, nop, wscale 8, mpcapable v1 flags[flag_h] nokey>
>>>>>> +0 < S. 0:0(0) ack 1 win 65535 <mss 1460, sackOK, TS val 700 ecr 100, nop, wscale 8, mpcapable v1 flags[flag_h] key[skey=2]>
>>>>>>
>>>>>> +0 write(3, ..., 100) = 100
>>>>>> +0 > . 1:1(0) ack 1 <nop, nop, TS val 845707014 ecr 700, nop, nop, sack 0:1>
>>>>>> +0 > P. 1:101(100) ack 1 <nop, nop, TS val 845958933 ecr 700>
>>>>>>
>>>>>> Simultaneous SYN-data crossing is also not supported by TFO, see [6].
>>>>>>
>>>>>> Link: https://github.com/multipath-tcp/mptcp_net-next/actions/runs/9936227696 [1]
>>>>>> Link: https://datatracker.ietf.org/doc/html/rfc8684#fig_tokens [2]
>>>>>> Link: https://github.com/multipath-tcp/packetdrill/blob/mptcp-net-next/gtests/net/mptcp/syscalls/accept.pkt#L28 [3]
>>>>>> Link: https://netdev.bots.linux.dev/contest.html?executor=vmksft-mptcp-dbg&test=mptcp-connect-sh [4]
>>>>>> Link: https://github.com/google/packetdrill/blob/master/gtests/net/tcp/fastopen/server/basic-cookie-not-reqd.pkt#L21 [5]
>>>>>> Link: https://github.com/google/packetdrill/blob/master/gtests/net/tcp/fastopen/client/simultaneous-fast-open.pkt [6]
>>>>>> Fixes: 23e89e8ee7be ("tcp: Don't drop SYN+ACK for simultaneous connect().")
>>>>>> Suggested-by: Paolo Abeni <pabeni@redhat.com>
>>>>>> Suggested-by: Kuniyuki Iwashima <kuniyu@amazon.com>
>>>>>> Signed-off-by: Matthieu Baerts (NGI0) <matttbe@kernel.org>
>>>>>> ---
>>>>>> Notes:
>>>>>> - We could also drop this 'goto consume', and send the unnecessary ACK
>>>>>> in this simultaneous connect case, which doesn't seem to be a "real"
>>>>>> case, more something for fuzzers. But that's not what the RFC 9293
>>>>>> recommends to do.
>>>>>> - v2:
>>>>>> - Check if the SYN bit is set instead of looking for TFO and MPTCP
>>>>>> specific attributes, as suggested by Kuniyuki.
>>>>>> - Updated the comment above
>>>>>> - Please note that the v2 has been sent mainly to satisfy the CI (to
>>>>>> be able to catch new bugs with MPTCP), and because the suggestion
>>>>>> from Kuniyuki looks better. It has not been sent to urge TCP
>>>>>> maintainers to review it quicker than it should, please take your
>>>>>> time and enjoy netdev.conf :)
>>>>>> ---
>>>>>> net/ipv4/tcp_input.c | 7 ++++++-
>>>>>> 1 file changed, 6 insertions(+), 1 deletion(-)
>>>>>>
>>>>>> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
>>>>>> index ff9ab3d01ced..bfe1bc69dc3e 100644
>>>>>> --- a/net/ipv4/tcp_input.c
>>>>>> +++ b/net/ipv4/tcp_input.c
>>>>>> @@ -6820,7 +6820,12 @@ tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb)
>>>>>> if (sk->sk_shutdown & SEND_SHUTDOWN)
>>>>>> tcp_shutdown(sk, SEND_SHUTDOWN);
>>>>>>
>>>>>> - if (sk->sk_socket)
>>>>>> + /* For crossed SYN cases, not to send an unnecessary ACK.
>>>>>> + * Note that sk->sk_socket can be assigned in other cases, e.g.
>>>>>> + * with TFO (if accept()'ed before the 3rd ACK) and MPTCP (MPJ:
>>>>>> + * sk_socket is the parent MPTCP sock).
>>>>>> + */
>>>>>> + if (sk->sk_socket && th->syn)
>>>>>> goto consume;
>>>>>
>>>>> I think we should simply remove this part completely, because we
>>>>> should send an ack anyway.
>>>>
>>>> Thank you for having looked, and ran the full packetdrill test suite!
>>>>
>>>>>
>>>>> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
>>>>> index ff9ab3d01ced89570903d3a9f649a637c5e07a90..91357d4713182078debd746a224046cba80ea3ce
>>>>> 100644
>>>>> --- a/net/ipv4/tcp_input.c
>>>>> +++ b/net/ipv4/tcp_input.c
>>>>> @@ -6820,8 +6820,6 @@ tcp_rcv_state_process(struct sock *sk, struct
>>>>> sk_buff *skb)
>>>>> if (sk->sk_shutdown & SEND_SHUTDOWN)
>>>>> tcp_shutdown(sk, SEND_SHUTDOWN);
>>>>>
>>>>> - if (sk->sk_socket)
>>>>> - goto consume;
>>>>> break;
>>>>>
>>>>> case TCP_FIN_WAIT1: {
>>>>>
>>>>>
>>>>> I have a failing packetdrill test after Kuniyuki patch :
>>>>>
>>>>>
>>>>>
>>>>> //
>>>>> // Test the simultaneous open scenario that both end sends
>>>>> // SYN/data. Although we don't support that the connection should
>>>>> // still be established.
>>>>> //
>>>>> `../../common/defaults.sh
>>>>> ../../common/set_sysctls.py /proc/sys/net/ipv4/tcp_timestamps=0`
>>>>>
>>>>> // Cache warmup: send a Fast Open cookie request
>>>>> 0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
>>>>> +0 fcntl(3, F_SETFL, O_RDWR|O_NONBLOCK) = 0
>>>>> +0 sendto(3, ..., 0, MSG_FASTOPEN, ..., ...) = -1 EINPROGRESS
>>>>> (Operation is now in progress)
>>>>> +0 > S 0:0(0) <mss 1460,nop,nop,sackOK,nop,wscale 8,FO,nop,nop>
>>>>> +.01 < S. 123:123(0) ack 1 win 14600 <mss
>>>>> 1460,nop,nop,sackOK,nop,wscale 6,FO abcd1234,nop,nop>
>>>>> +0 > . 1:1(0) ack 1
>>>>> +.01 close(3) = 0
>>>>> +0 > F. 1:1(0) ack 1
>>>>> +.01 < F. 1:1(0) ack 2 win 92
>>>>> +0 > . 2:2(0) ack 2
>>>>>
>>>>>
>>>>> //
>>>>> // Test: simulatenous fast open
>>>>> //
>>>>> +.01 socket(..., SOCK_STREAM, IPPROTO_TCP) = 4
>>>>> +0 fcntl(4, F_SETFL, O_RDWR|O_NONBLOCK) = 0
>>>>> +0 sendto(4, ..., 1000, MSG_FASTOPEN, ..., ...) = 1000
>>>>> +0 > S 0:1000(1000) <mss 1460,nop,nop,sackOK,nop,wscale 8,FO
>>>>> abcd1234,nop,nop>
>>>>> // Simul. SYN-data crossing: we don't support that yet so ack only remote ISN
>>>>> +.005 < S 1234:1734(500) win 14600 <mss 1040,nop,nop,sackOK,nop,wscale
>>>>> 6,FO 87654321,nop,nop>
>>>>> +0 > S. 0:0(0) ack 1235 <mss 1460,nop,nop,sackOK,nop,wscale 8>
>>>>>
>>>>> // SYN data is never retried.
>>>>> +.045 < S. 1234:1234(0) ack 1001 win 14600 <mss
>>>>> 940,nop,nop,sackOK,nop,wscale 6,FO 12345678,nop,nop>
>>>>> +0 > . 1001:1001(0) ack 1
>>>>
>>>> I recently sent a PR -- already applied -- to Neal to remove this line:
>>>>
>>>> https://github.com/google/packetdrill/pull/86
>>>>
>>>> I thought it was the intension of Kuniyuki's patch not to send this ACK
>>>> in this case to follow the RFC 9293's recommendation. This TFO test
>>>> looks a bit similar to the example from Kuniyuki's patch:
>>>>
>>>>
>>>> --------------- 8< ---------------
>>>> 0 socket(..., SOCK_STREAM|SOCK_NONBLOCK, IPPROTO_TCP) = 3
>>>> +0 connect(3, ..., ...) = -1 EINPROGRESS (Operation now in progress)
>>>>
>>>> +0 > S 0:0(0) <mss 1460,sackOK,TS val 1000 ecr 0,nop,wscale 8>
>>>> +0 < S 0:0(0) win 1000 <mss 1000>
>>>> +0 > S. 0:0(0) ack 1 <mss 1460,sackOK,TS val 3308134035 ecr 0,nop,wscale 8>
>>>> +0 < S. 0:0(0) ack 1 win 1000
>>>>
>>>> /* No ACK here */
>>>>
>>>> +0 write(3, ..., 100) = 100
>>>> +0 > P. 1:101(100) ack 1
>>>> --------------- 8< ---------------
>>>>
>>>>
>>>>
>>>> But maybe here that should be different for TFO?
>>>>
>>>> For my case with MPTCP (and TFO), it is fine to drop this 'goto consume'
>>>> but I don't know how "strict" we want to be regarding the RFC and this
>>>> marginal case.
>>>
>>> Problem of this 'goto consume' is that we are not properly sending a
>>> DUPACK in this case.
>>>
>>> +.01 socket(..., SOCK_STREAM, IPPROTO_TCP) = 4
>>> +0 fcntl(4, F_SETFL, O_RDWR|O_NONBLOCK) = 0
>>> +0 sendto(4, ..., 1000, MSG_FASTOPEN, ..., ...) = 1000
>>> +0 > S 0:1000(1000) <mss 1460,nop,nop,sackOK,nop,wscale 8,FO
>>> abcd1234,nop,nop>
>>> // Simul. SYN-data crossing: we don't support that yet so ack only remote ISN
>>> +.005 < S 1234:1734(500) win 14600 <mss 1040,nop,nop,sackOK,nop,wscale
>>> 6,FO 87654321,nop,nop>
>>> +0 > S. 0:0(0) ack 1235 <mss 1460,nop,nop,sackOK,nop,wscale 8>
>>>
>>> +.045 < S. 1234:1234(0) ack 1001 win 14600 <mss
>>> 940,nop,nop,sackOK,nop,wscale 6,FO 12345678,nop,nop>
>>> +0 > . 1001:1001(0) ack 1 <nop,nop,sack 0:1> // See here
>>
>> I'm sorry, but is it normal to have 'ack 1' with 'sack 0:1' here?
>
> It is normal, because the SYN was already received/processed.
>
> sack 0:1 represents this SYN sequence.
Thank you for your reply!
Maybe it is just me, but does it not look strange to have the SACK
covering a segment (0:1) that is before the ACK (1)?
'ack 1' and 'sack 0:1' seem to cover the same block, no?
Before Kuniyuki's patch, this 'sack 0:1' was not present.
Cheers,
Matt
--
Sponsored by the NGI0 Core fund.
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH net v2 1/2] tcp: process the 3rd ACK with sk_socket for TFO/MPTCP
2024-07-23 15:38 ` Eric Dumazet
2024-07-23 16:08 ` Matthieu Baerts
@ 2024-07-23 21:27 ` Kuniyuki Iwashima
1 sibling, 0 replies; 17+ messages in thread
From: Kuniyuki Iwashima @ 2024-07-23 21:27 UTC (permalink / raw)
To: edumazet
Cc: davem, dsahern, kuba, kuniyu, linux-kernel, matttbe, mptcp,
ncardwell, netdev, pabeni
From: Eric Dumazet <edumazet@google.com>
Date: Tue, 23 Jul 2024 17:38:27 +0200
[...]
> > > //
> > > // Test the simultaneous open scenario that both end sends
> > > // SYN/data. Although we don't support that the connection should
> > > // still be established.
> > > //
> > > `../../common/defaults.sh
> > > ../../common/set_sysctls.py /proc/sys/net/ipv4/tcp_timestamps=0`
> > >
> > > // Cache warmup: send a Fast Open cookie request
> > > 0 socket(..., SOCK_STREAM, IPPROTO_TCP) = 3
> > > +0 fcntl(3, F_SETFL, O_RDWR|O_NONBLOCK) = 0
> > > +0 sendto(3, ..., 0, MSG_FASTOPEN, ..., ...) = -1 EINPROGRESS
> > > (Operation is now in progress)
> > > +0 > S 0:0(0) <mss 1460,nop,nop,sackOK,nop,wscale 8,FO,nop,nop>
> > > +.01 < S. 123:123(0) ack 1 win 14600 <mss
> > > 1460,nop,nop,sackOK,nop,wscale 6,FO abcd1234,nop,nop>
> > > +0 > . 1:1(0) ack 1
> > > +.01 close(3) = 0
> > > +0 > F. 1:1(0) ack 1
> > > +.01 < F. 1:1(0) ack 2 win 92
> > > +0 > . 2:2(0) ack 2
> > >
> > >
> > > //
> > > // Test: simulatenous fast open
> > > //
> > > +.01 socket(..., SOCK_STREAM, IPPROTO_TCP) = 4
> > > +0 fcntl(4, F_SETFL, O_RDWR|O_NONBLOCK) = 0
> > > +0 sendto(4, ..., 1000, MSG_FASTOPEN, ..., ...) = 1000
> > > +0 > S 0:1000(1000) <mss 1460,nop,nop,sackOK,nop,wscale 8,FO
> > > abcd1234,nop,nop>
> > > // Simul. SYN-data crossing: we don't support that yet so ack only remote ISN
> > > +.005 < S 1234:1734(500) win 14600 <mss 1040,nop,nop,sackOK,nop,wscale
> > > 6,FO 87654321,nop,nop>
> > > +0 > S. 0:0(0) ack 1235 <mss 1460,nop,nop,sackOK,nop,wscale 8>
> > >
> > > // SYN data is never retried.
> > > +.045 < S. 1234:1234(0) ack 1001 win 14600 <mss
> > > 940,nop,nop,sackOK,nop,wscale 6,FO 12345678,nop,nop>
> > > +0 > . 1001:1001(0) ack 1
> >
> > I recently sent a PR -- already applied -- to Neal to remove this line:
> >
> > https://github.com/google/packetdrill/pull/86
> >
> > I thought it was the intension of Kuniyuki's patch not to send this ACK
> > in this case to follow the RFC 9293's recommendation. This TFO test
> > looks a bit similar to the example from Kuniyuki's patch:
> >
> >
> > --------------- 8< ---------------
> > 0 socket(..., SOCK_STREAM|SOCK_NONBLOCK, IPPROTO_TCP) = 3
> > +0 connect(3, ..., ...) = -1 EINPROGRESS (Operation now in progress)
> >
> > +0 > S 0:0(0) <mss 1460,sackOK,TS val 1000 ecr 0,nop,wscale 8>
> > +0 < S 0:0(0) win 1000 <mss 1000>
> > +0 > S. 0:0(0) ack 1 <mss 1460,sackOK,TS val 3308134035 ecr 0,nop,wscale 8>
> > +0 < S. 0:0(0) ack 1 win 1000
> >
> > /* No ACK here */
> >
> > +0 write(3, ..., 100) = 100
> > +0 > P. 1:101(100) ack 1
> > --------------- 8< ---------------
> >
> >
> >
> > But maybe here that should be different for TFO?
> >
> > For my case with MPTCP (and TFO), it is fine to drop this 'goto consume'
> > but I don't know how "strict" we want to be regarding the RFC and this
> > marginal case.
>
> Problem of this 'goto consume' is that we are not properly sending a
> DUPACK in this case.
>
> +.01 socket(..., SOCK_STREAM, IPPROTO_TCP) = 4
> +0 fcntl(4, F_SETFL, O_RDWR|O_NONBLOCK) = 0
> +0 sendto(4, ..., 1000, MSG_FASTOPEN, ..., ...) = 1000
> +0 > S 0:1000(1000) <mss 1460,nop,nop,sackOK,nop,wscale 8,FO
> abcd1234,nop,nop>
> // Simul. SYN-data crossing: we don't support that yet so ack only remote ISN
> +.005 < S 1234:1734(500) win 14600 <mss 1040,nop,nop,sackOK,nop,wscale
> 6,FO 87654321,nop,nop>
> +0 > S. 0:0(0) ack 1235 <mss 1460,nop,nop,sackOK,nop,wscale 8>
>
> +.045 < S. 1234:1234(0) ack 1001 win 14600 <mss
> 940,nop,nop,sackOK,nop,wscale 6,FO 12345678,nop,nop>
> +0 > . 1001:1001(0) ack 1 <nop,nop,sack 0:1> // See here
>
> Not sending a dupack seems wrong and could hurt.
I think the situation where we should send ACK after simultaneous
SYN+ACK would be:
---8<---
+.01 socket(..., SOCK_STREAM, IPPROTO_TCP) = 4
+0 fcntl(4, F_SETFL, O_RDWR|O_NONBLOCK) = 0
+0 sendto(4, ..., 1000, MSG_FASTOPEN, ..., ...) = 1000
+0 > S 0:1000(1000) <mss 1440,nop,nop,sackOK,nop,wscale 8,FO abcd1234,nop,nop>
// Simul. SYN-data crossing: we don't support that yet so ack only remote ISN
+.005 < S 1234:1734(500) win 14400 <mss 1040,nop,nop,sackOK,nop,wscale 6,FO 87654321,nop,nop>
+0 > S. 0:0(0) ack 1235 <mss 1440,nop,nop,sackOK,nop,wscale 8>
// SYN data is not ACKed too.
+.045 < S. 1234:1234(0) ack 1 win 14400 <mss 940,nop,nop,sackOK,nop,wscale 6,FO 12345678,nop,nop>
+0 > . 1:1001(1000) ack 1
---8<---
When the first data in SYN is not ACKed, it must be retransmitted,
and it can be done just after SYN+ACK is received, which is skipped
by 'goto consume'.
Retransmitting data in SYN is not supported though.
---8<---
sendto syscall: 1721769223.194675
outbound sniffed packet: 0.040288 S 833802090:833803090(1000) win 65535 <mss 1440,nop,nop,sackOK,nop,wscale 8,FO abcd1234,nop,nop>
inbound injected packet: 0.045323 S 1234:1734(500) win 14400 <mss 1040,nop,nop,sackOK,nop,wscale 6,FO 87654321,nop,nop>
outbound sniffed packet: 0.045355 S. 833802090:833802090(0) ack 1235 win 65535 <mss 1440,nop,nop,sackOK,nop,wscale 8>
inbound injected packet: 0.090429 S. 1234:1234(0) ack 833802091 win 14400 <mss 940,nop,nop,sackOK,nop,wscale 6,FO 12345678,nop,nop>
outbound sniffed packet: 0.090460 . 833803091:833803091(0) ack 1235 win 1052
outbound sniffed packet: 1.051776 S. 833802090:833802090(0) ack 1235 win 65535 <mss 1440,nop,nop,sackOK,nop,wscale 8>
simul2.pkt:38: error handling packet: live packet field tcp_data_offset: expected: 5 (0x5) vs actual: 8 (0x8)
script packet: 0.090462 . 1:1001(1000) ack 1
actual #0 packet: 0.090460 . 1001:1001(0) ack 1 win 1052
actual #1 packet: 1.051776 S. 0:0(0) ack 1235 win 65535 <mss 1440,nop,nop,sackOK,nop,wscale 8>
---8<---
Except the case, I think the dupack is not needed in theory.
But I understand the dupack could help the other quickly retransmit the
not-yet-ACKed data in SYN instead of waiting for a timer as expected in
the comment.
---8<---
// The other end retries
+.1 < P. 1:501(500) ack 1000 win 257
---8<---
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH net v2 1/2] tcp: process the 3rd ACK with sk_socket for TFO/MPTCP
2024-07-23 19:09 ` Matthieu Baerts
@ 2024-07-23 21:41 ` Kuniyuki Iwashima
2024-07-23 22:01 ` Neal Cardwell
1 sibling, 0 replies; 17+ messages in thread
From: Kuniyuki Iwashima @ 2024-07-23 21:41 UTC (permalink / raw)
To: matttbe
Cc: davem, dsahern, edumazet, kuba, kuniyu, linux-kernel, mptcp,
ncardwell, netdev, pabeni
From: Matthieu Baerts <matttbe@kernel.org>
Date: Tue, 23 Jul 2024 21:09:40 +0200
[...]
> >>> Problem of this 'goto consume' is that we are not properly sending a
> >>> DUPACK in this case.
> >>>
> >>> +.01 socket(..., SOCK_STREAM, IPPROTO_TCP) = 4
> >>> +0 fcntl(4, F_SETFL, O_RDWR|O_NONBLOCK) = 0
> >>> +0 sendto(4, ..., 1000, MSG_FASTOPEN, ..., ...) = 1000
> >>> +0 > S 0:1000(1000) <mss 1460,nop,nop,sackOK,nop,wscale 8,FO
> >>> abcd1234,nop,nop>
> >>> // Simul. SYN-data crossing: we don't support that yet so ack only remote ISN
> >>> +.005 < S 1234:1734(500) win 14600 <mss 1040,nop,nop,sackOK,nop,wscale
> >>> 6,FO 87654321,nop,nop>
> >>> +0 > S. 0:0(0) ack 1235 <mss 1460,nop,nop,sackOK,nop,wscale 8>
> >>>
> >>> +.045 < S. 1234:1234(0) ack 1001 win 14600 <mss
> >>> 940,nop,nop,sackOK,nop,wscale 6,FO 12345678,nop,nop>
> >>> +0 > . 1001:1001(0) ack 1 <nop,nop,sack 0:1> // See here
> >>
> >> I'm sorry, but is it normal to have 'ack 1' with 'sack 0:1' here?
> >
> > It is normal, because the SYN was already received/processed.
> >
> > sack 0:1 represents this SYN sequence.
>
> Thank you for your reply!
>
> Maybe it is just me, but does it not look strange to have the SACK
> covering a segment (0:1) that is before the ACK (1)?
>
> 'ack 1' and 'sack 0:1' seem to cover the same block, no?
> Before Kuniyuki's patch, this 'sack 0:1' was not present.
This looks a bit strange to me too, but I guess this is also not
forbidden ?
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH net v2 1/2] tcp: process the 3rd ACK with sk_socket for TFO/MPTCP
2024-07-23 19:09 ` Matthieu Baerts
2024-07-23 21:41 ` Kuniyuki Iwashima
@ 2024-07-23 22:01 ` Neal Cardwell
2024-07-24 8:15 ` Matthieu Baerts
1 sibling, 1 reply; 17+ messages in thread
From: Neal Cardwell @ 2024-07-23 22:01 UTC (permalink / raw)
To: Matthieu Baerts
Cc: Eric Dumazet, David S. Miller, David Ahern, Jakub Kicinski,
Paolo Abeni, Kuniyuki Iwashima, netdev, mptcp, linux-kernel
On Tue, Jul 23, 2024 at 3:09 PM Matthieu Baerts <matttbe@kernel.org> wrote:
>
> Hi Eric,
>
> On 23/07/2024 18:42, Eric Dumazet wrote:
> > On Tue, Jul 23, 2024 at 6:08 PM Matthieu Baerts <matttbe@kernel.org> wrote:
> >>
> >> Hi Eric,
> >>
> >> On 23/07/2024 17:38, Eric Dumazet wrote:
> >>> On Tue, Jul 23, 2024 at 4:58 PM Matthieu Baerts <matttbe@kernel.org> wrote:
> >>>>
> >>>> Hi Eric,
> >>>>
> >>>> +cc Neal
> >>>> -cc Jerry (NoSuchUser)
> >>>>
> >>>> On 23/07/2024 16:37, Eric Dumazet wrote:
> >>>>> On Thu, Jul 18, 2024 at 12:34 PM Matthieu Baerts (NGI0)
> >>>>> <matttbe@kernel.org> wrote:
> >>>>>>
...
> >>> +.045 < S. 1234:1234(0) ack 1001 win 14600 <mss
> >>> 940,nop,nop,sackOK,nop,wscale 6,FO 12345678,nop,nop>
> >>> +0 > . 1001:1001(0) ack 1 <nop,nop,sack 0:1> // See here
> >>
> >> I'm sorry, but is it normal to have 'ack 1' with 'sack 0:1' here?
> >
> > It is normal, because the SYN was already received/processed.
> >
> > sack 0:1 represents this SYN sequence.
>
> Thank you for your reply!
>
> Maybe it is just me, but does it not look strange to have the SACK
> covering a segment (0:1) that is before the ACK (1)?
>
> 'ack 1' and 'sack 0:1' seem to cover the same block, no?
> Before Kuniyuki's patch, this 'sack 0:1' was not present.
A SACK that covers a sequence range that was already cumulatively or
selectively acknowledged is legal and useful, and is called a
Duplicate Selective Acknowledgement (DSACK or D-SACK).
A DSACK indicates that a receiver received duplicate data. That can be
very useful in allowing a data sender to determine that a
retransmission was not needed (spurious). If a data sender receives
DSACKs for all retransmitted data in a loss detection episode then it
knows all retransmissions were spurious, and thus it can "undo" its
congestion control reaction to that estimated loss, since the
congestion control algorithm was responding to an incorrect loss
signal. This can be very helpful for performance in the presence of
varying delays or reordering, which can cause spurious loss detection
episodes..
See:
https://datatracker.ietf.org/doc/html/rfc2883
An Extension to the Selective Acknowledgement (SACK) Option for TCP
https://www.rfc-editor.org/rfc/rfc3708.html
"Using TCP Duplicate Selective Acknowledgement (DSACKs) and Stream
Control Transmission Protocol (SCTP) Duplicate Transmission Sequence
Numbers (TSNs) to Detect Spurious Retransmissions"
neal
^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH net v2 1/2] tcp: process the 3rd ACK with sk_socket for TFO/MPTCP
2024-07-23 22:01 ` Neal Cardwell
@ 2024-07-24 8:15 ` Matthieu Baerts
0 siblings, 0 replies; 17+ messages in thread
From: Matthieu Baerts @ 2024-07-24 8:15 UTC (permalink / raw)
To: Neal Cardwell
Cc: Eric Dumazet, David S. Miller, David Ahern, Jakub Kicinski,
Paolo Abeni, Kuniyuki Iwashima, netdev, mptcp, linux-kernel
Hi Neal,
On 24/07/2024 00:01, Neal Cardwell wrote:
> On Tue, Jul 23, 2024 at 3:09 PM Matthieu Baerts <matttbe@kernel.org> wrote:
>>
>> Hi Eric,
>>
>> On 23/07/2024 18:42, Eric Dumazet wrote:
>>> On Tue, Jul 23, 2024 at 6:08 PM Matthieu Baerts <matttbe@kernel.org> wrote:
>>>>
>>>> Hi Eric,
>>>>
>>>> On 23/07/2024 17:38, Eric Dumazet wrote:
>>>>> On Tue, Jul 23, 2024 at 4:58 PM Matthieu Baerts <matttbe@kernel.org> wrote:
>>>>>>
>>>>>> Hi Eric,
>>>>>>
>>>>>> +cc Neal
>>>>>> -cc Jerry (NoSuchUser)
>>>>>>
>>>>>> On 23/07/2024 16:37, Eric Dumazet wrote:
>>>>>>> On Thu, Jul 18, 2024 at 12:34 PM Matthieu Baerts (NGI0)
>>>>>>> <matttbe@kernel.org> wrote:
>>>>>>>>
> ...
>>>>> +.045 < S. 1234:1234(0) ack 1001 win 14600 <mss
>>>>> 940,nop,nop,sackOK,nop,wscale 6,FO 12345678,nop,nop>
>>>>> +0 > . 1001:1001(0) ack 1 <nop,nop,sack 0:1> // See here
>>>>
>>>> I'm sorry, but is it normal to have 'ack 1' with 'sack 0:1' here?
>>>
>>> It is normal, because the SYN was already received/processed.
>>>
>>> sack 0:1 represents this SYN sequence.
>>
>> Thank you for your reply!
>>
>> Maybe it is just me, but does it not look strange to have the SACK
>> covering a segment (0:1) that is before the ACK (1)?
>>
>> 'ack 1' and 'sack 0:1' seem to cover the same block, no?
>> Before Kuniyuki's patch, this 'sack 0:1' was not present.
>
> A SACK that covers a sequence range that was already cumulatively or
> selectively acknowledged is legal and useful, and is called a
> Duplicate Selective Acknowledgement (DSACK or D-SACK).
>
> A DSACK indicates that a receiver received duplicate data. That can be
> very useful in allowing a data sender to determine that a
> retransmission was not needed (spurious). If a data sender receives
> DSACKs for all retransmitted data in a loss detection episode then it
> knows all retransmissions were spurious, and thus it can "undo" its
> congestion control reaction to that estimated loss, since the
> congestion control algorithm was responding to an incorrect loss
> signal. This can be very helpful for performance in the presence of
> varying delays or reordering, which can cause spurious loss detection
> episodes..
>
> See:
>
> https://datatracker.ietf.org/doc/html/rfc2883
> An Extension to the Selective Acknowledgement (SACK) Option for TCP
>
> https://www.rfc-editor.org/rfc/rfc3708.html
> "Using TCP Duplicate Selective Acknowledgement (DSACKs) and Stream
> Control Transmission Protocol (SCTP) Duplicate Transmission Sequence
> Numbers (TSNs) to Detect Spurious Retransmissions"
Thank you for the great explanations!
I was not expecting this in a 3rd ACK :)
I don't know if it will help the congestion control algorithm in this
case, but I guess we don't need to worry about this marginal case.
I will then send the v3, and the Packetdrill modification.
Cheers,
Matt
--
Sponsored by the NGI0 Core fund.
^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2024-07-24 8:16 UTC | newest]
Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-07-18 10:33 [PATCH net v2 0/2] tcp: restrict crossed SYN specific actions to SYN-ACK Matthieu Baerts (NGI0)
2024-07-18 10:33 ` [PATCH net v2 1/2] tcp: process the 3rd ACK with sk_socket for TFO/MPTCP Matthieu Baerts (NGI0)
2024-07-23 14:37 ` Eric Dumazet
2024-07-23 14:58 ` Matthieu Baerts
2024-07-23 15:38 ` Eric Dumazet
2024-07-23 16:08 ` Matthieu Baerts
2024-07-23 16:42 ` Eric Dumazet
2024-07-23 19:09 ` Matthieu Baerts
2024-07-23 21:41 ` Kuniyuki Iwashima
2024-07-23 22:01 ` Neal Cardwell
2024-07-24 8:15 ` Matthieu Baerts
2024-07-23 21:27 ` Kuniyuki Iwashima
2024-07-18 10:33 ` [PATCH net v2 2/2] tcp: limit wake-up for crossed SYN cases to SYN-ACK Matthieu Baerts (NGI0)
2024-07-23 14:32 ` Eric Dumazet
2024-07-23 14:40 ` Matthieu Baerts
2024-07-23 8:10 ` [PATCH net v2 0/2] tcp: restrict crossed SYN specific actions " Paolo Abeni
2024-07-23 8:22 ` Eric Dumazet
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).