[PATCH v3 net-next 0/2] tcp: Make simultaneous connect() RFC-compliant.

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v3 net-next 0/2] tcp: Make simultaneous connect() RFC-compliant.
@ 2024-07-10 17:12 Kuniyuki Iwashima
  2024-07-10 17:12 ` [PATCH v3 net-next 1/2] tcp: Don't drop SYN+ACK for simultaneous connect() Kuniyuki Iwashima
                   ` (2 more replies)
  0 siblings, 3 replies; 9+ messages in thread
From: Kuniyuki Iwashima @ 2024-07-10 17:12 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	David Ahern
  Cc: Kuniyuki Iwashima, Kuniyuki Iwashima, netdev

Patch 1 fixes an issue that BPF TCP option parser is triggered for ACK
instead of SYN+ACK in the case of simultaneous connect().

Patch 2 removes an wrong assumption in tcp_ao/self-connnect tests.

v3:
  * Use (sk->sk_state == TCP_SYN_RECV && sk->sk_socket) to detect cross SYN case

v2: https://lore.kernel.org/netdev/20240708180852.92919-1-kuniyu@amazon.com/
  * Target net-next and remove Fixes: tag
  * Don't skip bpf_skops_parse_hdr() to centralise sk_state check
  * Remove unnecessary ACK after SYN+ACK
  * Add patch 2

v1: https://lore.kernel.org/netdev/20240704035703.95065-1-kuniyu@amazon.com/


Kuniyuki Iwashima (2):
  tcp: Don't drop SYN+ACK for simultaneous connect().
  selftests: tcp: Remove broken SNMP assumptions for TCP AO self-connect
    tests.

 net/ipv4/tcp_input.c                           |  9 +++++++++
 .../selftests/net/tcp_ao/self-connect.c        | 18 ------------------
 2 files changed, 9 insertions(+), 18 deletions(-)

-- 
2.30.2


^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH v3 net-next 1/2] tcp: Don't drop SYN+ACK for simultaneous connect().
  2024-07-10 17:12 [PATCH v3 net-next 0/2] tcp: Make simultaneous connect() RFC-compliant Kuniyuki Iwashima
@ 2024-07-10 17:12 ` Kuniyuki Iwashima
  2024-07-11 15:34   ` Eric Dumazet
  2024-07-15 15:58   ` Matthieu Baerts
  2024-07-10 17:12 ` [PATCH v3 net-next 2/2] selftests: tcp: Remove broken SNMP assumptions for TCP AO self-connect tests Kuniyuki Iwashima
  2024-07-13 22:30 ` [PATCH v3 net-next 0/2] tcp: Make simultaneous connect() RFC-compliant patchwork-bot+netdevbpf
  2 siblings, 2 replies; 9+ messages in thread
From: Kuniyuki Iwashima @ 2024-07-10 17:12 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	David Ahern
  Cc: Kuniyuki Iwashima, Kuniyuki Iwashima, netdev

RFC 9293 states that in the case of simultaneous connect(), the connection
gets established when SYN+ACK is received. [0]

      TCP Peer A                                       TCP Peer B

  1.  CLOSED                                           CLOSED
  2.  SYN-SENT     --> <SEQ=100><CTL=SYN>              ...
  3.  SYN-RECEIVED <-- <SEQ=300><CTL=SYN>              <-- SYN-SENT
  4.               ... <SEQ=100><CTL=SYN>              --> SYN-RECEIVED
  5.  SYN-RECEIVED --> <SEQ=100><ACK=301><CTL=SYN,ACK> ...
  6.  ESTABLISHED  <-- <SEQ=300><ACK=101><CTL=SYN,ACK> <-- SYN-RECEIVED
  7.               ... <SEQ=100><ACK=301><CTL=SYN,ACK> --> ESTABLISHED

However, since commit 0c24604b68fc ("tcp: implement RFC 5961 4.2"), such a
SYN+ACK is dropped in tcp_validate_incoming() and responded with Challenge
ACK.

For example, the write() syscall in the following packetdrill script fails
with -EAGAIN, and wrong SNMP stats get incremented.

   0 socket(..., SOCK_STREAM|SOCK_NONBLOCK, IPPROTO_TCP) = 3
  +0 connect(3, ..., ...) = -1 EINPROGRESS (Operation now in progress)

  +0 > S  0:0(0) <mss 1460,sackOK,TS val 1000 ecr 0,nop,wscale 8>
  +0 < S  0:0(0) win 1000 <mss 1000>
  +0 > S. 0:0(0) ack 1 <mss 1460,sackOK,TS val 3308134035 ecr 0,nop,wscale 8>
  +0 < S. 0:0(0) ack 1 win 1000

  +0 write(3, ..., 100) = 100
  +0 > P. 1:101(100) ack 1

  --

  # packetdrill cross-synack.pkt
  cross-synack.pkt:13: runtime error in write call: Expected result 100 but got -1 with errno 11 (Resource temporarily unavailable)
  # nstat
  ...
  TcpExtTCPChallengeACK           1                  0.0
  TcpExtTCPSYNChallenge           1                  0.0

The problem is that bpf_skops_established() is triggered by the Challenge
ACK instead of SYN+ACK.  This causes the bpf prog to miss the chance to
check if the peer supports a TCP option that is expected to be exchanged
in SYN and SYN+ACK.

Let's accept a bare SYN+ACK for active-open TCP_SYN_RECV sockets to avoid
such a situation.

Note that tcp_ack_snd_check() in tcp_rcv_state_process() is skipped not to
send an unnecessary ACK, but this could be a bit risky for net.git, so this
targets for net-next.

Link: https://www.rfc-editor.org/rfc/rfc9293.html#section-3.5-7 [0]
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
---
 net/ipv4/tcp_input.c | 9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 47dacb575f74..1eddb6b9fb2a 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -5989,6 +5989,11 @@ static bool tcp_validate_incoming(struct sock *sk, struct sk_buff *skb,
 	 * RFC 5961 4.2 : Send a challenge ack
 	 */
 	if (th->syn) {
+		if (sk->sk_state == TCP_SYN_RECV && sk->sk_socket && th->ack &&
+		    TCP_SKB_CB(skb)->seq + 1 == TCP_SKB_CB(skb)->end_seq &&
+		    TCP_SKB_CB(skb)->seq + 1 == tp->rcv_nxt &&
+		    TCP_SKB_CB(skb)->ack_seq == tp->snd_nxt)
+			goto pass;
 syn_challenge:
 		if (syn_inerr)
 			TCP_INC_STATS(sock_net(sk), TCP_MIB_INERRS);
@@ -5998,6 +6003,7 @@ static bool tcp_validate_incoming(struct sock *sk, struct sk_buff *skb,
 		goto discard;
 	}
 
+pass:
 	bpf_skops_parse_hdr(sk, skb);
 
 	return true;
@@ -6804,6 +6810,9 @@ tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb)
 		tcp_fast_path_on(tp);
 		if (sk->sk_shutdown & SEND_SHUTDOWN)
 			tcp_shutdown(sk, SEND_SHUTDOWN);
+
+		if (sk->sk_socket)
+			goto consume;
 		break;
 
 	case TCP_FIN_WAIT1: {
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v3 net-next 2/2] selftests: tcp: Remove broken SNMP assumptions for TCP AO self-connect tests.
  2024-07-10 17:12 [PATCH v3 net-next 0/2] tcp: Make simultaneous connect() RFC-compliant Kuniyuki Iwashima
  2024-07-10 17:12 ` [PATCH v3 net-next 1/2] tcp: Don't drop SYN+ACK for simultaneous connect() Kuniyuki Iwashima
@ 2024-07-10 17:12 ` Kuniyuki Iwashima
  2024-07-13 22:30 ` [PATCH v3 net-next 0/2] tcp: Make simultaneous connect() RFC-compliant patchwork-bot+netdevbpf
  2 siblings, 0 replies; 9+ messages in thread
From: Kuniyuki Iwashima @ 2024-07-10 17:12 UTC (permalink / raw)
  To: David S. Miller, Eric Dumazet, Jakub Kicinski, Paolo Abeni,
	David Ahern
  Cc: Kuniyuki Iwashima, Kuniyuki Iwashima, netdev, Dmitry Safonov

tcp_ao/self-connect.c checked the following SNMP stats before/after
connect() to confirm that the test exercises the simultaneous connect()
path.

  * TCPChallengeACK
  * TCPSYNChallenge

But the stats should not be counted for self-connect in the first place,
and the assumption is no longer true.

Let's remove the check.

Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Dmitry Safonov <dima@arista.com>
---
 .../selftests/net/tcp_ao/self-connect.c        | 18 ------------------
 1 file changed, 18 deletions(-)

diff --git a/tools/testing/selftests/net/tcp_ao/self-connect.c b/tools/testing/selftests/net/tcp_ao/self-connect.c
index e154d9e198a9..a5698b0a3718 100644
--- a/tools/testing/selftests/net/tcp_ao/self-connect.c
+++ b/tools/testing/selftests/net/tcp_ao/self-connect.c
@@ -30,8 +30,6 @@ static void setup_lo_intf(const char *lo_intf)
 static void tcp_self_connect(const char *tst, unsigned int port,
 			     bool different_keyids, bool check_restore)
 {
-	uint64_t before_challenge_ack, after_challenge_ack;
-	uint64_t before_syn_challenge, after_syn_challenge;
 	struct tcp_ao_counters before_ao, after_ao;
 	uint64_t before_aogood, after_aogood;
 	struct netstat *ns_before, *ns_after;
@@ -62,8 +60,6 @@ static void tcp_self_connect(const char *tst, unsigned int port,
 
 	ns_before = netstat_read();
 	before_aogood = netstat_get(ns_before, "TCPAOGood", NULL);
-	before_challenge_ack = netstat_get(ns_before, "TCPChallengeACK", NULL);
-	before_syn_challenge = netstat_get(ns_before, "TCPSYNChallenge", NULL);
 	if (test_get_tcp_ao_counters(sk, &before_ao))
 		test_error("test_get_tcp_ao_counters()");
 
@@ -82,8 +78,6 @@ static void tcp_self_connect(const char *tst, unsigned int port,
 
 	ns_after = netstat_read();
 	after_aogood = netstat_get(ns_after, "TCPAOGood", NULL);
-	after_challenge_ack = netstat_get(ns_after, "TCPChallengeACK", NULL);
-	after_syn_challenge = netstat_get(ns_after, "TCPSYNChallenge", NULL);
 	if (test_get_tcp_ao_counters(sk, &after_ao))
 		test_error("test_get_tcp_ao_counters()");
 	if (!check_restore) {
@@ -98,18 +92,6 @@ static void tcp_self_connect(const char *tst, unsigned int port,
 		close(sk);
 		return;
 	}
-	if (after_challenge_ack <= before_challenge_ack ||
-	    after_syn_challenge <= before_syn_challenge) {
-		/*
-		 * It's also meant to test simultaneous open, so check
-		 * these counters as well.
-		 */
-		test_fail("%s: Didn't challenge SYN or ACK: %zu <= %zu OR %zu <= %zu",
-			  tst, after_challenge_ack, before_challenge_ack,
-			  after_syn_challenge, before_syn_challenge);
-		close(sk);
-		return;
-	}
 
 	if (test_tcp_ao_counters_cmp(tst, &before_ao, &after_ao, TEST_CNT_GOOD)) {
 		close(sk);
-- 
2.30.2


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH v3 net-next 1/2] tcp: Don't drop SYN+ACK for simultaneous connect().
  2024-07-10 17:12 ` [PATCH v3 net-next 1/2] tcp: Don't drop SYN+ACK for simultaneous connect() Kuniyuki Iwashima
@ 2024-07-11 15:34   ` Eric Dumazet
  2024-07-15 15:58   ` Matthieu Baerts
  1 sibling, 0 replies; 9+ messages in thread
From: Eric Dumazet @ 2024-07-11 15:34 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: David S. Miller, Jakub Kicinski, Paolo Abeni, David Ahern,
	Kuniyuki Iwashima, netdev

On Wed, Jul 10, 2024 at 10:13 AM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
>
> RFC 9293 states that in the case of simultaneous connect(), the connection
> gets established when SYN+ACK is received. [0]
>
>       TCP Peer A                                       TCP Peer B
>
>   1.  CLOSED                                           CLOSED
>   2.  SYN-SENT     --> <SEQ=100><CTL=SYN>              ...
>   3.  SYN-RECEIVED <-- <SEQ=300><CTL=SYN>              <-- SYN-SENT
>   4.               ... <SEQ=100><CTL=SYN>              --> SYN-RECEIVED
>   5.  SYN-RECEIVED --> <SEQ=100><ACK=301><CTL=SYN,ACK> ...
>   6.  ESTABLISHED  <-- <SEQ=300><ACK=101><CTL=SYN,ACK> <-- SYN-RECEIVED
>   7.               ... <SEQ=100><ACK=301><CTL=SYN,ACK> --> ESTABLISHED
>
> However, since commit 0c24604b68fc ("tcp: implement RFC 5961 4.2"), such a
> SYN+ACK is dropped in tcp_validate_incoming() and responded with Challenge
> ACK.
>
> For example, the write() syscall in the following packetdrill script fails
> with -EAGAIN, and wrong SNMP stats get incremented.
>
>    0 socket(..., SOCK_STREAM|SOCK_NONBLOCK, IPPROTO_TCP) = 3
>   +0 connect(3, ..., ...) = -1 EINPROGRESS (Operation now in progress)
>
>   +0 > S  0:0(0) <mss 1460,sackOK,TS val 1000 ecr 0,nop,wscale 8>
>   +0 < S  0:0(0) win 1000 <mss 1000>
>   +0 > S. 0:0(0) ack 1 <mss 1460,sackOK,TS val 3308134035 ecr 0,nop,wscale 8>
>   +0 < S. 0:0(0) ack 1 win 1000
>
>   +0 write(3, ..., 100) = 100
>   +0 > P. 1:101(100) ack 1
>
>   --
>
>   # packetdrill cross-synack.pkt
>   cross-synack.pkt:13: runtime error in write call: Expected result 100 but got -1 with errno 11 (Resource temporarily unavailable)
>   # nstat
>   ...
>   TcpExtTCPChallengeACK           1                  0.0
>   TcpExtTCPSYNChallenge           1                  0.0
>
> The problem is that bpf_skops_established() is triggered by the Challenge
> ACK instead of SYN+ACK.  This causes the bpf prog to miss the chance to
> check if the peer supports a TCP option that is expected to be exchanged
> in SYN and SYN+ACK.
>
> Let's accept a bare SYN+ACK for active-open TCP_SYN_RECV sockets to avoid
> such a situation.
>
> Note that tcp_ack_snd_check() in tcp_rcv_state_process() is skipped not to
> send an unnecessary ACK, but this could be a bit risky for net.git, so this
> targets for net-next.
>
> Link: https://www.rfc-editor.org/rfc/rfc9293.html#section-3.5-7 [0]
> Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
> ---

Reviewed-by: Eric Dumazet <edumazet@google.com>

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v3 net-next 0/2] tcp: Make simultaneous connect() RFC-compliant.
  2024-07-10 17:12 [PATCH v3 net-next 0/2] tcp: Make simultaneous connect() RFC-compliant Kuniyuki Iwashima
  2024-07-10 17:12 ` [PATCH v3 net-next 1/2] tcp: Don't drop SYN+ACK for simultaneous connect() Kuniyuki Iwashima
  2024-07-10 17:12 ` [PATCH v3 net-next 2/2] selftests: tcp: Remove broken SNMP assumptions for TCP AO self-connect tests Kuniyuki Iwashima
@ 2024-07-13 22:30 ` patchwork-bot+netdevbpf
  2 siblings, 0 replies; 9+ messages in thread
From: patchwork-bot+netdevbpf @ 2024-07-13 22:30 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: davem, edumazet, kuba, pabeni, dsahern, kuni1840, netdev

Hello:

This series was applied to netdev/net-next.git (main)
by Jakub Kicinski <kuba@kernel.org>:

On Wed, 10 Jul 2024 10:12:44 -0700 you wrote:
> Patch 1 fixes an issue that BPF TCP option parser is triggered for ACK
> instead of SYN+ACK in the case of simultaneous connect().
> 
> Patch 2 removes an wrong assumption in tcp_ao/self-connnect tests.
> 
> v3:
>   * Use (sk->sk_state == TCP_SYN_RECV && sk->sk_socket) to detect cross SYN case
> 
> [...]

Here is the summary with links:
  - [v3,net-next,1/2] tcp: Don't drop SYN+ACK for simultaneous connect().
    https://git.kernel.org/netdev/net-next/c/23e89e8ee7be
  - [v3,net-next,2/2] selftests: tcp: Remove broken SNMP assumptions for TCP AO self-connect tests.
    https://git.kernel.org/netdev/net-next/c/b3bb4d23a41b

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v3 net-next 1/2] tcp: Don't drop SYN+ACK for simultaneous connect().
  2024-07-10 17:12 ` [PATCH v3 net-next 1/2] tcp: Don't drop SYN+ACK for simultaneous connect() Kuniyuki Iwashima
  2024-07-11 15:34   ` Eric Dumazet
@ 2024-07-15 15:58   ` Matthieu Baerts
  2024-07-16 19:23     ` Kuniyuki Iwashima
  1 sibling, 1 reply; 9+ messages in thread
From: Matthieu Baerts @ 2024-07-15 15:58 UTC (permalink / raw)
  To: Kuniyuki Iwashima, David S. Miller, Eric Dumazet, Jakub Kicinski,
	Paolo Abeni, David Ahern
  Cc: Kuniyuki Iwashima, netdev

Hi Kuniyuki,

On 10/07/2024 19:12, Kuniyuki Iwashima wrote:
> RFC 9293 states that in the case of simultaneous connect(), the connection
> gets established when SYN+ACK is received. [0]
> 
>       TCP Peer A                                       TCP Peer B
> 
>   1.  CLOSED                                           CLOSED
>   2.  SYN-SENT     --> <SEQ=100><CTL=SYN>              ...
>   3.  SYN-RECEIVED <-- <SEQ=300><CTL=SYN>              <-- SYN-SENT
>   4.               ... <SEQ=100><CTL=SYN>              --> SYN-RECEIVED
>   5.  SYN-RECEIVED --> <SEQ=100><ACK=301><CTL=SYN,ACK> ...
>   6.  ESTABLISHED  <-- <SEQ=300><ACK=101><CTL=SYN,ACK> <-- SYN-RECEIVED
>   7.               ... <SEQ=100><ACK=301><CTL=SYN,ACK> --> ESTABLISHED
> 
> However, since commit 0c24604b68fc ("tcp: implement RFC 5961 4.2"), such a
> SYN+ACK is dropped in tcp_validate_incoming() and responded with Challenge
> ACK.
> 
> For example, the write() syscall in the following packetdrill script fails
> with -EAGAIN, and wrong SNMP stats get incremented.
> 
>    0 socket(..., SOCK_STREAM|SOCK_NONBLOCK, IPPROTO_TCP) = 3
>   +0 connect(3, ..., ...) = -1 EINPROGRESS (Operation now in progress)
> 
>   +0 > S  0:0(0) <mss 1460,sackOK,TS val 1000 ecr 0,nop,wscale 8>
>   +0 < S  0:0(0) win 1000 <mss 1000>
>   +0 > S. 0:0(0) ack 1 <mss 1460,sackOK,TS val 3308134035 ecr 0,nop,wscale 8>
>   +0 < S. 0:0(0) ack 1 win 1000
> 
>   +0 write(3, ..., 100) = 100
>   +0 > P. 1:101(100) ack 1
> 
>   --
> 
>   # packetdrill cross-synack.pkt
>   cross-synack.pkt:13: runtime error in write call: Expected result 100 but got -1 with errno 11 (Resource temporarily unavailable)
>   # nstat
>   ...
>   TcpExtTCPChallengeACK           1                  0.0
>   TcpExtTCPSYNChallenge           1                  0.0
> 
> The problem is that bpf_skops_established() is triggered by the Challenge
> ACK instead of SYN+ACK.  This causes the bpf prog to miss the chance to
> check if the peer supports a TCP option that is expected to be exchanged
> in SYN and SYN+ACK.
> 
> Let's accept a bare SYN+ACK for active-open TCP_SYN_RECV sockets to avoid
> such a situation.
> 
> Note that tcp_ack_snd_check() in tcp_rcv_state_process() is skipped not to
> send an unnecessary ACK, but this could be a bit risky for net.git, so this
> targets for net-next.
> 
> Link: https://www.rfc-editor.org/rfc/rfc9293.html#section-3.5-7 [0]
> Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>

Thank you for having worked on this patch!

> ---
>  net/ipv4/tcp_input.c | 9 +++++++++
>  1 file changed, 9 insertions(+)
> 
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index 47dacb575f74..1eddb6b9fb2a 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -5989,6 +5989,11 @@ static bool tcp_validate_incoming(struct sock *sk, struct sk_buff *skb,
>  	 * RFC 5961 4.2 : Send a challenge ack
>  	 */
>  	if (th->syn) {
> +		if (sk->sk_state == TCP_SYN_RECV && sk->sk_socket && th->ack &&
> +		    TCP_SKB_CB(skb)->seq + 1 == TCP_SKB_CB(skb)->end_seq &&
> +		    TCP_SKB_CB(skb)->seq + 1 == tp->rcv_nxt &&
> +		    TCP_SKB_CB(skb)->ack_seq == tp->snd_nxt)
> +			goto pass;
>  syn_challenge:
>  		if (syn_inerr)
>  			TCP_INC_STATS(sock_net(sk), TCP_MIB_INERRS);
> @@ -5998,6 +6003,7 @@ static bool tcp_validate_incoming(struct sock *sk, struct sk_buff *skb,
>  		goto discard;
>  	}
>  
> +pass:
>  	bpf_skops_parse_hdr(sk, skb);
>  
>  	return true;
> @@ -6804,6 +6810,9 @@ tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb)
>  		tcp_fast_path_on(tp);
>  		if (sk->sk_shutdown & SEND_SHUTDOWN)
>  			tcp_shutdown(sk, SEND_SHUTDOWN);
> +
> +		if (sk->sk_socket)
> +			goto consume;

It looks like this modification changes the behaviour for MPTCP Join
requests for listening sockets: when receiving the 3rd ACK of a request
adding a new path (MP_JOIN), sk->sk_socket will be set, and point to the
MPTCP sock that has been created when the MPTCP connection got created
before with the first path. This new 'goto' here will then skip the
process of the segment text (step 7) and not go through tcp_data_queue()
where the MPTCP options are validated, and some actions are triggered,
e.g. sending the MPJ 4th ACK [1].

This doesn't fully break MPTCP, mainly the 4th MPJ ACK that will be
delayed, but it looks like it affects the MPTFO feature as well --
probably in case of retransmissions I suppose -- and being the reason
why the selftests started to be unstable the last few days [2].

[1] https://datatracker.ietf.org/doc/html/rfc8684#fig_tokens
[2]
https://netdev.bots.linux.dev/contest.html?executor=vmksft-mptcp-dbg&test=mptcp-connect-sh


Looking at what this patch here is trying to fix, I wonder if it would
not be enough to apply this patch:

> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index ff9ab3d01ced..ff981d7776c3 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -6820,7 +6820,7 @@ tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb)
>                 if (sk->sk_shutdown & SEND_SHUTDOWN)
>                         tcp_shutdown(sk, SEND_SHUTDOWN);
>  
> -               if (sk->sk_socket)
> +               if (sk->sk_socket && !sk_is_mptcp(sk))
>                         goto consume;
>                 break;
>  

But I still need to investigate how the issue that is being addressed by
your patch can be translated to the MPTCP case. I guess we could add
additional checks for MPTCP: new connection or additional path? etc. Or
maybe that's not needed.

>  		break;
>  
>  	case TCP_FIN_WAIT1: {

Cheers,
Matt
-- 
Sponsored by the NGI0 Core fund.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v3 net-next 1/2] tcp: Don't drop SYN+ACK for simultaneous connect().
  2024-07-15 15:58   ` Matthieu Baerts
@ 2024-07-16 19:23     ` Kuniyuki Iwashima
  2024-07-16 20:04       ` Matthieu Baerts
  0 siblings, 1 reply; 9+ messages in thread
From: Kuniyuki Iwashima @ 2024-07-16 19:23 UTC (permalink / raw)
  To: matttbe; +Cc: davem, dsahern, edumazet, kuba, kuni1840, kuniyu, netdev, pabeni

Hi Matthieu,

From: Matthieu Baerts <matttbe@kernel.org>
Date: Mon, 15 Jul 2024 17:58:49 +0200
> Hi Kuniyuki,
> 
> On 10/07/2024 19:12, Kuniyuki Iwashima wrote:
> > RFC 9293 states that in the case of simultaneous connect(), the connection
> > gets established when SYN+ACK is received. [0]
> > 
> >       TCP Peer A                                       TCP Peer B
> > 
> >   1.  CLOSED                                           CLOSED
> >   2.  SYN-SENT     --> <SEQ=100><CTL=SYN>              ...
> >   3.  SYN-RECEIVED <-- <SEQ=300><CTL=SYN>              <-- SYN-SENT
> >   4.               ... <SEQ=100><CTL=SYN>              --> SYN-RECEIVED
> >   5.  SYN-RECEIVED --> <SEQ=100><ACK=301><CTL=SYN,ACK> ...
> >   6.  ESTABLISHED  <-- <SEQ=300><ACK=101><CTL=SYN,ACK> <-- SYN-RECEIVED
> >   7.               ... <SEQ=100><ACK=301><CTL=SYN,ACK> --> ESTABLISHED
> > 
> > However, since commit 0c24604b68fc ("tcp: implement RFC 5961 4.2"), such a
> > SYN+ACK is dropped in tcp_validate_incoming() and responded with Challenge
> > ACK.
> > 
> > For example, the write() syscall in the following packetdrill script fails
> > with -EAGAIN, and wrong SNMP stats get incremented.
> > 
> >    0 socket(..., SOCK_STREAM|SOCK_NONBLOCK, IPPROTO_TCP) = 3
> >   +0 connect(3, ..., ...) = -1 EINPROGRESS (Operation now in progress)
> > 
> >   +0 > S  0:0(0) <mss 1460,sackOK,TS val 1000 ecr 0,nop,wscale 8>
> >   +0 < S  0:0(0) win 1000 <mss 1000>
> >   +0 > S. 0:0(0) ack 1 <mss 1460,sackOK,TS val 3308134035 ecr 0,nop,wscale 8>
> >   +0 < S. 0:0(0) ack 1 win 1000
> > 
> >   +0 write(3, ..., 100) = 100
> >   +0 > P. 1:101(100) ack 1
> > 
> >   --
> > 
> >   # packetdrill cross-synack.pkt
> >   cross-synack.pkt:13: runtime error in write call: Expected result 100 but got -1 with errno 11 (Resource temporarily unavailable)
> >   # nstat
> >   ...
> >   TcpExtTCPChallengeACK           1                  0.0
> >   TcpExtTCPSYNChallenge           1                  0.0
> > 
> > The problem is that bpf_skops_established() is triggered by the Challenge
> > ACK instead of SYN+ACK.  This causes the bpf prog to miss the chance to
> > check if the peer supports a TCP option that is expected to be exchanged
> > in SYN and SYN+ACK.
> > 
> > Let's accept a bare SYN+ACK for active-open TCP_SYN_RECV sockets to avoid
> > such a situation.
> > 
> > Note that tcp_ack_snd_check() in tcp_rcv_state_process() is skipped not to
> > send an unnecessary ACK, but this could be a bit risky for net.git, so this
> > targets for net-next.
> > 
> > Link: https://www.rfc-editor.org/rfc/rfc9293.html#section-3.5-7 [0]
> > Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
> 
> Thank you for having worked on this patch!
> 
> > ---
> >  net/ipv4/tcp_input.c | 9 +++++++++
> >  1 file changed, 9 insertions(+)
> > 
> > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> > index 47dacb575f74..1eddb6b9fb2a 100644
> > --- a/net/ipv4/tcp_input.c
> > +++ b/net/ipv4/tcp_input.c
> > @@ -5989,6 +5989,11 @@ static bool tcp_validate_incoming(struct sock *sk, struct sk_buff *skb,
> >  	 * RFC 5961 4.2 : Send a challenge ack
> >  	 */
> >  	if (th->syn) {
> > +		if (sk->sk_state == TCP_SYN_RECV && sk->sk_socket && th->ack &&
> > +		    TCP_SKB_CB(skb)->seq + 1 == TCP_SKB_CB(skb)->end_seq &&
> > +		    TCP_SKB_CB(skb)->seq + 1 == tp->rcv_nxt &&
> > +		    TCP_SKB_CB(skb)->ack_seq == tp->snd_nxt)
> > +			goto pass;
> >  syn_challenge:
> >  		if (syn_inerr)
> >  			TCP_INC_STATS(sock_net(sk), TCP_MIB_INERRS);
> > @@ -5998,6 +6003,7 @@ static bool tcp_validate_incoming(struct sock *sk, struct sk_buff *skb,
> >  		goto discard;
> >  	}
> >  
> > +pass:
> >  	bpf_skops_parse_hdr(sk, skb);
> >  
> >  	return true;
> > @@ -6804,6 +6810,9 @@ tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb)
> >  		tcp_fast_path_on(tp);
> >  		if (sk->sk_shutdown & SEND_SHUTDOWN)
> >  			tcp_shutdown(sk, SEND_SHUTDOWN);
> > +
> > +		if (sk->sk_socket)
> > +			goto consume;
>
> It looks like this modification changes the behaviour for MPTCP Join
> requests for listening sockets: when receiving the 3rd ACK of a request
> adding a new path (MP_JOIN), sk->sk_socket will be set, and point to the
> MPTCP sock that has been created when the MPTCP connection got created
> before with the first path.

Thanks for catching this!

I completely missed how MPTCP sets sk->sk_socket before the 3rd ACK is
processed.  I debugged a bit and confirmed mptcp_stream_accept() sets
the inflight subflow's sk->sk_socket with newsk->sk_socket.


> This new 'goto' here will then skip the
> process of the segment text (step 7) and not go through tcp_data_queue()
> where the MPTCP options are validated, and some actions are triggered,
> e.g. sending the MPJ 4th ACK [1].
> 
> This doesn't fully break MPTCP, mainly the 4th MPJ ACK that will be
> delayed,

Yes, the test failure depends on timing.  I only reproduced it by running
the test many times on non-kvm qemu.


> but it looks like it affects the MPTFO feature as well --
> probably in case of retransmissions I suppose -- and being the reason
> why the selftests started to be unstable the last few days [2].
> 
> [1] https://datatracker.ietf.org/doc/html/rfc8684#fig_tokens
> [2]
> https://netdev.bots.linux.dev/contest.html?executor=vmksft-mptcp-dbg&test=mptcp-connect-sh
> 
> 
> Looking at what this patch here is trying to fix, I wonder if it would
> not be enough to apply this patch:
> 
> > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> > index ff9ab3d01ced..ff981d7776c3 100644
> > --- a/net/ipv4/tcp_input.c
> > +++ b/net/ipv4/tcp_input.c
> > @@ -6820,7 +6820,7 @@ tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb)
> >                 if (sk->sk_shutdown & SEND_SHUTDOWN)
> >                         tcp_shutdown(sk, SEND_SHUTDOWN);
> >  
> > -               if (sk->sk_socket)
> > +               if (sk->sk_socket && !sk_is_mptcp(sk))
> >                         goto consume;
> >                 break;
> >  
> 
> But I still need to investigate how the issue that is being addressed by
> your patch can be translated to the MPTCP case. I guess we could add
> additional checks for MPTCP: new connection or additional path? etc. Or
> maybe that's not needed.

My first intention was not to drop SYN+ACK in tcp_validate_incoming(),
and the goto is added in v2, which is rather to be more compliant with
RFC not to send an unnecessary ACK for simultaneous connect().

So, we could rewrite the condition as this,

  if (sk->sk_socket && !th->syn)

but I think your patch is better to give a hint that MPTCP has a
different logic.

Also, a similar check done before the goto, and this could be
improved ?

  if (sk->sk_socket)
    sk_wake_async(sk, SOCK_WAKE_IO, POLL_OUT);


Thanks!





^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v3 net-next 1/2] tcp: Don't drop SYN+ACK for simultaneous connect().
  2024-07-16 19:23     ` Kuniyuki Iwashima
@ 2024-07-16 20:04       ` Matthieu Baerts
  2024-07-16 20:28         ` Kuniyuki Iwashima
  0 siblings, 1 reply; 9+ messages in thread
From: Matthieu Baerts @ 2024-07-16 20:04 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: davem, dsahern, edumazet, kuba, kuni1840, netdev, pabeni

Hi Kuniyuki,

Thank you for your reply!

On 16/07/2024 21:23, Kuniyuki Iwashima wrote:
> Hi Matthieu,
> 
> From: Matthieu Baerts <matttbe@kernel.org>
> Date: Mon, 15 Jul 2024 17:58:49 +0200
>> Hi Kuniyuki,
>>
>> On 10/07/2024 19:12, Kuniyuki Iwashima wrote:
>>> RFC 9293 states that in the case of simultaneous connect(), the connection
>>> gets established when SYN+ACK is received. [0]
>>>
>>>       TCP Peer A                                       TCP Peer B
>>>
>>>   1.  CLOSED                                           CLOSED
>>>   2.  SYN-SENT     --> <SEQ=100><CTL=SYN>              ...
>>>   3.  SYN-RECEIVED <-- <SEQ=300><CTL=SYN>              <-- SYN-SENT
>>>   4.               ... <SEQ=100><CTL=SYN>              --> SYN-RECEIVED
>>>   5.  SYN-RECEIVED --> <SEQ=100><ACK=301><CTL=SYN,ACK> ...
>>>   6.  ESTABLISHED  <-- <SEQ=300><ACK=101><CTL=SYN,ACK> <-- SYN-RECEIVED
>>>   7.               ... <SEQ=100><ACK=301><CTL=SYN,ACK> --> ESTABLISHED
>>>
>>> However, since commit 0c24604b68fc ("tcp: implement RFC 5961 4.2"), such a
>>> SYN+ACK is dropped in tcp_validate_incoming() and responded with Challenge
>>> ACK.
>>>
>>> For example, the write() syscall in the following packetdrill script fails
>>> with -EAGAIN, and wrong SNMP stats get incremented.
>>>
>>>    0 socket(..., SOCK_STREAM|SOCK_NONBLOCK, IPPROTO_TCP) = 3
>>>   +0 connect(3, ..., ...) = -1 EINPROGRESS (Operation now in progress)
>>>
>>>   +0 > S  0:0(0) <mss 1460,sackOK,TS val 1000 ecr 0,nop,wscale 8>
>>>   +0 < S  0:0(0) win 1000 <mss 1000>
>>>   +0 > S. 0:0(0) ack 1 <mss 1460,sackOK,TS val 3308134035 ecr 0,nop,wscale 8>
>>>   +0 < S. 0:0(0) ack 1 win 1000
>>>
>>>   +0 write(3, ..., 100) = 100
>>>   +0 > P. 1:101(100) ack 1
>>>
>>>   --
>>>
>>>   # packetdrill cross-synack.pkt
>>>   cross-synack.pkt:13: runtime error in write call: Expected result 100 but got -1 with errno 11 (Resource temporarily unavailable)
>>>   # nstat
>>>   ...
>>>   TcpExtTCPChallengeACK           1                  0.0
>>>   TcpExtTCPSYNChallenge           1                  0.0
>>>
>>> The problem is that bpf_skops_established() is triggered by the Challenge
>>> ACK instead of SYN+ACK.  This causes the bpf prog to miss the chance to
>>> check if the peer supports a TCP option that is expected to be exchanged
>>> in SYN and SYN+ACK.
>>>
>>> Let's accept a bare SYN+ACK for active-open TCP_SYN_RECV sockets to avoid
>>> such a situation.
>>>
>>> Note that tcp_ack_snd_check() in tcp_rcv_state_process() is skipped not to
>>> send an unnecessary ACK, but this could be a bit risky for net.git, so this
>>> targets for net-next.
>>>
>>> Link: https://www.rfc-editor.org/rfc/rfc9293.html#section-3.5-7 [0]
>>> Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
>>
>> Thank you for having worked on this patch!
>>
>>> ---
>>>  net/ipv4/tcp_input.c | 9 +++++++++
>>>  1 file changed, 9 insertions(+)
>>>
>>> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
>>> index 47dacb575f74..1eddb6b9fb2a 100644
>>> --- a/net/ipv4/tcp_input.c
>>> +++ b/net/ipv4/tcp_input.c
>>> @@ -5989,6 +5989,11 @@ static bool tcp_validate_incoming(struct sock *sk, struct sk_buff *skb,
>>>  	 * RFC 5961 4.2 : Send a challenge ack
>>>  	 */
>>>  	if (th->syn) {
>>> +		if (sk->sk_state == TCP_SYN_RECV && sk->sk_socket && th->ack &&
>>> +		    TCP_SKB_CB(skb)->seq + 1 == TCP_SKB_CB(skb)->end_seq &&
>>> +		    TCP_SKB_CB(skb)->seq + 1 == tp->rcv_nxt &&
>>> +		    TCP_SKB_CB(skb)->ack_seq == tp->snd_nxt)
>>> +			goto pass;
>>>  syn_challenge:
>>>  		if (syn_inerr)
>>>  			TCP_INC_STATS(sock_net(sk), TCP_MIB_INERRS);
>>> @@ -5998,6 +6003,7 @@ static bool tcp_validate_incoming(struct sock *sk, struct sk_buff *skb,
>>>  		goto discard;
>>>  	}
>>>  
>>> +pass:
>>>  	bpf_skops_parse_hdr(sk, skb);
>>>  
>>>  	return true;
>>> @@ -6804,6 +6810,9 @@ tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb)
>>>  		tcp_fast_path_on(tp);
>>>  		if (sk->sk_shutdown & SEND_SHUTDOWN)
>>>  			tcp_shutdown(sk, SEND_SHUTDOWN);
>>> +
>>> +		if (sk->sk_socket)
>>> +			goto consume;
>>
>> It looks like this modification changes the behaviour for MPTCP Join
>> requests for listening sockets: when receiving the 3rd ACK of a request
>> adding a new path (MP_JOIN), sk->sk_socket will be set, and point to the
>> MPTCP sock that has been created when the MPTCP connection got created
>> before with the first path.
> 
> Thanks for catching this!
> 
> I completely missed how MPTCP sets sk->sk_socket before the 3rd ACK is
> processed.

No problem. That's a shame there was not a clear error in the selftests :)

> I debugged a bit and confirmed mptcp_stream_accept() sets
> the inflight subflow's sk->sk_socket with newsk->sk_socket.

Yes, that's correct.

>> This new 'goto' here will then skip the
>> process of the segment text (step 7) and not go through tcp_data_queue()
>> where the MPTCP options are validated, and some actions are triggered,
>> e.g. sending the MPJ 4th ACK [1].
>>
>> This doesn't fully break MPTCP, mainly the 4th MPJ ACK that will be
>> delayed,
> 
> Yes, the test failure depends on timing.  I only reproduced it by running
> the test many times on non-kvm qemu.

Thank you for having checked!

>> but it looks like it affects the MPTFO feature as well --
>> probably in case of retransmissions I suppose -- and being the reason
>> why the selftests started to be unstable the last few days [2].
>>
>> [1] https://datatracker.ietf.org/doc/html/rfc8684#fig_tokens
>> [2]
>> https://netdev.bots.linux.dev/contest.html?executor=vmksft-mptcp-dbg&test=mptcp-connect-sh
>>
>>
>> Looking at what this patch here is trying to fix, I wonder if it would
>> not be enough to apply this patch:
>>
>>> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
>>> index ff9ab3d01ced..ff981d7776c3 100644
>>> --- a/net/ipv4/tcp_input.c
>>> +++ b/net/ipv4/tcp_input.c
>>> @@ -6820,7 +6820,7 @@ tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb)
>>>                 if (sk->sk_shutdown & SEND_SHUTDOWN)
>>>                         tcp_shutdown(sk, SEND_SHUTDOWN);
>>>  
>>> -               if (sk->sk_socket)
>>> +               if (sk->sk_socket && !sk_is_mptcp(sk))
>>>                         goto consume;
>>>                 break;
>>>  
>>
>> But I still need to investigate how the issue that is being addressed by
>> your patch can be translated to the MPTCP case. I guess we could add
>> additional checks for MPTCP: new connection or additional path? etc. Or
>> maybe that's not needed.
> 
> My first intention was not to drop SYN+ACK in tcp_validate_incoming(),
> and the goto is added in v2, which is rather to be more compliant with
> RFC not to send an unnecessary ACK for simultaneous connect().
> 
> So, we could rewrite the condition as this,
> 
>   if (sk->sk_socket && !th->syn)

(Just to be sure, do you mean the opposite with th->syn?)

  if (sk->sk_socket && th->syn)
      goto consume;

That's a good idea!

I sent my patch a couple of minutes ago [1], then I saw your suggestion
here. It looks like it should work for the TFO case as well. Maybe your
suggestion is more generic and will cover more cases?

[1]
https://lore.kernel.org/all/20240716-upstream-net-next-20240716-tcp-3rd-ack-consume-sk_socket-v1-1-4e61d0b79233@kernel.org/

> but I think your patch is better to give a hint that MPTCP has a
> different logic.

Because TFO has also a different logic, it might be good to have a clear
comment about that.

> Also, a similar check done before the goto, and this could be
> improved ?
> 
>   if (sk->sk_socket)
>     sk_wake_async(sk, SOCK_WAKE_IO, POLL_OUT);

It is a bit late for me, but I think it can be kept as it is: for MPTCP,
it will not wake up the userspace as the subflow is managed by the
kernel. I would need to check if we could avoid that. Also, will this
wakeup not be useful for TFO?

Cheers,
Matt
-- 
Sponsored by the NGI0 Core fund.


^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v3 net-next 1/2] tcp: Don't drop SYN+ACK for simultaneous connect().
  2024-07-16 20:04       ` Matthieu Baerts
@ 2024-07-16 20:28         ` Kuniyuki Iwashima
  0 siblings, 0 replies; 9+ messages in thread
From: Kuniyuki Iwashima @ 2024-07-16 20:28 UTC (permalink / raw)
  To: matttbe; +Cc: davem, dsahern, edumazet, kuba, kuni1840, kuniyu, netdev, pabeni

From: Matthieu Baerts <matttbe@kernel.org>
Date: Tue, 16 Jul 2024 22:04:25 +0200
> Hi Kuniyuki,
> 
> Thank you for your reply!
> 
> On 16/07/2024 21:23, Kuniyuki Iwashima wrote:
> > Hi Matthieu,
> > 
> > From: Matthieu Baerts <matttbe@kernel.org>
> > Date: Mon, 15 Jul 2024 17:58:49 +0200
> >> Hi Kuniyuki,
> >>
> >> On 10/07/2024 19:12, Kuniyuki Iwashima wrote:
> >>> RFC 9293 states that in the case of simultaneous connect(), the connection
> >>> gets established when SYN+ACK is received. [0]
> >>>
> >>>       TCP Peer A                                       TCP Peer B
> >>>
> >>>   1.  CLOSED                                           CLOSED
> >>>   2.  SYN-SENT     --> <SEQ=100><CTL=SYN>              ...
> >>>   3.  SYN-RECEIVED <-- <SEQ=300><CTL=SYN>              <-- SYN-SENT
> >>>   4.               ... <SEQ=100><CTL=SYN>              --> SYN-RECEIVED
> >>>   5.  SYN-RECEIVED --> <SEQ=100><ACK=301><CTL=SYN,ACK> ...
> >>>   6.  ESTABLISHED  <-- <SEQ=300><ACK=101><CTL=SYN,ACK> <-- SYN-RECEIVED
> >>>   7.               ... <SEQ=100><ACK=301><CTL=SYN,ACK> --> ESTABLISHED
> >>>
> >>> However, since commit 0c24604b68fc ("tcp: implement RFC 5961 4.2"), such a
> >>> SYN+ACK is dropped in tcp_validate_incoming() and responded with Challenge
> >>> ACK.
> >>>
> >>> For example, the write() syscall in the following packetdrill script fails
> >>> with -EAGAIN, and wrong SNMP stats get incremented.
> >>>
> >>>    0 socket(..., SOCK_STREAM|SOCK_NONBLOCK, IPPROTO_TCP) = 3
> >>>   +0 connect(3, ..., ...) = -1 EINPROGRESS (Operation now in progress)
> >>>
> >>>   +0 > S  0:0(0) <mss 1460,sackOK,TS val 1000 ecr 0,nop,wscale 8>
> >>>   +0 < S  0:0(0) win 1000 <mss 1000>
> >>>   +0 > S. 0:0(0) ack 1 <mss 1460,sackOK,TS val 3308134035 ecr 0,nop,wscale 8>
> >>>   +0 < S. 0:0(0) ack 1 win 1000
> >>>
> >>>   +0 write(3, ..., 100) = 100
> >>>   +0 > P. 1:101(100) ack 1
> >>>
> >>>   --
> >>>
> >>>   # packetdrill cross-synack.pkt
> >>>   cross-synack.pkt:13: runtime error in write call: Expected result 100 but got -1 with errno 11 (Resource temporarily unavailable)
> >>>   # nstat
> >>>   ...
> >>>   TcpExtTCPChallengeACK           1                  0.0
> >>>   TcpExtTCPSYNChallenge           1                  0.0
> >>>
> >>> The problem is that bpf_skops_established() is triggered by the Challenge
> >>> ACK instead of SYN+ACK.  This causes the bpf prog to miss the chance to
> >>> check if the peer supports a TCP option that is expected to be exchanged
> >>> in SYN and SYN+ACK.
> >>>
> >>> Let's accept a bare SYN+ACK for active-open TCP_SYN_RECV sockets to avoid
> >>> such a situation.
> >>>
> >>> Note that tcp_ack_snd_check() in tcp_rcv_state_process() is skipped not to
> >>> send an unnecessary ACK, but this could be a bit risky for net.git, so this
> >>> targets for net-next.
> >>>
> >>> Link: https://www.rfc-editor.org/rfc/rfc9293.html#section-3.5-7 [0]
> >>> Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
> >>
> >> Thank you for having worked on this patch!
> >>
> >>> ---
> >>>  net/ipv4/tcp_input.c | 9 +++++++++
> >>>  1 file changed, 9 insertions(+)
> >>>
> >>> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> >>> index 47dacb575f74..1eddb6b9fb2a 100644
> >>> --- a/net/ipv4/tcp_input.c
> >>> +++ b/net/ipv4/tcp_input.c
> >>> @@ -5989,6 +5989,11 @@ static bool tcp_validate_incoming(struct sock *sk, struct sk_buff *skb,
> >>>  	 * RFC 5961 4.2 : Send a challenge ack
> >>>  	 */
> >>>  	if (th->syn) {
> >>> +		if (sk->sk_state == TCP_SYN_RECV && sk->sk_socket && th->ack &&
> >>> +		    TCP_SKB_CB(skb)->seq + 1 == TCP_SKB_CB(skb)->end_seq &&
> >>> +		    TCP_SKB_CB(skb)->seq + 1 == tp->rcv_nxt &&
> >>> +		    TCP_SKB_CB(skb)->ack_seq == tp->snd_nxt)
> >>> +			goto pass;
> >>>  syn_challenge:
> >>>  		if (syn_inerr)
> >>>  			TCP_INC_STATS(sock_net(sk), TCP_MIB_INERRS);
> >>> @@ -5998,6 +6003,7 @@ static bool tcp_validate_incoming(struct sock *sk, struct sk_buff *skb,
> >>>  		goto discard;
> >>>  	}
> >>>  
> >>> +pass:
> >>>  	bpf_skops_parse_hdr(sk, skb);
> >>>  
> >>>  	return true;
> >>> @@ -6804,6 +6810,9 @@ tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb)
> >>>  		tcp_fast_path_on(tp);
> >>>  		if (sk->sk_shutdown & SEND_SHUTDOWN)
> >>>  			tcp_shutdown(sk, SEND_SHUTDOWN);
> >>> +
> >>> +		if (sk->sk_socket)
> >>> +			goto consume;
> >>
> >> It looks like this modification changes the behaviour for MPTCP Join
> >> requests for listening sockets: when receiving the 3rd ACK of a request
> >> adding a new path (MP_JOIN), sk->sk_socket will be set, and point to the
> >> MPTCP sock that has been created when the MPTCP connection got created
> >> before with the first path.
> > 
> > Thanks for catching this!
> > 
> > I completely missed how MPTCP sets sk->sk_socket before the 3rd ACK is
> > processed.
> 
> No problem. That's a shame there was not a clear error in the selftests :)
> 
> > I debugged a bit and confirmed mptcp_stream_accept() sets
> > the inflight subflow's sk->sk_socket with newsk->sk_socket.
> 
> Yes, that's correct.
> 
> >> This new 'goto' here will then skip the
> >> process of the segment text (step 7) and not go through tcp_data_queue()
> >> where the MPTCP options are validated, and some actions are triggered,
> >> e.g. sending the MPJ 4th ACK [1].
> >>
> >> This doesn't fully break MPTCP, mainly the 4th MPJ ACK that will be
> >> delayed,
> > 
> > Yes, the test failure depends on timing.  I only reproduced it by running
> > the test many times on non-kvm qemu.
> 
> Thank you for having checked!
> 
> >> but it looks like it affects the MPTFO feature as well --
> >> probably in case of retransmissions I suppose -- and being the reason
> >> why the selftests started to be unstable the last few days [2].
> >>
> >> [1] https://datatracker.ietf.org/doc/html/rfc8684#fig_tokens
> >> [2]
> >> https://netdev.bots.linux.dev/contest.html?executor=vmksft-mptcp-dbg&test=mptcp-connect-sh
> >>
> >>
> >> Looking at what this patch here is trying to fix, I wonder if it would
> >> not be enough to apply this patch:
> >>
> >>> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> >>> index ff9ab3d01ced..ff981d7776c3 100644
> >>> --- a/net/ipv4/tcp_input.c
> >>> +++ b/net/ipv4/tcp_input.c
> >>> @@ -6820,7 +6820,7 @@ tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb)
> >>>                 if (sk->sk_shutdown & SEND_SHUTDOWN)
> >>>                         tcp_shutdown(sk, SEND_SHUTDOWN);
> >>>  
> >>> -               if (sk->sk_socket)
> >>> +               if (sk->sk_socket && !sk_is_mptcp(sk))
> >>>                         goto consume;
> >>>                 break;
> >>>  
> >>
> >> But I still need to investigate how the issue that is being addressed by
> >> your patch can be translated to the MPTCP case. I guess we could add
> >> additional checks for MPTCP: new connection or additional path? etc. Or
> >> maybe that's not needed.
> > 
> > My first intention was not to drop SYN+ACK in tcp_validate_incoming(),
> > and the goto is added in v2, which is rather to be more compliant with
> > RFC not to send an unnecessary ACK for simultaneous connect().
> > 
> > So, we could rewrite the condition as this,
> > 
> >   if (sk->sk_socket && !th->syn)
> 
> (Just to be sure, do you mean the opposite with th->syn?)

Ah, yes :)


> 
>   if (sk->sk_socket && th->syn)
>       goto consume;
> 
> That's a good idea!
> 
> I sent my patch a couple of minutes ago [1], then I saw your suggestion
> here. It looks like it should work for the TFO case as well. Maybe your
> suggestion is more generic and will cover more cases?

Likely, I was just trying to avoid unnecessary change and make the effect
minimal.


> 
> [1]
> https://lore.kernel.org/all/20240716-upstream-net-next-20240716-tcp-3rd-ack-consume-sk_socket-v1-1-4e61d0b79233@kernel.org/
> 
> > but I think your patch is better to give a hint that MPTCP has a
> > different logic.
> 
> Because TFO has also a different logic, it might be good to have a clear
> comment about that.

Exactly, TFO could be accept()ed before receiving ACK, and then
we must not drop ACK w/ data.


> 
> > Also, a similar check done before the goto, and this could be
> > improved ?
> > 
> >   if (sk->sk_socket)
> >     sk_wake_async(sk, SOCK_WAKE_IO, POLL_OUT);
> 
> It is a bit late for me, but I think it can be kept as it is: for MPTCP,
> it will not wake up the userspace as the subflow is managed by the
> kernel. I would need to check if we could avoid that. Also, will this
> wakeup not be useful for TFO?

Yes, I think it's not necessary for TFO.

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2024-07-16 20:28 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-07-10 17:12 [PATCH v3 net-next 0/2] tcp: Make simultaneous connect() RFC-compliant Kuniyuki Iwashima
2024-07-10 17:12 ` [PATCH v3 net-next 1/2] tcp: Don't drop SYN+ACK for simultaneous connect() Kuniyuki Iwashima
2024-07-11 15:34   ` Eric Dumazet
2024-07-15 15:58   ` Matthieu Baerts
2024-07-16 19:23     ` Kuniyuki Iwashima
2024-07-16 20:04       ` Matthieu Baerts
2024-07-16 20:28         ` Kuniyuki Iwashima
2024-07-10 17:12 ` [PATCH v3 net-next 2/2] selftests: tcp: Remove broken SNMP assumptions for TCP AO self-connect tests Kuniyuki Iwashima
2024-07-13 22:30 ` [PATCH v3 net-next 0/2] tcp: Make simultaneous connect() RFC-compliant patchwork-bot+netdevbpf

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).