netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Matthieu Baerts <matttbe@kernel.org>
To: Kuniyuki Iwashima <kuniyu@amazon.com>
Cc: davem@davemloft.net, dsahern@kernel.org, edumazet@google.com,
	kuba@kernel.org, kuni1840@gmail.com, netdev@vger.kernel.org,
	pabeni@redhat.com
Subject: Re: [PATCH v3 net-next 1/2] tcp: Don't drop SYN+ACK for simultaneous connect().
Date: Tue, 16 Jul 2024 22:04:25 +0200	[thread overview]
Message-ID: <31eb13bf-7ea9-436f-92a9-a8745ed86f9e@kernel.org> (raw)
In-Reply-To: <20240716192320.54815-1-kuniyu@amazon.com>

Hi Kuniyuki,

Thank you for your reply!

On 16/07/2024 21:23, Kuniyuki Iwashima wrote:
> Hi Matthieu,
> 
> From: Matthieu Baerts <matttbe@kernel.org>
> Date: Mon, 15 Jul 2024 17:58:49 +0200
>> Hi Kuniyuki,
>>
>> On 10/07/2024 19:12, Kuniyuki Iwashima wrote:
>>> RFC 9293 states that in the case of simultaneous connect(), the connection
>>> gets established when SYN+ACK is received. [0]
>>>
>>>       TCP Peer A                                       TCP Peer B
>>>
>>>   1.  CLOSED                                           CLOSED
>>>   2.  SYN-SENT     --> <SEQ=100><CTL=SYN>              ...
>>>   3.  SYN-RECEIVED <-- <SEQ=300><CTL=SYN>              <-- SYN-SENT
>>>   4.               ... <SEQ=100><CTL=SYN>              --> SYN-RECEIVED
>>>   5.  SYN-RECEIVED --> <SEQ=100><ACK=301><CTL=SYN,ACK> ...
>>>   6.  ESTABLISHED  <-- <SEQ=300><ACK=101><CTL=SYN,ACK> <-- SYN-RECEIVED
>>>   7.               ... <SEQ=100><ACK=301><CTL=SYN,ACK> --> ESTABLISHED
>>>
>>> However, since commit 0c24604b68fc ("tcp: implement RFC 5961 4.2"), such a
>>> SYN+ACK is dropped in tcp_validate_incoming() and responded with Challenge
>>> ACK.
>>>
>>> For example, the write() syscall in the following packetdrill script fails
>>> with -EAGAIN, and wrong SNMP stats get incremented.
>>>
>>>    0 socket(..., SOCK_STREAM|SOCK_NONBLOCK, IPPROTO_TCP) = 3
>>>   +0 connect(3, ..., ...) = -1 EINPROGRESS (Operation now in progress)
>>>
>>>   +0 > S  0:0(0) <mss 1460,sackOK,TS val 1000 ecr 0,nop,wscale 8>
>>>   +0 < S  0:0(0) win 1000 <mss 1000>
>>>   +0 > S. 0:0(0) ack 1 <mss 1460,sackOK,TS val 3308134035 ecr 0,nop,wscale 8>
>>>   +0 < S. 0:0(0) ack 1 win 1000
>>>
>>>   +0 write(3, ..., 100) = 100
>>>   +0 > P. 1:101(100) ack 1
>>>
>>>   --
>>>
>>>   # packetdrill cross-synack.pkt
>>>   cross-synack.pkt:13: runtime error in write call: Expected result 100 but got -1 with errno 11 (Resource temporarily unavailable)
>>>   # nstat
>>>   ...
>>>   TcpExtTCPChallengeACK           1                  0.0
>>>   TcpExtTCPSYNChallenge           1                  0.0
>>>
>>> The problem is that bpf_skops_established() is triggered by the Challenge
>>> ACK instead of SYN+ACK.  This causes the bpf prog to miss the chance to
>>> check if the peer supports a TCP option that is expected to be exchanged
>>> in SYN and SYN+ACK.
>>>
>>> Let's accept a bare SYN+ACK for active-open TCP_SYN_RECV sockets to avoid
>>> such a situation.
>>>
>>> Note that tcp_ack_snd_check() in tcp_rcv_state_process() is skipped not to
>>> send an unnecessary ACK, but this could be a bit risky for net.git, so this
>>> targets for net-next.
>>>
>>> Link: https://www.rfc-editor.org/rfc/rfc9293.html#section-3.5-7 [0]
>>> Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
>>
>> Thank you for having worked on this patch!
>>
>>> ---
>>>  net/ipv4/tcp_input.c | 9 +++++++++
>>>  1 file changed, 9 insertions(+)
>>>
>>> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
>>> index 47dacb575f74..1eddb6b9fb2a 100644
>>> --- a/net/ipv4/tcp_input.c
>>> +++ b/net/ipv4/tcp_input.c
>>> @@ -5989,6 +5989,11 @@ static bool tcp_validate_incoming(struct sock *sk, struct sk_buff *skb,
>>>  	 * RFC 5961 4.2 : Send a challenge ack
>>>  	 */
>>>  	if (th->syn) {
>>> +		if (sk->sk_state == TCP_SYN_RECV && sk->sk_socket && th->ack &&
>>> +		    TCP_SKB_CB(skb)->seq + 1 == TCP_SKB_CB(skb)->end_seq &&
>>> +		    TCP_SKB_CB(skb)->seq + 1 == tp->rcv_nxt &&
>>> +		    TCP_SKB_CB(skb)->ack_seq == tp->snd_nxt)
>>> +			goto pass;
>>>  syn_challenge:
>>>  		if (syn_inerr)
>>>  			TCP_INC_STATS(sock_net(sk), TCP_MIB_INERRS);
>>> @@ -5998,6 +6003,7 @@ static bool tcp_validate_incoming(struct sock *sk, struct sk_buff *skb,
>>>  		goto discard;
>>>  	}
>>>  
>>> +pass:
>>>  	bpf_skops_parse_hdr(sk, skb);
>>>  
>>>  	return true;
>>> @@ -6804,6 +6810,9 @@ tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb)
>>>  		tcp_fast_path_on(tp);
>>>  		if (sk->sk_shutdown & SEND_SHUTDOWN)
>>>  			tcp_shutdown(sk, SEND_SHUTDOWN);
>>> +
>>> +		if (sk->sk_socket)
>>> +			goto consume;
>>
>> It looks like this modification changes the behaviour for MPTCP Join
>> requests for listening sockets: when receiving the 3rd ACK of a request
>> adding a new path (MP_JOIN), sk->sk_socket will be set, and point to the
>> MPTCP sock that has been created when the MPTCP connection got created
>> before with the first path.
> 
> Thanks for catching this!
> 
> I completely missed how MPTCP sets sk->sk_socket before the 3rd ACK is
> processed.

No problem. That's a shame there was not a clear error in the selftests :)

> I debugged a bit and confirmed mptcp_stream_accept() sets
> the inflight subflow's sk->sk_socket with newsk->sk_socket.

Yes, that's correct.

>> This new 'goto' here will then skip the
>> process of the segment text (step 7) and not go through tcp_data_queue()
>> where the MPTCP options are validated, and some actions are triggered,
>> e.g. sending the MPJ 4th ACK [1].
>>
>> This doesn't fully break MPTCP, mainly the 4th MPJ ACK that will be
>> delayed,
> 
> Yes, the test failure depends on timing.  I only reproduced it by running
> the test many times on non-kvm qemu.

Thank you for having checked!

>> but it looks like it affects the MPTFO feature as well --
>> probably in case of retransmissions I suppose -- and being the reason
>> why the selftests started to be unstable the last few days [2].
>>
>> [1] https://datatracker.ietf.org/doc/html/rfc8684#fig_tokens
>> [2]
>> https://netdev.bots.linux.dev/contest.html?executor=vmksft-mptcp-dbg&test=mptcp-connect-sh
>>
>>
>> Looking at what this patch here is trying to fix, I wonder if it would
>> not be enough to apply this patch:
>>
>>> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
>>> index ff9ab3d01ced..ff981d7776c3 100644
>>> --- a/net/ipv4/tcp_input.c
>>> +++ b/net/ipv4/tcp_input.c
>>> @@ -6820,7 +6820,7 @@ tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb)
>>>                 if (sk->sk_shutdown & SEND_SHUTDOWN)
>>>                         tcp_shutdown(sk, SEND_SHUTDOWN);
>>>  
>>> -               if (sk->sk_socket)
>>> +               if (sk->sk_socket && !sk_is_mptcp(sk))
>>>                         goto consume;
>>>                 break;
>>>  
>>
>> But I still need to investigate how the issue that is being addressed by
>> your patch can be translated to the MPTCP case. I guess we could add
>> additional checks for MPTCP: new connection or additional path? etc. Or
>> maybe that's not needed.
> 
> My first intention was not to drop SYN+ACK in tcp_validate_incoming(),
> and the goto is added in v2, which is rather to be more compliant with
> RFC not to send an unnecessary ACK for simultaneous connect().
> 
> So, we could rewrite the condition as this,
> 
>   if (sk->sk_socket && !th->syn)

(Just to be sure, do you mean the opposite with th->syn?)

  if (sk->sk_socket && th->syn)
      goto consume;

That's a good idea!

I sent my patch a couple of minutes ago [1], then I saw your suggestion
here. It looks like it should work for the TFO case as well. Maybe your
suggestion is more generic and will cover more cases?

[1]
https://lore.kernel.org/all/20240716-upstream-net-next-20240716-tcp-3rd-ack-consume-sk_socket-v1-1-4e61d0b79233@kernel.org/

> but I think your patch is better to give a hint that MPTCP has a
> different logic.

Because TFO has also a different logic, it might be good to have a clear
comment about that.

> Also, a similar check done before the goto, and this could be
> improved ?
> 
>   if (sk->sk_socket)
>     sk_wake_async(sk, SOCK_WAKE_IO, POLL_OUT);

It is a bit late for me, but I think it can be kept as it is: for MPTCP,
it will not wake up the userspace as the subflow is managed by the
kernel. I would need to check if we could avoid that. Also, will this
wakeup not be useful for TFO?

Cheers,
Matt
-- 
Sponsored by the NGI0 Core fund.


  reply	other threads:[~2024-07-16 20:04 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-07-10 17:12 [PATCH v3 net-next 0/2] tcp: Make simultaneous connect() RFC-compliant Kuniyuki Iwashima
2024-07-10 17:12 ` [PATCH v3 net-next 1/2] tcp: Don't drop SYN+ACK for simultaneous connect() Kuniyuki Iwashima
2024-07-11 15:34   ` Eric Dumazet
2024-07-15 15:58   ` Matthieu Baerts
2024-07-16 19:23     ` Kuniyuki Iwashima
2024-07-16 20:04       ` Matthieu Baerts [this message]
2024-07-16 20:28         ` Kuniyuki Iwashima
2024-07-10 17:12 ` [PATCH v3 net-next 2/2] selftests: tcp: Remove broken SNMP assumptions for TCP AO self-connect tests Kuniyuki Iwashima
2024-07-13 22:30 ` [PATCH v3 net-next 0/2] tcp: Make simultaneous connect() RFC-compliant patchwork-bot+netdevbpf

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=31eb13bf-7ea9-436f-92a9-a8745ed86f9e@kernel.org \
    --to=matttbe@kernel.org \
    --cc=davem@davemloft.net \
    --cc=dsahern@kernel.org \
    --cc=edumazet@google.com \
    --cc=kuba@kernel.org \
    --cc=kuni1840@gmail.com \
    --cc=kuniyu@amazon.com \
    --cc=netdev@vger.kernel.org \
    --cc=pabeni@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).