netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH net v2] Fix race for duplicate reqsk on identical SYN
@ 2024-06-14 10:26 luoxuanqiang
  2024-06-14 10:54 ` Florian Westphal
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: luoxuanqiang @ 2024-06-14 10:26 UTC (permalink / raw)
  To: edumazet
  Cc: davem, dsahern, fw, kuba, linux-kernel, luoxuanqiang, netdev,
	pabeni, kuniyu, dccp

When bonding is configured in BOND_MODE_BROADCAST mode, if two identical
SYN packets are received at the same time and processed on different CPUs,
it can potentially create the same sk (sock) but two different reqsk
(request_sock) in tcp_conn_request().

These two different reqsk will respond with two SYNACK packets, and since
the generation of the seq (ISN) incorporates a timestamp, the final two
SYNACK packets will have different seq values.

The consequence is that when the Client receives and replies with an ACK
to the earlier SYNACK packet, we will reset(RST) it.

========================================================================

This behavior is consistently reproducible in my local setup,
which comprises:

                  | NETA1 ------ NETB1 |
PC_A --- bond --- |                    | --- bond --- PC_B
                  | NETA2 ------ NETB2 |

- PC_A is the Server and has two network cards, NETA1 and NETA2. I have
  bonded these two cards using BOND_MODE_BROADCAST mode and configured
  them to be handled by different CPU.

- PC_B is the Client, also equipped with two network cards, NETB1 and
  NETB2, which are also bonded and configured in BOND_MODE_BROADCAST mode.

If the client attempts a TCP connection to the server, it might encounter
a failure. Capturing packets from the server side reveals:

10.10.10.10.45182 > localhost: Flags [S], seq 320236027,
10.10.10.10.45182 > localhost: Flags [S], seq 320236027,
localhost > 10.10.10.10.45182: Flags [S.], seq 2967855116,
localhost > 10.10.10.10.45182: Flags [S.], seq 2967855123, <==
10.10.10.10.45182 > localhost: Flags [.], ack 4294967290,
10.10.10.10.45182 > localhost: Flags [.], ack 4294967290,
localhost > 10.10.10.10.45182: Flags [R], seq 2967855117, <==
localhost > 10.10.10.10.45182: Flags [R], seq 2967855117,

Two SYNACKs with different seq numbers are sent by localhost,
resulting in an anomaly.

========================================================================

The attempted solution is as follows:
In the tcp_conn_request(), while inserting reqsk into the ehash table,
it also checks if an entry already exists. If found, it avoids
reinsertion and releases it.

Simultaneously, In the reqsk_queue_hash_req(), the start of the
req->rsk_timer is adjusted to be after successful insertion.

Signed-off-by: luoxuanqiang <luoxuanqiang@kylinos.cn>
---
 include/net/inet_connection_sock.h |  2 +-
 net/dccp/ipv4.c                    |  2 +-
 net/dccp/ipv6.c                    |  2 +-
 net/ipv4/inet_connection_sock.c    | 15 +++++++++++----
 net/ipv4/tcp_input.c               | 11 ++++++++++-
 5 files changed, 24 insertions(+), 8 deletions(-)

diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h
index 7d6b1254c92d..8773d161d184 100644
--- a/include/net/inet_connection_sock.h
+++ b/include/net/inet_connection_sock.h
@@ -264,7 +264,7 @@ struct sock *inet_csk_reqsk_queue_add(struct sock *sk,
 				      struct request_sock *req,
 				      struct sock *child);
 void inet_csk_reqsk_queue_hash_add(struct sock *sk, struct request_sock *req,
-				   unsigned long timeout);
+				   unsigned long timeout, bool *found_dup_sk);
 struct sock *inet_csk_complete_hashdance(struct sock *sk, struct sock *child,
 					 struct request_sock *req,
 					 bool own_req);
diff --git a/net/dccp/ipv4.c b/net/dccp/ipv4.c
index ff41bd6f99c3..13aafdeb9205 100644
--- a/net/dccp/ipv4.c
+++ b/net/dccp/ipv4.c
@@ -657,7 +657,7 @@ int dccp_v4_conn_request(struct sock *sk, struct sk_buff *skb)
 	if (dccp_v4_send_response(sk, req))
 		goto drop_and_free;
 
-	inet_csk_reqsk_queue_hash_add(sk, req, DCCP_TIMEOUT_INIT);
+	inet_csk_reqsk_queue_hash_add(sk, req, DCCP_TIMEOUT_INIT, NULL);
 	reqsk_put(req);
 	return 0;
 
diff --git a/net/dccp/ipv6.c b/net/dccp/ipv6.c
index 85f4b8fdbe5e..493cdb12ce2b 100644
--- a/net/dccp/ipv6.c
+++ b/net/dccp/ipv6.c
@@ -400,7 +400,7 @@ static int dccp_v6_conn_request(struct sock *sk, struct sk_buff *skb)
 	if (dccp_v6_send_response(sk, req))
 		goto drop_and_free;
 
-	inet_csk_reqsk_queue_hash_add(sk, req, DCCP_TIMEOUT_INIT);
+	inet_csk_reqsk_queue_hash_add(sk, req, DCCP_TIMEOUT_INIT, NULL);
 	reqsk_put(req);
 	return 0;
 
diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index d81f74ce0f02..045d0701acfd 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -1123,12 +1123,16 @@ static void reqsk_timer_handler(struct timer_list *t)
 }
 
 static void reqsk_queue_hash_req(struct request_sock *req,
-				 unsigned long timeout)
+				 unsigned long timeout, bool *found_dup_sk)
 {
+	inet_ehash_insert(req_to_sk(req), NULL, found_dup_sk);
+	if (found_dup_sk && *found_dup_sk)
+		return;
+
+	/* The timer needs to be setup after a successful insertion. */
 	timer_setup(&req->rsk_timer, reqsk_timer_handler, TIMER_PINNED);
 	mod_timer(&req->rsk_timer, jiffies + timeout);
 
-	inet_ehash_insert(req_to_sk(req), NULL, NULL);
 	/* before letting lookups find us, make sure all req fields
 	 * are committed to memory and refcnt initialized.
 	 */
@@ -1137,9 +1141,12 @@ static void reqsk_queue_hash_req(struct request_sock *req,
 }
 
 void inet_csk_reqsk_queue_hash_add(struct sock *sk, struct request_sock *req,
-				   unsigned long timeout)
+				   unsigned long timeout, bool *found_dup_sk)
 {
-	reqsk_queue_hash_req(req, timeout);
+	reqsk_queue_hash_req(req, timeout, found_dup_sk);
+	if (found_dup_sk && *found_dup_sk)
+		return;
+
 	inet_csk_reqsk_queue_added(sk);
 }
 EXPORT_SYMBOL_GPL(inet_csk_reqsk_queue_hash_add);
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 9c04a9c8be9d..49876477c2b9 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -7255,8 +7255,17 @@ int tcp_conn_request(struct request_sock_ops *rsk_ops,
 	} else {
 		tcp_rsk(req)->tfo_listener = false;
 		if (!want_cookie) {
+			bool found_dup_sk = false;
+
 			req->timeout = tcp_timeout_init((struct sock *)req);
-			inet_csk_reqsk_queue_hash_add(sk, req, req->timeout);
+			inet_csk_reqsk_queue_hash_add(sk, req, req->timeout,
+						      &found_dup_sk);
+
+			if (unlikely(found_dup_sk)) {
+				reqsk_free(req);
+				return 0;
+			}
+
 		}
 		af_ops->send_synack(sk, dst, &fl, req, &foc,
 				    !want_cookie ? TCP_SYNACK_NORMAL :
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH net v2] Fix race for duplicate reqsk on identical SYN
  2024-06-14 10:26 [PATCH net v2] Fix race for duplicate reqsk on identical SYN luoxuanqiang
@ 2024-06-14 10:54 ` Florian Westphal
  2024-06-14 12:42   ` luoxuanqiang
  2024-06-16 23:45 ` alexandre.ferrieux
       [not found] ` <1718586352627144.1.seg@mailgw.kylinos.cn>
  2 siblings, 1 reply; 11+ messages in thread
From: Florian Westphal @ 2024-06-14 10:54 UTC (permalink / raw)
  To: luoxuanqiang
  Cc: edumazet, davem, dsahern, fw, kuba, linux-kernel, netdev, pabeni,
	kuniyu, dccp

luoxuanqiang <luoxuanqiang@kylinos.cn> wrote:
>  include/net/inet_connection_sock.h |  2 +-
>  net/dccp/ipv4.c                    |  2 +-
>  net/dccp/ipv6.c                    |  2 +-
>  net/ipv4/inet_connection_sock.c    | 15 +++++++++++----
>  net/ipv4/tcp_input.c               | 11 ++++++++++-
>  5 files changed, 24 insertions(+), 8 deletions(-)
> 
> diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h
> index 7d6b1254c92d..8773d161d184 100644
> --- a/include/net/inet_connection_sock.h
> +++ b/include/net/inet_connection_sock.h
> @@ -264,7 +264,7 @@ struct sock *inet_csk_reqsk_queue_add(struct sock *sk,
>  				      struct request_sock *req,
>  				      struct sock *child);
>  void inet_csk_reqsk_queue_hash_add(struct sock *sk, struct request_sock *req,
> -				   unsigned long timeout);
> +				   unsigned long timeout, bool *found_dup_sk);

Nit:

I think it would be preferrable to change retval to bool rather than
bool *found_dup_sk extra arg, so one can do

bool inet_csk_reqsk_queue_hash_add(struct sock *sk, struct request_sock *req,
  				   unsigned long timeout)
{
	if (!reqsk_queue_hash_req(req, timeout))
		return false;

i.e. let retval indicate wheter reqsk was inserted or not.

Patch looks good to me otherwise.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH net v2] Fix race for duplicate reqsk on identical SYN
  2024-06-14 10:54 ` Florian Westphal
@ 2024-06-14 12:42   ` luoxuanqiang
  2024-06-14 22:24     ` Kuniyuki Iwashima
  0 siblings, 1 reply; 11+ messages in thread
From: luoxuanqiang @ 2024-06-14 12:42 UTC (permalink / raw)
  To: Florian Westphal
  Cc: edumazet, davem, dsahern, kuba, linux-kernel, netdev, pabeni,
	kuniyu, dccp


在 2024/6/14 18:54, Florian Westphal 写道:
> luoxuanqiang <luoxuanqiang@kylinos.cn> wrote:
>>   include/net/inet_connection_sock.h |  2 +-
>>   net/dccp/ipv4.c                    |  2 +-
>>   net/dccp/ipv6.c                    |  2 +-
>>   net/ipv4/inet_connection_sock.c    | 15 +++++++++++----
>>   net/ipv4/tcp_input.c               | 11 ++++++++++-
>>   5 files changed, 24 insertions(+), 8 deletions(-)
>>
>> diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h
>> index 7d6b1254c92d..8773d161d184 100644
>> --- a/include/net/inet_connection_sock.h
>> +++ b/include/net/inet_connection_sock.h
>> @@ -264,7 +264,7 @@ struct sock *inet_csk_reqsk_queue_add(struct sock *sk,
>>   				      struct request_sock *req,
>>   				      struct sock *child);
>>   void inet_csk_reqsk_queue_hash_add(struct sock *sk, struct request_sock *req,
>> -				   unsigned long timeout);
>> +				   unsigned long timeout, bool *found_dup_sk);
> Nit:
>
> I think it would be preferrable to change retval to bool rather than
> bool *found_dup_sk extra arg, so one can do
>
> bool inet_csk_reqsk_queue_hash_add(struct sock *sk, struct request_sock *req,
>    				   unsigned long timeout)
> {
> 	if (!reqsk_queue_hash_req(req, timeout))
> 		return false;
>
> i.e. let retval indicate wheter reqsk was inserted or not.
>
> Patch looks good to me otherwise.

Thank you for your confirmation!

Regarding your suggestion, I had considered it before,
but besides tcp_conn_request() calling inet_csk_reqsk_queue_hash_add(),
dccp_v4(v6)_conn_request() also calls it. However, there is no
consideration for a failed insertion within that function, so it's
reasonable to let the caller decide whether to check for duplicate
reqsk.

The purpose of my modification this time is solely to confirm if a
reqsk for the same connection has already been inserted into the ehash.
If the insertion fails, inet_ehash_insert() will handle the
non-insertion gracefully, and I only need to release the duplicate
reqsk. I believe this change is minimal and effective.

Those are my considerations.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH net v2] Fix race for duplicate reqsk on identical SYN
  2024-06-14 12:42   ` luoxuanqiang
@ 2024-06-14 22:24     ` Kuniyuki Iwashima
  2024-06-15  6:40       ` Eric Dumazet
  0 siblings, 1 reply; 11+ messages in thread
From: Kuniyuki Iwashima @ 2024-06-14 22:24 UTC (permalink / raw)
  To: luoxuanqiang
  Cc: davem, dccp, dsahern, edumazet, fw, kuba, kuniyu, linux-kernel,
	netdev, pabeni

From: luoxuanqiang <luoxuanqiang@kylinos.cn>
Date: Fri, 14 Jun 2024 20:42:07 +0800
> 在 2024/6/14 18:54, Florian Westphal 写道:
> > luoxuanqiang <luoxuanqiang@kylinos.cn> wrote:
> >>   include/net/inet_connection_sock.h |  2 +-
> >>   net/dccp/ipv4.c                    |  2 +-
> >>   net/dccp/ipv6.c                    |  2 +-
> >>   net/ipv4/inet_connection_sock.c    | 15 +++++++++++----
> >>   net/ipv4/tcp_input.c               | 11 ++++++++++-
> >>   5 files changed, 24 insertions(+), 8 deletions(-)
> >>
> >> diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h
> >> index 7d6b1254c92d..8773d161d184 100644
> >> --- a/include/net/inet_connection_sock.h
> >> +++ b/include/net/inet_connection_sock.h
> >> @@ -264,7 +264,7 @@ struct sock *inet_csk_reqsk_queue_add(struct sock *sk,
> >>   				      struct request_sock *req,
> >>   				      struct sock *child);
> >>   void inet_csk_reqsk_queue_hash_add(struct sock *sk, struct request_sock *req,
> >> -				   unsigned long timeout);
> >> +				   unsigned long timeout, bool *found_dup_sk);
> > Nit:
> >
> > I think it would be preferrable to change retval to bool rather than
> > bool *found_dup_sk extra arg, so one can do

+1


> >
> > bool inet_csk_reqsk_queue_hash_add(struct sock *sk, struct request_sock *req,
> >    				   unsigned long timeout)
> > {
> > 	if (!reqsk_queue_hash_req(req, timeout))
> > 		return false;
> >
> > i.e. let retval indicate wheter reqsk was inserted or not.
> >
> > Patch looks good to me otherwise.
> 
> Thank you for your confirmation!
> 
> Regarding your suggestion, I had considered it before,
> but besides tcp_conn_request() calling inet_csk_reqsk_queue_hash_add(),
> dccp_v4(v6)_conn_request() also calls it. However, there is no
> consideration for a failed insertion within that function, so it's
> reasonable to let the caller decide whether to check for duplicate
> reqsk.

I guess you followed 01770a1661657 where found_dup_sk was introduced,
but note that the commit is specific to TCP SYN Cookie and TCP Fast Open
and DCCP is not related.

Then, own_req is common to TCP and DCCP, so found_dup_sk was added as an
additional argument.

However, another similar commit 5e0724d027f05 actually added own_req check
in DCCP path.

I personally would'nt care if DCCP was not changed to handle such a
failure because DCCP will be removed next year, but I still prefer
Florian's suggestion.

Thanks

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH net v2] Fix race for duplicate reqsk on identical SYN
  2024-06-14 22:24     ` Kuniyuki Iwashima
@ 2024-06-15  6:40       ` Eric Dumazet
  2024-06-17  2:01         ` luoxuanqiang
  2024-06-17  8:07         ` luoxuanqiang
  0 siblings, 2 replies; 11+ messages in thread
From: Eric Dumazet @ 2024-06-15  6:40 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: luoxuanqiang, davem, dccp, dsahern, fw, kuba, linux-kernel,
	netdev, pabeni

On Sat, Jun 15, 2024 at 12:24 AM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
>
> From: luoxuanqiang <luoxuanqiang@kylinos.cn>
> Date: Fri, 14 Jun 2024 20:42:07 +0800
> > 在 2024/6/14 18:54, Florian Westphal 写道:
> > > luoxuanqiang <luoxuanqiang@kylinos.cn> wrote:
> > >>   include/net/inet_connection_sock.h |  2 +-
> > >>   net/dccp/ipv4.c                    |  2 +-
> > >>   net/dccp/ipv6.c                    |  2 +-
> > >>   net/ipv4/inet_connection_sock.c    | 15 +++++++++++----
> > >>   net/ipv4/tcp_input.c               | 11 ++++++++++-
> > >>   5 files changed, 24 insertions(+), 8 deletions(-)
> > >>
> > >> diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h
> > >> index 7d6b1254c92d..8773d161d184 100644
> > >> --- a/include/net/inet_connection_sock.h
> > >> +++ b/include/net/inet_connection_sock.h
> > >> @@ -264,7 +264,7 @@ struct sock *inet_csk_reqsk_queue_add(struct sock *sk,
> > >>                                  struct request_sock *req,
> > >>                                  struct sock *child);
> > >>   void inet_csk_reqsk_queue_hash_add(struct sock *sk, struct request_sock *req,
> > >> -                             unsigned long timeout);
> > >> +                             unsigned long timeout, bool *found_dup_sk);
> > > Nit:
> > >
> > > I think it would be preferrable to change retval to bool rather than
> > > bool *found_dup_sk extra arg, so one can do
>
> +1
>
>
> > >
> > > bool inet_csk_reqsk_queue_hash_add(struct sock *sk, struct request_sock *req,
> > >                                unsigned long timeout)
> > > {
> > >     if (!reqsk_queue_hash_req(req, timeout))
> > >             return false;
> > >
> > > i.e. let retval indicate wheter reqsk was inserted or not.
> > >
> > > Patch looks good to me otherwise.
> >
> > Thank you for your confirmation!
> >
> > Regarding your suggestion, I had considered it before,
> > but besides tcp_conn_request() calling inet_csk_reqsk_queue_hash_add(),
> > dccp_v4(v6)_conn_request() also calls it. However, there is no
> > consideration for a failed insertion within that function, so it's
> > reasonable to let the caller decide whether to check for duplicate
> > reqsk.
>
> I guess you followed 01770a1661657 where found_dup_sk was introduced,
> but note that the commit is specific to TCP SYN Cookie and TCP Fast Open
> and DCCP is not related.
>
> Then, own_req is common to TCP and DCCP, so found_dup_sk was added as an
> additional argument.
>
> However, another similar commit 5e0724d027f05 actually added own_req check
> in DCCP path.
>
> I personally would'nt care if DCCP was not changed to handle such a
> failure because DCCP will be removed next year, but I still prefer
> Florian's suggestion.
>

Other things to consider :

- I presume this patch targets net tree, and luoxuanqiang needs the
fix to reach stable trees.

- This means a Fixes: tag is needed

- This also means that we should favor a patch with no or trivial
conflicts for stable backports.

Should the patch target the net-next tree, then the requirements can
be different.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH net v2] Fix race for duplicate reqsk on identical SYN
  2024-06-14 10:26 [PATCH net v2] Fix race for duplicate reqsk on identical SYN luoxuanqiang
  2024-06-14 10:54 ` Florian Westphal
@ 2024-06-16 23:45 ` alexandre.ferrieux
       [not found] ` <1718586352627144.1.seg@mailgw.kylinos.cn>
  2 siblings, 0 replies; 11+ messages in thread
From: alexandre.ferrieux @ 2024-06-16 23:45 UTC (permalink / raw)
  To: luoxuanqiang, edumazet; +Cc: davem, dsahern, fw, kuba, netdev, pabeni, kuniyu

On 14/06/2024 12:26, luoxuanqiang wrote:
> When bonding is configured in BOND_MODE_BROADCAST mode, if two identical
> SYN packets are received at the same time and processed on different CPUs,
> it can potentially create the same sk (sock) but two different reqsk
> (request_sock) in tcp_conn_request().
>
> These two different reqsk will respond with two SYNACK packets, and since
> the generation of the seq (ISN) incorporates a timestamp, the final two
> SYNACK packets will have different seq values.
>
> The consequence is that when the Client receives and replies with an ACK
> to the earlier SYNACK packet, we will reset(RST) it.
>
> ========================================================================
This is close, but not identical, to a race we observed on a *single* CPU with
the TPROXY iptables target, in the following situation:

  - two identical SYNs, sent one second apart from the same client socket,
    arrive back-to-back on the interface (due to network jitter)

  - they happen to be handled in the same batch of packet from one softirq
    name_your_nic_poll()

  - there, two loops run sequentially: one for netfilter (doing TPROXY), one
    for the network stack (doing TCP processing)

  - the first generates two distinct contexts for the two SYNs

  - the second respects these contexts and never gets a chance to merge them

The result is exactly as you describe, but in this case there is no need for 
bonding,
and everything happens in one single CPU, which is pretty ironic for a race.
My uneducated feeling is that the two loops are the cause of a simulated
parallelism, yielding the race. If each packet of the batch was handled
"to completion" (full netfilter handling followed immediately by full network
stack ingestion), the problem would not exist.

____________________________________________________________________________________________________________
Ce message et ses pieces jointes peuvent contenir des informations confidentielles ou privilegiees et ne doivent donc
pas etre diffuses, exploites ou copies sans autorisation. Si vous avez recu ce message par erreur, veuillez le signaler
a l'expediteur et le detruire ainsi que les pieces jointes. Les messages electroniques etant susceptibles d'alteration,
Orange decline toute responsabilite si ce message a ete altere, deforme ou falsifie. Merci.

This message and its attachments may contain confidential or privileged information that may be protected by law;
they should not be distributed, used or copied without authorisation.
If you have received this email in error, please notify the sender and delete this message and its attachments.
As emails may be altered, Orange is not liable for messages that have been modified, changed or falsified.
Thank you.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH net v2] Fix race for duplicate reqsk on identical SYN
  2024-06-15  6:40       ` Eric Dumazet
@ 2024-06-17  2:01         ` luoxuanqiang
  2024-06-17  8:07         ` luoxuanqiang
  1 sibling, 0 replies; 11+ messages in thread
From: luoxuanqiang @ 2024-06-17  2:01 UTC (permalink / raw)
  To: Eric Dumazet, Kuniyuki Iwashima
  Cc: davem, dccp, dsahern, fw, kuba, linux-kernel, netdev, pabeni


在 2024/6/15 14:40, Eric Dumazet 写道:
> On Sat, Jun 15, 2024 at 12:24 AM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
>> From: luoxuanqiang <luoxuanqiang@kylinos.cn>
>> Date: Fri, 14 Jun 2024 20:42:07 +0800
>>> 在 2024/6/14 18:54, Florian Westphal 写道:
>>>> luoxuanqiang <luoxuanqiang@kylinos.cn> wrote:
>>>>>    include/net/inet_connection_sock.h |  2 +-
>>>>>    net/dccp/ipv4.c                    |  2 +-
>>>>>    net/dccp/ipv6.c                    |  2 +-
>>>>>    net/ipv4/inet_connection_sock.c    | 15 +++++++++++----
>>>>>    net/ipv4/tcp_input.c               | 11 ++++++++++-
>>>>>    5 files changed, 24 insertions(+), 8 deletions(-)
>>>>>
>>>>> diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h
>>>>> index 7d6b1254c92d..8773d161d184 100644
>>>>> --- a/include/net/inet_connection_sock.h
>>>>> +++ b/include/net/inet_connection_sock.h
>>>>> @@ -264,7 +264,7 @@ struct sock *inet_csk_reqsk_queue_add(struct sock *sk,
>>>>>                                   struct request_sock *req,
>>>>>                                   struct sock *child);
>>>>>    void inet_csk_reqsk_queue_hash_add(struct sock *sk, struct request_sock *req,
>>>>> -                             unsigned long timeout);
>>>>> +                             unsigned long timeout, bool *found_dup_sk);
>>>> Nit:
>>>>
>>>> I think it would be preferrable to change retval to bool rather than
>>>> bool *found_dup_sk extra arg, so one can do
>> +1
>>
>>
>>>> bool inet_csk_reqsk_queue_hash_add(struct sock *sk, struct request_sock *req,
>>>>                                 unsigned long timeout)
>>>> {
>>>>      if (!reqsk_queue_hash_req(req, timeout))
>>>>              return false;
>>>>
>>>> i.e. let retval indicate wheter reqsk was inserted or not.
>>>>
>>>> Patch looks good to me otherwise.
>>> Thank you for your confirmation!
>>>
>>> Regarding your suggestion, I had considered it before,
>>> but besides tcp_conn_request() calling inet_csk_reqsk_queue_hash_add(),
>>> dccp_v4(v6)_conn_request() also calls it. However, there is no
>>> consideration for a failed insertion within that function, so it's
>>> reasonable to let the caller decide whether to check for duplicate
>>> reqsk.
>> I guess you followed 01770a1661657 where found_dup_sk was introduced,
>> but note that the commit is specific to TCP SYN Cookie and TCP Fast Open
>> and DCCP is not related.
>>
>> Then, own_req is common to TCP and DCCP, so found_dup_sk was added as an
>> additional argument.
>>
>> However, another similar commit 5e0724d027f05 actually added own_req check
>> in DCCP path.
>>
>> I personally would'nt care if DCCP was not changed to handle such a
>> failure because DCCP will be removed next year, but I still prefer
>> Florian's suggestion.
>>
> Other things to consider :
>
> - I presume this patch targets net tree, and luoxuanqiang needs the
> fix to reach stable trees.
>
> - This means a Fixes: tag is needed
>
> - This also means that we should favor a patch with no or trivial
> conflicts for stable backports.
>
> Should the patch target the net-next tree, then the requirements can
> be different.

Hello Eric and Kuniyuk,

Thank you for the information!

I've tested the kernel versions 4.19 and 6.10, and they both have
similar issues (I suspect this problem has been around for quite some
time). My intention is to propose a fix to the more stable branches as
soon as possible to cover a wider range. Like Eric mentioned, I hope to
minimize conflicts, so I expect to keep the original DCCP logic intact
and refer to the check for found_dup_sk in 01770a1661657. For DCCP, if
insertion into ehash fails, we might also need to consider handling
rsk_refcnt, as tcp_conn_request() requires rsk_refcnt to be 0 to release
reqsk.

Of course, if DCCP will be removed from net-next, I agree with
Kuniyuki and Florian's suggestions and will envision a better commit
content.

BRs!


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH net v2] Fix race for duplicate reqsk on identical SYN
       [not found] ` <1718586352627144.1.seg@mailgw.kylinos.cn>
@ 2024-06-17  2:53   ` luoxuanqiang
  2024-06-17 14:44     ` alexandre.ferrieux
  0 siblings, 1 reply; 11+ messages in thread
From: luoxuanqiang @ 2024-06-17  2:53 UTC (permalink / raw)
  To: alexandre.ferrieux, edumazet
  Cc: davem, dsahern, fw, kuba, netdev, pabeni, kuniyu


在 2024/6/17 07:45, alexandre.ferrieux@orange.com 写道:
> On 14/06/2024 12:26, luoxuanqiang wrote:
>> When bonding is configured in BOND_MODE_BROADCAST mode, if two identical
>> SYN packets are received at the same time and processed on different 
>> CPUs,
>> it can potentially create the same sk (sock) but two different reqsk
>> (request_sock) in tcp_conn_request().
>>
>> These two different reqsk will respond with two SYNACK packets, and 
>> since
>> the generation of the seq (ISN) incorporates a timestamp, the final two
>> SYNACK packets will have different seq values.
>>
>> The consequence is that when the Client receives and replies with an ACK
>> to the earlier SYNACK packet, we will reset(RST) it.
>>
>> ========================================================================
> This is close, but not identical, to a race we observed on a *single* 
> CPU with
> the TPROXY iptables target, in the following situation:
>
>  - two identical SYNs, sent one second apart from the same client socket,
>    arrive back-to-back on the interface (due to network jitter)
>
>  - they happen to be handled in the same batch of packet from one softirq
>    name_your_nic_poll()
>
>  - there, two loops run sequentially: one for netfilter (doing 
> TPROXY), one
>    for the network stack (doing TCP processing)
>
>  - the first generates two distinct contexts for the two SYNs
>
>  - the second respects these contexts and never gets a chance to merge 
> them
>
> The result is exactly as you describe, but in this case there is no 
> need for bonding,
> and everything happens in one single CPU, which is pretty ironic for a 
> race.
> My uneducated feeling is that the two loops are the cause of a simulated
> parallelism, yielding the race. If each packet of the batch was handled
> "to completion" (full netfilter handling followed immediately by full 
> network
> stack ingestion), the problem would not exist.

Based on your explanation, I believe a
similar issue can occur on a single CPU if two SYN packets are processed
  closely enough. However, apart from using bond3 mode and having them
processed on different CPUs to facilitate reproducibility, I haven't
found a good way to replicate it.

Could you please provide a more practical example or detailed test
steps to help me understand the reproduction scenario you mentioned?
Thank you very much!


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH net v2] Fix race for duplicate reqsk on identical SYN
  2024-06-15  6:40       ` Eric Dumazet
  2024-06-17  2:01         ` luoxuanqiang
@ 2024-06-17  8:07         ` luoxuanqiang
  1 sibling, 0 replies; 11+ messages in thread
From: luoxuanqiang @ 2024-06-17  8:07 UTC (permalink / raw)
  To: Eric Dumazet, Kuniyuki Iwashima
  Cc: davem, dccp, dsahern, fw, kuba, linux-kernel, netdev, pabeni


在 2024/6/15 14:40, Eric Dumazet 写道:
> On Sat, Jun 15, 2024 at 12:24 AM Kuniyuki Iwashima <kuniyu@amazon.com> wrote:
> 你好 Eric和Kuniyuk,
>> From: luoxuanqiang <luoxuanqiang@kylinos.cn>
>> Date: Fri, 14 Jun 2024 20:42:07 +0800
>>> 在 2024/6/14 18:54, Florian Westphal 写道:
>>>> luoxuanqiang <luoxuanqiang@kylinos.cn> wrote:
>>>>>    include/net/inet_connection_sock.h |  2 +-
>>>>>    net/dccp/ipv4.c                    |  2 +-
>>>>>    net/dccp/ipv6.c                    |  2 +-
>>>>>    net/ipv4/inet_connection_sock.c    | 15 +++++++++++----
>>>>>    net/ipv4/tcp_input.c               | 11 ++++++++++-
>>>>>    5 files changed, 24 insertions(+), 8 deletions(-)
>>>>>
>>>>> diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h
>>>>> index 7d6b1254c92d..8773d161d184 100644
>>>>> --- a/include/net/inet_connection_sock.h
>>>>> +++ b/include/net/inet_connection_sock.h
>>>>> @@ -264,7 +264,7 @@ struct sock *inet_csk_reqsk_queue_add(struct sock *sk,
>>>>>                                   struct request_sock *req,
>>>>>                                   struct sock *child);
>>>>>    void inet_csk_reqsk_queue_hash_add(struct sock *sk, struct request_sock *req,
>>>>> -                             unsigned long timeout);
>>>>> +                             unsigned long timeout, bool *found_dup_sk);
>>>> Nit:
>>>>
>>>> I think it would be preferrable to change retval to bool rather than
>>>> bool *found_dup_sk extra arg, so one can do
>> +1
>>
>>
>>>> bool inet_csk_reqsk_queue_hash_add(struct sock *sk, struct request_sock *req,
>>>>                                 unsigned long timeout)
>>>> {
>>>>      if (!reqsk_queue_hash_req(req, timeout))
>>>>              return false;
>>>>
>>>> i.e. let retval indicate wheter reqsk was inserted or not.
>>>>
>>>> Patch looks good to me otherwise.
>>> Thank you for your confirmation!
>>>
>>> Regarding your suggestion, I had considered it before,
>>> but besides tcp_conn_request() calling inet_csk_reqsk_queue_hash_add(),
>>> dccp_v4(v6)_conn_request() also calls it. However, there is no
>>> consideration for a failed insertion within that function, so it's
>>> reasonable to let the caller decide whether to check for duplicate
>>> reqsk.
>> I guess you followed 01770a1661657 where found_dup_sk was introduced,
>> but note that the commit is specific to TCP SYN Cookie and TCP Fast Open
>> and DCCP is not related.
>>
>> Then, own_req is common to TCP and DCCP, so found_dup_sk was added as an
>> additional argument.
>>
>> However, another similar commit 5e0724d027f05 actually added own_req check
>> in DCCP path.
>>
>> I personally would'nt care if DCCP was not changed to handle such a
>> failure because DCCP will be removed next year, but I still prefer
>> Florian's suggestion.
>>
> Other things to consider :
>
> - I presume this patch targets net tree, and luoxuanqiang needs the
> fix to reach stable trees.
>
> - This means a Fixes: tag is needed
>
> - This also means that we should favor a patch with no or trivial
> conflicts for stable backports.
>
> Should the patch target the net-next tree, then the requirements can
> be different.

Hi Kuniyuk and Florian,

I've created version 3 based on your suggestions, but I've kept the use
of 'found_dup_sk' since we need to pass NULL in DCCP to maintain its
logic unchanged. Could you please review this update and let me know if
it's okay? Thank you!

BRs!


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH net v2] Fix race for duplicate reqsk on identical SYN
  2024-06-17  2:53   ` luoxuanqiang
@ 2024-06-17 14:44     ` alexandre.ferrieux
  2024-08-25 12:24       ` alexandre.ferrieux
  0 siblings, 1 reply; 11+ messages in thread
From: alexandre.ferrieux @ 2024-06-17 14:44 UTC (permalink / raw)
  To: luoxuanqiang, edumazet; +Cc: davem, dsahern, fw, kuba, netdev, pabeni, kuniyu

On 17/06/2024 04:53, luoxuanqiang wrote:
> 
> 在 2024/6/17 07:45, alexandre.ferrieux@orange.com 写道:
>> On 14/06/2024 12:26, luoxuanqiang wrote:
>>> When bonding is configured in BOND_MODE_BROADCAST mode, if two identical
>>> SYN packets are received at the same time and processed on different CPUs,
>>> it can potentially create the same sk (sock) but two different reqsk
>>> (request_sock) in tcp_conn_request().
>>>
>>> These two different reqsk will respond with two SYNACK packets, and since
>>> the generation of the seq (ISN) incorporates a timestamp, the final two
>>> SYNACK packets will have different seq values.
>>>
>>> The consequence is that when the Client receives and replies with an ACK
>>> to the earlier SYNACK packet, we will reset(RST) it.
>>>
>>> ========================================================================
>> This is close, but not identical, to a race we observed on a *single* CPU with
>> the TPROXY iptables target, in the following situation:
>>
>>  - two identical SYNs, sent one second apart from the same client socket,
>>    arrive back-to-back on the interface (due to network jitter)
>>
>>  - they happen to be handled in the same batch of packet from one softirq
>>    name_your_nic_poll()
>>
>>  - there, two loops run sequentially: one for netfilter (doing TPROXY), one
>>    for the network stack (doing TCP processing)
>>
>>  - the first generates two distinct contexts for the two SYNs
>>
>>  - the second respects these contexts and never gets a chance to merge them
>>
>> The result is exactly as you describe, but in this case there is no need for 
>> bonding,
>> and everything happens in one single CPU, which is pretty ironic for a race.
>> My uneducated feeling is that the two loops are the cause of a simulated
>> parallelism, yielding the race. If each packet of the batch was handled
>> "to completion" (full netfilter handling followed immediately by full network
>> stack ingestion), the problem would not exist.
> 
> Based on your explanation, I believe a
> similar issue can occur on a single CPU if two SYN packets are processed
>   closely enough. However, apart from using bond3 mode and having them
> processed on different CPUs to facilitate reproducibility, I haven't
> found a good way to replicate it.
> 
> Could you please provide a more practical example or detailed test
> steps to help me understand the reproduction scenario you mentioned?
> Thank you very much!

To reproduce in my case, I just need the two SYNs to arrive back-to-back in the 
ingress buffer and get in the same softirq run. To reach this goal easily, you 
can set the interrupt coalescence to a large value (like several milliseconds), 
and on the emitter side, send them in rapid sequence from userland. If that's 
not enough, you can just send one and duplicate it with TEE.

Then, if the packets are naturally aimed at the host (normal INPUT chain), I 
can't see the problem (as could be expected as 99.9999% of webservers do just 
this). Quite clearly, tcp_v4_rcv() does a good job in this case and is able to 
link the second packet to the context (reqsk?) of the first one.

But, if packets are "in transit", in a transparent proxying context, with the 
TPROXY target doing a redirection to the local listener, the race happens 
deterministically: I've even managed to squeeze 6 or 7 duplicate packets in the 
softirq run, and all of them get a different ISN !!!

In summary, the minimal setup is just:

    - ethtool -C $ITF rx-usecs 30000
    - a listener bound on port $PO
    - iptables -t mangle -A PREROUTING -i $ITF -p tcp -j TPROXY --on-port $PO 
--on-ip 0.0.0.0


And to get to the specifics, I have the impression that  in  ip_sublist_rcv(), 
the "two-pass" method of calling first NF_HOOK_LIST(), then 
ip_list_rcv_finish(), gets confused as the first pass (NF_HOOK_LIST calling 
TPROXY) does a "half-job" of attaching a context, making all of them different, 
while the second pass retrieves these contexts and doesn't try to "merge" them 
when needed:

   static void ip_sublist_rcv(...)
   {
     NF_HOOK_LIST(NFPROTO_IPV4, NF_INET_PRE_ROUTING, net, NULL,
                  head, dev, NULL, ip_rcv_finish);
     ip_list_rcv_finish(net, NULL, head);
   }

My naive impression is that reducing the "batch" size to 1 would do the job.
In other words, "run each packet to completion", netfilter *and* ip_*_rcv.
But I lack the vision of the big picture leading to the current choice.
Thanks in advance for shedding light on this :)

-Alex

____________________________________________________________________________________________________________
Ce message et ses pieces jointes peuvent contenir des informations confidentielles ou privilegiees et ne doivent donc
pas etre diffuses, exploites ou copies sans autorisation. Si vous avez recu ce message par erreur, veuillez le signaler
a l'expediteur et le detruire ainsi que les pieces jointes. Les messages electroniques etant susceptibles d'alteration,
Orange decline toute responsabilite si ce message a ete altere, deforme ou falsifie. Merci.

This message and its attachments may contain confidential or privileged information that may be protected by law;
they should not be distributed, used or copied without authorisation.
If you have received this email in error, please notify the sender and delete this message and its attachments.
As emails may be altered, Orange is not liable for messages that have been modified, changed or falsified.
Thank you.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH net v2] Fix race for duplicate reqsk on identical SYN
  2024-06-17 14:44     ` alexandre.ferrieux
@ 2024-08-25 12:24       ` alexandre.ferrieux
  0 siblings, 0 replies; 11+ messages in thread
From: alexandre.ferrieux @ 2024-08-25 12:24 UTC (permalink / raw)
  To: luoxuanqiang, edumazet; +Cc: davem, dsahern, fw, kuba, netdev, pabeni, kuniyu

On 17/06/2024 16:44, Alexandre Ferrieux wrote:
> On 17/06/2024 04:53, luoxuanqiang wrote:
>>
>> 在 2024/6/17 07:45, alexandre.ferrieux@orange.com 写道:
>>> On 14/06/2024 12:26, luoxuanqiang wrote:
>>>> When bonding is configured in BOND_MODE_BROADCAST mode, if two identical
>>>> SYN packets are received at the same time and processed on different CPUs,
>>>> it can potentially create the same sk (sock) but two different reqsk
>>>> (request_sock) in tcp_conn_request().
>>>>
>>>> These two different reqsk will respond with two SYNACK packets, and since
>>>> the generation of the seq (ISN) incorporates a timestamp, the final two
>>>> SYNACK packets will have different seq values.
>>>>
>>>> The consequence is that when the Client receives and replies with an ACK
>>>> to the earlier SYNACK packet, we will reset(RST) it.
>>>>
>>>> ========================================================================
>>> This is close, but not identical, to a race we observed on a *single* CPU with
>>> the TPROXY iptables target, in the following situation:
>>>
>>>  - two identical SYNs, sent one second apart from the same client socket,
>>>    arrive back-to-back on the interface (due to network jitter)
>>>
>>>  - they happen to be handled in the same batch of packet from one softirq
>>>    name_your_nic_poll()
>>>
>>>  - there, two loops run sequentially: one for netfilter (doing TPROXY), one
>>>    for the network stack (doing TCP processing)
>>>
>>>  - the first generates two distinct contexts for the two SYNs
>>>
>>>  - the second respects these contexts and never gets a chance to merge them
>>>
>>> The result is exactly as you describe, but in this case there is no need for 
>>> bonding,
>>> and everything happens in one single CPU, which is pretty ironic for a race.
>>> My uneducated feeling is that the two loops are the cause of a simulated
>>> parallelism, yielding the race. If each packet of the batch was handled
>>> "to completion" (full netfilter handling followed immediately by full network
>>> stack ingestion), the problem would not exist.
>>
>> Based on your explanation, I believe a
>> similar issue can occur on a single CPU if two SYN packets are processed
>>   closely enough. However, apart from using bond3 mode and having them
>> processed on different CPUs to facilitate reproducibility, I haven't
>> found a good way to replicate it.
>>
>> Could you please provide a more practical example or detailed test
>> steps to help me understand the reproduction scenario you mentioned?
>> Thank you very much!
> 
> To reproduce in my case, I just need the two SYNs to arrive back-to-back in the 
> ingress buffer and get in the same softirq run. To reach this goal easily, you 
> can set the interrupt coalescence to a large value (like several milliseconds), 
> and on the emitter side, send them in rapid sequence from userland. If that's 
> not enough, you can just send one and duplicate it with TEE.

Good news: as I suspected, your fix (ff46e3b44219 shipped in 6.10) DOES solve my 
problem too !

As a consequence, this means the single-CPU scenario was exposed too, through 
netfilter's peculiar "breadth-first" iteration approach. This gives extra weight 
to the importance of your work.

TL;DR: thanks, kudos, congrats, and thanks !!!
____________________________________________________________________________________________________________
Ce message et ses pieces jointes peuvent contenir des informations confidentielles ou privilegiees et ne doivent donc
pas etre diffuses, exploites ou copies sans autorisation. Si vous avez recu ce message par erreur, veuillez le signaler
a l'expediteur et le detruire ainsi que les pieces jointes. Les messages electroniques etant susceptibles d'alteration,
Orange decline toute responsabilite si ce message a ete altere, deforme ou falsifie. Merci.

This message and its attachments may contain confidential or privileged information that may be protected by law;
they should not be distributed, used or copied without authorisation.
If you have received this email in error, please notify the sender and delete this message and its attachments.
As emails may be altered, Orange is not liable for messages that have been modified, changed or falsified.
Thank you.

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2024-08-25 12:26 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-06-14 10:26 [PATCH net v2] Fix race for duplicate reqsk on identical SYN luoxuanqiang
2024-06-14 10:54 ` Florian Westphal
2024-06-14 12:42   ` luoxuanqiang
2024-06-14 22:24     ` Kuniyuki Iwashima
2024-06-15  6:40       ` Eric Dumazet
2024-06-17  2:01         ` luoxuanqiang
2024-06-17  8:07         ` luoxuanqiang
2024-06-16 23:45 ` alexandre.ferrieux
     [not found] ` <1718586352627144.1.seg@mailgw.kylinos.cn>
2024-06-17  2:53   ` luoxuanqiang
2024-06-17 14:44     ` alexandre.ferrieux
2024-08-25 12:24       ` alexandre.ferrieux

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).