Re: [PATCH net] inet: Avoid established lookup missing active sk

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: luoxuanqiang <xuanqiang.luo@linux.dev>
To: Kuniyuki Iwashima <kuniyu@google.com>
Cc: edumazet@google.com, davem@davemloft.net, kuba@kernel.org,
	kernelxing@tencent.com, netdev@vger.kernel.org,
	Xuanqiang Luo <luoxuanqiang@kylinos.cn>
Subject: Re: [PATCH net] inet: Avoid established lookup missing active sk
Date: Wed, 3 Sep 2025 15:54:41 +0800	[thread overview]
Message-ID: <3ca7ad52-bfd6-4619-abef-93bf5be3969c@linux.dev> (raw)
In-Reply-To: <CAAVpQUDZFCYNMQ08uRLu388cmggckzPeP=N7WncFyxN-_hgaMw@mail.gmail.com>


在 2025/9/3 13:48, Kuniyuki Iwashima 写道:
> On Tue, Sep 2, 2025 at 10:16 PM Kuniyuki Iwashima <kuniyu@google.com> wrote:
>> On Tue, Sep 2, 2025 at 7:45 PM Xuanqiang Luo <xuanqiang.luo@linux.dev> wrote:
>>> From: Xuanqiang Luo <luoxuanqiang@kylinos.cn>
>>>
>>> Since the lookup of sk in ehash is lockless, when one CPU is performing a
>>> lookup while another CPU is executing delete and insert operations
>>> (deleting reqsk and inserting sk), the lookup CPU may miss either of
>>> them, if sk cannot be found, an RST may be sent.
>>>
>>> The call trace map is drawn as follows:
>>>     CPU 0                           CPU 1
>>>     -----                           -----
>>>                                  spin_lock()
>>>                                  sk_nulls_del_node_init_rcu(osk)
>>> __inet_lookup_established()
>>>                                  __sk_nulls_add_node_rcu(sk, list)
>>>                                  spin_unlock()
>> This usually does not happen except for local communication, and
>> retrying on the client side is much better than penalising all lookups
>> for SYN.
>>
>>> We can try using spin_lock()/spin_unlock() to wait for ehash updates
>>> (ensuring all deletions and insertions are completed) after a failed
>>> lookup in ehash, then lookup sk again after the update. Since the sk
>>> expected to be found is unlikely to encounter the aforementioned scenario
>>> multiple times consecutively, we only need one update.
>>>
>>> Similarly, an issue occurs in tw hashdance. Try adjusting the order in
>>> which it operates on ehash: remove sk first, then add tw. If sk is missed
>>> during lookup, it will likewise wait for the update to find tw, without
>>> worrying about the skc_refcnt issue that would arise if tw were found
>>> first.
>>>
>>> Fixes: 3ab5aee7fe84 ("net: Convert TCP & DCCP hash tables to use RCU / hlist_nulls")
>>> Signed-off-by: Xuanqiang Luo <luoxuanqiang@kylinos.cn>
>>> ---
>>>   net/ipv4/inet_hashtables.c    | 12 ++++++++++++
>>>   net/ipv4/inet_timewait_sock.c |  9 ++++-----
>>>   2 files changed, 16 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
>>> index ceeeec9b7290..4eb3a55b855b 100644
>>> --- a/net/ipv4/inet_hashtables.c
>>> +++ b/net/ipv4/inet_hashtables.c
>>> @@ -505,6 +505,7 @@ struct sock *__inet_lookup_established(const struct net *net,
>>>          unsigned int hash = inet_ehashfn(net, daddr, hnum, saddr, sport);
>>>          unsigned int slot = hash & hashinfo->ehash_mask;
>>>          struct inet_ehash_bucket *head = &hashinfo->ehash[slot];
>>> +       bool try_lock = true;
>>>
>>>   begin:
>>>          sk_nulls_for_each_rcu(sk, node, &head->chain) {
>>> @@ -528,6 +529,17 @@ struct sock *__inet_lookup_established(const struct net *net,
>>>           */
>>>          if (get_nulls_value(node) != slot)
>>>                  goto begin;
>>> +
>>> +       if (try_lock) {
>>> +               spinlock_t *lock = inet_ehash_lockp(hashinfo, hash);
>>> +
>>> +               try_lock = false;
>>> +               spin_lock(lock);
>>> +               /* Ensure ehash ops under spinlock complete. */
>>> +               spin_unlock(lock);
>>> +               goto begin;
>>> +       }
>>> +
>>>   out:
>>>          sk = NULL;
>>>   found:
>>> diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c
>>> index 875ff923a8ed..a91e02e19c53 100644
>>> --- a/net/ipv4/inet_timewait_sock.c
>>> +++ b/net/ipv4/inet_timewait_sock.c
>>> @@ -139,14 +139,10 @@ void inet_twsk_hashdance_schedule(struct inet_timewait_sock *tw,
>>>
>>>          spin_lock(lock);
>>>
>>> -       /* Step 2: Hash TW into tcp ehash chain */
>>> -       inet_twsk_add_node_rcu(tw, &ehead->chain);
>> You are adding a new RST scenario where the corresponding
>> socket is not found and a listener or no socket is found.
>>
>> The try_lock part is not guaranteed to happen after twsk
>> insertion below.

I'm sorry, I didn't get it. If try_lock can search again, other places 
should have left the spinlock critical section. That means twsk should 
have been inserted already. Or maybe I missed some details. Could you 
please explain more clearly? Your guidance would be really helpful. :)

> Oh no, spin_lock() dance sychronises the threads but I still
> think this is rather harmful for normal cases; now sending
> an unmatched packet can trigger lock dance, which is easily
> abused for DDoS.
>
I agree this scenario is possible. It does make DDoS attacks more 
impactful. Thanks for pointing that out. However, this is the best 
approach I can think of right now. Thanks Xuanqiang

>>
>>> -
>>> -       /* Step 3: Remove SK from hash chain */
>>> +       /* Step 2: Remove SK from hash chain */
>>>          if (__sk_nulls_del_node_init_rcu(sk))
>>>                  sock_prot_inuse_add(sock_net(sk), sk->sk_prot, -1);
>>>
>>> -
>>>          /* Ensure above writes are committed into memory before updating the
>>>           * refcount.
>>>           * Provides ordering vs later refcount_inc().
>>> @@ -161,6 +157,9 @@ void inet_twsk_hashdance_schedule(struct inet_timewait_sock *tw,
>>>           */
>>>          refcount_set(&tw->tw_refcnt, 3);
>>>
>>> +       /* Step 3: Hash TW into tcp ehash chain */
>>> +       inet_twsk_add_node_rcu(tw, &ehead->chain);
>>> +
>>>          inet_twsk_schedule(tw, timeo);
>>>
>>>          spin_unlock(lock);
>>> --
>>> 2.25.1
>>>

next prev parent reply	other threads:[~2025-09-03  7:55 UTC|newest]

Thread overview: 11+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-09-03  2:44 [PATCH net] inet: Avoid established lookup missing active sk Xuanqiang Luo
2025-09-03  5:16 ` Kuniyuki Iwashima
2025-09-03  5:48   ` Kuniyuki Iwashima
2025-09-03  7:54     ` luoxuanqiang [this message]
2025-09-03  6:40 ` Eric Dumazet
2025-09-03  6:52   ` Jason Xing
2025-09-03  8:03     ` luoxuanqiang
2025-09-03  8:35     ` Eric Dumazet
2025-09-03  9:05       ` Jason Xing
2025-09-03 11:51   ` luoxuanqiang
2025-09-03 21:53 ` [syzbot ci] " syzbot ci

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=3ca7ad52-bfd6-4619-abef-93bf5be3969c@linux.dev \
    --to=xuanqiang.luo@linux.dev \
    --cc=davem@davemloft.net \
    --cc=edumazet@google.com \
    --cc=kernelxing@tencent.com \
    --cc=kuba@kernel.org \
    --cc=kuniyu@google.com \
    --cc=luoxuanqiang@kylinos.cn \
    --cc=netdev@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).