From: luoxuanqiang <xuanqiang.luo@linux.dev>
To: Kuniyuki Iwashima <kuniyu@google.com>
Cc: edumazet@google.com, davem@davemloft.net, kuba@kernel.org,
kernelxing@tencent.com, netdev@vger.kernel.org,
Xuanqiang Luo <luoxuanqiang@kylinos.cn>
Subject: Re: [PATCH net] inet: Avoid established lookup missing active sk
Date: Wed, 3 Sep 2025 15:54:41 +0800 [thread overview]
Message-ID: <3ca7ad52-bfd6-4619-abef-93bf5be3969c@linux.dev> (raw)
In-Reply-To: <CAAVpQUDZFCYNMQ08uRLu388cmggckzPeP=N7WncFyxN-_hgaMw@mail.gmail.com>
在 2025/9/3 13:48, Kuniyuki Iwashima 写道:
> On Tue, Sep 2, 2025 at 10:16 PM Kuniyuki Iwashima <kuniyu@google.com> wrote:
>> On Tue, Sep 2, 2025 at 7:45 PM Xuanqiang Luo <xuanqiang.luo@linux.dev> wrote:
>>> From: Xuanqiang Luo <luoxuanqiang@kylinos.cn>
>>>
>>> Since the lookup of sk in ehash is lockless, when one CPU is performing a
>>> lookup while another CPU is executing delete and insert operations
>>> (deleting reqsk and inserting sk), the lookup CPU may miss either of
>>> them, if sk cannot be found, an RST may be sent.
>>>
>>> The call trace map is drawn as follows:
>>> CPU 0 CPU 1
>>> ----- -----
>>> spin_lock()
>>> sk_nulls_del_node_init_rcu(osk)
>>> __inet_lookup_established()
>>> __sk_nulls_add_node_rcu(sk, list)
>>> spin_unlock()
>> This usually does not happen except for local communication, and
>> retrying on the client side is much better than penalising all lookups
>> for SYN.
>>
>>> We can try using spin_lock()/spin_unlock() to wait for ehash updates
>>> (ensuring all deletions and insertions are completed) after a failed
>>> lookup in ehash, then lookup sk again after the update. Since the sk
>>> expected to be found is unlikely to encounter the aforementioned scenario
>>> multiple times consecutively, we only need one update.
>>>
>>> Similarly, an issue occurs in tw hashdance. Try adjusting the order in
>>> which it operates on ehash: remove sk first, then add tw. If sk is missed
>>> during lookup, it will likewise wait for the update to find tw, without
>>> worrying about the skc_refcnt issue that would arise if tw were found
>>> first.
>>>
>>> Fixes: 3ab5aee7fe84 ("net: Convert TCP & DCCP hash tables to use RCU / hlist_nulls")
>>> Signed-off-by: Xuanqiang Luo <luoxuanqiang@kylinos.cn>
>>> ---
>>> net/ipv4/inet_hashtables.c | 12 ++++++++++++
>>> net/ipv4/inet_timewait_sock.c | 9 ++++-----
>>> 2 files changed, 16 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
>>> index ceeeec9b7290..4eb3a55b855b 100644
>>> --- a/net/ipv4/inet_hashtables.c
>>> +++ b/net/ipv4/inet_hashtables.c
>>> @@ -505,6 +505,7 @@ struct sock *__inet_lookup_established(const struct net *net,
>>> unsigned int hash = inet_ehashfn(net, daddr, hnum, saddr, sport);
>>> unsigned int slot = hash & hashinfo->ehash_mask;
>>> struct inet_ehash_bucket *head = &hashinfo->ehash[slot];
>>> + bool try_lock = true;
>>>
>>> begin:
>>> sk_nulls_for_each_rcu(sk, node, &head->chain) {
>>> @@ -528,6 +529,17 @@ struct sock *__inet_lookup_established(const struct net *net,
>>> */
>>> if (get_nulls_value(node) != slot)
>>> goto begin;
>>> +
>>> + if (try_lock) {
>>> + spinlock_t *lock = inet_ehash_lockp(hashinfo, hash);
>>> +
>>> + try_lock = false;
>>> + spin_lock(lock);
>>> + /* Ensure ehash ops under spinlock complete. */
>>> + spin_unlock(lock);
>>> + goto begin;
>>> + }
>>> +
>>> out:
>>> sk = NULL;
>>> found:
>>> diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c
>>> index 875ff923a8ed..a91e02e19c53 100644
>>> --- a/net/ipv4/inet_timewait_sock.c
>>> +++ b/net/ipv4/inet_timewait_sock.c
>>> @@ -139,14 +139,10 @@ void inet_twsk_hashdance_schedule(struct inet_timewait_sock *tw,
>>>
>>> spin_lock(lock);
>>>
>>> - /* Step 2: Hash TW into tcp ehash chain */
>>> - inet_twsk_add_node_rcu(tw, &ehead->chain);
>> You are adding a new RST scenario where the corresponding
>> socket is not found and a listener or no socket is found.
>>
>> The try_lock part is not guaranteed to happen after twsk
>> insertion below.
I'm sorry, I didn't get it. If try_lock can search again, other places
should have left the spinlock critical section. That means twsk should
have been inserted already. Or maybe I missed some details. Could you
please explain more clearly? Your guidance would be really helpful. :)
> Oh no, spin_lock() dance sychronises the threads but I still
> think this is rather harmful for normal cases; now sending
> an unmatched packet can trigger lock dance, which is easily
> abused for DDoS.
>
I agree this scenario is possible. It does make DDoS attacks more
impactful. Thanks for pointing that out. However, this is the best
approach I can think of right now. Thanks Xuanqiang
>>
>>> -
>>> - /* Step 3: Remove SK from hash chain */
>>> + /* Step 2: Remove SK from hash chain */
>>> if (__sk_nulls_del_node_init_rcu(sk))
>>> sock_prot_inuse_add(sock_net(sk), sk->sk_prot, -1);
>>>
>>> -
>>> /* Ensure above writes are committed into memory before updating the
>>> * refcount.
>>> * Provides ordering vs later refcount_inc().
>>> @@ -161,6 +157,9 @@ void inet_twsk_hashdance_schedule(struct inet_timewait_sock *tw,
>>> */
>>> refcount_set(&tw->tw_refcnt, 3);
>>>
>>> + /* Step 3: Hash TW into tcp ehash chain */
>>> + inet_twsk_add_node_rcu(tw, &ehead->chain);
>>> +
>>> inet_twsk_schedule(tw, timeo);
>>>
>>> spin_unlock(lock);
>>> --
>>> 2.25.1
>>>
next prev parent reply other threads:[~2025-09-03 7:55 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-09-03 2:44 [PATCH net] inet: Avoid established lookup missing active sk Xuanqiang Luo
2025-09-03 5:16 ` Kuniyuki Iwashima
2025-09-03 5:48 ` Kuniyuki Iwashima
2025-09-03 7:54 ` luoxuanqiang [this message]
2025-09-03 6:40 ` Eric Dumazet
2025-09-03 6:52 ` Jason Xing
2025-09-03 8:03 ` luoxuanqiang
2025-09-03 8:35 ` Eric Dumazet
2025-09-03 9:05 ` Jason Xing
2025-09-03 11:51 ` luoxuanqiang
2025-09-03 21:53 ` [syzbot ci] " syzbot ci
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=3ca7ad52-bfd6-4619-abef-93bf5be3969c@linux.dev \
--to=xuanqiang.luo@linux.dev \
--cc=davem@davemloft.net \
--cc=edumazet@google.com \
--cc=kernelxing@tencent.com \
--cc=kuba@kernel.org \
--cc=kuniyu@google.com \
--cc=luoxuanqiang@kylinos.cn \
--cc=netdev@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).