From: Xuanqiang Luo <xuanqiang.luo@linux.dev>
To: edumazet@google.com, kuniyu@google.com
Cc: davem@davemloft.net, kuba@kernel.org, kernelxing@tencent.com,
netdev@vger.kernel.org, Xuanqiang Luo <luoxuanqiang@kylinos.cn>
Subject: [PATCH net] inet: Avoid established lookup missing active sk
Date: Wed, 3 Sep 2025 10:44:06 +0800 [thread overview]
Message-ID: <20250903024406.2418362-1-xuanqiang.luo@linux.dev> (raw)
From: Xuanqiang Luo <luoxuanqiang@kylinos.cn>
Since the lookup of sk in ehash is lockless, when one CPU is performing a
lookup while another CPU is executing delete and insert operations
(deleting reqsk and inserting sk), the lookup CPU may miss either of
them, if sk cannot be found, an RST may be sent.
The call trace map is drawn as follows:
CPU 0 CPU 1
----- -----
spin_lock()
sk_nulls_del_node_init_rcu(osk)
__inet_lookup_established()
__sk_nulls_add_node_rcu(sk, list)
spin_unlock()
We can try using spin_lock()/spin_unlock() to wait for ehash updates
(ensuring all deletions and insertions are completed) after a failed
lookup in ehash, then lookup sk again after the update. Since the sk
expected to be found is unlikely to encounter the aforementioned scenario
multiple times consecutively, we only need one update.
Similarly, an issue occurs in tw hashdance. Try adjusting the order in
which it operates on ehash: remove sk first, then add tw. If sk is missed
during lookup, it will likewise wait for the update to find tw, without
worrying about the skc_refcnt issue that would arise if tw were found
first.
Fixes: 3ab5aee7fe84 ("net: Convert TCP & DCCP hash tables to use RCU / hlist_nulls")
Signed-off-by: Xuanqiang Luo <luoxuanqiang@kylinos.cn>
---
net/ipv4/inet_hashtables.c | 12 ++++++++++++
net/ipv4/inet_timewait_sock.c | 9 ++++-----
2 files changed, 16 insertions(+), 5 deletions(-)
diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
index ceeeec9b7290..4eb3a55b855b 100644
--- a/net/ipv4/inet_hashtables.c
+++ b/net/ipv4/inet_hashtables.c
@@ -505,6 +505,7 @@ struct sock *__inet_lookup_established(const struct net *net,
unsigned int hash = inet_ehashfn(net, daddr, hnum, saddr, sport);
unsigned int slot = hash & hashinfo->ehash_mask;
struct inet_ehash_bucket *head = &hashinfo->ehash[slot];
+ bool try_lock = true;
begin:
sk_nulls_for_each_rcu(sk, node, &head->chain) {
@@ -528,6 +529,17 @@ struct sock *__inet_lookup_established(const struct net *net,
*/
if (get_nulls_value(node) != slot)
goto begin;
+
+ if (try_lock) {
+ spinlock_t *lock = inet_ehash_lockp(hashinfo, hash);
+
+ try_lock = false;
+ spin_lock(lock);
+ /* Ensure ehash ops under spinlock complete. */
+ spin_unlock(lock);
+ goto begin;
+ }
+
out:
sk = NULL;
found:
diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c
index 875ff923a8ed..a91e02e19c53 100644
--- a/net/ipv4/inet_timewait_sock.c
+++ b/net/ipv4/inet_timewait_sock.c
@@ -139,14 +139,10 @@ void inet_twsk_hashdance_schedule(struct inet_timewait_sock *tw,
spin_lock(lock);
- /* Step 2: Hash TW into tcp ehash chain */
- inet_twsk_add_node_rcu(tw, &ehead->chain);
-
- /* Step 3: Remove SK from hash chain */
+ /* Step 2: Remove SK from hash chain */
if (__sk_nulls_del_node_init_rcu(sk))
sock_prot_inuse_add(sock_net(sk), sk->sk_prot, -1);
-
/* Ensure above writes are committed into memory before updating the
* refcount.
* Provides ordering vs later refcount_inc().
@@ -161,6 +157,9 @@ void inet_twsk_hashdance_schedule(struct inet_timewait_sock *tw,
*/
refcount_set(&tw->tw_refcnt, 3);
+ /* Step 3: Hash TW into tcp ehash chain */
+ inet_twsk_add_node_rcu(tw, &ehead->chain);
+
inet_twsk_schedule(tw, timeo);
spin_unlock(lock);
--
2.25.1
next reply other threads:[~2025-09-03 2:45 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-09-03 2:44 Xuanqiang Luo [this message]
2025-09-03 5:16 ` [PATCH net] inet: Avoid established lookup missing active sk Kuniyuki Iwashima
2025-09-03 5:48 ` Kuniyuki Iwashima
2025-09-03 7:54 ` luoxuanqiang
2025-09-03 6:40 ` Eric Dumazet
2025-09-03 6:52 ` Jason Xing
2025-09-03 8:03 ` luoxuanqiang
2025-09-03 8:35 ` Eric Dumazet
2025-09-03 9:05 ` Jason Xing
2025-09-03 11:51 ` luoxuanqiang
2025-09-03 21:53 ` [syzbot ci] " syzbot ci
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20250903024406.2418362-1-xuanqiang.luo@linux.dev \
--to=xuanqiang.luo@linux.dev \
--cc=davem@davemloft.net \
--cc=edumazet@google.com \
--cc=kernelxing@tencent.com \
--cc=kuba@kernel.org \
--cc=kuniyu@google.com \
--cc=luoxuanqiang@kylinos.cn \
--cc=netdev@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).