* [PATCH net-next v2 1/3] rculist: Add __hlist_nulls_replace_rcu() and hlist_nulls_replace_init_rcu()
2025-09-16 6:46 [PATCH net-next v2 0/3] net: Avoid ehash lookup races xuanqiang.luo
@ 2025-09-16 6:46 ` xuanqiang.luo
2025-09-16 6:46 ` [PATCH net-next v2 2/3] inet: Avoid ehash lookup race in inet_ehash_insert() xuanqiang.luo
` (2 subsequent siblings)
3 siblings, 0 replies; 8+ messages in thread
From: xuanqiang.luo @ 2025-09-16 6:46 UTC (permalink / raw)
To: edumazet, kuniyu; +Cc: kerneljasonxing, davem, kuba, netdev, Xuanqiang Luo
From: Xuanqiang Luo <luoxuanqiang@kylinos.cn>
Add two functions to atomically replace RCU-protected hlist_nulls entries.
Signed-off-by: Xuanqiang Luo <luoxuanqiang@kylinos.cn>
---
include/linux/rculist_nulls.h | 61 +++++++++++++++++++++++++++++++++++
1 file changed, 61 insertions(+)
diff --git a/include/linux/rculist_nulls.h b/include/linux/rculist_nulls.h
index 89186c499dd4..8ed604f65a3e 100644
--- a/include/linux/rculist_nulls.h
+++ b/include/linux/rculist_nulls.h
@@ -152,6 +152,67 @@ static inline void hlist_nulls_add_fake(struct hlist_nulls_node *n)
n->next = (struct hlist_nulls_node *)NULLS_MARKER(NULL);
}
+/**
+ * __hlist_nulls_replace_rcu - replace an old entry by a new one
+ * @old: the element to be replaced
+ * @new: the new element to insert
+ *
+ * Description:
+ * Replace the old entry with the new one in a RCU-protected hlist_nulls, while
+ * permitting racing traversals.
+ *
+ * The caller must take whatever precautions are necessary (such as holding
+ * appropriate locks) to avoid racing with another list-mutation primitive, such
+ * as hlist_nulls_add_head_rcu() or hlist_nulls_del_rcu(), running on this same
+ * list. However, it is perfectly legal to run concurrently with the _rcu
+ * list-traversal primitives, such as hlist_nulls_for_each_entry_rcu().
+ */
+static inline void __hlist_nulls_replace_rcu(struct hlist_nulls_node *old,
+ struct hlist_nulls_node *new)
+{
+ struct hlist_nulls_node *next = old->next;
+
+ new->next = next;
+ WRITE_ONCE(new->pprev, old->pprev);
+ rcu_assign_pointer(*(struct hlist_nulls_node __rcu **)new->pprev, new);
+ if (!is_a_nulls(next))
+ WRITE_ONCE(new->next->pprev, &new->next);
+}
+
+/**
+ * hlist_nulls_replace_init_rcu - replace an old entry by a new one and
+ * initialize the old
+ * @old: the element to be replaced
+ * @new: the new element to insert
+ *
+ * Description:
+ * Replace the old entry with the new one in a RCU-protected hlist_nulls, while
+ * permitting racing traversals, and reinitialize the old entry.
+ *
+ * Return: true if the old entry was hashed and was replaced successfully, false
+ * otherwise.
+ *
+ * Note: hlist_nulls_unhashed() on the old node returns true after this.
+ * It is useful for RCU based read lockfree traversal if the writer side must
+ * know if the list entry is still hashed or already unhashed.
+ *
+ * The caller must take whatever precautions are necessary (such as holding
+ * appropriate locks) to avoid racing with another list-mutation primitive, such
+ * as hlist_nulls_add_head_rcu() or hlist_nulls_del_rcu(), running on this same
+ * list. However, it is perfectly legal to run concurrently with the _rcu
+ * list-traversal primitives, such as hlist_nulls_for_each_entry_rcu().
+ */
+static inline bool hlist_nulls_replace_init_rcu(struct hlist_nulls_node *old,
+ struct hlist_nulls_node *new)
+{
+ if (!hlist_nulls_unhashed(old)) {
+ __hlist_nulls_replace_rcu(old, new);
+ WRITE_ONCE(old->pprev, NULL);
+ return true;
+ }
+ return false;
+}
+
/**
* hlist_nulls_for_each_entry_rcu - iterate over rcu list of given type
* @tpos: the type * to use as a loop cursor.
--
2.25.1
^ permalink raw reply related [flat|nested] 8+ messages in thread* [PATCH net-next v2 2/3] inet: Avoid ehash lookup race in inet_ehash_insert()
2025-09-16 6:46 [PATCH net-next v2 0/3] net: Avoid ehash lookup races xuanqiang.luo
2025-09-16 6:46 ` [PATCH net-next v2 1/3] rculist: Add __hlist_nulls_replace_rcu() and hlist_nulls_replace_init_rcu() xuanqiang.luo
@ 2025-09-16 6:46 ` xuanqiang.luo
2025-09-16 6:46 ` [PATCH net-next v2 3/3] inet: Avoid ehash lookup race in inet_twsk_hashdance_schedule() xuanqiang.luo
2025-09-16 7:30 ` [PATCH net-next v2 0/3] net: Avoid ehash lookup races Eric Dumazet
3 siblings, 0 replies; 8+ messages in thread
From: xuanqiang.luo @ 2025-09-16 6:46 UTC (permalink / raw)
To: edumazet, kuniyu; +Cc: kerneljasonxing, davem, kuba, netdev, Xuanqiang Luo
From: Xuanqiang Luo <luoxuanqiang@kylinos.cn>
Since ehash lookups are lockless, if one CPU performs a lookup while
another concurrently deletes and inserts (removing reqsk and inserting sk),
the lookup may fail to find the socket, an RST may be sent.
The call trace map is drawn as follows:
CPU 0 CPU 1
----- -----
inet_ehash_insert()
spin_lock()
sk_nulls_del_node_init_rcu(osk)
__inet_lookup_established()
(lookup failed)
__sk_nulls_add_node_rcu(sk, list)
spin_unlock()
As both deletion and insertion operate on the same ehash chain, this patch
introduces two new sk_nulls_replace_* helper functions to implement atomic
replacement.
Fixes: 5e0724d027f0 ("tcp/dccp: fix hashdance race for passive sessions")
Signed-off-by: Xuanqiang Luo <luoxuanqiang@kylinos.cn>
---
include/net/sock.h | 23 +++++++++++++++++++++++
net/ipv4/inet_hashtables.c | 4 +++-
2 files changed, 26 insertions(+), 1 deletion(-)
diff --git a/include/net/sock.h b/include/net/sock.h
index 0fd465935334..e709376eaf0a 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -854,6 +854,29 @@ static inline bool sk_nulls_del_node_init_rcu(struct sock *sk)
return rc;
}
+static inline bool __sk_nulls_replace_node_init_rcu(struct sock *old,
+ struct sock *new)
+{
+ if (sk_hashed(old)) {
+ hlist_nulls_replace_init_rcu(&old->sk_nulls_node,
+ &new->sk_nulls_node);
+ return true;
+ }
+ return false;
+}
+
+static inline bool sk_nulls_replace_node_init_rcu(struct sock *old,
+ struct sock *new)
+{
+ bool rc = __sk_nulls_replace_node_init_rcu(old, new);
+
+ if (rc) {
+ WARN_ON(refcount_read(&old->sk_refcnt) == 1);
+ __sock_put(old);
+ }
+ return rc;
+}
+
static inline void __sk_add_node(struct sock *sk, struct hlist_head *list)
{
hlist_add_head(&sk->sk_node, list);
diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
index ef4ccfd46ff6..83c9ec625419 100644
--- a/net/ipv4/inet_hashtables.c
+++ b/net/ipv4/inet_hashtables.c
@@ -685,7 +685,8 @@ bool inet_ehash_insert(struct sock *sk, struct sock *osk, bool *found_dup_sk)
spin_lock(lock);
if (osk) {
WARN_ON_ONCE(sk->sk_hash != osk->sk_hash);
- ret = sk_nulls_del_node_init_rcu(osk);
+ ret = sk_nulls_replace_node_init_rcu(osk, sk);
+ goto unlock;
} else if (found_dup_sk) {
*found_dup_sk = inet_ehash_lookup_by_sk(sk, list);
if (*found_dup_sk)
@@ -695,6 +696,7 @@ bool inet_ehash_insert(struct sock *sk, struct sock *osk, bool *found_dup_sk)
if (ret)
__sk_nulls_add_node_rcu(sk, list);
+unlock:
spin_unlock(lock);
return ret;
--
2.25.1
^ permalink raw reply related [flat|nested] 8+ messages in thread* [PATCH net-next v2 3/3] inet: Avoid ehash lookup race in inet_twsk_hashdance_schedule()
2025-09-16 6:46 [PATCH net-next v2 0/3] net: Avoid ehash lookup races xuanqiang.luo
2025-09-16 6:46 ` [PATCH net-next v2 1/3] rculist: Add __hlist_nulls_replace_rcu() and hlist_nulls_replace_init_rcu() xuanqiang.luo
2025-09-16 6:46 ` [PATCH net-next v2 2/3] inet: Avoid ehash lookup race in inet_ehash_insert() xuanqiang.luo
@ 2025-09-16 6:46 ` xuanqiang.luo
2025-09-16 7:30 ` [PATCH net-next v2 0/3] net: Avoid ehash lookup races Eric Dumazet
3 siblings, 0 replies; 8+ messages in thread
From: xuanqiang.luo @ 2025-09-16 6:46 UTC (permalink / raw)
To: edumazet, kuniyu; +Cc: kerneljasonxing, davem, kuba, netdev, Xuanqiang Luo
From: Xuanqiang Luo <luoxuanqiang@kylinos.cn>
Since ehash lookups are lockless, if another CPU is converting sk to tw
concurrently, fetching the newly inserted tw with tw->tw_refcnt == 0 cause
lookup failure.
The call trace map is drawn as follows:
CPU 0 CPU 1
----- -----
inet_twsk_hashdance_schedule()
spin_lock()
inet_twsk_add_node_rcu(tw, ...)
__inet_lookup_established()
(find tw, failure due to tw_refcnt = 0)
__sk_nulls_del_node_init_rcu(sk)
refcount_set(&tw->tw_refcnt, 3)
spin_unlock()
By replacing sk with tw atomically via hlist_nulls_replace_init_rcu() after
setting tw_refcnt, we ensure that tw is either fully initialized or not
visible to other CPUs, eliminating the race.
Fixes: 3ab5aee7fe84 ("net: Convert TCP & DCCP hash tables to use RCU / hlist_nulls")
Signed-off-by: Xuanqiang Luo <luoxuanqiang@kylinos.cn>
---
net/ipv4/inet_timewait_sock.c | 15 ++++++---------
1 file changed, 6 insertions(+), 9 deletions(-)
diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c
index 5b5426b8ee92..1ba20c4cb73b 100644
--- a/net/ipv4/inet_timewait_sock.c
+++ b/net/ipv4/inet_timewait_sock.c
@@ -116,7 +116,7 @@ void inet_twsk_hashdance_schedule(struct inet_timewait_sock *tw,
spinlock_t *lock = inet_ehash_lockp(hashinfo, sk->sk_hash);
struct inet_bind_hashbucket *bhead, *bhead2;
- /* Step 1: Put TW into bind hash. Original socket stays there too.
+ /* Put TW into bind hash. Original socket stays there too.
Note, that any socket with inet->num != 0 MUST be bound in
binding cache, even if it is closed.
*/
@@ -140,14 +140,6 @@ void inet_twsk_hashdance_schedule(struct inet_timewait_sock *tw,
spin_lock(lock);
- /* Step 2: Hash TW into tcp ehash chain */
- inet_twsk_add_node_rcu(tw, &ehead->chain);
-
- /* Step 3: Remove SK from hash chain */
- if (__sk_nulls_del_node_init_rcu(sk))
- sock_prot_inuse_add(sock_net(sk), sk->sk_prot, -1);
-
-
/* Ensure above writes are committed into memory before updating the
* refcount.
* Provides ordering vs later refcount_inc().
@@ -162,6 +154,11 @@ void inet_twsk_hashdance_schedule(struct inet_timewait_sock *tw,
*/
refcount_set(&tw->tw_refcnt, 3);
+ if (hlist_nulls_replace_init_rcu(&sk->sk_nulls_node, &tw->tw_node))
+ sock_prot_inuse_add(sock_net(sk), sk->sk_prot, -1);
+ else
+ inet_twsk_add_node_rcu(tw, &ehead->chain);
+
inet_twsk_schedule(tw, timeo);
spin_unlock(lock);
--
2.25.1
^ permalink raw reply related [flat|nested] 8+ messages in thread* Re: [PATCH net-next v2 0/3] net: Avoid ehash lookup races
2025-09-16 6:46 [PATCH net-next v2 0/3] net: Avoid ehash lookup races xuanqiang.luo
` (2 preceding siblings ...)
2025-09-16 6:46 ` [PATCH net-next v2 3/3] inet: Avoid ehash lookup race in inet_twsk_hashdance_schedule() xuanqiang.luo
@ 2025-09-16 7:30 ` Eric Dumazet
2025-09-16 8:11 ` luoxuanqiang
3 siblings, 1 reply; 8+ messages in thread
From: Eric Dumazet @ 2025-09-16 7:30 UTC (permalink / raw)
To: xuanqiang.luo; +Cc: kuniyu, kerneljasonxing, davem, kuba, netdev, Xuanqiang Luo
On Mon, Sep 15, 2025 at 11:47 PM <xuanqiang.luo@linux.dev> wrote:
>
> From: Xuanqiang Luo <luoxuanqiang@kylinos.cn>
>
> After replacing R/W locks with RCU in commit 3ab5aee7fe84 ("net: Convert
> TCP & DCCP hash tables to use RCU / hlist_nulls"), a race window emerged
> during the switch from reqsk/sk to sk/tw.
>
> Now that both timewait sock (tw) and full sock (sk) reside on the same
> ehash chain, it is appropriate to introduce hlist_nulls replace
> operations, to eliminate the race conditions caused by this window.
>
> ---
> Changes:
> v2:
> * Patch 1
> * Use WRITE_ONCE() to initialize old->pprev.
> * Patch 2&3
> * Optimize sk hashed check. Thanks Kuni for pointing it out!
>
> v1: https://lore.kernel.org/all/20250915070308.111816-1-xuanqiang.luo@linux.dev/
Note : I think you sent an earlier version, you should have added a
link to the discussion,
and past feedback/suggestions.
Lack of credit is a bit annoying frankly.
I will take a look at your series, thanks.
>
> Xuanqiang Luo (3):
> rculist: Add __hlist_nulls_replace_rcu() and
> hlist_nulls_replace_init_rcu()
> inet: Avoid ehash lookup race in inet_ehash_insert()
> inet: Avoid ehash lookup race in inet_twsk_hashdance_schedule()
>
> include/linux/rculist_nulls.h | 61 +++++++++++++++++++++++++++++++++++
> include/net/sock.h | 23 +++++++++++++
> net/ipv4/inet_hashtables.c | 4 ++-
> net/ipv4/inet_timewait_sock.c | 15 ++++-----
> 4 files changed, 93 insertions(+), 10 deletions(-)
>
> --
> 2.25.1
>
^ permalink raw reply [flat|nested] 8+ messages in thread* Re: [PATCH net-next v2 0/3] net: Avoid ehash lookup races
2025-09-16 7:30 ` [PATCH net-next v2 0/3] net: Avoid ehash lookup races Eric Dumazet
@ 2025-09-16 8:11 ` luoxuanqiang
2025-09-16 9:02 ` Eric Dumazet
0 siblings, 1 reply; 8+ messages in thread
From: luoxuanqiang @ 2025-09-16 8:11 UTC (permalink / raw)
To: Eric Dumazet; +Cc: kuniyu, kerneljasonxing, davem, kuba, netdev, Xuanqiang Luo
在 2025/9/16 15:30, Eric Dumazet 写道:
> On Mon, Sep 15, 2025 at 11:47 PM <xuanqiang.luo@linux.dev> wrote:
>> From: Xuanqiang Luo <luoxuanqiang@kylinos.cn>
>>
>> After replacing R/W locks with RCU in commit 3ab5aee7fe84 ("net: Convert
>> TCP & DCCP hash tables to use RCU / hlist_nulls"), a race window emerged
>> during the switch from reqsk/sk to sk/tw.
>>
>> Now that both timewait sock (tw) and full sock (sk) reside on the same
>> ehash chain, it is appropriate to introduce hlist_nulls replace
>> operations, to eliminate the race conditions caused by this window.
>>
>> ---
>> Changes:
>> v2:
>> * Patch 1
>> * Use WRITE_ONCE() to initialize old->pprev.
>> * Patch 2&3
>> * Optimize sk hashed check. Thanks Kuni for pointing it out!
>>
>> v1: https://lore.kernel.org/all/20250915070308.111816-1-xuanqiang.luo@linux.dev/
> Note : I think you sent an earlier version, you should have added a
> link to the discussion,
> and past feedback/suggestions.
>
> Lack of credit is a bit annoying frankly.
>
> I will take a look at your series, thanks.
This patch's solution isn't very related to previous ones, so I didn't
include prior discussions.
But to help others get the full picture, I'll share past links and
earlier discussions related to this type of issue in the next version
or a reply. Sorry for any oversight.
Thanks
Xuanqiang.
>> Xuanqiang Luo (3):
>> rculist: Add __hlist_nulls_replace_rcu() and
>> hlist_nulls_replace_init_rcu()
>> inet: Avoid ehash lookup race in inet_ehash_insert()
>> inet: Avoid ehash lookup race in inet_twsk_hashdance_schedule()
>>
>> include/linux/rculist_nulls.h | 61 +++++++++++++++++++++++++++++++++++
>> include/net/sock.h | 23 +++++++++++++
>> net/ipv4/inet_hashtables.c | 4 ++-
>> net/ipv4/inet_timewait_sock.c | 15 ++++-----
>> 4 files changed, 93 insertions(+), 10 deletions(-)
>>
>> --
>> 2.25.1
>>
^ permalink raw reply [flat|nested] 8+ messages in thread* Re: [PATCH net-next v2 0/3] net: Avoid ehash lookup races
2025-09-16 8:11 ` luoxuanqiang
@ 2025-09-16 9:02 ` Eric Dumazet
2025-09-16 9:20 ` luoxuanqiang
0 siblings, 1 reply; 8+ messages in thread
From: Eric Dumazet @ 2025-09-16 9:02 UTC (permalink / raw)
To: luoxuanqiang; +Cc: kuniyu, kerneljasonxing, davem, kuba, netdev, Xuanqiang Luo
On Tue, Sep 16, 2025 at 1:12 AM luoxuanqiang <xuanqiang.luo@linux.dev> wrote:
>
>
> 在 2025/9/16 15:30, Eric Dumazet 写道:
> > On Mon, Sep 15, 2025 at 11:47 PM <xuanqiang.luo@linux.dev> wrote:
> >> From: Xuanqiang Luo <luoxuanqiang@kylinos.cn>
> >>
> >> After replacing R/W locks with RCU in commit 3ab5aee7fe84 ("net: Convert
> >> TCP & DCCP hash tables to use RCU / hlist_nulls"), a race window emerged
> >> during the switch from reqsk/sk to sk/tw.
> >>
> >> Now that both timewait sock (tw) and full sock (sk) reside on the same
> >> ehash chain, it is appropriate to introduce hlist_nulls replace
> >> operations, to eliminate the race conditions caused by this window.
> >>
> >> ---
> >> Changes:
> >> v2:
> >> * Patch 1
> >> * Use WRITE_ONCE() to initialize old->pprev.
> >> * Patch 2&3
> >> * Optimize sk hashed check. Thanks Kuni for pointing it out!
> >>
> >> v1: https://lore.kernel.org/all/20250915070308.111816-1-xuanqiang.luo@linux.dev/
> > Note : I think you sent an earlier version, you should have added a
> > link to the discussion,
> > and past feedback/suggestions.
> >
> > Lack of credit is a bit annoying frankly.
> >
> > I will take a look at your series, thanks.
>
> This patch's solution isn't very related to previous ones, so I didn't
> include prior discussions.
This is completely related, aiming to fix the same issue, do not try
to pretend otherwise.
Really, adding more context and acknowledging that reviewers
made suggestions would be quite fair.
This is a difficult series, with a lot of potential bugs, you need to bring
us on board.
^ permalink raw reply [flat|nested] 8+ messages in thread* Re: [PATCH net-next v2 0/3] net: Avoid ehash lookup races
2025-09-16 9:02 ` Eric Dumazet
@ 2025-09-16 9:20 ` luoxuanqiang
0 siblings, 0 replies; 8+ messages in thread
From: luoxuanqiang @ 2025-09-16 9:20 UTC (permalink / raw)
To: Eric Dumazet; +Cc: kuniyu, kerneljasonxing, davem, kuba, netdev, Xuanqiang Luo
在 2025/9/16 17:02, Eric Dumazet 写道:
> On Tue, Sep 16, 2025 at 1:12 AM luoxuanqiang <xuanqiang.luo@linux.dev> wrote:
>>
>> 在 2025/9/16 15:30, Eric Dumazet 写道:
>>> On Mon, Sep 15, 2025 at 11:47 PM <xuanqiang.luo@linux.dev> wrote:
>>>> From: Xuanqiang Luo <luoxuanqiang@kylinos.cn>
>>>>
>>>> After replacing R/W locks with RCU in commit 3ab5aee7fe84 ("net: Convert
>>>> TCP & DCCP hash tables to use RCU / hlist_nulls"), a race window emerged
>>>> during the switch from reqsk/sk to sk/tw.
>>>>
>>>> Now that both timewait sock (tw) and full sock (sk) reside on the same
>>>> ehash chain, it is appropriate to introduce hlist_nulls replace
>>>> operations, to eliminate the race conditions caused by this window.
>>>>
>>>> ---
>>>> Changes:
>>>> v2:
>>>> * Patch 1
>>>> * Use WRITE_ONCE() to initialize old->pprev.
>>>> * Patch 2&3
>>>> * Optimize sk hashed check. Thanks Kuni for pointing it out!
>>>>
>>>> v1: https://lore.kernel.org/all/20250915070308.111816-1-xuanqiang.luo@linux.dev/
>>> Note : I think you sent an earlier version, you should have added a
>>> link to the discussion,
>>> and past feedback/suggestions.
>>>
>>> Lack of credit is a bit annoying frankly.
>>>
>>> I will take a look at your series, thanks.
>> This patch's solution isn't very related to previous ones, so I didn't
>> include prior discussions.
> This is completely related, aiming to fix the same issue, do not try
> to pretend otherwise.
>
> Really, adding more context and acknowledging that reviewers
> made suggestions would be quite fair.
>
> This is a difficult series, with a lot of potential bugs, you need to bring
> us on board.
I understand your point. I’m sorry for wasting your time on this. Please
disregard this series submission — I will send the next version soon,
which will include the mentioned background information. I apologize
again for my misunderstanding.
Thanks Xuanqiang.
^ permalink raw reply [flat|nested] 8+ messages in thread