netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH net-next v1 0/3] net: Avoid ehash lookup races
@ 2025-09-15  7:03 xuanqiang.luo
  2025-09-15  7:03 ` [PATCH net-next v1 1/3] rculist: Add __hlist_nulls_replace_rcu() and hlist_nulls_replace_init_rcu() xuanqiang.luo
                   ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: xuanqiang.luo @ 2025-09-15  7:03 UTC (permalink / raw)
  To: edumazet, kuniyu; +Cc: kerneljasonxing, davem, kuba, netdev, Xuanqiang Luo

From: Xuanqiang Luo <luoxuanqiang@kylinos.cn>

After replacing R/W locks with RCU in commit 3ab5aee7fe84 ("net: Convert
TCP & DCCP hash tables to use RCU / hlist_nulls"), a race window emerged
during the switch from reqsk/sk to sk/tw.

Now that both timewait sock (tw) and full sock (sk) reside on the same ehash
chain, it is appropriate to introduce hlist_nulls replace operations, to
eliminate the race conditions caused by this window.

Xuanqiang Luo (3):
  rculist: Add __hlist_nulls_replace_rcu() and
    hlist_nulls_replace_init_rcu()
  inet: Avoid ehash lookup race in inet_ehash_insert()
  inet: Avoid ehash lookup race in inet_twsk_hashdance_schedule()

 include/linux/rculist_nulls.h | 62 +++++++++++++++++++++++++++++++++++
 include/net/sock.h            | 23 +++++++++++++
 net/ipv4/inet_hashtables.c    |  7 ++++
 net/ipv4/inet_timewait_sock.c | 20 ++++++-----
 4 files changed, 103 insertions(+), 9 deletions(-)

-- 
2.27.0


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH net-next v1 1/3] rculist: Add __hlist_nulls_replace_rcu() and hlist_nulls_replace_init_rcu()
  2025-09-15  7:03 [PATCH net-next v1 0/3] net: Avoid ehash lookup races xuanqiang.luo
@ 2025-09-15  7:03 ` xuanqiang.luo
  2025-09-15 22:50   ` Kuniyuki Iwashima
  2025-09-15  7:03 ` [PATCH net-next v1 2/3] inet: Avoid ehash lookup race in inet_ehash_insert() xuanqiang.luo
  2025-09-15  7:03 ` [PATCH net-next v1 3/3] inet: Avoid ehash lookup race in inet_twsk_hashdance_schedule() xuanqiang.luo
  2 siblings, 1 reply; 12+ messages in thread
From: xuanqiang.luo @ 2025-09-15  7:03 UTC (permalink / raw)
  To: edumazet, kuniyu; +Cc: kerneljasonxing, davem, kuba, netdev, Xuanqiang Luo

From: Xuanqiang Luo <luoxuanqiang@kylinos.cn>

Add two functions to atomically replace RCU-protected hlist_nulls entries.

Signed-off-by: Xuanqiang Luo <luoxuanqiang@kylinos.cn>
---
 include/linux/rculist_nulls.h | 62 +++++++++++++++++++++++++++++++++++
 1 file changed, 62 insertions(+)

diff --git a/include/linux/rculist_nulls.h b/include/linux/rculist_nulls.h
index 89186c499dd4..eaa3a0d2f206 100644
--- a/include/linux/rculist_nulls.h
+++ b/include/linux/rculist_nulls.h
@@ -152,6 +152,68 @@ static inline void hlist_nulls_add_fake(struct hlist_nulls_node *n)
 	n->next = (struct hlist_nulls_node *)NULLS_MARKER(NULL);
 }
 
+/**
+ * __hlist_nulls_replace_rcu - replace an old entry by a new one
+ * @old: the element to be replaced
+ * @new: the new element to insert
+ *
+ * Description:
+ * Replace the old entry with the new one in a RCU-protected hlist_nulls, while
+ * permitting racing traversals.
+ *
+ * The caller must take whatever precautions are necessary (such as holding
+ * appropriate locks) to avoid racing with another list-mutation primitive, such
+ * as hlist_nulls_add_head_rcu() or hlist_nulls_del_rcu(), running on this same
+ * list.  However, it is perfectly legal to run concurrently with the _rcu
+ * list-traversal primitives, such as hlist_nulls_for_each_entry_rcu().
+ */
+static inline void __hlist_nulls_replace_rcu(struct hlist_nulls_node *old,
+					     struct hlist_nulls_node *new)
+{
+	struct hlist_nulls_node *next = old->next;
+
+	new->next = next;
+	WRITE_ONCE(new->pprev, old->pprev);
+	rcu_assign_pointer(*(struct hlist_nulls_node __rcu **)new->pprev, new);
+	if (!is_a_nulls(next))
+		WRITE_ONCE(new->next->pprev, &new->next);
+}
+
+/**
+ * hlist_nulls_replace_init_rcu - replace an old entry by a new one and
+ * initialize the old
+ * @old: the element to be replaced
+ * @new: the new element to insert
+ *
+ * Description:
+ * Replace the old entry with the new one in a RCU-protected hlist_nulls, while
+ * permitting racing traversals, and reinitialize the old entry.
+ *
+ * Return: true if the old entry was hashed and was replaced successfully, false
+ * otherwise.
+ *
+ * Note: hlist_nulls_unhashed() on the old node returns true after this.
+ * It is useful for RCU based read lockfree traversal if the writer side must
+ * know if the list entry is still hashed or already unhashed.
+ *
+ * The caller must take whatever precautions are necessary (such as holding
+ * appropriate locks) to avoid racing with another list-mutation primitive, such
+ * as hlist_nulls_add_head_rcu() or hlist_nulls_del_rcu(), running on this same
+ * list. However, it is perfectly legal to run concurrently with the _rcu
+ * list-traversal primitives, such as hlist_nulls_for_each_entry_rcu().
+ */
+static inline bool hlist_nulls_replace_init_rcu(struct hlist_nulls_node *old,
+						struct hlist_nulls_node *new)
+{
+	if (!hlist_nulls_unhashed(old)) {
+		__hlist_nulls_replace_rcu(old, new);
+		old->pprev = NULL;
+		return true;
+	}
+
+	return false;
+}
+
 /**
  * hlist_nulls_for_each_entry_rcu - iterate over rcu list of given type
  * @tpos:	the type * to use as a loop cursor.
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH net-next v1 2/3] inet: Avoid ehash lookup race in inet_ehash_insert()
  2025-09-15  7:03 [PATCH net-next v1 0/3] net: Avoid ehash lookup races xuanqiang.luo
  2025-09-15  7:03 ` [PATCH net-next v1 1/3] rculist: Add __hlist_nulls_replace_rcu() and hlist_nulls_replace_init_rcu() xuanqiang.luo
@ 2025-09-15  7:03 ` xuanqiang.luo
  2025-09-15 23:00   ` Kuniyuki Iwashima
  2025-09-15  7:03 ` [PATCH net-next v1 3/3] inet: Avoid ehash lookup race in inet_twsk_hashdance_schedule() xuanqiang.luo
  2 siblings, 1 reply; 12+ messages in thread
From: xuanqiang.luo @ 2025-09-15  7:03 UTC (permalink / raw)
  To: edumazet, kuniyu; +Cc: kerneljasonxing, davem, kuba, netdev, Xuanqiang Luo

From: Xuanqiang Luo <luoxuanqiang@kylinos.cn>

Since ehash lookups are lockless, if one CPU performs a lookup while
another concurrently deletes and inserts (removing reqsk and inserting sk),
the lookup may fail to find the socket, an RST may be sent.

The call trace map is drawn as follows:
   CPU 0                           CPU 1
   -----                           -----
				inet_ehash_insert()
                                spin_lock()
                                sk_nulls_del_node_init_rcu(osk)
__inet_lookup_established()
	(lookup failed)
                                __sk_nulls_add_node_rcu(sk, list)
                                spin_unlock()

As both deletion and insertion operate on the same ehash chain, this patch
introduces two new sk_nulls_replace_* helper functions to implement atomic
replacement.

If sk_nulls_replace_node_init_rcu() fails, it indicates osk is either
hlist_unhashed or hlist_nulls_unhashed. The former returns false; the
latter performs insertion without deletion.

Fixes: 5e0724d027f0 ("tcp/dccp: fix hashdance race for passive sessions")
Signed-off-by: Xuanqiang Luo <luoxuanqiang@kylinos.cn>
---
 include/net/sock.h         | 23 +++++++++++++++++++++++
 net/ipv4/inet_hashtables.c |  7 +++++++
 2 files changed, 30 insertions(+)

diff --git a/include/net/sock.h b/include/net/sock.h
index 896bec2d2176..26dacf7bc93e 100644
--- a/include/net/sock.h
+++ b/include/net/sock.h
@@ -859,6 +859,29 @@ static inline bool sk_nulls_del_node_init_rcu(struct sock *sk)
 	return rc;
 }
 
+static inline bool __sk_nulls_replace_node_init_rcu(struct sock *old,
+						    struct sock *new)
+{
+	if (sk_hashed(old) &&
+	    hlist_nulls_replace_init_rcu(&old->sk_nulls_node,
+					 &new->sk_nulls_node))
+		return true;
+
+	return false;
+}
+
+static inline bool sk_nulls_replace_node_init_rcu(struct sock *old,
+						  struct sock *new)
+{
+	bool rc = __sk_nulls_replace_node_init_rcu(old, new);
+
+	if (rc) {
+		WARN_ON(refcount_read(&old->sk_refcnt) == 1);
+		__sock_put(old);
+	}
+	return rc;
+}
+
 static inline void __sk_add_node(struct sock *sk, struct hlist_head *list)
 {
 	hlist_add_head(&sk->sk_node, list);
diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
index ef4ccfd46ff6..7803fd3cc8e9 100644
--- a/net/ipv4/inet_hashtables.c
+++ b/net/ipv4/inet_hashtables.c
@@ -685,6 +685,12 @@ bool inet_ehash_insert(struct sock *sk, struct sock *osk, bool *found_dup_sk)
 	spin_lock(lock);
 	if (osk) {
 		WARN_ON_ONCE(sk->sk_hash != osk->sk_hash);
+		/* Since osk and sk should be in the same ehash bucket, try
+		 * direct replacement to avoid lookup gaps. On failure, no
+		 * changes. sk_nulls_del_node_init_rcu() will handle the rest.
+		 */
+		if (sk_nulls_replace_node_init_rcu(osk, sk))
+			goto unlock;
 		ret = sk_nulls_del_node_init_rcu(osk);
 	} else if (found_dup_sk) {
 		*found_dup_sk = inet_ehash_lookup_by_sk(sk, list);
@@ -695,6 +701,7 @@ bool inet_ehash_insert(struct sock *sk, struct sock *osk, bool *found_dup_sk)
 	if (ret)
 		__sk_nulls_add_node_rcu(sk, list);
 
+unlock:
 	spin_unlock(lock);
 
 	return ret;
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH net-next v1 3/3] inet: Avoid ehash lookup race in inet_twsk_hashdance_schedule()
  2025-09-15  7:03 [PATCH net-next v1 0/3] net: Avoid ehash lookup races xuanqiang.luo
  2025-09-15  7:03 ` [PATCH net-next v1 1/3] rculist: Add __hlist_nulls_replace_rcu() and hlist_nulls_replace_init_rcu() xuanqiang.luo
  2025-09-15  7:03 ` [PATCH net-next v1 2/3] inet: Avoid ehash lookup race in inet_ehash_insert() xuanqiang.luo
@ 2025-09-15  7:03 ` xuanqiang.luo
  2025-09-15 23:25   ` Kuniyuki Iwashima
  2 siblings, 1 reply; 12+ messages in thread
From: xuanqiang.luo @ 2025-09-15  7:03 UTC (permalink / raw)
  To: edumazet, kuniyu; +Cc: kerneljasonxing, davem, kuba, netdev, Xuanqiang Luo

From: Xuanqiang Luo <luoxuanqiang@kylinos.cn>

Since ehash lookups are lockless, if another CPU is converting sk to tw
concurrently, fetching the newly inserted tw with tw->tw_refcnt == 0 cause
lookup failure.

The call trace map is drawn as follows:
   CPU 0                                CPU 1
   -----                                -----
				     inet_twsk_hashdance_schedule()
				     spin_lock()
				     inet_twsk_add_node_rcu(tw, ...)
__inet_lookup_established()
(find tw, failure due to tw_refcnt = 0)
				     __sk_nulls_del_node_init_rcu(sk)
				     refcount_set(&tw->tw_refcnt, 3)
				     spin_unlock()

By replacing sk with tw atomically via hlist_nulls_replace_init_rcu() after
setting tw_refcnt, we ensure that tw is either fully initialized or not
visible to other CPUs, eliminating the race.

Fixes: 3ab5aee7fe84 ("net: Convert TCP & DCCP hash tables to use RCU / hlist_nulls")
Signed-off-by: Xuanqiang Luo <luoxuanqiang@kylinos.cn>
---
 net/ipv4/inet_timewait_sock.c | 20 +++++++++++---------
 1 file changed, 11 insertions(+), 9 deletions(-)

diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c
index 5b5426b8ee92..4c84a020315d 100644
--- a/net/ipv4/inet_timewait_sock.c
+++ b/net/ipv4/inet_timewait_sock.c
@@ -115,8 +115,9 @@ void inet_twsk_hashdance_schedule(struct inet_timewait_sock *tw,
 	struct inet_ehash_bucket *ehead = inet_ehash_bucket(hashinfo, sk->sk_hash);
 	spinlock_t *lock = inet_ehash_lockp(hashinfo, sk->sk_hash);
 	struct inet_bind_hashbucket *bhead, *bhead2;
+	bool ret = false;
 
-	/* Step 1: Put TW into bind hash. Original socket stays there too.
+	/* Put TW into bind hash. Original socket stays there too.
 	   Note, that any socket with inet->num != 0 MUST be bound in
 	   binding cache, even if it is closed.
 	 */
@@ -140,14 +141,6 @@ void inet_twsk_hashdance_schedule(struct inet_timewait_sock *tw,
 
 	spin_lock(lock);
 
-	/* Step 2: Hash TW into tcp ehash chain */
-	inet_twsk_add_node_rcu(tw, &ehead->chain);
-
-	/* Step 3: Remove SK from hash chain */
-	if (__sk_nulls_del_node_init_rcu(sk))
-		sock_prot_inuse_add(sock_net(sk), sk->sk_prot, -1);
-
-
 	/* Ensure above writes are committed into memory before updating the
 	 * refcount.
 	 * Provides ordering vs later refcount_inc().
@@ -162,6 +155,15 @@ void inet_twsk_hashdance_schedule(struct inet_timewait_sock *tw,
 	 */
 	refcount_set(&tw->tw_refcnt, 3);
 
+	if (sk_hashed(sk)) {
+		ret = hlist_nulls_replace_init_rcu(&sk->sk_nulls_node,
+						   &tw->tw_node);
+		sock_prot_inuse_add(sock_net(sk), sk->sk_prot, -1);
+	}
+
+	if (!ret)
+		inet_twsk_add_node_rcu(tw, &ehead->chain);
+
 	inet_twsk_schedule(tw, timeo);
 
 	spin_unlock(lock);
-- 
2.27.0


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH net-next v1 1/3] rculist: Add __hlist_nulls_replace_rcu() and hlist_nulls_replace_init_rcu()
  2025-09-15  7:03 ` [PATCH net-next v1 1/3] rculist: Add __hlist_nulls_replace_rcu() and hlist_nulls_replace_init_rcu() xuanqiang.luo
@ 2025-09-15 22:50   ` Kuniyuki Iwashima
  2025-09-16  1:33     ` luoxuanqiang
  0 siblings, 1 reply; 12+ messages in thread
From: Kuniyuki Iwashima @ 2025-09-15 22:50 UTC (permalink / raw)
  To: xuanqiang.luo
  Cc: edumazet, kerneljasonxing, davem, kuba, netdev, Xuanqiang Luo

On Mon, Sep 15, 2025 at 12:04 AM <xuanqiang.luo@linux.dev> wrote:
>
> From: Xuanqiang Luo <luoxuanqiang@kylinos.cn>
>
> Add two functions to atomically replace RCU-protected hlist_nulls entries.
>
> Signed-off-by: Xuanqiang Luo <luoxuanqiang@kylinos.cn>
> ---
>  include/linux/rculist_nulls.h | 62 +++++++++++++++++++++++++++++++++++
>  1 file changed, 62 insertions(+)
>
> diff --git a/include/linux/rculist_nulls.h b/include/linux/rculist_nulls.h
> index 89186c499dd4..eaa3a0d2f206 100644
> --- a/include/linux/rculist_nulls.h
> +++ b/include/linux/rculist_nulls.h
> @@ -152,6 +152,68 @@ static inline void hlist_nulls_add_fake(struct hlist_nulls_node *n)
>         n->next = (struct hlist_nulls_node *)NULLS_MARKER(NULL);
>  }
>
> +/**
> + * __hlist_nulls_replace_rcu - replace an old entry by a new one
> + * @old: the element to be replaced
> + * @new: the new element to insert
> + *
> + * Description:
> + * Replace the old entry with the new one in a RCU-protected hlist_nulls, while
> + * permitting racing traversals.
> + *
> + * The caller must take whatever precautions are necessary (such as holding
> + * appropriate locks) to avoid racing with another list-mutation primitive, such
> + * as hlist_nulls_add_head_rcu() or hlist_nulls_del_rcu(), running on this same
> + * list.  However, it is perfectly legal to run concurrently with the _rcu
> + * list-traversal primitives, such as hlist_nulls_for_each_entry_rcu().
> + */
> +static inline void __hlist_nulls_replace_rcu(struct hlist_nulls_node *old,
> +                                            struct hlist_nulls_node *new)
> +{
> +       struct hlist_nulls_node *next = old->next;
> +
> +       new->next = next;
> +       WRITE_ONCE(new->pprev, old->pprev);
> +       rcu_assign_pointer(*(struct hlist_nulls_node __rcu **)new->pprev, new);
> +       if (!is_a_nulls(next))
> +               WRITE_ONCE(new->next->pprev, &new->next);
> +}
> +
> +/**
> + * hlist_nulls_replace_init_rcu - replace an old entry by a new one and
> + * initialize the old
> + * @old: the element to be replaced
> + * @new: the new element to insert
> + *
> + * Description:
> + * Replace the old entry with the new one in a RCU-protected hlist_nulls, while
> + * permitting racing traversals, and reinitialize the old entry.
> + *
> + * Return: true if the old entry was hashed and was replaced successfully, false
> + * otherwise.
> + *
> + * Note: hlist_nulls_unhashed() on the old node returns true after this.
> + * It is useful for RCU based read lockfree traversal if the writer side must
> + * know if the list entry is still hashed or already unhashed.
> + *
> + * The caller must take whatever precautions are necessary (such as holding
> + * appropriate locks) to avoid racing with another list-mutation primitive, such
> + * as hlist_nulls_add_head_rcu() or hlist_nulls_del_rcu(), running on this same
> + * list. However, it is perfectly legal to run concurrently with the _rcu
> + * list-traversal primitives, such as hlist_nulls_for_each_entry_rcu().
> + */
> +static inline bool hlist_nulls_replace_init_rcu(struct hlist_nulls_node *old,
> +                                               struct hlist_nulls_node *new)
> +{
> +       if (!hlist_nulls_unhashed(old)) {

This is already checked by __sk_nulls_replace_node_init_rcu().


> +               __hlist_nulls_replace_rcu(old, new);
> +               old->pprev = NULL;
> +               return true;
> +       }
> +
> +       return false;
> +}
> +
>  /**
>   * hlist_nulls_for_each_entry_rcu - iterate over rcu list of given type
>   * @tpos:      the type * to use as a loop cursor.
> --
> 2.27.0
>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH net-next v1 2/3] inet: Avoid ehash lookup race in inet_ehash_insert()
  2025-09-15  7:03 ` [PATCH net-next v1 2/3] inet: Avoid ehash lookup race in inet_ehash_insert() xuanqiang.luo
@ 2025-09-15 23:00   ` Kuniyuki Iwashima
  2025-09-16  1:57     ` luoxuanqiang
  0 siblings, 1 reply; 12+ messages in thread
From: Kuniyuki Iwashima @ 2025-09-15 23:00 UTC (permalink / raw)
  To: xuanqiang.luo
  Cc: edumazet, kerneljasonxing, davem, kuba, netdev, Xuanqiang Luo

On Mon, Sep 15, 2025 at 12:04 AM <xuanqiang.luo@linux.dev> wrote:
>
> From: Xuanqiang Luo <luoxuanqiang@kylinos.cn>
>
> Since ehash lookups are lockless, if one CPU performs a lookup while
> another concurrently deletes and inserts (removing reqsk and inserting sk),
> the lookup may fail to find the socket, an RST may be sent.
>
> The call trace map is drawn as follows:
>    CPU 0                           CPU 1
>    -----                           -----
>                                 inet_ehash_insert()
>                                 spin_lock()
>                                 sk_nulls_del_node_init_rcu(osk)
> __inet_lookup_established()
>         (lookup failed)
>                                 __sk_nulls_add_node_rcu(sk, list)
>                                 spin_unlock()
>
> As both deletion and insertion operate on the same ehash chain, this patch
> introduces two new sk_nulls_replace_* helper functions to implement atomic
> replacement.
>
> If sk_nulls_replace_node_init_rcu() fails, it indicates osk is either
> hlist_unhashed or hlist_nulls_unhashed. The former returns false; the
> latter performs insertion without deletion.
>
> Fixes: 5e0724d027f0 ("tcp/dccp: fix hashdance race for passive sessions")
> Signed-off-by: Xuanqiang Luo <luoxuanqiang@kylinos.cn>
> ---
>  include/net/sock.h         | 23 +++++++++++++++++++++++
>  net/ipv4/inet_hashtables.c |  7 +++++++
>  2 files changed, 30 insertions(+)
>
> diff --git a/include/net/sock.h b/include/net/sock.h
> index 896bec2d2176..26dacf7bc93e 100644
> --- a/include/net/sock.h
> +++ b/include/net/sock.h
> @@ -859,6 +859,29 @@ static inline bool sk_nulls_del_node_init_rcu(struct sock *sk)
>         return rc;
>  }
>
> +static inline bool __sk_nulls_replace_node_init_rcu(struct sock *old,
> +                                                   struct sock *new)
> +{
> +       if (sk_hashed(old) &&
> +           hlist_nulls_replace_init_rcu(&old->sk_nulls_node,
> +                                        &new->sk_nulls_node))
> +               return true;
> +
> +       return false;
> +}
> +
> +static inline bool sk_nulls_replace_node_init_rcu(struct sock *old,
> +                                                 struct sock *new)
> +{
> +       bool rc = __sk_nulls_replace_node_init_rcu(old, new);
> +
> +       if (rc) {
> +               WARN_ON(refcount_read(&old->sk_refcnt) == 1);
> +               __sock_put(old);
> +       }
> +       return rc;
> +}
> +
>  static inline void __sk_add_node(struct sock *sk, struct hlist_head *list)
>  {
>         hlist_add_head(&sk->sk_node, list);
> diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
> index ef4ccfd46ff6..7803fd3cc8e9 100644
> --- a/net/ipv4/inet_hashtables.c
> +++ b/net/ipv4/inet_hashtables.c
> @@ -685,6 +685,12 @@ bool inet_ehash_insert(struct sock *sk, struct sock *osk, bool *found_dup_sk)
>         spin_lock(lock);
>         if (osk) {
>                 WARN_ON_ONCE(sk->sk_hash != osk->sk_hash);
> +               /* Since osk and sk should be in the same ehash bucket, try
> +                * direct replacement to avoid lookup gaps. On failure, no
> +                * changes. sk_nulls_del_node_init_rcu() will handle the rest.

Both sk_nulls_replace_node_init_rcu() and
sk_nulls_del_node_init_rcu() return true only when
sk_hashed(osk) == true.

Only thing sk_nulls_del_node_init_rcu() does is to
set ret to false.


> +                */
> +               if (sk_nulls_replace_node_init_rcu(osk, sk))
> +                       goto unlock;
>                 ret = sk_nulls_del_node_init_rcu(osk);

So, should we simply do

ret = sk_nulls_replace_node_init_rcu(osk, sk);
goto unlock;

?


>         } else if (found_dup_sk) {
>                 *found_dup_sk = inet_ehash_lookup_by_sk(sk, list);
> @@ -695,6 +701,7 @@ bool inet_ehash_insert(struct sock *sk, struct sock *osk, bool *found_dup_sk)
>         if (ret)
>                 __sk_nulls_add_node_rcu(sk, list);
>
> +unlock:
>         spin_unlock(lock);
>
>         return ret;
> --
> 2.27.0
>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH net-next v1 3/3] inet: Avoid ehash lookup race in inet_twsk_hashdance_schedule()
  2025-09-15  7:03 ` [PATCH net-next v1 3/3] inet: Avoid ehash lookup race in inet_twsk_hashdance_schedule() xuanqiang.luo
@ 2025-09-15 23:25   ` Kuniyuki Iwashima
  0 siblings, 0 replies; 12+ messages in thread
From: Kuniyuki Iwashima @ 2025-09-15 23:25 UTC (permalink / raw)
  To: xuanqiang.luo
  Cc: edumazet, kerneljasonxing, davem, kuba, netdev, Xuanqiang Luo

On Mon, Sep 15, 2025 at 12:04 AM <xuanqiang.luo@linux.dev> wrote:
>
> From: Xuanqiang Luo <luoxuanqiang@kylinos.cn>
>
> Since ehash lookups are lockless, if another CPU is converting sk to tw
> concurrently, fetching the newly inserted tw with tw->tw_refcnt == 0 cause
> lookup failure.
>
> The call trace map is drawn as follows:
>    CPU 0                                CPU 1
>    -----                                -----
>                                      inet_twsk_hashdance_schedule()
>                                      spin_lock()
>                                      inet_twsk_add_node_rcu(tw, ...)
> __inet_lookup_established()
> (find tw, failure due to tw_refcnt = 0)
>                                      __sk_nulls_del_node_init_rcu(sk)
>                                      refcount_set(&tw->tw_refcnt, 3)
>                                      spin_unlock()
>
> By replacing sk with tw atomically via hlist_nulls_replace_init_rcu() after
> setting tw_refcnt, we ensure that tw is either fully initialized or not
> visible to other CPUs, eliminating the race.
>
> Fixes: 3ab5aee7fe84 ("net: Convert TCP & DCCP hash tables to use RCU / hlist_nulls")
> Signed-off-by: Xuanqiang Luo <luoxuanqiang@kylinos.cn>
> ---
>  net/ipv4/inet_timewait_sock.c | 20 +++++++++++---------
>  1 file changed, 11 insertions(+), 9 deletions(-)
>
> diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c
> index 5b5426b8ee92..4c84a020315d 100644
> --- a/net/ipv4/inet_timewait_sock.c
> +++ b/net/ipv4/inet_timewait_sock.c
> @@ -115,8 +115,9 @@ void inet_twsk_hashdance_schedule(struct inet_timewait_sock *tw,
>         struct inet_ehash_bucket *ehead = inet_ehash_bucket(hashinfo, sk->sk_hash);
>         spinlock_t *lock = inet_ehash_lockp(hashinfo, sk->sk_hash);
>         struct inet_bind_hashbucket *bhead, *bhead2;
> +       bool ret = false;
>
> -       /* Step 1: Put TW into bind hash. Original socket stays there too.
> +       /* Put TW into bind hash. Original socket stays there too.
>            Note, that any socket with inet->num != 0 MUST be bound in
>            binding cache, even if it is closed.
>          */
> @@ -140,14 +141,6 @@ void inet_twsk_hashdance_schedule(struct inet_timewait_sock *tw,
>
>         spin_lock(lock);
>
> -       /* Step 2: Hash TW into tcp ehash chain */
> -       inet_twsk_add_node_rcu(tw, &ehead->chain);
> -
> -       /* Step 3: Remove SK from hash chain */
> -       if (__sk_nulls_del_node_init_rcu(sk))
> -               sock_prot_inuse_add(sock_net(sk), sk->sk_prot, -1);
> -
> -
>         /* Ensure above writes are committed into memory before updating the
>          * refcount.
>          * Provides ordering vs later refcount_inc().
> @@ -162,6 +155,15 @@ void inet_twsk_hashdance_schedule(struct inet_timewait_sock *tw,
>          */
>         refcount_set(&tw->tw_refcnt, 3);
>
> +       if (sk_hashed(sk)) {

Is this check necessary ?

I think tcp_time_wait() is called under bh_lock_sock() and there
will be no racing thread unlike reqsk vs full sk for 3WHS.

I guess the existing code checks the retval of
__sk_nulls_del_node_init_rcu() just in case.


> +               ret = hlist_nulls_replace_init_rcu(&sk->sk_nulls_node,
> +                                                  &tw->tw_node);
> +               sock_prot_inuse_add(sock_net(sk), sk->sk_prot, -1);
> +       }
> +
> +       if (!ret)
> +               inet_twsk_add_node_rcu(tw, &ehead->chain);
> +
>         inet_twsk_schedule(tw, timeo);
>
>         spin_unlock(lock);
> --
> 2.27.0
>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH net-next v1 1/3] rculist: Add __hlist_nulls_replace_rcu() and hlist_nulls_replace_init_rcu()
  2025-09-15 22:50   ` Kuniyuki Iwashima
@ 2025-09-16  1:33     ` luoxuanqiang
  2025-09-16  2:10       ` Kuniyuki Iwashima
  0 siblings, 1 reply; 12+ messages in thread
From: luoxuanqiang @ 2025-09-16  1:33 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: edumazet, kerneljasonxing, davem, kuba, netdev, Xuanqiang Luo


在 2025/9/16 06:50, Kuniyuki Iwashima 写道:
> On Mon, Sep 15, 2025 at 12:04 AM <xuanqiang.luo@linux.dev> wrote:
>> From: Xuanqiang Luo <luoxuanqiang@kylinos.cn>
>>
>> Add two functions to atomically replace RCU-protected hlist_nulls entries.
>>
>> Signed-off-by: Xuanqiang Luo <luoxuanqiang@kylinos.cn>
>> ---
>>   include/linux/rculist_nulls.h | 62 +++++++++++++++++++++++++++++++++++
>>   1 file changed, 62 insertions(+)
>>
>> diff --git a/include/linux/rculist_nulls.h b/include/linux/rculist_nulls.h
>> index 89186c499dd4..eaa3a0d2f206 100644
>> --- a/include/linux/rculist_nulls.h
>> +++ b/include/linux/rculist_nulls.h
>> @@ -152,6 +152,68 @@ static inline void hlist_nulls_add_fake(struct hlist_nulls_node *n)
>>          n->next = (struct hlist_nulls_node *)NULLS_MARKER(NULL);
>>   }
>>
>> +/**
>> + * __hlist_nulls_replace_rcu - replace an old entry by a new one
>> + * @old: the element to be replaced
>> + * @new: the new element to insert
>> + *
>> + * Description:
>> + * Replace the old entry with the new one in a RCU-protected hlist_nulls, while
>> + * permitting racing traversals.
>> + *
>> + * The caller must take whatever precautions are necessary (such as holding
>> + * appropriate locks) to avoid racing with another list-mutation primitive, such
>> + * as hlist_nulls_add_head_rcu() or hlist_nulls_del_rcu(), running on this same
>> + * list.  However, it is perfectly legal to run concurrently with the _rcu
>> + * list-traversal primitives, such as hlist_nulls_for_each_entry_rcu().
>> + */
>> +static inline void __hlist_nulls_replace_rcu(struct hlist_nulls_node *old,
>> +                                            struct hlist_nulls_node *new)
>> +{
>> +       struct hlist_nulls_node *next = old->next;
>> +
>> +       new->next = next;
>> +       WRITE_ONCE(new->pprev, old->pprev);
>> +       rcu_assign_pointer(*(struct hlist_nulls_node __rcu **)new->pprev, new);
>> +       if (!is_a_nulls(next))
>> +               WRITE_ONCE(new->next->pprev, &new->next);
>> +}
>> +
>> +/**
>> + * hlist_nulls_replace_init_rcu - replace an old entry by a new one and
>> + * initialize the old
>> + * @old: the element to be replaced
>> + * @new: the new element to insert
>> + *
>> + * Description:
>> + * Replace the old entry with the new one in a RCU-protected hlist_nulls, while
>> + * permitting racing traversals, and reinitialize the old entry.
>> + *
>> + * Return: true if the old entry was hashed and was replaced successfully, false
>> + * otherwise.
>> + *
>> + * Note: hlist_nulls_unhashed() on the old node returns true after this.
>> + * It is useful for RCU based read lockfree traversal if the writer side must
>> + * know if the list entry is still hashed or already unhashed.
>> + *
>> + * The caller must take whatever precautions are necessary (such as holding
>> + * appropriate locks) to avoid racing with another list-mutation primitive, such
>> + * as hlist_nulls_add_head_rcu() or hlist_nulls_del_rcu(), running on this same
>> + * list. However, it is perfectly legal to run concurrently with the _rcu
>> + * list-traversal primitives, such as hlist_nulls_for_each_entry_rcu().
>> + */
>> +static inline bool hlist_nulls_replace_init_rcu(struct hlist_nulls_node *old,
>> +                                               struct hlist_nulls_node *new)
>> +{
>> +       if (!hlist_nulls_unhashed(old)) {
> This is already checked by __sk_nulls_replace_node_init_rcu().
>
It seems to me that hlist_nulls_replace_init_rcu() checks whether the
sk_nulls_node is unhashed, while __sk_nulls_replace_node_init_rcu() checks the
sk_node's unhashed status. Perhaps these serve different purposes?
This would maintain parity with how hlist_nulls_del_init_rcu verifies
sk_nulls_node and __sk_nulls_del_node_init_rcu() checks sk_node.


Thanks
Xuanqiang

>> +               __hlist_nulls_replace_rcu(old, new);
>> +               old->pprev = NULL;
>> +               return true;
>> +       }
>> +
>> +       return false;
>> +}
>> +
>>   /**
>>    * hlist_nulls_for_each_entry_rcu - iterate over rcu list of given type
>>    * @tpos:      the type * to use as a loop cursor.
>> --
>> 2.27.0
>>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH net-next v1 2/3] inet: Avoid ehash lookup race in inet_ehash_insert()
  2025-09-15 23:00   ` Kuniyuki Iwashima
@ 2025-09-16  1:57     ` luoxuanqiang
  2025-09-16  2:18       ` Kuniyuki Iwashima
  0 siblings, 1 reply; 12+ messages in thread
From: luoxuanqiang @ 2025-09-16  1:57 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: edumazet, kerneljasonxing, davem, kuba, netdev, Xuanqiang Luo


在 2025/9/16 07:00, Kuniyuki Iwashima 写道:
> On Mon, Sep 15, 2025 at 12:04 AM <xuanqiang.luo@linux.dev> wrote:
>> From: Xuanqiang Luo <luoxuanqiang@kylinos.cn>
>>
>> Since ehash lookups are lockless, if one CPU performs a lookup while
>> another concurrently deletes and inserts (removing reqsk and inserting sk),
>> the lookup may fail to find the socket, an RST may be sent.
>>
>> The call trace map is drawn as follows:
>>     CPU 0                           CPU 1
>>     -----                           -----
>>                                  inet_ehash_insert()
>>                                  spin_lock()
>>                                  sk_nulls_del_node_init_rcu(osk)
>> __inet_lookup_established()
>>          (lookup failed)
>>                                  __sk_nulls_add_node_rcu(sk, list)
>>                                  spin_unlock()
>>
>> As both deletion and insertion operate on the same ehash chain, this patch
>> introduces two new sk_nulls_replace_* helper functions to implement atomic
>> replacement.
>>
>> If sk_nulls_replace_node_init_rcu() fails, it indicates osk is either
>> hlist_unhashed or hlist_nulls_unhashed. The former returns false; the
>> latter performs insertion without deletion.
>>
>> Fixes: 5e0724d027f0 ("tcp/dccp: fix hashdance race for passive sessions")
>> Signed-off-by: Xuanqiang Luo <luoxuanqiang@kylinos.cn>
>> ---
>>   include/net/sock.h         | 23 +++++++++++++++++++++++
>>   net/ipv4/inet_hashtables.c |  7 +++++++
>>   2 files changed, 30 insertions(+)
>>
>> diff --git a/include/net/sock.h b/include/net/sock.h
>> index 896bec2d2176..26dacf7bc93e 100644
>> --- a/include/net/sock.h
>> +++ b/include/net/sock.h
>> @@ -859,6 +859,29 @@ static inline bool sk_nulls_del_node_init_rcu(struct sock *sk)
>>          return rc;
>>   }
>>
>> +static inline bool __sk_nulls_replace_node_init_rcu(struct sock *old,
>> +                                                   struct sock *new)
>> +{
>> +       if (sk_hashed(old) &&
>> +           hlist_nulls_replace_init_rcu(&old->sk_nulls_node,
>> +                                        &new->sk_nulls_node))
>> +               return true;
>> +
>> +       return false;
>> +}
>> +
>> +static inline bool sk_nulls_replace_node_init_rcu(struct sock *old,
>> +                                                 struct sock *new)
>> +{
>> +       bool rc = __sk_nulls_replace_node_init_rcu(old, new);
>> +
>> +       if (rc) {
>> +               WARN_ON(refcount_read(&old->sk_refcnt) == 1);
>> +               __sock_put(old);
>> +       }
>> +       return rc;
>> +}
>> +
>>   static inline void __sk_add_node(struct sock *sk, struct hlist_head *list)
>>   {
>>          hlist_add_head(&sk->sk_node, list);
>> diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
>> index ef4ccfd46ff6..7803fd3cc8e9 100644
>> --- a/net/ipv4/inet_hashtables.c
>> +++ b/net/ipv4/inet_hashtables.c
>> @@ -685,6 +685,12 @@ bool inet_ehash_insert(struct sock *sk, struct sock *osk, bool *found_dup_sk)
>>          spin_lock(lock);
>>          if (osk) {
>>                  WARN_ON_ONCE(sk->sk_hash != osk->sk_hash);
>> +               /* Since osk and sk should be in the same ehash bucket, try
>> +                * direct replacement to avoid lookup gaps. On failure, no
>> +                * changes. sk_nulls_del_node_init_rcu() will handle the rest.
> Both sk_nulls_replace_node_init_rcu() and
> sk_nulls_del_node_init_rcu() return true only when
> sk_hashed(osk) == true.
>
> Only thing sk_nulls_del_node_init_rcu() does is to
> set ret to false.
>
>
>> +                */
>> +               if (sk_nulls_replace_node_init_rcu(osk, sk))
>> +                       goto unlock;
>>                  ret = sk_nulls_del_node_init_rcu(osk);
> So, should we simply do
>
> ret = sk_nulls_replace_node_init_rcu(osk, sk);
> goto unlock;
>
> ?

sk_nulls_replace_node_init_rcu() only returns true if both 
sk_hashed(osk) == true and hlist_nulls_unhashed(old) == false.
However, in the original sk_nulls_del_node_init_rcu() logic, when 
sk_hashed(osk) == true, it always performs __sock_put(sk) regardless of 
the hlist_nulls_unhashed(old) check. Therefore, if 
sk_nulls_replace_node_init_rcu() fails, we can safely let ret or 
__sock_put(sk) be handled by the subsequent 
sk_nulls_del_node_init_rcu(osk) call. Thanks Xuanqiang.

>
>>          } else if (found_dup_sk) {
>>                  *found_dup_sk = inet_ehash_lookup_by_sk(sk, list);
>> @@ -695,6 +701,7 @@ bool inet_ehash_insert(struct sock *sk, struct sock *osk, bool *found_dup_sk)
>>          if (ret)
>>                  __sk_nulls_add_node_rcu(sk, list);
>>
>> +unlock:
>>          spin_unlock(lock);
>>
>>          return ret;
>> --
>> 2.27.0
>>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH net-next v1 1/3] rculist: Add __hlist_nulls_replace_rcu() and hlist_nulls_replace_init_rcu()
  2025-09-16  1:33     ` luoxuanqiang
@ 2025-09-16  2:10       ` Kuniyuki Iwashima
  0 siblings, 0 replies; 12+ messages in thread
From: Kuniyuki Iwashima @ 2025-09-16  2:10 UTC (permalink / raw)
  To: luoxuanqiang
  Cc: edumazet, kerneljasonxing, davem, kuba, netdev, Xuanqiang Luo

On Mon, Sep 15, 2025 at 6:34 PM luoxuanqiang <xuanqiang.luo@linux.dev> wrote:
>
>
> 在 2025/9/16 06:50, Kuniyuki Iwashima 写道:
> > On Mon, Sep 15, 2025 at 12:04 AM <xuanqiang.luo@linux.dev> wrote:
> >> From: Xuanqiang Luo <luoxuanqiang@kylinos.cn>
> >>
> >> Add two functions to atomically replace RCU-protected hlist_nulls entries.
> >>
> >> Signed-off-by: Xuanqiang Luo <luoxuanqiang@kylinos.cn>
> >> ---
> >>   include/linux/rculist_nulls.h | 62 +++++++++++++++++++++++++++++++++++
> >>   1 file changed, 62 insertions(+)
> >>
> >> diff --git a/include/linux/rculist_nulls.h b/include/linux/rculist_nulls.h
> >> index 89186c499dd4..eaa3a0d2f206 100644
> >> --- a/include/linux/rculist_nulls.h
> >> +++ b/include/linux/rculist_nulls.h
> >> @@ -152,6 +152,68 @@ static inline void hlist_nulls_add_fake(struct hlist_nulls_node *n)
> >>          n->next = (struct hlist_nulls_node *)NULLS_MARKER(NULL);
> >>   }
> >>
> >> +/**
> >> + * __hlist_nulls_replace_rcu - replace an old entry by a new one
> >> + * @old: the element to be replaced
> >> + * @new: the new element to insert
> >> + *
> >> + * Description:
> >> + * Replace the old entry with the new one in a RCU-protected hlist_nulls, while
> >> + * permitting racing traversals.
> >> + *
> >> + * The caller must take whatever precautions are necessary (such as holding
> >> + * appropriate locks) to avoid racing with another list-mutation primitive, such
> >> + * as hlist_nulls_add_head_rcu() or hlist_nulls_del_rcu(), running on this same
> >> + * list.  However, it is perfectly legal to run concurrently with the _rcu
> >> + * list-traversal primitives, such as hlist_nulls_for_each_entry_rcu().
> >> + */
> >> +static inline void __hlist_nulls_replace_rcu(struct hlist_nulls_node *old,
> >> +                                            struct hlist_nulls_node *new)
> >> +{
> >> +       struct hlist_nulls_node *next = old->next;
> >> +
> >> +       new->next = next;
> >> +       WRITE_ONCE(new->pprev, old->pprev);
> >> +       rcu_assign_pointer(*(struct hlist_nulls_node __rcu **)new->pprev, new);
> >> +       if (!is_a_nulls(next))
> >> +               WRITE_ONCE(new->next->pprev, &new->next);
> >> +}
> >> +
> >> +/**
> >> + * hlist_nulls_replace_init_rcu - replace an old entry by a new one and
> >> + * initialize the old
> >> + * @old: the element to be replaced
> >> + * @new: the new element to insert
> >> + *
> >> + * Description:
> >> + * Replace the old entry with the new one in a RCU-protected hlist_nulls, while
> >> + * permitting racing traversals, and reinitialize the old entry.
> >> + *
> >> + * Return: true if the old entry was hashed and was replaced successfully, false
> >> + * otherwise.
> >> + *
> >> + * Note: hlist_nulls_unhashed() on the old node returns true after this.
> >> + * It is useful for RCU based read lockfree traversal if the writer side must
> >> + * know if the list entry is still hashed or already unhashed.
> >> + *
> >> + * The caller must take whatever precautions are necessary (such as holding
> >> + * appropriate locks) to avoid racing with another list-mutation primitive, such
> >> + * as hlist_nulls_add_head_rcu() or hlist_nulls_del_rcu(), running on this same
> >> + * list. However, it is perfectly legal to run concurrently with the _rcu
> >> + * list-traversal primitives, such as hlist_nulls_for_each_entry_rcu().
> >> + */
> >> +static inline bool hlist_nulls_replace_init_rcu(struct hlist_nulls_node *old,
> >> +                                               struct hlist_nulls_node *new)
> >> +{
> >> +       if (!hlist_nulls_unhashed(old)) {
> > This is already checked by __sk_nulls_replace_node_init_rcu().
> >
> It seems to me that hlist_nulls_replace_init_rcu() checks whether the
> sk_nulls_node is unhashed, while __sk_nulls_replace_node_init_rcu() checks the
> sk_node's unhashed status.

sk_nulls_node and sk_node have the same size and
are defined in union so that we can reuse sk_hashed()
for both types.


> Perhaps these serve different purposes?
> This would maintain parity with how hlist_nulls_del_init_rcu verifies
> sk_nulls_node and __sk_nulls_del_node_init_rcu() checks sk_node.
>
>
> Thanks
> Xuanqiang
>
> >> +               __hlist_nulls_replace_rcu(old, new);
> >> +               old->pprev = NULL;
> >> +               return true;
> >> +       }
> >> +
> >> +       return false;
> >> +}
> >> +
> >>   /**
> >>    * hlist_nulls_for_each_entry_rcu - iterate over rcu list of given type
> >>    * @tpos:      the type * to use as a loop cursor.
> >> --
> >> 2.27.0
> >>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH net-next v1 2/3] inet: Avoid ehash lookup race in inet_ehash_insert()
  2025-09-16  1:57     ` luoxuanqiang
@ 2025-09-16  2:18       ` Kuniyuki Iwashima
  2025-09-16  2:28         ` luoxuanqiang
  0 siblings, 1 reply; 12+ messages in thread
From: Kuniyuki Iwashima @ 2025-09-16  2:18 UTC (permalink / raw)
  To: luoxuanqiang
  Cc: edumazet, kerneljasonxing, davem, kuba, netdev, Xuanqiang Luo

On Mon, Sep 15, 2025 at 6:57 PM luoxuanqiang <xuanqiang.luo@linux.dev> wrote:
>
>
> 在 2025/9/16 07:00, Kuniyuki Iwashima 写道:
> > On Mon, Sep 15, 2025 at 12:04 AM <xuanqiang.luo@linux.dev> wrote:
> >> From: Xuanqiang Luo <luoxuanqiang@kylinos.cn>
> >>
> >> Since ehash lookups are lockless, if one CPU performs a lookup while
> >> another concurrently deletes and inserts (removing reqsk and inserting sk),
> >> the lookup may fail to find the socket, an RST may be sent.
> >>
> >> The call trace map is drawn as follows:
> >>     CPU 0                           CPU 1
> >>     -----                           -----
> >>                                  inet_ehash_insert()
> >>                                  spin_lock()
> >>                                  sk_nulls_del_node_init_rcu(osk)
> >> __inet_lookup_established()
> >>          (lookup failed)
> >>                                  __sk_nulls_add_node_rcu(sk, list)
> >>                                  spin_unlock()
> >>
> >> As both deletion and insertion operate on the same ehash chain, this patch
> >> introduces two new sk_nulls_replace_* helper functions to implement atomic
> >> replacement.
> >>
> >> If sk_nulls_replace_node_init_rcu() fails, it indicates osk is either
> >> hlist_unhashed or hlist_nulls_unhashed. The former returns false; the
> >> latter performs insertion without deletion.
> >>
> >> Fixes: 5e0724d027f0 ("tcp/dccp: fix hashdance race for passive sessions")
> >> Signed-off-by: Xuanqiang Luo <luoxuanqiang@kylinos.cn>
> >> ---
> >>   include/net/sock.h         | 23 +++++++++++++++++++++++
> >>   net/ipv4/inet_hashtables.c |  7 +++++++
> >>   2 files changed, 30 insertions(+)
> >>
> >> diff --git a/include/net/sock.h b/include/net/sock.h
> >> index 896bec2d2176..26dacf7bc93e 100644
> >> --- a/include/net/sock.h
> >> +++ b/include/net/sock.h
> >> @@ -859,6 +859,29 @@ static inline bool sk_nulls_del_node_init_rcu(struct sock *sk)
> >>          return rc;
> >>   }
> >>
> >> +static inline bool __sk_nulls_replace_node_init_rcu(struct sock *old,
> >> +                                                   struct sock *new)
> >> +{
> >> +       if (sk_hashed(old) &&
> >> +           hlist_nulls_replace_init_rcu(&old->sk_nulls_node,
> >> +                                        &new->sk_nulls_node))
> >> +               return true;
> >> +
> >> +       return false;
> >> +}
> >> +
> >> +static inline bool sk_nulls_replace_node_init_rcu(struct sock *old,
> >> +                                                 struct sock *new)
> >> +{
> >> +       bool rc = __sk_nulls_replace_node_init_rcu(old, new);
> >> +
> >> +       if (rc) {
> >> +               WARN_ON(refcount_read(&old->sk_refcnt) == 1);
> >> +               __sock_put(old);
> >> +       }
> >> +       return rc;
> >> +}
> >> +
> >>   static inline void __sk_add_node(struct sock *sk, struct hlist_head *list)
> >>   {
> >>          hlist_add_head(&sk->sk_node, list);
> >> diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
> >> index ef4ccfd46ff6..7803fd3cc8e9 100644
> >> --- a/net/ipv4/inet_hashtables.c
> >> +++ b/net/ipv4/inet_hashtables.c
> >> @@ -685,6 +685,12 @@ bool inet_ehash_insert(struct sock *sk, struct sock *osk, bool *found_dup_sk)
> >>          spin_lock(lock);
> >>          if (osk) {
> >>                  WARN_ON_ONCE(sk->sk_hash != osk->sk_hash);
> >> +               /* Since osk and sk should be in the same ehash bucket, try
> >> +                * direct replacement to avoid lookup gaps. On failure, no
> >> +                * changes. sk_nulls_del_node_init_rcu() will handle the rest.
> > Both sk_nulls_replace_node_init_rcu() and
> > sk_nulls_del_node_init_rcu() return true only when
> > sk_hashed(osk) == true.
> >
> > Only thing sk_nulls_del_node_init_rcu() does is to
> > set ret to false.
> >
> >
> >> +                */
> >> +               if (sk_nulls_replace_node_init_rcu(osk, sk))
> >> +                       goto unlock;
> >>                  ret = sk_nulls_del_node_init_rcu(osk);
> > So, should we simply do
> >
> > ret = sk_nulls_replace_node_init_rcu(osk, sk);
> > goto unlock;
> >
> > ?
>
> sk_nulls_replace_node_init_rcu() only returns true if both
> sk_hashed(osk) == true and hlist_nulls_unhashed(old) == false.

sk_hashed(sk) == !hlist_nulls_unhashed(&sk->sk_nulls_node)
is always true as sk_node and sk_nulls_node are in union.


> However, in the original sk_nulls_del_node_init_rcu() logic, when
> sk_hashed(osk) == true,

So this should be an unreachable branch.

> it always performs __sock_put(sk) regardless of
> the hlist_nulls_unhashed(old) check. Therefore, if
> sk_nulls_replace_node_init_rcu() fails, we can safely let ret or
> __sock_put(sk) be handled by the subsequent
> sk_nulls_del_node_init_rcu(osk) call. Thanks Xuanqiang.
>
> >
> >>          } else if (found_dup_sk) {
> >>                  *found_dup_sk = inet_ehash_lookup_by_sk(sk, list);
> >> @@ -695,6 +701,7 @@ bool inet_ehash_insert(struct sock *sk, struct sock *osk, bool *found_dup_sk)
> >>          if (ret)
> >>                  __sk_nulls_add_node_rcu(sk, list);
> >>
> >> +unlock:
> >>          spin_unlock(lock);
> >>
> >>          return ret;
> >> --
> >> 2.27.0
> >>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH net-next v1 2/3] inet: Avoid ehash lookup race in inet_ehash_insert()
  2025-09-16  2:18       ` Kuniyuki Iwashima
@ 2025-09-16  2:28         ` luoxuanqiang
  0 siblings, 0 replies; 12+ messages in thread
From: luoxuanqiang @ 2025-09-16  2:28 UTC (permalink / raw)
  To: Kuniyuki Iwashima
  Cc: edumazet, kerneljasonxing, davem, kuba, netdev, Xuanqiang Luo


在 2025/9/16 10:18, Kuniyuki Iwashima 写道:
> On Mon, Sep 15, 2025 at 6:57 PM luoxuanqiang <xuanqiang.luo@linux.dev> wrote:
>>
>> 在 2025/9/16 07:00, Kuniyuki Iwashima 写道:
>>> On Mon, Sep 15, 2025 at 12:04 AM <xuanqiang.luo@linux.dev> wrote:
>>>> From: Xuanqiang Luo <luoxuanqiang@kylinos.cn>
>>>>
>>>> Since ehash lookups are lockless, if one CPU performs a lookup while
>>>> another concurrently deletes and inserts (removing reqsk and inserting sk),
>>>> the lookup may fail to find the socket, an RST may be sent.
>>>>
>>>> The call trace map is drawn as follows:
>>>>      CPU 0                           CPU 1
>>>>      -----                           -----
>>>>                                   inet_ehash_insert()
>>>>                                   spin_lock()
>>>>                                   sk_nulls_del_node_init_rcu(osk)
>>>> __inet_lookup_established()
>>>>           (lookup failed)
>>>>                                   __sk_nulls_add_node_rcu(sk, list)
>>>>                                   spin_unlock()
>>>>
>>>> As both deletion and insertion operate on the same ehash chain, this patch
>>>> introduces two new sk_nulls_replace_* helper functions to implement atomic
>>>> replacement.
>>>>
>>>> If sk_nulls_replace_node_init_rcu() fails, it indicates osk is either
>>>> hlist_unhashed or hlist_nulls_unhashed. The former returns false; the
>>>> latter performs insertion without deletion.
>>>>
>>>> Fixes: 5e0724d027f0 ("tcp/dccp: fix hashdance race for passive sessions")
>>>> Signed-off-by: Xuanqiang Luo <luoxuanqiang@kylinos.cn>
>>>> ---
>>>>    include/net/sock.h         | 23 +++++++++++++++++++++++
>>>>    net/ipv4/inet_hashtables.c |  7 +++++++
>>>>    2 files changed, 30 insertions(+)
>>>>
>>>> diff --git a/include/net/sock.h b/include/net/sock.h
>>>> index 896bec2d2176..26dacf7bc93e 100644
>>>> --- a/include/net/sock.h
>>>> +++ b/include/net/sock.h
>>>> @@ -859,6 +859,29 @@ static inline bool sk_nulls_del_node_init_rcu(struct sock *sk)
>>>>           return rc;
>>>>    }
>>>>
>>>> +static inline bool __sk_nulls_replace_node_init_rcu(struct sock *old,
>>>> +                                                   struct sock *new)
>>>> +{
>>>> +       if (sk_hashed(old) &&
>>>> +           hlist_nulls_replace_init_rcu(&old->sk_nulls_node,
>>>> +                                        &new->sk_nulls_node))
>>>> +               return true;
>>>> +
>>>> +       return false;
>>>> +}
>>>> +
>>>> +static inline bool sk_nulls_replace_node_init_rcu(struct sock *old,
>>>> +                                                 struct sock *new)
>>>> +{
>>>> +       bool rc = __sk_nulls_replace_node_init_rcu(old, new);
>>>> +
>>>> +       if (rc) {
>>>> +               WARN_ON(refcount_read(&old->sk_refcnt) == 1);
>>>> +               __sock_put(old);
>>>> +       }
>>>> +       return rc;
>>>> +}
>>>> +
>>>>    static inline void __sk_add_node(struct sock *sk, struct hlist_head *list)
>>>>    {
>>>>           hlist_add_head(&sk->sk_node, list);
>>>> diff --git a/net/ipv4/inet_hashtables.c b/net/ipv4/inet_hashtables.c
>>>> index ef4ccfd46ff6..7803fd3cc8e9 100644
>>>> --- a/net/ipv4/inet_hashtables.c
>>>> +++ b/net/ipv4/inet_hashtables.c
>>>> @@ -685,6 +685,12 @@ bool inet_ehash_insert(struct sock *sk, struct sock *osk, bool *found_dup_sk)
>>>>           spin_lock(lock);
>>>>           if (osk) {
>>>>                   WARN_ON_ONCE(sk->sk_hash != osk->sk_hash);
>>>> +               /* Since osk and sk should be in the same ehash bucket, try
>>>> +                * direct replacement to avoid lookup gaps. On failure, no
>>>> +                * changes. sk_nulls_del_node_init_rcu() will handle the rest.
>>> Both sk_nulls_replace_node_init_rcu() and
>>> sk_nulls_del_node_init_rcu() return true only when
>>> sk_hashed(osk) == true.
>>>
>>> Only thing sk_nulls_del_node_init_rcu() does is to
>>> set ret to false.
>>>
>>>
>>>> +                */
>>>> +               if (sk_nulls_replace_node_init_rcu(osk, sk))
>>>> +                       goto unlock;
>>>>                   ret = sk_nulls_del_node_init_rcu(osk);
>>> So, should we simply do
>>>
>>> ret = sk_nulls_replace_node_init_rcu(osk, sk);
>>> goto unlock;
>>>
>>> ?
>> sk_nulls_replace_node_init_rcu() only returns true if both
>> sk_hashed(osk) == true and hlist_nulls_unhashed(old) == false.
> sk_hashed(sk) == !hlist_nulls_unhashed(&sk->sk_nulls_node)
> is always true as sk_node and sk_nulls_node are in union.
>
>
>> However, in the original sk_nulls_del_node_init_rcu() logic, when
>> sk_hashed(osk) == true,
> So this should be an unreachable branch.

Oh, you're right! That's a point I overlooked. I'll fix it in V2.


thanks Kuni!

>> it always performs __sock_put(sk) regardless of
>> the hlist_nulls_unhashed(old) check. Therefore, if
>> sk_nulls_replace_node_init_rcu() fails, we can safely let ret or
>> __sock_put(sk) be handled by the subsequent
>> sk_nulls_del_node_init_rcu(osk) call. Thanks Xuanqiang.
>>
>>>>           } else if (found_dup_sk) {
>>>>                   *found_dup_sk = inet_ehash_lookup_by_sk(sk, list);
>>>> @@ -695,6 +701,7 @@ bool inet_ehash_insert(struct sock *sk, struct sock *osk, bool *found_dup_sk)
>>>>           if (ret)
>>>>                   __sk_nulls_add_node_rcu(sk, list);
>>>>
>>>> +unlock:
>>>>           spin_unlock(lock);
>>>>
>>>>           return ret;
>>>> --
>>>> 2.27.0
>>>>

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2025-09-16  2:29 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-09-15  7:03 [PATCH net-next v1 0/3] net: Avoid ehash lookup races xuanqiang.luo
2025-09-15  7:03 ` [PATCH net-next v1 1/3] rculist: Add __hlist_nulls_replace_rcu() and hlist_nulls_replace_init_rcu() xuanqiang.luo
2025-09-15 22:50   ` Kuniyuki Iwashima
2025-09-16  1:33     ` luoxuanqiang
2025-09-16  2:10       ` Kuniyuki Iwashima
2025-09-15  7:03 ` [PATCH net-next v1 2/3] inet: Avoid ehash lookup race in inet_ehash_insert() xuanqiang.luo
2025-09-15 23:00   ` Kuniyuki Iwashima
2025-09-16  1:57     ` luoxuanqiang
2025-09-16  2:18       ` Kuniyuki Iwashima
2025-09-16  2:28         ` luoxuanqiang
2025-09-15  7:03 ` [PATCH net-next v1 3/3] inet: Avoid ehash lookup race in inet_twsk_hashdance_schedule() xuanqiang.luo
2025-09-15 23:25   ` Kuniyuki Iwashima

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).