[PATCH net-next v6 0/3] net: tcp: un-pin tw timer

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH net-next v6 0/3] net: tcp: un-pin tw timer
@ 2024-06-03  9:36 Florian Westphal
  2024-06-03  9:36 ` [PATCH net-next v6 1/3] net: tcp/dcpp: prepare for tw_timer un-pinning Florian Westphal
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Florian Westphal @ 2024-06-03  9:36 UTC (permalink / raw)
  To: netdev
  Cc: Paolo Abeni, David S. Miller, Eric Dumazet, Jakub Kicinski,
	mleitner, juri.lelli, vschneid, tglozar, dsahern, bigeasy, tglx

This is v6 of the series where the tw_timer is un-pinned to get rid of
interferences in isolated CPUs setups.

First patch makes necessary preparations, existing code relies on
TIMER_PINNED to avoid races.

Second patch un-pins the TW timer. Could be folded into the first one,
but it might help wrt. bisection.

Third patch is a minor cleanup to move a helper from .h to the only
remaining compilation unit.

Tested with iperf3 and stress-ng socket mode.

Changes since previous iteration:
 - do not use work queues, keep timers as-is
 - use timer_shutdown (can't wait for timer with spinlock held)
 - keep existing tw sk refcount (3)

Florian Westphal (2):
  net: tcp: un-pin the tw_timer
  tcp: move inet_twsk_schedule helper out of header

Valentin Schneider (1):
  net: tcp/dcpp: prepare for tw_timer un-pinning

 include/net/inet_timewait_sock.h | 11 ++---
 net/dccp/minisocks.c             |  3 +-
 net/ipv4/inet_timewait_sock.c    | 81 +++++++++++++++++++++++++++-----
 net/ipv4/tcp_minisocks.c         |  3 +-
 4 files changed, 74 insertions(+), 24 deletions(-)

-- 
2.44.2


^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH net-next v6 1/3] net: tcp/dcpp: prepare for tw_timer un-pinning
  2024-06-03  9:36 [PATCH net-next v6 0/3] net: tcp: un-pin tw timer Florian Westphal
@ 2024-06-03  9:36 ` Florian Westphal
  2024-06-03 10:30   ` Eric Dumazet
  2024-06-03 10:53   ` shaozhengchao
  2024-06-03  9:36 ` [PATCH net-next v6 2/3] net: tcp: un-pin the tw_timer Florian Westphal
  2024-06-03  9:36 ` [PATCH net-next v6 3/3] tcp: move inet_twsk_schedule helper out of header Florian Westphal
  2 siblings, 2 replies; 11+ messages in thread
From: Florian Westphal @ 2024-06-03  9:36 UTC (permalink / raw)
  To: netdev
  Cc: Paolo Abeni, David S. Miller, Eric Dumazet, Jakub Kicinski,
	mleitner, juri.lelli, vschneid, tglozar, dsahern, bigeasy, tglx

From: Valentin Schneider <vschneid@redhat.com>

The TCP timewait timer is proving to be problematic for setups where
scheduler CPU isolation is achieved at runtime via cpusets (as opposed to
statically via isolcpus=domains).

What happens there is a CPU goes through tcp_time_wait(), arming the
time_wait timer, then gets isolated. TCP_TIMEWAIT_LEN later, the timer
fires, causing interference for the now-isolated CPU. This is conceptually
similar to the issue described in commit e02b93124855 ("workqueue: Unbind
kworkers before sending them to exit()")

Move inet_twsk_schedule() to within inet_twsk_hashdance(), with the ehash
lock held. Expand the lock's critical section from inet_twsk_kill() to
inet_twsk_deschedule_put(), serializing the scheduling vs descheduling of
the timer. IOW, this prevents the following race:

			     tcp_time_wait()
			       inet_twsk_hashdance()
  inet_twsk_deschedule_put()
    del_timer_sync()
			       inet_twsk_schedule()

Thanks to Paolo Abeni for suggesting to leverage the ehash lock.

This also restores a comment from commit ec94c2696f0b ("tcp/dccp: avoid
one atomic operation for timewait hashdance") as inet_twsk_hashdance() had
a "Step 1" and "Step 3" comment, but the "Step 2" had gone missing.

inet_twsk_deschedule_put() now acquires the ehash spinlock to synchronize
vs. reschedule and timer firing: timer_del_sync() is replaced with
timer_shutdown().

This means that tw_timer may still be running on another CPU.  However, as
the timer owns a reference on tw sk that is only put at the end this
should be fine.

To ease possible regression search, actual un-pin is done in next patch.

Link: https://lore.kernel.org/all/ZPhpfMjSiHVjQkTk@localhost.localdomain/
Signed-off-by: Valentin Schneider <vschneid@redhat.com>
Signed-off-by: Florian Westphal <fw@strlen.de>
---
 include/net/inet_timewait_sock.h |  6 ++-
 net/dccp/minisocks.c             |  3 +-
 net/ipv4/inet_timewait_sock.c    | 74 ++++++++++++++++++++++++++------
 net/ipv4/tcp_minisocks.c         |  3 +-
 4 files changed, 68 insertions(+), 18 deletions(-)

diff --git a/include/net/inet_timewait_sock.h b/include/net/inet_timewait_sock.h
index 2a536eea9424..5b43d220243d 100644
--- a/include/net/inet_timewait_sock.h
+++ b/include/net/inet_timewait_sock.h
@@ -93,8 +93,10 @@ struct inet_timewait_sock *inet_twsk_alloc(const struct sock *sk,
 					   struct inet_timewait_death_row *dr,
 					   const int state);
 
-void inet_twsk_hashdance(struct inet_timewait_sock *tw, struct sock *sk,
-			 struct inet_hashinfo *hashinfo);
+void inet_twsk_hashdance_schedule(struct inet_timewait_sock *tw,
+				  struct sock *sk,
+				  struct inet_hashinfo *hashinfo,
+				  int timeo);
 
 void __inet_twsk_schedule(struct inet_timewait_sock *tw, int timeo,
 			  bool rearm);
diff --git a/net/dccp/minisocks.c b/net/dccp/minisocks.c
index 251a57cf5822..deb52d7d31b4 100644
--- a/net/dccp/minisocks.c
+++ b/net/dccp/minisocks.c
@@ -59,11 +59,10 @@ void dccp_time_wait(struct sock *sk, int state, int timeo)
 		 * we complete the initialization.
 		 */
 		local_bh_disable();
-		inet_twsk_schedule(tw, timeo);
 		/* Linkage updates.
 		 * Note that access to tw after this point is illegal.
 		 */
-		inet_twsk_hashdance(tw, sk, &dccp_hashinfo);
+		inet_twsk_hashdance_schedule(tw, sk, &dccp_hashinfo, timeo);
 		local_bh_enable();
 	} else {
 		/* Sorry, if we're out of memory, just CLOSE this
diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c
index e28075f0006e..2b1d836df64e 100644
--- a/net/ipv4/inet_timewait_sock.c
+++ b/net/ipv4/inet_timewait_sock.c
@@ -44,14 +44,14 @@ void inet_twsk_bind_unhash(struct inet_timewait_sock *tw,
 	__sock_put((struct sock *)tw);
 }
 
-/* Must be called with locally disabled BHs. */
-static void inet_twsk_kill(struct inet_timewait_sock *tw)
+static void __inet_twsk_kill(struct inet_timewait_sock *tw, spinlock_t *lock)
+__releases(lock)
 {
 	struct inet_hashinfo *hashinfo = tw->tw_dr->hashinfo;
-	spinlock_t *lock = inet_ehash_lockp(hashinfo, tw->tw_hash);
 	struct inet_bind_hashbucket *bhead, *bhead2;
 
-	spin_lock(lock);
+	lockdep_assert_held(lock);
+
 	sk_nulls_del_node_init_rcu((struct sock *)tw);
 	spin_unlock(lock);
 
@@ -71,6 +71,16 @@ static void inet_twsk_kill(struct inet_timewait_sock *tw)
 	inet_twsk_put(tw);
 }
 
+/* Must be called with locally disabled BHs. */
+static void inet_twsk_kill(struct inet_timewait_sock *tw)
+{
+	struct inet_hashinfo *hashinfo = tw->tw_dr->hashinfo;
+	spinlock_t *lock = inet_ehash_lockp(hashinfo, tw->tw_hash);
+
+	spin_lock(lock);
+	__inet_twsk_kill(tw, lock);
+}
+
 void inet_twsk_free(struct inet_timewait_sock *tw)
 {
 	struct module *owner = tw->tw_prot->owner;
@@ -96,9 +106,13 @@ static void inet_twsk_add_node_rcu(struct inet_timewait_sock *tw,
  * Enter the time wait state. This is called with locally disabled BH.
  * Essentially we whip up a timewait bucket, copy the relevant info into it
  * from the SK, and mess with hash chains and list linkage.
+ *
+ * The caller must not access @tw anymore after this function returns.
  */
-void inet_twsk_hashdance(struct inet_timewait_sock *tw, struct sock *sk,
-			   struct inet_hashinfo *hashinfo)
+void inet_twsk_hashdance_schedule(struct inet_timewait_sock *tw,
+				  struct sock *sk,
+				  struct inet_hashinfo *hashinfo,
+				  int timeo)
 {
 	const struct inet_sock *inet = inet_sk(sk);
 	const struct inet_connection_sock *icsk = inet_csk(sk);
@@ -129,26 +143,33 @@ void inet_twsk_hashdance(struct inet_timewait_sock *tw, struct sock *sk,
 
 	spin_lock(lock);
 
+	/* Step 2: Hash TW into tcp ehash chain */
 	inet_twsk_add_node_rcu(tw, &ehead->chain);
 
 	/* Step 3: Remove SK from hash chain */
 	if (__sk_nulls_del_node_init_rcu(sk))
 		sock_prot_inuse_add(sock_net(sk), sk->sk_prot, -1);
 
-	spin_unlock(lock);
 
+	/* Ensure above writes are committed into memory before updating the
+	 * refcount.
+	 * Provides ordering vs later refcount_inc().
+	 */
+	smp_wmb();
 	/* tw_refcnt is set to 3 because we have :
 	 * - one reference for bhash chain.
 	 * - one reference for ehash chain.
 	 * - one reference for timer.
-	 * We can use atomic_set() because prior spin_lock()/spin_unlock()
-	 * committed into memory all tw fields.
 	 * Also note that after this point, we lost our implicit reference
 	 * so we are not allowed to use tw anymore.
 	 */
 	refcount_set(&tw->tw_refcnt, 3);
+
+	inet_twsk_schedule(tw, timeo);
+
+	spin_unlock(lock);
 }
-EXPORT_SYMBOL_GPL(inet_twsk_hashdance);
+EXPORT_SYMBOL_GPL(inet_twsk_hashdance_schedule);
 
 static void tw_timer_handler(struct timer_list *t)
 {
@@ -217,8 +238,37 @@ EXPORT_SYMBOL_GPL(inet_twsk_alloc);
  */
 void inet_twsk_deschedule_put(struct inet_timewait_sock *tw)
 {
-	if (del_timer_sync(&tw->tw_timer))
-		inet_twsk_kill(tw);
+	struct inet_hashinfo *hashinfo = tw->tw_dr->hashinfo;
+	spinlock_t *lock = inet_ehash_lockp(hashinfo, tw->tw_hash);
+
+	/* inet_twsk_purge() walks over all sockets, including tw ones,
+	 * and removes them via inet_twsk_deschedule_put() after a
+	 * refcount_inc_not_zero().
+	 *
+	 * inet_twsk_hashdance_schedule() must (re)init the refcount before
+	 * arming the timer, i.e. inet_twsk_purge can obtain a reference to
+	 * a twsk that did not yet schedule the timer.
+	 *
+	 * The ehash lock synchronizes these two:
+	 * After acquiring the lock, the timer is always scheduled (else
+	 * timer_shutdown returns false).
+	 *
+	 * With plain timer_shutdown_sync() and without grabbing the ehash
+	 * lock, we can get:
+	 * 1) cpu x sets twsk refcount to 3
+	 * 2) cpu y bumps refcount to 4
+	 * 3) cpu y calls inet_twsk_deschedule_put() and shuts timer down
+	 * 4) cpu x tries to start timer, but mod_timer is a noop post-shutdown
+	 * -> timer refcount is never decremented.
+	 */
+	spin_lock(lock);
+	if (timer_shutdown(&tw->tw_timer)) {
+		/* releases @lock */
+		__inet_twsk_kill(tw, lock);
+	} else {
+		spin_unlock(lock);
+	}
+
 	inet_twsk_put(tw);
 }
 EXPORT_SYMBOL(inet_twsk_deschedule_put);
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index 538c06f95918..47de6f3efc85 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -344,11 +344,10 @@ void tcp_time_wait(struct sock *sk, int state, int timeo)
 		 * we complete the initialization.
 		 */
 		local_bh_disable();
-		inet_twsk_schedule(tw, timeo);
 		/* Linkage updates.
 		 * Note that access to tw after this point is illegal.
 		 */
-		inet_twsk_hashdance(tw, sk, net->ipv4.tcp_death_row.hashinfo);
+		inet_twsk_hashdance_schedule(tw, sk, net->ipv4.tcp_death_row.hashinfo, timeo);
 		local_bh_enable();
 	} else {
 		/* Sorry, if we're out of memory, just CLOSE this
-- 
2.44.2


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH net-next v6 2/3] net: tcp: un-pin the tw_timer
  2024-06-03  9:36 [PATCH net-next v6 0/3] net: tcp: un-pin tw timer Florian Westphal
  2024-06-03  9:36 ` [PATCH net-next v6 1/3] net: tcp/dcpp: prepare for tw_timer un-pinning Florian Westphal
@ 2024-06-03  9:36 ` Florian Westphal
  2024-06-03  9:36 ` [PATCH net-next v6 3/3] tcp: move inet_twsk_schedule helper out of header Florian Westphal
  2 siblings, 0 replies; 11+ messages in thread
From: Florian Westphal @ 2024-06-03  9:36 UTC (permalink / raw)
  To: netdev
  Cc: Paolo Abeni, David S. Miller, Eric Dumazet, Jakub Kicinski,
	mleitner, juri.lelli, vschneid, tglozar, dsahern, bigeasy, tglx

After previous patch, even if timer fires immediately on another CPU,
context that schedules the timer now holds the ehash spinlock, so timer
cannot reap tw socket until ehash lock is released.

Signed-off-by: Florian Westphal <fw@strlen.de>
---
 net/ipv4/inet_timewait_sock.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c
index 2b1d836df64e..66d9b76595b7 100644
--- a/net/ipv4/inet_timewait_sock.c
+++ b/net/ipv4/inet_timewait_sock.c
@@ -213,7 +213,7 @@ struct inet_timewait_sock *inet_twsk_alloc(const struct sock *sk,
 		tw->tw_prot	    = sk->sk_prot_creator;
 		atomic64_set(&tw->tw_cookie, atomic64_read(&sk->sk_cookie));
 		twsk_net_set(tw, sock_net(sk));
-		timer_setup(&tw->tw_timer, tw_timer_handler, TIMER_PINNED);
+		timer_setup(&tw->tw_timer, tw_timer_handler, 0);
 		/*
 		 * Because we use RCU lookups, we should not set tw_refcnt
 		 * to a non null value before everything is setup for this
-- 
2.44.2


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH net-next v6 3/3] tcp: move inet_twsk_schedule helper out of header
  2024-06-03  9:36 [PATCH net-next v6 0/3] net: tcp: un-pin tw timer Florian Westphal
  2024-06-03  9:36 ` [PATCH net-next v6 1/3] net: tcp/dcpp: prepare for tw_timer un-pinning Florian Westphal
  2024-06-03  9:36 ` [PATCH net-next v6 2/3] net: tcp: un-pin the tw_timer Florian Westphal
@ 2024-06-03  9:36 ` Florian Westphal
  2 siblings, 0 replies; 11+ messages in thread
From: Florian Westphal @ 2024-06-03  9:36 UTC (permalink / raw)
  To: netdev
  Cc: Paolo Abeni, David S. Miller, Eric Dumazet, Jakub Kicinski,
	mleitner, juri.lelli, vschneid, tglozar, dsahern, bigeasy, tglx

Its no longer used outside inet_timewait_sock.c, so move it there.

Signed-off-by: Florian Westphal <fw@strlen.de>
---
 include/net/inet_timewait_sock.h | 5 -----
 net/ipv4/inet_timewait_sock.c    | 5 +++++
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/include/net/inet_timewait_sock.h b/include/net/inet_timewait_sock.h
index 5b43d220243d..f88b68269012 100644
--- a/include/net/inet_timewait_sock.h
+++ b/include/net/inet_timewait_sock.h
@@ -101,11 +101,6 @@ void inet_twsk_hashdance_schedule(struct inet_timewait_sock *tw,
 void __inet_twsk_schedule(struct inet_timewait_sock *tw, int timeo,
 			  bool rearm);
 
-static inline void inet_twsk_schedule(struct inet_timewait_sock *tw, int timeo)
-{
-	__inet_twsk_schedule(tw, timeo, false);
-}
-
 static inline void inet_twsk_reschedule(struct inet_timewait_sock *tw, int timeo)
 {
 	__inet_twsk_schedule(tw, timeo, true);
diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c
index 66d9b76595b7..13a66c0b6b2a 100644
--- a/net/ipv4/inet_timewait_sock.c
+++ b/net/ipv4/inet_timewait_sock.c
@@ -102,6 +102,11 @@ static void inet_twsk_add_node_rcu(struct inet_timewait_sock *tw,
 	hlist_nulls_add_head_rcu(&tw->tw_node, list);
 }
 
+static void inet_twsk_schedule(struct inet_timewait_sock *tw, int timeo)
+{
+	__inet_twsk_schedule(tw, timeo, false);
+}
+
 /*
  * Enter the time wait state. This is called with locally disabled BH.
  * Essentially we whip up a timewait bucket, copy the relevant info into it
-- 
2.44.2


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH net-next v6 1/3] net: tcp/dcpp: prepare for tw_timer un-pinning
  2024-06-03  9:36 ` [PATCH net-next v6 1/3] net: tcp/dcpp: prepare for tw_timer un-pinning Florian Westphal
@ 2024-06-03 10:30   ` Eric Dumazet
  2024-06-03 11:21     ` Florian Westphal
  2024-06-03 10:53   ` shaozhengchao
  1 sibling, 1 reply; 11+ messages in thread
From: Eric Dumazet @ 2024-06-03 10:30 UTC (permalink / raw)
  To: Florian Westphal
  Cc: netdev, Paolo Abeni, David S. Miller, Jakub Kicinski, mleitner,
	juri.lelli, vschneid, tglozar, dsahern, bigeasy, tglx

On Mon, Jun 3, 2024 at 11:37 AM Florian Westphal <fw@strlen.de> wrote:
>
> From: Valentin Schneider <vschneid@redhat.com>
>
> The TCP timewait timer is proving to be problematic for setups where
> scheduler CPU isolation is achieved at runtime via cpusets (as opposed to
> statically via isolcpus=domains).
>
> What happens there is a CPU goes through tcp_time_wait(), arming the
> time_wait timer, then gets isolated. TCP_TIMEWAIT_LEN later, the timer
> fires, causing interference for the now-isolated CPU. This is conceptually
> similar to the issue described in commit e02b93124855 ("workqueue: Unbind
> kworkers before sending them to exit()")
>
> Move inet_twsk_schedule() to within inet_twsk_hashdance(), with the ehash
> lock held. Expand the lock's critical section from inet_twsk_kill() to
> inet_twsk_deschedule_put(), serializing the scheduling vs descheduling of
> the timer. IOW, this prevents the following race:
>
>                              tcp_time_wait()
>                                inet_twsk_hashdance()
>   inet_twsk_deschedule_put()
>     del_timer_sync()
>                                inet_twsk_schedule()
>
> Thanks to Paolo Abeni for suggesting to leverage the ehash lock.
>
> This also restores a comment from commit ec94c2696f0b ("tcp/dccp: avoid
> one atomic operation for timewait hashdance") as inet_twsk_hashdance() had
> a "Step 1" and "Step 3" comment, but the "Step 2" had gone missing.
>
> inet_twsk_deschedule_put() now acquires the ehash spinlock to synchronize
> vs. reschedule and timer firing: timer_del_sync() is replaced with
> timer_shutdown().
>
> This means that tw_timer may still be running on another CPU.  However, as
> the timer owns a reference on tw sk that is only put at the end this
> should be fine.
>
> To ease possible regression search, actual un-pin is done in next patch.
>
> Link: https://lore.kernel.org/all/ZPhpfMjSiHVjQkTk@localhost.localdomain/
> Signed-off-by: Valentin Schneider <vschneid@redhat.com>
> Signed-off-by: Florian Westphal <fw@strlen.de>
> ---
>  include/net/inet_timewait_sock.h |  6 ++-
>  net/dccp/minisocks.c             |  3 +-
>  net/ipv4/inet_timewait_sock.c    | 74 ++++++++++++++++++++++++++------
>  net/ipv4/tcp_minisocks.c         |  3 +-
>  4 files changed, 68 insertions(+), 18 deletions(-)
>
> diff --git a/include/net/inet_timewait_sock.h b/include/net/inet_timewait_sock.h
> index 2a536eea9424..5b43d220243d 100644
> --- a/include/net/inet_timewait_sock.h
> +++ b/include/net/inet_timewait_sock.h
> @@ -93,8 +93,10 @@ struct inet_timewait_sock *inet_twsk_alloc(const struct sock *sk,
>                                            struct inet_timewait_death_row *dr,
>                                            const int state);
>
> -void inet_twsk_hashdance(struct inet_timewait_sock *tw, struct sock *sk,
> -                        struct inet_hashinfo *hashinfo);
> +void inet_twsk_hashdance_schedule(struct inet_timewait_sock *tw,
> +                                 struct sock *sk,
> +                                 struct inet_hashinfo *hashinfo,
> +                                 int timeo);
>
>  void __inet_twsk_schedule(struct inet_timewait_sock *tw, int timeo,
>                           bool rearm);
> diff --git a/net/dccp/minisocks.c b/net/dccp/minisocks.c
> index 251a57cf5822..deb52d7d31b4 100644
> --- a/net/dccp/minisocks.c
> +++ b/net/dccp/minisocks.c
> @@ -59,11 +59,10 @@ void dccp_time_wait(struct sock *sk, int state, int timeo)
>                  * we complete the initialization.
>                  */
>                 local_bh_disable();
> -               inet_twsk_schedule(tw, timeo);
>                 /* Linkage updates.
>                  * Note that access to tw after this point is illegal.
>                  */
> -               inet_twsk_hashdance(tw, sk, &dccp_hashinfo);
> +               inet_twsk_hashdance_schedule(tw, sk, &dccp_hashinfo, timeo);
>                 local_bh_enable();
>         } else {
>                 /* Sorry, if we're out of memory, just CLOSE this
> diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c
> index e28075f0006e..2b1d836df64e 100644
> --- a/net/ipv4/inet_timewait_sock.c
> +++ b/net/ipv4/inet_timewait_sock.c
> @@ -44,14 +44,14 @@ void inet_twsk_bind_unhash(struct inet_timewait_sock *tw,
>         __sock_put((struct sock *)tw);
>  }
>
> -/* Must be called with locally disabled BHs. */
> -static void inet_twsk_kill(struct inet_timewait_sock *tw)
> +static void __inet_twsk_kill(struct inet_timewait_sock *tw, spinlock_t *lock)
> +__releases(lock)
>  {
>         struct inet_hashinfo *hashinfo = tw->tw_dr->hashinfo;
> -       spinlock_t *lock = inet_ehash_lockp(hashinfo, tw->tw_hash);
>         struct inet_bind_hashbucket *bhead, *bhead2;
>
> -       spin_lock(lock);
> +       lockdep_assert_held(lock);
> +
>         sk_nulls_del_node_init_rcu((struct sock *)tw);
>         spin_unlock(lock);
>
> @@ -71,6 +71,16 @@ static void inet_twsk_kill(struct inet_timewait_sock *tw)
>         inet_twsk_put(tw);
>  }
>
> +/* Must be called with locally disabled BHs. */
> +static void inet_twsk_kill(struct inet_timewait_sock *tw)
> +{
> +       struct inet_hashinfo *hashinfo = tw->tw_dr->hashinfo;
> +       spinlock_t *lock = inet_ehash_lockp(hashinfo, tw->tw_hash);
> +
> +       spin_lock(lock);
> +       __inet_twsk_kill(tw, lock);
> +}
> +
>  void inet_twsk_free(struct inet_timewait_sock *tw)
>  {
>         struct module *owner = tw->tw_prot->owner;
> @@ -96,9 +106,13 @@ static void inet_twsk_add_node_rcu(struct inet_timewait_sock *tw,
>   * Enter the time wait state. This is called with locally disabled BH.
>   * Essentially we whip up a timewait bucket, copy the relevant info into it
>   * from the SK, and mess with hash chains and list linkage.
> + *
> + * The caller must not access @tw anymore after this function returns.
>   */
> -void inet_twsk_hashdance(struct inet_timewait_sock *tw, struct sock *sk,
> -                          struct inet_hashinfo *hashinfo)
> +void inet_twsk_hashdance_schedule(struct inet_timewait_sock *tw,
> +                                 struct sock *sk,
> +                                 struct inet_hashinfo *hashinfo,
> +                                 int timeo)
>  {
>         const struct inet_sock *inet = inet_sk(sk);
>         const struct inet_connection_sock *icsk = inet_csk(sk);
> @@ -129,26 +143,33 @@ void inet_twsk_hashdance(struct inet_timewait_sock *tw, struct sock *sk,
>
>         spin_lock(lock);
>
> +       /* Step 2: Hash TW into tcp ehash chain */
>         inet_twsk_add_node_rcu(tw, &ehead->chain);
>
>         /* Step 3: Remove SK from hash chain */
>         if (__sk_nulls_del_node_init_rcu(sk))
>                 sock_prot_inuse_add(sock_net(sk), sk->sk_prot, -1);
>
> -       spin_unlock(lock);
>
> +       /* Ensure above writes are committed into memory before updating the
> +        * refcount.
> +        * Provides ordering vs later refcount_inc().
> +        */
> +       smp_wmb();
>         /* tw_refcnt is set to 3 because we have :
>          * - one reference for bhash chain.
>          * - one reference for ehash chain.
>          * - one reference for timer.
> -        * We can use atomic_set() because prior spin_lock()/spin_unlock()
> -        * committed into memory all tw fields.
>          * Also note that after this point, we lost our implicit reference
>          * so we are not allowed to use tw anymore.
>          */
>         refcount_set(&tw->tw_refcnt, 3);
> +
> +       inet_twsk_schedule(tw, timeo);
> +
> +       spin_unlock(lock);
>  }
> -EXPORT_SYMBOL_GPL(inet_twsk_hashdance);
> +EXPORT_SYMBOL_GPL(inet_twsk_hashdance_schedule);
>
>  static void tw_timer_handler(struct timer_list *t)
>  {
> @@ -217,8 +238,37 @@ EXPORT_SYMBOL_GPL(inet_twsk_alloc);
>   */
>  void inet_twsk_deschedule_put(struct inet_timewait_sock *tw)
>  {
> -       if (del_timer_sync(&tw->tw_timer))
> -               inet_twsk_kill(tw);
> +       struct inet_hashinfo *hashinfo = tw->tw_dr->hashinfo;
> +       spinlock_t *lock = inet_ehash_lockp(hashinfo, tw->tw_hash);
> +
> +       /* inet_twsk_purge() walks over all sockets, including tw ones,
> +        * and removes them via inet_twsk_deschedule_put() after a
> +        * refcount_inc_not_zero().
> +        *
> +        * inet_twsk_hashdance_schedule() must (re)init the refcount before
> +        * arming the timer, i.e. inet_twsk_purge can obtain a reference to
> +        * a twsk that did not yet schedule the timer.
> +        *
> +        * The ehash lock synchronizes these two:
> +        * After acquiring the lock, the timer is always scheduled (else
> +        * timer_shutdown returns false).
> +        *
> +        * With plain timer_shutdown_sync() and without grabbing the ehash
> +        * lock, we can get:
> +        * 1) cpu x sets twsk refcount to 3
> +        * 2) cpu y bumps refcount to 4
> +        * 3) cpu y calls inet_twsk_deschedule_put() and shuts timer down
> +        * 4) cpu x tries to start timer, but mod_timer is a noop post-shutdown
> +        * -> timer refcount is never decremented.
> +        */
> +       spin_lock(lock);
> +       if (timer_shutdown(&tw->tw_timer)) {
> +               /* releases @lock */
> +               __inet_twsk_kill(tw, lock);
> +       } else {

If we do not have a sync variant here, I think that inet_twsk_purge()
could return while ongoing timers are alive.

tcp_sk_exit_batch() would then possibly hit :

WARN_ON_ONCE(!refcount_dec_and_test(&net->ipv4.tcp_death_row.tw_refcount));

The alive timer are releasing tw->tw_dr->tw_refcount at the end of
inet_twsk_kill()

Presumably, adding some udelay(200) in inet_twsk_kill() could help a
syzkaller instance to trigger a bug.

> +               spin_unlock(lock);
> +       }
> +
>         inet_twsk_put(tw);
>  }
>  EXPORT_SYMBOL(inet_twsk_deschedule_put);
> diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
> index 538c06f95918..47de6f3efc85 100644
> --- a/net/ipv4/tcp_minisocks.c
> +++ b/net/ipv4/tcp_minisocks.c
> @@ -344,11 +344,10 @@ void tcp_time_wait(struct sock *sk, int state, int timeo)
>                  * we complete the initialization.
>                  */
>                 local_bh_disable();
> -               inet_twsk_schedule(tw, timeo);
>                 /* Linkage updates.
>                  * Note that access to tw after this point is illegal.
>                  */
> -               inet_twsk_hashdance(tw, sk, net->ipv4.tcp_death_row.hashinfo);
> +               inet_twsk_hashdance_schedule(tw, sk, net->ipv4.tcp_death_row.hashinfo, timeo);
>                 local_bh_enable();
>         } else {
>                 /* Sorry, if we're out of memory, just CLOSE this
> --
> 2.44.2
>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH net-next v6 1/3] net: tcp/dcpp: prepare for tw_timer un-pinning
  2024-06-03  9:36 ` [PATCH net-next v6 1/3] net: tcp/dcpp: prepare for tw_timer un-pinning Florian Westphal
  2024-06-03 10:30   ` Eric Dumazet
@ 2024-06-03 10:53   ` shaozhengchao
  1 sibling, 0 replies; 11+ messages in thread
From: shaozhengchao @ 2024-06-03 10:53 UTC (permalink / raw)
  To: Florian Westphal, netdev
  Cc: Paolo Abeni, David S. Miller, Eric Dumazet, Jakub Kicinski,
	mleitner, juri.lelli, vschneid, tglozar, dsahern, bigeasy, tglx


The patch header should be "net: tcp/dccp ....".

On 2024/6/3 17:36, Florian Westphal wrote:
> From: Valentin Schneider <vschneid@redhat.com>
> 
> The TCP timewait timer is proving to be problematic for setups where
> scheduler CPU isolation is achieved at runtime via cpusets (as opposed to
> statically via isolcpus=domains).
> 
> What happens there is a CPU goes through tcp_time_wait(), arming the
> time_wait timer, then gets isolated. TCP_TIMEWAIT_LEN later, the timer
> fires, causing interference for the now-isolated CPU. This is conceptually
> similar to the issue described in commit e02b93124855 ("workqueue: Unbind
> kworkers before sending them to exit()")
> 
> Move inet_twsk_schedule() to within inet_twsk_hashdance(), with the ehash
> lock held. Expand the lock's critical section from inet_twsk_kill() to
> inet_twsk_deschedule_put(), serializing the scheduling vs descheduling of
> the timer. IOW, this prevents the following race:
> 
> 			     tcp_time_wait()
> 			       inet_twsk_hashdance()
>    inet_twsk_deschedule_put()
>      del_timer_sync()
> 			       inet_twsk_schedule()
> 
> Thanks to Paolo Abeni for suggesting to leverage the ehash lock.
> 
> This also restores a comment from commit ec94c2696f0b ("tcp/dccp: avoid
> one atomic operation for timewait hashdance") as inet_twsk_hashdance() had
> a "Step 1" and "Step 3" comment, but the "Step 2" had gone missing.
> 
> inet_twsk_deschedule_put() now acquires the ehash spinlock to synchronize
> vs. reschedule and timer firing: timer_del_sync() is replaced with
> timer_shutdown().
> 
> This means that tw_timer may still be running on another CPU.  However, as
> the timer owns a reference on tw sk that is only put at the end this
> should be fine.
> 
> To ease possible regression search, actual un-pin is done in next patch.
> 
> Link: https://lore.kernel.org/all/ZPhpfMjSiHVjQkTk@localhost.localdomain/
> Signed-off-by: Valentin Schneider <vschneid@redhat.com>
> Signed-off-by: Florian Westphal <fw@strlen.de>
> ---
>   include/net/inet_timewait_sock.h |  6 ++-
>   net/dccp/minisocks.c             |  3 +-
>   net/ipv4/inet_timewait_sock.c    | 74 ++++++++++++++++++++++++++------
>   net/ipv4/tcp_minisocks.c         |  3 +-
>   4 files changed, 68 insertions(+), 18 deletions(-)
> 
> diff --git a/include/net/inet_timewait_sock.h b/include/net/inet_timewait_sock.h
> index 2a536eea9424..5b43d220243d 100644
> --- a/include/net/inet_timewait_sock.h
> +++ b/include/net/inet_timewait_sock.h
> @@ -93,8 +93,10 @@ struct inet_timewait_sock *inet_twsk_alloc(const struct sock *sk,
>   					   struct inet_timewait_death_row *dr,
>   					   const int state);
>   
> -void inet_twsk_hashdance(struct inet_timewait_sock *tw, struct sock *sk,
> -			 struct inet_hashinfo *hashinfo);
> +void inet_twsk_hashdance_schedule(struct inet_timewait_sock *tw,
> +				  struct sock *sk,
> +				  struct inet_hashinfo *hashinfo,
> +				  int timeo);
>   
>   void __inet_twsk_schedule(struct inet_timewait_sock *tw, int timeo,
>   			  bool rearm);
> diff --git a/net/dccp/minisocks.c b/net/dccp/minisocks.c
> index 251a57cf5822..deb52d7d31b4 100644
> --- a/net/dccp/minisocks.c
> +++ b/net/dccp/minisocks.c
> @@ -59,11 +59,10 @@ void dccp_time_wait(struct sock *sk, int state, int timeo)
>   		 * we complete the initialization.
>   		 */
>   		local_bh_disable();
> -		inet_twsk_schedule(tw, timeo);
>   		/* Linkage updates.
>   		 * Note that access to tw after this point is illegal.
>   		 */
> -		inet_twsk_hashdance(tw, sk, &dccp_hashinfo);
> +		inet_twsk_hashdance_schedule(tw, sk, &dccp_hashinfo, timeo);
>   		local_bh_enable();
>   	} else {
>   		/* Sorry, if we're out of memory, just CLOSE this
> diff --git a/net/ipv4/inet_timewait_sock.c b/net/ipv4/inet_timewait_sock.c
> index e28075f0006e..2b1d836df64e 100644
> --- a/net/ipv4/inet_timewait_sock.c
> +++ b/net/ipv4/inet_timewait_sock.c
> @@ -44,14 +44,14 @@ void inet_twsk_bind_unhash(struct inet_timewait_sock *tw,
>   	__sock_put((struct sock *)tw);
>   }
>   
> -/* Must be called with locally disabled BHs. */
> -static void inet_twsk_kill(struct inet_timewait_sock *tw)
> +static void __inet_twsk_kill(struct inet_timewait_sock *tw, spinlock_t *lock)
> +__releases(lock)
>   {
>   	struct inet_hashinfo *hashinfo = tw->tw_dr->hashinfo;
> -	spinlock_t *lock = inet_ehash_lockp(hashinfo, tw->tw_hash);
>   	struct inet_bind_hashbucket *bhead, *bhead2;
>   
> -	spin_lock(lock);
> +	lockdep_assert_held(lock);
> +
>   	sk_nulls_del_node_init_rcu((struct sock *)tw);
>   	spin_unlock(lock);
>   
> @@ -71,6 +71,16 @@ static void inet_twsk_kill(struct inet_timewait_sock *tw)
>   	inet_twsk_put(tw);
>   }
>   
> +/* Must be called with locally disabled BHs. */
> +static void inet_twsk_kill(struct inet_timewait_sock *tw)
> +{
> +	struct inet_hashinfo *hashinfo = tw->tw_dr->hashinfo;
> +	spinlock_t *lock = inet_ehash_lockp(hashinfo, tw->tw_hash);
> +
> +	spin_lock(lock);
> +	__inet_twsk_kill(tw, lock);
> +}
> +
>   void inet_twsk_free(struct inet_timewait_sock *tw)
>   {
>   	struct module *owner = tw->tw_prot->owner;
> @@ -96,9 +106,13 @@ static void inet_twsk_add_node_rcu(struct inet_timewait_sock *tw,
>    * Enter the time wait state. This is called with locally disabled BH.
>    * Essentially we whip up a timewait bucket, copy the relevant info into it
>    * from the SK, and mess with hash chains and list linkage.
> + *
> + * The caller must not access @tw anymore after this function returns.
>    */
> -void inet_twsk_hashdance(struct inet_timewait_sock *tw, struct sock *sk,
> -			   struct inet_hashinfo *hashinfo)
> +void inet_twsk_hashdance_schedule(struct inet_timewait_sock *tw,
> +				  struct sock *sk,
> +				  struct inet_hashinfo *hashinfo,
> +				  int timeo)
>   {
>   	const struct inet_sock *inet = inet_sk(sk);
>   	const struct inet_connection_sock *icsk = inet_csk(sk);
> @@ -129,26 +143,33 @@ void inet_twsk_hashdance(struct inet_timewait_sock *tw, struct sock *sk,
>   
>   	spin_lock(lock);
>   
> +	/* Step 2: Hash TW into tcp ehash chain */
>   	inet_twsk_add_node_rcu(tw, &ehead->chain);
>   
>   	/* Step 3: Remove SK from hash chain */
>   	if (__sk_nulls_del_node_init_rcu(sk))
>   		sock_prot_inuse_add(sock_net(sk), sk->sk_prot, -1);
>   
> -	spin_unlock(lock);
>   
> +	/* Ensure above writes are committed into memory before updating the
> +	 * refcount.
> +	 * Provides ordering vs later refcount_inc().
> +	 */
> +	smp_wmb();
>   	/* tw_refcnt is set to 3 because we have :
>   	 * - one reference for bhash chain.
>   	 * - one reference for ehash chain.
>   	 * - one reference for timer.
> -	 * We can use atomic_set() because prior spin_lock()/spin_unlock()
> -	 * committed into memory all tw fields.
>   	 * Also note that after this point, we lost our implicit reference
>   	 * so we are not allowed to use tw anymore.
>   	 */
>   	refcount_set(&tw->tw_refcnt, 3);
> +
> +	inet_twsk_schedule(tw, timeo);
> +
> +	spin_unlock(lock);
>   }
> -EXPORT_SYMBOL_GPL(inet_twsk_hashdance);
> +EXPORT_SYMBOL_GPL(inet_twsk_hashdance_schedule);
>   
>   static void tw_timer_handler(struct timer_list *t)
>   {
> @@ -217,8 +238,37 @@ EXPORT_SYMBOL_GPL(inet_twsk_alloc);
>    */
>   void inet_twsk_deschedule_put(struct inet_timewait_sock *tw)
>   {
> -	if (del_timer_sync(&tw->tw_timer))
> -		inet_twsk_kill(tw);
> +	struct inet_hashinfo *hashinfo = tw->tw_dr->hashinfo;
> +	spinlock_t *lock = inet_ehash_lockp(hashinfo, tw->tw_hash);
> +
> +	/* inet_twsk_purge() walks over all sockets, including tw ones,
> +	 * and removes them via inet_twsk_deschedule_put() after a
> +	 * refcount_inc_not_zero().
> +	 *
> +	 * inet_twsk_hashdance_schedule() must (re)init the refcount before
> +	 * arming the timer, i.e. inet_twsk_purge can obtain a reference to
> +	 * a twsk that did not yet schedule the timer.
> +	 *
> +	 * The ehash lock synchronizes these two:
> +	 * After acquiring the lock, the timer is always scheduled (else
> +	 * timer_shutdown returns false).
> +	 *
> +	 * With plain timer_shutdown_sync() and without grabbing the ehash
> +	 * lock, we can get:
> +	 * 1) cpu x sets twsk refcount to 3
> +	 * 2) cpu y bumps refcount to 4
> +	 * 3) cpu y calls inet_twsk_deschedule_put() and shuts timer down
> +	 * 4) cpu x tries to start timer, but mod_timer is a noop post-shutdown
> +	 * -> timer refcount is never decremented.
> +	 */
> +	spin_lock(lock);
> +	if (timer_shutdown(&tw->tw_timer)) {
> +		/* releases @lock */
> +		__inet_twsk_kill(tw, lock);
> +	} else {
> +		spin_unlock(lock);
> +	}
> +
>   	inet_twsk_put(tw);
>   }
>   EXPORT_SYMBOL(inet_twsk_deschedule_put);
> diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
> index 538c06f95918..47de6f3efc85 100644
> --- a/net/ipv4/tcp_minisocks.c
> +++ b/net/ipv4/tcp_minisocks.c
> @@ -344,11 +344,10 @@ void tcp_time_wait(struct sock *sk, int state, int timeo)
>   		 * we complete the initialization.
>   		 */
>   		local_bh_disable();
> -		inet_twsk_schedule(tw, timeo);
>   		/* Linkage updates.
>   		 * Note that access to tw after this point is illegal.
>   		 */
> -		inet_twsk_hashdance(tw, sk, net->ipv4.tcp_death_row.hashinfo);
> +		inet_twsk_hashdance_schedule(tw, sk, net->ipv4.tcp_death_row.hashinfo, timeo);
>   		local_bh_enable();
>   	} else {
>   		/* Sorry, if we're out of memory, just CLOSE this

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH net-next v6 1/3] net: tcp/dcpp: prepare for tw_timer un-pinning
  2024-06-03 10:30   ` Eric Dumazet
@ 2024-06-03 11:21     ` Florian Westphal
  2024-06-03 11:53       ` Eric Dumazet
                         ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Florian Westphal @ 2024-06-03 11:21 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Florian Westphal, netdev, Paolo Abeni, David S. Miller,
	Jakub Kicinski, mleitner, juri.lelli, vschneid, tglozar, dsahern,
	bigeasy, tglx

Eric Dumazet <edumazet@google.com> wrote:
> On Mon, Jun 3, 2024 at 11:37 AM Florian Westphal <fw@strlen.de> wrote:
> > +       spin_lock(lock);
> > +       if (timer_shutdown(&tw->tw_timer)) {
> > +               /* releases @lock */
> > +               __inet_twsk_kill(tw, lock);
> > +       } else {
> 
> If we do not have a sync variant here, I think that inet_twsk_purge()
> could return while ongoing timers are alive.

Yes.

We can't use sync variant, it would deadlock on ehash spinlock.

> tcp_sk_exit_batch() would then possibly hit :
> 
> WARN_ON_ONCE(!refcount_dec_and_test(&net->ipv4.tcp_death_row.tw_refcount));
> 
> The alive timer are releasing tw->tw_dr->tw_refcount at the end of
> inet_twsk_kill()

Theoretically the tw socket can be unlinked from the tw hash already
(inet_twsk_purge won't encounter it), but timer is still running.

Only solution I see is to schedule() in tcp_sk_exit_batch() until
tw_refcount has dropped to the expected value, i.e. something like

static void tcp_wait_for_tw_timers(struct net *n)
{
	while (refcount_read(&n->ipv4.tcp_death_row.tw_refcount) > 1))
		schedule();
}

Any better idea?

I started to sketch a patch that keeps PINNED as-is but schedules almost
all of the actual work to a work item.

Idea was that it would lower RT latencies to acceptable level but it got
so ugly that I did not follow this path.

I could resurrect this if you think its worth a try.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH net-next v6 1/3] net: tcp/dcpp: prepare for tw_timer un-pinning
  2024-06-03 11:21     ` Florian Westphal
@ 2024-06-03 11:53       ` Eric Dumazet
  2024-06-03 12:10       ` Sebastian Andrzej Siewior
  2024-06-03 13:31       ` Florian Westphal
  2 siblings, 0 replies; 11+ messages in thread
From: Eric Dumazet @ 2024-06-03 11:53 UTC (permalink / raw)
  To: Florian Westphal
  Cc: netdev, Paolo Abeni, David S. Miller, Jakub Kicinski, mleitner,
	juri.lelli, vschneid, tglozar, dsahern, bigeasy, tglx

On Mon, Jun 3, 2024 at 1:21 PM Florian Westphal <fw@strlen.de> wrote:
>
> Eric Dumazet <edumazet@google.com> wrote:
> > On Mon, Jun 3, 2024 at 11:37 AM Florian Westphal <fw@strlen.de> wrote:
> > > +       spin_lock(lock);
> > > +       if (timer_shutdown(&tw->tw_timer)) {
> > > +               /* releases @lock */
> > > +               __inet_twsk_kill(tw, lock);
> > > +       } else {
> >
> > If we do not have a sync variant here, I think that inet_twsk_purge()
> > could return while ongoing timers are alive.
>
> Yes.
>
> We can't use sync variant, it would deadlock on ehash spinlock.
>
> > tcp_sk_exit_batch() would then possibly hit :
> >
> > WARN_ON_ONCE(!refcount_dec_and_test(&net->ipv4.tcp_death_row.tw_refcount));
> >
> > The alive timer are releasing tw->tw_dr->tw_refcount at the end of
> > inet_twsk_kill()
>
> Theoretically the tw socket can be unlinked from the tw hash already
> (inet_twsk_purge won't encounter it), but timer is still running.
>
> Only solution I see is to schedule() in tcp_sk_exit_batch() until
> tw_refcount has dropped to the expected value, i.e. something like
>
> static void tcp_wait_for_tw_timers(struct net *n)
> {
>         while (refcount_read(&n->ipv4.tcp_death_row.tw_refcount) > 1))
>                 schedule();
> }
>
> Any better idea?


Maybe usleep_range(500, 1000)

>
> I started to sketch a patch that keeps PINNED as-is but schedules almost
> all of the actual work to a work item.
>
> Idea was that it would lower RT latencies to acceptable level but it got
> so ugly that I did not follow this path.
>
> I could resurrect this if you think its worth a try.

I would rather avoid a work queue.

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH net-next v6 1/3] net: tcp/dcpp: prepare for tw_timer un-pinning
  2024-06-03 11:21     ` Florian Westphal
  2024-06-03 11:53       ` Eric Dumazet
@ 2024-06-03 12:10       ` Sebastian Andrzej Siewior
  2024-06-03 13:31       ` Florian Westphal
  2 siblings, 0 replies; 11+ messages in thread
From: Sebastian Andrzej Siewior @ 2024-06-03 12:10 UTC (permalink / raw)
  To: Florian Westphal
  Cc: Eric Dumazet, netdev, Paolo Abeni, David S. Miller,
	Jakub Kicinski, mleitner, juri.lelli, vschneid, tglozar, dsahern,
	tglx

On 2024-06-03 13:21:52 [+0200], Florian Westphal wrote:
> Theoretically the tw socket can be unlinked from the tw hash already
> (inet_twsk_purge won't encounter it), but timer is still running.
> 
> Only solution I see is to schedule() in tcp_sk_exit_batch() until
> tw_refcount has dropped to the expected value, i.e. something like
> 
> static void tcp_wait_for_tw_timers(struct net *n)
> {
> 	while (refcount_read(&n->ipv4.tcp_death_row.tw_refcount) > 1))
> 		schedule();
> }
> 
> Any better idea?

I need to read the thread but from the context, couldn't you
- grab a refcount
- timer_shutdown_sync() on it
?

Sebastian

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH net-next v6 1/3] net: tcp/dcpp: prepare for tw_timer un-pinning
  2024-06-03 11:21     ` Florian Westphal
  2024-06-03 11:53       ` Eric Dumazet
  2024-06-03 12:10       ` Sebastian Andrzej Siewior
@ 2024-06-03 13:31       ` Florian Westphal
  2024-06-03 13:50         ` Eric Dumazet
  2 siblings, 1 reply; 11+ messages in thread
From: Florian Westphal @ 2024-06-03 13:31 UTC (permalink / raw)
  To: Florian Westphal
  Cc: Eric Dumazet, netdev, Paolo Abeni, David S. Miller,
	Jakub Kicinski, mleitner, juri.lelli, vschneid, tglozar, dsahern,
	bigeasy, tglx

Florian Westphal <fw@strlen.de> wrote:
> Eric Dumazet <edumazet@google.com> wrote:
> > On Mon, Jun 3, 2024 at 11:37 AM Florian Westphal <fw@strlen.de> wrote:
> > > +       spin_lock(lock);
> > > +       if (timer_shutdown(&tw->tw_timer)) {
> > > +               /* releases @lock */
> > > +               __inet_twsk_kill(tw, lock);
> > > +       } else {
> > 
> > If we do not have a sync variant here, I think that inet_twsk_purge()
> > could return while ongoing timers are alive.
> 
> Yes.
> 
> We can't use sync variant, it would deadlock on ehash spinlock.
> 
> > tcp_sk_exit_batch() would then possibly hit :
> > 
> > WARN_ON_ONCE(!refcount_dec_and_test(&net->ipv4.tcp_death_row.tw_refcount));
> > 
> > The alive timer are releasing tw->tw_dr->tw_refcount at the end of
> > inet_twsk_kill()
> 
> Theoretically the tw socket can be unlinked from the tw hash already
> (inet_twsk_purge won't encounter it), but timer is still running.
> 
> Only solution I see is to schedule() in tcp_sk_exit_batch() until
> tw_refcount has dropped to the expected value, i.e. something like
> 
> static void tcp_wait_for_tw_timers(struct net *n)
> {
> 	while (refcount_read(&n->ipv4.tcp_death_row.tw_refcount) > 1))
> 		schedule();
> }
> 
> Any better idea?

Actually, I think we can solve this in a much simpler way.

Instead of replacing:

void inet_twsk_deschedule_put(struct inet_timewait_sock *tw)
{
 if (del_timer_sync(&tw->tw_timer))
    inet_twsk_kill(tw);
 inet_twsk_put(tw);
}

With:
 spinlock_t *lock = inet_ehash_lockp(hashinfo, tw->tw_hash);
 spin_lock(lock);
 if (timer_shutdown(&tw->tw_timer)) {

(Which gets us into the tcp_sk_exit_batch trouble Eric points out),
we can simply add "empty" ehash lock unlock pair before calling
del_timer_sync():

void inet_twsk_deschedule_put(struct inet_timewait_sock *tw)
{
+	spinlock_t *lock = inet_ehash_lockp(hashinfo, tw->tw_hash);
+	spin_lock(lock)
+	spin_unlock(lock)

        if (del_timer_sync(&tw->tw_timer))
                inet_twsk_kill(tw);
        inet_twsk_put(tw);
}

Rationale:
inet_twsk_deschedule_put() cannot be called before hashdance_schedule
calls refcount_set(&tw->tw_refcnt, 3).

Before this any refcount_inc_not_zero fails so we never get into
deschedule_put.

Hashdance_schedule holds the ehash lock when it sets the tw refcount.
The lock is released only after the timer is up and running.

When inet_twsk_deschedule_put() is called, and hashdance_schedule
is not yet done, the spinlock/unlock pair will guarantee that
the timer is up after the spin_unlock.

I think this is much better than the schedule loop waiting for tw_dr
refcount to drop, it mainly needs a comment to explain what this is
doing.

Thoughts?

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH net-next v6 1/3] net: tcp/dcpp: prepare for tw_timer un-pinning
  2024-06-03 13:31       ` Florian Westphal
@ 2024-06-03 13:50         ` Eric Dumazet
  0 siblings, 0 replies; 11+ messages in thread
From: Eric Dumazet @ 2024-06-03 13:50 UTC (permalink / raw)
  To: Florian Westphal
  Cc: netdev, Paolo Abeni, David S. Miller, Jakub Kicinski, mleitner,
	juri.lelli, vschneid, tglozar, dsahern, bigeasy, tglx

On Mon, Jun 3, 2024 at 3:32 PM Florian Westphal <fw@strlen.de> wrote:
>
> Florian Westphal <fw@strlen.de> wrote:
> > Eric Dumazet <edumazet@google.com> wrote:
> > > On Mon, Jun 3, 2024 at 11:37 AM Florian Westphal <fw@strlen.de> wrote:
> > > > +       spin_lock(lock);
> > > > +       if (timer_shutdown(&tw->tw_timer)) {
> > > > +               /* releases @lock */
> > > > +               __inet_twsk_kill(tw, lock);
> > > > +       } else {
> > >
> > > If we do not have a sync variant here, I think that inet_twsk_purge()
> > > could return while ongoing timers are alive.
> >
> > Yes.
> >
> > We can't use sync variant, it would deadlock on ehash spinlock.
> >
> > > tcp_sk_exit_batch() would then possibly hit :
> > >
> > > WARN_ON_ONCE(!refcount_dec_and_test(&net->ipv4.tcp_death_row.tw_refcount));
> > >
> > > The alive timer are releasing tw->tw_dr->tw_refcount at the end of
> > > inet_twsk_kill()
> >
> > Theoretically the tw socket can be unlinked from the tw hash already
> > (inet_twsk_purge won't encounter it), but timer is still running.
> >
> > Only solution I see is to schedule() in tcp_sk_exit_batch() until
> > tw_refcount has dropped to the expected value, i.e. something like
> >
> > static void tcp_wait_for_tw_timers(struct net *n)
> > {
> >       while (refcount_read(&n->ipv4.tcp_death_row.tw_refcount) > 1))
> >               schedule();
> > }
> >
> > Any better idea?
>
> Actually, I think we can solve this in a much simpler way.
>
> Instead of replacing:
>
> void inet_twsk_deschedule_put(struct inet_timewait_sock *tw)
> {
>  if (del_timer_sync(&tw->tw_timer))
>     inet_twsk_kill(tw);
>  inet_twsk_put(tw);
> }
>
> With:
>  spinlock_t *lock = inet_ehash_lockp(hashinfo, tw->tw_hash);
>  spin_lock(lock);
>  if (timer_shutdown(&tw->tw_timer)) {
>
> (Which gets us into the tcp_sk_exit_batch trouble Eric points out),
> we can simply add "empty" ehash lock unlock pair before calling
> del_timer_sync():
>
> void inet_twsk_deschedule_put(struct inet_timewait_sock *tw)
> {
> +       spinlock_t *lock = inet_ehash_lockp(hashinfo, tw->tw_hash);
> +       spin_lock(lock)
> +       spin_unlock(lock)
>
>         if (del_timer_sync(&tw->tw_timer))

Sounds good, maybe use timer_shutdown_sync() here ?

>                 inet_twsk_kill(tw);
>         inet_twsk_put(tw);
> }
>
> Rationale:
> inet_twsk_deschedule_put() cannot be called before hashdance_schedule
> calls refcount_set(&tw->tw_refcnt, 3).
>
> Before this any refcount_inc_not_zero fails so we never get into
> deschedule_put.
>
> Hashdance_schedule holds the ehash lock when it sets the tw refcount.
> The lock is released only after the timer is up and running.
>
> When inet_twsk_deschedule_put() is called, and hashdance_schedule
> is not yet done, the spinlock/unlock pair will guarantee that
> the timer is up after the spin_unlock.
>
> I think this is much better than the schedule loop waiting for tw_dr
> refcount to drop, it mainly needs a comment to explain what this is
> doing.
>
> Thoughts?

^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2024-06-03 13:50 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-06-03  9:36 [PATCH net-next v6 0/3] net: tcp: un-pin tw timer Florian Westphal
2024-06-03  9:36 ` [PATCH net-next v6 1/3] net: tcp/dcpp: prepare for tw_timer un-pinning Florian Westphal
2024-06-03 10:30   ` Eric Dumazet
2024-06-03 11:21     ` Florian Westphal
2024-06-03 11:53       ` Eric Dumazet
2024-06-03 12:10       ` Sebastian Andrzej Siewior
2024-06-03 13:31       ` Florian Westphal
2024-06-03 13:50         ` Eric Dumazet
2024-06-03 10:53   ` shaozhengchao
2024-06-03  9:36 ` [PATCH net-next v6 2/3] net: tcp: un-pin the tw_timer Florian Westphal
2024-06-03  9:36 ` [PATCH net-next v6 3/3] tcp: move inet_twsk_schedule helper out of header Florian Westphal

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).