[PATCH net 0/4] inet: frags: flush pending skbs in fqdir_pre

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH net 0/4] inet: frags: flush pending skbs in fqdir_pre_exit()
@ 2025-12-07  1:09 Jakub Kicinski
  2025-12-07  1:09 ` [PATCH net 1/4] inet: frags: avoid theoretical race in ip_frag_reinit() Jakub Kicinski
                   ` (4 more replies)
  0 siblings, 5 replies; 11+ messages in thread
From: Jakub Kicinski @ 2025-12-07  1:09 UTC (permalink / raw)
  To: davem
  Cc: netdev, edumazet, pabeni, andrew+netdev, horms, pablo, fw,
	netfilter-devel, willemdebruijn.kernel, kuniyu, Jakub Kicinski

Fix the issue reported by NIPA starting on Sep 18th [1], where
pernet_ops_rwsem is constantly held by a reader, preventing writers
from grabbing it (specifically driver modules from loading).

The fact that reports started around that time seems coincidental.
The issue seems to be skbs queued for defrag preventing conntrack
from exiting.

First patch fixes another theoretical issue, it's mostly a leftover
from an attempt to get rid of the inet_frag_queue refcnt, which
I gave up on (still think it's doable but a bit of a time sink).
Second patch is a minor refactor.

The real fix is in the third patch. It's the simplest fix I can
think of which is to flush the frag queues. Perhaps someone has
a better suggestion?

Last patch adds an explicit warning for conntrack getting stuck,
as this seems like something that can easily happen if bugs sneak in.
The warning will hopefully save us the first 20% of the investigation
effort.

Link: https://lore.kernel.org/20251001082036.0fc51440@kernel.org # [1]

Jakub Kicinski (4):
  inet: frags: avoid theoretical race in ip_frag_reinit()
  inet: frags: add inet_frag_queue_flush()
  inet: frags: flush pending skbs in fqdir_pre_exit()
  netfilter: conntrack: warn when cleanup is stuck

 include/net/inet_frag.h           | 18 ++--------
 include/net/ipv6_frag.h           |  9 +++--
 net/ipv4/inet_fragment.c          | 55 ++++++++++++++++++++++++++++---
 net/ipv4/ip_fragment.c            | 22 +++++--------
 net/netfilter/nf_conntrack_core.c |  3 ++
 5 files changed, 72 insertions(+), 35 deletions(-)

-- 
2.52.0

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH net 1/4] inet: frags: avoid theoretical race in ip_frag_reinit()
  2025-12-07  1:09 [PATCH net 0/4] inet: frags: flush pending skbs in fqdir_pre_exit() Jakub Kicinski
@ 2025-12-07  1:09 ` Jakub Kicinski
  2025-12-08 15:18   ` Eric Dumazet
  2025-12-07  1:09 ` [PATCH net 2/4] inet: frags: add inet_frag_queue_flush() Jakub Kicinski
                   ` (3 subsequent siblings)
  4 siblings, 1 reply; 11+ messages in thread
From: Jakub Kicinski @ 2025-12-07  1:09 UTC (permalink / raw)
  To: davem
  Cc: netdev, edumazet, pabeni, andrew+netdev, horms, pablo, fw,
	netfilter-devel, willemdebruijn.kernel, kuniyu, Jakub Kicinski

In ip_frag_reinit() we want to move the frag timeout timer into
the future. If the timer fires in the meantime we inadvertently
scheduled it again, and since the timer assumes a ref on frag_queue
we need to acquire one to balance things out.

This is technically racy, we should have acquired the reference
_before_ we touch the timer, it may fire again before we take the ref.
Avoid this entire dance by using mod_timer_pending() which only modifies
the timer if its pending (and which exists since Linux v2.6.30)

Note that this was the only place we ever took a ref on frag_queue
since Eric's conversion to RCU. So we could potentially replace
the whole refcnt field with an atomic flag and a bit more RCU.

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
---
 net/ipv4/inet_fragment.c | 4 +++-
 net/ipv4/ip_fragment.c   | 4 +---
 2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c
index 025895eb6ec5..30f4fa50ee2d 100644
--- a/net/ipv4/inet_fragment.c
+++ b/net/ipv4/inet_fragment.c
@@ -327,7 +327,9 @@ static struct inet_frag_queue *inet_frag_alloc(struct fqdir *fqdir,
 
 	timer_setup(&q->timer, f->frag_expire, 0);
 	spin_lock_init(&q->lock);
-	/* One reference for the timer, one for the hash table. */
+	/* One reference for the timer, one for the hash table.
+	 * We never take any extra references, only decrement this field.
+	 */
 	refcount_set(&q->refcnt, 2);
 
 	return q;
diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index f7012479713b..d7bccdc9dc69 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -242,10 +242,8 @@ static int ip_frag_reinit(struct ipq *qp)
 {
 	unsigned int sum_truesize = 0;
 
-	if (!mod_timer(&qp->q.timer, jiffies + qp->q.fqdir->timeout)) {
-		refcount_inc(&qp->q.refcnt);
+	if (!mod_timer_pending(&qp->q.timer, jiffies + qp->q.fqdir->timeout))
 		return -ETIMEDOUT;
-	}
 
 	sum_truesize = inet_frag_rbtree_purge(&qp->q.rb_fragments,
 					      SKB_DROP_REASON_FRAG_TOO_FAR);
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH net 1/4] inet: frags: avoid theoretical race in ip_frag_reinit()
  2025-12-07  1:09 ` [PATCH net 1/4] inet: frags: avoid theoretical race in ip_frag_reinit() Jakub Kicinski
@ 2025-12-08 15:18   ` Eric Dumazet
  0 siblings, 0 replies; 11+ messages in thread
From: Eric Dumazet @ 2025-12-08 15:18 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: davem, netdev, pabeni, andrew+netdev, horms, pablo, fw,
	netfilter-devel, willemdebruijn.kernel, kuniyu

On Sat, Dec 6, 2025 at 5:10 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> In ip_frag_reinit() we want to move the frag timeout timer into
> the future. If the timer fires in the meantime we inadvertently
> scheduled it again, and since the timer assumes a ref on frag_queue
> we need to acquire one to balance things out.
>
> This is technically racy, we should have acquired the reference
> _before_ we touch the timer, it may fire again before we take the ref.
> Avoid this entire dance by using mod_timer_pending() which only modifies
> the timer if its pending (and which exists since Linux v2.6.30)
>
> Note that this was the only place we ever took a ref on frag_queue
> since Eric's conversion to RCU. So we could potentially replace
> the whole refcnt field with an atomic flag and a bit more RCU.
>
> Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
> ---

Reviewed-by: Eric Dumazet <edumazet@google.com>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH net 2/4] inet: frags: add inet_frag_queue_flush()
  2025-12-07  1:09 [PATCH net 0/4] inet: frags: flush pending skbs in fqdir_pre_exit() Jakub Kicinski
  2025-12-07  1:09 ` [PATCH net 1/4] inet: frags: avoid theoretical race in ip_frag_reinit() Jakub Kicinski
@ 2025-12-07  1:09 ` Jakub Kicinski
  2025-12-08 15:19   ` Eric Dumazet
  2025-12-07  1:09 ` [PATCH net 3/4] inet: frags: flush pending skbs in fqdir_pre_exit() Jakub Kicinski
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 11+ messages in thread
From: Jakub Kicinski @ 2025-12-07  1:09 UTC (permalink / raw)
  To: davem
  Cc: netdev, edumazet, pabeni, andrew+netdev, horms, pablo, fw,
	netfilter-devel, willemdebruijn.kernel, kuniyu, Jakub Kicinski

Instead of exporting inet_frag_rbtree_purge() which requires that
caller takes care of memory accounting, add a new helper. We will
need to call it from a few places in the next patch.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
---
 include/net/inet_frag.h  |  5 ++---
 net/ipv4/inet_fragment.c | 15 ++++++++++++---
 net/ipv4/ip_fragment.c   |  6 +-----
 3 files changed, 15 insertions(+), 11 deletions(-)

diff --git a/include/net/inet_frag.h b/include/net/inet_frag.h
index 0eccd9c3a883..3ffaceee7bbc 100644
--- a/include/net/inet_frag.h
+++ b/include/net/inet_frag.h
@@ -141,9 +141,8 @@ void inet_frag_kill(struct inet_frag_queue *q, int *refs);
 void inet_frag_destroy(struct inet_frag_queue *q);
 struct inet_frag_queue *inet_frag_find(struct fqdir *fqdir, void *key);
 
-/* Free all skbs in the queue; return the sum of their truesizes. */
-unsigned int inet_frag_rbtree_purge(struct rb_root *root,
-				    enum skb_drop_reason reason);
+void inet_frag_queue_flush(struct inet_frag_queue *q,
+			   enum skb_drop_reason reason);
 
 static inline void inet_frag_putn(struct inet_frag_queue *q, int refs)
 {
diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c
index 30f4fa50ee2d..1bf969b5a1cb 100644
--- a/net/ipv4/inet_fragment.c
+++ b/net/ipv4/inet_fragment.c
@@ -263,8 +263,8 @@ static void inet_frag_destroy_rcu(struct rcu_head *head)
 	kmem_cache_free(f->frags_cachep, q);
 }
 
-unsigned int inet_frag_rbtree_purge(struct rb_root *root,
-				    enum skb_drop_reason reason)
+static unsigned int
+inet_frag_rbtree_purge(struct rb_root *root, enum skb_drop_reason reason)
 {
 	struct rb_node *p = rb_first(root);
 	unsigned int sum = 0;
@@ -284,7 +284,16 @@ unsigned int inet_frag_rbtree_purge(struct rb_root *root,
 	}
 	return sum;
 }
-EXPORT_SYMBOL(inet_frag_rbtree_purge);
+
+void inet_frag_queue_flush(struct inet_frag_queue *q,
+			   enum skb_drop_reason reason)
+{
+	unsigned int sum;
+
+	sum = inet_frag_rbtree_purge(&q->rb_fragments, reason);
+	sub_frag_mem_limit(q->fqdir, sum);
+}
+EXPORT_SYMBOL(inet_frag_queue_flush);
 
 void inet_frag_destroy(struct inet_frag_queue *q)
 {
diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index d7bccdc9dc69..32f1c1a46ba7 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -240,14 +240,10 @@ static int ip_frag_too_far(struct ipq *qp)
 
 static int ip_frag_reinit(struct ipq *qp)
 {
-	unsigned int sum_truesize = 0;
-
 	if (!mod_timer_pending(&qp->q.timer, jiffies + qp->q.fqdir->timeout))
 		return -ETIMEDOUT;
 
-	sum_truesize = inet_frag_rbtree_purge(&qp->q.rb_fragments,
-					      SKB_DROP_REASON_FRAG_TOO_FAR);
-	sub_frag_mem_limit(qp->q.fqdir, sum_truesize);
+	inet_frag_queue_flush(&qp->q, SKB_DROP_REASON_FRAG_TOO_FAR);
 
 	qp->q.flags = 0;
 	qp->q.len = 0;
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH net 2/4] inet: frags: add inet_frag_queue_flush()
  2025-12-07  1:09 ` [PATCH net 2/4] inet: frags: add inet_frag_queue_flush() Jakub Kicinski
@ 2025-12-08 15:19   ` Eric Dumazet
  0 siblings, 0 replies; 11+ messages in thread
From: Eric Dumazet @ 2025-12-08 15:19 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: davem, netdev, pabeni, andrew+netdev, horms, pablo, fw,
	netfilter-devel, willemdebruijn.kernel, kuniyu

On Sat, Dec 6, 2025 at 5:10 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> Instead of exporting inet_frag_rbtree_purge() which requires that
> caller takes care of memory accounting, add a new helper. We will
> need to call it from a few places in the next patch.
>
> Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Reviewed-by: Eric Dumazet <edumazet@google.com>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH net 3/4] inet: frags: flush pending skbs in fqdir_pre_exit()
  2025-12-07  1:09 [PATCH net 0/4] inet: frags: flush pending skbs in fqdir_pre_exit() Jakub Kicinski
  2025-12-07  1:09 ` [PATCH net 1/4] inet: frags: avoid theoretical race in ip_frag_reinit() Jakub Kicinski
  2025-12-07  1:09 ` [PATCH net 2/4] inet: frags: add inet_frag_queue_flush() Jakub Kicinski
@ 2025-12-07  1:09 ` Jakub Kicinski
  2025-12-08 15:17   ` Eric Dumazet
  2025-12-07  1:09 ` [PATCH net 4/4] netfilter: conntrack: warn when cleanup is stuck Jakub Kicinski
  2025-12-10  9:50 ` [PATCH net 0/4] inet: frags: flush pending skbs in fqdir_pre_exit() patchwork-bot+netdevbpf
  4 siblings, 1 reply; 11+ messages in thread
From: Jakub Kicinski @ 2025-12-07  1:09 UTC (permalink / raw)
  To: davem
  Cc: netdev, edumazet, pabeni, andrew+netdev, horms, pablo, fw,
	netfilter-devel, willemdebruijn.kernel, kuniyu, Jakub Kicinski

We have been seeing occasional deadlocks on pernet_ops_rwsem since
September in NIPA. The stuck task was usually modprobe (often loading
a driver like ipvlan), trying to take the lock as a Writer.
lockdep does not track readers for rwsems so the read wasn't obvious
from the reports.

On closer inspection the Reader holding the lock was conntrack looping
forever in nf_conntrack_cleanup_net_list(). Based on past experience
with occasional NIPA crashes I looked thru the tests which run before
the crash and noticed that the crash follows ip_defrag.sh. An immediate
red flag. Scouring thru (de)fragmentation queues reveals skbs sitting
around, holding conntrack references.

The problem is that since conntrack depends on nf_defrag_ipv6,
nf_defrag_ipv6 will load first. Since nf_defrag_ipv6 loads first its
netns exit hooks run _after_ conntrack's netns exit hook.

Flush all fragment queue SKBs during fqdir_pre_exit() to release
conntrack references before conntrack cleanup runs. Also flush
the queues in timer expiry handlers when they discover fqdir->dead
is set, in case packet sneaks in while we're running the pre_exit
flush.

The commit under Fixes is not exactly the culprit, but I think
previously the timer firing would eventually unblock the spinning
conntrack.

Fixes: d5dd88794a13 ("inet: fix various use-after-free in defrags units")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
---
 include/net/inet_frag.h  | 13 +------------
 include/net/ipv6_frag.h  |  9 ++++++---
 net/ipv4/inet_fragment.c | 36 ++++++++++++++++++++++++++++++++++++
 net/ipv4/ip_fragment.c   | 12 +++++++-----
 4 files changed, 50 insertions(+), 20 deletions(-)

diff --git a/include/net/inet_frag.h b/include/net/inet_frag.h
index 3ffaceee7bbc..365925c9d262 100644
--- a/include/net/inet_frag.h
+++ b/include/net/inet_frag.h
@@ -123,18 +123,7 @@ void inet_frags_fini(struct inet_frags *);
 
 int fqdir_init(struct fqdir **fqdirp, struct inet_frags *f, struct net *net);
 
-static inline void fqdir_pre_exit(struct fqdir *fqdir)
-{
-	/* Prevent creation of new frags.
-	 * Pairs with READ_ONCE() in inet_frag_find().
-	 */
-	WRITE_ONCE(fqdir->high_thresh, 0);
-
-	/* Pairs with READ_ONCE() in inet_frag_kill(), ip_expire()
-	 * and ip6frag_expire_frag_queue().
-	 */
-	WRITE_ONCE(fqdir->dead, true);
-}
+void fqdir_pre_exit(struct fqdir *fqdir);
 void fqdir_exit(struct fqdir *fqdir);
 
 void inet_frag_kill(struct inet_frag_queue *q, int *refs);
diff --git a/include/net/ipv6_frag.h b/include/net/ipv6_frag.h
index 38ef66826939..41d9fc6965f9 100644
--- a/include/net/ipv6_frag.h
+++ b/include/net/ipv6_frag.h
@@ -69,9 +69,6 @@ ip6frag_expire_frag_queue(struct net *net, struct frag_queue *fq)
 	int refs = 1;
 
 	rcu_read_lock();
-	/* Paired with the WRITE_ONCE() in fqdir_pre_exit(). */
-	if (READ_ONCE(fq->q.fqdir->dead))
-		goto out_rcu_unlock;
 	spin_lock(&fq->q.lock);
 
 	if (fq->q.flags & INET_FRAG_COMPLETE)
@@ -80,6 +77,12 @@ ip6frag_expire_frag_queue(struct net *net, struct frag_queue *fq)
 	fq->q.flags |= INET_FRAG_DROP;
 	inet_frag_kill(&fq->q, &refs);
 
+	/* Paired with the WRITE_ONCE() in fqdir_pre_exit(). */
+	if (READ_ONCE(fq->q.fqdir->dead)) {
+		inet_frag_queue_flush(&fq->q, 0);
+		goto out;
+	}
+
 	dev = dev_get_by_index_rcu(net, fq->iif);
 	if (!dev)
 		goto out;
diff --git a/net/ipv4/inet_fragment.c b/net/ipv4/inet_fragment.c
index 1bf969b5a1cb..001ee5c4d962 100644
--- a/net/ipv4/inet_fragment.c
+++ b/net/ipv4/inet_fragment.c
@@ -218,6 +218,41 @@ static int __init inet_frag_wq_init(void)
 
 pure_initcall(inet_frag_wq_init);
 
+void fqdir_pre_exit(struct fqdir *fqdir)
+{
+	struct inet_frag_queue *fq;
+	struct rhashtable_iter hti;
+
+	/* Prevent creation of new frags.
+	 * Pairs with READ_ONCE() in inet_frag_find().
+	 */
+	WRITE_ONCE(fqdir->high_thresh, 0);
+
+	/* Pairs with READ_ONCE() in inet_frag_kill(), ip_expire()
+	 * and ip6frag_expire_frag_queue().
+	 */
+	WRITE_ONCE(fqdir->dead, true);
+
+	rhashtable_walk_enter(&fqdir->rhashtable, &hti);
+	rhashtable_walk_start(&hti);
+
+	while ((fq = rhashtable_walk_next(&hti))) {
+		if (IS_ERR(fq)) {
+			if (PTR_ERR(fq) != -EAGAIN)
+				break;
+			continue;
+		}
+		spin_lock_bh(&fq->lock);
+		if (!(fq->flags & INET_FRAG_COMPLETE))
+			inet_frag_queue_flush(fq, 0);
+		spin_unlock_bh(&fq->lock);
+	}
+
+	rhashtable_walk_stop(&hti);
+	rhashtable_walk_exit(&hti);
+}
+EXPORT_SYMBOL(fqdir_pre_exit);
+
 void fqdir_exit(struct fqdir *fqdir)
 {
 	INIT_WORK(&fqdir->destroy_work, fqdir_work_fn);
@@ -290,6 +325,7 @@ void inet_frag_queue_flush(struct inet_frag_queue *q,
 {
 	unsigned int sum;
 
+	reason = reason ?: SKB_DROP_REASON_FRAG_REASM_TIMEOUT;
 	sum = inet_frag_rbtree_purge(&q->rb_fragments, reason);
 	sub_frag_mem_limit(q->fqdir, sum);
 }
diff --git a/net/ipv4/ip_fragment.c b/net/ipv4/ip_fragment.c
index 32f1c1a46ba7..56b0f738d2f2 100644
--- a/net/ipv4/ip_fragment.c
+++ b/net/ipv4/ip_fragment.c
@@ -134,11 +134,6 @@ static void ip_expire(struct timer_list *t)
 	net = qp->q.fqdir->net;
 
 	rcu_read_lock();
-
-	/* Paired with WRITE_ONCE() in fqdir_pre_exit(). */
-	if (READ_ONCE(qp->q.fqdir->dead))
-		goto out_rcu_unlock;
-
 	spin_lock(&qp->q.lock);
 
 	if (qp->q.flags & INET_FRAG_COMPLETE)
@@ -146,6 +141,13 @@ static void ip_expire(struct timer_list *t)
 
 	qp->q.flags |= INET_FRAG_DROP;
 	inet_frag_kill(&qp->q, &refs);
+
+	/* Paired with WRITE_ONCE() in fqdir_pre_exit(). */
+	if (READ_ONCE(qp->q.fqdir->dead)) {
+		inet_frag_queue_flush(&qp->q, 0);
+		goto out;
+	}
+
 	__IP_INC_STATS(net, IPSTATS_MIB_REASMFAILS);
 	__IP_INC_STATS(net, IPSTATS_MIB_REASMTIMEOUT);
 
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH net 3/4] inet: frags: flush pending skbs in fqdir_pre_exit()
  2025-12-07  1:09 ` [PATCH net 3/4] inet: frags: flush pending skbs in fqdir_pre_exit() Jakub Kicinski
@ 2025-12-08 15:17   ` Eric Dumazet
  0 siblings, 0 replies; 11+ messages in thread
From: Eric Dumazet @ 2025-12-08 15:17 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: davem, netdev, pabeni, andrew+netdev, horms, pablo, fw,
	netfilter-devel, willemdebruijn.kernel, kuniyu

On Sat, Dec 6, 2025 at 5:10 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> We have been seeing occasional deadlocks on pernet_ops_rwsem since
> September in NIPA. The stuck task was usually modprobe (often loading
> a driver like ipvlan), trying to take the lock as a Writer.
> lockdep does not track readers for rwsems so the read wasn't obvious
> from the reports.
>
> On closer inspection the Reader holding the lock was conntrack looping
> forever in nf_conntrack_cleanup_net_list(). Based on past experience
> with occasional NIPA crashes I looked thru the tests which run before
> the crash and noticed that the crash follows ip_defrag.sh. An immediate
> red flag. Scouring thru (de)fragmentation queues reveals skbs sitting
> around, holding conntrack references.
>
> The problem is that since conntrack depends on nf_defrag_ipv6,
> nf_defrag_ipv6 will load first. Since nf_defrag_ipv6 loads first its
> netns exit hooks run _after_ conntrack's netns exit hook.
>
> Flush all fragment queue SKBs during fqdir_pre_exit() to release
> conntrack references before conntrack cleanup runs. Also flush
> the queues in timer expiry handlers when they discover fqdir->dead
> is set, in case packet sneaks in while we're running the pre_exit
> flush.
>
> The commit under Fixes is not exactly the culprit, but I think
> previously the timer firing would eventually unblock the spinning
> conntrack.
>
> Fixes: d5dd88794a13 ("inet: fix various use-after-free in defrags units")
> Signed-off-by: Jakub Kicinski <kuba@kernel.org>

Reviewed-by: Eric Dumazet <edumazet@google.com>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH net 4/4] netfilter: conntrack: warn when cleanup is stuck
  2025-12-07  1:09 [PATCH net 0/4] inet: frags: flush pending skbs in fqdir_pre_exit() Jakub Kicinski
                   ` (2 preceding siblings ...)
  2025-12-07  1:09 ` [PATCH net 3/4] inet: frags: flush pending skbs in fqdir_pre_exit() Jakub Kicinski
@ 2025-12-07  1:09 ` Jakub Kicinski
  2025-12-07 10:21   ` Florian Westphal
  2025-12-10  9:50 ` [PATCH net 0/4] inet: frags: flush pending skbs in fqdir_pre_exit() patchwork-bot+netdevbpf
  4 siblings, 1 reply; 11+ messages in thread
From: Jakub Kicinski @ 2025-12-07  1:09 UTC (permalink / raw)
  To: davem
  Cc: netdev, edumazet, pabeni, andrew+netdev, horms, pablo, fw,
	netfilter-devel, willemdebruijn.kernel, kuniyu, Jakub Kicinski

nf_conntrack_cleanup_net_list() calls schedule() so it does not
show up as a hung task. Add an explicit check to make debugging
leaked skbs/conntack references more obvious.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
---
 net/netfilter/nf_conntrack_core.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/net/netfilter/nf_conntrack_core.c b/net/netfilter/nf_conntrack_core.c
index 0b95f226f211..d1f8eb725d42 100644
--- a/net/netfilter/nf_conntrack_core.c
+++ b/net/netfilter/nf_conntrack_core.c
@@ -2487,6 +2487,7 @@ void nf_conntrack_cleanup_net(struct net *net)
 void nf_conntrack_cleanup_net_list(struct list_head *net_exit_list)
 {
 	struct nf_ct_iter_data iter_data = {};
+	unsigned long start = jiffies;
 	struct net *net;
 	int busy;
 
@@ -2507,6 +2508,8 @@ void nf_conntrack_cleanup_net_list(struct list_head *net_exit_list)
 			busy = 1;
 	}
 	if (busy) {
+		DEBUG_NET_WARN_ONCE(time_after(jiffies, start + 60 * HZ),
+				    "conntrack cleanup blocked for 60s");
 		schedule();
 		goto i_see_dead_people;
 	}
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH net 4/4] netfilter: conntrack: warn when cleanup is stuck
  2025-12-07  1:09 ` [PATCH net 4/4] netfilter: conntrack: warn when cleanup is stuck Jakub Kicinski
@ 2025-12-07 10:21   ` Florian Westphal
  2025-12-08 15:20     ` Eric Dumazet
  0 siblings, 1 reply; 11+ messages in thread
From: Florian Westphal @ 2025-12-07 10:21 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: davem, netdev, edumazet, pabeni, andrew+netdev, horms, pablo,
	netfilter-devel, willemdebruijn.kernel, kuniyu

Jakub Kicinski <kuba@kernel.org> wrote:
> nf_conntrack_cleanup_net_list() calls schedule() so it does not
> show up as a hung task. Add an explicit check to make debugging
> leaked skbs/conntack references more obvious.

Acked-by: Florian Westphal <fw@strlen.de>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH net 4/4] netfilter: conntrack: warn when cleanup is stuck
  2025-12-07 10:21   ` Florian Westphal
@ 2025-12-08 15:20     ` Eric Dumazet
  0 siblings, 0 replies; 11+ messages in thread
From: Eric Dumazet @ 2025-12-08 15:20 UTC (permalink / raw)
  To: Florian Westphal
  Cc: Jakub Kicinski, davem, netdev, pabeni, andrew+netdev, horms,
	pablo, netfilter-devel, willemdebruijn.kernel, kuniyu

On Sun, Dec 7, 2025 at 2:21 AM Florian Westphal <fw@strlen.de> wrote:
>
> Jakub Kicinski <kuba@kernel.org> wrote:
> > nf_conntrack_cleanup_net_list() calls schedule() so it does not
> > show up as a hung task. Add an explicit check to make debugging
> > leaked skbs/conntack references more obvious.
>
> Acked-by: Florian Westphal <fw@strlen.de>

Reviewed-by: Eric Dumazet <edumazet@google.com>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH net 0/4] inet: frags: flush pending skbs in fqdir_pre_exit()
  2025-12-07  1:09 [PATCH net 0/4] inet: frags: flush pending skbs in fqdir_pre_exit() Jakub Kicinski
                   ` (3 preceding siblings ...)
  2025-12-07  1:09 ` [PATCH net 4/4] netfilter: conntrack: warn when cleanup is stuck Jakub Kicinski
@ 2025-12-10  9:50 ` patchwork-bot+netdevbpf
  4 siblings, 0 replies; 11+ messages in thread
From: patchwork-bot+netdevbpf @ 2025-12-10  9:50 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: davem, netdev, edumazet, pabeni, andrew+netdev, horms, pablo, fw,
	netfilter-devel, willemdebruijn.kernel, kuniyu

Hello:

This series was applied to netdev/net.git (main)
by Jakub Kicinski <kuba@kernel.org>:

On Sat,  6 Dec 2025 17:09:38 -0800 you wrote:
> Fix the issue reported by NIPA starting on Sep 18th [1], where
> pernet_ops_rwsem is constantly held by a reader, preventing writers
> from grabbing it (specifically driver modules from loading).
> 
> The fact that reports started around that time seems coincidental.
> The issue seems to be skbs queued for defrag preventing conntrack
> from exiting.
> 
> [...]

Here is the summary with links:
  - [net,1/4] inet: frags: avoid theoretical race in ip_frag_reinit()
    https://git.kernel.org/netdev/net/c/8ef522c8a59a
  - [net,2/4] inet: frags: add inet_frag_queue_flush()
    https://git.kernel.org/netdev/net/c/1231eec6994b
  - [net,3/4] inet: frags: flush pending skbs in fqdir_pre_exit()
    https://git.kernel.org/netdev/net/c/006a5035b495
  - [net,4/4] netfilter: conntrack: warn when cleanup is stuck
    https://git.kernel.org/netdev/net/c/92df4c56cf5b

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2025-12-10  9:53 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-12-07  1:09 [PATCH net 0/4] inet: frags: flush pending skbs in fqdir_pre_exit() Jakub Kicinski
2025-12-07  1:09 ` [PATCH net 1/4] inet: frags: avoid theoretical race in ip_frag_reinit() Jakub Kicinski
2025-12-08 15:18   ` Eric Dumazet
2025-12-07  1:09 ` [PATCH net 2/4] inet: frags: add inet_frag_queue_flush() Jakub Kicinski
2025-12-08 15:19   ` Eric Dumazet
2025-12-07  1:09 ` [PATCH net 3/4] inet: frags: flush pending skbs in fqdir_pre_exit() Jakub Kicinski
2025-12-08 15:17   ` Eric Dumazet
2025-12-07  1:09 ` [PATCH net 4/4] netfilter: conntrack: warn when cleanup is stuck Jakub Kicinski
2025-12-07 10:21   ` Florian Westphal
2025-12-08 15:20     ` Eric Dumazet
2025-12-10  9:50 ` [PATCH net 0/4] inet: frags: flush pending skbs in fqdir_pre_exit() patchwork-bot+netdevbpf

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).