[PATCH nf 0/2] netfilter: ctnetlink: fix memory leak in ctnetlink dump

netfilter-devel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH nf 0/2] netfilter: ctnetlink: fix memory leak in ctnetlink dump
@ 2025-08-01 15:25 Florian Westphal
  2025-08-01 15:25 ` [PATCH nf 1/2] netfilter: ctnetlink: fix refcount leak on table dump Florian Westphal
  2025-08-01 15:25 ` [PATCH nf 2/2] netfilter: ctnetlink: remove refcounting in expectation dumpers Florian Westphal
  0 siblings, 2 replies; 5+ messages in thread
From: Florian Westphal @ 2025-08-01 15:25 UTC (permalink / raw)
  To: netfilter-devel; +Cc: Florian Westphal

This series fixes a memory (refcount) leak in the ctnetlink dump code.
In some cases is possible that the entry being held in cb->args[] (with
refcount raised) fails to be delivered.

If this happens, the reference count is erronously incremented a second
time.

This results in a memory leak and non-recoverable hang in the netns
cleanup worker.

The second patch fixes a similar pattern in the expectation dump code.

In both cases the fix is to not use reference counting at all, the restart
hint is replaced by a cookie value, this has the same guarantees as the
existing code without need for keeping objects alive across partial dumps.

Note that the same pattern is used for dying lists, but as far as I can
see this problem can't happen there.  I will submit a patch for nf-next
that also uses refcount-less cookie values in the dying list dumper.

Florian Westphal (2):
  netfilter: ctnetlink: fix refcount leak on table dump
  netfilter: ctnetlink: remove refcounting in expectation dumpers

 net/netfilter/nf_conntrack_netlink.c | 65 +++++++++++++---------------
 1 file changed, 30 insertions(+), 35 deletions(-)

-- 
2.49.1

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH nf 1/2] netfilter: ctnetlink: fix refcount leak on table dump
  2025-08-01 15:25 [PATCH nf 0/2] netfilter: ctnetlink: fix memory leak in ctnetlink dump Florian Westphal
@ 2025-08-01 15:25 ` Florian Westphal
  2025-08-07 10:57   ` Pablo Neira Ayuso
  2025-08-01 15:25 ` [PATCH nf 2/2] netfilter: ctnetlink: remove refcounting in expectation dumpers Florian Westphal
  1 sibling, 1 reply; 5+ messages in thread
From: Florian Westphal @ 2025-08-01 15:25 UTC (permalink / raw)
  To: netfilter-devel; +Cc: Florian Westphal

There is a reference count leak in ctnetlink_dump_table():
      if (res < 0) {
                nf_conntrack_get(&ct->ct_general); // HERE
                cb->args[1] = (unsigned long)ct;
                ...

While its very unlikely, its possible that ct == last.
If this happens, then the refcount of ct was already incremented.
This 2nd increment is never undone.

This prevents the conntrack object from being released, which in turn
keeps prevents cnet->count from dropping back to 0.

This will then block the netns dismantle (or conntrack rmmod) as
nf_conntrack_cleanup_net_list() will wait forever.

This can be reproduced by running conntrack_resize.sh selftest in a loop.
It takes ~20 minutes for me on a preemptible kernel on average before
I see a runaway kworker spinning in nf_conntrack_cleanup_net_list.

One fix would to change this to:
        if (res < 0) {
		if (ct != last)
	                nf_conntrack_get(&ct->ct_general);

But this reference counting isn't needed in the first place.
We can just store a cookie value instead.

A followup patch will do the same for ctnetlink_exp_dump_table,
it looks to me as if this has the same problem and like
ctnetlink_dump_table, we only need a 'skip hint', not the actual
object so we can apply the same cookie strategy there as well.

Fixes: d205dc40798d ("[NETFILTER]: ctnetlink: fix deadlock in table dumping")
Signed-off-by: Florian Westphal <fw@strlen.de>
---
 net/netfilter/nf_conntrack_netlink.c | 24 +++++++++++++-----------
 1 file changed, 13 insertions(+), 11 deletions(-)

diff --git a/net/netfilter/nf_conntrack_netlink.c b/net/netfilter/nf_conntrack_netlink.c
index 2cc0fde23344..5fdcae45e0bc 100644
--- a/net/netfilter/nf_conntrack_netlink.c
+++ b/net/netfilter/nf_conntrack_netlink.c
@@ -884,8 +884,6 @@ ctnetlink_conntrack_event(unsigned int events, const struct nf_ct_event *item)
 
 static int ctnetlink_done(struct netlink_callback *cb)
 {
-	if (cb->args[1])
-		nf_ct_put((struct nf_conn *)cb->args[1]);
 	kfree(cb->data);
 	return 0;
 }
@@ -1208,19 +1206,26 @@ static int ctnetlink_filter_match(struct nf_conn *ct, void *data)
 	return 0;
 }
 
+static unsigned long ctnetlink_get_id(const struct nf_conn *ct)
+{
+	unsigned long id = nf_ct_get_id(ct);
+
+	return id ? id : 1;
+}
+
 static int
 ctnetlink_dump_table(struct sk_buff *skb, struct netlink_callback *cb)
 {
 	unsigned int flags = cb->data ? NLM_F_DUMP_FILTERED : 0;
 	struct net *net = sock_net(skb->sk);
-	struct nf_conn *ct, *last;
+	unsigned long last_id = cb->args[1];
 	struct nf_conntrack_tuple_hash *h;
 	struct hlist_nulls_node *n;
 	struct nf_conn *nf_ct_evict[8];
+	struct nf_conn *ct;
 	int res, i;
 	spinlock_t *lockp;
 
-	last = (struct nf_conn *)cb->args[1];
 	i = 0;
 
 	local_bh_disable();
@@ -1257,7 +1262,7 @@ ctnetlink_dump_table(struct sk_buff *skb, struct netlink_callback *cb)
 				continue;
 
 			if (cb->args[1]) {
-				if (ct != last)
+				if (ctnetlink_get_id(ct) != last_id)
 					continue;
 				cb->args[1] = 0;
 			}
@@ -1270,8 +1275,7 @@ ctnetlink_dump_table(struct sk_buff *skb, struct netlink_callback *cb)
 					    NFNL_MSG_TYPE(cb->nlh->nlmsg_type),
 					    ct, true, flags);
 			if (res < 0) {
-				nf_conntrack_get(&ct->ct_general);
-				cb->args[1] = (unsigned long)ct;
+				cb->args[1] = ctnetlink_get_id(ct);
 				spin_unlock(lockp);
 				goto out;
 			}
@@ -1284,12 +1288,10 @@ ctnetlink_dump_table(struct sk_buff *skb, struct netlink_callback *cb)
 	}
 out:
 	local_bh_enable();
-	if (last) {
+	if (last_id) {
 		/* nf ct hash resize happened, now clear the leftover. */
-		if ((struct nf_conn *)cb->args[1] == last)
+		if (cb->args[1] == last_id)
 			cb->args[1] = 0;
-
-		nf_ct_put(last);
 	}
 
 	while (i) {
-- 
2.49.1


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [PATCH nf 2/2] netfilter: ctnetlink: remove refcounting in expectation dumpers
  2025-08-01 15:25 [PATCH nf 0/2] netfilter: ctnetlink: fix memory leak in ctnetlink dump Florian Westphal
  2025-08-01 15:25 ` [PATCH nf 1/2] netfilter: ctnetlink: fix refcount leak on table dump Florian Westphal
@ 2025-08-01 15:25 ` Florian Westphal
  1 sibling, 0 replies; 5+ messages in thread
From: Florian Westphal @ 2025-08-01 15:25 UTC (permalink / raw)
  To: netfilter-devel; +Cc: Florian Westphal

Same pattern as previous patch: do not keep the expectation object
alive via refcount, only store a cookie value and then use that
as the skip hint for dump resumption.

AFAICS this has the same issue as the one resolved in the conntrack
dumper, when we do
  if (!refcount_inc_not_zero(&exp->use))

to increment the refcount, there is a chance that exp == last, which
causes a double-increment of the refcount and subsequent memory leak.

Fixes: cf6994c2b981 ("[NETFILTER]: nf_conntrack_netlink: sync expectation dumping with conntrack table dumping")
Fixes: e844a928431f ("netfilter: ctnetlink: allow to dump expectation per master conntrack")
Signed-off-by: Florian Westphal <fw@strlen.de>
---
 net/netfilter/nf_conntrack_netlink.c | 41 ++++++++++++----------------
 1 file changed, 17 insertions(+), 24 deletions(-)

diff --git a/net/netfilter/nf_conntrack_netlink.c b/net/netfilter/nf_conntrack_netlink.c
index 5fdcae45e0bc..cbe42c987136 100644
--- a/net/netfilter/nf_conntrack_netlink.c
+++ b/net/netfilter/nf_conntrack_netlink.c
@@ -3171,23 +3171,27 @@ ctnetlink_expect_event(unsigned int events, const struct nf_exp_event *item)
 	return 0;
 }
 #endif
-static int ctnetlink_exp_done(struct netlink_callback *cb)
+
+static unsigned long ctnetlink_exp_id(const struct nf_conntrack_expect *exp)
 {
-	if (cb->args[1])
-		nf_ct_expect_put((struct nf_conntrack_expect *)cb->args[1]);
-	return 0;
+	unsigned long id = (unsigned long)exp;
+
+	id += nf_ct_get_id(exp->master);
+	id += exp->class;
+
+	return id ? id : 1;
 }
 
 static int
 ctnetlink_exp_dump_table(struct sk_buff *skb, struct netlink_callback *cb)
 {
+	unsigned long last_id = cb->args[1];
 	struct net *net = sock_net(skb->sk);
-	struct nf_conntrack_expect *exp, *last;
 	struct nfgenmsg *nfmsg = nlmsg_data(cb->nlh);
 	u_int8_t l3proto = nfmsg->nfgen_family;
+	struct nf_conntrack_expect *exp;
 
 	rcu_read_lock();
-	last = (struct nf_conntrack_expect *)cb->args[1];
 	for (; cb->args[0] < nf_ct_expect_hsize; cb->args[0]++) {
 restart:
 		hlist_for_each_entry_rcu(exp, &nf_ct_expect_hash[cb->args[0]],
@@ -3199,7 +3203,7 @@ ctnetlink_exp_dump_table(struct sk_buff *skb, struct netlink_callback *cb)
 				continue;
 
 			if (cb->args[1]) {
-				if (exp != last)
+				if (ctnetlink_exp_id(exp) != last_id)
 					continue;
 				cb->args[1] = 0;
 			}
@@ -3208,9 +3212,7 @@ ctnetlink_exp_dump_table(struct sk_buff *skb, struct netlink_callback *cb)
 						    cb->nlh->nlmsg_seq,
 						    IPCTNL_MSG_EXP_NEW,
 						    exp) < 0) {
-				if (!refcount_inc_not_zero(&exp->use))
-					continue;
-				cb->args[1] = (unsigned long)exp;
+				cb->args[1] = ctnetlink_exp_id(exp);
 				goto out;
 			}
 		}
@@ -3221,32 +3223,30 @@ ctnetlink_exp_dump_table(struct sk_buff *skb, struct netlink_callback *cb)
 	}
 out:
 	rcu_read_unlock();
-	if (last)
-		nf_ct_expect_put(last);
-
 	return skb->len;
 }
 
 static int
 ctnetlink_exp_ct_dump_table(struct sk_buff *skb, struct netlink_callback *cb)
 {
-	struct nf_conntrack_expect *exp, *last;
 	struct nfgenmsg *nfmsg = nlmsg_data(cb->nlh);
 	struct nf_conn *ct = cb->data;
 	struct nf_conn_help *help = nfct_help(ct);
 	u_int8_t l3proto = nfmsg->nfgen_family;
+	unsigned long last_id = cb->args[1];
+	struct nf_conntrack_expect *exp;
 
 	if (cb->args[0])
 		return 0;
 
 	rcu_read_lock();
-	last = (struct nf_conntrack_expect *)cb->args[1];
+
 restart:
 	hlist_for_each_entry_rcu(exp, &help->expectations, lnode) {
 		if (l3proto && exp->tuple.src.l3num != l3proto)
 			continue;
 		if (cb->args[1]) {
-			if (exp != last)
+			if (ctnetlink_exp_id(exp) != last_id)
 				continue;
 			cb->args[1] = 0;
 		}
@@ -3254,9 +3254,7 @@ ctnetlink_exp_ct_dump_table(struct sk_buff *skb, struct netlink_callback *cb)
 					    cb->nlh->nlmsg_seq,
 					    IPCTNL_MSG_EXP_NEW,
 					    exp) < 0) {
-			if (!refcount_inc_not_zero(&exp->use))
-				continue;
-			cb->args[1] = (unsigned long)exp;
+			cb->args[1] = ctnetlink_exp_id(exp);
 			goto out;
 		}
 	}
@@ -3267,9 +3265,6 @@ ctnetlink_exp_ct_dump_table(struct sk_buff *skb, struct netlink_callback *cb)
 	cb->args[0] = 1;
 out:
 	rcu_read_unlock();
-	if (last)
-		nf_ct_expect_put(last);
-
 	return skb->len;
 }
 
@@ -3288,7 +3283,6 @@ static int ctnetlink_dump_exp_ct(struct net *net, struct sock *ctnl,
 	struct nf_conntrack_zone zone;
 	struct netlink_dump_control c = {
 		.dump = ctnetlink_exp_ct_dump_table,
-		.done = ctnetlink_exp_done,
 	};
 
 	err = ctnetlink_parse_tuple(cda, &tuple, CTA_EXPECT_MASTER,
@@ -3338,7 +3332,6 @@ static int ctnetlink_get_expect(struct sk_buff *skb,
 		else {
 			struct netlink_dump_control c = {
 				.dump = ctnetlink_exp_dump_table,
-				.done = ctnetlink_exp_done,
 			};
 			return netlink_dump_start(info->sk, skb, info->nlh, &c);
 		}
-- 
2.49.1


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH nf 1/2] netfilter: ctnetlink: fix refcount leak on table dump
  2025-08-01 15:25 ` [PATCH nf 1/2] netfilter: ctnetlink: fix refcount leak on table dump Florian Westphal
@ 2025-08-07 10:57   ` Pablo Neira Ayuso
  2025-08-07 11:29     ` Florian Westphal
  0 siblings, 1 reply; 5+ messages in thread
From: Pablo Neira Ayuso @ 2025-08-07 10:57 UTC (permalink / raw)
  To: Florian Westphal; +Cc: netfilter-devel

Hi Florian,

On Fri, Aug 01, 2025 at 05:25:08PM +0200, Florian Westphal wrote:
> There is a reference count leak in ctnetlink_dump_table():
>       if (res < 0) {
>                 nf_conntrack_get(&ct->ct_general); // HERE
>                 cb->args[1] = (unsigned long)ct;
>                 ...
                  goto out;

>
> While its very unlikely, its possible that ct == last.

out:
        ...
        if (last) {
                /* nf ct hash resize happened, now clear the leftover. */
                if ((struct nf_conn *)cb->args[1] == last) {
                        cb->args[1] = 0;
                }

                nf_ct_put(last);
        }

I think problem was introduced here:

  fefa92679dbe ("netfilter: ctnetlink: fix incorrect nf_ct_put during hash resize")

> If this happens, then the refcount of ct was already incremented.
> This 2nd increment is never undone.
> 
> This prevents the conntrack object from being released, which in turn
> keeps prevents cnet->count from dropping back to 0.
> 
> This will then block the netns dismantle (or conntrack rmmod) as
> nf_conntrack_cleanup_net_list() will wait forever.
> 
> This can be reproduced by running conntrack_resize.sh selftest in a loop.
> It takes ~20 minutes for me on a preemptible kernel on average before
> I see a runaway kworker spinning in nf_conntrack_cleanup_net_list.
> 
> One fix would to change this to:
>         if (res < 0) {
> 		if (ct != last)
> 	                nf_conntrack_get(&ct->ct_general);
> 
> But this reference counting isn't needed in the first place.
> We can just store a cookie value instead.

cookie is indeed safer approach.

IIRC, the concern is that cookie could result in providing a bogus
conntrack listing due to object recycling, which is more likely to
happen with SLAB_TYPESAFE_BY_RCU.

nf_ct_get_id() is adding using a random seed to generate the conntrack
id:

u32 nf_ct_get_id(const struct nf_conn *ct)
{
        static siphash_aligned_key_t ct_id_seed;
        unsigned long a, b, c, d;

        net_get_random_once(&ct_id_seed, sizeof(ct_id_seed));

Then, it should be very unlikely that such recycling that leads to
picking up from the wrong conntrack object because two conntrack
objects in the same memory spot will have different id.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH nf 1/2] netfilter: ctnetlink: fix refcount leak on table dump
  2025-08-07 10:57   ` Pablo Neira Ayuso
@ 2025-08-07 11:29     ` Florian Westphal
  0 siblings, 0 replies; 5+ messages in thread
From: Florian Westphal @ 2025-08-07 11:29 UTC (permalink / raw)
  To: Pablo Neira Ayuso; +Cc: netfilter-devel

Pablo Neira Ayuso <pablo@netfilter.org> wrote:
> On Fri, Aug 01, 2025 at 05:25:08PM +0200, Florian Westphal wrote:
> > There is a reference count leak in ctnetlink_dump_table():
> >       if (res < 0) {
> >                 nf_conntrack_get(&ct->ct_general); // HERE
> >                 cb->args[1] = (unsigned long)ct;
> >                 ...
>                   goto out;
> 
> >
> > While its very unlikely, its possible that ct == last.
> 
> out:
>         ...
>         if (last) {
>                 /* nf ct hash resize happened, now clear the leftover. */
>                 if ((struct nf_conn *)cb->args[1] == last) {
>                         cb->args[1] = 0;
>                 }
> 
>                 nf_ct_put(last);
>         }
> 
> I think problem was introduced here:
> 
>   fefa92679dbe ("netfilter: ctnetlink: fix incorrect nf_ct_put during hash resize")

I think you'r right, the 'clear the leftover' is only correct if
we hit cb->args[0] >= htable_size condition.
OTOH reverting it gives the problem that commit fixed.

So I think that this code is just way too complicated,
i have no idea why this ever used reference counts, they do not buy
anything but headaches.

> cookie is indeed safer approach.
> 
> IIRC, the concern is that cookie could result in providing a bogus
> conntrack listing due to object recycling, which is more likely to
> happen with SLAB_TYPESAFE_BY_RCU.

Maybe, but even if this code would just store the address, the probability
of a recycle happening in such a way that a conntrack oject happens to be
stored, and then on next dump got re-added at exactly this slot is
almost 0.

And even if it would have been, the worst that can happen is that we
dump another entry a second time.  /proc code uses to walk the entire
table from start, counting dumped-entries and I'm not aware of 'dup'
complaints.

> Then, it should be very unlikely that such recycling that leads to
> picking up from the wrong conntrack object because two conntrack
> objects in the same memory spot will have different id.

Yes, it considers the tuples for the hash too, so its exteremly
unlikely for a recycle to result in same u32 hash value.

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2025-08-07 11:29 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-08-01 15:25 [PATCH nf 0/2] netfilter: ctnetlink: fix memory leak in ctnetlink dump Florian Westphal
2025-08-01 15:25 ` [PATCH nf 1/2] netfilter: ctnetlink: fix refcount leak on table dump Florian Westphal
2025-08-07 10:57   ` Pablo Neira Ayuso
2025-08-07 11:29     ` Florian Westphal
2025-08-01 15:25 ` [PATCH nf 2/2] netfilter: ctnetlink: remove refcounting in expectation dumpers Florian Westphal

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).