All of lore.kernel.org
 help / color / mirror / Atom feed
From: Pablo Neira Ayuso <pablo@netfilter.org>
To: Fernando Fernandez Mancera <fmancera@suse.de>
Cc: netfilter-devel@vger.kernel.org, coreteam@netfilter.org, fw@strlen.de
Subject: Re: [PATCH nf v2] netfilter: nft_connlimit: fix duplicated tracking of a connection
Date: Fri, 31 Oct 2025 00:04:52 +0100	[thread overview]
Message-ID: <aQPvFHEYZYacJQcC@calendula> (raw)
In-Reply-To: <c58ae9ad-46f3-4853-bc61-ac725c860160@suse.de>

Hi Fernando,

On Thu, Oct 30, 2025 at 09:12:32AM +0100, Fernando Fernandez Mancera wrote:
> 
> On 10/29/25 9:33 PM, Pablo Neira Ayuso wrote:
> > Hi Fernando,
> > 
> > On Wed, Oct 29, 2025 at 02:23:18PM +0100, Fernando Fernandez Mancera wrote:
> > > Connlimit expression can be used for all kind of packets and not only
> > > for packets with connection state new. See this ruleset as example:
> > > 
> > > table ip filter {
> > >          chain input {
> > >                  type filter hook input priority filter; policy accept;
> > >                  tcp dport 22 ct count over 4 counter
> > >          }
> > > }
> > > 
> > > Currently, if the connection count goes over the limit the counter will
> > > count the packets. When a connection is closed, the connection count
> > > won't decrement as it should because it is only updated for new
> > > connections due to an optimization on __nf_conncount_add() that prevents
> > > updating the list if the connection is duplicated.
> > > 
> > > In addition, since commit d265929930e2 ("netfilter: nf_conncount: reduce
> > > unnecessary GC") there can be situations where a duplicated connection
> > > is added to the list. This is caused by two packets from the same
> > > connection being processed during the same jiffy.
> > > 
> > > To solve these problems, check whether this is a new connection and only
> > > add the connection to the list if that is the case during connlimit
> > > evaluation. Otherwise run a GC to update the count. This doesn't yield a
> > > performance degradation.
> > 
> > This is true is list is small, e.g. ct count over 4.
> > 
> > But user could much larger value, then every packet could trigger a
> > long list walk, because gc is bound to CONNCOUNT_GC_MAX_NODES which is
> > the maximum number of nodes that is _collected_.
> > 
> > And maybe the user selects:
> > 
> >    ct count over N mark set 0x1
> > 
> > where N is high, the gc walk can be long.
> > 
> > TBH, I added this expression mainly focusing on being used with
> > dynset, I allowed it too in rules for parity. In the dynset case,
> > there is a front-end datastructure (set) and this conncount list
> > is per element. Maybe there high ct count is also possible.
> > 
> > With this patch, gc is called more frequently, not only for each new
> > packet.
> > 
> 
> How is it called more frequently? Before, it was calling nf_conncount_add()
> for every packet which is indeed performing a GC inside, both
> nf_conncount_add() and nf_conncount_gc_list() return immediately if a GC was
> performed during the same jiffy.

Before this patch, without 'ct state new' in front, this was just
adding duplicates, then count is wrong, ie. this is broken.

Assuming 'ct state new' in place, then gc is only called when new
entries for the initial packet of a connection (still broken because
duplicates due to retransmissions are possible).

My proposal:

- Follow a more conservative approach: Perform this gc cycle for
  confirmed ct only when 'ct count over' evaluates true or 'ct count'
  evaluates false.

- For the confirmed ct case, stop gc inmediately when one slot is
  released to short-circuit the walk.

... but still long walk could possible.

- More difficult: For the confirmed ct case, add a limit on the
  maximum entries that are walked over in the gc iteration for each
  packet. If no connections are found to be released, annotate the
  entry at which this stops and a jiffy timestamp, to resume from where
  the gc walk has stopped in the previous gc. The timestamp could be
  used to decide whether to make a full gc walk or not. I mean, explore
  a bit more advance gc logic now that this will be alled for every
  packet.

> I tried with a limit of 2000 connections and didn't notice any performance
> change. I could try with CONNCOUNT_GC_MAX_NODES too.
> 
> > > Fixed in xt_connlimit too.
> > > 
> > > Fixes: d265929930e2 ("netfilter: nf_conncount: reduce unnecessary GC")
> > > Fixes: 976afca1ceba ("netfilter: nf_conncount: Early exit in nf_conncount_lookup() and cleanup")
> > > Closes: https://lore.kernel.org/netfilter/trinity-85c72a88-d762-46c3-be97-36f10e5d9796-1761173693813@3c-app-mailcom-bs12/
> > > Signed-off-by: Fernando Fernandez Mancera <fmancera@suse.de>
> > > ---
> > > v2: use nf_ct_is_confirmed(), add comment about why the gc call is
> > > needed and fix this in xt_connlimit too.
> > > ---
> > >   net/netfilter/nft_connlimit.c | 17 ++++++++++++++---
> > >   net/netfilter/xt_connlimit.c  | 14 ++++++++++++--
> > >   2 files changed, 26 insertions(+), 5 deletions(-)
> > > 
> > > diff --git a/net/netfilter/nft_connlimit.c b/net/netfilter/nft_connlimit.c
> > > index fc35a11cdca2..dedea1681e73 100644
> > > --- a/net/netfilter/nft_connlimit.c
> > > +++ b/net/netfilter/nft_connlimit.c
> > > @@ -43,9 +43,20 @@ static inline void nft_connlimit_do_eval(struct nft_connlimit *priv,
> > >   		return;
> > >   	}
> > > -	if (nf_conncount_add(nft_net(pkt), priv->list, tuple_ptr, zone)) {
> > > -		regs->verdict.code = NF_DROP;
> > > -		return;
> > > +	if (!ct || !nf_ct_is_confirmed(ct)) {
> > > +		if (nf_conncount_add(nft_net(pkt), priv->list, tuple_ptr, zone)) {
> > > +			regs->verdict.code = NF_DROP;
> > > +			return;
> > > +		}
> > > +	} else {
> > > +		/* Call gc to update the list count if any connection has been
> > > +		 * closed already. This is useful to softlimit connections
> > > +		 * like limiting bandwidth based on a number of open
> > > +		 * connections.
> > > +		 */
> > > +		local_bh_disable();
> > > +		nf_conncount_gc_list(nft_net(pkt), priv->list);
> > > +		local_bh_enable();
> > >   	}
> > >   	count = READ_ONCE(priv->list->count);
> > > diff --git a/net/netfilter/xt_connlimit.c b/net/netfilter/xt_connlimit.c
> > > index 0189f8b6b0bd..5c90e1929d86 100644
> > > --- a/net/netfilter/xt_connlimit.c
> > > +++ b/net/netfilter/xt_connlimit.c
> > > @@ -69,8 +69,18 @@ connlimit_mt(const struct sk_buff *skb, struct xt_action_param *par)
> > >   		key[1] = zone->id;
> > >   	}
> > > -	connections = nf_conncount_count(net, info->data, key, tuple_ptr,
> > > -					 zone);
> > > +	if (!ct || !nf_ct_is_confirmed(ct)) {
> > > +		connections = nf_conncount_count(net, info->data, key, tuple_ptr,
> > > +						 zone);
> > > +	} else {
> > > +		/* Call nf_conncount_count() with NULL tuple and zone to update
> > > +		 * the list if any connection has been closed already. This is
> > > +		 * useful to softlimit connections like limiting bandwidth based
> > > +		 * on a number of open connections.
> > > +		 */
> > > +		connections = nf_conncount_count(net, info->data, key, NULL, NULL);
> > > +	}
> > 
> > Maybe remove this from xt_connlimit?
> > 
> 
> Dropping this would leave xt_connlimit broken for the use-cases I discussed
> with Florian on the v1.

I don't want to drop this, I am wondering if walking a very long
linear list for each packet can be an issue for every packet.

For xt_connlimit, there is a rbtree of lists, ie. the list is splitted
accross different rbtree nodes and the list should be smaller.

But for nft_connlimit, this is a raw linear list exposed to packet
path, and users can set an arbitrarily large count number.

> > > +
> > >   	if (connections == 0)
> > >   		/* kmalloc failed, drop it entirely */
> > >   		goto hotdrop;
> > > -- 
> > > 2.51.0
> > > 
> 

  reply	other threads:[~2025-10-30 23:04 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-10-29 13:23 [PATCH nf v2] netfilter: nft_connlimit: fix duplicated tracking of a connection Fernando Fernandez Mancera
2025-10-29 20:33 ` Pablo Neira Ayuso
2025-10-30  8:12   ` Fernando Fernandez Mancera
2025-10-30 23:04     ` Pablo Neira Ayuso [this message]
2025-10-30 23:40       ` Florian Westphal
2025-10-31  8:55       ` Fernando Fernandez Mancera

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aQPvFHEYZYacJQcC@calendula \
    --to=pablo@netfilter.org \
    --cc=coreteam@netfilter.org \
    --cc=fmancera@suse.de \
    --cc=fw@strlen.de \
    --cc=netfilter-devel@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.