public inbox for netfilter-devel@vger.kernel.org
 help / color / mirror / Atom feed
From: Stefano Brivio <sbrivio@redhat.com>
To: Florian Westphal <fw@strlen.de>
Cc: <netfilter-devel@vger.kernel.org>
Subject: Re: [PATCH nf 1/2] netfilter: nft_set_pipapo_avx2: don't return non-matching entry on expiry
Date: Sat, 21 Mar 2026 15:25:07 +0100 (CET)	[thread overview]
Message-ID: <20260321152506.037f68c0@elisabeth> (raw)
In-Reply-To: <20260318132417.31661-2-fw@strlen.de>

On Wed, 18 Mar 2026 14:24:13 +0100
Florian Westphal <fw@strlen.de> wrote:

> New test case fails unexpectedly when avx2 matching functions are used.
> 
> The test first loads a ranomly generated pipapo set
> with 'ipv4 . port' key, i.e.  nft -f foo.
> 
> This works.  Then, it reloads the set after a flush:
> (echo flush set t s; cat foo) | nft -f -
> 
> This is expected to work, because its the same set after all and it was
> already loaded succesfully once.
> 
> But with avx2, this fails: nft reports a clashing element.
> 
> The reported clash is of following form:
> 
>     We successfully re-inserted
>       a . b
>       c . d
> 
> Then we try to insert a . d
> 
> avx2 finds the already existing a . d, which (due to 'flush set') is marked
> as invalid in the new generation.  It skips the element and moves to next.
> 
> Due to incorrect masking, the skip-step finds the next matching
> element *only considering the first field*,
> 
> i.e. we return the already reinserted "a . b", even though the
> last field is different and the entry should not have been matched.
> 
> No such error is reported for the generic c implementation (no avx2) or when
> the last field has to use the 'nft_pipapo_avx2_lookup_slow' fallback.
> 
> Bisection points to
> 7711f4bb4b36 ("netfilter: nft_set_pipapo: fix range overlap detection")
> but that fix merely uncovers this bug.
> 
> Before this commit, the wrong element is returned, but erronously
> reported as a full, identical duplicate.
> 
> The root-cause is too early return in the avx2 match functions.
> When we process the last field, we should continue to process data
> until the entire input size has been consumed to make sure no stale
> bits remain in the map.

Oops, thanks for fixing this. It must have been a lot of "fun" to debug
it.

Reviewed-by: Stefano Brivio <sbrivio@redhat.com>

An explanation below.

> An alternative fix is to change the avx2 lookup functions to also
> return the last 'i_ul' (map store location) and then replace:
> 
> 	if (unlikely(__nft_set_elem_expired(&e->ext, tstamp) ||
>                      !nft_set_elem_active(&e->ext, genmask))) {
> 		ret = pipapo_refill(res, f->bsize, f->rules,
> 				    fill, f->mt, last);
> 		 goto next_match;
>  	}
> 
> With:
>             if (unlikely(__nft_set_elem_expired(&e->ext, tstamp) ||
>                          !nft_set_elem_active(&e->ext, genmask))) {
> 		if (slow)
>                       ret = pipapo_refill(res, f->bsize, f->rules,
>                                           fill, f->mt, last);
> 		else
> 		      ret = nft_pipapo_avx2_refill(i_ul, &res[i_ul], fill, f->mt, last);
>                                           fill, f->mt, last);
>                  goto next_match;

By the way, there are some mixed spaces and tabs and one duplicate
line in these snippets, which make them a bit hard to understand, they
should be:

	if (unlikely(__nft_set_elem_expired(&e->ext, tstamp) ||
		     !nft_set_elem_active(&e->ext, genmask))) {
		ret = pipapo_refill(res, f->bsize, f->rules,
				    fill, f->mt, last);
		goto next_match;
	}

and:

	if (unlikely(__nft_set_elem_expired(&e->ext, tstamp) ||
		     !nft_set_elem_active(&e->ext, genmask))) {
		if (slow) {
			ret = pipapo_refill(res, f->bsize, f->rules,
					    fill, f->mt, last);
		} else {
			ret = nft_pipapo_avx2_refill(i_ul, &res[i_ul],
						     fill, f->mt, last);
		}

		goto next_match;
	}

> ... so that irrelvant map parts aren't considered. However, the diffstat
> is significantly larger than this one.

Right, yes, I don't think it's worth it. The extra branch might
actually make things slower.

> Signed-off-by: Florian Westphal <fw@strlen.de>
> ---
>  net/netfilter/nft_set_pipapo_avx2.c | 20 ++++++++++----------
>  1 file changed, 10 insertions(+), 10 deletions(-)
> 
> diff --git a/net/netfilter/nft_set_pipapo_avx2.c b/net/netfilter/nft_set_pipapo_avx2.c
> index 7ff90325c97f..6395982e4d95 100644
> --- a/net/netfilter/nft_set_pipapo_avx2.c
> +++ b/net/netfilter/nft_set_pipapo_avx2.c
> @@ -242,7 +242,7 @@ static int nft_pipapo_avx2_lookup_4b_2(unsigned long *map, unsigned long *fill,
>  
>  		b = nft_pipapo_avx2_refill(i_ul, &map[i_ul], fill, f->mt, last);
>  		if (last)
> -			return b;

This came from this kind of optimisation:

  https://pipapo.lameexcu.se/pipapo/tree/avx2.h?id=a724e8dbd67ce3d9bf5a24bd836dea4ad3a5516f#n59

		if (!_mm256_testz_si256(r4, r4)) {
			if (last) {
				_mm256_store_si256((__m256i *)(map + i), r4);
				return i;
			}
			ret = 0;
		}

which, however, is only applied if the register with the current
matching results is all zeroes, *and* I think it's incorrect anyway as
it would only work for a 32-byte sized bucket (single iteration).

I think it's a left-over from the previous stage of implementation
where I only had 32-byte sized buckets for simplicity.

Now, we could probably reintroduce this kind of implementation on the
lines of what you suggested but a long branch like that doesn't look
promising in terms of clock cycles.

-- 
Stefano


  reply	other threads:[~2026-03-21 14:25 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-18 13:24 [PATCH nf 0/2] netfilter: nft_set_pipapo_avx2: don't return non-matching entry Florian Westphal
2026-03-18 13:24 ` [PATCH nf 1/2] netfilter: nft_set_pipapo_avx2: don't return non-matching entry on expiry Florian Westphal
2026-03-21 14:25   ` Stefano Brivio [this message]
2026-03-18 13:24 ` [PATCH nf 2/2] selftests: netfilter: nft_concat_range.sh: add check for flush+reload bug Florian Westphal
2026-03-21 14:25   ` Stefano Brivio

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260321152506.037f68c0@elisabeth \
    --to=sbrivio@redhat.com \
    --cc=fw@strlen.de \
    --cc=netfilter-devel@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox