[PATCH nf-next RFC 0/3] netfilter: x_tables: statistic nth match account GRO/GSO packets

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH nf-next RFC 0/3] netfilter: x_tables: statistic nth match account GRO/GSO packets
@ 2025-11-27 12:33 Jesper Dangaard Brouer
  2025-11-27 12:33 ` [PATCH nf-next RFC 1/3] xt_statistic: taking GRO/GSO into account for nth-match Jesper Dangaard Brouer
                   ` (2 more replies)
  0 siblings, 3 replies; 10+ messages in thread
From: Jesper Dangaard Brouer @ 2025-11-27 12:33 UTC (permalink / raw)
  To: netfilter-devel, Pablo Neira Ayuso, Florian Westphal
  Cc: Jesper Dangaard Brouer, netdev, phil, Eric Dumazet,
	David S. Miller, Jakub Kicinski, Paolo Abeni, kernel-team,
	mfleming, matt

In production we have a service that does sampling of 1 in every 10000 nth
packets. This is leveraging the iptables statistic module for reducing the
samples send to userspace via NFLOG target.

This part worked nicely until a mathematician noticed that we were under
sampling GRO/GSO packets. This is an example of a Bernoulli trial. When wanted
to sample one packet every nth packet. When a GRO packet contains e.g. just 2
packets then we should have sampled that at 5000. At 10 packets this is
1000. This caused enough under sampling of GRO/GSO to make statistics wrong in
our backend systems consuming this.

The production workaround is simply send all packets larger than the MTU to
userspace (via NFLOG). Then let the userspace sampler daemon pick 1 in 10000 nth
packets to be logged to the backend. Needless to say, this solution doesn't
scale. In production if enough CPUs participate this results in lock contention,
and in general this is limiting through to 20Gbit/s out of 25Gbit/s.

This patchset avoids having to send all GRO/GSO packet to userspace, by letting
the statistics nth mode account for the number of GRO/GSO fragments.

---

Jesper Dangaard Brouer (3):
      xt_statistic: taking GRO/GSO into account for nth-match
      xt_statistic: do nth-mode accounting per CPU
      xt_statistic: DEBUG patch

 include/uapi/linux/netfilter/xt_statistic.h |  1 +
 net/netfilter/xt_statistic.c                | 94 +++++++++++++++++++--
 2 files changed, 89 insertions(+), 6 deletions(-)

--

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH nf-next RFC 1/3] xt_statistic: taking GRO/GSO into account for nth-match
  2025-11-27 12:33 [PATCH nf-next RFC 0/3] netfilter: x_tables: statistic nth match account GRO/GSO packets Jesper Dangaard Brouer
@ 2025-11-27 12:33 ` Jesper Dangaard Brouer
  2025-11-27 14:40   ` Florian Westphal
  2025-11-27 12:34 ` [PATCH nf-next RFC 2/3] xt_statistic: do nth-mode accounting per CPU Jesper Dangaard Brouer
  2025-11-27 12:34 ` [PATCH nf-next RFC 3/3] xt_statistic: DEBUG patch Jesper Dangaard Brouer
  2 siblings, 1 reply; 10+ messages in thread
From: Jesper Dangaard Brouer @ 2025-11-27 12:33 UTC (permalink / raw)
  To: netfilter-devel, Pablo Neira Ayuso, Florian Westphal
  Cc: Jesper Dangaard Brouer, netdev, phil, Eric Dumazet,
	David S. Miller, Jakub Kicinski, Paolo Abeni, kernel-team,
	mfleming, matt

The iptables statistic nth mode is documented to match one packet every nth
packets. When packets gets GRO/GSO aggregated before traversing the statistic
nth match, then they get accounted as a single packet.

This patch takes into account the number of packet frags a GRO/GSO packet
contains for the xt_statistic match.

Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
---
 net/netfilter/xt_statistic.c |   42 +++++++++++++++++++++++++++++++++++++-----
 1 file changed, 37 insertions(+), 5 deletions(-)

diff --git a/net/netfilter/xt_statistic.c b/net/netfilter/xt_statistic.c
index b26c1dcfc27b..d352c171f24d 100644
--- a/net/netfilter/xt_statistic.c
+++ b/net/netfilter/xt_statistic.c
@@ -25,12 +25,37 @@ MODULE_DESCRIPTION("Xtables: statistics-based matching (\"Nth\", random)");
 MODULE_ALIAS("ipt_statistic");
 MODULE_ALIAS("ip6t_statistic");
 
+static int gso_pkt_cnt(const struct sk_buff *skb)
+{
+	int pkt_cnt = 1;
+
+	if (!skb_is_gso(skb))
+		return pkt_cnt;
+
+	/* GSO packets contain many smaller packets. This makes the probability
+	 * incorrect, when wanting the probability to be per packet based.
+	 */
+	if (skb_has_frag_list(skb)) {
+		struct sk_buff *iter;
+
+		skb_walk_frags(skb, iter)
+			pkt_cnt++;
+	} else {
+		pkt_cnt += skb_shinfo(skb)->nr_frags;
+	}
+
+	return pkt_cnt;
+}
+
 static bool
 statistic_mt(const struct sk_buff *skb, struct xt_action_param *par)
 {
 	const struct xt_statistic_info *info = par->matchinfo;
+	struct xt_statistic_priv *priv = info->master;
 	bool ret = info->flags & XT_STATISTIC_INVERT;
-	int nval, oval;
+	u32 nval, oval;
+	int pkt_cnt;
+	bool match;
 
 	switch (info->mode) {
 	case XT_STATISTIC_MODE_RANDOM:
@@ -38,11 +63,18 @@ statistic_mt(const struct sk_buff *skb, struct xt_action_param *par)
 			ret = !ret;
 		break;
 	case XT_STATISTIC_MODE_NTH:
+		pkt_cnt = gso_pkt_cnt(skb);
 		do {
-			oval = atomic_read(&info->master->count);
-			nval = (oval == info->u.nth.every) ? 0 : oval + 1;
-		} while (atomic_cmpxchg(&info->master->count, oval, nval) != oval);
-		if (nval == 0)
+			match = false;
+			oval = atomic_read(&priv->count);
+			nval = oval + pkt_cnt;
+			if (nval > info->u.nth.every) {
+				match = true;
+				nval = nval - info->u.nth.every - 1;
+				nval = min(nval, info->u.nth.every);
+			}
+		} while (atomic_cmpxchg(&priv->count, oval, nval) != oval);
+		if (match)
 			ret = !ret;
 		break;
 	}



^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH nf-next RFC 2/3] xt_statistic: do nth-mode accounting per CPU
  2025-11-27 12:33 [PATCH nf-next RFC 0/3] netfilter: x_tables: statistic nth match account GRO/GSO packets Jesper Dangaard Brouer
  2025-11-27 12:33 ` [PATCH nf-next RFC 1/3] xt_statistic: taking GRO/GSO into account for nth-match Jesper Dangaard Brouer
@ 2025-11-27 12:34 ` Jesper Dangaard Brouer
  2025-11-27 14:48   ` Florian Westphal
  2025-11-27 12:34 ` [PATCH nf-next RFC 3/3] xt_statistic: DEBUG patch Jesper Dangaard Brouer
  2 siblings, 1 reply; 10+ messages in thread
From: Jesper Dangaard Brouer @ 2025-11-27 12:34 UTC (permalink / raw)
  To: netfilter-devel, Pablo Neira Ayuso, Florian Westphal
  Cc: Jesper Dangaard Brouer, netdev, phil, Eric Dumazet,
	David S. Miller, Jakub Kicinski, Paolo Abeni, kernel-team,
	mfleming, matt

The atomic cmpxchg operations for the nth-mode matching is a scaling
concern, on our production servers with 192 CPUs. The iptables rules that
does sampling of every 10000 packets exists on INPUT and OUTPUT chains.
Thus, these nth-counter rules are hit for every packets on the system with
high concurrency.

Our use-case is statistical sampling, where we don't need an accurate packet
across all CPUs in the system. Thus, we implement per-CPU counters for the
nth-mode match.

This replaces the XT_STATISTIC_MODE_NTH, to avoid having to change userspace
tooling. We keep and move atomic variant under XT_STATISTIC_MODE_NTH_ATOMIC
mode, which userspace can easily be extended to leverage if this is
necessary.

Signed-off-by: Jesper Dangaard Brouer <hawk@kernel.org>
---
 include/uapi/linux/netfilter/xt_statistic.h |    1 +
 net/netfilter/xt_statistic.c                |   37 ++++++++++++++++++++++++++-
 2 files changed, 37 insertions(+), 1 deletion(-)

diff --git a/include/uapi/linux/netfilter/xt_statistic.h b/include/uapi/linux/netfilter/xt_statistic.h
index bbce6fcb26e3..f399dd27ff61 100644
--- a/include/uapi/linux/netfilter/xt_statistic.h
+++ b/include/uapi/linux/netfilter/xt_statistic.h
@@ -7,6 +7,7 @@
 enum xt_statistic_mode {
 	XT_STATISTIC_MODE_RANDOM,
 	XT_STATISTIC_MODE_NTH,
+	XT_STATISTIC_MODE_NTH_ATOMIC,
 	__XT_STATISTIC_MODE_MAX
 };
 #define XT_STATISTIC_MODE_MAX (__XT_STATISTIC_MODE_MAX - 1)
diff --git a/net/netfilter/xt_statistic.c b/net/netfilter/xt_statistic.c
index d352c171f24d..165bff0a76e5 100644
--- a/net/netfilter/xt_statistic.c
+++ b/net/netfilter/xt_statistic.c
@@ -17,6 +17,7 @@
 
 struct xt_statistic_priv {
 	atomic_t count;
+	u32 __percpu *cnt_pcpu;
 } ____cacheline_aligned_in_smp;
 
 MODULE_LICENSE("GPL");
@@ -63,6 +64,21 @@ statistic_mt(const struct sk_buff *skb, struct xt_action_param *par)
 			ret = !ret;
 		break;
 	case XT_STATISTIC_MODE_NTH:
+		pkt_cnt = gso_pkt_cnt(skb);
+		do {
+			match = false;
+			oval = this_cpu_read(*priv->cnt_pcpu);
+			nval = oval + pkt_cnt;
+			if (nval > info->u.nth.every) {
+				match = true;
+				nval = nval - info->u.nth.every - 1;
+				nval = min(nval, info->u.nth.every);
+			}
+		} while (this_cpu_cmpxchg(*priv->cnt_pcpu, oval, nval) != oval);
+		if (match)
+			ret = !ret;
+		break;
+	case XT_STATISTIC_MODE_NTH_ATOMIC:
 		pkt_cnt = gso_pkt_cnt(skb);
 		do {
 			match = false;
@@ -85,6 +101,10 @@ statistic_mt(const struct sk_buff *skb, struct xt_action_param *par)
 static int statistic_mt_check(const struct xt_mtchk_param *par)
 {
 	struct xt_statistic_info *info = par->matchinfo;
+	struct xt_statistic_priv *priv;
+	u32 *this_cpu;
+	u32 nth_count;
+	int cpu;
 
 	if (info->mode > XT_STATISTIC_MODE_MAX ||
 	    info->flags & ~XT_STATISTIC_MASK)
@@ -93,7 +113,21 @@ static int statistic_mt_check(const struct xt_mtchk_param *par)
 	info->master = kzalloc(sizeof(*info->master), GFP_KERNEL);
 	if (info->master == NULL)
 		return -ENOMEM;
-	atomic_set(&info->master->count, info->u.nth.count);
+	priv = info->master;
+
+	priv->cnt_pcpu = alloc_percpu(u32);
+	if (!priv->cnt_pcpu) {
+		kfree(priv);
+		return -ENOMEM;
+	}
+
+	/* Userspace specifies start nth.count value */
+	nth_count = info->u.nth.count;
+	for_each_possible_cpu(cpu) {
+		this_cpu = per_cpu_ptr(priv->cnt_pcpu, cpu);
+		(*this_cpu) = nth_count;
+	}
+	atomic_set(&priv->count, nth_count);
 
 	return 0;
 }
@@ -102,6 +136,7 @@ static void statistic_mt_destroy(const struct xt_mtdtor_param *par)
 {
 	const struct xt_statistic_info *info = par->matchinfo;
 
+	free_percpu(info->master->cnt_pcpu);
 	kfree(info->master);
 }
 



^ permalink raw reply related	[flat|nested] 10+ messages in thread

* [PATCH nf-next RFC 3/3] xt_statistic: DEBUG patch
  2025-11-27 12:33 [PATCH nf-next RFC 0/3] netfilter: x_tables: statistic nth match account GRO/GSO packets Jesper Dangaard Brouer
  2025-11-27 12:33 ` [PATCH nf-next RFC 1/3] xt_statistic: taking GRO/GSO into account for nth-match Jesper Dangaard Brouer
  2025-11-27 12:34 ` [PATCH nf-next RFC 2/3] xt_statistic: do nth-mode accounting per CPU Jesper Dangaard Brouer
@ 2025-11-27 12:34 ` Jesper Dangaard Brouer
  2 siblings, 0 replies; 10+ messages in thread
From: Jesper Dangaard Brouer @ 2025-11-27 12:34 UTC (permalink / raw)
  To: netfilter-devel, Pablo Neira Ayuso, Florian Westphal
  Cc: Jesper Dangaard Brouer, netdev, phil, Eric Dumazet,
	David S. Miller, Jakub Kicinski, Paolo Abeni, kernel-team,
	mfleming, matt

This is not for upstream comsumption.
Used this while developing the patch to valid the two cases was exercised.

Nacked-by: Jesper Dangaard Brouer <hawk@kernel.org>
---
 net/netfilter/xt_statistic.c |   23 +++++++++++++++++++----
 1 file changed, 19 insertions(+), 4 deletions(-)

diff --git a/net/netfilter/xt_statistic.c b/net/netfilter/xt_statistic.c
index 165bff0a76e5..016669a71f2a 100644
--- a/net/netfilter/xt_statistic.c
+++ b/net/netfilter/xt_statistic.c
@@ -4,6 +4,7 @@
  *
  * Based on ipt_random and ipt_nth by Fabrice MARIE <fabrice@netfilter.org>.
  */
+#define pr_fmt(fmt) KBUILD_MODNAME ": " fmt
 
 #include <linux/init.h>
 #include <linux/spinlock.h>
@@ -26,7 +27,12 @@ MODULE_DESCRIPTION("Xtables: statistics-based matching (\"Nth\", random)");
 MODULE_ALIAS("ipt_statistic");
 MODULE_ALIAS("ip6t_statistic");
 
-static int gso_pkt_cnt(const struct sk_buff *skb)
+enum gso_type {
+	SKB_GSO_FRAGS_LIST,
+	SKB_GSO_FRAGS_ARRAY
+};
+
+static int gso_pkt_cnt(const struct sk_buff *skb, enum gso_type *type)
 {
 	int pkt_cnt = 1;
 
@@ -39,9 +45,11 @@ static int gso_pkt_cnt(const struct sk_buff *skb)
 	if (skb_has_frag_list(skb)) {
 		struct sk_buff *iter;
 
+		*type = SKB_GSO_FRAGS_LIST;
 		skb_walk_frags(skb, iter)
 			pkt_cnt++;
 	} else {
+		*type = SKB_GSO_FRAGS_ARRAY;
 		pkt_cnt += skb_shinfo(skb)->nr_frags;
 	}
 
@@ -54,9 +62,10 @@ statistic_mt(const struct sk_buff *skb, struct xt_action_param *par)
 	const struct xt_statistic_info *info = par->matchinfo;
 	struct xt_statistic_priv *priv = info->master;
 	bool ret = info->flags & XT_STATISTIC_INVERT;
+	enum gso_type gso_type;
+	bool match = false;
 	u32 nval, oval;
 	int pkt_cnt;
-	bool match;
 
 	switch (info->mode) {
 	case XT_STATISTIC_MODE_RANDOM:
@@ -64,7 +73,7 @@ statistic_mt(const struct sk_buff *skb, struct xt_action_param *par)
 			ret = !ret;
 		break;
 	case XT_STATISTIC_MODE_NTH:
-		pkt_cnt = gso_pkt_cnt(skb);
+		pkt_cnt = gso_pkt_cnt(skb, &gso_type);
 		do {
 			match = false;
 			oval = this_cpu_read(*priv->cnt_pcpu);
@@ -79,7 +88,7 @@ statistic_mt(const struct sk_buff *skb, struct xt_action_param *par)
 			ret = !ret;
 		break;
 	case XT_STATISTIC_MODE_NTH_ATOMIC:
-		pkt_cnt = gso_pkt_cnt(skb);
+		pkt_cnt = gso_pkt_cnt(skb, &gso_type);
 		do {
 			match = false;
 			oval = atomic_read(&priv->count);
@@ -95,6 +104,10 @@ statistic_mt(const struct sk_buff *skb, struct xt_action_param *par)
 		break;
 	}
 
+	if (match)
+		pr_info("debug XXX: SKB is GRO type:%d contains %d packets\n",
+			gso_type, pkt_cnt);
+
 	return ret;
 }
 
@@ -154,11 +167,13 @@ static struct xt_match xt_statistic_mt_reg __read_mostly = {
 
 static int __init statistic_mt_init(void)
 {
+	pr_info("module init\n");
 	return xt_register_match(&xt_statistic_mt_reg);
 }
 
 static void __exit statistic_mt_exit(void)
 {
+	pr_info("module exit\n");
 	xt_unregister_match(&xt_statistic_mt_reg);
 }
 



^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH nf-next RFC 1/3] xt_statistic: taking GRO/GSO into account for nth-match
  2025-11-27 12:33 ` [PATCH nf-next RFC 1/3] xt_statistic: taking GRO/GSO into account for nth-match Jesper Dangaard Brouer
@ 2025-11-27 14:40   ` Florian Westphal
  2025-12-05 16:23     ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 10+ messages in thread
From: Florian Westphal @ 2025-11-27 14:40 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: netfilter-devel, Pablo Neira Ayuso, netdev, phil, Eric Dumazet,
	David S. Miller, Jakub Kicinski, Paolo Abeni, kernel-team,
	mfleming, matt

Jesper Dangaard Brouer <hawk@kernel.org> wrote:
> The iptables statistic nth mode is documented to match one packet every nth
> packets. When packets gets GRO/GSO aggregated before traversing the statistic
> nth match, then they get accounted as a single packet.
> 
> This patch takes into account the number of packet frags a GRO/GSO packet
> contains for the xt_statistic match.

I doubt we can do this upstream.  Two reasons that come to mind, in no
particular order:

1. It introduces asymmetry.  All matches use "skb == packet" design.
   Examples that come to mind: xt_limit, xt_length.
2. This adds a compat issue with nftables:
    iptables-translate -A INPUT -m statistic --mode nth --packet 0  --every 10
    nft 'add rule ip filter INPUT numgen inc mod 10 0 counter'

'numgen' increments a counter for every skb, i.e. reg := i++;.
But, after this patch -m statistics doesn't work this way anymore
and the two rules no longer do the same thing.

But even if we'd ignore this or add a flag to control behavior, I don't
see how this could be implemented in nft.

And last but not least, I'm not sure the premise is correct.
Yes, when you think of 'packet sampling', then we don't 'match'
often enough for gro/gso case.

However, when doing '-m statistic ... -j LOG' (or whatever), then the
entire GSO superpacket is logged, i.e. several 'packets' got matched
at once.

So the existing algorithm works correctly even when considering
aggregation because on average the correct amount of segments gets
matched (logged).

With this proposed new algo, we can now match 100% of skbs / aggregated
segments, even for something like '--every 10'.  And that seems fishy to
me.

As far as I understood its only 'more correct' in your case because the
logging backend picks one individual segment out from the NFLOG'd
superpacket.

But if it would NOT do that, then you now sample (almost) all segments
seen on wire.  Did I misunderstand something here?

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH nf-next RFC 2/3] xt_statistic: do nth-mode accounting per CPU
  2025-11-27 12:34 ` [PATCH nf-next RFC 2/3] xt_statistic: do nth-mode accounting per CPU Jesper Dangaard Brouer
@ 2025-11-27 14:48   ` Florian Westphal
  2025-12-08 14:46     ` Florian Westphal
  0 siblings, 1 reply; 10+ messages in thread
From: Florian Westphal @ 2025-11-27 14:48 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: netfilter-devel, Pablo Neira Ayuso, netdev, phil, Eric Dumazet,
	David S. Miller, Jakub Kicinski, Paolo Abeni, kernel-team,
	mfleming, matt

Jesper Dangaard Brouer <hawk@kernel.org> wrote:
> The atomic cmpxchg operations for the nth-mode matching is a scaling
> concern, on our production servers with 192 CPUs. The iptables rules that
> does sampling of every 10000 packets exists on INPUT and OUTPUT chains.
> Thus, these nth-counter rules are hit for every packets on the system with
> high concurrency.
 
> Our use-case is statistical sampling, where we don't need an accurate packet
> across all CPUs in the system. Thus, we implement per-CPU counters for the
> nth-mode match.
> 
> This replaces the XT_STATISTIC_MODE_NTH, to avoid having to change userspace
> tooling. We keep and move atomic variant under XT_STATISTIC_MODE_NTH_ATOMIC
> mode, which userspace can easily be extended to leverage if this is
> necessary.

This patch seems acceptable to me (aside from the deliberate userspace
breakage).

But I do wonder why you can't move to random sampling instead, it
doesn't suffer from this problem (i.e. -m statistic --mode random).

I think a non-rfc version would have to add a new mode, plus the
userspace change, and an explanation why -m random can't be used,
esp.  because the changelog above implies to me that -m random would work
for this :-)

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH nf-next RFC 1/3] xt_statistic: taking GRO/GSO into account for nth-match
  2025-11-27 14:40   ` Florian Westphal
@ 2025-12-05 16:23     ` Jesper Dangaard Brouer
  2025-12-08 10:37       ` Nick Wood
  0 siblings, 1 reply; 10+ messages in thread
From: Jesper Dangaard Brouer @ 2025-12-05 16:23 UTC (permalink / raw)
  To: Florian Westphal
  Cc: netfilter-devel, Pablo Neira Ayuso, netdev, phil, Eric Dumazet,
	David S. Miller, Jakub Kicinski, Paolo Abeni, kernel-team,
	mfleming, matt, nwood, aforster

On 27/11/2025 15.40, Florian Westphal wrote:
> Jesper Dangaard Brouer <hawk@kernel.org> wrote:
>> The iptables statistic nth mode is documented to match one packet every nth
>> packets. When packets gets GRO/GSO aggregated before traversing the statistic
>> nth match, then they get accounted as a single packet.
>>
>> This patch takes into account the number of packet frags a GRO/GSO packet
>> contains for the xt_statistic match.
> 
> I doubt we can do this upstream.  Two reasons that come to mind, in no
> particular order:
> 
> 1. It introduces asymmetry.  All matches use "skb == packet" design.
>     Examples that come to mind: xt_limit, xt_length.
> 2. This adds a compat issue with nftables:
>      iptables-translate -A INPUT -m statistic --mode nth --packet 0  --every 10
>      nft 'add rule ip filter INPUT numgen inc mod 10 0 counter'
> 
> 'numgen' increments a counter for every skb, i.e. reg := i++;.
> But, after this patch -m statistics doesn't work this way anymore
> and the two rules no longer do the same thing.

At some point we do want to move from iptables to nftables, so we do
need to come up with a solution for nft.

I wonder what nft building block we need if we want something like this,
e.g. based on the number of GRO/GSO segments.  We could have an inc_segs
and a greater-than-and-reset (gt_limit / gt_mod) with a counter limit
modifier that wraps increments. Like the C code does, it matches when we
overrun 'nth.every' and resets it back.  For existing iptables code this
is reset to zero, and new code take into account how much we overshot.

> But even if we'd ignore this or add a flag to control behavior, I don't
> see how this could be implemented in nft.
> 
> And last but not least, I'm not sure the premise is correct.
> Yes, when you think of 'packet sampling', then we don't 'match'
> often enough for gro/gso case.
> 
> However, when doing '-m statistic ... -j LOG' (or whatever), then the
> entire GSO superpacket is logged, i.e. several 'packets' got matched
> at once.
> 
> So the existing algorithm works correctly even when considering
> aggregation because on average the correct amount of segments gets
> matched (logged).
> 

No, this is where the "on average" assumption fails.  Packets with many
segments gets statistically under-represented. As far as I'm told people
noticed that the bandwidth estimate based on sampled packets were too
far off (from other correlating stats like interface stats).

I'm told (Cc Nick Wood) the statistically correct way with --probability
setting would be doing a Bernoulli trial[1] or a "binomial experiment".
  This is how our userspace code (that gets all GRO/GSO packets) does
statistical sampling based on the number of segments (to get the correct
statistical probability):

The Rust code does this:
  let probability = 1.0 / sample_interval as f64;
  let adjusted_probability = nr_packets * probability * (1.0 - 
probability).powf(nr_packets - 1.0);

  [1] https://en.wikipedia.org/wiki/Bernoulli_trial

We could (also) update the kernel code for --probability to do this, but
as you can see the Rust code uses floating point calculations.

It was easier to change the nth code (and easier for me to reason about)
than dealing with converting the the formula to use an integer
approximation (given we don't have floating point calc in kernel).

> With this proposed new algo, we can now match 100% of skbs / aggregated
> segments, even for something like '--every 10'.  And that seems fishy to
> me.
> 
> As far as I understood its only 'more correct' in your case because the
> logging backend picks one individual segment out from the NFLOG'd
> superpacket.
> 
> But if it would NOT do that, then you now sample (almost) all segments
> seen on wire.  Did I misunderstand something here?

See above explanation about Bernoulli trial[1].

--Jesper

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH nf-next RFC 1/3] xt_statistic: taking GRO/GSO into account for nth-match
  2025-12-05 16:23     ` Jesper Dangaard Brouer
@ 2025-12-08 10:37       ` Nick Wood
  2025-12-08 14:18         ` Florian Westphal
  0 siblings, 1 reply; 10+ messages in thread
From: Nick Wood @ 2025-12-08 10:37 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: Florian Westphal, netfilter-devel, Pablo Neira Ayuso, netdev,
	phil, Eric Dumazet, David S. Miller, Jakub Kicinski, Paolo Abeni,
	kernel-team, mfleming, matt, aforster

On Fri, 5 Dec 2025 at 16:23, Jesper Dangaard Brouer <hawk@kernel.org> wrote:

> > So the existing algorithm works correctly even when considering
> > aggregation because on average the correct amount of segments gets
> > matched (logged).
> >
>
> No, this is where the "on average" assumption fails.  Packets with many
> segments gets statistically under-represented. As far as I'm told people
> noticed that the bandwidth estimate based on sampled packets were too
> far off (from other correlating stats like interface stats).

On-average is technically correct here; the issue is more subtle than
simple undercounting. To understand, consider a 50pps flow (without
GRO/GSO) with a sampling rate of 1/100. With 1 in k sampling we take a
sample every other second, and to estimate total flow we multiply by
the sample rate so we get a sawtooth looking wave that alternates
between 100pps and 0 every second. This is not 'correct', it's an
estimate as we'd expect, the absolute error here alternates between
+/-50 with the same frequency.

Now let's encapsulate these 50pps in a single GSO packet every second,
with 1-in-k is sampling on an all or nothing basis. For 99 seconds out
of every hundred we sample no packets -this is the under-sampling
you're referencing. But, for 1 second in 100 we sample the whole
GRO/GSO, and capture 50 samples. Causing our pps estimate for that
instant spikes to 5,000pps - so for an instant our absolute error
spikes by 2 orders of magnitude ('true' pps is 50 remember).

If you average the single spike into the 99 seconds of undersampling,
the figures do 'average out', but for very short flows this
amplification of error makes it impossible to accurately estimate the
true packet rate.
>
> I'm told (Cc Nick Wood) the statistically correct way with --probability
> setting would be doing a Bernoulli trial[1] or a "binomial experiment".
>   This is how our userspace code (that gets all GRO/GSO packets) does
> statistical sampling based on the number of segments (to get the correct
> statistical probability):
>
> The Rust code does this:
>   let probability = 1.0 / sample_interval as f64;
>   let adjusted_probability = nr_packets * probability * (1.0 -
> probability).powf(nr_packets - 1.0);
>
>   [1] https://en.wikipedia.org/wiki/Bernoulli_trial
>
> We could (also) update the kernel code for --probability to do this, but
> as you can see the Rust code uses floating point calculations.
>
> It was easier to change the nth code (and easier for me to reason about)
> than dealing with converting the the formula to use an integer
> approximation (given we don't have floating point calc in kernel).

with s = integer sample rate (i.e s=100 if we're sampling 1/100)
and n = nr_packets:

sample_threshold = [ 2**32 //s //s ] * [n*(s - (n-1))] ;

if get_random_u32 < sample_threshold {
    sample_single_subpacket
}*

Is an equivalent integer calculation for a Bernoulli trial treatment.
It undersamples by about 1 in 1/100k for s=100 and n=50 which is good
enough for most purposes. Error is smaller for smaller n. For smaller
s it may warrant an additional (cubic) term in n
*I'm a mathematician, not a kernel developer


> > With this proposed new algo, we can now match 100% of skbs / aggregated
> > segments, even for something like '--every 10'.  And that seems fishy to
> > me.
> >
> > As far as I understood its only 'more correct' in your case because the
> > logging backend picks one individual segment out from the NFLOG'd
> > superpacket.
> >
> > But if it would NOT do that, then you now sample (almost) all segments
> > seen on wire.  Did I misunderstand something here?
>
> See above explanation about Bernoulli trial[1].
>
> --Jesper
>

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH nf-next RFC 1/3] xt_statistic: taking GRO/GSO into account for nth-match
  2025-12-08 10:37       ` Nick Wood
@ 2025-12-08 14:18         ` Florian Westphal
  0 siblings, 0 replies; 10+ messages in thread
From: Florian Westphal @ 2025-12-08 14:18 UTC (permalink / raw)
  To: Nick Wood
  Cc: Jesper Dangaard Brouer, netfilter-devel, Pablo Neira Ayuso,
	netdev, phil, Eric Dumazet, David S. Miller, Jakub Kicinski,
	Paolo Abeni, kernel-team, mfleming, matt, aforster

Nick Wood <nwood@cloudflare.com> wrote:
> > It was easier to change the nth code (and easier for me to reason about)
> > than dealing with converting the the formula to use an integer
> > approximation (given we don't have floating point calc in kernel).
> 
> with s = integer sample rate (i.e s=100 if we're sampling 1/100)
> and n = nr_packets:
> 
> sample_threshold = [ 2**32 //s //s ] * [n*(s - (n-1))] ;
> 
> if get_random_u32 < sample_threshold {
>     sample_single_subpacket
> }*

Thanks for clarifying.  But we can't do that within the limitations
imposed by the netfilter framework.

We can only flag the entire skb/aggregate as either matching or not
matching.  We can't signal a 'match this subpacket'.

While we could split the aggregate within the conditional, we can't
(re)inject splitted packets back into the processing pipeline.

If there is a way, I can't think of anything.  At least not without
major rework of the entire netfilter stack :-|

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH nf-next RFC 2/3] xt_statistic: do nth-mode accounting per CPU
  2025-11-27 14:48   ` Florian Westphal
@ 2025-12-08 14:46     ` Florian Westphal
  0 siblings, 0 replies; 10+ messages in thread
From: Florian Westphal @ 2025-12-08 14:46 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: netfilter-devel, Pablo Neira Ayuso, netdev, phil, Eric Dumazet,
	David S. Miller, Jakub Kicinski, Paolo Abeni, kernel-team,
	mfleming, matt

Florian Westphal <fw@strlen.de> wrote:
> Jesper Dangaard Brouer <hawk@kernel.org> wrote:
> > The atomic cmpxchg operations for the nth-mode matching is a scaling
> > concern, on our production servers with 192 CPUs. The iptables rules that
> > does sampling of every 10000 packets exists on INPUT and OUTPUT chains.
> > Thus, these nth-counter rules are hit for every packets on the system with
> > high concurrency.
>  
> > Our use-case is statistical sampling, where we don't need an accurate packet
> > across all CPUs in the system. Thus, we implement per-CPU counters for the
> > nth-mode match.
> > 
> > This replaces the XT_STATISTIC_MODE_NTH, to avoid having to change userspace
> > tooling. We keep and move atomic variant under XT_STATISTIC_MODE_NTH_ATOMIC
> > mode, which userspace can easily be extended to leverage if this is
> > necessary.
> 
> This patch seems acceptable to me (aside from the deliberate userspace
> breakage).
> 
> But I do wonder why you can't move to random sampling instead, it
> doesn't suffer from this problem (i.e. -m statistic --mode random).

Addendum, did not think of this before.  Another alternative is to
prefix '-m statistic' with '-m cpu' so only one core will do the
sampling.  If this should be done on all cpus then xtables
framework would require n rules for n cpus which scales poorly.

In nftables one could use verdict map with 'meta cpu' as a hash key,
then one would be able to fanout based on processing cpu.

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2025-12-08 14:46 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-11-27 12:33 [PATCH nf-next RFC 0/3] netfilter: x_tables: statistic nth match account GRO/GSO packets Jesper Dangaard Brouer
2025-11-27 12:33 ` [PATCH nf-next RFC 1/3] xt_statistic: taking GRO/GSO into account for nth-match Jesper Dangaard Brouer
2025-11-27 14:40   ` Florian Westphal
2025-12-05 16:23     ` Jesper Dangaard Brouer
2025-12-08 10:37       ` Nick Wood
2025-12-08 14:18         ` Florian Westphal
2025-11-27 12:34 ` [PATCH nf-next RFC 2/3] xt_statistic: do nth-mode accounting per CPU Jesper Dangaard Brouer
2025-11-27 14:48   ` Florian Westphal
2025-12-08 14:46     ` Florian Westphal
2025-11-27 12:34 ` [PATCH nf-next RFC 3/3] xt_statistic: DEBUG patch Jesper Dangaard Brouer

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).