[PATCH net-next 0/3] Replace xt_recseq with u64

linux-rt-devel.lists.linux.dev archive mirror
 help / color / mirror / Atom feed

* [PATCH net-next 0/3] Replace xt_recseq with u64_stats.
@ 2025-02-16 12:51 Sebastian Andrzej Siewior
  2025-02-16 12:51 ` [PATCH net-next 1/3] netfilter: Make xt_table::private RCU protected Sebastian Andrzej Siewior
                   ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-02-16 12:51 UTC (permalink / raw)
  To: netfilter-devel, coreteam, linux-rt-devel
  Cc: Pablo Neira Ayuso, Jozsef Kadlecsik, Thomas Gleixner,
	Sebastian Andrzej Siewior

The per-CPU xt_recseq is a custom netfilter seqcount. It provides
synchronisation for the replacement of the xt_table::private pointer and
ensures that the two counter in xt_counters are properly observed during
an update on 32bit architectures. xt_recseq also supports recursion.

This construct is less than optimal on PREMPT_RT because the lack of an
associated lock (with the seqcount) can lead to a deadlock if a high
priority reader interrupts a writter. Also xt_recseq relies on locking
with BH-disable which becomes problematic if the lock, currently part of
local_bh_disable() on PREEMPT_RT, gets removed.

This can be optimized unrelated to PREEMPT_RT:
- Use RCU for synchronisation. This means ipt_do_table() (and the two
  other) access xt_table::private within a RCU section.
  xt_replace_table() replaces the pointer with rcu_assign_pointer() and
  uses synchronize_rcu() to wait until each reader left RCU section.

- Use u64_stats_t for the statistics. The advantage here is that
  u64_stats_sync which is use a seqcount is optimized away on 64bit
  architectures. The increment becomes just an add, the read just a read
  of the variable without a loop. On 32bit architectures the seqcount
  remains but the scope is smaller.

The struct xt_counters is defined in a user exported header (uapi). So
in patch #2 I tried to split the regular u64 access and the "internal
access" which treats the struct either as two counter or a per-CPU
pointer. In order not to expose u64_stats_t to userland I added a "pad"
which is cast to the internal type. I hoped that this makes it obvious
that a function like xt_get_this_cpu_counter() expects the possible
per-CPU type but mark_source_chains() or get_counters() expect the u64
type without pointers.

Sebastian Andrzej Siewior (3):
  netfilter: Make xt_table::private RCU protected.
  netfilter: Split the xt_counters type between kernel and user.
  netfilter: Use u64_stats for counters in xt_counters_k.

 include/linux/netfilter/x_tables.h            | 113 +++++++-----------
 include/uapi/linux/netfilter/x_tables.h       |   4 +
 include/uapi/linux/netfilter_arp/arp_tables.h |   5 +-
 include/uapi/linux/netfilter_ipv4/ip_tables.h |   5 +-
 .../uapi/linux/netfilter_ipv6/ip6_tables.h    |   5 +-
 net/ipv4/netfilter/arp_tables.c               |  65 +++++-----
 net/ipv4/netfilter/ip_tables.c                |  65 +++++-----
 net/ipv6/netfilter/ip6_tables.c               |  65 +++++-----
 net/netfilter/x_tables.c                      |  77 ++++++------
 9 files changed, 191 insertions(+), 213 deletions(-)

Sebastian

^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH net-next 1/3] netfilter: Make xt_table::private RCU protected.
  2025-02-16 12:51 [PATCH net-next 0/3] Replace xt_recseq with u64_stats Sebastian Andrzej Siewior
@ 2025-02-16 12:51 ` Sebastian Andrzej Siewior
  2025-02-17 14:05   ` Florian Westphal
  2025-02-16 12:51 ` [PATCH net-next 2/3] netfilter: Split the xt_counters type between kernel and user Sebastian Andrzej Siewior
  2025-02-16 12:51 ` [PATCH net-next 3/3] netfilter: Use u64_stats for counters in xt_counters_k Sebastian Andrzej Siewior
  2 siblings, 1 reply; 12+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-02-16 12:51 UTC (permalink / raw)
  To: netfilter-devel, coreteam, linux-rt-devel
  Cc: Pablo Neira Ayuso, Jozsef Kadlecsik, Thomas Gleixner,
	Sebastian Andrzej Siewior

The seqcount xt_recseq is used to synchronize the replacement of
xt_table::private in xt_replace_table() against all readers such as
ipt_do_table(). After the pointer is replaced, xt_register_target()
iterates over all per-CPU xt_recseq to ensure that none of CPUs is
within the critical section.
Once this is done, the old pointer can be examined and deallocated
safely.

This can also be achieved with RCU: Each reader of the private pointer
will be with in an RCU read section. The new pointer will be published
with rcu_assign_pointer() and synchronize_rcu() is used to wait until
each reader left its critical section.

Use RCU to assign xt_table::private and synchronise against reader.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 include/linux/netfilter/x_tables.h |  8 ++++-
 net/ipv4/netfilter/arp_tables.c    | 20 +++++++-----
 net/ipv4/netfilter/ip_tables.c     | 20 +++++++-----
 net/ipv6/netfilter/ip6_tables.c    | 20 +++++++-----
 net/netfilter/x_tables.c           | 50 ++++++++++++------------------
 5 files changed, 62 insertions(+), 56 deletions(-)

diff --git a/include/linux/netfilter/x_tables.h b/include/linux/netfilter/x_tables.h
index f39f688d72852..b9cd82e845d08 100644
--- a/include/linux/netfilter/x_tables.h
+++ b/include/linux/netfilter/x_tables.h
@@ -227,7 +227,7 @@ struct xt_table {
 	unsigned int valid_hooks;
 
 	/* Man behind the curtain... */
-	struct xt_table_info *private;
+	struct xt_table_info __rcu *priv_info;
 
 	/* hook ops that register the table with the netfilter core */
 	struct nf_hook_ops *ops;
@@ -345,6 +345,12 @@ void xt_free_table_info(struct xt_table_info *info);
  */
 DECLARE_PER_CPU(seqcount_t, xt_recseq);
 
+bool xt_af_lock_held(u_int8_t af);
+static inline struct xt_table_info *nf_table_private(const struct xt_table *table)
+{
+	return rcu_dereference_check(table->priv_info, xt_af_lock_held(table->af));
+}
+
 /* xt_tee_enabled - true if x_tables needs to handle reentrancy
  *
  * Enabled if current ip(6)tables ruleset has at least one -j TEE rule.
diff --git a/net/ipv4/netfilter/arp_tables.c b/net/ipv4/netfilter/arp_tables.c
index 1cdd9c28ab2da..0628e68910f7f 100644
--- a/net/ipv4/netfilter/arp_tables.c
+++ b/net/ipv4/netfilter/arp_tables.c
@@ -203,8 +203,9 @@ unsigned int arpt_do_table(void *priv,
 	outdev = state->out ? state->out->name : nulldevname;
 
 	local_bh_disable();
+	rcu_read_lock();
 	addend = xt_write_recseq_begin();
-	private = READ_ONCE(table->private); /* Address dependency. */
+	private = rcu_dereference(table->priv_info);
 	cpu     = smp_processor_id();
 	table_base = private->entries;
 	jumpstack  = (struct arpt_entry **)private->jumpstack[cpu];
@@ -279,6 +280,7 @@ unsigned int arpt_do_table(void *priv,
 		}
 	} while (!acpar.hotdrop);
 	xt_write_recseq_end(addend);
+	rcu_read_unlock();
 	local_bh_enable();
 
 	if (acpar.hotdrop)
@@ -648,9 +650,9 @@ static void get_old_counters(const struct xt_table_info *t,
 
 static struct xt_counters *alloc_counters(const struct xt_table *table)
 {
+	const struct xt_table_info *private = nf_table_private(table);
 	unsigned int countersize;
 	struct xt_counters *counters;
-	const struct xt_table_info *private = table->private;
 
 	/* We need atomic snapshot of counters: rest doesn't change
 	 * (other than comefrom, which userspace doesn't care
@@ -671,10 +673,10 @@ static int copy_entries_to_user(unsigned int total_size,
 				const struct xt_table *table,
 				void __user *userptr)
 {
+	struct xt_table_info *private = nf_table_private(table);
 	unsigned int off, num;
 	const struct arpt_entry *e;
 	struct xt_counters *counters;
-	struct xt_table_info *private = table->private;
 	int ret = 0;
 	void *loc_cpu_entry;
 
@@ -808,7 +810,7 @@ static int get_info(struct net *net, void __user *user, const int *len)
 	t = xt_request_find_table_lock(net, NFPROTO_ARP, name);
 	if (!IS_ERR(t)) {
 		struct arpt_getinfo info;
-		const struct xt_table_info *private = t->private;
+		const struct xt_table_info *private = nf_table_private(t);
 #ifdef CONFIG_NETFILTER_XTABLES_COMPAT
 		struct xt_table_info tmp;
 
@@ -861,7 +863,7 @@ static int get_entries(struct net *net, struct arpt_get_entries __user *uptr,
 
 	t = xt_find_table_lock(net, NFPROTO_ARP, get.name);
 	if (!IS_ERR(t)) {
-		const struct xt_table_info *private = t->private;
+		const struct xt_table_info *private = nf_table_private(t);
 
 		if (get.size == private->size)
 			ret = copy_entries_to_user(private->size,
@@ -1022,7 +1024,8 @@ static int do_add_counters(struct net *net, sockptr_t arg, unsigned int len)
 	}
 
 	local_bh_disable();
-	private = t->private;
+	rcu_read_lock();
+	private = rcu_dereference(t->priv_info);
 	if (private->number != tmp.num_counters) {
 		ret = -EINVAL;
 		goto unlock_up_free;
@@ -1040,6 +1043,7 @@ static int do_add_counters(struct net *net, sockptr_t arg, unsigned int len)
 	}
 	xt_write_recseq_end(addend);
  unlock_up_free:
+	rcu_read_unlock();
 	local_bh_enable();
 	xt_table_unlock(t);
 	module_put(t->me);
@@ -1340,8 +1344,8 @@ static int compat_copy_entries_to_user(unsigned int total_size,
 				       struct xt_table *table,
 				       void __user *userptr)
 {
+	const struct xt_table_info *private = nf_table_private(table);
 	struct xt_counters *counters;
-	const struct xt_table_info *private = table->private;
 	void __user *pos;
 	unsigned int size;
 	int ret = 0;
@@ -1390,7 +1394,7 @@ static int compat_get_entries(struct net *net,
 	xt_compat_lock(NFPROTO_ARP);
 	t = xt_find_table_lock(net, NFPROTO_ARP, get.name);
 	if (!IS_ERR(t)) {
-		const struct xt_table_info *private = t->private;
+		const struct xt_table_info *private = nf_table_private(t);
 		struct xt_table_info info;
 
 		ret = compat_table_info(private, &info);
diff --git a/net/ipv4/netfilter/ip_tables.c b/net/ipv4/netfilter/ip_tables.c
index 3d101613f27fa..20e8b46af8876 100644
--- a/net/ipv4/netfilter/ip_tables.c
+++ b/net/ipv4/netfilter/ip_tables.c
@@ -256,8 +256,9 @@ ipt_do_table(void *priv,
 
 	WARN_ON(!(table->valid_hooks & (1 << hook)));
 	local_bh_disable();
+	rcu_read_lock();
+	private = rcu_dereference(table->priv_info);
 	addend = xt_write_recseq_begin();
-	private = READ_ONCE(table->private); /* Address dependency. */
 	cpu        = smp_processor_id();
 	table_base = private->entries;
 	jumpstack  = (struct ipt_entry **)private->jumpstack[cpu];
@@ -354,6 +355,7 @@ ipt_do_table(void *priv,
 	} while (!acpar.hotdrop);
 
 	xt_write_recseq_end(addend);
+	rcu_read_unlock();
 	local_bh_enable();
 
 	if (acpar.hotdrop)
@@ -788,9 +790,9 @@ static void get_old_counters(const struct xt_table_info *t,
 
 static struct xt_counters *alloc_counters(const struct xt_table *table)
 {
+	const struct xt_table_info *private = nf_table_private(table);
 	unsigned int countersize;
 	struct xt_counters *counters;
-	const struct xt_table_info *private = table->private;
 
 	/* We need atomic snapshot of counters: rest doesn't change
 	   (other than comefrom, which userspace doesn't care
@@ -811,10 +813,10 @@ copy_entries_to_user(unsigned int total_size,
 		     const struct xt_table *table,
 		     void __user *userptr)
 {
+	const struct xt_table_info *private = nf_table_private(table);
 	unsigned int off, num;
 	const struct ipt_entry *e;
 	struct xt_counters *counters;
-	const struct xt_table_info *private = table->private;
 	int ret = 0;
 	const void *loc_cpu_entry;
 
@@ -963,7 +965,7 @@ static int get_info(struct net *net, void __user *user, const int *len)
 	t = xt_request_find_table_lock(net, AF_INET, name);
 	if (!IS_ERR(t)) {
 		struct ipt_getinfo info;
-		const struct xt_table_info *private = t->private;
+		const struct xt_table_info *private = nf_table_private(t);
 #ifdef CONFIG_NETFILTER_XTABLES_COMPAT
 		struct xt_table_info tmp;
 
@@ -1017,7 +1019,7 @@ get_entries(struct net *net, struct ipt_get_entries __user *uptr,
 
 	t = xt_find_table_lock(net, AF_INET, get.name);
 	if (!IS_ERR(t)) {
-		const struct xt_table_info *private = t->private;
+		const struct xt_table_info *private = nf_table_private(t);
 		if (get.size == private->size)
 			ret = copy_entries_to_user(private->size,
 						   t, uptr->entrytable);
@@ -1175,7 +1177,8 @@ do_add_counters(struct net *net, sockptr_t arg, unsigned int len)
 	}
 
 	local_bh_disable();
-	private = t->private;
+	rcu_read_lock();
+	private = rcu_dereference(t->priv_info);
 	if (private->number != tmp.num_counters) {
 		ret = -EINVAL;
 		goto unlock_up_free;
@@ -1192,6 +1195,7 @@ do_add_counters(struct net *net, sockptr_t arg, unsigned int len)
 	}
 	xt_write_recseq_end(addend);
  unlock_up_free:
+	rcu_read_unlock();
 	local_bh_enable();
 	xt_table_unlock(t);
 	module_put(t->me);
@@ -1550,8 +1554,8 @@ static int
 compat_copy_entries_to_user(unsigned int total_size, struct xt_table *table,
 			    void __user *userptr)
 {
+	const struct xt_table_info *private = nf_table_private(table);
 	struct xt_counters *counters;
-	const struct xt_table_info *private = table->private;
 	void __user *pos;
 	unsigned int size;
 	int ret = 0;
@@ -1597,7 +1601,7 @@ compat_get_entries(struct net *net, struct compat_ipt_get_entries __user *uptr,
 	xt_compat_lock(AF_INET);
 	t = xt_find_table_lock(net, AF_INET, get.name);
 	if (!IS_ERR(t)) {
-		const struct xt_table_info *private = t->private;
+		const struct xt_table_info *private = nf_table_private(t);
 		struct xt_table_info info;
 		ret = compat_table_info(private, &info);
 		if (!ret && get.size == info.size)
diff --git a/net/ipv6/netfilter/ip6_tables.c b/net/ipv6/netfilter/ip6_tables.c
index 7d5602950ae72..c12d489a09840 100644
--- a/net/ipv6/netfilter/ip6_tables.c
+++ b/net/ipv6/netfilter/ip6_tables.c
@@ -278,8 +278,9 @@ ip6t_do_table(void *priv, struct sk_buff *skb,
 	WARN_ON(!(table->valid_hooks & (1 << hook)));
 
 	local_bh_disable();
+	rcu_read_lock();
+	private = rcu_dereference(table->priv_info);
 	addend = xt_write_recseq_begin();
-	private = READ_ONCE(table->private); /* Address dependency. */
 	cpu        = smp_processor_id();
 	table_base = private->entries;
 	jumpstack  = (struct ip6t_entry **)private->jumpstack[cpu];
@@ -372,6 +373,7 @@ ip6t_do_table(void *priv, struct sk_buff *skb,
 	} while (!acpar.hotdrop);
 
 	xt_write_recseq_end(addend);
+	rcu_read_unlock();
 	local_bh_enable();
 
 	if (acpar.hotdrop)
@@ -804,9 +806,9 @@ static void get_old_counters(const struct xt_table_info *t,
 
 static struct xt_counters *alloc_counters(const struct xt_table *table)
 {
+	const struct xt_table_info *private = nf_table_private(table);
 	unsigned int countersize;
 	struct xt_counters *counters;
-	const struct xt_table_info *private = table->private;
 
 	/* We need atomic snapshot of counters: rest doesn't change
 	   (other than comefrom, which userspace doesn't care
@@ -827,10 +829,10 @@ copy_entries_to_user(unsigned int total_size,
 		     const struct xt_table *table,
 		     void __user *userptr)
 {
+	const struct xt_table_info *private = nf_table_private(table);
 	unsigned int off, num;
 	const struct ip6t_entry *e;
 	struct xt_counters *counters;
-	const struct xt_table_info *private = table->private;
 	int ret = 0;
 	const void *loc_cpu_entry;
 
@@ -979,7 +981,7 @@ static int get_info(struct net *net, void __user *user, const int *len)
 	t = xt_request_find_table_lock(net, AF_INET6, name);
 	if (!IS_ERR(t)) {
 		struct ip6t_getinfo info;
-		const struct xt_table_info *private = t->private;
+		const struct xt_table_info *private = nf_table_private(t);
 #ifdef CONFIG_NETFILTER_XTABLES_COMPAT
 		struct xt_table_info tmp;
 
@@ -1034,7 +1036,7 @@ get_entries(struct net *net, struct ip6t_get_entries __user *uptr,
 
 	t = xt_find_table_lock(net, AF_INET6, get.name);
 	if (!IS_ERR(t)) {
-		struct xt_table_info *private = t->private;
+		struct xt_table_info *private = nf_table_private(t);
 		if (get.size == private->size)
 			ret = copy_entries_to_user(private->size,
 						   t, uptr->entrytable);
@@ -1191,7 +1193,8 @@ do_add_counters(struct net *net, sockptr_t arg, unsigned int len)
 	}
 
 	local_bh_disable();
-	private = t->private;
+	rcu_read_lock();
+	private = rcu_dereference(t->priv_info);
 	if (private->number != tmp.num_counters) {
 		ret = -EINVAL;
 		goto unlock_up_free;
@@ -1208,6 +1211,7 @@ do_add_counters(struct net *net, sockptr_t arg, unsigned int len)
 	}
 	xt_write_recseq_end(addend);
  unlock_up_free:
+	rcu_read_unlock();
 	local_bh_enable();
 	xt_table_unlock(t);
 	module_put(t->me);
@@ -1559,8 +1563,8 @@ static int
 compat_copy_entries_to_user(unsigned int total_size, struct xt_table *table,
 			    void __user *userptr)
 {
+	const struct xt_table_info *private = nf_table_private(table);
 	struct xt_counters *counters;
-	const struct xt_table_info *private = table->private;
 	void __user *pos;
 	unsigned int size;
 	int ret = 0;
@@ -1606,7 +1610,7 @@ compat_get_entries(struct net *net, struct compat_ip6t_get_entries __user *uptr,
 	xt_compat_lock(AF_INET6);
 	t = xt_find_table_lock(net, AF_INET6, get.name);
 	if (!IS_ERR(t)) {
-		const struct xt_table_info *private = t->private;
+		const struct xt_table_info *private = nf_table_private(t);
 		struct xt_table_info info;
 		ret = compat_table_info(private, &info);
 		if (!ret && get.size == info.size)
diff --git a/net/netfilter/x_tables.c b/net/netfilter/x_tables.c
index 709840612f0df..ee38272cca562 100644
--- a/net/netfilter/x_tables.c
+++ b/net/netfilter/x_tables.c
@@ -85,6 +85,19 @@ static const char *const xt_prefix[NFPROTO_NUMPROTO] = {
 	[NFPROTO_IPV6]   = "ip6",
 };
 
+#ifdef CONFIG_LOCKDEP
+bool xt_af_lock_held(u_int8_t af)
+{
+	return lockdep_is_held(&xt[af].mutex) ||
+#ifdef CONFIG_NETFILTER_XTABLES_COMPAT
+		lockdep_is_held(&xt[af].compat_mutex);
+#else
+		false;
+#endif
+}
+EXPORT_SYMBOL_GPL(xt_af_lock_held);
+#endif
+
 /* Registration hooks for targets. */
 int xt_register_target(struct xt_target *target)
 {
@@ -1388,7 +1401,6 @@ xt_replace_table(struct xt_table *table,
 	      int *error)
 {
 	struct xt_table_info *private;
-	unsigned int cpu;
 	int ret;
 
 	ret = xt_jumpstack_alloc(newinfo);
@@ -1397,48 +1409,24 @@ xt_replace_table(struct xt_table *table,
 		return NULL;
 	}
 
-	/* Do the substitution. */
-	local_bh_disable();
-	private = table->private;
+	private = nf_table_private(table);
 
 	/* Check inside lock: is the old number correct? */
 	if (num_counters != private->number) {
 		pr_debug("num_counters != table->private->number (%u/%u)\n",
 			 num_counters, private->number);
-		local_bh_enable();
 		*error = -EAGAIN;
 		return NULL;
 	}
 
 	newinfo->initial_entries = private->initial_entries;
-	/*
-	 * Ensure contents of newinfo are visible before assigning to
-	 * private.
-	 */
-	smp_wmb();
-	table->private = newinfo;
-
-	/* make sure all cpus see new ->private value */
-	smp_mb();
 
+	rcu_assign_pointer(table->priv_info, newinfo);
 	/*
 	 * Even though table entries have now been swapped, other CPU's
 	 * may still be using the old entries...
 	 */
-	local_bh_enable();
-
-	/* ... so wait for even xt_recseq on all cpus */
-	for_each_possible_cpu(cpu) {
-		seqcount_t *s = &per_cpu(xt_recseq, cpu);
-		u32 seq = raw_read_seqcount(s);
-
-		if (seq & 1) {
-			do {
-				cond_resched();
-				cpu_relax();
-			} while (seq == raw_read_seqcount(s));
-		}
-	}
+	synchronize_rcu();
 
 	audit_log_nfcfg(table->name, table->af, private->number,
 			!private->number ? AUDIT_XT_OP_REGISTER :
@@ -1475,12 +1463,12 @@ struct xt_table *xt_register_table(struct net *net,
 	}
 
 	/* Simplifies replace_table code. */
-	table->private = bootstrap;
+	rcu_assign_pointer(table->priv_info, bootstrap);
 
 	if (!xt_replace_table(table, 0, newinfo, &ret))
 		goto unlock;
 
-	private = table->private;
+	private = nf_table_private(table);
 	pr_debug("table->private->number = %u\n", private->number);
 
 	/* save number of initial entries */
@@ -1503,7 +1491,7 @@ void *xt_unregister_table(struct xt_table *table)
 	struct xt_table_info *private;
 
 	mutex_lock(&xt[table->af].mutex);
-	private = table->private;
+	private = nf_table_private(table);
 	list_del(&table->list);
 	mutex_unlock(&xt[table->af].mutex);
 	audit_log_nfcfg(table->name, table->af, private->number,
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH net-next 2/3] netfilter: Split the xt_counters type between kernel and user.
  2025-02-16 12:51 [PATCH net-next 0/3] Replace xt_recseq with u64_stats Sebastian Andrzej Siewior
  2025-02-16 12:51 ` [PATCH net-next 1/3] netfilter: Make xt_table::private RCU protected Sebastian Andrzej Siewior
@ 2025-02-16 12:51 ` Sebastian Andrzej Siewior
  2025-02-16 15:22   ` kernel test robot
  2025-02-16 12:51 ` [PATCH net-next 3/3] netfilter: Use u64_stats for counters in xt_counters_k Sebastian Andrzej Siewior
  2 siblings, 1 reply; 12+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-02-16 12:51 UTC (permalink / raw)
  To: netfilter-devel, coreteam, linux-rt-devel
  Cc: Pablo Neira Ayuso, Jozsef Kadlecsik, Thomas Gleixner,
	Sebastian Andrzej Siewior

The struct xt_counter contains two u64 values as counters for bytes and
packets. This type is exposed to userland via uapi. The kernel uses the
type as such when it communicates with userland.
However the type within an entry (such as ipt_entry) is treated
differently: Within the kernel it is a two value struct if the system
has only one CPU.
With more CPUs, the first value is per-CPU pointer which points to
per-CPU memory which holds the two u64 counter. How the struct is
interpreted depends on the user.

Introduce a struct xt_counter_pad which is simply used as a place
holder, ensuring it is the same size as struct xt_counters. The kernel
function will use this type if the type might be a per-CPU pointer.
Add this padding struct to arpt_entry, ipt_entry and ip6t_entry.
Pass this type to xt_get_this_cpu_counter(), xt_percpu_counter_free()
and xt_percpu_counter_alloc(). These functions will cast it to union
xt_counter_k() and return the proper pointer to struct xt_counters_k.

This is mostly the same as previously but a bit more obvious and
introduces the struct xt_counters_k for in-kernel usage. This can be
replaces later without breaking userland.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 include/linux/netfilter/x_tables.h            | 40 +++++++++++++------
 include/uapi/linux/netfilter/x_tables.h       |  4 ++
 include/uapi/linux/netfilter_arp/arp_tables.h |  5 ++-
 include/uapi/linux/netfilter_ipv4/ip_tables.h |  5 ++-
 .../uapi/linux/netfilter_ipv6/ip6_tables.h    |  5 ++-
 net/ipv4/netfilter/arp_tables.c               | 22 +++++-----
 net/ipv4/netfilter/ip_tables.c                | 22 +++++-----
 net/ipv6/netfilter/ip6_tables.c               | 22 +++++-----
 net/netfilter/x_tables.c                      | 21 ++++++----
 9 files changed, 89 insertions(+), 57 deletions(-)

diff --git a/include/linux/netfilter/x_tables.h b/include/linux/netfilter/x_tables.h
index b9cd82e845d08..fc52a2ba90f6b 100644
--- a/include/linux/netfilter/x_tables.h
+++ b/include/linux/netfilter/x_tables.h
@@ -267,6 +267,17 @@ struct xt_table_info {
 	unsigned char entries[] __aligned(8);
 };
 
+struct xt_counters_k {
+	/* Packet and byte counter */
+	__u64 pcnt;
+	__u64 bcnt;
+};
+
+union xt_counter_k {
+	struct xt_counters_k __percpu *pcpu;
+	struct xt_counters_k local;
+};
+
 int xt_register_target(struct xt_target *target);
 void xt_unregister_target(struct xt_target *target);
 int xt_register_targets(struct xt_target *target, unsigned int n);
@@ -428,29 +439,32 @@ static inline unsigned long ifname_compare_aligned(const char *_a,
 
 struct xt_percpu_counter_alloc_state {
 	unsigned int off;
-	const char __percpu *mem;
+	void __percpu *mem;
 };
 
 bool xt_percpu_counter_alloc(struct xt_percpu_counter_alloc_state *state,
-			     struct xt_counters *counter);
-void xt_percpu_counter_free(struct xt_counters *cnt);
+			     struct xt_counter_pad *xt_pad);
+void xt_percpu_counter_free(struct xt_counter_pad *xt_pad);
 
-static inline struct xt_counters *
-xt_get_this_cpu_counter(struct xt_counters *cnt)
+static inline struct xt_counters_k *xt_get_this_cpu_counter(struct xt_counter_pad *xt_pad)
 {
-	if (nr_cpu_ids > 1)
-		return this_cpu_ptr((void __percpu *) (unsigned long) cnt->pcnt);
+	union xt_counter_k *xt_cnt = (union xt_counter_k *)xt_pad;
 
-	return cnt;
+	if (nr_cpu_ids > 1)
+		return this_cpu_ptr(xt_cnt->pcpu);
+
+	return &xt_cnt->local;
 }
 
-static inline struct xt_counters *
-xt_get_per_cpu_counter(struct xt_counters *cnt, unsigned int cpu)
+static inline struct xt_counters_k *xt_get_per_cpu_counter(struct xt_counter_pad *xt_pad,
+							   unsigned int cpu)
 {
-	if (nr_cpu_ids > 1)
-		return per_cpu_ptr((void __percpu *) (unsigned long) cnt->pcnt, cpu);
+	union xt_counter_k *xt_cnt = (union xt_counter_k *)xt_pad;
 
-	return cnt;
+	if (nr_cpu_ids > 1)
+		return per_cpu_ptr(xt_cnt->pcpu, cpu);
+
+	return &xt_cnt->local;
 }
 
 struct nf_hook_ops *xt_hook_ops_alloc(const struct xt_table *, nf_hookfn *);
diff --git a/include/uapi/linux/netfilter/x_tables.h b/include/uapi/linux/netfilter/x_tables.h
index 796af83a963aa..70e19a140ab1e 100644
--- a/include/uapi/linux/netfilter/x_tables.h
+++ b/include/uapi/linux/netfilter/x_tables.h
@@ -111,6 +111,10 @@ struct xt_counters {
 	__u64 pcnt, bcnt;			/* Packet and byte counters */
 };
 
+struct xt_counter_pad {
+	__u8 pad[16];
+};
+
 /* The argument to IPT_SO_ADD_COUNTERS. */
 struct xt_counters_info {
 	/* Which table. */
diff --git a/include/uapi/linux/netfilter_arp/arp_tables.h b/include/uapi/linux/netfilter_arp/arp_tables.h
index a6ac2463f787a..4ca949a955412 100644
--- a/include/uapi/linux/netfilter_arp/arp_tables.h
+++ b/include/uapi/linux/netfilter_arp/arp_tables.h
@@ -106,7 +106,10 @@ struct arpt_entry
 	unsigned int comefrom;
 
 	/* Packet and byte counters. */
-	struct xt_counters counters;
+	union {
+		struct xt_counters counters;
+		struct xt_counter_pad counter_pad;
+	};
 
 	/* The matches (if any), then the target. */
 	unsigned char elems[];
diff --git a/include/uapi/linux/netfilter_ipv4/ip_tables.h b/include/uapi/linux/netfilter_ipv4/ip_tables.h
index 1485df28b2391..a4874078ec058 100644
--- a/include/uapi/linux/netfilter_ipv4/ip_tables.h
+++ b/include/uapi/linux/netfilter_ipv4/ip_tables.h
@@ -118,7 +118,10 @@ struct ipt_entry {
 	unsigned int comefrom;
 
 	/* Packet and byte counters. */
-	struct xt_counters counters;
+	union {
+		struct xt_counters counters;
+		struct xt_counter_pad counter_pad;
+	};
 
 	/* The matches (if any), then the target. */
 	unsigned char elems[];
diff --git a/include/uapi/linux/netfilter_ipv6/ip6_tables.h b/include/uapi/linux/netfilter_ipv6/ip6_tables.h
index 766e8e0bcc683..8634257e1cd59 100644
--- a/include/uapi/linux/netfilter_ipv6/ip6_tables.h
+++ b/include/uapi/linux/netfilter_ipv6/ip6_tables.h
@@ -122,7 +122,10 @@ struct ip6t_entry {
 	unsigned int comefrom;
 
 	/* Packet and byte counters. */
-	struct xt_counters counters;
+	union {
+		struct xt_counters counters;
+		struct xt_counter_pad counter_pad;
+	};
 
 	/* The matches (if any), then the target. */
 	unsigned char elems[0];
diff --git a/net/ipv4/netfilter/arp_tables.c b/net/ipv4/netfilter/arp_tables.c
index 0628e68910f7f..ce3d73155ca9b 100644
--- a/net/ipv4/netfilter/arp_tables.c
+++ b/net/ipv4/netfilter/arp_tables.c
@@ -221,14 +221,14 @@ unsigned int arpt_do_table(void *priv,
 	arp = arp_hdr(skb);
 	do {
 		const struct xt_entry_target *t;
-		struct xt_counters *counter;
+		struct xt_counters_k *counter;
 
 		if (!arp_packet_match(arp, skb->dev, indev, outdev, &e->arp)) {
 			e = arpt_next_entry(e);
 			continue;
 		}
 
-		counter = xt_get_this_cpu_counter(&e->counters);
+		counter = xt_get_this_cpu_counter(&e->counter_pad);
 		ADD_COUNTER(*counter, arp_hdr_len(skb->dev), 1);
 
 		t = arpt_get_target_c(e);
@@ -412,7 +412,7 @@ find_check_entry(struct arpt_entry *e, struct net *net, const char *name,
 	struct xt_target *target;
 	int ret;
 
-	if (!xt_percpu_counter_alloc(alloc_state, &e->counters))
+	if (!xt_percpu_counter_alloc(alloc_state, &e->counter_pad))
 		return -ENOMEM;
 
 	t = arpt_get_target(e);
@@ -431,7 +431,7 @@ find_check_entry(struct arpt_entry *e, struct net *net, const char *name,
 err:
 	module_put(t->u.kernel.target->me);
 out:
-	xt_percpu_counter_free(&e->counters);
+	xt_percpu_counter_free(&e->counter_pad);
 
 	return ret;
 }
@@ -512,7 +512,7 @@ static void cleanup_entry(struct arpt_entry *e, struct net *net)
 	if (par.target->destroy != NULL)
 		par.target->destroy(&par);
 	module_put(par.target->me);
-	xt_percpu_counter_free(&e->counters);
+	xt_percpu_counter_free(&e->counter_pad);
 }
 
 /* Checks and translates the user-supplied table segment (held in
@@ -611,11 +611,11 @@ static void get_counters(const struct xt_table_info *t,
 
 		i = 0;
 		xt_entry_foreach(iter, t->entries, t->size) {
-			struct xt_counters *tmp;
+			struct xt_counters_k *tmp;
 			u64 bcnt, pcnt;
 			unsigned int start;
 
-			tmp = xt_get_per_cpu_counter(&iter->counters, cpu);
+			tmp = xt_get_per_cpu_counter(&iter->counter_pad, cpu);
 			do {
 				start = read_seqcount_begin(s);
 				bcnt = tmp->bcnt;
@@ -638,9 +638,9 @@ static void get_old_counters(const struct xt_table_info *t,
 	for_each_possible_cpu(cpu) {
 		i = 0;
 		xt_entry_foreach(iter, t->entries, t->size) {
-			struct xt_counters *tmp;
+			struct xt_counters_k *tmp;
 
-			tmp = xt_get_per_cpu_counter(&iter->counters, cpu);
+			tmp = xt_get_per_cpu_counter(&iter->counter_pad, cpu);
 			ADD_COUNTER(counters[i], tmp->bcnt, tmp->pcnt);
 			++i;
 		}
@@ -1035,9 +1035,9 @@ static int do_add_counters(struct net *net, sockptr_t arg, unsigned int len)
 
 	addend = xt_write_recseq_begin();
 	xt_entry_foreach(iter,  private->entries, private->size) {
-		struct xt_counters *tmp;
+		struct xt_counters_k *tmp;
 
-		tmp = xt_get_this_cpu_counter(&iter->counters);
+		tmp = xt_get_this_cpu_counter(&iter->counter_pad);
 		ADD_COUNTER(*tmp, paddc[i].bcnt, paddc[i].pcnt);
 		++i;
 	}
diff --git a/net/ipv4/netfilter/ip_tables.c b/net/ipv4/netfilter/ip_tables.c
index 20e8b46af8876..95f917f5bceef 100644
--- a/net/ipv4/netfilter/ip_tables.c
+++ b/net/ipv4/netfilter/ip_tables.c
@@ -278,7 +278,7 @@ ipt_do_table(void *priv,
 	do {
 		const struct xt_entry_target *t;
 		const struct xt_entry_match *ematch;
-		struct xt_counters *counter;
+		struct xt_counters_k *counter;
 
 		WARN_ON(!e);
 		if (!ip_packet_match(ip, indev, outdev,
@@ -295,7 +295,7 @@ ipt_do_table(void *priv,
 				goto no_match;
 		}
 
-		counter = xt_get_this_cpu_counter(&e->counters);
+		counter = xt_get_this_cpu_counter(&e->counter_pad);
 		ADD_COUNTER(*counter, skb->len, 1);
 
 		t = ipt_get_target_c(e);
@@ -525,7 +525,7 @@ find_check_entry(struct ipt_entry *e, struct net *net, const char *name,
 	struct xt_mtchk_param mtpar;
 	struct xt_entry_match *ematch;
 
-	if (!xt_percpu_counter_alloc(alloc_state, &e->counters))
+	if (!xt_percpu_counter_alloc(alloc_state, &e->counter_pad))
 		return -ENOMEM;
 
 	j = 0;
@@ -565,7 +565,7 @@ find_check_entry(struct ipt_entry *e, struct net *net, const char *name,
 		cleanup_match(ematch, net);
 	}
 
-	xt_percpu_counter_free(&e->counters);
+	xt_percpu_counter_free(&e->counter_pad);
 
 	return ret;
 }
@@ -653,7 +653,7 @@ cleanup_entry(struct ipt_entry *e, struct net *net)
 	if (par.target->destroy != NULL)
 		par.target->destroy(&par);
 	module_put(par.target->me);
-	xt_percpu_counter_free(&e->counters);
+	xt_percpu_counter_free(&e->counter_pad);
 }
 
 /* Checks and translates the user-supplied table segment (held in
@@ -750,11 +750,11 @@ get_counters(const struct xt_table_info *t,
 
 		i = 0;
 		xt_entry_foreach(iter, t->entries, t->size) {
-			struct xt_counters *tmp;
+			struct xt_counters_k *tmp;
 			u64 bcnt, pcnt;
 			unsigned int start;
 
-			tmp = xt_get_per_cpu_counter(&iter->counters, cpu);
+			tmp = xt_get_per_cpu_counter(&iter->counter_pad, cpu);
 			do {
 				start = read_seqcount_begin(s);
 				bcnt = tmp->bcnt;
@@ -777,9 +777,9 @@ static void get_old_counters(const struct xt_table_info *t,
 	for_each_possible_cpu(cpu) {
 		i = 0;
 		xt_entry_foreach(iter, t->entries, t->size) {
-			const struct xt_counters *tmp;
+			const struct xt_counters_k *tmp;
 
-			tmp = xt_get_per_cpu_counter(&iter->counters, cpu);
+			tmp = xt_get_per_cpu_counter(&iter->counter_pad, cpu);
 			ADD_COUNTER(counters[i], tmp->bcnt, tmp->pcnt);
 			++i; /* macro does multi eval of i */
 		}
@@ -1187,9 +1187,9 @@ do_add_counters(struct net *net, sockptr_t arg, unsigned int len)
 	i = 0;
 	addend = xt_write_recseq_begin();
 	xt_entry_foreach(iter, private->entries, private->size) {
-		struct xt_counters *tmp;
+		struct xt_counters_k *tmp;
 
-		tmp = xt_get_this_cpu_counter(&iter->counters);
+		tmp = xt_get_this_cpu_counter(&iter->counter_pad);
 		ADD_COUNTER(*tmp, paddc[i].bcnt, paddc[i].pcnt);
 		++i;
 	}
diff --git a/net/ipv6/netfilter/ip6_tables.c b/net/ipv6/netfilter/ip6_tables.c
index c12d489a09840..f4877b1b2463e 100644
--- a/net/ipv6/netfilter/ip6_tables.c
+++ b/net/ipv6/netfilter/ip6_tables.c
@@ -300,7 +300,7 @@ ip6t_do_table(void *priv, struct sk_buff *skb,
 	do {
 		const struct xt_entry_target *t;
 		const struct xt_entry_match *ematch;
-		struct xt_counters *counter;
+		struct xt_counters_k *counter;
 
 		WARN_ON(!e);
 		acpar.thoff = 0;
@@ -318,7 +318,7 @@ ip6t_do_table(void *priv, struct sk_buff *skb,
 				goto no_match;
 		}
 
-		counter = xt_get_this_cpu_counter(&e->counters);
+		counter = xt_get_this_cpu_counter(&e->counter_pad);
 		ADD_COUNTER(*counter, skb->len, 1);
 
 		t = ip6t_get_target_c(e);
@@ -544,7 +544,7 @@ find_check_entry(struct ip6t_entry *e, struct net *net, const char *name,
 	struct xt_mtchk_param mtpar;
 	struct xt_entry_match *ematch;
 
-	if (!xt_percpu_counter_alloc(alloc_state, &e->counters))
+	if (!xt_percpu_counter_alloc(alloc_state, &e->counter_pad))
 		return -ENOMEM;
 
 	j = 0;
@@ -583,7 +583,7 @@ find_check_entry(struct ip6t_entry *e, struct net *net, const char *name,
 		cleanup_match(ematch, net);
 	}
 
-	xt_percpu_counter_free(&e->counters);
+	xt_percpu_counter_free(&e->counter_pad);
 
 	return ret;
 }
@@ -670,7 +670,7 @@ static void cleanup_entry(struct ip6t_entry *e, struct net *net)
 	if (par.target->destroy != NULL)
 		par.target->destroy(&par);
 	module_put(par.target->me);
-	xt_percpu_counter_free(&e->counters);
+	xt_percpu_counter_free(&e->counter_pad);
 }
 
 /* Checks and translates the user-supplied table segment (held in
@@ -767,11 +767,11 @@ get_counters(const struct xt_table_info *t,
 
 		i = 0;
 		xt_entry_foreach(iter, t->entries, t->size) {
-			struct xt_counters *tmp;
+			struct xt_counters_k *tmp;
 			u64 bcnt, pcnt;
 			unsigned int start;
 
-			tmp = xt_get_per_cpu_counter(&iter->counters, cpu);
+			tmp = xt_get_per_cpu_counter(&iter->counter_pad, cpu);
 			do {
 				start = read_seqcount_begin(s);
 				bcnt = tmp->bcnt;
@@ -794,9 +794,9 @@ static void get_old_counters(const struct xt_table_info *t,
 	for_each_possible_cpu(cpu) {
 		i = 0;
 		xt_entry_foreach(iter, t->entries, t->size) {
-			const struct xt_counters *tmp;
+			const struct xt_counters_k *tmp;
 
-			tmp = xt_get_per_cpu_counter(&iter->counters, cpu);
+			tmp = xt_get_per_cpu_counter(&iter->counter_pad, cpu);
 			ADD_COUNTER(counters[i], tmp->bcnt, tmp->pcnt);
 			++i;
 		}
@@ -1203,9 +1203,9 @@ do_add_counters(struct net *net, sockptr_t arg, unsigned int len)
 	i = 0;
 	addend = xt_write_recseq_begin();
 	xt_entry_foreach(iter, private->entries, private->size) {
-		struct xt_counters *tmp;
+		struct xt_counters_k *tmp;
 
-		tmp = xt_get_this_cpu_counter(&iter->counters);
+		tmp = xt_get_this_cpu_counter(&iter->counter_pad);
 		ADD_COUNTER(*tmp, paddc[i].bcnt, paddc[i].pcnt);
 		++i;
 	}
diff --git a/net/netfilter/x_tables.c b/net/netfilter/x_tables.c
index ee38272cca562..e3f48539f81d6 100644
--- a/net/netfilter/x_tables.c
+++ b/net/netfilter/x_tables.c
@@ -1908,9 +1908,13 @@ EXPORT_SYMBOL_GPL(xt_proto_fini);
  * returns false on error.
  */
 bool xt_percpu_counter_alloc(struct xt_percpu_counter_alloc_state *state,
-			     struct xt_counters *counter)
+			     struct xt_counter_pad *xt_pad)
 {
-	BUILD_BUG_ON(XT_PCPU_BLOCK_SIZE < (sizeof(*counter) * 2));
+	union xt_counter_k *xt_cnt = (union xt_counter_k *)xt_pad;
+
+	BUILD_BUG_ON(XT_PCPU_BLOCK_SIZE < (sizeof(struct xt_counters_k) * 2));
+	BUILD_BUG_ON(sizeof(struct xt_counters_k) != sizeof(struct xt_counters));
+	BUILD_BUG_ON(sizeof(struct xt_counters_k) != sizeof(struct xt_counter_pad));
 
 	if (nr_cpu_ids <= 1)
 		return true;
@@ -1921,9 +1925,9 @@ bool xt_percpu_counter_alloc(struct xt_percpu_counter_alloc_state *state,
 		if (!state->mem)
 			return false;
 	}
-	counter->pcnt = (__force unsigned long)(state->mem + state->off);
-	state->off += sizeof(*counter);
-	if (state->off > (XT_PCPU_BLOCK_SIZE - sizeof(*counter))) {
+	xt_cnt->pcpu = state->mem + state->off;
+	state->off += sizeof(struct xt_counters_k);
+	if (state->off > (XT_PCPU_BLOCK_SIZE - sizeof(struct xt_counters_k))) {
 		state->mem = NULL;
 		state->off = 0;
 	}
@@ -1931,12 +1935,13 @@ bool xt_percpu_counter_alloc(struct xt_percpu_counter_alloc_state *state,
 }
 EXPORT_SYMBOL_GPL(xt_percpu_counter_alloc);
 
-void xt_percpu_counter_free(struct xt_counters *counters)
+void xt_percpu_counter_free(struct xt_counter_pad *xt_pad)
 {
-	unsigned long pcnt = counters->pcnt;
+	union xt_counter_k *xt_cnt = (union xt_counter_k *)xt_pad;
+	unsigned long pcnt = (unsigned long)xt_cnt->pcpu;
 
 	if (nr_cpu_ids > 1 && (pcnt & (XT_PCPU_BLOCK_SIZE - 1)) == 0)
-		free_percpu((void __percpu *)pcnt);
+		free_percpu(xt_cnt->pcpu);
 }
 EXPORT_SYMBOL_GPL(xt_percpu_counter_free);
 
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* [PATCH net-next 3/3] netfilter: Use u64_stats for counters in xt_counters_k.
  2025-02-16 12:51 [PATCH net-next 0/3] Replace xt_recseq with u64_stats Sebastian Andrzej Siewior
  2025-02-16 12:51 ` [PATCH net-next 1/3] netfilter: Make xt_table::private RCU protected Sebastian Andrzej Siewior
  2025-02-16 12:51 ` [PATCH net-next 2/3] netfilter: Split the xt_counters type between kernel and user Sebastian Andrzej Siewior
@ 2025-02-16 12:51 ` Sebastian Andrzej Siewior
  2 siblings, 0 replies; 12+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-02-16 12:51 UTC (permalink / raw)
  To: netfilter-devel, coreteam, linux-rt-devel
  Cc: Pablo Neira Ayuso, Jozsef Kadlecsik, Thomas Gleixner,
	Sebastian Andrzej Siewior

Using u64_stats for statistics has the advantage that the seqcount_t
synchronisation can be reduced to 32bit architectures only. On 64bit
architectures the update can happen lockless given there is only one
writter.

The critical section (xt_write_recseq_begin() - xt_write_recseq_end())
can be reduced to just the update of the value since the reader and its
xt_table_info::private access is RCU protected.

Use u64_stats_t in xt_counters_k for statistics. Add xt_counter_add() as
a helper to update counters within the critical section.

Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
 include/linux/netfilter/x_tables.h | 71 ++++++------------------------
 net/ipv4/netfilter/arp_tables.c    | 23 ++++------
 net/ipv4/netfilter/ip_tables.c     | 23 ++++------
 net/ipv6/netfilter/ip6_tables.c    | 23 ++++------
 net/netfilter/x_tables.c           |  6 +--
 5 files changed, 43 insertions(+), 103 deletions(-)

diff --git a/include/linux/netfilter/x_tables.h b/include/linux/netfilter/x_tables.h
index fc52a2ba90f6b..df429d0198c92 100644
--- a/include/linux/netfilter/x_tables.h
+++ b/include/linux/netfilter/x_tables.h
@@ -269,10 +269,21 @@ struct xt_table_info {
 
 struct xt_counters_k {
 	/* Packet and byte counter */
-	__u64 pcnt;
-	__u64 bcnt;
+	u64_stats_t pcnt;
+	u64_stats_t bcnt;
 };
 
+DECLARE_PER_CPU(struct u64_stats_sync, xt_syncp);
+
+static inline void xt_counter_add(struct xt_counters_k *xt_counter,
+				  unsigned int bytes, unsigned int packets)
+{
+	u64_stats_update_begin(this_cpu_ptr(&xt_syncp));
+	u64_stats_add(&xt_counter->pcnt, packets);
+	u64_stats_add(&xt_counter->bcnt, bytes);
+	u64_stats_update_end(this_cpu_ptr(&xt_syncp));
+}
+
 union xt_counter_k {
 	struct xt_counters_k __percpu *pcpu;
 	struct xt_counters_k local;
@@ -346,16 +357,6 @@ void xt_proto_fini(struct net *net, u_int8_t af);
 struct xt_table_info *xt_alloc_table_info(unsigned int size);
 void xt_free_table_info(struct xt_table_info *info);
 
-/**
- * xt_recseq - recursive seqcount for netfilter use
- *
- * Packet processing changes the seqcount only if no recursion happened
- * get_counters() can use read_seqcount_begin()/read_seqcount_retry(),
- * because we use the normal seqcount convention :
- * Low order bit set to 1 if a writer is active.
- */
-DECLARE_PER_CPU(seqcount_t, xt_recseq);
-
 bool xt_af_lock_held(u_int8_t af);
 static inline struct xt_table_info *nf_table_private(const struct xt_table *table)
 {
@@ -368,52 +369,6 @@ static inline struct xt_table_info *nf_table_private(const struct xt_table *tabl
  */
 extern struct static_key xt_tee_enabled;
 
-/**
- * xt_write_recseq_begin - start of a write section
- *
- * Begin packet processing : all readers must wait the end
- * 1) Must be called with preemption disabled
- * 2) softirqs must be disabled too (or we should use this_cpu_add())
- * Returns:
- *  1 if no recursion on this cpu
- *  0 if recursion detected
- */
-static inline unsigned int xt_write_recseq_begin(void)
-{
-	unsigned int addend;
-
-	/*
-	 * Low order bit of sequence is set if we already
-	 * called xt_write_recseq_begin().
-	 */
-	addend = (__this_cpu_read(xt_recseq.sequence) + 1) & 1;
-
-	/*
-	 * This is kind of a write_seqcount_begin(), but addend is 0 or 1
-	 * We dont check addend value to avoid a test and conditional jump,
-	 * since addend is most likely 1
-	 */
-	__this_cpu_add(xt_recseq.sequence, addend);
-	smp_mb();
-
-	return addend;
-}
-
-/**
- * xt_write_recseq_end - end of a write section
- * @addend: return value from previous xt_write_recseq_begin()
- *
- * End packet processing : all readers can proceed
- * 1) Must be called with preemption disabled
- * 2) softirqs must be disabled too (or we should use this_cpu_add())
- */
-static inline void xt_write_recseq_end(unsigned int addend)
-{
-	/* this is kind of a write_seqcount_end(), but addend is 0 or 1 */
-	smp_wmb();
-	__this_cpu_add(xt_recseq.sequence, addend);
-}
-
 /*
  * This helper is performance critical and must be inlined
  */
diff --git a/net/ipv4/netfilter/arp_tables.c b/net/ipv4/netfilter/arp_tables.c
index ce3d73155ca9b..6b957de58d2cf 100644
--- a/net/ipv4/netfilter/arp_tables.c
+++ b/net/ipv4/netfilter/arp_tables.c
@@ -194,7 +194,6 @@ unsigned int arpt_do_table(void *priv,
 	unsigned int cpu, stackidx = 0;
 	const struct xt_table_info *private;
 	struct xt_action_param acpar;
-	unsigned int addend;
 
 	if (!pskb_may_pull(skb, arp_hdr_len(skb->dev)))
 		return NF_DROP;
@@ -204,7 +203,6 @@ unsigned int arpt_do_table(void *priv,
 
 	local_bh_disable();
 	rcu_read_lock();
-	addend = xt_write_recseq_begin();
 	private = rcu_dereference(table->priv_info);
 	cpu     = smp_processor_id();
 	table_base = private->entries;
@@ -229,7 +227,7 @@ unsigned int arpt_do_table(void *priv,
 		}
 
 		counter = xt_get_this_cpu_counter(&e->counter_pad);
-		ADD_COUNTER(*counter, arp_hdr_len(skb->dev), 1);
+		xt_counter_add(counter, arp_hdr_len(skb->dev), 1);
 
 		t = arpt_get_target_c(e);
 
@@ -279,7 +277,6 @@ unsigned int arpt_do_table(void *priv,
 			break;
 		}
 	} while (!acpar.hotdrop);
-	xt_write_recseq_end(addend);
 	rcu_read_unlock();
 	local_bh_enable();
 
@@ -607,7 +604,7 @@ static void get_counters(const struct xt_table_info *t,
 	unsigned int i;
 
 	for_each_possible_cpu(cpu) {
-		seqcount_t *s = &per_cpu(xt_recseq, cpu);
+		struct u64_stats_sync *syncp = &per_cpu(xt_syncp, cpu);
 
 		i = 0;
 		xt_entry_foreach(iter, t->entries, t->size) {
@@ -617,10 +614,10 @@ static void get_counters(const struct xt_table_info *t,
 
 			tmp = xt_get_per_cpu_counter(&iter->counter_pad, cpu);
 			do {
-				start = read_seqcount_begin(s);
-				bcnt = tmp->bcnt;
-				pcnt = tmp->pcnt;
-			} while (read_seqcount_retry(s, start));
+				start = u64_stats_fetch_begin(syncp);
+				bcnt = u64_stats_read(&tmp->bcnt);
+				pcnt = u64_stats_read(&tmp->pcnt);
+			} while (u64_stats_fetch_retry(syncp, start));
 
 			ADD_COUNTER(counters[i], bcnt, pcnt);
 			++i;
@@ -641,7 +638,8 @@ static void get_old_counters(const struct xt_table_info *t,
 			struct xt_counters_k *tmp;
 
 			tmp = xt_get_per_cpu_counter(&iter->counter_pad, cpu);
-			ADD_COUNTER(counters[i], tmp->bcnt, tmp->pcnt);
+			ADD_COUNTER(counters[i], u64_stats_read(&tmp->bcnt),
+				    u64_stats_read(&tmp->pcnt));
 			++i;
 		}
 		cond_resched();
@@ -1011,7 +1009,6 @@ static int do_add_counters(struct net *net, sockptr_t arg, unsigned int len)
 	const struct xt_table_info *private;
 	int ret = 0;
 	struct arpt_entry *iter;
-	unsigned int addend;
 
 	paddc = xt_copy_counters(arg, len, &tmp);
 	if (IS_ERR(paddc))
@@ -1033,15 +1030,13 @@ static int do_add_counters(struct net *net, sockptr_t arg, unsigned int len)
 
 	i = 0;
 
-	addend = xt_write_recseq_begin();
 	xt_entry_foreach(iter,  private->entries, private->size) {
 		struct xt_counters_k *tmp;
 
 		tmp = xt_get_this_cpu_counter(&iter->counter_pad);
-		ADD_COUNTER(*tmp, paddc[i].bcnt, paddc[i].pcnt);
+		xt_counter_add(tmp, paddc[i].bcnt, paddc[i].pcnt);
 		++i;
 	}
-	xt_write_recseq_end(addend);
  unlock_up_free:
 	rcu_read_unlock();
 	local_bh_enable();
diff --git a/net/ipv4/netfilter/ip_tables.c b/net/ipv4/netfilter/ip_tables.c
index 95f917f5bceef..c5c90e2f724ba 100644
--- a/net/ipv4/netfilter/ip_tables.c
+++ b/net/ipv4/netfilter/ip_tables.c
@@ -236,7 +236,6 @@ ipt_do_table(void *priv,
 	unsigned int stackidx, cpu;
 	const struct xt_table_info *private;
 	struct xt_action_param acpar;
-	unsigned int addend;
 
 	/* Initialization */
 	stackidx = 0;
@@ -258,7 +257,6 @@ ipt_do_table(void *priv,
 	local_bh_disable();
 	rcu_read_lock();
 	private = rcu_dereference(table->priv_info);
-	addend = xt_write_recseq_begin();
 	cpu        = smp_processor_id();
 	table_base = private->entries;
 	jumpstack  = (struct ipt_entry **)private->jumpstack[cpu];
@@ -296,7 +294,7 @@ ipt_do_table(void *priv,
 		}
 
 		counter = xt_get_this_cpu_counter(&e->counter_pad);
-		ADD_COUNTER(*counter, skb->len, 1);
+		xt_counter_add(counter, skb->len, 1);
 
 		t = ipt_get_target_c(e);
 		WARN_ON(!t->u.kernel.target);
@@ -354,7 +352,6 @@ ipt_do_table(void *priv,
 		}
 	} while (!acpar.hotdrop);
 
-	xt_write_recseq_end(addend);
 	rcu_read_unlock();
 	local_bh_enable();
 
@@ -746,7 +743,7 @@ get_counters(const struct xt_table_info *t,
 	unsigned int i;
 
 	for_each_possible_cpu(cpu) {
-		seqcount_t *s = &per_cpu(xt_recseq, cpu);
+		struct u64_stats_sync *syncp = &per_cpu(xt_syncp, cpu);
 
 		i = 0;
 		xt_entry_foreach(iter, t->entries, t->size) {
@@ -756,10 +753,10 @@ get_counters(const struct xt_table_info *t,
 
 			tmp = xt_get_per_cpu_counter(&iter->counter_pad, cpu);
 			do {
-				start = read_seqcount_begin(s);
-				bcnt = tmp->bcnt;
-				pcnt = tmp->pcnt;
-			} while (read_seqcount_retry(s, start));
+				start = u64_stats_fetch_begin(syncp);
+				bcnt = u64_stats_read(&tmp->bcnt);
+				pcnt = u64_stats_read(&tmp->pcnt);
+			} while (u64_stats_fetch_retry(syncp, start));
 
 			ADD_COUNTER(counters[i], bcnt, pcnt);
 			++i; /* macro does multi eval of i */
@@ -780,7 +777,8 @@ static void get_old_counters(const struct xt_table_info *t,
 			const struct xt_counters_k *tmp;
 
 			tmp = xt_get_per_cpu_counter(&iter->counter_pad, cpu);
-			ADD_COUNTER(counters[i], tmp->bcnt, tmp->pcnt);
+			ADD_COUNTER(counters[i], u64_stats_read(&tmp->bcnt),
+				    u64_stats_read(&tmp->pcnt));
 			++i; /* macro does multi eval of i */
 		}
 
@@ -1164,7 +1162,6 @@ do_add_counters(struct net *net, sockptr_t arg, unsigned int len)
 	const struct xt_table_info *private;
 	int ret = 0;
 	struct ipt_entry *iter;
-	unsigned int addend;
 
 	paddc = xt_copy_counters(arg, len, &tmp);
 	if (IS_ERR(paddc))
@@ -1185,15 +1182,13 @@ do_add_counters(struct net *net, sockptr_t arg, unsigned int len)
 	}
 
 	i = 0;
-	addend = xt_write_recseq_begin();
 	xt_entry_foreach(iter, private->entries, private->size) {
 		struct xt_counters_k *tmp;
 
 		tmp = xt_get_this_cpu_counter(&iter->counter_pad);
-		ADD_COUNTER(*tmp, paddc[i].bcnt, paddc[i].pcnt);
+		xt_counter_add(tmp, paddc[i].bcnt, paddc[i].pcnt);
 		++i;
 	}
-	xt_write_recseq_end(addend);
  unlock_up_free:
 	rcu_read_unlock();
 	local_bh_enable();
diff --git a/net/ipv6/netfilter/ip6_tables.c b/net/ipv6/netfilter/ip6_tables.c
index f4877b1b2463e..5a6a592cd53dc 100644
--- a/net/ipv6/netfilter/ip6_tables.c
+++ b/net/ipv6/netfilter/ip6_tables.c
@@ -259,7 +259,6 @@ ip6t_do_table(void *priv, struct sk_buff *skb,
 	unsigned int stackidx, cpu;
 	const struct xt_table_info *private;
 	struct xt_action_param acpar;
-	unsigned int addend;
 
 	/* Initialization */
 	stackidx = 0;
@@ -280,7 +279,6 @@ ip6t_do_table(void *priv, struct sk_buff *skb,
 	local_bh_disable();
 	rcu_read_lock();
 	private = rcu_dereference(table->priv_info);
-	addend = xt_write_recseq_begin();
 	cpu        = smp_processor_id();
 	table_base = private->entries;
 	jumpstack  = (struct ip6t_entry **)private->jumpstack[cpu];
@@ -319,7 +317,7 @@ ip6t_do_table(void *priv, struct sk_buff *skb,
 		}
 
 		counter = xt_get_this_cpu_counter(&e->counter_pad);
-		ADD_COUNTER(*counter, skb->len, 1);
+		xt_counter_add(counter, skb->len, 1);
 
 		t = ip6t_get_target_c(e);
 		WARN_ON(!t->u.kernel.target);
@@ -372,7 +370,6 @@ ip6t_do_table(void *priv, struct sk_buff *skb,
 			break;
 	} while (!acpar.hotdrop);
 
-	xt_write_recseq_end(addend);
 	rcu_read_unlock();
 	local_bh_enable();
 
@@ -763,7 +760,7 @@ get_counters(const struct xt_table_info *t,
 	unsigned int i;
 
 	for_each_possible_cpu(cpu) {
-		seqcount_t *s = &per_cpu(xt_recseq, cpu);
+		struct u64_stats_sync *syncp = &per_cpu(xt_syncp, cpu);
 
 		i = 0;
 		xt_entry_foreach(iter, t->entries, t->size) {
@@ -773,10 +770,10 @@ get_counters(const struct xt_table_info *t,
 
 			tmp = xt_get_per_cpu_counter(&iter->counter_pad, cpu);
 			do {
-				start = read_seqcount_begin(s);
-				bcnt = tmp->bcnt;
-				pcnt = tmp->pcnt;
-			} while (read_seqcount_retry(s, start));
+				start = u64_stats_fetch_begin(syncp);
+				bcnt = u64_stats_read(&tmp->bcnt);
+				pcnt = u64_stats_read(&tmp->pcnt);
+			} while (u64_stats_fetch_retry(syncp, start));
 
 			ADD_COUNTER(counters[i], bcnt, pcnt);
 			++i;
@@ -797,7 +794,8 @@ static void get_old_counters(const struct xt_table_info *t,
 			const struct xt_counters_k *tmp;
 
 			tmp = xt_get_per_cpu_counter(&iter->counter_pad, cpu);
-			ADD_COUNTER(counters[i], tmp->bcnt, tmp->pcnt);
+			ADD_COUNTER(counters[i], u64_stats_read(&tmp->bcnt),
+				    u64_stats_read(&tmp->pcnt));
 			++i;
 		}
 		cond_resched();
@@ -1181,7 +1179,6 @@ do_add_counters(struct net *net, sockptr_t arg, unsigned int len)
 	const struct xt_table_info *private;
 	int ret = 0;
 	struct ip6t_entry *iter;
-	unsigned int addend;
 
 	paddc = xt_copy_counters(arg, len, &tmp);
 	if (IS_ERR(paddc))
@@ -1201,15 +1198,13 @@ do_add_counters(struct net *net, sockptr_t arg, unsigned int len)
 	}
 
 	i = 0;
-	addend = xt_write_recseq_begin();
 	xt_entry_foreach(iter, private->entries, private->size) {
 		struct xt_counters_k *tmp;
 
 		tmp = xt_get_this_cpu_counter(&iter->counter_pad);
-		ADD_COUNTER(*tmp, paddc[i].bcnt, paddc[i].pcnt);
+		xt_counter_add(tmp, paddc[i].bcnt, paddc[i].pcnt);
 		++i;
 	}
-	xt_write_recseq_end(addend);
  unlock_up_free:
 	rcu_read_unlock();
 	local_bh_enable();
diff --git a/net/netfilter/x_tables.c b/net/netfilter/x_tables.c
index e3f48539f81d6..d9a31ee920fc0 100644
--- a/net/netfilter/x_tables.c
+++ b/net/netfilter/x_tables.c
@@ -1330,8 +1330,8 @@ void xt_compat_unlock(u_int8_t af)
 EXPORT_SYMBOL_GPL(xt_compat_unlock);
 #endif
 
-DEFINE_PER_CPU(seqcount_t, xt_recseq);
-EXPORT_PER_CPU_SYMBOL_GPL(xt_recseq);
+DEFINE_PER_CPU(struct u64_stats_sync, xt_syncp);
+EXPORT_PER_CPU_SYMBOL_GPL(xt_syncp);
 
 struct static_key xt_tee_enabled __read_mostly;
 EXPORT_SYMBOL_GPL(xt_tee_enabled);
@@ -1977,7 +1977,7 @@ static int __init xt_init(void)
 	int rv;
 
 	for_each_possible_cpu(i) {
-		seqcount_init(&per_cpu(xt_recseq, i));
+		u64_stats_init(&per_cpu(xt_syncp, i));
 	}
 
 	xt = kcalloc(NFPROTO_NUMPROTO, sizeof(struct xt_af), GFP_KERNEL);
-- 
2.47.2


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH net-next 2/3] netfilter: Split the xt_counters type between kernel and user.
  2025-02-16 12:51 ` [PATCH net-next 2/3] netfilter: Split the xt_counters type between kernel and user Sebastian Andrzej Siewior
@ 2025-02-16 15:22   ` kernel test robot
  0 siblings, 0 replies; 12+ messages in thread
From: kernel test robot @ 2025-02-16 15:22 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior, netfilter-devel, coreteam,
	linux-rt-devel
  Cc: oe-kbuild-all, Pablo Neira Ayuso, Jozsef Kadlecsik,
	Thomas Gleixner, Sebastian Andrzej Siewior

Hi Sebastian,

kernel test robot noticed the following build warnings:

[auto build test WARNING on net-next/main]

url:    https://github.com/intel-lab-lkp/linux/commits/Sebastian-Andrzej-Siewior/netfilter-Make-xt_table-private-RCU-protected/20250216-210418
base:   net-next/main
patch link:    https://lore.kernel.org/r/20250216125135.3037967-3-bigeasy%40linutronix.de
patch subject: [PATCH net-next 2/3] netfilter: Split the xt_counters type between kernel and user.
config: i386-buildonly-randconfig-001-20250216 (https://download.01.org/0day-ci/archive/20250216/202502162342.iQ248XrR-lkp@intel.com/config)
compiler: gcc-12 (Debian 12.2.0-14) 12.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20250216/202502162342.iQ248XrR-lkp@intel.com/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <lkp@intel.com>
| Closes: https://lore.kernel.org/oe-kbuild-all/202502162342.iQ248XrR-lkp@intel.com/

All warnings (new ones prefixed by >>):

>> net/netfilter/x_tables.c:1912: warning: Function parameter or struct member 'xt_pad' not described in 'xt_percpu_counter_alloc'
>> net/netfilter/x_tables.c:1912: warning: Excess function parameter 'counter' description in 'xt_percpu_counter_alloc'


vim +1912 net/netfilter/x_tables.c

2e4e6a17af35be Harald Welte              2006-01-12  1887  
f28e15bacedd44 Florian Westphal          2016-11-22  1888  /**
f28e15bacedd44 Florian Westphal          2016-11-22  1889   * xt_percpu_counter_alloc - allocate x_tables rule counter
f28e15bacedd44 Florian Westphal          2016-11-22  1890   *
ae0ac0ed6fcf5a Florian Westphal          2016-11-22  1891   * @state: pointer to xt_percpu allocation state
f28e15bacedd44 Florian Westphal          2016-11-22  1892   * @counter: pointer to counter struct inside the ip(6)/arpt_entry struct
f28e15bacedd44 Florian Westphal          2016-11-22  1893   *
f28e15bacedd44 Florian Westphal          2016-11-22  1894   * On SMP, the packet counter [ ip(6)t_entry->counters.pcnt ] will then
f28e15bacedd44 Florian Westphal          2016-11-22  1895   * contain the address of the real (percpu) counter.
f28e15bacedd44 Florian Westphal          2016-11-22  1896   *
f28e15bacedd44 Florian Westphal          2016-11-22  1897   * Rule evaluation needs to use xt_get_this_cpu_counter() helper
f28e15bacedd44 Florian Westphal          2016-11-22  1898   * to fetch the real percpu counter.
f28e15bacedd44 Florian Westphal          2016-11-22  1899   *
ae0ac0ed6fcf5a Florian Westphal          2016-11-22  1900   * To speed up allocation and improve data locality, a 4kb block is
9ba5c404bf1d62 Ben Hutchings             2018-03-29  1901   * allocated.  Freeing any counter may free an entire block, so all
9ba5c404bf1d62 Ben Hutchings             2018-03-29  1902   * counters allocated using the same state must be freed at the same
9ba5c404bf1d62 Ben Hutchings             2018-03-29  1903   * time.
ae0ac0ed6fcf5a Florian Westphal          2016-11-22  1904   *
ae0ac0ed6fcf5a Florian Westphal          2016-11-22  1905   * xt_percpu_counter_alloc_state contains the base address of the
ae0ac0ed6fcf5a Florian Westphal          2016-11-22  1906   * allocated page and the current sub-offset.
ae0ac0ed6fcf5a Florian Westphal          2016-11-22  1907   *
f28e15bacedd44 Florian Westphal          2016-11-22  1908   * returns false on error.
f28e15bacedd44 Florian Westphal          2016-11-22  1909   */
ae0ac0ed6fcf5a Florian Westphal          2016-11-22  1910  bool xt_percpu_counter_alloc(struct xt_percpu_counter_alloc_state *state,
e99ca3c3e3ed04 Sebastian Andrzej Siewior 2025-02-16  1911  			     struct xt_counter_pad *xt_pad)
f28e15bacedd44 Florian Westphal          2016-11-22 @1912  {
e99ca3c3e3ed04 Sebastian Andrzej Siewior 2025-02-16  1913  	union xt_counter_k *xt_cnt = (union xt_counter_k *)xt_pad;
e99ca3c3e3ed04 Sebastian Andrzej Siewior 2025-02-16  1914  
e99ca3c3e3ed04 Sebastian Andrzej Siewior 2025-02-16  1915  	BUILD_BUG_ON(XT_PCPU_BLOCK_SIZE < (sizeof(struct xt_counters_k) * 2));
e99ca3c3e3ed04 Sebastian Andrzej Siewior 2025-02-16  1916  	BUILD_BUG_ON(sizeof(struct xt_counters_k) != sizeof(struct xt_counters));
e99ca3c3e3ed04 Sebastian Andrzej Siewior 2025-02-16  1917  	BUILD_BUG_ON(sizeof(struct xt_counters_k) != sizeof(struct xt_counter_pad));
f28e15bacedd44 Florian Westphal          2016-11-22  1918  
f28e15bacedd44 Florian Westphal          2016-11-22  1919  	if (nr_cpu_ids <= 1)
f28e15bacedd44 Florian Westphal          2016-11-22  1920  		return true;
f28e15bacedd44 Florian Westphal          2016-11-22  1921  
ae0ac0ed6fcf5a Florian Westphal          2016-11-22  1922  	if (!state->mem) {
ae0ac0ed6fcf5a Florian Westphal          2016-11-22  1923  		state->mem = __alloc_percpu(XT_PCPU_BLOCK_SIZE,
ae0ac0ed6fcf5a Florian Westphal          2016-11-22  1924  					    XT_PCPU_BLOCK_SIZE);
ae0ac0ed6fcf5a Florian Westphal          2016-11-22  1925  		if (!state->mem)
f28e15bacedd44 Florian Westphal          2016-11-22  1926  			return false;
ae0ac0ed6fcf5a Florian Westphal          2016-11-22  1927  	}
e99ca3c3e3ed04 Sebastian Andrzej Siewior 2025-02-16  1928  	xt_cnt->pcpu = state->mem + state->off;
e99ca3c3e3ed04 Sebastian Andrzej Siewior 2025-02-16  1929  	state->off += sizeof(struct xt_counters_k);
e99ca3c3e3ed04 Sebastian Andrzej Siewior 2025-02-16  1930  	if (state->off > (XT_PCPU_BLOCK_SIZE - sizeof(struct xt_counters_k))) {
ae0ac0ed6fcf5a Florian Westphal          2016-11-22  1931  		state->mem = NULL;
ae0ac0ed6fcf5a Florian Westphal          2016-11-22  1932  		state->off = 0;
ae0ac0ed6fcf5a Florian Westphal          2016-11-22  1933  	}
f28e15bacedd44 Florian Westphal          2016-11-22  1934  	return true;
f28e15bacedd44 Florian Westphal          2016-11-22  1935  }
f28e15bacedd44 Florian Westphal          2016-11-22  1936  EXPORT_SYMBOL_GPL(xt_percpu_counter_alloc);
f28e15bacedd44 Florian Westphal          2016-11-22  1937  

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH net-next 1/3] netfilter: Make xt_table::private RCU protected.
  2025-02-16 12:51 ` [PATCH net-next 1/3] netfilter: Make xt_table::private RCU protected Sebastian Andrzej Siewior
@ 2025-02-17 14:05   ` Florian Westphal
  2025-02-17 14:57     ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 12+ messages in thread
From: Florian Westphal @ 2025-02-17 14:05 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: netfilter-devel, coreteam, linux-rt-devel, Pablo Neira Ayuso,
	Jozsef Kadlecsik, Thomas Gleixner

Sebastian Andrzej Siewior <bigeasy@linutronix.de> wrote:
> The seqcount xt_recseq is used to synchronize the replacement of
> xt_table::private in xt_replace_table() against all readers such as
> ipt_do_table(). After the pointer is replaced, xt_register_target()
> iterates over all per-CPU xt_recseq to ensure that none of CPUs is
> within the critical section.
> Once this is done, the old pointer can be examined and deallocated
> safely.
> 
> This can also be achieved with RCU: Each reader of the private pointer
> will be with in an RCU read section. The new pointer will be published
> with rcu_assign_pointer() and synchronize_rcu() is used to wait until
> each reader left its critical section.

Note we had this before and it was reverted in
d3d40f237480 ("Revert "netfilter: x_tables: Switch synchronization to RCU"")

I'm not saying its wrong, but I think you need a plan b when the same
complaints wrt table replace slowdown come in.

And unfortunately I can't find a solution for this, unless we keep
either the existing wait-scheme for counters sake or we accept
that some counter update might be lost between copy to userland and
destruction (it would need to use rcu_work or similar, the xt
target/module destroy callbacks can sleep).

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH net-next 1/3] netfilter: Make xt_table::private RCU protected.
  2025-02-17 14:05   ` Florian Westphal
@ 2025-02-17 14:57     ` Sebastian Andrzej Siewior
  2025-02-17 15:35       ` Florian Westphal
  0 siblings, 1 reply; 12+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-02-17 14:57 UTC (permalink / raw)
  To: Florian Westphal
  Cc: netfilter-devel, coreteam, linux-rt-devel, Pablo Neira Ayuso,
	Jozsef Kadlecsik, Thomas Gleixner

On 2025-02-17 15:05:38 [+0100], Florian Westphal wrote:
> Sebastian Andrzej Siewior <bigeasy@linutronix.de> wrote:
> > The seqcount xt_recseq is used to synchronize the replacement of
> > xt_table::private in xt_replace_table() against all readers such as
> > ipt_do_table(). After the pointer is replaced, xt_register_target()
> > iterates over all per-CPU xt_recseq to ensure that none of CPUs is
> > within the critical section.
> > Once this is done, the old pointer can be examined and deallocated
> > safely.
> > 
> > This can also be achieved with RCU: Each reader of the private pointer
> > will be with in an RCU read section. The new pointer will be published
> > with rcu_assign_pointer() and synchronize_rcu() is used to wait until
> > each reader left its critical section.
> 
> Note we had this before and it was reverted in
> d3d40f237480 ("Revert "netfilter: x_tables: Switch synchronization to RCU"")
> 
> I'm not saying its wrong, but I think you need a plan b when the same
> complaints wrt table replace slowdown come in.
> 
> And unfortunately I can't find a solution for this, unless we keep
> either the existing wait-scheme for counters sake or we accept
> that some counter update might be lost between copy to userland and
> destruction (it would need to use rcu_work or similar, the xt
> target/module destroy callbacks can sleep).

Urgh. Is this fast & frequent update a real-world thing or a benchmark
of some sort? I mean adding rule after rule is possible but…

I used here synchronize_rcu() and there is also
synchronize_rcu_expedited() but I do hate it. With everything.

What are the counters used in userland for? I've seen that they are
copied but did not understood why.
  iptables-legacy -A INPUT -j ACCEPT
ends up in xt_replace_table() but iptables-nft doesn't. Different
interface, better design? Or I just used legacy and now it is considered
as the only?
I see two invocations on iptables-legacy-restore.

But the question remains: Why copy counters after replacing the rule
set?

Sebastian

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH net-next 1/3] netfilter: Make xt_table::private RCU protected.
  2025-02-17 14:57     ` Sebastian Andrzej Siewior
@ 2025-02-17 15:35       ` Florian Westphal
  2025-02-17 15:56         ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 12+ messages in thread
From: Florian Westphal @ 2025-02-17 15:35 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Florian Westphal, netfilter-devel, coreteam, linux-rt-devel,
	Pablo Neira Ayuso, Jozsef Kadlecsik, Thomas Gleixner

Sebastian Andrzej Siewior <bigeasy@linutronix.de> wrote:
> > > This can also be achieved with RCU: Each reader of the private pointer
> > > will be with in an RCU read section. The new pointer will be published
> > > with rcu_assign_pointer() and synchronize_rcu() is used to wait until
> > > each reader left its critical section.
> > 
> > Note we had this before and it was reverted in
> > d3d40f237480 ("Revert "netfilter: x_tables: Switch synchronization to RCU"")
> > 
> > I'm not saying its wrong, but I think you need a plan b when the same
> > complaints wrt table replace slowdown come in.
> > 
> > And unfortunately I can't find a solution for this, unless we keep
> > either the existing wait-scheme for counters sake or we accept
> > that some counter update might be lost between copy to userland and
> > destruction (it would need to use rcu_work or similar, the xt
> > target/module destroy callbacks can sleep).
> 
> Urgh. Is this fast & frequent update a real-world thing or a benchmark
> of some sort? I mean adding rule after rule is possible but…

No idea, its certainly bad design (iptables-restore exists for a
reason).

> I used here synchronize_rcu() and there is also
> synchronize_rcu_expedited() but I do hate it. With everything.

Agreed.

> What are the counters used in userland for? I've seen that they are
> copied but did not understood why.
>   iptables-legacy -A INPUT -j ACCEPT
> ends up in xt_replace_table() but iptables-nft doesn't. Different
> interface, better design? Or I just used legacy and now it is considered
> as the only?

iptables-nft uses the nftables netlink API, I don't think it suffers
from the preempt_rt issues you're resolving in the setsockopt api.

It won't go anywhere near xt_replace_table and it doesn't use the
xtables binary format on the user/kernel api side.

> I see two invocations on iptables-legacy-restore.
> 
> But the question remains: Why copy counters after replacing the rule
> set?

Good question.  What I can gather from
https://git.netfilter.org/iptables/tree/libiptc/libiptc.c#n2597

After replace we get copy of the old counters, depending on mode
we can force-update counters post-replace in kernel, via

	ret = setsockopt(handle->sockfd, TC_IPPROTO, SO_SET_ADD_COUNTERS,
			 newcounters, counterlen);

I suspect this is whats used by 'iptables -L -v -Z INPUT'.
But I don't know if anyone uses this in practice.

Maybe Pablo remembers the details, this is ancient code by kernel
standards.

I think we might get away with losing counters on the update
side, i.e. rcu_update_pointer() so new CPUs won't find the old
binary blob, copy the counters, then defer destruction via rcu_work
or similar and accept that the counters might have marginally
changed after the copy to userland and before last cpu left its
rcu critical section.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH net-next 1/3] netfilter: Make xt_table::private RCU protected.
  2025-02-17 15:35       ` Florian Westphal
@ 2025-02-17 15:56         ` Sebastian Andrzej Siewior
  2025-02-17 16:20           ` Florian Westphal
  0 siblings, 1 reply; 12+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-02-17 15:56 UTC (permalink / raw)
  To: Florian Westphal
  Cc: netfilter-devel, coreteam, linux-rt-devel, Pablo Neira Ayuso,
	Jozsef Kadlecsik, Thomas Gleixner

On 2025-02-17 16:35:48 [+0100], Florian Westphal wrote:
> 
> > What are the counters used in userland for? I've seen that they are
> > copied but did not understood why.
> >   iptables-legacy -A INPUT -j ACCEPT
> > ends up in xt_replace_table() but iptables-nft doesn't. Different
> > interface, better design? Or I just used legacy and now it is considered
> > as the only?
> 
> iptables-nft uses the nftables netlink API, I don't think it suffers
> from the preempt_rt issues you're resolving in the setsockopt api.

The thing is it there so I would like to see fixed one way or another.
But if this is the old API, it means the counter change in #3 is not
beneficial to iptables-nft.

> It won't go anywhere near xt_replace_table and it doesn't use the
> xtables binary format on the user/kernel api side.
> 
> > I see two invocations on iptables-legacy-restore.
> > 
> > But the question remains: Why copy counters after replacing the rule
> > set?
> 
> Good question.  What I can gather from
> https://git.netfilter.org/iptables/tree/libiptc/libiptc.c#n2597
> 
> After replace we get copy of the old counters, depending on mode
> we can force-update counters post-replace in kernel, via
> 
> 	ret = setsockopt(handle->sockfd, TC_IPPROTO, SO_SET_ADD_COUNTERS,
> 			 newcounters, counterlen);
> 
> I suspect this is whats used by 'iptables -L -v -Z INPUT'.
> But I don't know if anyone uses this in practice.

Oh. Wonderful. So even if skip counter updates after pointer
replacement, the damage is very limited.

> Maybe Pablo remembers the details, this is ancient code by kernel
> standards.
> 
> I think we might get away with losing counters on the update
> side, i.e. rcu_update_pointer() so new CPUs won't find the old
> binary blob, copy the counters, then defer destruction via rcu_work
> or similar and accept that the counters might have marginally
> changed after the copy to userland and before last cpu left its
> rcu critical section.

I could do that if we want to accelerate it. That is if we don't have
the muscle to point people to iptables-nft or iptables-legacy-restore.

Speaking of: Are there plans to remove the legacy interface? This could
be used as meet the users ;) But seriously, the nft part is around for
quite some time and there is not downside to it.

Sebastian

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH net-next 1/3] netfilter: Make xt_table::private RCU protected.
  2025-02-17 15:56         ` Sebastian Andrzej Siewior
@ 2025-02-17 16:20           ` Florian Westphal
  2025-02-18  8:18             ` Sebastian Andrzej Siewior
  0 siblings, 1 reply; 12+ messages in thread
From: Florian Westphal @ 2025-02-17 16:20 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Florian Westphal, netfilter-devel, coreteam, linux-rt-devel,
	Pablo Neira Ayuso, Jozsef Kadlecsik, Thomas Gleixner

Sebastian Andrzej Siewior <bigeasy@linutronix.de> wrote:
> On 2025-02-17 16:35:48 [+0100], Florian Westphal wrote:
> > I suspect this is whats used by 'iptables -L -v -Z INPUT'.
> > But I don't know if anyone uses this in practice.
> 
> Oh. Wonderful. So even if skip counter updates after pointer
> replacement, the damage is very limited.

Yes, I think so.

> > I think we might get away with losing counters on the update
> > side, i.e. rcu_update_pointer() so new CPUs won't find the old
> > binary blob, copy the counters, then defer destruction via rcu_work
> > or similar and accept that the counters might have marginally
> > changed after the copy to userland and before last cpu left its
> > rcu critical section.
> 
> I could do that if we want to accelerate it. That is if we don't have
> the muscle to point people to iptables-nft or iptables-legacy-restore.
> 
> Speaking of: Are there plans to remove the legacy interface? This could
> be used as meet the users ;) But seriously, the nft part is around for
> quite some time and there is not downside to it.

Yes, since 6c959fd5e17387201dba3619b2e6af213939a0a7
the legacy symbol is user visible so next step is to replace
various "select ...TABLES_LEGACY" with "depends on" clauses.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH net-next 1/3] netfilter: Make xt_table::private RCU protected.
  2025-02-17 16:20           ` Florian Westphal
@ 2025-02-18  8:18             ` Sebastian Andrzej Siewior
  2025-02-18 12:46               ` Florian Westphal
  0 siblings, 1 reply; 12+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-02-18  8:18 UTC (permalink / raw)
  To: Florian Westphal
  Cc: netfilter-devel, coreteam, linux-rt-devel, Pablo Neira Ayuso,
	Jozsef Kadlecsik, Thomas Gleixner

On 2025-02-17 17:20:53 [+0100], Florian Westphal wrote:
> > > I think we might get away with losing counters on the update
> > > side, i.e. rcu_update_pointer() so new CPUs won't find the old
> > > binary blob, copy the counters, then defer destruction via rcu_work
> > > or similar and accept that the counters might have marginally
> > > changed after the copy to userland and before last cpu left its
> > > rcu critical section.
> > 
> > I could do that if we want to accelerate it. That is if we don't have
> > the muscle to point people to iptables-nft or iptables-legacy-restore.
> > 
> > Speaking of: Are there plans to remove the legacy interface? This could
> > be used as meet the users ;) But seriously, the nft part is around for
> > quite some time and there is not downside to it.
> 
> Yes, since 6c959fd5e17387201dba3619b2e6af213939a0a7
> the legacy symbol is user visible so next step is to replace
> various "select ...TABLES_LEGACY" with "depends on" clauses.

Okay. So I would repost the series fixing what the bot complained in
2/3. The action in case people complain about slow insertion would be:
- Use iptables-legacy-restore if mass insertion is performance critical.
- Use iptables-nft which does not have this problem.
- If both option don't work, copy the counters immediately risking to
  miss in-flight updates, free the memory after a grace period.

Any objections?

Sebastian

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH net-next 1/3] netfilter: Make xt_table::private RCU protected.
  2025-02-18  8:18             ` Sebastian Andrzej Siewior
@ 2025-02-18 12:46               ` Florian Westphal
  0 siblings, 0 replies; 12+ messages in thread
From: Florian Westphal @ 2025-02-18 12:46 UTC (permalink / raw)
  To: Sebastian Andrzej Siewior
  Cc: Florian Westphal, netfilter-devel, coreteam, linux-rt-devel,
	Pablo Neira Ayuso, Jozsef Kadlecsik, Thomas Gleixner

Sebastian Andrzej Siewior <bigeasy@linutronix.de> wrote:
> > Yes, since 6c959fd5e17387201dba3619b2e6af213939a0a7
> > the legacy symbol is user visible so next step is to replace
> > various "select ...TABLES_LEGACY" with "depends on" clauses.
> 
> Okay. So I would repost the series fixing what the bot complained in
> 2/3. The action in case people complain about slow insertion would be:
> - Use iptables-legacy-restore if mass insertion is performance critical.
> - Use iptables-nft which does not have this problem.
> - If both option don't work, copy the counters immediately risking to
>   miss in-flight updates, free the memory after a grace period.

Seems like a good plan, thanks Sebastian.

> Any objections?

Not from my side.

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2025-02-18 12:46 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-02-16 12:51 [PATCH net-next 0/3] Replace xt_recseq with u64_stats Sebastian Andrzej Siewior
2025-02-16 12:51 ` [PATCH net-next 1/3] netfilter: Make xt_table::private RCU protected Sebastian Andrzej Siewior
2025-02-17 14:05   ` Florian Westphal
2025-02-17 14:57     ` Sebastian Andrzej Siewior
2025-02-17 15:35       ` Florian Westphal
2025-02-17 15:56         ` Sebastian Andrzej Siewior
2025-02-17 16:20           ` Florian Westphal
2025-02-18  8:18             ` Sebastian Andrzej Siewior
2025-02-18 12:46               ` Florian Westphal
2025-02-16 12:51 ` [PATCH net-next 2/3] netfilter: Split the xt_counters type between kernel and user Sebastian Andrzej Siewior
2025-02-16 15:22   ` kernel test robot
2025-02-16 12:51 ` [PATCH net-next 3/3] netfilter: Use u64_stats for counters in xt_counters_k Sebastian Andrzej Siewior

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).