* [PATCH net-next v2 1/3] netfilter: Make xt_table::private RCU protected.
2025-02-21 13:31 [PATCH net-next v2 0/3] Replace xt_recseq with u64_stats Sebastian Andrzej Siewior
@ 2025-02-21 13:31 ` Sebastian Andrzej Siewior
2025-02-21 13:31 ` [PATCH net-next v2 2/3] netfilter: Split the xt_counters type between kernel and user Sebastian Andrzej Siewior
` (3 subsequent siblings)
4 siblings, 0 replies; 11+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-02-21 13:31 UTC (permalink / raw)
To: netfilter-devel, coreteam, linux-rt-devel
Cc: Pablo Neira Ayuso, Jozsef Kadlecsik, Thomas Gleixner,
Sebastian Andrzej Siewior
The seqcount xt_recseq is used to synchronize the replacement of
xt_table::private in xt_replace_table() against all readers such as
ipt_do_table(). After the pointer is replaced, xt_register_target()
iterates over all per-CPU xt_recseq to ensure that none of CPUs is
within the critical section.
Once this is done, the old pointer can be examined and deallocated
safely.
This can also be achieved with RCU: Each reader of the private pointer
will be with in an RCU read section. The new pointer will be published
with rcu_assign_pointer() and synchronize_rcu() is used to wait until
each reader left its critical section.
Should this lead to a drop in performance due synchronize_rcu() in the
replacement, there are possible workarounds:
- Use iptables-legacy-restore instead multiple iptables-legacy staments
- Use iptables-nft instead iptables-legacy
- Don't copy the counters after the replacement at all or simply don't
wait for the in-flight counters and just copy what is there.
Use RCU to assign xt_table::private and synchronise against reader.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
include/linux/netfilter/x_tables.h | 8 ++++-
net/ipv4/netfilter/arp_tables.c | 20 +++++++-----
net/ipv4/netfilter/ip_tables.c | 20 +++++++-----
net/ipv6/netfilter/ip6_tables.c | 20 +++++++-----
net/netfilter/x_tables.c | 50 ++++++++++++------------------
5 files changed, 62 insertions(+), 56 deletions(-)
diff --git a/include/linux/netfilter/x_tables.h b/include/linux/netfilter/x_tables.h
index f39f688d72852..b9cd82e845d08 100644
--- a/include/linux/netfilter/x_tables.h
+++ b/include/linux/netfilter/x_tables.h
@@ -227,7 +227,7 @@ struct xt_table {
unsigned int valid_hooks;
/* Man behind the curtain... */
- struct xt_table_info *private;
+ struct xt_table_info __rcu *priv_info;
/* hook ops that register the table with the netfilter core */
struct nf_hook_ops *ops;
@@ -345,6 +345,12 @@ void xt_free_table_info(struct xt_table_info *info);
*/
DECLARE_PER_CPU(seqcount_t, xt_recseq);
+bool xt_af_lock_held(u_int8_t af);
+static inline struct xt_table_info *nf_table_private(const struct xt_table *table)
+{
+ return rcu_dereference_check(table->priv_info, xt_af_lock_held(table->af));
+}
+
/* xt_tee_enabled - true if x_tables needs to handle reentrancy
*
* Enabled if current ip(6)tables ruleset has at least one -j TEE rule.
diff --git a/net/ipv4/netfilter/arp_tables.c b/net/ipv4/netfilter/arp_tables.c
index 1cdd9c28ab2da..0628e68910f7f 100644
--- a/net/ipv4/netfilter/arp_tables.c
+++ b/net/ipv4/netfilter/arp_tables.c
@@ -203,8 +203,9 @@ unsigned int arpt_do_table(void *priv,
outdev = state->out ? state->out->name : nulldevname;
local_bh_disable();
+ rcu_read_lock();
addend = xt_write_recseq_begin();
- private = READ_ONCE(table->private); /* Address dependency. */
+ private = rcu_dereference(table->priv_info);
cpu = smp_processor_id();
table_base = private->entries;
jumpstack = (struct arpt_entry **)private->jumpstack[cpu];
@@ -279,6 +280,7 @@ unsigned int arpt_do_table(void *priv,
}
} while (!acpar.hotdrop);
xt_write_recseq_end(addend);
+ rcu_read_unlock();
local_bh_enable();
if (acpar.hotdrop)
@@ -648,9 +650,9 @@ static void get_old_counters(const struct xt_table_info *t,
static struct xt_counters *alloc_counters(const struct xt_table *table)
{
+ const struct xt_table_info *private = nf_table_private(table);
unsigned int countersize;
struct xt_counters *counters;
- const struct xt_table_info *private = table->private;
/* We need atomic snapshot of counters: rest doesn't change
* (other than comefrom, which userspace doesn't care
@@ -671,10 +673,10 @@ static int copy_entries_to_user(unsigned int total_size,
const struct xt_table *table,
void __user *userptr)
{
+ struct xt_table_info *private = nf_table_private(table);
unsigned int off, num;
const struct arpt_entry *e;
struct xt_counters *counters;
- struct xt_table_info *private = table->private;
int ret = 0;
void *loc_cpu_entry;
@@ -808,7 +810,7 @@ static int get_info(struct net *net, void __user *user, const int *len)
t = xt_request_find_table_lock(net, NFPROTO_ARP, name);
if (!IS_ERR(t)) {
struct arpt_getinfo info;
- const struct xt_table_info *private = t->private;
+ const struct xt_table_info *private = nf_table_private(t);
#ifdef CONFIG_NETFILTER_XTABLES_COMPAT
struct xt_table_info tmp;
@@ -861,7 +863,7 @@ static int get_entries(struct net *net, struct arpt_get_entries __user *uptr,
t = xt_find_table_lock(net, NFPROTO_ARP, get.name);
if (!IS_ERR(t)) {
- const struct xt_table_info *private = t->private;
+ const struct xt_table_info *private = nf_table_private(t);
if (get.size == private->size)
ret = copy_entries_to_user(private->size,
@@ -1022,7 +1024,8 @@ static int do_add_counters(struct net *net, sockptr_t arg, unsigned int len)
}
local_bh_disable();
- private = t->private;
+ rcu_read_lock();
+ private = rcu_dereference(t->priv_info);
if (private->number != tmp.num_counters) {
ret = -EINVAL;
goto unlock_up_free;
@@ -1040,6 +1043,7 @@ static int do_add_counters(struct net *net, sockptr_t arg, unsigned int len)
}
xt_write_recseq_end(addend);
unlock_up_free:
+ rcu_read_unlock();
local_bh_enable();
xt_table_unlock(t);
module_put(t->me);
@@ -1340,8 +1344,8 @@ static int compat_copy_entries_to_user(unsigned int total_size,
struct xt_table *table,
void __user *userptr)
{
+ const struct xt_table_info *private = nf_table_private(table);
struct xt_counters *counters;
- const struct xt_table_info *private = table->private;
void __user *pos;
unsigned int size;
int ret = 0;
@@ -1390,7 +1394,7 @@ static int compat_get_entries(struct net *net,
xt_compat_lock(NFPROTO_ARP);
t = xt_find_table_lock(net, NFPROTO_ARP, get.name);
if (!IS_ERR(t)) {
- const struct xt_table_info *private = t->private;
+ const struct xt_table_info *private = nf_table_private(t);
struct xt_table_info info;
ret = compat_table_info(private, &info);
diff --git a/net/ipv4/netfilter/ip_tables.c b/net/ipv4/netfilter/ip_tables.c
index 3d101613f27fa..20e8b46af8876 100644
--- a/net/ipv4/netfilter/ip_tables.c
+++ b/net/ipv4/netfilter/ip_tables.c
@@ -256,8 +256,9 @@ ipt_do_table(void *priv,
WARN_ON(!(table->valid_hooks & (1 << hook)));
local_bh_disable();
+ rcu_read_lock();
+ private = rcu_dereference(table->priv_info);
addend = xt_write_recseq_begin();
- private = READ_ONCE(table->private); /* Address dependency. */
cpu = smp_processor_id();
table_base = private->entries;
jumpstack = (struct ipt_entry **)private->jumpstack[cpu];
@@ -354,6 +355,7 @@ ipt_do_table(void *priv,
} while (!acpar.hotdrop);
xt_write_recseq_end(addend);
+ rcu_read_unlock();
local_bh_enable();
if (acpar.hotdrop)
@@ -788,9 +790,9 @@ static void get_old_counters(const struct xt_table_info *t,
static struct xt_counters *alloc_counters(const struct xt_table *table)
{
+ const struct xt_table_info *private = nf_table_private(table);
unsigned int countersize;
struct xt_counters *counters;
- const struct xt_table_info *private = table->private;
/* We need atomic snapshot of counters: rest doesn't change
(other than comefrom, which userspace doesn't care
@@ -811,10 +813,10 @@ copy_entries_to_user(unsigned int total_size,
const struct xt_table *table,
void __user *userptr)
{
+ const struct xt_table_info *private = nf_table_private(table);
unsigned int off, num;
const struct ipt_entry *e;
struct xt_counters *counters;
- const struct xt_table_info *private = table->private;
int ret = 0;
const void *loc_cpu_entry;
@@ -963,7 +965,7 @@ static int get_info(struct net *net, void __user *user, const int *len)
t = xt_request_find_table_lock(net, AF_INET, name);
if (!IS_ERR(t)) {
struct ipt_getinfo info;
- const struct xt_table_info *private = t->private;
+ const struct xt_table_info *private = nf_table_private(t);
#ifdef CONFIG_NETFILTER_XTABLES_COMPAT
struct xt_table_info tmp;
@@ -1017,7 +1019,7 @@ get_entries(struct net *net, struct ipt_get_entries __user *uptr,
t = xt_find_table_lock(net, AF_INET, get.name);
if (!IS_ERR(t)) {
- const struct xt_table_info *private = t->private;
+ const struct xt_table_info *private = nf_table_private(t);
if (get.size == private->size)
ret = copy_entries_to_user(private->size,
t, uptr->entrytable);
@@ -1175,7 +1177,8 @@ do_add_counters(struct net *net, sockptr_t arg, unsigned int len)
}
local_bh_disable();
- private = t->private;
+ rcu_read_lock();
+ private = rcu_dereference(t->priv_info);
if (private->number != tmp.num_counters) {
ret = -EINVAL;
goto unlock_up_free;
@@ -1192,6 +1195,7 @@ do_add_counters(struct net *net, sockptr_t arg, unsigned int len)
}
xt_write_recseq_end(addend);
unlock_up_free:
+ rcu_read_unlock();
local_bh_enable();
xt_table_unlock(t);
module_put(t->me);
@@ -1550,8 +1554,8 @@ static int
compat_copy_entries_to_user(unsigned int total_size, struct xt_table *table,
void __user *userptr)
{
+ const struct xt_table_info *private = nf_table_private(table);
struct xt_counters *counters;
- const struct xt_table_info *private = table->private;
void __user *pos;
unsigned int size;
int ret = 0;
@@ -1597,7 +1601,7 @@ compat_get_entries(struct net *net, struct compat_ipt_get_entries __user *uptr,
xt_compat_lock(AF_INET);
t = xt_find_table_lock(net, AF_INET, get.name);
if (!IS_ERR(t)) {
- const struct xt_table_info *private = t->private;
+ const struct xt_table_info *private = nf_table_private(t);
struct xt_table_info info;
ret = compat_table_info(private, &info);
if (!ret && get.size == info.size)
diff --git a/net/ipv6/netfilter/ip6_tables.c b/net/ipv6/netfilter/ip6_tables.c
index 7d5602950ae72..c12d489a09840 100644
--- a/net/ipv6/netfilter/ip6_tables.c
+++ b/net/ipv6/netfilter/ip6_tables.c
@@ -278,8 +278,9 @@ ip6t_do_table(void *priv, struct sk_buff *skb,
WARN_ON(!(table->valid_hooks & (1 << hook)));
local_bh_disable();
+ rcu_read_lock();
+ private = rcu_dereference(table->priv_info);
addend = xt_write_recseq_begin();
- private = READ_ONCE(table->private); /* Address dependency. */
cpu = smp_processor_id();
table_base = private->entries;
jumpstack = (struct ip6t_entry **)private->jumpstack[cpu];
@@ -372,6 +373,7 @@ ip6t_do_table(void *priv, struct sk_buff *skb,
} while (!acpar.hotdrop);
xt_write_recseq_end(addend);
+ rcu_read_unlock();
local_bh_enable();
if (acpar.hotdrop)
@@ -804,9 +806,9 @@ static void get_old_counters(const struct xt_table_info *t,
static struct xt_counters *alloc_counters(const struct xt_table *table)
{
+ const struct xt_table_info *private = nf_table_private(table);
unsigned int countersize;
struct xt_counters *counters;
- const struct xt_table_info *private = table->private;
/* We need atomic snapshot of counters: rest doesn't change
(other than comefrom, which userspace doesn't care
@@ -827,10 +829,10 @@ copy_entries_to_user(unsigned int total_size,
const struct xt_table *table,
void __user *userptr)
{
+ const struct xt_table_info *private = nf_table_private(table);
unsigned int off, num;
const struct ip6t_entry *e;
struct xt_counters *counters;
- const struct xt_table_info *private = table->private;
int ret = 0;
const void *loc_cpu_entry;
@@ -979,7 +981,7 @@ static int get_info(struct net *net, void __user *user, const int *len)
t = xt_request_find_table_lock(net, AF_INET6, name);
if (!IS_ERR(t)) {
struct ip6t_getinfo info;
- const struct xt_table_info *private = t->private;
+ const struct xt_table_info *private = nf_table_private(t);
#ifdef CONFIG_NETFILTER_XTABLES_COMPAT
struct xt_table_info tmp;
@@ -1034,7 +1036,7 @@ get_entries(struct net *net, struct ip6t_get_entries __user *uptr,
t = xt_find_table_lock(net, AF_INET6, get.name);
if (!IS_ERR(t)) {
- struct xt_table_info *private = t->private;
+ struct xt_table_info *private = nf_table_private(t);
if (get.size == private->size)
ret = copy_entries_to_user(private->size,
t, uptr->entrytable);
@@ -1191,7 +1193,8 @@ do_add_counters(struct net *net, sockptr_t arg, unsigned int len)
}
local_bh_disable();
- private = t->private;
+ rcu_read_lock();
+ private = rcu_dereference(t->priv_info);
if (private->number != tmp.num_counters) {
ret = -EINVAL;
goto unlock_up_free;
@@ -1208,6 +1211,7 @@ do_add_counters(struct net *net, sockptr_t arg, unsigned int len)
}
xt_write_recseq_end(addend);
unlock_up_free:
+ rcu_read_unlock();
local_bh_enable();
xt_table_unlock(t);
module_put(t->me);
@@ -1559,8 +1563,8 @@ static int
compat_copy_entries_to_user(unsigned int total_size, struct xt_table *table,
void __user *userptr)
{
+ const struct xt_table_info *private = nf_table_private(table);
struct xt_counters *counters;
- const struct xt_table_info *private = table->private;
void __user *pos;
unsigned int size;
int ret = 0;
@@ -1606,7 +1610,7 @@ compat_get_entries(struct net *net, struct compat_ip6t_get_entries __user *uptr,
xt_compat_lock(AF_INET6);
t = xt_find_table_lock(net, AF_INET6, get.name);
if (!IS_ERR(t)) {
- const struct xt_table_info *private = t->private;
+ const struct xt_table_info *private = nf_table_private(t);
struct xt_table_info info;
ret = compat_table_info(private, &info);
if (!ret && get.size == info.size)
diff --git a/net/netfilter/x_tables.c b/net/netfilter/x_tables.c
index 709840612f0df..ee38272cca562 100644
--- a/net/netfilter/x_tables.c
+++ b/net/netfilter/x_tables.c
@@ -85,6 +85,19 @@ static const char *const xt_prefix[NFPROTO_NUMPROTO] = {
[NFPROTO_IPV6] = "ip6",
};
+#ifdef CONFIG_LOCKDEP
+bool xt_af_lock_held(u_int8_t af)
+{
+ return lockdep_is_held(&xt[af].mutex) ||
+#ifdef CONFIG_NETFILTER_XTABLES_COMPAT
+ lockdep_is_held(&xt[af].compat_mutex);
+#else
+ false;
+#endif
+}
+EXPORT_SYMBOL_GPL(xt_af_lock_held);
+#endif
+
/* Registration hooks for targets. */
int xt_register_target(struct xt_target *target)
{
@@ -1388,7 +1401,6 @@ xt_replace_table(struct xt_table *table,
int *error)
{
struct xt_table_info *private;
- unsigned int cpu;
int ret;
ret = xt_jumpstack_alloc(newinfo);
@@ -1397,48 +1409,24 @@ xt_replace_table(struct xt_table *table,
return NULL;
}
- /* Do the substitution. */
- local_bh_disable();
- private = table->private;
+ private = nf_table_private(table);
/* Check inside lock: is the old number correct? */
if (num_counters != private->number) {
pr_debug("num_counters != table->private->number (%u/%u)\n",
num_counters, private->number);
- local_bh_enable();
*error = -EAGAIN;
return NULL;
}
newinfo->initial_entries = private->initial_entries;
- /*
- * Ensure contents of newinfo are visible before assigning to
- * private.
- */
- smp_wmb();
- table->private = newinfo;
-
- /* make sure all cpus see new ->private value */
- smp_mb();
+ rcu_assign_pointer(table->priv_info, newinfo);
/*
* Even though table entries have now been swapped, other CPU's
* may still be using the old entries...
*/
- local_bh_enable();
-
- /* ... so wait for even xt_recseq on all cpus */
- for_each_possible_cpu(cpu) {
- seqcount_t *s = &per_cpu(xt_recseq, cpu);
- u32 seq = raw_read_seqcount(s);
-
- if (seq & 1) {
- do {
- cond_resched();
- cpu_relax();
- } while (seq == raw_read_seqcount(s));
- }
- }
+ synchronize_rcu();
audit_log_nfcfg(table->name, table->af, private->number,
!private->number ? AUDIT_XT_OP_REGISTER :
@@ -1475,12 +1463,12 @@ struct xt_table *xt_register_table(struct net *net,
}
/* Simplifies replace_table code. */
- table->private = bootstrap;
+ rcu_assign_pointer(table->priv_info, bootstrap);
if (!xt_replace_table(table, 0, newinfo, &ret))
goto unlock;
- private = table->private;
+ private = nf_table_private(table);
pr_debug("table->private->number = %u\n", private->number);
/* save number of initial entries */
@@ -1503,7 +1491,7 @@ void *xt_unregister_table(struct xt_table *table)
struct xt_table_info *private;
mutex_lock(&xt[table->af].mutex);
- private = table->private;
+ private = nf_table_private(table);
list_del(&table->list);
mutex_unlock(&xt[table->af].mutex);
audit_log_nfcfg(table->name, table->af, private->number,
--
2.47.2
^ permalink raw reply related [flat|nested] 11+ messages in thread* [PATCH net-next v2 2/3] netfilter: Split the xt_counters type between kernel and user.
2025-02-21 13:31 [PATCH net-next v2 0/3] Replace xt_recseq with u64_stats Sebastian Andrzej Siewior
2025-02-21 13:31 ` [PATCH net-next v2 1/3] netfilter: Make xt_table::private RCU protected Sebastian Andrzej Siewior
@ 2025-02-21 13:31 ` Sebastian Andrzej Siewior
2025-02-21 13:31 ` [PATCH net-next v2 3/3] netfilter: Use u64_stats for counters in xt_counters_k Sebastian Andrzej Siewior
` (2 subsequent siblings)
4 siblings, 0 replies; 11+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-02-21 13:31 UTC (permalink / raw)
To: netfilter-devel, coreteam, linux-rt-devel
Cc: Pablo Neira Ayuso, Jozsef Kadlecsik, Thomas Gleixner,
Sebastian Andrzej Siewior
The struct xt_counter contains two u64 values as counters for bytes and
packets. This type is exposed to userland via uapi. The kernel uses the
type as such when it communicates with userland.
However the type within an entry (such as ipt_entry) is treated
differently: Within the kernel it is a two value struct if the system
has only one CPU.
With more CPUs, the first value is per-CPU pointer which points to
per-CPU memory which holds the two u64 counter. How the struct
intepreted depends on the user.
Introduce a struct xt_counter_pad which is simply used as a place
holder, ensuring it is the same size as struct xt_counters. The kernel
function will use this type if the type might be a per-CPU pointer.
Add this padding struct to arpt_entry, ipt_entry and ip6t_entry.
Pass this type to xt_get_this_cpu_counter(), xt_percpu_counter_free()
and xt_percpu_counter_alloc(). These functions will cast it to union
xt_counter_k() and return the proper pointer to struct xt_counters_k.
This is mostly the same as previously but a bit more obvious and
introducs the struct xt_counters_k for in-kernel usage. This can be
replaces later without breaking userland.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
include/linux/netfilter/x_tables.h | 40 +++++++++++++------
include/uapi/linux/netfilter/x_tables.h | 4 ++
include/uapi/linux/netfilter_arp/arp_tables.h | 5 ++-
include/uapi/linux/netfilter_ipv4/ip_tables.h | 5 ++-
.../uapi/linux/netfilter_ipv6/ip6_tables.h | 5 ++-
net/ipv4/netfilter/arp_tables.c | 22 +++++-----
net/ipv4/netfilter/ip_tables.c | 22 +++++-----
net/ipv6/netfilter/ip6_tables.c | 22 +++++-----
net/netfilter/x_tables.c | 23 ++++++-----
9 files changed, 90 insertions(+), 58 deletions(-)
diff --git a/include/linux/netfilter/x_tables.h b/include/linux/netfilter/x_tables.h
index b9cd82e845d08..fc52a2ba90f6b 100644
--- a/include/linux/netfilter/x_tables.h
+++ b/include/linux/netfilter/x_tables.h
@@ -267,6 +267,17 @@ struct xt_table_info {
unsigned char entries[] __aligned(8);
};
+struct xt_counters_k {
+ /* Packet and byte counter */
+ __u64 pcnt;
+ __u64 bcnt;
+};
+
+union xt_counter_k {
+ struct xt_counters_k __percpu *pcpu;
+ struct xt_counters_k local;
+};
+
int xt_register_target(struct xt_target *target);
void xt_unregister_target(struct xt_target *target);
int xt_register_targets(struct xt_target *target, unsigned int n);
@@ -428,29 +439,32 @@ static inline unsigned long ifname_compare_aligned(const char *_a,
struct xt_percpu_counter_alloc_state {
unsigned int off;
- const char __percpu *mem;
+ void __percpu *mem;
};
bool xt_percpu_counter_alloc(struct xt_percpu_counter_alloc_state *state,
- struct xt_counters *counter);
-void xt_percpu_counter_free(struct xt_counters *cnt);
+ struct xt_counter_pad *xt_pad);
+void xt_percpu_counter_free(struct xt_counter_pad *xt_pad);
-static inline struct xt_counters *
-xt_get_this_cpu_counter(struct xt_counters *cnt)
+static inline struct xt_counters_k *xt_get_this_cpu_counter(struct xt_counter_pad *xt_pad)
{
- if (nr_cpu_ids > 1)
- return this_cpu_ptr((void __percpu *) (unsigned long) cnt->pcnt);
+ union xt_counter_k *xt_cnt = (union xt_counter_k *)xt_pad;
- return cnt;
+ if (nr_cpu_ids > 1)
+ return this_cpu_ptr(xt_cnt->pcpu);
+
+ return &xt_cnt->local;
}
-static inline struct xt_counters *
-xt_get_per_cpu_counter(struct xt_counters *cnt, unsigned int cpu)
+static inline struct xt_counters_k *xt_get_per_cpu_counter(struct xt_counter_pad *xt_pad,
+ unsigned int cpu)
{
- if (nr_cpu_ids > 1)
- return per_cpu_ptr((void __percpu *) (unsigned long) cnt->pcnt, cpu);
+ union xt_counter_k *xt_cnt = (union xt_counter_k *)xt_pad;
- return cnt;
+ if (nr_cpu_ids > 1)
+ return per_cpu_ptr(xt_cnt->pcpu, cpu);
+
+ return &xt_cnt->local;
}
struct nf_hook_ops *xt_hook_ops_alloc(const struct xt_table *, nf_hookfn *);
diff --git a/include/uapi/linux/netfilter/x_tables.h b/include/uapi/linux/netfilter/x_tables.h
index 796af83a963aa..70e19a140ab1e 100644
--- a/include/uapi/linux/netfilter/x_tables.h
+++ b/include/uapi/linux/netfilter/x_tables.h
@@ -111,6 +111,10 @@ struct xt_counters {
__u64 pcnt, bcnt; /* Packet and byte counters */
};
+struct xt_counter_pad {
+ __u8 pad[16];
+};
+
/* The argument to IPT_SO_ADD_COUNTERS. */
struct xt_counters_info {
/* Which table. */
diff --git a/include/uapi/linux/netfilter_arp/arp_tables.h b/include/uapi/linux/netfilter_arp/arp_tables.h
index a6ac2463f787a..4ca949a955412 100644
--- a/include/uapi/linux/netfilter_arp/arp_tables.h
+++ b/include/uapi/linux/netfilter_arp/arp_tables.h
@@ -106,7 +106,10 @@ struct arpt_entry
unsigned int comefrom;
/* Packet and byte counters. */
- struct xt_counters counters;
+ union {
+ struct xt_counters counters;
+ struct xt_counter_pad counter_pad;
+ };
/* The matches (if any), then the target. */
unsigned char elems[];
diff --git a/include/uapi/linux/netfilter_ipv4/ip_tables.h b/include/uapi/linux/netfilter_ipv4/ip_tables.h
index 1485df28b2391..a4874078ec058 100644
--- a/include/uapi/linux/netfilter_ipv4/ip_tables.h
+++ b/include/uapi/linux/netfilter_ipv4/ip_tables.h
@@ -118,7 +118,10 @@ struct ipt_entry {
unsigned int comefrom;
/* Packet and byte counters. */
- struct xt_counters counters;
+ union {
+ struct xt_counters counters;
+ struct xt_counter_pad counter_pad;
+ };
/* The matches (if any), then the target. */
unsigned char elems[];
diff --git a/include/uapi/linux/netfilter_ipv6/ip6_tables.h b/include/uapi/linux/netfilter_ipv6/ip6_tables.h
index 766e8e0bcc683..8634257e1cd59 100644
--- a/include/uapi/linux/netfilter_ipv6/ip6_tables.h
+++ b/include/uapi/linux/netfilter_ipv6/ip6_tables.h
@@ -122,7 +122,10 @@ struct ip6t_entry {
unsigned int comefrom;
/* Packet and byte counters. */
- struct xt_counters counters;
+ union {
+ struct xt_counters counters;
+ struct xt_counter_pad counter_pad;
+ };
/* The matches (if any), then the target. */
unsigned char elems[0];
diff --git a/net/ipv4/netfilter/arp_tables.c b/net/ipv4/netfilter/arp_tables.c
index 0628e68910f7f..ce3d73155ca9b 100644
--- a/net/ipv4/netfilter/arp_tables.c
+++ b/net/ipv4/netfilter/arp_tables.c
@@ -221,14 +221,14 @@ unsigned int arpt_do_table(void *priv,
arp = arp_hdr(skb);
do {
const struct xt_entry_target *t;
- struct xt_counters *counter;
+ struct xt_counters_k *counter;
if (!arp_packet_match(arp, skb->dev, indev, outdev, &e->arp)) {
e = arpt_next_entry(e);
continue;
}
- counter = xt_get_this_cpu_counter(&e->counters);
+ counter = xt_get_this_cpu_counter(&e->counter_pad);
ADD_COUNTER(*counter, arp_hdr_len(skb->dev), 1);
t = arpt_get_target_c(e);
@@ -412,7 +412,7 @@ find_check_entry(struct arpt_entry *e, struct net *net, const char *name,
struct xt_target *target;
int ret;
- if (!xt_percpu_counter_alloc(alloc_state, &e->counters))
+ if (!xt_percpu_counter_alloc(alloc_state, &e->counter_pad))
return -ENOMEM;
t = arpt_get_target(e);
@@ -431,7 +431,7 @@ find_check_entry(struct arpt_entry *e, struct net *net, const char *name,
err:
module_put(t->u.kernel.target->me);
out:
- xt_percpu_counter_free(&e->counters);
+ xt_percpu_counter_free(&e->counter_pad);
return ret;
}
@@ -512,7 +512,7 @@ static void cleanup_entry(struct arpt_entry *e, struct net *net)
if (par.target->destroy != NULL)
par.target->destroy(&par);
module_put(par.target->me);
- xt_percpu_counter_free(&e->counters);
+ xt_percpu_counter_free(&e->counter_pad);
}
/* Checks and translates the user-supplied table segment (held in
@@ -611,11 +611,11 @@ static void get_counters(const struct xt_table_info *t,
i = 0;
xt_entry_foreach(iter, t->entries, t->size) {
- struct xt_counters *tmp;
+ struct xt_counters_k *tmp;
u64 bcnt, pcnt;
unsigned int start;
- tmp = xt_get_per_cpu_counter(&iter->counters, cpu);
+ tmp = xt_get_per_cpu_counter(&iter->counter_pad, cpu);
do {
start = read_seqcount_begin(s);
bcnt = tmp->bcnt;
@@ -638,9 +638,9 @@ static void get_old_counters(const struct xt_table_info *t,
for_each_possible_cpu(cpu) {
i = 0;
xt_entry_foreach(iter, t->entries, t->size) {
- struct xt_counters *tmp;
+ struct xt_counters_k *tmp;
- tmp = xt_get_per_cpu_counter(&iter->counters, cpu);
+ tmp = xt_get_per_cpu_counter(&iter->counter_pad, cpu);
ADD_COUNTER(counters[i], tmp->bcnt, tmp->pcnt);
++i;
}
@@ -1035,9 +1035,9 @@ static int do_add_counters(struct net *net, sockptr_t arg, unsigned int len)
addend = xt_write_recseq_begin();
xt_entry_foreach(iter, private->entries, private->size) {
- struct xt_counters *tmp;
+ struct xt_counters_k *tmp;
- tmp = xt_get_this_cpu_counter(&iter->counters);
+ tmp = xt_get_this_cpu_counter(&iter->counter_pad);
ADD_COUNTER(*tmp, paddc[i].bcnt, paddc[i].pcnt);
++i;
}
diff --git a/net/ipv4/netfilter/ip_tables.c b/net/ipv4/netfilter/ip_tables.c
index 20e8b46af8876..95f917f5bceef 100644
--- a/net/ipv4/netfilter/ip_tables.c
+++ b/net/ipv4/netfilter/ip_tables.c
@@ -278,7 +278,7 @@ ipt_do_table(void *priv,
do {
const struct xt_entry_target *t;
const struct xt_entry_match *ematch;
- struct xt_counters *counter;
+ struct xt_counters_k *counter;
WARN_ON(!e);
if (!ip_packet_match(ip, indev, outdev,
@@ -295,7 +295,7 @@ ipt_do_table(void *priv,
goto no_match;
}
- counter = xt_get_this_cpu_counter(&e->counters);
+ counter = xt_get_this_cpu_counter(&e->counter_pad);
ADD_COUNTER(*counter, skb->len, 1);
t = ipt_get_target_c(e);
@@ -525,7 +525,7 @@ find_check_entry(struct ipt_entry *e, struct net *net, const char *name,
struct xt_mtchk_param mtpar;
struct xt_entry_match *ematch;
- if (!xt_percpu_counter_alloc(alloc_state, &e->counters))
+ if (!xt_percpu_counter_alloc(alloc_state, &e->counter_pad))
return -ENOMEM;
j = 0;
@@ -565,7 +565,7 @@ find_check_entry(struct ipt_entry *e, struct net *net, const char *name,
cleanup_match(ematch, net);
}
- xt_percpu_counter_free(&e->counters);
+ xt_percpu_counter_free(&e->counter_pad);
return ret;
}
@@ -653,7 +653,7 @@ cleanup_entry(struct ipt_entry *e, struct net *net)
if (par.target->destroy != NULL)
par.target->destroy(&par);
module_put(par.target->me);
- xt_percpu_counter_free(&e->counters);
+ xt_percpu_counter_free(&e->counter_pad);
}
/* Checks and translates the user-supplied table segment (held in
@@ -750,11 +750,11 @@ get_counters(const struct xt_table_info *t,
i = 0;
xt_entry_foreach(iter, t->entries, t->size) {
- struct xt_counters *tmp;
+ struct xt_counters_k *tmp;
u64 bcnt, pcnt;
unsigned int start;
- tmp = xt_get_per_cpu_counter(&iter->counters, cpu);
+ tmp = xt_get_per_cpu_counter(&iter->counter_pad, cpu);
do {
start = read_seqcount_begin(s);
bcnt = tmp->bcnt;
@@ -777,9 +777,9 @@ static void get_old_counters(const struct xt_table_info *t,
for_each_possible_cpu(cpu) {
i = 0;
xt_entry_foreach(iter, t->entries, t->size) {
- const struct xt_counters *tmp;
+ const struct xt_counters_k *tmp;
- tmp = xt_get_per_cpu_counter(&iter->counters, cpu);
+ tmp = xt_get_per_cpu_counter(&iter->counter_pad, cpu);
ADD_COUNTER(counters[i], tmp->bcnt, tmp->pcnt);
++i; /* macro does multi eval of i */
}
@@ -1187,9 +1187,9 @@ do_add_counters(struct net *net, sockptr_t arg, unsigned int len)
i = 0;
addend = xt_write_recseq_begin();
xt_entry_foreach(iter, private->entries, private->size) {
- struct xt_counters *tmp;
+ struct xt_counters_k *tmp;
- tmp = xt_get_this_cpu_counter(&iter->counters);
+ tmp = xt_get_this_cpu_counter(&iter->counter_pad);
ADD_COUNTER(*tmp, paddc[i].bcnt, paddc[i].pcnt);
++i;
}
diff --git a/net/ipv6/netfilter/ip6_tables.c b/net/ipv6/netfilter/ip6_tables.c
index c12d489a09840..f4877b1b2463e 100644
--- a/net/ipv6/netfilter/ip6_tables.c
+++ b/net/ipv6/netfilter/ip6_tables.c
@@ -300,7 +300,7 @@ ip6t_do_table(void *priv, struct sk_buff *skb,
do {
const struct xt_entry_target *t;
const struct xt_entry_match *ematch;
- struct xt_counters *counter;
+ struct xt_counters_k *counter;
WARN_ON(!e);
acpar.thoff = 0;
@@ -318,7 +318,7 @@ ip6t_do_table(void *priv, struct sk_buff *skb,
goto no_match;
}
- counter = xt_get_this_cpu_counter(&e->counters);
+ counter = xt_get_this_cpu_counter(&e->counter_pad);
ADD_COUNTER(*counter, skb->len, 1);
t = ip6t_get_target_c(e);
@@ -544,7 +544,7 @@ find_check_entry(struct ip6t_entry *e, struct net *net, const char *name,
struct xt_mtchk_param mtpar;
struct xt_entry_match *ematch;
- if (!xt_percpu_counter_alloc(alloc_state, &e->counters))
+ if (!xt_percpu_counter_alloc(alloc_state, &e->counter_pad))
return -ENOMEM;
j = 0;
@@ -583,7 +583,7 @@ find_check_entry(struct ip6t_entry *e, struct net *net, const char *name,
cleanup_match(ematch, net);
}
- xt_percpu_counter_free(&e->counters);
+ xt_percpu_counter_free(&e->counter_pad);
return ret;
}
@@ -670,7 +670,7 @@ static void cleanup_entry(struct ip6t_entry *e, struct net *net)
if (par.target->destroy != NULL)
par.target->destroy(&par);
module_put(par.target->me);
- xt_percpu_counter_free(&e->counters);
+ xt_percpu_counter_free(&e->counter_pad);
}
/* Checks and translates the user-supplied table segment (held in
@@ -767,11 +767,11 @@ get_counters(const struct xt_table_info *t,
i = 0;
xt_entry_foreach(iter, t->entries, t->size) {
- struct xt_counters *tmp;
+ struct xt_counters_k *tmp;
u64 bcnt, pcnt;
unsigned int start;
- tmp = xt_get_per_cpu_counter(&iter->counters, cpu);
+ tmp = xt_get_per_cpu_counter(&iter->counter_pad, cpu);
do {
start = read_seqcount_begin(s);
bcnt = tmp->bcnt;
@@ -794,9 +794,9 @@ static void get_old_counters(const struct xt_table_info *t,
for_each_possible_cpu(cpu) {
i = 0;
xt_entry_foreach(iter, t->entries, t->size) {
- const struct xt_counters *tmp;
+ const struct xt_counters_k *tmp;
- tmp = xt_get_per_cpu_counter(&iter->counters, cpu);
+ tmp = xt_get_per_cpu_counter(&iter->counter_pad, cpu);
ADD_COUNTER(counters[i], tmp->bcnt, tmp->pcnt);
++i;
}
@@ -1203,9 +1203,9 @@ do_add_counters(struct net *net, sockptr_t arg, unsigned int len)
i = 0;
addend = xt_write_recseq_begin();
xt_entry_foreach(iter, private->entries, private->size) {
- struct xt_counters *tmp;
+ struct xt_counters_k *tmp;
- tmp = xt_get_this_cpu_counter(&iter->counters);
+ tmp = xt_get_this_cpu_counter(&iter->counter_pad);
ADD_COUNTER(*tmp, paddc[i].bcnt, paddc[i].pcnt);
++i;
}
diff --git a/net/netfilter/x_tables.c b/net/netfilter/x_tables.c
index ee38272cca562..5379ed82abd59 100644
--- a/net/netfilter/x_tables.c
+++ b/net/netfilter/x_tables.c
@@ -1889,7 +1889,7 @@ EXPORT_SYMBOL_GPL(xt_proto_fini);
* xt_percpu_counter_alloc - allocate x_tables rule counter
*
* @state: pointer to xt_percpu allocation state
- * @counter: pointer to counter struct inside the ip(6)/arpt_entry struct
+ * @xt_pad: pointer to the counter padding inside the ip(6)/arpt_entry struct
*
* On SMP, the packet counter [ ip(6)t_entry->counters.pcnt ] will then
* contain the address of the real (percpu) counter.
@@ -1908,9 +1908,13 @@ EXPORT_SYMBOL_GPL(xt_proto_fini);
* returns false on error.
*/
bool xt_percpu_counter_alloc(struct xt_percpu_counter_alloc_state *state,
- struct xt_counters *counter)
+ struct xt_counter_pad *xt_pad)
{
- BUILD_BUG_ON(XT_PCPU_BLOCK_SIZE < (sizeof(*counter) * 2));
+ union xt_counter_k *xt_cnt = (union xt_counter_k *)xt_pad;
+
+ BUILD_BUG_ON(XT_PCPU_BLOCK_SIZE < (sizeof(struct xt_counters_k) * 2));
+ BUILD_BUG_ON(sizeof(struct xt_counters_k) != sizeof(struct xt_counters));
+ BUILD_BUG_ON(sizeof(struct xt_counters_k) != sizeof(struct xt_counter_pad));
if (nr_cpu_ids <= 1)
return true;
@@ -1921,9 +1925,9 @@ bool xt_percpu_counter_alloc(struct xt_percpu_counter_alloc_state *state,
if (!state->mem)
return false;
}
- counter->pcnt = (__force unsigned long)(state->mem + state->off);
- state->off += sizeof(*counter);
- if (state->off > (XT_PCPU_BLOCK_SIZE - sizeof(*counter))) {
+ xt_cnt->pcpu = state->mem + state->off;
+ state->off += sizeof(struct xt_counters_k);
+ if (state->off > (XT_PCPU_BLOCK_SIZE - sizeof(struct xt_counters_k))) {
state->mem = NULL;
state->off = 0;
}
@@ -1931,12 +1935,13 @@ bool xt_percpu_counter_alloc(struct xt_percpu_counter_alloc_state *state,
}
EXPORT_SYMBOL_GPL(xt_percpu_counter_alloc);
-void xt_percpu_counter_free(struct xt_counters *counters)
+void xt_percpu_counter_free(struct xt_counter_pad *xt_pad)
{
- unsigned long pcnt = counters->pcnt;
+ union xt_counter_k *xt_cnt = (union xt_counter_k *)xt_pad;
+ unsigned long pcnt = (unsigned long)xt_cnt->pcpu;
if (nr_cpu_ids > 1 && (pcnt & (XT_PCPU_BLOCK_SIZE - 1)) == 0)
- free_percpu((void __percpu *)pcnt);
+ free_percpu(xt_cnt->pcpu);
}
EXPORT_SYMBOL_GPL(xt_percpu_counter_free);
--
2.47.2
^ permalink raw reply related [flat|nested] 11+ messages in thread* [PATCH net-next v2 3/3] netfilter: Use u64_stats for counters in xt_counters_k.
2025-02-21 13:31 [PATCH net-next v2 0/3] Replace xt_recseq with u64_stats Sebastian Andrzej Siewior
2025-02-21 13:31 ` [PATCH net-next v2 1/3] netfilter: Make xt_table::private RCU protected Sebastian Andrzej Siewior
2025-02-21 13:31 ` [PATCH net-next v2 2/3] netfilter: Split the xt_counters type between kernel and user Sebastian Andrzej Siewior
@ 2025-02-21 13:31 ` Sebastian Andrzej Siewior
2025-03-07 16:57 ` [PATCH net-next v2 0/3] Replace xt_recseq with u64_stats Sebastian Andrzej Siewior
2025-03-12 23:16 ` Pablo Neira Ayuso
4 siblings, 0 replies; 11+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-02-21 13:31 UTC (permalink / raw)
To: netfilter-devel, coreteam, linux-rt-devel
Cc: Pablo Neira Ayuso, Jozsef Kadlecsik, Thomas Gleixner,
Sebastian Andrzej Siewior
Using u64_stats for statistics has the advantage that the seqcount_t
synchronisation can be reduced to 32bit architectures only. On 64bit
architectures the update can happen lockless given there is only one
writter.
The critical section (xt_write_recseq_begin() - xt_write_recseq_end())
can be reduced to just the update of the value since the reader and its
xt_table_info::private access is RCU protected.
Use u64_stats_t in xt_counters_k for statistics. Add xt_counter_add() as
a helper to update counters within the critical section.
Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de>
---
include/linux/netfilter/x_tables.h | 71 ++++++------------------------
net/ipv4/netfilter/arp_tables.c | 23 ++++------
net/ipv4/netfilter/ip_tables.c | 23 ++++------
net/ipv6/netfilter/ip6_tables.c | 23 ++++------
net/netfilter/x_tables.c | 6 +--
5 files changed, 43 insertions(+), 103 deletions(-)
diff --git a/include/linux/netfilter/x_tables.h b/include/linux/netfilter/x_tables.h
index fc52a2ba90f6b..df429d0198c92 100644
--- a/include/linux/netfilter/x_tables.h
+++ b/include/linux/netfilter/x_tables.h
@@ -269,10 +269,21 @@ struct xt_table_info {
struct xt_counters_k {
/* Packet and byte counter */
- __u64 pcnt;
- __u64 bcnt;
+ u64_stats_t pcnt;
+ u64_stats_t bcnt;
};
+DECLARE_PER_CPU(struct u64_stats_sync, xt_syncp);
+
+static inline void xt_counter_add(struct xt_counters_k *xt_counter,
+ unsigned int bytes, unsigned int packets)
+{
+ u64_stats_update_begin(this_cpu_ptr(&xt_syncp));
+ u64_stats_add(&xt_counter->pcnt, packets);
+ u64_stats_add(&xt_counter->bcnt, bytes);
+ u64_stats_update_end(this_cpu_ptr(&xt_syncp));
+}
+
union xt_counter_k {
struct xt_counters_k __percpu *pcpu;
struct xt_counters_k local;
@@ -346,16 +357,6 @@ void xt_proto_fini(struct net *net, u_int8_t af);
struct xt_table_info *xt_alloc_table_info(unsigned int size);
void xt_free_table_info(struct xt_table_info *info);
-/**
- * xt_recseq - recursive seqcount for netfilter use
- *
- * Packet processing changes the seqcount only if no recursion happened
- * get_counters() can use read_seqcount_begin()/read_seqcount_retry(),
- * because we use the normal seqcount convention :
- * Low order bit set to 1 if a writer is active.
- */
-DECLARE_PER_CPU(seqcount_t, xt_recseq);
-
bool xt_af_lock_held(u_int8_t af);
static inline struct xt_table_info *nf_table_private(const struct xt_table *table)
{
@@ -368,52 +369,6 @@ static inline struct xt_table_info *nf_table_private(const struct xt_table *tabl
*/
extern struct static_key xt_tee_enabled;
-/**
- * xt_write_recseq_begin - start of a write section
- *
- * Begin packet processing : all readers must wait the end
- * 1) Must be called with preemption disabled
- * 2) softirqs must be disabled too (or we should use this_cpu_add())
- * Returns:
- * 1 if no recursion on this cpu
- * 0 if recursion detected
- */
-static inline unsigned int xt_write_recseq_begin(void)
-{
- unsigned int addend;
-
- /*
- * Low order bit of sequence is set if we already
- * called xt_write_recseq_begin().
- */
- addend = (__this_cpu_read(xt_recseq.sequence) + 1) & 1;
-
- /*
- * This is kind of a write_seqcount_begin(), but addend is 0 or 1
- * We dont check addend value to avoid a test and conditional jump,
- * since addend is most likely 1
- */
- __this_cpu_add(xt_recseq.sequence, addend);
- smp_mb();
-
- return addend;
-}
-
-/**
- * xt_write_recseq_end - end of a write section
- * @addend: return value from previous xt_write_recseq_begin()
- *
- * End packet processing : all readers can proceed
- * 1) Must be called with preemption disabled
- * 2) softirqs must be disabled too (or we should use this_cpu_add())
- */
-static inline void xt_write_recseq_end(unsigned int addend)
-{
- /* this is kind of a write_seqcount_end(), but addend is 0 or 1 */
- smp_wmb();
- __this_cpu_add(xt_recseq.sequence, addend);
-}
-
/*
* This helper is performance critical and must be inlined
*/
diff --git a/net/ipv4/netfilter/arp_tables.c b/net/ipv4/netfilter/arp_tables.c
index ce3d73155ca9b..6b957de58d2cf 100644
--- a/net/ipv4/netfilter/arp_tables.c
+++ b/net/ipv4/netfilter/arp_tables.c
@@ -194,7 +194,6 @@ unsigned int arpt_do_table(void *priv,
unsigned int cpu, stackidx = 0;
const struct xt_table_info *private;
struct xt_action_param acpar;
- unsigned int addend;
if (!pskb_may_pull(skb, arp_hdr_len(skb->dev)))
return NF_DROP;
@@ -204,7 +203,6 @@ unsigned int arpt_do_table(void *priv,
local_bh_disable();
rcu_read_lock();
- addend = xt_write_recseq_begin();
private = rcu_dereference(table->priv_info);
cpu = smp_processor_id();
table_base = private->entries;
@@ -229,7 +227,7 @@ unsigned int arpt_do_table(void *priv,
}
counter = xt_get_this_cpu_counter(&e->counter_pad);
- ADD_COUNTER(*counter, arp_hdr_len(skb->dev), 1);
+ xt_counter_add(counter, arp_hdr_len(skb->dev), 1);
t = arpt_get_target_c(e);
@@ -279,7 +277,6 @@ unsigned int arpt_do_table(void *priv,
break;
}
} while (!acpar.hotdrop);
- xt_write_recseq_end(addend);
rcu_read_unlock();
local_bh_enable();
@@ -607,7 +604,7 @@ static void get_counters(const struct xt_table_info *t,
unsigned int i;
for_each_possible_cpu(cpu) {
- seqcount_t *s = &per_cpu(xt_recseq, cpu);
+ struct u64_stats_sync *syncp = &per_cpu(xt_syncp, cpu);
i = 0;
xt_entry_foreach(iter, t->entries, t->size) {
@@ -617,10 +614,10 @@ static void get_counters(const struct xt_table_info *t,
tmp = xt_get_per_cpu_counter(&iter->counter_pad, cpu);
do {
- start = read_seqcount_begin(s);
- bcnt = tmp->bcnt;
- pcnt = tmp->pcnt;
- } while (read_seqcount_retry(s, start));
+ start = u64_stats_fetch_begin(syncp);
+ bcnt = u64_stats_read(&tmp->bcnt);
+ pcnt = u64_stats_read(&tmp->pcnt);
+ } while (u64_stats_fetch_retry(syncp, start));
ADD_COUNTER(counters[i], bcnt, pcnt);
++i;
@@ -641,7 +638,8 @@ static void get_old_counters(const struct xt_table_info *t,
struct xt_counters_k *tmp;
tmp = xt_get_per_cpu_counter(&iter->counter_pad, cpu);
- ADD_COUNTER(counters[i], tmp->bcnt, tmp->pcnt);
+ ADD_COUNTER(counters[i], u64_stats_read(&tmp->bcnt),
+ u64_stats_read(&tmp->pcnt));
++i;
}
cond_resched();
@@ -1011,7 +1009,6 @@ static int do_add_counters(struct net *net, sockptr_t arg, unsigned int len)
const struct xt_table_info *private;
int ret = 0;
struct arpt_entry *iter;
- unsigned int addend;
paddc = xt_copy_counters(arg, len, &tmp);
if (IS_ERR(paddc))
@@ -1033,15 +1030,13 @@ static int do_add_counters(struct net *net, sockptr_t arg, unsigned int len)
i = 0;
- addend = xt_write_recseq_begin();
xt_entry_foreach(iter, private->entries, private->size) {
struct xt_counters_k *tmp;
tmp = xt_get_this_cpu_counter(&iter->counter_pad);
- ADD_COUNTER(*tmp, paddc[i].bcnt, paddc[i].pcnt);
+ xt_counter_add(tmp, paddc[i].bcnt, paddc[i].pcnt);
++i;
}
- xt_write_recseq_end(addend);
unlock_up_free:
rcu_read_unlock();
local_bh_enable();
diff --git a/net/ipv4/netfilter/ip_tables.c b/net/ipv4/netfilter/ip_tables.c
index 95f917f5bceef..c5c90e2f724ba 100644
--- a/net/ipv4/netfilter/ip_tables.c
+++ b/net/ipv4/netfilter/ip_tables.c
@@ -236,7 +236,6 @@ ipt_do_table(void *priv,
unsigned int stackidx, cpu;
const struct xt_table_info *private;
struct xt_action_param acpar;
- unsigned int addend;
/* Initialization */
stackidx = 0;
@@ -258,7 +257,6 @@ ipt_do_table(void *priv,
local_bh_disable();
rcu_read_lock();
private = rcu_dereference(table->priv_info);
- addend = xt_write_recseq_begin();
cpu = smp_processor_id();
table_base = private->entries;
jumpstack = (struct ipt_entry **)private->jumpstack[cpu];
@@ -296,7 +294,7 @@ ipt_do_table(void *priv,
}
counter = xt_get_this_cpu_counter(&e->counter_pad);
- ADD_COUNTER(*counter, skb->len, 1);
+ xt_counter_add(counter, skb->len, 1);
t = ipt_get_target_c(e);
WARN_ON(!t->u.kernel.target);
@@ -354,7 +352,6 @@ ipt_do_table(void *priv,
}
} while (!acpar.hotdrop);
- xt_write_recseq_end(addend);
rcu_read_unlock();
local_bh_enable();
@@ -746,7 +743,7 @@ get_counters(const struct xt_table_info *t,
unsigned int i;
for_each_possible_cpu(cpu) {
- seqcount_t *s = &per_cpu(xt_recseq, cpu);
+ struct u64_stats_sync *syncp = &per_cpu(xt_syncp, cpu);
i = 0;
xt_entry_foreach(iter, t->entries, t->size) {
@@ -756,10 +753,10 @@ get_counters(const struct xt_table_info *t,
tmp = xt_get_per_cpu_counter(&iter->counter_pad, cpu);
do {
- start = read_seqcount_begin(s);
- bcnt = tmp->bcnt;
- pcnt = tmp->pcnt;
- } while (read_seqcount_retry(s, start));
+ start = u64_stats_fetch_begin(syncp);
+ bcnt = u64_stats_read(&tmp->bcnt);
+ pcnt = u64_stats_read(&tmp->pcnt);
+ } while (u64_stats_fetch_retry(syncp, start));
ADD_COUNTER(counters[i], bcnt, pcnt);
++i; /* macro does multi eval of i */
@@ -780,7 +777,8 @@ static void get_old_counters(const struct xt_table_info *t,
const struct xt_counters_k *tmp;
tmp = xt_get_per_cpu_counter(&iter->counter_pad, cpu);
- ADD_COUNTER(counters[i], tmp->bcnt, tmp->pcnt);
+ ADD_COUNTER(counters[i], u64_stats_read(&tmp->bcnt),
+ u64_stats_read(&tmp->pcnt));
++i; /* macro does multi eval of i */
}
@@ -1164,7 +1162,6 @@ do_add_counters(struct net *net, sockptr_t arg, unsigned int len)
const struct xt_table_info *private;
int ret = 0;
struct ipt_entry *iter;
- unsigned int addend;
paddc = xt_copy_counters(arg, len, &tmp);
if (IS_ERR(paddc))
@@ -1185,15 +1182,13 @@ do_add_counters(struct net *net, sockptr_t arg, unsigned int len)
}
i = 0;
- addend = xt_write_recseq_begin();
xt_entry_foreach(iter, private->entries, private->size) {
struct xt_counters_k *tmp;
tmp = xt_get_this_cpu_counter(&iter->counter_pad);
- ADD_COUNTER(*tmp, paddc[i].bcnt, paddc[i].pcnt);
+ xt_counter_add(tmp, paddc[i].bcnt, paddc[i].pcnt);
++i;
}
- xt_write_recseq_end(addend);
unlock_up_free:
rcu_read_unlock();
local_bh_enable();
diff --git a/net/ipv6/netfilter/ip6_tables.c b/net/ipv6/netfilter/ip6_tables.c
index f4877b1b2463e..5a6a592cd53dc 100644
--- a/net/ipv6/netfilter/ip6_tables.c
+++ b/net/ipv6/netfilter/ip6_tables.c
@@ -259,7 +259,6 @@ ip6t_do_table(void *priv, struct sk_buff *skb,
unsigned int stackidx, cpu;
const struct xt_table_info *private;
struct xt_action_param acpar;
- unsigned int addend;
/* Initialization */
stackidx = 0;
@@ -280,7 +279,6 @@ ip6t_do_table(void *priv, struct sk_buff *skb,
local_bh_disable();
rcu_read_lock();
private = rcu_dereference(table->priv_info);
- addend = xt_write_recseq_begin();
cpu = smp_processor_id();
table_base = private->entries;
jumpstack = (struct ip6t_entry **)private->jumpstack[cpu];
@@ -319,7 +317,7 @@ ip6t_do_table(void *priv, struct sk_buff *skb,
}
counter = xt_get_this_cpu_counter(&e->counter_pad);
- ADD_COUNTER(*counter, skb->len, 1);
+ xt_counter_add(counter, skb->len, 1);
t = ip6t_get_target_c(e);
WARN_ON(!t->u.kernel.target);
@@ -372,7 +370,6 @@ ip6t_do_table(void *priv, struct sk_buff *skb,
break;
} while (!acpar.hotdrop);
- xt_write_recseq_end(addend);
rcu_read_unlock();
local_bh_enable();
@@ -763,7 +760,7 @@ get_counters(const struct xt_table_info *t,
unsigned int i;
for_each_possible_cpu(cpu) {
- seqcount_t *s = &per_cpu(xt_recseq, cpu);
+ struct u64_stats_sync *syncp = &per_cpu(xt_syncp, cpu);
i = 0;
xt_entry_foreach(iter, t->entries, t->size) {
@@ -773,10 +770,10 @@ get_counters(const struct xt_table_info *t,
tmp = xt_get_per_cpu_counter(&iter->counter_pad, cpu);
do {
- start = read_seqcount_begin(s);
- bcnt = tmp->bcnt;
- pcnt = tmp->pcnt;
- } while (read_seqcount_retry(s, start));
+ start = u64_stats_fetch_begin(syncp);
+ bcnt = u64_stats_read(&tmp->bcnt);
+ pcnt = u64_stats_read(&tmp->pcnt);
+ } while (u64_stats_fetch_retry(syncp, start));
ADD_COUNTER(counters[i], bcnt, pcnt);
++i;
@@ -797,7 +794,8 @@ static void get_old_counters(const struct xt_table_info *t,
const struct xt_counters_k *tmp;
tmp = xt_get_per_cpu_counter(&iter->counter_pad, cpu);
- ADD_COUNTER(counters[i], tmp->bcnt, tmp->pcnt);
+ ADD_COUNTER(counters[i], u64_stats_read(&tmp->bcnt),
+ u64_stats_read(&tmp->pcnt));
++i;
}
cond_resched();
@@ -1181,7 +1179,6 @@ do_add_counters(struct net *net, sockptr_t arg, unsigned int len)
const struct xt_table_info *private;
int ret = 0;
struct ip6t_entry *iter;
- unsigned int addend;
paddc = xt_copy_counters(arg, len, &tmp);
if (IS_ERR(paddc))
@@ -1201,15 +1198,13 @@ do_add_counters(struct net *net, sockptr_t arg, unsigned int len)
}
i = 0;
- addend = xt_write_recseq_begin();
xt_entry_foreach(iter, private->entries, private->size) {
struct xt_counters_k *tmp;
tmp = xt_get_this_cpu_counter(&iter->counter_pad);
- ADD_COUNTER(*tmp, paddc[i].bcnt, paddc[i].pcnt);
+ xt_counter_add(tmp, paddc[i].bcnt, paddc[i].pcnt);
++i;
}
- xt_write_recseq_end(addend);
unlock_up_free:
rcu_read_unlock();
local_bh_enable();
diff --git a/net/netfilter/x_tables.c b/net/netfilter/x_tables.c
index 5379ed82abd59..c865a4470a901 100644
--- a/net/netfilter/x_tables.c
+++ b/net/netfilter/x_tables.c
@@ -1330,8 +1330,8 @@ void xt_compat_unlock(u_int8_t af)
EXPORT_SYMBOL_GPL(xt_compat_unlock);
#endif
-DEFINE_PER_CPU(seqcount_t, xt_recseq);
-EXPORT_PER_CPU_SYMBOL_GPL(xt_recseq);
+DEFINE_PER_CPU(struct u64_stats_sync, xt_syncp);
+EXPORT_PER_CPU_SYMBOL_GPL(xt_syncp);
struct static_key xt_tee_enabled __read_mostly;
EXPORT_SYMBOL_GPL(xt_tee_enabled);
@@ -1977,7 +1977,7 @@ static int __init xt_init(void)
int rv;
for_each_possible_cpu(i) {
- seqcount_init(&per_cpu(xt_recseq, i));
+ u64_stats_init(&per_cpu(xt_syncp, i));
}
xt = kcalloc(NFPROTO_NUMPROTO, sizeof(struct xt_af), GFP_KERNEL);
--
2.47.2
^ permalink raw reply related [flat|nested] 11+ messages in thread* Re: [PATCH net-next v2 0/3] Replace xt_recseq with u64_stats.
2025-02-21 13:31 [PATCH net-next v2 0/3] Replace xt_recseq with u64_stats Sebastian Andrzej Siewior
` (2 preceding siblings ...)
2025-02-21 13:31 ` [PATCH net-next v2 3/3] netfilter: Use u64_stats for counters in xt_counters_k Sebastian Andrzej Siewior
@ 2025-03-07 16:57 ` Sebastian Andrzej Siewior
2025-03-12 23:16 ` Pablo Neira Ayuso
4 siblings, 0 replies; 11+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-03-07 16:57 UTC (permalink / raw)
To: netfilter-devel, coreteam, linux-rt-devel
Cc: Pablo Neira Ayuso, Jozsef Kadlecsik, Thomas Gleixner
On 2025-02-21 14:31:40 [+0100], To netfilter-devel@vger.kernel.org wrote:
> The per-CPU xt_recseq is a custom netfilter seqcount. It provides
> synchronisation for the replacement of the xt_table::private pointer and
> ensures that the two counter in xt_counters are properly observed during
> an update on 32bit architectures. xt_recseq also supports recursion.
>
> This construct is less than optimal on PREMPT_RT because the lack of an
> associated lock (with the seqcount) can lead to a deadlock if a high
> priority reader interrupts a writter. Also xt_recseq relies on locking
> with BH-disable which becomes problematic if the lock, currently part of
> local_bh_disable() on PREEMPT_RT, gets removed.
>
> This can be optimized unrelated to PREEMPT_RT:
> - Use RCU for synchronisation. This means ipt_do_table() (and the two
> other) access xt_table::private within a RCU section.
> xt_replace_table() replaces the pointer with rcu_assign_pointer() and
> uses synchronize_rcu() to wait until each reader left RCU section.
>
> - Use u64_stats_t for the statistics. The advantage here is that
> u64_stats_sync which is use a seqcount is optimized away on 64bit
> architectures. The increment becomes just an add, the read just a read
> of the variable without a loop. On 32bit architectures the seqcount
> remains but the scope is smaller.
>
> The struct xt_counters is defined in a user exported header (uapi). So
> in patch #2 I tried to split the regular u64 access and the "internal
> access" which treats the struct either as two counter or a per-CPU
> pointer. In order not to expose u64_stats_t to userland I added a "pad"
> which is cast to the internal type. I hoped that this makes it obvious
> that a function like xt_get_this_cpu_counter() expects the possible
> per-CPU type but mark_source_chains() or get_counters() expect the u64
> type without pointers.
A gentle ping in case this got forgotten.
Sebastian
^ permalink raw reply [flat|nested] 11+ messages in thread* Re: [PATCH net-next v2 0/3] Replace xt_recseq with u64_stats.
2025-02-21 13:31 [PATCH net-next v2 0/3] Replace xt_recseq with u64_stats Sebastian Andrzej Siewior
` (3 preceding siblings ...)
2025-03-07 16:57 ` [PATCH net-next v2 0/3] Replace xt_recseq with u64_stats Sebastian Andrzej Siewior
@ 2025-03-12 23:16 ` Pablo Neira Ayuso
2025-03-13 8:34 ` Sebastian Andrzej Siewior
4 siblings, 1 reply; 11+ messages in thread
From: Pablo Neira Ayuso @ 2025-03-12 23:16 UTC (permalink / raw)
To: Sebastian Andrzej Siewior
Cc: netfilter-devel, coreteam, linux-rt-devel, Jozsef Kadlecsik,
Thomas Gleixner
Hi Sebastian,
On Fri, Feb 21, 2025 at 02:31:40PM +0100, Sebastian Andrzej Siewior wrote:
> The per-CPU xt_recseq is a custom netfilter seqcount. It provides
> synchronisation for the replacement of the xt_table::private pointer and
> ensures that the two counter in xt_counters are properly observed during
> an update on 32bit architectures. xt_recseq also supports recursion.
>
> This construct is less than optimal on PREMPT_RT because the lack of an
> associated lock (with the seqcount) can lead to a deadlock if a high
> priority reader interrupts a writter. Also xt_recseq relies on locking
> with BH-disable which becomes problematic if the lock, currently part of
> local_bh_disable() on PREEMPT_RT, gets removed.
>
> This can be optimized unrelated to PREEMPT_RT:
> - Use RCU for synchronisation. This means ipt_do_table() (and the two
> other) access xt_table::private within a RCU section.
> xt_replace_table() replaces the pointer with rcu_assign_pointer() and
> uses synchronize_rcu() to wait until each reader left RCU section.
>
> - Use u64_stats_t for the statistics. The advantage here is that
> u64_stats_sync which is use a seqcount is optimized away on 64bit
> architectures. The increment becomes just an add, the read just a read
> of the variable without a loop. On 32bit architectures the seqcount
> remains but the scope is smaller.
>
> The struct xt_counters is defined in a user exported header (uapi). So
> in patch #2 I tried to split the regular u64 access and the "internal
> access" which treats the struct either as two counter or a per-CPU
> pointer. In order not to expose u64_stats_t to userland I added a "pad"
> which is cast to the internal type. I hoped that this makes it obvious
> that a function like xt_get_this_cpu_counter() expects the possible
> per-CPU type but mark_source_chains() or get_counters() expect the u64
> type without pointers.
>
> v1…v2 https://lore.kernel.org/all/20250216125135.3037967-1-bigeasy@linutronix.de/
> - Updated kerneldoc in 2/3 so that the renamed parameter is part of
> it.
> - Updated description 1/3 in case there are complains regarding the
> synchronize_rcu(). The suggested course of action is to motivate
> people to move away from "legacy" towards "nft" tooling. Last resort
> is not to wait for the in-flight counter and just copy what is
> there.
Kconfig !PREEMPT_RT for this is not an option, right?
Thanks.
^ permalink raw reply [flat|nested] 11+ messages in thread* Re: [PATCH net-next v2 0/3] Replace xt_recseq with u64_stats.
2025-03-12 23:16 ` Pablo Neira Ayuso
@ 2025-03-13 8:34 ` Sebastian Andrzej Siewior
2025-03-20 12:41 ` Pablo Neira Ayuso
0 siblings, 1 reply; 11+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-03-13 8:34 UTC (permalink / raw)
To: Pablo Neira Ayuso
Cc: netfilter-devel, coreteam, linux-rt-devel, Jozsef Kadlecsik,
Thomas Gleixner
On 2025-03-13 00:16:03 [+0100], Pablo Neira Ayuso wrote:
> Hi Sebastian,
Hi Pablo,
> Kconfig !PREEMPT_RT for this is not an option, right?
That bad? I though it would make you happy ;)
Making it !PREEMPT_RT would essentially disable the whole nf-legacy
interface. Given that it is intended to get rid of it eventually it
might be an option. I mean there is nothing you can do with
iptables-legacy that you can't do with iptables-nft?
I mean if this is not going to happen because of $reasons then that
would be the next best thing.
> Thanks.
Sebastian
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH net-next v2 0/3] Replace xt_recseq with u64_stats.
2025-03-13 8:34 ` Sebastian Andrzej Siewior
@ 2025-03-20 12:41 ` Pablo Neira Ayuso
2025-03-21 10:24 ` Pablo Neira Ayuso
0 siblings, 1 reply; 11+ messages in thread
From: Pablo Neira Ayuso @ 2025-03-20 12:41 UTC (permalink / raw)
To: Sebastian Andrzej Siewior
Cc: netfilter-devel, coreteam, linux-rt-devel, Jozsef Kadlecsik,
Thomas Gleixner
Hi Sebastian,
On Thu, Mar 13, 2025 at 09:34:40AM +0100, Sebastian Andrzej Siewior wrote:
> On 2025-03-13 00:16:03 [+0100], Pablo Neira Ayuso wrote:
> > Hi Sebastian,
> Hi Pablo,
>
> > Kconfig !PREEMPT_RT for this is not an option, right?
>
> That bad? I though it would make you happy ;)
> Making it !PREEMPT_RT would essentially disable the whole nf-legacy
> interface. Given that it is intended to get rid of it eventually it
> might be an option. I mean there is nothing you can do with
> iptables-legacy that you can't do with iptables-nft?
> I mean if this is not going to happen because of $reasons then that
> would be the next best thing.
We could give a try to this series and see.
Thanks
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH net-next v2 0/3] Replace xt_recseq with u64_stats.
2025-03-20 12:41 ` Pablo Neira Ayuso
@ 2025-03-21 10:24 ` Pablo Neira Ayuso
2025-03-24 16:29 ` Sebastian Andrzej Siewior
0 siblings, 1 reply; 11+ messages in thread
From: Pablo Neira Ayuso @ 2025-03-21 10:24 UTC (permalink / raw)
To: Sebastian Andrzej Siewior
Cc: netfilter-devel, coreteam, linux-rt-devel, Jozsef Kadlecsik,
Thomas Gleixner
Hi Sebastian,
On Thu, Mar 20, 2025 at 01:41:26PM +0100, Pablo Neira Ayuso wrote:
> On Thu, Mar 13, 2025 at 09:34:40AM +0100, Sebastian Andrzej Siewior wrote:
> > On 2025-03-13 00:16:03 [+0100], Pablo Neira Ayuso wrote:
> > > Kconfig !PREEMPT_RT for this is not an option, right?
> >
> > That bad? I though it would make you happy ;)
> > Making it !PREEMPT_RT would essentially disable the whole nf-legacy
> > interface. Given that it is intended to get rid of it eventually it
> > might be an option. I mean there is nothing you can do with
> > iptables-legacy that you can't do with iptables-nft?
> > I mean if this is not going to happen because of $reasons then that
> > would be the next best thing.
>
> We could give a try to this series and see.
I have been discussing this with Florian, our proposal:
1. Make ipatbles legacy depend on !PREEMPT_RT which effectively
disabled iptables classic for RT.
This should be ok, iptables-nft should work for RT.
2. make iptables-legacy user-selectable.
these two are relatively simple.
If this does not make you happy, it should be possible to take your
patches plus hide synchronize_rcu() latency behind deferred free
(call_rcu+workqueue).
As this looks now, I am afraid chances are high that this series will
require a follow up.
Thanks.
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH net-next v2 0/3] Replace xt_recseq with u64_stats.
2025-03-21 10:24 ` Pablo Neira Ayuso
@ 2025-03-24 16:29 ` Sebastian Andrzej Siewior
2025-03-24 17:47 ` Florian Westphal
0 siblings, 1 reply; 11+ messages in thread
From: Sebastian Andrzej Siewior @ 2025-03-24 16:29 UTC (permalink / raw)
To: Pablo Neira Ayuso
Cc: netfilter-devel, coreteam, linux-rt-devel, Jozsef Kadlecsik,
Thomas Gleixner
On 2025-03-21 11:24:03 [+0100], Pablo Neira Ayuso wrote:
> Hi Sebastian,
Hi Pablo,
> I have been discussing this with Florian, our proposal:
>
> 1. Make ipatbles legacy depend on !PREEMPT_RT which effectively
> disabled iptables classic for RT.
>
> This should be ok, iptables-nft should work for RT.
>
> 2. make iptables-legacy user-selectable.
>
> these two are relatively simple.
Okay. Let me try that.
> If this does not make you happy, it should be possible to take your
> patches plus hide synchronize_rcu() latency behind deferred free
> (call_rcu+workqueue).
I will try the suggested above and then will see who complains. But I
think it should be doable to roll with nft only for future releases.
> Thanks.
Sebastian
^ permalink raw reply [flat|nested] 11+ messages in thread