* [RFC PATCH 0/2] net: introduce per netns packet type chains
@ 2025-03-14 13:04 Paolo Abeni
2025-03-14 13:05 ` [RFC PATCH 1/2] net: introduce per netns packet chains Paolo Abeni
2025-03-14 13:05 ` [RFC PATCH 2/2] net: hotdata optimization for netns ptypes Paolo Abeni
0 siblings, 2 replies; 9+ messages in thread
From: Paolo Abeni @ 2025-03-14 13:04 UTC (permalink / raw)
To: netdev
Cc: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
Simon Horman, Jonathan Corbet
The stack uses shared lists between all the network namespace to store
all the packet taps not bound to any device.
As a consequence, creating such taps in any namespace affects the
performances in all the network namespaces.
Patch 1 addresses the issue introducing new per network namespace packet
type chains, while patch 2 try to minimize the impact of such addition.
The hotdata implications are IMHO not trivial ence the RFC tag; I
suspect patch 2 being the most controversial. As such a possible
alternative is also presented.
Any feedback welcome!
Paolo Abeni (2):
net: introduce per netns packet chains
net: hotdata optimization for netns ptypes
.../networking/net_cachelines/net_device.rst | 2 +
include/linux/netdevice.h | 9 +-
include/net/hotdata.h | 1 -
include/net/net_namespace.h | 3 +
net/core/dev.c | 82 +++++++++++++++----
net/core/hotdata.c | 1 -
net/core/net-procfs.c | 16 ++--
net/core/net_namespace.c | 2 +
8 files changed, 86 insertions(+), 30 deletions(-)
--
2.48.1
^ permalink raw reply [flat|nested] 9+ messages in thread
* [RFC PATCH 1/2] net: introduce per netns packet chains
2025-03-14 13:04 [RFC PATCH 0/2] net: introduce per netns packet type chains Paolo Abeni
@ 2025-03-14 13:05 ` Paolo Abeni
2025-03-14 22:33 ` Sabrina Dubroca
2025-03-14 13:05 ` [RFC PATCH 2/2] net: hotdata optimization for netns ptypes Paolo Abeni
1 sibling, 1 reply; 9+ messages in thread
From: Paolo Abeni @ 2025-03-14 13:05 UTC (permalink / raw)
To: netdev
Cc: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
Simon Horman, Jonathan Corbet
Currently network taps unbound to any interface are linked in the
global ptype_all list, affecting the performance in all the network
namespaces.
Add per netns ptypes chains, so that in the mentioned case only
the netns owning the packet socket(s) is affected.
While at that drop the global ptype_all list: no in kernel user
registers a tap on "any" type without specifying either the target
device or the target namespace (and IMHO doing that would not make
any sense).
Note that this adds a conditional in the fast path (to check for
per netns ptype_specific list) and increases the dataset size by
a cacheline (owing the per netns lists).
The next patch will try to address the above.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
include/net/hotdata.h | 1 -
include/net/net_namespace.h | 3 +++
net/core/dev.c | 32 ++++++++++++++++++++++++--------
net/core/hotdata.c | 1 -
net/core/net-procfs.c | 16 ++++++++--------
net/core/net_namespace.c | 2 ++
6 files changed, 37 insertions(+), 18 deletions(-)
diff --git a/include/net/hotdata.h b/include/net/hotdata.h
index 30e9570beb2af..fda94b2647ffa 100644
--- a/include/net/hotdata.h
+++ b/include/net/hotdata.h
@@ -23,7 +23,6 @@ struct net_hotdata {
struct net_offload udpv6_offload;
#endif
struct list_head offload_base;
- struct list_head ptype_all;
struct kmem_cache *skbuff_cache;
struct kmem_cache *skbuff_fclone_cache;
struct kmem_cache *skb_small_head_cache;
diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
index f467a66abc6b1..bd57d8fb54f14 100644
--- a/include/net/net_namespace.h
+++ b/include/net/net_namespace.h
@@ -83,6 +83,9 @@ struct net {
struct llist_node defer_free_list;
struct llist_node cleanup_list; /* namespaces on death row */
+ struct list_head ptype_all;
+ struct list_head ptype_specific;
+
#ifdef CONFIG_KEYS
struct key_tag *key_domain; /* Key domain of operation tag */
#endif
diff --git a/net/core/dev.c b/net/core/dev.c
index 6fa6ed5b57987..00bdd8316cb5e 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -572,11 +572,19 @@ static inline void netdev_set_addr_lockdep_class(struct net_device *dev)
static inline struct list_head *ptype_head(const struct packet_type *pt)
{
- if (pt->type == htons(ETH_P_ALL))
- return pt->dev ? &pt->dev->ptype_all : &net_hotdata.ptype_all;
- else
+ if (pt->type == htons(ETH_P_ALL)) {
+ if (!pt->af_packet_net && !pt->dev)
+ return NULL;
+
return pt->dev ? &pt->dev->ptype_specific :
- &ptype_base[ntohs(pt->type) & PTYPE_HASH_MASK];
+ &pt->af_packet_net->ptype_all;
+ }
+
+ if (pt->dev)
+ return &pt->dev->ptype_specific;
+
+ return pt->af_packet_net ? &pt->af_packet_net->ptype_specific :
+ &ptype_base[ntohs(pt->type) & PTYPE_HASH_MASK];
}
/**
@@ -596,6 +604,9 @@ void dev_add_pack(struct packet_type *pt)
{
struct list_head *head = ptype_head(pt);
+ if (WARN_ON_ONCE(!head))
+ return;
+
spin_lock(&ptype_lock);
list_add_rcu(&pt->list, head);
spin_unlock(&ptype_lock);
@@ -620,6 +631,9 @@ void __dev_remove_pack(struct packet_type *pt)
struct list_head *head = ptype_head(pt);
struct packet_type *pt1;
+ if (!head)
+ return;
+
spin_lock(&ptype_lock);
list_for_each_entry(pt1, head, list) {
@@ -2469,7 +2483,7 @@ static inline bool skb_loop_sk(struct packet_type *ptype, struct sk_buff *skb)
*/
bool dev_nit_active(struct net_device *dev)
{
- return !list_empty(&net_hotdata.ptype_all) ||
+ return !list_empty(&dev_net(dev)->ptype_all) ||
!list_empty(&dev->ptype_all);
}
EXPORT_SYMBOL_GPL(dev_nit_active);
@@ -2481,7 +2495,7 @@ EXPORT_SYMBOL_GPL(dev_nit_active);
void dev_queue_xmit_nit(struct sk_buff *skb, struct net_device *dev)
{
- struct list_head *ptype_list = &net_hotdata.ptype_all;
+ struct list_head *ptype_list = &dev_net(dev)->ptype_all;
struct packet_type *ptype, *pt_prev = NULL;
struct sk_buff *skb2 = NULL;
@@ -2529,7 +2543,7 @@ void dev_queue_xmit_nit(struct sk_buff *skb, struct net_device *dev)
pt_prev = ptype;
}
- if (ptype_list == &net_hotdata.ptype_all) {
+ if (ptype_list == &dev_net(dev)->ptype_all) {
ptype_list = &dev->ptype_all;
goto again;
}
@@ -5718,7 +5732,7 @@ static int __netif_receive_skb_core(struct sk_buff **pskb, bool pfmemalloc,
if (pfmemalloc)
goto skip_taps;
- list_for_each_entry_rcu(ptype, &net_hotdata.ptype_all, list) {
+ list_for_each_entry_rcu(ptype, &dev_net(skb->dev)->ptype_all, list) {
if (pt_prev)
ret = deliver_skb(skb, pt_prev, orig_dev);
pt_prev = ptype;
@@ -5830,6 +5844,8 @@ static int __netif_receive_skb_core(struct sk_buff **pskb, bool pfmemalloc,
deliver_ptype_list_skb(skb, &pt_prev, orig_dev, type,
&ptype_base[ntohs(type) &
PTYPE_HASH_MASK]);
+ deliver_ptype_list_skb(skb, &pt_prev, orig_dev, type,
+ &dev_net(orig_dev)->ptype_specific);
}
deliver_ptype_list_skb(skb, &pt_prev, orig_dev, type,
diff --git a/net/core/hotdata.c b/net/core/hotdata.c
index d0aaaaa556f22..0bc893d5f07b0 100644
--- a/net/core/hotdata.c
+++ b/net/core/hotdata.c
@@ -7,7 +7,6 @@
struct net_hotdata net_hotdata __cacheline_aligned = {
.offload_base = LIST_HEAD_INIT(net_hotdata.offload_base),
- .ptype_all = LIST_HEAD_INIT(net_hotdata.ptype_all),
.gro_normal_batch = 8,
.netdev_budget = 300,
diff --git a/net/core/net-procfs.c b/net/core/net-procfs.c
index fa6d3969734a6..8a5d93eb9d77a 100644
--- a/net/core/net-procfs.c
+++ b/net/core/net-procfs.c
@@ -185,7 +185,7 @@ static void *ptype_get_idx(struct seq_file *seq, loff_t pos)
}
}
- list_for_each_entry_rcu(pt, &net_hotdata.ptype_all, list) {
+ list_for_each_entry_rcu(pt, &seq_file_net(seq)->ptype_all, list) {
if (i == pos)
return pt;
++i;
@@ -210,6 +210,7 @@ static void *ptype_seq_start(struct seq_file *seq, loff_t *pos)
static void *ptype_seq_next(struct seq_file *seq, void *v, loff_t *pos)
{
+ struct net *net = seq_file_net(seq);
struct net_device *dev;
struct packet_type *pt;
struct list_head *nxt;
@@ -232,16 +233,15 @@ static void *ptype_seq_next(struct seq_file *seq, void *v, loff_t *pos)
goto found;
}
}
-
- nxt = net_hotdata.ptype_all.next;
- goto ptype_all;
+ nxt = net->ptype_all.next;
+ goto net_ptype_all;
}
- if (pt->type == htons(ETH_P_ALL)) {
-ptype_all:
- if (nxt != &net_hotdata.ptype_all)
+ if (pt->af_packet_net) {
+net_ptype_all:
+ if (nxt != &net->ptype_all)
goto found;
- hash = 0;
+
nxt = ptype_base[0].next;
} else
hash = ntohs(pt->type) & PTYPE_HASH_MASK;
diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index 4303f2a492624..b0dfdf791ece5 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -340,6 +340,8 @@ static __net_init void preinit_net(struct net *net, struct user_namespace *user_
lock_set_cmp_fn(&net->rtnl_mutex, rtnl_net_lock_cmp_fn, NULL);
#endif
+ INIT_LIST_HEAD(&net->ptype_all);
+ INIT_LIST_HEAD(&net->ptype_specific);
preinit_net_sysctl(net);
}
--
2.48.1
^ permalink raw reply related [flat|nested] 9+ messages in thread
* [RFC PATCH 2/2] net: hotdata optimization for netns ptypes
2025-03-14 13:04 [RFC PATCH 0/2] net: introduce per netns packet type chains Paolo Abeni
2025-03-14 13:05 ` [RFC PATCH 1/2] net: introduce per netns packet chains Paolo Abeni
@ 2025-03-14 13:05 ` Paolo Abeni
2025-03-17 9:18 ` Paolo Abeni
1 sibling, 1 reply; 9+ messages in thread
From: Paolo Abeni @ 2025-03-14 13:05 UTC (permalink / raw)
To: netdev
Cc: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
Simon Horman, Jonathan Corbet
Per netns ptype usage is/should be an exception, but the current code
unconditionally touches the related lists for each RX and TX packet.
Add a per device flag in the hot data net_device section to cache the
'netns ptype required' information, and update it accordingly to the
relevant netns status. The new fields are placed in existing holes,
moved slightly to fit the relevant cacheline groups.
Be sure to keep such flag up2date when new devices are created and/or
devices are moved across namespaces initializing it in list_netdevice().
In the fast path we can skip per-netns list processing when such patch
is clear.
This avoid touching in the fastpath the additional cacheline needed by
the previous patch.
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
---
This is a little cumbersome for possibly little gain. An alternative
could be caching even the per-device list status in similar
flags. Both RX and TX could use a single conditional to completely
skip all the per dev/netns list. Potentially even moving the per
device lists out of hotdata.
Side note: despite being unconditionally touched in fastpath on both
RX and TX, currently dev->ptype_all is not placed in any cacheline
group hotdata.
---
.../networking/net_cachelines/net_device.rst | 2 +
include/linux/netdevice.h | 9 ++-
net/core/dev.c | 58 ++++++++++++++-----
3 files changed, 53 insertions(+), 16 deletions(-)
diff --git a/Documentation/networking/net_cachelines/net_device.rst b/Documentation/networking/net_cachelines/net_device.rst
index 6327e689e8a84..206f4afded60c 100644
--- a/Documentation/networking/net_cachelines/net_device.rst
+++ b/Documentation/networking/net_cachelines/net_device.rst
@@ -22,7 +22,9 @@ struct list_head napi_list
struct list_head unreg_list
struct list_head close_list
struct list_head ptype_all read_mostly dev_nit_active(tx)
+bool ptype_all_ns read_mostly dev_nit_active(tx)
struct list_head ptype_specific read_mostly deliver_ptype_list_skb/__netif_receive_skb_core(rx)
+bool ptype_specific_ns read_mostly deliver_ptype_list_skb/__netif_receive_skb_core(rx)
struct adj_list
unsigned_int flags read_mostly read_mostly __dev_queue_xmit,__dev_xmit_skb,ip6_output,__ip6_finish_output(tx);ip6_rcv_core(rx)
xdp_features_t xdp_features
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 0dbfe069a6e38..031e3e42db4a6 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -1791,6 +1791,9 @@ enum netdev_reg_state {
* @close_list: List entry used when we are closing the device
* @ptype_all: Device-specific packet handlers for all protocols
* @ptype_specific: Device-specific, protocol-specific packet handlers
+ * @ptype_all_ns: The owning netns has packet handlers for all protocols
+ * @ptype_specific_ns: The owning netns has protocol-specific packet
+ * handlers
*
* @adj_list: Directly linked devices, like slaves for bonding
* @features: Currently active device features
@@ -2125,14 +2128,16 @@ struct net_device {
struct pcpu_dstats __percpu *dstats;
};
unsigned long state;
- unsigned int flags;
- unsigned short hard_header_len;
netdev_features_t features;
struct inet6_dev __rcu *ip6_ptr;
+ unsigned int flags;
+ unsigned short hard_header_len;
+ bool ptype_all_ns;
__cacheline_group_end(net_device_read_txrx);
/* RX read-mostly hotpath */
__cacheline_group_begin(net_device_read_rx);
+ bool ptype_specific_ns;
struct bpf_prog __rcu *xdp_prog;
struct list_head ptype_specific;
int ifindex;
diff --git a/net/core/dev.c b/net/core/dev.c
index 00bdd8316cb5e..878122757c78b 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -405,6 +405,12 @@ static void list_netdevice(struct net_device *dev)
ASSERT_RTNL();
+ /* update ptype flags according to the current netns setting */
+ spin_lock(&ptype_lock);
+ dev->ptype_all_ns = !list_empty(&net->ptype_all);
+ dev->ptype_specific_ns = !list_empty(&net->ptype_specific);
+ spin_unlock(&ptype_lock);
+
list_add_tail_rcu(&dev->dev_list, &net->dev_base_head);
netdev_name_node_add(net, dev->name_node);
hlist_add_head_rcu(&dev->index_hlist,
@@ -587,6 +593,20 @@ static inline struct list_head *ptype_head(const struct packet_type *pt)
&ptype_base[ntohs(pt->type) & PTYPE_HASH_MASK];
}
+static void net_set_ptype(struct net *net, bool ptype_all, bool val)
+{
+ struct net_device *dev;
+
+ rcu_read_lock();
+ for_each_netdev_rcu(net, dev) {
+ if (ptype_all)
+ WRITE_ONCE(dev->ptype_all_ns, val);
+ else
+ WRITE_ONCE(dev->ptype_specific_ns, val);
+ }
+ rcu_read_unlock();
+}
+
/**
* dev_add_pack - add packet handler
* @pt: packet type declaration
@@ -609,6 +629,9 @@ void dev_add_pack(struct packet_type *pt)
spin_lock(&ptype_lock);
list_add_rcu(&pt->list, head);
+ if (pt->af_packet_net && !pt->dev && list_is_singular(head))
+ net_set_ptype(pt->af_packet_net, pt->type == htons(ETH_P_ALL),
+ true);
spin_unlock(&ptype_lock);
}
EXPORT_SYMBOL(dev_add_pack);
@@ -637,10 +660,13 @@ void __dev_remove_pack(struct packet_type *pt)
spin_lock(&ptype_lock);
list_for_each_entry(pt1, head, list) {
- if (pt == pt1) {
- list_del_rcu(&pt->list);
- goto out;
- }
+ if (pt != pt1)
+ continue;
+ list_del_rcu(&pt->list);
+ if (pt->af_packet_net && !pt->dev && list_empty(head))
+ net_set_ptype(pt->af_packet_net,
+ pt->type == htons(ETH_P_ALL), false);
+ goto out;
}
pr_warn("dev_remove_pack: %p not found\n", pt);
@@ -2483,8 +2509,7 @@ static inline bool skb_loop_sk(struct packet_type *ptype, struct sk_buff *skb)
*/
bool dev_nit_active(struct net_device *dev)
{
- return !list_empty(&dev_net(dev)->ptype_all) ||
- !list_empty(&dev->ptype_all);
+ return READ_ONCE(dev->ptype_all_ns) || !list_empty(&dev->ptype_all);
}
EXPORT_SYMBOL_GPL(dev_nit_active);
@@ -5732,10 +5757,12 @@ static int __netif_receive_skb_core(struct sk_buff **pskb, bool pfmemalloc,
if (pfmemalloc)
goto skip_taps;
- list_for_each_entry_rcu(ptype, &dev_net(skb->dev)->ptype_all, list) {
- if (pt_prev)
- ret = deliver_skb(skb, pt_prev, orig_dev);
- pt_prev = ptype;
+ if (READ_ONCE(skb->dev->ptype_all_ns)) {
+ list_for_each_entry_rcu(ptype, &dev_net(skb->dev)->ptype_all, list) {
+ if (pt_prev)
+ ret = deliver_skb(skb, pt_prev, orig_dev);
+ pt_prev = ptype;
+ }
}
list_for_each_entry_rcu(ptype, &skb->dev->ptype_all, list) {
@@ -5844,8 +5871,9 @@ static int __netif_receive_skb_core(struct sk_buff **pskb, bool pfmemalloc,
deliver_ptype_list_skb(skb, &pt_prev, orig_dev, type,
&ptype_base[ntohs(type) &
PTYPE_HASH_MASK]);
- deliver_ptype_list_skb(skb, &pt_prev, orig_dev, type,
- &dev_net(orig_dev)->ptype_specific);
+ if (READ_ONCE(skb->dev->ptype_specific_ns))
+ deliver_ptype_list_skb(skb, &pt_prev, orig_dev, type,
+ &dev_net(skb->dev)->ptype_specific);
}
deliver_ptype_list_skb(skb, &pt_prev, orig_dev, type,
@@ -12563,10 +12591,12 @@ static void __init net_dev_struct_check(void)
CACHELINE_ASSERT_GROUP_MEMBER(struct net_device, net_device_read_txrx, hard_header_len);
CACHELINE_ASSERT_GROUP_MEMBER(struct net_device, net_device_read_txrx, features);
CACHELINE_ASSERT_GROUP_MEMBER(struct net_device, net_device_read_txrx, ip6_ptr);
- CACHELINE_ASSERT_GROUP_SIZE(struct net_device, net_device_read_txrx, 46);
+ CACHELINE_ASSERT_GROUP_MEMBER(struct net_device, net_device_read_txrx, ptype_all_ns);
+ CACHELINE_ASSERT_GROUP_SIZE(struct net_device, net_device_read_txrx, 47);
/* RX read-mostly hotpath */
CACHELINE_ASSERT_GROUP_MEMBER(struct net_device, net_device_read_rx, ptype_specific);
+ CACHELINE_ASSERT_GROUP_MEMBER(struct net_device, net_device_read_rx, ptype_specific_ns);
CACHELINE_ASSERT_GROUP_MEMBER(struct net_device, net_device_read_rx, ifindex);
CACHELINE_ASSERT_GROUP_MEMBER(struct net_device, net_device_read_rx, real_num_rx_queues);
CACHELINE_ASSERT_GROUP_MEMBER(struct net_device, net_device_read_rx, _rx);
@@ -12581,7 +12611,7 @@ static void __init net_dev_struct_check(void)
#ifdef CONFIG_NET_XGRESS
CACHELINE_ASSERT_GROUP_MEMBER(struct net_device, net_device_read_rx, tcx_ingress);
#endif
- CACHELINE_ASSERT_GROUP_SIZE(struct net_device, net_device_read_rx, 92);
+ CACHELINE_ASSERT_GROUP_SIZE(struct net_device, net_device_read_rx, 93);
}
/*
--
2.48.1
^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [RFC PATCH 1/2] net: introduce per netns packet chains
2025-03-14 13:05 ` [RFC PATCH 1/2] net: introduce per netns packet chains Paolo Abeni
@ 2025-03-14 22:33 ` Sabrina Dubroca
2025-03-17 8:40 ` Paolo Abeni
0 siblings, 1 reply; 9+ messages in thread
From: Sabrina Dubroca @ 2025-03-14 22:33 UTC (permalink / raw)
To: Paolo Abeni
Cc: netdev, Andrew Lunn, David S. Miller, Eric Dumazet,
Jakub Kicinski, Simon Horman, Jonathan Corbet
2025-03-14, 14:05:00 +0100, Paolo Abeni wrote:
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 6fa6ed5b57987..00bdd8316cb5e 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -572,11 +572,19 @@ static inline void netdev_set_addr_lockdep_class(struct net_device *dev)
>
> static inline struct list_head *ptype_head(const struct packet_type *pt)
> {
> - if (pt->type == htons(ETH_P_ALL))
> - return pt->dev ? &pt->dev->ptype_all : &net_hotdata.ptype_all;
> - else
> + if (pt->type == htons(ETH_P_ALL)) {
> + if (!pt->af_packet_net && !pt->dev)
> + return NULL;
> +
> return pt->dev ? &pt->dev->ptype_specific :
s/specific/all/ ?
(ie ETH_P_ALL with pt->dev should go on &pt->dev->ptype_all like before)
> - &ptype_base[ntohs(pt->type) & PTYPE_HASH_MASK];
> + &pt->af_packet_net->ptype_all;
> + }
> +
> + if (pt->dev)
> + return &pt->dev->ptype_specific;
> +
> + return pt->af_packet_net ? &pt->af_packet_net->ptype_specific :
> + &ptype_base[ntohs(pt->type) & PTYPE_HASH_MASK];
> }
[...]
> diff --git a/net/core/net-procfs.c b/net/core/net-procfs.c
> index fa6d3969734a6..8a5d93eb9d77a 100644
> --- a/net/core/net-procfs.c
> +++ b/net/core/net-procfs.c
[...]
> @@ -232,16 +233,15 @@ static void *ptype_seq_next(struct seq_file *seq, void *v, loff_t *pos)
> goto found;
> }
> }
> -
> - nxt = net_hotdata.ptype_all.next;
> - goto ptype_all;
> + nxt = net->ptype_all.next;
> + goto net_ptype_all;
> }
>
> - if (pt->type == htons(ETH_P_ALL)) {
> -ptype_all:
> - if (nxt != &net_hotdata.ptype_all)
> + if (pt->af_packet_net) {
> +net_ptype_all:
> + if (nxt != &net->ptype_all)
> goto found;
This is missing similar code to find items on the new
net->ptype_specific list.
I think something like:
if (pt->af_packet_net) {
net_ptype_all:
- if (nxt != &net->ptype_all)
+ if (nxt != &net->ptype_all && nxt != &net->ptype_specific)
goto found;
+ if (nxt == &net->ptype_all) {
+ /* continue with ->ptype_specific if it's not empty */
+ nxt = net->ptype_specific.next;
+ if (nxt != &net->ptype_specific)
+ goto found;
+ }
nxt = ptype_base[0].next;
} else
(and probably something in ptype_get_idx as well)
> - hash = 0;
hash will now be used uninitialized in the loop a bit later in the
function?
while (nxt == &ptype_base[hash]) {
> +
> nxt = ptype_base[0].next;
--
Sabrina
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [RFC PATCH 1/2] net: introduce per netns packet chains
2025-03-14 22:33 ` Sabrina Dubroca
@ 2025-03-17 8:40 ` Paolo Abeni
2025-03-17 8:43 ` Eric Dumazet
0 siblings, 1 reply; 9+ messages in thread
From: Paolo Abeni @ 2025-03-17 8:40 UTC (permalink / raw)
To: Sabrina Dubroca
Cc: netdev, Andrew Lunn, David S. Miller, Eric Dumazet,
Jakub Kicinski, Simon Horman, Jonathan Corbet
On 3/14/25 11:33 PM, Sabrina Dubroca wrote:
> 2025-03-14, 14:05:00 +0100, Paolo Abeni wrote:
>> diff --git a/net/core/dev.c b/net/core/dev.c
>> index 6fa6ed5b57987..00bdd8316cb5e 100644
>> --- a/net/core/dev.c
>> +++ b/net/core/dev.c
>> @@ -572,11 +572,19 @@ static inline void netdev_set_addr_lockdep_class(struct net_device *dev)
>>
>> static inline struct list_head *ptype_head(const struct packet_type *pt)
>> {
>> - if (pt->type == htons(ETH_P_ALL))
>> - return pt->dev ? &pt->dev->ptype_all : &net_hotdata.ptype_all;
>> - else
>> + if (pt->type == htons(ETH_P_ALL)) {
>> + if (!pt->af_packet_net && !pt->dev)
>> + return NULL;
>> +
>> return pt->dev ? &pt->dev->ptype_specific :
>
> s/specific/all/ ?
> (ie ETH_P_ALL with pt->dev should go on &pt->dev->ptype_all like before)
>
>> - &ptype_base[ntohs(pt->type) & PTYPE_HASH_MASK];
>> + &pt->af_packet_net->ptype_all;
>> + }
>> +
>> + if (pt->dev)
>> + return &pt->dev->ptype_specific;
>> +
>> + return pt->af_packet_net ? &pt->af_packet_net->ptype_specific :
>> + &ptype_base[ntohs(pt->type) & PTYPE_HASH_MASK];
>> }
>
> [...]
>> diff --git a/net/core/net-procfs.c b/net/core/net-procfs.c
>> index fa6d3969734a6..8a5d93eb9d77a 100644
>> --- a/net/core/net-procfs.c
>> +++ b/net/core/net-procfs.c
> [...]
>> @@ -232,16 +233,15 @@ static void *ptype_seq_next(struct seq_file *seq, void *v, loff_t *pos)
>> goto found;
>> }
>> }
>> -
>> - nxt = net_hotdata.ptype_all.next;
>> - goto ptype_all;
>> + nxt = net->ptype_all.next;
>> + goto net_ptype_all;
>> }
>>
>> - if (pt->type == htons(ETH_P_ALL)) {
>> -ptype_all:
>> - if (nxt != &net_hotdata.ptype_all)
>> + if (pt->af_packet_net) {
>> +net_ptype_all:
>> + if (nxt != &net->ptype_all)
>> goto found;
>
> This is missing similar code to find items on the new
> net->ptype_specific list.
>
> I think something like:
>
> if (pt->af_packet_net) {
> net_ptype_all:
> - if (nxt != &net->ptype_all)
> + if (nxt != &net->ptype_all && nxt != &net->ptype_specific)
> goto found;
> + if (nxt == &net->ptype_all) {
> + /* continue with ->ptype_specific if it's not empty */
> + nxt = net->ptype_specific.next;
> + if (nxt != &net->ptype_specific)
> + goto found;
> + }
>
> nxt = ptype_base[0].next;
> } else
>
>
> (and probably something in ptype_get_idx as well)
>
>
>> - hash = 0;
>
> hash will now be used uninitialized in the loop a bit later in the
> function?
Thanks Sabrina! I'll try to address the above in the next revision.
/P
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [RFC PATCH 1/2] net: introduce per netns packet chains
2025-03-17 8:40 ` Paolo Abeni
@ 2025-03-17 8:43 ` Eric Dumazet
2025-03-17 9:23 ` Paolo Abeni
0 siblings, 1 reply; 9+ messages in thread
From: Eric Dumazet @ 2025-03-17 8:43 UTC (permalink / raw)
To: Paolo Abeni
Cc: Sabrina Dubroca, netdev, Andrew Lunn, David S. Miller,
Jakub Kicinski, Simon Horman, Jonathan Corbet
On Mon, Mar 17, 2025 at 9:40 AM Paolo Abeni <pabeni@redhat.com> wrote:
>
>
> Thanks Sabrina! I'll try to address the above in the next revision.
Could you use dev_net_rcu() instead of dev_net() ?
Thanks !
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [RFC PATCH 2/2] net: hotdata optimization for netns ptypes
2025-03-14 13:05 ` [RFC PATCH 2/2] net: hotdata optimization for netns ptypes Paolo Abeni
@ 2025-03-17 9:18 ` Paolo Abeni
0 siblings, 0 replies; 9+ messages in thread
From: Paolo Abeni @ 2025-03-17 9:18 UTC (permalink / raw)
To: netdev
Cc: Andrew Lunn, David S. Miller, Eric Dumazet, Jakub Kicinski,
Simon Horman, Jonathan Corbet, Sabrina Dubroca
On 3/14/25 2:05 PM, Paolo Abeni wrote:
> Per netns ptype usage is/should be an exception, but the current code
> unconditionally touches the related lists for each RX and TX packet.
>
> Add a per device flag in the hot data net_device section to cache the
> 'netns ptype required' information, and update it accordingly to the
> relevant netns status. The new fields are placed in existing holes,
> moved slightly to fit the relevant cacheline groups.
>
> Be sure to keep such flag up2date when new devices are created and/or
> devices are moved across namespaces initializing it in list_netdevice().
>
> In the fast path we can skip per-netns list processing when such patch
> is clear.
>
> This avoid touching in the fastpath the additional cacheline needed by
> the previous patch.
>
> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
> ---
> This is a little cumbersome for possibly little gain. An alternative
> could be caching even the per-device list status in similar
> flags. Both RX and TX could use a single conditional to completely
> skip all the per dev/netns list. Potentially even moving the per
> device lists out of hotdata.
>
> Side note: despite being unconditionally touched in fastpath on both
> RX and TX, currently dev->ptype_all is not placed in any cacheline
> group hotdata.
I think we could consider not empty ptype_all/specific lists as slightly
less fast path - packet socket processing will add considerable more
overhead.
If, so we could consolidate the packet type checks in a couple of read
mostly counters in hotdata, decreasing both the number of conditionals
and the data set size in fastpath. I'll test something alike the following:
---
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 0dbfe069a6e38..755dc9d9f9f4a 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -2134,7 +2134,6 @@ struct net_device {
/* RX read-mostly hotpath */
__cacheline_group_begin(net_device_read_rx);
struct bpf_prog __rcu *xdp_prog;
- struct list_head ptype_specific;
int ifindex;
unsigned int real_num_rx_queues;
struct netdev_rx_queue *_rx;
@@ -2174,6 +2173,7 @@ struct net_device {
struct list_head unreg_list;
struct list_head close_list;
struct list_head ptype_all;
+ struct list_head ptype_specific;
struct {
struct list_head upper;
diff --git a/include/net/net_namespace.h b/include/net/net_namespace.h
index bd57d8fb54f14..41ea0febf2260 100644
--- a/include/net/net_namespace.h
+++ b/include/net/net_namespace.h
@@ -575,6 +575,26 @@ static inline void fnhe_genid_bump(struct net *net)
atomic_inc(&net->fnhe_genid);
}
+static inline int net_ptype_all_cnt(const struct net *net)
+{
+ return READ_ONCE(net->ipv4.ptype_all_count);
+}
+
+static inline void net_ptype_all_set_cnt(struct net *net, int cnt)
+{
+ WRITE_ONCE(net->ipv4.ptype_all_count, cnt);
+}
+
+static inline int net_ptype_specific_cnt(const struct net *net)
+{
+ return READ_ONCE(net->ipv4.ptype_specific_count);
+}
+
+static inline void net_ptype_specific_set_cnt(struct net *net, int cnt)
+{
+ WRITE_ONCE(net->ipv4.ptype_specific_count, cnt);
+}
+
#ifdef CONFIG_NET
void net_ns_init(void);
#else
diff --git a/include/net/netns/ipv4.h b/include/net/netns/ipv4.h
index 6373e3f17da84..8df28c3c88929 100644
--- a/include/net/netns/ipv4.h
+++ b/include/net/netns/ipv4.h
@@ -75,6 +75,8 @@ struct netns_ipv4 {
/* TXRX readonly hotpath cache lines */
__cacheline_group_begin(netns_ipv4_read_txrx);
u8 sysctl_tcp_moderate_rcvbuf;
+ /* 2 bytes hole */
+ int ptype_all_count;
__cacheline_group_end(netns_ipv4_read_txrx);
/* RX readonly hotpath cache line */
@@ -82,9 +84,10 @@ struct netns_ipv4 {
u8 sysctl_ip_early_demux;
u8 sysctl_tcp_early_demux;
u8 sysctl_tcp_l3mdev_accept;
- /* 3 bytes hole, try to pack */
+ /* 1 bytes hole, try to pack */
int sysctl_tcp_reordering;
int sysctl_tcp_rmem[3];
+ int ptype_specific_count;
__cacheline_group_end(netns_ipv4_read_rx);
struct inet_timewait_death_row tcp_death_row;
diff --git a/net/core/dev.c b/net/core/dev.c
index ad1853da0a4b5..51cce0a164937 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -603,11 +603,18 @@ static inline struct list_head *ptype_head(const
struct packet_type *pt)
void dev_add_pack(struct packet_type *pt)
{
struct list_head *head = ptype_head(pt);
+ struct net *net;
if (WARN_ON_ONCE(!head))
return;
spin_lock(&ptype_lock);
+ net = pt->dev ? dev_net(pt->dev) : pt->af_packet_net;
+ if (pt->type == htons(ETH_P_ALL))
+ net_ptype_all_set_cnt(net, net_ptype_all_cnt(net) + 1);
+ else
+ net_ptype_specific_set_cnt(net,
+ net_ptype_specific_cnt(net) + 1);
list_add_rcu(&pt->list, head);
spin_unlock(&ptype_lock);
}
@@ -630,6 +637,7 @@ void __dev_remove_pack(struct packet_type *pt)
{
struct list_head *head = ptype_head(pt);
struct packet_type *pt1;
+ struct net *net;
if (!head)
return;
@@ -637,10 +645,16 @@ void __dev_remove_pack(struct packet_type *pt)
spin_lock(&ptype_lock);
list_for_each_entry(pt1, head, list) {
- if (pt == pt1) {
- list_del_rcu(&pt->list);
- goto out;
- }
+ if (pt != pt1)
+ continue;
+ list_del_rcu(&pt->list);
+ net = pt->dev ? dev_net(pt->dev) : pt->af_packet_net;
+ if (pt->type == htons(ETH_P_ALL))
+ net_ptype_all_set_cnt(net, net_ptype_all_cnt(net) - 1);
+ else
+ net_ptype_specific_set_cnt(net,
+ net_ptype_specific_cnt(net) - 1);
+ goto out;
}
pr_warn("dev_remove_pack: %p not found\n", pt);
@@ -2483,8 +2497,7 @@ static inline bool skb_loop_sk(struct packet_type
*ptype, struct sk_buff *skb)
*/
bool dev_nit_active(struct net_device *dev)
{
- return !list_empty(&dev_net(dev)->ptype_all) ||
- !list_empty(&dev->ptype_all);
+ return net_ptype_all_cnt(dev_net(dev)) > 0;
}
EXPORT_SYMBOL_GPL(dev_nit_active);
@@ -5671,11 +5684,49 @@ static inline int nf_ingress(struct sk_buff
*skb, struct packet_type **pt_prev,
return 0;
}
+static int deliver_ptype_all_skb(struct sk_buff *skb, int ret,
+ struct packet_type **ppt_prev,
+ struct net_device *orig_dev)
+{
+ struct packet_type *ptype, *pt_prev = *ppt_prev;
+
+ list_for_each_entry_rcu(ptype, &dev_net(skb->dev)->ptype_all, list) {
+ if (pt_prev)
+ ret = deliver_skb(skb, pt_prev, orig_dev);
+ pt_prev = ptype;
+ }
+
+ list_for_each_entry_rcu(ptype, &skb->dev->ptype_all, list) {
+ if (pt_prev)
+ ret = deliver_skb(skb, pt_prev, orig_dev);
+ pt_prev = ptype;
+ }
+ *ppt_prev = pt_prev;
+ return ret;
+}
+
+static void deliver_ptype_specific_skb(struct sk_buff *skb, bool
deliver_exact,
+ struct packet_type **ppt_prev,
+ struct net_device *orig_dev,
+ __be16 type)
+{
+ if (deliver_exact)
+ deliver_ptype_list_skb(skb, ppt_prev, orig_dev, type,
+ &dev_net(skb->dev)->ptype_specific);
+
+ deliver_ptype_list_skb(skb, ppt_prev, orig_dev, type,
+ &orig_dev->ptype_specific);
+
+ if (unlikely(skb->dev != orig_dev))
+ deliver_ptype_list_skb(skb, ppt_prev, orig_dev, type,
+ &skb->dev->ptype_specific);
+}
+
static int __netif_receive_skb_core(struct sk_buff **pskb, bool pfmemalloc,
struct packet_type **ppt_prev)
{
- struct packet_type *ptype, *pt_prev;
rx_handler_func_t *rx_handler;
+ struct packet_type *pt_prev;
struct sk_buff *skb = *pskb;
struct net_device *orig_dev;
bool deliver_exact = false;
@@ -5732,17 +5783,8 @@ static int __netif_receive_skb_core(struct
sk_buff **pskb, bool pfmemalloc,
if (pfmemalloc)
goto skip_taps;
- list_for_each_entry_rcu(ptype, &dev_net(skb->dev)->ptype_all, list) {
- if (pt_prev)
- ret = deliver_skb(skb, pt_prev, orig_dev);
- pt_prev = ptype;
- }
-
- list_for_each_entry_rcu(ptype, &skb->dev->ptype_all, list) {
- if (pt_prev)
- ret = deliver_skb(skb, pt_prev, orig_dev);
- pt_prev = ptype;
- }
+ if (net_ptype_all_cnt(dev_net(skb->dev)) > 0)
+ ret = deliver_ptype_all_skb(skb, ret, &pt_prev, orig_dev);
skip_taps:
#ifdef CONFIG_NET_INGRESS
@@ -5840,21 +5882,14 @@ static int __netif_receive_skb_core(struct
sk_buff **pskb, bool pfmemalloc,
type = skb->protocol;
/* deliver only exact match when indicated */
- if (likely(!deliver_exact)) {
+ if (likely(!deliver_exact))
deliver_ptype_list_skb(skb, &pt_prev, orig_dev, type,
&ptype_base[ntohs(type) &
PTYPE_HASH_MASK]);
- deliver_ptype_list_skb(skb, &pt_prev, orig_dev, type,
- &dev_net(orig_dev)->ptype_specific);
- }
-
- deliver_ptype_list_skb(skb, &pt_prev, orig_dev, type,
- &orig_dev->ptype_specific);
- if (unlikely(skb->dev != orig_dev)) {
- deliver_ptype_list_skb(skb, &pt_prev, orig_dev, type,
- &skb->dev->ptype_specific);
- }
+ if (net_ptype_specific_cnt(dev_net(skb->dev)) > 0)
+ deliver_ptype_specific_skb(skb, deliver_exact, &pt_prev,
+ orig_dev, type);
if (pt_prev) {
if (unlikely(skb_orphan_frags_rx(skb, GFP_ATOMIC)))
@@ -12566,7 +12601,6 @@ static void __init net_dev_struct_check(void)
CACHELINE_ASSERT_GROUP_SIZE(struct net_device, net_device_read_txrx, 46);
/* RX read-mostly hotpath */
- CACHELINE_ASSERT_GROUP_MEMBER(struct net_device, net_device_read_rx,
ptype_specific);
CACHELINE_ASSERT_GROUP_MEMBER(struct net_device, net_device_read_rx,
ifindex);
CACHELINE_ASSERT_GROUP_MEMBER(struct net_device, net_device_read_rx,
real_num_rx_queues);
CACHELINE_ASSERT_GROUP_MEMBER(struct net_device, net_device_read_rx, _rx);
@@ -12581,7 +12615,7 @@ static void __init net_dev_struct_check(void)
#ifdef CONFIG_NET_XGRESS
CACHELINE_ASSERT_GROUP_MEMBER(struct net_device, net_device_read_rx,
tcx_ingress);
#endif
- CACHELINE_ASSERT_GROUP_SIZE(struct net_device, net_device_read_rx, 92);
+ CACHELINE_ASSERT_GROUP_SIZE(struct net_device, net_device_read_rx, 76);
}
/*
diff --git a/net/core/net_namespace.c b/net/core/net_namespace.c
index b0dfdf791ece5..551355665aaee 100644
--- a/net/core/net_namespace.c
+++ b/net/core/net_namespace.c
@@ -1174,7 +1174,7 @@ static void __init netns_ipv4_struct_check(void)
/* TXRX readonly hotpath cache lines */
CACHELINE_ASSERT_GROUP_MEMBER(struct netns_ipv4, netns_ipv4_read_txrx,
sysctl_tcp_moderate_rcvbuf);
- CACHELINE_ASSERT_GROUP_SIZE(struct netns_ipv4, netns_ipv4_read_txrx, 1);
+ CACHELINE_ASSERT_GROUP_SIZE(struct netns_ipv4, netns_ipv4_read_txrx, 7);
/* RX readonly hotpath cache line */
CACHELINE_ASSERT_GROUP_MEMBER(struct netns_ipv4, netns_ipv4_read_rx,
@@ -1187,7 +1187,7 @@ static void __init netns_ipv4_struct_check(void)
sysctl_tcp_reordering);
CACHELINE_ASSERT_GROUP_MEMBER(struct netns_ipv4, netns_ipv4_read_rx,
sysctl_tcp_rmem);
- CACHELINE_ASSERT_GROUP_SIZE(struct netns_ipv4, netns_ipv4_read_rx, 22);
+ CACHELINE_ASSERT_GROUP_SIZE(struct netns_ipv4, netns_ipv4_read_rx, 24);
}
#endif
^ permalink raw reply related [flat|nested] 9+ messages in thread
* Re: [RFC PATCH 1/2] net: introduce per netns packet chains
2025-03-17 8:43 ` Eric Dumazet
@ 2025-03-17 9:23 ` Paolo Abeni
2025-03-17 10:46 ` Eric Dumazet
0 siblings, 1 reply; 9+ messages in thread
From: Paolo Abeni @ 2025-03-17 9:23 UTC (permalink / raw)
To: Eric Dumazet
Cc: Sabrina Dubroca, netdev, Andrew Lunn, David S. Miller,
Jakub Kicinski, Simon Horman, Jonathan Corbet
On 3/17/25 9:43 AM, Eric Dumazet wrote:
> On Mon, Mar 17, 2025 at 9:40 AM Paolo Abeni <pabeni@redhat.com> wrote:
>>
>>
>> Thanks Sabrina! I'll try to address the above in the next revision.
>
> Could you use dev_net_rcu() instead of dev_net() ?
Indeed all the call site are under rcu - except the one in
dev_queue_xmit_nit() that will need a little adjustment. I'll do in the
next revision, thanks!
Paolo
^ permalink raw reply [flat|nested] 9+ messages in thread
* Re: [RFC PATCH 1/2] net: introduce per netns packet chains
2025-03-17 9:23 ` Paolo Abeni
@ 2025-03-17 10:46 ` Eric Dumazet
0 siblings, 0 replies; 9+ messages in thread
From: Eric Dumazet @ 2025-03-17 10:46 UTC (permalink / raw)
To: Paolo Abeni
Cc: Sabrina Dubroca, netdev, Andrew Lunn, David S. Miller,
Jakub Kicinski, Simon Horman, Jonathan Corbet
On Mon, Mar 17, 2025 at 10:24 AM Paolo Abeni <pabeni@redhat.com> wrote:
>
> On 3/17/25 9:43 AM, Eric Dumazet wrote:
> > On Mon, Mar 17, 2025 at 9:40 AM Paolo Abeni <pabeni@redhat.com> wrote:
> >>
> >>
> >> Thanks Sabrina! I'll try to address the above in the next revision.
> >
> > Could you use dev_net_rcu() instead of dev_net() ?
>
> Indeed all the call site are under rcu - except the one in
> dev_queue_xmit_nit() that will need a little adjustment. I'll do in the
> next revision, thanks!
I think this patch makes a lot of sense, I would submit it as a standalone one ?
Second patch I honestly don't know yet :)
^ permalink raw reply [flat|nested] 9+ messages in thread
end of thread, other threads:[~2025-03-17 10:46 UTC | newest]
Thread overview: 9+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-03-14 13:04 [RFC PATCH 0/2] net: introduce per netns packet type chains Paolo Abeni
2025-03-14 13:05 ` [RFC PATCH 1/2] net: introduce per netns packet chains Paolo Abeni
2025-03-14 22:33 ` Sabrina Dubroca
2025-03-17 8:40 ` Paolo Abeni
2025-03-17 8:43 ` Eric Dumazet
2025-03-17 9:23 ` Paolo Abeni
2025-03-17 10:46 ` Eric Dumazet
2025-03-14 13:05 ` [RFC PATCH 2/2] net: hotdata optimization for netns ptypes Paolo Abeni
2025-03-17 9:18 ` Paolo Abeni
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).