* [PATCH net-next 0/3] net: lockless skb_attempt_defer_free()
@ 2025-09-26 15:13 Eric Dumazet
2025-09-26 15:13 ` [PATCH net-next 1/3] net: make softnet_data.defer_count an atomic Eric Dumazet
` (2 more replies)
0 siblings, 3 replies; 11+ messages in thread
From: Eric Dumazet @ 2025-09-26 15:13 UTC (permalink / raw)
To: David S . Miller, Jakub Kicinski, Paolo Abeni
Cc: Simon Horman, Kuniyuki Iwashima, Willem de Bruijn, netdev,
eric.dumazet, Eric Dumazet
Platforms with many cpus and relatively slow inter connect show
a significant spinlock contention in skb_attempt_defer_free().
This series refactors this infrastructure to be NUMA aware,
and lockless.
Tested on various platforms, including AMD Zen 2/3/4
and Intel Granite Rapids, showing significant cost reductions
under network stress (more than 20 Mpps).
Eric Dumazet (3):
net: make softnet_data.defer_count an atomic
net: use llist for sd->defer_list
net: add NUMA awareness to skb_attempt_defer_free()
include/linux/netdevice.h | 6 +-----
include/net/hotdata.h | 7 +++++++
net/core/dev.c | 43 +++++++++++++++++++++++----------------
net/core/dev.h | 2 +-
net/core/skbuff.c | 24 ++++++++++------------
5 files changed, 45 insertions(+), 37 deletions(-)
--
2.51.0.536.g15c5d4f767-goog
^ permalink raw reply [flat|nested] 11+ messages in thread* [PATCH net-next 1/3] net: make softnet_data.defer_count an atomic 2025-09-26 15:13 [PATCH net-next 0/3] net: lockless skb_attempt_defer_free() Eric Dumazet @ 2025-09-26 15:13 ` Eric Dumazet 2025-09-27 1:10 ` Jason Xing 2025-09-27 19:39 ` Kuniyuki Iwashima 2025-09-26 15:13 ` [PATCH net-next 2/3] net: use llist for sd->defer_list Eric Dumazet 2025-09-26 15:13 ` [PATCH net-next 3/3] net: add NUMA awareness to skb_attempt_defer_free() Eric Dumazet 2 siblings, 2 replies; 11+ messages in thread From: Eric Dumazet @ 2025-09-26 15:13 UTC (permalink / raw) To: David S . Miller, Jakub Kicinski, Paolo Abeni Cc: Simon Horman, Kuniyuki Iwashima, Willem de Bruijn, netdev, eric.dumazet, Eric Dumazet This is preparation work to remove the softnet_data.defer_lock, as it is contended on hosts with large number of cores. Signed-off-by: Eric Dumazet <edumazet@google.com> --- include/linux/netdevice.h | 2 +- net/core/dev.c | 2 +- net/core/skbuff.c | 6 ++---- 3 files changed, 4 insertions(+), 6 deletions(-) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 1b85454116f666ced61a1450d3f899940f499c05..27e3fa69253f694b98d32b6138cf491da5a8b824 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -3538,7 +3538,7 @@ struct softnet_data { /* Another possibly contended cache line */ spinlock_t defer_lock ____cacheline_aligned_in_smp; - int defer_count; + atomic_t defer_count; int defer_ipi_scheduled; struct sk_buff *defer_list; call_single_data_t defer_csd; diff --git a/net/core/dev.c b/net/core/dev.c index 8b54fdf0289ab223fc37d27a078536db37646b55..8566678d83444e8aacbfea4842878279cf28516f 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -6726,7 +6726,7 @@ static void skb_defer_free_flush(struct softnet_data *sd) spin_lock(&sd->defer_lock); skb = sd->defer_list; sd->defer_list = NULL; - sd->defer_count = 0; + atomic_set(&sd->defer_count, 0); spin_unlock(&sd->defer_lock); while (skb != NULL) { diff --git a/net/core/skbuff.c b/net/core/skbuff.c index daaf6da43cc9e199389c3afcd6621c177d247884..f91571f51c69ecf8c2fffed5f3a3cd33fd95828b 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -7201,14 +7201,12 @@ nodefer: kfree_skb_napi_cache(skb); sd = &per_cpu(softnet_data, cpu); defer_max = READ_ONCE(net_hotdata.sysctl_skb_defer_max); - if (READ_ONCE(sd->defer_count) >= defer_max) + if (atomic_read(&sd->defer_count) >= defer_max) goto nodefer; spin_lock_bh(&sd->defer_lock); /* Send an IPI every time queue reaches half capacity. */ - kick = sd->defer_count == (defer_max >> 1); - /* Paired with the READ_ONCE() few lines above */ - WRITE_ONCE(sd->defer_count, sd->defer_count + 1); + kick = (atomic_inc_return(&sd->defer_count) - 1) == (defer_max >> 1); skb->next = sd->defer_list; /* Paired with READ_ONCE() in skb_defer_free_flush() */ -- 2.51.0.536.g15c5d4f767-goog ^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [PATCH net-next 1/3] net: make softnet_data.defer_count an atomic 2025-09-26 15:13 ` [PATCH net-next 1/3] net: make softnet_data.defer_count an atomic Eric Dumazet @ 2025-09-27 1:10 ` Jason Xing 2025-09-27 19:39 ` Kuniyuki Iwashima 1 sibling, 0 replies; 11+ messages in thread From: Jason Xing @ 2025-09-27 1:10 UTC (permalink / raw) To: Eric Dumazet Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman, Kuniyuki Iwashima, Willem de Bruijn, netdev, eric.dumazet On Fri, Sep 26, 2025 at 11:13 PM Eric Dumazet <edumazet@google.com> wrote: > > This is preparation work to remove the softnet_data.defer_lock, > as it is contended on hosts with large number of cores. > > Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Jason Xing <kerneljasonxing@gmail.com> ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH net-next 1/3] net: make softnet_data.defer_count an atomic 2025-09-26 15:13 ` [PATCH net-next 1/3] net: make softnet_data.defer_count an atomic Eric Dumazet 2025-09-27 1:10 ` Jason Xing @ 2025-09-27 19:39 ` Kuniyuki Iwashima 1 sibling, 0 replies; 11+ messages in thread From: Kuniyuki Iwashima @ 2025-09-27 19:39 UTC (permalink / raw) To: Eric Dumazet Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman, Willem de Bruijn, netdev, eric.dumazet On Fri, Sep 26, 2025 at 8:13 AM Eric Dumazet <edumazet@google.com> wrote: > > This is preparation work to remove the softnet_data.defer_lock, > as it is contended on hosts with large number of cores. > > Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> ^ permalink raw reply [flat|nested] 11+ messages in thread
* [PATCH net-next 2/3] net: use llist for sd->defer_list 2025-09-26 15:13 [PATCH net-next 0/3] net: lockless skb_attempt_defer_free() Eric Dumazet 2025-09-26 15:13 ` [PATCH net-next 1/3] net: make softnet_data.defer_count an atomic Eric Dumazet @ 2025-09-26 15:13 ` Eric Dumazet 2025-09-27 1:09 ` Jason Xing 2025-09-27 19:52 ` Kuniyuki Iwashima 2025-09-26 15:13 ` [PATCH net-next 3/3] net: add NUMA awareness to skb_attempt_defer_free() Eric Dumazet 2 siblings, 2 replies; 11+ messages in thread From: Eric Dumazet @ 2025-09-26 15:13 UTC (permalink / raw) To: David S . Miller, Jakub Kicinski, Paolo Abeni Cc: Simon Horman, Kuniyuki Iwashima, Willem de Bruijn, netdev, eric.dumazet, Eric Dumazet Get rid of sd->defer_lock and adopt llist operations. We optimize skb_attempt_defer_free() for the common case, where the packet is queued. Otherwise sd->defer_count is increasing, until skb_defer_free_flush() clears it. Signed-off-by: Eric Dumazet <edumazet@google.com> --- include/linux/netdevice.h | 8 ++++---- net/core/dev.c | 18 ++++++------------ net/core/skbuff.c | 15 +++++++-------- 3 files changed, 17 insertions(+), 24 deletions(-) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 27e3fa69253f694b98d32b6138cf491da5a8b824..5c9aa16933d197f70746d64e5f44cae052d9971c 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -3537,10 +3537,10 @@ struct softnet_data { struct numa_drop_counters drop_counters; /* Another possibly contended cache line */ - spinlock_t defer_lock ____cacheline_aligned_in_smp; - atomic_t defer_count; - int defer_ipi_scheduled; - struct sk_buff *defer_list; + struct llist_head defer_list ____cacheline_aligned_in_smp; + atomic_long_t defer_count; + + int defer_ipi_scheduled ____cacheline_aligned_in_smp; call_single_data_t defer_csd; }; diff --git a/net/core/dev.c b/net/core/dev.c index 8566678d83444e8aacbfea4842878279cf28516f..fb67372774de10b0b112ca71c7c7a13819c2325b 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -6717,22 +6717,16 @@ EXPORT_SYMBOL(napi_complete_done); static void skb_defer_free_flush(struct softnet_data *sd) { + struct llist_node *free_list; struct sk_buff *skb, *next; - /* Paired with WRITE_ONCE() in skb_attempt_defer_free() */ - if (!READ_ONCE(sd->defer_list)) + if (llist_empty(&sd->defer_list)) return; + atomic_long_set(&sd->defer_count, 0); + free_list = llist_del_all(&sd->defer_list); - spin_lock(&sd->defer_lock); - skb = sd->defer_list; - sd->defer_list = NULL; - atomic_set(&sd->defer_count, 0); - spin_unlock(&sd->defer_lock); - - while (skb != NULL) { - next = skb->next; + llist_for_each_entry_safe(skb, next, free_list, ll_node) { napi_consume_skb(skb, 1); - skb = next; } } @@ -12995,7 +12989,7 @@ static int __init net_dev_init(void) sd->cpu = i; #endif INIT_CSD(&sd->defer_csd, trigger_rx_softirq, sd); - spin_lock_init(&sd->defer_lock); + init_llist_head(&sd->defer_list); gro_init(&sd->backlog.gro); sd->backlog.poll = process_backlog; diff --git a/net/core/skbuff.c b/net/core/skbuff.c index f91571f51c69ecf8c2fffed5f3a3cd33fd95828b..22d9dba0e433cf67243a5b7dda77e61d146baf50 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -7184,6 +7184,7 @@ static void kfree_skb_napi_cache(struct sk_buff *skb) */ void skb_attempt_defer_free(struct sk_buff *skb) { + unsigned long defer_count; int cpu = skb->alloc_cpu; struct softnet_data *sd; unsigned int defer_max; @@ -7201,17 +7202,15 @@ nodefer: kfree_skb_napi_cache(skb); sd = &per_cpu(softnet_data, cpu); defer_max = READ_ONCE(net_hotdata.sysctl_skb_defer_max); - if (atomic_read(&sd->defer_count) >= defer_max) + defer_count = atomic_long_inc_return(&sd->defer_count); + + if (defer_count >= defer_max) goto nodefer; - spin_lock_bh(&sd->defer_lock); - /* Send an IPI every time queue reaches half capacity. */ - kick = (atomic_inc_return(&sd->defer_count) - 1) == (defer_max >> 1); + llist_add(&skb->ll_node, &sd->defer_list); - skb->next = sd->defer_list; - /* Paired with READ_ONCE() in skb_defer_free_flush() */ - WRITE_ONCE(sd->defer_list, skb); - spin_unlock_bh(&sd->defer_lock); + /* Send an IPI every time queue reaches half capacity. */ + kick = (defer_count - 1) == (defer_max >> 1); /* Make sure to trigger NET_RX_SOFTIRQ on the remote CPU * if we are unlucky enough (this seems very unlikely). -- 2.51.0.536.g15c5d4f767-goog ^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [PATCH net-next 2/3] net: use llist for sd->defer_list 2025-09-26 15:13 ` [PATCH net-next 2/3] net: use llist for sd->defer_list Eric Dumazet @ 2025-09-27 1:09 ` Jason Xing 2025-09-27 19:52 ` Kuniyuki Iwashima 1 sibling, 0 replies; 11+ messages in thread From: Jason Xing @ 2025-09-27 1:09 UTC (permalink / raw) To: Eric Dumazet Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman, Kuniyuki Iwashima, Willem de Bruijn, netdev, eric.dumazet On Fri, Sep 26, 2025 at 11:13 PM Eric Dumazet <edumazet@google.com> wrote: > > Get rid of sd->defer_lock and adopt llist operations. > > We optimize skb_attempt_defer_free() for the common case, > where the packet is queued. Otherwise sd->defer_count > is increasing, until skb_defer_free_flush() clears it. > > Signed-off-by: Eric Dumazet <edumazet@google.com> Quite interesting optimization! I like the no lock version. Thanks! Reviewed-by: Jason Xing <kerneljasonxing@gmail.com> > --- > include/linux/netdevice.h | 8 ++++---- > net/core/dev.c | 18 ++++++------------ > net/core/skbuff.c | 15 +++++++-------- > 3 files changed, 17 insertions(+), 24 deletions(-) > > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h > index 27e3fa69253f694b98d32b6138cf491da5a8b824..5c9aa16933d197f70746d64e5f44cae052d9971c 100644 > --- a/include/linux/netdevice.h > +++ b/include/linux/netdevice.h > @@ -3537,10 +3537,10 @@ struct softnet_data { > struct numa_drop_counters drop_counters; > > /* Another possibly contended cache line */ > - spinlock_t defer_lock ____cacheline_aligned_in_smp; > - atomic_t defer_count; > - int defer_ipi_scheduled; > - struct sk_buff *defer_list; > + struct llist_head defer_list ____cacheline_aligned_in_smp; > + atomic_long_t defer_count; > + > + int defer_ipi_scheduled ____cacheline_aligned_in_smp; > call_single_data_t defer_csd; > }; > > diff --git a/net/core/dev.c b/net/core/dev.c > index 8566678d83444e8aacbfea4842878279cf28516f..fb67372774de10b0b112ca71c7c7a13819c2325b 100644 > --- a/net/core/dev.c > +++ b/net/core/dev.c > @@ -6717,22 +6717,16 @@ EXPORT_SYMBOL(napi_complete_done); > > static void skb_defer_free_flush(struct softnet_data *sd) > { > + struct llist_node *free_list; > struct sk_buff *skb, *next; > > - /* Paired with WRITE_ONCE() in skb_attempt_defer_free() */ > - if (!READ_ONCE(sd->defer_list)) > + if (llist_empty(&sd->defer_list)) > return; > + atomic_long_set(&sd->defer_count, 0); > + free_list = llist_del_all(&sd->defer_list); > > - spin_lock(&sd->defer_lock); > - skb = sd->defer_list; > - sd->defer_list = NULL; > - atomic_set(&sd->defer_count, 0); > - spin_unlock(&sd->defer_lock); > - > - while (skb != NULL) { > - next = skb->next; > + llist_for_each_entry_safe(skb, next, free_list, ll_node) { nit: no need to keep brackets > napi_consume_skb(skb, 1); > - skb = next; > } > } > > @@ -12995,7 +12989,7 @@ static int __init net_dev_init(void) > sd->cpu = i; > #endif > INIT_CSD(&sd->defer_csd, trigger_rx_softirq, sd); > - spin_lock_init(&sd->defer_lock); > + init_llist_head(&sd->defer_list); > > gro_init(&sd->backlog.gro); > sd->backlog.poll = process_backlog; > diff --git a/net/core/skbuff.c b/net/core/skbuff.c > index f91571f51c69ecf8c2fffed5f3a3cd33fd95828b..22d9dba0e433cf67243a5b7dda77e61d146baf50 100644 > --- a/net/core/skbuff.c > +++ b/net/core/skbuff.c > @@ -7184,6 +7184,7 @@ static void kfree_skb_napi_cache(struct sk_buff *skb) > */ > void skb_attempt_defer_free(struct sk_buff *skb) > { > + unsigned long defer_count; > int cpu = skb->alloc_cpu; > struct softnet_data *sd; > unsigned int defer_max; > @@ -7201,17 +7202,15 @@ nodefer: kfree_skb_napi_cache(skb); > > sd = &per_cpu(softnet_data, cpu); > defer_max = READ_ONCE(net_hotdata.sysctl_skb_defer_max); > - if (atomic_read(&sd->defer_count) >= defer_max) > + defer_count = atomic_long_inc_return(&sd->defer_count); > + > + if (defer_count >= defer_max) > goto nodefer; > > - spin_lock_bh(&sd->defer_lock); > - /* Send an IPI every time queue reaches half capacity. */ > - kick = (atomic_inc_return(&sd->defer_count) - 1) == (defer_max >> 1); > + llist_add(&skb->ll_node, &sd->defer_list); > > - skb->next = sd->defer_list; > - /* Paired with READ_ONCE() in skb_defer_free_flush() */ > - WRITE_ONCE(sd->defer_list, skb); > - spin_unlock_bh(&sd->defer_lock); > + /* Send an IPI every time queue reaches half capacity. */ > + kick = (defer_count - 1) == (defer_max >> 1); > > /* Make sure to trigger NET_RX_SOFTIRQ on the remote CPU > * if we are unlucky enough (this seems very unlikely). > -- > 2.51.0.536.g15c5d4f767-goog > > ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH net-next 2/3] net: use llist for sd->defer_list 2025-09-26 15:13 ` [PATCH net-next 2/3] net: use llist for sd->defer_list Eric Dumazet 2025-09-27 1:09 ` Jason Xing @ 2025-09-27 19:52 ` Kuniyuki Iwashima 1 sibling, 0 replies; 11+ messages in thread From: Kuniyuki Iwashima @ 2025-09-27 19:52 UTC (permalink / raw) To: Eric Dumazet Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman, Willem de Bruijn, netdev, eric.dumazet On Fri, Sep 26, 2025 at 8:13 AM Eric Dumazet <edumazet@google.com> wrote: > > Get rid of sd->defer_lock and adopt llist operations. > > We optimize skb_attempt_defer_free() for the common case, > where the packet is queued. Otherwise sd->defer_count > is increasing, until skb_defer_free_flush() clears it. > > Signed-off-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> ^ permalink raw reply [flat|nested] 11+ messages in thread
* [PATCH net-next 3/3] net: add NUMA awareness to skb_attempt_defer_free() 2025-09-26 15:13 [PATCH net-next 0/3] net: lockless skb_attempt_defer_free() Eric Dumazet 2025-09-26 15:13 ` [PATCH net-next 1/3] net: make softnet_data.defer_count an atomic Eric Dumazet 2025-09-26 15:13 ` [PATCH net-next 2/3] net: use llist for sd->defer_list Eric Dumazet @ 2025-09-26 15:13 ` Eric Dumazet 2025-09-27 1:33 ` Jason Xing 2025-09-27 20:28 ` Kuniyuki Iwashima 2 siblings, 2 replies; 11+ messages in thread From: Eric Dumazet @ 2025-09-26 15:13 UTC (permalink / raw) To: David S . Miller, Jakub Kicinski, Paolo Abeni Cc: Simon Horman, Kuniyuki Iwashima, Willem de Bruijn, netdev, eric.dumazet, Eric Dumazet Instead of sharing sd->defer_list & sd->defer_count with many cpus, add one pair for each NUMA node. Signed-off-by: Eric Dumazet <edumazet@google.com> --- include/linux/netdevice.h | 4 ---- include/net/hotdata.h | 7 +++++++ net/core/dev.c | 37 +++++++++++++++++++++++++------------ net/core/dev.h | 2 +- net/core/skbuff.c | 11 ++++++----- 5 files changed, 39 insertions(+), 22 deletions(-) diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h index 5c9aa16933d197f70746d64e5f44cae052d9971c..d1a687444b275d45d105e336d2ede264fd310f1b 100644 --- a/include/linux/netdevice.h +++ b/include/linux/netdevice.h @@ -3536,10 +3536,6 @@ struct softnet_data { struct numa_drop_counters drop_counters; - /* Another possibly contended cache line */ - struct llist_head defer_list ____cacheline_aligned_in_smp; - atomic_long_t defer_count; - int defer_ipi_scheduled ____cacheline_aligned_in_smp; call_single_data_t defer_csd; }; diff --git a/include/net/hotdata.h b/include/net/hotdata.h index fda94b2647ffa242c256c95ae929d9ef25e54f96..4acec191c54ab367ca12fff590d1f8c8aad64651 100644 --- a/include/net/hotdata.h +++ b/include/net/hotdata.h @@ -2,10 +2,16 @@ #ifndef _NET_HOTDATA_H #define _NET_HOTDATA_H +#include <linux/llist.h> #include <linux/types.h> #include <linux/netdevice.h> #include <net/protocol.h> +struct skb_defer_node { + struct llist_head defer_list; + atomic_long_t defer_count; +} ____cacheline_aligned_in_smp; + /* Read mostly data used in network fast paths. */ struct net_hotdata { #if IS_ENABLED(CONFIG_INET) @@ -30,6 +36,7 @@ struct net_hotdata { struct rps_sock_flow_table __rcu *rps_sock_flow_table; u32 rps_cpu_mask; #endif + struct skb_defer_node __percpu *skb_defer_nodes; int gro_normal_batch; int netdev_budget; int netdev_budget_usecs; diff --git a/net/core/dev.c b/net/core/dev.c index fb67372774de10b0b112ca71c7c7a13819c2325b..afcf07352eaa3b9a563173106c84167ebe1ab387 100644 --- a/net/core/dev.c +++ b/net/core/dev.c @@ -5180,8 +5180,9 @@ static void napi_schedule_rps(struct softnet_data *sd) __napi_schedule_irqoff(&mysd->backlog); } -void kick_defer_list_purge(struct softnet_data *sd, unsigned int cpu) +void kick_defer_list_purge(unsigned int cpu) { + struct softnet_data *sd = &per_cpu(softnet_data, cpu); unsigned long flags; if (use_backlog_threads()) { @@ -6715,18 +6716,26 @@ bool napi_complete_done(struct napi_struct *n, int work_done) } EXPORT_SYMBOL(napi_complete_done); -static void skb_defer_free_flush(struct softnet_data *sd) +struct skb_defer_node __percpu *skb_defer_nodes; + +static void skb_defer_free_flush(void) { struct llist_node *free_list; struct sk_buff *skb, *next; + struct skb_defer_node *sdn; + int node; - if (llist_empty(&sd->defer_list)) - return; - atomic_long_set(&sd->defer_count, 0); - free_list = llist_del_all(&sd->defer_list); + for_each_node(node) { + sdn = this_cpu_ptr(net_hotdata.skb_defer_nodes) + node; - llist_for_each_entry_safe(skb, next, free_list, ll_node) { - napi_consume_skb(skb, 1); + if (llist_empty(&sdn->defer_list)) + continue; + atomic_long_set(&sdn->defer_count, 0); + free_list = llist_del_all(&sdn->defer_list); + + llist_for_each_entry_safe(skb, next, free_list, ll_node) { + napi_consume_skb(skb, 1); + } } } @@ -6854,7 +6863,7 @@ static void __napi_busy_loop(unsigned int napi_id, if (work > 0) __NET_ADD_STATS(dev_net(napi->dev), LINUX_MIB_BUSYPOLLRXPACKETS, work); - skb_defer_free_flush(this_cpu_ptr(&softnet_data)); + skb_defer_free_flush(); bpf_net_ctx_clear(bpf_net_ctx); local_bh_enable(); @@ -7713,7 +7722,7 @@ static void napi_threaded_poll_loop(struct napi_struct *napi) local_irq_disable(); net_rps_action_and_irq_enable(sd); } - skb_defer_free_flush(sd); + skb_defer_free_flush(); bpf_net_ctx_clear(bpf_net_ctx); local_bh_enable(); @@ -7755,7 +7764,7 @@ static __latent_entropy void net_rx_action(void) for (;;) { struct napi_struct *n; - skb_defer_free_flush(sd); + skb_defer_free_flush(); if (list_empty(&list)) { if (list_empty(&repoll)) { @@ -12989,7 +12998,6 @@ static int __init net_dev_init(void) sd->cpu = i; #endif INIT_CSD(&sd->defer_csd, trigger_rx_softirq, sd); - init_llist_head(&sd->defer_list); gro_init(&sd->backlog.gro); sd->backlog.poll = process_backlog; @@ -12999,6 +13007,11 @@ static int __init net_dev_init(void) if (net_page_pool_create(i)) goto out; } + net_hotdata.skb_defer_nodes = + __alloc_percpu(sizeof(struct skb_defer_node) * nr_node_ids, + __alignof__(struct skb_defer_node)); + if (!net_hotdata.skb_defer_nodes) + goto out; if (use_backlog_threads()) smpboot_register_percpu_thread(&backlog_threads); diff --git a/net/core/dev.h b/net/core/dev.h index d6b08d435479b2ba476b1ddeeaae1dce6ac875a2..900880e8b5b4b9492eca23a4d9201045e6bf7f74 100644 --- a/net/core/dev.h +++ b/net/core/dev.h @@ -357,7 +357,7 @@ static inline void napi_assert_will_not_race(const struct napi_struct *napi) WARN_ON(READ_ONCE(napi->list_owner) != -1); } -void kick_defer_list_purge(struct softnet_data *sd, unsigned int cpu); +void kick_defer_list_purge(unsigned int cpu); #define XMIT_RECURSION_LIMIT 8 diff --git a/net/core/skbuff.c b/net/core/skbuff.c index 22d9dba0e433cf67243a5b7dda77e61d146baf50..03ed51050efe81b582c2bad147afecce3a7115e1 100644 --- a/net/core/skbuff.c +++ b/net/core/skbuff.c @@ -7184,9 +7184,9 @@ static void kfree_skb_napi_cache(struct sk_buff *skb) */ void skb_attempt_defer_free(struct sk_buff *skb) { + struct skb_defer_node *sdn; unsigned long defer_count; int cpu = skb->alloc_cpu; - struct softnet_data *sd; unsigned int defer_max; bool kick; @@ -7200,14 +7200,15 @@ nodefer: kfree_skb_napi_cache(skb); DEBUG_NET_WARN_ON_ONCE(skb_dst(skb)); DEBUG_NET_WARN_ON_ONCE(skb->destructor); - sd = &per_cpu(softnet_data, cpu); + sdn = per_cpu_ptr(net_hotdata.skb_defer_nodes, cpu) + numa_node_id(); + defer_max = READ_ONCE(net_hotdata.sysctl_skb_defer_max); - defer_count = atomic_long_inc_return(&sd->defer_count); + defer_count = atomic_long_inc_return(&sdn->defer_count); if (defer_count >= defer_max) goto nodefer; - llist_add(&skb->ll_node, &sd->defer_list); + llist_add(&skb->ll_node, &sdn->defer_list); /* Send an IPI every time queue reaches half capacity. */ kick = (defer_count - 1) == (defer_max >> 1); @@ -7216,7 +7217,7 @@ nodefer: kfree_skb_napi_cache(skb); * if we are unlucky enough (this seems very unlikely). */ if (unlikely(kick)) - kick_defer_list_purge(sd, cpu); + kick_defer_list_purge(cpu); } static void skb_splice_csum_page(struct sk_buff *skb, struct page *page, -- 2.51.0.536.g15c5d4f767-goog ^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [PATCH net-next 3/3] net: add NUMA awareness to skb_attempt_defer_free() 2025-09-26 15:13 ` [PATCH net-next 3/3] net: add NUMA awareness to skb_attempt_defer_free() Eric Dumazet @ 2025-09-27 1:33 ` Jason Xing 2025-09-27 20:28 ` Kuniyuki Iwashima 1 sibling, 0 replies; 11+ messages in thread From: Jason Xing @ 2025-09-27 1:33 UTC (permalink / raw) To: Eric Dumazet Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman, Kuniyuki Iwashima, Willem de Bruijn, netdev, eric.dumazet On Fri, Sep 26, 2025 at 11:13 PM Eric Dumazet <edumazet@google.com> wrote: > > Instead of sharing sd->defer_list & sd->defer_count with > many cpus, add one pair for each NUMA node. Great! I think I might borrow this idea to optimize xsk in the xmit path because previously I saw the performance impact among numa nodes. Reviewed-by: Jason Xing <kerneljasonxing@gmail.com> Thanks! ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH net-next 3/3] net: add NUMA awareness to skb_attempt_defer_free() 2025-09-26 15:13 ` [PATCH net-next 3/3] net: add NUMA awareness to skb_attempt_defer_free() Eric Dumazet 2025-09-27 1:33 ` Jason Xing @ 2025-09-27 20:28 ` Kuniyuki Iwashima 2025-09-28 8:45 ` Eric Dumazet 1 sibling, 1 reply; 11+ messages in thread From: Kuniyuki Iwashima @ 2025-09-27 20:28 UTC (permalink / raw) To: Eric Dumazet Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman, Willem de Bruijn, netdev, eric.dumazet On Fri, Sep 26, 2025 at 8:13 AM Eric Dumazet <edumazet@google.com> wrote: > > Instead of sharing sd->defer_list & sd->defer_count with > many cpus, add one pair for each NUMA node. > > Signed-off-by: Eric Dumazet <edumazet@google.com> > --- > include/linux/netdevice.h | 4 ---- > include/net/hotdata.h | 7 +++++++ > net/core/dev.c | 37 +++++++++++++++++++++++++------------ > net/core/dev.h | 2 +- > net/core/skbuff.c | 11 ++++++----- > 5 files changed, 39 insertions(+), 22 deletions(-) > > diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h > index 5c9aa16933d197f70746d64e5f44cae052d9971c..d1a687444b275d45d105e336d2ede264fd310f1b 100644 > --- a/include/linux/netdevice.h > +++ b/include/linux/netdevice.h > @@ -3536,10 +3536,6 @@ struct softnet_data { > > struct numa_drop_counters drop_counters; > > - /* Another possibly contended cache line */ > - struct llist_head defer_list ____cacheline_aligned_in_smp; > - atomic_long_t defer_count; > - > int defer_ipi_scheduled ____cacheline_aligned_in_smp; > call_single_data_t defer_csd; > }; > diff --git a/include/net/hotdata.h b/include/net/hotdata.h > index fda94b2647ffa242c256c95ae929d9ef25e54f96..4acec191c54ab367ca12fff590d1f8c8aad64651 100644 > --- a/include/net/hotdata.h > +++ b/include/net/hotdata.h > @@ -2,10 +2,16 @@ > #ifndef _NET_HOTDATA_H > #define _NET_HOTDATA_H > > +#include <linux/llist.h> > #include <linux/types.h> > #include <linux/netdevice.h> > #include <net/protocol.h> > > +struct skb_defer_node { > + struct llist_head defer_list; > + atomic_long_t defer_count; > +} ____cacheline_aligned_in_smp; > + > /* Read mostly data used in network fast paths. */ > struct net_hotdata { > #if IS_ENABLED(CONFIG_INET) > @@ -30,6 +36,7 @@ struct net_hotdata { > struct rps_sock_flow_table __rcu *rps_sock_flow_table; > u32 rps_cpu_mask; > #endif > + struct skb_defer_node __percpu *skb_defer_nodes; > int gro_normal_batch; > int netdev_budget; > int netdev_budget_usecs; > diff --git a/net/core/dev.c b/net/core/dev.c > index fb67372774de10b0b112ca71c7c7a13819c2325b..afcf07352eaa3b9a563173106c84167ebe1ab387 100644 > --- a/net/core/dev.c > +++ b/net/core/dev.c > @@ -5180,8 +5180,9 @@ static void napi_schedule_rps(struct softnet_data *sd) > __napi_schedule_irqoff(&mysd->backlog); > } > > -void kick_defer_list_purge(struct softnet_data *sd, unsigned int cpu) > +void kick_defer_list_purge(unsigned int cpu) > { > + struct softnet_data *sd = &per_cpu(softnet_data, cpu); > unsigned long flags; > > if (use_backlog_threads()) { > @@ -6715,18 +6716,26 @@ bool napi_complete_done(struct napi_struct *n, int work_done) > } > EXPORT_SYMBOL(napi_complete_done); > > -static void skb_defer_free_flush(struct softnet_data *sd) > +struct skb_defer_node __percpu *skb_defer_nodes; seems this is unused, given it's close to the end of cycle maybe this can be fixed up while applying ? :) Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com> Thanks! ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH net-next 3/3] net: add NUMA awareness to skb_attempt_defer_free() 2025-09-27 20:28 ` Kuniyuki Iwashima @ 2025-09-28 8:45 ` Eric Dumazet 0 siblings, 0 replies; 11+ messages in thread From: Eric Dumazet @ 2025-09-28 8:45 UTC (permalink / raw) To: Kuniyuki Iwashima Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman, Willem de Bruijn, netdev, eric.dumazet On Sat, Sep 27, 2025 at 1:28 PM Kuniyuki Iwashima <kuniyu@google.com> wrote: > > On Fri, Sep 26, 2025 at 8:13 AM Eric Dumazet <edumazet@google.com> wrote: > > > > Instead of sharing sd->defer_list & sd->defer_count with > > many cpus, add one pair for each NUMA node. > > > > Signed-off-by: Eric Dumazet <edumazet@google.com> > > -static void skb_defer_free_flush(struct softnet_data *sd) > > +struct skb_defer_node __percpu *skb_defer_nodes; > > seems this is unused, given it's close to the end of cycle > maybe this can be fixed up while applying ? :) Good catch, I moved it to net_hotdata but forgot to remove this leftover. Will send a V2. Thanks. ^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2025-09-28 8:45 UTC | newest] Thread overview: 11+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-09-26 15:13 [PATCH net-next 0/3] net: lockless skb_attempt_defer_free() Eric Dumazet 2025-09-26 15:13 ` [PATCH net-next 1/3] net: make softnet_data.defer_count an atomic Eric Dumazet 2025-09-27 1:10 ` Jason Xing 2025-09-27 19:39 ` Kuniyuki Iwashima 2025-09-26 15:13 ` [PATCH net-next 2/3] net: use llist for sd->defer_list Eric Dumazet 2025-09-27 1:09 ` Jason Xing 2025-09-27 19:52 ` Kuniyuki Iwashima 2025-09-26 15:13 ` [PATCH net-next 3/3] net: add NUMA awareness to skb_attempt_defer_free() Eric Dumazet 2025-09-27 1:33 ` Jason Xing 2025-09-27 20:28 ` Kuniyuki Iwashima 2025-09-28 8:45 ` Eric Dumazet
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).