* [PATCH net-next 0/3] net: lockless skb_attempt_defer_free()
@ 2025-09-26 15:13 Eric Dumazet
2025-09-26 15:13 ` [PATCH net-next 1/3] net: make softnet_data.defer_count an atomic Eric Dumazet
` (2 more replies)
0 siblings, 3 replies; 11+ messages in thread
From: Eric Dumazet @ 2025-09-26 15:13 UTC (permalink / raw)
To: David S . Miller, Jakub Kicinski, Paolo Abeni
Cc: Simon Horman, Kuniyuki Iwashima, Willem de Bruijn, netdev,
eric.dumazet, Eric Dumazet
Platforms with many cpus and relatively slow inter connect show
a significant spinlock contention in skb_attempt_defer_free().
This series refactors this infrastructure to be NUMA aware,
and lockless.
Tested on various platforms, including AMD Zen 2/3/4
and Intel Granite Rapids, showing significant cost reductions
under network stress (more than 20 Mpps).
Eric Dumazet (3):
net: make softnet_data.defer_count an atomic
net: use llist for sd->defer_list
net: add NUMA awareness to skb_attempt_defer_free()
include/linux/netdevice.h | 6 +-----
include/net/hotdata.h | 7 +++++++
net/core/dev.c | 43 +++++++++++++++++++++++----------------
net/core/dev.h | 2 +-
net/core/skbuff.c | 24 ++++++++++------------
5 files changed, 45 insertions(+), 37 deletions(-)
--
2.51.0.536.g15c5d4f767-goog
^ permalink raw reply [flat|nested] 11+ messages in thread
* [PATCH net-next 1/3] net: make softnet_data.defer_count an atomic
2025-09-26 15:13 [PATCH net-next 0/3] net: lockless skb_attempt_defer_free() Eric Dumazet
@ 2025-09-26 15:13 ` Eric Dumazet
2025-09-27 1:10 ` Jason Xing
2025-09-27 19:39 ` Kuniyuki Iwashima
2025-09-26 15:13 ` [PATCH net-next 2/3] net: use llist for sd->defer_list Eric Dumazet
2025-09-26 15:13 ` [PATCH net-next 3/3] net: add NUMA awareness to skb_attempt_defer_free() Eric Dumazet
2 siblings, 2 replies; 11+ messages in thread
From: Eric Dumazet @ 2025-09-26 15:13 UTC (permalink / raw)
To: David S . Miller, Jakub Kicinski, Paolo Abeni
Cc: Simon Horman, Kuniyuki Iwashima, Willem de Bruijn, netdev,
eric.dumazet, Eric Dumazet
This is preparation work to remove the softnet_data.defer_lock,
as it is contended on hosts with large number of cores.
Signed-off-by: Eric Dumazet <edumazet@google.com>
---
include/linux/netdevice.h | 2 +-
net/core/dev.c | 2 +-
net/core/skbuff.c | 6 ++----
3 files changed, 4 insertions(+), 6 deletions(-)
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 1b85454116f666ced61a1450d3f899940f499c05..27e3fa69253f694b98d32b6138cf491da5a8b824 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -3538,7 +3538,7 @@ struct softnet_data {
/* Another possibly contended cache line */
spinlock_t defer_lock ____cacheline_aligned_in_smp;
- int defer_count;
+ atomic_t defer_count;
int defer_ipi_scheduled;
struct sk_buff *defer_list;
call_single_data_t defer_csd;
diff --git a/net/core/dev.c b/net/core/dev.c
index 8b54fdf0289ab223fc37d27a078536db37646b55..8566678d83444e8aacbfea4842878279cf28516f 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -6726,7 +6726,7 @@ static void skb_defer_free_flush(struct softnet_data *sd)
spin_lock(&sd->defer_lock);
skb = sd->defer_list;
sd->defer_list = NULL;
- sd->defer_count = 0;
+ atomic_set(&sd->defer_count, 0);
spin_unlock(&sd->defer_lock);
while (skb != NULL) {
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index daaf6da43cc9e199389c3afcd6621c177d247884..f91571f51c69ecf8c2fffed5f3a3cd33fd95828b 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -7201,14 +7201,12 @@ nodefer: kfree_skb_napi_cache(skb);
sd = &per_cpu(softnet_data, cpu);
defer_max = READ_ONCE(net_hotdata.sysctl_skb_defer_max);
- if (READ_ONCE(sd->defer_count) >= defer_max)
+ if (atomic_read(&sd->defer_count) >= defer_max)
goto nodefer;
spin_lock_bh(&sd->defer_lock);
/* Send an IPI every time queue reaches half capacity. */
- kick = sd->defer_count == (defer_max >> 1);
- /* Paired with the READ_ONCE() few lines above */
- WRITE_ONCE(sd->defer_count, sd->defer_count + 1);
+ kick = (atomic_inc_return(&sd->defer_count) - 1) == (defer_max >> 1);
skb->next = sd->defer_list;
/* Paired with READ_ONCE() in skb_defer_free_flush() */
--
2.51.0.536.g15c5d4f767-goog
^ permalink raw reply related [flat|nested] 11+ messages in thread
* [PATCH net-next 2/3] net: use llist for sd->defer_list
2025-09-26 15:13 [PATCH net-next 0/3] net: lockless skb_attempt_defer_free() Eric Dumazet
2025-09-26 15:13 ` [PATCH net-next 1/3] net: make softnet_data.defer_count an atomic Eric Dumazet
@ 2025-09-26 15:13 ` Eric Dumazet
2025-09-27 1:09 ` Jason Xing
2025-09-27 19:52 ` Kuniyuki Iwashima
2025-09-26 15:13 ` [PATCH net-next 3/3] net: add NUMA awareness to skb_attempt_defer_free() Eric Dumazet
2 siblings, 2 replies; 11+ messages in thread
From: Eric Dumazet @ 2025-09-26 15:13 UTC (permalink / raw)
To: David S . Miller, Jakub Kicinski, Paolo Abeni
Cc: Simon Horman, Kuniyuki Iwashima, Willem de Bruijn, netdev,
eric.dumazet, Eric Dumazet
Get rid of sd->defer_lock and adopt llist operations.
We optimize skb_attempt_defer_free() for the common case,
where the packet is queued. Otherwise sd->defer_count
is increasing, until skb_defer_free_flush() clears it.
Signed-off-by: Eric Dumazet <edumazet@google.com>
---
include/linux/netdevice.h | 8 ++++----
net/core/dev.c | 18 ++++++------------
net/core/skbuff.c | 15 +++++++--------
3 files changed, 17 insertions(+), 24 deletions(-)
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 27e3fa69253f694b98d32b6138cf491da5a8b824..5c9aa16933d197f70746d64e5f44cae052d9971c 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -3537,10 +3537,10 @@ struct softnet_data {
struct numa_drop_counters drop_counters;
/* Another possibly contended cache line */
- spinlock_t defer_lock ____cacheline_aligned_in_smp;
- atomic_t defer_count;
- int defer_ipi_scheduled;
- struct sk_buff *defer_list;
+ struct llist_head defer_list ____cacheline_aligned_in_smp;
+ atomic_long_t defer_count;
+
+ int defer_ipi_scheduled ____cacheline_aligned_in_smp;
call_single_data_t defer_csd;
};
diff --git a/net/core/dev.c b/net/core/dev.c
index 8566678d83444e8aacbfea4842878279cf28516f..fb67372774de10b0b112ca71c7c7a13819c2325b 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -6717,22 +6717,16 @@ EXPORT_SYMBOL(napi_complete_done);
static void skb_defer_free_flush(struct softnet_data *sd)
{
+ struct llist_node *free_list;
struct sk_buff *skb, *next;
- /* Paired with WRITE_ONCE() in skb_attempt_defer_free() */
- if (!READ_ONCE(sd->defer_list))
+ if (llist_empty(&sd->defer_list))
return;
+ atomic_long_set(&sd->defer_count, 0);
+ free_list = llist_del_all(&sd->defer_list);
- spin_lock(&sd->defer_lock);
- skb = sd->defer_list;
- sd->defer_list = NULL;
- atomic_set(&sd->defer_count, 0);
- spin_unlock(&sd->defer_lock);
-
- while (skb != NULL) {
- next = skb->next;
+ llist_for_each_entry_safe(skb, next, free_list, ll_node) {
napi_consume_skb(skb, 1);
- skb = next;
}
}
@@ -12995,7 +12989,7 @@ static int __init net_dev_init(void)
sd->cpu = i;
#endif
INIT_CSD(&sd->defer_csd, trigger_rx_softirq, sd);
- spin_lock_init(&sd->defer_lock);
+ init_llist_head(&sd->defer_list);
gro_init(&sd->backlog.gro);
sd->backlog.poll = process_backlog;
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index f91571f51c69ecf8c2fffed5f3a3cd33fd95828b..22d9dba0e433cf67243a5b7dda77e61d146baf50 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -7184,6 +7184,7 @@ static void kfree_skb_napi_cache(struct sk_buff *skb)
*/
void skb_attempt_defer_free(struct sk_buff *skb)
{
+ unsigned long defer_count;
int cpu = skb->alloc_cpu;
struct softnet_data *sd;
unsigned int defer_max;
@@ -7201,17 +7202,15 @@ nodefer: kfree_skb_napi_cache(skb);
sd = &per_cpu(softnet_data, cpu);
defer_max = READ_ONCE(net_hotdata.sysctl_skb_defer_max);
- if (atomic_read(&sd->defer_count) >= defer_max)
+ defer_count = atomic_long_inc_return(&sd->defer_count);
+
+ if (defer_count >= defer_max)
goto nodefer;
- spin_lock_bh(&sd->defer_lock);
- /* Send an IPI every time queue reaches half capacity. */
- kick = (atomic_inc_return(&sd->defer_count) - 1) == (defer_max >> 1);
+ llist_add(&skb->ll_node, &sd->defer_list);
- skb->next = sd->defer_list;
- /* Paired with READ_ONCE() in skb_defer_free_flush() */
- WRITE_ONCE(sd->defer_list, skb);
- spin_unlock_bh(&sd->defer_lock);
+ /* Send an IPI every time queue reaches half capacity. */
+ kick = (defer_count - 1) == (defer_max >> 1);
/* Make sure to trigger NET_RX_SOFTIRQ on the remote CPU
* if we are unlucky enough (this seems very unlikely).
--
2.51.0.536.g15c5d4f767-goog
^ permalink raw reply related [flat|nested] 11+ messages in thread
* [PATCH net-next 3/3] net: add NUMA awareness to skb_attempt_defer_free()
2025-09-26 15:13 [PATCH net-next 0/3] net: lockless skb_attempt_defer_free() Eric Dumazet
2025-09-26 15:13 ` [PATCH net-next 1/3] net: make softnet_data.defer_count an atomic Eric Dumazet
2025-09-26 15:13 ` [PATCH net-next 2/3] net: use llist for sd->defer_list Eric Dumazet
@ 2025-09-26 15:13 ` Eric Dumazet
2025-09-27 1:33 ` Jason Xing
2025-09-27 20:28 ` Kuniyuki Iwashima
2 siblings, 2 replies; 11+ messages in thread
From: Eric Dumazet @ 2025-09-26 15:13 UTC (permalink / raw)
To: David S . Miller, Jakub Kicinski, Paolo Abeni
Cc: Simon Horman, Kuniyuki Iwashima, Willem de Bruijn, netdev,
eric.dumazet, Eric Dumazet
Instead of sharing sd->defer_list & sd->defer_count with
many cpus, add one pair for each NUMA node.
Signed-off-by: Eric Dumazet <edumazet@google.com>
---
include/linux/netdevice.h | 4 ----
include/net/hotdata.h | 7 +++++++
net/core/dev.c | 37 +++++++++++++++++++++++++------------
net/core/dev.h | 2 +-
net/core/skbuff.c | 11 ++++++-----
5 files changed, 39 insertions(+), 22 deletions(-)
diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index 5c9aa16933d197f70746d64e5f44cae052d9971c..d1a687444b275d45d105e336d2ede264fd310f1b 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -3536,10 +3536,6 @@ struct softnet_data {
struct numa_drop_counters drop_counters;
- /* Another possibly contended cache line */
- struct llist_head defer_list ____cacheline_aligned_in_smp;
- atomic_long_t defer_count;
-
int defer_ipi_scheduled ____cacheline_aligned_in_smp;
call_single_data_t defer_csd;
};
diff --git a/include/net/hotdata.h b/include/net/hotdata.h
index fda94b2647ffa242c256c95ae929d9ef25e54f96..4acec191c54ab367ca12fff590d1f8c8aad64651 100644
--- a/include/net/hotdata.h
+++ b/include/net/hotdata.h
@@ -2,10 +2,16 @@
#ifndef _NET_HOTDATA_H
#define _NET_HOTDATA_H
+#include <linux/llist.h>
#include <linux/types.h>
#include <linux/netdevice.h>
#include <net/protocol.h>
+struct skb_defer_node {
+ struct llist_head defer_list;
+ atomic_long_t defer_count;
+} ____cacheline_aligned_in_smp;
+
/* Read mostly data used in network fast paths. */
struct net_hotdata {
#if IS_ENABLED(CONFIG_INET)
@@ -30,6 +36,7 @@ struct net_hotdata {
struct rps_sock_flow_table __rcu *rps_sock_flow_table;
u32 rps_cpu_mask;
#endif
+ struct skb_defer_node __percpu *skb_defer_nodes;
int gro_normal_batch;
int netdev_budget;
int netdev_budget_usecs;
diff --git a/net/core/dev.c b/net/core/dev.c
index fb67372774de10b0b112ca71c7c7a13819c2325b..afcf07352eaa3b9a563173106c84167ebe1ab387 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -5180,8 +5180,9 @@ static void napi_schedule_rps(struct softnet_data *sd)
__napi_schedule_irqoff(&mysd->backlog);
}
-void kick_defer_list_purge(struct softnet_data *sd, unsigned int cpu)
+void kick_defer_list_purge(unsigned int cpu)
{
+ struct softnet_data *sd = &per_cpu(softnet_data, cpu);
unsigned long flags;
if (use_backlog_threads()) {
@@ -6715,18 +6716,26 @@ bool napi_complete_done(struct napi_struct *n, int work_done)
}
EXPORT_SYMBOL(napi_complete_done);
-static void skb_defer_free_flush(struct softnet_data *sd)
+struct skb_defer_node __percpu *skb_defer_nodes;
+
+static void skb_defer_free_flush(void)
{
struct llist_node *free_list;
struct sk_buff *skb, *next;
+ struct skb_defer_node *sdn;
+ int node;
- if (llist_empty(&sd->defer_list))
- return;
- atomic_long_set(&sd->defer_count, 0);
- free_list = llist_del_all(&sd->defer_list);
+ for_each_node(node) {
+ sdn = this_cpu_ptr(net_hotdata.skb_defer_nodes) + node;
- llist_for_each_entry_safe(skb, next, free_list, ll_node) {
- napi_consume_skb(skb, 1);
+ if (llist_empty(&sdn->defer_list))
+ continue;
+ atomic_long_set(&sdn->defer_count, 0);
+ free_list = llist_del_all(&sdn->defer_list);
+
+ llist_for_each_entry_safe(skb, next, free_list, ll_node) {
+ napi_consume_skb(skb, 1);
+ }
}
}
@@ -6854,7 +6863,7 @@ static void __napi_busy_loop(unsigned int napi_id,
if (work > 0)
__NET_ADD_STATS(dev_net(napi->dev),
LINUX_MIB_BUSYPOLLRXPACKETS, work);
- skb_defer_free_flush(this_cpu_ptr(&softnet_data));
+ skb_defer_free_flush();
bpf_net_ctx_clear(bpf_net_ctx);
local_bh_enable();
@@ -7713,7 +7722,7 @@ static void napi_threaded_poll_loop(struct napi_struct *napi)
local_irq_disable();
net_rps_action_and_irq_enable(sd);
}
- skb_defer_free_flush(sd);
+ skb_defer_free_flush();
bpf_net_ctx_clear(bpf_net_ctx);
local_bh_enable();
@@ -7755,7 +7764,7 @@ static __latent_entropy void net_rx_action(void)
for (;;) {
struct napi_struct *n;
- skb_defer_free_flush(sd);
+ skb_defer_free_flush();
if (list_empty(&list)) {
if (list_empty(&repoll)) {
@@ -12989,7 +12998,6 @@ static int __init net_dev_init(void)
sd->cpu = i;
#endif
INIT_CSD(&sd->defer_csd, trigger_rx_softirq, sd);
- init_llist_head(&sd->defer_list);
gro_init(&sd->backlog.gro);
sd->backlog.poll = process_backlog;
@@ -12999,6 +13007,11 @@ static int __init net_dev_init(void)
if (net_page_pool_create(i))
goto out;
}
+ net_hotdata.skb_defer_nodes =
+ __alloc_percpu(sizeof(struct skb_defer_node) * nr_node_ids,
+ __alignof__(struct skb_defer_node));
+ if (!net_hotdata.skb_defer_nodes)
+ goto out;
if (use_backlog_threads())
smpboot_register_percpu_thread(&backlog_threads);
diff --git a/net/core/dev.h b/net/core/dev.h
index d6b08d435479b2ba476b1ddeeaae1dce6ac875a2..900880e8b5b4b9492eca23a4d9201045e6bf7f74 100644
--- a/net/core/dev.h
+++ b/net/core/dev.h
@@ -357,7 +357,7 @@ static inline void napi_assert_will_not_race(const struct napi_struct *napi)
WARN_ON(READ_ONCE(napi->list_owner) != -1);
}
-void kick_defer_list_purge(struct softnet_data *sd, unsigned int cpu);
+void kick_defer_list_purge(unsigned int cpu);
#define XMIT_RECURSION_LIMIT 8
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 22d9dba0e433cf67243a5b7dda77e61d146baf50..03ed51050efe81b582c2bad147afecce3a7115e1 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -7184,9 +7184,9 @@ static void kfree_skb_napi_cache(struct sk_buff *skb)
*/
void skb_attempt_defer_free(struct sk_buff *skb)
{
+ struct skb_defer_node *sdn;
unsigned long defer_count;
int cpu = skb->alloc_cpu;
- struct softnet_data *sd;
unsigned int defer_max;
bool kick;
@@ -7200,14 +7200,15 @@ nodefer: kfree_skb_napi_cache(skb);
DEBUG_NET_WARN_ON_ONCE(skb_dst(skb));
DEBUG_NET_WARN_ON_ONCE(skb->destructor);
- sd = &per_cpu(softnet_data, cpu);
+ sdn = per_cpu_ptr(net_hotdata.skb_defer_nodes, cpu) + numa_node_id();
+
defer_max = READ_ONCE(net_hotdata.sysctl_skb_defer_max);
- defer_count = atomic_long_inc_return(&sd->defer_count);
+ defer_count = atomic_long_inc_return(&sdn->defer_count);
if (defer_count >= defer_max)
goto nodefer;
- llist_add(&skb->ll_node, &sd->defer_list);
+ llist_add(&skb->ll_node, &sdn->defer_list);
/* Send an IPI every time queue reaches half capacity. */
kick = (defer_count - 1) == (defer_max >> 1);
@@ -7216,7 +7217,7 @@ nodefer: kfree_skb_napi_cache(skb);
* if we are unlucky enough (this seems very unlikely).
*/
if (unlikely(kick))
- kick_defer_list_purge(sd, cpu);
+ kick_defer_list_purge(cpu);
}
static void skb_splice_csum_page(struct sk_buff *skb, struct page *page,
--
2.51.0.536.g15c5d4f767-goog
^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [PATCH net-next 2/3] net: use llist for sd->defer_list
2025-09-26 15:13 ` [PATCH net-next 2/3] net: use llist for sd->defer_list Eric Dumazet
@ 2025-09-27 1:09 ` Jason Xing
2025-09-27 19:52 ` Kuniyuki Iwashima
1 sibling, 0 replies; 11+ messages in thread
From: Jason Xing @ 2025-09-27 1:09 UTC (permalink / raw)
To: Eric Dumazet
Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman,
Kuniyuki Iwashima, Willem de Bruijn, netdev, eric.dumazet
On Fri, Sep 26, 2025 at 11:13 PM Eric Dumazet <edumazet@google.com> wrote:
>
> Get rid of sd->defer_lock and adopt llist operations.
>
> We optimize skb_attempt_defer_free() for the common case,
> where the packet is queued. Otherwise sd->defer_count
> is increasing, until skb_defer_free_flush() clears it.
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
Quite interesting optimization! I like the no lock version. Thanks!
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
> ---
> include/linux/netdevice.h | 8 ++++----
> net/core/dev.c | 18 ++++++------------
> net/core/skbuff.c | 15 +++++++--------
> 3 files changed, 17 insertions(+), 24 deletions(-)
>
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index 27e3fa69253f694b98d32b6138cf491da5a8b824..5c9aa16933d197f70746d64e5f44cae052d9971c 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -3537,10 +3537,10 @@ struct softnet_data {
> struct numa_drop_counters drop_counters;
>
> /* Another possibly contended cache line */
> - spinlock_t defer_lock ____cacheline_aligned_in_smp;
> - atomic_t defer_count;
> - int defer_ipi_scheduled;
> - struct sk_buff *defer_list;
> + struct llist_head defer_list ____cacheline_aligned_in_smp;
> + atomic_long_t defer_count;
> +
> + int defer_ipi_scheduled ____cacheline_aligned_in_smp;
> call_single_data_t defer_csd;
> };
>
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 8566678d83444e8aacbfea4842878279cf28516f..fb67372774de10b0b112ca71c7c7a13819c2325b 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -6717,22 +6717,16 @@ EXPORT_SYMBOL(napi_complete_done);
>
> static void skb_defer_free_flush(struct softnet_data *sd)
> {
> + struct llist_node *free_list;
> struct sk_buff *skb, *next;
>
> - /* Paired with WRITE_ONCE() in skb_attempt_defer_free() */
> - if (!READ_ONCE(sd->defer_list))
> + if (llist_empty(&sd->defer_list))
> return;
> + atomic_long_set(&sd->defer_count, 0);
> + free_list = llist_del_all(&sd->defer_list);
>
> - spin_lock(&sd->defer_lock);
> - skb = sd->defer_list;
> - sd->defer_list = NULL;
> - atomic_set(&sd->defer_count, 0);
> - spin_unlock(&sd->defer_lock);
> -
> - while (skb != NULL) {
> - next = skb->next;
> + llist_for_each_entry_safe(skb, next, free_list, ll_node) {
nit: no need to keep brackets
> napi_consume_skb(skb, 1);
> - skb = next;
> }
> }
>
> @@ -12995,7 +12989,7 @@ static int __init net_dev_init(void)
> sd->cpu = i;
> #endif
> INIT_CSD(&sd->defer_csd, trigger_rx_softirq, sd);
> - spin_lock_init(&sd->defer_lock);
> + init_llist_head(&sd->defer_list);
>
> gro_init(&sd->backlog.gro);
> sd->backlog.poll = process_backlog;
> diff --git a/net/core/skbuff.c b/net/core/skbuff.c
> index f91571f51c69ecf8c2fffed5f3a3cd33fd95828b..22d9dba0e433cf67243a5b7dda77e61d146baf50 100644
> --- a/net/core/skbuff.c
> +++ b/net/core/skbuff.c
> @@ -7184,6 +7184,7 @@ static void kfree_skb_napi_cache(struct sk_buff *skb)
> */
> void skb_attempt_defer_free(struct sk_buff *skb)
> {
> + unsigned long defer_count;
> int cpu = skb->alloc_cpu;
> struct softnet_data *sd;
> unsigned int defer_max;
> @@ -7201,17 +7202,15 @@ nodefer: kfree_skb_napi_cache(skb);
>
> sd = &per_cpu(softnet_data, cpu);
> defer_max = READ_ONCE(net_hotdata.sysctl_skb_defer_max);
> - if (atomic_read(&sd->defer_count) >= defer_max)
> + defer_count = atomic_long_inc_return(&sd->defer_count);
> +
> + if (defer_count >= defer_max)
> goto nodefer;
>
> - spin_lock_bh(&sd->defer_lock);
> - /* Send an IPI every time queue reaches half capacity. */
> - kick = (atomic_inc_return(&sd->defer_count) - 1) == (defer_max >> 1);
> + llist_add(&skb->ll_node, &sd->defer_list);
>
> - skb->next = sd->defer_list;
> - /* Paired with READ_ONCE() in skb_defer_free_flush() */
> - WRITE_ONCE(sd->defer_list, skb);
> - spin_unlock_bh(&sd->defer_lock);
> + /* Send an IPI every time queue reaches half capacity. */
> + kick = (defer_count - 1) == (defer_max >> 1);
>
> /* Make sure to trigger NET_RX_SOFTIRQ on the remote CPU
> * if we are unlucky enough (this seems very unlikely).
> --
> 2.51.0.536.g15c5d4f767-goog
>
>
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH net-next 1/3] net: make softnet_data.defer_count an atomic
2025-09-26 15:13 ` [PATCH net-next 1/3] net: make softnet_data.defer_count an atomic Eric Dumazet
@ 2025-09-27 1:10 ` Jason Xing
2025-09-27 19:39 ` Kuniyuki Iwashima
1 sibling, 0 replies; 11+ messages in thread
From: Jason Xing @ 2025-09-27 1:10 UTC (permalink / raw)
To: Eric Dumazet
Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman,
Kuniyuki Iwashima, Willem de Bruijn, netdev, eric.dumazet
On Fri, Sep 26, 2025 at 11:13 PM Eric Dumazet <edumazet@google.com> wrote:
>
> This is preparation work to remove the softnet_data.defer_lock,
> as it is contended on hosts with large number of cores.
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH net-next 3/3] net: add NUMA awareness to skb_attempt_defer_free()
2025-09-26 15:13 ` [PATCH net-next 3/3] net: add NUMA awareness to skb_attempt_defer_free() Eric Dumazet
@ 2025-09-27 1:33 ` Jason Xing
2025-09-27 20:28 ` Kuniyuki Iwashima
1 sibling, 0 replies; 11+ messages in thread
From: Jason Xing @ 2025-09-27 1:33 UTC (permalink / raw)
To: Eric Dumazet
Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman,
Kuniyuki Iwashima, Willem de Bruijn, netdev, eric.dumazet
On Fri, Sep 26, 2025 at 11:13 PM Eric Dumazet <edumazet@google.com> wrote:
>
> Instead of sharing sd->defer_list & sd->defer_count with
> many cpus, add one pair for each NUMA node.
Great! I think I might borrow this idea to optimize xsk in the xmit
path because previously I saw the performance impact among numa nodes.
Reviewed-by: Jason Xing <kerneljasonxing@gmail.com>
Thanks!
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH net-next 1/3] net: make softnet_data.defer_count an atomic
2025-09-26 15:13 ` [PATCH net-next 1/3] net: make softnet_data.defer_count an atomic Eric Dumazet
2025-09-27 1:10 ` Jason Xing
@ 2025-09-27 19:39 ` Kuniyuki Iwashima
1 sibling, 0 replies; 11+ messages in thread
From: Kuniyuki Iwashima @ 2025-09-27 19:39 UTC (permalink / raw)
To: Eric Dumazet
Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman,
Willem de Bruijn, netdev, eric.dumazet
On Fri, Sep 26, 2025 at 8:13 AM Eric Dumazet <edumazet@google.com> wrote:
>
> This is preparation work to remove the softnet_data.defer_lock,
> as it is contended on hosts with large number of cores.
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH net-next 2/3] net: use llist for sd->defer_list
2025-09-26 15:13 ` [PATCH net-next 2/3] net: use llist for sd->defer_list Eric Dumazet
2025-09-27 1:09 ` Jason Xing
@ 2025-09-27 19:52 ` Kuniyuki Iwashima
1 sibling, 0 replies; 11+ messages in thread
From: Kuniyuki Iwashima @ 2025-09-27 19:52 UTC (permalink / raw)
To: Eric Dumazet
Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman,
Willem de Bruijn, netdev, eric.dumazet
On Fri, Sep 26, 2025 at 8:13 AM Eric Dumazet <edumazet@google.com> wrote:
>
> Get rid of sd->defer_lock and adopt llist operations.
>
> We optimize skb_attempt_defer_free() for the common case,
> where the packet is queued. Otherwise sd->defer_count
> is increasing, until skb_defer_free_flush() clears it.
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH net-next 3/3] net: add NUMA awareness to skb_attempt_defer_free()
2025-09-26 15:13 ` [PATCH net-next 3/3] net: add NUMA awareness to skb_attempt_defer_free() Eric Dumazet
2025-09-27 1:33 ` Jason Xing
@ 2025-09-27 20:28 ` Kuniyuki Iwashima
2025-09-28 8:45 ` Eric Dumazet
1 sibling, 1 reply; 11+ messages in thread
From: Kuniyuki Iwashima @ 2025-09-27 20:28 UTC (permalink / raw)
To: Eric Dumazet
Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman,
Willem de Bruijn, netdev, eric.dumazet
On Fri, Sep 26, 2025 at 8:13 AM Eric Dumazet <edumazet@google.com> wrote:
>
> Instead of sharing sd->defer_list & sd->defer_count with
> many cpus, add one pair for each NUMA node.
>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> ---
> include/linux/netdevice.h | 4 ----
> include/net/hotdata.h | 7 +++++++
> net/core/dev.c | 37 +++++++++++++++++++++++++------------
> net/core/dev.h | 2 +-
> net/core/skbuff.c | 11 ++++++-----
> 5 files changed, 39 insertions(+), 22 deletions(-)
>
> diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
> index 5c9aa16933d197f70746d64e5f44cae052d9971c..d1a687444b275d45d105e336d2ede264fd310f1b 100644
> --- a/include/linux/netdevice.h
> +++ b/include/linux/netdevice.h
> @@ -3536,10 +3536,6 @@ struct softnet_data {
>
> struct numa_drop_counters drop_counters;
>
> - /* Another possibly contended cache line */
> - struct llist_head defer_list ____cacheline_aligned_in_smp;
> - atomic_long_t defer_count;
> -
> int defer_ipi_scheduled ____cacheline_aligned_in_smp;
> call_single_data_t defer_csd;
> };
> diff --git a/include/net/hotdata.h b/include/net/hotdata.h
> index fda94b2647ffa242c256c95ae929d9ef25e54f96..4acec191c54ab367ca12fff590d1f8c8aad64651 100644
> --- a/include/net/hotdata.h
> +++ b/include/net/hotdata.h
> @@ -2,10 +2,16 @@
> #ifndef _NET_HOTDATA_H
> #define _NET_HOTDATA_H
>
> +#include <linux/llist.h>
> #include <linux/types.h>
> #include <linux/netdevice.h>
> #include <net/protocol.h>
>
> +struct skb_defer_node {
> + struct llist_head defer_list;
> + atomic_long_t defer_count;
> +} ____cacheline_aligned_in_smp;
> +
> /* Read mostly data used in network fast paths. */
> struct net_hotdata {
> #if IS_ENABLED(CONFIG_INET)
> @@ -30,6 +36,7 @@ struct net_hotdata {
> struct rps_sock_flow_table __rcu *rps_sock_flow_table;
> u32 rps_cpu_mask;
> #endif
> + struct skb_defer_node __percpu *skb_defer_nodes;
> int gro_normal_batch;
> int netdev_budget;
> int netdev_budget_usecs;
> diff --git a/net/core/dev.c b/net/core/dev.c
> index fb67372774de10b0b112ca71c7c7a13819c2325b..afcf07352eaa3b9a563173106c84167ebe1ab387 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -5180,8 +5180,9 @@ static void napi_schedule_rps(struct softnet_data *sd)
> __napi_schedule_irqoff(&mysd->backlog);
> }
>
> -void kick_defer_list_purge(struct softnet_data *sd, unsigned int cpu)
> +void kick_defer_list_purge(unsigned int cpu)
> {
> + struct softnet_data *sd = &per_cpu(softnet_data, cpu);
> unsigned long flags;
>
> if (use_backlog_threads()) {
> @@ -6715,18 +6716,26 @@ bool napi_complete_done(struct napi_struct *n, int work_done)
> }
> EXPORT_SYMBOL(napi_complete_done);
>
> -static void skb_defer_free_flush(struct softnet_data *sd)
> +struct skb_defer_node __percpu *skb_defer_nodes;
seems this is unused, given it's close to the end of cycle
maybe this can be fixed up while applying ? :)
Reviewed-by: Kuniyuki Iwashima <kuniyu@google.com>
Thanks!
^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH net-next 3/3] net: add NUMA awareness to skb_attempt_defer_free()
2025-09-27 20:28 ` Kuniyuki Iwashima
@ 2025-09-28 8:45 ` Eric Dumazet
0 siblings, 0 replies; 11+ messages in thread
From: Eric Dumazet @ 2025-09-28 8:45 UTC (permalink / raw)
To: Kuniyuki Iwashima
Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman,
Willem de Bruijn, netdev, eric.dumazet
On Sat, Sep 27, 2025 at 1:28 PM Kuniyuki Iwashima <kuniyu@google.com> wrote:
>
> On Fri, Sep 26, 2025 at 8:13 AM Eric Dumazet <edumazet@google.com> wrote:
> >
> > Instead of sharing sd->defer_list & sd->defer_count with
> > many cpus, add one pair for each NUMA node.
> >
> > Signed-off-by: Eric Dumazet <edumazet@google.com>
> > -static void skb_defer_free_flush(struct softnet_data *sd)
> > +struct skb_defer_node __percpu *skb_defer_nodes;
>
> seems this is unused, given it's close to the end of cycle
> maybe this can be fixed up while applying ? :)
Good catch, I moved it to net_hotdata but forgot to remove this leftover.
Will send a V2.
Thanks.
^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2025-09-28 8:45 UTC | newest]
Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-09-26 15:13 [PATCH net-next 0/3] net: lockless skb_attempt_defer_free() Eric Dumazet
2025-09-26 15:13 ` [PATCH net-next 1/3] net: make softnet_data.defer_count an atomic Eric Dumazet
2025-09-27 1:10 ` Jason Xing
2025-09-27 19:39 ` Kuniyuki Iwashima
2025-09-26 15:13 ` [PATCH net-next 2/3] net: use llist for sd->defer_list Eric Dumazet
2025-09-27 1:09 ` Jason Xing
2025-09-27 19:52 ` Kuniyuki Iwashima
2025-09-26 15:13 ` [PATCH net-next 3/3] net: add NUMA awareness to skb_attempt_defer_free() Eric Dumazet
2025-09-27 1:33 ` Jason Xing
2025-09-27 20:28 ` Kuniyuki Iwashima
2025-09-28 8:45 ` Eric Dumazet
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).