[PATCH net-next] tcp: try to defer / return acked skbs to originating CPU

public inbox for netdev@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH net-next] tcp: try to defer / return acked skbs to originating CPU
@ 2026-01-17 16:42 Jakub Kicinski
  2026-01-17 17:10 ` Jakub Kicinski
  2026-01-17 18:16 ` Eric Dumazet
  0 siblings, 2 replies; 15+ messages in thread
From: Jakub Kicinski @ 2026-01-17 16:42 UTC (permalink / raw)
  To: edumazet
  Cc: kuniyu, ncardwell, netdev, davem, pabeni, andrew+netdev, horms,
	Jakub Kicinski

Running a memcache-like workload under production(ish) load
on a 300 thread AMD machine we see ~3% of CPU time spent
in kmem_cache_free() via tcp_ack(), freeing skbs from rtx queue.
This workloads pins workers away from softirq CPU so
the Tx skbs are pretty much always allocated on a different
CPU than where the ACKs arrive. Try to use the defer skb free
queue to return the skbs back to where they came from.
This results in a ~4% performance improvement for the workload.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
---
 include/net/tcp.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index ef0fee58fde8..e290651da508 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -332,7 +332,7 @@ static inline void tcp_wmem_free_skb(struct sock *sk, struct sk_buff *skb)
 		sk_mem_uncharge(sk, skb->truesize);
 	else
 		sk_mem_uncharge(sk, SKB_TRUESIZE(skb_end_offset(skb)));
-	__kfree_skb(skb);
+	skb_attempt_defer_free(skb);
 }

 void sk_forced_mem_schedule(struct sock *sk, int size);
-- 
2.52.0

^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH net-next] tcp: try to defer / return acked skbs to originating CPU
  2026-01-17 16:42 [PATCH net-next] tcp: try to defer / return acked skbs to originating CPU Jakub Kicinski
@ 2026-01-17 17:10 ` Jakub Kicinski
  2026-01-17 18:16 ` Eric Dumazet
  1 sibling, 0 replies; 15+ messages in thread
From: Jakub Kicinski @ 2026-01-17 17:10 UTC (permalink / raw)
  To: edumazet; +Cc: kuniyu, ncardwell, netdev, davem, pabeni, andrew+netdev, horms

On Sat, 17 Jan 2026 08:42:55 -0800 Jakub Kicinski wrote:
> Running a memcache-like workload under production(ish) load
> on a 300 thread AMD machine we see ~3% of CPU time spent
> in kmem_cache_free() via tcp_ack(), freeing skbs from rtx queue.
> This workloads pins workers away from softirq CPU so
> the Tx skbs are pretty much always allocated on a different
> CPU than where the ACKs arrive. Try to use the defer skb free
> queue to return the skbs back to where they came from.
> This results in a ~4% performance improvement for the workload.

In the interest of full transparency the performance testing was
done on a 6.13-ish kernel. But I don't see anything that'd make
the situation better upstream..

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH net-next] tcp: try to defer / return acked skbs to originating CPU
  2026-01-17 16:42 [PATCH net-next] tcp: try to defer / return acked skbs to originating CPU Jakub Kicinski
  2026-01-17 17:10 ` Jakub Kicinski
@ 2026-01-17 18:16 ` Eric Dumazet
  2026-01-17 23:03   ` Jakub Kicinski
  1 sibling, 1 reply; 15+ messages in thread
From: Eric Dumazet @ 2026-01-17 18:16 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: kuniyu, ncardwell, netdev, davem, pabeni, andrew+netdev, horms

On Sat, Jan 17, 2026 at 5:43 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> Running a memcache-like workload under production(ish) load
> on a 300 thread AMD machine we see ~3% of CPU time spent
> in kmem_cache_free() via tcp_ack(), freeing skbs from rtx queue.
> This workloads pins workers away from softirq CPU so
> the Tx skbs are pretty much always allocated on a different
> CPU than where the ACKs arrive. Try to use the defer skb free
> queue to return the skbs back to where they came from.
> This results in a ~4% performance improvement for the workload.
>

This probably makes sense when RFS is not used.
Here, RFS gives us ~40% performance improvement for typical RPC workloads,
so I never took a look at this side :)

Have you tested what happens for bulk sends ?
sendmsg() allocates skbs and push them to transmit queue,
but ACK can decide to split TSO packets, and the new allocation is done
on the softirq CPU (assuming RFS is not used)

Perhaps tso_fragment()/tcp_fragment() could copy the source
skb->alloc_cpu to (new)buff->alloc_cpu.

Also, if workers are away from softirq, they will only process the
defer queue in large patches, after receiving an trigger_rx_softirq()
IPI.
Any idea of skb_defer_free_flush() latency when dealing with batches
of ~64 big TSO packets ?



> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
> ---
>  include/net/tcp.h | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index ef0fee58fde8..e290651da508 100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -332,7 +332,7 @@ static inline void tcp_wmem_free_skb(struct sock *sk, struct sk_buff *skb)
>                 sk_mem_uncharge(sk, skb->truesize);
>         else
>                 sk_mem_uncharge(sk, SKB_TRUESIZE(skb_end_offset(skb)));
> -       __kfree_skb(skb);
> +       skb_attempt_defer_free(skb);
>  }
>
>  void sk_forced_mem_schedule(struct sock *sk, int size);
> --
> 2.52.0
>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH net-next] tcp: try to defer / return acked skbs to originating CPU
  2026-01-17 18:16 ` Eric Dumazet
@ 2026-01-17 23:03   ` Jakub Kicinski
  2026-01-18 12:15     ` Eric Dumazet
  0 siblings, 1 reply; 15+ messages in thread
From: Jakub Kicinski @ 2026-01-17 23:03 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: kuniyu, ncardwell, netdev, davem, pabeni, andrew+netdev, horms

On Sat, 17 Jan 2026 19:16:57 +0100 Eric Dumazet wrote:
> On Sat, Jan 17, 2026 at 5:43 PM Jakub Kicinski <kuba@kernel.org> wrote:
> > Running a memcache-like workload under production(ish) load
> > on a 300 thread AMD machine we see ~3% of CPU time spent
> > in kmem_cache_free() via tcp_ack(), freeing skbs from rtx queue.
> > This workloads pins workers away from softirq CPU so
> > the Tx skbs are pretty much always allocated on a different
> > CPU than where the ACKs arrive. Try to use the defer skb free
> > queue to return the skbs back to where they came from.
> > This results in a ~4% performance improvement for the workload.
> 
> This probably makes sense when RFS is not used.
> Here, RFS gives us ~40% performance improvement for typical RPC workloads,
> so I never took a look at this side :)

This workload doesn't like RFS. Maybe because it has 1M sockets..
I'll need to look closer, the patchwork queue first tho.. :)

> Have you tested what happens for bulk sends ?
> sendmsg() allocates skbs and push them to transmit queue,
> but ACK can decide to split TSO packets, and the new allocation is done
> on the softirq CPU (assuming RFS is not used)
> 
> Perhaps tso_fragment()/tcp_fragment() could copy the source
> skb->alloc_cpu to (new)buff->alloc_cpu.

I'll do some synthetic testing and get back.

> Also, if workers are away from softirq, they will only process the
> defer queue in large patches, after receiving an trigger_rx_softirq()
> IPI.
> Any idea of skb_defer_free_flush() latency when dealing with batches
> of ~64 big TSO packets ?

Not sure if there's much we can do about that.. Perhaps we should have 
a shrinker that flushes the defer queues? I chatted with Shakeel briefly
and it sounded fairly straightforward.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH net-next] tcp: try to defer / return acked skbs to originating CPU
  2026-01-17 23:03   ` Jakub Kicinski
@ 2026-01-18 12:15     ` Eric Dumazet
  2026-01-19 17:04       ` Jakub Kicinski
  0 siblings, 1 reply; 15+ messages in thread
From: Eric Dumazet @ 2026-01-18 12:15 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: kuniyu, ncardwell, netdev, davem, pabeni, andrew+netdev, horms

On Sun, Jan 18, 2026 at 12:03 AM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Sat, 17 Jan 2026 19:16:57 +0100 Eric Dumazet wrote:
> > On Sat, Jan 17, 2026 at 5:43 PM Jakub Kicinski <kuba@kernel.org> wrote:
> > > Running a memcache-like workload under production(ish) load
> > > on a 300 thread AMD machine we see ~3% of CPU time spent
> > > in kmem_cache_free() via tcp_ack(), freeing skbs from rtx queue.
> > > This workloads pins workers away from softirq CPU so
> > > the Tx skbs are pretty much always allocated on a different
> > > CPU than where the ACKs arrive. Try to use the defer skb free
> > > queue to return the skbs back to where they came from.
> > > This results in a ~4% performance improvement for the workload.
> >
> > This probably makes sense when RFS is not used.
> > Here, RFS gives us ~40% performance improvement for typical RPC workloads,
> > so I never took a look at this side :)
>
> This workload doesn't like RFS. Maybe because it has 1M sockets..
> I'll need to look closer, the patchwork queue first tho.. :)
>
> > Have you tested what happens for bulk sends ?
> > sendmsg() allocates skbs and push them to transmit queue,
> > but ACK can decide to split TSO packets, and the new allocation is done
> > on the softirq CPU (assuming RFS is not used)
> >
> > Perhaps tso_fragment()/tcp_fragment() could copy the source
> > skb->alloc_cpu to (new)buff->alloc_cpu.
>
> I'll do some synthetic testing and get back.
>
> > Also, if workers are away from softirq, they will only process the
> > defer queue in large patches, after receiving an trigger_rx_softirq()
> > IPI.
> > Any idea of skb_defer_free_flush() latency when dealing with batches
> > of ~64 big TSO packets ?
>
> Not sure if there's much we can do about that.. Perhaps we should have
> a shrinker that flushes the defer queues? I chatted with Shakeel briefly
> and it sounded fairly straightforward.

I was mostly concerned about latency spikes, I did some tests here and
this seems fine.
(I assume you asked Shakeel about the extra memory being held in the
per-cpu queue, and pcp implications ?)

Reviewed-by: Eric Dumazet <edumazet@google.com>

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH net-next] tcp: try to defer / return acked skbs to originating CPU
  2026-01-18 12:15     ` Eric Dumazet
@ 2026-01-19 17:04       ` Jakub Kicinski
  2026-01-29 23:04         ` Jakub Kicinski
  0 siblings, 1 reply; 15+ messages in thread
From: Jakub Kicinski @ 2026-01-19 17:04 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: kuniyu, ncardwell, netdev, davem, pabeni, andrew+netdev, horms

On Sun, 18 Jan 2026 13:15:00 +0100 Eric Dumazet wrote:
> > > Also, if workers are away from softirq, they will only process the
> > > defer queue in large patches, after receiving an trigger_rx_softirq()
> > > IPI.
> > > Any idea of skb_defer_free_flush() latency when dealing with batches
> > > of ~64 big TSO packets ?  
> >
> > Not sure if there's much we can do about that.. Perhaps we should have
> > a shrinker that flushes the defer queues? I chatted with Shakeel briefly
> > and it sounded fairly straightforward.  
> 
> I was mostly concerned about latency spikes, I did some tests here and
> this seems fine.

Looks like selftests run into the zerocopy Tx latency issue.
I'll drop this version from patchwork..

> (I assume you asked Shakeel about the extra memory being held in the
> per-cpu queue, and pcp implications ?)

Under real load it helps quite a bit but real load flushes the queues
frequently. I'll talk to him.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH net-next] tcp: try to defer / return acked skbs to originating CPU
  2026-01-19 17:04       ` Jakub Kicinski
@ 2026-01-29 23:04         ` Jakub Kicinski
  2026-01-29 23:10           ` Jakub Kicinski
  2026-02-16 16:06           ` Eric Dumazet
  0 siblings, 2 replies; 15+ messages in thread
From: Jakub Kicinski @ 2026-01-29 23:04 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: kuniyu, ncardwell, netdev, davem, pabeni, andrew+netdev, horms

[-- Attachment #1: Type: text/plain, Size: 691 bytes --]

On Mon, 19 Jan 2026 09:04:35 -0800 Jakub Kicinski wrote:
> > > Not sure if there's much we can do about that.. Perhaps we should have
> > > a shrinker that flushes the defer queues? I chatted with Shakeel briefly
> > > and it sounded fairly straightforward.    
> > 
> > I was mostly concerned about latency spikes, I did some tests here and
> > this seems fine.  
> 
> Looks like selftests run into the zerocopy Tx latency issue.
> I'll drop this version from patchwork..

Delaying zero copy forever is a bit of an annoyance.
I believe the same thing can happen in net-next with UDP 
but I haven't tested to confirm.

I assume the attached patch is out of question since it came up before?

[-- Attachment #2: 0001-net-periodically-flush-the-defer-queues.patch --]
[-- Type: text/x-patch, Size: 9441 bytes --]

From ec1dcd1d542880fdaa126a5d4d57147ce69c80d3 Mon Sep 17 00:00:00 2001
From: Jakub Kicinski <kuba@kernel.org>
Date: Wed, 28 Jan 2026 15:01:27 -0800
Subject: [--tree name--] net: periodically flush the defer queues

Zero-copy skbs may sit in the defer skb queue forever.
AFAIU we don't see this for UDP in SW tests because over veth
we end up hitting skb_orphan_frags_rx(). But it should happen
when we zero-copy Tx on a real interface, which frees skbs
via napi_consume_skb(). Since commit 6471658dc66c ("udp: use
skb_attempt_defer_free()") we will queue the skb with a ubuf
to a remote core where it may never be freed.

If we make TCP defer freeing Tx skbs this will be even more
obvious and trigger over veth. In TCP the stack is traversed
by a skb clone, and freeing happens when the ACK comes in,
so no lucky skb_orphan_frags_rx().

This patch attempts a periodic scrub of the defer queues.
Hopefully perf overhead is negligible under normal load,
even on huge machines, as we'll just check that timer is
already pending vast majority of the time.

Signed-off-by: Jakub Kicinski <kuba@kernel.org>
---
 Documentation/admin-guide/sysctl/net.rst |  13 ++
 include/net/hotdata.h                    |   1 +
 net/core/dev.h                           |   4 +
 net/core/dev.c                           |   1 +
 net/core/skbuff.c                        | 152 +++++++++++++++++++++++
 net/core/sysctl_net_core.c               |   8 ++
 6 files changed, 179 insertions(+)

diff --git a/Documentation/admin-guide/sysctl/net.rst b/Documentation/admin-guide/sysctl/net.rst
index 19408da2390b..dbfae381818a 100644
--- a/Documentation/admin-guide/sysctl/net.rst
+++ b/Documentation/admin-guide/sysctl/net.rst
@@ -368,6 +368,19 @@ by the cpu which allocated them.
 
 Default: 128
 
+skb_defer_timeout_us
+--------------------
+
+Timeout in microseconds for the safety timer that flushes deferred skb
+queues. When skbs are deferred to a CPU that never runs network softirq
+(e.g., application threads that allocated TX skbs for zero-copy sends),
+they may remain queued indefinitely. This timer periodically triggers
+an IPI to the affected CPUs to drain their defer queues.
+
+Setting this to 0 disables the safety timer.
+
+Default: 20000 (20 milliseconds)
+
 optmem_max
 ----------
 
diff --git a/include/net/hotdata.h b/include/net/hotdata.h
index 6632b1aa7584..fb738da24fe6 100644
--- a/include/net/hotdata.h
+++ b/include/net/hotdata.h
@@ -10,6 +10,7 @@
 struct skb_defer_node {
 	struct llist_head	defer_list;
 	atomic_long_t		defer_count;
+	u8			needs_flush;
 } ____cacheline_aligned_in_smp;
 
 /* Read mostly data used in network fast paths. */
diff --git a/net/core/dev.h b/net/core/dev.h
index 98793a738f43..e481b5e67363 100644
--- a/net/core/dev.h
+++ b/net/core/dev.h
@@ -366,6 +366,10 @@ static inline void napi_assert_will_not_race(const struct napi_struct *napi)
 
 void kick_defer_list_purge(unsigned int cpu);
 
+extern unsigned int sysctl_skb_defer_timeout_us;
+int sysctl_skb_defer_timeout(const struct ctl_table *table, int write,
+			     void *buffer, size_t *lenp, loff_t *ppos);
+
 #define XMIT_RECURSION_LIMIT	8
 
 #ifndef CONFIG_PREEMPT_RT
diff --git a/net/core/dev.c b/net/core/dev.c
index 43de5af0d6ec..e6e004f56c58 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -6801,6 +6801,7 @@ static void skb_defer_free_flush(void)
 		if (llist_empty(&sdn->defer_list))
 			continue;
 		atomic_long_set(&sdn->defer_count, 0);
+		WRITE_ONCE(sdn->needs_flush, 0);
 		free_list = llist_del_all(&sdn->defer_list);
 
 		llist_for_each_entry_safe(skb, next, free_list, ll_node) {
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 63c7c4519d63..4c234617d0ab 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -86,6 +86,7 @@
 #include <linux/uaccess.h>
 #include <trace/events/skb.h>
 #include <linux/highmem.h>
+#include <linux/hrtimer.h>
 #include <linux/capability.h>
 #include <linux/user_namespace.h>
 #include <linux/indirect_call_wrapper.h>
@@ -7219,6 +7220,152 @@ static void kfree_skb_napi_cache(struct sk_buff *skb)
 	local_bh_enable();
 }
 
+/*
+ * Deferred skb flush timer.
+ *
+ * SKBs may get stuck in the defer queue if the originating CPU never runs
+ * network softirq (e.g., application threads that allocated TX skbs).
+ * This timer periodically flushes aged skbs from all defer queues.
+ *
+ * State machine:
+ *   DISABLED -> IDLE: sysctl changed from 0 to non-zero
+ *   IDLE     -> ACTIVE: first defer schedules the timer
+ *   IDLE     -> DISABLED: sysctl set to 0 while timer idle
+ *   ACTIVE   -> RECHECK: timer fires, reschedules to check for races
+ *   ACTIVE   -> DISABLED: timer fires, sees timeout=0
+ *   RECHECK  -> ACTIVE: new defer arrives, timer will continue
+ *   RECHECK  -> IDLE: timer fires again with no new work
+ *   RECHECK  -> DISABLED: timer fires, sees timeout=0
+ *
+ * IDLE is 0 so kick_defer_safety_timer() fast path is a single comparison.
+ * DISABLED is -1, outside the IDLE/ACTIVE/RECHECK cycle.
+ */
+enum {
+	SKB_DEFER_TIMER_DISABLED = -1,
+	SKB_DEFER_TIMER_IDLE = 0,
+	SKB_DEFER_TIMER_ACTIVE = 1,
+	SKB_DEFER_TIMER_RECHECK = 2,
+	__SKB_DEFER_TIMER_STATE_CNT
+};
+
+static atomic_t skb_defer_timer_state = ATOMIC_INIT(SKB_DEFER_TIMER_DISABLED);
+static struct hrtimer skb_defer_flush_timer;
+unsigned int sysctl_skb_defer_timeout_us __read_mostly = 20000;
+
+static void skb_defer_flush_aged(void)
+{
+	struct skb_defer_node *sdn;
+	int cpu;
+
+	for_each_online_cpu(cpu) {
+		bool kick = false;
+		int node;
+
+		for_each_node(node) {
+			sdn = per_cpu_ptr(net_hotdata.skb_defer_nodes, cpu) + node;
+
+			if (!atomic_long_read(&sdn->defer_count)) {
+				WRITE_ONCE(sdn->needs_flush, 0);
+				continue;
+			}
+
+			kick |= READ_ONCE(sdn->needs_flush);
+			WRITE_ONCE(sdn->needs_flush, 1);
+		}
+		if (kick)
+			kick_defer_list_purge(cpu);
+	}
+}
+
+static enum hrtimer_restart skb_defer_flush_timer_fn(struct hrtimer *timer)
+{
+	enum hrtimer_restart ret;
+	unsigned int timeout;
+	int new_state, state;
+	ktime_t delay;
+
+	state = atomic_read(&skb_defer_timer_state);
+	timeout = READ_ONCE(sysctl_skb_defer_timeout_us);
+	if (!timeout) {
+		atomic_set(&skb_defer_timer_state, SKB_DEFER_TIMER_DISABLED);
+		return HRTIMER_NORESTART;
+	}
+
+	WARN_ON_ONCE(state == SKB_DEFER_TIMER_IDLE);
+	WARN_ON_ONCE(state == SKB_DEFER_TIMER_DISABLED);
+
+	/* State machine: ACTIVE (1) -> RECHECK (2) -> IDLE (0) */
+	new_state = (state + 1) % __SKB_DEFER_TIMER_STATE_CNT;
+	new_state = atomic_cmpxchg(&skb_defer_timer_state, state, new_state);
+
+	if (new_state == SKB_DEFER_TIMER_IDLE) {
+		ret = HRTIMER_NORESTART;
+	} else {
+		delay = ns_to_ktime((u64)timeout * (NSEC_PER_USEC / 2));
+		hrtimer_forward_now(timer, delay);
+		ret = HRTIMER_RESTART;
+	}
+
+	skb_defer_flush_aged();
+
+	return ret;
+}
+
+static DEFINE_SPINLOCK(skb_defer_sysctl_lock);
+
+int sysctl_skb_defer_timeout(const struct ctl_table *table, int write,
+			     void *buffer, size_t *lenp, loff_t *ppos)
+{
+	unsigned int old_timeout;
+	int ret;
+
+	if (!write)
+		return proc_dointvec_minmax(table, write, buffer, lenp, ppos);
+
+	spin_lock(&skb_defer_sysctl_lock);
+
+	old_timeout = sysctl_skb_defer_timeout_us;
+	ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos);
+	if (ret)
+		goto unlock;
+
+	if (!old_timeout && sysctl_skb_defer_timeout_us)
+		atomic_cmpxchg(&skb_defer_timer_state,
+			       SKB_DEFER_TIMER_DISABLED,
+			       SKB_DEFER_TIMER_IDLE);
+	if (old_timeout && !sysctl_skb_defer_timeout_us)
+		atomic_cmpxchg(&skb_defer_timer_state,
+			       SKB_DEFER_TIMER_IDLE,
+			       SKB_DEFER_TIMER_DISABLED);
+unlock:
+	spin_unlock(&skb_defer_sysctl_lock);
+	return ret;
+}
+
+static void kick_defer_safety_timer(void)
+{
+	unsigned int timeout;
+	ktime_t delay;
+	int state;
+
+	state = atomic_read(&skb_defer_timer_state);
+	if (likely(state))
+		return; /* already running (or disabled) */
+
+	timeout = READ_ONCE(sysctl_skb_defer_timeout_us);
+	if (!timeout)
+		return;
+
+	state = atomic_cmpxchg(&skb_defer_timer_state, SKB_DEFER_TIMER_IDLE,
+			       SKB_DEFER_TIMER_ACTIVE);
+	if (state != SKB_DEFER_TIMER_IDLE)
+		return;
+
+	/* Half the sysctl period, once cycle marks and second will flush */
+	delay = ns_to_ktime((u64)timeout * (NSEC_PER_USEC / 2));
+	hrtimer_start(&skb_defer_flush_timer, delay, HRTIMER_MODE_REL);
+}
+
 /**
  * skb_attempt_defer_free - queue skb for remote freeing
  * @skb: buffer
@@ -7264,6 +7411,8 @@ nodefer:	kfree_skb_napi_cache(skb);
 	 */
 	if (unlikely(kick))
 		kick_defer_list_purge(cpu);
+	else
+		kick_defer_safety_timer();
 }
 
 static void skb_splice_csum_page(struct sk_buff *skb, struct page *page,
@@ -7439,4 +7588,7 @@ void __init skb_init(void)
 						SKB_SMALL_HEAD_HEADROOM,
 						NULL);
 	skb_extensions_init();
+
+	hrtimer_setup(&skb_defer_flush_timer, skb_defer_flush_timer_fn,
+		      CLOCK_MONOTONIC, HRTIMER_MODE_REL);
 }
diff --git a/net/core/sysctl_net_core.c b/net/core/sysctl_net_core.c
index 03aea10073f0..af8528a2337c 100644
--- a/net/core/sysctl_net_core.c
+++ b/net/core/sysctl_net_core.c
@@ -645,6 +645,14 @@ static struct ctl_table net_core_table[] = {
 		.proc_handler	= proc_dointvec_minmax,
 		.extra1		= SYSCTL_ZERO,
 	},
+	{
+		.procname	= "skb_defer_timeout_us",
+		.data		= &sysctl_skb_defer_timeout_us,
+		.maxlen		= sizeof(unsigned int),
+		.mode		= 0644,
+		.proc_handler	= sysctl_skb_defer_timeout,
+		.extra1		= SYSCTL_ZERO,
+	},
 };
 
 static struct ctl_table netns_core_table[] = {
-- 
2.52.0


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH net-next] tcp: try to defer / return acked skbs to originating CPU
  2026-01-29 23:04         ` Jakub Kicinski
@ 2026-01-29 23:10           ` Jakub Kicinski
  2026-02-16 16:06           ` Eric Dumazet
  1 sibling, 0 replies; 15+ messages in thread
From: Jakub Kicinski @ 2026-01-29 23:10 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: kuniyu, ncardwell, netdev, davem, pabeni, andrew+netdev, horms

On Thu, 29 Jan 2026 15:04:59 -0800 Jakub Kicinski wrote:
> > > I was mostly concerned about latency spikes, I did some tests here and
> > > this seems fine.    
> > 
> > Looks like selftests run into the zerocopy Tx latency issue.
> > I'll drop this version from patchwork..  
> 
> Delaying zero copy forever is a bit of an annoyance.
> I believe the same thing can happen in net-next with UDP 
> but I haven't tested to confirm.
> 
> I assume the attached patch is out of question since it came up before?
> 
> [0001-net-periodically-flush-the-defer-queues.patch  text/x-patch (9441 bytes)] 

Ugh, too many patches in my /tmp. I attached an old buggy version, 
but you get the point.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH net-next] tcp: try to defer / return acked skbs to originating CPU
  2026-01-29 23:04         ` Jakub Kicinski
  2026-01-29 23:10           ` Jakub Kicinski
@ 2026-02-16 16:06           ` Eric Dumazet
  2026-02-16 17:49             ` Jakub Kicinski
  1 sibling, 1 reply; 15+ messages in thread
From: Eric Dumazet @ 2026-02-16 16:06 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: kuniyu, ncardwell, netdev, davem, pabeni, andrew+netdev, horms

On Fri, Jan 30, 2026 at 12:05 AM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Mon, 19 Jan 2026 09:04:35 -0800 Jakub Kicinski wrote:
> > > > Not sure if there's much we can do about that.. Perhaps we should have
> > > > a shrinker that flushes the defer queues? I chatted with Shakeel briefly
> > > > and it sounded fairly straightforward.
> > >
> > > I was mostly concerned about latency spikes, I did some tests here and
> > > this seems fine.
> >
> > Looks like selftests run into the zerocopy Tx latency issue.
> > I'll drop this version from patchwork..
>
> Delaying zero copy forever is a bit of an annoyance.
> I believe the same thing can happen in net-next with UDP
> but I haven't tested to confirm.
>
> I assume the attached patch is out of question since it came up before?

I think I totally missed your email :/

What about not attempting defer for zero copy skbs ?

It turns out existing patch e20dfbad8aab ("net: fix napi_consume_skb()
with alien skbs")
is already a problem for zcopy.

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 699c401a5eae9c497a42b6bdd8593af7890529f4..dc47d3efc72ed86dce5e382d505eda7bc863669a
100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -7266,10 +7266,15 @@ void skb_attempt_defer_free(struct sk_buff *skb)
 {
        struct skb_defer_node *sdn;
        unsigned long defer_count;
-       int cpu = skb->alloc_cpu;
        unsigned int defer_max;
        bool kick;
+       int cpu;

+       /* zero copy notifications should not be delayed. */
+       if (skb_zcopy(skb))
+               goto nodefer;
+
+       cpu = skb->alloc_cpu;
        if (cpu == raw_smp_processor_id() ||
            WARN_ON_ONCE(cpu >= nr_cpu_ids) ||
            !cpu_online(cpu)) {

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH net-next] tcp: try to defer / return acked skbs to originating CPU
  2026-02-16 16:06           ` Eric Dumazet
@ 2026-02-16 17:49             ` Jakub Kicinski
  2026-02-16 17:58               ` Eric Dumazet
  0 siblings, 1 reply; 15+ messages in thread
From: Jakub Kicinski @ 2026-02-16 17:49 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: kuniyu, ncardwell, netdev, davem, pabeni, andrew+netdev, horms

On Mon, 16 Feb 2026 17:06:58 +0100 Eric Dumazet wrote:
> > > Looks like selftests run into the zerocopy Tx latency issue.
> > > I'll drop this version from patchwork..  
> >
> > Delaying zero copy forever is a bit of an annoyance.
> > I believe the same thing can happen in net-next with UDP
> > but I haven't tested to confirm.
> >
> > I assume the attached patch is out of question since it came up before?  
> 
> I think I totally missed your email :/
> 
> What about not attempting defer for zero copy skbs ?
> 
> It turns out existing patch e20dfbad8aab ("net: fix napi_consume_skb()
> with alien skbs") is already a problem for zcopy.

We definitely need either this or the timer patch I attached, for UDP.

I put the TCP write side on a back burner because slab sheaves got
merged, I think skb defer free will still make a difference but IDK 
how much. Maybe juice will no longer be worth the squeeze?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH net-next] tcp: try to defer / return acked skbs to originating CPU
  2026-02-16 17:49             ` Jakub Kicinski
@ 2026-02-16 17:58               ` Eric Dumazet
  2026-02-16 18:11                 ` Jakub Kicinski
  0 siblings, 1 reply; 15+ messages in thread
From: Eric Dumazet @ 2026-02-16 17:58 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: kuniyu, ncardwell, netdev, davem, pabeni, andrew+netdev, horms

On Mon, Feb 16, 2026 at 6:49 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Mon, 16 Feb 2026 17:06:58 +0100 Eric Dumazet wrote:
> > > > Looks like selftests run into the zerocopy Tx latency issue.
> > > > I'll drop this version from patchwork..
> > >
> > > Delaying zero copy forever is a bit of an annoyance.
> > > I believe the same thing can happen in net-next with UDP
> > > but I haven't tested to confirm.
> > >
> > > I assume the attached patch is out of question since it came up before?
> >
> > I think I totally missed your email :/
> >
> > What about not attempting defer for zero copy skbs ?
> >
> > It turns out existing patch e20dfbad8aab ("net: fix napi_consume_skb()
> > with alien skbs") is already a problem for zcopy.
>
> We definitely need either this or the timer patch I attached, for UDP.
>
> I put the TCP write side on a back burner because slab sheaves got
> merged, I think skb defer free will still make a difference but IDK
> how much. Maybe juice will no longer be worth the squeeze?

It depends how many frags are attached to each skb.

page frags are not yet handled by SLUB sheaves :)

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH net-next] tcp: try to defer / return acked skbs to originating CPU
  2026-02-16 17:58               ` Eric Dumazet
@ 2026-02-16 18:11                 ` Jakub Kicinski
  2026-02-16 18:16                   ` Eric Dumazet
  0 siblings, 1 reply; 15+ messages in thread
From: Jakub Kicinski @ 2026-02-16 18:11 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: kuniyu, ncardwell, netdev, davem, pabeni, andrew+netdev, horms

On Mon, 16 Feb 2026 18:58:08 +0100 Eric Dumazet wrote:
> > > I think I totally missed your email :/
> > >
> > > What about not attempting defer for zero copy skbs ?
> > >
> > > It turns out existing patch e20dfbad8aab ("net: fix napi_consume_skb()
> > > with alien skbs") is already a problem for zcopy.  
> >
> > We definitely need either this or the timer patch I attached, for UDP.
> >
> > I put the TCP write side on a back burner because slab sheaves got
> > merged, I think skb defer free will still make a difference but IDK
> > how much. Maybe juice will no longer be worth the squeeze?  
> 
> It depends how many frags are attached to each skb.
> 
> page frags are not yet handled by SLUB sheaves :)

Ack, my recollection is that more of the cycles are spent in slab than
pcp handling, tho. I could be misremembering. But my thinking was that
if we both:
 - lose zc sends due to the unbounded completion time
 - lose the slab benefit due to sheaves
we are only left with fairly narrow case of page handling of small skbs.
My gut feeling was that however unclean the timer fix would be best
since it gives us zc back. But I don't have a way to experiment with
sheaves yet to get data. All of this is pure speculation.

Regardless the patch you shared earlier is probably best as a fix for
zero-copy UDP for now.

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH net-next] tcp: try to defer / return acked skbs to originating CPU
  2026-02-16 18:11                 ` Jakub Kicinski
@ 2026-02-16 18:16                   ` Eric Dumazet
  2026-02-17 21:50                     ` Jakub Kicinski
  0 siblings, 1 reply; 15+ messages in thread
From: Eric Dumazet @ 2026-02-16 18:16 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: kuniyu, ncardwell, netdev, davem, pabeni, andrew+netdev, horms

On Mon, Feb 16, 2026 at 7:11 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Mon, 16 Feb 2026 18:58:08 +0100 Eric Dumazet wrote:
> > > > I think I totally missed your email :/
> > > >
> > > > What about not attempting defer for zero copy skbs ?
> > > >
> > > > It turns out existing patch e20dfbad8aab ("net: fix napi_consume_skb()
> > > > with alien skbs") is already a problem for zcopy.
> > >
> > > We definitely need either this or the timer patch I attached, for UDP.
> > >
> > > I put the TCP write side on a back burner because slab sheaves got
> > > merged, I think skb defer free will still make a difference but IDK
> > > how much. Maybe juice will no longer be worth the squeeze?
> >
> > It depends how many frags are attached to each skb.
> >
> > page frags are not yet handled by SLUB sheaves :)
>
> Ack, my recollection is that more of the cycles are spent in slab than
> pcp handling, tho. I could be misremembering. But my thinking was that
> if we both:
>  - lose zc sends due to the unbounded completion time
>  - lose the slab benefit due to sheaves
> we are only left with fairly narrow case of page handling of small skbs.
> My gut feeling was that however unclean the timer fix would be best
> since it gives us zc back. But I don't have a way to experiment with
> sheaves yet to get data. All of this is pure speculation.
>
> Regardless the patch you shared earlier is probably best as a fix for
> zero-copy UDP for now.

I am cooking a formal patch, but I do not see why UDP is a problem today ?

Definitely I am seeing problems with  e20dfbad8aab ("net: fix napi_consume_skb()
with alien skbs"), so was tempted to use it for the FIxes: tag.

This is the first time a TX skb would be potentially delayed.

UDP use of skb_attempt_defer_free() is  with RX skbs, and normally
their zcopy status has been cleared ?

Thanks !

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH net-next] tcp: try to defer / return acked skbs to originating CPU
  2026-02-16 18:16                   ` Eric Dumazet
@ 2026-02-17 21:50                     ` Jakub Kicinski
  2026-02-17 21:56                       ` Eric Dumazet
  0 siblings, 1 reply; 15+ messages in thread
From: Jakub Kicinski @ 2026-02-17 21:50 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: kuniyu, ncardwell, netdev, davem, pabeni, andrew+netdev, horms

On Mon, 16 Feb 2026 19:16:46 +0100 Eric Dumazet wrote:
> UDP use of skb_attempt_defer_free() is  with RX skbs, and normally
> their zcopy status has been cleared ?

I meant coming in via napi_consume_skb() -> skb_attempt_defer_free().
There we are freeing Tx skbs, which for UDP these may legitimately 
have zc state. No?

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH net-next] tcp: try to defer / return acked skbs to originating CPU
  2026-02-17 21:50                     ` Jakub Kicinski
@ 2026-02-17 21:56                       ` Eric Dumazet
  0 siblings, 0 replies; 15+ messages in thread
From: Eric Dumazet @ 2026-02-17 21:56 UTC (permalink / raw)
  To: Jakub Kicinski
  Cc: kuniyu, ncardwell, netdev, davem, pabeni, andrew+netdev, horms

On Tue, Feb 17, 2026 at 10:50 PM Jakub Kicinski <kuba@kernel.org> wrote:
>
> On Mon, 16 Feb 2026 19:16:46 +0100 Eric Dumazet wrote:
> > UDP use of skb_attempt_defer_free() is  with RX skbs, and normally
> > their zcopy status has been cleared ?
>
> I meant coming in via napi_consume_skb() -> skb_attempt_defer_free().
> There we are freeing Tx skbs, which for UDP these may legitimately
> have zc state. No?

Ah, I guess I was referring to a kernel before  e20dfbad8aab ("net:
fix napi_consume_skb()
with alien skbs")

I guess we agree then ;)

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2026-02-17 21:56 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-01-17 16:42 [PATCH net-next] tcp: try to defer / return acked skbs to originating CPU Jakub Kicinski
2026-01-17 17:10 ` Jakub Kicinski
2026-01-17 18:16 ` Eric Dumazet
2026-01-17 23:03   ` Jakub Kicinski
2026-01-18 12:15     ` Eric Dumazet
2026-01-19 17:04       ` Jakub Kicinski
2026-01-29 23:04         ` Jakub Kicinski
2026-01-29 23:10           ` Jakub Kicinski
2026-02-16 16:06           ` Eric Dumazet
2026-02-16 17:49             ` Jakub Kicinski
2026-02-16 17:58               ` Eric Dumazet
2026-02-16 18:11                 ` Jakub Kicinski
2026-02-16 18:16                   ` Eric Dumazet
2026-02-17 21:50                     ` Jakub Kicinski
2026-02-17 21:56                       ` Eric Dumazet

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox