[PATCH v1 net-next 0/5] net: optimize TX throughput and efficiency

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH v1 net-next 0/5] net: optimize TX throughput and efficiency
@ 2025-10-13 14:54 Eric Dumazet
  2025-10-13 14:54 ` [PATCH v1 net-next 1/5] net: add add indirect call wrapper in skb_release_head_state() Eric Dumazet
                   ` (5 more replies)
  0 siblings, 6 replies; 24+ messages in thread
From: Eric Dumazet @ 2025-10-13 14:54 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Paolo Abeni
  Cc: Simon Horman, Jamal Hadi Salim, Cong Wang, Jiri Pirko,
	Kuniyuki Iwashima, Willem de Bruijn, netdev, eric.dumazet,
	Eric Dumazet

In this series, I replace the busylock spinlock we have in
__dev_queue_xmit() and use lockless list (llist) to reduce
spinlock contention to the minimum.

Idea is that only one cpu might spin on the qdisc spinlock,
while others simply add their skb in the llist.

After this series, we get a 300 % (4x) improvement on heavy TX workloads,
sending twice the number of packets per second, for half the cpu cycles.

Eric Dumazet (5):
  net: add add indirect call wrapper in skb_release_head_state()
  net/sched: act_mirred: add loop detection
  Revert "net/sched: Fix mirred deadlock on device recursion"
  net: sched: claim one cache line in Qdisc
  net: dev_queue_xmit() llist adoption

 include/linux/netdevice_xmit.h |  9 +++-
 include/net/sch_generic.h      | 23 ++++-----
 net/core/dev.c                 | 94 +++++++++++++++++++---------------
 net/core/skbuff.c              | 11 +++-
 net/sched/act_mirred.c         | 62 +++++++++-------------
 net/sched/sch_generic.c        |  7 ---
 6 files changed, 103 insertions(+), 103 deletions(-)

-- 
2.51.0.740.g6adb054d12-goog


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v1 net-next 1/5] net: add add indirect call wrapper in skb_release_head_state()
  2025-10-13 14:54 [PATCH v1 net-next 0/5] net: optimize TX throughput and efficiency Eric Dumazet
@ 2025-10-13 14:54 ` Eric Dumazet
  2025-10-13 14:54 ` [PATCH v1 net-next 2/5] net/sched: act_mirred: add loop detection Eric Dumazet
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 24+ messages in thread
From: Eric Dumazet @ 2025-10-13 14:54 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Paolo Abeni
  Cc: Simon Horman, Jamal Hadi Salim, Cong Wang, Jiri Pirko,
	Kuniyuki Iwashima, Willem de Bruijn, netdev, eric.dumazet,
	Eric Dumazet

While stress testing UDP senders on a host with expensive indirect
calls, I found cpus processing TX completions where showing
a very high cost (20%) in sock_wfree() due to
CONFIG_MITIGATION_RETPOLINE=y.

Take care of TCP and UDP TX destructors and use INDIRECT_CALL_3() macro.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 net/core/skbuff.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)

diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index bc12790017b0b5c0be99f8fb9d362b3730fa4eb0..692e3a70e75ed14786a67d72bc2ebc0282eee2be 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -1136,7 +1136,16 @@ void skb_release_head_state(struct sk_buff *skb)
 	skb_dst_drop(skb);
 	if (skb->destructor) {
 		DEBUG_NET_WARN_ON_ONCE(in_hardirq());
-		skb->destructor(skb);
+#ifdef CONFIG_INET
+		INDIRECT_CALL_3(skb->destructor,
+				tcp_wfree, __sock_wfree, sock_wfree,
+				skb);
+#else
+		INDIRECT_CALL_1(skb->destructor,
+				sock_wfree,
+				skb);
+
+#endif
 	}
 #if IS_ENABLED(CONFIG_NF_CONNTRACK)
 	nf_conntrack_put(skb_nfct(skb));
-- 
2.51.0.740.g6adb054d12-goog


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v1 net-next 2/5] net/sched: act_mirred: add loop detection
  2025-10-13 14:54 [PATCH v1 net-next 0/5] net: optimize TX throughput and efficiency Eric Dumazet
  2025-10-13 14:54 ` [PATCH v1 net-next 1/5] net: add add indirect call wrapper in skb_release_head_state() Eric Dumazet
@ 2025-10-13 14:54 ` Eric Dumazet
  2025-10-13 14:54 ` [PATCH v1 net-next 3/5] Revert "net/sched: Fix mirred deadlock on device recursion" Eric Dumazet
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 24+ messages in thread
From: Eric Dumazet @ 2025-10-13 14:54 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Paolo Abeni
  Cc: Simon Horman, Jamal Hadi Salim, Cong Wang, Jiri Pirko,
	Kuniyuki Iwashima, Willem de Bruijn, netdev, eric.dumazet,
	Eric Dumazet

Commit 0f022d32c3ec ("net/sched: Fix mirred deadlock on device recursion")
added code in the fast path, even when act_mirred is not used.

Prepare its revert by implementing loop detection in act_mirred.

Adds an array of device pointers in struct netdev_xmit.

tcf_mirred_is_act_redirect() can detect if the array
already contains the target device.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 include/linux/netdevice_xmit.h |  9 ++++-
 net/sched/act_mirred.c         | 62 +++++++++++++---------------------
 2 files changed, 31 insertions(+), 40 deletions(-)

diff --git a/include/linux/netdevice_xmit.h b/include/linux/netdevice_xmit.h
index 813a19122ebbb2c6a04176330b1055b7c2b9c902..cc232508e695eefe95ea6e55a21978be11d5da83 100644
--- a/include/linux/netdevice_xmit.h
+++ b/include/linux/netdevice_xmit.h
@@ -2,6 +2,12 @@
 #ifndef _LINUX_NETDEVICE_XMIT_H
 #define _LINUX_NETDEVICE_XMIT_H
 
+#if IS_ENABLED(CONFIG_NET_ACT_MIRRED)
+#define MIRRED_NEST_LIMIT	4
+#endif
+
+struct net_device;
+
 struct netdev_xmit {
 	u16 recursion;
 	u8  more;
@@ -9,7 +15,8 @@ struct netdev_xmit {
 	u8  skip_txqueue;
 #endif
 #if IS_ENABLED(CONFIG_NET_ACT_MIRRED)
-	u8 sched_mirred_nest;
+	u8			sched_mirred_nest;
+	struct net_device	*sched_mirred_dev[MIRRED_NEST_LIMIT];
 #endif
 #if IS_ENABLED(CONFIG_NF_DUP_NETDEV)
 	u8 nf_dup_skb_recursion;
diff --git a/net/sched/act_mirred.c b/net/sched/act_mirred.c
index 5f01f567c934d3669d9a3058cff861a8fe5f88b6..f27b583def78e4afecc7112854b93d59c2520201 100644
--- a/net/sched/act_mirred.c
+++ b/net/sched/act_mirred.c
@@ -29,31 +29,6 @@
 static LIST_HEAD(mirred_list);
 static DEFINE_SPINLOCK(mirred_list_lock);
 
-#define MIRRED_NEST_LIMIT    4
-
-#ifndef CONFIG_PREEMPT_RT
-static u8 tcf_mirred_nest_level_inc_return(void)
-{
-	return __this_cpu_inc_return(softnet_data.xmit.sched_mirred_nest);
-}
-
-static void tcf_mirred_nest_level_dec(void)
-{
-	__this_cpu_dec(softnet_data.xmit.sched_mirred_nest);
-}
-
-#else
-static u8 tcf_mirred_nest_level_inc_return(void)
-{
-	return current->net_xmit.sched_mirred_nest++;
-}
-
-static void tcf_mirred_nest_level_dec(void)
-{
-	current->net_xmit.sched_mirred_nest--;
-}
-#endif
-
 static bool tcf_mirred_is_act_redirect(int action)
 {
 	return action == TCA_EGRESS_REDIR || action == TCA_INGRESS_REDIR;
@@ -439,44 +414,53 @@ TC_INDIRECT_SCOPE int tcf_mirred_act(struct sk_buff *skb,
 {
 	struct tcf_mirred *m = to_mirred(a);
 	int retval = READ_ONCE(m->tcf_action);
-	unsigned int nest_level;
+	struct netdev_xmit *xmit;
 	bool m_mac_header_xmit;
 	struct net_device *dev;
-	int m_eaction;
+	int i, m_eaction;
 	u32 blockid;
 
-	nest_level = tcf_mirred_nest_level_inc_return();
-	if (unlikely(nest_level > MIRRED_NEST_LIMIT)) {
+#ifdef CONFIG_PREEMPT_RT
+	xmit = &current->net_xmit;
+#else
+	xmit = this_cpu_ptr(&softnet_data.xmit);
+#endif
+	if (unlikely(xmit->sched_mirred_nest >= MIRRED_NEST_LIMIT)) {
 		net_warn_ratelimited("Packet exceeded mirred recursion limit on dev %s\n",
 				     netdev_name(skb->dev));
-		retval = TC_ACT_SHOT;
-		goto dec_nest_level;
+		return TC_ACT_SHOT;
 	}
 
 	tcf_lastuse_update(&m->tcf_tm);
 	tcf_action_update_bstats(&m->common, skb);
 
 	blockid = READ_ONCE(m->tcfm_blockid);
-	if (blockid) {
-		retval = tcf_blockcast(skb, m, blockid, res, retval);
-		goto dec_nest_level;
-	}
+	if (blockid)
+		return tcf_blockcast(skb, m, blockid, res, retval);
 
 	dev = rcu_dereference_bh(m->tcfm_dev);
 	if (unlikely(!dev)) {
 		pr_notice_once("tc mirred: target device is gone\n");
 		tcf_action_inc_overlimit_qstats(&m->common);
-		goto dec_nest_level;
+		return retval;
 	}
+	for (i = 0; i < xmit->sched_mirred_nest; i++) {
+		if (xmit->sched_mirred_dev[i] != dev)
+			continue;
+		pr_notice_once("tc mirred: loop on device %s\n",
+			       netdev_name(dev));
+		tcf_action_inc_overlimit_qstats(&m->common);
+		return retval;
+	}
+
+	xmit->sched_mirred_dev[xmit->sched_mirred_nest++] = dev;
 
 	m_mac_header_xmit = READ_ONCE(m->tcfm_mac_header_xmit);
 	m_eaction = READ_ONCE(m->tcfm_eaction);
 
 	retval = tcf_mirred_to_dev(skb, m, dev, m_mac_header_xmit, m_eaction,
 				   retval);
-
-dec_nest_level:
-	tcf_mirred_nest_level_dec();
+	xmit->sched_mirred_nest--;
 
 	return retval;
 }
-- 
2.51.0.740.g6adb054d12-goog


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v1 net-next 3/5] Revert "net/sched: Fix mirred deadlock on device recursion"
  2025-10-13 14:54 [PATCH v1 net-next 0/5] net: optimize TX throughput and efficiency Eric Dumazet
  2025-10-13 14:54 ` [PATCH v1 net-next 1/5] net: add add indirect call wrapper in skb_release_head_state() Eric Dumazet
  2025-10-13 14:54 ` [PATCH v1 net-next 2/5] net/sched: act_mirred: add loop detection Eric Dumazet
@ 2025-10-13 14:54 ` Eric Dumazet
  2025-10-13 14:54 ` [PATCH v1 net-next 4/5] net: sched: claim one cache line in Qdisc Eric Dumazet
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 24+ messages in thread
From: Eric Dumazet @ 2025-10-13 14:54 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Paolo Abeni
  Cc: Simon Horman, Jamal Hadi Salim, Cong Wang, Jiri Pirko,
	Kuniyuki Iwashima, Willem de Bruijn, netdev, eric.dumazet,
	Eric Dumazet

This reverts commits 0f022d32c3eca477fbf79a205243a6123ed0fe11
and 44180feaccf266d9b0b28cc4ceaac019817deb5c.

Prior patch in this series implemented loop detection
in act_mirred, we can remove q->owner to save some cycles
in the fast path.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 include/net/sch_generic.h | 1 -
 net/core/dev.c            | 6 ------
 net/sched/sch_generic.c   | 2 --
 3 files changed, 9 deletions(-)

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index 738cd5b13c62fb8501619625880b87991f3dc17c..32e9961570b467b6066f1bb2c00ff1a270e342bc 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -117,7 +117,6 @@ struct Qdisc {
 	struct qdisc_skb_head	q;
 	struct gnet_stats_basic_sync bstats;
 	struct gnet_stats_queue	qstats;
-	int                     owner;
 	unsigned long		state;
 	unsigned long		state2; /* must be written under qdisc spinlock */
 	struct Qdisc            *next_sched;
diff --git a/net/core/dev.c b/net/core/dev.c
index a64cef2c537e98ee87776e6f8d3ca3d98f0711b3..0ff178399b2d28ca2754b3f06d69a97f5d6dcf71 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4167,10 +4167,6 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
 		return rc;
 	}
 
-	if (unlikely(READ_ONCE(q->owner) == smp_processor_id())) {
-		kfree_skb_reason(skb, SKB_DROP_REASON_TC_RECLASSIFY_LOOP);
-		return NET_XMIT_DROP;
-	}
 	/*
 	 * Heuristic to force contended enqueues to serialize on a
 	 * separate lock before trying to get qdisc main lock.
@@ -4210,9 +4206,7 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
 		qdisc_run_end(q);
 		rc = NET_XMIT_SUCCESS;
 	} else {
-		WRITE_ONCE(q->owner, smp_processor_id());
 		rc = dev_qdisc_enqueue(skb, q, &to_free, txq);
-		WRITE_ONCE(q->owner, -1);
 		if (qdisc_run_begin(q)) {
 			if (unlikely(contended)) {
 				spin_unlock(&q->busylock);
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index 1e008a228ebdf846d4ef7f83d655ac1142ec3596..dfa8e8e667d24a435b0c9cb3c1f05c8075f63e89 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -679,7 +679,6 @@ struct Qdisc noop_qdisc = {
 		.qlen = 0,
 		.lock = __SPIN_LOCK_UNLOCKED(noop_qdisc.skb_bad_txq.lock),
 	},
-	.owner = -1,
 };
 EXPORT_SYMBOL(noop_qdisc);
 
@@ -985,7 +984,6 @@ struct Qdisc *qdisc_alloc(struct netdev_queue *dev_queue,
 	sch->enqueue = ops->enqueue;
 	sch->dequeue = ops->dequeue;
 	sch->dev_queue = dev_queue;
-	sch->owner = -1;
 	netdev_hold(dev, &sch->dev_tracker, GFP_KERNEL);
 	refcount_set(&sch->refcnt, 1);
 
-- 
2.51.0.740.g6adb054d12-goog


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v1 net-next 4/5] net: sched: claim one cache line in Qdisc
  2025-10-13 14:54 [PATCH v1 net-next 0/5] net: optimize TX throughput and efficiency Eric Dumazet
                   ` (2 preceding siblings ...)
  2025-10-13 14:54 ` [PATCH v1 net-next 3/5] Revert "net/sched: Fix mirred deadlock on device recursion" Eric Dumazet
@ 2025-10-13 14:54 ` Eric Dumazet
  2025-10-13 14:54 ` [PATCH v1 net-next 5/5] net: dev_queue_xmit() llist adoption Eric Dumazet
  2025-10-13 16:23 ` [PATCH v1 net-next 0/5] net: optimize TX throughput and efficiency Toke Høiland-Jørgensen
  5 siblings, 0 replies; 24+ messages in thread
From: Eric Dumazet @ 2025-10-13 14:54 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Paolo Abeni
  Cc: Simon Horman, Jamal Hadi Salim, Cong Wang, Jiri Pirko,
	Kuniyuki Iwashima, Willem de Bruijn, netdev, eric.dumazet,
	Eric Dumazet

Replace state2 field with a boolean.

Move it to a hole between qstats and state so that
we shrink Qdisc by a full cache line.

Signed-off-by: Eric Dumazet <edumazet@google.com>
---
 include/net/sch_generic.h | 18 +++++++-----------
 1 file changed, 7 insertions(+), 11 deletions(-)

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index 32e9961570b467b6066f1bb2c00ff1a270e342bc..31561291bc92fd70d4d3ca8f5f7dbc4c94c895a0 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -41,13 +41,6 @@ enum qdisc_state_t {
 	__QDISC_STATE_DRAINING,
 };
 
-enum qdisc_state2_t {
-	/* Only for !TCQ_F_NOLOCK qdisc. Never access it directly.
-	 * Use qdisc_run_begin/end() or qdisc_is_running() instead.
-	 */
-	__QDISC_STATE2_RUNNING,
-};
-
 #define QDISC_STATE_MISSED	BIT(__QDISC_STATE_MISSED)
 #define QDISC_STATE_DRAINING	BIT(__QDISC_STATE_DRAINING)
 
@@ -117,8 +110,8 @@ struct Qdisc {
 	struct qdisc_skb_head	q;
 	struct gnet_stats_basic_sync bstats;
 	struct gnet_stats_queue	qstats;
+	bool			running; /* must be written under qdisc spinlock */
 	unsigned long		state;
-	unsigned long		state2; /* must be written under qdisc spinlock */
 	struct Qdisc            *next_sched;
 	struct sk_buff_head	skb_bad_txq;
 
@@ -167,7 +160,7 @@ static inline bool qdisc_is_running(struct Qdisc *qdisc)
 {
 	if (qdisc->flags & TCQ_F_NOLOCK)
 		return spin_is_locked(&qdisc->seqlock);
-	return test_bit(__QDISC_STATE2_RUNNING, &qdisc->state2);
+	return READ_ONCE(qdisc->running);
 }
 
 static inline bool nolock_qdisc_is_empty(const struct Qdisc *qdisc)
@@ -210,7 +203,10 @@ static inline bool qdisc_run_begin(struct Qdisc *qdisc)
 		 */
 		return spin_trylock(&qdisc->seqlock);
 	}
-	return !__test_and_set_bit(__QDISC_STATE2_RUNNING, &qdisc->state2);
+	if (READ_ONCE(qdisc->running))
+		return false;
+	WRITE_ONCE(qdisc->running, true);
+	return true;
 }
 
 static inline void qdisc_run_end(struct Qdisc *qdisc)
@@ -228,7 +224,7 @@ static inline void qdisc_run_end(struct Qdisc *qdisc)
 				      &qdisc->state)))
 			__netif_schedule(qdisc);
 	} else {
-		__clear_bit(__QDISC_STATE2_RUNNING, &qdisc->state2);
+		WRITE_ONCE(qdisc->running, false);
 	}
 }
 
-- 
2.51.0.740.g6adb054d12-goog


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* [PATCH v1 net-next 5/5] net: dev_queue_xmit() llist adoption
  2025-10-13 14:54 [PATCH v1 net-next 0/5] net: optimize TX throughput and efficiency Eric Dumazet
                   ` (3 preceding siblings ...)
  2025-10-13 14:54 ` [PATCH v1 net-next 4/5] net: sched: claim one cache line in Qdisc Eric Dumazet
@ 2025-10-13 14:54 ` Eric Dumazet
  2025-11-07 15:28   ` Toke Høiland-Jørgensen
  2025-10-13 16:23 ` [PATCH v1 net-next 0/5] net: optimize TX throughput and efficiency Toke Høiland-Jørgensen
  5 siblings, 1 reply; 24+ messages in thread
From: Eric Dumazet @ 2025-10-13 14:54 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Paolo Abeni
  Cc: Simon Horman, Jamal Hadi Salim, Cong Wang, Jiri Pirko,
	Kuniyuki Iwashima, Willem de Bruijn, netdev, eric.dumazet,
	Eric Dumazet, Toke Høiland-Jørgensen

Remove busylock spinlock and use a lockless list (llist)
to reduce spinlock contention to the minimum.

Idea is that only one cpu might spin on the qdisc spinlock,
while others simply add their skb in the llist.

After this patch, we get a 300 % improvement on heavy TX workloads.
- Sending twice the number of packets per second.
- While consuming 50 % less cycles.

Note that this also allows in the future to submit batches
to various qdisc->enqueue() methods.

Tested:

- Dual Intel(R) Xeon(R) 6985P-C  (480 hyper threads).
- 100Gbit NIC, 30 TX queues with FQ packet scheduler.
- echo 64 >/sys/kernel/slab/skbuff_small_head/cpu_partial (avoid contention in mm)
- 240 concurrent "netperf -t UDP_STREAM -- -m 120 -n"

Before:

16 Mpps (41 Mpps if each thread is pinned to a different cpu)

vmstat 2 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
243  0      0 2368988672  51036 1100852    0    0   146     1  242   60  0  9 91  0  0
244  0      0 2368988672  51036 1100852    0    0   536    10 487745 14718  0 52 48  0  0
244  0      0 2368988672  51036 1100852    0    0   512     0 503067 46033  0 52 48  0  0
244  0      0 2368988672  51036 1100852    0    0   512     0 494807 12107  0 52 48  0  0
244  0      0 2368988672  51036 1100852    0    0   702    26 492845 10110  0 52 48  0  0

Lock contention (1 second sample taken on 8 cores)
perf lock record -C0-7 sleep 1; perf lock contention
 contended   total wait     max wait     avg wait         type   caller

    442111      6.79 s     162.47 ms     15.35 us     spinlock   dev_hard_start_xmit+0xcd
      5961      9.57 ms      8.12 us      1.60 us     spinlock   __dev_queue_xmit+0x3a0
       244    560.63 us      7.63 us      2.30 us     spinlock   do_softirq+0x5b
        13     25.09 us      3.21 us      1.93 us     spinlock   net_tx_action+0xf8

If netperf threads are pinned, spinlock stress is very high.
perf lock record -C0-7 sleep 1; perf lock contention
 contended   total wait     max wait     avg wait         type   caller

    964508      7.10 s     147.25 ms      7.36 us     spinlock   dev_hard_start_xmit+0xcd
       201    268.05 us      4.65 us      1.33 us     spinlock   __dev_queue_xmit+0x3a0
        12     26.05 us      3.84 us      2.17 us     spinlock   do_softirq+0x5b

@__dev_queue_xmit_ns:
[256, 512)            21 |                                                    |
[512, 1K)            631 |                                                    |
[1K, 2K)           27328 |@                                                   |
[2K, 4K)          265392 |@@@@@@@@@@@@@@@@                                    |
[4K, 8K)          417543 |@@@@@@@@@@@@@@@@@@@@@@@@@@                          |
[8K, 16K)         826292 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[16K, 32K)        733822 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@      |
[32K, 64K)         19055 |@                                                   |
[64K, 128K)        17240 |@                                                   |
[128K, 256K)       25633 |@                                                   |
[256K, 512K)           4 |                                                    |

After:

29 Mpps (57 Mpps if each thread is pinned to a different cpu)

vmstat 2 5
procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
78  0      0 2369573632  32896 1350988    0    0    22     0  331  254  0  8 92  0  0
75  0      0 2369573632  32896 1350988    0    0    22    50 425713 280199  0 23 76  0  0
104  0      0 2369573632  32896 1350988    0    0   290     0 430238 298247  0 23 76  0  0
86  0      0 2369573632  32896 1350988    0    0   132     0 428019 291865  0 24 76  0  0
90  0      0 2369573632  32896 1350988    0    0   502     0 422498 278672  0 23 76  0  0

perf lock record -C0-7 sleep 1; perf lock contention
 contended   total wait     max wait     avg wait         type   caller

      2524    116.15 ms    486.61 us     46.02 us     spinlock   __dev_queue_xmit+0x55b
      5821    107.18 ms    371.67 us     18.41 us     spinlock   dev_hard_start_xmit+0xcd
      2377      9.73 ms     35.86 us      4.09 us     spinlock   ___slab_alloc+0x4e0
       923      5.74 ms     20.91 us      6.22 us     spinlock   ___slab_alloc+0x5c9
       121      3.42 ms    193.05 us     28.24 us     spinlock   net_tx_action+0xf8
         6    564.33 us    167.60 us     94.05 us     spinlock   do_softirq+0x5b

If netperf threads are pinned (~54 Mpps)
perf lock record -C0-7 sleep 1; perf lock contention
     32907    316.98 ms    195.98 us      9.63 us     spinlock   dev_hard_start_xmit+0xcd
      4507     61.83 ms    212.73 us     13.72 us     spinlock   __dev_queue_xmit+0x554
      2781     23.53 ms     40.03 us      8.46 us     spinlock   ___slab_alloc+0x5c9
      3554     18.94 ms     34.69 us      5.33 us     spinlock   ___slab_alloc+0x4e0
       233      9.09 ms    215.70 us     38.99 us     spinlock   do_softirq+0x5b
       153    930.66 us     48.67 us      6.08 us     spinlock   net_tx_action+0xfd
        84    331.10 us     14.22 us      3.94 us     spinlock   ___slab_alloc+0x5c9
       140    323.71 us      9.94 us      2.31 us     spinlock   ___slab_alloc+0x4e0

@__dev_queue_xmit_ns:
[128, 256)       1539830 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@                  |
[256, 512)       2299558 |@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@|
[512, 1K)         483936 |@@@@@@@@@@                                          |
[1K, 2K)          265345 |@@@@@@                                              |
[2K, 4K)          145463 |@@@                                                 |
[4K, 8K)           54571 |@                                                   |
[8K, 16K)          10270 |                                                    |
[16K, 32K)          9385 |                                                    |
[32K, 64K)          7749 |                                                    |
[64K, 128K)        26799 |                                                    |
[128K, 256K)        2665 |                                                    |
[256K, 512K)         665 |                                                    |

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>
---
 include/net/sch_generic.h |  4 +-
 net/core/dev.c            | 88 +++++++++++++++++++++++----------------
 net/sched/sch_generic.c   |  5 ---
 3 files changed, 55 insertions(+), 42 deletions(-)

diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index 31561291bc92fd70d4d3ca8f5f7dbc4c94c895a0..94966692ccdf51db085c236319705aecba8c30cf 100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -115,7 +115,9 @@ struct Qdisc {
 	struct Qdisc            *next_sched;
 	struct sk_buff_head	skb_bad_txq;
 
-	spinlock_t		busylock ____cacheline_aligned_in_smp;
+	atomic_long_t		defer_count ____cacheline_aligned_in_smp;
+	struct llist_head	defer_list;
+
 	spinlock_t		seqlock;
 
 	struct rcu_head		rcu;
diff --git a/net/core/dev.c b/net/core/dev.c
index 0ff178399b2d28ca2754b3f06d69a97f5d6dcf71..e281bae9b15046fa435c3a7cc9eaeb0b5609bab4 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4125,9 +4125,10 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
 				 struct net_device *dev,
 				 struct netdev_queue *txq)
 {
+	struct sk_buff *next, *to_free = NULL;
 	spinlock_t *root_lock = qdisc_lock(q);
-	struct sk_buff *to_free = NULL;
-	bool contended;
+	struct llist_node *ll_list, *first_n;
+	unsigned long defer_count = 0;
 	int rc;
 
 	qdisc_calculate_pkt_len(skb, q);
@@ -4167,61 +4168,76 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
 		return rc;
 	}
 
-	/*
-	 * Heuristic to force contended enqueues to serialize on a
-	 * separate lock before trying to get qdisc main lock.
-	 * This permits qdisc->running owner to get the lock more
-	 * often and dequeue packets faster.
-	 * On PREEMPT_RT it is possible to preempt the qdisc owner during xmit
-	 * and then other tasks will only enqueue packets. The packets will be
-	 * sent after the qdisc owner is scheduled again. To prevent this
-	 * scenario the task always serialize on the lock.
+	/* Open code llist_add(&skb->ll_node, &q->defer_list) + queue limit.
+	 * In the try_cmpxchg() loop, we want to increment q->defer_count
+	 * at most once to limit the number of skbs in defer_list.
+	 * We perform the defer_count increment only if the list is not empty,
+	 * because some arches have slow atomic_long_inc_return().
+	 */
+	first_n = READ_ONCE(q->defer_list.first);
+	do {
+		if (first_n && !defer_count) {
+			defer_count = atomic_long_inc_return(&q->defer_count);
+			if (unlikely(defer_count > q->limit)) {
+				kfree_skb_reason(skb, SKB_DROP_REASON_QDISC_DROP);
+				return NET_XMIT_DROP;
+			}
+		}
+		skb->ll_node.next = first_n;
+	} while (!try_cmpxchg(&q->defer_list.first, &first_n, &skb->ll_node));
+
+	/* If defer_list was not empty, we know the cpu which queued
+	 * the first skb will process the whole list for us.
 	 */
-	contended = qdisc_is_running(q) || IS_ENABLED(CONFIG_PREEMPT_RT);
-	if (unlikely(contended))
-		spin_lock(&q->busylock);
+	if (first_n)
+		return NET_XMIT_SUCCESS;
 
 	spin_lock(root_lock);
+
+	ll_list = llist_del_all(&q->defer_list);
+	/* There is a small race because we clear defer_count not atomically
+	 * with the prior llist_del_all(). This means defer_list could grow
+	 * over q->limit.
+	 */
+	atomic_long_set(&q->defer_count, 0);
+
+	ll_list = llist_reverse_order(ll_list);
+
 	if (unlikely(test_bit(__QDISC_STATE_DEACTIVATED, &q->state))) {
-		__qdisc_drop(skb, &to_free);
+		llist_for_each_entry_safe(skb, next, ll_list, ll_node)
+			__qdisc_drop(skb, &to_free);
 		rc = NET_XMIT_DROP;
-	} else if ((q->flags & TCQ_F_CAN_BYPASS) && !qdisc_qlen(q) &&
-		   qdisc_run_begin(q)) {
+		goto unlock;
+	}
+	rc = NET_XMIT_SUCCESS;
+	if ((q->flags & TCQ_F_CAN_BYPASS) && !qdisc_qlen(q) &&
+	    !llist_next(ll_list) && qdisc_run_begin(q)) {
 		/*
 		 * This is a work-conserving queue; there are no old skbs
 		 * waiting to be sent out; and the qdisc is not running -
 		 * xmit the skb directly.
 		 */
 
+		DEBUG_NET_WARN_ON_ONCE(skb != llist_entry(ll_list,
+							  struct sk_buff,
+							  ll_node));
 		qdisc_bstats_update(q, skb);
-
-		if (sch_direct_xmit(skb, q, dev, txq, root_lock, true)) {
-			if (unlikely(contended)) {
-				spin_unlock(&q->busylock);
-				contended = false;
-			}
+		if (sch_direct_xmit(skb, q, dev, txq, root_lock, true))
 			__qdisc_run(q);
-		}
-
 		qdisc_run_end(q);
-		rc = NET_XMIT_SUCCESS;
 	} else {
-		rc = dev_qdisc_enqueue(skb, q, &to_free, txq);
-		if (qdisc_run_begin(q)) {
-			if (unlikely(contended)) {
-				spin_unlock(&q->busylock);
-				contended = false;
-			}
-			__qdisc_run(q);
-			qdisc_run_end(q);
+		llist_for_each_entry_safe(skb, next, ll_list, ll_node) {
+			prefetch(next);
+			skb_mark_not_on_list(skb);
+			dev_qdisc_enqueue(skb, q, &to_free, txq);
 		}
+		qdisc_run(q);
 	}
+unlock:
 	spin_unlock(root_lock);
 	if (unlikely(to_free))
 		kfree_skb_list_reason(to_free,
 				      tcf_get_drop_reason(to_free));
-	if (unlikely(contended))
-		spin_unlock(&q->busylock);
 	return rc;
 }
 
diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index dfa8e8e667d24a435b0c9cb3c1f05c8075f63e89..d9a98d02a55fc361a223f3201e37b6a2b698bb5e 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -666,7 +666,6 @@ struct Qdisc noop_qdisc = {
 	.ops		=	&noop_qdisc_ops,
 	.q.lock		=	__SPIN_LOCK_UNLOCKED(noop_qdisc.q.lock),
 	.dev_queue	=	&noop_netdev_queue,
-	.busylock	=	__SPIN_LOCK_UNLOCKED(noop_qdisc.busylock),
 	.gso_skb = {
 		.next = (struct sk_buff *)&noop_qdisc.gso_skb,
 		.prev = (struct sk_buff *)&noop_qdisc.gso_skb,
@@ -970,10 +969,6 @@ struct Qdisc *qdisc_alloc(struct netdev_queue *dev_queue,
 		}
 	}
 
-	spin_lock_init(&sch->busylock);
-	lockdep_set_class(&sch->busylock,
-			  dev->qdisc_tx_busylock ?: &qdisc_tx_busylock);
-
 	/* seqlock has the same scope of busylock, for NOLOCK qdisc */
 	spin_lock_init(&sch->seqlock);
 	lockdep_set_class(&sch->seqlock,
-- 
2.51.0.740.g6adb054d12-goog


^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [PATCH v1 net-next 0/5] net: optimize TX throughput and efficiency
  2025-10-13 14:54 [PATCH v1 net-next 0/5] net: optimize TX throughput and efficiency Eric Dumazet
                   ` (4 preceding siblings ...)
  2025-10-13 14:54 ` [PATCH v1 net-next 5/5] net: dev_queue_xmit() llist adoption Eric Dumazet
@ 2025-10-13 16:23 ` Toke Høiland-Jørgensen
  5 siblings, 0 replies; 24+ messages in thread
From: Toke Høiland-Jørgensen @ 2025-10-13 16:23 UTC (permalink / raw)
  To: Eric Dumazet, David S . Miller, Jakub Kicinski, Paolo Abeni
  Cc: Simon Horman, Jamal Hadi Salim, Cong Wang, Jiri Pirko,
	Kuniyuki Iwashima, Willem de Bruijn, netdev, eric.dumazet,
	Eric Dumazet

Eric Dumazet <edumazet@google.com> writes:

> In this series, I replace the busylock spinlock we have in
> __dev_queue_xmit() and use lockless list (llist) to reduce
> spinlock contention to the minimum.
>
> Idea is that only one cpu might spin on the qdisc spinlock,
> while others simply add their skb in the llist.
>
> After this series, we get a 300 % (4x) improvement on heavy TX workloads,
> sending twice the number of packets per second, for half the cpu
> cycles.

Figured I might as well take a closer look at the rest of the series. So
for the whole thing:

Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v1 net-next 5/5] net: dev_queue_xmit() llist adoption
  2025-10-13 14:54 ` [PATCH v1 net-next 5/5] net: dev_queue_xmit() llist adoption Eric Dumazet
@ 2025-11-07 15:28   ` Toke Høiland-Jørgensen
  2025-11-07 15:37     ` Eric Dumazet
  0 siblings, 1 reply; 24+ messages in thread
From: Toke Høiland-Jørgensen @ 2025-11-07 15:28 UTC (permalink / raw)
  To: Eric Dumazet, David S . Miller, Jakub Kicinski, Paolo Abeni
  Cc: Simon Horman, Jamal Hadi Salim, Cong Wang, Jiri Pirko,
	Kuniyuki Iwashima, Willem de Bruijn, netdev, eric.dumazet,
	Eric Dumazet

Eric Dumazet <edumazet@google.com> writes:

> Remove busylock spinlock and use a lockless list (llist)
> to reduce spinlock contention to the minimum.
>
> Idea is that only one cpu might spin on the qdisc spinlock,
> while others simply add their skb in the llist.
>
> After this patch, we get a 300 % improvement on heavy TX workloads.
> - Sending twice the number of packets per second.
> - While consuming 50 % less cycles.
>
> Note that this also allows in the future to submit batches
> to various qdisc->enqueue() methods.
>
> Tested:
>
> - Dual Intel(R) Xeon(R) 6985P-C  (480 hyper threads).
> - 100Gbit NIC, 30 TX queues with FQ packet scheduler.
> - echo 64 >/sys/kernel/slab/skbuff_small_head/cpu_partial (avoid contention in mm)
> - 240 concurrent "netperf -t UDP_STREAM -- -m 120 -n"

Hi Eric

While testing this with sch_cake (to get a new baseline for the mq_cake
patches as Jamal suggested), I found that this patch completely destroys
the performance of cake in particular.

I run a small UDP test (64-byte packets across 16 flows through
xdp-trafficgen, offered load is ~5Mpps) with a single cake instance on
as the root interface qdisc.

With a stock Fedora (6.17.7) kernel, this gets me around 630 Kpps across
8 queues (on an E810-C, ice driver):

Ethtool(ice0p1  ) stat:     40321218 (     40,321,218) <= tx_bytes /sec
Ethtool(ice0p1  ) stat:     42841424 (     42,841,424) <= tx_bytes.nic /sec
Ethtool(ice0p1  ) stat:      5248505 (      5,248,505) <= tx_queue_0_bytes /sec
Ethtool(ice0p1  ) stat:        82008 (         82,008) <= tx_queue_0_packets /sec
Ethtool(ice0p1  ) stat:      3425984 (      3,425,984) <= tx_queue_1_bytes /sec
Ethtool(ice0p1  ) stat:        53531 (         53,531) <= tx_queue_1_packets /sec
Ethtool(ice0p1  ) stat:      5277496 (      5,277,496) <= tx_queue_2_bytes /sec
Ethtool(ice0p1  ) stat:        82461 (         82,461) <= tx_queue_2_packets /sec
Ethtool(ice0p1  ) stat:      5285736 (      5,285,736) <= tx_queue_3_bytes /sec
Ethtool(ice0p1  ) stat:        82590 (         82,590) <= tx_queue_3_packets /sec
Ethtool(ice0p1  ) stat:      5280731 (      5,280,731) <= tx_queue_4_bytes /sec
Ethtool(ice0p1  ) stat:        82511 (         82,511) <= tx_queue_4_packets /sec
Ethtool(ice0p1  ) stat:      5275665 (      5,275,665) <= tx_queue_5_bytes /sec
Ethtool(ice0p1  ) stat:        82432 (         82,432) <= tx_queue_5_packets /sec
Ethtool(ice0p1  ) stat:      5276398 (      5,276,398) <= tx_queue_6_bytes /sec
Ethtool(ice0p1  ) stat:        82444 (         82,444) <= tx_queue_6_packets /sec
Ethtool(ice0p1  ) stat:      5250946 (      5,250,946) <= tx_queue_7_bytes /sec
Ethtool(ice0p1  ) stat:        82046 (         82,046) <= tx_queue_7_packets /sec
Ethtool(ice0p1  ) stat:            1 (              1) <= tx_restart /sec
Ethtool(ice0p1  ) stat:       630023 (        630,023) <= tx_size_127.nic /sec
Ethtool(ice0p1  ) stat:       630019 (        630,019) <= tx_unicast /sec
Ethtool(ice0p1  ) stat:       630020 (        630,020) <= tx_unicast.nic /sec

However, running the same test on a net-next kernel, performance drops
to round 10 Kpps(!):

Ethtool(ice0p1  ) stat:       679003 (        679,003) <= tx_bytes /sec
Ethtool(ice0p1  ) stat:       721440 (        721,440) <= tx_bytes.nic /sec
Ethtool(ice0p1  ) stat:       123539 (        123,539) <= tx_queue_0_bytes /sec
Ethtool(ice0p1  ) stat:         1930 (          1,930) <= tx_queue_0_packets /sec
Ethtool(ice0p1  ) stat:         1776 (          1,776) <= tx_queue_1_bytes /sec
Ethtool(ice0p1  ) stat:           28 (             28) <= tx_queue_1_packets /sec
Ethtool(ice0p1  ) stat:         1837 (          1,837) <= tx_queue_2_bytes /sec
Ethtool(ice0p1  ) stat:           29 (             29) <= tx_queue_2_packets /sec
Ethtool(ice0p1  ) stat:         1776 (          1,776) <= tx_queue_3_bytes /sec
Ethtool(ice0p1  ) stat:           28 (             28) <= tx_queue_3_packets /sec
Ethtool(ice0p1  ) stat:         1654 (          1,654) <= tx_queue_4_bytes /sec
Ethtool(ice0p1  ) stat:           26 (             26) <= tx_queue_4_packets /sec
Ethtool(ice0p1  ) stat:       222026 (        222,026) <= tx_queue_5_bytes /sec
Ethtool(ice0p1  ) stat:         3469 (          3,469) <= tx_queue_5_packets /sec
Ethtool(ice0p1  ) stat:       183072 (        183,072) <= tx_queue_6_bytes /sec
Ethtool(ice0p1  ) stat:         2861 (          2,861) <= tx_queue_6_packets /sec
Ethtool(ice0p1  ) stat:       143322 (        143,322) <= tx_queue_7_bytes /sec
Ethtool(ice0p1  ) stat:         2239 (          2,239) <= tx_queue_7_packets /sec
Ethtool(ice0p1  ) stat:        10609 (         10,609) <= tx_size_127.nic /sec
Ethtool(ice0p1  ) stat:        10609 (         10,609) <= tx_unicast /sec
Ethtool(ice0p1  ) stat:        10609 (         10,609) <= tx_unicast.nic /sec

Reverting commit 100dfa74cad9 ("net: dev_queue_xmit() llist adoption")
(and the followon f8a55d5e71e6 ("net: add a fast path in
__netif_schedule()"), but that alone makes no difference) gets me back
to the previous 630-650 Kpps range.

I couldn't find any other qdisc that suffers in the same way (tried
fq_codel, sfq and netem as single root qdiscs), so this seems to be some
specific interaction between the llist implementation and sch_cake. Any
idea what could be causing this?

-Toke


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v1 net-next 5/5] net: dev_queue_xmit() llist adoption
  2025-11-07 15:28   ` Toke Høiland-Jørgensen
@ 2025-11-07 15:37     ` Eric Dumazet
  2025-11-07 15:46       ` Eric Dumazet
  0 siblings, 1 reply; 24+ messages in thread
From: Eric Dumazet @ 2025-11-07 15:37 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman,
	Jamal Hadi Salim, Cong Wang, Jiri Pirko, Kuniyuki Iwashima,
	Willem de Bruijn, netdev, eric.dumazet

On Fri, Nov 7, 2025 at 7:28 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> Eric Dumazet <edumazet@google.com> writes:
>
> > Remove busylock spinlock and use a lockless list (llist)
> > to reduce spinlock contention to the minimum.
> >
> > Idea is that only one cpu might spin on the qdisc spinlock,
> > while others simply add their skb in the llist.
> >
> > After this patch, we get a 300 % improvement on heavy TX workloads.
> > - Sending twice the number of packets per second.
> > - While consuming 50 % less cycles.
> >
> > Note that this also allows in the future to submit batches
> > to various qdisc->enqueue() methods.
> >
> > Tested:
> >
> > - Dual Intel(R) Xeon(R) 6985P-C  (480 hyper threads).
> > - 100Gbit NIC, 30 TX queues with FQ packet scheduler.
> > - echo 64 >/sys/kernel/slab/skbuff_small_head/cpu_partial (avoid contention in mm)
> > - 240 concurrent "netperf -t UDP_STREAM -- -m 120 -n"
>
> Hi Eric
>
> While testing this with sch_cake (to get a new baseline for the mq_cake
> patches as Jamal suggested), I found that this patch completely destroys
> the performance of cake in particular.
>
> I run a small UDP test (64-byte packets across 16 flows through
> xdp-trafficgen, offered load is ~5Mpps) with a single cake instance on
> as the root interface qdisc.
>
> With a stock Fedora (6.17.7) kernel, this gets me around 630 Kpps across
> 8 queues (on an E810-C, ice driver):
>
> Ethtool(ice0p1  ) stat:     40321218 (     40,321,218) <= tx_bytes /sec
> Ethtool(ice0p1  ) stat:     42841424 (     42,841,424) <= tx_bytes.nic /sec
> Ethtool(ice0p1  ) stat:      5248505 (      5,248,505) <= tx_queue_0_bytes /sec
> Ethtool(ice0p1  ) stat:        82008 (         82,008) <= tx_queue_0_packets /sec
> Ethtool(ice0p1  ) stat:      3425984 (      3,425,984) <= tx_queue_1_bytes /sec
> Ethtool(ice0p1  ) stat:        53531 (         53,531) <= tx_queue_1_packets /sec
> Ethtool(ice0p1  ) stat:      5277496 (      5,277,496) <= tx_queue_2_bytes /sec
> Ethtool(ice0p1  ) stat:        82461 (         82,461) <= tx_queue_2_packets /sec
> Ethtool(ice0p1  ) stat:      5285736 (      5,285,736) <= tx_queue_3_bytes /sec
> Ethtool(ice0p1  ) stat:        82590 (         82,590) <= tx_queue_3_packets /sec
> Ethtool(ice0p1  ) stat:      5280731 (      5,280,731) <= tx_queue_4_bytes /sec
> Ethtool(ice0p1  ) stat:        82511 (         82,511) <= tx_queue_4_packets /sec
> Ethtool(ice0p1  ) stat:      5275665 (      5,275,665) <= tx_queue_5_bytes /sec
> Ethtool(ice0p1  ) stat:        82432 (         82,432) <= tx_queue_5_packets /sec
> Ethtool(ice0p1  ) stat:      5276398 (      5,276,398) <= tx_queue_6_bytes /sec
> Ethtool(ice0p1  ) stat:        82444 (         82,444) <= tx_queue_6_packets /sec
> Ethtool(ice0p1  ) stat:      5250946 (      5,250,946) <= tx_queue_7_bytes /sec
> Ethtool(ice0p1  ) stat:        82046 (         82,046) <= tx_queue_7_packets /sec
> Ethtool(ice0p1  ) stat:            1 (              1) <= tx_restart /sec
> Ethtool(ice0p1  ) stat:       630023 (        630,023) <= tx_size_127.nic /sec
> Ethtool(ice0p1  ) stat:       630019 (        630,019) <= tx_unicast /sec
> Ethtool(ice0p1  ) stat:       630020 (        630,020) <= tx_unicast.nic /sec
>
> However, running the same test on a net-next kernel, performance drops
> to round 10 Kpps(!):
>
> Ethtool(ice0p1  ) stat:       679003 (        679,003) <= tx_bytes /sec
> Ethtool(ice0p1  ) stat:       721440 (        721,440) <= tx_bytes.nic /sec
> Ethtool(ice0p1  ) stat:       123539 (        123,539) <= tx_queue_0_bytes /sec
> Ethtool(ice0p1  ) stat:         1930 (          1,930) <= tx_queue_0_packets /sec
> Ethtool(ice0p1  ) stat:         1776 (          1,776) <= tx_queue_1_bytes /sec
> Ethtool(ice0p1  ) stat:           28 (             28) <= tx_queue_1_packets /sec
> Ethtool(ice0p1  ) stat:         1837 (          1,837) <= tx_queue_2_bytes /sec
> Ethtool(ice0p1  ) stat:           29 (             29) <= tx_queue_2_packets /sec
> Ethtool(ice0p1  ) stat:         1776 (          1,776) <= tx_queue_3_bytes /sec
> Ethtool(ice0p1  ) stat:           28 (             28) <= tx_queue_3_packets /sec
> Ethtool(ice0p1  ) stat:         1654 (          1,654) <= tx_queue_4_bytes /sec
> Ethtool(ice0p1  ) stat:           26 (             26) <= tx_queue_4_packets /sec
> Ethtool(ice0p1  ) stat:       222026 (        222,026) <= tx_queue_5_bytes /sec
> Ethtool(ice0p1  ) stat:         3469 (          3,469) <= tx_queue_5_packets /sec
> Ethtool(ice0p1  ) stat:       183072 (        183,072) <= tx_queue_6_bytes /sec
> Ethtool(ice0p1  ) stat:         2861 (          2,861) <= tx_queue_6_packets /sec
> Ethtool(ice0p1  ) stat:       143322 (        143,322) <= tx_queue_7_bytes /sec
> Ethtool(ice0p1  ) stat:         2239 (          2,239) <= tx_queue_7_packets /sec
> Ethtool(ice0p1  ) stat:        10609 (         10,609) <= tx_size_127.nic /sec
> Ethtool(ice0p1  ) stat:        10609 (         10,609) <= tx_unicast /sec
> Ethtool(ice0p1  ) stat:        10609 (         10,609) <= tx_unicast.nic /sec
>
> Reverting commit 100dfa74cad9 ("net: dev_queue_xmit() llist adoption")
> (and the followon f8a55d5e71e6 ("net: add a fast path in
> __netif_schedule()"), but that alone makes no difference) gets me back
> to the previous 630-650 Kpps range.
>
> I couldn't find any other qdisc that suffers in the same way (tried
> fq_codel, sfq and netem as single root qdiscs), so this seems to be some
> specific interaction between the llist implementation and sch_cake. Any
> idea what could be causing this?

I would take a look at full "tc -s -d qdisc" and see if anything
interesting is showing up (requeues ?)

ALso look if you have drops (perf record -a -e skb:kfree_skb)

You are sharing one qdisc on 8 queues ?

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v1 net-next 5/5] net: dev_queue_xmit() llist adoption
  2025-11-07 15:37     ` Eric Dumazet
@ 2025-11-07 15:46       ` Eric Dumazet
  2025-11-09 10:09         ` Eric Dumazet
  0 siblings, 1 reply; 24+ messages in thread
From: Eric Dumazet @ 2025-11-07 15:46 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman,
	Jamal Hadi Salim, Cong Wang, Jiri Pirko, Kuniyuki Iwashima,
	Willem de Bruijn, netdev, eric.dumazet

On Fri, Nov 7, 2025 at 7:37 AM Eric Dumazet <edumazet@google.com> wrote:
>
> On Fri, Nov 7, 2025 at 7:28 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
> >
> > Eric Dumazet <edumazet@google.com> writes:
> >
> > > Remove busylock spinlock and use a lockless list (llist)
> > > to reduce spinlock contention to the minimum.
> > >
> > > Idea is that only one cpu might spin on the qdisc spinlock,
> > > while others simply add their skb in the llist.
> > >
> > > After this patch, we get a 300 % improvement on heavy TX workloads.
> > > - Sending twice the number of packets per second.
> > > - While consuming 50 % less cycles.
> > >
> > > Note that this also allows in the future to submit batches
> > > to various qdisc->enqueue() methods.
> > >
> > > Tested:
> > >
> > > - Dual Intel(R) Xeon(R) 6985P-C  (480 hyper threads).
> > > - 100Gbit NIC, 30 TX queues with FQ packet scheduler.
> > > - echo 64 >/sys/kernel/slab/skbuff_small_head/cpu_partial (avoid contention in mm)
> > > - 240 concurrent "netperf -t UDP_STREAM -- -m 120 -n"
> >
> > Hi Eric
> >
> > While testing this with sch_cake (to get a new baseline for the mq_cake
> > patches as Jamal suggested), I found that this patch completely destroys
> > the performance of cake in particular.
> >
> > I run a small UDP test (64-byte packets across 16 flows through
> > xdp-trafficgen, offered load is ~5Mpps) with a single cake instance on
> > as the root interface qdisc.
> >
> > With a stock Fedora (6.17.7) kernel, this gets me around 630 Kpps across
> > 8 queues (on an E810-C, ice driver):
> >
> > Ethtool(ice0p1  ) stat:     40321218 (     40,321,218) <= tx_bytes /sec
> > Ethtool(ice0p1  ) stat:     42841424 (     42,841,424) <= tx_bytes.nic /sec
> > Ethtool(ice0p1  ) stat:      5248505 (      5,248,505) <= tx_queue_0_bytes /sec
> > Ethtool(ice0p1  ) stat:        82008 (         82,008) <= tx_queue_0_packets /sec
> > Ethtool(ice0p1  ) stat:      3425984 (      3,425,984) <= tx_queue_1_bytes /sec
> > Ethtool(ice0p1  ) stat:        53531 (         53,531) <= tx_queue_1_packets /sec
> > Ethtool(ice0p1  ) stat:      5277496 (      5,277,496) <= tx_queue_2_bytes /sec
> > Ethtool(ice0p1  ) stat:        82461 (         82,461) <= tx_queue_2_packets /sec
> > Ethtool(ice0p1  ) stat:      5285736 (      5,285,736) <= tx_queue_3_bytes /sec
> > Ethtool(ice0p1  ) stat:        82590 (         82,590) <= tx_queue_3_packets /sec
> > Ethtool(ice0p1  ) stat:      5280731 (      5,280,731) <= tx_queue_4_bytes /sec
> > Ethtool(ice0p1  ) stat:        82511 (         82,511) <= tx_queue_4_packets /sec
> > Ethtool(ice0p1  ) stat:      5275665 (      5,275,665) <= tx_queue_5_bytes /sec
> > Ethtool(ice0p1  ) stat:        82432 (         82,432) <= tx_queue_5_packets /sec
> > Ethtool(ice0p1  ) stat:      5276398 (      5,276,398) <= tx_queue_6_bytes /sec
> > Ethtool(ice0p1  ) stat:        82444 (         82,444) <= tx_queue_6_packets /sec
> > Ethtool(ice0p1  ) stat:      5250946 (      5,250,946) <= tx_queue_7_bytes /sec
> > Ethtool(ice0p1  ) stat:        82046 (         82,046) <= tx_queue_7_packets /sec
> > Ethtool(ice0p1  ) stat:            1 (              1) <= tx_restart /sec
> > Ethtool(ice0p1  ) stat:       630023 (        630,023) <= tx_size_127.nic /sec
> > Ethtool(ice0p1  ) stat:       630019 (        630,019) <= tx_unicast /sec
> > Ethtool(ice0p1  ) stat:       630020 (        630,020) <= tx_unicast.nic /sec
> >
> > However, running the same test on a net-next kernel, performance drops
> > to round 10 Kpps(!):
> >
> > Ethtool(ice0p1  ) stat:       679003 (        679,003) <= tx_bytes /sec
> > Ethtool(ice0p1  ) stat:       721440 (        721,440) <= tx_bytes.nic /sec
> > Ethtool(ice0p1  ) stat:       123539 (        123,539) <= tx_queue_0_bytes /sec
> > Ethtool(ice0p1  ) stat:         1930 (          1,930) <= tx_queue_0_packets /sec
> > Ethtool(ice0p1  ) stat:         1776 (          1,776) <= tx_queue_1_bytes /sec
> > Ethtool(ice0p1  ) stat:           28 (             28) <= tx_queue_1_packets /sec
> > Ethtool(ice0p1  ) stat:         1837 (          1,837) <= tx_queue_2_bytes /sec
> > Ethtool(ice0p1  ) stat:           29 (             29) <= tx_queue_2_packets /sec
> > Ethtool(ice0p1  ) stat:         1776 (          1,776) <= tx_queue_3_bytes /sec
> > Ethtool(ice0p1  ) stat:           28 (             28) <= tx_queue_3_packets /sec
> > Ethtool(ice0p1  ) stat:         1654 (          1,654) <= tx_queue_4_bytes /sec
> > Ethtool(ice0p1  ) stat:           26 (             26) <= tx_queue_4_packets /sec
> > Ethtool(ice0p1  ) stat:       222026 (        222,026) <= tx_queue_5_bytes /sec
> > Ethtool(ice0p1  ) stat:         3469 (          3,469) <= tx_queue_5_packets /sec
> > Ethtool(ice0p1  ) stat:       183072 (        183,072) <= tx_queue_6_bytes /sec
> > Ethtool(ice0p1  ) stat:         2861 (          2,861) <= tx_queue_6_packets /sec
> > Ethtool(ice0p1  ) stat:       143322 (        143,322) <= tx_queue_7_bytes /sec
> > Ethtool(ice0p1  ) stat:         2239 (          2,239) <= tx_queue_7_packets /sec
> > Ethtool(ice0p1  ) stat:        10609 (         10,609) <= tx_size_127.nic /sec
> > Ethtool(ice0p1  ) stat:        10609 (         10,609) <= tx_unicast /sec
> > Ethtool(ice0p1  ) stat:        10609 (         10,609) <= tx_unicast.nic /sec
> >
> > Reverting commit 100dfa74cad9 ("net: dev_queue_xmit() llist adoption")
> > (and the followon f8a55d5e71e6 ("net: add a fast path in
> > __netif_schedule()"), but that alone makes no difference) gets me back
> > to the previous 630-650 Kpps range.
> >
> > I couldn't find any other qdisc that suffers in the same way (tried
> > fq_codel, sfq and netem as single root qdiscs), so this seems to be some
> > specific interaction between the llist implementation and sch_cake. Any
> > idea what could be causing this?
>
> I would take a look at full "tc -s -d qdisc" and see if anything
> interesting is showing up (requeues ?)
>
> ALso look if you have drops (perf record -a -e skb:kfree_skb)
>
> You are sharing one qdisc on 8 queues ?

I also assume you are running net-next, because final patch was a bit different

int count = 0;

llist_for_each_entry_safe(skb, next, ll_list, ll_node) {
    prefetch(next);
    skb_mark_not_on_list(skb);
    rc = dev_qdisc_enqueue(skb, q, &to_free, txq);
    count++;
}
qdisc_run(q);
if (count != 1)
    rc = NET_XMIT_SUCCESS;

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v1 net-next 5/5] net: dev_queue_xmit() llist adoption
  2025-11-07 15:46       ` Eric Dumazet
@ 2025-11-09 10:09         ` Eric Dumazet
  2025-11-09 12:54           ` Eric Dumazet
  0 siblings, 1 reply; 24+ messages in thread
From: Eric Dumazet @ 2025-11-09 10:09 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman,
	Jamal Hadi Salim, Cong Wang, Jiri Pirko, Kuniyuki Iwashima,
	Willem de Bruijn, netdev, eric.dumazet

On Fri, Nov 7, 2025 at 7:46 AM Eric Dumazet <edumazet@google.com> wrote:
>
> On Fri, Nov 7, 2025 at 7:37 AM Eric Dumazet <edumazet@google.com> wrote:
> >
> > On Fri, Nov 7, 2025 at 7:28 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
> > >
> > > Eric Dumazet <edumazet@google.com> writes:
> > >
> > > > Remove busylock spinlock and use a lockless list (llist)
> > > > to reduce spinlock contention to the minimum.
> > > >
> > > > Idea is that only one cpu might spin on the qdisc spinlock,
> > > > while others simply add their skb in the llist.
> > > >
> > > > After this patch, we get a 300 % improvement on heavy TX workloads.
> > > > - Sending twice the number of packets per second.
> > > > - While consuming 50 % less cycles.
> > > >
> > > > Note that this also allows in the future to submit batches
> > > > to various qdisc->enqueue() methods.
> > > >
> > > > Tested:
> > > >
> > > > - Dual Intel(R) Xeon(R) 6985P-C  (480 hyper threads).
> > > > - 100Gbit NIC, 30 TX queues with FQ packet scheduler.
> > > > - echo 64 >/sys/kernel/slab/skbuff_small_head/cpu_partial (avoid contention in mm)
> > > > - 240 concurrent "netperf -t UDP_STREAM -- -m 120 -n"
> > >
> > > Hi Eric
> > >
> > > While testing this with sch_cake (to get a new baseline for the mq_cake
> > > patches as Jamal suggested), I found that this patch completely destroys
> > > the performance of cake in particular.
> > >
> > > I run a small UDP test (64-byte packets across 16 flows through
> > > xdp-trafficgen, offered load is ~5Mpps) with a single cake instance on
> > > as the root interface qdisc.
> > >
> > > With a stock Fedora (6.17.7) kernel, this gets me around 630 Kpps across
> > > 8 queues (on an E810-C, ice driver):
> > >
> > > Ethtool(ice0p1  ) stat:     40321218 (     40,321,218) <= tx_bytes /sec
> > > Ethtool(ice0p1  ) stat:     42841424 (     42,841,424) <= tx_bytes.nic /sec
> > > Ethtool(ice0p1  ) stat:      5248505 (      5,248,505) <= tx_queue_0_bytes /sec
> > > Ethtool(ice0p1  ) stat:        82008 (         82,008) <= tx_queue_0_packets /sec
> > > Ethtool(ice0p1  ) stat:      3425984 (      3,425,984) <= tx_queue_1_bytes /sec
> > > Ethtool(ice0p1  ) stat:        53531 (         53,531) <= tx_queue_1_packets /sec
> > > Ethtool(ice0p1  ) stat:      5277496 (      5,277,496) <= tx_queue_2_bytes /sec
> > > Ethtool(ice0p1  ) stat:        82461 (         82,461) <= tx_queue_2_packets /sec
> > > Ethtool(ice0p1  ) stat:      5285736 (      5,285,736) <= tx_queue_3_bytes /sec
> > > Ethtool(ice0p1  ) stat:        82590 (         82,590) <= tx_queue_3_packets /sec
> > > Ethtool(ice0p1  ) stat:      5280731 (      5,280,731) <= tx_queue_4_bytes /sec
> > > Ethtool(ice0p1  ) stat:        82511 (         82,511) <= tx_queue_4_packets /sec
> > > Ethtool(ice0p1  ) stat:      5275665 (      5,275,665) <= tx_queue_5_bytes /sec
> > > Ethtool(ice0p1  ) stat:        82432 (         82,432) <= tx_queue_5_packets /sec
> > > Ethtool(ice0p1  ) stat:      5276398 (      5,276,398) <= tx_queue_6_bytes /sec
> > > Ethtool(ice0p1  ) stat:        82444 (         82,444) <= tx_queue_6_packets /sec
> > > Ethtool(ice0p1  ) stat:      5250946 (      5,250,946) <= tx_queue_7_bytes /sec
> > > Ethtool(ice0p1  ) stat:        82046 (         82,046) <= tx_queue_7_packets /sec
> > > Ethtool(ice0p1  ) stat:            1 (              1) <= tx_restart /sec
> > > Ethtool(ice0p1  ) stat:       630023 (        630,023) <= tx_size_127.nic /sec
> > > Ethtool(ice0p1  ) stat:       630019 (        630,019) <= tx_unicast /sec
> > > Ethtool(ice0p1  ) stat:       630020 (        630,020) <= tx_unicast.nic /sec
> > >
> > > However, running the same test on a net-next kernel, performance drops
> > > to round 10 Kpps(!):
> > >
> > > Ethtool(ice0p1  ) stat:       679003 (        679,003) <= tx_bytes /sec
> > > Ethtool(ice0p1  ) stat:       721440 (        721,440) <= tx_bytes.nic /sec
> > > Ethtool(ice0p1  ) stat:       123539 (        123,539) <= tx_queue_0_bytes /sec
> > > Ethtool(ice0p1  ) stat:         1930 (          1,930) <= tx_queue_0_packets /sec
> > > Ethtool(ice0p1  ) stat:         1776 (          1,776) <= tx_queue_1_bytes /sec
> > > Ethtool(ice0p1  ) stat:           28 (             28) <= tx_queue_1_packets /sec
> > > Ethtool(ice0p1  ) stat:         1837 (          1,837) <= tx_queue_2_bytes /sec
> > > Ethtool(ice0p1  ) stat:           29 (             29) <= tx_queue_2_packets /sec
> > > Ethtool(ice0p1  ) stat:         1776 (          1,776) <= tx_queue_3_bytes /sec
> > > Ethtool(ice0p1  ) stat:           28 (             28) <= tx_queue_3_packets /sec
> > > Ethtool(ice0p1  ) stat:         1654 (          1,654) <= tx_queue_4_bytes /sec
> > > Ethtool(ice0p1  ) stat:           26 (             26) <= tx_queue_4_packets /sec
> > > Ethtool(ice0p1  ) stat:       222026 (        222,026) <= tx_queue_5_bytes /sec
> > > Ethtool(ice0p1  ) stat:         3469 (          3,469) <= tx_queue_5_packets /sec
> > > Ethtool(ice0p1  ) stat:       183072 (        183,072) <= tx_queue_6_bytes /sec
> > > Ethtool(ice0p1  ) stat:         2861 (          2,861) <= tx_queue_6_packets /sec
> > > Ethtool(ice0p1  ) stat:       143322 (        143,322) <= tx_queue_7_bytes /sec
> > > Ethtool(ice0p1  ) stat:         2239 (          2,239) <= tx_queue_7_packets /sec
> > > Ethtool(ice0p1  ) stat:        10609 (         10,609) <= tx_size_127.nic /sec
> > > Ethtool(ice0p1  ) stat:        10609 (         10,609) <= tx_unicast /sec
> > > Ethtool(ice0p1  ) stat:        10609 (         10,609) <= tx_unicast.nic /sec
> > >
> > > Reverting commit 100dfa74cad9 ("net: dev_queue_xmit() llist adoption")
> > > (and the followon f8a55d5e71e6 ("net: add a fast path in
> > > __netif_schedule()"), but that alone makes no difference) gets me back
> > > to the previous 630-650 Kpps range.
> > >
> > > I couldn't find any other qdisc that suffers in the same way (tried
> > > fq_codel, sfq and netem as single root qdiscs), so this seems to be some
> > > specific interaction between the llist implementation and sch_cake. Any
> > > idea what could be causing this?
> >
> > I would take a look at full "tc -s -d qdisc" and see if anything
> > interesting is showing up (requeues ?)
> >
> > ALso look if you have drops (perf record -a -e skb:kfree_skb)
> >
> > You are sharing one qdisc on 8 queues ?

This might be something related to XDP, because I ran the following
test (IDPF, 32 TX queues)

tc qd replace dev eth1 root cake
./super_netperf 16 -H tjbp27 -t UDP_STREAM -l 1000 -- -m 64 -Nn &

Before my series : ~360 Kpps
After my series : ~550 Kpps

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v1 net-next 5/5] net: dev_queue_xmit() llist adoption
  2025-11-09 10:09         ` Eric Dumazet
@ 2025-11-09 12:54           ` Eric Dumazet
  2025-11-09 16:33             ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 24+ messages in thread
From: Eric Dumazet @ 2025-11-09 12:54 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman,
	Jamal Hadi Salim, Cong Wang, Jiri Pirko, Kuniyuki Iwashima,
	Willem de Bruijn, netdev, eric.dumazet

On Sun, Nov 9, 2025 at 2:09 AM Eric Dumazet <edumazet@google.com> wrote:
>
>
> This might be something related to XDP, because I ran the following
> test (IDPF, 32 TX queues)
>
> tc qd replace dev eth1 root cake
> ./super_netperf 16 -H tjbp27 -t UDP_STREAM -l 1000 -- -m 64 -Nn &
>
> Before my series : ~360 Kpps
> After my series : ~550 Kpps

Or ... being faster uncovered an old qdisc bug.

I mentioned the 'requeues' because I have seen this counter lately,
and was wondering if this could
be a driver bug.

It seems the bug is in generic qdisc code: try_bulk_dequeue_skb() is
trusting BQL, but can not see the driver might block before BQL.

 I am testing the following patch, it would be great if this solution
works for you.

diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index d9a98d02a55fc361a223f3201e37b6a2b698bb5e..e18584604f0faab4e4d86a29565d7d982c9eb41d
100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -180,9 +180,10 @@ static inline void dev_requeue_skb(struct sk_buff
*skb, struct Qdisc *q)
 static void try_bulk_dequeue_skb(struct Qdisc *q,
                                 struct sk_buff *skb,
                                 const struct netdev_queue *txq,
-                                int *packets)
+                                int *packets, int quota)
 {
        int bytelimit = qdisc_avail_bulklimit(txq) - skb->len;
+       int cnt = 0;

        while (bytelimit > 0) {
                struct sk_buff *nskb = q->dequeue(q);
@@ -193,8 +194,10 @@ static void try_bulk_dequeue_skb(struct Qdisc *q,
                bytelimit -= nskb->len; /* covers GSO len */
                skb->next = nskb;
                skb = nskb;
-               (*packets)++; /* GSO counts as one pkt */
+               if (++cnt >= quota)
+                       break;
        }
+       (*packets) += cnt;
        skb_mark_not_on_list(skb);
 }

@@ -228,7 +231,7 @@ static void try_bulk_dequeue_skb_slow(struct Qdisc *q,
  * A requeued skb (via q->gso_skb) can also be a SKB list.
  */
 static struct sk_buff *dequeue_skb(struct Qdisc *q, bool *validate,
-                                  int *packets)
+                                  int *packets, int quota)
 {
        const struct netdev_queue *txq = q->dev_queue;
        struct sk_buff *skb = NULL;
@@ -295,7 +298,7 @@ static struct sk_buff *dequeue_skb(struct Qdisc
*q, bool *validate,
        if (skb) {
 bulk:
                if (qdisc_may_bulk(q))
-                       try_bulk_dequeue_skb(q, skb, txq, packets);
+                       try_bulk_dequeue_skb(q, skb, txq, packets, quota);
                else
                        try_bulk_dequeue_skb_slow(q, skb, packets);
        }
@@ -387,7 +390,7 @@ bool sch_direct_xmit(struct sk_buff *skb, struct Qdisc *q,
  *                             >0 - queue is not empty.
  *
  */
-static inline bool qdisc_restart(struct Qdisc *q, int *packets)
+static inline bool qdisc_restart(struct Qdisc *q, int *packets, int quota)
 {
        spinlock_t *root_lock = NULL;
        struct netdev_queue *txq;
@@ -396,7 +399,7 @@ static inline bool qdisc_restart(struct Qdisc *q,
int *packets)
        bool validate;

        /* Dequeue packet */
-       skb = dequeue_skb(q, &validate, packets);
+       skb = dequeue_skb(q, &validate, packets, quota);
        if (unlikely(!skb))
                return false;

@@ -414,7 +417,7 @@ void __qdisc_run(struct Qdisc *q)
        int quota = READ_ONCE(net_hotdata.dev_tx_weight);
        int packets;

-       while (qdisc_restart(q, &packets)) {
+       while (qdisc_restart(q, &packets, quota)) {
                quota -= packets;
                if (quota <= 0) {
                        if (q->flags & TCQ_F_NOLOCK)

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v1 net-next 5/5] net: dev_queue_xmit() llist adoption
  2025-11-09 12:54           ` Eric Dumazet
@ 2025-11-09 16:33             ` Toke Høiland-Jørgensen
  2025-11-09 17:14               ` Eric Dumazet
  2025-11-09 19:18               ` Jonas Köppeler
  0 siblings, 2 replies; 24+ messages in thread
From: Toke Høiland-Jørgensen @ 2025-11-09 16:33 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman,
	Jamal Hadi Salim, Cong Wang, Jiri Pirko, Kuniyuki Iwashima,
	Willem de Bruijn, netdev, eric.dumazet

Eric Dumazet <edumazet@google.com> writes:

> On Sun, Nov 9, 2025 at 2:09 AM Eric Dumazet <edumazet@google.com> wrote:
>>
>>
>> This might be something related to XDP, because I ran the following
>> test (IDPF, 32 TX queues)
>>
>> tc qd replace dev eth1 root cake
>> ./super_netperf 16 -H tjbp27 -t UDP_STREAM -l 1000 -- -m 64 -Nn &
>>
>> Before my series : ~360 Kpps
>> After my series : ~550 Kpps
>
> Or ... being faster uncovered an old qdisc bug.
>
> I mentioned the 'requeues' because I have seen this counter lately,
> and was wondering if this could
> be a driver bug.
>
> It seems the bug is in generic qdisc code: try_bulk_dequeue_skb() is
> trusting BQL, but can not see the driver might block before BQL.
>
>  I am testing the following patch, it would be great if this solution
> works for you.

That does not seem to make any difference. I am not really seeing any
requeues either, just a whole bunch of drops:

qdisc cake 8001: dev ice0p1 root refcnt 37 bandwidth unlimited diffserv3 triple-isolate nonat nowash no-ack-filter split-gso rtt 100ms raw overhead 0 
 Sent 9633155852 bytes 13658545 pkt (dropped 36165260, overlimits 0 requeues 42) 

Tried with 16 netperf UDP_STREAMs instead of xdp-trafficgen, and with
that it's even worse (down to less than 100 PPS). A single netperf
instance gets me back to the ~600k PPS range, so definitely something to
do with contention.

The drops seem to come from mainly two places:

# dropwatch -l kas
Initializing kallsyms db
dropwatch> start
Enabling monitoring...
Kernel monitoring activated.
Issue Ctrl-C to stop monitoring
1 drops at __netif_receive_skb_core.constprop.0+160 (0xffffffff87272de0) [software]
2132 drops at __dev_xmit_skb+3f5 (0xffffffff8726d475) [software]
1 drops at skb_queue_purge_reason+100 (0xffffffff8724e130) [software]
52901 drops at __dev_xmit_skb+3f5 (0xffffffff8726d475) [software]
153583 drops at __dev_xmit_skb+13c (0xffffffff8726d1bc) [software]
1 drops at __netif_receive_skb_core.constprop.0+160 (0xffffffff87272de0) [software]
93968 drops at __dev_xmit_skb+3f5 (0xffffffff8726d475) [software]
212982 drops at __dev_xmit_skb+13c (0xffffffff8726d1bc) [software]
239359 drops at __dev_xmit_skb+13c (0xffffffff8726d1bc) [software]
108219 drops at __dev_xmit_skb+3f5 (0xffffffff8726d475) [software]
191163 drops at __dev_xmit_skb+13c (0xffffffff8726d1bc) [software]
93300 drops at __dev_xmit_skb+3f5 (0xffffffff8726d475) [software]
131201 drops at __dev_xmit_skb+13c (0xffffffff8726d1bc) [software]

+13c corresponds to the defer_count check in your patch:

			defer_count = atomic_long_inc_return(&q->defer_count);
			if (unlikely(defer_count > q->limit)) {
				kfree_skb_reason(skb, SKB_DROP_REASON_QDISC_DROP);
				return NET_XMIT_DROP;
			}

and +3f5 is the to_free drop at the end of the function:

unlock:
	spin_unlock(root_lock);
	if (unlikely(to_free))
		kfree_skb_list_reason(to_free,
				      tcf_get_drop_reason(to_free));

Not sure why there's this difference between your setup or mine; some
.config or hardware difference related to the use of atomics? Any other
ideas?

-Toke

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v1 net-next 5/5] net: dev_queue_xmit() llist adoption
  2025-11-09 16:33             ` Toke Høiland-Jørgensen
@ 2025-11-09 17:14               ` Eric Dumazet
  2025-11-09 19:18               ` Jonas Köppeler
  1 sibling, 0 replies; 24+ messages in thread
From: Eric Dumazet @ 2025-11-09 17:14 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman,
	Jamal Hadi Salim, Cong Wang, Jiri Pirko, Kuniyuki Iwashima,
	Willem de Bruijn, netdev, eric.dumazet

On Sun, Nov 9, 2025 at 8:33 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> Eric Dumazet <edumazet@google.com> writes:
>
> > On Sun, Nov 9, 2025 at 2:09 AM Eric Dumazet <edumazet@google.com> wrote:
> >>
> >>
> >> This might be something related to XDP, because I ran the following
> >> test (IDPF, 32 TX queues)
> >>
> >> tc qd replace dev eth1 root cake
> >> ./super_netperf 16 -H tjbp27 -t UDP_STREAM -l 1000 -- -m 64 -Nn &
> >>
> >> Before my series : ~360 Kpps
> >> After my series : ~550 Kpps
> >
> > Or ... being faster uncovered an old qdisc bug.
> >
> > I mentioned the 'requeues' because I have seen this counter lately,
> > and was wondering if this could
> > be a driver bug.
> >
> > It seems the bug is in generic qdisc code: try_bulk_dequeue_skb() is
> > trusting BQL, but can not see the driver might block before BQL.
> >
> >  I am testing the following patch, it would be great if this solution
> > works for you.
>
> That does not seem to make any difference. I am not really seeing any
> requeues either, just a whole bunch of drops:
>
> qdisc cake 8001: dev ice0p1 root refcnt 37 bandwidth unlimited diffserv3 triple-isolate nonat nowash no-ack-filter split-gso rtt 100ms raw overhead 0
>  Sent 9633155852 bytes 13658545 pkt (dropped 36165260, overlimits 0 requeues 42)
>
> Tried with 16 netperf UDP_STREAMs instead of xdp-trafficgen, and with
> that it's even worse (down to less than 100 PPS). A single netperf
> instance gets me back to the ~600k PPS range, so definitely something to
> do with contention.
>
> The drops seem to come from mainly two places:
>
> # dropwatch -l kas
> Initializing kallsyms db
> dropwatch> start
> Enabling monitoring...
> Kernel monitoring activated.
> Issue Ctrl-C to stop monitoring
> 1 drops at __netif_receive_skb_core.constprop.0+160 (0xffffffff87272de0) [software]
> 2132 drops at __dev_xmit_skb+3f5 (0xffffffff8726d475) [software]
> 1 drops at skb_queue_purge_reason+100 (0xffffffff8724e130) [software]
> 52901 drops at __dev_xmit_skb+3f5 (0xffffffff8726d475) [software]
> 153583 drops at __dev_xmit_skb+13c (0xffffffff8726d1bc) [software]
> 1 drops at __netif_receive_skb_core.constprop.0+160 (0xffffffff87272de0) [software]
> 93968 drops at __dev_xmit_skb+3f5 (0xffffffff8726d475) [software]
> 212982 drops at __dev_xmit_skb+13c (0xffffffff8726d1bc) [software]
> 239359 drops at __dev_xmit_skb+13c (0xffffffff8726d1bc) [software]
> 108219 drops at __dev_xmit_skb+3f5 (0xffffffff8726d475) [software]
> 191163 drops at __dev_xmit_skb+13c (0xffffffff8726d1bc) [software]
> 93300 drops at __dev_xmit_skb+3f5 (0xffffffff8726d475) [software]
> 131201 drops at __dev_xmit_skb+13c (0xffffffff8726d1bc) [software]
>
> +13c corresponds to the defer_count check in your patch:
>
>                         defer_count = atomic_long_inc_return(&q->defer_count);
>                         if (unlikely(defer_count > q->limit)) {
>                                 kfree_skb_reason(skb, SKB_DROP_REASON_QDISC_DROP);
>                                 return NET_XMIT_DROP;
>                         }
>
> and +3f5 is the to_free drop at the end of the function:
>
> unlock:
>         spin_unlock(root_lock);
>         if (unlikely(to_free))
>                 kfree_skb_list_reason(to_free,
>                                       tcf_get_drop_reason(to_free));
>
> Not sure why there's this difference between your setup or mine; some
> .config or hardware difference related to the use of atomics? Any other
> ideas?

I would think atomics do not depend on .config.

Hmmm... maybe some CONFIG_PREEMPT_RT stuff ?

Cpu who won the race can not make progress for some reason.

Qdisc can be restarted from ksoftirqd, and it might compete with your
user threads,
because why not :)

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v1 net-next 5/5] net: dev_queue_xmit() llist adoption
  2025-11-09 16:33             ` Toke Høiland-Jørgensen
  2025-11-09 17:14               ` Eric Dumazet
@ 2025-11-09 19:18               ` Jonas Köppeler
  2025-11-09 19:28                 ` Eric Dumazet
  1 sibling, 1 reply; 24+ messages in thread
From: Jonas Köppeler @ 2025-11-09 19:18 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen, Eric Dumazet
  Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman,
	Jamal Hadi Salim, Cong Wang, Jiri Pirko, Kuniyuki Iwashima,
	Willem de Bruijn, netdev, eric.dumazet

On 11/9/25 5:33 PM, Toke Høiland-Jørgensen wrote:
> Not sure why there's this difference between your setup or mine; some
> .config or hardware difference related to the use of atomics? Any other
> ideas?

Hi Eric, hi Toke,

I observed a similar behavior where CAKE's throughput collapses after the patch.

Test setup:
- 4 queues CAKE root qdisc
- 64-byte packets at ~21 Mpps
- Intel Xeon Gold 6209U + 25GbE Intel XXV710 NIC
- DuT forwards incoming traffic back to traffic generator through cake

Throughput over 10 seconds before/after patch:

Before patch:
0.475   mpps
0.481   mpps
0.477   mpps
0.478   mpps
0.478   mpps
0.477   mpps
0.479   mpps
0.481   mpps
0.481   mpps

After patch:
0.265  mpps
0.035  mpps
0.003  mpps
0.002  mpps
0.001  mpps
0.002  mpps
0.002  mpps
0.002  mpps
0.002  mpps

---


 From the qdisc I also see a large number of drops. Running:

     perf record -a -e skb:kfree_skb

shows `QDISC_OVERLIMIT` and `CAKE_FLOOD` as the drop reasons.

`tc` statistics before/after the patch:

Before patch:
- drops: 32
- packets: 4,786,109
- memory_used: 8,916,480
- requeues: 254

After patch:
- drops: 13,601,075
- packets: 322,540
- memory_used: 15,504,576
- requeues: 273

---

Call graph of `__dev_queue_xmit` after the patch (CPU time percentages):

53.37%  __dev_queue_xmit
   21.02%  __qdisc_run
     13.79%  sch_direct_xmit
       12.01%  _raw_spin_lock
         11.30%  do_raw_spin_lock
           11.06%  __pv_queued_spin_lock_slowpath
     0.73%  _raw_spin_unlock
       0.58%  lock_release
     0.69%  dev_hard_start_xmit
     6.91%  cake_dequeue
       1.82%  sk_skb_reason_drop
         1.10%  skb_release_data
         0.65%  kfree_skbmem
           0.61%  kmem_cache_free
       1.64%  get_random_u32
       0.97%  ktime_get
         0.86%  seqcount_lockdep_reader_access.constprop.0
       0.91%  cake_dequeue_one
   16.49%  _raw_spin_lock
     15.71%  do_raw_spin_lock
       15.54%  __pv_queued_spin_lock_slowpath
   10.00%  dev_qdisc_enqueue
     9.94%  cake_enqueue
       4.90%  cake_hash
       2.85%  __skb_flow_dissect
         1.08%  lock_acquire
         0.65%  lock_release
       1.17%  __siphash_unaligned
       2.20%  ktime_get
         1.94%  seqcount_lockdep_reader_access.constprop.0
       0.69%  cake_get_flow_quantum / get_random_u16
   1.99%  netdev_core_pick_tx
     1.79%  i40e_lan_select_queue
     1.62%  netdev_pick_tx
       0.78%  lock_acquire
       0.52%  lock_release
     0.82%  lock_acquire
   0.76%  kfree_skb_list_reason
     0.52%  skb_release_data
   1.02%  lock_acquire
     0.63%  lock_release

---

The `_raw_spin_lock` portion under `__qdisc_run -> sch_direct_xmit` is slightly higher after the patch compared to before (from 5.68% to 12.01%).
It feels like once sch_cake starts dropping packets it (due to overlimit and cobalt-drops) the throughput collapses. Could it be that the overlimit
is reached "faster" when there are more CPUs trying to enqueue packets, thus reaching cake's queue limit due to the "batch" enqueue behavior,
which then leads to cake starting to drop packets?


Jonas


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v1 net-next 5/5] net: dev_queue_xmit() llist adoption
  2025-11-09 19:18               ` Jonas Köppeler
@ 2025-11-09 19:28                 ` Eric Dumazet
  2025-11-09 20:18                   ` Eric Dumazet
  0 siblings, 1 reply; 24+ messages in thread
From: Eric Dumazet @ 2025-11-09 19:28 UTC (permalink / raw)
  To: Jonas Köppeler
  Cc: Toke Høiland-Jørgensen, David S . Miller,
	Jakub Kicinski, Paolo Abeni, Simon Horman, Jamal Hadi Salim,
	Cong Wang, Jiri Pirko, Kuniyuki Iwashima, Willem de Bruijn,
	netdev, eric.dumazet

On Sun, Nov 9, 2025 at 11:18 AM Jonas Köppeler <j.koeppeler@tu-berlin.de> wrote:
>
> On 11/9/25 5:33 PM, Toke Høiland-Jørgensen wrote:
> > Not sure why there's this difference between your setup or mine; some
> > .config or hardware difference related to the use of atomics? Any other
> > ideas?
>
> Hi Eric, hi Toke,
>
> I observed a similar behavior where CAKE's throughput collapses after the patch.
>
> Test setup:
> - 4 queues CAKE root qdisc

Please send

tc -s -d qd sh


> - 64-byte packets at ~21 Mpps
> - Intel Xeon Gold 6209U + 25GbE Intel XXV710 NIC
> - DuT forwards incoming traffic back to traffic generator through cake
>
> Throughput over 10 seconds before/after patch:
>
> Before patch:
> 0.475   mpps
> 0.481   mpps
> 0.477   mpps
> 0.478   mpps
> 0.478   mpps
> 0.477   mpps
> 0.479   mpps
> 0.481   mpps
> 0.481   mpps
>
> After patch:
> 0.265  mpps
> 0.035  mpps
> 0.003  mpps
> 0.002  mpps
> 0.001  mpps
> 0.002  mpps
> 0.002  mpps
> 0.002  mpps
> 0.002  mpps
>
> ---
>
>
>  From the qdisc I also see a large number of drops. Running:
>
>      perf record -a -e skb:kfree_skb
>
> shows `QDISC_OVERLIMIT` and `CAKE_FLOOD` as the drop reasons.


Cake drops packets from dequeue() while the qdisc spinlock is held,
unfortunately.

So it is quite possible that feeding more packets to the qdisc than before
enters a mode where dequeue() has to drop more packets and slow down
the whole thing.

Presumably cake enqueue() should 'drop' the packet when the queue is
under high pressure,
because enqueue() can drop the packet without holding the qdisc spinlock.


>
> `tc` statistics before/after the patch:
>
> Before patch:
> - drops: 32
> - packets: 4,786,109
> - memory_used: 8,916,480
> - requeues: 254
>
> After patch:
> - drops: 13,601,075
> - packets: 322,540
> - memory_used: 15,504,576
> - requeues: 273
>
> ---
>
> Call graph of `__dev_queue_xmit` after the patch (CPU time percentages):
>
> 53.37%  __dev_queue_xmit
>    21.02%  __qdisc_run
>      13.79%  sch_direct_xmit
>        12.01%  _raw_spin_lock
>          11.30%  do_raw_spin_lock
>            11.06%  __pv_queued_spin_lock_slowpath
>      0.73%  _raw_spin_unlock
>        0.58%  lock_release
>      0.69%  dev_hard_start_xmit
>      6.91%  cake_dequeue
>        1.82%  sk_skb_reason_drop
>          1.10%  skb_release_data
>          0.65%  kfree_skbmem
>            0.61%  kmem_cache_free
>        1.64%  get_random_u32
>        0.97%  ktime_get
>          0.86%  seqcount_lockdep_reader_access.constprop.0
>        0.91%  cake_dequeue_one
>    16.49%  _raw_spin_lock
>      15.71%  do_raw_spin_lock
>        15.54%  __pv_queued_spin_lock_slowpath
>    10.00%  dev_qdisc_enqueue
>      9.94%  cake_enqueue
>        4.90%  cake_hash
>        2.85%  __skb_flow_dissect
>          1.08%  lock_acquire
>          0.65%  lock_release
>        1.17%  __siphash_unaligned
>        2.20%  ktime_get
>          1.94%  seqcount_lockdep_reader_access.constprop.0
>        0.69%  cake_get_flow_quantum / get_random_u16
>    1.99%  netdev_core_pick_tx
>      1.79%  i40e_lan_select_queue
>      1.62%  netdev_pick_tx
>        0.78%  lock_acquire
>        0.52%  lock_release
>      0.82%  lock_acquire
>    0.76%  kfree_skb_list_reason
>      0.52%  skb_release_data
>    1.02%  lock_acquire
>      0.63%  lock_release
>
> ---
>
> The `_raw_spin_lock` portion under `__qdisc_run -> sch_direct_xmit` is slightly higher after the patch compared to before (from 5.68% to 12.01%).
> It feels like once sch_cake starts dropping packets it (due to overlimit and cobalt-drops) the throughput collapses. Could it be that the overlimit
> is reached "faster" when there are more CPUs trying to enqueue packets, thus reaching cake's queue limit due to the "batch" enqueue behavior,
> which then leads to cake starting to drop packets?
>

Yes, probably.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v1 net-next 5/5] net: dev_queue_xmit() llist adoption
  2025-11-09 19:28                 ` Eric Dumazet
@ 2025-11-09 20:18                   ` Eric Dumazet
  2025-11-09 20:29                     ` Eric Dumazet
  0 siblings, 1 reply; 24+ messages in thread
From: Eric Dumazet @ 2025-11-09 20:18 UTC (permalink / raw)
  To: Jonas Köppeler
  Cc: Toke Høiland-Jørgensen, David S . Miller,
	Jakub Kicinski, Paolo Abeni, Simon Horman, Jamal Hadi Salim,
	Cong Wang, Jiri Pirko, Kuniyuki Iwashima, Willem de Bruijn,
	netdev, eric.dumazet

On Sun, Nov 9, 2025 at 11:28 AM Eric Dumazet <edumazet@google.com> wrote:
>
> On Sun, Nov 9, 2025 at 11:18 AM Jonas Köppeler <j.koeppeler@tu-berlin.de> wrote:
> >
> > On 11/9/25 5:33 PM, Toke Høiland-Jørgensen wrote:
> > > Not sure why there's this difference between your setup or mine; some
> > > .config or hardware difference related to the use of atomics? Any other
> > > ideas?
> >
> > Hi Eric, hi Toke,
> >
> > I observed a similar behavior where CAKE's throughput collapses after the patch.
> >
> > Test setup:
> > - 4 queues CAKE root qdisc
>
> Please send
>
> tc -s -d qd sh
>
>
> > - 64-byte packets at ~21 Mpps
> > - Intel Xeon Gold 6209U + 25GbE Intel XXV710 NIC
> > - DuT forwards incoming traffic back to traffic generator through cake
> >
> > Throughput over 10 seconds before/after patch:
> >
> > Before patch:
> > 0.475   mpps
> > 0.481   mpps
> > 0.477   mpps
> > 0.478   mpps
> > 0.478   mpps
> > 0.477   mpps
> > 0.479   mpps
> > 0.481   mpps
> > 0.481   mpps
> >
> > After patch:
> > 0.265  mpps
> > 0.035  mpps
> > 0.003  mpps
> > 0.002  mpps
> > 0.001  mpps
> > 0.002  mpps
> > 0.002  mpps
> > 0.002  mpps
> > 0.002  mpps
> >
> > ---
> >
> >
> >  From the qdisc I also see a large number of drops. Running:
> >
> >      perf record -a -e skb:kfree_skb
> >
> > shows `QDISC_OVERLIMIT` and `CAKE_FLOOD` as the drop reasons.
>
>
> Cake drops packets from dequeue() while the qdisc spinlock is held,
> unfortunately.
>
> So it is quite possible that feeding more packets to the qdisc than before
> enters a mode where dequeue() has to drop more packets and slow down
> the whole thing.
>
> Presumably cake enqueue() should 'drop' the packet when the queue is
> under high pressure,
> because enqueue() can drop the packet without holding the qdisc spinlock.
>
>
> >
> > `tc` statistics before/after the patch:
> >
> > Before patch:
> > - drops: 32
> > - packets: 4,786,109
> > - memory_used: 8,916,480
> > - requeues: 254
> >
> > After patch:
> > - drops: 13,601,075
> > - packets: 322,540
> > - memory_used: 15,504,576
> > - requeues: 273
> >
> > ---
> >
> > Call graph of `__dev_queue_xmit` after the patch (CPU time percentages):
> >
> > 53.37%  __dev_queue_xmit
> >    21.02%  __qdisc_run
> >      13.79%  sch_direct_xmit
> >        12.01%  _raw_spin_lock
> >          11.30%  do_raw_spin_lock
> >            11.06%  __pv_queued_spin_lock_slowpath
> >      0.73%  _raw_spin_unlock
> >        0.58%  lock_release
> >      0.69%  dev_hard_start_xmit
> >      6.91%  cake_dequeue
> >        1.82%  sk_skb_reason_drop
> >          1.10%  skb_release_data
> >          0.65%  kfree_skbmem
> >            0.61%  kmem_cache_free
> >        1.64%  get_random_u32
> >        0.97%  ktime_get
> >          0.86%  seqcount_lockdep_reader_access.constprop.0
> >        0.91%  cake_dequeue_one
> >    16.49%  _raw_spin_lock
> >      15.71%  do_raw_spin_lock
> >        15.54%  __pv_queued_spin_lock_slowpath
> >    10.00%  dev_qdisc_enqueue
> >      9.94%  cake_enqueue
> >        4.90%  cake_hash
> >        2.85%  __skb_flow_dissect
> >          1.08%  lock_acquire
> >          0.65%  lock_release
> >        1.17%  __siphash_unaligned
> >        2.20%  ktime_get
> >          1.94%  seqcount_lockdep_reader_access.constprop.0
> >        0.69%  cake_get_flow_quantum / get_random_u16
> >    1.99%  netdev_core_pick_tx
> >      1.79%  i40e_lan_select_queue
> >      1.62%  netdev_pick_tx
> >        0.78%  lock_acquire
> >        0.52%  lock_release
> >      0.82%  lock_acquire
> >    0.76%  kfree_skb_list_reason
> >      0.52%  skb_release_data
> >    1.02%  lock_acquire
> >      0.63%  lock_release
> >
> > ---
> >
> > The `_raw_spin_lock` portion under `__qdisc_run -> sch_direct_xmit` is slightly higher after the patch compared to before (from 5.68% to 12.01%).
> > It feels like once sch_cake starts dropping packets it (due to overlimit and cobalt-drops) the throughput collapses. Could it be that the overlimit
> > is reached "faster" when there are more CPUs trying to enqueue packets, thus reaching cake's queue limit due to the "batch" enqueue behavior,
> > which then leads to cake starting to drop packets?
> >
>
> Yes, probably.

I think the issue is really about TCQ_F_ONETXQUEUE :


Perhaps we should not accept q->limit packets in the ll_list, but a
much smaller limit.

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v1 net-next 5/5] net: dev_queue_xmit() llist adoption
  2025-11-09 20:18                   ` Eric Dumazet
@ 2025-11-09 20:29                     ` Eric Dumazet
  2025-11-10 11:31                       ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 24+ messages in thread
From: Eric Dumazet @ 2025-11-09 20:29 UTC (permalink / raw)
  To: Jonas Köppeler
  Cc: Toke Høiland-Jørgensen, David S . Miller,
	Jakub Kicinski, Paolo Abeni, Simon Horman, Jamal Hadi Salim,
	Cong Wang, Jiri Pirko, Kuniyuki Iwashima, Willem de Bruijn,
	netdev, eric.dumazet

On Sun, Nov 9, 2025 at 12:18 PM Eric Dumazet <edumazet@google.com> wrote:
>

> I think the issue is really about TCQ_F_ONETXQUEUE :

dequeue_skb() can only dequeue 8 packets at a time, then has to
release the qdisc spinlock.

>
>
> Perhaps we should not accept q->limit packets in the ll_list, but a
> much smaller limit.

I will test something like this

diff --git a/net/core/dev.c b/net/core/dev.c
index 69515edd17bc6a157046f31b3dd343a59ae192ab..e4187e2ca6324781216c073de2ec20626119327a
100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4185,8 +4185,12 @@ static inline int __dev_xmit_skb(struct sk_buff
*skb, struct Qdisc *q,
        first_n = READ_ONCE(q->defer_list.first);
        do {
                if (first_n && !defer_count) {
+                       unsigned long total;
+
                        defer_count = atomic_long_inc_return(&q->defer_count);
-                       if (unlikely(defer_count > q->limit)) {
+                       total = defer_count + READ_ONCE(q->q.qlen);
+
+                       if (unlikely(defer_count > 256 || total >
READ_ONCE(q->limit))) {
                                kfree_skb_reason(skb,
SKB_DROP_REASON_QDISC_DROP);
                                return NET_XMIT_DROP;
                        }

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v1 net-next 5/5] net: dev_queue_xmit() llist adoption
  2025-11-09 20:29                     ` Eric Dumazet
@ 2025-11-10 11:31                       ` Toke Høiland-Jørgensen
  2025-11-10 13:26                         ` Eric Dumazet
  0 siblings, 1 reply; 24+ messages in thread
From: Toke Høiland-Jørgensen @ 2025-11-10 11:31 UTC (permalink / raw)
  To: Eric Dumazet, Jonas Köppeler
  Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman,
	Jamal Hadi Salim, Cong Wang, Jiri Pirko, Kuniyuki Iwashima,
	Willem de Bruijn, netdev, eric.dumazet

Eric Dumazet <edumazet@google.com> writes:

> On Sun, Nov 9, 2025 at 12:18 PM Eric Dumazet <edumazet@google.com> wrote:
>>
>
>> I think the issue is really about TCQ_F_ONETXQUEUE :
>
> dequeue_skb() can only dequeue 8 packets at a time, then has to
> release the qdisc spinlock.

So after looking at this a bit more, I think I understand more or less
what's going on in the interaction between cake and your llist patch:

Basically, the llist patch moves the bottleneck from qdisc enqueue to
qdisc dequeue (in this setup that we're testing where the actual link
speed is not itself a bottleneck). Before, enqueue contends with dequeue
on the qdisc lock, meaning dequeue has no trouble keeping up, and the
qdisc never fills up.

With the llist patch, suddenly we're enqueueing a whole batch of packets
every time we take the lock, which means that dequeue can no longer keep
up, making it the bottleneck.

The complete collapse in throughput comes from the way cake deals with
unresponsive flows once the qdisc fills up: the BLUE part of its AQM
will drive up its drop probability to 1, where it will stay until the
flow responds (which, in this case, it never does).

Turning off the BLUE algorithm prevents the throughput collapse; there's
still a delta compared to a stock 6.17 kernel, which I think is because
cake is simply quite inefficient at dropping packets in an overload
situation. I'll experiment with a variant of the bulk dropping you
introduced to fq_codel and see if that helps. We should probably also
cap the drop probability of BLUE to something lower than 1.

The patch you sent (below) does not in itself help anything, but
lowering the constant to to 8 instead of 256 does help. I'm not sure
we want something that low, though; probably better to fix the behaviour
of cake, no?

-Toke

>> Perhaps we should not accept q->limit packets in the ll_list, but a
>> much smaller limit.
>
> I will test something like this
>
> diff --git a/net/core/dev.c b/net/core/dev.c
> index 69515edd17bc6a157046f31b3dd343a59ae192ab..e4187e2ca6324781216c073de2ec20626119327a
> 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -4185,8 +4185,12 @@ static inline int __dev_xmit_skb(struct sk_buff
> *skb, struct Qdisc *q,
>         first_n = READ_ONCE(q->defer_list.first);
>         do {
>                 if (first_n && !defer_count) {
> +                       unsigned long total;
> +
>                         defer_count = atomic_long_inc_return(&q->defer_count);
> -                       if (unlikely(defer_count > q->limit)) {
> +                       total = defer_count + READ_ONCE(q->q.qlen);
> +
> +                       if (unlikely(defer_count > 256 || total >
> READ_ONCE(q->limit))) {
>                                 kfree_skb_reason(skb,
> SKB_DROP_REASON_QDISC_DROP);
>                                 return NET_XMIT_DROP;
>                         }

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v1 net-next 5/5] net: dev_queue_xmit() llist adoption
  2025-11-10 11:31                       ` Toke Høiland-Jørgensen
@ 2025-11-10 13:26                         ` Eric Dumazet
  2025-11-10 14:49                           ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 24+ messages in thread
From: Eric Dumazet @ 2025-11-10 13:26 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Jonas Köppeler, David S . Miller, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Jamal Hadi Salim, Cong Wang,
	Jiri Pirko, Kuniyuki Iwashima, Willem de Bruijn, netdev,
	eric.dumazet

On Mon, Nov 10, 2025 at 3:31 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> Eric Dumazet <edumazet@google.com> writes:
>
> > On Sun, Nov 9, 2025 at 12:18 PM Eric Dumazet <edumazet@google.com> wrote:
> >>
> >
> >> I think the issue is really about TCQ_F_ONETXQUEUE :
> >
> > dequeue_skb() can only dequeue 8 packets at a time, then has to
> > release the qdisc spinlock.
>
> So after looking at this a bit more, I think I understand more or less
> what's going on in the interaction between cake and your llist patch:
>
> Basically, the llist patch moves the bottleneck from qdisc enqueue to
> qdisc dequeue (in this setup that we're testing where the actual link
> speed is not itself a bottleneck). Before, enqueue contends with dequeue
> on the qdisc lock, meaning dequeue has no trouble keeping up, and the
> qdisc never fills up.
>
> With the llist patch, suddenly we're enqueueing a whole batch of packets
> every time we take the lock, which means that dequeue can no longer keep
> up, making it the bottleneck.
>
> The complete collapse in throughput comes from the way cake deals with
> unresponsive flows once the qdisc fills up: the BLUE part of its AQM
> will drive up its drop probability to 1, where it will stay until the
> flow responds (which, in this case, it never does).
>
> Turning off the BLUE algorithm prevents the throughput collapse; there's
> still a delta compared to a stock 6.17 kernel, which I think is because
> cake is simply quite inefficient at dropping packets in an overload
> situation. I'll experiment with a variant of the bulk dropping you
> introduced to fq_codel and see if that helps. We should probably also
> cap the drop probability of BLUE to something lower than 1.
>
> The patch you sent (below) does not in itself help anything, but
> lowering the constant to to 8 instead of 256 does help. I'm not sure
> we want something that low, though; probably better to fix the behaviour
> of cake, no?

Presumably codel has a similar issue ?

We can add to dequeue() a mechanism to queue skbs that need to be dropped
after the spinlock and running bit are released.

We did something similar in 2016 for the enqueue part [1]

In 2025 this might be a bit more challenging because of eBPF qdisc.

Instead of adding a new parameter, perhaps add in 'struct Qdisc' a
*tofree pointer.

I can work on a patch today.

[1]
commit 520ac30f45519b0a82dd92117c181d1d6144677b
Author: Eric Dumazet <edumazet@google.com>
Date:   Tue Jun 21 23:16:49 2016 -0700

    net_sched: drop packets after root qdisc lock is released

    Qdisc performance suffers when packets are dropped at enqueue()
    time because drops (kfree_skb()) are done while qdisc lock is held,
    delaying a dequeue() draining the queue.

    Nominal throughput can be reduced by 50 % when this happens,
    at a time we would like the dequeue() to proceed as fast as possible.

    Even FQ is vulnerable to this problem, while one of FQ goals was
    to provide some flow isolation.

    This patch adds a 'struct sk_buff **to_free' parameter to all
    qdisc->enqueue(), and in qdisc_drop() helper.

    I measured a performance increase of up to 12 %, but this patch
    is a prereq so that future batches in enqueue() can fly.

    Signed-off-by: Eric Dumazet <edumazet@google.com>
    Acked-by: Jesper Dangaard Brouer <brouer@redhat.com>
    Signed-off-by: David S. Miller <davem@davemloft.net>




>
> -Toke
>
> >> Perhaps we should not accept q->limit packets in the ll_list, but a
> >> much smaller limit.
> >
> > I will test something like this
> >
> > diff --git a/net/core/dev.c b/net/core/dev.c
> > index 69515edd17bc6a157046f31b3dd343a59ae192ab..e4187e2ca6324781216c073de2ec20626119327a
> > 100644
> > --- a/net/core/dev.c
> > +++ b/net/core/dev.c
> > @@ -4185,8 +4185,12 @@ static inline int __dev_xmit_skb(struct sk_buff
> > *skb, struct Qdisc *q,
> >         first_n = READ_ONCE(q->defer_list.first);
> >         do {
> >                 if (first_n && !defer_count) {
> > +                       unsigned long total;
> > +
> >                         defer_count = atomic_long_inc_return(&q->defer_count);
> > -                       if (unlikely(defer_count > q->limit)) {
> > +                       total = defer_count + READ_ONCE(q->q.qlen);
> > +
> > +                       if (unlikely(defer_count > 256 || total >
> > READ_ONCE(q->limit))) {
> >                                 kfree_skb_reason(skb,
> > SKB_DROP_REASON_QDISC_DROP);
> >                                 return NET_XMIT_DROP;
> >                         }
>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v1 net-next 5/5] net: dev_queue_xmit() llist adoption
  2025-11-10 13:26                         ` Eric Dumazet
@ 2025-11-10 14:49                           ` Toke Høiland-Jørgensen
  2025-11-10 17:34                             ` Eric Dumazet
  0 siblings, 1 reply; 24+ messages in thread
From: Toke Høiland-Jørgensen @ 2025-11-10 14:49 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: Jonas Köppeler, David S . Miller, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Jamal Hadi Salim, Cong Wang,
	Jiri Pirko, Kuniyuki Iwashima, Willem de Bruijn, netdev,
	eric.dumazet

Eric Dumazet <edumazet@google.com> writes:

> On Mon, Nov 10, 2025 at 3:31 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>>
>> Eric Dumazet <edumazet@google.com> writes:
>>
>> > On Sun, Nov 9, 2025 at 12:18 PM Eric Dumazet <edumazet@google.com> wrote:
>> >>
>> >
>> >> I think the issue is really about TCQ_F_ONETXQUEUE :
>> >
>> > dequeue_skb() can only dequeue 8 packets at a time, then has to
>> > release the qdisc spinlock.
>>
>> So after looking at this a bit more, I think I understand more or less
>> what's going on in the interaction between cake and your llist patch:
>>
>> Basically, the llist patch moves the bottleneck from qdisc enqueue to
>> qdisc dequeue (in this setup that we're testing where the actual link
>> speed is not itself a bottleneck). Before, enqueue contends with dequeue
>> on the qdisc lock, meaning dequeue has no trouble keeping up, and the
>> qdisc never fills up.
>>
>> With the llist patch, suddenly we're enqueueing a whole batch of packets
>> every time we take the lock, which means that dequeue can no longer keep
>> up, making it the bottleneck.
>>
>> The complete collapse in throughput comes from the way cake deals with
>> unresponsive flows once the qdisc fills up: the BLUE part of its AQM
>> will drive up its drop probability to 1, where it will stay until the
>> flow responds (which, in this case, it never does).
>>
>> Turning off the BLUE algorithm prevents the throughput collapse; there's
>> still a delta compared to a stock 6.17 kernel, which I think is because
>> cake is simply quite inefficient at dropping packets in an overload
>> situation. I'll experiment with a variant of the bulk dropping you
>> introduced to fq_codel and see if that helps. We should probably also
>> cap the drop probability of BLUE to something lower than 1.
>>
>> The patch you sent (below) does not in itself help anything, but
>> lowering the constant to to 8 instead of 256 does help. I'm not sure
>> we want something that low, though; probably better to fix the behaviour
>> of cake, no?
>
> Presumably codel has a similar issue ?

Not directly, because codel is sojourn time based. Which means it
triggers only when packets stay in the queue for an extended period of
time; so as long as there's some progress being made, codel will get out
of its drop state (or not get into it in the first place). Whereas BLUE
is based solely on the fact that the queue is overflowing, and it
doesn't back off until the queue is completely empty.

BLUE was added as a mechanism to aggressively punish unresponsive flows;
I guess it's succeeding in this case? :P

> We can add to dequeue() a mechanism to queue skbs that need to be dropped
> after the spinlock and running bit are released.
>
> We did something similar in 2016 for the enqueue part [1]
>
> In 2025 this might be a bit more challenging because of eBPF qdisc.
>
> Instead of adding a new parameter, perhaps add in 'struct Qdisc' a
> *tofree pointer.
>
> I can work on a patch today.

This sounds like an excellent idea in any case - thanks! :)

-Toke


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v1 net-next 5/5] net: dev_queue_xmit() llist adoption
  2025-11-10 14:49                           ` Toke Høiland-Jørgensen
@ 2025-11-10 17:34                             ` Eric Dumazet
  2025-11-11 13:44                               ` Jonas Köppeler
  0 siblings, 1 reply; 24+ messages in thread
From: Eric Dumazet @ 2025-11-10 17:34 UTC (permalink / raw)
  To: Toke Høiland-Jørgensen
  Cc: Jonas Köppeler, David S . Miller, Jakub Kicinski,
	Paolo Abeni, Simon Horman, Jamal Hadi Salim, Cong Wang,
	Jiri Pirko, Kuniyuki Iwashima, Willem de Bruijn, netdev,
	eric.dumazet

On Mon, Nov 10, 2025 at 6:49 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>
> Eric Dumazet <edumazet@google.com> writes:
>
> > I can work on a patch today.
>
> This sounds like an excellent idea in any case - thanks! :)

The following (on top of my last series) seems to work for me

 include/net/pkt_sched.h   |    5 +++--
 include/net/sch_generic.h |   24 +++++++++++++++++++++++-
 net/core/dev.c            |   33 +++++++++++++++++++--------------
 net/sched/sch_cake.c      |    4 +++-
 4 files changed, 48 insertions(+), 18 deletions(-)

diff --git a/include/net/pkt_sched.h b/include/net/pkt_sched.h
index 4678db45832a1e3bf7b8a07756fb89ab868bd5d2..e703c507d0daa97ae7c3bf131e322b1eafcc5664
100644
--- a/include/net/pkt_sched.h
+++ b/include/net/pkt_sched.h
@@ -114,12 +114,13 @@ bool sch_direct_xmit(struct sk_buff *skb, struct Qdisc *q,

 void __qdisc_run(struct Qdisc *q);

-static inline void qdisc_run(struct Qdisc *q)
+static inline struct sk_buff *qdisc_run(struct Qdisc *q)
 {
        if (qdisc_run_begin(q)) {
                __qdisc_run(q);
-               qdisc_run_end(q);
+               return qdisc_run_end(q);
        }
+       return NULL;
 }

 extern const struct nla_policy rtm_tca_policy[TCA_MAX + 1];
diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
index 79501499dafba56271b9ebd97a8f379ffdc83cac..19cd2bc13bdba48f941b1599f896c15c8c7860ae
100644
--- a/include/net/sch_generic.h
+++ b/include/net/sch_generic.h
@@ -88,6 +88,8 @@ struct Qdisc {
 #define TCQ_F_INVISIBLE                0x80 /* invisible by default in dump */
 #define TCQ_F_NOLOCK           0x100 /* qdisc does not require locking */
 #define TCQ_F_OFFLOADED                0x200 /* qdisc is offloaded to HW */
+#define TCQ_F_DEQUEUE_DROPS    0x400 /* ->dequeue() can drop packets
in q->to_free */
+
        u32                     limit;
        const struct Qdisc_ops  *ops;
        struct qdisc_size_table __rcu *stab;
@@ -119,6 +121,8 @@ struct Qdisc {

                /* Note : we only change qstats.backlog in fast path. */
                struct gnet_stats_queue qstats;
+
+               struct sk_buff          *to_free;
        __cacheline_group_end(Qdisc_write);


@@ -218,8 +222,15 @@ static inline bool qdisc_run_begin(struct Qdisc *qdisc)
        return true;
 }

-static inline void qdisc_run_end(struct Qdisc *qdisc)
+static inline struct sk_buff *qdisc_run_end(struct Qdisc *qdisc)
 {
+       struct sk_buff *to_free = NULL;
+
+       if (qdisc->flags & TCQ_F_DEQUEUE_DROPS) {
+               to_free = qdisc->to_free;
+               if (to_free)
+                       qdisc->to_free = NULL;
+       }
        if (qdisc->flags & TCQ_F_NOLOCK) {
                spin_unlock(&qdisc->seqlock);

@@ -235,6 +246,7 @@ static inline void qdisc_run_end(struct Qdisc *qdisc)
        } else {
                WRITE_ONCE(qdisc->running, false);
        }
+       return to_free;
 }

 static inline bool qdisc_may_bulk(const struct Qdisc *qdisc)
@@ -1105,6 +1117,16 @@ static inline void tcf_set_drop_reason(const
struct sk_buff *skb,
        tc_skb_cb(skb)->drop_reason = reason;
 }

+static inline void qdisc_dequeue_drop(struct Qdisc *q, struct sk_buff *skb,
+                                     enum skb_drop_reason reason)
+{
+       DEBUG_NET_WARN_ON_ONCE(!(q->flags & TCQ_F_DEQUEUE_DROPS));
+
+       tcf_set_drop_reason(skb, reason);
+       skb->next = q->to_free;
+       q->to_free = skb;
+}
+
 /* Instead of calling kfree_skb() while root qdisc lock is held,
  * queue the skb for future freeing at end of __dev_xmit_skb()
  */
diff --git a/net/core/dev.c b/net/core/dev.c
index ac994974e2a81889fcc0a2e664edcdb7cfd0496d..18cfcd765b1b3e4af1c5339e36df517e7abc914f
100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4141,7 +4141,7 @@ static inline int __dev_xmit_skb(struct sk_buff
*skb, struct Qdisc *q,
                                 struct net_device *dev,
                                 struct netdev_queue *txq)
 {
-       struct sk_buff *next, *to_free = NULL;
+       struct sk_buff *next, *to_free = NULL, *to_free2 = NULL;
        spinlock_t *root_lock = qdisc_lock(q);
        struct llist_node *ll_list, *first_n;
        unsigned long defer_count = 0;
@@ -4160,9 +4160,9 @@ static inline int __dev_xmit_skb(struct sk_buff
*skb, struct Qdisc *q,
                        if (unlikely(!nolock_qdisc_is_empty(q))) {
                                rc = dev_qdisc_enqueue(skb, q, &to_free, txq);
                                __qdisc_run(q);
-                               qdisc_run_end(q);
+                               to_free2 = qdisc_run_end(q);

-                               goto no_lock_out;
+                               goto free_out;
                        }

                        qdisc_bstats_cpu_update(q, skb);
@@ -4170,18 +4170,15 @@ static inline int __dev_xmit_skb(struct
sk_buff *skb, struct Qdisc *q,
                            !nolock_qdisc_is_empty(q))
                                __qdisc_run(q);

-                       qdisc_run_end(q);
-                       return NET_XMIT_SUCCESS;
+                       to_free2 = qdisc_run_end(q);
+                       rc = NET_XMIT_SUCCESS;
+                       goto free_out;
                }

                rc = dev_qdisc_enqueue(skb, q, &to_free, txq);
-               qdisc_run(q);
+               to_free2 = qdisc_run(q);

-no_lock_out:
-               if (unlikely(to_free))
-                       kfree_skb_list_reason(to_free,
-                                             tcf_get_drop_reason(to_free));
-               return rc;
+               goto free_out;
        }

        /* Open code llist_add(&skb->ll_node, &q->defer_list) + queue limit.
@@ -4239,7 +4236,7 @@ static inline int __dev_xmit_skb(struct sk_buff
*skb, struct Qdisc *q,
                qdisc_bstats_update(q, skb);
                if (sch_direct_xmit(skb, q, dev, txq, root_lock, true))
                        __qdisc_run(q);
-               qdisc_run_end(q);
+               to_free2 = qdisc_run_end(q);
                rc = NET_XMIT_SUCCESS;
        } else {
                int count = 0;
@@ -4251,15 +4248,19 @@ static inline int __dev_xmit_skb(struct
sk_buff *skb, struct Qdisc *q,
                        rc = dev_qdisc_enqueue(skb, q, &to_free, txq);
                        count++;
                }
-               qdisc_run(q);
+               to_free2 = qdisc_run(q);
                if (count != 1)
                        rc = NET_XMIT_SUCCESS;
        }
 unlock:
        spin_unlock(root_lock);
+free_out:
        if (unlikely(to_free))
                kfree_skb_list_reason(to_free,
                                      tcf_get_drop_reason(to_free));
+       if (unlikely(to_free2))
+               kfree_skb_list_reason(to_free2,
+                                     tcf_get_drop_reason(to_free2));
        return rc;
 }

@@ -5741,6 +5742,7 @@ static __latent_entropy void net_tx_action(void)
        }

        if (sd->output_queue) {
+               struct sk_buff *to_free;
                struct Qdisc *head;

                local_irq_disable();
@@ -5780,9 +5782,12 @@ static __latent_entropy void net_tx_action(void)
                        }

                        clear_bit(__QDISC_STATE_SCHED, &q->state);
-                       qdisc_run(q);
+                       to_free = qdisc_run(q);
                        if (root_lock)
                                spin_unlock(root_lock);
+                       if (unlikely(to_free))
+                               kfree_skb_list_reason(to_free,
+                                             tcf_get_drop_reason(to_free));
                }

                rcu_read_unlock();
diff --git a/net/sched/sch_cake.c b/net/sched/sch_cake.c
index 312f5b000ffb67d74faf70f26d808e26315b4ab8..a717cc4e0606e80123ec9c76331d454dad699b69
100644
--- a/net/sched/sch_cake.c
+++ b/net/sched/sch_cake.c
@@ -2183,7 +2183,7 @@ static struct sk_buff *cake_dequeue(struct Qdisc *sch)
                b->tin_dropped++;
                qdisc_tree_reduce_backlog(sch, 1, qdisc_pkt_len(skb));
                qdisc_qstats_drop(sch);
-               kfree_skb_reason(skb, reason);
+               qdisc_dequeue_drop(sch, skb, reason);
                if (q->rate_flags & CAKE_FLAG_INGRESS)
                        goto retry;
        }
@@ -2569,6 +2569,8 @@ static void cake_reconfigure(struct Qdisc *sch)

        sch->flags &= ~TCQ_F_CAN_BYPASS;

+       sch->flags |= TCQ_F_DEQUEUE_DROPS;
+
        q->buffer_limit = min(q->buffer_limit,
                              max(sch->limit * psched_mtu(qdisc_dev(sch)),
                                  q->buffer_config_limit));

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v1 net-next 5/5] net: dev_queue_xmit() llist adoption
  2025-11-10 17:34                             ` Eric Dumazet
@ 2025-11-11 13:44                               ` Jonas Köppeler
  2025-11-11 16:42                                 ` Toke Høiland-Jørgensen
  0 siblings, 1 reply; 24+ messages in thread
From: Jonas Köppeler @ 2025-11-11 13:44 UTC (permalink / raw)
  To: Eric Dumazet, Toke Høiland-Jørgensen
  Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman,
	Jamal Hadi Salim, Cong Wang, Jiri Pirko, Kuniyuki Iwashima,
	Willem de Bruijn, netdev, eric.dumazet


On 11/10/25 18:34, Eric Dumazet wrote:
> On Mon, Nov 10, 2025 at 6:49 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>> Eric Dumazet <edumazet@google.com> writes:
>>
>>> I can work on a patch today.
>> This sounds like an excellent idea in any case - thanks! :)
> The following (on top of my last series) seems to work for me
>
>   include/net/pkt_sched.h   |    5 +++--
>   include/net/sch_generic.h |   24 +++++++++++++++++++++++-
>   net/core/dev.c            |   33 +++++++++++++++++++--------------
>   net/sched/sch_cake.c      |    4 +++-
>   4 files changed, 48 insertions(+), 18 deletions(-)
>
> diff --git a/include/net/pkt_sched.h b/include/net/pkt_sched.h
> index 4678db45832a1e3bf7b8a07756fb89ab868bd5d2..e703c507d0daa97ae7c3bf131e322b1eafcc5664
> 100644
> --- a/include/net/pkt_sched.h
> +++ b/include/net/pkt_sched.h
> @@ -114,12 +114,13 @@ bool sch_direct_xmit(struct sk_buff *skb, struct Qdisc *q,
>
>   void __qdisc_run(struct Qdisc *q);
>
> -static inline void qdisc_run(struct Qdisc *q)
> +static inline struct sk_buff *qdisc_run(struct Qdisc *q)
>   {
>          if (qdisc_run_begin(q)) {
>                  __qdisc_run(q);
> -               qdisc_run_end(q);
> +               return qdisc_run_end(q);
>          }
> +       return NULL;
>   }
>
>   extern const struct nla_policy rtm_tca_policy[TCA_MAX + 1];
> diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
> index 79501499dafba56271b9ebd97a8f379ffdc83cac..19cd2bc13bdba48f941b1599f896c15c8c7860ae
> 100644
> --- a/include/net/sch_generic.h
> +++ b/include/net/sch_generic.h
> @@ -88,6 +88,8 @@ struct Qdisc {
>   #define TCQ_F_INVISIBLE                0x80 /* invisible by default in dump */
>   #define TCQ_F_NOLOCK           0x100 /* qdisc does not require locking */
>   #define TCQ_F_OFFLOADED                0x200 /* qdisc is offloaded to HW */
> +#define TCQ_F_DEQUEUE_DROPS    0x400 /* ->dequeue() can drop packets
> in q->to_free */
> +
>          u32                     limit;
>          const struct Qdisc_ops  *ops;
>          struct qdisc_size_table __rcu *stab;
> @@ -119,6 +121,8 @@ struct Qdisc {
>
>                  /* Note : we only change qstats.backlog in fast path. */
>                  struct gnet_stats_queue qstats;
> +
> +               struct sk_buff          *to_free;
>          __cacheline_group_end(Qdisc_write);
>
>
> @@ -218,8 +222,15 @@ static inline bool qdisc_run_begin(struct Qdisc *qdisc)
>          return true;
>   }
>
> -static inline void qdisc_run_end(struct Qdisc *qdisc)
> +static inline struct sk_buff *qdisc_run_end(struct Qdisc *qdisc)
>   {
> +       struct sk_buff *to_free = NULL;
> +
> +       if (qdisc->flags & TCQ_F_DEQUEUE_DROPS) {
> +               to_free = qdisc->to_free;
> +               if (to_free)
> +                       qdisc->to_free = NULL;
> +       }
>          if (qdisc->flags & TCQ_F_NOLOCK) {
>                  spin_unlock(&qdisc->seqlock);
>
> @@ -235,6 +246,7 @@ static inline void qdisc_run_end(struct Qdisc *qdisc)
>          } else {
>                  WRITE_ONCE(qdisc->running, false);
>          }
> +       return to_free;
>   }
>
>   static inline bool qdisc_may_bulk(const struct Qdisc *qdisc)
> @@ -1105,6 +1117,16 @@ static inline void tcf_set_drop_reason(const
> struct sk_buff *skb,
>          tc_skb_cb(skb)->drop_reason = reason;
>   }
>
> +static inline void qdisc_dequeue_drop(struct Qdisc *q, struct sk_buff *skb,
> +                                     enum skb_drop_reason reason)
> +{
> +       DEBUG_NET_WARN_ON_ONCE(!(q->flags & TCQ_F_DEQUEUE_DROPS));
> +
> +       tcf_set_drop_reason(skb, reason);
> +       skb->next = q->to_free;
> +       q->to_free = skb;
> +}
> +
>   /* Instead of calling kfree_skb() while root qdisc lock is held,
>    * queue the skb for future freeing at end of __dev_xmit_skb()
>    */
> diff --git a/net/core/dev.c b/net/core/dev.c
> index ac994974e2a81889fcc0a2e664edcdb7cfd0496d..18cfcd765b1b3e4af1c5339e36df517e7abc914f
> 100644
> --- a/net/core/dev.c
> +++ b/net/core/dev.c
> @@ -4141,7 +4141,7 @@ static inline int __dev_xmit_skb(struct sk_buff
> *skb, struct Qdisc *q,
>                                   struct net_device *dev,
>                                   struct netdev_queue *txq)
>   {
> -       struct sk_buff *next, *to_free = NULL;
> +       struct sk_buff *next, *to_free = NULL, *to_free2 = NULL;
>          spinlock_t *root_lock = qdisc_lock(q);
>          struct llist_node *ll_list, *first_n;
>          unsigned long defer_count = 0;
> @@ -4160,9 +4160,9 @@ static inline int __dev_xmit_skb(struct sk_buff
> *skb, struct Qdisc *q,
>                          if (unlikely(!nolock_qdisc_is_empty(q))) {
>                                  rc = dev_qdisc_enqueue(skb, q, &to_free, txq);
>                                  __qdisc_run(q);
> -                               qdisc_run_end(q);
> +                               to_free2 = qdisc_run_end(q);
>
> -                               goto no_lock_out;
> +                               goto free_out;
>                          }
>
>                          qdisc_bstats_cpu_update(q, skb);
> @@ -4170,18 +4170,15 @@ static inline int __dev_xmit_skb(struct
> sk_buff *skb, struct Qdisc *q,
>                              !nolock_qdisc_is_empty(q))
>                                  __qdisc_run(q);
>
> -                       qdisc_run_end(q);
> -                       return NET_XMIT_SUCCESS;
> +                       to_free2 = qdisc_run_end(q);
> +                       rc = NET_XMIT_SUCCESS;
> +                       goto free_out;
>                  }
>
>                  rc = dev_qdisc_enqueue(skb, q, &to_free, txq);
> -               qdisc_run(q);
> +               to_free2 = qdisc_run(q);
>
> -no_lock_out:
> -               if (unlikely(to_free))
> -                       kfree_skb_list_reason(to_free,
> -                                             tcf_get_drop_reason(to_free));
> -               return rc;
> +               goto free_out;
>          }
>
>          /* Open code llist_add(&skb->ll_node, &q->defer_list) + queue limit.
> @@ -4239,7 +4236,7 @@ static inline int __dev_xmit_skb(struct sk_buff
> *skb, struct Qdisc *q,
>                  qdisc_bstats_update(q, skb);
>                  if (sch_direct_xmit(skb, q, dev, txq, root_lock, true))
>                          __qdisc_run(q);
> -               qdisc_run_end(q);
> +               to_free2 = qdisc_run_end(q);
>                  rc = NET_XMIT_SUCCESS;
>          } else {
>                  int count = 0;
> @@ -4251,15 +4248,19 @@ static inline int __dev_xmit_skb(struct
> sk_buff *skb, struct Qdisc *q,
>                          rc = dev_qdisc_enqueue(skb, q, &to_free, txq);
>                          count++;
>                  }
> -               qdisc_run(q);
> +               to_free2 = qdisc_run(q);
>                  if (count != 1)
>                          rc = NET_XMIT_SUCCESS;
>          }
>   unlock:
>          spin_unlock(root_lock);
> +free_out:
>          if (unlikely(to_free))
>                  kfree_skb_list_reason(to_free,
>                                        tcf_get_drop_reason(to_free));
> +       if (unlikely(to_free2))
> +               kfree_skb_list_reason(to_free2,
> +                                     tcf_get_drop_reason(to_free2));
>          return rc;
>   }
>
> @@ -5741,6 +5742,7 @@ static __latent_entropy void net_tx_action(void)
>          }
>
>          if (sd->output_queue) {
> +               struct sk_buff *to_free;
>                  struct Qdisc *head;
>
>                  local_irq_disable();
> @@ -5780,9 +5782,12 @@ static __latent_entropy void net_tx_action(void)
>                          }
>
>                          clear_bit(__QDISC_STATE_SCHED, &q->state);
> -                       qdisc_run(q);
> +                       to_free = qdisc_run(q);
>                          if (root_lock)
>                                  spin_unlock(root_lock);
> +                       if (unlikely(to_free))
> +                               kfree_skb_list_reason(to_free,
> +                                             tcf_get_drop_reason(to_free));
>                  }
>
>                  rcu_read_unlock();
> diff --git a/net/sched/sch_cake.c b/net/sched/sch_cake.c
> index 312f5b000ffb67d74faf70f26d808e26315b4ab8..a717cc4e0606e80123ec9c76331d454dad699b69
> 100644
> --- a/net/sched/sch_cake.c
> +++ b/net/sched/sch_cake.c
> @@ -2183,7 +2183,7 @@ static struct sk_buff *cake_dequeue(struct Qdisc *sch)
>                  b->tin_dropped++;
>                  qdisc_tree_reduce_backlog(sch, 1, qdisc_pkt_len(skb));
>                  qdisc_qstats_drop(sch);
> -               kfree_skb_reason(skb, reason);
> +               qdisc_dequeue_drop(sch, skb, reason);
>                  if (q->rate_flags & CAKE_FLAG_INGRESS)
>                          goto retry;
>          }
> @@ -2569,6 +2569,8 @@ static void cake_reconfigure(struct Qdisc *sch)
>
>          sch->flags &= ~TCQ_F_CAN_BYPASS;
>
> +       sch->flags |= TCQ_F_DEQUEUE_DROPS;
> +
>          q->buffer_limit = min(q->buffer_limit,
>                                max(sch->limit * psched_mtu(qdisc_dev(sch)),
>                                    q->buffer_config_limit));

Thanks for the patches. I experimented with these patches and these are the results:
Running UDP flood (~21 Mpps load) over 2 minutes.
pre-patch (baseline):
   cake achieves stable packet rate of 0.476 Mpps

     qdisc cake 8001: dev enp7s0np1 root refcnt 33 bandwidth unlimited
         besteffort flows nonat nowash no-ack-filter split-gso rtt 100ms
         noatm overhead 18 mpu 64
       Sent 3593552166 bytes 56149224 pkt (dropped 2183, overlimits 0
         requeues 311)
       backlog 0b 0p requeues 311
       memory used: 15503616b of 15140Kb

net-next/main:
   cake throughput drops from 0.61 Mpps to 0.15 Mpps (the expected collapse
   we've seen before)

     qdisc cake 8001: dev enp7s0np1 root refcnt 33 bandwidth unlimited
         besteffort flows nonat nowash no-ack-filter split-gso rtt 100ms
         noatm overhead 18 mpu 64
       Sent 1166199994 bytes 18221827 pkt (dropped 71317773, overlimits 0
         requeues 1914)
       backlog 0b 0p requeues 1914
       memory used: 15504576b of 15140Kb

net-next/main + this dequeue patch:
   cake throughput drops in the first 6 seconds from 0.65 Mpps  and then
   oscillates between 0.27–0.36 Mpps

     qdisc cake 8001: dev enp7s0np1 root refcnt 33 bandwidth unlimited
         besteffort flows nonat nowash no-ack-filter split-gso rtt 100ms
         noatm overhead 18 mpu 64
       Sent 2627464378 bytes 41054083 pkt (dropped 50102108, overlimits 0
         requeues 1008)
       backlog 0b 0p requeues 1008
       memory used: 15503616b of 15140Kb

net-next/main + this dequeue patch + limiting the number of deferred packets:
   cake throughput drops in the first 6 seconds from 0.65 Mpps  and then
   oscillates between 0.35–0.43 Mpps

     qdisc cake 8001: dev enp7s0np1 root refcnt 33 bandwidth unlimited
         besteffort flows nonat nowash no-ack-filter split-gso rtt 100ms
         noatm overhead 18 mpu 64
       Sent 2969529126 bytes 46398843 pkt (dropped 43618919, overlimits 0
         requeues 814)
       backlog 0b 0p requeues 814
       memory used: 15503616b of 15140Kb

I can provide the full throughput traces if needed.
So the last patches definitely mitigate cake's performance degradation.

- Jonas



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v1 net-next 5/5] net: dev_queue_xmit() llist adoption
  2025-11-11 13:44                               ` Jonas Köppeler
@ 2025-11-11 16:42                                 ` Toke Høiland-Jørgensen
  0 siblings, 0 replies; 24+ messages in thread
From: Toke Høiland-Jørgensen @ 2025-11-11 16:42 UTC (permalink / raw)
  To: Jonas Köppeler, Eric Dumazet
  Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman,
	Jamal Hadi Salim, Cong Wang, Jiri Pirko, Kuniyuki Iwashima,
	Willem de Bruijn, netdev, eric.dumazet

Jonas Köppeler <j.koeppeler@tu-berlin.de> writes:

> On 11/10/25 18:34, Eric Dumazet wrote:
>> On Mon, Nov 10, 2025 at 6:49 AM Toke Høiland-Jørgensen <toke@redhat.com> wrote:
>>> Eric Dumazet <edumazet@google.com> writes:
>>>
>>>> I can work on a patch today.
>>> This sounds like an excellent idea in any case - thanks! :)
>> The following (on top of my last series) seems to work for me
>>
>>   include/net/pkt_sched.h   |    5 +++--
>>   include/net/sch_generic.h |   24 +++++++++++++++++++++++-
>>   net/core/dev.c            |   33 +++++++++++++++++++--------------
>>   net/sched/sch_cake.c      |    4 +++-
>>   4 files changed, 48 insertions(+), 18 deletions(-)
>>
>> diff --git a/include/net/pkt_sched.h b/include/net/pkt_sched.h
>> index 4678db45832a1e3bf7b8a07756fb89ab868bd5d2..e703c507d0daa97ae7c3bf131e322b1eafcc5664
>> 100644
>> --- a/include/net/pkt_sched.h
>> +++ b/include/net/pkt_sched.h
>> @@ -114,12 +114,13 @@ bool sch_direct_xmit(struct sk_buff *skb, struct Qdisc *q,
>>
>>   void __qdisc_run(struct Qdisc *q);
>>
>> -static inline void qdisc_run(struct Qdisc *q)
>> +static inline struct sk_buff *qdisc_run(struct Qdisc *q)
>>   {
>>          if (qdisc_run_begin(q)) {
>>                  __qdisc_run(q);
>> -               qdisc_run_end(q);
>> +               return qdisc_run_end(q);
>>          }
>> +       return NULL;
>>   }
>>
>>   extern const struct nla_policy rtm_tca_policy[TCA_MAX + 1];
>> diff --git a/include/net/sch_generic.h b/include/net/sch_generic.h
>> index 79501499dafba56271b9ebd97a8f379ffdc83cac..19cd2bc13bdba48f941b1599f896c15c8c7860ae
>> 100644
>> --- a/include/net/sch_generic.h
>> +++ b/include/net/sch_generic.h
>> @@ -88,6 +88,8 @@ struct Qdisc {
>>   #define TCQ_F_INVISIBLE                0x80 /* invisible by default in dump */
>>   #define TCQ_F_NOLOCK           0x100 /* qdisc does not require locking */
>>   #define TCQ_F_OFFLOADED                0x200 /* qdisc is offloaded to HW */
>> +#define TCQ_F_DEQUEUE_DROPS    0x400 /* ->dequeue() can drop packets
>> in q->to_free */
>> +
>>          u32                     limit;
>>          const struct Qdisc_ops  *ops;
>>          struct qdisc_size_table __rcu *stab;
>> @@ -119,6 +121,8 @@ struct Qdisc {
>>
>>                  /* Note : we only change qstats.backlog in fast path. */
>>                  struct gnet_stats_queue qstats;
>> +
>> +               struct sk_buff          *to_free;
>>          __cacheline_group_end(Qdisc_write);
>>
>>
>> @@ -218,8 +222,15 @@ static inline bool qdisc_run_begin(struct Qdisc *qdisc)
>>          return true;
>>   }
>>
>> -static inline void qdisc_run_end(struct Qdisc *qdisc)
>> +static inline struct sk_buff *qdisc_run_end(struct Qdisc *qdisc)
>>   {
>> +       struct sk_buff *to_free = NULL;
>> +
>> +       if (qdisc->flags & TCQ_F_DEQUEUE_DROPS) {
>> +               to_free = qdisc->to_free;
>> +               if (to_free)
>> +                       qdisc->to_free = NULL;
>> +       }
>>          if (qdisc->flags & TCQ_F_NOLOCK) {
>>                  spin_unlock(&qdisc->seqlock);
>>
>> @@ -235,6 +246,7 @@ static inline void qdisc_run_end(struct Qdisc *qdisc)
>>          } else {
>>                  WRITE_ONCE(qdisc->running, false);
>>          }
>> +       return to_free;
>>   }
>>
>>   static inline bool qdisc_may_bulk(const struct Qdisc *qdisc)
>> @@ -1105,6 +1117,16 @@ static inline void tcf_set_drop_reason(const
>> struct sk_buff *skb,
>>          tc_skb_cb(skb)->drop_reason = reason;
>>   }
>>
>> +static inline void qdisc_dequeue_drop(struct Qdisc *q, struct sk_buff *skb,
>> +                                     enum skb_drop_reason reason)
>> +{
>> +       DEBUG_NET_WARN_ON_ONCE(!(q->flags & TCQ_F_DEQUEUE_DROPS));
>> +
>> +       tcf_set_drop_reason(skb, reason);
>> +       skb->next = q->to_free;
>> +       q->to_free = skb;
>> +}
>> +
>>   /* Instead of calling kfree_skb() while root qdisc lock is held,
>>    * queue the skb for future freeing at end of __dev_xmit_skb()
>>    */
>> diff --git a/net/core/dev.c b/net/core/dev.c
>> index ac994974e2a81889fcc0a2e664edcdb7cfd0496d..18cfcd765b1b3e4af1c5339e36df517e7abc914f
>> 100644
>> --- a/net/core/dev.c
>> +++ b/net/core/dev.c
>> @@ -4141,7 +4141,7 @@ static inline int __dev_xmit_skb(struct sk_buff
>> *skb, struct Qdisc *q,
>>                                   struct net_device *dev,
>>                                   struct netdev_queue *txq)
>>   {
>> -       struct sk_buff *next, *to_free = NULL;
>> +       struct sk_buff *next, *to_free = NULL, *to_free2 = NULL;
>>          spinlock_t *root_lock = qdisc_lock(q);
>>          struct llist_node *ll_list, *first_n;
>>          unsigned long defer_count = 0;
>> @@ -4160,9 +4160,9 @@ static inline int __dev_xmit_skb(struct sk_buff
>> *skb, struct Qdisc *q,
>>                          if (unlikely(!nolock_qdisc_is_empty(q))) {
>>                                  rc = dev_qdisc_enqueue(skb, q, &to_free, txq);
>>                                  __qdisc_run(q);
>> -                               qdisc_run_end(q);
>> +                               to_free2 = qdisc_run_end(q);
>>
>> -                               goto no_lock_out;
>> +                               goto free_out;
>>                          }
>>
>>                          qdisc_bstats_cpu_update(q, skb);
>> @@ -4170,18 +4170,15 @@ static inline int __dev_xmit_skb(struct
>> sk_buff *skb, struct Qdisc *q,
>>                              !nolock_qdisc_is_empty(q))
>>                                  __qdisc_run(q);
>>
>> -                       qdisc_run_end(q);
>> -                       return NET_XMIT_SUCCESS;
>> +                       to_free2 = qdisc_run_end(q);
>> +                       rc = NET_XMIT_SUCCESS;
>> +                       goto free_out;
>>                  }
>>
>>                  rc = dev_qdisc_enqueue(skb, q, &to_free, txq);
>> -               qdisc_run(q);
>> +               to_free2 = qdisc_run(q);
>>
>> -no_lock_out:
>> -               if (unlikely(to_free))
>> -                       kfree_skb_list_reason(to_free,
>> -                                             tcf_get_drop_reason(to_free));
>> -               return rc;
>> +               goto free_out;
>>          }
>>
>>          /* Open code llist_add(&skb->ll_node, &q->defer_list) + queue limit.
>> @@ -4239,7 +4236,7 @@ static inline int __dev_xmit_skb(struct sk_buff
>> *skb, struct Qdisc *q,
>>                  qdisc_bstats_update(q, skb);
>>                  if (sch_direct_xmit(skb, q, dev, txq, root_lock, true))
>>                          __qdisc_run(q);
>> -               qdisc_run_end(q);
>> +               to_free2 = qdisc_run_end(q);
>>                  rc = NET_XMIT_SUCCESS;
>>          } else {
>>                  int count = 0;
>> @@ -4251,15 +4248,19 @@ static inline int __dev_xmit_skb(struct
>> sk_buff *skb, struct Qdisc *q,
>>                          rc = dev_qdisc_enqueue(skb, q, &to_free, txq);
>>                          count++;
>>                  }
>> -               qdisc_run(q);
>> +               to_free2 = qdisc_run(q);
>>                  if (count != 1)
>>                          rc = NET_XMIT_SUCCESS;
>>          }
>>   unlock:
>>          spin_unlock(root_lock);
>> +free_out:
>>          if (unlikely(to_free))
>>                  kfree_skb_list_reason(to_free,
>>                                        tcf_get_drop_reason(to_free));
>> +       if (unlikely(to_free2))
>> +               kfree_skb_list_reason(to_free2,
>> +                                     tcf_get_drop_reason(to_free2));
>>          return rc;
>>   }
>>
>> @@ -5741,6 +5742,7 @@ static __latent_entropy void net_tx_action(void)
>>          }
>>
>>          if (sd->output_queue) {
>> +               struct sk_buff *to_free;
>>                  struct Qdisc *head;
>>
>>                  local_irq_disable();
>> @@ -5780,9 +5782,12 @@ static __latent_entropy void net_tx_action(void)
>>                          }
>>
>>                          clear_bit(__QDISC_STATE_SCHED, &q->state);
>> -                       qdisc_run(q);
>> +                       to_free = qdisc_run(q);
>>                          if (root_lock)
>>                                  spin_unlock(root_lock);
>> +                       if (unlikely(to_free))
>> +                               kfree_skb_list_reason(to_free,
>> +                                             tcf_get_drop_reason(to_free));
>>                  }
>>
>>                  rcu_read_unlock();
>> diff --git a/net/sched/sch_cake.c b/net/sched/sch_cake.c
>> index 312f5b000ffb67d74faf70f26d808e26315b4ab8..a717cc4e0606e80123ec9c76331d454dad699b69
>> 100644
>> --- a/net/sched/sch_cake.c
>> +++ b/net/sched/sch_cake.c
>> @@ -2183,7 +2183,7 @@ static struct sk_buff *cake_dequeue(struct Qdisc *sch)
>>                  b->tin_dropped++;
>>                  qdisc_tree_reduce_backlog(sch, 1, qdisc_pkt_len(skb));
>>                  qdisc_qstats_drop(sch);
>> -               kfree_skb_reason(skb, reason);
>> +               qdisc_dequeue_drop(sch, skb, reason);
>>                  if (q->rate_flags & CAKE_FLAG_INGRESS)
>>                          goto retry;
>>          }
>> @@ -2569,6 +2569,8 @@ static void cake_reconfigure(struct Qdisc *sch)
>>
>>          sch->flags &= ~TCQ_F_CAN_BYPASS;
>>
>> +       sch->flags |= TCQ_F_DEQUEUE_DROPS;
>> +
>>          q->buffer_limit = min(q->buffer_limit,
>>                                max(sch->limit * psched_mtu(qdisc_dev(sch)),
>>                                    q->buffer_config_limit));
>
> Thanks for the patches. I experimented with these patches and these are the results:
> Running UDP flood (~21 Mpps load) over 2 minutes.
> pre-patch (baseline):
>    cake achieves stable packet rate of 0.476 Mpps
>
>      qdisc cake 8001: dev enp7s0np1 root refcnt 33 bandwidth unlimited
>          besteffort flows nonat nowash no-ack-filter split-gso rtt 100ms
>          noatm overhead 18 mpu 64
>        Sent 3593552166 bytes 56149224 pkt (dropped 2183, overlimits 0
>          requeues 311)
>        backlog 0b 0p requeues 311
>        memory used: 15503616b of 15140Kb
>
> net-next/main:
>    cake throughput drops from 0.61 Mpps to 0.15 Mpps (the expected collapse
>    we've seen before)
>
>      qdisc cake 8001: dev enp7s0np1 root refcnt 33 bandwidth unlimited
>          besteffort flows nonat nowash no-ack-filter split-gso rtt 100ms
>          noatm overhead 18 mpu 64
>        Sent 1166199994 bytes 18221827 pkt (dropped 71317773, overlimits 0
>          requeues 1914)
>        backlog 0b 0p requeues 1914
>        memory used: 15504576b of 15140Kb
>
> net-next/main + this dequeue patch:
>    cake throughput drops in the first 6 seconds from 0.65 Mpps  and then
>    oscillates between 0.27–0.36 Mpps
>
>      qdisc cake 8001: dev enp7s0np1 root refcnt 33 bandwidth unlimited
>          besteffort flows nonat nowash no-ack-filter split-gso rtt 100ms
>          noatm overhead 18 mpu 64
>        Sent 2627464378 bytes 41054083 pkt (dropped 50102108, overlimits 0
>          requeues 1008)
>        backlog 0b 0p requeues 1008
>        memory used: 15503616b of 15140Kb
>
> net-next/main + this dequeue patch + limiting the number of deferred packets:
>    cake throughput drops in the first 6 seconds from 0.65 Mpps  and then
>    oscillates between 0.35–0.43 Mpps
>
>      qdisc cake 8001: dev enp7s0np1 root refcnt 33 bandwidth unlimited
>          besteffort flows nonat nowash no-ack-filter split-gso rtt 100ms
>          noatm overhead 18 mpu 64
>        Sent 2969529126 bytes 46398843 pkt (dropped 43618919, overlimits 0
>          requeues 814)
>        backlog 0b 0p requeues 814
>        memory used: 15503616b of 15140Kb
>
> I can provide the full throughput traces if needed.
> So the last patches definitely mitigate cake's performance degradation.

I also did a series of test runs with the variant in Eric's v2 patch set
that includes the dequeue side improvement, and it definitely helps, to
the point where I can now get up to 50% higher PPS with a cake instance
serving multiple flows.

However, it's unstable: if I add enough flows I end up back in a state
where throughput collapses to a dozen Kpps. The exact number of flows
where this happens varies a bit, but it's quite distinct when it does.

The number of flows it takes before things break down varies a bit; I
have not found a definite cause for this yet, but I'll keep looking.

-Toke

[0] https://lore.kernel.org/r/20251111093204.1432437-1-edumazet@google.com


^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2025-11-11 16:42 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-13 14:54 [PATCH v1 net-next 0/5] net: optimize TX throughput and efficiency Eric Dumazet
2025-10-13 14:54 ` [PATCH v1 net-next 1/5] net: add add indirect call wrapper in skb_release_head_state() Eric Dumazet
2025-10-13 14:54 ` [PATCH v1 net-next 2/5] net/sched: act_mirred: add loop detection Eric Dumazet
2025-10-13 14:54 ` [PATCH v1 net-next 3/5] Revert "net/sched: Fix mirred deadlock on device recursion" Eric Dumazet
2025-10-13 14:54 ` [PATCH v1 net-next 4/5] net: sched: claim one cache line in Qdisc Eric Dumazet
2025-10-13 14:54 ` [PATCH v1 net-next 5/5] net: dev_queue_xmit() llist adoption Eric Dumazet
2025-11-07 15:28   ` Toke Høiland-Jørgensen
2025-11-07 15:37     ` Eric Dumazet
2025-11-07 15:46       ` Eric Dumazet
2025-11-09 10:09         ` Eric Dumazet
2025-11-09 12:54           ` Eric Dumazet
2025-11-09 16:33             ` Toke Høiland-Jørgensen
2025-11-09 17:14               ` Eric Dumazet
2025-11-09 19:18               ` Jonas Köppeler
2025-11-09 19:28                 ` Eric Dumazet
2025-11-09 20:18                   ` Eric Dumazet
2025-11-09 20:29                     ` Eric Dumazet
2025-11-10 11:31                       ` Toke Høiland-Jørgensen
2025-11-10 13:26                         ` Eric Dumazet
2025-11-10 14:49                           ` Toke Høiland-Jørgensen
2025-11-10 17:34                             ` Eric Dumazet
2025-11-11 13:44                               ` Jonas Köppeler
2025-11-11 16:42                                 ` Toke Høiland-Jørgensen
2025-10-13 16:23 ` [PATCH v1 net-next 0/5] net: optimize TX throughput and efficiency Toke Høiland-Jørgensen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).