netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH net] net_sched: limit try_bulk_dequeue_skb() batches
@ 2025-11-09 16:12 Eric Dumazet
  2025-11-10 10:14 ` Toke Høiland-Jørgensen
                   ` (2 more replies)
  0 siblings, 3 replies; 21+ messages in thread
From: Eric Dumazet @ 2025-11-09 16:12 UTC (permalink / raw)
  To: David S . Miller, Jakub Kicinski, Paolo Abeni
  Cc: Simon Horman, Jamal Hadi Salim, Cong Wang, Jiri Pirko,
	Kuniyuki Iwashima, Willem de Bruijn, netdev, eric.dumazet,
	Eric Dumazet, Jesper Dangaard Brouer,
	Toke Høiland-Jørgensen

After commit 100dfa74cad9 ("inet: dev_queue_xmit() llist adoption")
I started seeing many qdisc requeues on IDPF under high TX workload.

$ tc -s qd sh dev eth1 handle 1: ; sleep 1; tc -s qd sh dev eth1 handle 1:
qdisc mq 1: root
 Sent 43534617319319 bytes 268186451819 pkt (dropped 0, overlimits 0 requeues 3532840114)
 backlog 1056Kb 6675p requeues 3532840114
qdisc mq 1: root
 Sent 43554665866695 bytes 268309964788 pkt (dropped 0, overlimits 0 requeues 3537737653)
 backlog 781164b 4822p requeues 3537737653

This is caused by try_bulk_dequeue_skb() being only limited by BQL budget.

perf record -C120-239 -e qdisc:qdisc_dequeue sleep 1 ; perf script
...
 netperf 75332 [146]  2711.138269: qdisc:qdisc_dequeue: dequeue ifindex=5 qdisc handle=0x80150000 parent=0x10013 txq_state=0x0 packets=1292 skbaddr=0xff378005a1e9f200
 netperf 75332 [146]  2711.138953: qdisc:qdisc_dequeue: dequeue ifindex=5 qdisc handle=0x80150000 parent=0x10013 txq_state=0x0 packets=1213 skbaddr=0xff378004d607a500
 netperf 75330 [144]  2711.139631: qdisc:qdisc_dequeue: dequeue ifindex=5 qdisc handle=0x80150000 parent=0x10013 txq_state=0x0 packets=1233 skbaddr=0xff3780046be20100
 netperf 75333 [147]  2711.140356: qdisc:qdisc_dequeue: dequeue ifindex=5 qdisc handle=0x80150000 parent=0x10013 txq_state=0x0 packets=1093 skbaddr=0xff37800514845b00
 netperf 75337 [151]  2711.141037: qdisc:qdisc_dequeue: dequeue ifindex=5 qdisc handle=0x80150000 parent=0x10013 txq_state=0x0 packets=1353 skbaddr=0xff37800460753300
 netperf 75337 [151]  2711.141877: qdisc:qdisc_dequeue: dequeue ifindex=5 qdisc handle=0x80150000 parent=0x10013 txq_state=0x0 packets=1367 skbaddr=0xff378004e72c7b00
 netperf 75330 [144]  2711.142643: qdisc:qdisc_dequeue: dequeue ifindex=5 qdisc handle=0x80150000 parent=0x10013 txq_state=0x0 packets=1202 skbaddr=0xff3780045bd60000
...

This is bad because :

1) Large batches hold one victim cpu for a very long time.

2) Driver often hit their own TX ring limit (all slots are used).

3) We call dev_requeue_skb()

4) Requeues are using a FIFO (q->gso_skb), breaking qdisc ability to
   implement FQ or priority scheduling.

5) dequeue_skb() gets packets from q->gso_skb one skb at a time
   with no xmit_more support. This is causing many spinlock games
   between the qdisc and the device driver.

Requeues were supposed to be very rare, lets keep them this way.

Limit batch sizes to /proc/sys/net/core/dev_weight (default 64) as
__qdisc_run() was designed to use.

Fixes: 5772e9a3463b ("qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Jesper Dangaard Brouer <hawk@kernel.org>
Cc: Toke Høiland-Jørgensen <toke@redhat.com>
---
 net/sched/sch_generic.c | 17 ++++++++++-------
 1 file changed, 10 insertions(+), 7 deletions(-)

diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
index d9a98d02a55fc361a223f3201e37b6a2b698bb5e..852e603c17551ee719bf1c561848d5ef0699ab5d 100644
--- a/net/sched/sch_generic.c
+++ b/net/sched/sch_generic.c
@@ -180,9 +180,10 @@ static inline void dev_requeue_skb(struct sk_buff *skb, struct Qdisc *q)
 static void try_bulk_dequeue_skb(struct Qdisc *q,
 				 struct sk_buff *skb,
 				 const struct netdev_queue *txq,
-				 int *packets)
+				 int *packets, int budget)
 {
 	int bytelimit = qdisc_avail_bulklimit(txq) - skb->len;
+	int cnt = 0;
 
 	while (bytelimit > 0) {
 		struct sk_buff *nskb = q->dequeue(q);
@@ -193,8 +194,10 @@ static void try_bulk_dequeue_skb(struct Qdisc *q,
 		bytelimit -= nskb->len; /* covers GSO len */
 		skb->next = nskb;
 		skb = nskb;
-		(*packets)++; /* GSO counts as one pkt */
+		if (++cnt >= budget)
+			break;
 	}
+	(*packets) += cnt;
 	skb_mark_not_on_list(skb);
 }
 
@@ -228,7 +231,7 @@ static void try_bulk_dequeue_skb_slow(struct Qdisc *q,
  * A requeued skb (via q->gso_skb) can also be a SKB list.
  */
 static struct sk_buff *dequeue_skb(struct Qdisc *q, bool *validate,
-				   int *packets)
+				   int *packets, int budget)
 {
 	const struct netdev_queue *txq = q->dev_queue;
 	struct sk_buff *skb = NULL;
@@ -295,7 +298,7 @@ static struct sk_buff *dequeue_skb(struct Qdisc *q, bool *validate,
 	if (skb) {
 bulk:
 		if (qdisc_may_bulk(q))
-			try_bulk_dequeue_skb(q, skb, txq, packets);
+			try_bulk_dequeue_skb(q, skb, txq, packets, budget);
 		else
 			try_bulk_dequeue_skb_slow(q, skb, packets);
 	}
@@ -387,7 +390,7 @@ bool sch_direct_xmit(struct sk_buff *skb, struct Qdisc *q,
  *				>0 - queue is not empty.
  *
  */
-static inline bool qdisc_restart(struct Qdisc *q, int *packets)
+static inline bool qdisc_restart(struct Qdisc *q, int *packets, int budget)
 {
 	spinlock_t *root_lock = NULL;
 	struct netdev_queue *txq;
@@ -396,7 +399,7 @@ static inline bool qdisc_restart(struct Qdisc *q, int *packets)
 	bool validate;
 
 	/* Dequeue packet */
-	skb = dequeue_skb(q, &validate, packets);
+	skb = dequeue_skb(q, &validate, packets, budget);
 	if (unlikely(!skb))
 		return false;
 
@@ -414,7 +417,7 @@ void __qdisc_run(struct Qdisc *q)
 	int quota = READ_ONCE(net_hotdata.dev_tx_weight);
 	int packets;
 
-	while (qdisc_restart(q, &packets)) {
+	while (qdisc_restart(q, &packets, quota)) {
 		quota -= packets;
 		if (quota <= 0) {
 			if (q->flags & TCQ_F_NOLOCK)
-- 
2.51.2.1041.gc1ab5b90ca-goog


^ permalink raw reply related	[flat|nested] 21+ messages in thread

* Re: [PATCH net] net_sched: limit try_bulk_dequeue_skb() batches
  2025-11-09 16:12 [PATCH net] net_sched: limit try_bulk_dequeue_skb() batches Eric Dumazet
@ 2025-11-10 10:14 ` Toke Høiland-Jørgensen
  2025-11-10 10:36 ` Jesper Dangaard Brouer
  2025-11-12  2:10 ` patchwork-bot+netdevbpf
  2 siblings, 0 replies; 21+ messages in thread
From: Toke Høiland-Jørgensen @ 2025-11-10 10:14 UTC (permalink / raw)
  To: Eric Dumazet, David S . Miller, Jakub Kicinski, Paolo Abeni
  Cc: Simon Horman, Jamal Hadi Salim, Cong Wang, Jiri Pirko,
	Kuniyuki Iwashima, Willem de Bruijn, netdev, eric.dumazet,
	Eric Dumazet, Jesper Dangaard Brouer

Eric Dumazet <edumazet@google.com> writes:

> After commit 100dfa74cad9 ("inet: dev_queue_xmit() llist adoption")
> I started seeing many qdisc requeues on IDPF under high TX workload.
>
> $ tc -s qd sh dev eth1 handle 1: ; sleep 1; tc -s qd sh dev eth1 handle 1:
> qdisc mq 1: root
>  Sent 43534617319319 bytes 268186451819 pkt (dropped 0, overlimits 0 requeues 3532840114)
>  backlog 1056Kb 6675p requeues 3532840114
> qdisc mq 1: root
>  Sent 43554665866695 bytes 268309964788 pkt (dropped 0, overlimits 0 requeues 3537737653)
>  backlog 781164b 4822p requeues 3537737653
>
> This is caused by try_bulk_dequeue_skb() being only limited by BQL budget.
>
> perf record -C120-239 -e qdisc:qdisc_dequeue sleep 1 ; perf script
> ...
>  netperf 75332 [146]  2711.138269: qdisc:qdisc_dequeue: dequeue ifindex=5 qdisc handle=0x80150000 parent=0x10013 txq_state=0x0 packets=1292 skbaddr=0xff378005a1e9f200
>  netperf 75332 [146]  2711.138953: qdisc:qdisc_dequeue: dequeue ifindex=5 qdisc handle=0x80150000 parent=0x10013 txq_state=0x0 packets=1213 skbaddr=0xff378004d607a500
>  netperf 75330 [144]  2711.139631: qdisc:qdisc_dequeue: dequeue ifindex=5 qdisc handle=0x80150000 parent=0x10013 txq_state=0x0 packets=1233 skbaddr=0xff3780046be20100
>  netperf 75333 [147]  2711.140356: qdisc:qdisc_dequeue: dequeue ifindex=5 qdisc handle=0x80150000 parent=0x10013 txq_state=0x0 packets=1093 skbaddr=0xff37800514845b00
>  netperf 75337 [151]  2711.141037: qdisc:qdisc_dequeue: dequeue ifindex=5 qdisc handle=0x80150000 parent=0x10013 txq_state=0x0 packets=1353 skbaddr=0xff37800460753300
>  netperf 75337 [151]  2711.141877: qdisc:qdisc_dequeue: dequeue ifindex=5 qdisc handle=0x80150000 parent=0x10013 txq_state=0x0 packets=1367 skbaddr=0xff378004e72c7b00
>  netperf 75330 [144]  2711.142643: qdisc:qdisc_dequeue: dequeue ifindex=5 qdisc handle=0x80150000 parent=0x10013 txq_state=0x0 packets=1202 skbaddr=0xff3780045bd60000
> ...
>
> This is bad because :
>
> 1) Large batches hold one victim cpu for a very long time.
>
> 2) Driver often hit their own TX ring limit (all slots are used).
>
> 3) We call dev_requeue_skb()
>
> 4) Requeues are using a FIFO (q->gso_skb), breaking qdisc ability to
>    implement FQ or priority scheduling.
>
> 5) dequeue_skb() gets packets from q->gso_skb one skb at a time
>    with no xmit_more support. This is causing many spinlock games
>    between the qdisc and the device driver.
>
> Requeues were supposed to be very rare, lets keep them this way.
>
> Limit batch sizes to /proc/sys/net/core/dev_weight (default 64) as
> __qdisc_run() was designed to use.
>
> Fixes: 5772e9a3463b ("qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE")
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Cc: Jesper Dangaard Brouer <hawk@kernel.org>
> Cc: Toke Høiland-Jørgensen <toke@redhat.com>

Makes sense!

Reviewed-by: Toke Høiland-Jørgensen <toke@redhat.com>


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net] net_sched: limit try_bulk_dequeue_skb() batches
  2025-11-09 16:12 [PATCH net] net_sched: limit try_bulk_dequeue_skb() batches Eric Dumazet
  2025-11-10 10:14 ` Toke Høiland-Jørgensen
@ 2025-11-10 10:36 ` Jesper Dangaard Brouer
  2025-11-10 11:06   ` Eric Dumazet
  2025-11-12  2:10 ` patchwork-bot+netdevbpf
  2 siblings, 1 reply; 21+ messages in thread
From: Jesper Dangaard Brouer @ 2025-11-10 10:36 UTC (permalink / raw)
  To: Eric Dumazet, David S . Miller, Jakub Kicinski, Paolo Abeni
  Cc: Simon Horman, Jamal Hadi Salim, Cong Wang, Jiri Pirko,
	Kuniyuki Iwashima, Willem de Bruijn, netdev, eric.dumazet,
	Toke Høiland-Jørgensen



On 09/11/2025 17.12, Eric Dumazet wrote:
> After commit 100dfa74cad9 ("inet: dev_queue_xmit() llist adoption")
> I started seeing many qdisc requeues on IDPF under high TX workload.
> 
> $ tc -s qd sh dev eth1 handle 1: ; sleep 1; tc -s qd sh dev eth1 handle 1:
> qdisc mq 1: root
>   Sent 43534617319319 bytes 268186451819 pkt (dropped 0, overlimits 0 requeues 3532840114)
>   backlog 1056Kb 6675p requeues 3532840114
> qdisc mq 1: root
>   Sent 43554665866695 bytes 268309964788 pkt (dropped 0, overlimits 0 requeues 3537737653)
>   backlog 781164b 4822p requeues 3537737653
> 
> This is caused by try_bulk_dequeue_skb() being only limited by BQL budget.
> 
> perf record -C120-239 -e qdisc:qdisc_dequeue sleep 1 ; perf script
> ...
>   netperf 75332 [146]  2711.138269: qdisc:qdisc_dequeue: dequeue ifindex=5 qdisc handle=0x80150000 parent=0x10013 txq_state=0x0 packets=1292 skbaddr=0xff378005a1e9f200
>   netperf 75332 [146]  2711.138953: qdisc:qdisc_dequeue: dequeue ifindex=5 qdisc handle=0x80150000 parent=0x10013 txq_state=0x0 packets=1213 skbaddr=0xff378004d607a500
>   netperf 75330 [144]  2711.139631: qdisc:qdisc_dequeue: dequeue ifindex=5 qdisc handle=0x80150000 parent=0x10013 txq_state=0x0 packets=1233 skbaddr=0xff3780046be20100
>   netperf 75333 [147]  2711.140356: qdisc:qdisc_dequeue: dequeue ifindex=5 qdisc handle=0x80150000 parent=0x10013 txq_state=0x0 packets=1093 skbaddr=0xff37800514845b00
>   netperf 75337 [151]  2711.141037: qdisc:qdisc_dequeue: dequeue ifindex=5 qdisc handle=0x80150000 parent=0x10013 txq_state=0x0 packets=1353 skbaddr=0xff37800460753300
>   netperf 75337 [151]  2711.141877: qdisc:qdisc_dequeue: dequeue ifindex=5 qdisc handle=0x80150000 parent=0x10013 txq_state=0x0 packets=1367 skbaddr=0xff378004e72c7b00
>   netperf 75330 [144]  2711.142643: qdisc:qdisc_dequeue: dequeue ifindex=5 qdisc handle=0x80150000 parent=0x10013 txq_state=0x0 packets=1202 skbaddr=0xff3780045bd60000
> ...
> 
> This is bad because :
> 
> 1) Large batches hold one victim cpu for a very long time.
> 
> 2) Driver often hit their own TX ring limit (all slots are used).
> 
> 3) We call dev_requeue_skb()
> 
> 4) Requeues are using a FIFO (q->gso_skb), breaking qdisc ability to
>     implement FQ or priority scheduling.
> 
> 5) dequeue_skb() gets packets from q->gso_skb one skb at a time
>     with no xmit_more support. This is causing many spinlock games
>     between the qdisc and the device driver.
> 
> Requeues were supposed to be very rare, lets keep them this way.
> 
> Limit batch sizes to /proc/sys/net/core/dev_weight (default 64) as
> __qdisc_run() was designed to use.
> 
> Fixes: 5772e9a3463b ("qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE")
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Cc: Jesper Dangaard Brouer <hawk@kernel.org>
> Cc: Toke Høiland-Jørgensen <toke@redhat.com>
> ---
>   net/sched/sch_generic.c | 17 ++++++++++-------
>   1 file changed, 10 insertions(+), 7 deletions(-)
> 
> diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
> index d9a98d02a55fc361a223f3201e37b6a2b698bb5e..852e603c17551ee719bf1c561848d5ef0699ab5d 100644
> --- a/net/sched/sch_generic.c
> +++ b/net/sched/sch_generic.c
> @@ -180,9 +180,10 @@ static inline void dev_requeue_skb(struct sk_buff *skb, struct Qdisc *q)
>   static void try_bulk_dequeue_skb(struct Qdisc *q,
>   				 struct sk_buff *skb,
>   				 const struct netdev_queue *txq,
> -				 int *packets)
> +				 int *packets, int budget)
>   {
>   	int bytelimit = qdisc_avail_bulklimit(txq) - skb->len;
> +	int cnt = 0;

You patch makes perfect sense, that we want this budget limit.

But: Why isn't bytelimit saving us?


Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>

>   	while (bytelimit > 0) {
>   		struct sk_buff *nskb = q->dequeue(q);
> @@ -193,8 +194,10 @@ static void try_bulk_dequeue_skb(struct Qdisc *q,
>   		bytelimit -= nskb->len; /* covers GSO len */
>   		skb->next = nskb;
>   		skb = nskb;
> -		(*packets)++; /* GSO counts as one pkt */
> +		if (++cnt >= budget)
> +			break;
>   	}
> +	(*packets) += cnt;
>   	skb_mark_not_on_list(skb);
>   }
>   
> @@ -228,7 +231,7 @@ static void try_bulk_dequeue_skb_slow(struct Qdisc *q,
>    * A requeued skb (via q->gso_skb) can also be a SKB list.
>    */
>   static struct sk_buff *dequeue_skb(struct Qdisc *q, bool *validate,
> -				   int *packets)
> +				   int *packets, int budget)
>   {
>   	const struct netdev_queue *txq = q->dev_queue;
>   	struct sk_buff *skb = NULL;
> @@ -295,7 +298,7 @@ static struct sk_buff *dequeue_skb(struct Qdisc *q, bool *validate,
>   	if (skb) {
>   bulk:
>   		if (qdisc_may_bulk(q))
> -			try_bulk_dequeue_skb(q, skb, txq, packets);
> +			try_bulk_dequeue_skb(q, skb, txq, packets, budget);
>   		else
>   			try_bulk_dequeue_skb_slow(q, skb, packets);
>   	}
> @@ -387,7 +390,7 @@ bool sch_direct_xmit(struct sk_buff *skb, struct Qdisc *q,
>    *				>0 - queue is not empty.
>    *
>    */
> -static inline bool qdisc_restart(struct Qdisc *q, int *packets)
> +static inline bool qdisc_restart(struct Qdisc *q, int *packets, int budget)
>   {
>   	spinlock_t *root_lock = NULL;
>   	struct netdev_queue *txq;
> @@ -396,7 +399,7 @@ static inline bool qdisc_restart(struct Qdisc *q, int *packets)
>   	bool validate;
>   
>   	/* Dequeue packet */
> -	skb = dequeue_skb(q, &validate, packets);
> +	skb = dequeue_skb(q, &validate, packets, budget);
>   	if (unlikely(!skb))
>   		return false;
>   
> @@ -414,7 +417,7 @@ void __qdisc_run(struct Qdisc *q)
>   	int quota = READ_ONCE(net_hotdata.dev_tx_weight);
>   	int packets;
>   
> -	while (qdisc_restart(q, &packets)) {
> +	while (qdisc_restart(q, &packets, quota)) {
>   		quota -= packets;
>   		if (quota <= 0) {
>   			if (q->flags & TCQ_F_NOLOCK)


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net] net_sched: limit try_bulk_dequeue_skb() batches
  2025-11-10 10:36 ` Jesper Dangaard Brouer
@ 2025-11-10 11:06   ` Eric Dumazet
  2025-11-10 15:04     ` Jesper Dangaard Brouer
  0 siblings, 1 reply; 21+ messages in thread
From: Eric Dumazet @ 2025-11-10 11:06 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman,
	Jamal Hadi Salim, Cong Wang, Jiri Pirko, Kuniyuki Iwashima,
	Willem de Bruijn, netdev, eric.dumazet,
	Toke Høiland-Jørgensen

On Mon, Nov 10, 2025 at 2:36 AM Jesper Dangaard Brouer <hawk@kernel.org> wrote:
>
>
>
> On 09/11/2025 17.12, Eric Dumazet wrote:
> > After commit 100dfa74cad9 ("inet: dev_queue_xmit() llist adoption")
> > I started seeing many qdisc requeues on IDPF under high TX workload.
> >
> > $ tc -s qd sh dev eth1 handle 1: ; sleep 1; tc -s qd sh dev eth1 handle 1:
> > qdisc mq 1: root
> >   Sent 43534617319319 bytes 268186451819 pkt (dropped 0, overlimits 0 requeues 3532840114)
> >   backlog 1056Kb 6675p requeues 3532840114
> > qdisc mq 1: root
> >   Sent 43554665866695 bytes 268309964788 pkt (dropped 0, overlimits 0 requeues 3537737653)
> >   backlog 781164b 4822p requeues 3537737653
> >
> > This is caused by try_bulk_dequeue_skb() being only limited by BQL budget.
> >
> > perf record -C120-239 -e qdisc:qdisc_dequeue sleep 1 ; perf script
> > ...
> >   netperf 75332 [146]  2711.138269: qdisc:qdisc_dequeue: dequeue ifindex=5 qdisc handle=0x80150000 parent=0x10013 txq_state=0x0 packets=1292 skbaddr=0xff378005a1e9f200
> >   netperf 75332 [146]  2711.138953: qdisc:qdisc_dequeue: dequeue ifindex=5 qdisc handle=0x80150000 parent=0x10013 txq_state=0x0 packets=1213 skbaddr=0xff378004d607a500
> >   netperf 75330 [144]  2711.139631: qdisc:qdisc_dequeue: dequeue ifindex=5 qdisc handle=0x80150000 parent=0x10013 txq_state=0x0 packets=1233 skbaddr=0xff3780046be20100
> >   netperf 75333 [147]  2711.140356: qdisc:qdisc_dequeue: dequeue ifindex=5 qdisc handle=0x80150000 parent=0x10013 txq_state=0x0 packets=1093 skbaddr=0xff37800514845b00
> >   netperf 75337 [151]  2711.141037: qdisc:qdisc_dequeue: dequeue ifindex=5 qdisc handle=0x80150000 parent=0x10013 txq_state=0x0 packets=1353 skbaddr=0xff37800460753300
> >   netperf 75337 [151]  2711.141877: qdisc:qdisc_dequeue: dequeue ifindex=5 qdisc handle=0x80150000 parent=0x10013 txq_state=0x0 packets=1367 skbaddr=0xff378004e72c7b00
> >   netperf 75330 [144]  2711.142643: qdisc:qdisc_dequeue: dequeue ifindex=5 qdisc handle=0x80150000 parent=0x10013 txq_state=0x0 packets=1202 skbaddr=0xff3780045bd60000
> > ...
> >
> > This is bad because :
> >
> > 1) Large batches hold one victim cpu for a very long time.
> >
> > 2) Driver often hit their own TX ring limit (all slots are used).
> >
> > 3) We call dev_requeue_skb()
> >
> > 4) Requeues are using a FIFO (q->gso_skb), breaking qdisc ability to
> >     implement FQ or priority scheduling.
> >
> > 5) dequeue_skb() gets packets from q->gso_skb one skb at a time
> >     with no xmit_more support. This is causing many spinlock games
> >     between the qdisc and the device driver.
> >
> > Requeues were supposed to be very rare, lets keep them this way.
> >
> > Limit batch sizes to /proc/sys/net/core/dev_weight (default 64) as
> > __qdisc_run() was designed to use.
> >
> > Fixes: 5772e9a3463b ("qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE")
> > Signed-off-by: Eric Dumazet <edumazet@google.com>
> > Cc: Jesper Dangaard Brouer <hawk@kernel.org>
> > Cc: Toke Høiland-Jørgensen <toke@redhat.com>
> > ---
> >   net/sched/sch_generic.c | 17 ++++++++++-------
> >   1 file changed, 10 insertions(+), 7 deletions(-)
> >
> > diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
> > index d9a98d02a55fc361a223f3201e37b6a2b698bb5e..852e603c17551ee719bf1c561848d5ef0699ab5d 100644
> > --- a/net/sched/sch_generic.c
> > +++ b/net/sched/sch_generic.c
> > @@ -180,9 +180,10 @@ static inline void dev_requeue_skb(struct sk_buff *skb, struct Qdisc *q)
> >   static void try_bulk_dequeue_skb(struct Qdisc *q,
> >                                struct sk_buff *skb,
> >                                const struct netdev_queue *txq,
> > -                              int *packets)
> > +                              int *packets, int budget)
> >   {
> >       int bytelimit = qdisc_avail_bulklimit(txq) - skb->len;
> > +     int cnt = 0;
>
> You patch makes perfect sense, that we want this budget limit.
>
> But: Why isn't bytelimit saving us?

BQL can easily grow
/sys/class/net/eth1/queues/tx-XXX/byte_queue_limits/limit to quite big
values with MQ high speed devices.

Each TX queue is usually serviced with RR, meaning that some of them
can get a long standing queue.


tjbp26:/home/edumazet# ./super_netperf 200 -H tjbp27 -l 100 &
[1] 198996

tjbp26:/home/edumazet# grep .
/sys/class/net/eth1/queues/tx-*/byte_queue_limits/limit
/sys/class/net/eth1/queues/tx-0/byte_queue_limits/limit:116826
/sys/class/net/eth1/queues/tx-10/byte_queue_limits/limit:84534
/sys/class/net/eth1/queues/tx-11/byte_queue_limits/limit:342924
/sys/class/net/eth1/queues/tx-12/byte_queue_limits/limit:433302
/sys/class/net/eth1/queues/tx-13/byte_queue_limits/limit:409254
/sys/class/net/eth1/queues/tx-14/byte_queue_limits/limit:434112
/sys/class/net/eth1/queues/tx-15/byte_queue_limits/limit:68304
/sys/class/net/eth1/queues/tx-16/byte_queue_limits/limit:65610
/sys/class/net/eth1/queues/tx-17/byte_queue_limits/limit:65772
/sys/class/net/eth1/queues/tx-18/byte_queue_limits/limit:69822
/sys/class/net/eth1/queues/tx-19/byte_queue_limits/limit:440634
/sys/class/net/eth1/queues/tx-1/byte_queue_limits/limit:70308
/sys/class/net/eth1/queues/tx-20/byte_queue_limits/limit:304824
/sys/class/net/eth1/queues/tx-21/byte_queue_limits/limit:497856
/sys/class/net/eth1/queues/tx-22/byte_queue_limits/limit:70308
/sys/class/net/eth1/queues/tx-23/byte_queue_limits/limit:535408
/sys/class/net/eth1/queues/tx-24/byte_queue_limits/limit:79419
/sys/class/net/eth1/queues/tx-25/byte_queue_limits/limit:70170
/sys/class/net/eth1/queues/tx-26/byte_queue_limits/limit:1595568
/sys/class/net/eth1/queues/tx-27/byte_queue_limits/limit:579108
/sys/class/net/eth1/queues/tx-28/byte_queue_limits/limit:430578
/sys/class/net/eth1/queues/tx-29/byte_queue_limits/limit:647172
/sys/class/net/eth1/queues/tx-2/byte_queue_limits/limit:345492
/sys/class/net/eth1/queues/tx-30/byte_queue_limits/limit:612392
/sys/class/net/eth1/queues/tx-31/byte_queue_limits/limit:344376
/sys/class/net/eth1/queues/tx-3/byte_queue_limits/limit:154740
/sys/class/net/eth1/queues/tx-4/byte_queue_limits/limit:60588
/sys/class/net/eth1/queues/tx-5/byte_queue_limits/limit:71970
/sys/class/net/eth1/queues/tx-6/byte_queue_limits/limit:70308
/sys/class/net/eth1/queues/tx-7/byte_queue_limits/limit:695454
/sys/class/net/eth1/queues/tx-8/byte_queue_limits/limit:101760
/sys/class/net/eth1/queues/tx-9/byte_queue_limits/limit:65286

Then if we send many small packets in a row, limit/pkt_avg_len can go
to arbitrary values.

Thanks.

>
>
> Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
>
> >       while (bytelimit > 0) {
> >               struct sk_buff *nskb = q->dequeue(q);
> > @@ -193,8 +194,10 @@ static void try_bulk_dequeue_skb(struct Qdisc *q,
> >               bytelimit -= nskb->len; /* covers GSO len */
> >               skb->next = nskb;
> >               skb = nskb;
> > -             (*packets)++; /* GSO counts as one pkt */
> > +             if (++cnt >= budget)
> > +                     break;
> >       }
> > +     (*packets) += cnt;
> >       skb_mark_not_on_list(skb);
> >   }
> >
> > @@ -228,7 +231,7 @@ static void try_bulk_dequeue_skb_slow(struct Qdisc *q,
> >    * A requeued skb (via q->gso_skb) can also be a SKB list.
> >    */
> >   static struct sk_buff *dequeue_skb(struct Qdisc *q, bool *validate,
> > -                                int *packets)
> > +                                int *packets, int budget)
> >   {
> >       const struct netdev_queue *txq = q->dev_queue;
> >       struct sk_buff *skb = NULL;
> > @@ -295,7 +298,7 @@ static struct sk_buff *dequeue_skb(struct Qdisc *q, bool *validate,
> >       if (skb) {
> >   bulk:
> >               if (qdisc_may_bulk(q))
> > -                     try_bulk_dequeue_skb(q, skb, txq, packets);
> > +                     try_bulk_dequeue_skb(q, skb, txq, packets, budget);
> >               else
> >                       try_bulk_dequeue_skb_slow(q, skb, packets);
> >       }
> > @@ -387,7 +390,7 @@ bool sch_direct_xmit(struct sk_buff *skb, struct Qdisc *q,
> >    *                          >0 - queue is not empty.
> >    *
> >    */
> > -static inline bool qdisc_restart(struct Qdisc *q, int *packets)
> > +static inline bool qdisc_restart(struct Qdisc *q, int *packets, int budget)
> >   {
> >       spinlock_t *root_lock = NULL;
> >       struct netdev_queue *txq;
> > @@ -396,7 +399,7 @@ static inline bool qdisc_restart(struct Qdisc *q, int *packets)
> >       bool validate;
> >
> >       /* Dequeue packet */
> > -     skb = dequeue_skb(q, &validate, packets);
> > +     skb = dequeue_skb(q, &validate, packets, budget);
> >       if (unlikely(!skb))
> >               return false;
> >
> > @@ -414,7 +417,7 @@ void __qdisc_run(struct Qdisc *q)
> >       int quota = READ_ONCE(net_hotdata.dev_tx_weight);
> >       int packets;
> >
> > -     while (qdisc_restart(q, &packets)) {
> > +     while (qdisc_restart(q, &packets, quota)) {
> >               quota -= packets;
> >               if (quota <= 0) {
> >                       if (q->flags & TCQ_F_NOLOCK)
>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net] net_sched: limit try_bulk_dequeue_skb() batches
  2025-11-10 11:06   ` Eric Dumazet
@ 2025-11-10 15:04     ` Jesper Dangaard Brouer
  2025-11-10 15:13       ` Eric Dumazet
  0 siblings, 1 reply; 21+ messages in thread
From: Jesper Dangaard Brouer @ 2025-11-10 15:04 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman,
	Jamal Hadi Salim, Cong Wang, Jiri Pirko, Kuniyuki Iwashima,
	Willem de Bruijn, netdev, eric.dumazet,
	Toke Høiland-Jørgensen, kernel-team, Jesse Brandeburg



On 10/11/2025 12.06, Eric Dumazet wrote:
> On Mon, Nov 10, 2025 at 2:36 AM Jesper Dangaard Brouer <hawk@kernel.org> wrote:
>>
>>
>>
>> On 09/11/2025 17.12, Eric Dumazet wrote:
>>> After commit 100dfa74cad9 ("inet: dev_queue_xmit() llist adoption")
>>> I started seeing many qdisc requeues on IDPF under high TX workload.
>>>
>>> $ tc -s qd sh dev eth1 handle 1: ; sleep 1; tc -s qd sh dev eth1 handle 1:
>>> qdisc mq 1: root
>>>    Sent 43534617319319 bytes 268186451819 pkt (dropped 0, overlimits 0 requeues 3532840114)
>>>    backlog 1056Kb 6675p requeues 3532840114
>>> qdisc mq 1: root
>>>    Sent 43554665866695 bytes 268309964788 pkt (dropped 0, overlimits 0 requeues 3537737653)
>>>    backlog 781164b 4822p requeues 3537737653
>>>
>>> This is caused by try_bulk_dequeue_skb() being only limited by BQL budget.
>>>
>>> perf record -C120-239 -e qdisc:qdisc_dequeue sleep 1 ; perf script
>>> ...
>>>    netperf 75332 [146]  2711.138269: qdisc:qdisc_dequeue: dequeue ifindex=5 qdisc handle=0x80150000 parent=0x10013 txq_state=0x0 packets=1292 skbaddr=0xff378005a1e9f200

To Jesse, see how Eric is using tracepoint qdisc:qdisc_dequeue

>>>    netperf 75332 [146]  2711.138953: qdisc:qdisc_dequeue: dequeue ifindex=5 qdisc handle=0x80150000 parent=0x10013 txq_state=0x0 packets=1213 skbaddr=0xff378004d607a500
>>>    netperf 75330 [144]  2711.139631: qdisc:qdisc_dequeue: dequeue ifindex=5 qdisc handle=0x80150000 parent=0x10013 txq_state=0x0 packets=1233 skbaddr=0xff3780046be20100
>>>    netperf 75333 [147]  2711.140356: qdisc:qdisc_dequeue: dequeue ifindex=5 qdisc handle=0x80150000 parent=0x10013 txq_state=0x0 packets=1093 skbaddr=0xff37800514845b00
>>>    netperf 75337 [151]  2711.141037: qdisc:qdisc_dequeue: dequeue ifindex=5 qdisc handle=0x80150000 parent=0x10013 txq_state=0x0 packets=1353 skbaddr=0xff37800460753300
>>>    netperf 75337 [151]  2711.141877: qdisc:qdisc_dequeue: dequeue ifindex=5 qdisc handle=0x80150000 parent=0x10013 txq_state=0x0 packets=1367 skbaddr=0xff378004e72c7b00
>>>    netperf 75330 [144]  2711.142643: qdisc:qdisc_dequeue: dequeue ifindex=5 qdisc handle=0x80150000 parent=0x10013 txq_state=0x0 packets=1202 skbaddr=0xff3780045bd60000
>>> ...
>>>
>>> This is bad because :
>>>
>>> 1) Large batches hold one victim cpu for a very long time.
>>>
>>> 2) Driver often hit their own TX ring limit (all slots are used).
>>>
>>> 3) We call dev_requeue_skb()
>>>
>>> 4) Requeues are using a FIFO (q->gso_skb), breaking qdisc ability to
>>>      implement FQ or priority scheduling.
>>>
>>> 5) dequeue_skb() gets packets from q->gso_skb one skb at a time
>>>      with no xmit_more support. This is causing many spinlock games
>>>      between the qdisc and the device driver.
>>>
>>> Requeues were supposed to be very rare, lets keep them this way.
>>>
>>> Limit batch sizes to /proc/sys/net/core/dev_weight (default 64) as
>>> __qdisc_run() was designed to use.
>>>
>>> Fixes: 5772e9a3463b ("qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE")
>>> Signed-off-by: Eric Dumazet <edumazet@google.com>
>>> Cc: Jesper Dangaard Brouer <hawk@kernel.org>
>>> Cc: Toke Høiland-Jørgensen <toke@redhat.com>
>>> ---
>>>    net/sched/sch_generic.c | 17 ++++++++++-------
>>>    1 file changed, 10 insertions(+), 7 deletions(-)
>>>
>>> diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
>>> index d9a98d02a55fc361a223f3201e37b6a2b698bb5e..852e603c17551ee719bf1c561848d5ef0699ab5d 100644
>>> --- a/net/sched/sch_generic.c
>>> +++ b/net/sched/sch_generic.c
>>> @@ -180,9 +180,10 @@ static inline void dev_requeue_skb(struct sk_buff *skb, struct Qdisc *q)
>>>    static void try_bulk_dequeue_skb(struct Qdisc *q,
>>>                                 struct sk_buff *skb,
>>>                                 const struct netdev_queue *txq,
>>> -                              int *packets)
>>> +                              int *packets, int budget)
>>>    {
>>>        int bytelimit = qdisc_avail_bulklimit(txq) - skb->len;
>>> +     int cnt = 0;
>>
>> You patch makes perfect sense, that we want this budget limit.
>>
>> But: Why isn't bytelimit saving us?
> 
> BQL can easily grow
> /sys/class/net/eth1/queues/tx-XXX/byte_queue_limits/limit to quite big
> values with MQ high speed devices.
> 
> Each TX queue is usually serviced with RR, meaning that some of them
> can get a long standing queue.
> 
> 
> tjbp26:/home/edumazet# ./super_netperf 200 -H tjbp27 -l 100 &
> [1] 198996
> 
> tjbp26:/home/edumazet# grep .
> /sys/class/net/eth1/queues/tx-*/byte_queue_limits/limit
> /sys/class/net/eth1/queues/tx-0/byte_queue_limits/limit:116826
> /sys/class/net/eth1/queues/tx-10/byte_queue_limits/limit:84534
> /sys/class/net/eth1/queues/tx-11/byte_queue_limits/limit:342924
> /sys/class/net/eth1/queues/tx-12/byte_queue_limits/limit:433302
> /sys/class/net/eth1/queues/tx-13/byte_queue_limits/limit:409254
> /sys/class/net/eth1/queues/tx-14/byte_queue_limits/limit:434112
> /sys/class/net/eth1/queues/tx-15/byte_queue_limits/limit:68304
> /sys/class/net/eth1/queues/tx-16/byte_queue_limits/limit:65610
> /sys/class/net/eth1/queues/tx-17/byte_queue_limits/limit:65772
> /sys/class/net/eth1/queues/tx-18/byte_queue_limits/limit:69822
> /sys/class/net/eth1/queues/tx-19/byte_queue_limits/limit:440634
> /sys/class/net/eth1/queues/tx-1/byte_queue_limits/limit:70308
> /sys/class/net/eth1/queues/tx-20/byte_queue_limits/limit:304824
> /sys/class/net/eth1/queues/tx-21/byte_queue_limits/limit:497856
> /sys/class/net/eth1/queues/tx-22/byte_queue_limits/limit:70308
> /sys/class/net/eth1/queues/tx-23/byte_queue_limits/limit:535408
> /sys/class/net/eth1/queues/tx-24/byte_queue_limits/limit:79419
> /sys/class/net/eth1/queues/tx-25/byte_queue_limits/limit:70170
> /sys/class/net/eth1/queues/tx-26/byte_queue_limits/limit:1595568
> /sys/class/net/eth1/queues/tx-27/byte_queue_limits/limit:579108
> /sys/class/net/eth1/queues/tx-28/byte_queue_limits/limit:430578
> /sys/class/net/eth1/queues/tx-29/byte_queue_limits/limit:647172
> /sys/class/net/eth1/queues/tx-2/byte_queue_limits/limit:345492
> /sys/class/net/eth1/queues/tx-30/byte_queue_limits/limit:612392
> /sys/class/net/eth1/queues/tx-31/byte_queue_limits/limit:344376
> /sys/class/net/eth1/queues/tx-3/byte_queue_limits/limit:154740
> /sys/class/net/eth1/queues/tx-4/byte_queue_limits/limit:60588
> /sys/class/net/eth1/queues/tx-5/byte_queue_limits/limit:71970
> /sys/class/net/eth1/queues/tx-6/byte_queue_limits/limit:70308
> /sys/class/net/eth1/queues/tx-7/byte_queue_limits/limit:695454
> /sys/class/net/eth1/queues/tx-8/byte_queue_limits/limit:101760
> /sys/class/net/eth1/queues/tx-9/byte_queue_limits/limit:65286
> 
> Then if we send many small packets in a row, limit/pkt_avg_len can go
> to arbitrary values.
> 

Thanks for sharing.

With these numbers it makes sense that BQL bytelimit isn't limiting this 
much code much.

e.g. 1595568 bytes / 1500 MTU = 1063 packets.

Our prod also have large numbers:

$ grep -H . /sys/class/net/REDACT0/queues/tx-*/byte_queue_limits/limit | 
sort -k2rn -t: | head -n 10
/sys/class/net/ext0/queues/tx-38/byte_queue_limits/limit:819432
/sys/class/net/ext0/queues/tx-95/byte_queue_limits/limit:766227
/sys/class/net/ext0/queues/tx-2/byte_queue_limits/limit:715412
/sys/class/net/ext0/queues/tx-66/byte_queue_limits/limit:692073
/sys/class/net/ext0/queues/tx-20/byte_queue_limits/limit:679817
/sys/class/net/ext0/queues/tx-61/byte_queue_limits/limit:647638
/sys/class/net/ext0/queues/tx-11/byte_queue_limits/limit:642212
/sys/class/net/ext0/queues/tx-10/byte_queue_limits/limit:615188
/sys/class/net/ext0/queues/tx-48/byte_queue_limits/limit:613745
/sys/class/net/ext0/queues/tx-80/byte_queue_limits/limit:584850

--Jesper

>>
>> Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
>>
>>>        while (bytelimit > 0) {
>>>                struct sk_buff *nskb = q->dequeue(q);
>>> @@ -193,8 +194,10 @@ static void try_bulk_dequeue_skb(struct Qdisc *q,
>>>                bytelimit -= nskb->len; /* covers GSO len */
>>>                skb->next = nskb;
>>>                skb = nskb;
>>> -             (*packets)++; /* GSO counts as one pkt */
>>> +             if (++cnt >= budget)
>>> +                     break;
>>>        }
>>> +     (*packets) += cnt;
>>>        skb_mark_not_on_list(skb);
>>>    }
>>>
>>> @@ -228,7 +231,7 @@ static void try_bulk_dequeue_skb_slow(struct Qdisc *q,
>>>     * A requeued skb (via q->gso_skb) can also be a SKB list.
>>>     */
>>>    static struct sk_buff *dequeue_skb(struct Qdisc *q, bool *validate,
>>> -                                int *packets)
>>> +                                int *packets, int budget)
>>>    {
>>>        const struct netdev_queue *txq = q->dev_queue;
>>>        struct sk_buff *skb = NULL;
>>> @@ -295,7 +298,7 @@ static struct sk_buff *dequeue_skb(struct Qdisc *q, bool *validate,
>>>        if (skb) {
>>>    bulk:
>>>                if (qdisc_may_bulk(q))
>>> -                     try_bulk_dequeue_skb(q, skb, txq, packets);
>>> +                     try_bulk_dequeue_skb(q, skb, txq, packets, budget);
>>>                else
>>>                        try_bulk_dequeue_skb_slow(q, skb, packets);
>>>        }
>>> @@ -387,7 +390,7 @@ bool sch_direct_xmit(struct sk_buff *skb, struct Qdisc *q,
>>>     *                          >0 - queue is not empty.
>>>     *
>>>     */
>>> -static inline bool qdisc_restart(struct Qdisc *q, int *packets)
>>> +static inline bool qdisc_restart(struct Qdisc *q, int *packets, int budget)
>>>    {
>>>        spinlock_t *root_lock = NULL;
>>>        struct netdev_queue *txq;
>>> @@ -396,7 +399,7 @@ static inline bool qdisc_restart(struct Qdisc *q, int *packets)
>>>        bool validate;
>>>
>>>        /* Dequeue packet */
>>> -     skb = dequeue_skb(q, &validate, packets);
>>> +     skb = dequeue_skb(q, &validate, packets, budget);
>>>        if (unlikely(!skb))
>>>                return false;
>>>
>>> @@ -414,7 +417,7 @@ void __qdisc_run(struct Qdisc *q)
>>>        int quota = READ_ONCE(net_hotdata.dev_tx_weight);
>>>        int packets;
>>>
>>> -     while (qdisc_restart(q, &packets)) {
>>> +     while (qdisc_restart(q, &packets, quota)) {
>>>                quota -= packets;
>>>                if (quota <= 0) {
>>>                        if (q->flags & TCQ_F_NOLOCK)
>>


^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net] net_sched: limit try_bulk_dequeue_skb() batches
  2025-11-10 15:04     ` Jesper Dangaard Brouer
@ 2025-11-10 15:13       ` Eric Dumazet
  0 siblings, 0 replies; 21+ messages in thread
From: Eric Dumazet @ 2025-11-10 15:13 UTC (permalink / raw)
  To: Jesper Dangaard Brouer
  Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman,
	Jamal Hadi Salim, Cong Wang, Jiri Pirko, Kuniyuki Iwashima,
	Willem de Bruijn, netdev, eric.dumazet,
	Toke Høiland-Jørgensen, kernel-team, Jesse Brandeburg

On Mon, Nov 10, 2025 at 7:04 AM Jesper Dangaard Brouer <hawk@kernel.org> wrote:
>
> Thanks for sharing.
>
> With these numbers it makes sense that BQL bytelimit isn't limiting this
> much code much.
>
> e.g. 1595568 bytes / 1500 MTU = 1063 packets.
>
> Our prod also have large numbers:
>
> $ grep -H . /sys/class/net/REDACT0/queues/tx-*/byte_queue_limits/limit |
> sort -k2rn -t: | head -n 10
> /sys/class/net/ext0/queues/tx-38/byte_queue_limits/limit:819432
> /sys/class/net/ext0/queues/tx-95/byte_queue_limits/limit:766227
> /sys/class/net/ext0/queues/tx-2/byte_queue_limits/limit:715412
> /sys/class/net/ext0/queues/tx-66/byte_queue_limits/limit:692073
> /sys/class/net/ext0/queues/tx-20/byte_queue_limits/limit:679817
> /sys/class/net/ext0/queues/tx-61/byte_queue_limits/limit:647638
> /sys/class/net/ext0/queues/tx-11/byte_queue_limits/limit:642212
> /sys/class/net/ext0/queues/tx-10/byte_queue_limits/limit:615188
> /sys/class/net/ext0/queues/tx-48/byte_queue_limits/limit:613745
> /sys/class/net/ext0/queues/tx-80/byte_queue_limits/limit:584850
>

Exactly :/

BQL was very nice when we had a single queue and/or relatively slow NIC.

For 200Gbit NIC, assuming being able to run a napi poll every 100 usec,
this means the napi poll has to tx-complete 2.5 MB per run, if a single flow
is pushing 200Gbit on a single tx-queue.

In reality, interrupt mitigations tend to space the napi poll delays, and BQL
reflects this by inflating /limit to a point BQL is not anymore effective.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net] net_sched: limit try_bulk_dequeue_skb() batches
  2025-11-09 16:12 [PATCH net] net_sched: limit try_bulk_dequeue_skb() batches Eric Dumazet
  2025-11-10 10:14 ` Toke Høiland-Jørgensen
  2025-11-10 10:36 ` Jesper Dangaard Brouer
@ 2025-11-12  2:10 ` patchwork-bot+netdevbpf
  2025-11-12 10:48   ` Jamal Hadi Salim
  2 siblings, 1 reply; 21+ messages in thread
From: patchwork-bot+netdevbpf @ 2025-11-12  2:10 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: davem, kuba, pabeni, horms, jhs, xiyou.wangcong, jiri, kuniyu,
	willemb, netdev, eric.dumazet, hawk, toke

Hello:

This patch was applied to netdev/net.git (main)
by Jakub Kicinski <kuba@kernel.org>:

On Sun,  9 Nov 2025 16:12:15 +0000 you wrote:
> After commit 100dfa74cad9 ("inet: dev_queue_xmit() llist adoption")
> I started seeing many qdisc requeues on IDPF under high TX workload.
> 
> $ tc -s qd sh dev eth1 handle 1: ; sleep 1; tc -s qd sh dev eth1 handle 1:
> qdisc mq 1: root
>  Sent 43534617319319 bytes 268186451819 pkt (dropped 0, overlimits 0 requeues 3532840114)
>  backlog 1056Kb 6675p requeues 3532840114
> qdisc mq 1: root
>  Sent 43554665866695 bytes 268309964788 pkt (dropped 0, overlimits 0 requeues 3537737653)
>  backlog 781164b 4822p requeues 3537737653
> 
> [...]

Here is the summary with links:
  - [net] net_sched: limit try_bulk_dequeue_skb() batches
    https://git.kernel.org/netdev/net/c/0345552a653c

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net] net_sched: limit try_bulk_dequeue_skb() batches
  2025-11-12  2:10 ` patchwork-bot+netdevbpf
@ 2025-11-12 10:48   ` Jamal Hadi Salim
  2025-11-12 11:21     ` Eric Dumazet
  2025-11-13 17:53     ` Jamal Hadi Salim
  0 siblings, 2 replies; 21+ messages in thread
From: Jamal Hadi Salim @ 2025-11-12 10:48 UTC (permalink / raw)
  To: kuba
  Cc: Eric Dumazet, davem, pabeni, horms, xiyou.wangcong, jiri, kuniyu,
	willemb, netdev, eric.dumazet, hawk, patchwork-bot+netdevbpf,
	toke

On Tue, Nov 11, 2025 at 9:10 PM <patchwork-bot+netdevbpf@kernel.org> wrote:
>
> Hello:
>
> This patch was applied to netdev/net.git (main)
> by Jakub Kicinski <kuba@kernel.org>:
>
> On Sun,  9 Nov 2025 16:12:15 +0000 you wrote:
> > After commit 100dfa74cad9 ("inet: dev_queue_xmit() llist adoption")
> > I started seeing many qdisc requeues on IDPF under high TX workload.
> >
> > $ tc -s qd sh dev eth1 handle 1: ; sleep 1; tc -s qd sh dev eth1 handle 1:
> > qdisc mq 1: root
> >  Sent 43534617319319 bytes 268186451819 pkt (dropped 0, overlimits 0 requeues 3532840114)
> >  backlog 1056Kb 6675p requeues 3532840114
> > qdisc mq 1: root
> >  Sent 43554665866695 bytes 268309964788 pkt (dropped 0, overlimits 0 requeues 3537737653)
> >  backlog 781164b 4822p requeues 3537737653
> >

Hrm. Should this have gone into net-next instead of net? Given that
the changes causing regression are still in net-next.
Dont think its a big deal if the merge is about to happen and i can
manually apply it since i was going to run some tests today.

cheers,
jamal

> > [...]
>
> Here is the summary with links:
>   - [net] net_sched: limit try_bulk_dequeue_skb() batches
>     https://git.kernel.org/netdev/net/c/0345552a653c
>
> You are awesome, thank you!
> --
> Deet-doot-dot, I am a bot.
> https://korg.docs.kernel.org/patchwork/pwbot.html
>
>

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net] net_sched: limit try_bulk_dequeue_skb() batches
  2025-11-12 10:48   ` Jamal Hadi Salim
@ 2025-11-12 11:21     ` Eric Dumazet
  2025-11-13 17:53     ` Jamal Hadi Salim
  1 sibling, 0 replies; 21+ messages in thread
From: Eric Dumazet @ 2025-11-12 11:21 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: kuba, davem, pabeni, horms, xiyou.wangcong, jiri, kuniyu, willemb,
	netdev, eric.dumazet, hawk, patchwork-bot+netdevbpf, toke

On Wed, Nov 12, 2025 at 2:48 AM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
>
> On Tue, Nov 11, 2025 at 9:10 PM <patchwork-bot+netdevbpf@kernel.org> wrote:
> >
> > Hello:
> >
> > This patch was applied to netdev/net.git (main)
> > by Jakub Kicinski <kuba@kernel.org>:
> >
> > On Sun,  9 Nov 2025 16:12:15 +0000 you wrote:
> > > After commit 100dfa74cad9 ("inet: dev_queue_xmit() llist adoption")
> > > I started seeing many qdisc requeues on IDPF under high TX workload.
> > >
> > > $ tc -s qd sh dev eth1 handle 1: ; sleep 1; tc -s qd sh dev eth1 handle 1:
> > > qdisc mq 1: root
> > >  Sent 43534617319319 bytes 268186451819 pkt (dropped 0, overlimits 0 requeues 3532840114)
> > >  backlog 1056Kb 6675p requeues 3532840114
> > > qdisc mq 1: root
> > >  Sent 43554665866695 bytes 268309964788 pkt (dropped 0, overlimits 0 requeues 3537737653)
> > >  backlog 781164b 4822p requeues 3537737653
> > >
>
> Hrm. Should this have gone into net-next instead of net? Given that
> the changes causing regression are still in net-next.
> Dont think its a big deal if the merge is about to happen and i can
> manually apply it since i was going to run some tests today.
>
> cheers,
> jamal
>

Let me clarify : Before my recent work, I also had requeues but did
not care much.

Since then, I am trying to reduce the cost of TX, and latencies.

This try_bulk_dequeue_skb() was a major issue, I guess nobody caught it before.

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net] net_sched: limit try_bulk_dequeue_skb() batches
  2025-11-12 10:48   ` Jamal Hadi Salim
  2025-11-12 11:21     ` Eric Dumazet
@ 2025-11-13 17:53     ` Jamal Hadi Salim
  2025-11-13 18:08       ` Eric Dumazet
  1 sibling, 1 reply; 21+ messages in thread
From: Jamal Hadi Salim @ 2025-11-13 17:53 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: kuba, davem, pabeni, horms, xiyou.wangcong, jiri, kuniyu, willemb,
	netdev, eric.dumazet, hawk, patchwork-bot+netdevbpf, toke

[..]
Eric,

So you are correct that requeues exist even before your changes to
speed up the tx path - two machines one with 6.5 and another with 6.8
variant exhibit this phenoma with very low traffic... which got me a
little curious.
My initial thought was perhaps it was related to mq/fqcodel combo but
a short run shows requeues occur on a couple of other qdiscs (ex prio)
and mq children (e.g., pfifo), which rules out fq codel as a
contributor to the requeues.
Example, this NUC i am typing on right now, after changing the root qdisc:

--
$ uname -r
6.8.0-87-generic
$
qdisc prio 8004: dev eno1 root refcnt 5 bands 8 priomap 1 2 2 2 1 2 0
0 1 1 1 1 1 1 1 1
 Sent 360948039 bytes 1015807 pkt (dropped 0, overlimits 0 requeues 1528)
 backlog 0b 0p requeues 1528
---

and 20-30  seconds later:
---
qdisc prio 8004: dev eno1 root refcnt 5 bands 8 priomap 1 2 2 2 1 2 0
0 1 1 1 1 1 1 1 1
 Sent 361867275 bytes 1017386 pkt (dropped 0, overlimits 0 requeues 1531)
 backlog 0b 0p requeues 1531
----

Reel cheep NIC doing 1G with 4 tx rings:
---
$ ethtool -i eno1
driver: igc
version: 6.8.0-87-generic
firmware-version: 1085:8770
expansion-rom-version:
bus-info: 0000:02:00.0
supports-statistics: yes
supports-test: yes
supports-eeprom-access: yes
supports-register-dump: yes
supports-priv-flags: yes

$ ethtool eno1
Settings for eno1:
Supported ports: [ TP ]
Supported link modes:   10baseT/Half 10baseT/Full
                        100baseT/Half 100baseT/Full
                        1000baseT/Full
                        2500baseT/Full
Supported pause frame use: Symmetric
Supports auto-negotiation: Yes
Supported FEC modes: Not reported
Advertised link modes:  10baseT/Half 10baseT/Full
                        100baseT/Half 100baseT/Full
                        1000baseT/Full
                        2500baseT/Full
Advertised pause frame use: Symmetric
Advertised auto-negotiation: Yes
Advertised FEC modes: Not reported
Speed: 1000Mb/s
Duplex: Full
Auto-negotiation: on
Port: Twisted Pair
PHYAD: 0
Transceiver: internal
MDI-X: off (auto)
netlink error: Operation not permitted
        Current message level: 0x00000007 (7)
                               drv probe link
Link detected: yes
----

Requeues should only happen if the driver is overwhelmed on the tx
side - i.e tx ring of choice has no more space. Back in the day, this
was not a very common event.
That can certainly be justified today with several explanations if: a)
modern processors getting faster b) the tx code path has become more
efficient (true from inspection and your results but those patches are
not on my small systems) c) (unlikely but) we are misaccounting for
requeues (need to look at the code). d) the driver is too eager to
return TX BUSY.

Thoughts?

We will run some forwarding performance tests and let you know if we
spot anything..

cheers,
jamal

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net] net_sched: limit try_bulk_dequeue_skb() batches
  2025-11-13 17:53     ` Jamal Hadi Salim
@ 2025-11-13 18:08       ` Eric Dumazet
  2025-11-13 18:30         ` Jamal Hadi Salim
  0 siblings, 1 reply; 21+ messages in thread
From: Eric Dumazet @ 2025-11-13 18:08 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: kuba, davem, pabeni, horms, xiyou.wangcong, jiri, kuniyu, willemb,
	netdev, eric.dumazet, hawk, patchwork-bot+netdevbpf, toke

On Thu, Nov 13, 2025 at 9:53 AM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
>
> [..]
> Eric,
>
> So you are correct that requeues exist even before your changes to
> speed up the tx path - two machines one with 6.5 and another with 6.8
> variant exhibit this phenoma with very low traffic... which got me a
> little curious.
> My initial thought was perhaps it was related to mq/fqcodel combo but
> a short run shows requeues occur on a couple of other qdiscs (ex prio)
> and mq children (e.g., pfifo), which rules out fq codel as a
> contributor to the requeues.
> Example, this NUC i am typing on right now, after changing the root qdisc:
>
> --
> $ uname -r
> 6.8.0-87-generic
> $
> qdisc prio 8004: dev eno1 root refcnt 5 bands 8 priomap 1 2 2 2 1 2 0
> 0 1 1 1 1 1 1 1 1
>  Sent 360948039 bytes 1015807 pkt (dropped 0, overlimits 0 requeues 1528)
>  backlog 0b 0p requeues 1528
> ---
>
> and 20-30  seconds later:
> ---
> qdisc prio 8004: dev eno1 root refcnt 5 bands 8 priomap 1 2 2 2 1 2 0
> 0 1 1 1 1 1 1 1 1
>  Sent 361867275 bytes 1017386 pkt (dropped 0, overlimits 0 requeues 1531)
>  backlog 0b 0p requeues 1531
> ----
>
> Reel cheep NIC doing 1G with 4 tx rings:
> ---
> $ ethtool -i eno1
> driver: igc
> version: 6.8.0-87-generic
> firmware-version: 1085:8770
> expansion-rom-version:
> bus-info: 0000:02:00.0
> supports-statistics: yes
> supports-test: yes
> supports-eeprom-access: yes
> supports-register-dump: yes
> supports-priv-flags: yes
>
> $ ethtool eno1
> Settings for eno1:
> Supported ports: [ TP ]
> Supported link modes:   10baseT/Half 10baseT/Full
>                         100baseT/Half 100baseT/Full
>                         1000baseT/Full
>                         2500baseT/Full
> Supported pause frame use: Symmetric
> Supports auto-negotiation: Yes
> Supported FEC modes: Not reported
> Advertised link modes:  10baseT/Half 10baseT/Full
>                         100baseT/Half 100baseT/Full
>                         1000baseT/Full
>                         2500baseT/Full
> Advertised pause frame use: Symmetric
> Advertised auto-negotiation: Yes
> Advertised FEC modes: Not reported
> Speed: 1000Mb/s
> Duplex: Full
> Auto-negotiation: on
> Port: Twisted Pair
> PHYAD: 0
> Transceiver: internal
> MDI-X: off (auto)
> netlink error: Operation not permitted
>         Current message level: 0x00000007 (7)
>                                drv probe link
> Link detected: yes
> ----
>
> Requeues should only happen if the driver is overwhelmed on the tx
> side - i.e tx ring of choice has no more space. Back in the day, this
> was not a very common event.
> That can certainly be justified today with several explanations if: a)
> modern processors getting faster b) the tx code path has become more
> efficient (true from inspection and your results but those patches are
> not on my small systems) c) (unlikely but) we are misaccounting for
> requeues (need to look at the code). d) the driver is too eager to
> return TX BUSY.
>
> Thoughts?

requeues can happen because some drivers do not use skb->len for the
BQL budget, but something bigger for GSO packets,
because they want to account for the (N) headers.

So the core networking stack could pull too many packets from the
qdisc for one xmit_more batch,
then ndo_start_xmit() at some point stops the queue before the end of
the batch, because BQL limit is hit sooner.

I think drivers should not be overzealous, BQL is a best effort, we do
not care of extra headers.

drivers/net/ethernet/intel/igc/igc_main.c is one of the overzealous drivers ;)

igc_tso() ...

/* update gso size and bytecount with header size */
first->gso_segs = skb_shinfo(skb)->gso_segs;
first->bytecount += (first->gso_segs - 1) * *hdr_len;

>
> We will run some forwarding performance tests and let you know if we
> spot anything..
>
> cheers,
> jamal

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net] net_sched: limit try_bulk_dequeue_skb() batches
  2025-11-13 18:08       ` Eric Dumazet
@ 2025-11-13 18:30         ` Jamal Hadi Salim
  2025-11-13 18:35           ` Eric Dumazet
  0 siblings, 1 reply; 21+ messages in thread
From: Jamal Hadi Salim @ 2025-11-13 18:30 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: kuba, davem, pabeni, horms, xiyou.wangcong, jiri, kuniyu, willemb,
	netdev, eric.dumazet, hawk, patchwork-bot+netdevbpf, toke

On Thu, Nov 13, 2025 at 1:08 PM Eric Dumazet <edumazet@google.com> wrote:
>
> On Thu, Nov 13, 2025 at 9:53 AM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> >
> > [..]
> > Eric,
> >
> > So you are correct that requeues exist even before your changes to
> > speed up the tx path - two machines one with 6.5 and another with 6.8
> > variant exhibit this phenoma with very low traffic... which got me a
> > little curious.
> > My initial thought was perhaps it was related to mq/fqcodel combo but
> > a short run shows requeues occur on a couple of other qdiscs (ex prio)
> > and mq children (e.g., pfifo), which rules out fq codel as a
> > contributor to the requeues.
> > Example, this NUC i am typing on right now, after changing the root qdisc:
> >
> > --
> > $ uname -r
> > 6.8.0-87-generic
> > $
> > qdisc prio 8004: dev eno1 root refcnt 5 bands 8 priomap 1 2 2 2 1 2 0
> > 0 1 1 1 1 1 1 1 1
> >  Sent 360948039 bytes 1015807 pkt (dropped 0, overlimits 0 requeues 1528)
> >  backlog 0b 0p requeues 1528
> > ---
> >
> > and 20-30  seconds later:
> > ---
> > qdisc prio 8004: dev eno1 root refcnt 5 bands 8 priomap 1 2 2 2 1 2 0
> > 0 1 1 1 1 1 1 1 1
> >  Sent 361867275 bytes 1017386 pkt (dropped 0, overlimits 0 requeues 1531)
> >  backlog 0b 0p requeues 1531
> > ----
> >
> > Reel cheep NIC doing 1G with 4 tx rings:
> > ---
> > $ ethtool -i eno1
> > driver: igc
> > version: 6.8.0-87-generic
> > firmware-version: 1085:8770
> > expansion-rom-version:
> > bus-info: 0000:02:00.0
> > supports-statistics: yes
> > supports-test: yes
> > supports-eeprom-access: yes
> > supports-register-dump: yes
> > supports-priv-flags: yes
> >
> > $ ethtool eno1
> > Settings for eno1:
> > Supported ports: [ TP ]
> > Supported link modes:   10baseT/Half 10baseT/Full
> >                         100baseT/Half 100baseT/Full
> >                         1000baseT/Full
> >                         2500baseT/Full
> > Supported pause frame use: Symmetric
> > Supports auto-negotiation: Yes
> > Supported FEC modes: Not reported
> > Advertised link modes:  10baseT/Half 10baseT/Full
> >                         100baseT/Half 100baseT/Full
> >                         1000baseT/Full
> >                         2500baseT/Full
> > Advertised pause frame use: Symmetric
> > Advertised auto-negotiation: Yes
> > Advertised FEC modes: Not reported
> > Speed: 1000Mb/s
> > Duplex: Full
> > Auto-negotiation: on
> > Port: Twisted Pair
> > PHYAD: 0
> > Transceiver: internal
> > MDI-X: off (auto)
> > netlink error: Operation not permitted
> >         Current message level: 0x00000007 (7)
> >                                drv probe link
> > Link detected: yes
> > ----
> >
> > Requeues should only happen if the driver is overwhelmed on the tx
> > side - i.e tx ring of choice has no more space. Back in the day, this
> > was not a very common event.
> > That can certainly be justified today with several explanations if: a)
> > modern processors getting faster b) the tx code path has become more
> > efficient (true from inspection and your results but those patches are
> > not on my small systems) c) (unlikely but) we are misaccounting for
> > requeues (need to look at the code). d) the driver is too eager to
> > return TX BUSY.
> >
> > Thoughts?
>
> requeues can happen because some drivers do not use skb->len for the
> BQL budget, but something bigger for GSO packets,
> because they want to account for the (N) headers.
>
> So the core networking stack could pull too many packets from the
> qdisc for one xmit_more batch,
> then ndo_start_xmit() at some point stops the queue before the end of
> the batch, because BQL limit is hit sooner.
>
> I think drivers should not be overzealous, BQL is a best effort, we do
> not care of extra headers.
>
> drivers/net/ethernet/intel/igc/igc_main.c is one of the overzealous drivers ;)
>
> igc_tso() ...
>
> /* update gso size and bytecount with header size */
> first->gso_segs = skb_shinfo(skb)->gso_segs;
> first->bytecount += (first->gso_segs - 1) * *hdr_len;
>


Ok, the 25G i40e driver we are going to run tests on seems to be
suffering from the same enthusiasm ;->
I guess the same codebase..
Very few drivers tho seem to be doing what you suggest. Of course idpf
being one of those ;->

cheers,
jamal

> > We will run some forwarding performance tests and let you know if we
> > spot anything..
> >
> > cheers,
> > jamal

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net] net_sched: limit try_bulk_dequeue_skb() batches
  2025-11-13 18:30         ` Jamal Hadi Salim
@ 2025-11-13 18:35           ` Eric Dumazet
  2025-11-13 18:55             ` Jamal Hadi Salim
  0 siblings, 1 reply; 21+ messages in thread
From: Eric Dumazet @ 2025-11-13 18:35 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: kuba, davem, pabeni, horms, xiyou.wangcong, jiri, kuniyu, willemb,
	netdev, eric.dumazet, hawk, patchwork-bot+netdevbpf, toke

On Thu, Nov 13, 2025 at 10:30 AM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
>
> On Thu, Nov 13, 2025 at 1:08 PM Eric Dumazet <edumazet@google.com> wrote:
> >
> > On Thu, Nov 13, 2025 at 9:53 AM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> > >
> > > [..]
> > > Eric,
> > >
> > > So you are correct that requeues exist even before your changes to
> > > speed up the tx path - two machines one with 6.5 and another with 6.8
> > > variant exhibit this phenoma with very low traffic... which got me a
> > > little curious.
> > > My initial thought was perhaps it was related to mq/fqcodel combo but
> > > a short run shows requeues occur on a couple of other qdiscs (ex prio)
> > > and mq children (e.g., pfifo), which rules out fq codel as a
> > > contributor to the requeues.
> > > Example, this NUC i am typing on right now, after changing the root qdisc:
> > >
> > > --
> > > $ uname -r
> > > 6.8.0-87-generic
> > > $
> > > qdisc prio 8004: dev eno1 root refcnt 5 bands 8 priomap 1 2 2 2 1 2 0
> > > 0 1 1 1 1 1 1 1 1
> > >  Sent 360948039 bytes 1015807 pkt (dropped 0, overlimits 0 requeues 1528)
> > >  backlog 0b 0p requeues 1528
> > > ---
> > >
> > > and 20-30  seconds later:
> > > ---
> > > qdisc prio 8004: dev eno1 root refcnt 5 bands 8 priomap 1 2 2 2 1 2 0
> > > 0 1 1 1 1 1 1 1 1
> > >  Sent 361867275 bytes 1017386 pkt (dropped 0, overlimits 0 requeues 1531)
> > >  backlog 0b 0p requeues 1531
> > > ----
> > >
> > > Reel cheep NIC doing 1G with 4 tx rings:
> > > ---
> > > $ ethtool -i eno1
> > > driver: igc
> > > version: 6.8.0-87-generic
> > > firmware-version: 1085:8770
> > > expansion-rom-version:
> > > bus-info: 0000:02:00.0
> > > supports-statistics: yes
> > > supports-test: yes
> > > supports-eeprom-access: yes
> > > supports-register-dump: yes
> > > supports-priv-flags: yes
> > >
> > > $ ethtool eno1
> > > Settings for eno1:
> > > Supported ports: [ TP ]
> > > Supported link modes:   10baseT/Half 10baseT/Full
> > >                         100baseT/Half 100baseT/Full
> > >                         1000baseT/Full
> > >                         2500baseT/Full
> > > Supported pause frame use: Symmetric
> > > Supports auto-negotiation: Yes
> > > Supported FEC modes: Not reported
> > > Advertised link modes:  10baseT/Half 10baseT/Full
> > >                         100baseT/Half 100baseT/Full
> > >                         1000baseT/Full
> > >                         2500baseT/Full
> > > Advertised pause frame use: Symmetric
> > > Advertised auto-negotiation: Yes
> > > Advertised FEC modes: Not reported
> > > Speed: 1000Mb/s
> > > Duplex: Full
> > > Auto-negotiation: on
> > > Port: Twisted Pair
> > > PHYAD: 0
> > > Transceiver: internal
> > > MDI-X: off (auto)
> > > netlink error: Operation not permitted
> > >         Current message level: 0x00000007 (7)
> > >                                drv probe link
> > > Link detected: yes
> > > ----
> > >
> > > Requeues should only happen if the driver is overwhelmed on the tx
> > > side - i.e tx ring of choice has no more space. Back in the day, this
> > > was not a very common event.
> > > That can certainly be justified today with several explanations if: a)
> > > modern processors getting faster b) the tx code path has become more
> > > efficient (true from inspection and your results but those patches are
> > > not on my small systems) c) (unlikely but) we are misaccounting for
> > > requeues (need to look at the code). d) the driver is too eager to
> > > return TX BUSY.
> > >
> > > Thoughts?
> >
> > requeues can happen because some drivers do not use skb->len for the
> > BQL budget, but something bigger for GSO packets,
> > because they want to account for the (N) headers.
> >
> > So the core networking stack could pull too many packets from the
> > qdisc for one xmit_more batch,
> > then ndo_start_xmit() at some point stops the queue before the end of
> > the batch, because BQL limit is hit sooner.
> >
> > I think drivers should not be overzealous, BQL is a best effort, we do
> > not care of extra headers.
> >
> > drivers/net/ethernet/intel/igc/igc_main.c is one of the overzealous drivers ;)
> >
> > igc_tso() ...
> >
> > /* update gso size and bytecount with header size */
> > first->gso_segs = skb_shinfo(skb)->gso_segs;
> > first->bytecount += (first->gso_segs - 1) * *hdr_len;
> >
>
>
> Ok, the 25G i40e driver we are going to run tests on seems to be
> suffering from the same enthusiasm ;->
> I guess the same codebase..
> Very few drivers tho seem to be doing what you suggest. Of course idpf
> being one of those ;->

Note that few requeues are ok.

In my case, I had 5 millions requeues per second, and at that point
you start noticing something is wrong ;)

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net] net_sched: limit try_bulk_dequeue_skb() batches
  2025-11-13 18:35           ` Eric Dumazet
@ 2025-11-13 18:55             ` Jamal Hadi Salim
  2025-11-14 16:28               ` Jamal Hadi Salim
  0 siblings, 1 reply; 21+ messages in thread
From: Jamal Hadi Salim @ 2025-11-13 18:55 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: kuba, davem, pabeni, horms, xiyou.wangcong, jiri, kuniyu, willemb,
	netdev, eric.dumazet, hawk, patchwork-bot+netdevbpf, toke

On Thu, Nov 13, 2025 at 1:36 PM Eric Dumazet <edumazet@google.com> wrote:
>
> On Thu, Nov 13, 2025 at 10:30 AM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> >
> > On Thu, Nov 13, 2025 at 1:08 PM Eric Dumazet <edumazet@google.com> wrote:
> > >
> > > On Thu, Nov 13, 2025 at 9:53 AM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> > > >
> > > > [..]
> > > > Eric,
> > > >
> > > > So you are correct that requeues exist even before your changes to
> > > > speed up the tx path - two machines one with 6.5 and another with 6.8
> > > > variant exhibit this phenoma with very low traffic... which got me a
> > > > little curious.
> > > > My initial thought was perhaps it was related to mq/fqcodel combo but
> > > > a short run shows requeues occur on a couple of other qdiscs (ex prio)
> > > > and mq children (e.g., pfifo), which rules out fq codel as a
> > > > contributor to the requeues.
> > > > Example, this NUC i am typing on right now, after changing the root qdisc:
> > > >
> > > > --
> > > > $ uname -r
> > > > 6.8.0-87-generic
> > > > $
> > > > qdisc prio 8004: dev eno1 root refcnt 5 bands 8 priomap 1 2 2 2 1 2 0
> > > > 0 1 1 1 1 1 1 1 1
> > > >  Sent 360948039 bytes 1015807 pkt (dropped 0, overlimits 0 requeues 1528)
> > > >  backlog 0b 0p requeues 1528
> > > > ---
> > > >
> > > > and 20-30  seconds later:
> > > > ---
> > > > qdisc prio 8004: dev eno1 root refcnt 5 bands 8 priomap 1 2 2 2 1 2 0
> > > > 0 1 1 1 1 1 1 1 1
> > > >  Sent 361867275 bytes 1017386 pkt (dropped 0, overlimits 0 requeues 1531)
> > > >  backlog 0b 0p requeues 1531
> > > > ----
> > > >
> > > > Reel cheep NIC doing 1G with 4 tx rings:
> > > > ---
> > > > $ ethtool -i eno1
> > > > driver: igc
> > > > version: 6.8.0-87-generic
> > > > firmware-version: 1085:8770
> > > > expansion-rom-version:
> > > > bus-info: 0000:02:00.0
> > > > supports-statistics: yes
> > > > supports-test: yes
> > > > supports-eeprom-access: yes
> > > > supports-register-dump: yes
> > > > supports-priv-flags: yes
> > > >
> > > > $ ethtool eno1
> > > > Settings for eno1:
> > > > Supported ports: [ TP ]
> > > > Supported link modes:   10baseT/Half 10baseT/Full
> > > >                         100baseT/Half 100baseT/Full
> > > >                         1000baseT/Full
> > > >                         2500baseT/Full
> > > > Supported pause frame use: Symmetric
> > > > Supports auto-negotiation: Yes
> > > > Supported FEC modes: Not reported
> > > > Advertised link modes:  10baseT/Half 10baseT/Full
> > > >                         100baseT/Half 100baseT/Full
> > > >                         1000baseT/Full
> > > >                         2500baseT/Full
> > > > Advertised pause frame use: Symmetric
> > > > Advertised auto-negotiation: Yes
> > > > Advertised FEC modes: Not reported
> > > > Speed: 1000Mb/s
> > > > Duplex: Full
> > > > Auto-negotiation: on
> > > > Port: Twisted Pair
> > > > PHYAD: 0
> > > > Transceiver: internal
> > > > MDI-X: off (auto)
> > > > netlink error: Operation not permitted
> > > >         Current message level: 0x00000007 (7)
> > > >                                drv probe link
> > > > Link detected: yes
> > > > ----
> > > >
> > > > Requeues should only happen if the driver is overwhelmed on the tx
> > > > side - i.e tx ring of choice has no more space. Back in the day, this
> > > > was not a very common event.
> > > > That can certainly be justified today with several explanations if: a)
> > > > modern processors getting faster b) the tx code path has become more
> > > > efficient (true from inspection and your results but those patches are
> > > > not on my small systems) c) (unlikely but) we are misaccounting for
> > > > requeues (need to look at the code). d) the driver is too eager to
> > > > return TX BUSY.
> > > >
> > > > Thoughts?
> > >
> > > requeues can happen because some drivers do not use skb->len for the
> > > BQL budget, but something bigger for GSO packets,
> > > because they want to account for the (N) headers.
> > >
> > > So the core networking stack could pull too many packets from the
> > > qdisc for one xmit_more batch,
> > > then ndo_start_xmit() at some point stops the queue before the end of
> > > the batch, because BQL limit is hit sooner.
> > >
> > > I think drivers should not be overzealous, BQL is a best effort, we do
> > > not care of extra headers.
> > >
> > > drivers/net/ethernet/intel/igc/igc_main.c is one of the overzealous drivers ;)
> > >
> > > igc_tso() ...
> > >
> > > /* update gso size and bytecount with header size */
> > > first->gso_segs = skb_shinfo(skb)->gso_segs;
> > > first->bytecount += (first->gso_segs - 1) * *hdr_len;
> > >
> >
> >
> > Ok, the 25G i40e driver we are going to run tests on seems to be
> > suffering from the same enthusiasm ;->
> > I guess the same codebase..
> > Very few drivers tho seem to be doing what you suggest. Of course idpf
> > being one of those ;->
>
> Note that few requeues are ok.
>
> In my case, I had 5 millions requeues per second, and at that point
> you start noticing something is wrong ;)

That's high ;-> For the nuc with igc, its <1%. Regardless, the
eagerness for TX BUSY implies reduced performance due to the early
bailout..

cheers,
jamal

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net] net_sched: limit try_bulk_dequeue_skb() batches
  2025-11-13 18:55             ` Jamal Hadi Salim
@ 2025-11-14 16:28               ` Jamal Hadi Salim
  2025-11-14 16:35                 ` Eric Dumazet
  0 siblings, 1 reply; 21+ messages in thread
From: Jamal Hadi Salim @ 2025-11-14 16:28 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: kuba, davem, pabeni, horms, xiyou.wangcong, jiri, kuniyu, willemb,
	netdev, eric.dumazet, hawk, patchwork-bot+netdevbpf, toke

On Thu, Nov 13, 2025 at 1:55 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
>
> On Thu, Nov 13, 2025 at 1:36 PM Eric Dumazet <edumazet@google.com> wrote:
> >
> > On Thu, Nov 13, 2025 at 10:30 AM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> > >
> > > On Thu, Nov 13, 2025 at 1:08 PM Eric Dumazet <edumazet@google.com> wrote:
> > > >
> > > > On Thu, Nov 13, 2025 at 9:53 AM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> > > > >
> > > > > [..]
> > > > > Eric,
> > > > >
> > > > > So you are correct that requeues exist even before your changes to
> > > > > speed up the tx path - two machines one with 6.5 and another with 6.8
> > > > > variant exhibit this phenoma with very low traffic... which got me a
> > > > > little curious.
> > > > > My initial thought was perhaps it was related to mq/fqcodel combo but
> > > > > a short run shows requeues occur on a couple of other qdiscs (ex prio)
> > > > > and mq children (e.g., pfifo), which rules out fq codel as a
> > > > > contributor to the requeues.
> > > > > Example, this NUC i am typing on right now, after changing the root qdisc:
> > > > >
> > > > > --
> > > > > $ uname -r
> > > > > 6.8.0-87-generic
> > > > > $
> > > > > qdisc prio 8004: dev eno1 root refcnt 5 bands 8 priomap 1 2 2 2 1 2 0
> > > > > 0 1 1 1 1 1 1 1 1
> > > > >  Sent 360948039 bytes 1015807 pkt (dropped 0, overlimits 0 requeues 1528)
> > > > >  backlog 0b 0p requeues 1528
> > > > > ---
> > > > >
> > > > > and 20-30  seconds later:
> > > > > ---
> > > > > qdisc prio 8004: dev eno1 root refcnt 5 bands 8 priomap 1 2 2 2 1 2 0
> > > > > 0 1 1 1 1 1 1 1 1
> > > > >  Sent 361867275 bytes 1017386 pkt (dropped 0, overlimits 0 requeues 1531)
> > > > >  backlog 0b 0p requeues 1531
> > > > > ----
> > > > >
> > > > > Reel cheep NIC doing 1G with 4 tx rings:
> > > > > ---
> > > > > $ ethtool -i eno1
> > > > > driver: igc
> > > > > version: 6.8.0-87-generic
> > > > > firmware-version: 1085:8770
> > > > > expansion-rom-version:
> > > > > bus-info: 0000:02:00.0
> > > > > supports-statistics: yes
> > > > > supports-test: yes
> > > > > supports-eeprom-access: yes
> > > > > supports-register-dump: yes
> > > > > supports-priv-flags: yes
> > > > >
> > > > > $ ethtool eno1
> > > > > Settings for eno1:
> > > > > Supported ports: [ TP ]
> > > > > Supported link modes:   10baseT/Half 10baseT/Full
> > > > >                         100baseT/Half 100baseT/Full
> > > > >                         1000baseT/Full
> > > > >                         2500baseT/Full
> > > > > Supported pause frame use: Symmetric
> > > > > Supports auto-negotiation: Yes
> > > > > Supported FEC modes: Not reported
> > > > > Advertised link modes:  10baseT/Half 10baseT/Full
> > > > >                         100baseT/Half 100baseT/Full
> > > > >                         1000baseT/Full
> > > > >                         2500baseT/Full
> > > > > Advertised pause frame use: Symmetric
> > > > > Advertised auto-negotiation: Yes
> > > > > Advertised FEC modes: Not reported
> > > > > Speed: 1000Mb/s
> > > > > Duplex: Full
> > > > > Auto-negotiation: on
> > > > > Port: Twisted Pair
> > > > > PHYAD: 0
> > > > > Transceiver: internal
> > > > > MDI-X: off (auto)
> > > > > netlink error: Operation not permitted
> > > > >         Current message level: 0x00000007 (7)
> > > > >                                drv probe link
> > > > > Link detected: yes
> > > > > ----
> > > > >
> > > > > Requeues should only happen if the driver is overwhelmed on the tx
> > > > > side - i.e tx ring of choice has no more space. Back in the day, this
> > > > > was not a very common event.
> > > > > That can certainly be justified today with several explanations if: a)
> > > > > modern processors getting faster b) the tx code path has become more
> > > > > efficient (true from inspection and your results but those patches are
> > > > > not on my small systems) c) (unlikely but) we are misaccounting for
> > > > > requeues (need to look at the code). d) the driver is too eager to
> > > > > return TX BUSY.
> > > > >
> > > > > Thoughts?
> > > >
> > > > requeues can happen because some drivers do not use skb->len for the
> > > > BQL budget, but something bigger for GSO packets,
> > > > because they want to account for the (N) headers.
> > > >
> > > > So the core networking stack could pull too many packets from the
> > > > qdisc for one xmit_more batch,
> > > > then ndo_start_xmit() at some point stops the queue before the end of
> > > > the batch, because BQL limit is hit sooner.
> > > >
> > > > I think drivers should not be overzealous, BQL is a best effort, we do
> > > > not care of extra headers.
> > > >
> > > > drivers/net/ethernet/intel/igc/igc_main.c is one of the overzealous drivers ;)
> > > >
> > > > igc_tso() ...
> > > >
> > > > /* update gso size and bytecount with header size */
> > > > first->gso_segs = skb_shinfo(skb)->gso_segs;
> > > > first->bytecount += (first->gso_segs - 1) * *hdr_len;
> > > >
> > >
> > >
> > > Ok, the 25G i40e driver we are going to run tests on seems to be
> > > suffering from the same enthusiasm ;->
> > > I guess the same codebase..
> > > Very few drivers tho seem to be doing what you suggest. Of course idpf
> > > being one of those ;->
> >
> > Note that few requeues are ok.
> >
> > In my case, I had 5 millions requeues per second, and at that point
> > you start noticing something is wrong ;)
>
> That's high ;-> For the nuc with igc, its <1%. Regardless, the
> eagerness for TX BUSY implies reduced performance due to the early
> bailout..

Ok, we are going to do some testing RSN, however, my adhd wont let
this requeue thing go ;->

So on at least i40e when you start sending say >2Mpps (forwarding or
tc mirred redirect ) - the TX BUSY is most certainly not because of
the driver is enthusiastically bailing out. Rather, this is due to BQL
- and specifically because netdev_tx_sent_queue() stops the queue;
Subsequent packets from the stack will get the magic TX_BUSY label in
sch_direct_xmit().

Some context:
For forwarding use case benchmarking, the typical idea is to use RSS
and IRQ binding to a specific CPU then craft some traffic patterns on
the sender so that the test machine has a very smooth distribution
across the different CPUs i.e goal is to have almost perfect load
balancing.

Q: In that case, would the defer list ever accumulate more than one
packet? Gut feeling says no.
Would have been nice to have an optional histogram to see the distribution..

cheers,
jamal

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net] net_sched: limit try_bulk_dequeue_skb() batches
  2025-11-14 16:28               ` Jamal Hadi Salim
@ 2025-11-14 16:35                 ` Eric Dumazet
  2025-11-14 17:13                   ` Jamal Hadi Salim
  0 siblings, 1 reply; 21+ messages in thread
From: Eric Dumazet @ 2025-11-14 16:35 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: kuba, davem, pabeni, horms, xiyou.wangcong, jiri, kuniyu, willemb,
	netdev, eric.dumazet, hawk, patchwork-bot+netdevbpf, toke

On Fri, Nov 14, 2025 at 8:28 AM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
>
> On Thu, Nov 13, 2025 at 1:55 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> >
> > On Thu, Nov 13, 2025 at 1:36 PM Eric Dumazet <edumazet@google.com> wrote:
> > >
> > > On Thu, Nov 13, 2025 at 10:30 AM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> > > >
> > > > On Thu, Nov 13, 2025 at 1:08 PM Eric Dumazet <edumazet@google.com> wrote:
> > > > >
> > > > > On Thu, Nov 13, 2025 at 9:53 AM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> > > > > >
> > > > > > [..]
> > > > > > Eric,
> > > > > >
> > > > > > So you are correct that requeues exist even before your changes to
> > > > > > speed up the tx path - two machines one with 6.5 and another with 6.8
> > > > > > variant exhibit this phenoma with very low traffic... which got me a
> > > > > > little curious.
> > > > > > My initial thought was perhaps it was related to mq/fqcodel combo but
> > > > > > a short run shows requeues occur on a couple of other qdiscs (ex prio)
> > > > > > and mq children (e.g., pfifo), which rules out fq codel as a
> > > > > > contributor to the requeues.
> > > > > > Example, this NUC i am typing on right now, after changing the root qdisc:
> > > > > >
> > > > > > --
> > > > > > $ uname -r
> > > > > > 6.8.0-87-generic
> > > > > > $
> > > > > > qdisc prio 8004: dev eno1 root refcnt 5 bands 8 priomap 1 2 2 2 1 2 0
> > > > > > 0 1 1 1 1 1 1 1 1
> > > > > >  Sent 360948039 bytes 1015807 pkt (dropped 0, overlimits 0 requeues 1528)
> > > > > >  backlog 0b 0p requeues 1528
> > > > > > ---
> > > > > >
> > > > > > and 20-30  seconds later:
> > > > > > ---
> > > > > > qdisc prio 8004: dev eno1 root refcnt 5 bands 8 priomap 1 2 2 2 1 2 0
> > > > > > 0 1 1 1 1 1 1 1 1
> > > > > >  Sent 361867275 bytes 1017386 pkt (dropped 0, overlimits 0 requeues 1531)
> > > > > >  backlog 0b 0p requeues 1531
> > > > > > ----
> > > > > >
> > > > > > Reel cheep NIC doing 1G with 4 tx rings:
> > > > > > ---
> > > > > > $ ethtool -i eno1
> > > > > > driver: igc
> > > > > > version: 6.8.0-87-generic
> > > > > > firmware-version: 1085:8770
> > > > > > expansion-rom-version:
> > > > > > bus-info: 0000:02:00.0
> > > > > > supports-statistics: yes
> > > > > > supports-test: yes
> > > > > > supports-eeprom-access: yes
> > > > > > supports-register-dump: yes
> > > > > > supports-priv-flags: yes
> > > > > >
> > > > > > $ ethtool eno1
> > > > > > Settings for eno1:
> > > > > > Supported ports: [ TP ]
> > > > > > Supported link modes:   10baseT/Half 10baseT/Full
> > > > > >                         100baseT/Half 100baseT/Full
> > > > > >                         1000baseT/Full
> > > > > >                         2500baseT/Full
> > > > > > Supported pause frame use: Symmetric
> > > > > > Supports auto-negotiation: Yes
> > > > > > Supported FEC modes: Not reported
> > > > > > Advertised link modes:  10baseT/Half 10baseT/Full
> > > > > >                         100baseT/Half 100baseT/Full
> > > > > >                         1000baseT/Full
> > > > > >                         2500baseT/Full
> > > > > > Advertised pause frame use: Symmetric
> > > > > > Advertised auto-negotiation: Yes
> > > > > > Advertised FEC modes: Not reported
> > > > > > Speed: 1000Mb/s
> > > > > > Duplex: Full
> > > > > > Auto-negotiation: on
> > > > > > Port: Twisted Pair
> > > > > > PHYAD: 0
> > > > > > Transceiver: internal
> > > > > > MDI-X: off (auto)
> > > > > > netlink error: Operation not permitted
> > > > > >         Current message level: 0x00000007 (7)
> > > > > >                                drv probe link
> > > > > > Link detected: yes
> > > > > > ----
> > > > > >
> > > > > > Requeues should only happen if the driver is overwhelmed on the tx
> > > > > > side - i.e tx ring of choice has no more space. Back in the day, this
> > > > > > was not a very common event.
> > > > > > That can certainly be justified today with several explanations if: a)
> > > > > > modern processors getting faster b) the tx code path has become more
> > > > > > efficient (true from inspection and your results but those patches are
> > > > > > not on my small systems) c) (unlikely but) we are misaccounting for
> > > > > > requeues (need to look at the code). d) the driver is too eager to
> > > > > > return TX BUSY.
> > > > > >
> > > > > > Thoughts?
> > > > >
> > > > > requeues can happen because some drivers do not use skb->len for the
> > > > > BQL budget, but something bigger for GSO packets,
> > > > > because they want to account for the (N) headers.
> > > > >
> > > > > So the core networking stack could pull too many packets from the
> > > > > qdisc for one xmit_more batch,
> > > > > then ndo_start_xmit() at some point stops the queue before the end of
> > > > > the batch, because BQL limit is hit sooner.
> > > > >
> > > > > I think drivers should not be overzealous, BQL is a best effort, we do
> > > > > not care of extra headers.
> > > > >
> > > > > drivers/net/ethernet/intel/igc/igc_main.c is one of the overzealous drivers ;)
> > > > >
> > > > > igc_tso() ...
> > > > >
> > > > > /* update gso size and bytecount with header size */
> > > > > first->gso_segs = skb_shinfo(skb)->gso_segs;
> > > > > first->bytecount += (first->gso_segs - 1) * *hdr_len;
> > > > >
> > > >
> > > >
> > > > Ok, the 25G i40e driver we are going to run tests on seems to be
> > > > suffering from the same enthusiasm ;->
> > > > I guess the same codebase..
> > > > Very few drivers tho seem to be doing what you suggest. Of course idpf
> > > > being one of those ;->
> > >
> > > Note that few requeues are ok.
> > >
> > > In my case, I had 5 millions requeues per second, and at that point
> > > you start noticing something is wrong ;)
> >
> > That's high ;-> For the nuc with igc, its <1%. Regardless, the
> > eagerness for TX BUSY implies reduced performance due to the early
> > bailout..
>
> Ok, we are going to do some testing RSN, however, my adhd wont let
> this requeue thing go ;->
>
> So on at least i40e when you start sending say >2Mpps (forwarding or
> tc mirred redirect ) - the TX BUSY is most certainly not because of
> the driver is enthusiastically bailing out. Rather, this is due to BQL
> - and specifically because netdev_tx_sent_queue() stops the queue;
> Subsequent packets from the stack will get the magic TX_BUSY label in
> sch_direct_xmit().
>
> Some context:
> For forwarding use case benchmarking, the typical idea is to use RSS
> and IRQ binding to a specific CPU then craft some traffic patterns on
> the sender so that the test machine has a very smooth distribution
> across the different CPUs i.e goal is to have almost perfect load
> balancing.
>
> Q: In that case, would the defer list ever accumulate more than one
> packet? Gut feeling says no.

It can accumulate  way more, when there is a mix of very small packets
and big TSO ones.

IIf you had a lot of large TSO packets being sent, the queue bql/limit
can reach 600,000 easily.
TX completion happens, queue is empty, but latched limit is 600,000
based on the last tx-completion round.

Then you have small packets of 64 bytes being sent very fast (say from pktgen)

600,000 / 64 = 9375

But most TX queues have a limit of 1024 or 2048 skbs... so they will
stop the queue _before_ BQL does.


> Would have been nice to have an optional histogram to see the distribution..

bpftrace is a nice way for building histograms.

>
> cheers,
> jamal

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net] net_sched: limit try_bulk_dequeue_skb() batches
  2025-11-14 16:35                 ` Eric Dumazet
@ 2025-11-14 17:13                   ` Jamal Hadi Salim
  2025-11-14 18:52                     ` Eric Dumazet
  0 siblings, 1 reply; 21+ messages in thread
From: Jamal Hadi Salim @ 2025-11-14 17:13 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: kuba, davem, pabeni, horms, xiyou.wangcong, jiri, kuniyu, willemb,
	netdev, eric.dumazet, hawk, patchwork-bot+netdevbpf, toke

On Fri, Nov 14, 2025 at 11:36 AM Eric Dumazet <edumazet@google.com> wrote:
>
> On Fri, Nov 14, 2025 at 8:28 AM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> >
> > On Thu, Nov 13, 2025 at 1:55 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> > >
> > > On Thu, Nov 13, 2025 at 1:36 PM Eric Dumazet <edumazet@google.com> wrote:
> > > >
> > > > On Thu, Nov 13, 2025 at 10:30 AM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> > > > >
> > > > > On Thu, Nov 13, 2025 at 1:08 PM Eric Dumazet <edumazet@google.com> wrote:
> > > > > >
> > > > > > On Thu, Nov 13, 2025 at 9:53 AM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> > > > > > >
> > > > > > > [..]
> > > > > > > Eric,
> > > > > > >
> > > > > > > So you are correct that requeues exist even before your changes to
> > > > > > > speed up the tx path - two machines one with 6.5 and another with 6.8
> > > > > > > variant exhibit this phenoma with very low traffic... which got me a
> > > > > > > little curious.
> > > > > > > My initial thought was perhaps it was related to mq/fqcodel combo but
> > > > > > > a short run shows requeues occur on a couple of other qdiscs (ex prio)
> > > > > > > and mq children (e.g., pfifo), which rules out fq codel as a
> > > > > > > contributor to the requeues.
> > > > > > > Example, this NUC i am typing on right now, after changing the root qdisc:
> > > > > > >
> > > > > > > --
> > > > > > > $ uname -r
> > > > > > > 6.8.0-87-generic
> > > > > > > $
> > > > > > > qdisc prio 8004: dev eno1 root refcnt 5 bands 8 priomap 1 2 2 2 1 2 0
> > > > > > > 0 1 1 1 1 1 1 1 1
> > > > > > >  Sent 360948039 bytes 1015807 pkt (dropped 0, overlimits 0 requeues 1528)
> > > > > > >  backlog 0b 0p requeues 1528
> > > > > > > ---
> > > > > > >
> > > > > > > and 20-30  seconds later:
> > > > > > > ---
> > > > > > > qdisc prio 8004: dev eno1 root refcnt 5 bands 8 priomap 1 2 2 2 1 2 0
> > > > > > > 0 1 1 1 1 1 1 1 1
> > > > > > >  Sent 361867275 bytes 1017386 pkt (dropped 0, overlimits 0 requeues 1531)
> > > > > > >  backlog 0b 0p requeues 1531
> > > > > > > ----
> > > > > > >
> > > > > > > Reel cheep NIC doing 1G with 4 tx rings:
> > > > > > > ---
> > > > > > > $ ethtool -i eno1
> > > > > > > driver: igc
> > > > > > > version: 6.8.0-87-generic
> > > > > > > firmware-version: 1085:8770
> > > > > > > expansion-rom-version:
> > > > > > > bus-info: 0000:02:00.0
> > > > > > > supports-statistics: yes
> > > > > > > supports-test: yes
> > > > > > > supports-eeprom-access: yes
> > > > > > > supports-register-dump: yes
> > > > > > > supports-priv-flags: yes
> > > > > > >
> > > > > > > $ ethtool eno1
> > > > > > > Settings for eno1:
> > > > > > > Supported ports: [ TP ]
> > > > > > > Supported link modes:   10baseT/Half 10baseT/Full
> > > > > > >                         100baseT/Half 100baseT/Full
> > > > > > >                         1000baseT/Full
> > > > > > >                         2500baseT/Full
> > > > > > > Supported pause frame use: Symmetric
> > > > > > > Supports auto-negotiation: Yes
> > > > > > > Supported FEC modes: Not reported
> > > > > > > Advertised link modes:  10baseT/Half 10baseT/Full
> > > > > > >                         100baseT/Half 100baseT/Full
> > > > > > >                         1000baseT/Full
> > > > > > >                         2500baseT/Full
> > > > > > > Advertised pause frame use: Symmetric
> > > > > > > Advertised auto-negotiation: Yes
> > > > > > > Advertised FEC modes: Not reported
> > > > > > > Speed: 1000Mb/s
> > > > > > > Duplex: Full
> > > > > > > Auto-negotiation: on
> > > > > > > Port: Twisted Pair
> > > > > > > PHYAD: 0
> > > > > > > Transceiver: internal
> > > > > > > MDI-X: off (auto)
> > > > > > > netlink error: Operation not permitted
> > > > > > >         Current message level: 0x00000007 (7)
> > > > > > >                                drv probe link
> > > > > > > Link detected: yes
> > > > > > > ----
> > > > > > >
> > > > > > > Requeues should only happen if the driver is overwhelmed on the tx
> > > > > > > side - i.e tx ring of choice has no more space. Back in the day, this
> > > > > > > was not a very common event.
> > > > > > > That can certainly be justified today with several explanations if: a)
> > > > > > > modern processors getting faster b) the tx code path has become more
> > > > > > > efficient (true from inspection and your results but those patches are
> > > > > > > not on my small systems) c) (unlikely but) we are misaccounting for
> > > > > > > requeues (need to look at the code). d) the driver is too eager to
> > > > > > > return TX BUSY.
> > > > > > >
> > > > > > > Thoughts?
> > > > > >
> > > > > > requeues can happen because some drivers do not use skb->len for the
> > > > > > BQL budget, but something bigger for GSO packets,
> > > > > > because they want to account for the (N) headers.
> > > > > >
> > > > > > So the core networking stack could pull too many packets from the
> > > > > > qdisc for one xmit_more batch,
> > > > > > then ndo_start_xmit() at some point stops the queue before the end of
> > > > > > the batch, because BQL limit is hit sooner.
> > > > > >
> > > > > > I think drivers should not be overzealous, BQL is a best effort, we do
> > > > > > not care of extra headers.
> > > > > >
> > > > > > drivers/net/ethernet/intel/igc/igc_main.c is one of the overzealous drivers ;)
> > > > > >
> > > > > > igc_tso() ...
> > > > > >
> > > > > > /* update gso size and bytecount with header size */
> > > > > > first->gso_segs = skb_shinfo(skb)->gso_segs;
> > > > > > first->bytecount += (first->gso_segs - 1) * *hdr_len;
> > > > > >
> > > > >
> > > > >
> > > > > Ok, the 25G i40e driver we are going to run tests on seems to be
> > > > > suffering from the same enthusiasm ;->
> > > > > I guess the same codebase..
> > > > > Very few drivers tho seem to be doing what you suggest. Of course idpf
> > > > > being one of those ;->
> > > >
> > > > Note that few requeues are ok.
> > > >
> > > > In my case, I had 5 millions requeues per second, and at that point
> > > > you start noticing something is wrong ;)
> > >
> > > That's high ;-> For the nuc with igc, its <1%. Regardless, the
> > > eagerness for TX BUSY implies reduced performance due to the early
> > > bailout..
> >
> > Ok, we are going to do some testing RSN, however, my adhd wont let
> > this requeue thing go ;->
> >
> > So on at least i40e when you start sending say >2Mpps (forwarding or
> > tc mirred redirect ) - the TX BUSY is most certainly not because of
> > the driver is enthusiastically bailing out. Rather, this is due to BQL
> > - and specifically because netdev_tx_sent_queue() stops the queue;
> > Subsequent packets from the stack will get the magic TX_BUSY label in
> > sch_direct_xmit().
> >
> > Some context:
> > For forwarding use case benchmarking, the typical idea is to use RSS
> > and IRQ binding to a specific CPU then craft some traffic patterns on
> > the sender so that the test machine has a very smooth distribution
> > across the different CPUs i.e goal is to have almost perfect load
> > balancing.
> >
> > Q: In that case, would the defer list ever accumulate more than one
> > packet? Gut feeling says no.
>
> It can accumulate  way more, when there is a mix of very small packets
> and big TSO ones.
>
> IIf you had a lot of large TSO packets being sent, the queue bql/limit
> can reach 600,000 easily.
> TX completion happens, queue is empty, but latched limit is 600,000
> based on the last tx-completion round.
>
> Then you have small packets of 64 bytes being sent very fast (say from pktgen)
>
> 600,000 / 64 = 9375
>
> But most TX queues have a limit of 1024 or 2048 skbs... so they will
> stop the queue _before_ BQL does.
>

Nice description, will check.
Remember, our use case is a middle box which receives pkts on one
netdev, does some processing, and sends to another. Essentially, we
pull from rx ring of src netdev, process and send to tx ring of the
other netdev. No batching or multiple CPUs funnelig to one txq and
very little if any TSO or locally generated traffic - and of course
benchmark is on 64B pkts.

>
> > Would have been nice to have an optional histogram to see the distribution..
>
> bpftrace is a nice way for building histograms.

Any tip will help, otherwise will figure it out..

cheers,
jamal
> >
> > cheers,
> > jamal

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net] net_sched: limit try_bulk_dequeue_skb() batches
  2025-11-14 17:13                   ` Jamal Hadi Salim
@ 2025-11-14 18:52                     ` Eric Dumazet
  2025-11-17 21:21                       ` Jamal Hadi Salim
  0 siblings, 1 reply; 21+ messages in thread
From: Eric Dumazet @ 2025-11-14 18:52 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: kuba, davem, pabeni, horms, xiyou.wangcong, jiri, kuniyu, willemb,
	netdev, eric.dumazet, hawk, patchwork-bot+netdevbpf, toke

On Fri, Nov 14, 2025 at 9:13 AM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
>
> On Fri, Nov 14, 2025 at 11:36 AM Eric Dumazet <edumazet@google.com> wrote:
> >
> > On Fri, Nov 14, 2025 at 8:28 AM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> > >
> > > On Thu, Nov 13, 2025 at 1:55 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> > > >
> > > > On Thu, Nov 13, 2025 at 1:36 PM Eric Dumazet <edumazet@google.com> wrote:
> > > > >
> > > > > On Thu, Nov 13, 2025 at 10:30 AM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> > > > > >
> > > > > > On Thu, Nov 13, 2025 at 1:08 PM Eric Dumazet <edumazet@google.com> wrote:
> > > > > > >
> > > > > > > On Thu, Nov 13, 2025 at 9:53 AM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> > > > > > > >
> > > > > > > > [..]
> > > > > > > > Eric,
> > > > > > > >
> > > > > > > > So you are correct that requeues exist even before your changes to
> > > > > > > > speed up the tx path - two machines one with 6.5 and another with 6.8
> > > > > > > > variant exhibit this phenoma with very low traffic... which got me a
> > > > > > > > little curious.
> > > > > > > > My initial thought was perhaps it was related to mq/fqcodel combo but
> > > > > > > > a short run shows requeues occur on a couple of other qdiscs (ex prio)
> > > > > > > > and mq children (e.g., pfifo), which rules out fq codel as a
> > > > > > > > contributor to the requeues.
> > > > > > > > Example, this NUC i am typing on right now, after changing the root qdisc:
> > > > > > > >
> > > > > > > > --
> > > > > > > > $ uname -r
> > > > > > > > 6.8.0-87-generic
> > > > > > > > $
> > > > > > > > qdisc prio 8004: dev eno1 root refcnt 5 bands 8 priomap 1 2 2 2 1 2 0
> > > > > > > > 0 1 1 1 1 1 1 1 1
> > > > > > > >  Sent 360948039 bytes 1015807 pkt (dropped 0, overlimits 0 requeues 1528)
> > > > > > > >  backlog 0b 0p requeues 1528
> > > > > > > > ---
> > > > > > > >
> > > > > > > > and 20-30  seconds later:
> > > > > > > > ---
> > > > > > > > qdisc prio 8004: dev eno1 root refcnt 5 bands 8 priomap 1 2 2 2 1 2 0
> > > > > > > > 0 1 1 1 1 1 1 1 1
> > > > > > > >  Sent 361867275 bytes 1017386 pkt (dropped 0, overlimits 0 requeues 1531)
> > > > > > > >  backlog 0b 0p requeues 1531
> > > > > > > > ----
> > > > > > > >
> > > > > > > > Reel cheep NIC doing 1G with 4 tx rings:
> > > > > > > > ---
> > > > > > > > $ ethtool -i eno1
> > > > > > > > driver: igc
> > > > > > > > version: 6.8.0-87-generic
> > > > > > > > firmware-version: 1085:8770
> > > > > > > > expansion-rom-version:
> > > > > > > > bus-info: 0000:02:00.0
> > > > > > > > supports-statistics: yes
> > > > > > > > supports-test: yes
> > > > > > > > supports-eeprom-access: yes
> > > > > > > > supports-register-dump: yes
> > > > > > > > supports-priv-flags: yes
> > > > > > > >
> > > > > > > > $ ethtool eno1
> > > > > > > > Settings for eno1:
> > > > > > > > Supported ports: [ TP ]
> > > > > > > > Supported link modes:   10baseT/Half 10baseT/Full
> > > > > > > >                         100baseT/Half 100baseT/Full
> > > > > > > >                         1000baseT/Full
> > > > > > > >                         2500baseT/Full
> > > > > > > > Supported pause frame use: Symmetric
> > > > > > > > Supports auto-negotiation: Yes
> > > > > > > > Supported FEC modes: Not reported
> > > > > > > > Advertised link modes:  10baseT/Half 10baseT/Full
> > > > > > > >                         100baseT/Half 100baseT/Full
> > > > > > > >                         1000baseT/Full
> > > > > > > >                         2500baseT/Full
> > > > > > > > Advertised pause frame use: Symmetric
> > > > > > > > Advertised auto-negotiation: Yes
> > > > > > > > Advertised FEC modes: Not reported
> > > > > > > > Speed: 1000Mb/s
> > > > > > > > Duplex: Full
> > > > > > > > Auto-negotiation: on
> > > > > > > > Port: Twisted Pair
> > > > > > > > PHYAD: 0
> > > > > > > > Transceiver: internal
> > > > > > > > MDI-X: off (auto)
> > > > > > > > netlink error: Operation not permitted
> > > > > > > >         Current message level: 0x00000007 (7)
> > > > > > > >                                drv probe link
> > > > > > > > Link detected: yes
> > > > > > > > ----
> > > > > > > >
> > > > > > > > Requeues should only happen if the driver is overwhelmed on the tx
> > > > > > > > side - i.e tx ring of choice has no more space. Back in the day, this
> > > > > > > > was not a very common event.
> > > > > > > > That can certainly be justified today with several explanations if: a)
> > > > > > > > modern processors getting faster b) the tx code path has become more
> > > > > > > > efficient (true from inspection and your results but those patches are
> > > > > > > > not on my small systems) c) (unlikely but) we are misaccounting for
> > > > > > > > requeues (need to look at the code). d) the driver is too eager to
> > > > > > > > return TX BUSY.
> > > > > > > >
> > > > > > > > Thoughts?
> > > > > > >
> > > > > > > requeues can happen because some drivers do not use skb->len for the
> > > > > > > BQL budget, but something bigger for GSO packets,
> > > > > > > because they want to account for the (N) headers.
> > > > > > >
> > > > > > > So the core networking stack could pull too many packets from the
> > > > > > > qdisc for one xmit_more batch,
> > > > > > > then ndo_start_xmit() at some point stops the queue before the end of
> > > > > > > the batch, because BQL limit is hit sooner.
> > > > > > >
> > > > > > > I think drivers should not be overzealous, BQL is a best effort, we do
> > > > > > > not care of extra headers.
> > > > > > >
> > > > > > > drivers/net/ethernet/intel/igc/igc_main.c is one of the overzealous drivers ;)
> > > > > > >
> > > > > > > igc_tso() ...
> > > > > > >
> > > > > > > /* update gso size and bytecount with header size */
> > > > > > > first->gso_segs = skb_shinfo(skb)->gso_segs;
> > > > > > > first->bytecount += (first->gso_segs - 1) * *hdr_len;
> > > > > > >
> > > > > >
> > > > > >
> > > > > > Ok, the 25G i40e driver we are going to run tests on seems to be
> > > > > > suffering from the same enthusiasm ;->
> > > > > > I guess the same codebase..
> > > > > > Very few drivers tho seem to be doing what you suggest. Of course idpf
> > > > > > being one of those ;->
> > > > >
> > > > > Note that few requeues are ok.
> > > > >
> > > > > In my case, I had 5 millions requeues per second, and at that point
> > > > > you start noticing something is wrong ;)
> > > >
> > > > That's high ;-> For the nuc with igc, its <1%. Regardless, the
> > > > eagerness for TX BUSY implies reduced performance due to the early
> > > > bailout..
> > >
> > > Ok, we are going to do some testing RSN, however, my adhd wont let
> > > this requeue thing go ;->
> > >
> > > So on at least i40e when you start sending say >2Mpps (forwarding or
> > > tc mirred redirect ) - the TX BUSY is most certainly not because of
> > > the driver is enthusiastically bailing out. Rather, this is due to BQL
> > > - and specifically because netdev_tx_sent_queue() stops the queue;
> > > Subsequent packets from the stack will get the magic TX_BUSY label in
> > > sch_direct_xmit().
> > >
> > > Some context:
> > > For forwarding use case benchmarking, the typical idea is to use RSS
> > > and IRQ binding to a specific CPU then craft some traffic patterns on
> > > the sender so that the test machine has a very smooth distribution
> > > across the different CPUs i.e goal is to have almost perfect load
> > > balancing.
> > >
> > > Q: In that case, would the defer list ever accumulate more than one
> > > packet? Gut feeling says no.
> >
> > It can accumulate  way more, when there is a mix of very small packets
> > and big TSO ones.
> >
> > IIf you had a lot of large TSO packets being sent, the queue bql/limit
> > can reach 600,000 easily.
> > TX completion happens, queue is empty, but latched limit is 600,000
> > based on the last tx-completion round.
> >
> > Then you have small packets of 64 bytes being sent very fast (say from pktgen)
> >
> > 600,000 / 64 = 9375
> >
> > But most TX queues have a limit of 1024 or 2048 skbs... so they will
> > stop the queue _before_ BQL does.
> >
>
> Nice description, will check.
> Remember, our use case is a middle box which receives pkts on one
> netdev, does some processing, and sends to another. Essentially, we
> pull from rx ring of src netdev, process and send to tx ring of the
> other netdev. No batching or multiple CPUs funnelig to one txq and
> very little if any TSO or locally generated traffic - and of course
> benchmark is on 64B pkts.

One thing that many drivers get wrong is that they limit the number of
packets that a napi_poll()
can tx-complete at a time.

BQL was meant to adjust its limit based on the number of bytes per round,
and the fact that the queue has been stopped (because of BQL limit) in
the last round.

So really, a driver must dequeue as many packets as possible.

Otherwise you may have spikes, even if your load is almost constant,
when under stress.

In fact I am a bit lost on what your problem is...

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net] net_sched: limit try_bulk_dequeue_skb() batches
  2025-11-14 18:52                     ` Eric Dumazet
@ 2025-11-17 21:21                       ` Jamal Hadi Salim
  2025-11-18 15:33                         ` Jamal Hadi Salim
  0 siblings, 1 reply; 21+ messages in thread
From: Jamal Hadi Salim @ 2025-11-17 21:21 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: kuba, davem, pabeni, horms, xiyou.wangcong, jiri, kuniyu, willemb,
	netdev, eric.dumazet, hawk, patchwork-bot+netdevbpf, toke

Hi Eric,
Sorry - was distracted.

On Fri, Nov 14, 2025 at 1:52 PM Eric Dumazet <edumazet@google.com> wrote:
>
> On Fri, Nov 14, 2025 at 9:13 AM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> >
> > On Fri, Nov 14, 2025 at 11:36 AM Eric Dumazet <edumazet@google.com> wrote:
> > >
> > > On Fri, Nov 14, 2025 at 8:28 AM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> > > >
> > > > On Thu, Nov 13, 2025 at 1:55 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> > > > >
> > > > > On Thu, Nov 13, 2025 at 1:36 PM Eric Dumazet <edumazet@google.com> wrote:
> > > > > >
> > > > > > On Thu, Nov 13, 2025 at 10:30 AM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> > > > > > >
> > > > > > > On Thu, Nov 13, 2025 at 1:08 PM Eric Dumazet <edumazet@google.com> wrote:
> > > > > > > >
> > > > > > > > On Thu, Nov 13, 2025 at 9:53 AM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> > > > > > > > >
> > > > > > > > > [..]
> > > > > > > > > Eric,
> > > > > > > > >
> > > > > > > > > So you are correct that requeues exist even before your changes to
> > > > > > > > > speed up the tx path - two machines one with 6.5 and another with 6.8
> > > > > > > > > variant exhibit this phenoma with very low traffic... which got me a
> > > > > > > > > little curious.
> > > > > > > > > My initial thought was perhaps it was related to mq/fqcodel combo but
> > > > > > > > > a short run shows requeues occur on a couple of other qdiscs (ex prio)
> > > > > > > > > and mq children (e.g., pfifo), which rules out fq codel as a
> > > > > > > > > contributor to the requeues.
> > > > > > > > > Example, this NUC i am typing on right now, after changing the root qdisc:
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > > $ uname -r
> > > > > > > > > 6.8.0-87-generic
> > > > > > > > > $
> > > > > > > > > qdisc prio 8004: dev eno1 root refcnt 5 bands 8 priomap 1 2 2 2 1 2 0
> > > > > > > > > 0 1 1 1 1 1 1 1 1
> > > > > > > > >  Sent 360948039 bytes 1015807 pkt (dropped 0, overlimits 0 requeues 1528)
> > > > > > > > >  backlog 0b 0p requeues 1528
> > > > > > > > > ---
> > > > > > > > >
> > > > > > > > > and 20-30  seconds later:
> > > > > > > > > ---
> > > > > > > > > qdisc prio 8004: dev eno1 root refcnt 5 bands 8 priomap 1 2 2 2 1 2 0
> > > > > > > > > 0 1 1 1 1 1 1 1 1
> > > > > > > > >  Sent 361867275 bytes 1017386 pkt (dropped 0, overlimits 0 requeues 1531)
> > > > > > > > >  backlog 0b 0p requeues 1531
> > > > > > > > > ----
> > > > > > > > >
> > > > > > > > > Reel cheep NIC doing 1G with 4 tx rings:
> > > > > > > > > ---
> > > > > > > > > $ ethtool -i eno1
> > > > > > > > > driver: igc
> > > > > > > > > version: 6.8.0-87-generic
> > > > > > > > > firmware-version: 1085:8770
> > > > > > > > > expansion-rom-version:
> > > > > > > > > bus-info: 0000:02:00.0
> > > > > > > > > supports-statistics: yes
> > > > > > > > > supports-test: yes
> > > > > > > > > supports-eeprom-access: yes
> > > > > > > > > supports-register-dump: yes
> > > > > > > > > supports-priv-flags: yes
> > > > > > > > >
> > > > > > > > > $ ethtool eno1
> > > > > > > > > Settings for eno1:
> > > > > > > > > Supported ports: [ TP ]
> > > > > > > > > Supported link modes:   10baseT/Half 10baseT/Full
> > > > > > > > >                         100baseT/Half 100baseT/Full
> > > > > > > > >                         1000baseT/Full
> > > > > > > > >                         2500baseT/Full
> > > > > > > > > Supported pause frame use: Symmetric
> > > > > > > > > Supports auto-negotiation: Yes
> > > > > > > > > Supported FEC modes: Not reported
> > > > > > > > > Advertised link modes:  10baseT/Half 10baseT/Full
> > > > > > > > >                         100baseT/Half 100baseT/Full
> > > > > > > > >                         1000baseT/Full
> > > > > > > > >                         2500baseT/Full
> > > > > > > > > Advertised pause frame use: Symmetric
> > > > > > > > > Advertised auto-negotiation: Yes
> > > > > > > > > Advertised FEC modes: Not reported
> > > > > > > > > Speed: 1000Mb/s
> > > > > > > > > Duplex: Full
> > > > > > > > > Auto-negotiation: on
> > > > > > > > > Port: Twisted Pair
> > > > > > > > > PHYAD: 0
> > > > > > > > > Transceiver: internal
> > > > > > > > > MDI-X: off (auto)
> > > > > > > > > netlink error: Operation not permitted
> > > > > > > > >         Current message level: 0x00000007 (7)
> > > > > > > > >                                drv probe link
> > > > > > > > > Link detected: yes
> > > > > > > > > ----
> > > > > > > > >
> > > > > > > > > Requeues should only happen if the driver is overwhelmed on the tx
> > > > > > > > > side - i.e tx ring of choice has no more space. Back in the day, this
> > > > > > > > > was not a very common event.
> > > > > > > > > That can certainly be justified today with several explanations if: a)
> > > > > > > > > modern processors getting faster b) the tx code path has become more
> > > > > > > > > efficient (true from inspection and your results but those patches are
> > > > > > > > > not on my small systems) c) (unlikely but) we are misaccounting for
> > > > > > > > > requeues (need to look at the code). d) the driver is too eager to
> > > > > > > > > return TX BUSY.
> > > > > > > > >
> > > > > > > > > Thoughts?
> > > > > > > >
> > > > > > > > requeues can happen because some drivers do not use skb->len for the
> > > > > > > > BQL budget, but something bigger for GSO packets,
> > > > > > > > because they want to account for the (N) headers.
> > > > > > > >
> > > > > > > > So the core networking stack could pull too many packets from the
> > > > > > > > qdisc for one xmit_more batch,
> > > > > > > > then ndo_start_xmit() at some point stops the queue before the end of
> > > > > > > > the batch, because BQL limit is hit sooner.
> > > > > > > >
> > > > > > > > I think drivers should not be overzealous, BQL is a best effort, we do
> > > > > > > > not care of extra headers.
> > > > > > > >
> > > > > > > > drivers/net/ethernet/intel/igc/igc_main.c is one of the overzealous drivers ;)
> > > > > > > >
> > > > > > > > igc_tso() ...
> > > > > > > >
> > > > > > > > /* update gso size and bytecount with header size */
> > > > > > > > first->gso_segs = skb_shinfo(skb)->gso_segs;
> > > > > > > > first->bytecount += (first->gso_segs - 1) * *hdr_len;
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > Ok, the 25G i40e driver we are going to run tests on seems to be
> > > > > > > suffering from the same enthusiasm ;->
> > > > > > > I guess the same codebase..
> > > > > > > Very few drivers tho seem to be doing what you suggest. Of course idpf
> > > > > > > being one of those ;->
> > > > > >
> > > > > > Note that few requeues are ok.
> > > > > >
> > > > > > In my case, I had 5 millions requeues per second, and at that point
> > > > > > you start noticing something is wrong ;)
> > > > >
> > > > > That's high ;-> For the nuc with igc, its <1%. Regardless, the
> > > > > eagerness for TX BUSY implies reduced performance due to the early
> > > > > bailout..
> > > >
> > > > Ok, we are going to do some testing RSN, however, my adhd wont let
> > > > this requeue thing go ;->
> > > >
> > > > So on at least i40e when you start sending say >2Mpps (forwarding or
> > > > tc mirred redirect ) - the TX BUSY is most certainly not because of
> > > > the driver is enthusiastically bailing out. Rather, this is due to BQL
> > > > - and specifically because netdev_tx_sent_queue() stops the queue;
> > > > Subsequent packets from the stack will get the magic TX_BUSY label in
> > > > sch_direct_xmit().
> > > >
> > > > Some context:
> > > > For forwarding use case benchmarking, the typical idea is to use RSS
> > > > and IRQ binding to a specific CPU then craft some traffic patterns on
> > > > the sender so that the test machine has a very smooth distribution
> > > > across the different CPUs i.e goal is to have almost perfect load
> > > > balancing.
> > > >
> > > > Q: In that case, would the defer list ever accumulate more than one
> > > > packet? Gut feeling says no.
> > >
> > > It can accumulate  way more, when there is a mix of very small packets
> > > and big TSO ones.
> > >
> > > IIf you had a lot of large TSO packets being sent, the queue bql/limit
> > > can reach 600,000 easily.
> > > TX completion happens, queue is empty, but latched limit is 600,000
> > > based on the last tx-completion round.
> > >
> > > Then you have small packets of 64 bytes being sent very fast (say from pktgen)
> > >
> > > 600,000 / 64 = 9375
> > >
> > > But most TX queues have a limit of 1024 or 2048 skbs... so they will
> > > stop the queue _before_ BQL does.
> > >
> >
> > Nice description, will check.
> > Remember, our use case is a middle box which receives pkts on one
> > netdev, does some processing, and sends to another. Essentially, we
> > pull from rx ring of src netdev, process and send to tx ring of the
> > other netdev. No batching or multiple CPUs funnelig to one txq and
> > very little if any TSO or locally generated traffic - and of course
> > benchmark is on 64B pkts.
>
> One thing that many drivers get wrong is that they limit the number of
> packets that a napi_poll() can tx-complete at a time.
>

I suppose drivers these days do the replenishing at napi_poll() time -
but it could also be done in the tx path when a driver fails to get
space on tx ring. I think at one point another strategy was to turn on
thresholds for TX completion interrupts, and you get the trigger to
replenish - my gut feel is this last one probably was deemed bad for
performance.

> BQL was meant to adjust its limit based on the number of bytes per round,
> and the fact that the queue has been stopped (because of BQL limit) in
> the last round.
>

For our benchmarking i dont think BQL is adding much value - more below..

> So really, a driver must dequeue as many packets as possible.
>

IIUC, the i40e will replenish up to 256 descriptor which is > than the
default NAPI weight (64).
So should be fine there?
is 256 a good number for a weight of 64? I'm not sure how these
thresholds are chosen; Is it a factor of tx ring size (default of 512,
so 256 is 50%)  or is it based on napi weight? The max size i40e can
do is tx/rx is 8160, it defaults to 512/512.

It used to be that you knew your hardware and its limitations and your
strategy of when and where to replenish was based sometimes on
experimentation.
i40e_clean_tx_irq() is entered on every napi poll for example...

> Otherwise you may have spikes, even if your load is almost constant,
> when under stress.

I see.
So the trick is to use max tx size then then increase the weight for
every replenish? We can set the TX ring to be the max and increase the
replenish size to all..

> In fact I am a bit lost on what your problem is...

Well, it started with observation that there are many requeues ;-> And
in my initial thought was the tx side was not keeping up. And then it
turned out that it was bql that was causing the requeues.

A forwarding example:
--> rx ring x on eth0 --> tc mirred -->tx ring y on eth1 ("tc mirred"
could be replaced with forwarding/bridging)

As you can see the rx softirq will likely run from napi poll all the
way to completion on tx side. Our tests are for 16 flows which are
crafted to distribute nicely via RSS to hit 16 cpus on 16 rx rings
(one per cpu) then fprwarding to 16 tx rings. Each packet is 64B. That
way 16 CPUs are kept busy in parallel. If it all works out there
should be zero lock contention on the tx side..

If we turn off BQL, there should be _zero_ requeues, which in my
thinking should make things faster..
we'll compare with/out bql.

Motivation for all this was your work to add the defer list - i would
like to check if we have a regression for forwarding workloads.
In theory, for our benchmark, we should never be able to accumulate
more than one packet on that defer list (if ever), so all that extra
code is not going to be useful for us.
That is fine as long as the new additional lines of code we are
hitting dont affect us as much..

cheers,
jamal

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net] net_sched: limit try_bulk_dequeue_skb() batches
  2025-11-17 21:21                       ` Jamal Hadi Salim
@ 2025-11-18 15:33                         ` Jamal Hadi Salim
  2025-11-18 15:38                           ` Eric Dumazet
  0 siblings, 1 reply; 21+ messages in thread
From: Jamal Hadi Salim @ 2025-11-18 15:33 UTC (permalink / raw)
  To: Eric Dumazet
  Cc: kuba, davem, pabeni, horms, xiyou.wangcong, jiri, kuniyu, willemb,
	netdev, eric.dumazet, hawk, patchwork-bot+netdevbpf, toke

On Mon, Nov 17, 2025 at 4:21 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
>
> Hi Eric,
> Sorry - was distracted.
>
> On Fri, Nov 14, 2025 at 1:52 PM Eric Dumazet <edumazet@google.com> wrote:
> >
> > On Fri, Nov 14, 2025 at 9:13 AM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> > >
> > > On Fri, Nov 14, 2025 at 11:36 AM Eric Dumazet <edumazet@google.com> wrote:
> > > >
> > > > On Fri, Nov 14, 2025 at 8:28 AM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> > > > >
> > > > > On Thu, Nov 13, 2025 at 1:55 PM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> > > > > >
> > > > > > On Thu, Nov 13, 2025 at 1:36 PM Eric Dumazet <edumazet@google.com> wrote:
> > > > > > >
> > > > > > > On Thu, Nov 13, 2025 at 10:30 AM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> > > > > > > >
> > > > > > > > On Thu, Nov 13, 2025 at 1:08 PM Eric Dumazet <edumazet@google.com> wrote:
> > > > > > > > >
> > > > > > > > > On Thu, Nov 13, 2025 at 9:53 AM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
> > > > > > > > > >
> > > > > > > > > > [..]
> > > > > > > > > > Eric,
> > > > > > > > > >
> > > > > > > > > > So you are correct that requeues exist even before your changes to
> > > > > > > > > > speed up the tx path - two machines one with 6.5 and another with 6.8
> > > > > > > > > > variant exhibit this phenoma with very low traffic... which got me a
> > > > > > > > > > little curious.
> > > > > > > > > > My initial thought was perhaps it was related to mq/fqcodel combo but
> > > > > > > > > > a short run shows requeues occur on a couple of other qdiscs (ex prio)
> > > > > > > > > > and mq children (e.g., pfifo), which rules out fq codel as a
> > > > > > > > > > contributor to the requeues.
> > > > > > > > > > Example, this NUC i am typing on right now, after changing the root qdisc:
> > > > > > > > > >
> > > > > > > > > > --
> > > > > > > > > > $ uname -r
> > > > > > > > > > 6.8.0-87-generic
> > > > > > > > > > $
> > > > > > > > > > qdisc prio 8004: dev eno1 root refcnt 5 bands 8 priomap 1 2 2 2 1 2 0
> > > > > > > > > > 0 1 1 1 1 1 1 1 1
> > > > > > > > > >  Sent 360948039 bytes 1015807 pkt (dropped 0, overlimits 0 requeues 1528)
> > > > > > > > > >  backlog 0b 0p requeues 1528
> > > > > > > > > > ---
> > > > > > > > > >
> > > > > > > > > > and 20-30  seconds later:
> > > > > > > > > > ---
> > > > > > > > > > qdisc prio 8004: dev eno1 root refcnt 5 bands 8 priomap 1 2 2 2 1 2 0
> > > > > > > > > > 0 1 1 1 1 1 1 1 1
> > > > > > > > > >  Sent 361867275 bytes 1017386 pkt (dropped 0, overlimits 0 requeues 1531)
> > > > > > > > > >  backlog 0b 0p requeues 1531
> > > > > > > > > > ----
> > > > > > > > > >
> > > > > > > > > > Reel cheep NIC doing 1G with 4 tx rings:
> > > > > > > > > > ---
> > > > > > > > > > $ ethtool -i eno1
> > > > > > > > > > driver: igc
> > > > > > > > > > version: 6.8.0-87-generic
> > > > > > > > > > firmware-version: 1085:8770
> > > > > > > > > > expansion-rom-version:
> > > > > > > > > > bus-info: 0000:02:00.0
> > > > > > > > > > supports-statistics: yes
> > > > > > > > > > supports-test: yes
> > > > > > > > > > supports-eeprom-access: yes
> > > > > > > > > > supports-register-dump: yes
> > > > > > > > > > supports-priv-flags: yes
> > > > > > > > > >
> > > > > > > > > > $ ethtool eno1
> > > > > > > > > > Settings for eno1:
> > > > > > > > > > Supported ports: [ TP ]
> > > > > > > > > > Supported link modes:   10baseT/Half 10baseT/Full
> > > > > > > > > >                         100baseT/Half 100baseT/Full
> > > > > > > > > >                         1000baseT/Full
> > > > > > > > > >                         2500baseT/Full
> > > > > > > > > > Supported pause frame use: Symmetric
> > > > > > > > > > Supports auto-negotiation: Yes
> > > > > > > > > > Supported FEC modes: Not reported
> > > > > > > > > > Advertised link modes:  10baseT/Half 10baseT/Full
> > > > > > > > > >                         100baseT/Half 100baseT/Full
> > > > > > > > > >                         1000baseT/Full
> > > > > > > > > >                         2500baseT/Full
> > > > > > > > > > Advertised pause frame use: Symmetric
> > > > > > > > > > Advertised auto-negotiation: Yes
> > > > > > > > > > Advertised FEC modes: Not reported
> > > > > > > > > > Speed: 1000Mb/s
> > > > > > > > > > Duplex: Full
> > > > > > > > > > Auto-negotiation: on
> > > > > > > > > > Port: Twisted Pair
> > > > > > > > > > PHYAD: 0
> > > > > > > > > > Transceiver: internal
> > > > > > > > > > MDI-X: off (auto)
> > > > > > > > > > netlink error: Operation not permitted
> > > > > > > > > >         Current message level: 0x00000007 (7)
> > > > > > > > > >                                drv probe link
> > > > > > > > > > Link detected: yes
> > > > > > > > > > ----
> > > > > > > > > >
> > > > > > > > > > Requeues should only happen if the driver is overwhelmed on the tx
> > > > > > > > > > side - i.e tx ring of choice has no more space. Back in the day, this
> > > > > > > > > > was not a very common event.
> > > > > > > > > > That can certainly be justified today with several explanations if: a)
> > > > > > > > > > modern processors getting faster b) the tx code path has become more
> > > > > > > > > > efficient (true from inspection and your results but those patches are
> > > > > > > > > > not on my small systems) c) (unlikely but) we are misaccounting for
> > > > > > > > > > requeues (need to look at the code). d) the driver is too eager to
> > > > > > > > > > return TX BUSY.
> > > > > > > > > >
> > > > > > > > > > Thoughts?
> > > > > > > > >
> > > > > > > > > requeues can happen because some drivers do not use skb->len for the
> > > > > > > > > BQL budget, but something bigger for GSO packets,
> > > > > > > > > because they want to account for the (N) headers.
> > > > > > > > >
> > > > > > > > > So the core networking stack could pull too many packets from the
> > > > > > > > > qdisc for one xmit_more batch,
> > > > > > > > > then ndo_start_xmit() at some point stops the queue before the end of
> > > > > > > > > the batch, because BQL limit is hit sooner.
> > > > > > > > >
> > > > > > > > > I think drivers should not be overzealous, BQL is a best effort, we do
> > > > > > > > > not care of extra headers.
> > > > > > > > >
> > > > > > > > > drivers/net/ethernet/intel/igc/igc_main.c is one of the overzealous drivers ;)
> > > > > > > > >
> > > > > > > > > igc_tso() ...
> > > > > > > > >
> > > > > > > > > /* update gso size and bytecount with header size */
> > > > > > > > > first->gso_segs = skb_shinfo(skb)->gso_segs;
> > > > > > > > > first->bytecount += (first->gso_segs - 1) * *hdr_len;
> > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > Ok, the 25G i40e driver we are going to run tests on seems to be
> > > > > > > > suffering from the same enthusiasm ;->
> > > > > > > > I guess the same codebase..
> > > > > > > > Very few drivers tho seem to be doing what you suggest. Of course idpf
> > > > > > > > being one of those ;->
> > > > > > >
> > > > > > > Note that few requeues are ok.
> > > > > > >
> > > > > > > In my case, I had 5 millions requeues per second, and at that point
> > > > > > > you start noticing something is wrong ;)
> > > > > >
> > > > > > That's high ;-> For the nuc with igc, its <1%. Regardless, the
> > > > > > eagerness for TX BUSY implies reduced performance due to the early
> > > > > > bailout..
> > > > >
> > > > > Ok, we are going to do some testing RSN, however, my adhd wont let
> > > > > this requeue thing go ;->
> > > > >
> > > > > So on at least i40e when you start sending say >2Mpps (forwarding or
> > > > > tc mirred redirect ) - the TX BUSY is most certainly not because of
> > > > > the driver is enthusiastically bailing out. Rather, this is due to BQL
> > > > > - and specifically because netdev_tx_sent_queue() stops the queue;
> > > > > Subsequent packets from the stack will get the magic TX_BUSY label in
> > > > > sch_direct_xmit().
> > > > >
> > > > > Some context:
> > > > > For forwarding use case benchmarking, the typical idea is to use RSS
> > > > > and IRQ binding to a specific CPU then craft some traffic patterns on
> > > > > the sender so that the test machine has a very smooth distribution
> > > > > across the different CPUs i.e goal is to have almost perfect load
> > > > > balancing.
> > > > >
> > > > > Q: In that case, would the defer list ever accumulate more than one
> > > > > packet? Gut feeling says no.
> > > >
> > > > It can accumulate  way more, when there is a mix of very small packets
> > > > and big TSO ones.
> > > >
> > > > IIf you had a lot of large TSO packets being sent, the queue bql/limit
> > > > can reach 600,000 easily.
> > > > TX completion happens, queue is empty, but latched limit is 600,000
> > > > based on the last tx-completion round.
> > > >
> > > > Then you have small packets of 64 bytes being sent very fast (say from pktgen)
> > > >
> > > > 600,000 / 64 = 9375
> > > >
> > > > But most TX queues have a limit of 1024 or 2048 skbs... so they will
> > > > stop the queue _before_ BQL does.
> > > >
> > >
> > > Nice description, will check.
> > > Remember, our use case is a middle box which receives pkts on one
> > > netdev, does some processing, and sends to another. Essentially, we
> > > pull from rx ring of src netdev, process and send to tx ring of the
> > > other netdev. No batching or multiple CPUs funnelig to one txq and
> > > very little if any TSO or locally generated traffic - and of course
> > > benchmark is on 64B pkts.
> >
> > One thing that many drivers get wrong is that they limit the number of
> > packets that a napi_poll() can tx-complete at a time.
> >
>
> I suppose drivers these days do the replenishing at napi_poll() time -
> but it could also be done in the tx path when a driver fails to get
> space on tx ring. I think at one point another strategy was to turn on
> thresholds for TX completion interrupts, and you get the trigger to
> replenish - my gut feel is this last one probably was deemed bad for
> performance.
>
> > BQL was meant to adjust its limit based on the number of bytes per round,
> > and the fact that the queue has been stopped (because of BQL limit) in
> > the last round.
> >
>
> For our benchmarking i dont think BQL is adding much value - more below..
>
> > So really, a driver must dequeue as many packets as possible.
> >
>
> IIUC, the i40e will replenish up to 256 descriptor which is > than the
> default NAPI weight (64).
> So should be fine there?
> is 256 a good number for a weight of 64? I'm not sure how these
> thresholds are chosen; Is it a factor of tx ring size (default of 512,
> so 256 is 50%)  or is it based on napi weight? The max size i40e can
> do is tx/rx is 8160, it defaults to 512/512.
>
> It used to be that you knew your hardware and its limitations and your
> strategy of when and where to replenish was based sometimes on
> experimentation.
> i40e_clean_tx_irq() is entered on every napi poll for example...
>
> > Otherwise you may have spikes, even if your load is almost constant,
> > when under stress.
>
> I see.
> So the trick is to use max tx size then then increase the weight for
> every replenish? We can set the TX ring to be the max and increase the
> replenish size to all..
>
> > In fact I am a bit lost on what your problem is...
>
> Well, it started with observation that there are many requeues ;-> And
> in my initial thought was the tx side was not keeping up. And then it
> turned out that it was bql that was causing the requeues.
>
> A forwarding example:
> --> rx ring x on eth0 --> tc mirred -->tx ring y on eth1 ("tc mirred"
> could be replaced with forwarding/bridging)
>
> As you can see the rx softirq will likely run from napi poll all the
> way to completion on tx side. Our tests are for 16 flows which are
> crafted to distribute nicely via RSS to hit 16 cpus on 16 rx rings
> (one per cpu) then fprwarding to 16 tx rings. Each packet is 64B. That
> way 16 CPUs are kept busy in parallel. If it all works out there
> should be zero lock contention on the tx side..
>
> If we turn off BQL, there should be _zero_ requeues, which in my
> thinking should make things faster..
> we'll compare with/out bql.
>
> Motivation for all this was your work to add the defer list - i would
> like to check if we have a regression for forwarding workloads.
> In theory, for our benchmark, we should never be able to accumulate
> more than one packet on that defer list (if ever), so all that extra
> code is not going to be useful for us.
> That is fine as long as the new additional lines of code we are
> hitting dont affect us as much..
>

And the summary is: There's a very tiny regression - but it is not
really noticeable. The test was in the 10Mpps+ and although the
difference is consistent/repeatable it falls within margin of error of
1-2Kpps.
So, it should be fine..
Still interested in the questions i asked though ;->

cheers,
jamal

^ permalink raw reply	[flat|nested] 21+ messages in thread

* Re: [PATCH net] net_sched: limit try_bulk_dequeue_skb() batches
  2025-11-18 15:33                         ` Jamal Hadi Salim
@ 2025-11-18 15:38                           ` Eric Dumazet
  0 siblings, 0 replies; 21+ messages in thread
From: Eric Dumazet @ 2025-11-18 15:38 UTC (permalink / raw)
  To: Jamal Hadi Salim
  Cc: kuba, davem, pabeni, horms, xiyou.wangcong, jiri, kuniyu, willemb,
	netdev, eric.dumazet, hawk, patchwork-bot+netdevbpf, toke

On Tue, Nov 18, 2025 at 7:33 AM Jamal Hadi Salim <jhs@mojatatu.com> wrote:
>

> And the summary is: There's a very tiny regression - but it is not
> really noticeable. The test was in the 10Mpps+ and although the
> difference is consistent/repeatable it falls within margin of error of
> 1-2Kpps.
> So, it should be fine..
> Still interested in the questions i asked though ;->

If you do not care about latency spikes and want to squeeze every
possible bit from your NIC,
you can still increase /proc/sys/net/core/dev_weight as you like.

^ permalink raw reply	[flat|nested] 21+ messages in thread

end of thread, other threads:[~2025-11-18 15:38 UTC | newest]

Thread overview: 21+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-11-09 16:12 [PATCH net] net_sched: limit try_bulk_dequeue_skb() batches Eric Dumazet
2025-11-10 10:14 ` Toke Høiland-Jørgensen
2025-11-10 10:36 ` Jesper Dangaard Brouer
2025-11-10 11:06   ` Eric Dumazet
2025-11-10 15:04     ` Jesper Dangaard Brouer
2025-11-10 15:13       ` Eric Dumazet
2025-11-12  2:10 ` patchwork-bot+netdevbpf
2025-11-12 10:48   ` Jamal Hadi Salim
2025-11-12 11:21     ` Eric Dumazet
2025-11-13 17:53     ` Jamal Hadi Salim
2025-11-13 18:08       ` Eric Dumazet
2025-11-13 18:30         ` Jamal Hadi Salim
2025-11-13 18:35           ` Eric Dumazet
2025-11-13 18:55             ` Jamal Hadi Salim
2025-11-14 16:28               ` Jamal Hadi Salim
2025-11-14 16:35                 ` Eric Dumazet
2025-11-14 17:13                   ` Jamal Hadi Salim
2025-11-14 18:52                     ` Eric Dumazet
2025-11-17 21:21                       ` Jamal Hadi Salim
2025-11-18 15:33                         ` Jamal Hadi Salim
2025-11-18 15:38                           ` Eric Dumazet

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).