From: Jesper Dangaard Brouer <hawk@kernel.org>
To: Eric Dumazet <edumazet@google.com>,
"David S . Miller" <davem@davemloft.net>,
Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>
Cc: "Simon Horman" <horms@kernel.org>,
"Jamal Hadi Salim" <jhs@mojatatu.com>,
"Cong Wang" <xiyou.wangcong@gmail.com>,
"Jiri Pirko" <jiri@resnulli.us>,
"Kuniyuki Iwashima" <kuniyu@google.com>,
"Willem de Bruijn" <willemb@google.com>,
netdev@vger.kernel.org, eric.dumazet@gmail.com,
"Toke Høiland-Jørgensen" <toke@redhat.com>
Subject: Re: [PATCH net] net_sched: limit try_bulk_dequeue_skb() batches
Date: Mon, 10 Nov 2025 11:36:33 +0100 [thread overview]
Message-ID: <cb568e91-9114-4e9a-ba88-eb4fc3772690@kernel.org> (raw)
In-Reply-To: <20251109161215.2574081-1-edumazet@google.com>
On 09/11/2025 17.12, Eric Dumazet wrote:
> After commit 100dfa74cad9 ("inet: dev_queue_xmit() llist adoption")
> I started seeing many qdisc requeues on IDPF under high TX workload.
>
> $ tc -s qd sh dev eth1 handle 1: ; sleep 1; tc -s qd sh dev eth1 handle 1:
> qdisc mq 1: root
> Sent 43534617319319 bytes 268186451819 pkt (dropped 0, overlimits 0 requeues 3532840114)
> backlog 1056Kb 6675p requeues 3532840114
> qdisc mq 1: root
> Sent 43554665866695 bytes 268309964788 pkt (dropped 0, overlimits 0 requeues 3537737653)
> backlog 781164b 4822p requeues 3537737653
>
> This is caused by try_bulk_dequeue_skb() being only limited by BQL budget.
>
> perf record -C120-239 -e qdisc:qdisc_dequeue sleep 1 ; perf script
> ...
> netperf 75332 [146] 2711.138269: qdisc:qdisc_dequeue: dequeue ifindex=5 qdisc handle=0x80150000 parent=0x10013 txq_state=0x0 packets=1292 skbaddr=0xff378005a1e9f200
> netperf 75332 [146] 2711.138953: qdisc:qdisc_dequeue: dequeue ifindex=5 qdisc handle=0x80150000 parent=0x10013 txq_state=0x0 packets=1213 skbaddr=0xff378004d607a500
> netperf 75330 [144] 2711.139631: qdisc:qdisc_dequeue: dequeue ifindex=5 qdisc handle=0x80150000 parent=0x10013 txq_state=0x0 packets=1233 skbaddr=0xff3780046be20100
> netperf 75333 [147] 2711.140356: qdisc:qdisc_dequeue: dequeue ifindex=5 qdisc handle=0x80150000 parent=0x10013 txq_state=0x0 packets=1093 skbaddr=0xff37800514845b00
> netperf 75337 [151] 2711.141037: qdisc:qdisc_dequeue: dequeue ifindex=5 qdisc handle=0x80150000 parent=0x10013 txq_state=0x0 packets=1353 skbaddr=0xff37800460753300
> netperf 75337 [151] 2711.141877: qdisc:qdisc_dequeue: dequeue ifindex=5 qdisc handle=0x80150000 parent=0x10013 txq_state=0x0 packets=1367 skbaddr=0xff378004e72c7b00
> netperf 75330 [144] 2711.142643: qdisc:qdisc_dequeue: dequeue ifindex=5 qdisc handle=0x80150000 parent=0x10013 txq_state=0x0 packets=1202 skbaddr=0xff3780045bd60000
> ...
>
> This is bad because :
>
> 1) Large batches hold one victim cpu for a very long time.
>
> 2) Driver often hit their own TX ring limit (all slots are used).
>
> 3) We call dev_requeue_skb()
>
> 4) Requeues are using a FIFO (q->gso_skb), breaking qdisc ability to
> implement FQ or priority scheduling.
>
> 5) dequeue_skb() gets packets from q->gso_skb one skb at a time
> with no xmit_more support. This is causing many spinlock games
> between the qdisc and the device driver.
>
> Requeues were supposed to be very rare, lets keep them this way.
>
> Limit batch sizes to /proc/sys/net/core/dev_weight (default 64) as
> __qdisc_run() was designed to use.
>
> Fixes: 5772e9a3463b ("qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE")
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Cc: Jesper Dangaard Brouer <hawk@kernel.org>
> Cc: Toke Høiland-Jørgensen <toke@redhat.com>
> ---
> net/sched/sch_generic.c | 17 ++++++++++-------
> 1 file changed, 10 insertions(+), 7 deletions(-)
>
> diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
> index d9a98d02a55fc361a223f3201e37b6a2b698bb5e..852e603c17551ee719bf1c561848d5ef0699ab5d 100644
> --- a/net/sched/sch_generic.c
> +++ b/net/sched/sch_generic.c
> @@ -180,9 +180,10 @@ static inline void dev_requeue_skb(struct sk_buff *skb, struct Qdisc *q)
> static void try_bulk_dequeue_skb(struct Qdisc *q,
> struct sk_buff *skb,
> const struct netdev_queue *txq,
> - int *packets)
> + int *packets, int budget)
> {
> int bytelimit = qdisc_avail_bulklimit(txq) - skb->len;
> + int cnt = 0;
You patch makes perfect sense, that we want this budget limit.
But: Why isn't bytelimit saving us?
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
> while (bytelimit > 0) {
> struct sk_buff *nskb = q->dequeue(q);
> @@ -193,8 +194,10 @@ static void try_bulk_dequeue_skb(struct Qdisc *q,
> bytelimit -= nskb->len; /* covers GSO len */
> skb->next = nskb;
> skb = nskb;
> - (*packets)++; /* GSO counts as one pkt */
> + if (++cnt >= budget)
> + break;
> }
> + (*packets) += cnt;
> skb_mark_not_on_list(skb);
> }
>
> @@ -228,7 +231,7 @@ static void try_bulk_dequeue_skb_slow(struct Qdisc *q,
> * A requeued skb (via q->gso_skb) can also be a SKB list.
> */
> static struct sk_buff *dequeue_skb(struct Qdisc *q, bool *validate,
> - int *packets)
> + int *packets, int budget)
> {
> const struct netdev_queue *txq = q->dev_queue;
> struct sk_buff *skb = NULL;
> @@ -295,7 +298,7 @@ static struct sk_buff *dequeue_skb(struct Qdisc *q, bool *validate,
> if (skb) {
> bulk:
> if (qdisc_may_bulk(q))
> - try_bulk_dequeue_skb(q, skb, txq, packets);
> + try_bulk_dequeue_skb(q, skb, txq, packets, budget);
> else
> try_bulk_dequeue_skb_slow(q, skb, packets);
> }
> @@ -387,7 +390,7 @@ bool sch_direct_xmit(struct sk_buff *skb, struct Qdisc *q,
> * >0 - queue is not empty.
> *
> */
> -static inline bool qdisc_restart(struct Qdisc *q, int *packets)
> +static inline bool qdisc_restart(struct Qdisc *q, int *packets, int budget)
> {
> spinlock_t *root_lock = NULL;
> struct netdev_queue *txq;
> @@ -396,7 +399,7 @@ static inline bool qdisc_restart(struct Qdisc *q, int *packets)
> bool validate;
>
> /* Dequeue packet */
> - skb = dequeue_skb(q, &validate, packets);
> + skb = dequeue_skb(q, &validate, packets, budget);
> if (unlikely(!skb))
> return false;
>
> @@ -414,7 +417,7 @@ void __qdisc_run(struct Qdisc *q)
> int quota = READ_ONCE(net_hotdata.dev_tx_weight);
> int packets;
>
> - while (qdisc_restart(q, &packets)) {
> + while (qdisc_restart(q, &packets, quota)) {
> quota -= packets;
> if (quota <= 0) {
> if (q->flags & TCQ_F_NOLOCK)
next prev parent reply other threads:[~2025-11-10 10:36 UTC|newest]
Thread overview: 21+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-11-09 16:12 [PATCH net] net_sched: limit try_bulk_dequeue_skb() batches Eric Dumazet
2025-11-10 10:14 ` Toke Høiland-Jørgensen
2025-11-10 10:36 ` Jesper Dangaard Brouer [this message]
2025-11-10 11:06 ` Eric Dumazet
2025-11-10 15:04 ` Jesper Dangaard Brouer
2025-11-10 15:13 ` Eric Dumazet
2025-11-12 2:10 ` patchwork-bot+netdevbpf
2025-11-12 10:48 ` Jamal Hadi Salim
2025-11-12 11:21 ` Eric Dumazet
2025-11-13 17:53 ` Jamal Hadi Salim
2025-11-13 18:08 ` Eric Dumazet
2025-11-13 18:30 ` Jamal Hadi Salim
2025-11-13 18:35 ` Eric Dumazet
2025-11-13 18:55 ` Jamal Hadi Salim
2025-11-14 16:28 ` Jamal Hadi Salim
2025-11-14 16:35 ` Eric Dumazet
2025-11-14 17:13 ` Jamal Hadi Salim
2025-11-14 18:52 ` Eric Dumazet
2025-11-17 21:21 ` Jamal Hadi Salim
2025-11-18 15:33 ` Jamal Hadi Salim
2025-11-18 15:38 ` Eric Dumazet
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=cb568e91-9114-4e9a-ba88-eb4fc3772690@kernel.org \
--to=hawk@kernel.org \
--cc=davem@davemloft.net \
--cc=edumazet@google.com \
--cc=eric.dumazet@gmail.com \
--cc=horms@kernel.org \
--cc=jhs@mojatatu.com \
--cc=jiri@resnulli.us \
--cc=kuba@kernel.org \
--cc=kuniyu@google.com \
--cc=netdev@vger.kernel.org \
--cc=pabeni@redhat.com \
--cc=toke@redhat.com \
--cc=willemb@google.com \
--cc=xiyou.wangcong@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.