From: Jesper Dangaard Brouer <hawk@kernel.org>
To: Eric Dumazet <edumazet@google.com>,
"David S . Miller" <davem@davemloft.net>,
Jakub Kicinski <kuba@kernel.org>, Paolo Abeni <pabeni@redhat.com>
Cc: "Simon Horman" <horms@kernel.org>,
"Jamal Hadi Salim" <jhs@mojatatu.com>,
"Cong Wang" <xiyou.wangcong@gmail.com>,
"Jiri Pirko" <jiri@resnulli.us>,
"Kuniyuki Iwashima" <kuniyu@google.com>,
"Willem de Bruijn" <willemb@google.com>,
netdev@vger.kernel.org, eric.dumazet@gmail.com,
"Toke Høiland-Jørgensen" <toke@redhat.com>
Subject: Re: [PATCH net] net_sched: limit try_bulk_dequeue_skb() batches
Date: Mon, 10 Nov 2025 11:36:33 +0100 [thread overview]
Message-ID: <cb568e91-9114-4e9a-ba88-eb4fc3772690@kernel.org> (raw)
In-Reply-To: <20251109161215.2574081-1-edumazet@google.com>
On 09/11/2025 17.12, Eric Dumazet wrote:
> After commit 100dfa74cad9 ("inet: dev_queue_xmit() llist adoption")
> I started seeing many qdisc requeues on IDPF under high TX workload.
>
> $ tc -s qd sh dev eth1 handle 1: ; sleep 1; tc -s qd sh dev eth1 handle 1:
> qdisc mq 1: root
> Sent 43534617319319 bytes 268186451819 pkt (dropped 0, overlimits 0 requeues 3532840114)
> backlog 1056Kb 6675p requeues 3532840114
> qdisc mq 1: root
> Sent 43554665866695 bytes 268309964788 pkt (dropped 0, overlimits 0 requeues 3537737653)
> backlog 781164b 4822p requeues 3537737653
>
> This is caused by try_bulk_dequeue_skb() being only limited by BQL budget.
>
> perf record -C120-239 -e qdisc:qdisc_dequeue sleep 1 ; perf script
> ...
> netperf 75332 [146] 2711.138269: qdisc:qdisc_dequeue: dequeue ifindex=5 qdisc handle=0x80150000 parent=0x10013 txq_state=0x0 packets=1292 skbaddr=0xff378005a1e9f200
> netperf 75332 [146] 2711.138953: qdisc:qdisc_dequeue: dequeue ifindex=5 qdisc handle=0x80150000 parent=0x10013 txq_state=0x0 packets=1213 skbaddr=0xff378004d607a500
> netperf 75330 [144] 2711.139631: qdisc:qdisc_dequeue: dequeue ifindex=5 qdisc handle=0x80150000 parent=0x10013 txq_state=0x0 packets=1233 skbaddr=0xff3780046be20100
> netperf 75333 [147] 2711.140356: qdisc:qdisc_dequeue: dequeue ifindex=5 qdisc handle=0x80150000 parent=0x10013 txq_state=0x0 packets=1093 skbaddr=0xff37800514845b00
> netperf 75337 [151] 2711.141037: qdisc:qdisc_dequeue: dequeue ifindex=5 qdisc handle=0x80150000 parent=0x10013 txq_state=0x0 packets=1353 skbaddr=0xff37800460753300
> netperf 75337 [151] 2711.141877: qdisc:qdisc_dequeue: dequeue ifindex=5 qdisc handle=0x80150000 parent=0x10013 txq_state=0x0 packets=1367 skbaddr=0xff378004e72c7b00
> netperf 75330 [144] 2711.142643: qdisc:qdisc_dequeue: dequeue ifindex=5 qdisc handle=0x80150000 parent=0x10013 txq_state=0x0 packets=1202 skbaddr=0xff3780045bd60000
> ...
>
> This is bad because :
>
> 1) Large batches hold one victim cpu for a very long time.
>
> 2) Driver often hit their own TX ring limit (all slots are used).
>
> 3) We call dev_requeue_skb()
>
> 4) Requeues are using a FIFO (q->gso_skb), breaking qdisc ability to
> implement FQ or priority scheduling.
>
> 5) dequeue_skb() gets packets from q->gso_skb one skb at a time
> with no xmit_more support. This is causing many spinlock games
> between the qdisc and the device driver.
>
> Requeues were supposed to be very rare, lets keep them this way.
>
> Limit batch sizes to /proc/sys/net/core/dev_weight (default 64) as
> __qdisc_run() was designed to use.
>
> Fixes: 5772e9a3463b ("qdisc: bulk dequeue support for qdiscs with TCQ_F_ONETXQUEUE")
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> Cc: Jesper Dangaard Brouer <hawk@kernel.org>
> Cc: Toke Høiland-Jørgensen <toke@redhat.com>
> ---
> net/sched/sch_generic.c | 17 ++++++++++-------
> 1 file changed, 10 insertions(+), 7 deletions(-)
>
> diff --git a/net/sched/sch_generic.c b/net/sched/sch_generic.c
> index d9a98d02a55fc361a223f3201e37b6a2b698bb5e..852e603c17551ee719bf1c561848d5ef0699ab5d 100644
> --- a/net/sched/sch_generic.c
> +++ b/net/sched/sch_generic.c
> @@ -180,9 +180,10 @@ static inline void dev_requeue_skb(struct sk_buff *skb, struct Qdisc *q)
> static void try_bulk_dequeue_skb(struct Qdisc *q,
> struct sk_buff *skb,
> const struct netdev_queue *txq,
> - int *packets)
> + int *packets, int budget)
> {
> int bytelimit = qdisc_avail_bulklimit(txq) - skb->len;
> + int cnt = 0;
You patch makes perfect sense, that we want this budget limit.
But: Why isn't bytelimit saving us?
Acked-by: Jesper Dangaard Brouer <hawk@kernel.org>
> while (bytelimit > 0) {
> struct sk_buff *nskb = q->dequeue(q);
> @@ -193,8 +194,10 @@ static void try_bulk_dequeue_skb(struct Qdisc *q,
> bytelimit -= nskb->len; /* covers GSO len */
> skb->next = nskb;
> skb = nskb;
> - (*packets)++; /* GSO counts as one pkt */
> + if (++cnt >= budget)
> + break;
> }
> + (*packets) += cnt;
> skb_mark_not_on_list(skb);
> }
>
> @@ -228,7 +231,7 @@ static void try_bulk_dequeue_skb_slow(struct Qdisc *q,
> * A requeued skb (via q->gso_skb) can also be a SKB list.
> */
> static struct sk_buff *dequeue_skb(struct Qdisc *q, bool *validate,
> - int *packets)
> + int *packets, int budget)
> {
> const struct netdev_queue *txq = q->dev_queue;
> struct sk_buff *skb = NULL;
> @@ -295,7 +298,7 @@ static struct sk_buff *dequeue_skb(struct Qdisc *q, bool *validate,
> if (skb) {
> bulk:
> if (qdisc_may_bulk(q))
> - try_bulk_dequeue_skb(q, skb, txq, packets);
> + try_bulk_dequeue_skb(q, skb, txq, packets, budget);
> else
> try_bulk_dequeue_skb_slow(q, skb, packets);
> }
> @@ -387,7 +390,7 @@ bool sch_direct_xmit(struct sk_buff *skb, struct Qdisc *q,
> * >0 - queue is not empty.
> *
> */
> -static inline bool qdisc_restart(struct Qdisc *q, int *packets)
> +static inline bool qdisc_restart(struct Qdisc *q, int *packets, int budget)
> {
> spinlock_t *root_lock = NULL;
> struct netdev_queue *txq;
> @@ -396,7 +399,7 @@ static inline bool qdisc_restart(struct Qdisc *q, int *packets)
> bool validate;
>
> /* Dequeue packet */
> - skb = dequeue_skb(q, &validate, packets);
> + skb = dequeue_skb(q, &validate, packets, budget);
> if (unlikely(!skb))
> return false;
>
> @@ -414,7 +417,7 @@ void __qdisc_run(struct Qdisc *q)
> int quota = READ_ONCE(net_hotdata.dev_tx_weight);
> int packets;
>
> - while (qdisc_restart(q, &packets)) {
> + while (qdisc_restart(q, &packets, quota)) {
> quota -= packets;
> if (quota <= 0) {
> if (q->flags & TCQ_F_NOLOCK)
next prev parent reply other threads:[~2025-11-10 10:36 UTC|newest]
Thread overview: 21+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-11-09 16:12 [PATCH net] net_sched: limit try_bulk_dequeue_skb() batches Eric Dumazet
2025-11-10 10:14 ` Toke Høiland-Jørgensen
2025-11-10 10:36 ` Jesper Dangaard Brouer [this message]
2025-11-10 11:06 ` Eric Dumazet
2025-11-10 15:04 ` Jesper Dangaard Brouer
2025-11-10 15:13 ` Eric Dumazet
2025-11-12 2:10 ` patchwork-bot+netdevbpf
2025-11-12 10:48 ` Jamal Hadi Salim
2025-11-12 11:21 ` Eric Dumazet
2025-11-13 17:53 ` Jamal Hadi Salim
2025-11-13 18:08 ` Eric Dumazet
2025-11-13 18:30 ` Jamal Hadi Salim
2025-11-13 18:35 ` Eric Dumazet
2025-11-13 18:55 ` Jamal Hadi Salim
2025-11-14 16:28 ` Jamal Hadi Salim
2025-11-14 16:35 ` Eric Dumazet
2025-11-14 17:13 ` Jamal Hadi Salim
2025-11-14 18:52 ` Eric Dumazet
2025-11-17 21:21 ` Jamal Hadi Salim
2025-11-18 15:33 ` Jamal Hadi Salim
2025-11-18 15:38 ` Eric Dumazet
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=cb568e91-9114-4e9a-ba88-eb4fc3772690@kernel.org \
--to=hawk@kernel.org \
--cc=davem@davemloft.net \
--cc=edumazet@google.com \
--cc=eric.dumazet@gmail.com \
--cc=horms@kernel.org \
--cc=jhs@mojatatu.com \
--cc=jiri@resnulli.us \
--cc=kuba@kernel.org \
--cc=kuniyu@google.com \
--cc=netdev@vger.kernel.org \
--cc=pabeni@redhat.com \
--cc=toke@redhat.com \
--cc=willemb@google.com \
--cc=xiyou.wangcong@gmail.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).