* [PATCH net] net: add net.core.qdisc_max_burst
@ 2026-01-07 10:41 Eric Dumazet
2026-01-07 23:25 ` Cong Wang
` (2 more replies)
0 siblings, 3 replies; 5+ messages in thread
From: Eric Dumazet @ 2026-01-07 10:41 UTC (permalink / raw)
To: David S . Miller, Jakub Kicinski, Paolo Abeni
Cc: Simon Horman, netdev, eric.dumazet, Eric Dumazet, Neal Cardwell
In blamed commit, I added a check against the temporary queue
built in __dev_xmit_skb(). Idea was to drop packets early,
before any spinlock was acquired.
if (unlikely(defer_count > READ_ONCE(q->limit))) {
kfree_skb_reason(skb, SKB_DROP_REASON_QDISC_DROP);
return NET_XMIT_DROP;
}
It turned out that HTB Qdisc has a zero q->limit.
HTB limits packets on a per-class basis.
Some of our tests became flaky.
Add a new sysctl : net.core.qdisc_max_burst to control
how many packets can be stored in the temporary lockless queue.
Also add a new QDISC_BURST_DROP drop reason to better diagnose
future issues.
Thanks Neal !
Fixes: 100dfa74cad9 ("net: dev_queue_xmit() llist adoption")
Reported-and-bisected-by: Neal Cardwell <ncardwell@google.com>
Signed-off-by: Eric Dumazet <edumazet@google.com>
---
Documentation/admin-guide/sysctl/net.rst | 8 ++++++++
include/net/dropreason-core.h | 6 ++++++
include/net/hotdata.h | 1 +
net/core/dev.c | 6 +++---
net/core/hotdata.c | 1 +
net/core/sysctl_net_core.c | 7 +++++++
6 files changed, 26 insertions(+), 3 deletions(-)
diff --git a/Documentation/admin-guide/sysctl/net.rst b/Documentation/admin-guide/sysctl/net.rst
index 369a738a68193e897d880eeb2c5a22cd90833938..91fa4ccd326c2b6351fd028a1c5d1c69126bee5f 100644
--- a/Documentation/admin-guide/sysctl/net.rst
+++ b/Documentation/admin-guide/sysctl/net.rst
@@ -303,6 +303,14 @@ netdev_max_backlog
Maximum number of packets, queued on the INPUT side, when the interface
receives packets faster than kernel can process them.
+qdisc_max_burst
+------------------
+
+Maximum number of packets that can be temporarily stored before
+reaching qdisc.
+
+Default: 1000
+
netdev_rss_key
--------------
diff --git a/include/net/dropreason-core.h b/include/net/dropreason-core.h
index 58d91ccc56e0b54368c432fb9075ab174dc3a09f..a7b7abd66e215c4bcaece6f00ca03de3ac81396f 100644
--- a/include/net/dropreason-core.h
+++ b/include/net/dropreason-core.h
@@ -67,6 +67,7 @@
FN(TC_EGRESS) \
FN(SECURITY_HOOK) \
FN(QDISC_DROP) \
+ FN(QDISC_BURST_DROP) \
FN(QDISC_OVERLIMIT) \
FN(QDISC_CONGESTED) \
FN(CAKE_FLOOD) \
@@ -374,6 +375,11 @@ enum skb_drop_reason {
* failed to enqueue to current qdisc)
*/
SKB_DROP_REASON_QDISC_DROP,
+ /**
+ * @SKB_DROP_REASON_QDISC_BURST_DROP: dropped when net.core.qdisc_max_burst
+ * limit is hit.
+ */
+ SKB_DROP_REASON_QDISC_BURST_DROP,
/**
* @SKB_DROP_REASON_QDISC_OVERLIMIT: dropped by qdisc when a qdisc
* instance exceeds its total buffer size limit.
diff --git a/include/net/hotdata.h b/include/net/hotdata.h
index 4acec191c54ab367ca12fff590d1f8c8aad64651..6632b1aa7584821fd4ab42163b77dfff6732a45e 100644
--- a/include/net/hotdata.h
+++ b/include/net/hotdata.h
@@ -42,6 +42,7 @@ struct net_hotdata {
int netdev_budget_usecs;
int tstamp_prequeue;
int max_backlog;
+ int qdisc_max_burst;
int dev_tx_weight;
int dev_rx_weight;
int sysctl_max_skb_frags;
diff --git a/net/core/dev.c b/net/core/dev.c
index 36dc5199037edb1506e67f6ab5e977ff41efef59..a8238023774407adbdb2e5d09dc009802870bd4e 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4190,8 +4190,8 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
do {
if (first_n && !defer_count) {
defer_count = atomic_long_inc_return(&q->defer_count);
- if (unlikely(defer_count > READ_ONCE(q->limit))) {
- kfree_skb_reason(skb, SKB_DROP_REASON_QDISC_DROP);
+ if (unlikely(defer_count > READ_ONCE(net_hotdata.qdisc_max_burst))) {
+ kfree_skb_reason(skb, SKB_DROP_REASON_QDISC_BURST_DROP);
return NET_XMIT_DROP;
}
}
@@ -4209,7 +4209,7 @@ static inline int __dev_xmit_skb(struct sk_buff *skb, struct Qdisc *q,
ll_list = llist_del_all(&q->defer_list);
/* There is a small race because we clear defer_count not atomically
* with the prior llist_del_all(). This means defer_list could grow
- * over q->limit.
+ * over qdisc_max_burst.
*/
atomic_long_set(&q->defer_count, 0);
diff --git a/net/core/hotdata.c b/net/core/hotdata.c
index dddd5c287cf08ba75aec1cc546fd1bc48c0f7b26..a6db365808178d243f53ae1a817938fb17c3f968 100644
--- a/net/core/hotdata.c
+++ b/net/core/hotdata.c
@@ -17,6 +17,7 @@ struct net_hotdata net_hotdata __cacheline_aligned = {
.tstamp_prequeue = 1,
.max_backlog = 1000,
+ .qdisc_max_burst = 1000,
.dev_tx_weight = 64,
.dev_rx_weight = 64,
.sysctl_max_skb_frags = MAX_SKB_FRAGS,
diff --git a/net/core/sysctl_net_core.c b/net/core/sysctl_net_core.c
index 8d4decb2606fa18222a02e59dc889efa995d2eaa..05dd55cf8b58e6c6fce498a11c09f23fd56d8f34 100644
--- a/net/core/sysctl_net_core.c
+++ b/net/core/sysctl_net_core.c
@@ -429,6 +429,13 @@ static struct ctl_table net_core_table[] = {
.mode = 0644,
.proc_handler = proc_dointvec
},
+ {
+ .procname = "qdisc_max_burst",
+ .data = &net_hotdata.qdisc_max_burst,
+ .maxlen = sizeof(int),
+ .mode = 0644,
+ .proc_handler = proc_dointvec
+ },
{
.procname = "netdev_rss_key",
.data = &netdev_rss_key,
--
2.52.0.351.gbe84eed79e-goog
^ permalink raw reply related [flat|nested] 5+ messages in thread* Re: [PATCH net] net: add net.core.qdisc_max_burst
2026-01-07 10:41 [PATCH net] net: add net.core.qdisc_max_burst Eric Dumazet
@ 2026-01-07 23:25 ` Cong Wang
2026-01-09 14:42 ` Eric Dumazet
2026-01-09 15:15 ` Neal Cardwell
2026-01-13 9:30 ` patchwork-bot+netdevbpf
2 siblings, 1 reply; 5+ messages in thread
From: Cong Wang @ 2026-01-07 23:25 UTC (permalink / raw)
To: Eric Dumazet
Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman,
netdev, eric.dumazet, Neal Cardwell
On Wed, Jan 07, 2026 at 10:41:59AM +0000, Eric Dumazet wrote:
> In blamed commit, I added a check against the temporary queue
> built in __dev_xmit_skb(). Idea was to drop packets early,
> before any spinlock was acquired.
>
> if (unlikely(defer_count > READ_ONCE(q->limit))) {
> kfree_skb_reason(skb, SKB_DROP_REASON_QDISC_DROP);
> return NET_XMIT_DROP;
> }
>
> It turned out that HTB Qdisc has a zero q->limit.
> HTB limits packets on a per-class basis.
> Some of our tests became flaky.
Hm, if q->limit is the problem here, why not introduce a new Qdisc
option for this?
>
> Add a new sysctl : net.core.qdisc_max_burst to control
> how many packets can be stored in the temporary lockless queue.
This becomes global instead of per-Qdisc. If this is intended, you might
want to document it explicitly in the documentation.
Regards,
Cong
^ permalink raw reply [flat|nested] 5+ messages in thread* Re: [PATCH net] net: add net.core.qdisc_max_burst
2026-01-07 23:25 ` Cong Wang
@ 2026-01-09 14:42 ` Eric Dumazet
0 siblings, 0 replies; 5+ messages in thread
From: Eric Dumazet @ 2026-01-09 14:42 UTC (permalink / raw)
To: Cong Wang
Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman,
netdev, eric.dumazet, Neal Cardwell
On Thu, Jan 8, 2026 at 12:25 AM Cong Wang <xiyou.wangcong@gmail.com> wrote:
>
> On Wed, Jan 07, 2026 at 10:41:59AM +0000, Eric Dumazet wrote:
> > In blamed commit, I added a check against the temporary queue
> > built in __dev_xmit_skb(). Idea was to drop packets early,
> > before any spinlock was acquired.
> >
> > if (unlikely(defer_count > READ_ONCE(q->limit))) {
> > kfree_skb_reason(skb, SKB_DROP_REASON_QDISC_DROP);
> > return NET_XMIT_DROP;
> > }
> >
> > It turned out that HTB Qdisc has a zero q->limit.
> > HTB limits packets on a per-class basis.
> > Some of our tests became flaky.
>
> Hm, if q->limit is the problem here, why not introduce a new Qdisc
> option for this?
My first patch was using a plain macro.
I think the global switch makes more sense : I do not expect anyone
would need to
tune it, but better be safe than sorry.
A per-qdisc parameter would need iproute2 changes
A per qdisc attribute would force all iproute2 users to update their scripts
in case of an emergency.
>
> >
> > Add a new sysctl : net.core.qdisc_max_burst to control
> > how many packets can be stored in the temporary lockless queue.
>
> This becomes global instead of per-Qdisc. If this is intended, you might
> want to document it explicitly in the documentation.
My patch did update Documentation/admin-guide/sysctl/net.rst
Are you thinking of something else ?
An alternative would be to set q->limit to 1000 in HTB only.
And eventually add an TCA_HTB_LIMIT new attribute to tune it.
diff --git a/net/sched/sch_htb.c b/net/sched/sch_htb.c
index b5e40c51655a731e7c1879e4eb4932b9c1687ea5..7f2402dfb0bfb4e8ee8a07f6d18eab0e8555c6d0
100644
--- a/net/sched/sch_htb.c
+++ b/net/sched/sch_htb.c
@@ -1059,6 +1059,7 @@ static int htb_init(struct Qdisc *sch, struct nlattr *opt,
bool offload;
int err;
+ sch->limit = 1000;
qdisc_watchdog_init(&q->watchdog, sch);
INIT_WORK(&q->work, htb_work_func);
^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [PATCH net] net: add net.core.qdisc_max_burst
2026-01-07 10:41 [PATCH net] net: add net.core.qdisc_max_burst Eric Dumazet
2026-01-07 23:25 ` Cong Wang
@ 2026-01-09 15:15 ` Neal Cardwell
2026-01-13 9:30 ` patchwork-bot+netdevbpf
2 siblings, 0 replies; 5+ messages in thread
From: Neal Cardwell @ 2026-01-09 15:15 UTC (permalink / raw)
To: Eric Dumazet
Cc: David S . Miller, Jakub Kicinski, Paolo Abeni, Simon Horman,
netdev, eric.dumazet
On Wed, Jan 7, 2026 at 5:42 AM Eric Dumazet <edumazet@google.com> wrote:
>
> In blamed commit, I added a check against the temporary queue
> built in __dev_xmit_skb(). Idea was to drop packets early,
> before any spinlock was acquired.
>
> if (unlikely(defer_count > READ_ONCE(q->limit))) {
> kfree_skb_reason(skb, SKB_DROP_REASON_QDISC_DROP);
> return NET_XMIT_DROP;
> }
>
> It turned out that HTB Qdisc has a zero q->limit.
> HTB limits packets on a per-class basis.
> Some of our tests became flaky.
>
> Add a new sysctl : net.core.qdisc_max_burst to control
> how many packets can be stored in the temporary lockless queue.
>
> Also add a new QDISC_BURST_DROP drop reason to better diagnose
> future issues.
>
> Thanks Neal !
>
> Fixes: 100dfa74cad9 ("net: dev_queue_xmit() llist adoption")
> Reported-and-bisected-by: Neal Cardwell <ncardwell@google.com>
> Signed-off-by: Eric Dumazet <edumazet@google.com>
> ---
Reviewed-by: Neal Cardwell <ncardwell@google.com>
Thanks, Eric! I really like this solution.
On Thu, Jan 8, 2026 at 12:25 AM Cong Wang <xiyou.wangcong@gmail.com> wrote:
> Hm, if q->limit is the problem here, why not introduce a new Qdisc
> option for this?
Adding a new Qdisc option for this would be more complicated and
error-prone than having a single global limit with a robust,
well-chosen default like this.
Having this separate burst limit for each qdisc would make qdiscs even
more complex to understand and use correctly, particularly given that
some qdiscs (like htb) already have attributes with "burst" in the
names ("cburst" and "burst" for htb). Having a third htb limit with
"burst" in the name or description would make things even more
difficult for users.
The value that is needed for good performance is a function of the
number of CPUs on the system and the networking transmit workload. I'm
having trouble thinking of a good reason why we'd want to be able to
customize this on a per-qdisc basis. Making this a per-qdisc limit
would incorrectly imply that there's some reason we can imagine
wanting a different value for different qdiscs, and would incorrectly
imply that we believe qdisc developers and system administrators
deploying qdiscs should spend time thinking about how to tune this
parameter.
Eric wrote:
> An alternative would be to set q->limit to 1000 in HTB only.
There are at least 11 different qdiscs that currently have no way to
set the q->limit, so the q->limit ends up staying 0, and suffering
this problem. So with that alternative approach we'd have to change
all 11 of those qdiscs, and make sure that in the future new qdiscs
that enqueue skbs in children instead of the root qdisc remembered to
set q->limit to some big number as well.
So Eric's patch seems like the best approach by far among those
discussed in this thread.
Thanks!
neal
^ permalink raw reply [flat|nested] 5+ messages in thread* Re: [PATCH net] net: add net.core.qdisc_max_burst
2026-01-07 10:41 [PATCH net] net: add net.core.qdisc_max_burst Eric Dumazet
2026-01-07 23:25 ` Cong Wang
2026-01-09 15:15 ` Neal Cardwell
@ 2026-01-13 9:30 ` patchwork-bot+netdevbpf
2 siblings, 0 replies; 5+ messages in thread
From: patchwork-bot+netdevbpf @ 2026-01-13 9:30 UTC (permalink / raw)
To: Eric Dumazet; +Cc: davem, kuba, pabeni, horms, netdev, eric.dumazet, ncardwell
Hello:
This patch was applied to netdev/net.git (main)
by Paolo Abeni <pabeni@redhat.com>:
On Wed, 7 Jan 2026 10:41:59 +0000 you wrote:
> In blamed commit, I added a check against the temporary queue
> built in __dev_xmit_skb(). Idea was to drop packets early,
> before any spinlock was acquired.
>
> if (unlikely(defer_count > READ_ONCE(q->limit))) {
> kfree_skb_reason(skb, SKB_DROP_REASON_QDISC_DROP);
> return NET_XMIT_DROP;
> }
>
> [...]
Here is the summary with links:
- [net] net: add net.core.qdisc_max_burst
https://git.kernel.org/netdev/net/c/ffe4ccd359d0
You are awesome, thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html
^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2026-01-13 9:33 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-01-07 10:41 [PATCH net] net: add net.core.qdisc_max_burst Eric Dumazet
2026-01-07 23:25 ` Cong Wang
2026-01-09 14:42 ` Eric Dumazet
2026-01-09 15:15 ` Neal Cardwell
2026-01-13 9:30 ` patchwork-bot+netdevbpf
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox