On 07/05/2026 16.46, Simon Schippers wrote: > > > On 5/7/26 16:34, Paolo Abeni wrote: >> On 5/7/26 8:54 AM, Simon Schippers wrote: >>> On 5/5/26 15:21, hawk@kernel.org wrote: >>>> @@ -928,9 +968,13 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget, >>>> } >>>> } else { >>>> /* ndo_start_xmit */ >>>> - struct sk_buff *skb = ptr; >>>> + bool bql_charged = veth_ptr_is_bql(ptr); >>>> + struct sk_buff *skb = veth_ptr_to_skb(ptr); >>>> >>>> stats->xdp_bytes += skb->len; >>>> + if (peer_txq && bql_charged) >>>> + netdev_tx_completed_queue(peer_txq, 1, VETH_BQL_UNIT); >>> >>> In the discussion with Jonas [1], I left a comment explaining why I think >>> this doesn’t work. >>> I've experimented with doing the "completion" at NAPI-end in veth_poll(), but that resulted in BQL limit being 128 packets, which leads to bad latency results (not acceptable). (See detailed report later) >>> I still think first that adding an option to modify the hard-coded >>> VETH_RING_SIZE is the way to go. >>> Not against being able to modify VETH_RING_SIZE, but I don't think it is the solution here. The simply solution is the configure BQL limit_min: `/sys/class/net//queues/tx-N/byte_queue_limits/limit_min` My experiments (below) find that limit_min=8 is gives good performance. We can simply set default to 8 as this still allows userspace to change this later if lower latency is preferred. >>> Thanks! >>> >>> [1] Link: https://lore.kernel.org/netdev/e8cdba04-aa9a-45c6-9807-8274b62920df@tu-dortmund.de/ >> >> In the above discussion a 20% regression is reported, which IMHO can't >> be ignored. Still the tput figures in the data are extremely low, >> something is possibly off?!? I would expect a few Mpps with pktgen on >> top of veth, while the reported data is ~20-30Kpps. >> >> /P >> > > The ~20-30Kpps occur when thousands of iptables rules are applied and > an UDP userspace application is sending. > > And there is a 20% pktgen regression (no iptables rules applied). > The pktgen test is a little dubious/weird and Jonas had to modify pktgen to test this. John Fastabend added a config to pktgen that allows us to benchmarking egress qdisc path, this might be better to use this. The samples/pktgen/pktgen_bench_xmit_mode_queue_xmit.sh is a demo usage. If redoing the tests, can you adjust limit_min to see the effect? /sys/class/net//queues/tx-N/byte_queue_limits/limit_min 20% throughput performance regression is of-cause too much, but I will remind us, that adding a qdisc will "cost" some overhead, that is a configuration choice. Our purpose here is to reduce bufferbloat and latency, not optimize for throughput. > I am pretty sure the reason is because the BQL limit is stuck at 2 > packets (because the completed queue is always called with 1 packet > and not in a interrupt/timer with multiple packets...). > I've run a lot of experiments, which I made AI write a report over, see attachment. The TL;DR is that best performance vs latency tradeoff is defaulting BQL/DQL limit_min to be 8 packets. I fear this patchset will stall forever, if we keep searching for a perfect solution without any overhead. The qdisc layer will be a baseline overhead. The limit=2 packets is actually the optimal darkbuffer queue size, but I acknowledge that this causes too many qdisc requeue events (leading to overhead). I suggest that I add another patch in V6, that defaults limit_min to 8 (separate patch to make it easier to revert/adjust later). I've talked with Jonas, and we want to experiment with different solutions to make BQL/DQL work better with virtual devices. This patchset helps our (production) use-case reduce mice-flow latency from approx 22ms to 1.3ms for latency under-load. Due to the consumer namespace being the bottleneck the requeue overhead is negligible in comparison. -Jesper