On 07/05/2026 16.46, Simon Schippers wrote:
> 
> 
> On 5/7/26 16:34, Paolo Abeni wrote:
>> On 5/7/26 8:54 AM, Simon Schippers wrote:
>>> On 5/5/26 15:21, hawk@kernel.org wrote:
>>>> @@ -928,9 +968,13 @@ static int veth_xdp_rcv(struct veth_rq *rq, int budget,
>>>>   			}
>>>>   		} else {
>>>>   			/* ndo_start_xmit */
>>>> -			struct sk_buff *skb = ptr;
>>>> +			bool bql_charged = veth_ptr_is_bql(ptr);
>>>> +			struct sk_buff *skb = veth_ptr_to_skb(ptr);
>>>>   
>>>>   			stats->xdp_bytes += skb->len;
>>>> +			if (peer_txq && bql_charged)
>>>> +				netdev_tx_completed_queue(peer_txq, 1, VETH_BQL_UNIT);
>>>
>>> In the discussion with Jonas [1], I left a comment explaining why I think
>>> this doesn’t work.
>>>

I've experimented with doing the "completion" at NAPI-end in
veth_poll(), but that resulted in BQL limit being 128 packets, which
leads to bad latency results (not acceptable).
(See detailed report later)


>>> I still think first that adding an option to modify the hard-coded
>>> VETH_RING_SIZE is the way to go.
>>>

Not against being able to modify VETH_RING_SIZE, but I don't think it is
the solution here.

The simply solution is the configure BQL limit_min:
  `/sys/class/net/<dev>/queues/tx-N/byte_queue_limits/limit_min`

My experiments (below) find that limit_min=8 is gives good performance.
We can simply set default to 8 as this still allows userspace to change
this later if lower latency is preferred.

>>> Thanks!
>>>
>>> [1] Link: https://lore.kernel.org/netdev/e8cdba04-aa9a-45c6-9807-8274b62920df@tu-dortmund.de/
 >>
>> In the above discussion a 20% regression is reported, which IMHO can't
>> be ignored. Still the tput figures in the data are extremely low,
>> something is possibly off?!? I would expect a few Mpps with pktgen on
>> top of veth, while the reported data is ~20-30Kpps.
>>
>> /P
>>
> 
> The ~20-30Kpps occur when thousands of iptables rules are applied and
> an UDP userspace application is sending.
> 
> And there is a 20% pktgen regression (no iptables rules applied).
> 

The pktgen test is a little dubious/weird and Jonas had to modify pktgen
to test this.   John Fastabend added a config to pktgen that allows us
to benchmarking egress qdisc path, this might be better to use this.
The samples/pktgen/pktgen_bench_xmit_mode_queue_xmit.sh is a demo usage.

If redoing the tests, can you adjust limit_min to see the effect?
  /sys/class/net/<dev>/queues/tx-N/byte_queue_limits/limit_min

20% throughput performance regression is of-cause too much, but I will
remind us, that adding a qdisc will "cost" some overhead, that is a
configuration choice.  Our purpose here is to reduce bufferbloat and
latency, not optimize for throughput.


> I am pretty sure the reason is because the BQL limit is stuck at 2
> packets (because the completed queue is always called with 1 packet
> and not in a interrupt/timer with multiple packets...).
> 

I've run a lot of experiments, which I made AI write a report over, see 
attachment.  The TL;DR is that best performance vs latency tradeoff is 
defaulting BQL/DQL limit_min to be 8 packets.

I fear this patchset will stall forever, if we keep searching for a 
perfect solution without any overhead.  The qdisc layer will be a 
baseline overhead.  The limit=2 packets is actually the optimal 
darkbuffer queue size, but I acknowledge that this causes too many qdisc 
requeue events (leading to overhead).  I suggest that I add another 
patch in V6, that defaults limit_min to 8 (separate patch to make it 
easier to revert/adjust later).

I've talked with Jonas, and we want to experiment with different 
solutions to make BQL/DQL work better with virtual devices.

This patchset helps our (production) use-case reduce mice-flow latency
from approx 22ms to 1.3ms for latency under-load.  Due to the consumer
namespace being the bottleneck the requeue overhead is negligible in
comparison.

-Jesper