[REGRESSION][BISECTED] tun/tap & vhost-net: multi-threaded network performance

Netdev List
 help / color / mirror / Atom feed

* [REGRESSION][BISECTED] tun/tap & vhost-net: multi-threaded network performance
@ 2026-07-01 19:16 Brett Sheffield
  2026-07-01 20:56 ` Michael S. Tsirkin
  2026-07-03 10:41 ` Simon Schippers
  0 siblings, 2 replies; 14+ messages in thread
From: Brett Sheffield @ 2026-07-01 19:16 UTC (permalink / raw)
  To: regressions, netdev
  Cc: Jakub Kicinski, Michael S. Tsirkin, Simon Schippers, Tim Gebauer,
	Willem de Bruijn, Jason Wang, Andrew Lunn, David S. Miller,
	Eric Dumazet, Paolo Abeni, linux-kernel

TL;DR - Commit 1d6e569b7d0c0b2736636749e4be0a27f3cefcb3 causes
significant performance regressions with TAP interfaces and multithreaded
network code. Please revert.


Librecast is an IPv6 multicast library. One of the tests (0055) fails under
Linux 7.2-rc1. The test performs data synchronization over IPv6 multicast using a TAP
interface. This test has run successfully on every stable, LTS and mainline RC
released in the past year. Every kernel with my Tested-by has run this test.

There have been a bunch of changes to MLDv2 so I started bisecting there, but
the culprit is actually 1d6e569b7d0c0b2736636749e4be0a27f3cefcb3 "tun/tap &
vhost-net: avoid ptr_ring tail-drop when a qdisc is present"

Reverting this commit fixes the test.

To eliminate my code and any multicast weirdness, I ran tests with iperf3
comparing the same host running 7.2-rc1 both with and without 1d6e569b7d0
reverted.

CPU: AMD Ryzen 9 9950X

[ host ] - [ bridge ] - [ tap ] - [ guest (qemu) ]

Running matching kernels on host and guest, I started iperf3 in server mode on
the guest and tested from the host so traffic passes through the tap interface.

iperf3 -s -V                 # server
iperf3 -c guest -P nthreads  # client

7.2.0-rc1 (threads 1):

[  5]   0.00-10.00  sec  20.2 GBytes  17.4 Gbits/sec    0            sender
[  5]   0.00-10.00  sec  2.00 GBytes  1.72 Gbits/sec                  receiver

7.2.0-rc1 (threads 1, reverted):

[  5]   0.00-10.00  sec  15.3 GBytes  13.1 Gbits/sec  368            sender
[  5]   0.00-10.00  sec  2.00 GBytes  1.72 Gbits/sec                  receiver

7.2.0-rc1 (threads 2):

[SUM]   0.00-10.00  sec  10.9 GBytes  9.33 Gbits/sec    0             sender
[SUM]   0.00-10.00  sec  4.00 GBytes  3.43 Gbits/sec                  receiver

7.2.0-rc1 (threads 2, reverted):

[SUM]   0.00-10.00  sec  15.9 GBytes  13.7 Gbits/sec  1567             sender
[SUM]   0.00-10.00  sec  4.00 GBytes  3.43 Gbits/sec                  receiver

7.2.0-rc1 (threads 4):

[SUM]   0.00-10.00  sec  10.9 GBytes  9.33 Gbits/sec    0             sender
[SUM]   0.00-10.00  sec  8.00 GBytes  6.87 Gbits/sec                  receiver

7.2.0-rc1 (threads 4, reverted):

[SUM]   0.00-10.00  sec  16.5 GBytes  14.1 Gbits/sec  6701             sender
[SUM]   0.00-10.00  sec  8.00 GBytes  6.87 Gbits/sec                  receiver

7.2.0-rc1 (threads 8):

[SUM]   0.00-10.00  sec  10.7 GBytes  9.15 Gbits/sec    0             sender
[SUM]   0.00-10.01  sec  10.6 GBytes  9.13 Gbits/sec                  receiver

7.2.0-rc1 (threads 8, reverted):

[SUM]   0.00-10.00  sec  16.2 GBytes  14.0 Gbits/sec  19319             sender
[SUM]   0.00-10.00  sec  15.7 GBytes  13.5 Gbits/sec                  receiver

7.2.0-rc1 (threads 16):

[SUM]   0.00-10.00  sec  10.9 GBytes  9.35 Gbits/sec    0             sender
[SUM]   0.00-10.01  sec  10.9 GBytes  9.32 Gbits/sec                  receiver

7.2.0-rc1 (threads 16, reverted):

[SUM]   0.00-10.00  sec  14.4 GBytes  12.4 Gbits/sec  43593             sender
[SUM]   0.00-10.00  sec  14.4 GBytes  12.4 Gbits/sec                  receiver


As you can see, the new code works for single threaded, but for all other cases
there's a significant performance drop. I see this trade-off is mentioned in the
commit, but the performance drop off is much worse than suggested with the
current patch.

In our multicast use case data is sent by multiple threads to multiple groups
simultaneously, this just breaks things to the extent that a <2 second test
times out after 5 minutes.


git bisect start
# status: waiting for both good and bad commits
# bad: [dc59e4fea9d83f03bad6bddf3fa2e52491777482] Linux 7.2-rc1
git bisect bad dc59e4fea9d83f03bad6bddf3fa2e52491777482
# status: waiting for good commit(s), bad commit known
# good: [36bdc0e815b4e8a05b9028d8ef8a25e1ead35cc1] net: usb: asix: ax88772: re-add usbnet_link_change() in phylink callbacks
git bisect good 36bdc0e815b4e8a05b9028d8ef8a25e1ead35cc1
# good: [db314398f618a3a23315f73c87f7d318eaf06c1b] Merge branch 'net-bridge-mcast-support-exponential-field-encoding'
git bisect good db314398f618a3a23315f73c87f7d318eaf06c1b
# bad: [079a028d6327e68cfa5d38b36123637b321c19a7] string: Remove strncpy() from the kernel
git bisect bad 079a028d6327e68cfa5d38b36123637b321c19a7
# bad: [f396f4005180928cd9e15e352a6512865d3bc908] Bluetooth: btmtk: fix URB leak in alloc_mtk_intr_urb error path
git bisect bad f396f4005180928cd9e15e352a6512865d3bc908
# bad: [ec1806a730a1c0b3d68a7f9afe81514fb0dd7991] netfilter: x_tables: disable 32bit compat interface in user namespaces
git bisect bad ec1806a730a1c0b3d68a7f9afe81514fb0dd7991
# good: [50c2d91c5dfa0e465826ec1f8dbad9cdc254bd85] mptcp: do not drop partial packets
git bisect good 50c2d91c5dfa0e465826ec1f8dbad9cdc254bd85
# good: [68993ced0f618e36cf33388f1e50223e5e6e78cc] Merge tag 'net-7.1-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
git bisect good 68993ced0f618e36cf33388f1e50223e5e6e78cc
# good: [34c78dff59a25110a4ce50c208e42a91490fe615] Merge branch 'net-use-ip_outnoroutes-drop-reason'
git bisect good 34c78dff59a25110a4ce50c208e42a91490fe615
# bad: [9587ed8137fb83d93f84b858337412f4500b21e9] Merge branch 'gve-add-support-for-ptp-gettimex64'
git bisect bad 9587ed8137fb83d93f84b858337412f4500b21e9
# bad: [83ea7fd73b11dd8cbf4416507a5eac3890b49fb0] net: dsa: microchip: remove unused phylink_mac_link_up() callback
git bisect bad 83ea7fd73b11dd8cbf4416507a5eac3890b49fb0
# bad: [f0de88303d5e7e04a1224bc7a00512b5a1c4fe7a] net: make is_skb_wmem() available to modules
git bisect bad f0de88303d5e7e04a1224bc7a00512b5a1c4fe7a
# bad: [c411baa463e85a779a7e68a00ba6298770b58c4c] netconsole: move push_ipv6() from netpoll
git bisect bad c411baa463e85a779a7e68a00ba6298770b58c4c
# good: [fba362c17d9d9211fc51f272156bb84fc23bdf98] ptr_ring: move free-space check into separate helper
git bisect good fba362c17d9d9211fc51f272156bb84fc23bdf98
# bad: [d0273dbe8be1640e597552f81faf1d6c9997d3e3] ipvlan: use netif_receive_skb() in ipvlan_process_multicast()
git bisect bad d0273dbe8be1640e597552f81faf1d6c9997d3e3
# bad: [3803065cd6b0630d4161d86aa04e2d1db0f3a0b5] Merge branch 'tun-tap-vhost-net-apply-qdisc-backpressure-on-full-ptr_ring-to-reduce-tx-drops'
git bisect bad 3803065cd6b0630d4161d86aa04e2d1db0f3a0b5
# bad: [1d6e569b7d0c0b2736636749e4be0a27f3cefcb3] tun/tap & vhost-net: avoid ptr_ring tail-drop when a qdisc is present
git bisect bad 1d6e569b7d0c0b2736636749e4be0a27f3cefcb3
# first bad commit: [1d6e569b7d0c0b2736636749e4be0a27f3cefcb3] tun/tap & vhost-net: avoid ptr_ring tail-drop when a qdisc is present

-- 
Brett Sheffield (he/him)
Librecast - Decentralising the Internet with Multicast
https://librecast.net/
https://blog.brettsheffield.com/

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [REGRESSION][BISECTED] tun/tap & vhost-net: multi-threaded network performance
  2026-07-01 19:16 [REGRESSION][BISECTED] tun/tap & vhost-net: multi-threaded network performance Brett Sheffield
@ 2026-07-01 20:56 ` Michael S. Tsirkin
  2026-07-02  7:24   ` Simon Schippers
  2026-07-03 10:41 ` Simon Schippers
  1 sibling, 1 reply; 14+ messages in thread
From: Michael S. Tsirkin @ 2026-07-01 20:56 UTC (permalink / raw)
  To: Brett Sheffield
  Cc: regressions, netdev, Jakub Kicinski, Simon Schippers, Tim Gebauer,
	Willem de Bruijn, Jason Wang, Andrew Lunn, David S. Miller,
	Eric Dumazet, Paolo Abeni, linux-kernel

On Wed, Jul 01, 2026 at 09:16:48PM +0200, Brett Sheffield wrote:
> TL;DR - Commit 1d6e569b7d0c0b2736636749e4be0a27f3cefcb3 causes
> significant performance regressions with TAP interfaces and multithreaded
> network code. Please revert.
> 
> 
> Librecast is an IPv6 multicast library. One of the tests (0055) fails under
> Linux 7.2-rc1. The test performs data synchronization over IPv6 multicast using a TAP
> interface. This test has run successfully on every stable, LTS and mainline RC
> released in the past year. Every kernel with my Tested-by has run this test.
> 
> There have been a bunch of changes to MLDv2 so I started bisecting there, but
> the culprit is actually 1d6e569b7d0c0b2736636749e4be0a27f3cefcb3 "tun/tap &
> vhost-net: avoid ptr_ring tail-drop when a qdisc is present"
> 
> Reverting this commit fixes the test.
> 
> To eliminate my code and any multicast weirdness, I ran tests with iperf3
> comparing the same host running 7.2-rc1 both with and without 1d6e569b7d0
> reverted.

Thanks a lot for the bisect! Reverting is not out of question, but
just before we do, it is worth analyzing the situation.

Could you pls tell us
- do you see packet drops?
- does it help to increase the tun queue size?

Thanks a lot!


> CPU: AMD Ryzen 9 9950X
> 
> [ host ] - [ bridge ] - [ tap ] - [ guest (qemu) ]
> 
> Running matching kernels on host and guest, I started iperf3 in server mode on
> the guest and tested from the host so traffic passes through the tap interface.
> 
> iperf3 -s -V                 # server
> iperf3 -c guest -P nthreads  # client
> 
> 7.2.0-rc1 (threads 1):
> 
> [  5]   0.00-10.00  sec  20.2 GBytes  17.4 Gbits/sec    0            sender
> [  5]   0.00-10.00  sec  2.00 GBytes  1.72 Gbits/sec                  receiver
> 
> 7.2.0-rc1 (threads 1, reverted):
> 
> [  5]   0.00-10.00  sec  15.3 GBytes  13.1 Gbits/sec  368            sender
> [  5]   0.00-10.00  sec  2.00 GBytes  1.72 Gbits/sec                  receiver
> 
> 7.2.0-rc1 (threads 2):
> 
> [SUM]   0.00-10.00  sec  10.9 GBytes  9.33 Gbits/sec    0             sender
> [SUM]   0.00-10.00  sec  4.00 GBytes  3.43 Gbits/sec                  receiver
> 
> 7.2.0-rc1 (threads 2, reverted):
> 
> [SUM]   0.00-10.00  sec  15.9 GBytes  13.7 Gbits/sec  1567             sender
> [SUM]   0.00-10.00  sec  4.00 GBytes  3.43 Gbits/sec                  receiver
> 
> 7.2.0-rc1 (threads 4):
> 
> [SUM]   0.00-10.00  sec  10.9 GBytes  9.33 Gbits/sec    0             sender
> [SUM]   0.00-10.00  sec  8.00 GBytes  6.87 Gbits/sec                  receiver
> 
> 7.2.0-rc1 (threads 4, reverted):
> 
> [SUM]   0.00-10.00  sec  16.5 GBytes  14.1 Gbits/sec  6701             sender
> [SUM]   0.00-10.00  sec  8.00 GBytes  6.87 Gbits/sec                  receiver
> 
> 7.2.0-rc1 (threads 8):
> 
> [SUM]   0.00-10.00  sec  10.7 GBytes  9.15 Gbits/sec    0             sender
> [SUM]   0.00-10.01  sec  10.6 GBytes  9.13 Gbits/sec                  receiver
> 
> 7.2.0-rc1 (threads 8, reverted):
> 
> [SUM]   0.00-10.00  sec  16.2 GBytes  14.0 Gbits/sec  19319             sender
> [SUM]   0.00-10.00  sec  15.7 GBytes  13.5 Gbits/sec                  receiver
> 
> 7.2.0-rc1 (threads 16):
> 
> [SUM]   0.00-10.00  sec  10.9 GBytes  9.35 Gbits/sec    0             sender
> [SUM]   0.00-10.01  sec  10.9 GBytes  9.32 Gbits/sec                  receiver
> 
> 7.2.0-rc1 (threads 16, reverted):
> 
> [SUM]   0.00-10.00  sec  14.4 GBytes  12.4 Gbits/sec  43593             sender
> [SUM]   0.00-10.00  sec  14.4 GBytes  12.4 Gbits/sec                  receiver
> 
> 
> As you can see, the new code works for single threaded, but for all other cases
> there's a significant performance drop. I see this trade-off is mentioned in the
> commit, but the performance drop off is much worse than suggested with the
> current patch.
> 
> In our multicast use case data is sent by multiple threads to multiple groups
> simultaneously, this just breaks things to the extent that a <2 second test
> times out after 5 minutes.
> 
> 
> git bisect start
> # status: waiting for both good and bad commits
> # bad: [dc59e4fea9d83f03bad6bddf3fa2e52491777482] Linux 7.2-rc1
> git bisect bad dc59e4fea9d83f03bad6bddf3fa2e52491777482
> # status: waiting for good commit(s), bad commit known
> # good: [36bdc0e815b4e8a05b9028d8ef8a25e1ead35cc1] net: usb: asix: ax88772: re-add usbnet_link_change() in phylink callbacks
> git bisect good 36bdc0e815b4e8a05b9028d8ef8a25e1ead35cc1
> # good: [db314398f618a3a23315f73c87f7d318eaf06c1b] Merge branch 'net-bridge-mcast-support-exponential-field-encoding'
> git bisect good db314398f618a3a23315f73c87f7d318eaf06c1b
> # bad: [079a028d6327e68cfa5d38b36123637b321c19a7] string: Remove strncpy() from the kernel
> git bisect bad 079a028d6327e68cfa5d38b36123637b321c19a7
> # bad: [f396f4005180928cd9e15e352a6512865d3bc908] Bluetooth: btmtk: fix URB leak in alloc_mtk_intr_urb error path
> git bisect bad f396f4005180928cd9e15e352a6512865d3bc908
> # bad: [ec1806a730a1c0b3d68a7f9afe81514fb0dd7991] netfilter: x_tables: disable 32bit compat interface in user namespaces
> git bisect bad ec1806a730a1c0b3d68a7f9afe81514fb0dd7991
> # good: [50c2d91c5dfa0e465826ec1f8dbad9cdc254bd85] mptcp: do not drop partial packets
> git bisect good 50c2d91c5dfa0e465826ec1f8dbad9cdc254bd85
> # good: [68993ced0f618e36cf33388f1e50223e5e6e78cc] Merge tag 'net-7.1-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
> git bisect good 68993ced0f618e36cf33388f1e50223e5e6e78cc
> # good: [34c78dff59a25110a4ce50c208e42a91490fe615] Merge branch 'net-use-ip_outnoroutes-drop-reason'
> git bisect good 34c78dff59a25110a4ce50c208e42a91490fe615
> # bad: [9587ed8137fb83d93f84b858337412f4500b21e9] Merge branch 'gve-add-support-for-ptp-gettimex64'
> git bisect bad 9587ed8137fb83d93f84b858337412f4500b21e9
> # bad: [83ea7fd73b11dd8cbf4416507a5eac3890b49fb0] net: dsa: microchip: remove unused phylink_mac_link_up() callback
> git bisect bad 83ea7fd73b11dd8cbf4416507a5eac3890b49fb0
> # bad: [f0de88303d5e7e04a1224bc7a00512b5a1c4fe7a] net: make is_skb_wmem() available to modules
> git bisect bad f0de88303d5e7e04a1224bc7a00512b5a1c4fe7a
> # bad: [c411baa463e85a779a7e68a00ba6298770b58c4c] netconsole: move push_ipv6() from netpoll
> git bisect bad c411baa463e85a779a7e68a00ba6298770b58c4c
> # good: [fba362c17d9d9211fc51f272156bb84fc23bdf98] ptr_ring: move free-space check into separate helper
> git bisect good fba362c17d9d9211fc51f272156bb84fc23bdf98
> # bad: [d0273dbe8be1640e597552f81faf1d6c9997d3e3] ipvlan: use netif_receive_skb() in ipvlan_process_multicast()
> git bisect bad d0273dbe8be1640e597552f81faf1d6c9997d3e3
> # bad: [3803065cd6b0630d4161d86aa04e2d1db0f3a0b5] Merge branch 'tun-tap-vhost-net-apply-qdisc-backpressure-on-full-ptr_ring-to-reduce-tx-drops'
> git bisect bad 3803065cd6b0630d4161d86aa04e2d1db0f3a0b5
> # bad: [1d6e569b7d0c0b2736636749e4be0a27f3cefcb3] tun/tap & vhost-net: avoid ptr_ring tail-drop when a qdisc is present
> git bisect bad 1d6e569b7d0c0b2736636749e4be0a27f3cefcb3
> # first bad commit: [1d6e569b7d0c0b2736636749e4be0a27f3cefcb3] tun/tap & vhost-net: avoid ptr_ring tail-drop when a qdisc is present
> 
> -- 
> Brett Sheffield (he/him)
> Librecast - Decentralising the Internet with Multicast
> https://librecast.net/
> https://blog.brettsheffield.com/


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [REGRESSION][BISECTED] tun/tap & vhost-net: multi-threaded network performance
  2026-07-01 20:56 ` Michael S. Tsirkin
@ 2026-07-02  7:24   ` Simon Schippers
  2026-07-02  7:42     ` Michael S. Tsirkin
  2026-07-02 11:07     ` Brett Sheffield
  0 siblings, 2 replies; 14+ messages in thread
From: Simon Schippers @ 2026-07-02  7:24 UTC (permalink / raw)
  To: Michael S. Tsirkin, Brett Sheffield
  Cc: regressions, netdev, Jakub Kicinski, Tim Gebauer,
	Willem de Bruijn, Jason Wang, Andrew Lunn, David S. Miller,
	Eric Dumazet, Paolo Abeni, linux-kernel

On 7/1/26 22:56, Michael S. Tsirkin wrote:
> On Wed, Jul 01, 2026 at 09:16:48PM +0200, Brett Sheffield wrote:
>> TL;DR - Commit 1d6e569b7d0c0b2736636749e4be0a27f3cefcb3 causes
>> significant performance regressions with TAP interfaces and multithreaded
>> network code. Please revert.
>>
>>
>> Librecast is an IPv6 multicast library. One of the tests (0055) fails under
>> Linux 7.2-rc1. The test performs data synchronization over IPv6 multicast using a TAP
>> interface. This test has run successfully on every stable, LTS and mainline RC
>> released in the past year. Every kernel with my Tested-by has run this test.
>>
>> There have been a bunch of changes to MLDv2 so I started bisecting there, but
>> the culprit is actually 1d6e569b7d0c0b2736636749e4be0a27f3cefcb3 "tun/tap &
>> vhost-net: avoid ptr_ring tail-drop when a qdisc is present"
>>
>> Reverting this commit fixes the test.
>>
>> To eliminate my code and any multicast weirdness, I ran tests with iperf3
>> comparing the same host running 7.2-rc1 both with and without 1d6e569b7d0
>> reverted.

Thank you very much for your bisect!

As the author, I am sorry for that regression!

> 
> Thanks a lot for the bisect! Reverting is not out of question, but
> just before we do, it is worth analyzing the situation.
> 
> Could you pls tell us
> - do you see packet drops?

iperf3 shows no TCP retransmissions, so there were never packet drops
when the patch was enabled.
It is the number after the sender data rate (example: threads 1,
reverted has 368 retransmissions/drops).

> - does it help to increase the tun queue size?

I agree, this would be great to know.

However, even then we must act. I am considering IFF_BACKPRESSURE
as a feature flag, defaulting to off. It would just enable/disable
the stopping logic in tun_net_xmit() and the waking logic
in __tun_wake_queue(). If disabled, it would result in the same logic
as before.

I could provide such a patch as [net] material.

Thanks again!

> 
> Thanks a lot!
> 
> 
>> CPU: AMD Ryzen 9 9950X
>>
>> [ host ] - [ bridge ] - [ tap ] - [ guest (qemu) ]
>>
>> Running matching kernels on host and guest, I started iperf3 in server mode on
>> the guest and tested from the host so traffic passes through the tap interface.
>>
>> iperf3 -s -V                 # server
>> iperf3 -c guest -P nthreads  # client
>>
>> 7.2.0-rc1 (threads 1):
>>
>> [  5]   0.00-10.00  sec  20.2 GBytes  17.4 Gbits/sec    0            sender
>> [  5]   0.00-10.00  sec  2.00 GBytes  1.72 Gbits/sec                  receiver
>>
>> 7.2.0-rc1 (threads 1, reverted):
>>
>> [  5]   0.00-10.00  sec  15.3 GBytes  13.1 Gbits/sec  368            sender
>> [  5]   0.00-10.00  sec  2.00 GBytes  1.72 Gbits/sec                  receiver
>>
>> 7.2.0-rc1 (threads 2):
>>
>> [SUM]   0.00-10.00  sec  10.9 GBytes  9.33 Gbits/sec    0             sender
>> [SUM]   0.00-10.00  sec  4.00 GBytes  3.43 Gbits/sec                  receiver
>>
>> 7.2.0-rc1 (threads 2, reverted):
>>
>> [SUM]   0.00-10.00  sec  15.9 GBytes  13.7 Gbits/sec  1567             sender
>> [SUM]   0.00-10.00  sec  4.00 GBytes  3.43 Gbits/sec                  receiver
>>
>> 7.2.0-rc1 (threads 4):
>>
>> [SUM]   0.00-10.00  sec  10.9 GBytes  9.33 Gbits/sec    0             sender
>> [SUM]   0.00-10.00  sec  8.00 GBytes  6.87 Gbits/sec                  receiver
>>
>> 7.2.0-rc1 (threads 4, reverted):
>>
>> [SUM]   0.00-10.00  sec  16.5 GBytes  14.1 Gbits/sec  6701             sender
>> [SUM]   0.00-10.00  sec  8.00 GBytes  6.87 Gbits/sec                  receiver
>>
>> 7.2.0-rc1 (threads 8):
>>
>> [SUM]   0.00-10.00  sec  10.7 GBytes  9.15 Gbits/sec    0             sender
>> [SUM]   0.00-10.01  sec  10.6 GBytes  9.13 Gbits/sec                  receiver
>>
>> 7.2.0-rc1 (threads 8, reverted):
>>
>> [SUM]   0.00-10.00  sec  16.2 GBytes  14.0 Gbits/sec  19319             sender
>> [SUM]   0.00-10.00  sec  15.7 GBytes  13.5 Gbits/sec                  receiver
>>
>> 7.2.0-rc1 (threads 16):
>>
>> [SUM]   0.00-10.00  sec  10.9 GBytes  9.35 Gbits/sec    0             sender
>> [SUM]   0.00-10.01  sec  10.9 GBytes  9.32 Gbits/sec                  receiver
>>
>> 7.2.0-rc1 (threads 16, reverted):
>>
>> [SUM]   0.00-10.00  sec  14.4 GBytes  12.4 Gbits/sec  43593             sender
>> [SUM]   0.00-10.00  sec  14.4 GBytes  12.4 Gbits/sec                  receiver
>>
>>
>> As you can see, the new code works for single threaded, but for all other cases
>> there's a significant performance drop. I see this trade-off is mentioned in the
>> commit, but the performance drop off is much worse than suggested with the
>> current patch.
>>
>> In our multicast use case data is sent by multiple threads to multiple groups
>> simultaneously, this just breaks things to the extent that a <2 second test
>> times out after 5 minutes.
>>
>>
>> git bisect start
>> # status: waiting for both good and bad commits
>> # bad: [dc59e4fea9d83f03bad6bddf3fa2e52491777482] Linux 7.2-rc1
>> git bisect bad dc59e4fea9d83f03bad6bddf3fa2e52491777482
>> # status: waiting for good commit(s), bad commit known
>> # good: [36bdc0e815b4e8a05b9028d8ef8a25e1ead35cc1] net: usb: asix: ax88772: re-add usbnet_link_change() in phylink callbacks
>> git bisect good 36bdc0e815b4e8a05b9028d8ef8a25e1ead35cc1
>> # good: [db314398f618a3a23315f73c87f7d318eaf06c1b] Merge branch 'net-bridge-mcast-support-exponential-field-encoding'
>> git bisect good db314398f618a3a23315f73c87f7d318eaf06c1b
>> # bad: [079a028d6327e68cfa5d38b36123637b321c19a7] string: Remove strncpy() from the kernel
>> git bisect bad 079a028d6327e68cfa5d38b36123637b321c19a7
>> # bad: [f396f4005180928cd9e15e352a6512865d3bc908] Bluetooth: btmtk: fix URB leak in alloc_mtk_intr_urb error path
>> git bisect bad f396f4005180928cd9e15e352a6512865d3bc908
>> # bad: [ec1806a730a1c0b3d68a7f9afe81514fb0dd7991] netfilter: x_tables: disable 32bit compat interface in user namespaces
>> git bisect bad ec1806a730a1c0b3d68a7f9afe81514fb0dd7991
>> # good: [50c2d91c5dfa0e465826ec1f8dbad9cdc254bd85] mptcp: do not drop partial packets
>> git bisect good 50c2d91c5dfa0e465826ec1f8dbad9cdc254bd85
>> # good: [68993ced0f618e36cf33388f1e50223e5e6e78cc] Merge tag 'net-7.1-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
>> git bisect good 68993ced0f618e36cf33388f1e50223e5e6e78cc
>> # good: [34c78dff59a25110a4ce50c208e42a91490fe615] Merge branch 'net-use-ip_outnoroutes-drop-reason'
>> git bisect good 34c78dff59a25110a4ce50c208e42a91490fe615
>> # bad: [9587ed8137fb83d93f84b858337412f4500b21e9] Merge branch 'gve-add-support-for-ptp-gettimex64'
>> git bisect bad 9587ed8137fb83d93f84b858337412f4500b21e9
>> # bad: [83ea7fd73b11dd8cbf4416507a5eac3890b49fb0] net: dsa: microchip: remove unused phylink_mac_link_up() callback
>> git bisect bad 83ea7fd73b11dd8cbf4416507a5eac3890b49fb0
>> # bad: [f0de88303d5e7e04a1224bc7a00512b5a1c4fe7a] net: make is_skb_wmem() available to modules
>> git bisect bad f0de88303d5e7e04a1224bc7a00512b5a1c4fe7a
>> # bad: [c411baa463e85a779a7e68a00ba6298770b58c4c] netconsole: move push_ipv6() from netpoll
>> git bisect bad c411baa463e85a779a7e68a00ba6298770b58c4c
>> # good: [fba362c17d9d9211fc51f272156bb84fc23bdf98] ptr_ring: move free-space check into separate helper
>> git bisect good fba362c17d9d9211fc51f272156bb84fc23bdf98
>> # bad: [d0273dbe8be1640e597552f81faf1d6c9997d3e3] ipvlan: use netif_receive_skb() in ipvlan_process_multicast()
>> git bisect bad d0273dbe8be1640e597552f81faf1d6c9997d3e3
>> # bad: [3803065cd6b0630d4161d86aa04e2d1db0f3a0b5] Merge branch 'tun-tap-vhost-net-apply-qdisc-backpressure-on-full-ptr_ring-to-reduce-tx-drops'
>> git bisect bad 3803065cd6b0630d4161d86aa04e2d1db0f3a0b5
>> # bad: [1d6e569b7d0c0b2736636749e4be0a27f3cefcb3] tun/tap & vhost-net: avoid ptr_ring tail-drop when a qdisc is present
>> git bisect bad 1d6e569b7d0c0b2736636749e4be0a27f3cefcb3
>> # first bad commit: [1d6e569b7d0c0b2736636749e4be0a27f3cefcb3] tun/tap & vhost-net: avoid ptr_ring tail-drop when a qdisc is present
>>
>> -- 
>> Brett Sheffield (he/him)
>> Librecast - Decentralising the Internet with Multicast
>> https://librecast.net/
>> https://blog.brettsheffield.com/
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [REGRESSION][BISECTED] tun/tap & vhost-net: multi-threaded network performance
  2026-07-02  7:24   ` Simon Schippers
@ 2026-07-02  7:42     ` Michael S. Tsirkin
  2026-07-02  8:01       ` Simon Schippers
  2026-07-02 11:07     ` Brett Sheffield
  1 sibling, 1 reply; 14+ messages in thread
From: Michael S. Tsirkin @ 2026-07-02  7:42 UTC (permalink / raw)
  To: Simon Schippers
  Cc: Brett Sheffield, regressions, netdev, Jakub Kicinski, Tim Gebauer,
	Willem de Bruijn, Jason Wang, Andrew Lunn, David S. Miller,
	Eric Dumazet, Paolo Abeni, linux-kernel

On Thu, Jul 02, 2026 at 09:24:59AM +0200, Simon Schippers wrote:
> On 7/1/26 22:56, Michael S. Tsirkin wrote:
> > On Wed, Jul 01, 2026 at 09:16:48PM +0200, Brett Sheffield wrote:
> >> TL;DR - Commit 1d6e569b7d0c0b2736636749e4be0a27f3cefcb3 causes
> >> significant performance regressions with TAP interfaces and multithreaded
> >> network code. Please revert.
> >>
> >>
> >> Librecast is an IPv6 multicast library. One of the tests (0055) fails under
> >> Linux 7.2-rc1. The test performs data synchronization over IPv6 multicast using a TAP
> >> interface. This test has run successfully on every stable, LTS and mainline RC
> >> released in the past year. Every kernel with my Tested-by has run this test.
> >>
> >> There have been a bunch of changes to MLDv2 so I started bisecting there, but
> >> the culprit is actually 1d6e569b7d0c0b2736636749e4be0a27f3cefcb3 "tun/tap &
> >> vhost-net: avoid ptr_ring tail-drop when a qdisc is present"
> >>
> >> Reverting this commit fixes the test.
> >>
> >> To eliminate my code and any multicast weirdness, I ran tests with iperf3
> >> comparing the same host running 7.2-rc1 both with and without 1d6e569b7d0
> >> reverted.
> 
> Thank you very much for your bisect!
> 
> As the author, I am sorry for that regression!
> 
> > 
> > Thanks a lot for the bisect! Reverting is not out of question, but
> > just before we do, it is worth analyzing the situation.
> > 
> > Could you pls tell us
> > - do you see packet drops?
> 
> iperf3 shows no TCP retransmissions, so there were never packet drops
> when the patch was enabled.
> It is the number after the sender data rate (example: threads 1,
> reverted has 368 retransmissions/drops).
> 
> > - does it help to increase the tun queue size?
> 
> I agree, this would be great to know.
> 
> However, even then we must act. I am considering IFF_BACKPRESSURE
> as a feature flag, defaulting to off. It would just enable/disable
> the stopping logic in tun_net_xmit() and the waking logic
> in __tun_wake_queue(). If disabled, it would result in the same logic
> as before.
> 
> I could provide such a patch as [net] material.
> 
> Thanks again!

Or BQL? I quickly wrote a prototype of that and it seems to work well -
could you help test maybe?


> > 
> > Thanks a lot!
> > 
> > 
> >> CPU: AMD Ryzen 9 9950X
> >>
> >> [ host ] - [ bridge ] - [ tap ] - [ guest (qemu) ]
> >>
> >> Running matching kernels on host and guest, I started iperf3 in server mode on
> >> the guest and tested from the host so traffic passes through the tap interface.
> >>
> >> iperf3 -s -V                 # server
> >> iperf3 -c guest -P nthreads  # client
> >>
> >> 7.2.0-rc1 (threads 1):
> >>
> >> [  5]   0.00-10.00  sec  20.2 GBytes  17.4 Gbits/sec    0            sender
> >> [  5]   0.00-10.00  sec  2.00 GBytes  1.72 Gbits/sec                  receiver
> >>
> >> 7.2.0-rc1 (threads 1, reverted):
> >>
> >> [  5]   0.00-10.00  sec  15.3 GBytes  13.1 Gbits/sec  368            sender
> >> [  5]   0.00-10.00  sec  2.00 GBytes  1.72 Gbits/sec                  receiver
> >>
> >> 7.2.0-rc1 (threads 2):
> >>
> >> [SUM]   0.00-10.00  sec  10.9 GBytes  9.33 Gbits/sec    0             sender
> >> [SUM]   0.00-10.00  sec  4.00 GBytes  3.43 Gbits/sec                  receiver
> >>
> >> 7.2.0-rc1 (threads 2, reverted):
> >>
> >> [SUM]   0.00-10.00  sec  15.9 GBytes  13.7 Gbits/sec  1567             sender
> >> [SUM]   0.00-10.00  sec  4.00 GBytes  3.43 Gbits/sec                  receiver
> >>
> >> 7.2.0-rc1 (threads 4):
> >>
> >> [SUM]   0.00-10.00  sec  10.9 GBytes  9.33 Gbits/sec    0             sender
> >> [SUM]   0.00-10.00  sec  8.00 GBytes  6.87 Gbits/sec                  receiver
> >>
> >> 7.2.0-rc1 (threads 4, reverted):
> >>
> >> [SUM]   0.00-10.00  sec  16.5 GBytes  14.1 Gbits/sec  6701             sender
> >> [SUM]   0.00-10.00  sec  8.00 GBytes  6.87 Gbits/sec                  receiver
> >>
> >> 7.2.0-rc1 (threads 8):
> >>
> >> [SUM]   0.00-10.00  sec  10.7 GBytes  9.15 Gbits/sec    0             sender
> >> [SUM]   0.00-10.01  sec  10.6 GBytes  9.13 Gbits/sec                  receiver
> >>
> >> 7.2.0-rc1 (threads 8, reverted):
> >>
> >> [SUM]   0.00-10.00  sec  16.2 GBytes  14.0 Gbits/sec  19319             sender
> >> [SUM]   0.00-10.00  sec  15.7 GBytes  13.5 Gbits/sec                  receiver
> >>
> >> 7.2.0-rc1 (threads 16):
> >>
> >> [SUM]   0.00-10.00  sec  10.9 GBytes  9.35 Gbits/sec    0             sender
> >> [SUM]   0.00-10.01  sec  10.9 GBytes  9.32 Gbits/sec                  receiver
> >>
> >> 7.2.0-rc1 (threads 16, reverted):
> >>
> >> [SUM]   0.00-10.00  sec  14.4 GBytes  12.4 Gbits/sec  43593             sender
> >> [SUM]   0.00-10.00  sec  14.4 GBytes  12.4 Gbits/sec                  receiver
> >>
> >>
> >> As you can see, the new code works for single threaded, but for all other cases
> >> there's a significant performance drop. I see this trade-off is mentioned in the
> >> commit, but the performance drop off is much worse than suggested with the
> >> current patch.
> >>
> >> In our multicast use case data is sent by multiple threads to multiple groups
> >> simultaneously, this just breaks things to the extent that a <2 second test
> >> times out after 5 minutes.
> >>
> >>
> >> git bisect start
> >> # status: waiting for both good and bad commits
> >> # bad: [dc59e4fea9d83f03bad6bddf3fa2e52491777482] Linux 7.2-rc1
> >> git bisect bad dc59e4fea9d83f03bad6bddf3fa2e52491777482
> >> # status: waiting for good commit(s), bad commit known
> >> # good: [36bdc0e815b4e8a05b9028d8ef8a25e1ead35cc1] net: usb: asix: ax88772: re-add usbnet_link_change() in phylink callbacks
> >> git bisect good 36bdc0e815b4e8a05b9028d8ef8a25e1ead35cc1
> >> # good: [db314398f618a3a23315f73c87f7d318eaf06c1b] Merge branch 'net-bridge-mcast-support-exponential-field-encoding'
> >> git bisect good db314398f618a3a23315f73c87f7d318eaf06c1b
> >> # bad: [079a028d6327e68cfa5d38b36123637b321c19a7] string: Remove strncpy() from the kernel
> >> git bisect bad 079a028d6327e68cfa5d38b36123637b321c19a7
> >> # bad: [f396f4005180928cd9e15e352a6512865d3bc908] Bluetooth: btmtk: fix URB leak in alloc_mtk_intr_urb error path
> >> git bisect bad f396f4005180928cd9e15e352a6512865d3bc908
> >> # bad: [ec1806a730a1c0b3d68a7f9afe81514fb0dd7991] netfilter: x_tables: disable 32bit compat interface in user namespaces
> >> git bisect bad ec1806a730a1c0b3d68a7f9afe81514fb0dd7991
> >> # good: [50c2d91c5dfa0e465826ec1f8dbad9cdc254bd85] mptcp: do not drop partial packets
> >> git bisect good 50c2d91c5dfa0e465826ec1f8dbad9cdc254bd85
> >> # good: [68993ced0f618e36cf33388f1e50223e5e6e78cc] Merge tag 'net-7.1-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
> >> git bisect good 68993ced0f618e36cf33388f1e50223e5e6e78cc
> >> # good: [34c78dff59a25110a4ce50c208e42a91490fe615] Merge branch 'net-use-ip_outnoroutes-drop-reason'
> >> git bisect good 34c78dff59a25110a4ce50c208e42a91490fe615
> >> # bad: [9587ed8137fb83d93f84b858337412f4500b21e9] Merge branch 'gve-add-support-for-ptp-gettimex64'
> >> git bisect bad 9587ed8137fb83d93f84b858337412f4500b21e9
> >> # bad: [83ea7fd73b11dd8cbf4416507a5eac3890b49fb0] net: dsa: microchip: remove unused phylink_mac_link_up() callback
> >> git bisect bad 83ea7fd73b11dd8cbf4416507a5eac3890b49fb0
> >> # bad: [f0de88303d5e7e04a1224bc7a00512b5a1c4fe7a] net: make is_skb_wmem() available to modules
> >> git bisect bad f0de88303d5e7e04a1224bc7a00512b5a1c4fe7a
> >> # bad: [c411baa463e85a779a7e68a00ba6298770b58c4c] netconsole: move push_ipv6() from netpoll
> >> git bisect bad c411baa463e85a779a7e68a00ba6298770b58c4c
> >> # good: [fba362c17d9d9211fc51f272156bb84fc23bdf98] ptr_ring: move free-space check into separate helper
> >> git bisect good fba362c17d9d9211fc51f272156bb84fc23bdf98
> >> # bad: [d0273dbe8be1640e597552f81faf1d6c9997d3e3] ipvlan: use netif_receive_skb() in ipvlan_process_multicast()
> >> git bisect bad d0273dbe8be1640e597552f81faf1d6c9997d3e3
> >> # bad: [3803065cd6b0630d4161d86aa04e2d1db0f3a0b5] Merge branch 'tun-tap-vhost-net-apply-qdisc-backpressure-on-full-ptr_ring-to-reduce-tx-drops'
> >> git bisect bad 3803065cd6b0630d4161d86aa04e2d1db0f3a0b5
> >> # bad: [1d6e569b7d0c0b2736636749e4be0a27f3cefcb3] tun/tap & vhost-net: avoid ptr_ring tail-drop when a qdisc is present
> >> git bisect bad 1d6e569b7d0c0b2736636749e4be0a27f3cefcb3
> >> # first bad commit: [1d6e569b7d0c0b2736636749e4be0a27f3cefcb3] tun/tap & vhost-net: avoid ptr_ring tail-drop when a qdisc is present
> >>
> >> -- 
> >> Brett Sheffield (he/him)
> >> Librecast - Decentralising the Internet with Multicast
> >> https://librecast.net/
> >> https://blog.brettsheffield.com/
> > 


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [REGRESSION][BISECTED] tun/tap & vhost-net: multi-threaded network performance
  2026-07-02  7:42     ` Michael S. Tsirkin
@ 2026-07-02  8:01       ` Simon Schippers
  0 siblings, 0 replies; 14+ messages in thread
From: Simon Schippers @ 2026-07-02  8:01 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Brett Sheffield, regressions, netdev, Jakub Kicinski, Tim Gebauer,
	Willem de Bruijn, Jason Wang, Andrew Lunn, David S. Miller,
	Eric Dumazet, Paolo Abeni, linux-kernel

On 7/2/26 09:42, Michael S. Tsirkin wrote:
> On Thu, Jul 02, 2026 at 09:24:59AM +0200, Simon Schippers wrote:
>> On 7/1/26 22:56, Michael S. Tsirkin wrote:
>>> On Wed, Jul 01, 2026 at 09:16:48PM +0200, Brett Sheffield wrote:
>>>> TL;DR - Commit 1d6e569b7d0c0b2736636749e4be0a27f3cefcb3 causes
>>>> significant performance regressions with TAP interfaces and multithreaded
>>>> network code. Please revert.
>>>>
>>>>
>>>> Librecast is an IPv6 multicast library. One of the tests (0055) fails under
>>>> Linux 7.2-rc1. The test performs data synchronization over IPv6 multicast using a TAP
>>>> interface. This test has run successfully on every stable, LTS and mainline RC
>>>> released in the past year. Every kernel with my Tested-by has run this test.
>>>>
>>>> There have been a bunch of changes to MLDv2 so I started bisecting there, but
>>>> the culprit is actually 1d6e569b7d0c0b2736636749e4be0a27f3cefcb3 "tun/tap &
>>>> vhost-net: avoid ptr_ring tail-drop when a qdisc is present"
>>>>
>>>> Reverting this commit fixes the test.
>>>>
>>>> To eliminate my code and any multicast weirdness, I ran tests with iperf3
>>>> comparing the same host running 7.2-rc1 both with and without 1d6e569b7d0
>>>> reverted.
>>
>> Thank you very much for your bisect!
>>
>> As the author, I am sorry for that regression!
>>
>>>
>>> Thanks a lot for the bisect! Reverting is not out of question, but
>>> just before we do, it is worth analyzing the situation.
>>>
>>> Could you pls tell us
>>> - do you see packet drops?
>>
>> iperf3 shows no TCP retransmissions, so there were never packet drops
>> when the patch was enabled.
>> It is the number after the sender data rate (example: threads 1,
>> reverted has 368 retransmissions/drops).
>>
>>> - does it help to increase the tun queue size?
>>
>> I agree, this would be great to know.
>>
>> However, even then we must act. I am considering IFF_BACKPRESSURE
>> as a feature flag, defaulting to off. It would just enable/disable
>> the stopping logic in tun_net_xmit() and the waking logic
>> in __tun_wake_queue(). If disabled, it would result in the same logic
>> as before.
>>
>> I could provide such a patch as [net] material.
>>
>> Thanks again!
> 
> Or BQL? I quickly wrote a prototype of that and it seems to work well -
> could you help test maybe?

BQL won't fix it I think.
A bigger TUN queue would probably fix it but BQL can only just adjust
to a smaller queue than 500 packets.

Unless you don't take the ptr_ring size as an upper limit for BQL and
resize the ptr_ring on the fly? I don't think this is viable.

I am currently working with others to get BQL working for veth.
TLDR: The issue with software interfaces is that there is no *periodic*
completion process [2]. For hardware there is.
I applied the same veth approach for TUN and it showed to fix
bufferbloat... But again won't fix the issue here I think.

[1] Link: https://lore.kernel.org/netdev/20260612083530.1650245-1-hawk@kernel.org/T/#u
[2] Link: https://lwn.net/Articles/469651/

> 
> 
>>>
>>> Thanks a lot!
>>>
>>>
>>>> CPU: AMD Ryzen 9 9950X

This processor has 2 CCDs. This probably makes the issue worse then it
was for me. My (older) Ryzen 5 5600X only has a single CCD.

Thanks.

>>>>
>>>> [ host ] - [ bridge ] - [ tap ] - [ guest (qemu) ]
>>>>
>>>> Running matching kernels on host and guest, I started iperf3 in server mode on
>>>> the guest and tested from the host so traffic passes through the tap interface.
>>>>
>>>> iperf3 -s -V                 # server
>>>> iperf3 -c guest -P nthreads  # client
>>>>
>>>> 7.2.0-rc1 (threads 1):
>>>>
>>>> [  5]   0.00-10.00  sec  20.2 GBytes  17.4 Gbits/sec    0            sender
>>>> [  5]   0.00-10.00  sec  2.00 GBytes  1.72 Gbits/sec                  receiver
>>>>
>>>> 7.2.0-rc1 (threads 1, reverted):
>>>>
>>>> [  5]   0.00-10.00  sec  15.3 GBytes  13.1 Gbits/sec  368            sender
>>>> [  5]   0.00-10.00  sec  2.00 GBytes  1.72 Gbits/sec                  receiver
>>>>
>>>> 7.2.0-rc1 (threads 2):
>>>>
>>>> [SUM]   0.00-10.00  sec  10.9 GBytes  9.33 Gbits/sec    0             sender
>>>> [SUM]   0.00-10.00  sec  4.00 GBytes  3.43 Gbits/sec                  receiver
>>>>
>>>> 7.2.0-rc1 (threads 2, reverted):
>>>>
>>>> [SUM]   0.00-10.00  sec  15.9 GBytes  13.7 Gbits/sec  1567             sender
>>>> [SUM]   0.00-10.00  sec  4.00 GBytes  3.43 Gbits/sec                  receiver
>>>>
>>>> 7.2.0-rc1 (threads 4):
>>>>
>>>> [SUM]   0.00-10.00  sec  10.9 GBytes  9.33 Gbits/sec    0             sender
>>>> [SUM]   0.00-10.00  sec  8.00 GBytes  6.87 Gbits/sec                  receiver
>>>>
>>>> 7.2.0-rc1 (threads 4, reverted):
>>>>
>>>> [SUM]   0.00-10.00  sec  16.5 GBytes  14.1 Gbits/sec  6701             sender
>>>> [SUM]   0.00-10.00  sec  8.00 GBytes  6.87 Gbits/sec                  receiver
>>>>
>>>> 7.2.0-rc1 (threads 8):
>>>>
>>>> [SUM]   0.00-10.00  sec  10.7 GBytes  9.15 Gbits/sec    0             sender
>>>> [SUM]   0.00-10.01  sec  10.6 GBytes  9.13 Gbits/sec                  receiver
>>>>
>>>> 7.2.0-rc1 (threads 8, reverted):
>>>>
>>>> [SUM]   0.00-10.00  sec  16.2 GBytes  14.0 Gbits/sec  19319             sender
>>>> [SUM]   0.00-10.00  sec  15.7 GBytes  13.5 Gbits/sec                  receiver
>>>>
>>>> 7.2.0-rc1 (threads 16):
>>>>
>>>> [SUM]   0.00-10.00  sec  10.9 GBytes  9.35 Gbits/sec    0             sender
>>>> [SUM]   0.00-10.01  sec  10.9 GBytes  9.32 Gbits/sec                  receiver
>>>>
>>>> 7.2.0-rc1 (threads 16, reverted):
>>>>
>>>> [SUM]   0.00-10.00  sec  14.4 GBytes  12.4 Gbits/sec  43593             sender
>>>> [SUM]   0.00-10.00  sec  14.4 GBytes  12.4 Gbits/sec                  receiver
>>>>
>>>>
>>>> As you can see, the new code works for single threaded, but for all other cases
>>>> there's a significant performance drop. I see this trade-off is mentioned in the
>>>> commit, but the performance drop off is much worse than suggested with the
>>>> current patch.
>>>>
>>>> In our multicast use case data is sent by multiple threads to multiple groups
>>>> simultaneously, this just breaks things to the extent that a <2 second test
>>>> times out after 5 minutes.
>>>>
>>>>
>>>> git bisect start
>>>> # status: waiting for both good and bad commits
>>>> # bad: [dc59e4fea9d83f03bad6bddf3fa2e52491777482] Linux 7.2-rc1
>>>> git bisect bad dc59e4fea9d83f03bad6bddf3fa2e52491777482
>>>> # status: waiting for good commit(s), bad commit known
>>>> # good: [36bdc0e815b4e8a05b9028d8ef8a25e1ead35cc1] net: usb: asix: ax88772: re-add usbnet_link_change() in phylink callbacks
>>>> git bisect good 36bdc0e815b4e8a05b9028d8ef8a25e1ead35cc1
>>>> # good: [db314398f618a3a23315f73c87f7d318eaf06c1b] Merge branch 'net-bridge-mcast-support-exponential-field-encoding'
>>>> git bisect good db314398f618a3a23315f73c87f7d318eaf06c1b
>>>> # bad: [079a028d6327e68cfa5d38b36123637b321c19a7] string: Remove strncpy() from the kernel
>>>> git bisect bad 079a028d6327e68cfa5d38b36123637b321c19a7
>>>> # bad: [f396f4005180928cd9e15e352a6512865d3bc908] Bluetooth: btmtk: fix URB leak in alloc_mtk_intr_urb error path
>>>> git bisect bad f396f4005180928cd9e15e352a6512865d3bc908
>>>> # bad: [ec1806a730a1c0b3d68a7f9afe81514fb0dd7991] netfilter: x_tables: disable 32bit compat interface in user namespaces
>>>> git bisect bad ec1806a730a1c0b3d68a7f9afe81514fb0dd7991
>>>> # good: [50c2d91c5dfa0e465826ec1f8dbad9cdc254bd85] mptcp: do not drop partial packets
>>>> git bisect good 50c2d91c5dfa0e465826ec1f8dbad9cdc254bd85
>>>> # good: [68993ced0f618e36cf33388f1e50223e5e6e78cc] Merge tag 'net-7.1-rc5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
>>>> git bisect good 68993ced0f618e36cf33388f1e50223e5e6e78cc
>>>> # good: [34c78dff59a25110a4ce50c208e42a91490fe615] Merge branch 'net-use-ip_outnoroutes-drop-reason'
>>>> git bisect good 34c78dff59a25110a4ce50c208e42a91490fe615
>>>> # bad: [9587ed8137fb83d93f84b858337412f4500b21e9] Merge branch 'gve-add-support-for-ptp-gettimex64'
>>>> git bisect bad 9587ed8137fb83d93f84b858337412f4500b21e9
>>>> # bad: [83ea7fd73b11dd8cbf4416507a5eac3890b49fb0] net: dsa: microchip: remove unused phylink_mac_link_up() callback
>>>> git bisect bad 83ea7fd73b11dd8cbf4416507a5eac3890b49fb0
>>>> # bad: [f0de88303d5e7e04a1224bc7a00512b5a1c4fe7a] net: make is_skb_wmem() available to modules
>>>> git bisect bad f0de88303d5e7e04a1224bc7a00512b5a1c4fe7a
>>>> # bad: [c411baa463e85a779a7e68a00ba6298770b58c4c] netconsole: move push_ipv6() from netpoll
>>>> git bisect bad c411baa463e85a779a7e68a00ba6298770b58c4c
>>>> # good: [fba362c17d9d9211fc51f272156bb84fc23bdf98] ptr_ring: move free-space check into separate helper
>>>> git bisect good fba362c17d9d9211fc51f272156bb84fc23bdf98
>>>> # bad: [d0273dbe8be1640e597552f81faf1d6c9997d3e3] ipvlan: use netif_receive_skb() in ipvlan_process_multicast()
>>>> git bisect bad d0273dbe8be1640e597552f81faf1d6c9997d3e3
>>>> # bad: [3803065cd6b0630d4161d86aa04e2d1db0f3a0b5] Merge branch 'tun-tap-vhost-net-apply-qdisc-backpressure-on-full-ptr_ring-to-reduce-tx-drops'
>>>> git bisect bad 3803065cd6b0630d4161d86aa04e2d1db0f3a0b5
>>>> # bad: [1d6e569b7d0c0b2736636749e4be0a27f3cefcb3] tun/tap & vhost-net: avoid ptr_ring tail-drop when a qdisc is present
>>>> git bisect bad 1d6e569b7d0c0b2736636749e4be0a27f3cefcb3
>>>> # first bad commit: [1d6e569b7d0c0b2736636749e4be0a27f3cefcb3] tun/tap & vhost-net: avoid ptr_ring tail-drop when a qdisc is present
>>>>
>>>> -- 
>>>> Brett Sheffield (he/him)
>>>> Librecast - Decentralising the Internet with Multicast
>>>> https://librecast.net/
>>>> https://blog.brettsheffield.com/
>>>
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [REGRESSION][BISECTED] tun/tap & vhost-net: multi-threaded network performance
  2026-07-02  7:24   ` Simon Schippers
  2026-07-02  7:42     ` Michael S. Tsirkin
@ 2026-07-02 11:07     ` Brett Sheffield
  2026-07-02 22:44       ` Michael S. Tsirkin
  2026-07-02 22:55       ` Michael S. Tsirkin
  1 sibling, 2 replies; 14+ messages in thread
From: Brett Sheffield @ 2026-07-02 11:07 UTC (permalink / raw)
  To: Simon Schippers
  Cc: Michael S. Tsirkin, regressions, netdev, Jakub Kicinski,
	Tim Gebauer, Willem de Bruijn, Jason Wang, Andrew Lunn,
	David S. Miller, Eric Dumazet, Paolo Abeni, linux-kernel

On 2026-07-02 09:24, Simon Schippers wrote:
> On 7/1/26 22:56, Michael S. Tsirkin wrote:
> > On Wed, Jul 01, 2026 at 09:16:48PM +0200, Brett Sheffield wrote:
> >> TL;DR - Commit 1d6e569b7d0c0b2736636749e4be0a27f3cefcb3 causes
> >> significant performance regressions with TAP interfaces and multithreaded
> >> network code. Please revert.
> >>
> >>
> >> Librecast is an IPv6 multicast library. One of the tests (0055) fails under
> >> Linux 7.2-rc1. The test performs data synchronization over IPv6 multicast using a TAP
> >> interface. This test has run successfully on every stable, LTS and mainline RC
> >> released in the past year. Every kernel with my Tested-by has run this test.
> >>
> >> There have been a bunch of changes to MLDv2 so I started bisecting there, but
> >> the culprit is actually 1d6e569b7d0c0b2736636749e4be0a27f3cefcb3 "tun/tap &
> >> vhost-net: avoid ptr_ring tail-drop when a qdisc is present"
> >>
> >> Reverting this commit fixes the test.
> >>
> >> To eliminate my code and any multicast weirdness, I ran tests with iperf3
> >> comparing the same host running 7.2-rc1 both with and without 1d6e569b7d0
> >> reverted.
> 
> Thank you very much for your bisect!
> 
> As the author, I am sorry for that regression!

No worries. That's why we test :-)

> > - does it help to increase the tun queue size?
> 
> I agree, this would be great to know.
> 
> However, even then we must act. I am considering IFF_BACKPRESSURE
> as a feature flag, defaulting to off. It would just enable/disable
> the stopping logic in tun_net_xmit() and the waking logic
> in __tun_wake_queue(). If disabled, it would result in the same logic
> as before.
> 
> I could provide such a patch as [net] material.

I'm going to make myself a strong cup of tea and dig into it a bit more here and
will let you know if I find anything worth reporting.

If you need me to try re-testing with specific settings or test a patch I'm
happy to do so.

Cheers,


Brett
-- 
Brett Sheffield (he/him)
Librecast - Decentralising the Internet with Multicast
https://librecast.net/
https://blog.brettsheffield.com/

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [REGRESSION][BISECTED] tun/tap & vhost-net: multi-threaded network performance
  2026-07-02 11:07     ` Brett Sheffield
@ 2026-07-02 22:44       ` Michael S. Tsirkin
  2026-07-03 10:34         ` Simon Schippers
  2026-07-02 22:55       ` Michael S. Tsirkin
  1 sibling, 1 reply; 14+ messages in thread
From: Michael S. Tsirkin @ 2026-07-02 22:44 UTC (permalink / raw)
  To: Brett Sheffield
  Cc: Simon Schippers, regressions, netdev, Jakub Kicinski, Tim Gebauer,
	Willem de Bruijn, Jason Wang, Andrew Lunn, David S. Miller,
	Eric Dumazet, Paolo Abeni, linux-kernel

On Thu, Jul 02, 2026 at 01:07:47PM +0200, Brett Sheffield wrote:
> On 2026-07-02 09:24, Simon Schippers wrote:
> > On 7/1/26 22:56, Michael S. Tsirkin wrote:
> > > On Wed, Jul 01, 2026 at 09:16:48PM +0200, Brett Sheffield wrote:
> > >> TL;DR - Commit 1d6e569b7d0c0b2736636749e4be0a27f3cefcb3 causes
> > >> significant performance regressions with TAP interfaces and multithreaded
> > >> network code. Please revert.
> > >>
> > >>
> > >> Librecast is an IPv6 multicast library. One of the tests (0055) fails under
> > >> Linux 7.2-rc1. The test performs data synchronization over IPv6 multicast using a TAP
> > >> interface. This test has run successfully on every stable, LTS and mainline RC
> > >> released in the past year. Every kernel with my Tested-by has run this test.
> > >>
> > >> There have been a bunch of changes to MLDv2 so I started bisecting there, but
> > >> the culprit is actually 1d6e569b7d0c0b2736636749e4be0a27f3cefcb3 "tun/tap &
> > >> vhost-net: avoid ptr_ring tail-drop when a qdisc is present"
> > >>
> > >> Reverting this commit fixes the test.
> > >>
> > >> To eliminate my code and any multicast weirdness, I ran tests with iperf3
> > >> comparing the same host running 7.2-rc1 both with and without 1d6e569b7d0
> > >> reverted.
> > 
> > Thank you very much for your bisect!
> > 
> > As the author, I am sorry for that regression!
> 
> No worries. That's why we test :-)
> 
> > > - does it help to increase the tun queue size?
> > 
> > I agree, this would be great to know.
> > 
> > However, even then we must act. I am considering IFF_BACKPRESSURE
> > as a feature flag, defaulting to off. It would just enable/disable
> > the stopping logic in tun_net_xmit() and the waking logic
> > in __tun_wake_queue(). If disabled, it would result in the same logic
> > as before.
> > 
> > I could provide such a patch as [net] material.
> 
> I'm going to make myself a strong cup of tea and dig into it a bit more here and
> will let you know if I find anything worth reporting.
> 
> If you need me to try re-testing with specific settings or test a patch I'm
> happy to do so.
> 
> Cheers,
> 
> 
> Brett
> -- 
> Brett Sheffield (he/him)
> Librecast - Decentralising the Internet with Multicast
> https://librecast.net/
> https://blog.brettsheffield.com/


Well, the issue was with host to guest right?
Then testing what does bql do might be interesting.
Might help.
Something like this? Lightly tested.


diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index bfa49fa9e3a1..abc46354c107 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -1076,6 +1076,7 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
 	queue = netdev_get_tx_queue(dev, txq);
 
 	spin_lock(&tfile->tx_ring.producer_lock);
+	netdev_tx_sent_queue(queue, len);
 	ret = __ptr_ring_produce(&tfile->tx_ring, skb);
 	if (!qdisc_txq_has_no_queue(queue) &&
 	    __ptr_ring_check_produce(&tfile->tx_ring) == -ENOSPC) {
@@ -1088,6 +1089,7 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
 	spin_unlock(&tfile->tx_ring.producer_lock);
 
 	if (ret) {
+		netdev_tx_completed_queue(queue, 1, len);
 		/* This should be a rare case if a qdisc is present, but
 		 * can happen due to lltx.
 		 * Since skb_tx_timestamp(), skb_orphan(),
@@ -2148,15 +2150,19 @@ static ssize_t tun_put_user(struct tun_struct *tun,
 
 /* Callers must hold ring.consumer_lock */
 static void __tun_wake_queue(struct tun_struct *tun,
-			     struct tun_file *tfile, int consumed)
+			     struct tun_file *tfile,
+			     unsigned int pkts, unsigned int bytes)
 {
 	struct netdev_queue *txq = netdev_get_tx_queue(tun->dev,
 						tfile->queue_index);
 
+	if (bytes)
+		netdev_tx_completed_queue(txq, pkts, bytes);
+
 	/* Paired with smp_mb__after_atomic() in tun_net_xmit() */
 	smp_mb();
 	if (netif_tx_queue_stopped(txq)) {
-		tfile->cons_cnt += consumed;
+		tfile->cons_cnt += pkts;
 		if (tfile->cons_cnt >= tfile->tx_ring.size / 2 ||
 		    __ptr_ring_empty(&tfile->tx_ring)) {
 			netif_tx_wake_queue(txq);
@@ -2167,12 +2173,16 @@ static void __tun_wake_queue(struct tun_struct *tun,
 
 static void *tun_ring_consume(struct tun_struct *tun, struct tun_file *tfile)
 {
+	unsigned int bytes = 0;
 	void *ptr;
 
 	spin_lock(&tfile->tx_ring.consumer_lock);
 	ptr = __ptr_ring_consume(&tfile->tx_ring);
-	if (ptr)
-		__tun_wake_queue(tun, tfile, 1);
+	if (ptr) {
+		if (!tun_is_xdp_frame(ptr))
+			bytes = ((struct sk_buff *)ptr)->len;
+		__tun_wake_queue(tun, tfile, 1, bytes);
+	}
 
 	spin_unlock(&tfile->tx_ring.consumer_lock);
 	return ptr;
@@ -3805,7 +3815,7 @@ struct ptr_ring *tun_get_tx_ring(struct file *file)
 EXPORT_SYMBOL_GPL(tun_get_tx_ring);
 
 /* Callers must hold ring.consumer_lock */
-void tun_wake_queue(struct file *file, int consumed)
+void tun_wake_queue(struct file *file, unsigned int pkts, unsigned int bytes)
 {
 	struct tun_file *tfile;
 	struct tun_struct *tun;
@@ -3821,7 +3831,7 @@ void tun_wake_queue(struct file *file, int consumed)
 
 	tun = rcu_dereference(tfile->tun);
 	if (tun)
-		__tun_wake_queue(tun, tfile, consumed);
+		__tun_wake_queue(tun, tfile, pkts, bytes);
 
 	rcu_read_unlock();
 }
diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index db341c922673..5267b323bd59 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -181,14 +181,23 @@ static int vhost_net_buf_produce(struct sock *sk,
 {
 	struct file *file = sk->sk_socket->file;
 	struct vhost_net_buf *rxq = &nvq->rxq;
+	unsigned int bytes = 0;
+	int i;
 
 	rxq->head = 0;
 	spin_lock(&nvq->rx_ring->consumer_lock);
 	rxq->tail = __ptr_ring_consume_batched(nvq->rx_ring, rxq->queue,
 					       VHOST_NET_BATCH);
 
-	if (rxq->tail)
-		tun_wake_queue(file, rxq->tail);
+	if (rxq->tail) {
+		for (i = 0; i < rxq->tail; i++) {
+			void *ptr = rxq->queue[i];
+
+			if (!tun_is_xdp_frame(ptr))
+				bytes += ((struct sk_buff *)ptr)->len;
+		}
+		tun_wake_queue(file, rxq->tail, bytes);
+	}
 
 	spin_unlock(&nvq->rx_ring->consumer_lock);
 	return rxq->tail;
diff --git a/include/linux/if_tun.h b/include/linux/if_tun.h
index 5f3e206c7a73..49b85bf4f828 100644
--- a/include/linux/if_tun.h
+++ b/include/linux/if_tun.h
@@ -22,7 +22,7 @@ struct tun_msg_ctl {
 #if defined(CONFIG_TUN) || defined(CONFIG_TUN_MODULE)
 struct socket *tun_get_socket(struct file *);
 struct ptr_ring *tun_get_tx_ring(struct file *file);
-void tun_wake_queue(struct file *file, int consumed);
+void tun_wake_queue(struct file *file, unsigned int pkts, unsigned int bytes);
 
 static inline bool tun_is_xdp_frame(void *ptr)
 {
@@ -56,7 +56,8 @@ static inline struct ptr_ring *tun_get_tx_ring(struct file *f)
 	return ERR_PTR(-EINVAL);
 }
 
-static inline void tun_wake_queue(struct file *f, int consumed) {}
+static inline void tun_wake_queue(struct file *f,
+				  unsigned int pkts, unsigned int bytes) {}
 
 static inline bool tun_is_xdp_frame(void *ptr)
 {


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [REGRESSION][BISECTED] tun/tap & vhost-net: multi-threaded network performance
  2026-07-02 11:07     ` Brett Sheffield
  2026-07-02 22:44       ` Michael S. Tsirkin
@ 2026-07-02 22:55       ` Michael S. Tsirkin
  2026-07-03 10:35         ` Simon Schippers
  2026-07-03 14:13         ` Brett Sheffield
  1 sibling, 2 replies; 14+ messages in thread
From: Michael S. Tsirkin @ 2026-07-02 22:55 UTC (permalink / raw)
  To: Brett Sheffield
  Cc: Simon Schippers, regressions, netdev, Jakub Kicinski, Tim Gebauer,
	Willem de Bruijn, Jason Wang, Andrew Lunn, David S. Miller,
	Eric Dumazet, Paolo Abeni, linux-kernel

On Thu, Jul 02, 2026 at 01:07:47PM +0200, Brett Sheffield wrote:
> On 2026-07-02 09:24, Simon Schippers wrote:
> > On 7/1/26 22:56, Michael S. Tsirkin wrote:
> > > On Wed, Jul 01, 2026 at 09:16:48PM +0200, Brett Sheffield wrote:
> > >> TL;DR - Commit 1d6e569b7d0c0b2736636749e4be0a27f3cefcb3 causes
> > >> significant performance regressions with TAP interfaces and multithreaded
> > >> network code. Please revert.
> > >>
> > >>
> > >> Librecast is an IPv6 multicast library. One of the tests (0055) fails under
> > >> Linux 7.2-rc1. The test performs data synchronization over IPv6 multicast using a TAP
> > >> interface. This test has run successfully on every stable, LTS and mainline RC
> > >> released in the past year. Every kernel with my Tested-by has run this test.
> > >>
> > >> There have been a bunch of changes to MLDv2 so I started bisecting there, but
> > >> the culprit is actually 1d6e569b7d0c0b2736636749e4be0a27f3cefcb3 "tun/tap &
> > >> vhost-net: avoid ptr_ring tail-drop when a qdisc is present"
> > >>
> > >> Reverting this commit fixes the test.
> > >>
> > >> To eliminate my code and any multicast weirdness, I ran tests with iperf3
> > >> comparing the same host running 7.2-rc1 both with and without 1d6e569b7d0
> > >> reverted.
> > 
> > Thank you very much for your bisect!
> > 
> > As the author, I am sorry for that regression!
> 
> No worries. That's why we test :-)
> 
> > > - does it help to increase the tun queue size?
> > 
> > I agree, this would be great to know.
> > 
> > However, even then we must act. I am considering IFF_BACKPRESSURE
> > as a feature flag, defaulting to off. It would just enable/disable
> > the stopping logic in tun_net_xmit() and the waking logic
> > in __tun_wake_queue(). If disabled, it would result in the same logic
> > as before.
> > 
> > I could provide such a patch as [net] material.
> 
> I'm going to make myself a strong cup of tea and dig into it a bit more here and
> will let you know if I find anything worth reporting.
> 
> If you need me to try re-testing with specific settings or test a patch I'm
> happy to do so.
> 
> Cheers,
> 
> 
> Brett
> -- 
> Brett Sheffield (he/him)
> Librecast - Decentralising the Internet with Multicast
> https://librecast.net/
> https://blog.brettsheffield.com/

Maybe it's the supposedly rare case? Does this change anything
for you?


diff --git a/drivers/net/tun.c b/drivers/net/tun.c
index bfa49fa9e3a1..bacd89460078 100644
--- a/drivers/net/tun.c
+++ b/drivers/net/tun.c
@@ -1018,7 +1018,6 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
 	struct netdev_queue *queue;
 	struct tun_file *tfile;
 	int len = skb->len;
-	int ret;
 
 	rcu_read_lock();
 	tfile = rcu_dereference(tun->tfiles[txq]);
@@ -1064,19 +1063,24 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
 		goto drop;
 	}
 
-	skb_tx_timestamp(skb);
-
-	/* Orphan the skb - required as we might hang on to it
-	 * for indefinite time.
-	 */
-	skb_orphan(skb);
-
-	nf_reset_ct(skb);
-
 	queue = netdev_get_tx_queue(dev, txq);
 
 	spin_lock(&tfile->tx_ring.producer_lock);
-	ret = __ptr_ring_produce(&tfile->tx_ring, skb);
+	if (__ptr_ring_check_produce(&tfile->tx_ring)) {
+		spin_unlock(&tfile->tx_ring.producer_lock);
+		netif_tx_stop_queue(queue);
+		smp_mb__after_atomic();
+		if (!__ptr_ring_check_produce(&tfile->tx_ring))
+			netif_tx_wake_queue(queue);
+		rcu_read_unlock();
+		return NETDEV_TX_BUSY;
+	}
+
+	skb_tx_timestamp(skb);
+	skb_orphan(skb);
+	nf_reset_ct(skb);
+
+	__ptr_ring_produce(&tfile->tx_ring, skb);
 	if (!qdisc_txq_has_no_queue(queue) &&
 	    __ptr_ring_check_produce(&tfile->tx_ring) == -ENOSPC) {
 		netif_tx_stop_queue(queue);
@@ -1087,18 +1091,6 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
 	}
 	spin_unlock(&tfile->tx_ring.producer_lock);
 
-	if (ret) {
-		/* This should be a rare case if a qdisc is present, but
-		 * can happen due to lltx.
-		 * Since skb_tx_timestamp(), skb_orphan(),
-		 * run_ebpf_filter() and pskb_trim() could have tinkered
-		 * with the SKB, returning NETDEV_TX_BUSY is unsafe and
-		 * we must drop instead.
-		 */
-		drop_reason = SKB_DROP_REASON_FULL_RING;
-		goto drop;
-	}
-
 	/* dev->lltx requires to do our own update of trans_start */
 	txq_trans_cond_update(queue);
 


^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [REGRESSION][BISECTED] tun/tap & vhost-net: multi-threaded network performance
  2026-07-02 22:44       ` Michael S. Tsirkin
@ 2026-07-03 10:34         ` Simon Schippers
  0 siblings, 0 replies; 14+ messages in thread
From: Simon Schippers @ 2026-07-03 10:34 UTC (permalink / raw)
  To: Michael S. Tsirkin, Brett Sheffield
  Cc: regressions, netdev, Jakub Kicinski, Tim Gebauer,
	Willem de Bruijn, Jason Wang, Andrew Lunn, David S. Miller,
	Eric Dumazet, Paolo Abeni, linux-kernel

On 7/3/26 00:44, Michael S. Tsirkin wrote:
> Well, the issue was with host to guest right?
> Then testing what does bql do might be interesting.
> Might help.
> Something like this? Lightly tested.

Your approach calls netdev_tx_completed_queue() per individual packet
which is wrong and will cause a constant BQL limit of 2 as we have seen
in [1], causing a regression *100%*.
Citing the documentation of netdev_tx_completed_queue() in netdevice.h:

 * Must be called at most once per TX completion round (and not per
 * individual packet), so that BQL can adjust its limits appropriately.

[1] Link: https://lore.kernel.org/all/e8cdba04-aa9a-45c6-9807-8274b62920df@tu-dortmund.de/

> 
> 
> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> index bfa49fa9e3a1..abc46354c107 100644
> --- a/drivers/net/tun.c
> +++ b/drivers/net/tun.c
> @@ -1076,6 +1076,7 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
>  	queue = netdev_get_tx_queue(dev, txq);
>  
>  	spin_lock(&tfile->tx_ring.producer_lock);
> +	netdev_tx_sent_queue(queue, len);
>  	ret = __ptr_ring_produce(&tfile->tx_ring, skb);
>  	if (!qdisc_txq_has_no_queue(queue) &&
>  	    __ptr_ring_check_produce(&tfile->tx_ring) == -ENOSPC) {
> @@ -1088,6 +1089,7 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
>  	spin_unlock(&tfile->tx_ring.producer_lock);
>  
>  	if (ret) {
> +		netdev_tx_completed_queue(queue, 1, len);
>  		/* This should be a rare case if a qdisc is present, but
>  		 * can happen due to lltx.
>  		 * Since skb_tx_timestamp(), skb_orphan(),
> @@ -2148,15 +2150,19 @@ static ssize_t tun_put_user(struct tun_struct *tun,
>  
>  /* Callers must hold ring.consumer_lock */
>  static void __tun_wake_queue(struct tun_struct *tun,
> -			     struct tun_file *tfile, int consumed)
> +			     struct tun_file *tfile,
> +			     unsigned int pkts, unsigned int bytes)
>  {
>  	struct netdev_queue *txq = netdev_get_tx_queue(tun->dev,
>  						tfile->queue_index);
>  
> +	if (bytes)
> +		netdev_tx_completed_queue(txq, pkts, bytes);
> +


Right here.



>  	/* Paired with smp_mb__after_atomic() in tun_net_xmit() */
>  	smp_mb();
>  	if (netif_tx_queue_stopped(txq)) {
> -		tfile->cons_cnt += consumed;
> +		tfile->cons_cnt += pkts;
>  		if (tfile->cons_cnt >= tfile->tx_ring.size / 2 ||
>  		    __ptr_ring_empty(&tfile->tx_ring)) {
>  			netif_tx_wake_queue(txq);
> @@ -2167,12 +2173,16 @@ static void __tun_wake_queue(struct tun_struct *tun,
>  
>  static void *tun_ring_consume(struct tun_struct *tun, struct tun_file *tfile)
>  {
> +	unsigned int bytes = 0;
>  	void *ptr;
>  
>  	spin_lock(&tfile->tx_ring.consumer_lock);
>  	ptr = __ptr_ring_consume(&tfile->tx_ring);
> -	if (ptr)
> -		__tun_wake_queue(tun, tfile, 1);
> +	if (ptr) {
> +		if (!tun_is_xdp_frame(ptr))
> +			bytes = ((struct sk_buff *)ptr)->len;
> +		__tun_wake_queue(tun, tfile, 1, bytes);
> +	}
>  
>  	spin_unlock(&tfile->tx_ring.consumer_lock);
>  	return ptr;
> @@ -3805,7 +3815,7 @@ struct ptr_ring *tun_get_tx_ring(struct file *file)
>  EXPORT_SYMBOL_GPL(tun_get_tx_ring);
>  
>  /* Callers must hold ring.consumer_lock */
> -void tun_wake_queue(struct file *file, int consumed)
> +void tun_wake_queue(struct file *file, unsigned int pkts, unsigned int bytes)
>  {
>  	struct tun_file *tfile;
>  	struct tun_struct *tun;
> @@ -3821,7 +3831,7 @@ void tun_wake_queue(struct file *file, int consumed)
>  
>  	tun = rcu_dereference(tfile->tun);
>  	if (tun)
> -		__tun_wake_queue(tun, tfile, consumed);
> +		__tun_wake_queue(tun, tfile, pkts, bytes);
>  
>  	rcu_read_unlock();
>  }
> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> index db341c922673..5267b323bd59 100644
> --- a/drivers/vhost/net.c
> +++ b/drivers/vhost/net.c
> @@ -181,14 +181,23 @@ static int vhost_net_buf_produce(struct sock *sk,
>  {
>  	struct file *file = sk->sk_socket->file;
>  	struct vhost_net_buf *rxq = &nvq->rxq;
> +	unsigned int bytes = 0;
> +	int i;
>  
>  	rxq->head = 0;
>  	spin_lock(&nvq->rx_ring->consumer_lock);
>  	rxq->tail = __ptr_ring_consume_batched(nvq->rx_ring, rxq->queue,
>  					       VHOST_NET_BATCH);
>  
> -	if (rxq->tail)
> -		tun_wake_queue(file, rxq->tail);
> +	if (rxq->tail) {
> +		for (i = 0; i < rxq->tail; i++) {
> +			void *ptr = rxq->queue[i];
> +
> +			if (!tun_is_xdp_frame(ptr))
> +				bytes += ((struct sk_buff *)ptr)->len;
> +		}
> +		tun_wake_queue(file, rxq->tail, bytes);
> +	}
>  
>  	spin_unlock(&nvq->rx_ring->consumer_lock);
>  	return rxq->tail;
> diff --git a/include/linux/if_tun.h b/include/linux/if_tun.h
> index 5f3e206c7a73..49b85bf4f828 100644
> --- a/include/linux/if_tun.h
> +++ b/include/linux/if_tun.h
> @@ -22,7 +22,7 @@ struct tun_msg_ctl {
>  #if defined(CONFIG_TUN) || defined(CONFIG_TUN_MODULE)
>  struct socket *tun_get_socket(struct file *);
>  struct ptr_ring *tun_get_tx_ring(struct file *file);
> -void tun_wake_queue(struct file *file, int consumed);
> +void tun_wake_queue(struct file *file, unsigned int pkts, unsigned int bytes);
>  
>  static inline bool tun_is_xdp_frame(void *ptr)
>  {
> @@ -56,7 +56,8 @@ static inline struct ptr_ring *tun_get_tx_ring(struct file *f)
>  	return ERR_PTR(-EINVAL);
>  }
>  
> -static inline void tun_wake_queue(struct file *f, int consumed) {}
> +static inline void tun_wake_queue(struct file *f,
> +				  unsigned int pkts, unsigned int bytes) {}
>  
>  static inline bool tun_is_xdp_frame(void *ptr)
>  {
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [REGRESSION][BISECTED] tun/tap & vhost-net: multi-threaded network performance
  2026-07-02 22:55       ` Michael S. Tsirkin
@ 2026-07-03 10:35         ` Simon Schippers
  2026-07-03 14:13         ` Brett Sheffield
  1 sibling, 0 replies; 14+ messages in thread
From: Simon Schippers @ 2026-07-03 10:35 UTC (permalink / raw)
  To: Michael S. Tsirkin, Brett Sheffield
  Cc: regressions, netdev, Jakub Kicinski, Tim Gebauer,
	Willem de Bruijn, Jason Wang, Andrew Lunn, David S. Miller,
	Eric Dumazet, Paolo Abeni, linux-kernel

On 7/3/26 00:55, Michael S. Tsirkin wrote:
> Maybe it's the supposedly rare case? Does this change anything
> for you?

In the "rare case" we would have packet drops, visible by iperf3 TCP
retransmissions. But these packet drops don't occur as I stated before.

Also, we are still not allowed to return NETDEV_TX_BUSY here, because
run_ebpf_filter() and pskb_trim() could have tinkered with the SKB.

> 
> 
> diff --git a/drivers/net/tun.c b/drivers/net/tun.c
> index bfa49fa9e3a1..bacd89460078 100644
> --- a/drivers/net/tun.c
> +++ b/drivers/net/tun.c
> @@ -1018,7 +1018,6 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
>  	struct netdev_queue *queue;
>  	struct tun_file *tfile;
>  	int len = skb->len;
> -	int ret;
>  
>  	rcu_read_lock();
>  	tfile = rcu_dereference(tun->tfiles[txq]);
> @@ -1064,19 +1063,24 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
>  		goto drop;
>  	}
>  
> -	skb_tx_timestamp(skb);
> -
> -	/* Orphan the skb - required as we might hang on to it
> -	 * for indefinite time.
> -	 */
> -	skb_orphan(skb);
> -
> -	nf_reset_ct(skb);
> -
>  	queue = netdev_get_tx_queue(dev, txq);
>  
>  	spin_lock(&tfile->tx_ring.producer_lock);
> -	ret = __ptr_ring_produce(&tfile->tx_ring, skb);
> +	if (__ptr_ring_check_produce(&tfile->tx_ring)) {
> +		spin_unlock(&tfile->tx_ring.producer_lock);
> +		netif_tx_stop_queue(queue);
> +		smp_mb__after_atomic();
> +		if (!__ptr_ring_check_produce(&tfile->tx_ring))
> +			netif_tx_wake_queue(queue);
> +		rcu_read_unlock();
> +		return NETDEV_TX_BUSY;
> +	}
> +
> +	skb_tx_timestamp(skb);
> +	skb_orphan(skb);
> +	nf_reset_ct(skb);
> +
> +	__ptr_ring_produce(&tfile->tx_ring, skb);
>  	if (!qdisc_txq_has_no_queue(queue) &&
>  	    __ptr_ring_check_produce(&tfile->tx_ring) == -ENOSPC) {
>  		netif_tx_stop_queue(queue);
> @@ -1087,18 +1091,6 @@ static netdev_tx_t tun_net_xmit(struct sk_buff *skb, struct net_device *dev)
>  	}
>  	spin_unlock(&tfile->tx_ring.producer_lock);
>  
> -	if (ret) {
> -		/* This should be a rare case if a qdisc is present, but
> -		 * can happen due to lltx.
> -		 * Since skb_tx_timestamp(), skb_orphan(),
> -		 * run_ebpf_filter() and pskb_trim() could have tinkered
> -		 * with the SKB, returning NETDEV_TX_BUSY is unsafe and
> -		 * we must drop instead.
> -		 */
> -		drop_reason = SKB_DROP_REASON_FULL_RING;
> -		goto drop;
> -	}
> -
>  	/* dev->lltx requires to do our own update of trans_start */
>  	txq_trans_cond_update(queue);
>  
> 

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [REGRESSION][BISECTED] tun/tap & vhost-net: multi-threaded network performance
  2026-07-01 19:16 [REGRESSION][BISECTED] tun/tap & vhost-net: multi-threaded network performance Brett Sheffield
  2026-07-01 20:56 ` Michael S. Tsirkin
@ 2026-07-03 10:41 ` Simon Schippers
  2026-07-03 11:55   ` Michael S. Tsirkin
  1 sibling, 1 reply; 14+ messages in thread
From: Simon Schippers @ 2026-07-03 10:41 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Jakub Kicinski, Tim Gebauer, Willem de Bruijn, Jason Wang,
	Andrew Lunn, David S. Miller, Eric Dumazet, Paolo Abeni,
	linux-kernel, Brett Sheffield, regressions, netdev

On 7/1/26 21:16, Brett Sheffield wrote:
> TL;DR - Commit 1d6e569b7d0c0b2736636749e4be0a27f3cefcb3 causes
> significant performance regressions with TAP interfaces and multithreaded
> network code. Please revert.

Micheal, I am convinced that netdev_tx_stop_queue() and
netdev_tx_wake_queue() are too slow. The communication between the CPUs
is just too slow. We already spent *a lot* of time speeding it up, the
patchset was only merged at v12...

Yes, I would *love* to have qdisc backpressure on by default but I do
not see how anything could speed it up.

*Is there any issue if we introduce IFF_BACKPRESSURE?*
I implemented it and it is a simple patch that only executes the
stopping / waking  if ((tun->flags & IFF_BACKPRESSURE)).
I could submit that to [net] very soon.

Thank you,
Simon

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [REGRESSION][BISECTED] tun/tap & vhost-net: multi-threaded network performance
  2026-07-03 10:41 ` Simon Schippers
@ 2026-07-03 11:55   ` Michael S. Tsirkin
  0 siblings, 0 replies; 14+ messages in thread
From: Michael S. Tsirkin @ 2026-07-03 11:55 UTC (permalink / raw)
  To: Simon Schippers
  Cc: Jakub Kicinski, Tim Gebauer, Willem de Bruijn, Jason Wang,
	Andrew Lunn, David S. Miller, Eric Dumazet, Paolo Abeni,
	linux-kernel, Brett Sheffield, regressions, netdev

On Fri, Jul 03, 2026 at 12:41:19PM +0200, Simon Schippers wrote:
> On 7/1/26 21:16, Brett Sheffield wrote:
> > TL;DR - Commit 1d6e569b7d0c0b2736636749e4be0a27f3cefcb3 causes
> > significant performance regressions with TAP interfaces and multithreaded
> > network code. Please revert.
> 
> Micheal, I am convinced that netdev_tx_stop_queue() and
> netdev_tx_wake_queue() are too slow. The communication between the CPUs
> is just too slow. We already spent *a lot* of time speeding it up, the
> patchset was only merged at v12...
> 
> Yes, I would *love* to have qdisc backpressure on by default but I do
> not see how anything could speed it up.
> 
> 
> *Is there any issue if we introduce IFF_BACKPRESSURE?*
> I implemented it and it is a simple patch that only executes the
> stopping / waking  if ((tun->flags & IFF_BACKPRESSURE)).
> I could submit that to [net] very soon.
> 
> Thank you,
> Simon

Pls go ahead.

-- 
MST


^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [REGRESSION][BISECTED] tun/tap & vhost-net: multi-threaded network performance
  2026-07-02 22:55       ` Michael S. Tsirkin
  2026-07-03 10:35         ` Simon Schippers
@ 2026-07-03 14:13         ` Brett Sheffield
  2026-07-03 14:19           ` Michael S. Tsirkin
  1 sibling, 1 reply; 14+ messages in thread
From: Brett Sheffield @ 2026-07-03 14:13 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Simon Schippers, regressions, netdev, Jakub Kicinski, Tim Gebauer,
	Willem de Bruijn, Jason Wang, Andrew Lunn, David S. Miller,
	Eric Dumazet, Paolo Abeni, linux-kernel

On 2026-07-02 18:55, Michael S. Tsirkin wrote:
> On Thu, Jul 02, 2026 at 01:07:47PM +0200, Brett Sheffield wrote:
> > On 2026-07-02 09:24, Simon Schippers wrote:
> > > On 7/1/26 22:56, Michael S. Tsirkin wrote:
> > > > - does it help to increase the tun queue size?
> > > 
> > > I agree, this would be great to know.

I tried adding a queue (qdisc pfifo limit 500) on both host and guest - no
apparent difference in drops or performance.

If there's any other setting you want me to try, let me know.

> Maybe it's the supposedly rare case? Does this change anything
> for you?

I also tried both patches, but again, no noticable difference.


-- 
Brett Sheffield (he/him)
Librecast - Decentralising the Internet with Multicast
https://librecast.net/
https://blog.brettsheffield.com/

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [REGRESSION][BISECTED] tun/tap & vhost-net: multi-threaded network performance
  2026-07-03 14:13         ` Brett Sheffield
@ 2026-07-03 14:19           ` Michael S. Tsirkin
  0 siblings, 0 replies; 14+ messages in thread
From: Michael S. Tsirkin @ 2026-07-03 14:19 UTC (permalink / raw)
  To: Brett Sheffield
  Cc: Simon Schippers, regressions, netdev, Jakub Kicinski, Tim Gebauer,
	Willem de Bruijn, Jason Wang, Andrew Lunn, David S. Miller,
	Eric Dumazet, Paolo Abeni, linux-kernel

On Fri, Jul 03, 2026 at 04:13:49PM +0200, Brett Sheffield wrote:
> On 2026-07-02 18:55, Michael S. Tsirkin wrote:
> > On Thu, Jul 02, 2026 at 01:07:47PM +0200, Brett Sheffield wrote:
> > > On 2026-07-02 09:24, Simon Schippers wrote:
> > > > On 7/1/26 22:56, Michael S. Tsirkin wrote:
> > > > > - does it help to increase the tun queue size?
> > > > 
> > > > I agree, this would be great to know.
> 
> I tried adding a queue (qdisc pfifo limit 500) on both host and guest - no
> apparent difference in drops or performance.
> 
> If there's any other setting you want me to try, let me know.
> 
> > Maybe it's the supposedly rare case? Does this change anything
> > for you?
> 
> I also tried both patches, but again, no noticable difference.
> 

Thank you. Simon pointed out issues with the patches but he also said his
intuition is strongly that it's not the way to go.
Let's wait for him to post the backpressure patch, we'll see.

-- 
MST


^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2026-07-03 14:19 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-07-01 19:16 [REGRESSION][BISECTED] tun/tap & vhost-net: multi-threaded network performance Brett Sheffield
2026-07-01 20:56 ` Michael S. Tsirkin
2026-07-02  7:24   ` Simon Schippers
2026-07-02  7:42     ` Michael S. Tsirkin
2026-07-02  8:01       ` Simon Schippers
2026-07-02 11:07     ` Brett Sheffield
2026-07-02 22:44       ` Michael S. Tsirkin
2026-07-03 10:34         ` Simon Schippers
2026-07-02 22:55       ` Michael S. Tsirkin
2026-07-03 10:35         ` Simon Schippers
2026-07-03 14:13         ` Brett Sheffield
2026-07-03 14:19           ` Michael S. Tsirkin
2026-07-03 10:41 ` Simon Schippers
2026-07-03 11:55   ` Michael S. Tsirkin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox