linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH net-next v3] vhost/net: Defer TX queue re-enable until after sendmsg
@ 2025-05-01  2:04 Jon Kohler
  2025-05-01 13:44 ` Michael S. Tsirkin
  2025-05-06  1:40 ` patchwork-bot+netdevbpf
  0 siblings, 2 replies; 8+ messages in thread
From: Jon Kohler @ 2025-05-01  2:04 UTC (permalink / raw)
  To: Michael S. Tsirkin, Jason Wang, Eugenio Pérez, kvm,
	virtualization, netdev, linux-kernel
  Cc: Jon Kohler

In handle_tx_copy, TX batching processes packets below ~PAGE_SIZE and
batches up to 64 messages before calling sock->sendmsg.

Currently, when there are no more messages on the ring to dequeue,
handle_tx_copy re-enables kicks on the ring *before* firing off the
batch sendmsg. However, sock->sendmsg incurs a non-zero delay,
especially if it needs to wake up a thread (e.g., another vhost worker).

If the guest submits additional messages immediately after the last ring
check and disablement, it triggers an EPT_MISCONFIG vmexit to attempt to
kick the vhost worker. This may happen while the worker is still
processing the sendmsg, leading to wasteful exit(s).

This is particularly problematic for single-threaded guest submission
threads, as they must exit, wait for the exit to be processed
(potentially involving a TTWU), and then resume.

In scenarios like a constant stream of UDP messages, this results in a
sawtooth pattern where the submitter frequently vmexits, and the
vhost-net worker alternates between sleeping and waking.

A common solution is to configure vhost-net busy polling via userspace
(e.g., qemu poll-us). However, treating the sendmsg as the "busy"
period by keeping kicks disabled during the final sendmsg and
performing one additional ring check afterward provides a significant
performance improvement without any excess busy poll cycles.

If messages are found in the ring after the final sendmsg, requeue the
TX handler. This ensures fairness for the RX handler and allows
vhost_run_work_list to cond_resched() as needed.

Test Case
    TX VM: taskset -c 2 iperf3  -c rx-ip-here -t 60 -p 5200 -b 0 -u -i 5
    RX VM: taskset -c 2 iperf3 -s -p 5200 -D
    6.12.0, each worker backed by tun interface with IFF_NAPI setup.
    Note: TCP side is largely unchanged as that was copy bound

6.12.0 unpatched
    EPT_MISCONFIG/second: 5411
    Datagrams/second: ~382k
    Interval         Transfer     Bitrate         Lost/Total Datagrams
    0.00-30.00  sec  15.5 GBytes  4.43 Gbits/sec  0/11481630 (0%)  sender

6.12.0 patched
    EPT_MISCONFIG/second: 58 (~93x reduction)
    Datagrams/second: ~650k  (~1.7x increase)
    Interval         Transfer     Bitrate         Lost/Total Datagrams
    0.00-30.00  sec  26.4 GBytes  7.55 Gbits/sec  0/19554720 (0%)  sender

Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Jon Kohler <jon@nutanix.com>
---
v2->v3: Address MST's comments regarding busyloop_intr
	https://patchwork.kernel.org/project/netdevbpf/patch/20250420010518.2842335-1-jon@nutanix.com/
v1->v2: Move from net to net-next (no changes)
	https://patchwork.kernel.org/project/netdevbpf/patch/20250401043230.790419-1-jon@nutanix.com/
---
 drivers/vhost/net.c | 30 +++++++++++++++++++++---------
 1 file changed, 21 insertions(+), 9 deletions(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index b9b9e9d40951..7cbfc7d718b3 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -755,10 +755,10 @@ static void handle_tx_copy(struct vhost_net *net, struct socket *sock)
 	int err;
 	int sent_pkts = 0;
 	bool sock_can_batch = (sock->sk->sk_sndbuf == INT_MAX);
+	bool busyloop_intr;
 
 	do {
-		bool busyloop_intr = false;
-
+		busyloop_intr = false;
 		if (nvq->done_idx == VHOST_NET_BATCH)
 			vhost_tx_batch(net, nvq, sock, &msg);
 
@@ -769,13 +769,10 @@ static void handle_tx_copy(struct vhost_net *net, struct socket *sock)
 			break;
 		/* Nothing new?  Wait for eventfd to tell us they refilled. */
 		if (head == vq->num) {
-			if (unlikely(busyloop_intr)) {
-				vhost_poll_queue(&vq->poll);
-			} else if (unlikely(vhost_enable_notify(&net->dev,
-								vq))) {
-				vhost_disable_notify(&net->dev, vq);
-				continue;
-			}
+			/* Kicks are disabled at this point, break loop and
+			 * process any remaining batched packets. Queue will
+			 * be re-enabled afterwards.
+			 */
 			break;
 		}
 
@@ -825,7 +822,22 @@ static void handle_tx_copy(struct vhost_net *net, struct socket *sock)
 		++nvq->done_idx;
 	} while (likely(!vhost_exceeds_weight(vq, ++sent_pkts, total_len)));
 
+	/* Kicks are still disabled, dispatch any remaining batched msgs. */
 	vhost_tx_batch(net, nvq, sock, &msg);
+
+	if (unlikely(busyloop_intr))
+		/* If interrupted while doing busy polling, requeue the
+		 * handler to be fair handle_rx as well as other tasks
+		 * waiting on cpu.
+		 */
+		vhost_poll_queue(&vq->poll);
+	else
+		/* All of our work has been completed; however, before
+		 * leaving the TX handler, do one last check for work,
+		 * and requeue handler if necessary. If there is no work,
+		 * queue will be reenabled.
+		 */
+		vhost_net_busy_poll_try_queue(net, vq);
 }
 
 static void handle_tx_zerocopy(struct vhost_net *net, struct socket *sock)
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 8+ messages in thread

* Re: [PATCH net-next v3] vhost/net: Defer TX queue re-enable until after sendmsg
  2025-05-01  2:04 [PATCH net-next v3] vhost/net: Defer TX queue re-enable until after sendmsg Jon Kohler
@ 2025-05-01 13:44 ` Michael S. Tsirkin
  2025-05-06  1:40 ` patchwork-bot+netdevbpf
  1 sibling, 0 replies; 8+ messages in thread
From: Michael S. Tsirkin @ 2025-05-01 13:44 UTC (permalink / raw)
  To: Jon Kohler
  Cc: Jason Wang, Eugenio Pérez, kvm, virtualization, netdev,
	linux-kernel

On Wed, Apr 30, 2025 at 07:04:28PM -0700, Jon Kohler wrote:
> In handle_tx_copy, TX batching processes packets below ~PAGE_SIZE and
> batches up to 64 messages before calling sock->sendmsg.
> 
> Currently, when there are no more messages on the ring to dequeue,
> handle_tx_copy re-enables kicks on the ring *before* firing off the
> batch sendmsg. However, sock->sendmsg incurs a non-zero delay,
> especially if it needs to wake up a thread (e.g., another vhost worker).
> 
> If the guest submits additional messages immediately after the last ring
> check and disablement, it triggers an EPT_MISCONFIG vmexit to attempt to
> kick the vhost worker. This may happen while the worker is still
> processing the sendmsg, leading to wasteful exit(s).
> 
> This is particularly problematic for single-threaded guest submission
> threads, as they must exit, wait for the exit to be processed
> (potentially involving a TTWU), and then resume.
> 
> In scenarios like a constant stream of UDP messages, this results in a
> sawtooth pattern where the submitter frequently vmexits, and the
> vhost-net worker alternates between sleeping and waking.
> 
> A common solution is to configure vhost-net busy polling via userspace
> (e.g., qemu poll-us). However, treating the sendmsg as the "busy"
> period by keeping kicks disabled during the final sendmsg and
> performing one additional ring check afterward provides a significant
> performance improvement without any excess busy poll cycles.
> 
> If messages are found in the ring after the final sendmsg, requeue the
> TX handler. This ensures fairness for the RX handler and allows
> vhost_run_work_list to cond_resched() as needed.
> 
> Test Case
>     TX VM: taskset -c 2 iperf3  -c rx-ip-here -t 60 -p 5200 -b 0 -u -i 5
>     RX VM: taskset -c 2 iperf3 -s -p 5200 -D
>     6.12.0, each worker backed by tun interface with IFF_NAPI setup.
>     Note: TCP side is largely unchanged as that was copy bound
> 
> 6.12.0 unpatched
>     EPT_MISCONFIG/second: 5411
>     Datagrams/second: ~382k
>     Interval         Transfer     Bitrate         Lost/Total Datagrams
>     0.00-30.00  sec  15.5 GBytes  4.43 Gbits/sec  0/11481630 (0%)  sender
> 
> 6.12.0 patched
>     EPT_MISCONFIG/second: 58 (~93x reduction)
>     Datagrams/second: ~650k  (~1.7x increase)
>     Interval         Transfer     Bitrate         Lost/Total Datagrams
>     0.00-30.00  sec  26.4 GBytes  7.55 Gbits/sec  0/19554720 (0%)  sender
> 
> Acked-by: Jason Wang <jasowang@redhat.com>
> Signed-off-by: Jon Kohler <jon@nutanix.com>

Acked-by: Michael S. Tsirkin <mst@redhat.com>

> ---
> v2->v3: Address MST's comments regarding busyloop_intr
> 	https://patchwork.kernel.org/project/netdevbpf/patch/20250420010518.2842335-1-jon@nutanix.com/
> v1->v2: Move from net to net-next (no changes)
> 	https://patchwork.kernel.org/project/netdevbpf/patch/20250401043230.790419-1-jon@nutanix.com/
> ---
>  drivers/vhost/net.c | 30 +++++++++++++++++++++---------
>  1 file changed, 21 insertions(+), 9 deletions(-)
> 
> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> index b9b9e9d40951..7cbfc7d718b3 100644
> --- a/drivers/vhost/net.c
> +++ b/drivers/vhost/net.c
> @@ -755,10 +755,10 @@ static void handle_tx_copy(struct vhost_net *net, struct socket *sock)
>  	int err;
>  	int sent_pkts = 0;
>  	bool sock_can_batch = (sock->sk->sk_sndbuf == INT_MAX);
> +	bool busyloop_intr;
>  
>  	do {
> -		bool busyloop_intr = false;
> -
> +		busyloop_intr = false;
>  		if (nvq->done_idx == VHOST_NET_BATCH)
>  			vhost_tx_batch(net, nvq, sock, &msg);
>  
> @@ -769,13 +769,10 @@ static void handle_tx_copy(struct vhost_net *net, struct socket *sock)
>  			break;
>  		/* Nothing new?  Wait for eventfd to tell us they refilled. */
>  		if (head == vq->num) {
> -			if (unlikely(busyloop_intr)) {
> -				vhost_poll_queue(&vq->poll);
> -			} else if (unlikely(vhost_enable_notify(&net->dev,
> -								vq))) {
> -				vhost_disable_notify(&net->dev, vq);
> -				continue;
> -			}
> +			/* Kicks are disabled at this point, break loop and
> +			 * process any remaining batched packets. Queue will
> +			 * be re-enabled afterwards.
> +			 */
>  			break;
>  		}
>  
> @@ -825,7 +822,22 @@ static void handle_tx_copy(struct vhost_net *net, struct socket *sock)
>  		++nvq->done_idx;
>  	} while (likely(!vhost_exceeds_weight(vq, ++sent_pkts, total_len)));
>  
> +	/* Kicks are still disabled, dispatch any remaining batched msgs. */
>  	vhost_tx_batch(net, nvq, sock, &msg);
> +
> +	if (unlikely(busyloop_intr))
> +		/* If interrupted while doing busy polling, requeue the
> +		 * handler to be fair handle_rx as well as other tasks
> +		 * waiting on cpu.
> +		 */
> +		vhost_poll_queue(&vq->poll);
> +	else
> +		/* All of our work has been completed; however, before
> +		 * leaving the TX handler, do one last check for work,
> +		 * and requeue handler if necessary. If there is no work,
> +		 * queue will be reenabled.
> +		 */
> +		vhost_net_busy_poll_try_queue(net, vq);
>  }
>  
>  static void handle_tx_zerocopy(struct vhost_net *net, struct socket *sock)
> -- 
> 2.43.0


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: [PATCH net-next v3] vhost/net: Defer TX queue re-enable until after sendmsg
  2025-05-01  2:04 [PATCH net-next v3] vhost/net: Defer TX queue re-enable until after sendmsg Jon Kohler
  2025-05-01 13:44 ` Michael S. Tsirkin
@ 2025-05-06  1:40 ` patchwork-bot+netdevbpf
  2025-09-10 18:58   ` vhost_iotlb_miss tight loop lockup - RE " Jon Kohler
  1 sibling, 1 reply; 8+ messages in thread
From: patchwork-bot+netdevbpf @ 2025-05-06  1:40 UTC (permalink / raw)
  To: Jon Kohler
  Cc: mst, jasowang, eperezma, kvm, virtualization, netdev,
	linux-kernel

Hello:

This patch was applied to netdev/net-next.git (main)
by Jakub Kicinski <kuba@kernel.org>:

On Wed, 30 Apr 2025 19:04:28 -0700 you wrote:
> In handle_tx_copy, TX batching processes packets below ~PAGE_SIZE and
> batches up to 64 messages before calling sock->sendmsg.
> 
> Currently, when there are no more messages on the ring to dequeue,
> handle_tx_copy re-enables kicks on the ring *before* firing off the
> batch sendmsg. However, sock->sendmsg incurs a non-zero delay,
> especially if it needs to wake up a thread (e.g., another vhost worker).
> 
> [...]

Here is the summary with links:
  - [net-next,v3] vhost/net: Defer TX queue re-enable until after sendmsg
    https://git.kernel.org/netdev/net-next/c/8c2e6b26ffe2

You are awesome, thank you!
-- 
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



^ permalink raw reply	[flat|nested] 8+ messages in thread

* vhost_iotlb_miss tight loop lockup - RE vhost/net: Defer TX queue re-enable until after sendmsg
  2025-05-06  1:40 ` patchwork-bot+netdevbpf
@ 2025-09-10 18:58   ` Jon Kohler
  2025-09-10 20:10     ` Michael S. Tsirkin
  0 siblings, 1 reply; 8+ messages in thread
From: Jon Kohler @ 2025-09-10 18:58 UTC (permalink / raw)
  To: patchwork-bot+netdevbpf@kernel.org
  Cc: mst@redhat.com, jasowang@redhat.com, eperezma@redhat.com,
	kvm@vger.kernel.org, virtualization@lists.linux.dev,
	netdev@vger.kernel.org, linux-kernel@vger.kernel.org



> On May 5, 2025, at 9:40 PM, patchwork-bot+netdevbpf@kernel.org wrote:
> 
> Hello:
> 
> This patch was applied to netdev/net-next.git (main)
> by Jakub Kicinski <kuba@kernel.org>:

Hey all,
Writing to fire up a flare and point out a problem that we’re seeing
with this patch internally, specifically when we enable iommu on the
virtio-net device.

With this patch applied on 6.12.y-based bare metal instance and then
starting a 6.12.y based guest with iommu enabled, we see lockups
within the guest in short order, as well as vmm (qemu) stuck in a tight
loop responding to iommu misses from vhost net loop.

We've bisected this in our internal tree, and for sure it is this
patch that is alledgedly causing the problem, so I wanted to point out
there is some sort of issue here. 

Working on trying to figure this out, but if jumps off the page to
anyone, happy to take advice!

Flamegraph:
https://gist.github.com/JonKohler/0e83c014230ab59ddc950f10441335f1#file-iotlb-lockup-svg

Guest dmesg errors like so:
[   66.081694] virtio_net virtio0 eth0: NETDEV WATCHDOG: CPU: 1: transmit queue 0 timed out 5500 ms
[   68.145155] virtio_net virtio0 eth0: TX timeout on queue: 0, sq: output.0, vq: 0x1, name: output.0, 7560000 usecs ago
[  112.907012] virtio_net virtio0 eth0: NETDEV WATCHDOG: CPU: 1: transmit queue 0 timed out 5568 ms
[  124.117540] virtio_net virtio0 eth0: TX timeout on queue: 0, sq: output.0, vq: 0x1, name: output.0, 16776000 usecs ago
[  124.118050] virtio_net virtio0 eth0: NETDEV WATCHDOG: CPU: 1: transmit queue 0 timed out 16776 ms
[  124.118447] virtio_net virtio0 eth0: TX timeout on queue: 0, sq: output.0, vq: 0x1, name: output.0, 16776000 usecs ago

Host level top output
3992758 qemu      20   0   16.6g  52168  26704 R  99.9   0.0  21:23.72 qemu-kvm       <<< this is the qemu main thread
3992769 qemu      20   0   16.6g  52168  26704 R  58.8   0.0  13:33.44 vhost-3992758 <<< this is the vhost-net kthread

For qemu-kvm main thread: 
Samples: 13K of event 'cycles:P', 4000 Hz, Event count (approx.): 5131922583 lost: 0/0 drop: 0/0
  Children      Self  Shared Object     Symbol
-   87.41%     0.30%  [kernel]          [k] entry_SYSCALL_64_after_hwframe
   - 87.11% entry_SYSCALL_64_after_hwframe
      - do_syscall_64
         - 44.79% ksys_write
            - 43.74% vfs_write
               - 40.96% vhost_chr_write_iter
                  - 38.22% vhost_process_iotlb_msg
                     - 13.72% vhost_iotlb_add_range_ctx
                        - 7.43% vhost_iotlb_map_free
                           - 4.37% vhost_iotlb_itree_remove
                                rb_next
                             1.78% __rb_erase_color
                             0.73% kfree
                          1.15% __rb_insert_augmented
                          0.68% __kmalloc_cache_noprof
                     - 10.73% vhost_vq_work_queue
                        - 7.65% try_to_wake_up
                           - 2.55% ttwu_queue_wakelist
                              - 1.72% __smp_call_single_queue
                                   1.36% call_function_single_prep_ipi
                           - 1.32% __task_rq_lock
                              - _raw_spin_lock
                                   native_queued_spin_lock_slowpath
                           - 1.30% select_task_rq
                              - select_task_rq_fair
                                 - 0.88% wake_affine
                                      available_idle_cpu
                          2.06% llist_add_batch
                     - 4.05% __mutex_lock.constprop.0
                          2.14% mutex_spin_on_owner
                          0.72% osq_lock
                       3.00% mutex_lock
                     - 1.72% kfree
                        - 1.16% __slab_free
                             slab_update_freelist.constprop.0.isra.0
                       1.37% _raw_spin_lock
                       1.08% mutex_unlock
                    1.98% _copy_from_iter
               - 1.86% rw_verify_area
                  - security_file_permission
                     - 1.13% file_has_perm
                          0.69% avc_has_perm
              0.63% fdget_pos
         - 27.86% syscall_exit_to_user_mode
            - syscall_exit_to_user_mode_prepare
               - 25.96% __audit_syscall_exit
                  - 25.03% __audit_filter_op
                       6.66% audit_filter_rules.constprop.0
                 1.27% audit_reset_context.part.0.constprop.0
         - 10.86% ksys_read
            - 9.37% vfs_read
               - 6.67% vhost_chr_read_iter
                    1.48% _copy_to_iter
                    1.36% _raw_spin_lock
                  - 1.30% __wake_up
                       0.81% _raw_spin_lock_irqsave
                  - 1.25% vhost_enqueue_msg
                       _raw_spin_lock
               - 1.83% rw_verify_area
                  - security_file_permission
                     - 1.03% file_has_perm
                          0.64% avc_has_perm
              0.65% fdget_pos
              0.57% fput
         - 2.56% syscall_trace_enter
            - 1.25% __seccomp_filter
                 seccomp_run_filters
              0.54% __audit_syscall_entry
              
vhost-net thread
Samples: 20K of event 'cycles:P', 4000 Hz, Event count (approx.): 7796456297 lost: 0/0 drop: 0/0
  Children      Self  Shared Object     Symbol
-  100.00%     3.38%  [kernel]          [k] vhost_task_fn
     38.26% 0xffffffff930bb8c0
   - 3.36% 0
        ret_from_fork_asm
        ret_from_fork
   - 1.16% vhost_task_fn
      - 2.35% vhost_run_work_list
         - 1.67% handle_tx
            - 7.09% __mutex_lock.constprop.0
                 6.64% mutex_spin_on_owner
            - 0.84% vq_meta_prefetch
               - 3.22% iotlb_access_ok
                    2.50% vhost_iotlb_itree_first
              0.80% mutex_lock
            - 0.75% handle_tx_copy
           0.86% llist_reverse_order

> 
> On Wed, 30 Apr 2025 19:04:28 -0700 you wrote:
>> In handle_tx_copy, TX batching processes packets below ~PAGE_SIZE and
>> batches up to 64 messages before calling sock->sendmsg.
>> 
>> Currently, when there are no more messages on the ring to dequeue,
>> handle_tx_copy re-enables kicks on the ring *before* firing off the
>> batch sendmsg. However, sock->sendmsg incurs a non-zero delay,
>> especially if it needs to wake up a thread (e.g., another vhost worker).
>> 
>> [...]
> 
> Here is the summary with links:
>  - [net-next,v3] vhost/net: Defer TX queue re-enable until after sendmsg
>    https://urldefense.proofpoint.com/v2/url?u=https-3A__git.kernel.org_netdev_net-2Dnext_c_8c2e6b26ffe2&d=DwIDaQ&c=s883GpUCOChKOHiocYtGcg&r=NGPRGGo37mQiSXgHKm5rCQ&m=0XoR6N9VbkaJ_wBENy8Z28uDdqjCe4HRNCyV-8o4etqXeEJOqoFFGjeGGP5sQcmt&s=-X8si_rU8pXKNyWNNzBqx5Fmv-ut9w2gS5E6coMDApM&e= 
> 
> You are awesome, thank you!
> -- 
> Deet-doot-dot, I am a bot.
> https://urldefense.proofpoint.com/v2/url?u=https-3A__korg.docs.kernel.org_patchwork_pwbot.html&d=DwIDaQ&c=s883GpUCOChKOHiocYtGcg&r=NGPRGGo37mQiSXgHKm5rCQ&m=0XoR6N9VbkaJ_wBENy8Z28uDdqjCe4HRNCyV-8o4etqXeEJOqoFFGjeGGP5sQcmt&s=sydedZsBCMSJM9_Ldw6Al-BplvM7FokLwV_80bJpGnM&e= 
> 
> 


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: vhost_iotlb_miss tight loop lockup - RE vhost/net: Defer TX queue re-enable until after sendmsg
  2025-09-10 18:58   ` vhost_iotlb_miss tight loop lockup - RE " Jon Kohler
@ 2025-09-10 20:10     ` Michael S. Tsirkin
  2025-09-11  5:49       ` Jason Wang
  0 siblings, 1 reply; 8+ messages in thread
From: Michael S. Tsirkin @ 2025-09-10 20:10 UTC (permalink / raw)
  To: Jon Kohler
  Cc: patchwork-bot+netdevbpf@kernel.org, jasowang@redhat.com,
	eperezma@redhat.com, kvm@vger.kernel.org,
	virtualization@lists.linux.dev, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org

On Wed, Sep 10, 2025 at 06:58:18PM +0000, Jon Kohler wrote:
> 
> 
> > On May 5, 2025, at 9:40 PM, patchwork-bot+netdevbpf@kernel.org wrote:
> > 
> > Hello:
> > 
> > This patch was applied to netdev/net-next.git (main)
> > by Jakub Kicinski <kuba@kernel.org>:
> 
> Hey all,
> Writing to fire up a flare and point out a problem that we’re seeing
> with this patch internally, specifically when we enable iommu on the
> virtio-net device.
> 
> With this patch applied on 6.12.y-based bare metal instance and then
> starting a 6.12.y based guest with iommu enabled, we see lockups
> within the guest in short order, as well as vmm (qemu) stuck in a tight
> loop responding to iommu misses from vhost net loop.
> 
> We've bisected this in our internal tree, and for sure it is this
> patch that is alledgedly causing the problem, so I wanted to point out
> there is some sort of issue here. 
> 
> Working on trying to figure this out, but if jumps off the page to
> anyone, happy to take advice!
> 
> Flamegraph:
> https://gist.github.com/JonKohler/0e83c014230ab59ddc950f10441335f1#file-iotlb-lockup-svg
> 
> Guest dmesg errors like so:
> [   66.081694] virtio_net virtio0 eth0: NETDEV WATCHDOG: CPU: 1: transmit queue 0 timed out 5500 ms
> [   68.145155] virtio_net virtio0 eth0: TX timeout on queue: 0, sq: output.0, vq: 0x1, name: output.0, 7560000 usecs ago
> [  112.907012] virtio_net virtio0 eth0: NETDEV WATCHDOG: CPU: 1: transmit queue 0 timed out 5568 ms
> [  124.117540] virtio_net virtio0 eth0: TX timeout on queue: 0, sq: output.0, vq: 0x1, name: output.0, 16776000 usecs ago
> [  124.118050] virtio_net virtio0 eth0: NETDEV WATCHDOG: CPU: 1: transmit queue 0 timed out 16776 ms
> [  124.118447] virtio_net virtio0 eth0: TX timeout on queue: 0, sq: output.0, vq: 0x1, name: output.0, 16776000 usecs ago
> 
> Host level top output
> 3992758 qemu      20   0   16.6g  52168  26704 R  99.9   0.0  21:23.72 qemu-kvm       <<< this is the qemu main thread
> 3992769 qemu      20   0   16.6g  52168  26704 R  58.8   0.0  13:33.44 vhost-3992758 <<< this is the vhost-net kthread
> 
> For qemu-kvm main thread: 
> Samples: 13K of event 'cycles:P', 4000 Hz, Event count (approx.): 5131922583 lost: 0/0 drop: 0/0
>   Children      Self  Shared Object     Symbol
> -   87.41%     0.30%  [kernel]          [k] entry_SYSCALL_64_after_hwframe
>    - 87.11% entry_SYSCALL_64_after_hwframe
>       - do_syscall_64
>          - 44.79% ksys_write
>             - 43.74% vfs_write
>                - 40.96% vhost_chr_write_iter
>                   - 38.22% vhost_process_iotlb_msg
>                      - 13.72% vhost_iotlb_add_range_ctx
>                         - 7.43% vhost_iotlb_map_free
>                            - 4.37% vhost_iotlb_itree_remove
>                                 rb_next
>                              1.78% __rb_erase_color
>                              0.73% kfree
>                           1.15% __rb_insert_augmented
>                           0.68% __kmalloc_cache_noprof
>                      - 10.73% vhost_vq_work_queue
>                         - 7.65% try_to_wake_up
>                            - 2.55% ttwu_queue_wakelist
>                               - 1.72% __smp_call_single_queue
>                                    1.36% call_function_single_prep_ipi
>                            - 1.32% __task_rq_lock
>                               - _raw_spin_lock
>                                    native_queued_spin_lock_slowpath
>                            - 1.30% select_task_rq
>                               - select_task_rq_fair
>                                  - 0.88% wake_affine
>                                       available_idle_cpu
>                           2.06% llist_add_batch
>                      - 4.05% __mutex_lock.constprop.0
>                           2.14% mutex_spin_on_owner
>                           0.72% osq_lock
>                        3.00% mutex_lock
>                      - 1.72% kfree
>                         - 1.16% __slab_free
>                              slab_update_freelist.constprop.0.isra.0
>                        1.37% _raw_spin_lock
>                        1.08% mutex_unlock
>                     1.98% _copy_from_iter
>                - 1.86% rw_verify_area
>                   - security_file_permission
>                      - 1.13% file_has_perm
>                           0.69% avc_has_perm
>               0.63% fdget_pos
>          - 27.86% syscall_exit_to_user_mode
>             - syscall_exit_to_user_mode_prepare
>                - 25.96% __audit_syscall_exit
>                   - 25.03% __audit_filter_op
>                        6.66% audit_filter_rules.constprop.0
>                  1.27% audit_reset_context.part.0.constprop.0
>          - 10.86% ksys_read
>             - 9.37% vfs_read
>                - 6.67% vhost_chr_read_iter
>                     1.48% _copy_to_iter
>                     1.36% _raw_spin_lock
>                   - 1.30% __wake_up
>                        0.81% _raw_spin_lock_irqsave
>                   - 1.25% vhost_enqueue_msg
>                        _raw_spin_lock
>                - 1.83% rw_verify_area
>                   - security_file_permission
>                      - 1.03% file_has_perm
>                           0.64% avc_has_perm
>               0.65% fdget_pos
>               0.57% fput
>          - 2.56% syscall_trace_enter
>             - 1.25% __seccomp_filter
>                  seccomp_run_filters
>               0.54% __audit_syscall_entry
>               
> vhost-net thread
> Samples: 20K of event 'cycles:P', 4000 Hz, Event count (approx.): 7796456297 lost: 0/0 drop: 0/0
>   Children      Self  Shared Object     Symbol
> -  100.00%     3.38%  [kernel]          [k] vhost_task_fn
>      38.26% 0xffffffff930bb8c0
>    - 3.36% 0
>         ret_from_fork_asm
>         ret_from_fork
>    - 1.16% vhost_task_fn
>       - 2.35% vhost_run_work_list
>          - 1.67% handle_tx
>             - 7.09% __mutex_lock.constprop.0
>                  6.64% mutex_spin_on_owner
>             - 0.84% vq_meta_prefetch
>                - 3.22% iotlb_access_ok
>                     2.50% vhost_iotlb_itree_first
>               0.80% mutex_lock
>             - 0.75% handle_tx_copy
>            0.86% llist_reverse_order
> 
> > 
> > On Wed, 30 Apr 2025 19:04:28 -0700 you wrote:
> >> In handle_tx_copy, TX batching processes packets below ~PAGE_SIZE and
> >> batches up to 64 messages before calling sock->sendmsg.
> >> 
> >> Currently, when there are no more messages on the ring to dequeue,
> >> handle_tx_copy re-enables kicks on the ring *before* firing off the
> >> batch sendmsg. However, sock->sendmsg incurs a non-zero delay,
> >> especially if it needs to wake up a thread (e.g., another vhost worker).
> >> 
> >> [...]
> > 
> > Here is the summary with links:
> >  - [net-next,v3] vhost/net: Defer TX queue re-enable until after sendmsg
> >    https://urldefense.proofpoint.com/v2/url?u=https-3A__git.kernel.org_netdev_net-2Dnext_c_8c2e6b26ffe2&d=DwIDaQ&c=s883GpUCOChKOHiocYtGcg&r=NGPRGGo37mQiSXgHKm5rCQ&m=0XoR6N9VbkaJ_wBENy8Z28uDdqjCe4HRNCyV-8o4etqXeEJOqoFFGjeGGP5sQcmt&s=-X8si_rU8pXKNyWNNzBqx5Fmv-ut9w2gS5E6coMDApM&e= 
> > 
> > You are awesome, thank you!
> > -- 
> > Deet-doot-dot, I am a bot.
> > https://urldefense.proofpoint.com/v2/url?u=https-3A__korg.docs.kernel.org_patchwork_pwbot.html&d=DwIDaQ&c=s883GpUCOChKOHiocYtGcg&r=NGPRGGo37mQiSXgHKm5rCQ&m=0XoR6N9VbkaJ_wBENy8Z28uDdqjCe4HRNCyV-8o4etqXeEJOqoFFGjeGGP5sQcmt&s=sydedZsBCMSJM9_Ldw6Al-BplvM7FokLwV_80bJpGnM&e= 
> > 
> > 
> 


Well it seems that if  get_tx_bufs failed with -EAGAIN then we
previously bailed out, but now we will redo poll and so on, forever.


No?


-- 
MST


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: vhost_iotlb_miss tight loop lockup - RE vhost/net: Defer TX queue re-enable until after sendmsg
  2025-09-10 20:10     ` Michael S. Tsirkin
@ 2025-09-11  5:49       ` Jason Wang
  2025-09-11  5:55         ` Michael S. Tsirkin
  2025-09-11 13:24         ` Jon Kohler
  0 siblings, 2 replies; 8+ messages in thread
From: Jason Wang @ 2025-09-11  5:49 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Jon Kohler, patchwork-bot+netdevbpf@kernel.org,
	eperezma@redhat.com, kvm@vger.kernel.org,
	virtualization@lists.linux.dev, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org

On Thu, Sep 11, 2025 at 4:11 AM Michael S. Tsirkin <mst@redhat.com> wrote:
>
> On Wed, Sep 10, 2025 at 06:58:18PM +0000, Jon Kohler wrote:
> >
> >
> > > On May 5, 2025, at 9:40 PM, patchwork-bot+netdevbpf@kernel.org wrote:
> > >
> > > Hello:
> > >
> > > This patch was applied to netdev/net-next.git (main)
> > > by Jakub Kicinski <kuba@kernel.org>:
> >
> > Hey all,
> > Writing to fire up a flare and point out a problem that we’re seeing
> > with this patch internally, specifically when we enable iommu on the
> > virtio-net device.
> >
> > With this patch applied on 6.12.y-based bare metal instance and then
> > starting a 6.12.y based guest with iommu enabled, we see lockups
> > within the guest in short order, as well as vmm (qemu) stuck in a tight
> > loop responding to iommu misses from vhost net loop.
> >
> > We've bisected this in our internal tree, and for sure it is this
> > patch that is alledgedly causing the problem, so I wanted to point out
> > there is some sort of issue here.
> >
> > Working on trying to figure this out, but if jumps off the page to
> > anyone, happy to take advice!
> >
> > Flamegraph:
> > https://gist.github.com/JonKohler/0e83c014230ab59ddc950f10441335f1#file-iotlb-lockup-svg
> >
> > Guest dmesg errors like so:
> > [   66.081694] virtio_net virtio0 eth0: NETDEV WATCHDOG: CPU: 1: transmit queue 0 timed out 5500 ms
> > [   68.145155] virtio_net virtio0 eth0: TX timeout on queue: 0, sq: output.0, vq: 0x1, name: output.0, 7560000 usecs ago
> > [  112.907012] virtio_net virtio0 eth0: NETDEV WATCHDOG: CPU: 1: transmit queue 0 timed out 5568 ms
> > [  124.117540] virtio_net virtio0 eth0: TX timeout on queue: 0, sq: output.0, vq: 0x1, name: output.0, 16776000 usecs ago
> > [  124.118050] virtio_net virtio0 eth0: NETDEV WATCHDOG: CPU: 1: transmit queue 0 timed out 16776 ms
> > [  124.118447] virtio_net virtio0 eth0: TX timeout on queue: 0, sq: output.0, vq: 0x1, name: output.0, 16776000 usecs ago
> >
> > Host level top output
> > 3992758 qemu      20   0   16.6g  52168  26704 R  99.9   0.0  21:23.72 qemu-kvm       <<< this is the qemu main thread
> > 3992769 qemu      20   0   16.6g  52168  26704 R  58.8   0.0  13:33.44 vhost-3992758 <<< this is the vhost-net kthread
> >
> > For qemu-kvm main thread:
> > Samples: 13K of event 'cycles:P', 4000 Hz, Event count (approx.): 5131922583 lost: 0/0 drop: 0/0
> >   Children      Self  Shared Object     Symbol
> > -   87.41%     0.30%  [kernel]          [k] entry_SYSCALL_64_after_hwframe
> >    - 87.11% entry_SYSCALL_64_after_hwframe
> >       - do_syscall_64
> >          - 44.79% ksys_write
> >             - 43.74% vfs_write
> >                - 40.96% vhost_chr_write_iter
> >                   - 38.22% vhost_process_iotlb_msg
> >                      - 13.72% vhost_iotlb_add_range_ctx
> >                         - 7.43% vhost_iotlb_map_free
> >                            - 4.37% vhost_iotlb_itree_remove
> >                                 rb_next
> >                              1.78% __rb_erase_color
> >                              0.73% kfree
> >                           1.15% __rb_insert_augmented
> >                           0.68% __kmalloc_cache_noprof
> >                      - 10.73% vhost_vq_work_queue
> >                         - 7.65% try_to_wake_up
> >                            - 2.55% ttwu_queue_wakelist
> >                               - 1.72% __smp_call_single_queue
> >                                    1.36% call_function_single_prep_ipi
> >                            - 1.32% __task_rq_lock
> >                               - _raw_spin_lock
> >                                    native_queued_spin_lock_slowpath
> >                            - 1.30% select_task_rq
> >                               - select_task_rq_fair
> >                                  - 0.88% wake_affine
> >                                       available_idle_cpu
> >                           2.06% llist_add_batch
> >                      - 4.05% __mutex_lock.constprop.0
> >                           2.14% mutex_spin_on_owner
> >                           0.72% osq_lock
> >                        3.00% mutex_lock
> >                      - 1.72% kfree
> >                         - 1.16% __slab_free
> >                              slab_update_freelist.constprop.0.isra.0
> >                        1.37% _raw_spin_lock
> >                        1.08% mutex_unlock
> >                     1.98% _copy_from_iter
> >                - 1.86% rw_verify_area
> >                   - security_file_permission
> >                      - 1.13% file_has_perm
> >                           0.69% avc_has_perm
> >               0.63% fdget_pos
> >          - 27.86% syscall_exit_to_user_mode
> >             - syscall_exit_to_user_mode_prepare
> >                - 25.96% __audit_syscall_exit
> >                   - 25.03% __audit_filter_op
> >                        6.66% audit_filter_rules.constprop.0
> >                  1.27% audit_reset_context.part.0.constprop.0
> >          - 10.86% ksys_read
> >             - 9.37% vfs_read
> >                - 6.67% vhost_chr_read_iter
> >                     1.48% _copy_to_iter
> >                     1.36% _raw_spin_lock
> >                   - 1.30% __wake_up
> >                        0.81% _raw_spin_lock_irqsave
> >                   - 1.25% vhost_enqueue_msg
> >                        _raw_spin_lock
> >                - 1.83% rw_verify_area
> >                   - security_file_permission
> >                      - 1.03% file_has_perm
> >                           0.64% avc_has_perm
> >               0.65% fdget_pos
> >               0.57% fput
> >          - 2.56% syscall_trace_enter
> >             - 1.25% __seccomp_filter
> >                  seccomp_run_filters
> >               0.54% __audit_syscall_entry
> >
> > vhost-net thread
> > Samples: 20K of event 'cycles:P', 4000 Hz, Event count (approx.): 7796456297 lost: 0/0 drop: 0/0
> >   Children      Self  Shared Object     Symbol
> > -  100.00%     3.38%  [kernel]          [k] vhost_task_fn
> >      38.26% 0xffffffff930bb8c0
> >    - 3.36% 0
> >         ret_from_fork_asm
> >         ret_from_fork
> >    - 1.16% vhost_task_fn
> >       - 2.35% vhost_run_work_list
> >          - 1.67% handle_tx
> >             - 7.09% __mutex_lock.constprop.0
> >                  6.64% mutex_spin_on_owner
> >             - 0.84% vq_meta_prefetch
> >                - 3.22% iotlb_access_ok
> >                     2.50% vhost_iotlb_itree_first
> >               0.80% mutex_lock
> >             - 0.75% handle_tx_copy
> >            0.86% llist_reverse_order
> >
> > >
> > > On Wed, 30 Apr 2025 19:04:28 -0700 you wrote:
> > >> In handle_tx_copy, TX batching processes packets below ~PAGE_SIZE and
> > >> batches up to 64 messages before calling sock->sendmsg.
> > >>
> > >> Currently, when there are no more messages on the ring to dequeue,
> > >> handle_tx_copy re-enables kicks on the ring *before* firing off the
> > >> batch sendmsg. However, sock->sendmsg incurs a non-zero delay,
> > >> especially if it needs to wake up a thread (e.g., another vhost worker).
> > >>
> > >> [...]
> > >
> > > Here is the summary with links:
> > >  - [net-next,v3] vhost/net: Defer TX queue re-enable until after sendmsg
> > >    https://urldefense.proofpoint.com/v2/url?u=https-3A__git.kernel.org_netdev_net-2Dnext_c_8c2e6b26ffe2&d=DwIDaQ&c=s883GpUCOChKOHiocYtGcg&r=NGPRGGo37mQiSXgHKm5rCQ&m=0XoR6N9VbkaJ_wBENy8Z28uDdqjCe4HRNCyV-8o4etqXeEJOqoFFGjeGGP5sQcmt&s=-X8si_rU8pXKNyWNNzBqx5Fmv-ut9w2gS5E6coMDApM&e=
> > >
> > > You are awesome, thank you!
> > > --
> > > Deet-doot-dot, I am a bot.
> > > https://urldefense.proofpoint.com/v2/url?u=https-3A__korg.docs.kernel.org_patchwork_pwbot.html&d=DwIDaQ&c=s883GpUCOChKOHiocYtGcg&r=NGPRGGo37mQiSXgHKm5rCQ&m=0XoR6N9VbkaJ_wBENy8Z28uDdqjCe4HRNCyV-8o4etqXeEJOqoFFGjeGGP5sQcmt&s=sydedZsBCMSJM9_Ldw6Al-BplvM7FokLwV_80bJpGnM&e=
> > >
> > >
> >
>
>
> Well it seems that if  get_tx_bufs failed with -EAGAIN then we
> previously bailed out, but now we will redo poll and so on, forever.

Something like this, the vhost_vq_avail_empty() will cause the
vhost_poll_queue() to be queued in this case.

Let me post a patch to fix that.

Thanks

>
>
> No?
>
>
> --
> MST
>


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: vhost_iotlb_miss tight loop lockup - RE vhost/net: Defer TX queue re-enable until after sendmsg
  2025-09-11  5:49       ` Jason Wang
@ 2025-09-11  5:55         ` Michael S. Tsirkin
  2025-09-11 13:24         ` Jon Kohler
  1 sibling, 0 replies; 8+ messages in thread
From: Michael S. Tsirkin @ 2025-09-11  5:55 UTC (permalink / raw)
  To: Jason Wang
  Cc: Jon Kohler, patchwork-bot+netdevbpf@kernel.org,
	eperezma@redhat.com, kvm@vger.kernel.org,
	virtualization@lists.linux.dev, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org

On Thu, Sep 11, 2025 at 01:49:05PM +0800, Jason Wang wrote:
> On Thu, Sep 11, 2025 at 4:11 AM Michael S. Tsirkin <mst@redhat.com> wrote:
> >
> > On Wed, Sep 10, 2025 at 06:58:18PM +0000, Jon Kohler wrote:
> > >
> > >
> > > > On May 5, 2025, at 9:40 PM, patchwork-bot+netdevbpf@kernel.org wrote:
> > > >
> > > > Hello:
> > > >
> > > > This patch was applied to netdev/net-next.git (main)
> > > > by Jakub Kicinski <kuba@kernel.org>:
> > >
> > > Hey all,
> > > Writing to fire up a flare and point out a problem that we’re seeing
> > > with this patch internally, specifically when we enable iommu on the
> > > virtio-net device.
> > >
> > > With this patch applied on 6.12.y-based bare metal instance and then
> > > starting a 6.12.y based guest with iommu enabled, we see lockups
> > > within the guest in short order, as well as vmm (qemu) stuck in a tight
> > > loop responding to iommu misses from vhost net loop.
> > >
> > > We've bisected this in our internal tree, and for sure it is this
> > > patch that is alledgedly causing the problem, so I wanted to point out
> > > there is some sort of issue here.
> > >
> > > Working on trying to figure this out, but if jumps off the page to
> > > anyone, happy to take advice!
> > >
> > > Flamegraph:
> > > https://gist.github.com/JonKohler/0e83c014230ab59ddc950f10441335f1#file-iotlb-lockup-svg
> > >
> > > Guest dmesg errors like so:
> > > [   66.081694] virtio_net virtio0 eth0: NETDEV WATCHDOG: CPU: 1: transmit queue 0 timed out 5500 ms
> > > [   68.145155] virtio_net virtio0 eth0: TX timeout on queue: 0, sq: output.0, vq: 0x1, name: output.0, 7560000 usecs ago
> > > [  112.907012] virtio_net virtio0 eth0: NETDEV WATCHDOG: CPU: 1: transmit queue 0 timed out 5568 ms
> > > [  124.117540] virtio_net virtio0 eth0: TX timeout on queue: 0, sq: output.0, vq: 0x1, name: output.0, 16776000 usecs ago
> > > [  124.118050] virtio_net virtio0 eth0: NETDEV WATCHDOG: CPU: 1: transmit queue 0 timed out 16776 ms
> > > [  124.118447] virtio_net virtio0 eth0: TX timeout on queue: 0, sq: output.0, vq: 0x1, name: output.0, 16776000 usecs ago
> > >
> > > Host level top output
> > > 3992758 qemu      20   0   16.6g  52168  26704 R  99.9   0.0  21:23.72 qemu-kvm       <<< this is the qemu main thread
> > > 3992769 qemu      20   0   16.6g  52168  26704 R  58.8   0.0  13:33.44 vhost-3992758 <<< this is the vhost-net kthread
> > >
> > > For qemu-kvm main thread:
> > > Samples: 13K of event 'cycles:P', 4000 Hz, Event count (approx.): 5131922583 lost: 0/0 drop: 0/0
> > >   Children      Self  Shared Object     Symbol
> > > -   87.41%     0.30%  [kernel]          [k] entry_SYSCALL_64_after_hwframe
> > >    - 87.11% entry_SYSCALL_64_after_hwframe
> > >       - do_syscall_64
> > >          - 44.79% ksys_write
> > >             - 43.74% vfs_write
> > >                - 40.96% vhost_chr_write_iter
> > >                   - 38.22% vhost_process_iotlb_msg
> > >                      - 13.72% vhost_iotlb_add_range_ctx
> > >                         - 7.43% vhost_iotlb_map_free
> > >                            - 4.37% vhost_iotlb_itree_remove
> > >                                 rb_next
> > >                              1.78% __rb_erase_color
> > >                              0.73% kfree
> > >                           1.15% __rb_insert_augmented
> > >                           0.68% __kmalloc_cache_noprof
> > >                      - 10.73% vhost_vq_work_queue
> > >                         - 7.65% try_to_wake_up
> > >                            - 2.55% ttwu_queue_wakelist
> > >                               - 1.72% __smp_call_single_queue
> > >                                    1.36% call_function_single_prep_ipi
> > >                            - 1.32% __task_rq_lock
> > >                               - _raw_spin_lock
> > >                                    native_queued_spin_lock_slowpath
> > >                            - 1.30% select_task_rq
> > >                               - select_task_rq_fair
> > >                                  - 0.88% wake_affine
> > >                                       available_idle_cpu
> > >                           2.06% llist_add_batch
> > >                      - 4.05% __mutex_lock.constprop.0
> > >                           2.14% mutex_spin_on_owner
> > >                           0.72% osq_lock
> > >                        3.00% mutex_lock
> > >                      - 1.72% kfree
> > >                         - 1.16% __slab_free
> > >                              slab_update_freelist.constprop.0.isra.0
> > >                        1.37% _raw_spin_lock
> > >                        1.08% mutex_unlock
> > >                     1.98% _copy_from_iter
> > >                - 1.86% rw_verify_area
> > >                   - security_file_permission
> > >                      - 1.13% file_has_perm
> > >                           0.69% avc_has_perm
> > >               0.63% fdget_pos
> > >          - 27.86% syscall_exit_to_user_mode
> > >             - syscall_exit_to_user_mode_prepare
> > >                - 25.96% __audit_syscall_exit
> > >                   - 25.03% __audit_filter_op
> > >                        6.66% audit_filter_rules.constprop.0
> > >                  1.27% audit_reset_context.part.0.constprop.0
> > >          - 10.86% ksys_read
> > >             - 9.37% vfs_read
> > >                - 6.67% vhost_chr_read_iter
> > >                     1.48% _copy_to_iter
> > >                     1.36% _raw_spin_lock
> > >                   - 1.30% __wake_up
> > >                        0.81% _raw_spin_lock_irqsave
> > >                   - 1.25% vhost_enqueue_msg
> > >                        _raw_spin_lock
> > >                - 1.83% rw_verify_area
> > >                   - security_file_permission
> > >                      - 1.03% file_has_perm
> > >                           0.64% avc_has_perm
> > >               0.65% fdget_pos
> > >               0.57% fput
> > >          - 2.56% syscall_trace_enter
> > >             - 1.25% __seccomp_filter
> > >                  seccomp_run_filters
> > >               0.54% __audit_syscall_entry
> > >
> > > vhost-net thread
> > > Samples: 20K of event 'cycles:P', 4000 Hz, Event count (approx.): 7796456297 lost: 0/0 drop: 0/0
> > >   Children      Self  Shared Object     Symbol
> > > -  100.00%     3.38%  [kernel]          [k] vhost_task_fn
> > >      38.26% 0xffffffff930bb8c0
> > >    - 3.36% 0
> > >         ret_from_fork_asm
> > >         ret_from_fork
> > >    - 1.16% vhost_task_fn
> > >       - 2.35% vhost_run_work_list
> > >          - 1.67% handle_tx
> > >             - 7.09% __mutex_lock.constprop.0
> > >                  6.64% mutex_spin_on_owner
> > >             - 0.84% vq_meta_prefetch
> > >                - 3.22% iotlb_access_ok
> > >                     2.50% vhost_iotlb_itree_first
> > >               0.80% mutex_lock
> > >             - 0.75% handle_tx_copy
> > >            0.86% llist_reverse_order
> > >
> > > >
> > > > On Wed, 30 Apr 2025 19:04:28 -0700 you wrote:
> > > >> In handle_tx_copy, TX batching processes packets below ~PAGE_SIZE and
> > > >> batches up to 64 messages before calling sock->sendmsg.
> > > >>
> > > >> Currently, when there are no more messages on the ring to dequeue,
> > > >> handle_tx_copy re-enables kicks on the ring *before* firing off the
> > > >> batch sendmsg. However, sock->sendmsg incurs a non-zero delay,
> > > >> especially if it needs to wake up a thread (e.g., another vhost worker).
> > > >>
> > > >> [...]
> > > >
> > > > Here is the summary with links:
> > > >  - [net-next,v3] vhost/net: Defer TX queue re-enable until after sendmsg
> > > >    https://urldefense.proofpoint.com/v2/url?u=https-3A__git.kernel.org_netdev_net-2Dnext_c_8c2e6b26ffe2&d=DwIDaQ&c=s883GpUCOChKOHiocYtGcg&r=NGPRGGo37mQiSXgHKm5rCQ&m=0XoR6N9VbkaJ_wBENy8Z28uDdqjCe4HRNCyV-8o4etqXeEJOqoFFGjeGGP5sQcmt&s=-X8si_rU8pXKNyWNNzBqx5Fmv-ut9w2gS5E6coMDApM&e=
> > > >
> > > > You are awesome, thank you!
> > > > --
> > > > Deet-doot-dot, I am a bot.
> > > > https://urldefense.proofpoint.com/v2/url?u=https-3A__korg.docs.kernel.org_patchwork_pwbot.html&d=DwIDaQ&c=s883GpUCOChKOHiocYtGcg&r=NGPRGGo37mQiSXgHKm5rCQ&m=0XoR6N9VbkaJ_wBENy8Z28uDdqjCe4HRNCyV-8o4etqXeEJOqoFFGjeGGP5sQcmt&s=sydedZsBCMSJM9_Ldw6Al-BplvM7FokLwV_80bJpGnM&e=
> > > >
> > > >
> > >
> >
> >
> > Well it seems that if  get_tx_bufs failed with -EAGAIN then we
> > previously bailed out, but now we will redo poll and so on, forever.
> 
> Something like this, the vhost_vq_avail_empty() will cause the
> vhost_poll_queue() to be queued in this case.
> 
> Let me post a patch to fix that.
> 
> Thanks


To add to that, busyloop_intr getting set is now ignored except in
the last loop, which also doesn't look nice but not a bug, strickly
speaking.

> >
> >
> > No?
> >
> >
> > --
> > MST
> >


^ permalink raw reply	[flat|nested] 8+ messages in thread

* Re: vhost_iotlb_miss tight loop lockup - RE vhost/net: Defer TX queue re-enable until after sendmsg
  2025-09-11  5:49       ` Jason Wang
  2025-09-11  5:55         ` Michael S. Tsirkin
@ 2025-09-11 13:24         ` Jon Kohler
  1 sibling, 0 replies; 8+ messages in thread
From: Jon Kohler @ 2025-09-11 13:24 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S. Tsirkin, patchwork-bot+netdevbpf@kernel.org,
	eperezma@redhat.com, kvm@vger.kernel.org,
	virtualization@lists.linux.dev, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org



> On Sep 11, 2025, at 1:49 AM, Jason Wang <jasowang@redhat.com> wrote:
> 
> !-------------------------------------------------------------------|
>  CAUTION: External Email
> 
> |-------------------------------------------------------------------!
> 
> On Thu, Sep 11, 2025 at 4:11 AM Michael S. Tsirkin <mst@redhat.com> wrote:
>> 
>> On Wed, Sep 10, 2025 at 06:58:18PM +0000, Jon Kohler wrote:
>>> 
>>> 
>>>> On May 5, 2025, at 9:40 PM, patchwork-bot+netdevbpf@kernel.org wrote:
>>>> 
>>>> Hello:
>>>> 
>>>> This patch was applied to netdev/net-next.git (main)
>>>> by Jakub Kicinski <kuba@kernel.org>:
>>> 
>>> Hey all,
>>> Writing to fire up a flare and point out a problem that we’re seeing
>>> with this patch internally, specifically when we enable iommu on the
>>> virtio-net device.
>>> 
>>> With this patch applied on 6.12.y-based bare metal instance and then
>>> starting a 6.12.y based guest with iommu enabled, we see lockups
>>> within the guest in short order, as well as vmm (qemu) stuck in a tight
>>> loop responding to iommu misses from vhost net loop.
>>> 
>>> We've bisected this in our internal tree, and for sure it is this
>>> patch that is alledgedly causing the problem, so I wanted to point out
>>> there is some sort of issue here.
>>> 
>>> Working on trying to figure this out, but if jumps off the page to
>>> anyone, happy to take advice!
>>> 
>>> Flamegraph:
>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__gist.github.com_JonKohler_0e83c014230ab59ddc950f10441335f1-23file-2Diotlb-2Dlockup-2Dsvg&d=DwIFaQ&c=s883GpUCOChKOHiocYtGcg&r=NGPRGGo37mQiSXgHKm5rCQ&m=XD-WdOsgncdt6PTfV1XEaecYF8hq634UgPyqfA2RFyd3VRYL27kl6G8lH5jGTK_x&s=jJp6ZIn5iS1x_a8yXm7BNtxUVL3HgLExEQhcS-L0m4U&e= 
>>> 
>>> Guest dmesg errors like so:
>>> [   66.081694] virtio_net virtio0 eth0: NETDEV WATCHDOG: CPU: 1: transmit queue 0 timed out 5500 ms
>>> [   68.145155] virtio_net virtio0 eth0: TX timeout on queue: 0, sq: output.0, vq: 0x1, name: output.0, 7560000 usecs ago
>>> [  112.907012] virtio_net virtio0 eth0: NETDEV WATCHDOG: CPU: 1: transmit queue 0 timed out 5568 ms
>>> [  124.117540] virtio_net virtio0 eth0: TX timeout on queue: 0, sq: output.0, vq: 0x1, name: output.0, 16776000 usecs ago
>>> [  124.118050] virtio_net virtio0 eth0: NETDEV WATCHDOG: CPU: 1: transmit queue 0 timed out 16776 ms
>>> [  124.118447] virtio_net virtio0 eth0: TX timeout on queue: 0, sq: output.0, vq: 0x1, name: output.0, 16776000 usecs ago
>>> 
>>> Host level top output
>>> 3992758 qemu      20   0   16.6g  52168  26704 R  99.9   0.0  21:23.72 qemu-kvm       <<< this is the qemu main thread
>>> 3992769 qemu      20   0   16.6g  52168  26704 R  58.8   0.0  13:33.44 vhost-3992758 <<< this is the vhost-net kthread
>>> 
>>> For qemu-kvm main thread:
>>> Samples: 13K of event 'cycles:P', 4000 Hz, Event count (approx.): 5131922583 lost: 0/0 drop: 0/0
>>>  Children      Self  Shared Object     Symbol
>>> -   87.41%     0.30%  [kernel]          [k] entry_SYSCALL_64_after_hwframe
>>>   - 87.11% entry_SYSCALL_64_after_hwframe
>>>      - do_syscall_64
>>>         - 44.79% ksys_write
>>>            - 43.74% vfs_write
>>>               - 40.96% vhost_chr_write_iter
>>>                  - 38.22% vhost_process_iotlb_msg
>>>                     - 13.72% vhost_iotlb_add_range_ctx
>>>                        - 7.43% vhost_iotlb_map_free
>>>                           - 4.37% vhost_iotlb_itree_remove
>>>                                rb_next
>>>                             1.78% __rb_erase_color
>>>                             0.73% kfree
>>>                          1.15% __rb_insert_augmented
>>>                          0.68% __kmalloc_cache_noprof
>>>                     - 10.73% vhost_vq_work_queue
>>>                        - 7.65% try_to_wake_up
>>>                           - 2.55% ttwu_queue_wakelist
>>>                              - 1.72% __smp_call_single_queue
>>>                                   1.36% call_function_single_prep_ipi
>>>                           - 1.32% __task_rq_lock
>>>                              - _raw_spin_lock
>>>                                   native_queued_spin_lock_slowpath
>>>                           - 1.30% select_task_rq
>>>                              - select_task_rq_fair
>>>                                 - 0.88% wake_affine
>>>                                      available_idle_cpu
>>>                          2.06% llist_add_batch
>>>                     - 4.05% __mutex_lock.constprop.0
>>>                          2.14% mutex_spin_on_owner
>>>                          0.72% osq_lock
>>>                       3.00% mutex_lock
>>>                     - 1.72% kfree
>>>                        - 1.16% __slab_free
>>>                             slab_update_freelist.constprop.0.isra.0
>>>                       1.37% _raw_spin_lock
>>>                       1.08% mutex_unlock
>>>                    1.98% _copy_from_iter
>>>               - 1.86% rw_verify_area
>>>                  - security_file_permission
>>>                     - 1.13% file_has_perm
>>>                          0.69% avc_has_perm
>>>              0.63% fdget_pos
>>>         - 27.86% syscall_exit_to_user_mode
>>>            - syscall_exit_to_user_mode_prepare
>>>               - 25.96% __audit_syscall_exit
>>>                  - 25.03% __audit_filter_op
>>>                       6.66% audit_filter_rules.constprop.0
>>>                 1.27% audit_reset_context.part.0.constprop.0
>>>         - 10.86% ksys_read
>>>            - 9.37% vfs_read
>>>               - 6.67% vhost_chr_read_iter
>>>                    1.48% _copy_to_iter
>>>                    1.36% _raw_spin_lock
>>>                  - 1.30% __wake_up
>>>                       0.81% _raw_spin_lock_irqsave
>>>                  - 1.25% vhost_enqueue_msg
>>>                       _raw_spin_lock
>>>               - 1.83% rw_verify_area
>>>                  - security_file_permission
>>>                     - 1.03% file_has_perm
>>>                          0.64% avc_has_perm
>>>              0.65% fdget_pos
>>>              0.57% fput
>>>         - 2.56% syscall_trace_enter
>>>            - 1.25% __seccomp_filter
>>>                 seccomp_run_filters
>>>              0.54% __audit_syscall_entry
>>> 
>>> vhost-net thread
>>> Samples: 20K of event 'cycles:P', 4000 Hz, Event count (approx.): 7796456297 lost: 0/0 drop: 0/0
>>>  Children      Self  Shared Object     Symbol
>>> -  100.00%     3.38%  [kernel]          [k] vhost_task_fn
>>>     38.26% 0xffffffff930bb8c0
>>>   - 3.36% 0
>>>        ret_from_fork_asm
>>>        ret_from_fork
>>>   - 1.16% vhost_task_fn
>>>      - 2.35% vhost_run_work_list
>>>         - 1.67% handle_tx
>>>            - 7.09% __mutex_lock.constprop.0
>>>                 6.64% mutex_spin_on_owner
>>>            - 0.84% vq_meta_prefetch
>>>               - 3.22% iotlb_access_ok
>>>                    2.50% vhost_iotlb_itree_first
>>>              0.80% mutex_lock
>>>            - 0.75% handle_tx_copy
>>>           0.86% llist_reverse_order
>>> 
>>>> 
>>>> On Wed, 30 Apr 2025 19:04:28 -0700 you wrote:
>>>>> In handle_tx_copy, TX batching processes packets below ~PAGE_SIZE and
>>>>> batches up to 64 messages before calling sock->sendmsg.
>>>>> 
>>>>> Currently, when there are no more messages on the ring to dequeue,
>>>>> handle_tx_copy re-enables kicks on the ring *before* firing off the
>>>>> batch sendmsg. However, sock->sendmsg incurs a non-zero delay,
>>>>> especially if it needs to wake up a thread (e.g., another vhost worker).
>>>>> 
>>>>> [...]
>>>> 
>>>> Here is the summary with links:
>>>> - [net-next,v3] vhost/net: Defer TX queue re-enable until after sendmsg
>>>>   https://urldefense.proofpoint.com/v2/url?u=https-3A__git.kernel.org_netdev_net-2Dnext_c_8c2e6b26ffe2&d=DwIDaQ&c=s883GpUCOChKOHiocYtGcg&r=NGPRGGo37mQiSXgHKm5rCQ&m=0XoR6N9VbkaJ_wBENy8Z28uDdqjCe4HRNCyV-8o4etqXeEJOqoFFGjeGGP5sQcmt&s=-X8si_rU8pXKNyWNNzBqx5Fmv-ut9w2gS5E6coMDApM&e=
>>>> 
>>>> You are awesome, thank you!
>>>> --
>>>> Deet-doot-dot, I am a bot.
>>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__korg.docs.kernel.org_patchwork_pwbot.html&d=DwIDaQ&c=s883GpUCOChKOHiocYtGcg&r=NGPRGGo37mQiSXgHKm5rCQ&m=0XoR6N9VbkaJ_wBENy8Z28uDdqjCe4HRNCyV-8o4etqXeEJOqoFFGjeGGP5sQcmt&s=sydedZsBCMSJM9_Ldw6Al-BplvM7FokLwV_80bJpGnM&e=
>>>> 
>>>> 
>>> 
>> 
>> 
>> Well it seems that if  get_tx_bufs failed with -EAGAIN then we
>> previously bailed out, but now we will redo poll and so on, forever.
> 
> Something like this, the vhost_vq_avail_empty() will cause the
> vhost_poll_queue() to be queued in this case.
> 
> Let me post a patch to fix that.

Ok great, thanks Jason and Michael. Happy to test this when available
as we’ve got a solid repro.

> 
> Thanks
> 
>> 
>> 
>> No?
>> 
>> 
>> --
>> MST
>> 
> 


^ permalink raw reply	[flat|nested] 8+ messages in thread

end of thread, other threads:[~2025-09-11 13:25 UTC | newest]

Thread overview: 8+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-05-01  2:04 [PATCH net-next v3] vhost/net: Defer TX queue re-enable until after sendmsg Jon Kohler
2025-05-01 13:44 ` Michael S. Tsirkin
2025-05-06  1:40 ` patchwork-bot+netdevbpf
2025-09-10 18:58   ` vhost_iotlb_miss tight loop lockup - RE " Jon Kohler
2025-09-10 20:10     ` Michael S. Tsirkin
2025-09-11  5:49       ` Jason Wang
2025-09-11  5:55         ` Michael S. Tsirkin
2025-09-11 13:24         ` Jon Kohler

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).