[PATCH net-next] vhost/net: check peek_head_len after signal to guest to avoid delays

virtualization.lists.linux-foundation.org archive mirror
 help / color / mirror / Atom feed

* [PATCH net-next] vhost/net: check peek_head_len after signal to guest to avoid delays
@ 2025-11-25 18:00 Jon Kohler
  2025-11-25 23:50 ` Michael S. Tsirkin
                   ` (3 more replies)
  0 siblings, 4 replies; 10+ messages in thread
From: Jon Kohler @ 2025-11-25 18:00 UTC (permalink / raw)
  To: Michael S. Tsirkin, Jason Wang, Eugenio Pérez, kvm,
	virtualization, netdev, linux-kernel
  Cc: Jon Kohler

In non-busypoll handle_rx paths, if peek_head_len returns 0, the RX
loop breaks, the RX wait queue is re-enabled, and vhost_net_signal_used
is called to flush done_idx and notify the guest if needed.

However, signaling the guest can take non-trivial time. During this
window, additional RX payloads may arrive on rx_ring without further
kicks. These new payloads will sit unprocessed until another kick
arrives, increasing latency. In high-rate UDP RX workloads, this was
observed to occur over 20k times per second.

To minimize this window and improve opportunities to process packets
promptly, immediately call peek_head_len after signaling. If new packets
are found, treat it as a busy poll interrupt and requeue handle_rx,
improving fairness to TX handlers and other pending CPU work. This also
helps suppress unnecessary thread wakeups, reducing waker CPU demand.

Signed-off-by: Jon Kohler <jon@nutanix.com>
---
 drivers/vhost/net.c | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index 35ded4330431..04cb5f1dc6e4 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -1015,6 +1015,27 @@ static int vhost_net_rx_peek_head_len(struct vhost_net *net, struct sock *sk,
 	struct vhost_virtqueue *tvq = &tnvq->vq;
 	int len = peek_head_len(rnvq, sk);

+	if (!len && rnvq->done_idx) {
+		/* When idle, flush signal first, which can take some
+		 * time for ring management and guest notification.
+		 * Afterwards, check one last time for work, as the ring
+		 * may have received new work during the notification
+		 * window.
+		 */
+		vhost_net_signal_used(rnvq, *count);
+		*count = 0;
+		if (peek_head_len(rnvq, sk)) {
+			/* More work came in during the notification
+			 * window. To be fair to the TX handler and other
+			 * potentially pending work items, pretend like
+			 * this was a busy poll interruption so that
+			 * the RX handler will be rescheduled and try
+			 * again.
+			 */
+			*busyloop_intr = true;
+		}
+	}
+
 	if (!len && rvq->busyloop_timeout) {
 		/* Flush batched heads first */
 		vhost_net_signal_used(rnvq, *count);
-- 
2.43.0

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [PATCH net-next] vhost/net: check peek_head_len after signal to guest to avoid delays
  2025-11-25 18:00 [PATCH net-next] vhost/net: check peek_head_len after signal to guest to avoid delays Jon Kohler
@ 2025-11-25 23:50 ` Michael S. Tsirkin
  2025-11-26 16:49   ` Jon Kohler
  2025-11-26  6:15 ` Jason Wang
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 10+ messages in thread
From: Michael S. Tsirkin @ 2025-11-25 23:50 UTC (permalink / raw)
  To: Jon Kohler
  Cc: Jason Wang, Eugenio Pérez, kvm, virtualization, netdev,
	linux-kernel

On Tue, Nov 25, 2025 at 11:00:33AM -0700, Jon Kohler wrote:
> In non-busypoll handle_rx paths, if peek_head_len returns 0, the RX
> loop breaks, the RX wait queue is re-enabled, and vhost_net_signal_used
> is called to flush done_idx and notify the guest if needed.
> 
> However, signaling the guest can take non-trivial time. During this
> window, additional RX payloads may arrive on rx_ring without further
> kicks. These new payloads will sit unprocessed until another kick
> arrives, increasing latency. In high-rate UDP RX workloads, this was
> observed to occur over 20k times per second.
> 
> To minimize this window and improve opportunities to process packets
> promptly, immediately call peek_head_len after signaling. If new packets
> are found, treat it as a busy poll interrupt and requeue handle_rx,
> improving fairness to TX handlers and other pending CPU work. This also
> helps suppress unnecessary thread wakeups, reducing waker CPU demand.
> 
> Signed-off-by: Jon Kohler <jon@nutanix.com>

Given this is supposed to be a performance improvement,
pls include info on the effect this has on performance. Thanks!

> ---
>  drivers/vhost/net.c | 21 +++++++++++++++++++++
>  1 file changed, 21 insertions(+)
> 
> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> index 35ded4330431..04cb5f1dc6e4 100644
> --- a/drivers/vhost/net.c
> +++ b/drivers/vhost/net.c
> @@ -1015,6 +1015,27 @@ static int vhost_net_rx_peek_head_len(struct vhost_net *net, struct sock *sk,
>  	struct vhost_virtqueue *tvq = &tnvq->vq;
>  	int len = peek_head_len(rnvq, sk);
>  
> +	if (!len && rnvq->done_idx) {
> +		/* When idle, flush signal first, which can take some
> +		 * time for ring management and guest notification.
> +		 * Afterwards, check one last time for work, as the ring
> +		 * may have received new work during the notification
> +		 * window.
> +		 */
> +		vhost_net_signal_used(rnvq, *count);
> +		*count = 0;
> +		if (peek_head_len(rnvq, sk)) {
> +			/* More work came in during the notification
> +			 * window. To be fair to the TX handler and other
> +			 * potentially pending work items, pretend like
> +			 * this was a busy poll interruption so that
> +			 * the RX handler will be rescheduled and try
> +			 * again.
> +			 */
> +			*busyloop_intr = true;
> +		}
> +	}
> +
>  	if (!len && rvq->busyloop_timeout) {
>  		/* Flush batched heads first */
>  		vhost_net_signal_used(rnvq, *count);
> -- 
> 2.43.0


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH net-next] vhost/net: check peek_head_len after signal to guest to avoid delays
  2025-11-25 18:00 [PATCH net-next] vhost/net: check peek_head_len after signal to guest to avoid delays Jon Kohler
  2025-11-25 23:50 ` Michael S. Tsirkin
@ 2025-11-26  6:15 ` Jason Wang
  2025-11-26 16:48   ` Jon Kohler
  2025-11-26  6:41 ` Michael S. Tsirkin
  2025-11-26  8:56 ` Michael S. Tsirkin
  3 siblings, 1 reply; 10+ messages in thread
From: Jason Wang @ 2025-11-26  6:15 UTC (permalink / raw)
  To: Jon Kohler
  Cc: Michael S. Tsirkin, Eugenio Pérez, kvm, virtualization,
	netdev, linux-kernel

On Wed, Nov 26, 2025 at 1:18 AM Jon Kohler <jon@nutanix.com> wrote:
>
> In non-busypoll handle_rx paths, if peek_head_len returns 0, the RX
> loop breaks, the RX wait queue is re-enabled, and vhost_net_signal_used
> is called to flush done_idx and notify the guest if needed.
>
> However, signaling the guest can take non-trivial time. During this
> window, additional RX payloads may arrive on rx_ring without further
> kicks. These new payloads will sit unprocessed until another kick
> arrives, increasing latency. In high-rate UDP RX workloads, this was
> observed to occur over 20k times per second.
>
> To minimize this window and improve opportunities to process packets
> promptly, immediately call peek_head_len after signaling. If new packets
> are found, treat it as a busy poll interrupt and requeue handle_rx,
> improving fairness to TX handlers and other pending CPU work. This also
> helps suppress unnecessary thread wakeups, reducing waker CPU demand.
>
> Signed-off-by: Jon Kohler <jon@nutanix.com>
> ---
>  drivers/vhost/net.c | 21 +++++++++++++++++++++
>  1 file changed, 21 insertions(+)
>
> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> index 35ded4330431..04cb5f1dc6e4 100644
> --- a/drivers/vhost/net.c
> +++ b/drivers/vhost/net.c
> @@ -1015,6 +1015,27 @@ static int vhost_net_rx_peek_head_len(struct vhost_net *net, struct sock *sk,
>         struct vhost_virtqueue *tvq = &tnvq->vq;
>         int len = peek_head_len(rnvq, sk);
>
> +       if (!len && rnvq->done_idx) {
> +               /* When idle, flush signal first, which can take some
> +                * time for ring management and guest notification.
> +                * Afterwards, check one last time for work, as the ring
> +                * may have received new work during the notification
> +                * window.
> +                */
> +               vhost_net_signal_used(rnvq, *count);
> +               *count = 0;
> +               if (peek_head_len(rnvq, sk)) {
> +                       /* More work came in during the notification
> +                        * window. To be fair to the TX handler and other
> +                        * potentially pending work items, pretend like
> +                        * this was a busy poll interruption so that
> +                        * the RX handler will be rescheduled and try
> +                        * again.
> +                        */
> +                       *busyloop_intr = true;
> +               }
> +       }

I'm not sure I will get here.

Once vhost_net_rx_peek_head_len() returns 0, we exit the loop to:

if (unlikely(busyloop_intr))
                vhost_poll_queue(&vq->poll);
        else if (!sock_len)
                vhost_net_enable_vq(net, vq);
out:
        vhost_net_signal_used(nvq, count);

Are you suggesting signalling before enabling vq actually?

Thanks

> +
>         if (!len && rvq->busyloop_timeout) {
>                 /* Flush batched heads first */
>                 vhost_net_signal_used(rnvq, *count);
> --
> 2.43.0
>


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH net-next] vhost/net: check peek_head_len after signal to guest to avoid delays
  2025-11-25 18:00 [PATCH net-next] vhost/net: check peek_head_len after signal to guest to avoid delays Jon Kohler
  2025-11-25 23:50 ` Michael S. Tsirkin
  2025-11-26  6:15 ` Jason Wang
@ 2025-11-26  6:41 ` Michael S. Tsirkin
  2025-11-26 16:47   ` Jon Kohler
  2025-11-26  8:56 ` Michael S. Tsirkin
  3 siblings, 1 reply; 10+ messages in thread
From: Michael S. Tsirkin @ 2025-11-26  6:41 UTC (permalink / raw)
  To: Jon Kohler
  Cc: Jason Wang, Eugenio Pérez, kvm, virtualization, netdev,
	linux-kernel

On Tue, Nov 25, 2025 at 11:00:33AM -0700, Jon Kohler wrote:
> In non-busypoll handle_rx paths, if peek_head_len returns 0, the RX
> loop breaks, the RX wait queue is re-enabled, and vhost_net_signal_used
> is called to flush done_idx and notify the guest if needed.
> 
> However, signaling the guest can take non-trivial time. During this
> window, additional RX payloads may arrive on rx_ring without further
> kicks. These new payloads will sit unprocessed until another kick
> arrives, increasing latency. In high-rate UDP RX workloads, this was
> observed to occur over 20k times per second.
> 
> To minimize this window and improve opportunities to process packets
> promptly, immediately call peek_head_len after signaling. If new packets
> are found, treat it as a busy poll interrupt and requeue handle_rx,
> improving fairness to TX handlers and other pending CPU work. This also
> helps suppress unnecessary thread wakeups, reducing waker CPU demand.
> 
> Signed-off-by: Jon Kohler <jon@nutanix.com>
> ---
>  drivers/vhost/net.c | 21 +++++++++++++++++++++
>  1 file changed, 21 insertions(+)
> 
> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> index 35ded4330431..04cb5f1dc6e4 100644
> --- a/drivers/vhost/net.c
> +++ b/drivers/vhost/net.c
> @@ -1015,6 +1015,27 @@ static int vhost_net_rx_peek_head_len(struct vhost_net *net, struct sock *sk,
>  	struct vhost_virtqueue *tvq = &tnvq->vq;
>  	int len = peek_head_len(rnvq, sk);
>  
> +	if (!len && rnvq->done_idx) {
> +		/* When idle, flush signal first, which can take some
> +		 * time for ring management and guest notification.
> +		 * Afterwards, check one last time for work, as the ring
> +		 * may have received new work during the notification
> +		 * window.
> +		 */
> +		vhost_net_signal_used(rnvq, *count);
> +		*count = 0;
> +		if (peek_head_len(rnvq, sk)) {
> +			/* More work came in during the notification
> +			 * window. To be fair to the TX handler and other
> +			 * potentially pending work items, pretend like
> +			 * this was a busy poll interruption so that
> +			 * the RX handler will be rescheduled and try
> +			 * again.
> +			 */
> +			*busyloop_intr = true;
> +		}
> +	}
> +
>  	if (!len && rvq->busyloop_timeout) {
>  		/* Flush batched heads first */
>  		vhost_net_signal_used(rnvq, *count);


Looks like this can easily send more interrupts than originally?
How can this be good?

From the description, I would expect the changes to just add another call to
peek_head_len after the existing vhost_net_signal_used.
What am I missing?


> -- 
> 2.43.0


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH net-next] vhost/net: check peek_head_len after signal to guest to avoid delays
  2025-11-25 18:00 [PATCH net-next] vhost/net: check peek_head_len after signal to guest to avoid delays Jon Kohler
                   ` (2 preceding siblings ...)
  2025-11-26  6:41 ` Michael S. Tsirkin
@ 2025-11-26  8:56 ` Michael S. Tsirkin
  2025-11-26 16:43   ` Jon Kohler
  3 siblings, 1 reply; 10+ messages in thread
From: Michael S. Tsirkin @ 2025-11-26  8:56 UTC (permalink / raw)
  To: Jon Kohler
  Cc: Jason Wang, Eugenio Pérez, kvm, virtualization, netdev,
	linux-kernel

On Tue, Nov 25, 2025 at 11:00:33AM -0700, Jon Kohler wrote:
> In non-busypoll handle_rx paths, if peek_head_len returns 0, the RX
> loop breaks, the RX wait queue is re-enabled, and vhost_net_signal_used
> is called to flush done_idx and notify the guest if needed.
> 
> However, signaling the guest can take non-trivial time. During this
> window, additional RX payloads may arrive on rx_ring without further
> kicks. These new payloads will sit unprocessed until another kick
> arrives, increasing latency. In high-rate UDP RX workloads, this was
> observed to occur over 20k times per second.
> 
> To minimize this window and improve opportunities to process packets
> promptly, immediately call peek_head_len after signaling. If new packets
> are found, treat it as a busy poll interrupt and requeue handle_rx,
> improving fairness to TX handlers and other pending CPU work. This also
> helps suppress unnecessary thread wakeups, reducing waker CPU demand.
> 
> Signed-off-by: Jon Kohler <jon@nutanix.com>
> ---
>  drivers/vhost/net.c | 21 +++++++++++++++++++++
>  1 file changed, 21 insertions(+)
> 
> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> index 35ded4330431..04cb5f1dc6e4 100644
> --- a/drivers/vhost/net.c
> +++ b/drivers/vhost/net.c
> @@ -1015,6 +1015,27 @@ static int vhost_net_rx_peek_head_len(struct vhost_net *net, struct sock *sk,
>  	struct vhost_virtqueue *tvq = &tnvq->vq;
>  	int len = peek_head_len(rnvq, sk);
>  
> +	if (!len && rnvq->done_idx) {
> +		/* When idle, flush signal first, which can take some
> +		 * time for ring management and guest notification.
> +		 * Afterwards, check one last time for work, as the ring
> +		 * may have received new work during the notification
> +		 * window.
> +		 */
> +		vhost_net_signal_used(rnvq, *count);
> +		*count = 0;
> +		if (peek_head_len(rnvq, sk)) {


I also wonder why don't we assign len here.
I get the point about being fair to TX but it's not
indefinite poll, just a single peek ...

> +			/* More work came in during the notification
> +			 * window. To be fair to the TX handler and other
> +			 * potentially pending work items, pretend like
> +			 * this was a busy poll interruption so that
> +			 * the RX handler will be rescheduled and try
> +			 * again.
> +			 */
> +			*busyloop_intr = true;
> +		}
> +	}
> +
>  	if (!len && rvq->busyloop_timeout) {
>  		/* Flush batched heads first */
>  		vhost_net_signal_used(rnvq, *count);
> -- 
> 2.43.0


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH net-next] vhost/net: check peek_head_len after signal to guest to avoid delays
  2025-11-26  8:56 ` Michael S. Tsirkin
@ 2025-11-26 16:43   ` Jon Kohler
  0 siblings, 0 replies; 10+ messages in thread
From: Jon Kohler @ 2025-11-26 16:43 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Jason Wang, Eugenio Pérez, kvm@vger.kernel.org,
	virtualization@lists.linux.dev, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org



> On Nov 26, 2025, at 3:56 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> 
> On Tue, Nov 25, 2025 at 11:00:33AM -0700, Jon Kohler wrote:
>> In non-busypoll handle_rx paths, if peek_head_len returns 0, the RX
>> loop breaks, the RX wait queue is re-enabled, and vhost_net_signal_used
>> is called to flush done_idx and notify the guest if needed.
>> 
>> However, signaling the guest can take non-trivial time. During this
>> window, additional RX payloads may arrive on rx_ring without further
>> kicks. These new payloads will sit unprocessed until another kick
>> arrives, increasing latency. In high-rate UDP RX workloads, this was
>> observed to occur over 20k times per second.
>> 
>> To minimize this window and improve opportunities to process packets
>> promptly, immediately call peek_head_len after signaling. If new packets
>> are found, treat it as a busy poll interrupt and requeue handle_rx,
>> improving fairness to TX handlers and other pending CPU work. This also
>> helps suppress unnecessary thread wakeups, reducing waker CPU demand.
>> 
>> Signed-off-by: Jon Kohler <jon@nutanix.com>
>> ---
>> drivers/vhost/net.c | 21 +++++++++++++++++++++
>> 1 file changed, 21 insertions(+)
>> 
>> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
>> index 35ded4330431..04cb5f1dc6e4 100644
>> --- a/drivers/vhost/net.c
>> +++ b/drivers/vhost/net.c
>> @@ -1015,6 +1015,27 @@ static int vhost_net_rx_peek_head_len(struct vhost_net *net, struct sock *sk,
>> struct vhost_virtqueue *tvq = &tnvq->vq;
>> int len = peek_head_len(rnvq, sk);
>> 
>> + if (!len && rnvq->done_idx) {
>> + /* When idle, flush signal first, which can take some
>> + * time for ring management and guest notification.
>> + * Afterwards, check one last time for work, as the ring
>> + * may have received new work during the notification
>> + * window.
>> + */
>> + vhost_net_signal_used(rnvq, *count);
>> + *count = 0;
>> + if (peek_head_len(rnvq, sk)) {
> 
> 
> I also wonder why don't we assign len here.
> I get the point about being fair to TX but it's not
> indefinite poll, just a single peek …

The first version I made of this patch did this. It works,
but I liked the idea of having baked in fairness design
wise. It could go either way though?

The nice bit about deferring to the TX handler (or other
work) is that you’d then naturally batch up more work,
so the ebb n flow should be nicer in a mixed load environment

Thoughts?

>> + /* More work came in during the notification
>> + * window. To be fair to the TX handler and other
>> + * potentially pending work items, pretend like
>> + * this was a busy poll interruption so that
>> + * the RX handler will be rescheduled and try
>> + * again.
>> + */
>> + *busyloop_intr = true;
>> + }
>> + }
>> +
>> if (!len && rvq->busyloop_timeout) {
>> /* Flush batched heads first */
>> vhost_net_signal_used(rnvq, *count);
>> -- 
>> 2.43.0
> 


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH net-next] vhost/net: check peek_head_len after signal to guest to avoid delays
  2025-11-26  6:41 ` Michael S. Tsirkin
@ 2025-11-26 16:47   ` Jon Kohler
  0 siblings, 0 replies; 10+ messages in thread
From: Jon Kohler @ 2025-11-26 16:47 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Jason Wang, Eugenio Pérez, kvm@vger.kernel.org,
	virtualization@lists.linux.dev, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org



> On Nov 26, 2025, at 1:41 AM, Michael S. Tsirkin <mst@redhat.com> wrote:
> 
> !-------------------------------------------------------------------|
>  CAUTION: External Email
> 
> |-------------------------------------------------------------------!
> 
> On Tue, Nov 25, 2025 at 11:00:33AM -0700, Jon Kohler wrote:
>> In non-busypoll handle_rx paths, if peek_head_len returns 0, the RX
>> loop breaks, the RX wait queue is re-enabled, and vhost_net_signal_used
>> is called to flush done_idx and notify the guest if needed.
>> 
>> However, signaling the guest can take non-trivial time. During this
>> window, additional RX payloads may arrive on rx_ring without further
>> kicks. These new payloads will sit unprocessed until another kick
>> arrives, increasing latency. In high-rate UDP RX workloads, this was
>> observed to occur over 20k times per second.
>> 
>> To minimize this window and improve opportunities to process packets
>> promptly, immediately call peek_head_len after signaling. If new packets
>> are found, treat it as a busy poll interrupt and requeue handle_rx,
>> improving fairness to TX handlers and other pending CPU work. This also
>> helps suppress unnecessary thread wakeups, reducing waker CPU demand.
>> 
>> Signed-off-by: Jon Kohler <jon@nutanix.com>
>> ---
>> drivers/vhost/net.c | 21 +++++++++++++++++++++
>> 1 file changed, 21 insertions(+)
>> 
>> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
>> index 35ded4330431..04cb5f1dc6e4 100644
>> --- a/drivers/vhost/net.c
>> +++ b/drivers/vhost/net.c
>> @@ -1015,6 +1015,27 @@ static int vhost_net_rx_peek_head_len(struct vhost_net *net, struct sock *sk,
>> struct vhost_virtqueue *tvq = &tnvq->vq;
>> int len = peek_head_len(rnvq, sk);
>> 
>> + if (!len && rnvq->done_idx) {
>> + /* When idle, flush signal first, which can take some
>> + * time for ring management and guest notification.
>> + * Afterwards, check one last time for work, as the ring
>> + * may have received new work during the notification
>> + * window.
>> + */
>> + vhost_net_signal_used(rnvq, *count);
>> + *count = 0;
>> + if (peek_head_len(rnvq, sk)) {
>> + /* More work came in during the notification
>> + * window. To be fair to the TX handler and other
>> + * potentially pending work items, pretend like
>> + * this was a busy poll interruption so that
>> + * the RX handler will be rescheduled and try
>> + * again.
>> + */
>> + *busyloop_intr = true;
>> + }
>> + }
>> +
>> if (!len && rvq->busyloop_timeout) {
>> /* Flush batched heads first */
>> vhost_net_signal_used(rnvq, *count);
> 
> 
> Looks like this can easily send more interrupts than originally?
> How can this be good?
> 
> From the description, I would expect the changes to just add another call to
> peek_head_len after the existing vhost_net_signal_used.
> What am I missing?

consider the following race, across NUMA nodes where it is most expensive.

Socket 1 (pNIC) <-------- NUMA -------> Socket 2 (vhost worker)
                                        vhost_net_rx_peek_head_len = 0
                                          peek_head_len = 0
                                            vhost_net_buf_peek
                                            vhost_net_buf_produce
tun_net_xmit                                ptr_ring_consume_batched
ptr_ring_produce()=0
                                        else if (!sock_len)
                                          vhost_net_enable_vq(net, vq);
                                            tun_chr_poll
                                              add_wait_queue
tfile->socket.sk->sk_data_ready()
  sock_def_readable
    skwq_has_sleeper=true
      wake_up ...
        vhost_poll_wakeup
          TTWU
                                        vhost_net_signal_used();
                                          ring operations (takes time, SMAP)
                                          signalling guest (takes time)
                                        vhost_task_fn
                                        schedule()!!
          ttwu rq spinlock!               rq lock
                                          schedules out
          IPI!                              
                                        idle path
                                          sched_ttwu_pending

All I’m saying here is that if we simply use the signal during
the time where we are *not* added to the wait queue, that will
give plenty of time for the race above to resolve wherein we
will give the lockless TX’ers time to add more workload to the
rx_ring, we can either a) process it right there (by assigning
len as you suggested in other mail) or b) set busy poll and
process tx handler or other work, with notification disabled

For the sake of argument, let’s pretend signaling took a full
second. A lot can happen in that time, so if signal first, then
check after, we can avoid the IPIs and trips in n out of
scheduler

> 
> 
>> -- 
>> 2.43.0
> 


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH net-next] vhost/net: check peek_head_len after signal to guest to avoid delays
  2025-11-26  6:15 ` Jason Wang
@ 2025-11-26 16:48   ` Jon Kohler
  0 siblings, 0 replies; 10+ messages in thread
From: Jon Kohler @ 2025-11-26 16:48 UTC (permalink / raw)
  To: Jason Wang
  Cc: Michael S. Tsirkin, Eugenio Pérez, kvm@vger.kernel.org,
	virtualization@lists.linux.dev, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org



> On Nov 26, 2025, at 1:15 AM, Jason Wang <jasowang@redhat.com> wrote:
> 
> On Wed, Nov 26, 2025 at 1:18 AM Jon Kohler <jon@nutanix.com> wrote:
>> 
>> In non-busypoll handle_rx paths, if peek_head_len returns 0, the RX
>> loop breaks, the RX wait queue is re-enabled, and vhost_net_signal_used
>> is called to flush done_idx and notify the guest if needed.
>> 
>> However, signaling the guest can take non-trivial time. During this
>> window, additional RX payloads may arrive on rx_ring without further
>> kicks. These new payloads will sit unprocessed until another kick
>> arrives, increasing latency. In high-rate UDP RX workloads, this was
>> observed to occur over 20k times per second.
>> 
>> To minimize this window and improve opportunities to process packets
>> promptly, immediately call peek_head_len after signaling. If new packets
>> are found, treat it as a busy poll interrupt and requeue handle_rx,
>> improving fairness to TX handlers and other pending CPU work. This also
>> helps suppress unnecessary thread wakeups, reducing waker CPU demand.
>> 
>> Signed-off-by: Jon Kohler <jon@nutanix.com>
>> ---
>> drivers/vhost/net.c | 21 +++++++++++++++++++++
>> 1 file changed, 21 insertions(+)
>> 
>> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
>> index 35ded4330431..04cb5f1dc6e4 100644
>> --- a/drivers/vhost/net.c
>> +++ b/drivers/vhost/net.c
>> @@ -1015,6 +1015,27 @@ static int vhost_net_rx_peek_head_len(struct vhost_net *net, struct sock *sk,
>>        struct vhost_virtqueue *tvq = &tnvq->vq;
>>        int len = peek_head_len(rnvq, sk);
>> 
>> +       if (!len && rnvq->done_idx) {
>> +               /* When idle, flush signal first, which can take some
>> +                * time for ring management and guest notification.
>> +                * Afterwards, check one last time for work, as the ring
>> +                * may have received new work during the notification
>> +                * window.
>> +                */
>> +               vhost_net_signal_used(rnvq, *count);
>> +               *count = 0;
>> +               if (peek_head_len(rnvq, sk)) {
>> +                       /* More work came in during the notification
>> +                        * window. To be fair to the TX handler and other
>> +                        * potentially pending work items, pretend like
>> +                        * this was a busy poll interruption so that
>> +                        * the RX handler will be rescheduled and try
>> +                        * again.
>> +                        */
>> +                       *busyloop_intr = true;
>> +               }
>> +       }
> 
> I'm not sure I will get here.
> 
> Once vhost_net_rx_peek_head_len() returns 0, we exit the loop to:
> 
> if (unlikely(busyloop_intr))
>                vhost_poll_queue(&vq->poll);
>        else if (!sock_len)
>                vhost_net_enable_vq(net, vq);
> out:
>        vhost_net_signal_used(nvq, count);
> 
> Are you suggesting signalling before enabling vq actually?

See my other note I just sent, yes, thats exactly what I’m suggesting

Signaling takes some time, and if we do that before we do our last
peek for work, we can pick up racing additions to the ring, and avoid
a trip to scheduler and IPIs, etc

> 
> Thanks
> 
>> +
>>        if (!len && rvq->busyloop_timeout) {
>>                /* Flush batched heads first */
>>                vhost_net_signal_used(rnvq, *count);
>> --
>> 2.43.0
>> 
> 


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH net-next] vhost/net: check peek_head_len after signal to guest to avoid delays
  2025-11-25 23:50 ` Michael S. Tsirkin
@ 2025-11-26 16:49   ` Jon Kohler
  2025-12-26 14:50     ` Michael S. Tsirkin
  0 siblings, 1 reply; 10+ messages in thread
From: Jon Kohler @ 2025-11-26 16:49 UTC (permalink / raw)
  To: Michael S. Tsirkin
  Cc: Jason Wang, Eugenio Pérez, kvm@vger.kernel.org,
	virtualization@lists.linux.dev, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org



> On Nov 25, 2025, at 6:50 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> 
> !-------------------------------------------------------------------|
>  CAUTION: External Email
> 
> |-------------------------------------------------------------------!
> 
> On Tue, Nov 25, 2025 at 11:00:33AM -0700, Jon Kohler wrote:
>> In non-busypoll handle_rx paths, if peek_head_len returns 0, the RX
>> loop breaks, the RX wait queue is re-enabled, and vhost_net_signal_used
>> is called to flush done_idx and notify the guest if needed.
>> 
>> However, signaling the guest can take non-trivial time. During this
>> window, additional RX payloads may arrive on rx_ring without further
>> kicks. These new payloads will sit unprocessed until another kick
>> arrives, increasing latency. In high-rate UDP RX workloads, this was
>> observed to occur over 20k times per second.
>> 
>> To minimize this window and improve opportunities to process packets
>> promptly, immediately call peek_head_len after signaling. If new packets
>> are found, treat it as a busy poll interrupt and requeue handle_rx,
>> improving fairness to TX handlers and other pending CPU work. This also
>> helps suppress unnecessary thread wakeups, reducing waker CPU demand.
>> 
>> Signed-off-by: Jon Kohler <jon@nutanix.com>
> 
> Given this is supposed to be a performance improvement,
> pls include info on the effect this has on performance. Thanks!

I had already mentioned we’re avoiding ~20k schedulers/IPIs in that
example, but I can add more detail. Let’s resolve the other parts of
the thread first and go from there?

> 
>> ---
>> drivers/vhost/net.c | 21 +++++++++++++++++++++
>> 1 file changed, 21 insertions(+)
>> 
>> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
>> index 35ded4330431..04cb5f1dc6e4 100644
>> --- a/drivers/vhost/net.c
>> +++ b/drivers/vhost/net.c
>> @@ -1015,6 +1015,27 @@ static int vhost_net_rx_peek_head_len(struct vhost_net *net, struct sock *sk,
>> struct vhost_virtqueue *tvq = &tnvq->vq;
>> int len = peek_head_len(rnvq, sk);
>> 
>> + if (!len && rnvq->done_idx) {
>> + /* When idle, flush signal first, which can take some
>> + * time for ring management and guest notification.
>> + * Afterwards, check one last time for work, as the ring
>> + * may have received new work during the notification
>> + * window.
>> + */
>> + vhost_net_signal_used(rnvq, *count);
>> + *count = 0;
>> + if (peek_head_len(rnvq, sk)) {
>> + /* More work came in during the notification
>> + * window. To be fair to the TX handler and other
>> + * potentially pending work items, pretend like
>> + * this was a busy poll interruption so that
>> + * the RX handler will be rescheduled and try
>> + * again.
>> + */
>> + *busyloop_intr = true;
>> + }
>> + }
>> +
>> if (!len && rvq->busyloop_timeout) {
>> /* Flush batched heads first */
>> vhost_net_signal_used(rnvq, *count);
>> -- 
>> 2.43.0
> 


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH net-next] vhost/net: check peek_head_len after signal to guest to avoid delays
  2025-11-26 16:49   ` Jon Kohler
@ 2025-12-26 14:50     ` Michael S. Tsirkin
  0 siblings, 0 replies; 10+ messages in thread
From: Michael S. Tsirkin @ 2025-12-26 14:50 UTC (permalink / raw)
  To: Jon Kohler
  Cc: Jason Wang, Eugenio Pérez, kvm@vger.kernel.org,
	virtualization@lists.linux.dev, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org

On Wed, Nov 26, 2025 at 04:49:11PM +0000, Jon Kohler wrote:
> 
> 
> > On Nov 25, 2025, at 6:50 PM, Michael S. Tsirkin <mst@redhat.com> wrote:
> > 
> > !-------------------------------------------------------------------|
> >  CAUTION: External Email
> > 
> > |-------------------------------------------------------------------!
> > 
> > On Tue, Nov 25, 2025 at 11:00:33AM -0700, Jon Kohler wrote:
> >> In non-busypoll handle_rx paths, if peek_head_len returns 0, the RX
> >> loop breaks, the RX wait queue is re-enabled, and vhost_net_signal_used
> >> is called to flush done_idx and notify the guest if needed.
> >> 
> >> However, signaling the guest can take non-trivial time. During this
> >> window, additional RX payloads may arrive on rx_ring without further
> >> kicks. These new payloads will sit unprocessed until another kick
> >> arrives, increasing latency. In high-rate UDP RX workloads, this was
> >> observed to occur over 20k times per second.
> >> 
> >> To minimize this window and improve opportunities to process packets
> >> promptly, immediately call peek_head_len after signaling. If new packets
> >> are found, treat it as a busy poll interrupt and requeue handle_rx,
> >> improving fairness to TX handlers and other pending CPU work. This also
> >> helps suppress unnecessary thread wakeups, reducing waker CPU demand.
> >> 
> >> Signed-off-by: Jon Kohler <jon@nutanix.com>
> > 
> > Given this is supposed to be a performance improvement,
> > pls include info on the effect this has on performance. Thanks!
> 
> I had already mentioned we’re avoiding ~20k schedulers/IPIs in that
> example, but I can add more detail. Let’s resolve the other parts of
> the thread first and go from there?


the discussion seems to have died down.
I suggest reposting with perf data you have
(which test, how much improvement, what cpu usage)
collected in the commit log.

thanks!

> > 
> >> ---
> >> drivers/vhost/net.c | 21 +++++++++++++++++++++
> >> 1 file changed, 21 insertions(+)
> >> 
> >> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> >> index 35ded4330431..04cb5f1dc6e4 100644
> >> --- a/drivers/vhost/net.c
> >> +++ b/drivers/vhost/net.c
> >> @@ -1015,6 +1015,27 @@ static int vhost_net_rx_peek_head_len(struct vhost_net *net, struct sock *sk,
> >> struct vhost_virtqueue *tvq = &tnvq->vq;
> >> int len = peek_head_len(rnvq, sk);
> >> 
> >> + if (!len && rnvq->done_idx) {
> >> + /* When idle, flush signal first, which can take some
> >> + * time for ring management and guest notification.
> >> + * Afterwards, check one last time for work, as the ring
> >> + * may have received new work during the notification
> >> + * window.
> >> + */
> >> + vhost_net_signal_used(rnvq, *count);
> >> + *count = 0;
> >> + if (peek_head_len(rnvq, sk)) {
> >> + /* More work came in during the notification
> >> + * window. To be fair to the TX handler and other
> >> + * potentially pending work items, pretend like
> >> + * this was a busy poll interruption so that
> >> + * the RX handler will be rescheduled and try
> >> + * again.
> >> + */
> >> + *busyloop_intr = true;
> >> + }
> >> + }
> >> +
> >> if (!len && rvq->busyloop_timeout) {
> >> /* Flush batched heads first */
> >> vhost_net_signal_used(rnvq, *count);
> >> -- 
> >> 2.43.0
> > 
> 


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2025-12-26 14:50 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-11-25 18:00 [PATCH net-next] vhost/net: check peek_head_len after signal to guest to avoid delays Jon Kohler
2025-11-25 23:50 ` Michael S. Tsirkin
2025-11-26 16:49   ` Jon Kohler
2025-12-26 14:50     ` Michael S. Tsirkin
2025-11-26  6:15 ` Jason Wang
2025-11-26 16:48   ` Jon Kohler
2025-11-26  6:41 ` Michael S. Tsirkin
2025-11-26 16:47   ` Jon Kohler
2025-11-26  8:56 ` Michael S. Tsirkin
2025-11-26 16:43   ` Jon Kohler

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).