From: Jon Kohler <jon@nutanix.com>
To: "Michael S. Tsirkin" <mst@redhat.com>,
"Jason Wang" <jasowang@redhat.com>,
"Eugenio Pérez" <eperezma@redhat.com>,
kvm@vger.kernel.org, virtualization@lists.linux.dev,
netdev@vger.kernel.org, linux-kernel@vger.kernel.org
Cc: Jon Kohler <jon@nutanix.com>
Subject: [PATCH net-next v2] vhost/net: Defer TX queue re-enable until after sendmsg
Date: Sat, 19 Apr 2025 18:05:18 -0700 [thread overview]
Message-ID: <20250420010518.2842335-1-jon@nutanix.com> (raw)
In handle_tx_copy, TX batching processes packets below ~PAGE_SIZE and
batches up to 64 messages before calling sock->sendmsg.
Currently, when there are no more messages on the ring to dequeue,
handle_tx_copy re-enables kicks on the ring *before* firing off the
batch sendmsg. However, sock->sendmsg incurs a non-zero delay,
especially if it needs to wake up a thread (e.g., another vhost worker).
If the guest submits additional messages immediately after the last ring
check and disablement, it triggers an EPT_MISCONFIG vmexit to attempt to
kick the vhost worker. This may happen while the worker is still
processing the sendmsg, leading to wasteful exit(s).
This is particularly problematic for single-threaded guest submission
threads, as they must exit, wait for the exit to be processed
(potentially involving a TTWU), and then resume.
In scenarios like a constant stream of UDP messages, this results in a
sawtooth pattern where the submitter frequently vmexits, and the
vhost-net worker alternates between sleeping and waking.
A common solution is to configure vhost-net busy polling via userspace
(e.g., qemu poll-us). However, treating the sendmsg as the "busy"
period by keeping kicks disabled during the final sendmsg and
performing one additional ring check afterward provides a significant
performance improvement without any excess busy poll cycles.
If messages are found in the ring after the final sendmsg, requeue the
TX handler. This ensures fairness for the RX handler and allows
vhost_run_work_list to cond_resched() as needed.
Test Case
TX VM: taskset -c 2 iperf3 -c rx-ip-here -t 60 -p 5200 -b 0 -u -i 5
RX VM: taskset -c 2 iperf3 -s -p 5200 -D
6.12.0, each worker backed by tun interface with IFF_NAPI setup.
Note: TCP side is largely unchanged as that was copy bound
6.12.0 unpatched
EPT_MISCONFIG/second: 5411
Datagrams/second: ~382k
Interval Transfer Bitrate Lost/Total Datagrams
0.00-30.00 sec 15.5 GBytes 4.43 Gbits/sec 0/11481630 (0%) sender
6.12.0 patched
EPT_MISCONFIG/second: 58 (~93x reduction)
Datagrams/second: ~650k (~1.7x increase)
Interval Transfer Bitrate Lost/Total Datagrams
0.00-30.00 sec 26.4 GBytes 7.55 Gbits/sec 0/19554720 (0%) sender
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: Jon Kohler <jon@nutanix.com>
---
drivers/vhost/net.c | 19 +++++++++++++++----
1 file changed, 15 insertions(+), 4 deletions(-)
diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index b9b9e9d40951..9b04025eea66 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -769,13 +769,17 @@ static void handle_tx_copy(struct vhost_net *net, struct socket *sock)
break;
/* Nothing new? Wait for eventfd to tell us they refilled. */
if (head == vq->num) {
+ /* If interrupted while doing busy polling, requeue
+ * the handler to be fair handle_rx as well as other
+ * tasks waiting on cpu
+ */
if (unlikely(busyloop_intr)) {
vhost_poll_queue(&vq->poll);
- } else if (unlikely(vhost_enable_notify(&net->dev,
- vq))) {
- vhost_disable_notify(&net->dev, vq);
- continue;
}
+ /* Kicks are disabled at this point, break loop and
+ * process any remaining batched packets. Queue will
+ * be re-enabled afterwards.
+ */
break;
}
@@ -825,7 +829,14 @@ static void handle_tx_copy(struct vhost_net *net, struct socket *sock)
++nvq->done_idx;
} while (likely(!vhost_exceeds_weight(vq, ++sent_pkts, total_len)));
+ /* Kicks are still disabled, dispatch any remaining batched msgs. */
vhost_tx_batch(net, nvq, sock, &msg);
+
+ /* All of our work has been completed; however, before leaving the
+ * TX handler, do one last check for work, and requeue handler if
+ * necessary. If there is no work, queue will be reenabled.
+ */
+ vhost_net_busy_poll_try_queue(net, vq);
}
static void handle_tx_zerocopy(struct vhost_net *net, struct socket *sock)
--
2.43.0
next reply other threads:[~2025-04-20 0:35 UTC|newest]
Thread overview: 11+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-04-20 1:05 Jon Kohler [this message]
2025-04-20 7:32 ` [PATCH net-next v2] vhost/net: Defer TX queue re-enable until after sendmsg Michael S. Tsirkin
2025-04-20 14:38 ` Jon Kohler
2025-04-24 11:48 ` Paolo Abeni
2025-04-24 12:11 ` Michael S. Tsirkin
2025-04-24 13:53 ` Jon Kohler
2025-04-24 14:24 ` Michael S. Tsirkin
2025-04-26 19:06 ` Jon Kohler
2025-04-26 19:21 ` Jon Kohler
2025-05-18 21:38 ` Michael S. Tsirkin
2025-05-18 22:21 ` Jon Kohler
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20250420010518.2842335-1-jon@nutanix.com \
--to=jon@nutanix.com \
--cc=eperezma@redhat.com \
--cc=jasowang@redhat.com \
--cc=kvm@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=mst@redhat.com \
--cc=netdev@vger.kernel.org \
--cc=virtualization@lists.linux.dev \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).