All of lore.kernel.org
 help / color / mirror / Atom feed
From: "Michael S. Tsirkin" <mst@redhat.com>
To: ShuangYu <shuangyu@yunyoo.cc>
Cc: "jasowang@redhat.com" <jasowang@redhat.com>,
	"virtualization@lists.linux.dev" <virtualization@lists.linux.dev>,
	"netdev@vger.kernel.org" <netdev@vger.kernel.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"kvm@vger.kernel.org" <kvm@vger.kernel.org>
Subject: Re: [BUG] vhost_net: livelock in handle_rx() when GRO packet exceeds virtqueue capacity
Date: Sun, 1 Mar 2026 19:39:27 -0500	[thread overview]
Message-ID: <20260301193655-mutt-send-email-mst@kernel.org> (raw)
In-Reply-To: <20260301190906-mutt-send-email-mst@kernel.org>

On Sun, Mar 01, 2026 at 07:10:06PM -0500, Michael S. Tsirkin wrote:
> On Sun, Mar 01, 2026 at 10:36:39PM +0000, ShuangYu wrote:
> > Hi,
> > 
> > We have hit a severe livelock in vhost_net on 6.18.x. The vhost
> > kernel thread spins at 100% CPU indefinitely in handle_rx(), and
> > QEMU becomes unkillable (stuck in D state).
> > [This is a text/plain messages]
> > 
> > Environment
> > -----------
> >   Kernel:  6.18.10-1.el8.elrepo.x86_64
> >   QEMU:    7.2.19
> >   Virtio:  VIRTIO_F_IN_ORDER is negotiated
> >   Backend: vhost (kernel)
> > 
> > Symptoms
> > --------
> >   - vhost-<pid> kernel thread at 100% CPU (R state, never yields)
> >   - QEMU stuck in D state at vhost_dev_flush() after receiving SIGTERM
> >   - kill -9 has no effect on the QEMU process
> >   - libvirt management plane deadlocks ("cannot acquire state change lock")
> > 
> > Root Cause
> > ----------
> > The livelock is triggered when a GRO-merged packet on the host TAP
> > interface (e.g., ~60KB) exceeds the remaining free capacity of the
> > guest's RX virtqueue (e.g., ~40KB of available buffers).
> > 
> > The loop in handle_rx() (drivers/vhost/net.c) proceeds as follows:
> > 
> >   1. get_rx_bufs() calls vhost_get_vq_desc_n() to fetch descriptors.
> >     It advances vq->last_avail_idx and vq->next_avail_head as it
> >     consumes buffers, but runs out before satisfying datalen.
> > 
> >   2. get_rx_bufs() jumps to err: and calls
> >     vhost_discard_vq_desc(vq, headcount, n), which rolls back
> >     vq->last_avail_idx and vq->next_avail_head.
> > 
> >     Critically, vq->avail_idx (the cached copy of the guest's
> >     avail->idx) is NOT rolled back. This is correct behavior in
> >     isolation, but creates a persistent mismatch:
> > 
> >       vq->avail_idx      = 108  (cached, unchanged)
> >       vq->last_avail_idx = 104  (rolled back)
> > 
> >   3. handle_rx() sees headcount == 0 and calls vhost_enable_notify().
> >     Inside, vhost_get_avail_idx() finds:
> > 
> >       vq->avail_idx (108) != vq->last_avail_idx (104)
> > 
> >     It returns 1 (true), indicating "new buffers available."
> >     But these are the SAME buffers that were just discarded.
> > 
> >   4. handle_rx() hits `continue`, restarting the loop.
> > 
> >   5. In the next iteration, vhost_get_vq_desc_n() checks:
> > 
> >       if (vq->avail_idx == vq->last_avail_idx)
> > 
> >     This is FALSE (108 != 104), so it skips re-reading the guest's
> >     actual avail->idx and directly fetches the same descriptors.
> > 
> >   6. The exact same sequence repeats: fetch -> too small -> discard
> >     -> rollback -> "new buffers!" -> continue. Indefinitely.
> > 
> > This appears to be a regression introduced by the VIRTIO_F_IN_ORDER
> > support, which added vhost_get_vq_desc_n() with the cached avail_idx
> > short-circuit check, and the two-argument vhost_discard_vq_desc()
> > with next_avail_head rollback. The mismatch between the rollback
> > scope (last_avail_idx, next_avail_head) and the check scope
> > (avail_idx vs last_avail_idx) was not present before this change.
> > 
> > bpftrace Evidence
> > -----------------
> > During the 100% CPU lockup, we traced:
> > 
> >   @get_rx_ret[0]:      4468052   // get_rx_bufs() returns 0 every time
> >   @peek_ret[60366]:    4385533   // same 60KB packet seen every iteration
> >   @sock_err[recvmsg]:        0   // tun_recvmsg() is never reached
> > 
> > vhost_get_vq_desc_n() was observed iterating over the exact same 11
> > descriptor addresses millions of times per second.
> > 
> > Workaround
> > ----------
> > Either of the following avoids the livelock:
> > 
> >   - Disable GRO/GSO on the TAP interface:
> >      ethtool -K <tap> gro off gso off
> > 
> >   - Switch from kernel vhost to userspace QEMU backend:
> >      <driver name='qemu'/> in libvirt XML
> > 
> > Bisect
> > ------
> > We have not yet completed a full git bisect, but the issue does not
> > occur on 6.17.x kernels which lack the VIRTIO_F_IN_ORDER vhost
> > support. We will follow up with a Fixes: tag if we can identify the
> > exact commit.
> > 
> > Suggested Fix Direction
> > -----------------------
> > In handle_rx(), when get_rx_bufs() returns 0 (headcount == 0) due to
> > insufficient buffers (not because the queue is truly empty), the code
> > should break out of the loop rather than relying on
> > vhost_enable_notify() to make that determination. For example, when
> > get_rx_bufs() returns r == 0 with datalen still > 0, this indicates a
> > "packet too large" condition, not a "queue empty" condition, and
> > should be handled differently.
> > 
> > Thanks,
> > ShuangYu
> 
> Hmm. On a hunch, does the following help? completely untested,
> it is night here, sorry.
> 
> 
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index 2f2c45d20883..aafae15d5156 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -1522,6 +1522,7 @@ static void vhost_dev_unlock_vqs(struct vhost_dev *d)
>  static inline int vhost_get_avail_idx(struct vhost_virtqueue *vq)
>  {
>  	__virtio16 idx;
> +	u16 avail_idx;
>  	int r;
>  
>  	r = vhost_get_avail(vq, idx, &vq->avail->idx);
> @@ -1532,17 +1533,19 @@ static inline int vhost_get_avail_idx(struct vhost_virtqueue *vq)
>  	}
>  
>  	/* Check it isn't doing very strange thing with available indexes */
> -	vq->avail_idx = vhost16_to_cpu(vq, idx);
> -	if (unlikely((u16)(vq->avail_idx - vq->last_avail_idx) > vq->num)) {
> +	avail_idx = vhost16_to_cpu(vq, idx);
> +	if (unlikely((u16)(avail_idx - vq->last_avail_idx) > vq->num)) {
>  		vq_err(vq, "Invalid available index change from %u to %u",
>  		       vq->last_avail_idx, vq->avail_idx);
>  		return -EINVAL;
>  	}
>  
>  	/* We're done if there is nothing new */
> -	if (vq->avail_idx == vq->last_avail_idx)
> +	if (avail_idx == vq->avail_idx)
>  		return 0;
>  
> +	vq->avail_idx == avail_idx;
> +

meaning 
	vq->avail_idx = avail_idx; 
of course

>  	/*
>  	 * We updated vq->avail_idx so we need a memory barrier between
>  	 * the index read above and the caller reading avail ring entries.


  reply	other threads:[~2026-03-02  0:39 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-03-01 22:36 [BUG] vhost_net: livelock in handle_rx() when GRO packet exceeds virtqueue capacity ShuangYu
2026-03-02  0:10 ` Michael S. Tsirkin
2026-03-02  0:39   ` Michael S. Tsirkin [this message]
2026-03-02  0:45     ` Michael S. Tsirkin
2026-03-02  1:57       ` ShuangYu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20260301193655-mutt-send-email-mst@kernel.org \
    --to=mst@redhat.com \
    --cc=jasowang@redhat.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=netdev@vger.kernel.org \
    --cc=shuangyu@yunyoo.cc \
    --cc=virtualization@lists.linux.dev \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.