* [BUG] vhost_net: livelock in handle_rx() when GRO packet exceeds virtqueue capacity
@ 2026-03-01 22:36 ShuangYu
2026-03-02 0:10 ` Michael S. Tsirkin
0 siblings, 1 reply; 5+ messages in thread
From: ShuangYu @ 2026-03-01 22:36 UTC (permalink / raw)
To: mst@redhat.com, jasowang@redhat.com
Cc: virtualization@lists.linux.dev, netdev@vger.kernel.org,
linux-kernel@vger.kernel.org, kvm@vger.kernel.org
Hi,
We have hit a severe livelock in vhost_net on 6.18.x. The vhost
kernel thread spins at 100% CPU indefinitely in handle_rx(), and
QEMU becomes unkillable (stuck in D state).
[This is a text/plain messages]
Environment
-----------
Kernel: 6.18.10-1.el8.elrepo.x86_64
QEMU: 7.2.19
Virtio: VIRTIO_F_IN_ORDER is negotiated
Backend: vhost (kernel)
Symptoms
--------
- vhost-<pid> kernel thread at 100% CPU (R state, never yields)
- QEMU stuck in D state at vhost_dev_flush() after receiving SIGTERM
- kill -9 has no effect on the QEMU process
- libvirt management plane deadlocks ("cannot acquire state change lock")
Root Cause
----------
The livelock is triggered when a GRO-merged packet on the host TAP
interface (e.g., ~60KB) exceeds the remaining free capacity of the
guest's RX virtqueue (e.g., ~40KB of available buffers).
The loop in handle_rx() (drivers/vhost/net.c) proceeds as follows:
1. get_rx_bufs() calls vhost_get_vq_desc_n() to fetch descriptors.
It advances vq->last_avail_idx and vq->next_avail_head as it
consumes buffers, but runs out before satisfying datalen.
2. get_rx_bufs() jumps to err: and calls
vhost_discard_vq_desc(vq, headcount, n), which rolls back
vq->last_avail_idx and vq->next_avail_head.
Critically, vq->avail_idx (the cached copy of the guest's
avail->idx) is NOT rolled back. This is correct behavior in
isolation, but creates a persistent mismatch:
vq->avail_idx = 108 (cached, unchanged)
vq->last_avail_idx = 104 (rolled back)
3. handle_rx() sees headcount == 0 and calls vhost_enable_notify().
Inside, vhost_get_avail_idx() finds:
vq->avail_idx (108) != vq->last_avail_idx (104)
It returns 1 (true), indicating "new buffers available."
But these are the SAME buffers that were just discarded.
4. handle_rx() hits `continue`, restarting the loop.
5. In the next iteration, vhost_get_vq_desc_n() checks:
if (vq->avail_idx == vq->last_avail_idx)
This is FALSE (108 != 104), so it skips re-reading the guest's
actual avail->idx and directly fetches the same descriptors.
6. The exact same sequence repeats: fetch -> too small -> discard
-> rollback -> "new buffers!" -> continue. Indefinitely.
This appears to be a regression introduced by the VIRTIO_F_IN_ORDER
support, which added vhost_get_vq_desc_n() with the cached avail_idx
short-circuit check, and the two-argument vhost_discard_vq_desc()
with next_avail_head rollback. The mismatch between the rollback
scope (last_avail_idx, next_avail_head) and the check scope
(avail_idx vs last_avail_idx) was not present before this change.
bpftrace Evidence
-----------------
During the 100% CPU lockup, we traced:
@get_rx_ret[0]: 4468052 // get_rx_bufs() returns 0 every time
@peek_ret[60366]: 4385533 // same 60KB packet seen every iteration
@sock_err[recvmsg]: 0 // tun_recvmsg() is never reached
vhost_get_vq_desc_n() was observed iterating over the exact same 11
descriptor addresses millions of times per second.
Workaround
----------
Either of the following avoids the livelock:
- Disable GRO/GSO on the TAP interface:
ethtool -K <tap> gro off gso off
- Switch from kernel vhost to userspace QEMU backend:
<driver name='qemu'/> in libvirt XML
Bisect
------
We have not yet completed a full git bisect, but the issue does not
occur on 6.17.x kernels which lack the VIRTIO_F_IN_ORDER vhost
support. We will follow up with a Fixes: tag if we can identify the
exact commit.
Suggested Fix Direction
-----------------------
In handle_rx(), when get_rx_bufs() returns 0 (headcount == 0) due to
insufficient buffers (not because the queue is truly empty), the code
should break out of the loop rather than relying on
vhost_enable_notify() to make that determination. For example, when
get_rx_bufs() returns r == 0 with datalen still > 0, this indicates a
"packet too large" condition, not a "queue empty" condition, and
should be handled differently.
Thanks,
ShuangYu
^ permalink raw reply [flat|nested] 5+ messages in thread* Re: [BUG] vhost_net: livelock in handle_rx() when GRO packet exceeds virtqueue capacity 2026-03-01 22:36 [BUG] vhost_net: livelock in handle_rx() when GRO packet exceeds virtqueue capacity ShuangYu @ 2026-03-02 0:10 ` Michael S. Tsirkin 2026-03-02 0:39 ` Michael S. Tsirkin 0 siblings, 1 reply; 5+ messages in thread From: Michael S. Tsirkin @ 2026-03-02 0:10 UTC (permalink / raw) To: ShuangYu Cc: jasowang@redhat.com, virtualization@lists.linux.dev, netdev@vger.kernel.org, linux-kernel@vger.kernel.org, kvm@vger.kernel.org On Sun, Mar 01, 2026 at 10:36:39PM +0000, ShuangYu wrote: > Hi, > > We have hit a severe livelock in vhost_net on 6.18.x. The vhost > kernel thread spins at 100% CPU indefinitely in handle_rx(), and > QEMU becomes unkillable (stuck in D state). > [This is a text/plain messages] > > Environment > ----------- > Kernel: 6.18.10-1.el8.elrepo.x86_64 > QEMU: 7.2.19 > Virtio: VIRTIO_F_IN_ORDER is negotiated > Backend: vhost (kernel) > > Symptoms > -------- > - vhost-<pid> kernel thread at 100% CPU (R state, never yields) > - QEMU stuck in D state at vhost_dev_flush() after receiving SIGTERM > - kill -9 has no effect on the QEMU process > - libvirt management plane deadlocks ("cannot acquire state change lock") > > Root Cause > ---------- > The livelock is triggered when a GRO-merged packet on the host TAP > interface (e.g., ~60KB) exceeds the remaining free capacity of the > guest's RX virtqueue (e.g., ~40KB of available buffers). > > The loop in handle_rx() (drivers/vhost/net.c) proceeds as follows: > > 1. get_rx_bufs() calls vhost_get_vq_desc_n() to fetch descriptors. > It advances vq->last_avail_idx and vq->next_avail_head as it > consumes buffers, but runs out before satisfying datalen. > > 2. get_rx_bufs() jumps to err: and calls > vhost_discard_vq_desc(vq, headcount, n), which rolls back > vq->last_avail_idx and vq->next_avail_head. > > Critically, vq->avail_idx (the cached copy of the guest's > avail->idx) is NOT rolled back. This is correct behavior in > isolation, but creates a persistent mismatch: > > vq->avail_idx = 108 (cached, unchanged) > vq->last_avail_idx = 104 (rolled back) > > 3. handle_rx() sees headcount == 0 and calls vhost_enable_notify(). > Inside, vhost_get_avail_idx() finds: > > vq->avail_idx (108) != vq->last_avail_idx (104) > > It returns 1 (true), indicating "new buffers available." > But these are the SAME buffers that were just discarded. > > 4. handle_rx() hits `continue`, restarting the loop. > > 5. In the next iteration, vhost_get_vq_desc_n() checks: > > if (vq->avail_idx == vq->last_avail_idx) > > This is FALSE (108 != 104), so it skips re-reading the guest's > actual avail->idx and directly fetches the same descriptors. > > 6. The exact same sequence repeats: fetch -> too small -> discard > -> rollback -> "new buffers!" -> continue. Indefinitely. > > This appears to be a regression introduced by the VIRTIO_F_IN_ORDER > support, which added vhost_get_vq_desc_n() with the cached avail_idx > short-circuit check, and the two-argument vhost_discard_vq_desc() > with next_avail_head rollback. The mismatch between the rollback > scope (last_avail_idx, next_avail_head) and the check scope > (avail_idx vs last_avail_idx) was not present before this change. > > bpftrace Evidence > ----------------- > During the 100% CPU lockup, we traced: > > @get_rx_ret[0]: 4468052 // get_rx_bufs() returns 0 every time > @peek_ret[60366]: 4385533 // same 60KB packet seen every iteration > @sock_err[recvmsg]: 0 // tun_recvmsg() is never reached > > vhost_get_vq_desc_n() was observed iterating over the exact same 11 > descriptor addresses millions of times per second. > > Workaround > ---------- > Either of the following avoids the livelock: > > - Disable GRO/GSO on the TAP interface: > ethtool -K <tap> gro off gso off > > - Switch from kernel vhost to userspace QEMU backend: > <driver name='qemu'/> in libvirt XML > > Bisect > ------ > We have not yet completed a full git bisect, but the issue does not > occur on 6.17.x kernels which lack the VIRTIO_F_IN_ORDER vhost > support. We will follow up with a Fixes: tag if we can identify the > exact commit. > > Suggested Fix Direction > ----------------------- > In handle_rx(), when get_rx_bufs() returns 0 (headcount == 0) due to > insufficient buffers (not because the queue is truly empty), the code > should break out of the loop rather than relying on > vhost_enable_notify() to make that determination. For example, when > get_rx_bufs() returns r == 0 with datalen still > 0, this indicates a > "packet too large" condition, not a "queue empty" condition, and > should be handled differently. > > Thanks, > ShuangYu Hmm. On a hunch, does the following help? completely untested, it is night here, sorry. diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c index 2f2c45d20883..aafae15d5156 100644 --- a/drivers/vhost/vhost.c +++ b/drivers/vhost/vhost.c @@ -1522,6 +1522,7 @@ static void vhost_dev_unlock_vqs(struct vhost_dev *d) static inline int vhost_get_avail_idx(struct vhost_virtqueue *vq) { __virtio16 idx; + u16 avail_idx; int r; r = vhost_get_avail(vq, idx, &vq->avail->idx); @@ -1532,17 +1533,19 @@ static inline int vhost_get_avail_idx(struct vhost_virtqueue *vq) } /* Check it isn't doing very strange thing with available indexes */ - vq->avail_idx = vhost16_to_cpu(vq, idx); - if (unlikely((u16)(vq->avail_idx - vq->last_avail_idx) > vq->num)) { + avail_idx = vhost16_to_cpu(vq, idx); + if (unlikely((u16)(avail_idx - vq->last_avail_idx) > vq->num)) { vq_err(vq, "Invalid available index change from %u to %u", vq->last_avail_idx, vq->avail_idx); return -EINVAL; } /* We're done if there is nothing new */ - if (vq->avail_idx == vq->last_avail_idx) + if (avail_idx == vq->avail_idx) return 0; + vq->avail_idx == avail_idx; + /* * We updated vq->avail_idx so we need a memory barrier between * the index read above and the caller reading avail ring entries. ^ permalink raw reply related [flat|nested] 5+ messages in thread
* Re: [BUG] vhost_net: livelock in handle_rx() when GRO packet exceeds virtqueue capacity 2026-03-02 0:10 ` Michael S. Tsirkin @ 2026-03-02 0:39 ` Michael S. Tsirkin 2026-03-02 0:45 ` Michael S. Tsirkin 0 siblings, 1 reply; 5+ messages in thread From: Michael S. Tsirkin @ 2026-03-02 0:39 UTC (permalink / raw) To: ShuangYu Cc: jasowang@redhat.com, virtualization@lists.linux.dev, netdev@vger.kernel.org, linux-kernel@vger.kernel.org, kvm@vger.kernel.org On Sun, Mar 01, 2026 at 07:10:06PM -0500, Michael S. Tsirkin wrote: > On Sun, Mar 01, 2026 at 10:36:39PM +0000, ShuangYu wrote: > > Hi, > > > > We have hit a severe livelock in vhost_net on 6.18.x. The vhost > > kernel thread spins at 100% CPU indefinitely in handle_rx(), and > > QEMU becomes unkillable (stuck in D state). > > [This is a text/plain messages] > > > > Environment > > ----------- > > Kernel: 6.18.10-1.el8.elrepo.x86_64 > > QEMU: 7.2.19 > > Virtio: VIRTIO_F_IN_ORDER is negotiated > > Backend: vhost (kernel) > > > > Symptoms > > -------- > > - vhost-<pid> kernel thread at 100% CPU (R state, never yields) > > - QEMU stuck in D state at vhost_dev_flush() after receiving SIGTERM > > - kill -9 has no effect on the QEMU process > > - libvirt management plane deadlocks ("cannot acquire state change lock") > > > > Root Cause > > ---------- > > The livelock is triggered when a GRO-merged packet on the host TAP > > interface (e.g., ~60KB) exceeds the remaining free capacity of the > > guest's RX virtqueue (e.g., ~40KB of available buffers). > > > > The loop in handle_rx() (drivers/vhost/net.c) proceeds as follows: > > > > 1. get_rx_bufs() calls vhost_get_vq_desc_n() to fetch descriptors. > > It advances vq->last_avail_idx and vq->next_avail_head as it > > consumes buffers, but runs out before satisfying datalen. > > > > 2. get_rx_bufs() jumps to err: and calls > > vhost_discard_vq_desc(vq, headcount, n), which rolls back > > vq->last_avail_idx and vq->next_avail_head. > > > > Critically, vq->avail_idx (the cached copy of the guest's > > avail->idx) is NOT rolled back. This is correct behavior in > > isolation, but creates a persistent mismatch: > > > > vq->avail_idx = 108 (cached, unchanged) > > vq->last_avail_idx = 104 (rolled back) > > > > 3. handle_rx() sees headcount == 0 and calls vhost_enable_notify(). > > Inside, vhost_get_avail_idx() finds: > > > > vq->avail_idx (108) != vq->last_avail_idx (104) > > > > It returns 1 (true), indicating "new buffers available." > > But these are the SAME buffers that were just discarded. > > > > 4. handle_rx() hits `continue`, restarting the loop. > > > > 5. In the next iteration, vhost_get_vq_desc_n() checks: > > > > if (vq->avail_idx == vq->last_avail_idx) > > > > This is FALSE (108 != 104), so it skips re-reading the guest's > > actual avail->idx and directly fetches the same descriptors. > > > > 6. The exact same sequence repeats: fetch -> too small -> discard > > -> rollback -> "new buffers!" -> continue. Indefinitely. > > > > This appears to be a regression introduced by the VIRTIO_F_IN_ORDER > > support, which added vhost_get_vq_desc_n() with the cached avail_idx > > short-circuit check, and the two-argument vhost_discard_vq_desc() > > with next_avail_head rollback. The mismatch between the rollback > > scope (last_avail_idx, next_avail_head) and the check scope > > (avail_idx vs last_avail_idx) was not present before this change. > > > > bpftrace Evidence > > ----------------- > > During the 100% CPU lockup, we traced: > > > > @get_rx_ret[0]: 4468052 // get_rx_bufs() returns 0 every time > > @peek_ret[60366]: 4385533 // same 60KB packet seen every iteration > > @sock_err[recvmsg]: 0 // tun_recvmsg() is never reached > > > > vhost_get_vq_desc_n() was observed iterating over the exact same 11 > > descriptor addresses millions of times per second. > > > > Workaround > > ---------- > > Either of the following avoids the livelock: > > > > - Disable GRO/GSO on the TAP interface: > > ethtool -K <tap> gro off gso off > > > > - Switch from kernel vhost to userspace QEMU backend: > > <driver name='qemu'/> in libvirt XML > > > > Bisect > > ------ > > We have not yet completed a full git bisect, but the issue does not > > occur on 6.17.x kernels which lack the VIRTIO_F_IN_ORDER vhost > > support. We will follow up with a Fixes: tag if we can identify the > > exact commit. > > > > Suggested Fix Direction > > ----------------------- > > In handle_rx(), when get_rx_bufs() returns 0 (headcount == 0) due to > > insufficient buffers (not because the queue is truly empty), the code > > should break out of the loop rather than relying on > > vhost_enable_notify() to make that determination. For example, when > > get_rx_bufs() returns r == 0 with datalen still > 0, this indicates a > > "packet too large" condition, not a "queue empty" condition, and > > should be handled differently. > > > > Thanks, > > ShuangYu > > Hmm. On a hunch, does the following help? completely untested, > it is night here, sorry. > > > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c > index 2f2c45d20883..aafae15d5156 100644 > --- a/drivers/vhost/vhost.c > +++ b/drivers/vhost/vhost.c > @@ -1522,6 +1522,7 @@ static void vhost_dev_unlock_vqs(struct vhost_dev *d) > static inline int vhost_get_avail_idx(struct vhost_virtqueue *vq) > { > __virtio16 idx; > + u16 avail_idx; > int r; > > r = vhost_get_avail(vq, idx, &vq->avail->idx); > @@ -1532,17 +1533,19 @@ static inline int vhost_get_avail_idx(struct vhost_virtqueue *vq) > } > > /* Check it isn't doing very strange thing with available indexes */ > - vq->avail_idx = vhost16_to_cpu(vq, idx); > - if (unlikely((u16)(vq->avail_idx - vq->last_avail_idx) > vq->num)) { > + avail_idx = vhost16_to_cpu(vq, idx); > + if (unlikely((u16)(avail_idx - vq->last_avail_idx) > vq->num)) { > vq_err(vq, "Invalid available index change from %u to %u", > vq->last_avail_idx, vq->avail_idx); > return -EINVAL; > } > > /* We're done if there is nothing new */ > - if (vq->avail_idx == vq->last_avail_idx) > + if (avail_idx == vq->avail_idx) > return 0; > > + vq->avail_idx == avail_idx; > + meaning vq->avail_idx = avail_idx; of course > /* > * We updated vq->avail_idx so we need a memory barrier between > * the index read above and the caller reading avail ring entries. ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [BUG] vhost_net: livelock in handle_rx() when GRO packet exceeds virtqueue capacity 2026-03-02 0:39 ` Michael S. Tsirkin @ 2026-03-02 0:45 ` Michael S. Tsirkin 2026-03-02 1:57 ` ShuangYu 0 siblings, 1 reply; 5+ messages in thread From: Michael S. Tsirkin @ 2026-03-02 0:45 UTC (permalink / raw) To: ShuangYu Cc: jasowang@redhat.com, virtualization@lists.linux.dev, netdev@vger.kernel.org, linux-kernel@vger.kernel.org, kvm@vger.kernel.org On Sun, Mar 01, 2026 at 07:39:30PM -0500, Michael S. Tsirkin wrote: > On Sun, Mar 01, 2026 at 07:10:06PM -0500, Michael S. Tsirkin wrote: > > On Sun, Mar 01, 2026 at 10:36:39PM +0000, ShuangYu wrote: > > > Hi, > > > > > > We have hit a severe livelock in vhost_net on 6.18.x. The vhost > > > kernel thread spins at 100% CPU indefinitely in handle_rx(), and > > > QEMU becomes unkillable (stuck in D state). > > > [This is a text/plain messages] > > > > > > Environment > > > ----------- > > > Kernel: 6.18.10-1.el8.elrepo.x86_64 > > > QEMU: 7.2.19 > > > Virtio: VIRTIO_F_IN_ORDER is negotiated > > > Backend: vhost (kernel) > > > > > > Symptoms > > > -------- > > > - vhost-<pid> kernel thread at 100% CPU (R state, never yields) > > > - QEMU stuck in D state at vhost_dev_flush() after receiving SIGTERM > > > - kill -9 has no effect on the QEMU process > > > - libvirt management plane deadlocks ("cannot acquire state change lock") > > > > > > Root Cause > > > ---------- > > > The livelock is triggered when a GRO-merged packet on the host TAP > > > interface (e.g., ~60KB) exceeds the remaining free capacity of the > > > guest's RX virtqueue (e.g., ~40KB of available buffers). > > > > > > The loop in handle_rx() (drivers/vhost/net.c) proceeds as follows: > > > > > > 1. get_rx_bufs() calls vhost_get_vq_desc_n() to fetch descriptors. > > > It advances vq->last_avail_idx and vq->next_avail_head as it > > > consumes buffers, but runs out before satisfying datalen. > > > > > > 2. get_rx_bufs() jumps to err: and calls > > > vhost_discard_vq_desc(vq, headcount, n), which rolls back > > > vq->last_avail_idx and vq->next_avail_head. > > > > > > Critically, vq->avail_idx (the cached copy of the guest's > > > avail->idx) is NOT rolled back. This is correct behavior in > > > isolation, but creates a persistent mismatch: > > > > > > vq->avail_idx = 108 (cached, unchanged) > > > vq->last_avail_idx = 104 (rolled back) > > > > > > 3. handle_rx() sees headcount == 0 and calls vhost_enable_notify(). > > > Inside, vhost_get_avail_idx() finds: > > > > > > vq->avail_idx (108) != vq->last_avail_idx (104) > > > > > > It returns 1 (true), indicating "new buffers available." > > > But these are the SAME buffers that were just discarded. > > > > > > 4. handle_rx() hits `continue`, restarting the loop. > > > > > > 5. In the next iteration, vhost_get_vq_desc_n() checks: > > > > > > if (vq->avail_idx == vq->last_avail_idx) > > > > > > This is FALSE (108 != 104), so it skips re-reading the guest's > > > actual avail->idx and directly fetches the same descriptors. > > > > > > 6. The exact same sequence repeats: fetch -> too small -> discard > > > -> rollback -> "new buffers!" -> continue. Indefinitely. > > > > > > This appears to be a regression introduced by the VIRTIO_F_IN_ORDER > > > support, which added vhost_get_vq_desc_n() with the cached avail_idx > > > short-circuit check, and the two-argument vhost_discard_vq_desc() > > > with next_avail_head rollback. The mismatch between the rollback > > > scope (last_avail_idx, next_avail_head) and the check scope > > > (avail_idx vs last_avail_idx) was not present before this change. > > > > > > bpftrace Evidence > > > ----------------- > > > During the 100% CPU lockup, we traced: > > > > > > @get_rx_ret[0]: 4468052 // get_rx_bufs() returns 0 every time > > > @peek_ret[60366]: 4385533 // same 60KB packet seen every iteration > > > @sock_err[recvmsg]: 0 // tun_recvmsg() is never reached > > > > > > vhost_get_vq_desc_n() was observed iterating over the exact same 11 > > > descriptor addresses millions of times per second. > > > > > > Workaround > > > ---------- > > > Either of the following avoids the livelock: > > > > > > - Disable GRO/GSO on the TAP interface: > > > ethtool -K <tap> gro off gso off > > > > > > - Switch from kernel vhost to userspace QEMU backend: > > > <driver name='qemu'/> in libvirt XML > > > > > > Bisect > > > ------ > > > We have not yet completed a full git bisect, but the issue does not > > > occur on 6.17.x kernels which lack the VIRTIO_F_IN_ORDER vhost > > > support. We will follow up with a Fixes: tag if we can identify the > > > exact commit. > > > > > > Suggested Fix Direction > > > ----------------------- > > > In handle_rx(), when get_rx_bufs() returns 0 (headcount == 0) due to > > > insufficient buffers (not because the queue is truly empty), the code > > > should break out of the loop rather than relying on > > > vhost_enable_notify() to make that determination. For example, when > > > get_rx_bufs() returns r == 0 with datalen still > 0, this indicates a > > > "packet too large" condition, not a "queue empty" condition, and > > > should be handled differently. > > > > > > Thanks, > > > ShuangYu > > > > Hmm. On a hunch, does the following help? completely untested, > > it is night here, sorry. > > > > > > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c > > index 2f2c45d20883..aafae15d5156 100644 > > --- a/drivers/vhost/vhost.c > > +++ b/drivers/vhost/vhost.c > > @@ -1522,6 +1522,7 @@ static void vhost_dev_unlock_vqs(struct vhost_dev *d) > > static inline int vhost_get_avail_idx(struct vhost_virtqueue *vq) > > { > > __virtio16 idx; > > + u16 avail_idx; > > int r; > > > > r = vhost_get_avail(vq, idx, &vq->avail->idx); > > @@ -1532,17 +1533,19 @@ static inline int vhost_get_avail_idx(struct vhost_virtqueue *vq) > > } > > > > /* Check it isn't doing very strange thing with available indexes */ > > - vq->avail_idx = vhost16_to_cpu(vq, idx); > > - if (unlikely((u16)(vq->avail_idx - vq->last_avail_idx) > vq->num)) { > > + avail_idx = vhost16_to_cpu(vq, idx); > > + if (unlikely((u16)(avail_idx - vq->last_avail_idx) > vq->num)) { > > vq_err(vq, "Invalid available index change from %u to %u", > > vq->last_avail_idx, vq->avail_idx); > > return -EINVAL; > > } > > > > /* We're done if there is nothing new */ > > - if (vq->avail_idx == vq->last_avail_idx) > > + if (avail_idx == vq->avail_idx) > > return 0; > > > > + vq->avail_idx == avail_idx; > > + > > meaning > vq->avail_idx = avail_idx; > of course > > > /* > > * We updated vq->avail_idx so we need a memory barrier between > > * the index read above and the caller reading avail ring entries. and the change this is fixing was done in d3bb267bbdcba199568f1325743d9d501dea0560 -- MST ^ permalink raw reply [flat|nested] 5+ messages in thread
* Re: [BUG] vhost_net: livelock in handle_rx() when GRO packet exceeds virtqueue capacity 2026-03-02 0:45 ` Michael S. Tsirkin @ 2026-03-02 1:57 ` ShuangYu 0 siblings, 0 replies; 5+ messages in thread From: ShuangYu @ 2026-03-02 1:57 UTC (permalink / raw) To: Michael S. Tsirkin Cc: jasowang@redhat.com, virtualization@lists.linux.dev, netdev@vger.kernel.org, linux-kernel@vger.kernel.org, kvm@vger.kernel.org > From: "Michael S. Tsirkin"<mst@redhat.com> > Date: Mon, Mar 2, 2026, 08:45 > Subject: Re: [BUG] vhost_net: livelock in handle_rx() when GRO packet exceeds virtqueue capacity > To: "ShuangYu"<shuangyu@yunyoo.cc> > Cc: "jasowang@redhat.com"<jasowang@redhat.com>, "virtualization@lists.linux.dev"<virtualization@lists.linux.dev>, "netdev@vger.kernel.org"<netdev@vger.kernel.org>, "linux-kernel@vger.kernel.org"<linux-kernel@vger.kernel.org>, "kvm@vger.kernel.org"<kvm@vger.kernel.org> > On Sun, Mar 01, 2026 at 07:39:30PM -0500, Michael S. Tsirkin wrote: > > On Sun, Mar 01, 2026 at 07:10:06PM -0500, Michael S. Tsirkin wrote: > > > On Sun, Mar 01, 2026 at 10:36:39PM +0000, ShuangYu wrote: > > > > Hi, > > > > > > > > We have hit a severe livelock in vhost_net on 6.18.x. The vhost > > > > kernel thread spins at 100% CPU indefinitely in handle_rx(), and > > > > QEMU becomes unkillable (stuck in D state). > > > > [This is a text/plain messages] > > > > > > > > Environment > > > > ----------- > > > > Kernel: 6.18.10-1.el8.elrepo.x86_64 > > > > QEMU: 7.2.19 > > > > Virtio: VIRTIO_F_IN_ORDER is negotiated > > > > Backend: vhost (kernel) > > > > > > > > Symptoms > > > > -------- > > > > - vhost-<pid> kernel thread at 100% CPU (R state, never yields) > > > > - QEMU stuck in D state at vhost_dev_flush() after receiving SIGTERM > > > > - kill -9 has no effect on the QEMU process > > > > - libvirt management plane deadlocks ("cannot acquire state change lock") > > > > > > > > Root Cause > > > > ---------- > > > > The livelock is triggered when a GRO-merged packet on the host TAP > > > > interface (e.g., ~60KB) exceeds the remaining free capacity of the > > > > guest's RX virtqueue (e.g., ~40KB of available buffers). > > > > > > > > The loop in handle_rx() (drivers/vhost/net.c) proceeds as follows: > > > > > > > > 1. get_rx_bufs() calls vhost_get_vq_desc_n() to fetch descriptors. > > > > It advances vq->last_avail_idx and vq->next_avail_head as it > > > > consumes buffers, but runs out before satisfying datalen. > > > > > > > > 2. get_rx_bufs() jumps to err: and calls > > > > vhost_discard_vq_desc(vq, headcount, n), which rolls back > > > > vq->last_avail_idx and vq->next_avail_head. > > > > > > > > Critically, vq->avail_idx (the cached copy of the guest's > > > > avail->idx) is NOT rolled back. This is correct behavior in > > > > isolation, but creates a persistent mismatch: > > > > > > > > vq->avail_idx = 108 (cached, unchanged) > > > > vq->last_avail_idx = 104 (rolled back) > > > > > > > > 3. handle_rx() sees headcount == 0 and calls vhost_enable_notify(). > > > > Inside, vhost_get_avail_idx() finds: > > > > > > > > vq->avail_idx (108) != vq->last_avail_idx (104) > > > > > > > > It returns 1 (true), indicating "new buffers available." > > > > But these are the SAME buffers that were just discarded. > > > > > > > > 4. handle_rx() hits `continue`, restarting the loop. > > > > > > > > 5. In the next iteration, vhost_get_vq_desc_n() checks: > > > > > > > > if (vq->avail_idx == vq->last_avail_idx) > > > > > > > > This is FALSE (108 != 104), so it skips re-reading the guest's > > > > actual avail->idx and directly fetches the same descriptors. > > > > > > > > 6. The exact same sequence repeats: fetch -> too small -> discard > > > > -> rollback -> "new buffers!" -> continue. Indefinitely. > > > > > > > > This appears to be a regression introduced by the VIRTIO_F_IN_ORDER > > > > support, which added vhost_get_vq_desc_n() with the cached avail_idx > > > > short-circuit check, and the two-argument vhost_discard_vq_desc() > > > > with next_avail_head rollback. The mismatch between the rollback > > > > scope (last_avail_idx, next_avail_head) and the check scope > > > > (avail_idx vs last_avail_idx) was not present before this change. > > > > > > > > bpftrace Evidence > > > > ----------------- > > > > During the 100% CPU lockup, we traced: > > > > > > > > @get_rx_ret[0]: 4468052 // get_rx_bufs() returns 0 every time > > > > @peek_ret[60366]: 4385533 // same 60KB packet seen every iteration > > > > @sock_err[recvmsg]: 0 // tun_recvmsg() is never reached > > > > > > > > vhost_get_vq_desc_n() was observed iterating over the exact same 11 > > > > descriptor addresses millions of times per second. > > > > > > > > Workaround > > > > ---------- > > > > Either of the following avoids the livelock: > > > > > > > > - Disable GRO/GSO on the TAP interface: > > > > ethtool -K <tap> gro off gso off > > > > > > > > - Switch from kernel vhost to userspace QEMU backend: > > > > <driver name='qemu'/> in libvirt XML > > > > > > > > Bisect > > > > ------ > > > > We have not yet completed a full git bisect, but the issue does not > > > > occur on 6.17.x kernels which lack the VIRTIO_F_IN_ORDER vhost > > > > support. We will follow up with a Fixes: tag if we can identify the > > > > exact commit. > > > > > > > > Suggested Fix Direction > > > > ----------------------- > > > > In handle_rx(), when get_rx_bufs() returns 0 (headcount == 0) due to > > > > insufficient buffers (not because the queue is truly empty), the code > > > > should break out of the loop rather than relying on > > > > vhost_enable_notify() to make that determination. For example, when > > > > get_rx_bufs() returns r == 0 with datalen still > 0, this indicates a > > > > "packet too large" condition, not a "queue empty" condition, and > > > > should be handled differently. > > > > > > > > Thanks, > > > > ShuangYu > > > > > > Hmm. On a hunch, does the following help? completely untested, > > > it is night here, sorry. > > > > > > > > > diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c > > > index 2f2c45d20883..aafae15d5156 100644 > > > --- a/drivers/vhost/vhost.c > > > +++ b/drivers/vhost/vhost.c > > > @@ -1522,6 +1522,7 @@ static void vhost_dev_unlock_vqs(struct vhost_dev *d) > > > static inline int vhost_get_avail_idx(struct vhost_virtqueue *vq) > > > { > > > __virtio16 idx; > > > + u16 avail_idx; > > > int r; > > > > > > r = vhost_get_avail(vq, idx, &vq->avail->idx); > > > @@ -1532,17 +1533,19 @@ static inline int vhost_get_avail_idx(struct vhost_virtqueue *vq) > > > } > > > > > > /* Check it isn't doing very strange thing with available indexes */ > > > - vq->avail_idx = vhost16_to_cpu(vq, idx); > > > - if (unlikely((u16)(vq->avail_idx - vq->last_avail_idx) > vq->num)) { > > > + avail_idx = vhost16_to_cpu(vq, idx); > > > + if (unlikely((u16)(avail_idx - vq->last_avail_idx) > vq->num)) { > > > vq_err(vq, "Invalid available index change from %u to %u", > > > vq->last_avail_idx, vq->avail_idx); > > > return -EINVAL; > > > } > > > > > > /* We're done if there is nothing new */ > > > - if (vq->avail_idx == vq->last_avail_idx) > > > + if (avail_idx == vq->avail_idx) > > > return 0; > > > > > > + vq->avail_idx == avail_idx; > > > + > > > > meaning > > vq->avail_idx = avail_idx; > > of course > > > > > /* > > > * We updated vq->avail_idx so we need a memory barrier between > > > * the index read above and the caller reading avail ring entries. > > > and the change this is fixing was done in d3bb267bbdcba199568f1325743d9d501dea0560 > > -- > MST > Thank you for the quick fix and for identifying the root commit. I've reviewed the patch and I believe the logic is correct — changing the "nothing new" check in vhost_get_avail_idx() from comparing against vq->last_avail_idx to comparing against the cached vq->avail_idx makes it immune to the rollback done by vhost_discard_vq_desc(), which is exactly what breaks the loop. One minor nit: the vq_err message on the sanity check path still references vq->avail_idx before it has been updated: vq_err(vq, "Invalid available index change from %u to %u", - vq->last_avail_idx, vq->avail_idx); + vq->last_avail_idx, avail_idx); Since this issue was found in production, I need some time to prepare a test setup to verify the patch Thanks, ShuangYu ^ permalink raw reply [flat|nested] 5+ messages in thread
end of thread, other threads:[~2026-03-02 1:58 UTC | newest] Thread overview: 5+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2026-03-01 22:36 [BUG] vhost_net: livelock in handle_rx() when GRO packet exceeds virtqueue capacity ShuangYu 2026-03-02 0:10 ` Michael S. Tsirkin 2026-03-02 0:39 ` Michael S. Tsirkin 2026-03-02 0:45 ` Michael S. Tsirkin 2026-03-02 1:57 ` ShuangYu
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox