[RFC v5 0/7] Add packed format to shadow virtqueue

qemu-devel.nongnu.org archive mirror
 help / color / mirror / Atom feed

* [RFC v5 0/7] Add packed format to shadow virtqueue
@ 2025-03-24 13:59 Sahil Siddiq
  2025-03-24 13:59 ` [RFC v5 1/7] vhost: Refactor vhost_svq_add_split Sahil Siddiq
                   ` (7 more replies)
  0 siblings, 8 replies; 44+ messages in thread
From: Sahil Siddiq @ 2025-03-24 13:59 UTC (permalink / raw)
  To: eperezma, sgarzare; +Cc: mst, qemu-devel, sahilcdq

Hi,

I managed to fix a few issues while testing this patch series.
There is still one issue that I am unable to resolve. I thought
I would send this patch series for review in case I have missed
something.

The issue is that this patch series does not work every time. I
am able to ping L0 from L2 and vice versa via packed SVQ when it
works.

When this doesn't work, both VMs throw a "Destination Host
Unreachable" error. This is sometimes (not always) accompanied
by the following kernel error (thrown by L2-kernel):

virtio_net virtio1: output.0:id 1 is not a head!

This error is not thrown always, but when it is thrown, the id
varies. This is invariably followed by a soft lockup:

[  284.662292] watchdog: BUG: soft lockup - CPU#1 stuck for 26s! [swapper/1:0]
[  284.662292] Modules linked in: rfkill intel_rapl_msr intel_rapl_common intel_uncore_frequency_common intel_pmc_core intel_vsec pmt_telemetry pmt_class vfg
[  284.662292] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 6.8.7-200.fc39.x86_64 #1
[  284.662292] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
[  284.662292] RIP: 0010:virtqueue_enable_cb_delayed+0x115/0x150
[  284.662292] Code: 44 77 04 0f ae f0 48 8b 42 70 0f b7 40 02 66 2b 42 50 66 39 c1 0f 93 c0 c3 cc cc cc cc 66 87 44 77 04 eb e2 f0 83 44 24 fc 00 <e9> 5a f1
[  284.662292] RSP: 0018:ffffb8f000100cb0 EFLAGS: 00000246
[  284.662292] RAX: 0000000000000000 RBX: ffff96f20204d800 RCX: ffff96f206f5e000
[  284.662292] RDX: ffff96f2054fd900 RSI: ffffb8f000100c7c RDI: ffff96f2054fd900
[  284.662292] RBP: ffff96f2078bb000 R08: 0000000000000001 R09: 0000000000000001
[  284.662292] R10: ffff96f2078bb000 R11: 0000000000000005 R12: ffff96f207bb4a00
[  284.662292] R13: 0000000000000000 R14: 0000000000000000 R15: ffff96f20452fd00
[  284.662292] FS:  0000000000000000(0000) GS:ffff96f27bc80000(0000) knlGS:0000000000000000
[  284.662292] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  284.662292] CR2: 00007f2a9ca191e8 CR3: 0000000136422003 CR4: 0000000000770ef0
[  284.662292] PKRU: 55555554
[  284.662292] Call Trace:
[  284.662292]  <IRQ>
[  284.662292]  ? watchdog_timer_fn+0x1e6/0x270
[  284.662292]  ? __pfx_watchdog_timer_fn+0x10/0x10
[  284.662292]  ? __hrtimer_run_queues+0x10f/0x2b0
[  284.662292]  ? hrtimer_interrupt+0xf8/0x230
[  284.662292]  ? __sysvec_apic_timer_interrupt+0x4d/0x140
[  284.662292]  ? sysvec_apic_timer_interrupt+0x39/0x90
[  284.662292]  ? asm_sysvec_apic_timer_interrupt+0x1a/0x20
[  284.662292]  ? virtqueue_enable_cb_delayed+0x115/0x150
[  284.662292]  start_xmit+0x2a6/0x4f0 [virtio_net]
[  284.662292]  ? netif_skb_features+0x98/0x300
[  284.662292]  dev_hard_start_xmit+0x61/0x1d0
[  284.662292]  sch_direct_xmit+0xa4/0x390
[  284.662292]  __dev_queue_xmit+0x84f/0xdc0
[  284.662292]  ? nf_hook_slow+0x42/0xf0
[  284.662292]  ip_finish_output2+0x2b8/0x580
[  284.662292]  igmp_ifc_timer_expire+0x1d5/0x430
[  284.662292]  ? __pfx_igmp_ifc_timer_expire+0x10/0x10
[  284.662292]  call_timer_fn+0x21/0x130
[  284.662292]  ? __pfx_igmp_ifc_timer_expire+0x10/0x10
[  284.662292]  __run_timers+0x21f/0x2b0
[  284.662292]  run_timer_softirq+0x1d/0x40
[  284.662292]  __do_softirq+0xc9/0x2c8
[  284.662292]  __irq_exit_rcu+0xa6/0xc0
[  284.662292]  sysvec_apic_timer_interrupt+0x72/0x90
[  284.662292]  </IRQ>
[  284.662292]  <TASK>
[  284.662292]  asm_sysvec_apic_timer_interrupt+0x1a/0x20
[  284.662292] RIP: 0010:pv_native_safe_halt+0xf/0x20
[  284.662292] Code: 22 d7 c3 cc cc cc cc 0f 1f 40 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa eb 07 0f 00 2d 53 75 3f 00 fb f4 <c3> cc c0
[  284.662292] RSP: 0018:ffffb8f0000b3ed8 EFLAGS: 00000212
[  284.662292] RAX: 0000000000000001 RBX: 0000000000000001 RCX: 0000000000000000
[  284.662292] RDX: 4000000000000000 RSI: 0000000000000083 RDI: 00000000000289ec
[  284.662292] RBP: ffff96f200810000 R08: 0000000000000000 R09: 0000000000000001
[  284.662292] R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000000
[  284.662292] R13: 0000000000000000 R14: ffff96f200810000 R15: 0000000000000000
[  284.662292]  default_idle+0x9/0x20
[  284.662292]  default_idle_call+0x2c/0xe0
[  284.662292]  do_idle+0x226/0x270
[  284.662292]  cpu_startup_entry+0x2a/0x30
[  284.662292]  start_secondary+0x11e/0x140
[  284.662292]  secondary_startup_64_no_verify+0x184/0x18b
[  284.662292]  </TASK>

The soft lockup seems to happen in
drivers/net/virtio_net.c:start_xmit() [1].

I don't think the issue is in the kernel because I haven't seen
any issue when testing my changes with split vqs. Only packed vqs
give an issue.

L0 kernel version: 6.12.13-1-lts

QEMU command to boot L1:

$ sudo ./qemu/build/qemu-system-x86_64 \
-enable-kvm \
-drive file=//home/valdaarhun/valdaarhun/qcow2_img/L1.qcow2,media=disk,if=virtio \
-net nic,model=virtio \
-net user,hostfwd=tcp::2222-:22 \
-device intel-iommu,snoop-control=on \
-device virtio-net-pci,netdev=net0,disable-legacy=on,disable-modern=off,iommu_platform=on,guest_uso4=off,guest_uso6=off,host_uso=off,guest_announce=off,mq=off,ctrl_vq=off,ctrl_rx=off,ctrl_vlan=off,ctrl_mac_addr=off,packed=on,event_idx=off,bus=pcie.0,addr=0x4 \
-netdev tap,id=net0,script=no,downscript=no,vhost=off \
-nographic \
-m 8G \
-smp 4 \
-M q35 \
-cpu host 2>&1 | tee vm.log

L1 kernel version: 6.8.5-201.fc39.x86_64

I have been following the "Hands on vDPA - Part 2" blog
to set up the environment in L1 [2].

QEMU command to boot L2:

# ./qemu/build/qemu-system-x86_64 \
-nographic \
-m 4G \
-enable-kvm \
-M q35 \
-drive file=//root/L2.qcow2,media=disk,if=virtio \
-netdev type=vhost-vdpa,vhostdev=/dev/vhost-vdpa-0,x-svq=true,id=vhost-vdpa0 \
-device virtio-net-pci,netdev=vhost-vdpa0,disable-legacy=on,disable-modern=off,ctrl_vq=off,ctrl_rx=off,ctrl_vlan=off,ctrl_mac_addr=off,event_idx=off,packed=on,bus=pcie.0,addr=0x7 \
-smp 4 \
-cpu host \
2>&1 | tee vm.log

L2 kernel version: 6.8.7-200.fc39.x86_64

I confirmed that packed vqs are enabled in L2 by running the
following:

# cut -c35 /sys/devices/pci0000\:00/0000\:00\:07.0/virtio1/features 
1

I may be wrong, but I think the issue in my implementation might be
related to:

1. incorrect endianness coversions.
2. implementation of "vhost_svq_more_used_packed" in commit #5.
3. implementation of "vhost_svq_(en|dis)able_notification" in commit #5.
4. something else?

Thanks,
Sahil

[1] https://github.com/torvalds/linux/blob/master/drivers/net/virtio_net.c#L3245
[2] https://www.redhat.com/en/blog/hands-vdpa-what-do-you-do-when-you-aint-got-hardware-part-2

Sahil Siddiq (7):
  vhost: Refactor vhost_svq_add_split
  vhost: Data structure changes to support packed vqs
  vhost: Forward descriptors to device via packed SVQ
  vdpa: Allocate memory for SVQ and map them to vdpa
  vhost: Forward descriptors to guest via packed vqs
  vhost: Validate transport device features for packed vqs
  vdpa: Support setting vring_base for packed SVQ

 hw/virtio/vhost-shadow-virtqueue.c | 396 ++++++++++++++++++++++-------
 hw/virtio/vhost-shadow-virtqueue.h |  88 ++++---
 hw/virtio/vhost-vdpa.c             |  52 +++-
 3 files changed, 404 insertions(+), 132 deletions(-)

-- 
2.48.1



^ permalink raw reply	[flat|nested] 44+ messages in thread

* [RFC v5 1/7] vhost: Refactor vhost_svq_add_split
  2025-03-24 13:59 [RFC v5 0/7] Add packed format to shadow virtqueue Sahil Siddiq
@ 2025-03-24 13:59 ` Sahil Siddiq
  2025-03-26 11:25   ` Eugenio Perez Martin
  2025-03-24 13:59 ` [RFC v5 2/7] vhost: Data structure changes to support packed vqs Sahil Siddiq
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 44+ messages in thread
From: Sahil Siddiq @ 2025-03-24 13:59 UTC (permalink / raw)
  To: eperezma, sgarzare; +Cc: mst, qemu-devel, sahilcdq

This commit refactors vhost_svq_add_split and vhost_svq_add to simplify
their implementation and prepare for the addition of packed vqs in the
following commits.

Signed-off-by: Sahil Siddiq <sahilcdq@proton.me>
---
No changes from v4 -> v5.

 hw/virtio/vhost-shadow-virtqueue.c | 107 +++++++++++------------------
 1 file changed, 41 insertions(+), 66 deletions(-)

diff --git a/hw/virtio/vhost-shadow-virtqueue.c b/hw/virtio/vhost-shadow-virtqueue.c
index 2481d49345..4f74ad402a 100644
--- a/hw/virtio/vhost-shadow-virtqueue.c
+++ b/hw/virtio/vhost-shadow-virtqueue.c
@@ -139,87 +139,48 @@ static bool vhost_svq_translate_addr(const VhostShadowVirtqueue *svq,
 }
 
 /**
- * Write descriptors to SVQ vring
+ * Write descriptors to SVQ split vring
  *
  * @svq: The shadow virtqueue
- * @sg: Cache for hwaddr
- * @iovec: The iovec from the guest
- * @num: iovec length
- * @addr: Descriptors' GPAs, if backed by guest memory
- * @more_descs: True if more descriptors come in the chain
- * @write: True if they are writeable descriptors
- *
- * Return true if success, false otherwise and print error.
+ * @out_sg: The iovec to the guest
+ * @out_num: Outgoing iovec length
+ * @in_sg: The iovec from the guest
+ * @in_num: Incoming iovec length
+ * @sgs: Cache for hwaddr
+ * @head: Saves current free_head
  */
-static bool vhost_svq_vring_write_descs(VhostShadowVirtqueue *svq, hwaddr *sg,
-                                        const struct iovec *iovec, size_t num,
-                                        const hwaddr *addr, bool more_descs,
-                                        bool write)
+static void vhost_svq_add_split(VhostShadowVirtqueue *svq,
+                                const struct iovec *out_sg, size_t out_num,
+                                const struct iovec *in_sg, size_t in_num,
+                                hwaddr *sgs, unsigned *head)
 {
+    unsigned avail_idx, n;
     uint16_t i = svq->free_head, last = svq->free_head;
-    unsigned n;
-    uint16_t flags = write ? cpu_to_le16(VRING_DESC_F_WRITE) : 0;
+    vring_avail_t *avail = svq->vring.avail;
     vring_desc_t *descs = svq->vring.desc;
-    bool ok;
-
-    if (num == 0) {
-        return true;
-    }
+    size_t num = in_num + out_num;
 
-    ok = vhost_svq_translate_addr(svq, sg, iovec, num, addr);
-    if (unlikely(!ok)) {
-        return false;
-    }
+    *head = svq->free_head;
 
     for (n = 0; n < num; n++) {
-        if (more_descs || (n + 1 < num)) {
-            descs[i].flags = flags | cpu_to_le16(VRING_DESC_F_NEXT);
+        descs[i].flags = cpu_to_le16(n < out_num ? 0 : VRING_DESC_F_WRITE);
+        if (n + 1 < num) {
+            descs[i].flags |= cpu_to_le16(VRING_DESC_F_NEXT);
             descs[i].next = cpu_to_le16(svq->desc_next[i]);
+        }
+
+        descs[i].addr = cpu_to_le64(sgs[n]);
+        if (n < out_num) {
+            descs[i].len = cpu_to_le32(out_sg[n].iov_len);
         } else {
-            descs[i].flags = flags;
+            descs[i].len = cpu_to_le32(in_sg[n - out_num].iov_len);
         }
-        descs[i].addr = cpu_to_le64(sg[n]);
-        descs[i].len = cpu_to_le32(iovec[n].iov_len);
 
         last = i;
         i = svq->desc_next[i];
     }
 
     svq->free_head = svq->desc_next[last];
-    return true;
-}
-
-static bool vhost_svq_add_split(VhostShadowVirtqueue *svq,
-                                const struct iovec *out_sg, size_t out_num,
-                                const hwaddr *out_addr,
-                                const struct iovec *in_sg, size_t in_num,
-                                const hwaddr *in_addr, unsigned *head)
-{
-    unsigned avail_idx;
-    vring_avail_t *avail = svq->vring.avail;
-    bool ok;
-    g_autofree hwaddr *sgs = g_new(hwaddr, MAX(out_num, in_num));
-
-    *head = svq->free_head;
-
-    /* We need some descriptors here */
-    if (unlikely(!out_num && !in_num)) {
-        qemu_log_mask(LOG_GUEST_ERROR,
-                      "Guest provided element with no descriptors");
-        return false;
-    }
-
-    ok = vhost_svq_vring_write_descs(svq, sgs, out_sg, out_num, out_addr,
-                                     in_num > 0, false);
-    if (unlikely(!ok)) {
-        return false;
-    }
-
-    ok = vhost_svq_vring_write_descs(svq, sgs, in_sg, in_num, in_addr, false,
-                                     true);
-    if (unlikely(!ok)) {
-        return false;
-    }
 
     /*
      * Put the entry in the available array (but don't update avail->idx until
@@ -233,7 +194,6 @@ static bool vhost_svq_add_split(VhostShadowVirtqueue *svq,
     smp_wmb();
     avail->idx = cpu_to_le16(svq->shadow_avail_idx);
 
-    return true;
 }
 
 static void vhost_svq_kick(VhostShadowVirtqueue *svq)
@@ -276,16 +236,31 @@ int vhost_svq_add(VhostShadowVirtqueue *svq, const struct iovec *out_sg,
     unsigned ndescs = in_num + out_num;
     bool ok;
 
+    /* We need some descriptors here */
+    if (unlikely(!ndescs)) {
+        qemu_log_mask(LOG_GUEST_ERROR,
+                      "Guest provided element with no descriptors");
+        return -EINVAL;
+    }
+
     if (unlikely(ndescs > vhost_svq_available_slots(svq))) {
         return -ENOSPC;
     }
 
-    ok = vhost_svq_add_split(svq, out_sg, out_num, out_addr, in_sg, in_num,
-                             in_addr, &qemu_head);
+    g_autofree hwaddr *sgs = g_new(hwaddr, ndescs);
+    ok = vhost_svq_translate_addr(svq, sgs, out_sg, out_num, out_addr);
     if (unlikely(!ok)) {
         return -EINVAL;
     }
 
+    ok = vhost_svq_translate_addr(svq, sgs + out_num, in_sg, in_num, in_addr);
+    if (unlikely(!ok)) {
+        return -EINVAL;
+    }
+
+    vhost_svq_add_split(svq, out_sg, out_num, in_sg,
+                        in_num, sgs, &qemu_head);
+
     svq->num_free -= ndescs;
     svq->desc_state[qemu_head].elem = elem;
     svq->desc_state[qemu_head].ndescs = ndescs;
-- 
2.48.1



^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [RFC v5 2/7] vhost: Data structure changes to support packed vqs
  2025-03-24 13:59 [RFC v5 0/7] Add packed format to shadow virtqueue Sahil Siddiq
  2025-03-24 13:59 ` [RFC v5 1/7] vhost: Refactor vhost_svq_add_split Sahil Siddiq
@ 2025-03-24 13:59 ` Sahil Siddiq
  2025-03-26 11:26   ` Eugenio Perez Martin
  2025-03-24 13:59 ` [RFC v5 3/7] vhost: Forward descriptors to device via packed SVQ Sahil Siddiq
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 44+ messages in thread
From: Sahil Siddiq @ 2025-03-24 13:59 UTC (permalink / raw)
  To: eperezma, sgarzare; +Cc: mst, qemu-devel, sahilcdq

Introduce "struct vring_packed".

Modify VhostShadowVirtqueue so it can support split and packed virtqueue
formats.

Signed-off-by: Sahil Siddiq <sahilcdq@proton.me>
---
Changes from v4 -> v5:
- This was commit #3 in v4. This has been reordered to commit #2
  based on review comments.
- Place shadow_avail_idx, shadow_used_idx, last_used_idx
  above the "shadow vring" union.

 hw/virtio/vhost-shadow-virtqueue.h | 87 +++++++++++++++++++-----------
 1 file changed, 56 insertions(+), 31 deletions(-)

diff --git a/hw/virtio/vhost-shadow-virtqueue.h b/hw/virtio/vhost-shadow-virtqueue.h
index 9c273739d6..5f7699da9d 100644
--- a/hw/virtio/vhost-shadow-virtqueue.h
+++ b/hw/virtio/vhost-shadow-virtqueue.h
@@ -46,10 +46,65 @@ typedef struct VhostShadowVirtqueueOps {
     VirtQueueAvailCallback avail_handler;
 } VhostShadowVirtqueueOps;
 
+struct vring_packed {
+    /* Actual memory layout for this queue. */
+    struct {
+        unsigned int num;
+        struct vring_packed_desc *desc;
+        struct vring_packed_desc_event *driver;
+        struct vring_packed_desc_event *device;
+    } vring;
+
+    /* Avail used flags. */
+    uint16_t avail_used_flags;
+
+    /* Index of the next avail descriptor. */
+    uint16_t next_avail_idx;
+
+    /* Driver ring wrap counter */
+    bool avail_wrap_counter;
+};
+
 /* Shadow virtqueue to relay notifications */
 typedef struct VhostShadowVirtqueue {
+    /* True if packed virtqueue */
+    bool is_packed;
+
+    /* Virtio queue shadowing */
+    VirtQueue *vq;
+
+    /* Virtio device */
+    VirtIODevice *vdev;
+
+    /* SVQ vring descriptors state */
+    SVQDescState *desc_state;
+
+    /*
+     * Backup next field for each descriptor so we can recover securely, not
+     * needing to trust the device access.
+     */
+    uint16_t *desc_next;
+
+    /* Next free descriptor */
+    uint16_t free_head;
+
+    /* Size of SVQ vring free descriptors */
+    uint16_t num_free;
+
+    /* Next head to expose to the device */
+    uint16_t shadow_avail_idx;
+
+    /* Last seen used idx */
+    uint16_t shadow_used_idx;
+
+    /* Next head to consume from the device */
+    uint16_t last_used_idx;
+
     /* Shadow vring */
-    struct vring vring;
+    union {
+        struct vring vring;
+        struct vring_packed vring_packed;
+    };
 
     /* Shadow kick notifier, sent to vhost */
     EventNotifier hdev_kick;
@@ -69,47 +124,17 @@ typedef struct VhostShadowVirtqueue {
     /* Guest's call notifier, where the SVQ calls guest. */
     EventNotifier svq_call;
 
-    /* Virtio queue shadowing */
-    VirtQueue *vq;
-
-    /* Virtio device */
-    VirtIODevice *vdev;
-
     /* IOVA mapping */
     VhostIOVATree *iova_tree;
 
-    /* SVQ vring descriptors state */
-    SVQDescState *desc_state;
-
     /* Next VirtQueue element that guest made available */
     VirtQueueElement *next_guest_avail_elem;
 
-    /*
-     * Backup next field for each descriptor so we can recover securely, not
-     * needing to trust the device access.
-     */
-    uint16_t *desc_next;
-
     /* Caller callbacks */
     const VhostShadowVirtqueueOps *ops;
 
     /* Caller callbacks opaque */
     void *ops_opaque;
-
-    /* Next head to expose to the device */
-    uint16_t shadow_avail_idx;
-
-    /* Next free descriptor */
-    uint16_t free_head;
-
-    /* Last seen used idx */
-    uint16_t shadow_used_idx;
-
-    /* Next head to consume from the device */
-    uint16_t last_used_idx;
-
-    /* Size of SVQ vring free descriptors */
-    uint16_t num_free;
 } VhostShadowVirtqueue;
 
 bool vhost_svq_valid_features(uint64_t features, Error **errp);
-- 
2.48.1



^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [RFC v5 3/7] vhost: Forward descriptors to device via packed SVQ
  2025-03-24 13:59 [RFC v5 0/7] Add packed format to shadow virtqueue Sahil Siddiq
  2025-03-24 13:59 ` [RFC v5 1/7] vhost: Refactor vhost_svq_add_split Sahil Siddiq
  2025-03-24 13:59 ` [RFC v5 2/7] vhost: Data structure changes to support packed vqs Sahil Siddiq
@ 2025-03-24 13:59 ` Sahil Siddiq
  2025-03-24 14:14   ` Sahil Siddiq
  2025-03-26 12:02   ` Eugenio Perez Martin
  2025-03-24 13:59 ` [RFC v5 4/7] vdpa: Allocate memory for SVQ and map them to vdpa Sahil Siddiq
                   ` (4 subsequent siblings)
  7 siblings, 2 replies; 44+ messages in thread
From: Sahil Siddiq @ 2025-03-24 13:59 UTC (permalink / raw)
  To: eperezma, sgarzare; +Cc: mst, qemu-devel, sahilcdq

Implement the insertion of available buffers in the descriptor area of
packed shadow virtqueues. It takes into account descriptor chains, but
does not consider indirect descriptors.

Enable the packed SVQ to forward the descriptors to the device.

Signed-off-by: Sahil Siddiq <sahilcdq@proton.me>
---
Changes from v4 -> v5:
- This was commit #2 in v4. This has been reordered to commit #3
  based on review comments.
- vhost-shadow-virtqueue.c:
  (vhost_svq_valid_features): Move addition of enums to commit #6
  based on review comments.
  (vhost_svq_add_packed): Set head_idx to buffer id instead of vring's
  index.
  (vhost_svq_kick): Split into vhost_svq_kick_split and
  vhost_svq_kick_packed.
  (vhost_svq_add): Use new vhost_svq_kick_* functions.

 hw/virtio/vhost-shadow-virtqueue.c | 117 +++++++++++++++++++++++++++--
 1 file changed, 112 insertions(+), 5 deletions(-)

diff --git a/hw/virtio/vhost-shadow-virtqueue.c b/hw/virtio/vhost-shadow-virtqueue.c
index 4f74ad402a..6e16cd4bdf 100644
--- a/hw/virtio/vhost-shadow-virtqueue.c
+++ b/hw/virtio/vhost-shadow-virtqueue.c
@@ -193,10 +193,83 @@ static void vhost_svq_add_split(VhostShadowVirtqueue *svq,
     /* Update the avail index after write the descriptor */
     smp_wmb();
     avail->idx = cpu_to_le16(svq->shadow_avail_idx);
+}
+
+/**
+ * Write descriptors to SVQ packed vring
+ *
+ * @svq: The shadow virtqueue
+ * @out_sg: The iovec to the guest
+ * @out_num: Outgoing iovec length
+ * @in_sg: The iovec from the guest
+ * @in_num: Incoming iovec length
+ * @sgs: Cache for hwaddr
+ * @head: Saves current free_head
+ */
+static void vhost_svq_add_packed(VhostShadowVirtqueue *svq,
+                                 const struct iovec *out_sg, size_t out_num,
+                                 const struct iovec *in_sg, size_t in_num,
+                                 hwaddr *sgs, unsigned *head)
+{
+    uint16_t id, curr, i, head_flags = 0, head_idx;
+    size_t num = out_num + in_num;
+    unsigned n;
+
+    struct vring_packed_desc *descs = svq->vring_packed.vring.desc;
+
+    head_idx = svq->vring_packed.next_avail_idx;
+    i = head_idx;
+    id = svq->free_head;
+    curr = id;
+    *head = id;
+
+    /* Write descriptors to SVQ packed vring */
+    for (n = 0; n < num; n++) {
+        uint16_t flags = cpu_to_le16(svq->vring_packed.avail_used_flags |
+                                     (n < out_num ? 0 : VRING_DESC_F_WRITE) |
+                                     (n + 1 == num ? 0 : VRING_DESC_F_NEXT));
+        if (i == head_idx) {
+            head_flags = flags;
+        } else {
+            descs[i].flags = flags;
+        }
+
+        descs[i].addr = cpu_to_le64(sgs[n]);
+        descs[i].id = id;
+        if (n < out_num) {
+            descs[i].len = cpu_to_le32(out_sg[n].iov_len);
+        } else {
+            descs[i].len = cpu_to_le32(in_sg[n - out_num].iov_len);
+        }
+
+        curr = cpu_to_le16(svq->desc_next[curr]);
+
+        if (++i >= svq->vring_packed.vring.num) {
+            i = 0;
+            svq->vring_packed.avail_used_flags ^=
+                1 << VRING_PACKED_DESC_F_AVAIL |
+                1 << VRING_PACKED_DESC_F_USED;
+        }
+    }
 
+    if (i <= head_idx) {
+        svq->vring_packed.avail_wrap_counter ^= 1;
+    }
+
+    svq->vring_packed.next_avail_idx = i;
+    svq->shadow_avail_idx = i;
+    svq->free_head = curr;
+
+    /*
+     * A driver MUST NOT make the first descriptor in the list
+     * available before all subsequent descriptors comprising
+     * the list are made available.
+     */
+    smp_wmb();
+    svq->vring_packed.vring.desc[head_idx].flags = head_flags;
 }
 
-static void vhost_svq_kick(VhostShadowVirtqueue *svq)
+static void vhost_svq_kick_split(VhostShadowVirtqueue *svq)
 {
     bool needs_kick;
 
@@ -209,7 +282,8 @@ static void vhost_svq_kick(VhostShadowVirtqueue *svq)
     if (virtio_vdev_has_feature(svq->vdev, VIRTIO_RING_F_EVENT_IDX)) {
         uint16_t avail_event = le16_to_cpu(
                 *(uint16_t *)(&svq->vring.used->ring[svq->vring.num]));
-        needs_kick = vring_need_event(avail_event, svq->shadow_avail_idx, svq->shadow_avail_idx - 1);
+        needs_kick = vring_need_event(avail_event, svq->shadow_avail_idx,
+                     svq->shadow_avail_idx - 1);
     } else {
         needs_kick =
                 !(svq->vring.used->flags & cpu_to_le16(VRING_USED_F_NO_NOTIFY));
@@ -222,6 +296,30 @@ static void vhost_svq_kick(VhostShadowVirtqueue *svq)
     event_notifier_set(&svq->hdev_kick);
 }
 
+static void vhost_svq_kick_packed(VhostShadowVirtqueue *svq)
+{
+    bool needs_kick;
+
+    /*
+     * We need to expose the available array entries before checking
+     * notification suppressions.
+     */
+    smp_mb();
+
+    if (virtio_vdev_has_feature(svq->vdev, VIRTIO_RING_F_EVENT_IDX)) {
+        return;
+    } else {
+        needs_kick = (svq->vring_packed.vring.device->flags !=
+                      cpu_to_le16(VRING_PACKED_EVENT_FLAG_DISABLE));
+    }
+
+    if (!needs_kick) {
+        return;
+    }
+
+    event_notifier_set(&svq->hdev_kick);
+}
+
 /**
  * Add an element to a SVQ.
  *
@@ -258,13 +356,22 @@ int vhost_svq_add(VhostShadowVirtqueue *svq, const struct iovec *out_sg,
         return -EINVAL;
     }
 
-    vhost_svq_add_split(svq, out_sg, out_num, in_sg,
-                        in_num, sgs, &qemu_head);
+    if (svq->is_packed) {
+        vhost_svq_add_packed(svq, out_sg, out_num, in_sg,
+                             in_num, sgs, &qemu_head);
+    } else {
+        vhost_svq_add_split(svq, out_sg, out_num, in_sg,
+                            in_num, sgs, &qemu_head);
+    }
 
     svq->num_free -= ndescs;
     svq->desc_state[qemu_head].elem = elem;
     svq->desc_state[qemu_head].ndescs = ndescs;
-    vhost_svq_kick(svq);
+    if (svq->is_packed) {
+        vhost_svq_kick_packed(svq);
+    } else {
+        vhost_svq_kick_split(svq);
+    }
     return 0;
 }
 
-- 
2.48.1



^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [RFC v5 4/7] vdpa: Allocate memory for SVQ and map them to vdpa
  2025-03-24 13:59 [RFC v5 0/7] Add packed format to shadow virtqueue Sahil Siddiq
                   ` (2 preceding siblings ...)
  2025-03-24 13:59 ` [RFC v5 3/7] vhost: Forward descriptors to device via packed SVQ Sahil Siddiq
@ 2025-03-24 13:59 ` Sahil Siddiq
  2025-03-26 12:05   ` Eugenio Perez Martin
  2025-03-24 13:59 ` [RFC v5 5/7] vhost: Forward descriptors to guest via packed vqs Sahil Siddiq
                   ` (3 subsequent siblings)
  7 siblings, 1 reply; 44+ messages in thread
From: Sahil Siddiq @ 2025-03-24 13:59 UTC (permalink / raw)
  To: eperezma, sgarzare; +Cc: mst, qemu-devel, sahilcdq

Allocate memory for the packed vq format and map them to the vdpa device.

Since "struct vring" and "struct vring_packed's vring" both have the same
memory layout, the implementation in SVQ start and SVQ stop should not
differ based on the vq's format.

Also initialize flags, counters and indices for packed vqs before they
are utilized.

Signed-off-by: Sahil Siddiq <sahilcdq@proton.me>
---
Changes from v4 -> v5:
- vhost-shadow-virtqueue.c:
  (vhost_svq_start): Initialize variables used by packed vring.

 hw/virtio/vhost-shadow-virtqueue.c | 52 +++++++++++++++++++++---------
 hw/virtio/vhost-shadow-virtqueue.h |  1 +
 hw/virtio/vhost-vdpa.c             | 37 +++++++++++++++++----
 3 files changed, 69 insertions(+), 21 deletions(-)

diff --git a/hw/virtio/vhost-shadow-virtqueue.c b/hw/virtio/vhost-shadow-virtqueue.c
index 6e16cd4bdf..126957231d 100644
--- a/hw/virtio/vhost-shadow-virtqueue.c
+++ b/hw/virtio/vhost-shadow-virtqueue.c
@@ -707,19 +707,33 @@ void vhost_svq_get_vring_addr(const VhostShadowVirtqueue *svq,
     addr->used_user_addr = (uint64_t)(uintptr_t)svq->vring.used;
 }
 
-size_t vhost_svq_driver_area_size(const VhostShadowVirtqueue *svq)
+size_t vhost_svq_descriptor_area_size(const VhostShadowVirtqueue *svq)
 {
     size_t desc_size = sizeof(vring_desc_t) * svq->vring.num;
-    size_t avail_size = offsetof(vring_avail_t, ring[svq->vring.num]) +
-                                                              sizeof(uint16_t);
+    return ROUND_UP(desc_size, qemu_real_host_page_size());
+}
 
-    return ROUND_UP(desc_size + avail_size, qemu_real_host_page_size());
+size_t vhost_svq_driver_area_size(const VhostShadowVirtqueue *svq)
+{
+    size_t avail_size;
+    if (svq->is_packed) {
+        avail_size = sizeof(uint32_t);
+    } else {
+        avail_size = offsetof(vring_avail_t, ring[svq->vring.num]) +
+                                                             sizeof(uint16_t);
+    }
+    return ROUND_UP(avail_size, qemu_real_host_page_size());
 }
 
 size_t vhost_svq_device_area_size(const VhostShadowVirtqueue *svq)
 {
-    size_t used_size = offsetof(vring_used_t, ring[svq->vring.num]) +
-                                                              sizeof(uint16_t);
+    size_t used_size;
+    if (svq->is_packed) {
+        used_size = sizeof(uint32_t);
+    } else {
+        used_size = offsetof(vring_used_t, ring[svq->vring.num]) +
+                                                           sizeof(uint16_t);
+    }
     return ROUND_UP(used_size, qemu_real_host_page_size());
 }
 
@@ -764,8 +778,6 @@ void vhost_svq_set_svq_kick_fd(VhostShadowVirtqueue *svq, int svq_kick_fd)
 void vhost_svq_start(VhostShadowVirtqueue *svq, VirtIODevice *vdev,
                      VirtQueue *vq, VhostIOVATree *iova_tree)
 {
-    size_t desc_size;
-
     event_notifier_set_handler(&svq->hdev_call, vhost_svq_handle_call);
     svq->next_guest_avail_elem = NULL;
     svq->shadow_avail_idx = 0;
@@ -774,20 +786,29 @@ void vhost_svq_start(VhostShadowVirtqueue *svq, VirtIODevice *vdev,
     svq->vdev = vdev;
     svq->vq = vq;
     svq->iova_tree = iova_tree;
+    svq->is_packed = virtio_vdev_has_feature(svq->vdev, VIRTIO_F_RING_PACKED);
+
+    if (svq->is_packed) {
+        svq->vring_packed.avail_wrap_counter = 1;
+        svq->vring_packed.next_avail_idx = 0;
+        svq->vring_packed.avail_used_flags = 1 << VRING_PACKED_DESC_F_AVAIL;
+        svq->last_used_idx = 0 | (1 << VRING_PACKED_EVENT_F_WRAP_CTR);
+    }
 
     svq->vring.num = virtio_queue_get_num(vdev, virtio_get_queue_index(vq));
     svq->num_free = svq->vring.num;
-    svq->vring.desc = mmap(NULL, vhost_svq_driver_area_size(svq),
+    svq->vring.desc = mmap(NULL, vhost_svq_descriptor_area_size(svq),
                            PROT_READ | PROT_WRITE, MAP_SHARED | MAP_ANONYMOUS,
                            -1, 0);
-    desc_size = sizeof(vring_desc_t) * svq->vring.num;
-    svq->vring.avail = (void *)((char *)svq->vring.desc + desc_size);
+    svq->vring.avail = mmap(NULL, vhost_svq_driver_area_size(svq),
+                            PROT_READ | PROT_WRITE, MAP_SHARED | MAP_ANONYMOUS,
+                            -1, 0);
     svq->vring.used = mmap(NULL, vhost_svq_device_area_size(svq),
                            PROT_READ | PROT_WRITE, MAP_SHARED | MAP_ANONYMOUS,
                            -1, 0);
-    svq->desc_state = g_new0(SVQDescState, svq->vring.num);
-    svq->desc_next = g_new0(uint16_t, svq->vring.num);
-    for (unsigned i = 0; i < svq->vring.num - 1; i++) {
+    svq->desc_state = g_new0(SVQDescState, svq->num_free);
+    svq->desc_next = g_new0(uint16_t, svq->num_free);
+    for (unsigned i = 0; i < svq->num_free - 1; i++) {
         svq->desc_next[i] = i + 1;
     }
 }
@@ -827,7 +848,8 @@ void vhost_svq_stop(VhostShadowVirtqueue *svq)
     svq->vq = NULL;
     g_free(svq->desc_next);
     g_free(svq->desc_state);
-    munmap(svq->vring.desc, vhost_svq_driver_area_size(svq));
+    munmap(svq->vring.desc, vhost_svq_descriptor_area_size(svq));
+    munmap(svq->vring.avail, vhost_svq_driver_area_size(svq));
     munmap(svq->vring.used, vhost_svq_device_area_size(svq));
     event_notifier_set_handler(&svq->hdev_call, NULL);
 }
diff --git a/hw/virtio/vhost-shadow-virtqueue.h b/hw/virtio/vhost-shadow-virtqueue.h
index 5f7699da9d..12c6ea8be2 100644
--- a/hw/virtio/vhost-shadow-virtqueue.h
+++ b/hw/virtio/vhost-shadow-virtqueue.h
@@ -152,6 +152,7 @@ void vhost_svq_set_svq_kick_fd(VhostShadowVirtqueue *svq, int svq_kick_fd);
 void vhost_svq_set_svq_call_fd(VhostShadowVirtqueue *svq, int call_fd);
 void vhost_svq_get_vring_addr(const VhostShadowVirtqueue *svq,
                               struct vhost_vring_addr *addr);
+size_t vhost_svq_descriptor_area_size(const VhostShadowVirtqueue *svq);
 size_t vhost_svq_driver_area_size(const VhostShadowVirtqueue *svq);
 size_t vhost_svq_device_area_size(const VhostShadowVirtqueue *svq);
 
diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
index 7efbde3d4c..58c8931d89 100644
--- a/hw/virtio/vhost-vdpa.c
+++ b/hw/virtio/vhost-vdpa.c
@@ -1137,6 +1137,8 @@ static void vhost_vdpa_svq_unmap_rings(struct vhost_dev *dev,
 
     vhost_vdpa_svq_unmap_ring(v, svq_addr.desc_user_addr);
 
+    vhost_vdpa_svq_unmap_ring(v, svq_addr.avail_user_addr);
+
     vhost_vdpa_svq_unmap_ring(v, svq_addr.used_user_addr);
 }
 
@@ -1191,38 +1193,61 @@ static bool vhost_vdpa_svq_map_rings(struct vhost_dev *dev,
                                      Error **errp)
 {
     ERRP_GUARD();
-    DMAMap device_region, driver_region;
+    DMAMap descriptor_region, device_region, driver_region;
     struct vhost_vring_addr svq_addr;
     struct vhost_vdpa *v = dev->opaque;
+    size_t descriptor_size = vhost_svq_descriptor_area_size(svq);
     size_t device_size = vhost_svq_device_area_size(svq);
     size_t driver_size = vhost_svq_driver_area_size(svq);
-    size_t avail_offset;
     bool ok;
 
     vhost_svq_get_vring_addr(svq, &svq_addr);
 
+    descriptor_region = (DMAMap) {
+        .translated_addr = svq_addr.desc_user_addr,
+        .size = descriptor_size - 1,
+        .perm = IOMMU_RO,
+    };
+    if (svq->is_packed) {
+        descriptor_region.perm = IOMMU_RW;
+    }
+
+    ok = vhost_vdpa_svq_map_ring(v, &descriptor_region, svq_addr.desc_user_addr,
+                                 errp);
+    if (unlikely(!ok)) {
+        error_prepend(errp, "Cannot create vq descriptor region: ");
+        return false;
+    }
+    addr->desc_user_addr = descriptor_region.iova;
+
     driver_region = (DMAMap) {
+        .translated_addr = svq_addr.avail_user_addr,
         .size = driver_size - 1,
         .perm = IOMMU_RO,
     };
-    ok = vhost_vdpa_svq_map_ring(v, &driver_region, svq_addr.desc_user_addr,
+    ok = vhost_vdpa_svq_map_ring(v, &driver_region, svq_addr.avail_user_addr,
                                  errp);
     if (unlikely(!ok)) {
         error_prepend(errp, "Cannot create vq driver region: ");
+        vhost_vdpa_svq_unmap_ring(v, descriptor_region.translated_addr);
         return false;
     }
-    addr->desc_user_addr = driver_region.iova;
-    avail_offset = svq_addr.avail_user_addr - svq_addr.desc_user_addr;
-    addr->avail_user_addr = driver_region.iova + avail_offset;
+    addr->avail_user_addr = driver_region.iova;
 
     device_region = (DMAMap) {
+        .translated_addr = svq_addr.used_user_addr,
         .size = device_size - 1,
         .perm = IOMMU_RW,
     };
+    if (svq->is_packed) {
+        device_region.perm = IOMMU_WO;
+    }
+
     ok = vhost_vdpa_svq_map_ring(v, &device_region, svq_addr.used_user_addr,
                                  errp);
     if (unlikely(!ok)) {
         error_prepend(errp, "Cannot create vq device region: ");
+        vhost_vdpa_svq_unmap_ring(v, descriptor_region.translated_addr);
         vhost_vdpa_svq_unmap_ring(v, driver_region.translated_addr);
     }
     addr->used_user_addr = device_region.iova;
-- 
2.48.1



^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [RFC v5 5/7] vhost: Forward descriptors to guest via packed vqs
  2025-03-24 13:59 [RFC v5 0/7] Add packed format to shadow virtqueue Sahil Siddiq
                   ` (3 preceding siblings ...)
  2025-03-24 13:59 ` [RFC v5 4/7] vdpa: Allocate memory for SVQ and map them to vdpa Sahil Siddiq
@ 2025-03-24 13:59 ` Sahil Siddiq
  2025-03-24 14:34   ` Sahil Siddiq
  2025-03-24 13:59 ` [RFC v5 6/7] vhost: Validate transport device features for " Sahil Siddiq
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 44+ messages in thread
From: Sahil Siddiq @ 2025-03-24 13:59 UTC (permalink / raw)
  To: eperezma, sgarzare; +Cc: mst, qemu-devel, sahilcdq

Detect when used descriptors are ready for consumption by the guest via
packed virtqueues and forward them from the device to the guest.

Signed-off-by: Sahil Siddiq <sahilcdq@proton.me>
---
Changes from v4 -> v5:
- New commit.
- vhost-shadow-virtqueue.c:
  (vhost_svq_more_used): Split into vhost_svq_more_used_split and
  vhost_svq_more_used_packed.
  (vhost_svq_enable_notification): Handle split and packed vqs.
  (vhost_svq_disable_notification): Likewise.
  (vhost_svq_get_buf): Split into vhost_svq_get_buf_split and
  vhost_svq_get_buf_packed.
  (vhost_svq_poll): Use new functions.

 hw/virtio/vhost-shadow-virtqueue.c | 121 ++++++++++++++++++++++++++---
 1 file changed, 110 insertions(+), 11 deletions(-)

diff --git a/hw/virtio/vhost-shadow-virtqueue.c b/hw/virtio/vhost-shadow-virtqueue.c
index 126957231d..8430b3c94a 100644
--- a/hw/virtio/vhost-shadow-virtqueue.c
+++ b/hw/virtio/vhost-shadow-virtqueue.c
@@ -463,7 +463,7 @@ static void vhost_handle_guest_kick_notifier(EventNotifier *n)
     vhost_handle_guest_kick(svq);
 }
 
-static bool vhost_svq_more_used(VhostShadowVirtqueue *svq)
+static bool vhost_svq_more_used_split(VhostShadowVirtqueue *svq)
 {
     uint16_t *used_idx = &svq->vring.used->idx;
     if (svq->last_used_idx != svq->shadow_used_idx) {
@@ -475,6 +475,22 @@ static bool vhost_svq_more_used(VhostShadowVirtqueue *svq)
     return svq->last_used_idx != svq->shadow_used_idx;
 }
 
+static bool vhost_svq_more_used_packed(VhostShadowVirtqueue *svq)
+{
+    bool avail_flag, used_flag, used_wrap_counter;
+    uint16_t last_used_idx, last_used, flags;
+
+    last_used_idx = svq->last_used_idx;
+    last_used = last_used_idx & ~(1 << VRING_PACKED_EVENT_F_WRAP_CTR);
+    used_wrap_counter = !!(last_used_idx & (1 << VRING_PACKED_EVENT_F_WRAP_CTR));
+
+    flags = le16_to_cpu(svq->vring_packed.vring.desc[last_used].flags);
+    avail_flag = !!(flags & (1 << VRING_PACKED_DESC_F_AVAIL));
+    used_flag = !!(flags & (1 << VRING_PACKED_DESC_F_USED));
+
+    return avail_flag == used_flag && used_flag == used_wrap_counter;
+}
+
 /**
  * Enable vhost device calls after disable them.
  *
@@ -486,16 +502,31 @@ static bool vhost_svq_more_used(VhostShadowVirtqueue *svq)
  */
 static bool vhost_svq_enable_notification(VhostShadowVirtqueue *svq)
 {
+    bool more_used;
     if (virtio_vdev_has_feature(svq->vdev, VIRTIO_RING_F_EVENT_IDX)) {
-        uint16_t *used_event = (uint16_t *)&svq->vring.avail->ring[svq->vring.num];
-        *used_event = cpu_to_le16(svq->shadow_used_idx);
+        if (!svq->is_packed) {
+            uint16_t *used_event = (uint16_t *)&svq->vring.avail->ring[svq->vring.num];
+            *used_event = cpu_to_le16(svq->shadow_used_idx);
+        }
     } else {
-        svq->vring.avail->flags &= ~cpu_to_le16(VRING_AVAIL_F_NO_INTERRUPT);
+        if (svq->is_packed) {
+            /* vq->vring_packed.vring.driver->off_wrap = cpu_to_le16(svq->last_used_idx); */
+            svq->vring_packed.vring.driver->flags =
+                cpu_to_le16(VRING_PACKED_EVENT_FLAG_ENABLE);
+        } else {
+            svq->vring.avail->flags &= ~cpu_to_le16(VRING_AVAIL_F_NO_INTERRUPT);
+        }
     }
 
     /* Make sure the event is enabled before the read of used_idx */
     smp_mb();
-    return !vhost_svq_more_used(svq);
+    if (svq->is_packed) {
+        more_used = !vhost_svq_more_used_packed(svq);
+    } else {
+        more_used = !vhost_svq_more_used_split(svq);
+    }
+
+    return more_used;
 }
 
 static void vhost_svq_disable_notification(VhostShadowVirtqueue *svq)
@@ -505,7 +536,12 @@ static void vhost_svq_disable_notification(VhostShadowVirtqueue *svq)
      * index is already an index too far away.
      */
     if (!virtio_vdev_has_feature(svq->vdev, VIRTIO_RING_F_EVENT_IDX)) {
-        svq->vring.avail->flags |= cpu_to_le16(VRING_AVAIL_F_NO_INTERRUPT);
+        if (svq->is_packed) {
+            svq->vring_packed.vring.driver->flags =
+                cpu_to_le16(VRING_PACKED_EVENT_FLAG_DISABLE);
+        } else {
+            svq->vring.avail->flags |= cpu_to_le16(VRING_AVAIL_F_NO_INTERRUPT);
+        }
     }
 }
 
@@ -519,15 +555,14 @@ static uint16_t vhost_svq_last_desc_of_chain(const VhostShadowVirtqueue *svq,
     return i;
 }
 
-G_GNUC_WARN_UNUSED_RESULT
-static VirtQueueElement *vhost_svq_get_buf(VhostShadowVirtqueue *svq,
-                                           uint32_t *len)
+static VirtQueueElement *vhost_svq_get_buf_split(VhostShadowVirtqueue *svq,
+                                                 uint32_t *len)
 {
     const vring_used_t *used = svq->vring.used;
     vring_used_elem_t used_elem;
     uint16_t last_used, last_used_chain, num;
 
-    if (!vhost_svq_more_used(svq)) {
+    if (!vhost_svq_more_used_split(svq)) {
         return NULL;
     }
 
@@ -562,6 +597,66 @@ static VirtQueueElement *vhost_svq_get_buf(VhostShadowVirtqueue *svq,
     return g_steal_pointer(&svq->desc_state[used_elem.id].elem);
 }
 
+static VirtQueueElement *vhost_svq_get_buf_packed(VhostShadowVirtqueue *svq,
+                                                  uint32_t *len)
+{
+    bool used_wrap_counter;
+    uint16_t last_used_idx, last_used, id, num, last_used_chain;
+
+    if (!vhost_svq_more_used_packed(svq)) {
+        return NULL;
+    }
+
+    /* Only get used array entries after they have been exposed by dev */
+    smp_rmb();
+    last_used_idx = svq->last_used_idx;
+    last_used = last_used_idx & ~(1 << VRING_PACKED_EVENT_F_WRAP_CTR);
+    used_wrap_counter = !!(last_used_idx & (1 << VRING_PACKED_EVENT_F_WRAP_CTR));
+    id = le32_to_cpu(svq->vring_packed.vring.desc[last_used].id);
+    *len = le32_to_cpu(svq->vring_packed.vring.desc[last_used].len);
+
+    if (unlikely(id >= svq->vring.num)) {
+        qemu_log_mask(LOG_GUEST_ERROR, "Device %s says index %u is used",
+                      svq->vdev->name, id);
+        return NULL;
+    }
+
+    if (unlikely(!svq->desc_state[id].ndescs)) {
+        qemu_log_mask(LOG_GUEST_ERROR,
+            "Device %s says index %u is used, but it was not available",
+            svq->vdev->name, id);
+        return NULL;
+    }
+
+    num = svq->desc_state[id].ndescs;
+    svq->desc_state[id].ndescs = 0;
+    last_used_chain = vhost_svq_last_desc_of_chain(svq, num, id);
+    svq->desc_next[last_used_chain] = svq->free_head;
+    svq->free_head = id;
+    svq->num_free += num;
+
+    last_used += num;
+    if (unlikely(last_used >= svq->vring_packed.vring.num)) {
+        last_used -= svq->vring_packed.vring.num;
+        used_wrap_counter ^= 1;
+    }
+
+    last_used = (last_used | (used_wrap_counter << VRING_PACKED_EVENT_F_WRAP_CTR));
+    svq->last_used_idx = last_used;
+    return g_steal_pointer(&svq->desc_state[id].elem);
+}
+
+G_GNUC_WARN_UNUSED_RESULT
+static VirtQueueElement *vhost_svq_get_buf(VhostShadowVirtqueue *svq,
+                                           uint32_t *len)
+{
+    if (svq->is_packed) {
+        return vhost_svq_get_buf_packed(svq, len);
+    }
+
+    return vhost_svq_get_buf_split(svq, len);
+}
+
 /**
  * Push an element to SVQ, returning it to the guest.
  */
@@ -639,7 +734,11 @@ size_t vhost_svq_poll(VhostShadowVirtqueue *svq, size_t num)
         uint32_t r = 0;
 
         do {
-            if (vhost_svq_more_used(svq)) {
+            if (!svq->is_packed && vhost_svq_more_used_split(svq)) {
+                break;
+            }
+
+            if (svq->is_packed && vhost_svq_more_used_packed(svq)) {
                 break;
             }
 
-- 
2.48.1



^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [RFC v5 6/7] vhost: Validate transport device features for packed vqs
  2025-03-24 13:59 [RFC v5 0/7] Add packed format to shadow virtqueue Sahil Siddiq
                   ` (4 preceding siblings ...)
  2025-03-24 13:59 ` [RFC v5 5/7] vhost: Forward descriptors to guest via packed vqs Sahil Siddiq
@ 2025-03-24 13:59 ` Sahil Siddiq
  2025-03-26 12:06   ` Eugenio Perez Martin
  2025-03-24 13:59 ` [RFC v5 7/7] vdpa: Support setting vring_base for packed SVQ Sahil Siddiq
  2025-03-26  7:35 ` [RFC v5 0/7] Add packed format to shadow virtqueue Eugenio Perez Martin
  7 siblings, 1 reply; 44+ messages in thread
From: Sahil Siddiq @ 2025-03-24 13:59 UTC (permalink / raw)
  To: eperezma, sgarzare; +Cc: mst, qemu-devel, sahilcdq

Validate transport device features required for utilizing packed SVQ
that both guests can use with the SVQ and SVQs can use with vdpa.

Signed-off-by: Sahil Siddiq <sahilcdq@proton.me>
---
Changes from v4 -> v5:
- Split from commit #2 in v4.

 hw/virtio/vhost-shadow-virtqueue.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/hw/virtio/vhost-shadow-virtqueue.c b/hw/virtio/vhost-shadow-virtqueue.c
index 8430b3c94a..035ab1e66f 100644
--- a/hw/virtio/vhost-shadow-virtqueue.c
+++ b/hw/virtio/vhost-shadow-virtqueue.c
@@ -33,6 +33,9 @@ bool vhost_svq_valid_features(uint64_t features, Error **errp)
          ++b) {
         switch (b) {
         case VIRTIO_F_ANY_LAYOUT:
+        case VIRTIO_F_RING_PACKED:
+        case VIRTIO_F_RING_RESET:
+        case VIRTIO_RING_F_INDIRECT_DESC:
         case VIRTIO_RING_F_EVENT_IDX:
             continue;
 
-- 
2.48.1



^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [RFC v5 7/7] vdpa: Support setting vring_base for packed SVQ
  2025-03-24 13:59 [RFC v5 0/7] Add packed format to shadow virtqueue Sahil Siddiq
                   ` (5 preceding siblings ...)
  2025-03-24 13:59 ` [RFC v5 6/7] vhost: Validate transport device features for " Sahil Siddiq
@ 2025-03-24 13:59 ` Sahil Siddiq
  2025-03-26 12:08   ` Eugenio Perez Martin
  2025-03-26  7:35 ` [RFC v5 0/7] Add packed format to shadow virtqueue Eugenio Perez Martin
  7 siblings, 1 reply; 44+ messages in thread
From: Sahil Siddiq @ 2025-03-24 13:59 UTC (permalink / raw)
  To: eperezma, sgarzare; +Cc: mst, qemu-devel, sahilcdq

This commit is the first in a series to add support for packed
virtqueues in vhost_shadow_virtqueue.

Linux commit 1225c216d954 ("vp_vdpa: allow set vq state to initial
state after reset") enabled the vp_vdpa driver to set the vq state to
the device's initial state. This works differently for split and packed
vqs.

With shadow virtqueues enabled, vhost-vdpa sets the vring base using
the VHOST_SET_VRING_BASE ioctl. The payload (vhost_vring_state)
differs for split and packed vqs. The implementation in QEMU currently
uses the payload required for split vqs (i.e., the num field of
vhost_vring_state is set to 0). The kernel throws EOPNOTSUPP when this
payload is used with packed vqs.

This patch sets the num field in the payload appropriately so vhost-vdpa
(with the vp_vdpa driver) can use packed SVQs.

Link: https://lists.nongnu.org/archive/html/qemu-devel/2024-10/msg05106.html
Link: https://lore.kernel.org/r/20210602021536.39525-4-jasowang@redhat.com
Link: 1225c216d954 ("vp_vdpa: allow set vq state to initial state after reset")
Signed-off-by: Sahil Siddiq <sahilcdq@proton.me>
Acked-by: Eugenio Pérez <eperezma@redhat.com>
---
Changes from v4 -> v5:
- Initially commit #5 in v4.
- Fix coding style of commit block as stated by checkpatch.pl.

 hw/virtio/vhost-vdpa.c | 15 +++++++++++++++
 1 file changed, 15 insertions(+)

diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
index 58c8931d89..0625e349b3 100644
--- a/hw/virtio/vhost-vdpa.c
+++ b/hw/virtio/vhost-vdpa.c
@@ -1265,6 +1265,21 @@ static bool vhost_vdpa_svq_setup(struct vhost_dev *dev,
     };
     int r;
 
+    /*
+     * In Linux, the upper 16 bits of s.num is encoded as
+     * the last used idx while the lower 16 bits is encoded
+     * as the last avail idx when using packed vqs. The most
+     * significant bit for each idx represents the counter
+     * and should be set in both cases while the remaining
+     * bits are cleared.
+     */
+    if (virtio_vdev_has_feature(dev->vdev, VIRTIO_F_RING_PACKED)) {
+        uint32_t last_avail_idx = 0 | (1 << VRING_PACKED_EVENT_F_WRAP_CTR);
+        uint32_t last_used_idx = 0 | (1 << VRING_PACKED_EVENT_F_WRAP_CTR);
+
+        s.num = (last_used_idx << 16) | last_avail_idx;
+    }
+
     r = vhost_vdpa_set_dev_vring_base(dev, &s);
     if (unlikely(r)) {
         error_setg_errno(errp, -r, "Cannot set vring base");
-- 
2.48.1



^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [RFC v5 3/7] vhost: Forward descriptors to device via packed SVQ
  2025-03-24 13:59 ` [RFC v5 3/7] vhost: Forward descriptors to device via packed SVQ Sahil Siddiq
@ 2025-03-24 14:14   ` Sahil Siddiq
  2025-03-26  8:03     ` Eugenio Perez Martin
  2025-03-26 12:02   ` Eugenio Perez Martin
  1 sibling, 1 reply; 44+ messages in thread
From: Sahil Siddiq @ 2025-03-24 14:14 UTC (permalink / raw)
  To: eperezma, sgarzare; +Cc: mst, qemu-devel, sahilcdq

Hi,

I had a few queries here.

On 3/24/25 7:29 PM, Sahil Siddiq wrote:
> Implement the insertion of available buffers in the descriptor area of
> packed shadow virtqueues. It takes into account descriptor chains, but
> does not consider indirect descriptors.
> 
> Enable the packed SVQ to forward the descriptors to the device.
> 
> Signed-off-by: Sahil Siddiq <sahilcdq@proton.me>
> ---
> Changes from v4 -> v5:
> - This was commit #2 in v4. This has been reordered to commit #3
>    based on review comments.
> - vhost-shadow-virtqueue.c:
>    (vhost_svq_valid_features): Move addition of enums to commit #6
>    based on review comments.
>    (vhost_svq_add_packed): Set head_idx to buffer id instead of vring's
>    index.
>    (vhost_svq_kick): Split into vhost_svq_kick_split and
>    vhost_svq_kick_packed.
>    (vhost_svq_add): Use new vhost_svq_kick_* functions.
> 
>   hw/virtio/vhost-shadow-virtqueue.c | 117 +++++++++++++++++++++++++++--
>   1 file changed, 112 insertions(+), 5 deletions(-)
> 
> diff --git a/hw/virtio/vhost-shadow-virtqueue.c b/hw/virtio/vhost-shadow-virtqueue.c
> index 4f74ad402a..6e16cd4bdf 100644
> --- a/hw/virtio/vhost-shadow-virtqueue.c
> +++ b/hw/virtio/vhost-shadow-virtqueue.c
> @@ -193,10 +193,83 @@ static void vhost_svq_add_split(VhostShadowVirtqueue *svq,
>       /* Update the avail index after write the descriptor */
>       smp_wmb();
>       avail->idx = cpu_to_le16(svq->shadow_avail_idx);
> +}
> +
> +/**
> + * Write descriptors to SVQ packed vring
> + *
> + * @svq: The shadow virtqueue
> + * @out_sg: The iovec to the guest
> + * @out_num: Outgoing iovec length
> + * @in_sg: The iovec from the guest
> + * @in_num: Incoming iovec length
> + * @sgs: Cache for hwaddr
> + * @head: Saves current free_head
> + */
> +static void vhost_svq_add_packed(VhostShadowVirtqueue *svq,
> +                                 const struct iovec *out_sg, size_t out_num,
> +                                 const struct iovec *in_sg, size_t in_num,
> +                                 hwaddr *sgs, unsigned *head)
> +{
> +    uint16_t id, curr, i, head_flags = 0, head_idx;
> +    size_t num = out_num + in_num;
> +    unsigned n;
> +
> +    struct vring_packed_desc *descs = svq->vring_packed.vring.desc;
> +
> +    head_idx = svq->vring_packed.next_avail_idx;

Since "svq->vring_packed.next_avail_idx" is part of QEMU internals and not
stored in guest memory, no endianness conversion is required here, right?

> +    i = head_idx;
> +    id = svq->free_head;
> +    curr = id;
> +    *head = id;

Should head be the buffer id or the idx of the descriptor ring where the
first descriptor of a descriptor chain is inserted?

> +    /* Write descriptors to SVQ packed vring */
> +    for (n = 0; n < num; n++) {
> +        uint16_t flags = cpu_to_le16(svq->vring_packed.avail_used_flags |
> +                                     (n < out_num ? 0 : VRING_DESC_F_WRITE) |
> +                                     (n + 1 == num ? 0 : VRING_DESC_F_NEXT));
> +        if (i == head_idx) {
> +            head_flags = flags;
> +        } else {
> +            descs[i].flags = flags;
> +        }
> +
> +        descs[i].addr = cpu_to_le64(sgs[n]);
> +        descs[i].id = id;
> +        if (n < out_num) {
> +            descs[i].len = cpu_to_le32(out_sg[n].iov_len);
> +        } else {
> +            descs[i].len = cpu_to_le32(in_sg[n - out_num].iov_len);
> +        }
> +
> +        curr = cpu_to_le16(svq->desc_next[curr]);
> +
> +        if (++i >= svq->vring_packed.vring.num) {
> +            i = 0;
> +            svq->vring_packed.avail_used_flags ^=
> +                1 << VRING_PACKED_DESC_F_AVAIL |
> +                1 << VRING_PACKED_DESC_F_USED;
> +        }
> +    }
>   
> +    if (i <= head_idx) {
> +        svq->vring_packed.avail_wrap_counter ^= 1;
> +    }
> +
> +    svq->vring_packed.next_avail_idx = i;
> +    svq->shadow_avail_idx = i;
> +    svq->free_head = curr;
> +
> +    /*
> +     * A driver MUST NOT make the first descriptor in the list
> +     * available before all subsequent descriptors comprising
> +     * the list are made available.
> +     */
> +    smp_wmb();
> +    svq->vring_packed.vring.desc[head_idx].flags = head_flags;
>   }
>   
> -static void vhost_svq_kick(VhostShadowVirtqueue *svq)
> +static void vhost_svq_kick_split(VhostShadowVirtqueue *svq)
>   {
>       bool needs_kick;
>   
> @@ -209,7 +282,8 @@ static void vhost_svq_kick(VhostShadowVirtqueue *svq)
>       if (virtio_vdev_has_feature(svq->vdev, VIRTIO_RING_F_EVENT_IDX)) {
>           uint16_t avail_event = le16_to_cpu(
>                   *(uint16_t *)(&svq->vring.used->ring[svq->vring.num]));
> -        needs_kick = vring_need_event(avail_event, svq->shadow_avail_idx, svq->shadow_avail_idx - 1);
> +        needs_kick = vring_need_event(avail_event, svq->shadow_avail_idx,
> +                     svq->shadow_avail_idx - 1);
>       } else {
>           needs_kick =
>                   !(svq->vring.used->flags & cpu_to_le16(VRING_USED_F_NO_NOTIFY));
> @@ -222,6 +296,30 @@ static void vhost_svq_kick(VhostShadowVirtqueue *svq)
>       event_notifier_set(&svq->hdev_kick);
>   }
>   
> +static void vhost_svq_kick_packed(VhostShadowVirtqueue *svq)
> +{
> +    bool needs_kick;
> +
> +    /*
> +     * We need to expose the available array entries before checking
> +     * notification suppressions.
> +     */
> +    smp_mb();
> +
> +    if (virtio_vdev_has_feature(svq->vdev, VIRTIO_RING_F_EVENT_IDX)) {
> +        return;
> +    } else {
> +        needs_kick = (svq->vring_packed.vring.device->flags !=
> +                      cpu_to_le16(VRING_PACKED_EVENT_FLAG_DISABLE));
> +    }
> +
> +    if (!needs_kick) {
> +        return;
> +    }
> +
> +    event_notifier_set(&svq->hdev_kick);
> +}
> +
>   /**
>    * Add an element to a SVQ.
>    *
> @@ -258,13 +356,22 @@ int vhost_svq_add(VhostShadowVirtqueue *svq, const struct iovec *out_sg,
>           return -EINVAL;
>       }
>   
> -    vhost_svq_add_split(svq, out_sg, out_num, in_sg,
> -                        in_num, sgs, &qemu_head);
> +    if (svq->is_packed) {
> +        vhost_svq_add_packed(svq, out_sg, out_num, in_sg,
> +                             in_num, sgs, &qemu_head);
> +    } else {
> +        vhost_svq_add_split(svq, out_sg, out_num, in_sg,
> +                            in_num, sgs, &qemu_head);
> +    }
>   
>       svq->num_free -= ndescs;
>       svq->desc_state[qemu_head].elem = elem;
>       svq->desc_state[qemu_head].ndescs = ndescs;

*head in vhost_svq_add_packed() is stored in "qemu_head" here.

> -    vhost_svq_kick(svq);
> +    if (svq->is_packed) {
> +        vhost_svq_kick_packed(svq);
> +    } else {
> +        vhost_svq_kick_split(svq);
> +    }
>       return 0;
>   }
>   

Thanks,
Sahil


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC v5 5/7] vhost: Forward descriptors to guest via packed vqs
  2025-03-24 13:59 ` [RFC v5 5/7] vhost: Forward descriptors to guest via packed vqs Sahil Siddiq
@ 2025-03-24 14:34   ` Sahil Siddiq
  2025-03-26  8:34     ` Eugenio Perez Martin
  0 siblings, 1 reply; 44+ messages in thread
From: Sahil Siddiq @ 2025-03-24 14:34 UTC (permalink / raw)
  To: eperezma, sgarzare; +Cc: mst, qemu-devel, sahilcdq

Hi,

I had a few more queries here as well.

On 3/24/25 7:29 PM, Sahil Siddiq wrote:
> Detect when used descriptors are ready for consumption by the guest via
> packed virtqueues and forward them from the device to the guest.
> 
> Signed-off-by: Sahil Siddiq <sahilcdq@proton.me>
> ---
> Changes from v4 -> v5:
> - New commit.
> - vhost-shadow-virtqueue.c:
>    (vhost_svq_more_used): Split into vhost_svq_more_used_split and
>    vhost_svq_more_used_packed.
>    (vhost_svq_enable_notification): Handle split and packed vqs.
>    (vhost_svq_disable_notification): Likewise.
>    (vhost_svq_get_buf): Split into vhost_svq_get_buf_split and
>    vhost_svq_get_buf_packed.
>    (vhost_svq_poll): Use new functions.
> 
>   hw/virtio/vhost-shadow-virtqueue.c | 121 ++++++++++++++++++++++++++---
>   1 file changed, 110 insertions(+), 11 deletions(-)
> 
> diff --git a/hw/virtio/vhost-shadow-virtqueue.c b/hw/virtio/vhost-shadow-virtqueue.c
> index 126957231d..8430b3c94a 100644
> --- a/hw/virtio/vhost-shadow-virtqueue.c
> +++ b/hw/virtio/vhost-shadow-virtqueue.c
> @@ -463,7 +463,7 @@ static void vhost_handle_guest_kick_notifier(EventNotifier *n)
>       vhost_handle_guest_kick(svq);
>   }
>   
> -static bool vhost_svq_more_used(VhostShadowVirtqueue *svq)
> +static bool vhost_svq_more_used_split(VhostShadowVirtqueue *svq)
>   {
>       uint16_t *used_idx = &svq->vring.used->idx;
>       if (svq->last_used_idx != svq->shadow_used_idx) {
> @@ -475,6 +475,22 @@ static bool vhost_svq_more_used(VhostShadowVirtqueue *svq)
>       return svq->last_used_idx != svq->shadow_used_idx;
>   }
>   
> +static bool vhost_svq_more_used_packed(VhostShadowVirtqueue *svq)
> +{
> +    bool avail_flag, used_flag, used_wrap_counter;
> +    uint16_t last_used_idx, last_used, flags;
> +
> +    last_used_idx = svq->last_used_idx;
> +    last_used = last_used_idx & ~(1 << VRING_PACKED_EVENT_F_WRAP_CTR);

In the linux kernel, last_used is calculated as:

last_used_idx & ~(-(1 << VRING_PACKED_EVENT_F_WRAP_CTR))

...instead of...

last_used_idx & ~(1 << VRING_PACKED_EVENT_F_WRAP_CTR)

Isn't the second option good enough if last_used_idx is uint16_t
and VRING_PACKED_EVENT_F_WRAP_CTR is defined as 15.

> +    used_wrap_counter = !!(last_used_idx & (1 << VRING_PACKED_EVENT_F_WRAP_CTR));
> +
> +    flags = le16_to_cpu(svq->vring_packed.vring.desc[last_used].flags);
> +    avail_flag = !!(flags & (1 << VRING_PACKED_DESC_F_AVAIL));
> +    used_flag = !!(flags & (1 << VRING_PACKED_DESC_F_USED));
> +
> +    return avail_flag == used_flag && used_flag == used_wrap_counter;
> +}
> +

Also in the implementation of vhost_svq_more_used_split() [1], I haven't
understood why the following condition:

svq->last_used_idx != svq->shadow_used_idx

is checked before updating the value of "svq->shadow_used_idx":

svq->shadow_used_idx = le16_to_cpu(*(volatile uint16_t *)used_idx)

Thanks,
Sahil

[1] https://gitlab.com/qemu-project/qemu/-/blob/master/hw/virtio/vhost-shadow-virtqueue.c#L387



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC v5 0/7] Add packed format to shadow virtqueue
  2025-03-24 13:59 [RFC v5 0/7] Add packed format to shadow virtqueue Sahil Siddiq
                   ` (6 preceding siblings ...)
  2025-03-24 13:59 ` [RFC v5 7/7] vdpa: Support setting vring_base for packed SVQ Sahil Siddiq
@ 2025-03-26  7:35 ` Eugenio Perez Martin
  2025-04-14  9:20   ` Sahil Siddiq
  7 siblings, 1 reply; 44+ messages in thread
From: Eugenio Perez Martin @ 2025-03-26  7:35 UTC (permalink / raw)
  To: Sahil Siddiq; +Cc: sgarzare, mst, qemu-devel, sahilcdq, Jason Wang

On Mon, Mar 24, 2025 at 2:59 PM Sahil Siddiq <icegambit91@gmail.com> wrote:
>
> Hi,
>
> I managed to fix a few issues while testing this patch series.
> There is still one issue that I am unable to resolve. I thought
> I would send this patch series for review in case I have missed
> something.
>
> The issue is that this patch series does not work every time. I
> am able to ping L0 from L2 and vice versa via packed SVQ when it
> works.
>

So we're on a very good track then!

> When this doesn't work, both VMs throw a "Destination Host
> Unreachable" error. This is sometimes (not always) accompanied
> by the following kernel error (thrown by L2-kernel):
>
> virtio_net virtio1: output.0:id 1 is not a head!
>

How many packets have been sent or received before hitting this? If
the answer to that is "the vq size", maybe there is a bug in the code
that handles the wraparound of the packed vq, as the used and avail
flags need to be twisted. You can count them in the SVQ code.

> This error is not thrown always, but when it is thrown, the id
> varies. This is invariably followed by a soft lockup:
>
> [  284.662292] watchdog: BUG: soft lockup - CPU#1 stuck for 26s! [swapper/1:0]
> [  284.662292] Modules linked in: rfkill intel_rapl_msr intel_rapl_common intel_uncore_frequency_common intel_pmc_core intel_vsec pmt_telemetry pmt_class vfg
> [  284.662292] CPU: 1 PID: 0 Comm: swapper/1 Not tainted 6.8.7-200.fc39.x86_64 #1
> [  284.662292] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebuilt.qemu.org 04/01/2014
> [  284.662292] RIP: 0010:virtqueue_enable_cb_delayed+0x115/0x150
> [  284.662292] Code: 44 77 04 0f ae f0 48 8b 42 70 0f b7 40 02 66 2b 42 50 66 39 c1 0f 93 c0 c3 cc cc cc cc 66 87 44 77 04 eb e2 f0 83 44 24 fc 00 <e9> 5a f1
> [  284.662292] RSP: 0018:ffffb8f000100cb0 EFLAGS: 00000246
> [  284.662292] RAX: 0000000000000000 RBX: ffff96f20204d800 RCX: ffff96f206f5e000
> [  284.662292] RDX: ffff96f2054fd900 RSI: ffffb8f000100c7c RDI: ffff96f2054fd900
> [  284.662292] RBP: ffff96f2078bb000 R08: 0000000000000001 R09: 0000000000000001
> [  284.662292] R10: ffff96f2078bb000 R11: 0000000000000005 R12: ffff96f207bb4a00
> [  284.662292] R13: 0000000000000000 R14: 0000000000000000 R15: ffff96f20452fd00
> [  284.662292] FS:  0000000000000000(0000) GS:ffff96f27bc80000(0000) knlGS:0000000000000000
> [  284.662292] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  284.662292] CR2: 00007f2a9ca191e8 CR3: 0000000136422003 CR4: 0000000000770ef0
> [  284.662292] PKRU: 55555554
> [  284.662292] Call Trace:
> [  284.662292]  <IRQ>
> [  284.662292]  ? watchdog_timer_fn+0x1e6/0x270
> [  284.662292]  ? __pfx_watchdog_timer_fn+0x10/0x10
> [  284.662292]  ? __hrtimer_run_queues+0x10f/0x2b0
> [  284.662292]  ? hrtimer_interrupt+0xf8/0x230
> [  284.662292]  ? __sysvec_apic_timer_interrupt+0x4d/0x140
> [  284.662292]  ? sysvec_apic_timer_interrupt+0x39/0x90
> [  284.662292]  ? asm_sysvec_apic_timer_interrupt+0x1a/0x20
> [  284.662292]  ? virtqueue_enable_cb_delayed+0x115/0x150
> [  284.662292]  start_xmit+0x2a6/0x4f0 [virtio_net]
> [  284.662292]  ? netif_skb_features+0x98/0x300
> [  284.662292]  dev_hard_start_xmit+0x61/0x1d0
> [  284.662292]  sch_direct_xmit+0xa4/0x390
> [  284.662292]  __dev_queue_xmit+0x84f/0xdc0
> [  284.662292]  ? nf_hook_slow+0x42/0xf0
> [  284.662292]  ip_finish_output2+0x2b8/0x580
> [  284.662292]  igmp_ifc_timer_expire+0x1d5/0x430
> [  284.662292]  ? __pfx_igmp_ifc_timer_expire+0x10/0x10
> [  284.662292]  call_timer_fn+0x21/0x130
> [  284.662292]  ? __pfx_igmp_ifc_timer_expire+0x10/0x10
> [  284.662292]  __run_timers+0x21f/0x2b0
> [  284.662292]  run_timer_softirq+0x1d/0x40
> [  284.662292]  __do_softirq+0xc9/0x2c8
> [  284.662292]  __irq_exit_rcu+0xa6/0xc0
> [  284.662292]  sysvec_apic_timer_interrupt+0x72/0x90
> [  284.662292]  </IRQ>
> [  284.662292]  <TASK>
> [  284.662292]  asm_sysvec_apic_timer_interrupt+0x1a/0x20
> [  284.662292] RIP: 0010:pv_native_safe_halt+0xf/0x20
> [  284.662292] Code: 22 d7 c3 cc cc cc cc 0f 1f 40 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa eb 07 0f 00 2d 53 75 3f 00 fb f4 <c3> cc c0
> [  284.662292] RSP: 0018:ffffb8f0000b3ed8 EFLAGS: 00000212
> [  284.662292] RAX: 0000000000000001 RBX: 0000000000000001 RCX: 0000000000000000
> [  284.662292] RDX: 4000000000000000 RSI: 0000000000000083 RDI: 00000000000289ec
> [  284.662292] RBP: ffff96f200810000 R08: 0000000000000000 R09: 0000000000000001
> [  284.662292] R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000000
> [  284.662292] R13: 0000000000000000 R14: ffff96f200810000 R15: 0000000000000000
> [  284.662292]  default_idle+0x9/0x20
> [  284.662292]  default_idle_call+0x2c/0xe0
> [  284.662292]  do_idle+0x226/0x270
> [  284.662292]  cpu_startup_entry+0x2a/0x30
> [  284.662292]  start_secondary+0x11e/0x140
> [  284.662292]  secondary_startup_64_no_verify+0x184/0x18b
> [  284.662292]  </TASK>
>
> The soft lockup seems to happen in
> drivers/net/virtio_net.c:start_xmit() [1].
>

Maybe it gets stuck in the do {} while(...
!virtqueue_enable_cb_delayed()) ? you can add a printk in
virtqueue_enable_cb_delayed return and check if it matches with the
speed you're sending or receiving ping. For example, if ping is each
second, you should not see a lot of traces.

If this does not work I'd try never disabling notifications, both in
the kernel and SVQ, and check if that works.

> I don't think the issue is in the kernel because I haven't seen
> any issue when testing my changes with split vqs. Only packed vqs
> give an issue.
>
> L0 kernel version: 6.12.13-1-lts
>
> QEMU command to boot L1:
>
> $ sudo ./qemu/build/qemu-system-x86_64 \
> -enable-kvm \
> -drive file=//home/valdaarhun/valdaarhun/qcow2_img/L1.qcow2,media=disk,if=virtio \
> -net nic,model=virtio \
> -net user,hostfwd=tcp::2222-:22 \
> -device intel-iommu,snoop-control=on \
> -device virtio-net-pci,netdev=net0,disable-legacy=on,disable-modern=off,iommu_platform=on,guest_uso4=off,guest_uso6=off,host_uso=off,guest_announce=off,mq=off,ctrl_vq=off,ctrl_rx=off,ctrl_vlan=off,ctrl_mac_addr=off,packed=on,event_idx=off,bus=pcie.0,addr=0x4 \
> -netdev tap,id=net0,script=no,downscript=no,vhost=off \
> -nographic \
> -m 8G \
> -smp 4 \
> -M q35 \
> -cpu host 2>&1 | tee vm.log
>
> L1 kernel version: 6.8.5-201.fc39.x86_64
>
> I have been following the "Hands on vDPA - Part 2" blog
> to set up the environment in L1 [2].
>
> QEMU command to boot L2:
>
> # ./qemu/build/qemu-system-x86_64 \
> -nographic \
> -m 4G \
> -enable-kvm \
> -M q35 \
> -drive file=//root/L2.qcow2,media=disk,if=virtio \
> -netdev type=vhost-vdpa,vhostdev=/dev/vhost-vdpa-0,x-svq=true,id=vhost-vdpa0 \
> -device virtio-net-pci,netdev=vhost-vdpa0,disable-legacy=on,disable-modern=off,ctrl_vq=off,ctrl_rx=off,ctrl_vlan=off,ctrl_mac_addr=off,event_idx=off,packed=on,bus=pcie.0,addr=0x7 \
> -smp 4 \
> -cpu host \
> 2>&1 | tee vm.log
>
> L2 kernel version: 6.8.7-200.fc39.x86_64
>
> I confirmed that packed vqs are enabled in L2 by running the
> following:
>
> # cut -c35 /sys/devices/pci0000\:00/0000\:00\:07.0/virtio1/features
> 1
>
> I may be wrong, but I think the issue in my implementation might be
> related to:
>
> 1. incorrect endianness coversions.
> 2. implementation of "vhost_svq_more_used_packed" in commit #5.
> 3. implementation of "vhost_svq_(en|dis)able_notification" in commit #5.
> 4. something else?
>

I think 1 is unlikely. I'd go with 2 and 3.

Let me know if the proposed changes work!

> Thanks,
> Sahil
>
> [1] https://github.com/torvalds/linux/blob/master/drivers/net/virtio_net.c#L3245
> [2] https://www.redhat.com/en/blog/hands-vdpa-what-do-you-do-when-you-aint-got-hardware-part-2
>
> Sahil Siddiq (7):
>   vhost: Refactor vhost_svq_add_split
>   vhost: Data structure changes to support packed vqs
>   vhost: Forward descriptors to device via packed SVQ
>   vdpa: Allocate memory for SVQ and map them to vdpa
>   vhost: Forward descriptors to guest via packed vqs
>   vhost: Validate transport device features for packed vqs
>   vdpa: Support setting vring_base for packed SVQ
>
>  hw/virtio/vhost-shadow-virtqueue.c | 396 ++++++++++++++++++++++-------
>  hw/virtio/vhost-shadow-virtqueue.h |  88 ++++---
>  hw/virtio/vhost-vdpa.c             |  52 +++-
>  3 files changed, 404 insertions(+), 132 deletions(-)
>
> --
> 2.48.1
>



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC v5 3/7] vhost: Forward descriptors to device via packed SVQ
  2025-03-24 14:14   ` Sahil Siddiq
@ 2025-03-26  8:03     ` Eugenio Perez Martin
  2025-03-27 18:42       ` Sahil Siddiq
  0 siblings, 1 reply; 44+ messages in thread
From: Eugenio Perez Martin @ 2025-03-26  8:03 UTC (permalink / raw)
  To: Sahil Siddiq; +Cc: sgarzare, mst, qemu-devel, sahilcdq

On Mon, Mar 24, 2025 at 3:14 PM Sahil Siddiq <icegambit91@gmail.com> wrote:
>
> Hi,
>
> I had a few queries here.
>
> On 3/24/25 7:29 PM, Sahil Siddiq wrote:
> > Implement the insertion of available buffers in the descriptor area of
> > packed shadow virtqueues. It takes into account descriptor chains, but
> > does not consider indirect descriptors.
> >
> > Enable the packed SVQ to forward the descriptors to the device.
> >
> > Signed-off-by: Sahil Siddiq <sahilcdq@proton.me>
> > ---
> > Changes from v4 -> v5:
> > - This was commit #2 in v4. This has been reordered to commit #3
> >    based on review comments.
> > - vhost-shadow-virtqueue.c:
> >    (vhost_svq_valid_features): Move addition of enums to commit #6
> >    based on review comments.
> >    (vhost_svq_add_packed): Set head_idx to buffer id instead of vring's
> >    index.
> >    (vhost_svq_kick): Split into vhost_svq_kick_split and
> >    vhost_svq_kick_packed.
> >    (vhost_svq_add): Use new vhost_svq_kick_* functions.
> >
> >   hw/virtio/vhost-shadow-virtqueue.c | 117 +++++++++++++++++++++++++++--
> >   1 file changed, 112 insertions(+), 5 deletions(-)
> >
> > diff --git a/hw/virtio/vhost-shadow-virtqueue.c b/hw/virtio/vhost-shadow-virtqueue.c
> > index 4f74ad402a..6e16cd4bdf 100644
> > --- a/hw/virtio/vhost-shadow-virtqueue.c
> > +++ b/hw/virtio/vhost-shadow-virtqueue.c
> > @@ -193,10 +193,83 @@ static void vhost_svq_add_split(VhostShadowVirtqueue *svq,
> >       /* Update the avail index after write the descriptor */
> >       smp_wmb();
> >       avail->idx = cpu_to_le16(svq->shadow_avail_idx);
> > +}
> > +
> > +/**
> > + * Write descriptors to SVQ packed vring
> > + *
> > + * @svq: The shadow virtqueue
> > + * @out_sg: The iovec to the guest
> > + * @out_num: Outgoing iovec length
> > + * @in_sg: The iovec from the guest
> > + * @in_num: Incoming iovec length
> > + * @sgs: Cache for hwaddr
> > + * @head: Saves current free_head
> > + */
> > +static void vhost_svq_add_packed(VhostShadowVirtqueue *svq,
> > +                                 const struct iovec *out_sg, size_t out_num,
> > +                                 const struct iovec *in_sg, size_t in_num,
> > +                                 hwaddr *sgs, unsigned *head)
> > +{
> > +    uint16_t id, curr, i, head_flags = 0, head_idx;
> > +    size_t num = out_num + in_num;
> > +    unsigned n;
> > +
> > +    struct vring_packed_desc *descs = svq->vring_packed.vring.desc;
> > +
> > +    head_idx = svq->vring_packed.next_avail_idx;
>
> Since "svq->vring_packed.next_avail_idx" is part of QEMU internals and not
> stored in guest memory, no endianness conversion is required here, right?
>

Right!

> > +    i = head_idx;
> > +    id = svq->free_head;
> > +    curr = id;
> > +    *head = id;
>
> Should head be the buffer id or the idx of the descriptor ring where the
> first descriptor of a descriptor chain is inserted?
>

The buffer id of the *last* descriptor of a chain. See "2.8.6 Next
Flag: Descriptor Chaining" at [1].

> > +    /* Write descriptors to SVQ packed vring */
> > +    for (n = 0; n < num; n++) {
> > +        uint16_t flags = cpu_to_le16(svq->vring_packed.avail_used_flags |
> > +                                     (n < out_num ? 0 : VRING_DESC_F_WRITE) |
> > +                                     (n + 1 == num ? 0 : VRING_DESC_F_NEXT));
> > +        if (i == head_idx) {
> > +            head_flags = flags;
> > +        } else {
> > +            descs[i].flags = flags;
> > +        }
> > +
> > +        descs[i].addr = cpu_to_le64(sgs[n]);
> > +        descs[i].id = id;
> > +        if (n < out_num) {
> > +            descs[i].len = cpu_to_le32(out_sg[n].iov_len);
> > +        } else {
> > +            descs[i].len = cpu_to_le32(in_sg[n - out_num].iov_len);
> > +        }
> > +
> > +        curr = cpu_to_le16(svq->desc_next[curr]);
> > +
> > +        if (++i >= svq->vring_packed.vring.num) {
> > +            i = 0;
> > +            svq->vring_packed.avail_used_flags ^=
> > +                1 << VRING_PACKED_DESC_F_AVAIL |
> > +                1 << VRING_PACKED_DESC_F_USED;
> > +        }
> > +    }
> >
> > +    if (i <= head_idx) {
> > +        svq->vring_packed.avail_wrap_counter ^= 1;
> > +    }
> > +
> > +    svq->vring_packed.next_avail_idx = i;
> > +    svq->shadow_avail_idx = i;
> > +    svq->free_head = curr;
> > +
> > +    /*
> > +     * A driver MUST NOT make the first descriptor in the list
> > +     * available before all subsequent descriptors comprising
> > +     * the list are made available.
> > +     */
> > +    smp_wmb();
> > +    svq->vring_packed.vring.desc[head_idx].flags = head_flags;
> >   }
> >
> > -static void vhost_svq_kick(VhostShadowVirtqueue *svq)
> > +static void vhost_svq_kick_split(VhostShadowVirtqueue *svq)
> >   {
> >       bool needs_kick;
> >
> > @@ -209,7 +282,8 @@ static void vhost_svq_kick(VhostShadowVirtqueue *svq)
> >       if (virtio_vdev_has_feature(svq->vdev, VIRTIO_RING_F_EVENT_IDX)) {
> >           uint16_t avail_event = le16_to_cpu(
> >                   *(uint16_t *)(&svq->vring.used->ring[svq->vring.num]));
> > -        needs_kick = vring_need_event(avail_event, svq->shadow_avail_idx, svq->shadow_avail_idx - 1);
> > +        needs_kick = vring_need_event(avail_event, svq->shadow_avail_idx,
> > +                     svq->shadow_avail_idx - 1);
> >       } else {
> >           needs_kick =
> >                   !(svq->vring.used->flags & cpu_to_le16(VRING_USED_F_NO_NOTIFY));
> > @@ -222,6 +296,30 @@ static void vhost_svq_kick(VhostShadowVirtqueue *svq)
> >       event_notifier_set(&svq->hdev_kick);
> >   }
> >
> > +static void vhost_svq_kick_packed(VhostShadowVirtqueue *svq)
> > +{
> > +    bool needs_kick;
> > +
> > +    /*
> > +     * We need to expose the available array entries before checking
> > +     * notification suppressions.
> > +     */
> > +    smp_mb();
> > +
> > +    if (virtio_vdev_has_feature(svq->vdev, VIRTIO_RING_F_EVENT_IDX)) {
> > +        return;
> > +    } else {
> > +        needs_kick = (svq->vring_packed.vring.device->flags !=
> > +                      cpu_to_le16(VRING_PACKED_EVENT_FLAG_DISABLE));
> > +    }
> > +
> > +    if (!needs_kick) {
> > +        return;
> > +    }
> > +
> > +    event_notifier_set(&svq->hdev_kick);
> > +}
> > +
> >   /**
> >    * Add an element to a SVQ.
> >    *
> > @@ -258,13 +356,22 @@ int vhost_svq_add(VhostShadowVirtqueue *svq, const struct iovec *out_sg,
> >           return -EINVAL;
> >       }
> >
> > -    vhost_svq_add_split(svq, out_sg, out_num, in_sg,
> > -                        in_num, sgs, &qemu_head);
> > +    if (svq->is_packed) {
> > +        vhost_svq_add_packed(svq, out_sg, out_num, in_sg,
> > +                             in_num, sgs, &qemu_head);
> > +    } else {
> > +        vhost_svq_add_split(svq, out_sg, out_num, in_sg,
> > +                            in_num, sgs, &qemu_head);
> > +    }
> >
> >       svq->num_free -= ndescs;
> >       svq->desc_state[qemu_head].elem = elem;
> >       svq->desc_state[qemu_head].ndescs = ndescs;
>
> *head in vhost_svq_add_packed() is stored in "qemu_head" here.
>

Sorry I don't get this, can you expand?

[1] https://docs.oasis-open.org/virtio/virtio/v1.3/virtio-v1.3.html

> > -    vhost_svq_kick(svq);
> > +    if (svq->is_packed) {
> > +        vhost_svq_kick_packed(svq);
> > +    } else {
> > +        vhost_svq_kick_split(svq);
> > +    }
> >       return 0;
> >   }
> >
>
> Thanks,
> Sahil
>



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC v5 5/7] vhost: Forward descriptors to guest via packed vqs
  2025-03-24 14:34   ` Sahil Siddiq
@ 2025-03-26  8:34     ` Eugenio Perez Martin
  2025-03-28  5:22       ` Sahil Siddiq
  0 siblings, 1 reply; 44+ messages in thread
From: Eugenio Perez Martin @ 2025-03-26  8:34 UTC (permalink / raw)
  To: Sahil Siddiq; +Cc: sgarzare, mst, qemu-devel, sahilcdq

On Mon, Mar 24, 2025 at 3:34 PM Sahil Siddiq <icegambit91@gmail.com> wrote:
>
> Hi,
>
> I had a few more queries here as well.
>
> On 3/24/25 7:29 PM, Sahil Siddiq wrote:
> > Detect when used descriptors are ready for consumption by the guest via
> > packed virtqueues and forward them from the device to the guest.
> >
> > Signed-off-by: Sahil Siddiq <sahilcdq@proton.me>
> > ---
> > Changes from v4 -> v5:
> > - New commit.
> > - vhost-shadow-virtqueue.c:
> >    (vhost_svq_more_used): Split into vhost_svq_more_used_split and
> >    vhost_svq_more_used_packed.
> >    (vhost_svq_enable_notification): Handle split and packed vqs.
> >    (vhost_svq_disable_notification): Likewise.
> >    (vhost_svq_get_buf): Split into vhost_svq_get_buf_split and
> >    vhost_svq_get_buf_packed.
> >    (vhost_svq_poll): Use new functions.
> >
> >   hw/virtio/vhost-shadow-virtqueue.c | 121 ++++++++++++++++++++++++++---
> >   1 file changed, 110 insertions(+), 11 deletions(-)
> >
> > diff --git a/hw/virtio/vhost-shadow-virtqueue.c b/hw/virtio/vhost-shadow-virtqueue.c
> > index 126957231d..8430b3c94a 100644
> > --- a/hw/virtio/vhost-shadow-virtqueue.c
> > +++ b/hw/virtio/vhost-shadow-virtqueue.c
> > @@ -463,7 +463,7 @@ static void vhost_handle_guest_kick_notifier(EventNotifier *n)
> >       vhost_handle_guest_kick(svq);
> >   }
> >
> > -static bool vhost_svq_more_used(VhostShadowVirtqueue *svq)
> > +static bool vhost_svq_more_used_split(VhostShadowVirtqueue *svq)
> >   {
> >       uint16_t *used_idx = &svq->vring.used->idx;
> >       if (svq->last_used_idx != svq->shadow_used_idx) {
> > @@ -475,6 +475,22 @@ static bool vhost_svq_more_used(VhostShadowVirtqueue *svq)
> >       return svq->last_used_idx != svq->shadow_used_idx;
> >   }
> >
> > +static bool vhost_svq_more_used_packed(VhostShadowVirtqueue *svq)
> > +{
> > +    bool avail_flag, used_flag, used_wrap_counter;
> > +    uint16_t last_used_idx, last_used, flags;
> > +
> > +    last_used_idx = svq->last_used_idx;
> > +    last_used = last_used_idx & ~(1 << VRING_PACKED_EVENT_F_WRAP_CTR);
>
> In the linux kernel, last_used is calculated as:
>
> last_used_idx & ~(-(1 << VRING_PACKED_EVENT_F_WRAP_CTR))
>
> ...instead of...
>
> last_used_idx & ~(1 << VRING_PACKED_EVENT_F_WRAP_CTR)
>
> Isn't the second option good enough if last_used_idx is uint16_t
> and VRING_PACKED_EVENT_F_WRAP_CTR is defined as 15.
>

I think it is good enough with the u16 restrictions but it's just
defensive code.

> > +    used_wrap_counter = !!(last_used_idx & (1 << VRING_PACKED_EVENT_F_WRAP_CTR));
> > +
> > +    flags = le16_to_cpu(svq->vring_packed.vring.desc[last_used].flags);
> > +    avail_flag = !!(flags & (1 << VRING_PACKED_DESC_F_AVAIL));
> > +    used_flag = !!(flags & (1 << VRING_PACKED_DESC_F_USED));
> > +
> > +    return avail_flag == used_flag && used_flag == used_wrap_counter;
> > +}
> > +
>
> Also in the implementation of vhost_svq_more_used_split() [1], I haven't
> understood why the following condition:
>
> svq->last_used_idx != svq->shadow_used_idx
>
> is checked before updating the value of "svq->shadow_used_idx":
>
> svq->shadow_used_idx = le16_to_cpu(*(volatile uint16_t *)used_idx)
>

As far as I know this is used to avoid concurrent access to guest's
used_idx, avoiding cache sharing, the memory barrier, and the
potentially costly volatile access.



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC v5 1/7] vhost: Refactor vhost_svq_add_split
  2025-03-24 13:59 ` [RFC v5 1/7] vhost: Refactor vhost_svq_add_split Sahil Siddiq
@ 2025-03-26 11:25   ` Eugenio Perez Martin
  2025-03-28  5:18     ` Sahil Siddiq
  0 siblings, 1 reply; 44+ messages in thread
From: Eugenio Perez Martin @ 2025-03-26 11:25 UTC (permalink / raw)
  To: Sahil Siddiq; +Cc: sgarzare, mst, qemu-devel, sahilcdq

On Mon, Mar 24, 2025 at 2:59 PM Sahil Siddiq <icegambit91@gmail.com> wrote:
>
> This commit refactors vhost_svq_add_split and vhost_svq_add to simplify
> their implementation and prepare for the addition of packed vqs in the
> following commits.
>
> Signed-off-by: Sahil Siddiq <sahilcdq@proton.me>
> ---
> No changes from v4 -> v5.
>

You can carry the Acked-by from previous series if you make no changes
(or even small changes).

Acked-by: Eugenio Pérez <eperezma@redhat.com>

>  hw/virtio/vhost-shadow-virtqueue.c | 107 +++++++++++------------------
>  1 file changed, 41 insertions(+), 66 deletions(-)
>
> diff --git a/hw/virtio/vhost-shadow-virtqueue.c b/hw/virtio/vhost-shadow-virtqueue.c
> index 2481d49345..4f74ad402a 100644
> --- a/hw/virtio/vhost-shadow-virtqueue.c
> +++ b/hw/virtio/vhost-shadow-virtqueue.c
> @@ -139,87 +139,48 @@ static bool vhost_svq_translate_addr(const VhostShadowVirtqueue *svq,
>  }
>
>  /**
> - * Write descriptors to SVQ vring
> + * Write descriptors to SVQ split vring
>   *
>   * @svq: The shadow virtqueue
> - * @sg: Cache for hwaddr
> - * @iovec: The iovec from the guest
> - * @num: iovec length
> - * @addr: Descriptors' GPAs, if backed by guest memory
> - * @more_descs: True if more descriptors come in the chain
> - * @write: True if they are writeable descriptors
> - *
> - * Return true if success, false otherwise and print error.
> + * @out_sg: The iovec to the guest
> + * @out_num: Outgoing iovec length
> + * @in_sg: The iovec from the guest
> + * @in_num: Incoming iovec length
> + * @sgs: Cache for hwaddr
> + * @head: Saves current free_head
>   */
> -static bool vhost_svq_vring_write_descs(VhostShadowVirtqueue *svq, hwaddr *sg,
> -                                        const struct iovec *iovec, size_t num,
> -                                        const hwaddr *addr, bool more_descs,
> -                                        bool write)
> +static void vhost_svq_add_split(VhostShadowVirtqueue *svq,
> +                                const struct iovec *out_sg, size_t out_num,
> +                                const struct iovec *in_sg, size_t in_num,
> +                                hwaddr *sgs, unsigned *head)
>  {
> +    unsigned avail_idx, n;
>      uint16_t i = svq->free_head, last = svq->free_head;
> -    unsigned n;
> -    uint16_t flags = write ? cpu_to_le16(VRING_DESC_F_WRITE) : 0;
> +    vring_avail_t *avail = svq->vring.avail;
>      vring_desc_t *descs = svq->vring.desc;
> -    bool ok;
> -
> -    if (num == 0) {
> -        return true;
> -    }
> +    size_t num = in_num + out_num;
>
> -    ok = vhost_svq_translate_addr(svq, sg, iovec, num, addr);
> -    if (unlikely(!ok)) {
> -        return false;
> -    }
> +    *head = svq->free_head;
>
>      for (n = 0; n < num; n++) {
> -        if (more_descs || (n + 1 < num)) {
> -            descs[i].flags = flags | cpu_to_le16(VRING_DESC_F_NEXT);
> +        descs[i].flags = cpu_to_le16(n < out_num ? 0 : VRING_DESC_F_WRITE);
> +        if (n + 1 < num) {
> +            descs[i].flags |= cpu_to_le16(VRING_DESC_F_NEXT);
>              descs[i].next = cpu_to_le16(svq->desc_next[i]);
> +        }
> +
> +        descs[i].addr = cpu_to_le64(sgs[n]);
> +        if (n < out_num) {
> +            descs[i].len = cpu_to_le32(out_sg[n].iov_len);
>          } else {
> -            descs[i].flags = flags;
> +            descs[i].len = cpu_to_le32(in_sg[n - out_num].iov_len);
>          }
> -        descs[i].addr = cpu_to_le64(sg[n]);
> -        descs[i].len = cpu_to_le32(iovec[n].iov_len);
>
>          last = i;
>          i = svq->desc_next[i];
>      }
>
>      svq->free_head = svq->desc_next[last];
> -    return true;
> -}
> -
> -static bool vhost_svq_add_split(VhostShadowVirtqueue *svq,
> -                                const struct iovec *out_sg, size_t out_num,
> -                                const hwaddr *out_addr,
> -                                const struct iovec *in_sg, size_t in_num,
> -                                const hwaddr *in_addr, unsigned *head)
> -{
> -    unsigned avail_idx;
> -    vring_avail_t *avail = svq->vring.avail;
> -    bool ok;
> -    g_autofree hwaddr *sgs = g_new(hwaddr, MAX(out_num, in_num));
> -
> -    *head = svq->free_head;
> -
> -    /* We need some descriptors here */
> -    if (unlikely(!out_num && !in_num)) {
> -        qemu_log_mask(LOG_GUEST_ERROR,
> -                      "Guest provided element with no descriptors");
> -        return false;
> -    }
> -
> -    ok = vhost_svq_vring_write_descs(svq, sgs, out_sg, out_num, out_addr,
> -                                     in_num > 0, false);
> -    if (unlikely(!ok)) {
> -        return false;
> -    }
> -
> -    ok = vhost_svq_vring_write_descs(svq, sgs, in_sg, in_num, in_addr, false,
> -                                     true);
> -    if (unlikely(!ok)) {
> -        return false;
> -    }
>
>      /*
>       * Put the entry in the available array (but don't update avail->idx until
> @@ -233,7 +194,6 @@ static bool vhost_svq_add_split(VhostShadowVirtqueue *svq,
>      smp_wmb();
>      avail->idx = cpu_to_le16(svq->shadow_avail_idx);
>
> -    return true;
>  }
>
>  static void vhost_svq_kick(VhostShadowVirtqueue *svq)
> @@ -276,16 +236,31 @@ int vhost_svq_add(VhostShadowVirtqueue *svq, const struct iovec *out_sg,
>      unsigned ndescs = in_num + out_num;
>      bool ok;
>
> +    /* We need some descriptors here */
> +    if (unlikely(!ndescs)) {
> +        qemu_log_mask(LOG_GUEST_ERROR,
> +                      "Guest provided element with no descriptors");
> +        return -EINVAL;
> +    }
> +
>      if (unlikely(ndescs > vhost_svq_available_slots(svq))) {
>          return -ENOSPC;
>      }
>
> -    ok = vhost_svq_add_split(svq, out_sg, out_num, out_addr, in_sg, in_num,
> -                             in_addr, &qemu_head);
> +    g_autofree hwaddr *sgs = g_new(hwaddr, ndescs);
> +    ok = vhost_svq_translate_addr(svq, sgs, out_sg, out_num, out_addr);
>      if (unlikely(!ok)) {
>          return -EINVAL;
>      }
>
> +    ok = vhost_svq_translate_addr(svq, sgs + out_num, in_sg, in_num, in_addr);
> +    if (unlikely(!ok)) {
> +        return -EINVAL;
> +    }
> +
> +    vhost_svq_add_split(svq, out_sg, out_num, in_sg,
> +                        in_num, sgs, &qemu_head);
> +
>      svq->num_free -= ndescs;
>      svq->desc_state[qemu_head].elem = elem;
>      svq->desc_state[qemu_head].ndescs = ndescs;
> --
> 2.48.1
>



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC v5 2/7] vhost: Data structure changes to support packed vqs
  2025-03-24 13:59 ` [RFC v5 2/7] vhost: Data structure changes to support packed vqs Sahil Siddiq
@ 2025-03-26 11:26   ` Eugenio Perez Martin
  2025-03-28  5:17     ` Sahil Siddiq
  0 siblings, 1 reply; 44+ messages in thread
From: Eugenio Perez Martin @ 2025-03-26 11:26 UTC (permalink / raw)
  To: Sahil Siddiq; +Cc: sgarzare, mst, qemu-devel, sahilcdq

On Mon, Mar 24, 2025 at 3:00 PM Sahil Siddiq <icegambit91@gmail.com> wrote:
>
> Introduce "struct vring_packed".
>
> Modify VhostShadowVirtqueue so it can support split and packed virtqueue
> formats.
>
> Signed-off-by: Sahil Siddiq <sahilcdq@proton.me>
> ---
> Changes from v4 -> v5:
> - This was commit #3 in v4. This has been reordered to commit #2
>   based on review comments.
> - Place shadow_avail_idx, shadow_used_idx, last_used_idx
>   above the "shadow vring" union.
>

What is the reason for the member reorder?

>  hw/virtio/vhost-shadow-virtqueue.h | 87 +++++++++++++++++++-----------
>  1 file changed, 56 insertions(+), 31 deletions(-)
>
> diff --git a/hw/virtio/vhost-shadow-virtqueue.h b/hw/virtio/vhost-shadow-virtqueue.h
> index 9c273739d6..5f7699da9d 100644
> --- a/hw/virtio/vhost-shadow-virtqueue.h
> +++ b/hw/virtio/vhost-shadow-virtqueue.h
> @@ -46,10 +46,65 @@ typedef struct VhostShadowVirtqueueOps {
>      VirtQueueAvailCallback avail_handler;
>  } VhostShadowVirtqueueOps;
>
> +struct vring_packed {
> +    /* Actual memory layout for this queue. */
> +    struct {
> +        unsigned int num;
> +        struct vring_packed_desc *desc;
> +        struct vring_packed_desc_event *driver;
> +        struct vring_packed_desc_event *device;
> +    } vring;
> +
> +    /* Avail used flags. */
> +    uint16_t avail_used_flags;
> +
> +    /* Index of the next avail descriptor. */
> +    uint16_t next_avail_idx;
> +
> +    /* Driver ring wrap counter */
> +    bool avail_wrap_counter;
> +};
> +
>  /* Shadow virtqueue to relay notifications */
>  typedef struct VhostShadowVirtqueue {
> +    /* True if packed virtqueue */
> +    bool is_packed;
> +
> +    /* Virtio queue shadowing */
> +    VirtQueue *vq;
> +
> +    /* Virtio device */
> +    VirtIODevice *vdev;
> +
> +    /* SVQ vring descriptors state */
> +    SVQDescState *desc_state;
> +
> +    /*
> +     * Backup next field for each descriptor so we can recover securely, not
> +     * needing to trust the device access.
> +     */
> +    uint16_t *desc_next;
> +
> +    /* Next free descriptor */
> +    uint16_t free_head;
> +
> +    /* Size of SVQ vring free descriptors */
> +    uint16_t num_free;
> +
> +    /* Next head to expose to the device */
> +    uint16_t shadow_avail_idx;
> +
> +    /* Last seen used idx */
> +    uint16_t shadow_used_idx;
> +
> +    /* Next head to consume from the device */
> +    uint16_t last_used_idx;
> +
>      /* Shadow vring */
> -    struct vring vring;
> +    union {
> +        struct vring vring;
> +        struct vring_packed vring_packed;
> +    };
>
>      /* Shadow kick notifier, sent to vhost */
>      EventNotifier hdev_kick;
> @@ -69,47 +124,17 @@ typedef struct VhostShadowVirtqueue {
>      /* Guest's call notifier, where the SVQ calls guest. */
>      EventNotifier svq_call;
>
> -    /* Virtio queue shadowing */
> -    VirtQueue *vq;
> -
> -    /* Virtio device */
> -    VirtIODevice *vdev;
> -
>      /* IOVA mapping */
>      VhostIOVATree *iova_tree;
>
> -    /* SVQ vring descriptors state */
> -    SVQDescState *desc_state;
> -
>      /* Next VirtQueue element that guest made available */
>      VirtQueueElement *next_guest_avail_elem;
>
> -    /*
> -     * Backup next field for each descriptor so we can recover securely, not
> -     * needing to trust the device access.
> -     */
> -    uint16_t *desc_next;
> -
>      /* Caller callbacks */
>      const VhostShadowVirtqueueOps *ops;
>
>      /* Caller callbacks opaque */
>      void *ops_opaque;
> -
> -    /* Next head to expose to the device */
> -    uint16_t shadow_avail_idx;
> -
> -    /* Next free descriptor */
> -    uint16_t free_head;
> -
> -    /* Last seen used idx */
> -    uint16_t shadow_used_idx;
> -
> -    /* Next head to consume from the device */
> -    uint16_t last_used_idx;
> -
> -    /* Size of SVQ vring free descriptors */
> -    uint16_t num_free;
>  } VhostShadowVirtqueue;
>
>  bool vhost_svq_valid_features(uint64_t features, Error **errp);
> --
> 2.48.1
>



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC v5 3/7] vhost: Forward descriptors to device via packed SVQ
  2025-03-24 13:59 ` [RFC v5 3/7] vhost: Forward descriptors to device via packed SVQ Sahil Siddiq
  2025-03-24 14:14   ` Sahil Siddiq
@ 2025-03-26 12:02   ` Eugenio Perez Martin
  2025-03-28  5:09     ` Sahil Siddiq
  1 sibling, 1 reply; 44+ messages in thread
From: Eugenio Perez Martin @ 2025-03-26 12:02 UTC (permalink / raw)
  To: Sahil Siddiq; +Cc: sgarzare, mst, qemu-devel, sahilcdq

On Mon, Mar 24, 2025 at 3:00 PM Sahil Siddiq <icegambit91@gmail.com> wrote:
>
> Implement the insertion of available buffers in the descriptor area of
> packed shadow virtqueues. It takes into account descriptor chains, but
> does not consider indirect descriptors.
>
> Enable the packed SVQ to forward the descriptors to the device.
>
> Signed-off-by: Sahil Siddiq <sahilcdq@proton.me>
> ---
> Changes from v4 -> v5:
> - This was commit #2 in v4. This has been reordered to commit #3
>   based on review comments.
> - vhost-shadow-virtqueue.c:
>   (vhost_svq_valid_features): Move addition of enums to commit #6
>   based on review comments.
>   (vhost_svq_add_packed): Set head_idx to buffer id instead of vring's
>   index.
>   (vhost_svq_kick): Split into vhost_svq_kick_split and
>   vhost_svq_kick_packed.
>   (vhost_svq_add): Use new vhost_svq_kick_* functions.
>
>  hw/virtio/vhost-shadow-virtqueue.c | 117 +++++++++++++++++++++++++++--
>  1 file changed, 112 insertions(+), 5 deletions(-)
>
> diff --git a/hw/virtio/vhost-shadow-virtqueue.c b/hw/virtio/vhost-shadow-virtqueue.c
> index 4f74ad402a..6e16cd4bdf 100644
> --- a/hw/virtio/vhost-shadow-virtqueue.c
> +++ b/hw/virtio/vhost-shadow-virtqueue.c
> @@ -193,10 +193,83 @@ static void vhost_svq_add_split(VhostShadowVirtqueue *svq,
>      /* Update the avail index after write the descriptor */
>      smp_wmb();
>      avail->idx = cpu_to_le16(svq->shadow_avail_idx);
> +}
> +
> +/**
> + * Write descriptors to SVQ packed vring
> + *
> + * @svq: The shadow virtqueue
> + * @out_sg: The iovec to the guest
> + * @out_num: Outgoing iovec length
> + * @in_sg: The iovec from the guest
> + * @in_num: Incoming iovec length
> + * @sgs: Cache for hwaddr
> + * @head: Saves current free_head
> + */
> +static void vhost_svq_add_packed(VhostShadowVirtqueue *svq,
> +                                 const struct iovec *out_sg, size_t out_num,
> +                                 const struct iovec *in_sg, size_t in_num,
> +                                 hwaddr *sgs, unsigned *head)
> +{
> +    uint16_t id, curr, i, head_flags = 0, head_idx;
> +    size_t num = out_num + in_num;
> +    unsigned n;
> +
> +    struct vring_packed_desc *descs = svq->vring_packed.vring.desc;
> +
> +    head_idx = svq->vring_packed.next_avail_idx;
> +    i = head_idx;
> +    id = svq->free_head;
> +    curr = id;
> +    *head = id;
> +
> +    /* Write descriptors to SVQ packed vring */
> +    for (n = 0; n < num; n++) {
> +        uint16_t flags = cpu_to_le16(svq->vring_packed.avail_used_flags |
> +                                     (n < out_num ? 0 : VRING_DESC_F_WRITE) |
> +                                     (n + 1 == num ? 0 : VRING_DESC_F_NEXT));
> +        if (i == head_idx) {
> +            head_flags = flags;
> +        } else {
> +            descs[i].flags = flags;
> +        }
> +
> +        descs[i].addr = cpu_to_le64(sgs[n]);
> +        descs[i].id = id;
> +        if (n < out_num) {
> +            descs[i].len = cpu_to_le32(out_sg[n].iov_len);
> +        } else {
> +            descs[i].len = cpu_to_le32(in_sg[n - out_num].iov_len);
> +        }
> +
> +        curr = cpu_to_le16(svq->desc_next[curr]);
> +
> +        if (++i >= svq->vring_packed.vring.num) {
> +            i = 0;
> +            svq->vring_packed.avail_used_flags ^=
> +                1 << VRING_PACKED_DESC_F_AVAIL |
> +                1 << VRING_PACKED_DESC_F_USED;
> +        }
> +    }
>
> +    if (i <= head_idx) {
> +        svq->vring_packed.avail_wrap_counter ^= 1;
> +    }
> +
> +    svq->vring_packed.next_avail_idx = i;
> +    svq->shadow_avail_idx = i;
> +    svq->free_head = curr;
> +
> +    /*
> +     * A driver MUST NOT make the first descriptor in the list
> +     * available before all subsequent descriptors comprising
> +     * the list are made available.
> +     */
> +    smp_wmb();
> +    svq->vring_packed.vring.desc[head_idx].flags = head_flags;
>  }
>
> -static void vhost_svq_kick(VhostShadowVirtqueue *svq)
> +static void vhost_svq_kick_split(VhostShadowVirtqueue *svq)
>  {
>      bool needs_kick;
>
> @@ -209,7 +282,8 @@ static void vhost_svq_kick(VhostShadowVirtqueue *svq)
>      if (virtio_vdev_has_feature(svq->vdev, VIRTIO_RING_F_EVENT_IDX)) {
>          uint16_t avail_event = le16_to_cpu(
>                  *(uint16_t *)(&svq->vring.used->ring[svq->vring.num]));
> -        needs_kick = vring_need_event(avail_event, svq->shadow_avail_idx, svq->shadow_avail_idx - 1);
> +        needs_kick = vring_need_event(avail_event, svq->shadow_avail_idx,
> +                     svq->shadow_avail_idx - 1);
>      } else {
>          needs_kick =
>                  !(svq->vring.used->flags & cpu_to_le16(VRING_USED_F_NO_NOTIFY));
> @@ -222,6 +296,30 @@ static void vhost_svq_kick(VhostShadowVirtqueue *svq)
>      event_notifier_set(&svq->hdev_kick);
>  }
>
> +static void vhost_svq_kick_packed(VhostShadowVirtqueue *svq)
> +{
> +    bool needs_kick;
> +
> +    /*
> +     * We need to expose the available array entries before checking
> +     * notification suppressions.
> +     */
> +    smp_mb();
> +
> +    if (virtio_vdev_has_feature(svq->vdev, VIRTIO_RING_F_EVENT_IDX)) {
> +        return;

It's weird SVQ does not need to kick if _F_EVENT_IDX. This should have
code checking the device ring flags etc.

> +    } else {
> +        needs_kick = (svq->vring_packed.vring.device->flags !=
> +                      cpu_to_le16(VRING_PACKED_EVENT_FLAG_DISABLE));
> +    }
> +
> +    if (!needs_kick) {
> +        return;
> +    }
> +
> +    event_notifier_set(&svq->hdev_kick);
> +}
> +
>  /**
>   * Add an element to a SVQ.
>   *
> @@ -258,13 +356,22 @@ int vhost_svq_add(VhostShadowVirtqueue *svq, const struct iovec *out_sg,
>          return -EINVAL;
>      }
>
> -    vhost_svq_add_split(svq, out_sg, out_num, in_sg,
> -                        in_num, sgs, &qemu_head);
> +    if (svq->is_packed) {
> +        vhost_svq_add_packed(svq, out_sg, out_num, in_sg,
> +                             in_num, sgs, &qemu_head);
> +    } else {
> +        vhost_svq_add_split(svq, out_sg, out_num, in_sg,
> +                            in_num, sgs, &qemu_head);
> +    }
>
>      svq->num_free -= ndescs;
>      svq->desc_state[qemu_head].elem = elem;
>      svq->desc_state[qemu_head].ndescs = ndescs;
> -    vhost_svq_kick(svq);
> +    if (svq->is_packed) {
> +        vhost_svq_kick_packed(svq);
> +    } else {
> +        vhost_svq_kick_split(svq);
> +    }
>      return 0;
>  }
>
> --
> 2.48.1
>



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC v5 4/7] vdpa: Allocate memory for SVQ and map them to vdpa
  2025-03-24 13:59 ` [RFC v5 4/7] vdpa: Allocate memory for SVQ and map them to vdpa Sahil Siddiq
@ 2025-03-26 12:05   ` Eugenio Perez Martin
  0 siblings, 0 replies; 44+ messages in thread
From: Eugenio Perez Martin @ 2025-03-26 12:05 UTC (permalink / raw)
  To: Sahil Siddiq; +Cc: sgarzare, mst, qemu-devel, sahilcdq

On Mon, Mar 24, 2025 at 3:00 PM Sahil Siddiq <icegambit91@gmail.com> wrote:
>
> Allocate memory for the packed vq format and map them to the vdpa device.
>
> Since "struct vring" and "struct vring_packed's vring" both have the same
> memory layout, the implementation in SVQ start and SVQ stop should not
> differ based on the vq's format.
>
> Also initialize flags, counters and indices for packed vqs before they
> are utilized.
>
> Signed-off-by: Sahil Siddiq <sahilcdq@proton.me>

Acked-by: Eugenio Pérez <eperezma@redhat.com>

> ---
> Changes from v4 -> v5:
> - vhost-shadow-virtqueue.c:
>   (vhost_svq_start): Initialize variables used by packed vring.
>
>  hw/virtio/vhost-shadow-virtqueue.c | 52 +++++++++++++++++++++---------
>  hw/virtio/vhost-shadow-virtqueue.h |  1 +
>  hw/virtio/vhost-vdpa.c             | 37 +++++++++++++++++----
>  3 files changed, 69 insertions(+), 21 deletions(-)
>
> diff --git a/hw/virtio/vhost-shadow-virtqueue.c b/hw/virtio/vhost-shadow-virtqueue.c
> index 6e16cd4bdf..126957231d 100644
> --- a/hw/virtio/vhost-shadow-virtqueue.c
> +++ b/hw/virtio/vhost-shadow-virtqueue.c
> @@ -707,19 +707,33 @@ void vhost_svq_get_vring_addr(const VhostShadowVirtqueue *svq,
>      addr->used_user_addr = (uint64_t)(uintptr_t)svq->vring.used;
>  }
>
> -size_t vhost_svq_driver_area_size(const VhostShadowVirtqueue *svq)
> +size_t vhost_svq_descriptor_area_size(const VhostShadowVirtqueue *svq)
>  {
>      size_t desc_size = sizeof(vring_desc_t) * svq->vring.num;
> -    size_t avail_size = offsetof(vring_avail_t, ring[svq->vring.num]) +
> -                                                              sizeof(uint16_t);
> +    return ROUND_UP(desc_size, qemu_real_host_page_size());
> +}
>
> -    return ROUND_UP(desc_size + avail_size, qemu_real_host_page_size());
> +size_t vhost_svq_driver_area_size(const VhostShadowVirtqueue *svq)
> +{
> +    size_t avail_size;
> +    if (svq->is_packed) {
> +        avail_size = sizeof(uint32_t);
> +    } else {
> +        avail_size = offsetof(vring_avail_t, ring[svq->vring.num]) +
> +                                                             sizeof(uint16_t);
> +    }
> +    return ROUND_UP(avail_size, qemu_real_host_page_size());
>  }
>
>  size_t vhost_svq_device_area_size(const VhostShadowVirtqueue *svq)
>  {
> -    size_t used_size = offsetof(vring_used_t, ring[svq->vring.num]) +
> -                                                              sizeof(uint16_t);
> +    size_t used_size;
> +    if (svq->is_packed) {
> +        used_size = sizeof(uint32_t);
> +    } else {
> +        used_size = offsetof(vring_used_t, ring[svq->vring.num]) +
> +                                                           sizeof(uint16_t);
> +    }
>      return ROUND_UP(used_size, qemu_real_host_page_size());
>  }
>
> @@ -764,8 +778,6 @@ void vhost_svq_set_svq_kick_fd(VhostShadowVirtqueue *svq, int svq_kick_fd)
>  void vhost_svq_start(VhostShadowVirtqueue *svq, VirtIODevice *vdev,
>                       VirtQueue *vq, VhostIOVATree *iova_tree)
>  {
> -    size_t desc_size;
> -
>      event_notifier_set_handler(&svq->hdev_call, vhost_svq_handle_call);
>      svq->next_guest_avail_elem = NULL;
>      svq->shadow_avail_idx = 0;
> @@ -774,20 +786,29 @@ void vhost_svq_start(VhostShadowVirtqueue *svq, VirtIODevice *vdev,
>      svq->vdev = vdev;
>      svq->vq = vq;
>      svq->iova_tree = iova_tree;
> +    svq->is_packed = virtio_vdev_has_feature(svq->vdev, VIRTIO_F_RING_PACKED);
> +
> +    if (svq->is_packed) {
> +        svq->vring_packed.avail_wrap_counter = 1;
> +        svq->vring_packed.next_avail_idx = 0;
> +        svq->vring_packed.avail_used_flags = 1 << VRING_PACKED_DESC_F_AVAIL;
> +        svq->last_used_idx = 0 | (1 << VRING_PACKED_EVENT_F_WRAP_CTR);
> +    }
>
>      svq->vring.num = virtio_queue_get_num(vdev, virtio_get_queue_index(vq));
>      svq->num_free = svq->vring.num;
> -    svq->vring.desc = mmap(NULL, vhost_svq_driver_area_size(svq),
> +    svq->vring.desc = mmap(NULL, vhost_svq_descriptor_area_size(svq),
>                             PROT_READ | PROT_WRITE, MAP_SHARED | MAP_ANONYMOUS,
>                             -1, 0);
> -    desc_size = sizeof(vring_desc_t) * svq->vring.num;
> -    svq->vring.avail = (void *)((char *)svq->vring.desc + desc_size);
> +    svq->vring.avail = mmap(NULL, vhost_svq_driver_area_size(svq),
> +                            PROT_READ | PROT_WRITE, MAP_SHARED | MAP_ANONYMOUS,
> +                            -1, 0);
>      svq->vring.used = mmap(NULL, vhost_svq_device_area_size(svq),
>                             PROT_READ | PROT_WRITE, MAP_SHARED | MAP_ANONYMOUS,
>                             -1, 0);
> -    svq->desc_state = g_new0(SVQDescState, svq->vring.num);
> -    svq->desc_next = g_new0(uint16_t, svq->vring.num);
> -    for (unsigned i = 0; i < svq->vring.num - 1; i++) {
> +    svq->desc_state = g_new0(SVQDescState, svq->num_free);
> +    svq->desc_next = g_new0(uint16_t, svq->num_free);
> +    for (unsigned i = 0; i < svq->num_free - 1; i++) {
>          svq->desc_next[i] = i + 1;
>      }
>  }
> @@ -827,7 +848,8 @@ void vhost_svq_stop(VhostShadowVirtqueue *svq)
>      svq->vq = NULL;
>      g_free(svq->desc_next);
>      g_free(svq->desc_state);
> -    munmap(svq->vring.desc, vhost_svq_driver_area_size(svq));
> +    munmap(svq->vring.desc, vhost_svq_descriptor_area_size(svq));
> +    munmap(svq->vring.avail, vhost_svq_driver_area_size(svq));
>      munmap(svq->vring.used, vhost_svq_device_area_size(svq));
>      event_notifier_set_handler(&svq->hdev_call, NULL);
>  }
> diff --git a/hw/virtio/vhost-shadow-virtqueue.h b/hw/virtio/vhost-shadow-virtqueue.h
> index 5f7699da9d..12c6ea8be2 100644
> --- a/hw/virtio/vhost-shadow-virtqueue.h
> +++ b/hw/virtio/vhost-shadow-virtqueue.h
> @@ -152,6 +152,7 @@ void vhost_svq_set_svq_kick_fd(VhostShadowVirtqueue *svq, int svq_kick_fd);
>  void vhost_svq_set_svq_call_fd(VhostShadowVirtqueue *svq, int call_fd);
>  void vhost_svq_get_vring_addr(const VhostShadowVirtqueue *svq,
>                                struct vhost_vring_addr *addr);
> +size_t vhost_svq_descriptor_area_size(const VhostShadowVirtqueue *svq);
>  size_t vhost_svq_driver_area_size(const VhostShadowVirtqueue *svq);
>  size_t vhost_svq_device_area_size(const VhostShadowVirtqueue *svq);
>
> diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
> index 7efbde3d4c..58c8931d89 100644
> --- a/hw/virtio/vhost-vdpa.c
> +++ b/hw/virtio/vhost-vdpa.c
> @@ -1137,6 +1137,8 @@ static void vhost_vdpa_svq_unmap_rings(struct vhost_dev *dev,
>
>      vhost_vdpa_svq_unmap_ring(v, svq_addr.desc_user_addr);
>
> +    vhost_vdpa_svq_unmap_ring(v, svq_addr.avail_user_addr);
> +
>      vhost_vdpa_svq_unmap_ring(v, svq_addr.used_user_addr);
>  }
>
> @@ -1191,38 +1193,61 @@ static bool vhost_vdpa_svq_map_rings(struct vhost_dev *dev,
>                                       Error **errp)
>  {
>      ERRP_GUARD();
> -    DMAMap device_region, driver_region;
> +    DMAMap descriptor_region, device_region, driver_region;
>      struct vhost_vring_addr svq_addr;
>      struct vhost_vdpa *v = dev->opaque;
> +    size_t descriptor_size = vhost_svq_descriptor_area_size(svq);
>      size_t device_size = vhost_svq_device_area_size(svq);
>      size_t driver_size = vhost_svq_driver_area_size(svq);
> -    size_t avail_offset;
>      bool ok;
>
>      vhost_svq_get_vring_addr(svq, &svq_addr);
>
> +    descriptor_region = (DMAMap) {
> +        .translated_addr = svq_addr.desc_user_addr,
> +        .size = descriptor_size - 1,
> +        .perm = IOMMU_RO,
> +    };
> +    if (svq->is_packed) {
> +        descriptor_region.perm = IOMMU_RW;
> +    }
> +
> +    ok = vhost_vdpa_svq_map_ring(v, &descriptor_region, svq_addr.desc_user_addr,
> +                                 errp);
> +    if (unlikely(!ok)) {
> +        error_prepend(errp, "Cannot create vq descriptor region: ");
> +        return false;
> +    }
> +    addr->desc_user_addr = descriptor_region.iova;
> +
>      driver_region = (DMAMap) {
> +        .translated_addr = svq_addr.avail_user_addr,
>          .size = driver_size - 1,
>          .perm = IOMMU_RO,
>      };
> -    ok = vhost_vdpa_svq_map_ring(v, &driver_region, svq_addr.desc_user_addr,
> +    ok = vhost_vdpa_svq_map_ring(v, &driver_region, svq_addr.avail_user_addr,
>                                   errp);
>      if (unlikely(!ok)) {
>          error_prepend(errp, "Cannot create vq driver region: ");
> +        vhost_vdpa_svq_unmap_ring(v, descriptor_region.translated_addr);
>          return false;
>      }
> -    addr->desc_user_addr = driver_region.iova;
> -    avail_offset = svq_addr.avail_user_addr - svq_addr.desc_user_addr;
> -    addr->avail_user_addr = driver_region.iova + avail_offset;
> +    addr->avail_user_addr = driver_region.iova;
>
>      device_region = (DMAMap) {
> +        .translated_addr = svq_addr.used_user_addr,
>          .size = device_size - 1,
>          .perm = IOMMU_RW,
>      };
> +    if (svq->is_packed) {
> +        device_region.perm = IOMMU_WO;
> +    }
> +
>      ok = vhost_vdpa_svq_map_ring(v, &device_region, svq_addr.used_user_addr,
>                                   errp);
>      if (unlikely(!ok)) {
>          error_prepend(errp, "Cannot create vq device region: ");
> +        vhost_vdpa_svq_unmap_ring(v, descriptor_region.translated_addr);
>          vhost_vdpa_svq_unmap_ring(v, driver_region.translated_addr);
>      }
>      addr->used_user_addr = device_region.iova;
> --
> 2.48.1
>



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC v5 6/7] vhost: Validate transport device features for packed vqs
  2025-03-24 13:59 ` [RFC v5 6/7] vhost: Validate transport device features for " Sahil Siddiq
@ 2025-03-26 12:06   ` Eugenio Perez Martin
  2025-03-28  5:33     ` Sahil Siddiq
  0 siblings, 1 reply; 44+ messages in thread
From: Eugenio Perez Martin @ 2025-03-26 12:06 UTC (permalink / raw)
  To: Sahil Siddiq; +Cc: sgarzare, mst, qemu-devel, sahilcdq

On Mon, Mar 24, 2025 at 3:00 PM Sahil Siddiq <icegambit91@gmail.com> wrote:
>
> Validate transport device features required for utilizing packed SVQ
> that both guests can use with the SVQ and SVQs can use with vdpa.
>
> Signed-off-by: Sahil Siddiq <sahilcdq@proton.me>
> ---
> Changes from v4 -> v5:
> - Split from commit #2 in v4.
>
>  hw/virtio/vhost-shadow-virtqueue.c | 3 +++
>  1 file changed, 3 insertions(+)
>
> diff --git a/hw/virtio/vhost-shadow-virtqueue.c b/hw/virtio/vhost-shadow-virtqueue.c
> index 8430b3c94a..035ab1e66f 100644
> --- a/hw/virtio/vhost-shadow-virtqueue.c
> +++ b/hw/virtio/vhost-shadow-virtqueue.c
> @@ -33,6 +33,9 @@ bool vhost_svq_valid_features(uint64_t features, Error **errp)
>           ++b) {
>          switch (b) {
>          case VIRTIO_F_ANY_LAYOUT:
> +        case VIRTIO_F_RING_PACKED:
> +        case VIRTIO_F_RING_RESET:
> +        case VIRTIO_RING_F_INDIRECT_DESC:

This should only enable _F_RING_PACKED, there is no code supporting
either reset or indirect descriptors.

>          case VIRTIO_RING_F_EVENT_IDX:
>              continue;
>
> --
> 2.48.1
>



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC v5 7/7] vdpa: Support setting vring_base for packed SVQ
  2025-03-24 13:59 ` [RFC v5 7/7] vdpa: Support setting vring_base for packed SVQ Sahil Siddiq
@ 2025-03-26 12:08   ` Eugenio Perez Martin
  2025-03-27 18:44     ` Sahil Siddiq
  0 siblings, 1 reply; 44+ messages in thread
From: Eugenio Perez Martin @ 2025-03-26 12:08 UTC (permalink / raw)
  To: Sahil Siddiq; +Cc: sgarzare, mst, qemu-devel, sahilcdq

On Mon, Mar 24, 2025 at 3:00 PM Sahil Siddiq <icegambit91@gmail.com> wrote:
>
> This commit is the first in a series to add support for packed
> virtqueues in vhost_shadow_virtqueue.
>
> Linux commit 1225c216d954 ("vp_vdpa: allow set vq state to initial
> state after reset") enabled the vp_vdpa driver to set the vq state to
> the device's initial state. This works differently for split and packed
> vqs.
>
> With shadow virtqueues enabled, vhost-vdpa sets the vring base using
> the VHOST_SET_VRING_BASE ioctl. The payload (vhost_vring_state)
> differs for split and packed vqs. The implementation in QEMU currently
> uses the payload required for split vqs (i.e., the num field of
> vhost_vring_state is set to 0). The kernel throws EOPNOTSUPP when this
> payload is used with packed vqs.
>
> This patch sets the num field in the payload appropriately so vhost-vdpa
> (with the vp_vdpa driver) can use packed SVQs.
>
> Link: https://lists.nongnu.org/archive/html/qemu-devel/2024-10/msg05106.html
> Link: https://lore.kernel.org/r/20210602021536.39525-4-jasowang@redhat.com
> Link: 1225c216d954 ("vp_vdpa: allow set vq state to initial state after reset")
> Signed-off-by: Sahil Siddiq <sahilcdq@proton.me>
> Acked-by: Eugenio Pérez <eperezma@redhat.com>
> ---
> Changes from v4 -> v5:
> - Initially commit #5 in v4.
> - Fix coding style of commit block as stated by checkpatch.pl.
>
>  hw/virtio/vhost-vdpa.c | 15 +++++++++++++++
>  1 file changed, 15 insertions(+)
>
> diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
> index 58c8931d89..0625e349b3 100644
> --- a/hw/virtio/vhost-vdpa.c
> +++ b/hw/virtio/vhost-vdpa.c
> @@ -1265,6 +1265,21 @@ static bool vhost_vdpa_svq_setup(struct vhost_dev *dev,
>      };
>      int r;
>
> +    /*
> +     * In Linux, the upper 16 bits of s.num is encoded as
> +     * the last used idx while the lower 16 bits is encoded
> +     * as the last avail idx when using packed vqs. The most
> +     * significant bit for each idx represents the counter
> +     * and should be set in both cases while the remaining
> +     * bits are cleared.
> +     */
> +    if (virtio_vdev_has_feature(dev->vdev, VIRTIO_F_RING_PACKED)) {
> +        uint32_t last_avail_idx = 0 | (1 << VRING_PACKED_EVENT_F_WRAP_CTR);
> +        uint32_t last_used_idx = 0 | (1 << VRING_PACKED_EVENT_F_WRAP_CTR);
> +
> +        s.num = (last_used_idx << 16) | last_avail_idx;
> +    }
> +

This should be added before 6/7 so we don't declare we support packed
without this.

>      r = vhost_vdpa_set_dev_vring_base(dev, &s);
>      if (unlikely(r)) {
>          error_setg_errno(errp, -r, "Cannot set vring base");
> --
> 2.48.1
>



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC v5 3/7] vhost: Forward descriptors to device via packed SVQ
  2025-03-26  8:03     ` Eugenio Perez Martin
@ 2025-03-27 18:42       ` Sahil Siddiq
  2025-03-28  7:51         ` Eugenio Perez Martin
  0 siblings, 1 reply; 44+ messages in thread
From: Sahil Siddiq @ 2025-03-27 18:42 UTC (permalink / raw)
  To: Eugenio Perez Martin; +Cc: sgarzare, mst, qemu-devel, sahilcdq

Hi,

On 3/26/25 1:33 PM, Eugenio Perez Martin wrote:
> On Mon, Mar 24, 2025 at 3:14 PM Sahil Siddiq <icegambit91@gmail.com> wrote:
>> On 3/24/25 7:29 PM, Sahil Siddiq wrote:
>>> Implement the insertion of available buffers in the descriptor area of
>>> packed shadow virtqueues. It takes into account descriptor chains, but
>>> does not consider indirect descriptors.
>>>
>>> Enable the packed SVQ to forward the descriptors to the device.
>>>
>>> Signed-off-by: Sahil Siddiq <sahilcdq@proton.me>
>>> ---
>>> Changes from v4 -> v5:
>>> - This was commit #2 in v4. This has been reordered to commit #3
>>>     based on review comments.
>>> - vhost-shadow-virtqueue.c:
>>>     (vhost_svq_valid_features): Move addition of enums to commit #6
>>>     based on review comments.
>>>     (vhost_svq_add_packed): Set head_idx to buffer id instead of vring's
>>>     index.
>>>     (vhost_svq_kick): Split into vhost_svq_kick_split and
>>>     vhost_svq_kick_packed.
>>>     (vhost_svq_add): Use new vhost_svq_kick_* functions.
>>>
>>>    hw/virtio/vhost-shadow-virtqueue.c | 117 +++++++++++++++++++++++++++--
>>>    1 file changed, 112 insertions(+), 5 deletions(-)
>>>
>>> diff --git a/hw/virtio/vhost-shadow-virtqueue.c b/hw/virtio/vhost-shadow-virtqueue.c
>>> index 4f74ad402a..6e16cd4bdf 100644
>>> --- a/hw/virtio/vhost-shadow-virtqueue.c
>>> +++ b/hw/virtio/vhost-shadow-virtqueue.c
>>> @@ -193,10 +193,83 @@ static void vhost_svq_add_split(VhostShadowVirtqueue *svq,
>>>        /* Update the avail index after write the descriptor */
>>>        smp_wmb();
>>>        avail->idx = cpu_to_le16(svq->shadow_avail_idx);
>>> +}
>>> +
>>> +/**
>>> + * Write descriptors to SVQ packed vring
>>> + *
>>> + * @svq: The shadow virtqueue
>>> + * @out_sg: The iovec to the guest
>>> + * @out_num: Outgoing iovec length
>>> + * @in_sg: The iovec from the guest
>>> + * @in_num: Incoming iovec length
>>> + * @sgs: Cache for hwaddr
>>> + * @head: Saves current free_head
>>> + */
>>> +static void vhost_svq_add_packed(VhostShadowVirtqueue *svq,
>>> +                                 const struct iovec *out_sg, size_t out_num,
>>> +                                 const struct iovec *in_sg, size_t in_num,
>>> +                                 hwaddr *sgs, unsigned *head)
>>> +{
>>> +    uint16_t id, curr, i, head_flags = 0, head_idx;
>>> +    size_t num = out_num + in_num;
>>> +    unsigned n;
>>> +
>>> +    struct vring_packed_desc *descs = svq->vring_packed.vring.desc;
>>> +
>>> +    head_idx = svq->vring_packed.next_avail_idx;
>>
>> Since "svq->vring_packed.next_avail_idx" is part of QEMU internals and not
>> stored in guest memory, no endianness conversion is required here, right?
>>
> 
> Right!

Understood.

>>> +    i = head_idx;
>>> +    id = svq->free_head;
>>> +    curr = id;
>>> +    *head = id;
>>
>> Should head be the buffer id or the idx of the descriptor ring where the
>> first descriptor of a descriptor chain is inserted?
>>
> 
> The buffer id of the *last* descriptor of a chain. See "2.8.6 Next
> Flag: Descriptor Chaining" at [1].

Ah, yes. The second half of my question in incorrect.

The tail descriptor of the chain includes the buffer id. In this implementation
we place the same tail buffer id in other locations of the descriptor ring since
they will be ignored anyway [1].

The explanation below frames my query better.

>>> +    /* Write descriptors to SVQ packed vring */
>>> +    for (n = 0; n < num; n++) {
>>> +        uint16_t flags = cpu_to_le16(svq->vring_packed.avail_used_flags |
>>> +                                     (n < out_num ? 0 : VRING_DESC_F_WRITE) |
>>> +                                     (n + 1 == num ? 0 : VRING_DESC_F_NEXT));
>>> +        if (i == head_idx) {
>>> +            head_flags = flags;
>>> +        } else {
>>> +            descs[i].flags = flags;
>>> +        }
>>> +
>>> +        descs[i].addr = cpu_to_le64(sgs[n]);
>>> +        descs[i].id = id;
>>> +        if (n < out_num) {
>>> +            descs[i].len = cpu_to_le32(out_sg[n].iov_len);
>>> +        } else {
>>> +            descs[i].len = cpu_to_le32(in_sg[n - out_num].iov_len);
>>> +        }
>>> +
>>> +        curr = cpu_to_le16(svq->desc_next[curr]);
>>> +
>>> +        if (++i >= svq->vring_packed.vring.num) {
>>> +            i = 0;
>>> +            svq->vring_packed.avail_used_flags ^=
>>> +                1 << VRING_PACKED_DESC_F_AVAIL |
>>> +                1 << VRING_PACKED_DESC_F_USED;
>>> +        }
>>> +    }
>>>
>>> +    if (i <= head_idx) {
>>> +        svq->vring_packed.avail_wrap_counter ^= 1;
>>> +    }
>>> +
>>> +    svq->vring_packed.next_avail_idx = i;
>>> +    svq->shadow_avail_idx = i;
>>> +    svq->free_head = curr;
>>> +
>>> +    /*
>>> +     * A driver MUST NOT make the first descriptor in the list
>>> +     * available before all subsequent descriptors comprising
>>> +     * the list are made available.
>>> +     */
>>> +    smp_wmb();
>>> +    svq->vring_packed.vring.desc[head_idx].flags = head_flags;
>>>    }
>>>
>>> -static void vhost_svq_kick(VhostShadowVirtqueue *svq)
>>> +static void vhost_svq_kick_split(VhostShadowVirtqueue *svq)
>>>    {
>>>        bool needs_kick;
>>>
>>> @@ -209,7 +282,8 @@ static void vhost_svq_kick(VhostShadowVirtqueue *svq)
>>>        if (virtio_vdev_has_feature(svq->vdev, VIRTIO_RING_F_EVENT_IDX)) {
>>>            uint16_t avail_event = le16_to_cpu(
>>>                    *(uint16_t *)(&svq->vring.used->ring[svq->vring.num]));
>>> -        needs_kick = vring_need_event(avail_event, svq->shadow_avail_idx, svq->shadow_avail_idx - 1);
>>> +        needs_kick = vring_need_event(avail_event, svq->shadow_avail_idx,
>>> +                     svq->shadow_avail_idx - 1);
>>>        } else {
>>>            needs_kick =
>>>                    !(svq->vring.used->flags & cpu_to_le16(VRING_USED_F_NO_NOTIFY));
>>> @@ -222,6 +296,30 @@ static void vhost_svq_kick(VhostShadowVirtqueue *svq)
>>>        event_notifier_set(&svq->hdev_kick);
>>>    }
>>>
>>> +static void vhost_svq_kick_packed(VhostShadowVirtqueue *svq)
>>> +{
>>> +    bool needs_kick;
>>> +
>>> +    /*
>>> +     * We need to expose the available array entries before checking
>>> +     * notification suppressions.
>>> +     */
>>> +    smp_mb();
>>> +
>>> +    if (virtio_vdev_has_feature(svq->vdev, VIRTIO_RING_F_EVENT_IDX)) {
>>> +        return;
>>> +    } else {
>>> +        needs_kick = (svq->vring_packed.vring.device->flags !=
>>> +                      cpu_to_le16(VRING_PACKED_EVENT_FLAG_DISABLE));
>>> +    }
>>> +
>>> +    if (!needs_kick) {
>>> +        return;
>>> +    }
>>> +
>>> +    event_notifier_set(&svq->hdev_kick);
>>> +}
>>> +
>>>    /**
>>>     * Add an element to a SVQ.
>>>     *
>>> @@ -258,13 +356,22 @@ int vhost_svq_add(VhostShadowVirtqueue *svq, const struct iovec *out_sg,
>>>            return -EINVAL;
>>>        }
>>>
>>> -    vhost_svq_add_split(svq, out_sg, out_num, in_sg,
>>> -                        in_num, sgs, &qemu_head);
>>> +    if (svq->is_packed) {
>>> +        vhost_svq_add_packed(svq, out_sg, out_num, in_sg,
>>> +                             in_num, sgs, &qemu_head);
>>> +    } else {
>>> +        vhost_svq_add_split(svq, out_sg, out_num, in_sg,
>>> +                            in_num, sgs, &qemu_head);
>>> +    }
>>>
>>>        svq->num_free -= ndescs;
>>>        svq->desc_state[qemu_head].elem = elem;
>>>        svq->desc_state[qemu_head].ndescs = ndescs;
>>
>> *head in vhost_svq_add_packed() is stored in "qemu_head" here.
>>
> 
> Sorry I don't get this, can you expand?

Sure. In vhost_svq_add(), after the descriptors have been added
(either using vhost_svq_add_split or vhost_svq_add_packed),
VirtQueueElement elem and ndescs are both saved in the
svq->desc_state array. "elem" and "ndescs" are later used when
the guest consumes used descriptors from the device in
vhost_svq_get_buf_(split|packed).

For split vqs, the index of svq->desc where elem and ndescs are
saved matches the index of the descriptor ring where the head of
the descriptor ring is placed.

In vhost_svq_add_split:

*head = svq->free_head;
[...]
avail_idx = svq->shadow_avail_idx & (svq->vring.num - 1);
avail->ring[avail_idx] = cpu_to_le16(*head);

"qemu_head" in vhost_svq_add gets its value from "*head" in
vhost_svq_add_split:

svq->desc_state[qemu_head].elem = elem;
svq->desc_state[qemu_head].ndescs = ndescs;

For packed vq, something similar has to be done. My approach was
to have the index of svq->desc_state match the buffer id in the
tail of the descriptor ring.

The entire chain is written to the descriptor ring in the loop
in vhost_svq_add_packed. I am not sure if the index of
svq->desc_state should be the buffer id or if it should be a
descriptor index ("head_idx" or the index corresponding to the
tail of the chain).

Thanks,
Sahil

[1] https://lists.nongnu.org/archive/html/qemu-devel/2024-06/msg03512.html


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC v5 7/7] vdpa: Support setting vring_base for packed SVQ
  2025-03-26 12:08   ` Eugenio Perez Martin
@ 2025-03-27 18:44     ` Sahil Siddiq
  0 siblings, 0 replies; 44+ messages in thread
From: Sahil Siddiq @ 2025-03-27 18:44 UTC (permalink / raw)
  To: Eugenio Perez Martin; +Cc: sgarzare, mst, qemu-devel, sahilcdq

Hi,

On 3/26/25 5:38 PM, Eugenio Perez Martin wrote:
> On Mon, Mar 24, 2025 at 3:00 PM Sahil Siddiq <icegambit91@gmail.com> wrote:
>> [...]
>> Link: https://lists.nongnu.org/archive/html/qemu-devel/2024-10/msg05106.html
>> Link: https://lore.kernel.org/r/20210602021536.39525-4-jasowang@redhat.com
>> Link: 1225c216d954 ("vp_vdpa: allow set vq state to initial state after reset")
>> Signed-off-by: Sahil Siddiq <sahilcdq@proton.me>
>> Acked-by: Eugenio Pérez <eperezma@redhat.com>
>> ---
>> Changes from v4 -> v5:
>> - Initially commit #5 in v4.
>> - Fix coding style of commit block as stated by checkpatch.pl.
>>
>>   hw/virtio/vhost-vdpa.c | 15 +++++++++++++++
>>   1 file changed, 15 insertions(+)
>>
>> diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
>> index 58c8931d89..0625e349b3 100644
>> --- a/hw/virtio/vhost-vdpa.c
>> +++ b/hw/virtio/vhost-vdpa.c
>> @@ -1265,6 +1265,21 @@ static bool vhost_vdpa_svq_setup(struct vhost_dev *dev,
>>       };
>>       int r;
>>
>> +    /*
>> +     * In Linux, the upper 16 bits of s.num is encoded as
>> +     * the last used idx while the lower 16 bits is encoded
>> +     * as the last avail idx when using packed vqs. The most
>> +     * significant bit for each idx represents the counter
>> +     * and should be set in both cases while the remaining
>> +     * bits are cleared.
>> +     */
>> +    if (virtio_vdev_has_feature(dev->vdev, VIRTIO_F_RING_PACKED)) {
>> +        uint32_t last_avail_idx = 0 | (1 << VRING_PACKED_EVENT_F_WRAP_CTR);
>> +        uint32_t last_used_idx = 0 | (1 << VRING_PACKED_EVENT_F_WRAP_CTR);
>> +
>> +        s.num = (last_used_idx << 16) | last_avail_idx;
>> +    }
>> +
> 
> This should be added before 6/7 so we don't declare we support packed
> without this.

Sure, I'll change the ordering in the next patch series.

Thanks,
Sahil


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC v5 3/7] vhost: Forward descriptors to device via packed SVQ
  2025-03-26 12:02   ` Eugenio Perez Martin
@ 2025-03-28  5:09     ` Sahil Siddiq
  2025-03-28  6:42       ` Eugenio Perez Martin
  0 siblings, 1 reply; 44+ messages in thread
From: Sahil Siddiq @ 2025-03-28  5:09 UTC (permalink / raw)
  To: Eugenio Perez Martin; +Cc: sgarzare, mst, qemu-devel, sahilcdq

Hi,

On 3/26/25 5:32 PM, Eugenio Perez Martin wrote:
> On Mon, Mar 24, 2025 at 3:00 PM Sahil Siddiq <icegambit91@gmail.com> wrote:
>>
>> Implement the insertion of available buffers in the descriptor area of
>> packed shadow virtqueues. It takes into account descriptor chains, but
>> does not consider indirect descriptors.
>>
>> Enable the packed SVQ to forward the descriptors to the device.
>>
>> Signed-off-by: Sahil Siddiq <sahilcdq@proton.me>
>> ---
>> Changes from v4 -> v5:
>> - This was commit #2 in v4. This has been reordered to commit #3
>>    based on review comments.
>> - vhost-shadow-virtqueue.c:
>>    (vhost_svq_valid_features): Move addition of enums to commit #6
>>    based on review comments.
>>    (vhost_svq_add_packed): Set head_idx to buffer id instead of vring's
>>    index.
>>    (vhost_svq_kick): Split into vhost_svq_kick_split and
>>    vhost_svq_kick_packed.
>>    (vhost_svq_add): Use new vhost_svq_kick_* functions.
>>
>>   hw/virtio/vhost-shadow-virtqueue.c | 117 +++++++++++++++++++++++++++--
>>   1 file changed, 112 insertions(+), 5 deletions(-)
>>
>> diff --git a/hw/virtio/vhost-shadow-virtqueue.c b/hw/virtio/vhost-shadow-virtqueue.c
>> index 4f74ad402a..6e16cd4bdf 100644
>> --- a/hw/virtio/vhost-shadow-virtqueue.c
>> +++ b/hw/virtio/vhost-shadow-virtqueue.c
>> @@ -193,10 +193,83 @@ static void vhost_svq_add_split(VhostShadowVirtqueue *svq,
>>       /* Update the avail index after write the descriptor */
>>       smp_wmb();
>>       avail->idx = cpu_to_le16(svq->shadow_avail_idx);
>> +}
>> +
>> +/**
>> + * Write descriptors to SVQ packed vring
>> + *
>> + * @svq: The shadow virtqueue
>> + * @out_sg: The iovec to the guest
>> + * @out_num: Outgoing iovec length
>> + * @in_sg: The iovec from the guest
>> + * @in_num: Incoming iovec length
>> + * @sgs: Cache for hwaddr
>> + * @head: Saves current free_head
>> + */
>> +static void vhost_svq_add_packed(VhostShadowVirtqueue *svq,
>> +                                 const struct iovec *out_sg, size_t out_num,
>> +                                 const struct iovec *in_sg, size_t in_num,
>> +                                 hwaddr *sgs, unsigned *head)
>> +{
>> +    uint16_t id, curr, i, head_flags = 0, head_idx;
>> +    size_t num = out_num + in_num;
>> +    unsigned n;
>> +
>> +    struct vring_packed_desc *descs = svq->vring_packed.vring.desc;
>> +
>> +    head_idx = svq->vring_packed.next_avail_idx;
>> +    i = head_idx;
>> +    id = svq->free_head;
>> +    curr = id;
>> +    *head = id;
>> +
>> +    /* Write descriptors to SVQ packed vring */
>> +    for (n = 0; n < num; n++) {
>> +        uint16_t flags = cpu_to_le16(svq->vring_packed.avail_used_flags |
>> +                                     (n < out_num ? 0 : VRING_DESC_F_WRITE) |
>> +                                     (n + 1 == num ? 0 : VRING_DESC_F_NEXT));
>> +        if (i == head_idx) {
>> +            head_flags = flags;
>> +        } else {
>> +            descs[i].flags = flags;
>> +        }
>> +
>> +        descs[i].addr = cpu_to_le64(sgs[n]);
>> +        descs[i].id = id;
>> +        if (n < out_num) {
>> +            descs[i].len = cpu_to_le32(out_sg[n].iov_len);
>> +        } else {
>> +            descs[i].len = cpu_to_le32(in_sg[n - out_num].iov_len);
>> +        }
>> +
>> +        curr = cpu_to_le16(svq->desc_next[curr]);
>> +
>> +        if (++i >= svq->vring_packed.vring.num) {
>> +            i = 0;
>> +            svq->vring_packed.avail_used_flags ^=
>> +                1 << VRING_PACKED_DESC_F_AVAIL |
>> +                1 << VRING_PACKED_DESC_F_USED;
>> +        }
>> +    }
>>
>> +    if (i <= head_idx) {
>> +        svq->vring_packed.avail_wrap_counter ^= 1;
>> +    }
>> +
>> +    svq->vring_packed.next_avail_idx = i;
>> +    svq->shadow_avail_idx = i;
>> +    svq->free_head = curr;
>> +
>> +    /*
>> +     * A driver MUST NOT make the first descriptor in the list
>> +     * available before all subsequent descriptors comprising
>> +     * the list are made available.
>> +     */
>> +    smp_wmb();
>> +    svq->vring_packed.vring.desc[head_idx].flags = head_flags;
>>   }
>>
>> -static void vhost_svq_kick(VhostShadowVirtqueue *svq)
>> +static void vhost_svq_kick_split(VhostShadowVirtqueue *svq)
>>   {
>>       bool needs_kick;
>>
>> @@ -209,7 +282,8 @@ static void vhost_svq_kick(VhostShadowVirtqueue *svq)
>>       if (virtio_vdev_has_feature(svq->vdev, VIRTIO_RING_F_EVENT_IDX)) {
>>           uint16_t avail_event = le16_to_cpu(
>>                   *(uint16_t *)(&svq->vring.used->ring[svq->vring.num]));
>> -        needs_kick = vring_need_event(avail_event, svq->shadow_avail_idx, svq->shadow_avail_idx - 1);
>> +        needs_kick = vring_need_event(avail_event, svq->shadow_avail_idx,
>> +                     svq->shadow_avail_idx - 1);
>>       } else {
>>           needs_kick =
>>                   !(svq->vring.used->flags & cpu_to_le16(VRING_USED_F_NO_NOTIFY));
>> @@ -222,6 +296,30 @@ static void vhost_svq_kick(VhostShadowVirtqueue *svq)
>>       event_notifier_set(&svq->hdev_kick);
>>   }
>>
>> +static void vhost_svq_kick_packed(VhostShadowVirtqueue *svq)
>> +{
>> +    bool needs_kick;
>> +
>> +    /*
>> +     * We need to expose the available array entries before checking
>> +     * notification suppressions.
>> +     */
>> +    smp_mb();
>> +
>> +    if (virtio_vdev_has_feature(svq->vdev, VIRTIO_RING_F_EVENT_IDX)) {
>> +        return;
> 
> It's weird SVQ does not need to kick if _F_EVENT_IDX. This should have
> code checking the device ring flags etc.
> 

Right, I haven't implemented this yet. Since the current implementation is
being tested with event_idx=off (points 3 and 4 of the roadmap [1]), I thought
I would leave this for later.

Maybe I can add a comment in the implementation explaining this.

Thanks,
Sahil

[1] https://wiki.qemu.org/Internships/ProjectIdeas/PackedShadowVirtqueue


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC v5 2/7] vhost: Data structure changes to support packed vqs
  2025-03-26 11:26   ` Eugenio Perez Martin
@ 2025-03-28  5:17     ` Sahil Siddiq
  0 siblings, 0 replies; 44+ messages in thread
From: Sahil Siddiq @ 2025-03-28  5:17 UTC (permalink / raw)
  To: Eugenio Perez Martin; +Cc: sgarzare, mst, qemu-devel, sahilcdq

Hi,

On 3/26/25 4:56 PM, Eugenio Perez Martin wrote:
> On Mon, Mar 24, 2025 at 3:00 PM Sahil Siddiq <icegambit91@gmail.com> wrote:
>>
>> Introduce "struct vring_packed".
>>
>> Modify VhostShadowVirtqueue so it can support split and packed virtqueue
>> formats.
>>
>> Signed-off-by: Sahil Siddiq <sahilcdq@proton.me>
>> ---
>> Changes from v4 -> v5:
>> - This was commit #3 in v4. This has been reordered to commit #2
>>    based on review comments.
>> - Place shadow_avail_idx, shadow_used_idx, last_used_idx
>>    above the "shadow vring" union.
>>
> 
> What is the reason for the member reorder?
> 

In v1 [1], I had made the decision to shift all the fields in the structure that
are used by split and packed vqs above the "shadow vring" union. But I hadn't
moved shadow_avail_idx, shadow_used_idx or last_used_idx.

The latter two are also used by both vring formats. "shadow_avail_idx" is only
used by split vqs but I think it'll have to be used in packed vqs as well in the
future.

To keep things consistent I thought I would shift these fields as well. Although,
these shifts shouldn't have an impact on the functionality. The shifts can be
dropped without issue.

>>   hw/virtio/vhost-shadow-virtqueue.h | 87 +++++++++++++++++++-----------
>>   1 file changed, 56 insertions(+), 31 deletions(-)
>>
>> diff --git a/hw/virtio/vhost-shadow-virtqueue.h b/hw/virtio/vhost-shadow-virtqueue.h
>> index 9c273739d6..5f7699da9d 100644
>> --- a/hw/virtio/vhost-shadow-virtqueue.h
>> +++ b/hw/virtio/vhost-shadow-virtqueue.h
>> @@ -46,10 +46,65 @@ typedef struct VhostShadowVirtqueueOps {
>>       VirtQueueAvailCallback avail_handler;
>>   } VhostShadowVirtqueueOps;
>>
>> +struct vring_packed {
>> +    /* Actual memory layout for this queue. */
>> +    struct {
>> +        unsigned int num;
>> +        struct vring_packed_desc *desc;
>> +        struct vring_packed_desc_event *driver;
>> +        struct vring_packed_desc_event *device;
>> +    } vring;
>> +
>> +    /* Avail used flags. */
>> +    uint16_t avail_used_flags;
>> +
>> +    /* Index of the next avail descriptor. */
>> +    uint16_t next_avail_idx;
>> +
>> +    /* Driver ring wrap counter */
>> +    bool avail_wrap_counter;
>> +};
>> +
>>   /* Shadow virtqueue to relay notifications */
>>   typedef struct VhostShadowVirtqueue {
>> +    /* True if packed virtqueue */
>> +    bool is_packed;
>> +
>> +    /* Virtio queue shadowing */
>> +    VirtQueue *vq;
>> +
>> +    /* Virtio device */
>> +    VirtIODevice *vdev;
>> +
>> +    /* SVQ vring descriptors state */
>> +    SVQDescState *desc_state;
>> +
>> +    /*
>> +     * Backup next field for each descriptor so we can recover securely, not
>> +     * needing to trust the device access.
>> +     */
>> +    uint16_t *desc_next;
>> +
>> +    /* Next free descriptor */
>> +    uint16_t free_head;
>> +
>> +    /* Size of SVQ vring free descriptors */
>> +    uint16_t num_free;
>> +
>> +    /* Next head to expose to the device */
>> +    uint16_t shadow_avail_idx;
>> +
>> +    /* Last seen used idx */
>> +    uint16_t shadow_used_idx;
>> +
>> +    /* Next head to consume from the device */
>> +    uint16_t last_used_idx;
>> +
>>       /* Shadow vring */
>> -    struct vring vring;
>> +    union {
>> +        struct vring vring;
>> +        struct vring_packed vring_packed;
>> +    };
>>
>>       /* Shadow kick notifier, sent to vhost */
>>       EventNotifier hdev_kick;
>> @@ -69,47 +124,17 @@ typedef struct VhostShadowVirtqueue {
>>       /* Guest's call notifier, where the SVQ calls guest. */
>>       EventNotifier svq_call;
>>
>> -    /* Virtio queue shadowing */
>> -    VirtQueue *vq;
>> -
>> -    /* Virtio device */
>> -    VirtIODevice *vdev;
>> -
>>       /* IOVA mapping */
>>       VhostIOVATree *iova_tree;
>>
>> -    /* SVQ vring descriptors state */
>> -    SVQDescState *desc_state;
>> -
>>       /* Next VirtQueue element that guest made available */
>>       VirtQueueElement *next_guest_avail_elem;
>>
>> -    /*
>> -     * Backup next field for each descriptor so we can recover securely, not
>> -     * needing to trust the device access.
>> -     */
>> -    uint16_t *desc_next;
>> -
>>       /* Caller callbacks */
>>       const VhostShadowVirtqueueOps *ops;
>>
>>       /* Caller callbacks opaque */
>>       void *ops_opaque;
>> -
>> -    /* Next head to expose to the device */
>> -    uint16_t shadow_avail_idx;
>> -
>> -    /* Next free descriptor */
>> -    uint16_t free_head;
>> -
>> -    /* Last seen used idx */
>> -    uint16_t shadow_used_idx;
>> -
>> -    /* Next head to consume from the device */
>> -    uint16_t last_used_idx;
>> -
>> -    /* Size of SVQ vring free descriptors */
>> -    uint16_t num_free;
>>   } VhostShadowVirtqueue;
>>
>>   bool vhost_svq_valid_features(uint64_t features, Error **errp);
>> --
>> 2.48.1
>>
> 

Thanks,
Sahil

[1] https://lists.nongnu.org/archive/html/qemu-devel/2024-06/msg03417.html


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC v5 1/7] vhost: Refactor vhost_svq_add_split
  2025-03-26 11:25   ` Eugenio Perez Martin
@ 2025-03-28  5:18     ` Sahil Siddiq
  0 siblings, 0 replies; 44+ messages in thread
From: Sahil Siddiq @ 2025-03-28  5:18 UTC (permalink / raw)
  To: Eugenio Perez Martin; +Cc: sgarzare, mst, qemu-devel, sahilcdq

Hi,

On 3/26/25 4:55 PM, Eugenio Perez Martin wrote:
> On Mon, Mar 24, 2025 at 2:59 PM Sahil Siddiq <icegambit91@gmail.com> wrote:
>>
>> This commit refactors vhost_svq_add_split and vhost_svq_add to simplify
>> their implementation and prepare for the addition of packed vqs in the
>> following commits.
>>
>> Signed-off-by: Sahil Siddiq <sahilcdq@proton.me>
>> ---
>> No changes from v4 -> v5.
>>
> 
> You can carry the Acked-by from previous series if you make no changes
> (or even small changes).
> 
> Acked-by: Eugenio Pérez <eperezma@redhat.com>
> 

Understood.

Thanks,
Sahil


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC v5 5/7] vhost: Forward descriptors to guest via packed vqs
  2025-03-26  8:34     ` Eugenio Perez Martin
@ 2025-03-28  5:22       ` Sahil Siddiq
  2025-03-28  7:53         ` Eugenio Perez Martin
  0 siblings, 1 reply; 44+ messages in thread
From: Sahil Siddiq @ 2025-03-28  5:22 UTC (permalink / raw)
  To: Eugenio Perez Martin; +Cc: sgarzare, mst, qemu-devel, sahilcdq

Hi,

On 3/26/25 2:04 PM, Eugenio Perez Martin wrote:
> On Mon, Mar 24, 2025 at 3:34 PM Sahil Siddiq <icegambit91@gmail.com> wrote:
>>
>> Hi,
>>
>> I had a few more queries here as well.
>>
>> On 3/24/25 7:29 PM, Sahil Siddiq wrote:
>>> Detect when used descriptors are ready for consumption by the guest via
>>> packed virtqueues and forward them from the device to the guest.
>>>
>>> Signed-off-by: Sahil Siddiq <sahilcdq@proton.me>
>>> ---
>>> Changes from v4 -> v5:
>>> - New commit.
>>> - vhost-shadow-virtqueue.c:
>>>     (vhost_svq_more_used): Split into vhost_svq_more_used_split and
>>>     vhost_svq_more_used_packed.
>>>     (vhost_svq_enable_notification): Handle split and packed vqs.
>>>     (vhost_svq_disable_notification): Likewise.
>>>     (vhost_svq_get_buf): Split into vhost_svq_get_buf_split and
>>>     vhost_svq_get_buf_packed.
>>>     (vhost_svq_poll): Use new functions.
>>>
>>>    hw/virtio/vhost-shadow-virtqueue.c | 121 ++++++++++++++++++++++++++---
>>>    1 file changed, 110 insertions(+), 11 deletions(-)
>>>
>>> diff --git a/hw/virtio/vhost-shadow-virtqueue.c b/hw/virtio/vhost-shadow-virtqueue.c
>>> index 126957231d..8430b3c94a 100644
>>> --- a/hw/virtio/vhost-shadow-virtqueue.c
>>> +++ b/hw/virtio/vhost-shadow-virtqueue.c
>>> @@ -463,7 +463,7 @@ static void vhost_handle_guest_kick_notifier(EventNotifier *n)
>>>        vhost_handle_guest_kick(svq);
>>>    }
>>>
>>> -static bool vhost_svq_more_used(VhostShadowVirtqueue *svq)
>>> +static bool vhost_svq_more_used_split(VhostShadowVirtqueue *svq)
>>>    {
>>>        uint16_t *used_idx = &svq->vring.used->idx;
>>>        if (svq->last_used_idx != svq->shadow_used_idx) {
>>> @@ -475,6 +475,22 @@ static bool vhost_svq_more_used(VhostShadowVirtqueue *svq)
>>>        return svq->last_used_idx != svq->shadow_used_idx;
>>>    }
>>>
>>> +static bool vhost_svq_more_used_packed(VhostShadowVirtqueue *svq)
>>> +{
>>> +    bool avail_flag, used_flag, used_wrap_counter;
>>> +    uint16_t last_used_idx, last_used, flags;
>>> +
>>> +    last_used_idx = svq->last_used_idx;
>>> +    last_used = last_used_idx & ~(1 << VRING_PACKED_EVENT_F_WRAP_CTR);
>>
>> In the linux kernel, last_used is calculated as:
>>
>> last_used_idx & ~(-(1 << VRING_PACKED_EVENT_F_WRAP_CTR))
>>
>> ...instead of...
>>
>> last_used_idx & ~(1 << VRING_PACKED_EVENT_F_WRAP_CTR)
>>
>> Isn't the second option good enough if last_used_idx is uint16_t
>> and VRING_PACKED_EVENT_F_WRAP_CTR is defined as 15.
>>
> 
> I think it is good enough with the u16 restrictions but it's just
> defensive code.
> 

Got it. I think it'll be better then to follow the implementation in
the kernel to keep it more robust.

>>> +    used_wrap_counter = !!(last_used_idx & (1 << VRING_PACKED_EVENT_F_WRAP_CTR));
>>> +
>>> +    flags = le16_to_cpu(svq->vring_packed.vring.desc[last_used].flags);
>>> +    avail_flag = !!(flags & (1 << VRING_PACKED_DESC_F_AVAIL));
>>> +    used_flag = !!(flags & (1 << VRING_PACKED_DESC_F_USED));
>>> +
>>> +    return avail_flag == used_flag && used_flag == used_wrap_counter;
>>> +}
>>> +
>>
>> Also in the implementation of vhost_svq_more_used_split() [1], I haven't
>> understood why the following condition:
>>
>> svq->last_used_idx != svq->shadow_used_idx
>>
>> is checked before updating the value of "svq->shadow_used_idx":
>>
>> svq->shadow_used_idx = le16_to_cpu(*(volatile uint16_t *)used_idx)
>>
> 
> As far as I know this is used to avoid concurrent access to guest's
> used_idx, avoiding cache sharing, the memory barrier, and the
> potentially costly volatile access.
> 

By concurrent access, do you mean in case one thread has already updated
the value of used_idx?

Thanks,
Sahil


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC v5 6/7] vhost: Validate transport device features for packed vqs
  2025-03-26 12:06   ` Eugenio Perez Martin
@ 2025-03-28  5:33     ` Sahil Siddiq
  2025-03-28  8:02       ` Eugenio Perez Martin
  0 siblings, 1 reply; 44+ messages in thread
From: Sahil Siddiq @ 2025-03-28  5:33 UTC (permalink / raw)
  To: Eugenio Perez Martin; +Cc: sgarzare, mst, qemu-devel, sahilcdq

Hi,

On 3/26/25 5:36 PM, Eugenio Perez Martin wrote:
> On Mon, Mar 24, 2025 at 3:00 PM Sahil Siddiq <icegambit91@gmail.com> wrote:
>>
>> Validate transport device features required for utilizing packed SVQ
>> that both guests can use with the SVQ and SVQs can use with vdpa.
>>
>> Signed-off-by: Sahil Siddiq <sahilcdq@proton.me>
>> ---
>> Changes from v4 -> v5:
>> - Split from commit #2 in v4.
>>
>>   hw/virtio/vhost-shadow-virtqueue.c | 3 +++
>>   1 file changed, 3 insertions(+)
>>
>> diff --git a/hw/virtio/vhost-shadow-virtqueue.c b/hw/virtio/vhost-shadow-virtqueue.c
>> index 8430b3c94a..035ab1e66f 100644
>> --- a/hw/virtio/vhost-shadow-virtqueue.c
>> +++ b/hw/virtio/vhost-shadow-virtqueue.c
>> @@ -33,6 +33,9 @@ bool vhost_svq_valid_features(uint64_t features, Error **errp)
>>            ++b) {
>>           switch (b) {
>>           case VIRTIO_F_ANY_LAYOUT:
>> +        case VIRTIO_F_RING_PACKED:
>> +        case VIRTIO_F_RING_RESET:
>> +        case VIRTIO_RING_F_INDIRECT_DESC:
> 
> This should only enable _F_RING_PACKED, there is no code supporting
> either reset or indirect descriptors.
> 

Without _F_RING_RESET and _RING_F_INDIRECT_DESC, I get the following error:

qemu-system-x86_64: -netdev type=vhost-vdpa,vhostdev=/dev/vhost-vdpa-0,x-svq=true,id=vhost-vdpa0: SVQ Invalid device feature flags, offer: 0x1071011ffa7, ok: 0x70011ffa7

Evaluating 0x1071011ffa7 & ~0x70011ffa7 gives me 0x10010000000 as the
set of invalid features. This corresponds to _F_RING_RESET (1 << 40)
and _RING_F_INDIRECT_DESC (1 << 28) [1].

I get this error when x-svq=true irrespective of whether split vqs or packed
vqs are used.

Is there a way to turn them off in the QEMU command?

Thanks,
Sahil

[1] https://gitlab.com/qemu-project/qemu/-/blob/master/include/standard-headers/linux/virtio_config.h


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC v5 3/7] vhost: Forward descriptors to device via packed SVQ
  2025-03-28  5:09     ` Sahil Siddiq
@ 2025-03-28  6:42       ` Eugenio Perez Martin
  0 siblings, 0 replies; 44+ messages in thread
From: Eugenio Perez Martin @ 2025-03-28  6:42 UTC (permalink / raw)
  To: Sahil Siddiq; +Cc: sgarzare, mst, qemu-devel, sahilcdq

On Fri, Mar 28, 2025 at 6:10 AM Sahil Siddiq <icegambit91@gmail.com> wrote:
>
> Hi,
>
> On 3/26/25 5:32 PM, Eugenio Perez Martin wrote:
> > On Mon, Mar 24, 2025 at 3:00 PM Sahil Siddiq <icegambit91@gmail.com> wrote:
> >>
> >> Implement the insertion of available buffers in the descriptor area of
> >> packed shadow virtqueues. It takes into account descriptor chains, but
> >> does not consider indirect descriptors.
> >>
> >> Enable the packed SVQ to forward the descriptors to the device.
> >>
> >> Signed-off-by: Sahil Siddiq <sahilcdq@proton.me>
> >> ---
> >> Changes from v4 -> v5:
> >> - This was commit #2 in v4. This has been reordered to commit #3
> >>    based on review comments.
> >> - vhost-shadow-virtqueue.c:
> >>    (vhost_svq_valid_features): Move addition of enums to commit #6
> >>    based on review comments.
> >>    (vhost_svq_add_packed): Set head_idx to buffer id instead of vring's
> >>    index.
> >>    (vhost_svq_kick): Split into vhost_svq_kick_split and
> >>    vhost_svq_kick_packed.
> >>    (vhost_svq_add): Use new vhost_svq_kick_* functions.
> >>
> >>   hw/virtio/vhost-shadow-virtqueue.c | 117 +++++++++++++++++++++++++++--
> >>   1 file changed, 112 insertions(+), 5 deletions(-)
> >>
> >> diff --git a/hw/virtio/vhost-shadow-virtqueue.c b/hw/virtio/vhost-shadow-virtqueue.c
> >> index 4f74ad402a..6e16cd4bdf 100644
> >> --- a/hw/virtio/vhost-shadow-virtqueue.c
> >> +++ b/hw/virtio/vhost-shadow-virtqueue.c
> >> @@ -193,10 +193,83 @@ static void vhost_svq_add_split(VhostShadowVirtqueue *svq,
> >>       /* Update the avail index after write the descriptor */
> >>       smp_wmb();
> >>       avail->idx = cpu_to_le16(svq->shadow_avail_idx);
> >> +}
> >> +
> >> +/**
> >> + * Write descriptors to SVQ packed vring
> >> + *
> >> + * @svq: The shadow virtqueue
> >> + * @out_sg: The iovec to the guest
> >> + * @out_num: Outgoing iovec length
> >> + * @in_sg: The iovec from the guest
> >> + * @in_num: Incoming iovec length
> >> + * @sgs: Cache for hwaddr
> >> + * @head: Saves current free_head
> >> + */
> >> +static void vhost_svq_add_packed(VhostShadowVirtqueue *svq,
> >> +                                 const struct iovec *out_sg, size_t out_num,
> >> +                                 const struct iovec *in_sg, size_t in_num,
> >> +                                 hwaddr *sgs, unsigned *head)
> >> +{
> >> +    uint16_t id, curr, i, head_flags = 0, head_idx;
> >> +    size_t num = out_num + in_num;
> >> +    unsigned n;
> >> +
> >> +    struct vring_packed_desc *descs = svq->vring_packed.vring.desc;
> >> +
> >> +    head_idx = svq->vring_packed.next_avail_idx;
> >> +    i = head_idx;
> >> +    id = svq->free_head;
> >> +    curr = id;
> >> +    *head = id;
> >> +
> >> +    /* Write descriptors to SVQ packed vring */
> >> +    for (n = 0; n < num; n++) {
> >> +        uint16_t flags = cpu_to_le16(svq->vring_packed.avail_used_flags |
> >> +                                     (n < out_num ? 0 : VRING_DESC_F_WRITE) |
> >> +                                     (n + 1 == num ? 0 : VRING_DESC_F_NEXT));
> >> +        if (i == head_idx) {
> >> +            head_flags = flags;
> >> +        } else {
> >> +            descs[i].flags = flags;
> >> +        }
> >> +
> >> +        descs[i].addr = cpu_to_le64(sgs[n]);
> >> +        descs[i].id = id;
> >> +        if (n < out_num) {
> >> +            descs[i].len = cpu_to_le32(out_sg[n].iov_len);
> >> +        } else {
> >> +            descs[i].len = cpu_to_le32(in_sg[n - out_num].iov_len);
> >> +        }
> >> +
> >> +        curr = cpu_to_le16(svq->desc_next[curr]);
> >> +
> >> +        if (++i >= svq->vring_packed.vring.num) {
> >> +            i = 0;
> >> +            svq->vring_packed.avail_used_flags ^=
> >> +                1 << VRING_PACKED_DESC_F_AVAIL |
> >> +                1 << VRING_PACKED_DESC_F_USED;
> >> +        }
> >> +    }
> >>
> >> +    if (i <= head_idx) {
> >> +        svq->vring_packed.avail_wrap_counter ^= 1;
> >> +    }
> >> +
> >> +    svq->vring_packed.next_avail_idx = i;
> >> +    svq->shadow_avail_idx = i;
> >> +    svq->free_head = curr;
> >> +
> >> +    /*
> >> +     * A driver MUST NOT make the first descriptor in the list
> >> +     * available before all subsequent descriptors comprising
> >> +     * the list are made available.
> >> +     */
> >> +    smp_wmb();
> >> +    svq->vring_packed.vring.desc[head_idx].flags = head_flags;
> >>   }
> >>
> >> -static void vhost_svq_kick(VhostShadowVirtqueue *svq)
> >> +static void vhost_svq_kick_split(VhostShadowVirtqueue *svq)
> >>   {
> >>       bool needs_kick;
> >>
> >> @@ -209,7 +282,8 @@ static void vhost_svq_kick(VhostShadowVirtqueue *svq)
> >>       if (virtio_vdev_has_feature(svq->vdev, VIRTIO_RING_F_EVENT_IDX)) {
> >>           uint16_t avail_event = le16_to_cpu(
> >>                   *(uint16_t *)(&svq->vring.used->ring[svq->vring.num]));
> >> -        needs_kick = vring_need_event(avail_event, svq->shadow_avail_idx, svq->shadow_avail_idx - 1);
> >> +        needs_kick = vring_need_event(avail_event, svq->shadow_avail_idx,
> >> +                     svq->shadow_avail_idx - 1);
> >>       } else {
> >>           needs_kick =
> >>                   !(svq->vring.used->flags & cpu_to_le16(VRING_USED_F_NO_NOTIFY));
> >> @@ -222,6 +296,30 @@ static void vhost_svq_kick(VhostShadowVirtqueue *svq)
> >>       event_notifier_set(&svq->hdev_kick);
> >>   }
> >>
> >> +static void vhost_svq_kick_packed(VhostShadowVirtqueue *svq)
> >> +{
> >> +    bool needs_kick;
> >> +
> >> +    /*
> >> +     * We need to expose the available array entries before checking
> >> +     * notification suppressions.
> >> +     */
> >> +    smp_mb();
> >> +
> >> +    if (virtio_vdev_has_feature(svq->vdev, VIRTIO_RING_F_EVENT_IDX)) {
> >> +        return;
> >
> > It's weird SVQ does not need to kick if _F_EVENT_IDX. This should have
> > code checking the device ring flags etc.
> >
>
> Right, I haven't implemented this yet. Since the current implementation is
> being tested with event_idx=off (points 3 and 4 of the roadmap [1]), I thought
> I would leave this for later.
>
> Maybe I can add a comment in the implementation explaining this.
>

Sure that's fine, and probably even better than trying to address
everything in one shot :) Can you add a TODO in each place you
identify so we're sure we don't miss any?

> Thanks,
> Sahil
>
> [1] https://wiki.qemu.org/Internships/ProjectIdeas/PackedShadowVirtqueue
>



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC v5 3/7] vhost: Forward descriptors to device via packed SVQ
  2025-03-27 18:42       ` Sahil Siddiq
@ 2025-03-28  7:51         ` Eugenio Perez Martin
  2025-04-14  9:37           ` Sahil Siddiq
  0 siblings, 1 reply; 44+ messages in thread
From: Eugenio Perez Martin @ 2025-03-28  7:51 UTC (permalink / raw)
  To: Sahil Siddiq; +Cc: sgarzare, mst, qemu-devel, sahilcdq

On Thu, Mar 27, 2025 at 7:42 PM Sahil Siddiq <icegambit91@gmail.com> wrote:
>
> Hi,
>
> On 3/26/25 1:33 PM, Eugenio Perez Martin wrote:
> > On Mon, Mar 24, 2025 at 3:14 PM Sahil Siddiq <icegambit91@gmail.com> wrote:
> >> On 3/24/25 7:29 PM, Sahil Siddiq wrote:
> >>> Implement the insertion of available buffers in the descriptor area of
> >>> packed shadow virtqueues. It takes into account descriptor chains, but
> >>> does not consider indirect descriptors.
> >>>
> >>> Enable the packed SVQ to forward the descriptors to the device.
> >>>
> >>> Signed-off-by: Sahil Siddiq <sahilcdq@proton.me>
> >>> ---
> >>> Changes from v4 -> v5:
> >>> - This was commit #2 in v4. This has been reordered to commit #3
> >>>     based on review comments.
> >>> - vhost-shadow-virtqueue.c:
> >>>     (vhost_svq_valid_features): Move addition of enums to commit #6
> >>>     based on review comments.
> >>>     (vhost_svq_add_packed): Set head_idx to buffer id instead of vring's
> >>>     index.
> >>>     (vhost_svq_kick): Split into vhost_svq_kick_split and
> >>>     vhost_svq_kick_packed.
> >>>     (vhost_svq_add): Use new vhost_svq_kick_* functions.
> >>>
> >>>    hw/virtio/vhost-shadow-virtqueue.c | 117 +++++++++++++++++++++++++++--
> >>>    1 file changed, 112 insertions(+), 5 deletions(-)
> >>>
> >>> diff --git a/hw/virtio/vhost-shadow-virtqueue.c b/hw/virtio/vhost-shadow-virtqueue.c
> >>> index 4f74ad402a..6e16cd4bdf 100644
> >>> --- a/hw/virtio/vhost-shadow-virtqueue.c
> >>> +++ b/hw/virtio/vhost-shadow-virtqueue.c
> >>> @@ -193,10 +193,83 @@ static void vhost_svq_add_split(VhostShadowVirtqueue *svq,
> >>>        /* Update the avail index after write the descriptor */
> >>>        smp_wmb();
> >>>        avail->idx = cpu_to_le16(svq->shadow_avail_idx);
> >>> +}
> >>> +
> >>> +/**
> >>> + * Write descriptors to SVQ packed vring
> >>> + *
> >>> + * @svq: The shadow virtqueue
> >>> + * @out_sg: The iovec to the guest
> >>> + * @out_num: Outgoing iovec length
> >>> + * @in_sg: The iovec from the guest
> >>> + * @in_num: Incoming iovec length
> >>> + * @sgs: Cache for hwaddr
> >>> + * @head: Saves current free_head
> >>> + */
> >>> +static void vhost_svq_add_packed(VhostShadowVirtqueue *svq,
> >>> +                                 const struct iovec *out_sg, size_t out_num,
> >>> +                                 const struct iovec *in_sg, size_t in_num,
> >>> +                                 hwaddr *sgs, unsigned *head)
> >>> +{
> >>> +    uint16_t id, curr, i, head_flags = 0, head_idx;
> >>> +    size_t num = out_num + in_num;
> >>> +    unsigned n;
> >>> +
> >>> +    struct vring_packed_desc *descs = svq->vring_packed.vring.desc;
> >>> +
> >>> +    head_idx = svq->vring_packed.next_avail_idx;
> >>
> >> Since "svq->vring_packed.next_avail_idx" is part of QEMU internals and not
> >> stored in guest memory, no endianness conversion is required here, right?
> >>
> >
> > Right!
>
> Understood.
>
> >>> +    i = head_idx;
> >>> +    id = svq->free_head;
> >>> +    curr = id;
> >>> +    *head = id;
> >>
> >> Should head be the buffer id or the idx of the descriptor ring where the
> >> first descriptor of a descriptor chain is inserted?
> >>
> >
> > The buffer id of the *last* descriptor of a chain. See "2.8.6 Next
> > Flag: Descriptor Chaining" at [1].
>
> Ah, yes. The second half of my question in incorrect.
>
> The tail descriptor of the chain includes the buffer id. In this implementation
> we place the same tail buffer id in other locations of the descriptor ring since
> they will be ignored anyway [1].
>
> The explanation below frames my query better.
>
> >>> +    /* Write descriptors to SVQ packed vring */
> >>> +    for (n = 0; n < num; n++) {
> >>> +        uint16_t flags = cpu_to_le16(svq->vring_packed.avail_used_flags |
> >>> +                                     (n < out_num ? 0 : VRING_DESC_F_WRITE) |
> >>> +                                     (n + 1 == num ? 0 : VRING_DESC_F_NEXT));
> >>> +        if (i == head_idx) {
> >>> +            head_flags = flags;
> >>> +        } else {
> >>> +            descs[i].flags = flags;
> >>> +        }
> >>> +
> >>> +        descs[i].addr = cpu_to_le64(sgs[n]);
> >>> +        descs[i].id = id;
> >>> +        if (n < out_num) {
> >>> +            descs[i].len = cpu_to_le32(out_sg[n].iov_len);
> >>> +        } else {
> >>> +            descs[i].len = cpu_to_le32(in_sg[n - out_num].iov_len);
> >>> +        }
> >>> +
> >>> +        curr = cpu_to_le16(svq->desc_next[curr]);
> >>> +
> >>> +        if (++i >= svq->vring_packed.vring.num) {
> >>> +            i = 0;
> >>> +            svq->vring_packed.avail_used_flags ^=
> >>> +                1 << VRING_PACKED_DESC_F_AVAIL |
> >>> +                1 << VRING_PACKED_DESC_F_USED;
> >>> +        }
> >>> +    }
> >>>
> >>> +    if (i <= head_idx) {
> >>> +        svq->vring_packed.avail_wrap_counter ^= 1;
> >>> +    }
> >>> +
> >>> +    svq->vring_packed.next_avail_idx = i;
> >>> +    svq->shadow_avail_idx = i;
> >>> +    svq->free_head = curr;
> >>> +
> >>> +    /*
> >>> +     * A driver MUST NOT make the first descriptor in the list
> >>> +     * available before all subsequent descriptors comprising
> >>> +     * the list are made available.
> >>> +     */
> >>> +    smp_wmb();
> >>> +    svq->vring_packed.vring.desc[head_idx].flags = head_flags;
> >>>    }
> >>>
> >>> -static void vhost_svq_kick(VhostShadowVirtqueue *svq)
> >>> +static void vhost_svq_kick_split(VhostShadowVirtqueue *svq)
> >>>    {
> >>>        bool needs_kick;
> >>>
> >>> @@ -209,7 +282,8 @@ static void vhost_svq_kick(VhostShadowVirtqueue *svq)
> >>>        if (virtio_vdev_has_feature(svq->vdev, VIRTIO_RING_F_EVENT_IDX)) {
> >>>            uint16_t avail_event = le16_to_cpu(
> >>>                    *(uint16_t *)(&svq->vring.used->ring[svq->vring.num]));
> >>> -        needs_kick = vring_need_event(avail_event, svq->shadow_avail_idx, svq->shadow_avail_idx - 1);
> >>> +        needs_kick = vring_need_event(avail_event, svq->shadow_avail_idx,
> >>> +                     svq->shadow_avail_idx - 1);
> >>>        } else {
> >>>            needs_kick =
> >>>                    !(svq->vring.used->flags & cpu_to_le16(VRING_USED_F_NO_NOTIFY));
> >>> @@ -222,6 +296,30 @@ static void vhost_svq_kick(VhostShadowVirtqueue *svq)
> >>>        event_notifier_set(&svq->hdev_kick);
> >>>    }
> >>>
> >>> +static void vhost_svq_kick_packed(VhostShadowVirtqueue *svq)
> >>> +{
> >>> +    bool needs_kick;
> >>> +
> >>> +    /*
> >>> +     * We need to expose the available array entries before checking
> >>> +     * notification suppressions.
> >>> +     */
> >>> +    smp_mb();
> >>> +
> >>> +    if (virtio_vdev_has_feature(svq->vdev, VIRTIO_RING_F_EVENT_IDX)) {
> >>> +        return;
> >>> +    } else {
> >>> +        needs_kick = (svq->vring_packed.vring.device->flags !=
> >>> +                      cpu_to_le16(VRING_PACKED_EVENT_FLAG_DISABLE));
> >>> +    }
> >>> +
> >>> +    if (!needs_kick) {
> >>> +        return;
> >>> +    }
> >>> +
> >>> +    event_notifier_set(&svq->hdev_kick);
> >>> +}
> >>> +
> >>>    /**
> >>>     * Add an element to a SVQ.
> >>>     *
> >>> @@ -258,13 +356,22 @@ int vhost_svq_add(VhostShadowVirtqueue *svq, const struct iovec *out_sg,
> >>>            return -EINVAL;
> >>>        }
> >>>
> >>> -    vhost_svq_add_split(svq, out_sg, out_num, in_sg,
> >>> -                        in_num, sgs, &qemu_head);
> >>> +    if (svq->is_packed) {
> >>> +        vhost_svq_add_packed(svq, out_sg, out_num, in_sg,
> >>> +                             in_num, sgs, &qemu_head);
> >>> +    } else {
> >>> +        vhost_svq_add_split(svq, out_sg, out_num, in_sg,
> >>> +                            in_num, sgs, &qemu_head);
> >>> +    }
> >>>
> >>>        svq->num_free -= ndescs;
> >>>        svq->desc_state[qemu_head].elem = elem;
> >>>        svq->desc_state[qemu_head].ndescs = ndescs;
> >>
> >> *head in vhost_svq_add_packed() is stored in "qemu_head" here.
> >>
> >
> > Sorry I don't get this, can you expand?
>
> Sure. In vhost_svq_add(), after the descriptors have been added
> (either using vhost_svq_add_split or vhost_svq_add_packed),
> VirtQueueElement elem and ndescs are both saved in the
> svq->desc_state array. "elem" and "ndescs" are later used when
> the guest consumes used descriptors from the device in
> vhost_svq_get_buf_(split|packed).
>
> For split vqs, the index of svq->desc where elem and ndescs are
> saved matches the index of the descriptor ring where the head of
> the descriptor ring is placed.
>
> In vhost_svq_add_split:
>
> *head = svq->free_head;
> [...]
> avail_idx = svq->shadow_avail_idx & (svq->vring.num - 1);
> avail->ring[avail_idx] = cpu_to_le16(*head);
>
> "qemu_head" in vhost_svq_add gets its value from "*head" in
> vhost_svq_add_split:
>
> svq->desc_state[qemu_head].elem = elem;
> svq->desc_state[qemu_head].ndescs = ndescs;
>
> For packed vq, something similar has to be done. My approach was
> to have the index of svq->desc_state match the buffer id in the
> tail of the descriptor ring.
>
> The entire chain is written to the descriptor ring in the loop
> in vhost_svq_add_packed. I am not sure if the index of
> svq->desc_state should be the buffer id or if it should be a
> descriptor index ("head_idx" or the index corresponding to the
> tail of the chain).
>

I think both approaches should be valid. My advice is to follow
Linux's code and let it be the tail descriptor id. This descriptor id
is pushed and popped from vq->free_head in a stack style.

In addition to that, Linux also sets the same id to all the chain
elements. I think this is useful when dealing with bad devices. In
particular, QEMU's packed vq implementation looked at the first
desciptor's id, which is an incorrect behavior.



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC v5 5/7] vhost: Forward descriptors to guest via packed vqs
  2025-03-28  5:22       ` Sahil Siddiq
@ 2025-03-28  7:53         ` Eugenio Perez Martin
  0 siblings, 0 replies; 44+ messages in thread
From: Eugenio Perez Martin @ 2025-03-28  7:53 UTC (permalink / raw)
  To: Sahil Siddiq; +Cc: sgarzare, mst, qemu-devel, sahilcdq

On Fri, Mar 28, 2025 at 6:22 AM Sahil Siddiq <icegambit91@gmail.com> wrote:
>
> Hi,
>
> On 3/26/25 2:04 PM, Eugenio Perez Martin wrote:
> > On Mon, Mar 24, 2025 at 3:34 PM Sahil Siddiq <icegambit91@gmail.com> wrote:
> >>
> >> Hi,
> >>
> >> I had a few more queries here as well.
> >>
> >> On 3/24/25 7:29 PM, Sahil Siddiq wrote:
> >>> Detect when used descriptors are ready for consumption by the guest via
> >>> packed virtqueues and forward them from the device to the guest.
> >>>
> >>> Signed-off-by: Sahil Siddiq <sahilcdq@proton.me>
> >>> ---
> >>> Changes from v4 -> v5:
> >>> - New commit.
> >>> - vhost-shadow-virtqueue.c:
> >>>     (vhost_svq_more_used): Split into vhost_svq_more_used_split and
> >>>     vhost_svq_more_used_packed.
> >>>     (vhost_svq_enable_notification): Handle split and packed vqs.
> >>>     (vhost_svq_disable_notification): Likewise.
> >>>     (vhost_svq_get_buf): Split into vhost_svq_get_buf_split and
> >>>     vhost_svq_get_buf_packed.
> >>>     (vhost_svq_poll): Use new functions.
> >>>
> >>>    hw/virtio/vhost-shadow-virtqueue.c | 121 ++++++++++++++++++++++++++---
> >>>    1 file changed, 110 insertions(+), 11 deletions(-)
> >>>
> >>> diff --git a/hw/virtio/vhost-shadow-virtqueue.c b/hw/virtio/vhost-shadow-virtqueue.c
> >>> index 126957231d..8430b3c94a 100644
> >>> --- a/hw/virtio/vhost-shadow-virtqueue.c
> >>> +++ b/hw/virtio/vhost-shadow-virtqueue.c
> >>> @@ -463,7 +463,7 @@ static void vhost_handle_guest_kick_notifier(EventNotifier *n)
> >>>        vhost_handle_guest_kick(svq);
> >>>    }
> >>>
> >>> -static bool vhost_svq_more_used(VhostShadowVirtqueue *svq)
> >>> +static bool vhost_svq_more_used_split(VhostShadowVirtqueue *svq)
> >>>    {
> >>>        uint16_t *used_idx = &svq->vring.used->idx;
> >>>        if (svq->last_used_idx != svq->shadow_used_idx) {
> >>> @@ -475,6 +475,22 @@ static bool vhost_svq_more_used(VhostShadowVirtqueue *svq)
> >>>        return svq->last_used_idx != svq->shadow_used_idx;
> >>>    }
> >>>
> >>> +static bool vhost_svq_more_used_packed(VhostShadowVirtqueue *svq)
> >>> +{
> >>> +    bool avail_flag, used_flag, used_wrap_counter;
> >>> +    uint16_t last_used_idx, last_used, flags;
> >>> +
> >>> +    last_used_idx = svq->last_used_idx;
> >>> +    last_used = last_used_idx & ~(1 << VRING_PACKED_EVENT_F_WRAP_CTR);
> >>
> >> In the linux kernel, last_used is calculated as:
> >>
> >> last_used_idx & ~(-(1 << VRING_PACKED_EVENT_F_WRAP_CTR))
> >>
> >> ...instead of...
> >>
> >> last_used_idx & ~(1 << VRING_PACKED_EVENT_F_WRAP_CTR)
> >>
> >> Isn't the second option good enough if last_used_idx is uint16_t
> >> and VRING_PACKED_EVENT_F_WRAP_CTR is defined as 15.
> >>
> >
> > I think it is good enough with the u16 restrictions but it's just
> > defensive code.
> >
>
> Got it. I think it'll be better then to follow the implementation in
> the kernel to keep it more robust.
>
> >>> +    used_wrap_counter = !!(last_used_idx & (1 << VRING_PACKED_EVENT_F_WRAP_CTR));
> >>> +
> >>> +    flags = le16_to_cpu(svq->vring_packed.vring.desc[last_used].flags);
> >>> +    avail_flag = !!(flags & (1 << VRING_PACKED_DESC_F_AVAIL));
> >>> +    used_flag = !!(flags & (1 << VRING_PACKED_DESC_F_USED));
> >>> +
> >>> +    return avail_flag == used_flag && used_flag == used_wrap_counter;
> >>> +}
> >>> +
> >>
> >> Also in the implementation of vhost_svq_more_used_split() [1], I haven't
> >> understood why the following condition:
> >>
> >> svq->last_used_idx != svq->shadow_used_idx
> >>
> >> is checked before updating the value of "svq->shadow_used_idx":
> >>
> >> svq->shadow_used_idx = le16_to_cpu(*(volatile uint16_t *)used_idx)
> >>
> >
> > As far as I know this is used to avoid concurrent access to guest's
> > used_idx, avoiding cache sharing, the memory barrier, and the
> > potentially costly volatile access.
> >
>
> By concurrent access, do you mean in case one thread has already updated
> the value of used_idx?
>

Yes, concurrent access by the driver and the device. This could be the
case of different threads if the device is virtual in QEMU. The two
CPU threads are accessing the same memory region.



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC v5 6/7] vhost: Validate transport device features for packed vqs
  2025-03-28  5:33     ` Sahil Siddiq
@ 2025-03-28  8:02       ` Eugenio Perez Martin
  0 siblings, 0 replies; 44+ messages in thread
From: Eugenio Perez Martin @ 2025-03-28  8:02 UTC (permalink / raw)
  To: Sahil Siddiq; +Cc: sgarzare, mst, qemu-devel, sahilcdq

On Fri, Mar 28, 2025 at 6:34 AM Sahil Siddiq <icegambit91@gmail.com> wrote:
>
> Hi,
>
> On 3/26/25 5:36 PM, Eugenio Perez Martin wrote:
> > On Mon, Mar 24, 2025 at 3:00 PM Sahil Siddiq <icegambit91@gmail.com> wrote:
> >>
> >> Validate transport device features required for utilizing packed SVQ
> >> that both guests can use with the SVQ and SVQs can use with vdpa.
> >>
> >> Signed-off-by: Sahil Siddiq <sahilcdq@proton.me>
> >> ---
> >> Changes from v4 -> v5:
> >> - Split from commit #2 in v4.
> >>
> >>   hw/virtio/vhost-shadow-virtqueue.c | 3 +++
> >>   1 file changed, 3 insertions(+)
> >>
> >> diff --git a/hw/virtio/vhost-shadow-virtqueue.c b/hw/virtio/vhost-shadow-virtqueue.c
> >> index 8430b3c94a..035ab1e66f 100644
> >> --- a/hw/virtio/vhost-shadow-virtqueue.c
> >> +++ b/hw/virtio/vhost-shadow-virtqueue.c
> >> @@ -33,6 +33,9 @@ bool vhost_svq_valid_features(uint64_t features, Error **errp)
> >>            ++b) {
> >>           switch (b) {
> >>           case VIRTIO_F_ANY_LAYOUT:
> >> +        case VIRTIO_F_RING_PACKED:
> >> +        case VIRTIO_F_RING_RESET:
> >> +        case VIRTIO_RING_F_INDIRECT_DESC:
> >
> > This should only enable _F_RING_PACKED, there is no code supporting
> > either reset or indirect descriptors.
> >
>
> Without _F_RING_RESET and _RING_F_INDIRECT_DESC, I get the following error:
>
> qemu-system-x86_64: -netdev type=vhost-vdpa,vhostdev=/dev/vhost-vdpa-0,x-svq=true,id=vhost-vdpa0: SVQ Invalid device feature flags, offer: 0x1071011ffa7, ok: 0x70011ffa7
>
> Evaluating 0x1071011ffa7 & ~0x70011ffa7 gives me 0x10010000000 as the
> set of invalid features. This corresponds to _F_RING_RESET (1 << 40)
> and _RING_F_INDIRECT_DESC (1 << 28) [1].
>
> I get this error when x-svq=true irrespective of whether split vqs or packed
> vqs are used.
>
> Is there a way to turn them off in the QEMU command?
>

In the case of nested virtualization you should be able to disable it by:

-device virtio-net-pci,indirect_desc=off,queue_reset=off,...

In L0 QEMU cmdline.

> Thanks,
> Sahil
>
> [1] https://gitlab.com/qemu-project/qemu/-/blob/master/include/standard-headers/linux/virtio_config.h
>



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC v5 0/7] Add packed format to shadow virtqueue
  2025-03-26  7:35 ` [RFC v5 0/7] Add packed format to shadow virtqueue Eugenio Perez Martin
@ 2025-04-14  9:20   ` Sahil Siddiq
  2025-04-15 19:20     ` Sahil Siddiq
  2025-04-16  7:20     ` Eugenio Perez Martin
  0 siblings, 2 replies; 44+ messages in thread
From: Sahil Siddiq @ 2025-04-14  9:20 UTC (permalink / raw)
  To: Eugenio Perez Martin; +Cc: sgarzare, mst, qemu-devel, sahilcdq, Jason Wang

Hi,

On 3/26/25 1:05 PM, Eugenio Perez Martin wrote:
> On Mon, Mar 24, 2025 at 2:59 PM Sahil Siddiq <icegambit91@gmail.com> wrote:
>> I managed to fix a few issues while testing this patch series.
>> There is still one issue that I am unable to resolve. I thought
>> I would send this patch series for review in case I have missed
>> something.
>>
>> The issue is that this patch series does not work every time. I
>> am able to ping L0 from L2 and vice versa via packed SVQ when it
>> works.
> 
> So we're on a very good track then!
> 
>> When this doesn't work, both VMs throw a "Destination Host
>> Unreachable" error. This is sometimes (not always) accompanied
>> by the following kernel error (thrown by L2-kernel):
>>
>> virtio_net virtio1: output.0:id 1 is not a head!
>>
> 
> How many packets have been sent or received before hitting this? If
> the answer to that is "the vq size", maybe there is a bug in the code
> that handles the wraparound of the packed vq, as the used and avail
> flags need to be twisted. You can count them in the SVQ code.

I did a lot more testing. This issue is quite unpredictable in terms
of the time at which it appears after booting L2. So far, it almost
always appears after booting L2. Even when pinging works, this issue
appears after several seconds of pinging.

The total number of svq descriptors varied in every test run. But in
every case, all 256 indices were filled in the descriptor region for
vq with vq_idx = 0. This is the RX vq, right? This was filled while L2
was booting. In the case when the ctrl vq is disabled, I am not sure
what is responsible for filling the vqs in the data plane during
booting.

=====
The issue is hit most frequently when the following command is run
in L0:
$ ip addr add 111.1.1.1/24 dev tap0
$ ip link set tap0 up

or, running the following in L2:
# ip addr add 111.1.1.2/24 dev eth0

The other vq (vq_idx=1) is not filled completely before the issue is
hit. I have been noting down the numbers and here is an example:

295 descriptors were added individually to the queues i.e., there were no chains (vhost_svq_add_packed)
|_ 256 additions in vq_idx = 0, all with unique ids
     |---- 27 descriptors (ids 0 through 26) were received later from the device (vhost_svq_get_buf_packed)
|_ 39 additions in vq_idx = 1
     |_ 13 descriptors had id = 0
     |_ 26 descriptors had id = 1
     |---- All descriptors were received at some point from the device (vhost_svq_get_buf_packed)

There was one case in which vq_idx=0 had wrapped around. I verified
that flags were set appropriately during the wrap (avail and used flags
were flipped as expected).

=====
The next common situation where this issue is hit is during startup.
Before L2 can finish booting successfully, this error is thrown:

virtio_net virtio1: output.0:id 0 is not a head!

258 descriptors were added individually to the queues during startup (there were no chains) (vhost_svq_add_packed)
|_ 256 additions in vq_idx = 0, all with unique ids
    |---- None of them were received by the device (vhost_svq_get_buf_packed)
|_ 2 additions in vq_idx = 1
    |_ id = 0 in index 0
    |_ id = 1 in index 1
    |---- Both descriptors were received at some point during startup from the device (vhost_svq_get_buf_packed)

=====
Another case is after several seconds of pinging L0 from L2.

[   99.034114] virtio_net virtio1: output.0:id 0 is not a head!

366 descriptors were added individually to the queues i.e., there were no chains (vhost_svq_add_packed)
|_ 289 additions in vq_idx = 0, wrap-around was observed with avail and used flags inverted for 33 descriptors
|   |---- 40 descriptors (ids 0 through 39) were received from the device (vhost_svq_get_buf_packed)
|_ 77 additions in vq_idx = 1
     |_ 76 descriptors had id = 0
     |_ 1 descriptor had id = 1
     |---- all 77 descriptors were received at some point from the device (vhost_svq_get_buf_packed)

I am not entirely sure now if there's an issue in the packed vq
implementation in QEMU or if this is being caused due to some sort
of race condition in linux.

"id is not a head" is being thrown because vq->packed.desc_state[id].data
doesn't exist for the corresponding id in Linux [1]. But QEMU seems to have
stored some data for this id via vhost_svq_add() [2]. Linux sets the value
of vq->packed.desc_state[id].data in its version of virtqueue_add_packed() [3].

>> This error is not thrown always, but when it is thrown, the id
>> varies. This is invariably followed by a soft lockup:
>> [...]
>> [  284.662292] Call Trace:
>> [  284.662292]  <IRQ>
>> [  284.662292]  ? watchdog_timer_fn+0x1e6/0x270
>> [  284.662292]  ? __pfx_watchdog_timer_fn+0x10/0x10
>> [  284.662292]  ? __hrtimer_run_queues+0x10f/0x2b0
>> [  284.662292]  ? hrtimer_interrupt+0xf8/0x230
>> [  284.662292]  ? __sysvec_apic_timer_interrupt+0x4d/0x140
>> [  284.662292]  ? sysvec_apic_timer_interrupt+0x39/0x90
>> [  284.662292]  ? asm_sysvec_apic_timer_interrupt+0x1a/0x20
>> [  284.662292]  ? virtqueue_enable_cb_delayed+0x115/0x150
>> [  284.662292]  start_xmit+0x2a6/0x4f0 [virtio_net]
>> [  284.662292]  ? netif_skb_features+0x98/0x300
>> [  284.662292]  dev_hard_start_xmit+0x61/0x1d0
>> [  284.662292]  sch_direct_xmit+0xa4/0x390
>> [  284.662292]  __dev_queue_xmit+0x84f/0xdc0
>> [  284.662292]  ? nf_hook_slow+0x42/0xf0
>> [  284.662292]  ip_finish_output2+0x2b8/0x580
>> [  284.662292]  igmp_ifc_timer_expire+0x1d5/0x430
>> [  284.662292]  ? __pfx_igmp_ifc_timer_expire+0x10/0x10
>> [  284.662292]  call_timer_fn+0x21/0x130
>> [  284.662292]  ? __pfx_igmp_ifc_timer_expire+0x10/0x10
>> [  284.662292]  __run_timers+0x21f/0x2b0
>> [  284.662292]  run_timer_softirq+0x1d/0x40
>> [  284.662292]  __do_softirq+0xc9/0x2c8
>> [  284.662292]  __irq_exit_rcu+0xa6/0xc0
>> [  284.662292]  sysvec_apic_timer_interrupt+0x72/0x90
>> [  284.662292]  </IRQ>
>> [  284.662292]  <TASK>
>> [  284.662292]  asm_sysvec_apic_timer_interrupt+0x1a/0x20
>> [  284.662292] RIP: 0010:pv_native_safe_halt+0xf/0x20
>> [  284.662292] Code: 22 d7 c3 cc cc cc cc 0f 1f 40 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa eb 07 0f 00 2d 53 75 3f 00 fb f4 <c3> cc c0
>> [  284.662292] RSP: 0018:ffffb8f0000b3ed8 EFLAGS: 00000212
>> [  284.662292] RAX: 0000000000000001 RBX: 0000000000000001 RCX: 0000000000000000
>> [  284.662292] RDX: 4000000000000000 RSI: 0000000000000083 RDI: 00000000000289ec
>> [  284.662292] RBP: ffff96f200810000 R08: 0000000000000000 R09: 0000000000000001
>> [  284.662292] R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000000
>> [  284.662292] R13: 0000000000000000 R14: ffff96f200810000 R15: 0000000000000000
>> [  284.662292]  default_idle+0x9/0x20
>> [  284.662292]  default_idle_call+0x2c/0xe0
>> [  284.662292]  do_idle+0x226/0x270
>> [  284.662292]  cpu_startup_entry+0x2a/0x30
>> [  284.662292]  start_secondary+0x11e/0x140
>> [  284.662292]  secondary_startup_64_no_verify+0x184/0x18b
>> [  284.662292]  </TASK>
>>
>> The soft lockup seems to happen in
>> drivers/net/virtio_net.c:start_xmit() [1].
>>
> 
> Maybe it gets stuck in the do {} while(...
> !virtqueue_enable_cb_delayed()) ? you can add a printk in
> virtqueue_enable_cb_delayed return and check if it matches with the
> speed you're sending or receiving ping. For example, if ping is each
> second, you should not see a lot of traces.
> 
> If this does not work I'd try never disabling notifications, both in
> the kernel and SVQ, and check if that works.

In order to disable notifications, will something have to be commented
out in the implementation?

>> [...]
>> QEMU command to boot L1:
>>
>> $ sudo ./qemu/build/qemu-system-x86_64 \
>> -enable-kvm \
>> -drive file=//home/valdaarhun/valdaarhun/qcow2_img/L1.qcow2,media=disk,if=virtio \
>> -net nic,model=virtio \
>> -net user,hostfwd=tcp::2222-:22 \
>> -device intel-iommu,snoop-control=on \
>> -device virtio-net-pci,netdev=net0,disable-legacy=on,disable-modern=off,iommu_platform=on,guest_uso4=off,guest_uso6=off,host_uso=off,guest_announce=off,mq=off,ctrl_vq=off,ctrl_rx=off,ctrl_vlan=off,ctrl_mac_addr=off,packed=on,event_idx=off,bus=pcie.0,addr=0x4 \
>> -netdev tap,id=net0,script=no,downscript=no,vhost=off \
>> -nographic \
>> -m 8G \
>> -smp 4 \
>> -M q35 \
>> -cpu host 2>&1 | tee vm.log
>>

I have added "-device virtio-net-pci,indirect_desc=off,queue_reset=off"
to the L0 QEMU command to boot L1.

Thanks,
Sahil

[1] https://github.com/torvalds/linux/blob/master/drivers/virtio/virtio_ring.c#L1762
[2] https://gitlab.com/qemu-project/qemu/-/blob/master/hw/virtio/vhost-shadow-virtqueue.c#L290
[3] https://github.com/torvalds/linux/blob/master/drivers/virtio/virtio_ring.c#L1564


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC v5 3/7] vhost: Forward descriptors to device via packed SVQ
  2025-03-28  7:51         ` Eugenio Perez Martin
@ 2025-04-14  9:37           ` Sahil Siddiq
  2025-04-14 15:07             ` Eugenio Perez Martin
  0 siblings, 1 reply; 44+ messages in thread
From: Sahil Siddiq @ 2025-04-14  9:37 UTC (permalink / raw)
  To: Eugenio Perez Martin; +Cc: sgarzare, mst, qemu-devel, sahilcdq

Hi,

On 3/28/25 1:21 PM, Eugenio Perez Martin wrote:
> On Thu, Mar 27, 2025 at 7:42 PM Sahil Siddiq <icegambit91@gmail.com> wrote:
>> On 3/26/25 1:33 PM, Eugenio Perez Martin wrote:
>>> On Mon, Mar 24, 2025 at 3:14 PM Sahil Siddiq <icegambit91@gmail.com> wrote:
>>>> On 3/24/25 7:29 PM, Sahil Siddiq wrote:
>>>>> Implement the insertion of available buffers in the descriptor area of
>>>>> packed shadow virtqueues. It takes into account descriptor chains, but
>>>>> does not consider indirect descriptors.
>>>>>
>>>>> Enable the packed SVQ to forward the descriptors to the device.
>>>>>
>>>>> Signed-off-by: Sahil Siddiq <sahilcdq@proton.me>
>>>>> ---
>>>>> [...]
>>>>> +
>>>>> +/**
>>>>> + * Write descriptors to SVQ packed vring
>>>>> + *
>>>>> + * @svq: The shadow virtqueue
>>>>> + * @out_sg: The iovec to the guest
>>>>> + * @out_num: Outgoing iovec length
>>>>> + * @in_sg: The iovec from the guest
>>>>> + * @in_num: Incoming iovec length
>>>>> + * @sgs: Cache for hwaddr
>>>>> + * @head: Saves current free_head
>>>>> + */
>>>>> +static void vhost_svq_add_packed(VhostShadowVirtqueue *svq,
>>>>> +                                 const struct iovec *out_sg, size_t out_num,
>>>>> +                                 const struct iovec *in_sg, size_t in_num,
>>>>> +                                 hwaddr *sgs, unsigned *head)
>>>>> +{
>>>>> +    uint16_t id, curr, i, head_flags = 0, head_idx;
>>>>> +    size_t num = out_num + in_num;
>>>>> +    unsigned n;
>>>>> +
>>>>> +    struct vring_packed_desc *descs = svq->vring_packed.vring.desc;
>>>>> +
>>>>> +    head_idx = svq->vring_packed.next_avail_idx;
>>>>
>>>> Since "svq->vring_packed.next_avail_idx" is part of QEMU internals and not
>>>> stored in guest memory, no endianness conversion is required here, right?
>>>>
>>>
>>> Right!
>>
>> Understood.
>>
>>>>> +    i = head_idx;
>>>>> +    id = svq->free_head;
>>>>> +    curr = id;
>>>>> +    *head = id;
>>>>
>>>> Should head be the buffer id or the idx of the descriptor ring where the
>>>> first descriptor of a descriptor chain is inserted?
>>>>
>>>
>>> The buffer id of the *last* descriptor of a chain. See "2.8.6 Next
>>> Flag: Descriptor Chaining" at [1].
>>
>> Ah, yes. The second half of my question in incorrect.
>>
>> The tail descriptor of the chain includes the buffer id. In this implementation
>> we place the same tail buffer id in other locations of the descriptor ring since
>> they will be ignored anyway [1].
>>
>> The explanation below frames my query better.
>>
>>>>> +    /* Write descriptors to SVQ packed vring */
>>>>> +    for (n = 0; n < num; n++) {
>>>>> +        uint16_t flags = cpu_to_le16(svq->vring_packed.avail_used_flags |
>>>>> +                                     (n < out_num ? 0 : VRING_DESC_F_WRITE) |
>>>>> +                                     (n + 1 == num ? 0 : VRING_DESC_F_NEXT));
>>>>> +        if (i == head_idx) {
>>>>> +            head_flags = flags;
>>>>> +        } else {
>>>>> +            descs[i].flags = flags;
>>>>> +        }
>>>>> +
>>>>> +        descs[i].addr = cpu_to_le64(sgs[n]);
>>>>> +        descs[i].id = id;
>>>>> +        if (n < out_num) {
>>>>> +            descs[i].len = cpu_to_le32(out_sg[n].iov_len);
>>>>> +        } else {
>>>>> +            descs[i].len = cpu_to_le32(in_sg[n - out_num].iov_len);
>>>>> +        }
>>>>> +
>>>>> +        curr = cpu_to_le16(svq->desc_next[curr]);
>>>>> +
>>>>> +        if (++i >= svq->vring_packed.vring.num) {
>>>>> +            i = 0;
>>>>> +            svq->vring_packed.avail_used_flags ^=
>>>>> +                1 << VRING_PACKED_DESC_F_AVAIL |
>>>>> +                1 << VRING_PACKED_DESC_F_USED;
>>>>> +        }
>>>>> +    }
>>>>>
>>>>> +    if (i <= head_idx) {
>>>>> +        svq->vring_packed.avail_wrap_counter ^= 1;
>>>>> +    }
>>>>> +
>>>>> +    svq->vring_packed.next_avail_idx = i;
>>>>> +    svq->shadow_avail_idx = i;
>>>>> +    svq->free_head = curr;
>>>>> +
>>>>> +    /*
>>>>> +     * A driver MUST NOT make the first descriptor in the list
>>>>> +     * available before all subsequent descriptors comprising
>>>>> +     * the list are made available.
>>>>> +     */
>>>>> +    smp_wmb();
>>>>> +    svq->vring_packed.vring.desc[head_idx].flags = head_flags;
>>>>>     }
>>>>>
>>>>> [...]
>>>>> @@ -258,13 +356,22 @@ int vhost_svq_add(VhostShadowVirtqueue *svq, const struct iovec *out_sg,
>>>>>             return -EINVAL;
>>>>>         }
>>>>>
>>>>> -    vhost_svq_add_split(svq, out_sg, out_num, in_sg,
>>>>> -                        in_num, sgs, &qemu_head);
>>>>> +    if (svq->is_packed) {
>>>>> +        vhost_svq_add_packed(svq, out_sg, out_num, in_sg,
>>>>> +                             in_num, sgs, &qemu_head);
>>>>> +    } else {
>>>>> +        vhost_svq_add_split(svq, out_sg, out_num, in_sg,
>>>>> +                            in_num, sgs, &qemu_head);
>>>>> +    }
>>>>>
>>>>>         svq->num_free -= ndescs;
>>>>>         svq->desc_state[qemu_head].elem = elem;
>>>>>         svq->desc_state[qemu_head].ndescs = ndescs;
>>>>
>>>> *head in vhost_svq_add_packed() is stored in "qemu_head" here.
>>>>
>>>
>>> Sorry I don't get this, can you expand?
>>
>> Sure. In vhost_svq_add(), after the descriptors have been added
>> (either using vhost_svq_add_split or vhost_svq_add_packed),
>> VirtQueueElement elem and ndescs are both saved in the
>> svq->desc_state array. "elem" and "ndescs" are later used when
>> the guest consumes used descriptors from the device in
>> vhost_svq_get_buf_(split|packed).
>>
>> For split vqs, the index of svq->desc where elem and ndescs are
>> saved matches the index of the descriptor ring where the head of
>> the descriptor ring is placed.
>>
>> In vhost_svq_add_split:
>>
>> *head = svq->free_head;
>> [...]
>> avail_idx = svq->shadow_avail_idx & (svq->vring.num - 1);
>> avail->ring[avail_idx] = cpu_to_le16(*head);
>>
>> "qemu_head" in vhost_svq_add gets its value from "*head" in
>> vhost_svq_add_split:
>>
>> svq->desc_state[qemu_head].elem = elem;
>> svq->desc_state[qemu_head].ndescs = ndescs;
>>
>> For packed vq, something similar has to be done. My approach was
>> to have the index of svq->desc_state match the buffer id in the
>> tail of the descriptor ring.
>>
>> The entire chain is written to the descriptor ring in the loop
>> in vhost_svq_add_packed. I am not sure if the index of
>> svq->desc_state should be the buffer id or if it should be a
>> descriptor index ("head_idx" or the index corresponding to the
>> tail of the chain).
>>
> 
> I think both approaches should be valid. My advice is to follow
> Linux's code and let it be the tail descriptor id. This descriptor id
> is pushed and popped from vq->free_head in a stack style.
> 
> In addition to that, Linux also sets the same id to all the chain
> elements. I think this is useful when dealing with bad devices. In
> particular, 

Understood. So far, I have implemented this so it matches the
implementation in Linux.

> QEMU's packed vq implementation looked at the first
> desciptor's id, which is an incorrect behavior.

Are you referring to:

1. svq->desc_state[qemu_head].elem = elem (in vhost_svq_add()), and
2. *head = id (in vhost_svq_add_packed())

According to the virtio spec, the buffer id must be saved in the last
index of the list in the descriptor region [1]. QEMU and Linux [2][3]
both use the value of vq->free_head (instead of the id that precedes
curr [4]) to save in the descriptor region and to use in svq->desc_state.

Thanks,
Sahil

[1] https://docs.oasis-open.org/virtio/virtio/v1.3/csd01/virtio-v1.3-csd01.html#x1-780006
[2] https://github.com/torvalds/linux/blob/master/drivers/virtio/virtio_ring.c#L1507
[3] https://github.com/torvalds/linux/blob/master/drivers/virtio/virtio_ring.c#L1563
[4] https://github.com/torvalds/linux/blob/master/drivers/virtio/virtio_ring.c#L1560


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC v5 3/7] vhost: Forward descriptors to device via packed SVQ
  2025-04-14  9:37           ` Sahil Siddiq
@ 2025-04-14 15:07             ` Eugenio Perez Martin
  2025-04-15 19:10               ` Sahil Siddiq
  0 siblings, 1 reply; 44+ messages in thread
From: Eugenio Perez Martin @ 2025-04-14 15:07 UTC (permalink / raw)
  To: Sahil Siddiq; +Cc: sgarzare, mst, qemu-devel, sahilcdq

On Mon, Apr 14, 2025 at 11:38 AM Sahil Siddiq <icegambit91@gmail.com> wrote:
>
> Hi,
>
> On 3/28/25 1:21 PM, Eugenio Perez Martin wrote:
> > On Thu, Mar 27, 2025 at 7:42 PM Sahil Siddiq <icegambit91@gmail.com> wrote:
> >> On 3/26/25 1:33 PM, Eugenio Perez Martin wrote:
> >>> On Mon, Mar 24, 2025 at 3:14 PM Sahil Siddiq <icegambit91@gmail.com> wrote:
> >>>> On 3/24/25 7:29 PM, Sahil Siddiq wrote:
> >>>>> Implement the insertion of available buffers in the descriptor area of
> >>>>> packed shadow virtqueues. It takes into account descriptor chains, but
> >>>>> does not consider indirect descriptors.
> >>>>>
> >>>>> Enable the packed SVQ to forward the descriptors to the device.
> >>>>>
> >>>>> Signed-off-by: Sahil Siddiq <sahilcdq@proton.me>
> >>>>> ---
> >>>>> [...]
> >>>>> +
> >>>>> +/**
> >>>>> + * Write descriptors to SVQ packed vring
> >>>>> + *
> >>>>> + * @svq: The shadow virtqueue
> >>>>> + * @out_sg: The iovec to the guest
> >>>>> + * @out_num: Outgoing iovec length
> >>>>> + * @in_sg: The iovec from the guest
> >>>>> + * @in_num: Incoming iovec length
> >>>>> + * @sgs: Cache for hwaddr
> >>>>> + * @head: Saves current free_head
> >>>>> + */
> >>>>> +static void vhost_svq_add_packed(VhostShadowVirtqueue *svq,
> >>>>> +                                 const struct iovec *out_sg, size_t out_num,
> >>>>> +                                 const struct iovec *in_sg, size_t in_num,
> >>>>> +                                 hwaddr *sgs, unsigned *head)
> >>>>> +{
> >>>>> +    uint16_t id, curr, i, head_flags = 0, head_idx;
> >>>>> +    size_t num = out_num + in_num;
> >>>>> +    unsigned n;
> >>>>> +
> >>>>> +    struct vring_packed_desc *descs = svq->vring_packed.vring.desc;
> >>>>> +
> >>>>> +    head_idx = svq->vring_packed.next_avail_idx;
> >>>>
> >>>> Since "svq->vring_packed.next_avail_idx" is part of QEMU internals and not
> >>>> stored in guest memory, no endianness conversion is required here, right?
> >>>>
> >>>
> >>> Right!
> >>
> >> Understood.
> >>
> >>>>> +    i = head_idx;
> >>>>> +    id = svq->free_head;
> >>>>> +    curr = id;
> >>>>> +    *head = id;
> >>>>
> >>>> Should head be the buffer id or the idx of the descriptor ring where the
> >>>> first descriptor of a descriptor chain is inserted?
> >>>>
> >>>
> >>> The buffer id of the *last* descriptor of a chain. See "2.8.6 Next
> >>> Flag: Descriptor Chaining" at [1].
> >>
> >> Ah, yes. The second half of my question in incorrect.
> >>
> >> The tail descriptor of the chain includes the buffer id. In this implementation
> >> we place the same tail buffer id in other locations of the descriptor ring since
> >> they will be ignored anyway [1].
> >>
> >> The explanation below frames my query better.
> >>
> >>>>> +    /* Write descriptors to SVQ packed vring */
> >>>>> +    for (n = 0; n < num; n++) {
> >>>>> +        uint16_t flags = cpu_to_le16(svq->vring_packed.avail_used_flags |
> >>>>> +                                     (n < out_num ? 0 : VRING_DESC_F_WRITE) |
> >>>>> +                                     (n + 1 == num ? 0 : VRING_DESC_F_NEXT));
> >>>>> +        if (i == head_idx) {
> >>>>> +            head_flags = flags;
> >>>>> +        } else {
> >>>>> +            descs[i].flags = flags;
> >>>>> +        }
> >>>>> +
> >>>>> +        descs[i].addr = cpu_to_le64(sgs[n]);
> >>>>> +        descs[i].id = id;
> >>>>> +        if (n < out_num) {
> >>>>> +            descs[i].len = cpu_to_le32(out_sg[n].iov_len);
> >>>>> +        } else {
> >>>>> +            descs[i].len = cpu_to_le32(in_sg[n - out_num].iov_len);
> >>>>> +        }
> >>>>> +
> >>>>> +        curr = cpu_to_le16(svq->desc_next[curr]);
> >>>>> +
> >>>>> +        if (++i >= svq->vring_packed.vring.num) {
> >>>>> +            i = 0;
> >>>>> +            svq->vring_packed.avail_used_flags ^=
> >>>>> +                1 << VRING_PACKED_DESC_F_AVAIL |
> >>>>> +                1 << VRING_PACKED_DESC_F_USED;
> >>>>> +        }
> >>>>> +    }
> >>>>>
> >>>>> +    if (i <= head_idx) {
> >>>>> +        svq->vring_packed.avail_wrap_counter ^= 1;
> >>>>> +    }
> >>>>> +
> >>>>> +    svq->vring_packed.next_avail_idx = i;
> >>>>> +    svq->shadow_avail_idx = i;
> >>>>> +    svq->free_head = curr;
> >>>>> +
> >>>>> +    /*
> >>>>> +     * A driver MUST NOT make the first descriptor in the list
> >>>>> +     * available before all subsequent descriptors comprising
> >>>>> +     * the list are made available.
> >>>>> +     */
> >>>>> +    smp_wmb();
> >>>>> +    svq->vring_packed.vring.desc[head_idx].flags = head_flags;
> >>>>>     }
> >>>>>
> >>>>> [...]
> >>>>> @@ -258,13 +356,22 @@ int vhost_svq_add(VhostShadowVirtqueue *svq, const struct iovec *out_sg,
> >>>>>             return -EINVAL;
> >>>>>         }
> >>>>>
> >>>>> -    vhost_svq_add_split(svq, out_sg, out_num, in_sg,
> >>>>> -                        in_num, sgs, &qemu_head);
> >>>>> +    if (svq->is_packed) {
> >>>>> +        vhost_svq_add_packed(svq, out_sg, out_num, in_sg,
> >>>>> +                             in_num, sgs, &qemu_head);
> >>>>> +    } else {
> >>>>> +        vhost_svq_add_split(svq, out_sg, out_num, in_sg,
> >>>>> +                            in_num, sgs, &qemu_head);
> >>>>> +    }
> >>>>>
> >>>>>         svq->num_free -= ndescs;
> >>>>>         svq->desc_state[qemu_head].elem = elem;
> >>>>>         svq->desc_state[qemu_head].ndescs = ndescs;
> >>>>
> >>>> *head in vhost_svq_add_packed() is stored in "qemu_head" here.
> >>>>
> >>>
> >>> Sorry I don't get this, can you expand?
> >>
> >> Sure. In vhost_svq_add(), after the descriptors have been added
> >> (either using vhost_svq_add_split or vhost_svq_add_packed),
> >> VirtQueueElement elem and ndescs are both saved in the
> >> svq->desc_state array. "elem" and "ndescs" are later used when
> >> the guest consumes used descriptors from the device in
> >> vhost_svq_get_buf_(split|packed).
> >>
> >> For split vqs, the index of svq->desc where elem and ndescs are
> >> saved matches the index of the descriptor ring where the head of
> >> the descriptor ring is placed.
> >>
> >> In vhost_svq_add_split:
> >>
> >> *head = svq->free_head;
> >> [...]
> >> avail_idx = svq->shadow_avail_idx & (svq->vring.num - 1);
> >> avail->ring[avail_idx] = cpu_to_le16(*head);
> >>
> >> "qemu_head" in vhost_svq_add gets its value from "*head" in
> >> vhost_svq_add_split:
> >>
> >> svq->desc_state[qemu_head].elem = elem;
> >> svq->desc_state[qemu_head].ndescs = ndescs;
> >>
> >> For packed vq, something similar has to be done. My approach was
> >> to have the index of svq->desc_state match the buffer id in the
> >> tail of the descriptor ring.
> >>
> >> The entire chain is written to the descriptor ring in the loop
> >> in vhost_svq_add_packed. I am not sure if the index of
> >> svq->desc_state should be the buffer id or if it should be a
> >> descriptor index ("head_idx" or the index corresponding to the
> >> tail of the chain).
> >>
> >
> > I think both approaches should be valid. My advice is to follow
> > Linux's code and let it be the tail descriptor id. This descriptor id
> > is pushed and popped from vq->free_head in a stack style.
> >
> > In addition to that, Linux also sets the same id to all the chain
> > elements. I think this is useful when dealing with bad devices. In
> > particular,
>
> Understood. So far, I have implemented this so it matches the
> implementation in Linux.
>
> > QEMU's packed vq implementation looked at the first
> > desciptor's id, which is an incorrect behavior.
>
> Are you referring to:
>
> 1. svq->desc_state[qemu_head].elem = elem (in vhost_svq_add()), and
> 2. *head = id (in vhost_svq_add_packed())
>

I meant "it used to use the first descriptor id by mistake". It was
fixed in commit 33abfea23959 ("hw/virtio: Fix obtain the buffer id
from the last descriptor"). It is better to set the descriptor id in
all the descriptors of the chain, so if QEMU does not contain this
patch in the nested VM case it can still work with this version.



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC v5 3/7] vhost: Forward descriptors to device via packed SVQ
  2025-04-14 15:07             ` Eugenio Perez Martin
@ 2025-04-15 19:10               ` Sahil Siddiq
  0 siblings, 0 replies; 44+ messages in thread
From: Sahil Siddiq @ 2025-04-15 19:10 UTC (permalink / raw)
  To: Eugenio Perez Martin; +Cc: sgarzare, mst, qemu-devel, sahilcdq

Hi,

On 4/14/25 8:37 PM, Eugenio Perez Martin wrote:
> On Mon, Apr 14, 2025 at 11:38 AM Sahil Siddiq <icegambit91@gmail.com> wrote:
>> On 3/28/25 1:21 PM, Eugenio Perez Martin wrote:
>>> On Thu, Mar 27, 2025 at 7:42 PM Sahil Siddiq <icegambit91@gmail.com> wrote:
>>>> On 3/26/25 1:33 PM, Eugenio Perez Martin wrote:
>>>> [...]
>>> I think both approaches should be valid. My advice is to follow
>>> Linux's code and let it be the tail descriptor id. This descriptor id
>>> is pushed and popped from vq->free_head in a stack style.
>>>
>>> In addition to that, Linux also sets the same id to all the chain
>>> elements. I think this is useful when dealing with bad devices. In
>>> particular,
>>
>> Understood. So far, I have implemented this so it matches the
>> implementation in Linux.
>>
>>> QEMU's packed vq implementation looked at the first
>>> desciptor's id, which is an incorrect behavior.
>>
>> Are you referring to:
>>
>> 1. svq->desc_state[qemu_head].elem = elem (in vhost_svq_add()), and
>> 2. *head = id (in vhost_svq_add_packed())
>>
> 
> I meant "it used to use the first descriptor id by mistake". It was
> fixed in commit 33abfea23959 ("hw/virtio: Fix obtain the buffer id
> from the last descriptor"). It is better to set the descriptor id in
> all the descriptors of the chain, so if QEMU does not contain this
> patch in the nested VM case it can still work with this version.
> 

Oh, ok. I have understood this now.

Thanks,
Sahil


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC v5 0/7] Add packed format to shadow virtqueue
  2025-04-14  9:20   ` Sahil Siddiq
@ 2025-04-15 19:20     ` Sahil Siddiq
  2025-04-16  7:20     ` Eugenio Perez Martin
  1 sibling, 0 replies; 44+ messages in thread
From: Sahil Siddiq @ 2025-04-15 19:20 UTC (permalink / raw)
  To: Eugenio Perez Martin; +Cc: sgarzare, mst, qemu-devel, sahilcdq

Hi,

On 4/14/25 2:50 PM, Sahil Siddiq wrote:
> On 3/26/25 1:05 PM, Eugenio Perez Martin wrote:
>> On Mon, Mar 24, 2025 at 2:59 PM Sahil Siddiq <icegambit91@gmail.com> wrote:
>>> I managed to fix a few issues while testing this patch series.
>>> There is still one issue that I am unable to resolve. I thought
>>> I would send this patch series for review in case I have missed
>>> something.
>>>
>>> The issue is that this patch series does not work every time. I
>>> am able to ping L0 from L2 and vice versa via packed SVQ when it
>>> works.
>>
>> So we're on a very good track then!
>>
>>> When this doesn't work, both VMs throw a "Destination Host
>>> Unreachable" error. This is sometimes (not always) accompanied
>>> by the following kernel error (thrown by L2-kernel):
>>>
>>> virtio_net virtio1: output.0:id 1 is not a head!
>>>
>>
>> How many packets have been sent or received before hitting this? If
>> the answer to that is "the vq size", maybe there is a bug in the code
>> that handles the wraparound of the packed vq, as the used and avail
>> flags need to be twisted. You can count them in the SVQ code.
> 
> I did a lot more testing. This issue is quite unpredictable in terms
> of the time at which it appears after booting L2. So far, it almost
> always appears after booting L2. Even when pinging works, this issue
> appears after several seconds of pinging.
> 
> The total number of svq descriptors varied in every test run. But in
> every case, all 256 indices were filled in the descriptor region for
> vq with vq_idx = 0. This is the RX vq, right? This was filled while L2
> was booting. In the case when the ctrl vq is disabled, I am not sure
> what is responsible for filling the vqs in the data plane during
> booting.
> 
> =====
> The issue is hit most frequently when the following command is run
> in L0:
> $ ip addr add 111.1.1.1/24 dev tap0
> $ ip link set tap0 up
> 
> or, running the following in L2:
> # ip addr add 111.1.1.2/24 dev eth0
> 
> The other vq (vq_idx=1) is not filled completely before the issue is
> hit. I have been noting down the numbers and here is an example:
> 
> 295 descriptors were added individually to the queues i.e., there were no chains (vhost_svq_add_packed)
> |_ 256 additions in vq_idx = 0, all with unique ids
>      |---- 27 descriptors (ids 0 through 26) were received later from the device (vhost_svq_get_buf_packed)
> |_ 39 additions in vq_idx = 1
>      |_ 13 descriptors had id = 0
>      |_ 26 descriptors had id = 1
>      |---- All descriptors were received at some point from the device (vhost_svq_get_buf_packed)
> 
> There was one case in which vq_idx=0 had wrapped around. I verified
> that flags were set appropriately during the wrap (avail and used flags
> were flipped as expected).
> 
> =====
> The next common situation where this issue is hit is during startup.
> Before L2 can finish booting successfully, this error is thrown:
> 
> virtio_net virtio1: output.0:id 0 is not a head!
> 
> 258 descriptors were added individually to the queues during startup (there were no chains) (vhost_svq_add_packed)
> |_ 256 additions in vq_idx = 0, all with unique ids
>     |---- None of them were received by the device (vhost_svq_get_buf_packed)
> |_ 2 additions in vq_idx = 1
>     |_ id = 0 in index 0
>     |_ id = 1 in index 1
>     |---- Both descriptors were received at some point during startup from the device (vhost_svq_get_buf_packed)
> 
> =====
> Another case is after several seconds of pinging L0 from L2.
> 
> [   99.034114] virtio_net virtio1: output.0:id 0 is not a head!
> 
> 366 descriptors were added individually to the queues i.e., there were no chains (vhost_svq_add_packed)
> |_ 289 additions in vq_idx = 0, wrap-around was observed with avail and used flags inverted for 33 descriptors
> |   |---- 40 descriptors (ids 0 through 39) were received from the device (vhost_svq_get_buf_packed)
> |_ 77 additions in vq_idx = 1
>      |_ 76 descriptors had id = 0
>      |_ 1 descriptor had id = 1
>      |---- all 77 descriptors were received at some point from the device (vhost_svq_get_buf_packed)
> 
> I am not entirely sure now if there's an issue in the packed vq
> implementation in QEMU or if this is being caused due to some sort
> of race condition in linux.

After some more testing, I think the issue is indeed in the current
implementation of packed vq in QEMU. The kernel does not crash when
using packed vqs with x-svq=false. I have an idea that might help
find the issue. It involves debugging the linux kernel. I'll try this
out and will let you know how it goes.

> "id is not a head" is being thrown because vq->packed.desc_state[id].data
> doesn't exist for the corresponding id in Linux [1]. But QEMU seems to have
> stored some data for this id via vhost_svq_add() [2]. Linux sets the value
> of vq->packed.desc_state[id].data in its version of virtqueue_add_packed() [3].
> [...]

Thanks,
Sahil


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC v5 0/7] Add packed format to shadow virtqueue
  2025-04-14  9:20   ` Sahil Siddiq
  2025-04-15 19:20     ` Sahil Siddiq
@ 2025-04-16  7:20     ` Eugenio Perez Martin
  2025-05-14  6:21       ` Sahil Siddiq
  1 sibling, 1 reply; 44+ messages in thread
From: Eugenio Perez Martin @ 2025-04-16  7:20 UTC (permalink / raw)
  To: Sahil Siddiq; +Cc: sgarzare, mst, qemu-devel, sahilcdq, Jason Wang

On Mon, Apr 14, 2025 at 11:20 AM Sahil Siddiq <icegambit91@gmail.com> wrote:
>
> Hi,
>
> On 3/26/25 1:05 PM, Eugenio Perez Martin wrote:
> > On Mon, Mar 24, 2025 at 2:59 PM Sahil Siddiq <icegambit91@gmail.com> wrote:
> >> I managed to fix a few issues while testing this patch series.
> >> There is still one issue that I am unable to resolve. I thought
> >> I would send this patch series for review in case I have missed
> >> something.
> >>
> >> The issue is that this patch series does not work every time. I
> >> am able to ping L0 from L2 and vice versa via packed SVQ when it
> >> works.
> >
> > So we're on a very good track then!
> >
> >> When this doesn't work, both VMs throw a "Destination Host
> >> Unreachable" error. This is sometimes (not always) accompanied
> >> by the following kernel error (thrown by L2-kernel):
> >>
> >> virtio_net virtio1: output.0:id 1 is not a head!
> >>
> >
> > How many packets have been sent or received before hitting this? If
> > the answer to that is "the vq size", maybe there is a bug in the code
> > that handles the wraparound of the packed vq, as the used and avail
> > flags need to be twisted. You can count them in the SVQ code.
>
> I did a lot more testing. This issue is quite unpredictable in terms
> of the time at which it appears after booting L2. So far, it almost
> always appears after booting L2. Even when pinging works, this issue
> appears after several seconds of pinging.
>

Maybe you can speed it up with ping -f?

> The total number of svq descriptors varied in every test run. But in
> every case, all 256 indices were filled in the descriptor region for
> vq with vq_idx = 0. This is the RX vq, right?

Right!

> This was filled while L2
> was booting. In the case when the ctrl vq is disabled, I am not sure
> what is responsible for filling the vqs in the data plane during
> booting.
>

The nested guest's driver fills the rx queue at startup. After that,
that nested guest kicks and SVQ receives the descriptors. It copies
the descriptors to the shadow virtqueue and then kicks L0 QEMU.

> =====
> The issue is hit most frequently when the following command is run
> in L0:
> $ ip addr add 111.1.1.1/24 dev tap0
> $ ip link set tap0 up
>
> or, running the following in L2:
> # ip addr add 111.1.1.2/24 dev eth0
>

I guess those are able to start the network, aren't they?

> The other vq (vq_idx=1) is not filled completely before the issue is
> hit.
> I have been noting down the numbers and here is an example:
>
> 295 descriptors were added individually to the queues i.e., there were no chains (vhost_svq_add_packed)
> |_ 256 additions in vq_idx = 0, all with unique ids
>      |---- 27 descriptors (ids 0 through 26) were received later from the device (vhost_svq_get_buf_packed)
> |_ 39 additions in vq_idx = 1
>      |_ 13 descriptors had id = 0
>      |_ 26 descriptors had id = 1
>      |---- All descriptors were received at some point from the device (vhost_svq_get_buf_packed)
>
> There was one case in which vq_idx=0 had wrapped around. I verified
> that flags were set appropriately during the wrap (avail and used flags
> were flipped as expected).
>

Ok sounds like you're able to reach it before filling the queue. I'd
go for debugging notifications for this one then. More on this below.

> =====
> The next common situation where this issue is hit is during startup.
> Before L2 can finish booting successfully, this error is thrown:
>
> virtio_net virtio1: output.0:id 0 is not a head!
>
> 258 descriptors were added individually to the queues during startup (there were no chains) (vhost_svq_add_packed)
> |_ 256 additions in vq_idx = 0, all with unique ids
>     |---- None of them were received by the device (vhost_svq_get_buf_packed)
> |_ 2 additions in vq_idx = 1
>     |_ id = 0 in index 0
>     |_ id = 1 in index 1
>     |---- Both descriptors were received at some point during startup from the device (vhost_svq_get_buf_packed)
>
> =====
> Another case is after several seconds of pinging L0 from L2.
>
> [   99.034114] virtio_net virtio1: output.0:id 0 is not a head!
>

So the L2 guest sees a descriptor it has not made available
previously. This can be caused because SVQ returns the same descriptor
twice, or it doesn't fill the id or flags properly. It can also be
caused because we're not protecting the write ordering in the ring,
but I don't see anything obviously wrong by looking at the code.

> 366 descriptors were added individually to the queues i.e., there were no chains (vhost_svq_add_packed)
> |_ 289 additions in vq_idx = 0, wrap-around was observed with avail and used flags inverted for 33 descriptors
> |   |---- 40 descriptors (ids 0 through 39) were received from the device (vhost_svq_get_buf_packed)
> |_ 77 additions in vq_idx = 1
>      |_ 76 descriptors had id = 0
>      |_ 1 descriptor had id = 1
>      |---- all 77 descriptors were received at some point from the device (vhost_svq_get_buf_packed)
>
> I am not entirely sure now if there's an issue in the packed vq
> implementation in QEMU or if this is being caused due to some sort
> of race condition in linux.
>
> "id is not a head" is being thrown because vq->packed.desc_state[id].data
> doesn't exist for the corresponding id in Linux [1]. But QEMU seems to have
> stored some data for this id via vhost_svq_add() [2]. Linux sets the value
> of vq->packed.desc_state[id].data in its version of virtqueue_add_packed() [3].
>

Let's keep debugging further. Can you trace the ids that the L2 kernel
makes available, and then the ones that it uses? At the same time, can
you trace the ids that the svq sees in vhost_svq_get_buf and the ones
that flushes? This allows us to check the set of available descriptors
at any given time.

> >> This error is not thrown always, but when it is thrown, the id
> >> varies. This is invariably followed by a soft lockup:
> >> [...]
> >> [  284.662292] Call Trace:
> >> [  284.662292]  <IRQ>
> >> [  284.662292]  ? watchdog_timer_fn+0x1e6/0x270
> >> [  284.662292]  ? __pfx_watchdog_timer_fn+0x10/0x10
> >> [  284.662292]  ? __hrtimer_run_queues+0x10f/0x2b0
> >> [  284.662292]  ? hrtimer_interrupt+0xf8/0x230
> >> [  284.662292]  ? __sysvec_apic_timer_interrupt+0x4d/0x140
> >> [  284.662292]  ? sysvec_apic_timer_interrupt+0x39/0x90
> >> [  284.662292]  ? asm_sysvec_apic_timer_interrupt+0x1a/0x20
> >> [  284.662292]  ? virtqueue_enable_cb_delayed+0x115/0x150
> >> [  284.662292]  start_xmit+0x2a6/0x4f0 [virtio_net]
> >> [  284.662292]  ? netif_skb_features+0x98/0x300
> >> [  284.662292]  dev_hard_start_xmit+0x61/0x1d0
> >> [  284.662292]  sch_direct_xmit+0xa4/0x390
> >> [  284.662292]  __dev_queue_xmit+0x84f/0xdc0
> >> [  284.662292]  ? nf_hook_slow+0x42/0xf0
> >> [  284.662292]  ip_finish_output2+0x2b8/0x580
> >> [  284.662292]  igmp_ifc_timer_expire+0x1d5/0x430
> >> [  284.662292]  ? __pfx_igmp_ifc_timer_expire+0x10/0x10
> >> [  284.662292]  call_timer_fn+0x21/0x130
> >> [  284.662292]  ? __pfx_igmp_ifc_timer_expire+0x10/0x10
> >> [  284.662292]  __run_timers+0x21f/0x2b0
> >> [  284.662292]  run_timer_softirq+0x1d/0x40
> >> [  284.662292]  __do_softirq+0xc9/0x2c8
> >> [  284.662292]  __irq_exit_rcu+0xa6/0xc0
> >> [  284.662292]  sysvec_apic_timer_interrupt+0x72/0x90
> >> [  284.662292]  </IRQ>
> >> [  284.662292]  <TASK>
> >> [  284.662292]  asm_sysvec_apic_timer_interrupt+0x1a/0x20
> >> [  284.662292] RIP: 0010:pv_native_safe_halt+0xf/0x20
> >> [  284.662292] Code: 22 d7 c3 cc cc cc cc 0f 1f 40 00 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa eb 07 0f 00 2d 53 75 3f 00 fb f4 <c3> cc c0
> >> [  284.662292] RSP: 0018:ffffb8f0000b3ed8 EFLAGS: 00000212
> >> [  284.662292] RAX: 0000000000000001 RBX: 0000000000000001 RCX: 0000000000000000
> >> [  284.662292] RDX: 4000000000000000 RSI: 0000000000000083 RDI: 00000000000289ec
> >> [  284.662292] RBP: ffff96f200810000 R08: 0000000000000000 R09: 0000000000000001
> >> [  284.662292] R10: 0000000000000001 R11: 0000000000000000 R12: 0000000000000000
> >> [  284.662292] R13: 0000000000000000 R14: ffff96f200810000 R15: 0000000000000000
> >> [  284.662292]  default_idle+0x9/0x20
> >> [  284.662292]  default_idle_call+0x2c/0xe0
> >> [  284.662292]  do_idle+0x226/0x270
> >> [  284.662292]  cpu_startup_entry+0x2a/0x30
> >> [  284.662292]  start_secondary+0x11e/0x140
> >> [  284.662292]  secondary_startup_64_no_verify+0x184/0x18b
> >> [  284.662292]  </TASK>
> >>
> >> The soft lockup seems to happen in
> >> drivers/net/virtio_net.c:start_xmit() [1].
> >>
> >
> > Maybe it gets stuck in the do {} while(...
> > !virtqueue_enable_cb_delayed()) ? you can add a printk in
> > virtqueue_enable_cb_delayed return and check if it matches with the
> > speed you're sending or receiving ping. For example, if ping is each
> > second, you should not see a lot of traces.
> >
> > If this does not work I'd try never disabling notifications, both in
> > the kernel and SVQ, and check if that works.
>
> In order to disable notifications, will something have to be commented
> out in the implementation?
>

To *never* disable notifications you should comment out the SVQ calls
to virtio_queue_set_notification and vhost_svq_disable_notification.
This way the other side (L0 device in QEMU and the guest) are forced
to notify SVQ always, and we can check if that solves the first issue,

> >> [...]
> >> QEMU command to boot L1:
> >>
> >> $ sudo ./qemu/build/qemu-system-x86_64 \
> >> -enable-kvm \
> >> -drive file=//home/valdaarhun/valdaarhun/qcow2_img/L1.qcow2,media=disk,if=virtio \
> >> -net nic,model=virtio \
> >> -net user,hostfwd=tcp::2222-:22 \
> >> -device intel-iommu,snoop-control=on \
> >> -device virtio-net-pci,netdev=net0,disable-legacy=on,disable-modern=off,iommu_platform=on,guest_uso4=off,guest_uso6=off,host_uso=off,guest_announce=off,mq=off,ctrl_vq=off,ctrl_rx=off,ctrl_vlan=off,ctrl_mac_addr=off,packed=on,event_idx=off,bus=pcie.0,addr=0x4 \
> >> -netdev tap,id=net0,script=no,downscript=no,vhost=off \
> >> -nographic \
> >> -m 8G \
> >> -smp 4 \
> >> -M q35 \
> >> -cpu host 2>&1 | tee vm.log
> >>
>
> I have added "-device virtio-net-pci,indirect_desc=off,queue_reset=off"
> to the L0 QEMU command to boot L1.
>
> Thanks,
> Sahil
>
> [1] https://github.com/torvalds/linux/blob/master/drivers/virtio/virtio_ring.c#L1762
> [2] https://gitlab.com/qemu-project/qemu/-/blob/master/hw/virtio/vhost-shadow-virtqueue.c#L290
> [3] https://github.com/torvalds/linux/blob/master/drivers/virtio/virtio_ring.c#L1564
>



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC v5 0/7] Add packed format to shadow virtqueue
  2025-04-16  7:20     ` Eugenio Perez Martin
@ 2025-05-14  6:21       ` Sahil Siddiq
  2025-05-15  6:19         ` Eugenio Perez Martin
  0 siblings, 1 reply; 44+ messages in thread
From: Sahil Siddiq @ 2025-05-14  6:21 UTC (permalink / raw)
  To: Eugenio Perez Martin; +Cc: sgarzare, mst, qemu-devel, sahilcdq

Hi,

Apologies, I haven't been in touch for a while. I have an update that
I would like to give.

On 4/16/25 12:50 PM, Eugenio Perez Martin wrote:
> On Mon, Apr 14, 2025 at 11:20 AM Sahil Siddiq <icegambit91@gmail.com> wrote:
>>
>> Hi,
>>
>> On 3/26/25 1:05 PM, Eugenio Perez Martin wrote:
>>> On Mon, Mar 24, 2025 at 2:59 PM Sahil Siddiq <icegambit91@gmail.com> wrote:
>>>> I managed to fix a few issues while testing this patch series.
>>>> There is still one issue that I am unable to resolve. I thought
>>>> I would send this patch series for review in case I have missed
>>>> something.
>>>>
>>>> The issue is that this patch series does not work every time. I
>>>> am able to ping L0 from L2 and vice versa via packed SVQ when it
>>>> works.
>>>
>>> So we're on a very good track then!
>>>
>>>> When this doesn't work, both VMs throw a "Destination Host
>>>> Unreachable" error. This is sometimes (not always) accompanied
>>>> by the following kernel error (thrown by L2-kernel):
>>>>
>>>> virtio_net virtio1: output.0:id 1 is not a head!
>>>>
>>>
>>> How many packets have been sent or received before hitting this? If
>>> the answer to that is "the vq size", maybe there is a bug in the code
>>> that handles the wraparound of the packed vq, as the used and avail
>>> flags need to be twisted. You can count them in the SVQ code.
>>
>> I did a lot more testing. This issue is quite unpredictable in terms
>> of the time at which it appears after booting L2. So far, it almost
>> always appears after booting L2. Even when pinging works, this issue
>> appears after several seconds of pinging.
>>
> 
> Maybe you can speed it up with ping -f?

Thank you, I was able to run tests much faster with the -f option. So
far I have noticed that the RX queue does not give problems. When all
the descriptors are used it is able to wrap around without issues.

>> The total number of svq descriptors varied in every test run. But in
>> every case, all 256 indices were filled in the descriptor region for
>> vq with vq_idx = 0. This is the RX vq, right?
> 
> Right!

The TX queue seems to be problematic. More on this below.

>> This was filled while L2
>> was booting. In the case when the ctrl vq is disabled, I am not sure
>> what is responsible for filling the vqs in the data plane during
>> booting.
>>
> The nested guest's driver fills the rx queue at startup. After that,
> that nested guest kicks and SVQ receives the descriptors. It copies
> the descriptors to the shadow virtqueue and then kicks L0 QEMU.

Understood.

>> =====
>> The issue is hit most frequently when the following command is run
>> in L0:
>> $ ip addr add 111.1.1.1/24 dev tap0
>> $ ip link set tap0 up
>>
>> or, running the following in L2:
>> # ip addr add 111.1.1.2/24 dev eth0
>>
> 
> I guess those are able to start the network, aren't they?

Yes, that's correct.

>> The other vq (vq_idx=1) is not filled completely before the issue is
>> hit.
>> I have been noting down the numbers and here is an example:
>>
>> 295 descriptors were added individually to the queues i.e., there were no chains (vhost_svq_add_packed)
>> |_ 256 additions in vq_idx = 0, all with unique ids
>>       |---- 27 descriptors (ids 0 through 26) were received later from the device (vhost_svq_get_buf_packed)
>> |_ 39 additions in vq_idx = 1
>>       |_ 13 descriptors had id = 0
>>       |_ 26 descriptors had id = 1
>>       |---- All descriptors were received at some point from the device (vhost_svq_get_buf_packed)
>>
>> There was one case in which vq_idx=0 had wrapped around. I verified
>> that flags were set appropriately during the wrap (avail and used flags
>> were flipped as expected).
>>
> 
> Ok sounds like you're able to reach it before filling the queue. I'd
> go for debugging notifications for this one then. More on this below.
> 
>> =====
>> The next common situation where this issue is hit is during startup.
>> Before L2 can finish booting successfully, this error is thrown:
>>
>> virtio_net virtio1: output.0:id 0 is not a head!
>>
>> 258 descriptors were added individually to the queues during startup (there were no chains) (vhost_svq_add_packed)
>> |_ 256 additions in vq_idx = 0, all with unique ids
>>      |---- None of them were received by the device (vhost_svq_get_buf_packed)
>> |_ 2 additions in vq_idx = 1
>>      |_ id = 0 in index 0
>>      |_ id = 1 in index 1
>>      |---- Both descriptors were received at some point during startup from the device (vhost_svq_get_buf_packed)
>>
>> =====
>> Another case is after several seconds of pinging L0 from L2.
>>
>> [   99.034114] virtio_net virtio1: output.0:id 0 is not a head!
>>
> 
> So the L2 guest sees a descriptor it has not made available
> previously. This can be caused because SVQ returns the same descriptor
> twice, or it doesn't fill the id or flags properly. It can also be
> caused because we're not protecting the write ordering in the ring,
> but I don't see anything obviously wrong by looking at the code.
> 
>> 366 descriptors were added individually to the queues i.e., there were no chains (vhost_svq_add_packed)
>> |_ 289 additions in vq_idx = 0, wrap-around was observed with avail and used flags inverted for 33 descriptors
>> |   |---- 40 descriptors (ids 0 through 39) were received from the device (vhost_svq_get_buf_packed)
>> |_ 77 additions in vq_idx = 1
>>       |_ 76 descriptors had id = 0
>>       |_ 1 descriptor had id = 1
>>       |---- all 77 descriptors were received at some point from the device (vhost_svq_get_buf_packed)
>>
>> I am not entirely sure now if there's an issue in the packed vq
>> implementation in QEMU or if this is being caused due to some sort
>> of race condition in linux.
>>
>> "id is not a head" is being thrown because vq->packed.desc_state[id].data
>> doesn't exist for the corresponding id in Linux [1]. But QEMU seems to have
>> stored some data for this id via vhost_svq_add() [2]. Linux sets the value
>> of vq->packed.desc_state[id].data in its version of virtqueue_add_packed() [3].
>>
> 
> Let's keep debugging further. Can you trace the ids that the L2 kernel
> makes available, and then the ones that it uses? At the same time, can
> you trace the ids that the svq sees in vhost_svq_get_buf and the ones
> that flushes? This allows us to check the set of available descriptors
> at any given time.
> 
In the linux kernel, I am printing which descriptor is received in which
queue in drivers/virtio/virtio_ring.c:virtqueue_get_buf_ctx_packed() [1].
I see the following lines getting printed for the TX queue:

[  192.101591] output.0 -> id: 0
[  213.737417] output.0 -> id: 0
[  213.738714] output.0 -> id: 1
[  213.740093] output.0 -> id: 0
[  213.741521] virtio_net virtio1: output.0:id 0 is not a head!

In QEMU's hw/virtio/vhost-shadow-virtqueue.c:vhost_svq_add_packed(), I am
printing the head_idx, id, len, flags and vq_idx. Just before the crash,
the following lines are printed:

head_idx: 157, id: 0, len: 122, flags: 32768, vq idx: 1
head_idx: 158, id: 0, len: 122, flags: 32768, vq idx: 1
head_idx: 159, id: 0, len: 66, flags: 32768, vq idx: 1
head_idx: 160, id: 1, len: 102, flags: 32768, vq idx: 1

In QEMU's hw/virtio/vhost-shadow-virtqueue.c:vhost_svq_get_buf_packed(), I
am printing the id, last_used index, used wrap counter and vq_idx. These
are the lines just before the crash:

id: 0, last_used: 158, used_wrap_counter: 0, vq idx: 1
id: 0, last_used: 159, used_wrap_counter: 0, vq idx: 1
id: 0, last_used: 160, used_wrap_counter: 0, vq idx: 1
id: 1, last_used: 161, used_wrap_counter: 0, vq idx: 1

In QEMU's hw/virtio/vhost-shadow-virtqueue.c:vhost_svq_flush() [2], I am tracking
the values of i and vq_idx in the outer do..while() loop as well as in the inner
while(true) loop. The value of i is used as the "idx" in virtqueue_fill() [3] and
as "count" in virtqueue_flush() [4]. Lines printed in each iteration of the outer
do...while loop are enclosed between "===" lines. These are the lines just before
the crash:

===
in_loop: i: 0, vq idx: 1
in_loop: i: 1, vq idx: 1
out_loop: i: 1, vq idx: 1
===
in_loop: i: 0, vq idx: 1
in_loop: i: 1, vq idx: 1
out_loop: i: 1, vq idx: 1
===
in_loop: i: 0, vq idx: 1
in_loop: i: 1, vq idx: 1
in_loop: i: 2, vq idx: 1
out_loop: i: 2, vq idx: 1
===
in_loop: i: 0, vq idx: 1
out_loop: i: 0, vq idx: 1

I have only investigated which descriptors the kernel uses. I'll also check
which descriptors are made available by the kernel. I'll let you know what I
find.

Thanks,
Sahil

[1] https://github.com/torvalds/linux/blob/master/drivers/virtio/virtio_ring.c#L1727
[2] https://gitlab.com/qemu-project/qemu/-/blob/master/hw/virtio/vhost-shadow-virtqueue.c#L499
[3] https://gitlab.com/qemu-project/qemu/-/blob/master/hw/virtio/virtio.c#L1008
[4] https://gitlab.com/qemu-project/qemu/-/blob/master/hw/virtio/virtio.c#L1147



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC v5 0/7] Add packed format to shadow virtqueue
  2025-05-14  6:21       ` Sahil Siddiq
@ 2025-05-15  6:19         ` Eugenio Perez Martin
  2025-06-26  5:16           ` Sahil Siddiq
  0 siblings, 1 reply; 44+ messages in thread
From: Eugenio Perez Martin @ 2025-05-15  6:19 UTC (permalink / raw)
  To: Sahil Siddiq; +Cc: sgarzare, mst, qemu-devel, sahilcdq

On Wed, May 14, 2025 at 8:22 AM Sahil Siddiq <icegambit91@gmail.com> wrote:
>
> Hi,
>
> Apologies, I haven't been in touch for a while. I have an update that
> I would like to give.
>
> On 4/16/25 12:50 PM, Eugenio Perez Martin wrote:
> > On Mon, Apr 14, 2025 at 11:20 AM Sahil Siddiq <icegambit91@gmail.com> wrote:
> >>
> >> Hi,
> >>
> >> On 3/26/25 1:05 PM, Eugenio Perez Martin wrote:
> >>> On Mon, Mar 24, 2025 at 2:59 PM Sahil Siddiq <icegambit91@gmail.com> wrote:
> >>>> I managed to fix a few issues while testing this patch series.
> >>>> There is still one issue that I am unable to resolve. I thought
> >>>> I would send this patch series for review in case I have missed
> >>>> something.
> >>>>
> >>>> The issue is that this patch series does not work every time. I
> >>>> am able to ping L0 from L2 and vice versa via packed SVQ when it
> >>>> works.
> >>>
> >>> So we're on a very good track then!
> >>>
> >>>> When this doesn't work, both VMs throw a "Destination Host
> >>>> Unreachable" error. This is sometimes (not always) accompanied
> >>>> by the following kernel error (thrown by L2-kernel):
> >>>>
> >>>> virtio_net virtio1: output.0:id 1 is not a head!
> >>>>
> >>>
> >>> How many packets have been sent or received before hitting this? If
> >>> the answer to that is "the vq size", maybe there is a bug in the code
> >>> that handles the wraparound of the packed vq, as the used and avail
> >>> flags need to be twisted. You can count them in the SVQ code.
> >>
> >> I did a lot more testing. This issue is quite unpredictable in terms
> >> of the time at which it appears after booting L2. So far, it almost
> >> always appears after booting L2. Even when pinging works, this issue
> >> appears after several seconds of pinging.
> >>
> >
> > Maybe you can speed it up with ping -f?
>
> Thank you, I was able to run tests much faster with the -f option. So
> far I have noticed that the RX queue does not give problems. When all
> the descriptors are used it is able to wrap around without issues.
>
> >> The total number of svq descriptors varied in every test run. But in
> >> every case, all 256 indices were filled in the descriptor region for
> >> vq with vq_idx = 0. This is the RX vq, right?
> >
> > Right!
>
> The TX queue seems to be problematic. More on this below.
>
> >> This was filled while L2
> >> was booting. In the case when the ctrl vq is disabled, I am not sure
> >> what is responsible for filling the vqs in the data plane during
> >> booting.
> >>
> > The nested guest's driver fills the rx queue at startup. After that,
> > that nested guest kicks and SVQ receives the descriptors. It copies
> > the descriptors to the shadow virtqueue and then kicks L0 QEMU.
>
> Understood.
>
> >> =====
> >> The issue is hit most frequently when the following command is run
> >> in L0:
> >> $ ip addr add 111.1.1.1/24 dev tap0
> >> $ ip link set tap0 up
> >>
> >> or, running the following in L2:
> >> # ip addr add 111.1.1.2/24 dev eth0
> >>
> >
> > I guess those are able to start the network, aren't they?
>
> Yes, that's correct.
>
> >> The other vq (vq_idx=1) is not filled completely before the issue is
> >> hit.
> >> I have been noting down the numbers and here is an example:
> >>
> >> 295 descriptors were added individually to the queues i.e., there were no chains (vhost_svq_add_packed)
> >> |_ 256 additions in vq_idx = 0, all with unique ids
> >>       |---- 27 descriptors (ids 0 through 26) were received later from the device (vhost_svq_get_buf_packed)
> >> |_ 39 additions in vq_idx = 1
> >>       |_ 13 descriptors had id = 0
> >>       |_ 26 descriptors had id = 1
> >>       |---- All descriptors were received at some point from the device (vhost_svq_get_buf_packed)
> >>
> >> There was one case in which vq_idx=0 had wrapped around. I verified
> >> that flags were set appropriately during the wrap (avail and used flags
> >> were flipped as expected).
> >>
> >
> > Ok sounds like you're able to reach it before filling the queue. I'd
> > go for debugging notifications for this one then. More on this below.
> >
> >> =====
> >> The next common situation where this issue is hit is during startup.
> >> Before L2 can finish booting successfully, this error is thrown:
> >>
> >> virtio_net virtio1: output.0:id 0 is not a head!
> >>
> >> 258 descriptors were added individually to the queues during startup (there were no chains) (vhost_svq_add_packed)
> >> |_ 256 additions in vq_idx = 0, all with unique ids
> >>      |---- None of them were received by the device (vhost_svq_get_buf_packed)
> >> |_ 2 additions in vq_idx = 1
> >>      |_ id = 0 in index 0
> >>      |_ id = 1 in index 1
> >>      |---- Both descriptors were received at some point during startup from the device (vhost_svq_get_buf_packed)
> >>
> >> =====
> >> Another case is after several seconds of pinging L0 from L2.
> >>
> >> [   99.034114] virtio_net virtio1: output.0:id 0 is not a head!
> >>
> >
> > So the L2 guest sees a descriptor it has not made available
> > previously. This can be caused because SVQ returns the same descriptor
> > twice, or it doesn't fill the id or flags properly. It can also be
> > caused because we're not protecting the write ordering in the ring,
> > but I don't see anything obviously wrong by looking at the code.
> >
> >> 366 descriptors were added individually to the queues i.e., there were no chains (vhost_svq_add_packed)
> >> |_ 289 additions in vq_idx = 0, wrap-around was observed with avail and used flags inverted for 33 descriptors
> >> |   |---- 40 descriptors (ids 0 through 39) were received from the device (vhost_svq_get_buf_packed)
> >> |_ 77 additions in vq_idx = 1
> >>       |_ 76 descriptors had id = 0
> >>       |_ 1 descriptor had id = 1
> >>       |---- all 77 descriptors were received at some point from the device (vhost_svq_get_buf_packed)
> >>
> >> I am not entirely sure now if there's an issue in the packed vq
> >> implementation in QEMU or if this is being caused due to some sort
> >> of race condition in linux.
> >>
> >> "id is not a head" is being thrown because vq->packed.desc_state[id].data
> >> doesn't exist for the corresponding id in Linux [1]. But QEMU seems to have
> >> stored some data for this id via vhost_svq_add() [2]. Linux sets the value
> >> of vq->packed.desc_state[id].data in its version of virtqueue_add_packed() [3].
> >>
> >
> > Let's keep debugging further. Can you trace the ids that the L2 kernel
> > makes available, and then the ones that it uses? At the same time, can
> > you trace the ids that the svq sees in vhost_svq_get_buf and the ones
> > that flushes? This allows us to check the set of available descriptors
> > at any given time.
> >
> In the linux kernel, I am printing which descriptor is received in which
> queue in drivers/virtio/virtio_ring.c:virtqueue_get_buf_ctx_packed() [1].
> I see the following lines getting printed for the TX queue:
>
> [  192.101591] output.0 -> id: 0
> [  213.737417] output.0 -> id: 0
> [  213.738714] output.0 -> id: 1
> [  213.740093] output.0 -> id: 0
> [  213.741521] virtio_net virtio1: output.0:id 0 is not a head!
>

I find it particular that it is the first descriptor with id 1. Do you
have any other descriptor with id 1 previously? Does it fail
consistently with id 1?

You should have descriptors with id 1 and more in the rx queue and the
code should not be able to tell the difference, so it seems weird it
fails with tx. But who knows :).

> In QEMU's hw/virtio/vhost-shadow-virtqueue.c:vhost_svq_add_packed(), I am
> printing the head_idx, id, len, flags and vq_idx. Just before the crash,
> the following lines are printed:
>
> head_idx: 157, id: 0, len: 122, flags: 32768, vq idx: 1
> head_idx: 158, id: 0, len: 122, flags: 32768, vq idx: 1
> head_idx: 159, id: 0, len: 66, flags: 32768, vq idx: 1
> head_idx: 160, id: 1, len: 102, flags: 32768, vq idx: 1
>
> In QEMU's hw/virtio/vhost-shadow-virtqueue.c:vhost_svq_get_buf_packed(), I
> am printing the id, last_used index, used wrap counter and vq_idx. These
> are the lines just before the crash:
>
> id: 0, last_used: 158, used_wrap_counter: 0, vq idx: 1
> id: 0, last_used: 159, used_wrap_counter: 0, vq idx: 1
> id: 0, last_used: 160, used_wrap_counter: 0, vq idx: 1
> id: 1, last_used: 161, used_wrap_counter: 0, vq idx: 1
>
> In QEMU's hw/virtio/vhost-shadow-virtqueue.c:vhost_svq_flush() [2], I am tracking
> the values of i and vq_idx in the outer do..while() loop as well as in the inner
> while(true) loop. The value of i is used as the "idx" in virtqueue_fill() [3] and
> as "count" in virtqueue_flush() [4]. Lines printed in each iteration of the outer
> do...while loop are enclosed between "===" lines. These are the lines just before
> the crash:
>

I'd print VirtQueueElement members too.

It seems you're super close to fix it :).

Thanks!

> ===
> in_loop: i: 0, vq idx: 1
> in_loop: i: 1, vq idx: 1
> out_loop: i: 1, vq idx: 1
> ===
> in_loop: i: 0, vq idx: 1
> in_loop: i: 1, vq idx: 1
> out_loop: i: 1, vq idx: 1
> ===
> in_loop: i: 0, vq idx: 1
> in_loop: i: 1, vq idx: 1
> in_loop: i: 2, vq idx: 1
> out_loop: i: 2, vq idx: 1
> ===
> in_loop: i: 0, vq idx: 1
> out_loop: i: 0, vq idx: 1
>
> I have only investigated which descriptors the kernel uses. I'll also check
> which descriptors are made available by the kernel. I'll let you know what I
> find.
>
> Thanks,
> Sahil
>
> [1] https://github.com/torvalds/linux/blob/master/drivers/virtio/virtio_ring.c#L1727
> [2] https://gitlab.com/qemu-project/qemu/-/blob/master/hw/virtio/vhost-shadow-virtqueue.c#L499
> [3] https://gitlab.com/qemu-project/qemu/-/blob/master/hw/virtio/virtio.c#L1008
> [4] https://gitlab.com/qemu-project/qemu/-/blob/master/hw/virtio/virtio.c#L1147
>



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC v5 0/7] Add packed format to shadow virtqueue
  2025-05-15  6:19         ` Eugenio Perez Martin
@ 2025-06-26  5:16           ` Sahil Siddiq
  2025-06-26  7:37             ` Eugenio Perez Martin
  0 siblings, 1 reply; 44+ messages in thread
From: Sahil Siddiq @ 2025-06-26  5:16 UTC (permalink / raw)
  To: Eugenio Perez Martin; +Cc: sgarzare, mst, qemu-devel, sahilcdq

Hi,

It's been a while since I sent an email. I thought I would send an update
to keep you in the loop.

I have been comparing svq's mechanism for split and packed vqs hoping to
find something that might lead to the source of the issue.

One thing worth noting is that when I use kernel version 6.8.5 for testing,
the crashes are far more frequent. In kernel version 6.15.0-rc3+, it's much
harder to reproduce.

On 5/15/25 11:49 AM, Eugenio Perez Martin wrote:
> On Wed, May 14, 2025 at 8:22 AM Sahil Siddiq <icegambit91@gmail.com> wrote:
>> On 4/16/25 12:50 PM, Eugenio Perez Martin wrote:
>>> On Mon, Apr 14, 2025 at 11:20 AM Sahil Siddiq <icegambit91@gmail.com> wrote:
>>>> On 3/26/25 1:05 PM, Eugenio Perez Martin wrote:
>>>>> On Mon, Mar 24, 2025 at 2:59 PM Sahil Siddiq <icegambit91@gmail.com> wrote:
>>>>>> I managed to fix a few issues while testing this patch series.
>>>>>> There is still one issue that I am unable to resolve. I thought
>>>>>> I would send this patch series for review in case I have missed
>>>>>> something.
>>>>>>
>>>>>> The issue is that this patch series does not work every time. I
>>>>>> am able to ping L0 from L2 and vice versa via packed SVQ when it
>>>>>> works.
>>>>>>
>>>>>> When this doesn't work, both VMs throw a "Destination Host
>>>>>> Unreachable" error. This is sometimes (not always) accompanied
>>>>>> by the following kernel error (thrown by L2-kernel):
>>>>>>
>>>>>> virtio_net virtio1: output.0:id 1 is not a head!
>>>>>>
>>
>> The TX queue seems to be problematic. More on this below.

Sometimes RX also results in this crash, but it seems to be less frequent.

>>>> This was filled while L2
>>>> was booting. In the case when the ctrl vq is disabled, I am not sure
>>>> what is responsible for filling the vqs in the data plane during
>>>> booting.
>>>>
>>> The nested guest's driver fills the rx queue at startup. After that,
>>> that nested guest kicks and SVQ receives the descriptors. It copies
>>> the descriptors to the shadow virtqueue and then kicks L0 QEMU.
>>
>> Understood.
>>
>>>> The other vq (vq_idx=1) is not filled completely before the issue is
>>>> hit.
>>>> I have been noting down the numbers and here is an example:
>>>>
>>>> 295 descriptors were added individually to the queues i.e., there were no chains (vhost_svq_add_packed)
>>>> |_ 256 additions in vq_idx = 0, all with unique ids
>>>>        |---- 27 descriptors (ids 0 through 26) were received later from the device (vhost_svq_get_buf_packed)
>>>> |_ 39 additions in vq_idx = 1
>>>>        |_ 13 descriptors had id = 0
>>>>        |_ 26 descriptors had id = 1
>>>>        |---- All descriptors were received at some point from the device (vhost_svq_get_buf_packed)
>>>>
>>>> There was one case in which vq_idx=0 had wrapped around. I verified
>>>> that flags were set appropriately during the wrap (avail and used flags
>>>> were flipped as expected).
>>>>
>>>
>>> Ok sounds like you're able to reach it before filling the queue. I'd
>>> go for debugging notifications for this one then. More on this below.
>>>
>>>> =====
>>>> The next common situation where this issue is hit is during startup.
>>>> Before L2 can finish booting successfully, this error is thrown:
>>>>
>>>> virtio_net virtio1: output.0:id 0 is not a head!
>>>>
>>>> 258 descriptors were added individually to the queues during startup (there were no chains) (vhost_svq_add_packed)
>>>> |_ 256 additions in vq_idx = 0, all with unique ids
>>>>       |---- None of them were received by the device (vhost_svq_get_buf_packed)
>>>> |_ 2 additions in vq_idx = 1
>>>>       |_ id = 0 in index 0
>>>>       |_ id = 1 in index 1
>>>>       |---- Both descriptors were received at some point during startup from the device (vhost_svq_get_buf_packed)
>>>>
>>>> =====
>>>> Another case is after several seconds of pinging L0 from L2.
>>>>
>>>> [   99.034114] virtio_net virtio1: output.0:id 0 is not a head!
>>>>
>>>
>>> So the L2 guest sees a descriptor it has not made available
>>> previously. This can be caused because SVQ returns the same descriptor
>>> twice, or it doesn't fill the id or flags properly. It can also be
>>> caused because we're not protecting the write ordering in the ring,
>>> but I don't see anything obviously wrong by looking at the code.
>>>
>>>> 366 descriptors were added individually to the queues i.e., there were no chains (vhost_svq_add_packed)
>>>> |_ 289 additions in vq_idx = 0, wrap-around was observed with avail and used flags inverted for 33 descriptors
>>>> |   |---- 40 descriptors (ids 0 through 39) were received from the device (vhost_svq_get_buf_packed)
>>>> |_ 77 additions in vq_idx = 1
>>>>        |_ 76 descriptors had id = 0
>>>>        |_ 1 descriptor had id = 1
>>>>        |---- all 77 descriptors were received at some point from the device (vhost_svq_get_buf_packed)
>>>>
>>>>
>>>> "id is not a head" is being thrown because vq->packed.desc_state[id].data
>>>> doesn't exist for the corresponding id in Linux [1]. But QEMU seems to have
>>>> stored some data for this id via vhost_svq_add() [2]. Linux sets the value
>>>> of vq->packed.desc_state[id].data in its version of virtqueue_add_packed() [3].
>>>>
>>>
>>> Let's keep debugging further. Can you trace the ids that the L2 kernel
>>> makes available, and then the ones that it uses? At the same time, can
>>> you trace the ids that the svq sees in vhost_svq_get_buf and the ones
>>> that flushes? This allows us to check the set of available descriptors
>>> at any given time.
>>>
>> In the linux kernel, I am printing which descriptor is received in which
>> queue in drivers/virtio/virtio_ring.c:virtqueue_get_buf_ctx_packed() [1].
>> I see the following lines getting printed for the TX queue:
>>
>> [  192.101591] output.0 -> id: 0
>> [  213.737417] output.0 -> id: 0
>> [  213.738714] output.0 -> id: 1
>> [  213.740093] output.0 -> id: 0
>> [  213.741521] virtio_net virtio1: output.0:id 0 is not a head!
>>
> 
> I find it particular that it is the first descriptor with id 1. Do you
> have any other descriptor with id 1 previously? Does it fail
> consistently with id 1?

Yes, the descriptor with id 1 was used previously in TX. It varies between
test runs. It has failed with other ids as well during some test runs. In
one test run, it failed with id 17. I think there's an off-by-one bug here.
It crashes when it receives id 'x - 1' instead of 'x'.
> You should have descriptors with id 1 and more in the rx queue and the
> code should not be able to tell the difference, so it seems weird it
> fails with tx. But who knows :).
Oh, I thought it would be able to differentiate between them since it knows
which vq->idx it's coming from.

I think there's something off in the way "free_head", "last_used_idx" and
"desc_next" values are calculated in vhost_svq_get_buf_packed() [1].

In the latest test run, QEMU sent ids 0 through 28 to L2. L2 started receiving
them in order till id 8. At this point it received id 7 again for some reason
and then crashed.

L2:

[ 1641.129218] (prepare_packed) output.0 -> needs_kick: 1
[ 1641.130621] (notify) output.0 -> function will return true
[ 1641.132022] output.0 -> id: 0
[ 1739.502358] input.0 -> id: 0
[ 1739.503003] input.0 -> id: 1
[ 1739.562024] input.0 -> id: 2
[ 1739.578682] input.0 -> id: 3
[ 1739.661913] input.0 -> id: 4
[ 1739.828796] input.0 -> id: 5
[ 1739.829789] input.0 -> id: 6
[ 1740.078757] input.0 -> id: 7
[ 1740.079749] input.0 -> id: 8
[ 1740.080382] input.0 -> id: 7    <----Received 7 again
[ 1740.081614] virtio_net virtio1: input.0:id 7 is not a head!

QEMU logs (vhost_svq_get_buf_packed):
------
size              : svq->vring.num
len               : svq->vring_packed.vring.desc[last_used].len
id                : svq->vring_packed.vring.desc[last_used].id
num               : svq->desc_state[id].ndescs
last_used_chain   : Result of vhost_svq_last_desc_of_chain(svq, num, id) [2]
free_head         : svq->free_head
last_used         : (last_used_idx & ~(1 << VRING_PACKED_EVENT_F_WRAP_CTR)) + num
used_wrap_counter : !!(last_used_idx & (1 << VRING_PACKED_EVENT_F_WRAP_CTR))
------

size: 256, len: 102, id: 0, vq idx: 0
id: 0, last_used_chain: 0, free_head: 0, vq idx: 0
num: 1, free_head: 0, id: 0, last_used: 1, used_wrap_counter: 1, vq idx: 0
------
size: 256, len: 74, id: 1, vq idx: 0
id: 1, last_used_chain: 1, free_head: 0, vq idx: 0
num: 1, free_head: 1, id: 1, last_used: 2, used_wrap_counter: 1, vq idx: 0
------
size: 256, len: 102, id: 2, vq idx: 0
id: 2, last_used_chain: 2, free_head: 1, vq idx: 0
num: 1, free_head: 2, id: 2, last_used: 3, used_wrap_counter: 1, vq idx: 0
------
size: 256, len: 82, id: 3, vq idx: 0
id: 3, last_used_chain: 3, free_head: 2, vq idx: 0
num: 1, free_head: 3, id: 3, last_used: 4, used_wrap_counter: 1, vq idx: 0
------
size: 256, len: 74, id: 4, vq idx: 0
id: 4, last_used_chain: 4, free_head: 3, vq idx: 0
num: 1, free_head: 4, id: 4, last_used: 5, used_wrap_counter: 1, vq idx: 0
------
size: 256, len: 82, id: 5, vq idx: 0
id: 5, last_used_chain: 5, free_head: 4, vq idx: 0
num: 1, free_head: 5, id: 5, last_used: 6, used_wrap_counter: 1, vq idx: 0
------
size: 256, len: 104, id: 6, vq idx: 0
id: 6, last_used_chain: 6, free_head: 5, vq idx: 0
num: 1, free_head: 6, id: 6, last_used: 7, used_wrap_counter: 1, vq idx: 0
------
size: 256, len: 82, id: 7, vq idx: 0
id: 7, last_used_chain: 7, free_head: 6, vq idx: 0
num: 1, free_head: 7, id: 7, last_used: 8, used_wrap_counter: 1, vq idx: 0
------
size: 256, len: 104, id: 8, vq idx: 0
id: 8, last_used_chain: 8, free_head: 7, vq idx: 0
num: 1, free_head: 8, id: 8, last_used: 9, used_wrap_counter: 1, vq idx: 0
------
size: 256, len: 98, id: 9, vq idx: 0
id: 9, last_used_chain: 9, free_head: 8, vq idx: 0
num: 1, free_head: 9, id: 9, last_used: 10, used_wrap_counter: 1, vq idx: 0
------
size: 256, len: 104, id: 10, vq idx: 0
id: 10, last_used_chain: 10, free_head: 9, vq idx: 0
num: 1, free_head: 10, id: 10, last_used: 11, used_wrap_counter: 1, vq idx: 0

I have a few more ideas of what to do. I'll let you know if I find something
else.

Thanks,
Sahil

[1] https://github.com/valdaarhun/qemu/blob/packed_vq/hw/virtio/vhost-shadow-virtqueue.c#L687
[2] https://github.com/valdaarhun/qemu/blob/packed_vq/hw/virtio/vhost-shadow-virtqueue.c#L629


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC v5 0/7] Add packed format to shadow virtqueue
  2025-06-26  5:16           ` Sahil Siddiq
@ 2025-06-26  7:37             ` Eugenio Perez Martin
  2025-07-30 14:32               ` Sahil Siddiq
  0 siblings, 1 reply; 44+ messages in thread
From: Eugenio Perez Martin @ 2025-06-26  7:37 UTC (permalink / raw)
  To: Sahil Siddiq; +Cc: sgarzare, mst, qemu-devel, sahilcdq

On Thu, Jun 26, 2025 at 7:16 AM Sahil Siddiq <icegambit91@gmail.com> wrote:
>
> Hi,
>
> It's been a while since I sent an email. I thought I would send an update
> to keep you in the loop.
>
> I have been comparing svq's mechanism for split and packed vqs hoping to
> find something that might lead to the source of the issue.
>
> One thing worth noting is that when I use kernel version 6.8.5 for testing,
> the crashes are far more frequent. In kernel version 6.15.0-rc3+, it's much
> harder to reproduce.
>
> On 5/15/25 11:49 AM, Eugenio Perez Martin wrote:
> > On Wed, May 14, 2025 at 8:22 AM Sahil Siddiq <icegambit91@gmail.com> wrote:
> >> On 4/16/25 12:50 PM, Eugenio Perez Martin wrote:
> >>> On Mon, Apr 14, 2025 at 11:20 AM Sahil Siddiq <icegambit91@gmail.com> wrote:
> >>>> On 3/26/25 1:05 PM, Eugenio Perez Martin wrote:
> >>>>> On Mon, Mar 24, 2025 at 2:59 PM Sahil Siddiq <icegambit91@gmail.com> wrote:
> >>>>>> I managed to fix a few issues while testing this patch series.
> >>>>>> There is still one issue that I am unable to resolve. I thought
> >>>>>> I would send this patch series for review in case I have missed
> >>>>>> something.
> >>>>>>
> >>>>>> The issue is that this patch series does not work every time. I
> >>>>>> am able to ping L0 from L2 and vice versa via packed SVQ when it
> >>>>>> works.
> >>>>>>
> >>>>>> When this doesn't work, both VMs throw a "Destination Host
> >>>>>> Unreachable" error. This is sometimes (not always) accompanied
> >>>>>> by the following kernel error (thrown by L2-kernel):
> >>>>>>
> >>>>>> virtio_net virtio1: output.0:id 1 is not a head!
> >>>>>>
> >>
> >> The TX queue seems to be problematic. More on this below.
>
> Sometimes RX also results in this crash, but it seems to be less frequent.
>
> >>>> This was filled while L2
> >>>> was booting. In the case when the ctrl vq is disabled, I am not sure
> >>>> what is responsible for filling the vqs in the data plane during
> >>>> booting.
> >>>>
> >>> The nested guest's driver fills the rx queue at startup. After that,
> >>> that nested guest kicks and SVQ receives the descriptors. It copies
> >>> the descriptors to the shadow virtqueue and then kicks L0 QEMU.
> >>
> >> Understood.
> >>
> >>>> The other vq (vq_idx=1) is not filled completely before the issue is
> >>>> hit.
> >>>> I have been noting down the numbers and here is an example:
> >>>>
> >>>> 295 descriptors were added individually to the queues i.e., there were no chains (vhost_svq_add_packed)
> >>>> |_ 256 additions in vq_idx = 0, all with unique ids
> >>>>        |---- 27 descriptors (ids 0 through 26) were received later from the device (vhost_svq_get_buf_packed)
> >>>> |_ 39 additions in vq_idx = 1
> >>>>        |_ 13 descriptors had id = 0
> >>>>        |_ 26 descriptors had id = 1
> >>>>        |---- All descriptors were received at some point from the device (vhost_svq_get_buf_packed)
> >>>>
> >>>> There was one case in which vq_idx=0 had wrapped around. I verified
> >>>> that flags were set appropriately during the wrap (avail and used flags
> >>>> were flipped as expected).
> >>>>
> >>>
> >>> Ok sounds like you're able to reach it before filling the queue. I'd
> >>> go for debugging notifications for this one then. More on this below.
> >>>
> >>>> =====
> >>>> The next common situation where this issue is hit is during startup.
> >>>> Before L2 can finish booting successfully, this error is thrown:
> >>>>
> >>>> virtio_net virtio1: output.0:id 0 is not a head!
> >>>>
> >>>> 258 descriptors were added individually to the queues during startup (there were no chains) (vhost_svq_add_packed)
> >>>> |_ 256 additions in vq_idx = 0, all with unique ids
> >>>>       |---- None of them were received by the device (vhost_svq_get_buf_packed)
> >>>> |_ 2 additions in vq_idx = 1
> >>>>       |_ id = 0 in index 0
> >>>>       |_ id = 1 in index 1
> >>>>       |---- Both descriptors were received at some point during startup from the device (vhost_svq_get_buf_packed)
> >>>>
> >>>> =====
> >>>> Another case is after several seconds of pinging L0 from L2.
> >>>>
> >>>> [   99.034114] virtio_net virtio1: output.0:id 0 is not a head!
> >>>>
> >>>
> >>> So the L2 guest sees a descriptor it has not made available
> >>> previously. This can be caused because SVQ returns the same descriptor
> >>> twice, or it doesn't fill the id or flags properly. It can also be
> >>> caused because we're not protecting the write ordering in the ring,
> >>> but I don't see anything obviously wrong by looking at the code.
> >>>
> >>>> 366 descriptors were added individually to the queues i.e., there were no chains (vhost_svq_add_packed)
> >>>> |_ 289 additions in vq_idx = 0, wrap-around was observed with avail and used flags inverted for 33 descriptors
> >>>> |   |---- 40 descriptors (ids 0 through 39) were received from the device (vhost_svq_get_buf_packed)
> >>>> |_ 77 additions in vq_idx = 1
> >>>>        |_ 76 descriptors had id = 0
> >>>>        |_ 1 descriptor had id = 1
> >>>>        |---- all 77 descriptors were received at some point from the device (vhost_svq_get_buf_packed)
> >>>>
> >>>>
> >>>> "id is not a head" is being thrown because vq->packed.desc_state[id].data
> >>>> doesn't exist for the corresponding id in Linux [1]. But QEMU seems to have
> >>>> stored some data for this id via vhost_svq_add() [2]. Linux sets the value
> >>>> of vq->packed.desc_state[id].data in its version of virtqueue_add_packed() [3].
> >>>>
> >>>
> >>> Let's keep debugging further. Can you trace the ids that the L2 kernel
> >>> makes available, and then the ones that it uses? At the same time, can
> >>> you trace the ids that the svq sees in vhost_svq_get_buf and the ones
> >>> that flushes? This allows us to check the set of available descriptors
> >>> at any given time.
> >>>
> >> In the linux kernel, I am printing which descriptor is received in which
> >> queue in drivers/virtio/virtio_ring.c:virtqueue_get_buf_ctx_packed() [1].
> >> I see the following lines getting printed for the TX queue:
> >>
> >> [  192.101591] output.0 -> id: 0
> >> [  213.737417] output.0 -> id: 0
> >> [  213.738714] output.0 -> id: 1
> >> [  213.740093] output.0 -> id: 0
> >> [  213.741521] virtio_net virtio1: output.0:id 0 is not a head!
> >>
> >
> > I find it particular that it is the first descriptor with id 1. Do you
> > have any other descriptor with id 1 previously? Does it fail
> > consistently with id 1?
>
> Yes, the descriptor with id 1 was used previously in TX. It varies between
> test runs. It has failed with other ids as well during some test runs. In
> one test run, it failed with id 17. I think there's an off-by-one bug here.
> It crashes when it receives id 'x - 1' instead of 'x'.
> > You should have descriptors with id 1 and more in the rx queue and the
> > code should not be able to tell the difference, so it seems weird it
> > fails with tx. But who knows :).
> Oh, I thought it would be able to differentiate between them since it knows
> which vq->idx it's coming from.
>
> I think there's something off in the way "free_head", "last_used_idx" and
> "desc_next" values are calculated in vhost_svq_get_buf_packed() [1].
>
> In the latest test run, QEMU sent ids 0 through 28 to L2. L2 started receiving
> them in order till id 8. At this point it received id 7 again for some reason
> and then crashed.
>
> L2:
>
> [ 1641.129218] (prepare_packed) output.0 -> needs_kick: 1
> [ 1641.130621] (notify) output.0 -> function will return true
> [ 1641.132022] output.0 -> id: 0
> [ 1739.502358] input.0 -> id: 0
> [ 1739.503003] input.0 -> id: 1
> [ 1739.562024] input.0 -> id: 2
> [ 1739.578682] input.0 -> id: 3
> [ 1739.661913] input.0 -> id: 4
> [ 1739.828796] input.0 -> id: 5
> [ 1739.829789] input.0 -> id: 6
> [ 1740.078757] input.0 -> id: 7
> [ 1740.079749] input.0 -> id: 8
> [ 1740.080382] input.0 -> id: 7    <----Received 7 again
> [ 1740.081614] virtio_net virtio1: input.0:id 7 is not a head!
>
> QEMU logs (vhost_svq_get_buf_packed):
> ------
> size              : svq->vring.num
> len               : svq->vring_packed.vring.desc[last_used].len
> id                : svq->vring_packed.vring.desc[last_used].id
> num               : svq->desc_state[id].ndescs
> last_used_chain   : Result of vhost_svq_last_desc_of_chain(svq, num, id) [2]
> free_head         : svq->free_head
> last_used         : (last_used_idx & ~(1 << VRING_PACKED_EVENT_F_WRAP_CTR)) + num
> used_wrap_counter : !!(last_used_idx & (1 << VRING_PACKED_EVENT_F_WRAP_CTR))
> ------
>
> size: 256, len: 102, id: 0, vq idx: 0
> id: 0, last_used_chain: 0, free_head: 0, vq idx: 0
> num: 1, free_head: 0, id: 0, last_used: 1, used_wrap_counter: 1, vq idx: 0
> ------
> size: 256, len: 74, id: 1, vq idx: 0
> id: 1, last_used_chain: 1, free_head: 0, vq idx: 0
> num: 1, free_head: 1, id: 1, last_used: 2, used_wrap_counter: 1, vq idx: 0
> ------
> size: 256, len: 102, id: 2, vq idx: 0
> id: 2, last_used_chain: 2, free_head: 1, vq idx: 0
> num: 1, free_head: 2, id: 2, last_used: 3, used_wrap_counter: 1, vq idx: 0
> ------
> size: 256, len: 82, id: 3, vq idx: 0
> id: 3, last_used_chain: 3, free_head: 2, vq idx: 0
> num: 1, free_head: 3, id: 3, last_used: 4, used_wrap_counter: 1, vq idx: 0
> ------
> size: 256, len: 74, id: 4, vq idx: 0
> id: 4, last_used_chain: 4, free_head: 3, vq idx: 0
> num: 1, free_head: 4, id: 4, last_used: 5, used_wrap_counter: 1, vq idx: 0
> ------
> size: 256, len: 82, id: 5, vq idx: 0
> id: 5, last_used_chain: 5, free_head: 4, vq idx: 0
> num: 1, free_head: 5, id: 5, last_used: 6, used_wrap_counter: 1, vq idx: 0
> ------
> size: 256, len: 104, id: 6, vq idx: 0
> id: 6, last_used_chain: 6, free_head: 5, vq idx: 0
> num: 1, free_head: 6, id: 6, last_used: 7, used_wrap_counter: 1, vq idx: 0
> ------
> size: 256, len: 82, id: 7, vq idx: 0
> id: 7, last_used_chain: 7, free_head: 6, vq idx: 0
> num: 1, free_head: 7, id: 7, last_used: 8, used_wrap_counter: 1, vq idx: 0
> ------
> size: 256, len: 104, id: 8, vq idx: 0
> id: 8, last_used_chain: 8, free_head: 7, vq idx: 0
> num: 1, free_head: 8, id: 8, last_used: 9, used_wrap_counter: 1, vq idx: 0
> ------
> size: 256, len: 98, id: 9, vq idx: 0
> id: 9, last_used_chain: 9, free_head: 8, vq idx: 0
> num: 1, free_head: 9, id: 9, last_used: 10, used_wrap_counter: 1, vq idx: 0
> ------
> size: 256, len: 104, id: 10, vq idx: 0
> id: 10, last_used_chain: 10, free_head: 9, vq idx: 0
> num: 1, free_head: 10, id: 10, last_used: 11, used_wrap_counter: 1, vq idx: 0
>
> I have a few more ideas of what to do. I'll let you know if I find something
> else.
>

I cannot find anything just by inspection. What about printing all the
desc_state and all desc_next to check for incoherencies in each
svq_add and get_buf?



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC v5 0/7] Add packed format to shadow virtqueue
  2025-06-26  7:37             ` Eugenio Perez Martin
@ 2025-07-30 14:32               ` Sahil Siddiq
  2025-07-31 13:52                 ` Eugenio Perez Martin
  0 siblings, 1 reply; 44+ messages in thread
From: Sahil Siddiq @ 2025-07-30 14:32 UTC (permalink / raw)
  To: Eugenio Perez Martin; +Cc: sgarzare, mst, qemu-devel, sahilcdq

Hi,

I think I have finally found the reason behind this issue.

The order in which "add_packed" and "get_buf_packed" are performed in the
nested guest kernel (L2 kernel) and QEMU are a little different. Due to
this, the values in free_head and svq->desc_next[] differ and the guest
crashes at some point. More below.

On 6/26/25 1:07 PM, Eugenio Perez Martin wrote:
> On Thu, Jun 26, 2025 at 7:16 AM Sahil Siddiq <icegambit91@gmail.com> wrote:
>> I think there's something off in the way "free_head", "last_used_idx" and
>> "desc_next" values are calculated in vhost_svq_get_buf_packed() [1].
>>
>> In the latest test run, QEMU sent ids 0 through 28 to L2. L2 started receiving
>> them in order till id 8. At this point it received id 7 again for some reason
>> and then crashed.
>>
>> L2:
>>
>> [ 1641.129218] (prepare_packed) output.0 -> needs_kick: 1
>> [ 1641.130621] (notify) output.0 -> function will return true
>> [ 1641.132022] output.0 -> id: 0
>> [ 1739.502358] input.0 -> id: 0
>> [ 1739.503003] input.0 -> id: 1
>> [ 1739.562024] input.0 -> id: 2
>> [ 1739.578682] input.0 -> id: 3
>> [ 1739.661913] input.0 -> id: 4
>> [ 1739.828796] input.0 -> id: 5
>> [ 1739.829789] input.0 -> id: 6
>> [ 1740.078757] input.0 -> id: 7
>> [ 1740.079749] input.0 -> id: 8
>> [ 1740.080382] input.0 -> id: 7    <----Received 7 again
>> [ 1740.081614] virtio_net virtio1: input.0:id 7 is not a head!
>>
>> QEMU logs (vhost_svq_get_buf_packed):
>> ------
>> size              : svq->vring.num
>> len               : svq->vring_packed.vring.desc[last_used].len
>> id                : svq->vring_packed.vring.desc[last_used].id
>> num               : svq->desc_state[id].ndescs
>> last_used_chain   : Result of vhost_svq_last_desc_of_chain(svq, num, id) [2]
>> free_head         : svq->free_head
>> last_used         : (last_used_idx & ~(1 << VRING_PACKED_EVENT_F_WRAP_CTR)) + num
>> used_wrap_counter : !!(last_used_idx & (1 << VRING_PACKED_EVENT_F_WRAP_CTR))
>> ------
>> size: 256, len: 82, id: 7, vq idx: 0
>> id: 7, last_used_chain: 7, free_head: 6, vq idx: 0
>> num: 1, free_head: 7, id: 7, last_used: 8, used_wrap_counter: 1, vq idx: 0
>> ------
>> size: 256, len: 104, id: 8, vq idx: 0
>> id: 8, last_used_chain: 8, free_head: 7, vq idx: 0
>> num: 1, free_head: 8, id: 8, last_used: 9, used_wrap_counter: 1, vq idx: 0
>> ------
>> size: 256, len: 98, id: 9, vq idx: 0
>> id: 9, last_used_chain: 9, free_head: 8, vq idx: 0
>> num: 1, free_head: 9, id: 9, last_used: 10, used_wrap_counter: 1, vq idx: 0
>> ------
>> size: 256, len: 104, id: 10, vq idx: 0
>> id: 10, last_used_chain: 10, free_head: 9, vq idx: 0
>> num: 1, free_head: 10, id: 10, last_used: 11, used_wrap_counter: 1, vq idx: 0
>>
>> I have a few more ideas of what to do. I'll let you know if I find something
>> else.
>>
> I cannot find anything just by inspection. What about printing all the
> desc_state and all desc_next to check for incoherencies in each
> svq_add and get_buf?
In this test, all 256 descriptors were filled in the RX vq.

In the TX queue, L2 kernel would add one descriptor at a time and notify
QEMU. QEMU would then register it in its SVQ and mark it as "available".
After processing the descriptor, QEMU would mark it as "used" and flush it
back to L2. L2, in turn, would mark this descriptor as "used". After this
process, L2 would add the next descriptor in the TX vq while reusing this
ID. This was observed from idx 0 till idx 7.

L2's debug logs:

[   18.379112] (use_indirect?) output.0 -> verdict: 0                <----- Begin adding descriptor in idx 6
[   18.387134] (add_packed) output.0 -> idx: 6
[   18.389897] (add_packed) output.0 -> id: 0
[   18.392290] (add_packed) output.0 -> len: 74
[   18.394606] (add_packed) output.0 -> addr: 5012315726
[   18.397043] (add_packed) output.0 -> next id: 1
[   18.399861] Entering prepare_packed: output.0
[   18.402478] (prepare_packed) output.0 -> needs_kick: 1
[   18.404998] (notify) output.0 -> function will return true        <----- Notify QEMU
[   18.406349] output.0 -> id: 0, idx: 6                             <----- Mark ID 0 in idx 6 as used
[   18.409482] output.0 -> old free_head: 1, new free_head: 0        <----- ID 0 can be reused
[   18.410919] (after get_buf processed) output.0 -> id: 0, idx: 7   <----- Next slot is idx 7
[   18.921895] (use_indirect?) output.0 -> verdict: 0                <----- Begin adding descriptor with ID = 0 in idx 7
[   18.930093] (add_packed) output.0 -> idx: 7
[   18.935715] (add_packed) output.0 -> id: 0
[   18.937609] (add_packed) output.0 -> len: 122
[   18.939614] (add_packed) output.0 -> addr: 4925868038
[   18.941710] (add_packed) output.0 -> next id: 1
[   18.944032] Entering prepare_packed: output.0
[   18.946148] (prepare_packed) output.0 -> needs_kick: 1
[   18.948234] (notify) output.0 -> function will return true        <----- Notify QEMU
[   18.949606] output.0 -> id: 0, idx: 7                             <----- Mark ID 0 in idx 7 as used
[   18.952756] output.0 -> old free_head: 1, new free_head: 0        <----- ID 0 can be reused
[   18.955154] (after get_buf processed) output.0 -> id: 0, idx: 8   <----- Next slot is idx 8

There was no issue in QEMU till this point.

[   19.177536] (use_indirect?) output.0 -> verdict: 0                <----- Begin adding descriptor with ID = 0 in idx 8
[   19.182415] (add_packed) output.0 -> idx: 8
[   19.187257] (add_packed) output.0 -> id: 0
[   19.191355] (add_packed) output.0 -> len: 102
[   19.195131] (add_packed) output.0 -> addr: 4370702342
[   19.199224] (add_packed) output.0 -> next id: 1
[   19.204929] Entering prepare_packed: output.0
[   19.209505] (prepare_packed) output.0 -> needs_kick: 1
[   19.213820] (notify) output.0 -> function will return true       <----- Notify QEMU
[   19.218792] (use_indirect?) output.0 -> verdict: 0               <----- Next slot is idx 9
[   19.224730] (add_packed) output.0 -> idx: 9
[   19.227067] (add_packed) output.0 -> id: 1                       <----- ID 0 can't be reused yet, so use ID = 1
[   19.229090] (add_packed) output.0 -> len: 330
[   19.231182] (add_packed) output.0 -> addr: 4311020614
[   19.233302] (add_packed) output.0 -> next id: 2
[   19.235620] Entering prepare_packed: output.0
[   19.237781] (prepare_packed) output.0 -> needs_kick: 1
[   19.239958] (notify) output.0 -> function will return true       <----- Notify QEMU
[   19.237780] output.0 -> id: 0, idx: 8                            <----- Mark ID 0 in idx 8 as used
[   19.243676] output.0 -> old free_head: 2, new free_head: 0       <----- ID 0 can now be reused
[   19.245214] (after get_buf processed) output.0 -> id: 0, idx: 9  <----- Next slot is idx 9
[   19.247097] output.0 -> id: 1, idx: 9                            <----- Mark ID 1 in idx 9 as used
[   19.249612] output.0 -> old free_head: 0, new free_head: 1       <----- ID 1 can now be reused
[   19.252266] (after get_buf processed) output.0 -> id: 1, idx: 10 <----- Next slot is idx 10

ID 0 and ID 1 in idx 8 and idx 9 respectively are pushed to QEMU
before either of them are marked as used.

But in QEMU, the order is slightly different.

num: 1, init_flags: 128                                                        <----- vhost_svq_add_packed()
idx: 8, id: 0, len: 0, flags: 0, vq idx: 1                                     <----- Before adding descriptor
idx: 8, id: 0, len: 102, flags: 128, vq idx: 1                                 <----- After adding descriptor
Finally: new_idx: 9, head_idx: 8, id: 0, len: 102, flags: 128, vq idx: 1
svq->vring.num: 256                                                            <----- Begin vhost_svq_get_buf_packed()
descriptor_len: 0
descriptor_id: 0                                                               <----- Mark ID = 0 as used
last_used: 8                                                                   <----- Processing idx 8
used_wrap_counter: 1
svq->desc_state[id].ndescs: 1
free_head: 0                                                                   <----- Update free_head to 0.
last_used: 9                                                                   <----- Update last_used to 9.
vq idx: 1                                                                      <----- End vhost_svq_get_buf_packed()
i: 0                                                                           <----- vhost_svq_flush()
descriptor_len: 0
elem->len: 22086
i: 1
elem_is_null: 1
vq idx: 1                                                                      <----- End vhost_svq_flush()
num: 1, init_flags: 128                                                        <----- vhost_svq_add_packed()
idx: 9, id: 0, len: 0, flags: 0, curr: 0, vq idx: 1                            <----- Before adding descriptor
idx: 9, id: 0, len: 330, flags: 128, curr: 1, vq idx: 1                        <----- After adding descriptor
Finally: new_idx: 10, head_idx: 9, id: 0, len: 330, flags: 128, vq idx: 1      <----- ID 0 has been reused (versus ID 1 in L2)
svq->vring.num: 256                                                            <----- Begin vhost_svq_get_buf_packed()
descriptor_len: 0
descriptor_id: 0                                                               <----- Mark ID = 0 as used
last_used: 9                                                                   <----- Processing idx 9
used_wrap_counter: 1
svq->desc_state[id].ndescs: 1
free_head: 0                                                                   <----- Update free_head to 0.
last_used: 10                                                                  <----- Update last_used to 10.
vq idx: 1                                                                      <----- End vhost_svq_get_buf_packed()
i: 0                                                                           <----- vhost_svq_flush()
descriptor_len: 0
elem->len: 22086
i: 1
elem_is_null: 1
vq idx: 1                                                                      <----- End vhost_svq_flush()

In QEMU, id 0 is added in idx 8. But it's marked as used before a
descriptor can be added in idx 9. Because of this there's a discrepancy
in the value of free_head and in svq->desc_next.

In the current implementation, the values of ID are generated, maintained
and processed by QEMU instead of reading from the guest's memory. I think
reading the value of ID from the guest memory (similar to reading the
descriptor length from guest memory) should resolve this issue.

The alternative would be to ensure that "add_packed" and "get_buf_packed"
are synchronized between the guest and QEMU.

What are your thoughts on this?

Thanks,
Sahil





^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC v5 0/7] Add packed format to shadow virtqueue
  2025-07-30 14:32               ` Sahil Siddiq
@ 2025-07-31 13:52                 ` Eugenio Perez Martin
  2025-08-04  6:04                   ` Sahil Siddiq
  0 siblings, 1 reply; 44+ messages in thread
From: Eugenio Perez Martin @ 2025-07-31 13:52 UTC (permalink / raw)
  To: Sahil Siddiq; +Cc: sgarzare, mst, qemu-devel, sahilcdq

On Wed, Jul 30, 2025 at 4:33 PM Sahil Siddiq <icegambit91@gmail.com> wrote:
>
> Hi,
>
> I think I have finally found the reason behind this issue.
>
> The order in which "add_packed" and "get_buf_packed" are performed in the
> nested guest kernel (L2 kernel) and QEMU are a little different. Due to
> this, the values in free_head and svq->desc_next[] differ and the guest
> crashes at some point. More below.
>
> On 6/26/25 1:07 PM, Eugenio Perez Martin wrote:
> > On Thu, Jun 26, 2025 at 7:16 AM Sahil Siddiq <icegambit91@gmail.com> wrote:
> >> I think there's something off in the way "free_head", "last_used_idx" and
> >> "desc_next" values are calculated in vhost_svq_get_buf_packed() [1].
> >>
> >> In the latest test run, QEMU sent ids 0 through 28 to L2. L2 started receiving
> >> them in order till id 8. At this point it received id 7 again for some reason
> >> and then crashed.
> >>
> >> L2:
> >>
> >> [ 1641.129218] (prepare_packed) output.0 -> needs_kick: 1
> >> [ 1641.130621] (notify) output.0 -> function will return true
> >> [ 1641.132022] output.0 -> id: 0
> >> [ 1739.502358] input.0 -> id: 0
> >> [ 1739.503003] input.0 -> id: 1
> >> [ 1739.562024] input.0 -> id: 2
> >> [ 1739.578682] input.0 -> id: 3
> >> [ 1739.661913] input.0 -> id: 4
> >> [ 1739.828796] input.0 -> id: 5
> >> [ 1739.829789] input.0 -> id: 6
> >> [ 1740.078757] input.0 -> id: 7
> >> [ 1740.079749] input.0 -> id: 8
> >> [ 1740.080382] input.0 -> id: 7    <----Received 7 again
> >> [ 1740.081614] virtio_net virtio1: input.0:id 7 is not a head!
> >>
> >> QEMU logs (vhost_svq_get_buf_packed):
> >> ------
> >> size              : svq->vring.num
> >> len               : svq->vring_packed.vring.desc[last_used].len
> >> id                : svq->vring_packed.vring.desc[last_used].id
> >> num               : svq->desc_state[id].ndescs
> >> last_used_chain   : Result of vhost_svq_last_desc_of_chain(svq, num, id) [2]
> >> free_head         : svq->free_head
> >> last_used         : (last_used_idx & ~(1 << VRING_PACKED_EVENT_F_WRAP_CTR)) + num
> >> used_wrap_counter : !!(last_used_idx & (1 << VRING_PACKED_EVENT_F_WRAP_CTR))
> >> ------
> >> size: 256, len: 82, id: 7, vq idx: 0
> >> id: 7, last_used_chain: 7, free_head: 6, vq idx: 0
> >> num: 1, free_head: 7, id: 7, last_used: 8, used_wrap_counter: 1, vq idx: 0
> >> ------
> >> size: 256, len: 104, id: 8, vq idx: 0
> >> id: 8, last_used_chain: 8, free_head: 7, vq idx: 0
> >> num: 1, free_head: 8, id: 8, last_used: 9, used_wrap_counter: 1, vq idx: 0
> >> ------
> >> size: 256, len: 98, id: 9, vq idx: 0
> >> id: 9, last_used_chain: 9, free_head: 8, vq idx: 0
> >> num: 1, free_head: 9, id: 9, last_used: 10, used_wrap_counter: 1, vq idx: 0
> >> ------
> >> size: 256, len: 104, id: 10, vq idx: 0
> >> id: 10, last_used_chain: 10, free_head: 9, vq idx: 0
> >> num: 1, free_head: 10, id: 10, last_used: 11, used_wrap_counter: 1, vq idx: 0
> >>
> >> I have a few more ideas of what to do. I'll let you know if I find something
> >> else.
> >>
> > I cannot find anything just by inspection. What about printing all the
> > desc_state and all desc_next to check for incoherencies in each
> > svq_add and get_buf?
> In this test, all 256 descriptors were filled in the RX vq.
>
> In the TX queue, L2 kernel would add one descriptor at a time and notify
> QEMU. QEMU would then register it in its SVQ and mark it as "available".
> After processing the descriptor, QEMU would mark it as "used" and flush it
> back to L2. L2, in turn, would mark this descriptor as "used". After this
> process, L2 would add the next descriptor in the TX vq while reusing this
> ID. This was observed from idx 0 till idx 7.
>
> L2's debug logs:
>
> [   18.379112] (use_indirect?) output.0 -> verdict: 0                <----- Begin adding descriptor in idx 6
> [   18.387134] (add_packed) output.0 -> idx: 6
> [   18.389897] (add_packed) output.0 -> id: 0
> [   18.392290] (add_packed) output.0 -> len: 74
> [   18.394606] (add_packed) output.0 -> addr: 5012315726
> [   18.397043] (add_packed) output.0 -> next id: 1
> [   18.399861] Entering prepare_packed: output.0
> [   18.402478] (prepare_packed) output.0 -> needs_kick: 1
> [   18.404998] (notify) output.0 -> function will return true        <----- Notify QEMU
> [   18.406349] output.0 -> id: 0, idx: 6                             <----- Mark ID 0 in idx 6 as used
> [   18.409482] output.0 -> old free_head: 1, new free_head: 0        <----- ID 0 can be reused
> [   18.410919] (after get_buf processed) output.0 -> id: 0, idx: 7   <----- Next slot is idx 7
> [   18.921895] (use_indirect?) output.0 -> verdict: 0                <----- Begin adding descriptor with ID = 0 in idx 7
> [   18.930093] (add_packed) output.0 -> idx: 7
> [   18.935715] (add_packed) output.0 -> id: 0
> [   18.937609] (add_packed) output.0 -> len: 122
> [   18.939614] (add_packed) output.0 -> addr: 4925868038
> [   18.941710] (add_packed) output.0 -> next id: 1
> [   18.944032] Entering prepare_packed: output.0
> [   18.946148] (prepare_packed) output.0 -> needs_kick: 1
> [   18.948234] (notify) output.0 -> function will return true        <----- Notify QEMU
> [   18.949606] output.0 -> id: 0, idx: 7                             <----- Mark ID 0 in idx 7 as used
> [   18.952756] output.0 -> old free_head: 1, new free_head: 0        <----- ID 0 can be reused
> [   18.955154] (after get_buf processed) output.0 -> id: 0, idx: 8   <----- Next slot is idx 8
>
> There was no issue in QEMU till this point.
>
> [   19.177536] (use_indirect?) output.0 -> verdict: 0                <----- Begin adding descriptor with ID = 0 in idx 8
> [   19.182415] (add_packed) output.0 -> idx: 8
> [   19.187257] (add_packed) output.0 -> id: 0
> [   19.191355] (add_packed) output.0 -> len: 102
> [   19.195131] (add_packed) output.0 -> addr: 4370702342
> [   19.199224] (add_packed) output.0 -> next id: 1
> [   19.204929] Entering prepare_packed: output.0
> [   19.209505] (prepare_packed) output.0 -> needs_kick: 1
> [   19.213820] (notify) output.0 -> function will return true       <----- Notify QEMU
> [   19.218792] (use_indirect?) output.0 -> verdict: 0               <----- Next slot is idx 9
> [   19.224730] (add_packed) output.0 -> idx: 9
> [   19.227067] (add_packed) output.0 -> id: 1                       <----- ID 0 can't be reused yet, so use ID = 1
> [   19.229090] (add_packed) output.0 -> len: 330
> [   19.231182] (add_packed) output.0 -> addr: 4311020614
> [   19.233302] (add_packed) output.0 -> next id: 2
> [   19.235620] Entering prepare_packed: output.0
> [   19.237781] (prepare_packed) output.0 -> needs_kick: 1
> [   19.239958] (notify) output.0 -> function will return true       <----- Notify QEMU
> [   19.237780] output.0 -> id: 0, idx: 8                            <----- Mark ID 0 in idx 8 as used
> [   19.243676] output.0 -> old free_head: 2, new free_head: 0       <----- ID 0 can now be reused
> [   19.245214] (after get_buf processed) output.0 -> id: 0, idx: 9  <----- Next slot is idx 9
> [   19.247097] output.0 -> id: 1, idx: 9                            <----- Mark ID 1 in idx 9 as used
> [   19.249612] output.0 -> old free_head: 0, new free_head: 1       <----- ID 1 can now be reused
> [   19.252266] (after get_buf processed) output.0 -> id: 1, idx: 10 <----- Next slot is idx 10
>
> ID 0 and ID 1 in idx 8 and idx 9 respectively are pushed to QEMU
> before either of them are marked as used.
>
> But in QEMU, the order is slightly different.
>
> num: 1, init_flags: 128                                                        <----- vhost_svq_add_packed()
> idx: 8, id: 0, len: 0, flags: 0, vq idx: 1                                     <----- Before adding descriptor
> idx: 8, id: 0, len: 102, flags: 128, vq idx: 1                                 <----- After adding descriptor
> Finally: new_idx: 9, head_idx: 8, id: 0, len: 102, flags: 128, vq idx: 1
> svq->vring.num: 256                                                            <----- Begin vhost_svq_get_buf_packed()
> descriptor_len: 0
> descriptor_id: 0                                                               <----- Mark ID = 0 as used
> last_used: 8                                                                   <----- Processing idx 8
> used_wrap_counter: 1
> svq->desc_state[id].ndescs: 1
> free_head: 0                                                                   <----- Update free_head to 0.
> last_used: 9                                                                   <----- Update last_used to 9.
> vq idx: 1                                                                      <----- End vhost_svq_get_buf_packed()
> i: 0                                                                           <----- vhost_svq_flush()
> descriptor_len: 0
> elem->len: 22086
> i: 1
> elem_is_null: 1
> vq idx: 1                                                                      <----- End vhost_svq_flush()
> num: 1, init_flags: 128                                                        <----- vhost_svq_add_packed()
> idx: 9, id: 0, len: 0, flags: 0, curr: 0, vq idx: 1                            <----- Before adding descriptor
> idx: 9, id: 0, len: 330, flags: 128, curr: 1, vq idx: 1                        <----- After adding descriptor
> Finally: new_idx: 10, head_idx: 9, id: 0, len: 330, flags: 128, vq idx: 1      <----- ID 0 has been reused (versus ID 1 in L2)
> svq->vring.num: 256                                                            <----- Begin vhost_svq_get_buf_packed()
> descriptor_len: 0
> descriptor_id: 0                                                               <----- Mark ID = 0 as used
> last_used: 9                                                                   <----- Processing idx 9
> used_wrap_counter: 1
> svq->desc_state[id].ndescs: 1
> free_head: 0                                                                   <----- Update free_head to 0.
> last_used: 10                                                                  <----- Update last_used to 10.
> vq idx: 1                                                                      <----- End vhost_svq_get_buf_packed()
> i: 0                                                                           <----- vhost_svq_flush()
> descriptor_len: 0
> elem->len: 22086
> i: 1
> elem_is_null: 1
> vq idx: 1                                                                      <----- End vhost_svq_flush()
>
> In QEMU, id 0 is added in idx 8. But it's marked as used before a
> descriptor can be added in idx 9. Because of this there's a discrepancy
> in the value of free_head and in svq->desc_next.
>
> In the current implementation, the values of ID are generated, maintained
> and processed by QEMU instead of reading from the guest's memory. I think
> reading the value of ID from the guest memory (similar to reading the
> descriptor length from guest memory) should resolve this issue.
>

Ok you made a good catch here :).

The 1:1 sync is hard to achieve as a single buffer in the guest may
need to be splitted in many buffers in the host.

> The alternative would be to ensure that "add_packed" and "get_buf_packed"
> are synchronized between the guest and QEMU.
>

Yes, they're synchronized. When the guest makes an available
descriptor, its head is saved in the VirtQueueElement of the SVQ's
head idx on svq->desc_state.

Reviewing patch 3/7 I see you're actually returning the id of the
first descriptor of the chain in *head, while it should be the id of
the *last* descriptor. It should not be the cause of the failure, as I
don't see any descriptor chain in the log. To keep the free linked
list happy we may need to store the head of the descriptor chain in
the vq too.

Now, why is SVQ id 0 being reused? Sounds like free_list is not
initialized to 0, 1, 2... but to something else like 0, 0, 0, etc. Can
you print the whole list in each iteration?

> What are your thoughts on this?
>
> Thanks,
> Sahil
>
>
>



^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC v5 0/7] Add packed format to shadow virtqueue
  2025-07-31 13:52                 ` Eugenio Perez Martin
@ 2025-08-04  6:04                   ` Sahil Siddiq
  2025-08-05  9:07                     ` Eugenio Perez Martin
  0 siblings, 1 reply; 44+ messages in thread
From: Sahil Siddiq @ 2025-08-04  6:04 UTC (permalink / raw)
  To: Eugenio Perez Martin; +Cc: sgarzare, mst, qemu-devel, sahilcdq

Hi,

On 7/31/25 7:22 PM, Eugenio Perez Martin wrote:
> On Wed, Jul 30, 2025 at 4:33 PM Sahil Siddiq <icegambit91@gmail.com> wrote:
>> I think I have finally found the reason behind this issue.
>>
>> The order in which "add_packed" and "get_buf_packed" are performed in the
>> nested guest kernel (L2 kernel) and QEMU are a little different. Due to
>> this, the values in free_head and svq->desc_next[] differ and the guest
>> crashes at some point. More below.
>>
>> On 6/26/25 1:07 PM, Eugenio Perez Martin wrote:
>>> On Thu, Jun 26, 2025 at 7:16 AM Sahil Siddiq <icegambit91@gmail.com> wrote:
>>>> I think there's something off in the way "free_head", "last_used_idx" and
>>>> "desc_next" values are calculated in vhost_svq_get_buf_packed() [1].
>>>>
>>>> In the latest test run, QEMU sent ids 0 through 28 to L2. L2 started receiving
>>>> them in order till id 8. At this point it received id 7 again for some reason
>>>> and then crashed.
>>>>
>>>> L2:
>>>>
>>>> [ 1641.129218] (prepare_packed) output.0 -> needs_kick: 1
>>>> [ 1641.130621] (notify) output.0 -> function will return true
>>>> [ 1641.132022] output.0 -> id: 0
>>>> [ 1739.502358] input.0 -> id: 0
>>>> [ 1739.503003] input.0 -> id: 1
>>>> [ 1739.562024] input.0 -> id: 2
>>>> [ 1739.578682] input.0 -> id: 3
>>>> [ 1739.661913] input.0 -> id: 4
>>>> [ 1739.828796] input.0 -> id: 5
>>>> [ 1739.829789] input.0 -> id: 6
>>>> [ 1740.078757] input.0 -> id: 7
>>>> [ 1740.079749] input.0 -> id: 8
>>>> [ 1740.080382] input.0 -> id: 7    <----Received 7 again
>>>> [ 1740.081614] virtio_net virtio1: input.0:id 7 is not a head!
>>>>
>>>> QEMU logs (vhost_svq_get_buf_packed):
>>>> ------
>>>> size              : svq->vring.num
>>>> len               : svq->vring_packed.vring.desc[last_used].len
>>>> id                : svq->vring_packed.vring.desc[last_used].id
>>>> num               : svq->desc_state[id].ndescs
>>>> last_used_chain   : Result of vhost_svq_last_desc_of_chain(svq, num, id) [2]
>>>> free_head         : svq->free_head
>>>> last_used         : (last_used_idx & ~(1 << VRING_PACKED_EVENT_F_WRAP_CTR)) + num
>>>> used_wrap_counter : !!(last_used_idx & (1 << VRING_PACKED_EVENT_F_WRAP_CTR))
>>>> ------
>>>> size: 256, len: 82, id: 7, vq idx: 0
>>>> id: 7, last_used_chain: 7, free_head: 6, vq idx: 0
>>>> num: 1, free_head: 7, id: 7, last_used: 8, used_wrap_counter: 1, vq idx: 0
>>>> ------
>>>> size: 256, len: 104, id: 8, vq idx: 0
>>>> id: 8, last_used_chain: 8, free_head: 7, vq idx: 0
>>>> num: 1, free_head: 8, id: 8, last_used: 9, used_wrap_counter: 1, vq idx: 0
>>>> ------
>>>> size: 256, len: 98, id: 9, vq idx: 0
>>>> id: 9, last_used_chain: 9, free_head: 8, vq idx: 0
>>>> num: 1, free_head: 9, id: 9, last_used: 10, used_wrap_counter: 1, vq idx: 0
>>>> ------
>>>> size: 256, len: 104, id: 10, vq idx: 0
>>>> id: 10, last_used_chain: 10, free_head: 9, vq idx: 0
>>>> num: 1, free_head: 10, id: 10, last_used: 11, used_wrap_counter: 1, vq idx: 0
>>>>
>>>> I have a few more ideas of what to do. I'll let you know if I find something
>>>> else.
>>>>
>>> I cannot find anything just by inspection. What about printing all the
>>> desc_state and all desc_next to check for incoherencies in each
>>> svq_add and get_buf?
>> In this test, all 256 descriptors were filled in the RX vq.
>>
>> In the TX queue, L2 kernel would add one descriptor at a time and notify
>> QEMU. QEMU would then register it in its SVQ and mark it as "available".
>> After processing the descriptor, QEMU would mark it as "used" and flush it
>> back to L2. L2, in turn, would mark this descriptor as "used". After this
>> process, L2 would add the next descriptor in the TX vq while reusing this
>> ID. This was observed from idx 0 till idx 7.
>>
>> L2's debug logs:
>>
>> [   18.379112] (use_indirect?) output.0 -> verdict: 0                <----- Begin adding descriptor in idx 6
>> [   18.387134] (add_packed) output.0 -> idx: 6
>> [   18.389897] (add_packed) output.0 -> id: 0
>> [   18.392290] (add_packed) output.0 -> len: 74
>> [   18.394606] (add_packed) output.0 -> addr: 5012315726
>> [   18.397043] (add_packed) output.0 -> next id: 1
>> [   18.399861] Entering prepare_packed: output.0
>> [   18.402478] (prepare_packed) output.0 -> needs_kick: 1
>> [   18.404998] (notify) output.0 -> function will return true        <----- Notify QEMU
>> [   18.406349] output.0 -> id: 0, idx: 6                             <----- Mark ID 0 in idx 6 as used
>> [   18.409482] output.0 -> old free_head: 1, new free_head: 0        <----- ID 0 can be reused
>> [   18.410919] (after get_buf processed) output.0 -> id: 0, idx: 7   <----- Next slot is idx 7
>> [   18.921895] (use_indirect?) output.0 -> verdict: 0                <----- Begin adding descriptor with ID = 0 in idx 7
>> [   18.930093] (add_packed) output.0 -> idx: 7
>> [   18.935715] (add_packed) output.0 -> id: 0
>> [   18.937609] (add_packed) output.0 -> len: 122
>> [   18.939614] (add_packed) output.0 -> addr: 4925868038
>> [   18.941710] (add_packed) output.0 -> next id: 1
>> [   18.944032] Entering prepare_packed: output.0
>> [   18.946148] (prepare_packed) output.0 -> needs_kick: 1
>> [   18.948234] (notify) output.0 -> function will return true        <----- Notify QEMU
>> [   18.949606] output.0 -> id: 0, idx: 7                             <----- Mark ID 0 in idx 7 as used
>> [   18.952756] output.0 -> old free_head: 1, new free_head: 0        <----- ID 0 can be reused
>> [   18.955154] (after get_buf processed) output.0 -> id: 0, idx: 8   <----- Next slot is idx 8
>>
>> There was no issue in QEMU till this point.
>>
>> [   19.177536] (use_indirect?) output.0 -> verdict: 0                <----- Begin adding descriptor with ID = 0 in idx 8
>> [   19.182415] (add_packed) output.0 -> idx: 8
>> [   19.187257] (add_packed) output.0 -> id: 0
>> [   19.191355] (add_packed) output.0 -> len: 102
>> [   19.195131] (add_packed) output.0 -> addr: 4370702342
>> [   19.199224] (add_packed) output.0 -> next id: 1
>> [   19.204929] Entering prepare_packed: output.0
>> [   19.209505] (prepare_packed) output.0 -> needs_kick: 1
>> [   19.213820] (notify) output.0 -> function will return true       <----- Notify QEMU
>> [   19.218792] (use_indirect?) output.0 -> verdict: 0               <----- Next slot is idx 9
>> [   19.224730] (add_packed) output.0 -> idx: 9
>> [   19.227067] (add_packed) output.0 -> id: 1                       <----- ID 0 can't be reused yet, so use ID = 1
>> [   19.229090] (add_packed) output.0 -> len: 330
>> [   19.231182] (add_packed) output.0 -> addr: 4311020614
>> [   19.233302] (add_packed) output.0 -> next id: 2
>> [   19.235620] Entering prepare_packed: output.0
>> [   19.237781] (prepare_packed) output.0 -> needs_kick: 1
>> [   19.239958] (notify) output.0 -> function will return true       <----- Notify QEMU
>> [   19.237780] output.0 -> id: 0, idx: 8                            <----- Mark ID 0 in idx 8 as used
>> [   19.243676] output.0 -> old free_head: 2, new free_head: 0       <----- ID 0 can now be reused
>> [   19.245214] (after get_buf processed) output.0 -> id: 0, idx: 9  <----- Next slot is idx 9
>> [   19.247097] output.0 -> id: 1, idx: 9                            <----- Mark ID 1 in idx 9 as used
>> [   19.249612] output.0 -> old free_head: 0, new free_head: 1       <----- ID 1 can now be reused
>> [   19.252266] (after get_buf processed) output.0 -> id: 1, idx: 10 <----- Next slot is idx 10
>>
>> ID 0 and ID 1 in idx 8 and idx 9 respectively are pushed to QEMU
>> before either of them are marked as used.
>>
>> But in QEMU, the order is slightly different.
>>
>> num: 1, init_flags: 128                                                        <----- vhost_svq_add_packed()
>> idx: 8, id: 0, len: 0, flags: 0, vq idx: 1                                     <----- Before adding descriptor
>> idx: 8, id: 0, len: 102, flags: 128, vq idx: 1                                 <----- After adding descriptor
>> Finally: new_idx: 9, head_idx: 8, id: 0, len: 102, flags: 128, vq idx: 1
>> svq->vring.num: 256                                                            <----- Begin vhost_svq_get_buf_packed()
>> descriptor_len: 0
>> descriptor_id: 0                                                               <----- Mark ID = 0 as used
>> last_used: 8                                                                   <----- Processing idx 8
>> used_wrap_counter: 1
>> svq->desc_state[id].ndescs: 1
>> free_head: 0                                                                   <----- Update free_head to 0.
>> last_used: 9                                                                   <----- Update last_used to 9.
>> vq idx: 1                                                                      <----- End vhost_svq_get_buf_packed()
>> i: 0                                                                           <----- vhost_svq_flush()
>> descriptor_len: 0
>> elem->len: 22086
>> i: 1
>> elem_is_null: 1
>> vq idx: 1                                                                      <----- End vhost_svq_flush()
>> num: 1, init_flags: 128                                                        <----- vhost_svq_add_packed()
>> idx: 9, id: 0, len: 0, flags: 0, curr: 0, vq idx: 1                            <----- Before adding descriptor
>> idx: 9, id: 0, len: 330, flags: 128, curr: 1, vq idx: 1                        <----- After adding descriptor
>> Finally: new_idx: 10, head_idx: 9, id: 0, len: 330, flags: 128, vq idx: 1      <----- ID 0 has been reused (versus ID 1 in L2)
>> svq->vring.num: 256                                                            <----- Begin vhost_svq_get_buf_packed()
>> descriptor_len: 0
>> descriptor_id: 0                                                               <----- Mark ID = 0 as used
>> last_used: 9                                                                   <----- Processing idx 9
>> used_wrap_counter: 1
>> svq->desc_state[id].ndescs: 1
>> free_head: 0                                                                   <----- Update free_head to 0.
>> last_used: 10                                                                  <----- Update last_used to 10.
>> vq idx: 1                                                                      <----- End vhost_svq_get_buf_packed()
>> i: 0                                                                           <----- vhost_svq_flush()
>> descriptor_len: 0
>> elem->len: 22086
>> i: 1
>> elem_is_null: 1
>> vq idx: 1                                                                      <----- End vhost_svq_flush()
>>
>> In QEMU, id 0 is added in idx 8. But it's marked as used before a
>> descriptor can be added in idx 9. Because of this there's a discrepancy
>> in the value of free_head and in svq->desc_next.
>>
>> In the current implementation, the values of ID are generated, maintained
>> and processed by QEMU instead of reading from the guest's memory. I think
>> reading the value of ID from the guest memory (similar to reading the
>> descriptor length from guest memory) should resolve this issue.
>>
> 
> Ok you made a good catch here :).
> 
> The 1:1 sync is hard to achieve as a single buffer in the guest may
> need to be splitted in many buffers in the host.
> 
>> The alternative would be to ensure that "add_packed" and "get_buf_packed"
>> are synchronized between the guest and QEMU.
> 
> Yes, they're synchronized. When the guest makes an available
> descriptor, its head is saved in the VirtQueueElement of the SVQ's
> head idx on svq->desc_state.
> 
> Reviewing patch 3/7 I see you're actually returning the id of the
> first descriptor of the chain in *head, while it should be the id of
> the *last* descriptor. It should not be the cause of the failure, as I
> don't see any descriptor chain in the log.

Does this mean if the current free_head is 3 and the length of the chain
is 4, then ID 7 should be saved in the descriptor ring? In the current
implementation, *all* descriptors in the chain are being assigned the same
ID (= free_head instead of free_head + length of chain).

> To keep the free linked list happy we may need to store the head of the
> descriptor chain in the vq too.
> 
> Now, why is SVQ id 0 being reused? Sounds like free_list is not
> initialized to 0, 1, 2... but to something else like 0, 0, 0, etc. Can
> you print the whole list in each iteration?
> 

The free_list initially has the following values:
index 0 -> 1
index 1 -> 2
...
index 254 -> 255
index 255 -> 0

free_head is set to 0 at the beginning. When add_packed() executes for
the first time ID 0 is used and free_head is set to 1. If get_buf_packed()
is run immediately after the "add" operation, free_list[ID] is set to
the current free_head (ID 1 in this case). After this free_head is set to
the used ID (0 in this case).

So, free_list still looks like this:
index 0 -> 1
index 1 -> 2
...
index 254 -> 255
index 255 -> 0

But the free_head is 0. So, ID = 0 is reused again.

When 2 IDs (such as with idx 8 and idx 9 in the guest) are added to the
SVQ without either being used, then free_head is updated as shown:

free_head = 0 -> 1 -> 2

And then marking both IDs as "used" results in:

free_list[0] = current free_head (= 2)
free_head = 0
free_list[1] = current free_head (= 0)
free_head = 1

So free_list looks like this:

index 0 -> 2
index 1 -> 0
index 2 -> 3
index 3 -> 4
...
index 254 -> 255
index 255 -> 0

None of the indices in free_list hold the value 1.

Since at this point free_head is 1, ID 1 is now used very frequently
if add_packed() and get_buf_used() run one at a time in an interleaved
fashion.

Thanks,
Sahil


^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC v5 0/7] Add packed format to shadow virtqueue
  2025-08-04  6:04                   ` Sahil Siddiq
@ 2025-08-05  9:07                     ` Eugenio Perez Martin
  0 siblings, 0 replies; 44+ messages in thread
From: Eugenio Perez Martin @ 2025-08-05  9:07 UTC (permalink / raw)
  To: Sahil Siddiq; +Cc: sgarzare, mst, qemu-devel, sahilcdq

On Mon, Aug 4, 2025 at 8:04 AM Sahil Siddiq <icegambit91@gmail.com> wrote:
>
> Hi,
>
> On 7/31/25 7:22 PM, Eugenio Perez Martin wrote:
> > On Wed, Jul 30, 2025 at 4:33 PM Sahil Siddiq <icegambit91@gmail.com> wrote:
> >> I think I have finally found the reason behind this issue.
> >>
> >> The order in which "add_packed" and "get_buf_packed" are performed in the
> >> nested guest kernel (L2 kernel) and QEMU are a little different. Due to
> >> this, the values in free_head and svq->desc_next[] differ and the guest
> >> crashes at some point. More below.
> >>
> >> On 6/26/25 1:07 PM, Eugenio Perez Martin wrote:
> >>> On Thu, Jun 26, 2025 at 7:16 AM Sahil Siddiq <icegambit91@gmail.com> wrote:
> >>>> I think there's something off in the way "free_head", "last_used_idx" and
> >>>> "desc_next" values are calculated in vhost_svq_get_buf_packed() [1].
> >>>>
> >>>> In the latest test run, QEMU sent ids 0 through 28 to L2. L2 started receiving
> >>>> them in order till id 8. At this point it received id 7 again for some reason
> >>>> and then crashed.
> >>>>
> >>>> L2:
> >>>>
> >>>> [ 1641.129218] (prepare_packed) output.0 -> needs_kick: 1
> >>>> [ 1641.130621] (notify) output.0 -> function will return true
> >>>> [ 1641.132022] output.0 -> id: 0
> >>>> [ 1739.502358] input.0 -> id: 0
> >>>> [ 1739.503003] input.0 -> id: 1
> >>>> [ 1739.562024] input.0 -> id: 2
> >>>> [ 1739.578682] input.0 -> id: 3
> >>>> [ 1739.661913] input.0 -> id: 4
> >>>> [ 1739.828796] input.0 -> id: 5
> >>>> [ 1739.829789] input.0 -> id: 6
> >>>> [ 1740.078757] input.0 -> id: 7
> >>>> [ 1740.079749] input.0 -> id: 8
> >>>> [ 1740.080382] input.0 -> id: 7    <----Received 7 again
> >>>> [ 1740.081614] virtio_net virtio1: input.0:id 7 is not a head!
> >>>>
> >>>> QEMU logs (vhost_svq_get_buf_packed):
> >>>> ------
> >>>> size              : svq->vring.num
> >>>> len               : svq->vring_packed.vring.desc[last_used].len
> >>>> id                : svq->vring_packed.vring.desc[last_used].id
> >>>> num               : svq->desc_state[id].ndescs
> >>>> last_used_chain   : Result of vhost_svq_last_desc_of_chain(svq, num, id) [2]
> >>>> free_head         : svq->free_head
> >>>> last_used         : (last_used_idx & ~(1 << VRING_PACKED_EVENT_F_WRAP_CTR)) + num
> >>>> used_wrap_counter : !!(last_used_idx & (1 << VRING_PACKED_EVENT_F_WRAP_CTR))
> >>>> ------
> >>>> size: 256, len: 82, id: 7, vq idx: 0
> >>>> id: 7, last_used_chain: 7, free_head: 6, vq idx: 0
> >>>> num: 1, free_head: 7, id: 7, last_used: 8, used_wrap_counter: 1, vq idx: 0
> >>>> ------
> >>>> size: 256, len: 104, id: 8, vq idx: 0
> >>>> id: 8, last_used_chain: 8, free_head: 7, vq idx: 0
> >>>> num: 1, free_head: 8, id: 8, last_used: 9, used_wrap_counter: 1, vq idx: 0
> >>>> ------
> >>>> size: 256, len: 98, id: 9, vq idx: 0
> >>>> id: 9, last_used_chain: 9, free_head: 8, vq idx: 0
> >>>> num: 1, free_head: 9, id: 9, last_used: 10, used_wrap_counter: 1, vq idx: 0
> >>>> ------
> >>>> size: 256, len: 104, id: 10, vq idx: 0
> >>>> id: 10, last_used_chain: 10, free_head: 9, vq idx: 0
> >>>> num: 1, free_head: 10, id: 10, last_used: 11, used_wrap_counter: 1, vq idx: 0
> >>>>
> >>>> I have a few more ideas of what to do. I'll let you know if I find something
> >>>> else.
> >>>>
> >>> I cannot find anything just by inspection. What about printing all the
> >>> desc_state and all desc_next to check for incoherencies in each
> >>> svq_add and get_buf?
> >> In this test, all 256 descriptors were filled in the RX vq.
> >>
> >> In the TX queue, L2 kernel would add one descriptor at a time and notify
> >> QEMU. QEMU would then register it in its SVQ and mark it as "available".
> >> After processing the descriptor, QEMU would mark it as "used" and flush it
> >> back to L2. L2, in turn, would mark this descriptor as "used". After this
> >> process, L2 would add the next descriptor in the TX vq while reusing this
> >> ID. This was observed from idx 0 till idx 7.
> >>
> >> L2's debug logs:
> >>
> >> [   18.379112] (use_indirect?) output.0 -> verdict: 0                <----- Begin adding descriptor in idx 6
> >> [   18.387134] (add_packed) output.0 -> idx: 6
> >> [   18.389897] (add_packed) output.0 -> id: 0
> >> [   18.392290] (add_packed) output.0 -> len: 74
> >> [   18.394606] (add_packed) output.0 -> addr: 5012315726
> >> [   18.397043] (add_packed) output.0 -> next id: 1
> >> [   18.399861] Entering prepare_packed: output.0
> >> [   18.402478] (prepare_packed) output.0 -> needs_kick: 1
> >> [   18.404998] (notify) output.0 -> function will return true        <----- Notify QEMU
> >> [   18.406349] output.0 -> id: 0, idx: 6                             <----- Mark ID 0 in idx 6 as used
> >> [   18.409482] output.0 -> old free_head: 1, new free_head: 0        <----- ID 0 can be reused
> >> [   18.410919] (after get_buf processed) output.0 -> id: 0, idx: 7   <----- Next slot is idx 7
> >> [   18.921895] (use_indirect?) output.0 -> verdict: 0                <----- Begin adding descriptor with ID = 0 in idx 7
> >> [   18.930093] (add_packed) output.0 -> idx: 7
> >> [   18.935715] (add_packed) output.0 -> id: 0
> >> [   18.937609] (add_packed) output.0 -> len: 122
> >> [   18.939614] (add_packed) output.0 -> addr: 4925868038
> >> [   18.941710] (add_packed) output.0 -> next id: 1
> >> [   18.944032] Entering prepare_packed: output.0
> >> [   18.946148] (prepare_packed) output.0 -> needs_kick: 1
> >> [   18.948234] (notify) output.0 -> function will return true        <----- Notify QEMU
> >> [   18.949606] output.0 -> id: 0, idx: 7                             <----- Mark ID 0 in idx 7 as used
> >> [   18.952756] output.0 -> old free_head: 1, new free_head: 0        <----- ID 0 can be reused
> >> [   18.955154] (after get_buf processed) output.0 -> id: 0, idx: 8   <----- Next slot is idx 8
> >>
> >> There was no issue in QEMU till this point.
> >>
> >> [   19.177536] (use_indirect?) output.0 -> verdict: 0                <----- Begin adding descriptor with ID = 0 in idx 8
> >> [   19.182415] (add_packed) output.0 -> idx: 8
> >> [   19.187257] (add_packed) output.0 -> id: 0
> >> [   19.191355] (add_packed) output.0 -> len: 102
> >> [   19.195131] (add_packed) output.0 -> addr: 4370702342
> >> [   19.199224] (add_packed) output.0 -> next id: 1
> >> [   19.204929] Entering prepare_packed: output.0
> >> [   19.209505] (prepare_packed) output.0 -> needs_kick: 1
> >> [   19.213820] (notify) output.0 -> function will return true       <----- Notify QEMU
> >> [   19.218792] (use_indirect?) output.0 -> verdict: 0               <----- Next slot is idx 9
> >> [   19.224730] (add_packed) output.0 -> idx: 9
> >> [   19.227067] (add_packed) output.0 -> id: 1                       <----- ID 0 can't be reused yet, so use ID = 1
> >> [   19.229090] (add_packed) output.0 -> len: 330
> >> [   19.231182] (add_packed) output.0 -> addr: 4311020614
> >> [   19.233302] (add_packed) output.0 -> next id: 2
> >> [   19.235620] Entering prepare_packed: output.0
> >> [   19.237781] (prepare_packed) output.0 -> needs_kick: 1
> >> [   19.239958] (notify) output.0 -> function will return true       <----- Notify QEMU
> >> [   19.237780] output.0 -> id: 0, idx: 8                            <----- Mark ID 0 in idx 8 as used
> >> [   19.243676] output.0 -> old free_head: 2, new free_head: 0       <----- ID 0 can now be reused
> >> [   19.245214] (after get_buf processed) output.0 -> id: 0, idx: 9  <----- Next slot is idx 9
> >> [   19.247097] output.0 -> id: 1, idx: 9                            <----- Mark ID 1 in idx 9 as used
> >> [   19.249612] output.0 -> old free_head: 0, new free_head: 1       <----- ID 1 can now be reused
> >> [   19.252266] (after get_buf processed) output.0 -> id: 1, idx: 10 <----- Next slot is idx 10
> >>
> >> ID 0 and ID 1 in idx 8 and idx 9 respectively are pushed to QEMU
> >> before either of them are marked as used.
> >>
> >> But in QEMU, the order is slightly different.
> >>
> >> num: 1, init_flags: 128                                                        <----- vhost_svq_add_packed()
> >> idx: 8, id: 0, len: 0, flags: 0, vq idx: 1                                     <----- Before adding descriptor
> >> idx: 8, id: 0, len: 102, flags: 128, vq idx: 1                                 <----- After adding descriptor
> >> Finally: new_idx: 9, head_idx: 8, id: 0, len: 102, flags: 128, vq idx: 1
> >> svq->vring.num: 256                                                            <----- Begin vhost_svq_get_buf_packed()
> >> descriptor_len: 0
> >> descriptor_id: 0                                                               <----- Mark ID = 0 as used
> >> last_used: 8                                                                   <----- Processing idx 8
> >> used_wrap_counter: 1
> >> svq->desc_state[id].ndescs: 1
> >> free_head: 0                                                                   <----- Update free_head to 0.
> >> last_used: 9                                                                   <----- Update last_used to 9.
> >> vq idx: 1                                                                      <----- End vhost_svq_get_buf_packed()
> >> i: 0                                                                           <----- vhost_svq_flush()
> >> descriptor_len: 0
> >> elem->len: 22086
> >> i: 1
> >> elem_is_null: 1
> >> vq idx: 1                                                                      <----- End vhost_svq_flush()
> >> num: 1, init_flags: 128                                                        <----- vhost_svq_add_packed()
> >> idx: 9, id: 0, len: 0, flags: 0, curr: 0, vq idx: 1                            <----- Before adding descriptor
> >> idx: 9, id: 0, len: 330, flags: 128, curr: 1, vq idx: 1                        <----- After adding descriptor
> >> Finally: new_idx: 10, head_idx: 9, id: 0, len: 330, flags: 128, vq idx: 1      <----- ID 0 has been reused (versus ID 1 in L2)
> >> svq->vring.num: 256                                                            <----- Begin vhost_svq_get_buf_packed()
> >> descriptor_len: 0
> >> descriptor_id: 0                                                               <----- Mark ID = 0 as used
> >> last_used: 9                                                                   <----- Processing idx 9
> >> used_wrap_counter: 1
> >> svq->desc_state[id].ndescs: 1
> >> free_head: 0                                                                   <----- Update free_head to 0.
> >> last_used: 10                                                                  <----- Update last_used to 10.
> >> vq idx: 1                                                                      <----- End vhost_svq_get_buf_packed()
> >> i: 0                                                                           <----- vhost_svq_flush()
> >> descriptor_len: 0
> >> elem->len: 22086
> >> i: 1
> >> elem_is_null: 1
> >> vq idx: 1                                                                      <----- End vhost_svq_flush()
> >>
> >> In QEMU, id 0 is added in idx 8. But it's marked as used before a
> >> descriptor can be added in idx 9. Because of this there's a discrepancy
> >> in the value of free_head and in svq->desc_next.
> >>
> >> In the current implementation, the values of ID are generated, maintained
> >> and processed by QEMU instead of reading from the guest's memory. I think
> >> reading the value of ID from the guest memory (similar to reading the
> >> descriptor length from guest memory) should resolve this issue.
> >>
> >
> > Ok you made a good catch here :).
> >
> > The 1:1 sync is hard to achieve as a single buffer in the guest may
> > need to be splitted in many buffers in the host.
> >
> >> The alternative would be to ensure that "add_packed" and "get_buf_packed"
> >> are synchronized between the guest and QEMU.
> >
> > Yes, they're synchronized. When the guest makes an available
> > descriptor, its head is saved in the VirtQueueElement of the SVQ's
> > head idx on svq->desc_state.
> >
> > Reviewing patch 3/7 I see you're actually returning the id of the
> > first descriptor of the chain in *head, while it should be the id of
> > the *last* descriptor. It should not be the cause of the failure, as I
> > don't see any descriptor chain in the log.
>
> Does this mean if the current free_head is 3 and the length of the chain
> is 4, then ID 7 should be saved in the descriptor ring? In the current
> implementation, *all* descriptors in the chain are being assigned the same
> ID (= free_head instead of free_head + length of chain).
>

Ouch, you're right! I recall I had the same comment while reading the
kernel's version and I forgot it.

> > To keep the free linked list happy we may need to store the head of the
> > descriptor chain in the vq too.
> >
> > Now, why is SVQ id 0 being reused? Sounds like free_list is not
> > initialized to 0, 1, 2... but to something else like 0, 0, 0, etc. Can
> > you print the whole list in each iteration?
> >
>
> The free_list initially has the following values:
> index 0 -> 1
> index 1 -> 2
> ...
> index 254 -> 255
> index 255 -> 0
>
> free_head is set to 0 at the beginning. When add_packed() executes for
> the first time ID 0 is used and free_head is set to 1. If get_buf_packed()
> is run immediately after the "add" operation, free_list[ID] is set to
> the current free_head (ID 1 in this case). After this free_head is set to
> the used ID (0 in this case).
>
> So, free_list still looks like this:
> index 0 -> 1
> index 1 -> 2
> ...
> index 254 -> 255
> index 255 -> 0
>
> But the free_head is 0. So, ID = 0 is reused again.
>
> When 2 IDs (such as with idx 8 and idx 9 in the guest) are added to the
> SVQ without either being used, then free_head is updated as shown:
>
> free_head = 0 -> 1 -> 2
>
> And then marking both IDs as "used" results in:
>
> free_list[0] = current free_head (= 2)
> free_head = 0
> free_list[1] = current free_head (= 0)
> free_head = 1
>
> So free_list looks like this:
>
> index 0 -> 2
> index 1 -> 0
> index 2 -> 3
> index 3 -> 4
> ...
> index 254 -> 255
> index 255 -> 0
>
> None of the indices in free_list hold the value 1.
>

So the free_list should always transverse all the indices. If SVQ
makes available 0, 1 and then the device uses 0, 1 in that order, the
free list should be 1 -> 0 -> 2 -> 3... Otherwise the loop will make
available the same descriptor twice. You should always have this case
in the rx queue in the Linux kernel virtio driver once it uses two
descriptors, maybe you can trace it and compare it with your SVQ code?

I just spotted that you're using le32_to_cpu(used descriptor id) in
patch 5/7, but it should be le16_to_cpu(). It shouldn't matter in x86
but it would cause problems in BE archs.

> Since at this point free_head is 1, ID 1 is now used very frequently
> if add_packed() and get_buf_used() run one at a time in an interleaved
> fashion.
>

This part is expected.



^ permalink raw reply	[flat|nested] 44+ messages in thread

end of thread, other threads:[~2025-08-05  9:09 UTC | newest]

Thread overview: 44+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-03-24 13:59 [RFC v5 0/7] Add packed format to shadow virtqueue Sahil Siddiq
2025-03-24 13:59 ` [RFC v5 1/7] vhost: Refactor vhost_svq_add_split Sahil Siddiq
2025-03-26 11:25   ` Eugenio Perez Martin
2025-03-28  5:18     ` Sahil Siddiq
2025-03-24 13:59 ` [RFC v5 2/7] vhost: Data structure changes to support packed vqs Sahil Siddiq
2025-03-26 11:26   ` Eugenio Perez Martin
2025-03-28  5:17     ` Sahil Siddiq
2025-03-24 13:59 ` [RFC v5 3/7] vhost: Forward descriptors to device via packed SVQ Sahil Siddiq
2025-03-24 14:14   ` Sahil Siddiq
2025-03-26  8:03     ` Eugenio Perez Martin
2025-03-27 18:42       ` Sahil Siddiq
2025-03-28  7:51         ` Eugenio Perez Martin
2025-04-14  9:37           ` Sahil Siddiq
2025-04-14 15:07             ` Eugenio Perez Martin
2025-04-15 19:10               ` Sahil Siddiq
2025-03-26 12:02   ` Eugenio Perez Martin
2025-03-28  5:09     ` Sahil Siddiq
2025-03-28  6:42       ` Eugenio Perez Martin
2025-03-24 13:59 ` [RFC v5 4/7] vdpa: Allocate memory for SVQ and map them to vdpa Sahil Siddiq
2025-03-26 12:05   ` Eugenio Perez Martin
2025-03-24 13:59 ` [RFC v5 5/7] vhost: Forward descriptors to guest via packed vqs Sahil Siddiq
2025-03-24 14:34   ` Sahil Siddiq
2025-03-26  8:34     ` Eugenio Perez Martin
2025-03-28  5:22       ` Sahil Siddiq
2025-03-28  7:53         ` Eugenio Perez Martin
2025-03-24 13:59 ` [RFC v5 6/7] vhost: Validate transport device features for " Sahil Siddiq
2025-03-26 12:06   ` Eugenio Perez Martin
2025-03-28  5:33     ` Sahil Siddiq
2025-03-28  8:02       ` Eugenio Perez Martin
2025-03-24 13:59 ` [RFC v5 7/7] vdpa: Support setting vring_base for packed SVQ Sahil Siddiq
2025-03-26 12:08   ` Eugenio Perez Martin
2025-03-27 18:44     ` Sahil Siddiq
2025-03-26  7:35 ` [RFC v5 0/7] Add packed format to shadow virtqueue Eugenio Perez Martin
2025-04-14  9:20   ` Sahil Siddiq
2025-04-15 19:20     ` Sahil Siddiq
2025-04-16  7:20     ` Eugenio Perez Martin
2025-05-14  6:21       ` Sahil Siddiq
2025-05-15  6:19         ` Eugenio Perez Martin
2025-06-26  5:16           ` Sahil Siddiq
2025-06-26  7:37             ` Eugenio Perez Martin
2025-07-30 14:32               ` Sahil Siddiq
2025-07-31 13:52                 ` Eugenio Perez Martin
2025-08-04  6:04                   ` Sahil Siddiq
2025-08-05  9:07                     ` Eugenio Perez Martin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).