* [RFC PATCH v4 00/20] vDPA shadow virtqueue
@ 2021-10-01  7:05 Eugenio Pérez
  2021-10-01  7:05 ` [RFC PATCH v4 01/20] virtio: Add VIRTIO_F_QUEUE_STATE Eugenio Pérez
                   ` (20 more replies)
  0 siblings, 21 replies; 90+ messages in thread
From: Eugenio Pérez @ 2021-10-01  7:05 UTC (permalink / raw)
  To: qemu-devel
  Cc: Parav Pandit, Juan Quintela, Jason Wang, Michael S. Tsirkin,
	Markus Armbruster, virtualization, Harpreet Singh Anand,
	Xiao W Wang, Stefan Hajnoczi, Eli Cohen, Eric Blake,
	Michael Lilja, Stefano Garzarella
This series enable shadow virtqueue (SVQ) for vhost-vdpa devices. This
is intended as a new method of tracking the memory the devices touch
during a migration process: Instead of relay on vhost device's dirty
logging capability, SVQ intercepts the VQ dataplane forwarding the
descriptors between VM and device. This way qemu is the effective
writer of guests memory, like in qemu's virtio device operation.
When SVQ is enabled qemu offers a new vring to the device to read
and write into, and also intercepts kicks and calls between the device
and the guest. Used buffers relay would cause dirty memory being
tracked, but at this RFC SVQ is not enabled on migration automatically.
It is based on the ideas of DPDK SW assisted LM, in the series of
DPDK's https://patchwork.dpdk.org/cover/48370/ . However, these does
not map the shadow vq in guest's VA, but in qemu's.
For qemu to use shadow virtqueues the guest virtio driver must not use
features like event_idx or indirect descriptors. These limitations will
be addressed in later series, but they are left out for simplicity at
the moment.
SVQ needs to be enabled with QMP command:
{ "execute": "x-vhost-enable-shadow-vq",
      "arguments": { "name": "dev0", "enable": true } }
This series includes some patches to delete in the final version that
helps with its testing. The first two of the series freely implements
the feature to stop the device and be able to retrieve its status. It's
intended to be used with vp_vpda driver in a nested environment. This
driver also need modifications to forward the new status bit.
Patches 2-8 prepares the SVQ and QMP command to support guest to host
notifications forwarding. If the SVQ is enabled with these ones
applied and the device supports it, that part can be tested in
isolation (for example, with networking), hopping through SVQ.
Same thing is true with patches 9-13, but with device to guest
notifications.
The rest of the patches implements the actual buffer forwarding.
Comments are welcome.
TODO:
* Event, indirect, packed, and others features of virtio - Waiting for
  confirmation of the big picture.
* Use already available iova tree to track mappings.
* To sepparate buffers forwarding in its own AIO context, so we can
  throw more threads to that task and we don't need to stop the main
  event loop.
* unmap iommu memory. Now the tree can only grow from SVQ enable, but
  it should be fine as long as not a lot of memory is added to the
  guest.
* Rebase on top of latest qemu (and, hopefully, on top of multiqueue
  vdpa).
* Some assertions need to be appropiate error handling paths.
* Proper documentation.
Changes from v3 RFC:
  * Move everything to vhost-vdpa backend. A big change, this allowed
    some cleanup but more code has been added in other places.
  * More use of glib utilities, especially to manage memory.
v3 link:
https://lists.nongnu.org/archive/html/qemu-devel/2021-05/msg06032.html
Changes from v2 RFC:
  * Adding vhost-vdpa devices support
  * Fixed some memory leaks pointed by different comments
v2 link:
https://lists.nongnu.org/archive/html/qemu-devel/2021-03/msg05600.html
Changes from v1 RFC:
  * Use QMP instead of migration to start SVQ mode.
  * Only accepting IOMMU devices, closer behavior with target devices
    (vDPA)
  * Fix invalid masking/unmasking of vhost call fd.
  * Use of proper methods for synchronization.
  * No need to modify VirtIO device code, all of the changes are
    contained in vhost code.
  * Delete superfluous code.
  * An intermediate RFC was sent with only the notifications forwarding
    changes. It can be seen in
    https://patchew.org/QEMU/20210129205415.876290-1-eperezma@redhat.com/
v1 link:
https://lists.gnu.org/archive/html/qemu-devel/2020-11/msg05372.html
Eugenio Pérez (20):
      virtio: Add VIRTIO_F_QUEUE_STATE
      virtio-net: Honor VIRTIO_CONFIG_S_DEVICE_STOPPED
      virtio: Add virtio_queue_is_host_notifier_enabled
      vhost: Make vhost_virtqueue_{start,stop} public
      vhost: Add x-vhost-enable-shadow-vq qmp
      vhost: Add VhostShadowVirtqueue
      vdpa: Register vdpa devices in a list
      vhost: Route guest->host notification through shadow virtqueue
      Add vhost_svq_get_svq_call_notifier
      Add vhost_svq_set_guest_call_notifier
      vdpa: Save call_fd in vhost-vdpa
      vhost-vdpa: Take into account SVQ in vhost_vdpa_set_vring_call
      vhost: Route host->guest notification through shadow virtqueue
      virtio: Add vhost_shadow_vq_get_vring_addr
      vdpa: Save host and guest features
      vhost: Add vhost_svq_valid_device_features to shadow vq
      vhost: Shadow virtqueue buffers forwarding
      vhost: Add VhostIOVATree
      vhost: Use a tree to store memory mappings
      vdpa: Add custom IOTLB translations to SVQ
Eugenio Pérez (20):
  virtio: Add VIRTIO_F_QUEUE_STATE
  virtio-net: Honor VIRTIO_CONFIG_S_DEVICE_STOPPED
  virtio: Add virtio_queue_is_host_notifier_enabled
  vhost: Make vhost_virtqueue_{start,stop} public
  vhost: Add x-vhost-enable-shadow-vq qmp
  vhost: Add VhostShadowVirtqueue
  vdpa: Register vdpa devices in a list
  vhost: Route guest->host notification through shadow virtqueue
  vdpa: Save call_fd in vhost-vdpa
  vhost-vdpa: Take into account SVQ in vhost_vdpa_set_vring_call
  vhost: Route host->guest notification through shadow virtqueue
  virtio: Add vhost_shadow_vq_get_vring_addr
  vdpa: Save host and guest features
  vhost: Add vhost_svq_valid_device_features to shadow vq
  vhost: Shadow virtqueue buffers forwarding
  vhost: Check for device VRING_USED_F_NO_NOTIFY at shadow virtqueue
    kick
  vhost: Use VRING_AVAIL_F_NO_INTERRUPT at device call on shadow
    virtqueue
  vhost: Add VhostIOVATree
  vhost: Use a tree to store memory mappings
  vdpa: Add custom IOTLB translations to SVQ
 qapi/net.json                                 |  23 +
 hw/virtio/vhost-iova-tree.h                   |  40 ++
 hw/virtio/vhost-shadow-virtqueue.h            |  37 ++
 hw/virtio/virtio-pci.h                        |   1 +
 include/hw/virtio/vhost-vdpa.h                |  13 +
 include/hw/virtio/vhost.h                     |   4 +
 include/hw/virtio/virtio.h                    |   5 +-
 .../standard-headers/linux/virtio_config.h    |   5 +
 include/standard-headers/linux/virtio_pci.h   |   2 +
 hw/net/virtio-net.c                           |   6 +-
 hw/virtio/vhost-iova-tree.c                   | 230 +++++++
 hw/virtio/vhost-shadow-virtqueue.c            | 619 ++++++++++++++++++
 hw/virtio/vhost-vdpa.c                        | 412 +++++++++++-
 hw/virtio/vhost.c                             |  12 +-
 hw/virtio/virtio-pci.c                        |  16 +-
 hw/virtio/virtio.c                            |   5 +
 hw/virtio/meson.build                         |   2 +-
 hw/virtio/trace-events                        |   1 +
 18 files changed, 1413 insertions(+), 20 deletions(-)
 create mode 100644 hw/virtio/vhost-iova-tree.h
 create mode 100644 hw/virtio/vhost-shadow-virtqueue.h
 create mode 100644 hw/virtio/vhost-iova-tree.c
 create mode 100644 hw/virtio/vhost-shadow-virtqueue.c
-- 
2.27.0
^ permalink raw reply	[flat|nested] 90+ messages in thread
* [RFC PATCH v4 01/20] virtio: Add VIRTIO_F_QUEUE_STATE
  2021-10-01  7:05 [RFC PATCH v4 00/20] vDPA shadow virtqueue Eugenio Pérez
@ 2021-10-01  7:05 ` Eugenio Pérez
  2021-10-01  7:05 ` [RFC PATCH v4 02/20] virtio-net: Honor VIRTIO_CONFIG_S_DEVICE_STOPPED Eugenio Pérez
                   ` (19 subsequent siblings)
  20 siblings, 0 replies; 90+ messages in thread
From: Eugenio Pérez @ 2021-10-01  7:05 UTC (permalink / raw)
  To: qemu-devel
  Cc: Parav Pandit, Juan Quintela, Jason Wang, Michael S. Tsirkin,
	Markus Armbruster, virtualization, Harpreet Singh Anand,
	Xiao W Wang, Stefan Hajnoczi, Eli Cohen, Eric Blake,
	Michael Lilja, Stefano Garzarella
virtio: Add VIRTIO_F_QUEUE_STATE
Implementation of RFC of device state capability:
https://lists.oasis-open.org/archives/virtio-comment/202012/msg00005.html
With this capability, vdpa device can reset it's index so it can start
consuming from guest after disabling shadow virtqueue (SVQ), with state
not 0.
The use case is to test SVQ with virtio-pci vdpa (vp_vdpa) with nested
virtualization. Spawning a L0 qemu with a virtio-net device, use
vp_vdpa driver to handle it in the guest, and then spawn a L1 qemu using
that vdpa device. When L1 qemu calls device to set a new state though
vdpa ioctl, vp_vdpa should set each queue state though virtio
VIRTIO_PCI_COMMON_Q_AVAIL_STATE.
Since this is only for testing vhost-vdpa, it's added here before of
proposing to kernel code. No effort is done for checking that device
can actually change its state, its layout, or if the device even
supports to change state at all. These will be added in the future.
Also, a modified version of vp_vdpa that allows to set these in PCI
config is needed.
TODO: Check for feature enabled and split in virtio pci config
Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 hw/virtio/virtio-pci.h                         | 1 +
 include/hw/virtio/virtio.h                     | 4 +++-
 include/standard-headers/linux/virtio_config.h | 3 +++
 include/standard-headers/linux/virtio_pci.h    | 2 ++
 hw/virtio/virtio-pci.c                         | 9 +++++++++
 5 files changed, 18 insertions(+), 1 deletion(-)
diff --git a/hw/virtio/virtio-pci.h b/hw/virtio/virtio-pci.h
index 2446dcd9ae..019badbd7c 100644
--- a/hw/virtio/virtio-pci.h
+++ b/hw/virtio/virtio-pci.h
@@ -120,6 +120,7 @@ typedef struct VirtIOPCIQueue {
   uint32_t desc[2];
   uint32_t avail[2];
   uint32_t used[2];
+  uint16_t state;
 } VirtIOPCIQueue;
 
 struct VirtIOPCIProxy {
diff --git a/include/hw/virtio/virtio.h b/include/hw/virtio/virtio.h
index 8bab9cfb75..5fe575b8f0 100644
--- a/include/hw/virtio/virtio.h
+++ b/include/hw/virtio/virtio.h
@@ -289,7 +289,9 @@ typedef struct VirtIORNGConf VirtIORNGConf;
     DEFINE_PROP_BIT64("iommu_platform", _state, _field, \
                       VIRTIO_F_IOMMU_PLATFORM, false), \
     DEFINE_PROP_BIT64("packed", _state, _field, \
-                      VIRTIO_F_RING_PACKED, false)
+                      VIRTIO_F_RING_PACKED, false), \
+    DEFINE_PROP_BIT64("save_restore_q_state", _state, _field, \
+                      VIRTIO_F_QUEUE_STATE, true)
 
 hwaddr virtio_queue_get_desc_addr(VirtIODevice *vdev, int n);
 bool virtio_queue_enabled_legacy(VirtIODevice *vdev, int n);
diff --git a/include/standard-headers/linux/virtio_config.h b/include/standard-headers/linux/virtio_config.h
index 22e3a85f67..59fad3eb45 100644
--- a/include/standard-headers/linux/virtio_config.h
+++ b/include/standard-headers/linux/virtio_config.h
@@ -90,4 +90,7 @@
  * Does the device support Single Root I/O Virtualization?
  */
 #define VIRTIO_F_SR_IOV			37
+
+/* Device support save and restore virtqueue state */
+#define VIRTIO_F_QUEUE_STATE            40
 #endif /* _LINUX_VIRTIO_CONFIG_H */
diff --git a/include/standard-headers/linux/virtio_pci.h b/include/standard-headers/linux/virtio_pci.h
index db7a8e2fcb..c8d9802a87 100644
--- a/include/standard-headers/linux/virtio_pci.h
+++ b/include/standard-headers/linux/virtio_pci.h
@@ -164,6 +164,7 @@ struct virtio_pci_common_cfg {
 	uint32_t queue_avail_hi;		/* read-write */
 	uint32_t queue_used_lo;		/* read-write */
 	uint32_t queue_used_hi;		/* read-write */
+	uint16_t queue_avail_state;     /* read-write */
 };
 
 /* Fields in VIRTIO_PCI_CAP_PCI_CFG: */
@@ -202,6 +203,7 @@ struct virtio_pci_cfg_cap {
 #define VIRTIO_PCI_COMMON_Q_AVAILHI	44
 #define VIRTIO_PCI_COMMON_Q_USEDLO	48
 #define VIRTIO_PCI_COMMON_Q_USEDHI	52
+#define VIRTIO_PCI_COMMON_Q_AVAIL_STATE	56
 
 #endif /* VIRTIO_PCI_NO_MODERN */
 
diff --git a/hw/virtio/virtio-pci.c b/hw/virtio/virtio-pci.c
index b321604d9b..6f30118c4e 100644
--- a/hw/virtio/virtio-pci.c
+++ b/hw/virtio/virtio-pci.c
@@ -1216,6 +1216,9 @@ static uint64_t virtio_pci_common_read(void *opaque, hwaddr addr,
     case VIRTIO_PCI_COMMON_Q_USEDHI:
         val = proxy->vqs[vdev->queue_sel].used[1];
         break;
+    case VIRTIO_PCI_COMMON_Q_AVAIL_STATE:
+        val = virtio_queue_get_last_avail_idx(vdev, vdev->queue_sel);
+        break;
     default:
         val = 0;
     }
@@ -1298,6 +1301,8 @@ static void virtio_pci_common_write(void *opaque, hwaddr addr,
                        proxy->vqs[vdev->queue_sel].avail[0],
                        ((uint64_t)proxy->vqs[vdev->queue_sel].used[1]) << 32 |
                        proxy->vqs[vdev->queue_sel].used[0]);
+            virtio_queue_set_last_avail_idx(vdev, vdev->queue_sel,
+                        proxy->vqs[vdev->queue_sel].state);
             proxy->vqs[vdev->queue_sel].enabled = 1;
         } else {
             virtio_error(vdev, "wrong value for queue_enable %"PRIx64, val);
@@ -1321,6 +1326,9 @@ static void virtio_pci_common_write(void *opaque, hwaddr addr,
     case VIRTIO_PCI_COMMON_Q_USEDHI:
         proxy->vqs[vdev->queue_sel].used[1] = val;
         break;
+    case VIRTIO_PCI_COMMON_Q_AVAIL_STATE:
+        proxy->vqs[vdev->queue_sel].state = val;
+        break;
     default:
         break;
     }
@@ -1909,6 +1917,7 @@ static void virtio_pci_reset(DeviceState *qdev)
         proxy->vqs[i].desc[0] = proxy->vqs[i].desc[1] = 0;
         proxy->vqs[i].avail[0] = proxy->vqs[i].avail[1] = 0;
         proxy->vqs[i].used[0] = proxy->vqs[i].used[1] = 0;
+        proxy->vqs[i].state = 0;
     }
 
     if (pci_is_express(dev)) {
-- 
2.27.0
^ permalink raw reply related	[flat|nested] 90+ messages in thread
* [RFC PATCH v4 02/20] virtio-net: Honor VIRTIO_CONFIG_S_DEVICE_STOPPED
  2021-10-01  7:05 [RFC PATCH v4 00/20] vDPA shadow virtqueue Eugenio Pérez
  2021-10-01  7:05 ` [RFC PATCH v4 01/20] virtio: Add VIRTIO_F_QUEUE_STATE Eugenio Pérez
@ 2021-10-01  7:05 ` Eugenio Pérez
  2021-10-01  7:05 ` [RFC PATCH v4 03/20] virtio: Add virtio_queue_is_host_notifier_enabled Eugenio Pérez
                   ` (18 subsequent siblings)
  20 siblings, 0 replies; 90+ messages in thread
From: Eugenio Pérez @ 2021-10-01  7:05 UTC (permalink / raw)
  To: qemu-devel
  Cc: Parav Pandit, Juan Quintela, Jason Wang, Michael S. Tsirkin,
	Markus Armbruster, virtualization, Harpreet Singh Anand,
	Xiao W Wang, Stefan Hajnoczi, Eli Cohen, Eric Blake,
	Michael Lilja, Stefano Garzarella
So the guest can stop and start net device. It freely implements the RFC
https://lists.oasis-open.org/archives/virtio-comment/202012/msg00027.html
To stop (as "pause") the device is required to migrate status and vring
addresses between device and SVQ. Once the device is stopped, the driver
can request avail_idx, so it can be assigned to SVQ.
This is a WIP commit: as with VIRTIO_F_QUEUE_STATE, is introduced in
virtio_config.h before of even proposing for the kernel, with no feature
flag, and, with no checking in the device. It also needs a modified
vp_vdpa driver that supports to set and retrieve status.
For virtio-net with qemu device there is no need to restore avail
state: Since every tx and rx operation is entirely done in BQL
regarding virtio, it would be enough with restore last_avail_idx with
used_idx. Doing this way test the vq state part of the rest of the
series.
Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 include/standard-headers/linux/virtio_config.h | 2 ++
 hw/net/virtio-net.c                            | 6 ++++--
 hw/virtio/virtio-pci.c                         | 7 +++++--
 3 files changed, 11 insertions(+), 4 deletions(-)
diff --git a/include/standard-headers/linux/virtio_config.h b/include/standard-headers/linux/virtio_config.h
index 59fad3eb45..b3f6b1365d 100644
--- a/include/standard-headers/linux/virtio_config.h
+++ b/include/standard-headers/linux/virtio_config.h
@@ -40,6 +40,8 @@
 #define VIRTIO_CONFIG_S_DRIVER_OK	4
 /* Driver has finished configuring features */
 #define VIRTIO_CONFIG_S_FEATURES_OK	8
+/* Device is stopped */
+#define VIRTIO_CONFIG_S_DEVICE_STOPPED 32
 /* Device entered invalid state, driver must reset it */
 #define VIRTIO_CONFIG_S_NEEDS_RESET	0x40
 /* We've given up on this device. */
diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
index bd7958b9f0..e8f55cdeba 100644
--- a/hw/net/virtio-net.c
+++ b/hw/net/virtio-net.c
@@ -198,6 +198,7 @@ static bool virtio_net_started(VirtIONet *n, uint8_t status)
 {
     VirtIODevice *vdev = VIRTIO_DEVICE(n);
     return (status & VIRTIO_CONFIG_S_DRIVER_OK) &&
+        (!(status & VIRTIO_CONFIG_S_DEVICE_STOPPED)) &&
         (n->status & VIRTIO_NET_S_LINK_UP) && vdev->vm_running;
 }
 
@@ -385,7 +386,7 @@ static void virtio_net_set_status(struct VirtIODevice *vdev, uint8_t status)
             qemu_flush_queued_packets(ncs);
         }
 
-        if (!q->tx_waiting) {
+        if (!q->tx_waiting && !(status & VIRTIO_CONFIG_S_DEVICE_STOPPED)) {
             continue;
         }
 
@@ -1503,7 +1504,8 @@ static bool virtio_net_can_receive(NetClientState *nc)
     }
 
     if (!virtio_queue_ready(q->rx_vq) ||
-        !(vdev->status & VIRTIO_CONFIG_S_DRIVER_OK)) {
+        !(vdev->status & VIRTIO_CONFIG_S_DRIVER_OK) ||
+        vdev->status == VIRTIO_CONFIG_S_DEVICE_STOPPED) {
         return false;
     }
 
diff --git a/hw/virtio/virtio-pci.c b/hw/virtio/virtio-pci.c
index 6f30118c4e..9ed0f62222 100644
--- a/hw/virtio/virtio-pci.c
+++ b/hw/virtio/virtio-pci.c
@@ -326,13 +326,15 @@ static void virtio_ioport_write(void *opaque, uint32_t addr, uint32_t val)
         }
         break;
     case VIRTIO_PCI_STATUS:
-        if (!(val & VIRTIO_CONFIG_S_DRIVER_OK)) {
+        if (!(val & VIRTIO_CONFIG_S_DRIVER_OK) ||
+            val & VIRTIO_CONFIG_S_DEVICE_STOPPED) {
             virtio_pci_stop_ioeventfd(proxy);
         }
 
         virtio_set_status(vdev, val & 0xFF);
 
-        if (val & VIRTIO_CONFIG_S_DRIVER_OK) {
+        if (val & VIRTIO_CONFIG_S_DRIVER_OK &&
+            !(val & VIRTIO_CONFIG_S_DEVICE_STOPPED)) {
             virtio_pci_start_ioeventfd(proxy);
         }
 
@@ -1303,6 +1305,7 @@ static void virtio_pci_common_write(void *opaque, hwaddr addr,
                        proxy->vqs[vdev->queue_sel].used[0]);
             virtio_queue_set_last_avail_idx(vdev, vdev->queue_sel,
                         proxy->vqs[vdev->queue_sel].state);
+            virtio_queue_update_used_idx(vdev, vdev->queue_sel);
             proxy->vqs[vdev->queue_sel].enabled = 1;
         } else {
             virtio_error(vdev, "wrong value for queue_enable %"PRIx64, val);
-- 
2.27.0
^ permalink raw reply related	[flat|nested] 90+ messages in thread
* [RFC PATCH v4 03/20] virtio: Add virtio_queue_is_host_notifier_enabled
  2021-10-01  7:05 [RFC PATCH v4 00/20] vDPA shadow virtqueue Eugenio Pérez
  2021-10-01  7:05 ` [RFC PATCH v4 01/20] virtio: Add VIRTIO_F_QUEUE_STATE Eugenio Pérez
  2021-10-01  7:05 ` [RFC PATCH v4 02/20] virtio-net: Honor VIRTIO_CONFIG_S_DEVICE_STOPPED Eugenio Pérez
@ 2021-10-01  7:05 ` Eugenio Pérez
  2021-10-01  7:05 ` [RFC PATCH v4 04/20] vhost: Make vhost_virtqueue_{start,stop} public Eugenio Pérez
                   ` (17 subsequent siblings)
  20 siblings, 0 replies; 90+ messages in thread
From: Eugenio Pérez @ 2021-10-01  7:05 UTC (permalink / raw)
  To: qemu-devel
  Cc: Parav Pandit, Juan Quintela, Jason Wang, Michael S. Tsirkin,
	Markus Armbruster, virtualization, Harpreet Singh Anand,
	Xiao W Wang, Stefan Hajnoczi, Eli Cohen, Eric Blake,
	Michael Lilja, Stefano Garzarella
This allows shadow virtqueue code to assert the queue status before
making changes.
Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 include/hw/virtio/virtio.h | 1 +
 hw/virtio/virtio.c         | 5 +++++
 2 files changed, 6 insertions(+)
diff --git a/include/hw/virtio/virtio.h b/include/hw/virtio/virtio.h
index 5fe575b8f0..2fe03f64c6 100644
--- a/include/hw/virtio/virtio.h
+++ b/include/hw/virtio/virtio.h
@@ -318,6 +318,7 @@ void virtio_device_release_ioeventfd(VirtIODevice *vdev);
 bool virtio_device_ioeventfd_enabled(VirtIODevice *vdev);
 EventNotifier *virtio_queue_get_host_notifier(VirtQueue *vq);
 void virtio_queue_set_host_notifier_enabled(VirtQueue *vq, bool enabled);
+bool virtio_queue_is_host_notifier_enabled(const VirtQueue *vq);
 void virtio_queue_host_notifier_read(EventNotifier *n);
 void virtio_queue_aio_set_host_notifier_handler(VirtQueue *vq, AioContext *ctx,
                                                 VirtIOHandleAIOOutput handle_output);
diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c
index ab516ac614..f04de57409 100644
--- a/hw/virtio/virtio.c
+++ b/hw/virtio/virtio.c
@@ -3591,6 +3591,11 @@ EventNotifier *virtio_queue_get_host_notifier(VirtQueue *vq)
     return &vq->host_notifier;
 }
 
+bool virtio_queue_is_host_notifier_enabled(const VirtQueue *vq)
+{
+    return vq->host_notifier_enabled;
+}
+
 void virtio_queue_set_host_notifier_enabled(VirtQueue *vq, bool enabled)
 {
     vq->host_notifier_enabled = enabled;
-- 
2.27.0
^ permalink raw reply related	[flat|nested] 90+ messages in thread
* [RFC PATCH v4 04/20] vhost: Make vhost_virtqueue_{start,stop} public
  2021-10-01  7:05 [RFC PATCH v4 00/20] vDPA shadow virtqueue Eugenio Pérez
                   ` (2 preceding siblings ...)
  2021-10-01  7:05 ` [RFC PATCH v4 03/20] virtio: Add virtio_queue_is_host_notifier_enabled Eugenio Pérez
@ 2021-10-01  7:05 ` Eugenio Pérez
  2021-10-01  7:05 ` [RFC PATCH v4 05/20] vhost: Add x-vhost-enable-shadow-vq qmp Eugenio Pérez
                   ` (16 subsequent siblings)
  20 siblings, 0 replies; 90+ messages in thread
From: Eugenio Pérez @ 2021-10-01  7:05 UTC (permalink / raw)
  To: qemu-devel
  Cc: Parav Pandit, Juan Quintela, Jason Wang, Michael S. Tsirkin,
	Markus Armbruster, virtualization, Harpreet Singh Anand,
	Xiao W Wang, Stefan Hajnoczi, Eli Cohen, Eric Blake,
	Michael Lilja, Stefano Garzarella
The easier way to reset vhost behavior after enabling shadow virtqueue
is to call vhost_virtqueue_start. Also, vhost_virtqueue_stop provides
the simmetrical call to it, and it will be used.
Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 include/hw/virtio/vhost.h |  4 ++++
 hw/virtio/vhost.c         | 12 ++++--------
 2 files changed, 8 insertions(+), 8 deletions(-)
diff --git a/include/hw/virtio/vhost.h b/include/hw/virtio/vhost.h
index 045d0fd9f2..1cdcded4c5 100644
--- a/include/hw/virtio/vhost.h
+++ b/include/hw/virtio/vhost.h
@@ -130,6 +130,10 @@ int vhost_net_set_backend(struct vhost_dev *hdev,
                           struct vhost_vring_file *file);
 
 int vhost_device_iotlb_miss(struct vhost_dev *dev, uint64_t iova, int write);
+int vhost_virtqueue_start(struct vhost_dev *dev, struct VirtIODevice *vdev,
+                          struct vhost_virtqueue *vq, unsigned idx);
+void vhost_virtqueue_stop(struct vhost_dev *dev, struct VirtIODevice *vdev,
+                          struct vhost_virtqueue *vq, unsigned idx);
 int vhost_dev_get_config(struct vhost_dev *hdev, uint8_t *config,
                          uint32_t config_len, Error **errp);
 int vhost_dev_set_config(struct vhost_dev *dev, const uint8_t *data,
diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
index e040f631c6..5d2aa6d72d 100644
--- a/hw/virtio/vhost.c
+++ b/hw/virtio/vhost.c
@@ -1049,10 +1049,8 @@ out:
     return ret;
 }
 
-static int vhost_virtqueue_start(struct vhost_dev *dev,
-                                struct VirtIODevice *vdev,
-                                struct vhost_virtqueue *vq,
-                                unsigned idx)
+int vhost_virtqueue_start(struct vhost_dev *dev, struct VirtIODevice *vdev,
+                          struct vhost_virtqueue *vq, unsigned idx)
 {
     BusState *qbus = BUS(qdev_get_parent_bus(DEVICE(vdev)));
     VirtioBusState *vbus = VIRTIO_BUS(qbus);
@@ -1171,10 +1169,8 @@ fail_alloc_desc:
     return r;
 }
 
-static void vhost_virtqueue_stop(struct vhost_dev *dev,
-                                    struct VirtIODevice *vdev,
-                                    struct vhost_virtqueue *vq,
-                                    unsigned idx)
+void vhost_virtqueue_stop(struct vhost_dev *dev, struct VirtIODevice *vdev,
+                          struct vhost_virtqueue *vq, unsigned idx)
 {
     int vhost_vq_index = dev->vhost_ops->vhost_get_vq_index(dev, idx);
     struct vhost_vring_state state = {
-- 
2.27.0
^ permalink raw reply related	[flat|nested] 90+ messages in thread
* [RFC PATCH v4 05/20] vhost: Add x-vhost-enable-shadow-vq qmp
  2021-10-01  7:05 [RFC PATCH v4 00/20] vDPA shadow virtqueue Eugenio Pérez
                   ` (3 preceding siblings ...)
  2021-10-01  7:05 ` [RFC PATCH v4 04/20] vhost: Make vhost_virtqueue_{start,stop} public Eugenio Pérez
@ 2021-10-01  7:05 ` Eugenio Pérez
  2021-10-12  5:18   ` Markus Armbruster
  2021-10-01  7:05 ` [RFC PATCH v4 06/20] vhost: Add VhostShadowVirtqueue Eugenio Pérez
                   ` (15 subsequent siblings)
  20 siblings, 1 reply; 90+ messages in thread
From: Eugenio Pérez @ 2021-10-01  7:05 UTC (permalink / raw)
  To: qemu-devel
  Cc: Parav Pandit, Juan Quintela, Jason Wang, Michael S. Tsirkin,
	Markus Armbruster, virtualization, Harpreet Singh Anand,
	Xiao W Wang, Stefan Hajnoczi, Eli Cohen, Eric Blake,
	Michael Lilja, Stefano Garzarella
Command to enable shadow virtqueue.
Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 qapi/net.json          | 23 +++++++++++++++++++++++
 hw/virtio/vhost-vdpa.c |  8 ++++++++
 2 files changed, 31 insertions(+)
diff --git a/qapi/net.json b/qapi/net.json
index 7fab2e7cd8..a2c30fd455 100644
--- a/qapi/net.json
+++ b/qapi/net.json
@@ -79,6 +79,29 @@
 { 'command': 'netdev_del', 'data': {'id': 'str'},
   'allow-preconfig': true }
 
+##
+# @x-vhost-enable-shadow-vq:
+#
+# Use vhost shadow virtqueue.
+#
+# @name: the device name of the VirtIO device
+#
+# @enable: true to use the alternate shadow VQ notifications
+#
+# Returns: Always error, since SVQ is not implemented at the moment.
+#
+# Since: 6.2
+#
+# Example:
+#
+# -> { "execute": "x-vhost-enable-shadow-vq",
+#     "arguments": { "name": "virtio-net", "enable": false } }
+#
+##
+{ 'command': 'x-vhost-enable-shadow-vq',
+  'data': {'name': 'str', 'enable': 'bool'},
+  'if': 'defined(CONFIG_VHOST_KERNEL)' }
+
 ##
 # @NetLegacyNicOptions:
 #
diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
index 4fa414feea..c63e311d7c 100644
--- a/hw/virtio/vhost-vdpa.c
+++ b/hw/virtio/vhost-vdpa.c
@@ -23,6 +23,8 @@
 #include "cpu.h"
 #include "trace.h"
 #include "qemu-common.h"
+#include "qapi/qapi-commands-net.h"
+#include "qapi/error.h"
 
 static bool vhost_vdpa_listener_skipped_section(MemoryRegionSection *section)
 {
@@ -656,6 +658,12 @@ static bool  vhost_vdpa_force_iommu(struct vhost_dev *dev)
     return true;
 }
 
+
+void qmp_x_vhost_enable_shadow_vq(const char *name, bool enable, Error **errp)
+{
+    error_setg(errp, "Shadow virtqueue still not implemented");
+}
+
 const VhostOps vdpa_ops = {
         .backend_type = VHOST_BACKEND_TYPE_VDPA,
         .vhost_backend_init = vhost_vdpa_init,
-- 
2.27.0
^ permalink raw reply related	[flat|nested] 90+ messages in thread
* [RFC PATCH v4 06/20] vhost: Add VhostShadowVirtqueue
  2021-10-01  7:05 [RFC PATCH v4 00/20] vDPA shadow virtqueue Eugenio Pérez
                   ` (4 preceding siblings ...)
  2021-10-01  7:05 ` [RFC PATCH v4 05/20] vhost: Add x-vhost-enable-shadow-vq qmp Eugenio Pérez
@ 2021-10-01  7:05 ` Eugenio Pérez
  2021-10-01  7:05 ` [RFC PATCH v4 07/20] vdpa: Register vdpa devices in a list Eugenio Pérez
                   ` (14 subsequent siblings)
  20 siblings, 0 replies; 90+ messages in thread
From: Eugenio Pérez @ 2021-10-01  7:05 UTC (permalink / raw)
  To: qemu-devel
  Cc: Parav Pandit, Juan Quintela, Jason Wang, Michael S. Tsirkin,
	Markus Armbruster, virtualization, Harpreet Singh Anand,
	Xiao W Wang, Stefan Hajnoczi, Eli Cohen, Eric Blake,
	Michael Lilja, Stefano Garzarella
Vhost shadow virtqueue (SVQ) is an intermediate jump for virtqueue
notifications and buffers, allowing qemu to track them. While qemu is
forwarding the buffers and virtqueue changes, is able to commit the
memory it's being dirtied, the same way regular qemu's VirtIO devices
do.
This commit only exposes basic SVQ allocation and free, so changes
regarding different aspects of SVQ (notifications forwarding, buffer
forwarding, starting/stopping) are more isolated and easier to bisect.
Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 hw/virtio/vhost-shadow-virtqueue.h | 21 ++++++++++
 hw/virtio/vhost-shadow-virtqueue.c | 64 ++++++++++++++++++++++++++++++
 hw/virtio/meson.build              |  2 +-
 3 files changed, 86 insertions(+), 1 deletion(-)
 create mode 100644 hw/virtio/vhost-shadow-virtqueue.h
 create mode 100644 hw/virtio/vhost-shadow-virtqueue.c
diff --git a/hw/virtio/vhost-shadow-virtqueue.h b/hw/virtio/vhost-shadow-virtqueue.h
new file mode 100644
index 0000000000..27ac6388fa
--- /dev/null
+++ b/hw/virtio/vhost-shadow-virtqueue.h
@@ -0,0 +1,21 @@
+/*
+ * vhost shadow virtqueue
+ *
+ * SPDX-FileCopyrightText: Red Hat, Inc. 2021
+ * SPDX-FileContributor: Author: Eugenio Pérez <eperezma@redhat.com>
+ *
+ * SPDX-License-Identifier: GPL-2.0-or-later
+ */
+
+#ifndef VHOST_SHADOW_VIRTQUEUE_H
+#define VHOST_SHADOW_VIRTQUEUE_H
+
+#include "hw/virtio/vhost.h"
+
+typedef struct VhostShadowVirtqueue VhostShadowVirtqueue;
+
+VhostShadowVirtqueue *vhost_svq_new(struct vhost_dev *dev, int idx);
+
+void vhost_svq_free(VhostShadowVirtqueue *vq);
+
+#endif
diff --git a/hw/virtio/vhost-shadow-virtqueue.c b/hw/virtio/vhost-shadow-virtqueue.c
new file mode 100644
index 0000000000..c4826a1b56
--- /dev/null
+++ b/hw/virtio/vhost-shadow-virtqueue.c
@@ -0,0 +1,64 @@
+/*
+ * vhost shadow virtqueue
+ *
+ * SPDX-FileCopyrightText: Red Hat, Inc. 2021
+ * SPDX-FileContributor: Author: Eugenio Pérez <eperezma@redhat.com>
+ *
+ * SPDX-License-Identifier: GPL-2.0-or-later
+ */
+
+#include "qemu/osdep.h"
+#include "hw/virtio/vhost-shadow-virtqueue.h"
+
+#include "qemu/error-report.h"
+#include "qemu/event_notifier.h"
+
+/* Shadow virtqueue to relay notifications */
+typedef struct VhostShadowVirtqueue {
+    /* Shadow kick notifier, sent to vhost */
+    EventNotifier kick_notifier;
+    /* Shadow call notifier, sent to vhost */
+    EventNotifier call_notifier;
+} VhostShadowVirtqueue;
+
+/*
+ * Creates vhost shadow virtqueue, and instruct vhost device to use the shadow
+ * methods and file descriptors.
+ */
+VhostShadowVirtqueue *vhost_svq_new(struct vhost_dev *dev, int idx)
+{
+    g_autofree VhostShadowVirtqueue *svq = g_new0(VhostShadowVirtqueue, 1);
+    int r;
+
+    r = event_notifier_init(&svq->kick_notifier, 0);
+    if (r != 0) {
+        error_report("Couldn't create kick event notifier: %s",
+                     strerror(errno));
+        goto err_init_kick_notifier;
+    }
+
+    r = event_notifier_init(&svq->call_notifier, 0);
+    if (r != 0) {
+        error_report("Couldn't create call event notifier: %s",
+                     strerror(errno));
+        goto err_init_call_notifier;
+    }
+
+    return g_steal_pointer(&svq);
+
+err_init_call_notifier:
+    event_notifier_cleanup(&svq->kick_notifier);
+
+err_init_kick_notifier:
+    return NULL;
+}
+
+/*
+ * Free the resources of the shadow virtqueue.
+ */
+void vhost_svq_free(VhostShadowVirtqueue *vq)
+{
+    event_notifier_cleanup(&vq->kick_notifier);
+    event_notifier_cleanup(&vq->call_notifier);
+    g_free(vq);
+}
diff --git a/hw/virtio/meson.build b/hw/virtio/meson.build
index fbff9bc9d4..8b5a0225fe 100644
--- a/hw/virtio/meson.build
+++ b/hw/virtio/meson.build
@@ -11,7 +11,7 @@ softmmu_ss.add(when: 'CONFIG_ALL', if_true: files('vhost-stub.c'))
 
 virtio_ss = ss.source_set()
 virtio_ss.add(files('virtio.c'))
-virtio_ss.add(when: 'CONFIG_VHOST', if_true: files('vhost.c', 'vhost-backend.c'))
+virtio_ss.add(when: 'CONFIG_VHOST', if_true: files('vhost.c', 'vhost-backend.c', 'vhost-shadow-virtqueue.c'))
 virtio_ss.add(when: 'CONFIG_VHOST_USER', if_true: files('vhost-user.c'))
 virtio_ss.add(when: 'CONFIG_VHOST_VDPA', if_true: files('vhost-vdpa.c'))
 virtio_ss.add(when: 'CONFIG_VIRTIO_BALLOON', if_true: files('virtio-balloon.c'))
-- 
2.27.0
^ permalink raw reply related	[flat|nested] 90+ messages in thread
* [RFC PATCH v4 07/20] vdpa: Register vdpa devices in a list
  2021-10-01  7:05 [RFC PATCH v4 00/20] vDPA shadow virtqueue Eugenio Pérez
                   ` (5 preceding siblings ...)
  2021-10-01  7:05 ` [RFC PATCH v4 06/20] vhost: Add VhostShadowVirtqueue Eugenio Pérez
@ 2021-10-01  7:05 ` Eugenio Pérez
  2021-10-01  7:05 ` [RFC PATCH v4 08/20] vhost: Route guest->host notification through shadow virtqueue Eugenio Pérez
                   ` (13 subsequent siblings)
  20 siblings, 0 replies; 90+ messages in thread
From: Eugenio Pérez @ 2021-10-01  7:05 UTC (permalink / raw)
  To: qemu-devel
  Cc: Parav Pandit, Juan Quintela, Jason Wang, Michael S. Tsirkin,
	Markus Armbruster, virtualization, Harpreet Singh Anand,
	Xiao W Wang, Stefan Hajnoczi, Eli Cohen, Eric Blake,
	Michael Lilja, Stefano Garzarella
This way QMP command can iterate through them and find the devices it
needs to enable shadow virtqueue in.
Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 include/hw/virtio/vhost-vdpa.h | 2 ++
 hw/virtio/vhost-vdpa.c         | 5 +++++
 2 files changed, 7 insertions(+)
diff --git a/include/hw/virtio/vhost-vdpa.h b/include/hw/virtio/vhost-vdpa.h
index 9188226d8b..0d565bb5bd 100644
--- a/include/hw/virtio/vhost-vdpa.h
+++ b/include/hw/virtio/vhost-vdpa.h
@@ -12,6 +12,7 @@
 #ifndef HW_VIRTIO_VHOST_VDPA_H
 #define HW_VIRTIO_VHOST_VDPA_H
 
+#include "qemu/queue.h"
 #include "hw/virtio/virtio.h"
 
 typedef struct VhostVDPAHostNotifier {
@@ -24,6 +25,7 @@ typedef struct vhost_vdpa {
     uint32_t msg_type;
     MemoryListener listener;
     struct vhost_dev *dev;
+    QLIST_ENTRY(vhost_vdpa) entry;
     VhostVDPAHostNotifier notifier[VIRTIO_QUEUE_MAX];
 } VhostVDPA;
 
diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
index c63e311d7c..e0dc7508c3 100644
--- a/hw/virtio/vhost-vdpa.c
+++ b/hw/virtio/vhost-vdpa.c
@@ -26,6 +26,9 @@
 #include "qapi/qapi-commands-net.h"
 #include "qapi/error.h"
 
+static QLIST_HEAD(, vhost_vdpa) vhost_vdpa_devices =
+    QLIST_HEAD_INITIALIZER(vhost_vdpa_devices);
+
 static bool vhost_vdpa_listener_skipped_section(MemoryRegionSection *section)
 {
     return (!memory_region_is_ram(section->mr) &&
@@ -280,6 +283,7 @@ static int vhost_vdpa_init(struct vhost_dev *dev, void *opaque, Error **errp)
     dev->opaque =  opaque ;
     v->listener = vhost_vdpa_memory_listener;
     v->msg_type = VHOST_IOTLB_MSG_V2;
+    QLIST_INSERT_HEAD(&vhost_vdpa_devices, v, entry);
 
     vhost_vdpa_add_status(dev, VIRTIO_CONFIG_S_ACKNOWLEDGE |
                                VIRTIO_CONFIG_S_DRIVER);
@@ -377,6 +381,7 @@ static int vhost_vdpa_cleanup(struct vhost_dev *dev)
     trace_vhost_vdpa_cleanup(dev, v);
     vhost_vdpa_host_notifiers_uninit(dev, dev->nvqs);
     memory_listener_unregister(&v->listener);
+    QLIST_REMOVE(v, entry);
 
     dev->opaque = NULL;
     return 0;
-- 
2.27.0
^ permalink raw reply related	[flat|nested] 90+ messages in thread
* [RFC PATCH v4 08/20] vhost: Route guest->host notification through shadow virtqueue
  2021-10-01  7:05 [RFC PATCH v4 00/20] vDPA shadow virtqueue Eugenio Pérez
                   ` (6 preceding siblings ...)
  2021-10-01  7:05 ` [RFC PATCH v4 07/20] vdpa: Register vdpa devices in a list Eugenio Pérez
@ 2021-10-01  7:05 ` Eugenio Pérez
  2021-10-12  5:19   ` Markus Armbruster
  2021-10-13  3:27   ` Jason Wang
  2021-10-01  7:05 ` [RFC PATCH v4 09/20] vdpa: Save call_fd in vhost-vdpa Eugenio Pérez
                   ` (12 subsequent siblings)
  20 siblings, 2 replies; 90+ messages in thread
From: Eugenio Pérez @ 2021-10-01  7:05 UTC (permalink / raw)
  To: qemu-devel
  Cc: Parav Pandit, Juan Quintela, Jason Wang, Michael S. Tsirkin,
	Markus Armbruster, virtualization, Harpreet Singh Anand,
	Xiao W Wang, Stefan Hajnoczi, Eli Cohen, Eric Blake,
	Michael Lilja, Stefano Garzarella
Shadow virtqueue notifications forwarding is disabled when vhost_dev
stops, so code flow follows usual cleanup.
Also, host notifiers must be disabled at SVQ start, and they will not
start if SVQ has been enabled when device is stopped. This is trivial
to address, but it is left out for simplicity at this moment.
Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 qapi/net.json                      |   2 +-
 hw/virtio/vhost-shadow-virtqueue.h |   8 ++
 include/hw/virtio/vhost-vdpa.h     |   4 +
 hw/virtio/vhost-shadow-virtqueue.c | 138 ++++++++++++++++++++++++++++-
 hw/virtio/vhost-vdpa.c             | 116 +++++++++++++++++++++++-
 5 files changed, 264 insertions(+), 4 deletions(-)
diff --git a/qapi/net.json b/qapi/net.json
index a2c30fd455..fe546b0e7c 100644
--- a/qapi/net.json
+++ b/qapi/net.json
@@ -88,7 +88,7 @@
 #
 # @enable: true to use the alternate shadow VQ notifications
 #
-# Returns: Always error, since SVQ is not implemented at the moment.
+# Returns: Error if failure, or 'no error' for success.
 #
 # Since: 6.2
 #
diff --git a/hw/virtio/vhost-shadow-virtqueue.h b/hw/virtio/vhost-shadow-virtqueue.h
index 27ac6388fa..237cfceb9c 100644
--- a/hw/virtio/vhost-shadow-virtqueue.h
+++ b/hw/virtio/vhost-shadow-virtqueue.h
@@ -14,6 +14,14 @@
 
 typedef struct VhostShadowVirtqueue VhostShadowVirtqueue;
 
+EventNotifier *vhost_svq_get_svq_call_notifier(VhostShadowVirtqueue *svq);
+void vhost_svq_set_guest_call_notifier(VhostShadowVirtqueue *svq, int call_fd);
+
+bool vhost_svq_start(struct vhost_dev *dev, unsigned idx,
+                     VhostShadowVirtqueue *svq);
+void vhost_svq_stop(struct vhost_dev *dev, unsigned idx,
+                    VhostShadowVirtqueue *svq);
+
 VhostShadowVirtqueue *vhost_svq_new(struct vhost_dev *dev, int idx);
 
 void vhost_svq_free(VhostShadowVirtqueue *vq);
diff --git a/include/hw/virtio/vhost-vdpa.h b/include/hw/virtio/vhost-vdpa.h
index 0d565bb5bd..48aae59d8e 100644
--- a/include/hw/virtio/vhost-vdpa.h
+++ b/include/hw/virtio/vhost-vdpa.h
@@ -12,6 +12,8 @@
 #ifndef HW_VIRTIO_VHOST_VDPA_H
 #define HW_VIRTIO_VHOST_VDPA_H
 
+#include <gmodule.h>
+
 #include "qemu/queue.h"
 #include "hw/virtio/virtio.h"
 
@@ -24,6 +26,8 @@ typedef struct vhost_vdpa {
     int device_fd;
     uint32_t msg_type;
     MemoryListener listener;
+    bool shadow_vqs_enabled;
+    GPtrArray *shadow_vqs;
     struct vhost_dev *dev;
     QLIST_ENTRY(vhost_vdpa) entry;
     VhostVDPAHostNotifier notifier[VIRTIO_QUEUE_MAX];
diff --git a/hw/virtio/vhost-shadow-virtqueue.c b/hw/virtio/vhost-shadow-virtqueue.c
index c4826a1b56..21dc99ab5d 100644
--- a/hw/virtio/vhost-shadow-virtqueue.c
+++ b/hw/virtio/vhost-shadow-virtqueue.c
@@ -9,9 +9,12 @@
 
 #include "qemu/osdep.h"
 #include "hw/virtio/vhost-shadow-virtqueue.h"
+#include "hw/virtio/vhost.h"
+
+#include "standard-headers/linux/vhost_types.h"
 
 #include "qemu/error-report.h"
-#include "qemu/event_notifier.h"
+#include "qemu/main-loop.h"
 
 /* Shadow virtqueue to relay notifications */
 typedef struct VhostShadowVirtqueue {
@@ -19,14 +22,146 @@ typedef struct VhostShadowVirtqueue {
     EventNotifier kick_notifier;
     /* Shadow call notifier, sent to vhost */
     EventNotifier call_notifier;
+
+    /*
+     * Borrowed virtqueue's guest to host notifier.
+     * To borrow it in this event notifier allows to register on the event
+     * loop and access the associated shadow virtqueue easily. If we use the
+     * VirtQueue, we don't have an easy way to retrieve it.
+     *
+     * So shadow virtqueue must not clean it, or we would lose VirtQueue one.
+     */
+    EventNotifier host_notifier;
+
+    /* Guest's call notifier, where SVQ calls guest. */
+    EventNotifier guest_call_notifier;
+
+    /* Virtio queue shadowing */
+    VirtQueue *vq;
 } VhostShadowVirtqueue;
 
+/* Forward guest notifications */
+static void vhost_handle_guest_kick(EventNotifier *n)
+{
+    VhostShadowVirtqueue *svq = container_of(n, VhostShadowVirtqueue,
+                                             host_notifier);
+
+    if (unlikely(!event_notifier_test_and_clear(n))) {
+        return;
+    }
+
+    event_notifier_set(&svq->kick_notifier);
+}
+
+/*
+ * Obtain the SVQ call notifier, where vhost device notifies SVQ that there
+ * exists pending used buffers.
+ *
+ * @svq Shadow Virtqueue
+ */
+EventNotifier *vhost_svq_get_svq_call_notifier(VhostShadowVirtqueue *svq)
+{
+    return &svq->call_notifier;
+}
+
+/*
+ * Set the call notifier for the SVQ to call the guest
+ *
+ * @svq Shadow virtqueue
+ * @call_fd call notifier
+ *
+ * Called on BQL context.
+ */
+void vhost_svq_set_guest_call_notifier(VhostShadowVirtqueue *svq, int call_fd)
+{
+    event_notifier_init_fd(&svq->guest_call_notifier, call_fd);
+}
+
+/*
+ * Restore the vhost guest to host notifier, i.e., disables svq effect.
+ */
+static int vhost_svq_restore_vdev_host_notifier(struct vhost_dev *dev,
+                                                unsigned vhost_index,
+                                                VhostShadowVirtqueue *svq)
+{
+    EventNotifier *vq_host_notifier = virtio_queue_get_host_notifier(svq->vq);
+    struct vhost_vring_file file = {
+        .index = vhost_index,
+        .fd = event_notifier_get_fd(vq_host_notifier),
+    };
+    int r;
+
+    /* Restore vhost kick */
+    r = dev->vhost_ops->vhost_set_vring_kick(dev, &file);
+    return r ? -errno : 0;
+}
+
+/*
+ * Start shadow virtqueue operation.
+ * @dev vhost device
+ * @hidx vhost virtqueue index
+ * @svq Shadow Virtqueue
+ */
+bool vhost_svq_start(struct vhost_dev *dev, unsigned idx,
+                     VhostShadowVirtqueue *svq)
+{
+    EventNotifier *vq_host_notifier = virtio_queue_get_host_notifier(svq->vq);
+    struct vhost_vring_file file = {
+        .index = dev->vhost_ops->vhost_get_vq_index(dev, dev->vq_index + idx),
+        .fd = event_notifier_get_fd(&svq->kick_notifier),
+    };
+    int r;
+
+    /* Check that notifications are still going directly to vhost dev */
+    assert(virtio_queue_is_host_notifier_enabled(svq->vq));
+
+    /*
+     * event_notifier_set_handler already checks for guest's notifications if
+     * they arrive in the switch, so there is no need to explicitely check for
+     * them.
+     */
+    event_notifier_init_fd(&svq->host_notifier,
+                           event_notifier_get_fd(vq_host_notifier));
+    event_notifier_set_handler(&svq->host_notifier, vhost_handle_guest_kick);
+
+    r = dev->vhost_ops->vhost_set_vring_kick(dev, &file);
+    if (unlikely(r != 0)) {
+        error_report("Couldn't set kick fd: %s", strerror(errno));
+        goto err_set_vring_kick;
+    }
+
+    return true;
+
+err_set_vring_kick:
+    event_notifier_set_handler(&svq->host_notifier, NULL);
+
+    return false;
+}
+
+/*
+ * Stop shadow virtqueue operation.
+ * @dev vhost device
+ * @idx vhost queue index
+ * @svq Shadow Virtqueue
+ */
+void vhost_svq_stop(struct vhost_dev *dev, unsigned idx,
+                    VhostShadowVirtqueue *svq)
+{
+    int r = vhost_svq_restore_vdev_host_notifier(dev, idx, svq);
+    if (unlikely(r < 0)) {
+        error_report("Couldn't restore vq kick fd: %s", strerror(-r));
+    }
+
+    event_notifier_set_handler(&svq->host_notifier, NULL);
+}
+
 /*
  * Creates vhost shadow virtqueue, and instruct vhost device to use the shadow
  * methods and file descriptors.
  */
 VhostShadowVirtqueue *vhost_svq_new(struct vhost_dev *dev, int idx)
 {
+    int vq_idx = dev->vq_index + idx;
     g_autofree VhostShadowVirtqueue *svq = g_new0(VhostShadowVirtqueue, 1);
     int r;
 
@@ -44,6 +179,7 @@ VhostShadowVirtqueue *vhost_svq_new(struct vhost_dev *dev, int idx)
         goto err_init_call_notifier;
     }
 
+    svq->vq = virtio_get_queue(dev->vdev, vq_idx);
     return g_steal_pointer(&svq);
 
 err_init_call_notifier:
diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
index e0dc7508c3..36c954a779 100644
--- a/hw/virtio/vhost-vdpa.c
+++ b/hw/virtio/vhost-vdpa.c
@@ -17,6 +17,7 @@
 #include "hw/virtio/vhost.h"
 #include "hw/virtio/vhost-backend.h"
 #include "hw/virtio/virtio-net.h"
+#include "hw/virtio/vhost-shadow-virtqueue.h"
 #include "hw/virtio/vhost-vdpa.h"
 #include "exec/address-spaces.h"
 #include "qemu/main-loop.h"
@@ -272,6 +273,16 @@ static void vhost_vdpa_add_status(struct vhost_dev *dev, uint8_t status)
     vhost_vdpa_call(dev, VHOST_VDPA_SET_STATUS, &s);
 }
 
+/**
+ * Adaptor function to free shadow virtqueue through gpointer
+ *
+ * @svq   The Shadow Virtqueue
+ */
+static void vhost_psvq_free(gpointer svq)
+{
+    vhost_svq_free(svq);
+}
+
 static int vhost_vdpa_init(struct vhost_dev *dev, void *opaque, Error **errp)
 {
     struct vhost_vdpa *v;
@@ -283,6 +294,7 @@ static int vhost_vdpa_init(struct vhost_dev *dev, void *opaque, Error **errp)
     dev->opaque =  opaque ;
     v->listener = vhost_vdpa_memory_listener;
     v->msg_type = VHOST_IOTLB_MSG_V2;
+    v->shadow_vqs = g_ptr_array_new_full(dev->nvqs, vhost_psvq_free);
     QLIST_INSERT_HEAD(&vhost_vdpa_devices, v, entry);
 
     vhost_vdpa_add_status(dev, VIRTIO_CONFIG_S_ACKNOWLEDGE |
@@ -373,6 +385,17 @@ err:
     return;
 }
 
+static void vhost_vdpa_svq_cleanup(struct vhost_dev *dev)
+{
+    struct vhost_vdpa *v = dev->opaque;
+    size_t idx;
+
+    for (idx = 0; idx < v->shadow_vqs->len; ++idx) {
+        vhost_svq_stop(dev, idx, g_ptr_array_index(v->shadow_vqs, idx));
+    }
+    g_ptr_array_free(v->shadow_vqs, true);
+}
+
 static int vhost_vdpa_cleanup(struct vhost_dev *dev)
 {
     struct vhost_vdpa *v;
@@ -381,6 +404,7 @@ static int vhost_vdpa_cleanup(struct vhost_dev *dev)
     trace_vhost_vdpa_cleanup(dev, v);
     vhost_vdpa_host_notifiers_uninit(dev, dev->nvqs);
     memory_listener_unregister(&v->listener);
+    vhost_vdpa_svq_cleanup(dev);
     QLIST_REMOVE(v, entry);
 
     dev->opaque = NULL;
@@ -557,7 +581,9 @@ static int vhost_vdpa_dev_start(struct vhost_dev *dev, bool started)
     if (started) {
         uint8_t status = 0;
         memory_listener_register(&v->listener, &address_space_memory);
-        vhost_vdpa_host_notifiers_init(dev);
+        if (!v->shadow_vqs_enabled) {
+            vhost_vdpa_host_notifiers_init(dev);
+        }
         vhost_vdpa_set_vring_ready(dev);
         vhost_vdpa_add_status(dev, VIRTIO_CONFIG_S_DRIVER_OK);
         vhost_vdpa_call(dev, VHOST_VDPA_GET_STATUS, &status);
@@ -663,10 +689,96 @@ static bool  vhost_vdpa_force_iommu(struct vhost_dev *dev)
     return true;
 }
 
+/*
+ * Start shadow virtqueue.
+ */
+static bool vhost_vdpa_svq_start_vq(struct vhost_dev *dev, unsigned idx)
+{
+    struct vhost_vdpa *v = dev->opaque;
+    VhostShadowVirtqueue *svq = g_ptr_array_index(v->shadow_vqs, idx);
+    return vhost_svq_start(dev, idx, svq);
+}
+
+static unsigned vhost_vdpa_enable_svq(struct vhost_vdpa *v, bool enable)
+{
+    struct vhost_dev *hdev = v->dev;
+    unsigned n;
+
+    if (enable == v->shadow_vqs_enabled) {
+        return hdev->nvqs;
+    }
+
+    if (enable) {
+        /* Allocate resources */
+        assert(v->shadow_vqs->len == 0);
+        for (n = 0; n < hdev->nvqs; ++n) {
+            VhostShadowVirtqueue *svq = vhost_svq_new(hdev, n);
+            bool ok;
+
+            if (unlikely(!svq)) {
+                g_ptr_array_set_size(v->shadow_vqs, 0);
+                return 0;
+            }
+            g_ptr_array_add(v->shadow_vqs, svq);
+
+            ok = vhost_vdpa_svq_start_vq(hdev, n);
+            if (unlikely(!ok)) {
+                /* Free still not started svqs */
+                g_ptr_array_set_size(v->shadow_vqs, n);
+                enable = false;
+                break;
+            }
+        }
+    }
+
+    v->shadow_vqs_enabled = enable;
+
+    if (!enable) {
+        /* Disable all queues or clean up failed start */
+        for (n = 0; n < v->shadow_vqs->len; ++n) {
+            unsigned vq_idx = vhost_vdpa_get_vq_index(hdev, n);
+            VhostShadowVirtqueue *svq = g_ptr_array_index(v->shadow_vqs, n);
+            vhost_svq_stop(hdev, n, svq);
+            vhost_virtqueue_start(hdev, hdev->vdev, &hdev->vqs[n], vq_idx);
+        }
+
+        /* Resources cleanup */
+        g_ptr_array_set_size(v->shadow_vqs, 0);
+    }
+
+    return n;
+}
 
 void qmp_x_vhost_enable_shadow_vq(const char *name, bool enable, Error **errp)
 {
-    error_setg(errp, "Shadow virtqueue still not implemented");
+    struct vhost_vdpa *v;
+    const char *err_cause = NULL;
+    bool r;
+
+    QLIST_FOREACH(v, &vhost_vdpa_devices, entry) {
+        if (v->dev->vdev && 0 == strcmp(v->dev->vdev->name, name)) {
+            break;
+        }
+    }
+
+    if (!v) {
+        err_cause = "Device not found";
+        goto err;
+    } else if (v->notifier[0].addr) {
+        err_cause = "Device has host notifiers enabled";
+        goto err;
+    }
+
+    r = vhost_vdpa_enable_svq(v, enable);
+    if (unlikely(!r)) {
+        err_cause = "Error enabling (see monitor)";
+        goto err;
+    }
+
+err:
+    if (err_cause) {
+        error_setg(errp, "Can't enable shadow vq on %s: %s", name, err_cause);
+    }
 }
 
 const VhostOps vdpa_ops = {
-- 
2.27.0
^ permalink raw reply related	[flat|nested] 90+ messages in thread
* [RFC PATCH v4 09/20] vdpa: Save call_fd in vhost-vdpa
  2021-10-01  7:05 [RFC PATCH v4 00/20] vDPA shadow virtqueue Eugenio Pérez
                   ` (7 preceding siblings ...)
  2021-10-01  7:05 ` [RFC PATCH v4 08/20] vhost: Route guest->host notification through shadow virtqueue Eugenio Pérez
@ 2021-10-01  7:05 ` Eugenio Pérez
  2021-10-13  3:43   ` Jason Wang
  2021-10-01  7:05 ` [RFC PATCH v4 10/20] vhost-vdpa: Take into account SVQ in vhost_vdpa_set_vring_call Eugenio Pérez
                   ` (11 subsequent siblings)
  20 siblings, 1 reply; 90+ messages in thread
From: Eugenio Pérez @ 2021-10-01  7:05 UTC (permalink / raw)
  To: qemu-devel
  Cc: Parav Pandit, Juan Quintela, Jason Wang, Michael S. Tsirkin,
	Markus Armbruster, virtualization, Harpreet Singh Anand,
	Xiao W Wang, Stefan Hajnoczi, Eli Cohen, Eric Blake,
	Michael Lilja, Stefano Garzarella
We need to know it to switch to Shadow VirtQueue.
Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 include/hw/virtio/vhost-vdpa.h | 2 ++
 hw/virtio/vhost-vdpa.c         | 5 +++++
 2 files changed, 7 insertions(+)
diff --git a/include/hw/virtio/vhost-vdpa.h b/include/hw/virtio/vhost-vdpa.h
index 48aae59d8e..fddac248b3 100644
--- a/include/hw/virtio/vhost-vdpa.h
+++ b/include/hw/virtio/vhost-vdpa.h
@@ -30,6 +30,8 @@ typedef struct vhost_vdpa {
     GPtrArray *shadow_vqs;
     struct vhost_dev *dev;
     QLIST_ENTRY(vhost_vdpa) entry;
+    /* File descriptor the device uses to call VM/SVQ */
+    int call_fd[VIRTIO_QUEUE_MAX];
     VhostVDPAHostNotifier notifier[VIRTIO_QUEUE_MAX];
 } VhostVDPA;
 
diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
index 36c954a779..57a857444a 100644
--- a/hw/virtio/vhost-vdpa.c
+++ b/hw/virtio/vhost-vdpa.c
@@ -652,7 +652,12 @@ static int vhost_vdpa_set_vring_kick(struct vhost_dev *dev,
 static int vhost_vdpa_set_vring_call(struct vhost_dev *dev,
                                        struct vhost_vring_file *file)
 {
+    struct vhost_vdpa *v = dev->opaque;
+    int vdpa_idx = vhost_vdpa_get_vq_index(dev, file->index);
+
     trace_vhost_vdpa_set_vring_call(dev, file->index, file->fd);
+
+    v->call_fd[vdpa_idx] = file->fd;
     return vhost_vdpa_call(dev, VHOST_SET_VRING_CALL, file);
 }
 
-- 
2.27.0
^ permalink raw reply related	[flat|nested] 90+ messages in thread
* [RFC PATCH v4 10/20] vhost-vdpa: Take into account SVQ in vhost_vdpa_set_vring_call
  2021-10-01  7:05 [RFC PATCH v4 00/20] vDPA shadow virtqueue Eugenio Pérez
                   ` (8 preceding siblings ...)
  2021-10-01  7:05 ` [RFC PATCH v4 09/20] vdpa: Save call_fd in vhost-vdpa Eugenio Pérez
@ 2021-10-01  7:05 ` Eugenio Pérez
  2021-10-13  3:43   ` Jason Wang
  2021-10-01  7:05 ` [RFC PATCH v4 11/20] vhost: Route host->guest notification through shadow virtqueue Eugenio Pérez
                   ` (10 subsequent siblings)
  20 siblings, 1 reply; 90+ messages in thread
From: Eugenio Pérez @ 2021-10-01  7:05 UTC (permalink / raw)
  To: qemu-devel
  Cc: Parav Pandit, Juan Quintela, Jason Wang, Michael S. Tsirkin,
	Markus Armbruster, virtualization, Harpreet Singh Anand,
	Xiao W Wang, Stefan Hajnoczi, Eli Cohen, Eric Blake,
	Michael Lilja, Stefano Garzarella
Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 hw/virtio/vhost-vdpa.c | 17 ++++++++++++++---
 1 file changed, 14 insertions(+), 3 deletions(-)
diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
index 57a857444a..bc34de2439 100644
--- a/hw/virtio/vhost-vdpa.c
+++ b/hw/virtio/vhost-vdpa.c
@@ -649,16 +649,27 @@ static int vhost_vdpa_set_vring_kick(struct vhost_dev *dev,
     return vhost_vdpa_call(dev, VHOST_SET_VRING_KICK, file);
 }
 
+static int vhost_vdpa_set_vring_dev_call(struct vhost_dev *dev,
+                                         struct vhost_vring_file *file)
+{
+    trace_vhost_vdpa_set_vring_call(dev, file->index, file->fd);
+    return vhost_vdpa_call(dev, VHOST_SET_VRING_CALL, file);
+}
+
 static int vhost_vdpa_set_vring_call(struct vhost_dev *dev,
                                        struct vhost_vring_file *file)
 {
     struct vhost_vdpa *v = dev->opaque;
     int vdpa_idx = vhost_vdpa_get_vq_index(dev, file->index);
 
-    trace_vhost_vdpa_set_vring_call(dev, file->index, file->fd);
-
     v->call_fd[vdpa_idx] = file->fd;
-    return vhost_vdpa_call(dev, VHOST_SET_VRING_CALL, file);
+    if (v->shadow_vqs_enabled) {
+        VhostShadowVirtqueue *svq = g_ptr_array_index(v->shadow_vqs, vdpa_idx);
+        vhost_svq_set_guest_call_notifier(svq, file->fd);
+        return 0;
+    } else {
+        return vhost_vdpa_set_vring_dev_call(dev, file);
+    }
 }
 
 static int vhost_vdpa_get_features(struct vhost_dev *dev,
-- 
2.27.0
^ permalink raw reply related	[flat|nested] 90+ messages in thread
* [RFC PATCH v4 11/20] vhost: Route host->guest notification through shadow virtqueue
  2021-10-01  7:05 [RFC PATCH v4 00/20] vDPA shadow virtqueue Eugenio Pérez
                   ` (9 preceding siblings ...)
  2021-10-01  7:05 ` [RFC PATCH v4 10/20] vhost-vdpa: Take into account SVQ in vhost_vdpa_set_vring_call Eugenio Pérez
@ 2021-10-01  7:05 ` Eugenio Pérez
  2021-10-13  3:47   ` Jason Wang
  2021-10-13  3:49   ` Jason Wang
  2021-10-01  7:05 ` [RFC PATCH v4 12/20] virtio: Add vhost_shadow_vq_get_vring_addr Eugenio Pérez
                   ` (9 subsequent siblings)
  20 siblings, 2 replies; 90+ messages in thread
From: Eugenio Pérez @ 2021-10-01  7:05 UTC (permalink / raw)
  To: qemu-devel
  Cc: Parav Pandit, Juan Quintela, Jason Wang, Michael S. Tsirkin,
	Markus Armbruster, virtualization, Harpreet Singh Anand,
	Xiao W Wang, Stefan Hajnoczi, Eli Cohen, Eric Blake,
	Michael Lilja, Stefano Garzarella
This will make qemu aware of the device used buffers, allowing it to
write the guest memory with its contents if needed.
Since the use of vhost_virtqueue_start can unmasks and discard call
events, vhost_virtqueue_start should be modified in one of these ways:
* Split in two: One of them uses all logic to start a queue with no
  side effects for the guest, and another one tha actually assumes that
  the guest has just started the device. Vdpa should use just the
  former.
* Actually store and check if the guest notifier is masked, and do it
  conditionally.
* Left as it is, and duplicate all the logic in vhost-vdpa.
Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 hw/virtio/vhost-shadow-virtqueue.c | 19 +++++++++++++++
 hw/virtio/vhost-vdpa.c             | 38 +++++++++++++++++++++++++++++-
 2 files changed, 56 insertions(+), 1 deletion(-)
diff --git a/hw/virtio/vhost-shadow-virtqueue.c b/hw/virtio/vhost-shadow-virtqueue.c
index 21dc99ab5d..3fe129cf63 100644
--- a/hw/virtio/vhost-shadow-virtqueue.c
+++ b/hw/virtio/vhost-shadow-virtqueue.c
@@ -53,6 +53,22 @@ static void vhost_handle_guest_kick(EventNotifier *n)
     event_notifier_set(&svq->kick_notifier);
 }
 
+/* Forward vhost notifications */
+static void vhost_svq_handle_call_no_test(EventNotifier *n)
+{
+    VhostShadowVirtqueue *svq = container_of(n, VhostShadowVirtqueue,
+                                             call_notifier);
+
+    event_notifier_set(&svq->guest_call_notifier);
+}
+
+static void vhost_svq_handle_call(EventNotifier *n)
+{
+    if (likely(event_notifier_test_and_clear(n))) {
+        vhost_svq_handle_call_no_test(n);
+    }
+}
+
 /*
  * Obtain the SVQ call notifier, where vhost device notifies SVQ that there
  * exists pending used buffers.
@@ -180,6 +196,8 @@ VhostShadowVirtqueue *vhost_svq_new(struct vhost_dev *dev, int idx)
     }
 
     svq->vq = virtio_get_queue(dev->vdev, vq_idx);
+    event_notifier_set_handler(&svq->call_notifier,
+                               vhost_svq_handle_call);
     return g_steal_pointer(&svq);
 
 err_init_call_notifier:
@@ -195,6 +213,7 @@ err_init_kick_notifier:
 void vhost_svq_free(VhostShadowVirtqueue *vq)
 {
     event_notifier_cleanup(&vq->kick_notifier);
+    event_notifier_set_handler(&vq->call_notifier, NULL);
     event_notifier_cleanup(&vq->call_notifier);
     g_free(vq);
 }
diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
index bc34de2439..6c5f4c98b8 100644
--- a/hw/virtio/vhost-vdpa.c
+++ b/hw/virtio/vhost-vdpa.c
@@ -712,13 +712,40 @@ static bool vhost_vdpa_svq_start_vq(struct vhost_dev *dev, unsigned idx)
 {
     struct vhost_vdpa *v = dev->opaque;
     VhostShadowVirtqueue *svq = g_ptr_array_index(v->shadow_vqs, idx);
-    return vhost_svq_start(dev, idx, svq);
+    EventNotifier *vhost_call_notifier = vhost_svq_get_svq_call_notifier(svq);
+    struct vhost_vring_file vhost_call_file = {
+        .index = idx + dev->vq_index,
+        .fd = event_notifier_get_fd(vhost_call_notifier),
+    };
+    int r;
+    bool b;
+
+    /* Set shadow vq -> guest notifier */
+    assert(v->call_fd[idx]);
+    vhost_svq_set_guest_call_notifier(svq, v->call_fd[idx]);
+
+    b = vhost_svq_start(dev, idx, svq);
+    if (unlikely(!b)) {
+        return false;
+    }
+
+    /* Set device -> SVQ notifier */
+    r = vhost_vdpa_set_vring_dev_call(dev, &vhost_call_file);
+    if (unlikely(r)) {
+        error_report("vhost_vdpa_set_vring_call for shadow vq failed");
+        return false;
+    }
+
+    /* Check for pending calls */
+    event_notifier_set(vhost_call_notifier);
+    return true;
 }
 
 static unsigned vhost_vdpa_enable_svq(struct vhost_vdpa *v, bool enable)
 {
     struct vhost_dev *hdev = v->dev;
     unsigned n;
+    int r;
 
     if (enable == v->shadow_vqs_enabled) {
         return hdev->nvqs;
@@ -752,9 +779,18 @@ static unsigned vhost_vdpa_enable_svq(struct vhost_vdpa *v, bool enable)
     if (!enable) {
         /* Disable all queues or clean up failed start */
         for (n = 0; n < v->shadow_vqs->len; ++n) {
+            struct vhost_vring_file file = {
+                .index = vhost_vdpa_get_vq_index(hdev, n),
+                .fd = v->call_fd[n],
+            };
+
+            r = vhost_vdpa_set_vring_call(hdev, &file);
+            assert(r == 0);
+
             unsigned vq_idx = vhost_vdpa_get_vq_index(hdev, n);
             VhostShadowVirtqueue *svq = g_ptr_array_index(v->shadow_vqs, n);
             vhost_svq_stop(hdev, n, svq);
+            /* TODO: This can unmask or override call fd! */
             vhost_virtqueue_start(hdev, hdev->vdev, &hdev->vqs[n], vq_idx);
         }
 
-- 
2.27.0
^ permalink raw reply related	[flat|nested] 90+ messages in thread
* [RFC PATCH v4 12/20] virtio: Add vhost_shadow_vq_get_vring_addr
  2021-10-01  7:05 [RFC PATCH v4 00/20] vDPA shadow virtqueue Eugenio Pérez
                   ` (10 preceding siblings ...)
  2021-10-01  7:05 ` [RFC PATCH v4 11/20] vhost: Route host->guest notification through shadow virtqueue Eugenio Pérez
@ 2021-10-01  7:05 ` Eugenio Pérez
  2021-10-13  3:54   ` Jason Wang
  2021-10-01  7:05 ` [RFC PATCH v4 13/20] vdpa: Save host and guest features Eugenio Pérez
                   ` (8 subsequent siblings)
  20 siblings, 1 reply; 90+ messages in thread
From: Eugenio Pérez @ 2021-10-01  7:05 UTC (permalink / raw)
  To: qemu-devel
  Cc: Parav Pandit, Juan Quintela, Jason Wang, Michael S. Tsirkin,
	Markus Armbruster, virtualization, Harpreet Singh Anand,
	Xiao W Wang, Stefan Hajnoczi, Eli Cohen, Eric Blake,
	Michael Lilja, Stefano Garzarella
It reports the shadow virtqueue address from qemu virtual address space
Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 hw/virtio/vhost-shadow-virtqueue.h |  4 +++
 hw/virtio/vhost-shadow-virtqueue.c | 50 ++++++++++++++++++++++++++++++
 2 files changed, 54 insertions(+)
diff --git a/hw/virtio/vhost-shadow-virtqueue.h b/hw/virtio/vhost-shadow-virtqueue.h
index 237cfceb9c..2df3d117f5 100644
--- a/hw/virtio/vhost-shadow-virtqueue.h
+++ b/hw/virtio/vhost-shadow-virtqueue.h
@@ -16,6 +16,10 @@ typedef struct VhostShadowVirtqueue VhostShadowVirtqueue;
 
 EventNotifier *vhost_svq_get_svq_call_notifier(VhostShadowVirtqueue *svq);
 void vhost_svq_set_guest_call_notifier(VhostShadowVirtqueue *svq, int call_fd);
+void vhost_svq_get_vring_addr(const VhostShadowVirtqueue *svq,
+                              struct vhost_vring_addr *addr);
+size_t vhost_svq_driver_area_size(const VhostShadowVirtqueue *svq);
+size_t vhost_svq_device_area_size(const VhostShadowVirtqueue *svq);
 
 bool vhost_svq_start(struct vhost_dev *dev, unsigned idx,
                      VhostShadowVirtqueue *svq);
diff --git a/hw/virtio/vhost-shadow-virtqueue.c b/hw/virtio/vhost-shadow-virtqueue.c
index 3fe129cf63..5c1899f6af 100644
--- a/hw/virtio/vhost-shadow-virtqueue.c
+++ b/hw/virtio/vhost-shadow-virtqueue.c
@@ -18,6 +18,9 @@
 
 /* Shadow virtqueue to relay notifications */
 typedef struct VhostShadowVirtqueue {
+    /* Shadow vring */
+    struct vring vring;
+
     /* Shadow kick notifier, sent to vhost */
     EventNotifier kick_notifier;
     /* Shadow call notifier, sent to vhost */
@@ -38,6 +41,9 @@ typedef struct VhostShadowVirtqueue {
 
     /* Virtio queue shadowing */
     VirtQueue *vq;
+
+    /* Virtio device */
+    VirtIODevice *vdev;
 } VhostShadowVirtqueue;
 
 /* Forward guest notifications */
@@ -93,6 +99,35 @@ void vhost_svq_set_guest_call_notifier(VhostShadowVirtqueue *svq, int call_fd)
     event_notifier_init_fd(&svq->guest_call_notifier, call_fd);
 }
 
+/*
+ * Get the shadow vq vring address.
+ * @svq Shadow virtqueue
+ * @addr Destination to store address
+ */
+void vhost_svq_get_vring_addr(const VhostShadowVirtqueue *svq,
+                              struct vhost_vring_addr *addr)
+{
+    addr->desc_user_addr = (uint64_t)svq->vring.desc;
+    addr->avail_user_addr = (uint64_t)svq->vring.avail;
+    addr->used_user_addr = (uint64_t)svq->vring.used;
+}
+
+size_t vhost_svq_driver_area_size(const VhostShadowVirtqueue *svq)
+{
+    uint16_t vq_idx = virtio_get_queue_index(svq->vq);
+    size_t desc_size = virtio_queue_get_desc_size(svq->vdev, vq_idx);
+    size_t avail_size = virtio_queue_get_avail_size(svq->vdev, vq_idx);
+
+    return ROUND_UP(desc_size + avail_size, qemu_real_host_page_size);
+}
+
+size_t vhost_svq_device_area_size(const VhostShadowVirtqueue *svq)
+{
+    uint16_t vq_idx = virtio_get_queue_index(svq->vq);
+    size_t used_size = virtio_queue_get_used_size(svq->vdev, vq_idx);
+    return ROUND_UP(used_size, qemu_real_host_page_size);
+}
+
 /*
  * Restore the vhost guest to host notifier, i.e., disables svq effect.
  */
@@ -178,6 +213,10 @@ void vhost_svq_stop(struct vhost_dev *dev, unsigned idx,
 VhostShadowVirtqueue *vhost_svq_new(struct vhost_dev *dev, int idx)
 {
     int vq_idx = dev->vq_index + idx;
+    unsigned num = virtio_queue_get_num(dev->vdev, vq_idx);
+    size_t desc_size = virtio_queue_get_desc_size(dev->vdev, vq_idx);
+    size_t driver_size;
+    size_t device_size;
     g_autofree VhostShadowVirtqueue *svq = g_new0(VhostShadowVirtqueue, 1);
     int r;
 
@@ -196,6 +235,15 @@ VhostShadowVirtqueue *vhost_svq_new(struct vhost_dev *dev, int idx)
     }
 
     svq->vq = virtio_get_queue(dev->vdev, vq_idx);
+    svq->vdev = dev->vdev;
+    driver_size = vhost_svq_driver_area_size(svq);
+    device_size = vhost_svq_device_area_size(svq);
+    svq->vring.num = num;
+    svq->vring.desc = qemu_memalign(qemu_real_host_page_size, driver_size);
+    svq->vring.avail = (void *)((char *)svq->vring.desc + desc_size);
+    memset(svq->vring.desc, 0, driver_size);
+    svq->vring.used = qemu_memalign(qemu_real_host_page_size, device_size);
+    memset(svq->vring.used, 0, device_size);
     event_notifier_set_handler(&svq->call_notifier,
                                vhost_svq_handle_call);
     return g_steal_pointer(&svq);
@@ -215,5 +263,7 @@ void vhost_svq_free(VhostShadowVirtqueue *vq)
     event_notifier_cleanup(&vq->kick_notifier);
     event_notifier_set_handler(&vq->call_notifier, NULL);
     event_notifier_cleanup(&vq->call_notifier);
+    qemu_vfree(vq->vring.desc);
+    qemu_vfree(vq->vring.used);
     g_free(vq);
 }
-- 
2.27.0
^ permalink raw reply related	[flat|nested] 90+ messages in thread
* [RFC PATCH v4 13/20] vdpa: Save host and guest features
  2021-10-01  7:05 [RFC PATCH v4 00/20] vDPA shadow virtqueue Eugenio Pérez
                   ` (11 preceding siblings ...)
  2021-10-01  7:05 ` [RFC PATCH v4 12/20] virtio: Add vhost_shadow_vq_get_vring_addr Eugenio Pérez
@ 2021-10-01  7:05 ` Eugenio Pérez
  2021-10-13  3:56   ` Jason Wang
  2021-10-01  7:05 ` [RFC PATCH v4 14/20] vhost: Add vhost_svq_valid_device_features to shadow vq Eugenio Pérez
                   ` (7 subsequent siblings)
  20 siblings, 1 reply; 90+ messages in thread
From: Eugenio Pérez @ 2021-10-01  7:05 UTC (permalink / raw)
  To: qemu-devel
  Cc: Parav Pandit, Juan Quintela, Jason Wang, Michael S. Tsirkin,
	Markus Armbruster, virtualization, Harpreet Singh Anand,
	Xiao W Wang, Stefan Hajnoczi, Eli Cohen, Eric Blake,
	Michael Lilja, Stefano Garzarella
Those are needed for SVQ: Host ones are needed to check if SVQ knows
how to talk with the device and for feature negotiation, and guest ones
to know if SVQ can talk with it.
Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 include/hw/virtio/vhost-vdpa.h |  2 ++
 hw/virtio/vhost-vdpa.c         | 31 ++++++++++++++++++++++++++++---
 2 files changed, 30 insertions(+), 3 deletions(-)
diff --git a/include/hw/virtio/vhost-vdpa.h b/include/hw/virtio/vhost-vdpa.h
index fddac248b3..9044ae694b 100644
--- a/include/hw/virtio/vhost-vdpa.h
+++ b/include/hw/virtio/vhost-vdpa.h
@@ -26,6 +26,8 @@ typedef struct vhost_vdpa {
     int device_fd;
     uint32_t msg_type;
     MemoryListener listener;
+    uint64_t host_features;
+    uint64_t guest_features;
     bool shadow_vqs_enabled;
     GPtrArray *shadow_vqs;
     struct vhost_dev *dev;
diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
index 6c5f4c98b8..a057e8277d 100644
--- a/hw/virtio/vhost-vdpa.c
+++ b/hw/virtio/vhost-vdpa.c
@@ -439,10 +439,19 @@ static int vhost_vdpa_set_mem_table(struct vhost_dev *dev,
     return 0;
 }
 
-static int vhost_vdpa_set_features(struct vhost_dev *dev,
-                                   uint64_t features)
+/**
+ * Internal set_features() that follows vhost/VirtIO protocol for that
+ */
+static int vhost_vdpa_backend_set_features(struct vhost_dev *dev,
+                                           uint64_t features)
 {
+    struct vhost_vdpa *v = dev->opaque;
+
     int ret;
+    if (v->host_features & BIT_ULL(VIRTIO_F_QUEUE_STATE)) {
+        features |= BIT_ULL(VIRTIO_F_QUEUE_STATE);
+    }
+
     trace_vhost_vdpa_set_features(dev, features);
     ret = vhost_vdpa_call(dev, VHOST_SET_FEATURES, &features);
     uint8_t status = 0;
@@ -455,6 +464,17 @@ static int vhost_vdpa_set_features(struct vhost_dev *dev,
     return !(status & VIRTIO_CONFIG_S_FEATURES_OK);
 }
 
+/**
+ * Exposed vhost set features
+ */
+static int vhost_vdpa_set_features(struct vhost_dev *dev,
+                                   uint64_t features)
+{
+    struct vhost_vdpa *v = dev->opaque;
+    v->guest_features = features;
+    return vhost_vdpa_backend_set_features(dev, features);
+}
+
 static int vhost_vdpa_set_backend_cap(struct vhost_dev *dev)
 {
     uint64_t features;
@@ -673,12 +693,17 @@ static int vhost_vdpa_set_vring_call(struct vhost_dev *dev,
 }
 
 static int vhost_vdpa_get_features(struct vhost_dev *dev,
-                                     uint64_t *features)
+                                   uint64_t *features)
 {
     int ret;
 
     ret = vhost_vdpa_call(dev, VHOST_GET_FEATURES, features);
     trace_vhost_vdpa_get_features(dev, *features);
+
+    if (ret == 0) {
+        struct vhost_vdpa *v = dev->opaque;
+        v->host_features = *features;
+    }
     return ret;
 }
 
-- 
2.27.0
^ permalink raw reply related	[flat|nested] 90+ messages in thread
* [RFC PATCH v4 14/20] vhost: Add vhost_svq_valid_device_features to shadow vq
  2021-10-01  7:05 [RFC PATCH v4 00/20] vDPA shadow virtqueue Eugenio Pérez
                   ` (12 preceding siblings ...)
  2021-10-01  7:05 ` [RFC PATCH v4 13/20] vdpa: Save host and guest features Eugenio Pérez
@ 2021-10-01  7:05 ` Eugenio Pérez
  2021-10-01  7:05 ` [RFC PATCH v4 15/20] vhost: Shadow virtqueue buffers forwarding Eugenio Pérez
                   ` (6 subsequent siblings)
  20 siblings, 0 replies; 90+ messages in thread
From: Eugenio Pérez @ 2021-10-01  7:05 UTC (permalink / raw)
  To: qemu-devel
  Cc: Parav Pandit, Juan Quintela, Jason Wang, Michael S. Tsirkin,
	Markus Armbruster, virtualization, Harpreet Singh Anand,
	Xiao W Wang, Stefan Hajnoczi, Eli Cohen, Eric Blake,
	Michael Lilja, Stefano Garzarella
This allows it to test if the guest has aknowledge an invalid transport
feature for SVQ. This will include packed vq layout, invalid descriptors
or event idx at the moment we start forwarding buffers.
We don't check for device features here since they will be re-negotiated
again. This allows SVQ to both use more advanced features of the device
when they are available and the guest is not capable of run them, and to
make SVQ compatible with future transport features.
Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 hw/virtio/vhost-shadow-virtqueue.h | 2 ++
 hw/virtio/vhost-shadow-virtqueue.c | 6 ++++++
 2 files changed, 8 insertions(+)
diff --git a/hw/virtio/vhost-shadow-virtqueue.h b/hw/virtio/vhost-shadow-virtqueue.h
index 2df3d117f5..b7baa424a7 100644
--- a/hw/virtio/vhost-shadow-virtqueue.h
+++ b/hw/virtio/vhost-shadow-virtqueue.h
@@ -14,6 +14,8 @@
 
 typedef struct VhostShadowVirtqueue VhostShadowVirtqueue;
 
+bool vhost_svq_valid_device_features(uint64_t *features);
+
 EventNotifier *vhost_svq_get_svq_call_notifier(VhostShadowVirtqueue *svq);
 void vhost_svq_set_guest_call_notifier(VhostShadowVirtqueue *svq, int call_fd);
 void vhost_svq_get_vring_addr(const VhostShadowVirtqueue *svq,
diff --git a/hw/virtio/vhost-shadow-virtqueue.c b/hw/virtio/vhost-shadow-virtqueue.c
index 5c1899f6af..34e159d4fd 100644
--- a/hw/virtio/vhost-shadow-virtqueue.c
+++ b/hw/virtio/vhost-shadow-virtqueue.c
@@ -46,6 +46,12 @@ typedef struct VhostShadowVirtqueue {
     VirtIODevice *vdev;
 } VhostShadowVirtqueue;
 
+/* If the device is using some of these, SVQ cannot communicate */
+bool vhost_svq_valid_device_features(uint64_t *dev_features)
+{
+    return true;
+}
+
 /* Forward guest notifications */
 static void vhost_handle_guest_kick(EventNotifier *n)
 {
-- 
2.27.0
^ permalink raw reply related	[flat|nested] 90+ messages in thread
* [RFC PATCH v4 15/20] vhost: Shadow virtqueue buffers forwarding
  2021-10-01  7:05 [RFC PATCH v4 00/20] vDPA shadow virtqueue Eugenio Pérez
                   ` (13 preceding siblings ...)
  2021-10-01  7:05 ` [RFC PATCH v4 14/20] vhost: Add vhost_svq_valid_device_features to shadow vq Eugenio Pérez
@ 2021-10-01  7:05 ` Eugenio Pérez
  2021-10-12  5:21   ` Markus Armbruster
  2021-10-13  4:31   ` Jason Wang
  2021-10-01  7:05 ` [RFC PATCH v4 16/20] vhost: Check for device VRING_USED_F_NO_NOTIFY at shadow virtqueue kick Eugenio Pérez
                   ` (5 subsequent siblings)
  20 siblings, 2 replies; 90+ messages in thread
From: Eugenio Pérez @ 2021-10-01  7:05 UTC (permalink / raw)
  To: qemu-devel
  Cc: Parav Pandit, Juan Quintela, Jason Wang, Michael S. Tsirkin,
	Markus Armbruster, virtualization, Harpreet Singh Anand,
	Xiao W Wang, Stefan Hajnoczi, Eli Cohen, Eric Blake,
	Michael Lilja, Stefano Garzarella
Initial version of shadow virtqueue that actually forward buffers. There
are no iommu support at the moment, and that will be addressed in future
patches of this series. Since all vhost-vdpa devices uses forced IOMMU,
this means that SVQ is not usable at this point of the series on any
device.
For simplicity it only supports modern devices, that expects vring
in little endian, with split ring and no event idx or indirect
descriptors. Support for them will not be added in this series.
It reuses the VirtQueue code for the device part. The driver part is
based on Linux's virtio_ring driver, but with stripped functionality
and optimizations so it's easier to review. Later commits add simpler
ones.
SVQ uses VIRTIO_CONFIG_S_DEVICE_STOPPED to pause the device and
retrieve its status (next available idx the device was going to
consume) race-free. It can later reset the device to replace vring
addresses etc. When SVQ starts qemu can resume consuming the guest's
driver ring from that state, without notice from the latter.
This status bit VIRTIO_CONFIG_S_DEVICE_STOPPED is currently discussed
in VirtIO, and is implemented in qemu VirtIO-net devices in previous
commits.
Removal of _S_DEVICE_STOPPED bit (in other words, resuming the device)
can be done in the future if an use case arises. At this moment we can
just rely on reseting the full device.
Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 qapi/net.json                      |   2 +-
 hw/virtio/vhost-shadow-virtqueue.c | 237 ++++++++++++++++++++++++++++-
 hw/virtio/vhost-vdpa.c             | 109 ++++++++++++-
 3 files changed, 337 insertions(+), 11 deletions(-)
diff --git a/qapi/net.json b/qapi/net.json
index fe546b0e7c..1f4a55f2c5 100644
--- a/qapi/net.json
+++ b/qapi/net.json
@@ -86,7 +86,7 @@
 #
 # @name: the device name of the VirtIO device
 #
-# @enable: true to use the alternate shadow VQ notifications
+# @enable: true to use the alternate shadow VQ buffers fowarding path
 #
 # Returns: Error if failure, or 'no error' for success.
 #
diff --git a/hw/virtio/vhost-shadow-virtqueue.c b/hw/virtio/vhost-shadow-virtqueue.c
index 34e159d4fd..df7e6fa3ec 100644
--- a/hw/virtio/vhost-shadow-virtqueue.c
+++ b/hw/virtio/vhost-shadow-virtqueue.c
@@ -10,6 +10,7 @@
 #include "qemu/osdep.h"
 #include "hw/virtio/vhost-shadow-virtqueue.h"
 #include "hw/virtio/vhost.h"
+#include "hw/virtio/virtio-access.h"
 
 #include "standard-headers/linux/vhost_types.h"
 
@@ -44,15 +45,135 @@ typedef struct VhostShadowVirtqueue {
 
     /* Virtio device */
     VirtIODevice *vdev;
+
+    /* Map for returning guest's descriptors */
+    VirtQueueElement **ring_id_maps;
+
+    /* Next head to expose to device */
+    uint16_t avail_idx_shadow;
+
+    /* Next free descriptor */
+    uint16_t free_head;
+
+    /* Last seen used idx */
+    uint16_t shadow_used_idx;
+
+    /* Next head to consume from device */
+    uint16_t used_idx;
 } VhostShadowVirtqueue;
 
 /* If the device is using some of these, SVQ cannot communicate */
 bool vhost_svq_valid_device_features(uint64_t *dev_features)
 {
-    return true;
+    uint64_t b;
+    bool r = true;
+
+    for (b = VIRTIO_TRANSPORT_F_START; b <= VIRTIO_TRANSPORT_F_END; ++b) {
+        switch (b) {
+        case VIRTIO_F_NOTIFY_ON_EMPTY:
+        case VIRTIO_F_ANY_LAYOUT:
+            /* SVQ is fine with this feature */
+            continue;
+
+        case VIRTIO_F_ACCESS_PLATFORM:
+            /* SVQ needs this feature disabled. Can't continue */
+            if (*dev_features & BIT_ULL(b)) {
+                clear_bit(b, dev_features);
+                r = false;
+            }
+            break;
+
+        case VIRTIO_F_VERSION_1:
+            /* SVQ needs this feature, so can't continue */
+            if (!(*dev_features & BIT_ULL(b))) {
+                set_bit(b, dev_features);
+                r = false;
+            }
+            continue;
+
+        default:
+            /*
+             * SVQ must disable this feature, let's hope the device is fine
+             * without it.
+             */
+            if (*dev_features & BIT_ULL(b)) {
+                clear_bit(b, dev_features);
+            }
+        }
+    }
+
+    return r;
+}
+
+static void vhost_vring_write_descs(VhostShadowVirtqueue *svq,
+                                    const struct iovec *iovec,
+                                    size_t num, bool more_descs, bool write)
+{
+    uint16_t i = svq->free_head, last = svq->free_head;
+    unsigned n;
+    uint16_t flags = write ? cpu_to_le16(VRING_DESC_F_WRITE) : 0;
+    vring_desc_t *descs = svq->vring.desc;
+
+    if (num == 0) {
+        return;
+    }
+
+    for (n = 0; n < num; n++) {
+        if (more_descs || (n + 1 < num)) {
+            descs[i].flags = flags | cpu_to_le16(VRING_DESC_F_NEXT);
+        } else {
+            descs[i].flags = flags;
+        }
+        descs[i].addr = cpu_to_le64((hwaddr)iovec[n].iov_base);
+        descs[i].len = cpu_to_le32(iovec[n].iov_len);
+
+        last = i;
+        i = cpu_to_le16(descs[i].next);
+    }
+
+    svq->free_head = le16_to_cpu(descs[last].next);
+}
+
+static unsigned vhost_svq_add_split(VhostShadowVirtqueue *svq,
+                                    VirtQueueElement *elem)
+{
+    int head;
+    unsigned avail_idx;
+    vring_avail_t *avail = svq->vring.avail;
+
+    head = svq->free_head;
+
+    /* We need some descriptors here */
+    assert(elem->out_num || elem->in_num);
+
+    vhost_vring_write_descs(svq, elem->out_sg, elem->out_num,
+                            elem->in_num > 0, false);
+    vhost_vring_write_descs(svq, elem->in_sg, elem->in_num, false, true);
+
+    /*
+     * Put entry in available array (but don't update avail->idx until they
+     * do sync).
+     */
+    avail_idx = svq->avail_idx_shadow & (svq->vring.num - 1);
+    avail->ring[avail_idx] = cpu_to_le16(head);
+    svq->avail_idx_shadow++;
+
+    /* Update avail index after the descriptor is wrote */
+    smp_wmb();
+    avail->idx = cpu_to_le16(svq->avail_idx_shadow);
+
+    return head;
+
 }
 
-/* Forward guest notifications */
+static void vhost_svq_add(VhostShadowVirtqueue *svq, VirtQueueElement *elem)
+{
+    unsigned qemu_head = vhost_svq_add_split(svq, elem);
+
+    svq->ring_id_maps[qemu_head] = elem;
+}
+
+/* Handle guest->device notifications */
 static void vhost_handle_guest_kick(EventNotifier *n)
 {
     VhostShadowVirtqueue *svq = container_of(n, VhostShadowVirtqueue,
@@ -62,7 +183,74 @@ static void vhost_handle_guest_kick(EventNotifier *n)
         return;
     }
 
-    event_notifier_set(&svq->kick_notifier);
+    /* Make available as many buffers as possible */
+    do {
+        if (virtio_queue_get_notification(svq->vq)) {
+            /* No more notifications until process all available */
+            virtio_queue_set_notification(svq->vq, false);
+        }
+
+        while (true) {
+            VirtQueueElement *elem = virtqueue_pop(svq->vq, sizeof(*elem));
+            if (!elem) {
+                break;
+            }
+
+            vhost_svq_add(svq, elem);
+            event_notifier_set(&svq->kick_notifier);
+        }
+
+        virtio_queue_set_notification(svq->vq, true);
+    } while (!virtio_queue_empty(svq->vq));
+}
+
+static bool vhost_svq_more_used(VhostShadowVirtqueue *svq)
+{
+    if (svq->used_idx != svq->shadow_used_idx) {
+        return true;
+    }
+
+    /* Get used idx must not be reordered */
+    smp_rmb();
+    svq->shadow_used_idx = cpu_to_le16(svq->vring.used->idx);
+
+    return svq->used_idx != svq->shadow_used_idx;
+}
+
+static VirtQueueElement *vhost_svq_get_buf(VhostShadowVirtqueue *svq)
+{
+    vring_desc_t *descs = svq->vring.desc;
+    const vring_used_t *used = svq->vring.used;
+    vring_used_elem_t used_elem;
+    uint16_t last_used;
+
+    if (!vhost_svq_more_used(svq)) {
+        return NULL;
+    }
+
+    last_used = svq->used_idx & (svq->vring.num - 1);
+    used_elem.id = le32_to_cpu(used->ring[last_used].id);
+    used_elem.len = le32_to_cpu(used->ring[last_used].len);
+
+    svq->used_idx++;
+    if (unlikely(used_elem.id >= svq->vring.num)) {
+        error_report("Device %s says index %u is used", svq->vdev->name,
+                     used_elem.id);
+        return NULL;
+    }
+
+    if (unlikely(!svq->ring_id_maps[used_elem.id])) {
+        error_report(
+            "Device %s says index %u is used, but it was not available",
+            svq->vdev->name, used_elem.id);
+        return NULL;
+    }
+
+    descs[used_elem.id].next = svq->free_head;
+    svq->free_head = used_elem.id;
+
+    svq->ring_id_maps[used_elem.id]->len = used_elem.len;
+    return g_steal_pointer(&svq->ring_id_maps[used_elem.id]);
 }
 
 /* Forward vhost notifications */
@@ -70,8 +258,26 @@ static void vhost_svq_handle_call_no_test(EventNotifier *n)
 {
     VhostShadowVirtqueue *svq = container_of(n, VhostShadowVirtqueue,
                                              call_notifier);
-
-    event_notifier_set(&svq->guest_call_notifier);
+    VirtQueue *vq = svq->vq;
+
+    /* Make as many buffers as possible used. */
+    do {
+        unsigned i = 0;
+
+        /* TODO: Use VRING_AVAIL_F_NO_INTERRUPT */
+        while (true) {
+            g_autofree VirtQueueElement *elem = vhost_svq_get_buf(svq);
+            if (!elem) {
+                break;
+            }
+
+            assert(i < svq->vring.num);
+            virtqueue_fill(vq, elem, elem->len, i++);
+        }
+
+        virtqueue_flush(vq, i);
+        event_notifier_set(&svq->guest_call_notifier);
+    } while (vhost_svq_more_used(svq));
 }
 
 static void vhost_svq_handle_call(EventNotifier *n)
@@ -204,12 +410,25 @@ err_set_vring_kick:
 void vhost_svq_stop(struct vhost_dev *dev, unsigned idx,
                     VhostShadowVirtqueue *svq)
 {
+    int i;
     int r = vhost_svq_restore_vdev_host_notifier(dev, idx, svq);
+
     if (unlikely(r < 0)) {
         error_report("Couldn't restore vq kick fd: %s", strerror(-r));
     }
 
     event_notifier_set_handler(&svq->host_notifier, NULL);
+
+    for (i = 0; i < svq->vring.num; ++i) {
+        g_autofree VirtQueueElement *elem = svq->ring_id_maps[i];
+        /*
+         * Although the doc says we must unpop in order, it's ok to unpop
+         * everything.
+         */
+        if (elem) {
+            virtqueue_unpop(svq->vq, elem, elem->len);
+        }
+    }
 }
 
 /*
@@ -224,7 +443,7 @@ VhostShadowVirtqueue *vhost_svq_new(struct vhost_dev *dev, int idx)
     size_t driver_size;
     size_t device_size;
     g_autofree VhostShadowVirtqueue *svq = g_new0(VhostShadowVirtqueue, 1);
-    int r;
+    int r, i;
 
     r = event_notifier_init(&svq->kick_notifier, 0);
     if (r != 0) {
@@ -250,6 +469,11 @@ VhostShadowVirtqueue *vhost_svq_new(struct vhost_dev *dev, int idx)
     memset(svq->vring.desc, 0, driver_size);
     svq->vring.used = qemu_memalign(qemu_real_host_page_size, device_size);
     memset(svq->vring.used, 0, device_size);
+    for (i = 0; i < num - 1; i++) {
+        svq->vring.desc[i].next = cpu_to_le16(i + 1);
+    }
+
+    svq->ring_id_maps = g_new0(VirtQueueElement *, num);
     event_notifier_set_handler(&svq->call_notifier,
                                vhost_svq_handle_call);
     return g_steal_pointer(&svq);
@@ -269,6 +493,7 @@ void vhost_svq_free(VhostShadowVirtqueue *vq)
     event_notifier_cleanup(&vq->kick_notifier);
     event_notifier_set_handler(&vq->call_notifier, NULL);
     event_notifier_cleanup(&vq->call_notifier);
+    g_free(vq->ring_id_maps);
     qemu_vfree(vq->vring.desc);
     qemu_vfree(vq->vring.used);
     g_free(vq);
diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
index a057e8277d..bb7010ddb5 100644
--- a/hw/virtio/vhost-vdpa.c
+++ b/hw/virtio/vhost-vdpa.c
@@ -19,6 +19,7 @@
 #include "hw/virtio/virtio-net.h"
 #include "hw/virtio/vhost-shadow-virtqueue.h"
 #include "hw/virtio/vhost-vdpa.h"
+#include "hw/virtio/vhost-shadow-virtqueue.h"
 #include "exec/address-spaces.h"
 #include "qemu/main-loop.h"
 #include "cpu.h"
@@ -475,6 +476,28 @@ static int vhost_vdpa_set_features(struct vhost_dev *dev,
     return vhost_vdpa_backend_set_features(dev, features);
 }
 
+/**
+ * Restore guest features to vdpa device
+ */
+static int vhost_vdpa_set_guest_features(struct vhost_dev *dev)
+{
+    struct vhost_vdpa *v = dev->opaque;
+    return vhost_vdpa_backend_set_features(dev, v->guest_features);
+}
+
+/**
+ * Set shadow virtqueue supported features
+ */
+static int vhost_vdpa_set_svq_features(struct vhost_dev *dev)
+{
+    struct vhost_vdpa *v = dev->opaque;
+    uint64_t features = v->host_features;
+    bool b = vhost_svq_valid_device_features(&features);
+    assert(b);
+
+    return vhost_vdpa_backend_set_features(dev, features);
+}
+
 static int vhost_vdpa_set_backend_cap(struct vhost_dev *dev)
 {
     uint64_t features;
@@ -730,6 +753,19 @@ static bool  vhost_vdpa_force_iommu(struct vhost_dev *dev)
     return true;
 }
 
+static int vhost_vdpa_vring_pause(struct vhost_dev *dev)
+{
+    int r;
+    uint8_t status;
+
+    vhost_vdpa_add_status(dev, VIRTIO_CONFIG_S_DEVICE_STOPPED);
+    do {
+        r = vhost_vdpa_call(dev, VHOST_VDPA_GET_STATUS, &status);
+    } while (r == 0 && !(status & VIRTIO_CONFIG_S_DEVICE_STOPPED));
+
+    return 0;
+}
+
 /*
  * Start shadow virtqueue.
  */
@@ -742,9 +778,29 @@ static bool vhost_vdpa_svq_start_vq(struct vhost_dev *dev, unsigned idx)
         .index = idx + dev->vq_index,
         .fd = event_notifier_get_fd(vhost_call_notifier),
     };
+    struct vhost_vring_addr addr = {
+        .index = idx + dev->vq_index,
+    };
+    struct vhost_vring_state num = {
+        .index = idx + dev->vq_index,
+        .num = virtio_queue_get_num(dev->vdev, idx),
+    };
     int r;
     bool b;
 
+    vhost_svq_get_vring_addr(svq, &addr);
+    r = vhost_vdpa_set_vring_addr(dev, &addr);
+    if (unlikely(r)) {
+        error_report("vhost_set_vring_addr for shadow vq failed");
+        return false;
+    }
+
+    r = vhost_vdpa_set_vring_num(dev, &num);
+    if (unlikely(r)) {
+        error_report("vhost_vdpa_set_vring_num for shadow vq failed");
+        return false;
+    }
+
     /* Set shadow vq -> guest notifier */
     assert(v->call_fd[idx]);
     vhost_svq_set_guest_call_notifier(svq, v->call_fd[idx]);
@@ -781,15 +837,32 @@ static unsigned vhost_vdpa_enable_svq(struct vhost_vdpa *v, bool enable)
         assert(v->shadow_vqs->len == 0);
         for (n = 0; n < hdev->nvqs; ++n) {
             VhostShadowVirtqueue *svq = vhost_svq_new(hdev, n);
-            bool ok;
-
             if (unlikely(!svq)) {
                 g_ptr_array_set_size(v->shadow_vqs, 0);
                 return 0;
             }
             g_ptr_array_add(v->shadow_vqs, svq);
+        }
+    }
 
-            ok = vhost_vdpa_svq_start_vq(hdev, n);
+    r = vhost_vdpa_vring_pause(hdev);
+    assert(r == 0);
+
+    if (enable) {
+        for (n = 0; n < v->shadow_vqs->len; ++n) {
+            /* Obtain Virtqueue state */
+            vhost_virtqueue_stop(hdev, hdev->vdev, &hdev->vqs[n], n);
+        }
+    }
+
+    /* Reset device so it can be configured */
+    r = vhost_vdpa_dev_start(hdev, false);
+    assert(r == 0);
+
+    if (enable) {
+        int r;
+        for (n = 0; n < v->shadow_vqs->len; ++n) {
+            bool ok = vhost_vdpa_svq_start_vq(hdev, n);
             if (unlikely(!ok)) {
                 /* Free still not started svqs */
                 g_ptr_array_set_size(v->shadow_vqs, n);
@@ -797,11 +870,19 @@ static unsigned vhost_vdpa_enable_svq(struct vhost_vdpa *v, bool enable)
                 break;
             }
         }
+
+        /* Need to ack features to set state in vp_vdpa devices */
+        r = vhost_vdpa_set_svq_features(hdev);
+        if (unlikely(r)) {
+            enable = false;
+        }
     }
 
     v->shadow_vqs_enabled = enable;
 
     if (!enable) {
+        vhost_vdpa_set_guest_features(hdev);
+
         /* Disable all queues or clean up failed start */
         for (n = 0; n < v->shadow_vqs->len; ++n) {
             struct vhost_vring_file file = {
@@ -818,7 +899,12 @@ static unsigned vhost_vdpa_enable_svq(struct vhost_vdpa *v, bool enable)
             /* TODO: This can unmask or override call fd! */
             vhost_virtqueue_start(hdev, hdev->vdev, &hdev->vqs[n], vq_idx);
         }
+    }
 
+    r = vhost_vdpa_dev_start(hdev, true);
+    assert(r == 0);
+
+    if (!enable) {
         /* Resources cleanup */
         g_ptr_array_set_size(v->shadow_vqs, 0);
     }
@@ -831,6 +917,7 @@ void qmp_x_vhost_enable_shadow_vq(const char *name, bool enable, Error **errp)
     struct vhost_vdpa *v;
     const char *err_cause = NULL;
     bool r;
+    uint64_t svq_features;
 
     QLIST_FOREACH(v, &vhost_vdpa_devices, entry) {
         if (v->dev->vdev && 0 == strcmp(v->dev->vdev->name, name)) {
@@ -846,6 +933,20 @@ void qmp_x_vhost_enable_shadow_vq(const char *name, bool enable, Error **errp)
         goto err;
     }
 
+    svq_features = v->host_features;
+    if (!vhost_svq_valid_device_features(&svq_features)) {
+        error_setg(errp,
+            "Can't enable shadow vq on %s: Unexpected feature flags (%lx-%lx)",
+            name, v->host_features, svq_features);
+        return;
+    } else {
+        /* TODO: Check for virtio_vdpa + IOMMU & modern device */
+    }
+
+    if (err_cause) {
+        goto err;
+    }
+
     r = vhost_vdpa_enable_svq(v, enable);
     if (unlikely(!r)) {
         err_cause = "Error enabling (see monitor)";
@@ -853,7 +954,7 @@ void qmp_x_vhost_enable_shadow_vq(const char *name, bool enable, Error **errp)
     }
 
 err:
-    if (err_cause) {
+    if (errp == NULL && err_cause) {
         error_setg(errp, "Can't enable shadow vq on %s: %s", name, err_cause);
     }
 }
-- 
2.27.0
^ permalink raw reply related	[flat|nested] 90+ messages in thread
* [RFC PATCH v4 16/20] vhost: Check for device VRING_USED_F_NO_NOTIFY at shadow virtqueue kick
  2021-10-01  7:05 [RFC PATCH v4 00/20] vDPA shadow virtqueue Eugenio Pérez
                   ` (14 preceding siblings ...)
  2021-10-01  7:05 ` [RFC PATCH v4 15/20] vhost: Shadow virtqueue buffers forwarding Eugenio Pérez
@ 2021-10-01  7:05 ` Eugenio Pérez
  2021-10-13  4:35   ` Jason Wang
  2021-10-01  7:06 ` [RFC PATCH v4 17/20] vhost: Use VRING_AVAIL_F_NO_INTERRUPT at device call on shadow virtqueue Eugenio Pérez
                   ` (4 subsequent siblings)
  20 siblings, 1 reply; 90+ messages in thread
From: Eugenio Pérez @ 2021-10-01  7:05 UTC (permalink / raw)
  To: qemu-devel
  Cc: Parav Pandit, Juan Quintela, Jason Wang, Michael S. Tsirkin,
	Markus Armbruster, virtualization, Harpreet Singh Anand,
	Xiao W Wang, Stefan Hajnoczi, Eli Cohen, Eric Blake,
	Michael Lilja, Stefano Garzarella
Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 hw/virtio/vhost-shadow-virtqueue.c | 11 ++++++++++-
 1 file changed, 10 insertions(+), 1 deletion(-)
diff --git a/hw/virtio/vhost-shadow-virtqueue.c b/hw/virtio/vhost-shadow-virtqueue.c
index df7e6fa3ec..775f8d36a0 100644
--- a/hw/virtio/vhost-shadow-virtqueue.c
+++ b/hw/virtio/vhost-shadow-virtqueue.c
@@ -173,6 +173,15 @@ static void vhost_svq_add(VhostShadowVirtqueue *svq, VirtQueueElement *elem)
     svq->ring_id_maps[qemu_head] = elem;
 }
 
+static void vhost_svq_kick(VhostShadowVirtqueue *svq)
+{
+    /* Make sure we are reading updated device flag */
+    smp_mb();
+    if (!(svq->vring.used->flags & VRING_USED_F_NO_NOTIFY)) {
+        event_notifier_set(&svq->kick_notifier);
+    }
+}
+
 /* Handle guest->device notifications */
 static void vhost_handle_guest_kick(EventNotifier *n)
 {
@@ -197,7 +206,7 @@ static void vhost_handle_guest_kick(EventNotifier *n)
             }
 
             vhost_svq_add(svq, elem);
-            event_notifier_set(&svq->kick_notifier);
+            vhost_svq_kick(svq);
         }
 
         virtio_queue_set_notification(svq->vq, true);
-- 
2.27.0
^ permalink raw reply related	[flat|nested] 90+ messages in thread
* [RFC PATCH v4 17/20] vhost: Use VRING_AVAIL_F_NO_INTERRUPT at device call on shadow virtqueue
  2021-10-01  7:05 [RFC PATCH v4 00/20] vDPA shadow virtqueue Eugenio Pérez
                   ` (15 preceding siblings ...)
  2021-10-01  7:05 ` [RFC PATCH v4 16/20] vhost: Check for device VRING_USED_F_NO_NOTIFY at shadow virtqueue kick Eugenio Pérez
@ 2021-10-01  7:06 ` Eugenio Pérez
  2021-10-13  4:36   ` Jason Wang
  2021-10-01  7:06 ` [RFC PATCH v4 18/20] vhost: Add VhostIOVATree Eugenio Pérez
                   ` (3 subsequent siblings)
  20 siblings, 1 reply; 90+ messages in thread
From: Eugenio Pérez @ 2021-10-01  7:06 UTC (permalink / raw)
  To: qemu-devel
  Cc: Parav Pandit, Juan Quintela, Jason Wang, Michael S. Tsirkin,
	Markus Armbruster, virtualization, Harpreet Singh Anand,
	Xiao W Wang, Stefan Hajnoczi, Eli Cohen, Eric Blake,
	Michael Lilja, Stefano Garzarella
Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 hw/virtio/vhost-shadow-virtqueue.c | 24 +++++++++++++++++++++++-
 1 file changed, 23 insertions(+), 1 deletion(-)
diff --git a/hw/virtio/vhost-shadow-virtqueue.c b/hw/virtio/vhost-shadow-virtqueue.c
index 775f8d36a0..2fd0bab75d 100644
--- a/hw/virtio/vhost-shadow-virtqueue.c
+++ b/hw/virtio/vhost-shadow-virtqueue.c
@@ -60,6 +60,9 @@ typedef struct VhostShadowVirtqueue {
 
     /* Next head to consume from device */
     uint16_t used_idx;
+
+    /* Cache for the exposed notification flag */
+    bool notification;
 } VhostShadowVirtqueue;
 
 /* If the device is using some of these, SVQ cannot communicate */
@@ -105,6 +108,24 @@ bool vhost_svq_valid_device_features(uint64_t *dev_features)
     return r;
 }
 
+static void vhost_svq_set_notification(VhostShadowVirtqueue *svq, bool enable)
+{
+    uint16_t notification_flag;
+
+    if (svq->notification == enable) {
+        return;
+    }
+
+    notification_flag = cpu_to_le16(VRING_AVAIL_F_NO_INTERRUPT);
+
+    svq->notification = enable;
+    if (enable) {
+        svq->vring.avail->flags &= ~notification_flag;
+    } else {
+        svq->vring.avail->flags |= notification_flag;
+    }
+}
+
 static void vhost_vring_write_descs(VhostShadowVirtqueue *svq,
                                     const struct iovec *iovec,
                                     size_t num, bool more_descs, bool write)
@@ -273,7 +294,7 @@ static void vhost_svq_handle_call_no_test(EventNotifier *n)
     do {
         unsigned i = 0;
 
-        /* TODO: Use VRING_AVAIL_F_NO_INTERRUPT */
+        vhost_svq_set_notification(svq, false);
         while (true) {
             g_autofree VirtQueueElement *elem = vhost_svq_get_buf(svq);
             if (!elem) {
@@ -286,6 +307,7 @@ static void vhost_svq_handle_call_no_test(EventNotifier *n)
 
         virtqueue_flush(vq, i);
         event_notifier_set(&svq->guest_call_notifier);
+        vhost_svq_set_notification(svq, true);
     } while (vhost_svq_more_used(svq));
 }
 
-- 
2.27.0
^ permalink raw reply related	[flat|nested] 90+ messages in thread
* [RFC PATCH v4 18/20] vhost: Add VhostIOVATree
  2021-10-01  7:05 [RFC PATCH v4 00/20] vDPA shadow virtqueue Eugenio Pérez
                   ` (16 preceding siblings ...)
  2021-10-01  7:06 ` [RFC PATCH v4 17/20] vhost: Use VRING_AVAIL_F_NO_INTERRUPT at device call on shadow virtqueue Eugenio Pérez
@ 2021-10-01  7:06 ` Eugenio Pérez
  2021-10-19  8:32   ` Jason Wang
  2021-10-01  7:06 ` [RFC PATCH v4 19/20] vhost: Use a tree to store memory mappings Eugenio Pérez
                   ` (2 subsequent siblings)
  20 siblings, 1 reply; 90+ messages in thread
From: Eugenio Pérez @ 2021-10-01  7:06 UTC (permalink / raw)
  To: qemu-devel
  Cc: Parav Pandit, Juan Quintela, Jason Wang, Michael S. Tsirkin,
	Markus Armbruster, virtualization, Harpreet Singh Anand,
	Xiao W Wang, Stefan Hajnoczi, Eli Cohen, Eric Blake,
	Michael Lilja, Stefano Garzarella
This tree is able to look for a translated address from an IOVA address.
At first glance is similar to util/iova-tree. However, SVQ working on
devices with limited IOVA space need more capabilities, like allocating
IOVA chunks or perform reverse translations (qemu addresses to iova).
The allocation capability, as "assign a free IOVA address to this chunk
of memory in qemu's address space" allows shadow virtqueue to create a
new address space that is not restricted by guest's addressable one, so
we can allocate shadow vqs vrings outside of its reachability, nor
qemu's one. At the moment, the allocation is just done growing, not
allowing deletion.
A different name could be used, but ordered searchable array is a
little bit long though.
It duplicates the array so it can search efficiently both directions,
and it will signal overlap if iova or the translated address is
present in it's each array.
Use of array will be changed to util-iova-tree in future series.
Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 hw/virtio/vhost-iova-tree.h |  40 +++++++
 hw/virtio/vhost-iova-tree.c | 230 ++++++++++++++++++++++++++++++++++++
 hw/virtio/meson.build       |   2 +-
 3 files changed, 271 insertions(+), 1 deletion(-)
 create mode 100644 hw/virtio/vhost-iova-tree.h
 create mode 100644 hw/virtio/vhost-iova-tree.c
diff --git a/hw/virtio/vhost-iova-tree.h b/hw/virtio/vhost-iova-tree.h
new file mode 100644
index 0000000000..d163a88905
--- /dev/null
+++ b/hw/virtio/vhost-iova-tree.h
@@ -0,0 +1,40 @@
+/*
+ * vhost software live migration ring
+ *
+ * SPDX-FileCopyrightText: Red Hat, Inc. 2021
+ * SPDX-FileContributor: Author: Eugenio Pérez <eperezma@redhat.com>
+ *
+ * SPDX-License-Identifier: GPL-2.0-or-later
+ */
+
+#ifndef HW_VIRTIO_VHOST_IOVA_TREE_H
+#define HW_VIRTIO_VHOST_IOVA_TREE_H
+
+#include "exec/memory.h"
+
+typedef struct VhostDMAMap {
+    void *translated_addr;
+    hwaddr iova;
+    hwaddr size;                /* Inclusive */
+    IOMMUAccessFlags perm;
+} VhostDMAMap;
+
+typedef enum VhostDMAMapNewRC {
+    VHOST_DMA_MAP_NO_SPACE = -3,
+    VHOST_DMA_MAP_OVERLAP = -2,
+    VHOST_DMA_MAP_INVALID = -1,
+    VHOST_DMA_MAP_OK = 0,
+} VhostDMAMapNewRC;
+
+typedef struct VhostIOVATree VhostIOVATree;
+
+VhostIOVATree *vhost_iova_tree_new(void);
+void vhost_iova_tree_unref(VhostIOVATree *iova_rm);
+G_DEFINE_AUTOPTR_CLEANUP_FUNC(VhostIOVATree, vhost_iova_tree_unref);
+
+const VhostDMAMap *vhost_iova_tree_find_iova(const VhostIOVATree *iova_rm,
+                                             const VhostDMAMap *map);
+VhostDMAMapNewRC vhost_iova_tree_alloc(VhostIOVATree *iova_rm,
+                                       VhostDMAMap *map);
+
+#endif
diff --git a/hw/virtio/vhost-iova-tree.c b/hw/virtio/vhost-iova-tree.c
new file mode 100644
index 0000000000..c284e27607
--- /dev/null
+++ b/hw/virtio/vhost-iova-tree.c
@@ -0,0 +1,230 @@
+/*
+ * vhost software live migration ring
+ *
+ * SPDX-FileCopyrightText: Red Hat, Inc. 2021
+ * SPDX-FileContributor: Author: Eugenio Pérez <eperezma@redhat.com>
+ *
+ * SPDX-License-Identifier: GPL-2.0-or-later
+ */
+
+#include "qemu/osdep.h"
+#include "vhost-iova-tree.h"
+
+#define G_ARRAY_NOT_ZERO_TERMINATED false
+#define G_ARRAY_NOT_CLEAR_ON_ALLOC false
+
+#define iova_min qemu_real_host_page_size
+
+/**
+ * VhostIOVATree, able to:
+ * - Translate iova address
+ * - Reverse translate iova address (from translated to iova)
+ * - Allocate IOVA regions for translated range (potentially slow operation)
+ *
+ * Note that it cannot remove nodes.
+ */
+struct VhostIOVATree {
+    /* Ordered array of reverse translations, IOVA address to qemu memory. */
+    GArray *iova_taddr_map;
+
+    /*
+     * Ordered array of translations from qemu virtual memory address to iova
+     */
+    GArray *taddr_iova_map;
+};
+
+/**
+ * Inserts an element after an existing one in garray.
+ *
+ * @array      The array
+ * @prev_elem  The previous element of array of NULL if prepending
+ * @map        The DMA map
+ *
+ * It provides the aditional advantage of being type safe over
+ * g_array_insert_val, which accepts a reference pointer instead of a value
+ * with no complains.
+ */
+static void vhost_iova_tree_insert_after(GArray *array,
+                                         const VhostDMAMap *prev_elem,
+                                         const VhostDMAMap *map)
+{
+    size_t pos;
+
+    if (!prev_elem) {
+        pos = 0;
+    } else {
+        pos = prev_elem - &g_array_index(array, typeof(*prev_elem), 0) + 1;
+    }
+
+    g_array_insert_val(array, pos, *map);
+}
+
+static gint vhost_iova_tree_cmp_taddr(gconstpointer a, gconstpointer b)
+{
+    const VhostDMAMap *m1 = a, *m2 = b;
+
+    if (m1->translated_addr > m2->translated_addr + m2->size) {
+        return 1;
+    }
+
+    if (m1->translated_addr + m1->size < m2->translated_addr) {
+        return -1;
+    }
+
+    /* Overlapped */
+    return 0;
+}
+
+/**
+ * Find the previous node to a given iova
+ *
+ * @array  The ascending ordered-by-translated-addr array of VhostDMAMap
+ * @map    The map to insert
+ * @prev   Returned location of the previous map
+ *
+ * Return VHOST_DMA_MAP_OK if everything went well, or VHOST_DMA_MAP_OVERLAP if
+ * it already exists. It is ok to use this function to check if a given range
+ * exists, but it will use a linear search.
+ *
+ * TODO: We can use bsearch to locate the entry if we save the state in the
+ * needle, knowing that the needle is always the first argument to
+ * compare_func.
+ */
+static VhostDMAMapNewRC vhost_iova_tree_find_prev(const GArray *array,
+                                                  GCompareFunc compare_func,
+                                                  const VhostDMAMap *map,
+                                                  const VhostDMAMap **prev)
+{
+    size_t i;
+    int r;
+
+    *prev = NULL;
+    for (i = 0; i < array->len; ++i) {
+        r = compare_func(map, &g_array_index(array, typeof(*map), i));
+        if (r == 0) {
+            return VHOST_DMA_MAP_OVERLAP;
+        }
+        if (r < 0) {
+            return VHOST_DMA_MAP_OK;
+        }
+
+        *prev = &g_array_index(array, typeof(**prev), i);
+    }
+
+    return VHOST_DMA_MAP_OK;
+}
+
+/**
+ * Create a new IOVA tree
+ *
+ * Returns the new IOVA tree
+ */
+VhostIOVATree *vhost_iova_tree_new(void)
+{
+    VhostIOVATree *tree = g_new(VhostIOVATree, 1);
+    tree->iova_taddr_map = g_array_new(G_ARRAY_NOT_ZERO_TERMINATED,
+                                       G_ARRAY_NOT_CLEAR_ON_ALLOC,
+                                       sizeof(VhostDMAMap));
+    tree->taddr_iova_map = g_array_new(G_ARRAY_NOT_ZERO_TERMINATED,
+                                       G_ARRAY_NOT_CLEAR_ON_ALLOC,
+                                       sizeof(VhostDMAMap));
+    return tree;
+}
+
+/**
+ * Destroy an IOVA tree
+ *
+ * @tree  The iova tree
+ */
+void vhost_iova_tree_unref(VhostIOVATree *tree)
+{
+    g_array_unref(g_steal_pointer(&tree->iova_taddr_map));
+    g_array_unref(g_steal_pointer(&tree->taddr_iova_map));
+}
+
+/**
+ * Find the IOVA address stored from a memory address
+ *
+ * @tree     The iova tree
+ * @map      The map with the memory address
+ *
+ * Return the stored mapping, or NULL if not found.
+ */
+const VhostDMAMap *vhost_iova_tree_find_iova(const VhostIOVATree *tree,
+                                             const VhostDMAMap *map)
+{
+    /*
+     * This can be replaced with g_array_binary_search (Since glib 2.62) when
+     * that version become common enough.
+     */
+    return bsearch(map, tree->taddr_iova_map->data, tree->taddr_iova_map->len,
+                   sizeof(*map), vhost_iova_tree_cmp_taddr);
+}
+
+static bool vhost_iova_tree_find_iova_hole(const GArray *iova_map,
+                                           const VhostDMAMap *map,
+                                           const VhostDMAMap **prev_elem)
+{
+    size_t i;
+    hwaddr iova = iova_min;
+
+    *prev_elem = NULL;
+    for (i = 0; i < iova_map->len; i++) {
+        const VhostDMAMap *next = &g_array_index(iova_map, typeof(*next), i);
+        hwaddr hole_end = next->iova;
+        if (map->size < hole_end - iova) {
+            return true;
+        }
+
+        iova = next->iova + next->size + 1;
+        *prev_elem = next;
+    }
+
+    return ((hwaddr)-1 - iova) > iova_map->len;
+}
+
+/**
+ * Allocate a new mapping
+ *
+ * @tree  The iova tree
+ * @map   The iova map
+ *
+ * Returns:
+ * - VHOST_DMA_MAP_OK if the map fits in the container
+ * - VHOST_DMA_MAP_INVALID if the map does not make sense (like size overflow)
+ * - VHOST_DMA_MAP_OVERLAP if the tree already contains that map
+ * - VHOST_DMA_MAP_NO_SPACE if iova_rm cannot allocate more space.
+ *
+ * It returns assignated iova in map->iova if return value is VHOST_DMA_MAP_OK.
+ */
+VhostDMAMapNewRC vhost_iova_tree_alloc(VhostIOVATree *tree,
+                                       VhostDMAMap *map)
+{
+    const VhostDMAMap *qemu_prev, *iova_prev;
+    int find_prev_rc;
+    bool fit;
+
+    if (map->translated_addr + map->size < map->translated_addr ||
+        map->iova + map->size < map->iova || map->perm == IOMMU_NONE) {
+        return VHOST_DMA_MAP_INVALID;
+    }
+
+    /* Search for a hole in iova space big enough */
+    fit = vhost_iova_tree_find_iova_hole(tree->iova_taddr_map, map,
+                                         &iova_prev);
+    if (!fit) {
+        return VHOST_DMA_MAP_NO_SPACE;
+    }
+
+    map->iova = iova_prev ? (iova_prev->iova + iova_prev->size) + 1 : iova_min;
+    find_prev_rc = vhost_iova_tree_find_prev(tree->taddr_iova_map,
+                                             vhost_iova_tree_cmp_taddr, map,
+                                             &qemu_prev);
+    if (find_prev_rc == VHOST_DMA_MAP_OVERLAP) {
+        return VHOST_DMA_MAP_OVERLAP;
+    }
+
+    vhost_iova_tree_insert_after(tree->iova_taddr_map, iova_prev, map);
+    vhost_iova_tree_insert_after(tree->taddr_iova_map, qemu_prev, map);
+    return VHOST_DMA_MAP_OK;
+}
diff --git a/hw/virtio/meson.build b/hw/virtio/meson.build
index 8b5a0225fe..cb306b83c6 100644
--- a/hw/virtio/meson.build
+++ b/hw/virtio/meson.build
@@ -11,7 +11,7 @@ softmmu_ss.add(when: 'CONFIG_ALL', if_true: files('vhost-stub.c'))
 
 virtio_ss = ss.source_set()
 virtio_ss.add(files('virtio.c'))
-virtio_ss.add(when: 'CONFIG_VHOST', if_true: files('vhost.c', 'vhost-backend.c', 'vhost-shadow-virtqueue.c'))
+virtio_ss.add(when: 'CONFIG_VHOST', if_true: files('vhost.c', 'vhost-backend.c', 'vhost-shadow-virtqueue.c', 'vhost-iova-tree.c'))
 virtio_ss.add(when: 'CONFIG_VHOST_USER', if_true: files('vhost-user.c'))
 virtio_ss.add(when: 'CONFIG_VHOST_VDPA', if_true: files('vhost-vdpa.c'))
 virtio_ss.add(when: 'CONFIG_VIRTIO_BALLOON', if_true: files('virtio-balloon.c'))
-- 
2.27.0
^ permalink raw reply related	[flat|nested] 90+ messages in thread
* [RFC PATCH v4 19/20] vhost: Use a tree to store memory mappings
  2021-10-01  7:05 [RFC PATCH v4 00/20] vDPA shadow virtqueue Eugenio Pérez
                   ` (17 preceding siblings ...)
  2021-10-01  7:06 ` [RFC PATCH v4 18/20] vhost: Add VhostIOVATree Eugenio Pérez
@ 2021-10-01  7:06 ` Eugenio Pérez
  2021-10-01  7:06 ` [RFC PATCH v4 20/20] vdpa: Add custom IOTLB translations to SVQ Eugenio Pérez
  2021-10-12  3:59 ` [RFC PATCH v4 00/20] vDPA shadow virtqueue Jason Wang
  20 siblings, 0 replies; 90+ messages in thread
From: Eugenio Pérez @ 2021-10-01  7:06 UTC (permalink / raw)
  To: qemu-devel
  Cc: Parav Pandit, Juan Quintela, Jason Wang, Michael S. Tsirkin,
	Markus Armbruster, virtualization, Harpreet Singh Anand,
	Xiao W Wang, Stefan Hajnoczi, Eli Cohen, Eric Blake,
	Michael Lilja, Stefano Garzarella
Track memory translations of devices with IOMMU (all vhost-vdpa
devices at the moment). It does not work if device has restrictions in
its iova range at the moment.
Updates to tree are protected by BQL, each one always run from main
event loop context.
Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 include/hw/virtio/vhost-vdpa.h |  3 ++
 hw/virtio/vhost-vdpa.c         | 59 ++++++++++++++++++++++++++++++++++
 2 files changed, 62 insertions(+)
diff --git a/include/hw/virtio/vhost-vdpa.h b/include/hw/virtio/vhost-vdpa.h
index 9044ae694b..7353e36884 100644
--- a/include/hw/virtio/vhost-vdpa.h
+++ b/include/hw/virtio/vhost-vdpa.h
@@ -15,6 +15,7 @@
 #include <gmodule.h>
 
 #include "qemu/queue.h"
+#include "hw/virtio/vhost-iova-tree.h"
 #include "hw/virtio/virtio.h"
 
 typedef struct VhostVDPAHostNotifier {
@@ -29,6 +30,8 @@ typedef struct vhost_vdpa {
     uint64_t host_features;
     uint64_t guest_features;
     bool shadow_vqs_enabled;
+    /* IOVA mapping used by Shadow Virtqueue */
+    VhostIOVATree *iova_map;
     GPtrArray *shadow_vqs;
     struct vhost_dev *dev;
     QLIST_ENTRY(vhost_vdpa) entry;
diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
index bb7010ddb5..a9c680b487 100644
--- a/hw/virtio/vhost-vdpa.c
+++ b/hw/virtio/vhost-vdpa.c
@@ -395,6 +395,7 @@ static void vhost_vdpa_svq_cleanup(struct vhost_dev *dev)
         vhost_svq_stop(dev, idx, g_ptr_array_index(v->shadow_vqs, idx));
     }
     g_ptr_array_free(v->shadow_vqs, true);
+    g_clear_pointer(&v->iova_map, vhost_iova_tree_unref);
 }
 
 static int vhost_vdpa_cleanup(struct vhost_dev *dev)
@@ -753,6 +754,22 @@ static bool  vhost_vdpa_force_iommu(struct vhost_dev *dev)
     return true;
 }
 
+/**
+ * Maps QEMU vaddr memory to device in a suitable way for shadow virtqueue:
+ * - It always reference qemu memory address, not guest's memory.
+ * - TODO It's always in range of device.
+ *
+ * It returns the translated address
+ */
+static int vhost_vdpa_svq_map(struct vhost_vdpa *v, VhostDMAMap *map)
+{
+    int r = vhost_iova_tree_alloc(v->iova_map, map);
+    assert(r == VHOST_DMA_MAP_OK);
+
+    return vhost_vdpa_dma_map(v, map->iova, map->size, map->translated_addr,
+                              false);
+}
+
 static int vhost_vdpa_vring_pause(struct vhost_dev *dev)
 {
     int r;
@@ -771,6 +788,7 @@ static int vhost_vdpa_vring_pause(struct vhost_dev *dev)
  */
 static bool vhost_vdpa_svq_start_vq(struct vhost_dev *dev, unsigned idx)
 {
+    VhostDMAMap device_region, driver_region;
     struct vhost_vdpa *v = dev->opaque;
     VhostShadowVirtqueue *svq = g_ptr_array_index(v->shadow_vqs, idx);
     EventNotifier *vhost_call_notifier = vhost_svq_get_svq_call_notifier(svq);
@@ -789,6 +807,33 @@ static bool vhost_vdpa_svq_start_vq(struct vhost_dev *dev, unsigned idx)
     bool b;
 
     vhost_svq_get_vring_addr(svq, &addr);
+    driver_region = (VhostDMAMap) {
+        .translated_addr = (void *)addr.desc_user_addr,
+
+        /*
+         * DMAMAp.size include the last byte included in the range, while
+         * sizeof marks one past it. Substract one byte to make them match.
+         */
+        .size = vhost_svq_driver_area_size(svq) - 1,
+        .perm = VHOST_ACCESS_RO,
+    };
+    device_region = (VhostDMAMap) {
+        .translated_addr = (void *)addr.used_user_addr,
+        .size = vhost_svq_device_area_size(svq) - 1,
+        .perm = VHOST_ACCESS_RW,
+    };
+
+    r = vhost_vdpa_svq_map(v, &driver_region);
+    assert(r == 0);
+    r = vhost_vdpa_svq_map(v, &device_region);
+    assert(r == 0);
+
+    /* Expose IOVA addresses to vDPA device */
+    addr.avail_user_addr = driver_region.iova + addr.avail_user_addr
+                           - addr.desc_user_addr;
+    addr.desc_user_addr = driver_region.iova;
+    addr.used_user_addr = device_region.iova;
+
     r = vhost_vdpa_set_vring_addr(dev, &addr);
     if (unlikely(r)) {
         error_report("vhost_set_vring_addr for shadow vq failed");
@@ -822,6 +867,17 @@ static bool vhost_vdpa_svq_start_vq(struct vhost_dev *dev, unsigned idx)
     return true;
 }
 
+/**
+ * Enable or disable shadow virtqueue in a vhost vdpa device.
+ *
+ * This function is idempotent, to call it many times with the same value for
+ * enable_svq will simply return success.
+ *
+ * @v       The vhost vdpa device
+ * @enable  The value of shadow virtqueue we want.
+ *
+ * Returns the number of queues changed.
+ */
 static unsigned vhost_vdpa_enable_svq(struct vhost_vdpa *v, bool enable)
 {
     struct vhost_dev *hdev = v->dev;
@@ -833,6 +889,8 @@ static unsigned vhost_vdpa_enable_svq(struct vhost_vdpa *v, bool enable)
     }
 
     if (enable) {
+        v->iova_map = vhost_iova_tree_new();
+
         /* Allocate resources */
         assert(v->shadow_vqs->len == 0);
         for (n = 0; n < hdev->nvqs; ++n) {
@@ -907,6 +965,7 @@ static unsigned vhost_vdpa_enable_svq(struct vhost_vdpa *v, bool enable)
     if (!enable) {
         /* Resources cleanup */
         g_ptr_array_set_size(v->shadow_vqs, 0);
+        g_clear_pointer(&v->iova_map, vhost_iova_tree_unref);
     }
 
     return n;
-- 
2.27.0
^ permalink raw reply related	[flat|nested] 90+ messages in thread
* [RFC PATCH v4 20/20] vdpa: Add custom IOTLB translations to SVQ
  2021-10-01  7:05 [RFC PATCH v4 00/20] vDPA shadow virtqueue Eugenio Pérez
                   ` (18 preceding siblings ...)
  2021-10-01  7:06 ` [RFC PATCH v4 19/20] vhost: Use a tree to store memory mappings Eugenio Pérez
@ 2021-10-01  7:06 ` Eugenio Pérez
  2021-10-13  5:34   ` Jason Wang
  2021-10-19  9:24   ` Jason Wang
  2021-10-12  3:59 ` [RFC PATCH v4 00/20] vDPA shadow virtqueue Jason Wang
  20 siblings, 2 replies; 90+ messages in thread
From: Eugenio Pérez @ 2021-10-01  7:06 UTC (permalink / raw)
  To: qemu-devel
  Cc: Parav Pandit, Juan Quintela, Jason Wang, Michael S. Tsirkin,
	Markus Armbruster, virtualization, Harpreet Singh Anand,
	Xiao W Wang, Stefan Hajnoczi, Eli Cohen, Eric Blake,
	Michael Lilja, Stefano Garzarella
Use translations added in VhostIOVATree in SVQ.
Now every element needs to store the previous address also, so VirtQueue
can consume the elements properly. This adds a little overhead per VQ
element, having to allocate more memory to stash them. As a possible
optimization, this allocation could be avoided if the descriptor is not
a chain but a single one, but this is left undone.
TODO: iova range should be queried before, and add logic to fail when
GPA is outside of its range and memory listener or svq add it.
Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
---
 hw/virtio/vhost-shadow-virtqueue.h |   4 +-
 hw/virtio/vhost-shadow-virtqueue.c | 130 ++++++++++++++++++++++++-----
 hw/virtio/vhost-vdpa.c             |  40 ++++++++-
 hw/virtio/trace-events             |   1 +
 4 files changed, 152 insertions(+), 23 deletions(-)
diff --git a/hw/virtio/vhost-shadow-virtqueue.h b/hw/virtio/vhost-shadow-virtqueue.h
index b7baa424a7..a0e6b5267a 100644
--- a/hw/virtio/vhost-shadow-virtqueue.h
+++ b/hw/virtio/vhost-shadow-virtqueue.h
@@ -11,6 +11,7 @@
 #define VHOST_SHADOW_VIRTQUEUE_H
 
 #include "hw/virtio/vhost.h"
+#include "hw/virtio/vhost-iova-tree.h"
 
 typedef struct VhostShadowVirtqueue VhostShadowVirtqueue;
 
@@ -28,7 +29,8 @@ bool vhost_svq_start(struct vhost_dev *dev, unsigned idx,
 void vhost_svq_stop(struct vhost_dev *dev, unsigned idx,
                     VhostShadowVirtqueue *svq);
 
-VhostShadowVirtqueue *vhost_svq_new(struct vhost_dev *dev, int idx);
+VhostShadowVirtqueue *vhost_svq_new(struct vhost_dev *dev, int idx,
+                                    VhostIOVATree *iova_map);
 
 void vhost_svq_free(VhostShadowVirtqueue *vq);
 
diff --git a/hw/virtio/vhost-shadow-virtqueue.c b/hw/virtio/vhost-shadow-virtqueue.c
index 2fd0bab75d..9db538547e 100644
--- a/hw/virtio/vhost-shadow-virtqueue.c
+++ b/hw/virtio/vhost-shadow-virtqueue.c
@@ -11,12 +11,19 @@
 #include "hw/virtio/vhost-shadow-virtqueue.h"
 #include "hw/virtio/vhost.h"
 #include "hw/virtio/virtio-access.h"
+#include "hw/virtio/vhost-iova-tree.h"
 
 #include "standard-headers/linux/vhost_types.h"
 
 #include "qemu/error-report.h"
 #include "qemu/main-loop.h"
 
+typedef struct SVQElement {
+    VirtQueueElement elem;
+    void **in_sg_stash;
+    void **out_sg_stash;
+} SVQElement;
+
 /* Shadow virtqueue to relay notifications */
 typedef struct VhostShadowVirtqueue {
     /* Shadow vring */
@@ -46,8 +53,11 @@ typedef struct VhostShadowVirtqueue {
     /* Virtio device */
     VirtIODevice *vdev;
 
+    /* IOVA mapping if used */
+    VhostIOVATree *iova_map;
+
     /* Map for returning guest's descriptors */
-    VirtQueueElement **ring_id_maps;
+    SVQElement **ring_id_maps;
 
     /* Next head to expose to device */
     uint16_t avail_idx_shadow;
@@ -79,13 +89,6 @@ bool vhost_svq_valid_device_features(uint64_t *dev_features)
             continue;
 
         case VIRTIO_F_ACCESS_PLATFORM:
-            /* SVQ needs this feature disabled. Can't continue */
-            if (*dev_features & BIT_ULL(b)) {
-                clear_bit(b, dev_features);
-                r = false;
-            }
-            break;
-
         case VIRTIO_F_VERSION_1:
             /* SVQ needs this feature, so can't continue */
             if (!(*dev_features & BIT_ULL(b))) {
@@ -126,6 +129,64 @@ static void vhost_svq_set_notification(VhostShadowVirtqueue *svq, bool enable)
     }
 }
 
+static void vhost_svq_stash_addr(void ***stash, const struct iovec *iov,
+                                 size_t num)
+{
+    size_t i;
+
+    if (num == 0) {
+        return;
+    }
+
+    *stash = g_new(void *, num);
+    for (i = 0; i < num; ++i) {
+        (*stash)[i] = iov[i].iov_base;
+    }
+}
+
+static void vhost_svq_unstash_addr(void **stash, struct iovec *iov, size_t num)
+{
+    size_t i;
+
+    if (num == 0) {
+        return;
+    }
+
+    for (i = 0; i < num; ++i) {
+        iov[i].iov_base = stash[i];
+    }
+    g_free(stash);
+}
+
+static void vhost_svq_translate_addr(const VhostShadowVirtqueue *svq,
+                                     struct iovec *iovec, size_t num)
+{
+    size_t i;
+
+    for (i = 0; i < num; ++i) {
+        VhostDMAMap needle = {
+            .translated_addr = iovec[i].iov_base,
+            .size = iovec[i].iov_len,
+        };
+        size_t off;
+
+        const VhostDMAMap *map = vhost_iova_tree_find_iova(svq->iova_map,
+                                                           &needle);
+        /*
+         * Map cannot be NULL since iova map contains all guest space and
+         * qemu already has a physical address mapped
+         */
+        assert(map);
+
+        /*
+         * Map->iova chunk size is ignored. What to do if descriptor
+         * (addr, size) does not fit is delegated to the device.
+         */
+        off = needle.translated_addr - map->translated_addr;
+        iovec[i].iov_base = (void *)(map->iova + off);
+    }
+}
+
 static void vhost_vring_write_descs(VhostShadowVirtqueue *svq,
                                     const struct iovec *iovec,
                                     size_t num, bool more_descs, bool write)
@@ -156,8 +217,9 @@ static void vhost_vring_write_descs(VhostShadowVirtqueue *svq,
 }
 
 static unsigned vhost_svq_add_split(VhostShadowVirtqueue *svq,
-                                    VirtQueueElement *elem)
+                                    SVQElement *svq_elem)
 {
+    VirtQueueElement *elem = &svq_elem->elem;
     int head;
     unsigned avail_idx;
     vring_avail_t *avail = svq->vring.avail;
@@ -167,6 +229,12 @@ static unsigned vhost_svq_add_split(VhostShadowVirtqueue *svq,
     /* We need some descriptors here */
     assert(elem->out_num || elem->in_num);
 
+    vhost_svq_stash_addr(&svq_elem->in_sg_stash, elem->in_sg, elem->in_num);
+    vhost_svq_stash_addr(&svq_elem->out_sg_stash, elem->out_sg, elem->out_num);
+
+    vhost_svq_translate_addr(svq, elem->in_sg, elem->in_num);
+    vhost_svq_translate_addr(svq, elem->out_sg, elem->out_num);
+
     vhost_vring_write_descs(svq, elem->out_sg, elem->out_num,
                             elem->in_num > 0, false);
     vhost_vring_write_descs(svq, elem->in_sg, elem->in_num, false, true);
@@ -187,7 +255,7 @@ static unsigned vhost_svq_add_split(VhostShadowVirtqueue *svq,
 
 }
 
-static void vhost_svq_add(VhostShadowVirtqueue *svq, VirtQueueElement *elem)
+static void vhost_svq_add(VhostShadowVirtqueue *svq, SVQElement *elem)
 {
     unsigned qemu_head = vhost_svq_add_split(svq, elem);
 
@@ -221,7 +289,7 @@ static void vhost_handle_guest_kick(EventNotifier *n)
         }
 
         while (true) {
-            VirtQueueElement *elem = virtqueue_pop(svq->vq, sizeof(*elem));
+            SVQElement *elem = virtqueue_pop(svq->vq, sizeof(*elem));
             if (!elem) {
                 break;
             }
@@ -247,7 +315,7 @@ static bool vhost_svq_more_used(VhostShadowVirtqueue *svq)
     return svq->used_idx != svq->shadow_used_idx;
 }
 
-static VirtQueueElement *vhost_svq_get_buf(VhostShadowVirtqueue *svq)
+static SVQElement *vhost_svq_get_buf(VhostShadowVirtqueue *svq)
 {
     vring_desc_t *descs = svq->vring.desc;
     const vring_used_t *used = svq->vring.used;
@@ -279,7 +347,7 @@ static VirtQueueElement *vhost_svq_get_buf(VhostShadowVirtqueue *svq)
     descs[used_elem.id].next = svq->free_head;
     svq->free_head = used_elem.id;
 
-    svq->ring_id_maps[used_elem.id]->len = used_elem.len;
+    svq->ring_id_maps[used_elem.id]->elem.len = used_elem.len;
     return g_steal_pointer(&svq->ring_id_maps[used_elem.id]);
 }
 
@@ -296,12 +364,19 @@ static void vhost_svq_handle_call_no_test(EventNotifier *n)
 
         vhost_svq_set_notification(svq, false);
         while (true) {
-            g_autofree VirtQueueElement *elem = vhost_svq_get_buf(svq);
-            if (!elem) {
+            g_autofree SVQElement *svq_elem = vhost_svq_get_buf(svq);
+            VirtQueueElement *elem;
+            if (!svq_elem) {
                 break;
             }
 
             assert(i < svq->vring.num);
+            elem = &svq_elem->elem;
+
+            vhost_svq_unstash_addr(svq_elem->in_sg_stash, elem->in_sg,
+                                   elem->in_num);
+            vhost_svq_unstash_addr(svq_elem->out_sg_stash, elem->out_sg,
+                                   elem->out_num);
             virtqueue_fill(vq, elem, elem->len, i++);
         }
 
@@ -451,14 +526,24 @@ void vhost_svq_stop(struct vhost_dev *dev, unsigned idx,
     event_notifier_set_handler(&svq->host_notifier, NULL);
 
     for (i = 0; i < svq->vring.num; ++i) {
-        g_autofree VirtQueueElement *elem = svq->ring_id_maps[i];
+        g_autofree SVQElement *svq_elem = svq->ring_id_maps[i];
+        VirtQueueElement *elem;
+
+        if (!svq_elem) {
+            continue;
+        }
+
+        elem = &svq_elem->elem;
+        vhost_svq_unstash_addr(svq_elem->in_sg_stash, elem->in_sg,
+                               elem->in_num);
+        vhost_svq_unstash_addr(svq_elem->out_sg_stash, elem->out_sg,
+                               elem->out_num);
+
         /*
          * Although the doc says we must unpop in order, it's ok to unpop
          * everything.
          */
-        if (elem) {
-            virtqueue_unpop(svq->vq, elem, elem->len);
-        }
+        virtqueue_unpop(svq->vq, elem, elem->len);
     }
 }
 
@@ -466,7 +551,8 @@ void vhost_svq_stop(struct vhost_dev *dev, unsigned idx,
  * Creates vhost shadow virtqueue, and instruct vhost device to use the shadow
  * methods and file descriptors.
  */
-VhostShadowVirtqueue *vhost_svq_new(struct vhost_dev *dev, int idx)
+VhostShadowVirtqueue *vhost_svq_new(struct vhost_dev *dev, int idx,
+                                    VhostIOVATree *iova_map)
 {
     int vq_idx = dev->vq_index + idx;
     unsigned num = virtio_queue_get_num(dev->vdev, vq_idx);
@@ -500,11 +586,13 @@ VhostShadowVirtqueue *vhost_svq_new(struct vhost_dev *dev, int idx)
     memset(svq->vring.desc, 0, driver_size);
     svq->vring.used = qemu_memalign(qemu_real_host_page_size, device_size);
     memset(svq->vring.used, 0, device_size);
+    svq->iova_map = iova_map;
+
     for (i = 0; i < num - 1; i++) {
         svq->vring.desc[i].next = cpu_to_le16(i + 1);
     }
 
-    svq->ring_id_maps = g_new0(VirtQueueElement *, num);
+    svq->ring_id_maps = g_new0(SVQElement *, num);
     event_notifier_set_handler(&svq->call_notifier,
                                vhost_svq_handle_call);
     return g_steal_pointer(&svq);
diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
index a9c680b487..f5a12fee9d 100644
--- a/hw/virtio/vhost-vdpa.c
+++ b/hw/virtio/vhost-vdpa.c
@@ -176,6 +176,18 @@ static void vhost_vdpa_listener_region_add(MemoryListener *listener,
                                          vaddr, section->readonly);
 
     llsize = int128_sub(llend, int128_make64(iova));
+    if (v->shadow_vqs_enabled) {
+        VhostDMAMap mem_region = {
+            .translated_addr = vaddr,
+            .size = int128_get64(llsize) - 1,
+            .perm = IOMMU_ACCESS_FLAG(true, section->readonly),
+        };
+
+        int r = vhost_iova_tree_alloc(v->iova_map, &mem_region);
+        assert(r == VHOST_DMA_MAP_OK);
+
+        iova = mem_region.iova;
+    }
 
     ret = vhost_vdpa_dma_map(v, iova, int128_get64(llsize),
                              vaddr, section->readonly);
@@ -754,6 +766,23 @@ static bool  vhost_vdpa_force_iommu(struct vhost_dev *dev)
     return true;
 }
 
+static int vhost_vdpa_get_iova_range(struct vhost_dev *dev,
+                                     hwaddr *first, hwaddr *last)
+{
+    int ret;
+    struct vhost_vdpa_iova_range range;
+
+    ret = vhost_vdpa_call(dev, VHOST_VDPA_GET_IOVA_RANGE, &range);
+    if (ret != 0) {
+        return ret;
+    }
+
+    *first = range.first;
+    *last = range.last;
+    trace_vhost_vdpa_get_iova_range(dev, *first, *last);
+    return ret;
+}
+
 /**
  * Maps QEMU vaddr memory to device in a suitable way for shadow virtqueue:
  * - It always reference qemu memory address, not guest's memory.
@@ -881,6 +910,7 @@ static bool vhost_vdpa_svq_start_vq(struct vhost_dev *dev, unsigned idx)
 static unsigned vhost_vdpa_enable_svq(struct vhost_vdpa *v, bool enable)
 {
     struct vhost_dev *hdev = v->dev;
+    hwaddr iova_first, iova_last;
     unsigned n;
     int r;
 
@@ -894,7 +924,7 @@ static unsigned vhost_vdpa_enable_svq(struct vhost_vdpa *v, bool enable)
         /* Allocate resources */
         assert(v->shadow_vqs->len == 0);
         for (n = 0; n < hdev->nvqs; ++n) {
-            VhostShadowVirtqueue *svq = vhost_svq_new(hdev, n);
+            VhostShadowVirtqueue *svq = vhost_svq_new(hdev, n, v->iova_map);
             if (unlikely(!svq)) {
                 g_ptr_array_set_size(v->shadow_vqs, 0);
                 return 0;
@@ -903,6 +933,8 @@ static unsigned vhost_vdpa_enable_svq(struct vhost_vdpa *v, bool enable)
         }
     }
 
+    r = vhost_vdpa_get_iova_range(hdev, &iova_first, &iova_last);
+    assert(r == 0);
     r = vhost_vdpa_vring_pause(hdev);
     assert(r == 0);
 
@@ -913,6 +945,12 @@ static unsigned vhost_vdpa_enable_svq(struct vhost_vdpa *v, bool enable)
         }
     }
 
+    memory_listener_unregister(&v->listener);
+    if (vhost_vdpa_dma_unmap(v, iova_first,
+                             (iova_last - iova_first) & TARGET_PAGE_MASK)) {
+        error_report("Fail to invalidate device iotlb");
+    }
+
     /* Reset device so it can be configured */
     r = vhost_vdpa_dev_start(hdev, false);
     assert(r == 0);
diff --git a/hw/virtio/trace-events b/hw/virtio/trace-events
index 8ed19e9d0c..650e521e35 100644
--- a/hw/virtio/trace-events
+++ b/hw/virtio/trace-events
@@ -52,6 +52,7 @@ vhost_vdpa_set_vring_call(void *dev, unsigned int index, int fd) "dev: %p index:
 vhost_vdpa_get_features(void *dev, uint64_t features) "dev: %p features: 0x%"PRIx64
 vhost_vdpa_set_owner(void *dev) "dev: %p"
 vhost_vdpa_vq_get_addr(void *dev, void *vq, uint64_t desc_user_addr, uint64_t avail_user_addr, uint64_t used_user_addr) "dev: %p vq: %p desc_user_addr: 0x%"PRIx64" avail_user_addr: 0x%"PRIx64" used_user_addr: 0x%"PRIx64
+vhost_vdpa_get_iova_range(void *dev, uint64_t first, uint64_t last) "dev: %p first: 0x%"PRIx64" last: 0x%"PRIx64
 
 # virtio.c
 virtqueue_alloc_element(void *elem, size_t sz, unsigned in_num, unsigned out_num) "elem %p size %zd in_num %u out_num %u"
-- 
2.27.0
^ permalink raw reply related	[flat|nested] 90+ messages in thread
* Re: [RFC PATCH v4 00/20] vDPA shadow virtqueue
  2021-10-01  7:05 [RFC PATCH v4 00/20] vDPA shadow virtqueue Eugenio Pérez
                   ` (19 preceding siblings ...)
  2021-10-01  7:06 ` [RFC PATCH v4 20/20] vdpa: Add custom IOTLB translations to SVQ Eugenio Pérez
@ 2021-10-12  3:59 ` Jason Wang
  2021-10-12  4:06   ` Jason Wang
  20 siblings, 1 reply; 90+ messages in thread
From: Jason Wang @ 2021-10-12  3:59 UTC (permalink / raw)
  To: Eugenio Pérez, qemu-devel
  Cc: Parav Pandit, Juan Quintela, Michael S. Tsirkin,
	Markus Armbruster, virtualization, Harpreet Singh Anand,
	Xiao W Wang, Stefan Hajnoczi, Eli Cohen, Eric Blake,
	Michael Lilja, Stefano Garzarella
在 2021/10/1 下午3:05, Eugenio Pérez 写道:
> This series enable shadow virtqueue (SVQ) for vhost-vdpa devices. This
> is intended as a new method of tracking the memory the devices touch
> during a migration process: Instead of relay on vhost device's dirty
> logging capability, SVQ intercepts the VQ dataplane forwarding the
> descriptors between VM and device. This way qemu is the effective
> writer of guests memory, like in qemu's virtio device operation.
>
> When SVQ is enabled qemu offers a new vring to the device to read
> and write into, and also intercepts kicks and calls between the device
> and the guest. Used buffers relay would cause dirty memory being
> tracked, but at this RFC SVQ is not enabled on migration automatically.
>
> It is based on the ideas of DPDK SW assisted LM, in the series of
> DPDK's https://patchwork.dpdk.org/cover/48370/ . However, these does
> not map the shadow vq in guest's VA, but in qemu's.
>
> For qemu to use shadow virtqueues the guest virtio driver must not use
> features like event_idx or indirect descriptors. These limitations will
> be addressed in later series, but they are left out for simplicity at
> the moment.
>
> SVQ needs to be enabled with QMP command:
>
> { "execute": "x-vhost-enable-shadow-vq",
>        "arguments": { "name": "dev0", "enable": true } }
>
> This series includes some patches to delete in the final version that
> helps with its testing. The first two of the series freely implements
> the feature to stop the device and be able to retrieve its status. It's
> intended to be used with vp_vpda driver in a nested environment. This
> driver also need modifications to forward the new status bit.
>
> Patches 2-8 prepares the SVQ and QMP command to support guest to host
> notifications forwarding. If the SVQ is enabled with these ones
> applied and the device supports it, that part can be tested in
> isolation (for example, with networking), hopping through SVQ.
>
> Same thing is true with patches 9-13, but with device to guest
> notifications.
>
> The rest of the patches implements the actual buffer forwarding.
>
> Comments are welcome.
Hi Eugenio:
It would be helpful to have a public git repo for us to ease the review.
Thanks
>
> TODO:
> * Event, indirect, packed, and others features of virtio - Waiting for
>    confirmation of the big picture.
> * Use already available iova tree to track mappings.
> * To sepparate buffers forwarding in its own AIO context, so we can
>    throw more threads to that task and we don't need to stop the main
>    event loop.
> * unmap iommu memory. Now the tree can only grow from SVQ enable, but
>    it should be fine as long as not a lot of memory is added to the
>    guest.
> * Rebase on top of latest qemu (and, hopefully, on top of multiqueue
>    vdpa).
> * Some assertions need to be appropiate error handling paths.
> * Proper documentation.
>
> Changes from v3 RFC:
>    * Move everything to vhost-vdpa backend. A big change, this allowed
>      some cleanup but more code has been added in other places.
>    * More use of glib utilities, especially to manage memory.
> v3 link:
> https://lists.nongnu.org/archive/html/qemu-devel/2021-05/msg06032.html
>
> Changes from v2 RFC:
>    * Adding vhost-vdpa devices support
>    * Fixed some memory leaks pointed by different comments
> v2 link:
> https://lists.nongnu.org/archive/html/qemu-devel/2021-03/msg05600.html
>
> Changes from v1 RFC:
>    * Use QMP instead of migration to start SVQ mode.
>    * Only accepting IOMMU devices, closer behavior with target devices
>      (vDPA)
>    * Fix invalid masking/unmasking of vhost call fd.
>    * Use of proper methods for synchronization.
>    * No need to modify VirtIO device code, all of the changes are
>      contained in vhost code.
>    * Delete superfluous code.
>    * An intermediate RFC was sent with only the notifications forwarding
>      changes. It can be seen in
>      https://patchew.org/QEMU/20210129205415.876290-1-eperezma@redhat.com/
> v1 link:
> https://lists.gnu.org/archive/html/qemu-devel/2020-11/msg05372.html
>
> Eugenio Pérez (20):
>        virtio: Add VIRTIO_F_QUEUE_STATE
>        virtio-net: Honor VIRTIO_CONFIG_S_DEVICE_STOPPED
>        virtio: Add virtio_queue_is_host_notifier_enabled
>        vhost: Make vhost_virtqueue_{start,stop} public
>        vhost: Add x-vhost-enable-shadow-vq qmp
>        vhost: Add VhostShadowVirtqueue
>        vdpa: Register vdpa devices in a list
>        vhost: Route guest->host notification through shadow virtqueue
>        Add vhost_svq_get_svq_call_notifier
>        Add vhost_svq_set_guest_call_notifier
>        vdpa: Save call_fd in vhost-vdpa
>        vhost-vdpa: Take into account SVQ in vhost_vdpa_set_vring_call
>        vhost: Route host->guest notification through shadow virtqueue
>        virtio: Add vhost_shadow_vq_get_vring_addr
>        vdpa: Save host and guest features
>        vhost: Add vhost_svq_valid_device_features to shadow vq
>        vhost: Shadow virtqueue buffers forwarding
>        vhost: Add VhostIOVATree
>        vhost: Use a tree to store memory mappings
>        vdpa: Add custom IOTLB translations to SVQ
>
> Eugenio Pérez (20):
>    virtio: Add VIRTIO_F_QUEUE_STATE
>    virtio-net: Honor VIRTIO_CONFIG_S_DEVICE_STOPPED
>    virtio: Add virtio_queue_is_host_notifier_enabled
>    vhost: Make vhost_virtqueue_{start,stop} public
>    vhost: Add x-vhost-enable-shadow-vq qmp
>    vhost: Add VhostShadowVirtqueue
>    vdpa: Register vdpa devices in a list
>    vhost: Route guest->host notification through shadow virtqueue
>    vdpa: Save call_fd in vhost-vdpa
>    vhost-vdpa: Take into account SVQ in vhost_vdpa_set_vring_call
>    vhost: Route host->guest notification through shadow virtqueue
>    virtio: Add vhost_shadow_vq_get_vring_addr
>    vdpa: Save host and guest features
>    vhost: Add vhost_svq_valid_device_features to shadow vq
>    vhost: Shadow virtqueue buffers forwarding
>    vhost: Check for device VRING_USED_F_NO_NOTIFY at shadow virtqueue
>      kick
>    vhost: Use VRING_AVAIL_F_NO_INTERRUPT at device call on shadow
>      virtqueue
>    vhost: Add VhostIOVATree
>    vhost: Use a tree to store memory mappings
>    vdpa: Add custom IOTLB translations to SVQ
>
>   qapi/net.json                                 |  23 +
>   hw/virtio/vhost-iova-tree.h                   |  40 ++
>   hw/virtio/vhost-shadow-virtqueue.h            |  37 ++
>   hw/virtio/virtio-pci.h                        |   1 +
>   include/hw/virtio/vhost-vdpa.h                |  13 +
>   include/hw/virtio/vhost.h                     |   4 +
>   include/hw/virtio/virtio.h                    |   5 +-
>   .../standard-headers/linux/virtio_config.h    |   5 +
>   include/standard-headers/linux/virtio_pci.h   |   2 +
>   hw/net/virtio-net.c                           |   6 +-
>   hw/virtio/vhost-iova-tree.c                   | 230 +++++++
>   hw/virtio/vhost-shadow-virtqueue.c            | 619 ++++++++++++++++++
>   hw/virtio/vhost-vdpa.c                        | 412 +++++++++++-
>   hw/virtio/vhost.c                             |  12 +-
>   hw/virtio/virtio-pci.c                        |  16 +-
>   hw/virtio/virtio.c                            |   5 +
>   hw/virtio/meson.build                         |   2 +-
>   hw/virtio/trace-events                        |   1 +
>   18 files changed, 1413 insertions(+), 20 deletions(-)
>   create mode 100644 hw/virtio/vhost-iova-tree.h
>   create mode 100644 hw/virtio/vhost-shadow-virtqueue.h
>   create mode 100644 hw/virtio/vhost-iova-tree.c
>   create mode 100644 hw/virtio/vhost-shadow-virtqueue.c
>
^ permalink raw reply	[flat|nested] 90+ messages in thread
* Re: [RFC PATCH v4 00/20] vDPA shadow virtqueue
  2021-10-12  3:59 ` [RFC PATCH v4 00/20] vDPA shadow virtqueue Jason Wang
@ 2021-10-12  4:06   ` Jason Wang
  2021-10-12  9:09     ` Eugenio Perez Martin
  0 siblings, 1 reply; 90+ messages in thread
From: Jason Wang @ 2021-10-12  4:06 UTC (permalink / raw)
  To: Eugenio Pérez, qemu-devel
  Cc: Parav Pandit, Juan Quintela, Michael S. Tsirkin,
	Markus Armbruster, virtualization, Harpreet Singh Anand,
	Xiao W Wang, Stefan Hajnoczi, Eli Cohen, Eric Blake,
	Michael Lilja, Stefano Garzarella
On Tue, Oct 12, 2021 at 11:59 AM Jason Wang <jasowang@redhat.com> wrote:
>
>
> 在 2021/10/1 下午3:05, Eugenio Pérez 写道:
> > This series enable shadow virtqueue (SVQ) for vhost-vdpa devices. This
> > is intended as a new method of tracking the memory the devices touch
> > during a migration process: Instead of relay on vhost device's dirty
> > logging capability, SVQ intercepts the VQ dataplane forwarding the
> > descriptors between VM and device. This way qemu is the effective
> > writer of guests memory, like in qemu's virtio device operation.
> >
> > When SVQ is enabled qemu offers a new vring to the device to read
> > and write into, and also intercepts kicks and calls between the device
> > and the guest. Used buffers relay would cause dirty memory being
> > tracked, but at this RFC SVQ is not enabled on migration automatically.
> >
> > It is based on the ideas of DPDK SW assisted LM, in the series of
> > DPDK's https://patchwork.dpdk.org/cover/48370/ . However, these does
> > not map the shadow vq in guest's VA, but in qemu's.
> >
> > For qemu to use shadow virtqueues the guest virtio driver must not use
> > features like event_idx or indirect descriptors. These limitations will
> > be addressed in later series, but they are left out for simplicity at
> > the moment.
> >
> > SVQ needs to be enabled with QMP command:
> >
> > { "execute": "x-vhost-enable-shadow-vq",
> >        "arguments": { "name": "dev0", "enable": true } }
> >
> > This series includes some patches to delete in the final version that
> > helps with its testing. The first two of the series freely implements
> > the feature to stop the device and be able to retrieve its status. It's
> > intended to be used with vp_vpda driver in a nested environment. This
> > driver also need modifications to forward the new status bit.
> >
> > Patches 2-8 prepares the SVQ and QMP command to support guest to host
> > notifications forwarding. If the SVQ is enabled with these ones
> > applied and the device supports it, that part can be tested in
> > isolation (for example, with networking), hopping through SVQ.
> >
> > Same thing is true with patches 9-13, but with device to guest
> > notifications.
> >
> > The rest of the patches implements the actual buffer forwarding.
> >
> > Comments are welcome.
>
>
> Hi Eugenio:
>
>
> It would be helpful to have a public git repo for us to ease the review.
>
> Thanks
>
Btw, we also need to measure the performance impact of the shadow virtqueue.
Thanks
^ permalink raw reply	[flat|nested] 90+ messages in thread
* Re: [RFC PATCH v4 05/20] vhost: Add x-vhost-enable-shadow-vq qmp
  2021-10-01  7:05 ` [RFC PATCH v4 05/20] vhost: Add x-vhost-enable-shadow-vq qmp Eugenio Pérez
@ 2021-10-12  5:18   ` Markus Armbruster
  2021-10-12 13:08     ` Eugenio Perez Martin
  0 siblings, 1 reply; 90+ messages in thread
From: Markus Armbruster @ 2021-10-12  5:18 UTC (permalink / raw)
  To: Eugenio Pérez
  Cc: Parav Pandit, Juan Quintela, Jason Wang, Michael S. Tsirkin,
	qemu-devel, virtualization, Harpreet Singh Anand, Xiao W Wang,
	Stefan Hajnoczi, Eli Cohen, Eric Blake, Michael Lilja,
	Stefano Garzarella
Eugenio Pérez <eperezma@redhat.com> writes:
> Command to enable shadow virtqueue.
>
> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> ---
>  qapi/net.json          | 23 +++++++++++++++++++++++
>  hw/virtio/vhost-vdpa.c |  8 ++++++++
>  2 files changed, 31 insertions(+)
>
> diff --git a/qapi/net.json b/qapi/net.json
> index 7fab2e7cd8..a2c30fd455 100644
> --- a/qapi/net.json
> +++ b/qapi/net.json
> @@ -79,6 +79,29 @@
>  { 'command': 'netdev_del', 'data': {'id': 'str'},
>    'allow-preconfig': true }
>  
> +##
> +# @x-vhost-enable-shadow-vq:
> +#
> +# Use vhost shadow virtqueue.
> +#
> +# @name: the device name of the VirtIO device
Is this a qdev ID?  A network client name?
> +#
> +# @enable: true to use the alternate shadow VQ notifications
> +#
> +# Returns: Always error, since SVQ is not implemented at the moment.
> +#
> +# Since: 6.2
> +#
> +# Example:
> +#
> +# -> { "execute": "x-vhost-enable-shadow-vq",
> +#     "arguments": { "name": "virtio-net", "enable": false } }
> +#
> +##
> +{ 'command': 'x-vhost-enable-shadow-vq',
> +  'data': {'name': 'str', 'enable': 'bool'},
> +  'if': 'defined(CONFIG_VHOST_KERNEL)' }
> +
Adding an command just for controlling a flag in some object is fine for
quick experiments.  As a permanent interface, it's problematic: one
command per flag would result in way too many commands.  Better: one
command to control a set of related properties.
I hesitate to suggest qom-set, because qom-set is not introspectable.
Recurring complaint about QOM: poor integration with QAPI/QMP.
Naming nitpick: since the command can both enable and disable, I'd call
it -set-vq instead of -enable-vq.
>  ##
>  # @NetLegacyNicOptions:
>  #
> diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
> index 4fa414feea..c63e311d7c 100644
> --- a/hw/virtio/vhost-vdpa.c
> +++ b/hw/virtio/vhost-vdpa.c
> @@ -23,6 +23,8 @@
>  #include "cpu.h"
>  #include "trace.h"
>  #include "qemu-common.h"
> +#include "qapi/qapi-commands-net.h"
> +#include "qapi/error.h"
>  
>  static bool vhost_vdpa_listener_skipped_section(MemoryRegionSection *section)
>  {
> @@ -656,6 +658,12 @@ static bool  vhost_vdpa_force_iommu(struct vhost_dev *dev)
>      return true;
>  }
>  
> +
> +void qmp_x_vhost_enable_shadow_vq(const char *name, bool enable, Error **errp)
> +{
> +    error_setg(errp, "Shadow virtqueue still not implemented");
> +}
> +
>  const VhostOps vdpa_ops = {
>          .backend_type = VHOST_BACKEND_TYPE_VDPA,
>          .vhost_backend_init = vhost_vdpa_init,
^ permalink raw reply	[flat|nested] 90+ messages in thread
* Re: [RFC PATCH v4 08/20] vhost: Route guest->host notification through shadow virtqueue
  2021-10-01  7:05 ` [RFC PATCH v4 08/20] vhost: Route guest->host notification through shadow virtqueue Eugenio Pérez
@ 2021-10-12  5:19   ` Markus Armbruster
  2021-10-12 13:09     ` Eugenio Perez Martin
  2021-10-13  3:27   ` Jason Wang
  1 sibling, 1 reply; 90+ messages in thread
From: Markus Armbruster @ 2021-10-12  5:19 UTC (permalink / raw)
  To: Eugenio Pérez
  Cc: Parav Pandit, Juan Quintela, Jason Wang, Michael S. Tsirkin,
	qemu-devel, virtualization, Harpreet Singh Anand, Xiao W Wang,
	Stefan Hajnoczi, Eli Cohen, Eric Blake, Michael Lilja,
	Stefano Garzarella
Eugenio Pérez <eperezma@redhat.com> writes:
> Shadow virtqueue notifications forwarding is disabled when vhost_dev
> stops, so code flow follows usual cleanup.
>
> Also, host notifiers must be disabled at SVQ start, and they will not
> start if SVQ has been enabled when device is stopped. This is trivial
> to address, but it is left out for simplicity at this moment.
>
> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> ---
>  qapi/net.json                      |   2 +-
>  hw/virtio/vhost-shadow-virtqueue.h |   8 ++
>  include/hw/virtio/vhost-vdpa.h     |   4 +
>  hw/virtio/vhost-shadow-virtqueue.c | 138 ++++++++++++++++++++++++++++-
>  hw/virtio/vhost-vdpa.c             | 116 +++++++++++++++++++++++-
>  5 files changed, 264 insertions(+), 4 deletions(-)
>
> diff --git a/qapi/net.json b/qapi/net.json
> index a2c30fd455..fe546b0e7c 100644
> --- a/qapi/net.json
> +++ b/qapi/net.json
> @@ -88,7 +88,7 @@
>  #
>  # @enable: true to use the alternate shadow VQ notifications
>  #
> -# Returns: Always error, since SVQ is not implemented at the moment.
> +# Returns: Error if failure, or 'no error' for success.
Delete the whole line, please.
>  #
>  # Since: 6.2
>  #
[...]
^ permalink raw reply	[flat|nested] 90+ messages in thread
* Re: [RFC PATCH v4 15/20] vhost: Shadow virtqueue buffers forwarding
  2021-10-01  7:05 ` [RFC PATCH v4 15/20] vhost: Shadow virtqueue buffers forwarding Eugenio Pérez
@ 2021-10-12  5:21   ` Markus Armbruster
  2021-10-12 13:28     ` Eugenio Perez Martin
  2021-10-13  4:31   ` Jason Wang
  1 sibling, 1 reply; 90+ messages in thread
From: Markus Armbruster @ 2021-10-12  5:21 UTC (permalink / raw)
  To: Eugenio Pérez
  Cc: Parav Pandit, Juan Quintela, Jason Wang, Michael S. Tsirkin,
	qemu-devel, virtualization, Harpreet Singh Anand, Xiao W Wang,
	Stefan Hajnoczi, Eli Cohen, Eric Blake, Michael Lilja,
	Stefano Garzarella
Eugenio Pérez <eperezma@redhat.com> writes:
> Initial version of shadow virtqueue that actually forward buffers. There
> are no iommu support at the moment, and that will be addressed in future
> patches of this series. Since all vhost-vdpa devices uses forced IOMMU,
> this means that SVQ is not usable at this point of the series on any
> device.
>
> For simplicity it only supports modern devices, that expects vring
> in little endian, with split ring and no event idx or indirect
> descriptors. Support for them will not be added in this series.
>
> It reuses the VirtQueue code for the device part. The driver part is
> based on Linux's virtio_ring driver, but with stripped functionality
> and optimizations so it's easier to review. Later commits add simpler
> ones.
>
> SVQ uses VIRTIO_CONFIG_S_DEVICE_STOPPED to pause the device and
> retrieve its status (next available idx the device was going to
> consume) race-free. It can later reset the device to replace vring
> addresses etc. When SVQ starts qemu can resume consuming the guest's
> driver ring from that state, without notice from the latter.
>
> This status bit VIRTIO_CONFIG_S_DEVICE_STOPPED is currently discussed
> in VirtIO, and is implemented in qemu VirtIO-net devices in previous
> commits.
>
> Removal of _S_DEVICE_STOPPED bit (in other words, resuming the device)
> can be done in the future if an use case arises. At this moment we can
> just rely on reseting the full device.
>
> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> ---
>  qapi/net.json                      |   2 +-
>  hw/virtio/vhost-shadow-virtqueue.c | 237 ++++++++++++++++++++++++++++-
>  hw/virtio/vhost-vdpa.c             | 109 ++++++++++++-
>  3 files changed, 337 insertions(+), 11 deletions(-)
>
> diff --git a/qapi/net.json b/qapi/net.json
> index fe546b0e7c..1f4a55f2c5 100644
> --- a/qapi/net.json
> +++ b/qapi/net.json
> @@ -86,7 +86,7 @@
>  #
>  # @name: the device name of the VirtIO device
>  #
> -# @enable: true to use the alternate shadow VQ notifications
> +# @enable: true to use the alternate shadow VQ buffers fowarding path
Uh, why does the flag change meaning half-way through this series?
>  #
>  # Returns: Error if failure, or 'no error' for success.
>  #
[...]
^ permalink raw reply	[flat|nested] 90+ messages in thread
* Re: [RFC PATCH v4 00/20] vDPA shadow virtqueue
  2021-10-12  4:06   ` Jason Wang
@ 2021-10-12  9:09     ` Eugenio Perez Martin
  0 siblings, 0 replies; 90+ messages in thread
From: Eugenio Perez Martin @ 2021-10-12  9:09 UTC (permalink / raw)
  To: Jason Wang
  Cc: Parav Pandit, Juan Quintela, Markus Armbruster,
	Michael S. Tsirkin, qemu-devel, virtualization,
	Harpreet Singh Anand, Xiao W Wang, Stefan Hajnoczi, Eli Cohen,
	Eric Blake, Michael Lilja, Stefano Garzarella
On Tue, Oct 12, 2021 at 6:06 AM Jason Wang <jasowang@redhat.com> wrote:
>
> On Tue, Oct 12, 2021 at 11:59 AM Jason Wang <jasowang@redhat.com> wrote:
> >
> >
> > 在 2021/10/1 下午3:05, Eugenio Pérez 写道:
> > > This series enable shadow virtqueue (SVQ) for vhost-vdpa devices. This
> > > is intended as a new method of tracking the memory the devices touch
> > > during a migration process: Instead of relay on vhost device's dirty
> > > logging capability, SVQ intercepts the VQ dataplane forwarding the
> > > descriptors between VM and device. This way qemu is the effective
> > > writer of guests memory, like in qemu's virtio device operation.
> > >
> > > When SVQ is enabled qemu offers a new vring to the device to read
> > > and write into, and also intercepts kicks and calls between the device
> > > and the guest. Used buffers relay would cause dirty memory being
> > > tracked, but at this RFC SVQ is not enabled on migration automatically.
> > >
> > > It is based on the ideas of DPDK SW assisted LM, in the series of
> > > DPDK's https://patchwork.dpdk.org/cover/48370/ . However, these does
> > > not map the shadow vq in guest's VA, but in qemu's.
> > >
> > > For qemu to use shadow virtqueues the guest virtio driver must not use
> > > features like event_idx or indirect descriptors. These limitations will
> > > be addressed in later series, but they are left out for simplicity at
> > > the moment.
> > >
> > > SVQ needs to be enabled with QMP command:
> > >
> > > { "execute": "x-vhost-enable-shadow-vq",
> > >        "arguments": { "name": "dev0", "enable": true } }
> > >
> > > This series includes some patches to delete in the final version that
> > > helps with its testing. The first two of the series freely implements
> > > the feature to stop the device and be able to retrieve its status. It's
> > > intended to be used with vp_vpda driver in a nested environment. This
> > > driver also need modifications to forward the new status bit.
> > >
> > > Patches 2-8 prepares the SVQ and QMP command to support guest to host
> > > notifications forwarding. If the SVQ is enabled with these ones
> > > applied and the device supports it, that part can be tested in
> > > isolation (for example, with networking), hopping through SVQ.
> > >
> > > Same thing is true with patches 9-13, but with device to guest
> > > notifications.
> > >
> > > The rest of the patches implements the actual buffer forwarding.
> > >
> > > Comments are welcome.
> >
> >
> > Hi Eugenio:
> >
> >
> > It would be helpful to have a public git repo for us to ease the review.
> >
> > Thanks
> >
Hi Jason,
I just pushed this tag to
https://github.com/eugpermar/qemu/tree/vdpa_sw_live_migration.d/vdpa-v4
,
but let me know if you find another way more convenient.
Thanks!
>
> Btw, we also need to measure the performance impact of the shadow virtqueue.
>
I will measure it in subsequent series, since I'm still making some
changes. At the moment I'm also testing with nested virtualization
that can affect it.
However we need to take into account that this series still has a lot
of room for improvement. I would say that packed vq and isolating code
in its own aio context could give a noticeable boost on the numbers.
Thanks!
> Thanks
>
^ permalink raw reply	[flat|nested] 90+ messages in thread
* Re: [RFC PATCH v4 05/20] vhost: Add x-vhost-enable-shadow-vq qmp
  2021-10-12  5:18   ` Markus Armbruster
@ 2021-10-12 13:08     ` Eugenio Perez Martin
  2021-10-12 13:45       ` Markus Armbruster
  0 siblings, 1 reply; 90+ messages in thread
From: Eugenio Perez Martin @ 2021-10-12 13:08 UTC (permalink / raw)
  To: Markus Armbruster
  Cc: Parav Pandit, Juan Quintela, Jason Wang, Michael S. Tsirkin,
	qemu-level, virtualization, Harpreet Singh Anand, Xiao W Wang,
	Stefan Hajnoczi, Eli Cohen, Eric Blake, Michael Lilja,
	Stefano Garzarella
On Tue, Oct 12, 2021 at 7:18 AM Markus Armbruster <armbru@redhat.com> wrote:
>
> Eugenio Pérez <eperezma@redhat.com> writes:
>
> > Command to enable shadow virtqueue.
> >
> > Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> > ---
> >  qapi/net.json          | 23 +++++++++++++++++++++++
> >  hw/virtio/vhost-vdpa.c |  8 ++++++++
> >  2 files changed, 31 insertions(+)
> >
> > diff --git a/qapi/net.json b/qapi/net.json
> > index 7fab2e7cd8..a2c30fd455 100644
> > --- a/qapi/net.json
> > +++ b/qapi/net.json
> > @@ -79,6 +79,29 @@
> >  { 'command': 'netdev_del', 'data': {'id': 'str'},
> >    'allow-preconfig': true }
> >
> > +##
> > +# @x-vhost-enable-shadow-vq:
> > +#
> > +# Use vhost shadow virtqueue.
> > +#
> > +# @name: the device name of the VirtIO device
>
> Is this a qdev ID?  A network client name?
>
At this moment is the virtio device name, the one specified at the
call of "virtio_init". But this should change, maybe the qdev id or
something that can be provided by the command line fits better here.
> > +#
> > +# @enable: true to use the alternate shadow VQ notifications
> > +#
> > +# Returns: Always error, since SVQ is not implemented at the moment.
> > +#
> > +# Since: 6.2
> > +#
> > +# Example:
> > +#
> > +# -> { "execute": "x-vhost-enable-shadow-vq",
> > +#     "arguments": { "name": "virtio-net", "enable": false } }
> > +#
> > +##
> > +{ 'command': 'x-vhost-enable-shadow-vq',
> > +  'data': {'name': 'str', 'enable': 'bool'},
> > +  'if': 'defined(CONFIG_VHOST_KERNEL)' }
> > +
>
> Adding an command just for controlling a flag in some object is fine for
> quick experiments.  As a permanent interface, it's problematic: one
> command per flag would result in way too many commands.  Better: one
> command to control a set of related properties.
>
> I hesitate to suggest qom-set, because qom-set is not introspectable.
> Recurring complaint about QOM: poor integration with QAPI/QMP.
>
I will take it into account, but it's only temporary, that's why it
has the x- prefix. It's not like event_idx or other device feature
flags: Every vDPA device can potentially use SVQ datapath in a
transparent way, neither the vDPA device nor the guest know that qemu
supports it.
Ideally, this mode will kick in at the migration time automatically,
no need to perform more actions.
> Naming nitpick: since the command can both enable and disable, I'd call
> it -set-vq instead of -enable-vq.
>
Got it, I will replace it.
Thanks!
> >  ##
> >  # @NetLegacyNicOptions:
> >  #
> > diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
> > index 4fa414feea..c63e311d7c 100644
> > --- a/hw/virtio/vhost-vdpa.c
> > +++ b/hw/virtio/vhost-vdpa.c
> > @@ -23,6 +23,8 @@
> >  #include "cpu.h"
> >  #include "trace.h"
> >  #include "qemu-common.h"
> > +#include "qapi/qapi-commands-net.h"
> > +#include "qapi/error.h"
> >
> >  static bool vhost_vdpa_listener_skipped_section(MemoryRegionSection *section)
> >  {
> > @@ -656,6 +658,12 @@ static bool  vhost_vdpa_force_iommu(struct vhost_dev *dev)
> >      return true;
> >  }
> >
> > +
> > +void qmp_x_vhost_enable_shadow_vq(const char *name, bool enable, Error **errp)
> > +{
> > +    error_setg(errp, "Shadow virtqueue still not implemented");
> > +}
> > +
> >  const VhostOps vdpa_ops = {
> >          .backend_type = VHOST_BACKEND_TYPE_VDPA,
> >          .vhost_backend_init = vhost_vdpa_init,
>
^ permalink raw reply	[flat|nested] 90+ messages in thread
* Re: [RFC PATCH v4 08/20] vhost: Route guest->host notification through shadow virtqueue
  2021-10-12  5:19   ` Markus Armbruster
@ 2021-10-12 13:09     ` Eugenio Perez Martin
  0 siblings, 0 replies; 90+ messages in thread
From: Eugenio Perez Martin @ 2021-10-12 13:09 UTC (permalink / raw)
  To: Markus Armbruster
  Cc: Parav Pandit, Juan Quintela, Jason Wang, Michael S. Tsirkin,
	qemu-level, virtualization, Harpreet Singh Anand, Xiao W Wang,
	Stefan Hajnoczi, Eli Cohen, Eric Blake, Michael Lilja,
	Stefano Garzarella
On Tue, Oct 12, 2021 at 7:20 AM Markus Armbruster <armbru@redhat.com> wrote:
>
> Eugenio Pérez <eperezma@redhat.com> writes:
>
> > Shadow virtqueue notifications forwarding is disabled when vhost_dev
> > stops, so code flow follows usual cleanup.
> >
> > Also, host notifiers must be disabled at SVQ start, and they will not
> > start if SVQ has been enabled when device is stopped. This is trivial
> > to address, but it is left out for simplicity at this moment.
> >
> > Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> > ---
> >  qapi/net.json                      |   2 +-
> >  hw/virtio/vhost-shadow-virtqueue.h |   8 ++
> >  include/hw/virtio/vhost-vdpa.h     |   4 +
> >  hw/virtio/vhost-shadow-virtqueue.c | 138 ++++++++++++++++++++++++++++-
> >  hw/virtio/vhost-vdpa.c             | 116 +++++++++++++++++++++++-
> >  5 files changed, 264 insertions(+), 4 deletions(-)
> >
> > diff --git a/qapi/net.json b/qapi/net.json
> > index a2c30fd455..fe546b0e7c 100644
> > --- a/qapi/net.json
> > +++ b/qapi/net.json
> > @@ -88,7 +88,7 @@
> >  #
> >  # @enable: true to use the alternate shadow VQ notifications
> >  #
> > -# Returns: Always error, since SVQ is not implemented at the moment.
> > +# Returns: Error if failure, or 'no error' for success.
>
> Delete the whole line, please.
>
I will do it.
Thanks!
> >  #
> >  # Since: 6.2
> >  #
> [...]
>
^ permalink raw reply	[flat|nested] 90+ messages in thread
* Re: [RFC PATCH v4 15/20] vhost: Shadow virtqueue buffers forwarding
  2021-10-12  5:21   ` Markus Armbruster
@ 2021-10-12 13:28     ` Eugenio Perez Martin
  2021-10-12 13:48       ` Markus Armbruster
  0 siblings, 1 reply; 90+ messages in thread
From: Eugenio Perez Martin @ 2021-10-12 13:28 UTC (permalink / raw)
  To: Markus Armbruster
  Cc: Parav Pandit, Juan Quintela, Jason Wang, Michael S. Tsirkin,
	qemu-level, virtualization, Harpreet Singh Anand, Xiao W Wang,
	Stefan Hajnoczi, Eli Cohen, Eric Blake, Michael Lilja,
	Stefano Garzarella
On Tue, Oct 12, 2021 at 7:21 AM Markus Armbruster <armbru@redhat.com> wrote:
>
> Eugenio Pérez <eperezma@redhat.com> writes:
>
> > Initial version of shadow virtqueue that actually forward buffers. There
> > are no iommu support at the moment, and that will be addressed in future
> > patches of this series. Since all vhost-vdpa devices uses forced IOMMU,
> > this means that SVQ is not usable at this point of the series on any
> > device.
> >
> > For simplicity it only supports modern devices, that expects vring
> > in little endian, with split ring and no event idx or indirect
> > descriptors. Support for them will not be added in this series.
> >
> > It reuses the VirtQueue code for the device part. The driver part is
> > based on Linux's virtio_ring driver, but with stripped functionality
> > and optimizations so it's easier to review. Later commits add simpler
> > ones.
> >
> > SVQ uses VIRTIO_CONFIG_S_DEVICE_STOPPED to pause the device and
> > retrieve its status (next available idx the device was going to
> > consume) race-free. It can later reset the device to replace vring
> > addresses etc. When SVQ starts qemu can resume consuming the guest's
> > driver ring from that state, without notice from the latter.
> >
> > This status bit VIRTIO_CONFIG_S_DEVICE_STOPPED is currently discussed
> > in VirtIO, and is implemented in qemu VirtIO-net devices in previous
> > commits.
> >
> > Removal of _S_DEVICE_STOPPED bit (in other words, resuming the device)
> > can be done in the future if an use case arises. At this moment we can
> > just rely on reseting the full device.
> >
> > Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> > ---
> >  qapi/net.json                      |   2 +-
> >  hw/virtio/vhost-shadow-virtqueue.c | 237 ++++++++++++++++++++++++++++-
> >  hw/virtio/vhost-vdpa.c             | 109 ++++++++++++-
> >  3 files changed, 337 insertions(+), 11 deletions(-)
> >
> > diff --git a/qapi/net.json b/qapi/net.json
> > index fe546b0e7c..1f4a55f2c5 100644
> > --- a/qapi/net.json
> > +++ b/qapi/net.json
> > @@ -86,7 +86,7 @@
> >  #
> >  # @name: the device name of the VirtIO device
> >  #
> > -# @enable: true to use the alternate shadow VQ notifications
> > +# @enable: true to use the alternate shadow VQ buffers fowarding path
>
> Uh, why does the flag change meaning half-way through this series?
>
Before this patch, the SVQ mode just makes an extra hop for
notifications. Guest ones are now received by qemu via ioeventfd, and
qemu forwards them to the device using a different eventfd. The
reverse is also true: the device ones will be received by qemu by
device call fd, and then qemu will forward them to the guest using a
different irqfd.
This intermediate step is not very useful by itself, but helps for
checking that that part of the communication works fine, with no need
for shadow virtqueue to understand vring format. Doing that way also
produces smaller patches.
So it makes sense to me to tell what QMP command does exactly at every
point of the series. However I can directly document it as "use the
alternate shadow VQ buffers forwarding path" from the beginning.
Does this make sense, or will it be better to write the final
intention of the command?
Thanks!
> >  #
> >  # Returns: Error if failure, or 'no error' for success.
> >  #
>
> [...]
>
^ permalink raw reply	[flat|nested] 90+ messages in thread
* Re: [RFC PATCH v4 05/20] vhost: Add x-vhost-enable-shadow-vq qmp
  2021-10-12 13:08     ` Eugenio Perez Martin
@ 2021-10-12 13:45       ` Markus Armbruster
  2021-10-14 12:01         ` Eugenio Perez Martin
  0 siblings, 1 reply; 90+ messages in thread
From: Markus Armbruster @ 2021-10-12 13:45 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: Parav Pandit, Juan Quintela, Jason Wang, Michael S. Tsirkin,
	qemu-level, virtualization, Harpreet Singh Anand, Xiao W Wang,
	Stefan Hajnoczi, Eli Cohen, Eric Blake, Michael Lilja,
	Stefano Garzarella
Eugenio Perez Martin <eperezma@redhat.com> writes:
> On Tue, Oct 12, 2021 at 7:18 AM Markus Armbruster <armbru@redhat.com> wrote:
>>
>> Eugenio Pérez <eperezma@redhat.com> writes:
>>
>> > Command to enable shadow virtqueue.
>> >
>> > Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
>> > ---
>> >  qapi/net.json          | 23 +++++++++++++++++++++++
>> >  hw/virtio/vhost-vdpa.c |  8 ++++++++
>> >  2 files changed, 31 insertions(+)
>> >
>> > diff --git a/qapi/net.json b/qapi/net.json
>> > index 7fab2e7cd8..a2c30fd455 100644
>> > --- a/qapi/net.json
>> > +++ b/qapi/net.json
>> > @@ -79,6 +79,29 @@
>> >  { 'command': 'netdev_del', 'data': {'id': 'str'},
>> >    'allow-preconfig': true }
>> >
>> > +##
>> > +# @x-vhost-enable-shadow-vq:
>> > +#
>> > +# Use vhost shadow virtqueue.
>> > +#
>> > +# @name: the device name of the VirtIO device
>>
>> Is this a qdev ID?  A network client name?
>
> At this moment is the virtio device name, the one specified at the
> call of "virtio_init". But this should change, maybe the qdev id or
> something that can be provided by the command line fits better here.
To refer to a device backend, we commonly use a backend-specific ID.
For network backends, that's NetClientState member name.
To refer to a device frontend, we commonly use a qdev ID or a QOM path.
[...]
^ permalink raw reply	[flat|nested] 90+ messages in thread
* Re: [RFC PATCH v4 15/20] vhost: Shadow virtqueue buffers forwarding
  2021-10-12 13:28     ` Eugenio Perez Martin
@ 2021-10-12 13:48       ` Markus Armbruster
  2021-10-14 15:04         ` Eugenio Perez Martin
  0 siblings, 1 reply; 90+ messages in thread
From: Markus Armbruster @ 2021-10-12 13:48 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: Parav Pandit, Juan Quintela, Jason Wang, Michael S. Tsirkin,
	Markus Armbruster, qemu-level, Harpreet Singh Anand, Xiao W Wang,
	Stefan Hajnoczi, Eli Cohen, virtualization, Eric Blake,
	Michael Lilja, Stefano Garzarella
Eugenio Perez Martin <eperezma@redhat.com> writes:
> On Tue, Oct 12, 2021 at 7:21 AM Markus Armbruster <armbru@redhat.com> wrote:
>>
>> Eugenio Pérez <eperezma@redhat.com> writes:
>>
>> > Initial version of shadow virtqueue that actually forward buffers. There
>> > are no iommu support at the moment, and that will be addressed in future
>> > patches of this series. Since all vhost-vdpa devices uses forced IOMMU,
>> > this means that SVQ is not usable at this point of the series on any
>> > device.
>> >
>> > For simplicity it only supports modern devices, that expects vring
>> > in little endian, with split ring and no event idx or indirect
>> > descriptors. Support for them will not be added in this series.
>> >
>> > It reuses the VirtQueue code for the device part. The driver part is
>> > based on Linux's virtio_ring driver, but with stripped functionality
>> > and optimizations so it's easier to review. Later commits add simpler
>> > ones.
>> >
>> > SVQ uses VIRTIO_CONFIG_S_DEVICE_STOPPED to pause the device and
>> > retrieve its status (next available idx the device was going to
>> > consume) race-free. It can later reset the device to replace vring
>> > addresses etc. When SVQ starts qemu can resume consuming the guest's
>> > driver ring from that state, without notice from the latter.
>> >
>> > This status bit VIRTIO_CONFIG_S_DEVICE_STOPPED is currently discussed
>> > in VirtIO, and is implemented in qemu VirtIO-net devices in previous
>> > commits.
>> >
>> > Removal of _S_DEVICE_STOPPED bit (in other words, resuming the device)
>> > can be done in the future if an use case arises. At this moment we can
>> > just rely on reseting the full device.
>> >
>> > Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
>> > ---
>> >  qapi/net.json                      |   2 +-
>> >  hw/virtio/vhost-shadow-virtqueue.c | 237 ++++++++++++++++++++++++++++-
>> >  hw/virtio/vhost-vdpa.c             | 109 ++++++++++++-
>> >  3 files changed, 337 insertions(+), 11 deletions(-)
>> >
>> > diff --git a/qapi/net.json b/qapi/net.json
>> > index fe546b0e7c..1f4a55f2c5 100644
>> > --- a/qapi/net.json
>> > +++ b/qapi/net.json
>> > @@ -86,7 +86,7 @@
>> >  #
>> >  # @name: the device name of the VirtIO device
>> >  #
>> > -# @enable: true to use the alternate shadow VQ notifications
>> > +# @enable: true to use the alternate shadow VQ buffers fowarding path
>>
>> Uh, why does the flag change meaning half-way through this series?
>>
>
> Before this patch, the SVQ mode just makes an extra hop for
> notifications. Guest ones are now received by qemu via ioeventfd, and
> qemu forwards them to the device using a different eventfd. The
> reverse is also true: the device ones will be received by qemu by
> device call fd, and then qemu will forward them to the guest using a
> different irqfd.
>
> This intermediate step is not very useful by itself, but helps for
> checking that that part of the communication works fine, with no need
> for shadow virtqueue to understand vring format. Doing that way also
> produces smaller patches.
>
> So it makes sense to me to tell what QMP command does exactly at every
> point of the series. However I can directly document it as "use the
> alternate shadow VQ buffers forwarding path" from the beginning.
>
> Does this make sense, or will it be better to write the final
> intention of the command?
>
> Thanks!
Working your explanation into commit messages and possibly comments
should do.
^ permalink raw reply	[flat|nested] 90+ messages in thread
* Re: [RFC PATCH v4 08/20] vhost: Route guest->host notification through shadow virtqueue
  2021-10-01  7:05 ` [RFC PATCH v4 08/20] vhost: Route guest->host notification through shadow virtqueue Eugenio Pérez
  2021-10-12  5:19   ` Markus Armbruster
@ 2021-10-13  3:27   ` Jason Wang
  2021-10-14 12:00     ` Eugenio Perez Martin
  1 sibling, 1 reply; 90+ messages in thread
From: Jason Wang @ 2021-10-13  3:27 UTC (permalink / raw)
  To: Eugenio Pérez, qemu-devel
  Cc: Parav Pandit, Juan Quintela, Michael S. Tsirkin,
	Markus Armbruster, virtualization, Harpreet Singh Anand,
	Xiao W Wang, Stefan Hajnoczi, Eli Cohen, Eric Blake,
	Michael Lilja, Stefano Garzarella
在 2021/10/1 下午3:05, Eugenio Pérez 写道:
> Shadow virtqueue notifications forwarding is disabled when vhost_dev
> stops, so code flow follows usual cleanup.
>
> Also, host notifiers must be disabled at SVQ start,
Any reason for this?
> and they will not
> start if SVQ has been enabled when device is stopped. This is trivial
> to address, but it is left out for simplicity at this moment.
It looks to me this patch also contains the following logics
1) codes to enable svq
2) codes to let svq to be enabled from QMP.
I think they need to be split out, we may endup with the following 
series of patches
1) svq skeleton with enable/disable
2) route host notifier to svq
3) route guest notifier to svq
4) codes to enable svq
5) enable svq via QMP
>
> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> ---
>   qapi/net.json                      |   2 +-
>   hw/virtio/vhost-shadow-virtqueue.h |   8 ++
>   include/hw/virtio/vhost-vdpa.h     |   4 +
>   hw/virtio/vhost-shadow-virtqueue.c | 138 ++++++++++++++++++++++++++++-
>   hw/virtio/vhost-vdpa.c             | 116 +++++++++++++++++++++++-
>   5 files changed, 264 insertions(+), 4 deletions(-)
>
> diff --git a/qapi/net.json b/qapi/net.json
> index a2c30fd455..fe546b0e7c 100644
> --- a/qapi/net.json
> +++ b/qapi/net.json
> @@ -88,7 +88,7 @@
>   #
>   # @enable: true to use the alternate shadow VQ notifications
>   #
> -# Returns: Always error, since SVQ is not implemented at the moment.
> +# Returns: Error if failure, or 'no error' for success.
>   #
>   # Since: 6.2
>   #
> diff --git a/hw/virtio/vhost-shadow-virtqueue.h b/hw/virtio/vhost-shadow-virtqueue.h
> index 27ac6388fa..237cfceb9c 100644
> --- a/hw/virtio/vhost-shadow-virtqueue.h
> +++ b/hw/virtio/vhost-shadow-virtqueue.h
> @@ -14,6 +14,14 @@
>   
>   typedef struct VhostShadowVirtqueue VhostShadowVirtqueue;
>   
> +EventNotifier *vhost_svq_get_svq_call_notifier(VhostShadowVirtqueue *svq);
Let's move this function to another patch since it's unrelated to the 
guest->host routing.
> +void vhost_svq_set_guest_call_notifier(VhostShadowVirtqueue *svq, int call_fd);
> +
> +bool vhost_svq_start(struct vhost_dev *dev, unsigned idx,
> +                     VhostShadowVirtqueue *svq);
> +void vhost_svq_stop(struct vhost_dev *dev, unsigned idx,
> +                    VhostShadowVirtqueue *svq);
> +
>   VhostShadowVirtqueue *vhost_svq_new(struct vhost_dev *dev, int idx);
>   
>   void vhost_svq_free(VhostShadowVirtqueue *vq);
> diff --git a/include/hw/virtio/vhost-vdpa.h b/include/hw/virtio/vhost-vdpa.h
> index 0d565bb5bd..48aae59d8e 100644
> --- a/include/hw/virtio/vhost-vdpa.h
> +++ b/include/hw/virtio/vhost-vdpa.h
> @@ -12,6 +12,8 @@
>   #ifndef HW_VIRTIO_VHOST_VDPA_H
>   #define HW_VIRTIO_VHOST_VDPA_H
>   
> +#include <gmodule.h>
> +
>   #include "qemu/queue.h"
>   #include "hw/virtio/virtio.h"
>   
> @@ -24,6 +26,8 @@ typedef struct vhost_vdpa {
>       int device_fd;
>       uint32_t msg_type;
>       MemoryListener listener;
> +    bool shadow_vqs_enabled;
> +    GPtrArray *shadow_vqs;
>       struct vhost_dev *dev;
>       QLIST_ENTRY(vhost_vdpa) entry;
>       VhostVDPAHostNotifier notifier[VIRTIO_QUEUE_MAX];
> diff --git a/hw/virtio/vhost-shadow-virtqueue.c b/hw/virtio/vhost-shadow-virtqueue.c
> index c4826a1b56..21dc99ab5d 100644
> --- a/hw/virtio/vhost-shadow-virtqueue.c
> +++ b/hw/virtio/vhost-shadow-virtqueue.c
> @@ -9,9 +9,12 @@
>   
>   #include "qemu/osdep.h"
>   #include "hw/virtio/vhost-shadow-virtqueue.h"
> +#include "hw/virtio/vhost.h"
> +
> +#include "standard-headers/linux/vhost_types.h"
>   
>   #include "qemu/error-report.h"
> -#include "qemu/event_notifier.h"
> +#include "qemu/main-loop.h"
>   
>   /* Shadow virtqueue to relay notifications */
>   typedef struct VhostShadowVirtqueue {
> @@ -19,14 +22,146 @@ typedef struct VhostShadowVirtqueue {
>       EventNotifier kick_notifier;
>       /* Shadow call notifier, sent to vhost */
>       EventNotifier call_notifier;
> +
> +    /*
> +     * Borrowed virtqueue's guest to host notifier.
> +     * To borrow it in this event notifier allows to register on the event
> +     * loop and access the associated shadow virtqueue easily. If we use the
> +     * VirtQueue, we don't have an easy way to retrieve it.
> +     *
> +     * So shadow virtqueue must not clean it, or we would lose VirtQueue one.
> +     */
> +    EventNotifier host_notifier;
> +
> +    /* Guest's call notifier, where SVQ calls guest. */
> +    EventNotifier guest_call_notifier;
To be consistent, let's simply use "guest_notifier" here.
> +
> +    /* Virtio queue shadowing */
> +    VirtQueue *vq;
>   } VhostShadowVirtqueue;
>   
> +/* Forward guest notifications */
> +static void vhost_handle_guest_kick(EventNotifier *n)
> +{
> +    VhostShadowVirtqueue *svq = container_of(n, VhostShadowVirtqueue,
> +                                             host_notifier);
> +
> +    if (unlikely(!event_notifier_test_and_clear(n))) {
> +        return;
> +    }
Is there a chance that we may stop the processing of available buffers 
during the svq enabling? There could be no kick from the guest in this case.
> +
> +    event_notifier_set(&svq->kick_notifier);
> +}
> +
> +/*
> + * Obtain the SVQ call notifier, where vhost device notifies SVQ that there
> + * exists pending used buffers.
> + *
> + * @svq Shadow Virtqueue
> + */
> +EventNotifier *vhost_svq_get_svq_call_notifier(VhostShadowVirtqueue *svq)
> +{
> +    return &svq->call_notifier;
> +}
> +
> +/*
> + * Set the call notifier for the SVQ to call the guest
> + *
> + * @svq Shadow virtqueue
> + * @call_fd call notifier
> + *
> + * Called on BQL context.
> + */
> +void vhost_svq_set_guest_call_notifier(VhostShadowVirtqueue *svq, int call_fd)
> +{
> +    event_notifier_init_fd(&svq->guest_call_notifier, call_fd);
> +}
> +
> +/*
> + * Restore the vhost guest to host notifier, i.e., disables svq effect.
> + */
> +static int vhost_svq_restore_vdev_host_notifier(struct vhost_dev *dev,
> +                                                unsigned vhost_index,
> +                                                VhostShadowVirtqueue *svq)
> +{
> +    EventNotifier *vq_host_notifier = virtio_queue_get_host_notifier(svq->vq);
> +    struct vhost_vring_file file = {
> +        .index = vhost_index,
> +        .fd = event_notifier_get_fd(vq_host_notifier),
> +    };
> +    int r;
> +
> +    /* Restore vhost kick */
> +    r = dev->vhost_ops->vhost_set_vring_kick(dev, &file);
And remap the notification area if necessary.
> +    return r ? -errno : 0;
> +}
> +
> +/*
> + * Start shadow virtqueue operation.
> + * @dev vhost device
> + * @hidx vhost virtqueue index
> + * @svq Shadow Virtqueue
> + */
> +bool vhost_svq_start(struct vhost_dev *dev, unsigned idx,
> +                     VhostShadowVirtqueue *svq)
> +{
> +    EventNotifier *vq_host_notifier = virtio_queue_get_host_notifier(svq->vq);
> +    struct vhost_vring_file file = {
> +        .index = dev->vhost_ops->vhost_get_vq_index(dev, dev->vq_index + idx),
> +        .fd = event_notifier_get_fd(&svq->kick_notifier),
> +    };
> +    int r;
> +
> +    /* Check that notifications are still going directly to vhost dev */
> +    assert(virtio_queue_is_host_notifier_enabled(svq->vq));
> +
> +    /*
> +     * event_notifier_set_handler already checks for guest's notifications if
> +     * they arrive in the switch, so there is no need to explicitely check for
> +     * them.
> +     */
If this is true, shouldn't we call vhost_set_vring_kick() before the 
event_notifier_set_handler()?
Btw, I think we should update the fd if set_vring_kick() was called 
after this function?
> +    event_notifier_init_fd(&svq->host_notifier,
> +                           event_notifier_get_fd(vq_host_notifier));
> +    event_notifier_set_handler(&svq->host_notifier, vhost_handle_guest_kick);
> +
> +    r = dev->vhost_ops->vhost_set_vring_kick(dev, &file);
And we need to stop the notification area mmap.
> +    if (unlikely(r != 0)) {
> +        error_report("Couldn't set kick fd: %s", strerror(errno));
> +        goto err_set_vring_kick;
> +    }
> +
> +    return true;
> +
> +err_set_vring_kick:
> +    event_notifier_set_handler(&svq->host_notifier, NULL);
> +
> +    return false;
> +}
> +
> +/*
> + * Stop shadow virtqueue operation.
> + * @dev vhost device
> + * @idx vhost queue index
> + * @svq Shadow Virtqueue
> + */
> +void vhost_svq_stop(struct vhost_dev *dev, unsigned idx,
> +                    VhostShadowVirtqueue *svq)
> +{
> +    int r = vhost_svq_restore_vdev_host_notifier(dev, idx, svq);
> +    if (unlikely(r < 0)) {
> +        error_report("Couldn't restore vq kick fd: %s", strerror(-r));
> +    }
> +
> +    event_notifier_set_handler(&svq->host_notifier, NULL);
> +}
> +
>   /*
>    * Creates vhost shadow virtqueue, and instruct vhost device to use the shadow
>    * methods and file descriptors.
>    */
>   VhostShadowVirtqueue *vhost_svq_new(struct vhost_dev *dev, int idx)
>   {
> +    int vq_idx = dev->vq_index + idx;
>       g_autofree VhostShadowVirtqueue *svq = g_new0(VhostShadowVirtqueue, 1);
>       int r;
>   
> @@ -44,6 +179,7 @@ VhostShadowVirtqueue *vhost_svq_new(struct vhost_dev *dev, int idx)
>           goto err_init_call_notifier;
>       }
>   
> +    svq->vq = virtio_get_queue(dev->vdev, vq_idx);
>       return g_steal_pointer(&svq);
>   
>   err_init_call_notifier:
> diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
> index e0dc7508c3..36c954a779 100644
> --- a/hw/virtio/vhost-vdpa.c
> +++ b/hw/virtio/vhost-vdpa.c
> @@ -17,6 +17,7 @@
>   #include "hw/virtio/vhost.h"
>   #include "hw/virtio/vhost-backend.h"
>   #include "hw/virtio/virtio-net.h"
> +#include "hw/virtio/vhost-shadow-virtqueue.h"
>   #include "hw/virtio/vhost-vdpa.h"
>   #include "exec/address-spaces.h"
>   #include "qemu/main-loop.h"
> @@ -272,6 +273,16 @@ static void vhost_vdpa_add_status(struct vhost_dev *dev, uint8_t status)
>       vhost_vdpa_call(dev, VHOST_VDPA_SET_STATUS, &s);
>   }
>   
> +/**
> + * Adaptor function to free shadow virtqueue through gpointer
> + *
> + * @svq   The Shadow Virtqueue
> + */
> +static void vhost_psvq_free(gpointer svq)
> +{
> +    vhost_svq_free(svq);
> +}
> +
>   static int vhost_vdpa_init(struct vhost_dev *dev, void *opaque, Error **errp)
>   {
>       struct vhost_vdpa *v;
> @@ -283,6 +294,7 @@ static int vhost_vdpa_init(struct vhost_dev *dev, void *opaque, Error **errp)
>       dev->opaque =  opaque ;
>       v->listener = vhost_vdpa_memory_listener;
>       v->msg_type = VHOST_IOTLB_MSG_V2;
> +    v->shadow_vqs = g_ptr_array_new_full(dev->nvqs, vhost_psvq_free);
>       QLIST_INSERT_HEAD(&vhost_vdpa_devices, v, entry);
>   
>       vhost_vdpa_add_status(dev, VIRTIO_CONFIG_S_ACKNOWLEDGE |
> @@ -373,6 +385,17 @@ err:
>       return;
>   }
>   
> +static void vhost_vdpa_svq_cleanup(struct vhost_dev *dev)
> +{
> +    struct vhost_vdpa *v = dev->opaque;
> +    size_t idx;
> +
> +    for (idx = 0; idx < v->shadow_vqs->len; ++idx) {
> +        vhost_svq_stop(dev, idx, g_ptr_array_index(v->shadow_vqs, idx));
> +    }
> +    g_ptr_array_free(v->shadow_vqs, true);
> +}
> +
>   static int vhost_vdpa_cleanup(struct vhost_dev *dev)
>   {
>       struct vhost_vdpa *v;
> @@ -381,6 +404,7 @@ static int vhost_vdpa_cleanup(struct vhost_dev *dev)
>       trace_vhost_vdpa_cleanup(dev, v);
>       vhost_vdpa_host_notifiers_uninit(dev, dev->nvqs);
>       memory_listener_unregister(&v->listener);
> +    vhost_vdpa_svq_cleanup(dev);
>       QLIST_REMOVE(v, entry);
>   
>       dev->opaque = NULL;
> @@ -557,7 +581,9 @@ static int vhost_vdpa_dev_start(struct vhost_dev *dev, bool started)
>       if (started) {
>           uint8_t status = 0;
>           memory_listener_register(&v->listener, &address_space_memory);
> -        vhost_vdpa_host_notifiers_init(dev);
> +        if (!v->shadow_vqs_enabled) {
> +            vhost_vdpa_host_notifiers_init(dev);
> +        }
This looks like a trick, why not check and setup shadow_vqs inside:
1) vhost_vdpa_host_notifiers_init()
and
2) vhost_vdpa_set_vring_kick()
>           vhost_vdpa_set_vring_ready(dev);
>           vhost_vdpa_add_status(dev, VIRTIO_CONFIG_S_DRIVER_OK);
>           vhost_vdpa_call(dev, VHOST_VDPA_GET_STATUS, &status);
> @@ -663,10 +689,96 @@ static bool  vhost_vdpa_force_iommu(struct vhost_dev *dev)
>       return true;
>   }
>   
> +/*
> + * Start shadow virtqueue.
> + */
> +static bool vhost_vdpa_svq_start_vq(struct vhost_dev *dev, unsigned idx)
> +{
> +    struct vhost_vdpa *v = dev->opaque;
> +    VhostShadowVirtqueue *svq = g_ptr_array_index(v->shadow_vqs, idx);
> +    return vhost_svq_start(dev, idx, svq);
> +}
> +
> +static unsigned vhost_vdpa_enable_svq(struct vhost_vdpa *v, bool enable)
> +{
> +    struct vhost_dev *hdev = v->dev;
> +    unsigned n;
> +
> +    if (enable == v->shadow_vqs_enabled) {
> +        return hdev->nvqs;
> +    }
> +
> +    if (enable) {
> +        /* Allocate resources */
> +        assert(v->shadow_vqs->len == 0);
> +        for (n = 0; n < hdev->nvqs; ++n) {
> +            VhostShadowVirtqueue *svq = vhost_svq_new(hdev, n);
> +            bool ok;
> +
> +            if (unlikely(!svq)) {
> +                g_ptr_array_set_size(v->shadow_vqs, 0);
> +                return 0;
> +            }
> +            g_ptr_array_add(v->shadow_vqs, svq);
> +
> +            ok = vhost_vdpa_svq_start_vq(hdev, n);
> +            if (unlikely(!ok)) {
> +                /* Free still not started svqs */
> +                g_ptr_array_set_size(v->shadow_vqs, n);
> +                enable = false;
> +                break;
> +            }
> +        }
Since there's almost no logic could be shared between enable and 
disable. Let's split those logic out into dedicated functions where the 
codes looks more easy to be reviewed (e.g have a better error handling etc).
> +    }
> +
> +    v->shadow_vqs_enabled = enable;
> +
> +    if (!enable) {
> +        /* Disable all queues or clean up failed start */
> +        for (n = 0; n < v->shadow_vqs->len; ++n) {
> +            unsigned vq_idx = vhost_vdpa_get_vq_index(hdev, n);
> +            VhostShadowVirtqueue *svq = g_ptr_array_index(v->shadow_vqs, n);
> +            vhost_svq_stop(hdev, n, svq);
> +            vhost_virtqueue_start(hdev, hdev->vdev, &hdev->vqs[n], vq_idx);
> +        }
> +
> +        /* Resources cleanup */
> +        g_ptr_array_set_size(v->shadow_vqs, 0);
> +    }
> +
> +    return n;
> +}
>   
>   void qmp_x_vhost_enable_shadow_vq(const char *name, bool enable, Error **errp)
>   {
> -    error_setg(errp, "Shadow virtqueue still not implemented");
> +    struct vhost_vdpa *v;
> +    const char *err_cause = NULL;
> +    bool r;
> +
> +    QLIST_FOREACH(v, &vhost_vdpa_devices, entry) {
> +        if (v->dev->vdev && 0 == strcmp(v->dev->vdev->name, name)) {
> +            break;
> +        }
> +    }
I think you can iterate the NetClientStates to ge tthe vhost-vdpa backends.
> +
> +    if (!v) {
> +        err_cause = "Device not found";
> +        goto err;
> +    } else if (v->notifier[0].addr) {
> +        err_cause = "Device has host notifiers enabled";
I don't get this.
Btw this function should be implemented in an independent patch after 
svq is fully functional.
Thanks
> +        goto err;
> +    }
> +
> +    r = vhost_vdpa_enable_svq(v, enable);
> +    if (unlikely(!r)) {
> +        err_cause = "Error enabling (see monitor)";
> +        goto err;
> +    }
> +
> +err:
> +    if (err_cause) {
> +        error_setg(errp, "Can't enable shadow vq on %s: %s", name, err_cause);
> +    }
>   }
>   
>   const VhostOps vdpa_ops = {
^ permalink raw reply	[flat|nested] 90+ messages in thread
* Re: [RFC PATCH v4 09/20] vdpa: Save call_fd in vhost-vdpa
  2021-10-01  7:05 ` [RFC PATCH v4 09/20] vdpa: Save call_fd in vhost-vdpa Eugenio Pérez
@ 2021-10-13  3:43   ` Jason Wang
  2021-10-14 12:11     ` Eugenio Perez Martin
  0 siblings, 1 reply; 90+ messages in thread
From: Jason Wang @ 2021-10-13  3:43 UTC (permalink / raw)
  To: Eugenio Pérez, qemu-devel
  Cc: Parav Pandit, Juan Quintela, Michael S. Tsirkin,
	Markus Armbruster, virtualization, Harpreet Singh Anand,
	Xiao W Wang, Stefan Hajnoczi, Eli Cohen, Eric Blake,
	Michael Lilja, Stefano Garzarella
在 2021/10/1 下午3:05, Eugenio Pérez 写道:
> We need to know it to switch to Shadow VirtQueue.
>
> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> ---
>   include/hw/virtio/vhost-vdpa.h | 2 ++
>   hw/virtio/vhost-vdpa.c         | 5 +++++
>   2 files changed, 7 insertions(+)
>
> diff --git a/include/hw/virtio/vhost-vdpa.h b/include/hw/virtio/vhost-vdpa.h
> index 48aae59d8e..fddac248b3 100644
> --- a/include/hw/virtio/vhost-vdpa.h
> +++ b/include/hw/virtio/vhost-vdpa.h
> @@ -30,6 +30,8 @@ typedef struct vhost_vdpa {
>       GPtrArray *shadow_vqs;
>       struct vhost_dev *dev;
>       QLIST_ENTRY(vhost_vdpa) entry;
> +    /* File descriptor the device uses to call VM/SVQ */
> +    int call_fd[VIRTIO_QUEUE_MAX];
Any reason we don't do this for kick_fd or why 
virtio_queue_get_guest_notifier() can't work here? Need a comment or 
commit log.
I think we need to have a consistent way to handle both kick and call fd.
Thanks
>       VhostVDPAHostNotifier notifier[VIRTIO_QUEUE_MAX];
>   } VhostVDPA;
>   
> diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
> index 36c954a779..57a857444a 100644
> --- a/hw/virtio/vhost-vdpa.c
> +++ b/hw/virtio/vhost-vdpa.c
> @@ -652,7 +652,12 @@ static int vhost_vdpa_set_vring_kick(struct vhost_dev *dev,
>   static int vhost_vdpa_set_vring_call(struct vhost_dev *dev,
>                                          struct vhost_vring_file *file)
>   {
> +    struct vhost_vdpa *v = dev->opaque;
> +    int vdpa_idx = vhost_vdpa_get_vq_index(dev, file->index);
> +
>       trace_vhost_vdpa_set_vring_call(dev, file->index, file->fd);
> +
> +    v->call_fd[vdpa_idx] = file->fd;
>       return vhost_vdpa_call(dev, VHOST_SET_VRING_CALL, file);
>   }
>   
^ permalink raw reply	[flat|nested] 90+ messages in thread
* Re: [RFC PATCH v4 10/20] vhost-vdpa: Take into account SVQ in vhost_vdpa_set_vring_call
  2021-10-01  7:05 ` [RFC PATCH v4 10/20] vhost-vdpa: Take into account SVQ in vhost_vdpa_set_vring_call Eugenio Pérez
@ 2021-10-13  3:43   ` Jason Wang
  2021-10-14 12:18     ` Eugenio Perez Martin
  0 siblings, 1 reply; 90+ messages in thread
From: Jason Wang @ 2021-10-13  3:43 UTC (permalink / raw)
  To: Eugenio Pérez, qemu-devel
  Cc: Parav Pandit, Juan Quintela, Michael S. Tsirkin,
	Markus Armbruster, virtualization, Harpreet Singh Anand,
	Xiao W Wang, Stefan Hajnoczi, Eli Cohen, Eric Blake,
	Michael Lilja, Stefano Garzarella
在 2021/10/1 下午3:05, Eugenio Pérez 写道:
> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> ---
>   hw/virtio/vhost-vdpa.c | 17 ++++++++++++++---
>   1 file changed, 14 insertions(+), 3 deletions(-)
>
> diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
> index 57a857444a..bc34de2439 100644
> --- a/hw/virtio/vhost-vdpa.c
> +++ b/hw/virtio/vhost-vdpa.c
> @@ -649,16 +649,27 @@ static int vhost_vdpa_set_vring_kick(struct vhost_dev *dev,
>       return vhost_vdpa_call(dev, VHOST_SET_VRING_KICK, file);
>   }
>   
> +static int vhost_vdpa_set_vring_dev_call(struct vhost_dev *dev,
> +                                         struct vhost_vring_file *file)
> +{
> +    trace_vhost_vdpa_set_vring_call(dev, file->index, file->fd);
> +    return vhost_vdpa_call(dev, VHOST_SET_VRING_CALL, file);
> +}
> +
>   static int vhost_vdpa_set_vring_call(struct vhost_dev *dev,
>                                          struct vhost_vring_file *file)
>   {
>       struct vhost_vdpa *v = dev->opaque;
>       int vdpa_idx = vhost_vdpa_get_vq_index(dev, file->index);
>   
> -    trace_vhost_vdpa_set_vring_call(dev, file->index, file->fd);
> -
>       v->call_fd[vdpa_idx] = file->fd;
> -    return vhost_vdpa_call(dev, VHOST_SET_VRING_CALL, file);
> +    if (v->shadow_vqs_enabled) {
> +        VhostShadowVirtqueue *svq = g_ptr_array_index(v->shadow_vqs, vdpa_idx);
> +        vhost_svq_set_guest_call_notifier(svq, file->fd);
> +        return 0;
> +    } else {
> +        return vhost_vdpa_set_vring_dev_call(dev, file);
> +    }
I feel like we should do the same for kick fd.
Thanks
>   }
>   
>   static int vhost_vdpa_get_features(struct vhost_dev *dev,
^ permalink raw reply	[flat|nested] 90+ messages in thread
* Re: [RFC PATCH v4 11/20] vhost: Route host->guest notification through shadow virtqueue
  2021-10-01  7:05 ` [RFC PATCH v4 11/20] vhost: Route host->guest notification through shadow virtqueue Eugenio Pérez
@ 2021-10-13  3:47   ` Jason Wang
  2021-10-14 16:39     ` Eugenio Perez Martin
  2021-10-13  3:49   ` Jason Wang
  1 sibling, 1 reply; 90+ messages in thread
From: Jason Wang @ 2021-10-13  3:47 UTC (permalink / raw)
  To: Eugenio Pérez, qemu-devel
  Cc: Parav Pandit, Juan Quintela, Michael S. Tsirkin,
	Markus Armbruster, virtualization, Harpreet Singh Anand,
	Xiao W Wang, Stefan Hajnoczi, Eli Cohen, Eric Blake,
	Michael Lilja, Stefano Garzarella
在 2021/10/1 下午3:05, Eugenio Pérez 写道:
> This will make qemu aware of the device used buffers, allowing it to
> write the guest memory with its contents if needed.
>
> Since the use of vhost_virtqueue_start can unmasks and discard call
> events, vhost_virtqueue_start should be modified in one of these ways:
> * Split in two: One of them uses all logic to start a queue with no
>    side effects for the guest, and another one tha actually assumes that
>    the guest has just started the device. Vdpa should use just the
>    former.
> * Actually store and check if the guest notifier is masked, and do it
>    conditionally.
> * Left as it is, and duplicate all the logic in vhost-vdpa.
>
> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> ---
>   hw/virtio/vhost-shadow-virtqueue.c | 19 +++++++++++++++
>   hw/virtio/vhost-vdpa.c             | 38 +++++++++++++++++++++++++++++-
>   2 files changed, 56 insertions(+), 1 deletion(-)
>
> diff --git a/hw/virtio/vhost-shadow-virtqueue.c b/hw/virtio/vhost-shadow-virtqueue.c
> index 21dc99ab5d..3fe129cf63 100644
> --- a/hw/virtio/vhost-shadow-virtqueue.c
> +++ b/hw/virtio/vhost-shadow-virtqueue.c
> @@ -53,6 +53,22 @@ static void vhost_handle_guest_kick(EventNotifier *n)
>       event_notifier_set(&svq->kick_notifier);
>   }
>   
> +/* Forward vhost notifications */
> +static void vhost_svq_handle_call_no_test(EventNotifier *n)
> +{
> +    VhostShadowVirtqueue *svq = container_of(n, VhostShadowVirtqueue,
> +                                             call_notifier);
> +
> +    event_notifier_set(&svq->guest_call_notifier);
> +}
> +
> +static void vhost_svq_handle_call(EventNotifier *n)
> +{
> +    if (likely(event_notifier_test_and_clear(n))) {
> +        vhost_svq_handle_call_no_test(n);
> +    }
> +}
> +
>   /*
>    * Obtain the SVQ call notifier, where vhost device notifies SVQ that there
>    * exists pending used buffers.
> @@ -180,6 +196,8 @@ VhostShadowVirtqueue *vhost_svq_new(struct vhost_dev *dev, int idx)
>       }
>   
>       svq->vq = virtio_get_queue(dev->vdev, vq_idx);
> +    event_notifier_set_handler(&svq->call_notifier,
> +                               vhost_svq_handle_call);
>       return g_steal_pointer(&svq);
>   
>   err_init_call_notifier:
> @@ -195,6 +213,7 @@ err_init_kick_notifier:
>   void vhost_svq_free(VhostShadowVirtqueue *vq)
>   {
>       event_notifier_cleanup(&vq->kick_notifier);
> +    event_notifier_set_handler(&vq->call_notifier, NULL);
>       event_notifier_cleanup(&vq->call_notifier);
>       g_free(vq);
>   }
> diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
> index bc34de2439..6c5f4c98b8 100644
> --- a/hw/virtio/vhost-vdpa.c
> +++ b/hw/virtio/vhost-vdpa.c
> @@ -712,13 +712,40 @@ static bool vhost_vdpa_svq_start_vq(struct vhost_dev *dev, unsigned idx)
>   {
>       struct vhost_vdpa *v = dev->opaque;
>       VhostShadowVirtqueue *svq = g_ptr_array_index(v->shadow_vqs, idx);
> -    return vhost_svq_start(dev, idx, svq);
> +    EventNotifier *vhost_call_notifier = vhost_svq_get_svq_call_notifier(svq);
> +    struct vhost_vring_file vhost_call_file = {
> +        .index = idx + dev->vq_index,
> +        .fd = event_notifier_get_fd(vhost_call_notifier),
> +    };
> +    int r;
> +    bool b;
> +
> +    /* Set shadow vq -> guest notifier */
> +    assert(v->call_fd[idx]);
We need aovid the asser() here. On which case we can hit this?
> +    vhost_svq_set_guest_call_notifier(svq, v->call_fd[idx]);
> +
> +    b = vhost_svq_start(dev, idx, svq);
> +    if (unlikely(!b)) {
> +        return false;
> +    }
> +
> +    /* Set device -> SVQ notifier */
> +    r = vhost_vdpa_set_vring_dev_call(dev, &vhost_call_file);
> +    if (unlikely(r)) {
> +        error_report("vhost_vdpa_set_vring_call for shadow vq failed");
> +        return false;
> +    }
Similar to kick, do we need to set_vring_call() before vhost_svq_start()?
> +
> +    /* Check for pending calls */
> +    event_notifier_set(vhost_call_notifier);
Interesting, can this result spurious interrupt?
> +    return true;
>   }
>   
>   static unsigned vhost_vdpa_enable_svq(struct vhost_vdpa *v, bool enable)
>   {
>       struct vhost_dev *hdev = v->dev;
>       unsigned n;
> +    int r;
>   
>       if (enable == v->shadow_vqs_enabled) {
>           return hdev->nvqs;
> @@ -752,9 +779,18 @@ static unsigned vhost_vdpa_enable_svq(struct vhost_vdpa *v, bool enable)
>       if (!enable) {
>           /* Disable all queues or clean up failed start */
>           for (n = 0; n < v->shadow_vqs->len; ++n) {
> +            struct vhost_vring_file file = {
> +                .index = vhost_vdpa_get_vq_index(hdev, n),
> +                .fd = v->call_fd[n],
> +            };
> +
> +            r = vhost_vdpa_set_vring_call(hdev, &file);
> +            assert(r == 0);
> +
>               unsigned vq_idx = vhost_vdpa_get_vq_index(hdev, n);
>               VhostShadowVirtqueue *svq = g_ptr_array_index(v->shadow_vqs, n);
>               vhost_svq_stop(hdev, n, svq);
> +            /* TODO: This can unmask or override call fd! */
I don't get this comment. Does this mean the current code can't work 
with mask_notifiers? If yes, this is something we need to fix.
Thanks
>               vhost_virtqueue_start(hdev, hdev->vdev, &hdev->vqs[n], vq_idx);
>           }
>   
^ permalink raw reply	[flat|nested] 90+ messages in thread
* Re: [RFC PATCH v4 11/20] vhost: Route host->guest notification through shadow virtqueue
  2021-10-01  7:05 ` [RFC PATCH v4 11/20] vhost: Route host->guest notification through shadow virtqueue Eugenio Pérez
  2021-10-13  3:47   ` Jason Wang
@ 2021-10-13  3:49   ` Jason Wang
  2021-10-14 15:58     ` Eugenio Perez Martin
  1 sibling, 1 reply; 90+ messages in thread
From: Jason Wang @ 2021-10-13  3:49 UTC (permalink / raw)
  To: Eugenio Pérez, qemu-devel
  Cc: Parav Pandit, Juan Quintela, Michael S. Tsirkin,
	Markus Armbruster, virtualization, Harpreet Singh Anand,
	Xiao W Wang, Stefan Hajnoczi, Eli Cohen, Eric Blake,
	Michael Lilja, Stefano Garzarella
在 2021/10/1 下午3:05, Eugenio Pérez 写道:
> This will make qemu aware of the device used buffers, allowing it to
> write the guest memory with its contents if needed.
>
> Since the use of vhost_virtqueue_start can unmasks and discard call
> events, vhost_virtqueue_start should be modified in one of these ways:
> * Split in two: One of them uses all logic to start a queue with no
>    side effects for the guest, and another one tha actually assumes that
>    the guest has just started the device. Vdpa should use just the
>    former.
> * Actually store and check if the guest notifier is masked, and do it
>    conditionally.
> * Left as it is, and duplicate all the logic in vhost-vdpa.
Btw, the log looks not clear. I guess this patch goes for method 3. If 
yes, we need explain it and why.
Thanks
>
> Signed-off-by: Eugenio Pérez<eperezma@redhat.com>
^ permalink raw reply	[flat|nested] 90+ messages in thread
* Re: [RFC PATCH v4 12/20] virtio: Add vhost_shadow_vq_get_vring_addr
  2021-10-01  7:05 ` [RFC PATCH v4 12/20] virtio: Add vhost_shadow_vq_get_vring_addr Eugenio Pérez
@ 2021-10-13  3:54   ` Jason Wang
  2021-10-14 14:39     ` Eugenio Perez Martin
  0 siblings, 1 reply; 90+ messages in thread
From: Jason Wang @ 2021-10-13  3:54 UTC (permalink / raw)
  To: Eugenio Pérez, qemu-devel
  Cc: Parav Pandit, Juan Quintela, Michael S. Tsirkin,
	Markus Armbruster, virtualization, Harpreet Singh Anand,
	Xiao W Wang, Stefan Hajnoczi, Eli Cohen, Eric Blake,
	Michael Lilja, Stefano Garzarella
在 2021/10/1 下午3:05, Eugenio Pérez 写道:
> It reports the shadow virtqueue address from qemu virtual address space
I think both the title and commit log needs to more tweaks. Looking at 
the codes, what id does is actually introduce vring into svq.
>
> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> ---
>   hw/virtio/vhost-shadow-virtqueue.h |  4 +++
>   hw/virtio/vhost-shadow-virtqueue.c | 50 ++++++++++++++++++++++++++++++
>   2 files changed, 54 insertions(+)
>
> diff --git a/hw/virtio/vhost-shadow-virtqueue.h b/hw/virtio/vhost-shadow-virtqueue.h
> index 237cfceb9c..2df3d117f5 100644
> --- a/hw/virtio/vhost-shadow-virtqueue.h
> +++ b/hw/virtio/vhost-shadow-virtqueue.h
> @@ -16,6 +16,10 @@ typedef struct VhostShadowVirtqueue VhostShadowVirtqueue;
>   
>   EventNotifier *vhost_svq_get_svq_call_notifier(VhostShadowVirtqueue *svq);
>   void vhost_svq_set_guest_call_notifier(VhostShadowVirtqueue *svq, int call_fd);
> +void vhost_svq_get_vring_addr(const VhostShadowVirtqueue *svq,
> +                              struct vhost_vring_addr *addr);
> +size_t vhost_svq_driver_area_size(const VhostShadowVirtqueue *svq);
> +size_t vhost_svq_device_area_size(const VhostShadowVirtqueue *svq);
>   
>   bool vhost_svq_start(struct vhost_dev *dev, unsigned idx,
>                        VhostShadowVirtqueue *svq);
> diff --git a/hw/virtio/vhost-shadow-virtqueue.c b/hw/virtio/vhost-shadow-virtqueue.c
> index 3fe129cf63..5c1899f6af 100644
> --- a/hw/virtio/vhost-shadow-virtqueue.c
> +++ b/hw/virtio/vhost-shadow-virtqueue.c
> @@ -18,6 +18,9 @@
>   
>   /* Shadow virtqueue to relay notifications */
>   typedef struct VhostShadowVirtqueue {
> +    /* Shadow vring */
> +    struct vring vring;
> +
>       /* Shadow kick notifier, sent to vhost */
>       EventNotifier kick_notifier;
>       /* Shadow call notifier, sent to vhost */
> @@ -38,6 +41,9 @@ typedef struct VhostShadowVirtqueue {
>   
>       /* Virtio queue shadowing */
>       VirtQueue *vq;
> +
> +    /* Virtio device */
> +    VirtIODevice *vdev;
>   } VhostShadowVirtqueue;
>   
>   /* Forward guest notifications */
> @@ -93,6 +99,35 @@ void vhost_svq_set_guest_call_notifier(VhostShadowVirtqueue *svq, int call_fd)
>       event_notifier_init_fd(&svq->guest_call_notifier, call_fd);
>   }
>   
> +/*
> + * Get the shadow vq vring address.
> + * @svq Shadow virtqueue
> + * @addr Destination to store address
> + */
> +void vhost_svq_get_vring_addr(const VhostShadowVirtqueue *svq,
> +                              struct vhost_vring_addr *addr)
> +{
> +    addr->desc_user_addr = (uint64_t)svq->vring.desc;
> +    addr->avail_user_addr = (uint64_t)svq->vring.avail;
> +    addr->used_user_addr = (uint64_t)svq->vring.used;
> +}
> +
> +size_t vhost_svq_driver_area_size(const VhostShadowVirtqueue *svq)
> +{
> +    uint16_t vq_idx = virtio_get_queue_index(svq->vq);
> +    size_t desc_size = virtio_queue_get_desc_size(svq->vdev, vq_idx);
> +    size_t avail_size = virtio_queue_get_avail_size(svq->vdev, vq_idx);
> +
> +    return ROUND_UP(desc_size + avail_size, qemu_real_host_page_size);
Is this round up required by the spec?
> +}
> +
> +size_t vhost_svq_device_area_size(const VhostShadowVirtqueue *svq)
> +{
> +    uint16_t vq_idx = virtio_get_queue_index(svq->vq);
> +    size_t used_size = virtio_queue_get_used_size(svq->vdev, vq_idx);
> +    return ROUND_UP(used_size, qemu_real_host_page_size);
> +}
> +
>   /*
>    * Restore the vhost guest to host notifier, i.e., disables svq effect.
>    */
> @@ -178,6 +213,10 @@ void vhost_svq_stop(struct vhost_dev *dev, unsigned idx,
>   VhostShadowVirtqueue *vhost_svq_new(struct vhost_dev *dev, int idx)
>   {
>       int vq_idx = dev->vq_index + idx;
> +    unsigned num = virtio_queue_get_num(dev->vdev, vq_idx);
> +    size_t desc_size = virtio_queue_get_desc_size(dev->vdev, vq_idx);
> +    size_t driver_size;
> +    size_t device_size;
>       g_autofree VhostShadowVirtqueue *svq = g_new0(VhostShadowVirtqueue, 1);
>       int r;
>   
> @@ -196,6 +235,15 @@ VhostShadowVirtqueue *vhost_svq_new(struct vhost_dev *dev, int idx)
>       }
>   
>       svq->vq = virtio_get_queue(dev->vdev, vq_idx);
> +    svq->vdev = dev->vdev;
> +    driver_size = vhost_svq_driver_area_size(svq);
> +    device_size = vhost_svq_device_area_size(svq);
> +    svq->vring.num = num;
> +    svq->vring.desc = qemu_memalign(qemu_real_host_page_size, driver_size);
> +    svq->vring.avail = (void *)((char *)svq->vring.desc + desc_size);
> +    memset(svq->vring.desc, 0, driver_size);
Any reason for using the contiguous area for both desc and avail?
Thanks
> +    svq->vring.used = qemu_memalign(qemu_real_host_page_size, device_size);
> +    memset(svq->vring.used, 0, device_size);
>       event_notifier_set_handler(&svq->call_notifier,
>                                  vhost_svq_handle_call);
>       return g_steal_pointer(&svq);
> @@ -215,5 +263,7 @@ void vhost_svq_free(VhostShadowVirtqueue *vq)
>       event_notifier_cleanup(&vq->kick_notifier);
>       event_notifier_set_handler(&vq->call_notifier, NULL);
>       event_notifier_cleanup(&vq->call_notifier);
> +    qemu_vfree(vq->vring.desc);
> +    qemu_vfree(vq->vring.used);
>       g_free(vq);
>   }
^ permalink raw reply	[flat|nested] 90+ messages in thread
* Re: [RFC PATCH v4 13/20] vdpa: Save host and guest features
  2021-10-01  7:05 ` [RFC PATCH v4 13/20] vdpa: Save host and guest features Eugenio Pérez
@ 2021-10-13  3:56   ` Jason Wang
  2021-10-14 15:03     ` Eugenio Perez Martin
  0 siblings, 1 reply; 90+ messages in thread
From: Jason Wang @ 2021-10-13  3:56 UTC (permalink / raw)
  To: Eugenio Pérez, qemu-devel
  Cc: Parav Pandit, Juan Quintela, Michael S. Tsirkin,
	Markus Armbruster, virtualization, Harpreet Singh Anand,
	Xiao W Wang, Stefan Hajnoczi, Eli Cohen, Eric Blake,
	Michael Lilja, Stefano Garzarella
在 2021/10/1 下午3:05, Eugenio Pérez 写道:
> Those are needed for SVQ: Host ones are needed to check if SVQ knows
> how to talk with the device and for feature negotiation, and guest ones
> to know if SVQ can talk with it.
>
> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> ---
>   include/hw/virtio/vhost-vdpa.h |  2 ++
>   hw/virtio/vhost-vdpa.c         | 31 ++++++++++++++++++++++++++++---
>   2 files changed, 30 insertions(+), 3 deletions(-)
>
> diff --git a/include/hw/virtio/vhost-vdpa.h b/include/hw/virtio/vhost-vdpa.h
> index fddac248b3..9044ae694b 100644
> --- a/include/hw/virtio/vhost-vdpa.h
> +++ b/include/hw/virtio/vhost-vdpa.h
> @@ -26,6 +26,8 @@ typedef struct vhost_vdpa {
>       int device_fd;
>       uint32_t msg_type;
>       MemoryListener listener;
> +    uint64_t host_features;
> +    uint64_t guest_features;
Any reason that we can't use the features stored in VirtioDevice?
Thanks
>       bool shadow_vqs_enabled;
>       GPtrArray *shadow_vqs;
>       struct vhost_dev *dev;
> diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
> index 6c5f4c98b8..a057e8277d 100644
> --- a/hw/virtio/vhost-vdpa.c
> +++ b/hw/virtio/vhost-vdpa.c
> @@ -439,10 +439,19 @@ static int vhost_vdpa_set_mem_table(struct vhost_dev *dev,
>       return 0;
>   }
>   
> -static int vhost_vdpa_set_features(struct vhost_dev *dev,
> -                                   uint64_t features)
> +/**
> + * Internal set_features() that follows vhost/VirtIO protocol for that
> + */
> +static int vhost_vdpa_backend_set_features(struct vhost_dev *dev,
> +                                           uint64_t features)
>   {
> +    struct vhost_vdpa *v = dev->opaque;
> +
>       int ret;
> +    if (v->host_features & BIT_ULL(VIRTIO_F_QUEUE_STATE)) {
> +        features |= BIT_ULL(VIRTIO_F_QUEUE_STATE);
> +    }
> +
>       trace_vhost_vdpa_set_features(dev, features);
>       ret = vhost_vdpa_call(dev, VHOST_SET_FEATURES, &features);
>       uint8_t status = 0;
> @@ -455,6 +464,17 @@ static int vhost_vdpa_set_features(struct vhost_dev *dev,
>       return !(status & VIRTIO_CONFIG_S_FEATURES_OK);
>   }
>   
> +/**
> + * Exposed vhost set features
> + */
> +static int vhost_vdpa_set_features(struct vhost_dev *dev,
> +                                   uint64_t features)
> +{
> +    struct vhost_vdpa *v = dev->opaque;
> +    v->guest_features = features;
> +    return vhost_vdpa_backend_set_features(dev, features);
> +}
> +
>   static int vhost_vdpa_set_backend_cap(struct vhost_dev *dev)
>   {
>       uint64_t features;
> @@ -673,12 +693,17 @@ static int vhost_vdpa_set_vring_call(struct vhost_dev *dev,
>   }
>   
>   static int vhost_vdpa_get_features(struct vhost_dev *dev,
> -                                     uint64_t *features)
> +                                   uint64_t *features)
>   {
>       int ret;
>   
>       ret = vhost_vdpa_call(dev, VHOST_GET_FEATURES, features);
>       trace_vhost_vdpa_get_features(dev, *features);
> +
> +    if (ret == 0) {
> +        struct vhost_vdpa *v = dev->opaque;
> +        v->host_features = *features;
> +    }
>       return ret;
>   }
>   
^ permalink raw reply	[flat|nested] 90+ messages in thread
* Re: [RFC PATCH v4 15/20] vhost: Shadow virtqueue buffers forwarding
  2021-10-01  7:05 ` [RFC PATCH v4 15/20] vhost: Shadow virtqueue buffers forwarding Eugenio Pérez
  2021-10-12  5:21   ` Markus Armbruster
@ 2021-10-13  4:31   ` Jason Wang
  2021-10-14 17:56     ` Eugenio Perez Martin
  1 sibling, 1 reply; 90+ messages in thread
From: Jason Wang @ 2021-10-13  4:31 UTC (permalink / raw)
  To: Eugenio Pérez, qemu-devel
  Cc: Parav Pandit, Juan Quintela, Michael S. Tsirkin,
	Markus Armbruster, virtualization, Harpreet Singh Anand,
	Xiao W Wang, Stefan Hajnoczi, Eli Cohen, Eric Blake,
	Michael Lilja, Stefano Garzarella
在 2021/10/1 下午3:05, Eugenio Pérez 写道:
> Initial version of shadow virtqueue that actually forward buffers. There
> are no iommu support at the moment, and that will be addressed in future
> patches of this series. Since all vhost-vdpa devices uses forced IOMMU,
> this means that SVQ is not usable at this point of the series on any
> device.
>
> For simplicity it only supports modern devices, that expects vring
> in little endian, with split ring and no event idx or indirect
> descriptors. Support for them will not be added in this series.
>
> It reuses the VirtQueue code for the device part. The driver part is
> based on Linux's virtio_ring driver, but with stripped functionality
> and optimizations so it's easier to review. Later commits add simpler
> ones.
>
> SVQ uses VIRTIO_CONFIG_S_DEVICE_STOPPED to pause the device and
> retrieve its status (next available idx the device was going to
> consume) race-free. It can later reset the device to replace vring
> addresses etc. When SVQ starts qemu can resume consuming the guest's
> driver ring from that state, without notice from the latter.
>
> This status bit VIRTIO_CONFIG_S_DEVICE_STOPPED is currently discussed
> in VirtIO, and is implemented in qemu VirtIO-net devices in previous
> commits.
>
> Removal of _S_DEVICE_STOPPED bit (in other words, resuming the device)
> can be done in the future if an use case arises. At this moment we can
> just rely on reseting the full device.
>
> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> ---
>   qapi/net.json                      |   2 +-
>   hw/virtio/vhost-shadow-virtqueue.c | 237 ++++++++++++++++++++++++++++-
>   hw/virtio/vhost-vdpa.c             | 109 ++++++++++++-
>   3 files changed, 337 insertions(+), 11 deletions(-)
>
> diff --git a/qapi/net.json b/qapi/net.json
> index fe546b0e7c..1f4a55f2c5 100644
> --- a/qapi/net.json
> +++ b/qapi/net.json
> @@ -86,7 +86,7 @@
>   #
>   # @name: the device name of the VirtIO device
>   #
> -# @enable: true to use the alternate shadow VQ notifications
> +# @enable: true to use the alternate shadow VQ buffers fowarding path
>   #
>   # Returns: Error if failure, or 'no error' for success.
>   #
> diff --git a/hw/virtio/vhost-shadow-virtqueue.c b/hw/virtio/vhost-shadow-virtqueue.c
> index 34e159d4fd..df7e6fa3ec 100644
> --- a/hw/virtio/vhost-shadow-virtqueue.c
> +++ b/hw/virtio/vhost-shadow-virtqueue.c
> @@ -10,6 +10,7 @@
>   #include "qemu/osdep.h"
>   #include "hw/virtio/vhost-shadow-virtqueue.h"
>   #include "hw/virtio/vhost.h"
> +#include "hw/virtio/virtio-access.h"
>   
>   #include "standard-headers/linux/vhost_types.h"
>   
> @@ -44,15 +45,135 @@ typedef struct VhostShadowVirtqueue {
>   
>       /* Virtio device */
>       VirtIODevice *vdev;
> +
> +    /* Map for returning guest's descriptors */
> +    VirtQueueElement **ring_id_maps;
> +
> +    /* Next head to expose to device */
> +    uint16_t avail_idx_shadow;
> +
> +    /* Next free descriptor */
> +    uint16_t free_head;
> +
> +    /* Last seen used idx */
> +    uint16_t shadow_used_idx;
> +
> +    /* Next head to consume from device */
> +    uint16_t used_idx;
Let's use "last_used_idx" as kernel driver did.
>   } VhostShadowVirtqueue;
>   
>   /* If the device is using some of these, SVQ cannot communicate */
>   bool vhost_svq_valid_device_features(uint64_t *dev_features)
>   {
> -    return true;
> +    uint64_t b;
> +    bool r = true;
> +
> +    for (b = VIRTIO_TRANSPORT_F_START; b <= VIRTIO_TRANSPORT_F_END; ++b) {
> +        switch (b) {
> +        case VIRTIO_F_NOTIFY_ON_EMPTY:
> +        case VIRTIO_F_ANY_LAYOUT:
> +            /* SVQ is fine with this feature */
> +            continue;
> +
> +        case VIRTIO_F_ACCESS_PLATFORM:
> +            /* SVQ needs this feature disabled. Can't continue */
So code can explain itself, need a comment to explain why.
> +            if (*dev_features & BIT_ULL(b)) {
> +                clear_bit(b, dev_features);
> +                r = false;
> +            }
> +            break;
> +
> +        case VIRTIO_F_VERSION_1:
> +            /* SVQ needs this feature, so can't continue */
A comment to explain why SVQ needs this feature.
> +            if (!(*dev_features & BIT_ULL(b))) {
> +                set_bit(b, dev_features);
> +                r = false;
> +            }
> +            continue;
> +
> +        default:
> +            /*
> +             * SVQ must disable this feature, let's hope the device is fine
> +             * without it.
> +             */
> +            if (*dev_features & BIT_ULL(b)) {
> +                clear_bit(b, dev_features);
> +            }
> +        }
> +    }
> +
> +    return r;
> +}
Let's move this to patch 14.
> +
> +static void vhost_vring_write_descs(VhostShadowVirtqueue *svq,
> +                                    const struct iovec *iovec,
> +                                    size_t num, bool more_descs, bool write)
> +{
> +    uint16_t i = svq->free_head, last = svq->free_head;
> +    unsigned n;
> +    uint16_t flags = write ? cpu_to_le16(VRING_DESC_F_WRITE) : 0;
> +    vring_desc_t *descs = svq->vring.desc;
> +
> +    if (num == 0) {
> +        return;
> +    }
> +
> +    for (n = 0; n < num; n++) {
> +        if (more_descs || (n + 1 < num)) {
> +            descs[i].flags = flags | cpu_to_le16(VRING_DESC_F_NEXT);
> +        } else {
> +            descs[i].flags = flags;
> +        }
> +        descs[i].addr = cpu_to_le64((hwaddr)iovec[n].iov_base);
> +        descs[i].len = cpu_to_le32(iovec[n].iov_len);
> +
> +        last = i;
> +        i = cpu_to_le16(descs[i].next);
> +    }
> +
> +    svq->free_head = le16_to_cpu(descs[last].next);
> +}
> +
> +static unsigned vhost_svq_add_split(VhostShadowVirtqueue *svq,
> +                                    VirtQueueElement *elem)
> +{
> +    int head;
> +    unsigned avail_idx;
> +    vring_avail_t *avail = svq->vring.avail;
> +
> +    head = svq->free_head;
> +
> +    /* We need some descriptors here */
> +    assert(elem->out_num || elem->in_num);
> +
> +    vhost_vring_write_descs(svq, elem->out_sg, elem->out_num,
> +                            elem->in_num > 0, false);
> +    vhost_vring_write_descs(svq, elem->in_sg, elem->in_num, false, true);
> +
> +    /*
> +     * Put entry in available array (but don't update avail->idx until they
> +     * do sync).
> +     */
> +    avail_idx = svq->avail_idx_shadow & (svq->vring.num - 1);
> +    avail->ring[avail_idx] = cpu_to_le16(head);
> +    svq->avail_idx_shadow++;
> +
> +    /* Update avail index after the descriptor is wrote */
> +    smp_wmb();
> +    avail->idx = cpu_to_le16(svq->avail_idx_shadow);
> +
> +    return head;
> +
>   }
>   
> -/* Forward guest notifications */
> +static void vhost_svq_add(VhostShadowVirtqueue *svq, VirtQueueElement *elem)
> +{
> +    unsigned qemu_head = vhost_svq_add_split(svq, elem);
> +
> +    svq->ring_id_maps[qemu_head] = elem;
> +}
> +
> +/* Handle guest->device notifications */
>   static void vhost_handle_guest_kick(EventNotifier *n)
>   {
>       VhostShadowVirtqueue *svq = container_of(n, VhostShadowVirtqueue,
> @@ -62,7 +183,74 @@ static void vhost_handle_guest_kick(EventNotifier *n)
>           return;
>       }
>   
> -    event_notifier_set(&svq->kick_notifier);
> +    /* Make available as many buffers as possible */
> +    do {
> +        if (virtio_queue_get_notification(svq->vq)) {
> +            /* No more notifications until process all available */
> +            virtio_queue_set_notification(svq->vq, false);
> +        }
This can be done outside the loop.
> +
> +        while (true) {
> +            VirtQueueElement *elem = virtqueue_pop(svq->vq, sizeof(*elem));
> +            if (!elem) {
> +                break;
> +            }
> +
> +            vhost_svq_add(svq, elem);
> +            event_notifier_set(&svq->kick_notifier);
> +        }
> +
> +        virtio_queue_set_notification(svq->vq, true);
I think this can be moved to the end of this function.
Btw, we probably need a quota to make sure the svq is not hogging the 
main event loop.
Similar issue could be found in both virtio-net TX (using timer or bh) 
and TAP (a quota).
> +    } while (!virtio_queue_empty(svq->vq));
> +}
> +
> +static bool vhost_svq_more_used(VhostShadowVirtqueue *svq)
> +{
> +    if (svq->used_idx != svq->shadow_used_idx) {
> +        return true;
> +    }
> +
> +    /* Get used idx must not be reordered */
> +    smp_rmb();
Interesting, we don't do this for kernel drivers. It would be helpful to 
explain it more clear by "X must be done before Y".
> +    svq->shadow_used_idx = cpu_to_le16(svq->vring.used->idx);
> +
> +    return svq->used_idx != svq->shadow_used_idx;
> +}
> +
> +static VirtQueueElement *vhost_svq_get_buf(VhostShadowVirtqueue *svq)
> +{
> +    vring_desc_t *descs = svq->vring.desc;
> +    const vring_used_t *used = svq->vring.used;
> +    vring_used_elem_t used_elem;
> +    uint16_t last_used;
> +
> +    if (!vhost_svq_more_used(svq)) {
> +        return NULL;
> +    }
> +
> +    last_used = svq->used_idx & (svq->vring.num - 1);
> +    used_elem.id = le32_to_cpu(used->ring[last_used].id);
> +    used_elem.len = le32_to_cpu(used->ring[last_used].len);
> +
> +    svq->used_idx++;
> +    if (unlikely(used_elem.id >= svq->vring.num)) {
> +        error_report("Device %s says index %u is used", svq->vdev->name,
> +                     used_elem.id);
> +        return NULL;
> +    }
> +
> +    if (unlikely(!svq->ring_id_maps[used_elem.id])) {
> +        error_report(
> +            "Device %s says index %u is used, but it was not available",
> +            svq->vdev->name, used_elem.id);
> +        return NULL;
> +    }
> +
> +    descs[used_elem.id].next = svq->free_head;
> +    svq->free_head = used_elem.id;
> +
> +    svq->ring_id_maps[used_elem.id]->len = used_elem.len;
> +    return g_steal_pointer(&svq->ring_id_maps[used_elem.id]);
>   }
>   
>   /* Forward vhost notifications */
> @@ -70,8 +258,26 @@ static void vhost_svq_handle_call_no_test(EventNotifier *n)
>   {
>       VhostShadowVirtqueue *svq = container_of(n, VhostShadowVirtqueue,
>                                                call_notifier);
> -
> -    event_notifier_set(&svq->guest_call_notifier);
> +    VirtQueue *vq = svq->vq;
> +
> +    /* Make as many buffers as possible used. */
> +    do {
> +        unsigned i = 0;
> +
> +        /* TODO: Use VRING_AVAIL_F_NO_INTERRUPT */
> +        while (true) {
> +            g_autofree VirtQueueElement *elem = vhost_svq_get_buf(svq);
> +            if (!elem) {
> +                break;
> +            }
> +
> +            assert(i < svq->vring.num);
Let's return error instead of using the assert.
> +            virtqueue_fill(vq, elem, elem->len, i++);
> +        }
> +
> +        virtqueue_flush(vq, i);
> +        event_notifier_set(&svq->guest_call_notifier);
> +    } while (vhost_svq_more_used(svq));
>   }
>   
>   static void vhost_svq_handle_call(EventNotifier *n)
> @@ -204,12 +410,25 @@ err_set_vring_kick:
>   void vhost_svq_stop(struct vhost_dev *dev, unsigned idx,
>                       VhostShadowVirtqueue *svq)
>   {
> +    int i;
>       int r = vhost_svq_restore_vdev_host_notifier(dev, idx, svq);
> +
>       if (unlikely(r < 0)) {
>           error_report("Couldn't restore vq kick fd: %s", strerror(-r));
>       }
>   
>       event_notifier_set_handler(&svq->host_notifier, NULL);
> +
> +    for (i = 0; i < svq->vring.num; ++i) {
> +        g_autofree VirtQueueElement *elem = svq->ring_id_maps[i];
> +        /*
> +         * Although the doc says we must unpop in order, it's ok to unpop
> +         * everything.
> +         */
> +        if (elem) {
> +            virtqueue_unpop(svq->vq, elem, elem->len);
> +        }
Will this result some of the "pending" buffers to be submitted multiple 
times? If yes, should we wait for all the buffers used instead of doing 
the unpop here?
> +    }
>   }
>   
>   /*
> @@ -224,7 +443,7 @@ VhostShadowVirtqueue *vhost_svq_new(struct vhost_dev *dev, int idx)
>       size_t driver_size;
>       size_t device_size;
>       g_autofree VhostShadowVirtqueue *svq = g_new0(VhostShadowVirtqueue, 1);
> -    int r;
> +    int r, i;
>   
>       r = event_notifier_init(&svq->kick_notifier, 0);
>       if (r != 0) {
> @@ -250,6 +469,11 @@ VhostShadowVirtqueue *vhost_svq_new(struct vhost_dev *dev, int idx)
>       memset(svq->vring.desc, 0, driver_size);
>       svq->vring.used = qemu_memalign(qemu_real_host_page_size, device_size);
>       memset(svq->vring.used, 0, device_size);
> +    for (i = 0; i < num - 1; i++) {
> +        svq->vring.desc[i].next = cpu_to_le16(i + 1);
> +    }
> +
> +    svq->ring_id_maps = g_new0(VirtQueueElement *, num);
>       event_notifier_set_handler(&svq->call_notifier,
>                                  vhost_svq_handle_call);
>       return g_steal_pointer(&svq);
> @@ -269,6 +493,7 @@ void vhost_svq_free(VhostShadowVirtqueue *vq)
>       event_notifier_cleanup(&vq->kick_notifier);
>       event_notifier_set_handler(&vq->call_notifier, NULL);
>       event_notifier_cleanup(&vq->call_notifier);
> +    g_free(vq->ring_id_maps);
>       qemu_vfree(vq->vring.desc);
>       qemu_vfree(vq->vring.used);
>       g_free(vq);
> diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
> index a057e8277d..bb7010ddb5 100644
> --- a/hw/virtio/vhost-vdpa.c
> +++ b/hw/virtio/vhost-vdpa.c
> @@ -19,6 +19,7 @@
>   #include "hw/virtio/virtio-net.h"
>   #include "hw/virtio/vhost-shadow-virtqueue.h"
>   #include "hw/virtio/vhost-vdpa.h"
> +#include "hw/virtio/vhost-shadow-virtqueue.h"
>   #include "exec/address-spaces.h"
>   #include "qemu/main-loop.h"
>   #include "cpu.h"
> @@ -475,6 +476,28 @@ static int vhost_vdpa_set_features(struct vhost_dev *dev,
>       return vhost_vdpa_backend_set_features(dev, features);
>   }
>   
> +/**
> + * Restore guest features to vdpa device
> + */
> +static int vhost_vdpa_set_guest_features(struct vhost_dev *dev)
> +{
> +    struct vhost_vdpa *v = dev->opaque;
> +    return vhost_vdpa_backend_set_features(dev, v->guest_features);
> +}
> +
> +/**
> + * Set shadow virtqueue supported features
> + */
> +static int vhost_vdpa_set_svq_features(struct vhost_dev *dev)
> +{
> +    struct vhost_vdpa *v = dev->opaque;
> +    uint64_t features = v->host_features;
> +    bool b = vhost_svq_valid_device_features(&features);
> +    assert(b);
> +
> +    return vhost_vdpa_backend_set_features(dev, features);
> +}
> +
>   static int vhost_vdpa_set_backend_cap(struct vhost_dev *dev)
>   {
>       uint64_t features;
> @@ -730,6 +753,19 @@ static bool  vhost_vdpa_force_iommu(struct vhost_dev *dev)
>       return true;
>   }
>   
> +static int vhost_vdpa_vring_pause(struct vhost_dev *dev)
> +{
> +    int r;
> +    uint8_t status;
> +
> +    vhost_vdpa_add_status(dev, VIRTIO_CONFIG_S_DEVICE_STOPPED);
> +    do {
> +        r = vhost_vdpa_call(dev, VHOST_VDPA_GET_STATUS, &status);
I guess we'd better add some sleep here.
> +    } while (r == 0 && !(status & VIRTIO_CONFIG_S_DEVICE_STOPPED));
> +
> +    return 0;
> +}
> +
>   /*
>    * Start shadow virtqueue.
>    */
> @@ -742,9 +778,29 @@ static bool vhost_vdpa_svq_start_vq(struct vhost_dev *dev, unsigned idx)
>           .index = idx + dev->vq_index,
>           .fd = event_notifier_get_fd(vhost_call_notifier),
>       };
> +    struct vhost_vring_addr addr = {
> +        .index = idx + dev->vq_index,
> +    };
> +    struct vhost_vring_state num = {
> +        .index = idx + dev->vq_index,
> +        .num = virtio_queue_get_num(dev->vdev, idx),
> +    };
>       int r;
>       bool b;
>   
> +    vhost_svq_get_vring_addr(svq, &addr);
> +    r = vhost_vdpa_set_vring_addr(dev, &addr);
> +    if (unlikely(r)) {
> +        error_report("vhost_set_vring_addr for shadow vq failed");
> +        return false;
> +    }
> +
> +    r = vhost_vdpa_set_vring_num(dev, &num);
> +    if (unlikely(r)) {
> +        error_report("vhost_vdpa_set_vring_num for shadow vq failed");
> +        return false;
> +    }
> +
>       /* Set shadow vq -> guest notifier */
>       assert(v->call_fd[idx]);
>       vhost_svq_set_guest_call_notifier(svq, v->call_fd[idx]);
> @@ -781,15 +837,32 @@ static unsigned vhost_vdpa_enable_svq(struct vhost_vdpa *v, bool enable)
>           assert(v->shadow_vqs->len == 0);
>           for (n = 0; n < hdev->nvqs; ++n) {
>               VhostShadowVirtqueue *svq = vhost_svq_new(hdev, n);
> -            bool ok;
> -
>               if (unlikely(!svq)) {
>                   g_ptr_array_set_size(v->shadow_vqs, 0);
>                   return 0;
>               }
>               g_ptr_array_add(v->shadow_vqs, svq);
> +        }
> +    }
>   
> -            ok = vhost_vdpa_svq_start_vq(hdev, n);
> +    r = vhost_vdpa_vring_pause(hdev);
> +    assert(r == 0);
> +
> +    if (enable) {
> +        for (n = 0; n < v->shadow_vqs->len; ++n) {
> +            /* Obtain Virtqueue state */
> +            vhost_virtqueue_stop(hdev, hdev->vdev, &hdev->vqs[n], n);
> +        }
> +    }
> +
> +    /* Reset device so it can be configured */
> +    r = vhost_vdpa_dev_start(hdev, false);
> +    assert(r == 0);
> +
> +    if (enable) {
> +        int r;
> +        for (n = 0; n < v->shadow_vqs->len; ++n) {
> +            bool ok = vhost_vdpa_svq_start_vq(hdev, n);
>               if (unlikely(!ok)) {
>                   /* Free still not started svqs */
>                   g_ptr_array_set_size(v->shadow_vqs, n);
> @@ -797,11 +870,19 @@ static unsigned vhost_vdpa_enable_svq(struct vhost_vdpa *v, bool enable)
>                   break;
>               }
>           }
> +
> +        /* Need to ack features to set state in vp_vdpa devices */
vhost_vdpa actually?
> +        r = vhost_vdpa_set_svq_features(hdev);
> +        if (unlikely(r)) {
> +            enable = false;
> +        }
>       }
>   
>       v->shadow_vqs_enabled = enable;
>   
>       if (!enable) {
> +        vhost_vdpa_set_guest_features(hdev);
> +
>           /* Disable all queues or clean up failed start */
>           for (n = 0; n < v->shadow_vqs->len; ++n) {
>               struct vhost_vring_file file = {
> @@ -818,7 +899,12 @@ static unsigned vhost_vdpa_enable_svq(struct vhost_vdpa *v, bool enable)
>               /* TODO: This can unmask or override call fd! */
>               vhost_virtqueue_start(hdev, hdev->vdev, &hdev->vqs[n], vq_idx);
>           }
> +    }
>   
> +    r = vhost_vdpa_dev_start(hdev, true);
> +    assert(r == 0);
> +
> +    if (!enable) {
>           /* Resources cleanup */
>           g_ptr_array_set_size(v->shadow_vqs, 0);
>       }
> @@ -831,6 +917,7 @@ void qmp_x_vhost_enable_shadow_vq(const char *name, bool enable, Error **errp)
>       struct vhost_vdpa *v;
>       const char *err_cause = NULL;
>       bool r;
> +    uint64_t svq_features;
>   
>       QLIST_FOREACH(v, &vhost_vdpa_devices, entry) {
>           if (v->dev->vdev && 0 == strcmp(v->dev->vdev->name, name)) {
> @@ -846,6 +933,20 @@ void qmp_x_vhost_enable_shadow_vq(const char *name, bool enable, Error **errp)
>           goto err;
>       }
>   
> +    svq_features = v->host_features;
> +    if (!vhost_svq_valid_device_features(&svq_features)) {
> +        error_setg(errp,
> +            "Can't enable shadow vq on %s: Unexpected feature flags (%lx-%lx)",
> +            name, v->host_features, svq_features);
> +        return;
> +    } else {
> +        /* TODO: Check for virtio_vdpa + IOMMU & modern device */
I guess you mean "vhost_vdpa" here. For IOMMU, I guess you mean "vIOMMU" 
actually?
Thanks
> +    }
> +
> +    if (err_cause) {
> +        goto err;
> +    }
> +
>       r = vhost_vdpa_enable_svq(v, enable);
>       if (unlikely(!r)) {
>           err_cause = "Error enabling (see monitor)";
> @@ -853,7 +954,7 @@ void qmp_x_vhost_enable_shadow_vq(const char *name, bool enable, Error **errp)
>       }
>   
>   err:
> -    if (err_cause) {
> +    if (errp == NULL && err_cause) {
>           error_setg(errp, "Can't enable shadow vq on %s: %s", name, err_cause);
>       }
>   }
^ permalink raw reply	[flat|nested] 90+ messages in thread
* Re: [RFC PATCH v4 16/20] vhost: Check for device VRING_USED_F_NO_NOTIFY at shadow virtqueue kick
  2021-10-01  7:05 ` [RFC PATCH v4 16/20] vhost: Check for device VRING_USED_F_NO_NOTIFY at shadow virtqueue kick Eugenio Pérez
@ 2021-10-13  4:35   ` Jason Wang
  2021-10-15  6:17     ` Eugenio Perez Martin
  0 siblings, 1 reply; 90+ messages in thread
From: Jason Wang @ 2021-10-13  4:35 UTC (permalink / raw)
  To: Eugenio Pérez, qemu-devel
  Cc: Parav Pandit, Juan Quintela, Michael S. Tsirkin,
	Markus Armbruster, virtualization, Harpreet Singh Anand,
	Xiao W Wang, Stefan Hajnoczi, Eli Cohen, Eric Blake,
	Michael Lilja, Stefano Garzarella
在 2021/10/1 下午3:05, Eugenio Pérez 写道:
> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> ---
>   hw/virtio/vhost-shadow-virtqueue.c | 11 ++++++++++-
>   1 file changed, 10 insertions(+), 1 deletion(-)
>
> diff --git a/hw/virtio/vhost-shadow-virtqueue.c b/hw/virtio/vhost-shadow-virtqueue.c
> index df7e6fa3ec..775f8d36a0 100644
> --- a/hw/virtio/vhost-shadow-virtqueue.c
> +++ b/hw/virtio/vhost-shadow-virtqueue.c
> @@ -173,6 +173,15 @@ static void vhost_svq_add(VhostShadowVirtqueue *svq, VirtQueueElement *elem)
>       svq->ring_id_maps[qemu_head] = elem;
>   }
>   
> +static void vhost_svq_kick(VhostShadowVirtqueue *svq)
> +{
> +    /* Make sure we are reading updated device flag */
I guess this would be better:
         /* We need to expose available array entries before checking used
          * flags. */
(Borrowed from kernel codes).
Thanks
> +    smp_mb();
> +    if (!(svq->vring.used->flags & VRING_USED_F_NO_NOTIFY)) {
> +        event_notifier_set(&svq->kick_notifier);
> +    }
> +}
> +
>   /* Handle guest->device notifications */
>   static void vhost_handle_guest_kick(EventNotifier *n)
>   {
> @@ -197,7 +206,7 @@ static void vhost_handle_guest_kick(EventNotifier *n)
>               }
>   
>               vhost_svq_add(svq, elem);
> -            event_notifier_set(&svq->kick_notifier);
> +            vhost_svq_kick(svq);
>           }
>   
>           virtio_queue_set_notification(svq->vq, true);
^ permalink raw reply	[flat|nested] 90+ messages in thread
* Re: [RFC PATCH v4 17/20] vhost: Use VRING_AVAIL_F_NO_INTERRUPT at device call on shadow virtqueue
  2021-10-01  7:06 ` [RFC PATCH v4 17/20] vhost: Use VRING_AVAIL_F_NO_INTERRUPT at device call on shadow virtqueue Eugenio Pérez
@ 2021-10-13  4:36   ` Jason Wang
  2021-10-15  6:22     ` Eugenio Perez Martin
  0 siblings, 1 reply; 90+ messages in thread
From: Jason Wang @ 2021-10-13  4:36 UTC (permalink / raw)
  To: Eugenio Pérez, qemu-devel
  Cc: Parav Pandit, Juan Quintela, Michael S. Tsirkin,
	Markus Armbruster, virtualization, Harpreet Singh Anand,
	Xiao W Wang, Stefan Hajnoczi, Eli Cohen, Eric Blake,
	Michael Lilja, Stefano Garzarella
在 2021/10/1 下午3:06, Eugenio Pérez 写道:
> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
Commit log please.
Thanks
> ---
>   hw/virtio/vhost-shadow-virtqueue.c | 24 +++++++++++++++++++++++-
>   1 file changed, 23 insertions(+), 1 deletion(-)
>
> diff --git a/hw/virtio/vhost-shadow-virtqueue.c b/hw/virtio/vhost-shadow-virtqueue.c
> index 775f8d36a0..2fd0bab75d 100644
> --- a/hw/virtio/vhost-shadow-virtqueue.c
> +++ b/hw/virtio/vhost-shadow-virtqueue.c
> @@ -60,6 +60,9 @@ typedef struct VhostShadowVirtqueue {
>   
>       /* Next head to consume from device */
>       uint16_t used_idx;
> +
> +    /* Cache for the exposed notification flag */
> +    bool notification;
>   } VhostShadowVirtqueue;
>   
>   /* If the device is using some of these, SVQ cannot communicate */
> @@ -105,6 +108,24 @@ bool vhost_svq_valid_device_features(uint64_t *dev_features)
>       return r;
>   }
>   
> +static void vhost_svq_set_notification(VhostShadowVirtqueue *svq, bool enable)
> +{
> +    uint16_t notification_flag;
> +
> +    if (svq->notification == enable) {
> +        return;
> +    }
> +
> +    notification_flag = cpu_to_le16(VRING_AVAIL_F_NO_INTERRUPT);
> +
> +    svq->notification = enable;
> +    if (enable) {
> +        svq->vring.avail->flags &= ~notification_flag;
> +    } else {
> +        svq->vring.avail->flags |= notification_flag;
> +    }
> +}
> +
>   static void vhost_vring_write_descs(VhostShadowVirtqueue *svq,
>                                       const struct iovec *iovec,
>                                       size_t num, bool more_descs, bool write)
> @@ -273,7 +294,7 @@ static void vhost_svq_handle_call_no_test(EventNotifier *n)
>       do {
>           unsigned i = 0;
>   
> -        /* TODO: Use VRING_AVAIL_F_NO_INTERRUPT */
> +        vhost_svq_set_notification(svq, false);
>           while (true) {
>               g_autofree VirtQueueElement *elem = vhost_svq_get_buf(svq);
>               if (!elem) {
> @@ -286,6 +307,7 @@ static void vhost_svq_handle_call_no_test(EventNotifier *n)
>   
>           virtqueue_flush(vq, i);
>           event_notifier_set(&svq->guest_call_notifier);
> +        vhost_svq_set_notification(svq, true);
>       } while (vhost_svq_more_used(svq));
>   }
>   
^ permalink raw reply	[flat|nested] 90+ messages in thread
* Re: [RFC PATCH v4 20/20] vdpa: Add custom IOTLB translations to SVQ
  2021-10-01  7:06 ` [RFC PATCH v4 20/20] vdpa: Add custom IOTLB translations to SVQ Eugenio Pérez
@ 2021-10-13  5:34   ` Jason Wang
  2021-10-15  7:27     ` Eugenio Perez Martin
  2021-10-19  9:24   ` Jason Wang
  1 sibling, 1 reply; 90+ messages in thread
From: Jason Wang @ 2021-10-13  5:34 UTC (permalink / raw)
  To: Eugenio Pérez, qemu-devel
  Cc: Parav Pandit, Juan Quintela, Michael S. Tsirkin,
	Markus Armbruster, virtualization, Harpreet Singh Anand,
	Xiao W Wang, Stefan Hajnoczi, Eli Cohen, Eric Blake,
	Michael Lilja, Stefano Garzarella
在 2021/10/1 下午3:06, Eugenio Pérez 写道:
> Use translations added in VhostIOVATree in SVQ.
>
> Now every element needs to store the previous address also, so VirtQueue
> can consume the elements properly. This adds a little overhead per VQ
> element, having to allocate more memory to stash them. As a possible
> optimization, this allocation could be avoided if the descriptor is not
> a chain but a single one, but this is left undone.
>
> TODO: iova range should be queried before, and add logic to fail when
> GPA is outside of its range and memory listener or svq add it.
>
> Signed-off-by: Eugenio Pérez<eperezma@redhat.com>
> ---
>   hw/virtio/vhost-shadow-virtqueue.h |   4 +-
>   hw/virtio/vhost-shadow-virtqueue.c | 130 ++++++++++++++++++++++++-----
>   hw/virtio/vhost-vdpa.c             |  40 ++++++++-
>   hw/virtio/trace-events             |   1 +
>   4 files changed, 152 insertions(+), 23 deletions(-)
Think hard about the whole logic. This is safe since qemu memory map 
will fail if guest submits a invalidate IOVA.
Then I wonder if we do something much more simpler:
1) Using qemu VA as IOVA but only maps the VA that belongs to guest
2) Then we don't need any IOVA tree here, what we need is to just map 
vring and use qemu VA without any translation
Thanks
^ permalink raw reply	[flat|nested] 90+ messages in thread
* Re: [RFC PATCH v4 08/20] vhost: Route guest->host notification through shadow virtqueue
  2021-10-13  3:27   ` Jason Wang
@ 2021-10-14 12:00     ` Eugenio Perez Martin
  2021-10-15  3:45       ` Jason Wang
  2021-10-15 18:21       ` Eugenio Perez Martin
  0 siblings, 2 replies; 90+ messages in thread
From: Eugenio Perez Martin @ 2021-10-14 12:00 UTC (permalink / raw)
  To: Jason Wang
  Cc: Parav Pandit, Juan Quintela, Markus Armbruster,
	Michael S. Tsirkin, qemu-level, virtualization,
	Harpreet Singh Anand, Xiao W Wang, Stefan Hajnoczi, Eli Cohen,
	Eric Blake, Michael Lilja, Stefano Garzarella
On Wed, Oct 13, 2021 at 5:27 AM Jason Wang <jasowang@redhat.com> wrote:
>
>
> 在 2021/10/1 下午3:05, Eugenio Pérez 写道:
> > Shadow virtqueue notifications forwarding is disabled when vhost_dev
> > stops, so code flow follows usual cleanup.
> >
> > Also, host notifiers must be disabled at SVQ start,
>
>
> Any reason for this?
>
It will be addressed in a later series, sorry.
>
> > and they will not
> > start if SVQ has been enabled when device is stopped. This is trivial
> > to address, but it is left out for simplicity at this moment.
>
>
> It looks to me this patch also contains the following logics
>
> 1) codes to enable svq
>
> 2) codes to let svq to be enabled from QMP.
>
> I think they need to be split out,
I agree that we can split this more, with the code that belongs to SVQ
and the code that belongs to vhost-vdpa. it will be addressed in
future series.
> we may endup with the following
> series of patches
>
With "series of patches" do you mean to send every step in a separated
series? There are odds of having the need of modifying code already
sent & merged with later ones. If you confirm to me that it is fine, I
can do it that way for sure.
> 1) svq skeleton with enable/disable
> 2) route host notifier to svq
> 3) route guest notifier to svq
> 4) codes to enable svq
> 5) enable svq via QMP
>
I'm totally fine with that, but there is code that is never called if
the qmp command is not added. The compiler complains about static
functions that are not called, making impossible things like bisecting
through these commits, unless I use attribute((unused)) or similar. Or
have I missed something?
We could do that way with the code that belongs to SVQ though, since
all of it is declared in headers. But to delay the "enable svq via
qmp" to the last one makes debugging harder, as we cannot just enable
notifications forwarding with no buffers forwarding.
If I introduce a change in the notifications code, I can simply go to
these commits and enable SVQ for notifications. This way I can have an
idea of what part is failing. A similar logic can be applied to other
devices than vp_vdpa. We would lose it if we
>
> >
> > Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> > ---
> >   qapi/net.json                      |   2 +-
> >   hw/virtio/vhost-shadow-virtqueue.h |   8 ++
> >   include/hw/virtio/vhost-vdpa.h     |   4 +
> >   hw/virtio/vhost-shadow-virtqueue.c | 138 ++++++++++++++++++++++++++++-
> >   hw/virtio/vhost-vdpa.c             | 116 +++++++++++++++++++++++-
> >   5 files changed, 264 insertions(+), 4 deletions(-)
> >
> > diff --git a/qapi/net.json b/qapi/net.json
> > index a2c30fd455..fe546b0e7c 100644
> > --- a/qapi/net.json
> > +++ b/qapi/net.json
> > @@ -88,7 +88,7 @@
> >   #
> >   # @enable: true to use the alternate shadow VQ notifications
> >   #
> > -# Returns: Always error, since SVQ is not implemented at the moment.
> > +# Returns: Error if failure, or 'no error' for success.
> >   #
> >   # Since: 6.2
> >   #
> > diff --git a/hw/virtio/vhost-shadow-virtqueue.h b/hw/virtio/vhost-shadow-virtqueue.h
> > index 27ac6388fa..237cfceb9c 100644
> > --- a/hw/virtio/vhost-shadow-virtqueue.h
> > +++ b/hw/virtio/vhost-shadow-virtqueue.h
> > @@ -14,6 +14,14 @@
> >
> >   typedef struct VhostShadowVirtqueue VhostShadowVirtqueue;
> >
> > +EventNotifier *vhost_svq_get_svq_call_notifier(VhostShadowVirtqueue *svq);
>
>
> Let's move this function to another patch since it's unrelated to the
> guest->host routing.
>
Right, I missed it while squashing commits and at later reviews.
>
> > +void vhost_svq_set_guest_call_notifier(VhostShadowVirtqueue *svq, int call_fd);
> > +
> > +bool vhost_svq_start(struct vhost_dev *dev, unsigned idx,
> > +                     VhostShadowVirtqueue *svq);
> > +void vhost_svq_stop(struct vhost_dev *dev, unsigned idx,
> > +                    VhostShadowVirtqueue *svq);
> > +
> >   VhostShadowVirtqueue *vhost_svq_new(struct vhost_dev *dev, int idx);
> >
> >   void vhost_svq_free(VhostShadowVirtqueue *vq);
> > diff --git a/include/hw/virtio/vhost-vdpa.h b/include/hw/virtio/vhost-vdpa.h
> > index 0d565bb5bd..48aae59d8e 100644
> > --- a/include/hw/virtio/vhost-vdpa.h
> > +++ b/include/hw/virtio/vhost-vdpa.h
> > @@ -12,6 +12,8 @@
> >   #ifndef HW_VIRTIO_VHOST_VDPA_H
> >   #define HW_VIRTIO_VHOST_VDPA_H
> >
> > +#include <gmodule.h>
> > +
> >   #include "qemu/queue.h"
> >   #include "hw/virtio/virtio.h"
> >
> > @@ -24,6 +26,8 @@ typedef struct vhost_vdpa {
> >       int device_fd;
> >       uint32_t msg_type;
> >       MemoryListener listener;
> > +    bool shadow_vqs_enabled;
> > +    GPtrArray *shadow_vqs;
> >       struct vhost_dev *dev;
> >       QLIST_ENTRY(vhost_vdpa) entry;
> >       VhostVDPAHostNotifier notifier[VIRTIO_QUEUE_MAX];
> > diff --git a/hw/virtio/vhost-shadow-virtqueue.c b/hw/virtio/vhost-shadow-virtqueue.c
> > index c4826a1b56..21dc99ab5d 100644
> > --- a/hw/virtio/vhost-shadow-virtqueue.c
> > +++ b/hw/virtio/vhost-shadow-virtqueue.c
> > @@ -9,9 +9,12 @@
> >
> >   #include "qemu/osdep.h"
> >   #include "hw/virtio/vhost-shadow-virtqueue.h"
> > +#include "hw/virtio/vhost.h"
> > +
> > +#include "standard-headers/linux/vhost_types.h"
> >
> >   #include "qemu/error-report.h"
> > -#include "qemu/event_notifier.h"
> > +#include "qemu/main-loop.h"
> >
> >   /* Shadow virtqueue to relay notifications */
> >   typedef struct VhostShadowVirtqueue {
> > @@ -19,14 +22,146 @@ typedef struct VhostShadowVirtqueue {
> >       EventNotifier kick_notifier;
> >       /* Shadow call notifier, sent to vhost */
> >       EventNotifier call_notifier;
> > +
> > +    /*
> > +     * Borrowed virtqueue's guest to host notifier.
> > +     * To borrow it in this event notifier allows to register on the event
> > +     * loop and access the associated shadow virtqueue easily. If we use the
> > +     * VirtQueue, we don't have an easy way to retrieve it.
> > +     *
> > +     * So shadow virtqueue must not clean it, or we would lose VirtQueue one.
> > +     */
> > +    EventNotifier host_notifier;
> > +
> > +    /* Guest's call notifier, where SVQ calls guest. */
> > +    EventNotifier guest_call_notifier;
>
>
> To be consistent, let's simply use "guest_notifier" here.
>
It could be confused when the series adds a guest -> qemu kick
notifier then. Actually, I think it would be better to rename
host_notifier to something like host_svq_notifier. Or host_call and
guest_call, since "notifier" is already in the type, making the name
to be a little bit "Hungarian notation".
>
> > +
> > +    /* Virtio queue shadowing */
> > +    VirtQueue *vq;
> >   } VhostShadowVirtqueue;
> >
> > +/* Forward guest notifications */
> > +static void vhost_handle_guest_kick(EventNotifier *n)
> > +{
> > +    VhostShadowVirtqueue *svq = container_of(n, VhostShadowVirtqueue,
> > +                                             host_notifier);
> > +
> > +    if (unlikely(!event_notifier_test_and_clear(n))) {
> > +        return;
> > +    }
>
>
> Is there a chance that we may stop the processing of available buffers
> during the svq enabling? There could be no kick from the guest in this case.
>
Actually, yes, I think you are right. The guest kick eventfd could
have been consumed by vhost but there may be still pending buffers.
I think it would be better to check for available buffers first, then
clear the notifier unconditionally, and then re-check and process them
if any [1].
However, this problem arises later in the series: At this moment the
device is not reset and guest's host notifier is not replaced, so
either vhost/device receives the kick, or SVQ does and forwards it.
Does it make sense to you?
>
> > +
> > +    event_notifier_set(&svq->kick_notifier);
> > +}
> > +
> > +/*
> > + * Obtain the SVQ call notifier, where vhost device notifies SVQ that there
> > + * exists pending used buffers.
> > + *
> > + * @svq Shadow Virtqueue
> > + */
> > +EventNotifier *vhost_svq_get_svq_call_notifier(VhostShadowVirtqueue *svq)
> > +{
> > +    return &svq->call_notifier;
> > +}
> > +
> > +/*
> > + * Set the call notifier for the SVQ to call the guest
> > + *
> > + * @svq Shadow virtqueue
> > + * @call_fd call notifier
> > + *
> > + * Called on BQL context.
> > + */
> > +void vhost_svq_set_guest_call_notifier(VhostShadowVirtqueue *svq, int call_fd)
> > +{
> > +    event_notifier_init_fd(&svq->guest_call_notifier, call_fd);
> > +}
> > +
> > +/*
> > + * Restore the vhost guest to host notifier, i.e., disables svq effect.
> > + */
> > +static int vhost_svq_restore_vdev_host_notifier(struct vhost_dev *dev,
> > +                                                unsigned vhost_index,
> > +                                                VhostShadowVirtqueue *svq)
> > +{
> > +    EventNotifier *vq_host_notifier = virtio_queue_get_host_notifier(svq->vq);
> > +    struct vhost_vring_file file = {
> > +        .index = vhost_index,
> > +        .fd = event_notifier_get_fd(vq_host_notifier),
> > +    };
> > +    int r;
> > +
> > +    /* Restore vhost kick */
> > +    r = dev->vhost_ops->vhost_set_vring_kick(dev, &file);
>
>
> And remap the notification area if necessary.
Totally right, that step is missed in this series.
However, remapping guest host notifier memory region has no advantages
over using ioeventfd to perform guest -> SVQ notifications, doesn't
it? By both methods flow needs to go through the hypervisor kernel.
>
>
> > +    return r ? -errno : 0;
> > +}
> > +
> > +/*
> > + * Start shadow virtqueue operation.
> > + * @dev vhost device
> > + * @hidx vhost virtqueue index
> > + * @svq Shadow Virtqueue
> > + */
> > +bool vhost_svq_start(struct vhost_dev *dev, unsigned idx,
> > +                     VhostShadowVirtqueue *svq)
> > +{
> > +    EventNotifier *vq_host_notifier = virtio_queue_get_host_notifier(svq->vq);
> > +    struct vhost_vring_file file = {
> > +        .index = dev->vhost_ops->vhost_get_vq_index(dev, dev->vq_index + idx),
> > +        .fd = event_notifier_get_fd(&svq->kick_notifier),
> > +    };
> > +    int r;
> > +
> > +    /* Check that notifications are still going directly to vhost dev */
> > +    assert(virtio_queue_is_host_notifier_enabled(svq->vq));
> > +
> > +    /*
> > +     * event_notifier_set_handler already checks for guest's notifications if
> > +     * they arrive in the switch, so there is no need to explicitely check for
> > +     * them.
> > +     */
>
>
> If this is true, shouldn't we call vhost_set_vring_kick() before the
> event_notifier_set_handler()?
>
Not at this point of the series, but it could be another solution when
we need to reset the device and we are unsure if all buffers have been
read. But I think I prefer the solution exposed in [1] and to
explicitly call vhost_handle_guest_kick here. Do you think
differently?
> Btw, I think we should update the fd if set_vring_kick() was called
> after this function?
>
Kind of. This is currently bad in the code, but...
Backend callbacks vhost_ops->vhost_set_vring_kick and
vhost_ops->vhost_set_vring_addr are only called at
vhost_virtqueue_start. And they are always called with known data
already stored in VirtQueue.
To avoid storing more state in vhost_vdpa, I think that we should
avoid duplicating them, but ignore new kick_fd or address in SVQ mode,
and retrieve them again at the moment the device is (re)started in SVQ
mode. Qemu already avoids things like allowing the guest to set
addresses at random time, using the VirtIOPCIProxy to store them.
I also see how duplicating that status could protect vdpa SVQ code
against future changes to vhost code, but that would make this series
bigger and more complex with no need at this moment in my opinion.
Do you agree?
>
> > +    event_notifier_init_fd(&svq->host_notifier,
> > +                           event_notifier_get_fd(vq_host_notifier));
> > +    event_notifier_set_handler(&svq->host_notifier, vhost_handle_guest_kick);
> > +
> > +    r = dev->vhost_ops->vhost_set_vring_kick(dev, &file);
>
>
> And we need to stop the notification area mmap.
>
Right.
>
> > +    if (unlikely(r != 0)) {
> > +        error_report("Couldn't set kick fd: %s", strerror(errno));
> > +        goto err_set_vring_kick;
> > +    }
> > +
> > +    return true;
> > +
> > +err_set_vring_kick:
> > +    event_notifier_set_handler(&svq->host_notifier, NULL);
> > +
> > +    return false;
> > +}
> > +
> > +/*
> > + * Stop shadow virtqueue operation.
> > + * @dev vhost device
> > + * @idx vhost queue index
> > + * @svq Shadow Virtqueue
> > + */
> > +void vhost_svq_stop(struct vhost_dev *dev, unsigned idx,
> > +                    VhostShadowVirtqueue *svq)
> > +{
> > +    int r = vhost_svq_restore_vdev_host_notifier(dev, idx, svq);
> > +    if (unlikely(r < 0)) {
> > +        error_report("Couldn't restore vq kick fd: %s", strerror(-r));
> > +    }
> > +
> > +    event_notifier_set_handler(&svq->host_notifier, NULL);
> > +}
> > +
> >   /*
> >    * Creates vhost shadow virtqueue, and instruct vhost device to use the shadow
> >    * methods and file descriptors.
> >    */
> >   VhostShadowVirtqueue *vhost_svq_new(struct vhost_dev *dev, int idx)
> >   {
> > +    int vq_idx = dev->vq_index + idx;
> >       g_autofree VhostShadowVirtqueue *svq = g_new0(VhostShadowVirtqueue, 1);
> >       int r;
> >
> > @@ -44,6 +179,7 @@ VhostShadowVirtqueue *vhost_svq_new(struct vhost_dev *dev, int idx)
> >           goto err_init_call_notifier;
> >       }
> >
> > +    svq->vq = virtio_get_queue(dev->vdev, vq_idx);
> >       return g_steal_pointer(&svq);
> >
> >   err_init_call_notifier:
> > diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
> > index e0dc7508c3..36c954a779 100644
> > --- a/hw/virtio/vhost-vdpa.c
> > +++ b/hw/virtio/vhost-vdpa.c
> > @@ -17,6 +17,7 @@
> >   #include "hw/virtio/vhost.h"
> >   #include "hw/virtio/vhost-backend.h"
> >   #include "hw/virtio/virtio-net.h"
> > +#include "hw/virtio/vhost-shadow-virtqueue.h"
> >   #include "hw/virtio/vhost-vdpa.h"
> >   #include "exec/address-spaces.h"
> >   #include "qemu/main-loop.h"
> > @@ -272,6 +273,16 @@ static void vhost_vdpa_add_status(struct vhost_dev *dev, uint8_t status)
> >       vhost_vdpa_call(dev, VHOST_VDPA_SET_STATUS, &s);
> >   }
> >
> > +/**
> > + * Adaptor function to free shadow virtqueue through gpointer
> > + *
> > + * @svq   The Shadow Virtqueue
> > + */
> > +static void vhost_psvq_free(gpointer svq)
> > +{
> > +    vhost_svq_free(svq);
> > +}
> > +
> >   static int vhost_vdpa_init(struct vhost_dev *dev, void *opaque, Error **errp)
> >   {
> >       struct vhost_vdpa *v;
> > @@ -283,6 +294,7 @@ static int vhost_vdpa_init(struct vhost_dev *dev, void *opaque, Error **errp)
> >       dev->opaque =  opaque ;
> >       v->listener = vhost_vdpa_memory_listener;
> >       v->msg_type = VHOST_IOTLB_MSG_V2;
> > +    v->shadow_vqs = g_ptr_array_new_full(dev->nvqs, vhost_psvq_free);
> >       QLIST_INSERT_HEAD(&vhost_vdpa_devices, v, entry);
> >
> >       vhost_vdpa_add_status(dev, VIRTIO_CONFIG_S_ACKNOWLEDGE |
> > @@ -373,6 +385,17 @@ err:
> >       return;
> >   }
> >
> > +static void vhost_vdpa_svq_cleanup(struct vhost_dev *dev)
> > +{
> > +    struct vhost_vdpa *v = dev->opaque;
> > +    size_t idx;
> > +
> > +    for (idx = 0; idx < v->shadow_vqs->len; ++idx) {
> > +        vhost_svq_stop(dev, idx, g_ptr_array_index(v->shadow_vqs, idx));
> > +    }
> > +    g_ptr_array_free(v->shadow_vqs, true);
> > +}
> > +
> >   static int vhost_vdpa_cleanup(struct vhost_dev *dev)
> >   {
> >       struct vhost_vdpa *v;
> > @@ -381,6 +404,7 @@ static int vhost_vdpa_cleanup(struct vhost_dev *dev)
> >       trace_vhost_vdpa_cleanup(dev, v);
> >       vhost_vdpa_host_notifiers_uninit(dev, dev->nvqs);
> >       memory_listener_unregister(&v->listener);
> > +    vhost_vdpa_svq_cleanup(dev);
> >       QLIST_REMOVE(v, entry);
> >
> >       dev->opaque = NULL;
> > @@ -557,7 +581,9 @@ static int vhost_vdpa_dev_start(struct vhost_dev *dev, bool started)
> >       if (started) {
> >           uint8_t status = 0;
> >           memory_listener_register(&v->listener, &address_space_memory);
> > -        vhost_vdpa_host_notifiers_init(dev);
> > +        if (!v->shadow_vqs_enabled) {
> > +            vhost_vdpa_host_notifiers_init(dev);
> > +        }
>
>
> This looks like a trick, why not check and setup shadow_vqs inside:
>
> 1) vhost_vdpa_host_notifiers_init()
>
> and
>
> 2) vhost_vdpa_set_vring_kick()
>
Ok I will move the checks there.
>
> >           vhost_vdpa_set_vring_ready(dev);
> >           vhost_vdpa_add_status(dev, VIRTIO_CONFIG_S_DRIVER_OK);
> >           vhost_vdpa_call(dev, VHOST_VDPA_GET_STATUS, &status);
> > @@ -663,10 +689,96 @@ static bool  vhost_vdpa_force_iommu(struct vhost_dev *dev)
> >       return true;
> >   }
> >
> > +/*
> > + * Start shadow virtqueue.
> > + */
> > +static bool vhost_vdpa_svq_start_vq(struct vhost_dev *dev, unsigned idx)
> > +{
> > +    struct vhost_vdpa *v = dev->opaque;
> > +    VhostShadowVirtqueue *svq = g_ptr_array_index(v->shadow_vqs, idx);
> > +    return vhost_svq_start(dev, idx, svq);
> > +}
> > +
> > +static unsigned vhost_vdpa_enable_svq(struct vhost_vdpa *v, bool enable)
> > +{
> > +    struct vhost_dev *hdev = v->dev;
> > +    unsigned n;
> > +
> > +    if (enable == v->shadow_vqs_enabled) {
> > +        return hdev->nvqs;
> > +    }
> > +
> > +    if (enable) {
> > +        /* Allocate resources */
> > +        assert(v->shadow_vqs->len == 0);
> > +        for (n = 0; n < hdev->nvqs; ++n) {
> > +            VhostShadowVirtqueue *svq = vhost_svq_new(hdev, n);
> > +            bool ok;
> > +
> > +            if (unlikely(!svq)) {
> > +                g_ptr_array_set_size(v->shadow_vqs, 0);
> > +                return 0;
> > +            }
> > +            g_ptr_array_add(v->shadow_vqs, svq);
> > +
> > +            ok = vhost_vdpa_svq_start_vq(hdev, n);
> > +            if (unlikely(!ok)) {
> > +                /* Free still not started svqs */
> > +                g_ptr_array_set_size(v->shadow_vqs, n);
> > +                enable = false;
[2]
> > +                break;
> > +            }
> > +        }
>
>
> Since there's almost no logic could be shared between enable and
> disable. Let's split those logic out into dedicated functions where the
> codes looks more easy to be reviewed (e.g have a better error handling etc).
>
Maybe it could be more clear in the code, but the reused logic is the
disabling of SVQ and the fallback in case it cannot be enabled with
[2]. But I'm not against splitting in two different functions if it
makes review easier.
>
> > +    }
> > +
> > +    v->shadow_vqs_enabled = enable;
> > +
> > +    if (!enable) {
> > +        /* Disable all queues or clean up failed start */
> > +        for (n = 0; n < v->shadow_vqs->len; ++n) {
> > +            unsigned vq_idx = vhost_vdpa_get_vq_index(hdev, n);
> > +            VhostShadowVirtqueue *svq = g_ptr_array_index(v->shadow_vqs, n);
> > +            vhost_svq_stop(hdev, n, svq);
> > +            vhost_virtqueue_start(hdev, hdev->vdev, &hdev->vqs[n], vq_idx);
> > +        }
> > +
> > +        /* Resources cleanup */
> > +        g_ptr_array_set_size(v->shadow_vqs, 0);
> > +    }
> > +
> > +    return n;
> > +}
> >
> >   void qmp_x_vhost_enable_shadow_vq(const char *name, bool enable, Error **errp)
> >   {
> > -    error_setg(errp, "Shadow virtqueue still not implemented");
> > +    struct vhost_vdpa *v;
> > +    const char *err_cause = NULL;
> > +    bool r;
> > +
> > +    QLIST_FOREACH(v, &vhost_vdpa_devices, entry) {
> > +        if (v->dev->vdev && 0 == strcmp(v->dev->vdev->name, name)) {
> > +            break;
> > +        }
> > +    }
>
>
> I think you can iterate the NetClientStates to ge tthe vhost-vdpa backends.
>
Right, I missed it.
>
> > +
> > +    if (!v) {
> > +        err_cause = "Device not found";
> > +        goto err;
> > +    } else if (v->notifier[0].addr) {
> > +        err_cause = "Device has host notifiers enabled";
>
>
> I don't get this.
>
At this moment of the series you can enable guest -> SVQ -> 'vdpa
device' if the device is not using the host notifiers memory region.
The right solution is to disable it for the guest, and to handle it in
SVQ. Otherwise, guest kick will bypass SVQ and
It can be done in the same patch, or at least to disable (as unmap)
them at this moment and handle them in a posterior patch. but for
prototyping the solution I just ignored it in this series. It will be
handled some way or another in the next one. I prefer the last one, to
handle in a different patch, but let me know if you think it is better
otherwise.
> Btw this function should be implemented in an independent patch after
> svq is fully functional.
>
(Reasons for that are already commented at the top of this mail :) ).
Thanks!
> Thanks
>
>
> > +        goto err;
> > +    }
> > +
> > +    r = vhost_vdpa_enable_svq(v, enable);
> > +    if (unlikely(!r)) {
> > +        err_cause = "Error enabling (see monitor)";
> > +        goto err;
> > +    }
> > +
> > +err:
> > +    if (err_cause) {
> > +        error_setg(errp, "Can't enable shadow vq on %s: %s", name, err_cause);
> > +    }
> >   }
> >
> >   const VhostOps vdpa_ops = {
>
^ permalink raw reply	[flat|nested] 90+ messages in thread
* Re: [RFC PATCH v4 05/20] vhost: Add x-vhost-enable-shadow-vq qmp
  2021-10-12 13:45       ` Markus Armbruster
@ 2021-10-14 12:01         ` Eugenio Perez Martin
  0 siblings, 0 replies; 90+ messages in thread
From: Eugenio Perez Martin @ 2021-10-14 12:01 UTC (permalink / raw)
  To: Markus Armbruster
  Cc: Parav Pandit, Juan Quintela, Jason Wang, Michael S. Tsirkin,
	qemu-level, virtualization, Harpreet Singh Anand, Xiao W Wang,
	Stefan Hajnoczi, Eli Cohen, Eric Blake, Stefano Garzarella
On Tue, Oct 12, 2021 at 3:46 PM Markus Armbruster <armbru@redhat.com> wrote:
>
> Eugenio Perez Martin <eperezma@redhat.com> writes:
>
> > On Tue, Oct 12, 2021 at 7:18 AM Markus Armbruster <armbru@redhat.com> wrote:
> >>
> >> Eugenio Pérez <eperezma@redhat.com> writes:
> >>
> >> > Command to enable shadow virtqueue.
> >> >
> >> > Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> >> > ---
> >> >  qapi/net.json          | 23 +++++++++++++++++++++++
> >> >  hw/virtio/vhost-vdpa.c |  8 ++++++++
> >> >  2 files changed, 31 insertions(+)
> >> >
> >> > diff --git a/qapi/net.json b/qapi/net.json
> >> > index 7fab2e7cd8..a2c30fd455 100644
> >> > --- a/qapi/net.json
> >> > +++ b/qapi/net.json
> >> > @@ -79,6 +79,29 @@
> >> >  { 'command': 'netdev_del', 'data': {'id': 'str'},
> >> >    'allow-preconfig': true }
> >> >
> >> > +##
> >> > +# @x-vhost-enable-shadow-vq:
> >> > +#
> >> > +# Use vhost shadow virtqueue.
> >> > +#
> >> > +# @name: the device name of the VirtIO device
> >>
> >> Is this a qdev ID?  A network client name?
> >
> > At this moment is the virtio device name, the one specified at the
> > call of "virtio_init". But this should change, maybe the qdev id or
> > something that can be provided by the command line fits better here.
>
> To refer to a device backend, we commonly use a backend-specific ID.
> For network backends, that's NetClientState member name.
>
Ok so I will use the NetClientState member name, it fits way better
here than the virtio device name.
Thanks!
> To refer to a device frontend, we commonly use a qdev ID or a QOM path.
>
> [...]
>
^ permalink raw reply	[flat|nested] 90+ messages in thread
* Re: [RFC PATCH v4 09/20] vdpa: Save call_fd in vhost-vdpa
  2021-10-13  3:43   ` Jason Wang
@ 2021-10-14 12:11     ` Eugenio Perez Martin
  0 siblings, 0 replies; 90+ messages in thread
From: Eugenio Perez Martin @ 2021-10-14 12:11 UTC (permalink / raw)
  To: Jason Wang
  Cc: Parav Pandit, Juan Quintela, Markus Armbruster,
	Michael S. Tsirkin, qemu-level, virtualization,
	Harpreet Singh Anand, Xiao W Wang, Stefan Hajnoczi, Eli Cohen,
	Eric Blake, Stefano Garzarella
On Wed, Oct 13, 2021 at 5:43 AM Jason Wang <jasowang@redhat.com> wrote:
>
>
> 在 2021/10/1 下午3:05, Eugenio Pérez 写道:
> > We need to know it to switch to Shadow VirtQueue.
> >
> > Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> > ---
> >   include/hw/virtio/vhost-vdpa.h | 2 ++
> >   hw/virtio/vhost-vdpa.c         | 5 +++++
> >   2 files changed, 7 insertions(+)
> >
> > diff --git a/include/hw/virtio/vhost-vdpa.h b/include/hw/virtio/vhost-vdpa.h
> > index 48aae59d8e..fddac248b3 100644
> > --- a/include/hw/virtio/vhost-vdpa.h
> > +++ b/include/hw/virtio/vhost-vdpa.h
> > @@ -30,6 +30,8 @@ typedef struct vhost_vdpa {
> >       GPtrArray *shadow_vqs;
> >       struct vhost_dev *dev;
> >       QLIST_ENTRY(vhost_vdpa) entry;
> > +    /* File descriptor the device uses to call VM/SVQ */
> > +    int call_fd[VIRTIO_QUEUE_MAX];
>
>
> Any reason we don't do this for kick_fd or why
> virtio_queue_get_guest_notifier() can't work here? Need a comment or
> commit log.
>
> I think we need to have a consistent way to handle both kick and call fd.
>
> Thanks
>
The reasons for it have been given in answers to patch 08/20, since
both have converged to it somehow. Please let me know if you think
otherwise and this needs to be continued here.
Thanks!
>
> >       VhostVDPAHostNotifier notifier[VIRTIO_QUEUE_MAX];
> >   } VhostVDPA;
> >
> > diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
> > index 36c954a779..57a857444a 100644
> > --- a/hw/virtio/vhost-vdpa.c
> > +++ b/hw/virtio/vhost-vdpa.c
> > @@ -652,7 +652,12 @@ static int vhost_vdpa_set_vring_kick(struct vhost_dev *dev,
> >   static int vhost_vdpa_set_vring_call(struct vhost_dev *dev,
> >                                          struct vhost_vring_file *file)
> >   {
> > +    struct vhost_vdpa *v = dev->opaque;
> > +    int vdpa_idx = vhost_vdpa_get_vq_index(dev, file->index);
> > +
> >       trace_vhost_vdpa_set_vring_call(dev, file->index, file->fd);
> > +
> > +    v->call_fd[vdpa_idx] = file->fd;
> >       return vhost_vdpa_call(dev, VHOST_SET_VRING_CALL, file);
> >   }
> >
>
^ permalink raw reply	[flat|nested] 90+ messages in thread
* Re: [RFC PATCH v4 10/20] vhost-vdpa: Take into account SVQ in vhost_vdpa_set_vring_call
  2021-10-13  3:43   ` Jason Wang
@ 2021-10-14 12:18     ` Eugenio Perez Martin
  0 siblings, 0 replies; 90+ messages in thread
From: Eugenio Perez Martin @ 2021-10-14 12:18 UTC (permalink / raw)
  To: Jason Wang
  Cc: Parav Pandit, Juan Quintela, Markus Armbruster,
	Michael S. Tsirkin, qemu-level, virtualization,
	Harpreet Singh Anand, Xiao W Wang, Stefan Hajnoczi, Eli Cohen,
	Eric Blake, Michael Lilja, Stefano Garzarella
On Wed, Oct 13, 2021 at 5:43 AM Jason Wang <jasowang@redhat.com> wrote:
>
>
> 在 2021/10/1 下午3:05, Eugenio Pérez 写道:
> > Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> > ---
> >   hw/virtio/vhost-vdpa.c | 17 ++++++++++++++---
> >   1 file changed, 14 insertions(+), 3 deletions(-)
> >
> > diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
> > index 57a857444a..bc34de2439 100644
> > --- a/hw/virtio/vhost-vdpa.c
> > +++ b/hw/virtio/vhost-vdpa.c
> > @@ -649,16 +649,27 @@ static int vhost_vdpa_set_vring_kick(struct vhost_dev *dev,
> >       return vhost_vdpa_call(dev, VHOST_SET_VRING_KICK, file);
> >   }
> >
> > +static int vhost_vdpa_set_vring_dev_call(struct vhost_dev *dev,
> > +                                         struct vhost_vring_file *file)
> > +{
> > +    trace_vhost_vdpa_set_vring_call(dev, file->index, file->fd);
> > +    return vhost_vdpa_call(dev, VHOST_SET_VRING_CALL, file);
> > +}
> > +
> >   static int vhost_vdpa_set_vring_call(struct vhost_dev *dev,
> >                                          struct vhost_vring_file *file)
> >   {
> >       struct vhost_vdpa *v = dev->opaque;
> >       int vdpa_idx = vhost_vdpa_get_vq_index(dev, file->index);
> >
> > -    trace_vhost_vdpa_set_vring_call(dev, file->index, file->fd);
> > -
> >       v->call_fd[vdpa_idx] = file->fd;
> > -    return vhost_vdpa_call(dev, VHOST_SET_VRING_CALL, file);
> > +    if (v->shadow_vqs_enabled) {
> > +        VhostShadowVirtqueue *svq = g_ptr_array_index(v->shadow_vqs, vdpa_idx);
> > +        vhost_svq_set_guest_call_notifier(svq, file->fd);
> > +        return 0;
> > +    } else {
> > +        return vhost_vdpa_set_vring_dev_call(dev, file);
> > +    }
>
>
> I feel like we should do the same for kick fd.
>
> Thanks
>
I think this also has been answered on 08/20, but feel free to tell me
otherwise if I missed something.
Thanks!
>
> >   }
> >
> >   static int vhost_vdpa_get_features(struct vhost_dev *dev,
>
^ permalink raw reply	[flat|nested] 90+ messages in thread
* Re: [RFC PATCH v4 12/20] virtio: Add vhost_shadow_vq_get_vring_addr
  2021-10-13  3:54   ` Jason Wang
@ 2021-10-14 14:39     ` Eugenio Perez Martin
  0 siblings, 0 replies; 90+ messages in thread
From: Eugenio Perez Martin @ 2021-10-14 14:39 UTC (permalink / raw)
  To: Jason Wang
  Cc: Parav Pandit, Juan Quintela, Markus Armbruster,
	Michael S. Tsirkin, qemu-level, virtualization,
	Harpreet Singh Anand, Xiao W Wang, Stefan Hajnoczi, Eli Cohen,
	Eric Blake, Michael Lilja, Stefano Garzarella
On Wed, Oct 13, 2021 at 5:54 AM Jason Wang <jasowang@redhat.com> wrote:
>
>
> 在 2021/10/1 下午3:05, Eugenio Pérez 写道:
> > It reports the shadow virtqueue address from qemu virtual address space
>
>
> I think both the title and commit log needs to more tweaks. Looking at
> the codes, what id does is actually introduce vring into svq.
>
Right, this commit evolved a little bit providing more functionality
and it is not reflected in the commit message. I will expand it.
>
> >
> > Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> > ---
> >   hw/virtio/vhost-shadow-virtqueue.h |  4 +++
> >   hw/virtio/vhost-shadow-virtqueue.c | 50 ++++++++++++++++++++++++++++++
> >   2 files changed, 54 insertions(+)
> >
> > diff --git a/hw/virtio/vhost-shadow-virtqueue.h b/hw/virtio/vhost-shadow-virtqueue.h
> > index 237cfceb9c..2df3d117f5 100644
> > --- a/hw/virtio/vhost-shadow-virtqueue.h
> > +++ b/hw/virtio/vhost-shadow-virtqueue.h
> > @@ -16,6 +16,10 @@ typedef struct VhostShadowVirtqueue VhostShadowVirtqueue;
> >
> >   EventNotifier *vhost_svq_get_svq_call_notifier(VhostShadowVirtqueue *svq);
> >   void vhost_svq_set_guest_call_notifier(VhostShadowVirtqueue *svq, int call_fd);
> > +void vhost_svq_get_vring_addr(const VhostShadowVirtqueue *svq,
> > +                              struct vhost_vring_addr *addr);
> > +size_t vhost_svq_driver_area_size(const VhostShadowVirtqueue *svq);
> > +size_t vhost_svq_device_area_size(const VhostShadowVirtqueue *svq);
> >
> >   bool vhost_svq_start(struct vhost_dev *dev, unsigned idx,
> >                        VhostShadowVirtqueue *svq);
> > diff --git a/hw/virtio/vhost-shadow-virtqueue.c b/hw/virtio/vhost-shadow-virtqueue.c
> > index 3fe129cf63..5c1899f6af 100644
> > --- a/hw/virtio/vhost-shadow-virtqueue.c
> > +++ b/hw/virtio/vhost-shadow-virtqueue.c
> > @@ -18,6 +18,9 @@
> >
> >   /* Shadow virtqueue to relay notifications */
> >   typedef struct VhostShadowVirtqueue {
> > +    /* Shadow vring */
> > +    struct vring vring;
> > +
> >       /* Shadow kick notifier, sent to vhost */
> >       EventNotifier kick_notifier;
> >       /* Shadow call notifier, sent to vhost */
> > @@ -38,6 +41,9 @@ typedef struct VhostShadowVirtqueue {
> >
> >       /* Virtio queue shadowing */
> >       VirtQueue *vq;
> > +
> > +    /* Virtio device */
> > +    VirtIODevice *vdev;
> >   } VhostShadowVirtqueue;
> >
> >   /* Forward guest notifications */
> > @@ -93,6 +99,35 @@ void vhost_svq_set_guest_call_notifier(VhostShadowVirtqueue *svq, int call_fd)
> >       event_notifier_init_fd(&svq->guest_call_notifier, call_fd);
> >   }
> >
> > +/*
> > + * Get the shadow vq vring address.
> > + * @svq Shadow virtqueue
> > + * @addr Destination to store address
> > + */
> > +void vhost_svq_get_vring_addr(const VhostShadowVirtqueue *svq,
> > +                              struct vhost_vring_addr *addr)
> > +{
> > +    addr->desc_user_addr = (uint64_t)svq->vring.desc;
> > +    addr->avail_user_addr = (uint64_t)svq->vring.avail;
> > +    addr->used_user_addr = (uint64_t)svq->vring.used;
> > +}
> > +
> > +size_t vhost_svq_driver_area_size(const VhostShadowVirtqueue *svq)
> > +{
> > +    uint16_t vq_idx = virtio_get_queue_index(svq->vq);
> > +    size_t desc_size = virtio_queue_get_desc_size(svq->vdev, vq_idx);
> > +    size_t avail_size = virtio_queue_get_avail_size(svq->vdev, vq_idx);
> > +
> > +    return ROUND_UP(desc_size + avail_size, qemu_real_host_page_size);
>
>
> Is this round up required by the spec?
>
No, I was trying to avoid that more qemu data get exposed to the
device because of mapping at page granularity, in case data gets
allocated after some region. I will expand with a comment, but if
there are other ways to achieve or it is not needed please let me
know!
>
> > +}
> > +
> > +size_t vhost_svq_device_area_size(const VhostShadowVirtqueue *svq)
> > +{
> > +    uint16_t vq_idx = virtio_get_queue_index(svq->vq);
> > +    size_t used_size = virtio_queue_get_used_size(svq->vdev, vq_idx);
> > +    return ROUND_UP(used_size, qemu_real_host_page_size);
> > +}
> > +
> >   /*
> >    * Restore the vhost guest to host notifier, i.e., disables svq effect.
> >    */
> > @@ -178,6 +213,10 @@ void vhost_svq_stop(struct vhost_dev *dev, unsigned idx,
> >   VhostShadowVirtqueue *vhost_svq_new(struct vhost_dev *dev, int idx)
> >   {
> >       int vq_idx = dev->vq_index + idx;
> > +    unsigned num = virtio_queue_get_num(dev->vdev, vq_idx);
> > +    size_t desc_size = virtio_queue_get_desc_size(dev->vdev, vq_idx);
> > +    size_t driver_size;
> > +    size_t device_size;
> >       g_autofree VhostShadowVirtqueue *svq = g_new0(VhostShadowVirtqueue, 1);
> >       int r;
> >
> > @@ -196,6 +235,15 @@ VhostShadowVirtqueue *vhost_svq_new(struct vhost_dev *dev, int idx)
> >       }
> >
> >       svq->vq = virtio_get_queue(dev->vdev, vq_idx);
> > +    svq->vdev = dev->vdev;
> > +    driver_size = vhost_svq_driver_area_size(svq);
> > +    device_size = vhost_svq_device_area_size(svq);
> > +    svq->vring.num = num;
> > +    svq->vring.desc = qemu_memalign(qemu_real_host_page_size, driver_size);
> > +    svq->vring.avail = (void *)((char *)svq->vring.desc + desc_size);
> > +    memset(svq->vring.desc, 0, driver_size);
>
>
> Any reason for using the contiguous area for both desc and avail?
>
No special reason, it can be splitted but if we maintain the
page-width padding it could save memory, iotlb entries, etc. Not like
it's going to be a big difference but still.
Thanks!
> Thanks
>
>
> > +    svq->vring.used = qemu_memalign(qemu_real_host_page_size, device_size);
> > +    memset(svq->vring.used, 0, device_size);
> >       event_notifier_set_handler(&svq->call_notifier,
> >                                  vhost_svq_handle_call);
> >       return g_steal_pointer(&svq);
> > @@ -215,5 +263,7 @@ void vhost_svq_free(VhostShadowVirtqueue *vq)
> >       event_notifier_cleanup(&vq->kick_notifier);
> >       event_notifier_set_handler(&vq->call_notifier, NULL);
> >       event_notifier_cleanup(&vq->call_notifier);
> > +    qemu_vfree(vq->vring.desc);
> > +    qemu_vfree(vq->vring.used);
> >       g_free(vq);
> >   }
>
^ permalink raw reply	[flat|nested] 90+ messages in thread
* Re: [RFC PATCH v4 13/20] vdpa: Save host and guest features
  2021-10-13  3:56   ` Jason Wang
@ 2021-10-14 15:03     ` Eugenio Perez Martin
  0 siblings, 0 replies; 90+ messages in thread
From: Eugenio Perez Martin @ 2021-10-14 15:03 UTC (permalink / raw)
  To: Jason Wang
  Cc: Parav Pandit, Juan Quintela, Markus Armbruster,
	Michael S. Tsirkin, qemu-level, virtualization,
	Harpreet Singh Anand, Xiao W Wang, Stefan Hajnoczi, Eli Cohen,
	Eric Blake, Stefano Garzarella
On Wed, Oct 13, 2021 at 5:57 AM Jason Wang <jasowang@redhat.com> wrote:
>
>
> 在 2021/10/1 下午3:05, Eugenio Pérez 写道:
> > Those are needed for SVQ: Host ones are needed to check if SVQ knows
> > how to talk with the device and for feature negotiation, and guest ones
> > to know if SVQ can talk with it.
> >
> > Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> > ---
> >   include/hw/virtio/vhost-vdpa.h |  2 ++
> >   hw/virtio/vhost-vdpa.c         | 31 ++++++++++++++++++++++++++++---
> >   2 files changed, 30 insertions(+), 3 deletions(-)
> >
> > diff --git a/include/hw/virtio/vhost-vdpa.h b/include/hw/virtio/vhost-vdpa.h
> > index fddac248b3..9044ae694b 100644
> > --- a/include/hw/virtio/vhost-vdpa.h
> > +++ b/include/hw/virtio/vhost-vdpa.h
> > @@ -26,6 +26,8 @@ typedef struct vhost_vdpa {
> >       int device_fd;
> >       uint32_t msg_type;
> >       MemoryListener listener;
> > +    uint64_t host_features;
> > +    uint64_t guest_features;
>
>
> Any reason that we can't use the features stored in VirtioDevice?
>
> Thanks
>
It was easier to handle the non standard _F_STOP feature flag in
vhost-vdpa but I think we can use VirtIODevice flags for the next
series.
Thanks!
>
> >       bool shadow_vqs_enabled;
> >       GPtrArray *shadow_vqs;
> >       struct vhost_dev *dev;
> > diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
> > index 6c5f4c98b8..a057e8277d 100644
> > --- a/hw/virtio/vhost-vdpa.c
> > +++ b/hw/virtio/vhost-vdpa.c
> > @@ -439,10 +439,19 @@ static int vhost_vdpa_set_mem_table(struct vhost_dev *dev,
> >       return 0;
> >   }
> >
> > -static int vhost_vdpa_set_features(struct vhost_dev *dev,
> > -                                   uint64_t features)
> > +/**
> > + * Internal set_features() that follows vhost/VirtIO protocol for that
> > + */
> > +static int vhost_vdpa_backend_set_features(struct vhost_dev *dev,
> > +                                           uint64_t features)
> >   {
> > +    struct vhost_vdpa *v = dev->opaque;
> > +
> >       int ret;
> > +    if (v->host_features & BIT_ULL(VIRTIO_F_QUEUE_STATE)) {
> > +        features |= BIT_ULL(VIRTIO_F_QUEUE_STATE);
> > +    }
> > +
> >       trace_vhost_vdpa_set_features(dev, features);
> >       ret = vhost_vdpa_call(dev, VHOST_SET_FEATURES, &features);
> >       uint8_t status = 0;
> > @@ -455,6 +464,17 @@ static int vhost_vdpa_set_features(struct vhost_dev *dev,
> >       return !(status & VIRTIO_CONFIG_S_FEATURES_OK);
> >   }
> >
> > +/**
> > + * Exposed vhost set features
> > + */
> > +static int vhost_vdpa_set_features(struct vhost_dev *dev,
> > +                                   uint64_t features)
> > +{
> > +    struct vhost_vdpa *v = dev->opaque;
> > +    v->guest_features = features;
> > +    return vhost_vdpa_backend_set_features(dev, features);
> > +}
> > +
> >   static int vhost_vdpa_set_backend_cap(struct vhost_dev *dev)
> >   {
> >       uint64_t features;
> > @@ -673,12 +693,17 @@ static int vhost_vdpa_set_vring_call(struct vhost_dev *dev,
> >   }
> >
> >   static int vhost_vdpa_get_features(struct vhost_dev *dev,
> > -                                     uint64_t *features)
> > +                                   uint64_t *features)
> >   {
> >       int ret;
> >
> >       ret = vhost_vdpa_call(dev, VHOST_GET_FEATURES, features);
> >       trace_vhost_vdpa_get_features(dev, *features);
> > +
> > +    if (ret == 0) {
> > +        struct vhost_vdpa *v = dev->opaque;
> > +        v->host_features = *features;
> > +    }
> >       return ret;
> >   }
> >
>
^ permalink raw reply	[flat|nested] 90+ messages in thread
* Re: [RFC PATCH v4 15/20] vhost: Shadow virtqueue buffers forwarding
  2021-10-12 13:48       ` Markus Armbruster
@ 2021-10-14 15:04         ` Eugenio Perez Martin
  0 siblings, 0 replies; 90+ messages in thread
From: Eugenio Perez Martin @ 2021-10-14 15:04 UTC (permalink / raw)
  To: Markus Armbruster
  Cc: Parav Pandit, Juan Quintela, Jason Wang, Michael S. Tsirkin,
	qemu-level, virtualization, Harpreet Singh Anand, Xiao W Wang,
	Stefan Hajnoczi, Eli Cohen, Eric Blake, Michael Lilja,
	Stefano Garzarella
On Tue, Oct 12, 2021 at 3:48 PM Markus Armbruster <armbru@redhat.com> wrote:
>
> Eugenio Perez Martin <eperezma@redhat.com> writes:
>
> > On Tue, Oct 12, 2021 at 7:21 AM Markus Armbruster <armbru@redhat.com> wrote:
> >>
> >> Eugenio Pérez <eperezma@redhat.com> writes:
> >>
> >> > Initial version of shadow virtqueue that actually forward buffers. There
> >> > are no iommu support at the moment, and that will be addressed in future
> >> > patches of this series. Since all vhost-vdpa devices uses forced IOMMU,
> >> > this means that SVQ is not usable at this point of the series on any
> >> > device.
> >> >
> >> > For simplicity it only supports modern devices, that expects vring
> >> > in little endian, with split ring and no event idx or indirect
> >> > descriptors. Support for them will not be added in this series.
> >> >
> >> > It reuses the VirtQueue code for the device part. The driver part is
> >> > based on Linux's virtio_ring driver, but with stripped functionality
> >> > and optimizations so it's easier to review. Later commits add simpler
> >> > ones.
> >> >
> >> > SVQ uses VIRTIO_CONFIG_S_DEVICE_STOPPED to pause the device and
> >> > retrieve its status (next available idx the device was going to
> >> > consume) race-free. It can later reset the device to replace vring
> >> > addresses etc. When SVQ starts qemu can resume consuming the guest's
> >> > driver ring from that state, without notice from the latter.
> >> >
> >> > This status bit VIRTIO_CONFIG_S_DEVICE_STOPPED is currently discussed
> >> > in VirtIO, and is implemented in qemu VirtIO-net devices in previous
> >> > commits.
> >> >
> >> > Removal of _S_DEVICE_STOPPED bit (in other words, resuming the device)
> >> > can be done in the future if an use case arises. At this moment we can
> >> > just rely on reseting the full device.
> >> >
> >> > Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> >> > ---
> >> >  qapi/net.json                      |   2 +-
> >> >  hw/virtio/vhost-shadow-virtqueue.c | 237 ++++++++++++++++++++++++++++-
> >> >  hw/virtio/vhost-vdpa.c             | 109 ++++++++++++-
> >> >  3 files changed, 337 insertions(+), 11 deletions(-)
> >> >
> >> > diff --git a/qapi/net.json b/qapi/net.json
> >> > index fe546b0e7c..1f4a55f2c5 100644
> >> > --- a/qapi/net.json
> >> > +++ b/qapi/net.json
> >> > @@ -86,7 +86,7 @@
> >> >  #
> >> >  # @name: the device name of the VirtIO device
> >> >  #
> >> > -# @enable: true to use the alternate shadow VQ notifications
> >> > +# @enable: true to use the alternate shadow VQ buffers fowarding path
> >>
> >> Uh, why does the flag change meaning half-way through this series?
> >>
> >
> > Before this patch, the SVQ mode just makes an extra hop for
> > notifications. Guest ones are now received by qemu via ioeventfd, and
> > qemu forwards them to the device using a different eventfd. The
> > reverse is also true: the device ones will be received by qemu by
> > device call fd, and then qemu will forward them to the guest using a
> > different irqfd.
> >
> > This intermediate step is not very useful by itself, but helps for
> > checking that that part of the communication works fine, with no need
> > for shadow virtqueue to understand vring format. Doing that way also
> > produces smaller patches.
> >
> > So it makes sense to me to tell what QMP command does exactly at every
> > point of the series. However I can directly document it as "use the
> > alternate shadow VQ buffers forwarding path" from the beginning.
> >
> > Does this make sense, or will it be better to write the final
> > intention of the command?
> >
> > Thanks!
>
> Working your explanation into commit messages and possibly comments
> should do.
>
Got it, I will include them in both. Thanks!
^ permalink raw reply	[flat|nested] 90+ messages in thread
* Re: [RFC PATCH v4 11/20] vhost: Route host->guest notification through shadow virtqueue
  2021-10-13  3:49   ` Jason Wang
@ 2021-10-14 15:58     ` Eugenio Perez Martin
  2021-10-15  4:24       ` Jason Wang
  0 siblings, 1 reply; 90+ messages in thread
From: Eugenio Perez Martin @ 2021-10-14 15:58 UTC (permalink / raw)
  To: Jason Wang
  Cc: Parav Pandit, Juan Quintela, Markus Armbruster,
	Michael S. Tsirkin, qemu-level, virtualization,
	Harpreet Singh Anand, Xiao W Wang, Stefan Hajnoczi, Eli Cohen,
	Eric Blake, Michael Lilja, Stefano Garzarella
On Wed, Oct 13, 2021 at 5:49 AM Jason Wang <jasowang@redhat.com> wrote:
>
>
> 在 2021/10/1 下午3:05, Eugenio Pérez 写道:
> > This will make qemu aware of the device used buffers, allowing it to
> > write the guest memory with its contents if needed.
> >
> > Since the use of vhost_virtqueue_start can unmasks and discard call
> > events, vhost_virtqueue_start should be modified in one of these ways:
> > * Split in two: One of them uses all logic to start a queue with no
> >    side effects for the guest, and another one tha actually assumes that
> >    the guest has just started the device. Vdpa should use just the
> >    former.
> > * Actually store and check if the guest notifier is masked, and do it
> >    conditionally.
> > * Left as it is, and duplicate all the logic in vhost-vdpa.
>
>
> Btw, the log looks not clear. I guess this patch goes for method 3. If
> yes, we need explain it and why.
>
> Thanks
>
Sorry about being unclear. This commit log (and code) just exposes the
problem and the solutions I came up with but does nothing to solve it.
I'm actually going for method 3 for the next series but I'm open to
doing it differently.
>
> >
> > Signed-off-by: Eugenio Pérez<eperezma@redhat.com>
>
^ permalink raw reply	[flat|nested] 90+ messages in thread
* Re: [RFC PATCH v4 11/20] vhost: Route host->guest notification through shadow virtqueue
  2021-10-13  3:47   ` Jason Wang
@ 2021-10-14 16:39     ` Eugenio Perez Martin
  2021-10-15  4:42       ` Jason Wang
  0 siblings, 1 reply; 90+ messages in thread
From: Eugenio Perez Martin @ 2021-10-14 16:39 UTC (permalink / raw)
  To: Jason Wang
  Cc: Parav Pandit, Juan Quintela, Markus Armbruster,
	Michael S. Tsirkin, qemu-level, virtualization,
	Harpreet Singh Anand, Xiao W Wang, Stefan Hajnoczi, Eli Cohen,
	Eric Blake, Michael Lilja, Stefano Garzarella
On Wed, Oct 13, 2021 at 5:47 AM Jason Wang <jasowang@redhat.com> wrote:
>
>
> 在 2021/10/1 下午3:05, Eugenio Pérez 写道:
> > This will make qemu aware of the device used buffers, allowing it to
> > write the guest memory with its contents if needed.
> >
> > Since the use of vhost_virtqueue_start can unmasks and discard call
> > events, vhost_virtqueue_start should be modified in one of these ways:
> > * Split in two: One of them uses all logic to start a queue with no
> >    side effects for the guest, and another one tha actually assumes that
> >    the guest has just started the device. Vdpa should use just the
> >    former.
> > * Actually store and check if the guest notifier is masked, and do it
> >    conditionally.
> > * Left as it is, and duplicate all the logic in vhost-vdpa.
> >
> > Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> > ---
> >   hw/virtio/vhost-shadow-virtqueue.c | 19 +++++++++++++++
> >   hw/virtio/vhost-vdpa.c             | 38 +++++++++++++++++++++++++++++-
> >   2 files changed, 56 insertions(+), 1 deletion(-)
> >
> > diff --git a/hw/virtio/vhost-shadow-virtqueue.c b/hw/virtio/vhost-shadow-virtqueue.c
> > index 21dc99ab5d..3fe129cf63 100644
> > --- a/hw/virtio/vhost-shadow-virtqueue.c
> > +++ b/hw/virtio/vhost-shadow-virtqueue.c
> > @@ -53,6 +53,22 @@ static void vhost_handle_guest_kick(EventNotifier *n)
> >       event_notifier_set(&svq->kick_notifier);
> >   }
> >
> > +/* Forward vhost notifications */
> > +static void vhost_svq_handle_call_no_test(EventNotifier *n)
> > +{
> > +    VhostShadowVirtqueue *svq = container_of(n, VhostShadowVirtqueue,
> > +                                             call_notifier);
> > +
> > +    event_notifier_set(&svq->guest_call_notifier);
> > +}
> > +
> > +static void vhost_svq_handle_call(EventNotifier *n)
> > +{
> > +    if (likely(event_notifier_test_and_clear(n))) {
> > +        vhost_svq_handle_call_no_test(n);
> > +    }
> > +}
> > +
> >   /*
> >    * Obtain the SVQ call notifier, where vhost device notifies SVQ that there
> >    * exists pending used buffers.
> > @@ -180,6 +196,8 @@ VhostShadowVirtqueue *vhost_svq_new(struct vhost_dev *dev, int idx)
> >       }
> >
> >       svq->vq = virtio_get_queue(dev->vdev, vq_idx);
> > +    event_notifier_set_handler(&svq->call_notifier,
> > +                               vhost_svq_handle_call);
> >       return g_steal_pointer(&svq);
> >
> >   err_init_call_notifier:
> > @@ -195,6 +213,7 @@ err_init_kick_notifier:
> >   void vhost_svq_free(VhostShadowVirtqueue *vq)
> >   {
> >       event_notifier_cleanup(&vq->kick_notifier);
> > +    event_notifier_set_handler(&vq->call_notifier, NULL);
> >       event_notifier_cleanup(&vq->call_notifier);
> >       g_free(vq);
> >   }
> > diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
> > index bc34de2439..6c5f4c98b8 100644
> > --- a/hw/virtio/vhost-vdpa.c
> > +++ b/hw/virtio/vhost-vdpa.c
> > @@ -712,13 +712,40 @@ static bool vhost_vdpa_svq_start_vq(struct vhost_dev *dev, unsigned idx)
> >   {
> >       struct vhost_vdpa *v = dev->opaque;
> >       VhostShadowVirtqueue *svq = g_ptr_array_index(v->shadow_vqs, idx);
> > -    return vhost_svq_start(dev, idx, svq);
> > +    EventNotifier *vhost_call_notifier = vhost_svq_get_svq_call_notifier(svq);
> > +    struct vhost_vring_file vhost_call_file = {
> > +        .index = idx + dev->vq_index,
> > +        .fd = event_notifier_get_fd(vhost_call_notifier),
> > +    };
> > +    int r;
> > +    bool b;
> > +
> > +    /* Set shadow vq -> guest notifier */
> > +    assert(v->call_fd[idx]);
>
>
> We need aovid the asser() here. On which case we can hit this?
>
I would say that there is no way we can actually hit it, so let's remove it.
>
> > +    vhost_svq_set_guest_call_notifier(svq, v->call_fd[idx]);
> > +
> > +    b = vhost_svq_start(dev, idx, svq);
> > +    if (unlikely(!b)) {
> > +        return false;
> > +    }
> > +
> > +    /* Set device -> SVQ notifier */
> > +    r = vhost_vdpa_set_vring_dev_call(dev, &vhost_call_file);
> > +    if (unlikely(r)) {
> > +        error_report("vhost_vdpa_set_vring_call for shadow vq failed");
> > +        return false;
> > +    }
>
>
> Similar to kick, do we need to set_vring_call() before vhost_svq_start()?
>
It should not matter at this moment because the device should not be
started at this point and device calls should not run
vhost_svq_handle_call until BQL is released.
The "logic" of doing it after is to make clear that svq must be fully
initialized before processing device calls, even in the case that we
extract SVQ in its own iothread or similar. But this could be done
before vhost_svq_start for sure.
>
> > +
> > +    /* Check for pending calls */
> > +    event_notifier_set(vhost_call_notifier);
>
>
> Interesting, can this result spurious interrupt?
>
This actually "queues" a vhost_svq_handle_call after the BQL release,
where the device should be fully reset. In that regard, if there are
no used descriptors there will not be an irq raised to the guest. Does
that answer the question? Or have I missed something?
>
> > +    return true;
> >   }
> >
> >   static unsigned vhost_vdpa_enable_svq(struct vhost_vdpa *v, bool enable)
> >   {
> >       struct vhost_dev *hdev = v->dev;
> >       unsigned n;
> > +    int r;
> >
> >       if (enable == v->shadow_vqs_enabled) {
> >           return hdev->nvqs;
> > @@ -752,9 +779,18 @@ static unsigned vhost_vdpa_enable_svq(struct vhost_vdpa *v, bool enable)
> >       if (!enable) {
> >           /* Disable all queues or clean up failed start */
> >           for (n = 0; n < v->shadow_vqs->len; ++n) {
> > +            struct vhost_vring_file file = {
> > +                .index = vhost_vdpa_get_vq_index(hdev, n),
> > +                .fd = v->call_fd[n],
> > +            };
> > +
> > +            r = vhost_vdpa_set_vring_call(hdev, &file);
> > +            assert(r == 0);
> > +
> >               unsigned vq_idx = vhost_vdpa_get_vq_index(hdev, n);
> >               VhostShadowVirtqueue *svq = g_ptr_array_index(v->shadow_vqs, n);
> >               vhost_svq_stop(hdev, n, svq);
> > +            /* TODO: This can unmask or override call fd! */
>
>
> I don't get this comment. Does this mean the current code can't work
> with mask_notifiers? If yes, this is something we need to fix.
>
Yes, but it will be addressed in the next series. I should have
explained it bette here, sorry :).
Thanks!
> Thanks
>
>
> >               vhost_virtqueue_start(hdev, hdev->vdev, &hdev->vqs[n], vq_idx);
> >           }
> >
>
^ permalink raw reply	[flat|nested] 90+ messages in thread
* Re: [RFC PATCH v4 15/20] vhost: Shadow virtqueue buffers forwarding
  2021-10-13  4:31   ` Jason Wang
@ 2021-10-14 17:56     ` Eugenio Perez Martin
  2021-10-15  4:23       ` Jason Wang
  0 siblings, 1 reply; 90+ messages in thread
From: Eugenio Perez Martin @ 2021-10-14 17:56 UTC (permalink / raw)
  To: Jason Wang
  Cc: Parav Pandit, Juan Quintela, Markus Armbruster,
	Michael S. Tsirkin, qemu-level, virtualization,
	Harpreet Singh Anand, Xiao W Wang, Stefan Hajnoczi, Eli Cohen,
	Eric Blake, Michael Lilja, Stefano Garzarella
On Wed, Oct 13, 2021 at 6:31 AM Jason Wang <jasowang@redhat.com> wrote:
>
>
> 在 2021/10/1 下午3:05, Eugenio Pérez 写道:
> > Initial version of shadow virtqueue that actually forward buffers. There
> > are no iommu support at the moment, and that will be addressed in future
> > patches of this series. Since all vhost-vdpa devices uses forced IOMMU,
> > this means that SVQ is not usable at this point of the series on any
> > device.
> >
> > For simplicity it only supports modern devices, that expects vring
> > in little endian, with split ring and no event idx or indirect
> > descriptors. Support for them will not be added in this series.
> >
> > It reuses the VirtQueue code for the device part. The driver part is
> > based on Linux's virtio_ring driver, but with stripped functionality
> > and optimizations so it's easier to review. Later commits add simpler
> > ones.
> >
> > SVQ uses VIRTIO_CONFIG_S_DEVICE_STOPPED to pause the device and
> > retrieve its status (next available idx the device was going to
> > consume) race-free. It can later reset the device to replace vring
> > addresses etc. When SVQ starts qemu can resume consuming the guest's
> > driver ring from that state, without notice from the latter.
> >
> > This status bit VIRTIO_CONFIG_S_DEVICE_STOPPED is currently discussed
> > in VirtIO, and is implemented in qemu VirtIO-net devices in previous
> > commits.
> >
> > Removal of _S_DEVICE_STOPPED bit (in other words, resuming the device)
> > can be done in the future if an use case arises. At this moment we can
> > just rely on reseting the full device.
> >
> > Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> > ---
> >   qapi/net.json                      |   2 +-
> >   hw/virtio/vhost-shadow-virtqueue.c | 237 ++++++++++++++++++++++++++++-
> >   hw/virtio/vhost-vdpa.c             | 109 ++++++++++++-
> >   3 files changed, 337 insertions(+), 11 deletions(-)
> >
> > diff --git a/qapi/net.json b/qapi/net.json
> > index fe546b0e7c..1f4a55f2c5 100644
> > --- a/qapi/net.json
> > +++ b/qapi/net.json
> > @@ -86,7 +86,7 @@
> >   #
> >   # @name: the device name of the VirtIO device
> >   #
> > -# @enable: true to use the alternate shadow VQ notifications
> > +# @enable: true to use the alternate shadow VQ buffers fowarding path
> >   #
> >   # Returns: Error if failure, or 'no error' for success.
> >   #
> > diff --git a/hw/virtio/vhost-shadow-virtqueue.c b/hw/virtio/vhost-shadow-virtqueue.c
> > index 34e159d4fd..df7e6fa3ec 100644
> > --- a/hw/virtio/vhost-shadow-virtqueue.c
> > +++ b/hw/virtio/vhost-shadow-virtqueue.c
> > @@ -10,6 +10,7 @@
> >   #include "qemu/osdep.h"
> >   #include "hw/virtio/vhost-shadow-virtqueue.h"
> >   #include "hw/virtio/vhost.h"
> > +#include "hw/virtio/virtio-access.h"
> >
> >   #include "standard-headers/linux/vhost_types.h"
> >
> > @@ -44,15 +45,135 @@ typedef struct VhostShadowVirtqueue {
> >
> >       /* Virtio device */
> >       VirtIODevice *vdev;
> > +
> > +    /* Map for returning guest's descriptors */
> > +    VirtQueueElement **ring_id_maps;
> > +
> > +    /* Next head to expose to device */
> > +    uint16_t avail_idx_shadow;
> > +
> > +    /* Next free descriptor */
> > +    uint16_t free_head;
> > +
> > +    /* Last seen used idx */
> > +    uint16_t shadow_used_idx;
> > +
> > +    /* Next head to consume from device */
> > +    uint16_t used_idx;
>
>
> Let's use "last_used_idx" as kernel driver did.
>
Ok I will change it.
>
> >   } VhostShadowVirtqueue;
> >
> >   /* If the device is using some of these, SVQ cannot communicate */
> >   bool vhost_svq_valid_device_features(uint64_t *dev_features)
> >   {
> > -    return true;
> > +    uint64_t b;
> > +    bool r = true;
> > +
> > +    for (b = VIRTIO_TRANSPORT_F_START; b <= VIRTIO_TRANSPORT_F_END; ++b) {
> > +        switch (b) {
> > +        case VIRTIO_F_NOTIFY_ON_EMPTY:
> > +        case VIRTIO_F_ANY_LAYOUT:
> > +            /* SVQ is fine with this feature */
> > +            continue;
> > +
> > +        case VIRTIO_F_ACCESS_PLATFORM:
> > +            /* SVQ needs this feature disabled. Can't continue */
>
>
> So code can explain itself, need a comment to explain why.
>
Do you mean that it *doesn't* need a comment to explain why? In that
case I will delete them.
>
> > +            if (*dev_features & BIT_ULL(b)) {
> > +                clear_bit(b, dev_features);
> > +                r = false;
> > +            }
> > +            break;
> > +
> > +        case VIRTIO_F_VERSION_1:
> > +            /* SVQ needs this feature, so can't continue */
>
>
> A comment to explain why SVQ needs this feature.
>
Sure I will add it.
>
> > +            if (!(*dev_features & BIT_ULL(b))) {
> > +                set_bit(b, dev_features);
> > +                r = false;
> > +            }
> > +            continue;
> > +
> > +        default:
> > +            /*
> > +             * SVQ must disable this feature, let's hope the device is fine
> > +             * without it.
> > +             */
> > +            if (*dev_features & BIT_ULL(b)) {
> > +                clear_bit(b, dev_features);
> > +            }
> > +        }
> > +    }
> > +
> > +    return r;
> > +}
>
>
> Let's move this to patch 14.
>
I can move it down to 14/20, but then it is not really accurate, since
notifications forwarding can work with all feature sets. Not like we
are introducing a regression, but still.
I can always explain that in the patch message though, would that be ok?
>
> > +
> > +static void vhost_vring_write_descs(VhostShadowVirtqueue *svq,
> > +                                    const struct iovec *iovec,
> > +                                    size_t num, bool more_descs, bool write)
> > +{
> > +    uint16_t i = svq->free_head, last = svq->free_head;
> > +    unsigned n;
> > +    uint16_t flags = write ? cpu_to_le16(VRING_DESC_F_WRITE) : 0;
> > +    vring_desc_t *descs = svq->vring.desc;
> > +
> > +    if (num == 0) {
> > +        return;
> > +    }
> > +
> > +    for (n = 0; n < num; n++) {
> > +        if (more_descs || (n + 1 < num)) {
> > +            descs[i].flags = flags | cpu_to_le16(VRING_DESC_F_NEXT);
> > +        } else {
> > +            descs[i].flags = flags;
> > +        }
> > +        descs[i].addr = cpu_to_le64((hwaddr)iovec[n].iov_base);
> > +        descs[i].len = cpu_to_le32(iovec[n].iov_len);
> > +
> > +        last = i;
> > +        i = cpu_to_le16(descs[i].next);
> > +    }
> > +
> > +    svq->free_head = le16_to_cpu(descs[last].next);
> > +}
> > +
> > +static unsigned vhost_svq_add_split(VhostShadowVirtqueue *svq,
> > +                                    VirtQueueElement *elem)
> > +{
> > +    int head;
> > +    unsigned avail_idx;
> > +    vring_avail_t *avail = svq->vring.avail;
> > +
> > +    head = svq->free_head;
> > +
> > +    /* We need some descriptors here */
> > +    assert(elem->out_num || elem->in_num);
> > +
> > +    vhost_vring_write_descs(svq, elem->out_sg, elem->out_num,
> > +                            elem->in_num > 0, false);
> > +    vhost_vring_write_descs(svq, elem->in_sg, elem->in_num, false, true);
> > +
> > +    /*
> > +     * Put entry in available array (but don't update avail->idx until they
> > +     * do sync).
> > +     */
> > +    avail_idx = svq->avail_idx_shadow & (svq->vring.num - 1);
> > +    avail->ring[avail_idx] = cpu_to_le16(head);
> > +    svq->avail_idx_shadow++;
> > +
> > +    /* Update avail index after the descriptor is wrote */
> > +    smp_wmb();
> > +    avail->idx = cpu_to_le16(svq->avail_idx_shadow);
> > +
> > +    return head;
> > +
> >   }
> >
> > -/* Forward guest notifications */
> > +static void vhost_svq_add(VhostShadowVirtqueue *svq, VirtQueueElement *elem)
> > +{
> > +    unsigned qemu_head = vhost_svq_add_split(svq, elem);
> > +
> > +    svq->ring_id_maps[qemu_head] = elem;
> > +}
> > +
> > +/* Handle guest->device notifications */
> >   static void vhost_handle_guest_kick(EventNotifier *n)
> >   {
> >       VhostShadowVirtqueue *svq = container_of(n, VhostShadowVirtqueue,
> > @@ -62,7 +183,74 @@ static void vhost_handle_guest_kick(EventNotifier *n)
> >           return;
> >       }
> >
> > -    event_notifier_set(&svq->kick_notifier);
> > +    /* Make available as many buffers as possible */
> > +    do {
> > +        if (virtio_queue_get_notification(svq->vq)) {
> > +            /* No more notifications until process all available */
> > +            virtio_queue_set_notification(svq->vq, false);
> > +        }
>
>
> This can be done outside the loop.
>
I think it cannot. The intention of doing this way is that we check
for new available buffers *also after* enabling notifications, so we
don't miss any of them. It is more or less copied from
virtio_blk_handle_vq, which also needs to run to completion.
If we need to loop again because there are more available buffers, we
want to disable notifications again. Or am I missing something?
>
> > +
> > +        while (true) {
> > +            VirtQueueElement *elem = virtqueue_pop(svq->vq, sizeof(*elem));
> > +            if (!elem) {
> > +                break;
> > +            }
> > +
> > +            vhost_svq_add(svq, elem);
> > +            event_notifier_set(&svq->kick_notifier);
> > +        }
> > +
> > +        virtio_queue_set_notification(svq->vq, true);
>
>
> I think this can be moved to the end of this function.
>
(Same as previous answer)
> Btw, we probably need a quota to make sure the svq is not hogging the
> main event loop.
>
> Similar issue could be found in both virtio-net TX (using timer or bh)
> and TAP (a quota).
>
I think that virtqueue size is the natural limit to that: since we are
not making any buffers used in the loop, there is no way that it runs
more than virtqueue size times. If it does because of an evil/bogus
guest, virtqueue_pop raises the message "Virtqueue size exceeded" and
returns NULL, effectively breaking the loop.
Virtio-net tx functions mark each buffer right after making them
available and use it, so they can hog BQL. But my understanding is
that is not possible in the SVQ case.
I can add a comment in the code to make it clearer though.
>
> > +    } while (!virtio_queue_empty(svq->vq));
> > +}
> > +
> > +static bool vhost_svq_more_used(VhostShadowVirtqueue *svq)
> > +{
> > +    if (svq->used_idx != svq->shadow_used_idx) {
> > +        return true;
> > +    }
> > +
> > +    /* Get used idx must not be reordered */
> > +    smp_rmb();
>
>
> Interesting, we don't do this for kernel drivers. It would be helpful to
> explain it more clear by "X must be done before Y".
>
I think this got reordered, it's supposed to be *after* get the used
idx, so it matches the one in the kernel with the comment "Only get
used array entries after they have been exposed by host.".
I will change it for the next series.
>
> > +    svq->shadow_used_idx = cpu_to_le16(svq->vring.used->idx);
> > +
> > +    return svq->used_idx != svq->shadow_used_idx;
> > +}
> > +
> > +static VirtQueueElement *vhost_svq_get_buf(VhostShadowVirtqueue *svq)
> > +{
> > +    vring_desc_t *descs = svq->vring.desc;
> > +    const vring_used_t *used = svq->vring.used;
> > +    vring_used_elem_t used_elem;
> > +    uint16_t last_used;
> > +
> > +    if (!vhost_svq_more_used(svq)) {
> > +        return NULL;
> > +    }
> > +
> > +    last_used = svq->used_idx & (svq->vring.num - 1);
> > +    used_elem.id = le32_to_cpu(used->ring[last_used].id);
> > +    used_elem.len = le32_to_cpu(used->ring[last_used].len);
> > +
> > +    svq->used_idx++;
> > +    if (unlikely(used_elem.id >= svq->vring.num)) {
> > +        error_report("Device %s says index %u is used", svq->vdev->name,
> > +                     used_elem.id);
> > +        return NULL;
> > +    }
> > +
> > +    if (unlikely(!svq->ring_id_maps[used_elem.id])) {
> > +        error_report(
> > +            "Device %s says index %u is used, but it was not available",
> > +            svq->vdev->name, used_elem.id);
> > +        return NULL;
> > +    }
> > +
> > +    descs[used_elem.id].next = svq->free_head;
> > +    svq->free_head = used_elem.id;
> > +
> > +    svq->ring_id_maps[used_elem.id]->len = used_elem.len;
> > +    return g_steal_pointer(&svq->ring_id_maps[used_elem.id]);
> >   }
> >
> >   /* Forward vhost notifications */
> > @@ -70,8 +258,26 @@ static void vhost_svq_handle_call_no_test(EventNotifier *n)
> >   {
> >       VhostShadowVirtqueue *svq = container_of(n, VhostShadowVirtqueue,
> >                                                call_notifier);
> > -
> > -    event_notifier_set(&svq->guest_call_notifier);
> > +    VirtQueue *vq = svq->vq;
> > +
> > +    /* Make as many buffers as possible used. */
> > +    do {
> > +        unsigned i = 0;
> > +
> > +        /* TODO: Use VRING_AVAIL_F_NO_INTERRUPT */
> > +        while (true) {
> > +            g_autofree VirtQueueElement *elem = vhost_svq_get_buf(svq);
> > +            if (!elem) {
> > +                break;
> > +            }
> > +
> > +            assert(i < svq->vring.num);
>
>
> Let's return error instead of using the assert.
>
Actually this is a condition that we should never meet: In the case of
ring overrun, device would try to set used a descriptor that is either
> vring size *or* should try to overrun some of the already used ones.
In both cases, elem should be NULL and the loop should break.
So this is a safety net protecting from both, if we have an i >
svq->vring.num means we are not processing used buffers well anymore,
and (moreover) this is happening after making used all descriptors.
Taking that into account, should we delete it?
>
> > +            virtqueue_fill(vq, elem, elem->len, i++);
> > +        }
> > +
> > +        virtqueue_flush(vq, i);
> > +        event_notifier_set(&svq->guest_call_notifier);
> > +    } while (vhost_svq_more_used(svq));
> >   }
> >
> >   static void vhost_svq_handle_call(EventNotifier *n)
> > @@ -204,12 +410,25 @@ err_set_vring_kick:
> >   void vhost_svq_stop(struct vhost_dev *dev, unsigned idx,
> >                       VhostShadowVirtqueue *svq)
> >   {
> > +    int i;
> >       int r = vhost_svq_restore_vdev_host_notifier(dev, idx, svq);
> > +
> >       if (unlikely(r < 0)) {
> >           error_report("Couldn't restore vq kick fd: %s", strerror(-r));
> >       }
> >
> >       event_notifier_set_handler(&svq->host_notifier, NULL);
> > +
> > +    for (i = 0; i < svq->vring.num; ++i) {
> > +        g_autofree VirtQueueElement *elem = svq->ring_id_maps[i];
> > +        /*
> > +         * Although the doc says we must unpop in order, it's ok to unpop
> > +         * everything.
> > +         */
> > +        if (elem) {
> > +            virtqueue_unpop(svq->vq, elem, elem->len);
> > +        }
>
>
> Will this result some of the "pending" buffers to be submitted multiple
> times? If yes, should we wait for all the buffers used instead of doing
> the unpop here?
>
Do you mean to call virtqueue_unpop with the same elem (or elem.id)
multiple times? That should never happen, because elem.id should be
the position in the ring_id_maps. Also, unpop() should just unmap the
element and never sync again.
Maybe it is way clearer to call virtqueue_detach_element here directly.
>
> > +    }
> >   }
> >
> >   /*
> > @@ -224,7 +443,7 @@ VhostShadowVirtqueue *vhost_svq_new(struct vhost_dev *dev, int idx)
> >       size_t driver_size;
> >       size_t device_size;
> >       g_autofree VhostShadowVirtqueue *svq = g_new0(VhostShadowVirtqueue, 1);
> > -    int r;
> > +    int r, i;
> >
> >       r = event_notifier_init(&svq->kick_notifier, 0);
> >       if (r != 0) {
> > @@ -250,6 +469,11 @@ VhostShadowVirtqueue *vhost_svq_new(struct vhost_dev *dev, int idx)
> >       memset(svq->vring.desc, 0, driver_size);
> >       svq->vring.used = qemu_memalign(qemu_real_host_page_size, device_size);
> >       memset(svq->vring.used, 0, device_size);
> > +    for (i = 0; i < num - 1; i++) {
> > +        svq->vring.desc[i].next = cpu_to_le16(i + 1);
> > +    }
> > +
> > +    svq->ring_id_maps = g_new0(VirtQueueElement *, num);
> >       event_notifier_set_handler(&svq->call_notifier,
> >                                  vhost_svq_handle_call);
> >       return g_steal_pointer(&svq);
> > @@ -269,6 +493,7 @@ void vhost_svq_free(VhostShadowVirtqueue *vq)
> >       event_notifier_cleanup(&vq->kick_notifier);
> >       event_notifier_set_handler(&vq->call_notifier, NULL);
> >       event_notifier_cleanup(&vq->call_notifier);
> > +    g_free(vq->ring_id_maps);
> >       qemu_vfree(vq->vring.desc);
> >       qemu_vfree(vq->vring.used);
> >       g_free(vq);
> > diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
> > index a057e8277d..bb7010ddb5 100644
> > --- a/hw/virtio/vhost-vdpa.c
> > +++ b/hw/virtio/vhost-vdpa.c
> > @@ -19,6 +19,7 @@
> >   #include "hw/virtio/virtio-net.h"
> >   #include "hw/virtio/vhost-shadow-virtqueue.h"
> >   #include "hw/virtio/vhost-vdpa.h"
> > +#include "hw/virtio/vhost-shadow-virtqueue.h"
> >   #include "exec/address-spaces.h"
> >   #include "qemu/main-loop.h"
> >   #include "cpu.h"
> > @@ -475,6 +476,28 @@ static int vhost_vdpa_set_features(struct vhost_dev *dev,
> >       return vhost_vdpa_backend_set_features(dev, features);
> >   }
> >
> > +/**
> > + * Restore guest features to vdpa device
> > + */
> > +static int vhost_vdpa_set_guest_features(struct vhost_dev *dev)
> > +{
> > +    struct vhost_vdpa *v = dev->opaque;
> > +    return vhost_vdpa_backend_set_features(dev, v->guest_features);
> > +}
> > +
> > +/**
> > + * Set shadow virtqueue supported features
> > + */
> > +static int vhost_vdpa_set_svq_features(struct vhost_dev *dev)
> > +{
> > +    struct vhost_vdpa *v = dev->opaque;
> > +    uint64_t features = v->host_features;
> > +    bool b = vhost_svq_valid_device_features(&features);
> > +    assert(b);
> > +
> > +    return vhost_vdpa_backend_set_features(dev, features);
> > +}
> > +
> >   static int vhost_vdpa_set_backend_cap(struct vhost_dev *dev)
> >   {
> >       uint64_t features;
> > @@ -730,6 +753,19 @@ static bool  vhost_vdpa_force_iommu(struct vhost_dev *dev)
> >       return true;
> >   }
> >
> > +static int vhost_vdpa_vring_pause(struct vhost_dev *dev)
> > +{
> > +    int r;
> > +    uint8_t status;
> > +
> > +    vhost_vdpa_add_status(dev, VIRTIO_CONFIG_S_DEVICE_STOPPED);
> > +    do {
> > +        r = vhost_vdpa_call(dev, VHOST_VDPA_GET_STATUS, &status);
>
>
> I guess we'd better add some sleep here.
>
If the final version still contains the call, I will add the sleep. At
the moment I think it's better if we stop the device by a vdpa ioctl.
>
> > +    } while (r == 0 && !(status & VIRTIO_CONFIG_S_DEVICE_STOPPED));
> > +
> > +    return 0;
> > +}
> > +
> >   /*
> >    * Start shadow virtqueue.
> >    */
> > @@ -742,9 +778,29 @@ static bool vhost_vdpa_svq_start_vq(struct vhost_dev *dev, unsigned idx)
> >           .index = idx + dev->vq_index,
> >           .fd = event_notifier_get_fd(vhost_call_notifier),
> >       };
> > +    struct vhost_vring_addr addr = {
> > +        .index = idx + dev->vq_index,
> > +    };
> > +    struct vhost_vring_state num = {
> > +        .index = idx + dev->vq_index,
> > +        .num = virtio_queue_get_num(dev->vdev, idx),
> > +    };
> >       int r;
> >       bool b;
> >
> > +    vhost_svq_get_vring_addr(svq, &addr);
> > +    r = vhost_vdpa_set_vring_addr(dev, &addr);
> > +    if (unlikely(r)) {
> > +        error_report("vhost_set_vring_addr for shadow vq failed");
> > +        return false;
> > +    }
> > +
> > +    r = vhost_vdpa_set_vring_num(dev, &num);
> > +    if (unlikely(r)) {
> > +        error_report("vhost_vdpa_set_vring_num for shadow vq failed");
> > +        return false;
> > +    }
> > +
> >       /* Set shadow vq -> guest notifier */
> >       assert(v->call_fd[idx]);
> >       vhost_svq_set_guest_call_notifier(svq, v->call_fd[idx]);
> > @@ -781,15 +837,32 @@ static unsigned vhost_vdpa_enable_svq(struct vhost_vdpa *v, bool enable)
> >           assert(v->shadow_vqs->len == 0);
> >           for (n = 0; n < hdev->nvqs; ++n) {
> >               VhostShadowVirtqueue *svq = vhost_svq_new(hdev, n);
> > -            bool ok;
> > -
> >               if (unlikely(!svq)) {
> >                   g_ptr_array_set_size(v->shadow_vqs, 0);
> >                   return 0;
> >               }
> >               g_ptr_array_add(v->shadow_vqs, svq);
> > +        }
> > +    }
> >
> > -            ok = vhost_vdpa_svq_start_vq(hdev, n);
> > +    r = vhost_vdpa_vring_pause(hdev);
> > +    assert(r == 0);
> > +
> > +    if (enable) {
> > +        for (n = 0; n < v->shadow_vqs->len; ++n) {
> > +            /* Obtain Virtqueue state */
> > +            vhost_virtqueue_stop(hdev, hdev->vdev, &hdev->vqs[n], n);
> > +        }
> > +    }
> > +
> > +    /* Reset device so it can be configured */
> > +    r = vhost_vdpa_dev_start(hdev, false);
> > +    assert(r == 0);
> > +
> > +    if (enable) {
> > +        int r;
> > +        for (n = 0; n < v->shadow_vqs->len; ++n) {
> > +            bool ok = vhost_vdpa_svq_start_vq(hdev, n);
> >               if (unlikely(!ok)) {
> >                   /* Free still not started svqs */
> >                   g_ptr_array_set_size(v->shadow_vqs, n);
> > @@ -797,11 +870,19 @@ static unsigned vhost_vdpa_enable_svq(struct vhost_vdpa *v, bool enable)
> >                   break;
> >               }
> >           }
> > +
> > +        /* Need to ack features to set state in vp_vdpa devices */
>
>
> vhost_vdpa actually?
>
Yes, what a mistake!
>
> > +        r = vhost_vdpa_set_svq_features(hdev);
> > +        if (unlikely(r)) {
> > +            enable = false;
> > +        }
> >       }
> >
> >       v->shadow_vqs_enabled = enable;
> >
> >       if (!enable) {
> > +        vhost_vdpa_set_guest_features(hdev);
> > +
> >           /* Disable all queues or clean up failed start */
> >           for (n = 0; n < v->shadow_vqs->len; ++n) {
> >               struct vhost_vring_file file = {
> > @@ -818,7 +899,12 @@ static unsigned vhost_vdpa_enable_svq(struct vhost_vdpa *v, bool enable)
> >               /* TODO: This can unmask or override call fd! */
> >               vhost_virtqueue_start(hdev, hdev->vdev, &hdev->vqs[n], vq_idx);
> >           }
> > +    }
> >
> > +    r = vhost_vdpa_dev_start(hdev, true);
> > +    assert(r == 0);
> > +
> > +    if (!enable) {
> >           /* Resources cleanup */
> >           g_ptr_array_set_size(v->shadow_vqs, 0);
> >       }
> > @@ -831,6 +917,7 @@ void qmp_x_vhost_enable_shadow_vq(const char *name, bool enable, Error **errp)
> >       struct vhost_vdpa *v;
> >       const char *err_cause = NULL;
> >       bool r;
> > +    uint64_t svq_features;
> >
> >       QLIST_FOREACH(v, &vhost_vdpa_devices, entry) {
> >           if (v->dev->vdev && 0 == strcmp(v->dev->vdev->name, name)) {
> > @@ -846,6 +933,20 @@ void qmp_x_vhost_enable_shadow_vq(const char *name, bool enable, Error **errp)
> >           goto err;
> >       }
> >
> > +    svq_features = v->host_features;
> > +    if (!vhost_svq_valid_device_features(&svq_features)) {
> > +        error_setg(errp,
> > +            "Can't enable shadow vq on %s: Unexpected feature flags (%lx-%lx)",
> > +            name, v->host_features, svq_features);
> > +        return;
> > +    } else {
> > +        /* TODO: Check for virtio_vdpa + IOMMU & modern device */
>
>
> I guess you mean "vhost_vdpa" here.
Yes, a similar mistake in less than 50 lines :).
> For IOMMU, I guess you mean "vIOMMU"
> actually?
>
This comment is out of date and inherited from the vhost version,
where only the IOMMU version was developed, so it will be deleted in
the next series. I think it makes little sense to check vIOMMU if we
stick with vDPA since it still does not support it, but we could make
the check here for sure.
Thanks!
> Thanks
>
>
> > +    }
> > +
> > +    if (err_cause) {
> > +        goto err;
> > +    }
> > +
> >       r = vhost_vdpa_enable_svq(v, enable);
> >       if (unlikely(!r)) {
> >           err_cause = "Error enabling (see monitor)";
> > @@ -853,7 +954,7 @@ void qmp_x_vhost_enable_shadow_vq(const char *name, bool enable, Error **errp)
> >       }
> >
> >   err:
> > -    if (err_cause) {
> > +    if (errp == NULL && err_cause) {
> >           error_setg(errp, "Can't enable shadow vq on %s: %s", name, err_cause);
> >       }
> >   }
>
^ permalink raw reply	[flat|nested] 90+ messages in thread
* Re: [RFC PATCH v4 08/20] vhost: Route guest->host notification through shadow virtqueue
  2021-10-14 12:00     ` Eugenio Perez Martin
@ 2021-10-15  3:45       ` Jason Wang
  2021-10-15  9:08         ` Eugenio Perez Martin
  2021-10-15 18:21       ` Eugenio Perez Martin
  1 sibling, 1 reply; 90+ messages in thread
From: Jason Wang @ 2021-10-15  3:45 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: Parav Pandit, Juan Quintela, Markus Armbruster,
	Michael S. Tsirkin, qemu-level, virtualization,
	Harpreet Singh Anand, Xiao W Wang, Stefan Hajnoczi, Eli Cohen,
	Eric Blake, Michael Lilja, Stefano Garzarella
在 2021/10/14 下午8:00, Eugenio Perez Martin 写道:
> On Wed, Oct 13, 2021 at 5:27 AM Jason Wang <jasowang@redhat.com> wrote:
>>
>> 在 2021/10/1 下午3:05, Eugenio Pérez 写道:
>>> Shadow virtqueue notifications forwarding is disabled when vhost_dev
>>> stops, so code flow follows usual cleanup.
>>>
>>> Also, host notifiers must be disabled at SVQ start,
>>
>> Any reason for this?
>>
> It will be addressed in a later series, sorry.
>
>>> and they will not
>>> start if SVQ has been enabled when device is stopped. This is trivial
>>> to address, but it is left out for simplicity at this moment.
>>
>> It looks to me this patch also contains the following logics
>>
>> 1) codes to enable svq
>>
>> 2) codes to let svq to be enabled from QMP.
>>
>> I think they need to be split out,
> I agree that we can split this more, with the code that belongs to SVQ
> and the code that belongs to vhost-vdpa. it will be addressed in
> future series.
>
>> we may endup with the following
>> series of patches
>>
> With "series of patches" do you mean to send every step in a separated
> series? There are odds of having the need of modifying code already
> sent & merged with later ones. If you confirm to me that it is fine, I
> can do it that way for sure.
Sorry for being unclear. I meant it's a sub-series actually of the series.
>
>> 1) svq skeleton with enable/disable
>> 2) route host notifier to svq
>> 3) route guest notifier to svq
>> 4) codes to enable svq
>> 5) enable svq via QMP
>>
> I'm totally fine with that, but there is code that is never called if
> the qmp command is not added. The compiler complains about static
> functions that are not called, making impossible things like bisecting
> through these commits, unless I use attribute((unused)) or similar. Or
> have I missed something?
You're right, then I think we can then:
1) svq skeleton with enable/disable via QMP
2) route host notifier to svq
3) route guest notifier to svq
>
> We could do that way with the code that belongs to SVQ though, since
> all of it is declared in headers. But to delay the "enable svq via
> qmp" to the last one makes debugging harder, as we cannot just enable
> notifications forwarding with no buffers forwarding.
Yes.
>
> If I introduce a change in the notifications code, I can simply go to
> these commits and enable SVQ for notifications. This way I can have an
> idea of what part is failing. A similar logic can be applied to other
> devices than vp_vdpa.
vhost-vdpa actually?
> We would lose it if we
>
>>> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
>>> ---
>>>    qapi/net.json                      |   2 +-
>>>    hw/virtio/vhost-shadow-virtqueue.h |   8 ++
>>>    include/hw/virtio/vhost-vdpa.h     |   4 +
>>>    hw/virtio/vhost-shadow-virtqueue.c | 138 ++++++++++++++++++++++++++++-
>>>    hw/virtio/vhost-vdpa.c             | 116 +++++++++++++++++++++++-
>>>    5 files changed, 264 insertions(+), 4 deletions(-)
>>>
>>> diff --git a/qapi/net.json b/qapi/net.json
>>> index a2c30fd455..fe546b0e7c 100644
>>> --- a/qapi/net.json
>>> +++ b/qapi/net.json
>>> @@ -88,7 +88,7 @@
>>>    #
>>>    # @enable: true to use the alternate shadow VQ notifications
>>>    #
>>> -# Returns: Always error, since SVQ is not implemented at the moment.
>>> +# Returns: Error if failure, or 'no error' for success.
>>>    #
>>>    # Since: 6.2
>>>    #
>>> diff --git a/hw/virtio/vhost-shadow-virtqueue.h b/hw/virtio/vhost-shadow-virtqueue.h
>>> index 27ac6388fa..237cfceb9c 100644
>>> --- a/hw/virtio/vhost-shadow-virtqueue.h
>>> +++ b/hw/virtio/vhost-shadow-virtqueue.h
>>> @@ -14,6 +14,14 @@
>>>
>>>    typedef struct VhostShadowVirtqueue VhostShadowVirtqueue;
>>>
>>> +EventNotifier *vhost_svq_get_svq_call_notifier(VhostShadowVirtqueue *svq);
>>
>> Let's move this function to another patch since it's unrelated to the
>> guest->host routing.
>>
> Right, I missed it while squashing commits and at later reviews.
>
>>> +void vhost_svq_set_guest_call_notifier(VhostShadowVirtqueue *svq, int call_fd);
>>> +
>>> +bool vhost_svq_start(struct vhost_dev *dev, unsigned idx,
>>> +                     VhostShadowVirtqueue *svq);
>>> +void vhost_svq_stop(struct vhost_dev *dev, unsigned idx,
>>> +                    VhostShadowVirtqueue *svq);
>>> +
>>>    VhostShadowVirtqueue *vhost_svq_new(struct vhost_dev *dev, int idx);
>>>
>>>    void vhost_svq_free(VhostShadowVirtqueue *vq);
>>> diff --git a/include/hw/virtio/vhost-vdpa.h b/include/hw/virtio/vhost-vdpa.h
>>> index 0d565bb5bd..48aae59d8e 100644
>>> --- a/include/hw/virtio/vhost-vdpa.h
>>> +++ b/include/hw/virtio/vhost-vdpa.h
>>> @@ -12,6 +12,8 @@
>>>    #ifndef HW_VIRTIO_VHOST_VDPA_H
>>>    #define HW_VIRTIO_VHOST_VDPA_H
>>>
>>> +#include <gmodule.h>
>>> +
>>>    #include "qemu/queue.h"
>>>    #include "hw/virtio/virtio.h"
>>>
>>> @@ -24,6 +26,8 @@ typedef struct vhost_vdpa {
>>>        int device_fd;
>>>        uint32_t msg_type;
>>>        MemoryListener listener;
>>> +    bool shadow_vqs_enabled;
>>> +    GPtrArray *shadow_vqs;
>>>        struct vhost_dev *dev;
>>>        QLIST_ENTRY(vhost_vdpa) entry;
>>>        VhostVDPAHostNotifier notifier[VIRTIO_QUEUE_MAX];
>>> diff --git a/hw/virtio/vhost-shadow-virtqueue.c b/hw/virtio/vhost-shadow-virtqueue.c
>>> index c4826a1b56..21dc99ab5d 100644
>>> --- a/hw/virtio/vhost-shadow-virtqueue.c
>>> +++ b/hw/virtio/vhost-shadow-virtqueue.c
>>> @@ -9,9 +9,12 @@
>>>
>>>    #include "qemu/osdep.h"
>>>    #include "hw/virtio/vhost-shadow-virtqueue.h"
>>> +#include "hw/virtio/vhost.h"
>>> +
>>> +#include "standard-headers/linux/vhost_types.h"
>>>
>>>    #include "qemu/error-report.h"
>>> -#include "qemu/event_notifier.h"
>>> +#include "qemu/main-loop.h"
>>>
>>>    /* Shadow virtqueue to relay notifications */
>>>    typedef struct VhostShadowVirtqueue {
>>> @@ -19,14 +22,146 @@ typedef struct VhostShadowVirtqueue {
>>>        EventNotifier kick_notifier;
>>>        /* Shadow call notifier, sent to vhost */
>>>        EventNotifier call_notifier;
>>> +
>>> +    /*
>>> +     * Borrowed virtqueue's guest to host notifier.
>>> +     * To borrow it in this event notifier allows to register on the event
>>> +     * loop and access the associated shadow virtqueue easily. If we use the
>>> +     * VirtQueue, we don't have an easy way to retrieve it.
>>> +     *
>>> +     * So shadow virtqueue must not clean it, or we would lose VirtQueue one.
>>> +     */
>>> +    EventNotifier host_notifier;
>>> +
>>> +    /* Guest's call notifier, where SVQ calls guest. */
>>> +    EventNotifier guest_call_notifier;
>>
>> To be consistent, let's simply use "guest_notifier" here.
>>
> It could be confused when the series adds a guest -> qemu kick
> notifier then. Actually, I think it would be better to rename
> host_notifier to something like host_svq_notifier. Or host_call and
> guest_call, since "notifier" is already in the type, making the name
> to be a little bit "Hungarian notation".
I think that's fine, just need to make sure we have a consistent name 
for SVQ notifier.
>
>>> +
>>> +    /* Virtio queue shadowing */
>>> +    VirtQueue *vq;
>>>    } VhostShadowVirtqueue;
>>>
>>> +/* Forward guest notifications */
>>> +static void vhost_handle_guest_kick(EventNotifier *n)
>>> +{
>>> +    VhostShadowVirtqueue *svq = container_of(n, VhostShadowVirtqueue,
>>> +                                             host_notifier);
>>> +
>>> +    if (unlikely(!event_notifier_test_and_clear(n))) {
>>> +        return;
>>> +    }
>>
>> Is there a chance that we may stop the processing of available buffers
>> during the svq enabling? There could be no kick from the guest in this case.
>>
> Actually, yes, I think you are right. The guest kick eventfd could
> have been consumed by vhost but there may be still pending buffers.
>
> I think it would be better to check for available buffers first, then
> clear the notifier unconditionally, and then re-check and process them
> if any [1].
Looks like I can't find "[1]" anywhere.
>
> However, this problem arises later in the series: At this moment the
> device is not reset and guest's host notifier is not replaced, so
> either vhost/device receives the kick, or SVQ does and forwards it.
>
> Does it make sense to you?
Kind of, so I think we can:
1) As you said, always check available buffers when switching to SVQ
2) alwasy kick the vhost when switching back to vhost
>
>>> +
>>> +    event_notifier_set(&svq->kick_notifier);
>>> +}
>>> +
>>> +/*
>>> + * Obtain the SVQ call notifier, where vhost device notifies SVQ that there
>>> + * exists pending used buffers.
>>> + *
>>> + * @svq Shadow Virtqueue
>>> + */
>>> +EventNotifier *vhost_svq_get_svq_call_notifier(VhostShadowVirtqueue *svq)
>>> +{
>>> +    return &svq->call_notifier;
>>> +}
>>> +
>>> +/*
>>> + * Set the call notifier for the SVQ to call the guest
>>> + *
>>> + * @svq Shadow virtqueue
>>> + * @call_fd call notifier
>>> + *
>>> + * Called on BQL context.
>>> + */
>>> +void vhost_svq_set_guest_call_notifier(VhostShadowVirtqueue *svq, int call_fd)
>>> +{
>>> +    event_notifier_init_fd(&svq->guest_call_notifier, call_fd);
>>> +}
>>> +
>>> +/*
>>> + * Restore the vhost guest to host notifier, i.e., disables svq effect.
>>> + */
>>> +static int vhost_svq_restore_vdev_host_notifier(struct vhost_dev *dev,
>>> +                                                unsigned vhost_index,
>>> +                                                VhostShadowVirtqueue *svq)
>>> +{
>>> +    EventNotifier *vq_host_notifier = virtio_queue_get_host_notifier(svq->vq);
>>> +    struct vhost_vring_file file = {
>>> +        .index = vhost_index,
>>> +        .fd = event_notifier_get_fd(vq_host_notifier),
>>> +    };
>>> +    int r;
>>> +
>>> +    /* Restore vhost kick */
>>> +    r = dev->vhost_ops->vhost_set_vring_kick(dev, &file);
>>
>> And remap the notification area if necessary.
> Totally right, that step is missed in this series.
>
> However, remapping guest host notifier memory region has no advantages
> over using ioeventfd to perform guest -> SVQ notifications, doesn't
> it? By both methods flow needs to go through the hypervisor kernel.
To be clear, I meant restore the notification area mapping from guest to 
device directly. For SVQ, you are right, there's no much value for 
bothering notification area map. (Or we can do it in the future).
>
>>
>>> +    return r ? -errno : 0;
>>> +}
>>> +
>>> +/*
>>> + * Start shadow virtqueue operation.
>>> + * @dev vhost device
>>> + * @hidx vhost virtqueue index
>>> + * @svq Shadow Virtqueue
>>> + */
>>> +bool vhost_svq_start(struct vhost_dev *dev, unsigned idx,
>>> +                     VhostShadowVirtqueue *svq)
>>> +{
>>> +    EventNotifier *vq_host_notifier = virtio_queue_get_host_notifier(svq->vq);
>>> +    struct vhost_vring_file file = {
>>> +        .index = dev->vhost_ops->vhost_get_vq_index(dev, dev->vq_index + idx),
>>> +        .fd = event_notifier_get_fd(&svq->kick_notifier),
>>> +    };
>>> +    int r;
>>> +
>>> +    /* Check that notifications are still going directly to vhost dev */
>>> +    assert(virtio_queue_is_host_notifier_enabled(svq->vq));
>>> +
>>> +    /*
>>> +     * event_notifier_set_handler already checks for guest's notifications if
>>> +     * they arrive in the switch, so there is no need to explicitely check for
>>> +     * them.
>>> +     */
>>
>> If this is true, shouldn't we call vhost_set_vring_kick() before the
>> event_notifier_set_handler()?
>>
> Not at this point of the series, but it could be another solution when
> we need to reset the device and we are unsure if all buffers have been
> read. But I think I prefer the solution exposed in [1] and to
> explicitly call vhost_handle_guest_kick here. Do you think
> differently?
I actually mean if we can end up with this situation since SVQ take over 
the host notifier before set_vring_kick().
1) guest kick vq, vhost is running
2) qemu take over the host notifier
3) guest kick vq
4) qemu route host notifier to SVQ
Then the vq will be handled by both SVQ and vhost?
>
>> Btw, I think we should update the fd if set_vring_kick() was called
>> after this function?
>>
> Kind of. This is currently bad in the code, but...
>
> Backend callbacks vhost_ops->vhost_set_vring_kick and
> vhost_ops->vhost_set_vring_addr are only called at
> vhost_virtqueue_start. And they are always called with known data
> already stored in VirtQueue.
This is true for now, but I'd suggest to not depend on this since it:
1) it might be changed in the future
2) we're working at vhost layer and expose API to virtio device, the 
code should be robust to handle set_vring_kick() at any time
3) I think we've already handled similar situation of set_vring_call, so 
let's be consistent
>
> To avoid storing more state in vhost_vdpa, I think that we should
> avoid duplicating them, but ignore new kick_fd or address in SVQ mode,
> and retrieve them again at the moment the device is (re)started in SVQ
> mode. Qemu already avoids things like allowing the guest to set
> addresses at random time, using the VirtIOPCIProxy to store them.
>
> I also see how duplicating that status could protect vdpa SVQ code
> against future changes to vhost code, but that would make this series
> bigger and more complex with no need at this moment in my opinion.
>
> Do you agree?
Somehow, but consider we can handle set_vring_call(), let's at last make 
set_vring_kick() more robust.
>
>>> +    event_notifier_init_fd(&svq->host_notifier,
>>> +                           event_notifier_get_fd(vq_host_notifier));
>>> +    event_notifier_set_handler(&svq->host_notifier, vhost_handle_guest_kick);
>>> +
>>> +    r = dev->vhost_ops->vhost_set_vring_kick(dev, &file);
>>
>> And we need to stop the notification area mmap.
>>
> Right.
>
>>> +    if (unlikely(r != 0)) {
>>> +        error_report("Couldn't set kick fd: %s", strerror(errno));
>>> +        goto err_set_vring_kick;
>>> +    }
>>> +
>>> +    return true;
>>> +
>>> +err_set_vring_kick:
>>> +    event_notifier_set_handler(&svq->host_notifier, NULL);
>>> +
>>> +    return false;
>>> +}
>>> +
>>> +/*
>>> + * Stop shadow virtqueue operation.
>>> + * @dev vhost device
>>> + * @idx vhost queue index
>>> + * @svq Shadow Virtqueue
>>> + */
>>> +void vhost_svq_stop(struct vhost_dev *dev, unsigned idx,
>>> +                    VhostShadowVirtqueue *svq)
>>> +{
>>> +    int r = vhost_svq_restore_vdev_host_notifier(dev, idx, svq);
>>> +    if (unlikely(r < 0)) {
>>> +        error_report("Couldn't restore vq kick fd: %s", strerror(-r));
>>> +    }
>>> +
>>> +    event_notifier_set_handler(&svq->host_notifier, NULL);
>>> +}
>>> +
>>>    /*
>>>     * Creates vhost shadow virtqueue, and instruct vhost device to use the shadow
>>>     * methods and file descriptors.
>>>     */
>>>    VhostShadowVirtqueue *vhost_svq_new(struct vhost_dev *dev, int idx)
>>>    {
>>> +    int vq_idx = dev->vq_index + idx;
>>>        g_autofree VhostShadowVirtqueue *svq = g_new0(VhostShadowVirtqueue, 1);
>>>        int r;
>>>
>>> @@ -44,6 +179,7 @@ VhostShadowVirtqueue *vhost_svq_new(struct vhost_dev *dev, int idx)
>>>            goto err_init_call_notifier;
>>>        }
>>>
>>> +    svq->vq = virtio_get_queue(dev->vdev, vq_idx);
>>>        return g_steal_pointer(&svq);
>>>
>>>    err_init_call_notifier:
>>> diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
>>> index e0dc7508c3..36c954a779 100644
>>> --- a/hw/virtio/vhost-vdpa.c
>>> +++ b/hw/virtio/vhost-vdpa.c
>>> @@ -17,6 +17,7 @@
>>>    #include "hw/virtio/vhost.h"
>>>    #include "hw/virtio/vhost-backend.h"
>>>    #include "hw/virtio/virtio-net.h"
>>> +#include "hw/virtio/vhost-shadow-virtqueue.h"
>>>    #include "hw/virtio/vhost-vdpa.h"
>>>    #include "exec/address-spaces.h"
>>>    #include "qemu/main-loop.h"
>>> @@ -272,6 +273,16 @@ static void vhost_vdpa_add_status(struct vhost_dev *dev, uint8_t status)
>>>        vhost_vdpa_call(dev, VHOST_VDPA_SET_STATUS, &s);
>>>    }
>>>
>>> +/**
>>> + * Adaptor function to free shadow virtqueue through gpointer
>>> + *
>>> + * @svq   The Shadow Virtqueue
>>> + */
>>> +static void vhost_psvq_free(gpointer svq)
>>> +{
>>> +    vhost_svq_free(svq);
>>> +}
>>> +
>>>    static int vhost_vdpa_init(struct vhost_dev *dev, void *opaque, Error **errp)
>>>    {
>>>        struct vhost_vdpa *v;
>>> @@ -283,6 +294,7 @@ static int vhost_vdpa_init(struct vhost_dev *dev, void *opaque, Error **errp)
>>>        dev->opaque =  opaque ;
>>>        v->listener = vhost_vdpa_memory_listener;
>>>        v->msg_type = VHOST_IOTLB_MSG_V2;
>>> +    v->shadow_vqs = g_ptr_array_new_full(dev->nvqs, vhost_psvq_free);
>>>        QLIST_INSERT_HEAD(&vhost_vdpa_devices, v, entry);
>>>
>>>        vhost_vdpa_add_status(dev, VIRTIO_CONFIG_S_ACKNOWLEDGE |
>>> @@ -373,6 +385,17 @@ err:
>>>        return;
>>>    }
>>>
>>> +static void vhost_vdpa_svq_cleanup(struct vhost_dev *dev)
>>> +{
>>> +    struct vhost_vdpa *v = dev->opaque;
>>> +    size_t idx;
>>> +
>>> +    for (idx = 0; idx < v->shadow_vqs->len; ++idx) {
>>> +        vhost_svq_stop(dev, idx, g_ptr_array_index(v->shadow_vqs, idx));
>>> +    }
>>> +    g_ptr_array_free(v->shadow_vqs, true);
>>> +}
>>> +
>>>    static int vhost_vdpa_cleanup(struct vhost_dev *dev)
>>>    {
>>>        struct vhost_vdpa *v;
>>> @@ -381,6 +404,7 @@ static int vhost_vdpa_cleanup(struct vhost_dev *dev)
>>>        trace_vhost_vdpa_cleanup(dev, v);
>>>        vhost_vdpa_host_notifiers_uninit(dev, dev->nvqs);
>>>        memory_listener_unregister(&v->listener);
>>> +    vhost_vdpa_svq_cleanup(dev);
>>>        QLIST_REMOVE(v, entry);
>>>
>>>        dev->opaque = NULL;
>>> @@ -557,7 +581,9 @@ static int vhost_vdpa_dev_start(struct vhost_dev *dev, bool started)
>>>        if (started) {
>>>            uint8_t status = 0;
>>>            memory_listener_register(&v->listener, &address_space_memory);
>>> -        vhost_vdpa_host_notifiers_init(dev);
>>> +        if (!v->shadow_vqs_enabled) {
>>> +            vhost_vdpa_host_notifiers_init(dev);
>>> +        }
>>
>> This looks like a trick, why not check and setup shadow_vqs inside:
>>
>> 1) vhost_vdpa_host_notifiers_init()
>>
>> and
>>
>> 2) vhost_vdpa_set_vring_kick()
>>
> Ok I will move the checks there.
>
>>>            vhost_vdpa_set_vring_ready(dev);
>>>            vhost_vdpa_add_status(dev, VIRTIO_CONFIG_S_DRIVER_OK);
>>>            vhost_vdpa_call(dev, VHOST_VDPA_GET_STATUS, &status);
>>> @@ -663,10 +689,96 @@ static bool  vhost_vdpa_force_iommu(struct vhost_dev *dev)
>>>        return true;
>>>    }
>>>
>>> +/*
>>> + * Start shadow virtqueue.
>>> + */
>>> +static bool vhost_vdpa_svq_start_vq(struct vhost_dev *dev, unsigned idx)
>>> +{
>>> +    struct vhost_vdpa *v = dev->opaque;
>>> +    VhostShadowVirtqueue *svq = g_ptr_array_index(v->shadow_vqs, idx);
>>> +    return vhost_svq_start(dev, idx, svq);
>>> +}
>>> +
>>> +static unsigned vhost_vdpa_enable_svq(struct vhost_vdpa *v, bool enable)
>>> +{
>>> +    struct vhost_dev *hdev = v->dev;
>>> +    unsigned n;
>>> +
>>> +    if (enable == v->shadow_vqs_enabled) {
>>> +        return hdev->nvqs;
>>> +    }
>>> +
>>> +    if (enable) {
>>> +        /* Allocate resources */
>>> +        assert(v->shadow_vqs->len == 0);
>>> +        for (n = 0; n < hdev->nvqs; ++n) {
>>> +            VhostShadowVirtqueue *svq = vhost_svq_new(hdev, n);
>>> +            bool ok;
>>> +
>>> +            if (unlikely(!svq)) {
>>> +                g_ptr_array_set_size(v->shadow_vqs, 0);
>>> +                return 0;
>>> +            }
>>> +            g_ptr_array_add(v->shadow_vqs, svq);
>>> +
>>> +            ok = vhost_vdpa_svq_start_vq(hdev, n);
>>> +            if (unlikely(!ok)) {
>>> +                /* Free still not started svqs */
>>> +                g_ptr_array_set_size(v->shadow_vqs, n);
>>> +                enable = false;
> [2]
>
>>> +                break;
>>> +            }
>>> +        }
>>
>> Since there's almost no logic could be shared between enable and
>> disable. Let's split those logic out into dedicated functions where the
>> codes looks more easy to be reviewed (e.g have a better error handling etc).
>>
> Maybe it could be more clear in the code, but the reused logic is the
> disabling of SVQ and the fallback in case it cannot be enabled with
> [2]. But I'm not against splitting in two different functions if it
> makes review easier.
Ok.
>
>>> +    }
>>> +
>>> +    v->shadow_vqs_enabled = enable;
>>> +
>>> +    if (!enable) {
>>> +        /* Disable all queues or clean up failed start */
>>> +        for (n = 0; n < v->shadow_vqs->len; ++n) {
>>> +            unsigned vq_idx = vhost_vdpa_get_vq_index(hdev, n);
>>> +            VhostShadowVirtqueue *svq = g_ptr_array_index(v->shadow_vqs, n);
>>> +            vhost_svq_stop(hdev, n, svq);
>>> +            vhost_virtqueue_start(hdev, hdev->vdev, &hdev->vqs[n], vq_idx);
>>> +        }
>>> +
>>> +        /* Resources cleanup */
>>> +        g_ptr_array_set_size(v->shadow_vqs, 0);
>>> +    }
>>> +
>>> +    return n;
>>> +}
>>>
>>>    void qmp_x_vhost_enable_shadow_vq(const char *name, bool enable, Error **errp)
>>>    {
>>> -    error_setg(errp, "Shadow virtqueue still not implemented");
>>> +    struct vhost_vdpa *v;
>>> +    const char *err_cause = NULL;
>>> +    bool r;
>>> +
>>> +    QLIST_FOREACH(v, &vhost_vdpa_devices, entry) {
>>> +        if (v->dev->vdev && 0 == strcmp(v->dev->vdev->name, name)) {
>>> +            break;
>>> +        }
>>> +    }
>>
>> I think you can iterate the NetClientStates to ge tthe vhost-vdpa backends.
>>
> Right, I missed it.
>
>>> +
>>> +    if (!v) {
>>> +        err_cause = "Device not found";
>>> +        goto err;
>>> +    } else if (v->notifier[0].addr) {
>>> +        err_cause = "Device has host notifiers enabled";
>>
>> I don't get this.
>>
> At this moment of the series you can enable guest -> SVQ -> 'vdpa
> device' if the device is not using the host notifiers memory region.
> The right solution is to disable it for the guest, and to handle it in
> SVQ. Otherwise, guest kick will bypass SVQ and
>
> It can be done in the same patch, or at least to disable (as unmap)
> them at this moment and handle them in a posterior patch. but for
> prototyping the solution I just ignored it in this series. It will be
> handled some way or another in the next one. I prefer the last one, to
> handle in a different patch, but let me know if you think it is better
> otherwise.
Aha, I see. But I think we need to that in this patch otherwise the we 
can route host notifier to SVQ.
Thanks
>
>> Btw this function should be implemented in an independent patch after
>> svq is fully functional.
>>
> (Reasons for that are already commented at the top of this mail :) ).
>
> Thanks!
>
>> Thanks
>>
>>
>>> +        goto err;
>>> +    }
>>> +
>>> +    r = vhost_vdpa_enable_svq(v, enable);
>>> +    if (unlikely(!r)) {
>>> +        err_cause = "Error enabling (see monitor)";
>>> +        goto err;
>>> +    }
>>> +
>>> +err:
>>> +    if (err_cause) {
>>> +        error_setg(errp, "Can't enable shadow vq on %s: %s", name, err_cause);
>>> +    }
>>>    }
>>>
>>>    const VhostOps vdpa_ops = {
^ permalink raw reply	[flat|nested] 90+ messages in thread
* Re: [RFC PATCH v4 15/20] vhost: Shadow virtqueue buffers forwarding
  2021-10-14 17:56     ` Eugenio Perez Martin
@ 2021-10-15  4:23       ` Jason Wang
  2021-10-15  9:33         ` Eugenio Perez Martin
  0 siblings, 1 reply; 90+ messages in thread
From: Jason Wang @ 2021-10-15  4:23 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: Parav Pandit, Juan Quintela, Markus Armbruster,
	Michael S. Tsirkin, qemu-level, virtualization,
	Harpreet Singh Anand, Xiao W Wang, Stefan Hajnoczi, Eli Cohen,
	Eric Blake, Michael Lilja, Stefano Garzarella
在 2021/10/15 上午1:56, Eugenio Perez Martin 写道:
> On Wed, Oct 13, 2021 at 6:31 AM Jason Wang <jasowang@redhat.com> wrote:
>>
>> 在 2021/10/1 下午3:05, Eugenio Pérez 写道:
>>> Initial version of shadow virtqueue that actually forward buffers. There
>>> are no iommu support at the moment, and that will be addressed in future
>>> patches of this series. Since all vhost-vdpa devices uses forced IOMMU,
>>> this means that SVQ is not usable at this point of the series on any
>>> device.
>>>
>>> For simplicity it only supports modern devices, that expects vring
>>> in little endian, with split ring and no event idx or indirect
>>> descriptors. Support for them will not be added in this series.
>>>
>>> It reuses the VirtQueue code for the device part. The driver part is
>>> based on Linux's virtio_ring driver, but with stripped functionality
>>> and optimizations so it's easier to review. Later commits add simpler
>>> ones.
>>>
>>> SVQ uses VIRTIO_CONFIG_S_DEVICE_STOPPED to pause the device and
>>> retrieve its status (next available idx the device was going to
>>> consume) race-free. It can later reset the device to replace vring
>>> addresses etc. When SVQ starts qemu can resume consuming the guest's
>>> driver ring from that state, without notice from the latter.
>>>
>>> This status bit VIRTIO_CONFIG_S_DEVICE_STOPPED is currently discussed
>>> in VirtIO, and is implemented in qemu VirtIO-net devices in previous
>>> commits.
>>>
>>> Removal of _S_DEVICE_STOPPED bit (in other words, resuming the device)
>>> can be done in the future if an use case arises. At this moment we can
>>> just rely on reseting the full device.
>>>
>>> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
>>> ---
>>>    qapi/net.json                      |   2 +-
>>>    hw/virtio/vhost-shadow-virtqueue.c | 237 ++++++++++++++++++++++++++++-
>>>    hw/virtio/vhost-vdpa.c             | 109 ++++++++++++-
>>>    3 files changed, 337 insertions(+), 11 deletions(-)
>>>
>>> diff --git a/qapi/net.json b/qapi/net.json
>>> index fe546b0e7c..1f4a55f2c5 100644
>>> --- a/qapi/net.json
>>> +++ b/qapi/net.json
>>> @@ -86,7 +86,7 @@
>>>    #
>>>    # @name: the device name of the VirtIO device
>>>    #
>>> -# @enable: true to use the alternate shadow VQ notifications
>>> +# @enable: true to use the alternate shadow VQ buffers fowarding path
>>>    #
>>>    # Returns: Error if failure, or 'no error' for success.
>>>    #
>>> diff --git a/hw/virtio/vhost-shadow-virtqueue.c b/hw/virtio/vhost-shadow-virtqueue.c
>>> index 34e159d4fd..df7e6fa3ec 100644
>>> --- a/hw/virtio/vhost-shadow-virtqueue.c
>>> +++ b/hw/virtio/vhost-shadow-virtqueue.c
>>> @@ -10,6 +10,7 @@
>>>    #include "qemu/osdep.h"
>>>    #include "hw/virtio/vhost-shadow-virtqueue.h"
>>>    #include "hw/virtio/vhost.h"
>>> +#include "hw/virtio/virtio-access.h"
>>>
>>>    #include "standard-headers/linux/vhost_types.h"
>>>
>>> @@ -44,15 +45,135 @@ typedef struct VhostShadowVirtqueue {
>>>
>>>        /* Virtio device */
>>>        VirtIODevice *vdev;
>>> +
>>> +    /* Map for returning guest's descriptors */
>>> +    VirtQueueElement **ring_id_maps;
>>> +
>>> +    /* Next head to expose to device */
>>> +    uint16_t avail_idx_shadow;
>>> +
>>> +    /* Next free descriptor */
>>> +    uint16_t free_head;
>>> +
>>> +    /* Last seen used idx */
>>> +    uint16_t shadow_used_idx;
>>> +
>>> +    /* Next head to consume from device */
>>> +    uint16_t used_idx;
>>
>> Let's use "last_used_idx" as kernel driver did.
>>
> Ok I will change it.
>
>>>    } VhostShadowVirtqueue;
>>>
>>>    /* If the device is using some of these, SVQ cannot communicate */
>>>    bool vhost_svq_valid_device_features(uint64_t *dev_features)
>>>    {
>>> -    return true;
>>> +    uint64_t b;
>>> +    bool r = true;
>>> +
>>> +    for (b = VIRTIO_TRANSPORT_F_START; b <= VIRTIO_TRANSPORT_F_END; ++b) {
>>> +        switch (b) {
>>> +        case VIRTIO_F_NOTIFY_ON_EMPTY:
>>> +        case VIRTIO_F_ANY_LAYOUT:
>>> +            /* SVQ is fine with this feature */
>>> +            continue;
>>> +
>>> +        case VIRTIO_F_ACCESS_PLATFORM:
>>> +            /* SVQ needs this feature disabled. Can't continue */
>>
>> So code can explain itself, need a comment to explain why.
>>
> Do you mean that it *doesn't* need a comment to explain why? In that
> case I will delete them.
I meant the comment is duplicated with the code. If you wish, you can 
explain why ACCESS_PLATFORM needs to be disabled.
>
>>> +            if (*dev_features & BIT_ULL(b)) {
>>> +                clear_bit(b, dev_features);
>>> +                r = false;
>>> +            }
>>> +            break;
>>> +
>>> +        case VIRTIO_F_VERSION_1:
>>> +            /* SVQ needs this feature, so can't continue */
>>
>> A comment to explain why SVQ needs this feature.
>>
> Sure I will add it.
>
>>> +            if (!(*dev_features & BIT_ULL(b))) {
>>> +                set_bit(b, dev_features);
>>> +                r = false;
>>> +            }
>>> +            continue;
>>> +
>>> +        default:
>>> +            /*
>>> +             * SVQ must disable this feature, let's hope the device is fine
>>> +             * without it.
>>> +             */
>>> +            if (*dev_features & BIT_ULL(b)) {
>>> +                clear_bit(b, dev_features);
>>> +            }
>>> +        }
>>> +    }
>>> +
>>> +    return r;
>>> +}
>>
>> Let's move this to patch 14.
>>
> I can move it down to 14/20, but then it is not really accurate, since
> notifications forwarding can work with all feature sets. Not like we
> are introducing a regression, but still.
>
> I can always explain that in the patch message though, would that be ok?
I'm afraid this will break bisection. E.g for patch 14, it works for any 
features but for patch 15 it doesn't.
>
>>> +
>>> +static void vhost_vring_write_descs(VhostShadowVirtqueue *svq,
>>> +                                    const struct iovec *iovec,
>>> +                                    size_t num, bool more_descs, bool write)
>>> +{
>>> +    uint16_t i = svq->free_head, last = svq->free_head;
>>> +    unsigned n;
>>> +    uint16_t flags = write ? cpu_to_le16(VRING_DESC_F_WRITE) : 0;
>>> +    vring_desc_t *descs = svq->vring.desc;
>>> +
>>> +    if (num == 0) {
>>> +        return;
>>> +    }
>>> +
>>> +    for (n = 0; n < num; n++) {
>>> +        if (more_descs || (n + 1 < num)) {
>>> +            descs[i].flags = flags | cpu_to_le16(VRING_DESC_F_NEXT);
>>> +        } else {
>>> +            descs[i].flags = flags;
>>> +        }
>>> +        descs[i].addr = cpu_to_le64((hwaddr)iovec[n].iov_base);
>>> +        descs[i].len = cpu_to_le32(iovec[n].iov_len);
>>> +
>>> +        last = i;
>>> +        i = cpu_to_le16(descs[i].next);
>>> +    }
>>> +
>>> +    svq->free_head = le16_to_cpu(descs[last].next);
>>> +}
>>> +
>>> +static unsigned vhost_svq_add_split(VhostShadowVirtqueue *svq,
>>> +                                    VirtQueueElement *elem)
>>> +{
>>> +    int head;
>>> +    unsigned avail_idx;
>>> +    vring_avail_t *avail = svq->vring.avail;
>>> +
>>> +    head = svq->free_head;
>>> +
>>> +    /* We need some descriptors here */
>>> +    assert(elem->out_num || elem->in_num);
>>> +
>>> +    vhost_vring_write_descs(svq, elem->out_sg, elem->out_num,
>>> +                            elem->in_num > 0, false);
>>> +    vhost_vring_write_descs(svq, elem->in_sg, elem->in_num, false, true);
>>> +
>>> +    /*
>>> +     * Put entry in available array (but don't update avail->idx until they
>>> +     * do sync).
>>> +     */
>>> +    avail_idx = svq->avail_idx_shadow & (svq->vring.num - 1);
>>> +    avail->ring[avail_idx] = cpu_to_le16(head);
>>> +    svq->avail_idx_shadow++;
>>> +
>>> +    /* Update avail index after the descriptor is wrote */
>>> +    smp_wmb();
>>> +    avail->idx = cpu_to_le16(svq->avail_idx_shadow);
>>> +
>>> +    return head;
>>> +
>>>    }
>>>
>>> -/* Forward guest notifications */
>>> +static void vhost_svq_add(VhostShadowVirtqueue *svq, VirtQueueElement *elem)
>>> +{
>>> +    unsigned qemu_head = vhost_svq_add_split(svq, elem);
>>> +
>>> +    svq->ring_id_maps[qemu_head] = elem;
>>> +}
>>> +
>>> +/* Handle guest->device notifications */
>>>    static void vhost_handle_guest_kick(EventNotifier *n)
>>>    {
>>>        VhostShadowVirtqueue *svq = container_of(n, VhostShadowVirtqueue,
>>> @@ -62,7 +183,74 @@ static void vhost_handle_guest_kick(EventNotifier *n)
>>>            return;
>>>        }
>>>
>>> -    event_notifier_set(&svq->kick_notifier);
>>> +    /* Make available as many buffers as possible */
>>> +    do {
>>> +        if (virtio_queue_get_notification(svq->vq)) {
>>> +            /* No more notifications until process all available */
>>> +            virtio_queue_set_notification(svq->vq, false);
>>> +        }
>>
>> This can be done outside the loop.
>>
> I think it cannot. The intention of doing this way is that we check
> for new available buffers *also after* enabling notifications, so we
> don't miss any of them. It is more or less copied from
> virtio_blk_handle_vq, which also needs to run to completion.
>
> If we need to loop again because there are more available buffers, we
> want to disable notifications again. Or am I missing something?
I think you're right.
>
>>> +
>>> +        while (true) {
>>> +            VirtQueueElement *elem = virtqueue_pop(svq->vq, sizeof(*elem));
>>> +            if (!elem) {
>>> +                break;
>>> +            }
>>> +
>>> +            vhost_svq_add(svq, elem);
>>> +            event_notifier_set(&svq->kick_notifier);
>>> +        }
>>> +
>>> +        virtio_queue_set_notification(svq->vq, true);
>>
>> I think this can be moved to the end of this function.
>>
> (Same as previous answer)
>
>> Btw, we probably need a quota to make sure the svq is not hogging the
>> main event loop.
>>
>> Similar issue could be found in both virtio-net TX (using timer or bh)
>> and TAP (a quota).
>>
> I think that virtqueue size is the natural limit to that: since we are
> not making any buffers used in the loop, there is no way that it runs
> more than virtqueue size times. If it does because of an evil/bogus
> guest, virtqueue_pop raises the message "Virtqueue size exceeded" and
> returns NULL, effectively breaking the loop.
>
> Virtio-net tx functions mark each buffer right after making them
> available and use it, so they can hog BQL. But my understanding is
> that is not possible in the SVQ case.
Right.
>
> I can add a comment in the code to make it clearer though.
Yes, please.
>
>>> +    } while (!virtio_queue_empty(svq->vq));
>>> +}
>>> +
>>> +static bool vhost_svq_more_used(VhostShadowVirtqueue *svq)
>>> +{
>>> +    if (svq->used_idx != svq->shadow_used_idx) {
>>> +        return true;
>>> +    }
>>> +
>>> +    /* Get used idx must not be reordered */
>>> +    smp_rmb();
>>
>> Interesting, we don't do this for kernel drivers. It would be helpful to
>> explain it more clear by "X must be done before Y".
>>
> I think this got reordered, it's supposed to be *after* get the used
> idx, so it matches the one in the kernel with the comment "Only get
> used array entries after they have been exposed by host.".
Right.
>
> I will change it for the next series.
Ok.
>
>>> +    svq->shadow_used_idx = cpu_to_le16(svq->vring.used->idx);
>>> +
>>> +    return svq->used_idx != svq->shadow_used_idx;
>>> +}
>>> +
>>> +static VirtQueueElement *vhost_svq_get_buf(VhostShadowVirtqueue *svq)
>>> +{
>>> +    vring_desc_t *descs = svq->vring.desc;
>>> +    const vring_used_t *used = svq->vring.used;
>>> +    vring_used_elem_t used_elem;
>>> +    uint16_t last_used;
>>> +
>>> +    if (!vhost_svq_more_used(svq)) {
>>> +        return NULL;
>>> +    }
>>> +
>>> +    last_used = svq->used_idx & (svq->vring.num - 1);
>>> +    used_elem.id = le32_to_cpu(used->ring[last_used].id);
>>> +    used_elem.len = le32_to_cpu(used->ring[last_used].len);
>>> +
>>> +    svq->used_idx++;
>>> +    if (unlikely(used_elem.id >= svq->vring.num)) {
>>> +        error_report("Device %s says index %u is used", svq->vdev->name,
>>> +                     used_elem.id);
>>> +        return NULL;
>>> +    }
>>> +
>>> +    if (unlikely(!svq->ring_id_maps[used_elem.id])) {
>>> +        error_report(
>>> +            "Device %s says index %u is used, but it was not available",
>>> +            svq->vdev->name, used_elem.id);
>>> +        return NULL;
>>> +    }
>>> +
>>> +    descs[used_elem.id].next = svq->free_head;
>>> +    svq->free_head = used_elem.id;
>>> +
>>> +    svq->ring_id_maps[used_elem.id]->len = used_elem.len;
>>> +    return g_steal_pointer(&svq->ring_id_maps[used_elem.id]);
>>>    }
>>>
>>>    /* Forward vhost notifications */
>>> @@ -70,8 +258,26 @@ static void vhost_svq_handle_call_no_test(EventNotifier *n)
>>>    {
>>>        VhostShadowVirtqueue *svq = container_of(n, VhostShadowVirtqueue,
>>>                                                 call_notifier);
>>> -
>>> -    event_notifier_set(&svq->guest_call_notifier);
>>> +    VirtQueue *vq = svq->vq;
>>> +
>>> +    /* Make as many buffers as possible used. */
>>> +    do {
>>> +        unsigned i = 0;
>>> +
>>> +        /* TODO: Use VRING_AVAIL_F_NO_INTERRUPT */
>>> +        while (true) {
>>> +            g_autofree VirtQueueElement *elem = vhost_svq_get_buf(svq);
>>> +            if (!elem) {
>>> +                break;
>>> +            }
>>> +
>>> +            assert(i < svq->vring.num);
>>
>> Let's return error instead of using the assert.
>>
> Actually this is a condition that we should never meet: In the case of
> ring overrun, device would try to set used a descriptor that is either
>> vring size *or* should try to overrun some of the already used ones.
> In both cases, elem should be NULL and the loop should break.
>
> So this is a safety net protecting from both, if we have an i >
> svq->vring.num means we are not processing used buffers well anymore,
> and (moreover) this is happening after making used all descriptors.
>
> Taking that into account, should we delete it?
Maybe a warning instead.
>
>>> +            virtqueue_fill(vq, elem, elem->len, i++);
>>> +        }
>>> +
>>> +        virtqueue_flush(vq, i);
>>> +        event_notifier_set(&svq->guest_call_notifier);
>>> +    } while (vhost_svq_more_used(svq));
>>>    }
>>>
>>>    static void vhost_svq_handle_call(EventNotifier *n)
>>> @@ -204,12 +410,25 @@ err_set_vring_kick:
>>>    void vhost_svq_stop(struct vhost_dev *dev, unsigned idx,
>>>                        VhostShadowVirtqueue *svq)
>>>    {
>>> +    int i;
>>>        int r = vhost_svq_restore_vdev_host_notifier(dev, idx, svq);
>>> +
>>>        if (unlikely(r < 0)) {
>>>            error_report("Couldn't restore vq kick fd: %s", strerror(-r));
>>>        }
>>>
>>>        event_notifier_set_handler(&svq->host_notifier, NULL);
>>> +
>>> +    for (i = 0; i < svq->vring.num; ++i) {
>>> +        g_autofree VirtQueueElement *elem = svq->ring_id_maps[i];
>>> +        /*
>>> +         * Although the doc says we must unpop in order, it's ok to unpop
>>> +         * everything.
>>> +         */
>>> +        if (elem) {
>>> +            virtqueue_unpop(svq->vq, elem, elem->len);
>>> +        }
>>
>> Will this result some of the "pending" buffers to be submitted multiple
>> times? If yes, should we wait for all the buffers used instead of doing
>> the unpop here?
>>
> Do you mean to call virtqueue_unpop with the same elem (or elem.id)
> multiple times? That should never happen, because elem.id should be
> the position in the ring_id_maps. Also, unpop() should just unmap the
> element and never sync again.
>
> Maybe it is way clearer to call virtqueue_detach_element here directly.
I meant basically for the buffers that were consumed by the device but 
not made used. In this case if we unpop here. It will be processed by 
the device later via vhost-vdpa again?
This is probably fine for net but I'm not sure it works for other 
devices. Another way is to wait until all the consumed buffer used.
>
>>> +    }
>>>    }
>>>
>>>    /*
>>> @@ -224,7 +443,7 @@ VhostShadowVirtqueue *vhost_svq_new(struct vhost_dev *dev, int idx)
>>>        size_t driver_size;
>>>        size_t device_size;
>>>        g_autofree VhostShadowVirtqueue *svq = g_new0(VhostShadowVirtqueue, 1);
>>> -    int r;
>>> +    int r, i;
>>>
>>>        r = event_notifier_init(&svq->kick_notifier, 0);
>>>        if (r != 0) {
>>> @@ -250,6 +469,11 @@ VhostShadowVirtqueue *vhost_svq_new(struct vhost_dev *dev, int idx)
>>>        memset(svq->vring.desc, 0, driver_size);
>>>        svq->vring.used = qemu_memalign(qemu_real_host_page_size, device_size);
>>>        memset(svq->vring.used, 0, device_size);
>>> +    for (i = 0; i < num - 1; i++) {
>>> +        svq->vring.desc[i].next = cpu_to_le16(i + 1);
>>> +    }
>>> +
>>> +    svq->ring_id_maps = g_new0(VirtQueueElement *, num);
>>>        event_notifier_set_handler(&svq->call_notifier,
>>>                                   vhost_svq_handle_call);
>>>        return g_steal_pointer(&svq);
>>> @@ -269,6 +493,7 @@ void vhost_svq_free(VhostShadowVirtqueue *vq)
>>>        event_notifier_cleanup(&vq->kick_notifier);
>>>        event_notifier_set_handler(&vq->call_notifier, NULL);
>>>        event_notifier_cleanup(&vq->call_notifier);
>>> +    g_free(vq->ring_id_maps);
>>>        qemu_vfree(vq->vring.desc);
>>>        qemu_vfree(vq->vring.used);
>>>        g_free(vq);
>>> diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
>>> index a057e8277d..bb7010ddb5 100644
>>> --- a/hw/virtio/vhost-vdpa.c
>>> +++ b/hw/virtio/vhost-vdpa.c
>>> @@ -19,6 +19,7 @@
>>>    #include "hw/virtio/virtio-net.h"
>>>    #include "hw/virtio/vhost-shadow-virtqueue.h"
>>>    #include "hw/virtio/vhost-vdpa.h"
>>> +#include "hw/virtio/vhost-shadow-virtqueue.h"
>>>    #include "exec/address-spaces.h"
>>>    #include "qemu/main-loop.h"
>>>    #include "cpu.h"
>>> @@ -475,6 +476,28 @@ static int vhost_vdpa_set_features(struct vhost_dev *dev,
>>>        return vhost_vdpa_backend_set_features(dev, features);
>>>    }
>>>
>>> +/**
>>> + * Restore guest features to vdpa device
>>> + */
>>> +static int vhost_vdpa_set_guest_features(struct vhost_dev *dev)
>>> +{
>>> +    struct vhost_vdpa *v = dev->opaque;
>>> +    return vhost_vdpa_backend_set_features(dev, v->guest_features);
>>> +}
>>> +
>>> +/**
>>> + * Set shadow virtqueue supported features
>>> + */
>>> +static int vhost_vdpa_set_svq_features(struct vhost_dev *dev)
>>> +{
>>> +    struct vhost_vdpa *v = dev->opaque;
>>> +    uint64_t features = v->host_features;
>>> +    bool b = vhost_svq_valid_device_features(&features);
>>> +    assert(b);
>>> +
>>> +    return vhost_vdpa_backend_set_features(dev, features);
>>> +}
>>> +
>>>    static int vhost_vdpa_set_backend_cap(struct vhost_dev *dev)
>>>    {
>>>        uint64_t features;
>>> @@ -730,6 +753,19 @@ static bool  vhost_vdpa_force_iommu(struct vhost_dev *dev)
>>>        return true;
>>>    }
>>>
>>> +static int vhost_vdpa_vring_pause(struct vhost_dev *dev)
>>> +{
>>> +    int r;
>>> +    uint8_t status;
>>> +
>>> +    vhost_vdpa_add_status(dev, VIRTIO_CONFIG_S_DEVICE_STOPPED);
>>> +    do {
>>> +        r = vhost_vdpa_call(dev, VHOST_VDPA_GET_STATUS, &status);
>>
>> I guess we'd better add some sleep here.
>>
> If the final version still contains the call, I will add the sleep. At
> the moment I think it's better if we stop the device by a vdpa ioctl.
Ok, so the idea is to sleep in the ioctl?
>
>>> +    } while (r == 0 && !(status & VIRTIO_CONFIG_S_DEVICE_STOPPED));
>>> +
>>> +    return 0;
>>> +}
>>> +
>>>    /*
>>>     * Start shadow virtqueue.
>>>     */
>>> @@ -742,9 +778,29 @@ static bool vhost_vdpa_svq_start_vq(struct vhost_dev *dev, unsigned idx)
>>>            .index = idx + dev->vq_index,
>>>            .fd = event_notifier_get_fd(vhost_call_notifier),
>>>        };
>>> +    struct vhost_vring_addr addr = {
>>> +        .index = idx + dev->vq_index,
>>> +    };
>>> +    struct vhost_vring_state num = {
>>> +        .index = idx + dev->vq_index,
>>> +        .num = virtio_queue_get_num(dev->vdev, idx),
>>> +    };
>>>        int r;
>>>        bool b;
>>>
>>> +    vhost_svq_get_vring_addr(svq, &addr);
>>> +    r = vhost_vdpa_set_vring_addr(dev, &addr);
>>> +    if (unlikely(r)) {
>>> +        error_report("vhost_set_vring_addr for shadow vq failed");
>>> +        return false;
>>> +    }
>>> +
>>> +    r = vhost_vdpa_set_vring_num(dev, &num);
>>> +    if (unlikely(r)) {
>>> +        error_report("vhost_vdpa_set_vring_num for shadow vq failed");
>>> +        return false;
>>> +    }
>>> +
>>>        /* Set shadow vq -> guest notifier */
>>>        assert(v->call_fd[idx]);
>>>        vhost_svq_set_guest_call_notifier(svq, v->call_fd[idx]);
>>> @@ -781,15 +837,32 @@ static unsigned vhost_vdpa_enable_svq(struct vhost_vdpa *v, bool enable)
>>>            assert(v->shadow_vqs->len == 0);
>>>            for (n = 0; n < hdev->nvqs; ++n) {
>>>                VhostShadowVirtqueue *svq = vhost_svq_new(hdev, n);
>>> -            bool ok;
>>> -
>>>                if (unlikely(!svq)) {
>>>                    g_ptr_array_set_size(v->shadow_vqs, 0);
>>>                    return 0;
>>>                }
>>>                g_ptr_array_add(v->shadow_vqs, svq);
>>> +        }
>>> +    }
>>>
>>> -            ok = vhost_vdpa_svq_start_vq(hdev, n);
>>> +    r = vhost_vdpa_vring_pause(hdev);
>>> +    assert(r == 0);
>>> +
>>> +    if (enable) {
>>> +        for (n = 0; n < v->shadow_vqs->len; ++n) {
>>> +            /* Obtain Virtqueue state */
>>> +            vhost_virtqueue_stop(hdev, hdev->vdev, &hdev->vqs[n], n);
>>> +        }
>>> +    }
>>> +
>>> +    /* Reset device so it can be configured */
>>> +    r = vhost_vdpa_dev_start(hdev, false);
>>> +    assert(r == 0);
>>> +
>>> +    if (enable) {
>>> +        int r;
>>> +        for (n = 0; n < v->shadow_vqs->len; ++n) {
>>> +            bool ok = vhost_vdpa_svq_start_vq(hdev, n);
>>>                if (unlikely(!ok)) {
>>>                    /* Free still not started svqs */
>>>                    g_ptr_array_set_size(v->shadow_vqs, n);
>>> @@ -797,11 +870,19 @@ static unsigned vhost_vdpa_enable_svq(struct vhost_vdpa *v, bool enable)
>>>                    break;
>>>                }
>>>            }
>>> +
>>> +        /* Need to ack features to set state in vp_vdpa devices */
>>
>> vhost_vdpa actually?
>>
> Yes, what a mistake!
>
>>> +        r = vhost_vdpa_set_svq_features(hdev);
>>> +        if (unlikely(r)) {
>>> +            enable = false;
>>> +        }
>>>        }
>>>
>>>        v->shadow_vqs_enabled = enable;
>>>
>>>        if (!enable) {
>>> +        vhost_vdpa_set_guest_features(hdev);
>>> +
>>>            /* Disable all queues or clean up failed start */
>>>            for (n = 0; n < v->shadow_vqs->len; ++n) {
>>>                struct vhost_vring_file file = {
>>> @@ -818,7 +899,12 @@ static unsigned vhost_vdpa_enable_svq(struct vhost_vdpa *v, bool enable)
>>>                /* TODO: This can unmask or override call fd! */
>>>                vhost_virtqueue_start(hdev, hdev->vdev, &hdev->vqs[n], vq_idx);
>>>            }
>>> +    }
>>>
>>> +    r = vhost_vdpa_dev_start(hdev, true);
>>> +    assert(r == 0);
>>> +
>>> +    if (!enable) {
>>>            /* Resources cleanup */
>>>            g_ptr_array_set_size(v->shadow_vqs, 0);
>>>        }
>>> @@ -831,6 +917,7 @@ void qmp_x_vhost_enable_shadow_vq(const char *name, bool enable, Error **errp)
>>>        struct vhost_vdpa *v;
>>>        const char *err_cause = NULL;
>>>        bool r;
>>> +    uint64_t svq_features;
>>>
>>>        QLIST_FOREACH(v, &vhost_vdpa_devices, entry) {
>>>            if (v->dev->vdev && 0 == strcmp(v->dev->vdev->name, name)) {
>>> @@ -846,6 +933,20 @@ void qmp_x_vhost_enable_shadow_vq(const char *name, bool enable, Error **errp)
>>>            goto err;
>>>        }
>>>
>>> +    svq_features = v->host_features;
>>> +    if (!vhost_svq_valid_device_features(&svq_features)) {
>>> +        error_setg(errp,
>>> +            "Can't enable shadow vq on %s: Unexpected feature flags (%lx-%lx)",
>>> +            name, v->host_features, svq_features);
>>> +        return;
>>> +    } else {
>>> +        /* TODO: Check for virtio_vdpa + IOMMU & modern device */
>>
>> I guess you mean "vhost_vdpa" here.
> Yes, a similar mistake in less than 50 lines :).
>
>> For IOMMU, I guess you mean "vIOMMU"
>> actually?
>>
> This comment is out of date and inherited from the vhost version,
> where only the IOMMU version was developed, so it will be deleted in
> the next series. I think it makes little sense to check vIOMMU if we
> stick with vDPA since it still does not support it, but we could make
> the check here for sure.
Right.
Thanks
>
> Thanks!
>
>> Thanks
>>
>>
>>> +    }
>>> +
>>> +    if (err_cause) {
>>> +        goto err;
>>> +    }
>>> +
>>>        r = vhost_vdpa_enable_svq(v, enable);
>>>        if (unlikely(!r)) {
>>>            err_cause = "Error enabling (see monitor)";
>>> @@ -853,7 +954,7 @@ void qmp_x_vhost_enable_shadow_vq(const char *name, bool enable, Error **errp)
>>>        }
>>>
>>>    err:
>>> -    if (err_cause) {
>>> +    if (errp == NULL && err_cause) {
>>>            error_setg(errp, "Can't enable shadow vq on %s: %s", name, err_cause);
>>>        }
>>>    }
^ permalink raw reply	[flat|nested] 90+ messages in thread
* Re: [RFC PATCH v4 11/20] vhost: Route host->guest notification through shadow virtqueue
  2021-10-14 15:58     ` Eugenio Perez Martin
@ 2021-10-15  4:24       ` Jason Wang
  0 siblings, 0 replies; 90+ messages in thread
From: Jason Wang @ 2021-10-15  4:24 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: Parav Pandit, Juan Quintela, Markus Armbruster,
	Michael S. Tsirkin, qemu-level, virtualization,
	Harpreet Singh Anand, Xiao W Wang, Stefan Hajnoczi, Eli Cohen,
	Eric Blake, Michael Lilja, Stefano Garzarella
在 2021/10/14 下午11:58, Eugenio Perez Martin 写道:
> On Wed, Oct 13, 2021 at 5:49 AM Jason Wang <jasowang@redhat.com> wrote:
>>
>> 在 2021/10/1 下午3:05, Eugenio Pérez 写道:
>>> This will make qemu aware of the device used buffers, allowing it to
>>> write the guest memory with its contents if needed.
>>>
>>> Since the use of vhost_virtqueue_start can unmasks and discard call
>>> events, vhost_virtqueue_start should be modified in one of these ways:
>>> * Split in two: One of them uses all logic to start a queue with no
>>>     side effects for the guest, and another one tha actually assumes that
>>>     the guest has just started the device. Vdpa should use just the
>>>     former.
>>> * Actually store and check if the guest notifier is masked, and do it
>>>     conditionally.
>>> * Left as it is, and duplicate all the logic in vhost-vdpa.
>>
>> Btw, the log looks not clear. I guess this patch goes for method 3. If
>> yes, we need explain it and why.
>>
>> Thanks
>>
> Sorry about being unclear. This commit log (and code) just exposes the
> problem and the solutions I came up with but does nothing to solve it.
> I'm actually going for method 3 for the next series but I'm open to
> doing it differently.
That's fine, but need to doc that method 3 is something that is done in 
the patch.
Thanks
>
>>> Signed-off-by: Eugenio Pérez<eperezma@redhat.com>
^ permalink raw reply	[flat|nested] 90+ messages in thread
* Re: [RFC PATCH v4 11/20] vhost: Route host->guest notification through shadow virtqueue
  2021-10-14 16:39     ` Eugenio Perez Martin
@ 2021-10-15  4:42       ` Jason Wang
  2021-10-19  8:39         ` Eugenio Perez Martin
  0 siblings, 1 reply; 90+ messages in thread
From: Jason Wang @ 2021-10-15  4:42 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: Parav Pandit, Juan Quintela, Markus Armbruster,
	Michael S. Tsirkin, qemu-level, virtualization,
	Harpreet Singh Anand, Xiao W Wang, Stefan Hajnoczi, Eli Cohen,
	Eric Blake, Michael Lilja, Stefano Garzarella
在 2021/10/15 上午12:39, Eugenio Perez Martin 写道:
> On Wed, Oct 13, 2021 at 5:47 AM Jason Wang <jasowang@redhat.com> wrote:
>>
>> 在 2021/10/1 下午3:05, Eugenio Pérez 写道:
>>> This will make qemu aware of the device used buffers, allowing it to
>>> write the guest memory with its contents if needed.
>>>
>>> Since the use of vhost_virtqueue_start can unmasks and discard call
>>> events, vhost_virtqueue_start should be modified in one of these ways:
>>> * Split in two: One of them uses all logic to start a queue with no
>>>     side effects for the guest, and another one tha actually assumes that
>>>     the guest has just started the device. Vdpa should use just the
>>>     former.
>>> * Actually store and check if the guest notifier is masked, and do it
>>>     conditionally.
>>> * Left as it is, and duplicate all the logic in vhost-vdpa.
>>>
>>> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
>>> ---
>>>    hw/virtio/vhost-shadow-virtqueue.c | 19 +++++++++++++++
>>>    hw/virtio/vhost-vdpa.c             | 38 +++++++++++++++++++++++++++++-
>>>    2 files changed, 56 insertions(+), 1 deletion(-)
>>>
>>> diff --git a/hw/virtio/vhost-shadow-virtqueue.c b/hw/virtio/vhost-shadow-virtqueue.c
>>> index 21dc99ab5d..3fe129cf63 100644
>>> --- a/hw/virtio/vhost-shadow-virtqueue.c
>>> +++ b/hw/virtio/vhost-shadow-virtqueue.c
>>> @@ -53,6 +53,22 @@ static void vhost_handle_guest_kick(EventNotifier *n)
>>>        event_notifier_set(&svq->kick_notifier);
>>>    }
>>>
>>> +/* Forward vhost notifications */
>>> +static void vhost_svq_handle_call_no_test(EventNotifier *n)
>>> +{
>>> +    VhostShadowVirtqueue *svq = container_of(n, VhostShadowVirtqueue,
>>> +                                             call_notifier);
>>> +
>>> +    event_notifier_set(&svq->guest_call_notifier);
>>> +}
>>> +
>>> +static void vhost_svq_handle_call(EventNotifier *n)
>>> +{
>>> +    if (likely(event_notifier_test_and_clear(n))) {
>>> +        vhost_svq_handle_call_no_test(n);
>>> +    }
>>> +}
>>> +
>>>    /*
>>>     * Obtain the SVQ call notifier, where vhost device notifies SVQ that there
>>>     * exists pending used buffers.
>>> @@ -180,6 +196,8 @@ VhostShadowVirtqueue *vhost_svq_new(struct vhost_dev *dev, int idx)
>>>        }
>>>
>>>        svq->vq = virtio_get_queue(dev->vdev, vq_idx);
>>> +    event_notifier_set_handler(&svq->call_notifier,
>>> +                               vhost_svq_handle_call);
>>>        return g_steal_pointer(&svq);
>>>
>>>    err_init_call_notifier:
>>> @@ -195,6 +213,7 @@ err_init_kick_notifier:
>>>    void vhost_svq_free(VhostShadowVirtqueue *vq)
>>>    {
>>>        event_notifier_cleanup(&vq->kick_notifier);
>>> +    event_notifier_set_handler(&vq->call_notifier, NULL);
>>>        event_notifier_cleanup(&vq->call_notifier);
>>>        g_free(vq);
>>>    }
>>> diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
>>> index bc34de2439..6c5f4c98b8 100644
>>> --- a/hw/virtio/vhost-vdpa.c
>>> +++ b/hw/virtio/vhost-vdpa.c
>>> @@ -712,13 +712,40 @@ static bool vhost_vdpa_svq_start_vq(struct vhost_dev *dev, unsigned idx)
>>>    {
>>>        struct vhost_vdpa *v = dev->opaque;
>>>        VhostShadowVirtqueue *svq = g_ptr_array_index(v->shadow_vqs, idx);
>>> -    return vhost_svq_start(dev, idx, svq);
>>> +    EventNotifier *vhost_call_notifier = vhost_svq_get_svq_call_notifier(svq);
>>> +    struct vhost_vring_file vhost_call_file = {
>>> +        .index = idx + dev->vq_index,
>>> +        .fd = event_notifier_get_fd(vhost_call_notifier),
>>> +    };
>>> +    int r;
>>> +    bool b;
>>> +
>>> +    /* Set shadow vq -> guest notifier */
>>> +    assert(v->call_fd[idx]);
>>
>> We need aovid the asser() here. On which case we can hit this?
>>
> I would say that there is no way we can actually hit it, so let's remove it.
>
>>> +    vhost_svq_set_guest_call_notifier(svq, v->call_fd[idx]);
>>> +
>>> +    b = vhost_svq_start(dev, idx, svq);
>>> +    if (unlikely(!b)) {
>>> +        return false;
>>> +    }
>>> +
>>> +    /* Set device -> SVQ notifier */
>>> +    r = vhost_vdpa_set_vring_dev_call(dev, &vhost_call_file);
>>> +    if (unlikely(r)) {
>>> +        error_report("vhost_vdpa_set_vring_call for shadow vq failed");
>>> +        return false;
>>> +    }
>>
>> Similar to kick, do we need to set_vring_call() before vhost_svq_start()?
>>
> It should not matter at this moment because the device should not be
> started at this point and device calls should not run
> vhost_svq_handle_call until BQL is released.
Yes, we stop virtqueue before.
>
> The "logic" of doing it after is to make clear that svq must be fully
> initialized before processing device calls, even in the case that we
> extract SVQ in its own iothread or similar. But this could be done
> before vhost_svq_start for sure.
>
>>> +
>>> +    /* Check for pending calls */
>>> +    event_notifier_set(vhost_call_notifier);
>>
>> Interesting, can this result spurious interrupt?
>>
> This actually "queues" a vhost_svq_handle_call after the BQL release,
> where the device should be fully reset. In that regard, if there are
> no used descriptors there will not be an irq raised to the guest. Does
> that answer the question? Or have I missed something?
Yes, please explain this in the comment.
>
>>> +    return true;
>>>    }
>>>
>>>    static unsigned vhost_vdpa_enable_svq(struct vhost_vdpa *v, bool enable)
>>>    {
>>>        struct vhost_dev *hdev = v->dev;
>>>        unsigned n;
>>> +    int r;
>>>
>>>        if (enable == v->shadow_vqs_enabled) {
>>>            return hdev->nvqs;
>>> @@ -752,9 +779,18 @@ static unsigned vhost_vdpa_enable_svq(struct vhost_vdpa *v, bool enable)
>>>        if (!enable) {
>>>            /* Disable all queues or clean up failed start */
>>>            for (n = 0; n < v->shadow_vqs->len; ++n) {
>>> +            struct vhost_vring_file file = {
>>> +                .index = vhost_vdpa_get_vq_index(hdev, n),
>>> +                .fd = v->call_fd[n],
>>> +            };
>>> +
>>> +            r = vhost_vdpa_set_vring_call(hdev, &file);
>>> +            assert(r == 0);
>>> +
>>>                unsigned vq_idx = vhost_vdpa_get_vq_index(hdev, n);
>>>                VhostShadowVirtqueue *svq = g_ptr_array_index(v->shadow_vqs, n);
>>>                vhost_svq_stop(hdev, n, svq);
>>> +            /* TODO: This can unmask or override call fd! */
>>
>> I don't get this comment. Does this mean the current code can't work
>> with mask_notifiers? If yes, this is something we need to fix.
>>
> Yes, but it will be addressed in the next series. I should have
> explained it bette here, sorry :).
Ok.
Thanks
>
> Thanks!
>
>> Thanks
>>
>>
>>>                vhost_virtqueue_start(hdev, hdev->vdev, &hdev->vqs[n], vq_idx);
>>>            }
>>>
^ permalink raw reply	[flat|nested] 90+ messages in thread
* Re: [RFC PATCH v4 16/20] vhost: Check for device VRING_USED_F_NO_NOTIFY at shadow virtqueue kick
  2021-10-13  4:35   ` Jason Wang
@ 2021-10-15  6:17     ` Eugenio Perez Martin
  0 siblings, 0 replies; 90+ messages in thread
From: Eugenio Perez Martin @ 2021-10-15  6:17 UTC (permalink / raw)
  To: Jason Wang
  Cc: Parav Pandit, Juan Quintela, Markus Armbruster,
	Michael S. Tsirkin, qemu-level, virtualization,
	Harpreet Singh Anand, Xiao W Wang, Stefan Hajnoczi, Eli Cohen,
	Eric Blake, Michael Lilja, Stefano Garzarella
On Wed, Oct 13, 2021 at 6:35 AM Jason Wang <jasowang@redhat.com> wrote:
>
>
> 在 2021/10/1 下午3:05, Eugenio Pérez 写道:
> > Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> > ---
> >   hw/virtio/vhost-shadow-virtqueue.c | 11 ++++++++++-
> >   1 file changed, 10 insertions(+), 1 deletion(-)
> >
> > diff --git a/hw/virtio/vhost-shadow-virtqueue.c b/hw/virtio/vhost-shadow-virtqueue.c
> > index df7e6fa3ec..775f8d36a0 100644
> > --- a/hw/virtio/vhost-shadow-virtqueue.c
> > +++ b/hw/virtio/vhost-shadow-virtqueue.c
> > @@ -173,6 +173,15 @@ static void vhost_svq_add(VhostShadowVirtqueue *svq, VirtQueueElement *elem)
> >       svq->ring_id_maps[qemu_head] = elem;
> >   }
> >
> > +static void vhost_svq_kick(VhostShadowVirtqueue *svq)
> > +{
> > +    /* Make sure we are reading updated device flag */
>
>
> I guess this would be better:
>
>          /* We need to expose available array entries before checking used
>           * flags. */
>
> (Borrowed from kernel codes).
>
> Thanks
>
Right, I will replace it, thanks!
>
> > +    smp_mb();
> > +    if (!(svq->vring.used->flags & VRING_USED_F_NO_NOTIFY)) {
> > +        event_notifier_set(&svq->kick_notifier);
> > +    }
> > +}
> > +
> >   /* Handle guest->device notifications */
> >   static void vhost_handle_guest_kick(EventNotifier *n)
> >   {
> > @@ -197,7 +206,7 @@ static void vhost_handle_guest_kick(EventNotifier *n)
> >               }
> >
> >               vhost_svq_add(svq, elem);
> > -            event_notifier_set(&svq->kick_notifier);
> > +            vhost_svq_kick(svq);
> >           }
> >
> >           virtio_queue_set_notification(svq->vq, true);
>
^ permalink raw reply	[flat|nested] 90+ messages in thread
* Re: [RFC PATCH v4 17/20] vhost: Use VRING_AVAIL_F_NO_INTERRUPT at device call on shadow virtqueue
  2021-10-13  4:36   ` Jason Wang
@ 2021-10-15  6:22     ` Eugenio Perez Martin
  0 siblings, 0 replies; 90+ messages in thread
From: Eugenio Perez Martin @ 2021-10-15  6:22 UTC (permalink / raw)
  To: Jason Wang
  Cc: Parav Pandit, Juan Quintela, Markus Armbruster,
	Michael S. Tsirkin, qemu-level, virtualization,
	Harpreet Singh Anand, Xiao W Wang, Stefan Hajnoczi, Eli Cohen,
	Eric Blake, Michael Lilja, Stefano Garzarella
On Wed, Oct 13, 2021 at 6:36 AM Jason Wang <jasowang@redhat.com> wrote:
>
>
> 在 2021/10/1 下午3:06, Eugenio Pérez 写道:
> > Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
>
>
> Commit log please.
>
> Thanks
>
Sorry, this was another commit that was intended to be squashed.
I think I squashed two other (previous) commits by mistake in the
rebase editor, that's why you detected unrelated changes and mixed SVQ
& vhost-vdpa changes. I will take more care for the next series I
send.
Thanks!
>
> > ---
> >   hw/virtio/vhost-shadow-virtqueue.c | 24 +++++++++++++++++++++++-
> >   1 file changed, 23 insertions(+), 1 deletion(-)
> >
> > diff --git a/hw/virtio/vhost-shadow-virtqueue.c b/hw/virtio/vhost-shadow-virtqueue.c
> > index 775f8d36a0..2fd0bab75d 100644
> > --- a/hw/virtio/vhost-shadow-virtqueue.c
> > +++ b/hw/virtio/vhost-shadow-virtqueue.c
> > @@ -60,6 +60,9 @@ typedef struct VhostShadowVirtqueue {
> >
> >       /* Next head to consume from device */
> >       uint16_t used_idx;
> > +
> > +    /* Cache for the exposed notification flag */
> > +    bool notification;
> >   } VhostShadowVirtqueue;
> >
> >   /* If the device is using some of these, SVQ cannot communicate */
> > @@ -105,6 +108,24 @@ bool vhost_svq_valid_device_features(uint64_t *dev_features)
> >       return r;
> >   }
> >
> > +static void vhost_svq_set_notification(VhostShadowVirtqueue *svq, bool enable)
> > +{
> > +    uint16_t notification_flag;
> > +
> > +    if (svq->notification == enable) {
> > +        return;
> > +    }
> > +
> > +    notification_flag = cpu_to_le16(VRING_AVAIL_F_NO_INTERRUPT);
> > +
> > +    svq->notification = enable;
> > +    if (enable) {
> > +        svq->vring.avail->flags &= ~notification_flag;
> > +    } else {
> > +        svq->vring.avail->flags |= notification_flag;
> > +    }
> > +}
> > +
> >   static void vhost_vring_write_descs(VhostShadowVirtqueue *svq,
> >                                       const struct iovec *iovec,
> >                                       size_t num, bool more_descs, bool write)
> > @@ -273,7 +294,7 @@ static void vhost_svq_handle_call_no_test(EventNotifier *n)
> >       do {
> >           unsigned i = 0;
> >
> > -        /* TODO: Use VRING_AVAIL_F_NO_INTERRUPT */
> > +        vhost_svq_set_notification(svq, false);
> >           while (true) {
> >               g_autofree VirtQueueElement *elem = vhost_svq_get_buf(svq);
> >               if (!elem) {
> > @@ -286,6 +307,7 @@ static void vhost_svq_handle_call_no_test(EventNotifier *n)
> >
> >           virtqueue_flush(vq, i);
> >           event_notifier_set(&svq->guest_call_notifier);
> > +        vhost_svq_set_notification(svq, true);
> >       } while (vhost_svq_more_used(svq));
> >   }
> >
>
^ permalink raw reply	[flat|nested] 90+ messages in thread
* Re: [RFC PATCH v4 20/20] vdpa: Add custom IOTLB translations to SVQ
  2021-10-13  5:34   ` Jason Wang
@ 2021-10-15  7:27     ` Eugenio Perez Martin
  2021-10-15  7:37       ` Jason Wang
  0 siblings, 1 reply; 90+ messages in thread
From: Eugenio Perez Martin @ 2021-10-15  7:27 UTC (permalink / raw)
  To: Jason Wang
  Cc: Parav Pandit, Juan Quintela, Markus Armbruster,
	Michael S. Tsirkin, qemu-level, virtualization,
	Harpreet Singh Anand, Xiao W Wang, Stefan Hajnoczi, Eli Cohen,
	Eric Blake, Stefano Garzarella
On Wed, Oct 13, 2021 at 7:34 AM Jason Wang <jasowang@redhat.com> wrote:
>
>
> 在 2021/10/1 下午3:06, Eugenio Pérez 写道:
> > Use translations added in VhostIOVATree in SVQ.
> >
> > Now every element needs to store the previous address also, so VirtQueue
> > can consume the elements properly. This adds a little overhead per VQ
> > element, having to allocate more memory to stash them. As a possible
> > optimization, this allocation could be avoided if the descriptor is not
> > a chain but a single one, but this is left undone.
> >
> > TODO: iova range should be queried before, and add logic to fail when
> > GPA is outside of its range and memory listener or svq add it.
> >
> > Signed-off-by: Eugenio Pérez<eperezma@redhat.com>
> > ---
> >   hw/virtio/vhost-shadow-virtqueue.h |   4 +-
> >   hw/virtio/vhost-shadow-virtqueue.c | 130 ++++++++++++++++++++++++-----
> >   hw/virtio/vhost-vdpa.c             |  40 ++++++++-
> >   hw/virtio/trace-events             |   1 +
> >   4 files changed, 152 insertions(+), 23 deletions(-)
>
>
> Think hard about the whole logic. This is safe since qemu memory map
> will fail if guest submits a invalidate IOVA.
>
Can you expand on this? What you mean is that VirtQueue already
protects SVQ code if the guest sets an invalid buffer address (GPA),
isn't it?
> Then I wonder if we do something much more simpler:
>
> 1) Using qemu VA as IOVA but only maps the VA that belongs to guest
> 2) Then we don't need any IOVA tree here, what we need is to just map
> vring and use qemu VA without any translation
>
That would be great, but either qemu's SVQ vring or guest translated
buffers address (in qemu VA form) were already in high addresses,
outside of the device's iova range (in my test).
I didn't try remapping tricks to make them fit in the range, but I
think it does complicate the solution relatively fast if there was
already memory in that range owned by qemu before enabling SVQ:
* Guest memory must be contiguous in VA address space, but it "must"
support hotplug/unplug (although vDPA currently pins it). Hotplug
memory could always overlap with SVQ vring, so we would need to move
it.
* Duplicating mapped memory for writing? (Not sure if guest memory is
actually movable in qemu).
* Indirect descriptors will need to allocate and free memory more or
less frequently, increasing the possibility of overlapping.
If we can move guest memory, however, I can see how we can track it in
a tree *but* mark when the tree is 1:1 with qemu's VA, so buffers
forwarding does not take the translation penalty. When guest memory
cannot be map 1:1, we can resort to tree, and come back to 1:1
translation if the offending tree node(s) get deleted.
However I think this puts the solution a little bit farther than
"starting simple" :).
Does it make sense?
Thanks!
> Thanks
>
^ permalink raw reply	[flat|nested] 90+ messages in thread
* Re: [RFC PATCH v4 20/20] vdpa: Add custom IOTLB translations to SVQ
  2021-10-15  7:27     ` Eugenio Perez Martin
@ 2021-10-15  7:37       ` Jason Wang
  2021-10-15  8:20         ` Eugenio Perez Martin
  0 siblings, 1 reply; 90+ messages in thread
From: Jason Wang @ 2021-10-15  7:37 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: Parav Pandit, Juan Quintela, Markus Armbruster,
	Michael S. Tsirkin, qemu-level, virtualization,
	Harpreet Singh Anand, Xiao W Wang, Stefan Hajnoczi, Eli Cohen,
	Eric Blake, Stefano Garzarella
On Fri, Oct 15, 2021 at 3:28 PM Eugenio Perez Martin
<eperezma@redhat.com> wrote:
>
> On Wed, Oct 13, 2021 at 7:34 AM Jason Wang <jasowang@redhat.com> wrote:
> >
> >
> > 在 2021/10/1 下午3:06, Eugenio Pérez 写道:
> > > Use translations added in VhostIOVATree in SVQ.
> > >
> > > Now every element needs to store the previous address also, so VirtQueue
> > > can consume the elements properly. This adds a little overhead per VQ
> > > element, having to allocate more memory to stash them. As a possible
> > > optimization, this allocation could be avoided if the descriptor is not
> > > a chain but a single one, but this is left undone.
> > >
> > > TODO: iova range should be queried before, and add logic to fail when
> > > GPA is outside of its range and memory listener or svq add it.
> > >
> > > Signed-off-by: Eugenio Pérez<eperezma@redhat.com>
> > > ---
> > >   hw/virtio/vhost-shadow-virtqueue.h |   4 +-
> > >   hw/virtio/vhost-shadow-virtqueue.c | 130 ++++++++++++++++++++++++-----
> > >   hw/virtio/vhost-vdpa.c             |  40 ++++++++-
> > >   hw/virtio/trace-events             |   1 +
> > >   4 files changed, 152 insertions(+), 23 deletions(-)
> >
> >
> > Think hard about the whole logic. This is safe since qemu memory map
> > will fail if guest submits a invalidate IOVA.
> >
>
> Can you expand on this? What you mean is that VirtQueue already
> protects SVQ code if the guest sets an invalid buffer address (GPA),
> isn't it?
Yes.
>
> > Then I wonder if we do something much more simpler:
> >
> > 1) Using qemu VA as IOVA but only maps the VA that belongs to guest
> > 2) Then we don't need any IOVA tree here, what we need is to just map
> > vring and use qemu VA without any translation
> >
>
> That would be great, but either qemu's SVQ vring or guest translated
> buffers address (in qemu VA form) were already in high addresses,
> outside of the device's iova range (in my test).
You're right. I miss that and that's why we need e.g iova tree and allocator.
What I proposed only makes sense when shared virtual memory (SVA) is
implemented. In the case of SVA, the valid iova range should be the
full VA range.
>
> I didn't try remapping tricks to make them fit in the range, but I
> think it does complicate the solution relatively fast if there was
> already memory in that range owned by qemu before enabling SVQ:
>
> * Guest memory must be contiguous in VA address space, but it "must"
> support hotplug/unplug (although vDPA currently pins it). Hotplug
> memory could always overlap with SVQ vring, so we would need to move
> it.
> * Duplicating mapped memory for writing? (Not sure if guest memory is
> actually movable in qemu).
> * Indirect descriptors will need to allocate and free memory more or
> less frequently, increasing the possibility of overlapping.
I'm not sure I get the problem, but overlapping is not an issue since
we're using VA.
>
> If we can move guest memory,
I'm not sure we can do this or it looks very tricky.
> however, I can see how we can track it in
> a tree *but* mark when the tree is 1:1 with qemu's VA, so buffers
> forwarding does not take the translation penalty. When guest memory
> cannot be map 1:1, we can resort to tree, and come back to 1:1
> translation if the offending tree node(s) get deleted.
>
> However I think this puts the solution a little bit farther than
> "starting simple" :).
>
> Does it make sense?
Yes. So I think I will review the IOVA tree codes and get back to you.
THanks
>
> Thanks!
>
> > Thanks
> >
>
^ permalink raw reply	[flat|nested] 90+ messages in thread
* Re: [RFC PATCH v4 20/20] vdpa: Add custom IOTLB translations to SVQ
  2021-10-15  7:37       ` Jason Wang
@ 2021-10-15  8:20         ` Eugenio Perez Martin
  2021-10-15  8:37           ` Jason Wang
  2021-10-15  9:14           ` Eugenio Perez Martin
  0 siblings, 2 replies; 90+ messages in thread
From: Eugenio Perez Martin @ 2021-10-15  8:20 UTC (permalink / raw)
  To: Jason Wang
  Cc: Parav Pandit, Juan Quintela, Markus Armbruster,
	Michael S. Tsirkin, qemu-level, virtualization,
	Harpreet Singh Anand, Xiao W Wang, Stefan Hajnoczi, Eli Cohen,
	Eric Blake, Stefano Garzarella
On Fri, Oct 15, 2021 at 9:37 AM Jason Wang <jasowang@redhat.com> wrote:
>
> On Fri, Oct 15, 2021 at 3:28 PM Eugenio Perez Martin
> <eperezma@redhat.com> wrote:
> >
> > On Wed, Oct 13, 2021 at 7:34 AM Jason Wang <jasowang@redhat.com> wrote:
> > >
> > >
> > > 在 2021/10/1 下午3:06, Eugenio Pérez 写道:
> > > > Use translations added in VhostIOVATree in SVQ.
> > > >
> > > > Now every element needs to store the previous address also, so VirtQueue
> > > > can consume the elements properly. This adds a little overhead per VQ
> > > > element, having to allocate more memory to stash them. As a possible
> > > > optimization, this allocation could be avoided if the descriptor is not
> > > > a chain but a single one, but this is left undone.
> > > >
> > > > TODO: iova range should be queried before, and add logic to fail when
> > > > GPA is outside of its range and memory listener or svq add it.
> > > >
> > > > Signed-off-by: Eugenio Pérez<eperezma@redhat.com>
> > > > ---
> > > >   hw/virtio/vhost-shadow-virtqueue.h |   4 +-
> > > >   hw/virtio/vhost-shadow-virtqueue.c | 130 ++++++++++++++++++++++++-----
> > > >   hw/virtio/vhost-vdpa.c             |  40 ++++++++-
> > > >   hw/virtio/trace-events             |   1 +
> > > >   4 files changed, 152 insertions(+), 23 deletions(-)
> > >
> > >
> > > Think hard about the whole logic. This is safe since qemu memory map
> > > will fail if guest submits a invalidate IOVA.
> > >
> >
> > Can you expand on this? What you mean is that VirtQueue already
> > protects SVQ code if the guest sets an invalid buffer address (GPA),
> > isn't it?
>
> Yes.
>
> >
> > > Then I wonder if we do something much more simpler:
> > >
> > > 1) Using qemu VA as IOVA but only maps the VA that belongs to guest
> > > 2) Then we don't need any IOVA tree here, what we need is to just map
> > > vring and use qemu VA without any translation
> > >
> >
> > That would be great, but either qemu's SVQ vring or guest translated
> > buffers address (in qemu VA form) were already in high addresses,
> > outside of the device's iova range (in my test).
>
> You're right. I miss that and that's why we need e.g iova tree and allocator.
>
> What I proposed only makes sense when shared virtual memory (SVA) is
> implemented. In the case of SVA, the valid iova range should be the
> full VA range.
>
> >
> > I didn't try remapping tricks to make them fit in the range, but I
> > think it does complicate the solution relatively fast if there was
> > already memory in that range owned by qemu before enabling SVQ:
> >
> > * Guest memory must be contiguous in VA address space, but it "must"
> > support hotplug/unplug (although vDPA currently pins it). Hotplug
> > memory could always overlap with SVQ vring, so we would need to move
> > it.
> > * Duplicating mapped memory for writing? (Not sure if guest memory is
> > actually movable in qemu).
> > * Indirect descriptors will need to allocate and free memory more or
> > less frequently, increasing the possibility of overlapping.
>
> I'm not sure I get the problem, but overlapping is not an issue since
> we're using VA.
>
It's basically the same (potential) problem of DPDK's SVQ: IOVA Range
goes from 0 to X. That means that both GPA and SVQ must be in IOVA
range. As an example, we put GPA at the beginning of the range, that
grows upwards when memory is hot plugged, and SVQ vrings that grows
downwards when devices are added or set in SVQ mode.
Even without both space fragmentation problems, we could reach a point
where both will take the same address, and we would need to go to the
tree.
But since we are able to detect those situations, I can see how we can
work in two modes as an optimization: 1:1 when they don't overlap, and
fragmented tree where it does. But I don't think it's a good idea to
include it from the beginning, and I'm not sure if that is worth it
without measuring the  tree translation cost first.
> >
> > If we can move guest memory,
>
> I'm not sure we can do this or it looks very tricky.
>
Just thinking out loud here, but maybe we could map all memory and
play with remap_file_pages [1] a little bit for that.
> > however, I can see how we can track it in
> > a tree *but* mark when the tree is 1:1 with qemu's VA, so buffers
> > forwarding does not take the translation penalty. When guest memory
> > cannot be map 1:1, we can resort to tree, and come back to 1:1
> > translation if the offending tree node(s) get deleted.
> >
> > However I think this puts the solution a little bit farther than
> > "starting simple" :).
> >
> > Does it make sense?
>
> Yes. So I think I will review the IOVA tree codes and get back to you.
>
Looking forward to it :).
Thanks!
[1] https://linux.die.net/man/2/remap_file_pages
> THanks
>
> >
> > Thanks!
> >
> > > Thanks
> > >
> >
>
^ permalink raw reply	[flat|nested] 90+ messages in thread
* Re: [RFC PATCH v4 20/20] vdpa: Add custom IOTLB translations to SVQ
  2021-10-15  8:20         ` Eugenio Perez Martin
@ 2021-10-15  8:37           ` Jason Wang
  2021-10-15  9:14           ` Eugenio Perez Martin
  1 sibling, 0 replies; 90+ messages in thread
From: Jason Wang @ 2021-10-15  8:37 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: Parav Pandit, Juan Quintela, Markus Armbruster,
	Michael S. Tsirkin, qemu-level, virtualization,
	Harpreet Singh Anand, Xiao W Wang, Stefan Hajnoczi, Eli Cohen,
	Eric Blake, Stefano Garzarella
On Fri, Oct 15, 2021 at 4:21 PM Eugenio Perez Martin
<eperezma@redhat.com> wrote:
>
> On Fri, Oct 15, 2021 at 9:37 AM Jason Wang <jasowang@redhat.com> wrote:
> >
> > On Fri, Oct 15, 2021 at 3:28 PM Eugenio Perez Martin
> > <eperezma@redhat.com> wrote:
> > >
> > > On Wed, Oct 13, 2021 at 7:34 AM Jason Wang <jasowang@redhat.com> wrote:
> > > >
> > > >
> > > > 在 2021/10/1 下午3:06, Eugenio Pérez 写道:
> > > > > Use translations added in VhostIOVATree in SVQ.
> > > > >
> > > > > Now every element needs to store the previous address also, so VirtQueue
> > > > > can consume the elements properly. This adds a little overhead per VQ
> > > > > element, having to allocate more memory to stash them. As a possible
> > > > > optimization, this allocation could be avoided if the descriptor is not
> > > > > a chain but a single one, but this is left undone.
> > > > >
> > > > > TODO: iova range should be queried before, and add logic to fail when
> > > > > GPA is outside of its range and memory listener or svq add it.
> > > > >
> > > > > Signed-off-by: Eugenio Pérez<eperezma@redhat.com>
> > > > > ---
> > > > >   hw/virtio/vhost-shadow-virtqueue.h |   4 +-
> > > > >   hw/virtio/vhost-shadow-virtqueue.c | 130 ++++++++++++++++++++++++-----
> > > > >   hw/virtio/vhost-vdpa.c             |  40 ++++++++-
> > > > >   hw/virtio/trace-events             |   1 +
> > > > >   4 files changed, 152 insertions(+), 23 deletions(-)
> > > >
> > > >
> > > > Think hard about the whole logic. This is safe since qemu memory map
> > > > will fail if guest submits a invalidate IOVA.
> > > >
> > >
> > > Can you expand on this? What you mean is that VirtQueue already
> > > protects SVQ code if the guest sets an invalid buffer address (GPA),
> > > isn't it?
> >
> > Yes.
> >
> > >
> > > > Then I wonder if we do something much more simpler:
> > > >
> > > > 1) Using qemu VA as IOVA but only maps the VA that belongs to guest
> > > > 2) Then we don't need any IOVA tree here, what we need is to just map
> > > > vring and use qemu VA without any translation
> > > >
> > >
> > > That would be great, but either qemu's SVQ vring or guest translated
> > > buffers address (in qemu VA form) were already in high addresses,
> > > outside of the device's iova range (in my test).
> >
> > You're right. I miss that and that's why we need e.g iova tree and allocator.
> >
> > What I proposed only makes sense when shared virtual memory (SVA) is
> > implemented. In the case of SVA, the valid iova range should be the
> > full VA range.
> >
> > >
> > > I didn't try remapping tricks to make them fit in the range, but I
> > > think it does complicate the solution relatively fast if there was
> > > already memory in that range owned by qemu before enabling SVQ:
> > >
> > > * Guest memory must be contiguous in VA address space, but it "must"
> > > support hotplug/unplug (although vDPA currently pins it). Hotplug
> > > memory could always overlap with SVQ vring, so we would need to move
> > > it.
> > > * Duplicating mapped memory for writing? (Not sure if guest memory is
> > > actually movable in qemu).
> > > * Indirect descriptors will need to allocate and free memory more or
> > > less frequently, increasing the possibility of overlapping.
> >
> > I'm not sure I get the problem, but overlapping is not an issue since
> > we're using VA.
> >
>
> It's basically the same (potential) problem of DPDK's SVQ: IOVA Range
> goes from 0 to X. That means that both GPA and SVQ must be in IOVA
> range. As an example, we put GPA at the beginning of the range, that
> grows upwards when memory is hot plugged, and SVQ vrings that grows
> downwards when devices are added or set in SVQ mode.
Yes, but this is not the case if we're using VA.
>
> Even without both space fragmentation problems, we could reach a point
> where both will take the same address, and we would need to go to the
> tree.
>
> But since we are able to detect those situations, I can see how we can
> work in two modes as an optimization: 1:1 when they don't overlap, and
> fragmented tree where it does. But I don't think it's a good idea to
> include it from the beginning, and I'm not sure if that is worth it
> without measuring the  tree translation cost first.
>
> > >
> > > If we can move guest memory,
> >
> > I'm not sure we can do this or it looks very tricky.
> >
>
> Just thinking out loud here, but maybe we could map all memory and
> play with remap_file_pages [1] a little bit for that.
The problem is that there's no guarantee that it will always succeed.
So let's start with the current dedicated IOVA address space. We can
do optimization on top anyhow.
>
> > > however, I can see how we can track it in
> > > a tree *but* mark when the tree is 1:1 with qemu's VA, so buffers
> > > forwarding does not take the translation penalty. When guest memory
> > > cannot be map 1:1, we can resort to tree, and come back to 1:1
> > > translation if the offending tree node(s) get deleted.
> > >
> > > However I think this puts the solution a little bit farther than
> > > "starting simple" :).
> > >
> > > Does it make sense?
> >
> > Yes. So I think I will review the IOVA tree codes and get back to you.
> >
>
> Looking forward to it :).
Thanks
>
> Thanks!
>
> [1] https://linux.die.net/man/2/remap_file_pages
>
> > THanks
> >
> > >
> > > Thanks!
> > >
> > > > Thanks
> > > >
> > >
> >
>
^ permalink raw reply	[flat|nested] 90+ messages in thread
* Re: [RFC PATCH v4 08/20] vhost: Route guest->host notification through shadow virtqueue
  2021-10-15  3:45       ` Jason Wang
@ 2021-10-15  9:08         ` Eugenio Perez Martin
  0 siblings, 0 replies; 90+ messages in thread
From: Eugenio Perez Martin @ 2021-10-15  9:08 UTC (permalink / raw)
  To: Jason Wang
  Cc: Parav Pandit, Juan Quintela, Markus Armbruster,
	Michael S. Tsirkin, qemu-level, virtualization,
	Harpreet Singh Anand, Xiao W Wang, Stefan Hajnoczi, Eli Cohen,
	Eric Blake, Stefano Garzarella
On Fri, Oct 15, 2021 at 5:46 AM Jason Wang <jasowang@redhat.com> wrote:
>
>
> 在 2021/10/14 下午8:00, Eugenio Perez Martin 写道:
> > On Wed, Oct 13, 2021 at 5:27 AM Jason Wang <jasowang@redhat.com> wrote:
> >>
> >> 在 2021/10/1 下午3:05, Eugenio Pérez 写道:
> >>> Shadow virtqueue notifications forwarding is disabled when vhost_dev
> >>> stops, so code flow follows usual cleanup.
> >>>
> >>> Also, host notifiers must be disabled at SVQ start,
> >>
> >> Any reason for this?
> >>
> > It will be addressed in a later series, sorry.
> >
> >>> and they will not
> >>> start if SVQ has been enabled when device is stopped. This is trivial
> >>> to address, but it is left out for simplicity at this moment.
> >>
> >> It looks to me this patch also contains the following logics
> >>
> >> 1) codes to enable svq
> >>
> >> 2) codes to let svq to be enabled from QMP.
> >>
> >> I think they need to be split out,
> > I agree that we can split this more, with the code that belongs to SVQ
> > and the code that belongs to vhost-vdpa. it will be addressed in
> > future series.
> >
> >> we may endup with the following
> >> series of patches
> >>
> > With "series of patches" do you mean to send every step in a separated
> > series? There are odds of having the need of modifying code already
> > sent & merged with later ones. If you confirm to me that it is fine, I
> > can do it that way for sure.
>
>
> Sorry for being unclear. I meant it's a sub-series actually of the series.
>
>
> >
> >> 1) svq skeleton with enable/disable
> >> 2) route host notifier to svq
> >> 3) route guest notifier to svq
> >> 4) codes to enable svq
> >> 5) enable svq via QMP
> >>
> > I'm totally fine with that, but there is code that is never called if
> > the qmp command is not added. The compiler complains about static
> > functions that are not called, making impossible things like bisecting
> > through these commits, unless I use attribute((unused)) or similar. Or
> > have I missed something?
>
>
> You're right, then I think we can then:
>
> 1) svq skeleton with enable/disable via QMP
> 2) route host notifier to svq
> 3) route guest notifier to svq
>
Got it, I will try to adapt the series.
>
> >
> > We could do that way with the code that belongs to SVQ though, since
> > all of it is declared in headers. But to delay the "enable svq via
> > qmp" to the last one makes debugging harder, as we cannot just enable
> > notifications forwarding with no buffers forwarding.
>
>
> Yes.
>
>
> >
> > If I introduce a change in the notifications code, I can simply go to
> > these commits and enable SVQ for notifications. This way I can have an
> > idea of what part is failing. A similar logic can be applied to other
> > devices than vp_vdpa.
>
>
> vhost-vdpa actually?
>
Not this time :).
I was actually talking about other devices with other features or even
subtle differences because they being hardware instead of software.
For example, vp_vdpa does not use host notifiers at the moment. If I
enable SVQ with only notifications forwarding and not buffer ones, I
can bisect pretty fast that there is a problem in the notifications
forwarding mechanism. This saved me a few times while writing the
series.
There are other ways of reaching the same conclusion of course, and
future developments or fixes to this serie will deprecate this method,
but I think it is a good idea to let the SVQ enabling early in the
serie as a way to test what part of the series is failing.
This is just my opinion of course, and I'm biased by the fact that I'm
the one proposing the series. If it makes the reviewing so hard, I'm
totally in to delay QMP command to the end of the series.
>
> > We would lose it if we
Sorry, I think I sent it without a proper review.
> >
> >>> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> >>> ---
> >>>    qapi/net.json                      |   2 +-
> >>>    hw/virtio/vhost-shadow-virtqueue.h |   8 ++
> >>>    include/hw/virtio/vhost-vdpa.h     |   4 +
> >>>    hw/virtio/vhost-shadow-virtqueue.c | 138 ++++++++++++++++++++++++++++-
> >>>    hw/virtio/vhost-vdpa.c             | 116 +++++++++++++++++++++++-
> >>>    5 files changed, 264 insertions(+), 4 deletions(-)
> >>>
> >>> diff --git a/qapi/net.json b/qapi/net.json
> >>> index a2c30fd455..fe546b0e7c 100644
> >>> --- a/qapi/net.json
> >>> +++ b/qapi/net.json
> >>> @@ -88,7 +88,7 @@
> >>>    #
> >>>    # @enable: true to use the alternate shadow VQ notifications
> >>>    #
> >>> -# Returns: Always error, since SVQ is not implemented at the moment.
> >>> +# Returns: Error if failure, or 'no error' for success.
> >>>    #
> >>>    # Since: 6.2
> >>>    #
> >>> diff --git a/hw/virtio/vhost-shadow-virtqueue.h b/hw/virtio/vhost-shadow-virtqueue.h
> >>> index 27ac6388fa..237cfceb9c 100644
> >>> --- a/hw/virtio/vhost-shadow-virtqueue.h
> >>> +++ b/hw/virtio/vhost-shadow-virtqueue.h
> >>> @@ -14,6 +14,14 @@
> >>>
> >>>    typedef struct VhostShadowVirtqueue VhostShadowVirtqueue;
> >>>
> >>> +EventNotifier *vhost_svq_get_svq_call_notifier(VhostShadowVirtqueue *svq);
> >>
> >> Let's move this function to another patch since it's unrelated to the
> >> guest->host routing.
> >>
> > Right, I missed it while squashing commits and at later reviews.
> >
> >>> +void vhost_svq_set_guest_call_notifier(VhostShadowVirtqueue *svq, int call_fd);
> >>> +
> >>> +bool vhost_svq_start(struct vhost_dev *dev, unsigned idx,
> >>> +                     VhostShadowVirtqueue *svq);
> >>> +void vhost_svq_stop(struct vhost_dev *dev, unsigned idx,
> >>> +                    VhostShadowVirtqueue *svq);
> >>> +
> >>>    VhostShadowVirtqueue *vhost_svq_new(struct vhost_dev *dev, int idx);
> >>>
> >>>    void vhost_svq_free(VhostShadowVirtqueue *vq);
> >>> diff --git a/include/hw/virtio/vhost-vdpa.h b/include/hw/virtio/vhost-vdpa.h
> >>> index 0d565bb5bd..48aae59d8e 100644
> >>> --- a/include/hw/virtio/vhost-vdpa.h
> >>> +++ b/include/hw/virtio/vhost-vdpa.h
> >>> @@ -12,6 +12,8 @@
> >>>    #ifndef HW_VIRTIO_VHOST_VDPA_H
> >>>    #define HW_VIRTIO_VHOST_VDPA_H
> >>>
> >>> +#include <gmodule.h>
> >>> +
> >>>    #include "qemu/queue.h"
> >>>    #include "hw/virtio/virtio.h"
> >>>
> >>> @@ -24,6 +26,8 @@ typedef struct vhost_vdpa {
> >>>        int device_fd;
> >>>        uint32_t msg_type;
> >>>        MemoryListener listener;
> >>> +    bool shadow_vqs_enabled;
> >>> +    GPtrArray *shadow_vqs;
> >>>        struct vhost_dev *dev;
> >>>        QLIST_ENTRY(vhost_vdpa) entry;
> >>>        VhostVDPAHostNotifier notifier[VIRTIO_QUEUE_MAX];
> >>> diff --git a/hw/virtio/vhost-shadow-virtqueue.c b/hw/virtio/vhost-shadow-virtqueue.c
> >>> index c4826a1b56..21dc99ab5d 100644
> >>> --- a/hw/virtio/vhost-shadow-virtqueue.c
> >>> +++ b/hw/virtio/vhost-shadow-virtqueue.c
> >>> @@ -9,9 +9,12 @@
> >>>
> >>>    #include "qemu/osdep.h"
> >>>    #include "hw/virtio/vhost-shadow-virtqueue.h"
> >>> +#include "hw/virtio/vhost.h"
> >>> +
> >>> +#include "standard-headers/linux/vhost_types.h"
> >>>
> >>>    #include "qemu/error-report.h"
> >>> -#include "qemu/event_notifier.h"
> >>> +#include "qemu/main-loop.h"
> >>>
> >>>    /* Shadow virtqueue to relay notifications */
> >>>    typedef struct VhostShadowVirtqueue {
> >>> @@ -19,14 +22,146 @@ typedef struct VhostShadowVirtqueue {
> >>>        EventNotifier kick_notifier;
> >>>        /* Shadow call notifier, sent to vhost */
> >>>        EventNotifier call_notifier;
> >>> +
> >>> +    /*
> >>> +     * Borrowed virtqueue's guest to host notifier.
> >>> +     * To borrow it in this event notifier allows to register on the event
> >>> +     * loop and access the associated shadow virtqueue easily. If we use the
> >>> +     * VirtQueue, we don't have an easy way to retrieve it.
> >>> +     *
> >>> +     * So shadow virtqueue must not clean it, or we would lose VirtQueue one.
> >>> +     */
> >>> +    EventNotifier host_notifier;
> >>> +
> >>> +    /* Guest's call notifier, where SVQ calls guest. */
> >>> +    EventNotifier guest_call_notifier;
> >>
> >> To be consistent, let's simply use "guest_notifier" here.
> >>
> > It could be confused when the series adds a guest -> qemu kick
> > notifier then. Actually, I think it would be better to rename
> > host_notifier to something like host_svq_notifier. Or host_call and
> > guest_call, since "notifier" is already in the type, making the name
> > to be a little bit "Hungarian notation".
>
>
> I think that's fine, just need to make sure we have a consistent name
> for SVQ notifier.
>
Got it.
>
> >
> >>> +
> >>> +    /* Virtio queue shadowing */
> >>> +    VirtQueue *vq;
> >>>    } VhostShadowVirtqueue;
> >>>
> >>> +/* Forward guest notifications */
> >>> +static void vhost_handle_guest_kick(EventNotifier *n)
> >>> +{
> >>> +    VhostShadowVirtqueue *svq = container_of(n, VhostShadowVirtqueue,
> >>> +                                             host_notifier);
> >>> +
> >>> +    if (unlikely(!event_notifier_test_and_clear(n))) {
> >>> +        return;
> >>> +    }
> >>
> >> Is there a chance that we may stop the processing of available buffers
> >> during the svq enabling? There could be no kick from the guest in this case.
> >>
> > Actually, yes, I think you are right. The guest kick eventfd could
> > have been consumed by vhost but there may be still pending buffers.
> >
> > I think it would be better to check for available buffers first, then
> > clear the notifier unconditionally, and then re-check and process them
> > if any [1].
>
>
> Looks like I can't find "[1]" anywhere.
>
Sorry, it's at the comment about vhost_svq_start function. It's not
isolated, but in the middle of a paragraph, because both comments are
related: The solution to one of them affects the other.
>
> >
> > However, this problem arises later in the series: At this moment the
> > device is not reset and guest's host notifier is not replaced, so
> > either vhost/device receives the kick, or SVQ does and forwards it.
> >
> > Does it make sense to you?
>
>
> Kind of, so I think we can:
>
> 1) As you said, always check available buffers when switching to SVQ
> 2) alwasy kick the vhost when switching back to vhost
>
Right. I think I prefer solution 1, but let me delay the choice to
compare them properly.
>
> >
> >>> +
> >>> +    event_notifier_set(&svq->kick_notifier);
> >>> +}
> >>> +
> >>> +/*
> >>> + * Obtain the SVQ call notifier, where vhost device notifies SVQ that there
> >>> + * exists pending used buffers.
> >>> + *
> >>> + * @svq Shadow Virtqueue
> >>> + */
> >>> +EventNotifier *vhost_svq_get_svq_call_notifier(VhostShadowVirtqueue *svq)
> >>> +{
> >>> +    return &svq->call_notifier;
> >>> +}
> >>> +
> >>> +/*
> >>> + * Set the call notifier for the SVQ to call the guest
> >>> + *
> >>> + * @svq Shadow virtqueue
> >>> + * @call_fd call notifier
> >>> + *
> >>> + * Called on BQL context.
> >>> + */
> >>> +void vhost_svq_set_guest_call_notifier(VhostShadowVirtqueue *svq, int call_fd)
> >>> +{
> >>> +    event_notifier_init_fd(&svq->guest_call_notifier, call_fd);
> >>> +}
> >>> +
> >>> +/*
> >>> + * Restore the vhost guest to host notifier, i.e., disables svq effect.
> >>> + */
> >>> +static int vhost_svq_restore_vdev_host_notifier(struct vhost_dev *dev,
> >>> +                                                unsigned vhost_index,
> >>> +                                                VhostShadowVirtqueue *svq)
> >>> +{
> >>> +    EventNotifier *vq_host_notifier = virtio_queue_get_host_notifier(svq->vq);
> >>> +    struct vhost_vring_file file = {
> >>> +        .index = vhost_index,
> >>> +        .fd = event_notifier_get_fd(vq_host_notifier),
> >>> +    };
> >>> +    int r;
> >>> +
> >>> +    /* Restore vhost kick */
> >>> +    r = dev->vhost_ops->vhost_set_vring_kick(dev, &file);
> >>
> >> And remap the notification area if necessary.
> > Totally right, that step is missed in this series.
> >
> > However, remapping guest host notifier memory region has no advantages
> > over using ioeventfd to perform guest -> SVQ notifications, doesn't
> > it? By both methods flow needs to go through the hypervisor kernel.
>
>
> To be clear, I meant restore the notification area mapping from guest to
> device directly. For SVQ, you are right, there's no much value for
> bothering notification area map. (Or we can do it in the future).
>
Ok, I think I confused the location of your comment so I misunderstood
it. You are right, the notification area must be restored here.
>
> >
> >>
> >>> +    return r ? -errno : 0;
> >>> +}
> >>> +
> >>> +/*
> >>> + * Start shadow virtqueue operation.
> >>> + * @dev vhost device
> >>> + * @hidx vhost virtqueue index
> >>> + * @svq Shadow Virtqueue
> >>> + */
> >>> +bool vhost_svq_start(struct vhost_dev *dev, unsigned idx,
> >>> +                     VhostShadowVirtqueue *svq)
> >>> +{
> >>> +    EventNotifier *vq_host_notifier = virtio_queue_get_host_notifier(svq->vq);
> >>> +    struct vhost_vring_file file = {
> >>> +        .index = dev->vhost_ops->vhost_get_vq_index(dev, dev->vq_index + idx),
> >>> +        .fd = event_notifier_get_fd(&svq->kick_notifier),
> >>> +    };
> >>> +    int r;
> >>> +
> >>> +    /* Check that notifications are still going directly to vhost dev */
> >>> +    assert(virtio_queue_is_host_notifier_enabled(svq->vq));
> >>> +
> >>> +    /*
> >>> +     * event_notifier_set_handler already checks for guest's notifications if
> >>> +     * they arrive in the switch, so there is no need to explicitely check for
> >>> +     * them.
> >>> +     */
> >>
> >> If this is true, shouldn't we call vhost_set_vring_kick() before the
> >> event_notifier_set_handler()?
> >>
> > Not at this point of the series, but it could be another solution when
> > we need to reset the device and we are unsure if all buffers have been
> > read. But I think I prefer the solution exposed in [1] and to
> > explicitly call vhost_handle_guest_kick here. Do you think
> > differently?
>
>
> I actually mean if we can end up with this situation since SVQ take over
> the host notifier before set_vring_kick().
>
> 1) guest kick vq, vhost is running
> 2) qemu take over the host notifier
> 3) guest kick vq
> 4) qemu route host notifier to SVQ
>
> Then the vq will be handled by both SVQ and vhost?
>
It shouldn't because the vhost device must be paused at this moment.
What can happen is that vhost swallows that notification but the
device does not answer to the kick.
>
> >
> >> Btw, I think we should update the fd if set_vring_kick() was called
> >> after this function?
> >>
> > Kind of. This is currently bad in the code, but...
> >
> > Backend callbacks vhost_ops->vhost_set_vring_kick and
> > vhost_ops->vhost_set_vring_addr are only called at
> > vhost_virtqueue_start. And they are always called with known data
> > already stored in VirtQueue.
>
>
> This is true for now, but I'd suggest to not depend on this since it:
>
> 1) it might be changed in the future
> 2) we're working at vhost layer and expose API to virtio device, the
> code should be robust to handle set_vring_kick() at any time
> 3) I think we've already handled similar situation of set_vring_call, so
> let's be consistent
>
Ok. Do you think the same for addr? Or can we stick to that justification there?
>
> >
> > To avoid storing more state in vhost_vdpa, I think that we should
> > avoid duplicating them, but ignore new kick_fd or address in SVQ mode,
> > and retrieve them again at the moment the device is (re)started in SVQ
> > mode. Qemu already avoids things like allowing the guest to set
> > addresses at random time, using the VirtIOPCIProxy to store them.
> >
> > I also see how duplicating that status could protect vdpa SVQ code
> > against future changes to vhost code, but that would make this series
> > bigger and more complex with no need at this moment in my opinion.
> >
> > Do you agree?
>
>
> Somehow, but consider we can handle set_vring_call(), let's at last make
> set_vring_kick() more robust.
>
>
> >
> >>> +    event_notifier_init_fd(&svq->host_notifier,
> >>> +                           event_notifier_get_fd(vq_host_notifier));
> >>> +    event_notifier_set_handler(&svq->host_notifier, vhost_handle_guest_kick);
> >>> +
> >>> +    r = dev->vhost_ops->vhost_set_vring_kick(dev, &file);
> >>
> >> And we need to stop the notification area mmap.
> >>
> > Right.
> >
> >>> +    if (unlikely(r != 0)) {
> >>> +        error_report("Couldn't set kick fd: %s", strerror(errno));
> >>> +        goto err_set_vring_kick;
> >>> +    }
> >>> +
> >>> +    return true;
> >>> +
> >>> +err_set_vring_kick:
> >>> +    event_notifier_set_handler(&svq->host_notifier, NULL);
> >>> +
> >>> +    return false;
> >>> +}
> >>> +
> >>> +/*
> >>> + * Stop shadow virtqueue operation.
> >>> + * @dev vhost device
> >>> + * @idx vhost queue index
> >>> + * @svq Shadow Virtqueue
> >>> + */
> >>> +void vhost_svq_stop(struct vhost_dev *dev, unsigned idx,
> >>> +                    VhostShadowVirtqueue *svq)
> >>> +{
> >>> +    int r = vhost_svq_restore_vdev_host_notifier(dev, idx, svq);
> >>> +    if (unlikely(r < 0)) {
> >>> +        error_report("Couldn't restore vq kick fd: %s", strerror(-r));
> >>> +    }
> >>> +
> >>> +    event_notifier_set_handler(&svq->host_notifier, NULL);
> >>> +}
> >>> +
> >>>    /*
> >>>     * Creates vhost shadow virtqueue, and instruct vhost device to use the shadow
> >>>     * methods and file descriptors.
> >>>     */
> >>>    VhostShadowVirtqueue *vhost_svq_new(struct vhost_dev *dev, int idx)
> >>>    {
> >>> +    int vq_idx = dev->vq_index + idx;
> >>>        g_autofree VhostShadowVirtqueue *svq = g_new0(VhostShadowVirtqueue, 1);
> >>>        int r;
> >>>
> >>> @@ -44,6 +179,7 @@ VhostShadowVirtqueue *vhost_svq_new(struct vhost_dev *dev, int idx)
> >>>            goto err_init_call_notifier;
> >>>        }
> >>>
> >>> +    svq->vq = virtio_get_queue(dev->vdev, vq_idx);
> >>>        return g_steal_pointer(&svq);
> >>>
> >>>    err_init_call_notifier:
> >>> diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
> >>> index e0dc7508c3..36c954a779 100644
> >>> --- a/hw/virtio/vhost-vdpa.c
> >>> +++ b/hw/virtio/vhost-vdpa.c
> >>> @@ -17,6 +17,7 @@
> >>>    #include "hw/virtio/vhost.h"
> >>>    #include "hw/virtio/vhost-backend.h"
> >>>    #include "hw/virtio/virtio-net.h"
> >>> +#include "hw/virtio/vhost-shadow-virtqueue.h"
> >>>    #include "hw/virtio/vhost-vdpa.h"
> >>>    #include "exec/address-spaces.h"
> >>>    #include "qemu/main-loop.h"
> >>> @@ -272,6 +273,16 @@ static void vhost_vdpa_add_status(struct vhost_dev *dev, uint8_t status)
> >>>        vhost_vdpa_call(dev, VHOST_VDPA_SET_STATUS, &s);
> >>>    }
> >>>
> >>> +/**
> >>> + * Adaptor function to free shadow virtqueue through gpointer
> >>> + *
> >>> + * @svq   The Shadow Virtqueue
> >>> + */
> >>> +static void vhost_psvq_free(gpointer svq)
> >>> +{
> >>> +    vhost_svq_free(svq);
> >>> +}
> >>> +
> >>>    static int vhost_vdpa_init(struct vhost_dev *dev, void *opaque, Error **errp)
> >>>    {
> >>>        struct vhost_vdpa *v;
> >>> @@ -283,6 +294,7 @@ static int vhost_vdpa_init(struct vhost_dev *dev, void *opaque, Error **errp)
> >>>        dev->opaque =  opaque ;
> >>>        v->listener = vhost_vdpa_memory_listener;
> >>>        v->msg_type = VHOST_IOTLB_MSG_V2;
> >>> +    v->shadow_vqs = g_ptr_array_new_full(dev->nvqs, vhost_psvq_free);
> >>>        QLIST_INSERT_HEAD(&vhost_vdpa_devices, v, entry);
> >>>
> >>>        vhost_vdpa_add_status(dev, VIRTIO_CONFIG_S_ACKNOWLEDGE |
> >>> @@ -373,6 +385,17 @@ err:
> >>>        return;
> >>>    }
> >>>
> >>> +static void vhost_vdpa_svq_cleanup(struct vhost_dev *dev)
> >>> +{
> >>> +    struct vhost_vdpa *v = dev->opaque;
> >>> +    size_t idx;
> >>> +
> >>> +    for (idx = 0; idx < v->shadow_vqs->len; ++idx) {
> >>> +        vhost_svq_stop(dev, idx, g_ptr_array_index(v->shadow_vqs, idx));
> >>> +    }
> >>> +    g_ptr_array_free(v->shadow_vqs, true);
> >>> +}
> >>> +
> >>>    static int vhost_vdpa_cleanup(struct vhost_dev *dev)
> >>>    {
> >>>        struct vhost_vdpa *v;
> >>> @@ -381,6 +404,7 @@ static int vhost_vdpa_cleanup(struct vhost_dev *dev)
> >>>        trace_vhost_vdpa_cleanup(dev, v);
> >>>        vhost_vdpa_host_notifiers_uninit(dev, dev->nvqs);
> >>>        memory_listener_unregister(&v->listener);
> >>> +    vhost_vdpa_svq_cleanup(dev);
> >>>        QLIST_REMOVE(v, entry);
> >>>
> >>>        dev->opaque = NULL;
> >>> @@ -557,7 +581,9 @@ static int vhost_vdpa_dev_start(struct vhost_dev *dev, bool started)
> >>>        if (started) {
> >>>            uint8_t status = 0;
> >>>            memory_listener_register(&v->listener, &address_space_memory);
> >>> -        vhost_vdpa_host_notifiers_init(dev);
> >>> +        if (!v->shadow_vqs_enabled) {
> >>> +            vhost_vdpa_host_notifiers_init(dev);
> >>> +        }
> >>
> >> This looks like a trick, why not check and setup shadow_vqs inside:
> >>
> >> 1) vhost_vdpa_host_notifiers_init()
> >>
> >> and
> >>
> >> 2) vhost_vdpa_set_vring_kick()
> >>
> > Ok I will move the checks there.
> >
> >>>            vhost_vdpa_set_vring_ready(dev);
> >>>            vhost_vdpa_add_status(dev, VIRTIO_CONFIG_S_DRIVER_OK);
> >>>            vhost_vdpa_call(dev, VHOST_VDPA_GET_STATUS, &status);
> >>> @@ -663,10 +689,96 @@ static bool  vhost_vdpa_force_iommu(struct vhost_dev *dev)
> >>>        return true;
> >>>    }
> >>>
> >>> +/*
> >>> + * Start shadow virtqueue.
> >>> + */
> >>> +static bool vhost_vdpa_svq_start_vq(struct vhost_dev *dev, unsigned idx)
> >>> +{
> >>> +    struct vhost_vdpa *v = dev->opaque;
> >>> +    VhostShadowVirtqueue *svq = g_ptr_array_index(v->shadow_vqs, idx);
> >>> +    return vhost_svq_start(dev, idx, svq);
> >>> +}
> >>> +
> >>> +static unsigned vhost_vdpa_enable_svq(struct vhost_vdpa *v, bool enable)
> >>> +{
> >>> +    struct vhost_dev *hdev = v->dev;
> >>> +    unsigned n;
> >>> +
> >>> +    if (enable == v->shadow_vqs_enabled) {
> >>> +        return hdev->nvqs;
> >>> +    }
> >>> +
> >>> +    if (enable) {
> >>> +        /* Allocate resources */
> >>> +        assert(v->shadow_vqs->len == 0);
> >>> +        for (n = 0; n < hdev->nvqs; ++n) {
> >>> +            VhostShadowVirtqueue *svq = vhost_svq_new(hdev, n);
> >>> +            bool ok;
> >>> +
> >>> +            if (unlikely(!svq)) {
> >>> +                g_ptr_array_set_size(v->shadow_vqs, 0);
> >>> +                return 0;
> >>> +            }
> >>> +            g_ptr_array_add(v->shadow_vqs, svq);
> >>> +
> >>> +            ok = vhost_vdpa_svq_start_vq(hdev, n);
> >>> +            if (unlikely(!ok)) {
> >>> +                /* Free still not started svqs */
> >>> +                g_ptr_array_set_size(v->shadow_vqs, n);
> >>> +                enable = false;
> > [2]
> >
> >>> +                break;
> >>> +            }
> >>> +        }
> >>
> >> Since there's almost no logic could be shared between enable and
> >> disable. Let's split those logic out into dedicated functions where the
> >> codes looks more easy to be reviewed (e.g have a better error handling etc).
> >>
> > Maybe it could be more clear in the code, but the reused logic is the
> > disabling of SVQ and the fallback in case it cannot be enabled with
> > [2]. But I'm not against splitting in two different functions if it
> > makes review easier.
>
>
> Ok.
>
>
> >
> >>> +    }
> >>> +
> >>> +    v->shadow_vqs_enabled = enable;
> >>> +
> >>> +    if (!enable) {
> >>> +        /* Disable all queues or clean up failed start */
> >>> +        for (n = 0; n < v->shadow_vqs->len; ++n) {
> >>> +            unsigned vq_idx = vhost_vdpa_get_vq_index(hdev, n);
> >>> +            VhostShadowVirtqueue *svq = g_ptr_array_index(v->shadow_vqs, n);
> >>> +            vhost_svq_stop(hdev, n, svq);
> >>> +            vhost_virtqueue_start(hdev, hdev->vdev, &hdev->vqs[n], vq_idx);
> >>> +        }
> >>> +
> >>> +        /* Resources cleanup */
> >>> +        g_ptr_array_set_size(v->shadow_vqs, 0);
> >>> +    }
> >>> +
> >>> +    return n;
> >>> +}
> >>>
> >>>    void qmp_x_vhost_enable_shadow_vq(const char *name, bool enable, Error **errp)
> >>>    {
> >>> -    error_setg(errp, "Shadow virtqueue still not implemented");
> >>> +    struct vhost_vdpa *v;
> >>> +    const char *err_cause = NULL;
> >>> +    bool r;
> >>> +
> >>> +    QLIST_FOREACH(v, &vhost_vdpa_devices, entry) {
> >>> +        if (v->dev->vdev && 0 == strcmp(v->dev->vdev->name, name)) {
> >>> +            break;
> >>> +        }
> >>> +    }
> >>
> >> I think you can iterate the NetClientStates to ge tthe vhost-vdpa backends.
> >>
> > Right, I missed it.
> >
> >>> +
> >>> +    if (!v) {
> >>> +        err_cause = "Device not found";
> >>> +        goto err;
> >>> +    } else if (v->notifier[0].addr) {
> >>> +        err_cause = "Device has host notifiers enabled";
> >>
> >> I don't get this.
> >>
> > At this moment of the series you can enable guest -> SVQ -> 'vdpa
> > device' if the device is not using the host notifiers memory region.
> > The right solution is to disable it for the guest, and to handle it in
> > SVQ. Otherwise, guest kick will bypass SVQ and
> >
> > It can be done in the same patch, or at least to disable (as unmap)
> > them at this moment and handle them in a posterior patch. but for
> > prototyping the solution I just ignored it in this series. It will be
> > handled some way or another in the next one. I prefer the last one, to
> > handle in a different patch, but let me know if you think it is better
> > otherwise.
>
>
> Aha, I see. But I think we need to that in this patch otherwise the we
> can route host notifier to SVQ.
>
> Thanks
>
Ok, I will check it.
Thanks!
>
> >
> >> Btw this function should be implemented in an independent patch after
> >> svq is fully functional.
> >>
> > (Reasons for that are already commented at the top of this mail :) ).
> >
> > Thanks!
> >
> >> Thanks
> >>
> >>
> >>> +        goto err;
> >>> +    }
> >>> +
> >>> +    r = vhost_vdpa_enable_svq(v, enable);
> >>> +    if (unlikely(!r)) {
> >>> +        err_cause = "Error enabling (see monitor)";
> >>> +        goto err;
> >>> +    }
> >>> +
> >>> +err:
> >>> +    if (err_cause) {
> >>> +        error_setg(errp, "Can't enable shadow vq on %s: %s", name, err_cause);
> >>> +    }
> >>>    }
> >>>
> >>>    const VhostOps vdpa_ops = {
>
^ permalink raw reply	[flat|nested] 90+ messages in thread
* Re: [RFC PATCH v4 20/20] vdpa: Add custom IOTLB translations to SVQ
  2021-10-15  8:20         ` Eugenio Perez Martin
  2021-10-15  8:37           ` Jason Wang
@ 2021-10-15  9:14           ` Eugenio Perez Martin
  1 sibling, 0 replies; 90+ messages in thread
From: Eugenio Perez Martin @ 2021-10-15  9:14 UTC (permalink / raw)
  To: Jason Wang
  Cc: Parav Pandit, Juan Quintela, Markus Armbruster,
	Michael S. Tsirkin, qemu-level, virtualization,
	Harpreet Singh Anand, Xiao W Wang, Stefan Hajnoczi, Eli Cohen,
	Eric Blake, Stefano Garzarella
On Fri, Oct 15, 2021 at 10:20 AM Eugenio Perez Martin
<eperezma@redhat.com> wrote:
>
> On Fri, Oct 15, 2021 at 9:37 AM Jason Wang <jasowang@redhat.com> wrote:
> >
> > On Fri, Oct 15, 2021 at 3:28 PM Eugenio Perez Martin
> > <eperezma@redhat.com> wrote:
> > >
> > > On Wed, Oct 13, 2021 at 7:34 AM Jason Wang <jasowang@redhat.com> wrote:
> > > >
> > > >
> > > > 在 2021/10/1 下午3:06, Eugenio Pérez 写道:
> > > > > Use translations added in VhostIOVATree in SVQ.
> > > > >
> > > > > Now every element needs to store the previous address also, so VirtQueue
> > > > > can consume the elements properly. This adds a little overhead per VQ
> > > > > element, having to allocate more memory to stash them. As a possible
> > > > > optimization, this allocation could be avoided if the descriptor is not
> > > > > a chain but a single one, but this is left undone.
> > > > >
> > > > > TODO: iova range should be queried before, and add logic to fail when
> > > > > GPA is outside of its range and memory listener or svq add it.
> > > > >
> > > > > Signed-off-by: Eugenio Pérez<eperezma@redhat.com>
> > > > > ---
> > > > >   hw/virtio/vhost-shadow-virtqueue.h |   4 +-
> > > > >   hw/virtio/vhost-shadow-virtqueue.c | 130 ++++++++++++++++++++++++-----
> > > > >   hw/virtio/vhost-vdpa.c             |  40 ++++++++-
> > > > >   hw/virtio/trace-events             |   1 +
> > > > >   4 files changed, 152 insertions(+), 23 deletions(-)
> > > >
> > > >
> > > > Think hard about the whole logic. This is safe since qemu memory map
> > > > will fail if guest submits a invalidate IOVA.
> > > >
> > >
> > > Can you expand on this? What you mean is that VirtQueue already
> > > protects SVQ code if the guest sets an invalid buffer address (GPA),
> > > isn't it?
> >
> > Yes.
> >
> > >
> > > > Then I wonder if we do something much more simpler:
> > > >
> > > > 1) Using qemu VA as IOVA but only maps the VA that belongs to guest
> > > > 2) Then we don't need any IOVA tree here, what we need is to just map
> > > > vring and use qemu VA without any translation
> > > >
> > >
> > > That would be great, but either qemu's SVQ vring or guest translated
> > > buffers address (in qemu VA form) were already in high addresses,
> > > outside of the device's iova range (in my test).
> >
> > You're right. I miss that and that's why we need e.g iova tree and allocator.
> >
> > What I proposed only makes sense when shared virtual memory (SVA) is
> > implemented. In the case of SVA, the valid iova range should be the
> > full VA range.
> >
> > >
> > > I didn't try remapping tricks to make them fit in the range, but I
> > > think it does complicate the solution relatively fast if there was
> > > already memory in that range owned by qemu before enabling SVQ:
> > >
> > > * Guest memory must be contiguous in VA address space, but it "must"
> > > support hotplug/unplug (although vDPA currently pins it). Hotplug
> > > memory could always overlap with SVQ vring, so we would need to move
> > > it.
> > > * Duplicating mapped memory for writing? (Not sure if guest memory is
> > > actually movable in qemu).
> > > * Indirect descriptors will need to allocate and free memory more or
> > > less frequently, increasing the possibility of overlapping.
> >
> > I'm not sure I get the problem, but overlapping is not an issue since
> > we're using VA.
> >
>
> It's basically the same (potential) problem of DPDK's SVQ: IOVA Range
> goes from 0 to X. That means that both GPA and SVQ must be in IOVA
> range. As an example, we put GPA at the beginning of the range, that
> grows upwards when memory is hot plugged, and SVQ vrings that grows
> downwards when devices are added or set in SVQ mode.
>
> Even without both space fragmentation problems, we could reach a point
> where both will take the same address, and we would need to go to the
> tree.
>
> But since we are able to detect those situations, I can see how we can
> work in two modes as an optimization: 1:1 when they don't overlap, and
> fragmented tree where it does. But I don't think it's a good idea to
> include it from the beginning, and I'm not sure if that is worth it
> without measuring the  tree translation cost first.
>
> > >
> > > If we can move guest memory,
> >
> > I'm not sure we can do this or it looks very tricky.
> >
>
> Just thinking out loud here, but maybe we could map all memory and
> play with remap_file_pages [1] a little bit for that.
>
> > > however, I can see how we can track it in
> > > a tree *but* mark when the tree is 1:1 with qemu's VA, so buffers
> > > forwarding does not take the translation penalty. When guest memory
> > > cannot be map 1:1, we can resort to tree, and come back to 1:1
> > > translation if the offending tree node(s) get deleted.
> > >
> > > However I think this puts the solution a little bit farther than
> > > "starting simple" :).
> > >
> > > Does it make sense?
> >
> > Yes. So I think I will review the IOVA tree codes and get back to you.
> >
>
> Looking forward to it :).
>
PS: Actually, they still use the GArray solution as the previous
series. I'm currently migrating to use an actual tree and adding
allocation features to util/iova-tree, so maybe it is not worth
reviewing it at this moment. The public interface to it is the same,
but there is little to review there.
Thanks!
> Thanks!
>
> [1] https://linux.die.net/man/2/remap_file_pages
>
> > THanks
> >
> > >
> > > Thanks!
> > >
> > > > Thanks
> > > >
> > >
> >
^ permalink raw reply	[flat|nested] 90+ messages in thread
* Re: [RFC PATCH v4 15/20] vhost: Shadow virtqueue buffers forwarding
  2021-10-15  4:23       ` Jason Wang
@ 2021-10-15  9:33         ` Eugenio Perez Martin
  0 siblings, 0 replies; 90+ messages in thread
From: Eugenio Perez Martin @ 2021-10-15  9:33 UTC (permalink / raw)
  To: Jason Wang
  Cc: Parav Pandit, Juan Quintela, Markus Armbruster,
	Michael S. Tsirkin, qemu-level, virtualization,
	Harpreet Singh Anand, Xiao W Wang, Stefan Hajnoczi, Eli Cohen,
	Eric Blake, Michael Lilja, Stefano Garzarella
On Fri, Oct 15, 2021 at 6:23 AM Jason Wang <jasowang@redhat.com> wrote:
>
>
> 在 2021/10/15 上午1:56, Eugenio Perez Martin 写道:
> > On Wed, Oct 13, 2021 at 6:31 AM Jason Wang <jasowang@redhat.com> wrote:
> >>
> >> 在 2021/10/1 下午3:05, Eugenio Pérez 写道:
> >>> Initial version of shadow virtqueue that actually forward buffers. There
> >>> are no iommu support at the moment, and that will be addressed in future
> >>> patches of this series. Since all vhost-vdpa devices uses forced IOMMU,
> >>> this means that SVQ is not usable at this point of the series on any
> >>> device.
> >>>
> >>> For simplicity it only supports modern devices, that expects vring
> >>> in little endian, with split ring and no event idx or indirect
> >>> descriptors. Support for them will not be added in this series.
> >>>
> >>> It reuses the VirtQueue code for the device part. The driver part is
> >>> based on Linux's virtio_ring driver, but with stripped functionality
> >>> and optimizations so it's easier to review. Later commits add simpler
> >>> ones.
> >>>
> >>> SVQ uses VIRTIO_CONFIG_S_DEVICE_STOPPED to pause the device and
> >>> retrieve its status (next available idx the device was going to
> >>> consume) race-free. It can later reset the device to replace vring
> >>> addresses etc. When SVQ starts qemu can resume consuming the guest's
> >>> driver ring from that state, without notice from the latter.
> >>>
> >>> This status bit VIRTIO_CONFIG_S_DEVICE_STOPPED is currently discussed
> >>> in VirtIO, and is implemented in qemu VirtIO-net devices in previous
> >>> commits.
> >>>
> >>> Removal of _S_DEVICE_STOPPED bit (in other words, resuming the device)
> >>> can be done in the future if an use case arises. At this moment we can
> >>> just rely on reseting the full device.
> >>>
> >>> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> >>> ---
> >>>    qapi/net.json                      |   2 +-
> >>>    hw/virtio/vhost-shadow-virtqueue.c | 237 ++++++++++++++++++++++++++++-
> >>>    hw/virtio/vhost-vdpa.c             | 109 ++++++++++++-
> >>>    3 files changed, 337 insertions(+), 11 deletions(-)
> >>>
> >>> diff --git a/qapi/net.json b/qapi/net.json
> >>> index fe546b0e7c..1f4a55f2c5 100644
> >>> --- a/qapi/net.json
> >>> +++ b/qapi/net.json
> >>> @@ -86,7 +86,7 @@
> >>>    #
> >>>    # @name: the device name of the VirtIO device
> >>>    #
> >>> -# @enable: true to use the alternate shadow VQ notifications
> >>> +# @enable: true to use the alternate shadow VQ buffers fowarding path
> >>>    #
> >>>    # Returns: Error if failure, or 'no error' for success.
> >>>    #
> >>> diff --git a/hw/virtio/vhost-shadow-virtqueue.c b/hw/virtio/vhost-shadow-virtqueue.c
> >>> index 34e159d4fd..df7e6fa3ec 100644
> >>> --- a/hw/virtio/vhost-shadow-virtqueue.c
> >>> +++ b/hw/virtio/vhost-shadow-virtqueue.c
> >>> @@ -10,6 +10,7 @@
> >>>    #include "qemu/osdep.h"
> >>>    #include "hw/virtio/vhost-shadow-virtqueue.h"
> >>>    #include "hw/virtio/vhost.h"
> >>> +#include "hw/virtio/virtio-access.h"
> >>>
> >>>    #include "standard-headers/linux/vhost_types.h"
> >>>
> >>> @@ -44,15 +45,135 @@ typedef struct VhostShadowVirtqueue {
> >>>
> >>>        /* Virtio device */
> >>>        VirtIODevice *vdev;
> >>> +
> >>> +    /* Map for returning guest's descriptors */
> >>> +    VirtQueueElement **ring_id_maps;
> >>> +
> >>> +    /* Next head to expose to device */
> >>> +    uint16_t avail_idx_shadow;
> >>> +
> >>> +    /* Next free descriptor */
> >>> +    uint16_t free_head;
> >>> +
> >>> +    /* Last seen used idx */
> >>> +    uint16_t shadow_used_idx;
> >>> +
> >>> +    /* Next head to consume from device */
> >>> +    uint16_t used_idx;
> >>
> >> Let's use "last_used_idx" as kernel driver did.
> >>
> > Ok I will change it.
> >
> >>>    } VhostShadowVirtqueue;
> >>>
> >>>    /* If the device is using some of these, SVQ cannot communicate */
> >>>    bool vhost_svq_valid_device_features(uint64_t *dev_features)
> >>>    {
> >>> -    return true;
> >>> +    uint64_t b;
> >>> +    bool r = true;
> >>> +
> >>> +    for (b = VIRTIO_TRANSPORT_F_START; b <= VIRTIO_TRANSPORT_F_END; ++b) {
> >>> +        switch (b) {
> >>> +        case VIRTIO_F_NOTIFY_ON_EMPTY:
> >>> +        case VIRTIO_F_ANY_LAYOUT:
> >>> +            /* SVQ is fine with this feature */
> >>> +            continue;
> >>> +
> >>> +        case VIRTIO_F_ACCESS_PLATFORM:
> >>> +            /* SVQ needs this feature disabled. Can't continue */
> >>
> >> So code can explain itself, need a comment to explain why.
> >>
> > Do you mean that it *doesn't* need a comment to explain why? In that
> > case I will delete them.
>
>
> I meant the comment is duplicated with the code. If you wish, you can
> explain why ACCESS_PLATFORM needs to be disabled.
>
Got it, I will do something about it.
>
> >
> >>> +            if (*dev_features & BIT_ULL(b)) {
> >>> +                clear_bit(b, dev_features);
> >>> +                r = false;
> >>> +            }
> >>> +            break;
> >>> +
> >>> +        case VIRTIO_F_VERSION_1:
> >>> +            /* SVQ needs this feature, so can't continue */
> >>
> >> A comment to explain why SVQ needs this feature.
> >>
> > Sure I will add it.
> >
> >>> +            if (!(*dev_features & BIT_ULL(b))) {
> >>> +                set_bit(b, dev_features);
> >>> +                r = false;
> >>> +            }
> >>> +            continue;
> >>> +
> >>> +        default:
> >>> +            /*
> >>> +             * SVQ must disable this feature, let's hope the device is fine
> >>> +             * without it.
> >>> +             */
> >>> +            if (*dev_features & BIT_ULL(b)) {
> >>> +                clear_bit(b, dev_features);
> >>> +            }
> >>> +        }
> >>> +    }
> >>> +
> >>> +    return r;
> >>> +}
> >>
> >> Let's move this to patch 14.
> >>
> > I can move it down to 14/20, but then it is not really accurate, since
> > notifications forwarding can work with all feature sets. Not like we
> > are introducing a regression, but still.
> >
> > I can always explain that in the patch message though, would that be ok?
>
>
> I'm afraid this will break bisection. E.g for patch 14, it works for any
> features but for patch 15 it doesn't.
>
So no moving them? :).
>
> >
> >>> +
> >>> +static void vhost_vring_write_descs(VhostShadowVirtqueue *svq,
> >>> +                                    const struct iovec *iovec,
> >>> +                                    size_t num, bool more_descs, bool write)
> >>> +{
> >>> +    uint16_t i = svq->free_head, last = svq->free_head;
> >>> +    unsigned n;
> >>> +    uint16_t flags = write ? cpu_to_le16(VRING_DESC_F_WRITE) : 0;
> >>> +    vring_desc_t *descs = svq->vring.desc;
> >>> +
> >>> +    if (num == 0) {
> >>> +        return;
> >>> +    }
> >>> +
> >>> +    for (n = 0; n < num; n++) {
> >>> +        if (more_descs || (n + 1 < num)) {
> >>> +            descs[i].flags = flags | cpu_to_le16(VRING_DESC_F_NEXT);
> >>> +        } else {
> >>> +            descs[i].flags = flags;
> >>> +        }
> >>> +        descs[i].addr = cpu_to_le64((hwaddr)iovec[n].iov_base);
> >>> +        descs[i].len = cpu_to_le32(iovec[n].iov_len);
> >>> +
> >>> +        last = i;
> >>> +        i = cpu_to_le16(descs[i].next);
> >>> +    }
> >>> +
> >>> +    svq->free_head = le16_to_cpu(descs[last].next);
> >>> +}
> >>> +
> >>> +static unsigned vhost_svq_add_split(VhostShadowVirtqueue *svq,
> >>> +                                    VirtQueueElement *elem)
> >>> +{
> >>> +    int head;
> >>> +    unsigned avail_idx;
> >>> +    vring_avail_t *avail = svq->vring.avail;
> >>> +
> >>> +    head = svq->free_head;
> >>> +
> >>> +    /* We need some descriptors here */
> >>> +    assert(elem->out_num || elem->in_num);
> >>> +
> >>> +    vhost_vring_write_descs(svq, elem->out_sg, elem->out_num,
> >>> +                            elem->in_num > 0, false);
> >>> +    vhost_vring_write_descs(svq, elem->in_sg, elem->in_num, false, true);
> >>> +
> >>> +    /*
> >>> +     * Put entry in available array (but don't update avail->idx until they
> >>> +     * do sync).
> >>> +     */
> >>> +    avail_idx = svq->avail_idx_shadow & (svq->vring.num - 1);
> >>> +    avail->ring[avail_idx] = cpu_to_le16(head);
> >>> +    svq->avail_idx_shadow++;
> >>> +
> >>> +    /* Update avail index after the descriptor is wrote */
> >>> +    smp_wmb();
> >>> +    avail->idx = cpu_to_le16(svq->avail_idx_shadow);
> >>> +
> >>> +    return head;
> >>> +
> >>>    }
> >>>
> >>> -/* Forward guest notifications */
> >>> +static void vhost_svq_add(VhostShadowVirtqueue *svq, VirtQueueElement *elem)
> >>> +{
> >>> +    unsigned qemu_head = vhost_svq_add_split(svq, elem);
> >>> +
> >>> +    svq->ring_id_maps[qemu_head] = elem;
> >>> +}
> >>> +
> >>> +/* Handle guest->device notifications */
> >>>    static void vhost_handle_guest_kick(EventNotifier *n)
> >>>    {
> >>>        VhostShadowVirtqueue *svq = container_of(n, VhostShadowVirtqueue,
> >>> @@ -62,7 +183,74 @@ static void vhost_handle_guest_kick(EventNotifier *n)
> >>>            return;
> >>>        }
> >>>
> >>> -    event_notifier_set(&svq->kick_notifier);
> >>> +    /* Make available as many buffers as possible */
> >>> +    do {
> >>> +        if (virtio_queue_get_notification(svq->vq)) {
> >>> +            /* No more notifications until process all available */
> >>> +            virtio_queue_set_notification(svq->vq, false);
> >>> +        }
> >>
> >> This can be done outside the loop.
> >>
> > I think it cannot. The intention of doing this way is that we check
> > for new available buffers *also after* enabling notifications, so we
> > don't miss any of them. It is more or less copied from
> > virtio_blk_handle_vq, which also needs to run to completion.
> >
> > If we need to loop again because there are more available buffers, we
> > want to disable notifications again. Or am I missing something?
>
>
> I think you're right.
>
>
> >
> >>> +
> >>> +        while (true) {
> >>> +            VirtQueueElement *elem = virtqueue_pop(svq->vq, sizeof(*elem));
> >>> +            if (!elem) {
> >>> +                break;
> >>> +            }
> >>> +
> >>> +            vhost_svq_add(svq, elem);
> >>> +            event_notifier_set(&svq->kick_notifier);
> >>> +        }
> >>> +
> >>> +        virtio_queue_set_notification(svq->vq, true);
> >>
> >> I think this can be moved to the end of this function.
> >>
> > (Same as previous answer)
> >
> >> Btw, we probably need a quota to make sure the svq is not hogging the
> >> main event loop.
> >>
> >> Similar issue could be found in both virtio-net TX (using timer or bh)
> >> and TAP (a quota).
> >>
> > I think that virtqueue size is the natural limit to that: since we are
> > not making any buffers used in the loop, there is no way that it runs
> > more than virtqueue size times. If it does because of an evil/bogus
> > guest, virtqueue_pop raises the message "Virtqueue size exceeded" and
> > returns NULL, effectively breaking the loop.
> >
> > Virtio-net tx functions mark each buffer right after making them
> > available and use it, so they can hog BQL. But my understanding is
> > that is not possible in the SVQ case.
>
>
> Right.
>
>
> >
> > I can add a comment in the code to make it clearer though.
>
>
> Yes, please.
>
Ok I will add the comment.
>
> >
> >>> +    } while (!virtio_queue_empty(svq->vq));
> >>> +}
> >>> +
> >>> +static bool vhost_svq_more_used(VhostShadowVirtqueue *svq)
> >>> +{
> >>> +    if (svq->used_idx != svq->shadow_used_idx) {
> >>> +        return true;
> >>> +    }
> >>> +
> >>> +    /* Get used idx must not be reordered */
> >>> +    smp_rmb();
> >>
> >> Interesting, we don't do this for kernel drivers. It would be helpful to
> >> explain it more clear by "X must be done before Y".
> >>
> > I think this got reordered, it's supposed to be *after* get the used
> > idx, so it matches the one in the kernel with the comment "Only get
> > used array entries after they have been exposed by host.".
>
>
> Right.
>
>
> >
> > I will change it for the next series.
>
>
> Ok.
>
>
> >
> >>> +    svq->shadow_used_idx = cpu_to_le16(svq->vring.used->idx);
> >>> +
> >>> +    return svq->used_idx != svq->shadow_used_idx;
> >>> +}
> >>> +
> >>> +static VirtQueueElement *vhost_svq_get_buf(VhostShadowVirtqueue *svq)
> >>> +{
> >>> +    vring_desc_t *descs = svq->vring.desc;
> >>> +    const vring_used_t *used = svq->vring.used;
> >>> +    vring_used_elem_t used_elem;
> >>> +    uint16_t last_used;
> >>> +
> >>> +    if (!vhost_svq_more_used(svq)) {
> >>> +        return NULL;
> >>> +    }
> >>> +
> >>> +    last_used = svq->used_idx & (svq->vring.num - 1);
> >>> +    used_elem.id = le32_to_cpu(used->ring[last_used].id);
> >>> +    used_elem.len = le32_to_cpu(used->ring[last_used].len);
> >>> +
> >>> +    svq->used_idx++;
> >>> +    if (unlikely(used_elem.id >= svq->vring.num)) {
> >>> +        error_report("Device %s says index %u is used", svq->vdev->name,
> >>> +                     used_elem.id);
> >>> +        return NULL;
> >>> +    }
> >>> +
> >>> +    if (unlikely(!svq->ring_id_maps[used_elem.id])) {
> >>> +        error_report(
> >>> +            "Device %s says index %u is used, but it was not available",
> >>> +            svq->vdev->name, used_elem.id);
> >>> +        return NULL;
> >>> +    }
> >>> +
> >>> +    descs[used_elem.id].next = svq->free_head;
> >>> +    svq->free_head = used_elem.id;
> >>> +
> >>> +    svq->ring_id_maps[used_elem.id]->len = used_elem.len;
> >>> +    return g_steal_pointer(&svq->ring_id_maps[used_elem.id]);
> >>>    }
> >>>
> >>>    /* Forward vhost notifications */
> >>> @@ -70,8 +258,26 @@ static void vhost_svq_handle_call_no_test(EventNotifier *n)
> >>>    {
> >>>        VhostShadowVirtqueue *svq = container_of(n, VhostShadowVirtqueue,
> >>>                                                 call_notifier);
> >>> -
> >>> -    event_notifier_set(&svq->guest_call_notifier);
> >>> +    VirtQueue *vq = svq->vq;
> >>> +
> >>> +    /* Make as many buffers as possible used. */
> >>> +    do {
> >>> +        unsigned i = 0;
> >>> +
> >>> +        /* TODO: Use VRING_AVAIL_F_NO_INTERRUPT */
> >>> +        while (true) {
> >>> +            g_autofree VirtQueueElement *elem = vhost_svq_get_buf(svq);
> >>> +            if (!elem) {
> >>> +                break;
> >>> +            }
> >>> +
> >>> +            assert(i < svq->vring.num);
> >>
> >> Let's return error instead of using the assert.
> >>
> > Actually this is a condition that we should never meet: In the case of
> > ring overrun, device would try to set used a descriptor that is either
> >> vring size *or* should try to overrun some of the already used ones.
> > In both cases, elem should be NULL and the loop should break.
> >
> > So this is a safety net protecting from both, if we have an i >
> > svq->vring.num means we are not processing used buffers well anymore,
> > and (moreover) this is happening after making used all descriptors.
> >
> > Taking that into account, should we delete it?
>
>
> Maybe a warning instead.
>
Ok I will do it.
>
> >
> >>> +            virtqueue_fill(vq, elem, elem->len, i++);
> >>> +        }
> >>> +
> >>> +        virtqueue_flush(vq, i);
> >>> +        event_notifier_set(&svq->guest_call_notifier);
> >>> +    } while (vhost_svq_more_used(svq));
> >>>    }
> >>>
> >>>    static void vhost_svq_handle_call(EventNotifier *n)
> >>> @@ -204,12 +410,25 @@ err_set_vring_kick:
> >>>    void vhost_svq_stop(struct vhost_dev *dev, unsigned idx,
> >>>                        VhostShadowVirtqueue *svq)
> >>>    {
> >>> +    int i;
> >>>        int r = vhost_svq_restore_vdev_host_notifier(dev, idx, svq);
> >>> +
> >>>        if (unlikely(r < 0)) {
> >>>            error_report("Couldn't restore vq kick fd: %s", strerror(-r));
> >>>        }
> >>>
> >>>        event_notifier_set_handler(&svq->host_notifier, NULL);
> >>> +
> >>> +    for (i = 0; i < svq->vring.num; ++i) {
> >>> +        g_autofree VirtQueueElement *elem = svq->ring_id_maps[i];
> >>> +        /*
> >>> +         * Although the doc says we must unpop in order, it's ok to unpop
> >>> +         * everything.
> >>> +         */
> >>> +        if (elem) {
> >>> +            virtqueue_unpop(svq->vq, elem, elem->len);
> >>> +        }
> >>
> >> Will this result some of the "pending" buffers to be submitted multiple
> >> times? If yes, should we wait for all the buffers used instead of doing
> >> the unpop here?
> >>
> > Do you mean to call virtqueue_unpop with the same elem (or elem.id)
> > multiple times? That should never happen, because elem.id should be
> > the position in the ring_id_maps. Also, unpop() should just unmap the
> > element and never sync again.
> >
> > Maybe it is way clearer to call virtqueue_detach_element here directly.
>
>
> I meant basically for the buffers that were consumed by the device but
> not made used. In this case if we unpop here. It will be processed by
> the device later via vhost-vdpa again?
>
> This is probably fine for net but I'm not sure it works for other
> devices. Another way is to wait until all the consumed buffer used.
>
My understanding is that it is the responsibility of the device to
pause and mark all of them as used. Without things like INFLIGHT_FD
there is no way for qemu to tell the difference between a buffer that
should be marked as used but is not, and a buffer that is properly not
marked as used. So if the device is not able to do it by itself, it
(or vdpa backend device driver) should not offer / fail the pause
command.
Similarly, if the device *needs* the inflight_fd mechanism to pause,
it should fail the pause if it has not been negotiated.
>
> >
> >>> +    }
> >>>    }
> >>>
> >>>    /*
> >>> @@ -224,7 +443,7 @@ VhostShadowVirtqueue *vhost_svq_new(struct vhost_dev *dev, int idx)
> >>>        size_t driver_size;
> >>>        size_t device_size;
> >>>        g_autofree VhostShadowVirtqueue *svq = g_new0(VhostShadowVirtqueue, 1);
> >>> -    int r;
> >>> +    int r, i;
> >>>
> >>>        r = event_notifier_init(&svq->kick_notifier, 0);
> >>>        if (r != 0) {
> >>> @@ -250,6 +469,11 @@ VhostShadowVirtqueue *vhost_svq_new(struct vhost_dev *dev, int idx)
> >>>        memset(svq->vring.desc, 0, driver_size);
> >>>        svq->vring.used = qemu_memalign(qemu_real_host_page_size, device_size);
> >>>        memset(svq->vring.used, 0, device_size);
> >>> +    for (i = 0; i < num - 1; i++) {
> >>> +        svq->vring.desc[i].next = cpu_to_le16(i + 1);
> >>> +    }
> >>> +
> >>> +    svq->ring_id_maps = g_new0(VirtQueueElement *, num);
> >>>        event_notifier_set_handler(&svq->call_notifier,
> >>>                                   vhost_svq_handle_call);
> >>>        return g_steal_pointer(&svq);
> >>> @@ -269,6 +493,7 @@ void vhost_svq_free(VhostShadowVirtqueue *vq)
> >>>        event_notifier_cleanup(&vq->kick_notifier);
> >>>        event_notifier_set_handler(&vq->call_notifier, NULL);
> >>>        event_notifier_cleanup(&vq->call_notifier);
> >>> +    g_free(vq->ring_id_maps);
> >>>        qemu_vfree(vq->vring.desc);
> >>>        qemu_vfree(vq->vring.used);
> >>>        g_free(vq);
> >>> diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
> >>> index a057e8277d..bb7010ddb5 100644
> >>> --- a/hw/virtio/vhost-vdpa.c
> >>> +++ b/hw/virtio/vhost-vdpa.c
> >>> @@ -19,6 +19,7 @@
> >>>    #include "hw/virtio/virtio-net.h"
> >>>    #include "hw/virtio/vhost-shadow-virtqueue.h"
> >>>    #include "hw/virtio/vhost-vdpa.h"
> >>> +#include "hw/virtio/vhost-shadow-virtqueue.h"
> >>>    #include "exec/address-spaces.h"
> >>>    #include "qemu/main-loop.h"
> >>>    #include "cpu.h"
> >>> @@ -475,6 +476,28 @@ static int vhost_vdpa_set_features(struct vhost_dev *dev,
> >>>        return vhost_vdpa_backend_set_features(dev, features);
> >>>    }
> >>>
> >>> +/**
> >>> + * Restore guest features to vdpa device
> >>> + */
> >>> +static int vhost_vdpa_set_guest_features(struct vhost_dev *dev)
> >>> +{
> >>> +    struct vhost_vdpa *v = dev->opaque;
> >>> +    return vhost_vdpa_backend_set_features(dev, v->guest_features);
> >>> +}
> >>> +
> >>> +/**
> >>> + * Set shadow virtqueue supported features
> >>> + */
> >>> +static int vhost_vdpa_set_svq_features(struct vhost_dev *dev)
> >>> +{
> >>> +    struct vhost_vdpa *v = dev->opaque;
> >>> +    uint64_t features = v->host_features;
> >>> +    bool b = vhost_svq_valid_device_features(&features);
> >>> +    assert(b);
> >>> +
> >>> +    return vhost_vdpa_backend_set_features(dev, features);
> >>> +}
> >>> +
> >>>    static int vhost_vdpa_set_backend_cap(struct vhost_dev *dev)
> >>>    {
> >>>        uint64_t features;
> >>> @@ -730,6 +753,19 @@ static bool  vhost_vdpa_force_iommu(struct vhost_dev *dev)
> >>>        return true;
> >>>    }
> >>>
> >>> +static int vhost_vdpa_vring_pause(struct vhost_dev *dev)
> >>> +{
> >>> +    int r;
> >>> +    uint8_t status;
> >>> +
> >>> +    vhost_vdpa_add_status(dev, VIRTIO_CONFIG_S_DEVICE_STOPPED);
> >>> +    do {
> >>> +        r = vhost_vdpa_call(dev, VHOST_VDPA_GET_STATUS, &status);
> >>
> >> I guess we'd better add some sleep here.
> >>
> > If the final version still contains the call, I will add the sleep. At
> > the moment I think it's better if we stop the device by a vdpa ioctl.
>
>
> Ok, so the idea is to sleep in the ioctl?
>
At the moment I would say so: The ioctl can take "as long as it
wants". If we decide to support both methods (virtio pause & vdpa
pause), then sleep should be added, yes.
If we don't trust the device regarding the time it will take to pause,
we could try to move the wait for the pause outside of BQL in the
future, so qemu can operate regularly. I think it is not worth it in
the first series but it could help the adoption to other stricter or
more stateful devices.
Thanks!
>
> >
> >>> +    } while (r == 0 && !(status & VIRTIO_CONFIG_S_DEVICE_STOPPED));
> >>> +
> >>> +    return 0;
> >>> +}
> >>> +
> >>>    /*
> >>>     * Start shadow virtqueue.
> >>>     */
> >>> @@ -742,9 +778,29 @@ static bool vhost_vdpa_svq_start_vq(struct vhost_dev *dev, unsigned idx)
> >>>            .index = idx + dev->vq_index,
> >>>            .fd = event_notifier_get_fd(vhost_call_notifier),
> >>>        };
> >>> +    struct vhost_vring_addr addr = {
> >>> +        .index = idx + dev->vq_index,
> >>> +    };
> >>> +    struct vhost_vring_state num = {
> >>> +        .index = idx + dev->vq_index,
> >>> +        .num = virtio_queue_get_num(dev->vdev, idx),
> >>> +    };
> >>>        int r;
> >>>        bool b;
> >>>
> >>> +    vhost_svq_get_vring_addr(svq, &addr);
> >>> +    r = vhost_vdpa_set_vring_addr(dev, &addr);
> >>> +    if (unlikely(r)) {
> >>> +        error_report("vhost_set_vring_addr for shadow vq failed");
> >>> +        return false;
> >>> +    }
> >>> +
> >>> +    r = vhost_vdpa_set_vring_num(dev, &num);
> >>> +    if (unlikely(r)) {
> >>> +        error_report("vhost_vdpa_set_vring_num for shadow vq failed");
> >>> +        return false;
> >>> +    }
> >>> +
> >>>        /* Set shadow vq -> guest notifier */
> >>>        assert(v->call_fd[idx]);
> >>>        vhost_svq_set_guest_call_notifier(svq, v->call_fd[idx]);
> >>> @@ -781,15 +837,32 @@ static unsigned vhost_vdpa_enable_svq(struct vhost_vdpa *v, bool enable)
> >>>            assert(v->shadow_vqs->len == 0);
> >>>            for (n = 0; n < hdev->nvqs; ++n) {
> >>>                VhostShadowVirtqueue *svq = vhost_svq_new(hdev, n);
> >>> -            bool ok;
> >>> -
> >>>                if (unlikely(!svq)) {
> >>>                    g_ptr_array_set_size(v->shadow_vqs, 0);
> >>>                    return 0;
> >>>                }
> >>>                g_ptr_array_add(v->shadow_vqs, svq);
> >>> +        }
> >>> +    }
> >>>
> >>> -            ok = vhost_vdpa_svq_start_vq(hdev, n);
> >>> +    r = vhost_vdpa_vring_pause(hdev);
> >>> +    assert(r == 0);
> >>> +
> >>> +    if (enable) {
> >>> +        for (n = 0; n < v->shadow_vqs->len; ++n) {
> >>> +            /* Obtain Virtqueue state */
> >>> +            vhost_virtqueue_stop(hdev, hdev->vdev, &hdev->vqs[n], n);
> >>> +        }
> >>> +    }
> >>> +
> >>> +    /* Reset device so it can be configured */
> >>> +    r = vhost_vdpa_dev_start(hdev, false);
> >>> +    assert(r == 0);
> >>> +
> >>> +    if (enable) {
> >>> +        int r;
> >>> +        for (n = 0; n < v->shadow_vqs->len; ++n) {
> >>> +            bool ok = vhost_vdpa_svq_start_vq(hdev, n);
> >>>                if (unlikely(!ok)) {
> >>>                    /* Free still not started svqs */
> >>>                    g_ptr_array_set_size(v->shadow_vqs, n);
> >>> @@ -797,11 +870,19 @@ static unsigned vhost_vdpa_enable_svq(struct vhost_vdpa *v, bool enable)
> >>>                    break;
> >>>                }
> >>>            }
> >>> +
> >>> +        /* Need to ack features to set state in vp_vdpa devices */
> >>
> >> vhost_vdpa actually?
> >>
> > Yes, what a mistake!
> >
> >>> +        r = vhost_vdpa_set_svq_features(hdev);
> >>> +        if (unlikely(r)) {
> >>> +            enable = false;
> >>> +        }
> >>>        }
> >>>
> >>>        v->shadow_vqs_enabled = enable;
> >>>
> >>>        if (!enable) {
> >>> +        vhost_vdpa_set_guest_features(hdev);
> >>> +
> >>>            /* Disable all queues or clean up failed start */
> >>>            for (n = 0; n < v->shadow_vqs->len; ++n) {
> >>>                struct vhost_vring_file file = {
> >>> @@ -818,7 +899,12 @@ static unsigned vhost_vdpa_enable_svq(struct vhost_vdpa *v, bool enable)
> >>>                /* TODO: This can unmask or override call fd! */
> >>>                vhost_virtqueue_start(hdev, hdev->vdev, &hdev->vqs[n], vq_idx);
> >>>            }
> >>> +    }
> >>>
> >>> +    r = vhost_vdpa_dev_start(hdev, true);
> >>> +    assert(r == 0);
> >>> +
> >>> +    if (!enable) {
> >>>            /* Resources cleanup */
> >>>            g_ptr_array_set_size(v->shadow_vqs, 0);
> >>>        }
> >>> @@ -831,6 +917,7 @@ void qmp_x_vhost_enable_shadow_vq(const char *name, bool enable, Error **errp)
> >>>        struct vhost_vdpa *v;
> >>>        const char *err_cause = NULL;
> >>>        bool r;
> >>> +    uint64_t svq_features;
> >>>
> >>>        QLIST_FOREACH(v, &vhost_vdpa_devices, entry) {
> >>>            if (v->dev->vdev && 0 == strcmp(v->dev->vdev->name, name)) {
> >>> @@ -846,6 +933,20 @@ void qmp_x_vhost_enable_shadow_vq(const char *name, bool enable, Error **errp)
> >>>            goto err;
> >>>        }
> >>>
> >>> +    svq_features = v->host_features;
> >>> +    if (!vhost_svq_valid_device_features(&svq_features)) {
> >>> +        error_setg(errp,
> >>> +            "Can't enable shadow vq on %s: Unexpected feature flags (%lx-%lx)",
> >>> +            name, v->host_features, svq_features);
> >>> +        return;
> >>> +    } else {
> >>> +        /* TODO: Check for virtio_vdpa + IOMMU & modern device */
> >>
> >> I guess you mean "vhost_vdpa" here.
> > Yes, a similar mistake in less than 50 lines :).
> >
> >> For IOMMU, I guess you mean "vIOMMU"
> >> actually?
> >>
> > This comment is out of date and inherited from the vhost version,
> > where only the IOMMU version was developed, so it will be deleted in
> > the next series. I think it makes little sense to check vIOMMU if we
> > stick with vDPA since it still does not support it, but we could make
> > the check here for sure.
>
>
> Right.
>
> Thanks
>
>
> >
> > Thanks!
> >
> >> Thanks
> >>
> >>
> >>> +    }
> >>> +
> >>> +    if (err_cause) {
> >>> +        goto err;
> >>> +    }
> >>> +
> >>>        r = vhost_vdpa_enable_svq(v, enable);
> >>>        if (unlikely(!r)) {
> >>>            err_cause = "Error enabling (see monitor)";
> >>> @@ -853,7 +954,7 @@ void qmp_x_vhost_enable_shadow_vq(const char *name, bool enable, Error **errp)
> >>>        }
> >>>
> >>>    err:
> >>> -    if (err_cause) {
> >>> +    if (errp == NULL && err_cause) {
> >>>            error_setg(errp, "Can't enable shadow vq on %s: %s", name, err_cause);
> >>>        }
> >>>    }
>
^ permalink raw reply	[flat|nested] 90+ messages in thread
* Re: [RFC PATCH v4 08/20] vhost: Route guest->host notification through shadow virtqueue
  2021-10-14 12:00     ` Eugenio Perez Martin
  2021-10-15  3:45       ` Jason Wang
@ 2021-10-15 18:21       ` Eugenio Perez Martin
  1 sibling, 0 replies; 90+ messages in thread
From: Eugenio Perez Martin @ 2021-10-15 18:21 UTC (permalink / raw)
  To: Jason Wang
  Cc: Parav Pandit, Juan Quintela, Markus Armbruster,
	Michael S. Tsirkin, qemu-level, virtualization,
	Harpreet Singh Anand, Xiao W Wang, Stefan Hajnoczi, Eli Cohen,
	Eric Blake, Michael Lilja, Stefano Garzarella
On Thu, Oct 14, 2021 at 2:00 PM Eugenio Perez Martin
<eperezma@redhat.com> wrote:
>
> On Wed, Oct 13, 2021 at 5:27 AM Jason Wang <jasowang@redhat.com> wrote:
> >
> >
> > 在 2021/10/1 下午3:05, Eugenio Pérez 写道:
> > > Shadow virtqueue notifications forwarding is disabled when vhost_dev
> > > stops, so code flow follows usual cleanup.
> > >
> > > Also, host notifiers must be disabled at SVQ start,
> >
> >
> > Any reason for this?
> >
>
> It will be addressed in a later series, sorry.
>
> >
> > > and they will not
> > > start if SVQ has been enabled when device is stopped. This is trivial
> > > to address, but it is left out for simplicity at this moment.
> >
> >
> > It looks to me this patch also contains the following logics
> >
> > 1) codes to enable svq
> >
> > 2) codes to let svq to be enabled from QMP.
> >
> > I think they need to be split out,
>
> I agree that we can split this more, with the code that belongs to SVQ
> and the code that belongs to vhost-vdpa. it will be addressed in
> future series.
>
> > we may endup with the following
> > series of patches
> >
>
> With "series of patches" do you mean to send every step in a separated
> series? There are odds of having the need of modifying code already
> sent & merged with later ones. If you confirm to me that it is fine, I
> can do it that way for sure.
>
> > 1) svq skeleton with enable/disable
> > 2) route host notifier to svq
> > 3) route guest notifier to svq
> > 4) codes to enable svq
> > 5) enable svq via QMP
> >
>
> I'm totally fine with that, but there is code that is never called if
> the qmp command is not added. The compiler complains about static
> functions that are not called, making impossible things like bisecting
> through these commits, unless I use attribute((unused)) or similar. Or
> have I missed something?
>
> We could do that way with the code that belongs to SVQ though, since
> all of it is declared in headers. But to delay the "enable svq via
> qmp" to the last one makes debugging harder, as we cannot just enable
> notifications forwarding with no buffers forwarding.
>
> If I introduce a change in the notifications code, I can simply go to
> these commits and enable SVQ for notifications. This way I can have an
> idea of what part is failing. A similar logic can be applied to other
> devices than vp_vdpa. We would lose it if we
>
> >
> > >
> > > Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> > > ---
> > >   qapi/net.json                      |   2 +-
> > >   hw/virtio/vhost-shadow-virtqueue.h |   8 ++
> > >   include/hw/virtio/vhost-vdpa.h     |   4 +
> > >   hw/virtio/vhost-shadow-virtqueue.c | 138 ++++++++++++++++++++++++++++-
> > >   hw/virtio/vhost-vdpa.c             | 116 +++++++++++++++++++++++-
> > >   5 files changed, 264 insertions(+), 4 deletions(-)
> > >
> > > diff --git a/qapi/net.json b/qapi/net.json
> > > index a2c30fd455..fe546b0e7c 100644
> > > --- a/qapi/net.json
> > > +++ b/qapi/net.json
> > > @@ -88,7 +88,7 @@
> > >   #
> > >   # @enable: true to use the alternate shadow VQ notifications
> > >   #
> > > -# Returns: Always error, since SVQ is not implemented at the moment.
> > > +# Returns: Error if failure, or 'no error' for success.
> > >   #
> > >   # Since: 6.2
> > >   #
> > > diff --git a/hw/virtio/vhost-shadow-virtqueue.h b/hw/virtio/vhost-shadow-virtqueue.h
> > > index 27ac6388fa..237cfceb9c 100644
> > > --- a/hw/virtio/vhost-shadow-virtqueue.h
> > > +++ b/hw/virtio/vhost-shadow-virtqueue.h
> > > @@ -14,6 +14,14 @@
> > >
> > >   typedef struct VhostShadowVirtqueue VhostShadowVirtqueue;
> > >
> > > +EventNotifier *vhost_svq_get_svq_call_notifier(VhostShadowVirtqueue *svq);
> >
> >
> > Let's move this function to another patch since it's unrelated to the
> > guest->host routing.
> >
>
> Right, I missed it while squashing commits and at later reviews.
>
> >
> > > +void vhost_svq_set_guest_call_notifier(VhostShadowVirtqueue *svq, int call_fd);
> > > +
> > > +bool vhost_svq_start(struct vhost_dev *dev, unsigned idx,
> > > +                     VhostShadowVirtqueue *svq);
> > > +void vhost_svq_stop(struct vhost_dev *dev, unsigned idx,
> > > +                    VhostShadowVirtqueue *svq);
> > > +
> > >   VhostShadowVirtqueue *vhost_svq_new(struct vhost_dev *dev, int idx);
> > >
> > >   void vhost_svq_free(VhostShadowVirtqueue *vq);
> > > diff --git a/include/hw/virtio/vhost-vdpa.h b/include/hw/virtio/vhost-vdpa.h
> > > index 0d565bb5bd..48aae59d8e 100644
> > > --- a/include/hw/virtio/vhost-vdpa.h
> > > +++ b/include/hw/virtio/vhost-vdpa.h
> > > @@ -12,6 +12,8 @@
> > >   #ifndef HW_VIRTIO_VHOST_VDPA_H
> > >   #define HW_VIRTIO_VHOST_VDPA_H
> > >
> > > +#include <gmodule.h>
> > > +
> > >   #include "qemu/queue.h"
> > >   #include "hw/virtio/virtio.h"
> > >
> > > @@ -24,6 +26,8 @@ typedef struct vhost_vdpa {
> > >       int device_fd;
> > >       uint32_t msg_type;
> > >       MemoryListener listener;
> > > +    bool shadow_vqs_enabled;
> > > +    GPtrArray *shadow_vqs;
> > >       struct vhost_dev *dev;
> > >       QLIST_ENTRY(vhost_vdpa) entry;
> > >       VhostVDPAHostNotifier notifier[VIRTIO_QUEUE_MAX];
> > > diff --git a/hw/virtio/vhost-shadow-virtqueue.c b/hw/virtio/vhost-shadow-virtqueue.c
> > > index c4826a1b56..21dc99ab5d 100644
> > > --- a/hw/virtio/vhost-shadow-virtqueue.c
> > > +++ b/hw/virtio/vhost-shadow-virtqueue.c
> > > @@ -9,9 +9,12 @@
> > >
> > >   #include "qemu/osdep.h"
> > >   #include "hw/virtio/vhost-shadow-virtqueue.h"
> > > +#include "hw/virtio/vhost.h"
> > > +
> > > +#include "standard-headers/linux/vhost_types.h"
> > >
> > >   #include "qemu/error-report.h"
> > > -#include "qemu/event_notifier.h"
> > > +#include "qemu/main-loop.h"
> > >
> > >   /* Shadow virtqueue to relay notifications */
> > >   typedef struct VhostShadowVirtqueue {
> > > @@ -19,14 +22,146 @@ typedef struct VhostShadowVirtqueue {
> > >       EventNotifier kick_notifier;
> > >       /* Shadow call notifier, sent to vhost */
> > >       EventNotifier call_notifier;
> > > +
> > > +    /*
> > > +     * Borrowed virtqueue's guest to host notifier.
> > > +     * To borrow it in this event notifier allows to register on the event
> > > +     * loop and access the associated shadow virtqueue easily. If we use the
> > > +     * VirtQueue, we don't have an easy way to retrieve it.
> > > +     *
> > > +     * So shadow virtqueue must not clean it, or we would lose VirtQueue one.
> > > +     */
> > > +    EventNotifier host_notifier;
> > > +
> > > +    /* Guest's call notifier, where SVQ calls guest. */
> > > +    EventNotifier guest_call_notifier;
> >
> >
> > To be consistent, let's simply use "guest_notifier" here.
> >
>
> It could be confused when the series adds a guest -> qemu kick
> notifier then. Actually, I think it would be better to rename
> host_notifier to something like host_svq_notifier. Or host_call and
> guest_call, since "notifier" is already in the type, making the name
> to be a little bit "Hungarian notation".
>
> >
> > > +
> > > +    /* Virtio queue shadowing */
> > > +    VirtQueue *vq;
> > >   } VhostShadowVirtqueue;
> > >
> > > +/* Forward guest notifications */
> > > +static void vhost_handle_guest_kick(EventNotifier *n)
> > > +{
> > > +    VhostShadowVirtqueue *svq = container_of(n, VhostShadowVirtqueue,
> > > +                                             host_notifier);
> > > +
> > > +    if (unlikely(!event_notifier_test_and_clear(n))) {
> > > +        return;
> > > +    }
> >
> >
> > Is there a chance that we may stop the processing of available buffers
> > during the svq enabling? There could be no kick from the guest in this case.
> >
>
> Actually, yes, I think you are right. The guest kick eventfd could
> have been consumed by vhost but there may be still pending buffers.
>
> I think it would be better to check for available buffers first, then
> clear the notifier unconditionally, and then re-check and process them
> if any [1].
>
> However, this problem arises later in the series: At this moment the
> device is not reset and guest's host notifier is not replaced, so
> either vhost/device receives the kick, or SVQ does and forwards it.
>
> Does it make sense to you?
>
> >
> > > +
> > > +    event_notifier_set(&svq->kick_notifier);
> > > +}
> > > +
> > > +/*
> > > + * Obtain the SVQ call notifier, where vhost device notifies SVQ that there
> > > + * exists pending used buffers.
> > > + *
> > > + * @svq Shadow Virtqueue
> > > + */
> > > +EventNotifier *vhost_svq_get_svq_call_notifier(VhostShadowVirtqueue *svq)
> > > +{
> > > +    return &svq->call_notifier;
> > > +}
> > > +
> > > +/*
> > > + * Set the call notifier for the SVQ to call the guest
> > > + *
> > > + * @svq Shadow virtqueue
> > > + * @call_fd call notifier
> > > + *
> > > + * Called on BQL context.
> > > + */
> > > +void vhost_svq_set_guest_call_notifier(VhostShadowVirtqueue *svq, int call_fd)
> > > +{
> > > +    event_notifier_init_fd(&svq->guest_call_notifier, call_fd);
> > > +}
> > > +
> > > +/*
> > > + * Restore the vhost guest to host notifier, i.e., disables svq effect.
> > > + */
> > > +static int vhost_svq_restore_vdev_host_notifier(struct vhost_dev *dev,
> > > +                                                unsigned vhost_index,
> > > +                                                VhostShadowVirtqueue *svq)
> > > +{
> > > +    EventNotifier *vq_host_notifier = virtio_queue_get_host_notifier(svq->vq);
> > > +    struct vhost_vring_file file = {
> > > +        .index = vhost_index,
> > > +        .fd = event_notifier_get_fd(vq_host_notifier),
> > > +    };
> > > +    int r;
> > > +
> > > +    /* Restore vhost kick */
> > > +    r = dev->vhost_ops->vhost_set_vring_kick(dev, &file);
> >
> >
> > And remap the notification area if necessary.
>
> Totally right, that step is missed in this series.
>
> However, remapping guest host notifier memory region has no advantages
> over using ioeventfd to perform guest -> SVQ notifications, doesn't
> it? By both methods flow needs to go through the hypervisor kernel.
>
> >
> >
> > > +    return r ? -errno : 0;
> > > +}
> > > +
> > > +/*
> > > + * Start shadow virtqueue operation.
> > > + * @dev vhost device
> > > + * @hidx vhost virtqueue index
> > > + * @svq Shadow Virtqueue
> > > + */
> > > +bool vhost_svq_start(struct vhost_dev *dev, unsigned idx,
> > > +                     VhostShadowVirtqueue *svq)
> > > +{
> > > +    EventNotifier *vq_host_notifier = virtio_queue_get_host_notifier(svq->vq);
> > > +    struct vhost_vring_file file = {
> > > +        .index = dev->vhost_ops->vhost_get_vq_index(dev, dev->vq_index + idx),
> > > +        .fd = event_notifier_get_fd(&svq->kick_notifier),
> > > +    };
> > > +    int r;
> > > +
> > > +    /* Check that notifications are still going directly to vhost dev */
> > > +    assert(virtio_queue_is_host_notifier_enabled(svq->vq));
> > > +
> > > +    /*
> > > +     * event_notifier_set_handler already checks for guest's notifications if
> > > +     * they arrive in the switch, so there is no need to explicitely check for
> > > +     * them.
> > > +     */
> >
> >
> > If this is true, shouldn't we call vhost_set_vring_kick() before the
> > event_notifier_set_handler()?
> >
>
> Not at this point of the series, but it could be another solution when
> we need to reset the device and we are unsure if all buffers have been
> read. But I think I prefer the solution exposed in [1] and to
> explicitly call vhost_handle_guest_kick here. Do you think
> differently?
>
> > Btw, I think we should update the fd if set_vring_kick() was called
> > after this function?
> >
>
> Kind of. This is currently bad in the code, but...
>
> Backend callbacks vhost_ops->vhost_set_vring_kick and
> vhost_ops->vhost_set_vring_addr are only called at
> vhost_virtqueue_start. And they are always called with known data
> already stored in VirtQueue.
>
> To avoid storing more state in vhost_vdpa, I think that we should
> avoid duplicating them, but ignore new kick_fd or address in SVQ mode,
> and retrieve them again at the moment the device is (re)started in SVQ
> mode. Qemu already avoids things like allowing the guest to set
> addresses at random time, using the VirtIOPCIProxy to store them.
>
> I also see how duplicating that status could protect vdpa SVQ code
> against future changes to vhost code, but that would make this series
> bigger and more complex with no need at this moment in my opinion.
>
> Do you agree?
>
> >
> > > +    event_notifier_init_fd(&svq->host_notifier,
> > > +                           event_notifier_get_fd(vq_host_notifier));
> > > +    event_notifier_set_handler(&svq->host_notifier, vhost_handle_guest_kick);
> > > +
> > > +    r = dev->vhost_ops->vhost_set_vring_kick(dev, &file);
> >
> >
> > And we need to stop the notification area mmap.
> >
>
> Right.
>
> >
> > > +    if (unlikely(r != 0)) {
> > > +        error_report("Couldn't set kick fd: %s", strerror(errno));
> > > +        goto err_set_vring_kick;
> > > +    }
> > > +
> > > +    return true;
> > > +
> > > +err_set_vring_kick:
> > > +    event_notifier_set_handler(&svq->host_notifier, NULL);
> > > +
> > > +    return false;
> > > +}
> > > +
> > > +/*
> > > + * Stop shadow virtqueue operation.
> > > + * @dev vhost device
> > > + * @idx vhost queue index
> > > + * @svq Shadow Virtqueue
> > > + */
> > > +void vhost_svq_stop(struct vhost_dev *dev, unsigned idx,
> > > +                    VhostShadowVirtqueue *svq)
> > > +{
> > > +    int r = vhost_svq_restore_vdev_host_notifier(dev, idx, svq);
> > > +    if (unlikely(r < 0)) {
> > > +        error_report("Couldn't restore vq kick fd: %s", strerror(-r));
> > > +    }
> > > +
> > > +    event_notifier_set_handler(&svq->host_notifier, NULL);
> > > +}
> > > +
> > >   /*
> > >    * Creates vhost shadow virtqueue, and instruct vhost device to use the shadow
> > >    * methods and file descriptors.
> > >    */
> > >   VhostShadowVirtqueue *vhost_svq_new(struct vhost_dev *dev, int idx)
> > >   {
> > > +    int vq_idx = dev->vq_index + idx;
> > >       g_autofree VhostShadowVirtqueue *svq = g_new0(VhostShadowVirtqueue, 1);
> > >       int r;
> > >
> > > @@ -44,6 +179,7 @@ VhostShadowVirtqueue *vhost_svq_new(struct vhost_dev *dev, int idx)
> > >           goto err_init_call_notifier;
> > >       }
> > >
> > > +    svq->vq = virtio_get_queue(dev->vdev, vq_idx);
> > >       return g_steal_pointer(&svq);
> > >
> > >   err_init_call_notifier:
> > > diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
> > > index e0dc7508c3..36c954a779 100644
> > > --- a/hw/virtio/vhost-vdpa.c
> > > +++ b/hw/virtio/vhost-vdpa.c
> > > @@ -17,6 +17,7 @@
> > >   #include "hw/virtio/vhost.h"
> > >   #include "hw/virtio/vhost-backend.h"
> > >   #include "hw/virtio/virtio-net.h"
> > > +#include "hw/virtio/vhost-shadow-virtqueue.h"
> > >   #include "hw/virtio/vhost-vdpa.h"
> > >   #include "exec/address-spaces.h"
> > >   #include "qemu/main-loop.h"
> > > @@ -272,6 +273,16 @@ static void vhost_vdpa_add_status(struct vhost_dev *dev, uint8_t status)
> > >       vhost_vdpa_call(dev, VHOST_VDPA_SET_STATUS, &s);
> > >   }
> > >
> > > +/**
> > > + * Adaptor function to free shadow virtqueue through gpointer
> > > + *
> > > + * @svq   The Shadow Virtqueue
> > > + */
> > > +static void vhost_psvq_free(gpointer svq)
> > > +{
> > > +    vhost_svq_free(svq);
> > > +}
> > > +
> > >   static int vhost_vdpa_init(struct vhost_dev *dev, void *opaque, Error **errp)
> > >   {
> > >       struct vhost_vdpa *v;
> > > @@ -283,6 +294,7 @@ static int vhost_vdpa_init(struct vhost_dev *dev, void *opaque, Error **errp)
> > >       dev->opaque =  opaque ;
> > >       v->listener = vhost_vdpa_memory_listener;
> > >       v->msg_type = VHOST_IOTLB_MSG_V2;
> > > +    v->shadow_vqs = g_ptr_array_new_full(dev->nvqs, vhost_psvq_free);
> > >       QLIST_INSERT_HEAD(&vhost_vdpa_devices, v, entry);
> > >
> > >       vhost_vdpa_add_status(dev, VIRTIO_CONFIG_S_ACKNOWLEDGE |
> > > @@ -373,6 +385,17 @@ err:
> > >       return;
> > >   }
> > >
> > > +static void vhost_vdpa_svq_cleanup(struct vhost_dev *dev)
> > > +{
> > > +    struct vhost_vdpa *v = dev->opaque;
> > > +    size_t idx;
> > > +
> > > +    for (idx = 0; idx < v->shadow_vqs->len; ++idx) {
> > > +        vhost_svq_stop(dev, idx, g_ptr_array_index(v->shadow_vqs, idx));
> > > +    }
> > > +    g_ptr_array_free(v->shadow_vqs, true);
> > > +}
> > > +
> > >   static int vhost_vdpa_cleanup(struct vhost_dev *dev)
> > >   {
> > >       struct vhost_vdpa *v;
> > > @@ -381,6 +404,7 @@ static int vhost_vdpa_cleanup(struct vhost_dev *dev)
> > >       trace_vhost_vdpa_cleanup(dev, v);
> > >       vhost_vdpa_host_notifiers_uninit(dev, dev->nvqs);
> > >       memory_listener_unregister(&v->listener);
> > > +    vhost_vdpa_svq_cleanup(dev);
> > >       QLIST_REMOVE(v, entry);
> > >
> > >       dev->opaque = NULL;
> > > @@ -557,7 +581,9 @@ static int vhost_vdpa_dev_start(struct vhost_dev *dev, bool started)
> > >       if (started) {
> > >           uint8_t status = 0;
> > >           memory_listener_register(&v->listener, &address_space_memory);
> > > -        vhost_vdpa_host_notifiers_init(dev);
> > > +        if (!v->shadow_vqs_enabled) {
> > > +            vhost_vdpa_host_notifiers_init(dev);
> > > +        }
> >
> >
> > This looks like a trick, why not check and setup shadow_vqs inside:
> >
> > 1) vhost_vdpa_host_notifiers_init()
> >
> > and
> >
> > 2) vhost_vdpa_set_vring_kick()
> >
>
> Ok I will move the checks there.
>
> >
> > >           vhost_vdpa_set_vring_ready(dev);
> > >           vhost_vdpa_add_status(dev, VIRTIO_CONFIG_S_DRIVER_OK);
> > >           vhost_vdpa_call(dev, VHOST_VDPA_GET_STATUS, &status);
> > > @@ -663,10 +689,96 @@ static bool  vhost_vdpa_force_iommu(struct vhost_dev *dev)
> > >       return true;
> > >   }
> > >
> > > +/*
> > > + * Start shadow virtqueue.
> > > + */
> > > +static bool vhost_vdpa_svq_start_vq(struct vhost_dev *dev, unsigned idx)
> > > +{
> > > +    struct vhost_vdpa *v = dev->opaque;
> > > +    VhostShadowVirtqueue *svq = g_ptr_array_index(v->shadow_vqs, idx);
> > > +    return vhost_svq_start(dev, idx, svq);
> > > +}
> > > +
> > > +static unsigned vhost_vdpa_enable_svq(struct vhost_vdpa *v, bool enable)
> > > +{
> > > +    struct vhost_dev *hdev = v->dev;
> > > +    unsigned n;
> > > +
> > > +    if (enable == v->shadow_vqs_enabled) {
> > > +        return hdev->nvqs;
> > > +    }
> > > +
> > > +    if (enable) {
> > > +        /* Allocate resources */
> > > +        assert(v->shadow_vqs->len == 0);
> > > +        for (n = 0; n < hdev->nvqs; ++n) {
> > > +            VhostShadowVirtqueue *svq = vhost_svq_new(hdev, n);
> > > +            bool ok;
> > > +
> > > +            if (unlikely(!svq)) {
> > > +                g_ptr_array_set_size(v->shadow_vqs, 0);
> > > +                return 0;
> > > +            }
> > > +            g_ptr_array_add(v->shadow_vqs, svq);
> > > +
> > > +            ok = vhost_vdpa_svq_start_vq(hdev, n);
> > > +            if (unlikely(!ok)) {
> > > +                /* Free still not started svqs */
> > > +                g_ptr_array_set_size(v->shadow_vqs, n);
> > > +                enable = false;
>
> [2]
>
> > > +                break;
> > > +            }
> > > +        }
> >
> >
> > Since there's almost no logic could be shared between enable and
> > disable. Let's split those logic out into dedicated functions where the
> > codes looks more easy to be reviewed (e.g have a better error handling etc).
> >
>
> Maybe it could be more clear in the code, but the reused logic is the
> disabling of SVQ and the fallback in case it cannot be enabled with
> [2]. But I'm not against splitting in two different functions if it
> makes review easier.
>
> >
> > > +    }
> > > +
> > > +    v->shadow_vqs_enabled = enable;
> > > +
> > > +    if (!enable) {
> > > +        /* Disable all queues or clean up failed start */
> > > +        for (n = 0; n < v->shadow_vqs->len; ++n) {
> > > +            unsigned vq_idx = vhost_vdpa_get_vq_index(hdev, n);
> > > +            VhostShadowVirtqueue *svq = g_ptr_array_index(v->shadow_vqs, n);
> > > +            vhost_svq_stop(hdev, n, svq);
> > > +            vhost_virtqueue_start(hdev, hdev->vdev, &hdev->vqs[n], vq_idx);
> > > +        }
> > > +
> > > +        /* Resources cleanup */
> > > +        g_ptr_array_set_size(v->shadow_vqs, 0);
> > > +    }
> > > +
> > > +    return n;
> > > +}
> > >
> > >   void qmp_x_vhost_enable_shadow_vq(const char *name, bool enable, Error **errp)
> > >   {
> > > -    error_setg(errp, "Shadow virtqueue still not implemented");
> > > +    struct vhost_vdpa *v;
> > > +    const char *err_cause = NULL;
> > > +    bool r;
> > > +
> > > +    QLIST_FOREACH(v, &vhost_vdpa_devices, entry) {
> > > +        if (v->dev->vdev && 0 == strcmp(v->dev->vdev->name, name)) {
> > > +            break;
> > > +        }
> > > +    }
> >
> >
> > I think you can iterate the NetClientStates to ge tthe vhost-vdpa backends.
> >
>
> Right, I missed it.
>
Actually, that would always miss other device types like blk (isn't it?).
But using just the name is definitely a bad idea.
> >
> > > +
> > > +    if (!v) {
> > > +        err_cause = "Device not found";
> > > +        goto err;
> > > +    } else if (v->notifier[0].addr) {
> > > +        err_cause = "Device has host notifiers enabled";
> >
> >
> > I don't get this.
> >
>
> At this moment of the series you can enable guest -> SVQ -> 'vdpa
> device' if the device is not using the host notifiers memory region.
> The right solution is to disable it for the guest, and to handle it in
> SVQ. Otherwise, guest kick will bypass SVQ and
>
> It can be done in the same patch, or at least to disable (as unmap)
> them at this moment and handle them in a posterior patch. but for
> prototyping the solution I just ignored it in this series. It will be
> handled some way or another in the next one. I prefer the last one, to
> handle in a different patch, but let me know if you think it is better
> otherwise.
>
> > Btw this function should be implemented in an independent patch after
> > svq is fully functional.
> >
>
> (Reasons for that are already commented at the top of this mail :) ).
>
> Thanks!
>
> > Thanks
> >
> >
> > > +        goto err;
> > > +    }
> > > +
> > > +    r = vhost_vdpa_enable_svq(v, enable);
> > > +    if (unlikely(!r)) {
> > > +        err_cause = "Error enabling (see monitor)";
> > > +        goto err;
> > > +    }
> > > +
> > > +err:
> > > +    if (err_cause) {
> > > +        error_setg(errp, "Can't enable shadow vq on %s: %s", name, err_cause);
> > > +    }
> > >   }
> > >
> > >   const VhostOps vdpa_ops = {
> >
^ permalink raw reply	[flat|nested] 90+ messages in thread
* Re: [RFC PATCH v4 18/20] vhost: Add VhostIOVATree
  2021-10-01  7:06 ` [RFC PATCH v4 18/20] vhost: Add VhostIOVATree Eugenio Pérez
@ 2021-10-19  8:32   ` Jason Wang
  2021-10-19  9:22     ` Jason Wang
  2021-10-20  7:36     ` Eugenio Perez Martin
  0 siblings, 2 replies; 90+ messages in thread
From: Jason Wang @ 2021-10-19  8:32 UTC (permalink / raw)
  To: Eugenio Pérez, qemu-devel
  Cc: Parav Pandit, Juan Quintela, Michael S. Tsirkin,
	Markus Armbruster, Peter Xu, virtualization, Harpreet Singh Anand,
	Xiao W Wang, Stefan Hajnoczi, Eli Cohen, Eric Blake,
	Michael Lilja, Stefano Garzarella
在 2021/10/1 下午3:06, Eugenio Pérez 写道:
> This tree is able to look for a translated address from an IOVA address.
>
> At first glance is similar to util/iova-tree. However, SVQ working on
> devices with limited IOVA space need more capabilities, like allocating
> IOVA chunks or perform reverse translations (qemu addresses to iova).
I don't see any reverse translation is used in the shadow code. Or 
anything I missed?
>
> The allocation capability, as "assign a free IOVA address to this chunk
> of memory in qemu's address space" allows shadow virtqueue to create a
> new address space that is not restricted by guest's addressable one, so
> we can allocate shadow vqs vrings outside of its reachability, nor
> qemu's one. At the moment, the allocation is just done growing, not
> allowing deletion.
>
> A different name could be used, but ordered searchable array is a
> little bit long though.
>
> It duplicates the array so it can search efficiently both directions,
> and it will signal overlap if iova or the translated address is
> present in it's each array.
>
> Use of array will be changed to util-iova-tree in future series.
Adding Peter.
It looks to me the only thing miseed is the iova allocator. And it looks 
to me it's better to decouple the allocator from the iova tree.
Then we had:
1) initialize iova range
2) iova = iova_alloc(size)
3) built the iova tree map
4) buffer forwarding
5) iova_free(size)
>
> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> ---
>   hw/virtio/vhost-iova-tree.h |  40 +++++++
>   hw/virtio/vhost-iova-tree.c | 230 ++++++++++++++++++++++++++++++++++++
>   hw/virtio/meson.build       |   2 +-
>   3 files changed, 271 insertions(+), 1 deletion(-)
>   create mode 100644 hw/virtio/vhost-iova-tree.h
>   create mode 100644 hw/virtio/vhost-iova-tree.c
>
> diff --git a/hw/virtio/vhost-iova-tree.h b/hw/virtio/vhost-iova-tree.h
> new file mode 100644
> index 0000000000..d163a88905
> --- /dev/null
> +++ b/hw/virtio/vhost-iova-tree.h
> @@ -0,0 +1,40 @@
> +/*
> + * vhost software live migration ring
> + *
> + * SPDX-FileCopyrightText: Red Hat, Inc. 2021
> + * SPDX-FileContributor: Author: Eugenio Pérez <eperezma@redhat.com>
> + *
> + * SPDX-License-Identifier: GPL-2.0-or-later
> + */
> +
> +#ifndef HW_VIRTIO_VHOST_IOVA_TREE_H
> +#define HW_VIRTIO_VHOST_IOVA_TREE_H
> +
> +#include "exec/memory.h"
> +
> +typedef struct VhostDMAMap {
> +    void *translated_addr;
> +    hwaddr iova;
> +    hwaddr size;                /* Inclusive */
> +    IOMMUAccessFlags perm;
> +} VhostDMAMap;
> +
> +typedef enum VhostDMAMapNewRC {
> +    VHOST_DMA_MAP_NO_SPACE = -3,
> +    VHOST_DMA_MAP_OVERLAP = -2,
> +    VHOST_DMA_MAP_INVALID = -1,
> +    VHOST_DMA_MAP_OK = 0,
> +} VhostDMAMapNewRC;
> +
> +typedef struct VhostIOVATree VhostIOVATree;
> +
> +VhostIOVATree *vhost_iova_tree_new(void);
> +void vhost_iova_tree_unref(VhostIOVATree *iova_rm);
> +G_DEFINE_AUTOPTR_CLEANUP_FUNC(VhostIOVATree, vhost_iova_tree_unref);
> +
> +const VhostDMAMap *vhost_iova_tree_find_iova(const VhostIOVATree *iova_rm,
> +                                             const VhostDMAMap *map);
> +VhostDMAMapNewRC vhost_iova_tree_alloc(VhostIOVATree *iova_rm,
> +                                       VhostDMAMap *map);
> +
> +#endif
> diff --git a/hw/virtio/vhost-iova-tree.c b/hw/virtio/vhost-iova-tree.c
> new file mode 100644
> index 0000000000..c284e27607
> --- /dev/null
> +++ b/hw/virtio/vhost-iova-tree.c
> @@ -0,0 +1,230 @@
> +/*
> + * vhost software live migration ring
> + *
> + * SPDX-FileCopyrightText: Red Hat, Inc. 2021
> + * SPDX-FileContributor: Author: Eugenio Pérez <eperezma@redhat.com>
> + *
> + * SPDX-License-Identifier: GPL-2.0-or-later
> + */
> +
> +#include "qemu/osdep.h"
> +#include "vhost-iova-tree.h"
> +
> +#define G_ARRAY_NOT_ZERO_TERMINATED false
> +#define G_ARRAY_NOT_CLEAR_ON_ALLOC false
> +
> +#define iova_min qemu_real_host_page_size
> +
> +/**
> + * VhostIOVATree, able to:
> + * - Translate iova address
> + * - Reverse translate iova address (from translated to iova)
> + * - Allocate IOVA regions for translated range (potentially slow operation)
> + *
> + * Note that it cannot remove nodes.
> + */
> +struct VhostIOVATree {
> +    /* Ordered array of reverse translations, IOVA address to qemu memory. */
> +    GArray *iova_taddr_map;
> +
> +    /*
> +     * Ordered array of translations from qemu virtual memory address to iova
> +     */
> +    GArray *taddr_iova_map;
> +};
Any reason for using GArray? Is it faster?
> +
> +/**
> + * Inserts an element after an existing one in garray.
> + *
> + * @array      The array
> + * @prev_elem  The previous element of array of NULL if prepending
> + * @map        The DMA map
> + *
> + * It provides the aditional advantage of being type safe over
> + * g_array_insert_val, which accepts a reference pointer instead of a value
> + * with no complains.
> + */
> +static void vhost_iova_tree_insert_after(GArray *array,
> +                                         const VhostDMAMap *prev_elem,
> +                                         const VhostDMAMap *map)
> +{
> +    size_t pos;
> +
> +    if (!prev_elem) {
> +        pos = 0;
> +    } else {
> +        pos = prev_elem - &g_array_index(array, typeof(*prev_elem), 0) + 1;
> +    }
> +
> +    g_array_insert_val(array, pos, *map);
> +}
> +
> +static gint vhost_iova_tree_cmp_taddr(gconstpointer a, gconstpointer b)
> +{
> +    const VhostDMAMap *m1 = a, *m2 = b;
> +
> +    if (m1->translated_addr > m2->translated_addr + m2->size) {
> +        return 1;
> +    }
> +
> +    if (m1->translated_addr + m1->size < m2->translated_addr) {
> +        return -1;
> +    }
> +
> +    /* Overlapped */
> +    return 0;
> +}
> +
> +/**
> + * Find the previous node to a given iova
> + *
> + * @array  The ascending ordered-by-translated-addr array of VhostDMAMap
> + * @map    The map to insert
> + * @prev   Returned location of the previous map
> + *
> + * Return VHOST_DMA_MAP_OK if everything went well, or VHOST_DMA_MAP_OVERLAP if
> + * it already exists. It is ok to use this function to check if a given range
> + * exists, but it will use a linear search.
> + *
> + * TODO: We can use bsearch to locate the entry if we save the state in the
> + * needle, knowing that the needle is always the first argument to
> + * compare_func.
> + */
> +static VhostDMAMapNewRC vhost_iova_tree_find_prev(const GArray *array,
> +                                                  GCompareFunc compare_func,
> +                                                  const VhostDMAMap *map,
> +                                                  const VhostDMAMap **prev)
> +{
> +    size_t i;
> +    int r;
> +
> +    *prev = NULL;
> +    for (i = 0; i < array->len; ++i) {
> +        r = compare_func(map, &g_array_index(array, typeof(*map), i));
> +        if (r == 0) {
> +            return VHOST_DMA_MAP_OVERLAP;
> +        }
> +        if (r < 0) {
> +            return VHOST_DMA_MAP_OK;
> +        }
> +
> +        *prev = &g_array_index(array, typeof(**prev), i);
> +    }
> +
> +    return VHOST_DMA_MAP_OK;
> +}
> +
> +/**
> + * Create a new IOVA tree
> + *
> + * Returns the new IOVA tree
> + */
> +VhostIOVATree *vhost_iova_tree_new(void)
> +{
So I think it needs to be initialized with the range we get from 
get_iova_range().
Thanks
> +    VhostIOVATree *tree = g_new(VhostIOVATree, 1);
> +    tree->iova_taddr_map = g_array_new(G_ARRAY_NOT_ZERO_TERMINATED,
> +                                       G_ARRAY_NOT_CLEAR_ON_ALLOC,
> +                                       sizeof(VhostDMAMap));
> +    tree->taddr_iova_map = g_array_new(G_ARRAY_NOT_ZERO_TERMINATED,
> +                                       G_ARRAY_NOT_CLEAR_ON_ALLOC,
> +                                       sizeof(VhostDMAMap));
> +    return tree;
> +}
> +
> +/**
> + * Destroy an IOVA tree
> + *
> + * @tree  The iova tree
> + */
> +void vhost_iova_tree_unref(VhostIOVATree *tree)
> +{
> +    g_array_unref(g_steal_pointer(&tree->iova_taddr_map));
> +    g_array_unref(g_steal_pointer(&tree->taddr_iova_map));
> +}
> +
> +/**
> + * Find the IOVA address stored from a memory address
> + *
> + * @tree     The iova tree
> + * @map      The map with the memory address
> + *
> + * Return the stored mapping, or NULL if not found.
> + */
> +const VhostDMAMap *vhost_iova_tree_find_iova(const VhostIOVATree *tree,
> +                                             const VhostDMAMap *map)
> +{
> +    /*
> +     * This can be replaced with g_array_binary_search (Since glib 2.62) when
> +     * that version become common enough.
> +     */
> +    return bsearch(map, tree->taddr_iova_map->data, tree->taddr_iova_map->len,
> +                   sizeof(*map), vhost_iova_tree_cmp_taddr);
> +}
> +
> +static bool vhost_iova_tree_find_iova_hole(const GArray *iova_map,
> +                                           const VhostDMAMap *map,
> +                                           const VhostDMAMap **prev_elem)
> +{
> +    size_t i;
> +    hwaddr iova = iova_min;
> +
> +    *prev_elem = NULL;
> +    for (i = 0; i < iova_map->len; i++) {
> +        const VhostDMAMap *next = &g_array_index(iova_map, typeof(*next), i);
> +        hwaddr hole_end = next->iova;
> +        if (map->size < hole_end - iova) {
> +            return true;
> +        }
> +
> +        iova = next->iova + next->size + 1;
> +        *prev_elem = next;
> +    }
> +
> +    return ((hwaddr)-1 - iova) > iova_map->len;
> +}
> +
> +/**
> + * Allocate a new mapping
> + *
> + * @tree  The iova tree
> + * @map   The iova map
> + *
> + * Returns:
> + * - VHOST_DMA_MAP_OK if the map fits in the container
> + * - VHOST_DMA_MAP_INVALID if the map does not make sense (like size overflow)
> + * - VHOST_DMA_MAP_OVERLAP if the tree already contains that map
> + * - VHOST_DMA_MAP_NO_SPACE if iova_rm cannot allocate more space.
> + *
> + * It returns assignated iova in map->iova if return value is VHOST_DMA_MAP_OK.
> + */
> +VhostDMAMapNewRC vhost_iova_tree_alloc(VhostIOVATree *tree,
> +                                       VhostDMAMap *map)
> +{
> +    const VhostDMAMap *qemu_prev, *iova_prev;
> +    int find_prev_rc;
> +    bool fit;
> +
> +    if (map->translated_addr + map->size < map->translated_addr ||
> +        map->iova + map->size < map->iova || map->perm == IOMMU_NONE) {
> +        return VHOST_DMA_MAP_INVALID;
> +    }
> +
> +    /* Search for a hole in iova space big enough */
> +    fit = vhost_iova_tree_find_iova_hole(tree->iova_taddr_map, map,
> +                                         &iova_prev);
> +    if (!fit) {
> +        return VHOST_DMA_MAP_NO_SPACE;
> +    }
> +
> +    map->iova = iova_prev ? (iova_prev->iova + iova_prev->size) + 1 : iova_min;
> +    find_prev_rc = vhost_iova_tree_find_prev(tree->taddr_iova_map,
> +                                             vhost_iova_tree_cmp_taddr, map,
> +                                             &qemu_prev);
> +    if (find_prev_rc == VHOST_DMA_MAP_OVERLAP) {
> +        return VHOST_DMA_MAP_OVERLAP;
> +    }
> +
> +    vhost_iova_tree_insert_after(tree->iova_taddr_map, iova_prev, map);
> +    vhost_iova_tree_insert_after(tree->taddr_iova_map, qemu_prev, map);
> +    return VHOST_DMA_MAP_OK;
> +}
> diff --git a/hw/virtio/meson.build b/hw/virtio/meson.build
> index 8b5a0225fe..cb306b83c6 100644
> --- a/hw/virtio/meson.build
> +++ b/hw/virtio/meson.build
> @@ -11,7 +11,7 @@ softmmu_ss.add(when: 'CONFIG_ALL', if_true: files('vhost-stub.c'))
>   
>   virtio_ss = ss.source_set()
>   virtio_ss.add(files('virtio.c'))
> -virtio_ss.add(when: 'CONFIG_VHOST', if_true: files('vhost.c', 'vhost-backend.c', 'vhost-shadow-virtqueue.c'))
> +virtio_ss.add(when: 'CONFIG_VHOST', if_true: files('vhost.c', 'vhost-backend.c', 'vhost-shadow-virtqueue.c', 'vhost-iova-tree.c'))
>   virtio_ss.add(when: 'CONFIG_VHOST_USER', if_true: files('vhost-user.c'))
>   virtio_ss.add(when: 'CONFIG_VHOST_VDPA', if_true: files('vhost-vdpa.c'))
>   virtio_ss.add(when: 'CONFIG_VIRTIO_BALLOON', if_true: files('virtio-balloon.c'))
^ permalink raw reply	[flat|nested] 90+ messages in thread
* Re: [RFC PATCH v4 11/20] vhost: Route host->guest notification through shadow virtqueue
  2021-10-15  4:42       ` Jason Wang
@ 2021-10-19  8:39         ` Eugenio Perez Martin
  2021-10-20  2:01           ` Jason Wang
  0 siblings, 1 reply; 90+ messages in thread
From: Eugenio Perez Martin @ 2021-10-19  8:39 UTC (permalink / raw)
  To: Jason Wang
  Cc: Parav Pandit, Juan Quintela, Markus Armbruster,
	Michael S. Tsirkin, qemu-level, virtualization,
	Harpreet Singh Anand, Xiao W Wang, Stefan Hajnoczi, Eli Cohen,
	Eric Blake, Stefano Garzarella
On Fri, Oct 15, 2021 at 6:42 AM Jason Wang <jasowang@redhat.com> wrote:
>
>
> 在 2021/10/15 上午12:39, Eugenio Perez Martin 写道:
> > On Wed, Oct 13, 2021 at 5:47 AM Jason Wang <jasowang@redhat.com> wrote:
> >>
> >> 在 2021/10/1 下午3:05, Eugenio Pérez 写道:
> >>> This will make qemu aware of the device used buffers, allowing it to
> >>> write the guest memory with its contents if needed.
> >>>
> >>> Since the use of vhost_virtqueue_start can unmasks and discard call
> >>> events, vhost_virtqueue_start should be modified in one of these ways:
> >>> * Split in two: One of them uses all logic to start a queue with no
> >>>     side effects for the guest, and another one tha actually assumes that
> >>>     the guest has just started the device. Vdpa should use just the
> >>>     former.
> >>> * Actually store and check if the guest notifier is masked, and do it
> >>>     conditionally.
> >>> * Left as it is, and duplicate all the logic in vhost-vdpa.
> >>>
> >>> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> >>> ---
> >>>    hw/virtio/vhost-shadow-virtqueue.c | 19 +++++++++++++++
> >>>    hw/virtio/vhost-vdpa.c             | 38 +++++++++++++++++++++++++++++-
> >>>    2 files changed, 56 insertions(+), 1 deletion(-)
> >>>
> >>> diff --git a/hw/virtio/vhost-shadow-virtqueue.c b/hw/virtio/vhost-shadow-virtqueue.c
> >>> index 21dc99ab5d..3fe129cf63 100644
> >>> --- a/hw/virtio/vhost-shadow-virtqueue.c
> >>> +++ b/hw/virtio/vhost-shadow-virtqueue.c
> >>> @@ -53,6 +53,22 @@ static void vhost_handle_guest_kick(EventNotifier *n)
> >>>        event_notifier_set(&svq->kick_notifier);
> >>>    }
> >>>
> >>> +/* Forward vhost notifications */
> >>> +static void vhost_svq_handle_call_no_test(EventNotifier *n)
> >>> +{
> >>> +    VhostShadowVirtqueue *svq = container_of(n, VhostShadowVirtqueue,
> >>> +                                             call_notifier);
> >>> +
> >>> +    event_notifier_set(&svq->guest_call_notifier);
> >>> +}
> >>> +
> >>> +static void vhost_svq_handle_call(EventNotifier *n)
> >>> +{
> >>> +    if (likely(event_notifier_test_and_clear(n))) {
> >>> +        vhost_svq_handle_call_no_test(n);
> >>> +    }
> >>> +}
> >>> +
> >>>    /*
> >>>     * Obtain the SVQ call notifier, where vhost device notifies SVQ that there
> >>>     * exists pending used buffers.
> >>> @@ -180,6 +196,8 @@ VhostShadowVirtqueue *vhost_svq_new(struct vhost_dev *dev, int idx)
> >>>        }
> >>>
> >>>        svq->vq = virtio_get_queue(dev->vdev, vq_idx);
> >>> +    event_notifier_set_handler(&svq->call_notifier,
> >>> +                               vhost_svq_handle_call);
> >>>        return g_steal_pointer(&svq);
> >>>
> >>>    err_init_call_notifier:
> >>> @@ -195,6 +213,7 @@ err_init_kick_notifier:
> >>>    void vhost_svq_free(VhostShadowVirtqueue *vq)
> >>>    {
> >>>        event_notifier_cleanup(&vq->kick_notifier);
> >>> +    event_notifier_set_handler(&vq->call_notifier, NULL);
> >>>        event_notifier_cleanup(&vq->call_notifier);
> >>>        g_free(vq);
> >>>    }
> >>> diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
> >>> index bc34de2439..6c5f4c98b8 100644
> >>> --- a/hw/virtio/vhost-vdpa.c
> >>> +++ b/hw/virtio/vhost-vdpa.c
> >>> @@ -712,13 +712,40 @@ static bool vhost_vdpa_svq_start_vq(struct vhost_dev *dev, unsigned idx)
> >>>    {
> >>>        struct vhost_vdpa *v = dev->opaque;
> >>>        VhostShadowVirtqueue *svq = g_ptr_array_index(v->shadow_vqs, idx);
> >>> -    return vhost_svq_start(dev, idx, svq);
> >>> +    EventNotifier *vhost_call_notifier = vhost_svq_get_svq_call_notifier(svq);
> >>> +    struct vhost_vring_file vhost_call_file = {
> >>> +        .index = idx + dev->vq_index,
> >>> +        .fd = event_notifier_get_fd(vhost_call_notifier),
> >>> +    };
> >>> +    int r;
> >>> +    bool b;
> >>> +
> >>> +    /* Set shadow vq -> guest notifier */
> >>> +    assert(v->call_fd[idx]);
> >>
> >> We need aovid the asser() here. On which case we can hit this?
> >>
> > I would say that there is no way we can actually hit it, so let's remove it.
> >
> >>> +    vhost_svq_set_guest_call_notifier(svq, v->call_fd[idx]);
> >>> +
> >>> +    b = vhost_svq_start(dev, idx, svq);
> >>> +    if (unlikely(!b)) {
> >>> +        return false;
> >>> +    }
> >>> +
> >>> +    /* Set device -> SVQ notifier */
> >>> +    r = vhost_vdpa_set_vring_dev_call(dev, &vhost_call_file);
> >>> +    if (unlikely(r)) {
> >>> +        error_report("vhost_vdpa_set_vring_call for shadow vq failed");
> >>> +        return false;
> >>> +    }
> >>
> >> Similar to kick, do we need to set_vring_call() before vhost_svq_start()?
> >>
> > It should not matter at this moment because the device should not be
> > started at this point and device calls should not run
> > vhost_svq_handle_call until BQL is released.
>
>
> Yes, we stop virtqueue before.
>
>
> >
> > The "logic" of doing it after is to make clear that svq must be fully
> > initialized before processing device calls, even in the case that we
> > extract SVQ in its own iothread or similar. But this could be done
> > before vhost_svq_start for sure.
> >
> >>> +
> >>> +    /* Check for pending calls */
> >>> +    event_notifier_set(vhost_call_notifier);
> >>
> >> Interesting, can this result spurious interrupt?
> >>
> > This actually "queues" a vhost_svq_handle_call after the BQL release,
> > where the device should be fully reset. In that regard, if there are
> > no used descriptors there will not be an irq raised to the guest. Does
> > that answer the question? Or have I missed something?
>
>
> Yes, please explain this in the comment.
>
I'm reviewing this again, and actually I think I was wrong in solving the issue.
Since at this point the device is being configured, there is no chance
that we had a missing call notification here: A previous kick is
needed for the device to generate any calls, and these cannot be
processed.
What is not solved in this series is that we could have pending used
buffers in vdpa device stopping SVQ, but queuing a check for that is
not going to solve anything, since SVQ vring would be already
destroyed:
* vdpa device marks N > 0 buffers as used, and calls.
* Before processing them, SVQ stop is called. SVQ have not processed
these, and cleans them, making this event_notifier_set useless.
So this would require a few changes. Mainly, instead of queueing a
check for used, these need to be checked before svq cleaning. After
that, obtain the VQ state (is not obtained in the stop at the moment,
trusting in guest's used idx) and run a last
vhost_svq_handle_call_no_test while the device is paused.
Thanks!
>
> >
> >>> +    return true;
> >>>    }
> >>>
> >>>    static unsigned vhost_vdpa_enable_svq(struct vhost_vdpa *v, bool enable)
> >>>    {
> >>>        struct vhost_dev *hdev = v->dev;
> >>>        unsigned n;
> >>> +    int r;
> >>>
> >>>        if (enable == v->shadow_vqs_enabled) {
> >>>            return hdev->nvqs;
> >>> @@ -752,9 +779,18 @@ static unsigned vhost_vdpa_enable_svq(struct vhost_vdpa *v, bool enable)
> >>>        if (!enable) {
> >>>            /* Disable all queues or clean up failed start */
> >>>            for (n = 0; n < v->shadow_vqs->len; ++n) {
> >>> +            struct vhost_vring_file file = {
> >>> +                .index = vhost_vdpa_get_vq_index(hdev, n),
> >>> +                .fd = v->call_fd[n],
> >>> +            };
> >>> +
> >>> +            r = vhost_vdpa_set_vring_call(hdev, &file);
> >>> +            assert(r == 0);
> >>> +
> >>>                unsigned vq_idx = vhost_vdpa_get_vq_index(hdev, n);
> >>>                VhostShadowVirtqueue *svq = g_ptr_array_index(v->shadow_vqs, n);
> >>>                vhost_svq_stop(hdev, n, svq);
> >>> +            /* TODO: This can unmask or override call fd! */
> >>
> >> I don't get this comment. Does this mean the current code can't work
> >> with mask_notifiers? If yes, this is something we need to fix.
> >>
> > Yes, but it will be addressed in the next series. I should have
> > explained it bette here, sorry :).
>
>
> Ok.
>
> Thanks
>
>
> >
> > Thanks!
> >
> >> Thanks
> >>
> >>
> >>>                vhost_virtqueue_start(hdev, hdev->vdev, &hdev->vqs[n], vq_idx);
> >>>            }
> >>>
>
^ permalink raw reply	[flat|nested] 90+ messages in thread
* Re: [RFC PATCH v4 18/20] vhost: Add VhostIOVATree
  2021-10-19  8:32   ` Jason Wang
@ 2021-10-19  9:22     ` Jason Wang
  2021-10-20  7:54       ` Eugenio Perez Martin
  2021-10-20  7:36     ` Eugenio Perez Martin
  1 sibling, 1 reply; 90+ messages in thread
From: Jason Wang @ 2021-10-19  9:22 UTC (permalink / raw)
  To: Eugenio Pérez, qemu-devel
On Tue, Oct 19, 2021 at 4:32 PM Jason Wang <jasowang@redhat.com> wrote:
>
>
> 在 2021/10/1 下午3:06, Eugenio Pérez 写道:
> > This tree is able to look for a translated address from an IOVA address.
> >
> > At first glance is similar to util/iova-tree. However, SVQ working on
> > devices with limited IOVA space need more capabilities, like allocating
> > IOVA chunks or perform reverse translations (qemu addresses to iova).
>
>
> I don't see any reverse translation is used in the shadow code. Or
> anything I missed?
Ok, it looks to me that it is used in the iova allocator. But I think
it's better to decouple it to an independent allocator instead of
vhost iova tree.
Thanks
^ permalink raw reply	[flat|nested] 90+ messages in thread
* Re: [RFC PATCH v4 20/20] vdpa: Add custom IOTLB translations to SVQ
  2021-10-01  7:06 ` [RFC PATCH v4 20/20] vdpa: Add custom IOTLB translations to SVQ Eugenio Pérez
  2021-10-13  5:34   ` Jason Wang
@ 2021-10-19  9:24   ` Jason Wang
  2021-10-19 10:28     ` Eugenio Perez Martin
  1 sibling, 1 reply; 90+ messages in thread
From: Jason Wang @ 2021-10-19  9:24 UTC (permalink / raw)
  To: Eugenio Pérez, qemu-devel
  Cc: Parav Pandit, Juan Quintela, Michael S. Tsirkin,
	Markus Armbruster, virtualization, Harpreet Singh Anand,
	Xiao W Wang, Stefan Hajnoczi, Eli Cohen, Eric Blake,
	Michael Lilja, Stefano Garzarella
在 2021/10/1 下午3:06, Eugenio Pérez 写道:
> Use translations added in VhostIOVATree in SVQ.
>
> Now every element needs to store the previous address also, so VirtQueue
> can consume the elements properly. This adds a little overhead per VQ
> element, having to allocate more memory to stash them. As a possible
> optimization, this allocation could be avoided if the descriptor is not
> a chain but a single one, but this is left undone.
>
> TODO: iova range should be queried before, and add logic to fail when
> GPA is outside of its range and memory listener or svq add it.
>
> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> ---
>   hw/virtio/vhost-shadow-virtqueue.h |   4 +-
>   hw/virtio/vhost-shadow-virtqueue.c | 130 ++++++++++++++++++++++++-----
>   hw/virtio/vhost-vdpa.c             |  40 ++++++++-
>   hw/virtio/trace-events             |   1 +
>   4 files changed, 152 insertions(+), 23 deletions(-)
>
> diff --git a/hw/virtio/vhost-shadow-virtqueue.h b/hw/virtio/vhost-shadow-virtqueue.h
> index b7baa424a7..a0e6b5267a 100644
> --- a/hw/virtio/vhost-shadow-virtqueue.h
> +++ b/hw/virtio/vhost-shadow-virtqueue.h
> @@ -11,6 +11,7 @@
>   #define VHOST_SHADOW_VIRTQUEUE_H
>   
>   #include "hw/virtio/vhost.h"
> +#include "hw/virtio/vhost-iova-tree.h"
>   
>   typedef struct VhostShadowVirtqueue VhostShadowVirtqueue;
>   
> @@ -28,7 +29,8 @@ bool vhost_svq_start(struct vhost_dev *dev, unsigned idx,
>   void vhost_svq_stop(struct vhost_dev *dev, unsigned idx,
>                       VhostShadowVirtqueue *svq);
>   
> -VhostShadowVirtqueue *vhost_svq_new(struct vhost_dev *dev, int idx);
> +VhostShadowVirtqueue *vhost_svq_new(struct vhost_dev *dev, int idx,
> +                                    VhostIOVATree *iova_map);
>   
>   void vhost_svq_free(VhostShadowVirtqueue *vq);
>   
> diff --git a/hw/virtio/vhost-shadow-virtqueue.c b/hw/virtio/vhost-shadow-virtqueue.c
> index 2fd0bab75d..9db538547e 100644
> --- a/hw/virtio/vhost-shadow-virtqueue.c
> +++ b/hw/virtio/vhost-shadow-virtqueue.c
> @@ -11,12 +11,19 @@
>   #include "hw/virtio/vhost-shadow-virtqueue.h"
>   #include "hw/virtio/vhost.h"
>   #include "hw/virtio/virtio-access.h"
> +#include "hw/virtio/vhost-iova-tree.h"
>   
>   #include "standard-headers/linux/vhost_types.h"
>   
>   #include "qemu/error-report.h"
>   #include "qemu/main-loop.h"
>   
> +typedef struct SVQElement {
> +    VirtQueueElement elem;
> +    void **in_sg_stash;
> +    void **out_sg_stash;
> +} SVQElement;
> +
>   /* Shadow virtqueue to relay notifications */
>   typedef struct VhostShadowVirtqueue {
>       /* Shadow vring */
> @@ -46,8 +53,11 @@ typedef struct VhostShadowVirtqueue {
>       /* Virtio device */
>       VirtIODevice *vdev;
>   
> +    /* IOVA mapping if used */
> +    VhostIOVATree *iova_map;
> +
>       /* Map for returning guest's descriptors */
> -    VirtQueueElement **ring_id_maps;
> +    SVQElement **ring_id_maps;
>   
>       /* Next head to expose to device */
>       uint16_t avail_idx_shadow;
> @@ -79,13 +89,6 @@ bool vhost_svq_valid_device_features(uint64_t *dev_features)
>               continue;
>   
>           case VIRTIO_F_ACCESS_PLATFORM:
> -            /* SVQ needs this feature disabled. Can't continue */
> -            if (*dev_features & BIT_ULL(b)) {
> -                clear_bit(b, dev_features);
> -                r = false;
> -            }
> -            break;
> -
>           case VIRTIO_F_VERSION_1:
>               /* SVQ needs this feature, so can't continue */
>               if (!(*dev_features & BIT_ULL(b))) {
> @@ -126,6 +129,64 @@ static void vhost_svq_set_notification(VhostShadowVirtqueue *svq, bool enable)
>       }
>   }
>   
> +static void vhost_svq_stash_addr(void ***stash, const struct iovec *iov,
> +                                 size_t num)
> +{
> +    size_t i;
> +
> +    if (num == 0) {
> +        return;
> +    }
> +
> +    *stash = g_new(void *, num);
> +    for (i = 0; i < num; ++i) {
> +        (*stash)[i] = iov[i].iov_base;
> +    }
> +}
> +
> +static void vhost_svq_unstash_addr(void **stash, struct iovec *iov, size_t num)
> +{
> +    size_t i;
> +
> +    if (num == 0) {
> +        return;
> +    }
> +
> +    for (i = 0; i < num; ++i) {
> +        iov[i].iov_base = stash[i];
> +    }
> +    g_free(stash);
> +}
> +
> +static void vhost_svq_translate_addr(const VhostShadowVirtqueue *svq,
> +                                     struct iovec *iovec, size_t num)
> +{
> +    size_t i;
> +
> +    for (i = 0; i < num; ++i) {
> +        VhostDMAMap needle = {
> +            .translated_addr = iovec[i].iov_base,
> +            .size = iovec[i].iov_len,
> +        };
> +        size_t off;
> +
> +        const VhostDMAMap *map = vhost_iova_tree_find_iova(svq->iova_map,
> +                                                           &needle);
Is it possible that we end up with more than one maps here?
> +        /*
> +         * Map cannot be NULL since iova map contains all guest space and
> +         * qemu already has a physical address mapped
> +         */
> +        assert(map);
> +
> +        /*
> +         * Map->iova chunk size is ignored. What to do if descriptor
> +         * (addr, size) does not fit is delegated to the device.
> +         */
> +        off = needle.translated_addr - map->translated_addr;
> +        iovec[i].iov_base = (void *)(map->iova + off);
> +    }
> +}
> +
>   static void vhost_vring_write_descs(VhostShadowVirtqueue *svq,
>                                       const struct iovec *iovec,
>                                       size_t num, bool more_descs, bool write)
> @@ -156,8 +217,9 @@ static void vhost_vring_write_descs(VhostShadowVirtqueue *svq,
>   }
>   
>   static unsigned vhost_svq_add_split(VhostShadowVirtqueue *svq,
> -                                    VirtQueueElement *elem)
> +                                    SVQElement *svq_elem)
>   {
> +    VirtQueueElement *elem = &svq_elem->elem;
>       int head;
>       unsigned avail_idx;
>       vring_avail_t *avail = svq->vring.avail;
> @@ -167,6 +229,12 @@ static unsigned vhost_svq_add_split(VhostShadowVirtqueue *svq,
>       /* We need some descriptors here */
>       assert(elem->out_num || elem->in_num);
>   
> +    vhost_svq_stash_addr(&svq_elem->in_sg_stash, elem->in_sg, elem->in_num);
> +    vhost_svq_stash_addr(&svq_elem->out_sg_stash, elem->out_sg, elem->out_num);
I wonder if we can solve the trick like stash and unstash with a 
dedicated sgs in svq_elem, instead of reusing the elem.
Thanks
> +
> +    vhost_svq_translate_addr(svq, elem->in_sg, elem->in_num);
> +    vhost_svq_translate_addr(svq, elem->out_sg, elem->out_num);
> +
>       vhost_vring_write_descs(svq, elem->out_sg, elem->out_num,
>                               elem->in_num > 0, false);
>       vhost_vring_write_descs(svq, elem->in_sg, elem->in_num, false, true);
> @@ -187,7 +255,7 @@ static unsigned vhost_svq_add_split(VhostShadowVirtqueue *svq,
>   
>   }
>   
> -static void vhost_svq_add(VhostShadowVirtqueue *svq, VirtQueueElement *elem)
> +static void vhost_svq_add(VhostShadowVirtqueue *svq, SVQElement *elem)
>   {
>       unsigned qemu_head = vhost_svq_add_split(svq, elem);
>   
> @@ -221,7 +289,7 @@ static void vhost_handle_guest_kick(EventNotifier *n)
>           }
>   
>           while (true) {
> -            VirtQueueElement *elem = virtqueue_pop(svq->vq, sizeof(*elem));
> +            SVQElement *elem = virtqueue_pop(svq->vq, sizeof(*elem));
>               if (!elem) {
>                   break;
>               }
> @@ -247,7 +315,7 @@ static bool vhost_svq_more_used(VhostShadowVirtqueue *svq)
>       return svq->used_idx != svq->shadow_used_idx;
>   }
>   
> -static VirtQueueElement *vhost_svq_get_buf(VhostShadowVirtqueue *svq)
> +static SVQElement *vhost_svq_get_buf(VhostShadowVirtqueue *svq)
>   {
>       vring_desc_t *descs = svq->vring.desc;
>       const vring_used_t *used = svq->vring.used;
> @@ -279,7 +347,7 @@ static VirtQueueElement *vhost_svq_get_buf(VhostShadowVirtqueue *svq)
>       descs[used_elem.id].next = svq->free_head;
>       svq->free_head = used_elem.id;
>   
> -    svq->ring_id_maps[used_elem.id]->len = used_elem.len;
> +    svq->ring_id_maps[used_elem.id]->elem.len = used_elem.len;
>       return g_steal_pointer(&svq->ring_id_maps[used_elem.id]);
>   }
>   
> @@ -296,12 +364,19 @@ static void vhost_svq_handle_call_no_test(EventNotifier *n)
>   
>           vhost_svq_set_notification(svq, false);
>           while (true) {
> -            g_autofree VirtQueueElement *elem = vhost_svq_get_buf(svq);
> -            if (!elem) {
> +            g_autofree SVQElement *svq_elem = vhost_svq_get_buf(svq);
> +            VirtQueueElement *elem;
> +            if (!svq_elem) {
>                   break;
>               }
>   
>               assert(i < svq->vring.num);
> +            elem = &svq_elem->elem;
> +
> +            vhost_svq_unstash_addr(svq_elem->in_sg_stash, elem->in_sg,
> +                                   elem->in_num);
> +            vhost_svq_unstash_addr(svq_elem->out_sg_stash, elem->out_sg,
> +                                   elem->out_num);
>               virtqueue_fill(vq, elem, elem->len, i++);
>           }
>   
> @@ -451,14 +526,24 @@ void vhost_svq_stop(struct vhost_dev *dev, unsigned idx,
>       event_notifier_set_handler(&svq->host_notifier, NULL);
>   
>       for (i = 0; i < svq->vring.num; ++i) {
> -        g_autofree VirtQueueElement *elem = svq->ring_id_maps[i];
> +        g_autofree SVQElement *svq_elem = svq->ring_id_maps[i];
> +        VirtQueueElement *elem;
> +
> +        if (!svq_elem) {
> +            continue;
> +        }
> +
> +        elem = &svq_elem->elem;
> +        vhost_svq_unstash_addr(svq_elem->in_sg_stash, elem->in_sg,
> +                               elem->in_num);
> +        vhost_svq_unstash_addr(svq_elem->out_sg_stash, elem->out_sg,
> +                               elem->out_num);
> +
>           /*
>            * Although the doc says we must unpop in order, it's ok to unpop
>            * everything.
>            */
> -        if (elem) {
> -            virtqueue_unpop(svq->vq, elem, elem->len);
> -        }
> +        virtqueue_unpop(svq->vq, elem, elem->len);
>       }
>   }
>   
> @@ -466,7 +551,8 @@ void vhost_svq_stop(struct vhost_dev *dev, unsigned idx,
>    * Creates vhost shadow virtqueue, and instruct vhost device to use the shadow
>    * methods and file descriptors.
>    */
> -VhostShadowVirtqueue *vhost_svq_new(struct vhost_dev *dev, int idx)
> +VhostShadowVirtqueue *vhost_svq_new(struct vhost_dev *dev, int idx,
> +                                    VhostIOVATree *iova_map)
>   {
>       int vq_idx = dev->vq_index + idx;
>       unsigned num = virtio_queue_get_num(dev->vdev, vq_idx);
> @@ -500,11 +586,13 @@ VhostShadowVirtqueue *vhost_svq_new(struct vhost_dev *dev, int idx)
>       memset(svq->vring.desc, 0, driver_size);
>       svq->vring.used = qemu_memalign(qemu_real_host_page_size, device_size);
>       memset(svq->vring.used, 0, device_size);
> +    svq->iova_map = iova_map;
> +
>       for (i = 0; i < num - 1; i++) {
>           svq->vring.desc[i].next = cpu_to_le16(i + 1);
>       }
>   
> -    svq->ring_id_maps = g_new0(VirtQueueElement *, num);
> +    svq->ring_id_maps = g_new0(SVQElement *, num);
>       event_notifier_set_handler(&svq->call_notifier,
>                                  vhost_svq_handle_call);
>       return g_steal_pointer(&svq);
> diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
> index a9c680b487..f5a12fee9d 100644
> --- a/hw/virtio/vhost-vdpa.c
> +++ b/hw/virtio/vhost-vdpa.c
> @@ -176,6 +176,18 @@ static void vhost_vdpa_listener_region_add(MemoryListener *listener,
>                                            vaddr, section->readonly);
>   
>       llsize = int128_sub(llend, int128_make64(iova));
> +    if (v->shadow_vqs_enabled) {
> +        VhostDMAMap mem_region = {
> +            .translated_addr = vaddr,
> +            .size = int128_get64(llsize) - 1,
> +            .perm = IOMMU_ACCESS_FLAG(true, section->readonly),
> +        };
> +
> +        int r = vhost_iova_tree_alloc(v->iova_map, &mem_region);
> +        assert(r == VHOST_DMA_MAP_OK);
> +
> +        iova = mem_region.iova;
> +    }
>   
>       ret = vhost_vdpa_dma_map(v, iova, int128_get64(llsize),
>                                vaddr, section->readonly);
> @@ -754,6 +766,23 @@ static bool  vhost_vdpa_force_iommu(struct vhost_dev *dev)
>       return true;
>   }
>   
> +static int vhost_vdpa_get_iova_range(struct vhost_dev *dev,
> +                                     hwaddr *first, hwaddr *last)
> +{
> +    int ret;
> +    struct vhost_vdpa_iova_range range;
> +
> +    ret = vhost_vdpa_call(dev, VHOST_VDPA_GET_IOVA_RANGE, &range);
> +    if (ret != 0) {
> +        return ret;
> +    }
> +
> +    *first = range.first;
> +    *last = range.last;
> +    trace_vhost_vdpa_get_iova_range(dev, *first, *last);
> +    return ret;
> +}
> +
>   /**
>    * Maps QEMU vaddr memory to device in a suitable way for shadow virtqueue:
>    * - It always reference qemu memory address, not guest's memory.
> @@ -881,6 +910,7 @@ static bool vhost_vdpa_svq_start_vq(struct vhost_dev *dev, unsigned idx)
>   static unsigned vhost_vdpa_enable_svq(struct vhost_vdpa *v, bool enable)
>   {
>       struct vhost_dev *hdev = v->dev;
> +    hwaddr iova_first, iova_last;
>       unsigned n;
>       int r;
>   
> @@ -894,7 +924,7 @@ static unsigned vhost_vdpa_enable_svq(struct vhost_vdpa *v, bool enable)
>           /* Allocate resources */
>           assert(v->shadow_vqs->len == 0);
>           for (n = 0; n < hdev->nvqs; ++n) {
> -            VhostShadowVirtqueue *svq = vhost_svq_new(hdev, n);
> +            VhostShadowVirtqueue *svq = vhost_svq_new(hdev, n, v->iova_map);
>               if (unlikely(!svq)) {
>                   g_ptr_array_set_size(v->shadow_vqs, 0);
>                   return 0;
> @@ -903,6 +933,8 @@ static unsigned vhost_vdpa_enable_svq(struct vhost_vdpa *v, bool enable)
>           }
>       }
>   
> +    r = vhost_vdpa_get_iova_range(hdev, &iova_first, &iova_last);
> +    assert(r == 0);
>       r = vhost_vdpa_vring_pause(hdev);
>       assert(r == 0);
>   
> @@ -913,6 +945,12 @@ static unsigned vhost_vdpa_enable_svq(struct vhost_vdpa *v, bool enable)
>           }
>       }
>   
> +    memory_listener_unregister(&v->listener);
> +    if (vhost_vdpa_dma_unmap(v, iova_first,
> +                             (iova_last - iova_first) & TARGET_PAGE_MASK)) {
> +        error_report("Fail to invalidate device iotlb");
> +    }
> +
>       /* Reset device so it can be configured */
>       r = vhost_vdpa_dev_start(hdev, false);
>       assert(r == 0);
> diff --git a/hw/virtio/trace-events b/hw/virtio/trace-events
> index 8ed19e9d0c..650e521e35 100644
> --- a/hw/virtio/trace-events
> +++ b/hw/virtio/trace-events
> @@ -52,6 +52,7 @@ vhost_vdpa_set_vring_call(void *dev, unsigned int index, int fd) "dev: %p index:
>   vhost_vdpa_get_features(void *dev, uint64_t features) "dev: %p features: 0x%"PRIx64
>   vhost_vdpa_set_owner(void *dev) "dev: %p"
>   vhost_vdpa_vq_get_addr(void *dev, void *vq, uint64_t desc_user_addr, uint64_t avail_user_addr, uint64_t used_user_addr) "dev: %p vq: %p desc_user_addr: 0x%"PRIx64" avail_user_addr: 0x%"PRIx64" used_user_addr: 0x%"PRIx64
> +vhost_vdpa_get_iova_range(void *dev, uint64_t first, uint64_t last) "dev: %p first: 0x%"PRIx64" last: 0x%"PRIx64
>   
>   # virtio.c
>   virtqueue_alloc_element(void *elem, size_t sz, unsigned in_num, unsigned out_num) "elem %p size %zd in_num %u out_num %u"
^ permalink raw reply	[flat|nested] 90+ messages in thread
* Re: [RFC PATCH v4 20/20] vdpa: Add custom IOTLB translations to SVQ
  2021-10-19  9:24   ` Jason Wang
@ 2021-10-19 10:28     ` Eugenio Perez Martin
  2021-10-20  2:02       ` Jason Wang
  0 siblings, 1 reply; 90+ messages in thread
From: Eugenio Perez Martin @ 2021-10-19 10:28 UTC (permalink / raw)
  To: Jason Wang
  Cc: Parav Pandit, Juan Quintela, Markus Armbruster,
	Michael S. Tsirkin, qemu-level, virtualization,
	Harpreet Singh Anand, Xiao W Wang, Stefan Hajnoczi, Eli Cohen,
	Eric Blake, Michael Lilja, Stefano Garzarella
On Tue, Oct 19, 2021 at 11:25 AM Jason Wang <jasowang@redhat.com> wrote:
>
>
> 在 2021/10/1 下午3:06, Eugenio Pérez 写道:
> > Use translations added in VhostIOVATree in SVQ.
> >
> > Now every element needs to store the previous address also, so VirtQueue
> > can consume the elements properly. This adds a little overhead per VQ
> > element, having to allocate more memory to stash them. As a possible
> > optimization, this allocation could be avoided if the descriptor is not
> > a chain but a single one, but this is left undone.
> >
> > TODO: iova range should be queried before, and add logic to fail when
> > GPA is outside of its range and memory listener or svq add it.
> >
> > Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> > ---
> >   hw/virtio/vhost-shadow-virtqueue.h |   4 +-
> >   hw/virtio/vhost-shadow-virtqueue.c | 130 ++++++++++++++++++++++++-----
> >   hw/virtio/vhost-vdpa.c             |  40 ++++++++-
> >   hw/virtio/trace-events             |   1 +
> >   4 files changed, 152 insertions(+), 23 deletions(-)
> >
> > diff --git a/hw/virtio/vhost-shadow-virtqueue.h b/hw/virtio/vhost-shadow-virtqueue.h
> > index b7baa424a7..a0e6b5267a 100644
> > --- a/hw/virtio/vhost-shadow-virtqueue.h
> > +++ b/hw/virtio/vhost-shadow-virtqueue.h
> > @@ -11,6 +11,7 @@
> >   #define VHOST_SHADOW_VIRTQUEUE_H
> >
> >   #include "hw/virtio/vhost.h"
> > +#include "hw/virtio/vhost-iova-tree.h"
> >
> >   typedef struct VhostShadowVirtqueue VhostShadowVirtqueue;
> >
> > @@ -28,7 +29,8 @@ bool vhost_svq_start(struct vhost_dev *dev, unsigned idx,
> >   void vhost_svq_stop(struct vhost_dev *dev, unsigned idx,
> >                       VhostShadowVirtqueue *svq);
> >
> > -VhostShadowVirtqueue *vhost_svq_new(struct vhost_dev *dev, int idx);
> > +VhostShadowVirtqueue *vhost_svq_new(struct vhost_dev *dev, int idx,
> > +                                    VhostIOVATree *iova_map);
> >
> >   void vhost_svq_free(VhostShadowVirtqueue *vq);
> >
> > diff --git a/hw/virtio/vhost-shadow-virtqueue.c b/hw/virtio/vhost-shadow-virtqueue.c
> > index 2fd0bab75d..9db538547e 100644
> > --- a/hw/virtio/vhost-shadow-virtqueue.c
> > +++ b/hw/virtio/vhost-shadow-virtqueue.c
> > @@ -11,12 +11,19 @@
> >   #include "hw/virtio/vhost-shadow-virtqueue.h"
> >   #include "hw/virtio/vhost.h"
> >   #include "hw/virtio/virtio-access.h"
> > +#include "hw/virtio/vhost-iova-tree.h"
> >
> >   #include "standard-headers/linux/vhost_types.h"
> >
> >   #include "qemu/error-report.h"
> >   #include "qemu/main-loop.h"
> >
> > +typedef struct SVQElement {
> > +    VirtQueueElement elem;
> > +    void **in_sg_stash;
> > +    void **out_sg_stash;
> > +} SVQElement;
> > +
> >   /* Shadow virtqueue to relay notifications */
> >   typedef struct VhostShadowVirtqueue {
> >       /* Shadow vring */
> > @@ -46,8 +53,11 @@ typedef struct VhostShadowVirtqueue {
> >       /* Virtio device */
> >       VirtIODevice *vdev;
> >
> > +    /* IOVA mapping if used */
> > +    VhostIOVATree *iova_map;
> > +
> >       /* Map for returning guest's descriptors */
> > -    VirtQueueElement **ring_id_maps;
> > +    SVQElement **ring_id_maps;
> >
> >       /* Next head to expose to device */
> >       uint16_t avail_idx_shadow;
> > @@ -79,13 +89,6 @@ bool vhost_svq_valid_device_features(uint64_t *dev_features)
> >               continue;
> >
> >           case VIRTIO_F_ACCESS_PLATFORM:
> > -            /* SVQ needs this feature disabled. Can't continue */
> > -            if (*dev_features & BIT_ULL(b)) {
> > -                clear_bit(b, dev_features);
> > -                r = false;
> > -            }
> > -            break;
> > -
> >           case VIRTIO_F_VERSION_1:
> >               /* SVQ needs this feature, so can't continue */
> >               if (!(*dev_features & BIT_ULL(b))) {
> > @@ -126,6 +129,64 @@ static void vhost_svq_set_notification(VhostShadowVirtqueue *svq, bool enable)
> >       }
> >   }
> >
> > +static void vhost_svq_stash_addr(void ***stash, const struct iovec *iov,
> > +                                 size_t num)
> > +{
> > +    size_t i;
> > +
> > +    if (num == 0) {
> > +        return;
> > +    }
> > +
> > +    *stash = g_new(void *, num);
> > +    for (i = 0; i < num; ++i) {
> > +        (*stash)[i] = iov[i].iov_base;
> > +    }
> > +}
> > +
> > +static void vhost_svq_unstash_addr(void **stash, struct iovec *iov, size_t num)
> > +{
> > +    size_t i;
> > +
> > +    if (num == 0) {
> > +        return;
> > +    }
> > +
> > +    for (i = 0; i < num; ++i) {
> > +        iov[i].iov_base = stash[i];
> > +    }
> > +    g_free(stash);
> > +}
> > +
> > +static void vhost_svq_translate_addr(const VhostShadowVirtqueue *svq,
> > +                                     struct iovec *iovec, size_t num)
> > +{
> > +    size_t i;
> > +
> > +    for (i = 0; i < num; ++i) {
> > +        VhostDMAMap needle = {
> > +            .translated_addr = iovec[i].iov_base,
> > +            .size = iovec[i].iov_len,
> > +        };
> > +        size_t off;
> > +
> > +        const VhostDMAMap *map = vhost_iova_tree_find_iova(svq->iova_map,
> > +                                                           &needle);
>
>
> Is it possible that we end up with more than one maps here?
>
Actually it is possible, since there is no guarantee that one
descriptor (or indirect descriptor) maps exactly to one iov. It could
map to many if qemu vaddr is not contiguous but GPA + size is. This is
something that must be fixed for the next revision, so thanks for
pointing it out!
Taking that into account, the condition that svq vring avail_idx -
used_idx was always less or equal than guest's vring avail_idx -
used_idx is not true anymore. Checking for that before adding buffers
to SVQ is the easy part, but how could we recover in that case?
I think that the easy solution is to check for more available buffers
unconditionally at the end of vhost_svq_handle_call, which handles the
SVQ used and is supposed to make more room for available buffers. So
vhost_handle_guest_kick would not check if eventfd is set or not
anymore.
Would that make sense?
Thanks!
>
> > +        /*
> > +         * Map cannot be NULL since iova map contains all guest space and
> > +         * qemu already has a physical address mapped
> > +         */
> > +        assert(map);
> > +
> > +        /*
> > +         * Map->iova chunk size is ignored. What to do if descriptor
> > +         * (addr, size) does not fit is delegated to the device.
> > +         */
> > +        off = needle.translated_addr - map->translated_addr;
> > +        iovec[i].iov_base = (void *)(map->iova + off);
> > +    }
> > +}
> > +
> >   static void vhost_vring_write_descs(VhostShadowVirtqueue *svq,
> >                                       const struct iovec *iovec,
> >                                       size_t num, bool more_descs, bool write)
> > @@ -156,8 +217,9 @@ static void vhost_vring_write_descs(VhostShadowVirtqueue *svq,
> >   }
> >
> >   static unsigned vhost_svq_add_split(VhostShadowVirtqueue *svq,
> > -                                    VirtQueueElement *elem)
> > +                                    SVQElement *svq_elem)
> >   {
> > +    VirtQueueElement *elem = &svq_elem->elem;
> >       int head;
> >       unsigned avail_idx;
> >       vring_avail_t *avail = svq->vring.avail;
> > @@ -167,6 +229,12 @@ static unsigned vhost_svq_add_split(VhostShadowVirtqueue *svq,
> >       /* We need some descriptors here */
> >       assert(elem->out_num || elem->in_num);
> >
> > +    vhost_svq_stash_addr(&svq_elem->in_sg_stash, elem->in_sg, elem->in_num);
> > +    vhost_svq_stash_addr(&svq_elem->out_sg_stash, elem->out_sg, elem->out_num);
>
>
> I wonder if we can solve the trick like stash and unstash with a
> dedicated sgs in svq_elem, instead of reusing the elem.
>
Actually yes, it would be way simpler to use a new sgs array in
svq_elem. I will change that.
Thanks!
> Thanks
>
>
> > +
> > +    vhost_svq_translate_addr(svq, elem->in_sg, elem->in_num);
> > +    vhost_svq_translate_addr(svq, elem->out_sg, elem->out_num);
> > +
> >       vhost_vring_write_descs(svq, elem->out_sg, elem->out_num,
> >                               elem->in_num > 0, false);
> >       vhost_vring_write_descs(svq, elem->in_sg, elem->in_num, false, true);
> > @@ -187,7 +255,7 @@ static unsigned vhost_svq_add_split(VhostShadowVirtqueue *svq,
> >
> >   }
> >
> > -static void vhost_svq_add(VhostShadowVirtqueue *svq, VirtQueueElement *elem)
> > +static void vhost_svq_add(VhostShadowVirtqueue *svq, SVQElement *elem)
> >   {
> >       unsigned qemu_head = vhost_svq_add_split(svq, elem);
> >
> > @@ -221,7 +289,7 @@ static void vhost_handle_guest_kick(EventNotifier *n)
> >           }
> >
> >           while (true) {
> > -            VirtQueueElement *elem = virtqueue_pop(svq->vq, sizeof(*elem));
> > +            SVQElement *elem = virtqueue_pop(svq->vq, sizeof(*elem));
> >               if (!elem) {
> >                   break;
> >               }
> > @@ -247,7 +315,7 @@ static bool vhost_svq_more_used(VhostShadowVirtqueue *svq)
> >       return svq->used_idx != svq->shadow_used_idx;
> >   }
> >
> > -static VirtQueueElement *vhost_svq_get_buf(VhostShadowVirtqueue *svq)
> > +static SVQElement *vhost_svq_get_buf(VhostShadowVirtqueue *svq)
> >   {
> >       vring_desc_t *descs = svq->vring.desc;
> >       const vring_used_t *used = svq->vring.used;
> > @@ -279,7 +347,7 @@ static VirtQueueElement *vhost_svq_get_buf(VhostShadowVirtqueue *svq)
> >       descs[used_elem.id].next = svq->free_head;
> >       svq->free_head = used_elem.id;
> >
> > -    svq->ring_id_maps[used_elem.id]->len = used_elem.len;
> > +    svq->ring_id_maps[used_elem.id]->elem.len = used_elem.len;
> >       return g_steal_pointer(&svq->ring_id_maps[used_elem.id]);
> >   }
> >
> > @@ -296,12 +364,19 @@ static void vhost_svq_handle_call_no_test(EventNotifier *n)
> >
> >           vhost_svq_set_notification(svq, false);
> >           while (true) {
> > -            g_autofree VirtQueueElement *elem = vhost_svq_get_buf(svq);
> > -            if (!elem) {
> > +            g_autofree SVQElement *svq_elem = vhost_svq_get_buf(svq);
> > +            VirtQueueElement *elem;
> > +            if (!svq_elem) {
> >                   break;
> >               }
> >
> >               assert(i < svq->vring.num);
> > +            elem = &svq_elem->elem;
> > +
> > +            vhost_svq_unstash_addr(svq_elem->in_sg_stash, elem->in_sg,
> > +                                   elem->in_num);
> > +            vhost_svq_unstash_addr(svq_elem->out_sg_stash, elem->out_sg,
> > +                                   elem->out_num);
> >               virtqueue_fill(vq, elem, elem->len, i++);
> >           }
> >
> > @@ -451,14 +526,24 @@ void vhost_svq_stop(struct vhost_dev *dev, unsigned idx,
> >       event_notifier_set_handler(&svq->host_notifier, NULL);
> >
> >       for (i = 0; i < svq->vring.num; ++i) {
> > -        g_autofree VirtQueueElement *elem = svq->ring_id_maps[i];
> > +        g_autofree SVQElement *svq_elem = svq->ring_id_maps[i];
> > +        VirtQueueElement *elem;
> > +
> > +        if (!svq_elem) {
> > +            continue;
> > +        }
> > +
> > +        elem = &svq_elem->elem;
> > +        vhost_svq_unstash_addr(svq_elem->in_sg_stash, elem->in_sg,
> > +                               elem->in_num);
> > +        vhost_svq_unstash_addr(svq_elem->out_sg_stash, elem->out_sg,
> > +                               elem->out_num);
> > +
> >           /*
> >            * Although the doc says we must unpop in order, it's ok to unpop
> >            * everything.
> >            */
> > -        if (elem) {
> > -            virtqueue_unpop(svq->vq, elem, elem->len);
> > -        }
> > +        virtqueue_unpop(svq->vq, elem, elem->len);
> >       }
> >   }
> >
> > @@ -466,7 +551,8 @@ void vhost_svq_stop(struct vhost_dev *dev, unsigned idx,
> >    * Creates vhost shadow virtqueue, and instruct vhost device to use the shadow
> >    * methods and file descriptors.
> >    */
> > -VhostShadowVirtqueue *vhost_svq_new(struct vhost_dev *dev, int idx)
> > +VhostShadowVirtqueue *vhost_svq_new(struct vhost_dev *dev, int idx,
> > +                                    VhostIOVATree *iova_map)
> >   {
> >       int vq_idx = dev->vq_index + idx;
> >       unsigned num = virtio_queue_get_num(dev->vdev, vq_idx);
> > @@ -500,11 +586,13 @@ VhostShadowVirtqueue *vhost_svq_new(struct vhost_dev *dev, int idx)
> >       memset(svq->vring.desc, 0, driver_size);
> >       svq->vring.used = qemu_memalign(qemu_real_host_page_size, device_size);
> >       memset(svq->vring.used, 0, device_size);
> > +    svq->iova_map = iova_map;
> > +
> >       for (i = 0; i < num - 1; i++) {
> >           svq->vring.desc[i].next = cpu_to_le16(i + 1);
> >       }
> >
> > -    svq->ring_id_maps = g_new0(VirtQueueElement *, num);
> > +    svq->ring_id_maps = g_new0(SVQElement *, num);
> >       event_notifier_set_handler(&svq->call_notifier,
> >                                  vhost_svq_handle_call);
> >       return g_steal_pointer(&svq);
> > diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
> > index a9c680b487..f5a12fee9d 100644
> > --- a/hw/virtio/vhost-vdpa.c
> > +++ b/hw/virtio/vhost-vdpa.c
> > @@ -176,6 +176,18 @@ static void vhost_vdpa_listener_region_add(MemoryListener *listener,
> >                                            vaddr, section->readonly);
> >
> >       llsize = int128_sub(llend, int128_make64(iova));
> > +    if (v->shadow_vqs_enabled) {
> > +        VhostDMAMap mem_region = {
> > +            .translated_addr = vaddr,
> > +            .size = int128_get64(llsize) - 1,
> > +            .perm = IOMMU_ACCESS_FLAG(true, section->readonly),
> > +        };
> > +
> > +        int r = vhost_iova_tree_alloc(v->iova_map, &mem_region);
> > +        assert(r == VHOST_DMA_MAP_OK);
> > +
> > +        iova = mem_region.iova;
> > +    }
> >
> >       ret = vhost_vdpa_dma_map(v, iova, int128_get64(llsize),
> >                                vaddr, section->readonly);
> > @@ -754,6 +766,23 @@ static bool  vhost_vdpa_force_iommu(struct vhost_dev *dev)
> >       return true;
> >   }
> >
> > +static int vhost_vdpa_get_iova_range(struct vhost_dev *dev,
> > +                                     hwaddr *first, hwaddr *last)
> > +{
> > +    int ret;
> > +    struct vhost_vdpa_iova_range range;
> > +
> > +    ret = vhost_vdpa_call(dev, VHOST_VDPA_GET_IOVA_RANGE, &range);
> > +    if (ret != 0) {
> > +        return ret;
> > +    }
> > +
> > +    *first = range.first;
> > +    *last = range.last;
> > +    trace_vhost_vdpa_get_iova_range(dev, *first, *last);
> > +    return ret;
> > +}
> > +
> >   /**
> >    * Maps QEMU vaddr memory to device in a suitable way for shadow virtqueue:
> >    * - It always reference qemu memory address, not guest's memory.
> > @@ -881,6 +910,7 @@ static bool vhost_vdpa_svq_start_vq(struct vhost_dev *dev, unsigned idx)
> >   static unsigned vhost_vdpa_enable_svq(struct vhost_vdpa *v, bool enable)
> >   {
> >       struct vhost_dev *hdev = v->dev;
> > +    hwaddr iova_first, iova_last;
> >       unsigned n;
> >       int r;
> >
> > @@ -894,7 +924,7 @@ static unsigned vhost_vdpa_enable_svq(struct vhost_vdpa *v, bool enable)
> >           /* Allocate resources */
> >           assert(v->shadow_vqs->len == 0);
> >           for (n = 0; n < hdev->nvqs; ++n) {
> > -            VhostShadowVirtqueue *svq = vhost_svq_new(hdev, n);
> > +            VhostShadowVirtqueue *svq = vhost_svq_new(hdev, n, v->iova_map);
> >               if (unlikely(!svq)) {
> >                   g_ptr_array_set_size(v->shadow_vqs, 0);
> >                   return 0;
> > @@ -903,6 +933,8 @@ static unsigned vhost_vdpa_enable_svq(struct vhost_vdpa *v, bool enable)
> >           }
> >       }
> >
> > +    r = vhost_vdpa_get_iova_range(hdev, &iova_first, &iova_last);
> > +    assert(r == 0);
> >       r = vhost_vdpa_vring_pause(hdev);
> >       assert(r == 0);
> >
> > @@ -913,6 +945,12 @@ static unsigned vhost_vdpa_enable_svq(struct vhost_vdpa *v, bool enable)
> >           }
> >       }
> >
> > +    memory_listener_unregister(&v->listener);
> > +    if (vhost_vdpa_dma_unmap(v, iova_first,
> > +                             (iova_last - iova_first) & TARGET_PAGE_MASK)) {
> > +        error_report("Fail to invalidate device iotlb");
> > +    }
> > +
> >       /* Reset device so it can be configured */
> >       r = vhost_vdpa_dev_start(hdev, false);
> >       assert(r == 0);
> > diff --git a/hw/virtio/trace-events b/hw/virtio/trace-events
> > index 8ed19e9d0c..650e521e35 100644
> > --- a/hw/virtio/trace-events
> > +++ b/hw/virtio/trace-events
> > @@ -52,6 +52,7 @@ vhost_vdpa_set_vring_call(void *dev, unsigned int index, int fd) "dev: %p index:
> >   vhost_vdpa_get_features(void *dev, uint64_t features) "dev: %p features: 0x%"PRIx64
> >   vhost_vdpa_set_owner(void *dev) "dev: %p"
> >   vhost_vdpa_vq_get_addr(void *dev, void *vq, uint64_t desc_user_addr, uint64_t avail_user_addr, uint64_t used_user_addr) "dev: %p vq: %p desc_user_addr: 0x%"PRIx64" avail_user_addr: 0x%"PRIx64" used_user_addr: 0x%"PRIx64
> > +vhost_vdpa_get_iova_range(void *dev, uint64_t first, uint64_t last) "dev: %p first: 0x%"PRIx64" last: 0x%"PRIx64
> >
> >   # virtio.c
> >   virtqueue_alloc_element(void *elem, size_t sz, unsigned in_num, unsigned out_num) "elem %p size %zd in_num %u out_num %u"
>
^ permalink raw reply	[flat|nested] 90+ messages in thread
* Re: [RFC PATCH v4 11/20] vhost: Route host->guest notification through shadow virtqueue
  2021-10-19  8:39         ` Eugenio Perez Martin
@ 2021-10-20  2:01           ` Jason Wang
  2021-10-20  6:36             ` Eugenio Perez Martin
  0 siblings, 1 reply; 90+ messages in thread
From: Jason Wang @ 2021-10-20  2:01 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: Parav Pandit, Juan Quintela, Markus Armbruster,
	Michael S. Tsirkin, qemu-level, virtualization,
	Harpreet Singh Anand, Xiao W Wang, Stefan Hajnoczi, Eli Cohen,
	Eric Blake, Stefano Garzarella
On Tue, Oct 19, 2021 at 4:40 PM Eugenio Perez Martin
<eperezma@redhat.com> wrote:
>
> On Fri, Oct 15, 2021 at 6:42 AM Jason Wang <jasowang@redhat.com> wrote:
> >
> >
> > 在 2021/10/15 上午12:39, Eugenio Perez Martin 写道:
> > > On Wed, Oct 13, 2021 at 5:47 AM Jason Wang <jasowang@redhat.com> wrote:
> > >>
> > >> 在 2021/10/1 下午3:05, Eugenio Pérez 写道:
> > >>> This will make qemu aware of the device used buffers, allowing it to
> > >>> write the guest memory with its contents if needed.
> > >>>
> > >>> Since the use of vhost_virtqueue_start can unmasks and discard call
> > >>> events, vhost_virtqueue_start should be modified in one of these ways:
> > >>> * Split in two: One of them uses all logic to start a queue with no
> > >>>     side effects for the guest, and another one tha actually assumes that
> > >>>     the guest has just started the device. Vdpa should use just the
> > >>>     former.
> > >>> * Actually store and check if the guest notifier is masked, and do it
> > >>>     conditionally.
> > >>> * Left as it is, and duplicate all the logic in vhost-vdpa.
> > >>>
> > >>> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> > >>> ---
> > >>>    hw/virtio/vhost-shadow-virtqueue.c | 19 +++++++++++++++
> > >>>    hw/virtio/vhost-vdpa.c             | 38 +++++++++++++++++++++++++++++-
> > >>>    2 files changed, 56 insertions(+), 1 deletion(-)
> > >>>
> > >>> diff --git a/hw/virtio/vhost-shadow-virtqueue.c b/hw/virtio/vhost-shadow-virtqueue.c
> > >>> index 21dc99ab5d..3fe129cf63 100644
> > >>> --- a/hw/virtio/vhost-shadow-virtqueue.c
> > >>> +++ b/hw/virtio/vhost-shadow-virtqueue.c
> > >>> @@ -53,6 +53,22 @@ static void vhost_handle_guest_kick(EventNotifier *n)
> > >>>        event_notifier_set(&svq->kick_notifier);
> > >>>    }
> > >>>
> > >>> +/* Forward vhost notifications */
> > >>> +static void vhost_svq_handle_call_no_test(EventNotifier *n)
> > >>> +{
> > >>> +    VhostShadowVirtqueue *svq = container_of(n, VhostShadowVirtqueue,
> > >>> +                                             call_notifier);
> > >>> +
> > >>> +    event_notifier_set(&svq->guest_call_notifier);
> > >>> +}
> > >>> +
> > >>> +static void vhost_svq_handle_call(EventNotifier *n)
> > >>> +{
> > >>> +    if (likely(event_notifier_test_and_clear(n))) {
> > >>> +        vhost_svq_handle_call_no_test(n);
> > >>> +    }
> > >>> +}
> > >>> +
> > >>>    /*
> > >>>     * Obtain the SVQ call notifier, where vhost device notifies SVQ that there
> > >>>     * exists pending used buffers.
> > >>> @@ -180,6 +196,8 @@ VhostShadowVirtqueue *vhost_svq_new(struct vhost_dev *dev, int idx)
> > >>>        }
> > >>>
> > >>>        svq->vq = virtio_get_queue(dev->vdev, vq_idx);
> > >>> +    event_notifier_set_handler(&svq->call_notifier,
> > >>> +                               vhost_svq_handle_call);
> > >>>        return g_steal_pointer(&svq);
> > >>>
> > >>>    err_init_call_notifier:
> > >>> @@ -195,6 +213,7 @@ err_init_kick_notifier:
> > >>>    void vhost_svq_free(VhostShadowVirtqueue *vq)
> > >>>    {
> > >>>        event_notifier_cleanup(&vq->kick_notifier);
> > >>> +    event_notifier_set_handler(&vq->call_notifier, NULL);
> > >>>        event_notifier_cleanup(&vq->call_notifier);
> > >>>        g_free(vq);
> > >>>    }
> > >>> diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
> > >>> index bc34de2439..6c5f4c98b8 100644
> > >>> --- a/hw/virtio/vhost-vdpa.c
> > >>> +++ b/hw/virtio/vhost-vdpa.c
> > >>> @@ -712,13 +712,40 @@ static bool vhost_vdpa_svq_start_vq(struct vhost_dev *dev, unsigned idx)
> > >>>    {
> > >>>        struct vhost_vdpa *v = dev->opaque;
> > >>>        VhostShadowVirtqueue *svq = g_ptr_array_index(v->shadow_vqs, idx);
> > >>> -    return vhost_svq_start(dev, idx, svq);
> > >>> +    EventNotifier *vhost_call_notifier = vhost_svq_get_svq_call_notifier(svq);
> > >>> +    struct vhost_vring_file vhost_call_file = {
> > >>> +        .index = idx + dev->vq_index,
> > >>> +        .fd = event_notifier_get_fd(vhost_call_notifier),
> > >>> +    };
> > >>> +    int r;
> > >>> +    bool b;
> > >>> +
> > >>> +    /* Set shadow vq -> guest notifier */
> > >>> +    assert(v->call_fd[idx]);
> > >>
> > >> We need aovid the asser() here. On which case we can hit this?
> > >>
> > > I would say that there is no way we can actually hit it, so let's remove it.
> > >
> > >>> +    vhost_svq_set_guest_call_notifier(svq, v->call_fd[idx]);
> > >>> +
> > >>> +    b = vhost_svq_start(dev, idx, svq);
> > >>> +    if (unlikely(!b)) {
> > >>> +        return false;
> > >>> +    }
> > >>> +
> > >>> +    /* Set device -> SVQ notifier */
> > >>> +    r = vhost_vdpa_set_vring_dev_call(dev, &vhost_call_file);
> > >>> +    if (unlikely(r)) {
> > >>> +        error_report("vhost_vdpa_set_vring_call for shadow vq failed");
> > >>> +        return false;
> > >>> +    }
> > >>
> > >> Similar to kick, do we need to set_vring_call() before vhost_svq_start()?
> > >>
> > > It should not matter at this moment because the device should not be
> > > started at this point and device calls should not run
> > > vhost_svq_handle_call until BQL is released.
> >
> >
> > Yes, we stop virtqueue before.
> >
> >
> > >
> > > The "logic" of doing it after is to make clear that svq must be fully
> > > initialized before processing device calls, even in the case that we
> > > extract SVQ in its own iothread or similar. But this could be done
> > > before vhost_svq_start for sure.
> > >
> > >>> +
> > >>> +    /* Check for pending calls */
> > >>> +    event_notifier_set(vhost_call_notifier);
> > >>
> > >> Interesting, can this result spurious interrupt?
> > >>
> > > This actually "queues" a vhost_svq_handle_call after the BQL release,
> > > where the device should be fully reset. In that regard, if there are
> > > no used descriptors there will not be an irq raised to the guest. Does
> > > that answer the question? Or have I missed something?
> >
> >
> > Yes, please explain this in the comment.
> >
>
> I'm reviewing this again, and actually I think I was wrong in solving the issue.
>
> Since at this point the device is being configured, there is no chance
> that we had a missing call notification here: A previous kick is
> needed for the device to generate any calls, and these cannot be
> processed.
>
> What is not solved in this series is that we could have pending used
> buffers in vdpa device stopping SVQ, but queuing a check for that is
> not going to solve anything, since SVQ vring would be already
> destroyed:
>
> * vdpa device marks N > 0 buffers as used, and calls.
> * Before processing them, SVQ stop is called. SVQ have not processed
> these, and cleans them, making this event_notifier_set useless.
>
> So this would require a few changes. Mainly, instead of queueing a
> check for used, these need to be checked before svq cleaning. After
> that, obtain the VQ state (is not obtained in the stop at the moment,
> trusting in guest's used idx) and run a last
> vhost_svq_handle_call_no_test while the device is paused.
It looks to me what's really important is that SVQ needs to
drain/forwared used buffers after vdpa is stopped. Then we should be
fine.
>
> Thanks!
>
> >
> > >
> > >>> +    return true;
> > >>>    }
> > >>>
> > >>>    static unsigned vhost_vdpa_enable_svq(struct vhost_vdpa *v, bool enable)
> > >>>    {
> > >>>        struct vhost_dev *hdev = v->dev;
> > >>>        unsigned n;
> > >>> +    int r;
> > >>>
> > >>>        if (enable == v->shadow_vqs_enabled) {
> > >>>            return hdev->nvqs;
> > >>> @@ -752,9 +779,18 @@ static unsigned vhost_vdpa_enable_svq(struct vhost_vdpa *v, bool enable)
> > >>>        if (!enable) {
> > >>>            /* Disable all queues or clean up failed start */
> > >>>            for (n = 0; n < v->shadow_vqs->len; ++n) {
> > >>> +            struct vhost_vring_file file = {
> > >>> +                .index = vhost_vdpa_get_vq_index(hdev, n),
> > >>> +                .fd = v->call_fd[n],
> > >>> +            };
> > >>> +
> > >>> +            r = vhost_vdpa_set_vring_call(hdev, &file);
> > >>> +            assert(r == 0);
> > >>> +
> > >>>                unsigned vq_idx = vhost_vdpa_get_vq_index(hdev, n);
> > >>>                VhostShadowVirtqueue *svq = g_ptr_array_index(v->shadow_vqs, n);
> > >>>                vhost_svq_stop(hdev, n, svq);
> > >>> +            /* TODO: This can unmask or override call fd! */
> > >>
> > >> I don't get this comment. Does this mean the current code can't work
> > >> with mask_notifiers? If yes, this is something we need to fix.
> > >>
> > > Yes, but it will be addressed in the next series. I should have
> > > explained it bette here, sorry :).
> >
> >
> > Ok.
> >
> > Thanks
> >
> >
> > >
> > > Thanks!
> > >
> > >> Thanks
> > >>
> > >>
> > >>>                vhost_virtqueue_start(hdev, hdev->vdev, &hdev->vqs[n], vq_idx);
> > >>>            }
> > >>>
> >
>
^ permalink raw reply	[flat|nested] 90+ messages in thread
* Re: [RFC PATCH v4 20/20] vdpa: Add custom IOTLB translations to SVQ
  2021-10-19 10:28     ` Eugenio Perez Martin
@ 2021-10-20  2:02       ` Jason Wang
  2021-10-20  2:07         ` Jason Wang
  0 siblings, 1 reply; 90+ messages in thread
From: Jason Wang @ 2021-10-20  2:02 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: Parav Pandit, Juan Quintela, Markus Armbruster,
	Michael S. Tsirkin, qemu-level, virtualization,
	Harpreet Singh Anand, Xiao W Wang, Stefan Hajnoczi, Eli Cohen,
	Eric Blake, Michael Lilja, Stefano Garzarella
On Tue, Oct 19, 2021 at 6:29 PM Eugenio Perez Martin
<eperezma@redhat.com> wrote:
>
> On Tue, Oct 19, 2021 at 11:25 AM Jason Wang <jasowang@redhat.com> wrote:
> >
> >
> > 在 2021/10/1 下午3:06, Eugenio Pérez 写道:
> > > Use translations added in VhostIOVATree in SVQ.
> > >
> > > Now every element needs to store the previous address also, so VirtQueue
> > > can consume the elements properly. This adds a little overhead per VQ
> > > element, having to allocate more memory to stash them. As a possible
> > > optimization, this allocation could be avoided if the descriptor is not
> > > a chain but a single one, but this is left undone.
> > >
> > > TODO: iova range should be queried before, and add logic to fail when
> > > GPA is outside of its range and memory listener or svq add it.
> > >
> > > Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> > > ---
> > >   hw/virtio/vhost-shadow-virtqueue.h |   4 +-
> > >   hw/virtio/vhost-shadow-virtqueue.c | 130 ++++++++++++++++++++++++-----
> > >   hw/virtio/vhost-vdpa.c             |  40 ++++++++-
> > >   hw/virtio/trace-events             |   1 +
> > >   4 files changed, 152 insertions(+), 23 deletions(-)
> > >
> > > diff --git a/hw/virtio/vhost-shadow-virtqueue.h b/hw/virtio/vhost-shadow-virtqueue.h
> > > index b7baa424a7..a0e6b5267a 100644
> > > --- a/hw/virtio/vhost-shadow-virtqueue.h
> > > +++ b/hw/virtio/vhost-shadow-virtqueue.h
> > > @@ -11,6 +11,7 @@
> > >   #define VHOST_SHADOW_VIRTQUEUE_H
> > >
> > >   #include "hw/virtio/vhost.h"
> > > +#include "hw/virtio/vhost-iova-tree.h"
> > >
> > >   typedef struct VhostShadowVirtqueue VhostShadowVirtqueue;
> > >
> > > @@ -28,7 +29,8 @@ bool vhost_svq_start(struct vhost_dev *dev, unsigned idx,
> > >   void vhost_svq_stop(struct vhost_dev *dev, unsigned idx,
> > >                       VhostShadowVirtqueue *svq);
> > >
> > > -VhostShadowVirtqueue *vhost_svq_new(struct vhost_dev *dev, int idx);
> > > +VhostShadowVirtqueue *vhost_svq_new(struct vhost_dev *dev, int idx,
> > > +                                    VhostIOVATree *iova_map);
> > >
> > >   void vhost_svq_free(VhostShadowVirtqueue *vq);
> > >
> > > diff --git a/hw/virtio/vhost-shadow-virtqueue.c b/hw/virtio/vhost-shadow-virtqueue.c
> > > index 2fd0bab75d..9db538547e 100644
> > > --- a/hw/virtio/vhost-shadow-virtqueue.c
> > > +++ b/hw/virtio/vhost-shadow-virtqueue.c
> > > @@ -11,12 +11,19 @@
> > >   #include "hw/virtio/vhost-shadow-virtqueue.h"
> > >   #include "hw/virtio/vhost.h"
> > >   #include "hw/virtio/virtio-access.h"
> > > +#include "hw/virtio/vhost-iova-tree.h"
> > >
> > >   #include "standard-headers/linux/vhost_types.h"
> > >
> > >   #include "qemu/error-report.h"
> > >   #include "qemu/main-loop.h"
> > >
> > > +typedef struct SVQElement {
> > > +    VirtQueueElement elem;
> > > +    void **in_sg_stash;
> > > +    void **out_sg_stash;
> > > +} SVQElement;
> > > +
> > >   /* Shadow virtqueue to relay notifications */
> > >   typedef struct VhostShadowVirtqueue {
> > >       /* Shadow vring */
> > > @@ -46,8 +53,11 @@ typedef struct VhostShadowVirtqueue {
> > >       /* Virtio device */
> > >       VirtIODevice *vdev;
> > >
> > > +    /* IOVA mapping if used */
> > > +    VhostIOVATree *iova_map;
> > > +
> > >       /* Map for returning guest's descriptors */
> > > -    VirtQueueElement **ring_id_maps;
> > > +    SVQElement **ring_id_maps;
> > >
> > >       /* Next head to expose to device */
> > >       uint16_t avail_idx_shadow;
> > > @@ -79,13 +89,6 @@ bool vhost_svq_valid_device_features(uint64_t *dev_features)
> > >               continue;
> > >
> > >           case VIRTIO_F_ACCESS_PLATFORM:
> > > -            /* SVQ needs this feature disabled. Can't continue */
> > > -            if (*dev_features & BIT_ULL(b)) {
> > > -                clear_bit(b, dev_features);
> > > -                r = false;
> > > -            }
> > > -            break;
> > > -
> > >           case VIRTIO_F_VERSION_1:
> > >               /* SVQ needs this feature, so can't continue */
> > >               if (!(*dev_features & BIT_ULL(b))) {
> > > @@ -126,6 +129,64 @@ static void vhost_svq_set_notification(VhostShadowVirtqueue *svq, bool enable)
> > >       }
> > >   }
> > >
> > > +static void vhost_svq_stash_addr(void ***stash, const struct iovec *iov,
> > > +                                 size_t num)
> > > +{
> > > +    size_t i;
> > > +
> > > +    if (num == 0) {
> > > +        return;
> > > +    }
> > > +
> > > +    *stash = g_new(void *, num);
> > > +    for (i = 0; i < num; ++i) {
> > > +        (*stash)[i] = iov[i].iov_base;
> > > +    }
> > > +}
> > > +
> > > +static void vhost_svq_unstash_addr(void **stash, struct iovec *iov, size_t num)
> > > +{
> > > +    size_t i;
> > > +
> > > +    if (num == 0) {
> > > +        return;
> > > +    }
> > > +
> > > +    for (i = 0; i < num; ++i) {
> > > +        iov[i].iov_base = stash[i];
> > > +    }
> > > +    g_free(stash);
> > > +}
> > > +
> > > +static void vhost_svq_translate_addr(const VhostShadowVirtqueue *svq,
> > > +                                     struct iovec *iovec, size_t num)
> > > +{
> > > +    size_t i;
> > > +
> > > +    for (i = 0; i < num; ++i) {
> > > +        VhostDMAMap needle = {
> > > +            .translated_addr = iovec[i].iov_base,
> > > +            .size = iovec[i].iov_len,
> > > +        };
> > > +        size_t off;
> > > +
> > > +        const VhostDMAMap *map = vhost_iova_tree_find_iova(svq->iova_map,
> > > +                                                           &needle);
> >
> >
> > Is it possible that we end up with more than one maps here?
> >
>
> Actually it is possible, since there is no guarantee that one
> descriptor (or indirect descriptor) maps exactly to one iov. It could
> map to many if qemu vaddr is not contiguous but GPA + size is. This is
> something that must be fixed for the next revision, so thanks for
> pointing it out!
>
> Taking that into account, the condition that svq vring avail_idx -
> used_idx was always less or equal than guest's vring avail_idx -
> used_idx is not true anymore. Checking for that before adding buffers
> to SVQ is the easy part, but how could we recover in that case?
>
> I think that the easy solution is to check for more available buffers
> unconditionally at the end of vhost_svq_handle_call, which handles the
> SVQ used and is supposed to make more room for available buffers. So
> vhost_handle_guest_kick would not check if eventfd is set or not
> anymore.
>
> Would that make sense?
Yes, I think it should work.
Thanks
>
> Thanks!
>
> >
> > > +        /*
> > > +         * Map cannot be NULL since iova map contains all guest space and
> > > +         * qemu already has a physical address mapped
> > > +         */
> > > +        assert(map);
> > > +
> > > +        /*
> > > +         * Map->iova chunk size is ignored. What to do if descriptor
> > > +         * (addr, size) does not fit is delegated to the device.
> > > +         */
> > > +        off = needle.translated_addr - map->translated_addr;
> > > +        iovec[i].iov_base = (void *)(map->iova + off);
> > > +    }
> > > +}
> > > +
> > >   static void vhost_vring_write_descs(VhostShadowVirtqueue *svq,
> > >                                       const struct iovec *iovec,
> > >                                       size_t num, bool more_descs, bool write)
> > > @@ -156,8 +217,9 @@ static void vhost_vring_write_descs(VhostShadowVirtqueue *svq,
> > >   }
> > >
> > >   static unsigned vhost_svq_add_split(VhostShadowVirtqueue *svq,
> > > -                                    VirtQueueElement *elem)
> > > +                                    SVQElement *svq_elem)
> > >   {
> > > +    VirtQueueElement *elem = &svq_elem->elem;
> > >       int head;
> > >       unsigned avail_idx;
> > >       vring_avail_t *avail = svq->vring.avail;
> > > @@ -167,6 +229,12 @@ static unsigned vhost_svq_add_split(VhostShadowVirtqueue *svq,
> > >       /* We need some descriptors here */
> > >       assert(elem->out_num || elem->in_num);
> > >
> > > +    vhost_svq_stash_addr(&svq_elem->in_sg_stash, elem->in_sg, elem->in_num);
> > > +    vhost_svq_stash_addr(&svq_elem->out_sg_stash, elem->out_sg, elem->out_num);
> >
> >
> > I wonder if we can solve the trick like stash and unstash with a
> > dedicated sgs in svq_elem, instead of reusing the elem.
> >
>
> Actually yes, it would be way simpler to use a new sgs array in
> svq_elem. I will change that.
>
> Thanks!
>
> > Thanks
> >
> >
> > > +
> > > +    vhost_svq_translate_addr(svq, elem->in_sg, elem->in_num);
> > > +    vhost_svq_translate_addr(svq, elem->out_sg, elem->out_num);
> > > +
> > >       vhost_vring_write_descs(svq, elem->out_sg, elem->out_num,
> > >                               elem->in_num > 0, false);
> > >       vhost_vring_write_descs(svq, elem->in_sg, elem->in_num, false, true);
> > > @@ -187,7 +255,7 @@ static unsigned vhost_svq_add_split(VhostShadowVirtqueue *svq,
> > >
> > >   }
> > >
> > > -static void vhost_svq_add(VhostShadowVirtqueue *svq, VirtQueueElement *elem)
> > > +static void vhost_svq_add(VhostShadowVirtqueue *svq, SVQElement *elem)
> > >   {
> > >       unsigned qemu_head = vhost_svq_add_split(svq, elem);
> > >
> > > @@ -221,7 +289,7 @@ static void vhost_handle_guest_kick(EventNotifier *n)
> > >           }
> > >
> > >           while (true) {
> > > -            VirtQueueElement *elem = virtqueue_pop(svq->vq, sizeof(*elem));
> > > +            SVQElement *elem = virtqueue_pop(svq->vq, sizeof(*elem));
> > >               if (!elem) {
> > >                   break;
> > >               }
> > > @@ -247,7 +315,7 @@ static bool vhost_svq_more_used(VhostShadowVirtqueue *svq)
> > >       return svq->used_idx != svq->shadow_used_idx;
> > >   }
> > >
> > > -static VirtQueueElement *vhost_svq_get_buf(VhostShadowVirtqueue *svq)
> > > +static SVQElement *vhost_svq_get_buf(VhostShadowVirtqueue *svq)
> > >   {
> > >       vring_desc_t *descs = svq->vring.desc;
> > >       const vring_used_t *used = svq->vring.used;
> > > @@ -279,7 +347,7 @@ static VirtQueueElement *vhost_svq_get_buf(VhostShadowVirtqueue *svq)
> > >       descs[used_elem.id].next = svq->free_head;
> > >       svq->free_head = used_elem.id;
> > >
> > > -    svq->ring_id_maps[used_elem.id]->len = used_elem.len;
> > > +    svq->ring_id_maps[used_elem.id]->elem.len = used_elem.len;
> > >       return g_steal_pointer(&svq->ring_id_maps[used_elem.id]);
> > >   }
> > >
> > > @@ -296,12 +364,19 @@ static void vhost_svq_handle_call_no_test(EventNotifier *n)
> > >
> > >           vhost_svq_set_notification(svq, false);
> > >           while (true) {
> > > -            g_autofree VirtQueueElement *elem = vhost_svq_get_buf(svq);
> > > -            if (!elem) {
> > > +            g_autofree SVQElement *svq_elem = vhost_svq_get_buf(svq);
> > > +            VirtQueueElement *elem;
> > > +            if (!svq_elem) {
> > >                   break;
> > >               }
> > >
> > >               assert(i < svq->vring.num);
> > > +            elem = &svq_elem->elem;
> > > +
> > > +            vhost_svq_unstash_addr(svq_elem->in_sg_stash, elem->in_sg,
> > > +                                   elem->in_num);
> > > +            vhost_svq_unstash_addr(svq_elem->out_sg_stash, elem->out_sg,
> > > +                                   elem->out_num);
> > >               virtqueue_fill(vq, elem, elem->len, i++);
> > >           }
> > >
> > > @@ -451,14 +526,24 @@ void vhost_svq_stop(struct vhost_dev *dev, unsigned idx,
> > >       event_notifier_set_handler(&svq->host_notifier, NULL);
> > >
> > >       for (i = 0; i < svq->vring.num; ++i) {
> > > -        g_autofree VirtQueueElement *elem = svq->ring_id_maps[i];
> > > +        g_autofree SVQElement *svq_elem = svq->ring_id_maps[i];
> > > +        VirtQueueElement *elem;
> > > +
> > > +        if (!svq_elem) {
> > > +            continue;
> > > +        }
> > > +
> > > +        elem = &svq_elem->elem;
> > > +        vhost_svq_unstash_addr(svq_elem->in_sg_stash, elem->in_sg,
> > > +                               elem->in_num);
> > > +        vhost_svq_unstash_addr(svq_elem->out_sg_stash, elem->out_sg,
> > > +                               elem->out_num);
> > > +
> > >           /*
> > >            * Although the doc says we must unpop in order, it's ok to unpop
> > >            * everything.
> > >            */
> > > -        if (elem) {
> > > -            virtqueue_unpop(svq->vq, elem, elem->len);
> > > -        }
> > > +        virtqueue_unpop(svq->vq, elem, elem->len);
> > >       }
> > >   }
> > >
> > > @@ -466,7 +551,8 @@ void vhost_svq_stop(struct vhost_dev *dev, unsigned idx,
> > >    * Creates vhost shadow virtqueue, and instruct vhost device to use the shadow
> > >    * methods and file descriptors.
> > >    */
> > > -VhostShadowVirtqueue *vhost_svq_new(struct vhost_dev *dev, int idx)
> > > +VhostShadowVirtqueue *vhost_svq_new(struct vhost_dev *dev, int idx,
> > > +                                    VhostIOVATree *iova_map)
> > >   {
> > >       int vq_idx = dev->vq_index + idx;
> > >       unsigned num = virtio_queue_get_num(dev->vdev, vq_idx);
> > > @@ -500,11 +586,13 @@ VhostShadowVirtqueue *vhost_svq_new(struct vhost_dev *dev, int idx)
> > >       memset(svq->vring.desc, 0, driver_size);
> > >       svq->vring.used = qemu_memalign(qemu_real_host_page_size, device_size);
> > >       memset(svq->vring.used, 0, device_size);
> > > +    svq->iova_map = iova_map;
> > > +
> > >       for (i = 0; i < num - 1; i++) {
> > >           svq->vring.desc[i].next = cpu_to_le16(i + 1);
> > >       }
> > >
> > > -    svq->ring_id_maps = g_new0(VirtQueueElement *, num);
> > > +    svq->ring_id_maps = g_new0(SVQElement *, num);
> > >       event_notifier_set_handler(&svq->call_notifier,
> > >                                  vhost_svq_handle_call);
> > >       return g_steal_pointer(&svq);
> > > diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
> > > index a9c680b487..f5a12fee9d 100644
> > > --- a/hw/virtio/vhost-vdpa.c
> > > +++ b/hw/virtio/vhost-vdpa.c
> > > @@ -176,6 +176,18 @@ static void vhost_vdpa_listener_region_add(MemoryListener *listener,
> > >                                            vaddr, section->readonly);
> > >
> > >       llsize = int128_sub(llend, int128_make64(iova));
> > > +    if (v->shadow_vqs_enabled) {
> > > +        VhostDMAMap mem_region = {
> > > +            .translated_addr = vaddr,
> > > +            .size = int128_get64(llsize) - 1,
> > > +            .perm = IOMMU_ACCESS_FLAG(true, section->readonly),
> > > +        };
> > > +
> > > +        int r = vhost_iova_tree_alloc(v->iova_map, &mem_region);
> > > +        assert(r == VHOST_DMA_MAP_OK);
> > > +
> > > +        iova = mem_region.iova;
> > > +    }
> > >
> > >       ret = vhost_vdpa_dma_map(v, iova, int128_get64(llsize),
> > >                                vaddr, section->readonly);
> > > @@ -754,6 +766,23 @@ static bool  vhost_vdpa_force_iommu(struct vhost_dev *dev)
> > >       return true;
> > >   }
> > >
> > > +static int vhost_vdpa_get_iova_range(struct vhost_dev *dev,
> > > +                                     hwaddr *first, hwaddr *last)
> > > +{
> > > +    int ret;
> > > +    struct vhost_vdpa_iova_range range;
> > > +
> > > +    ret = vhost_vdpa_call(dev, VHOST_VDPA_GET_IOVA_RANGE, &range);
> > > +    if (ret != 0) {
> > > +        return ret;
> > > +    }
> > > +
> > > +    *first = range.first;
> > > +    *last = range.last;
> > > +    trace_vhost_vdpa_get_iova_range(dev, *first, *last);
> > > +    return ret;
> > > +}
> > > +
> > >   /**
> > >    * Maps QEMU vaddr memory to device in a suitable way for shadow virtqueue:
> > >    * - It always reference qemu memory address, not guest's memory.
> > > @@ -881,6 +910,7 @@ static bool vhost_vdpa_svq_start_vq(struct vhost_dev *dev, unsigned idx)
> > >   static unsigned vhost_vdpa_enable_svq(struct vhost_vdpa *v, bool enable)
> > >   {
> > >       struct vhost_dev *hdev = v->dev;
> > > +    hwaddr iova_first, iova_last;
> > >       unsigned n;
> > >       int r;
> > >
> > > @@ -894,7 +924,7 @@ static unsigned vhost_vdpa_enable_svq(struct vhost_vdpa *v, bool enable)
> > >           /* Allocate resources */
> > >           assert(v->shadow_vqs->len == 0);
> > >           for (n = 0; n < hdev->nvqs; ++n) {
> > > -            VhostShadowVirtqueue *svq = vhost_svq_new(hdev, n);
> > > +            VhostShadowVirtqueue *svq = vhost_svq_new(hdev, n, v->iova_map);
> > >               if (unlikely(!svq)) {
> > >                   g_ptr_array_set_size(v->shadow_vqs, 0);
> > >                   return 0;
> > > @@ -903,6 +933,8 @@ static unsigned vhost_vdpa_enable_svq(struct vhost_vdpa *v, bool enable)
> > >           }
> > >       }
> > >
> > > +    r = vhost_vdpa_get_iova_range(hdev, &iova_first, &iova_last);
> > > +    assert(r == 0);
> > >       r = vhost_vdpa_vring_pause(hdev);
> > >       assert(r == 0);
> > >
> > > @@ -913,6 +945,12 @@ static unsigned vhost_vdpa_enable_svq(struct vhost_vdpa *v, bool enable)
> > >           }
> > >       }
> > >
> > > +    memory_listener_unregister(&v->listener);
> > > +    if (vhost_vdpa_dma_unmap(v, iova_first,
> > > +                             (iova_last - iova_first) & TARGET_PAGE_MASK)) {
> > > +        error_report("Fail to invalidate device iotlb");
> > > +    }
> > > +
> > >       /* Reset device so it can be configured */
> > >       r = vhost_vdpa_dev_start(hdev, false);
> > >       assert(r == 0);
> > > diff --git a/hw/virtio/trace-events b/hw/virtio/trace-events
> > > index 8ed19e9d0c..650e521e35 100644
> > > --- a/hw/virtio/trace-events
> > > +++ b/hw/virtio/trace-events
> > > @@ -52,6 +52,7 @@ vhost_vdpa_set_vring_call(void *dev, unsigned int index, int fd) "dev: %p index:
> > >   vhost_vdpa_get_features(void *dev, uint64_t features) "dev: %p features: 0x%"PRIx64
> > >   vhost_vdpa_set_owner(void *dev) "dev: %p"
> > >   vhost_vdpa_vq_get_addr(void *dev, void *vq, uint64_t desc_user_addr, uint64_t avail_user_addr, uint64_t used_user_addr) "dev: %p vq: %p desc_user_addr: 0x%"PRIx64" avail_user_addr: 0x%"PRIx64" used_user_addr: 0x%"PRIx64
> > > +vhost_vdpa_get_iova_range(void *dev, uint64_t first, uint64_t last) "dev: %p first: 0x%"PRIx64" last: 0x%"PRIx64
> > >
> > >   # virtio.c
> > >   virtqueue_alloc_element(void *elem, size_t sz, unsigned in_num, unsigned out_num) "elem %p size %zd in_num %u out_num %u"
> >
>
^ permalink raw reply	[flat|nested] 90+ messages in thread
* Re: [RFC PATCH v4 20/20] vdpa: Add custom IOTLB translations to SVQ
  2021-10-20  2:02       ` Jason Wang
@ 2021-10-20  2:07         ` Jason Wang
  2021-10-20  6:51           ` Eugenio Perez Martin
  0 siblings, 1 reply; 90+ messages in thread
From: Jason Wang @ 2021-10-20  2:07 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: Parav Pandit, Juan Quintela, Markus Armbruster,
	Michael S. Tsirkin, qemu-level, virtualization,
	Harpreet Singh Anand, Xiao W Wang, Stefan Hajnoczi, Eli Cohen,
	Eric Blake, Michael Lilja, Stefano Garzarella
On Wed, Oct 20, 2021 at 10:02 AM Jason Wang <jasowang@redhat.com> wrote:
>
> On Tue, Oct 19, 2021 at 6:29 PM Eugenio Perez Martin
> <eperezma@redhat.com> wrote:
> >
> > On Tue, Oct 19, 2021 at 11:25 AM Jason Wang <jasowang@redhat.com> wrote:
> > >
> > >
> > > 在 2021/10/1 下午3:06, Eugenio Pérez 写道:
> > > > Use translations added in VhostIOVATree in SVQ.
> > > >
> > > > Now every element needs to store the previous address also, so VirtQueue
> > > > can consume the elements properly. This adds a little overhead per VQ
> > > > element, having to allocate more memory to stash them. As a possible
> > > > optimization, this allocation could be avoided if the descriptor is not
> > > > a chain but a single one, but this is left undone.
> > > >
> > > > TODO: iova range should be queried before, and add logic to fail when
> > > > GPA is outside of its range and memory listener or svq add it.
> > > >
> > > > Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> > > > ---
> > > >   hw/virtio/vhost-shadow-virtqueue.h |   4 +-
> > > >   hw/virtio/vhost-shadow-virtqueue.c | 130 ++++++++++++++++++++++++-----
> > > >   hw/virtio/vhost-vdpa.c             |  40 ++++++++-
> > > >   hw/virtio/trace-events             |   1 +
> > > >   4 files changed, 152 insertions(+), 23 deletions(-)
> > > >
> > > > diff --git a/hw/virtio/vhost-shadow-virtqueue.h b/hw/virtio/vhost-shadow-virtqueue.h
> > > > index b7baa424a7..a0e6b5267a 100644
> > > > --- a/hw/virtio/vhost-shadow-virtqueue.h
> > > > +++ b/hw/virtio/vhost-shadow-virtqueue.h
> > > > @@ -11,6 +11,7 @@
> > > >   #define VHOST_SHADOW_VIRTQUEUE_H
> > > >
> > > >   #include "hw/virtio/vhost.h"
> > > > +#include "hw/virtio/vhost-iova-tree.h"
> > > >
> > > >   typedef struct VhostShadowVirtqueue VhostShadowVirtqueue;
> > > >
> > > > @@ -28,7 +29,8 @@ bool vhost_svq_start(struct vhost_dev *dev, unsigned idx,
> > > >   void vhost_svq_stop(struct vhost_dev *dev, unsigned idx,
> > > >                       VhostShadowVirtqueue *svq);
> > > >
> > > > -VhostShadowVirtqueue *vhost_svq_new(struct vhost_dev *dev, int idx);
> > > > +VhostShadowVirtqueue *vhost_svq_new(struct vhost_dev *dev, int idx,
> > > > +                                    VhostIOVATree *iova_map);
> > > >
> > > >   void vhost_svq_free(VhostShadowVirtqueue *vq);
> > > >
> > > > diff --git a/hw/virtio/vhost-shadow-virtqueue.c b/hw/virtio/vhost-shadow-virtqueue.c
> > > > index 2fd0bab75d..9db538547e 100644
> > > > --- a/hw/virtio/vhost-shadow-virtqueue.c
> > > > +++ b/hw/virtio/vhost-shadow-virtqueue.c
> > > > @@ -11,12 +11,19 @@
> > > >   #include "hw/virtio/vhost-shadow-virtqueue.h"
> > > >   #include "hw/virtio/vhost.h"
> > > >   #include "hw/virtio/virtio-access.h"
> > > > +#include "hw/virtio/vhost-iova-tree.h"
> > > >
> > > >   #include "standard-headers/linux/vhost_types.h"
> > > >
> > > >   #include "qemu/error-report.h"
> > > >   #include "qemu/main-loop.h"
> > > >
> > > > +typedef struct SVQElement {
> > > > +    VirtQueueElement elem;
> > > > +    void **in_sg_stash;
> > > > +    void **out_sg_stash;
> > > > +} SVQElement;
> > > > +
> > > >   /* Shadow virtqueue to relay notifications */
> > > >   typedef struct VhostShadowVirtqueue {
> > > >       /* Shadow vring */
> > > > @@ -46,8 +53,11 @@ typedef struct VhostShadowVirtqueue {
> > > >       /* Virtio device */
> > > >       VirtIODevice *vdev;
> > > >
> > > > +    /* IOVA mapping if used */
> > > > +    VhostIOVATree *iova_map;
> > > > +
> > > >       /* Map for returning guest's descriptors */
> > > > -    VirtQueueElement **ring_id_maps;
> > > > +    SVQElement **ring_id_maps;
> > > >
> > > >       /* Next head to expose to device */
> > > >       uint16_t avail_idx_shadow;
> > > > @@ -79,13 +89,6 @@ bool vhost_svq_valid_device_features(uint64_t *dev_features)
> > > >               continue;
> > > >
> > > >           case VIRTIO_F_ACCESS_PLATFORM:
> > > > -            /* SVQ needs this feature disabled. Can't continue */
> > > > -            if (*dev_features & BIT_ULL(b)) {
> > > > -                clear_bit(b, dev_features);
> > > > -                r = false;
> > > > -            }
> > > > -            break;
> > > > -
> > > >           case VIRTIO_F_VERSION_1:
> > > >               /* SVQ needs this feature, so can't continue */
> > > >               if (!(*dev_features & BIT_ULL(b))) {
> > > > @@ -126,6 +129,64 @@ static void vhost_svq_set_notification(VhostShadowVirtqueue *svq, bool enable)
> > > >       }
> > > >   }
> > > >
> > > > +static void vhost_svq_stash_addr(void ***stash, const struct iovec *iov,
> > > > +                                 size_t num)
> > > > +{
> > > > +    size_t i;
> > > > +
> > > > +    if (num == 0) {
> > > > +        return;
> > > > +    }
> > > > +
> > > > +    *stash = g_new(void *, num);
> > > > +    for (i = 0; i < num; ++i) {
> > > > +        (*stash)[i] = iov[i].iov_base;
> > > > +    }
> > > > +}
> > > > +
> > > > +static void vhost_svq_unstash_addr(void **stash, struct iovec *iov, size_t num)
> > > > +{
> > > > +    size_t i;
> > > > +
> > > > +    if (num == 0) {
> > > > +        return;
> > > > +    }
> > > > +
> > > > +    for (i = 0; i < num; ++i) {
> > > > +        iov[i].iov_base = stash[i];
> > > > +    }
> > > > +    g_free(stash);
> > > > +}
> > > > +
> > > > +static void vhost_svq_translate_addr(const VhostShadowVirtqueue *svq,
> > > > +                                     struct iovec *iovec, size_t num)
> > > > +{
> > > > +    size_t i;
> > > > +
> > > > +    for (i = 0; i < num; ++i) {
> > > > +        VhostDMAMap needle = {
> > > > +            .translated_addr = iovec[i].iov_base,
> > > > +            .size = iovec[i].iov_len,
> > > > +        };
> > > > +        size_t off;
> > > > +
> > > > +        const VhostDMAMap *map = vhost_iova_tree_find_iova(svq->iova_map,
> > > > +                                                           &needle);
> > >
> > >
> > > Is it possible that we end up with more than one maps here?
> > >
> >
> > Actually it is possible, since there is no guarantee that one
> > descriptor (or indirect descriptor) maps exactly to one iov. It could
> > map to many if qemu vaddr is not contiguous but GPA + size is. This is
> > something that must be fixed for the next revision, so thanks for
> > pointing it out!
> >
> > Taking that into account, the condition that svq vring avail_idx -
> > used_idx was always less or equal than guest's vring avail_idx -
> > used_idx is not true anymore. Checking for that before adding buffers
> > to SVQ is the easy part, but how could we recover in that case?
> >
> > I think that the easy solution is to check for more available buffers
> > unconditionally at the end of vhost_svq_handle_call, which handles the
> > SVQ used and is supposed to make more room for available buffers. So
> > vhost_handle_guest_kick would not check if eventfd is set or not
> > anymore.
> >
> > Would that make sense?
>
> Yes, I think it should work.
Btw, I wonder how to handle indirect descriptors. SVQ doesn't use
indirect descriptors for now, but it looks like a must otherwise we
may end up SVQ is full before VQ.
It looks to me an easy way is to always use indirect descriptors if #sg >= 2?
Thanks
>
> Thanks
>
> >
> > Thanks!
> >
> > >
> > > > +        /*
> > > > +         * Map cannot be NULL since iova map contains all guest space and
> > > > +         * qemu already has a physical address mapped
> > > > +         */
> > > > +        assert(map);
> > > > +
> > > > +        /*
> > > > +         * Map->iova chunk size is ignored. What to do if descriptor
> > > > +         * (addr, size) does not fit is delegated to the device.
> > > > +         */
> > > > +        off = needle.translated_addr - map->translated_addr;
> > > > +        iovec[i].iov_base = (void *)(map->iova + off);
> > > > +    }
> > > > +}
> > > > +
> > > >   static void vhost_vring_write_descs(VhostShadowVirtqueue *svq,
> > > >                                       const struct iovec *iovec,
> > > >                                       size_t num, bool more_descs, bool write)
> > > > @@ -156,8 +217,9 @@ static void vhost_vring_write_descs(VhostShadowVirtqueue *svq,
> > > >   }
> > > >
> > > >   static unsigned vhost_svq_add_split(VhostShadowVirtqueue *svq,
> > > > -                                    VirtQueueElement *elem)
> > > > +                                    SVQElement *svq_elem)
> > > >   {
> > > > +    VirtQueueElement *elem = &svq_elem->elem;
> > > >       int head;
> > > >       unsigned avail_idx;
> > > >       vring_avail_t *avail = svq->vring.avail;
> > > > @@ -167,6 +229,12 @@ static unsigned vhost_svq_add_split(VhostShadowVirtqueue *svq,
> > > >       /* We need some descriptors here */
> > > >       assert(elem->out_num || elem->in_num);
> > > >
> > > > +    vhost_svq_stash_addr(&svq_elem->in_sg_stash, elem->in_sg, elem->in_num);
> > > > +    vhost_svq_stash_addr(&svq_elem->out_sg_stash, elem->out_sg, elem->out_num);
> > >
> > >
> > > I wonder if we can solve the trick like stash and unstash with a
> > > dedicated sgs in svq_elem, instead of reusing the elem.
> > >
> >
> > Actually yes, it would be way simpler to use a new sgs array in
> > svq_elem. I will change that.
> >
> > Thanks!
> >
> > > Thanks
> > >
> > >
> > > > +
> > > > +    vhost_svq_translate_addr(svq, elem->in_sg, elem->in_num);
> > > > +    vhost_svq_translate_addr(svq, elem->out_sg, elem->out_num);
> > > > +
> > > >       vhost_vring_write_descs(svq, elem->out_sg, elem->out_num,
> > > >                               elem->in_num > 0, false);
> > > >       vhost_vring_write_descs(svq, elem->in_sg, elem->in_num, false, true);
> > > > @@ -187,7 +255,7 @@ static unsigned vhost_svq_add_split(VhostShadowVirtqueue *svq,
> > > >
> > > >   }
> > > >
> > > > -static void vhost_svq_add(VhostShadowVirtqueue *svq, VirtQueueElement *elem)
> > > > +static void vhost_svq_add(VhostShadowVirtqueue *svq, SVQElement *elem)
> > > >   {
> > > >       unsigned qemu_head = vhost_svq_add_split(svq, elem);
> > > >
> > > > @@ -221,7 +289,7 @@ static void vhost_handle_guest_kick(EventNotifier *n)
> > > >           }
> > > >
> > > >           while (true) {
> > > > -            VirtQueueElement *elem = virtqueue_pop(svq->vq, sizeof(*elem));
> > > > +            SVQElement *elem = virtqueue_pop(svq->vq, sizeof(*elem));
> > > >               if (!elem) {
> > > >                   break;
> > > >               }
> > > > @@ -247,7 +315,7 @@ static bool vhost_svq_more_used(VhostShadowVirtqueue *svq)
> > > >       return svq->used_idx != svq->shadow_used_idx;
> > > >   }
> > > >
> > > > -static VirtQueueElement *vhost_svq_get_buf(VhostShadowVirtqueue *svq)
> > > > +static SVQElement *vhost_svq_get_buf(VhostShadowVirtqueue *svq)
> > > >   {
> > > >       vring_desc_t *descs = svq->vring.desc;
> > > >       const vring_used_t *used = svq->vring.used;
> > > > @@ -279,7 +347,7 @@ static VirtQueueElement *vhost_svq_get_buf(VhostShadowVirtqueue *svq)
> > > >       descs[used_elem.id].next = svq->free_head;
> > > >       svq->free_head = used_elem.id;
> > > >
> > > > -    svq->ring_id_maps[used_elem.id]->len = used_elem.len;
> > > > +    svq->ring_id_maps[used_elem.id]->elem.len = used_elem.len;
> > > >       return g_steal_pointer(&svq->ring_id_maps[used_elem.id]);
> > > >   }
> > > >
> > > > @@ -296,12 +364,19 @@ static void vhost_svq_handle_call_no_test(EventNotifier *n)
> > > >
> > > >           vhost_svq_set_notification(svq, false);
> > > >           while (true) {
> > > > -            g_autofree VirtQueueElement *elem = vhost_svq_get_buf(svq);
> > > > -            if (!elem) {
> > > > +            g_autofree SVQElement *svq_elem = vhost_svq_get_buf(svq);
> > > > +            VirtQueueElement *elem;
> > > > +            if (!svq_elem) {
> > > >                   break;
> > > >               }
> > > >
> > > >               assert(i < svq->vring.num);
> > > > +            elem = &svq_elem->elem;
> > > > +
> > > > +            vhost_svq_unstash_addr(svq_elem->in_sg_stash, elem->in_sg,
> > > > +                                   elem->in_num);
> > > > +            vhost_svq_unstash_addr(svq_elem->out_sg_stash, elem->out_sg,
> > > > +                                   elem->out_num);
> > > >               virtqueue_fill(vq, elem, elem->len, i++);
> > > >           }
> > > >
> > > > @@ -451,14 +526,24 @@ void vhost_svq_stop(struct vhost_dev *dev, unsigned idx,
> > > >       event_notifier_set_handler(&svq->host_notifier, NULL);
> > > >
> > > >       for (i = 0; i < svq->vring.num; ++i) {
> > > > -        g_autofree VirtQueueElement *elem = svq->ring_id_maps[i];
> > > > +        g_autofree SVQElement *svq_elem = svq->ring_id_maps[i];
> > > > +        VirtQueueElement *elem;
> > > > +
> > > > +        if (!svq_elem) {
> > > > +            continue;
> > > > +        }
> > > > +
> > > > +        elem = &svq_elem->elem;
> > > > +        vhost_svq_unstash_addr(svq_elem->in_sg_stash, elem->in_sg,
> > > > +                               elem->in_num);
> > > > +        vhost_svq_unstash_addr(svq_elem->out_sg_stash, elem->out_sg,
> > > > +                               elem->out_num);
> > > > +
> > > >           /*
> > > >            * Although the doc says we must unpop in order, it's ok to unpop
> > > >            * everything.
> > > >            */
> > > > -        if (elem) {
> > > > -            virtqueue_unpop(svq->vq, elem, elem->len);
> > > > -        }
> > > > +        virtqueue_unpop(svq->vq, elem, elem->len);
> > > >       }
> > > >   }
> > > >
> > > > @@ -466,7 +551,8 @@ void vhost_svq_stop(struct vhost_dev *dev, unsigned idx,
> > > >    * Creates vhost shadow virtqueue, and instruct vhost device to use the shadow
> > > >    * methods and file descriptors.
> > > >    */
> > > > -VhostShadowVirtqueue *vhost_svq_new(struct vhost_dev *dev, int idx)
> > > > +VhostShadowVirtqueue *vhost_svq_new(struct vhost_dev *dev, int idx,
> > > > +                                    VhostIOVATree *iova_map)
> > > >   {
> > > >       int vq_idx = dev->vq_index + idx;
> > > >       unsigned num = virtio_queue_get_num(dev->vdev, vq_idx);
> > > > @@ -500,11 +586,13 @@ VhostShadowVirtqueue *vhost_svq_new(struct vhost_dev *dev, int idx)
> > > >       memset(svq->vring.desc, 0, driver_size);
> > > >       svq->vring.used = qemu_memalign(qemu_real_host_page_size, device_size);
> > > >       memset(svq->vring.used, 0, device_size);
> > > > +    svq->iova_map = iova_map;
> > > > +
> > > >       for (i = 0; i < num - 1; i++) {
> > > >           svq->vring.desc[i].next = cpu_to_le16(i + 1);
> > > >       }
> > > >
> > > > -    svq->ring_id_maps = g_new0(VirtQueueElement *, num);
> > > > +    svq->ring_id_maps = g_new0(SVQElement *, num);
> > > >       event_notifier_set_handler(&svq->call_notifier,
> > > >                                  vhost_svq_handle_call);
> > > >       return g_steal_pointer(&svq);
> > > > diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
> > > > index a9c680b487..f5a12fee9d 100644
> > > > --- a/hw/virtio/vhost-vdpa.c
> > > > +++ b/hw/virtio/vhost-vdpa.c
> > > > @@ -176,6 +176,18 @@ static void vhost_vdpa_listener_region_add(MemoryListener *listener,
> > > >                                            vaddr, section->readonly);
> > > >
> > > >       llsize = int128_sub(llend, int128_make64(iova));
> > > > +    if (v->shadow_vqs_enabled) {
> > > > +        VhostDMAMap mem_region = {
> > > > +            .translated_addr = vaddr,
> > > > +            .size = int128_get64(llsize) - 1,
> > > > +            .perm = IOMMU_ACCESS_FLAG(true, section->readonly),
> > > > +        };
> > > > +
> > > > +        int r = vhost_iova_tree_alloc(v->iova_map, &mem_region);
> > > > +        assert(r == VHOST_DMA_MAP_OK);
> > > > +
> > > > +        iova = mem_region.iova;
> > > > +    }
> > > >
> > > >       ret = vhost_vdpa_dma_map(v, iova, int128_get64(llsize),
> > > >                                vaddr, section->readonly);
> > > > @@ -754,6 +766,23 @@ static bool  vhost_vdpa_force_iommu(struct vhost_dev *dev)
> > > >       return true;
> > > >   }
> > > >
> > > > +static int vhost_vdpa_get_iova_range(struct vhost_dev *dev,
> > > > +                                     hwaddr *first, hwaddr *last)
> > > > +{
> > > > +    int ret;
> > > > +    struct vhost_vdpa_iova_range range;
> > > > +
> > > > +    ret = vhost_vdpa_call(dev, VHOST_VDPA_GET_IOVA_RANGE, &range);
> > > > +    if (ret != 0) {
> > > > +        return ret;
> > > > +    }
> > > > +
> > > > +    *first = range.first;
> > > > +    *last = range.last;
> > > > +    trace_vhost_vdpa_get_iova_range(dev, *first, *last);
> > > > +    return ret;
> > > > +}
> > > > +
> > > >   /**
> > > >    * Maps QEMU vaddr memory to device in a suitable way for shadow virtqueue:
> > > >    * - It always reference qemu memory address, not guest's memory.
> > > > @@ -881,6 +910,7 @@ static bool vhost_vdpa_svq_start_vq(struct vhost_dev *dev, unsigned idx)
> > > >   static unsigned vhost_vdpa_enable_svq(struct vhost_vdpa *v, bool enable)
> > > >   {
> > > >       struct vhost_dev *hdev = v->dev;
> > > > +    hwaddr iova_first, iova_last;
> > > >       unsigned n;
> > > >       int r;
> > > >
> > > > @@ -894,7 +924,7 @@ static unsigned vhost_vdpa_enable_svq(struct vhost_vdpa *v, bool enable)
> > > >           /* Allocate resources */
> > > >           assert(v->shadow_vqs->len == 0);
> > > >           for (n = 0; n < hdev->nvqs; ++n) {
> > > > -            VhostShadowVirtqueue *svq = vhost_svq_new(hdev, n);
> > > > +            VhostShadowVirtqueue *svq = vhost_svq_new(hdev, n, v->iova_map);
> > > >               if (unlikely(!svq)) {
> > > >                   g_ptr_array_set_size(v->shadow_vqs, 0);
> > > >                   return 0;
> > > > @@ -903,6 +933,8 @@ static unsigned vhost_vdpa_enable_svq(struct vhost_vdpa *v, bool enable)
> > > >           }
> > > >       }
> > > >
> > > > +    r = vhost_vdpa_get_iova_range(hdev, &iova_first, &iova_last);
> > > > +    assert(r == 0);
> > > >       r = vhost_vdpa_vring_pause(hdev);
> > > >       assert(r == 0);
> > > >
> > > > @@ -913,6 +945,12 @@ static unsigned vhost_vdpa_enable_svq(struct vhost_vdpa *v, bool enable)
> > > >           }
> > > >       }
> > > >
> > > > +    memory_listener_unregister(&v->listener);
> > > > +    if (vhost_vdpa_dma_unmap(v, iova_first,
> > > > +                             (iova_last - iova_first) & TARGET_PAGE_MASK)) {
> > > > +        error_report("Fail to invalidate device iotlb");
> > > > +    }
> > > > +
> > > >       /* Reset device so it can be configured */
> > > >       r = vhost_vdpa_dev_start(hdev, false);
> > > >       assert(r == 0);
> > > > diff --git a/hw/virtio/trace-events b/hw/virtio/trace-events
> > > > index 8ed19e9d0c..650e521e35 100644
> > > > --- a/hw/virtio/trace-events
> > > > +++ b/hw/virtio/trace-events
> > > > @@ -52,6 +52,7 @@ vhost_vdpa_set_vring_call(void *dev, unsigned int index, int fd) "dev: %p index:
> > > >   vhost_vdpa_get_features(void *dev, uint64_t features) "dev: %p features: 0x%"PRIx64
> > > >   vhost_vdpa_set_owner(void *dev) "dev: %p"
> > > >   vhost_vdpa_vq_get_addr(void *dev, void *vq, uint64_t desc_user_addr, uint64_t avail_user_addr, uint64_t used_user_addr) "dev: %p vq: %p desc_user_addr: 0x%"PRIx64" avail_user_addr: 0x%"PRIx64" used_user_addr: 0x%"PRIx64
> > > > +vhost_vdpa_get_iova_range(void *dev, uint64_t first, uint64_t last) "dev: %p first: 0x%"PRIx64" last: 0x%"PRIx64
> > > >
> > > >   # virtio.c
> > > >   virtqueue_alloc_element(void *elem, size_t sz, unsigned in_num, unsigned out_num) "elem %p size %zd in_num %u out_num %u"
> > >
> >
^ permalink raw reply	[flat|nested] 90+ messages in thread
* Re: [RFC PATCH v4 11/20] vhost: Route host->guest notification through shadow virtqueue
  2021-10-20  2:01           ` Jason Wang
@ 2021-10-20  6:36             ` Eugenio Perez Martin
  0 siblings, 0 replies; 90+ messages in thread
From: Eugenio Perez Martin @ 2021-10-20  6:36 UTC (permalink / raw)
  To: Jason Wang
  Cc: Parav Pandit, Juan Quintela, Markus Armbruster,
	Michael S. Tsirkin, qemu-level, virtualization,
	Harpreet Singh Anand, Xiao W Wang, Stefan Hajnoczi, Eli Cohen,
	Eric Blake, Stefano Garzarella
On Wed, Oct 20, 2021 at 4:01 AM Jason Wang <jasowang@redhat.com> wrote:
>
> On Tue, Oct 19, 2021 at 4:40 PM Eugenio Perez Martin
> <eperezma@redhat.com> wrote:
> >
> > On Fri, Oct 15, 2021 at 6:42 AM Jason Wang <jasowang@redhat.com> wrote:
> > >
> > >
> > > 在 2021/10/15 上午12:39, Eugenio Perez Martin 写道:
> > > > On Wed, Oct 13, 2021 at 5:47 AM Jason Wang <jasowang@redhat.com> wrote:
> > > >>
> > > >> 在 2021/10/1 下午3:05, Eugenio Pérez 写道:
> > > >>> This will make qemu aware of the device used buffers, allowing it to
> > > >>> write the guest memory with its contents if needed.
> > > >>>
> > > >>> Since the use of vhost_virtqueue_start can unmasks and discard call
> > > >>> events, vhost_virtqueue_start should be modified in one of these ways:
> > > >>> * Split in two: One of them uses all logic to start a queue with no
> > > >>>     side effects for the guest, and another one tha actually assumes that
> > > >>>     the guest has just started the device. Vdpa should use just the
> > > >>>     former.
> > > >>> * Actually store and check if the guest notifier is masked, and do it
> > > >>>     conditionally.
> > > >>> * Left as it is, and duplicate all the logic in vhost-vdpa.
> > > >>>
> > > >>> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> > > >>> ---
> > > >>>    hw/virtio/vhost-shadow-virtqueue.c | 19 +++++++++++++++
> > > >>>    hw/virtio/vhost-vdpa.c             | 38 +++++++++++++++++++++++++++++-
> > > >>>    2 files changed, 56 insertions(+), 1 deletion(-)
> > > >>>
> > > >>> diff --git a/hw/virtio/vhost-shadow-virtqueue.c b/hw/virtio/vhost-shadow-virtqueue.c
> > > >>> index 21dc99ab5d..3fe129cf63 100644
> > > >>> --- a/hw/virtio/vhost-shadow-virtqueue.c
> > > >>> +++ b/hw/virtio/vhost-shadow-virtqueue.c
> > > >>> @@ -53,6 +53,22 @@ static void vhost_handle_guest_kick(EventNotifier *n)
> > > >>>        event_notifier_set(&svq->kick_notifier);
> > > >>>    }
> > > >>>
> > > >>> +/* Forward vhost notifications */
> > > >>> +static void vhost_svq_handle_call_no_test(EventNotifier *n)
> > > >>> +{
> > > >>> +    VhostShadowVirtqueue *svq = container_of(n, VhostShadowVirtqueue,
> > > >>> +                                             call_notifier);
> > > >>> +
> > > >>> +    event_notifier_set(&svq->guest_call_notifier);
> > > >>> +}
> > > >>> +
> > > >>> +static void vhost_svq_handle_call(EventNotifier *n)
> > > >>> +{
> > > >>> +    if (likely(event_notifier_test_and_clear(n))) {
> > > >>> +        vhost_svq_handle_call_no_test(n);
> > > >>> +    }
> > > >>> +}
> > > >>> +
> > > >>>    /*
> > > >>>     * Obtain the SVQ call notifier, where vhost device notifies SVQ that there
> > > >>>     * exists pending used buffers.
> > > >>> @@ -180,6 +196,8 @@ VhostShadowVirtqueue *vhost_svq_new(struct vhost_dev *dev, int idx)
> > > >>>        }
> > > >>>
> > > >>>        svq->vq = virtio_get_queue(dev->vdev, vq_idx);
> > > >>> +    event_notifier_set_handler(&svq->call_notifier,
> > > >>> +                               vhost_svq_handle_call);
> > > >>>        return g_steal_pointer(&svq);
> > > >>>
> > > >>>    err_init_call_notifier:
> > > >>> @@ -195,6 +213,7 @@ err_init_kick_notifier:
> > > >>>    void vhost_svq_free(VhostShadowVirtqueue *vq)
> > > >>>    {
> > > >>>        event_notifier_cleanup(&vq->kick_notifier);
> > > >>> +    event_notifier_set_handler(&vq->call_notifier, NULL);
> > > >>>        event_notifier_cleanup(&vq->call_notifier);
> > > >>>        g_free(vq);
> > > >>>    }
> > > >>> diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
> > > >>> index bc34de2439..6c5f4c98b8 100644
> > > >>> --- a/hw/virtio/vhost-vdpa.c
> > > >>> +++ b/hw/virtio/vhost-vdpa.c
> > > >>> @@ -712,13 +712,40 @@ static bool vhost_vdpa_svq_start_vq(struct vhost_dev *dev, unsigned idx)
> > > >>>    {
> > > >>>        struct vhost_vdpa *v = dev->opaque;
> > > >>>        VhostShadowVirtqueue *svq = g_ptr_array_index(v->shadow_vqs, idx);
> > > >>> -    return vhost_svq_start(dev, idx, svq);
> > > >>> +    EventNotifier *vhost_call_notifier = vhost_svq_get_svq_call_notifier(svq);
> > > >>> +    struct vhost_vring_file vhost_call_file = {
> > > >>> +        .index = idx + dev->vq_index,
> > > >>> +        .fd = event_notifier_get_fd(vhost_call_notifier),
> > > >>> +    };
> > > >>> +    int r;
> > > >>> +    bool b;
> > > >>> +
> > > >>> +    /* Set shadow vq -> guest notifier */
> > > >>> +    assert(v->call_fd[idx]);
> > > >>
> > > >> We need aovid the asser() here. On which case we can hit this?
> > > >>
> > > > I would say that there is no way we can actually hit it, so let's remove it.
> > > >
> > > >>> +    vhost_svq_set_guest_call_notifier(svq, v->call_fd[idx]);
> > > >>> +
> > > >>> +    b = vhost_svq_start(dev, idx, svq);
> > > >>> +    if (unlikely(!b)) {
> > > >>> +        return false;
> > > >>> +    }
> > > >>> +
> > > >>> +    /* Set device -> SVQ notifier */
> > > >>> +    r = vhost_vdpa_set_vring_dev_call(dev, &vhost_call_file);
> > > >>> +    if (unlikely(r)) {
> > > >>> +        error_report("vhost_vdpa_set_vring_call for shadow vq failed");
> > > >>> +        return false;
> > > >>> +    }
> > > >>
> > > >> Similar to kick, do we need to set_vring_call() before vhost_svq_start()?
> > > >>
> > > > It should not matter at this moment because the device should not be
> > > > started at this point and device calls should not run
> > > > vhost_svq_handle_call until BQL is released.
> > >
> > >
> > > Yes, we stop virtqueue before.
> > >
> > >
> > > >
> > > > The "logic" of doing it after is to make clear that svq must be fully
> > > > initialized before processing device calls, even in the case that we
> > > > extract SVQ in its own iothread or similar. But this could be done
> > > > before vhost_svq_start for sure.
> > > >
> > > >>> +
> > > >>> +    /* Check for pending calls */
> > > >>> +    event_notifier_set(vhost_call_notifier);
> > > >>
> > > >> Interesting, can this result spurious interrupt?
> > > >>
> > > > This actually "queues" a vhost_svq_handle_call after the BQL release,
> > > > where the device should be fully reset. In that regard, if there are
> > > > no used descriptors there will not be an irq raised to the guest. Does
> > > > that answer the question? Or have I missed something?
> > >
> > >
> > > Yes, please explain this in the comment.
> > >
> >
> > I'm reviewing this again, and actually I think I was wrong in solving the issue.
> >
> > Since at this point the device is being configured, there is no chance
> > that we had a missing call notification here: A previous kick is
> > needed for the device to generate any calls, and these cannot be
> > processed.
> >
> > What is not solved in this series is that we could have pending used
> > buffers in vdpa device stopping SVQ, but queuing a check for that is
> > not going to solve anything, since SVQ vring would be already
> > destroyed:
> >
> > * vdpa device marks N > 0 buffers as used, and calls.
> > * Before processing them, SVQ stop is called. SVQ have not processed
> > these, and cleans them, making this event_notifier_set useless.
> >
> > So this would require a few changes. Mainly, instead of queueing a
> > check for used, these need to be checked before svq cleaning. After
> > that, obtain the VQ state (is not obtained in the stop at the moment,
> > trusting in guest's used idx) and run a last
> > vhost_svq_handle_call_no_test while the device is paused.
>
> It looks to me what's really important is that SVQ needs to
> drain/forwared used buffers after vdpa is stopped. Then we should be
> fine.
>
Right. I think I picked the wrong place to raise the concern, but the
next revision will include the drain of the pending buffers.
Thanks!
> >
> > Thanks!
> >
> > >
> > > >
> > > >>> +    return true;
> > > >>>    }
> > > >>>
> > > >>>    static unsigned vhost_vdpa_enable_svq(struct vhost_vdpa *v, bool enable)
> > > >>>    {
> > > >>>        struct vhost_dev *hdev = v->dev;
> > > >>>        unsigned n;
> > > >>> +    int r;
> > > >>>
> > > >>>        if (enable == v->shadow_vqs_enabled) {
> > > >>>            return hdev->nvqs;
> > > >>> @@ -752,9 +779,18 @@ static unsigned vhost_vdpa_enable_svq(struct vhost_vdpa *v, bool enable)
> > > >>>        if (!enable) {
> > > >>>            /* Disable all queues or clean up failed start */
> > > >>>            for (n = 0; n < v->shadow_vqs->len; ++n) {
> > > >>> +            struct vhost_vring_file file = {
> > > >>> +                .index = vhost_vdpa_get_vq_index(hdev, n),
> > > >>> +                .fd = v->call_fd[n],
> > > >>> +            };
> > > >>> +
> > > >>> +            r = vhost_vdpa_set_vring_call(hdev, &file);
> > > >>> +            assert(r == 0);
> > > >>> +
> > > >>>                unsigned vq_idx = vhost_vdpa_get_vq_index(hdev, n);
> > > >>>                VhostShadowVirtqueue *svq = g_ptr_array_index(v->shadow_vqs, n);
> > > >>>                vhost_svq_stop(hdev, n, svq);
> > > >>> +            /* TODO: This can unmask or override call fd! */
> > > >>
> > > >> I don't get this comment. Does this mean the current code can't work
> > > >> with mask_notifiers? If yes, this is something we need to fix.
> > > >>
> > > > Yes, but it will be addressed in the next series. I should have
> > > > explained it bette here, sorry :).
> > >
> > >
> > > Ok.
> > >
> > > Thanks
> > >
> > >
> > > >
> > > > Thanks!
> > > >
> > > >> Thanks
> > > >>
> > > >>
> > > >>>                vhost_virtqueue_start(hdev, hdev->vdev, &hdev->vqs[n], vq_idx);
> > > >>>            }
> > > >>>
> > >
> >
>
^ permalink raw reply	[flat|nested] 90+ messages in thread
* Re: [RFC PATCH v4 20/20] vdpa: Add custom IOTLB translations to SVQ
  2021-10-20  2:07         ` Jason Wang
@ 2021-10-20  6:51           ` Eugenio Perez Martin
  2021-10-20  9:03             ` Jason Wang
  0 siblings, 1 reply; 90+ messages in thread
From: Eugenio Perez Martin @ 2021-10-20  6:51 UTC (permalink / raw)
  To: Jason Wang
  Cc: Parav Pandit, Juan Quintela, Markus Armbruster,
	Michael S. Tsirkin, qemu-level, virtualization,
	Harpreet Singh Anand, Xiao W Wang, Stefan Hajnoczi, Eli Cohen,
	Eric Blake, Michael Lilja, Stefano Garzarella
On Wed, Oct 20, 2021 at 4:07 AM Jason Wang <jasowang@redhat.com> wrote:
>
> On Wed, Oct 20, 2021 at 10:02 AM Jason Wang <jasowang@redhat.com> wrote:
> >
> > On Tue, Oct 19, 2021 at 6:29 PM Eugenio Perez Martin
> > <eperezma@redhat.com> wrote:
> > >
> > > On Tue, Oct 19, 2021 at 11:25 AM Jason Wang <jasowang@redhat.com> wrote:
> > > >
> > > >
> > > > 在 2021/10/1 下午3:06, Eugenio Pérez 写道:
> > > > > Use translations added in VhostIOVATree in SVQ.
> > > > >
> > > > > Now every element needs to store the previous address also, so VirtQueue
> > > > > can consume the elements properly. This adds a little overhead per VQ
> > > > > element, having to allocate more memory to stash them. As a possible
> > > > > optimization, this allocation could be avoided if the descriptor is not
> > > > > a chain but a single one, but this is left undone.
> > > > >
> > > > > TODO: iova range should be queried before, and add logic to fail when
> > > > > GPA is outside of its range and memory listener or svq add it.
> > > > >
> > > > > Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> > > > > ---
> > > > >   hw/virtio/vhost-shadow-virtqueue.h |   4 +-
> > > > >   hw/virtio/vhost-shadow-virtqueue.c | 130 ++++++++++++++++++++++++-----
> > > > >   hw/virtio/vhost-vdpa.c             |  40 ++++++++-
> > > > >   hw/virtio/trace-events             |   1 +
> > > > >   4 files changed, 152 insertions(+), 23 deletions(-)
> > > > >
> > > > > diff --git a/hw/virtio/vhost-shadow-virtqueue.h b/hw/virtio/vhost-shadow-virtqueue.h
> > > > > index b7baa424a7..a0e6b5267a 100644
> > > > > --- a/hw/virtio/vhost-shadow-virtqueue.h
> > > > > +++ b/hw/virtio/vhost-shadow-virtqueue.h
> > > > > @@ -11,6 +11,7 @@
> > > > >   #define VHOST_SHADOW_VIRTQUEUE_H
> > > > >
> > > > >   #include "hw/virtio/vhost.h"
> > > > > +#include "hw/virtio/vhost-iova-tree.h"
> > > > >
> > > > >   typedef struct VhostShadowVirtqueue VhostShadowVirtqueue;
> > > > >
> > > > > @@ -28,7 +29,8 @@ bool vhost_svq_start(struct vhost_dev *dev, unsigned idx,
> > > > >   void vhost_svq_stop(struct vhost_dev *dev, unsigned idx,
> > > > >                       VhostShadowVirtqueue *svq);
> > > > >
> > > > > -VhostShadowVirtqueue *vhost_svq_new(struct vhost_dev *dev, int idx);
> > > > > +VhostShadowVirtqueue *vhost_svq_new(struct vhost_dev *dev, int idx,
> > > > > +                                    VhostIOVATree *iova_map);
> > > > >
> > > > >   void vhost_svq_free(VhostShadowVirtqueue *vq);
> > > > >
> > > > > diff --git a/hw/virtio/vhost-shadow-virtqueue.c b/hw/virtio/vhost-shadow-virtqueue.c
> > > > > index 2fd0bab75d..9db538547e 100644
> > > > > --- a/hw/virtio/vhost-shadow-virtqueue.c
> > > > > +++ b/hw/virtio/vhost-shadow-virtqueue.c
> > > > > @@ -11,12 +11,19 @@
> > > > >   #include "hw/virtio/vhost-shadow-virtqueue.h"
> > > > >   #include "hw/virtio/vhost.h"
> > > > >   #include "hw/virtio/virtio-access.h"
> > > > > +#include "hw/virtio/vhost-iova-tree.h"
> > > > >
> > > > >   #include "standard-headers/linux/vhost_types.h"
> > > > >
> > > > >   #include "qemu/error-report.h"
> > > > >   #include "qemu/main-loop.h"
> > > > >
> > > > > +typedef struct SVQElement {
> > > > > +    VirtQueueElement elem;
> > > > > +    void **in_sg_stash;
> > > > > +    void **out_sg_stash;
> > > > > +} SVQElement;
> > > > > +
> > > > >   /* Shadow virtqueue to relay notifications */
> > > > >   typedef struct VhostShadowVirtqueue {
> > > > >       /* Shadow vring */
> > > > > @@ -46,8 +53,11 @@ typedef struct VhostShadowVirtqueue {
> > > > >       /* Virtio device */
> > > > >       VirtIODevice *vdev;
> > > > >
> > > > > +    /* IOVA mapping if used */
> > > > > +    VhostIOVATree *iova_map;
> > > > > +
> > > > >       /* Map for returning guest's descriptors */
> > > > > -    VirtQueueElement **ring_id_maps;
> > > > > +    SVQElement **ring_id_maps;
> > > > >
> > > > >       /* Next head to expose to device */
> > > > >       uint16_t avail_idx_shadow;
> > > > > @@ -79,13 +89,6 @@ bool vhost_svq_valid_device_features(uint64_t *dev_features)
> > > > >               continue;
> > > > >
> > > > >           case VIRTIO_F_ACCESS_PLATFORM:
> > > > > -            /* SVQ needs this feature disabled. Can't continue */
> > > > > -            if (*dev_features & BIT_ULL(b)) {
> > > > > -                clear_bit(b, dev_features);
> > > > > -                r = false;
> > > > > -            }
> > > > > -            break;
> > > > > -
> > > > >           case VIRTIO_F_VERSION_1:
> > > > >               /* SVQ needs this feature, so can't continue */
> > > > >               if (!(*dev_features & BIT_ULL(b))) {
> > > > > @@ -126,6 +129,64 @@ static void vhost_svq_set_notification(VhostShadowVirtqueue *svq, bool enable)
> > > > >       }
> > > > >   }
> > > > >
> > > > > +static void vhost_svq_stash_addr(void ***stash, const struct iovec *iov,
> > > > > +                                 size_t num)
> > > > > +{
> > > > > +    size_t i;
> > > > > +
> > > > > +    if (num == 0) {
> > > > > +        return;
> > > > > +    }
> > > > > +
> > > > > +    *stash = g_new(void *, num);
> > > > > +    for (i = 0; i < num; ++i) {
> > > > > +        (*stash)[i] = iov[i].iov_base;
> > > > > +    }
> > > > > +}
> > > > > +
> > > > > +static void vhost_svq_unstash_addr(void **stash, struct iovec *iov, size_t num)
> > > > > +{
> > > > > +    size_t i;
> > > > > +
> > > > > +    if (num == 0) {
> > > > > +        return;
> > > > > +    }
> > > > > +
> > > > > +    for (i = 0; i < num; ++i) {
> > > > > +        iov[i].iov_base = stash[i];
> > > > > +    }
> > > > > +    g_free(stash);
> > > > > +}
> > > > > +
> > > > > +static void vhost_svq_translate_addr(const VhostShadowVirtqueue *svq,
> > > > > +                                     struct iovec *iovec, size_t num)
> > > > > +{
> > > > > +    size_t i;
> > > > > +
> > > > > +    for (i = 0; i < num; ++i) {
> > > > > +        VhostDMAMap needle = {
> > > > > +            .translated_addr = iovec[i].iov_base,
> > > > > +            .size = iovec[i].iov_len,
> > > > > +        };
> > > > > +        size_t off;
> > > > > +
> > > > > +        const VhostDMAMap *map = vhost_iova_tree_find_iova(svq->iova_map,
> > > > > +                                                           &needle);
> > > >
> > > >
> > > > Is it possible that we end up with more than one maps here?
> > > >
> > >
> > > Actually it is possible, since there is no guarantee that one
> > > descriptor (or indirect descriptor) maps exactly to one iov. It could
> > > map to many if qemu vaddr is not contiguous but GPA + size is. This is
> > > something that must be fixed for the next revision, so thanks for
> > > pointing it out!
> > >
> > > Taking that into account, the condition that svq vring avail_idx -
> > > used_idx was always less or equal than guest's vring avail_idx -
> > > used_idx is not true anymore. Checking for that before adding buffers
> > > to SVQ is the easy part, but how could we recover in that case?
> > >
> > > I think that the easy solution is to check for more available buffers
> > > unconditionally at the end of vhost_svq_handle_call, which handles the
> > > SVQ used and is supposed to make more room for available buffers. So
> > > vhost_handle_guest_kick would not check if eventfd is set or not
> > > anymore.
> > >
> > > Would that make sense?
> >
> > Yes, I think it should work.
>
> Btw, I wonder how to handle indirect descriptors. SVQ doesn't use
> indirect descriptors for now, but it looks like a must otherwise we
> may end up SVQ is full before VQ.
>
We can get to that situation without indirect too, if a single
descriptor maps to more than one sg buffer. The next revision is going
to control that too.
> It looks to me an easy way is to always use indirect descriptors if #sg >= 2?
>
I will use that, but that does not solve the case where a descriptor
maps to > 1 different buffers in qemu vaddr. So I think that some
check after marking descriptors as used is a must somehow.
> Thanks
>
> >
> > Thanks
> >
> > >
> > > Thanks!
> > >
> > > >
> > > > > +        /*
> > > > > +         * Map cannot be NULL since iova map contains all guest space and
> > > > > +         * qemu already has a physical address mapped
> > > > > +         */
> > > > > +        assert(map);
> > > > > +
> > > > > +        /*
> > > > > +         * Map->iova chunk size is ignored. What to do if descriptor
> > > > > +         * (addr, size) does not fit is delegated to the device.
> > > > > +         */
> > > > > +        off = needle.translated_addr - map->translated_addr;
> > > > > +        iovec[i].iov_base = (void *)(map->iova + off);
> > > > > +    }
> > > > > +}
> > > > > +
> > > > >   static void vhost_vring_write_descs(VhostShadowVirtqueue *svq,
> > > > >                                       const struct iovec *iovec,
> > > > >                                       size_t num, bool more_descs, bool write)
> > > > > @@ -156,8 +217,9 @@ static void vhost_vring_write_descs(VhostShadowVirtqueue *svq,
> > > > >   }
> > > > >
> > > > >   static unsigned vhost_svq_add_split(VhostShadowVirtqueue *svq,
> > > > > -                                    VirtQueueElement *elem)
> > > > > +                                    SVQElement *svq_elem)
> > > > >   {
> > > > > +    VirtQueueElement *elem = &svq_elem->elem;
> > > > >       int head;
> > > > >       unsigned avail_idx;
> > > > >       vring_avail_t *avail = svq->vring.avail;
> > > > > @@ -167,6 +229,12 @@ static unsigned vhost_svq_add_split(VhostShadowVirtqueue *svq,
> > > > >       /* We need some descriptors here */
> > > > >       assert(elem->out_num || elem->in_num);
> > > > >
> > > > > +    vhost_svq_stash_addr(&svq_elem->in_sg_stash, elem->in_sg, elem->in_num);
> > > > > +    vhost_svq_stash_addr(&svq_elem->out_sg_stash, elem->out_sg, elem->out_num);
> > > >
> > > >
> > > > I wonder if we can solve the trick like stash and unstash with a
> > > > dedicated sgs in svq_elem, instead of reusing the elem.
> > > >
> > >
> > > Actually yes, it would be way simpler to use a new sgs array in
> > > svq_elem. I will change that.
> > >
> > > Thanks!
> > >
> > > > Thanks
> > > >
> > > >
> > > > > +
> > > > > +    vhost_svq_translate_addr(svq, elem->in_sg, elem->in_num);
> > > > > +    vhost_svq_translate_addr(svq, elem->out_sg, elem->out_num);
> > > > > +
> > > > >       vhost_vring_write_descs(svq, elem->out_sg, elem->out_num,
> > > > >                               elem->in_num > 0, false);
> > > > >       vhost_vring_write_descs(svq, elem->in_sg, elem->in_num, false, true);
> > > > > @@ -187,7 +255,7 @@ static unsigned vhost_svq_add_split(VhostShadowVirtqueue *svq,
> > > > >
> > > > >   }
> > > > >
> > > > > -static void vhost_svq_add(VhostShadowVirtqueue *svq, VirtQueueElement *elem)
> > > > > +static void vhost_svq_add(VhostShadowVirtqueue *svq, SVQElement *elem)
> > > > >   {
> > > > >       unsigned qemu_head = vhost_svq_add_split(svq, elem);
> > > > >
> > > > > @@ -221,7 +289,7 @@ static void vhost_handle_guest_kick(EventNotifier *n)
> > > > >           }
> > > > >
> > > > >           while (true) {
> > > > > -            VirtQueueElement *elem = virtqueue_pop(svq->vq, sizeof(*elem));
> > > > > +            SVQElement *elem = virtqueue_pop(svq->vq, sizeof(*elem));
> > > > >               if (!elem) {
> > > > >                   break;
> > > > >               }
> > > > > @@ -247,7 +315,7 @@ static bool vhost_svq_more_used(VhostShadowVirtqueue *svq)
> > > > >       return svq->used_idx != svq->shadow_used_idx;
> > > > >   }
> > > > >
> > > > > -static VirtQueueElement *vhost_svq_get_buf(VhostShadowVirtqueue *svq)
> > > > > +static SVQElement *vhost_svq_get_buf(VhostShadowVirtqueue *svq)
> > > > >   {
> > > > >       vring_desc_t *descs = svq->vring.desc;
> > > > >       const vring_used_t *used = svq->vring.used;
> > > > > @@ -279,7 +347,7 @@ static VirtQueueElement *vhost_svq_get_buf(VhostShadowVirtqueue *svq)
> > > > >       descs[used_elem.id].next = svq->free_head;
> > > > >       svq->free_head = used_elem.id;
> > > > >
> > > > > -    svq->ring_id_maps[used_elem.id]->len = used_elem.len;
> > > > > +    svq->ring_id_maps[used_elem.id]->elem.len = used_elem.len;
> > > > >       return g_steal_pointer(&svq->ring_id_maps[used_elem.id]);
> > > > >   }
> > > > >
> > > > > @@ -296,12 +364,19 @@ static void vhost_svq_handle_call_no_test(EventNotifier *n)
> > > > >
> > > > >           vhost_svq_set_notification(svq, false);
> > > > >           while (true) {
> > > > > -            g_autofree VirtQueueElement *elem = vhost_svq_get_buf(svq);
> > > > > -            if (!elem) {
> > > > > +            g_autofree SVQElement *svq_elem = vhost_svq_get_buf(svq);
> > > > > +            VirtQueueElement *elem;
> > > > > +            if (!svq_elem) {
> > > > >                   break;
> > > > >               }
> > > > >
> > > > >               assert(i < svq->vring.num);
> > > > > +            elem = &svq_elem->elem;
> > > > > +
> > > > > +            vhost_svq_unstash_addr(svq_elem->in_sg_stash, elem->in_sg,
> > > > > +                                   elem->in_num);
> > > > > +            vhost_svq_unstash_addr(svq_elem->out_sg_stash, elem->out_sg,
> > > > > +                                   elem->out_num);
> > > > >               virtqueue_fill(vq, elem, elem->len, i++);
> > > > >           }
> > > > >
> > > > > @@ -451,14 +526,24 @@ void vhost_svq_stop(struct vhost_dev *dev, unsigned idx,
> > > > >       event_notifier_set_handler(&svq->host_notifier, NULL);
> > > > >
> > > > >       for (i = 0; i < svq->vring.num; ++i) {
> > > > > -        g_autofree VirtQueueElement *elem = svq->ring_id_maps[i];
> > > > > +        g_autofree SVQElement *svq_elem = svq->ring_id_maps[i];
> > > > > +        VirtQueueElement *elem;
> > > > > +
> > > > > +        if (!svq_elem) {
> > > > > +            continue;
> > > > > +        }
> > > > > +
> > > > > +        elem = &svq_elem->elem;
> > > > > +        vhost_svq_unstash_addr(svq_elem->in_sg_stash, elem->in_sg,
> > > > > +                               elem->in_num);
> > > > > +        vhost_svq_unstash_addr(svq_elem->out_sg_stash, elem->out_sg,
> > > > > +                               elem->out_num);
> > > > > +
> > > > >           /*
> > > > >            * Although the doc says we must unpop in order, it's ok to unpop
> > > > >            * everything.
> > > > >            */
> > > > > -        if (elem) {
> > > > > -            virtqueue_unpop(svq->vq, elem, elem->len);
> > > > > -        }
> > > > > +        virtqueue_unpop(svq->vq, elem, elem->len);
> > > > >       }
> > > > >   }
> > > > >
> > > > > @@ -466,7 +551,8 @@ void vhost_svq_stop(struct vhost_dev *dev, unsigned idx,
> > > > >    * Creates vhost shadow virtqueue, and instruct vhost device to use the shadow
> > > > >    * methods and file descriptors.
> > > > >    */
> > > > > -VhostShadowVirtqueue *vhost_svq_new(struct vhost_dev *dev, int idx)
> > > > > +VhostShadowVirtqueue *vhost_svq_new(struct vhost_dev *dev, int idx,
> > > > > +                                    VhostIOVATree *iova_map)
> > > > >   {
> > > > >       int vq_idx = dev->vq_index + idx;
> > > > >       unsigned num = virtio_queue_get_num(dev->vdev, vq_idx);
> > > > > @@ -500,11 +586,13 @@ VhostShadowVirtqueue *vhost_svq_new(struct vhost_dev *dev, int idx)
> > > > >       memset(svq->vring.desc, 0, driver_size);
> > > > >       svq->vring.used = qemu_memalign(qemu_real_host_page_size, device_size);
> > > > >       memset(svq->vring.used, 0, device_size);
> > > > > +    svq->iova_map = iova_map;
> > > > > +
> > > > >       for (i = 0; i < num - 1; i++) {
> > > > >           svq->vring.desc[i].next = cpu_to_le16(i + 1);
> > > > >       }
> > > > >
> > > > > -    svq->ring_id_maps = g_new0(VirtQueueElement *, num);
> > > > > +    svq->ring_id_maps = g_new0(SVQElement *, num);
> > > > >       event_notifier_set_handler(&svq->call_notifier,
> > > > >                                  vhost_svq_handle_call);
> > > > >       return g_steal_pointer(&svq);
> > > > > diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
> > > > > index a9c680b487..f5a12fee9d 100644
> > > > > --- a/hw/virtio/vhost-vdpa.c
> > > > > +++ b/hw/virtio/vhost-vdpa.c
> > > > > @@ -176,6 +176,18 @@ static void vhost_vdpa_listener_region_add(MemoryListener *listener,
> > > > >                                            vaddr, section->readonly);
> > > > >
> > > > >       llsize = int128_sub(llend, int128_make64(iova));
> > > > > +    if (v->shadow_vqs_enabled) {
> > > > > +        VhostDMAMap mem_region = {
> > > > > +            .translated_addr = vaddr,
> > > > > +            .size = int128_get64(llsize) - 1,
> > > > > +            .perm = IOMMU_ACCESS_FLAG(true, section->readonly),
> > > > > +        };
> > > > > +
> > > > > +        int r = vhost_iova_tree_alloc(v->iova_map, &mem_region);
> > > > > +        assert(r == VHOST_DMA_MAP_OK);
> > > > > +
> > > > > +        iova = mem_region.iova;
> > > > > +    }
> > > > >
> > > > >       ret = vhost_vdpa_dma_map(v, iova, int128_get64(llsize),
> > > > >                                vaddr, section->readonly);
> > > > > @@ -754,6 +766,23 @@ static bool  vhost_vdpa_force_iommu(struct vhost_dev *dev)
> > > > >       return true;
> > > > >   }
> > > > >
> > > > > +static int vhost_vdpa_get_iova_range(struct vhost_dev *dev,
> > > > > +                                     hwaddr *first, hwaddr *last)
> > > > > +{
> > > > > +    int ret;
> > > > > +    struct vhost_vdpa_iova_range range;
> > > > > +
> > > > > +    ret = vhost_vdpa_call(dev, VHOST_VDPA_GET_IOVA_RANGE, &range);
> > > > > +    if (ret != 0) {
> > > > > +        return ret;
> > > > > +    }
> > > > > +
> > > > > +    *first = range.first;
> > > > > +    *last = range.last;
> > > > > +    trace_vhost_vdpa_get_iova_range(dev, *first, *last);
> > > > > +    return ret;
> > > > > +}
> > > > > +
> > > > >   /**
> > > > >    * Maps QEMU vaddr memory to device in a suitable way for shadow virtqueue:
> > > > >    * - It always reference qemu memory address, not guest's memory.
> > > > > @@ -881,6 +910,7 @@ static bool vhost_vdpa_svq_start_vq(struct vhost_dev *dev, unsigned idx)
> > > > >   static unsigned vhost_vdpa_enable_svq(struct vhost_vdpa *v, bool enable)
> > > > >   {
> > > > >       struct vhost_dev *hdev = v->dev;
> > > > > +    hwaddr iova_first, iova_last;
> > > > >       unsigned n;
> > > > >       int r;
> > > > >
> > > > > @@ -894,7 +924,7 @@ static unsigned vhost_vdpa_enable_svq(struct vhost_vdpa *v, bool enable)
> > > > >           /* Allocate resources */
> > > > >           assert(v->shadow_vqs->len == 0);
> > > > >           for (n = 0; n < hdev->nvqs; ++n) {
> > > > > -            VhostShadowVirtqueue *svq = vhost_svq_new(hdev, n);
> > > > > +            VhostShadowVirtqueue *svq = vhost_svq_new(hdev, n, v->iova_map);
> > > > >               if (unlikely(!svq)) {
> > > > >                   g_ptr_array_set_size(v->shadow_vqs, 0);
> > > > >                   return 0;
> > > > > @@ -903,6 +933,8 @@ static unsigned vhost_vdpa_enable_svq(struct vhost_vdpa *v, bool enable)
> > > > >           }
> > > > >       }
> > > > >
> > > > > +    r = vhost_vdpa_get_iova_range(hdev, &iova_first, &iova_last);
> > > > > +    assert(r == 0);
> > > > >       r = vhost_vdpa_vring_pause(hdev);
> > > > >       assert(r == 0);
> > > > >
> > > > > @@ -913,6 +945,12 @@ static unsigned vhost_vdpa_enable_svq(struct vhost_vdpa *v, bool enable)
> > > > >           }
> > > > >       }
> > > > >
> > > > > +    memory_listener_unregister(&v->listener);
> > > > > +    if (vhost_vdpa_dma_unmap(v, iova_first,
> > > > > +                             (iova_last - iova_first) & TARGET_PAGE_MASK)) {
> > > > > +        error_report("Fail to invalidate device iotlb");
> > > > > +    }
> > > > > +
> > > > >       /* Reset device so it can be configured */
> > > > >       r = vhost_vdpa_dev_start(hdev, false);
> > > > >       assert(r == 0);
> > > > > diff --git a/hw/virtio/trace-events b/hw/virtio/trace-events
> > > > > index 8ed19e9d0c..650e521e35 100644
> > > > > --- a/hw/virtio/trace-events
> > > > > +++ b/hw/virtio/trace-events
> > > > > @@ -52,6 +52,7 @@ vhost_vdpa_set_vring_call(void *dev, unsigned int index, int fd) "dev: %p index:
> > > > >   vhost_vdpa_get_features(void *dev, uint64_t features) "dev: %p features: 0x%"PRIx64
> > > > >   vhost_vdpa_set_owner(void *dev) "dev: %p"
> > > > >   vhost_vdpa_vq_get_addr(void *dev, void *vq, uint64_t desc_user_addr, uint64_t avail_user_addr, uint64_t used_user_addr) "dev: %p vq: %p desc_user_addr: 0x%"PRIx64" avail_user_addr: 0x%"PRIx64" used_user_addr: 0x%"PRIx64
> > > > > +vhost_vdpa_get_iova_range(void *dev, uint64_t first, uint64_t last) "dev: %p first: 0x%"PRIx64" last: 0x%"PRIx64
> > > > >
> > > > >   # virtio.c
> > > > >   virtqueue_alloc_element(void *elem, size_t sz, unsigned in_num, unsigned out_num) "elem %p size %zd in_num %u out_num %u"
> > > >
> > >
>
^ permalink raw reply	[flat|nested] 90+ messages in thread
* Re: [RFC PATCH v4 18/20] vhost: Add VhostIOVATree
  2021-10-19  8:32   ` Jason Wang
  2021-10-19  9:22     ` Jason Wang
@ 2021-10-20  7:36     ` Eugenio Perez Martin
  1 sibling, 0 replies; 90+ messages in thread
From: Eugenio Perez Martin @ 2021-10-20  7:36 UTC (permalink / raw)
  To: Jason Wang
  Cc: Parav Pandit, Juan Quintela, Markus Armbruster,
	Michael S. Tsirkin, qemu-level, Peter Xu, virtualization,
	Harpreet Singh Anand, Xiao W Wang, Stefan Hajnoczi, Eli Cohen,
	Eric Blake, Stefano Garzarella
On Tue, Oct 19, 2021 at 10:32 AM Jason Wang <jasowang@redhat.com> wrote:
>
>
> 在 2021/10/1 下午3:06, Eugenio Pérez 写道:
> > This tree is able to look for a translated address from an IOVA address.
> >
> > At first glance is similar to util/iova-tree. However, SVQ working on
> > devices with limited IOVA space need more capabilities, like allocating
> > IOVA chunks or perform reverse translations (qemu addresses to iova).
>
>
> I don't see any reverse translation is used in the shadow code. Or
> anything I missed?
>
>
> >
> > The allocation capability, as "assign a free IOVA address to this chunk
> > of memory in qemu's address space" allows shadow virtqueue to create a
> > new address space that is not restricted by guest's addressable one, so
> > we can allocate shadow vqs vrings outside of its reachability, nor
> > qemu's one. At the moment, the allocation is just done growing, not
> > allowing deletion.
> >
> > A different name could be used, but ordered searchable array is a
> > little bit long though.
> >
> > It duplicates the array so it can search efficiently both directions,
> > and it will signal overlap if iova or the translated address is
> > present in it's each array.
> >
> > Use of array will be changed to util-iova-tree in future series.
>
>
> Adding Peter.
>
Thanks, I missed CC him!
> It looks to me the only thing miseed is the iova allocator. And it looks
> to me it's better to decouple the allocator from the iova tree.
>
> Then we had:
>
> 1) initialize iova range
> 2) iova = iova_alloc(size)
> 3) built the iova tree map
> 4) buffer forwarding
> 5) iova_free(size)
>
The next series I will send once I have solved all the comments is
done that way, but the allocation is done in iova tree, not outside.
Reasons below.
>
> >
> > Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> > ---
> >   hw/virtio/vhost-iova-tree.h |  40 +++++++
> >   hw/virtio/vhost-iova-tree.c | 230 ++++++++++++++++++++++++++++++++++++
> >   hw/virtio/meson.build       |   2 +-
> >   3 files changed, 271 insertions(+), 1 deletion(-)
> >   create mode 100644 hw/virtio/vhost-iova-tree.h
> >   create mode 100644 hw/virtio/vhost-iova-tree.c
> >
> > diff --git a/hw/virtio/vhost-iova-tree.h b/hw/virtio/vhost-iova-tree.h
> > new file mode 100644
> > index 0000000000..d163a88905
> > --- /dev/null
> > +++ b/hw/virtio/vhost-iova-tree.h
> > @@ -0,0 +1,40 @@
> > +/*
> > + * vhost software live migration ring
> > + *
> > + * SPDX-FileCopyrightText: Red Hat, Inc. 2021
> > + * SPDX-FileContributor: Author: Eugenio Pérez <eperezma@redhat.com>
> > + *
> > + * SPDX-License-Identifier: GPL-2.0-or-later
> > + */
> > +
> > +#ifndef HW_VIRTIO_VHOST_IOVA_TREE_H
> > +#define HW_VIRTIO_VHOST_IOVA_TREE_H
> > +
> > +#include "exec/memory.h"
> > +
> > +typedef struct VhostDMAMap {
> > +    void *translated_addr;
> > +    hwaddr iova;
> > +    hwaddr size;                /* Inclusive */
> > +    IOMMUAccessFlags perm;
> > +} VhostDMAMap;
> > +
> > +typedef enum VhostDMAMapNewRC {
> > +    VHOST_DMA_MAP_NO_SPACE = -3,
> > +    VHOST_DMA_MAP_OVERLAP = -2,
> > +    VHOST_DMA_MAP_INVALID = -1,
> > +    VHOST_DMA_MAP_OK = 0,
> > +} VhostDMAMapNewRC;
> > +
> > +typedef struct VhostIOVATree VhostIOVATree;
> > +
> > +VhostIOVATree *vhost_iova_tree_new(void);
> > +void vhost_iova_tree_unref(VhostIOVATree *iova_rm);
> > +G_DEFINE_AUTOPTR_CLEANUP_FUNC(VhostIOVATree, vhost_iova_tree_unref);
> > +
> > +const VhostDMAMap *vhost_iova_tree_find_iova(const VhostIOVATree *iova_rm,
> > +                                             const VhostDMAMap *map);
> > +VhostDMAMapNewRC vhost_iova_tree_alloc(VhostIOVATree *iova_rm,
> > +                                       VhostDMAMap *map);
> > +
> > +#endif
> > diff --git a/hw/virtio/vhost-iova-tree.c b/hw/virtio/vhost-iova-tree.c
> > new file mode 100644
> > index 0000000000..c284e27607
> > --- /dev/null
> > +++ b/hw/virtio/vhost-iova-tree.c
> > @@ -0,0 +1,230 @@
> > +/*
> > + * vhost software live migration ring
> > + *
> > + * SPDX-FileCopyrightText: Red Hat, Inc. 2021
> > + * SPDX-FileContributor: Author: Eugenio Pérez <eperezma@redhat.com>
> > + *
> > + * SPDX-License-Identifier: GPL-2.0-or-later
> > + */
> > +
> > +#include "qemu/osdep.h"
> > +#include "vhost-iova-tree.h"
> > +
> > +#define G_ARRAY_NOT_ZERO_TERMINATED false
> > +#define G_ARRAY_NOT_CLEAR_ON_ALLOC false
> > +
> > +#define iova_min qemu_real_host_page_size
> > +
> > +/**
> > + * VhostIOVATree, able to:
> > + * - Translate iova address
> > + * - Reverse translate iova address (from translated to iova)
> > + * - Allocate IOVA regions for translated range (potentially slow operation)
> > + *
> > + * Note that it cannot remove nodes.
> > + */
> > +struct VhostIOVATree {
> > +    /* Ordered array of reverse translations, IOVA address to qemu memory. */
> > +    GArray *iova_taddr_map;
> > +
> > +    /*
> > +     * Ordered array of translations from qemu virtual memory address to iova
> > +     */
> > +    GArray *taddr_iova_map;
> > +};
>
>
> Any reason for using GArray? Is it faster?
>
To be honest, I used a GArray mainly for prototyping reasons, because
it allowed me to "insert a next element" once I've located either a
hole (in case of iova) or an address. To do it in the iova tree
required to either add code to iterate elements with arguments or to
allocate them. Another possibility is to add yet another structure
with "free regions".
I would say that at this moment GArray will be faster than GTree due
to GArray ability to bisect it and the better data locality, since we
will not be adding and deleting regions frequently during migration.
But that could change in case we support viommu platforms, and I
didn't measure it.
For the next revision I've added allocation capabilities to iova tree,
which I think fits pretty well there, and a separated qemu's vaddr ->
iova translation GTree. The allocation still transverse the utils/iova
tree linearly, but we can add a more performant allocator in the
future easily if needed. It should not affect anyway unless we hotplug
memory or similar.
>
> > +
> > +/**
> > + * Inserts an element after an existing one in garray.
> > + *
> > + * @array      The array
> > + * @prev_elem  The previous element of array of NULL if prepending
> > + * @map        The DMA map
> > + *
> > + * It provides the aditional advantage of being type safe over
> > + * g_array_insert_val, which accepts a reference pointer instead of a value
> > + * with no complains.
> > + */
> > +static void vhost_iova_tree_insert_after(GArray *array,
> > +                                         const VhostDMAMap *prev_elem,
> > +                                         const VhostDMAMap *map)
> > +{
> > +    size_t pos;
> > +
> > +    if (!prev_elem) {
> > +        pos = 0;
> > +    } else {
> > +        pos = prev_elem - &g_array_index(array, typeof(*prev_elem), 0) + 1;
> > +    }
> > +
> > +    g_array_insert_val(array, pos, *map);
> > +}
> > +
> > +static gint vhost_iova_tree_cmp_taddr(gconstpointer a, gconstpointer b)
> > +{
> > +    const VhostDMAMap *m1 = a, *m2 = b;
> > +
> > +    if (m1->translated_addr > m2->translated_addr + m2->size) {
> > +        return 1;
> > +    }
> > +
> > +    if (m1->translated_addr + m1->size < m2->translated_addr) {
> > +        return -1;
> > +    }
> > +
> > +    /* Overlapped */
> > +    return 0;
> > +}
> > +
> > +/**
> > + * Find the previous node to a given iova
> > + *
> > + * @array  The ascending ordered-by-translated-addr array of VhostDMAMap
> > + * @map    The map to insert
> > + * @prev   Returned location of the previous map
> > + *
> > + * Return VHOST_DMA_MAP_OK if everything went well, or VHOST_DMA_MAP_OVERLAP if
> > + * it already exists. It is ok to use this function to check if a given range
> > + * exists, but it will use a linear search.
> > + *
> > + * TODO: We can use bsearch to locate the entry if we save the state in the
> > + * needle, knowing that the needle is always the first argument to
> > + * compare_func.
> > + */
> > +static VhostDMAMapNewRC vhost_iova_tree_find_prev(const GArray *array,
> > +                                                  GCompareFunc compare_func,
> > +                                                  const VhostDMAMap *map,
> > +                                                  const VhostDMAMap **prev)
> > +{
> > +    size_t i;
> > +    int r;
> > +
> > +    *prev = NULL;
> > +    for (i = 0; i < array->len; ++i) {
> > +        r = compare_func(map, &g_array_index(array, typeof(*map), i));
> > +        if (r == 0) {
> > +            return VHOST_DMA_MAP_OVERLAP;
> > +        }
> > +        if (r < 0) {
> > +            return VHOST_DMA_MAP_OK;
> > +        }
> > +
> > +        *prev = &g_array_index(array, typeof(**prev), i);
> > +    }
> > +
> > +    return VHOST_DMA_MAP_OK;
> > +}
> > +
> > +/**
> > + * Create a new IOVA tree
> > + *
> > + * Returns the new IOVA tree
> > + */
> > +VhostIOVATree *vhost_iova_tree_new(void)
> > +{
>
>
> So I think it needs to be initialized with the range we get from
> get_iova_range().
>
Right, it is done that way for the next revision.
> Thanks
>
>
> > +    VhostIOVATree *tree = g_new(VhostIOVATree, 1);
> > +    tree->iova_taddr_map = g_array_new(G_ARRAY_NOT_ZERO_TERMINATED,
> > +                                       G_ARRAY_NOT_CLEAR_ON_ALLOC,
> > +                                       sizeof(VhostDMAMap));
> > +    tree->taddr_iova_map = g_array_new(G_ARRAY_NOT_ZERO_TERMINATED,
> > +                                       G_ARRAY_NOT_CLEAR_ON_ALLOC,
> > +                                       sizeof(VhostDMAMap));
> > +    return tree;
> > +}
> > +
> > +/**
> > + * Destroy an IOVA tree
> > + *
> > + * @tree  The iova tree
> > + */
> > +void vhost_iova_tree_unref(VhostIOVATree *tree)
> > +{
> > +    g_array_unref(g_steal_pointer(&tree->iova_taddr_map));
> > +    g_array_unref(g_steal_pointer(&tree->taddr_iova_map));
> > +}
> > +
> > +/**
> > + * Find the IOVA address stored from a memory address
> > + *
> > + * @tree     The iova tree
> > + * @map      The map with the memory address
> > + *
> > + * Return the stored mapping, or NULL if not found.
> > + */
> > +const VhostDMAMap *vhost_iova_tree_find_iova(const VhostIOVATree *tree,
> > +                                             const VhostDMAMap *map)
> > +{
> > +    /*
> > +     * This can be replaced with g_array_binary_search (Since glib 2.62) when
> > +     * that version become common enough.
> > +     */
> > +    return bsearch(map, tree->taddr_iova_map->data, tree->taddr_iova_map->len,
> > +                   sizeof(*map), vhost_iova_tree_cmp_taddr);
> > +}
> > +
> > +static bool vhost_iova_tree_find_iova_hole(const GArray *iova_map,
> > +                                           const VhostDMAMap *map,
> > +                                           const VhostDMAMap **prev_elem)
> > +{
> > +    size_t i;
> > +    hwaddr iova = iova_min;
> > +
> > +    *prev_elem = NULL;
> > +    for (i = 0; i < iova_map->len; i++) {
> > +        const VhostDMAMap *next = &g_array_index(iova_map, typeof(*next), i);
> > +        hwaddr hole_end = next->iova;
> > +        if (map->size < hole_end - iova) {
> > +            return true;
> > +        }
> > +
> > +        iova = next->iova + next->size + 1;
> > +        *prev_elem = next;
> > +    }
> > +
> > +    return ((hwaddr)-1 - iova) > iova_map->len;
> > +}
> > +
> > +/**
> > + * Allocate a new mapping
> > + *
> > + * @tree  The iova tree
> > + * @map   The iova map
> > + *
> > + * Returns:
> > + * - VHOST_DMA_MAP_OK if the map fits in the container
> > + * - VHOST_DMA_MAP_INVALID if the map does not make sense (like size overflow)
> > + * - VHOST_DMA_MAP_OVERLAP if the tree already contains that map
> > + * - VHOST_DMA_MAP_NO_SPACE if iova_rm cannot allocate more space.
> > + *
> > + * It returns assignated iova in map->iova if return value is VHOST_DMA_MAP_OK.
> > + */
> > +VhostDMAMapNewRC vhost_iova_tree_alloc(VhostIOVATree *tree,
> > +                                       VhostDMAMap *map)
> > +{
> > +    const VhostDMAMap *qemu_prev, *iova_prev;
> > +    int find_prev_rc;
> > +    bool fit;
> > +
> > +    if (map->translated_addr + map->size < map->translated_addr ||
> > +        map->iova + map->size < map->iova || map->perm == IOMMU_NONE) {
> > +        return VHOST_DMA_MAP_INVALID;
> > +    }
> > +
> > +    /* Search for a hole in iova space big enough */
> > +    fit = vhost_iova_tree_find_iova_hole(tree->iova_taddr_map, map,
> > +                                         &iova_prev);
> > +    if (!fit) {
> > +        return VHOST_DMA_MAP_NO_SPACE;
> > +    }
> > +
> > +    map->iova = iova_prev ? (iova_prev->iova + iova_prev->size) + 1 : iova_min;
> > +    find_prev_rc = vhost_iova_tree_find_prev(tree->taddr_iova_map,
> > +                                             vhost_iova_tree_cmp_taddr, map,
> > +                                             &qemu_prev);
> > +    if (find_prev_rc == VHOST_DMA_MAP_OVERLAP) {
> > +        return VHOST_DMA_MAP_OVERLAP;
> > +    }
> > +
> > +    vhost_iova_tree_insert_after(tree->iova_taddr_map, iova_prev, map);
> > +    vhost_iova_tree_insert_after(tree->taddr_iova_map, qemu_prev, map);
> > +    return VHOST_DMA_MAP_OK;
> > +}
> > diff --git a/hw/virtio/meson.build b/hw/virtio/meson.build
> > index 8b5a0225fe..cb306b83c6 100644
> > --- a/hw/virtio/meson.build
> > +++ b/hw/virtio/meson.build
> > @@ -11,7 +11,7 @@ softmmu_ss.add(when: 'CONFIG_ALL', if_true: files('vhost-stub.c'))
> >
> >   virtio_ss = ss.source_set()
> >   virtio_ss.add(files('virtio.c'))
> > -virtio_ss.add(when: 'CONFIG_VHOST', if_true: files('vhost.c', 'vhost-backend.c', 'vhost-shadow-virtqueue.c'))
> > +virtio_ss.add(when: 'CONFIG_VHOST', if_true: files('vhost.c', 'vhost-backend.c', 'vhost-shadow-virtqueue.c', 'vhost-iova-tree.c'))
> >   virtio_ss.add(when: 'CONFIG_VHOST_USER', if_true: files('vhost-user.c'))
> >   virtio_ss.add(when: 'CONFIG_VHOST_VDPA', if_true: files('vhost-vdpa.c'))
> >   virtio_ss.add(when: 'CONFIG_VIRTIO_BALLOON', if_true: files('virtio-balloon.c'))
>
^ permalink raw reply	[flat|nested] 90+ messages in thread
* Re: [RFC PATCH v4 18/20] vhost: Add VhostIOVATree
  2021-10-19  9:22     ` Jason Wang
@ 2021-10-20  7:54       ` Eugenio Perez Martin
  2021-10-20  9:01         ` Jason Wang
  0 siblings, 1 reply; 90+ messages in thread
From: Eugenio Perez Martin @ 2021-10-20  7:54 UTC (permalink / raw)
  To: Jason Wang; +Cc: qemu-devel
On Tue, Oct 19, 2021 at 11:23 AM Jason Wang <jasowang@redhat.com> wrote:
>
> On Tue, Oct 19, 2021 at 4:32 PM Jason Wang <jasowang@redhat.com> wrote:
> >
> >
> > 在 2021/10/1 下午3:06, Eugenio Pérez 写道:
> > > This tree is able to look for a translated address from an IOVA address.
> > >
> > > At first glance is similar to util/iova-tree. However, SVQ working on
> > > devices with limited IOVA space need more capabilities, like allocating
> > > IOVA chunks or perform reverse translations (qemu addresses to iova).
> >
> >
> > I don't see any reverse translation is used in the shadow code. Or
> > anything I missed?
>
> Ok, it looks to me that it is used in the iova allocator. But I think
> it's better to decouple it to an independent allocator instead of
> vhost iova tree.
>
Reverse translation is used every time a buffer is made available,
since buffers content are not copied, only the descriptors to SVQ
vring.
At this point all the limits are copied to vhost iova tree in the next
revision I will send, defined at its creation at
vhost_iova_tree_new(). They are outside of util/iova-tree, only sent
to the latter at allocation time.
Since vhost_iova_tree has its own vhost_iova_tree_alloc(), that wraps
the iova_tree_alloc() [1], limits could be kept in vhost-vdpa and make
them an argument of vhost_iova_tree_alloc. But I'm not sure if it's
what you are proposing or I'm missing something.
Either way, I think it is harder to talk about this specific case
without code, since this one still does not address the limits. Would
you prefer me to send another RFC in WIP quality, with *not* all
comments addressed? I would say that there is not a lot of pending
work to send the next one, but it might be easier for all of us.
Thanks!
[1] This util/iova-tree method will be proposed in the next series,
and vhost_iova_tree wraps it since it needs to keep in sync both
trees: iova->qemu vaddr for iova allocation and the reverse one to
translate available buffers.
> Thanks
>
^ permalink raw reply	[flat|nested] 90+ messages in thread
* Re: [RFC PATCH v4 18/20] vhost: Add VhostIOVATree
  2021-10-20  7:54       ` Eugenio Perez Martin
@ 2021-10-20  9:01         ` Jason Wang
  2021-10-20 12:06           ` Eugenio Perez Martin
  0 siblings, 1 reply; 90+ messages in thread
From: Jason Wang @ 2021-10-20  9:01 UTC (permalink / raw)
  To: Eugenio Perez Martin; +Cc: qemu-devel
On Wed, Oct 20, 2021 at 3:54 PM Eugenio Perez Martin
<eperezma@redhat.com> wrote:
>
> On Tue, Oct 19, 2021 at 11:23 AM Jason Wang <jasowang@redhat.com> wrote:
> >
> > On Tue, Oct 19, 2021 at 4:32 PM Jason Wang <jasowang@redhat.com> wrote:
> > >
> > >
> > > 在 2021/10/1 下午3:06, Eugenio Pérez 写道:
> > > > This tree is able to look for a translated address from an IOVA address.
> > > >
> > > > At first glance is similar to util/iova-tree. However, SVQ working on
> > > > devices with limited IOVA space need more capabilities, like allocating
> > > > IOVA chunks or perform reverse translations (qemu addresses to iova).
> > >
> > >
> > > I don't see any reverse translation is used in the shadow code. Or
> > > anything I missed?
> >
> > Ok, it looks to me that it is used in the iova allocator. But I think
> > it's better to decouple it to an independent allocator instead of
> > vhost iova tree.
> >
>
> Reverse translation is used every time a buffer is made available,
> since buffers content are not copied, only the descriptors to SVQ
> vring.
I may miss something but I didn't see the code? Qemu knows the VA of
virtqueue, and the VA of the VQ is stored in the VirtQueueElem?
>
> At this point all the limits are copied to vhost iova tree in the next
> revision I will send, defined at its creation at
> vhost_iova_tree_new(). They are outside of util/iova-tree, only sent
> to the latter at allocation time.
>
> Since vhost_iova_tree has its own vhost_iova_tree_alloc(), that wraps
> the iova_tree_alloc() [1], limits could be kept in vhost-vdpa and make
> them an argument of vhost_iova_tree_alloc. But I'm not sure if it's
> what you are proposing or I'm missing something.
If the reverse translation is only used in iova allocation, I meant to
split the logic of IOVA allocation itself.
>
> Either way, I think it is harder to talk about this specific case
> without code, since this one still does not address the limits. Would
> you prefer me to send another RFC in WIP quality, with *not* all
> comments addressed? I would say that there is not a lot of pending
> work to send the next one, but it might be easier for all of us.
I'd prefer to try to address them all, otherwise it's not easy to see
what is missing.
Thanks
>
> Thanks!
>
> [1] This util/iova-tree method will be proposed in the next series,
> and vhost_iova_tree wraps it since it needs to keep in sync both
> trees: iova->qemu vaddr for iova allocation and the reverse one to
> translate available buffers.
>
> > Thanks
> >
>
^ permalink raw reply	[flat|nested] 90+ messages in thread
* Re: [RFC PATCH v4 20/20] vdpa: Add custom IOTLB translations to SVQ
  2021-10-20  6:51           ` Eugenio Perez Martin
@ 2021-10-20  9:03             ` Jason Wang
  2021-10-20 11:56               ` Eugenio Perez Martin
  0 siblings, 1 reply; 90+ messages in thread
From: Jason Wang @ 2021-10-20  9:03 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: Parav Pandit, Juan Quintela, Markus Armbruster,
	Michael S. Tsirkin, qemu-level, virtualization,
	Harpreet Singh Anand, Xiao W Wang, Stefan Hajnoczi, Eli Cohen,
	Eric Blake, Michael Lilja, Stefano Garzarella
On Wed, Oct 20, 2021 at 2:52 PM Eugenio Perez Martin
<eperezma@redhat.com> wrote:
>
> On Wed, Oct 20, 2021 at 4:07 AM Jason Wang <jasowang@redhat.com> wrote:
> >
> > On Wed, Oct 20, 2021 at 10:02 AM Jason Wang <jasowang@redhat.com> wrote:
> > >
> > > On Tue, Oct 19, 2021 at 6:29 PM Eugenio Perez Martin
> > > <eperezma@redhat.com> wrote:
> > > >
> > > > On Tue, Oct 19, 2021 at 11:25 AM Jason Wang <jasowang@redhat.com> wrote:
> > > > >
> > > > >
> > > > > 在 2021/10/1 下午3:06, Eugenio Pérez 写道:
> > > > > > Use translations added in VhostIOVATree in SVQ.
> > > > > >
> > > > > > Now every element needs to store the previous address also, so VirtQueue
> > > > > > can consume the elements properly. This adds a little overhead per VQ
> > > > > > element, having to allocate more memory to stash them. As a possible
> > > > > > optimization, this allocation could be avoided if the descriptor is not
> > > > > > a chain but a single one, but this is left undone.
> > > > > >
> > > > > > TODO: iova range should be queried before, and add logic to fail when
> > > > > > GPA is outside of its range and memory listener or svq add it.
> > > > > >
> > > > > > Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> > > > > > ---
> > > > > >   hw/virtio/vhost-shadow-virtqueue.h |   4 +-
> > > > > >   hw/virtio/vhost-shadow-virtqueue.c | 130 ++++++++++++++++++++++++-----
> > > > > >   hw/virtio/vhost-vdpa.c             |  40 ++++++++-
> > > > > >   hw/virtio/trace-events             |   1 +
> > > > > >   4 files changed, 152 insertions(+), 23 deletions(-)
> > > > > >
> > > > > > diff --git a/hw/virtio/vhost-shadow-virtqueue.h b/hw/virtio/vhost-shadow-virtqueue.h
> > > > > > index b7baa424a7..a0e6b5267a 100644
> > > > > > --- a/hw/virtio/vhost-shadow-virtqueue.h
> > > > > > +++ b/hw/virtio/vhost-shadow-virtqueue.h
> > > > > > @@ -11,6 +11,7 @@
> > > > > >   #define VHOST_SHADOW_VIRTQUEUE_H
> > > > > >
> > > > > >   #include "hw/virtio/vhost.h"
> > > > > > +#include "hw/virtio/vhost-iova-tree.h"
> > > > > >
> > > > > >   typedef struct VhostShadowVirtqueue VhostShadowVirtqueue;
> > > > > >
> > > > > > @@ -28,7 +29,8 @@ bool vhost_svq_start(struct vhost_dev *dev, unsigned idx,
> > > > > >   void vhost_svq_stop(struct vhost_dev *dev, unsigned idx,
> > > > > >                       VhostShadowVirtqueue *svq);
> > > > > >
> > > > > > -VhostShadowVirtqueue *vhost_svq_new(struct vhost_dev *dev, int idx);
> > > > > > +VhostShadowVirtqueue *vhost_svq_new(struct vhost_dev *dev, int idx,
> > > > > > +                                    VhostIOVATree *iova_map);
> > > > > >
> > > > > >   void vhost_svq_free(VhostShadowVirtqueue *vq);
> > > > > >
> > > > > > diff --git a/hw/virtio/vhost-shadow-virtqueue.c b/hw/virtio/vhost-shadow-virtqueue.c
> > > > > > index 2fd0bab75d..9db538547e 100644
> > > > > > --- a/hw/virtio/vhost-shadow-virtqueue.c
> > > > > > +++ b/hw/virtio/vhost-shadow-virtqueue.c
> > > > > > @@ -11,12 +11,19 @@
> > > > > >   #include "hw/virtio/vhost-shadow-virtqueue.h"
> > > > > >   #include "hw/virtio/vhost.h"
> > > > > >   #include "hw/virtio/virtio-access.h"
> > > > > > +#include "hw/virtio/vhost-iova-tree.h"
> > > > > >
> > > > > >   #include "standard-headers/linux/vhost_types.h"
> > > > > >
> > > > > >   #include "qemu/error-report.h"
> > > > > >   #include "qemu/main-loop.h"
> > > > > >
> > > > > > +typedef struct SVQElement {
> > > > > > +    VirtQueueElement elem;
> > > > > > +    void **in_sg_stash;
> > > > > > +    void **out_sg_stash;
> > > > > > +} SVQElement;
> > > > > > +
> > > > > >   /* Shadow virtqueue to relay notifications */
> > > > > >   typedef struct VhostShadowVirtqueue {
> > > > > >       /* Shadow vring */
> > > > > > @@ -46,8 +53,11 @@ typedef struct VhostShadowVirtqueue {
> > > > > >       /* Virtio device */
> > > > > >       VirtIODevice *vdev;
> > > > > >
> > > > > > +    /* IOVA mapping if used */
> > > > > > +    VhostIOVATree *iova_map;
> > > > > > +
> > > > > >       /* Map for returning guest's descriptors */
> > > > > > -    VirtQueueElement **ring_id_maps;
> > > > > > +    SVQElement **ring_id_maps;
> > > > > >
> > > > > >       /* Next head to expose to device */
> > > > > >       uint16_t avail_idx_shadow;
> > > > > > @@ -79,13 +89,6 @@ bool vhost_svq_valid_device_features(uint64_t *dev_features)
> > > > > >               continue;
> > > > > >
> > > > > >           case VIRTIO_F_ACCESS_PLATFORM:
> > > > > > -            /* SVQ needs this feature disabled. Can't continue */
> > > > > > -            if (*dev_features & BIT_ULL(b)) {
> > > > > > -                clear_bit(b, dev_features);
> > > > > > -                r = false;
> > > > > > -            }
> > > > > > -            break;
> > > > > > -
> > > > > >           case VIRTIO_F_VERSION_1:
> > > > > >               /* SVQ needs this feature, so can't continue */
> > > > > >               if (!(*dev_features & BIT_ULL(b))) {
> > > > > > @@ -126,6 +129,64 @@ static void vhost_svq_set_notification(VhostShadowVirtqueue *svq, bool enable)
> > > > > >       }
> > > > > >   }
> > > > > >
> > > > > > +static void vhost_svq_stash_addr(void ***stash, const struct iovec *iov,
> > > > > > +                                 size_t num)
> > > > > > +{
> > > > > > +    size_t i;
> > > > > > +
> > > > > > +    if (num == 0) {
> > > > > > +        return;
> > > > > > +    }
> > > > > > +
> > > > > > +    *stash = g_new(void *, num);
> > > > > > +    for (i = 0; i < num; ++i) {
> > > > > > +        (*stash)[i] = iov[i].iov_base;
> > > > > > +    }
> > > > > > +}
> > > > > > +
> > > > > > +static void vhost_svq_unstash_addr(void **stash, struct iovec *iov, size_t num)
> > > > > > +{
> > > > > > +    size_t i;
> > > > > > +
> > > > > > +    if (num == 0) {
> > > > > > +        return;
> > > > > > +    }
> > > > > > +
> > > > > > +    for (i = 0; i < num; ++i) {
> > > > > > +        iov[i].iov_base = stash[i];
> > > > > > +    }
> > > > > > +    g_free(stash);
> > > > > > +}
> > > > > > +
> > > > > > +static void vhost_svq_translate_addr(const VhostShadowVirtqueue *svq,
> > > > > > +                                     struct iovec *iovec, size_t num)
> > > > > > +{
> > > > > > +    size_t i;
> > > > > > +
> > > > > > +    for (i = 0; i < num; ++i) {
> > > > > > +        VhostDMAMap needle = {
> > > > > > +            .translated_addr = iovec[i].iov_base,
> > > > > > +            .size = iovec[i].iov_len,
> > > > > > +        };
> > > > > > +        size_t off;
> > > > > > +
> > > > > > +        const VhostDMAMap *map = vhost_iova_tree_find_iova(svq->iova_map,
> > > > > > +                                                           &needle);
> > > > >
> > > > >
> > > > > Is it possible that we end up with more than one maps here?
> > > > >
> > > >
> > > > Actually it is possible, since there is no guarantee that one
> > > > descriptor (or indirect descriptor) maps exactly to one iov. It could
> > > > map to many if qemu vaddr is not contiguous but GPA + size is. This is
> > > > something that must be fixed for the next revision, so thanks for
> > > > pointing it out!
> > > >
> > > > Taking that into account, the condition that svq vring avail_idx -
> > > > used_idx was always less or equal than guest's vring avail_idx -
> > > > used_idx is not true anymore. Checking for that before adding buffers
> > > > to SVQ is the easy part, but how could we recover in that case?
> > > >
> > > > I think that the easy solution is to check for more available buffers
> > > > unconditionally at the end of vhost_svq_handle_call, which handles the
> > > > SVQ used and is supposed to make more room for available buffers. So
> > > > vhost_handle_guest_kick would not check if eventfd is set or not
> > > > anymore.
> > > >
> > > > Would that make sense?
> > >
> > > Yes, I think it should work.
> >
> > Btw, I wonder how to handle indirect descriptors. SVQ doesn't use
> > indirect descriptors for now, but it looks like a must otherwise we
> > may end up SVQ is full before VQ.
> >
>
> We can get to that situation without indirect too, if a single
> descriptor maps to more than one sg buffer. The next revision is going
> to control that too.
>
> > It looks to me an easy way is to always use indirect descriptors if #sg >= 2?
> >
>
> I will use that, but that does not solve the case where a descriptor
> maps to > 1 different buffers in qemu vaddr.
Right, so we need to deal with the case when SVQ is out of space.
> So I think that some
> check after marking descriptors as used is a must somehow.
I thought it should be before processing the available buffer? It's
the guest driver that make sure there's sufficient space for used
ring?
Thanks
>
>
> > Thanks
> >
> > >
> > > Thanks
> > >
> > > >
> > > > Thanks!
> > > >
> > > > >
> > > > > > +        /*
> > > > > > +         * Map cannot be NULL since iova map contains all guest space and
> > > > > > +         * qemu already has a physical address mapped
> > > > > > +         */
> > > > > > +        assert(map);
> > > > > > +
> > > > > > +        /*
> > > > > > +         * Map->iova chunk size is ignored. What to do if descriptor
> > > > > > +         * (addr, size) does not fit is delegated to the device.
> > > > > > +         */
> > > > > > +        off = needle.translated_addr - map->translated_addr;
> > > > > > +        iovec[i].iov_base = (void *)(map->iova + off);
> > > > > > +    }
> > > > > > +}
> > > > > > +
> > > > > >   static void vhost_vring_write_descs(VhostShadowVirtqueue *svq,
> > > > > >                                       const struct iovec *iovec,
> > > > > >                                       size_t num, bool more_descs, bool write)
> > > > > > @@ -156,8 +217,9 @@ static void vhost_vring_write_descs(VhostShadowVirtqueue *svq,
> > > > > >   }
> > > > > >
> > > > > >   static unsigned vhost_svq_add_split(VhostShadowVirtqueue *svq,
> > > > > > -                                    VirtQueueElement *elem)
> > > > > > +                                    SVQElement *svq_elem)
> > > > > >   {
> > > > > > +    VirtQueueElement *elem = &svq_elem->elem;
> > > > > >       int head;
> > > > > >       unsigned avail_idx;
> > > > > >       vring_avail_t *avail = svq->vring.avail;
> > > > > > @@ -167,6 +229,12 @@ static unsigned vhost_svq_add_split(VhostShadowVirtqueue *svq,
> > > > > >       /* We need some descriptors here */
> > > > > >       assert(elem->out_num || elem->in_num);
> > > > > >
> > > > > > +    vhost_svq_stash_addr(&svq_elem->in_sg_stash, elem->in_sg, elem->in_num);
> > > > > > +    vhost_svq_stash_addr(&svq_elem->out_sg_stash, elem->out_sg, elem->out_num);
> > > > >
> > > > >
> > > > > I wonder if we can solve the trick like stash and unstash with a
> > > > > dedicated sgs in svq_elem, instead of reusing the elem.
> > > > >
> > > >
> > > > Actually yes, it would be way simpler to use a new sgs array in
> > > > svq_elem. I will change that.
> > > >
> > > > Thanks!
> > > >
> > > > > Thanks
> > > > >
> > > > >
> > > > > > +
> > > > > > +    vhost_svq_translate_addr(svq, elem->in_sg, elem->in_num);
> > > > > > +    vhost_svq_translate_addr(svq, elem->out_sg, elem->out_num);
> > > > > > +
> > > > > >       vhost_vring_write_descs(svq, elem->out_sg, elem->out_num,
> > > > > >                               elem->in_num > 0, false);
> > > > > >       vhost_vring_write_descs(svq, elem->in_sg, elem->in_num, false, true);
> > > > > > @@ -187,7 +255,7 @@ static unsigned vhost_svq_add_split(VhostShadowVirtqueue *svq,
> > > > > >
> > > > > >   }
> > > > > >
> > > > > > -static void vhost_svq_add(VhostShadowVirtqueue *svq, VirtQueueElement *elem)
> > > > > > +static void vhost_svq_add(VhostShadowVirtqueue *svq, SVQElement *elem)
> > > > > >   {
> > > > > >       unsigned qemu_head = vhost_svq_add_split(svq, elem);
> > > > > >
> > > > > > @@ -221,7 +289,7 @@ static void vhost_handle_guest_kick(EventNotifier *n)
> > > > > >           }
> > > > > >
> > > > > >           while (true) {
> > > > > > -            VirtQueueElement *elem = virtqueue_pop(svq->vq, sizeof(*elem));
> > > > > > +            SVQElement *elem = virtqueue_pop(svq->vq, sizeof(*elem));
> > > > > >               if (!elem) {
> > > > > >                   break;
> > > > > >               }
> > > > > > @@ -247,7 +315,7 @@ static bool vhost_svq_more_used(VhostShadowVirtqueue *svq)
> > > > > >       return svq->used_idx != svq->shadow_used_idx;
> > > > > >   }
> > > > > >
> > > > > > -static VirtQueueElement *vhost_svq_get_buf(VhostShadowVirtqueue *svq)
> > > > > > +static SVQElement *vhost_svq_get_buf(VhostShadowVirtqueue *svq)
> > > > > >   {
> > > > > >       vring_desc_t *descs = svq->vring.desc;
> > > > > >       const vring_used_t *used = svq->vring.used;
> > > > > > @@ -279,7 +347,7 @@ static VirtQueueElement *vhost_svq_get_buf(VhostShadowVirtqueue *svq)
> > > > > >       descs[used_elem.id].next = svq->free_head;
> > > > > >       svq->free_head = used_elem.id;
> > > > > >
> > > > > > -    svq->ring_id_maps[used_elem.id]->len = used_elem.len;
> > > > > > +    svq->ring_id_maps[used_elem.id]->elem.len = used_elem.len;
> > > > > >       return g_steal_pointer(&svq->ring_id_maps[used_elem.id]);
> > > > > >   }
> > > > > >
> > > > > > @@ -296,12 +364,19 @@ static void vhost_svq_handle_call_no_test(EventNotifier *n)
> > > > > >
> > > > > >           vhost_svq_set_notification(svq, false);
> > > > > >           while (true) {
> > > > > > -            g_autofree VirtQueueElement *elem = vhost_svq_get_buf(svq);
> > > > > > -            if (!elem) {
> > > > > > +            g_autofree SVQElement *svq_elem = vhost_svq_get_buf(svq);
> > > > > > +            VirtQueueElement *elem;
> > > > > > +            if (!svq_elem) {
> > > > > >                   break;
> > > > > >               }
> > > > > >
> > > > > >               assert(i < svq->vring.num);
> > > > > > +            elem = &svq_elem->elem;
> > > > > > +
> > > > > > +            vhost_svq_unstash_addr(svq_elem->in_sg_stash, elem->in_sg,
> > > > > > +                                   elem->in_num);
> > > > > > +            vhost_svq_unstash_addr(svq_elem->out_sg_stash, elem->out_sg,
> > > > > > +                                   elem->out_num);
> > > > > >               virtqueue_fill(vq, elem, elem->len, i++);
> > > > > >           }
> > > > > >
> > > > > > @@ -451,14 +526,24 @@ void vhost_svq_stop(struct vhost_dev *dev, unsigned idx,
> > > > > >       event_notifier_set_handler(&svq->host_notifier, NULL);
> > > > > >
> > > > > >       for (i = 0; i < svq->vring.num; ++i) {
> > > > > > -        g_autofree VirtQueueElement *elem = svq->ring_id_maps[i];
> > > > > > +        g_autofree SVQElement *svq_elem = svq->ring_id_maps[i];
> > > > > > +        VirtQueueElement *elem;
> > > > > > +
> > > > > > +        if (!svq_elem) {
> > > > > > +            continue;
> > > > > > +        }
> > > > > > +
> > > > > > +        elem = &svq_elem->elem;
> > > > > > +        vhost_svq_unstash_addr(svq_elem->in_sg_stash, elem->in_sg,
> > > > > > +                               elem->in_num);
> > > > > > +        vhost_svq_unstash_addr(svq_elem->out_sg_stash, elem->out_sg,
> > > > > > +                               elem->out_num);
> > > > > > +
> > > > > >           /*
> > > > > >            * Although the doc says we must unpop in order, it's ok to unpop
> > > > > >            * everything.
> > > > > >            */
> > > > > > -        if (elem) {
> > > > > > -            virtqueue_unpop(svq->vq, elem, elem->len);
> > > > > > -        }
> > > > > > +        virtqueue_unpop(svq->vq, elem, elem->len);
> > > > > >       }
> > > > > >   }
> > > > > >
> > > > > > @@ -466,7 +551,8 @@ void vhost_svq_stop(struct vhost_dev *dev, unsigned idx,
> > > > > >    * Creates vhost shadow virtqueue, and instruct vhost device to use the shadow
> > > > > >    * methods and file descriptors.
> > > > > >    */
> > > > > > -VhostShadowVirtqueue *vhost_svq_new(struct vhost_dev *dev, int idx)
> > > > > > +VhostShadowVirtqueue *vhost_svq_new(struct vhost_dev *dev, int idx,
> > > > > > +                                    VhostIOVATree *iova_map)
> > > > > >   {
> > > > > >       int vq_idx = dev->vq_index + idx;
> > > > > >       unsigned num = virtio_queue_get_num(dev->vdev, vq_idx);
> > > > > > @@ -500,11 +586,13 @@ VhostShadowVirtqueue *vhost_svq_new(struct vhost_dev *dev, int idx)
> > > > > >       memset(svq->vring.desc, 0, driver_size);
> > > > > >       svq->vring.used = qemu_memalign(qemu_real_host_page_size, device_size);
> > > > > >       memset(svq->vring.used, 0, device_size);
> > > > > > +    svq->iova_map = iova_map;
> > > > > > +
> > > > > >       for (i = 0; i < num - 1; i++) {
> > > > > >           svq->vring.desc[i].next = cpu_to_le16(i + 1);
> > > > > >       }
> > > > > >
> > > > > > -    svq->ring_id_maps = g_new0(VirtQueueElement *, num);
> > > > > > +    svq->ring_id_maps = g_new0(SVQElement *, num);
> > > > > >       event_notifier_set_handler(&svq->call_notifier,
> > > > > >                                  vhost_svq_handle_call);
> > > > > >       return g_steal_pointer(&svq);
> > > > > > diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
> > > > > > index a9c680b487..f5a12fee9d 100644
> > > > > > --- a/hw/virtio/vhost-vdpa.c
> > > > > > +++ b/hw/virtio/vhost-vdpa.c
> > > > > > @@ -176,6 +176,18 @@ static void vhost_vdpa_listener_region_add(MemoryListener *listener,
> > > > > >                                            vaddr, section->readonly);
> > > > > >
> > > > > >       llsize = int128_sub(llend, int128_make64(iova));
> > > > > > +    if (v->shadow_vqs_enabled) {
> > > > > > +        VhostDMAMap mem_region = {
> > > > > > +            .translated_addr = vaddr,
> > > > > > +            .size = int128_get64(llsize) - 1,
> > > > > > +            .perm = IOMMU_ACCESS_FLAG(true, section->readonly),
> > > > > > +        };
> > > > > > +
> > > > > > +        int r = vhost_iova_tree_alloc(v->iova_map, &mem_region);
> > > > > > +        assert(r == VHOST_DMA_MAP_OK);
> > > > > > +
> > > > > > +        iova = mem_region.iova;
> > > > > > +    }
> > > > > >
> > > > > >       ret = vhost_vdpa_dma_map(v, iova, int128_get64(llsize),
> > > > > >                                vaddr, section->readonly);
> > > > > > @@ -754,6 +766,23 @@ static bool  vhost_vdpa_force_iommu(struct vhost_dev *dev)
> > > > > >       return true;
> > > > > >   }
> > > > > >
> > > > > > +static int vhost_vdpa_get_iova_range(struct vhost_dev *dev,
> > > > > > +                                     hwaddr *first, hwaddr *last)
> > > > > > +{
> > > > > > +    int ret;
> > > > > > +    struct vhost_vdpa_iova_range range;
> > > > > > +
> > > > > > +    ret = vhost_vdpa_call(dev, VHOST_VDPA_GET_IOVA_RANGE, &range);
> > > > > > +    if (ret != 0) {
> > > > > > +        return ret;
> > > > > > +    }
> > > > > > +
> > > > > > +    *first = range.first;
> > > > > > +    *last = range.last;
> > > > > > +    trace_vhost_vdpa_get_iova_range(dev, *first, *last);
> > > > > > +    return ret;
> > > > > > +}
> > > > > > +
> > > > > >   /**
> > > > > >    * Maps QEMU vaddr memory to device in a suitable way for shadow virtqueue:
> > > > > >    * - It always reference qemu memory address, not guest's memory.
> > > > > > @@ -881,6 +910,7 @@ static bool vhost_vdpa_svq_start_vq(struct vhost_dev *dev, unsigned idx)
> > > > > >   static unsigned vhost_vdpa_enable_svq(struct vhost_vdpa *v, bool enable)
> > > > > >   {
> > > > > >       struct vhost_dev *hdev = v->dev;
> > > > > > +    hwaddr iova_first, iova_last;
> > > > > >       unsigned n;
> > > > > >       int r;
> > > > > >
> > > > > > @@ -894,7 +924,7 @@ static unsigned vhost_vdpa_enable_svq(struct vhost_vdpa *v, bool enable)
> > > > > >           /* Allocate resources */
> > > > > >           assert(v->shadow_vqs->len == 0);
> > > > > >           for (n = 0; n < hdev->nvqs; ++n) {
> > > > > > -            VhostShadowVirtqueue *svq = vhost_svq_new(hdev, n);
> > > > > > +            VhostShadowVirtqueue *svq = vhost_svq_new(hdev, n, v->iova_map);
> > > > > >               if (unlikely(!svq)) {
> > > > > >                   g_ptr_array_set_size(v->shadow_vqs, 0);
> > > > > >                   return 0;
> > > > > > @@ -903,6 +933,8 @@ static unsigned vhost_vdpa_enable_svq(struct vhost_vdpa *v, bool enable)
> > > > > >           }
> > > > > >       }
> > > > > >
> > > > > > +    r = vhost_vdpa_get_iova_range(hdev, &iova_first, &iova_last);
> > > > > > +    assert(r == 0);
> > > > > >       r = vhost_vdpa_vring_pause(hdev);
> > > > > >       assert(r == 0);
> > > > > >
> > > > > > @@ -913,6 +945,12 @@ static unsigned vhost_vdpa_enable_svq(struct vhost_vdpa *v, bool enable)
> > > > > >           }
> > > > > >       }
> > > > > >
> > > > > > +    memory_listener_unregister(&v->listener);
> > > > > > +    if (vhost_vdpa_dma_unmap(v, iova_first,
> > > > > > +                             (iova_last - iova_first) & TARGET_PAGE_MASK)) {
> > > > > > +        error_report("Fail to invalidate device iotlb");
> > > > > > +    }
> > > > > > +
> > > > > >       /* Reset device so it can be configured */
> > > > > >       r = vhost_vdpa_dev_start(hdev, false);
> > > > > >       assert(r == 0);
> > > > > > diff --git a/hw/virtio/trace-events b/hw/virtio/trace-events
> > > > > > index 8ed19e9d0c..650e521e35 100644
> > > > > > --- a/hw/virtio/trace-events
> > > > > > +++ b/hw/virtio/trace-events
> > > > > > @@ -52,6 +52,7 @@ vhost_vdpa_set_vring_call(void *dev, unsigned int index, int fd) "dev: %p index:
> > > > > >   vhost_vdpa_get_features(void *dev, uint64_t features) "dev: %p features: 0x%"PRIx64
> > > > > >   vhost_vdpa_set_owner(void *dev) "dev: %p"
> > > > > >   vhost_vdpa_vq_get_addr(void *dev, void *vq, uint64_t desc_user_addr, uint64_t avail_user_addr, uint64_t used_user_addr) "dev: %p vq: %p desc_user_addr: 0x%"PRIx64" avail_user_addr: 0x%"PRIx64" used_user_addr: 0x%"PRIx64
> > > > > > +vhost_vdpa_get_iova_range(void *dev, uint64_t first, uint64_t last) "dev: %p first: 0x%"PRIx64" last: 0x%"PRIx64
> > > > > >
> > > > > >   # virtio.c
> > > > > >   virtqueue_alloc_element(void *elem, size_t sz, unsigned in_num, unsigned out_num) "elem %p size %zd in_num %u out_num %u"
> > > > >
> > > >
> >
>
^ permalink raw reply	[flat|nested] 90+ messages in thread
* Re: [RFC PATCH v4 20/20] vdpa: Add custom IOTLB translations to SVQ
  2021-10-20  9:03             ` Jason Wang
@ 2021-10-20 11:56               ` Eugenio Perez Martin
  2021-10-21  2:38                 ` Jason Wang
  2021-10-26  4:32                 ` Jason Wang
  0 siblings, 2 replies; 90+ messages in thread
From: Eugenio Perez Martin @ 2021-10-20 11:56 UTC (permalink / raw)
  To: Jason Wang
  Cc: Parav Pandit, Juan Quintela, Markus Armbruster,
	Michael S. Tsirkin, qemu-level, virtualization,
	Harpreet Singh Anand, Xiao W Wang, Stefan Hajnoczi, Eli Cohen,
	Eric Blake, Michael Lilja, Stefano Garzarella
On Wed, Oct 20, 2021 at 11:03 AM Jason Wang <jasowang@redhat.com> wrote:
>
> On Wed, Oct 20, 2021 at 2:52 PM Eugenio Perez Martin
> <eperezma@redhat.com> wrote:
> >
> > On Wed, Oct 20, 2021 at 4:07 AM Jason Wang <jasowang@redhat.com> wrote:
> > >
> > > On Wed, Oct 20, 2021 at 10:02 AM Jason Wang <jasowang@redhat.com> wrote:
> > > >
> > > > On Tue, Oct 19, 2021 at 6:29 PM Eugenio Perez Martin
> > > > <eperezma@redhat.com> wrote:
> > > > >
> > > > > On Tue, Oct 19, 2021 at 11:25 AM Jason Wang <jasowang@redhat.com> wrote:
> > > > > >
> > > > > >
> > > > > > 在 2021/10/1 下午3:06, Eugenio Pérez 写道:
> > > > > > > Use translations added in VhostIOVATree in SVQ.
> > > > > > >
> > > > > > > Now every element needs to store the previous address also, so VirtQueue
> > > > > > > can consume the elements properly. This adds a little overhead per VQ
> > > > > > > element, having to allocate more memory to stash them. As a possible
> > > > > > > optimization, this allocation could be avoided if the descriptor is not
> > > > > > > a chain but a single one, but this is left undone.
> > > > > > >
> > > > > > > TODO: iova range should be queried before, and add logic to fail when
> > > > > > > GPA is outside of its range and memory listener or svq add it.
> > > > > > >
> > > > > > > Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> > > > > > > ---
> > > > > > >   hw/virtio/vhost-shadow-virtqueue.h |   4 +-
> > > > > > >   hw/virtio/vhost-shadow-virtqueue.c | 130 ++++++++++++++++++++++++-----
> > > > > > >   hw/virtio/vhost-vdpa.c             |  40 ++++++++-
> > > > > > >   hw/virtio/trace-events             |   1 +
> > > > > > >   4 files changed, 152 insertions(+), 23 deletions(-)
> > > > > > >
> > > > > > > diff --git a/hw/virtio/vhost-shadow-virtqueue.h b/hw/virtio/vhost-shadow-virtqueue.h
> > > > > > > index b7baa424a7..a0e6b5267a 100644
> > > > > > > --- a/hw/virtio/vhost-shadow-virtqueue.h
> > > > > > > +++ b/hw/virtio/vhost-shadow-virtqueue.h
> > > > > > > @@ -11,6 +11,7 @@
> > > > > > >   #define VHOST_SHADOW_VIRTQUEUE_H
> > > > > > >
> > > > > > >   #include "hw/virtio/vhost.h"
> > > > > > > +#include "hw/virtio/vhost-iova-tree.h"
> > > > > > >
> > > > > > >   typedef struct VhostShadowVirtqueue VhostShadowVirtqueue;
> > > > > > >
> > > > > > > @@ -28,7 +29,8 @@ bool vhost_svq_start(struct vhost_dev *dev, unsigned idx,
> > > > > > >   void vhost_svq_stop(struct vhost_dev *dev, unsigned idx,
> > > > > > >                       VhostShadowVirtqueue *svq);
> > > > > > >
> > > > > > > -VhostShadowVirtqueue *vhost_svq_new(struct vhost_dev *dev, int idx);
> > > > > > > +VhostShadowVirtqueue *vhost_svq_new(struct vhost_dev *dev, int idx,
> > > > > > > +                                    VhostIOVATree *iova_map);
> > > > > > >
> > > > > > >   void vhost_svq_free(VhostShadowVirtqueue *vq);
> > > > > > >
> > > > > > > diff --git a/hw/virtio/vhost-shadow-virtqueue.c b/hw/virtio/vhost-shadow-virtqueue.c
> > > > > > > index 2fd0bab75d..9db538547e 100644
> > > > > > > --- a/hw/virtio/vhost-shadow-virtqueue.c
> > > > > > > +++ b/hw/virtio/vhost-shadow-virtqueue.c
> > > > > > > @@ -11,12 +11,19 @@
> > > > > > >   #include "hw/virtio/vhost-shadow-virtqueue.h"
> > > > > > >   #include "hw/virtio/vhost.h"
> > > > > > >   #include "hw/virtio/virtio-access.h"
> > > > > > > +#include "hw/virtio/vhost-iova-tree.h"
> > > > > > >
> > > > > > >   #include "standard-headers/linux/vhost_types.h"
> > > > > > >
> > > > > > >   #include "qemu/error-report.h"
> > > > > > >   #include "qemu/main-loop.h"
> > > > > > >
> > > > > > > +typedef struct SVQElement {
> > > > > > > +    VirtQueueElement elem;
> > > > > > > +    void **in_sg_stash;
> > > > > > > +    void **out_sg_stash;
> > > > > > > +} SVQElement;
> > > > > > > +
> > > > > > >   /* Shadow virtqueue to relay notifications */
> > > > > > >   typedef struct VhostShadowVirtqueue {
> > > > > > >       /* Shadow vring */
> > > > > > > @@ -46,8 +53,11 @@ typedef struct VhostShadowVirtqueue {
> > > > > > >       /* Virtio device */
> > > > > > >       VirtIODevice *vdev;
> > > > > > >
> > > > > > > +    /* IOVA mapping if used */
> > > > > > > +    VhostIOVATree *iova_map;
> > > > > > > +
> > > > > > >       /* Map for returning guest's descriptors */
> > > > > > > -    VirtQueueElement **ring_id_maps;
> > > > > > > +    SVQElement **ring_id_maps;
> > > > > > >
> > > > > > >       /* Next head to expose to device */
> > > > > > >       uint16_t avail_idx_shadow;
> > > > > > > @@ -79,13 +89,6 @@ bool vhost_svq_valid_device_features(uint64_t *dev_features)
> > > > > > >               continue;
> > > > > > >
> > > > > > >           case VIRTIO_F_ACCESS_PLATFORM:
> > > > > > > -            /* SVQ needs this feature disabled. Can't continue */
> > > > > > > -            if (*dev_features & BIT_ULL(b)) {
> > > > > > > -                clear_bit(b, dev_features);
> > > > > > > -                r = false;
> > > > > > > -            }
> > > > > > > -            break;
> > > > > > > -
> > > > > > >           case VIRTIO_F_VERSION_1:
> > > > > > >               /* SVQ needs this feature, so can't continue */
> > > > > > >               if (!(*dev_features & BIT_ULL(b))) {
> > > > > > > @@ -126,6 +129,64 @@ static void vhost_svq_set_notification(VhostShadowVirtqueue *svq, bool enable)
> > > > > > >       }
> > > > > > >   }
> > > > > > >
> > > > > > > +static void vhost_svq_stash_addr(void ***stash, const struct iovec *iov,
> > > > > > > +                                 size_t num)
> > > > > > > +{
> > > > > > > +    size_t i;
> > > > > > > +
> > > > > > > +    if (num == 0) {
> > > > > > > +        return;
> > > > > > > +    }
> > > > > > > +
> > > > > > > +    *stash = g_new(void *, num);
> > > > > > > +    for (i = 0; i < num; ++i) {
> > > > > > > +        (*stash)[i] = iov[i].iov_base;
> > > > > > > +    }
> > > > > > > +}
> > > > > > > +
> > > > > > > +static void vhost_svq_unstash_addr(void **stash, struct iovec *iov, size_t num)
> > > > > > > +{
> > > > > > > +    size_t i;
> > > > > > > +
> > > > > > > +    if (num == 0) {
> > > > > > > +        return;
> > > > > > > +    }
> > > > > > > +
> > > > > > > +    for (i = 0; i < num; ++i) {
> > > > > > > +        iov[i].iov_base = stash[i];
> > > > > > > +    }
> > > > > > > +    g_free(stash);
> > > > > > > +}
> > > > > > > +
> > > > > > > +static void vhost_svq_translate_addr(const VhostShadowVirtqueue *svq,
> > > > > > > +                                     struct iovec *iovec, size_t num)
> > > > > > > +{
> > > > > > > +    size_t i;
> > > > > > > +
> > > > > > > +    for (i = 0; i < num; ++i) {
> > > > > > > +        VhostDMAMap needle = {
> > > > > > > +            .translated_addr = iovec[i].iov_base,
> > > > > > > +            .size = iovec[i].iov_len,
> > > > > > > +        };
> > > > > > > +        size_t off;
> > > > > > > +
> > > > > > > +        const VhostDMAMap *map = vhost_iova_tree_find_iova(svq->iova_map,
> > > > > > > +                                                           &needle);
> > > > > >
> > > > > >
> > > > > > Is it possible that we end up with more than one maps here?
> > > > > >
> > > > >
> > > > > Actually it is possible, since there is no guarantee that one
> > > > > descriptor (or indirect descriptor) maps exactly to one iov. It could
> > > > > map to many if qemu vaddr is not contiguous but GPA + size is. This is
> > > > > something that must be fixed for the next revision, so thanks for
> > > > > pointing it out!
> > > > >
> > > > > Taking that into account, the condition that svq vring avail_idx -
> > > > > used_idx was always less or equal than guest's vring avail_idx -
> > > > > used_idx is not true anymore. Checking for that before adding buffers
> > > > > to SVQ is the easy part, but how could we recover in that case?
> > > > >
> > > > > I think that the easy solution is to check for more available buffers
> > > > > unconditionally at the end of vhost_svq_handle_call, which handles the
> > > > > SVQ used and is supposed to make more room for available buffers. So
> > > > > vhost_handle_guest_kick would not check if eventfd is set or not
> > > > > anymore.
> > > > >
> > > > > Would that make sense?
> > > >
> > > > Yes, I think it should work.
> > >
> > > Btw, I wonder how to handle indirect descriptors. SVQ doesn't use
> > > indirect descriptors for now, but it looks like a must otherwise we
> > > may end up SVQ is full before VQ.
> > >
> >
> > We can get to that situation without indirect too, if a single
> > descriptor maps to more than one sg buffer. The next revision is going
> > to control that too.
> >
> > > It looks to me an easy way is to always use indirect descriptors if #sg >= 2?
> > >
> >
> > I will use that, but that does not solve the case where a descriptor
> > maps to > 1 different buffers in qemu vaddr.
>
> Right, so we need to deal with the case when SVQ is out of space.
>
>
> > So I think that some
> > check after marking descriptors as used is a must somehow.
>
> I thought it should be before processing the available buffer?
Yes, I meant after that. Somehow, because I include checking the
number of sg buffers as "processing". :).
> It's
> the guest driver that make sure there's sufficient space for used
> ring?
>
(I think we are talking the same with different words, but just in
case I will develop the idea here with an example).
The guest is able to check if there is enough space in the SVQ's
vring, but not in the device's vring. As an example of this, imagine
that a guest makes available a GPA contiguous buffer of 64K, one
descriptor. However, this memory is divided into 16 chunks of 4K in
qemu's VA space. Imagine that at this moment there are only eight
slots free in each vring, and that neither communication is using
indirect descriptors.
The guest only needs 1 descriptor available to make that buffer
available, so it will add to avail ring. But SVQ needs 16 chained
descriptors, so the buffer is not going to reach the device until it
makes at least 8 more descriptors as used. SVQ checked for the amount
of available room, as you said, but it cannot forward the available
one.
Since the guest already sent kick when it made the descriptor
available, we need another mechanism to know when we have all the
needed free slots in the SVQ vring. And that's what I meant with the
check after marking some buffers as available.
I still think it is not worth it to protect the forwarding methods of
hogging BQL, since there must be a limit sooner or later, but it is
something that is worth putting on the table again. But this requires
changes for the next version for sure.
I can think in more scenarios, like guest making available an indirect
descriptor of vq size that needs to be splitted in even more sgs. Qemu
already does not support more than 1024 sgs buffers in VirtQueue, but
a driver (as SVQ) must *not* create an indirect descriptor chain
longer than the Queue Size. Should we always increase vq size to 1024
always? I think these are highly unlikely, but again these concerns
must be at least commented here.
Does it make sense?
Thanks!
> Thanks
>
> >
> >
> > > Thanks
> > >
> > > >
> > > > Thanks
> > > >
> > > > >
> > > > > Thanks!
> > > > >
> > > > > >
> > > > > > > +        /*
> > > > > > > +         * Map cannot be NULL since iova map contains all guest space and
> > > > > > > +         * qemu already has a physical address mapped
> > > > > > > +         */
> > > > > > > +        assert(map);
> > > > > > > +
> > > > > > > +        /*
> > > > > > > +         * Map->iova chunk size is ignored. What to do if descriptor
> > > > > > > +         * (addr, size) does not fit is delegated to the device.
> > > > > > > +         */
> > > > > > > +        off = needle.translated_addr - map->translated_addr;
> > > > > > > +        iovec[i].iov_base = (void *)(map->iova + off);
> > > > > > > +    }
> > > > > > > +}
> > > > > > > +
> > > > > > >   static void vhost_vring_write_descs(VhostShadowVirtqueue *svq,
> > > > > > >                                       const struct iovec *iovec,
> > > > > > >                                       size_t num, bool more_descs, bool write)
> > > > > > > @@ -156,8 +217,9 @@ static void vhost_vring_write_descs(VhostShadowVirtqueue *svq,
> > > > > > >   }
> > > > > > >
> > > > > > >   static unsigned vhost_svq_add_split(VhostShadowVirtqueue *svq,
> > > > > > > -                                    VirtQueueElement *elem)
> > > > > > > +                                    SVQElement *svq_elem)
> > > > > > >   {
> > > > > > > +    VirtQueueElement *elem = &svq_elem->elem;
> > > > > > >       int head;
> > > > > > >       unsigned avail_idx;
> > > > > > >       vring_avail_t *avail = svq->vring.avail;
> > > > > > > @@ -167,6 +229,12 @@ static unsigned vhost_svq_add_split(VhostShadowVirtqueue *svq,
> > > > > > >       /* We need some descriptors here */
> > > > > > >       assert(elem->out_num || elem->in_num);
> > > > > > >
> > > > > > > +    vhost_svq_stash_addr(&svq_elem->in_sg_stash, elem->in_sg, elem->in_num);
> > > > > > > +    vhost_svq_stash_addr(&svq_elem->out_sg_stash, elem->out_sg, elem->out_num);
> > > > > >
> > > > > >
> > > > > > I wonder if we can solve the trick like stash and unstash with a
> > > > > > dedicated sgs in svq_elem, instead of reusing the elem.
> > > > > >
> > > > >
> > > > > Actually yes, it would be way simpler to use a new sgs array in
> > > > > svq_elem. I will change that.
> > > > >
> > > > > Thanks!
> > > > >
> > > > > > Thanks
> > > > > >
> > > > > >
> > > > > > > +
> > > > > > > +    vhost_svq_translate_addr(svq, elem->in_sg, elem->in_num);
> > > > > > > +    vhost_svq_translate_addr(svq, elem->out_sg, elem->out_num);
> > > > > > > +
> > > > > > >       vhost_vring_write_descs(svq, elem->out_sg, elem->out_num,
> > > > > > >                               elem->in_num > 0, false);
> > > > > > >       vhost_vring_write_descs(svq, elem->in_sg, elem->in_num, false, true);
> > > > > > > @@ -187,7 +255,7 @@ static unsigned vhost_svq_add_split(VhostShadowVirtqueue *svq,
> > > > > > >
> > > > > > >   }
> > > > > > >
> > > > > > > -static void vhost_svq_add(VhostShadowVirtqueue *svq, VirtQueueElement *elem)
> > > > > > > +static void vhost_svq_add(VhostShadowVirtqueue *svq, SVQElement *elem)
> > > > > > >   {
> > > > > > >       unsigned qemu_head = vhost_svq_add_split(svq, elem);
> > > > > > >
> > > > > > > @@ -221,7 +289,7 @@ static void vhost_handle_guest_kick(EventNotifier *n)
> > > > > > >           }
> > > > > > >
> > > > > > >           while (true) {
> > > > > > > -            VirtQueueElement *elem = virtqueue_pop(svq->vq, sizeof(*elem));
> > > > > > > +            SVQElement *elem = virtqueue_pop(svq->vq, sizeof(*elem));
> > > > > > >               if (!elem) {
> > > > > > >                   break;
> > > > > > >               }
> > > > > > > @@ -247,7 +315,7 @@ static bool vhost_svq_more_used(VhostShadowVirtqueue *svq)
> > > > > > >       return svq->used_idx != svq->shadow_used_idx;
> > > > > > >   }
> > > > > > >
> > > > > > > -static VirtQueueElement *vhost_svq_get_buf(VhostShadowVirtqueue *svq)
> > > > > > > +static SVQElement *vhost_svq_get_buf(VhostShadowVirtqueue *svq)
> > > > > > >   {
> > > > > > >       vring_desc_t *descs = svq->vring.desc;
> > > > > > >       const vring_used_t *used = svq->vring.used;
> > > > > > > @@ -279,7 +347,7 @@ static VirtQueueElement *vhost_svq_get_buf(VhostShadowVirtqueue *svq)
> > > > > > >       descs[used_elem.id].next = svq->free_head;
> > > > > > >       svq->free_head = used_elem.id;
> > > > > > >
> > > > > > > -    svq->ring_id_maps[used_elem.id]->len = used_elem.len;
> > > > > > > +    svq->ring_id_maps[used_elem.id]->elem.len = used_elem.len;
> > > > > > >       return g_steal_pointer(&svq->ring_id_maps[used_elem.id]);
> > > > > > >   }
> > > > > > >
> > > > > > > @@ -296,12 +364,19 @@ static void vhost_svq_handle_call_no_test(EventNotifier *n)
> > > > > > >
> > > > > > >           vhost_svq_set_notification(svq, false);
> > > > > > >           while (true) {
> > > > > > > -            g_autofree VirtQueueElement *elem = vhost_svq_get_buf(svq);
> > > > > > > -            if (!elem) {
> > > > > > > +            g_autofree SVQElement *svq_elem = vhost_svq_get_buf(svq);
> > > > > > > +            VirtQueueElement *elem;
> > > > > > > +            if (!svq_elem) {
> > > > > > >                   break;
> > > > > > >               }
> > > > > > >
> > > > > > >               assert(i < svq->vring.num);
> > > > > > > +            elem = &svq_elem->elem;
> > > > > > > +
> > > > > > > +            vhost_svq_unstash_addr(svq_elem->in_sg_stash, elem->in_sg,
> > > > > > > +                                   elem->in_num);
> > > > > > > +            vhost_svq_unstash_addr(svq_elem->out_sg_stash, elem->out_sg,
> > > > > > > +                                   elem->out_num);
> > > > > > >               virtqueue_fill(vq, elem, elem->len, i++);
> > > > > > >           }
> > > > > > >
> > > > > > > @@ -451,14 +526,24 @@ void vhost_svq_stop(struct vhost_dev *dev, unsigned idx,
> > > > > > >       event_notifier_set_handler(&svq->host_notifier, NULL);
> > > > > > >
> > > > > > >       for (i = 0; i < svq->vring.num; ++i) {
> > > > > > > -        g_autofree VirtQueueElement *elem = svq->ring_id_maps[i];
> > > > > > > +        g_autofree SVQElement *svq_elem = svq->ring_id_maps[i];
> > > > > > > +        VirtQueueElement *elem;
> > > > > > > +
> > > > > > > +        if (!svq_elem) {
> > > > > > > +            continue;
> > > > > > > +        }
> > > > > > > +
> > > > > > > +        elem = &svq_elem->elem;
> > > > > > > +        vhost_svq_unstash_addr(svq_elem->in_sg_stash, elem->in_sg,
> > > > > > > +                               elem->in_num);
> > > > > > > +        vhost_svq_unstash_addr(svq_elem->out_sg_stash, elem->out_sg,
> > > > > > > +                               elem->out_num);
> > > > > > > +
> > > > > > >           /*
> > > > > > >            * Although the doc says we must unpop in order, it's ok to unpop
> > > > > > >            * everything.
> > > > > > >            */
> > > > > > > -        if (elem) {
> > > > > > > -            virtqueue_unpop(svq->vq, elem, elem->len);
> > > > > > > -        }
> > > > > > > +        virtqueue_unpop(svq->vq, elem, elem->len);
> > > > > > >       }
> > > > > > >   }
> > > > > > >
> > > > > > > @@ -466,7 +551,8 @@ void vhost_svq_stop(struct vhost_dev *dev, unsigned idx,
> > > > > > >    * Creates vhost shadow virtqueue, and instruct vhost device to use the shadow
> > > > > > >    * methods and file descriptors.
> > > > > > >    */
> > > > > > > -VhostShadowVirtqueue *vhost_svq_new(struct vhost_dev *dev, int idx)
> > > > > > > +VhostShadowVirtqueue *vhost_svq_new(struct vhost_dev *dev, int idx,
> > > > > > > +                                    VhostIOVATree *iova_map)
> > > > > > >   {
> > > > > > >       int vq_idx = dev->vq_index + idx;
> > > > > > >       unsigned num = virtio_queue_get_num(dev->vdev, vq_idx);
> > > > > > > @@ -500,11 +586,13 @@ VhostShadowVirtqueue *vhost_svq_new(struct vhost_dev *dev, int idx)
> > > > > > >       memset(svq->vring.desc, 0, driver_size);
> > > > > > >       svq->vring.used = qemu_memalign(qemu_real_host_page_size, device_size);
> > > > > > >       memset(svq->vring.used, 0, device_size);
> > > > > > > +    svq->iova_map = iova_map;
> > > > > > > +
> > > > > > >       for (i = 0; i < num - 1; i++) {
> > > > > > >           svq->vring.desc[i].next = cpu_to_le16(i + 1);
> > > > > > >       }
> > > > > > >
> > > > > > > -    svq->ring_id_maps = g_new0(VirtQueueElement *, num);
> > > > > > > +    svq->ring_id_maps = g_new0(SVQElement *, num);
> > > > > > >       event_notifier_set_handler(&svq->call_notifier,
> > > > > > >                                  vhost_svq_handle_call);
> > > > > > >       return g_steal_pointer(&svq);
> > > > > > > diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
> > > > > > > index a9c680b487..f5a12fee9d 100644
> > > > > > > --- a/hw/virtio/vhost-vdpa.c
> > > > > > > +++ b/hw/virtio/vhost-vdpa.c
> > > > > > > @@ -176,6 +176,18 @@ static void vhost_vdpa_listener_region_add(MemoryListener *listener,
> > > > > > >                                            vaddr, section->readonly);
> > > > > > >
> > > > > > >       llsize = int128_sub(llend, int128_make64(iova));
> > > > > > > +    if (v->shadow_vqs_enabled) {
> > > > > > > +        VhostDMAMap mem_region = {
> > > > > > > +            .translated_addr = vaddr,
> > > > > > > +            .size = int128_get64(llsize) - 1,
> > > > > > > +            .perm = IOMMU_ACCESS_FLAG(true, section->readonly),
> > > > > > > +        };
> > > > > > > +
> > > > > > > +        int r = vhost_iova_tree_alloc(v->iova_map, &mem_region);
> > > > > > > +        assert(r == VHOST_DMA_MAP_OK);
> > > > > > > +
> > > > > > > +        iova = mem_region.iova;
> > > > > > > +    }
> > > > > > >
> > > > > > >       ret = vhost_vdpa_dma_map(v, iova, int128_get64(llsize),
> > > > > > >                                vaddr, section->readonly);
> > > > > > > @@ -754,6 +766,23 @@ static bool  vhost_vdpa_force_iommu(struct vhost_dev *dev)
> > > > > > >       return true;
> > > > > > >   }
> > > > > > >
> > > > > > > +static int vhost_vdpa_get_iova_range(struct vhost_dev *dev,
> > > > > > > +                                     hwaddr *first, hwaddr *last)
> > > > > > > +{
> > > > > > > +    int ret;
> > > > > > > +    struct vhost_vdpa_iova_range range;
> > > > > > > +
> > > > > > > +    ret = vhost_vdpa_call(dev, VHOST_VDPA_GET_IOVA_RANGE, &range);
> > > > > > > +    if (ret != 0) {
> > > > > > > +        return ret;
> > > > > > > +    }
> > > > > > > +
> > > > > > > +    *first = range.first;
> > > > > > > +    *last = range.last;
> > > > > > > +    trace_vhost_vdpa_get_iova_range(dev, *first, *last);
> > > > > > > +    return ret;
> > > > > > > +}
> > > > > > > +
> > > > > > >   /**
> > > > > > >    * Maps QEMU vaddr memory to device in a suitable way for shadow virtqueue:
> > > > > > >    * - It always reference qemu memory address, not guest's memory.
> > > > > > > @@ -881,6 +910,7 @@ static bool vhost_vdpa_svq_start_vq(struct vhost_dev *dev, unsigned idx)
> > > > > > >   static unsigned vhost_vdpa_enable_svq(struct vhost_vdpa *v, bool enable)
> > > > > > >   {
> > > > > > >       struct vhost_dev *hdev = v->dev;
> > > > > > > +    hwaddr iova_first, iova_last;
> > > > > > >       unsigned n;
> > > > > > >       int r;
> > > > > > >
> > > > > > > @@ -894,7 +924,7 @@ static unsigned vhost_vdpa_enable_svq(struct vhost_vdpa *v, bool enable)
> > > > > > >           /* Allocate resources */
> > > > > > >           assert(v->shadow_vqs->len == 0);
> > > > > > >           for (n = 0; n < hdev->nvqs; ++n) {
> > > > > > > -            VhostShadowVirtqueue *svq = vhost_svq_new(hdev, n);
> > > > > > > +            VhostShadowVirtqueue *svq = vhost_svq_new(hdev, n, v->iova_map);
> > > > > > >               if (unlikely(!svq)) {
> > > > > > >                   g_ptr_array_set_size(v->shadow_vqs, 0);
> > > > > > >                   return 0;
> > > > > > > @@ -903,6 +933,8 @@ static unsigned vhost_vdpa_enable_svq(struct vhost_vdpa *v, bool enable)
> > > > > > >           }
> > > > > > >       }
> > > > > > >
> > > > > > > +    r = vhost_vdpa_get_iova_range(hdev, &iova_first, &iova_last);
> > > > > > > +    assert(r == 0);
> > > > > > >       r = vhost_vdpa_vring_pause(hdev);
> > > > > > >       assert(r == 0);
> > > > > > >
> > > > > > > @@ -913,6 +945,12 @@ static unsigned vhost_vdpa_enable_svq(struct vhost_vdpa *v, bool enable)
> > > > > > >           }
> > > > > > >       }
> > > > > > >
> > > > > > > +    memory_listener_unregister(&v->listener);
> > > > > > > +    if (vhost_vdpa_dma_unmap(v, iova_first,
> > > > > > > +                             (iova_last - iova_first) & TARGET_PAGE_MASK)) {
> > > > > > > +        error_report("Fail to invalidate device iotlb");
> > > > > > > +    }
> > > > > > > +
> > > > > > >       /* Reset device so it can be configured */
> > > > > > >       r = vhost_vdpa_dev_start(hdev, false);
> > > > > > >       assert(r == 0);
> > > > > > > diff --git a/hw/virtio/trace-events b/hw/virtio/trace-events
> > > > > > > index 8ed19e9d0c..650e521e35 100644
> > > > > > > --- a/hw/virtio/trace-events
> > > > > > > +++ b/hw/virtio/trace-events
> > > > > > > @@ -52,6 +52,7 @@ vhost_vdpa_set_vring_call(void *dev, unsigned int index, int fd) "dev: %p index:
> > > > > > >   vhost_vdpa_get_features(void *dev, uint64_t features) "dev: %p features: 0x%"PRIx64
> > > > > > >   vhost_vdpa_set_owner(void *dev) "dev: %p"
> > > > > > >   vhost_vdpa_vq_get_addr(void *dev, void *vq, uint64_t desc_user_addr, uint64_t avail_user_addr, uint64_t used_user_addr) "dev: %p vq: %p desc_user_addr: 0x%"PRIx64" avail_user_addr: 0x%"PRIx64" used_user_addr: 0x%"PRIx64
> > > > > > > +vhost_vdpa_get_iova_range(void *dev, uint64_t first, uint64_t last) "dev: %p first: 0x%"PRIx64" last: 0x%"PRIx64
> > > > > > >
> > > > > > >   # virtio.c
> > > > > > >   virtqueue_alloc_element(void *elem, size_t sz, unsigned in_num, unsigned out_num) "elem %p size %zd in_num %u out_num %u"
> > > > > >
> > > > >
> > >
> >
>
^ permalink raw reply	[flat|nested] 90+ messages in thread
* Re: [RFC PATCH v4 18/20] vhost: Add VhostIOVATree
  2021-10-20  9:01         ` Jason Wang
@ 2021-10-20 12:06           ` Eugenio Perez Martin
  2021-10-21  2:34             ` Jason Wang
  0 siblings, 1 reply; 90+ messages in thread
From: Eugenio Perez Martin @ 2021-10-20 12:06 UTC (permalink / raw)
  To: Jason Wang; +Cc: qemu-devel
On Wed, Oct 20, 2021 at 11:01 AM Jason Wang <jasowang@redhat.com> wrote:
>
> On Wed, Oct 20, 2021 at 3:54 PM Eugenio Perez Martin
> <eperezma@redhat.com> wrote:
> >
> > On Tue, Oct 19, 2021 at 11:23 AM Jason Wang <jasowang@redhat.com> wrote:
> > >
> > > On Tue, Oct 19, 2021 at 4:32 PM Jason Wang <jasowang@redhat.com> wrote:
> > > >
> > > >
> > > > 在 2021/10/1 下午3:06, Eugenio Pérez 写道:
> > > > > This tree is able to look for a translated address from an IOVA address.
> > > > >
> > > > > At first glance is similar to util/iova-tree. However, SVQ working on
> > > > > devices with limited IOVA space need more capabilities, like allocating
> > > > > IOVA chunks or perform reverse translations (qemu addresses to iova).
> > > >
> > > >
> > > > I don't see any reverse translation is used in the shadow code. Or
> > > > anything I missed?
> > >
> > > Ok, it looks to me that it is used in the iova allocator. But I think
> > > it's better to decouple it to an independent allocator instead of
> > > vhost iova tree.
> > >
> >
> > Reverse translation is used every time a buffer is made available,
> > since buffers content are not copied, only the descriptors to SVQ
> > vring.
>
> I may miss something but I didn't see the code? Qemu knows the VA of
> virtqueue, and the VA of the VQ is stored in the VirtQueueElem?
>
It's used in the patch 20/20, could that be the misunderstanding? The
function calling it is vhost_svq_translate_addr.
Qemu knows the VA address of the buffer, but it must offer a valid SVQ
iova to the device. That is the translation I mean.
> >
> > At this point all the limits are copied to vhost iova tree in the next
> > revision I will send, defined at its creation at
> > vhost_iova_tree_new(). They are outside of util/iova-tree, only sent
> > to the latter at allocation time.
> >
> > Since vhost_iova_tree has its own vhost_iova_tree_alloc(), that wraps
> > the iova_tree_alloc() [1], limits could be kept in vhost-vdpa and make
> > them an argument of vhost_iova_tree_alloc. But I'm not sure if it's
> > what you are proposing or I'm missing something.
>
> If the reverse translation is only used in iova allocation, I meant to
> split the logic of IOVA allocation itself.
>
Still don't understand it, sorry :). In SVQ setup we allocate an iova
address for every guest's GPA address its driver can use. After that
there should be no allocation unless memory is hotplugged.
So the limits are only needed precisely at allocation time. Not sure
if that is what you mean here, but to first allocate and then check if
it is within the range could lead to false negatives, since there
could be a valid range *in* the address but the iova allocator
returned us another range that fell outside the range. How could we
know the cause if it is not using the range itself?
> >
> > Either way, I think it is harder to talk about this specific case
> > without code, since this one still does not address the limits. Would
> > you prefer me to send another RFC in WIP quality, with *not* all
> > comments addressed? I would say that there is not a lot of pending
> > work to send the next one, but it might be easier for all of us.
>
> I'd prefer to try to address them all, otherwise it's not easy to see
> what is missing.
>
Got it, I will do it that way then!
Thanks!
> Thanks
>
> >
> > Thanks!
> >
> > [1] This util/iova-tree method will be proposed in the next series,
> > and vhost_iova_tree wraps it since it needs to keep in sync both
> > trees: iova->qemu vaddr for iova allocation and the reverse one to
> > translate available buffers.
> >
> > > Thanks
> > >
> >
>
^ permalink raw reply	[flat|nested] 90+ messages in thread
* Re: [RFC PATCH v4 18/20] vhost: Add VhostIOVATree
  2021-10-20 12:06           ` Eugenio Perez Martin
@ 2021-10-21  2:34             ` Jason Wang
  2021-10-21  7:03               ` Eugenio Perez Martin
  0 siblings, 1 reply; 90+ messages in thread
From: Jason Wang @ 2021-10-21  2:34 UTC (permalink / raw)
  To: Eugenio Perez Martin; +Cc: qemu-devel
On Wed, Oct 20, 2021 at 8:07 PM Eugenio Perez Martin
<eperezma@redhat.com> wrote:
>
> On Wed, Oct 20, 2021 at 11:01 AM Jason Wang <jasowang@redhat.com> wrote:
> >
> > On Wed, Oct 20, 2021 at 3:54 PM Eugenio Perez Martin
> > <eperezma@redhat.com> wrote:
> > >
> > > On Tue, Oct 19, 2021 at 11:23 AM Jason Wang <jasowang@redhat.com> wrote:
> > > >
> > > > On Tue, Oct 19, 2021 at 4:32 PM Jason Wang <jasowang@redhat.com> wrote:
> > > > >
> > > > >
> > > > > 在 2021/10/1 下午3:06, Eugenio Pérez 写道:
> > > > > > This tree is able to look for a translated address from an IOVA address.
> > > > > >
> > > > > > At first glance is similar to util/iova-tree. However, SVQ working on
> > > > > > devices with limited IOVA space need more capabilities, like allocating
> > > > > > IOVA chunks or perform reverse translations (qemu addresses to iova).
> > > > >
> > > > >
> > > > > I don't see any reverse translation is used in the shadow code. Or
> > > > > anything I missed?
> > > >
> > > > Ok, it looks to me that it is used in the iova allocator. But I think
> > > > it's better to decouple it to an independent allocator instead of
> > > > vhost iova tree.
> > > >
> > >
> > > Reverse translation is used every time a buffer is made available,
> > > since buffers content are not copied, only the descriptors to SVQ
> > > vring.
> >
> > I may miss something but I didn't see the code? Qemu knows the VA of
> > virtqueue, and the VA of the VQ is stored in the VirtQueueElem?
> >
>
> It's used in the patch 20/20, could that be the misunderstanding? The
> function calling it is vhost_svq_translate_addr.
>
> Qemu knows the VA address of the buffer, but it must offer a valid SVQ
> iova to the device. That is the translation I mean.
Ok, I get you. So if I understand correctly, what you did is:
1) allocate IOVA during region_add
2) preform VA->IOVA reverse lookup in handle_kick
This should be fine, but here're some suggestions:
1) remove the assert(map) in vhost_svq_translate_addr() since guest
can add e.g BAR address
2) we probably need a better name vhost_iova_tree_alloc(), maybe
"vhost_iova_tree_map_alloc()"
There's actually another method.
1) don't do IOVA/map allocation in region_add()
2) do the allocation in handle_kick(), then we know the IOVA so no
reverse lookup
The advantage is that this can work for the case of vIOMMU. And they
should perform the same:
1) you method avoid the iova allocation per sg
2) my method avoid the reverse lookup per sg
>
> > >
> > > At this point all the limits are copied to vhost iova tree in the next
> > > revision I will send, defined at its creation at
> > > vhost_iova_tree_new(). They are outside of util/iova-tree, only sent
> > > to the latter at allocation time.
> > >
> > > Since vhost_iova_tree has its own vhost_iova_tree_alloc(), that wraps
> > > the iova_tree_alloc() [1], limits could be kept in vhost-vdpa and make
> > > them an argument of vhost_iova_tree_alloc. But I'm not sure if it's
> > > what you are proposing or I'm missing something.
> >
> > If the reverse translation is only used in iova allocation, I meant to
> > split the logic of IOVA allocation itself.
> >
>
> Still don't understand it, sorry :). In SVQ setup we allocate an iova
> address for every guest's GPA address its driver can use. After that
> there should be no allocation unless memory is hotplugged.
>
> So the limits are only needed precisely at allocation time. Not sure
> if that is what you mean here, but to first allocate and then check if
> it is within the range could lead to false negatives, since there
> could be a valid range *in* the address but the iova allocator
> returned us another range that fell outside the range. How could we
> know the cause if it is not using the range itself?
See my above reply. And we can teach the iova allocator to return the
IOVA in the range that vhost-vDPA supports.
Thanks
>
> > >
> > > Either way, I think it is harder to talk about this specific case
> > > without code, since this one still does not address the limits. Would
> > > you prefer me to send another RFC in WIP quality, with *not* all
> > > comments addressed? I would say that there is not a lot of pending
> > > work to send the next one, but it might be easier for all of us.
> >
> > I'd prefer to try to address them all, otherwise it's not easy to see
> > what is missing.
> >
>
> Got it, I will do it that way then!
>
> Thanks!
>
> > Thanks
> >
> > >
> > > Thanks!
> > >
> > > [1] This util/iova-tree method will be proposed in the next series,
> > > and vhost_iova_tree wraps it since it needs to keep in sync both
> > > trees: iova->qemu vaddr for iova allocation and the reverse one to
> > > translate available buffers.
> > >
> > > > Thanks
> > > >
> > >
> >
>
^ permalink raw reply	[flat|nested] 90+ messages in thread
* Re: [RFC PATCH v4 20/20] vdpa: Add custom IOTLB translations to SVQ
  2021-10-20 11:56               ` Eugenio Perez Martin
@ 2021-10-21  2:38                 ` Jason Wang
  2021-10-26  4:32                 ` Jason Wang
  1 sibling, 0 replies; 90+ messages in thread
From: Jason Wang @ 2021-10-21  2:38 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: Parav Pandit, Juan Quintela, Markus Armbruster,
	Michael S. Tsirkin, qemu-level, virtualization,
	Harpreet Singh Anand, Xiao W Wang, Stefan Hajnoczi, Eli Cohen,
	Eric Blake, Michael Lilja, Stefano Garzarella
On Wed, Oct 20, 2021 at 7:57 PM Eugenio Perez Martin
<eperezma@redhat.com> wrote:
>
> On Wed, Oct 20, 2021 at 11:03 AM Jason Wang <jasowang@redhat.com> wrote:
> >
> > On Wed, Oct 20, 2021 at 2:52 PM Eugenio Perez Martin
> > <eperezma@redhat.com> wrote:
> > >
> > > On Wed, Oct 20, 2021 at 4:07 AM Jason Wang <jasowang@redhat.com> wrote:
> > > >
> > > > On Wed, Oct 20, 2021 at 10:02 AM Jason Wang <jasowang@redhat.com> wrote:
> > > > >
> > > > > On Tue, Oct 19, 2021 at 6:29 PM Eugenio Perez Martin
> > > > > <eperezma@redhat.com> wrote:
> > > > > >
> > > > > > On Tue, Oct 19, 2021 at 11:25 AM Jason Wang <jasowang@redhat.com> wrote:
> > > > > > >
> > > > > > >
> > > > > > > 在 2021/10/1 下午3:06, Eugenio Pérez 写道:
> > > > > > > > Use translations added in VhostIOVATree in SVQ.
> > > > > > > >
> > > > > > > > Now every element needs to store the previous address also, so VirtQueue
> > > > > > > > can consume the elements properly. This adds a little overhead per VQ
> > > > > > > > element, having to allocate more memory to stash them. As a possible
> > > > > > > > optimization, this allocation could be avoided if the descriptor is not
> > > > > > > > a chain but a single one, but this is left undone.
> > > > > > > >
> > > > > > > > TODO: iova range should be queried before, and add logic to fail when
> > > > > > > > GPA is outside of its range and memory listener or svq add it.
> > > > > > > >
> > > > > > > > Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> > > > > > > > ---
> > > > > > > >   hw/virtio/vhost-shadow-virtqueue.h |   4 +-
> > > > > > > >   hw/virtio/vhost-shadow-virtqueue.c | 130 ++++++++++++++++++++++++-----
> > > > > > > >   hw/virtio/vhost-vdpa.c             |  40 ++++++++-
> > > > > > > >   hw/virtio/trace-events             |   1 +
> > > > > > > >   4 files changed, 152 insertions(+), 23 deletions(-)
> > > > > > > >
> > > > > > > > diff --git a/hw/virtio/vhost-shadow-virtqueue.h b/hw/virtio/vhost-shadow-virtqueue.h
> > > > > > > > index b7baa424a7..a0e6b5267a 100644
> > > > > > > > --- a/hw/virtio/vhost-shadow-virtqueue.h
> > > > > > > > +++ b/hw/virtio/vhost-shadow-virtqueue.h
> > > > > > > > @@ -11,6 +11,7 @@
> > > > > > > >   #define VHOST_SHADOW_VIRTQUEUE_H
> > > > > > > >
> > > > > > > >   #include "hw/virtio/vhost.h"
> > > > > > > > +#include "hw/virtio/vhost-iova-tree.h"
> > > > > > > >
> > > > > > > >   typedef struct VhostShadowVirtqueue VhostShadowVirtqueue;
> > > > > > > >
> > > > > > > > @@ -28,7 +29,8 @@ bool vhost_svq_start(struct vhost_dev *dev, unsigned idx,
> > > > > > > >   void vhost_svq_stop(struct vhost_dev *dev, unsigned idx,
> > > > > > > >                       VhostShadowVirtqueue *svq);
> > > > > > > >
> > > > > > > > -VhostShadowVirtqueue *vhost_svq_new(struct vhost_dev *dev, int idx);
> > > > > > > > +VhostShadowVirtqueue *vhost_svq_new(struct vhost_dev *dev, int idx,
> > > > > > > > +                                    VhostIOVATree *iova_map);
> > > > > > > >
> > > > > > > >   void vhost_svq_free(VhostShadowVirtqueue *vq);
> > > > > > > >
> > > > > > > > diff --git a/hw/virtio/vhost-shadow-virtqueue.c b/hw/virtio/vhost-shadow-virtqueue.c
> > > > > > > > index 2fd0bab75d..9db538547e 100644
> > > > > > > > --- a/hw/virtio/vhost-shadow-virtqueue.c
> > > > > > > > +++ b/hw/virtio/vhost-shadow-virtqueue.c
> > > > > > > > @@ -11,12 +11,19 @@
> > > > > > > >   #include "hw/virtio/vhost-shadow-virtqueue.h"
> > > > > > > >   #include "hw/virtio/vhost.h"
> > > > > > > >   #include "hw/virtio/virtio-access.h"
> > > > > > > > +#include "hw/virtio/vhost-iova-tree.h"
> > > > > > > >
> > > > > > > >   #include "standard-headers/linux/vhost_types.h"
> > > > > > > >
> > > > > > > >   #include "qemu/error-report.h"
> > > > > > > >   #include "qemu/main-loop.h"
> > > > > > > >
> > > > > > > > +typedef struct SVQElement {
> > > > > > > > +    VirtQueueElement elem;
> > > > > > > > +    void **in_sg_stash;
> > > > > > > > +    void **out_sg_stash;
> > > > > > > > +} SVQElement;
> > > > > > > > +
> > > > > > > >   /* Shadow virtqueue to relay notifications */
> > > > > > > >   typedef struct VhostShadowVirtqueue {
> > > > > > > >       /* Shadow vring */
> > > > > > > > @@ -46,8 +53,11 @@ typedef struct VhostShadowVirtqueue {
> > > > > > > >       /* Virtio device */
> > > > > > > >       VirtIODevice *vdev;
> > > > > > > >
> > > > > > > > +    /* IOVA mapping if used */
> > > > > > > > +    VhostIOVATree *iova_map;
> > > > > > > > +
> > > > > > > >       /* Map for returning guest's descriptors */
> > > > > > > > -    VirtQueueElement **ring_id_maps;
> > > > > > > > +    SVQElement **ring_id_maps;
> > > > > > > >
> > > > > > > >       /* Next head to expose to device */
> > > > > > > >       uint16_t avail_idx_shadow;
> > > > > > > > @@ -79,13 +89,6 @@ bool vhost_svq_valid_device_features(uint64_t *dev_features)
> > > > > > > >               continue;
> > > > > > > >
> > > > > > > >           case VIRTIO_F_ACCESS_PLATFORM:
> > > > > > > > -            /* SVQ needs this feature disabled. Can't continue */
> > > > > > > > -            if (*dev_features & BIT_ULL(b)) {
> > > > > > > > -                clear_bit(b, dev_features);
> > > > > > > > -                r = false;
> > > > > > > > -            }
> > > > > > > > -            break;
> > > > > > > > -
> > > > > > > >           case VIRTIO_F_VERSION_1:
> > > > > > > >               /* SVQ needs this feature, so can't continue */
> > > > > > > >               if (!(*dev_features & BIT_ULL(b))) {
> > > > > > > > @@ -126,6 +129,64 @@ static void vhost_svq_set_notification(VhostShadowVirtqueue *svq, bool enable)
> > > > > > > >       }
> > > > > > > >   }
> > > > > > > >
> > > > > > > > +static void vhost_svq_stash_addr(void ***stash, const struct iovec *iov,
> > > > > > > > +                                 size_t num)
> > > > > > > > +{
> > > > > > > > +    size_t i;
> > > > > > > > +
> > > > > > > > +    if (num == 0) {
> > > > > > > > +        return;
> > > > > > > > +    }
> > > > > > > > +
> > > > > > > > +    *stash = g_new(void *, num);
> > > > > > > > +    for (i = 0; i < num; ++i) {
> > > > > > > > +        (*stash)[i] = iov[i].iov_base;
> > > > > > > > +    }
> > > > > > > > +}
> > > > > > > > +
> > > > > > > > +static void vhost_svq_unstash_addr(void **stash, struct iovec *iov, size_t num)
> > > > > > > > +{
> > > > > > > > +    size_t i;
> > > > > > > > +
> > > > > > > > +    if (num == 0) {
> > > > > > > > +        return;
> > > > > > > > +    }
> > > > > > > > +
> > > > > > > > +    for (i = 0; i < num; ++i) {
> > > > > > > > +        iov[i].iov_base = stash[i];
> > > > > > > > +    }
> > > > > > > > +    g_free(stash);
> > > > > > > > +}
> > > > > > > > +
> > > > > > > > +static void vhost_svq_translate_addr(const VhostShadowVirtqueue *svq,
> > > > > > > > +                                     struct iovec *iovec, size_t num)
> > > > > > > > +{
> > > > > > > > +    size_t i;
> > > > > > > > +
> > > > > > > > +    for (i = 0; i < num; ++i) {
> > > > > > > > +        VhostDMAMap needle = {
> > > > > > > > +            .translated_addr = iovec[i].iov_base,
> > > > > > > > +            .size = iovec[i].iov_len,
> > > > > > > > +        };
> > > > > > > > +        size_t off;
> > > > > > > > +
> > > > > > > > +        const VhostDMAMap *map = vhost_iova_tree_find_iova(svq->iova_map,
> > > > > > > > +                                                           &needle);
> > > > > > >
> > > > > > >
> > > > > > > Is it possible that we end up with more than one maps here?
> > > > > > >
> > > > > >
> > > > > > Actually it is possible, since there is no guarantee that one
> > > > > > descriptor (or indirect descriptor) maps exactly to one iov. It could
> > > > > > map to many if qemu vaddr is not contiguous but GPA + size is. This is
> > > > > > something that must be fixed for the next revision, so thanks for
> > > > > > pointing it out!
> > > > > >
> > > > > > Taking that into account, the condition that svq vring avail_idx -
> > > > > > used_idx was always less or equal than guest's vring avail_idx -
> > > > > > used_idx is not true anymore. Checking for that before adding buffers
> > > > > > to SVQ is the easy part, but how could we recover in that case?
> > > > > >
> > > > > > I think that the easy solution is to check for more available buffers
> > > > > > unconditionally at the end of vhost_svq_handle_call, which handles the
> > > > > > SVQ used and is supposed to make more room for available buffers. So
> > > > > > vhost_handle_guest_kick would not check if eventfd is set or not
> > > > > > anymore.
> > > > > >
> > > > > > Would that make sense?
> > > > >
> > > > > Yes, I think it should work.
> > > >
> > > > Btw, I wonder how to handle indirect descriptors. SVQ doesn't use
> > > > indirect descriptors for now, but it looks like a must otherwise we
> > > > may end up SVQ is full before VQ.
> > > >
> > >
> > > We can get to that situation without indirect too, if a single
> > > descriptor maps to more than one sg buffer. The next revision is going
> > > to control that too.
> > >
> > > > It looks to me an easy way is to always use indirect descriptors if #sg >= 2?
> > > >
> > >
> > > I will use that, but that does not solve the case where a descriptor
> > > maps to > 1 different buffers in qemu vaddr.
> >
> > Right, so we need to deal with the case when SVQ is out of space.
> >
> >
> > > So I think that some
> > > check after marking descriptors as used is a must somehow.
> >
> > I thought it should be before processing the available buffer?
>
> Yes, I meant after that. Somehow, because I include checking the
> number of sg buffers as "processing". :).
>
> > It's
> > the guest driver that make sure there's sufficient space for used
> > ring?
> >
>
> (I think we are talking the same with different words, but just in
> case I will develop the idea here with an example).
>
> The guest is able to check if there is enough space in the SVQ's
> vring, but not in the device's vring. As an example of this, imagine
> that a guest makes available a GPA contiguous buffer of 64K, one
> descriptor. However, this memory is divided into 16 chunks of 4K in
> qemu's VA space. Imagine that at this moment there are only eight
> slots free in each vring, and that neither communication is using
> indirect descriptors.
>
> The guest only needs 1 descriptor available to make that buffer
> available, so it will add to avail ring. But SVQ needs 16 chained
> descriptors, so the buffer is not going to reach the device until it
> makes at least 8 more descriptors as used. SVQ checked for the amount
> of available room, as you said, but it cannot forward the available
> one.
>
> Since the guest already sent kick when it made the descriptor
> available, we need another mechanism to know when we have all the
> needed free slots in the SVQ vring. And that's what I meant with the
> check after marking some buffers as available.
>
> I still think it is not worth it to protect the forwarding methods of
> hogging BQL, since there must be a limit sooner or later, but it is
> something that is worth putting on the table again. But this requires
> changes for the next version for sure.
Ok.
>
> I can think in more scenarios, like guest making available an indirect
> descriptor of vq size that needs to be splitted in even more sgs. Qemu
> already does not support more than 1024 sgs buffers in VirtQueue, but
> a driver (as SVQ) must *not* create an indirect descriptor chain
> longer than the Queue Size. Should we always increase vq size to 1024
> always? I think these are highly unlikely, but again these concerns
> must be at least commented here.
>
> Does it make sense?
Right. So I think the SVQ codes should be ready to handle all those cases.
Thanks
>
> Thanks!
>
> > Thanks
> >
> > >
> > >
> > > > Thanks
> > > >
> > > > >
> > > > > Thanks
> > > > >
> > > > > >
> > > > > > Thanks!
> > > > > >
> > > > > > >
> > > > > > > > +        /*
> > > > > > > > +         * Map cannot be NULL since iova map contains all guest space and
> > > > > > > > +         * qemu already has a physical address mapped
> > > > > > > > +         */
> > > > > > > > +        assert(map);
> > > > > > > > +
> > > > > > > > +        /*
> > > > > > > > +         * Map->iova chunk size is ignored. What to do if descriptor
> > > > > > > > +         * (addr, size) does not fit is delegated to the device.
> > > > > > > > +         */
> > > > > > > > +        off = needle.translated_addr - map->translated_addr;
> > > > > > > > +        iovec[i].iov_base = (void *)(map->iova + off);
> > > > > > > > +    }
> > > > > > > > +}
> > > > > > > > +
> > > > > > > >   static void vhost_vring_write_descs(VhostShadowVirtqueue *svq,
> > > > > > > >                                       const struct iovec *iovec,
> > > > > > > >                                       size_t num, bool more_descs, bool write)
> > > > > > > > @@ -156,8 +217,9 @@ static void vhost_vring_write_descs(VhostShadowVirtqueue *svq,
> > > > > > > >   }
> > > > > > > >
> > > > > > > >   static unsigned vhost_svq_add_split(VhostShadowVirtqueue *svq,
> > > > > > > > -                                    VirtQueueElement *elem)
> > > > > > > > +                                    SVQElement *svq_elem)
> > > > > > > >   {
> > > > > > > > +    VirtQueueElement *elem = &svq_elem->elem;
> > > > > > > >       int head;
> > > > > > > >       unsigned avail_idx;
> > > > > > > >       vring_avail_t *avail = svq->vring.avail;
> > > > > > > > @@ -167,6 +229,12 @@ static unsigned vhost_svq_add_split(VhostShadowVirtqueue *svq,
> > > > > > > >       /* We need some descriptors here */
> > > > > > > >       assert(elem->out_num || elem->in_num);
> > > > > > > >
> > > > > > > > +    vhost_svq_stash_addr(&svq_elem->in_sg_stash, elem->in_sg, elem->in_num);
> > > > > > > > +    vhost_svq_stash_addr(&svq_elem->out_sg_stash, elem->out_sg, elem->out_num);
> > > > > > >
> > > > > > >
> > > > > > > I wonder if we can solve the trick like stash and unstash with a
> > > > > > > dedicated sgs in svq_elem, instead of reusing the elem.
> > > > > > >
> > > > > >
> > > > > > Actually yes, it would be way simpler to use a new sgs array in
> > > > > > svq_elem. I will change that.
> > > > > >
> > > > > > Thanks!
> > > > > >
> > > > > > > Thanks
> > > > > > >
> > > > > > >
> > > > > > > > +
> > > > > > > > +    vhost_svq_translate_addr(svq, elem->in_sg, elem->in_num);
> > > > > > > > +    vhost_svq_translate_addr(svq, elem->out_sg, elem->out_num);
> > > > > > > > +
> > > > > > > >       vhost_vring_write_descs(svq, elem->out_sg, elem->out_num,
> > > > > > > >                               elem->in_num > 0, false);
> > > > > > > >       vhost_vring_write_descs(svq, elem->in_sg, elem->in_num, false, true);
> > > > > > > > @@ -187,7 +255,7 @@ static unsigned vhost_svq_add_split(VhostShadowVirtqueue *svq,
> > > > > > > >
> > > > > > > >   }
> > > > > > > >
> > > > > > > > -static void vhost_svq_add(VhostShadowVirtqueue *svq, VirtQueueElement *elem)
> > > > > > > > +static void vhost_svq_add(VhostShadowVirtqueue *svq, SVQElement *elem)
> > > > > > > >   {
> > > > > > > >       unsigned qemu_head = vhost_svq_add_split(svq, elem);
> > > > > > > >
> > > > > > > > @@ -221,7 +289,7 @@ static void vhost_handle_guest_kick(EventNotifier *n)
> > > > > > > >           }
> > > > > > > >
> > > > > > > >           while (true) {
> > > > > > > > -            VirtQueueElement *elem = virtqueue_pop(svq->vq, sizeof(*elem));
> > > > > > > > +            SVQElement *elem = virtqueue_pop(svq->vq, sizeof(*elem));
> > > > > > > >               if (!elem) {
> > > > > > > >                   break;
> > > > > > > >               }
> > > > > > > > @@ -247,7 +315,7 @@ static bool vhost_svq_more_used(VhostShadowVirtqueue *svq)
> > > > > > > >       return svq->used_idx != svq->shadow_used_idx;
> > > > > > > >   }
> > > > > > > >
> > > > > > > > -static VirtQueueElement *vhost_svq_get_buf(VhostShadowVirtqueue *svq)
> > > > > > > > +static SVQElement *vhost_svq_get_buf(VhostShadowVirtqueue *svq)
> > > > > > > >   {
> > > > > > > >       vring_desc_t *descs = svq->vring.desc;
> > > > > > > >       const vring_used_t *used = svq->vring.used;
> > > > > > > > @@ -279,7 +347,7 @@ static VirtQueueElement *vhost_svq_get_buf(VhostShadowVirtqueue *svq)
> > > > > > > >       descs[used_elem.id].next = svq->free_head;
> > > > > > > >       svq->free_head = used_elem.id;
> > > > > > > >
> > > > > > > > -    svq->ring_id_maps[used_elem.id]->len = used_elem.len;
> > > > > > > > +    svq->ring_id_maps[used_elem.id]->elem.len = used_elem.len;
> > > > > > > >       return g_steal_pointer(&svq->ring_id_maps[used_elem.id]);
> > > > > > > >   }
> > > > > > > >
> > > > > > > > @@ -296,12 +364,19 @@ static void vhost_svq_handle_call_no_test(EventNotifier *n)
> > > > > > > >
> > > > > > > >           vhost_svq_set_notification(svq, false);
> > > > > > > >           while (true) {
> > > > > > > > -            g_autofree VirtQueueElement *elem = vhost_svq_get_buf(svq);
> > > > > > > > -            if (!elem) {
> > > > > > > > +            g_autofree SVQElement *svq_elem = vhost_svq_get_buf(svq);
> > > > > > > > +            VirtQueueElement *elem;
> > > > > > > > +            if (!svq_elem) {
> > > > > > > >                   break;
> > > > > > > >               }
> > > > > > > >
> > > > > > > >               assert(i < svq->vring.num);
> > > > > > > > +            elem = &svq_elem->elem;
> > > > > > > > +
> > > > > > > > +            vhost_svq_unstash_addr(svq_elem->in_sg_stash, elem->in_sg,
> > > > > > > > +                                   elem->in_num);
> > > > > > > > +            vhost_svq_unstash_addr(svq_elem->out_sg_stash, elem->out_sg,
> > > > > > > > +                                   elem->out_num);
> > > > > > > >               virtqueue_fill(vq, elem, elem->len, i++);
> > > > > > > >           }
> > > > > > > >
> > > > > > > > @@ -451,14 +526,24 @@ void vhost_svq_stop(struct vhost_dev *dev, unsigned idx,
> > > > > > > >       event_notifier_set_handler(&svq->host_notifier, NULL);
> > > > > > > >
> > > > > > > >       for (i = 0; i < svq->vring.num; ++i) {
> > > > > > > > -        g_autofree VirtQueueElement *elem = svq->ring_id_maps[i];
> > > > > > > > +        g_autofree SVQElement *svq_elem = svq->ring_id_maps[i];
> > > > > > > > +        VirtQueueElement *elem;
> > > > > > > > +
> > > > > > > > +        if (!svq_elem) {
> > > > > > > > +            continue;
> > > > > > > > +        }
> > > > > > > > +
> > > > > > > > +        elem = &svq_elem->elem;
> > > > > > > > +        vhost_svq_unstash_addr(svq_elem->in_sg_stash, elem->in_sg,
> > > > > > > > +                               elem->in_num);
> > > > > > > > +        vhost_svq_unstash_addr(svq_elem->out_sg_stash, elem->out_sg,
> > > > > > > > +                               elem->out_num);
> > > > > > > > +
> > > > > > > >           /*
> > > > > > > >            * Although the doc says we must unpop in order, it's ok to unpop
> > > > > > > >            * everything.
> > > > > > > >            */
> > > > > > > > -        if (elem) {
> > > > > > > > -            virtqueue_unpop(svq->vq, elem, elem->len);
> > > > > > > > -        }
> > > > > > > > +        virtqueue_unpop(svq->vq, elem, elem->len);
> > > > > > > >       }
> > > > > > > >   }
> > > > > > > >
> > > > > > > > @@ -466,7 +551,8 @@ void vhost_svq_stop(struct vhost_dev *dev, unsigned idx,
> > > > > > > >    * Creates vhost shadow virtqueue, and instruct vhost device to use the shadow
> > > > > > > >    * methods and file descriptors.
> > > > > > > >    */
> > > > > > > > -VhostShadowVirtqueue *vhost_svq_new(struct vhost_dev *dev, int idx)
> > > > > > > > +VhostShadowVirtqueue *vhost_svq_new(struct vhost_dev *dev, int idx,
> > > > > > > > +                                    VhostIOVATree *iova_map)
> > > > > > > >   {
> > > > > > > >       int vq_idx = dev->vq_index + idx;
> > > > > > > >       unsigned num = virtio_queue_get_num(dev->vdev, vq_idx);
> > > > > > > > @@ -500,11 +586,13 @@ VhostShadowVirtqueue *vhost_svq_new(struct vhost_dev *dev, int idx)
> > > > > > > >       memset(svq->vring.desc, 0, driver_size);
> > > > > > > >       svq->vring.used = qemu_memalign(qemu_real_host_page_size, device_size);
> > > > > > > >       memset(svq->vring.used, 0, device_size);
> > > > > > > > +    svq->iova_map = iova_map;
> > > > > > > > +
> > > > > > > >       for (i = 0; i < num - 1; i++) {
> > > > > > > >           svq->vring.desc[i].next = cpu_to_le16(i + 1);
> > > > > > > >       }
> > > > > > > >
> > > > > > > > -    svq->ring_id_maps = g_new0(VirtQueueElement *, num);
> > > > > > > > +    svq->ring_id_maps = g_new0(SVQElement *, num);
> > > > > > > >       event_notifier_set_handler(&svq->call_notifier,
> > > > > > > >                                  vhost_svq_handle_call);
> > > > > > > >       return g_steal_pointer(&svq);
> > > > > > > > diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
> > > > > > > > index a9c680b487..f5a12fee9d 100644
> > > > > > > > --- a/hw/virtio/vhost-vdpa.c
> > > > > > > > +++ b/hw/virtio/vhost-vdpa.c
> > > > > > > > @@ -176,6 +176,18 @@ static void vhost_vdpa_listener_region_add(MemoryListener *listener,
> > > > > > > >                                            vaddr, section->readonly);
> > > > > > > >
> > > > > > > >       llsize = int128_sub(llend, int128_make64(iova));
> > > > > > > > +    if (v->shadow_vqs_enabled) {
> > > > > > > > +        VhostDMAMap mem_region = {
> > > > > > > > +            .translated_addr = vaddr,
> > > > > > > > +            .size = int128_get64(llsize) - 1,
> > > > > > > > +            .perm = IOMMU_ACCESS_FLAG(true, section->readonly),
> > > > > > > > +        };
> > > > > > > > +
> > > > > > > > +        int r = vhost_iova_tree_alloc(v->iova_map, &mem_region);
> > > > > > > > +        assert(r == VHOST_DMA_MAP_OK);
> > > > > > > > +
> > > > > > > > +        iova = mem_region.iova;
> > > > > > > > +    }
> > > > > > > >
> > > > > > > >       ret = vhost_vdpa_dma_map(v, iova, int128_get64(llsize),
> > > > > > > >                                vaddr, section->readonly);
> > > > > > > > @@ -754,6 +766,23 @@ static bool  vhost_vdpa_force_iommu(struct vhost_dev *dev)
> > > > > > > >       return true;
> > > > > > > >   }
> > > > > > > >
> > > > > > > > +static int vhost_vdpa_get_iova_range(struct vhost_dev *dev,
> > > > > > > > +                                     hwaddr *first, hwaddr *last)
> > > > > > > > +{
> > > > > > > > +    int ret;
> > > > > > > > +    struct vhost_vdpa_iova_range range;
> > > > > > > > +
> > > > > > > > +    ret = vhost_vdpa_call(dev, VHOST_VDPA_GET_IOVA_RANGE, &range);
> > > > > > > > +    if (ret != 0) {
> > > > > > > > +        return ret;
> > > > > > > > +    }
> > > > > > > > +
> > > > > > > > +    *first = range.first;
> > > > > > > > +    *last = range.last;
> > > > > > > > +    trace_vhost_vdpa_get_iova_range(dev, *first, *last);
> > > > > > > > +    return ret;
> > > > > > > > +}
> > > > > > > > +
> > > > > > > >   /**
> > > > > > > >    * Maps QEMU vaddr memory to device in a suitable way for shadow virtqueue:
> > > > > > > >    * - It always reference qemu memory address, not guest's memory.
> > > > > > > > @@ -881,6 +910,7 @@ static bool vhost_vdpa_svq_start_vq(struct vhost_dev *dev, unsigned idx)
> > > > > > > >   static unsigned vhost_vdpa_enable_svq(struct vhost_vdpa *v, bool enable)
> > > > > > > >   {
> > > > > > > >       struct vhost_dev *hdev = v->dev;
> > > > > > > > +    hwaddr iova_first, iova_last;
> > > > > > > >       unsigned n;
> > > > > > > >       int r;
> > > > > > > >
> > > > > > > > @@ -894,7 +924,7 @@ static unsigned vhost_vdpa_enable_svq(struct vhost_vdpa *v, bool enable)
> > > > > > > >           /* Allocate resources */
> > > > > > > >           assert(v->shadow_vqs->len == 0);
> > > > > > > >           for (n = 0; n < hdev->nvqs; ++n) {
> > > > > > > > -            VhostShadowVirtqueue *svq = vhost_svq_new(hdev, n);
> > > > > > > > +            VhostShadowVirtqueue *svq = vhost_svq_new(hdev, n, v->iova_map);
> > > > > > > >               if (unlikely(!svq)) {
> > > > > > > >                   g_ptr_array_set_size(v->shadow_vqs, 0);
> > > > > > > >                   return 0;
> > > > > > > > @@ -903,6 +933,8 @@ static unsigned vhost_vdpa_enable_svq(struct vhost_vdpa *v, bool enable)
> > > > > > > >           }
> > > > > > > >       }
> > > > > > > >
> > > > > > > > +    r = vhost_vdpa_get_iova_range(hdev, &iova_first, &iova_last);
> > > > > > > > +    assert(r == 0);
> > > > > > > >       r = vhost_vdpa_vring_pause(hdev);
> > > > > > > >       assert(r == 0);
> > > > > > > >
> > > > > > > > @@ -913,6 +945,12 @@ static unsigned vhost_vdpa_enable_svq(struct vhost_vdpa *v, bool enable)
> > > > > > > >           }
> > > > > > > >       }
> > > > > > > >
> > > > > > > > +    memory_listener_unregister(&v->listener);
> > > > > > > > +    if (vhost_vdpa_dma_unmap(v, iova_first,
> > > > > > > > +                             (iova_last - iova_first) & TARGET_PAGE_MASK)) {
> > > > > > > > +        error_report("Fail to invalidate device iotlb");
> > > > > > > > +    }
> > > > > > > > +
> > > > > > > >       /* Reset device so it can be configured */
> > > > > > > >       r = vhost_vdpa_dev_start(hdev, false);
> > > > > > > >       assert(r == 0);
> > > > > > > > diff --git a/hw/virtio/trace-events b/hw/virtio/trace-events
> > > > > > > > index 8ed19e9d0c..650e521e35 100644
> > > > > > > > --- a/hw/virtio/trace-events
> > > > > > > > +++ b/hw/virtio/trace-events
> > > > > > > > @@ -52,6 +52,7 @@ vhost_vdpa_set_vring_call(void *dev, unsigned int index, int fd) "dev: %p index:
> > > > > > > >   vhost_vdpa_get_features(void *dev, uint64_t features) "dev: %p features: 0x%"PRIx64
> > > > > > > >   vhost_vdpa_set_owner(void *dev) "dev: %p"
> > > > > > > >   vhost_vdpa_vq_get_addr(void *dev, void *vq, uint64_t desc_user_addr, uint64_t avail_user_addr, uint64_t used_user_addr) "dev: %p vq: %p desc_user_addr: 0x%"PRIx64" avail_user_addr: 0x%"PRIx64" used_user_addr: 0x%"PRIx64
> > > > > > > > +vhost_vdpa_get_iova_range(void *dev, uint64_t first, uint64_t last) "dev: %p first: 0x%"PRIx64" last: 0x%"PRIx64
> > > > > > > >
> > > > > > > >   # virtio.c
> > > > > > > >   virtqueue_alloc_element(void *elem, size_t sz, unsigned in_num, unsigned out_num) "elem %p size %zd in_num %u out_num %u"
> > > > > > >
> > > > > >
> > > >
> > >
> >
>
^ permalink raw reply	[flat|nested] 90+ messages in thread
* Re: [RFC PATCH v4 18/20] vhost: Add VhostIOVATree
  2021-10-21  2:34             ` Jason Wang
@ 2021-10-21  7:03               ` Eugenio Perez Martin
  2021-10-21  8:12                 ` Jason Wang
  0 siblings, 1 reply; 90+ messages in thread
From: Eugenio Perez Martin @ 2021-10-21  7:03 UTC (permalink / raw)
  To: Jason Wang; +Cc: qemu-devel
On Thu, Oct 21, 2021 at 4:34 AM Jason Wang <jasowang@redhat.com> wrote:
>
> On Wed, Oct 20, 2021 at 8:07 PM Eugenio Perez Martin
> <eperezma@redhat.com> wrote:
> >
> > On Wed, Oct 20, 2021 at 11:01 AM Jason Wang <jasowang@redhat.com> wrote:
> > >
> > > On Wed, Oct 20, 2021 at 3:54 PM Eugenio Perez Martin
> > > <eperezma@redhat.com> wrote:
> > > >
> > > > On Tue, Oct 19, 2021 at 11:23 AM Jason Wang <jasowang@redhat.com> wrote:
> > > > >
> > > > > On Tue, Oct 19, 2021 at 4:32 PM Jason Wang <jasowang@redhat.com> wrote:
> > > > > >
> > > > > >
> > > > > > 在 2021/10/1 下午3:06, Eugenio Pérez 写道:
> > > > > > > This tree is able to look for a translated address from an IOVA address.
> > > > > > >
> > > > > > > At first glance is similar to util/iova-tree. However, SVQ working on
> > > > > > > devices with limited IOVA space need more capabilities, like allocating
> > > > > > > IOVA chunks or perform reverse translations (qemu addresses to iova).
> > > > > >
> > > > > >
> > > > > > I don't see any reverse translation is used in the shadow code. Or
> > > > > > anything I missed?
> > > > >
> > > > > Ok, it looks to me that it is used in the iova allocator. But I think
> > > > > it's better to decouple it to an independent allocator instead of
> > > > > vhost iova tree.
> > > > >
> > > >
> > > > Reverse translation is used every time a buffer is made available,
> > > > since buffers content are not copied, only the descriptors to SVQ
> > > > vring.
> > >
> > > I may miss something but I didn't see the code? Qemu knows the VA of
> > > virtqueue, and the VA of the VQ is stored in the VirtQueueElem?
> > >
> >
> > It's used in the patch 20/20, could that be the misunderstanding? The
> > function calling it is vhost_svq_translate_addr.
> >
> > Qemu knows the VA address of the buffer, but it must offer a valid SVQ
> > iova to the device. That is the translation I mean.
>
> Ok, I get you. So if I understand correctly, what you did is:
>
> 1) allocate IOVA during region_add
> 2) preform VA->IOVA reverse lookup in handle_kick
>
> This should be fine, but here're some suggestions:
>
> 1) remove the assert(map) in vhost_svq_translate_addr() since guest
> can add e.g BAR address
Wouldn't VirtQueue block them in virtqueue_pop / address_space_read_*
functions? I'm fine to remove it but I would say it adds value against
coding error.
> 2) we probably need a better name vhost_iova_tree_alloc(), maybe
> "vhost_iova_tree_map_alloc()"
>
Ok I will change for the next version.
> There's actually another method.
>
> 1) don't do IOVA/map allocation in region_add()
> 2) do the allocation in handle_kick(), then we know the IOVA so no
> reverse lookup
>
> The advantage is that this can work for the case of vIOMMU. And they
> should perform the same:
>
> 1) you method avoid the iova allocation per sg
> 2) my method avoid the reverse lookup per sg
>
It's somehow doable, but we are replacing a tree search with a linear
insertion at this moment.
I would say that guest's IOVA -> qemu vaddr part works with no change
for vIOMMU, since VirtQueue's virtqueue_pop already gives us the vaddr
even in the case of vIOMMU. The only change I would add for that case
is the SVQ -> device map/unmapping part, so the device cannot access
random addresses but only the exposed ones. I'm assuming that part is
O(1).
This way, we already have a tree with all the possible guest's
addresses, and we only need to look for it's SVQ iova -> vaddr
translation. This is a O(log(N)) operation, and read only, so it's
easily parallelizable when we make each SVQ in it's own thread (if
needed). The only thing left is to expose that with an iommu miss
(O(1)) and unmap it on used buffers processing (also O(1)). The
domination operation keeps being VirtQueue's own code lookup for
guest's IOVA -> GPA, which I'm assuming is already well optimized and
will benefit from future optimizations since qemu's memory system is
frequently used.
To optimize your use case we would need to add a custom (and smarter
than the currently used) allocator to SVQ. I've been looking for ways
to reuse glibc or similar in our own arenas but with no luck. It will
be code that SVQ needs to maintain by and for itself anyway.
In either case it should not be hard to switch to your method, just a
few call changes in the future, if we achieve a faster allocator.
Would that make sense?
> >
> > > >
> > > > At this point all the limits are copied to vhost iova tree in the next
> > > > revision I will send, defined at its creation at
> > > > vhost_iova_tree_new(). They are outside of util/iova-tree, only sent
> > > > to the latter at allocation time.
> > > >
> > > > Since vhost_iova_tree has its own vhost_iova_tree_alloc(), that wraps
> > > > the iova_tree_alloc() [1], limits could be kept in vhost-vdpa and make
> > > > them an argument of vhost_iova_tree_alloc. But I'm not sure if it's
> > > > what you are proposing or I'm missing something.
> > >
> > > If the reverse translation is only used in iova allocation, I meant to
> > > split the logic of IOVA allocation itself.
> > >
> >
> > Still don't understand it, sorry :). In SVQ setup we allocate an iova
> > address for every guest's GPA address its driver can use. After that
> > there should be no allocation unless memory is hotplugged.
> >
> > So the limits are only needed precisely at allocation time. Not sure
> > if that is what you mean here, but to first allocate and then check if
> > it is within the range could lead to false negatives, since there
> > could be a valid range *in* the address but the iova allocator
> > returned us another range that fell outside the range. How could we
> > know the cause if it is not using the range itself?
>
> See my above reply. And we can teach the iova allocator to return the
> IOVA in the range that vhost-vDPA supports.
>
Ok,
For the next series it will be that way. I'm pretty sure we are
aligned in this part, but the lack of code in this series makes it
very hard to discuss it :).
Thanks!
> Thanks
>
> >
> > > >
> > > > Either way, I think it is harder to talk about this specific case
> > > > without code, since this one still does not address the limits. Would
> > > > you prefer me to send another RFC in WIP quality, with *not* all
> > > > comments addressed? I would say that there is not a lot of pending
> > > > work to send the next one, but it might be easier for all of us.
> > >
> > > I'd prefer to try to address them all, otherwise it's not easy to see
> > > what is missing.
> > >
> >
> > Got it, I will do it that way then!
> >
> > Thanks!
> >
> > > Thanks
> > >
> > > >
> > > > Thanks!
> > > >
> > > > [1] This util/iova-tree method will be proposed in the next series,
> > > > and vhost_iova_tree wraps it since it needs to keep in sync both
> > > > trees: iova->qemu vaddr for iova allocation and the reverse one to
> > > > translate available buffers.
> > > >
> > > > > Thanks
> > > > >
> > > >
> > >
> >
>
^ permalink raw reply	[flat|nested] 90+ messages in thread
* Re: [RFC PATCH v4 18/20] vhost: Add VhostIOVATree
  2021-10-21  7:03               ` Eugenio Perez Martin
@ 2021-10-21  8:12                 ` Jason Wang
  2021-10-21 14:33                   ` Eugenio Perez Martin
  0 siblings, 1 reply; 90+ messages in thread
From: Jason Wang @ 2021-10-21  8:12 UTC (permalink / raw)
  To: Eugenio Perez Martin; +Cc: qemu-devel
On Thu, Oct 21, 2021 at 3:03 PM Eugenio Perez Martin
<eperezma@redhat.com> wrote:
>
> On Thu, Oct 21, 2021 at 4:34 AM Jason Wang <jasowang@redhat.com> wrote:
> >
> > On Wed, Oct 20, 2021 at 8:07 PM Eugenio Perez Martin
> > <eperezma@redhat.com> wrote:
> > >
> > > On Wed, Oct 20, 2021 at 11:01 AM Jason Wang <jasowang@redhat.com> wrote:
> > > >
> > > > On Wed, Oct 20, 2021 at 3:54 PM Eugenio Perez Martin
> > > > <eperezma@redhat.com> wrote:
> > > > >
> > > > > On Tue, Oct 19, 2021 at 11:23 AM Jason Wang <jasowang@redhat.com> wrote:
> > > > > >
> > > > > > On Tue, Oct 19, 2021 at 4:32 PM Jason Wang <jasowang@redhat.com> wrote:
> > > > > > >
> > > > > > >
> > > > > > > 在 2021/10/1 下午3:06, Eugenio Pérez 写道:
> > > > > > > > This tree is able to look for a translated address from an IOVA address.
> > > > > > > >
> > > > > > > > At first glance is similar to util/iova-tree. However, SVQ working on
> > > > > > > > devices with limited IOVA space need more capabilities, like allocating
> > > > > > > > IOVA chunks or perform reverse translations (qemu addresses to iova).
> > > > > > >
> > > > > > >
> > > > > > > I don't see any reverse translation is used in the shadow code. Or
> > > > > > > anything I missed?
> > > > > >
> > > > > > Ok, it looks to me that it is used in the iova allocator. But I think
> > > > > > it's better to decouple it to an independent allocator instead of
> > > > > > vhost iova tree.
> > > > > >
> > > > >
> > > > > Reverse translation is used every time a buffer is made available,
> > > > > since buffers content are not copied, only the descriptors to SVQ
> > > > > vring.
> > > >
> > > > I may miss something but I didn't see the code? Qemu knows the VA of
> > > > virtqueue, and the VA of the VQ is stored in the VirtQueueElem?
> > > >
> > >
> > > It's used in the patch 20/20, could that be the misunderstanding? The
> > > function calling it is vhost_svq_translate_addr.
> > >
> > > Qemu knows the VA address of the buffer, but it must offer a valid SVQ
> > > iova to the device. That is the translation I mean.
> >
> > Ok, I get you. So if I understand correctly, what you did is:
> >
> > 1) allocate IOVA during region_add
> > 2) preform VA->IOVA reverse lookup in handle_kick
> >
> > This should be fine, but here're some suggestions:
> >
> > 1) remove the assert(map) in vhost_svq_translate_addr() since guest
> > can add e.g BAR address
>
> Wouldn't VirtQueue block them in virtqueue_pop / address_space_read_*
> functions? I'm fine to remove it but I would say it adds value against
> coding error.
I think not. Though these addresses were excluded in
vhost_vdpa_listener_skipped_section(). For Qemu memory core, they are
valid addresses. Qemu emulate how hardware work (e.g pci p2p), so dma
to bar is allowed.
>
> > 2) we probably need a better name vhost_iova_tree_alloc(), maybe
> > "vhost_iova_tree_map_alloc()"
> >
>
> Ok I will change for the next version.
>
> > There's actually another method.
> >
> > 1) don't do IOVA/map allocation in region_add()
> > 2) do the allocation in handle_kick(), then we know the IOVA so no
> > reverse lookup
> >
> > The advantage is that this can work for the case of vIOMMU. And they
> > should perform the same:
> >
> > 1) you method avoid the iova allocation per sg
> > 2) my method avoid the reverse lookup per sg
> >
>
> It's somehow doable, but we are replacing a tree search with a linear
> insertion at this moment.
>
> I would say that guest's IOVA -> qemu vaddr part works with no change
> for vIOMMU, since VirtQueue's virtqueue_pop already gives us the vaddr
> even in the case of vIOMMU.
So in this case:
1) listener gives us GPA->host IOVA (host IOVA is allocated per GPA)
2) virtqueue_pop gives us guest IOVA -> VA
We still need extra logic to lookup the vIOMMU to get the guest IOVA
-> GPA then we can know the host IOVA.
If we allocate after virtqueue_pop(), we can follow the same logic as
without vIOMMU. Just allocate an host IOVA then all is done.
> The only change I would add for that case
> is the SVQ -> device map/unmapping part, so the device cannot access
> random addresses but only the exposed ones. I'm assuming that part is
> O(1).
>
> This way, we already have a tree with all the possible guest's
> addresses, and we only need to look for it's SVQ iova -> vaddr
> translation. This is a O(log(N)) operation,
Yes, but it's requires traverse the vIOMMU page table which should be
slower than our own iova tree?
> and read only, so it's
> easily parallelizable when we make each SVQ in it's own thread (if
> needed).
Yes, this is because the host IOVA was allocated before by the memory listener.
> The only thing left is to expose that with an iommu miss
> (O(1)) and unmap it on used buffers processing (also O(1)). The
> domination operation keeps being VirtQueue's own code lookup for
> guest's IOVA -> GPA, which I'm assuming is already well optimized and
> will benefit from future optimizations since qemu's memory system is
> frequently used.
>
> To optimize your use case we would need to add a custom (and smarter
> than the currently used) allocator to SVQ. I've been looking for ways
> to reuse glibc or similar in our own arenas but with no luck. It will
> be code that SVQ needs to maintain by and for itself anyway.
The benefit is to have separate iova allocation from the tree.
>
> In either case it should not be hard to switch to your method, just a
> few call changes in the future, if we achieve a faster allocator.
>
> Would that make sense?
Yes, feel free to choose any method you wish or feel simpler in the next series.
>
> > >
> > > > >
> > > > > At this point all the limits are copied to vhost iova tree in the next
> > > > > revision I will send, defined at its creation at
> > > > > vhost_iova_tree_new(). They are outside of util/iova-tree, only sent
> > > > > to the latter at allocation time.
> > > > >
> > > > > Since vhost_iova_tree has its own vhost_iova_tree_alloc(), that wraps
> > > > > the iova_tree_alloc() [1], limits could be kept in vhost-vdpa and make
> > > > > them an argument of vhost_iova_tree_alloc. But I'm not sure if it's
> > > > > what you are proposing or I'm missing something.
> > > >
> > > > If the reverse translation is only used in iova allocation, I meant to
> > > > split the logic of IOVA allocation itself.
> > > >
> > >
> > > Still don't understand it, sorry :). In SVQ setup we allocate an iova
> > > address for every guest's GPA address its driver can use. After that
> > > there should be no allocation unless memory is hotplugged.
> > >
> > > So the limits are only needed precisely at allocation time. Not sure
> > > if that is what you mean here, but to first allocate and then check if
> > > it is within the range could lead to false negatives, since there
> > > could be a valid range *in* the address but the iova allocator
> > > returned us another range that fell outside the range. How could we
> > > know the cause if it is not using the range itself?
> >
> > See my above reply. And we can teach the iova allocator to return the
> > IOVA in the range that vhost-vDPA supports.
> >
>
> Ok,
>
> For the next series it will be that way. I'm pretty sure we are
> aligned in this part, but the lack of code in this series makes it
> very hard to discuss it :).
Fine. Let's see.
Thanks
>
> Thanks!
>
> > Thanks
> >
> > >
> > > > >
> > > > > Either way, I think it is harder to talk about this specific case
> > > > > without code, since this one still does not address the limits. Would
> > > > > you prefer me to send another RFC in WIP quality, with *not* all
> > > > > comments addressed? I would say that there is not a lot of pending
> > > > > work to send the next one, but it might be easier for all of us.
> > > >
> > > > I'd prefer to try to address them all, otherwise it's not easy to see
> > > > what is missing.
> > > >
> > >
> > > Got it, I will do it that way then!
> > >
> > > Thanks!
> > >
> > > > Thanks
> > > >
> > > > >
> > > > > Thanks!
> > > > >
> > > > > [1] This util/iova-tree method will be proposed in the next series,
> > > > > and vhost_iova_tree wraps it since it needs to keep in sync both
> > > > > trees: iova->qemu vaddr for iova allocation and the reverse one to
> > > > > translate available buffers.
> > > > >
> > > > > > Thanks
> > > > > >
> > > > >
> > > >
> > >
> >
>
^ permalink raw reply	[flat|nested] 90+ messages in thread
* Re: [RFC PATCH v4 18/20] vhost: Add VhostIOVATree
  2021-10-21  8:12                 ` Jason Wang
@ 2021-10-21 14:33                   ` Eugenio Perez Martin
  2021-10-26  4:29                     ` Jason Wang
  0 siblings, 1 reply; 90+ messages in thread
From: Eugenio Perez Martin @ 2021-10-21 14:33 UTC (permalink / raw)
  To: Jason Wang; +Cc: qemu-devel
On Thu, Oct 21, 2021 at 10:12 AM Jason Wang <jasowang@redhat.com> wrote:
>
> On Thu, Oct 21, 2021 at 3:03 PM Eugenio Perez Martin
> <eperezma@redhat.com> wrote:
> >
> > On Thu, Oct 21, 2021 at 4:34 AM Jason Wang <jasowang@redhat.com> wrote:
> > >
> > > On Wed, Oct 20, 2021 at 8:07 PM Eugenio Perez Martin
> > > <eperezma@redhat.com> wrote:
> > > >
> > > > On Wed, Oct 20, 2021 at 11:01 AM Jason Wang <jasowang@redhat.com> wrote:
> > > > >
> > > > > On Wed, Oct 20, 2021 at 3:54 PM Eugenio Perez Martin
> > > > > <eperezma@redhat.com> wrote:
> > > > > >
> > > > > > On Tue, Oct 19, 2021 at 11:23 AM Jason Wang <jasowang@redhat.com> wrote:
> > > > > > >
> > > > > > > On Tue, Oct 19, 2021 at 4:32 PM Jason Wang <jasowang@redhat.com> wrote:
> > > > > > > >
> > > > > > > >
> > > > > > > > 在 2021/10/1 下午3:06, Eugenio Pérez 写道:
> > > > > > > > > This tree is able to look for a translated address from an IOVA address.
> > > > > > > > >
> > > > > > > > > At first glance is similar to util/iova-tree. However, SVQ working on
> > > > > > > > > devices with limited IOVA space need more capabilities, like allocating
> > > > > > > > > IOVA chunks or perform reverse translations (qemu addresses to iova).
> > > > > > > >
> > > > > > > >
> > > > > > > > I don't see any reverse translation is used in the shadow code. Or
> > > > > > > > anything I missed?
> > > > > > >
> > > > > > > Ok, it looks to me that it is used in the iova allocator. But I think
> > > > > > > it's better to decouple it to an independent allocator instead of
> > > > > > > vhost iova tree.
> > > > > > >
> > > > > >
> > > > > > Reverse translation is used every time a buffer is made available,
> > > > > > since buffers content are not copied, only the descriptors to SVQ
> > > > > > vring.
> > > > >
> > > > > I may miss something but I didn't see the code? Qemu knows the VA of
> > > > > virtqueue, and the VA of the VQ is stored in the VirtQueueElem?
> > > > >
> > > >
> > > > It's used in the patch 20/20, could that be the misunderstanding? The
> > > > function calling it is vhost_svq_translate_addr.
> > > >
> > > > Qemu knows the VA address of the buffer, but it must offer a valid SVQ
> > > > iova to the device. That is the translation I mean.
> > >
> > > Ok, I get you. So if I understand correctly, what you did is:
> > >
> > > 1) allocate IOVA during region_add
> > > 2) preform VA->IOVA reverse lookup in handle_kick
> > >
> > > This should be fine, but here're some suggestions:
> > >
> > > 1) remove the assert(map) in vhost_svq_translate_addr() since guest
> > > can add e.g BAR address
> >
> > Wouldn't VirtQueue block them in virtqueue_pop / address_space_read_*
> > functions? I'm fine to remove it but I would say it adds value against
> > coding error.
>
> I think not. Though these addresses were excluded in
> vhost_vdpa_listener_skipped_section(). For Qemu memory core, they are
> valid addresses. Qemu emulate how hardware work (e.g pci p2p), so dma
> to bar is allowed.
>
Ok I will treat them as errors.
> >
> > > 2) we probably need a better name vhost_iova_tree_alloc(), maybe
> > > "vhost_iova_tree_map_alloc()"
> > >
> >
> > Ok I will change for the next version.
> >
> > > There's actually another method.
> > >
> > > 1) don't do IOVA/map allocation in region_add()
> > > 2) do the allocation in handle_kick(), then we know the IOVA so no
> > > reverse lookup
> > >
> > > The advantage is that this can work for the case of vIOMMU. And they
> > > should perform the same:
> > >
> > > 1) you method avoid the iova allocation per sg
> > > 2) my method avoid the reverse lookup per sg
> > >
> >
> > It's somehow doable, but we are replacing a tree search with a linear
> > insertion at this moment.
> >
> > I would say that guest's IOVA -> qemu vaddr part works with no change
> > for vIOMMU, since VirtQueue's virtqueue_pop already gives us the vaddr
> > even in the case of vIOMMU.
>
> So in this case:
>
> 1) listener gives us GPA->host IOVA (host IOVA is allocated per GPA)
Right, that was a miss from my side, I think I get your point way better now.
So now vhost-iova-tree translates GPA -> host IOVA in vIOMMU case, and
it is updated at the same frequency than guest physical memory hotplug
/ unplug (little during migration, I guess). There are special entries
for SVQ vrings, that the tree does not map with GPA for obvious
reasons, and you cannot locate them when looking by GPA.
Let's assume too that only SVQ vrings have been sent as IOMMU / IOTLB
map, with the relation Host iova -> qemu's VA.
> 2) virtqueue_pop gives us guest IOVA -> VA
>
> We still need extra logic to lookup the vIOMMU to get the guest IOVA
> GPA then we can know the host IOVA.
>
That's somehow right, but I think this does not need to be *another*
search, insertion, etc. Please see below.
> If we allocate after virtqueue_pop(), we can follow the same logic as
> without vIOMMU. Just allocate an host IOVA then all is done.
>
> > The only change I would add for that case
> > is the SVQ -> device map/unmapping part, so the device cannot access
> > random addresses but only the exposed ones. I'm assuming that part is
> > O(1).
> >
> > This way, we already have a tree with all the possible guest's
> > addresses, and we only need to look for it's SVQ iova -> vaddr
> > translation. This is a O(log(N)) operation,
>
> Yes, but it's requires traverse the vIOMMU page table which should be
> slower than our own iova tree?
>
The lookup over vIOMMU is not needed (to perform twice), since
virtqueue_pop already do it. We already have that data here, just need
to extract it. Not saying that is complicated, just saying that I
didn't dedicate a lot of time to figure out how. The calltrace of it
is:
#0  address_space_translate_iommu
    (iommu_mr, xlat, plen_out, page_mask_out, is_write, is_mmio,
target_as, attrs) at ../softmmu/physmem.c:418
#1  flatview_do_translate
    (fv, addr, xlat, plen_out, page_mask_out, is_write, is_mmio,
target_as, attrs) at ../softmmu/physmem.c:505
#2  flatview_translate
    (fv, addr, xlat, plen, is_write, attrs) at ../softmmu/physmem.c:565
#3  address_space_map (as, addr, plen, is_write, attrs)
    at ../softmmu/physmem.c:3183
#4  dma_memory_map (as, addr, len, dir)
    at /home/qemu/svq/include/sysemu/dma.h:202
#5  virtqueue_map_desc
    (vdev, p_num_sg, addr, iov, max_num_sg, is_write, pa, sz) at
../hw/virtio/virtio.c:1314
#6  virtqueue_split_pop (vq, sz) at ../hw/virtio/virtio.c:1488
So with that GPA we can locate its correspond entry in the
vhost-iova-tree, in a read-only operation, O(log(N)). And element
address in qemu's va is not going to change until we mark it as used.
This process (all the stack call trace) needs to be serialized somehow
in qemu's memory system internals, I'm just assuming that it will be
faster than the one we can do in SVQ with little effort, and it will
help to reduce duplication. If is not the case, I think it is even
more beneficial to improve it, than to reinvent it in SVQ.
After that, an iommu map needs to be sent to the device, as (qemu's
iommu obtained from the tree, qemu's VA, length, ...). We may even
batch them. Another option is to wait for the miss(), but I think that
would be a waste of resources.
The reverse is also true with the unmapping: When we see an used
descriptor, IOTLB unmap(s) will be sent before send the descriptor to
guest as used.
> > and read only, so it's
> > easily parallelizable when we make each SVQ in it's own thread (if
> > needed).
>
> Yes, this is because the host IOVA was allocated before by the memory listener.
>
Right.
> > The only thing left is to expose that with an iommu miss
> > (O(1)) and unmap it on used buffers processing (also O(1)). The
> > domination operation keeps being VirtQueue's own code lookup for
> > guest's IOVA -> GPA, which I'm assuming is already well optimized and
> > will benefit from future optimizations since qemu's memory system is
> > frequently used.
> >
> > To optimize your use case we would need to add a custom (and smarter
> > than the currently used) allocator to SVQ. I've been looking for ways
> > to reuse glibc or similar in our own arenas but with no luck. It will
> > be code that SVQ needs to maintain by and for itself anyway.
>
> The benefit is to have separate iova allocation from the tree.
>
> >
> > In either case it should not be hard to switch to your method, just a
> > few call changes in the future, if we achieve a faster allocator.
> >
> > Would that make sense?
>
> Yes, feel free to choose any method you wish or feel simpler in the next series.
>
> >
> > > >
> > > > > >
> > > > > > At this point all the limits are copied to vhost iova tree in the next
> > > > > > revision I will send, defined at its creation at
> > > > > > vhost_iova_tree_new(). They are outside of util/iova-tree, only sent
> > > > > > to the latter at allocation time.
> > > > > >
> > > > > > Since vhost_iova_tree has its own vhost_iova_tree_alloc(), that wraps
> > > > > > the iova_tree_alloc() [1], limits could be kept in vhost-vdpa and make
> > > > > > them an argument of vhost_iova_tree_alloc. But I'm not sure if it's
> > > > > > what you are proposing or I'm missing something.
> > > > >
> > > > > If the reverse translation is only used in iova allocation, I meant to
> > > > > split the logic of IOVA allocation itself.
> > > > >
> > > >
> > > > Still don't understand it, sorry :). In SVQ setup we allocate an iova
> > > > address for every guest's GPA address its driver can use. After that
> > > > there should be no allocation unless memory is hotplugged.
> > > >
> > > > So the limits are only needed precisely at allocation time. Not sure
> > > > if that is what you mean here, but to first allocate and then check if
> > > > it is within the range could lead to false negatives, since there
> > > > could be a valid range *in* the address but the iova allocator
> > > > returned us another range that fell outside the range. How could we
> > > > know the cause if it is not using the range itself?
> > >
> > > See my above reply. And we can teach the iova allocator to return the
> > > IOVA in the range that vhost-vDPA supports.
> > >
> >
> > Ok,
> >
> > For the next series it will be that way. I'm pretty sure we are
> > aligned in this part, but the lack of code in this series makes it
> > very hard to discuss it :).
>
> Fine. Let's see.
>
> Thanks
>
> >
> > Thanks!
> >
> > > Thanks
> > >
> > > >
> > > > > >
> > > > > > Either way, I think it is harder to talk about this specific case
> > > > > > without code, since this one still does not address the limits. Would
> > > > > > you prefer me to send another RFC in WIP quality, with *not* all
> > > > > > comments addressed? I would say that there is not a lot of pending
> > > > > > work to send the next one, but it might be easier for all of us.
> > > > >
> > > > > I'd prefer to try to address them all, otherwise it's not easy to see
> > > > > what is missing.
> > > > >
> > > >
> > > > Got it, I will do it that way then!
> > > >
> > > > Thanks!
> > > >
> > > > > Thanks
> > > > >
> > > > > >
> > > > > > Thanks!
> > > > > >
> > > > > > [1] This util/iova-tree method will be proposed in the next series,
> > > > > > and vhost_iova_tree wraps it since it needs to keep in sync both
> > > > > > trees: iova->qemu vaddr for iova allocation and the reverse one to
> > > > > > translate available buffers.
> > > > > >
> > > > > > > Thanks
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>
^ permalink raw reply	[flat|nested] 90+ messages in thread
* Re: [RFC PATCH v4 18/20] vhost: Add VhostIOVATree
  2021-10-21 14:33                   ` Eugenio Perez Martin
@ 2021-10-26  4:29                     ` Jason Wang
  0 siblings, 0 replies; 90+ messages in thread
From: Jason Wang @ 2021-10-26  4:29 UTC (permalink / raw)
  To: Eugenio Perez Martin; +Cc: qemu-devel
On Thu, Oct 21, 2021 at 10:34 PM Eugenio Perez Martin
<eperezma@redhat.com> wrote:
>
> On Thu, Oct 21, 2021 at 10:12 AM Jason Wang <jasowang@redhat.com> wrote:
> >
> > On Thu, Oct 21, 2021 at 3:03 PM Eugenio Perez Martin
> > <eperezma@redhat.com> wrote:
> > >
> > > On Thu, Oct 21, 2021 at 4:34 AM Jason Wang <jasowang@redhat.com> wrote:
> > > >
> > > > On Wed, Oct 20, 2021 at 8:07 PM Eugenio Perez Martin
> > > > <eperezma@redhat.com> wrote:
> > > > >
> > > > > On Wed, Oct 20, 2021 at 11:01 AM Jason Wang <jasowang@redhat.com> wrote:
> > > > > >
> > > > > > On Wed, Oct 20, 2021 at 3:54 PM Eugenio Perez Martin
> > > > > > <eperezma@redhat.com> wrote:
> > > > > > >
> > > > > > > On Tue, Oct 19, 2021 at 11:23 AM Jason Wang <jasowang@redhat.com> wrote:
> > > > > > > >
> > > > > > > > On Tue, Oct 19, 2021 at 4:32 PM Jason Wang <jasowang@redhat.com> wrote:
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > 在 2021/10/1 下午3:06, Eugenio Pérez 写道:
> > > > > > > > > > This tree is able to look for a translated address from an IOVA address.
> > > > > > > > > >
> > > > > > > > > > At first glance is similar to util/iova-tree. However, SVQ working on
> > > > > > > > > > devices with limited IOVA space need more capabilities, like allocating
> > > > > > > > > > IOVA chunks or perform reverse translations (qemu addresses to iova).
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > I don't see any reverse translation is used in the shadow code. Or
> > > > > > > > > anything I missed?
> > > > > > > >
> > > > > > > > Ok, it looks to me that it is used in the iova allocator. But I think
> > > > > > > > it's better to decouple it to an independent allocator instead of
> > > > > > > > vhost iova tree.
> > > > > > > >
> > > > > > >
> > > > > > > Reverse translation is used every time a buffer is made available,
> > > > > > > since buffers content are not copied, only the descriptors to SVQ
> > > > > > > vring.
> > > > > >
> > > > > > I may miss something but I didn't see the code? Qemu knows the VA of
> > > > > > virtqueue, and the VA of the VQ is stored in the VirtQueueElem?
> > > > > >
> > > > >
> > > > > It's used in the patch 20/20, could that be the misunderstanding? The
> > > > > function calling it is vhost_svq_translate_addr.
> > > > >
> > > > > Qemu knows the VA address of the buffer, but it must offer a valid SVQ
> > > > > iova to the device. That is the translation I mean.
> > > >
> > > > Ok, I get you. So if I understand correctly, what you did is:
> > > >
> > > > 1) allocate IOVA during region_add
> > > > 2) preform VA->IOVA reverse lookup in handle_kick
> > > >
> > > > This should be fine, but here're some suggestions:
> > > >
> > > > 1) remove the assert(map) in vhost_svq_translate_addr() since guest
> > > > can add e.g BAR address
> > >
> > > Wouldn't VirtQueue block them in virtqueue_pop / address_space_read_*
> > > functions? I'm fine to remove it but I would say it adds value against
> > > coding error.
> >
> > I think not. Though these addresses were excluded in
> > vhost_vdpa_listener_skipped_section(). For Qemu memory core, they are
> > valid addresses. Qemu emulate how hardware work (e.g pci p2p), so dma
> > to bar is allowed.
> >
>
> Ok I will treat them as errors.
>
> > >
> > > > 2) we probably need a better name vhost_iova_tree_alloc(), maybe
> > > > "vhost_iova_tree_map_alloc()"
> > > >
> > >
> > > Ok I will change for the next version.
> > >
> > > > There's actually another method.
> > > >
> > > > 1) don't do IOVA/map allocation in region_add()
> > > > 2) do the allocation in handle_kick(), then we know the IOVA so no
> > > > reverse lookup
> > > >
> > > > The advantage is that this can work for the case of vIOMMU. And they
> > > > should perform the same:
> > > >
> > > > 1) you method avoid the iova allocation per sg
> > > > 2) my method avoid the reverse lookup per sg
> > > >
> > >
> > > It's somehow doable, but we are replacing a tree search with a linear
> > > insertion at this moment.
> > >
> > > I would say that guest's IOVA -> qemu vaddr part works with no change
> > > for vIOMMU, since VirtQueue's virtqueue_pop already gives us the vaddr
> > > even in the case of vIOMMU.
> >
> > So in this case:
> >
> > 1) listener gives us GPA->host IOVA (host IOVA is allocated per GPA)
>
> Right, that was a miss from my side, I think I get your point way better now.
>
> So now vhost-iova-tree translates GPA -> host IOVA in vIOMMU case, and
> it is updated at the same frequency than guest physical memory hotplug
> / unplug (little during migration, I guess). There are special entries
> for SVQ vrings, that the tree does not map with GPA for obvious
> reasons, and you cannot locate them when looking by GPA.
Yes.
>
> Let's assume too that only SVQ vrings have been sent as IOMMU / IOTLB
> map, with the relation Host iova -> qemu's VA.
>
> > 2) virtqueue_pop gives us guest IOVA -> VA
> >
> > We still need extra logic to lookup the vIOMMU to get the guest IOVA
> > GPA then we can know the host IOVA.
> >
>
> That's somehow right, but I think this does not need to be *another*
> search, insertion, etc. Please see below.
>
> > If we allocate after virtqueue_pop(), we can follow the same logic as
> > without vIOMMU. Just allocate an host IOVA then all is done.
> >
> > > The only change I would add for that case
> > > is the SVQ -> device map/unmapping part, so the device cannot access
> > > random addresses but only the exposed ones. I'm assuming that part is
> > > O(1).
> > >
> > > This way, we already have a tree with all the possible guest's
> > > addresses, and we only need to look for it's SVQ iova -> vaddr
> > > translation. This is a O(log(N)) operation,
> >
> > Yes, but it's requires traverse the vIOMMU page table which should be
> > slower than our own iova tree?
> >
>
> The lookup over vIOMMU is not needed (to perform twice), since
> virtqueue_pop already do it. We already have that data here, just need
> to extract it.
For 'extract' do you mean fetching it from IOMMU's IOTLB via
address_space_get_iotlb_entry()? Yes, it would be faster and probably
an O(1).
> Not saying that is complicated, just saying that I
> didn't dedicate a lot of time to figure out how. The calltrace of it
> is:
>
> #0  address_space_translate_iommu
>     (iommu_mr, xlat, plen_out, page_mask_out, is_write, is_mmio,
> target_as, attrs) at ../softmmu/physmem.c:418
> #1  flatview_do_translate
>     (fv, addr, xlat, plen_out, page_mask_out, is_write, is_mmio,
> target_as, attrs) at ../softmmu/physmem.c:505
> #2  flatview_translate
>     (fv, addr, xlat, plen, is_write, attrs) at ../softmmu/physmem.c:565
> #3  address_space_map (as, addr, plen, is_write, attrs)
>     at ../softmmu/physmem.c:3183
> #4  dma_memory_map (as, addr, len, dir)
>     at /home/qemu/svq/include/sysemu/dma.h:202
> #5  virtqueue_map_desc
>     (vdev, p_num_sg, addr, iov, max_num_sg, is_write, pa, sz) at
> ../hw/virtio/virtio.c:1314
> #6  virtqueue_split_pop (vq, sz) at ../hw/virtio/virtio.c:1488
>
> So with that GPA we can locate its correspond entry in the
> vhost-iova-tree, in a read-only operation, O(log(N)). And element
> address in qemu's va is not going to change until we mark it as used.
>
> This process (all the stack call trace) needs to be serialized somehow
> in qemu's memory system internals, I'm just assuming that it will be
> faster than the one we can do in SVQ with little effort, and it will
> help to reduce duplication. If is not the case, I think it is even
> more beneficial to improve it, than to reinvent it in SVQ.
I think so.
Thanks
>
> After that, an iommu map needs to be sent to the device, as (qemu's
> iommu obtained from the tree, qemu's VA, length, ...). We may even
> batch them. Another option is to wait for the miss(), but I think that
> would be a waste of resources.
>
> The reverse is also true with the unmapping: When we see an used
> descriptor, IOTLB unmap(s) will be sent before send the descriptor to
> guest as used.
>
> > > and read only, so it's
> > > easily parallelizable when we make each SVQ in it's own thread (if
> > > needed).
> >
> > Yes, this is because the host IOVA was allocated before by the memory listener.
> >
>
> Right.
>
> > > The only thing left is to expose that with an iommu miss
> > > (O(1)) and unmap it on used buffers processing (also O(1)). The
> > > domination operation keeps being VirtQueue's own code lookup for
> > > guest's IOVA -> GPA, which I'm assuming is already well optimized and
> > > will benefit from future optimizations since qemu's memory system is
> > > frequently used.
> > >
> > > To optimize your use case we would need to add a custom (and smarter
> > > than the currently used) allocator to SVQ. I've been looking for ways
> > > to reuse glibc or similar in our own arenas but with no luck. It will
> > > be code that SVQ needs to maintain by and for itself anyway.
> >
> > The benefit is to have separate iova allocation from the tree.
> >
> > >
> > > In either case it should not be hard to switch to your method, just a
> > > few call changes in the future, if we achieve a faster allocator.
> > >
> > > Would that make sense?
> >
> > Yes, feel free to choose any method you wish or feel simpler in the next series.
> >
> > >
> > > > >
> > > > > > >
> > > > > > > At this point all the limits are copied to vhost iova tree in the next
> > > > > > > revision I will send, defined at its creation at
> > > > > > > vhost_iova_tree_new(). They are outside of util/iova-tree, only sent
> > > > > > > to the latter at allocation time.
> > > > > > >
> > > > > > > Since vhost_iova_tree has its own vhost_iova_tree_alloc(), that wraps
> > > > > > > the iova_tree_alloc() [1], limits could be kept in vhost-vdpa and make
> > > > > > > them an argument of vhost_iova_tree_alloc. But I'm not sure if it's
> > > > > > > what you are proposing or I'm missing something.
> > > > > >
> > > > > > If the reverse translation is only used in iova allocation, I meant to
> > > > > > split the logic of IOVA allocation itself.
> > > > > >
> > > > >
> > > > > Still don't understand it, sorry :). In SVQ setup we allocate an iova
> > > > > address for every guest's GPA address its driver can use. After that
> > > > > there should be no allocation unless memory is hotplugged.
> > > > >
> > > > > So the limits are only needed precisely at allocation time. Not sure
> > > > > if that is what you mean here, but to first allocate and then check if
> > > > > it is within the range could lead to false negatives, since there
> > > > > could be a valid range *in* the address but the iova allocator
> > > > > returned us another range that fell outside the range. How could we
> > > > > know the cause if it is not using the range itself?
> > > >
> > > > See my above reply. And we can teach the iova allocator to return the
> > > > IOVA in the range that vhost-vDPA supports.
> > > >
> > >
> > > Ok,
> > >
> > > For the next series it will be that way. I'm pretty sure we are
> > > aligned in this part, but the lack of code in this series makes it
> > > very hard to discuss it :).
> >
> > Fine. Let's see.
> >
> > Thanks
> >
> > >
> > > Thanks!
> > >
> > > > Thanks
> > > >
> > > > >
> > > > > > >
> > > > > > > Either way, I think it is harder to talk about this specific case
> > > > > > > without code, since this one still does not address the limits. Would
> > > > > > > you prefer me to send another RFC in WIP quality, with *not* all
> > > > > > > comments addressed? I would say that there is not a lot of pending
> > > > > > > work to send the next one, but it might be easier for all of us.
> > > > > >
> > > > > > I'd prefer to try to address them all, otherwise it's not easy to see
> > > > > > what is missing.
> > > > > >
> > > > >
> > > > > Got it, I will do it that way then!
> > > > >
> > > > > Thanks!
> > > > >
> > > > > > Thanks
> > > > > >
> > > > > > >
> > > > > > > Thanks!
> > > > > > >
> > > > > > > [1] This util/iova-tree method will be proposed in the next series,
> > > > > > > and vhost_iova_tree wraps it since it needs to keep in sync both
> > > > > > > trees: iova->qemu vaddr for iova allocation and the reverse one to
> > > > > > > translate available buffers.
> > > > > > >
> > > > > > > > Thanks
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>
^ permalink raw reply	[flat|nested] 90+ messages in thread
* Re: [RFC PATCH v4 20/20] vdpa: Add custom IOTLB translations to SVQ
  2021-10-20 11:56               ` Eugenio Perez Martin
  2021-10-21  2:38                 ` Jason Wang
@ 2021-10-26  4:32                 ` Jason Wang
  1 sibling, 0 replies; 90+ messages in thread
From: Jason Wang @ 2021-10-26  4:32 UTC (permalink / raw)
  To: Eugenio Perez Martin
  Cc: Parav Pandit, Juan Quintela, Markus Armbruster,
	Michael S. Tsirkin, qemu-level, virtualization,
	Harpreet Singh Anand, Xiao W Wang, Stefan Hajnoczi, Eli Cohen,
	Eric Blake, Michael Lilja, Stefano Garzarella
On Wed, Oct 20, 2021 at 7:57 PM Eugenio Perez Martin
<eperezma@redhat.com> wrote:
>
> On Wed, Oct 20, 2021 at 11:03 AM Jason Wang <jasowang@redhat.com> wrote:
> >
> > On Wed, Oct 20, 2021 at 2:52 PM Eugenio Perez Martin
> > <eperezma@redhat.com> wrote:
> > >
> > > On Wed, Oct 20, 2021 at 4:07 AM Jason Wang <jasowang@redhat.com> wrote:
> > > >
> > > > On Wed, Oct 20, 2021 at 10:02 AM Jason Wang <jasowang@redhat.com> wrote:
> > > > >
> > > > > On Tue, Oct 19, 2021 at 6:29 PM Eugenio Perez Martin
> > > > > <eperezma@redhat.com> wrote:
> > > > > >
> > > > > > On Tue, Oct 19, 2021 at 11:25 AM Jason Wang <jasowang@redhat.com> wrote:
> > > > > > >
> > > > > > >
> > > > > > > 在 2021/10/1 下午3:06, Eugenio Pérez 写道:
> > > > > > > > Use translations added in VhostIOVATree in SVQ.
> > > > > > > >
> > > > > > > > Now every element needs to store the previous address also, so VirtQueue
> > > > > > > > can consume the elements properly. This adds a little overhead per VQ
> > > > > > > > element, having to allocate more memory to stash them. As a possible
> > > > > > > > optimization, this allocation could be avoided if the descriptor is not
> > > > > > > > a chain but a single one, but this is left undone.
> > > > > > > >
> > > > > > > > TODO: iova range should be queried before, and add logic to fail when
> > > > > > > > GPA is outside of its range and memory listener or svq add it.
> > > > > > > >
> > > > > > > > Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> > > > > > > > ---
> > > > > > > >   hw/virtio/vhost-shadow-virtqueue.h |   4 +-
> > > > > > > >   hw/virtio/vhost-shadow-virtqueue.c | 130 ++++++++++++++++++++++++-----
> > > > > > > >   hw/virtio/vhost-vdpa.c             |  40 ++++++++-
> > > > > > > >   hw/virtio/trace-events             |   1 +
> > > > > > > >   4 files changed, 152 insertions(+), 23 deletions(-)
> > > > > > > >
> > > > > > > > diff --git a/hw/virtio/vhost-shadow-virtqueue.h b/hw/virtio/vhost-shadow-virtqueue.h
> > > > > > > > index b7baa424a7..a0e6b5267a 100644
> > > > > > > > --- a/hw/virtio/vhost-shadow-virtqueue.h
> > > > > > > > +++ b/hw/virtio/vhost-shadow-virtqueue.h
> > > > > > > > @@ -11,6 +11,7 @@
> > > > > > > >   #define VHOST_SHADOW_VIRTQUEUE_H
> > > > > > > >
> > > > > > > >   #include "hw/virtio/vhost.h"
> > > > > > > > +#include "hw/virtio/vhost-iova-tree.h"
> > > > > > > >
> > > > > > > >   typedef struct VhostShadowVirtqueue VhostShadowVirtqueue;
> > > > > > > >
> > > > > > > > @@ -28,7 +29,8 @@ bool vhost_svq_start(struct vhost_dev *dev, unsigned idx,
> > > > > > > >   void vhost_svq_stop(struct vhost_dev *dev, unsigned idx,
> > > > > > > >                       VhostShadowVirtqueue *svq);
> > > > > > > >
> > > > > > > > -VhostShadowVirtqueue *vhost_svq_new(struct vhost_dev *dev, int idx);
> > > > > > > > +VhostShadowVirtqueue *vhost_svq_new(struct vhost_dev *dev, int idx,
> > > > > > > > +                                    VhostIOVATree *iova_map);
> > > > > > > >
> > > > > > > >   void vhost_svq_free(VhostShadowVirtqueue *vq);
> > > > > > > >
> > > > > > > > diff --git a/hw/virtio/vhost-shadow-virtqueue.c b/hw/virtio/vhost-shadow-virtqueue.c
> > > > > > > > index 2fd0bab75d..9db538547e 100644
> > > > > > > > --- a/hw/virtio/vhost-shadow-virtqueue.c
> > > > > > > > +++ b/hw/virtio/vhost-shadow-virtqueue.c
> > > > > > > > @@ -11,12 +11,19 @@
> > > > > > > >   #include "hw/virtio/vhost-shadow-virtqueue.h"
> > > > > > > >   #include "hw/virtio/vhost.h"
> > > > > > > >   #include "hw/virtio/virtio-access.h"
> > > > > > > > +#include "hw/virtio/vhost-iova-tree.h"
> > > > > > > >
> > > > > > > >   #include "standard-headers/linux/vhost_types.h"
> > > > > > > >
> > > > > > > >   #include "qemu/error-report.h"
> > > > > > > >   #include "qemu/main-loop.h"
> > > > > > > >
> > > > > > > > +typedef struct SVQElement {
> > > > > > > > +    VirtQueueElement elem;
> > > > > > > > +    void **in_sg_stash;
> > > > > > > > +    void **out_sg_stash;
> > > > > > > > +} SVQElement;
> > > > > > > > +
> > > > > > > >   /* Shadow virtqueue to relay notifications */
> > > > > > > >   typedef struct VhostShadowVirtqueue {
> > > > > > > >       /* Shadow vring */
> > > > > > > > @@ -46,8 +53,11 @@ typedef struct VhostShadowVirtqueue {
> > > > > > > >       /* Virtio device */
> > > > > > > >       VirtIODevice *vdev;
> > > > > > > >
> > > > > > > > +    /* IOVA mapping if used */
> > > > > > > > +    VhostIOVATree *iova_map;
> > > > > > > > +
> > > > > > > >       /* Map for returning guest's descriptors */
> > > > > > > > -    VirtQueueElement **ring_id_maps;
> > > > > > > > +    SVQElement **ring_id_maps;
> > > > > > > >
> > > > > > > >       /* Next head to expose to device */
> > > > > > > >       uint16_t avail_idx_shadow;
> > > > > > > > @@ -79,13 +89,6 @@ bool vhost_svq_valid_device_features(uint64_t *dev_features)
> > > > > > > >               continue;
> > > > > > > >
> > > > > > > >           case VIRTIO_F_ACCESS_PLATFORM:
> > > > > > > > -            /* SVQ needs this feature disabled. Can't continue */
> > > > > > > > -            if (*dev_features & BIT_ULL(b)) {
> > > > > > > > -                clear_bit(b, dev_features);
> > > > > > > > -                r = false;
> > > > > > > > -            }
> > > > > > > > -            break;
> > > > > > > > -
> > > > > > > >           case VIRTIO_F_VERSION_1:
> > > > > > > >               /* SVQ needs this feature, so can't continue */
> > > > > > > >               if (!(*dev_features & BIT_ULL(b))) {
> > > > > > > > @@ -126,6 +129,64 @@ static void vhost_svq_set_notification(VhostShadowVirtqueue *svq, bool enable)
> > > > > > > >       }
> > > > > > > >   }
> > > > > > > >
> > > > > > > > +static void vhost_svq_stash_addr(void ***stash, const struct iovec *iov,
> > > > > > > > +                                 size_t num)
> > > > > > > > +{
> > > > > > > > +    size_t i;
> > > > > > > > +
> > > > > > > > +    if (num == 0) {
> > > > > > > > +        return;
> > > > > > > > +    }
> > > > > > > > +
> > > > > > > > +    *stash = g_new(void *, num);
> > > > > > > > +    for (i = 0; i < num; ++i) {
> > > > > > > > +        (*stash)[i] = iov[i].iov_base;
> > > > > > > > +    }
> > > > > > > > +}
> > > > > > > > +
> > > > > > > > +static void vhost_svq_unstash_addr(void **stash, struct iovec *iov, size_t num)
> > > > > > > > +{
> > > > > > > > +    size_t i;
> > > > > > > > +
> > > > > > > > +    if (num == 0) {
> > > > > > > > +        return;
> > > > > > > > +    }
> > > > > > > > +
> > > > > > > > +    for (i = 0; i < num; ++i) {
> > > > > > > > +        iov[i].iov_base = stash[i];
> > > > > > > > +    }
> > > > > > > > +    g_free(stash);
> > > > > > > > +}
> > > > > > > > +
> > > > > > > > +static void vhost_svq_translate_addr(const VhostShadowVirtqueue *svq,
> > > > > > > > +                                     struct iovec *iovec, size_t num)
> > > > > > > > +{
> > > > > > > > +    size_t i;
> > > > > > > > +
> > > > > > > > +    for (i = 0; i < num; ++i) {
> > > > > > > > +        VhostDMAMap needle = {
> > > > > > > > +            .translated_addr = iovec[i].iov_base,
> > > > > > > > +            .size = iovec[i].iov_len,
> > > > > > > > +        };
> > > > > > > > +        size_t off;
> > > > > > > > +
> > > > > > > > +        const VhostDMAMap *map = vhost_iova_tree_find_iova(svq->iova_map,
> > > > > > > > +                                                           &needle);
> > > > > > >
> > > > > > >
> > > > > > > Is it possible that we end up with more than one maps here?
> > > > > > >
> > > > > >
> > > > > > Actually it is possible, since there is no guarantee that one
> > > > > > descriptor (or indirect descriptor) maps exactly to one iov. It could
> > > > > > map to many if qemu vaddr is not contiguous but GPA + size is. This is
> > > > > > something that must be fixed for the next revision, so thanks for
> > > > > > pointing it out!
> > > > > >
> > > > > > Taking that into account, the condition that svq vring avail_idx -
> > > > > > used_idx was always less or equal than guest's vring avail_idx -
> > > > > > used_idx is not true anymore. Checking for that before adding buffers
> > > > > > to SVQ is the easy part, but how could we recover in that case?
> > > > > >
> > > > > > I think that the easy solution is to check for more available buffers
> > > > > > unconditionally at the end of vhost_svq_handle_call, which handles the
> > > > > > SVQ used and is supposed to make more room for available buffers. So
> > > > > > vhost_handle_guest_kick would not check if eventfd is set or not
> > > > > > anymore.
> > > > > >
> > > > > > Would that make sense?
> > > > >
> > > > > Yes, I think it should work.
> > > >
> > > > Btw, I wonder how to handle indirect descriptors. SVQ doesn't use
> > > > indirect descriptors for now, but it looks like a must otherwise we
> > > > may end up SVQ is full before VQ.
> > > >
> > >
> > > We can get to that situation without indirect too, if a single
> > > descriptor maps to more than one sg buffer. The next revision is going
> > > to control that too.
> > >
> > > > It looks to me an easy way is to always use indirect descriptors if #sg >= 2?
> > > >
> > >
> > > I will use that, but that does not solve the case where a descriptor
> > > maps to > 1 different buffers in qemu vaddr.
> >
> > Right, so we need to deal with the case when SVQ is out of space.
> >
> >
> > > So I think that some
> > > check after marking descriptors as used is a must somehow.
> >
> > I thought it should be before processing the available buffer?
>
> Yes, I meant after that. Somehow, because I include checking the
> number of sg buffers as "processing". :).
>
> > It's
> > the guest driver that make sure there's sufficient space for used
> > ring?
> >
>
> (I think we are talking the same with different words, but just in
> case I will develop the idea here with an example).
>
> The guest is able to check if there is enough space in the SVQ's
> vring, but not in the device's vring. As an example of this, imagine
> that a guest makes available a GPA contiguous buffer of 64K, one
> descriptor. However, this memory is divided into 16 chunks of 4K in
> qemu's VA space. Imagine that at this moment there are only eight
> slots free in each vring, and that neither communication is using
> indirect descriptors.
>
> The guest only needs 1 descriptor available to make that buffer
> available, so it will add to avail ring. But SVQ needs 16 chained
> descriptors, so the buffer is not going to reach the device until it
> makes at least 8 more descriptors as used. SVQ checked for the amount
> of available room, as you said, but it cannot forward the available
> one.
>
> Since the guest already sent kick when it made the descriptor
> available, we need another mechanism to know when we have all the
> needed free slots in the SVQ vring. And that's what I meant with the
> check after marking some buffers as available.
>
> I still think it is not worth it to protect the forwarding methods of
> hogging BQL, since there must be a limit sooner or later, but it is
> something that is worth putting on the table again. But this requires
> changes for the next version for sure.
>
> I can think in more scenarios, like guest making available an indirect
> descriptor of vq size that needs to be splitted in even more sgs. Qemu
> already does not support more than 1024 sgs buffers in VirtQueue, but
> a driver (as SVQ) must *not* create an indirect descriptor chain
> longer than the Queue Size. Should we always increase vq size to 1024
> always? I think these are highly unlikely, but again these concerns
> must be at least commented here.
>
> Does it make sense?
Makes a lot of sense. It's better to make the code robust without any
assumption on both host and guest configuration.
Thanks
>
> Thanks!
>
> > Thanks
> >
> > >
> > >
> > > > Thanks
> > > >
> > > > >
> > > > > Thanks
> > > > >
> > > > > >
> > > > > > Thanks!
> > > > > >
> > > > > > >
> > > > > > > > +        /*
> > > > > > > > +         * Map cannot be NULL since iova map contains all guest space and
> > > > > > > > +         * qemu already has a physical address mapped
> > > > > > > > +         */
> > > > > > > > +        assert(map);
> > > > > > > > +
> > > > > > > > +        /*
> > > > > > > > +         * Map->iova chunk size is ignored. What to do if descriptor
> > > > > > > > +         * (addr, size) does not fit is delegated to the device.
> > > > > > > > +         */
> > > > > > > > +        off = needle.translated_addr - map->translated_addr;
> > > > > > > > +        iovec[i].iov_base = (void *)(map->iova + off);
> > > > > > > > +    }
> > > > > > > > +}
> > > > > > > > +
> > > > > > > >   static void vhost_vring_write_descs(VhostShadowVirtqueue *svq,
> > > > > > > >                                       const struct iovec *iovec,
> > > > > > > >                                       size_t num, bool more_descs, bool write)
> > > > > > > > @@ -156,8 +217,9 @@ static void vhost_vring_write_descs(VhostShadowVirtqueue *svq,
> > > > > > > >   }
> > > > > > > >
> > > > > > > >   static unsigned vhost_svq_add_split(VhostShadowVirtqueue *svq,
> > > > > > > > -                                    VirtQueueElement *elem)
> > > > > > > > +                                    SVQElement *svq_elem)
> > > > > > > >   {
> > > > > > > > +    VirtQueueElement *elem = &svq_elem->elem;
> > > > > > > >       int head;
> > > > > > > >       unsigned avail_idx;
> > > > > > > >       vring_avail_t *avail = svq->vring.avail;
> > > > > > > > @@ -167,6 +229,12 @@ static unsigned vhost_svq_add_split(VhostShadowVirtqueue *svq,
> > > > > > > >       /* We need some descriptors here */
> > > > > > > >       assert(elem->out_num || elem->in_num);
> > > > > > > >
> > > > > > > > +    vhost_svq_stash_addr(&svq_elem->in_sg_stash, elem->in_sg, elem->in_num);
> > > > > > > > +    vhost_svq_stash_addr(&svq_elem->out_sg_stash, elem->out_sg, elem->out_num);
> > > > > > >
> > > > > > >
> > > > > > > I wonder if we can solve the trick like stash and unstash with a
> > > > > > > dedicated sgs in svq_elem, instead of reusing the elem.
> > > > > > >
> > > > > >
> > > > > > Actually yes, it would be way simpler to use a new sgs array in
> > > > > > svq_elem. I will change that.
> > > > > >
> > > > > > Thanks!
> > > > > >
> > > > > > > Thanks
> > > > > > >
> > > > > > >
> > > > > > > > +
> > > > > > > > +    vhost_svq_translate_addr(svq, elem->in_sg, elem->in_num);
> > > > > > > > +    vhost_svq_translate_addr(svq, elem->out_sg, elem->out_num);
> > > > > > > > +
> > > > > > > >       vhost_vring_write_descs(svq, elem->out_sg, elem->out_num,
> > > > > > > >                               elem->in_num > 0, false);
> > > > > > > >       vhost_vring_write_descs(svq, elem->in_sg, elem->in_num, false, true);
> > > > > > > > @@ -187,7 +255,7 @@ static unsigned vhost_svq_add_split(VhostShadowVirtqueue *svq,
> > > > > > > >
> > > > > > > >   }
> > > > > > > >
> > > > > > > > -static void vhost_svq_add(VhostShadowVirtqueue *svq, VirtQueueElement *elem)
> > > > > > > > +static void vhost_svq_add(VhostShadowVirtqueue *svq, SVQElement *elem)
> > > > > > > >   {
> > > > > > > >       unsigned qemu_head = vhost_svq_add_split(svq, elem);
> > > > > > > >
> > > > > > > > @@ -221,7 +289,7 @@ static void vhost_handle_guest_kick(EventNotifier *n)
> > > > > > > >           }
> > > > > > > >
> > > > > > > >           while (true) {
> > > > > > > > -            VirtQueueElement *elem = virtqueue_pop(svq->vq, sizeof(*elem));
> > > > > > > > +            SVQElement *elem = virtqueue_pop(svq->vq, sizeof(*elem));
> > > > > > > >               if (!elem) {
> > > > > > > >                   break;
> > > > > > > >               }
> > > > > > > > @@ -247,7 +315,7 @@ static bool vhost_svq_more_used(VhostShadowVirtqueue *svq)
> > > > > > > >       return svq->used_idx != svq->shadow_used_idx;
> > > > > > > >   }
> > > > > > > >
> > > > > > > > -static VirtQueueElement *vhost_svq_get_buf(VhostShadowVirtqueue *svq)
> > > > > > > > +static SVQElement *vhost_svq_get_buf(VhostShadowVirtqueue *svq)
> > > > > > > >   {
> > > > > > > >       vring_desc_t *descs = svq->vring.desc;
> > > > > > > >       const vring_used_t *used = svq->vring.used;
> > > > > > > > @@ -279,7 +347,7 @@ static VirtQueueElement *vhost_svq_get_buf(VhostShadowVirtqueue *svq)
> > > > > > > >       descs[used_elem.id].next = svq->free_head;
> > > > > > > >       svq->free_head = used_elem.id;
> > > > > > > >
> > > > > > > > -    svq->ring_id_maps[used_elem.id]->len = used_elem.len;
> > > > > > > > +    svq->ring_id_maps[used_elem.id]->elem.len = used_elem.len;
> > > > > > > >       return g_steal_pointer(&svq->ring_id_maps[used_elem.id]);
> > > > > > > >   }
> > > > > > > >
> > > > > > > > @@ -296,12 +364,19 @@ static void vhost_svq_handle_call_no_test(EventNotifier *n)
> > > > > > > >
> > > > > > > >           vhost_svq_set_notification(svq, false);
> > > > > > > >           while (true) {
> > > > > > > > -            g_autofree VirtQueueElement *elem = vhost_svq_get_buf(svq);
> > > > > > > > -            if (!elem) {
> > > > > > > > +            g_autofree SVQElement *svq_elem = vhost_svq_get_buf(svq);
> > > > > > > > +            VirtQueueElement *elem;
> > > > > > > > +            if (!svq_elem) {
> > > > > > > >                   break;
> > > > > > > >               }
> > > > > > > >
> > > > > > > >               assert(i < svq->vring.num);
> > > > > > > > +            elem = &svq_elem->elem;
> > > > > > > > +
> > > > > > > > +            vhost_svq_unstash_addr(svq_elem->in_sg_stash, elem->in_sg,
> > > > > > > > +                                   elem->in_num);
> > > > > > > > +            vhost_svq_unstash_addr(svq_elem->out_sg_stash, elem->out_sg,
> > > > > > > > +                                   elem->out_num);
> > > > > > > >               virtqueue_fill(vq, elem, elem->len, i++);
> > > > > > > >           }
> > > > > > > >
> > > > > > > > @@ -451,14 +526,24 @@ void vhost_svq_stop(struct vhost_dev *dev, unsigned idx,
> > > > > > > >       event_notifier_set_handler(&svq->host_notifier, NULL);
> > > > > > > >
> > > > > > > >       for (i = 0; i < svq->vring.num; ++i) {
> > > > > > > > -        g_autofree VirtQueueElement *elem = svq->ring_id_maps[i];
> > > > > > > > +        g_autofree SVQElement *svq_elem = svq->ring_id_maps[i];
> > > > > > > > +        VirtQueueElement *elem;
> > > > > > > > +
> > > > > > > > +        if (!svq_elem) {
> > > > > > > > +            continue;
> > > > > > > > +        }
> > > > > > > > +
> > > > > > > > +        elem = &svq_elem->elem;
> > > > > > > > +        vhost_svq_unstash_addr(svq_elem->in_sg_stash, elem->in_sg,
> > > > > > > > +                               elem->in_num);
> > > > > > > > +        vhost_svq_unstash_addr(svq_elem->out_sg_stash, elem->out_sg,
> > > > > > > > +                               elem->out_num);
> > > > > > > > +
> > > > > > > >           /*
> > > > > > > >            * Although the doc says we must unpop in order, it's ok to unpop
> > > > > > > >            * everything.
> > > > > > > >            */
> > > > > > > > -        if (elem) {
> > > > > > > > -            virtqueue_unpop(svq->vq, elem, elem->len);
> > > > > > > > -        }
> > > > > > > > +        virtqueue_unpop(svq->vq, elem, elem->len);
> > > > > > > >       }
> > > > > > > >   }
> > > > > > > >
> > > > > > > > @@ -466,7 +551,8 @@ void vhost_svq_stop(struct vhost_dev *dev, unsigned idx,
> > > > > > > >    * Creates vhost shadow virtqueue, and instruct vhost device to use the shadow
> > > > > > > >    * methods and file descriptors.
> > > > > > > >    */
> > > > > > > > -VhostShadowVirtqueue *vhost_svq_new(struct vhost_dev *dev, int idx)
> > > > > > > > +VhostShadowVirtqueue *vhost_svq_new(struct vhost_dev *dev, int idx,
> > > > > > > > +                                    VhostIOVATree *iova_map)
> > > > > > > >   {
> > > > > > > >       int vq_idx = dev->vq_index + idx;
> > > > > > > >       unsigned num = virtio_queue_get_num(dev->vdev, vq_idx);
> > > > > > > > @@ -500,11 +586,13 @@ VhostShadowVirtqueue *vhost_svq_new(struct vhost_dev *dev, int idx)
> > > > > > > >       memset(svq->vring.desc, 0, driver_size);
> > > > > > > >       svq->vring.used = qemu_memalign(qemu_real_host_page_size, device_size);
> > > > > > > >       memset(svq->vring.used, 0, device_size);
> > > > > > > > +    svq->iova_map = iova_map;
> > > > > > > > +
> > > > > > > >       for (i = 0; i < num - 1; i++) {
> > > > > > > >           svq->vring.desc[i].next = cpu_to_le16(i + 1);
> > > > > > > >       }
> > > > > > > >
> > > > > > > > -    svq->ring_id_maps = g_new0(VirtQueueElement *, num);
> > > > > > > > +    svq->ring_id_maps = g_new0(SVQElement *, num);
> > > > > > > >       event_notifier_set_handler(&svq->call_notifier,
> > > > > > > >                                  vhost_svq_handle_call);
> > > > > > > >       return g_steal_pointer(&svq);
> > > > > > > > diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
> > > > > > > > index a9c680b487..f5a12fee9d 100644
> > > > > > > > --- a/hw/virtio/vhost-vdpa.c
> > > > > > > > +++ b/hw/virtio/vhost-vdpa.c
> > > > > > > > @@ -176,6 +176,18 @@ static void vhost_vdpa_listener_region_add(MemoryListener *listener,
> > > > > > > >                                            vaddr, section->readonly);
> > > > > > > >
> > > > > > > >       llsize = int128_sub(llend, int128_make64(iova));
> > > > > > > > +    if (v->shadow_vqs_enabled) {
> > > > > > > > +        VhostDMAMap mem_region = {
> > > > > > > > +            .translated_addr = vaddr,
> > > > > > > > +            .size = int128_get64(llsize) - 1,
> > > > > > > > +            .perm = IOMMU_ACCESS_FLAG(true, section->readonly),
> > > > > > > > +        };
> > > > > > > > +
> > > > > > > > +        int r = vhost_iova_tree_alloc(v->iova_map, &mem_region);
> > > > > > > > +        assert(r == VHOST_DMA_MAP_OK);
> > > > > > > > +
> > > > > > > > +        iova = mem_region.iova;
> > > > > > > > +    }
> > > > > > > >
> > > > > > > >       ret = vhost_vdpa_dma_map(v, iova, int128_get64(llsize),
> > > > > > > >                                vaddr, section->readonly);
> > > > > > > > @@ -754,6 +766,23 @@ static bool  vhost_vdpa_force_iommu(struct vhost_dev *dev)
> > > > > > > >       return true;
> > > > > > > >   }
> > > > > > > >
> > > > > > > > +static int vhost_vdpa_get_iova_range(struct vhost_dev *dev,
> > > > > > > > +                                     hwaddr *first, hwaddr *last)
> > > > > > > > +{
> > > > > > > > +    int ret;
> > > > > > > > +    struct vhost_vdpa_iova_range range;
> > > > > > > > +
> > > > > > > > +    ret = vhost_vdpa_call(dev, VHOST_VDPA_GET_IOVA_RANGE, &range);
> > > > > > > > +    if (ret != 0) {
> > > > > > > > +        return ret;
> > > > > > > > +    }
> > > > > > > > +
> > > > > > > > +    *first = range.first;
> > > > > > > > +    *last = range.last;
> > > > > > > > +    trace_vhost_vdpa_get_iova_range(dev, *first, *last);
> > > > > > > > +    return ret;
> > > > > > > > +}
> > > > > > > > +
> > > > > > > >   /**
> > > > > > > >    * Maps QEMU vaddr memory to device in a suitable way for shadow virtqueue:
> > > > > > > >    * - It always reference qemu memory address, not guest's memory.
> > > > > > > > @@ -881,6 +910,7 @@ static bool vhost_vdpa_svq_start_vq(struct vhost_dev *dev, unsigned idx)
> > > > > > > >   static unsigned vhost_vdpa_enable_svq(struct vhost_vdpa *v, bool enable)
> > > > > > > >   {
> > > > > > > >       struct vhost_dev *hdev = v->dev;
> > > > > > > > +    hwaddr iova_first, iova_last;
> > > > > > > >       unsigned n;
> > > > > > > >       int r;
> > > > > > > >
> > > > > > > > @@ -894,7 +924,7 @@ static unsigned vhost_vdpa_enable_svq(struct vhost_vdpa *v, bool enable)
> > > > > > > >           /* Allocate resources */
> > > > > > > >           assert(v->shadow_vqs->len == 0);
> > > > > > > >           for (n = 0; n < hdev->nvqs; ++n) {
> > > > > > > > -            VhostShadowVirtqueue *svq = vhost_svq_new(hdev, n);
> > > > > > > > +            VhostShadowVirtqueue *svq = vhost_svq_new(hdev, n, v->iova_map);
> > > > > > > >               if (unlikely(!svq)) {
> > > > > > > >                   g_ptr_array_set_size(v->shadow_vqs, 0);
> > > > > > > >                   return 0;
> > > > > > > > @@ -903,6 +933,8 @@ static unsigned vhost_vdpa_enable_svq(struct vhost_vdpa *v, bool enable)
> > > > > > > >           }
> > > > > > > >       }
> > > > > > > >
> > > > > > > > +    r = vhost_vdpa_get_iova_range(hdev, &iova_first, &iova_last);
> > > > > > > > +    assert(r == 0);
> > > > > > > >       r = vhost_vdpa_vring_pause(hdev);
> > > > > > > >       assert(r == 0);
> > > > > > > >
> > > > > > > > @@ -913,6 +945,12 @@ static unsigned vhost_vdpa_enable_svq(struct vhost_vdpa *v, bool enable)
> > > > > > > >           }
> > > > > > > >       }
> > > > > > > >
> > > > > > > > +    memory_listener_unregister(&v->listener);
> > > > > > > > +    if (vhost_vdpa_dma_unmap(v, iova_first,
> > > > > > > > +                             (iova_last - iova_first) & TARGET_PAGE_MASK)) {
> > > > > > > > +        error_report("Fail to invalidate device iotlb");
> > > > > > > > +    }
> > > > > > > > +
> > > > > > > >       /* Reset device so it can be configured */
> > > > > > > >       r = vhost_vdpa_dev_start(hdev, false);
> > > > > > > >       assert(r == 0);
> > > > > > > > diff --git a/hw/virtio/trace-events b/hw/virtio/trace-events
> > > > > > > > index 8ed19e9d0c..650e521e35 100644
> > > > > > > > --- a/hw/virtio/trace-events
> > > > > > > > +++ b/hw/virtio/trace-events
> > > > > > > > @@ -52,6 +52,7 @@ vhost_vdpa_set_vring_call(void *dev, unsigned int index, int fd) "dev: %p index:
> > > > > > > >   vhost_vdpa_get_features(void *dev, uint64_t features) "dev: %p features: 0x%"PRIx64
> > > > > > > >   vhost_vdpa_set_owner(void *dev) "dev: %p"
> > > > > > > >   vhost_vdpa_vq_get_addr(void *dev, void *vq, uint64_t desc_user_addr, uint64_t avail_user_addr, uint64_t used_user_addr) "dev: %p vq: %p desc_user_addr: 0x%"PRIx64" avail_user_addr: 0x%"PRIx64" used_user_addr: 0x%"PRIx64
> > > > > > > > +vhost_vdpa_get_iova_range(void *dev, uint64_t first, uint64_t last) "dev: %p first: 0x%"PRIx64" last: 0x%"PRIx64
> > > > > > > >
> > > > > > > >   # virtio.c
> > > > > > > >   virtqueue_alloc_element(void *elem, size_t sz, unsigned in_num, unsigned out_num) "elem %p size %zd in_num %u out_num %u"
> > > > > > >
> > > > > >
> > > >
> > >
> >
>
^ permalink raw reply	[flat|nested] 90+ messages in thread
end of thread, other threads:[~2021-10-26  4:34 UTC | newest]
Thread overview: 90+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2021-10-01  7:05 [RFC PATCH v4 00/20] vDPA shadow virtqueue Eugenio Pérez
2021-10-01  7:05 ` [RFC PATCH v4 01/20] virtio: Add VIRTIO_F_QUEUE_STATE Eugenio Pérez
2021-10-01  7:05 ` [RFC PATCH v4 02/20] virtio-net: Honor VIRTIO_CONFIG_S_DEVICE_STOPPED Eugenio Pérez
2021-10-01  7:05 ` [RFC PATCH v4 03/20] virtio: Add virtio_queue_is_host_notifier_enabled Eugenio Pérez
2021-10-01  7:05 ` [RFC PATCH v4 04/20] vhost: Make vhost_virtqueue_{start,stop} public Eugenio Pérez
2021-10-01  7:05 ` [RFC PATCH v4 05/20] vhost: Add x-vhost-enable-shadow-vq qmp Eugenio Pérez
2021-10-12  5:18   ` Markus Armbruster
2021-10-12 13:08     ` Eugenio Perez Martin
2021-10-12 13:45       ` Markus Armbruster
2021-10-14 12:01         ` Eugenio Perez Martin
2021-10-01  7:05 ` [RFC PATCH v4 06/20] vhost: Add VhostShadowVirtqueue Eugenio Pérez
2021-10-01  7:05 ` [RFC PATCH v4 07/20] vdpa: Register vdpa devices in a list Eugenio Pérez
2021-10-01  7:05 ` [RFC PATCH v4 08/20] vhost: Route guest->host notification through shadow virtqueue Eugenio Pérez
2021-10-12  5:19   ` Markus Armbruster
2021-10-12 13:09     ` Eugenio Perez Martin
2021-10-13  3:27   ` Jason Wang
2021-10-14 12:00     ` Eugenio Perez Martin
2021-10-15  3:45       ` Jason Wang
2021-10-15  9:08         ` Eugenio Perez Martin
2021-10-15 18:21       ` Eugenio Perez Martin
2021-10-01  7:05 ` [RFC PATCH v4 09/20] vdpa: Save call_fd in vhost-vdpa Eugenio Pérez
2021-10-13  3:43   ` Jason Wang
2021-10-14 12:11     ` Eugenio Perez Martin
2021-10-01  7:05 ` [RFC PATCH v4 10/20] vhost-vdpa: Take into account SVQ in vhost_vdpa_set_vring_call Eugenio Pérez
2021-10-13  3:43   ` Jason Wang
2021-10-14 12:18     ` Eugenio Perez Martin
2021-10-01  7:05 ` [RFC PATCH v4 11/20] vhost: Route host->guest notification through shadow virtqueue Eugenio Pérez
2021-10-13  3:47   ` Jason Wang
2021-10-14 16:39     ` Eugenio Perez Martin
2021-10-15  4:42       ` Jason Wang
2021-10-19  8:39         ` Eugenio Perez Martin
2021-10-20  2:01           ` Jason Wang
2021-10-20  6:36             ` Eugenio Perez Martin
2021-10-13  3:49   ` Jason Wang
2021-10-14 15:58     ` Eugenio Perez Martin
2021-10-15  4:24       ` Jason Wang
2021-10-01  7:05 ` [RFC PATCH v4 12/20] virtio: Add vhost_shadow_vq_get_vring_addr Eugenio Pérez
2021-10-13  3:54   ` Jason Wang
2021-10-14 14:39     ` Eugenio Perez Martin
2021-10-01  7:05 ` [RFC PATCH v4 13/20] vdpa: Save host and guest features Eugenio Pérez
2021-10-13  3:56   ` Jason Wang
2021-10-14 15:03     ` Eugenio Perez Martin
2021-10-01  7:05 ` [RFC PATCH v4 14/20] vhost: Add vhost_svq_valid_device_features to shadow vq Eugenio Pérez
2021-10-01  7:05 ` [RFC PATCH v4 15/20] vhost: Shadow virtqueue buffers forwarding Eugenio Pérez
2021-10-12  5:21   ` Markus Armbruster
2021-10-12 13:28     ` Eugenio Perez Martin
2021-10-12 13:48       ` Markus Armbruster
2021-10-14 15:04         ` Eugenio Perez Martin
2021-10-13  4:31   ` Jason Wang
2021-10-14 17:56     ` Eugenio Perez Martin
2021-10-15  4:23       ` Jason Wang
2021-10-15  9:33         ` Eugenio Perez Martin
2021-10-01  7:05 ` [RFC PATCH v4 16/20] vhost: Check for device VRING_USED_F_NO_NOTIFY at shadow virtqueue kick Eugenio Pérez
2021-10-13  4:35   ` Jason Wang
2021-10-15  6:17     ` Eugenio Perez Martin
2021-10-01  7:06 ` [RFC PATCH v4 17/20] vhost: Use VRING_AVAIL_F_NO_INTERRUPT at device call on shadow virtqueue Eugenio Pérez
2021-10-13  4:36   ` Jason Wang
2021-10-15  6:22     ` Eugenio Perez Martin
2021-10-01  7:06 ` [RFC PATCH v4 18/20] vhost: Add VhostIOVATree Eugenio Pérez
2021-10-19  8:32   ` Jason Wang
2021-10-19  9:22     ` Jason Wang
2021-10-20  7:54       ` Eugenio Perez Martin
2021-10-20  9:01         ` Jason Wang
2021-10-20 12:06           ` Eugenio Perez Martin
2021-10-21  2:34             ` Jason Wang
2021-10-21  7:03               ` Eugenio Perez Martin
2021-10-21  8:12                 ` Jason Wang
2021-10-21 14:33                   ` Eugenio Perez Martin
2021-10-26  4:29                     ` Jason Wang
2021-10-20  7:36     ` Eugenio Perez Martin
2021-10-01  7:06 ` [RFC PATCH v4 19/20] vhost: Use a tree to store memory mappings Eugenio Pérez
2021-10-01  7:06 ` [RFC PATCH v4 20/20] vdpa: Add custom IOTLB translations to SVQ Eugenio Pérez
2021-10-13  5:34   ` Jason Wang
2021-10-15  7:27     ` Eugenio Perez Martin
2021-10-15  7:37       ` Jason Wang
2021-10-15  8:20         ` Eugenio Perez Martin
2021-10-15  8:37           ` Jason Wang
2021-10-15  9:14           ` Eugenio Perez Martin
2021-10-19  9:24   ` Jason Wang
2021-10-19 10:28     ` Eugenio Perez Martin
2021-10-20  2:02       ` Jason Wang
2021-10-20  2:07         ` Jason Wang
2021-10-20  6:51           ` Eugenio Perez Martin
2021-10-20  9:03             ` Jason Wang
2021-10-20 11:56               ` Eugenio Perez Martin
2021-10-21  2:38                 ` Jason Wang
2021-10-26  4:32                 ` Jason Wang
2021-10-12  3:59 ` [RFC PATCH v4 00/20] vDPA shadow virtqueue Jason Wang
2021-10-12  4:06   ` Jason Wang
2021-10-12  9:09     ` Eugenio Perez Martin
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).