* Re: [RFC v3 00/29] vDPA software assisted live migration
[not found] <20210519162903.1172366-1-eperezma@redhat.com>
@ 2021-05-24 9:38 ` Michael S. Tsirkin
[not found] ` <CAJaqyWcVm55qjaDpQsuLzaY0FCzjW2ARyvOWCdfS9RJNoOM7Aw@mail.gmail.com>
[not found] ` <20210519162903.1172366-7-eperezma@redhat.com>
` (6 subsequent siblings)
7 siblings, 1 reply; 19+ messages in thread
From: Michael S. Tsirkin @ 2021-05-24 9:38 UTC (permalink / raw)
To: Eugenio Pérez
Cc: Parav Pandit, Markus Armbruster, qemu-devel, Harpreet Singh Anand,
Xiao W Wang, Stefan Hajnoczi, Eli Cohen, virtualization,
Eric Blake, Michael Lilja
On Wed, May 19, 2021 at 06:28:34PM +0200, Eugenio Pérez wrote:
> Commit 17 introduces the buffer forwarding. Previous one are for
> preparations again, and laters are for enabling some obvious
> optimizations. However, it needs the vdpa device to be able to map
> every IOVA space, and some vDPA devices are not able to do so. Checking
> of this is added in previous commits.
That might become a significant limitation. And it worries me that
this is such a big patchset which might yet take a while to get
finalized.
I have an idea: how about as a first step we implement a transparent
switch from vdpa to a software virtio in QEMU or a software vhost in
kernel?
This will give us live migration quickly with performance comparable
to failover but without dependance on guest cooperation.
Next step could be driving vdpa from userspace while still copying
packets to a pre-registered buffer.
Finally your approach will be a performance optimization for devices
that support arbitrary IOVA.
Thoughts?
--
MST
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC v3 00/29] vDPA software assisted live migration
[not found] ` <CAJaqyWcVm55qjaDpQsuLzaY0FCzjW2ARyvOWCdfS9RJNoOM7Aw@mail.gmail.com>
@ 2021-05-24 11:29 ` Michael S. Tsirkin
2021-07-19 14:13 ` Stefan Hajnoczi
2021-05-25 0:09 ` Jason Wang
1 sibling, 1 reply; 19+ messages in thread
From: Michael S. Tsirkin @ 2021-05-24 11:29 UTC (permalink / raw)
To: Eugenio Perez Martin
Cc: Parav Pandit, Markus Armbruster, qemu-level, Harpreet Singh Anand,
Xiao W Wang, Stefan Hajnoczi, Eli Cohen, virtualization,
Eric Blake, Michael Lilja
On Mon, May 24, 2021 at 12:37:48PM +0200, Eugenio Perez Martin wrote:
> On Mon, May 24, 2021 at 11:38 AM Michael S. Tsirkin <mst@redhat.com> wrote:
> >
> > On Wed, May 19, 2021 at 06:28:34PM +0200, Eugenio Pérez wrote:
> > > Commit 17 introduces the buffer forwarding. Previous one are for
> > > preparations again, and laters are for enabling some obvious
> > > optimizations. However, it needs the vdpa device to be able to map
> > > every IOVA space, and some vDPA devices are not able to do so. Checking
> > > of this is added in previous commits.
> >
> > That might become a significant limitation. And it worries me that
> > this is such a big patchset which might yet take a while to get
> > finalized.
> >
>
> Sorry, maybe I've been unclear here: Latter commits in this series
> address this limitation. Still not perfect: for example, it does not
> support adding or removing guest's memory at the moment, but this
> should be easy to implement on top.
>
> The main issue I'm observing is from the kernel if I'm not wrong: If I
> unmap every address, I cannot re-map them again. But code in this
> patchset is mostly final, except for the comments it may arise in the
> mail list of course.
>
> > I have an idea: how about as a first step we implement a transparent
> > switch from vdpa to a software virtio in QEMU or a software vhost in
> > kernel?
> >
> > This will give us live migration quickly with performance comparable
> > to failover but without dependance on guest cooperation.
> >
>
> I think it should be doable. I'm not sure about the effort that needs
> to be done in qemu to hide these "hypervisor-failover devices" from
> guest's view but it should be comparable to failover, as you say.
>
> Networking should be ok by its nature, although it could require care
> on the host hardware setup. But I'm not sure how other types of
> vhost/vdpa devices may work that way. How would a disk/scsi device
> switch modes? Can the kernel take control of the vdpa device through
> vhost, and just start reporting with a dirty bitmap?
>
> Thanks!
It depends of course, e.g. blk is mostly reads/writes so
not a lot of state. just don't reorder or drop requests.
> > Next step could be driving vdpa from userspace while still copying
> > packets to a pre-registered buffer.
> >
> > Finally your approach will be a performance optimization for devices
> > that support arbitrary IOVA.
> >
> > Thoughts?
> >
> > --
> > MST
> >
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC v3 00/29] vDPA software assisted live migration
[not found] ` <CAJaqyWcVm55qjaDpQsuLzaY0FCzjW2ARyvOWCdfS9RJNoOM7Aw@mail.gmail.com>
2021-05-24 11:29 ` Michael S. Tsirkin
@ 2021-05-25 0:09 ` Jason Wang
1 sibling, 0 replies; 19+ messages in thread
From: Jason Wang @ 2021-05-25 0:09 UTC (permalink / raw)
To: Eugenio Perez Martin, Michael S. Tsirkin
Cc: Parav Pandit, Markus Armbruster, qemu-level, Harpreet Singh Anand,
Xiao W Wang, Stefan Hajnoczi, Eli Cohen, virtualization,
Eric Blake, Michael Lilja
在 2021/5/24 下午6:37, Eugenio Perez Martin 写道:
> On Mon, May 24, 2021 at 11:38 AM Michael S. Tsirkin <mst@redhat.com> wrote:
>> On Wed, May 19, 2021 at 06:28:34PM +0200, Eugenio Pérez wrote:
>>> Commit 17 introduces the buffer forwarding. Previous one are for
>>> preparations again, and laters are for enabling some obvious
>>> optimizations. However, it needs the vdpa device to be able to map
>>> every IOVA space, and some vDPA devices are not able to do so. Checking
>>> of this is added in previous commits.
>> That might become a significant limitation. And it worries me that
>> this is such a big patchset which might yet take a while to get
>> finalized.
>>
> Sorry, maybe I've been unclear here: Latter commits in this series
> address this limitation. Still not perfect: for example, it does not
> support adding or removing guest's memory at the moment, but this
> should be easy to implement on top.
>
> The main issue I'm observing is from the kernel if I'm not wrong: If I
> unmap every address, I cannot re-map them again.
This looks like a bug.
Does this happen only on some specific device (e.g vp_vdpa) or it's a
general issue of vhost-vdpa?
> But code in this
> patchset is mostly final, except for the comments it may arise in the
> mail list of course.
>
>> I have an idea: how about as a first step we implement a transparent
>> switch from vdpa to a software virtio in QEMU or a software vhost in
>> kernel?
>>
>> This will give us live migration quickly with performance comparable
>> to failover but without dependance on guest cooperation.
>>
> I think it should be doable. I'm not sure about the effort that needs
> to be done in qemu to hide these "hypervisor-failover devices" from
> guest's view but it should be comparable to failover, as you say.
Yes, if we want to switch, I'd go to a fallback to vhost-vdpa network
backend instead.
Thanks
>
> Networking should be ok by its nature, although it could require care
> on the host hardware setup. But I'm not sure how other types of
> vhost/vdpa devices may work that way. How would a disk/scsi device
> switch modes? Can the kernel take control of the vdpa device through
> vhost, and just start reporting with a dirty bitmap?
>
> Thanks!
>
>> Next step could be driving vdpa from userspace while still copying
>> packets to a pre-registered buffer.
>>
>> Finally your approach will be a performance optimization for devices
>> that support arbitrary IOVA.
>>
>> Thoughts?
>>
>> --
>> MST
>>
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC v3 06/29] virtio-net: Honor VIRTIO_CONFIG_S_DEVICE_STOPPED
[not found] ` <20210519162903.1172366-7-eperezma@redhat.com>
@ 2021-05-26 1:06 ` Jason Wang
2021-05-26 1:10 ` Jason Wang
0 siblings, 1 reply; 19+ messages in thread
From: Jason Wang @ 2021-05-26 1:06 UTC (permalink / raw)
To: Eugenio Pérez, qemu-devel
Cc: Parav Pandit, Michael S. Tsirkin, Markus Armbruster,
virtualization, Harpreet Singh Anand, Xiao W Wang,
Stefan Hajnoczi, Eli Cohen, Eric Blake, Michael Lilja
在 2021/5/20 上午12:28, Eugenio Pérez 写道:
> So the guest can stop and start net device. It implements the RFC
> https://lists.oasis-open.org/archives/virtio-comment/202012/msg00027.html
>
> To stop (as "pause") the device is required to migrate status and vring
> addresses between device and SVQ.
>
> This is a WIP commit: as with VIRTIO_F_QUEUE_STATE, is introduced in
> virtio_config.h before of even proposing for the kernel, with no feature
> flag, and, with no checking in the device. It also needs a modified
> vp_vdpa driver that supports to set and retrieve status.
>
> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> ---
> include/standard-headers/linux/virtio_config.h | 2 ++
> hw/net/virtio-net.c | 4 +++-
> 2 files changed, 5 insertions(+), 1 deletion(-)
>
> diff --git a/include/standard-headers/linux/virtio_config.h b/include/standard-headers/linux/virtio_config.h
> index 59fad3eb45..b3f6b1365d 100644
> --- a/include/standard-headers/linux/virtio_config.h
> +++ b/include/standard-headers/linux/virtio_config.h
> @@ -40,6 +40,8 @@
> #define VIRTIO_CONFIG_S_DRIVER_OK 4
> /* Driver has finished configuring features */
> #define VIRTIO_CONFIG_S_FEATURES_OK 8
> +/* Device is stopped */
> +#define VIRTIO_CONFIG_S_DEVICE_STOPPED 32
> /* Device entered invalid state, driver must reset it */
> #define VIRTIO_CONFIG_S_NEEDS_RESET 0x40
> /* We've given up on this device. */
> diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
> index 96a3cc8357..2d3caea289 100644
> --- a/hw/net/virtio-net.c
> +++ b/hw/net/virtio-net.c
> @@ -198,7 +198,9 @@ static bool virtio_net_started(VirtIONet *n, uint8_t status)
> {
> VirtIODevice *vdev = VIRTIO_DEVICE(n);
> return (status & VIRTIO_CONFIG_S_DRIVER_OK) &&
> - (n->status & VIRTIO_NET_S_LINK_UP) && vdev->vm_running;
> + (!(n->status & VIRTIO_CONFIG_S_DEVICE_STOPPED)) &&
> + (n->status & VIRTIO_NET_S_LINK_UP) &&
> + vdev->vm_running;
> }
>
> static void virtio_net_announce_notify(VirtIONet *net)
It looks to me this is only the part of pause. We still need the resume?
Thanks
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC v3 06/29] virtio-net: Honor VIRTIO_CONFIG_S_DEVICE_STOPPED
2021-05-26 1:06 ` [RFC v3 06/29] virtio-net: Honor VIRTIO_CONFIG_S_DEVICE_STOPPED Jason Wang
@ 2021-05-26 1:10 ` Jason Wang
[not found] ` <CAJaqyWeV+za=xeKHb9vn=Y+0mfekCb8w5dmWNMgzQ6uOtU3jxw@mail.gmail.com>
0 siblings, 1 reply; 19+ messages in thread
From: Jason Wang @ 2021-05-26 1:10 UTC (permalink / raw)
To: Eugenio Pérez, qemu-devel
Cc: Parav Pandit, Michael S. Tsirkin, Markus Armbruster,
virtualization, Harpreet Singh Anand, Xiao W Wang,
Stefan Hajnoczi, Eli Cohen, Eric Blake, Michael Lilja
在 2021/5/26 上午9:06, Jason Wang 写道:
>
> 在 2021/5/20 上午12:28, Eugenio Pérez 写道:
>> So the guest can stop and start net device. It implements the RFC
>> https://lists.oasis-open.org/archives/virtio-comment/202012/msg00027.html
>>
>>
>> To stop (as "pause") the device is required to migrate status and vring
>> addresses between device and SVQ.
>>
>> This is a WIP commit: as with VIRTIO_F_QUEUE_STATE, is introduced in
>> virtio_config.h before of even proposing for the kernel, with no feature
>> flag, and, with no checking in the device. It also needs a modified
>> vp_vdpa driver that supports to set and retrieve status.
>>
>> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
>> ---
>> include/standard-headers/linux/virtio_config.h | 2 ++
>> hw/net/virtio-net.c | 4 +++-
>> 2 files changed, 5 insertions(+), 1 deletion(-)
>>
>> diff --git a/include/standard-headers/linux/virtio_config.h
>> b/include/standard-headers/linux/virtio_config.h
>> index 59fad3eb45..b3f6b1365d 100644
>> --- a/include/standard-headers/linux/virtio_config.h
>> +++ b/include/standard-headers/linux/virtio_config.h
>> @@ -40,6 +40,8 @@
>> #define VIRTIO_CONFIG_S_DRIVER_OK 4
>> /* Driver has finished configuring features */
>> #define VIRTIO_CONFIG_S_FEATURES_OK 8
>> +/* Device is stopped */
>> +#define VIRTIO_CONFIG_S_DEVICE_STOPPED 32
>> /* Device entered invalid state, driver must reset it */
>> #define VIRTIO_CONFIG_S_NEEDS_RESET 0x40
>> /* We've given up on this device. */
>> diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
>> index 96a3cc8357..2d3caea289 100644
>> --- a/hw/net/virtio-net.c
>> +++ b/hw/net/virtio-net.c
>> @@ -198,7 +198,9 @@ static bool virtio_net_started(VirtIONet *n,
>> uint8_t status)
>> {
>> VirtIODevice *vdev = VIRTIO_DEVICE(n);
>> return (status & VIRTIO_CONFIG_S_DRIVER_OK) &&
>> - (n->status & VIRTIO_NET_S_LINK_UP) && vdev->vm_running;
>> + (!(n->status & VIRTIO_CONFIG_S_DEVICE_STOPPED)) &&
>> + (n->status & VIRTIO_NET_S_LINK_UP) &&
>> + vdev->vm_running;
>> }
>> static void virtio_net_announce_notify(VirtIONet *net)
>
>
> It looks to me this is only the part of pause.
And even for pause, I don't see anything that prevents rx/tx from being
executed? (E.g virtio_net_handle_tx_bh or virtio_net_handle_rx).
Thanks
> We still need the resume?
>
> Thanks
>
>
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC v3 13/29] vhost: Add vhost_get_iova_range operation
[not found] ` <20210519162903.1172366-14-eperezma@redhat.com>
@ 2021-05-26 1:14 ` Jason Wang
[not found] ` <CAJaqyWeL-0KjsBcXs1tYdvn9xLAK-x0Sb+RFuzPgngXxYtF9uw@mail.gmail.com>
0 siblings, 1 reply; 19+ messages in thread
From: Jason Wang @ 2021-05-26 1:14 UTC (permalink / raw)
To: Eugenio Pérez, qemu-devel
Cc: Parav Pandit, Michael S. Tsirkin, Markus Armbruster,
virtualization, Harpreet Singh Anand, Xiao W Wang,
Stefan Hajnoczi, Eli Cohen, Eric Blake, Michael Lilja
在 2021/5/20 上午12:28, Eugenio Pérez 写道:
> For simplicity, If a device does not support this operation it means
> that it can handle full (uint64_t)-1 iova address.
Note that, we probably need a separated patch for this.
And we need to this during vhost-vdpa initialization. If GPA is out of
the range, we need to fail the start of vhost-vdpa.
THanks
>
> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> ---
> include/hw/virtio/vhost-backend.h | 5 +++++
> hw/virtio/vhost-vdpa.c | 18 ++++++++++++++++++
> hw/virtio/trace-events | 1 +
> 3 files changed, 24 insertions(+)
>
> diff --git a/include/hw/virtio/vhost-backend.h b/include/hw/virtio/vhost-backend.h
> index 94d3323905..bcb112c166 100644
> --- a/include/hw/virtio/vhost-backend.h
> +++ b/include/hw/virtio/vhost-backend.h
> @@ -36,6 +36,7 @@ struct vhost_vring_addr;
> struct vhost_scsi_target;
> struct vhost_iotlb_msg;
> struct vhost_virtqueue;
> +struct vhost_vdpa_iova_range;
>
> typedef int (*vhost_backend_init)(struct vhost_dev *dev, void *opaque);
> typedef int (*vhost_backend_cleanup)(struct vhost_dev *dev);
> @@ -127,6 +128,9 @@ typedef bool (*vhost_force_iommu_op)(struct vhost_dev *dev);
>
> typedef int (*vhost_vring_pause_op)(struct vhost_dev *dev);
>
> +typedef int (*vhost_get_iova_range)(struct vhost_dev *dev,
> + hwaddr *first, hwaddr *last);
> +
> typedef struct VhostOps {
> VhostBackendType backend_type;
> vhost_backend_init vhost_backend_init;
> @@ -173,6 +177,7 @@ typedef struct VhostOps {
> vhost_get_device_id_op vhost_get_device_id;
> vhost_vring_pause_op vhost_vring_pause;
> vhost_force_iommu_op vhost_force_iommu;
> + vhost_get_iova_range vhost_get_iova_range;
> } VhostOps;
>
> extern const VhostOps user_ops;
> diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
> index 01d2101d09..74fe92935e 100644
> --- a/hw/virtio/vhost-vdpa.c
> +++ b/hw/virtio/vhost-vdpa.c
> @@ -579,6 +579,23 @@ static bool vhost_vdpa_force_iommu(struct vhost_dev *dev)
> return true;
> }
>
> +static int vhost_vdpa_get_iova_range(struct vhost_dev *dev,
> + hwaddr *first, hwaddr *last)
> +{
> + int ret;
> + struct vhost_vdpa_iova_range range;
> +
> + ret = vhost_vdpa_call(dev, VHOST_VDPA_GET_IOVA_RANGE, &range);
> + if (ret != 0) {
> + return ret;
> + }
> +
> + *first = range.first;
> + *last = range.last;
> + trace_vhost_vdpa_get_iova_range(dev, *first, *last);
> + return ret;
> +}
> +
> const VhostOps vdpa_ops = {
> .backend_type = VHOST_BACKEND_TYPE_VDPA,
> .vhost_backend_init = vhost_vdpa_init,
> @@ -611,4 +628,5 @@ const VhostOps vdpa_ops = {
> .vhost_get_device_id = vhost_vdpa_get_device_id,
> .vhost_vq_get_addr = vhost_vdpa_vq_get_addr,
> .vhost_force_iommu = vhost_vdpa_force_iommu,
> + .vhost_get_iova_range = vhost_vdpa_get_iova_range,
> };
> diff --git a/hw/virtio/trace-events b/hw/virtio/trace-events
> index c62727f879..5debe3a681 100644
> --- a/hw/virtio/trace-events
> +++ b/hw/virtio/trace-events
> @@ -52,6 +52,7 @@ vhost_vdpa_set_vring_call(void *dev, unsigned int index, int fd) "dev: %p index:
> vhost_vdpa_get_features(void *dev, uint64_t features) "dev: %p features: 0x%"PRIx64
> vhost_vdpa_set_owner(void *dev) "dev: %p"
> vhost_vdpa_vq_get_addr(void *dev, void *vq, uint64_t desc_user_addr, uint64_t avail_user_addr, uint64_t used_user_addr) "dev: %p vq: %p desc_user_addr: 0x%"PRIx64" avail_user_addr: 0x%"PRIx64" used_user_addr: 0x%"PRIx64
> +vhost_vdpa_get_iova_range(void *dev, uint64_t first, uint64_t last) "dev: %p first: 0x%"PRIx64" last: 0x%"PRIx64
>
> # virtio.c
> virtqueue_alloc_element(void *elem, size_t sz, unsigned in_num, unsigned out_num) "elem %p size %zd in_num %u out_num %u"
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC v3 13/29] vhost: Add vhost_get_iova_range operation
[not found] ` <CAJaqyWeL-0KjsBcXs1tYdvn9xLAK-x0Sb+RFuzPgngXxYtF9uw@mail.gmail.com>
@ 2021-05-27 4:51 ` Jason Wang
[not found] ` <CAJaqyWf+=-nwOsS=zZEhmiTA_TotVMQibUgE0grCMZgXVDNpxg@mail.gmail.com>
0 siblings, 1 reply; 19+ messages in thread
From: Jason Wang @ 2021-05-27 4:51 UTC (permalink / raw)
To: Eugenio Perez Martin
Cc: Parav Pandit, Michael S. Tsirkin, Markus Armbruster, qemu-level,
Harpreet Singh Anand, Xiao W Wang, Stefan Hajnoczi, Eli Cohen,
virtualization, Eric Blake, Michael Lilja
在 2021/5/27 上午1:49, Eugenio Perez Martin 写道:
> On Wed, May 26, 2021 at 3:14 AM Jason Wang <jasowang@redhat.com> wrote:
>>
>> 在 2021/5/20 上午12:28, Eugenio Pérez 写道:
>>> For simplicity, If a device does not support this operation it means
>>> that it can handle full (uint64_t)-1 iova address.
>>
>> Note that, we probably need a separated patch for this.
>>
> Actually the comment is not in the right commit, the next one is the
> one that uses it. Is that what you mean?
No, it's about the following suggestions.
>
>> And we need to this during vhost-vdpa initialization. If GPA is out of
>> the range, we need to fail the start of vhost-vdpa.
Note that this is for non-IOMMU case. For the case of vIOMMU we probably
need to validate it against address width or other similar attributes.
Thanks
>>
> Right, that is still to-do.
>
> Maybe a series with just these two commits and failing the start if
> GPA is not in the range, as you say, would help to split the amount of
> changes.
>
> I will send it if no more comments arise about it.
>
> Thanks!
>
>> THanks
>>
>>
>>> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
>>> ---
>>> include/hw/virtio/vhost-backend.h | 5 +++++
>>> hw/virtio/vhost-vdpa.c | 18 ++++++++++++++++++
>>> hw/virtio/trace-events | 1 +
>>> 3 files changed, 24 insertions(+)
>>>
>>> diff --git a/include/hw/virtio/vhost-backend.h b/include/hw/virtio/vhost-backend.h
>>> index 94d3323905..bcb112c166 100644
>>> --- a/include/hw/virtio/vhost-backend.h
>>> +++ b/include/hw/virtio/vhost-backend.h
>>> @@ -36,6 +36,7 @@ struct vhost_vring_addr;
>>> struct vhost_scsi_target;
>>> struct vhost_iotlb_msg;
>>> struct vhost_virtqueue;
>>> +struct vhost_vdpa_iova_range;
>>>
>>> typedef int (*vhost_backend_init)(struct vhost_dev *dev, void *opaque);
>>> typedef int (*vhost_backend_cleanup)(struct vhost_dev *dev);
>>> @@ -127,6 +128,9 @@ typedef bool (*vhost_force_iommu_op)(struct vhost_dev *dev);
>>>
>>> typedef int (*vhost_vring_pause_op)(struct vhost_dev *dev);
>>>
>>> +typedef int (*vhost_get_iova_range)(struct vhost_dev *dev,
>>> + hwaddr *first, hwaddr *last);
>>> +
>>> typedef struct VhostOps {
>>> VhostBackendType backend_type;
>>> vhost_backend_init vhost_backend_init;
>>> @@ -173,6 +177,7 @@ typedef struct VhostOps {
>>> vhost_get_device_id_op vhost_get_device_id;
>>> vhost_vring_pause_op vhost_vring_pause;
>>> vhost_force_iommu_op vhost_force_iommu;
>>> + vhost_get_iova_range vhost_get_iova_range;
>>> } VhostOps;
>>>
>>> extern const VhostOps user_ops;
>>> diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
>>> index 01d2101d09..74fe92935e 100644
>>> --- a/hw/virtio/vhost-vdpa.c
>>> +++ b/hw/virtio/vhost-vdpa.c
>>> @@ -579,6 +579,23 @@ static bool vhost_vdpa_force_iommu(struct vhost_dev *dev)
>>> return true;
>>> }
>>>
>>> +static int vhost_vdpa_get_iova_range(struct vhost_dev *dev,
>>> + hwaddr *first, hwaddr *last)
>>> +{
>>> + int ret;
>>> + struct vhost_vdpa_iova_range range;
>>> +
>>> + ret = vhost_vdpa_call(dev, VHOST_VDPA_GET_IOVA_RANGE, &range);
>>> + if (ret != 0) {
>>> + return ret;
>>> + }
>>> +
>>> + *first = range.first;
>>> + *last = range.last;
>>> + trace_vhost_vdpa_get_iova_range(dev, *first, *last);
>>> + return ret;
>>> +}
>>> +
>>> const VhostOps vdpa_ops = {
>>> .backend_type = VHOST_BACKEND_TYPE_VDPA,
>>> .vhost_backend_init = vhost_vdpa_init,
>>> @@ -611,4 +628,5 @@ const VhostOps vdpa_ops = {
>>> .vhost_get_device_id = vhost_vdpa_get_device_id,
>>> .vhost_vq_get_addr = vhost_vdpa_vq_get_addr,
>>> .vhost_force_iommu = vhost_vdpa_force_iommu,
>>> + .vhost_get_iova_range = vhost_vdpa_get_iova_range,
>>> };
>>> diff --git a/hw/virtio/trace-events b/hw/virtio/trace-events
>>> index c62727f879..5debe3a681 100644
>>> --- a/hw/virtio/trace-events
>>> +++ b/hw/virtio/trace-events
>>> @@ -52,6 +52,7 @@ vhost_vdpa_set_vring_call(void *dev, unsigned int index, int fd) "dev: %p index:
>>> vhost_vdpa_get_features(void *dev, uint64_t features) "dev: %p features: 0x%"PRIx64
>>> vhost_vdpa_set_owner(void *dev) "dev: %p"
>>> vhost_vdpa_vq_get_addr(void *dev, void *vq, uint64_t desc_user_addr, uint64_t avail_user_addr, uint64_t used_user_addr) "dev: %p vq: %p desc_user_addr: 0x%"PRIx64" avail_user_addr: 0x%"PRIx64" used_user_addr: 0x%"PRIx64
>>> +vhost_vdpa_get_iova_range(void *dev, uint64_t first, uint64_t last) "dev: %p first: 0x%"PRIx64" last: 0x%"PRIx64
>>>
>>> # virtio.c
>>> virtqueue_alloc_element(void *elem, size_t sz, unsigned in_num, unsigned out_num) "elem %p size %zd in_num %u out_num %u"
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC v3 15/29] vhost: Add enable_custom_iommu to VhostOps
[not found] ` <20210519162903.1172366-16-eperezma@redhat.com>
@ 2021-05-31 9:01 ` Jason Wang
0 siblings, 0 replies; 19+ messages in thread
From: Jason Wang @ 2021-05-31 9:01 UTC (permalink / raw)
To: Eugenio Pérez, qemu-devel
Cc: Parav Pandit, Michael S. Tsirkin, Markus Armbruster,
virtualization, Harpreet Singh Anand, Xiao W Wang,
Stefan Hajnoczi, Eli Cohen, Eric Blake, Michael Lilja
在 2021/5/20 上午12:28, Eugenio Pérez 写道:
> This operation enable the backend-specific IOTLB entries.
>
> If a backend support this, it start managing its own entries, and vhost
> can disable it through this operation and recover control.
>
> Every enable/disable operation must also clear all IOTLB device entries.
>
> At the moment, the only backend that does so is vhost-vdpa. To fully
> support these, vdpa needs also to expose a way for vhost subsystem to
> map and unmap entries. This will be done in future commits.
>
> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
I think there's probably no need to introduce this helper.
Instead, we can introduce ops like shadow_vq_start()/stop(). Then the
details like this could be hided there.
(And hide the backend deatils (avoid calling vhost_vdpa_dma_map())
directly from the vhost.c)
Thanks
> ---
> include/hw/virtio/vhost-backend.h | 4 ++++
> 1 file changed, 4 insertions(+)
>
> diff --git a/include/hw/virtio/vhost-backend.h b/include/hw/virtio/vhost-backend.h
> index bcb112c166..f8eed2ace5 100644
> --- a/include/hw/virtio/vhost-backend.h
> +++ b/include/hw/virtio/vhost-backend.h
> @@ -128,6 +128,9 @@ typedef bool (*vhost_force_iommu_op)(struct vhost_dev *dev);
>
> typedef int (*vhost_vring_pause_op)(struct vhost_dev *dev);
>
> +typedef int (*vhost_enable_custom_iommu_op)(struct vhost_dev *dev,
> + bool enable);
> +
> typedef int (*vhost_get_iova_range)(struct vhost_dev *dev,
> hwaddr *first, hwaddr *last);
>
> @@ -177,6 +180,7 @@ typedef struct VhostOps {
> vhost_get_device_id_op vhost_get_device_id;
> vhost_vring_pause_op vhost_vring_pause;
> vhost_force_iommu_op vhost_force_iommu;
> + vhost_enable_custom_iommu_op vhost_enable_custom_iommu;
> vhost_get_iova_range vhost_get_iova_range;
> } VhostOps;
>
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC v3 21/29] vhost: Add VhostIOVATree
[not found] ` <20210519162903.1172366-22-eperezma@redhat.com>
@ 2021-05-31 9:40 ` Jason Wang
[not found] ` <CAJaqyWfKTiKeKLLjB9qDf4qJwL420YRo6FrJgozp_tn0Z57pOA@mail.gmail.com>
0 siblings, 1 reply; 19+ messages in thread
From: Jason Wang @ 2021-05-31 9:40 UTC (permalink / raw)
To: Eugenio Pérez, qemu-devel
Cc: Parav Pandit, Michael S. Tsirkin, Markus Armbruster,
virtualization, Harpreet Singh Anand, Xiao W Wang,
Stefan Hajnoczi, Eli Cohen, Eric Blake, Michael Lilja
在 2021/5/20 上午12:28, Eugenio Pérez 写道:
> This tree is able to look for a translated address from a IOVA address.
>
> At first glance is similar to util/iova-tree. However, SVQ working on
> devices with limited IOVA space need more capabilities, like allocating
> IOVA chunks or perform reverse translations (qemu addresses to iova).
>
> Starting a sepparated implementation. Knowing than insertions/deletions
> will not be as frequent as searches,
This might not be true if vIOMMU is enabled.
> it uses an ordered array at
> implementation.
I wonder how much overhead could g_array be if it needs to grow.
> A different name could be used, but ordered
> searchable array is a little bit long though.
Note that we had a very good example for this. That is the kernel iova
allocator which is implemented via rbtree.
Instead of figuring out g_array vs g_tree stuffs, I would simple go with
g_tree first (based on util/iova-tree) and borrow the well design kernel
iova allocator API to have a generic IOVA one instead of coupling it
with vhost. It could be used by other userspace driver in the future:
init_iova_domain()/put_iova_domain();
alloc_iova()/free_iova();
find_iova();
Another reference is the iova allocator that is implemented in VFIO.
Thanks
>
> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> ---
> hw/virtio/vhost-iova-tree.h | 50 ++++++++++
> hw/virtio/vhost-iova-tree.c | 188 ++++++++++++++++++++++++++++++++++++
> hw/virtio/meson.build | 2 +-
> 3 files changed, 239 insertions(+), 1 deletion(-)
> create mode 100644 hw/virtio/vhost-iova-tree.h
> create mode 100644 hw/virtio/vhost-iova-tree.c
>
> diff --git a/hw/virtio/vhost-iova-tree.h b/hw/virtio/vhost-iova-tree.h
> new file mode 100644
> index 0000000000..2a44af8b3a
> --- /dev/null
> +++ b/hw/virtio/vhost-iova-tree.h
> @@ -0,0 +1,50 @@
> +/*
> + * vhost software live migration ring
> + *
> + * SPDX-FileCopyrightText: Red Hat, Inc. 2021
> + * SPDX-FileContributor: Author: Eugenio Pérez <eperezma@redhat.com>
> + *
> + * SPDX-License-Identifier: GPL-2.0-or-later
> + */
> +
> +#ifndef HW_VIRTIO_VHOST_IOVA_TREE_H
> +#define HW_VIRTIO_VHOST_IOVA_TREE_H
> +
> +#include <gmodule.h>
> +
> +#include "exec/memory.h"
> +
> +typedef struct VhostDMAMap {
> + void *translated_addr;
> + hwaddr iova;
> + hwaddr size; /* Inclusive */
> + IOMMUAccessFlags perm;
> +} VhostDMAMap;
> +
> +typedef enum VhostDMAMapNewRC {
> + VHOST_DMA_MAP_OVERLAP = -2,
> + VHOST_DMA_MAP_INVALID = -1,
> + VHOST_DMA_MAP_OK = 0,
> +} VhostDMAMapNewRC;
> +
> +/**
> + * VhostIOVATree
> + *
> + * Store and search IOVA -> Translated mappings.
> + *
> + * Note that it cannot remove nodes.
> + */
> +typedef struct VhostIOVATree {
> + /* Ordered array of reverse translations, IOVA address to qemu memory. */
> + GArray *iova_taddr_map;
> +} VhostIOVATree;
> +
> +void vhost_iova_tree_new(VhostIOVATree *iova_rm);
> +void vhost_iova_tree_destroy(VhostIOVATree *iova_rm);
> +
> +const VhostDMAMap *vhost_iova_tree_find_taddr(const VhostIOVATree *iova_rm,
> + const VhostDMAMap *map);
> +VhostDMAMapNewRC vhost_iova_tree_insert(VhostIOVATree *iova_rm,
> + VhostDMAMap *map);
> +
> +#endif
> diff --git a/hw/virtio/vhost-iova-tree.c b/hw/virtio/vhost-iova-tree.c
> new file mode 100644
> index 0000000000..dfd7e448b5
> --- /dev/null
> +++ b/hw/virtio/vhost-iova-tree.c
> @@ -0,0 +1,188 @@
> +/*
> + * vhost software live migration ring
> + *
> + * SPDX-FileCopyrightText: Red Hat, Inc. 2021
> + * SPDX-FileContributor: Author: Eugenio Pérez <eperezma@redhat.com>
> + *
> + * SPDX-License-Identifier: GPL-2.0-or-later
> + */
> +
> +#include "qemu/osdep.h"
> +#include "vhost-iova-tree.h"
> +
> +#define G_ARRAY_NOT_ZERO_TERMINATED false
> +#define G_ARRAY_NOT_CLEAR_ON_ALLOC false
> +
> +/**
> + * Inserts an element after an existing one in garray.
> + *
> + * @array The array
> + * @prev_elem The previous element of array of NULL if prepending
> + * @map The DMA map
> + *
> + * It provides the aditional advantage of being type safe over
> + * g_array_insert_val, which accepts a reference pointer instead of a value
> + * with no complains.
> + */
> +static void vhost_iova_tree_insert_after(GArray *array,
> + const VhostDMAMap *prev_elem,
> + const VhostDMAMap *map)
> +{
> + size_t pos;
> +
> + if (!prev_elem) {
> + pos = 0;
> + } else {
> + pos = prev_elem - &g_array_index(array, typeof(*prev_elem), 0) + 1;
> + }
> +
> + g_array_insert_val(array, pos, *map);
> +}
> +
> +static gint vhost_iova_tree_cmp_iova(gconstpointer a, gconstpointer b)
> +{
> + const VhostDMAMap *m1 = a, *m2 = b;
> +
> + if (m1->iova > m2->iova + m2->size) {
> + return 1;
> + }
> +
> + if (m1->iova + m1->size < m2->iova) {
> + return -1;
> + }
> +
> + /* Overlapped */
> + return 0;
> +}
> +
> +/**
> + * Find the previous node to a given iova
> + *
> + * @array The ascending ordered-by-translated-addr array of VhostDMAMap
> + * @map The map to insert
> + * @prev Returned location of the previous map
> + *
> + * Return VHOST_DMA_MAP_OK if everything went well, or VHOST_DMA_MAP_OVERLAP if
> + * it already exists. It is ok to use this function to check if a given range
> + * exists, but it will use a linear search.
> + *
> + * TODO: We can use bsearch to locate the entry if we save the state in the
> + * needle, knowing that the needle is always the first argument to
> + * compare_func.
> + */
> +static VhostDMAMapNewRC vhost_iova_tree_find_prev(const GArray *array,
> + GCompareFunc compare_func,
> + const VhostDMAMap *map,
> + const VhostDMAMap **prev)
> +{
> + size_t i;
> + int r;
> +
> + *prev = NULL;
> + for (i = 0; i < array->len; ++i) {
> + r = compare_func(map, &g_array_index(array, typeof(*map), i));
> + if (r == 0) {
> + return VHOST_DMA_MAP_OVERLAP;
> + }
> + if (r < 0) {
> + return VHOST_DMA_MAP_OK;
> + }
> +
> + *prev = &g_array_index(array, typeof(**prev), i);
> + }
> +
> + return VHOST_DMA_MAP_OK;
> +}
> +
> +/**
> + * Create a new IOVA tree
> + *
> + * @tree The IOVA tree
> + */
> +void vhost_iova_tree_new(VhostIOVATree *tree)
> +{
> + assert(tree);
> +
> + tree->iova_taddr_map = g_array_new(G_ARRAY_NOT_ZERO_TERMINATED,
> + G_ARRAY_NOT_CLEAR_ON_ALLOC,
> + sizeof(VhostDMAMap));
> +}
> +
> +/**
> + * Destroy an IOVA tree
> + *
> + * @tree The iova tree
> + */
> +void vhost_iova_tree_destroy(VhostIOVATree *tree)
> +{
> + g_array_unref(g_steal_pointer(&tree->iova_taddr_map));
> +}
> +
> +/**
> + * Perform a search on a GArray.
> + *
> + * @array Glib array
> + * @map Map to look up
> + * @compare_func Compare function to use
> + *
> + * Return The found element or NULL if not found.
> + *
> + * This can be replaced with g_array_binary_search (Since glib 2.62) when that
> + * is common enough.
> + */
> +static const VhostDMAMap *vhost_iova_tree_bsearch(const GArray *array,
> + const VhostDMAMap *map,
> + GCompareFunc compare_func)
> +{
> + return bsearch(map, array->data, array->len, sizeof(*map), compare_func);
> +}
> +
> +/**
> + * Find the translated address stored from a IOVA address
> + *
> + * @tree The iova tree
> + * @map The map with the memory address
> + *
> + * Return the stored mapping, or NULL if not found.
> + */
> +const VhostDMAMap *vhost_iova_tree_find_taddr(const VhostIOVATree *tree,
> + const VhostDMAMap *map)
> +{
> + return vhost_iova_tree_bsearch(tree->iova_taddr_map, map,
> + vhost_iova_tree_cmp_iova);
> +}
> +
> +/**
> + * Insert a new map
> + *
> + * @tree The iova tree
> + * @map The iova map
> + *
> + * Returns:
> + * - VHOST_DMA_MAP_OK if the map fits in the container
> + * - VHOST_DMA_MAP_INVALID if the map does not make sense (like size overflow)
> + * - VHOST_DMA_MAP_OVERLAP if the tree already contains that map
> + * Can query the assignated iova in map.
> + */
> +VhostDMAMapNewRC vhost_iova_tree_insert(VhostIOVATree *tree,
> + VhostDMAMap *map)
> +{
> + const VhostDMAMap *prev;
> + int find_prev_rc;
> +
> + if (map->translated_addr + map->size < map->translated_addr ||
> + map->iova + map->size < map->iova || map->perm == IOMMU_NONE) {
> + return VHOST_DMA_MAP_INVALID;
> + }
> +
> + /* Check for duplicates, and save position for insertion */
> + find_prev_rc = vhost_iova_tree_find_prev(tree->iova_taddr_map,
> + vhost_iova_tree_cmp_iova, map,
> + &prev);
> + if (find_prev_rc == VHOST_DMA_MAP_OVERLAP) {
> + return VHOST_DMA_MAP_OVERLAP;
> + }
> +
> + vhost_iova_tree_insert_after(tree->iova_taddr_map, prev, map);
> + return VHOST_DMA_MAP_OK;
> +}
> diff --git a/hw/virtio/meson.build b/hw/virtio/meson.build
> index 8b5a0225fe..cb306b83c6 100644
> --- a/hw/virtio/meson.build
> +++ b/hw/virtio/meson.build
> @@ -11,7 +11,7 @@ softmmu_ss.add(when: 'CONFIG_ALL', if_true: files('vhost-stub.c'))
>
> virtio_ss = ss.source_set()
> virtio_ss.add(files('virtio.c'))
> -virtio_ss.add(when: 'CONFIG_VHOST', if_true: files('vhost.c', 'vhost-backend.c', 'vhost-shadow-virtqueue.c'))
> +virtio_ss.add(when: 'CONFIG_VHOST', if_true: files('vhost.c', 'vhost-backend.c', 'vhost-shadow-virtqueue.c', 'vhost-iova-tree.c'))
> virtio_ss.add(when: 'CONFIG_VHOST_USER', if_true: files('vhost-user.c'))
> virtio_ss.add(when: 'CONFIG_VHOST_VDPA', if_true: files('vhost-vdpa.c'))
> virtio_ss.add(when: 'CONFIG_VIRTIO_BALLOON', if_true: files('virtio-balloon.c'))
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC v3 17/29] vhost: Shadow virtqueue buffers forwarding
[not found] ` <20210519162903.1172366-18-eperezma@redhat.com>
@ 2021-06-02 9:50 ` Jason Wang
[not found] ` <CAJaqyWf7M1fjrd+kr-2bcYj+ibrqZVoREZuTiJ0i+p6dA+Dukw@mail.gmail.com>
0 siblings, 1 reply; 19+ messages in thread
From: Jason Wang @ 2021-06-02 9:50 UTC (permalink / raw)
To: Eugenio Pérez, qemu-devel
Cc: Parav Pandit, Michael S. Tsirkin, Markus Armbruster,
virtualization, Harpreet Singh Anand, Xiao W Wang,
Stefan Hajnoczi, Eli Cohen, Eric Blake, Michael Lilja
在 2021/5/20 上午12:28, Eugenio Pérez 写道:
> Initial version of shadow virtqueue that actually forward buffers. The
> exposed addresses are the qemu's virtual address, so devices with IOMMU
> that does not allow full mapping of qemu's address space does not work
> at the moment.
>
> Also for simplicity it only supports modern devices, that expects vring
> in little endian, with split ring and no event idx or indirect
> descriptors.
>
> It reuses the VirtQueue code for the device part. The driver part is
> based on Linux's virtio_ring driver, but with stripped functionality
> and optimizations so it's easier to review.
>
> Later commits will solve some of these concerns.
It would be more more easier to review if you squash those
"enhancements" into this patch.
>
> Code also need to map used ring (device part) as RW in, and only in,
> vhost-net. To map (or call vhost_device_iotlb_miss) inconditionally
> would print an error in case of vhost devices with its own mapping
> (vdpa).
I think we should not depend on the IOTLB miss. Instead, we should
program the device IOTLB before starting the svq. Or is there anything
that prevent you from doing this?
> To know if this call is needed, vhost_sw_live_migration_start_vq and
> vhost_sw_live_migration_stop copy the test performed in
> vhost_dev_start. Testing for the actual backend type could be cleaner,
> or checking for non-NULL vhost_force_iommu, enable_custom_iommu, or
> another vhostOp. We could extract this test in its own function too,
> so its name could give a better hint. Just copy the vhost_dev_start
> check at the moment.
>
> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> ---
> hw/virtio/vhost-shadow-virtqueue.c | 205 +++++++++++++++++++++++++++--
> hw/virtio/vhost.c | 134 ++++++++++++++++++-
> 2 files changed, 325 insertions(+), 14 deletions(-)
>
> diff --git a/hw/virtio/vhost-shadow-virtqueue.c b/hw/virtio/vhost-shadow-virtqueue.c
> index ff50f12410..6d767fe248 100644
> --- a/hw/virtio/vhost-shadow-virtqueue.c
> +++ b/hw/virtio/vhost-shadow-virtqueue.c
> @@ -9,6 +9,7 @@
>
> #include "hw/virtio/vhost-shadow-virtqueue.h"
> #include "hw/virtio/vhost.h"
> +#include "hw/virtio/virtio-access.h"
>
> #include "standard-headers/linux/vhost_types.h"
>
> @@ -48,9 +49,93 @@ typedef struct VhostShadowVirtqueue {
>
> /* Virtio device */
> VirtIODevice *vdev;
> +
> + /* Map for returning guest's descriptors */
> + VirtQueueElement **ring_id_maps;
> +
> + /* Next head to expose to device */
> + uint16_t avail_idx_shadow;
> +
> + /* Next free descriptor */
> + uint16_t free_head;
> +
> + /* Last seen used idx */
> + uint16_t shadow_used_idx;
> +
> + /* Next head to consume from device */
> + uint16_t used_idx;
> } VhostShadowVirtqueue;
>
> -/* Forward guest notifications */
> +static void vhost_vring_write_descs(VhostShadowVirtqueue *svq,
> + const struct iovec *iovec,
> + size_t num, bool more_descs, bool write)
> +{
> + uint16_t i = svq->free_head, last = svq->free_head;
> + unsigned n;
> + uint16_t flags = write ? cpu_to_le16(VRING_DESC_F_WRITE) : 0;
> + vring_desc_t *descs = svq->vring.desc;
> +
> + if (num == 0) {
> + return;
> + }
> +
> + for (n = 0; n < num; n++) {
> + if (more_descs || (n + 1 < num)) {
> + descs[i].flags = flags | cpu_to_le16(VRING_DESC_F_NEXT);
> + } else {
> + descs[i].flags = flags;
> + }
> + descs[i].addr = cpu_to_le64((hwaddr)iovec[n].iov_base);
> + descs[i].len = cpu_to_le32(iovec[n].iov_len);
> +
> + last = i;
> + i = cpu_to_le16(descs[i].next);
> + }
> +
> + svq->free_head = le16_to_cpu(descs[last].next);
> +}
> +
> +static unsigned vhost_shadow_vq_add_split(VhostShadowVirtqueue *svq,
> + VirtQueueElement *elem)
> +{
> + int head;
> + unsigned avail_idx;
> + vring_avail_t *avail = svq->vring.avail;
> +
> + head = svq->free_head;
> +
> + /* We need some descriptors here */
> + assert(elem->out_num || elem->in_num);
> +
> + vhost_vring_write_descs(svq, elem->out_sg, elem->out_num,
> + elem->in_num > 0, false);
> + vhost_vring_write_descs(svq, elem->in_sg, elem->in_num, false, true);
> +
> + /*
> + * Put entry in available array (but don't update avail->idx until they
> + * do sync).
> + */
> + avail_idx = svq->avail_idx_shadow & (svq->vring.num - 1);
> + avail->ring[avail_idx] = cpu_to_le16(head);
> + svq->avail_idx_shadow++;
> +
> + /* Expose descriptors to device */
It's better to describe the detail order.
E.g "Update avail index after the descriptor is wrote"
> + smp_wmb();
> + avail->idx = cpu_to_le16(svq->avail_idx_shadow);
> +
> + return head;
> +
> +}
> +
> +static void vhost_shadow_vq_add(VhostShadowVirtqueue *svq,
> + VirtQueueElement *elem)
> +{
> + unsigned qemu_head = vhost_shadow_vq_add_split(svq, elem);
> +
> + svq->ring_id_maps[qemu_head] = elem;
> +}
> +
> +/* Handle guest->device notifications */
> static void vhost_handle_guest_kick(EventNotifier *n)
> {
> VhostShadowVirtqueue *svq = container_of(n, VhostShadowVirtqueue,
> @@ -60,7 +145,67 @@ static void vhost_handle_guest_kick(EventNotifier *n)
> return;
> }
>
> - event_notifier_set(&svq->kick_notifier);
> + /* Make available as many buffers as possible */
> + do {
> + if (virtio_queue_get_notification(svq->vq)) {
> + /* No more notifications until process all available */
> + virtio_queue_set_notification(svq->vq, false);
> + }
> +
> + while (true) {
> + VirtQueueElement *elem = virtqueue_pop(svq->vq, sizeof(*elem));
> + if (!elem) {
> + break;
> + }
> +
> + vhost_shadow_vq_add(svq, elem);
> + event_notifier_set(&svq->kick_notifier);
> + }
> +
> + virtio_queue_set_notification(svq->vq, true);
> + } while (!virtio_queue_empty(svq->vq));
> +}
> +
> +static bool vhost_shadow_vq_more_used(VhostShadowVirtqueue *svq)
> +{
> + if (svq->used_idx != svq->shadow_used_idx) {
> + return true;
> + }
> +
> + /* Get used idx must not be reordered */
> + smp_rmb();
> + svq->shadow_used_idx = cpu_to_le16(svq->vring.used->idx);
> +
> + return svq->used_idx != svq->shadow_used_idx;
> +}
> +
> +static VirtQueueElement *vhost_shadow_vq_get_buf(VhostShadowVirtqueue *svq)
> +{
> + vring_desc_t *descs = svq->vring.desc;
> + const vring_used_t *used = svq->vring.used;
> + vring_used_elem_t used_elem;
> + uint16_t last_used;
> +
> + if (!vhost_shadow_vq_more_used(svq)) {
> + return NULL;
> + }
> +
> + last_used = svq->used_idx & (svq->vring.num - 1);
> + used_elem.id = le32_to_cpu(used->ring[last_used].id);
> + used_elem.len = le32_to_cpu(used->ring[last_used].len);
> +
> + if (unlikely(used_elem.id >= svq->vring.num)) {
> + error_report("Device %s says index %u is available", svq->vdev->name,
> + used_elem.id);
> + return NULL;
> + }
> +
> + descs[used_elem.id].next = svq->free_head;
> + svq->free_head = used_elem.id;
> +
> + svq->used_idx++;
> + svq->ring_id_maps[used_elem.id]->len = used_elem.len;
> + return g_steal_pointer(&svq->ring_id_maps[used_elem.id]);
> }
>
> /* Forward vhost notifications */
> @@ -69,17 +214,33 @@ static void vhost_shadow_vq_handle_call_no_test(EventNotifier *n)
> VhostShadowVirtqueue *svq = container_of(n, VhostShadowVirtqueue,
> call_notifier);
> EventNotifier *masked_notifier;
> + VirtQueue *vq = svq->vq;
>
> masked_notifier = svq->masked_notifier.n;
>
> - if (!masked_notifier) {
> - unsigned n = virtio_get_queue_index(svq->vq);
> - virtio_queue_invalidate_signalled_used(svq->vdev, n);
> - virtio_notify_irqfd(svq->vdev, svq->vq);
> - } else if (!svq->masked_notifier.signaled) {
> - svq->masked_notifier.signaled = true;
> - event_notifier_set(svq->masked_notifier.n);
> - }
> + /* Make as many buffers as possible used. */
> + do {
> + unsigned i = 0;
> +
> + /* TODO: Use VRING_AVAIL_F_NO_INTERRUPT */
> + while (true) {
> + g_autofree VirtQueueElement *elem = vhost_shadow_vq_get_buf(svq);
> + if (!elem) {
> + break;
> + }
> +
> + assert(i < svq->vring.num);
> + virtqueue_fill(vq, elem, elem->len, i++);
> + }
> +
> + virtqueue_flush(vq, i);
> + if (!masked_notifier) {
> + virtio_notify_irqfd(svq->vdev, svq->vq);
Any reason that you don't use virtio_notify() here?
> + } else if (!svq->masked_notifier.signaled) {
> + svq->masked_notifier.signaled = true;
> + event_notifier_set(svq->masked_notifier.n);
> + }
This is an example of the extra complexity if you do shadow virtqueue at
virtio level.
If you do everything at e.g vhost-vdpa, you don't need to care about
masked_notifer at all.
Thanks
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC v3 25/29] vhost: Add custom IOTLB translations to SVQ
[not found] ` <20210519162903.1172366-26-eperezma@redhat.com>
@ 2021-06-02 9:51 ` Jason Wang
[not found] ` <CAJaqyWc7OWeQnXHihY=mYp=N+rRJLcbFUsJA-OszD6tyr6v-FQ@mail.gmail.com>
0 siblings, 1 reply; 19+ messages in thread
From: Jason Wang @ 2021-06-02 9:51 UTC (permalink / raw)
To: Eugenio Pérez, qemu-devel
Cc: Parav Pandit, Michael S. Tsirkin, Markus Armbruster,
virtualization, Harpreet Singh Anand, Xiao W Wang,
Stefan Hajnoczi, Eli Cohen, Eric Blake, Michael Lilja
在 2021/5/20 上午12:28, Eugenio Pérez 写道:
> Use translations added in IOVAReverseMaps in SVQ if the vhost device
> does not support the mapping of the full qemu's virtual address space.
> In other cases, Shadow Virtqueue still uses the qemu's virtual address
> of the buffer pointed by the descriptor, which has been translated
> already by qemu's VirtQueue machinery.
I'd say let stick to a single kind of translation (iova allocator) that
works for all the cases first and add optimizations on top.
>
> Now every element needs to store the previous address also, so VirtQueue
> can consume the elements properly. This adds a little overhead per VQ
> element, having to allocate more memory to stash them. As a possible
> optimization, this allocation could be avoided if the descriptor is not
> a chain but a single one, but this is left undone.
>
> Checking also for vhost_set_iotlb_callback to send used ring remapping.
> This is only needed for kernel, and would print an error in case of
> vhost devices with its own mapping (vdpa).
>
> This could change for other callback, like checking for
> vhost_force_iommu, enable_custom_iommu, or another. Another option could
> be to, at least, extract the check of "is map(used, writable) needed?"
> in another function. But at the moment just copy the check used in
> vhost_dev_start here.
>
> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
> ---
> hw/virtio/vhost-shadow-virtqueue.c | 134 ++++++++++++++++++++++++++---
> hw/virtio/vhost.c | 29 +++++--
> 2 files changed, 145 insertions(+), 18 deletions(-)
>
> diff --git a/hw/virtio/vhost-shadow-virtqueue.c b/hw/virtio/vhost-shadow-virtqueue.c
> index 934d3bb27b..a92da979d1 100644
> --- a/hw/virtio/vhost-shadow-virtqueue.c
> +++ b/hw/virtio/vhost-shadow-virtqueue.c
> @@ -10,12 +10,19 @@
> #include "hw/virtio/vhost-shadow-virtqueue.h"
> #include "hw/virtio/vhost.h"
> #include "hw/virtio/virtio-access.h"
> +#include "hw/virtio/vhost-iova-tree.h"
>
> #include "standard-headers/linux/vhost_types.h"
>
> #include "qemu/error-report.h"
> #include "qemu/main-loop.h"
>
> +typedef struct SVQElement {
> + VirtQueueElement elem;
> + void **in_sg_stash;
> + void **out_sg_stash;
Any reason for the trick like this?
Can we simply use iovec and iov_copy() here?
Thanks
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC v3 00/29] vDPA software assisted live migration
[not found] <20210519162903.1172366-1-eperezma@redhat.com>
` (6 preceding siblings ...)
[not found] ` <20210519162903.1172366-26-eperezma@redhat.com>
@ 2021-06-02 9:59 ` Jason Wang
7 siblings, 0 replies; 19+ messages in thread
From: Jason Wang @ 2021-06-02 9:59 UTC (permalink / raw)
To: Eugenio Pérez, qemu-devel
Cc: Parav Pandit, Michael S. Tsirkin, Markus Armbruster,
virtualization, Harpreet Singh Anand, Xiao W Wang,
Stefan Hajnoczi, Eli Cohen, Eric Blake, Michael Lilja
在 2021/5/20 上午12:28, Eugenio Pérez 写道:
> This series enable shadow virtqueue for vhost-vdpa devices. This is a
> new method of vhost devices migration: Instead of relay on vhost
> device's dirty logging capability, SW assisted LM intercepts dataplane,
> forwarding the descriptors between VM and device. Is intended for vDPA
> devices with no dirty memory tracking capabilities.
>
> In this migration mode, qemu offers a new vring to the device to
> read and write into, and disable vhost notifiers, processing guest and
> vhost notifications in qemu. On used buffer relay, qemu will mark the
> dirty memory as with plain virtio-net devices. This way, devices does
> not need to have dirty page logging capability.
>
> This series is a POC doing SW LM for vhost-net and vhost-vdpa devices.
> The former already have dirty page logging capabilities, but it is both
> easier to test and uses different code paths in qemu.
>
> For qemu to use shadow virtqueues the vhost-net devices need to be
> instantiated:
> * With IOMMU (iommu_platform=on,ats=on)
> * Without event_idx (event_idx=off)
>
> And shadow virtqueue needs to be enabled for them with QMP command
> like:
>
> { "execute": "x-vhost-enable-shadow-vq",
> "arguments": { "name": "dev0", "enable": true } }
>
> The series includes some commits to delete in the final version. One
> of them is the one that adds vhost_kernel_vring_pause to vhost kernel
> devices. This is only intended to work with vhost-net devices, as a way
> to test the solution, so don't use any other vhost kernel device in the
> same test.
>
> The vhost-vdpa devices should work the same way. However, vp_vdpa is
> not working properly with intel iommu unmapping, so this series add two
> extra commits to allow testing the solution enable SVQ mode from the
> device start and forbidding any other vhost-vdpa memory mapping. The
> causes of this are still under debugging.
>
> For testing vhost-vdpa devices vp_vdpa device has been used with nested
> virtualization, using a qemu virtio-net device in L0. To be able to
> stop and reset status, features in RFC status has been implemented in
> commits 5 and 6. After that, virtio-net driver in L0 guest is replaced
> by vp_vdpa driver, and a nested qemu instance is launched using it.
>
> This vp_vdpa driver needs to be also modified to support the RFCs,
> mainly allowing it to removing the _S_STOPPED status flag and
> implementing actual vp_vdpa_set_vq_state and vp_vdpa_get_vq_state
> callbacks.
>
> Just the notification forwarding (with no descriptor relay) can be
> achieved with patches 7 and 8, and starting SVQ. Previous commits
> are cleanup ones and declaration of QMP command.
>
> Commit 17 introduces the buffer forwarding. Previous one are for
> preparations again, and laters are for enabling some obvious
> optimizations. However, it needs the vdpa device to be able to map
> every IOVA space, and some vDPA devices are not able to do so. Checking
> of this is added in previous commits.
>
> Later commits allow vhost and shadow virtqueue to track and translate
> between qemu virtual addresses and a restricted iommu range. At the
> moment is not able to delete old translations, limit maximum range
> it can translate, nor vhost add new memory regions from the moment
> SVQ is enabled, but is somehow straightforward to add these.
>
> This is a big series, so the idea is to send it in logical chunks once
> all comments have been collected. As a first complete usecase, a SVQ
> mode with no possibility of going back to regular mode would cover a
> first usecase, and this RFC already have all the ingredients but
> internal memory tracking.
>
> It is based on the ideas of DPDK SW assisted LM, in the series of
> DPDK's https://patchwork.dpdk.org/cover/48370/ . However, these does
> not map the shadow vq in guest's VA, but in qemu's.
>
> Comments are welcome!
Thanks a lot for working on this.
I feel like we need to start from something simple to have some minimal
functions to work first.
There are two major issues for the current complexity:
1) the current code tries to work on general virtio level (all kinds of
vhost backends)
2) two kinds of translations are used (qemu HVA and qemu IOVA)
For 1), I'd suggest to start from vhost-vdpa, and it's even better to
hide all the svq stuffs from the vhost API first. That is to say, do it
totally inside vhost-vDPA and leave the rest part of Qemu untouched.
This makes things easier and after this part is merged. We can start to
think of how to generalize it other vhost bakcends (or it's still
questionable that it's worth to do so). I believe most of the codes
could be re-used.
For 2), I think we'd better always go with qemu IOVA (IOVA allocator).
This should work for all cases and may simplify the code a lot. In the
future, if we found qemu HVA is useful, we can implement some dedicated
alocator for vhost-net to have 1:1 mapping.
Thoughts?
Thanks
>
> TODO:
> * Event, indirect, packed, and others features of virtio - Waiting for
> confirmation of the big picture.
> * vDPA devices: Grow IOVA tree to track new or deleted memory. Cap
> IOVA limit in tree so it cannot grow forever.
> * To sepparate buffers forwarding in its own AIO context, so we can
> throw more threads to that task and we don't need to stop the main
> event loop.
> * IOMMU optimizations, so bacthing and bigger chunks of IOVA can be
> sent to device.
> * Automatic kick-in on live-migration.
> * Proper documentation.
>
> Thanks!
>
> Changes from v2 RFC:
> * Adding vhost-vdpa devices support
> * Fixed some memory leaks pointed by different comments
>
> Changes from v1 RFC:
> * Use QMP instead of migration to start SVQ mode.
> * Only accepting IOMMU devices, closer behavior with target devices
> (vDPA)
> * Fix invalid masking/unmasking of vhost call fd.
> * Use of proper methods for synchronization.
> * No need to modify VirtIO device code, all of the changes are
> contained in vhost code.
> * Delete superfluous code.
> * An intermediate RFC was sent with only the notifications forwarding
> changes. It can be seen in
> https://patchew.org/QEMU/20210129205415.876290-1-eperezma@redhat.com/
> * v1 at
> https://lists.gnu.org/archive/html/qemu-devel/2020-11/msg05372.html
>
> Eugenio Pérez (29):
> virtio: Add virtio_queue_is_host_notifier_enabled
> vhost: Save masked_notifier state
> vhost: Add VhostShadowVirtqueue
> vhost: Add x-vhost-enable-shadow-vq qmp
> virtio: Add VIRTIO_F_QUEUE_STATE
> virtio-net: Honor VIRTIO_CONFIG_S_DEVICE_STOPPED
> vhost: Route guest->host notification through shadow virtqueue
> vhost: Route host->guest notification through shadow virtqueue
> vhost: Avoid re-set masked notifier in shadow vq
> virtio: Add vhost_shadow_vq_get_vring_addr
> vhost: Add vhost_vring_pause operation
> vhost: add vhost_kernel_vring_pause
> vhost: Add vhost_get_iova_range operation
> vhost: add vhost_has_limited_iova_range
> vhost: Add enable_custom_iommu to VhostOps
> vhost-vdpa: Add vhost_vdpa_enable_custom_iommu
> vhost: Shadow virtqueue buffers forwarding
> vhost: Use vhost_enable_custom_iommu to unmap everything if available
> vhost: Check for device VRING_USED_F_NO_NOTIFY at shadow virtqueue
> kick
> vhost: Use VRING_AVAIL_F_NO_INTERRUPT at device call on shadow
> virtqueue
> vhost: Add VhostIOVATree
> vhost: Add iova_rev_maps_find_iova to IOVAReverseMaps
> vhost: Use a tree to store memory mappings
> vhost: Add iova_rev_maps_alloc
> vhost: Add custom IOTLB translations to SVQ
> vhost: Map in vdpa-dev
> vhost-vdpa: Implement vhost_vdpa_vring_pause operation
> vhost-vdpa: never map with vDPA listener
> vhost: Start vhost-vdpa SVQ directly
>
> qapi/net.json | 22 +
> hw/virtio/vhost-iova-tree.h | 61 ++
> hw/virtio/vhost-shadow-virtqueue.h | 38 ++
> hw/virtio/virtio-pci.h | 1 +
> include/hw/virtio/vhost-backend.h | 16 +
> include/hw/virtio/vhost-vdpa.h | 2 +-
> include/hw/virtio/vhost.h | 14 +
> include/hw/virtio/virtio.h | 5 +-
> .../standard-headers/linux/virtio_config.h | 5 +
> include/standard-headers/linux/virtio_pci.h | 2 +
> hw/net/virtio-net.c | 4 +-
> hw/virtio/vhost-backend.c | 42 ++
> hw/virtio/vhost-iova-tree.c | 283 ++++++++
> hw/virtio/vhost-shadow-virtqueue.c | 643 ++++++++++++++++++
> hw/virtio/vhost-vdpa.c | 73 +-
> hw/virtio/vhost.c | 459 ++++++++++++-
> hw/virtio/virtio-pci.c | 9 +
> hw/virtio/virtio.c | 5 +
> hw/virtio/meson.build | 2 +-
> hw/virtio/trace-events | 1 +
> 20 files changed, 1663 insertions(+), 24 deletions(-)
> create mode 100644 hw/virtio/vhost-iova-tree.h
> create mode 100644 hw/virtio/vhost-shadow-virtqueue.h
> create mode 100644 hw/virtio/vhost-iova-tree.c
> create mode 100644 hw/virtio/vhost-shadow-virtqueue.c
>
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC v3 06/29] virtio-net: Honor VIRTIO_CONFIG_S_DEVICE_STOPPED
[not found] ` <CAJaqyWeV+za=xeKHb9vn=Y+0mfekCb8w5dmWNMgzQ6uOtU3jxw@mail.gmail.com>
@ 2021-06-03 3:12 ` Jason Wang
0 siblings, 0 replies; 19+ messages in thread
From: Jason Wang @ 2021-06-03 3:12 UTC (permalink / raw)
To: Eugenio Perez Martin
Cc: Parav Pandit, Michael S. Tsirkin, Markus Armbruster, qemu-level,
Harpreet Singh Anand, Xiao W Wang, Stefan Hajnoczi, Eli Cohen,
virtualization, Eric Blake, Michael Lilja
在 2021/6/1 下午3:13, Eugenio Perez Martin 写道:
> On Wed, May 26, 2021 at 3:10 AM Jason Wang <jasowang@redhat.com> wrote:
>>
>> 在 2021/5/26 上午9:06, Jason Wang 写道:
>>> 在 2021/5/20 上午12:28, Eugenio Pérez 写道:
>>>> So the guest can stop and start net device. It implements the RFC
>>>> https://lists.oasis-open.org/archives/virtio-comment/202012/msg00027.html
>>>>
>>>>
>>>> To stop (as "pause") the device is required to migrate status and vring
>>>> addresses between device and SVQ.
>>>>
>>>> This is a WIP commit: as with VIRTIO_F_QUEUE_STATE, is introduced in
>>>> virtio_config.h before of even proposing for the kernel, with no feature
>>>> flag, and, with no checking in the device. It also needs a modified
>>>> vp_vdpa driver that supports to set and retrieve status.
>>>>
>>>> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
>>>> ---
>>>> include/standard-headers/linux/virtio_config.h | 2 ++
>>>> hw/net/virtio-net.c | 4 +++-
>>>> 2 files changed, 5 insertions(+), 1 deletion(-)
>>>>
>>>> diff --git a/include/standard-headers/linux/virtio_config.h
>>>> b/include/standard-headers/linux/virtio_config.h
>>>> index 59fad3eb45..b3f6b1365d 100644
>>>> --- a/include/standard-headers/linux/virtio_config.h
>>>> +++ b/include/standard-headers/linux/virtio_config.h
>>>> @@ -40,6 +40,8 @@
>>>> #define VIRTIO_CONFIG_S_DRIVER_OK 4
>>>> /* Driver has finished configuring features */
>>>> #define VIRTIO_CONFIG_S_FEATURES_OK 8
>>>> +/* Device is stopped */
>>>> +#define VIRTIO_CONFIG_S_DEVICE_STOPPED 32
>>>> /* Device entered invalid state, driver must reset it */
>>>> #define VIRTIO_CONFIG_S_NEEDS_RESET 0x40
>>>> /* We've given up on this device. */
>>>> diff --git a/hw/net/virtio-net.c b/hw/net/virtio-net.c
>>>> index 96a3cc8357..2d3caea289 100644
>>>> --- a/hw/net/virtio-net.c
>>>> +++ b/hw/net/virtio-net.c
>>>> @@ -198,7 +198,9 @@ static bool virtio_net_started(VirtIONet *n,
>>>> uint8_t status)
>>>> {
>>>> VirtIODevice *vdev = VIRTIO_DEVICE(n);
>>>> return (status & VIRTIO_CONFIG_S_DRIVER_OK) &&
>>>> - (n->status & VIRTIO_NET_S_LINK_UP) && vdev->vm_running;
>>>> + (!(n->status & VIRTIO_CONFIG_S_DEVICE_STOPPED)) &&
>>>> + (n->status & VIRTIO_NET_S_LINK_UP) &&
>>>> + vdev->vm_running;
>>>> }
>>>> static void virtio_net_announce_notify(VirtIONet *net)
>>>
>>> It looks to me this is only the part of pause.
> For SVQ we need to switch vring addresses, and a full reset of the
> device is required for that. Actually, the pause is just used to
> recover
>
> If you prefer this could be sent as a separate series where the full
> pause/resume cycle is implemented, and then SVQ uses the pause part.
> However there are no use for the resume part at the moment.
That would be fine if you can send it in another series.
>
>> And even for pause, I don't see anything that prevents rx/tx from being
>> executed? (E.g virtio_net_handle_tx_bh or virtio_net_handle_rx).
>>
> virtio_net_started is called from virtio_net_set_status. If
> _S_DEVICE_STOPPED is true, the former return false, and variable
> queue_started is false in the latter:
> queue_started =
> virtio_net_started(n, queue_status) && !n->vhost_started;
>
> After that, it should work like a regular device reset or link down if
> I'm not wrong, and the last part of virtio_net_set_status should
> delete timer or cancel bh.
You are right.
Thanks
>
>> Thanks
>>
>>
>>> We still need the resume?
>>>
>>> Thanks
>>>
>>>
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC v3 13/29] vhost: Add vhost_get_iova_range operation
[not found] ` <CAJaqyWf+=-nwOsS=zZEhmiTA_TotVMQibUgE0grCMZgXVDNpxg@mail.gmail.com>
@ 2021-06-03 3:13 ` Jason Wang
0 siblings, 0 replies; 19+ messages in thread
From: Jason Wang @ 2021-06-03 3:13 UTC (permalink / raw)
To: Eugenio Perez Martin
Cc: Parav Pandit, Michael S. Tsirkin, Markus Armbruster, qemu-level,
Harpreet Singh Anand, Xiao W Wang, Stefan Hajnoczi, Eli Cohen,
virtualization, Eric Blake, Michael Lilja
在 2021/6/1 下午3:17, Eugenio Perez Martin 写道:
> On Thu, May 27, 2021 at 6:51 AM Jason Wang <jasowang@redhat.com> wrote:
>>
>> 在 2021/5/27 上午1:49, Eugenio Perez Martin 写道:
>>> On Wed, May 26, 2021 at 3:14 AM Jason Wang <jasowang@redhat.com> wrote:
>>>> 在 2021/5/20 上午12:28, Eugenio Pérez 写道:
>>>>> For simplicity, If a device does not support this operation it means
>>>>> that it can handle full (uint64_t)-1 iova address.
>>>> Note that, we probably need a separated patch for this.
>>>>
>>> Actually the comment is not in the right commit, the next one is the
>>> one that uses it. Is that what you mean?
>>
>> No, it's about the following suggestions.
>>
>>
>>>> And we need to this during vhost-vdpa initialization. If GPA is out of
>>>> the range, we need to fail the start of vhost-vdpa.
>>
>> Note that this is for non-IOMMU case. For the case of vIOMMU we probably
>> need to validate it against address width or other similar attributes.
>>
> Right.
>
> What should qemu do if the memory of the guest gets expanded outside
> of the range? I think there is not a clear way to fail the memory
> addition, isn't it?
I'm not sure, but I guess there should be a way to fail the memory hot-add.
(otherwise we can introduce the error reporting for them)
Thanks
>
>> Thanks
>>
>>
>>> Right, that is still to-do.
>>>
>>> Maybe a series with just these two commits and failing the start if
>>> GPA is not in the range, as you say, would help to split the amount of
>>> changes.
>>>
>>> I will send it if no more comments arise about it.
>>>
>>> Thanks!
>>>
>>>> THanks
>>>>
>>>>
>>>>> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
>>>>> ---
>>>>> include/hw/virtio/vhost-backend.h | 5 +++++
>>>>> hw/virtio/vhost-vdpa.c | 18 ++++++++++++++++++
>>>>> hw/virtio/trace-events | 1 +
>>>>> 3 files changed, 24 insertions(+)
>>>>>
>>>>> diff --git a/include/hw/virtio/vhost-backend.h b/include/hw/virtio/vhost-backend.h
>>>>> index 94d3323905..bcb112c166 100644
>>>>> --- a/include/hw/virtio/vhost-backend.h
>>>>> +++ b/include/hw/virtio/vhost-backend.h
>>>>> @@ -36,6 +36,7 @@ struct vhost_vring_addr;
>>>>> struct vhost_scsi_target;
>>>>> struct vhost_iotlb_msg;
>>>>> struct vhost_virtqueue;
>>>>> +struct vhost_vdpa_iova_range;
>>>>>
>>>>> typedef int (*vhost_backend_init)(struct vhost_dev *dev, void *opaque);
>>>>> typedef int (*vhost_backend_cleanup)(struct vhost_dev *dev);
>>>>> @@ -127,6 +128,9 @@ typedef bool (*vhost_force_iommu_op)(struct vhost_dev *dev);
>>>>>
>>>>> typedef int (*vhost_vring_pause_op)(struct vhost_dev *dev);
>>>>>
>>>>> +typedef int (*vhost_get_iova_range)(struct vhost_dev *dev,
>>>>> + hwaddr *first, hwaddr *last);
>>>>> +
>>>>> typedef struct VhostOps {
>>>>> VhostBackendType backend_type;
>>>>> vhost_backend_init vhost_backend_init;
>>>>> @@ -173,6 +177,7 @@ typedef struct VhostOps {
>>>>> vhost_get_device_id_op vhost_get_device_id;
>>>>> vhost_vring_pause_op vhost_vring_pause;
>>>>> vhost_force_iommu_op vhost_force_iommu;
>>>>> + vhost_get_iova_range vhost_get_iova_range;
>>>>> } VhostOps;
>>>>>
>>>>> extern const VhostOps user_ops;
>>>>> diff --git a/hw/virtio/vhost-vdpa.c b/hw/virtio/vhost-vdpa.c
>>>>> index 01d2101d09..74fe92935e 100644
>>>>> --- a/hw/virtio/vhost-vdpa.c
>>>>> +++ b/hw/virtio/vhost-vdpa.c
>>>>> @@ -579,6 +579,23 @@ static bool vhost_vdpa_force_iommu(struct vhost_dev *dev)
>>>>> return true;
>>>>> }
>>>>>
>>>>> +static int vhost_vdpa_get_iova_range(struct vhost_dev *dev,
>>>>> + hwaddr *first, hwaddr *last)
>>>>> +{
>>>>> + int ret;
>>>>> + struct vhost_vdpa_iova_range range;
>>>>> +
>>>>> + ret = vhost_vdpa_call(dev, VHOST_VDPA_GET_IOVA_RANGE, &range);
>>>>> + if (ret != 0) {
>>>>> + return ret;
>>>>> + }
>>>>> +
>>>>> + *first = range.first;
>>>>> + *last = range.last;
>>>>> + trace_vhost_vdpa_get_iova_range(dev, *first, *last);
>>>>> + return ret;
>>>>> +}
>>>>> +
>>>>> const VhostOps vdpa_ops = {
>>>>> .backend_type = VHOST_BACKEND_TYPE_VDPA,
>>>>> .vhost_backend_init = vhost_vdpa_init,
>>>>> @@ -611,4 +628,5 @@ const VhostOps vdpa_ops = {
>>>>> .vhost_get_device_id = vhost_vdpa_get_device_id,
>>>>> .vhost_vq_get_addr = vhost_vdpa_vq_get_addr,
>>>>> .vhost_force_iommu = vhost_vdpa_force_iommu,
>>>>> + .vhost_get_iova_range = vhost_vdpa_get_iova_range,
>>>>> };
>>>>> diff --git a/hw/virtio/trace-events b/hw/virtio/trace-events
>>>>> index c62727f879..5debe3a681 100644
>>>>> --- a/hw/virtio/trace-events
>>>>> +++ b/hw/virtio/trace-events
>>>>> @@ -52,6 +52,7 @@ vhost_vdpa_set_vring_call(void *dev, unsigned int index, int fd) "dev: %p index:
>>>>> vhost_vdpa_get_features(void *dev, uint64_t features) "dev: %p features: 0x%"PRIx64
>>>>> vhost_vdpa_set_owner(void *dev) "dev: %p"
>>>>> vhost_vdpa_vq_get_addr(void *dev, void *vq, uint64_t desc_user_addr, uint64_t avail_user_addr, uint64_t used_user_addr) "dev: %p vq: %p desc_user_addr: 0x%"PRIx64" avail_user_addr: 0x%"PRIx64" used_user_addr: 0x%"PRIx64
>>>>> +vhost_vdpa_get_iova_range(void *dev, uint64_t first, uint64_t last) "dev: %p first: 0x%"PRIx64" last: 0x%"PRIx64
>>>>>
>>>>> # virtio.c
>>>>> virtqueue_alloc_element(void *elem, size_t sz, unsigned in_num, unsigned out_num) "elem %p size %zd in_num %u out_num %u"
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC v3 17/29] vhost: Shadow virtqueue buffers forwarding
[not found] ` <CAJaqyWf7M1fjrd+kr-2bcYj+ibrqZVoREZuTiJ0i+p6dA+Dukw@mail.gmail.com>
@ 2021-06-03 3:34 ` Jason Wang
0 siblings, 0 replies; 19+ messages in thread
From: Jason Wang @ 2021-06-03 3:34 UTC (permalink / raw)
To: Eugenio Perez Martin
Cc: Parav Pandit, Michael S. Tsirkin, Markus Armbruster, qemu-level,
Harpreet Singh Anand, Xiao W Wang, Stefan Hajnoczi, Eli Cohen,
virtualization, Eric Blake, Michael Lilja
在 2021/6/3 上午1:18, Eugenio Perez Martin 写道:
> On Wed, Jun 2, 2021 at 11:51 AM Jason Wang <jasowang@redhat.com> wrote:
>>
>> 在 2021/5/20 上午12:28, Eugenio Pérez 写道:
>>> Initial version of shadow virtqueue that actually forward buffers. The
>>> exposed addresses are the qemu's virtual address, so devices with IOMMU
>>> that does not allow full mapping of qemu's address space does not work
>>> at the moment.
>>>
>>> Also for simplicity it only supports modern devices, that expects vring
>>> in little endian, with split ring and no event idx or indirect
>>> descriptors.
>>>
>>> It reuses the VirtQueue code for the device part. The driver part is
>>> based on Linux's virtio_ring driver, but with stripped functionality
>>> and optimizations so it's easier to review.
>>>
>>> Later commits will solve some of these concerns.
>>
>> It would be more more easier to review if you squash those
>> "enhancements" into this patch.
>>
> Ok, they will be in the same commit for the next version.
>
>>> Code also need to map used ring (device part) as RW in, and only in,
>>> vhost-net. To map (or call vhost_device_iotlb_miss) inconditionally
>>> would print an error in case of vhost devices with its own mapping
>>> (vdpa).
>>
>> I think we should not depend on the IOTLB miss. Instead, we should
>> program the device IOTLB before starting the svq. Or is there anything
>> that prevent you from doing this?
>>
> Sorry for being unclear, that is what I meant in the message: No other
> device than kernel vhost needs the map (as "sent iotlb miss ahead"),
> so we must make it conditional. Doing it unconditionally would make
> nothing but an error appear on the screen, but it is incorrect anyway.
>
> Is it clearer this way?
So what I'm worrying is the following code:
int vhost_device_iotlb_miss(struct vhost_dev *dev, uint64_t iova, int write)
{
...
if (dev->shadow_vqs_enabled) {
/* Shadow virtqueue translations in its Virtual Address Space */
const VhostDMAMap *result;
const VhostDMAMap needle = {
.iova = iova,
};
result = vhost_iova_tree_find_taddr(&dev->iova_map, &needle);
...
}
I wonder the reason for doing that (sorry if I've asked this in the
previous version).
I think the correct way is to use map those in the device IOTLB before,
instead of using the miss.
Then we can have a unified code for vhost-vDPA and vhost-kernel.
>
>>> To know if this call is needed, vhost_sw_live_migration_start_vq and
>>> vhost_sw_live_migration_stop copy the test performed in
>>> vhost_dev_start. Testing for the actual backend type could be cleaner,
>>> or checking for non-NULL vhost_force_iommu, enable_custom_iommu, or
>>> another vhostOp. We could extract this test in its own function too,
>>> so its name could give a better hint. Just copy the vhost_dev_start
>>> check at the moment.
>>>
>>> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
>>> ---
>>> hw/virtio/vhost-shadow-virtqueue.c | 205 +++++++++++++++++++++++++++--
>>> hw/virtio/vhost.c | 134 ++++++++++++++++++-
>>> 2 files changed, 325 insertions(+), 14 deletions(-)
>>>
>>> diff --git a/hw/virtio/vhost-shadow-virtqueue.c b/hw/virtio/vhost-shadow-virtqueue.c
>>> index ff50f12410..6d767fe248 100644
>>> --- a/hw/virtio/vhost-shadow-virtqueue.c
>>> +++ b/hw/virtio/vhost-shadow-virtqueue.c
>>> @@ -9,6 +9,7 @@
>>>
>>> #include "hw/virtio/vhost-shadow-virtqueue.h"
>>> #include "hw/virtio/vhost.h"
>>> +#include "hw/virtio/virtio-access.h"
>>>
>>> #include "standard-headers/linux/vhost_types.h"
>>>
>>> @@ -48,9 +49,93 @@ typedef struct VhostShadowVirtqueue {
>>>
>>> /* Virtio device */
>>> VirtIODevice *vdev;
>>> +
>>> + /* Map for returning guest's descriptors */
>>> + VirtQueueElement **ring_id_maps;
>>> +
>>> + /* Next head to expose to device */
>>> + uint16_t avail_idx_shadow;
>>> +
>>> + /* Next free descriptor */
>>> + uint16_t free_head;
>>> +
>>> + /* Last seen used idx */
>>> + uint16_t shadow_used_idx;
>>> +
>>> + /* Next head to consume from device */
>>> + uint16_t used_idx;
>>> } VhostShadowVirtqueue;
>>>
>>> -/* Forward guest notifications */
>>> +static void vhost_vring_write_descs(VhostShadowVirtqueue *svq,
>>> + const struct iovec *iovec,
>>> + size_t num, bool more_descs, bool write)
>>> +{
>>> + uint16_t i = svq->free_head, last = svq->free_head;
>>> + unsigned n;
>>> + uint16_t flags = write ? cpu_to_le16(VRING_DESC_F_WRITE) : 0;
>>> + vring_desc_t *descs = svq->vring.desc;
>>> +
>>> + if (num == 0) {
>>> + return;
>>> + }
>>> +
>>> + for (n = 0; n < num; n++) {
>>> + if (more_descs || (n + 1 < num)) {
>>> + descs[i].flags = flags | cpu_to_le16(VRING_DESC_F_NEXT);
>>> + } else {
>>> + descs[i].flags = flags;
>>> + }
>>> + descs[i].addr = cpu_to_le64((hwaddr)iovec[n].iov_base);
>>> + descs[i].len = cpu_to_le32(iovec[n].iov_len);
>>> +
>>> + last = i;
>>> + i = cpu_to_le16(descs[i].next);
>>> + }
>>> +
>>> + svq->free_head = le16_to_cpu(descs[last].next);
>>> +}
>>> +
>>> +static unsigned vhost_shadow_vq_add_split(VhostShadowVirtqueue *svq,
>>> + VirtQueueElement *elem)
>>> +{
>>> + int head;
>>> + unsigned avail_idx;
>>> + vring_avail_t *avail = svq->vring.avail;
>>> +
>>> + head = svq->free_head;
>>> +
>>> + /* We need some descriptors here */
>>> + assert(elem->out_num || elem->in_num);
>>> +
>>> + vhost_vring_write_descs(svq, elem->out_sg, elem->out_num,
>>> + elem->in_num > 0, false);
>>> + vhost_vring_write_descs(svq, elem->in_sg, elem->in_num, false, true);
>>> +
>>> + /*
>>> + * Put entry in available array (but don't update avail->idx until they
>>> + * do sync).
>>> + */
>>> + avail_idx = svq->avail_idx_shadow & (svq->vring.num - 1);
>>> + avail->ring[avail_idx] = cpu_to_le16(head);
>>> + svq->avail_idx_shadow++;
>>> +
>>> + /* Expose descriptors to device */
>>
>> It's better to describe the detail order.
>>
>> E.g "Update avail index after the descriptor is wrote"
>>
> Agree, I will replace it with your wording.
>
>>> + smp_wmb();
>>> + avail->idx = cpu_to_le16(svq->avail_idx_shadow);
>>> +
>>> + return head;
>>> +
>>> +}
>>> +
>>> +static void vhost_shadow_vq_add(VhostShadowVirtqueue *svq,
>>> + VirtQueueElement *elem)
>>> +{
>>> + unsigned qemu_head = vhost_shadow_vq_add_split(svq, elem);
>>> +
>>> + svq->ring_id_maps[qemu_head] = elem;
>>> +}
>>> +
>>> +/* Handle guest->device notifications */
>>> static void vhost_handle_guest_kick(EventNotifier *n)
>>> {
>>> VhostShadowVirtqueue *svq = container_of(n, VhostShadowVirtqueue,
>>> @@ -60,7 +145,67 @@ static void vhost_handle_guest_kick(EventNotifier *n)
>>> return;
>>> }
>>>
>>> - event_notifier_set(&svq->kick_notifier);
>>> + /* Make available as many buffers as possible */
>>> + do {
>>> + if (virtio_queue_get_notification(svq->vq)) {
>>> + /* No more notifications until process all available */
>>> + virtio_queue_set_notification(svq->vq, false);
>>> + }
>>> +
>>> + while (true) {
>>> + VirtQueueElement *elem = virtqueue_pop(svq->vq, sizeof(*elem));
>>> + if (!elem) {
>>> + break;
>>> + }
>>> +
>>> + vhost_shadow_vq_add(svq, elem);
>>> + event_notifier_set(&svq->kick_notifier);
>>> + }
>>> +
>>> + virtio_queue_set_notification(svq->vq, true);
>>> + } while (!virtio_queue_empty(svq->vq));
>>> +}
>>> +
>>> +static bool vhost_shadow_vq_more_used(VhostShadowVirtqueue *svq)
>>> +{
>>> + if (svq->used_idx != svq->shadow_used_idx) {
>>> + return true;
>>> + }
>>> +
>>> + /* Get used idx must not be reordered */
>>> + smp_rmb();
>>> + svq->shadow_used_idx = cpu_to_le16(svq->vring.used->idx);
>>> +
>>> + return svq->used_idx != svq->shadow_used_idx;
>>> +}
>>> +
>>> +static VirtQueueElement *vhost_shadow_vq_get_buf(VhostShadowVirtqueue *svq)
>>> +{
>>> + vring_desc_t *descs = svq->vring.desc;
>>> + const vring_used_t *used = svq->vring.used;
>>> + vring_used_elem_t used_elem;
>>> + uint16_t last_used;
>>> +
>>> + if (!vhost_shadow_vq_more_used(svq)) {
>>> + return NULL;
>>> + }
>>> +
>>> + last_used = svq->used_idx & (svq->vring.num - 1);
>>> + used_elem.id = le32_to_cpu(used->ring[last_used].id);
>>> + used_elem.len = le32_to_cpu(used->ring[last_used].len);
>>> +
>>> + if (unlikely(used_elem.id >= svq->vring.num)) {
>>> + error_report("Device %s says index %u is available", svq->vdev->name,
>>> + used_elem.id);
>>> + return NULL;
>>> + }
>>> +
>>> + descs[used_elem.id].next = svq->free_head;
>>> + svq->free_head = used_elem.id;
>>> +
>>> + svq->used_idx++;
>>> + svq->ring_id_maps[used_elem.id]->len = used_elem.len;
>>> + return g_steal_pointer(&svq->ring_id_maps[used_elem.id]);
>>> }
>>>
>>> /* Forward vhost notifications */
>>> @@ -69,17 +214,33 @@ static void vhost_shadow_vq_handle_call_no_test(EventNotifier *n)
>>> VhostShadowVirtqueue *svq = container_of(n, VhostShadowVirtqueue,
>>> call_notifier);
>>> EventNotifier *masked_notifier;
>>> + VirtQueue *vq = svq->vq;
>>>
>>> masked_notifier = svq->masked_notifier.n;
>>>
>>> - if (!masked_notifier) {
>>> - unsigned n = virtio_get_queue_index(svq->vq);
>>> - virtio_queue_invalidate_signalled_used(svq->vdev, n);
>>> - virtio_notify_irqfd(svq->vdev, svq->vq);
>>> - } else if (!svq->masked_notifier.signaled) {
>>> - svq->masked_notifier.signaled = true;
>>> - event_notifier_set(svq->masked_notifier.n);
>>> - }
>>> + /* Make as many buffers as possible used. */
>>> + do {
>>> + unsigned i = 0;
>>> +
>>> + /* TODO: Use VRING_AVAIL_F_NO_INTERRUPT */
>>> + while (true) {
>>> + g_autofree VirtQueueElement *elem = vhost_shadow_vq_get_buf(svq);
>>> + if (!elem) {
>>> + break;
>>> + }
>>> +
>>> + assert(i < svq->vring.num);
>>> + virtqueue_fill(vq, elem, elem->len, i++);
>>> + }
>>> +
>>> + virtqueue_flush(vq, i);
>>> + if (!masked_notifier) {
>>> + virtio_notify_irqfd(svq->vdev, svq->vq);
>>
>> Any reason that you don't use virtio_notify() here?
>>
> No reason but to make sure guest_notifier is used. I'm not sure if
> this is an implementation detail though.
The difference is that virtio_notify() will go through the memory API
which will finally go to irqfd in this case.
>
> I can test to switch to virtio_notify, what would be the advantage here?
Probably no.
>
>>> + } else if (!svq->masked_notifier.signaled) {
>>> + svq->masked_notifier.signaled = true;
>>> + event_notifier_set(svq->masked_notifier.n);
>>> + }
>>
>> This is an example of the extra complexity if you do shadow virtqueue at
> .> virtio level.
>> If you do everything at e.g vhost-vdpa, you don't need to care about
>> masked_notifer at all.
>>
> Correct me if I'm wrong, what you mean is to use the backend
> vhost_set_vring_call to set the guest notifier (from SVQ point of
> view), and then set it unconditionally. The function
> vhost_virtqueue_mask would set the masked one by itself, no
> modification is needed here.
Something like this, from the point of vhost, it doesn't even need to
know whether or not the notifier is masked or not. All it needs is to
write to the eventfd set via vq call.
>
> Backend would still need a conditional checking if SVQ is enabled, so
> it either sends call_fd to backend or to SVQ.
Yes.
> The call to
> virtqueue_fill, would still be needed if we don't want to duplicate
> all the device virtio's logic in the vhost-vdpa backend.
Yes, you can make the buffer forwarding a common library then it could
be used other vhost backend in the future.
The point is to avoid touching vhost protocols to reduce the changeset
and have someting minimal for our requirements (vhost-vDPA mainly).
>
> Another possibility would be to just store guest_notifier in SVQ, and
> replace it with masked notifier and back. I think this is more aligned
> with what you have in mind, but it still needs changes to
> vhost_virtqueue_mask. Note that the boolean store
> masked_notifier.signaled is just a (maybe premature) optimization to
> skip the unneeded write syscall, but it could be omitted for brevity.
> Or maybe a cleaner solution is to use io_uring for this write? :).
Looks like not what I meant :)
To clarify, it works like:
1) record the vq call fd1 during vhost_vdpa_set_vring_call
2) when svq is not enabled, set this fd1 to vhost-VDPA via
VHOST_SET_VRING_CALL
3) when svq is enabled, initialize and set fd2 to vhost-vDPA, poll and
handle guest kick via fd1 and rely fd1 to fd2
So we don't need to care much about the masking, in the svq codes, we
just stick to use the fd that is set via recent vhost_vdpa_set_vring_call().
That means, if virtqueue is masked, we're using mased_notifier actually,
but it's totally transparent to us.
So the idea is behave like a normal vhost-vDPA backend, and hide the
shadowing from the virtio codes.
Thanks
>
> Thanks!
>
>> Thanks
>>
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC v3 25/29] vhost: Add custom IOTLB translations to SVQ
[not found] ` <CAJaqyWc7OWeQnXHihY=mYp=N+rRJLcbFUsJA-OszD6tyr6v-FQ@mail.gmail.com>
@ 2021-06-03 3:39 ` Jason Wang
0 siblings, 0 replies; 19+ messages in thread
From: Jason Wang @ 2021-06-03 3:39 UTC (permalink / raw)
To: Eugenio Perez Martin
Cc: Parav Pandit, Michael S. Tsirkin, Markus Armbruster, qemu-level,
Harpreet Singh Anand, Xiao W Wang, Stefan Hajnoczi, Eli Cohen,
virtualization, Eric Blake, Michael Lilja
在 2021/6/3 上午1:51, Eugenio Perez Martin 写道:
> On Wed, Jun 2, 2021 at 11:52 AM Jason Wang <jasowang@redhat.com> wrote:
>>
>> 在 2021/5/20 上午12:28, Eugenio Pérez 写道:
>>> Use translations added in IOVAReverseMaps in SVQ if the vhost device
>>> does not support the mapping of the full qemu's virtual address space.
>>> In other cases, Shadow Virtqueue still uses the qemu's virtual address
>>> of the buffer pointed by the descriptor, which has been translated
>>> already by qemu's VirtQueue machinery.
>>
>> I'd say let stick to a single kind of translation (iova allocator) that
>> works for all the cases first and add optimizations on top.
>>
> Ok, I will start from here for the next revision.
>
>>> Now every element needs to store the previous address also, so VirtQueue
>>> can consume the elements properly. This adds a little overhead per VQ
>>> element, having to allocate more memory to stash them. As a possible
>>> optimization, this allocation could be avoided if the descriptor is not
>>> a chain but a single one, but this is left undone.
>>>
>>> Checking also for vhost_set_iotlb_callback to send used ring remapping.
>>> This is only needed for kernel, and would print an error in case of
>>> vhost devices with its own mapping (vdpa).
>>>
>>> This could change for other callback, like checking for
>>> vhost_force_iommu, enable_custom_iommu, or another. Another option could
>>> be to, at least, extract the check of "is map(used, writable) needed?"
>>> in another function. But at the moment just copy the check used in
>>> vhost_dev_start here.
>>>
>>> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
>>> ---
>>> hw/virtio/vhost-shadow-virtqueue.c | 134 ++++++++++++++++++++++++++---
>>> hw/virtio/vhost.c | 29 +++++--
>>> 2 files changed, 145 insertions(+), 18 deletions(-)
>>>
>>> diff --git a/hw/virtio/vhost-shadow-virtqueue.c b/hw/virtio/vhost-shadow-virtqueue.c
>>> index 934d3bb27b..a92da979d1 100644
>>> --- a/hw/virtio/vhost-shadow-virtqueue.c
>>> +++ b/hw/virtio/vhost-shadow-virtqueue.c
>>> @@ -10,12 +10,19 @@
>>> #include "hw/virtio/vhost-shadow-virtqueue.h"
>>> #include "hw/virtio/vhost.h"
>>> #include "hw/virtio/virtio-access.h"
>>> +#include "hw/virtio/vhost-iova-tree.h"
>>>
>>> #include "standard-headers/linux/vhost_types.h"
>>>
>>> #include "qemu/error-report.h"
>>> #include "qemu/main-loop.h"
>>>
>>> +typedef struct SVQElement {
>>> + VirtQueueElement elem;
>>> + void **in_sg_stash;
>>> + void **out_sg_stash;
>>
>> Any reason for the trick like this?
>>
>> Can we simply use iovec and iov_copy() here?
>>
> At the moment the device writes the buffer directly to the guest's
> memory, and SVQ only translates the descriptor. In this scenario,
> there would be no need for iov_copy, isn't it?
It depends on which kinds of translation you used.
If I read the code correctly, stash is used for storing HVAs after the
HVA->IOVA translation.
This looks exactly the work of iov (and do we guarantee the there will
be a 1:1 translation?)
And if the mapping is 1:1 you can simply use iov_copy().
But this wont' be a option if we will always use iova allocator.
>
> The reason for stash and unstash them was to allow the 1:1 mapping
> with qemu memory and IOMMU and iova allocator to work with less
> changes, In particular, the reason for unstash is that virtqueue_fill,
> expects qemu pointers to set the guest memory page as dirty in
> virtqueue_unmap_sg->dma_memory_unmap.
>
> Now I think that just storing the iova address from the allocator in a
> separated field and using a wrapper to get the IOVA addresses in SVQ
> would be a better idea, so I would change to this if everyone agrees.
I agree.
Thanks
>
> Thanks!
>
>> Thanks
>>
>>
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC v3 21/29] vhost: Add VhostIOVATree
[not found] ` <CAJaqyWfKTiKeKLLjB9qDf4qJwL420YRo6FrJgozp_tn0Z57pOA@mail.gmail.com>
@ 2021-07-14 3:04 ` Jason Wang
[not found] ` <CAJaqyWdZtvUM0_O1BiZNMkkgJEa_yrOh=XDgjTt2byZwU9J8tQ@mail.gmail.com>
0 siblings, 1 reply; 19+ messages in thread
From: Jason Wang @ 2021-07-14 3:04 UTC (permalink / raw)
To: Eugenio Perez Martin
Cc: Parav Pandit, Michael S. Tsirkin, Markus Armbruster, qemu-level,
Harpreet Singh Anand, Xiao W Wang, Stefan Hajnoczi, Eli Cohen,
virtualization, Eric Blake, Michael Lilja
在 2021/6/1 下午4:15, Eugenio Perez Martin 写道:
> On Mon, May 31, 2021 at 11:40 AM Jason Wang <jasowang@redhat.com> wrote:
>>
>> 在 2021/5/20 上午12:28, Eugenio Pérez 写道:
>>> This tree is able to look for a translated address from a IOVA address.
>>>
>>> At first glance is similar to util/iova-tree. However, SVQ working on
>>> devices with limited IOVA space need more capabilities, like allocating
>>> IOVA chunks or perform reverse translations (qemu addresses to iova).
>>>
>>> Starting a sepparated implementation. Knowing than insertions/deletions
>>> will not be as frequent as searches,
>>
>> This might not be true if vIOMMU is enabled.
>>
> Right.
>
>>> it uses an ordered array at
>>> implementation.
>>
>> I wonder how much overhead could g_array be if it needs to grow.
>>
> I didn't do any tests, actually. But I see this VhostIOVATree as a
> replaceable tool, just to get the buffer translations to work. So I'm
> both ok with change it now and ok to delay it, since they should not
> be hard to do.
>
>>> A different name could be used, but ordered
>>> searchable array is a little bit long though.
>>
>> Note that we had a very good example for this. That is the kernel iova
>> allocator which is implemented via rbtree.
>>
>> Instead of figuring out g_array vs g_tree stuffs, I would simple go with
>> g_tree first (based on util/iova-tree) and borrow the well design kernel
>> iova allocator API to have a generic IOVA one instead of coupling it
>> with vhost. It could be used by other userspace driver in the future:
>>
>> init_iova_domain()/put_iova_domain();
>>
>> alloc_iova()/free_iova();
>>
>> find_iova();
>>
> We could go that way, but then the iova-tree should be extended to
> support both translations (iova->translated_addr is now implemented in
> iova-tree, the reverse is not). If I understood you correctly,
> borrowing the kernel iova allocator would give us both, right?
No the reverse lookup is done via a specific IOMMU driver if I
understand it correctly.
And if the mapping is 1:1 we can just use two iova-tree I guess.
>
> Note that it is not coupled to vhost at all except in the name: all
> the implementations only work with hwaddr and void pointers memory.
> Just to illustrate the point, I think it could be a drop-in
> replacement for iova-tree at this moment (with all the
> drawbacks/advantages of an array vs tree).
Ok.
Thanks
>
>> Another reference is the iova allocator that is implemented in VFIO.
> I will check this too.
>
> Thanks!
>
>
>> Thanks
>>
>>
>>> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
>>> ---
>>> hw/virtio/vhost-iova-tree.h | 50 ++++++++++
>>> hw/virtio/vhost-iova-tree.c | 188 ++++++++++++++++++++++++++++++++++++
>>> hw/virtio/meson.build | 2 +-
>>> 3 files changed, 239 insertions(+), 1 deletion(-)
>>> create mode 100644 hw/virtio/vhost-iova-tree.h
>>> create mode 100644 hw/virtio/vhost-iova-tree.c
>>>
>>> diff --git a/hw/virtio/vhost-iova-tree.h b/hw/virtio/vhost-iova-tree.h
>>> new file mode 100644
>>> index 0000000000..2a44af8b3a
>>> --- /dev/null
>>> +++ b/hw/virtio/vhost-iova-tree.h
>>> @@ -0,0 +1,50 @@
>>> +/*
>>> + * vhost software live migration ring
>>> + *
>>> + * SPDX-FileCopyrightText: Red Hat, Inc. 2021
>>> + * SPDX-FileContributor: Author: Eugenio Pérez <eperezma@redhat.com>
>>> + *
>>> + * SPDX-License-Identifier: GPL-2.0-or-later
>>> + */
>>> +
>>> +#ifndef HW_VIRTIO_VHOST_IOVA_TREE_H
>>> +#define HW_VIRTIO_VHOST_IOVA_TREE_H
>>> +
>>> +#include <gmodule.h>
>>> +
>>> +#include "exec/memory.h"
>>> +
>>> +typedef struct VhostDMAMap {
>>> + void *translated_addr;
>>> + hwaddr iova;
>>> + hwaddr size; /* Inclusive */
>>> + IOMMUAccessFlags perm;
>>> +} VhostDMAMap;
>>> +
>>> +typedef enum VhostDMAMapNewRC {
>>> + VHOST_DMA_MAP_OVERLAP = -2,
>>> + VHOST_DMA_MAP_INVALID = -1,
>>> + VHOST_DMA_MAP_OK = 0,
>>> +} VhostDMAMapNewRC;
>>> +
>>> +/**
>>> + * VhostIOVATree
>>> + *
>>> + * Store and search IOVA -> Translated mappings.
>>> + *
>>> + * Note that it cannot remove nodes.
>>> + */
>>> +typedef struct VhostIOVATree {
>>> + /* Ordered array of reverse translations, IOVA address to qemu memory. */
>>> + GArray *iova_taddr_map;
>>> +} VhostIOVATree;
>>> +
>>> +void vhost_iova_tree_new(VhostIOVATree *iova_rm);
>>> +void vhost_iova_tree_destroy(VhostIOVATree *iova_rm);
>>> +
>>> +const VhostDMAMap *vhost_iova_tree_find_taddr(const VhostIOVATree *iova_rm,
>>> + const VhostDMAMap *map);
>>> +VhostDMAMapNewRC vhost_iova_tree_insert(VhostIOVATree *iova_rm,
>>> + VhostDMAMap *map);
>>> +
>>> +#endif
>>> diff --git a/hw/virtio/vhost-iova-tree.c b/hw/virtio/vhost-iova-tree.c
>>> new file mode 100644
>>> index 0000000000..dfd7e448b5
>>> --- /dev/null
>>> +++ b/hw/virtio/vhost-iova-tree.c
>>> @@ -0,0 +1,188 @@
>>> +/*
>>> + * vhost software live migration ring
>>> + *
>>> + * SPDX-FileCopyrightText: Red Hat, Inc. 2021
>>> + * SPDX-FileContributor: Author: Eugenio Pérez <eperezma@redhat.com>
>>> + *
>>> + * SPDX-License-Identifier: GPL-2.0-or-later
>>> + */
>>> +
>>> +#include "qemu/osdep.h"
>>> +#include "vhost-iova-tree.h"
>>> +
>>> +#define G_ARRAY_NOT_ZERO_TERMINATED false
>>> +#define G_ARRAY_NOT_CLEAR_ON_ALLOC false
>>> +
>>> +/**
>>> + * Inserts an element after an existing one in garray.
>>> + *
>>> + * @array The array
>>> + * @prev_elem The previous element of array of NULL if prepending
>>> + * @map The DMA map
>>> + *
>>> + * It provides the aditional advantage of being type safe over
>>> + * g_array_insert_val, which accepts a reference pointer instead of a value
>>> + * with no complains.
>>> + */
>>> +static void vhost_iova_tree_insert_after(GArray *array,
>>> + const VhostDMAMap *prev_elem,
>>> + const VhostDMAMap *map)
>>> +{
>>> + size_t pos;
>>> +
>>> + if (!prev_elem) {
>>> + pos = 0;
>>> + } else {
>>> + pos = prev_elem - &g_array_index(array, typeof(*prev_elem), 0) + 1;
>>> + }
>>> +
>>> + g_array_insert_val(array, pos, *map);
>>> +}
>>> +
>>> +static gint vhost_iova_tree_cmp_iova(gconstpointer a, gconstpointer b)
>>> +{
>>> + const VhostDMAMap *m1 = a, *m2 = b;
>>> +
>>> + if (m1->iova > m2->iova + m2->size) {
>>> + return 1;
>>> + }
>>> +
>>> + if (m1->iova + m1->size < m2->iova) {
>>> + return -1;
>>> + }
>>> +
>>> + /* Overlapped */
>>> + return 0;
>>> +}
>>> +
>>> +/**
>>> + * Find the previous node to a given iova
>>> + *
>>> + * @array The ascending ordered-by-translated-addr array of VhostDMAMap
>>> + * @map The map to insert
>>> + * @prev Returned location of the previous map
>>> + *
>>> + * Return VHOST_DMA_MAP_OK if everything went well, or VHOST_DMA_MAP_OVERLAP if
>>> + * it already exists. It is ok to use this function to check if a given range
>>> + * exists, but it will use a linear search.
>>> + *
>>> + * TODO: We can use bsearch to locate the entry if we save the state in the
>>> + * needle, knowing that the needle is always the first argument to
>>> + * compare_func.
>>> + */
>>> +static VhostDMAMapNewRC vhost_iova_tree_find_prev(const GArray *array,
>>> + GCompareFunc compare_func,
>>> + const VhostDMAMap *map,
>>> + const VhostDMAMap **prev)
>>> +{
>>> + size_t i;
>>> + int r;
>>> +
>>> + *prev = NULL;
>>> + for (i = 0; i < array->len; ++i) {
>>> + r = compare_func(map, &g_array_index(array, typeof(*map), i));
>>> + if (r == 0) {
>>> + return VHOST_DMA_MAP_OVERLAP;
>>> + }
>>> + if (r < 0) {
>>> + return VHOST_DMA_MAP_OK;
>>> + }
>>> +
>>> + *prev = &g_array_index(array, typeof(**prev), i);
>>> + }
>>> +
>>> + return VHOST_DMA_MAP_OK;
>>> +}
>>> +
>>> +/**
>>> + * Create a new IOVA tree
>>> + *
>>> + * @tree The IOVA tree
>>> + */
>>> +void vhost_iova_tree_new(VhostIOVATree *tree)
>>> +{
>>> + assert(tree);
>>> +
>>> + tree->iova_taddr_map = g_array_new(G_ARRAY_NOT_ZERO_TERMINATED,
>>> + G_ARRAY_NOT_CLEAR_ON_ALLOC,
>>> + sizeof(VhostDMAMap));
>>> +}
>>> +
>>> +/**
>>> + * Destroy an IOVA tree
>>> + *
>>> + * @tree The iova tree
>>> + */
>>> +void vhost_iova_tree_destroy(VhostIOVATree *tree)
>>> +{
>>> + g_array_unref(g_steal_pointer(&tree->iova_taddr_map));
>>> +}
>>> +
>>> +/**
>>> + * Perform a search on a GArray.
>>> + *
>>> + * @array Glib array
>>> + * @map Map to look up
>>> + * @compare_func Compare function to use
>>> + *
>>> + * Return The found element or NULL if not found.
>>> + *
>>> + * This can be replaced with g_array_binary_search (Since glib 2.62) when that
>>> + * is common enough.
>>> + */
>>> +static const VhostDMAMap *vhost_iova_tree_bsearch(const GArray *array,
>>> + const VhostDMAMap *map,
>>> + GCompareFunc compare_func)
>>> +{
>>> + return bsearch(map, array->data, array->len, sizeof(*map), compare_func);
>>> +}
>>> +
>>> +/**
>>> + * Find the translated address stored from a IOVA address
>>> + *
>>> + * @tree The iova tree
>>> + * @map The map with the memory address
>>> + *
>>> + * Return the stored mapping, or NULL if not found.
>>> + */
>>> +const VhostDMAMap *vhost_iova_tree_find_taddr(const VhostIOVATree *tree,
>>> + const VhostDMAMap *map)
>>> +{
>>> + return vhost_iova_tree_bsearch(tree->iova_taddr_map, map,
>>> + vhost_iova_tree_cmp_iova);
>>> +}
>>> +
>>> +/**
>>> + * Insert a new map
>>> + *
>>> + * @tree The iova tree
>>> + * @map The iova map
>>> + *
>>> + * Returns:
>>> + * - VHOST_DMA_MAP_OK if the map fits in the container
>>> + * - VHOST_DMA_MAP_INVALID if the map does not make sense (like size overflow)
>>> + * - VHOST_DMA_MAP_OVERLAP if the tree already contains that map
>>> + * Can query the assignated iova in map.
>>> + */
>>> +VhostDMAMapNewRC vhost_iova_tree_insert(VhostIOVATree *tree,
>>> + VhostDMAMap *map)
>>> +{
>>> + const VhostDMAMap *prev;
>>> + int find_prev_rc;
>>> +
>>> + if (map->translated_addr + map->size < map->translated_addr ||
>>> + map->iova + map->size < map->iova || map->perm == IOMMU_NONE) {
>>> + return VHOST_DMA_MAP_INVALID;
>>> + }
>>> +
>>> + /* Check for duplicates, and save position for insertion */
>>> + find_prev_rc = vhost_iova_tree_find_prev(tree->iova_taddr_map,
>>> + vhost_iova_tree_cmp_iova, map,
>>> + &prev);
>>> + if (find_prev_rc == VHOST_DMA_MAP_OVERLAP) {
>>> + return VHOST_DMA_MAP_OVERLAP;
>>> + }
>>> +
>>> + vhost_iova_tree_insert_after(tree->iova_taddr_map, prev, map);
>>> + return VHOST_DMA_MAP_OK;
>>> +}
>>> diff --git a/hw/virtio/meson.build b/hw/virtio/meson.build
>>> index 8b5a0225fe..cb306b83c6 100644
>>> --- a/hw/virtio/meson.build
>>> +++ b/hw/virtio/meson.build
>>> @@ -11,7 +11,7 @@ softmmu_ss.add(when: 'CONFIG_ALL', if_true: files('vhost-stub.c'))
>>>
>>> virtio_ss = ss.source_set()
>>> virtio_ss.add(files('virtio.c'))
>>> -virtio_ss.add(when: 'CONFIG_VHOST', if_true: files('vhost.c', 'vhost-backend.c', 'vhost-shadow-virtqueue.c'))
>>> +virtio_ss.add(when: 'CONFIG_VHOST', if_true: files('vhost.c', 'vhost-backend.c', 'vhost-shadow-virtqueue.c', 'vhost-iova-tree.c'))
>>> virtio_ss.add(when: 'CONFIG_VHOST_USER', if_true: files('vhost-user.c'))
>>> virtio_ss.add(when: 'CONFIG_VHOST_VDPA', if_true: files('vhost-vdpa.c'))
>>> virtio_ss.add(when: 'CONFIG_VIRTIO_BALLOON', if_true: files('virtio-balloon.c'))
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC v3 21/29] vhost: Add VhostIOVATree
[not found] ` <CAJaqyWePZSo8-q=d3fWBGUwgM8EfHHVYgR_KGKxjcgeZn5C=Vw@mail.gmail.com>
@ 2021-07-14 9:33 ` Jason Wang
0 siblings, 0 replies; 19+ messages in thread
From: Jason Wang @ 2021-07-14 9:33 UTC (permalink / raw)
To: Eugenio Perez Martin
Cc: Parav Pandit, Michael S. Tsirkin, Markus Armbruster, qemu-level,
Harpreet Singh Anand, Xiao W Wang, Stefan Hajnoczi, Eli Cohen,
virtualization, Eric Blake, Michael Lilja
在 2021/7/14 下午5:14, Eugenio Perez Martin 写道:
> On Wed, Jul 14, 2021 at 8:54 AM Eugenio Perez Martin
> <eperezma@redhat.com> wrote:
>> On Wed, Jul 14, 2021 at 5:04 AM Jason Wang <jasowang@redhat.com> wrote:
>>>
>>> 在 2021/6/1 下午4:15, Eugenio Perez Martin 写道:
>>>> On Mon, May 31, 2021 at 11:40 AM Jason Wang <jasowang@redhat.com> wrote:
>>>>> 在 2021/5/20 上午12:28, Eugenio Pérez 写道:
>>>>>> This tree is able to look for a translated address from a IOVA address.
>>>>>>
>>>>>> At first glance is similar to util/iova-tree. However, SVQ working on
>>>>>> devices with limited IOVA space need more capabilities, like allocating
>>>>>> IOVA chunks or perform reverse translations (qemu addresses to iova).
>>>>>>
>>>>>> Starting a sepparated implementation. Knowing than insertions/deletions
>>>>>> will not be as frequent as searches,
>>>>> This might not be true if vIOMMU is enabled.
>>>>>
>>>> Right.
>>>>
>>>>>> it uses an ordered array at
>>>>>> implementation.
>>>>> I wonder how much overhead could g_array be if it needs to grow.
>>>>>
>>>> I didn't do any tests, actually. But I see this VhostIOVATree as a
>>>> replaceable tool, just to get the buffer translations to work. So I'm
>>>> both ok with change it now and ok to delay it, since they should not
>>>> be hard to do.
>>>>
>>>>>> A different name could be used, but ordered
>>>>>> searchable array is a little bit long though.
>>>>> Note that we had a very good example for this. That is the kernel iova
>>>>> allocator which is implemented via rbtree.
>>>>>
>>>>> Instead of figuring out g_array vs g_tree stuffs, I would simple go with
>>>>> g_tree first (based on util/iova-tree) and borrow the well design kernel
>>>>> iova allocator API to have a generic IOVA one instead of coupling it
>>>>> with vhost. It could be used by other userspace driver in the future:
>>>>>
>>>>> init_iova_domain()/put_iova_domain();
>>>>>
>>>>> alloc_iova()/free_iova();
>>>>>
>>>>> find_iova();
>>>>>
>>>> We could go that way, but then the iova-tree should be extended to
>>>> support both translations (iova->translated_addr is now implemented in
>>>> iova-tree, the reverse is not). If I understood you correctly,
>>>> borrowing the kernel iova allocator would give us both, right?
>>>
>>> No the reverse lookup is done via a specific IOMMU driver if I
>>> understand it correctly.
>>>
>>> And if the mapping is 1:1 we can just use two iova-tree I guess.
>>>
>> I did try with two IOVATree, but the usage of the reversed one is
>> confusing at best. To reuse most of the code, .iova needs to be
>> .translated_addr, and the opposite. Maybe I can try again using a
>> wrapper structure that reverses them on each operation (insert,
>> search, ...).
>>
>> Thanks!
>>
> Another feature is also needed that IOVATree does not support:
> Allocation, as the searching of a free hole in iova range [1]. Linux's
> alloc_iova is actually the equivalent of vhost_iova_tree_insert in
> this commit, that expects iova addresses to be specified, not
> "allocated".
Yes, that's why need a general iova allocator. (But use it for shadow
virtqueue first for speeding up the merge).
>
> My first attempt to solve that was to add a second (third?) tree with
> the free space. But this again complicates the code a lot, and misses
> a few optimization opportunities (like the need of searching in many
> trees instead of just one, and then reuse the result [2]. Maybe
> iova_tree_foreach can be modified to achieve this, but then more code
> needs to be modified.
Ok, I think you can propose what you think better and let's see the code
then.
>
> [1] Range themselves are not supported in IOVATree natively, but I
> think they are more or less straightforward to implement in
> vhost-vdpa.
You mean in the kernel? We only have the abstraction of entries which is
kind of but more than just a simple range.
Thanks
>
>>>> Note that it is not coupled to vhost at all except in the name: all
>>>> the implementations only work with hwaddr and void pointers memory.
>>>> Just to illustrate the point, I think it could be a drop-in
>>>> replacement for iova-tree at this moment (with all the
>>>> drawbacks/advantages of an array vs tree).
>>>
>>> Ok.
>>>
>>> Thanks
>>>
>>>
>>>>> Another reference is the iova allocator that is implemented in VFIO.
>>>> I will check this too.
>>>>
>>>> Thanks!
>>>>
>>>>
>>>>> Thanks
>>>>>
>>>>>
>>>>>> Signed-off-by: Eugenio Pérez <eperezma@redhat.com>
>>>>>> ---
>>>>>> hw/virtio/vhost-iova-tree.h | 50 ++++++++++
>>>>>> hw/virtio/vhost-iova-tree.c | 188 ++++++++++++++++++++++++++++++++++++
>>>>>> hw/virtio/meson.build | 2 +-
>>>>>> 3 files changed, 239 insertions(+), 1 deletion(-)
>>>>>> create mode 100644 hw/virtio/vhost-iova-tree.h
>>>>>> create mode 100644 hw/virtio/vhost-iova-tree.c
>>>>>>
>>>>>> diff --git a/hw/virtio/vhost-iova-tree.h b/hw/virtio/vhost-iova-tree.h
>>>>>> new file mode 100644
>>>>>> index 0000000000..2a44af8b3a
>>>>>> --- /dev/null
>>>>>> +++ b/hw/virtio/vhost-iova-tree.h
>>>>>> @@ -0,0 +1,50 @@
>>>>>> +/*
>>>>>> + * vhost software live migration ring
>>>>>> + *
>>>>>> + * SPDX-FileCopyrightText: Red Hat, Inc. 2021
>>>>>> + * SPDX-FileContributor: Author: Eugenio Pérez <eperezma@redhat.com>
>>>>>> + *
>>>>>> + * SPDX-License-Identifier: GPL-2.0-or-later
>>>>>> + */
>>>>>> +
>>>>>> +#ifndef HW_VIRTIO_VHOST_IOVA_TREE_H
>>>>>> +#define HW_VIRTIO_VHOST_IOVA_TREE_H
>>>>>> +
>>>>>> +#include <gmodule.h>
>>>>>> +
>>>>>> +#include "exec/memory.h"
>>>>>> +
>>>>>> +typedef struct VhostDMAMap {
>>>>>> + void *translated_addr;
>>>>>> + hwaddr iova;
>>>>>> + hwaddr size; /* Inclusive */
>>>>>> + IOMMUAccessFlags perm;
>>>>>> +} VhostDMAMap;
>>>>>> +
>>>>>> +typedef enum VhostDMAMapNewRC {
>>>>>> + VHOST_DMA_MAP_OVERLAP = -2,
>>>>>> + VHOST_DMA_MAP_INVALID = -1,
>>>>>> + VHOST_DMA_MAP_OK = 0,
>>>>>> +} VhostDMAMapNewRC;
>>>>>> +
>>>>>> +/**
>>>>>> + * VhostIOVATree
>>>>>> + *
>>>>>> + * Store and search IOVA -> Translated mappings.
>>>>>> + *
>>>>>> + * Note that it cannot remove nodes.
>>>>>> + */
>>>>>> +typedef struct VhostIOVATree {
>>>>>> + /* Ordered array of reverse translations, IOVA address to qemu memory. */
>>>>>> + GArray *iova_taddr_map;
>>>>>> +} VhostIOVATree;
>>>>>> +
>>>>>> +void vhost_iova_tree_new(VhostIOVATree *iova_rm);
>>>>>> +void vhost_iova_tree_destroy(VhostIOVATree *iova_rm);
>>>>>> +
>>>>>> +const VhostDMAMap *vhost_iova_tree_find_taddr(const VhostIOVATree *iova_rm,
>>>>>> + const VhostDMAMap *map);
>>>>>> +VhostDMAMapNewRC vhost_iova_tree_insert(VhostIOVATree *iova_rm,
>>>>>> + VhostDMAMap *map);
>>>>>> +
>>>>>> +#endif
>>>>>> diff --git a/hw/virtio/vhost-iova-tree.c b/hw/virtio/vhost-iova-tree.c
>>>>>> new file mode 100644
>>>>>> index 0000000000..dfd7e448b5
>>>>>> --- /dev/null
>>>>>> +++ b/hw/virtio/vhost-iova-tree.c
>>>>>> @@ -0,0 +1,188 @@
>>>>>> +/*
>>>>>> + * vhost software live migration ring
>>>>>> + *
>>>>>> + * SPDX-FileCopyrightText: Red Hat, Inc. 2021
>>>>>> + * SPDX-FileContributor: Author: Eugenio Pérez <eperezma@redhat.com>
>>>>>> + *
>>>>>> + * SPDX-License-Identifier: GPL-2.0-or-later
>>>>>> + */
>>>>>> +
>>>>>> +#include "qemu/osdep.h"
>>>>>> +#include "vhost-iova-tree.h"
>>>>>> +
>>>>>> +#define G_ARRAY_NOT_ZERO_TERMINATED false
>>>>>> +#define G_ARRAY_NOT_CLEAR_ON_ALLOC false
>>>>>> +
>>>>>> +/**
>>>>>> + * Inserts an element after an existing one in garray.
>>>>>> + *
>>>>>> + * @array The array
>>>>>> + * @prev_elem The previous element of array of NULL if prepending
> [2] If I remember correctly.
>
>>>>>> + * @map The DMA map
>>>>>> + *
>>>>>> + * It provides the aditional advantage of being type safe over
>>>>>> + * g_array_insert_val, which accepts a reference pointer instead of a value
>>>>>> + * with no complains.
>>>>>> + */
>>>>>> +static void vhost_iova_tree_insert_after(GArray *array,
>>>>>> + const VhostDMAMap *prev_elem,
>>>>>> + const VhostDMAMap *map)
>>>>>> +{
>>>>>> + size_t pos;
>>>>>> +
>>>>>> + if (!prev_elem) {
>>>>>> + pos = 0;
>>>>>> + } else {
>>>>>> + pos = prev_elem - &g_array_index(array, typeof(*prev_elem), 0) + 1;
>>>>>> + }
>>>>>> +
>>>>>> + g_array_insert_val(array, pos, *map);
>>>>>> +}
>>>>>> +
>>>>>> +static gint vhost_iova_tree_cmp_iova(gconstpointer a, gconstpointer b)
>>>>>> +{
>>>>>> + const VhostDMAMap *m1 = a, *m2 = b;
>>>>>> +
>>>>>> + if (m1->iova > m2->iova + m2->size) {
>>>>>> + return 1;
>>>>>> + }
>>>>>> +
>>>>>> + if (m1->iova + m1->size < m2->iova) {
>>>>>> + return -1;
>>>>>> + }
>>>>>> +
>>>>>> + /* Overlapped */
>>>>>> + return 0;
>>>>>> +}
>>>>>> +
>>>>>> +/**
>>>>>> + * Find the previous node to a given iova
>>>>>> + *
>>>>>> + * @array The ascending ordered-by-translated-addr array of VhostDMAMap
>>>>>> + * @map The map to insert
>>>>>> + * @prev Returned location of the previous map
>>>>>> + *
>>>>>> + * Return VHOST_DMA_MAP_OK if everything went well, or VHOST_DMA_MAP_OVERLAP if
>>>>>> + * it already exists. It is ok to use this function to check if a given range
>>>>>> + * exists, but it will use a linear search.
>>>>>> + *
>>>>>> + * TODO: We can use bsearch to locate the entry if we save the state in the
>>>>>> + * needle, knowing that the needle is always the first argument to
>>>>>> + * compare_func.
>>>>>> + */
>>>>>> +static VhostDMAMapNewRC vhost_iova_tree_find_prev(const GArray *array,
>>>>>> + GCompareFunc compare_func,
>>>>>> + const VhostDMAMap *map,
>>>>>> + const VhostDMAMap **prev)
>>>>>> +{
>>>>>> + size_t i;
>>>>>> + int r;
>>>>>> +
>>>>>> + *prev = NULL;
>>>>>> + for (i = 0; i < array->len; ++i) {
>>>>>> + r = compare_func(map, &g_array_index(array, typeof(*map), i));
>>>>>> + if (r == 0) {
>>>>>> + return VHOST_DMA_MAP_OVERLAP;
>>>>>> + }
>>>>>> + if (r < 0) {
>>>>>> + return VHOST_DMA_MAP_OK;
>>>>>> + }
>>>>>> +
>>>>>> + *prev = &g_array_index(array, typeof(**prev), i);
>>>>>> + }
>>>>>> +
>>>>>> + return VHOST_DMA_MAP_OK;
>>>>>> +}
>>>>>> +
>>>>>> +/**
>>>>>> + * Create a new IOVA tree
>>>>>> + *
>>>>>> + * @tree The IOVA tree
>>>>>> + */
>>>>>> +void vhost_iova_tree_new(VhostIOVATree *tree)
>>>>>> +{
>>>>>> + assert(tree);
>>>>>> +
>>>>>> + tree->iova_taddr_map = g_array_new(G_ARRAY_NOT_ZERO_TERMINATED,
>>>>>> + G_ARRAY_NOT_CLEAR_ON_ALLOC,
>>>>>> + sizeof(VhostDMAMap));
>>>>>> +}
>>>>>> +
>>>>>> +/**
>>>>>> + * Destroy an IOVA tree
>>>>>> + *
>>>>>> + * @tree The iova tree
>>>>>> + */
>>>>>> +void vhost_iova_tree_destroy(VhostIOVATree *tree)
>>>>>> +{
>>>>>> + g_array_unref(g_steal_pointer(&tree->iova_taddr_map));
>>>>>> +}
>>>>>> +
>>>>>> +/**
>>>>>> + * Perform a search on a GArray.
>>>>>> + *
>>>>>> + * @array Glib array
>>>>>> + * @map Map to look up
>>>>>> + * @compare_func Compare function to use
>>>>>> + *
>>>>>> + * Return The found element or NULL if not found.
>>>>>> + *
>>>>>> + * This can be replaced with g_array_binary_search (Since glib 2.62) when that
>>>>>> + * is common enough.
>>>>>> + */
>>>>>> +static const VhostDMAMap *vhost_iova_tree_bsearch(const GArray *array,
>>>>>> + const VhostDMAMap *map,
>>>>>> + GCompareFunc compare_func)
>>>>>> +{
>>>>>> + return bsearch(map, array->data, array->len, sizeof(*map), compare_func);
>>>>>> +}
>>>>>> +
>>>>>> +/**
>>>>>> + * Find the translated address stored from a IOVA address
>>>>>> + *
>>>>>> + * @tree The iova tree
>>>>>> + * @map The map with the memory address
>>>>>> + *
>>>>>> + * Return the stored mapping, or NULL if not found.
>>>>>> + */
>>>>>> +const VhostDMAMap *vhost_iova_tree_find_taddr(const VhostIOVATree *tree,
>>>>>> + const VhostDMAMap *map)
>>>>>> +{
>>>>>> + return vhost_iova_tree_bsearch(tree->iova_taddr_map, map,
>>>>>> + vhost_iova_tree_cmp_iova);
>>>>>> +}
>>>>>> +
>>>>>> +/**
>>>>>> + * Insert a new map
>>>>>> + *
>>>>>> + * @tree The iova tree
>>>>>> + * @map The iova map
>>>>>> + *
>>>>>> + * Returns:
>>>>>> + * - VHOST_DMA_MAP_OK if the map fits in the container
>>>>>> + * - VHOST_DMA_MAP_INVALID if the map does not make sense (like size overflow)
>>>>>> + * - VHOST_DMA_MAP_OVERLAP if the tree already contains that map
>>>>>> + * Can query the assignated iova in map.
>>>>>> + */
>>>>>> +VhostDMAMapNewRC vhost_iova_tree_insert(VhostIOVATree *tree,
>>>>>> + VhostDMAMap *map)
>>>>>> +{
>>>>>> + const VhostDMAMap *prev;
>>>>>> + int find_prev_rc;
>>>>>> +
>>>>>> + if (map->translated_addr + map->size < map->translated_addr ||
>>>>>> + map->iova + map->size < map->iova || map->perm == IOMMU_NONE) {
>>>>>> + return VHOST_DMA_MAP_INVALID;
>>>>>> + }
>>>>>> +
>>>>>> + /* Check for duplicates, and save position for insertion */
>>>>>> + find_prev_rc = vhost_iova_tree_find_prev(tree->iova_taddr_map,
>>>>>> + vhost_iova_tree_cmp_iova, map,
>>>>>> + &prev);
>>>>>> + if (find_prev_rc == VHOST_DMA_MAP_OVERLAP) {
>>>>>> + return VHOST_DMA_MAP_OVERLAP;
>>>>>> + }
>>>>>> +
>>>>>> + vhost_iova_tree_insert_after(tree->iova_taddr_map, prev, map);
>>>>>> + return VHOST_DMA_MAP_OK;
>>>>>> +}
>>>>>> diff --git a/hw/virtio/meson.build b/hw/virtio/meson.build
>>>>>> index 8b5a0225fe..cb306b83c6 100644
>>>>>> --- a/hw/virtio/meson.build
>>>>>> +++ b/hw/virtio/meson.build
>>>>>> @@ -11,7 +11,7 @@ softmmu_ss.add(when: 'CONFIG_ALL', if_true: files('vhost-stub.c'))
>>>>>>
>>>>>> virtio_ss = ss.source_set()
>>>>>> virtio_ss.add(files('virtio.c'))
>>>>>> -virtio_ss.add(when: 'CONFIG_VHOST', if_true: files('vhost.c', 'vhost-backend.c', 'vhost-shadow-virtqueue.c'))
>>>>>> +virtio_ss.add(when: 'CONFIG_VHOST', if_true: files('vhost.c', 'vhost-backend.c', 'vhost-shadow-virtqueue.c', 'vhost-iova-tree.c'))
>>>>>> virtio_ss.add(when: 'CONFIG_VHOST_USER', if_true: files('vhost-user.c'))
>>>>>> virtio_ss.add(when: 'CONFIG_VHOST_VDPA', if_true: files('vhost-vdpa.c'))
>>>>>> virtio_ss.add(when: 'CONFIG_VIRTIO_BALLOON', if_true: files('virtio-balloon.c'))
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [RFC v3 00/29] vDPA software assisted live migration
2021-05-24 11:29 ` Michael S. Tsirkin
@ 2021-07-19 14:13 ` Stefan Hajnoczi
0 siblings, 0 replies; 19+ messages in thread
From: Stefan Hajnoczi @ 2021-07-19 14:13 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: Parav Pandit, qemu-level, Markus Armbruster, Eugenio Perez Martin,
Harpreet Singh Anand, Xiao W Wang, Eli Cohen, virtualization,
Eric Blake, Michael Lilja
[-- Attachment #1.1: Type: text/plain, Size: 3018 bytes --]
On Mon, May 24, 2021 at 07:29:06AM -0400, Michael S. Tsirkin wrote:
> On Mon, May 24, 2021 at 12:37:48PM +0200, Eugenio Perez Martin wrote:
> > On Mon, May 24, 2021 at 11:38 AM Michael S. Tsirkin <mst@redhat.com> wrote:
> > >
> > > On Wed, May 19, 2021 at 06:28:34PM +0200, Eugenio Pérez wrote:
> > > > Commit 17 introduces the buffer forwarding. Previous one are for
> > > > preparations again, and laters are for enabling some obvious
> > > > optimizations. However, it needs the vdpa device to be able to map
> > > > every IOVA space, and some vDPA devices are not able to do so. Checking
> > > > of this is added in previous commits.
> > >
> > > That might become a significant limitation. And it worries me that
> > > this is such a big patchset which might yet take a while to get
> > > finalized.
> > >
> >
> > Sorry, maybe I've been unclear here: Latter commits in this series
> > address this limitation. Still not perfect: for example, it does not
> > support adding or removing guest's memory at the moment, but this
> > should be easy to implement on top.
> >
> > The main issue I'm observing is from the kernel if I'm not wrong: If I
> > unmap every address, I cannot re-map them again. But code in this
> > patchset is mostly final, except for the comments it may arise in the
> > mail list of course.
> >
> > > I have an idea: how about as a first step we implement a transparent
> > > switch from vdpa to a software virtio in QEMU or a software vhost in
> > > kernel?
> > >
> > > This will give us live migration quickly with performance comparable
> > > to failover but without dependance on guest cooperation.
> > >
> >
> > I think it should be doable. I'm not sure about the effort that needs
> > to be done in qemu to hide these "hypervisor-failover devices" from
> > guest's view but it should be comparable to failover, as you say.
> >
> > Networking should be ok by its nature, although it could require care
> > on the host hardware setup. But I'm not sure how other types of
> > vhost/vdpa devices may work that way. How would a disk/scsi device
> > switch modes? Can the kernel take control of the vdpa device through
> > vhost, and just start reporting with a dirty bitmap?
> >
> > Thanks!
>
> It depends of course, e.g. blk is mostly reads/writes so
> not a lot of state. just don't reorder or drop requests.
QEMU's virtio-blk does not attempt to change states (e.g. quiesce the
device or switch between vhost kernel/QEMU, etc) while there are
in-flight requests. Instead all currently active requests must complete
(in some cases they can be cancelled to stop them early). Note that
failed requests can be kept in a list across the switch and then
resubmitted later.
The underlying storage never has requests in flight while the device is
switched. The reason QEMU does this is because there's no way to hand
over an in-flight preadv(2), Linux AIO, or other host kernel block layer
request to another process.
Stefan
[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]
[-- Attachment #2: Type: text/plain, Size: 183 bytes --]
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply [flat|nested] 19+ messages in thread
end of thread, other threads:[~2021-07-19 14:14 UTC | newest]
Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
[not found] <20210519162903.1172366-1-eperezma@redhat.com>
2021-05-24 9:38 ` [RFC v3 00/29] vDPA software assisted live migration Michael S. Tsirkin
[not found] ` <CAJaqyWcVm55qjaDpQsuLzaY0FCzjW2ARyvOWCdfS9RJNoOM7Aw@mail.gmail.com>
2021-05-24 11:29 ` Michael S. Tsirkin
2021-07-19 14:13 ` Stefan Hajnoczi
2021-05-25 0:09 ` Jason Wang
[not found] ` <20210519162903.1172366-7-eperezma@redhat.com>
2021-05-26 1:06 ` [RFC v3 06/29] virtio-net: Honor VIRTIO_CONFIG_S_DEVICE_STOPPED Jason Wang
2021-05-26 1:10 ` Jason Wang
[not found] ` <CAJaqyWeV+za=xeKHb9vn=Y+0mfekCb8w5dmWNMgzQ6uOtU3jxw@mail.gmail.com>
2021-06-03 3:12 ` Jason Wang
[not found] ` <20210519162903.1172366-14-eperezma@redhat.com>
2021-05-26 1:14 ` [RFC v3 13/29] vhost: Add vhost_get_iova_range operation Jason Wang
[not found] ` <CAJaqyWeL-0KjsBcXs1tYdvn9xLAK-x0Sb+RFuzPgngXxYtF9uw@mail.gmail.com>
2021-05-27 4:51 ` Jason Wang
[not found] ` <CAJaqyWf+=-nwOsS=zZEhmiTA_TotVMQibUgE0grCMZgXVDNpxg@mail.gmail.com>
2021-06-03 3:13 ` Jason Wang
[not found] ` <20210519162903.1172366-16-eperezma@redhat.com>
2021-05-31 9:01 ` [RFC v3 15/29] vhost: Add enable_custom_iommu to VhostOps Jason Wang
[not found] ` <20210519162903.1172366-22-eperezma@redhat.com>
2021-05-31 9:40 ` [RFC v3 21/29] vhost: Add VhostIOVATree Jason Wang
[not found] ` <CAJaqyWfKTiKeKLLjB9qDf4qJwL420YRo6FrJgozp_tn0Z57pOA@mail.gmail.com>
2021-07-14 3:04 ` Jason Wang
[not found] ` <CAJaqyWdZtvUM0_O1BiZNMkkgJEa_yrOh=XDgjTt2byZwU9J8tQ@mail.gmail.com>
[not found] ` <CAJaqyWePZSo8-q=d3fWBGUwgM8EfHHVYgR_KGKxjcgeZn5C=Vw@mail.gmail.com>
2021-07-14 9:33 ` Jason Wang
[not found] ` <20210519162903.1172366-18-eperezma@redhat.com>
2021-06-02 9:50 ` [RFC v3 17/29] vhost: Shadow virtqueue buffers forwarding Jason Wang
[not found] ` <CAJaqyWf7M1fjrd+kr-2bcYj+ibrqZVoREZuTiJ0i+p6dA+Dukw@mail.gmail.com>
2021-06-03 3:34 ` Jason Wang
[not found] ` <20210519162903.1172366-26-eperezma@redhat.com>
2021-06-02 9:51 ` [RFC v3 25/29] vhost: Add custom IOTLB translations to SVQ Jason Wang
[not found] ` <CAJaqyWc7OWeQnXHihY=mYp=N+rRJLcbFUsJA-OszD6tyr6v-FQ@mail.gmail.com>
2021-06-03 3:39 ` Jason Wang
2021-06-02 9:59 ` [RFC v3 00/29] vDPA software assisted live migration Jason Wang
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).