From mboxrd@z Thu Jan  1 00:00:00 1970
Received: from eggs.gnu.org ([2001:4830:134:3::10]:55760)
	by lists.gnu.org with esmtp (Exim 4.71)
	(envelope-from <jasowang@redhat.com>) id 1f6T3R-0005Ij-UP
	for qemu-devel@nongnu.org; Wed, 11 Apr 2018 23:37:48 -0400
Received: from Debian-exim by eggs.gnu.org with spam-scanned (Exim 4.71)
	(envelope-from <jasowang@redhat.com>) id 1f6T3O-0006Kx-Md
	for qemu-devel@nongnu.org; Wed, 11 Apr 2018 23:37:45 -0400
Received: from mx3-rdu2.redhat.com ([66.187.233.73]:45890 helo=mx1.redhat.com)
	by eggs.gnu.org with esmtps (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32)
	(Exim 4.71) (envelope-from <jasowang@redhat.com>) id 1f6T3O-0006Kk-GS
	for qemu-devel@nongnu.org; Wed, 11 Apr 2018 23:37:42 -0400
References: <20180411072027.5656-1-tiwei.bie@intel.com>
	<20180411161926-mutt-send-email-mst@kernel.org>
	<20180412011059.yywn73znjdip2cyv@debian>
	<20180412042724-mutt-send-email-mst@kernel.org>
	<20180412013942.egucc4isxkokta7z@debian>
	<20180412044404-mutt-send-email-mst@kernel.org>
From: Jason Wang <jasowang@redhat.com>
Message-ID: <1726abc8-92ff-420a-adbd-c08e9fa251d2@redhat.com>
Date: Thu, 12 Apr 2018 11:37:35 +0800
MIME-Version: 1.0
In-Reply-To: <20180412044404-mutt-send-email-mst@kernel.org>
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Language: en-US
Content-Transfer-Encoding: quoted-printable
Subject: Re: [Qemu-devel] [RFC] vhost-user: introduce F_NEED_ALL_IOTLB
 protocol feature
List-Id: <qemu-devel.nongnu.org>
List-Unsubscribe: <https://lists.nongnu.org/mailman/options/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=unsubscribe>
List-Archive: <http://lists.nongnu.org/archive/html/qemu-devel/>
List-Post: <mailto:qemu-devel@nongnu.org>
List-Help: <mailto:qemu-devel-request@nongnu.org?subject=help>
List-Subscribe: <https://lists.nongnu.org/mailman/listinfo/qemu-devel>,
	<mailto:qemu-devel-request@nongnu.org?subject=subscribe>
To: "Michael S. Tsirkin" <mst@redhat.com>, Tiwei Bie <tiwei.bie@intel.com>
Cc: cunming.liang@intel.com, qemu-devel@nongnu.org, peterx@redhat.com, zhihong.wang@intel.com, dan.daly@intel.com


On 2018=E5=B9=B404=E6=9C=8812=E6=97=A5 09:57, Michael S. Tsirkin wrote:
> On Thu, Apr 12, 2018 at 09:39:43AM +0800, Tiwei Bie wrote:
>> On Thu, Apr 12, 2018 at 04:29:29AM +0300, Michael S. Tsirkin wrote:
>>> On Thu, Apr 12, 2018 at 09:10:59AM +0800, Tiwei Bie wrote:
>>>> On Wed, Apr 11, 2018 at 04:22:21PM +0300, Michael S. Tsirkin wrote:
>>>>> On Wed, Apr 11, 2018 at 03:20:27PM +0800, Tiwei Bie wrote:
>>>>>> This patch introduces VHOST_USER_PROTOCOL_F_NEED_ALL_IOTLB
>>>>>> feature for vhost-user. By default, vhost-user backend needs
>>>>>> to query the IOTLBs from QEMU after meeting unknown IOVAs.
>>>>>> With this protocol feature negotiated, QEMU will provide all
>>>>>> the IOTLBs to vhost-user backend without waiting for the
>>>>>> queries from backend. This is helpful when using a hardware
>>>>>> accelerator which is not able to handle unknown IOVAs at the
>>>>>> vhost-user backend.
>>>>>>
>>>>>> Signed-off-by: Tiwei Bie <tiwei.bie@intel.com>
>>>>> This is potentially a large amount of data to be sent
>>>>> on a socket.
>>>> If we take the hardware accelerator out of this picture, we
>>>> will find that it's actually a question of "pre-loading" vs
>>>> "lazy-loading". I think neither of them is perfect.
>>>>
>>>> For "pre-loading", as you said, we may have a tough starting.
>>>> But for "lazy-loading", we can't have a predictable performance.
>>>> A sudden, unexpected performance drop may happen at any time,
>>>> because we may meet an unknown IOVA at any time in this case.
>>> That's how hardware behaves too though. So we can expect guests
>>> to try to optimize locality.
>> The difference is that, the software implementation needs to
>> query the mappings via socket. And it's much slower..
> If you are proposing this new feature as an optimization,
> then I'd like to see numbers showing the performance gains.
>
> It's definitely possible to optimize things out.  Pre-loading isn't
> where I would start optimizing though.  For example, DPDK could have it=
s
> own VTD emulation, then it could access guest memory directly.

Have vtd emulation in dpdk have many disadvantages:

- vendor locked, can only work for intel
- duplication of codes and bugs
- a huge number of new message types needs to be invented

So I tend to go to a reverse way, link dpdk to qemu.

>
>
>>>> Once we meet an unknown IOVA, the backend's data path will need
>>>> to stop and query the mapping of the IOVA via the socket and
>>>> wait for the reply. And the latency is not negligible (sometimes
>>>> it's even unacceptable), especially in high performance network
>>>> case. So maybe it's better to make both of them available to
>>>> the vhost backend.
>>>>
>>>>> I had an impression that a hardware accelerator was using
>>>>> VFIO anyway. Given this, can't we have QEMU program
>>>>> the shadow IOMMU tables into VFIO directly?
>>>> I think it's a good idea! Currently, my concern about it is
>>>> that, the hardware device may not use IOMMU and it may have
>>>> its builtin address translation unit. And it will be a pain
>>>> for device vendors to teach VFIO to be able to work with the
>>>> builtin address translation unit.
>>> I think such drivers would have to interate with VFIO somehow.
>>> Otherwise, what is the plan for assigning such devices then?
>> Such devices are just for vhost data path acceleration.
> That's not true I think.  E.g. RDMA devices have an on-card MMU.
>
>> They have many available queue pairs, the switch logic
>> will be done among those queue pairs. And different queue
>> pairs will serve different VMs directly.
>>
>> Best regards,
>> Tiwei Bie
> The way I would do it is attach different PASID values to
> different queues. This way you can use the standard IOMMU
> to enforce protection.

So that's just shared virtual memory on host which can share iova=20
address space between a specific queue pair and a process. I'm not sure=20
how hard can exist vhost-user backend to support this.

Thanks

>
>
>
>>>
>>>> Best regards,
>>>> Tiwei Bie
>>>>
>>>>>
>>>>>> ---
>>>>>> The idea of this patch is to let QEMU push all the IOTLBs
>>>>>> to vhost-user backend without waiting for the queries from
>>>>>> the backend. Because hardware accelerator at the vhost-user
>>>>>> backend may not be able to handle unknown IOVAs.
>>>>>>
>>>>>> This is just a RFC for now. It seems that, it doesn't work
>>>>>> as expected when guest is using kernel driver (To handle
>>>>>> this case, it seems that some RAM regions' events also need
>>>>>> to be listened). Any comments would be appreciated! Thanks!
>>>>>>
>>>>>>   docs/interop/vhost-user.txt       |  9 ++++++++
>>>>>>   hw/virtio/vhost-backend.c         |  7 ++++++
>>>>>>   hw/virtio/vhost-user.c            |  8 +++++++
>>>>>>   hw/virtio/vhost.c                 | 47 +++++++++++++++++++++++++=
+++++++++++---
>>>>>>   include/hw/virtio/vhost-backend.h |  3 +++
>>>>>>   5 files changed, 71 insertions(+), 3 deletions(-)
>>>>>>
>>>>>> diff --git a/docs/interop/vhost-user.txt b/docs/interop/vhost-user=
.txt
>>>>>> index 534caab18a..73e07f9dad 100644
>>>>>> --- a/docs/interop/vhost-user.txt
>>>>>> +++ b/docs/interop/vhost-user.txt
>>>>>> @@ -380,6 +380,7 @@ Protocol features
>>>>>>   #define VHOST_USER_PROTOCOL_F_CRYPTO_SESSION 7
>>>>>>   #define VHOST_USER_PROTOCOL_F_PAGEFAULT      8
>>>>>>   #define VHOST_USER_PROTOCOL_F_CONFIG         9
>>>>>> +#define VHOST_USER_PROTOCOL_F_NEED_ALL_IOTLB 10
>>>>>>  =20
>>>>>>   Master message types
>>>>>>   --------------------
>>>>>> @@ -797,3 +798,11 @@ resilient for selective requests.
>>>>>>   For the message types that already solicit a reply from the clie=
nt, the
>>>>>>   presence of VHOST_USER_PROTOCOL_F_REPLY_ACK or need_reply bit be=
ing set brings
>>>>>>   no behavioural change. (See the 'Communication' section for deta=
ils.)
>>>>>> +
>>>>>> +VHOST_USER_PROTOCOL_F_NEED_ALL_IOTLB
>>>>>> +------------------------------------
>>>>>> +By default, vhost-user backend needs to query the IOTLBs from QEM=
U after
>>>>>> +meeting unknown IOVAs. With this protocol feature negotiated, QEM=
U will
>>>>>> +provide all the IOTLBs to vhost backend without waiting for the q=
ueries
>>>>>> +from backend. This is helpful when using a hardware accelerator w=
hich is
>>>>>> +not able to handle unknown IOVAs at the vhost-user backend.
>>>>>> diff --git a/hw/virtio/vhost-backend.c b/hw/virtio/vhost-backend.c
>>>>>> index 7f09efab8b..d72994e9a5 100644
>>>>>> --- a/hw/virtio/vhost-backend.c
>>>>>> +++ b/hw/virtio/vhost-backend.c
>>>>>> @@ -233,6 +233,11 @@ static void vhost_kernel_set_iotlb_callback(s=
truct vhost_dev *dev,
>>>>>>           qemu_set_fd_handler((uintptr_t)dev->opaque, NULL, NULL, =
NULL);
>>>>>>   }
>>>>>>  =20
>>>>>> +static bool vhost_kernel_need_all_device_iotlb(struct vhost_dev *=
dev)
>>>>>> +{
>>>>>> +    return false;
>>>>>> +}
>>>>>> +
>>>>>>   static const VhostOps kernel_ops =3D {
>>>>>>           .backend_type =3D VHOST_BACKEND_TYPE_KERNEL,
>>>>>>           .vhost_backend_init =3D vhost_kernel_init,
>>>>>> @@ -264,6 +269,8 @@ static const VhostOps kernel_ops =3D {
>>>>>>   #endif /* CONFIG_VHOST_VSOCK */
>>>>>>           .vhost_set_iotlb_callback =3D vhost_kernel_set_iotlb_cal=
lback,
>>>>>>           .vhost_send_device_iotlb_msg =3D vhost_kernel_send_devic=
e_iotlb_msg,
>>>>>> +        .vhost_backend_need_all_device_iotlb =3D
>>>>>> +                                vhost_kernel_need_all_device_iotl=
b,
>>>>>>   };
>>>>>>  =20
>>>>>>   int vhost_set_backend_type(struct vhost_dev *dev, VhostBackendTy=
pe backend_type)
>>>>>> diff --git a/hw/virtio/vhost-user.c b/hw/virtio/vhost-user.c
>>>>>> index 38da8692bb..30e0b1c13b 100644
>>>>>> --- a/hw/virtio/vhost-user.c
>>>>>> +++ b/hw/virtio/vhost-user.c
>>>>>> @@ -47,6 +47,7 @@ enum VhostUserProtocolFeature {
>>>>>>       VHOST_USER_PROTOCOL_F_CRYPTO_SESSION =3D 7,
>>>>>>       VHOST_USER_PROTOCOL_F_PAGEFAULT =3D 8,
>>>>>>       VHOST_USER_PROTOCOL_F_CONFIG =3D 9,
>>>>>> +    VHOST_USER_PROTOCOL_F_NEED_ALL_IOTLB =3D 10,
>>>>>>       VHOST_USER_PROTOCOL_F_MAX
>>>>>>   };
>>>>>>  =20
>>>>>> @@ -1581,6 +1582,12 @@ vhost_user_crypto_close_session(struct vhos=
t_dev *dev, uint64_t session_id)
>>>>>>       return 0;
>>>>>>   }
>>>>>>  =20
>>>>>> +static bool vhost_user_need_all_device_iotlb(struct vhost_dev *de=
v)
>>>>>> +{
>>>>>> +    return virtio_has_feature(dev->protocol_features,
>>>>>> +                              VHOST_USER_PROTOCOL_F_NEED_ALL_IOTL=
B);
>>>>>> +}
>>>>>> +
>>>>>>   const VhostOps user_ops =3D {
>>>>>>           .backend_type =3D VHOST_BACKEND_TYPE_USER,
>>>>>>           .vhost_backend_init =3D vhost_user_init,
>>>>>> @@ -1611,4 +1618,5 @@ const VhostOps user_ops =3D {
>>>>>>           .vhost_set_config =3D vhost_user_set_config,
>>>>>>           .vhost_crypto_create_session =3D vhost_user_crypto_creat=
e_session,
>>>>>>           .vhost_crypto_close_session =3D vhost_user_crypto_close_=
session,
>>>>>> +        .vhost_backend_need_all_device_iotlb =3D vhost_user_need_=
all_device_iotlb,
>>>>>>   };
>>>>>> diff --git a/hw/virtio/vhost.c b/hw/virtio/vhost.c
>>>>>> index f51bf573d5..70922ee998 100644
>>>>>> --- a/hw/virtio/vhost.c
>>>>>> +++ b/hw/virtio/vhost.c
>>>>>> @@ -48,6 +48,9 @@ static unsigned int used_memslots;
>>>>>>   static QLIST_HEAD(, vhost_dev) vhost_devices =3D
>>>>>>       QLIST_HEAD_INITIALIZER(vhost_devices);
>>>>>>  =20
>>>>>> +static int vhost_memory_region_lookup(struct vhost_dev *hdev, uin=
t64_t gpa,
>>>>>> +                                      uint64_t *uaddr, uint64_t *=
len);
>>>>>> +
>>>>>>   bool vhost_has_free_slot(void)
>>>>>>   {
>>>>>>       unsigned int slots_limit =3D ~0U;
>>>>>> @@ -634,12 +637,39 @@ static void vhost_region_addnop(MemoryListen=
er *listener,
>>>>>>       vhost_region_add_section(dev, section);
>>>>>>   }
>>>>>>  =20
>>>>>> -static void vhost_iommu_unmap_notify(IOMMUNotifier *n, IOMMUTLBEn=
try *iotlb)
>>>>>> +static void vhost_iommu_map_notify(IOMMUNotifier *n, IOMMUTLBEntr=
y *iotlb)
>>>>>>   {
>>>>>>       struct vhost_iommu *iommu =3D container_of(n, struct vhost_i=
ommu, n);
>>>>>>       struct vhost_dev *hdev =3D iommu->hdev;
>>>>>>       hwaddr iova =3D iotlb->iova + iommu->iommu_offset;
>>>>>>  =20
>>>>>> +    if ((iotlb->perm & IOMMU_RW) !=3D IOMMU_NONE) {
>>>>>> +        uint64_t uaddr, len;
>>>>>> +
>>>>>> +        rcu_read_lock();
>>>>>> +
>>>>>> +        if (iotlb->target_as !=3D NULL) {
>>>>>> +            if (vhost_memory_region_lookup(hdev, iotlb->translate=
d_addr,
>>>>>> +                        &uaddr, &len)) {
>>>>>> +                error_report("Fail to lookup the translated addre=
ss "
>>>>>> +                        "%"PRIx64, iotlb->translated_addr);
>>>>>> +                goto out;
>>>>>> +            }
>>>>>> +
>>>>>> +            len =3D MIN(iotlb->addr_mask + 1, len);
>>>>>> +            iova =3D iova & ~iotlb->addr_mask;
>>>>>> +
>>>>>> +            if (vhost_backend_update_device_iotlb(hdev, iova, uad=
dr,
>>>>>> +                                                  len, iotlb->per=
m)) {
>>>>>> +                error_report("Fail to update device iotlb");
>>>>>> +                goto out;
>>>>>> +            }
>>>>>> +        }
>>>>>> +out:
>>>>>> +        rcu_read_unlock();
>>>>>> +        return;
>>>>>> +    }
>>>>>> +
>>>>>>       if (vhost_backend_invalidate_device_iotlb(hdev, iova,
>>>>>>                                                 iotlb->addr_mask +=
 1)) {
>>>>>>           error_report("Fail to invalidate device iotlb");
>>>>>> @@ -652,6 +682,7 @@ static void vhost_iommu_region_add(MemoryListe=
ner *listener,
>>>>>>       struct vhost_dev *dev =3D container_of(listener, struct vhos=
t_dev,
>>>>>>                                            iommu_listener);
>>>>>>       struct vhost_iommu *iommu;
>>>>>> +    IOMMUNotifierFlag flags;
>>>>>>       Int128 end;
>>>>>>  =20
>>>>>>       if (!memory_region_is_iommu(section->mr)) {
>>>>>> @@ -662,8 +693,15 @@ static void vhost_iommu_region_add(MemoryList=
ener *listener,
>>>>>>       end =3D int128_add(int128_make64(section->offset_within_regi=
on),
>>>>>>                        section->size);
>>>>>>       end =3D int128_sub(end, int128_one());
>>>>>> -    iommu_notifier_init(&iommu->n, vhost_iommu_unmap_notify,
>>>>>> -                        IOMMU_NOTIFIER_UNMAP,
>>>>>> +
>>>>>> +    if (dev->vhost_ops->vhost_backend_need_all_device_iotlb(dev))=
 {
>>>>>> +        flags =3D IOMMU_NOTIFIER_ALL;
>>>>>> +    } else {
>>>>>> +        flags =3D IOMMU_NOTIFIER_UNMAP;
>>>>>> +    }
>>>>>> +
>>>>>> +    iommu_notifier_init(&iommu->n, vhost_iommu_map_notify,
>>>>>> +                        flags,
>>>>>>                           section->offset_within_region,
>>>>>>                           int128_get64(end));
>>>>>>       iommu->mr =3D section->mr;
>>>>>> @@ -673,6 +711,9 @@ static void vhost_iommu_region_add(MemoryListe=
ner *listener,
>>>>>>       memory_region_register_iommu_notifier(section->mr, &iommu->n=
);
>>>>>>       QLIST_INSERT_HEAD(&dev->iommu_list, iommu, iommu_next);
>>>>>>       /* TODO: can replay help performance here? */
>>>>>> +    if (flags =3D=3D IOMMU_NOTIFIER_ALL) {
>>>>>> +        memory_region_iommu_replay(IOMMU_MEMORY_REGION(iommu->mr)=
, &iommu->n);
>>>>>> +    }
>>>>>>   }
>>>>>>  =20
>>>>>>   static void vhost_iommu_region_del(MemoryListener *listener,
>>>>>> diff --git a/include/hw/virtio/vhost-backend.h b/include/hw/virtio=
/vhost-backend.h
>>>>>> index 5dac61f9ea..602fd987a4 100644
>>>>>> --- a/include/hw/virtio/vhost-backend.h
>>>>>> +++ b/include/hw/virtio/vhost-backend.h
>>>>>> @@ -101,6 +101,8 @@ typedef int (*vhost_crypto_create_session_op)(=
struct vhost_dev *dev,
>>>>>>   typedef int (*vhost_crypto_close_session_op)(struct vhost_dev *d=
ev,
>>>>>>                                                uint64_t session_id=
);
>>>>>>  =20
>>>>>> +typedef bool (*vhost_backend_need_all_device_iotlb_op)(struct vho=
st_dev *dev);
>>>>>> +
>>>>>>   typedef struct VhostOps {
>>>>>>       VhostBackendType backend_type;
>>>>>>       vhost_backend_init vhost_backend_init;
>>>>>> @@ -138,6 +140,7 @@ typedef struct VhostOps {
>>>>>>       vhost_set_config_op vhost_set_config;
>>>>>>       vhost_crypto_create_session_op vhost_crypto_create_session;
>>>>>>       vhost_crypto_close_session_op vhost_crypto_close_session;
>>>>>> +    vhost_backend_need_all_device_iotlb_op vhost_backend_need_all=
_device_iotlb;
>>>>>>   } VhostOps;
>>>>>>  =20
>>>>>>   extern const VhostOps user_ops;
>>>>>> --=20
>>>>>> 2.11.0