Re: [RFC PATCH V2 2/2] vhost: device IOTLB API

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Jason Wang <jasowang@redhat.com>
To: "Michael S. Tsirkin" <mst@redhat.com>
Cc: kvm@vger.kernel.org, qemu-devel@nongnu.org,
	netdev@vger.kernel.org, linux-kernel@vger.kernel.org,
	peterx@redhat.com, virtualization@lists.linux-foundation.org,
	pbonzini@redhat.com
Subject: Re: [RFC PATCH V2 2/2] vhost: device IOTLB API
Date: Thu, 28 Apr 2016 14:37:16 +0800	[thread overview]
Message-ID: <5721AF9C.9030209@redhat.com> (raw)
In-Reply-To: <20160427143057-mutt-send-email-mst@redhat.com>



On 04/27/2016 07:45 PM, Michael S. Tsirkin wrote:
> On Fri, Mar 25, 2016 at 10:34:34AM +0800, Jason Wang wrote:
>> This patch tries to implement an device IOTLB for vhost. This could be
>> used with for co-operation with userspace(qemu) implementation of DMA
>> remapping.
>>
>> The idea is simple. When vhost meets an IOTLB miss, it will request
>> the assistance of userspace to do the translation, this is done
>> through:
>>
>> - Fill the translation request in a preset userspace address (This
>>   address is set through ioctl VHOST_SET_IOTLB_REQUEST_ENTRY).
>> - Notify userspace through eventfd (This eventfd was set through ioctl
>>   VHOST_SET_IOTLB_FD).
> Why use an eventfd for this?

The aim is to implement the API all through ioctls.

>  We use them for interrupts because
> that happens to be what kvm wants, but here - why don't we
> just add a generic support for reading out events
> on the vhost fd itself?

I've considered this approach, but what's the advantages of this? I mean
looks like all other ioctls could be done through vhost fd
reading/writing too.

>
>> - device IOTLB were started and stopped through VHOST_RUN_IOTLB ioctl
>>
>> When userspace finishes the translation, it will update the vhost
>> IOTLB through VHOST_UPDATE_IOTLB ioctl. Userspace is also in charge of
>> snooping the IOTLB invalidation of IOMMU IOTLB and use
>> VHOST_UPDATE_IOTLB to invalidate the possible entry in vhost.
> There's one problem here, and that is that VQs still do not undergo
> translation.  In theory VQ could be mapped in such a way
> that it's not contigious in userspace memory.

I'm not sure I get the issue, current vhost API support setting
desc_user_addr, used_user_addr and avail_user_addr independently. So
looks ok? If not, looks not a problem to device IOTLB API itself.

>
>
>> Signed-off-by: Jason Wang <jasowang@redhat.com>
> What limits amount of entries that kernel keeps around?

It depends on guest working set I think. Looking at
http://dpdk.org/doc/guides/linux_gsg/sys_reqs.html:

- For 2MB page size in guest, it suggests hugepages=1024
- For 1GB page size, it suggests a hugepages=4

So I choose 2048 to make sure it can cover this.

> Do we want at least a mod parameter for this?

Maybe.

>
>> ---
>>  drivers/vhost/net.c        |   6 +-
>>  drivers/vhost/vhost.c      | 301 +++++++++++++++++++++++++++++++++++++++------
>>  drivers/vhost/vhost.h      |  17 ++-
>>  fs/eventfd.c               |   3 +-
>>  include/uapi/linux/vhost.h |  35 ++++++
>>  5 files changed, 320 insertions(+), 42 deletions(-)
>>

[...]

>> +struct vhost_iotlb_entry {
>> +	__u64 iova;
>> +	__u64 size;
>> +	__u64 userspace_addr;
> Alignment requirements?

The API does not require any alignment. Will add a comment for this.

>
>> +	struct {
>> +#define VHOST_ACCESS_RO      0x1
>> +#define VHOST_ACCESS_WO      0x2
>> +#define VHOST_ACCESS_RW      0x3
>> +		__u8  perm;
>> +#define VHOST_IOTLB_MISS           1
>> +#define VHOST_IOTLB_UPDATE         2
>> +#define VHOST_IOTLB_INVALIDATE     3
>> +		__u8  type;
>> +#define VHOST_IOTLB_INVALID        0x1
>> +#define VHOST_IOTLB_VALID          0x2
>> +		__u8  valid;
> why do we need this flag?

Useless, will remove.

>
>> +		__u8  u8_padding;
>> +		__u32 padding;
>> +	} flags;
>> +};
>> +
>> +struct vhost_vring_iotlb_entry {
>> +	unsigned int index;
>> +	__u64 userspace_addr;
>> +};
>> +
>>  struct vhost_memory_region {
>>  	__u64 guest_phys_addr;
>>  	__u64 memory_size; /* bytes */
>> @@ -127,6 +153,15 @@ struct vhost_memory {
>>  /* Set eventfd to signal an error */
>>  #define VHOST_SET_VRING_ERR _IOW(VHOST_VIRTIO, 0x22, struct vhost_vring_file)
>>  
>> +/* IOTLB */
>> +/* Specify an eventfd file descriptor to signle on IOTLB miss */
> typo

Will fix it.

>
>> +#define VHOST_SET_VRING_IOTLB_CALL _IOW(VHOST_VIRTIO, 0x23, struct      \
>> +                                        vhost_vring_file)
>> +#define VHOST_SET_VRING_IOTLB_REQUEST _IOW(VHOST_VIRTIO, 0x25, struct   \
>> +                                           vhost_vring_iotlb_entry)
>> +#define VHOST_UPDATE_IOTLB _IOW(VHOST_VIRTIO, 0x24, struct vhost_iotlb_entry)
>> +#define VHOST_RUN_IOTLB _IOW(VHOST_VIRTIO, 0x26, int)
>> +
> Is the assumption that userspace must dedicate a thread to running the iotlb? 
> I dislike this one.
> Please support asynchronous APIs at least optionally to make
> userspace make its own threading decisions.

Nope, my qemu patches does not use a dedicated thread. This API is used
to start or top DMAR according to e.g whether guest enable DMAR for
intel IOMMU.

>
>>  /* VHOST_NET specific defines */
>>  
>>  /* Attach virtio net ring to a raw socket, or tap device.
> Don't we need a feature bit for this?

Yes we need it. The feature bit is not considered in this patch and
looks like it was still under discussion. After we finalize it, I will add.

> Are we short on feature bits? If yes maybe it's time to add
> something like PROTOCOL_FEATURES that we have in vhost-user.
>

I believe it can just work like VERSION_1, or is there anything I missed?

WARNING: multiple messages have this Message-ID (diff)

From: Jason Wang <jasowang@redhat.com>
To: "Michael S. Tsirkin" <mst@redhat.com>
Cc: kvm@vger.kernel.org, virtualization@lists.linux-foundation.org,
	netdev@vger.kernel.org, linux-kernel@vger.kernel.org,
	peterx@redhat.com, pbonzini@redhat.com, qemu-devel@nongnu.org
Subject: Re: [RFC PATCH V2 2/2] vhost: device IOTLB API
Date: Thu, 28 Apr 2016 14:37:16 +0800	[thread overview]
Message-ID: <5721AF9C.9030209@redhat.com> (raw)
In-Reply-To: <20160427143057-mutt-send-email-mst@redhat.com>



On 04/27/2016 07:45 PM, Michael S. Tsirkin wrote:
> On Fri, Mar 25, 2016 at 10:34:34AM +0800, Jason Wang wrote:
>> This patch tries to implement an device IOTLB for vhost. This could be
>> used with for co-operation with userspace(qemu) implementation of DMA
>> remapping.
>>
>> The idea is simple. When vhost meets an IOTLB miss, it will request
>> the assistance of userspace to do the translation, this is done
>> through:
>>
>> - Fill the translation request in a preset userspace address (This
>>   address is set through ioctl VHOST_SET_IOTLB_REQUEST_ENTRY).
>> - Notify userspace through eventfd (This eventfd was set through ioctl
>>   VHOST_SET_IOTLB_FD).
> Why use an eventfd for this?

The aim is to implement the API all through ioctls.

>  We use them for interrupts because
> that happens to be what kvm wants, but here - why don't we
> just add a generic support for reading out events
> on the vhost fd itself?

I've considered this approach, but what's the advantages of this? I mean
looks like all other ioctls could be done through vhost fd
reading/writing too.

>
>> - device IOTLB were started and stopped through VHOST_RUN_IOTLB ioctl
>>
>> When userspace finishes the translation, it will update the vhost
>> IOTLB through VHOST_UPDATE_IOTLB ioctl. Userspace is also in charge of
>> snooping the IOTLB invalidation of IOMMU IOTLB and use
>> VHOST_UPDATE_IOTLB to invalidate the possible entry in vhost.
> There's one problem here, and that is that VQs still do not undergo
> translation.  In theory VQ could be mapped in such a way
> that it's not contigious in userspace memory.

I'm not sure I get the issue, current vhost API support setting
desc_user_addr, used_user_addr and avail_user_addr independently. So
looks ok? If not, looks not a problem to device IOTLB API itself.

>
>
>> Signed-off-by: Jason Wang <jasowang@redhat.com>
> What limits amount of entries that kernel keeps around?

It depends on guest working set I think. Looking at
http://dpdk.org/doc/guides/linux_gsg/sys_reqs.html:

- For 2MB page size in guest, it suggests hugepages=1024
- For 1GB page size, it suggests a hugepages=4

So I choose 2048 to make sure it can cover this.

> Do we want at least a mod parameter for this?

Maybe.

>
>> ---
>>  drivers/vhost/net.c        |   6 +-
>>  drivers/vhost/vhost.c      | 301 +++++++++++++++++++++++++++++++++++++++------
>>  drivers/vhost/vhost.h      |  17 ++-
>>  fs/eventfd.c               |   3 +-
>>  include/uapi/linux/vhost.h |  35 ++++++
>>  5 files changed, 320 insertions(+), 42 deletions(-)
>>

[...]

>> +struct vhost_iotlb_entry {
>> +	__u64 iova;
>> +	__u64 size;
>> +	__u64 userspace_addr;
> Alignment requirements?

The API does not require any alignment. Will add a comment for this.

>
>> +	struct {
>> +#define VHOST_ACCESS_RO      0x1
>> +#define VHOST_ACCESS_WO      0x2
>> +#define VHOST_ACCESS_RW      0x3
>> +		__u8  perm;
>> +#define VHOST_IOTLB_MISS           1
>> +#define VHOST_IOTLB_UPDATE         2
>> +#define VHOST_IOTLB_INVALIDATE     3
>> +		__u8  type;
>> +#define VHOST_IOTLB_INVALID        0x1
>> +#define VHOST_IOTLB_VALID          0x2
>> +		__u8  valid;
> why do we need this flag?

Useless, will remove.

>
>> +		__u8  u8_padding;
>> +		__u32 padding;
>> +	} flags;
>> +};
>> +
>> +struct vhost_vring_iotlb_entry {
>> +	unsigned int index;
>> +	__u64 userspace_addr;
>> +};
>> +
>>  struct vhost_memory_region {
>>  	__u64 guest_phys_addr;
>>  	__u64 memory_size; /* bytes */
>> @@ -127,6 +153,15 @@ struct vhost_memory {
>>  /* Set eventfd to signal an error */
>>  #define VHOST_SET_VRING_ERR _IOW(VHOST_VIRTIO, 0x22, struct vhost_vring_file)
>>  
>> +/* IOTLB */
>> +/* Specify an eventfd file descriptor to signle on IOTLB miss */
> typo

Will fix it.

>
>> +#define VHOST_SET_VRING_IOTLB_CALL _IOW(VHOST_VIRTIO, 0x23, struct      \
>> +                                        vhost_vring_file)
>> +#define VHOST_SET_VRING_IOTLB_REQUEST _IOW(VHOST_VIRTIO, 0x25, struct   \
>> +                                           vhost_vring_iotlb_entry)
>> +#define VHOST_UPDATE_IOTLB _IOW(VHOST_VIRTIO, 0x24, struct vhost_iotlb_entry)
>> +#define VHOST_RUN_IOTLB _IOW(VHOST_VIRTIO, 0x26, int)
>> +
> Is the assumption that userspace must dedicate a thread to running the iotlb? 
> I dislike this one.
> Please support asynchronous APIs at least optionally to make
> userspace make its own threading decisions.

Nope, my qemu patches does not use a dedicated thread. This API is used
to start or top DMAR according to e.g whether guest enable DMAR for
intel IOMMU.

>
>>  /* VHOST_NET specific defines */
>>  
>>  /* Attach virtio net ring to a raw socket, or tap device.
> Don't we need a feature bit for this?

Yes we need it. The feature bit is not considered in this patch and
looks like it was still under discussion. After we finalize it, I will add.

> Are we short on feature bits? If yes maybe it's time to add
> something like PROTOCOL_FEATURES that we have in vhost-user.
>

I believe it can just work like VERSION_1, or is there anything I missed?

WARNING: multiple messages have this Message-ID (diff)

From: Jason Wang <jasowang@redhat.com>
To: "Michael S. Tsirkin" <mst@redhat.com>
Cc: kvm@vger.kernel.org, virtualization@lists.linux-foundation.org,
	netdev@vger.kernel.org, linux-kernel@vger.kernel.org,
	peterx@redhat.com, pbonzini@redhat.com, qemu-devel@nongnu.org
Subject: Re: [Qemu-devel] [RFC PATCH V2 2/2] vhost: device IOTLB API
Date: Thu, 28 Apr 2016 14:37:16 +0800	[thread overview]
Message-ID: <5721AF9C.9030209@redhat.com> (raw)
In-Reply-To: <20160427143057-mutt-send-email-mst@redhat.com>



On 04/27/2016 07:45 PM, Michael S. Tsirkin wrote:
> On Fri, Mar 25, 2016 at 10:34:34AM +0800, Jason Wang wrote:
>> This patch tries to implement an device IOTLB for vhost. This could be
>> used with for co-operation with userspace(qemu) implementation of DMA
>> remapping.
>>
>> The idea is simple. When vhost meets an IOTLB miss, it will request
>> the assistance of userspace to do the translation, this is done
>> through:
>>
>> - Fill the translation request in a preset userspace address (This
>>   address is set through ioctl VHOST_SET_IOTLB_REQUEST_ENTRY).
>> - Notify userspace through eventfd (This eventfd was set through ioctl
>>   VHOST_SET_IOTLB_FD).
> Why use an eventfd for this?

The aim is to implement the API all through ioctls.

>  We use them for interrupts because
> that happens to be what kvm wants, but here - why don't we
> just add a generic support for reading out events
> on the vhost fd itself?

I've considered this approach, but what's the advantages of this? I mean
looks like all other ioctls could be done through vhost fd
reading/writing too.

>
>> - device IOTLB were started and stopped through VHOST_RUN_IOTLB ioctl
>>
>> When userspace finishes the translation, it will update the vhost
>> IOTLB through VHOST_UPDATE_IOTLB ioctl. Userspace is also in charge of
>> snooping the IOTLB invalidation of IOMMU IOTLB and use
>> VHOST_UPDATE_IOTLB to invalidate the possible entry in vhost.
> There's one problem here, and that is that VQs still do not undergo
> translation.  In theory VQ could be mapped in such a way
> that it's not contigious in userspace memory.

I'm not sure I get the issue, current vhost API support setting
desc_user_addr, used_user_addr and avail_user_addr independently. So
looks ok? If not, looks not a problem to device IOTLB API itself.

>
>
>> Signed-off-by: Jason Wang <jasowang@redhat.com>
> What limits amount of entries that kernel keeps around?

It depends on guest working set I think. Looking at
http://dpdk.org/doc/guides/linux_gsg/sys_reqs.html:

- For 2MB page size in guest, it suggests hugepages=1024
- For 1GB page size, it suggests a hugepages=4

So I choose 2048 to make sure it can cover this.

> Do we want at least a mod parameter for this?

Maybe.

>
>> ---
>>  drivers/vhost/net.c        |   6 +-
>>  drivers/vhost/vhost.c      | 301 +++++++++++++++++++++++++++++++++++++++------
>>  drivers/vhost/vhost.h      |  17 ++-
>>  fs/eventfd.c               |   3 +-
>>  include/uapi/linux/vhost.h |  35 ++++++
>>  5 files changed, 320 insertions(+), 42 deletions(-)
>>

[...]

>> +struct vhost_iotlb_entry {
>> +	__u64 iova;
>> +	__u64 size;
>> +	__u64 userspace_addr;
> Alignment requirements?

The API does not require any alignment. Will add a comment for this.

>
>> +	struct {
>> +#define VHOST_ACCESS_RO      0x1
>> +#define VHOST_ACCESS_WO      0x2
>> +#define VHOST_ACCESS_RW      0x3
>> +		__u8  perm;
>> +#define VHOST_IOTLB_MISS           1
>> +#define VHOST_IOTLB_UPDATE         2
>> +#define VHOST_IOTLB_INVALIDATE     3
>> +		__u8  type;
>> +#define VHOST_IOTLB_INVALID        0x1
>> +#define VHOST_IOTLB_VALID          0x2
>> +		__u8  valid;
> why do we need this flag?

Useless, will remove.

>
>> +		__u8  u8_padding;
>> +		__u32 padding;
>> +	} flags;
>> +};
>> +
>> +struct vhost_vring_iotlb_entry {
>> +	unsigned int index;
>> +	__u64 userspace_addr;
>> +};
>> +
>>  struct vhost_memory_region {
>>  	__u64 guest_phys_addr;
>>  	__u64 memory_size; /* bytes */
>> @@ -127,6 +153,15 @@ struct vhost_memory {
>>  /* Set eventfd to signal an error */
>>  #define VHOST_SET_VRING_ERR _IOW(VHOST_VIRTIO, 0x22, struct vhost_vring_file)
>>  
>> +/* IOTLB */
>> +/* Specify an eventfd file descriptor to signle on IOTLB miss */
> typo

Will fix it.

>
>> +#define VHOST_SET_VRING_IOTLB_CALL _IOW(VHOST_VIRTIO, 0x23, struct      \
>> +                                        vhost_vring_file)
>> +#define VHOST_SET_VRING_IOTLB_REQUEST _IOW(VHOST_VIRTIO, 0x25, struct   \
>> +                                           vhost_vring_iotlb_entry)
>> +#define VHOST_UPDATE_IOTLB _IOW(VHOST_VIRTIO, 0x24, struct vhost_iotlb_entry)
>> +#define VHOST_RUN_IOTLB _IOW(VHOST_VIRTIO, 0x26, int)
>> +
> Is the assumption that userspace must dedicate a thread to running the iotlb? 
> I dislike this one.
> Please support asynchronous APIs at least optionally to make
> userspace make its own threading decisions.

Nope, my qemu patches does not use a dedicated thread. This API is used
to start or top DMAR according to e.g whether guest enable DMAR for
intel IOMMU.

>
>>  /* VHOST_NET specific defines */
>>  
>>  /* Attach virtio net ring to a raw socket, or tap device.
> Don't we need a feature bit for this?

Yes we need it. The feature bit is not considered in this patch and
looks like it was still under discussion. After we finalize it, I will add.

> Are we short on feature bits? If yes maybe it's time to add
> something like PROTOCOL_FEATURES that we have in vhost-user.
>

I believe it can just work like VERSION_1, or is there anything I missed?

next prev parent reply	other threads:[~2016-04-28  6:37 UTC|newest]

Thread overview: 31+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2016-03-25  2:34 [RFC PATCH V2 0/2] basic device IOTLB support Jason Wang
2016-03-25  2:34 ` [Qemu-devel] " Jason Wang
2016-03-25  2:34 ` Jason Wang
2016-03-25  2:34 ` [RFC PATCH V2 1/2] vhost: convert pre sorted vhost memory array to interval tree Jason Wang
2016-03-25  2:34   ` [Qemu-devel] " Jason Wang
2016-03-25  2:34   ` Jason Wang
2016-04-27 11:30   ` Michael S. Tsirkin
2016-04-27 11:30     ` [Qemu-devel] " Michael S. Tsirkin
2016-04-27 11:30     ` Michael S. Tsirkin
2016-04-28  6:20     ` Jason Wang
2016-04-28  6:20       ` [Qemu-devel] " Jason Wang
2016-04-28  6:20       ` Jason Wang
2016-03-25  2:34 ` [RFC PATCH V2 2/2] vhost: device IOTLB API Jason Wang
2016-03-25  2:34   ` [Qemu-devel] " Jason Wang
2016-03-25  2:34   ` Jason Wang
2016-04-27 11:45   ` Michael S. Tsirkin
2016-04-27 11:45   ` Michael S. Tsirkin
2016-04-27 11:45     ` [Qemu-devel] " Michael S. Tsirkin
2016-04-27 11:45     ` Michael S. Tsirkin
2016-04-28  6:37     ` Jason Wang [this message]
2016-04-28  6:37       ` [Qemu-devel] " Jason Wang
2016-04-28  6:37       ` Jason Wang
2016-04-28 14:43       ` Michael S. Tsirkin
2016-04-28 14:43         ` [Qemu-devel] " Michael S. Tsirkin
2016-04-28 14:43         ` Michael S. Tsirkin
2016-04-29  1:12         ` Jason Wang
2016-04-29  1:12           ` [Qemu-devel] " Jason Wang
2016-04-29  1:12           ` Jason Wang
2016-04-29  4:44           ` Jason Wang
2016-04-29  4:44             ` [Qemu-devel] " Jason Wang
2016-04-29  4:44             ` Jason Wang

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=5721AF9C.9030209@redhat.com \
    --to=jasowang@redhat.com \
    --cc=kvm@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mst@redhat.com \
    --cc=netdev@vger.kernel.org \
    --cc=pbonzini@redhat.com \
    --cc=peterx@redhat.com \
    --cc=qemu-devel@nongnu.org \
    --cc=virtualization@lists.linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.