virtualization.lists.linux-foundation.org archive mirror
 help / color / mirror / Atom feed
* Re: [RFC 3/4] vduse: grab the module's references until there is no vduse device
       [not found] ` <20201019145623.671-4-xieyongji@bytedance.com>
@ 2020-10-19 15:05   ` Michael S. Tsirkin
       [not found]     ` <CACycT3sQ-rw+weEktyK5jQTfMNWYR6qSaD1vAUEyCP6x7C9rRQ@mail.gmail.com>
  0 siblings, 1 reply; 12+ messages in thread
From: Michael S. Tsirkin @ 2020-10-19 15:05 UTC (permalink / raw)
  To: Xie Yongji; +Cc: linux-mm, akpm, virtualization

On Mon, Oct 19, 2020 at 10:56:22PM +0800, Xie Yongji wrote:
> The module should not be unloaded if any vduse device exists.
> So increase the module's reference count when creating vduse
> device. And the reference count is kept until the device is
> destroyed.
> 
> Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
> ---
>  drivers/vdpa/vdpa_user/vduse_dev.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c
> index 6787ba66725c..f04aa02de8c1 100644
> --- a/drivers/vdpa/vdpa_user/vduse_dev.c
> +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
> @@ -887,6 +887,7 @@ static int vduse_destroy_dev(u32 id)
>  	kfree(dev->vqs);
>  	vduse_iova_domain_destroy(dev->domain);
>  	vduse_dev_destroy(dev);
> +	module_put(THIS_MODULE);
>  
>  	return 0;
>  }
> @@ -931,6 +932,7 @@ static int vduse_create_dev(struct vduse_dev_config *config)
>  
>  	dev->connected = true;
>  	list_add(&dev->list, &vduse_devs);
> +	__module_get(THIS_MODULE);
>  
>  	return fd;
>  err_fd:

This kind of thing is usually an indicator of a bug. E.g.
if the refcount drops to 0 on module_put(THIS_MODULE) it
will be unloaded and the following return will not run.



> -- 
> 2.25.1

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC 2/4] vduse: Introduce VDUSE - vDPA Device in Userspace
       [not found] ` <20201019145623.671-3-xieyongji@bytedance.com>
@ 2020-10-19 15:08   ` Michael S. Tsirkin
  2020-10-19 15:24     ` Randy Dunlap
  0 siblings, 1 reply; 12+ messages in thread
From: Michael S. Tsirkin @ 2020-10-19 15:08 UTC (permalink / raw)
  To: Xie Yongji; +Cc: linux-mm, akpm, virtualization

On Mon, Oct 19, 2020 at 10:56:21PM +0800, Xie Yongji wrote:
> diff --git a/include/uapi/linux/vduse.h b/include/uapi/linux/vduse.h
> new file mode 100644
> index 000000000000..855d2116b3a6
> --- /dev/null
> +++ b/include/uapi/linux/vduse.h
> @@ -0,0 +1,85 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _UAPI_VDUSE_H_
> +#define _UAPI_VDUSE_H_
> +
> +#include <linux/types.h>
> +
> +#define VDUSE_CONFIG_DATA_LEN	8
> +
> +enum vduse_req_type {
> +	VDUSE_SET_VQ_STATE,
> +	VDUSE_SET_FEATURES,
> +	VDUSE_GET_FEATURES,
> +	VDUSE_SET_STATUS,
> +	VDUSE_GET_STATUS,
> +	VDUSE_SET_CONFIG,
> +	VDUSE_GET_CONFIG,
> +};
> +
> +struct vduse_vq_state {
> +	__u32 index;
> +	__u32 num;
> +	__u64 desc_addr;
> +	__u64 driver_addr;
> +	__u64 device_addr;
> +	__u8 ready;
> +};
> +
> +struct vduse_dev_config_data {
> +	__u32 offset;
> +	__u32 len;
> +	__u8 data[VDUSE_CONFIG_DATA_LEN];
> +};
> +
> +struct vduse_dev_request {
> +	__u32 type;
> +	__u32 unique;
> +	__u32 flags;
> +	__u32 size;
> +	union {
> +		struct vduse_vq_state vq_state;
> +		struct vduse_dev_config_data config;
> +		__u64 features;
> +		__u8 status;
> +	};
> +};
> +
> +struct vduse_dev_response {
> +	__u32 unique;
> +	__s32 result;
> +	union {
> +		struct vduse_dev_config_data config;
> +		__u64 features;
> +		__u8 status;
> +	};
> +};
> +
> +/* ioctl */
> +
> +struct vduse_dev_config {
> +	__u32 id;
> +	__u32 vendor_id;
> +	__u32 device_id;
> +	__u64 iova_size;
> +	__u16 vq_num;
> +	__u16 vq_size_max;
> +	__u32 vq_align;
> +};
> +
> +struct vduse_vq_eventfd {
> +	__u32 index;
> +	__u32 fd;
> +};
> +
> +#define VDUSE_BASE	'V'
> +
> +#define VDUSE_CREATE_DEV	_IOW(VDUSE_BASE, 0x01, struct vduse_dev_config)
> +#define VDUSE_GET_DEV		_IO(VDUSE_BASE, 0x02)
> +#define VDUSE_DESTROY_DEV	_IO(VDUSE_BASE, 0x03)
> +
> +#define VDUSE_DEV_START		_IO(VDUSE_BASE, 0x04)
> +#define VDUSE_DEV_STOP		_IO(VDUSE_BASE, 0x05)
> +#define VDUSE_VQ_SETUP_KICKFD	_IOW(VDUSE_BASE, 0x06, struct vduse_vq_eventfd)
> +#define VDUSE_VQ_SETUP_IRQFD	_IOW(VDUSE_BASE, 0x07, struct vduse_vq_eventfd)
> +
> +#endif /* _UAPI_VDUSE_H_ */


Could we see some documentation about the user interface of this module please?

> -- 
> 2.25.1

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC 1/4] mm: export zap_page_range() for driver use
       [not found] ` <20201019145623.671-2-xieyongji@bytedance.com>
@ 2020-10-19 15:14   ` Matthew Wilcox
  0 siblings, 0 replies; 12+ messages in thread
From: Matthew Wilcox @ 2020-10-19 15:14 UTC (permalink / raw)
  To: Xie Yongji; +Cc: linux-mm, virtualization, akpm, mst

On Mon, Oct 19, 2020 at 10:56:20PM +0800, Xie Yongji wrote:
> Export zap_page_range() for use in VDUSE.

I think you're missing a lot of MMU notifier work by calling this
directly.  It probably works in every scenario you've tested, but won't
work for others.  I see you're using VM_MIXEDMAP -- would it make sense
to use VM_PFNMAP instead and use zap_vma_ptes()?  Or would it make sense
to change zap_vma_ptes() to handle VM_MIXEDMAP as well as VM_PFNMAP?

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC 2/4] vduse: Introduce VDUSE - vDPA Device in Userspace
  2020-10-19 15:08   ` [RFC 2/4] vduse: Introduce VDUSE - vDPA Device in Userspace Michael S. Tsirkin
@ 2020-10-19 15:24     ` Randy Dunlap
  0 siblings, 0 replies; 12+ messages in thread
From: Randy Dunlap @ 2020-10-19 15:24 UTC (permalink / raw)
  To: Michael S. Tsirkin, Xie Yongji; +Cc: linux-mm, akpm, virtualization

On 10/19/20 8:08 AM, Michael S. Tsirkin wrote:
> On Mon, Oct 19, 2020 at 10:56:21PM +0800, Xie Yongji wrote:
>> diff --git a/include/uapi/linux/vduse.h b/include/uapi/linux/vduse.h

>> +#define VDUSE_BASE	'V'
>> +
> 
> Could we see some documentation about the user interface of this module please?
> 

Also, the VDUSE_BASE value should be documented in
Documentation/userspace-api/ioctl/ioctl-number.rst.

thanks.
-- 
~Randy

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [External] Re: [RFC 3/4] vduse: grab the module's references until there is no vduse device
       [not found]     ` <CACycT3sQ-rw+weEktyK5jQTfMNWYR6qSaD1vAUEyCP6x7C9rRQ@mail.gmail.com>
@ 2020-10-19 15:47       ` Michael S. Tsirkin
       [not found]         ` <CACycT3vG+ZEhn3SYy=7c5rkMz4XRbQZL21NdpPozPnH_x6Srhg@mail.gmail.com>
  0 siblings, 1 reply; 12+ messages in thread
From: Michael S. Tsirkin @ 2020-10-19 15:47 UTC (permalink / raw)
  To: 谢永吉; +Cc: linux-mm, akpm, virtualization

On Mon, Oct 19, 2020 at 11:44:36PM +0800, 谢永吉 wrote:
> 
> 
> On Mon, Oct 19, 2020 at 11:05 PM Michael S. Tsirkin <mst@redhat.com> wrote:
> 
>     On Mon, Oct 19, 2020 at 10:56:22PM +0800, Xie Yongji wrote:
>     > The module should not be unloaded if any vduse device exists.
>     > So increase the module's reference count when creating vduse
>     > device. And the reference count is kept until the device is
>     > destroyed.
>     >
>     > Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
>     > ---
>     >  drivers/vdpa/vdpa_user/vduse_dev.c | 2 ++
>     >  1 file changed, 2 insertions(+)
>     >
>     > diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/
>     vduse_dev.c
>     > index 6787ba66725c..f04aa02de8c1 100644
>     > --- a/drivers/vdpa/vdpa_user/vduse_dev.c
>     > +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
>     > @@ -887,6 +887,7 @@ static int vduse_destroy_dev(u32 id)
>     >       kfree(dev->vqs);
>     >       vduse_iova_domain_destroy(dev->domain);
>     >       vduse_dev_destroy(dev);
>     > +     module_put(THIS_MODULE);
>     > 
>     >       return 0;
>     >  }
>     > @@ -931,6 +932,7 @@ static int vduse_create_dev(struct vduse_dev_config
>     *config)
>     > 
>     >       dev->connected = true;
>     >       list_add(&dev->list, &vduse_devs);
>     > +     __module_get(THIS_MODULE);
>     > 
>     >       return fd;
>     >  err_fd:
> 
>     This kind of thing is usually an indicator of a bug. E.g.
>     if the refcount drops to 0 on module_put(THIS_MODULE) it
>     will be unloaded and the following return will not run.
> 
> 
> 
> Should this happen?  The refcount should be only decreased to 0 after the
> misc_device is closed?
> 
> Thanks,
> Yongji
> 

OTOH if it never drops to 0 anyway then why do you need to increase it?

> 
> 
>     > --
>     > 2.25.1
> 
> 

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [External] Re: [RFC 3/4] vduse: grab the module's references until there is no vduse device
       [not found]         ` <CACycT3vG+ZEhn3SYy=7c5rkMz4XRbQZL21NdpPozPnH_x6Srhg@mail.gmail.com>
@ 2020-10-19 16:41           ` Michael S. Tsirkin
  0 siblings, 0 replies; 12+ messages in thread
From: Michael S. Tsirkin @ 2020-10-19 16:41 UTC (permalink / raw)
  To: 谢永吉; +Cc: linux-mm, akpm, virtualization

On Mon, Oct 19, 2020 at 11:56:35PM +0800, 谢永吉 wrote:
> 
> 
> 
> On Mon, Oct 19, 2020 at 11:47 PM Michael S. Tsirkin <mst@redhat.com> wrote:
> 
>     On Mon, Oct 19, 2020 at 11:44:36PM +0800, 谢永吉 wrote:
>     >
>     >
>     > On Mon, Oct 19, 2020 at 11:05 PM Michael S. Tsirkin <mst@redhat.com>
>     wrote:
>     >
>     >     On Mon, Oct 19, 2020 at 10:56:22PM +0800, Xie Yongji wrote:
>     >     > The module should not be unloaded if any vduse device exists.
>     >     > So increase the module's reference count when creating vduse
>     >     > device. And the reference count is kept until the device is
>     >     > destroyed.
>     >     >
>     >     > Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
>     >     > ---
>     >     >  drivers/vdpa/vdpa_user/vduse_dev.c | 2 ++
>     >     >  1 file changed, 2 insertions(+)
>     >     >
>     >     > diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/
>     vdpa_user/
>     >     vduse_dev.c
>     >     > index 6787ba66725c..f04aa02de8c1 100644
>     >     > --- a/drivers/vdpa/vdpa_user/vduse_dev.c
>     >     > +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
>     >     > @@ -887,6 +887,7 @@ static int vduse_destroy_dev(u32 id)
>     >     >       kfree(dev->vqs);
>     >     >       vduse_iova_domain_destroy(dev->domain);
>     >     >       vduse_dev_destroy(dev);
>     >     > +     module_put(THIS_MODULE);
>     >     > 
>     >     >       return 0;
>     >     >  }
>     >     > @@ -931,6 +932,7 @@ static int vduse_create_dev(struct
>     vduse_dev_config
>     >     *config)
>     >     > 
>     >     >       dev->connected = true;
>     >     >       list_add(&dev->list, &vduse_devs);
>     >     > +     __module_get(THIS_MODULE);
>     >     > 
>     >     >       return fd;
>     >     >  err_fd:
>     >
>     >     This kind of thing is usually an indicator of a bug. E.g.
>     >     if the refcount drops to 0 on module_put(THIS_MODULE) it
>     >     will be unloaded and the following return will not run.
>     >
>     >
>     >
>     > Should this happen?  The refcount should be only decreased to 0 after the
>     > misc_device is closed?
>     >
>     > Thanks,
>     > Yongji
>     >
> 
>     OTOH if it never drops to 0 anyway then why do you need to increase it?
> 
> 
> 
> To prevent unloading the module in the case that the device is created, but no
> user process using it (e.g. the user process crashed). 
> 
> Thanks,
> Yongji

Looks like it can drop to 0 if that is the case then?


> 
>     >
>     >
>     >     > --
>     >     > 2.25.1
>     >
>     >
> 
> 

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC 0/4] Introduce VDUSE - vDPA Device in Userspace
       [not found] <20201019145623.671-1-xieyongji@bytedance.com>
                   ` (2 preceding siblings ...)
       [not found] ` <20201019145623.671-2-xieyongji@bytedance.com>
@ 2020-10-19 17:16 ` Michael S. Tsirkin
       [not found]   ` <CACycT3vzpm_+v-DbqeVRMg8BRny_GoL2JxpbzYC3JYTMKGn_vg@mail.gmail.com>
  2020-10-20  3:20 ` Jason Wang
  4 siblings, 1 reply; 12+ messages in thread
From: Michael S. Tsirkin @ 2020-10-19 17:16 UTC (permalink / raw)
  To: Xie Yongji; +Cc: linux-mm, akpm, virtualization

On Mon, Oct 19, 2020 at 10:56:19PM +0800, Xie Yongji wrote:
> This series introduces a framework, which can be used to implement
> vDPA Devices in a userspace program. To implement it, the work
> consist of two parts: control path emulating and data path offloading.
> 
> In the control path, the VDUSE driver will make use of message
> mechnism to forward the actions (get/set features, get/st status,
> get/set config space and set virtqueue states) from virtio-vdpa
> driver to userspace. Userspace can use read()/write() to
> receive/reply to those control messages.
> 
> In the data path, the VDUSE driver implements a MMU-based
> on-chip IOMMU driver which supports both direct mapping and
> indirect mapping with bounce buffer. Then userspace can access
> those iova space via mmap(). Besides, eventfd mechnism is used to
> trigger interrupts and forward virtqueue kicks.
> 
> The details and our user case is shown below:
> 
> ------------------------     -----------------------------------------------------------
> |                  APP |     |                          QEMU                           |
> |       ---------      |     | --------------------    -------------------+<-->+------ |
> |       |dev/vdx|      |     | | device emulation |    | virtio dataplane |    | BDS | |
> ------------+-----------     -----------+-----------------------+-----------------+-----
>             |                           |                       |                 |
>             |                           | emulating             | offloading      |
> ------------+---------------------------+-----------------------+-----------------+------
> |    | block device |           |  vduse driver |        |  vdpa device |    | TCP/IP | |
> |    -------+--------           --------+--------        +------+-------     -----+---- |
> |           |                           |                |      |                 |     |
> |           |                           |                |      |                 |     |
> | ----------+----------       ----------+-----------     |      |                 |     |
> | | virtio-blk driver |       | virtio-vdpa driver |     |      |                 |     |
> | ----------+----------       ----------+-----------     |      |                 |     |
> |           |                           |                |      |                 |     |
> |           |                           ------------------      |                 |     |
> |           -----------------------------------------------------              ---+---  |
> ------------------------------------------------------------------------------ | NIC |---
>                                                                                ---+---
>                                                                                   |
>                                                                          ---------+---------
>                                                                          | Remote Storages |
>                                                                          -------------------
> We make use of it to implement a block device connecting to
> our distributed storage, which can be used in containers and
> bare metal.

What is not exactly clear is what is the APP above doing.

Taking virtio blk requests and sending them over the network
in some proprietary way?

> Compared with qemu-nbd solution, this solution has
> higher performance, and we can have an unified technology stack
> in VM and containers for remote storages.
> 
> To test it with a host disk (e.g. /dev/sdx):
> 
>   $ qemu-storage-daemon \
>       --chardev socket,id=charmonitor,path=/tmp/qmp.sock,server,nowait \
>       --monitor chardev=charmonitor \
>       --blockdev driver=host_device,cache.direct=on,aio=native,filename=/dev/sdx,node-name=disk0 \
>       --export vduse-blk,id=test,node-name=disk0,writable=on,vduse-id=1,num-queues=16,queue-size=128
> 
> The qemu-storage-daemon can be found at https://github.com/bytedance/qemu/tree/vduse
> 
> Future work:
>   - Improve performance (e.g. zero copy implementation in datapath)
>   - Config interrupt support
>   - Userspace library (find a way to reuse device emulation code in qemu/rust-vmm)


How does this driver compare with vhost-user-blk (which doesn't need kernel support)?



> Xie Yongji (4):
>   mm: export zap_page_range() for driver use
>   vduse: Introduce VDUSE - vDPA Device in Userspace
>   vduse: grab the module's references until there is no vduse device
>   vduse: Add memory shrinker to reclaim bounce pages
> 
>  drivers/vdpa/Kconfig                 |    8 +
>  drivers/vdpa/Makefile                |    1 +
>  drivers/vdpa/vdpa_user/Makefile      |    5 +
>  drivers/vdpa/vdpa_user/eventfd.c     |  221 ++++++
>  drivers/vdpa/vdpa_user/eventfd.h     |   48 ++
>  drivers/vdpa/vdpa_user/iova_domain.c |  488 ++++++++++++
>  drivers/vdpa/vdpa_user/iova_domain.h |  104 +++
>  drivers/vdpa/vdpa_user/vduse.h       |   66 ++
>  drivers/vdpa/vdpa_user/vduse_dev.c   | 1081 ++++++++++++++++++++++++++
>  include/uapi/linux/vduse.h           |   85 ++
>  mm/memory.c                          |    1 +
>  11 files changed, 2108 insertions(+)
>  create mode 100644 drivers/vdpa/vdpa_user/Makefile
>  create mode 100644 drivers/vdpa/vdpa_user/eventfd.c
>  create mode 100644 drivers/vdpa/vdpa_user/eventfd.h
>  create mode 100644 drivers/vdpa/vdpa_user/iova_domain.c
>  create mode 100644 drivers/vdpa/vdpa_user/iova_domain.h
>  create mode 100644 drivers/vdpa/vdpa_user/vduse.h
>  create mode 100644 drivers/vdpa/vdpa_user/vduse_dev.c
>  create mode 100644 include/uapi/linux/vduse.h
> 
> -- 
> 2.25.1

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [External] Re: [RFC 0/4] Introduce VDUSE - vDPA Device in Userspace
       [not found]   ` <CACycT3vzpm_+v-DbqeVRMg8BRny_GoL2JxpbzYC3JYTMKGn_vg@mail.gmail.com>
@ 2020-10-20  2:20     ` Jason Wang
  0 siblings, 0 replies; 12+ messages in thread
From: Jason Wang @ 2020-10-20  2:20 UTC (permalink / raw)
  To: 谢永吉, Michael S. Tsirkin
  Cc: linux-mm, akpm, virtualization


On 2020/10/20 上午10:18, 谢永吉 wrote:
>
>
>
>     How does this driver compare with vhost-user-blk (which doesn't
>     need kernel support)?
>
>
> We want to implement a block device rather than a virtio-blk 
> dataplane. And with this driver's help, the vhost-user-blk process 
> could provide storage service to all APPs in the host.
>
> Thanks,
> Yongji


I guess the point is that, with the help of VDUSE, besides vhost-vDPA 
for VM, you can have a kernel virtio interface through virtio-vdpa which 
can not be done in vhost-user-blk.

Thanks

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [RFC 0/4] Introduce VDUSE - vDPA Device in Userspace
       [not found] <20201019145623.671-1-xieyongji@bytedance.com>
                   ` (3 preceding siblings ...)
  2020-10-19 17:16 ` [RFC 0/4] Introduce VDUSE - vDPA Device in Userspace Michael S. Tsirkin
@ 2020-10-20  3:20 ` Jason Wang
       [not found]   ` <CACycT3srzADF63rotgHwfsqn5GJOCbXx+19Dcnw8HLyTGY_7Eg@mail.gmail.com>
  4 siblings, 1 reply; 12+ messages in thread
From: Jason Wang @ 2020-10-20  3:20 UTC (permalink / raw)
  To: Xie Yongji, mst, akpm; +Cc: linux-mm, virtualization


On 2020/10/19 下午10:56, Xie Yongji wrote:
> This series introduces a framework, which can be used to implement
> vDPA Devices in a userspace program. To implement it, the work
> consist of two parts: control path emulating and data path offloading.
>
> In the control path, the VDUSE driver will make use of message
> mechnism to forward the actions (get/set features, get/st status,
> get/set config space and set virtqueue states) from virtio-vdpa
> driver to userspace. Userspace can use read()/write() to
> receive/reply to those control messages.
>
> In the data path, the VDUSE driver implements a MMU-based
> on-chip IOMMU driver which supports both direct mapping and
> indirect mapping with bounce buffer. Then userspace can access
> those iova space via mmap(). Besides, eventfd mechnism is used to
> trigger interrupts and forward virtqueue kicks.


This is pretty interesting!

For vhost-vdpa, it should work, but for virtio-vdpa, I think we should 
carefully deal with the IOMMU/DMA ops stuffs.

I notice that neither dma_map nor set_map is implemented in 
vduse_vdpa_config_ops, this means you want to let vhost-vDPA to deal 
with IOMMU domains stuffs.  Any reason for doing that?

The reason for the questions are:

1) You've implemented a on-chip IOMMU driver but don't expose it to 
generic IOMMU layer (or generic IOMMU layer may need some extension to 
support this)
2) We will probably remove the IOMMU domain management in vhost-vDPA, 
and move it to the device(parent).

So if it's possible, please implement either set_map() or 
dma_map()/dma_unmap(), this may align with our future goal and may speed 
up the development.

Btw, it would be helpful to give even more details on how the on-chip 
IOMMU driver in implemented.


>
> The details and our user case is shown below:
>
> ------------------------     -----------------------------------------------------------
> |                  APP |     |                          QEMU                           |
> |       ---------      |     | --------------------    -------------------+<-->+------ |
> |       |dev/vdx|      |     | | device emulation |    | virtio dataplane |    | BDS | |
> ------------+-----------     -----------+-----------------------+-----------------+-----
>              |                           |                       |                 |
>              |                           | emulating             | offloading      |
> ------------+---------------------------+-----------------------+-----------------+------
> |    | block device |           |  vduse driver |        |  vdpa device |    | TCP/IP | |
> |    -------+--------           --------+--------        +------+-------     -----+---- |
> |           |                           |                |      |                 |     |
> |           |                           |                |      |                 |     |
> | ----------+----------       ----------+-----------     |      |                 |     |
> | | virtio-blk driver |       | virtio-vdpa driver |     |      |                 |     |
> | ----------+----------       ----------+-----------     |      |                 |     |
> |           |                           |                |      |                 |     |
> |           |                           ------------------      |                 |     |
> |           -----------------------------------------------------              ---+---  |
> ------------------------------------------------------------------------------ | NIC |---
>                                                                                 ---+---
>                                                                                    |
>                                                                           ---------+---------
>                                                                           | Remote Storages |
>                                                                           -------------------


The figure is not very clear to me in the following points:

1) if the device emulation and virtio dataplane is all implemented in 
QEMU, what's the point of doing this? I thought the device should be a 
remove process?
2) it would be better to draw a vDPA bus somewhere to help people to 
understand the architecture
3) for the "offloading" I guess it should be done virtio vhost-vDPA, so 
it's better to draw a vhost-vDPA block there


> We make use of it to implement a block device connecting to
> our distributed storage, which can be used in containers and
> bare metal. Compared with qemu-nbd solution, this solution has
> higher performance, and we can have an unified technology stack
> in VM and containers for remote storages.
>
> To test it with a host disk (e.g. /dev/sdx):
>
>    $ qemu-storage-daemon \
>        --chardev socket,id=charmonitor,path=/tmp/qmp.sock,server,nowait \
>        --monitor chardev=charmonitor \
>        --blockdev driver=host_device,cache.direct=on,aio=native,filename=/dev/sdx,node-name=disk0 \
>        --export vduse-blk,id=test,node-name=disk0,writable=on,vduse-id=1,num-queues=16,queue-size=128
>
> The qemu-storage-daemon can be found at https://github.com/bytedance/qemu/tree/vduse
>
> Future work:
>    - Improve performance (e.g. zero copy implementation in datapath)
>    - Config interrupt support
>    - Userspace library (find a way to reuse device emulation code in qemu/rust-vmm)


Right, a library will be very useful.

Thanks


>
> Xie Yongji (4):
>    mm: export zap_page_range() for driver use
>    vduse: Introduce VDUSE - vDPA Device in Userspace
>    vduse: grab the module's references until there is no vduse device
>    vduse: Add memory shrinker to reclaim bounce pages
>
>   drivers/vdpa/Kconfig                 |    8 +
>   drivers/vdpa/Makefile                |    1 +
>   drivers/vdpa/vdpa_user/Makefile      |    5 +
>   drivers/vdpa/vdpa_user/eventfd.c     |  221 ++++++
>   drivers/vdpa/vdpa_user/eventfd.h     |   48 ++
>   drivers/vdpa/vdpa_user/iova_domain.c |  488 ++++++++++++
>   drivers/vdpa/vdpa_user/iova_domain.h |  104 +++
>   drivers/vdpa/vdpa_user/vduse.h       |   66 ++
>   drivers/vdpa/vdpa_user/vduse_dev.c   | 1081 ++++++++++++++++++++++++++
>   include/uapi/linux/vduse.h           |   85 ++
>   mm/memory.c                          |    1 +
>   11 files changed, 2108 insertions(+)
>   create mode 100644 drivers/vdpa/vdpa_user/Makefile
>   create mode 100644 drivers/vdpa/vdpa_user/eventfd.c
>   create mode 100644 drivers/vdpa/vdpa_user/eventfd.h
>   create mode 100644 drivers/vdpa/vdpa_user/iova_domain.c
>   create mode 100644 drivers/vdpa/vdpa_user/iova_domain.h
>   create mode 100644 drivers/vdpa/vdpa_user/vduse.h
>   create mode 100644 drivers/vdpa/vdpa_user/vduse_dev.c
>   create mode 100644 include/uapi/linux/vduse.h
>

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [External] Re: [RFC 0/4] Introduce VDUSE - vDPA Device in Userspace
       [not found]   ` <CACycT3srzADF63rotgHwfsqn5GJOCbXx+19Dcnw8HLyTGY_7Eg@mail.gmail.com>
@ 2020-10-20  8:01     ` Jason Wang
       [not found]       ` <CACycT3ssE-iMquAmrrHGQyBCv7XkQ2WrinFMMPTTubxuuOQ92g@mail.gmail.com>
  0 siblings, 1 reply; 12+ messages in thread
From: Jason Wang @ 2020-10-20  8:01 UTC (permalink / raw)
  To: Yongji Xie; +Cc: linux-mm, akpm, virtualization, Michael S. Tsirkin


On 2020/10/20 下午3:39, Yongji Xie wrote:
>
>
> On Tue, Oct 20, 2020 at 11:20 AM Jason Wang <jasowang@redhat.com 
> <mailto:jasowang@redhat.com>> wrote:
>
>
>     On 2020/10/19 下午10:56, Xie Yongji wrote:
>     > This series introduces a framework, which can be used to implement
>     > vDPA Devices in a userspace program. To implement it, the work
>     > consist of two parts: control path emulating and data path
>     offloading.
>     >
>     > In the control path, the VDUSE driver will make use of message
>     > mechnism to forward the actions (get/set features, get/st status,
>     > get/set config space and set virtqueue states) from virtio-vdpa
>     > driver to userspace. Userspace can use read()/write() to
>     > receive/reply to those control messages.
>     >
>     > In the data path, the VDUSE driver implements a MMU-based
>     > on-chip IOMMU driver which supports both direct mapping and
>     > indirect mapping with bounce buffer. Then userspace can access
>     > those iova space via mmap(). Besides, eventfd mechnism is used to
>     > trigger interrupts and forward virtqueue kicks.
>
>
>     This is pretty interesting!
>
>     For vhost-vdpa, it should work, but for virtio-vdpa, I think we
>     should
>     carefully deal with the IOMMU/DMA ops stuffs.
>
>
>     I notice that neither dma_map nor set_map is implemented in
>     vduse_vdpa_config_ops, this means you want to let vhost-vDPA to deal
>     with IOMMU domains stuffs.  Any reason for doing that?
>
> Actually, this series only focus on virtio-vdpa case now. To support 
> vhost-vdpa,  as you said, we need to implement dma_map/dma_unmap. But 
> there is a limit that vm's memory can't be anonymous pages which are 
> forbidden in vm_insert_page(). Maybe we need to add some limits on 
> vhost-vdpa?


I'm not sure I get this, any reason that you want to use 
vm_insert_page() to VM's memory. Or do you mean you want to implement 
some kind of zero-copy?

I guess from the software device implemention in user space it only need 
to receive IOVA ranges and map them in its own address space.


>     The reason for the questions are:
>
>     1) You've implemented a on-chip IOMMU driver but don't expose it to
>     generic IOMMU layer (or generic IOMMU layer may need some
>     extension to
>     support this)
>     2) We will probably remove the IOMMU domain management in vhost-vDPA,
>     and move it to the device(parent).
>
>     So if it's possible, please implement either set_map() or
>     dma_map()/dma_unmap(), this may align with our future goal and may
>     speed
>     up the development.
>
>     Btw, it would be helpful to give even more details on how the on-chip
>     IOMMU driver in implemented.
>
>
> The basic idea is treating MMU (VA->PA) as IOMMU (IOVA->PA). And using 
> vm_insert_page()/zap_page_range() to do address mapping/unmapping. And 
> the address mapping will be done in page fault handler because 
> vm_insert_page() can't be called in atomic_context such 
> as dma_map_ops->map_page().


Ok, please add it in the cover letter or patch 2 in the next version.


>
>     >
>     > The details and our user case is shown below:
>     >
>     > ------------------------
>      -----------------------------------------------------------
>     > |                  APP |     | QEMU                           |
>     > |       ---------      |     | --------------------
>     -------------------+<-->+------ |
>     > |       |dev/vdx|      |     | | device emulation | | virtio
>     dataplane |    | BDS | |
>     > ------------+-----------
>      -----------+-----------------------+-----------------+-----
>     >              |                           |          |           
>          |
>     >              |                           | emulating          |
>     offloading      |
>     >
>     ------------+---------------------------+-----------------------+-----------------+------
>     > |    | block device |           |  vduse driver |   |  vdpa
>     device |    | TCP/IP | |
>     > |    -------+--------           --------+--------  
>     +------+-------     -----+---- |
>     > |           |                           |   |      |           
>          |     |
>     > |           |                           |   |      |           
>          |     |
>     > | ----------+----------       ----------+-----------  |      | 
>                    |     |
>     > | | virtio-blk driver |       | virtio-vdpa driver |  |      | 
>                    |     |
>     > | ----------+----------       ----------+-----------  |      | 
>                    |     |
>     > |           |                           |   |      |           
>          |     |
>     > |           |  ------------------      |                 |     |
>     > |  -----------------------------------------------------        
>     ---+---  |
>     >
>     ------------------------------------------------------------------------------
>     | NIC |---
>     >                          ---+---
>     >                             |
>     >                    ---------+---------
>     >                    | Remote Storages |
>     >                    -------------------
>
>
>     The figure is not very clear to me in the following points:
>
>     1) if the device emulation and virtio dataplane is all implemented in
>     QEMU, what's the point of doing this? I thought the device should
>     be a
>     remove process?
>
>     2) it would be better to draw a vDPA bus somewhere to help people to
>     understand the architecture
>     3) for the "offloading" I guess it should be done virtio
>     vhost-vDPA, so
>     it's better to draw a vhost-vDPA block there
>
>
> This figure only shows virtio-vdpa case, I will take vhost-vdpa case 
> into consideration in next version.


Please do that, otherwise this proposal is incomplete.

Thanks


>
> Thanks,
> Yongji

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [External] Re: [RFC 0/4] Introduce VDUSE - vDPA Device in Userspace
       [not found]       ` <CACycT3ssE-iMquAmrrHGQyBCv7XkQ2WrinFMMPTTubxuuOQ92g@mail.gmail.com>
@ 2020-10-20  9:12         ` Jason Wang
       [not found]           ` <CACycT3s2GZ3yKP+Xn2V83_-=tXg342J4n91ZAb0c-+UD_+sFnA@mail.gmail.com>
  0 siblings, 1 reply; 12+ messages in thread
From: Jason Wang @ 2020-10-20  9:12 UTC (permalink / raw)
  To: Yongji Xie; +Cc: linux-mm, akpm, virtualization, Michael S. Tsirkin


On 2020/10/20 下午4:35, Yongji Xie wrote:
>
>
> On Tue, Oct 20, 2020 at 4:01 PM Jason Wang <jasowang@redhat.com 
> <mailto:jasowang@redhat.com>> wrote:
>
>
>     On 2020/10/20 下午3:39, Yongji Xie wrote:
>     >
>     >
>     > On Tue, Oct 20, 2020 at 11:20 AM Jason Wang <jasowang@redhat.com
>     <mailto:jasowang@redhat.com>
>     > <mailto:jasowang@redhat.com <mailto:jasowang@redhat.com>>> wrote:
>     >
>     >
>     >     On 2020/10/19 下午10:56, Xie Yongji wrote:
>     >     > This series introduces a framework, which can be used to
>     implement
>     >     > vDPA Devices in a userspace program. To implement it, the work
>     >     > consist of two parts: control path emulating and data path
>     >     offloading.
>     >     >
>     >     > In the control path, the VDUSE driver will make use of message
>     >     > mechnism to forward the actions (get/set features, get/st
>     status,
>     >     > get/set config space and set virtqueue states) from
>     virtio-vdpa
>     >     > driver to userspace. Userspace can use read()/write() to
>     >     > receive/reply to those control messages.
>     >     >
>     >     > In the data path, the VDUSE driver implements a MMU-based
>     >     > on-chip IOMMU driver which supports both direct mapping and
>     >     > indirect mapping with bounce buffer. Then userspace can access
>     >     > those iova space via mmap(). Besides, eventfd mechnism is
>     used to
>     >     > trigger interrupts and forward virtqueue kicks.
>     >
>     >
>     >     This is pretty interesting!
>     >
>     >     For vhost-vdpa, it should work, but for virtio-vdpa, I think we
>     >     should
>     >     carefully deal with the IOMMU/DMA ops stuffs.
>     >
>     >
>     >     I notice that neither dma_map nor set_map is implemented in
>     >     vduse_vdpa_config_ops, this means you want to let vhost-vDPA
>     to deal
>     >     with IOMMU domains stuffs.  Any reason for doing that?
>     >
>     > Actually, this series only focus on virtio-vdpa case now. To
>     support
>     > vhost-vdpa,  as you said, we need to implement
>     dma_map/dma_unmap. But
>     > there is a limit that vm's memory can't be anonymous pages which
>     are
>     > forbidden in vm_insert_page(). Maybe we need to add some limits on
>     > vhost-vdpa?
>
>
>     I'm not sure I get this, any reason that you want to use
>     vm_insert_page() to VM's memory. Or do you mean you want to implement
>     some kind of zero-copy? 
>
>
>
> If my understanding is right, we will have a QEMU (VM) process and a 
> device emulation process in the vhost-vdpa case, right? When I/O 
> happens, the virtio driver in VM will put the IOVA to vring and device 
> emulation process will get the IOVA from vring. Then the device 
> emulation process will translate the IOVA to its VA to access the dma 
> buffer which resides in VM's memory. That means the device emulation 
> process needs to access VM's memory, so we should use vm_insert_page() 
> to build the page table of the device emulation process.


Ok, I get you now. So it looks to me the that the real issue is not the 
limitation to anonymous page but see the comments above vm_insert_page():

"

  * The page has to be a nice clean _individual_ kernel allocation.
"

So I suspect that using vm_insert_page() to share pages between 
processes is legal. We need inputs from MM experts.

Thanks


>
>     I guess from the software device implemention in user space it
>     only need
>     to receive IOVA ranges and map them in its own address space.
>
>
> How to map them in its own address space if we don't use vm_insert_page()?

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [External] Re: [RFC 0/4] Introduce VDUSE - vDPA Device in Userspace
       [not found]           ` <CACycT3s2GZ3yKP+Xn2V83_-=tXg342J4n91ZAb0c-+UD_+sFnA@mail.gmail.com>
@ 2020-10-23  8:44             ` Jason Wang
  0 siblings, 0 replies; 12+ messages in thread
From: Jason Wang @ 2020-10-23  8:44 UTC (permalink / raw)
  To: Yongji Xie; +Cc: linux-mm, akpm, virtualization, Michael S. Tsirkin


On 2020/10/23 上午10:55, Yongji Xie wrote:
>
>
> On Tue, Oct 20, 2020 at 5:13 PM Jason Wang <jasowang@redhat.com 
> <mailto:jasowang@redhat.com>> wrote:
>
>
>     On 2020/10/20 下午4:35, Yongji Xie wrote:
>     >
>     >
>     > On Tue, Oct 20, 2020 at 4:01 PM Jason Wang <jasowang@redhat.com
>     <mailto:jasowang@redhat.com>
>     > <mailto:jasowang@redhat.com <mailto:jasowang@redhat.com>>> wrote:
>     >
>     >
>     >     On 2020/10/20 下午3:39, Yongji Xie wrote:
>     >     >
>     >     >
>     >     > On Tue, Oct 20, 2020 at 11:20 AM Jason Wang
>     <jasowang@redhat.com <mailto:jasowang@redhat.com>
>     >     <mailto:jasowang@redhat.com <mailto:jasowang@redhat.com>>
>     >     > <mailto:jasowang@redhat.com <mailto:jasowang@redhat.com>
>     <mailto:jasowang@redhat.com <mailto:jasowang@redhat.com>>>> wrote:
>     >     >
>     >     >
>     >     >     On 2020/10/19 下午10:56, Xie Yongji wrote:
>     >     >     > This series introduces a framework, which can be used to
>     >     implement
>     >     >     > vDPA Devices in a userspace program. To implement
>     it, the work
>     >     >     > consist of two parts: control path emulating and
>     data path
>     >     >     offloading.
>     >     >     >
>     >     >     > In the control path, the VDUSE driver will make use
>     of message
>     >     >     > mechnism to forward the actions (get/set features,
>     get/st
>     >     status,
>     >     >     > get/set config space and set virtqueue states) from
>     >     virtio-vdpa
>     >     >     > driver to userspace. Userspace can use read()/write() to
>     >     >     > receive/reply to those control messages.
>     >     >     >
>     >     >     > In the data path, the VDUSE driver implements a
>     MMU-based
>     >     >     > on-chip IOMMU driver which supports both direct
>     mapping and
>     >     >     > indirect mapping with bounce buffer. Then userspace
>     can access
>     >     >     > those iova space via mmap(). Besides, eventfd
>     mechnism is
>     >     used to
>     >     >     > trigger interrupts and forward virtqueue kicks.
>     >     >
>     >     >
>     >     >     This is pretty interesting!
>     >     >
>     >     >     For vhost-vdpa, it should work, but for virtio-vdpa, I
>     think we
>     >     >     should
>     >     >     carefully deal with the IOMMU/DMA ops stuffs.
>     >     >
>     >     >
>     >     >     I notice that neither dma_map nor set_map is
>     implemented in
>     >     >     vduse_vdpa_config_ops, this means you want to let
>     vhost-vDPA
>     >     to deal
>     >     >     with IOMMU domains stuffs.  Any reason for doing that?
>     >     >
>     >     > Actually, this series only focus on virtio-vdpa case now. To
>     >     support
>     >     > vhost-vdpa,  as you said, we need to implement
>     >     dma_map/dma_unmap. But
>     >     > there is a limit that vm's memory can't be anonymous pages
>     which
>     >     are
>     >     > forbidden in vm_insert_page(). Maybe we need to add some
>     limits on
>     >     > vhost-vdpa?
>     >
>     >
>     >     I'm not sure I get this, any reason that you want to use
>     >     vm_insert_page() to VM's memory. Or do you mean you want to
>     implement
>     >     some kind of zero-copy?
>     >
>     >
>     >
>     > If my understanding is right, we will have a QEMU (VM) process
>     and a
>     > device emulation process in the vhost-vdpa case, right? When I/O
>     > happens, the virtio driver in VM will put the IOVA to vring and
>     device
>     > emulation process will get the IOVA from vring. Then the device
>     > emulation process will translate the IOVA to its VA to access
>     the dma
>     > buffer which resides in VM's memory. That means the device
>     emulation
>     > process needs to access VM's memory, so we should use
>     vm_insert_page()
>     > to build the page table of the device emulation process.
>
>
>     Ok, I get you now. So it looks to me the that the real issue is
>     not the
>     limitation to anonymous page but see the comments above
>     vm_insert_page():
>
>     "
>
>       * The page has to be a nice clean _individual_ kernel allocation.
>     "
>
>     So I suspect that using vm_insert_page() to share pages between
>     processes is legal. We need inputs from MM experts.
>
>
> Yes,  vm_insert_page() can't be used in this case. So could we add the 
> shmfd into the vhost iotlb msg and pass it to the device emulation 
> process as a new iova_domain, just like vhost-user does.
>
> Thanks,
> Yongji


I think vhost-user did that via SET_MEM_TABLE which is not supported by 
vDPA. Note that the current IOTLB message will be used when vIOMMU is 
enabled.

This needs more thought. Will come back if I had any thought.

Thanks


>
>
>
>
>     >
>     >     I guess from the software device implemention in user space it
>     >     only need
>     >     to receive IOVA ranges and map them in its own address space.
>     >
>     >
>     > How to map them in its own address space if we don't use
>     vm_insert_page()?
>

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2020-10-23  8:45 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20201019145623.671-1-xieyongji@bytedance.com>
     [not found] ` <20201019145623.671-4-xieyongji@bytedance.com>
2020-10-19 15:05   ` [RFC 3/4] vduse: grab the module's references until there is no vduse device Michael S. Tsirkin
     [not found]     ` <CACycT3sQ-rw+weEktyK5jQTfMNWYR6qSaD1vAUEyCP6x7C9rRQ@mail.gmail.com>
2020-10-19 15:47       ` [External] " Michael S. Tsirkin
     [not found]         ` <CACycT3vG+ZEhn3SYy=7c5rkMz4XRbQZL21NdpPozPnH_x6Srhg@mail.gmail.com>
2020-10-19 16:41           ` Michael S. Tsirkin
     [not found] ` <20201019145623.671-3-xieyongji@bytedance.com>
2020-10-19 15:08   ` [RFC 2/4] vduse: Introduce VDUSE - vDPA Device in Userspace Michael S. Tsirkin
2020-10-19 15:24     ` Randy Dunlap
     [not found] ` <20201019145623.671-2-xieyongji@bytedance.com>
2020-10-19 15:14   ` [RFC 1/4] mm: export zap_page_range() for driver use Matthew Wilcox
2020-10-19 17:16 ` [RFC 0/4] Introduce VDUSE - vDPA Device in Userspace Michael S. Tsirkin
     [not found]   ` <CACycT3vzpm_+v-DbqeVRMg8BRny_GoL2JxpbzYC3JYTMKGn_vg@mail.gmail.com>
2020-10-20  2:20     ` [External] " Jason Wang
2020-10-20  3:20 ` Jason Wang
     [not found]   ` <CACycT3srzADF63rotgHwfsqn5GJOCbXx+19Dcnw8HLyTGY_7Eg@mail.gmail.com>
2020-10-20  8:01     ` [External] " Jason Wang
     [not found]       ` <CACycT3ssE-iMquAmrrHGQyBCv7XkQ2WrinFMMPTTubxuuOQ92g@mail.gmail.com>
2020-10-20  9:12         ` Jason Wang
     [not found]           ` <CACycT3s2GZ3yKP+Xn2V83_-=tXg342J4n91ZAb0c-+UD_+sFnA@mail.gmail.com>
2020-10-23  8:44             ` Jason Wang

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).