virtualization.lists.linux-foundation.org archive mirror
 help / color / mirror / Atom feed
* Re: [PATCH v8 03/10] eventfd: Increase the recursion depth of eventfd_signal()
       [not found] ` <20210615141331.407-4-xieyongji@bytedance.com>
@ 2021-06-17  8:33   ` He Zhe
       [not found]     ` <CACycT3t1Dgrzsr7LbBrDhRLDa3qZ85ZOgj9H7r1fqPi-kf7r6Q@mail.gmail.com>
  0 siblings, 1 reply; 41+ messages in thread
From: He Zhe @ 2021-06-17  8:33 UTC (permalink / raw)
  To: Xie Yongji, mst, jasowang, stefanha, sgarzare, parav, hch,
	christian.brauner, rdunlap, willy, viro, axboe, bcrl, corbet,
	mika.penttila, dan.carpenter, joro, gregkh
  Cc: He Zhe, kvm, netdev, linux-kernel, virtualization, iommu,
	songmuchun, linux-fsdevel



On 6/15/21 10:13 PM, Xie Yongji wrote:
> Increase the recursion depth of eventfd_signal() to 1. This
> is the maximum recursion depth we have found so far, which
> can be triggered with the following call chain:
>
>     kvm_io_bus_write                        [kvm]
>       --> ioeventfd_write                   [kvm]
>         --> eventfd_signal                  [eventfd]
>           --> vhost_poll_wakeup             [vhost]
>             --> vduse_vdpa_kick_vq          [vduse]
>               --> eventfd_signal            [eventfd]
>
> Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
> Acked-by: Jason Wang <jasowang@redhat.com>

The fix had been posted one year ago.

https://lore.kernel.org/lkml/20200410114720.24838-1-zhe.he@windriver.com/


> ---
>  fs/eventfd.c            | 2 +-
>  include/linux/eventfd.h | 5 ++++-
>  2 files changed, 5 insertions(+), 2 deletions(-)
>
> diff --git a/fs/eventfd.c b/fs/eventfd.c
> index e265b6dd4f34..cc7cd1dbedd3 100644
> --- a/fs/eventfd.c
> +++ b/fs/eventfd.c
> @@ -71,7 +71,7 @@ __u64 eventfd_signal(struct eventfd_ctx *ctx, __u64 n)
>  	 * it returns true, the eventfd_signal() call should be deferred to a
>  	 * safe context.
>  	 */
> -	if (WARN_ON_ONCE(this_cpu_read(eventfd_wake_count)))
> +	if (WARN_ON_ONCE(this_cpu_read(eventfd_wake_count) > EFD_WAKE_DEPTH))
>  		return 0;
>  
>  	spin_lock_irqsave(&ctx->wqh.lock, flags);
> diff --git a/include/linux/eventfd.h b/include/linux/eventfd.h
> index fa0a524baed0..886d99cd38ef 100644
> --- a/include/linux/eventfd.h
> +++ b/include/linux/eventfd.h
> @@ -29,6 +29,9 @@
>  #define EFD_SHARED_FCNTL_FLAGS (O_CLOEXEC | O_NONBLOCK)
>  #define EFD_FLAGS_SET (EFD_SHARED_FCNTL_FLAGS | EFD_SEMAPHORE)
>  
> +/* Maximum recursion depth */
> +#define EFD_WAKE_DEPTH 1
> +
>  struct eventfd_ctx;
>  struct file;
>  
> @@ -47,7 +50,7 @@ DECLARE_PER_CPU(int, eventfd_wake_count);
>  
>  static inline bool eventfd_signal_count(void)
>  {
> -	return this_cpu_read(eventfd_wake_count);
> +	return this_cpu_read(eventfd_wake_count) > EFD_WAKE_DEPTH;

count is just count. How deep is acceptable should be put
where eventfd_signal_count is called.


Zhe

>  }
>  
>  #else /* CONFIG_EVENTFD */

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v8 03/10] eventfd: Increase the recursion depth of eventfd_signal()
       [not found]     ` <CACycT3t1Dgrzsr7LbBrDhRLDa3qZ85ZOgj9H7r1fqPi-kf7r6Q@mail.gmail.com>
@ 2021-06-18  8:41       ` He Zhe
  2021-06-18  8:44       ` [PATCH] eventfd: Enlarge recursion limit to allow vhost to work He Zhe
  1 sibling, 0 replies; 41+ messages in thread
From: He Zhe @ 2021-06-18  8:41 UTC (permalink / raw)
  To: Yongji Xie
  Cc: kvm, Michael S. Tsirkin, virtualization, Christian Brauner,
	qiang.zhang, Jonathan Corbet, joro, Matthew Wilcox,
	Christoph Hellwig, Dan Carpenter, Al Viro, Stefan Hajnoczi,
	songmuchun, Jens Axboe, Greg KH, Randy Dunlap, linux-kernel,
	iommu, bcrl, netdev, linux-fsdevel, Mika Penttilä



On 6/18/21 11:29 AM, Yongji Xie wrote:
> On Thu, Jun 17, 2021 at 4:34 PM He Zhe <zhe.he@windriver.com> wrote:
>>
>>
>> On 6/15/21 10:13 PM, Xie Yongji wrote:
>>> Increase the recursion depth of eventfd_signal() to 1. This
>>> is the maximum recursion depth we have found so far, which
>>> can be triggered with the following call chain:
>>>
>>>     kvm_io_bus_write                        [kvm]
>>>       --> ioeventfd_write                   [kvm]
>>>         --> eventfd_signal                  [eventfd]
>>>           --> vhost_poll_wakeup             [vhost]
>>>             --> vduse_vdpa_kick_vq          [vduse]
>>>               --> eventfd_signal            [eventfd]
>>>
>>> Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
>>> Acked-by: Jason Wang <jasowang@redhat.com>
>> The fix had been posted one year ago.
>>
>> https://lore.kernel.org/lkml/20200410114720.24838-1-zhe.he@windriver.com/
>>
> OK, so it seems to be a fix for the RT system if my understanding is
> correct? Any reason why it's not merged? I'm happy to rebase my series
> on your patch if you'd like to repost it.

It works for both mainline and RT kernel. The folks just reproduced in their RT
environments.

This patch somehow hasn't got maintainer's reply, so not merged yet.

And OK, I'll resend the patch.

>
> BTW, I also notice another thread for this issue:
>
> https://lore.kernel.org/linux-fsdevel/DM6PR11MB420291B550A10853403C7592FF349@DM6PR11MB4202.namprd11.prod.outlook.com/T/

This is the same way as my v1

https://lore.kernel.org/lkml/3b4aa4cb-0e76-89c2-c48a-cf24e1a36bc2@kernel.dk/

which was not what the maintainer wanted.


>
>>> ---
>>>  fs/eventfd.c            | 2 +-
>>>  include/linux/eventfd.h | 5 ++++-
>>>  2 files changed, 5 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/fs/eventfd.c b/fs/eventfd.c
>>> index e265b6dd4f34..cc7cd1dbedd3 100644
>>> --- a/fs/eventfd.c
>>> +++ b/fs/eventfd.c
>>> @@ -71,7 +71,7 @@ __u64 eventfd_signal(struct eventfd_ctx *ctx, __u64 n)
>>>        * it returns true, the eventfd_signal() call should be deferred to a
>>>        * safe context.
>>>        */
>>> -     if (WARN_ON_ONCE(this_cpu_read(eventfd_wake_count)))
>>> +     if (WARN_ON_ONCE(this_cpu_read(eventfd_wake_count) > EFD_WAKE_DEPTH))
>>>               return 0;
>>>
>>>       spin_lock_irqsave(&ctx->wqh.lock, flags);
>>> diff --git a/include/linux/eventfd.h b/include/linux/eventfd.h
>>> index fa0a524baed0..886d99cd38ef 100644
>>> --- a/include/linux/eventfd.h
>>> +++ b/include/linux/eventfd.h
>>> @@ -29,6 +29,9 @@
>>>  #define EFD_SHARED_FCNTL_FLAGS (O_CLOEXEC | O_NONBLOCK)
>>>  #define EFD_FLAGS_SET (EFD_SHARED_FCNTL_FLAGS | EFD_SEMAPHORE)
>>>
>>> +/* Maximum recursion depth */
>>> +#define EFD_WAKE_DEPTH 1
>>> +
>>>  struct eventfd_ctx;
>>>  struct file;
>>>
>>> @@ -47,7 +50,7 @@ DECLARE_PER_CPU(int, eventfd_wake_count);
>>>
>>>  static inline bool eventfd_signal_count(void)
>>>  {
>>> -     return this_cpu_read(eventfd_wake_count);
>>> +     return this_cpu_read(eventfd_wake_count) > EFD_WAKE_DEPTH;
>> count is just count. How deep is acceptable should be put
>> where eventfd_signal_count is called.
>>
> The return value of this function is boolean rather than integer.
> Please see the comments in eventfd_signal():
>
> "then it should check eventfd_signal_count() before calling this
> function. If it returns true, the eventfd_signal() call should be
> deferred to a safe context."

OK. Now that the maintainer comments as such we can use it accordingly,
though I still got the feeling that the function name and the type of the return
value don't match.


Thanks,
Zhe

>
> Thanks,
> Yongji

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 41+ messages in thread

* [PATCH] eventfd: Enlarge recursion limit to allow vhost to work
       [not found]     ` <CACycT3t1Dgrzsr7LbBrDhRLDa3qZ85ZOgj9H7r1fqPi-kf7r6Q@mail.gmail.com>
  2021-06-18  8:41       ` He Zhe
@ 2021-06-18  8:44       ` He Zhe
  2021-07-03  8:31         ` Michael S. Tsirkin
  1 sibling, 1 reply; 41+ messages in thread
From: He Zhe @ 2021-06-18  8:44 UTC (permalink / raw)
  To: xieyongji, mst, jasowang, stefanha, sgarzare, parav, hch,
	christian.brauner, rdunlap, willy, viro, axboe, bcrl, corbet,
	mika.penttila, dan.carpenter, gregkh, songmuchun, virtualization,
	kvm, linux-fsdevel, iommu, linux-kernel, qiang.zhang, zhe.he

commit b5e683d5cab8 ("eventfd: track eventfd_signal() recursion depth")
introduces a percpu counter that tracks the percpu recursion depth and
warn if it greater than zero, to avoid potential deadlock and stack
overflow.

However sometimes different eventfds may be used in parallel. Specifically,
when heavy network load goes through kvm and vhost, working as below, it
would trigger the following call trace.

-  100.00%
   - 66.51%
        ret_from_fork
        kthread
      - vhost_worker
         - 33.47% handle_tx_kick
              handle_tx
              handle_tx_copy
              vhost_tx_batch.isra.0
              vhost_add_used_and_signal_n
              eventfd_signal
         - 33.05% handle_rx_net
              handle_rx
              vhost_add_used_and_signal_n
              eventfd_signal
   - 33.49%
        ioctl
        entry_SYSCALL_64_after_hwframe
        do_syscall_64
        __x64_sys_ioctl
        ksys_ioctl
        do_vfs_ioctl
        kvm_vcpu_ioctl
        kvm_arch_vcpu_ioctl_run
        vmx_handle_exit
        handle_ept_misconfig
        kvm_io_bus_write
        __kvm_io_bus_write
        eventfd_signal

001: WARNING: CPU: 1 PID: 1503 at fs/eventfd.c:73 eventfd_signal+0x85/0xa0
---- snip ----
001: Call Trace:
001:  vhost_signal+0x15e/0x1b0 [vhost]
001:  vhost_add_used_and_signal_n+0x2b/0x40 [vhost]
001:  handle_rx+0xb9/0x900 [vhost_net]
001:  handle_rx_net+0x15/0x20 [vhost_net]
001:  vhost_worker+0xbe/0x120 [vhost]
001:  kthread+0x106/0x140
001:  ? log_used.part.0+0x20/0x20 [vhost]
001:  ? kthread_park+0x90/0x90
001:  ret_from_fork+0x35/0x40
001: ---[ end trace 0000000000000003 ]---

This patch enlarges the limit to 1 which is the maximum recursion depth we
have found so far.

The credit of modification for eventfd_signal_count goes to
Xie Yongji <xieyongji@bytedance.com>

Signed-off-by: He Zhe <zhe.he@windriver.com>
---
 fs/eventfd.c            | 3 ++-
 include/linux/eventfd.h | 5 ++++-
 2 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/fs/eventfd.c b/fs/eventfd.c
index e265b6dd4f34..add6af91cacf 100644
--- a/fs/eventfd.c
+++ b/fs/eventfd.c
@@ -71,7 +71,8 @@ __u64 eventfd_signal(struct eventfd_ctx *ctx, __u64 n)
 	 * it returns true, the eventfd_signal() call should be deferred to a
 	 * safe context.
 	 */
-	if (WARN_ON_ONCE(this_cpu_read(eventfd_wake_count)))
+	if (WARN_ON_ONCE(this_cpu_read(eventfd_wake_count) >
+	    EFD_WAKE_COUNT_MAX))
 		return 0;
 
 	spin_lock_irqsave(&ctx->wqh.lock, flags);
diff --git a/include/linux/eventfd.h b/include/linux/eventfd.h
index fa0a524baed0..74be152ebe87 100644
--- a/include/linux/eventfd.h
+++ b/include/linux/eventfd.h
@@ -29,6 +29,9 @@
 #define EFD_SHARED_FCNTL_FLAGS (O_CLOEXEC | O_NONBLOCK)
 #define EFD_FLAGS_SET (EFD_SHARED_FCNTL_FLAGS | EFD_SEMAPHORE)
 
+/* This is the maximum recursion depth we find so far */
+#define EFD_WAKE_COUNT_MAX 1
+
 struct eventfd_ctx;
 struct file;
 
@@ -47,7 +50,7 @@ DECLARE_PER_CPU(int, eventfd_wake_count);
 
 static inline bool eventfd_signal_count(void)
 {
-	return this_cpu_read(eventfd_wake_count);
+	return this_cpu_read(eventfd_wake_count) > EFD_WAKE_COUNT_MAX;
 }
 
 #else /* CONFIG_EVENTFD */
-- 
2.17.1

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply related	[flat|nested] 41+ messages in thread

* Re: [PATCH v8 09/10] vduse: Introduce VDUSE - vDPA Device in Userspace
       [not found] ` <20210615141331.407-10-xieyongji@bytedance.com>
@ 2021-06-21  9:13   ` Jason Wang
       [not found]     ` <CACycT3tAON+-qZev+9EqyL2XbgH5HDspOqNt3ohQLQ8GqVK=EA@mail.gmail.com>
  2021-06-24 14:46   ` Stefan Hajnoczi
  2021-07-07  8:52   ` Stefan Hajnoczi
  2 siblings, 1 reply; 41+ messages in thread
From: Jason Wang @ 2021-06-21  9:13 UTC (permalink / raw)
  To: Xie Yongji, mst, stefanha, sgarzare, parav, hch,
	christian.brauner, rdunlap, willy, viro, axboe, bcrl, corbet,
	mika.penttila, dan.carpenter, joro, gregkh
  Cc: kvm, netdev, linux-kernel, virtualization, iommu, songmuchun,
	linux-fsdevel


在 2021/6/15 下午10:13, Xie Yongji 写道:
> This VDUSE driver enables implementing vDPA devices in userspace.
> The vDPA device's control path is handled in kernel and the data
> path is handled in userspace.
>
> A message mechnism is used by VDUSE driver to forward some control
> messages such as starting/stopping datapath to userspace. Userspace
> can use read()/write() to receive/reply those control messages.
>
> And some ioctls are introduced to help userspace to implement the
> data path. VDUSE_IOTLB_GET_FD ioctl can be used to get the file
> descriptors referring to vDPA device's iova regions. Then userspace
> can use mmap() to access those iova regions. VDUSE_DEV_GET_FEATURES
> and VDUSE_VQ_GET_INFO ioctls are used to get the negotiated features
> and metadata of virtqueues. VDUSE_INJECT_VQ_IRQ and VDUSE_VQ_SETUP_KICKFD
> ioctls can be used to inject interrupt and setup the kickfd for
> virtqueues. VDUSE_DEV_UPDATE_CONFIG ioctl is used to update the
> configuration space and inject a config interrupt.
>
> Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
> ---
>   Documentation/userspace-api/ioctl/ioctl-number.rst |    1 +
>   drivers/vdpa/Kconfig                               |   10 +
>   drivers/vdpa/Makefile                              |    1 +
>   drivers/vdpa/vdpa_user/Makefile                    |    5 +
>   drivers/vdpa/vdpa_user/vduse_dev.c                 | 1453 ++++++++++++++++++++
>   include/uapi/linux/vduse.h                         |  143 ++
>   6 files changed, 1613 insertions(+)
>   create mode 100644 drivers/vdpa/vdpa_user/Makefile
>   create mode 100644 drivers/vdpa/vdpa_user/vduse_dev.c
>   create mode 100644 include/uapi/linux/vduse.h
>
> diff --git a/Documentation/userspace-api/ioctl/ioctl-number.rst b/Documentation/userspace-api/ioctl/ioctl-number.rst
> index 9bfc2b510c64..acd95e9dcfe7 100644
> --- a/Documentation/userspace-api/ioctl/ioctl-number.rst
> +++ b/Documentation/userspace-api/ioctl/ioctl-number.rst
> @@ -300,6 +300,7 @@ Code  Seq#    Include File                                           Comments
>   'z'   10-4F  drivers/s390/crypto/zcrypt_api.h                        conflict!
>   '|'   00-7F  linux/media.h
>   0x80  00-1F  linux/fb.h
> +0x81  00-1F  linux/vduse.h
>   0x89  00-06  arch/x86/include/asm/sockios.h
>   0x89  0B-DF  linux/sockios.h
>   0x89  E0-EF  linux/sockios.h                                         SIOCPROTOPRIVATE range
> diff --git a/drivers/vdpa/Kconfig b/drivers/vdpa/Kconfig
> index a503c1b2bfd9..6e23bce6433a 100644
> --- a/drivers/vdpa/Kconfig
> +++ b/drivers/vdpa/Kconfig
> @@ -33,6 +33,16 @@ config VDPA_SIM_BLOCK
>   	  vDPA block device simulator which terminates IO request in a
>   	  memory buffer.
>   
> +config VDPA_USER
> +	tristate "VDUSE (vDPA Device in Userspace) support"
> +	depends on EVENTFD && MMU && HAS_DMA
> +	select DMA_OPS
> +	select VHOST_IOTLB
> +	select IOMMU_IOVA
> +	help
> +	  With VDUSE it is possible to emulate a vDPA Device
> +	  in a userspace program.
> +
>   config IFCVF
>   	tristate "Intel IFC VF vDPA driver"
>   	depends on PCI_MSI
> diff --git a/drivers/vdpa/Makefile b/drivers/vdpa/Makefile
> index 67fe7f3d6943..f02ebed33f19 100644
> --- a/drivers/vdpa/Makefile
> +++ b/drivers/vdpa/Makefile
> @@ -1,6 +1,7 @@
>   # SPDX-License-Identifier: GPL-2.0
>   obj-$(CONFIG_VDPA) += vdpa.o
>   obj-$(CONFIG_VDPA_SIM) += vdpa_sim/
> +obj-$(CONFIG_VDPA_USER) += vdpa_user/
>   obj-$(CONFIG_IFCVF)    += ifcvf/
>   obj-$(CONFIG_MLX5_VDPA) += mlx5/
>   obj-$(CONFIG_VP_VDPA)    += virtio_pci/
> diff --git a/drivers/vdpa/vdpa_user/Makefile b/drivers/vdpa/vdpa_user/Makefile
> new file mode 100644
> index 000000000000..260e0b26af99
> --- /dev/null
> +++ b/drivers/vdpa/vdpa_user/Makefile
> @@ -0,0 +1,5 @@
> +# SPDX-License-Identifier: GPL-2.0
> +
> +vduse-y := vduse_dev.o iova_domain.o
> +
> +obj-$(CONFIG_VDPA_USER) += vduse.o
> diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c
> new file mode 100644
> index 000000000000..5271cbd15e28
> --- /dev/null
> +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
> @@ -0,0 +1,1453 @@
> +// SPDX-License-Identifier: GPL-2.0-only
> +/*
> + * VDUSE: vDPA Device in Userspace
> + *
> + * Copyright (C) 2020-2021 Bytedance Inc. and/or its affiliates. All rights reserved.
> + *
> + * Author: Xie Yongji <xieyongji@bytedance.com>
> + *
> + */
> +
> +#include <linux/init.h>
> +#include <linux/module.h>
> +#include <linux/cdev.h>
> +#include <linux/device.h>
> +#include <linux/eventfd.h>
> +#include <linux/slab.h>
> +#include <linux/wait.h>
> +#include <linux/dma-map-ops.h>
> +#include <linux/poll.h>
> +#include <linux/file.h>
> +#include <linux/uio.h>
> +#include <linux/vdpa.h>
> +#include <linux/nospec.h>
> +#include <uapi/linux/vduse.h>
> +#include <uapi/linux/vdpa.h>
> +#include <uapi/linux/virtio_config.h>
> +#include <uapi/linux/virtio_ids.h>
> +#include <uapi/linux/virtio_blk.h>
> +#include <linux/mod_devicetable.h>
> +
> +#include "iova_domain.h"
> +
> +#define DRV_AUTHOR   "Yongji Xie <xieyongji@bytedance.com>"
> +#define DRV_DESC     "vDPA Device in Userspace"
> +#define DRV_LICENSE  "GPL v2"
> +
> +#define VDUSE_DEV_MAX (1U << MINORBITS)
> +#define VDUSE_MAX_BOUNCE_SIZE (64 * 1024 * 1024)
> +#define VDUSE_IOVA_SIZE (128 * 1024 * 1024)
> +#define VDUSE_REQUEST_TIMEOUT 30
> +
> +struct vduse_virtqueue {
> +	u16 index;
> +	u32 num;
> +	u32 avail_idx;
> +	u64 desc_addr;
> +	u64 driver_addr;
> +	u64 device_addr;
> +	bool ready;
> +	bool kicked;
> +	spinlock_t kick_lock;
> +	spinlock_t irq_lock;
> +	struct eventfd_ctx *kickfd;
> +	struct vdpa_callback cb;
> +	struct work_struct inject;
> +};
> +
> +struct vduse_dev;
> +
> +struct vduse_vdpa {
> +	struct vdpa_device vdpa;
> +	struct vduse_dev *dev;
> +};
> +
> +struct vduse_dev {
> +	struct vduse_vdpa *vdev;
> +	struct device *dev;
> +	struct vduse_virtqueue *vqs;
> +	struct vduse_iova_domain *domain;
> +	char *name;
> +	struct mutex lock;
> +	spinlock_t msg_lock;
> +	u64 msg_unique;
> +	wait_queue_head_t waitq;
> +	struct list_head send_list;
> +	struct list_head recv_list;
> +	struct vdpa_callback config_cb;
> +	struct work_struct inject;
> +	spinlock_t irq_lock;
> +	int minor;
> +	bool connected;
> +	bool started;
> +	u64 api_version;
> +	u64 user_features;


Let's use device_features.


> +	u64 features;


And driver features.


> +	u32 device_id;
> +	u32 vendor_id;
> +	u32 generation;
> +	u32 config_size;
> +	void *config;
> +	u8 status;
> +	u16 vq_size_max;
> +	u32 vq_num;
> +	u32 vq_align;
> +};
> +
> +struct vduse_dev_msg {
> +	struct vduse_dev_request req;
> +	struct vduse_dev_response resp;
> +	struct list_head list;
> +	wait_queue_head_t waitq;
> +	bool completed;
> +};
> +
> +struct vduse_control {
> +	u64 api_version;
> +};
> +
> +static DEFINE_MUTEX(vduse_lock);
> +static DEFINE_IDR(vduse_idr);
> +
> +static dev_t vduse_major;
> +static struct class *vduse_class;
> +static struct cdev vduse_ctrl_cdev;
> +static struct cdev vduse_cdev;
> +static struct workqueue_struct *vduse_irq_wq;
> +
> +static u32 allowed_device_id[] = {
> +	VIRTIO_ID_BLOCK,
> +};
> +
> +static inline struct vduse_dev *vdpa_to_vduse(struct vdpa_device *vdpa)
> +{
> +	struct vduse_vdpa *vdev = container_of(vdpa, struct vduse_vdpa, vdpa);
> +
> +	return vdev->dev;
> +}
> +
> +static inline struct vduse_dev *dev_to_vduse(struct device *dev)
> +{
> +	struct vdpa_device *vdpa = dev_to_vdpa(dev);
> +
> +	return vdpa_to_vduse(vdpa);
> +}
> +
> +static struct vduse_dev_msg *vduse_find_msg(struct list_head *head,
> +					    uint32_t request_id)
> +{
> +	struct vduse_dev_msg *msg;
> +
> +	list_for_each_entry(msg, head, list) {
> +		if (msg->req.request_id == request_id) {
> +			list_del(&msg->list);
> +			return msg;
> +		}
> +	}
> +
> +	return NULL;
> +}
> +
> +static struct vduse_dev_msg *vduse_dequeue_msg(struct list_head *head)
> +{
> +	struct vduse_dev_msg *msg = NULL;
> +
> +	if (!list_empty(head)) {
> +		msg = list_first_entry(head, struct vduse_dev_msg, list);
> +		list_del(&msg->list);
> +	}
> +
> +	return msg;
> +}
> +
> +static void vduse_enqueue_msg(struct list_head *head,
> +			      struct vduse_dev_msg *msg)
> +{
> +	list_add_tail(&msg->list, head);
> +}
> +
> +static int vduse_dev_msg_send(struct vduse_dev *dev,
> +			      struct vduse_dev_msg *msg, bool no_reply)
> +{


It looks to me the only user for no_reply=true is the dataplane start. I 
wonder no_reply is really needed consider we have switched to use 
wait_event_killable_timeout().

In another way, no_reply is false for vq state synchronization and IOTLB 
updating. I wonder if we can simply use no_reply = true for them.


> +	init_waitqueue_head(&msg->waitq);
> +	spin_lock(&dev->msg_lock);
> +	msg->req.request_id = dev->msg_unique++;
> +	vduse_enqueue_msg(&dev->send_list, msg);
> +	wake_up(&dev->waitq);
> +	spin_unlock(&dev->msg_lock);
> +	if (no_reply)
> +		return 0;
> +
> +	wait_event_killable_timeout(msg->waitq, msg->completed,
> +				    VDUSE_REQUEST_TIMEOUT * HZ);
> +	spin_lock(&dev->msg_lock);
> +	if (!msg->completed) {
> +		list_del(&msg->list);
> +		msg->resp.result = VDUSE_REQ_RESULT_FAILED;
> +	}
> +	spin_unlock(&dev->msg_lock);
> +
> +	return (msg->resp.result == VDUSE_REQ_RESULT_OK) ? 0 : -EIO;


Do we need to serialize the check by protecting it with the spinlock above?


> +}
> +
> +static void vduse_dev_msg_cleanup(struct vduse_dev *dev)
> +{
> +	struct vduse_dev_msg *msg;
> +
> +	spin_lock(&dev->msg_lock);
> +	while ((msg = vduse_dequeue_msg(&dev->send_list))) {
> +		if (msg->req.flags & VDUSE_REQ_FLAGS_NO_REPLY)
> +			kfree(msg);
> +		else
> +			vduse_enqueue_msg(&dev->recv_list, msg);
> +	}
> +	while ((msg = vduse_dequeue_msg(&dev->recv_list))) {
> +		msg->resp.result = VDUSE_REQ_RESULT_FAILED;
> +		msg->completed = 1;
> +		wake_up(&msg->waitq);
> +	}
> +	spin_unlock(&dev->msg_lock);
> +}
> +
> +static void vduse_dev_start_dataplane(struct vduse_dev *dev)
> +{
> +	struct vduse_dev_msg *msg = kzalloc(sizeof(*msg),
> +					    GFP_KERNEL | __GFP_NOFAIL);
> +
> +	msg->req.type = VDUSE_START_DATAPLANE;
> +	msg->req.flags |= VDUSE_REQ_FLAGS_NO_REPLY;
> +	vduse_dev_msg_send(dev, msg, true);
> +}
> +
> +static void vduse_dev_stop_dataplane(struct vduse_dev *dev)
> +{
> +	struct vduse_dev_msg *msg = kzalloc(sizeof(*msg),
> +					    GFP_KERNEL | __GFP_NOFAIL);
> +
> +	msg->req.type = VDUSE_STOP_DATAPLANE;
> +	msg->req.flags |= VDUSE_REQ_FLAGS_NO_REPLY;


Can we simply use this flag instead of introducing a new parameter 
(no_reply) in vduse_dev_msg_send()?


> +	vduse_dev_msg_send(dev, msg, true);
> +}
> +
> +static int vduse_dev_get_vq_state(struct vduse_dev *dev,
> +				  struct vduse_virtqueue *vq,
> +				  struct vdpa_vq_state *state)
> +{
> +	struct vduse_dev_msg msg = { 0 };
> +	int ret;


Note that I post a series that implement the packed virtqueue support:

https://lists.linuxfoundation.org/pipermail/virtualization/2021-June/054501.html

So this patch needs to be updated as well.


> +
> +	msg.req.type = VDUSE_GET_VQ_STATE;
> +	msg.req.vq_state.index = vq->index;
> +
> +	ret = vduse_dev_msg_send(dev, &msg, false);
> +	if (ret)
> +		return ret;
> +
> +	state->avail_index = msg.resp.vq_state.avail_idx;
> +	return 0;
> +}
> +
> +static int vduse_dev_update_iotlb(struct vduse_dev *dev,
> +				u64 start, u64 last)
> +{
> +	struct vduse_dev_msg msg = { 0 };
> +
> +	if (last < start)
> +		return -EINVAL;
> +
> +	msg.req.type = VDUSE_UPDATE_IOTLB;
> +	msg.req.iova.start = start;
> +	msg.req.iova.last = last;
> +
> +	return vduse_dev_msg_send(dev, &msg, false);
> +}
> +
> +static ssize_t vduse_dev_read_iter(struct kiocb *iocb, struct iov_iter *to)
> +{
> +	struct file *file = iocb->ki_filp;
> +	struct vduse_dev *dev = file->private_data;
> +	struct vduse_dev_msg *msg;
> +	int size = sizeof(struct vduse_dev_request);
> +	ssize_t ret;
> +
> +	if (iov_iter_count(to) < size)
> +		return -EINVAL;
> +
> +	spin_lock(&dev->msg_lock);
> +	while (1) {
> +		msg = vduse_dequeue_msg(&dev->send_list);
> +		if (msg)
> +			break;
> +
> +		ret = -EAGAIN;
> +		if (file->f_flags & O_NONBLOCK)
> +			goto unlock;
> +
> +		spin_unlock(&dev->msg_lock);
> +		ret = wait_event_interruptible_exclusive(dev->waitq,
> +					!list_empty(&dev->send_list));
> +		if (ret)
> +			return ret;
> +
> +		spin_lock(&dev->msg_lock);
> +	}
> +	spin_unlock(&dev->msg_lock);
> +	ret = copy_to_iter(&msg->req, size, to);
> +	spin_lock(&dev->msg_lock);
> +	if (ret != size) {
> +		ret = -EFAULT;
> +		vduse_enqueue_msg(&dev->send_list, msg);
> +		goto unlock;
> +	}
> +	if (msg->req.flags & VDUSE_REQ_FLAGS_NO_REPLY)
> +		kfree(msg);
> +	else
> +		vduse_enqueue_msg(&dev->recv_list, msg);
> +unlock:
> +	spin_unlock(&dev->msg_lock);
> +
> +	return ret;
> +}
> +
> +static ssize_t vduse_dev_write_iter(struct kiocb *iocb, struct iov_iter *from)
> +{
> +	struct file *file = iocb->ki_filp;
> +	struct vduse_dev *dev = file->private_data;
> +	struct vduse_dev_response resp;
> +	struct vduse_dev_msg *msg;
> +	size_t ret;
> +
> +	ret = copy_from_iter(&resp, sizeof(resp), from);
> +	if (ret != sizeof(resp))
> +		return -EINVAL;
> +
> +	spin_lock(&dev->msg_lock);
> +	msg = vduse_find_msg(&dev->recv_list, resp.request_id);
> +	if (!msg) {
> +		ret = -ENOENT;
> +		goto unlock;
> +	}
> +
> +	memcpy(&msg->resp, &resp, sizeof(resp));
> +	msg->completed = 1;
> +	wake_up(&msg->waitq);
> +unlock:
> +	spin_unlock(&dev->msg_lock);
> +
> +	return ret;
> +}
> +
> +static __poll_t vduse_dev_poll(struct file *file, poll_table *wait)
> +{
> +	struct vduse_dev *dev = file->private_data;
> +	__poll_t mask = 0;
> +
> +	poll_wait(file, &dev->waitq, wait);
> +
> +	if (!list_empty(&dev->send_list))
> +		mask |= EPOLLIN | EPOLLRDNORM;
> +	if (!list_empty(&dev->recv_list))
> +		mask |= EPOLLOUT | EPOLLWRNORM;
> +
> +	return mask;
> +}
> +
> +static void vduse_dev_reset(struct vduse_dev *dev)
> +{
> +	int i;
> +	struct vduse_iova_domain *domain = dev->domain;
> +
> +	/* The coherent mappings are handled in vduse_dev_free_coherent() */
> +	if (domain->bounce_map)
> +		vduse_domain_reset_bounce_map(domain);
> +
> +	dev->features = 0;
> +	dev->generation++;
> +	spin_lock(&dev->irq_lock);
> +	dev->config_cb.callback = NULL;
> +	dev->config_cb.private = NULL;
> +	spin_unlock(&dev->irq_lock);
> +
> +	for (i = 0; i < dev->vq_num; i++) {
> +		struct vduse_virtqueue *vq = &dev->vqs[i];
> +
> +		vq->ready = false;
> +		vq->desc_addr = 0;
> +		vq->driver_addr = 0;
> +		vq->device_addr = 0;
> +		vq->avail_idx = 0;
> +		vq->num = 0;
> +
> +		spin_lock(&vq->kick_lock);
> +		vq->kicked = false;
> +		if (vq->kickfd)
> +			eventfd_ctx_put(vq->kickfd);
> +		vq->kickfd = NULL;
> +		spin_unlock(&vq->kick_lock);
> +
> +		spin_lock(&vq->irq_lock);
> +		vq->cb.callback = NULL;
> +		vq->cb.private = NULL;
> +		spin_unlock(&vq->irq_lock);
> +	}
> +}
> +
> +static int vduse_vdpa_set_vq_address(struct vdpa_device *vdpa, u16 idx,
> +				u64 desc_area, u64 driver_area,
> +				u64 device_area)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +	struct vduse_virtqueue *vq = &dev->vqs[idx];
> +
> +	vq->desc_addr = desc_area;
> +	vq->driver_addr = driver_area;
> +	vq->device_addr = device_area;
> +
> +	return 0;
> +}
> +
> +static void vduse_vdpa_kick_vq(struct vdpa_device *vdpa, u16 idx)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +	struct vduse_virtqueue *vq = &dev->vqs[idx];
> +
> +	spin_lock(&vq->kick_lock);
> +	if (!vq->ready)
> +		goto unlock;
> +
> +	if (vq->kickfd)
> +		eventfd_signal(vq->kickfd, 1);
> +	else
> +		vq->kicked = true;
> +unlock:
> +	spin_unlock(&vq->kick_lock);
> +}
> +
> +static void vduse_vdpa_set_vq_cb(struct vdpa_device *vdpa, u16 idx,
> +			      struct vdpa_callback *cb)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +	struct vduse_virtqueue *vq = &dev->vqs[idx];
> +
> +	spin_lock(&vq->irq_lock);
> +	vq->cb.callback = cb->callback;
> +	vq->cb.private = cb->private;
> +	spin_unlock(&vq->irq_lock);
> +}
> +
> +static void vduse_vdpa_set_vq_num(struct vdpa_device *vdpa, u16 idx, u32 num)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +	struct vduse_virtqueue *vq = &dev->vqs[idx];
> +
> +	vq->num = num;
> +}
> +
> +static void vduse_vdpa_set_vq_ready(struct vdpa_device *vdpa,
> +					u16 idx, bool ready)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +	struct vduse_virtqueue *vq = &dev->vqs[idx];
> +
> +	vq->ready = ready;
> +}
> +
> +static bool vduse_vdpa_get_vq_ready(struct vdpa_device *vdpa, u16 idx)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +	struct vduse_virtqueue *vq = &dev->vqs[idx];
> +
> +	return vq->ready;
> +}
> +
> +static int vduse_vdpa_set_vq_state(struct vdpa_device *vdpa, u16 idx,
> +				const struct vdpa_vq_state *state)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +	struct vduse_virtqueue *vq = &dev->vqs[idx];
> +
> +	vq->avail_idx = state->avail_index;
> +	return 0;
> +}
> +
> +static int vduse_vdpa_get_vq_state(struct vdpa_device *vdpa, u16 idx,
> +				struct vdpa_vq_state *state)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +	struct vduse_virtqueue *vq = &dev->vqs[idx];
> +
> +	return vduse_dev_get_vq_state(dev, vq, state);
> +}
> +
> +static u32 vduse_vdpa_get_vq_align(struct vdpa_device *vdpa)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +
> +	return dev->vq_align;
> +}
> +
> +static u64 vduse_vdpa_get_features(struct vdpa_device *vdpa)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +
> +	return dev->user_features;
> +}
> +
> +static int vduse_vdpa_set_features(struct vdpa_device *vdpa, u64 features)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +
> +	dev->features = features;
> +	return 0;
> +}
> +
> +static void vduse_vdpa_set_config_cb(struct vdpa_device *vdpa,
> +				  struct vdpa_callback *cb)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +
> +	spin_lock(&dev->irq_lock);
> +	dev->config_cb.callback = cb->callback;
> +	dev->config_cb.private = cb->private;
> +	spin_unlock(&dev->irq_lock);
> +}
> +
> +static u16 vduse_vdpa_get_vq_num_max(struct vdpa_device *vdpa)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +
> +	return dev->vq_size_max;
> +}
> +
> +static u32 vduse_vdpa_get_device_id(struct vdpa_device *vdpa)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +
> +	return dev->device_id;
> +}
> +
> +static u32 vduse_vdpa_get_vendor_id(struct vdpa_device *vdpa)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +
> +	return dev->vendor_id;
> +}
> +
> +static u8 vduse_vdpa_get_status(struct vdpa_device *vdpa)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +
> +	return dev->status;
> +}
> +
> +static void vduse_vdpa_set_status(struct vdpa_device *vdpa, u8 status)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +	bool started = !!(status & VIRTIO_CONFIG_S_DRIVER_OK);
> +
> +	dev->status = status;
> +
> +	if (dev->started == started)
> +		return;


If we check dev->status == status, (or only check the DRIVER_OK bit) 
then there's no need to introduce an extra dev->started.


> +
> +	dev->started = started;
> +	if (dev->started) {
> +		vduse_dev_start_dataplane(dev);
> +	} else {
> +		vduse_dev_reset(dev);
> +		vduse_dev_stop_dataplane(dev);


I wonder if no_reply work for the case of vhost-vdpa. For virtio-vDPA, 
we have bouncing buffers so it's harmless if usersapce dataplane keeps 
performing read/write. For vhost-vDPA we don't have such stuffs.


> +	}
> +}
> +
> +static size_t vduse_vdpa_get_config_size(struct vdpa_device *vdpa)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +
> +	return dev->config_size;
> +}
> +
> +static void vduse_vdpa_get_config(struct vdpa_device *vdpa, unsigned int offset,
> +				  void *buf, unsigned int len)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +
> +	memcpy(buf, dev->config + offset, len);
> +}
> +
> +static void vduse_vdpa_set_config(struct vdpa_device *vdpa, unsigned int offset,
> +			const void *buf, unsigned int len)
> +{
> +	/* Now we only support read-only configuration space */
> +}
> +
> +static u32 vduse_vdpa_get_generation(struct vdpa_device *vdpa)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +
> +	return dev->generation;
> +}
> +
> +static int vduse_vdpa_set_map(struct vdpa_device *vdpa,
> +				struct vhost_iotlb *iotlb)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +	int ret;
> +
> +	ret = vduse_domain_set_map(dev->domain, iotlb);
> +	if (ret)
> +		return ret;
> +
> +	ret = vduse_dev_update_iotlb(dev, 0ULL, ULLONG_MAX);
> +	if (ret) {
> +		vduse_domain_clear_map(dev->domain, iotlb);
> +		return ret;
> +	}
> +
> +	return 0;
> +}
> +
> +static void vduse_vdpa_free(struct vdpa_device *vdpa)
> +{
> +	struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +
> +	dev->vdev = NULL;
> +}
> +
> +static const struct vdpa_config_ops vduse_vdpa_config_ops = {
> +	.set_vq_address		= vduse_vdpa_set_vq_address,
> +	.kick_vq		= vduse_vdpa_kick_vq,
> +	.set_vq_cb		= vduse_vdpa_set_vq_cb,
> +	.set_vq_num             = vduse_vdpa_set_vq_num,
> +	.set_vq_ready		= vduse_vdpa_set_vq_ready,
> +	.get_vq_ready		= vduse_vdpa_get_vq_ready,
> +	.set_vq_state		= vduse_vdpa_set_vq_state,
> +	.get_vq_state		= vduse_vdpa_get_vq_state,
> +	.get_vq_align		= vduse_vdpa_get_vq_align,
> +	.get_features		= vduse_vdpa_get_features,
> +	.set_features		= vduse_vdpa_set_features,
> +	.set_config_cb		= vduse_vdpa_set_config_cb,
> +	.get_vq_num_max		= vduse_vdpa_get_vq_num_max,
> +	.get_device_id		= vduse_vdpa_get_device_id,
> +	.get_vendor_id		= vduse_vdpa_get_vendor_id,
> +	.get_status		= vduse_vdpa_get_status,
> +	.set_status		= vduse_vdpa_set_status,
> +	.get_config_size	= vduse_vdpa_get_config_size,
> +	.get_config		= vduse_vdpa_get_config,
> +	.set_config		= vduse_vdpa_set_config,
> +	.get_generation		= vduse_vdpa_get_generation,
> +	.set_map		= vduse_vdpa_set_map,
> +	.free			= vduse_vdpa_free,
> +};
> +
> +static dma_addr_t vduse_dev_map_page(struct device *dev, struct page *page,
> +				     unsigned long offset, size_t size,
> +				     enum dma_data_direction dir,
> +				     unsigned long attrs)
> +{
> +	struct vduse_dev *vdev = dev_to_vduse(dev);
> +	struct vduse_iova_domain *domain = vdev->domain;
> +
> +	return vduse_domain_map_page(domain, page, offset, size, dir, attrs);
> +}
> +
> +static void vduse_dev_unmap_page(struct device *dev, dma_addr_t dma_addr,
> +				size_t size, enum dma_data_direction dir,
> +				unsigned long attrs)
> +{
> +	struct vduse_dev *vdev = dev_to_vduse(dev);
> +	struct vduse_iova_domain *domain = vdev->domain;
> +
> +	return vduse_domain_unmap_page(domain, dma_addr, size, dir, attrs);
> +}
> +
> +static void *vduse_dev_alloc_coherent(struct device *dev, size_t size,
> +					dma_addr_t *dma_addr, gfp_t flag,
> +					unsigned long attrs)
> +{
> +	struct vduse_dev *vdev = dev_to_vduse(dev);
> +	struct vduse_iova_domain *domain = vdev->domain;
> +	unsigned long iova;
> +	void *addr;
> +
> +	*dma_addr = DMA_MAPPING_ERROR;
> +	addr = vduse_domain_alloc_coherent(domain, size,
> +				(dma_addr_t *)&iova, flag, attrs);
> +	if (!addr)
> +		return NULL;
> +
> +	*dma_addr = (dma_addr_t)iova;
> +
> +	return addr;
> +}
> +
> +static void vduse_dev_free_coherent(struct device *dev, size_t size,
> +					void *vaddr, dma_addr_t dma_addr,
> +					unsigned long attrs)
> +{
> +	struct vduse_dev *vdev = dev_to_vduse(dev);
> +	struct vduse_iova_domain *domain = vdev->domain;
> +
> +	vduse_domain_free_coherent(domain, size, vaddr, dma_addr, attrs);
> +}
> +
> +static size_t vduse_dev_max_mapping_size(struct device *dev)
> +{
> +	struct vduse_dev *vdev = dev_to_vduse(dev);
> +	struct vduse_iova_domain *domain = vdev->domain;
> +
> +	return domain->bounce_size;
> +}
> +
> +static const struct dma_map_ops vduse_dev_dma_ops = {
> +	.map_page = vduse_dev_map_page,
> +	.unmap_page = vduse_dev_unmap_page,
> +	.alloc = vduse_dev_alloc_coherent,
> +	.free = vduse_dev_free_coherent,
> +	.max_mapping_size = vduse_dev_max_mapping_size,
> +};
> +
> +static unsigned int perm_to_file_flags(u8 perm)
> +{
> +	unsigned int flags = 0;
> +
> +	switch (perm) {
> +	case VDUSE_ACCESS_WO:
> +		flags |= O_WRONLY;
> +		break;
> +	case VDUSE_ACCESS_RO:
> +		flags |= O_RDONLY;
> +		break;
> +	case VDUSE_ACCESS_RW:
> +		flags |= O_RDWR;
> +		break;
> +	default:
> +		WARN(1, "invalidate vhost IOTLB permission\n");
> +		break;
> +	}
> +
> +	return flags;
> +}
> +
> +static int vduse_kickfd_setup(struct vduse_dev *dev,
> +			struct vduse_vq_eventfd *eventfd)
> +{
> +	struct eventfd_ctx *ctx = NULL;
> +	struct vduse_virtqueue *vq;
> +	u32 index;
> +
> +	if (eventfd->index >= dev->vq_num)
> +		return -EINVAL;
> +
> +	index = array_index_nospec(eventfd->index, dev->vq_num);
> +	vq = &dev->vqs[index];
> +	if (eventfd->fd >= 0) {
> +		ctx = eventfd_ctx_fdget(eventfd->fd);
> +		if (IS_ERR(ctx))
> +			return PTR_ERR(ctx);
> +	} else if (eventfd->fd != VDUSE_EVENTFD_DEASSIGN)
> +		return 0;
> +
> +	spin_lock(&vq->kick_lock);
> +	if (vq->kickfd)
> +		eventfd_ctx_put(vq->kickfd);
> +	vq->kickfd = ctx;
> +	if (vq->ready && vq->kicked && vq->kickfd) {
> +		eventfd_signal(vq->kickfd, 1);
> +		vq->kicked = false;
> +	}
> +	spin_unlock(&vq->kick_lock);
> +
> +	return 0;
> +}
> +
> +static void vduse_dev_irq_inject(struct work_struct *work)
> +{
> +	struct vduse_dev *dev = container_of(work, struct vduse_dev, inject);
> +
> +	spin_lock_irq(&dev->irq_lock);
> +	if (dev->config_cb.callback)
> +		dev->config_cb.callback(dev->config_cb.private);
> +	spin_unlock_irq(&dev->irq_lock);
> +}
> +
> +static void vduse_vq_irq_inject(struct work_struct *work)
> +{
> +	struct vduse_virtqueue *vq = container_of(work,
> +					struct vduse_virtqueue, inject);
> +
> +	spin_lock_irq(&vq->irq_lock);
> +	if (vq->ready && vq->cb.callback)
> +		vq->cb.callback(vq->cb.private);
> +	spin_unlock_irq(&vq->irq_lock);
> +}
> +
> +static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
> +			    unsigned long arg)
> +{
> +	struct vduse_dev *dev = file->private_data;
> +	void __user *argp = (void __user *)arg;
> +	int ret;
> +
> +	switch (cmd) {
> +	case VDUSE_IOTLB_GET_FD: {
> +		struct vduse_iotlb_entry entry;
> +		struct vhost_iotlb_map *map;
> +		struct vdpa_map_file *map_file;
> +		struct vduse_iova_domain *domain = dev->domain;
> +		struct file *f = NULL;
> +
> +		ret = -EFAULT;
> +		if (copy_from_user(&entry, argp, sizeof(entry)))
> +			break;
> +
> +		ret = -EINVAL;
> +		if (entry.start > entry.last)
> +			break;
> +
> +		spin_lock(&domain->iotlb_lock);
> +		map = vhost_iotlb_itree_first(domain->iotlb,
> +					      entry.start, entry.last);
> +		if (map) {
> +			map_file = (struct vdpa_map_file *)map->opaque;
> +			f = get_file(map_file->file);
> +			entry.offset = map_file->offset;
> +			entry.start = map->start;
> +			entry.last = map->last;
> +			entry.perm = map->perm;
> +		}
> +		spin_unlock(&domain->iotlb_lock);
> +		ret = -EINVAL;
> +		if (!f)
> +			break;
> +
> +		ret = -EFAULT;
> +		if (copy_to_user(argp, &entry, sizeof(entry))) {
> +			fput(f);
> +			break;
> +		}
> +		ret = receive_fd(f, perm_to_file_flags(entry.perm));
> +		fput(f);
> +		break;
> +	}
> +	case VDUSE_DEV_GET_FEATURES:
> +		ret = put_user(dev->features, (u64 __user *)argp);
> +		break;
> +	case VDUSE_DEV_UPDATE_CONFIG: {
> +		struct vduse_config_update config;
> +		unsigned long size = offsetof(struct vduse_config_update,
> +					      buffer);
> +
> +		ret = -EFAULT;
> +		if (copy_from_user(&config, argp, size))
> +			break;
> +
> +		ret = -EINVAL;
> +		if (config.length == 0 ||
> +		    config.length > dev->config_size - config.offset)
> +			break;
> +
> +		ret = -EFAULT;
> +		if (copy_from_user(dev->config + config.offset, argp + size,
> +				   config.length))
> +			break;
> +
> +		ret = 0;
> +		queue_work(vduse_irq_wq, &dev->inject);


I wonder if it's better to separate config interrupt out of config 
update or we need document this.


> +		break;
> +	}
> +	case VDUSE_VQ_GET_INFO: {


Do we need to limit this only when DRIVER_OK is set?


> +		struct vduse_vq_info vq_info;
> +		u32 vq_index;
> +
> +		ret = -EFAULT;
> +		if (copy_from_user(&vq_info, argp, sizeof(vq_info)))
> +			break;
> +
> +		ret = -EINVAL;
> +		if (vq_info.index >= dev->vq_num)
> +			break;
> +
> +		vq_index = array_index_nospec(vq_info.index, dev->vq_num);
> +		vq_info.desc_addr = dev->vqs[vq_index].desc_addr;
> +		vq_info.driver_addr = dev->vqs[vq_index].driver_addr;
> +		vq_info.device_addr = dev->vqs[vq_index].device_addr;
> +		vq_info.num = dev->vqs[vq_index].num;
> +		vq_info.avail_idx = dev->vqs[vq_index].avail_idx;
> +		vq_info.ready = dev->vqs[vq_index].ready;
> +
> +		ret = -EFAULT;
> +		if (copy_to_user(argp, &vq_info, sizeof(vq_info)))
> +			break;
> +
> +		ret = 0;
> +		break;
> +	}
> +	case VDUSE_VQ_SETUP_KICKFD: {
> +		struct vduse_vq_eventfd eventfd;
> +
> +		ret = -EFAULT;
> +		if (copy_from_user(&eventfd, argp, sizeof(eventfd)))
> +			break;
> +
> +		ret = vduse_kickfd_setup(dev, &eventfd);
> +		break;
> +	}
> +	case VDUSE_VQ_INJECT_IRQ: {
> +		u32 vq_index;
> +
> +		ret = -EFAULT;
> +		if (get_user(vq_index, (u32 __user *)argp))
> +			break;
> +
> +		ret = -EINVAL;
> +		if (vq_index >= dev->vq_num)
> +			break;
> +
> +		ret = 0;
> +		vq_index = array_index_nospec(vq_index, dev->vq_num);
> +		queue_work(vduse_irq_wq, &dev->vqs[vq_index].inject);
> +		break;
> +	}
> +	default:
> +		ret = -ENOIOCTLCMD;
> +		break;
> +	}
> +
> +	return ret;
> +}
> +
> +static int vduse_dev_release(struct inode *inode, struct file *file)
> +{
> +	struct vduse_dev *dev = file->private_data;
> +
> +	spin_lock(&dev->msg_lock);
> +	/* Make sure the inflight messages can processed after reconncection */
> +	list_splice_init(&dev->recv_list, &dev->send_list);
> +	spin_unlock(&dev->msg_lock);
> +	dev->connected = false;
> +
> +	return 0;
> +}
> +
> +static struct vduse_dev *vduse_dev_get_from_minor(int minor)
> +{
> +	struct vduse_dev *dev;
> +
> +	mutex_lock(&vduse_lock);
> +	dev = idr_find(&vduse_idr, minor);
> +	mutex_unlock(&vduse_lock);
> +
> +	return dev;
> +}
> +
> +static int vduse_dev_open(struct inode *inode, struct file *file)
> +{
> +	int ret;
> +	struct vduse_dev *dev = vduse_dev_get_from_minor(iminor(inode));
> +
> +	if (!dev)
> +		return -ENODEV;
> +
> +	ret = -EBUSY;
> +	mutex_lock(&dev->lock);
> +	if (dev->connected)
> +		goto unlock;
> +
> +	ret = 0;
> +	dev->connected = true;
> +	file->private_data = dev;
> +unlock:
> +	mutex_unlock(&dev->lock);
> +
> +	return ret;
> +}
> +
> +static const struct file_operations vduse_dev_fops = {
> +	.owner		= THIS_MODULE,
> +	.open		= vduse_dev_open,
> +	.release	= vduse_dev_release,
> +	.read_iter	= vduse_dev_read_iter,
> +	.write_iter	= vduse_dev_write_iter,
> +	.poll		= vduse_dev_poll,
> +	.unlocked_ioctl	= vduse_dev_ioctl,
> +	.compat_ioctl	= compat_ptr_ioctl,
> +	.llseek		= noop_llseek,
> +};
> +
> +static struct vduse_dev *vduse_dev_create(void)
> +{
> +	struct vduse_dev *dev = kzalloc(sizeof(*dev), GFP_KERNEL);
> +
> +	if (!dev)
> +		return NULL;
> +
> +	mutex_init(&dev->lock);
> +	spin_lock_init(&dev->msg_lock);
> +	INIT_LIST_HEAD(&dev->send_list);
> +	INIT_LIST_HEAD(&dev->recv_list);
> +	spin_lock_init(&dev->irq_lock);
> +
> +	INIT_WORK(&dev->inject, vduse_dev_irq_inject);
> +	init_waitqueue_head(&dev->waitq);
> +
> +	return dev;
> +}
> +
> +static void vduse_dev_destroy(struct vduse_dev *dev)
> +{
> +	kfree(dev);
> +}
> +
> +static struct vduse_dev *vduse_find_dev(const char *name)
> +{
> +	struct vduse_dev *dev;
> +	int id;
> +
> +	idr_for_each_entry(&vduse_idr, dev, id)
> +		if (!strcmp(dev->name, name))
> +			return dev;
> +
> +	return NULL;
> +}
> +
> +static int vduse_destroy_dev(char *name)
> +{
> +	struct vduse_dev *dev = vduse_find_dev(name);
> +
> +	if (!dev)
> +		return -EINVAL;
> +
> +	mutex_lock(&dev->lock);
> +	if (dev->vdev || dev->connected) {
> +		mutex_unlock(&dev->lock);
> +		return -EBUSY;
> +	}
> +	dev->connected = true;
> +	mutex_unlock(&dev->lock);
> +
> +	vduse_dev_msg_cleanup(dev);
> +	device_destroy(vduse_class, MKDEV(MAJOR(vduse_major), dev->minor));
> +	idr_remove(&vduse_idr, dev->minor);
> +	kvfree(dev->config);
> +	kfree(dev->vqs);
> +	vduse_domain_destroy(dev->domain);
> +	kfree(dev->name);
> +	vduse_dev_destroy(dev);
> +	module_put(THIS_MODULE);
> +
> +	return 0;
> +}
> +
> +static bool device_is_allowed(u32 device_id)
> +{
> +	int i;
> +
> +	for (i = 0; i < ARRAY_SIZE(allowed_device_id); i++)
> +		if (allowed_device_id[i] == device_id)
> +			return true;
> +
> +	return false;
> +}
> +
> +static bool features_is_valid(u64 features)
> +{
> +	if (!(features & (1ULL << VIRTIO_F_ACCESS_PLATFORM)))
> +		return false;
> +
> +	/* Now we only support read-only configuration space */
> +	if (features & (1ULL << VIRTIO_BLK_F_CONFIG_WCE))
> +		return false;
> +
> +	return true;
> +}
> +
> +static bool vduse_validate_config(struct vduse_dev_config *config)
> +{
> +	if (config->bounce_size > VDUSE_MAX_BOUNCE_SIZE)
> +		return false;
> +
> +	if (config->vq_align > PAGE_SIZE)
> +		return false;
> +
> +	if (config->config_size > PAGE_SIZE)
> +		return false;
> +
> +	if (!device_is_allowed(config->device_id))
> +		return false;
> +
> +	if (!features_is_valid(config->features))
> +		return false;


Do we need to validate whether or not config_size is too small otherwise 
we may have OOB access in get_config()?


> +
> +	return true;
> +}
> +
> +static int vduse_create_dev(struct vduse_dev_config *config,
> +			    void *config_buf, u64 api_version)
> +{
> +	int i, ret;
> +	struct vduse_dev *dev;
> +
> +	ret = -EEXIST;
> +	if (vduse_find_dev(config->name))
> +		goto err;
> +
> +	ret = -ENOMEM;
> +	dev = vduse_dev_create();
> +	if (!dev)
> +		goto err;
> +
> +	dev->api_version = api_version;
> +	dev->user_features = config->features;
> +	dev->device_id = config->device_id;
> +	dev->vendor_id = config->vendor_id;
> +	dev->name = kstrdup(config->name, GFP_KERNEL);
> +	if (!dev->name)
> +		goto err_str;
> +
> +	dev->domain = vduse_domain_create(VDUSE_IOVA_SIZE - 1,
> +					  config->bounce_size);
> +	if (!dev->domain)
> +		goto err_domain;
> +
> +	dev->config = config_buf;
> +	dev->config_size = config->config_size;
> +	dev->vq_align = config->vq_align;
> +	dev->vq_size_max = config->vq_size_max;
> +	dev->vq_num = config->vq_num;
> +	dev->vqs = kcalloc(dev->vq_num, sizeof(*dev->vqs), GFP_KERNEL);
> +	if (!dev->vqs)
> +		goto err_vqs;
> +
> +	for (i = 0; i < dev->vq_num; i++) {
> +		dev->vqs[i].index = i;
> +		INIT_WORK(&dev->vqs[i].inject, vduse_vq_irq_inject);
> +		spin_lock_init(&dev->vqs[i].kick_lock);
> +		spin_lock_init(&dev->vqs[i].irq_lock);
> +	}
> +
> +	ret = idr_alloc(&vduse_idr, dev, 1, VDUSE_DEV_MAX, GFP_KERNEL);
> +	if (ret < 0)
> +		goto err_idr;
> +
> +	dev->minor = ret;
> +	dev->dev = device_create(vduse_class, NULL,
> +				 MKDEV(MAJOR(vduse_major), dev->minor),
> +				 NULL, "%s", config->name);
> +	if (IS_ERR(dev->dev)) {
> +		ret = PTR_ERR(dev->dev);
> +		goto err_dev;
> +	}
> +	__module_get(THIS_MODULE);
> +
> +	return 0;
> +err_dev:
> +	idr_remove(&vduse_idr, dev->minor);
> +err_idr:
> +	kfree(dev->vqs);
> +err_vqs:
> +	vduse_domain_destroy(dev->domain);
> +err_domain:
> +	kfree(dev->name);
> +err_str:
> +	vduse_dev_destroy(dev);
> +err:
> +	kvfree(config_buf);
> +	return ret;
> +}
> +
> +static long vduse_ioctl(struct file *file, unsigned int cmd,
> +			unsigned long arg)
> +{
> +	int ret;
> +	void __user *argp = (void __user *)arg;
> +	struct vduse_control *control = file->private_data;
> +
> +	mutex_lock(&vduse_lock);
> +	switch (cmd) {
> +	case VDUSE_GET_API_VERSION:
> +		ret = put_user(control->api_version, (u64 __user *)argp);
> +		break;
> +	case VDUSE_SET_API_VERSION: {
> +		u64 api_version;
> +
> +		ret = -EFAULT;
> +		if (get_user(api_version, (u64 __user *)argp))
> +			break;
> +
> +		ret = -EINVAL;
> +		if (api_version > VDUSE_API_VERSION)
> +			break;
> +
> +		ret = 0;
> +		control->api_version = api_version;
> +		break;
> +	}
> +	case VDUSE_CREATE_DEV: {
> +		struct vduse_dev_config config;
> +		unsigned long size = offsetof(struct vduse_dev_config, config);
> +		void *buf;
> +
> +		ret = -EFAULT;
> +		if (copy_from_user(&config, argp, size))
> +			break;
> +
> +		ret = -EINVAL;
> +		if (vduse_validate_config(&config) == false)
> +			break;
> +
> +		buf = vmemdup_user(argp + size, config.config_size);
> +		if (IS_ERR(buf)) {
> +			ret = PTR_ERR(buf);
> +			break;
> +		}
> +		ret = vduse_create_dev(&config, buf, control->api_version);
> +		break;
> +	}
> +	case VDUSE_DESTROY_DEV: {
> +		char name[VDUSE_NAME_MAX];
> +
> +		ret = -EFAULT;
> +		if (copy_from_user(name, argp, VDUSE_NAME_MAX))
> +			break;
> +
> +		ret = vduse_destroy_dev(name);
> +		break;
> +	}
> +	default:
> +		ret = -EINVAL;
> +		break;
> +	}
> +	mutex_unlock(&vduse_lock);
> +
> +	return ret;
> +}
> +
> +static int vduse_release(struct inode *inode, struct file *file)
> +{
> +	struct vduse_control *control = file->private_data;
> +
> +	kfree(control);
> +	return 0;
> +}
> +
> +static int vduse_open(struct inode *inode, struct file *file)
> +{
> +	struct vduse_control *control;
> +
> +	control = kmalloc(sizeof(struct vduse_control), GFP_KERNEL);
> +	if (!control)
> +		return -ENOMEM;
> +
> +	control->api_version = VDUSE_API_VERSION;
> +	file->private_data = control;
> +
> +	return 0;
> +}
> +
> +static const struct file_operations vduse_ctrl_fops = {
> +	.owner		= THIS_MODULE,
> +	.open		= vduse_open,
> +	.release	= vduse_release,
> +	.unlocked_ioctl	= vduse_ioctl,
> +	.compat_ioctl	= compat_ptr_ioctl,
> +	.llseek		= noop_llseek,
> +};
> +
> +static char *vduse_devnode(struct device *dev, umode_t *mode)
> +{
> +	return kasprintf(GFP_KERNEL, "vduse/%s", dev_name(dev));
> +}
> +
> +static void vduse_mgmtdev_release(struct device *dev)
> +{
> +}
> +
> +static struct device vduse_mgmtdev = {
> +	.init_name = "vduse",
> +	.release = vduse_mgmtdev_release,
> +};
> +
> +static struct vdpa_mgmt_dev mgmt_dev;
> +
> +static int vduse_dev_init_vdpa(struct vduse_dev *dev, const char *name)
> +{
> +	struct vduse_vdpa *vdev;
> +	int ret;
> +
> +	if (dev->vdev)
> +		return -EEXIST;
> +
> +	vdev = vdpa_alloc_device(struct vduse_vdpa, vdpa, dev->dev,
> +				 &vduse_vdpa_config_ops, name, true);
> +	if (!vdev)
> +		return -ENOMEM;
> +
> +	dev->vdev = vdev;
> +	vdev->dev = dev;
> +	vdev->vdpa.dev.dma_mask = &vdev->vdpa.dev.coherent_dma_mask;
> +	ret = dma_set_mask_and_coherent(&vdev->vdpa.dev, DMA_BIT_MASK(64));
> +	if (ret) {
> +		put_device(&vdev->vdpa.dev);
> +		return ret;
> +	}
> +	set_dma_ops(&vdev->vdpa.dev, &vduse_dev_dma_ops);
> +	vdev->vdpa.dma_dev = &vdev->vdpa.dev;
> +	vdev->vdpa.mdev = &mgmt_dev;
> +
> +	return 0;
> +}
> +
> +static int vdpa_dev_add(struct vdpa_mgmt_dev *mdev, const char *name)
> +{
> +	struct vduse_dev *dev;
> +	int ret;
> +
> +	mutex_lock(&vduse_lock);
> +	dev = vduse_find_dev(name);
> +	if (!dev) {
> +		mutex_unlock(&vduse_lock);
> +		return -EINVAL;
> +	}
> +	ret = vduse_dev_init_vdpa(dev, name);
> +	mutex_unlock(&vduse_lock);
> +	if (ret)
> +		return ret;
> +
> +	ret = _vdpa_register_device(&dev->vdev->vdpa, dev->vq_num);
> +	if (ret) {
> +		put_device(&dev->vdev->vdpa.dev);
> +		return ret;
> +	}
> +
> +	return 0;
> +}
> +
> +static void vdpa_dev_del(struct vdpa_mgmt_dev *mdev, struct vdpa_device *dev)
> +{
> +	_vdpa_unregister_device(dev);
> +}
> +
> +static const struct vdpa_mgmtdev_ops vdpa_dev_mgmtdev_ops = {
> +	.dev_add = vdpa_dev_add,
> +	.dev_del = vdpa_dev_del,
> +};
> +
> +static struct virtio_device_id id_table[] = {
> +	{ VIRTIO_ID_BLOCK, VIRTIO_DEV_ANY_ID },
> +	{ 0 },
> +};
> +
> +static struct vdpa_mgmt_dev mgmt_dev = {
> +	.device = &vduse_mgmtdev,
> +	.id_table = id_table,
> +	.ops = &vdpa_dev_mgmtdev_ops,
> +};
> +
> +static int vduse_mgmtdev_init(void)
> +{
> +	int ret;
> +
> +	ret = device_register(&vduse_mgmtdev);
> +	if (ret)
> +		return ret;
> +
> +	ret = vdpa_mgmtdev_register(&mgmt_dev);
> +	if (ret)
> +		goto err;
> +
> +	return 0;
> +err:
> +	device_unregister(&vduse_mgmtdev);
> +	return ret;
> +}
> +
> +static void vduse_mgmtdev_exit(void)
> +{
> +	vdpa_mgmtdev_unregister(&mgmt_dev);
> +	device_unregister(&vduse_mgmtdev);
> +}
> +
> +static int vduse_init(void)
> +{
> +	int ret;
> +	struct device *dev;
> +
> +	vduse_class = class_create(THIS_MODULE, "vduse");
> +	if (IS_ERR(vduse_class))
> +		return PTR_ERR(vduse_class);
> +
> +	vduse_class->devnode = vduse_devnode;
> +
> +	ret = alloc_chrdev_region(&vduse_major, 0, VDUSE_DEV_MAX, "vduse");
> +	if (ret)
> +		goto err_chardev_region;
> +
> +	/* /dev/vduse/control */
> +	cdev_init(&vduse_ctrl_cdev, &vduse_ctrl_fops);
> +	vduse_ctrl_cdev.owner = THIS_MODULE;
> +	ret = cdev_add(&vduse_ctrl_cdev, vduse_major, 1);
> +	if (ret)
> +		goto err_ctrl_cdev;
> +
> +	dev = device_create(vduse_class, NULL, vduse_major, NULL, "control");
> +	if (IS_ERR(dev)) {
> +		ret = PTR_ERR(dev);
> +		goto err_device;
> +	}
> +
> +	/* /dev/vduse/$DEVICE */
> +	cdev_init(&vduse_cdev, &vduse_dev_fops);
> +	vduse_cdev.owner = THIS_MODULE;
> +	ret = cdev_add(&vduse_cdev, MKDEV(MAJOR(vduse_major), 1),
> +		       VDUSE_DEV_MAX - 1);
> +	if (ret)
> +		goto err_cdev;
> +
> +	vduse_irq_wq = alloc_workqueue("vduse-irq",
> +				WQ_HIGHPRI | WQ_SYSFS | WQ_UNBOUND, 0);
> +	if (!vduse_irq_wq)
> +		goto err_wq;
> +
> +	ret = vduse_domain_init();
> +	if (ret)
> +		goto err_domain;
> +
> +	ret = vduse_mgmtdev_init();
> +	if (ret)
> +		goto err_mgmtdev;
> +
> +	return 0;
> +err_mgmtdev:
> +	vduse_domain_exit();
> +err_domain:
> +	destroy_workqueue(vduse_irq_wq);
> +err_wq:
> +	cdev_del(&vduse_cdev);
> +err_cdev:
> +	device_destroy(vduse_class, vduse_major);
> +err_device:
> +	cdev_del(&vduse_ctrl_cdev);
> +err_ctrl_cdev:
> +	unregister_chrdev_region(vduse_major, VDUSE_DEV_MAX);
> +err_chardev_region:
> +	class_destroy(vduse_class);
> +	return ret;
> +}
> +module_init(vduse_init);
> +
> +static void vduse_exit(void)
> +{
> +	vduse_mgmtdev_exit();
> +	vduse_domain_exit();
> +	destroy_workqueue(vduse_irq_wq);
> +	cdev_del(&vduse_cdev);
> +	device_destroy(vduse_class, vduse_major);
> +	cdev_del(&vduse_ctrl_cdev);
> +	unregister_chrdev_region(vduse_major, VDUSE_DEV_MAX);
> +	class_destroy(vduse_class);
> +}
> +module_exit(vduse_exit);
> +
> +MODULE_LICENSE(DRV_LICENSE);
> +MODULE_AUTHOR(DRV_AUTHOR);
> +MODULE_DESCRIPTION(DRV_DESC);
> diff --git a/include/uapi/linux/vduse.h b/include/uapi/linux/vduse.h
> new file mode 100644
> index 000000000000..f21b2e51b5c8
> --- /dev/null
> +++ b/include/uapi/linux/vduse.h
> @@ -0,0 +1,143 @@
> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> +#ifndef _UAPI_VDUSE_H_
> +#define _UAPI_VDUSE_H_
> +
> +#include <linux/types.h>
> +
> +#define VDUSE_API_VERSION	0
> +
> +#define VDUSE_NAME_MAX	256
> +
> +/* the control messages definition for read/write */
> +
> +enum vduse_req_type {
> +	/* Get the state for virtqueue from userspace */
> +	VDUSE_GET_VQ_STATE,
> +	/* Notify userspace to start the dataplane, no reply */
> +	VDUSE_START_DATAPLANE,
> +	/* Notify userspace to stop the dataplane, no reply */
> +	VDUSE_STOP_DATAPLANE,
> +	/* Notify userspace to update the memory mapping in device IOTLB */
> +	VDUSE_UPDATE_IOTLB,
> +};
> +
> +struct vduse_vq_state {
> +	__u32 index; /* virtqueue index */
> +	__u32 avail_idx; /* virtqueue state (last_avail_idx) */
> +};


This needs some tweaks to support packed virtqueue.


> +
> +struct vduse_iova_range {
> +	__u64 start; /* start of the IOVA range */
> +	__u64 last; /* end of the IOVA range */
> +};
> +
> +struct vduse_dev_request {
> +	__u32 type; /* request type */
> +	__u32 request_id; /* request id */
> +#define VDUSE_REQ_FLAGS_NO_REPLY	(1 << 0) /* No need to reply */
> +	__u32 flags; /* request flags */
> +	__u32 reserved; /* for future use */
> +	union {
> +		struct vduse_vq_state vq_state; /* virtqueue state */
> +		struct vduse_iova_range iova; /* iova range for updating */
> +		__u32 padding[16]; /* padding */
> +	};
> +};
> +
> +struct vduse_dev_response {
> +	__u32 request_id; /* corresponding request id */
> +#define VDUSE_REQ_RESULT_OK	0x00
> +#define VDUSE_REQ_RESULT_FAILED	0x01
> +	__u32 result; /* the result of request */
> +	__u32 reserved[2]; /* for future use */
> +	union {
> +		struct vduse_vq_state vq_state; /* virtqueue state */
> +		__u32 padding[16]; /* padding */
> +	};
> +};
> +
> +/* ioctls */
> +
> +struct vduse_dev_config {
> +	char name[VDUSE_NAME_MAX]; /* vduse device name */
> +	__u32 vendor_id; /* virtio vendor id */
> +	__u32 device_id; /* virtio device id */
> +	__u64 features; /* device features */
> +	__u64 bounce_size; /* bounce buffer size for iommu */
> +	__u16 vq_size_max; /* the max size of virtqueue */
> +	__u16 padding; /* padding */
> +	__u32 vq_num; /* the number of virtqueues */
> +	__u32 vq_align; /* the allocation alignment of virtqueue's metadata */
> +	__u32 config_size; /* the size of the configuration space */
> +	__u32 reserved[15]; /* for future use */
> +	__u8 config[0]; /* the buffer of the configuration space */
> +};
> +
> +struct vduse_iotlb_entry {
> +	__u64 offset; /* the mmap offset on fd */
> +	__u64 start; /* start of the IOVA range */
> +	__u64 last; /* last of the IOVA range */
> +#define VDUSE_ACCESS_RO 0x1
> +#define VDUSE_ACCESS_WO 0x2
> +#define VDUSE_ACCESS_RW 0x3
> +	__u8 perm; /* access permission of this range */
> +};
> +
> +struct vduse_config_update {
> +	__u32 offset; /* offset from the beginning of configuration space */
> +	__u32 length; /* the length to write to configuration space */
> +	__u8 buffer[0]; /* buffer used to write from */
> +};
> +
> +struct vduse_vq_info {
> +	__u32 index; /* virtqueue index */
> +	__u32 avail_idx; /* virtqueue state (last_avail_idx) */
> +	__u64 desc_addr; /* address of desc area */
> +	__u64 driver_addr; /* address of driver area */
> +	__u64 device_addr; /* address of device area */
> +	__u32 num; /* the size of virtqueue */
> +	__u8 ready; /* ready status of virtqueue */
> +};
> +
> +struct vduse_vq_eventfd {
> +	__u32 index; /* virtqueue index */
> +#define VDUSE_EVENTFD_DEASSIGN -1
> +	int fd; /* eventfd, -1 means de-assigning the eventfd */
> +};
> +
> +#define VDUSE_BASE	0x81
> +
> +/* Get the version of VDUSE API. This is used for future extension */
> +#define VDUSE_GET_API_VERSION	_IOR(VDUSE_BASE, 0x00, __u64)
> +
> +/* Set the version of VDUSE API. */
> +#define VDUSE_SET_API_VERSION	_IOW(VDUSE_BASE, 0x01, __u64)
> +
> +/* Create a vduse device which is represented by a char device (/dev/vduse/<name>) */
> +#define VDUSE_CREATE_DEV	_IOW(VDUSE_BASE, 0x02, struct vduse_dev_config)
> +
> +/* Destroy a vduse device. Make sure there are no references to the char device */
> +#define VDUSE_DESTROY_DEV	_IOW(VDUSE_BASE, 0x03, char[VDUSE_NAME_MAX])
> +
> +/*
> + * Get a file descriptor for the first overlapped iova region,
> + * -EINVAL means the iova region doesn't exist.
> + */
> +#define VDUSE_IOTLB_GET_FD	_IOWR(VDUSE_BASE, 0x04, struct vduse_iotlb_entry)
> +
> +/* Get the negotiated features */
> +#define VDUSE_DEV_GET_FEATURES	_IOR(VDUSE_BASE, 0x05, __u64)
> +
> +/* Update the configuration space */
> +#define VDUSE_DEV_UPDATE_CONFIG	_IOW(VDUSE_BASE, 0x06, struct vduse_config_update)
> +
> +/* Get the specified virtqueue's information */
> +#define VDUSE_VQ_GET_INFO	_IOWR(VDUSE_BASE, 0x07, struct vduse_vq_info)
> +
> +/* Setup an eventfd to receive kick for virtqueue */
> +#define VDUSE_VQ_SETUP_KICKFD	_IOW(VDUSE_BASE, 0x08, struct vduse_vq_eventfd)
> +
> +/* Inject an interrupt for specific virtqueue */
> +#define VDUSE_VQ_INJECT_IRQ	_IOW(VDUSE_BASE, 0x09, __u32)
> +
> +#endif /* _UAPI_VDUSE_H_ */


_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v8 09/10] vduse: Introduce VDUSE - vDPA Device in Userspace
       [not found]     ` <CACycT3tAON+-qZev+9EqyL2XbgH5HDspOqNt3ohQLQ8GqVK=EA@mail.gmail.com>
@ 2021-06-22  5:06       ` Jason Wang
       [not found]         ` <CACycT3uzMJS7vw6MVMOgY4rb=SPfT2srV+8DPdwUVeELEiJgbA@mail.gmail.com>
  0 siblings, 1 reply; 41+ messages in thread
From: Jason Wang @ 2021-06-22  5:06 UTC (permalink / raw)
  To: Yongji Xie
  Cc: kvm, Michael S. Tsirkin, virtualization, Christian Brauner,
	Jonathan Corbet, joro, Matthew Wilcox, Christoph Hellwig,
	Dan Carpenter, Al Viro, Stefan Hajnoczi, songmuchun, Jens Axboe,
	Greg KH, Randy Dunlap, linux-kernel, iommu, bcrl, netdev,
	linux-fsdevel, Mika Penttilä


在 2021/6/21 下午6:41, Yongji Xie 写道:
> On Mon, Jun 21, 2021 at 5:14 PM Jason Wang <jasowang@redhat.com> wrote:
>>
>> 在 2021/6/15 下午10:13, Xie Yongji 写道:
>>> This VDUSE driver enables implementing vDPA devices in userspace.
>>> The vDPA device's control path is handled in kernel and the data
>>> path is handled in userspace.
>>>
>>> A message mechnism is used by VDUSE driver to forward some control
>>> messages such as starting/stopping datapath to userspace. Userspace
>>> can use read()/write() to receive/reply those control messages.
>>>
>>> And some ioctls are introduced to help userspace to implement the
>>> data path. VDUSE_IOTLB_GET_FD ioctl can be used to get the file
>>> descriptors referring to vDPA device's iova regions. Then userspace
>>> can use mmap() to access those iova regions. VDUSE_DEV_GET_FEATURES
>>> and VDUSE_VQ_GET_INFO ioctls are used to get the negotiated features
>>> and metadata of virtqueues. VDUSE_INJECT_VQ_IRQ and VDUSE_VQ_SETUP_KICKFD
>>> ioctls can be used to inject interrupt and setup the kickfd for
>>> virtqueues. VDUSE_DEV_UPDATE_CONFIG ioctl is used to update the
>>> configuration space and inject a config interrupt.
>>>
>>> Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
>>> ---
>>>    Documentation/userspace-api/ioctl/ioctl-number.rst |    1 +
>>>    drivers/vdpa/Kconfig                               |   10 +
>>>    drivers/vdpa/Makefile                              |    1 +
>>>    drivers/vdpa/vdpa_user/Makefile                    |    5 +
>>>    drivers/vdpa/vdpa_user/vduse_dev.c                 | 1453 ++++++++++++++++++++
>>>    include/uapi/linux/vduse.h                         |  143 ++
>>>    6 files changed, 1613 insertions(+)
>>>    create mode 100644 drivers/vdpa/vdpa_user/Makefile
>>>    create mode 100644 drivers/vdpa/vdpa_user/vduse_dev.c
>>>    create mode 100644 include/uapi/linux/vduse.h
>>>
>>> diff --git a/Documentation/userspace-api/ioctl/ioctl-number.rst b/Documentation/userspace-api/ioctl/ioctl-number.rst
>>> index 9bfc2b510c64..acd95e9dcfe7 100644
>>> --- a/Documentation/userspace-api/ioctl/ioctl-number.rst
>>> +++ b/Documentation/userspace-api/ioctl/ioctl-number.rst
>>> @@ -300,6 +300,7 @@ Code  Seq#    Include File                                           Comments
>>>    'z'   10-4F  drivers/s390/crypto/zcrypt_api.h                        conflict!
>>>    '|'   00-7F  linux/media.h
>>>    0x80  00-1F  linux/fb.h
>>> +0x81  00-1F  linux/vduse.h
>>>    0x89  00-06  arch/x86/include/asm/sockios.h
>>>    0x89  0B-DF  linux/sockios.h
>>>    0x89  E0-EF  linux/sockios.h                                         SIOCPROTOPRIVATE range
>>> diff --git a/drivers/vdpa/Kconfig b/drivers/vdpa/Kconfig
>>> index a503c1b2bfd9..6e23bce6433a 100644
>>> --- a/drivers/vdpa/Kconfig
>>> +++ b/drivers/vdpa/Kconfig
>>> @@ -33,6 +33,16 @@ config VDPA_SIM_BLOCK
>>>          vDPA block device simulator which terminates IO request in a
>>>          memory buffer.
>>>
>>> +config VDPA_USER
>>> +     tristate "VDUSE (vDPA Device in Userspace) support"
>>> +     depends on EVENTFD && MMU && HAS_DMA
>>> +     select DMA_OPS
>>> +     select VHOST_IOTLB
>>> +     select IOMMU_IOVA
>>> +     help
>>> +       With VDUSE it is possible to emulate a vDPA Device
>>> +       in a userspace program.
>>> +
>>>    config IFCVF
>>>        tristate "Intel IFC VF vDPA driver"
>>>        depends on PCI_MSI
>>> diff --git a/drivers/vdpa/Makefile b/drivers/vdpa/Makefile
>>> index 67fe7f3d6943..f02ebed33f19 100644
>>> --- a/drivers/vdpa/Makefile
>>> +++ b/drivers/vdpa/Makefile
>>> @@ -1,6 +1,7 @@
>>>    # SPDX-License-Identifier: GPL-2.0
>>>    obj-$(CONFIG_VDPA) += vdpa.o
>>>    obj-$(CONFIG_VDPA_SIM) += vdpa_sim/
>>> +obj-$(CONFIG_VDPA_USER) += vdpa_user/
>>>    obj-$(CONFIG_IFCVF)    += ifcvf/
>>>    obj-$(CONFIG_MLX5_VDPA) += mlx5/
>>>    obj-$(CONFIG_VP_VDPA)    += virtio_pci/
>>> diff --git a/drivers/vdpa/vdpa_user/Makefile b/drivers/vdpa/vdpa_user/Makefile
>>> new file mode 100644
>>> index 000000000000..260e0b26af99
>>> --- /dev/null
>>> +++ b/drivers/vdpa/vdpa_user/Makefile
>>> @@ -0,0 +1,5 @@
>>> +# SPDX-License-Identifier: GPL-2.0
>>> +
>>> +vduse-y := vduse_dev.o iova_domain.o
>>> +
>>> +obj-$(CONFIG_VDPA_USER) += vduse.o
>>> diff --git a/drivers/vdpa/vdpa_user/vduse_dev.c b/drivers/vdpa/vdpa_user/vduse_dev.c
>>> new file mode 100644
>>> index 000000000000..5271cbd15e28
>>> --- /dev/null
>>> +++ b/drivers/vdpa/vdpa_user/vduse_dev.c
>>> @@ -0,0 +1,1453 @@
>>> +// SPDX-License-Identifier: GPL-2.0-only
>>> +/*
>>> + * VDUSE: vDPA Device in Userspace
>>> + *
>>> + * Copyright (C) 2020-2021 Bytedance Inc. and/or its affiliates. All rights reserved.
>>> + *
>>> + * Author: Xie Yongji <xieyongji@bytedance.com>
>>> + *
>>> + */
>>> +
>>> +#include <linux/init.h>
>>> +#include <linux/module.h>
>>> +#include <linux/cdev.h>
>>> +#include <linux/device.h>
>>> +#include <linux/eventfd.h>
>>> +#include <linux/slab.h>
>>> +#include <linux/wait.h>
>>> +#include <linux/dma-map-ops.h>
>>> +#include <linux/poll.h>
>>> +#include <linux/file.h>
>>> +#include <linux/uio.h>
>>> +#include <linux/vdpa.h>
>>> +#include <linux/nospec.h>
>>> +#include <uapi/linux/vduse.h>
>>> +#include <uapi/linux/vdpa.h>
>>> +#include <uapi/linux/virtio_config.h>
>>> +#include <uapi/linux/virtio_ids.h>
>>> +#include <uapi/linux/virtio_blk.h>
>>> +#include <linux/mod_devicetable.h>
>>> +
>>> +#include "iova_domain.h"
>>> +
>>> +#define DRV_AUTHOR   "Yongji Xie <xieyongji@bytedance.com>"
>>> +#define DRV_DESC     "vDPA Device in Userspace"
>>> +#define DRV_LICENSE  "GPL v2"
>>> +
>>> +#define VDUSE_DEV_MAX (1U << MINORBITS)
>>> +#define VDUSE_MAX_BOUNCE_SIZE (64 * 1024 * 1024)
>>> +#define VDUSE_IOVA_SIZE (128 * 1024 * 1024)
>>> +#define VDUSE_REQUEST_TIMEOUT 30
>>> +
>>> +struct vduse_virtqueue {
>>> +     u16 index;
>>> +     u32 num;
>>> +     u32 avail_idx;
>>> +     u64 desc_addr;
>>> +     u64 driver_addr;
>>> +     u64 device_addr;
>>> +     bool ready;
>>> +     bool kicked;
>>> +     spinlock_t kick_lock;
>>> +     spinlock_t irq_lock;
>>> +     struct eventfd_ctx *kickfd;
>>> +     struct vdpa_callback cb;
>>> +     struct work_struct inject;
>>> +};
>>> +
>>> +struct vduse_dev;
>>> +
>>> +struct vduse_vdpa {
>>> +     struct vdpa_device vdpa;
>>> +     struct vduse_dev *dev;
>>> +};
>>> +
>>> +struct vduse_dev {
>>> +     struct vduse_vdpa *vdev;
>>> +     struct device *dev;
>>> +     struct vduse_virtqueue *vqs;
>>> +     struct vduse_iova_domain *domain;
>>> +     char *name;
>>> +     struct mutex lock;
>>> +     spinlock_t msg_lock;
>>> +     u64 msg_unique;
>>> +     wait_queue_head_t waitq;
>>> +     struct list_head send_list;
>>> +     struct list_head recv_list;
>>> +     struct vdpa_callback config_cb;
>>> +     struct work_struct inject;
>>> +     spinlock_t irq_lock;
>>> +     int minor;
>>> +     bool connected;
>>> +     bool started;
>>> +     u64 api_version;
>>> +     u64 user_features;
>>
>> Let's use device_features.
>>
> OK.
>
>>> +     u64 features;
>>
>> And driver features.
>>
> OK.
>
>>> +     u32 device_id;
>>> +     u32 vendor_id;
>>> +     u32 generation;
>>> +     u32 config_size;
>>> +     void *config;
>>> +     u8 status;
>>> +     u16 vq_size_max;
>>> +     u32 vq_num;
>>> +     u32 vq_align;
>>> +};
>>> +
>>> +struct vduse_dev_msg {
>>> +     struct vduse_dev_request req;
>>> +     struct vduse_dev_response resp;
>>> +     struct list_head list;
>>> +     wait_queue_head_t waitq;
>>> +     bool completed;
>>> +};
>>> +
>>> +struct vduse_control {
>>> +     u64 api_version;
>>> +};
>>> +
>>> +static DEFINE_MUTEX(vduse_lock);
>>> +static DEFINE_IDR(vduse_idr);
>>> +
>>> +static dev_t vduse_major;
>>> +static struct class *vduse_class;
>>> +static struct cdev vduse_ctrl_cdev;
>>> +static struct cdev vduse_cdev;
>>> +static struct workqueue_struct *vduse_irq_wq;
>>> +
>>> +static u32 allowed_device_id[] = {
>>> +     VIRTIO_ID_BLOCK,
>>> +};
>>> +
>>> +static inline struct vduse_dev *vdpa_to_vduse(struct vdpa_device *vdpa)
>>> +{
>>> +     struct vduse_vdpa *vdev = container_of(vdpa, struct vduse_vdpa, vdpa);
>>> +
>>> +     return vdev->dev;
>>> +}
>>> +
>>> +static inline struct vduse_dev *dev_to_vduse(struct device *dev)
>>> +{
>>> +     struct vdpa_device *vdpa = dev_to_vdpa(dev);
>>> +
>>> +     return vdpa_to_vduse(vdpa);
>>> +}
>>> +
>>> +static struct vduse_dev_msg *vduse_find_msg(struct list_head *head,
>>> +                                         uint32_t request_id)
>>> +{
>>> +     struct vduse_dev_msg *msg;
>>> +
>>> +     list_for_each_entry(msg, head, list) {
>>> +             if (msg->req.request_id == request_id) {
>>> +                     list_del(&msg->list);
>>> +                     return msg;
>>> +             }
>>> +     }
>>> +
>>> +     return NULL;
>>> +}
>>> +
>>> +static struct vduse_dev_msg *vduse_dequeue_msg(struct list_head *head)
>>> +{
>>> +     struct vduse_dev_msg *msg = NULL;
>>> +
>>> +     if (!list_empty(head)) {
>>> +             msg = list_first_entry(head, struct vduse_dev_msg, list);
>>> +             list_del(&msg->list);
>>> +     }
>>> +
>>> +     return msg;
>>> +}
>>> +
>>> +static void vduse_enqueue_msg(struct list_head *head,
>>> +                           struct vduse_dev_msg *msg)
>>> +{
>>> +     list_add_tail(&msg->list, head);
>>> +}
>>> +
>>> +static int vduse_dev_msg_send(struct vduse_dev *dev,
>>> +                           struct vduse_dev_msg *msg, bool no_reply)
>>> +{
>>
>> It looks to me the only user for no_reply=true is the dataplane start. I
>> wonder no_reply is really needed consider we have switched to use
>> wait_event_killable_timeout().
>>
> Do we need to handle the error in this case if we remove the no_reply
> flag. Print a warning message?


See below.


>
>> In another way, no_reply is false for vq state synchronization and IOTLB
>> updating. I wonder if we can simply use no_reply = true for them.
>>
> Looks like we can't, e.g. we need to get a reply from userspace for vq state.


Right.


>
>>> +     init_waitqueue_head(&msg->waitq);
>>> +     spin_lock(&dev->msg_lock);
>>> +     msg->req.request_id = dev->msg_unique++;
>>> +     vduse_enqueue_msg(&dev->send_list, msg);
>>> +     wake_up(&dev->waitq);
>>> +     spin_unlock(&dev->msg_lock);
>>> +     if (no_reply)
>>> +             return 0;
>>> +
>>> +     wait_event_killable_timeout(msg->waitq, msg->completed,
>>> +                                 VDUSE_REQUEST_TIMEOUT * HZ);
>>> +     spin_lock(&dev->msg_lock);
>>> +     if (!msg->completed) {
>>> +             list_del(&msg->list);
>>> +             msg->resp.result = VDUSE_REQ_RESULT_FAILED;
>>> +     }
>>> +     spin_unlock(&dev->msg_lock);
>>> +
>>> +     return (msg->resp.result == VDUSE_REQ_RESULT_OK) ? 0 : -EIO;
>>
>> Do we need to serialize the check by protecting it with the spinlock above?
>>
> Good point.
>
>>> +}
>>> +
>>> +static void vduse_dev_msg_cleanup(struct vduse_dev *dev)
>>> +{
>>> +     struct vduse_dev_msg *msg;
>>> +
>>> +     spin_lock(&dev->msg_lock);
>>> +     while ((msg = vduse_dequeue_msg(&dev->send_list))) {
>>> +             if (msg->req.flags & VDUSE_REQ_FLAGS_NO_REPLY)
>>> +                     kfree(msg);
>>> +             else
>>> +                     vduse_enqueue_msg(&dev->recv_list, msg);
>>> +     }
>>> +     while ((msg = vduse_dequeue_msg(&dev->recv_list))) {
>>> +             msg->resp.result = VDUSE_REQ_RESULT_FAILED;
>>> +             msg->completed = 1;
>>> +             wake_up(&msg->waitq);
>>> +     }
>>> +     spin_unlock(&dev->msg_lock);
>>> +}
>>> +
>>> +static void vduse_dev_start_dataplane(struct vduse_dev *dev)
>>> +{
>>> +     struct vduse_dev_msg *msg = kzalloc(sizeof(*msg),
>>> +                                         GFP_KERNEL | __GFP_NOFAIL);
>>> +
>>> +     msg->req.type = VDUSE_START_DATAPLANE;
>>> +     msg->req.flags |= VDUSE_REQ_FLAGS_NO_REPLY;
>>> +     vduse_dev_msg_send(dev, msg, true);
>>> +}
>>> +
>>> +static void vduse_dev_stop_dataplane(struct vduse_dev *dev)
>>> +{
>>> +     struct vduse_dev_msg *msg = kzalloc(sizeof(*msg),
>>> +                                         GFP_KERNEL | __GFP_NOFAIL);
>>> +
>>> +     msg->req.type = VDUSE_STOP_DATAPLANE;
>>> +     msg->req.flags |= VDUSE_REQ_FLAGS_NO_REPLY;
>>
>> Can we simply use this flag instead of introducing a new parameter
>> (no_reply) in vduse_dev_msg_send()?
>>
> Looks good to me.
>
>>> +     vduse_dev_msg_send(dev, msg, true);
>>> +}
>>> +
>>> +static int vduse_dev_get_vq_state(struct vduse_dev *dev,
>>> +                               struct vduse_virtqueue *vq,
>>> +                               struct vdpa_vq_state *state)
>>> +{
>>> +     struct vduse_dev_msg msg = { 0 };
>>> +     int ret;
>>
>> Note that I post a series that implement the packed virtqueue support:
>>
>> https://lists.linuxfoundation.org/pipermail/virtualization/2021-June/054501.html
>>
>> So this patch needs to be updated as well.
>>
> Will do it.
>
>>> +
>>> +     msg.req.type = VDUSE_GET_VQ_STATE;
>>> +     msg.req.vq_state.index = vq->index;
>>> +
>>> +     ret = vduse_dev_msg_send(dev, &msg, false);
>>> +     if (ret)
>>> +             return ret;
>>> +
>>> +     state->avail_index = msg.resp.vq_state.avail_idx;
>>> +     return 0;
>>> +}
>>> +
>>> +static int vduse_dev_update_iotlb(struct vduse_dev *dev,
>>> +                             u64 start, u64 last)
>>> +{
>>> +     struct vduse_dev_msg msg = { 0 };
>>> +
>>> +     if (last < start)
>>> +             return -EINVAL;
>>> +
>>> +     msg.req.type = VDUSE_UPDATE_IOTLB;
>>> +     msg.req.iova.start = start;
>>> +     msg.req.iova.last = last;
>>> +
>>> +     return vduse_dev_msg_send(dev, &msg, false);
>>> +}
>>> +
>>> +static ssize_t vduse_dev_read_iter(struct kiocb *iocb, struct iov_iter *to)
>>> +{
>>> +     struct file *file = iocb->ki_filp;
>>> +     struct vduse_dev *dev = file->private_data;
>>> +     struct vduse_dev_msg *msg;
>>> +     int size = sizeof(struct vduse_dev_request);
>>> +     ssize_t ret;
>>> +
>>> +     if (iov_iter_count(to) < size)
>>> +             return -EINVAL;
>>> +
>>> +     spin_lock(&dev->msg_lock);
>>> +     while (1) {
>>> +             msg = vduse_dequeue_msg(&dev->send_list);
>>> +             if (msg)
>>> +                     break;
>>> +
>>> +             ret = -EAGAIN;
>>> +             if (file->f_flags & O_NONBLOCK)
>>> +                     goto unlock;
>>> +
>>> +             spin_unlock(&dev->msg_lock);
>>> +             ret = wait_event_interruptible_exclusive(dev->waitq,
>>> +                                     !list_empty(&dev->send_list));
>>> +             if (ret)
>>> +                     return ret;
>>> +
>>> +             spin_lock(&dev->msg_lock);
>>> +     }
>>> +     spin_unlock(&dev->msg_lock);
>>> +     ret = copy_to_iter(&msg->req, size, to);
>>> +     spin_lock(&dev->msg_lock);
>>> +     if (ret != size) {
>>> +             ret = -EFAULT;
>>> +             vduse_enqueue_msg(&dev->send_list, msg);
>>> +             goto unlock;
>>> +     }
>>> +     if (msg->req.flags & VDUSE_REQ_FLAGS_NO_REPLY)
>>> +             kfree(msg);
>>> +     else
>>> +             vduse_enqueue_msg(&dev->recv_list, msg);
>>> +unlock:
>>> +     spin_unlock(&dev->msg_lock);
>>> +
>>> +     return ret;
>>> +}
>>> +
>>> +static ssize_t vduse_dev_write_iter(struct kiocb *iocb, struct iov_iter *from)
>>> +{
>>> +     struct file *file = iocb->ki_filp;
>>> +     struct vduse_dev *dev = file->private_data;
>>> +     struct vduse_dev_response resp;
>>> +     struct vduse_dev_msg *msg;
>>> +     size_t ret;
>>> +
>>> +     ret = copy_from_iter(&resp, sizeof(resp), from);
>>> +     if (ret != sizeof(resp))
>>> +             return -EINVAL;
>>> +
>>> +     spin_lock(&dev->msg_lock);
>>> +     msg = vduse_find_msg(&dev->recv_list, resp.request_id);
>>> +     if (!msg) {
>>> +             ret = -ENOENT;
>>> +             goto unlock;
>>> +     }
>>> +
>>> +     memcpy(&msg->resp, &resp, sizeof(resp));
>>> +     msg->completed = 1;
>>> +     wake_up(&msg->waitq);
>>> +unlock:
>>> +     spin_unlock(&dev->msg_lock);
>>> +
>>> +     return ret;
>>> +}
>>> +
>>> +static __poll_t vduse_dev_poll(struct file *file, poll_table *wait)
>>> +{
>>> +     struct vduse_dev *dev = file->private_data;
>>> +     __poll_t mask = 0;
>>> +
>>> +     poll_wait(file, &dev->waitq, wait);
>>> +
>>> +     if (!list_empty(&dev->send_list))
>>> +             mask |= EPOLLIN | EPOLLRDNORM;
>>> +     if (!list_empty(&dev->recv_list))
>>> +             mask |= EPOLLOUT | EPOLLWRNORM;
>>> +
>>> +     return mask;
>>> +}
>>> +
>>> +static void vduse_dev_reset(struct vduse_dev *dev)
>>> +{
>>> +     int i;
>>> +     struct vduse_iova_domain *domain = dev->domain;
>>> +
>>> +     /* The coherent mappings are handled in vduse_dev_free_coherent() */
>>> +     if (domain->bounce_map)
>>> +             vduse_domain_reset_bounce_map(domain);
>>> +
>>> +     dev->features = 0;
>>> +     dev->generation++;
>>> +     spin_lock(&dev->irq_lock);
>>> +     dev->config_cb.callback = NULL;
>>> +     dev->config_cb.private = NULL;
>>> +     spin_unlock(&dev->irq_lock);
>>> +
>>> +     for (i = 0; i < dev->vq_num; i++) {
>>> +             struct vduse_virtqueue *vq = &dev->vqs[i];
>>> +
>>> +             vq->ready = false;
>>> +             vq->desc_addr = 0;
>>> +             vq->driver_addr = 0;
>>> +             vq->device_addr = 0;
>>> +             vq->avail_idx = 0;
>>> +             vq->num = 0;
>>> +
>>> +             spin_lock(&vq->kick_lock);
>>> +             vq->kicked = false;
>>> +             if (vq->kickfd)
>>> +                     eventfd_ctx_put(vq->kickfd);
>>> +             vq->kickfd = NULL;
>>> +             spin_unlock(&vq->kick_lock);
>>> +
>>> +             spin_lock(&vq->irq_lock);
>>> +             vq->cb.callback = NULL;
>>> +             vq->cb.private = NULL;
>>> +             spin_unlock(&vq->irq_lock);
>>> +     }
>>> +}
>>> +
>>> +static int vduse_vdpa_set_vq_address(struct vdpa_device *vdpa, u16 idx,
>>> +                             u64 desc_area, u64 driver_area,
>>> +                             u64 device_area)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +     struct vduse_virtqueue *vq = &dev->vqs[idx];
>>> +
>>> +     vq->desc_addr = desc_area;
>>> +     vq->driver_addr = driver_area;
>>> +     vq->device_addr = device_area;
>>> +
>>> +     return 0;
>>> +}
>>> +
>>> +static void vduse_vdpa_kick_vq(struct vdpa_device *vdpa, u16 idx)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +     struct vduse_virtqueue *vq = &dev->vqs[idx];
>>> +
>>> +     spin_lock(&vq->kick_lock);
>>> +     if (!vq->ready)
>>> +             goto unlock;
>>> +
>>> +     if (vq->kickfd)
>>> +             eventfd_signal(vq->kickfd, 1);
>>> +     else
>>> +             vq->kicked = true;
>>> +unlock:
>>> +     spin_unlock(&vq->kick_lock);
>>> +}
>>> +
>>> +static void vduse_vdpa_set_vq_cb(struct vdpa_device *vdpa, u16 idx,
>>> +                           struct vdpa_callback *cb)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +     struct vduse_virtqueue *vq = &dev->vqs[idx];
>>> +
>>> +     spin_lock(&vq->irq_lock);
>>> +     vq->cb.callback = cb->callback;
>>> +     vq->cb.private = cb->private;
>>> +     spin_unlock(&vq->irq_lock);
>>> +}
>>> +
>>> +static void vduse_vdpa_set_vq_num(struct vdpa_device *vdpa, u16 idx, u32 num)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +     struct vduse_virtqueue *vq = &dev->vqs[idx];
>>> +
>>> +     vq->num = num;
>>> +}
>>> +
>>> +static void vduse_vdpa_set_vq_ready(struct vdpa_device *vdpa,
>>> +                                     u16 idx, bool ready)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +     struct vduse_virtqueue *vq = &dev->vqs[idx];
>>> +
>>> +     vq->ready = ready;
>>> +}
>>> +
>>> +static bool vduse_vdpa_get_vq_ready(struct vdpa_device *vdpa, u16 idx)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +     struct vduse_virtqueue *vq = &dev->vqs[idx];
>>> +
>>> +     return vq->ready;
>>> +}
>>> +
>>> +static int vduse_vdpa_set_vq_state(struct vdpa_device *vdpa, u16 idx,
>>> +                             const struct vdpa_vq_state *state)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +     struct vduse_virtqueue *vq = &dev->vqs[idx];
>>> +
>>> +     vq->avail_idx = state->avail_index;
>>> +     return 0;
>>> +}
>>> +
>>> +static int vduse_vdpa_get_vq_state(struct vdpa_device *vdpa, u16 idx,
>>> +                             struct vdpa_vq_state *state)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +     struct vduse_virtqueue *vq = &dev->vqs[idx];
>>> +
>>> +     return vduse_dev_get_vq_state(dev, vq, state);
>>> +}
>>> +
>>> +static u32 vduse_vdpa_get_vq_align(struct vdpa_device *vdpa)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +
>>> +     return dev->vq_align;
>>> +}
>>> +
>>> +static u64 vduse_vdpa_get_features(struct vdpa_device *vdpa)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +
>>> +     return dev->user_features;
>>> +}
>>> +
>>> +static int vduse_vdpa_set_features(struct vdpa_device *vdpa, u64 features)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +
>>> +     dev->features = features;
>>> +     return 0;
>>> +}
>>> +
>>> +static void vduse_vdpa_set_config_cb(struct vdpa_device *vdpa,
>>> +                               struct vdpa_callback *cb)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +
>>> +     spin_lock(&dev->irq_lock);
>>> +     dev->config_cb.callback = cb->callback;
>>> +     dev->config_cb.private = cb->private;
>>> +     spin_unlock(&dev->irq_lock);
>>> +}
>>> +
>>> +static u16 vduse_vdpa_get_vq_num_max(struct vdpa_device *vdpa)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +
>>> +     return dev->vq_size_max;
>>> +}
>>> +
>>> +static u32 vduse_vdpa_get_device_id(struct vdpa_device *vdpa)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +
>>> +     return dev->device_id;
>>> +}
>>> +
>>> +static u32 vduse_vdpa_get_vendor_id(struct vdpa_device *vdpa)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +
>>> +     return dev->vendor_id;
>>> +}
>>> +
>>> +static u8 vduse_vdpa_get_status(struct vdpa_device *vdpa)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +
>>> +     return dev->status;
>>> +}
>>> +
>>> +static void vduse_vdpa_set_status(struct vdpa_device *vdpa, u8 status)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +     bool started = !!(status & VIRTIO_CONFIG_S_DRIVER_OK);
>>> +
>>> +     dev->status = status;
>>> +
>>> +     if (dev->started == started)
>>> +             return;
>>
>> If we check dev->status == status, (or only check the DRIVER_OK bit)
>> then there's no need to introduce an extra dev->started.
>>
> Will do it.
>
>>> +
>>> +     dev->started = started;
>>> +     if (dev->started) {
>>> +             vduse_dev_start_dataplane(dev);
>>> +     } else {
>>> +             vduse_dev_reset(dev);
>>> +             vduse_dev_stop_dataplane(dev);
>>
>> I wonder if no_reply work for the case of vhost-vdpa. For virtio-vDPA,
>> we have bouncing buffers so it's harmless if usersapce dataplane keeps
>> performing read/write. For vhost-vDPA we don't have such stuffs.
>>
> OK. So it still needs to be synchronized here. If so, how to handle
> the error? Looks like printing a warning message should be enough.


We need fix a way to propagate the error to the userspace.

E.g if we want to stop the deivce, we will delay the status reset until 
we get respose from the userspace?


>
>>> +     }
>>> +}
>>> +
>>> +static size_t vduse_vdpa_get_config_size(struct vdpa_device *vdpa)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +
>>> +     return dev->config_size;
>>> +}
>>> +
>>> +static void vduse_vdpa_get_config(struct vdpa_device *vdpa, unsigned int offset,
>>> +                               void *buf, unsigned int len)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +
>>> +     memcpy(buf, dev->config + offset, len);
>>> +}
>>> +
>>> +static void vduse_vdpa_set_config(struct vdpa_device *vdpa, unsigned int offset,
>>> +                     const void *buf, unsigned int len)
>>> +{
>>> +     /* Now we only support read-only configuration space */
>>> +}
>>> +
>>> +static u32 vduse_vdpa_get_generation(struct vdpa_device *vdpa)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +
>>> +     return dev->generation;
>>> +}
>>> +
>>> +static int vduse_vdpa_set_map(struct vdpa_device *vdpa,
>>> +                             struct vhost_iotlb *iotlb)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +     int ret;
>>> +
>>> +     ret = vduse_domain_set_map(dev->domain, iotlb);
>>> +     if (ret)
>>> +             return ret;
>>> +
>>> +     ret = vduse_dev_update_iotlb(dev, 0ULL, ULLONG_MAX);
>>> +     if (ret) {
>>> +             vduse_domain_clear_map(dev->domain, iotlb);
>>> +             return ret;
>>> +     }
>>> +
>>> +     return 0;
>>> +}
>>> +
>>> +static void vduse_vdpa_free(struct vdpa_device *vdpa)
>>> +{
>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>> +
>>> +     dev->vdev = NULL;
>>> +}
>>> +
>>> +static const struct vdpa_config_ops vduse_vdpa_config_ops = {
>>> +     .set_vq_address         = vduse_vdpa_set_vq_address,
>>> +     .kick_vq                = vduse_vdpa_kick_vq,
>>> +     .set_vq_cb              = vduse_vdpa_set_vq_cb,
>>> +     .set_vq_num             = vduse_vdpa_set_vq_num,
>>> +     .set_vq_ready           = vduse_vdpa_set_vq_ready,
>>> +     .get_vq_ready           = vduse_vdpa_get_vq_ready,
>>> +     .set_vq_state           = vduse_vdpa_set_vq_state,
>>> +     .get_vq_state           = vduse_vdpa_get_vq_state,
>>> +     .get_vq_align           = vduse_vdpa_get_vq_align,
>>> +     .get_features           = vduse_vdpa_get_features,
>>> +     .set_features           = vduse_vdpa_set_features,
>>> +     .set_config_cb          = vduse_vdpa_set_config_cb,
>>> +     .get_vq_num_max         = vduse_vdpa_get_vq_num_max,
>>> +     .get_device_id          = vduse_vdpa_get_device_id,
>>> +     .get_vendor_id          = vduse_vdpa_get_vendor_id,
>>> +     .get_status             = vduse_vdpa_get_status,
>>> +     .set_status             = vduse_vdpa_set_status,
>>> +     .get_config_size        = vduse_vdpa_get_config_size,
>>> +     .get_config             = vduse_vdpa_get_config,
>>> +     .set_config             = vduse_vdpa_set_config,
>>> +     .get_generation         = vduse_vdpa_get_generation,
>>> +     .set_map                = vduse_vdpa_set_map,
>>> +     .free                   = vduse_vdpa_free,
>>> +};
>>> +
>>> +static dma_addr_t vduse_dev_map_page(struct device *dev, struct page *page,
>>> +                                  unsigned long offset, size_t size,
>>> +                                  enum dma_data_direction dir,
>>> +                                  unsigned long attrs)
>>> +{
>>> +     struct vduse_dev *vdev = dev_to_vduse(dev);
>>> +     struct vduse_iova_domain *domain = vdev->domain;
>>> +
>>> +     return vduse_domain_map_page(domain, page, offset, size, dir, attrs);
>>> +}
>>> +
>>> +static void vduse_dev_unmap_page(struct device *dev, dma_addr_t dma_addr,
>>> +                             size_t size, enum dma_data_direction dir,
>>> +                             unsigned long attrs)
>>> +{
>>> +     struct vduse_dev *vdev = dev_to_vduse(dev);
>>> +     struct vduse_iova_domain *domain = vdev->domain;
>>> +
>>> +     return vduse_domain_unmap_page(domain, dma_addr, size, dir, attrs);
>>> +}
>>> +
>>> +static void *vduse_dev_alloc_coherent(struct device *dev, size_t size,
>>> +                                     dma_addr_t *dma_addr, gfp_t flag,
>>> +                                     unsigned long attrs)
>>> +{
>>> +     struct vduse_dev *vdev = dev_to_vduse(dev);
>>> +     struct vduse_iova_domain *domain = vdev->domain;
>>> +     unsigned long iova;
>>> +     void *addr;
>>> +
>>> +     *dma_addr = DMA_MAPPING_ERROR;
>>> +     addr = vduse_domain_alloc_coherent(domain, size,
>>> +                             (dma_addr_t *)&iova, flag, attrs);
>>> +     if (!addr)
>>> +             return NULL;
>>> +
>>> +     *dma_addr = (dma_addr_t)iova;
>>> +
>>> +     return addr;
>>> +}
>>> +
>>> +static void vduse_dev_free_coherent(struct device *dev, size_t size,
>>> +                                     void *vaddr, dma_addr_t dma_addr,
>>> +                                     unsigned long attrs)
>>> +{
>>> +     struct vduse_dev *vdev = dev_to_vduse(dev);
>>> +     struct vduse_iova_domain *domain = vdev->domain;
>>> +
>>> +     vduse_domain_free_coherent(domain, size, vaddr, dma_addr, attrs);
>>> +}
>>> +
>>> +static size_t vduse_dev_max_mapping_size(struct device *dev)
>>> +{
>>> +     struct vduse_dev *vdev = dev_to_vduse(dev);
>>> +     struct vduse_iova_domain *domain = vdev->domain;
>>> +
>>> +     return domain->bounce_size;
>>> +}
>>> +
>>> +static const struct dma_map_ops vduse_dev_dma_ops = {
>>> +     .map_page = vduse_dev_map_page,
>>> +     .unmap_page = vduse_dev_unmap_page,
>>> +     .alloc = vduse_dev_alloc_coherent,
>>> +     .free = vduse_dev_free_coherent,
>>> +     .max_mapping_size = vduse_dev_max_mapping_size,
>>> +};
>>> +
>>> +static unsigned int perm_to_file_flags(u8 perm)
>>> +{
>>> +     unsigned int flags = 0;
>>> +
>>> +     switch (perm) {
>>> +     case VDUSE_ACCESS_WO:
>>> +             flags |= O_WRONLY;
>>> +             break;
>>> +     case VDUSE_ACCESS_RO:
>>> +             flags |= O_RDONLY;
>>> +             break;
>>> +     case VDUSE_ACCESS_RW:
>>> +             flags |= O_RDWR;
>>> +             break;
>>> +     default:
>>> +             WARN(1, "invalidate vhost IOTLB permission\n");
>>> +             break;
>>> +     }
>>> +
>>> +     return flags;
>>> +}
>>> +
>>> +static int vduse_kickfd_setup(struct vduse_dev *dev,
>>> +                     struct vduse_vq_eventfd *eventfd)
>>> +{
>>> +     struct eventfd_ctx *ctx = NULL;
>>> +     struct vduse_virtqueue *vq;
>>> +     u32 index;
>>> +
>>> +     if (eventfd->index >= dev->vq_num)
>>> +             return -EINVAL;
>>> +
>>> +     index = array_index_nospec(eventfd->index, dev->vq_num);
>>> +     vq = &dev->vqs[index];
>>> +     if (eventfd->fd >= 0) {
>>> +             ctx = eventfd_ctx_fdget(eventfd->fd);
>>> +             if (IS_ERR(ctx))
>>> +                     return PTR_ERR(ctx);
>>> +     } else if (eventfd->fd != VDUSE_EVENTFD_DEASSIGN)
>>> +             return 0;
>>> +
>>> +     spin_lock(&vq->kick_lock);
>>> +     if (vq->kickfd)
>>> +             eventfd_ctx_put(vq->kickfd);
>>> +     vq->kickfd = ctx;
>>> +     if (vq->ready && vq->kicked && vq->kickfd) {
>>> +             eventfd_signal(vq->kickfd, 1);
>>> +             vq->kicked = false;
>>> +     }
>>> +     spin_unlock(&vq->kick_lock);
>>> +
>>> +     return 0;
>>> +}
>>> +
>>> +static void vduse_dev_irq_inject(struct work_struct *work)
>>> +{
>>> +     struct vduse_dev *dev = container_of(work, struct vduse_dev, inject);
>>> +
>>> +     spin_lock_irq(&dev->irq_lock);
>>> +     if (dev->config_cb.callback)
>>> +             dev->config_cb.callback(dev->config_cb.private);
>>> +     spin_unlock_irq(&dev->irq_lock);
>>> +}
>>> +
>>> +static void vduse_vq_irq_inject(struct work_struct *work)
>>> +{
>>> +     struct vduse_virtqueue *vq = container_of(work,
>>> +                                     struct vduse_virtqueue, inject);
>>> +
>>> +     spin_lock_irq(&vq->irq_lock);
>>> +     if (vq->ready && vq->cb.callback)
>>> +             vq->cb.callback(vq->cb.private);
>>> +     spin_unlock_irq(&vq->irq_lock);
>>> +}
>>> +
>>> +static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
>>> +                         unsigned long arg)
>>> +{
>>> +     struct vduse_dev *dev = file->private_data;
>>> +     void __user *argp = (void __user *)arg;
>>> +     int ret;
>>> +
>>> +     switch (cmd) {
>>> +     case VDUSE_IOTLB_GET_FD: {
>>> +             struct vduse_iotlb_entry entry;
>>> +             struct vhost_iotlb_map *map;
>>> +             struct vdpa_map_file *map_file;
>>> +             struct vduse_iova_domain *domain = dev->domain;
>>> +             struct file *f = NULL;
>>> +
>>> +             ret = -EFAULT;
>>> +             if (copy_from_user(&entry, argp, sizeof(entry)))
>>> +                     break;
>>> +
>>> +             ret = -EINVAL;
>>> +             if (entry.start > entry.last)
>>> +                     break;
>>> +
>>> +             spin_lock(&domain->iotlb_lock);
>>> +             map = vhost_iotlb_itree_first(domain->iotlb,
>>> +                                           entry.start, entry.last);
>>> +             if (map) {
>>> +                     map_file = (struct vdpa_map_file *)map->opaque;
>>> +                     f = get_file(map_file->file);
>>> +                     entry.offset = map_file->offset;
>>> +                     entry.start = map->start;
>>> +                     entry.last = map->last;
>>> +                     entry.perm = map->perm;
>>> +             }
>>> +             spin_unlock(&domain->iotlb_lock);
>>> +             ret = -EINVAL;
>>> +             if (!f)
>>> +                     break;
>>> +
>>> +             ret = -EFAULT;
>>> +             if (copy_to_user(argp, &entry, sizeof(entry))) {
>>> +                     fput(f);
>>> +                     break;
>>> +             }
>>> +             ret = receive_fd(f, perm_to_file_flags(entry.perm));
>>> +             fput(f);
>>> +             break;
>>> +     }
>>> +     case VDUSE_DEV_GET_FEATURES:
>>> +             ret = put_user(dev->features, (u64 __user *)argp);
>>> +             break;
>>> +     case VDUSE_DEV_UPDATE_CONFIG: {
>>> +             struct vduse_config_update config;
>>> +             unsigned long size = offsetof(struct vduse_config_update,
>>> +                                           buffer);
>>> +
>>> +             ret = -EFAULT;
>>> +             if (copy_from_user(&config, argp, size))
>>> +                     break;
>>> +
>>> +             ret = -EINVAL;
>>> +             if (config.length == 0 ||
>>> +                 config.length > dev->config_size - config.offset)
>>> +                     break;
>>> +
>>> +             ret = -EFAULT;
>>> +             if (copy_from_user(dev->config + config.offset, argp + size,
>>> +                                config.length))
>>> +                     break;
>>> +
>>> +             ret = 0;
>>> +             queue_work(vduse_irq_wq, &dev->inject);
>>
>> I wonder if it's better to separate config interrupt out of config
>> update or we need document this.
>>
> I have documented it in the docs. Looks like a config update should be
> always followed by a config interrupt. I didn't find a case that uses
> them separately.


The uAPI doesn't prevent us from the following scenario:

update_config(mac[0], ..);
update_config(max[1], ..);

So it looks to me it's better to separate the config interrupt from the 
config updating.


>
>>> +             break;
>>> +     }
>>> +     case VDUSE_VQ_GET_INFO: {
>>
>> Do we need to limit this only when DRIVER_OK is set?
>>
> Any reason to add this limitation?


Otherwise the vq is not fully initialized, e.g the desc_addr might not 
be correct.


>
>>> +             struct vduse_vq_info vq_info;
>>> +             u32 vq_index;
>>> +
>>> +             ret = -EFAULT;
>>> +             if (copy_from_user(&vq_info, argp, sizeof(vq_info)))
>>> +                     break;
>>> +
>>> +             ret = -EINVAL;
>>> +             if (vq_info.index >= dev->vq_num)
>>> +                     break;
>>> +
>>> +             vq_index = array_index_nospec(vq_info.index, dev->vq_num);
>>> +             vq_info.desc_addr = dev->vqs[vq_index].desc_addr;
>>> +             vq_info.driver_addr = dev->vqs[vq_index].driver_addr;
>>> +             vq_info.device_addr = dev->vqs[vq_index].device_addr;
>>> +             vq_info.num = dev->vqs[vq_index].num;
>>> +             vq_info.avail_idx = dev->vqs[vq_index].avail_idx;
>>> +             vq_info.ready = dev->vqs[vq_index].ready;
>>> +
>>> +             ret = -EFAULT;
>>> +             if (copy_to_user(argp, &vq_info, sizeof(vq_info)))
>>> +                     break;
>>> +
>>> +             ret = 0;
>>> +             break;
>>> +     }
>>> +     case VDUSE_VQ_SETUP_KICKFD: {
>>> +             struct vduse_vq_eventfd eventfd;
>>> +
>>> +             ret = -EFAULT;
>>> +             if (copy_from_user(&eventfd, argp, sizeof(eventfd)))
>>> +                     break;
>>> +
>>> +             ret = vduse_kickfd_setup(dev, &eventfd);
>>> +             break;
>>> +     }
>>> +     case VDUSE_VQ_INJECT_IRQ: {
>>> +             u32 vq_index;
>>> +
>>> +             ret = -EFAULT;
>>> +             if (get_user(vq_index, (u32 __user *)argp))
>>> +                     break;
>>> +
>>> +             ret = -EINVAL;
>>> +             if (vq_index >= dev->vq_num)
>>> +                     break;
>>> +
>>> +             ret = 0;
>>> +             vq_index = array_index_nospec(vq_index, dev->vq_num);
>>> +             queue_work(vduse_irq_wq, &dev->vqs[vq_index].inject);
>>> +             break;
>>> +     }
>>> +     default:
>>> +             ret = -ENOIOCTLCMD;
>>> +             break;
>>> +     }
>>> +
>>> +     return ret;
>>> +}
>>> +
>>> +static int vduse_dev_release(struct inode *inode, struct file *file)
>>> +{
>>> +     struct vduse_dev *dev = file->private_data;
>>> +
>>> +     spin_lock(&dev->msg_lock);
>>> +     /* Make sure the inflight messages can processed after reconncection */
>>> +     list_splice_init(&dev->recv_list, &dev->send_list);
>>> +     spin_unlock(&dev->msg_lock);
>>> +     dev->connected = false;
>>> +
>>> +     return 0;
>>> +}
>>> +
>>> +static struct vduse_dev *vduse_dev_get_from_minor(int minor)
>>> +{
>>> +     struct vduse_dev *dev;
>>> +
>>> +     mutex_lock(&vduse_lock);
>>> +     dev = idr_find(&vduse_idr, minor);
>>> +     mutex_unlock(&vduse_lock);
>>> +
>>> +     return dev;
>>> +}
>>> +
>>> +static int vduse_dev_open(struct inode *inode, struct file *file)
>>> +{
>>> +     int ret;
>>> +     struct vduse_dev *dev = vduse_dev_get_from_minor(iminor(inode));
>>> +
>>> +     if (!dev)
>>> +             return -ENODEV;
>>> +
>>> +     ret = -EBUSY;
>>> +     mutex_lock(&dev->lock);
>>> +     if (dev->connected)
>>> +             goto unlock;
>>> +
>>> +     ret = 0;
>>> +     dev->connected = true;
>>> +     file->private_data = dev;
>>> +unlock:
>>> +     mutex_unlock(&dev->lock);
>>> +
>>> +     return ret;
>>> +}
>>> +
>>> +static const struct file_operations vduse_dev_fops = {
>>> +     .owner          = THIS_MODULE,
>>> +     .open           = vduse_dev_open,
>>> +     .release        = vduse_dev_release,
>>> +     .read_iter      = vduse_dev_read_iter,
>>> +     .write_iter     = vduse_dev_write_iter,
>>> +     .poll           = vduse_dev_poll,
>>> +     .unlocked_ioctl = vduse_dev_ioctl,
>>> +     .compat_ioctl   = compat_ptr_ioctl,
>>> +     .llseek         = noop_llseek,
>>> +};
>>> +
>>> +static struct vduse_dev *vduse_dev_create(void)
>>> +{
>>> +     struct vduse_dev *dev = kzalloc(sizeof(*dev), GFP_KERNEL);
>>> +
>>> +     if (!dev)
>>> +             return NULL;
>>> +
>>> +     mutex_init(&dev->lock);
>>> +     spin_lock_init(&dev->msg_lock);
>>> +     INIT_LIST_HEAD(&dev->send_list);
>>> +     INIT_LIST_HEAD(&dev->recv_list);
>>> +     spin_lock_init(&dev->irq_lock);
>>> +
>>> +     INIT_WORK(&dev->inject, vduse_dev_irq_inject);
>>> +     init_waitqueue_head(&dev->waitq);
>>> +
>>> +     return dev;
>>> +}
>>> +
>>> +static void vduse_dev_destroy(struct vduse_dev *dev)
>>> +{
>>> +     kfree(dev);
>>> +}
>>> +
>>> +static struct vduse_dev *vduse_find_dev(const char *name)
>>> +{
>>> +     struct vduse_dev *dev;
>>> +     int id;
>>> +
>>> +     idr_for_each_entry(&vduse_idr, dev, id)
>>> +             if (!strcmp(dev->name, name))
>>> +                     return dev;
>>> +
>>> +     return NULL;
>>> +}
>>> +
>>> +static int vduse_destroy_dev(char *name)
>>> +{
>>> +     struct vduse_dev *dev = vduse_find_dev(name);
>>> +
>>> +     if (!dev)
>>> +             return -EINVAL;
>>> +
>>> +     mutex_lock(&dev->lock);
>>> +     if (dev->vdev || dev->connected) {
>>> +             mutex_unlock(&dev->lock);
>>> +             return -EBUSY;
>>> +     }
>>> +     dev->connected = true;
>>> +     mutex_unlock(&dev->lock);
>>> +
>>> +     vduse_dev_msg_cleanup(dev);
>>> +     device_destroy(vduse_class, MKDEV(MAJOR(vduse_major), dev->minor));
>>> +     idr_remove(&vduse_idr, dev->minor);
>>> +     kvfree(dev->config);
>>> +     kfree(dev->vqs);
>>> +     vduse_domain_destroy(dev->domain);
>>> +     kfree(dev->name);
>>> +     vduse_dev_destroy(dev);
>>> +     module_put(THIS_MODULE);
>>> +
>>> +     return 0;
>>> +}
>>> +
>>> +static bool device_is_allowed(u32 device_id)
>>> +{
>>> +     int i;
>>> +
>>> +     for (i = 0; i < ARRAY_SIZE(allowed_device_id); i++)
>>> +             if (allowed_device_id[i] == device_id)
>>> +                     return true;
>>> +
>>> +     return false;
>>> +}
>>> +
>>> +static bool features_is_valid(u64 features)
>>> +{
>>> +     if (!(features & (1ULL << VIRTIO_F_ACCESS_PLATFORM)))
>>> +             return false;
>>> +
>>> +     /* Now we only support read-only configuration space */
>>> +     if (features & (1ULL << VIRTIO_BLK_F_CONFIG_WCE))
>>> +             return false;
>>> +
>>> +     return true;
>>> +}
>>> +
>>> +static bool vduse_validate_config(struct vduse_dev_config *config)
>>> +{
>>> +     if (config->bounce_size > VDUSE_MAX_BOUNCE_SIZE)
>>> +             return false;
>>> +
>>> +     if (config->vq_align > PAGE_SIZE)
>>> +             return false;
>>> +
>>> +     if (config->config_size > PAGE_SIZE)
>>> +             return false;
>>> +
>>> +     if (!device_is_allowed(config->device_id))
>>> +             return false;
>>> +
>>> +     if (!features_is_valid(config->features))
>>> +             return false;
>>
>> Do we need to validate whether or not config_size is too small otherwise
>> we may have OOB access in get_config()?
>>
> How about adding validation in get_config()? It seems to be hard to
> define the lower bound.


It should work.

Thanks


>
>>> +
>>> +     return true;
>>> +}
>>> +
>>> +static int vduse_create_dev(struct vduse_dev_config *config,
>>> +                         void *config_buf, u64 api_version)
>>> +{
>>> +     int i, ret;
>>> +     struct vduse_dev *dev;
>>> +
>>> +     ret = -EEXIST;
>>> +     if (vduse_find_dev(config->name))
>>> +             goto err;
>>> +
>>> +     ret = -ENOMEM;
>>> +     dev = vduse_dev_create();
>>> +     if (!dev)
>>> +             goto err;
>>> +
>>> +     dev->api_version = api_version;
>>> +     dev->user_features = config->features;
>>> +     dev->device_id = config->device_id;
>>> +     dev->vendor_id = config->vendor_id;
>>> +     dev->name = kstrdup(config->name, GFP_KERNEL);
>>> +     if (!dev->name)
>>> +             goto err_str;
>>> +
>>> +     dev->domain = vduse_domain_create(VDUSE_IOVA_SIZE - 1,
>>> +                                       config->bounce_size);
>>> +     if (!dev->domain)
>>> +             goto err_domain;
>>> +
>>> +     dev->config = config_buf;
>>> +     dev->config_size = config->config_size;
>>> +     dev->vq_align = config->vq_align;
>>> +     dev->vq_size_max = config->vq_size_max;
>>> +     dev->vq_num = config->vq_num;
>>> +     dev->vqs = kcalloc(dev->vq_num, sizeof(*dev->vqs), GFP_KERNEL);
>>> +     if (!dev->vqs)
>>> +             goto err_vqs;
>>> +
>>> +     for (i = 0; i < dev->vq_num; i++) {
>>> +             dev->vqs[i].index = i;
>>> +             INIT_WORK(&dev->vqs[i].inject, vduse_vq_irq_inject);
>>> +             spin_lock_init(&dev->vqs[i].kick_lock);
>>> +             spin_lock_init(&dev->vqs[i].irq_lock);
>>> +     }
>>> +
>>> +     ret = idr_alloc(&vduse_idr, dev, 1, VDUSE_DEV_MAX, GFP_KERNEL);
>>> +     if (ret < 0)
>>> +             goto err_idr;
>>> +
>>> +     dev->minor = ret;
>>> +     dev->dev = device_create(vduse_class, NULL,
>>> +                              MKDEV(MAJOR(vduse_major), dev->minor),
>>> +                              NULL, "%s", config->name);
>>> +     if (IS_ERR(dev->dev)) {
>>> +             ret = PTR_ERR(dev->dev);
>>> +             goto err_dev;
>>> +     }
>>> +     __module_get(THIS_MODULE);
>>> +
>>> +     return 0;
>>> +err_dev:
>>> +     idr_remove(&vduse_idr, dev->minor);
>>> +err_idr:
>>> +     kfree(dev->vqs);
>>> +err_vqs:
>>> +     vduse_domain_destroy(dev->domain);
>>> +err_domain:
>>> +     kfree(dev->name);
>>> +err_str:
>>> +     vduse_dev_destroy(dev);
>>> +err:
>>> +     kvfree(config_buf);
>>> +     return ret;
>>> +}
>>> +
>>> +static long vduse_ioctl(struct file *file, unsigned int cmd,
>>> +                     unsigned long arg)
>>> +{
>>> +     int ret;
>>> +     void __user *argp = (void __user *)arg;
>>> +     struct vduse_control *control = file->private_data;
>>> +
>>> +     mutex_lock(&vduse_lock);
>>> +     switch (cmd) {
>>> +     case VDUSE_GET_API_VERSION:
>>> +             ret = put_user(control->api_version, (u64 __user *)argp);
>>> +             break;
>>> +     case VDUSE_SET_API_VERSION: {
>>> +             u64 api_version;
>>> +
>>> +             ret = -EFAULT;
>>> +             if (get_user(api_version, (u64 __user *)argp))
>>> +                     break;
>>> +
>>> +             ret = -EINVAL;
>>> +             if (api_version > VDUSE_API_VERSION)
>>> +                     break;
>>> +
>>> +             ret = 0;
>>> +             control->api_version = api_version;
>>> +             break;
>>> +     }
>>> +     case VDUSE_CREATE_DEV: {
>>> +             struct vduse_dev_config config;
>>> +             unsigned long size = offsetof(struct vduse_dev_config, config);
>>> +             void *buf;
>>> +
>>> +             ret = -EFAULT;
>>> +             if (copy_from_user(&config, argp, size))
>>> +                     break;
>>> +
>>> +             ret = -EINVAL;
>>> +             if (vduse_validate_config(&config) == false)
>>> +                     break;
>>> +
>>> +             buf = vmemdup_user(argp + size, config.config_size);
>>> +             if (IS_ERR(buf)) {
>>> +                     ret = PTR_ERR(buf);
>>> +                     break;
>>> +             }
>>> +             ret = vduse_create_dev(&config, buf, control->api_version);
>>> +             break;
>>> +     }
>>> +     case VDUSE_DESTROY_DEV: {
>>> +             char name[VDUSE_NAME_MAX];
>>> +
>>> +             ret = -EFAULT;
>>> +             if (copy_from_user(name, argp, VDUSE_NAME_MAX))
>>> +                     break;
>>> +
>>> +             ret = vduse_destroy_dev(name);
>>> +             break;
>>> +     }
>>> +     default:
>>> +             ret = -EINVAL;
>>> +             break;
>>> +     }
>>> +     mutex_unlock(&vduse_lock);
>>> +
>>> +     return ret;
>>> +}
>>> +
>>> +static int vduse_release(struct inode *inode, struct file *file)
>>> +{
>>> +     struct vduse_control *control = file->private_data;
>>> +
>>> +     kfree(control);
>>> +     return 0;
>>> +}
>>> +
>>> +static int vduse_open(struct inode *inode, struct file *file)
>>> +{
>>> +     struct vduse_control *control;
>>> +
>>> +     control = kmalloc(sizeof(struct vduse_control), GFP_KERNEL);
>>> +     if (!control)
>>> +             return -ENOMEM;
>>> +
>>> +     control->api_version = VDUSE_API_VERSION;
>>> +     file->private_data = control;
>>> +
>>> +     return 0;
>>> +}
>>> +
>>> +static const struct file_operations vduse_ctrl_fops = {
>>> +     .owner          = THIS_MODULE,
>>> +     .open           = vduse_open,
>>> +     .release        = vduse_release,
>>> +     .unlocked_ioctl = vduse_ioctl,
>>> +     .compat_ioctl   = compat_ptr_ioctl,
>>> +     .llseek         = noop_llseek,
>>> +};
>>> +
>>> +static char *vduse_devnode(struct device *dev, umode_t *mode)
>>> +{
>>> +     return kasprintf(GFP_KERNEL, "vduse/%s", dev_name(dev));
>>> +}
>>> +
>>> +static void vduse_mgmtdev_release(struct device *dev)
>>> +{
>>> +}
>>> +
>>> +static struct device vduse_mgmtdev = {
>>> +     .init_name = "vduse",
>>> +     .release = vduse_mgmtdev_release,
>>> +};
>>> +
>>> +static struct vdpa_mgmt_dev mgmt_dev;
>>> +
>>> +static int vduse_dev_init_vdpa(struct vduse_dev *dev, const char *name)
>>> +{
>>> +     struct vduse_vdpa *vdev;
>>> +     int ret;
>>> +
>>> +     if (dev->vdev)
>>> +             return -EEXIST;
>>> +
>>> +     vdev = vdpa_alloc_device(struct vduse_vdpa, vdpa, dev->dev,
>>> +                              &vduse_vdpa_config_ops, name, true);
>>> +     if (!vdev)
>>> +             return -ENOMEM;
>>> +
>>> +     dev->vdev = vdev;
>>> +     vdev->dev = dev;
>>> +     vdev->vdpa.dev.dma_mask = &vdev->vdpa.dev.coherent_dma_mask;
>>> +     ret = dma_set_mask_and_coherent(&vdev->vdpa.dev, DMA_BIT_MASK(64));
>>> +     if (ret) {
>>> +             put_device(&vdev->vdpa.dev);
>>> +             return ret;
>>> +     }
>>> +     set_dma_ops(&vdev->vdpa.dev, &vduse_dev_dma_ops);
>>> +     vdev->vdpa.dma_dev = &vdev->vdpa.dev;
>>> +     vdev->vdpa.mdev = &mgmt_dev;
>>> +
>>> +     return 0;
>>> +}
>>> +
>>> +static int vdpa_dev_add(struct vdpa_mgmt_dev *mdev, const char *name)
>>> +{
>>> +     struct vduse_dev *dev;
>>> +     int ret;
>>> +
>>> +     mutex_lock(&vduse_lock);
>>> +     dev = vduse_find_dev(name);
>>> +     if (!dev) {
>>> +             mutex_unlock(&vduse_lock);
>>> +             return -EINVAL;
>>> +     }
>>> +     ret = vduse_dev_init_vdpa(dev, name);
>>> +     mutex_unlock(&vduse_lock);
>>> +     if (ret)
>>> +             return ret;
>>> +
>>> +     ret = _vdpa_register_device(&dev->vdev->vdpa, dev->vq_num);
>>> +     if (ret) {
>>> +             put_device(&dev->vdev->vdpa.dev);
>>> +             return ret;
>>> +     }
>>> +
>>> +     return 0;
>>> +}
>>> +
>>> +static void vdpa_dev_del(struct vdpa_mgmt_dev *mdev, struct vdpa_device *dev)
>>> +{
>>> +     _vdpa_unregister_device(dev);
>>> +}
>>> +
>>> +static const struct vdpa_mgmtdev_ops vdpa_dev_mgmtdev_ops = {
>>> +     .dev_add = vdpa_dev_add,
>>> +     .dev_del = vdpa_dev_del,
>>> +};
>>> +
>>> +static struct virtio_device_id id_table[] = {
>>> +     { VIRTIO_ID_BLOCK, VIRTIO_DEV_ANY_ID },
>>> +     { 0 },
>>> +};
>>> +
>>> +static struct vdpa_mgmt_dev mgmt_dev = {
>>> +     .device = &vduse_mgmtdev,
>>> +     .id_table = id_table,
>>> +     .ops = &vdpa_dev_mgmtdev_ops,
>>> +};
>>> +
>>> +static int vduse_mgmtdev_init(void)
>>> +{
>>> +     int ret;
>>> +
>>> +     ret = device_register(&vduse_mgmtdev);
>>> +     if (ret)
>>> +             return ret;
>>> +
>>> +     ret = vdpa_mgmtdev_register(&mgmt_dev);
>>> +     if (ret)
>>> +             goto err;
>>> +
>>> +     return 0;
>>> +err:
>>> +     device_unregister(&vduse_mgmtdev);
>>> +     return ret;
>>> +}
>>> +
>>> +static void vduse_mgmtdev_exit(void)
>>> +{
>>> +     vdpa_mgmtdev_unregister(&mgmt_dev);
>>> +     device_unregister(&vduse_mgmtdev);
>>> +}
>>> +
>>> +static int vduse_init(void)
>>> +{
>>> +     int ret;
>>> +     struct device *dev;
>>> +
>>> +     vduse_class = class_create(THIS_MODULE, "vduse");
>>> +     if (IS_ERR(vduse_class))
>>> +             return PTR_ERR(vduse_class);
>>> +
>>> +     vduse_class->devnode = vduse_devnode;
>>> +
>>> +     ret = alloc_chrdev_region(&vduse_major, 0, VDUSE_DEV_MAX, "vduse");
>>> +     if (ret)
>>> +             goto err_chardev_region;
>>> +
>>> +     /* /dev/vduse/control */
>>> +     cdev_init(&vduse_ctrl_cdev, &vduse_ctrl_fops);
>>> +     vduse_ctrl_cdev.owner = THIS_MODULE;
>>> +     ret = cdev_add(&vduse_ctrl_cdev, vduse_major, 1);
>>> +     if (ret)
>>> +             goto err_ctrl_cdev;
>>> +
>>> +     dev = device_create(vduse_class, NULL, vduse_major, NULL, "control");
>>> +     if (IS_ERR(dev)) {
>>> +             ret = PTR_ERR(dev);
>>> +             goto err_device;
>>> +     }
>>> +
>>> +     /* /dev/vduse/$DEVICE */
>>> +     cdev_init(&vduse_cdev, &vduse_dev_fops);
>>> +     vduse_cdev.owner = THIS_MODULE;
>>> +     ret = cdev_add(&vduse_cdev, MKDEV(MAJOR(vduse_major), 1),
>>> +                    VDUSE_DEV_MAX - 1);
>>> +     if (ret)
>>> +             goto err_cdev;
>>> +
>>> +     vduse_irq_wq = alloc_workqueue("vduse-irq",
>>> +                             WQ_HIGHPRI | WQ_SYSFS | WQ_UNBOUND, 0);
>>> +     if (!vduse_irq_wq)
>>> +             goto err_wq;
>>> +
>>> +     ret = vduse_domain_init();
>>> +     if (ret)
>>> +             goto err_domain;
>>> +
>>> +     ret = vduse_mgmtdev_init();
>>> +     if (ret)
>>> +             goto err_mgmtdev;
>>> +
>>> +     return 0;
>>> +err_mgmtdev:
>>> +     vduse_domain_exit();
>>> +err_domain:
>>> +     destroy_workqueue(vduse_irq_wq);
>>> +err_wq:
>>> +     cdev_del(&vduse_cdev);
>>> +err_cdev:
>>> +     device_destroy(vduse_class, vduse_major);
>>> +err_device:
>>> +     cdev_del(&vduse_ctrl_cdev);
>>> +err_ctrl_cdev:
>>> +     unregister_chrdev_region(vduse_major, VDUSE_DEV_MAX);
>>> +err_chardev_region:
>>> +     class_destroy(vduse_class);
>>> +     return ret;
>>> +}
>>> +module_init(vduse_init);
>>> +
>>> +static void vduse_exit(void)
>>> +{
>>> +     vduse_mgmtdev_exit();
>>> +     vduse_domain_exit();
>>> +     destroy_workqueue(vduse_irq_wq);
>>> +     cdev_del(&vduse_cdev);
>>> +     device_destroy(vduse_class, vduse_major);
>>> +     cdev_del(&vduse_ctrl_cdev);
>>> +     unregister_chrdev_region(vduse_major, VDUSE_DEV_MAX);
>>> +     class_destroy(vduse_class);
>>> +}
>>> +module_exit(vduse_exit);
>>> +
>>> +MODULE_LICENSE(DRV_LICENSE);
>>> +MODULE_AUTHOR(DRV_AUTHOR);
>>> +MODULE_DESCRIPTION(DRV_DESC);
>>> diff --git a/include/uapi/linux/vduse.h b/include/uapi/linux/vduse.h
>>> new file mode 100644
>>> index 000000000000..f21b2e51b5c8
>>> --- /dev/null
>>> +++ b/include/uapi/linux/vduse.h
>>> @@ -0,0 +1,143 @@
>>> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
>>> +#ifndef _UAPI_VDUSE_H_
>>> +#define _UAPI_VDUSE_H_
>>> +
>>> +#include <linux/types.h>
>>> +
>>> +#define VDUSE_API_VERSION    0
>>> +
>>> +#define VDUSE_NAME_MAX       256
>>> +
>>> +/* the control messages definition for read/write */
>>> +
>>> +enum vduse_req_type {
>>> +     /* Get the state for virtqueue from userspace */
>>> +     VDUSE_GET_VQ_STATE,
>>> +     /* Notify userspace to start the dataplane, no reply */
>>> +     VDUSE_START_DATAPLANE,
>>> +     /* Notify userspace to stop the dataplane, no reply */
>>> +     VDUSE_STOP_DATAPLANE,
>>> +     /* Notify userspace to update the memory mapping in device IOTLB */
>>> +     VDUSE_UPDATE_IOTLB,
>>> +};
>>> +
>>> +struct vduse_vq_state {
>>> +     __u32 index; /* virtqueue index */
>>> +     __u32 avail_idx; /* virtqueue state (last_avail_idx) */
>>> +};
>>
>> This needs some tweaks to support packed virtqueue.
>>
> OK.
>
> Thanks,
> Yongji
>

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v8 09/10] vduse: Introduce VDUSE - vDPA Device in Userspace
       [not found]         ` <CACycT3uzMJS7vw6MVMOgY4rb=SPfT2srV+8DPdwUVeELEiJgbA@mail.gmail.com>
@ 2021-06-22  7:49           ` Jason Wang
       [not found]             ` <CACycT3uuooKLNnpPHewGZ=q46Fap2P4XCFirdxxn=FxK+X1ECg@mail.gmail.com>
  0 siblings, 1 reply; 41+ messages in thread
From: Jason Wang @ 2021-06-22  7:49 UTC (permalink / raw)
  To: Yongji Xie
  Cc: kvm, Michael S. Tsirkin, virtualization, Christian Brauner,
	Jonathan Corbet, joro, Matthew Wilcox, Christoph Hellwig,
	Dan Carpenter, Al Viro, Stefan Hajnoczi, songmuchun, Jens Axboe,
	Greg KH, Randy Dunlap, linux-kernel, iommu, bcrl, netdev,
	linux-fsdevel, Mika Penttilä


在 2021/6/22 下午3:22, Yongji Xie 写道:
>> We need fix a way to propagate the error to the userspace.
>>
>> E.g if we want to stop the deivce, we will delay the status reset until
>> we get respose from the userspace?
>>
> I didn't get how to delay the status reset. And should it be a DoS
> that we want to fix if the userspace doesn't give a response forever?


You're right. So let's make set_status() can fail first, then propagate 
its failure via VHOST_VDPA_SET_STATUS.


>
>>>>> +     }
>>>>> +}
>>>>> +
>>>>> +static size_t vduse_vdpa_get_config_size(struct vdpa_device *vdpa)
>>>>> +{
>>>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>>>> +
>>>>> +     return dev->config_size;
>>>>> +}
>>>>> +
>>>>> +static void vduse_vdpa_get_config(struct vdpa_device *vdpa, unsigned int offset,
>>>>> +                               void *buf, unsigned int len)
>>>>> +{
>>>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>>>> +
>>>>> +     memcpy(buf, dev->config + offset, len);
>>>>> +}
>>>>> +
>>>>> +static void vduse_vdpa_set_config(struct vdpa_device *vdpa, unsigned int offset,
>>>>> +                     const void *buf, unsigned int len)
>>>>> +{
>>>>> +     /* Now we only support read-only configuration space */
>>>>> +}
>>>>> +
>>>>> +static u32 vduse_vdpa_get_generation(struct vdpa_device *vdpa)
>>>>> +{
>>>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>>>> +
>>>>> +     return dev->generation;
>>>>> +}
>>>>> +
>>>>> +static int vduse_vdpa_set_map(struct vdpa_device *vdpa,
>>>>> +                             struct vhost_iotlb *iotlb)
>>>>> +{
>>>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>>>> +     int ret;
>>>>> +
>>>>> +     ret = vduse_domain_set_map(dev->domain, iotlb);
>>>>> +     if (ret)
>>>>> +             return ret;
>>>>> +
>>>>> +     ret = vduse_dev_update_iotlb(dev, 0ULL, ULLONG_MAX);
>>>>> +     if (ret) {
>>>>> +             vduse_domain_clear_map(dev->domain, iotlb);
>>>>> +             return ret;
>>>>> +     }
>>>>> +
>>>>> +     return 0;
>>>>> +}
>>>>> +
>>>>> +static void vduse_vdpa_free(struct vdpa_device *vdpa)
>>>>> +{
>>>>> +     struct vduse_dev *dev = vdpa_to_vduse(vdpa);
>>>>> +
>>>>> +     dev->vdev = NULL;
>>>>> +}
>>>>> +
>>>>> +static const struct vdpa_config_ops vduse_vdpa_config_ops = {
>>>>> +     .set_vq_address         = vduse_vdpa_set_vq_address,
>>>>> +     .kick_vq                = vduse_vdpa_kick_vq,
>>>>> +     .set_vq_cb              = vduse_vdpa_set_vq_cb,
>>>>> +     .set_vq_num             = vduse_vdpa_set_vq_num,
>>>>> +     .set_vq_ready           = vduse_vdpa_set_vq_ready,
>>>>> +     .get_vq_ready           = vduse_vdpa_get_vq_ready,
>>>>> +     .set_vq_state           = vduse_vdpa_set_vq_state,
>>>>> +     .get_vq_state           = vduse_vdpa_get_vq_state,
>>>>> +     .get_vq_align           = vduse_vdpa_get_vq_align,
>>>>> +     .get_features           = vduse_vdpa_get_features,
>>>>> +     .set_features           = vduse_vdpa_set_features,
>>>>> +     .set_config_cb          = vduse_vdpa_set_config_cb,
>>>>> +     .get_vq_num_max         = vduse_vdpa_get_vq_num_max,
>>>>> +     .get_device_id          = vduse_vdpa_get_device_id,
>>>>> +     .get_vendor_id          = vduse_vdpa_get_vendor_id,
>>>>> +     .get_status             = vduse_vdpa_get_status,
>>>>> +     .set_status             = vduse_vdpa_set_status,
>>>>> +     .get_config_size        = vduse_vdpa_get_config_size,
>>>>> +     .get_config             = vduse_vdpa_get_config,
>>>>> +     .set_config             = vduse_vdpa_set_config,
>>>>> +     .get_generation         = vduse_vdpa_get_generation,
>>>>> +     .set_map                = vduse_vdpa_set_map,
>>>>> +     .free                   = vduse_vdpa_free,
>>>>> +};
>>>>> +
>>>>> +static dma_addr_t vduse_dev_map_page(struct device *dev, struct page *page,
>>>>> +                                  unsigned long offset, size_t size,
>>>>> +                                  enum dma_data_direction dir,
>>>>> +                                  unsigned long attrs)
>>>>> +{
>>>>> +     struct vduse_dev *vdev = dev_to_vduse(dev);
>>>>> +     struct vduse_iova_domain *domain = vdev->domain;
>>>>> +
>>>>> +     return vduse_domain_map_page(domain, page, offset, size, dir, attrs);
>>>>> +}
>>>>> +
>>>>> +static void vduse_dev_unmap_page(struct device *dev, dma_addr_t dma_addr,
>>>>> +                             size_t size, enum dma_data_direction dir,
>>>>> +                             unsigned long attrs)
>>>>> +{
>>>>> +     struct vduse_dev *vdev = dev_to_vduse(dev);
>>>>> +     struct vduse_iova_domain *domain = vdev->domain;
>>>>> +
>>>>> +     return vduse_domain_unmap_page(domain, dma_addr, size, dir, attrs);
>>>>> +}
>>>>> +
>>>>> +static void *vduse_dev_alloc_coherent(struct device *dev, size_t size,
>>>>> +                                     dma_addr_t *dma_addr, gfp_t flag,
>>>>> +                                     unsigned long attrs)
>>>>> +{
>>>>> +     struct vduse_dev *vdev = dev_to_vduse(dev);
>>>>> +     struct vduse_iova_domain *domain = vdev->domain;
>>>>> +     unsigned long iova;
>>>>> +     void *addr;
>>>>> +
>>>>> +     *dma_addr = DMA_MAPPING_ERROR;
>>>>> +     addr = vduse_domain_alloc_coherent(domain, size,
>>>>> +                             (dma_addr_t *)&iova, flag, attrs);
>>>>> +     if (!addr)
>>>>> +             return NULL;
>>>>> +
>>>>> +     *dma_addr = (dma_addr_t)iova;
>>>>> +
>>>>> +     return addr;
>>>>> +}
>>>>> +
>>>>> +static void vduse_dev_free_coherent(struct device *dev, size_t size,
>>>>> +                                     void *vaddr, dma_addr_t dma_addr,
>>>>> +                                     unsigned long attrs)
>>>>> +{
>>>>> +     struct vduse_dev *vdev = dev_to_vduse(dev);
>>>>> +     struct vduse_iova_domain *domain = vdev->domain;
>>>>> +
>>>>> +     vduse_domain_free_coherent(domain, size, vaddr, dma_addr, attrs);
>>>>> +}
>>>>> +
>>>>> +static size_t vduse_dev_max_mapping_size(struct device *dev)
>>>>> +{
>>>>> +     struct vduse_dev *vdev = dev_to_vduse(dev);
>>>>> +     struct vduse_iova_domain *domain = vdev->domain;
>>>>> +
>>>>> +     return domain->bounce_size;
>>>>> +}
>>>>> +
>>>>> +static const struct dma_map_ops vduse_dev_dma_ops = {
>>>>> +     .map_page = vduse_dev_map_page,
>>>>> +     .unmap_page = vduse_dev_unmap_page,
>>>>> +     .alloc = vduse_dev_alloc_coherent,
>>>>> +     .free = vduse_dev_free_coherent,
>>>>> +     .max_mapping_size = vduse_dev_max_mapping_size,
>>>>> +};
>>>>> +
>>>>> +static unsigned int perm_to_file_flags(u8 perm)
>>>>> +{
>>>>> +     unsigned int flags = 0;
>>>>> +
>>>>> +     switch (perm) {
>>>>> +     case VDUSE_ACCESS_WO:
>>>>> +             flags |= O_WRONLY;
>>>>> +             break;
>>>>> +     case VDUSE_ACCESS_RO:
>>>>> +             flags |= O_RDONLY;
>>>>> +             break;
>>>>> +     case VDUSE_ACCESS_RW:
>>>>> +             flags |= O_RDWR;
>>>>> +             break;
>>>>> +     default:
>>>>> +             WARN(1, "invalidate vhost IOTLB permission\n");
>>>>> +             break;
>>>>> +     }
>>>>> +
>>>>> +     return flags;
>>>>> +}
>>>>> +
>>>>> +static int vduse_kickfd_setup(struct vduse_dev *dev,
>>>>> +                     struct vduse_vq_eventfd *eventfd)
>>>>> +{
>>>>> +     struct eventfd_ctx *ctx = NULL;
>>>>> +     struct vduse_virtqueue *vq;
>>>>> +     u32 index;
>>>>> +
>>>>> +     if (eventfd->index >= dev->vq_num)
>>>>> +             return -EINVAL;
>>>>> +
>>>>> +     index = array_index_nospec(eventfd->index, dev->vq_num);
>>>>> +     vq = &dev->vqs[index];
>>>>> +     if (eventfd->fd >= 0) {
>>>>> +             ctx = eventfd_ctx_fdget(eventfd->fd);
>>>>> +             if (IS_ERR(ctx))
>>>>> +                     return PTR_ERR(ctx);
>>>>> +     } else if (eventfd->fd != VDUSE_EVENTFD_DEASSIGN)
>>>>> +             return 0;
>>>>> +
>>>>> +     spin_lock(&vq->kick_lock);
>>>>> +     if (vq->kickfd)
>>>>> +             eventfd_ctx_put(vq->kickfd);
>>>>> +     vq->kickfd = ctx;
>>>>> +     if (vq->ready && vq->kicked && vq->kickfd) {
>>>>> +             eventfd_signal(vq->kickfd, 1);
>>>>> +             vq->kicked = false;
>>>>> +     }
>>>>> +     spin_unlock(&vq->kick_lock);
>>>>> +
>>>>> +     return 0;
>>>>> +}
>>>>> +
>>>>> +static void vduse_dev_irq_inject(struct work_struct *work)
>>>>> +{
>>>>> +     struct vduse_dev *dev = container_of(work, struct vduse_dev, inject);
>>>>> +
>>>>> +     spin_lock_irq(&dev->irq_lock);
>>>>> +     if (dev->config_cb.callback)
>>>>> +             dev->config_cb.callback(dev->config_cb.private);
>>>>> +     spin_unlock_irq(&dev->irq_lock);
>>>>> +}
>>>>> +
>>>>> +static void vduse_vq_irq_inject(struct work_struct *work)
>>>>> +{
>>>>> +     struct vduse_virtqueue *vq = container_of(work,
>>>>> +                                     struct vduse_virtqueue, inject);
>>>>> +
>>>>> +     spin_lock_irq(&vq->irq_lock);
>>>>> +     if (vq->ready && vq->cb.callback)
>>>>> +             vq->cb.callback(vq->cb.private);
>>>>> +     spin_unlock_irq(&vq->irq_lock);
>>>>> +}
>>>>> +
>>>>> +static long vduse_dev_ioctl(struct file *file, unsigned int cmd,
>>>>> +                         unsigned long arg)
>>>>> +{
>>>>> +     struct vduse_dev *dev = file->private_data;
>>>>> +     void __user *argp = (void __user *)arg;
>>>>> +     int ret;
>>>>> +
>>>>> +     switch (cmd) {
>>>>> +     case VDUSE_IOTLB_GET_FD: {
>>>>> +             struct vduse_iotlb_entry entry;
>>>>> +             struct vhost_iotlb_map *map;
>>>>> +             struct vdpa_map_file *map_file;
>>>>> +             struct vduse_iova_domain *domain = dev->domain;
>>>>> +             struct file *f = NULL;
>>>>> +
>>>>> +             ret = -EFAULT;
>>>>> +             if (copy_from_user(&entry, argp, sizeof(entry)))
>>>>> +                     break;
>>>>> +
>>>>> +             ret = -EINVAL;
>>>>> +             if (entry.start > entry.last)
>>>>> +                     break;
>>>>> +
>>>>> +             spin_lock(&domain->iotlb_lock);
>>>>> +             map = vhost_iotlb_itree_first(domain->iotlb,
>>>>> +                                           entry.start, entry.last);
>>>>> +             if (map) {
>>>>> +                     map_file = (struct vdpa_map_file *)map->opaque;
>>>>> +                     f = get_file(map_file->file);
>>>>> +                     entry.offset = map_file->offset;
>>>>> +                     entry.start = map->start;
>>>>> +                     entry.last = map->last;
>>>>> +                     entry.perm = map->perm;
>>>>> +             }
>>>>> +             spin_unlock(&domain->iotlb_lock);
>>>>> +             ret = -EINVAL;
>>>>> +             if (!f)
>>>>> +                     break;
>>>>> +
>>>>> +             ret = -EFAULT;
>>>>> +             if (copy_to_user(argp, &entry, sizeof(entry))) {
>>>>> +                     fput(f);
>>>>> +                     break;
>>>>> +             }
>>>>> +             ret = receive_fd(f, perm_to_file_flags(entry.perm));
>>>>> +             fput(f);
>>>>> +             break;
>>>>> +     }
>>>>> +     case VDUSE_DEV_GET_FEATURES:
>>>>> +             ret = put_user(dev->features, (u64 __user *)argp);
>>>>> +             break;
>>>>> +     case VDUSE_DEV_UPDATE_CONFIG: {
>>>>> +             struct vduse_config_update config;
>>>>> +             unsigned long size = offsetof(struct vduse_config_update,
>>>>> +                                           buffer);
>>>>> +
>>>>> +             ret = -EFAULT;
>>>>> +             if (copy_from_user(&config, argp, size))
>>>>> +                     break;
>>>>> +
>>>>> +             ret = -EINVAL;
>>>>> +             if (config.length == 0 ||
>>>>> +                 config.length > dev->config_size - config.offset)
>>>>> +                     break;
>>>>> +
>>>>> +             ret = -EFAULT;
>>>>> +             if (copy_from_user(dev->config + config.offset, argp + size,
>>>>> +                                config.length))
>>>>> +                     break;
>>>>> +
>>>>> +             ret = 0;
>>>>> +             queue_work(vduse_irq_wq, &dev->inject);
>>>> I wonder if it's better to separate config interrupt out of config
>>>> update or we need document this.
>>>>
>>> I have documented it in the docs. Looks like a config update should be
>>> always followed by a config interrupt. I didn't find a case that uses
>>> them separately.
>> The uAPI doesn't prevent us from the following scenario:
>>
>> update_config(mac[0], ..);
>> update_config(max[1], ..);
>>
>> So it looks to me it's better to separate the config interrupt from the
>> config updating.
>>
> Fine.
>
>>>>> +             break;
>>>>> +     }
>>>>> +     case VDUSE_VQ_GET_INFO: {
>>>> Do we need to limit this only when DRIVER_OK is set?
>>>>
>>> Any reason to add this limitation?
>> Otherwise the vq is not fully initialized, e.g the desc_addr might not
>> be correct.
>>
> The vq_info->ready can be used to tell userspace whether the vq is
> initialized or not.


Yes, this will work as well.

Thanks


>
> Thanks,
> Yongji
>

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v8 09/10] vduse: Introduce VDUSE - vDPA Device in Userspace
       [not found]             ` <CACycT3uuooKLNnpPHewGZ=q46Fap2P4XCFirdxxn=FxK+X1ECg@mail.gmail.com>
@ 2021-06-23  3:30               ` Jason Wang
       [not found]                 ` <CACycT3u8=_D3hCtJR+d5BgeUQMce6S7c_6P3CVfvWfYhCQeXFA@mail.gmail.com>
  0 siblings, 1 reply; 41+ messages in thread
From: Jason Wang @ 2021-06-23  3:30 UTC (permalink / raw)
  To: Yongji Xie
  Cc: kvm, Michael S. Tsirkin, virtualization, Christian Brauner,
	Jonathan Corbet, joro, Matthew Wilcox, Christoph Hellwig,
	Dan Carpenter, Al Viro, Stefan Hajnoczi, songmuchun, Jens Axboe,
	Greg KH, Randy Dunlap, linux-kernel, iommu, bcrl, netdev,
	linux-fsdevel, Mika Penttilä


在 2021/6/22 下午4:14, Yongji Xie 写道:
> On Tue, Jun 22, 2021 at 3:50 PM Jason Wang <jasowang@redhat.com> wrote:
>>
>> 在 2021/6/22 下午3:22, Yongji Xie 写道:
>>>> We need fix a way to propagate the error to the userspace.
>>>>
>>>> E.g if we want to stop the deivce, we will delay the status reset until
>>>> we get respose from the userspace?
>>>>
>>> I didn't get how to delay the status reset. And should it be a DoS
>>> that we want to fix if the userspace doesn't give a response forever?
>>
>> You're right. So let's make set_status() can fail first, then propagate
>> its failure via VHOST_VDPA_SET_STATUS.
>>
> OK. So we only need to propagate the failure in the vhost-vdpa case, right?


I think not, we need to deal with the reset for virtio as well:

E.g in register_virtio_devices(), we have:

         /* We always start by resetting the device, in case a previous
          * driver messed it up.  This also tests that code path a 
little. */
       dev->config->reset(dev);

We probably need to make reset can fail and then fail the 
register_virtio_device() as well.

Thanks


>
> Thanks,
> Yongji
>

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v8 09/10] vduse: Introduce VDUSE - vDPA Device in Userspace
       [not found]                 ` <CACycT3u8=_D3hCtJR+d5BgeUQMce6S7c_6P3CVfvWfYhCQeXFA@mail.gmail.com>
@ 2021-06-24  3:34                   ` Jason Wang
       [not found]                     ` <CACycT3uCSLUDVpQHdrmuxSuoBDg-4n22t+N-Jm2GoNNp9JYB2w@mail.gmail.com>
  0 siblings, 1 reply; 41+ messages in thread
From: Jason Wang @ 2021-06-24  3:34 UTC (permalink / raw)
  To: Yongji Xie
  Cc: kvm, Michael S. Tsirkin, virtualization, Christian Brauner,
	Jonathan Corbet, joro, Matthew Wilcox, Christoph Hellwig,
	Dan Carpenter, Al Viro, Stefan Hajnoczi, songmuchun, Jens Axboe,
	Greg KH, Randy Dunlap, linux-kernel, iommu, bcrl, netdev,
	linux-fsdevel, Mika Penttilä


在 2021/6/23 下午1:50, Yongji Xie 写道:
> On Wed, Jun 23, 2021 at 11:31 AM Jason Wang <jasowang@redhat.com> wrote:
>>
>> 在 2021/6/22 下午4:14, Yongji Xie 写道:
>>> On Tue, Jun 22, 2021 at 3:50 PM Jason Wang <jasowang@redhat.com> wrote:
>>>> 在 2021/6/22 下午3:22, Yongji Xie 写道:
>>>>>> We need fix a way to propagate the error to the userspace.
>>>>>>
>>>>>> E.g if we want to stop the deivce, we will delay the status reset until
>>>>>> we get respose from the userspace?
>>>>>>
>>>>> I didn't get how to delay the status reset. And should it be a DoS
>>>>> that we want to fix if the userspace doesn't give a response forever?
>>>> You're right. So let's make set_status() can fail first, then propagate
>>>> its failure via VHOST_VDPA_SET_STATUS.
>>>>
>>> OK. So we only need to propagate the failure in the vhost-vdpa case, right?
>>
>> I think not, we need to deal with the reset for virtio as well:
>>
>> E.g in register_virtio_devices(), we have:
>>
>>           /* We always start by resetting the device, in case a previous
>>            * driver messed it up.  This also tests that code path a
>> little. */
>>         dev->config->reset(dev);
>>
>> We probably need to make reset can fail and then fail the
>> register_virtio_device() as well.
>>
> OK, looks like virtio_add_status() and virtio_device_ready()[1] should
> be also modified if we need to propagate the failure in the
> virtio-vdpa case. Or do we only need to care about the reset case?
>
> [1] https://lore.kernel.org/lkml/20210517093428.670-1-xieyongji@bytedance.com/


My understanding is DRIVER_OK is not something that needs to be validated:

"

DRIVER_OK (4)
Indicates that the driver is set up and ready to drive the device.

"

Since the spec doesn't require to re-read the and check if DRIVER_OK is 
set in 3.1.1 Driver Requirements: Device Initialization.

It's more about "telling the device that driver is ready."

But we don have some status bit that requires the synchronization with 
the device.

1) FEATURES_OK, spec requires to re-read the status bit to check whether 
or it it was set by the device:

"

Re-read device status to ensure the FEATURES_OK bit is still set: 
otherwise, the device does not support our subset of features and the 
device is unusable.

"

This is useful for some device which can only support a subset of the 
features. E.g a device that can only work for packed virtqueue. This 
means the current design of set_features won't work, we need either:

1a) relay the set_features request to userspace

or

1b) introduce a mandated_device_features during device creation and 
validate the driver features during the set_features(), and don't set 
FEATURES_OK if they don't match.


2) Some transports (PCI) requires to re-read the status to ensure the 
synchronization.

"

After writing 0 to device_status, the driver MUST wait for a read of 
device_status to return 0 before reinitializing the device.

"

So we need to deal with both FEATURES_OK and reset, but probably not 
DRIVER_OK.

Thanks


>
> Thanks,
> Yongji
>

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v8 09/10] vduse: Introduce VDUSE - vDPA Device in Userspace
       [not found]                     ` <CACycT3uCSLUDVpQHdrmuxSuoBDg-4n22t+N-Jm2GoNNp9JYB2w@mail.gmail.com>
@ 2021-06-24  8:13                       ` Jason Wang
       [not found]                         ` <CACycT3tS=10kcUCNGYm=dUZsK+vrHzDvB3FSwAzuJCu3t+QuUQ@mail.gmail.com>
  0 siblings, 1 reply; 41+ messages in thread
From: Jason Wang @ 2021-06-24  8:13 UTC (permalink / raw)
  To: Yongji Xie
  Cc: kvm, Michael S. Tsirkin, virtualization, Christian Brauner,
	Jonathan Corbet, joro, Matthew Wilcox, Christoph Hellwig,
	Dan Carpenter, Al Viro, Stefan Hajnoczi, songmuchun, Jens Axboe,
	Greg KH, Randy Dunlap, linux-kernel, iommu, bcrl, netdev,
	linux-fsdevel, Mika Penttilä


在 2021/6/24 下午12:46, Yongji Xie 写道:
>> So we need to deal with both FEATURES_OK and reset, but probably not
>> DRIVER_OK.
>>
> OK, I see. Thanks for the explanation. One more question is how about
> clearing the corresponding status bit in get_status() rather than
> making set_status() fail. Since the spec recommends this way for
> validation which is done in virtio_dev_remove() and
> virtio_finalize_features().
>
> Thanks,
> Yongji
>

I think you can. Or it would be even better that we just don't set the 
bit during set_status().

I just realize that in vdpa_reset() we had:

static inline void vdpa_reset(struct vdpa_device *vdev)
{
         const struct vdpa_config_ops *ops = vdev->config;

         vdev->features_valid = false;
         ops->set_status(vdev, 0);
}

We probably need to add the synchronization here. E.g re-read with a 
timeout.

Thanks

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v8 10/10] Documentation: Add documentation for VDUSE
       [not found] ` <20210615141331.407-11-xieyongji@bytedance.com>
@ 2021-06-24 13:01   ` Stefan Hajnoczi
       [not found]     ` <CACycT3uxnQmXWsgmNVxQtiRhz1UXXTAJFY3OiAJqokbJH6ifMA@mail.gmail.com>
  0 siblings, 1 reply; 41+ messages in thread
From: Stefan Hajnoczi @ 2021-06-24 13:01 UTC (permalink / raw)
  To: Xie Yongji
  Cc: kvm, mst, virtualization, christian.brauner, corbet, joro, willy,
	hch, dan.carpenter, viro, songmuchun, axboe, gregkh, rdunlap,
	linux-kernel, iommu, bcrl, netdev, linux-fsdevel, mika.penttila


[-- Attachment #1.1: Type: text/plain, Size: 8854 bytes --]

On Tue, Jun 15, 2021 at 10:13:31PM +0800, Xie Yongji wrote:
> VDUSE (vDPA Device in Userspace) is a framework to support
> implementing software-emulated vDPA devices in userspace. This
> document is intended to clarify the VDUSE design and usage.
> 
> Signed-off-by: Xie Yongji <xieyongji@bytedance.com>
> ---
>  Documentation/userspace-api/index.rst |   1 +
>  Documentation/userspace-api/vduse.rst | 222 ++++++++++++++++++++++++++++++++++
>  2 files changed, 223 insertions(+)
>  create mode 100644 Documentation/userspace-api/vduse.rst
> 
> diff --git a/Documentation/userspace-api/index.rst b/Documentation/userspace-api/index.rst
> index 0b5eefed027e..c432be070f67 100644
> --- a/Documentation/userspace-api/index.rst
> +++ b/Documentation/userspace-api/index.rst
> @@ -27,6 +27,7 @@ place where this information is gathered.
>     iommu
>     media/index
>     sysfs-platform_profile
> +   vduse
>  
>  .. only::  subproject and html
>  
> diff --git a/Documentation/userspace-api/vduse.rst b/Documentation/userspace-api/vduse.rst
> new file mode 100644
> index 000000000000..2f9cd1a4e530
> --- /dev/null
> +++ b/Documentation/userspace-api/vduse.rst
> @@ -0,0 +1,222 @@
> +==================================
> +VDUSE - "vDPA Device in Userspace"
> +==================================
> +
> +vDPA (virtio data path acceleration) device is a device that uses a
> +datapath which complies with the virtio specifications with vendor
> +specific control path. vDPA devices can be both physically located on
> +the hardware or emulated by software. VDUSE is a framework that makes it
> +possible to implement software-emulated vDPA devices in userspace. And
> +to make it simple, the emulated vDPA device's control path is handled in
> +the kernel and only the data path is implemented in the userspace.
> +
> +Note that only virtio block device is supported by VDUSE framework now,
> +which can reduce security risks when the userspace process that implements
> +the data path is run by an unprivileged user. The Support for other device
> +types can be added after the security issue is clarified or fixed in the future.
> +
> +Start/Stop VDUSE devices
> +------------------------
> +
> +VDUSE devices are started as follows:
> +
> +1. Create a new VDUSE instance with ioctl(VDUSE_CREATE_DEV) on
> +   /dev/vduse/control.
> +
> +2. Begin processing VDUSE messages from /dev/vduse/$NAME. The first
> +   messages will arrive while attaching the VDUSE instance to vDPA bus.
> +
> +3. Send the VDPA_CMD_DEV_NEW netlink message to attach the VDUSE
> +   instance to vDPA bus.
> +
> +VDUSE devices are stopped as follows:
> +
> +1. Send the VDPA_CMD_DEV_DEL netlink message to detach the VDUSE
> +   instance from vDPA bus.
> +
> +2. Close the file descriptor referring to /dev/vduse/$NAME
> +
> +3. Destroy the VDUSE instance with ioctl(VDUSE_DESTROY_DEV) on
> +   /dev/vduse/control
> +
> +The netlink messages metioned above can be sent via vdpa tool in iproute2
> +or use the below sample codes:
> +
> +.. code-block:: c
> +
> +	static int netlink_add_vduse(const char *name, enum vdpa_command cmd)
> +	{
> +		struct nl_sock *nlsock;
> +		struct nl_msg *msg;
> +		int famid;
> +
> +		nlsock = nl_socket_alloc();
> +		if (!nlsock)
> +			return -ENOMEM;
> +
> +		if (genl_connect(nlsock))
> +			goto free_sock;
> +
> +		famid = genl_ctrl_resolve(nlsock, VDPA_GENL_NAME);
> +		if (famid < 0)
> +			goto close_sock;
> +
> +		msg = nlmsg_alloc();
> +		if (!msg)
> +			goto close_sock;
> +
> +		if (!genlmsg_put(msg, NL_AUTO_PORT, NL_AUTO_SEQ, famid, 0, 0, cmd, 0))
> +			goto nla_put_failure;
> +
> +		NLA_PUT_STRING(msg, VDPA_ATTR_DEV_NAME, name);
> +		if (cmd == VDPA_CMD_DEV_NEW)
> +			NLA_PUT_STRING(msg, VDPA_ATTR_MGMTDEV_DEV_NAME, "vduse");
> +
> +		if (nl_send_sync(nlsock, msg))
> +			goto close_sock;
> +
> +		nl_close(nlsock);
> +		nl_socket_free(nlsock);
> +
> +		return 0;
> +	nla_put_failure:
> +		nlmsg_free(msg);
> +	close_sock:
> +		nl_close(nlsock);
> +	free_sock:
> +		nl_socket_free(nlsock);
> +		return -1;
> +	}
> +
> +How VDUSE works
> +---------------
> +
> +Since the emuldated vDPA device's control path is handled in the kernel,

s/emuldated/emulated/

> +a message-based communication protocol and few types of control messages
> +are introduced by VDUSE framework to make userspace be aware of the data
> +path related changes:
> +
> +- VDUSE_GET_VQ_STATE: Get the state for virtqueue from userspace
> +
> +- VDUSE_START_DATAPLANE: Notify userspace to start the dataplane
> +
> +- VDUSE_STOP_DATAPLANE: Notify userspace to stop the dataplane
> +
> +- VDUSE_UPDATE_IOTLB: Notify userspace to update the memory mapping in device IOTLB
> +
> +Userspace needs to read()/write() on /dev/vduse/$NAME to receive/reply
> +those control messages from/to VDUSE kernel module as follows:
> +
> +.. code-block:: c
> +
> +	static int vduse_message_handler(int dev_fd)
> +	{
> +		int len;
> +		struct vduse_dev_request req;
> +		struct vduse_dev_response resp;
> +
> +		len = read(dev_fd, &req, sizeof(req));
> +		if (len != sizeof(req))
> +			return -1;
> +
> +		resp.request_id = req.request_id;
> +
> +		switch (req.type) {
> +
> +		/* handle different types of message */
> +
> +		}
> +
> +		if (req.flags & VDUSE_REQ_FLAGS_NO_REPLY)
> +			return 0;
> +
> +		len = write(dev_fd, &resp, sizeof(resp));
> +		if (len != sizeof(resp))
> +			return -1;
> +
> +		return 0;
> +	}
> +
> +After VDUSE_START_DATAPLANE messages is received, userspace should start the
> +dataplane processing with the help of some ioctls on /dev/vduse/$NAME:
> +
> +- VDUSE_IOTLB_GET_FD: get the file descriptor to the first overlapped iova region.
> +  Userspace can access this iova region by passing fd and corresponding size, offset,
> +  perm to mmap(). For example:
> +
> +.. code-block:: c
> +
> +	static int perm_to_prot(uint8_t perm)
> +	{
> +		int prot = 0;
> +
> +		switch (perm) {
> +		case VDUSE_ACCESS_WO:
> +			prot |= PROT_WRITE;
> +			break;
> +		case VDUSE_ACCESS_RO:
> +			prot |= PROT_READ;
> +			break;
> +		case VDUSE_ACCESS_RW:
> +			prot |= PROT_READ | PROT_WRITE;
> +			break;
> +		}
> +
> +		return prot;
> +	}
> +
> +	static void *iova_to_va(int dev_fd, uint64_t iova, uint64_t *len)
> +	{
> +		int fd;
> +		void *addr;
> +		size_t size;
> +		struct vduse_iotlb_entry entry;
> +
> +		entry.start = iova;
> +		entry.last = iova + 1;

Why +1?

I expected the request to include *len so that VDUSE can create a bounce
buffer for the full iova range, if necessary.

> +		fd = ioctl(dev_fd, VDUSE_IOTLB_GET_FD, &entry);
> +		if (fd < 0)
> +			return NULL;
> +
> +		size = entry.last - entry.start + 1;
> +		*len = entry.last - iova + 1;
> +		addr = mmap(0, size, perm_to_prot(entry.perm), MAP_SHARED,
> +			    fd, entry.offset);
> +		close(fd);
> +		if (addr == MAP_FAILED)
> +			return NULL;
> +
> +		/* do something to cache this iova region */

How is userspace expected to manage iotlb mmaps? When should munmap(2)
be called?

Should userspace expect VDUSE_IOTLB_GET_FD to return a full chunk of
guest RAM (e.g. multiple gigabytes) that can be cached permanently or
will it return just enough pages to cover [start, last)?

> +
> +		return addr + iova - entry.start;
> +	}
> +
> +- VDUSE_DEV_GET_FEATURES: Get the negotiated features

Are these VIRTIO feature bits? Please explain how feature negotiation
works. There must be a way for userspace to report the device's
supported feature bits to the kernel.

> +- VDUSE_DEV_UPDATE_CONFIG: Update the configuration space and inject a config interrupt

Does this mean the contents of the configuration space are cached by
VDUSE? The downside is that the userspace code cannot generate the
contents on demand. Most devices doin't need to generate the contents
on demand, so I think this is okay but I had expected a different
interface:

kernel->userspace VDUSE_DEV_GET_CONFIG
userspace->kernel VDUSE_DEV_INJECT_CONFIG_IRQ

I think you can leave it the way it is, but I wanted to mention this in
case someone thinks it's important to support generating the contents of
the configuration space on demand.

> +- VDUSE_VQ_GET_INFO: Get the specified virtqueue's metadata
> +
> +- VDUSE_VQ_SETUP_KICKFD: set the kickfd for virtqueue, this eventfd is used
> +  by VDUSE kernel module to notify userspace to consume the vring.
> +
> +- VDUSE_INJECT_VQ_IRQ: inject an interrupt for specific virtqueue

This information is useful but it's not enough to be able to implement a
userspace device. Please provide more developer documentation or at
least refer to uapi header files, published documents, etc that contain
the details.

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

[-- Attachment #2: Type: text/plain, Size: 183 bytes --]

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v8 09/10] vduse: Introduce VDUSE - vDPA Device in Userspace
       [not found] ` <20210615141331.407-10-xieyongji@bytedance.com>
  2021-06-21  9:13   ` [PATCH v8 09/10] vduse: " Jason Wang
@ 2021-06-24 14:46   ` Stefan Hajnoczi
       [not found]     ` <CACycT3vaXQ4dxC9QUzXXJs7og6TVqqVGa8uHZnTStacsYAiFwQ@mail.gmail.com>
  2021-07-07  8:52   ` Stefan Hajnoczi
  2 siblings, 1 reply; 41+ messages in thread
From: Stefan Hajnoczi @ 2021-06-24 14:46 UTC (permalink / raw)
  To: Xie Yongji
  Cc: kvm, mst, virtualization, christian.brauner, corbet, joro, willy,
	hch, dan.carpenter, viro, songmuchun, axboe, gregkh, rdunlap,
	linux-kernel, iommu, bcrl, netdev, linux-fsdevel, mika.penttila


[-- Attachment #1.1: Type: text/plain, Size: 6126 bytes --]

On Tue, Jun 15, 2021 at 10:13:30PM +0800, Xie Yongji wrote:
> diff --git a/include/uapi/linux/vduse.h b/include/uapi/linux/vduse.h
> new file mode 100644
> index 000000000000..f21b2e51b5c8
> --- /dev/null
> +++ b/include/uapi/linux/vduse.h
> @@ -0,0 +1,143 @@
> +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */
> +#ifndef _UAPI_VDUSE_H_
> +#define _UAPI_VDUSE_H_
> +
> +#include <linux/types.h>
> +
> +#define VDUSE_API_VERSION	0
> +
> +#define VDUSE_NAME_MAX	256
> +
> +/* the control messages definition for read/write */
> +
> +enum vduse_req_type {
> +	/* Get the state for virtqueue from userspace */
> +	VDUSE_GET_VQ_STATE,
> +	/* Notify userspace to start the dataplane, no reply */
> +	VDUSE_START_DATAPLANE,
> +	/* Notify userspace to stop the dataplane, no reply */
> +	VDUSE_STOP_DATAPLANE,
> +	/* Notify userspace to update the memory mapping in device IOTLB */
> +	VDUSE_UPDATE_IOTLB,
> +};
> +
> +struct vduse_vq_state {
> +	__u32 index; /* virtqueue index */
> +	__u32 avail_idx; /* virtqueue state (last_avail_idx) */
> +};
> +
> +struct vduse_iova_range {
> +	__u64 start; /* start of the IOVA range */
> +	__u64 last; /* end of the IOVA range */

Please clarify whether this describes a closed range [start, last] or an
open range [start, last).

> +};
> +
> +struct vduse_dev_request {
> +	__u32 type; /* request type */
> +	__u32 request_id; /* request id */
> +#define VDUSE_REQ_FLAGS_NO_REPLY	(1 << 0) /* No need to reply */
> +	__u32 flags; /* request flags */
> +	__u32 reserved; /* for future use */
> +	union {
> +		struct vduse_vq_state vq_state; /* virtqueue state */
> +		struct vduse_iova_range iova; /* iova range for updating */
> +		__u32 padding[16]; /* padding */
> +	};
> +};
> +
> +struct vduse_dev_response {
> +	__u32 request_id; /* corresponding request id */
> +#define VDUSE_REQ_RESULT_OK	0x00
> +#define VDUSE_REQ_RESULT_FAILED	0x01
> +	__u32 result; /* the result of request */
> +	__u32 reserved[2]; /* for future use */
> +	union {
> +		struct vduse_vq_state vq_state; /* virtqueue state */
> +		__u32 padding[16]; /* padding */
> +	};
> +};
> +
> +/* ioctls */
> +
> +struct vduse_dev_config {
> +	char name[VDUSE_NAME_MAX]; /* vduse device name */
> +	__u32 vendor_id; /* virtio vendor id */
> +	__u32 device_id; /* virtio device id */
> +	__u64 features; /* device features */
> +	__u64 bounce_size; /* bounce buffer size for iommu */
> +	__u16 vq_size_max; /* the max size of virtqueue */

The VIRTIO specification allows per-virtqueue sizes. A device can have
two virtqueues, where the first one allows up to 1024 descriptors and
the second one allows only 128 descriptors, for example.

This constant seems to impose the constraint that all virtqueues have
the same maximum size. Is this really necessary?

> +	__u16 padding; /* padding */
> +	__u32 vq_num; /* the number of virtqueues */
> +	__u32 vq_align; /* the allocation alignment of virtqueue's metadata */

I'm not sure what this is?

> +	__u32 config_size; /* the size of the configuration space */
> +	__u32 reserved[15]; /* for future use */
> +	__u8 config[0]; /* the buffer of the configuration space */
> +};
> +
> +struct vduse_iotlb_entry {
> +	__u64 offset; /* the mmap offset on fd */
> +	__u64 start; /* start of the IOVA range */
> +	__u64 last; /* last of the IOVA range */

Same here, please specify whether this is an open range or a closed
range.

> +#define VDUSE_ACCESS_RO 0x1
> +#define VDUSE_ACCESS_WO 0x2
> +#define VDUSE_ACCESS_RW 0x3
> +	__u8 perm; /* access permission of this range */
> +};
> +
> +struct vduse_config_update {
> +	__u32 offset; /* offset from the beginning of configuration space */
> +	__u32 length; /* the length to write to configuration space */
> +	__u8 buffer[0]; /* buffer used to write from */
> +};
> +
> +struct vduse_vq_info {
> +	__u32 index; /* virtqueue index */
> +	__u32 avail_idx; /* virtqueue state (last_avail_idx) */
> +	__u64 desc_addr; /* address of desc area */
> +	__u64 driver_addr; /* address of driver area */
> +	__u64 device_addr; /* address of device area */
> +	__u32 num; /* the size of virtqueue */
> +	__u8 ready; /* ready status of virtqueue */
> +};
> +
> +struct vduse_vq_eventfd {
> +	__u32 index; /* virtqueue index */
> +#define VDUSE_EVENTFD_DEASSIGN -1
> +	int fd; /* eventfd, -1 means de-assigning the eventfd */
> +};
> +
> +#define VDUSE_BASE	0x81
> +
> +/* Get the version of VDUSE API. This is used for future extension */
> +#define VDUSE_GET_API_VERSION	_IOR(VDUSE_BASE, 0x00, __u64)
> +
> +/* Set the version of VDUSE API. */
> +#define VDUSE_SET_API_VERSION	_IOW(VDUSE_BASE, 0x01, __u64)
> +
> +/* Create a vduse device which is represented by a char device (/dev/vduse/<name>) */
> +#define VDUSE_CREATE_DEV	_IOW(VDUSE_BASE, 0x02, struct vduse_dev_config)
> +
> +/* Destroy a vduse device. Make sure there are no references to the char device */
> +#define VDUSE_DESTROY_DEV	_IOW(VDUSE_BASE, 0x03, char[VDUSE_NAME_MAX])
> +
> +/*
> + * Get a file descriptor for the first overlapped iova region,
> + * -EINVAL means the iova region doesn't exist.
> + */
> +#define VDUSE_IOTLB_GET_FD	_IOWR(VDUSE_BASE, 0x04, struct vduse_iotlb_entry)
> +
> +/* Get the negotiated features */
> +#define VDUSE_DEV_GET_FEATURES	_IOR(VDUSE_BASE, 0x05, __u64)
> +
> +/* Update the configuration space */
> +#define VDUSE_DEV_UPDATE_CONFIG	_IOW(VDUSE_BASE, 0x06, struct vduse_config_update)
> +
> +/* Get the specified virtqueue's information */
> +#define VDUSE_VQ_GET_INFO	_IOWR(VDUSE_BASE, 0x07, struct vduse_vq_info)
> +
> +/* Setup an eventfd to receive kick for virtqueue */
> +#define VDUSE_VQ_SETUP_KICKFD	_IOW(VDUSE_BASE, 0x08, struct vduse_vq_eventfd)
> +
> +/* Inject an interrupt for specific virtqueue */
> +#define VDUSE_VQ_INJECT_IRQ	_IOW(VDUSE_BASE, 0x09, __u32)

There is not enough documentation to use this header file. For example,
which ioctls are used with /dev/vduse and which are used with
/dev/vduse/<name>?

Please document that ioctl API fully. It will not only help userspace
developers but also define what is part of the interface and what is an
implementation detail that can change in the future.

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

[-- Attachment #2: Type: text/plain, Size: 183 bytes --]

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v8 00/10] Introduce VDUSE - vDPA Device in Userspace
       [not found] <20210615141331.407-1-xieyongji@bytedance.com>
       [not found] ` <20210615141331.407-4-xieyongji@bytedance.com>
       [not found] ` <20210615141331.407-11-xieyongji@bytedance.com>
@ 2021-06-24 15:12 ` Stefan Hajnoczi
  2021-06-28 10:33 ` Liu Xiaodong
       [not found] ` <20210615141331.407-10-xieyongji@bytedance.com>
  4 siblings, 0 replies; 41+ messages in thread
From: Stefan Hajnoczi @ 2021-06-24 15:12 UTC (permalink / raw)
  To: Xie Yongji
  Cc: kvm, mst, virtualization, christian.brauner, corbet, joro, willy,
	hch, dan.carpenter, viro, songmuchun, axboe, gregkh, rdunlap,
	linux-kernel, iommu, bcrl, netdev, linux-fsdevel, mika.penttila


[-- Attachment #1.1: Type: text/plain, Size: 515 bytes --]

On Tue, Jun 15, 2021 at 10:13:21PM +0800, Xie Yongji wrote:
> This series introduces a framework that makes it possible to implement
> software-emulated vDPA devices in userspace. And to make it simple, the
> emulated vDPA device's control path is handled in the kernel and only the
> data path is implemented in the userspace.

This looks interesting. Unfortunately I don't have enough time to do a
full review, but I looked at the documentation and uapi header file to
give feedback on the userspace ABI.

Stefan

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

[-- Attachment #2: Type: text/plain, Size: 183 bytes --]

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v8 09/10] vduse: Introduce VDUSE - vDPA Device in Userspace
       [not found]                         ` <CACycT3tS=10kcUCNGYm=dUZsK+vrHzDvB3FSwAzuJCu3t+QuUQ@mail.gmail.com>
@ 2021-06-25  3:08                           ` Jason Wang
       [not found]                             ` <CACycT3vpMFbc9Fzuo9oksMaA-pVb1dEVTEgjNoft16voryPSWQ@mail.gmail.com>
  0 siblings, 1 reply; 41+ messages in thread
From: Jason Wang @ 2021-06-25  3:08 UTC (permalink / raw)
  To: Yongji Xie
  Cc: kvm, Michael S. Tsirkin, virtualization, Christian Brauner,
	Jonathan Corbet, joro, Matthew Wilcox, Christoph Hellwig,
	Dan Carpenter, Al Viro, Stefan Hajnoczi, songmuchun, Jens Axboe,
	Greg KH, Randy Dunlap, linux-kernel, iommu, bcrl, netdev,
	linux-fsdevel, Mika Penttilä


在 2021/6/24 下午5:16, Yongji Xie 写道:
> On Thu, Jun 24, 2021 at 4:14 PM Jason Wang <jasowang@redhat.com> wrote:
>>
>> 在 2021/6/24 下午12:46, Yongji Xie 写道:
>>>> So we need to deal with both FEATURES_OK and reset, but probably not
>>>> DRIVER_OK.
>>>>
>>> OK, I see. Thanks for the explanation. One more question is how about
>>> clearing the corresponding status bit in get_status() rather than
>>> making set_status() fail. Since the spec recommends this way for
>>> validation which is done in virtio_dev_remove() and
>>> virtio_finalize_features().
>>>
>>> Thanks,
>>> Yongji
>>>
>> I think you can. Or it would be even better that we just don't set the
>> bit during set_status().
>>
> Yes, that's what I mean.
>
>> I just realize that in vdpa_reset() we had:
>>
>> static inline void vdpa_reset(struct vdpa_device *vdev)
>> {
>>           const struct vdpa_config_ops *ops = vdev->config;
>>
>>           vdev->features_valid = false;
>>           ops->set_status(vdev, 0);
>> }
>>
>> We probably need to add the synchronization here. E.g re-read with a
>> timeout.
>>
> Looks like the timeout is already in set_status().


Do you mean the VDUSE's implementation?


>   Do we really need a
> duplicated one here?


1) this is the timeout at the vDPA layer instead of the VDUSE layer.
2) it really depends on what's the meaning of the timeout for set_status 
of VDUSE.

Do we want:

2a) for set_status(): relay the message to userspace and wait for the 
userspace to quiescence the datapath

or

2b) for set_status(): simply relay the message to userspace, reply is no 
needed. Userspace will use a command to update the status when the 
datapath is stop. The the status could be fetched via get_stats().

2b looks more spec complaint.

> And how to handle failure? Adding a return value
> to virtio_config_ops->reset() and passing the error to the upper
> layer?


Something like this.

Thanks


>
> Thanks,
> Yongji
>

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v8 00/10] Introduce VDUSE - vDPA Device in Userspace
  2021-06-28 10:33 ` Liu Xiaodong
@ 2021-06-28  4:35   ` Jason Wang
  2021-06-28  5:54     ` Liu, Xiaodong
  2021-06-28 10:32   ` Yongji Xie
  1 sibling, 1 reply; 41+ messages in thread
From: Jason Wang @ 2021-06-28  4:35 UTC (permalink / raw)
  To: Liu Xiaodong, Xie Yongji, mst, stefanha, sgarzare, parav, hch,
	christian.brauner, rdunlap, willy, viro, axboe, bcrl, corbet,
	mika.penttila, dan.carpenter, joro, gregkh
  Cc: kvm, netdev, linux-kernel, virtualization, iommu, songmuchun,
	linux-fsdevel


在 2021/6/28 下午6:33, Liu Xiaodong 写道:
> On Tue, Jun 15, 2021 at 10:13:21PM +0800, Xie Yongji wrote:
>> This series introduces a framework that makes it possible to implement
>> software-emulated vDPA devices in userspace. And to make it simple, the
>> emulated vDPA device's control path is handled in the kernel and only the
>> data path is implemented in the userspace.
>>
>> Since the emuldated vDPA device's control path is handled in the kernel,
>> a message mechnism is introduced to make userspace be aware of the data
>> path related changes. Userspace can use read()/write() to receive/reply
>> the control messages.
>>
>> In the data path, the core is mapping dma buffer into VDUSE daemon's
>> address space, which can be implemented in different ways depending on
>> the vdpa bus to which the vDPA device is attached.
>>
>> In virtio-vdpa case, we implements a MMU-based on-chip IOMMU driver with
>> bounce-buffering mechanism to achieve that. And in vhost-vdpa case, the dma
>> buffer is reside in a userspace memory region which can be shared to the
>> VDUSE userspace processs via transferring the shmfd.
>>
>> The details and our user case is shown below:
>>
>> ------------------------    -------------------------   ----------------------------------------------
>> |            Container |    |              QEMU(VM) |   |                               VDUSE daemon |
>> |       ---------      |    |  -------------------  |   | ------------------------- ---------------- |
>> |       |dev/vdx|      |    |  |/dev/vhost-vdpa-x|  |   | | vDPA device emulation | | block driver | |
>> ------------+-----------     -----------+------------   -------------+----------------------+---------
>>              |                           |                            |                      |
>>              |                           |                            |                      |
>> ------------+---------------------------+----------------------------+----------------------+---------
>> |    | block device |           |  vhost device |            | vduse driver |          | TCP/IP |    |
>> |    -------+--------           --------+--------            -------+--------          -----+----    |
>> |           |                           |                           |                       |        |
>> | ----------+----------       ----------+-----------         -------+-------                |        |
>> | | virtio-blk driver |       |  vhost-vdpa driver |         | vdpa device |                |        |
>> | ----------+----------       ----------+-----------         -------+-------                |        |
>> |           |      virtio bus           |                           |                       |        |
>> |   --------+----+-----------           |                           |                       |        |
>> |                |                      |                           |                       |        |
>> |      ----------+----------            |                           |                       |        |
>> |      | virtio-blk device |            |                           |                       |        |
>> |      ----------+----------            |                           |                       |        |
>> |                |                      |                           |                       |        |
>> |     -----------+-----------           |                           |                       |        |
>> |     |  virtio-vdpa driver |           |                           |                       |        |
>> |     -----------+-----------           |                           |                       |        |
>> |                |                      |                           |    vdpa bus           |        |
>> |     -----------+----------------------+---------------------------+------------           |        |
>> |                                                                                        ---+---     |
>> -----------------------------------------------------------------------------------------| NIC |------
>>                                                                                           ---+---
>>                                                                                              |
>>                                                                                     ---------+---------
>>                                                                                     | Remote Storages |
>>                                                                                     -------------------
>>
>> We make use of it to implement a block device connecting to
>> our distributed storage, which can be used both in containers and
>> VMs. Thus, we can have an unified technology stack in this two cases.
>>
>> To test it with null-blk:
>>
>>    $ qemu-storage-daemon \
>>        --chardev socket,id=charmonitor,path=/tmp/qmp.sock,server,nowait \
>>        --monitor chardev=charmonitor \
>>        --blockdev driver=host_device,cache.direct=on,aio=native,filename=/dev/nullb0,node-name=disk0 \
>>        --export type=vduse-blk,id=test,node-name=disk0,writable=on,name=vduse-null,num-queues=16,queue-size=128
>>
>> The qemu-storage-daemon can be found at https://github.com/bytedance/qemu/tree/vduse
>>
>> To make the userspace VDUSE processes such as qemu-storage-daemon able to
>> be run by an unprivileged user. We did some works on virtio driver to avoid
>> trusting device, including:
>>
>>    - validating the used length:
>>
>>      * https://lore.kernel.org/lkml/20210531135852.113-1-xieyongji@bytedance.com/
>>      * https://lore.kernel.org/lkml/20210525125622.1203-1-xieyongji@bytedance.com/
>>
>>    - validating the device config:
>>
>>      * https://lore.kernel.org/lkml/20210615104810.151-1-xieyongji@bytedance.com/
>>
>>    - validating the device response:
>>
>>      * https://lore.kernel.org/lkml/20210615105218.214-1-xieyongji@bytedance.com/
>>
>> Since I'm not sure if I missing something during auditing, especially on some
>> virtio device drivers that I'm not familiar with, we limit the supported device
>> type to virtio block device currently. The support for other device types can be
>> added after the security issue of corresponding device driver is clarified or
>> fixed in the future.
>>
>> Future work:
>>    - Improve performance
>>    - Userspace library (find a way to reuse device emulation code in qemu/rust-vmm)
>>    - Support more device types
>>
>> V7 to V8:
>> - Rebased to newest kernel tree
>> - Rework VDUSE driver to handle the device's control path in kernel
>> - Limit the supported device type to virtio block device
>> - Export free_iova_fast()
>> - Remove the virtio-blk and virtio-scsi patches (will send them alone)
>> - Remove all module parameters
>> - Use the same MAJOR for both control device and VDUSE devices
>> - Avoid eventfd cleanup in vduse_dev_release()
>>
>> V6 to V7:
>> - Export alloc_iova_fast()
>> - Add get_config_size() callback
>> - Add some patches to avoid trusting virtio devices
>> - Add limited device emulation
>> - Add some documents
>> - Use workqueue to inject config irq
>> - Add parameter on vq irq injecting
>> - Rename vduse_domain_get_mapping_page() to vduse_domain_get_coherent_page()
>> - Add WARN_ON() to catch message failure
>> - Add some padding/reserved fields to uAPI structure
>> - Fix some bugs
>> - Rebase to vhost.git
>>
>> V5 to V6:
>> - Export receive_fd() instead of __receive_fd()
>> - Factor out the unmapping logic of pa and va separatedly
>> - Remove the logic of bounce page allocation in page fault handler
>> - Use PAGE_SIZE as IOVA allocation granule
>> - Add EPOLLOUT support
>> - Enable setting API version in userspace
>> - Fix some bugs
>>
>> V4 to V5:
>> - Remove the patch for irq binding
>> - Use a single IOTLB for all types of mapping
>> - Factor out vhost_vdpa_pa_map()
>> - Add some sample codes in document
>> - Use receice_fd_user() to pass file descriptor
>> - Fix some bugs
>>
>> V3 to V4:
>> - Rebase to vhost.git
>> - Split some patches
>> - Add some documents
>> - Use ioctl to inject interrupt rather than eventfd
>> - Enable config interrupt support
>> - Support binding irq to the specified cpu
>> - Add two module parameter to limit bounce/iova size
>> - Create char device rather than anon inode per vduse
>> - Reuse vhost IOTLB for iova domain
>> - Rework the message mechnism in control path
>>
>> V2 to V3:
>> - Rework the MMU-based IOMMU driver
>> - Use the iova domain as iova allocator instead of genpool
>> - Support transferring vma->vm_file in vhost-vdpa
>> - Add SVA support in vhost-vdpa
>> - Remove the patches on bounce pages reclaim
>>
>> V1 to V2:
>> - Add vhost-vdpa support
>> - Add some documents
>> - Based on the vdpa management tool
>> - Introduce a workqueue for irq injection
>> - Replace interval tree with array map to store the iova_map
>>
>> Xie Yongji (10):
>>    iova: Export alloc_iova_fast() and free_iova_fast();
>>    file: Export receive_fd() to modules
>>    eventfd: Increase the recursion depth of eventfd_signal()
>>    vhost-iotlb: Add an opaque pointer for vhost IOTLB
>>    vdpa: Add an opaque pointer for vdpa_config_ops.dma_map()
>>    vdpa: factor out vhost_vdpa_pa_map() and vhost_vdpa_pa_unmap()
>>    vdpa: Support transferring virtual addressing during DMA mapping
>>    vduse: Implement an MMU-based IOMMU driver
>>    vduse: Introduce VDUSE - vDPA Device in Userspace
>>    Documentation: Add documentation for VDUSE
>>
>>   Documentation/userspace-api/index.rst              |    1 +
>>   Documentation/userspace-api/ioctl/ioctl-number.rst |    1 +
>>   Documentation/userspace-api/vduse.rst              |  222 +++
>>   drivers/iommu/iova.c                               |    2 +
>>   drivers/vdpa/Kconfig                               |   10 +
>>   drivers/vdpa/Makefile                              |    1 +
>>   drivers/vdpa/ifcvf/ifcvf_main.c                    |    2 +-
>>   drivers/vdpa/mlx5/net/mlx5_vnet.c                  |    2 +-
>>   drivers/vdpa/vdpa.c                                |    9 +-
>>   drivers/vdpa/vdpa_sim/vdpa_sim.c                   |    8 +-
>>   drivers/vdpa/vdpa_user/Makefile                    |    5 +
>>   drivers/vdpa/vdpa_user/iova_domain.c               |  545 ++++++++
>>   drivers/vdpa/vdpa_user/iova_domain.h               |   73 +
>>   drivers/vdpa/vdpa_user/vduse_dev.c                 | 1453 ++++++++++++++++++++
>>   drivers/vdpa/virtio_pci/vp_vdpa.c                  |    2 +-
>>   drivers/vhost/iotlb.c                              |   20 +-
>>   drivers/vhost/vdpa.c                               |  148 +-
>>   fs/eventfd.c                                       |    2 +-
>>   fs/file.c                                          |    6 +
>>   include/linux/eventfd.h                            |    5 +-
>>   include/linux/file.h                               |    7 +-
>>   include/linux/vdpa.h                               |   21 +-
>>   include/linux/vhost_iotlb.h                        |    3 +
>>   include/uapi/linux/vduse.h                         |  143 ++
>>   24 files changed, 2641 insertions(+), 50 deletions(-)
>>   create mode 100644 Documentation/userspace-api/vduse.rst
>>   create mode 100644 drivers/vdpa/vdpa_user/Makefile
>>   create mode 100644 drivers/vdpa/vdpa_user/iova_domain.c
>>   create mode 100644 drivers/vdpa/vdpa_user/iova_domain.h
>>   create mode 100644 drivers/vdpa/vdpa_user/vduse_dev.c
>>   create mode 100644 include/uapi/linux/vduse.h
>>
>> --
>> 2.11.0
> Hi, Yongji
>
> Great work! your method is really wise that implements a software IOMMU
> so that data path gets processed by userspace application efficiently.
> Sorry, I've just realized your work and patches.
>
>
> I was working on a similar thing aiming to get vhost-user-blk device
> from SPDK vhost-target to be exported as local host kernel block device.
> It's diagram is like this:
>
>
>                                  -----------------------------
> ------------------------        |    -----------------      |    ---------------------------------------
> |   <RunC Container>   |     <<<<<<<<| Shared-Memory |>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>        |
> |       ---------      |     v  |    -----------------      |    |                            v        |
> |       |dev/vdx|      |     v  |   <virtio-local-agent>    |    |      <Vhost-user Target>   v        |
> ------------+-----------     v  | ------------------------  |    |  --------------------------v------  |
>              |                v  | |/dev/virtio-local-ctrl|  |    |  | unix socket |   |block driver |  |
>              |                v  ------------+----------------    --------+--------------------v---------
>              |                v              |                            |                    v
> ------------+----------------v--------------+----------------------------+--------------------v--------|
> |    | block device |        v      |  Misc device |                     |                    v        |
> |    -------+--------        v      --------+-------                     |                    v        |
> |           |                v              |                            |                    v        |
> | ----------+----------      v              |                            |                    v        |
> | | virtio-blk driver |      v              |                            |                    v        |
> | ----------+----------      v              |                            |                    v        |
> |           | virtio bus     v              |                            |                    v        |
> |   --------+---+-------     v              |                            |                    v        |
> |               |            v              |                            |                    v        |
> |               |            v              |                            |                    v        |
> |     ----------+----------  v     ---------+-----------                 |                    v        |
> |     | virtio-blk device |--<----| virtio-local driver |----------------<                    v        |
> |     ----------+----------       ----------+-----------                                      v        |
> |                                                                                    ---------+--------|
> -------------------------------------------------------------------------------------| RNIC |--| PCIe |-
>                                                                                       ----+---  | NVMe |
>                                                                                           |     --------
>                                                                                  ---------+---------
>                                                                                  | Remote Storages |
>                                                                                  -------------------
>
>
> I just draft out an initial proof version. When seeing your RFC mail,
> I'm thinking that SPDK target may depends on your work, so I could
> directly drop mine.
> But after a glance of the RFC patches, seems it is not so easy or
> efficient to get vduse leveraged by SPDK.
> (Please correct me, if I get wrong understanding on vduse. :) )
>
> The large barrier is bounce-buffer mapping: SPDK requires hugepages
> for NVMe over PCIe and RDMA, so take some preallcoated hugepages to
> map as bounce buffer is necessary. Or it's hard to avoid an extra
> memcpy from bounce-buffer to hugepage.
> If you can add an option to map hugepages as bounce-buffer,
> then SPDK could also be a potential user of vduse.


Several issues:

- VDUSE needs to limit the total size of the bounce buffers (64M if I 
was not wrong). Does it work for SPDK?
- VDUSE can use hugepages but I'm not sure we can mandate hugepages (or 
we need introduce new flags for supporting this)

Thanks


>
> It would be better if SPDK vhost-target could leverage the datapath of
> vduse directly and efficiently. Even the control path is vdpa based,
> we may work out one daemon as agent to bridge SPDK vhost-target with vduse.
> Then users who already deployed SPDK vhost-target, can smoothly run
> some agent daemon without code modification on SPDK vhost-target itself.
> (It is only better-to-have for SPDK vhost-target app, not mandatory for SPDK) :)
> At least, some small barrier is there that blocked a vhost-target use vduse
> datapath efficiently:
> - Current IO completion irq of vduse is IOCTL based. If add one option
> to get it eventfd based, then vhost-target can directly notify IO
> completion via negotiated eventfd.
>
>
> Thanks
>  From Xiaodong
>
>
>
>
>
> 									
>

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v8 09/10] vduse: Introduce VDUSE - vDPA Device in Userspace
       [not found]                             ` <CACycT3vpMFbc9Fzuo9oksMaA-pVb1dEVTEgjNoft16voryPSWQ@mail.gmail.com>
@ 2021-06-28  4:40                               ` Jason Wang
       [not found]                                 ` <CACycT3u9-id2DxPpuVLtyg4tzrUF9xCAGr7nBm=21HfUJJasaQ@mail.gmail.com>
  0 siblings, 1 reply; 41+ messages in thread
From: Jason Wang @ 2021-06-28  4:40 UTC (permalink / raw)
  To: Yongji Xie
  Cc: kvm, Michael S. Tsirkin, virtualization, Christian Brauner,
	Jonathan Corbet, joro, Matthew Wilcox, Christoph Hellwig,
	Dan Carpenter, Al Viro, Stefan Hajnoczi, songmuchun, Jens Axboe,
	Greg KH, Randy Dunlap, linux-kernel, iommu, bcrl, netdev,
	linux-fsdevel, Mika Penttilä


在 2021/6/25 下午12:19, Yongji Xie 写道:
>> 2b) for set_status(): simply relay the message to userspace, reply is no
>> needed. Userspace will use a command to update the status when the
>> datapath is stop. The the status could be fetched via get_stats().
>>
>> 2b looks more spec complaint.
>>
> Looks good to me. And I think we can use the reply of the message to
> update the status instead of introducing a new command.
>

Just notice this part in virtio_finalize_features():

         virtio_add_status(dev, VIRTIO_CONFIG_S_FEATURES_OK);
         status = dev->config->get_status(dev);
         if (!(status & VIRTIO_CONFIG_S_FEATURES_OK)) {

So we no reply doesn't work for FEATURES_OK.

So my understanding is:

1) We must not use noreply for set_status()
2) We can use noreply for get_status(), but it requires a new ioctl to 
update the status.

So it looks to me we need synchronize for both get_status() and 
set_status().

Thanks


_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 41+ messages in thread

* RE: [PATCH v8 00/10] Introduce VDUSE - vDPA Device in Userspace
  2021-06-28  4:35   ` Jason Wang
@ 2021-06-28  5:54     ` Liu, Xiaodong
  2021-06-29  4:10       ` Jason Wang
  0 siblings, 1 reply; 41+ messages in thread
From: Liu, Xiaodong @ 2021-06-28  5:54 UTC (permalink / raw)
  To: Jason Wang, Xie Yongji, mst@redhat.com, stefanha@redhat.com,
	sgarzare@redhat.com, parav@nvidia.com, hch@infradead.org,
	christian.brauner@canonical.com, rdunlap@infradead.org,
	willy@infradead.org, viro@zeniv.linux.org.uk, axboe@kernel.dk,
	bcrl@kvack.org, corbet@lwn.net, mika.penttila@nextfour.com,
	dan.carpenter@oracle.com, joro@8bytes.org,
	gregkh@linuxfoundation.org
  Cc: kvm@vger.kernel.org, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	virtualization@lists.linux-foundation.org,
	iommu@lists.linux-foundation.org, songmuchun@bytedance.com,
	linux-fsdevel@vger.kernel.org



>-----Original Message-----
>From: Jason Wang <jasowang@redhat.com>
>Sent: Monday, June 28, 2021 12:35 PM
>To: Liu, Xiaodong <xiaodong.liu@intel.com>; Xie Yongji
><xieyongji@bytedance.com>; mst@redhat.com; stefanha@redhat.com;
>sgarzare@redhat.com; parav@nvidia.com; hch@infradead.org;
>christian.brauner@canonical.com; rdunlap@infradead.org; willy@infradead.org;
>viro@zeniv.linux.org.uk; axboe@kernel.dk; bcrl@kvack.org; corbet@lwn.net;
>mika.penttila@nextfour.com; dan.carpenter@oracle.com; joro@8bytes.org;
>gregkh@linuxfoundation.org
>Cc: songmuchun@bytedance.com; virtualization@lists.linux-foundation.org;
>netdev@vger.kernel.org; kvm@vger.kernel.org; linux-fsdevel@vger.kernel.org;
>iommu@lists.linux-foundation.org; linux-kernel@vger.kernel.org
>Subject: Re: [PATCH v8 00/10] Introduce VDUSE - vDPA Device in Userspace
>
>
>在 2021/6/28 下午6:33, Liu Xiaodong 写道:
>> On Tue, Jun 15, 2021 at 10:13:21PM +0800, Xie Yongji wrote:
>>> This series introduces a framework that makes it possible to
>>> implement software-emulated vDPA devices in userspace. And to make it
>>> simple, the emulated vDPA device's control path is handled in the
>>> kernel and only the data path is implemented in the userspace.
>>>
>>> Since the emuldated vDPA device's control path is handled in the
>>> kernel, a message mechnism is introduced to make userspace be aware
>>> of the data path related changes. Userspace can use read()/write() to
>>> receive/reply the control messages.
>>>
>>> In the data path, the core is mapping dma buffer into VDUSE daemon's
>>> address space, which can be implemented in different ways depending
>>> on the vdpa bus to which the vDPA device is attached.
>>>
>>> In virtio-vdpa case, we implements a MMU-based on-chip IOMMU driver
>>> with bounce-buffering mechanism to achieve that. And in vhost-vdpa
>>> case, the dma buffer is reside in a userspace memory region which can
>>> be shared to the VDUSE userspace processs via transferring the shmfd.
>>>
>>> The details and our user case is shown below:
>>>
>>> ------------------------    -------------------------   ------------------------------------------
>----
>>> |            Container |    |              QEMU(VM) |   |                               VDUSE daemon
>|
>>> |       ---------      |    |  -------------------  |   | ------------------------- ---------------- |
>>> |       |dev/vdx|      |    |  |/dev/vhost-vdpa-x|  |   | | vDPA device emulation | |
>block driver | |
>>> ------------+-----------     -----------+------------   -------------+----------------------+---
>------
>>>              |                           |                            |                      |
>>>              |                           |                            |                      |
>>> ------------+---------------------------+----------------------------+----------------------
>+---------
>>> |    | block device |           |  vhost device |            | vduse driver |          | TCP/IP |
>|
>>> |    -------+--------           --------+--------            -------+--------          -----+----    |
>>> |           |                           |                           |                       |        |
>>> | ----------+----------       ----------+-----------         -------+-------                |        |
>>> | | virtio-blk driver |       |  vhost-vdpa driver |         | vdpa device |                |
>|
>>> | ----------+----------       ----------+-----------         -------+-------                |        |
>>> |           |      virtio bus           |                           |                       |        |
>>> |   --------+----+-----------           |                           |                       |        |
>>> |                |                      |                           |                       |        |
>>> |      ----------+----------            |                           |                       |        |
>>> |      | virtio-blk device |            |                           |                       |        |
>>> |      ----------+----------            |                           |                       |        |
>>> |                |                      |                           |                       |        |
>>> |     -----------+-----------           |                           |                       |        |
>>> |     |  virtio-vdpa driver |           |                           |                       |        |
>>> |     -----------+-----------           |                           |                       |        |
>>> |                |                      |                           |    vdpa bus           |        |
>>> |     -----------+----------------------+---------------------------+------------           |
>|
>>> |                                                                                        ---+---     |
>>> -----------------------------------------------------------------------------------------| NIC
>|------
>>>                                                                                           ---+---
>>>                                                                                              |
>>>                                                                                     ---------+---------
>>>                                                                                     | Remote Storages |
>>>
>>> -------------------
>>>
>>> We make use of it to implement a block device connecting to our
>>> distributed storage, which can be used both in containers and VMs.
>>> Thus, we can have an unified technology stack in this two cases.
>>>
>>> To test it with null-blk:
>>>
>>>    $ qemu-storage-daemon \
>>>        --chardev socket,id=charmonitor,path=/tmp/qmp.sock,server,nowait \
>>>        --monitor chardev=charmonitor \
>>>        --blockdev
>driver=host_device,cache.direct=on,aio=native,filename=/dev/nullb0,node-
>name=disk0 \
>>>        --export
>>> type=vduse-blk,id=test,node-name=disk0,writable=on,name=vduse-null,nu
>>> m-queues=16,queue-size=128
>>>
>>> The qemu-storage-daemon can be found at
>>> https://github.com/bytedance/qemu/tree/vduse
>>>
>>> To make the userspace VDUSE processes such as qemu-storage-daemon
>>> able to be run by an unprivileged user. We did some works on virtio
>>> driver to avoid trusting device, including:
>>>
>>>    - validating the used length:
>>>
>>>      * https://lore.kernel.org/lkml/20210531135852.113-1-
>xieyongji@bytedance.com/
>>>      *
>>> https://lore.kernel.org/lkml/20210525125622.1203-1-xieyongji@bytedanc
>>> e.com/
>>>
>>>    - validating the device config:
>>>
>>>      *
>>> https://lore.kernel.org/lkml/20210615104810.151-1-xieyongji@bytedance
>>> .com/
>>>
>>>    - validating the device response:
>>>
>>>      *
>>> https://lore.kernel.org/lkml/20210615105218.214-1-xieyongji@bytedance
>>> .com/
>>>
>>> Since I'm not sure if I missing something during auditing, especially
>>> on some virtio device drivers that I'm not familiar with, we limit
>>> the supported device type to virtio block device currently. The
>>> support for other device types can be added after the security issue
>>> of corresponding device driver is clarified or fixed in the future.
>>>
>>> Future work:
>>>    - Improve performance
>>>    - Userspace library (find a way to reuse device emulation code in qemu/rust-
>vmm)
>>>    - Support more device types
>>>
>>> V7 to V8:
>>> - Rebased to newest kernel tree
>>> - Rework VDUSE driver to handle the device's control path in kernel
>>> - Limit the supported device type to virtio block device
>>> - Export free_iova_fast()
>>> - Remove the virtio-blk and virtio-scsi patches (will send them
>>> alone)
>>> - Remove all module parameters
>>> - Use the same MAJOR for both control device and VDUSE devices
>>> - Avoid eventfd cleanup in vduse_dev_release()
>>>
>>> V6 to V7:
>>> - Export alloc_iova_fast()
>>> - Add get_config_size() callback
>>> - Add some patches to avoid trusting virtio devices
>>> - Add limited device emulation
>>> - Add some documents
>>> - Use workqueue to inject config irq
>>> - Add parameter on vq irq injecting
>>> - Rename vduse_domain_get_mapping_page() to
>>> vduse_domain_get_coherent_page()
>>> - Add WARN_ON() to catch message failure
>>> - Add some padding/reserved fields to uAPI structure
>>> - Fix some bugs
>>> - Rebase to vhost.git
>>>
>>> V5 to V6:
>>> - Export receive_fd() instead of __receive_fd()
>>> - Factor out the unmapping logic of pa and va separatedly
>>> - Remove the logic of bounce page allocation in page fault handler
>>> - Use PAGE_SIZE as IOVA allocation granule
>>> - Add EPOLLOUT support
>>> - Enable setting API version in userspace
>>> - Fix some bugs
>>>
>>> V4 to V5:
>>> - Remove the patch for irq binding
>>> - Use a single IOTLB for all types of mapping
>>> - Factor out vhost_vdpa_pa_map()
>>> - Add some sample codes in document
>>> - Use receice_fd_user() to pass file descriptor
>>> - Fix some bugs
>>>
>>> V3 to V4:
>>> - Rebase to vhost.git
>>> - Split some patches
>>> - Add some documents
>>> - Use ioctl to inject interrupt rather than eventfd
>>> - Enable config interrupt support
>>> - Support binding irq to the specified cpu
>>> - Add two module parameter to limit bounce/iova size
>>> - Create char device rather than anon inode per vduse
>>> - Reuse vhost IOTLB for iova domain
>>> - Rework the message mechnism in control path
>>>
>>> V2 to V3:
>>> - Rework the MMU-based IOMMU driver
>>> - Use the iova domain as iova allocator instead of genpool
>>> - Support transferring vma->vm_file in vhost-vdpa
>>> - Add SVA support in vhost-vdpa
>>> - Remove the patches on bounce pages reclaim
>>>
>>> V1 to V2:
>>> - Add vhost-vdpa support
>>> - Add some documents
>>> - Based on the vdpa management tool
>>> - Introduce a workqueue for irq injection
>>> - Replace interval tree with array map to store the iova_map
>>>
>>> Xie Yongji (10):
>>>    iova: Export alloc_iova_fast() and free_iova_fast();
>>>    file: Export receive_fd() to modules
>>>    eventfd: Increase the recursion depth of eventfd_signal()
>>>    vhost-iotlb: Add an opaque pointer for vhost IOTLB
>>>    vdpa: Add an opaque pointer for vdpa_config_ops.dma_map()
>>>    vdpa: factor out vhost_vdpa_pa_map() and vhost_vdpa_pa_unmap()
>>>    vdpa: Support transferring virtual addressing during DMA mapping
>>>    vduse: Implement an MMU-based IOMMU driver
>>>    vduse: Introduce VDUSE - vDPA Device in Userspace
>>>    Documentation: Add documentation for VDUSE
>>>
>>>   Documentation/userspace-api/index.rst              |    1 +
>>>   Documentation/userspace-api/ioctl/ioctl-number.rst |    1 +
>>>   Documentation/userspace-api/vduse.rst              |  222 +++
>>>   drivers/iommu/iova.c                               |    2 +
>>>   drivers/vdpa/Kconfig                               |   10 +
>>>   drivers/vdpa/Makefile                              |    1 +
>>>   drivers/vdpa/ifcvf/ifcvf_main.c                    |    2 +-
>>>   drivers/vdpa/mlx5/net/mlx5_vnet.c                  |    2 +-
>>>   drivers/vdpa/vdpa.c                                |    9 +-
>>>   drivers/vdpa/vdpa_sim/vdpa_sim.c                   |    8 +-
>>>   drivers/vdpa/vdpa_user/Makefile                    |    5 +
>>>   drivers/vdpa/vdpa_user/iova_domain.c               |  545 ++++++++
>>>   drivers/vdpa/vdpa_user/iova_domain.h               |   73 +
>>>   drivers/vdpa/vdpa_user/vduse_dev.c                 | 1453
>++++++++++++++++++++
>>>   drivers/vdpa/virtio_pci/vp_vdpa.c                  |    2 +-
>>>   drivers/vhost/iotlb.c                              |   20 +-
>>>   drivers/vhost/vdpa.c                               |  148 +-
>>>   fs/eventfd.c                                       |    2 +-
>>>   fs/file.c                                          |    6 +
>>>   include/linux/eventfd.h                            |    5 +-
>>>   include/linux/file.h                               |    7 +-
>>>   include/linux/vdpa.h                               |   21 +-
>>>   include/linux/vhost_iotlb.h                        |    3 +
>>>   include/uapi/linux/vduse.h                         |  143 ++
>>>   24 files changed, 2641 insertions(+), 50 deletions(-)
>>>   create mode 100644 Documentation/userspace-api/vduse.rst
>>>   create mode 100644 drivers/vdpa/vdpa_user/Makefile
>>>   create mode 100644 drivers/vdpa/vdpa_user/iova_domain.c
>>>   create mode 100644 drivers/vdpa/vdpa_user/iova_domain.h
>>>   create mode 100644 drivers/vdpa/vdpa_user/vduse_dev.c
>>>   create mode 100644 include/uapi/linux/vduse.h
>>>
>>> --
>>> 2.11.0
>> Hi, Yongji
>>
>> Great work! your method is really wise that implements a software
>> IOMMU so that data path gets processed by userspace application efficiently.
>> Sorry, I've just realized your work and patches.
>>
>>
>> I was working on a similar thing aiming to get vhost-user-blk device
>> from SPDK vhost-target to be exported as local host kernel block device.
>> It's diagram is like this:
>>
>>
>>                                  -----------------------------
>> ------------------------        |    -----------------      |    --------------------------------------
>-
>> |   <RunC Container>   |     <<<<<<<<| Shared-Memory
>|>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>        |
>> |       ---------      |     v  |    -----------------      |    |                            v        |
>> |       |dev/vdx|      |     v  |   <virtio-local-agent>    |    |      <Vhost-user Target>
>v        |
>> ------------+-----------     v  | ------------------------  |    |  --------------------------v-----
>-  |
>>              |                v  | |/dev/virtio-local-ctrl|  |    |  | unix socket |   |block driver
>|  |
>>              |                v  ------------+----------------    --------+--------------------v---------
>>              |                v              |                            |                    v
>> ------------+----------------v--------------+----------------------------+--------------------
>v--------|
>> |    | block device |        v      |  Misc device |                     |                    v        |
>> |    -------+--------        v      --------+-------                     |                    v        |
>> |           |                v              |                            |                    v        |
>> | ----------+----------      v              |                            |                    v        |
>> | | virtio-blk driver |      v              |                            |                    v        |
>> | ----------+----------      v              |                            |                    v        |
>> |           | virtio bus     v              |                            |                    v        |
>> |   --------+---+-------     v              |                            |                    v        |
>> |               |            v              |                            |                    v        |
>> |               |            v              |                            |                    v        |
>> |     ----------+----------  v     ---------+-----------                 |                    v        |
>> |     | virtio-blk device |--<----| virtio-local driver |----------------<                    v
>|
>> |     ----------+----------       ----------+-----------                                      v        |
>> |
>> | ---------+--------|
>> -------------------------------------------------------------------------------------| RNIC |--
>| PCIe |-
>>                                                                                       ----+---  | NVMe |
>>                                                                                           |     --------
>>                                                                                  ---------+---------
>>                                                                                  | Remote Storages |
>>
>> -------------------
>>
>>
>> I just draft out an initial proof version. When seeing your RFC mail,
>> I'm thinking that SPDK target may depends on your work, so I could
>> directly drop mine.
>> But after a glance of the RFC patches, seems it is not so easy or
>> efficient to get vduse leveraged by SPDK.
>> (Please correct me, if I get wrong understanding on vduse. :) )
>>
>> The large barrier is bounce-buffer mapping: SPDK requires hugepages
>> for NVMe over PCIe and RDMA, so take some preallcoated hugepages to
>> map as bounce buffer is necessary. Or it's hard to avoid an extra
>> memcpy from bounce-buffer to hugepage.
>> If you can add an option to map hugepages as bounce-buffer, then SPDK
>> could also be a potential user of vduse.
>
>
>Several issues:
>
>- VDUSE needs to limit the total size of the bounce buffers (64M if I was not
>wrong). Does it work for SPDK?

Yes, Jason. It is enough and works for SPDK.
Since it's a kind of bounce buffer mainly for in-flight IO, so limited size like
64MB is enough.

>- VDUSE can use hugepages but I'm not sure we can mandate hugepages (or we
>need introduce new flags for supporting this)

Same with your worry, I'm afraid too that it is a hard for a kernel module
to directly preallocate hugepage internal.
What I tried is that:
1. A simple agent daemon (represents for one device)  `preallocates` and maps
    dozens of 2MB hugepages (like 64MB) for one device.
2. The daemon passes its mapping addr&len and hugepage fd to kernel
    module through created IOCTL.
3. Kernel module remaps the hugepages inside kernel.
4. Vhost user target gets and maps hugepage fd from kernel module
    in vhost-user msg through Unix Domain Socket cmsg.
Then kernel module and target map on the same hugepage based
bounce buffer for in-flight IO.

If there is one option in VDUSE to map userspace preallocated memory, then
VDUSE should be able to mandate it even it is hugepage based.

>Thanks
>
>
>>
>> It would be better if SPDK vhost-target could leverage the datapath of
>> vduse directly and efficiently. Even the control path is vdpa based,
>> we may work out one daemon as agent to bridge SPDK vhost-target with vduse.
>> Then users who already deployed SPDK vhost-target, can smoothly run
>> some agent daemon without code modification on SPDK vhost-target itself.
>> (It is only better-to-have for SPDK vhost-target app, not mandatory
>> for SPDK) :) At least, some small barrier is there that blocked a
>> vhost-target use vduse datapath efficiently:
>> - Current IO completion irq of vduse is IOCTL based. If add one option
>> to get it eventfd based, then vhost-target can directly notify IO
>> completion via negotiated eventfd.
>>
>>
>> Thanks
>>  From Xiaodong
>>
>>
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v8 00/10] Introduce VDUSE - vDPA Device in Userspace
  2021-06-28 10:33 ` Liu Xiaodong
  2021-06-28  4:35   ` Jason Wang
@ 2021-06-28 10:32   ` Yongji Xie
  2021-06-29  4:12     ` Jason Wang
  1 sibling, 1 reply; 41+ messages in thread
From: Yongji Xie @ 2021-06-28 10:32 UTC (permalink / raw)
  To: Liu Xiaodong
  Cc: kvm, Michael S. Tsirkin, virtualization, christian.brauner,
	corbet, joro, willy, hch, Xie Yongji, dan.carpenter, viro,
	Stefan Hajnoczi, songmuchun, axboe, gregkh, rdunlap, linux-kernel,
	iommu, bcrl, netdev, linux-fsdevel, mika.penttila

On Mon, 28 Jun 2021 at 10:55, Liu Xiaodong <xiaodong.liu@intel.com> wrote:
>
> On Tue, Jun 15, 2021 at 10:13:21PM +0800, Xie Yongji wrote:
> >
> > This series introduces a framework that makes it possible to implement
> > software-emulated vDPA devices in userspace. And to make it simple, the
> > emulated vDPA device's control path is handled in the kernel and only the
> > data path is implemented in the userspace.
> >
> > Since the emuldated vDPA device's control path is handled in the kernel,
> > a message mechnism is introduced to make userspace be aware of the data
> > path related changes. Userspace can use read()/write() to receive/reply
> > the control messages.
> >
> > In the data path, the core is mapping dma buffer into VDUSE daemon's
> > address space, which can be implemented in different ways depending on
> > the vdpa bus to which the vDPA device is attached.
> >
> > In virtio-vdpa case, we implements a MMU-based on-chip IOMMU driver with
> > bounce-buffering mechanism to achieve that. And in vhost-vdpa case, the dma
> > buffer is reside in a userspace memory region which can be shared to the
> > VDUSE userspace processs via transferring the shmfd.
> >
> > The details and our user case is shown below:
> >
> > ------------------------    -------------------------   ----------------------------------------------
> > |            Container |    |              QEMU(VM) |   |                               VDUSE daemon |
> > |       ---------      |    |  -------------------  |   | ------------------------- ---------------- |
> > |       |dev/vdx|      |    |  |/dev/vhost-vdpa-x|  |   | | vDPA device emulation | | block driver | |
> > ------------+-----------     -----------+------------   -------------+----------------------+---------
> >             |                           |                            |                      |
> >             |                           |                            |                      |
> > ------------+---------------------------+----------------------------+----------------------+---------
> > |    | block device |           |  vhost device |            | vduse driver |          | TCP/IP |    |
> > |    -------+--------           --------+--------            -------+--------          -----+----    |
> > |           |                           |                           |                       |        |
> > | ----------+----------       ----------+-----------         -------+-------                |        |
> > | | virtio-blk driver |       |  vhost-vdpa driver |         | vdpa device |                |        |
> > | ----------+----------       ----------+-----------         -------+-------                |        |
> > |           |      virtio bus           |                           |                       |        |
> > |   --------+----+-----------           |                           |                       |        |
> > |                |                      |                           |                       |        |
> > |      ----------+----------            |                           |                       |        |
> > |      | virtio-blk device |            |                           |                       |        |
> > |      ----------+----------            |                           |                       |        |
> > |                |                      |                           |                       |        |
> > |     -----------+-----------           |                           |                       |        |
> > |     |  virtio-vdpa driver |           |                           |                       |        |
> > |     -----------+-----------           |                           |                       |        |
> > |                |                      |                           |    vdpa bus           |        |
> > |     -----------+----------------------+---------------------------+------------           |        |
> > |                                                                                        ---+---     |
> > -----------------------------------------------------------------------------------------| NIC |------
> >                                                                                          ---+---
> >                                                                                             |
> >                                                                                    ---------+---------
> >                                                                                    | Remote Storages |
> >                                                                                    -------------------
> >
> > We make use of it to implement a block device connecting to
> > our distributed storage, which can be used both in containers and
> > VMs. Thus, we can have an unified technology stack in this two cases.
> >
> > To test it with null-blk:
> >
> >   $ qemu-storage-daemon \
> >       --chardev socket,id=charmonitor,path=/tmp/qmp.sock,server,nowait \
> >       --monitor chardev=charmonitor \
> >       --blockdev driver=host_device,cache.direct=on,aio=native,filename=/dev/nullb0,node-name=disk0 \
> >       --export type=vduse-blk,id=test,node-name=disk0,writable=on,name=vduse-null,num-queues=16,queue-size=128
> >
> > The qemu-storage-daemon can be found at https://github.com/bytedance/qemu/tree/vduse
> >
> > To make the userspace VDUSE processes such as qemu-storage-daemon able to
> > be run by an unprivileged user. We did some works on virtio driver to avoid
> > trusting device, including:
> >
> >   - validating the used length:
> >
> >     * https://lore.kernel.org/lkml/20210531135852.113-1-xieyongji@bytedance.com/
> >     * https://lore.kernel.org/lkml/20210525125622.1203-1-xieyongji@bytedance.com/
> >
> >   - validating the device config:
> >
> >     * https://lore.kernel.org/lkml/20210615104810.151-1-xieyongji@bytedance.com/
> >
> >   - validating the device response:
> >
> >     * https://lore.kernel.org/lkml/20210615105218.214-1-xieyongji@bytedance.com/
> >
> > Since I'm not sure if I missing something during auditing, especially on some
> > virtio device drivers that I'm not familiar with, we limit the supported device
> > type to virtio block device currently. The support for other device types can be
> > added after the security issue of corresponding device driver is clarified or
> > fixed in the future.
> >
> > Future work:
> >   - Improve performance
> >   - Userspace library (find a way to reuse device emulation code in qemu/rust-vmm)
> >   - Support more device types
> >
> > V7 to V8:
> > - Rebased to newest kernel tree
> > - Rework VDUSE driver to handle the device's control path in kernel
> > - Limit the supported device type to virtio block device
> > - Export free_iova_fast()
> > - Remove the virtio-blk and virtio-scsi patches (will send them alone)
> > - Remove all module parameters
> > - Use the same MAJOR for both control device and VDUSE devices
> > - Avoid eventfd cleanup in vduse_dev_release()
> >
> > V6 to V7:
> > - Export alloc_iova_fast()
> > - Add get_config_size() callback
> > - Add some patches to avoid trusting virtio devices
> > - Add limited device emulation
> > - Add some documents
> > - Use workqueue to inject config irq
> > - Add parameter on vq irq injecting
> > - Rename vduse_domain_get_mapping_page() to vduse_domain_get_coherent_page()
> > - Add WARN_ON() to catch message failure
> > - Add some padding/reserved fields to uAPI structure
> > - Fix some bugs
> > - Rebase to vhost.git
> >
> > V5 to V6:
> > - Export receive_fd() instead of __receive_fd()
> > - Factor out the unmapping logic of pa and va separatedly
> > - Remove the logic of bounce page allocation in page fault handler
> > - Use PAGE_SIZE as IOVA allocation granule
> > - Add EPOLLOUT support
> > - Enable setting API version in userspace
> > - Fix some bugs
> >
> > V4 to V5:
> > - Remove the patch for irq binding
> > - Use a single IOTLB for all types of mapping
> > - Factor out vhost_vdpa_pa_map()
> > - Add some sample codes in document
> > - Use receice_fd_user() to pass file descriptor
> > - Fix some bugs
> >
> > V3 to V4:
> > - Rebase to vhost.git
> > - Split some patches
> > - Add some documents
> > - Use ioctl to inject interrupt rather than eventfd
> > - Enable config interrupt support
> > - Support binding irq to the specified cpu
> > - Add two module parameter to limit bounce/iova size
> > - Create char device rather than anon inode per vduse
> > - Reuse vhost IOTLB for iova domain
> > - Rework the message mechnism in control path
> >
> > V2 to V3:
> > - Rework the MMU-based IOMMU driver
> > - Use the iova domain as iova allocator instead of genpool
> > - Support transferring vma->vm_file in vhost-vdpa
> > - Add SVA support in vhost-vdpa
> > - Remove the patches on bounce pages reclaim
> >
> > V1 to V2:
> > - Add vhost-vdpa support
> > - Add some documents
> > - Based on the vdpa management tool
> > - Introduce a workqueue for irq injection
> > - Replace interval tree with array map to store the iova_map
> >
> > Xie Yongji (10):
> >   iova: Export alloc_iova_fast() and free_iova_fast();
> >   file: Export receive_fd() to modules
> >   eventfd: Increase the recursion depth of eventfd_signal()
> >   vhost-iotlb: Add an opaque pointer for vhost IOTLB
> >   vdpa: Add an opaque pointer for vdpa_config_ops.dma_map()
> >   vdpa: factor out vhost_vdpa_pa_map() and vhost_vdpa_pa_unmap()
> >   vdpa: Support transferring virtual addressing during DMA mapping
> >   vduse: Implement an MMU-based IOMMU driver
> >   vduse: Introduce VDUSE - vDPA Device in Userspace
> >   Documentation: Add documentation for VDUSE
> >
> >  Documentation/userspace-api/index.rst              |    1 +
> >  Documentation/userspace-api/ioctl/ioctl-number.rst |    1 +
> >  Documentation/userspace-api/vduse.rst              |  222 +++
> >  drivers/iommu/iova.c                               |    2 +
> >  drivers/vdpa/Kconfig                               |   10 +
> >  drivers/vdpa/Makefile                              |    1 +
> >  drivers/vdpa/ifcvf/ifcvf_main.c                    |    2 +-
> >  drivers/vdpa/mlx5/net/mlx5_vnet.c                  |    2 +-
> >  drivers/vdpa/vdpa.c                                |    9 +-
> >  drivers/vdpa/vdpa_sim/vdpa_sim.c                   |    8 +-
> >  drivers/vdpa/vdpa_user/Makefile                    |    5 +
> >  drivers/vdpa/vdpa_user/iova_domain.c               |  545 ++++++++
> >  drivers/vdpa/vdpa_user/iova_domain.h               |   73 +
> >  drivers/vdpa/vdpa_user/vduse_dev.c                 | 1453 ++++++++++++++++++++
> >  drivers/vdpa/virtio_pci/vp_vdpa.c                  |    2 +-
> >  drivers/vhost/iotlb.c                              |   20 +-
> >  drivers/vhost/vdpa.c                               |  148 +-
> >  fs/eventfd.c                                       |    2 +-
> >  fs/file.c                                          |    6 +
> >  include/linux/eventfd.h                            |    5 +-
> >  include/linux/file.h                               |    7 +-
> >  include/linux/vdpa.h                               |   21 +-
> >  include/linux/vhost_iotlb.h                        |    3 +
> >  include/uapi/linux/vduse.h                         |  143 ++
> >  24 files changed, 2641 insertions(+), 50 deletions(-)
> >  create mode 100644 Documentation/userspace-api/vduse.rst
> >  create mode 100644 drivers/vdpa/vdpa_user/Makefile
> >  create mode 100644 drivers/vdpa/vdpa_user/iova_domain.c
> >  create mode 100644 drivers/vdpa/vdpa_user/iova_domain.h
> >  create mode 100644 drivers/vdpa/vdpa_user/vduse_dev.c
> >  create mode 100644 include/uapi/linux/vduse.h
> >
> > --
> > 2.11.0
>
> Hi, Yongji
>
> Great work! your method is really wise that implements a software IOMMU
> so that data path gets processed by userspace application efficiently.
> Sorry, I've just realized your work and patches.
>
>
> I was working on a similar thing aiming to get vhost-user-blk device
> from SPDK vhost-target to be exported as local host kernel block device.
> It's diagram is like this:
>
>
>                                 -----------------------------
> ------------------------        |    -----------------      |    ---------------------------------------
> |   <RunC Container>   |     <<<<<<<<| Shared-Memory |>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>        |
> |       ---------      |     v  |    -----------------      |    |                            v        |
> |       |dev/vdx|      |     v  |   <virtio-local-agent>    |    |      <Vhost-user Target>   v        |
> ------------+-----------     v  | ------------------------  |    |  --------------------------v------  |
>             |                v  | |/dev/virtio-local-ctrl|  |    |  | unix socket |   |block driver |  |
>             |                v  ------------+----------------    --------+--------------------v---------
>             |                v              |                            |                    v
> ------------+----------------v--------------+----------------------------+--------------------v--------|
> |    | block device |        v      |  Misc device |                     |                    v        |
> |    -------+--------        v      --------+-------                     |                    v        |
> |           |                v              |                            |                    v        |
> | ----------+----------      v              |                            |                    v        |
> | | virtio-blk driver |      v              |                            |                    v        |
> | ----------+----------      v              |                            |                    v        |
> |           | virtio bus     v              |                            |                    v        |
> |   --------+---+-------     v              |                            |                    v        |
> |               |            v              |                            |                    v        |
> |               |            v              |                            |                    v        |
> |     ----------+----------  v     ---------+-----------                 |                    v        |
> |     | virtio-blk device |--<----| virtio-local driver |----------------<                    v        |
> |     ----------+----------       ----------+-----------                                      v        |
> |                                                                                    ---------+--------|
> -------------------------------------------------------------------------------------| RNIC |--| PCIe |-
>                                                                                      ----+---  | NVMe |
>                                                                                          |     --------
>                                                                                 ---------+---------
>                                                                                 | Remote Storages |
>                                                                                 -------------------
>

Oh, yes, this design is similar to VDUSE.

>
> I just draft out. an initial proof version. When seeing your RFC mail,
> I'm thinking that SPDK target may depends on your work, so I could
> directly drop mine.

Great to hear that! I think we can extend VDUSE to meet your needs.
But I prefer to do that after this initial version merged.

> But after a glance of the RFC patches, seems it is not so easy or
> efficient to get vduse leveraged by SPDK.
> (Please correct me, if I get wrong understanding on vduse. :) )
>
> The large barrier is bounce-buffer mapping: SPDK requires hugepages
> for NVMe over PCIe and RDMA, so take some preallcoated hugepages to
> map as bounce buffer is necessary. Or it's hard to avoid an extra
> memcpy from bounce-buffer to hugepage.
> If you can add an option to map hugepages as bounce-buffer,
> then SPDK could also be a potential user of vduse.
>

I think we can support registering user space memory for bounce-buffer
use like XDP does. But this needs to pin the pages, so I didn't
consider it in this initial version.

> It would be better if SPDK vhost-target could leverage the datapath of
> vduse directly and efficiently. Even the control path is vdpa based,
> we may work out one daemon as agent to bridge SPDK vhost-target with vduse.
> Then users who already deployed SPDK vhost-target, can smoothly run
> some agent daemon without code modification on SPDK vhost-target itself.

That's a good idea!

> (It is only better-to-have for SPDK vhost-target app, not mandatory for SPDK) :)
> At least, some small barrier is there that blocked a vhost-target use vduse
> datapath efficiently:
> - Current IO completion irq of vduse is IOCTL based. If add one option
> to get it eventfd based, then vhost-target can directly notify IO
> completion via negotiated eventfd.
>

Make sense. Actually we did use the eventfd mechanism for this purpose
in the old version. But using ioctl would be simple, so we choose it
in this initial version.

Thanks,
Yongji
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v8 00/10] Introduce VDUSE - vDPA Device in Userspace
       [not found] <20210615141331.407-1-xieyongji@bytedance.com>
                   ` (2 preceding siblings ...)
  2021-06-24 15:12 ` [PATCH v8 00/10] Introduce VDUSE - vDPA Device in Userspace Stefan Hajnoczi
@ 2021-06-28 10:33 ` Liu Xiaodong
  2021-06-28  4:35   ` Jason Wang
  2021-06-28 10:32   ` Yongji Xie
       [not found] ` <20210615141331.407-10-xieyongji@bytedance.com>
  4 siblings, 2 replies; 41+ messages in thread
From: Liu Xiaodong @ 2021-06-28 10:33 UTC (permalink / raw)
  To: Xie Yongji, mst, jasowang, stefanha, sgarzare, parav, hch,
	christian.brauner, rdunlap, willy, viro, axboe, bcrl, corbet,
	mika.penttila, dan.carpenter, joro, gregkh, xiaodong.liu
  Cc: kvm, netdev, linux-kernel, virtualization, iommu, songmuchun,
	linux-fsdevel

On Tue, Jun 15, 2021 at 10:13:21PM +0800, Xie Yongji wrote:
> 
> This series introduces a framework that makes it possible to implement
> software-emulated vDPA devices in userspace. And to make it simple, the
> emulated vDPA device's control path is handled in the kernel and only the
> data path is implemented in the userspace.
> 
> Since the emuldated vDPA device's control path is handled in the kernel,
> a message mechnism is introduced to make userspace be aware of the data
> path related changes. Userspace can use read()/write() to receive/reply
> the control messages.
> 
> In the data path, the core is mapping dma buffer into VDUSE daemon's
> address space, which can be implemented in different ways depending on
> the vdpa bus to which the vDPA device is attached.
> 
> In virtio-vdpa case, we implements a MMU-based on-chip IOMMU driver with
> bounce-buffering mechanism to achieve that. And in vhost-vdpa case, the dma
> buffer is reside in a userspace memory region which can be shared to the
> VDUSE userspace processs via transferring the shmfd.
> 
> The details and our user case is shown below:
> 
> ------------------------    -------------------------   ----------------------------------------------
> |            Container |    |              QEMU(VM) |   |                               VDUSE daemon |
> |       ---------      |    |  -------------------  |   | ------------------------- ---------------- |
> |       |dev/vdx|      |    |  |/dev/vhost-vdpa-x|  |   | | vDPA device emulation | | block driver | |
> ------------+-----------     -----------+------------   -------------+----------------------+---------
>             |                           |                            |                      |
>             |                           |                            |                      |
> ------------+---------------------------+----------------------------+----------------------+---------
> |    | block device |           |  vhost device |            | vduse driver |          | TCP/IP |    |
> |    -------+--------           --------+--------            -------+--------          -----+----    |
> |           |                           |                           |                       |        |
> | ----------+----------       ----------+-----------         -------+-------                |        |
> | | virtio-blk driver |       |  vhost-vdpa driver |         | vdpa device |                |        |
> | ----------+----------       ----------+-----------         -------+-------                |        |
> |           |      virtio bus           |                           |                       |        |
> |   --------+----+-----------           |                           |                       |        |
> |                |                      |                           |                       |        |
> |      ----------+----------            |                           |                       |        |
> |      | virtio-blk device |            |                           |                       |        |
> |      ----------+----------            |                           |                       |        |
> |                |                      |                           |                       |        |
> |     -----------+-----------           |                           |                       |        |
> |     |  virtio-vdpa driver |           |                           |                       |        |
> |     -----------+-----------           |                           |                       |        |
> |                |                      |                           |    vdpa bus           |        |
> |     -----------+----------------------+---------------------------+------------           |        |
> |                                                                                        ---+---     |
> -----------------------------------------------------------------------------------------| NIC |------
>                                                                                          ---+---
>                                                                                             |
>                                                                                    ---------+---------
>                                                                                    | Remote Storages |
>                                                                                    -------------------
> 
> We make use of it to implement a block device connecting to
> our distributed storage, which can be used both in containers and
> VMs. Thus, we can have an unified technology stack in this two cases.
> 
> To test it with null-blk:
> 
>   $ qemu-storage-daemon \
>       --chardev socket,id=charmonitor,path=/tmp/qmp.sock,server,nowait \
>       --monitor chardev=charmonitor \
>       --blockdev driver=host_device,cache.direct=on,aio=native,filename=/dev/nullb0,node-name=disk0 \
>       --export type=vduse-blk,id=test,node-name=disk0,writable=on,name=vduse-null,num-queues=16,queue-size=128
> 
> The qemu-storage-daemon can be found at https://github.com/bytedance/qemu/tree/vduse
> 
> To make the userspace VDUSE processes such as qemu-storage-daemon able to
> be run by an unprivileged user. We did some works on virtio driver to avoid
> trusting device, including:
> 
>   - validating the used length:
> 
>     * https://lore.kernel.org/lkml/20210531135852.113-1-xieyongji@bytedance.com/
>     * https://lore.kernel.org/lkml/20210525125622.1203-1-xieyongji@bytedance.com/
> 
>   - validating the device config:
> 
>     * https://lore.kernel.org/lkml/20210615104810.151-1-xieyongji@bytedance.com/
> 
>   - validating the device response:
> 
>     * https://lore.kernel.org/lkml/20210615105218.214-1-xieyongji@bytedance.com/
> 
> Since I'm not sure if I missing something during auditing, especially on some
> virtio device drivers that I'm not familiar with, we limit the supported device
> type to virtio block device currently. The support for other device types can be
> added after the security issue of corresponding device driver is clarified or
> fixed in the future.
> 
> Future work:
>   - Improve performance
>   - Userspace library (find a way to reuse device emulation code in qemu/rust-vmm)
>   - Support more device types
> 
> V7 to V8:
> - Rebased to newest kernel tree
> - Rework VDUSE driver to handle the device's control path in kernel
> - Limit the supported device type to virtio block device
> - Export free_iova_fast()
> - Remove the virtio-blk and virtio-scsi patches (will send them alone)
> - Remove all module parameters
> - Use the same MAJOR for both control device and VDUSE devices
> - Avoid eventfd cleanup in vduse_dev_release()
> 
> V6 to V7:
> - Export alloc_iova_fast()
> - Add get_config_size() callback
> - Add some patches to avoid trusting virtio devices
> - Add limited device emulation
> - Add some documents
> - Use workqueue to inject config irq
> - Add parameter on vq irq injecting
> - Rename vduse_domain_get_mapping_page() to vduse_domain_get_coherent_page()
> - Add WARN_ON() to catch message failure
> - Add some padding/reserved fields to uAPI structure
> - Fix some bugs
> - Rebase to vhost.git
> 
> V5 to V6:
> - Export receive_fd() instead of __receive_fd()
> - Factor out the unmapping logic of pa and va separatedly
> - Remove the logic of bounce page allocation in page fault handler
> - Use PAGE_SIZE as IOVA allocation granule
> - Add EPOLLOUT support
> - Enable setting API version in userspace
> - Fix some bugs
> 
> V4 to V5:
> - Remove the patch for irq binding
> - Use a single IOTLB for all types of mapping
> - Factor out vhost_vdpa_pa_map()
> - Add some sample codes in document
> - Use receice_fd_user() to pass file descriptor
> - Fix some bugs
> 
> V3 to V4:
> - Rebase to vhost.git
> - Split some patches
> - Add some documents
> - Use ioctl to inject interrupt rather than eventfd
> - Enable config interrupt support
> - Support binding irq to the specified cpu
> - Add two module parameter to limit bounce/iova size
> - Create char device rather than anon inode per vduse
> - Reuse vhost IOTLB for iova domain
> - Rework the message mechnism in control path
> 
> V2 to V3:
> - Rework the MMU-based IOMMU driver
> - Use the iova domain as iova allocator instead of genpool
> - Support transferring vma->vm_file in vhost-vdpa
> - Add SVA support in vhost-vdpa
> - Remove the patches on bounce pages reclaim
> 
> V1 to V2:
> - Add vhost-vdpa support
> - Add some documents
> - Based on the vdpa management tool
> - Introduce a workqueue for irq injection
> - Replace interval tree with array map to store the iova_map
> 
> Xie Yongji (10):
>   iova: Export alloc_iova_fast() and free_iova_fast();
>   file: Export receive_fd() to modules
>   eventfd: Increase the recursion depth of eventfd_signal()
>   vhost-iotlb: Add an opaque pointer for vhost IOTLB
>   vdpa: Add an opaque pointer for vdpa_config_ops.dma_map()
>   vdpa: factor out vhost_vdpa_pa_map() and vhost_vdpa_pa_unmap()
>   vdpa: Support transferring virtual addressing during DMA mapping
>   vduse: Implement an MMU-based IOMMU driver
>   vduse: Introduce VDUSE - vDPA Device in Userspace
>   Documentation: Add documentation for VDUSE
> 
>  Documentation/userspace-api/index.rst              |    1 +
>  Documentation/userspace-api/ioctl/ioctl-number.rst |    1 +
>  Documentation/userspace-api/vduse.rst              |  222 +++
>  drivers/iommu/iova.c                               |    2 +
>  drivers/vdpa/Kconfig                               |   10 +
>  drivers/vdpa/Makefile                              |    1 +
>  drivers/vdpa/ifcvf/ifcvf_main.c                    |    2 +-
>  drivers/vdpa/mlx5/net/mlx5_vnet.c                  |    2 +-
>  drivers/vdpa/vdpa.c                                |    9 +-
>  drivers/vdpa/vdpa_sim/vdpa_sim.c                   |    8 +-
>  drivers/vdpa/vdpa_user/Makefile                    |    5 +
>  drivers/vdpa/vdpa_user/iova_domain.c               |  545 ++++++++
>  drivers/vdpa/vdpa_user/iova_domain.h               |   73 +
>  drivers/vdpa/vdpa_user/vduse_dev.c                 | 1453 ++++++++++++++++++++
>  drivers/vdpa/virtio_pci/vp_vdpa.c                  |    2 +-
>  drivers/vhost/iotlb.c                              |   20 +-
>  drivers/vhost/vdpa.c                               |  148 +-
>  fs/eventfd.c                                       |    2 +-
>  fs/file.c                                          |    6 +
>  include/linux/eventfd.h                            |    5 +-
>  include/linux/file.h                               |    7 +-
>  include/linux/vdpa.h                               |   21 +-
>  include/linux/vhost_iotlb.h                        |    3 +
>  include/uapi/linux/vduse.h                         |  143 ++
>  24 files changed, 2641 insertions(+), 50 deletions(-)
>  create mode 100644 Documentation/userspace-api/vduse.rst
>  create mode 100644 drivers/vdpa/vdpa_user/Makefile
>  create mode 100644 drivers/vdpa/vdpa_user/iova_domain.c
>  create mode 100644 drivers/vdpa/vdpa_user/iova_domain.h
>  create mode 100644 drivers/vdpa/vdpa_user/vduse_dev.c
>  create mode 100644 include/uapi/linux/vduse.h
> 
> --
> 2.11.0

Hi, Yongji

Great work! your method is really wise that implements a software IOMMU
so that data path gets processed by userspace application efficiently.
Sorry, I've just realized your work and patches.


I was working on a similar thing aiming to get vhost-user-blk device
from SPDK vhost-target to be exported as local host kernel block device.
It's diagram is like this:


                                -----------------------------                
------------------------        |    -----------------      |    ---------------------------------------
|   <RunC Container>   |     <<<<<<<<| Shared-Memory |>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>        |
|       ---------      |     v  |    -----------------      |    |                            v        |
|       |dev/vdx|      |     v  |   <virtio-local-agent>    |    |      <Vhost-user Target>   v        |
------------+-----------     v  | ------------------------  |    |  --------------------------v------  |
            |                v  | |/dev/virtio-local-ctrl|  |    |  | unix socket |   |block driver |  |
            |                v  ------------+----------------    --------+--------------------v---------
            |                v              |                            |                    v
------------+----------------v--------------+----------------------------+--------------------v--------|
|    | block device |        v      |  Misc device |                     |                    v        |
|    -------+--------        v      --------+-------                     |                    v        |
|           |                v              |                            |                    v        |
| ----------+----------      v              |                            |                    v        |
| | virtio-blk driver |      v              |                            |                    v        |
| ----------+----------      v              |                            |                    v        |
|           | virtio bus     v              |                            |                    v        |
|   --------+---+-------     v              |                            |                    v        |
|               |            v              |                            |                    v        |
|               |            v              |                            |                    v        |
|     ----------+----------  v     ---------+-----------                 |                    v        |
|     | virtio-blk device |--<----| virtio-local driver |----------------<                    v        |
|     ----------+----------       ----------+-----------                                      v        |
|                                                                                    ---------+--------|
-------------------------------------------------------------------------------------| RNIC |--| PCIe |-
                                                                                     ----+---  | NVMe |
                                                                                         |     --------
                                                                                ---------+---------
                                                                                | Remote Storages |
                                                                                -------------------


I just draft out an initial proof version. When seeing your RFC mail,
I'm thinking that SPDK target may depends on your work, so I could
directly drop mine.
But after a glance of the RFC patches, seems it is not so easy or
efficient to get vduse leveraged by SPDK.
(Please correct me, if I get wrong understanding on vduse. :) )

The large barrier is bounce-buffer mapping: SPDK requires hugepages
for NVMe over PCIe and RDMA, so take some preallcoated hugepages to
map as bounce buffer is necessary. Or it's hard to avoid an extra
memcpy from bounce-buffer to hugepage.
If you can add an option to map hugepages as bounce-buffer,
then SPDK could also be a potential user of vduse.

It would be better if SPDK vhost-target could leverage the datapath of
vduse directly and efficiently. Even the control path is vdpa based,
we may work out one daemon as agent to bridge SPDK vhost-target with vduse.
Then users who already deployed SPDK vhost-target, can smoothly run
some agent daemon without code modification on SPDK vhost-target itself.
(It is only better-to-have for SPDK vhost-target app, not mandatory for SPDK) :)
At least, some small barrier is there that blocked a vhost-target use vduse
datapath efficiently:
- Current IO completion irq of vduse is IOCTL based. If add one option
to get it eventfd based, then vhost-target can directly notify IO
completion via negotiated eventfd.


Thanks
From Xiaodong





									
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v8 09/10] vduse: Introduce VDUSE - vDPA Device in Userspace
       [not found]                                 ` <CACycT3u9-id2DxPpuVLtyg4tzrUF9xCAGr7nBm=21HfUJJasaQ@mail.gmail.com>
@ 2021-06-29  3:29                                   ` Jason Wang
       [not found]                                     ` <CACycT3ucVz3D4Tcr1C6uzWyApZy7Xk4o17VH2gvLO3w1Ra+skg@mail.gmail.com>
  0 siblings, 1 reply; 41+ messages in thread
From: Jason Wang @ 2021-06-29  3:29 UTC (permalink / raw)
  To: Yongji Xie
  Cc: kvm, Michael S. Tsirkin, virtualization, Christian Brauner,
	Jonathan Corbet, joro, Matthew Wilcox, Christoph Hellwig,
	Dan Carpenter, Al Viro, Stefan Hajnoczi, songmuchun, Jens Axboe,
	Greg KH, Randy Dunlap, linux-kernel, iommu, bcrl, netdev,
	linux-fsdevel, Mika Penttilä


在 2021/6/29 上午10:26, Yongji Xie 写道:
> On Mon, Jun 28, 2021 at 12:40 PM Jason Wang <jasowang@redhat.com> wrote:
>>
>> 在 2021/6/25 下午12:19, Yongji Xie 写道:
>>>> 2b) for set_status(): simply relay the message to userspace, reply is no
>>>> needed. Userspace will use a command to update the status when the
>>>> datapath is stop. The the status could be fetched via get_stats().
>>>>
>>>> 2b looks more spec complaint.
>>>>
>>> Looks good to me. And I think we can use the reply of the message to
>>> update the status instead of introducing a new command.
>>>
>> Just notice this part in virtio_finalize_features():
>>
>>           virtio_add_status(dev, VIRTIO_CONFIG_S_FEATURES_OK);
>>           status = dev->config->get_status(dev);
>>           if (!(status & VIRTIO_CONFIG_S_FEATURES_OK)) {
>>
>> So we no reply doesn't work for FEATURES_OK.
>>
>> So my understanding is:
>>
>> 1) We must not use noreply for set_status()
>> 2) We can use noreply for get_status(), but it requires a new ioctl to
>> update the status.
>>
>> So it looks to me we need synchronize for both get_status() and
>> set_status().
>>
> We should not send messages to userspace in the FEATURES_OK case. So
> the synchronization is not necessary.


As discussed previously, there could be a device that mandates some 
features (VIRTIO_F_RING_PACKED). So it can choose to not accept 
FEATURES_OK is packed virtqueue is not negotiated.

In this case we need to relay the message to userspace.

Thanks


>
> Thanks,
> Yongji
>

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v8 09/10] vduse: Introduce VDUSE - vDPA Device in Userspace
       [not found]                                     ` <CACycT3ucVz3D4Tcr1C6uzWyApZy7Xk4o17VH2gvLO3w1Ra+skg@mail.gmail.com>
@ 2021-06-29  4:03                                       ` Jason Wang
  0 siblings, 0 replies; 41+ messages in thread
From: Jason Wang @ 2021-06-29  4:03 UTC (permalink / raw)
  To: Yongji Xie
  Cc: kvm, Michael S. Tsirkin, virtualization, Christian Brauner,
	Jonathan Corbet, joro, Matthew Wilcox, Christoph Hellwig,
	Dan Carpenter, Al Viro, Stefan Hajnoczi, songmuchun, Jens Axboe,
	Greg KH, Randy Dunlap, linux-kernel, iommu, bcrl, netdev,
	linux-fsdevel, Mika Penttilä


在 2021/6/29 上午11:56, Yongji Xie 写道:
> On Tue, Jun 29, 2021 at 11:29 AM Jason Wang <jasowang@redhat.com> wrote:
>>
>> 在 2021/6/29 上午10:26, Yongji Xie 写道:
>>> On Mon, Jun 28, 2021 at 12:40 PM Jason Wang <jasowang@redhat.com> wrote:
>>>> 在 2021/6/25 下午12:19, Yongji Xie 写道:
>>>>>> 2b) for set_status(): simply relay the message to userspace, reply is no
>>>>>> needed. Userspace will use a command to update the status when the
>>>>>> datapath is stop. The the status could be fetched via get_stats().
>>>>>>
>>>>>> 2b looks more spec complaint.
>>>>>>
>>>>> Looks good to me. And I think we can use the reply of the message to
>>>>> update the status instead of introducing a new command.
>>>>>
>>>> Just notice this part in virtio_finalize_features():
>>>>
>>>>            virtio_add_status(dev, VIRTIO_CONFIG_S_FEATURES_OK);
>>>>            status = dev->config->get_status(dev);
>>>>            if (!(status & VIRTIO_CONFIG_S_FEATURES_OK)) {
>>>>
>>>> So we no reply doesn't work for FEATURES_OK.
>>>>
>>>> So my understanding is:
>>>>
>>>> 1) We must not use noreply for set_status()
>>>> 2) We can use noreply for get_status(), but it requires a new ioctl to
>>>> update the status.
>>>>
>>>> So it looks to me we need synchronize for both get_status() and
>>>> set_status().
>>>>
>>> We should not send messages to userspace in the FEATURES_OK case. So
>>> the synchronization is not necessary.
>>
>> As discussed previously, there could be a device that mandates some
>> features (VIRTIO_F_RING_PACKED). So it can choose to not accept
>> FEATURES_OK is packed virtqueue is not negotiated.
>>
>> In this case we need to relay the message to userspace.
>>
> OK, I see. If so, I prefer to only use noreply for set_status(). We do
> not set the status bit if the message is failed. In this way, we don't
> need to change lots of virtio core codes to handle the failure of
> set_status()/get_status().


It should work.

Thanks


>
> Thanks,
> Yongji
>

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v8 00/10] Introduce VDUSE - vDPA Device in Userspace
  2021-06-28  5:54     ` Liu, Xiaodong
@ 2021-06-29  4:10       ` Jason Wang
  2021-06-29  7:56         ` Liu, Xiaodong
  0 siblings, 1 reply; 41+ messages in thread
From: Jason Wang @ 2021-06-29  4:10 UTC (permalink / raw)
  To: Liu, Xiaodong, Xie Yongji, mst@redhat.com, stefanha@redhat.com,
	sgarzare@redhat.com, parav@nvidia.com, hch@infradead.org,
	christian.brauner@canonical.com, rdunlap@infradead.org,
	willy@infradead.org, viro@zeniv.linux.org.uk, axboe@kernel.dk,
	bcrl@kvack.org, corbet@lwn.net, mika.penttila@nextfour.com,
	dan.carpenter@oracle.com, joro@8bytes.org,
	gregkh@linuxfoundation.org
  Cc: kvm@vger.kernel.org, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	virtualization@lists.linux-foundation.org,
	iommu@lists.linux-foundation.org, songmuchun@bytedance.com,
	linux-fsdevel@vger.kernel.org


在 2021/6/28 下午1:54, Liu, Xiaodong 写道:
>> Several issues:
>>
>> - VDUSE needs to limit the total size of the bounce buffers (64M if I was not
>> wrong). Does it work for SPDK?
> Yes, Jason. It is enough and works for SPDK.
> Since it's a kind of bounce buffer mainly for in-flight IO, so limited size like
> 64MB is enough.


Ok.


>
>> - VDUSE can use hugepages but I'm not sure we can mandate hugepages (or we
>> need introduce new flags for supporting this)
> Same with your worry, I'm afraid too that it is a hard for a kernel module
> to directly preallocate hugepage internal.
> What I tried is that:
> 1. A simple agent daemon (represents for one device)  `preallocates` and maps
>      dozens of 2MB hugepages (like 64MB) for one device.
> 2. The daemon passes its mapping addr&len and hugepage fd to kernel
>      module through created IOCTL.
> 3. Kernel module remaps the hugepages inside kernel.


Such model should work, but the main "issue" is that it introduce  
overheads in the case of vhost-vDPA.

Note that in the case of vhost-vDPA, we don't use bounce buffer, the  
userspace pages were shared directly.

And since DMA is not done per page, it prevents us from using tricks  
like vm_insert_page() in those cases.


> 4. Vhost user target gets and maps hugepage fd from kernel module
>      in vhost-user msg through Unix Domain Socket cmsg.
> Then kernel module and target map on the same hugepage based
> bounce buffer for in-flight IO.
>
> If there is one option in VDUSE to map userspace preallocated memory, then
> VDUSE should be able to mandate it even it is hugepage based.
>

As above, this requires some kind of re-design since VDUSE depends on  
the model of mmap(MAP_SHARED) instead of umem registering.

Thanks

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v8 00/10] Introduce VDUSE - vDPA Device in Userspace
  2021-06-28 10:32   ` Yongji Xie
@ 2021-06-29  4:12     ` Jason Wang
       [not found]       ` <CACycT3vVhNdhtyohKJQuMXTic5m6jDjEfjzbzvp=2FJgwup8mg@mail.gmail.com>
  0 siblings, 1 reply; 41+ messages in thread
From: Jason Wang @ 2021-06-29  4:12 UTC (permalink / raw)
  To: Yongji Xie, Liu Xiaodong
  Cc: kvm, Michael S. Tsirkin, virtualization, christian.brauner,
	corbet, joro, willy, hch, Xie Yongji, dan.carpenter, viro,
	Stefan Hajnoczi, songmuchun, axboe, gregkh, rdunlap, linux-kernel,
	iommu, bcrl, netdev, linux-fsdevel, mika.penttila


在 2021/6/28 下午6:32, Yongji Xie 写道:
>> The large barrier is bounce-buffer mapping: SPDK requires hugepages
>> for NVMe over PCIe and RDMA, so take some preallcoated hugepages to
>> map as bounce buffer is necessary. Or it's hard to avoid an extra
>> memcpy from bounce-buffer to hugepage.
>> If you can add an option to map hugepages as bounce-buffer,
>> then SPDK could also be a potential user of vduse.
>>
> I think we can support registering user space memory for bounce-buffer
> use like XDP does. But this needs to pin the pages, so I didn't
> consider it in this initial version.
>

Note that userspace should be unaware of the existence of the bounce buffer.

So we need to think carefully of mmap() vs umem registering.

Thanks

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v8 00/10] Introduce VDUSE - vDPA Device in Userspace
       [not found]       ` <CACycT3vVhNdhtyohKJQuMXTic5m6jDjEfjzbzvp=2FJgwup8mg@mail.gmail.com>
@ 2021-06-29  7:33         ` Jason Wang
  0 siblings, 0 replies; 41+ messages in thread
From: Jason Wang @ 2021-06-29  7:33 UTC (permalink / raw)
  To: Yongji Xie
  Cc: kvm, Michael S. Tsirkin, virtualization, Christian Brauner,
	Jonathan Corbet, joro, Matthew Wilcox, Christoph Hellwig,
	Dan Carpenter, Liu Xiaodong, Al Viro, Stefan Hajnoczi, songmuchun,
	Jens Axboe, Greg KH, Randy Dunlap, linux-kernel, iommu, bcrl,
	netdev, linux-fsdevel, Mika Penttilä


在 2021/6/29 下午2:40, Yongji Xie 写道:
> On Tue, Jun 29, 2021 at 12:13 PM Jason Wang <jasowang@redhat.com> wrote:
>>
>> 在 2021/6/28 下午6:32, Yongji Xie 写道:
>>>> The large barrier is bounce-buffer mapping: SPDK requires hugepages
>>>> for NVMe over PCIe and RDMA, so take some preallcoated hugepages to
>>>> map as bounce buffer is necessary. Or it's hard to avoid an extra
>>>> memcpy from bounce-buffer to hugepage.
>>>> If you can add an option to map hugepages as bounce-buffer,
>>>> then SPDK could also be a potential user of vduse.
>>>>
>>> I think we can support registering user space memory for bounce-buffer
>>> use like XDP does. But this needs to pin the pages, so I didn't
>>> consider it in this initial version.
>>>
>> Note that userspace should be unaware of the existence of the bounce buffer.
>>
> If so, it might be hard to use umem. Because we can't use umem for
> coherent mapping which needs physical address contiguous space.
>
> Thanks,
> Yongji


We probably can use umem for memory other than the virtqueue (still via 
mmap()).

Thanks


_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 41+ messages in thread

* RE: [PATCH v8 00/10] Introduce VDUSE - vDPA Device in Userspace
  2021-06-29  4:10       ` Jason Wang
@ 2021-06-29  7:56         ` Liu, Xiaodong
  0 siblings, 0 replies; 41+ messages in thread
From: Liu, Xiaodong @ 2021-06-29  7:56 UTC (permalink / raw)
  To: Jason Wang, Xie Yongji, mst@redhat.com, stefanha@redhat.com,
	sgarzare@redhat.com, parav@nvidia.com, hch@infradead.org,
	christian.brauner@canonical.com, rdunlap@infradead.org,
	willy@infradead.org, viro@zeniv.linux.org.uk, axboe@kernel.dk,
	bcrl@kvack.org, corbet@lwn.net, mika.penttila@nextfour.com,
	dan.carpenter@oracle.com, joro@8bytes.org,
	gregkh@linuxfoundation.org
  Cc: kvm@vger.kernel.org, netdev@vger.kernel.org,
	linux-kernel@vger.kernel.org,
	virtualization@lists.linux-foundation.org,
	iommu@lists.linux-foundation.org, songmuchun@bytedance.com,
	linux-fsdevel@vger.kernel.org



>-----Original Message-----
>From: Jason Wang <jasowang@redhat.com>
>Sent: Tuesday, June 29, 2021 12:11 PM
>To: Liu, Xiaodong <xiaodong.liu@intel.com>; Xie Yongji
><xieyongji@bytedance.com>; mst@redhat.com; stefanha@redhat.com;
>sgarzare@redhat.com; parav@nvidia.com; hch@infradead.org;
>christian.brauner@canonical.com; rdunlap@infradead.org; willy@infradead.org;
>viro@zeniv.linux.org.uk; axboe@kernel.dk; bcrl@kvack.org; corbet@lwn.net;
>mika.penttila@nextfour.com; dan.carpenter@oracle.com; joro@8bytes.org;
>gregkh@linuxfoundation.org
>Cc: songmuchun@bytedance.com; virtualization@lists.linux-foundation.org;
>netdev@vger.kernel.org; kvm@vger.kernel.org; linux-fsdevel@vger.kernel.org;
>iommu@lists.linux-foundation.org; linux-kernel@vger.kernel.org
>Subject: Re: [PATCH v8 00/10] Introduce VDUSE - vDPA Device in Userspace
>
>
>在 2021/6/28 下午1:54, Liu, Xiaodong 写道:
>>> Several issues:
>>>
>>> - VDUSE needs to limit the total size of the bounce buffers (64M if I was not
>>> wrong). Does it work for SPDK?
>> Yes, Jason. It is enough and works for SPDK.
>> Since it's a kind of bounce buffer mainly for in-flight IO, so limited size like
>> 64MB is enough.
>
>
>Ok.
>
>
>>
>>> - VDUSE can use hugepages but I'm not sure we can mandate hugepages (or
>we
>>> need introduce new flags for supporting this)
>> Same with your worry, I'm afraid too that it is a hard for a kernel module
>> to directly preallocate hugepage internal.
>> What I tried is that:
>> 1. A simple agent daemon (represents for one device)  `preallocates` and maps
>>      dozens of 2MB hugepages (like 64MB) for one device.
>> 2. The daemon passes its mapping addr&len and hugepage fd to kernel
>>      module through created IOCTL.
>> 3. Kernel module remaps the hugepages inside kernel.
>
>
>Such model should work, but the main "issue" is that it introduce
>overheads in the case of vhost-vDPA.
>
>Note that in the case of vhost-vDPA, we don't use bounce buffer, the
>userspace pages were shared directly.
>
>And since DMA is not done per page, it prevents us from using tricks
>like vm_insert_page() in those cases.
>

Yes, really, it's a problem to handle vhost-vDPA case.
But there are already several solutions to get VM served, like vhost-user,
vfio-user, so at least for SPDK, it won't serve VM through VDUSE. If a user
still want to do that, then the user should tolerate Introduced overhead.

In other words, software backend like SPDK, will appreciate the virtio
datapath of VDUSE to serve local host instead of VM. That's why I also drafted
a "virtio-local" to bridge vhost-user target and local host kernel virtio-blk.

>
>> 4. Vhost user target gets and maps hugepage fd from kernel module
>>      in vhost-user msg through Unix Domain Socket cmsg.
>> Then kernel module and target map on the same hugepage based
>> bounce buffer for in-flight IO.
>>
>> If there is one option in VDUSE to map userspace preallocated memory, then
>> VDUSE should be able to mandate it even it is hugepage based.
>>
>
>As above, this requires some kind of re-design since VDUSE depends on
>the model of mmap(MAP_SHARED) instead of umem registering.

Got it, Jason, this may be hard for current version of VDUSE.
Maybe we can consider these options after VDUSE merged later.

Since if VDUSE datapath could be directly leveraged by vhost-user target,
its value will be propagated immediately.

>
>Thanks

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Re: [PATCH v8 09/10] vduse: Introduce VDUSE - vDPA Device in Userspace
       [not found]     ` <CACycT3vaXQ4dxC9QUzXXJs7og6TVqqVGa8uHZnTStacsYAiFwQ@mail.gmail.com>
@ 2021-06-30  9:51       ` Stefan Hajnoczi
       [not found]         ` <CACycT3t6M5i0gznABm52v=rdmeeLZu8smXAOLg+WsM3WY1fgTw@mail.gmail.com>
  0 siblings, 1 reply; 41+ messages in thread
From: Stefan Hajnoczi @ 2021-06-30  9:51 UTC (permalink / raw)
  To: Yongji Xie
  Cc: kvm, Michael S. Tsirkin, virtualization, Christian Brauner,
	Jonathan Corbet, joro, Matthew Wilcox, Christoph Hellwig,
	Dan Carpenter, Al Viro, songmuchun, Jens Axboe, Greg KH,
	Randy Dunlap, linux-kernel, iommu, bcrl, netdev, linux-fsdevel,
	Mika Penttilä


[-- Attachment #1.1: Type: text/plain, Size: 2121 bytes --]

On Tue, Jun 29, 2021 at 10:59:51AM +0800, Yongji Xie wrote:
> On Mon, Jun 28, 2021 at 9:02 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> >
> > On Tue, Jun 15, 2021 at 10:13:30PM +0800, Xie Yongji wrote:
> > > +/* ioctls */
> > > +
> > > +struct vduse_dev_config {
> > > +     char name[VDUSE_NAME_MAX]; /* vduse device name */
> > > +     __u32 vendor_id; /* virtio vendor id */
> > > +     __u32 device_id; /* virtio device id */
> > > +     __u64 features; /* device features */
> > > +     __u64 bounce_size; /* bounce buffer size for iommu */
> > > +     __u16 vq_size_max; /* the max size of virtqueue */
> >
> > The VIRTIO specification allows per-virtqueue sizes. A device can have
> > two virtqueues, where the first one allows up to 1024 descriptors and
> > the second one allows only 128 descriptors, for example.
> >
> 
> Good point! But it looks like virtio-vdpa/virtio-pci doesn't support
> that now. All virtqueues have the same maximum size.

I see struct vpda_config_ops only supports a per-device max vq size:
u16 (*get_vq_num_max)(struct vdpa_device *vdev);

virtio-pci supports per-virtqueue sizes because the struct
virtio_pci_common_cfg->queue_size register is per-queue (controlled by
queue_select).

I guess this is a question for Jason: will vdpa will keep this limitation?
If yes, then VDUSE can stick to it too without running into problems in
the future.

> > > +     __u16 padding; /* padding */
> > > +     __u32 vq_num; /* the number of virtqueues */
> > > +     __u32 vq_align; /* the allocation alignment of virtqueue's metadata */
> >
> > I'm not sure what this is?
> >
> 
>  This will be used by vring_create_virtqueue() too.

If there is no official definition for the meaning of this value then
"/* same as vring_create_virtqueue()'s vring_align parameter */" would
be clearer. That way the reader knows what to research in order to
understand how this field works.

I don't remember but maybe it was used to support vrings when the
host/guest have non-4KB page sizes. I wonder if anyone has an official
definition for this value?

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

[-- Attachment #2: Type: text/plain, Size: 183 bytes --]

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Re: [PATCH v8 10/10] Documentation: Add documentation for VDUSE
       [not found]     ` <CACycT3uxnQmXWsgmNVxQtiRhz1UXXTAJFY3OiAJqokbJH6ifMA@mail.gmail.com>
@ 2021-06-30 10:06       ` Stefan Hajnoczi
       [not found]         ` <CACycT3taKhf1cWp3Jd0aSVekAZvpbR-_fkyPLQ=B+jZBB5H=8Q@mail.gmail.com>
  0 siblings, 1 reply; 41+ messages in thread
From: Stefan Hajnoczi @ 2021-06-30 10:06 UTC (permalink / raw)
  To: Yongji Xie
  Cc: kvm, Michael S. Tsirkin, virtualization, Christian Brauner,
	Jonathan Corbet, joro, Matthew Wilcox, Christoph Hellwig,
	Dan Carpenter, Al Viro, songmuchun, Jens Axboe, Greg KH,
	Randy Dunlap, linux-kernel, iommu, bcrl, netdev, linux-fsdevel,
	Mika Penttilä


[-- Attachment #1.1: Type: text/plain, Size: 5708 bytes --]

On Tue, Jun 29, 2021 at 01:43:11PM +0800, Yongji Xie wrote:
> On Mon, Jun 28, 2021 at 9:02 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > On Tue, Jun 15, 2021 at 10:13:31PM +0800, Xie Yongji wrote:
> > > +     static void *iova_to_va(int dev_fd, uint64_t iova, uint64_t *len)
> > > +     {
> > > +             int fd;
> > > +             void *addr;
> > > +             size_t size;
> > > +             struct vduse_iotlb_entry entry;
> > > +
> > > +             entry.start = iova;
> > > +             entry.last = iova + 1;
> >
> > Why +1?
> >
> > I expected the request to include *len so that VDUSE can create a bounce
> > buffer for the full iova range, if necessary.
> >
> 
> The function is used to translate iova to va. And the *len is not
> specified by the caller. Instead, it's used to tell the caller the
> length of the contiguous iova region from the specified iova. And the
> ioctl VDUSE_IOTLB_GET_FD will get the file descriptor to the first
> overlapped iova region. So using iova + 1 should be enough here.

Does the entry.last field have any purpose with VDUSE_IOTLB_GET_FD? I
wonder why userspace needs to assign a value at all if it's always +1.

> 
> > > +             fd = ioctl(dev_fd, VDUSE_IOTLB_GET_FD, &entry);
> > > +             if (fd < 0)
> > > +                     return NULL;
> > > +
> > > +             size = entry.last - entry.start + 1;
> > > +             *len = entry.last - iova + 1;
> > > +             addr = mmap(0, size, perm_to_prot(entry.perm), MAP_SHARED,
> > > +                         fd, entry.offset);
> > > +             close(fd);
> > > +             if (addr == MAP_FAILED)
> > > +                     return NULL;
> > > +
> > > +             /* do something to cache this iova region */
> >
> > How is userspace expected to manage iotlb mmaps? When should munmap(2)
> > be called?
> >
> 
> The simple way is using a list to store the iotlb mappings. And we
> should call the munmap(2) for the old mappings when VDUSE_UPDATE_IOTLB
> or VDUSE_STOP_DATAPLANE message is received.

Thanks for explaining. It would be helpful to have a description of
IOTLB operation in this document.

> > Should userspace expect VDUSE_IOTLB_GET_FD to return a full chunk of
> > guest RAM (e.g. multiple gigabytes) that can be cached permanently or
> > will it return just enough pages to cover [start, last)?
> >
> 
> It should return one iotlb mapping that covers [start, last). In
> vhost-vdpa cases, it might be a full chunk of guest RAM. In
> virtio-vdpa cases, it might be the whole bounce buffer or one coherent
> mapping (produced by dma_alloc_coherent()).

Great, thanks. Adding something about this to the documentation would
help others implementing VDUSE devices or libraries.

> > > +
> > > +             return addr + iova - entry.start;
> > > +     }
> > > +
> > > +- VDUSE_DEV_GET_FEATURES: Get the negotiated features
> >
> > Are these VIRTIO feature bits? Please explain how feature negotiation
> > works. There must be a way for userspace to report the device's
> > supported feature bits to the kernel.
> >
> 
> Yes, these are VIRTIO feature bits. Userspace will specify the
> device's supported feature bits when creating a new VDUSE device with
> ioctl(VDUSE_CREATE_DEV).

Can the VDUSE device influence feature bit negotiation? For example, if
the VDUSE virtio-blk device does not implement discard/write-zeroes, how
does QEMU or the guest find out about this?

> > > +- VDUSE_DEV_UPDATE_CONFIG: Update the configuration space and inject a config interrupt
> >
> > Does this mean the contents of the configuration space are cached by
> > VDUSE?
> 
> Yes, but the kernel will also store the same contents.
> 
> > The downside is that the userspace code cannot generate the
> > contents on demand. Most devices doin't need to generate the contents
> > on demand, so I think this is okay but I had expected a different
> > interface:
> >
> > kernel->userspace VDUSE_DEV_GET_CONFIG
> > userspace->kernel VDUSE_DEV_INJECT_CONFIG_IRQ
> >
> 
> The problem is how to handle the failure of VDUSE_DEV_GET_CONFIG. We
> will need lots of modification of virtio codes to support that. So to
> make it simple, we choose this way:
> 
> userspace -> kernel VDUSE_DEV_SET_CONFIG
> userspace -> kernel VDUSE_DEV_INJECT_CONFIG_IRQ
> 
> > I think you can leave it the way it is, but I wanted to mention this in
> > case someone thinks it's important to support generating the contents of
> > the configuration space on demand.
> >
> 
> Sorry, I didn't get you here. Can't VDUSE_DEV_SET_CONFIG and
> VDUSE_DEV_INJECT_CONFIG_IRQ achieve that?

If the contents of the configuration space change continuously, then the
VDUSE_DEV_SET_CONFIG approach is inefficient and might have race
conditions. For example, imagine a device where the driver can read a
timer from the configuration space. I think the VIRTIO device model
allows that although I'm not aware of any devices that do something like
it today. The problem is that VDUSE_DEV_SET_CONFIG would have to be
called frequently to keep the timer value updated even though the guest
driver probably isn't accessing it.

What's worse is that there might be race conditions where other
driver->device operations are supposed to update the configuration space
but VDUSE_DEV_SET_CONFIG means that the VDUSE kernel code is caching an
outdated copy.

Again, I don't think it's a problem for existing devices in the VIRTIO
specification. But I'm not 100% sure and future devices might require
what I've described, so the VDUSE_DEV_SET_CONFIG interface could become
a problem.

Stefan

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

[-- Attachment #2: Type: text/plain, Size: 183 bytes --]

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v8 09/10] vduse: Introduce VDUSE - vDPA Device in Userspace
       [not found]         ` <CACycT3t6M5i0gznABm52v=rdmeeLZu8smXAOLg+WsM3WY1fgTw@mail.gmail.com>
@ 2021-07-01  7:55           ` Jason Wang
       [not found]             ` <CACycT3v7pYXAFtijPgWCMZ2WXxjT2Y-DUwS3hN_T7dhfE5o_6g@mail.gmail.com>
  0 siblings, 1 reply; 41+ messages in thread
From: Jason Wang @ 2021-07-01  7:55 UTC (permalink / raw)
  To: Yongji Xie, Stefan Hajnoczi
  Cc: kvm, Michael S. Tsirkin, virtualization, Christian Brauner,
	Jonathan Corbet, joro, Matthew Wilcox, Christoph Hellwig,
	Dan Carpenter, Al Viro, songmuchun, Jens Axboe, Greg KH,
	Randy Dunlap, linux-kernel, iommu, bcrl, netdev, linux-fsdevel,
	Mika Penttilä


在 2021/7/1 下午2:50, Yongji Xie 写道:
> On Wed, Jun 30, 2021 at 5:51 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
>> On Tue, Jun 29, 2021 at 10:59:51AM +0800, Yongji Xie wrote:
>>> On Mon, Jun 28, 2021 at 9:02 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
>>>> On Tue, Jun 15, 2021 at 10:13:30PM +0800, Xie Yongji wrote:
>>>>> +/* ioctls */
>>>>> +
>>>>> +struct vduse_dev_config {
>>>>> +     char name[VDUSE_NAME_MAX]; /* vduse device name */
>>>>> +     __u32 vendor_id; /* virtio vendor id */
>>>>> +     __u32 device_id; /* virtio device id */
>>>>> +     __u64 features; /* device features */
>>>>> +     __u64 bounce_size; /* bounce buffer size for iommu */
>>>>> +     __u16 vq_size_max; /* the max size of virtqueue */
>>>> The VIRTIO specification allows per-virtqueue sizes. A device can have
>>>> two virtqueues, where the first one allows up to 1024 descriptors and
>>>> the second one allows only 128 descriptors, for example.
>>>>
>>> Good point! But it looks like virtio-vdpa/virtio-pci doesn't support
>>> that now. All virtqueues have the same maximum size.
>> I see struct vpda_config_ops only supports a per-device max vq size:
>> u16 (*get_vq_num_max)(struct vdpa_device *vdev);
>>
>> virtio-pci supports per-virtqueue sizes because the struct
>> virtio_pci_common_cfg->queue_size register is per-queue (controlled by
>> queue_select).
>>
> Oh, yes. I miss queue_select.
>
>> I guess this is a question for Jason: will vdpa will keep this limitation?
>> If yes, then VDUSE can stick to it too without running into problems in
>> the future.


I think it's better to extend the get_vq_num_max() per virtqueue.

Currently, vDPA assumes the parent to have a global max size. This seems 
to work on most of the parents but not vp-vDPA (which could be backed by 
QEMU, in that case cvq's size is smaller).

Fortunately, we haven't enabled had cvq support in the userspace now.

I can post the fixes.


>>
>>>>> +     __u16 padding; /* padding */
>>>>> +     __u32 vq_num; /* the number of virtqueues */
>>>>> +     __u32 vq_align; /* the allocation alignment of virtqueue's metadata */
>>>> I'm not sure what this is?
>>>>
>>>   This will be used by vring_create_virtqueue() too.
>> If there is no official definition for the meaning of this value then
>> "/* same as vring_create_virtqueue()'s vring_align parameter */" would
>> be clearer. That way the reader knows what to research in order to
>> understand how this field works.
>>
> OK.
>
>> I don't remember but maybe it was used to support vrings when the
>> host/guest have non-4KB page sizes. I wonder if anyone has an official
>> definition for this value?
> Not sure. Maybe we might need some alignment which is less than
> PAGE_SIZE sometimes.


So I see CCW always use 4096, but I'm not sure whether or not it's 
smaller than PAGE_SIZE.

Thanks


>
> Thanks,
> Yongji
>

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: Re: Re: [PATCH v8 10/10] Documentation: Add documentation for VDUSE
       [not found]         ` <CACycT3taKhf1cWp3Jd0aSVekAZvpbR-_fkyPLQ=B+jZBB5H=8Q@mail.gmail.com>
@ 2021-07-01 13:15           ` Stefan Hajnoczi
       [not found]             ` <CACycT3vo-diHgTSLw_FS2E+5ia5VjihE3qw7JmZR7JT55P-wQA@mail.gmail.com>
  0 siblings, 1 reply; 41+ messages in thread
From: Stefan Hajnoczi @ 2021-07-01 13:15 UTC (permalink / raw)
  To: Yongji Xie
  Cc: kvm, Michael S. Tsirkin, virtualization, Christian Brauner,
	Jonathan Corbet, joro, Matthew Wilcox, Christoph Hellwig,
	Dan Carpenter, Al Viro, songmuchun, Jens Axboe, Greg KH,
	Randy Dunlap, linux-kernel, iommu, bcrl, netdev, linux-fsdevel,
	Mika Penttilä


[-- Attachment #1.1: Type: text/plain, Size: 7762 bytes --]

On Thu, Jul 01, 2021 at 06:00:48PM +0800, Yongji Xie wrote:
> On Wed, Jun 30, 2021 at 6:06 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> >
> > On Tue, Jun 29, 2021 at 01:43:11PM +0800, Yongji Xie wrote:
> > > On Mon, Jun 28, 2021 at 9:02 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > > > On Tue, Jun 15, 2021 at 10:13:31PM +0800, Xie Yongji wrote:
> > > > > +     static void *iova_to_va(int dev_fd, uint64_t iova, uint64_t *len)
> > > > > +     {
> > > > > +             int fd;
> > > > > +             void *addr;
> > > > > +             size_t size;
> > > > > +             struct vduse_iotlb_entry entry;
> > > > > +
> > > > > +             entry.start = iova;
> > > > > +             entry.last = iova + 1;
> > > >
> > > > Why +1?
> > > >
> > > > I expected the request to include *len so that VDUSE can create a bounce
> > > > buffer for the full iova range, if necessary.
> > > >
> > >
> > > The function is used to translate iova to va. And the *len is not
> > > specified by the caller. Instead, it's used to tell the caller the
> > > length of the contiguous iova region from the specified iova. And the
> > > ioctl VDUSE_IOTLB_GET_FD will get the file descriptor to the first
> > > overlapped iova region. So using iova + 1 should be enough here.
> >
> > Does the entry.last field have any purpose with VDUSE_IOTLB_GET_FD? I
> > wonder why userspace needs to assign a value at all if it's always +1.
> >
> 
> If we need to get some iova regions in the specified range, we need
> the entry.last field. For example, we can use [0, ULONG_MAX] to get
> the first overlapped iova region which might be [4096, 8192]. But in
> this function, we don't use VDUSE_IOTLB_GET_FD like this. We need to
> get the iova region including the specified iova.

I see, thanks for explaining!

> > > > > +             return addr + iova - entry.start;
> > > > > +     }
> > > > > +
> > > > > +- VDUSE_DEV_GET_FEATURES: Get the negotiated features
> > > >
> > > > Are these VIRTIO feature bits? Please explain how feature negotiation
> > > > works. There must be a way for userspace to report the device's
> > > > supported feature bits to the kernel.
> > > >
> > >
> > > Yes, these are VIRTIO feature bits. Userspace will specify the
> > > device's supported feature bits when creating a new VDUSE device with
> > > ioctl(VDUSE_CREATE_DEV).
> >
> > Can the VDUSE device influence feature bit negotiation? For example, if
> > the VDUSE virtio-blk device does not implement discard/write-zeroes, how
> > does QEMU or the guest find out about this?
> >
> 
> There is a "features" field in struct vduse_dev_config which is used
> to do feature negotiation.

This approach is more restrictive than required by the VIRTIO
specification:

  "The device SHOULD accept any valid subset of features the driver
  accepts, otherwise it MUST fail to set the FEATURES_OK device status
  bit when the driver writes it."

  https://docs.oasis-open.org/virtio/virtio/v1.1/cs01/virtio-v1.1-cs01.html#x1-130002

The spec allows a device to reject certain subsets of features. For
example, if feature B depends on feature A and can only be enabled when
feature A is also enabled.

From your description I think VDUSE would accept feature B without
feature A since the device implementation has no opportunity to fail
negotiation with custom logic.

Ideally VDUSE would send a SET_FEATURES message to userspace, allowing
the device implementation full flexibility in which subsets of features
to accept.

This is a corner case. Many or maybe even all existing VIRTIO devices
don't need this flexibility, but I want to point out this limitation in
the VDUSE interface because it may cause issues in the future.

> > > > > +- VDUSE_DEV_UPDATE_CONFIG: Update the configuration space and inject a config interrupt
> > > >
> > > > Does this mean the contents of the configuration space are cached by
> > > > VDUSE?
> > >
> > > Yes, but the kernel will also store the same contents.
> > >
> > > > The downside is that the userspace code cannot generate the
> > > > contents on demand. Most devices doin't need to generate the contents
> > > > on demand, so I think this is okay but I had expected a different
> > > > interface:
> > > >
> > > > kernel->userspace VDUSE_DEV_GET_CONFIG
> > > > userspace->kernel VDUSE_DEV_INJECT_CONFIG_IRQ
> > > >
> > >
> > > The problem is how to handle the failure of VDUSE_DEV_GET_CONFIG. We
> > > will need lots of modification of virtio codes to support that. So to
> > > make it simple, we choose this way:
> > >
> > > userspace -> kernel VDUSE_DEV_SET_CONFIG
> > > userspace -> kernel VDUSE_DEV_INJECT_CONFIG_IRQ
> > >
> > > > I think you can leave it the way it is, but I wanted to mention this in
> > > > case someone thinks it's important to support generating the contents of
> > > > the configuration space on demand.
> > > >
> > >
> > > Sorry, I didn't get you here. Can't VDUSE_DEV_SET_CONFIG and
> > > VDUSE_DEV_INJECT_CONFIG_IRQ achieve that?
> >
> > If the contents of the configuration space change continuously, then the
> > VDUSE_DEV_SET_CONFIG approach is inefficient and might have race
> > conditions. For example, imagine a device where the driver can read a
> > timer from the configuration space. I think the VIRTIO device model
> > allows that although I'm not aware of any devices that do something like
> > it today. The problem is that VDUSE_DEV_SET_CONFIG would have to be
> > called frequently to keep the timer value updated even though the guest
> > driver probably isn't accessing it.
> >
> 
> OK, I get you now. Since the VIRTIO specification says "Device
> configuration space is generally used for rarely-changing or
> initialization-time parameters". I assume the VDUSE_DEV_SET_CONFIG
> ioctl should not be called frequently.

The spec uses MUST and other terms to define the precise requirements.
Here the language (especially the word "generally") is weaker and means
there may be exceptions.

Another type of access that doesn't work with the VDUSE_DEV_SET_CONFIG
approach is reads that have side-effects. For example, imagine a field
containing an error code if the device encounters a problem unrelated to
a specific virtqueue request. Reading from this field resets the error
code to 0, saving the driver an extra configuration space write access
and possibly race conditions. It isn't possible to implement those
semantics suing VDUSE_DEV_SET_CONFIG. It's another corner case, but it
makes me think that the interface does not allow full VIRTIO semantics.

> > What's worse is that there might be race conditions where other
> > driver->device operations are supposed to update the configuration space
> > but VDUSE_DEV_SET_CONFIG means that the VDUSE kernel code is caching an
> > outdated copy.
> >
> 
> I'm not sure. Should the device and driver be able to access the same
> fields concurrently?

Yes. The VIRTIO spec has a generation count to handle multi-field
accesses so that consistency can be ensured:
https://docs.oasis-open.org/virtio/virtio/v1.1/cs01/virtio-v1.1-cs01.html#x1-180004

> 
> > Again, I don't think it's a problem for existing devices in the VIRTIO
> > specification. But I'm not 100% sure and future devices might require
> > what I've described, so the VDUSE_DEV_SET_CONFIG interface could become
> > a problem.
> >
> 
> If so, maybe a new interface can be added at that time. The
> VDUSE_DEV_GET_CONFIG might be better, but I still did not find a good
> way for failure handling.

I'm not aware of the details of why the current approach was necessary,
so I don't have any concrete suggestions. Sorry!

Stefan

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

[-- Attachment #2: Type: text/plain, Size: 183 bytes --]

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v8 09/10] vduse: Introduce VDUSE - vDPA Device in Userspace
       [not found]             ` <CACycT3v7pYXAFtijPgWCMZ2WXxjT2Y-DUwS3hN_T7dhfE5o_6g@mail.gmail.com>
@ 2021-07-02  3:25               ` Jason Wang
  0 siblings, 0 replies; 41+ messages in thread
From: Jason Wang @ 2021-07-02  3:25 UTC (permalink / raw)
  To: Yongji Xie
  Cc: kvm, Michael S. Tsirkin, virtualization, Christian Brauner,
	Jonathan Corbet, joro, Matthew Wilcox, Christoph Hellwig,
	Dan Carpenter, Al Viro, Stefan Hajnoczi, songmuchun, Jens Axboe,
	Greg KH, Randy Dunlap, linux-kernel, iommu, bcrl, netdev,
	linux-fsdevel, Mika Penttilä


在 2021/7/1 下午6:26, Yongji Xie 写道:
> On Thu, Jul 1, 2021 at 3:55 PM Jason Wang <jasowang@redhat.com> wrote:
>>
>> 在 2021/7/1 下午2:50, Yongji Xie 写道:
>>> On Wed, Jun 30, 2021 at 5:51 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
>>>> On Tue, Jun 29, 2021 at 10:59:51AM +0800, Yongji Xie wrote:
>>>>> On Mon, Jun 28, 2021 at 9:02 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
>>>>>> On Tue, Jun 15, 2021 at 10:13:30PM +0800, Xie Yongji wrote:
>>>>>>> +/* ioctls */
>>>>>>> +
>>>>>>> +struct vduse_dev_config {
>>>>>>> +     char name[VDUSE_NAME_MAX]; /* vduse device name */
>>>>>>> +     __u32 vendor_id; /* virtio vendor id */
>>>>>>> +     __u32 device_id; /* virtio device id */
>>>>>>> +     __u64 features; /* device features */
>>>>>>> +     __u64 bounce_size; /* bounce buffer size for iommu */
>>>>>>> +     __u16 vq_size_max; /* the max size of virtqueue */
>>>>>> The VIRTIO specification allows per-virtqueue sizes. A device can have
>>>>>> two virtqueues, where the first one allows up to 1024 descriptors and
>>>>>> the second one allows only 128 descriptors, for example.
>>>>>>
>>>>> Good point! But it looks like virtio-vdpa/virtio-pci doesn't support
>>>>> that now. All virtqueues have the same maximum size.
>>>> I see struct vpda_config_ops only supports a per-device max vq size:
>>>> u16 (*get_vq_num_max)(struct vdpa_device *vdev);
>>>>
>>>> virtio-pci supports per-virtqueue sizes because the struct
>>>> virtio_pci_common_cfg->queue_size register is per-queue (controlled by
>>>> queue_select).
>>>>
>>> Oh, yes. I miss queue_select.
>>>
>>>> I guess this is a question for Jason: will vdpa will keep this limitation?
>>>> If yes, then VDUSE can stick to it too without running into problems in
>>>> the future.
>>
>> I think it's better to extend the get_vq_num_max() per virtqueue.
>>
>> Currently, vDPA assumes the parent to have a global max size. This seems
>> to work on most of the parents but not vp-vDPA (which could be backed by
>> QEMU, in that case cvq's size is smaller).
>>
>> Fortunately, we haven't enabled had cvq support in the userspace now.
>>
>> I can post the fixes.
>>
> OK. If so, it looks like we need to support the per-vq configuration.
> I wonder if it's better to use something like: VDUSE_CREATE_DEVICE ->
> VDUSE_SETUP_VQ -> VDUSE_SETUP_VQ -> ... -> VDUSE_ENABLE_DEVICE to do
> initialization rather than only use VDUSE_CREATE_DEVICE.


This should be fine.

Thanks


>
> Thanks,
> Yongji
>

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH] eventfd: Enlarge recursion limit to allow vhost to work
  2021-06-18  8:44       ` [PATCH] eventfd: Enlarge recursion limit to allow vhost to work He Zhe
@ 2021-07-03  8:31         ` Michael S. Tsirkin
  0 siblings, 0 replies; 41+ messages in thread
From: Michael S. Tsirkin @ 2021-07-03  8:31 UTC (permalink / raw)
  To: He Zhe
  Cc: kvm, virtualization, christian.brauner, qiang.zhang, corbet,
	willy, hch, xieyongji, dan.carpenter, viro, stefanha, songmuchun,
	axboe, gregkh, rdunlap, linux-kernel, iommu, bcrl, linux-fsdevel,
	mika.penttila

On Fri, Jun 18, 2021 at 04:44:12PM +0800, He Zhe wrote:
> commit b5e683d5cab8 ("eventfd: track eventfd_signal() recursion depth")
> introduces a percpu counter that tracks the percpu recursion depth and
> warn if it greater than zero, to avoid potential deadlock and stack
> overflow.
> 
> However sometimes different eventfds may be used in parallel. Specifically,
> when heavy network load goes through kvm and vhost, working as below, it
> would trigger the following call trace.
> 
> -  100.00%
>    - 66.51%
>         ret_from_fork
>         kthread
>       - vhost_worker
>          - 33.47% handle_tx_kick
>               handle_tx
>               handle_tx_copy
>               vhost_tx_batch.isra.0
>               vhost_add_used_and_signal_n
>               eventfd_signal
>          - 33.05% handle_rx_net
>               handle_rx
>               vhost_add_used_and_signal_n
>               eventfd_signal
>    - 33.49%
>         ioctl
>         entry_SYSCALL_64_after_hwframe
>         do_syscall_64
>         __x64_sys_ioctl
>         ksys_ioctl
>         do_vfs_ioctl
>         kvm_vcpu_ioctl
>         kvm_arch_vcpu_ioctl_run
>         vmx_handle_exit
>         handle_ept_misconfig
>         kvm_io_bus_write
>         __kvm_io_bus_write
>         eventfd_signal
> 
> 001: WARNING: CPU: 1 PID: 1503 at fs/eventfd.c:73 eventfd_signal+0x85/0xa0
> ---- snip ----
> 001: Call Trace:
> 001:  vhost_signal+0x15e/0x1b0 [vhost]
> 001:  vhost_add_used_and_signal_n+0x2b/0x40 [vhost]
> 001:  handle_rx+0xb9/0x900 [vhost_net]
> 001:  handle_rx_net+0x15/0x20 [vhost_net]
> 001:  vhost_worker+0xbe/0x120 [vhost]
> 001:  kthread+0x106/0x140
> 001:  ? log_used.part.0+0x20/0x20 [vhost]
> 001:  ? kthread_park+0x90/0x90
> 001:  ret_from_fork+0x35/0x40
> 001: ---[ end trace 0000000000000003 ]---
> 
> This patch enlarges the limit to 1 which is the maximum recursion depth we
> have found so far.
> 
> The credit of modification for eventfd_signal_count goes to
> Xie Yongji <xieyongji@bytedance.com>
> 

And maybe:

Fixes: b5e683d5cab8 ("eventfd: track eventfd_signal() recursion depth")

who's merging this?

> Signed-off-by: He Zhe <zhe.he@windriver.com>
> ---
>  fs/eventfd.c            | 3 ++-
>  include/linux/eventfd.h | 5 ++++-
>  2 files changed, 6 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/eventfd.c b/fs/eventfd.c
> index e265b6dd4f34..add6af91cacf 100644
> --- a/fs/eventfd.c
> +++ b/fs/eventfd.c
> @@ -71,7 +71,8 @@ __u64 eventfd_signal(struct eventfd_ctx *ctx, __u64 n)
>  	 * it returns true, the eventfd_signal() call should be deferred to a
>  	 * safe context.
>  	 */
> -	if (WARN_ON_ONCE(this_cpu_read(eventfd_wake_count)))
> +	if (WARN_ON_ONCE(this_cpu_read(eventfd_wake_count) >
> +	    EFD_WAKE_COUNT_MAX))
>  		return 0;
>  
>  	spin_lock_irqsave(&ctx->wqh.lock, flags);
> diff --git a/include/linux/eventfd.h b/include/linux/eventfd.h
> index fa0a524baed0..74be152ebe87 100644
> --- a/include/linux/eventfd.h
> +++ b/include/linux/eventfd.h
> @@ -29,6 +29,9 @@
>  #define EFD_SHARED_FCNTL_FLAGS (O_CLOEXEC | O_NONBLOCK)
>  #define EFD_FLAGS_SET (EFD_SHARED_FCNTL_FLAGS | EFD_SEMAPHORE)
>  
> +/* This is the maximum recursion depth we find so far */
> +#define EFD_WAKE_COUNT_MAX 1
> +
>  struct eventfd_ctx;
>  struct file;
>  
> @@ -47,7 +50,7 @@ DECLARE_PER_CPU(int, eventfd_wake_count);
>  
>  static inline bool eventfd_signal_count(void)
>  {
> -	return this_cpu_read(eventfd_wake_count);
> +	return this_cpu_read(eventfd_wake_count) > EFD_WAKE_COUNT_MAX;
>  }
>  
>  #else /* CONFIG_EVENTFD */
> -- 
> 2.17.1

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v8 10/10] Documentation: Add documentation for VDUSE
       [not found]             ` <CACycT3vo-diHgTSLw_FS2E+5ia5VjihE3qw7JmZR7JT55P-wQA@mail.gmail.com>
@ 2021-07-05  3:36               ` Jason Wang
  2021-07-05 12:49                 ` Stefan Hajnoczi
  0 siblings, 1 reply; 41+ messages in thread
From: Jason Wang @ 2021-07-05  3:36 UTC (permalink / raw)
  To: Yongji Xie, Stefan Hajnoczi
  Cc: kvm, Michael S. Tsirkin, virtualization, Christian Brauner,
	Jonathan Corbet, joro, Matthew Wilcox, Christoph Hellwig,
	Dan Carpenter, Al Viro, songmuchun, Jens Axboe, Greg KH,
	Randy Dunlap, linux-kernel, iommu, bcrl, netdev, linux-fsdevel,
	Mika Penttilä


在 2021/7/4 下午5:49, Yongji Xie 写道:
>>> OK, I get you now. Since the VIRTIO specification says "Device
>>> configuration space is generally used for rarely-changing or
>>> initialization-time parameters". I assume the VDUSE_DEV_SET_CONFIG
>>> ioctl should not be called frequently.
>> The spec uses MUST and other terms to define the precise requirements.
>> Here the language (especially the word "generally") is weaker and means
>> there may be exceptions.
>>
>> Another type of access that doesn't work with the VDUSE_DEV_SET_CONFIG
>> approach is reads that have side-effects. For example, imagine a field
>> containing an error code if the device encounters a problem unrelated to
>> a specific virtqueue request. Reading from this field resets the error
>> code to 0, saving the driver an extra configuration space write access
>> and possibly race conditions. It isn't possible to implement those
>> semantics suing VDUSE_DEV_SET_CONFIG. It's another corner case, but it
>> makes me think that the interface does not allow full VIRTIO semantics.


Note that though you're correct, my understanding is that config space 
is not suitable for this kind of error propagating. And it would be very 
hard to implement such kind of semantic in some transports.  Virtqueue 
should be much better. As Yong Ji quoted, the config space is used for 
"rarely-changing or intialization-time parameters".


> Agreed. I will use VDUSE_DEV_GET_CONFIG in the next version. And to
> handle the message failure, I'm going to add a return value to
> virtio_config_ops.get() and virtio_cread_* API so that the error can
> be propagated to the virtio device driver. Then the virtio-blk device
> driver can be modified to handle that.
>
> Jason and Stefan, what do you think of this way?


I'd like to stick to the current assumption thich get_config won't fail. 
That is to say,

1) maintain a config in the kernel, make sure the config space read can 
always succeed
2) introduce an ioctl for the vduse usersapce to update the config space.
3) we can synchronize with the vduse userspace during set_config

Does this work?

Thanks


>

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v8 10/10] Documentation: Add documentation for VDUSE
  2021-07-05  3:36               ` Jason Wang
@ 2021-07-05 12:49                 ` Stefan Hajnoczi
  2021-07-06  2:34                   ` Jason Wang
       [not found]                   ` <CACycT3t-BTMrpNTwBUfbvaxTh6tLthxbo3OJwMk_iuiSpMuZPg@mail.gmail.com>
  0 siblings, 2 replies; 41+ messages in thread
From: Stefan Hajnoczi @ 2021-07-05 12:49 UTC (permalink / raw)
  To: Jason Wang
  Cc: kvm, Michael S. Tsirkin, virtualization, Christian Brauner,
	Jonathan Corbet, joro, Matthew Wilcox, Christoph Hellwig,
	Yongji Xie, Dan Carpenter, Al Viro, songmuchun, Jens Axboe,
	Greg KH, Randy Dunlap, linux-kernel, iommu, bcrl, netdev,
	linux-fsdevel, Mika Penttilä


[-- Attachment #1.1: Type: text/plain, Size: 4867 bytes --]

On Mon, Jul 05, 2021 at 11:36:15AM +0800, Jason Wang wrote:
> 
> 在 2021/7/4 下午5:49, Yongji Xie 写道:
> > > > OK, I get you now. Since the VIRTIO specification says "Device
> > > > configuration space is generally used for rarely-changing or
> > > > initialization-time parameters". I assume the VDUSE_DEV_SET_CONFIG
> > > > ioctl should not be called frequently.
> > > The spec uses MUST and other terms to define the precise requirements.
> > > Here the language (especially the word "generally") is weaker and means
> > > there may be exceptions.
> > > 
> > > Another type of access that doesn't work with the VDUSE_DEV_SET_CONFIG
> > > approach is reads that have side-effects. For example, imagine a field
> > > containing an error code if the device encounters a problem unrelated to
> > > a specific virtqueue request. Reading from this field resets the error
> > > code to 0, saving the driver an extra configuration space write access
> > > and possibly race conditions. It isn't possible to implement those
> > > semantics suing VDUSE_DEV_SET_CONFIG. It's another corner case, but it
> > > makes me think that the interface does not allow full VIRTIO semantics.
> 
> 
> Note that though you're correct, my understanding is that config space is
> not suitable for this kind of error propagating. And it would be very hard
> to implement such kind of semantic in some transports.  Virtqueue should be
> much better. As Yong Ji quoted, the config space is used for
> "rarely-changing or intialization-time parameters".
> 
> 
> > Agreed. I will use VDUSE_DEV_GET_CONFIG in the next version. And to
> > handle the message failure, I'm going to add a return value to
> > virtio_config_ops.get() and virtio_cread_* API so that the error can
> > be propagated to the virtio device driver. Then the virtio-blk device
> > driver can be modified to handle that.
> > 
> > Jason and Stefan, what do you think of this way?

Why does VDUSE_DEV_GET_CONFIG need to support an error return value?

The VIRTIO spec provides no way for the device to report errors from
config space accesses.

The QEMU virtio-pci implementation returns -1 from invalid
virtio_config_read*() and silently discards virtio_config_write*()
accesses.

VDUSE can take the same approach with
VDUSE_DEV_GET_CONFIG/VDUSE_DEV_SET_CONFIG.

> I'd like to stick to the current assumption thich get_config won't fail.
> That is to say,
> 
> 1) maintain a config in the kernel, make sure the config space read can
> always succeed
> 2) introduce an ioctl for the vduse usersapce to update the config space.
> 3) we can synchronize with the vduse userspace during set_config
> 
> Does this work?

I noticed that caching is also allowed by the vhost-user protocol
messages (QEMU's docs/interop/vhost-user.rst), but the device doesn't
know whether or not caching is in effect. The interface you outlined
above requires caching.

Is there a reason why the host kernel vDPA code needs to cache the
configuration space?

Here are the vhost-user protocol messages:

  Virtio device config space
  ^^^^^^^^^^^^^^^^^^^^^^^^^^

  +--------+------+-------+---------+
  | offset | size | flags | payload |
  +--------+------+-------+---------+

  :offset: a 32-bit offset of virtio device's configuration space

  :size: a 32-bit configuration space access size in bytes

  :flags: a 32-bit value:
    - 0: Vhost master messages used for writeable fields
    - 1: Vhost master messages used for live migration

  :payload: Size bytes array holding the contents of the virtio
            device's configuration space

  ...

  ``VHOST_USER_GET_CONFIG``
    :id: 24
    :equivalent ioctl: N/A
    :master payload: virtio device config space
    :slave payload: virtio device config space

    When ``VHOST_USER_PROTOCOL_F_CONFIG`` is negotiated, this message is
    submitted by the vhost-user master to fetch the contents of the
    virtio device configuration space, vhost-user slave's payload size
    MUST match master's request, vhost-user slave uses zero length of
    payload to indicate an error to vhost-user master. The vhost-user
    master may cache the contents to avoid repeated
    ``VHOST_USER_GET_CONFIG`` calls.

  ``VHOST_USER_SET_CONFIG``
    :id: 25
    :equivalent ioctl: N/A
    :master payload: virtio device config space
    :slave payload: N/A

    When ``VHOST_USER_PROTOCOL_F_CONFIG`` is negotiated, this message is
    submitted by the vhost-user master when the Guest changes the virtio
    device configuration space and also can be used for live migration
    on the destination host. The vhost-user slave must check the flags
    field, and slaves MUST NOT accept SET_CONFIG for read-only
    configuration space fields unless the live migration bit is set.

Stefan

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

[-- Attachment #2: Type: text/plain, Size: 183 bytes --]

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v8 10/10] Documentation: Add documentation for VDUSE
  2021-07-05 12:49                 ` Stefan Hajnoczi
@ 2021-07-06  2:34                   ` Jason Wang
  2021-07-06 10:14                     ` Stefan Hajnoczi
       [not found]                   ` <CACycT3t-BTMrpNTwBUfbvaxTh6tLthxbo3OJwMk_iuiSpMuZPg@mail.gmail.com>
  1 sibling, 1 reply; 41+ messages in thread
From: Jason Wang @ 2021-07-06  2:34 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: kvm, Michael S. Tsirkin, virtualization, Christian Brauner,
	Jonathan Corbet, joro, Matthew Wilcox, Christoph Hellwig,
	Yongji Xie, Dan Carpenter, Al Viro, songmuchun, Jens Axboe,
	Greg KH, Randy Dunlap, linux-kernel, iommu, bcrl, netdev,
	linux-fsdevel, Mika Penttilä


在 2021/7/5 下午8:49, Stefan Hajnoczi 写道:
> On Mon, Jul 05, 2021 at 11:36:15AM +0800, Jason Wang wrote:
>> 在 2021/7/4 下午5:49, Yongji Xie 写道:
>>>>> OK, I get you now. Since the VIRTIO specification says "Device
>>>>> configuration space is generally used for rarely-changing or
>>>>> initialization-time parameters". I assume the VDUSE_DEV_SET_CONFIG
>>>>> ioctl should not be called frequently.
>>>> The spec uses MUST and other terms to define the precise requirements.
>>>> Here the language (especially the word "generally") is weaker and means
>>>> there may be exceptions.
>>>>
>>>> Another type of access that doesn't work with the VDUSE_DEV_SET_CONFIG
>>>> approach is reads that have side-effects. For example, imagine a field
>>>> containing an error code if the device encounters a problem unrelated to
>>>> a specific virtqueue request. Reading from this field resets the error
>>>> code to 0, saving the driver an extra configuration space write access
>>>> and possibly race conditions. It isn't possible to implement those
>>>> semantics suing VDUSE_DEV_SET_CONFIG. It's another corner case, but it
>>>> makes me think that the interface does not allow full VIRTIO semantics.
>>
>> Note that though you're correct, my understanding is that config space is
>> not suitable for this kind of error propagating. And it would be very hard
>> to implement such kind of semantic in some transports.  Virtqueue should be
>> much better. As Yong Ji quoted, the config space is used for
>> "rarely-changing or intialization-time parameters".
>>
>>
>>> Agreed. I will use VDUSE_DEV_GET_CONFIG in the next version. And to
>>> handle the message failure, I'm going to add a return value to
>>> virtio_config_ops.get() and virtio_cread_* API so that the error can
>>> be propagated to the virtio device driver. Then the virtio-blk device
>>> driver can be modified to handle that.
>>>
>>> Jason and Stefan, what do you think of this way?
> Why does VDUSE_DEV_GET_CONFIG need to support an error return value?
>
> The VIRTIO spec provides no way for the device to report errors from
> config space accesses.
>
> The QEMU virtio-pci implementation returns -1 from invalid
> virtio_config_read*() and silently discards virtio_config_write*()
> accesses.
>
> VDUSE can take the same approach with
> VDUSE_DEV_GET_CONFIG/VDUSE_DEV_SET_CONFIG.
>
>> I'd like to stick to the current assumption thich get_config won't fail.
>> That is to say,
>>
>> 1) maintain a config in the kernel, make sure the config space read can
>> always succeed
>> 2) introduce an ioctl for the vduse usersapce to update the config space.
>> 3) we can synchronize with the vduse userspace during set_config
>>
>> Does this work?
> I noticed that caching is also allowed by the vhost-user protocol
> messages (QEMU's docs/interop/vhost-user.rst), but the device doesn't
> know whether or not caching is in effect. The interface you outlined
> above requires caching.
>
> Is there a reason why the host kernel vDPA code needs to cache the
> configuration space?


Because:

1) Kernel can not wait forever in get_config(), this is the major 
difference with vhost-user.
2) Stick to the current assumption that virtio_cread() should always 
succeed.

Thanks


>
> Here are the vhost-user protocol messages:
>
>    Virtio device config space
>    ^^^^^^^^^^^^^^^^^^^^^^^^^^
>
>    +--------+------+-------+---------+
>    | offset | size | flags | payload |
>    +--------+------+-------+---------+
>
>    :offset: a 32-bit offset of virtio device's configuration space
>
>    :size: a 32-bit configuration space access size in bytes
>
>    :flags: a 32-bit value:
>      - 0: Vhost master messages used for writeable fields
>      - 1: Vhost master messages used for live migration
>
>    :payload: Size bytes array holding the contents of the virtio
>              device's configuration space
>
>    ...
>
>    ``VHOST_USER_GET_CONFIG``
>      :id: 24
>      :equivalent ioctl: N/A
>      :master payload: virtio device config space
>      :slave payload: virtio device config space
>
>      When ``VHOST_USER_PROTOCOL_F_CONFIG`` is negotiated, this message is
>      submitted by the vhost-user master to fetch the contents of the
>      virtio device configuration space, vhost-user slave's payload size
>      MUST match master's request, vhost-user slave uses zero length of
>      payload to indicate an error to vhost-user master. The vhost-user
>      master may cache the contents to avoid repeated
>      ``VHOST_USER_GET_CONFIG`` calls.
>
>    ``VHOST_USER_SET_CONFIG``
>      :id: 25
>      :equivalent ioctl: N/A
>      :master payload: virtio device config space
>      :slave payload: N/A
>
>      When ``VHOST_USER_PROTOCOL_F_CONFIG`` is negotiated, this message is
>      submitted by the vhost-user master when the Guest changes the virtio
>      device configuration space and also can be used for live migration
>      on the destination host. The vhost-user slave must check the flags
>      field, and slaves MUST NOT accept SET_CONFIG for read-only
>      configuration space fields unless the live migration bit is set.
>
> Stefan

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v8 10/10] Documentation: Add documentation for VDUSE
  2021-07-06  2:34                   ` Jason Wang
@ 2021-07-06 10:14                     ` Stefan Hajnoczi
       [not found]                       ` <CACGkMEs2HHbUfarum8uQ6wuXoDwLQUSXTsa-huJFiqr__4cwRg@mail.gmail.com>
  0 siblings, 1 reply; 41+ messages in thread
From: Stefan Hajnoczi @ 2021-07-06 10:14 UTC (permalink / raw)
  To: Jason Wang
  Cc: kvm, Michael S. Tsirkin, virtualization, Christian Brauner,
	Jonathan Corbet, joro, Matthew Wilcox, Christoph Hellwig,
	Yongji Xie, Dan Carpenter, Al Viro, songmuchun, Jens Axboe,
	Greg KH, Randy Dunlap, linux-kernel, iommu, bcrl, netdev,
	linux-fsdevel, Mika Penttilä


[-- Attachment #1.1: Type: text/plain, Size: 4169 bytes --]

On Tue, Jul 06, 2021 at 10:34:33AM +0800, Jason Wang wrote:
> 
> 在 2021/7/5 下午8:49, Stefan Hajnoczi 写道:
> > On Mon, Jul 05, 2021 at 11:36:15AM +0800, Jason Wang wrote:
> > > 在 2021/7/4 下午5:49, Yongji Xie 写道:
> > > > > > OK, I get you now. Since the VIRTIO specification says "Device
> > > > > > configuration space is generally used for rarely-changing or
> > > > > > initialization-time parameters". I assume the VDUSE_DEV_SET_CONFIG
> > > > > > ioctl should not be called frequently.
> > > > > The spec uses MUST and other terms to define the precise requirements.
> > > > > Here the language (especially the word "generally") is weaker and means
> > > > > there may be exceptions.
> > > > > 
> > > > > Another type of access that doesn't work with the VDUSE_DEV_SET_CONFIG
> > > > > approach is reads that have side-effects. For example, imagine a field
> > > > > containing an error code if the device encounters a problem unrelated to
> > > > > a specific virtqueue request. Reading from this field resets the error
> > > > > code to 0, saving the driver an extra configuration space write access
> > > > > and possibly race conditions. It isn't possible to implement those
> > > > > semantics suing VDUSE_DEV_SET_CONFIG. It's another corner case, but it
> > > > > makes me think that the interface does not allow full VIRTIO semantics.
> > > 
> > > Note that though you're correct, my understanding is that config space is
> > > not suitable for this kind of error propagating. And it would be very hard
> > > to implement such kind of semantic in some transports.  Virtqueue should be
> > > much better. As Yong Ji quoted, the config space is used for
> > > "rarely-changing or intialization-time parameters".
> > > 
> > > 
> > > > Agreed. I will use VDUSE_DEV_GET_CONFIG in the next version. And to
> > > > handle the message failure, I'm going to add a return value to
> > > > virtio_config_ops.get() and virtio_cread_* API so that the error can
> > > > be propagated to the virtio device driver. Then the virtio-blk device
> > > > driver can be modified to handle that.
> > > > 
> > > > Jason and Stefan, what do you think of this way?
> > Why does VDUSE_DEV_GET_CONFIG need to support an error return value?
> > 
> > The VIRTIO spec provides no way for the device to report errors from
> > config space accesses.
> > 
> > The QEMU virtio-pci implementation returns -1 from invalid
> > virtio_config_read*() and silently discards virtio_config_write*()
> > accesses.
> > 
> > VDUSE can take the same approach with
> > VDUSE_DEV_GET_CONFIG/VDUSE_DEV_SET_CONFIG.
> > 
> > > I'd like to stick to the current assumption thich get_config won't fail.
> > > That is to say,
> > > 
> > > 1) maintain a config in the kernel, make sure the config space read can
> > > always succeed
> > > 2) introduce an ioctl for the vduse usersapce to update the config space.
> > > 3) we can synchronize with the vduse userspace during set_config
> > > 
> > > Does this work?
> > I noticed that caching is also allowed by the vhost-user protocol
> > messages (QEMU's docs/interop/vhost-user.rst), but the device doesn't
> > know whether or not caching is in effect. The interface you outlined
> > above requires caching.
> > 
> > Is there a reason why the host kernel vDPA code needs to cache the
> > configuration space?
> 
> 
> Because:
> 
> 1) Kernel can not wait forever in get_config(), this is the major difference
> with vhost-user.

virtio_cread() can sleep:

  #define virtio_cread(vdev, structname, member, ptr)                     \
          do {                                                            \
                  typeof(((structname*)0)->member) virtio_cread_v;        \
                                                                          \
                  might_sleep();                                          \
                  ^^^^^^^^^^^^^^

Which code path cannot sleep?

> 2) Stick to the current assumption that virtio_cread() should always
> succeed.

That can be done by reading -1 (like QEMU does) when the read fails.

Stefan

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

[-- Attachment #2: Type: text/plain, Size: 183 bytes --]

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v8 10/10] Documentation: Add documentation for VDUSE
       [not found]                   ` <CACycT3t-BTMrpNTwBUfbvaxTh6tLthxbo3OJwMk_iuiSpMuZPg@mail.gmail.com>
@ 2021-07-06 10:22                     ` Stefan Hajnoczi
       [not found]                       ` <CACycT3t=V-VV7LYDda8mt=QxN_Ay-N+3dgWp382TObkeei9MOg@mail.gmail.com>
  0 siblings, 1 reply; 41+ messages in thread
From: Stefan Hajnoczi @ 2021-07-06 10:22 UTC (permalink / raw)
  To: Yongji Xie
  Cc: kvm, Michael S. Tsirkin, virtualization, Christian Brauner,
	Jonathan Corbet, joro, Matthew Wilcox, Christoph Hellwig,
	Dan Carpenter, Al Viro, songmuchun, Jens Axboe, Greg KH,
	Randy Dunlap, linux-kernel, iommu, bcrl, netdev, linux-fsdevel,
	Mika Penttilä


[-- Attachment #1.1: Type: text/plain, Size: 3910 bytes --]

On Tue, Jul 06, 2021 at 11:04:18AM +0800, Yongji Xie wrote:
> On Mon, Jul 5, 2021 at 8:50 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> >
> > On Mon, Jul 05, 2021 at 11:36:15AM +0800, Jason Wang wrote:
> > >
> > > 在 2021/7/4 下午5:49, Yongji Xie 写道:
> > > > > > OK, I get you now. Since the VIRTIO specification says "Device
> > > > > > configuration space is generally used for rarely-changing or
> > > > > > initialization-time parameters". I assume the VDUSE_DEV_SET_CONFIG
> > > > > > ioctl should not be called frequently.
> > > > > The spec uses MUST and other terms to define the precise requirements.
> > > > > Here the language (especially the word "generally") is weaker and means
> > > > > there may be exceptions.
> > > > >
> > > > > Another type of access that doesn't work with the VDUSE_DEV_SET_CONFIG
> > > > > approach is reads that have side-effects. For example, imagine a field
> > > > > containing an error code if the device encounters a problem unrelated to
> > > > > a specific virtqueue request. Reading from this field resets the error
> > > > > code to 0, saving the driver an extra configuration space write access
> > > > > and possibly race conditions. It isn't possible to implement those
> > > > > semantics suing VDUSE_DEV_SET_CONFIG. It's another corner case, but it
> > > > > makes me think that the interface does not allow full VIRTIO semantics.
> > >
> > >
> > > Note that though you're correct, my understanding is that config space is
> > > not suitable for this kind of error propagating. And it would be very hard
> > > to implement such kind of semantic in some transports.  Virtqueue should be
> > > much better. As Yong Ji quoted, the config space is used for
> > > "rarely-changing or intialization-time parameters".
> > >
> > >
> > > > Agreed. I will use VDUSE_DEV_GET_CONFIG in the next version. And to
> > > > handle the message failure, I'm going to add a return value to
> > > > virtio_config_ops.get() and virtio_cread_* API so that the error can
> > > > be propagated to the virtio device driver. Then the virtio-blk device
> > > > driver can be modified to handle that.
> > > >
> > > > Jason and Stefan, what do you think of this way?
> >
> > Why does VDUSE_DEV_GET_CONFIG need to support an error return value?
> >
> 
> We add a timeout and return error in case userspace never replies to
> the message.
> 
> > The VIRTIO spec provides no way for the device to report errors from
> > config space accesses.
> >
> > The QEMU virtio-pci implementation returns -1 from invalid
> > virtio_config_read*() and silently discards virtio_config_write*()
> > accesses.
> >
> > VDUSE can take the same approach with
> > VDUSE_DEV_GET_CONFIG/VDUSE_DEV_SET_CONFIG.
> >
> 
> I noticed that virtio_config_read*() only returns -1 when we access a
> invalid field. But in the VDUSE case, VDUSE_DEV_GET_CONFIG might fail
> when we access a valid field. Not sure if it's ok to silently ignore
> this kind of error.

That's a good point but it's a general VIRTIO issue. Any device
implementation (QEMU userspace, hardware vDPA, etc) can fail, so the
VIRTIO specification needs to provide a way for the driver to detect
this.

If userspace violates the contract then VDUSE needs to mark the device
broken. QEMU's device emulation does something similar with the
vdev->broken flag.

The VIRTIO Device Status field DEVICE_NEEDS_RESET bit can be set by
vDPA/VDUSE to indicate that the device is not operational and must be
reset.

The driver code may still process the -1 value read from the
configuration space. Hopefully this isn't a problem. There is currently
no VIRTIO interface besides DEVICE_NEEDS_RESET to indicate configuration
space access failures. On the other hand, drivers need to handle
malicious devices so they should be able to cope with the -1 value
anyway.

Stefan

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

[-- Attachment #2: Type: text/plain, Size: 183 bytes --]

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v8 09/10] vduse: Introduce VDUSE - vDPA Device in Userspace
       [not found] ` <20210615141331.407-10-xieyongji@bytedance.com>
  2021-06-21  9:13   ` [PATCH v8 09/10] vduse: " Jason Wang
  2021-06-24 14:46   ` Stefan Hajnoczi
@ 2021-07-07  8:52   ` Stefan Hajnoczi
  2 siblings, 0 replies; 41+ messages in thread
From: Stefan Hajnoczi @ 2021-07-07  8:52 UTC (permalink / raw)
  To: Xie Yongji
  Cc: kvm, mst, virtualization, christian.brauner, corbet, joro, willy,
	hch, dan.carpenter, viro, songmuchun, axboe, gregkh, rdunlap,
	linux-kernel, iommu, bcrl, netdev, linux-fsdevel, mika.penttila


[-- Attachment #1.1: Type: text/plain, Size: 903 bytes --]

On Tue, Jun 15, 2021 at 10:13:30PM +0800, Xie Yongji wrote:
> +static bool vduse_validate_config(struct vduse_dev_config *config)
> +{

The name field needs to be NUL terminated?

> +	case VDUSE_CREATE_DEV: {
> +		struct vduse_dev_config config;
> +		unsigned long size = offsetof(struct vduse_dev_config, config);
> +		void *buf;
> +
> +		ret = -EFAULT;
> +		if (copy_from_user(&config, argp, size))
> +			break;
> +
> +		ret = -EINVAL;
> +		if (vduse_validate_config(&config) == false)
> +			break;
> +
> +		buf = vmemdup_user(argp + size, config.config_size);
> +		if (IS_ERR(buf)) {
> +			ret = PTR_ERR(buf);
> +			break;
> +		}
> +		ret = vduse_create_dev(&config, buf, control->api_version);
> +		break;
> +	}
> +	case VDUSE_DESTROY_DEV: {
> +		char name[VDUSE_NAME_MAX];
> +
> +		ret = -EFAULT;
> +		if (copy_from_user(name, argp, VDUSE_NAME_MAX))
> +			break;

Is this missing a NUL terminator?

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

[-- Attachment #2: Type: text/plain, Size: 183 bytes --]

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v8 10/10] Documentation: Add documentation for VDUSE
       [not found]                             ` <YOVr801d01YOPzLL@stefanha-x1.localdomain>
@ 2021-07-07  9:24                               ` Jason Wang
  2021-07-07 15:54                                 ` Stefan Hajnoczi
  0 siblings, 1 reply; 41+ messages in thread
From: Jason Wang @ 2021-07-07  9:24 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: kvm, Michael S. Tsirkin, virtualization, Christian Brauner,
	Jonathan Corbet, joro, Matthew Wilcox, Christoph Hellwig,
	Xie Yongji, Dan Carpenter, Al Viro, songmuchun@bytedance.com,
	Jens Axboe, gregkh, Randy Dunlap, linux-kernel,
	iommu@lists.linux-foundation.org, bcrl@kvack.org, netdev,
	linux-fsdevel, Mika Penttilä


在 2021/7/7 下午4:55, Stefan Hajnoczi 写道:
> On Wed, Jul 07, 2021 at 11:43:28AM +0800, Jason Wang wrote:
>> 在 2021/7/7 上午1:11, Stefan Hajnoczi 写道:
>>> On Tue, Jul 06, 2021 at 09:08:26PM +0800, Jason Wang wrote:
>>>> On Tue, Jul 6, 2021 at 6:15 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
>>>>> On Tue, Jul 06, 2021 at 10:34:33AM +0800, Jason Wang wrote:
>>>>>> 在 2021/7/5 下午8:49, Stefan Hajnoczi 写道:
>>>>>>> On Mon, Jul 05, 2021 at 11:36:15AM +0800, Jason Wang wrote:
>>>>>>>> 在 2021/7/4 下午5:49, Yongji Xie 写道:
>>>>>>>>>>> OK, I get you now. Since the VIRTIO specification says "Device
>>>>>>>>>>> configuration space is generally used for rarely-changing or
>>>>>>>>>>> initialization-time parameters". I assume the VDUSE_DEV_SET_CONFIG
>>>>>>>>>>> ioctl should not be called frequently.
>>>>>>>>>> The spec uses MUST and other terms to define the precise requirements.
>>>>>>>>>> Here the language (especially the word "generally") is weaker and means
>>>>>>>>>> there may be exceptions.
>>>>>>>>>>
>>>>>>>>>> Another type of access that doesn't work with the VDUSE_DEV_SET_CONFIG
>>>>>>>>>> approach is reads that have side-effects. For example, imagine a field
>>>>>>>>>> containing an error code if the device encounters a problem unrelated to
>>>>>>>>>> a specific virtqueue request. Reading from this field resets the error
>>>>>>>>>> code to 0, saving the driver an extra configuration space write access
>>>>>>>>>> and possibly race conditions. It isn't possible to implement those
>>>>>>>>>> semantics suing VDUSE_DEV_SET_CONFIG. It's another corner case, but it
>>>>>>>>>> makes me think that the interface does not allow full VIRTIO semantics.
>>>>>>>> Note that though you're correct, my understanding is that config space is
>>>>>>>> not suitable for this kind of error propagating. And it would be very hard
>>>>>>>> to implement such kind of semantic in some transports.  Virtqueue should be
>>>>>>>> much better. As Yong Ji quoted, the config space is used for
>>>>>>>> "rarely-changing or intialization-time parameters".
>>>>>>>>
>>>>>>>>
>>>>>>>>> Agreed. I will use VDUSE_DEV_GET_CONFIG in the next version. And to
>>>>>>>>> handle the message failure, I'm going to add a return value to
>>>>>>>>> virtio_config_ops.get() and virtio_cread_* API so that the error can
>>>>>>>>> be propagated to the virtio device driver. Then the virtio-blk device
>>>>>>>>> driver can be modified to handle that.
>>>>>>>>>
>>>>>>>>> Jason and Stefan, what do you think of this way?
>>>>>>> Why does VDUSE_DEV_GET_CONFIG need to support an error return value?
>>>>>>>
>>>>>>> The VIRTIO spec provides no way for the device to report errors from
>>>>>>> config space accesses.
>>>>>>>
>>>>>>> The QEMU virtio-pci implementation returns -1 from invalid
>>>>>>> virtio_config_read*() and silently discards virtio_config_write*()
>>>>>>> accesses.
>>>>>>>
>>>>>>> VDUSE can take the same approach with
>>>>>>> VDUSE_DEV_GET_CONFIG/VDUSE_DEV_SET_CONFIG.
>>>>>>>
>>>>>>>> I'd like to stick to the current assumption thich get_config won't fail.
>>>>>>>> That is to say,
>>>>>>>>
>>>>>>>> 1) maintain a config in the kernel, make sure the config space read can
>>>>>>>> always succeed
>>>>>>>> 2) introduce an ioctl for the vduse usersapce to update the config space.
>>>>>>>> 3) we can synchronize with the vduse userspace during set_config
>>>>>>>>
>>>>>>>> Does this work?
>>>>>>> I noticed that caching is also allowed by the vhost-user protocol
>>>>>>> messages (QEMU's docs/interop/vhost-user.rst), but the device doesn't
>>>>>>> know whether or not caching is in effect. The interface you outlined
>>>>>>> above requires caching.
>>>>>>>
>>>>>>> Is there a reason why the host kernel vDPA code needs to cache the
>>>>>>> configuration space?
>>>>>> Because:
>>>>>>
>>>>>> 1) Kernel can not wait forever in get_config(), this is the major difference
>>>>>> with vhost-user.
>>>>> virtio_cread() can sleep:
>>>>>
>>>>>     #define virtio_cread(vdev, structname, member, ptr)                     \
>>>>>             do {                                                            \
>>>>>                     typeof(((structname*)0)->member) virtio_cread_v;        \
>>>>>                                                                             \
>>>>>                     might_sleep();                                          \
>>>>>                     ^^^^^^^^^^^^^^
>>>>>
>>>>> Which code path cannot sleep?
>>>> Well, it can sleep but it can't sleep forever. For VDUSE, a
>>>> buggy/malicious userspace may refuse to respond to the get_config.
>>>>
>>>> It looks to me the ideal case, with the current virtio spec, for VDUSE is to
>>>>
>>>> 1) maintain the device and its state in the kernel, userspace may sync
>>>> with the kernel device via ioctls
>>>> 2) offload the datapath (virtqueue) to the userspace
>>>>
>>>> This seems more robust and safe than simply relaying everything to
>>>> userspace and waiting for its response.
>>>>
>>>> And we know for sure this model can work, an example is TUN/TAP:
>>>> netdevice is abstracted in the kernel and datapath is done via
>>>> sendmsg()/recvmsg().
>>>>
>>>> Maintaining the config in the kernel follows this model and it can
>>>> simplify the device generation implementation.
>>>>
>>>> For config space write, it requires more thought but fortunately it's
>>>> not commonly used. So VDUSE can choose to filter out the
>>>> device/features that depends on the config write.
>>> This is the problem. There are other messages like SET_FEATURES where I
>>> guess we'll face the same challenge.
>>
>> Probably not, userspace device can tell the kernel about the device_features
>> and mandated_features during creation, and the feature negotiation could be
>> done purely in the kernel without bothering the userspace.


(For some reason I drop the list accidentally, adding them back, sorry)


> Sorry, I confused the messages. I meant SET_STATUS. It's a synchronous
> interface where the driver waits for the device.


It depends on how we define "synchronous" here. If I understand 
correctly, the spec doesn't expect there will be any kind of failure for 
the operation of set_status itself.

Instead, anytime it want any synchronization, it should be done via 
get_status():

1) re-read device status to make sure FEATURES_OK is set during feature 
negotiation
2) re-read device status to be 0 to make sure the device has finish the 
reset


>
> VDUSE currently doesn't wait for the device emulation process to handle
> this message (no reply is needed) but I think this is a mistake because
> VDUSE is not following the VIRTIO device model.


With the trick that is done for FEATURES_OK above, I think we don't need 
to wait for the reply.

If userspace takes too long to respond, it can be detected since 
get_status() doesn't return the expected value for long time.

And for the case that needs a timeout, we probably can use NEEDS_RESET.


>
> I strongly suggest designing the VDUSE interface to match the VIRTIO
> device model (or at least the vDPA interface).


I fully agree with you and that is what we want to achieve in this series.


> Defining a custom
> interface for VDUSE avoids some implementation complexity and makes it
> easier to deal with untrusted userspace, but it's impossible to
> implement certain VIRTIO features or devices. It also fragments VIRTIO
> more than necessary; we have a standard, let's stick to it.


Yes.


>
>>> I agree that caching the contents of configuration space in the kernel
>>> helps, but if there are other VDUSE messages with the same problem then
>>> an attacker will exploit them instead.
>>>
>>> I think a systematic solution is needed. It would be necessary to
>>> enumerate the virtio_vdpa and vhost_vdpa cases separately to figure out
>>> where VDUSE messages are synchronous/time-sensitive.
>>
>> This is the case of reset and needs more thought. We should stick a
>> consistent uAPI for the userspace.
>>
>> For vhost-vDPA, it needs synchronzied with the userspace and we can wait for
>> ever.
> The VMM should still be able to handle signals when a vhost_vdpa ioctl
> is waiting for a reply from the VDUSE userspace process. Or if that's
> not possible then there needs to be a way to force disconnection from
> VDUSE so the VMM can be killed.


Note that VDUSE works under vDPA bus, so vhost should be transport to VDUSE.

But we can detect this via whether or not the bounce buffer is used.

Thanks


>
> Stefan

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v8 10/10] Documentation: Add documentation for VDUSE
  2021-07-07  9:24                               ` Jason Wang
@ 2021-07-07 15:54                                 ` Stefan Hajnoczi
  2021-07-08  4:17                                   ` Jason Wang
  0 siblings, 1 reply; 41+ messages in thread
From: Stefan Hajnoczi @ 2021-07-07 15:54 UTC (permalink / raw)
  To: Jason Wang
  Cc: kvm, Michael S. Tsirkin, virtualization, Christian Brauner,
	Jonathan Corbet, joro, Matthew Wilcox, Christoph Hellwig,
	Xie Yongji, Dan Carpenter, Al Viro, songmuchun@bytedance.com,
	Jens Axboe, gregkh, Randy Dunlap, linux-kernel,
	iommu@lists.linux-foundation.org, bcrl@kvack.org, netdev,
	linux-fsdevel, Mika Penttilä


[-- Attachment #1.1: Type: text/plain, Size: 8501 bytes --]

On Wed, Jul 07, 2021 at 05:24:08PM +0800, Jason Wang wrote:
> 
> 在 2021/7/7 下午4:55, Stefan Hajnoczi 写道:
> > On Wed, Jul 07, 2021 at 11:43:28AM +0800, Jason Wang wrote:
> > > 在 2021/7/7 上午1:11, Stefan Hajnoczi 写道:
> > > > On Tue, Jul 06, 2021 at 09:08:26PM +0800, Jason Wang wrote:
> > > > > On Tue, Jul 6, 2021 at 6:15 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > > > > > On Tue, Jul 06, 2021 at 10:34:33AM +0800, Jason Wang wrote:
> > > > > > > 在 2021/7/5 下午8:49, Stefan Hajnoczi 写道:
> > > > > > > > On Mon, Jul 05, 2021 at 11:36:15AM +0800, Jason Wang wrote:
> > > > > > > > > 在 2021/7/4 下午5:49, Yongji Xie 写道:
> > > > > > > > > > > > OK, I get you now. Since the VIRTIO specification says "Device
> > > > > > > > > > > > configuration space is generally used for rarely-changing or
> > > > > > > > > > > > initialization-time parameters". I assume the VDUSE_DEV_SET_CONFIG
> > > > > > > > > > > > ioctl should not be called frequently.
> > > > > > > > > > > The spec uses MUST and other terms to define the precise requirements.
> > > > > > > > > > > Here the language (especially the word "generally") is weaker and means
> > > > > > > > > > > there may be exceptions.
> > > > > > > > > > > 
> > > > > > > > > > > Another type of access that doesn't work with the VDUSE_DEV_SET_CONFIG
> > > > > > > > > > > approach is reads that have side-effects. For example, imagine a field
> > > > > > > > > > > containing an error code if the device encounters a problem unrelated to
> > > > > > > > > > > a specific virtqueue request. Reading from this field resets the error
> > > > > > > > > > > code to 0, saving the driver an extra configuration space write access
> > > > > > > > > > > and possibly race conditions. It isn't possible to implement those
> > > > > > > > > > > semantics suing VDUSE_DEV_SET_CONFIG. It's another corner case, but it
> > > > > > > > > > > makes me think that the interface does not allow full VIRTIO semantics.
> > > > > > > > > Note that though you're correct, my understanding is that config space is
> > > > > > > > > not suitable for this kind of error propagating. And it would be very hard
> > > > > > > > > to implement such kind of semantic in some transports.  Virtqueue should be
> > > > > > > > > much better. As Yong Ji quoted, the config space is used for
> > > > > > > > > "rarely-changing or intialization-time parameters".
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > > > Agreed. I will use VDUSE_DEV_GET_CONFIG in the next version. And to
> > > > > > > > > > handle the message failure, I'm going to add a return value to
> > > > > > > > > > virtio_config_ops.get() and virtio_cread_* API so that the error can
> > > > > > > > > > be propagated to the virtio device driver. Then the virtio-blk device
> > > > > > > > > > driver can be modified to handle that.
> > > > > > > > > > 
> > > > > > > > > > Jason and Stefan, what do you think of this way?
> > > > > > > > Why does VDUSE_DEV_GET_CONFIG need to support an error return value?
> > > > > > > > 
> > > > > > > > The VIRTIO spec provides no way for the device to report errors from
> > > > > > > > config space accesses.
> > > > > > > > 
> > > > > > > > The QEMU virtio-pci implementation returns -1 from invalid
> > > > > > > > virtio_config_read*() and silently discards virtio_config_write*()
> > > > > > > > accesses.
> > > > > > > > 
> > > > > > > > VDUSE can take the same approach with
> > > > > > > > VDUSE_DEV_GET_CONFIG/VDUSE_DEV_SET_CONFIG.
> > > > > > > > 
> > > > > > > > > I'd like to stick to the current assumption thich get_config won't fail.
> > > > > > > > > That is to say,
> > > > > > > > > 
> > > > > > > > > 1) maintain a config in the kernel, make sure the config space read can
> > > > > > > > > always succeed
> > > > > > > > > 2) introduce an ioctl for the vduse usersapce to update the config space.
> > > > > > > > > 3) we can synchronize with the vduse userspace during set_config
> > > > > > > > > 
> > > > > > > > > Does this work?
> > > > > > > > I noticed that caching is also allowed by the vhost-user protocol
> > > > > > > > messages (QEMU's docs/interop/vhost-user.rst), but the device doesn't
> > > > > > > > know whether or not caching is in effect. The interface you outlined
> > > > > > > > above requires caching.
> > > > > > > > 
> > > > > > > > Is there a reason why the host kernel vDPA code needs to cache the
> > > > > > > > configuration space?
> > > > > > > Because:
> > > > > > > 
> > > > > > > 1) Kernel can not wait forever in get_config(), this is the major difference
> > > > > > > with vhost-user.
> > > > > > virtio_cread() can sleep:
> > > > > > 
> > > > > >     #define virtio_cread(vdev, structname, member, ptr)                     \
> > > > > >             do {                                                            \
> > > > > >                     typeof(((structname*)0)->member) virtio_cread_v;        \
> > > > > >                                                                             \
> > > > > >                     might_sleep();                                          \
> > > > > >                     ^^^^^^^^^^^^^^
> > > > > > 
> > > > > > Which code path cannot sleep?
> > > > > Well, it can sleep but it can't sleep forever. For VDUSE, a
> > > > > buggy/malicious userspace may refuse to respond to the get_config.
> > > > > 
> > > > > It looks to me the ideal case, with the current virtio spec, for VDUSE is to
> > > > > 
> > > > > 1) maintain the device and its state in the kernel, userspace may sync
> > > > > with the kernel device via ioctls
> > > > > 2) offload the datapath (virtqueue) to the userspace
> > > > > 
> > > > > This seems more robust and safe than simply relaying everything to
> > > > > userspace and waiting for its response.
> > > > > 
> > > > > And we know for sure this model can work, an example is TUN/TAP:
> > > > > netdevice is abstracted in the kernel and datapath is done via
> > > > > sendmsg()/recvmsg().
> > > > > 
> > > > > Maintaining the config in the kernel follows this model and it can
> > > > > simplify the device generation implementation.
> > > > > 
> > > > > For config space write, it requires more thought but fortunately it's
> > > > > not commonly used. So VDUSE can choose to filter out the
> > > > > device/features that depends on the config write.
> > > > This is the problem. There are other messages like SET_FEATURES where I
> > > > guess we'll face the same challenge.
> > > 
> > > Probably not, userspace device can tell the kernel about the device_features
> > > and mandated_features during creation, and the feature negotiation could be
> > > done purely in the kernel without bothering the userspace.
> 
> 
> (For some reason I drop the list accidentally, adding them back, sorry)
> 
> 
> > Sorry, I confused the messages. I meant SET_STATUS. It's a synchronous
> > interface where the driver waits for the device.
> 
> 
> It depends on how we define "synchronous" here. If I understand correctly,
> the spec doesn't expect there will be any kind of failure for the operation
> of set_status itself.
> 
> Instead, anytime it want any synchronization, it should be done via
> get_status():
> 
> 1) re-read device status to make sure FEATURES_OK is set during feature
> negotiation
> 2) re-read device status to be 0 to make sure the device has finish the
> reset
> 
> 
> > 
> > VDUSE currently doesn't wait for the device emulation process to handle
> > this message (no reply is needed) but I think this is a mistake because
> > VDUSE is not following the VIRTIO device model.
> 
> 
> With the trick that is done for FEATURES_OK above, I think we don't need to
> wait for the reply.
> 
> If userspace takes too long to respond, it can be detected since
> get_status() doesn't return the expected value for long time.
> 
> And for the case that needs a timeout, we probably can use NEEDS_RESET.

I think you're right. get_status is the synchronization point, not
set_status.

Currently there is no VDUSE GET_STATUS message. The
VDUSE_START/STOP_DATAPLANE messages could be changed to SET_STATUS so
that the device emulation program can participate in emulating the
Device Status field. This change could affect VDUSE's VIRTIO feature
interface since the device emulation program can reject features by not
setting FEATURES_OK.

Stefan

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

[-- Attachment #2: Type: text/plain, Size: 183 bytes --]

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v8 10/10] Documentation: Add documentation for VDUSE
  2021-07-07 15:54                                 ` Stefan Hajnoczi
@ 2021-07-08  4:17                                   ` Jason Wang
  2021-07-08  9:06                                     ` Stefan Hajnoczi
  0 siblings, 1 reply; 41+ messages in thread
From: Jason Wang @ 2021-07-08  4:17 UTC (permalink / raw)
  To: Stefan Hajnoczi
  Cc: kvm, Michael S. Tsirkin, virtualization, Christian Brauner,
	Jonathan Corbet, joro, Matthew Wilcox, Christoph Hellwig,
	Xie Yongji, Dan Carpenter, Al Viro, songmuchun@bytedance.com,
	Jens Axboe, gregkh, Randy Dunlap, linux-kernel,
	iommu@lists.linux-foundation.org, bcrl@kvack.org, netdev,
	linux-fsdevel, Mika Penttilä


在 2021/7/7 下午11:54, Stefan Hajnoczi 写道:
> On Wed, Jul 07, 2021 at 05:24:08PM +0800, Jason Wang wrote:
>> 在 2021/7/7 下午4:55, Stefan Hajnoczi 写道:
>>> On Wed, Jul 07, 2021 at 11:43:28AM +0800, Jason Wang wrote:
>>>> 在 2021/7/7 上午1:11, Stefan Hajnoczi 写道:
>>>>> On Tue, Jul 06, 2021 at 09:08:26PM +0800, Jason Wang wrote:
>>>>>> On Tue, Jul 6, 2021 at 6:15 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
>>>>>>> On Tue, Jul 06, 2021 at 10:34:33AM +0800, Jason Wang wrote:
>>>>>>>> 在 2021/7/5 下午8:49, Stefan Hajnoczi 写道:
>>>>>>>>> On Mon, Jul 05, 2021 at 11:36:15AM +0800, Jason Wang wrote:
>>>>>>>>>> 在 2021/7/4 下午5:49, Yongji Xie 写道:
>>>>>>>>>>>>> OK, I get you now. Since the VIRTIO specification says "Device
>>>>>>>>>>>>> configuration space is generally used for rarely-changing or
>>>>>>>>>>>>> initialization-time parameters". I assume the VDUSE_DEV_SET_CONFIG
>>>>>>>>>>>>> ioctl should not be called frequently.
>>>>>>>>>>>> The spec uses MUST and other terms to define the precise requirements.
>>>>>>>>>>>> Here the language (especially the word "generally") is weaker and means
>>>>>>>>>>>> there may be exceptions.
>>>>>>>>>>>>
>>>>>>>>>>>> Another type of access that doesn't work with the VDUSE_DEV_SET_CONFIG
>>>>>>>>>>>> approach is reads that have side-effects. For example, imagine a field
>>>>>>>>>>>> containing an error code if the device encounters a problem unrelated to
>>>>>>>>>>>> a specific virtqueue request. Reading from this field resets the error
>>>>>>>>>>>> code to 0, saving the driver an extra configuration space write access
>>>>>>>>>>>> and possibly race conditions. It isn't possible to implement those
>>>>>>>>>>>> semantics suing VDUSE_DEV_SET_CONFIG. It's another corner case, but it
>>>>>>>>>>>> makes me think that the interface does not allow full VIRTIO semantics.
>>>>>>>>>> Note that though you're correct, my understanding is that config space is
>>>>>>>>>> not suitable for this kind of error propagating. And it would be very hard
>>>>>>>>>> to implement such kind of semantic in some transports.  Virtqueue should be
>>>>>>>>>> much better. As Yong Ji quoted, the config space is used for
>>>>>>>>>> "rarely-changing or intialization-time parameters".
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>> Agreed. I will use VDUSE_DEV_GET_CONFIG in the next version. And to
>>>>>>>>>>> handle the message failure, I'm going to add a return value to
>>>>>>>>>>> virtio_config_ops.get() and virtio_cread_* API so that the error can
>>>>>>>>>>> be propagated to the virtio device driver. Then the virtio-blk device
>>>>>>>>>>> driver can be modified to handle that.
>>>>>>>>>>>
>>>>>>>>>>> Jason and Stefan, what do you think of this way?
>>>>>>>>> Why does VDUSE_DEV_GET_CONFIG need to support an error return value?
>>>>>>>>>
>>>>>>>>> The VIRTIO spec provides no way for the device to report errors from
>>>>>>>>> config space accesses.
>>>>>>>>>
>>>>>>>>> The QEMU virtio-pci implementation returns -1 from invalid
>>>>>>>>> virtio_config_read*() and silently discards virtio_config_write*()
>>>>>>>>> accesses.
>>>>>>>>>
>>>>>>>>> VDUSE can take the same approach with
>>>>>>>>> VDUSE_DEV_GET_CONFIG/VDUSE_DEV_SET_CONFIG.
>>>>>>>>>
>>>>>>>>>> I'd like to stick to the current assumption thich get_config won't fail.
>>>>>>>>>> That is to say,
>>>>>>>>>>
>>>>>>>>>> 1) maintain a config in the kernel, make sure the config space read can
>>>>>>>>>> always succeed
>>>>>>>>>> 2) introduce an ioctl for the vduse usersapce to update the config space.
>>>>>>>>>> 3) we can synchronize with the vduse userspace during set_config
>>>>>>>>>>
>>>>>>>>>> Does this work?
>>>>>>>>> I noticed that caching is also allowed by the vhost-user protocol
>>>>>>>>> messages (QEMU's docs/interop/vhost-user.rst), but the device doesn't
>>>>>>>>> know whether or not caching is in effect. The interface you outlined
>>>>>>>>> above requires caching.
>>>>>>>>>
>>>>>>>>> Is there a reason why the host kernel vDPA code needs to cache the
>>>>>>>>> configuration space?
>>>>>>>> Because:
>>>>>>>>
>>>>>>>> 1) Kernel can not wait forever in get_config(), this is the major difference
>>>>>>>> with vhost-user.
>>>>>>> virtio_cread() can sleep:
>>>>>>>
>>>>>>>      #define virtio_cread(vdev, structname, member, ptr)                     \
>>>>>>>              do {                                                            \
>>>>>>>                      typeof(((structname*)0)->member) virtio_cread_v;        \
>>>>>>>                                                                              \
>>>>>>>                      might_sleep();                                          \
>>>>>>>                      ^^^^^^^^^^^^^^
>>>>>>>
>>>>>>> Which code path cannot sleep?
>>>>>> Well, it can sleep but it can't sleep forever. For VDUSE, a
>>>>>> buggy/malicious userspace may refuse to respond to the get_config.
>>>>>>
>>>>>> It looks to me the ideal case, with the current virtio spec, for VDUSE is to
>>>>>>
>>>>>> 1) maintain the device and its state in the kernel, userspace may sync
>>>>>> with the kernel device via ioctls
>>>>>> 2) offload the datapath (virtqueue) to the userspace
>>>>>>
>>>>>> This seems more robust and safe than simply relaying everything to
>>>>>> userspace and waiting for its response.
>>>>>>
>>>>>> And we know for sure this model can work, an example is TUN/TAP:
>>>>>> netdevice is abstracted in the kernel and datapath is done via
>>>>>> sendmsg()/recvmsg().
>>>>>>
>>>>>> Maintaining the config in the kernel follows this model and it can
>>>>>> simplify the device generation implementation.
>>>>>>
>>>>>> For config space write, it requires more thought but fortunately it's
>>>>>> not commonly used. So VDUSE can choose to filter out the
>>>>>> device/features that depends on the config write.
>>>>> This is the problem. There are other messages like SET_FEATURES where I
>>>>> guess we'll face the same challenge.
>>>> Probably not, userspace device can tell the kernel about the device_features
>>>> and mandated_features during creation, and the feature negotiation could be
>>>> done purely in the kernel without bothering the userspace.
>>
>> (For some reason I drop the list accidentally, adding them back, sorry)
>>
>>
>>> Sorry, I confused the messages. I meant SET_STATUS. It's a synchronous
>>> interface where the driver waits for the device.
>>
>> It depends on how we define "synchronous" here. If I understand correctly,
>> the spec doesn't expect there will be any kind of failure for the operation
>> of set_status itself.
>>
>> Instead, anytime it want any synchronization, it should be done via
>> get_status():
>>
>> 1) re-read device status to make sure FEATURES_OK is set during feature
>> negotiation
>> 2) re-read device status to be 0 to make sure the device has finish the
>> reset
>>
>>
>>> VDUSE currently doesn't wait for the device emulation process to handle
>>> this message (no reply is needed) but I think this is a mistake because
>>> VDUSE is not following the VIRTIO device model.
>>
>> With the trick that is done for FEATURES_OK above, I think we don't need to
>> wait for the reply.
>>
>> If userspace takes too long to respond, it can be detected since
>> get_status() doesn't return the expected value for long time.
>>
>> And for the case that needs a timeout, we probably can use NEEDS_RESET.
> I think you're right. get_status is the synchronization point, not
> set_status.
>
> Currently there is no VDUSE GET_STATUS message. The
> VDUSE_START/STOP_DATAPLANE messages could be changed to SET_STATUS so
> that the device emulation program can participate in emulating the
> Device Status field.


I'm not sure I get this, but it is what has been done?

+static void vduse_vdpa_set_status(struct vdpa_device *vdpa, u8 status)
+{
+    struct vduse_dev *dev = vdpa_to_vduse(vdpa);
+    bool started = !!(status & VIRTIO_CONFIG_S_DRIVER_OK);
+
+    dev->status = status;
+
+    if (dev->started == started)
+        return;
+
+    dev->started = started;
+    if (dev->started) {
+        vduse_dev_start_dataplane(dev);
+    } else {
+        vduse_dev_reset(dev);
+        vduse_dev_stop_dataplane(dev);
+    }
+}


But the looks not correct:

1) !DRIVER_OK doesn't means a reset?
2) Need to deal with FEATURES_OK

Thanks


>   This change could affect VDUSE's VIRTIO feature
> interface since the device emulation program can reject features by not
> setting FEATURES_OK.
>
> Stefan

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v8 10/10] Documentation: Add documentation for VDUSE
  2021-07-08  4:17                                   ` Jason Wang
@ 2021-07-08  9:06                                     ` Stefan Hajnoczi
  0 siblings, 0 replies; 41+ messages in thread
From: Stefan Hajnoczi @ 2021-07-08  9:06 UTC (permalink / raw)
  To: Jason Wang
  Cc: kvm, Michael S. Tsirkin, virtualization, Christian Brauner,
	Jonathan Corbet, joro, Matthew Wilcox, Christoph Hellwig,
	Xie Yongji, Dan Carpenter, Al Viro, songmuchun@bytedance.com,
	Jens Axboe, gregkh, Randy Dunlap, linux-kernel,
	iommu@lists.linux-foundation.org, bcrl@kvack.org, netdev,
	linux-fsdevel, Mika Penttilä


[-- Attachment #1.1: Type: text/plain, Size: 10270 bytes --]

On Thu, Jul 08, 2021 at 12:17:56PM +0800, Jason Wang wrote:
> 
> 在 2021/7/7 下午11:54, Stefan Hajnoczi 写道:
> > On Wed, Jul 07, 2021 at 05:24:08PM +0800, Jason Wang wrote:
> > > 在 2021/7/7 下午4:55, Stefan Hajnoczi 写道:
> > > > On Wed, Jul 07, 2021 at 11:43:28AM +0800, Jason Wang wrote:
> > > > > 在 2021/7/7 上午1:11, Stefan Hajnoczi 写道:
> > > > > > On Tue, Jul 06, 2021 at 09:08:26PM +0800, Jason Wang wrote:
> > > > > > > On Tue, Jul 6, 2021 at 6:15 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > > > > > > > On Tue, Jul 06, 2021 at 10:34:33AM +0800, Jason Wang wrote:
> > > > > > > > > 在 2021/7/5 下午8:49, Stefan Hajnoczi 写道:
> > > > > > > > > > On Mon, Jul 05, 2021 at 11:36:15AM +0800, Jason Wang wrote:
> > > > > > > > > > > 在 2021/7/4 下午5:49, Yongji Xie 写道:
> > > > > > > > > > > > > > OK, I get you now. Since the VIRTIO specification says "Device
> > > > > > > > > > > > > > configuration space is generally used for rarely-changing or
> > > > > > > > > > > > > > initialization-time parameters". I assume the VDUSE_DEV_SET_CONFIG
> > > > > > > > > > > > > > ioctl should not be called frequently.
> > > > > > > > > > > > > The spec uses MUST and other terms to define the precise requirements.
> > > > > > > > > > > > > Here the language (especially the word "generally") is weaker and means
> > > > > > > > > > > > > there may be exceptions.
> > > > > > > > > > > > > 
> > > > > > > > > > > > > Another type of access that doesn't work with the VDUSE_DEV_SET_CONFIG
> > > > > > > > > > > > > approach is reads that have side-effects. For example, imagine a field
> > > > > > > > > > > > > containing an error code if the device encounters a problem unrelated to
> > > > > > > > > > > > > a specific virtqueue request. Reading from this field resets the error
> > > > > > > > > > > > > code to 0, saving the driver an extra configuration space write access
> > > > > > > > > > > > > and possibly race conditions. It isn't possible to implement those
> > > > > > > > > > > > > semantics suing VDUSE_DEV_SET_CONFIG. It's another corner case, but it
> > > > > > > > > > > > > makes me think that the interface does not allow full VIRTIO semantics.
> > > > > > > > > > > Note that though you're correct, my understanding is that config space is
> > > > > > > > > > > not suitable for this kind of error propagating. And it would be very hard
> > > > > > > > > > > to implement such kind of semantic in some transports.  Virtqueue should be
> > > > > > > > > > > much better. As Yong Ji quoted, the config space is used for
> > > > > > > > > > > "rarely-changing or intialization-time parameters".
> > > > > > > > > > > 
> > > > > > > > > > > 
> > > > > > > > > > > > Agreed. I will use VDUSE_DEV_GET_CONFIG in the next version. And to
> > > > > > > > > > > > handle the message failure, I'm going to add a return value to
> > > > > > > > > > > > virtio_config_ops.get() and virtio_cread_* API so that the error can
> > > > > > > > > > > > be propagated to the virtio device driver. Then the virtio-blk device
> > > > > > > > > > > > driver can be modified to handle that.
> > > > > > > > > > > > 
> > > > > > > > > > > > Jason and Stefan, what do you think of this way?
> > > > > > > > > > Why does VDUSE_DEV_GET_CONFIG need to support an error return value?
> > > > > > > > > > 
> > > > > > > > > > The VIRTIO spec provides no way for the device to report errors from
> > > > > > > > > > config space accesses.
> > > > > > > > > > 
> > > > > > > > > > The QEMU virtio-pci implementation returns -1 from invalid
> > > > > > > > > > virtio_config_read*() and silently discards virtio_config_write*()
> > > > > > > > > > accesses.
> > > > > > > > > > 
> > > > > > > > > > VDUSE can take the same approach with
> > > > > > > > > > VDUSE_DEV_GET_CONFIG/VDUSE_DEV_SET_CONFIG.
> > > > > > > > > > 
> > > > > > > > > > > I'd like to stick to the current assumption thich get_config won't fail.
> > > > > > > > > > > That is to say,
> > > > > > > > > > > 
> > > > > > > > > > > 1) maintain a config in the kernel, make sure the config space read can
> > > > > > > > > > > always succeed
> > > > > > > > > > > 2) introduce an ioctl for the vduse usersapce to update the config space.
> > > > > > > > > > > 3) we can synchronize with the vduse userspace during set_config
> > > > > > > > > > > 
> > > > > > > > > > > Does this work?
> > > > > > > > > > I noticed that caching is also allowed by the vhost-user protocol
> > > > > > > > > > messages (QEMU's docs/interop/vhost-user.rst), but the device doesn't
> > > > > > > > > > know whether or not caching is in effect. The interface you outlined
> > > > > > > > > > above requires caching.
> > > > > > > > > > 
> > > > > > > > > > Is there a reason why the host kernel vDPA code needs to cache the
> > > > > > > > > > configuration space?
> > > > > > > > > Because:
> > > > > > > > > 
> > > > > > > > > 1) Kernel can not wait forever in get_config(), this is the major difference
> > > > > > > > > with vhost-user.
> > > > > > > > virtio_cread() can sleep:
> > > > > > > > 
> > > > > > > >      #define virtio_cread(vdev, structname, member, ptr)                     \
> > > > > > > >              do {                                                            \
> > > > > > > >                      typeof(((structname*)0)->member) virtio_cread_v;        \
> > > > > > > >                                                                              \
> > > > > > > >                      might_sleep();                                          \
> > > > > > > >                      ^^^^^^^^^^^^^^
> > > > > > > > 
> > > > > > > > Which code path cannot sleep?
> > > > > > > Well, it can sleep but it can't sleep forever. For VDUSE, a
> > > > > > > buggy/malicious userspace may refuse to respond to the get_config.
> > > > > > > 
> > > > > > > It looks to me the ideal case, with the current virtio spec, for VDUSE is to
> > > > > > > 
> > > > > > > 1) maintain the device and its state in the kernel, userspace may sync
> > > > > > > with the kernel device via ioctls
> > > > > > > 2) offload the datapath (virtqueue) to the userspace
> > > > > > > 
> > > > > > > This seems more robust and safe than simply relaying everything to
> > > > > > > userspace and waiting for its response.
> > > > > > > 
> > > > > > > And we know for sure this model can work, an example is TUN/TAP:
> > > > > > > netdevice is abstracted in the kernel and datapath is done via
> > > > > > > sendmsg()/recvmsg().
> > > > > > > 
> > > > > > > Maintaining the config in the kernel follows this model and it can
> > > > > > > simplify the device generation implementation.
> > > > > > > 
> > > > > > > For config space write, it requires more thought but fortunately it's
> > > > > > > not commonly used. So VDUSE can choose to filter out the
> > > > > > > device/features that depends on the config write.
> > > > > > This is the problem. There are other messages like SET_FEATURES where I
> > > > > > guess we'll face the same challenge.
> > > > > Probably not, userspace device can tell the kernel about the device_features
> > > > > and mandated_features during creation, and the feature negotiation could be
> > > > > done purely in the kernel without bothering the userspace.
> > > 
> > > (For some reason I drop the list accidentally, adding them back, sorry)
> > > 
> > > 
> > > > Sorry, I confused the messages. I meant SET_STATUS. It's a synchronous
> > > > interface where the driver waits for the device.
> > > 
> > > It depends on how we define "synchronous" here. If I understand correctly,
> > > the spec doesn't expect there will be any kind of failure for the operation
> > > of set_status itself.
> > > 
> > > Instead, anytime it want any synchronization, it should be done via
> > > get_status():
> > > 
> > > 1) re-read device status to make sure FEATURES_OK is set during feature
> > > negotiation
> > > 2) re-read device status to be 0 to make sure the device has finish the
> > > reset
> > > 
> > > 
> > > > VDUSE currently doesn't wait for the device emulation process to handle
> > > > this message (no reply is needed) but I think this is a mistake because
> > > > VDUSE is not following the VIRTIO device model.
> > > 
> > > With the trick that is done for FEATURES_OK above, I think we don't need to
> > > wait for the reply.
> > > 
> > > If userspace takes too long to respond, it can be detected since
> > > get_status() doesn't return the expected value for long time.
> > > 
> > > And for the case that needs a timeout, we probably can use NEEDS_RESET.
> > I think you're right. get_status is the synchronization point, not
> > set_status.
> > 
> > Currently there is no VDUSE GET_STATUS message. The
> > VDUSE_START/STOP_DATAPLANE messages could be changed to SET_STATUS so
> > that the device emulation program can participate in emulating the
> > Device Status field.
> 
> 
> I'm not sure I get this, but it is what has been done?
> 
> +static void vduse_vdpa_set_status(struct vdpa_device *vdpa, u8 status)
> +{
> +    struct vduse_dev *dev = vdpa_to_vduse(vdpa);
> +    bool started = !!(status & VIRTIO_CONFIG_S_DRIVER_OK);
> +
> +    dev->status = status;
> +
> +    if (dev->started == started)
> +        return;
> +
> +    dev->started = started;
> +    if (dev->started) {
> +        vduse_dev_start_dataplane(dev);
> +    } else {
> +        vduse_dev_reset(dev);
> +        vduse_dev_stop_dataplane(dev);
> +    }
> +}
> 
> 
> But the looks not correct:
> 
> 1) !DRIVER_OK doesn't means a reset?
> 2) Need to deal with FEATURES_OK

I'm not sure if this reply was to me or to Yongji Xie?

Currently vduse_vdpa_set_status() does not allow the device emulation
program to participate fully in Device Status field changes. It hides
the status bits and only sends VDUSE_START/STOP_DATAPLANE.

I suggest having GET_STATUS/SET_STATUS messages instead, allowing the
device emulation program to handle these parts of the VIRTIO device
model (e.g. rejecting combinations of features that are mutually
exclusive).

Stefan

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

[-- Attachment #2: Type: text/plain, Size: 183 bytes --]

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 41+ messages in thread

* Re: [PATCH v8 10/10] Documentation: Add documentation for VDUSE
       [not found]                       ` <CACycT3t=V-VV7LYDda8mt=QxN_Ay-N+3dgWp382TObkeei9MOg@mail.gmail.com>
@ 2021-07-08  9:07                         ` Stefan Hajnoczi
  0 siblings, 0 replies; 41+ messages in thread
From: Stefan Hajnoczi @ 2021-07-08  9:07 UTC (permalink / raw)
  To: Yongji Xie
  Cc: kvm, Michael S. Tsirkin, virtualization, Christian Brauner,
	Jonathan Corbet, joro, Matthew Wilcox, Christoph Hellwig,
	Dan Carpenter, Al Viro, songmuchun, Jens Axboe, Greg KH,
	Randy Dunlap, linux-kernel, iommu, bcrl, netdev, linux-fsdevel,
	Mika Penttilä


[-- Attachment #1.1: Type: text/plain, Size: 4346 bytes --]

On Wed, Jul 07, 2021 at 05:09:13PM +0800, Yongji Xie wrote:
> On Tue, Jul 6, 2021 at 6:22 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> >
> > On Tue, Jul 06, 2021 at 11:04:18AM +0800, Yongji Xie wrote:
> > > On Mon, Jul 5, 2021 at 8:50 PM Stefan Hajnoczi <stefanha@redhat.com> wrote:
> > > >
> > > > On Mon, Jul 05, 2021 at 11:36:15AM +0800, Jason Wang wrote:
> > > > >
> > > > > 在 2021/7/4 下午5:49, Yongji Xie 写道:
> > > > > > > > OK, I get you now. Since the VIRTIO specification says "Device
> > > > > > > > configuration space is generally used for rarely-changing or
> > > > > > > > initialization-time parameters". I assume the VDUSE_DEV_SET_CONFIG
> > > > > > > > ioctl should not be called frequently.
> > > > > > > The spec uses MUST and other terms to define the precise requirements.
> > > > > > > Here the language (especially the word "generally") is weaker and means
> > > > > > > there may be exceptions.
> > > > > > >
> > > > > > > Another type of access that doesn't work with the VDUSE_DEV_SET_CONFIG
> > > > > > > approach is reads that have side-effects. For example, imagine a field
> > > > > > > containing an error code if the device encounters a problem unrelated to
> > > > > > > a specific virtqueue request. Reading from this field resets the error
> > > > > > > code to 0, saving the driver an extra configuration space write access
> > > > > > > and possibly race conditions. It isn't possible to implement those
> > > > > > > semantics suing VDUSE_DEV_SET_CONFIG. It's another corner case, but it
> > > > > > > makes me think that the interface does not allow full VIRTIO semantics.
> > > > >
> > > > >
> > > > > Note that though you're correct, my understanding is that config space is
> > > > > not suitable for this kind of error propagating. And it would be very hard
> > > > > to implement such kind of semantic in some transports.  Virtqueue should be
> > > > > much better. As Yong Ji quoted, the config space is used for
> > > > > "rarely-changing or intialization-time parameters".
> > > > >
> > > > >
> > > > > > Agreed. I will use VDUSE_DEV_GET_CONFIG in the next version. And to
> > > > > > handle the message failure, I'm going to add a return value to
> > > > > > virtio_config_ops.get() and virtio_cread_* API so that the error can
> > > > > > be propagated to the virtio device driver. Then the virtio-blk device
> > > > > > driver can be modified to handle that.
> > > > > >
> > > > > > Jason and Stefan, what do you think of this way?
> > > >
> > > > Why does VDUSE_DEV_GET_CONFIG need to support an error return value?
> > > >
> > >
> > > We add a timeout and return error in case userspace never replies to
> > > the message.
> > >
> > > > The VIRTIO spec provides no way for the device to report errors from
> > > > config space accesses.
> > > >
> > > > The QEMU virtio-pci implementation returns -1 from invalid
> > > > virtio_config_read*() and silently discards virtio_config_write*()
> > > > accesses.
> > > >
> > > > VDUSE can take the same approach with
> > > > VDUSE_DEV_GET_CONFIG/VDUSE_DEV_SET_CONFIG.
> > > >
> > >
> > > I noticed that virtio_config_read*() only returns -1 when we access a
> > > invalid field. But in the VDUSE case, VDUSE_DEV_GET_CONFIG might fail
> > > when we access a valid field. Not sure if it's ok to silently ignore
> > > this kind of error.
> >
> > That's a good point but it's a general VIRTIO issue. Any device
> > implementation (QEMU userspace, hardware vDPA, etc) can fail, so the
> > VIRTIO specification needs to provide a way for the driver to detect
> > this.
> >
> > If userspace violates the contract then VDUSE needs to mark the device
> > broken. QEMU's device emulation does something similar with the
> > vdev->broken flag.
> >
> > The VIRTIO Device Status field DEVICE_NEEDS_RESET bit can be set by
> > vDPA/VDUSE to indicate that the device is not operational and must be
> > reset.
> >
> 
> It might be a solution. But DEVICE_NEEDS_RESET  is not implemented
> currently. So I'm thinking whether it's ok to add a check of
> DEVICE_NEEDS_RESET status bit in probe function of virtio device
> driver (e.g. virtio-blk driver). Then VDUSE can make use of it to fail
> device initailization when configuration space access failed.

Okay.

Stefan

[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 488 bytes --]

[-- Attachment #2: Type: text/plain, Size: 183 bytes --]

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply	[flat|nested] 41+ messages in thread

end of thread, other threads:[~2021-07-08  9:07 UTC | newest]

Thread overview: 41+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
     [not found] <20210615141331.407-1-xieyongji@bytedance.com>
     [not found] ` <20210615141331.407-4-xieyongji@bytedance.com>
2021-06-17  8:33   ` [PATCH v8 03/10] eventfd: Increase the recursion depth of eventfd_signal() He Zhe
     [not found]     ` <CACycT3t1Dgrzsr7LbBrDhRLDa3qZ85ZOgj9H7r1fqPi-kf7r6Q@mail.gmail.com>
2021-06-18  8:41       ` He Zhe
2021-06-18  8:44       ` [PATCH] eventfd: Enlarge recursion limit to allow vhost to work He Zhe
2021-07-03  8:31         ` Michael S. Tsirkin
     [not found] ` <20210615141331.407-11-xieyongji@bytedance.com>
2021-06-24 13:01   ` [PATCH v8 10/10] Documentation: Add documentation for VDUSE Stefan Hajnoczi
     [not found]     ` <CACycT3uxnQmXWsgmNVxQtiRhz1UXXTAJFY3OiAJqokbJH6ifMA@mail.gmail.com>
2021-06-30 10:06       ` Stefan Hajnoczi
     [not found]         ` <CACycT3taKhf1cWp3Jd0aSVekAZvpbR-_fkyPLQ=B+jZBB5H=8Q@mail.gmail.com>
2021-07-01 13:15           ` Stefan Hajnoczi
     [not found]             ` <CACycT3vo-diHgTSLw_FS2E+5ia5VjihE3qw7JmZR7JT55P-wQA@mail.gmail.com>
2021-07-05  3:36               ` Jason Wang
2021-07-05 12:49                 ` Stefan Hajnoczi
2021-07-06  2:34                   ` Jason Wang
2021-07-06 10:14                     ` Stefan Hajnoczi
     [not found]                       ` <CACGkMEs2HHbUfarum8uQ6wuXoDwLQUSXTsa-huJFiqr__4cwRg@mail.gmail.com>
     [not found]                         ` <YOSOsrQWySr0andk@stefanha-x1.localdomain>
     [not found]                           ` <100e6788-7fdf-1505-d69c-bc28a8bc7a78@redhat.com>
     [not found]                             ` <YOVr801d01YOPzLL@stefanha-x1.localdomain>
2021-07-07  9:24                               ` Jason Wang
2021-07-07 15:54                                 ` Stefan Hajnoczi
2021-07-08  4:17                                   ` Jason Wang
2021-07-08  9:06                                     ` Stefan Hajnoczi
     [not found]                   ` <CACycT3t-BTMrpNTwBUfbvaxTh6tLthxbo3OJwMk_iuiSpMuZPg@mail.gmail.com>
2021-07-06 10:22                     ` Stefan Hajnoczi
     [not found]                       ` <CACycT3t=V-VV7LYDda8mt=QxN_Ay-N+3dgWp382TObkeei9MOg@mail.gmail.com>
2021-07-08  9:07                         ` Stefan Hajnoczi
2021-06-24 15:12 ` [PATCH v8 00/10] Introduce VDUSE - vDPA Device in Userspace Stefan Hajnoczi
2021-06-28 10:33 ` Liu Xiaodong
2021-06-28  4:35   ` Jason Wang
2021-06-28  5:54     ` Liu, Xiaodong
2021-06-29  4:10       ` Jason Wang
2021-06-29  7:56         ` Liu, Xiaodong
2021-06-28 10:32   ` Yongji Xie
2021-06-29  4:12     ` Jason Wang
     [not found]       ` <CACycT3vVhNdhtyohKJQuMXTic5m6jDjEfjzbzvp=2FJgwup8mg@mail.gmail.com>
2021-06-29  7:33         ` Jason Wang
     [not found] ` <20210615141331.407-10-xieyongji@bytedance.com>
2021-06-21  9:13   ` [PATCH v8 09/10] vduse: " Jason Wang
     [not found]     ` <CACycT3tAON+-qZev+9EqyL2XbgH5HDspOqNt3ohQLQ8GqVK=EA@mail.gmail.com>
2021-06-22  5:06       ` Jason Wang
     [not found]         ` <CACycT3uzMJS7vw6MVMOgY4rb=SPfT2srV+8DPdwUVeELEiJgbA@mail.gmail.com>
2021-06-22  7:49           ` Jason Wang
     [not found]             ` <CACycT3uuooKLNnpPHewGZ=q46Fap2P4XCFirdxxn=FxK+X1ECg@mail.gmail.com>
2021-06-23  3:30               ` Jason Wang
     [not found]                 ` <CACycT3u8=_D3hCtJR+d5BgeUQMce6S7c_6P3CVfvWfYhCQeXFA@mail.gmail.com>
2021-06-24  3:34                   ` Jason Wang
     [not found]                     ` <CACycT3uCSLUDVpQHdrmuxSuoBDg-4n22t+N-Jm2GoNNp9JYB2w@mail.gmail.com>
2021-06-24  8:13                       ` Jason Wang
     [not found]                         ` <CACycT3tS=10kcUCNGYm=dUZsK+vrHzDvB3FSwAzuJCu3t+QuUQ@mail.gmail.com>
2021-06-25  3:08                           ` Jason Wang
     [not found]                             ` <CACycT3vpMFbc9Fzuo9oksMaA-pVb1dEVTEgjNoft16voryPSWQ@mail.gmail.com>
2021-06-28  4:40                               ` Jason Wang
     [not found]                                 ` <CACycT3u9-id2DxPpuVLtyg4tzrUF9xCAGr7nBm=21HfUJJasaQ@mail.gmail.com>
2021-06-29  3:29                                   ` Jason Wang
     [not found]                                     ` <CACycT3ucVz3D4Tcr1C6uzWyApZy7Xk4o17VH2gvLO3w1Ra+skg@mail.gmail.com>
2021-06-29  4:03                                       ` Jason Wang
2021-06-24 14:46   ` Stefan Hajnoczi
     [not found]     ` <CACycT3vaXQ4dxC9QUzXXJs7og6TVqqVGa8uHZnTStacsYAiFwQ@mail.gmail.com>
2021-06-30  9:51       ` Stefan Hajnoczi
     [not found]         ` <CACycT3t6M5i0gznABm52v=rdmeeLZu8smXAOLg+WsM3WY1fgTw@mail.gmail.com>
2021-07-01  7:55           ` Jason Wang
     [not found]             ` <CACycT3v7pYXAFtijPgWCMZ2WXxjT2Y-DUwS3hN_T7dhfE5o_6g@mail.gmail.com>
2021-07-02  3:25               ` Jason Wang
2021-07-07  8:52   ` Stefan Hajnoczi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).