* Re: [PATCH] vhost: correct the related warning message
From: Michael S. Tsirkin @ 2018-12-13 15:21 UTC (permalink / raw)
To: Sergei Shtylyov; +Cc: kvm, netdev, virtualization, piaojun, viro, hch, wangyan
In-Reply-To: <7c1a2b4c-670f-2be8-88e1-7485237029f6@cogentembedded.com>
On Thu, Dec 13, 2018 at 12:05:31PM +0300, Sergei Shtylyov wrote:
> On 13.12.2018 4:10, wangyan wrote:
>
> > Fixes: 'commit d588cf8f618d ("target: Fix se_tpg_tfo->tf_subsys regression + remove tf_subsystem")'
> > 'commit cbbd26b8b1a6 ("[iov_iter] new primitives - copy_from_iter_full() and friends")'
>
> Fixes: d588cf8f618d ("target: Fix se_tpg_tfo->tf_subsys regression + remove
> tf_subsystem")
> Fixes: cbbd26b8b1a6 ("[iov_iter] new primitives - copy_from_iter_full() and
> friends")
>
> But no line wrapping allowed (can't find how to turn it off in my mailer).
Well is the standard really sane if people can't find a way to
follow it without being forced to use a specific mailer?
> > Signed-off-by: Yan Wang <wangyan122@huawei.com>
> > Reviewed-by: Jun Piao <piaojun@huawei.com>
> [...]
>
> MBR, Sergei
^ permalink raw reply
* Re: [PATCH net-next 0/3] vhost: accelerate metadata access through vmap()
From: Michael S. Tsirkin @ 2018-12-13 15:27 UTC (permalink / raw)
To: Jason Wang; +Cc: netdev, linux-kernel, kvm, virtualization
In-Reply-To: <20181213101022.12475-1-jasowang@redhat.com>
On Thu, Dec 13, 2018 at 06:10:19PM +0800, Jason Wang wrote:
> Hi:
>
> This series tries to access virtqueue metadata through kernel virtual
> address instead of copy_user() friends since they had too much
> overheads like checks, spec barriers or even hardware feature
> toggling.
Userspace accesses through remapping tricks and next time there's a need
for a new barrier we are left to figure it out by ourselves. I don't
like the idea I have to say. As a first step, why don't we switch to
unsafe_put_user/unsafe_get_user etc?
That would be more of an apples to apples comparison, would it not?
> Test shows about 24% improvement on TX PPS. It should benefit other
> cases as well.
>
> Please review
>
> Jason Wang (3):
> vhost: generalize adding used elem
> vhost: fine grain userspace memory accessors
> vhost: access vq metadata through kernel virtual address
>
> drivers/vhost/vhost.c | 281 ++++++++++++++++++++++++++++++++++++++----
> drivers/vhost/vhost.h | 11 ++
> 2 files changed, 266 insertions(+), 26 deletions(-)
>
> --
> 2.17.1
^ permalink raw reply
* Re: [PATCH net-next 3/3] vhost: access vq metadata through kernel virtual address
From: Michael S. Tsirkin @ 2018-12-13 15:44 UTC (permalink / raw)
To: Jason Wang; +Cc: netdev, linux-kernel, kvm, virtualization
In-Reply-To: <20181213101022.12475-4-jasowang@redhat.com>
On Thu, Dec 13, 2018 at 06:10:22PM +0800, Jason Wang wrote:
> It was noticed that the copy_user() friends that was used to access
> virtqueue metdata tends to be very expensive for dataplane
> implementation like vhost since it involves lots of software check,
> speculation barrier, hardware feature toggling (e.g SMAP). The
> extra cost will be more obvious when transferring small packets.
>
> This patch tries to eliminate those overhead by pin vq metadata pages
> and access them through vmap(). During SET_VRING_ADDR, we will setup
> those mappings and memory accessors are modified to use pointers to
> access the metadata directly.
>
> Note, this was only done when device IOTLB is not enabled. We could
> use similar method to optimize it in the future.
>
> Tests shows about ~24% improvement on TX PPS when using virtio-user +
> vhost_net + xdp1 on TAP (CONFIG_HARDENED_USERCOPY is not enabled):
>
> Before: ~5.0Mpps
> After: ~6.1Mpps
>
> Signed-off-by: Jason Wang <jasowang@redhat.com>
> ---
> drivers/vhost/vhost.c | 178 ++++++++++++++++++++++++++++++++++++++++++
> drivers/vhost/vhost.h | 11 +++
> 2 files changed, 189 insertions(+)
>
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index bafe39d2e637..1bd24203afb6 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -443,6 +443,9 @@ void vhost_dev_init(struct vhost_dev *dev,
> vq->indirect = NULL;
> vq->heads = NULL;
> vq->dev = dev;
> + memset(&vq->avail_ring, 0, sizeof(vq->avail_ring));
> + memset(&vq->used_ring, 0, sizeof(vq->used_ring));
> + memset(&vq->desc_ring, 0, sizeof(vq->desc_ring));
> mutex_init(&vq->mutex);
> vhost_vq_reset(dev, vq);
> if (vq->handle_kick)
> @@ -614,6 +617,102 @@ static void vhost_clear_msg(struct vhost_dev *dev)
> spin_unlock(&dev->iotlb_lock);
> }
>
> +static int vhost_init_vmap(struct vhost_vmap *map, unsigned long uaddr,
> + size_t size, int write)
> +{
> + struct page **pages;
> + int npages = DIV_ROUND_UP(size, PAGE_SIZE);
> + int npinned;
> + void *vaddr;
> +
> + pages = kmalloc_array(npages, sizeof(struct page *), GFP_KERNEL);
> + if (!pages)
> + return -ENOMEM;
> +
> + npinned = get_user_pages_fast(uaddr, npages, write, pages);
> + if (npinned != npages)
> + goto err;
> +
As I said I have doubts about the whole approach, but this
implementation in particular isn't a good idea
as it keeps the page around forever.
So no THP, no NUMA rebalancing, userspace-controlled
amount of memory locked up and not accounted for.
Don't get me wrong it's a great patch in an ideal world.
But then in an ideal world no barriers smap etc are necessary at all.
> + vaddr = vmap(pages, npages, VM_MAP, PAGE_KERNEL);
> + if (!vaddr)
> + goto err;
> +
> + map->pages = pages;
> + map->addr = vaddr + (uaddr & (PAGE_SIZE - 1));
> + map->npages = npages;
> +
> + return 0;
> +
> +err:
> + if (npinned > 0)
> + release_pages(pages, npinned);
> + kfree(pages);
> + return -EFAULT;
> +}
> +
> +static void vhost_uninit_vmap(struct vhost_vmap *map)
> +{
> + if (!map->addr)
> + return;
> +
> + vunmap(map->addr);
> + release_pages(map->pages, map->npages);
> + kfree(map->pages);
> +
> + map->addr = NULL;
> + map->pages = NULL;
> + map->npages = 0;
> +}
> +
> +static void vhost_clean_vmaps(struct vhost_virtqueue *vq)
> +{
> + vhost_uninit_vmap(&vq->avail_ring);
> + vhost_uninit_vmap(&vq->desc_ring);
> + vhost_uninit_vmap(&vq->used_ring);
> +}
> +
> +static int vhost_setup_vmaps(struct vhost_virtqueue *vq, unsigned long avail,
> + unsigned long desc, unsigned long used)
> +{
> + size_t event = vhost_has_feature(vq, VIRTIO_RING_F_EVENT_IDX) ? 2 : 0;
> + size_t avail_size, desc_size, used_size;
> + int ret;
> +
> + vhost_clean_vmaps(vq);
> +
> + avail_size = sizeof(*vq->avail) +
> + sizeof(*vq->avail->ring) * vq->num + event;
> + ret = vhost_init_vmap(&vq->avail_ring, avail, avail_size, false);
> + if (ret) {
> + vq_err(vq, "Fail to setup vmap for avail ring!\n");
> + goto err_avail;
> + }
> +
> + desc_size = sizeof(*vq->desc) * vq->num;
> + ret = vhost_init_vmap(&vq->desc_ring, desc, desc_size, false);
> + if (ret) {
> + vq_err(vq, "Fail to setup vmap for desc ring!\n");
> + goto err_desc;
> + }
> +
> + used_size = sizeof(*vq->used) +
> + sizeof(*vq->used->ring) * vq->num + event;
> + ret = vhost_init_vmap(&vq->used_ring, used, used_size, true);
> + if (ret) {
> + vq_err(vq, "Fail to setup vmap for used ring!\n");
> + goto err_used;
> + }
> +
> + return 0;
> +
> +err_used:
> + vhost_uninit_vmap(&vq->used_ring);
> +err_desc:
> + vhost_uninit_vmap(&vq->avail_ring);
> +err_avail:
> + return -EFAULT;
> +}
> +
> void vhost_dev_cleanup(struct vhost_dev *dev)
> {
> int i;
> @@ -626,6 +725,7 @@ void vhost_dev_cleanup(struct vhost_dev *dev)
> if (dev->vqs[i]->call_ctx)
> eventfd_ctx_put(dev->vqs[i]->call_ctx);
> vhost_vq_reset(dev, dev->vqs[i]);
> + vhost_clean_vmaps(dev->vqs[i]);
> }
> vhost_dev_free_iovecs(dev);
> if (dev->log_ctx)
> @@ -873,6 +973,14 @@ static inline void __user *__vhost_get_user(struct vhost_virtqueue *vq,
>
> static inline int vhost_put_avail_event(struct vhost_virtqueue *vq)
> {
> + if (!vq->iotlb) {
> + struct vring_used *used = vq->used_ring.addr;
> +
> + *((__virtio16 *)&used->ring[vq->num]) =
> + cpu_to_vhost16(vq, vq->avail_idx);
> + return 0;
> + }
> +
> return vhost_put_user(vq, cpu_to_vhost16(vq, vq->avail_idx),
> vhost_avail_event(vq));
> }
> @@ -881,6 +989,13 @@ static inline int vhost_put_used(struct vhost_virtqueue *vq,
> struct vring_used_elem *head, int idx,
> int count)
> {
> + if (!vq->iotlb) {
> + struct vring_used *used = vq->used_ring.addr;
> +
> + memcpy(used->ring + idx, head, count * sizeof(*head));
> + return 0;
> + }
> +
> return vhost_copy_to_user(vq, vq->used->ring + idx, head,
> count * sizeof(*head));
> }
> @@ -888,6 +1003,13 @@ static inline int vhost_put_used(struct vhost_virtqueue *vq,
> static inline int vhost_put_used_flags(struct vhost_virtqueue *vq)
>
> {
> + if (!vq->iotlb) {
> + struct vring_used *used = vq->used_ring.addr;
> +
> + used->flags = cpu_to_vhost16(vq, vq->used_flags);
> + return 0;
> + }
> +
> return vhost_put_user(vq, cpu_to_vhost16(vq, vq->used_flags),
> &vq->used->flags);
> }
> @@ -895,6 +1017,13 @@ static inline int vhost_put_used_flags(struct vhost_virtqueue *vq)
> static inline int vhost_put_used_idx(struct vhost_virtqueue *vq)
>
> {
> + if (!vq->iotlb) {
> + struct vring_used *used = vq->used_ring.addr;
> +
> + used->idx = cpu_to_vhost16(vq, vq->last_used_idx);
> + return 0;
> + }
> +
> return vhost_put_user(vq, cpu_to_vhost16(vq, vq->last_used_idx),
> &vq->used->idx);
> }
> @@ -926,12 +1055,26 @@ static inline int vhost_put_used_idx(struct vhost_virtqueue *vq)
> static inline int vhost_get_avail_idx(struct vhost_virtqueue *vq,
> __virtio16 *idx)
> {
> + if (!vq->iotlb) {
> + struct vring_avail *avail = vq->avail_ring.addr;
> +
> + *idx = avail->idx;
> + return 0;
> + }
> +
> return vhost_get_avail(vq, *idx, &vq->avail->idx);
> }
>
> static inline int vhost_get_avail_head(struct vhost_virtqueue *vq,
> __virtio16 *head, int idx)
> {
> + if (!vq->iotlb) {
> + struct vring_avail *avail = vq->avail_ring.addr;
> +
> + *head = avail->ring[idx & (vq->num - 1)];
> + return 0;
> + }
> +
> return vhost_get_avail(vq, *head,
> &vq->avail->ring[idx & (vq->num - 1)]);
> }
> @@ -939,24 +1082,52 @@ static inline int vhost_get_avail_head(struct vhost_virtqueue *vq,
> static inline int vhost_get_avail_flags(struct vhost_virtqueue *vq,
> __virtio16 *flags)
> {
> + if (!vq->iotlb) {
> + struct vring_avail *avail = vq->avail_ring.addr;
> +
> + *flags = avail->flags;
> + return 0;
> + }
> +
> return vhost_get_avail(vq, *flags, &vq->avail->flags);
> }
>
> static inline int vhost_get_used_event(struct vhost_virtqueue *vq,
> __virtio16 *event)
> {
> + if (!vq->iotlb) {
> + struct vring_avail *avail = vq->avail_ring.addr;
> +
> + *event = (__virtio16)avail->ring[vq->num];
> + return 0;
> + }
> +
> return vhost_get_avail(vq, *event, vhost_used_event(vq));
> }
>
> static inline int vhost_get_used_idx(struct vhost_virtqueue *vq,
> __virtio16 *idx)
> {
> + if (!vq->iotlb) {
> + struct vring_used *used = vq->used_ring.addr;
> +
> + *idx = used->idx;
> + return 0;
> + }
> +
> return vhost_get_used(vq, *idx, &vq->used->idx);
> }
>
> static inline int vhost_get_desc(struct vhost_virtqueue *vq,
> struct vring_desc *desc, int idx)
> {
> + if (!vq->iotlb) {
> + struct vring_desc *d = vq->desc_ring.addr;
> +
> + *desc = *(d + idx);
> + return 0;
> + }
> +
> return vhost_copy_from_user(vq, desc, vq->desc + idx, sizeof(*desc));
> }
>
> @@ -1551,6 +1722,13 @@ long vhost_vring_ioctl(struct vhost_dev *d, unsigned int ioctl, void __user *arg
> }
> }
>
> + if (!vq->iotlb && vhost_setup_vmaps(vq, a.avail_user_addr,
> + a.desc_user_addr,
> + a.used_user_addr)) {
> + r = -EINVAL;
> + break;
> + }
> +
> vq->log_used = !!(a.flags & (0x1 << VHOST_VRING_F_LOG));
> vq->desc = (void __user *)(unsigned long)a.desc_user_addr;
> vq->avail = (void __user *)(unsigned long)a.avail_user_addr;
> diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
> index 466ef7542291..89dc0ad3d055 100644
> --- a/drivers/vhost/vhost.h
> +++ b/drivers/vhost/vhost.h
> @@ -80,6 +80,12 @@ enum vhost_uaddr_type {
> VHOST_NUM_ADDRS = 3,
> };
>
> +struct vhost_vmap {
> + struct page **pages;
> + void *addr;
> + int npages;
> +};
> +
> /* The virtqueue structure describes a queue attached to a device. */
> struct vhost_virtqueue {
> struct vhost_dev *dev;
> @@ -90,6 +96,11 @@ struct vhost_virtqueue {
> struct vring_desc __user *desc;
> struct vring_avail __user *avail;
> struct vring_used __user *used;
> +
> + struct vhost_vmap avail_ring;
> + struct vhost_vmap desc_ring;
> + struct vhost_vmap used_ring;
> +
> const struct vhost_umem_node *meta_iotlb[VHOST_NUM_ADDRS];
> struct file *kick;
> struct eventfd_ctx *call_ctx;
> --
> 2.17.1
^ permalink raw reply
* Re: [PATCH v2 2/5] VSOCK: support fill data to mergeable rx buffer in host
From: Stefan Hajnoczi @ 2018-12-13 15:46 UTC (permalink / raw)
To: jiangyiwen; +Cc: kvm, mst, netdev, virtualization, David Miller
In-Reply-To: <5C120D69.10201@huawei.com>
[-- Attachment #1.1: Type: text/plain, Size: 1792 bytes --]
On Thu, Dec 13, 2018 at 03:42:33PM +0800, jiangyiwen wrote:
> On 2018/12/13 13:59, David Miller wrote:
> > From: jiangyiwen <jiangyiwen@huawei.com>
> > Date: Thu, 13 Dec 2018 11:11:48 +0800
> >
> >> I hope Host can fill fewer bytes into rx virtqueue, so
> >> I keep structure virtio_vsock_mrg_rxbuf_hdr one byte
> >> alignment.
> >
> > The question is if this actully matters.
> >
> > Do you know?
> >
> > If the obejct this is embeeded inside of is at least 2 byte aligned,
> > you are marking it packed for nothing.
> >
> > There are only %100 downsides to using the packed attribute.
> >
> > Simply define your datastructures properly, with fixed sized types,
> > and all padding defined explicitly.
> >
> > .
> >
>
> Hi David,
>
> Thanks a lot, I need to send number buffers from Host to Guest, so I think
> we need to keep the structure size the same between host and guest.
> But after your reminder, I feel my code may exist a serious problem,
> that in mergeable mode, I send the total structure virtio_vsock_pkt
> from Host to Guest, however, this structure size may be different
> under different compilers (Guest and Host are different). Then, Guest
> may parse the wrong packet length.
>
> David, I want to ask if there is such a problem?
>
> In addition, why I send total virtio_vsock_pkt structure from Host to Guest?
> - In order to avoid to allocate virtio_vsock_pkt memory when receiving
> packets, in case of insufficient memory, it may have some advantages, and
> we may keep consistent with old version.
Yes, virtio_vsock_pkt is internal driver state and should not be part of
the host<->guest interface (also for security reasons it's not good to
expose internal state structs across the interface).
Stefan
[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]
[-- Attachment #2: Type: text/plain, Size: 183 bytes --]
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
* Re: [PATCH v2 2/5] VSOCK: support fill data to mergeable rx buffer in host
From: Stefan Hajnoczi @ 2018-12-13 15:49 UTC (permalink / raw)
To: jiangyiwen; +Cc: netdev, virtualization, kvm, Michael S. Tsirkin
In-Reply-To: <5C11CD14.3040809@huawei.com>
[-- Attachment #1.1: Type: text/plain, Size: 1764 bytes --]
On Thu, Dec 13, 2018 at 11:08:04AM +0800, jiangyiwen wrote:
> On 2018/12/12 23:37, Michael S. Tsirkin wrote:
> > On Wed, Dec 12, 2018 at 05:29:31PM +0800, jiangyiwen wrote:
> >> When vhost support VIRTIO_VSOCK_F_MRG_RXBUF feature,
> >> it will merge big packet into rx vq.
> >>
> >> Signed-off-by: Yiwen Jiang <jiangyiwen@huawei.com>
> >
> > I feel this approach jumps into making interface changes for
> > optimizations too quickly. For example, what prevents us
> > from taking a big buffer, prepending each chunk
> > with the header and writing it out without
> > host/guest interface changes?
> >
> > This should allow optimizations such as vhost_add_used_n
> > batching.
> >
> > I realize a header in each packet does have a cost,
> > but it also has advantages such as improved robustness,
> > I'd like to see more of an apples to apples comparison
> > of the performance gain from skipping them.
> >
> >
>
> Hi Michael,
>
> I don't fully understand what you mean, do you want to
> see a performance comparison that before performance and
> only use batching?
>
> In my opinion, guest don't fill big buffer in rx vq because
> the balance performance and guest memory pressure, add
> mergeable feature can improve big packets performance,
> as for small packets, I try to find out the reason, may be
> the fluctuation of test results, or in mergeable mode, when
> Host send a 4k packet to Guest, we should call vhost_get_vq_desc()
> twice in host(hdr + 4k data), and in guest we also should call
> virtqueue_get_buf() twice.
I like the idea of making optimizations in small steps and measuring the
effect of each step. This way we'll know which aspect caused the
differences in benchmark results.
Stefan
[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]
[-- Attachment #2: Type: text/plain, Size: 183 bytes --]
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
* Re: [PATCH v2 3/5] VSOCK: support receive mergeable rx buffer in guest
From: Stefan Hajnoczi @ 2018-12-13 16:20 UTC (permalink / raw)
To: jiangyiwen; +Cc: netdev, virtualization, kvm, Michael S. Tsirkin
In-Reply-To: <5C10D57B.3070701@huawei.com>
[-- Attachment #1.1: Type: text/plain, Size: 1750 bytes --]
On Wed, Dec 12, 2018 at 05:31:39PM +0800, jiangyiwen wrote:
> +static struct virtio_vsock_pkt *receive_mergeable(struct virtqueue *vq,
> + struct virtio_vsock *vsock, unsigned int *total_len)
> +{
> + struct virtio_vsock_pkt *pkt;
> + u16 num_buf;
> + void *buf;
> + unsigned int len;
> + size_t vsock_hlen = sizeof(struct virtio_vsock_pkt);
> +
> + buf = virtqueue_get_buf(vq, &len);
> + if (!buf)
> + return NULL;
> +
> + *total_len = len;
> + vsock->rx_buf_nr--;
> +
> + if (unlikely(len < vsock_hlen)) {
> + put_page(virt_to_head_page(buf));
> + return NULL;
> + }
> +
> + pkt = buf;
> + num_buf = le16_to_cpu(pkt->mrg_rxbuf_hdr.num_buffers);
> + if (!num_buf || num_buf > VIRTIO_VSOCK_MAX_VEC_NUM) {
> + put_page(virt_to_head_page(buf));
> + return NULL;
> + }
> +
> + /* Initialize pkt residual structure */
> + memset(&pkt->work, 0, vsock_hlen - sizeof(struct virtio_vsock_hdr) -
> + sizeof(struct virtio_vsock_mrg_rxbuf_hdr));
struct virtio_vsock_pkt is an internal driver state structure. It must
be able to change without breaking the host<->guest device interface.
Exposing struct virtio_vsock_pkt across the device interface makes this
impossible.
One way to solve this is by placing a header length field at the
beginning of the mergeable buffer. That way the device knows at which
offset the payload should be placed and the driver can continue to use
the allocation scheme you've chosen (where a single page contains the
virtio_vsock_pkt and payload).
Another issue is that virtio requires exclusive read OR write
descriptors. You cannot have a descriptor that is both read and write.
So there's not a huge advantage to combining the driver state struct and
payload, except maybe the driver can save a memory allocation.
[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]
[-- Attachment #2: Type: text/plain, Size: 183 bytes --]
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
* Re: [PATCH v2 0/5] VSOCK: support mergeable rx buffer in vhost-vsock
From: Stefan Hajnoczi @ 2018-12-13 16:34 UTC (permalink / raw)
To: jiangyiwen; +Cc: netdev, virtualization, kvm, Michael S. Tsirkin
In-Reply-To: <5C10D41E.9050002@huawei.com>
[-- Attachment #1.1: Type: text/plain, Size: 1297 bytes --]
On Wed, Dec 12, 2018 at 05:25:50PM +0800, jiangyiwen wrote:
> Now vsock only support send/receive small packet, it can't achieve
> high performance. As previous discussed with Jason Wang, I revisit the
> idea of vhost-net about mergeable rx buffer and implement the mergeable
> rx buffer in vhost-vsock, it can allow big packet to be scattered in
> into different buffers and improve performance obviously.
Sorry, I've been a bad maintainer. I was focussed on other projects and
my email backlog is huge.
I like the idea of trying out optimizations on virtio-vsock, seeing if
code can be shared with virtio-net, and maybe later switching to a
virtio-net transport for vsock (if it turns out enough code can be
shared).
Another optimization that could be interesting:
Userspace processes reading from a socket sleep in
vsock_stream_recvmsg(). I wonder if we can bypass struct
virtio_vsock_pkt and copying the payload into pkt->buf in this case.
(This doesn't improve poll(2)/select(2) though!)
Imagine a userspace process waiting for data on a socket. When the
virtqueue becomes ready, we can read in struct virtio_vsock_hdr and find
the socket for that connection. Then we could copy the payload directly
to userspace instead of creating a virtio_vsock_pkt and copying to
pkt->buf first.
[-- Attachment #1.2: signature.asc --]
[-- Type: application/pgp-signature, Size: 455 bytes --]
[-- Attachment #2: Type: text/plain, Size: 183 bytes --]
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
* Re: [PATCH] drm/virtio: switch to generic fbdev emulation
From: Dave Airlie @ 2018-12-13 19:18 UTC (permalink / raw)
To: Gerd Hoffmann; +Cc: Dave Airlie, LKML, dri-devel, open list:VIRTIO CORE, NET...
In-Reply-To: <20181213134915.24722-1-kraxel@redhat.com>
On Thu, 13 Dec 2018 at 23:49, Gerd Hoffmann <kraxel@redhat.com> wrote:
>
> Signed-off-by: Gerd Hoffmann <kraxel@redhat.com>
Seems correct to me,
Reviewed-by: Dave Airlie <airlied@redhat.com>
^ permalink raw reply
* Re: [PATCH net-next 1/3] vhost: generalize adding used elem
From: Michael S. Tsirkin @ 2018-12-13 19:41 UTC (permalink / raw)
To: Jason Wang; +Cc: netdev, linux-kernel, kvm, virtualization
In-Reply-To: <20181213101022.12475-2-jasowang@redhat.com>
On Thu, Dec 13, 2018 at 06:10:20PM +0800, Jason Wang wrote:
> Use one generic vhost_copy_to_user() instead of two dedicated
> accessor. This will simplify the conversion to fine grain accessors.
>
> Signed-off-by: Jason Wang <jasowang@redhat.com>
The reason we did it like this is because it was faster.
Want to try benchmarking before we change it?
> ---
> drivers/vhost/vhost.c | 11 +----------
> 1 file changed, 1 insertion(+), 10 deletions(-)
>
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index 3a5f81a66d34..1c54ec1b82f8 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -2164,16 +2164,7 @@ static int __vhost_add_used_n(struct vhost_virtqueue *vq,
>
> start = vq->last_used_idx & (vq->num - 1);
> used = vq->used->ring + start;
> - if (count == 1) {
> - if (vhost_put_user(vq, heads[0].id, &used->id)) {
> - vq_err(vq, "Failed to write used id");
> - return -EFAULT;
> - }
> - if (vhost_put_user(vq, heads[0].len, &used->len)) {
> - vq_err(vq, "Failed to write used len");
> - return -EFAULT;
> - }
> - } else if (vhost_copy_to_user(vq, used, heads, count * sizeof *used)) {
> + if (vhost_copy_to_user(vq, used, heads, count * sizeof *used)) {
> vq_err(vq, "Failed to write used");
> return -EFAULT;
> }
> --
> 2.17.1
^ permalink raw reply
* Re: [PATCH] vhost: return EINVAL if iovecs size does not match the message size
From: Michael S. Tsirkin @ 2018-12-13 19:55 UTC (permalink / raw)
To: Pavel Tikhomirov
Cc: kvm, netdev, linux-kernel, virtualization, Konstantin Khorenko
In-Reply-To: <20181213145350.5454-1-ptikhomirov@virtuozzo.com>
On Thu, Dec 13, 2018 at 05:53:50PM +0300, Pavel Tikhomirov wrote:
> We've failed to copy and process vhost_iotlb_msg so let userspace at
> least know about it. For instance before these patch the code below runs
> without any error:
>
> int main()
> {
> struct vhost_msg msg;
> struct iovec iov;
> int fd;
>
> fd = open("/dev/vhost-net", O_RDWR);
> if (fd == -1) {
> perror("open");
> return 1;
> }
>
> iov.iov_base = &msg;
> iov.iov_len = sizeof(msg)-4;
>
> if (writev(fd, &iov,1) == -1) {
> perror("writev");
> return 1;
> }
>
> return 0;
> }
>
> Signed-off-by: Pavel Tikhomirov <ptikhomirov@virtuozzo.com>
Thanks for the patch!
> ---
> drivers/vhost/vhost.c | 8 ++++++--
> 1 file changed, 6 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index 3a5f81a66d34..03014224ef13 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -1024,8 +1024,10 @@ ssize_t vhost_chr_write_iter(struct vhost_dev *dev,
> int type, ret;
>
> ret = copy_from_iter(&type, sizeof(type), from);
> - if (ret != sizeof(type))
> + if (ret != sizeof(type)) {
> + ret = -EINVAL;
> goto done;
> + }
>
> switch (type) {
> case VHOST_IOTLB_MSG:
should this be EFAULT rather?
> @@ -1044,8 +1046,10 @@ ssize_t vhost_chr_write_iter(struct vhost_dev *dev,
>
> iov_iter_advance(from, offset);
> ret = copy_from_iter(&msg, sizeof(msg), from);
> - if (ret != sizeof(msg))
> + if (ret != sizeof(msg)) {
> + ret = -EINVAL;
> goto done;
> + }
> if (vhost_process_iotlb_msg(dev, &msg)) {
> ret = -EFAULT;
> goto done;
This too?
> --
> 2.17.1
^ permalink raw reply
* Re: [PATCH] vhost: correct the related warning message
From: Sergei Shtylyov @ 2018-12-13 20:06 UTC (permalink / raw)
To: Michael S. Tsirkin
Cc: kvm, netdev, virtualization, piaojun, viro, hch, wangyan
In-Reply-To: <20181213102024-mutt-send-email-mst@kernel.org>
On 12/13/2018 06:21 PM, Michael S. Tsirkin wrote:
>>> Fixes: 'commit d588cf8f618d ("target: Fix se_tpg_tfo->tf_subsys regression + remove tf_subsystem")'
>>> 'commit cbbd26b8b1a6 ("[iov_iter] new primitives - copy_from_iter_full() and friends")'
>>
>> Fixes: d588cf8f618d ("target: Fix se_tpg_tfo->tf_subsys regression + remove
>> tf_subsystem")
>> Fixes: cbbd26b8b1a6 ("[iov_iter] new primitives - copy_from_iter_full() and
>> friends")
>>
>> But no line wrapping allowed (can't find how to turn it off in my mailer).
>
> Well is the standard really sane if people can't find a way to
> follow it without being forced to use a specific mailer?
Hum, my mail looked correct, although I wasn't able to get the editor to NOT
wrap my lines. Go figure... :-)
>>> Signed-off-by: Yan Wang <wangyan122@huawei.com>
>>> Reviewed-by: Jun Piao <piaojun@huawei.com>
>> [...]
MBR, Sergei
^ permalink raw reply
* Re: [PATCH net-next 0/3] vhost: accelerate metadata access through vmap()
From: Michael S. Tsirkin @ 2018-12-13 20:12 UTC (permalink / raw)
To: Jason Wang; +Cc: netdev, linux-kernel, kvm, virtualization
In-Reply-To: <20181213101022.12475-1-jasowang@redhat.com>
On Thu, Dec 13, 2018 at 06:10:19PM +0800, Jason Wang wrote:
> Hi:
>
> This series tries to access virtqueue metadata through kernel virtual
> address instead of copy_user() friends since they had too much
> overheads like checks, spec barriers or even hardware feature
> toggling.
>
> Test shows about 24% improvement on TX PPS. It should benefit other
> cases as well.
>
> Please review
I think the idea of speeding up userspace access is a good one.
However I think that moving all checks to start is way too aggressive.
Instead, let's batch things up but let's not keep them
around forever.
Here are some ideas:
1. Disable preemption, process a small number of small packets
directly in an atomic context. This should cut latency
down significantly, the tricky part is to only do it
on a light load and disable this
for the streaming case otherwise it's unfair.
This might fail, if it does just bounce things out to
a thread.
2. Switch to unsafe_put_user/unsafe_get_user,
and batch up multiple accesses.
3. Allow adding a fixup point manually,
such that multiple independent get_user accesses
can get a single fixup (will allow better compiler
optimizations).
> Jason Wang (3):
> vhost: generalize adding used elem
> vhost: fine grain userspace memory accessors
> vhost: access vq metadata through kernel virtual address
>
> drivers/vhost/vhost.c | 281 ++++++++++++++++++++++++++++++++++++++----
> drivers/vhost/vhost.h | 11 ++
> 2 files changed, 266 insertions(+), 26 deletions(-)
>
> --
> 2.17.1
^ permalink raw reply
* Re: [PATCH net-next 3/3] vhost: access vq metadata through kernel virtual address
From: Konrad Rzeszutek Wilk @ 2018-12-13 21:18 UTC (permalink / raw)
To: Michael S. Tsirkin; +Cc: netdev, linux-kernel, kvm, virtualization
In-Reply-To: <20181213102713-mutt-send-email-mst@kernel.org>
.giant snip..
> > + npinned = get_user_pages_fast(uaddr, npages, write, pages);
> > + if (npinned != npages)
> > + goto err;
> > +
>
> As I said I have doubts about the whole approach, but this
> implementation in particular isn't a good idea
> as it keeps the page around forever.
> So no THP, no NUMA rebalancing, userspace-controlled
> amount of memory locked up and not accounted for.
>
> Don't get me wrong it's a great patch in an ideal world.
> But then in an ideal world no barriers smap etc are necessary at all.
So .. suggestions on how this could be accepted? As in other ways
where we still get vmap and the issues you mentioned are not troubling you?
Thanks!
^ permalink raw reply
* Re: [PATCH net-next 3/3] vhost: access vq metadata through kernel virtual address
From: Michael S. Tsirkin @ 2018-12-13 21:58 UTC (permalink / raw)
To: Konrad Rzeszutek Wilk; +Cc: netdev, linux-kernel, kvm, virtualization
In-Reply-To: <20181213211840.GC18692@char.us.oracle.com>
On Thu, Dec 13, 2018 at 04:18:40PM -0500, Konrad Rzeszutek Wilk wrote:
> .giant snip..
> > > + npinned = get_user_pages_fast(uaddr, npages, write, pages);
> > > + if (npinned != npages)
> > > + goto err;
> > > +
> >
> > As I said I have doubts about the whole approach, but this
> > implementation in particular isn't a good idea
> > as it keeps the page around forever.
> > So no THP, no NUMA rebalancing, userspace-controlled
> > amount of memory locked up and not accounted for.
> >
> > Don't get me wrong it's a great patch in an ideal world.
> > But then in an ideal world no barriers smap etc are necessary at all.
>
> So .. suggestions on how this could be accepted? As in other ways
> where we still get vmap and the issues you mentioned are not troubling you?
>
> Thanks!
I'd suggest leave vmap alone and find ways to speed up accesses
that can fault.
--
MST
^ permalink raw reply
* kernel vhost demands an interrupt from guest when the ring is full in order to enable guest to submit new packets to the queue
From: Steven Luong (sluong) via Virtualization @ 2018-12-13 23:24 UTC (permalink / raw)
To: virtualization@lists.linux-foundation.org
Cc: Damjan Marion (damarion), Jerome Tollet (jtollet)
[-- Attachment #1.1: Type: text/plain, Size: 2057 bytes --]
Folks,
We came across a memory race condition between VPP vhost driver and the kernel vhost. VPP is running a tap interface over vhost backend. In this case, VPP is acting as the vhost driver mode and the kernel vhost is acting as the vhost device mode.
In the kernel vhost’s TX traffic direction which is VPP’s RX traffic direction, kernel vhost is the producer and VPP is the consumer. Kernel vhost places traffic in kernel vhost’s TX vring. VPP removes traffic in VPP’s RX vring. It is inevitable that the vring may become full under heavy traffic and the kernel vhost may have to stop and wait for the guest (VPP) to empty the vring and to refill the vring with descriptors. When that case happens, kernel vhost clears the bit in the vring’s used flag to request an interrupt/notification. Due to shared memory race condition, VPP may miss the clearing of the vring’s used flag from kernel vhost and didn’t send kernel vhost an interrupt after VPP empties and refills the vring with new descriptors. Unfortunately, this missed notification causes the kernel vhost to be stuck because once the kernel vhost is waiting for an interrupt/notification from the guest, only an interrupt/notification from the guest can resume/re-enable the guest from submitting new packets to the vring. This design seems vulnerable. Should the kernel vhost totally depend on the notification from the guest? Quoting the text from
http://docs.oasis-open.org/virtio/virtio/v1.0/virtio-v1.0.html
/* The device uses this in used->flags to advise the driver: don’t kick me
* when you add a buffer. It’s unreliable, so it’s simply an
* optimization. */
#define VIRTQ_USED_F_NO_NOTIFY 1
I interpret that the notification is simply an optimization, not a reliable notification mechanism. So the kernel vhost should not bet the farm on it.
We encounter the same race condition in kernel vhost’s RX traffic direction which causes the kernel vhost to be stuck due to missed interrupt/notification although it is less frequent.
Steven
[-- Attachment #1.2: Type: text/html, Size: 4987 bytes --]
[-- Attachment #2: Type: text/plain, Size: 183 bytes --]
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
* Re: [PATCH v2 5/5] VSOCK: batch sending rx buffer to increase bandwidth
From: jiangyiwen @ 2018-12-14 1:05 UTC (permalink / raw)
To: Stefan Hajnoczi; +Cc: netdev, virtualization, kvm, Michael S. Tsirkin
In-Reply-To: <20181213151752.GK23318@stefanha-x1.localdomain>
On 2018/12/13 23:17, Stefan Hajnoczi wrote:
> On Wed, Dec 12, 2018 at 05:35:27PM +0800, jiangyiwen wrote:
>> Batch sending rx buffer can improve total bandwidth.
>>
>> Signed-off-by: Yiwen Jiang <jiangyiwen@huawei.com>
>> ---
>
> Please send patches with git-send-email --thread --no-chain-reply-to so
> that your patch series email thread looks like this:
>
> * [PATCH 00/NN] My feature
> +-- [PATCH 01/NN] First patch
> +-- [PATCH 02/NN] Second patch
> .
> .
> .
> +-- [PATCH NN/NN] Last patch
>
> This way it's much easier to view the entire series. At the moment you
> are sending each patch as an independent email and there is no
> relationship between the emails.
>
> Thanks,
> Stefan
>
Thanks Stefan, I have not send a series of patches before, and
I will use email thread form in the later version.
Thanks again,
Yiwen.
^ permalink raw reply
* Re: [PATCH net V2 4/4] vhost: log dirty page correctly
From: Jason Wang @ 2018-12-14 2:43 UTC (permalink / raw)
To: Michael S. Tsirkin; +Cc: Jintack Lim, netdev, linux-kernel, kvm, virtualization
In-Reply-To: <20181213092930-mutt-send-email-mst@kernel.org>
On 2018/12/13 下午10:31, Michael S. Tsirkin wrote:
>> Just to make sure I understand this. It looks to me we should:
>>
>> - allow passing GIOVA->GPA through UAPI
>>
>> - cache GIOVA->GPA somewhere but still use GIOVA->HVA in device IOTLB for
>> performance
>>
>> Is this what you suggest?
>>
>> Thanks
> Not really. We already have GPA->HVA, so I suggested a flag to pass
> GIOVA->GPA in the IOTLB.
>
> This has advantages for security since a single table needs
> then to be validated to ensure guest does not corrupt
> QEMU memory.
>
I wonder how much we can gain through this. Currently, qemu IOMMU gives
GIOVA->GPA mapping, and qemu vhost code will translate GPA to HVA then
pass GIOVA->HVA to vhost. It looks no difference to me.
Thanks
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
* Re: [PATCH net-next 0/3] vhost: accelerate metadata access through vmap()
From: Jason Wang @ 2018-12-14 3:42 UTC (permalink / raw)
To: Michael S. Tsirkin; +Cc: netdev, linux-kernel, kvm, virtualization
In-Reply-To: <20181213102315-mutt-send-email-mst@kernel.org>
On 2018/12/13 下午11:27, Michael S. Tsirkin wrote:
> On Thu, Dec 13, 2018 at 06:10:19PM +0800, Jason Wang wrote:
>> Hi:
>>
>> This series tries to access virtqueue metadata through kernel virtual
>> address instead of copy_user() friends since they had too much
>> overheads like checks, spec barriers or even hardware feature
>> toggling.
> Userspace accesses through remapping tricks and next time there's a need
> for a new barrier we are left to figure it out by ourselves.
I don't get here, do you mean spec barriers? It's completely unnecessary
for vhost which is kernel thread. And even if you're right, vhost is not
the only place, there's lots of vmap() based accessing in kernel. Think
in another direction, this means we won't suffer form unnecessary
barriers for kthread like vhost in the future, we will manually pick the
one we really need (but it should have little possibility).
Please notice we only access metdata through remapping not the data
itself. This idea has been used for high speed userspace backend for
years, e.g packet socket or recent AF_XDP. The only difference is the
page was remap to from kernel to userspace.
> I don't
> like the idea I have to say. As a first step, why don't we switch to
> unsafe_put_user/unsafe_get_user etc?
Several reasons:
- They only have x86 variant, it won't have any difference for the rest
of architecture.
- unsafe_put_user/unsafe_get_user is not sufficient for accessing
structures (e.g accessing descriptor) or arrays (batching).
- Unless we can batch at least the accessing of two places in three of
avail, used and descriptor in one run. There will be no difference. E.g
we can batch updating used ring, but it won't make any difference in
this case.
> That would be more of an apples to apples comparison, would it not?
Apples to apples comparison only help if we are the No.1. But the fact
is we are not. If we want to compete with e.g dpdk or AF_XDP, vmap() is
the fastest method AFAIK.
Thanks
>
>
>> Test shows about 24% improvement on TX PPS. It should benefit other
>> cases as well.
>>
>> Please review
>>
>> Jason Wang (3):
>> vhost: generalize adding used elem
>> vhost: fine grain userspace memory accessors
>> vhost: access vq metadata through kernel virtual address
>>
>> drivers/vhost/vhost.c | 281 ++++++++++++++++++++++++++++++++++++++----
>> drivers/vhost/vhost.h | 11 ++
>> 2 files changed, 266 insertions(+), 26 deletions(-)
>>
>> --
>> 2.17.1
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
* Re: [PATCH net-next 3/3] vhost: access vq metadata through kernel virtual address
From: Jason Wang @ 2018-12-14 3:57 UTC (permalink / raw)
To: Michael S. Tsirkin; +Cc: netdev, linux-kernel, kvm, virtualization
In-Reply-To: <20181213102713-mutt-send-email-mst@kernel.org>
On 2018/12/13 下午11:44, Michael S. Tsirkin wrote:
> On Thu, Dec 13, 2018 at 06:10:22PM +0800, Jason Wang wrote:
>> It was noticed that the copy_user() friends that was used to access
>> virtqueue metdata tends to be very expensive for dataplane
>> implementation like vhost since it involves lots of software check,
>> speculation barrier, hardware feature toggling (e.g SMAP). The
>> extra cost will be more obvious when transferring small packets.
>>
>> This patch tries to eliminate those overhead by pin vq metadata pages
>> and access them through vmap(). During SET_VRING_ADDR, we will setup
>> those mappings and memory accessors are modified to use pointers to
>> access the metadata directly.
>>
>> Note, this was only done when device IOTLB is not enabled. We could
>> use similar method to optimize it in the future.
>>
>> Tests shows about ~24% improvement on TX PPS when using virtio-user +
>> vhost_net + xdp1 on TAP (CONFIG_HARDENED_USERCOPY is not enabled):
>>
>> Before: ~5.0Mpps
>> After: ~6.1Mpps
>>
>> Signed-off-by: Jason Wang<jasowang@redhat.com>
>> ---
>> drivers/vhost/vhost.c | 178 ++++++++++++++++++++++++++++++++++++++++++
>> drivers/vhost/vhost.h | 11 +++
>> 2 files changed, 189 insertions(+)
>>
>> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
>> index bafe39d2e637..1bd24203afb6 100644
>> --- a/drivers/vhost/vhost.c
>> +++ b/drivers/vhost/vhost.c
>> @@ -443,6 +443,9 @@ void vhost_dev_init(struct vhost_dev *dev,
>> vq->indirect = NULL;
>> vq->heads = NULL;
>> vq->dev = dev;
>> + memset(&vq->avail_ring, 0, sizeof(vq->avail_ring));
>> + memset(&vq->used_ring, 0, sizeof(vq->used_ring));
>> + memset(&vq->desc_ring, 0, sizeof(vq->desc_ring));
>> mutex_init(&vq->mutex);
>> vhost_vq_reset(dev, vq);
>> if (vq->handle_kick)
>> @@ -614,6 +617,102 @@ static void vhost_clear_msg(struct vhost_dev *dev)
>> spin_unlock(&dev->iotlb_lock);
>> }
>>
>> +static int vhost_init_vmap(struct vhost_vmap *map, unsigned long uaddr,
>> + size_t size, int write)
>> +{
>> + struct page **pages;
>> + int npages = DIV_ROUND_UP(size, PAGE_SIZE);
>> + int npinned;
>> + void *vaddr;
>> +
>> + pages = kmalloc_array(npages, sizeof(struct page *), GFP_KERNEL);
>> + if (!pages)
>> + return -ENOMEM;
>> +
>> + npinned = get_user_pages_fast(uaddr, npages, write, pages);
>> + if (npinned != npages)
>> + goto err;
>> +
> As I said I have doubts about the whole approach, but this
> implementation in particular isn't a good idea
> as it keeps the page around forever.
> So no THP, no NUMA rebalancing,
This is the price of all GUP users not only vhost itself. What's more
important, the goal is not to be left too much behind for other backends
like DPDK or AF_XDP (all of which are using GUP).
> userspace-controlled
> amount of memory locked up and not accounted for.
It's pretty easy to add this since the slow path was still kept. If we
exceeds the limitation, we can switch back to slow path.
>
> Don't get me wrong it's a great patch in an ideal world.
> But then in an ideal world no barriers smap etc are necessary at all.
Again, this is only for metadata accessing not the data which has been
used for years for real use cases.
For SMAP, it makes senses for the address that kernel can not forcast.
But it's not the case for the vhost metadata since we know the address
will be accessed very frequently. For speculation barrier, it helps
nothing for the data path of vhost which is a kthread. Packet or AF_XDP
benefit from accessing metadata directly, we should do it as well.
Thanks
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
* Re: [PATCH net-next 1/3] vhost: generalize adding used elem
From: Jason Wang @ 2018-12-14 4:00 UTC (permalink / raw)
To: Michael S. Tsirkin; +Cc: netdev, linux-kernel, kvm, virtualization
In-Reply-To: <20181213144027-mutt-send-email-mst@kernel.org>
On 2018/12/14 上午3:41, Michael S. Tsirkin wrote:
> On Thu, Dec 13, 2018 at 06:10:20PM +0800, Jason Wang wrote:
>> Use one generic vhost_copy_to_user() instead of two dedicated
>> accessor. This will simplify the conversion to fine grain accessors.
>>
>> Signed-off-by: Jason Wang <jasowang@redhat.com>
>
> The reason we did it like this is because it was faster.
> Want to try benchmarking before we change it?
Faster is a good side effect of such conversion. Otherwise we need two
new fine grain memory accessor.
But, anyway, I can test its impact.
Thanks
>
>> ---
>> drivers/vhost/vhost.c | 11 +----------
>> 1 file changed, 1 insertion(+), 10 deletions(-)
>>
>> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
>> index 3a5f81a66d34..1c54ec1b82f8 100644
>> --- a/drivers/vhost/vhost.c
>> +++ b/drivers/vhost/vhost.c
>> @@ -2164,16 +2164,7 @@ static int __vhost_add_used_n(struct vhost_virtqueue *vq,
>>
>> start = vq->last_used_idx & (vq->num - 1);
>> used = vq->used->ring + start;
>> - if (count == 1) {
>> - if (vhost_put_user(vq, heads[0].id, &used->id)) {
>> - vq_err(vq, "Failed to write used id");
>> - return -EFAULT;
>> - }
>> - if (vhost_put_user(vq, heads[0].len, &used->len)) {
>> - vq_err(vq, "Failed to write used len");
>> - return -EFAULT;
>> - }
>> - } else if (vhost_copy_to_user(vq, used, heads, count * sizeof *used)) {
>> + if (vhost_copy_to_user(vq, used, heads, count * sizeof *used)) {
>> vq_err(vq, "Failed to write used");
>> return -EFAULT;
>> }
>> --
>> 2.17.1
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
* Re: [PATCH net-next 0/3] vhost: accelerate metadata access through vmap()
From: Jason Wang @ 2018-12-14 4:29 UTC (permalink / raw)
To: Michael S. Tsirkin; +Cc: netdev, linux-kernel, kvm, virtualization
In-Reply-To: <20181213144116-mutt-send-email-mst@kernel.org>
On 2018/12/14 上午4:12, Michael S. Tsirkin wrote:
> On Thu, Dec 13, 2018 at 06:10:19PM +0800, Jason Wang wrote:
>> Hi:
>>
>> This series tries to access virtqueue metadata through kernel virtual
>> address instead of copy_user() friends since they had too much
>> overheads like checks, spec barriers or even hardware feature
>> toggling.
>>
>> Test shows about 24% improvement on TX PPS. It should benefit other
>> cases as well.
>>
>> Please review
> I think the idea of speeding up userspace access is a good one.
> However I think that moving all checks to start is way too aggressive.
So did packet and AF_XDP. Anyway, sharing address space and access them
directly is the fastest way. Performance is the major consideration for
people to choose backend. Compare to userspace implementation, vhost
does not have security advantages at any level. If vhost is still slow,
people will start to develop backends based on e.g AF_XDP.
> Instead, let's batch things up but let's not keep them
> around forever.
> Here are some ideas:
>
>
> 1. Disable preemption, process a small number of small packets
> directly in an atomic context. This should cut latency
> down significantly, the tricky part is to only do it
> on a light load and disable this
> for the streaming case otherwise it's unfair.
> This might fail, if it does just bounce things out to
> a thread.
I'm not sure what context you meant here. Is this for TX path of TUN?
But a fundamental difference is my series is targeted for extreme heavy
load not light one, 100% cpu for vhost is expected.
>
> 2. Switch to unsafe_put_user/unsafe_get_user,
> and batch up multiple accesses.
As I said, unless we can batch accessing of two difference places of
three of avail, descriptor and used. It won't help for batching the
accessing of a single place like used. I'm even not sure this can be
done consider the case of packed virtqueue, we have a single descriptor
ring. Batching through unsafe helpers may not help in this case since
it's equivalent to safe ones . And This requires non trivial refactoring
of vhost. And such refactoring itself make give us noticeable impact
(e.g it may lead regression).
>
> 3. Allow adding a fixup point manually,
> such that multiple independent get_user accesses
> can get a single fixup (will allow better compiler
> optimizations).
>
So for metadata access, I don't see how you suggest here can help in the
case of heavy workload.
For data access, this may help but I've played to batch the data copy to
reduce SMAP/spec barriers in vhost-net but I don't see performance
improvement.
Thanks
>
>
>
>> Jason Wang (3):
>> vhost: generalize adding used elem
>> vhost: fine grain userspace memory accessors
>> vhost: access vq metadata through kernel virtual address
>>
>> drivers/vhost/vhost.c | 281 ++++++++++++++++++++++++++++++++++++++----
>> drivers/vhost/vhost.h | 11 ++
>> 2 files changed, 266 insertions(+), 26 deletions(-)
>>
>> --
>> 2.17.1
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization
^ permalink raw reply
* Re: [PATCH v2 2/5] VSOCK: support fill data to mergeable rx buffer in host
From: jiangyiwen @ 2018-12-14 7:41 UTC (permalink / raw)
To: Michael S. Tsirkin; +Cc: netdev, kvm, Stefan Hajnoczi, virtualization
In-Reply-To: <20181213094615-mutt-send-email-mst@kernel.org>
On 2018/12/13 22:48, Michael S. Tsirkin wrote:
> On Thu, Dec 13, 2018 at 11:08:04AM +0800, jiangyiwen wrote:
>> On 2018/12/12 23:37, Michael S. Tsirkin wrote:
>>> On Wed, Dec 12, 2018 at 05:29:31PM +0800, jiangyiwen wrote:
>>>> When vhost support VIRTIO_VSOCK_F_MRG_RXBUF feature,
>>>> it will merge big packet into rx vq.
>>>>
>>>> Signed-off-by: Yiwen Jiang <jiangyiwen@huawei.com>
>>>
>>> I feel this approach jumps into making interface changes for
>>> optimizations too quickly. For example, what prevents us
>>> from taking a big buffer, prepending each chunk
>>> with the header and writing it out without
>>> host/guest interface changes?
>>>
>>> This should allow optimizations such as vhost_add_used_n
>>> batching.
>>>
>>> I realize a header in each packet does have a cost,
>>> but it also has advantages such as improved robustness,
>>> I'd like to see more of an apples to apples comparison
>>> of the performance gain from skipping them.
>>>
>>>
>>
>> Hi Michael,
>>
>> I don't fully understand what you mean, do you want to
>> see a performance comparison that before performance and
>> only use batching?
>>
>> In my opinion, guest don't fill big buffer in rx vq because
>> the balance performance and guest memory pressure, add
>> mergeable feature can improve big packets performance,
>> as for small packets, I try to find out the reason, may be
>> the fluctuation of test results, or in mergeable mode, when
>> Host send a 4k packet to Guest, we should call vhost_get_vq_desc()
>> twice in host(hdr + 4k data), and in guest we also should call
>> virtqueue_get_buf() twice.
>>
>> Thanks,
>> Yiwen.
>
> What I mean is that at least some of the gain here is because
> larger skbs are passed up the stack.
>
Yes, the main gain is from larger skbs.
> You do not need larger packets to build larger skbs though.
> Just check that the addresses match and you can combine
> multiple fragments in a single skb.
>
>
I understand what you mean, if use batching send that the
performance also can be improved, I can test the real
performance result only use batching.
Thanks,
Yiwen.
>
>>>> ---
>>>> drivers/vhost/vsock.c | 111 ++++++++++++++++++++++++++++++--------
>>>> include/linux/virtio_vsock.h | 1 +
>>>> include/uapi/linux/virtio_vsock.h | 5 ++
>>>> 3 files changed, 94 insertions(+), 23 deletions(-)
>>>>
>>>> diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
>>>> index 34bc3ab..dc52b0f 100644
>>>> --- a/drivers/vhost/vsock.c
>>>> +++ b/drivers/vhost/vsock.c
>>>> @@ -22,7 +22,8 @@
>>>> #define VHOST_VSOCK_DEFAULT_HOST_CID 2
>>>>
>>>> enum {
>>>> - VHOST_VSOCK_FEATURES = VHOST_FEATURES,
>>>> + VHOST_VSOCK_FEATURES = VHOST_FEATURES |
>>>> + (1ULL << VIRTIO_VSOCK_F_MRG_RXBUF),
>>>> };
>>>>
>>>> /* Used to track all the vhost_vsock instances on the system. */
>>>> @@ -80,6 +81,69 @@ static struct vhost_vsock *vhost_vsock_get(u32 guest_cid)
>>>> return vsock;
>>>> }
>>>>
>>>> +/* This segment of codes are copied from drivers/vhost/net.c */
>>>> +static int get_rx_bufs(struct vhost_virtqueue *vq,
>>>> + struct vring_used_elem *heads, int datalen,
>>>> + unsigned *iovcount, unsigned int quota)
>>>> +{
>>>> + unsigned int out, in;
>>>> + int seg = 0;
>>>> + int headcount = 0;
>>>> + unsigned d;
>>>> + int ret;
>>>> + /*
>>>> + * len is always initialized before use since we are always called with
>>>> + * datalen > 0.
>>>> + */
>>>> + u32 uninitialized_var(len);
>>>> +
>>>> + while (datalen > 0 && headcount < quota) {
>>>> + if (unlikely(seg >= UIO_MAXIOV)) {
>>>> + ret = -ENOBUFS;
>>>> + goto err;
>>>> + }
>>>> +
>>>> + ret = vhost_get_vq_desc(vq, vq->iov + seg,
>>>> + ARRAY_SIZE(vq->iov) - seg, &out,
>>>> + &in, NULL, NULL);
>>>> + if (unlikely(ret < 0))
>>>> + goto err;
>>>> +
>>>> + d = ret;
>>>> + if (d == vq->num) {
>>>> + ret = 0;
>>>> + goto err;
>>>> + }
>>>> +
>>>> + if (unlikely(out || in <= 0)) {
>>>> + vq_err(vq, "unexpected descriptor format for RX: "
>>>> + "out %d, in %d\n", out, in);
>>>> + ret = -EINVAL;
>>>> + goto err;
>>>> + }
>>>> +
>>>> + heads[headcount].id = cpu_to_vhost32(vq, d);
>>>> + len = iov_length(vq->iov + seg, in);
>>>> + heads[headcount].len = cpu_to_vhost32(vq, len);
>>>> + datalen -= len;
>>>> + ++headcount;
>>>> + seg += in;
>>>> + }
>>>> +
>>>> + heads[headcount - 1].len = cpu_to_vhost32(vq, len + datalen);
>>>> + *iovcount = seg;
>>>> +
>>>> + /* Detect overrun */
>>>> + if (unlikely(datalen > 0)) {
>>>> + ret = UIO_MAXIOV + 1;
>>>> + goto err;
>>>> + }
>>>> + return headcount;
>>>> +err:
>>>> + vhost_discard_vq_desc(vq, headcount);
>>>> + return ret;
>>>> +}
>>>> +
>>>> static void
>>>> vhost_transport_do_send_pkt(struct vhost_vsock *vsock,
>>>> struct vhost_virtqueue *vq)
>>>> @@ -87,22 +151,34 @@ static struct vhost_vsock *vhost_vsock_get(u32 guest_cid)
>>>> struct vhost_virtqueue *tx_vq = &vsock->vqs[VSOCK_VQ_TX];
>>>> bool added = false;
>>>> bool restart_tx = false;
>>>> + int mergeable;
>>>> + size_t vsock_hlen;
>>>>
>>>> mutex_lock(&vq->mutex);
>>>>
>>>> if (!vq->private_data)
>>>> goto out;
>>>>
>>>> + mergeable = vhost_has_feature(vq, VIRTIO_VSOCK_F_MRG_RXBUF);
>>>> + /*
>>>> + * Guest fill page for rx vq in mergeable case, so it will not
>>>> + * allocate pkt structure, we should reserve size of pkt in advance.
>>>> + */
>>>> + if (likely(mergeable))
>>>> + vsock_hlen = sizeof(struct virtio_vsock_pkt);
>>>> + else
>>>> + vsock_hlen = sizeof(struct virtio_vsock_hdr);
>>>> +
>>>> /* Avoid further vmexits, we're already processing the virtqueue */
>>>> vhost_disable_notify(&vsock->dev, vq);
>>>>
>>>> for (;;) {
>>>> struct virtio_vsock_pkt *pkt;
>>>> struct iov_iter iov_iter;
>>>> - unsigned out, in;
>>>> + unsigned out = 0, in = 0;
>>>> size_t nbytes;
>>>> size_t len;
>>>> - int head;
>>>> + s16 headcount;
>>>>
>>>> spin_lock_bh(&vsock->send_pkt_list_lock);
>>>> if (list_empty(&vsock->send_pkt_list)) {
>>>> @@ -116,16 +192,9 @@ static struct vhost_vsock *vhost_vsock_get(u32 guest_cid)
>>>> list_del_init(&pkt->list);
>>>> spin_unlock_bh(&vsock->send_pkt_list_lock);
>>>>
>>>> - head = vhost_get_vq_desc(vq, vq->iov, ARRAY_SIZE(vq->iov),
>>>> - &out, &in, NULL, NULL);
>>>> - if (head < 0) {
>>>> - spin_lock_bh(&vsock->send_pkt_list_lock);
>>>> - list_add(&pkt->list, &vsock->send_pkt_list);
>>>> - spin_unlock_bh(&vsock->send_pkt_list_lock);
>>>> - break;
>>>> - }
>>>> -
>>>> - if (head == vq->num) {
>>>> + headcount = get_rx_bufs(vq, vq->heads, vsock_hlen + pkt->len,
>>>> + &in, likely(mergeable) ? UIO_MAXIOV : 1);
>>>> + if (headcount <= 0) {
>>>> spin_lock_bh(&vsock->send_pkt_list_lock);
>>>> list_add(&pkt->list, &vsock->send_pkt_list);
>>>> spin_unlock_bh(&vsock->send_pkt_list_lock);
>>>> @@ -133,24 +202,20 @@ static struct vhost_vsock *vhost_vsock_get(u32 guest_cid)
>>>> /* We cannot finish yet if more buffers snuck in while
>>>> * re-enabling notify.
>>>> */
>>>> - if (unlikely(vhost_enable_notify(&vsock->dev, vq))) {
>>>> + if (!headcount && unlikely(vhost_enable_notify(&vsock->dev, vq))) {
>>>> vhost_disable_notify(&vsock->dev, vq);
>>>> continue;
>>>> }
>>>> break;
>>>> }
>>>>
>>>> - if (out) {
>>>> - virtio_transport_free_pkt(pkt);
>>>> - vq_err(vq, "Expected 0 output buffers, got %u\n", out);
>>>> - break;
>>>> - }
>>>> -
>>>> len = iov_length(&vq->iov[out], in);
>>>> iov_iter_init(&iov_iter, READ, &vq->iov[out], in, len);
>>>>
>>>> - nbytes = copy_to_iter(&pkt->hdr, sizeof(pkt->hdr), &iov_iter);
>>>> - if (nbytes != sizeof(pkt->hdr)) {
>>>> + if (likely(mergeable))
>>>> + pkt->mrg_rxbuf_hdr.num_buffers = cpu_to_le16(headcount);
>>>> + nbytes = copy_to_iter(&pkt->hdr, vsock_hlen, &iov_iter);
>>>> + if (nbytes != vsock_hlen) {
>>>> virtio_transport_free_pkt(pkt);
>>>> vq_err(vq, "Faulted on copying pkt hdr\n");
>>>> break;
>>>> @@ -163,7 +228,7 @@ static struct vhost_vsock *vhost_vsock_get(u32 guest_cid)
>>>> break;
>>>> }
>>>>
>>>> - vhost_add_used(vq, head, sizeof(pkt->hdr) + pkt->len);
>>>> + vhost_add_used_n(vq, vq->heads, headcount);
>>>> added = true;
>>>>
>>>> if (pkt->reply) {
>>>> diff --git a/include/linux/virtio_vsock.h b/include/linux/virtio_vsock.h
>>>> index bf84418..da9e1fe 100644
>>>> --- a/include/linux/virtio_vsock.h
>>>> +++ b/include/linux/virtio_vsock.h
>>>> @@ -50,6 +50,7 @@ struct virtio_vsock_sock {
>>>>
>>>> struct virtio_vsock_pkt {
>>>> struct virtio_vsock_hdr hdr;
>>>> + struct virtio_vsock_mrg_rxbuf_hdr mrg_rxbuf_hdr;
>>>> struct work_struct work;
>>>> struct list_head list;
>>>> /* socket refcnt not held, only use for cancellation */
>>>> diff --git a/include/uapi/linux/virtio_vsock.h b/include/uapi/linux/virtio_vsock.h
>>>> index 1d57ed3..2292f30 100644
>>>> --- a/include/uapi/linux/virtio_vsock.h
>>>> +++ b/include/uapi/linux/virtio_vsock.h
>>>> @@ -63,6 +63,11 @@ struct virtio_vsock_hdr {
>>>> __le32 fwd_cnt;
>>>> } __attribute__((packed));
>>>>
>>>> +/* It add mergeable rx buffers feature */
>>>> +struct virtio_vsock_mrg_rxbuf_hdr {
>>>> + __le16 num_buffers; /* number of mergeable rx buffers */
>>>> +} __attribute__((packed));
>>>> +
>>>> enum virtio_vsock_type {
>>>> VIRTIO_VSOCK_TYPE_STREAM = 1,
>>>> };
>>>> --
>>>> 1.8.3.1
>>>>
>>>
>>> .
>>>
>>
>
> .
>
^ permalink raw reply
* Re: [PATCH v2 2/5] VSOCK: support fill data to mergeable rx buffer in host
From: jiangyiwen @ 2018-12-14 7:47 UTC (permalink / raw)
To: Michael S. Tsirkin; +Cc: kvm, netdev, virtualization, stefanha, David Miller
In-Reply-To: <20181213094935-mutt-send-email-mst@kernel.org>
On 2018/12/13 22:50, Michael S. Tsirkin wrote:
> On Thu, Dec 13, 2018 at 11:11:48AM +0800, jiangyiwen wrote:
>> On 2018/12/13 3:09, David Miller wrote:
>>> From: jiangyiwen <jiangyiwen@huawei.com>
>>> Date: Wed, 12 Dec 2018 17:29:31 +0800
>>>
>>>> diff --git a/include/uapi/linux/virtio_vsock.h b/include/uapi/linux/virtio_vsock.h
>>>> index 1d57ed3..2292f30 100644
>>>> --- a/include/uapi/linux/virtio_vsock.h
>>>> +++ b/include/uapi/linux/virtio_vsock.h
>>>> @@ -63,6 +63,11 @@ struct virtio_vsock_hdr {
>>>> __le32 fwd_cnt;
>>>> } __attribute__((packed));
>>>>
>>>> +/* It add mergeable rx buffers feature */
>>>> +struct virtio_vsock_mrg_rxbuf_hdr {
>>>> + __le16 num_buffers; /* number of mergeable rx buffers */
>>>> +} __attribute__((packed));
>>>> +
>>>
>>> I know the rest of this file uses 'packed' but this attribute should
>>> only be used if absolutely necessary as it incurs a
>>> non-trivial performance penalty for some architectures.
>>>
>>> .
>>>
>>
>> Hi David,
>>
>> I hope Host can fill fewer bytes into rx virtqueue, so
>> I keep structure virtio_vsock_mrg_rxbuf_hdr one byte
>> alignment.
>>
>> Thanks,
>> Yiwen.
>
> It doesn't work like this now though, does it?
> Buffers are preallocated and they are always aligned.
> So I do not see the point.
>
Hi Michael,
Now my patch has a serious problem, I use virtio_vsock_pkt as
the transport header from host to guest, it will cause
guest parse the wrong packet length. Because this structure
size may be different under different compilers
(guest and host are different). I will solve the problem
in later version.
Thanks,
Yiwen.
^ permalink raw reply
* Re: [PATCH v2 2/5] VSOCK: support fill data to mergeable rx buffer in host
From: jiangyiwen @ 2018-12-14 7:49 UTC (permalink / raw)
To: Stefan Hajnoczi; +Cc: netdev, virtualization, kvm, Michael S. Tsirkin
In-Reply-To: <20181213154923.GN23318@stefanha-x1.localdomain>
On 2018/12/13 23:49, Stefan Hajnoczi wrote:
> On Thu, Dec 13, 2018 at 11:08:04AM +0800, jiangyiwen wrote:
>> On 2018/12/12 23:37, Michael S. Tsirkin wrote:
>>> On Wed, Dec 12, 2018 at 05:29:31PM +0800, jiangyiwen wrote:
>>>> When vhost support VIRTIO_VSOCK_F_MRG_RXBUF feature,
>>>> it will merge big packet into rx vq.
>>>>
>>>> Signed-off-by: Yiwen Jiang <jiangyiwen@huawei.com>
>>>
>>> I feel this approach jumps into making interface changes for
>>> optimizations too quickly. For example, what prevents us
>>> from taking a big buffer, prepending each chunk
>>> with the header and writing it out without
>>> host/guest interface changes?
>>>
>>> This should allow optimizations such as vhost_add_used_n
>>> batching.
>>>
>>> I realize a header in each packet does have a cost,
>>> but it also has advantages such as improved robustness,
>>> I'd like to see more of an apples to apples comparison
>>> of the performance gain from skipping them.
>>>
>>>
>>
>> Hi Michael,
>>
>> I don't fully understand what you mean, do you want to
>> see a performance comparison that before performance and
>> only use batching?
>>
>> In my opinion, guest don't fill big buffer in rx vq because
>> the balance performance and guest memory pressure, add
>> mergeable feature can improve big packets performance,
>> as for small packets, I try to find out the reason, may be
>> the fluctuation of test results, or in mergeable mode, when
>> Host send a 4k packet to Guest, we should call vhost_get_vq_desc()
>> twice in host(hdr + 4k data), and in guest we also should call
>> virtqueue_get_buf() twice.
>
> I like the idea of making optimizations in small steps and measuring the
> effect of each step. This way we'll know which aspect caused the
> differences in benchmark results.
>
> Stefan
>
Yes, now I also focus on other project, but I will use some
extra time to measure it.
Thanks,
Yiwen.
^ permalink raw reply
* Re: [PATCH v2 3/5] VSOCK: support receive mergeable rx buffer in guest
From: jiangyiwen @ 2018-12-14 8:11 UTC (permalink / raw)
To: Michael S. Tsirkin; +Cc: netdev, kvm, Stefan Hajnoczi, virtualization
In-Reply-To: <20181213092619-mutt-send-email-mst@kernel.org>
On 2018/12/13 22:29, Michael S. Tsirkin wrote:
> On Thu, Dec 13, 2018 at 10:38:09AM +0800, jiangyiwen wrote:
>> Hi Michael,
>>
>> On 2018/12/12 23:31, Michael S. Tsirkin wrote:
>>> On Wed, Dec 12, 2018 at 05:31:39PM +0800, jiangyiwen wrote:
>>>> Guest receive mergeable rx buffer, it can merge
>>>> scatter rx buffer into a big buffer and then copy
>>>> to user space.
>>>>
>>>> In addition, it also use iovec to replace buf in struct
>>>> virtio_vsock_pkt, keep tx and rx consistency. The only
>>>> difference is now tx still uses a segment of continuous
>>>> physical memory to implement.
>>>>
>>>> Signed-off-by: Yiwen Jiang <jiangyiwen@huawei.com>
>>>> ---
>>>> drivers/vhost/vsock.c | 31 +++++++---
>>>> include/linux/virtio_vsock.h | 6 +-
>>>> net/vmw_vsock/virtio_transport.c | 105 ++++++++++++++++++++++++++++----
>>>> net/vmw_vsock/virtio_transport_common.c | 59 ++++++++++++++----
>>>> 4 files changed, 166 insertions(+), 35 deletions(-)
>>>
>>>
>>> This was supposed to be a guest patch, why is vhost changed here?
>>>
>>
>> In mergeable rx buff cases, it need to scatter big packets into several
>> buffers, so I add kvec variable in struct virtio_vsock_pkt, at the same
>> time, in order to keep tx and rx consistency, I use kvec to replace
>> variable buf, because vhost use the variable pkt->buf, so this patch
>> caused vhost is changed.
>
> You'd want to split these patches imho.
>
Ok, I should do it.
>>>> diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
>>>> index dc52b0f..c7ab0dd 100644
>>>> --- a/drivers/vhost/vsock.c
>>>> +++ b/drivers/vhost/vsock.c
>>>> @@ -179,6 +179,8 @@ static int get_rx_bufs(struct vhost_virtqueue *vq,
>>>> size_t nbytes;
>>>> size_t len;
>>>> s16 headcount;
>>>> + size_t remain_len;
>>>> + int i;
>>>>
>>>> spin_lock_bh(&vsock->send_pkt_list_lock);
>>>> if (list_empty(&vsock->send_pkt_list)) {
>>>> @@ -221,11 +223,19 @@ static int get_rx_bufs(struct vhost_virtqueue *vq,
>>>> break;
>>>> }
>>>>
>>>> - nbytes = copy_to_iter(pkt->buf, pkt->len, &iov_iter);
>>>> - if (nbytes != pkt->len) {
>>>> - virtio_transport_free_pkt(pkt);
>>>> - vq_err(vq, "Faulted on copying pkt buf\n");
>>>> - break;
>>>> + remain_len = pkt->len;
>>>> + for (i = 0; i < pkt->nr_vecs; i++) {
>>>> + int tmp_len;
>>>> +
>>>> + tmp_len = min(remain_len, pkt->vec[i].iov_len);
>>>> + nbytes = copy_to_iter(pkt->vec[i].iov_base, tmp_len, &iov_iter);
>>>> + if (nbytes != tmp_len) {
>>>> + virtio_transport_free_pkt(pkt);
>>>> + vq_err(vq, "Faulted on copying pkt buf\n");
>>>> + break;
>>>> + }
>>>> +
>>>> + remain_len -= tmp_len;
>>>> }
>>>>
>>>> vhost_add_used_n(vq, vq->heads, headcount);
>>>> @@ -341,6 +351,7 @@ static void vhost_transport_send_pkt_work(struct vhost_work *work)
>>>> struct iov_iter iov_iter;
>>>> size_t nbytes;
>>>> size_t len;
>>>> + void *buf;
>>>>
>>>> if (in != 0) {
>>>> vq_err(vq, "Expected 0 input buffers, got %u\n", in);
>>>> @@ -375,13 +386,17 @@ static void vhost_transport_send_pkt_work(struct vhost_work *work)
>>>> return NULL;
>>>> }
>>>>
>>>> - pkt->buf = kmalloc(pkt->len, GFP_KERNEL);
>>>> - if (!pkt->buf) {
>>>> + buf = kmalloc(pkt->len, GFP_KERNEL);
>>>> + if (!buf) {
>>>> kfree(pkt);
>>>> return NULL;
>>>> }
>>>>
>>>> - nbytes = copy_from_iter(pkt->buf, pkt->len, &iov_iter);
>>>> + pkt->vec[0].iov_base = buf;
>>>> + pkt->vec[0].iov_len = pkt->len;
>>>> + pkt->nr_vecs = 1;
>>>> +
>>>> + nbytes = copy_from_iter(buf, pkt->len, &iov_iter);
>>>> if (nbytes != pkt->len) {
>>>> vq_err(vq, "Expected %u byte payload, got %zu bytes\n",
>>>> pkt->len, nbytes);
>>>> diff --git a/include/linux/virtio_vsock.h b/include/linux/virtio_vsock.h
>>>> index da9e1fe..734eeed 100644
>>>> --- a/include/linux/virtio_vsock.h
>>>> +++ b/include/linux/virtio_vsock.h
>>>> @@ -13,6 +13,8 @@
>>>> #define VIRTIO_VSOCK_DEFAULT_RX_BUF_SIZE (1024 * 4)
>>>> #define VIRTIO_VSOCK_MAX_BUF_SIZE 0xFFFFFFFFUL
>>>> #define VIRTIO_VSOCK_MAX_PKT_BUF_SIZE (1024 * 64)
>>>> +/* virtio_vsock_pkt + max_pkt_len(default MAX_PKT_BUF_SIZE) */
>>>> +#define VIRTIO_VSOCK_MAX_VEC_NUM ((VIRTIO_VSOCK_MAX_PKT_BUF_SIZE / PAGE_SIZE) + 1)
>>>>
>>>> /* Virtio-vsock feature */
>>>> #define VIRTIO_VSOCK_F_MRG_RXBUF 0 /* Host can merge receive buffers. */
>>>> @@ -55,10 +57,12 @@ struct virtio_vsock_pkt {
>>>> struct list_head list;
>>>> /* socket refcnt not held, only use for cancellation */
>>>> struct vsock_sock *vsk;
>>>> - void *buf;
>>>> + struct kvec vec[VIRTIO_VSOCK_MAX_VEC_NUM];
>>>> + int nr_vecs;
>>>> u32 len;
>>>> u32 off;
>>>> bool reply;
>>>> + bool mergeable;
>>>> };
>>>>
>>>> struct virtio_vsock_pkt_info {
>>>> diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
>>>> index c4a465c..148b58a 100644
>>>> --- a/net/vmw_vsock/virtio_transport.c
>>>> +++ b/net/vmw_vsock/virtio_transport.c
>>>> @@ -155,8 +155,10 @@ static int virtio_transport_send_pkt_loopback(struct virtio_vsock *vsock,
>>>>
>>>> sg_init_one(&hdr, &pkt->hdr, sizeof(pkt->hdr));
>>>> sgs[out_sg++] = &hdr;
>>>> - if (pkt->buf) {
>>>> - sg_init_one(&buf, pkt->buf, pkt->len);
>>>> + if (pkt->len) {
>>>> + /* Currently only support a segment of memory in tx */
>>>> + BUG_ON(pkt->vec[0].iov_len != pkt->len);
>>>> + sg_init_one(&buf, pkt->vec[0].iov_base, pkt->vec[0].iov_len);
>>>> sgs[out_sg++] = &buf;
>>>> }
>>>>
>>>> @@ -304,23 +306,28 @@ static int fill_old_rx_buff(struct virtqueue *vq)
>>>> struct virtio_vsock_pkt *pkt;
>>>> struct scatterlist hdr, buf, *sgs[2];
>>>> int ret;
>>>> + void *pkt_buf;
>>>>
>>>> pkt = kzalloc(sizeof(*pkt), GFP_KERNEL);
>>>> if (!pkt)
>>>> return -ENOMEM;
>>>>
>>>> - pkt->buf = kmalloc(buf_len, GFP_KERNEL);
>>>> - if (!pkt->buf) {
>>>> + pkt_buf = kmalloc(buf_len, GFP_KERNEL);
>>>> + if (!pkt_buf) {
>>>> virtio_transport_free_pkt(pkt);
>>>> return -ENOMEM;
>>>> }
>>>>
>>>> + pkt->vec[0].iov_base = pkt_buf;
>>>> + pkt->vec[0].iov_len = buf_len;
>>>> + pkt->nr_vecs = 1;
>>>> +
>>>> pkt->len = buf_len;
>>>>
>>>> sg_init_one(&hdr, &pkt->hdr, sizeof(pkt->hdr));
>>>> sgs[0] = &hdr;
>>>>
>>>> - sg_init_one(&buf, pkt->buf, buf_len);
>>>> + sg_init_one(&buf, pkt->vec[0].iov_base, buf_len);
>>>> sgs[1] = &buf;
>>>> ret = virtqueue_add_sgs(vq, sgs, 0, 2, pkt, GFP_KERNEL);
>>>> if (ret)
>>>> @@ -388,11 +395,78 @@ static bool virtio_transport_more_replies(struct virtio_vsock *vsock)
>>>> return val < virtqueue_get_vring_size(vq);
>>>> }
>>>>
>>>> +static struct virtio_vsock_pkt *receive_mergeable(struct virtqueue *vq,
>>>> + struct virtio_vsock *vsock, unsigned int *total_len)
>>>> +{
>>>> + struct virtio_vsock_pkt *pkt;
>>>> + u16 num_buf;
>>>> + void *buf;
>>>> + unsigned int len;
>>>> + size_t vsock_hlen = sizeof(struct virtio_vsock_pkt);
>>>> +
>>>> + buf = virtqueue_get_buf(vq, &len);
>>>> + if (!buf)
>>>> + return NULL;
>>>> +
>>>> + *total_len = len;
>>>> + vsock->rx_buf_nr--;
>>>> +
>>>> + if (unlikely(len < vsock_hlen)) {
>>>> + put_page(virt_to_head_page(buf));
>>>> + return NULL;
>>>> + }
>>>> +
>>>> + pkt = buf;
>>>> + num_buf = le16_to_cpu(pkt->mrg_rxbuf_hdr.num_buffers);
>>>> + if (!num_buf || num_buf > VIRTIO_VSOCK_MAX_VEC_NUM) {
>>>> + put_page(virt_to_head_page(buf));
>>>> + return NULL;
>>>> + }
>>>
>>> So everything just stops going, and host and user don't even
>>> know what the reason is. And not only that - the next
>>> packet will be corrupted because we skipped the first one.
>>>
>>>
>>
>> I understand this case will not encountered unless the code has
>> *BUG*, like Host send some problematic packages (shorten/longer than
>> expected). In this case, I think we should ignore/drop these packets.
>
> If there's a specific packet length expected, e.g. in this case vector
> size, it needs to be negotiated between host and guest in some way.
>
Actually, I still keep the negotiated in the virtio_transport_rx_work(),
in the receive_mergeable() I will calculate total len, and compared with
the pkt->hdr.len(host packed).
>>>
>>>> +
>>>> + /* Initialize pkt residual structure */
>>>> + memset(&pkt->work, 0, vsock_hlen - sizeof(struct virtio_vsock_hdr) -
>>>> + sizeof(struct virtio_vsock_mrg_rxbuf_hdr));
>>>> +
>>>> + pkt->mergeable = true;
>>>> + pkt->len = le32_to_cpu(pkt->hdr.len);
>>>> + if (!pkt->len)
>>>> + return pkt;
>>>> +
>>>> + len -= vsock_hlen;
>>>> + if (len) {
>>>> + pkt->vec[pkt->nr_vecs].iov_base = buf + vsock_hlen;
>>>> + pkt->vec[pkt->nr_vecs].iov_len = len;
>>>> + /* Shared page with pkt, so get page in advance */
>>>> + get_page(virt_to_head_page(buf));
>>>> + pkt->nr_vecs++;
>>>> + }
>>>> +
>>>> + while (--num_buf) {
>>>> + buf = virtqueue_get_buf(vq, &len);
>>>> + if (!buf)
>>>> + goto err;
>>>> +
>>>> + *total_len += len;
>>>> + vsock->rx_buf_nr--;
>>>> +
>>>> + pkt->vec[pkt->nr_vecs].iov_base = buf;
>>>> + pkt->vec[pkt->nr_vecs].iov_len = len;
>>>> + pkt->nr_vecs++;
>>>> + }
>>>> +
>>>> + return pkt;
>>>> +err:
>>>> + virtio_transport_free_pkt(pkt);
>>>> + return NULL;
>>>> +}
>>>> +
>>>> static void virtio_transport_rx_work(struct work_struct *work)
>>>> {
>>>> struct virtio_vsock *vsock =
>>>> container_of(work, struct virtio_vsock, rx_work);
>>>> struct virtqueue *vq;
>>>> + size_t vsock_hlen = vsock->mergeable ? sizeof(struct virtio_vsock_pkt) :
>>>> + sizeof(struct virtio_vsock_hdr);
>>>>
>>>> vq = vsock->vqs[VSOCK_VQ_RX];
>>>>
>>>> @@ -412,21 +486,26 @@ static void virtio_transport_rx_work(struct work_struct *work)
>>>> goto out;
>>>> }
>>>>
>>>> - pkt = virtqueue_get_buf(vq, &len);
>>>> - if (!pkt) {
>>>> - break;
>>>> - }
>>>> + if (likely(vsock->mergeable)) {
>>>> + pkt = receive_mergeable(vq, vsock, &len);
>>>> + if (!pkt)
>>>> + break;
>>>> + } else {
>>>> + pkt = virtqueue_get_buf(vq, &len);
>>>> + if (!pkt)
>>>> + break;
>>>>
>>>
>>> So looking at it, this seems to be the main source of the gain.
>>> But why does this require host/guest changes?
>>>
>>>
>>> The way I see it:
>>> - get a buffer and create an skb
>>> - get the next one, check header matches, if yes
>>> tack it on the skb as a fragment. If not then
>>> don't, deliver previous one and queue the new one.
>>>
>>>
>>
>> Vhost change reason I explain as above, and I hope use kvec
>> to instead buf, after all buf only can express a contiguous
>> physical memory.
>>
>> Thanks,
>> Yiwen.
>
>
> I got the reason but I am not yet convinced it's a good one.
> You don't necessarily need host to skip headers
> in all but the first chunk. Guest can do this just as well.
>
>
I understand what you mean, right, only waste some memory, but it
can gain more convenient, like I don't need to skip headers and
host don't need to transport total pkt. I will fix it in the later
version.
Thanks,
Yiwen.
>
>>>
>>>> - vsock->rx_buf_nr--;
>>>> + vsock->rx_buf_nr--;
>>>> + }
>>>>
>>>> /* Drop short/long packets */
>>>> - if (unlikely(len < sizeof(pkt->hdr) ||
>>>> - len > sizeof(pkt->hdr) + pkt->len)) {
>>>> + if (unlikely(len < vsock_hlen ||
>>>> + len > vsock_hlen + pkt->len)) {
>>>> virtio_transport_free_pkt(pkt);
>>>> continue;
>>>> }
>>>>
>>>> - pkt->len = len - sizeof(pkt->hdr);
>>>> + pkt->len = len - vsock_hlen;
>>>> virtio_transport_deliver_tap_pkt(pkt);
>>>> virtio_transport_recv_pkt(pkt);
>>>> }
>>>> diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
>>>> index 3ae3a33..123a8b6 100644
>>>> --- a/net/vmw_vsock/virtio_transport_common.c
>>>> +++ b/net/vmw_vsock/virtio_transport_common.c
>>>> @@ -44,6 +44,7 @@ static const struct virtio_transport *virtio_transport_get_ops(void)
>>>> {
>>>> struct virtio_vsock_pkt *pkt;
>>>> int err;
>>>> + void *buf = NULL;
>>>>
>>>> pkt = kzalloc(sizeof(*pkt), GFP_KERNEL);
>>>> if (!pkt)
>>>> @@ -62,12 +63,16 @@ static const struct virtio_transport *virtio_transport_get_ops(void)
>>>> pkt->vsk = info->vsk;
>>>>
>>>> if (info->msg && len > 0) {
>>>> - pkt->buf = kmalloc(len, GFP_KERNEL);
>>>> - if (!pkt->buf)
>>>> + buf = kmalloc(len, GFP_KERNEL);
>>>> + if (!buf)
>>>> goto out_pkt;
>>>> - err = memcpy_from_msg(pkt->buf, info->msg, len);
>>>> + err = memcpy_from_msg(buf, info->msg, len);
>>>> if (err)
>>>> goto out;
>>>> +
>>>> + pkt->vec[0].iov_base = buf;
>>>> + pkt->vec[0].iov_len = len;
>>>> + pkt->nr_vecs = 1;
>>>> }
>>>>
>>>> trace_virtio_transport_alloc_pkt(src_cid, src_port,
>>>> @@ -80,7 +85,7 @@ static const struct virtio_transport *virtio_transport_get_ops(void)
>>>> return pkt;
>>>>
>>>> out:
>>>> - kfree(pkt->buf);
>>>> + kfree(buf);
>>>> out_pkt:
>>>> kfree(pkt);
>>>> return NULL;
>>>> @@ -92,6 +97,7 @@ static struct sk_buff *virtio_transport_build_skb(void *opaque)
>>>> struct virtio_vsock_pkt *pkt = opaque;
>>>> struct af_vsockmon_hdr *hdr;
>>>> struct sk_buff *skb;
>>>> + int i;
>>>>
>>>> skb = alloc_skb(sizeof(*hdr) + sizeof(pkt->hdr) + pkt->len,
>>>> GFP_ATOMIC);
>>>> @@ -134,7 +140,8 @@ static struct sk_buff *virtio_transport_build_skb(void *opaque)
>>>> skb_put_data(skb, &pkt->hdr, sizeof(pkt->hdr));
>>>>
>>>> if (pkt->len) {
>>>> - skb_put_data(skb, pkt->buf, pkt->len);
>>>> + for (i = 0; i < pkt->nr_vecs; i++)
>>>> + skb_put_data(skb, pkt->vec[i].iov_base, pkt->vec[i].iov_len);
>>>> }
>>>>
>>>> return skb;
>>>> @@ -260,6 +267,9 @@ static int virtio_transport_send_credit_update(struct vsock_sock *vsk,
>>>>
>>>> spin_lock_bh(&vvs->rx_lock);
>>>> while (total < len && !list_empty(&vvs->rx_queue)) {
>>>> + size_t copy_bytes, last_vec_total = 0, vec_off;
>>>> + int i;
>>>> +
>>>> pkt = list_first_entry(&vvs->rx_queue,
>>>> struct virtio_vsock_pkt, list);
>>>>
>>>> @@ -272,14 +282,28 @@ static int virtio_transport_send_credit_update(struct vsock_sock *vsk,
>>>> */
>>>> spin_unlock_bh(&vvs->rx_lock);
>>>>
>>>> - err = memcpy_to_msg(msg, pkt->buf + pkt->off, bytes);
>>>> - if (err)
>>>> - goto out;
>>>> + for (i = 0; i < pkt->nr_vecs; i++) {
>>>> + if (pkt->off > last_vec_total + pkt->vec[i].iov_len) {
>>>> + last_vec_total += pkt->vec[i].iov_len;
>>>> + continue;
>>>> + }
>>>> +
>>>> + vec_off = pkt->off - last_vec_total;
>>>> + copy_bytes = min(pkt->vec[i].iov_len - vec_off, bytes);
>>>> + err = memcpy_to_msg(msg, pkt->vec[i].iov_base + vec_off,
>>>> + copy_bytes);
>>>> + if (err)
>>>> + goto out;
>>>> +
>>>> + bytes -= copy_bytes;
>>>> + pkt->off += copy_bytes;
>>>> + total += copy_bytes;
>>>> + last_vec_total += pkt->vec[i].iov_len;
>>>> + if (!bytes)
>>>> + break;
>>>> + }
>>>>
>>>> spin_lock_bh(&vvs->rx_lock);
>>>> -
>>>> - total += bytes;
>>>> - pkt->off += bytes;
>>>> if (pkt->off == pkt->len) {
>>>> virtio_transport_dec_rx_pkt(vvs, pkt);
>>>> list_del(&pkt->list);
>>>> @@ -1050,8 +1074,17 @@ void virtio_transport_recv_pkt(struct virtio_vsock_pkt *pkt)
>>>>
>>>> void virtio_transport_free_pkt(struct virtio_vsock_pkt *pkt)
>>>> {
>>>> - kfree(pkt->buf);
>>>> - kfree(pkt);
>>>> + int i;
>>>> +
>>>> + if (pkt->mergeable) {
>>>> + for (i = 0; i < pkt->nr_vecs; i++)
>>>> + put_page(virt_to_head_page(pkt->vec[i].iov_base));
>>>> + put_page(virt_to_head_page((void *)pkt));
>>>> + } else {
>>>> + for (i = 0; i < pkt->nr_vecs; i++)
>>>> + kfree(pkt->vec[i].iov_base);
>>>> + kfree(pkt);
>>>> + }
>>>> }
>>>> EXPORT_SYMBOL_GPL(virtio_transport_free_pkt);
>>>>
>>>> --
>>>> 1.8.3.1
>>>>
>>>
>>> .
>>>
>>
>
> .
>
^ permalink raw reply
page: next (older) | prev (newer) | latest
- recent:[subjects (threaded)|topics (new)|topics (active)]
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox