Linux virtualization list
 help / color / mirror / Atom feed
* Re: [PATCH v2 2/5] VSOCK: support fill data to mergeable rx buffer in host
From: jiangyiwen @ 2018-12-13  3:08 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: netdev, kvm, Stefan Hajnoczi, virtualization
In-Reply-To: <20181212103138-mutt-send-email-mst@kernel.org>

On 2018/12/12 23:37, Michael S. Tsirkin wrote:
> On Wed, Dec 12, 2018 at 05:29:31PM +0800, jiangyiwen wrote:
>> When vhost support VIRTIO_VSOCK_F_MRG_RXBUF feature,
>> it will merge big packet into rx vq.
>>
>> Signed-off-by: Yiwen Jiang <jiangyiwen@huawei.com>
> 
> I feel this approach jumps into making interface changes for
> optimizations too quickly. For example, what prevents us
> from taking a big buffer, prepending each chunk
> with the header and writing it out without
> host/guest interface changes?
> 
> This should allow optimizations such as vhost_add_used_n
> batching.
> 
> I realize a header in each packet does have a cost,
> but it also has advantages such as improved robustness,
> I'd like to see more of an apples to apples comparison
> of the performance gain from skipping them.
> 
> 

Hi Michael,

I don't fully understand what you mean, do you want to
see a performance comparison that before performance and
only use batching?

In my opinion, guest don't fill big buffer in rx vq because
the balance performance and guest memory pressure, add
mergeable feature can improve big packets performance,
as for small packets, I try to find out the reason, may be
the fluctuation of test results, or in mergeable mode, when
Host send a 4k packet to Guest, we should call vhost_get_vq_desc()
twice in host(hdr + 4k data), and in guest we also should call
virtqueue_get_buf() twice.

Thanks,
Yiwen.

>> ---
>>  drivers/vhost/vsock.c             | 111 ++++++++++++++++++++++++++++++--------
>>  include/linux/virtio_vsock.h      |   1 +
>>  include/uapi/linux/virtio_vsock.h |   5 ++
>>  3 files changed, 94 insertions(+), 23 deletions(-)
>>
>> diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
>> index 34bc3ab..dc52b0f 100644
>> --- a/drivers/vhost/vsock.c
>> +++ b/drivers/vhost/vsock.c
>> @@ -22,7 +22,8 @@
>>  #define VHOST_VSOCK_DEFAULT_HOST_CID	2
>>
>>  enum {
>> -	VHOST_VSOCK_FEATURES = VHOST_FEATURES,
>> +	VHOST_VSOCK_FEATURES = VHOST_FEATURES |
>> +			(1ULL << VIRTIO_VSOCK_F_MRG_RXBUF),
>>  };
>>
>>  /* Used to track all the vhost_vsock instances on the system. */
>> @@ -80,6 +81,69 @@ static struct vhost_vsock *vhost_vsock_get(u32 guest_cid)
>>  	return vsock;
>>  }
>>
>> +/* This segment of codes are copied from drivers/vhost/net.c */
>> +static int get_rx_bufs(struct vhost_virtqueue *vq,
>> +		struct vring_used_elem *heads, int datalen,
>> +		unsigned *iovcount, unsigned int quota)
>> +{
>> +	unsigned int out, in;
>> +	int seg = 0;
>> +	int headcount = 0;
>> +	unsigned d;
>> +	int ret;
>> +	/*
>> +	 * len is always initialized before use since we are always called with
>> +	 * datalen > 0.
>> +	 */
>> +	u32 uninitialized_var(len);
>> +
>> +	while (datalen > 0 && headcount < quota) {
>> +		if (unlikely(seg >= UIO_MAXIOV)) {
>> +			ret = -ENOBUFS;
>> +			goto err;
>> +		}
>> +
>> +		ret = vhost_get_vq_desc(vq, vq->iov + seg,
>> +				ARRAY_SIZE(vq->iov) - seg, &out,
>> +				&in, NULL, NULL);
>> +		if (unlikely(ret < 0))
>> +			goto err;
>> +
>> +		d = ret;
>> +		if (d == vq->num) {
>> +			ret = 0;
>> +			goto err;
>> +		}
>> +
>> +		if (unlikely(out || in <= 0)) {
>> +			vq_err(vq, "unexpected descriptor format for RX: "
>> +					"out %d, in %d\n", out, in);
>> +			ret = -EINVAL;
>> +			goto err;
>> +		}
>> +
>> +		heads[headcount].id = cpu_to_vhost32(vq, d);
>> +		len = iov_length(vq->iov + seg, in);
>> +		heads[headcount].len = cpu_to_vhost32(vq, len);
>> +		datalen -= len;
>> +		++headcount;
>> +		seg += in;
>> +	}
>> +
>> +	heads[headcount - 1].len = cpu_to_vhost32(vq, len + datalen);
>> +	*iovcount = seg;
>> +
>> +	/* Detect overrun */
>> +	if (unlikely(datalen > 0)) {
>> +		ret = UIO_MAXIOV + 1;
>> +		goto err;
>> +	}
>> +	return headcount;
>> +err:
>> +	vhost_discard_vq_desc(vq, headcount);
>> +	return ret;
>> +}
>> +
>>  static void
>>  vhost_transport_do_send_pkt(struct vhost_vsock *vsock,
>>  			    struct vhost_virtqueue *vq)
>> @@ -87,22 +151,34 @@ static struct vhost_vsock *vhost_vsock_get(u32 guest_cid)
>>  	struct vhost_virtqueue *tx_vq = &vsock->vqs[VSOCK_VQ_TX];
>>  	bool added = false;
>>  	bool restart_tx = false;
>> +	int mergeable;
>> +	size_t vsock_hlen;
>>
>>  	mutex_lock(&vq->mutex);
>>
>>  	if (!vq->private_data)
>>  		goto out;
>>
>> +	mergeable = vhost_has_feature(vq, VIRTIO_VSOCK_F_MRG_RXBUF);
>> +	/*
>> +	 * Guest fill page for rx vq in mergeable case, so it will not
>> +	 * allocate pkt structure, we should reserve size of pkt in advance.
>> +	 */
>> +	if (likely(mergeable))
>> +		vsock_hlen = sizeof(struct virtio_vsock_pkt);
>> +	else
>> +		vsock_hlen = sizeof(struct virtio_vsock_hdr);
>> +
>>  	/* Avoid further vmexits, we're already processing the virtqueue */
>>  	vhost_disable_notify(&vsock->dev, vq);
>>
>>  	for (;;) {
>>  		struct virtio_vsock_pkt *pkt;
>>  		struct iov_iter iov_iter;
>> -		unsigned out, in;
>> +		unsigned out = 0, in = 0;
>>  		size_t nbytes;
>>  		size_t len;
>> -		int head;
>> +		s16 headcount;
>>
>>  		spin_lock_bh(&vsock->send_pkt_list_lock);
>>  		if (list_empty(&vsock->send_pkt_list)) {
>> @@ -116,16 +192,9 @@ static struct vhost_vsock *vhost_vsock_get(u32 guest_cid)
>>  		list_del_init(&pkt->list);
>>  		spin_unlock_bh(&vsock->send_pkt_list_lock);
>>
>> -		head = vhost_get_vq_desc(vq, vq->iov, ARRAY_SIZE(vq->iov),
>> -					 &out, &in, NULL, NULL);
>> -		if (head < 0) {
>> -			spin_lock_bh(&vsock->send_pkt_list_lock);
>> -			list_add(&pkt->list, &vsock->send_pkt_list);
>> -			spin_unlock_bh(&vsock->send_pkt_list_lock);
>> -			break;
>> -		}
>> -
>> -		if (head == vq->num) {
>> +		headcount = get_rx_bufs(vq, vq->heads, vsock_hlen + pkt->len,
>> +				&in, likely(mergeable) ? UIO_MAXIOV : 1);
>> +		if (headcount <= 0) {
>>  			spin_lock_bh(&vsock->send_pkt_list_lock);
>>  			list_add(&pkt->list, &vsock->send_pkt_list);
>>  			spin_unlock_bh(&vsock->send_pkt_list_lock);
>> @@ -133,24 +202,20 @@ static struct vhost_vsock *vhost_vsock_get(u32 guest_cid)
>>  			/* We cannot finish yet if more buffers snuck in while
>>  			 * re-enabling notify.
>>  			 */
>> -			if (unlikely(vhost_enable_notify(&vsock->dev, vq))) {
>> +			if (!headcount && unlikely(vhost_enable_notify(&vsock->dev, vq))) {
>>  				vhost_disable_notify(&vsock->dev, vq);
>>  				continue;
>>  			}
>>  			break;
>>  		}
>>
>> -		if (out) {
>> -			virtio_transport_free_pkt(pkt);
>> -			vq_err(vq, "Expected 0 output buffers, got %u\n", out);
>> -			break;
>> -		}
>> -
>>  		len = iov_length(&vq->iov[out], in);
>>  		iov_iter_init(&iov_iter, READ, &vq->iov[out], in, len);
>>
>> -		nbytes = copy_to_iter(&pkt->hdr, sizeof(pkt->hdr), &iov_iter);
>> -		if (nbytes != sizeof(pkt->hdr)) {
>> +		if (likely(mergeable))
>> +			pkt->mrg_rxbuf_hdr.num_buffers = cpu_to_le16(headcount);
>> +		nbytes = copy_to_iter(&pkt->hdr, vsock_hlen, &iov_iter);
>> +		if (nbytes != vsock_hlen) {
>>  			virtio_transport_free_pkt(pkt);
>>  			vq_err(vq, "Faulted on copying pkt hdr\n");
>>  			break;
>> @@ -163,7 +228,7 @@ static struct vhost_vsock *vhost_vsock_get(u32 guest_cid)
>>  			break;
>>  		}
>>
>> -		vhost_add_used(vq, head, sizeof(pkt->hdr) + pkt->len);
>> +		vhost_add_used_n(vq, vq->heads, headcount);
>>  		added = true;
>>
>>  		if (pkt->reply) {
>> diff --git a/include/linux/virtio_vsock.h b/include/linux/virtio_vsock.h
>> index bf84418..da9e1fe 100644
>> --- a/include/linux/virtio_vsock.h
>> +++ b/include/linux/virtio_vsock.h
>> @@ -50,6 +50,7 @@ struct virtio_vsock_sock {
>>
>>  struct virtio_vsock_pkt {
>>  	struct virtio_vsock_hdr	hdr;
>> +	struct virtio_vsock_mrg_rxbuf_hdr mrg_rxbuf_hdr;
>>  	struct work_struct work;
>>  	struct list_head list;
>>  	/* socket refcnt not held, only use for cancellation */
>> diff --git a/include/uapi/linux/virtio_vsock.h b/include/uapi/linux/virtio_vsock.h
>> index 1d57ed3..2292f30 100644
>> --- a/include/uapi/linux/virtio_vsock.h
>> +++ b/include/uapi/linux/virtio_vsock.h
>> @@ -63,6 +63,11 @@ struct virtio_vsock_hdr {
>>  	__le32	fwd_cnt;
>>  } __attribute__((packed));
>>
>> +/* It add mergeable rx buffers feature */
>> +struct virtio_vsock_mrg_rxbuf_hdr {
>> +	__le16  num_buffers;    /* number of mergeable rx buffers */
>> +} __attribute__((packed));
>> +
>>  enum virtio_vsock_type {
>>  	VIRTIO_VSOCK_TYPE_STREAM = 1,
>>  };
>> -- 
>> 1.8.3.1
>>
> 
> .
> 

^ permalink raw reply

* [PATCH net V3 3/3] Revert "net: vhost: lock the vqs one by one"
From: Jason Wang @ 2018-12-13  2:53 UTC (permalink / raw)
  To: mst, jasowang, kvm, virtualization, netdev, linux-kernel
In-Reply-To: <20181213025339.14023-1-jasowang@redhat.com>

This reverts commit 78139c94dc8c96a478e67dab3bee84dc6eccb5fd. We don't
protect device IOTLB with vq mutex, which will lead e.g use after free
for device IOTLB entries. And since we've switched to use
mutex_trylock() in previous patch, it's safe to revert it without
having deadlock.

Fixes: commit 78139c94dc8c ("net: vhost: lock the vqs one by one")
Cc: Tonghao Zhang <xiangxia.m.yue@gmail.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/vhost/vhost.c | 21 +++++++++++++++++----
 1 file changed, 17 insertions(+), 4 deletions(-)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 5915f240275a..55e5aa662ad5 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -295,11 +295,8 @@ static void vhost_vq_meta_reset(struct vhost_dev *d)
 {
 	int i;
 
-	for (i = 0; i < d->nvqs; ++i) {
-		mutex_lock(&d->vqs[i]->mutex);
+	for (i = 0; i < d->nvqs; ++i)
 		__vhost_vq_meta_reset(d->vqs[i]);
-		mutex_unlock(&d->vqs[i]->mutex);
-	}
 }
 
 static void vhost_vq_reset(struct vhost_dev *dev,
@@ -895,6 +892,20 @@ static inline void __user *__vhost_get_user(struct vhost_virtqueue *vq,
 #define vhost_get_used(vq, x, ptr) \
 	vhost_get_user(vq, x, ptr, VHOST_ADDR_USED)
 
+static void vhost_dev_lock_vqs(struct vhost_dev *d)
+{
+	int i = 0;
+	for (i = 0; i < d->nvqs; ++i)
+		mutex_lock_nested(&d->vqs[i]->mutex, i);
+}
+
+static void vhost_dev_unlock_vqs(struct vhost_dev *d)
+{
+	int i = 0;
+	for (i = 0; i < d->nvqs; ++i)
+		mutex_unlock(&d->vqs[i]->mutex);
+}
+
 static int vhost_new_umem_range(struct vhost_umem *umem,
 				u64 start, u64 size, u64 end,
 				u64 userspace_addr, int perm)
@@ -976,6 +987,7 @@ static int vhost_process_iotlb_msg(struct vhost_dev *dev,
 	int ret = 0;
 
 	mutex_lock(&dev->mutex);
+	vhost_dev_lock_vqs(dev);
 	switch (msg->type) {
 	case VHOST_IOTLB_UPDATE:
 		if (!dev->iotlb) {
@@ -1009,6 +1021,7 @@ static int vhost_process_iotlb_msg(struct vhost_dev *dev,
 		break;
 	}
 
+	vhost_dev_unlock_vqs(dev);
 	mutex_unlock(&dev->mutex);
 
 	return ret;
-- 
2.17.1

^ permalink raw reply related

* [PATCH net V3 2/3] vhost_net: switch to use mutex_trylock() in vhost_net_busy_poll()
From: Jason Wang @ 2018-12-13  2:53 UTC (permalink / raw)
  To: mst, jasowang, kvm, virtualization, netdev, linux-kernel
In-Reply-To: <20181213025339.14023-1-jasowang@redhat.com>

We used to hold the mutex of paired virtqueue in
vhost_net_busy_poll(). But this will results an inconsistent lock
order which may cause deadlock if we try to bring back the protection
of device IOTLB with vq mutex that requires to hold mutex of all
virtqueues at the same time.

Fix this simply by switching to use mutex_trylock(), when fail just
skip the busy polling. This can happen when device IOTLB is under
updating which should be rare.

Fixes: commit 78139c94dc8c ("net: vhost: lock the vqs one by one")
Cc: Tonghao Zhang <xiangxia.m.yue@gmail.com>
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/vhost/net.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
index ab11b2bee273..ad7a6f475a44 100644
--- a/drivers/vhost/net.c
+++ b/drivers/vhost/net.c
@@ -513,7 +513,13 @@ static void vhost_net_busy_poll(struct vhost_net *net,
 	struct socket *sock;
 	struct vhost_virtqueue *vq = poll_rx ? tvq : rvq;
 
-	mutex_lock_nested(&vq->mutex, poll_rx ? VHOST_NET_VQ_TX: VHOST_NET_VQ_RX);
+	/* Try to hold the vq mutex of the paired virtqueue. We can't
+	 * use mutex_lock() here since we could not guarantee a
+	 * consistenet lock ordering.
+	 */
+	if (!mutex_trylock(&vq->mutex))
+		return;
+
 	vhost_disable_notify(&net->dev, vq);
 	sock = rvq->private_data;
 
-- 
2.17.1

^ permalink raw reply related

* [PATCH net V3 1/3] vhost: make sure used idx is seen before log in vhost_add_used_n()
From: Jason Wang @ 2018-12-13  2:53 UTC (permalink / raw)
  To: mst, jasowang, kvm, virtualization, netdev, linux-kernel
In-Reply-To: <20181213025339.14023-1-jasowang@redhat.com>

We miss a write barrier that guarantees used idx is updated and seen
before log. This will let userspace sync and copy used ring before
used idx is update. Fix this by adding a barrier before log_write().

Fixes: 8dd014adfea6f ("vhost-net: mergeable buffers support")
Acked-by: Michael S. Tsirkin <mst@redhat.com>
Signed-off-by: Jason Wang <jasowang@redhat.com>
---
 drivers/vhost/vhost.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
index 6b98d8e3a5bf..5915f240275a 100644
--- a/drivers/vhost/vhost.c
+++ b/drivers/vhost/vhost.c
@@ -2220,6 +2220,8 @@ int vhost_add_used_n(struct vhost_virtqueue *vq, struct vring_used_elem *heads,
 		return -EFAULT;
 	}
 	if (unlikely(vq->log_used)) {
+		/* Make sure used idx is seen before log. */
+		smp_wmb();
 		/* Log used index update. */
 		log_write(vq->log_base,
 			  vq->log_addr + offsetof(struct vring_used, idx),
-- 
2.17.1

^ permalink raw reply related

* [PATCH net V3 0/3] Fix various issue of vhost
From: Jason Wang @ 2018-12-13  2:53 UTC (permalink / raw)
  To: mst, jasowang, kvm, virtualization, netdev, linux-kernel

Hi:

This series tries to fix various issues of vhost:

- Patch 1 adds a missing write barrier between used idx updating and
  logging.
- Patch 2-3 brings back the protection of device IOTLB through vq
  mutex, this fixes possible use after free in device IOTLB entries.

Please consider them for -stable.

Thanks

Changes from V2:
- drop dirty page fix and make it for net-next
Changes from V1:
- silent compiler warning for 32bit.
- use mutex_trylock() on slowpath instead of mutex_lock() even on fast
  path.

Jason Wang (3):
  vhost: make sure used idx is seen before log in vhost_add_used_n()
  vhost_net: switch to use mutex_trylock() in vhost_net_busy_poll()
  Revert "net: vhost: lock the vqs one by one"

 drivers/vhost/net.c   |  8 +++++++-
 drivers/vhost/vhost.c | 23 +++++++++++++++++++----
 2 files changed, 26 insertions(+), 5 deletions(-)

-- 
2.17.1

^ permalink raw reply

* Re: [PATCH v2 1/5] VSOCK: support fill mergeable rx buffer in guest
From: jiangyiwen @ 2018-12-13  2:47 UTC (permalink / raw)
  To: David Miller; +Cc: kvm, mst, netdev, virtualization, stefanha
In-Reply-To: <20181212.110829.1327856253463467975.davem@davemloft.net>

On 2018/12/13 3:08, David Miller wrote:
> From: jiangyiwen <jiangyiwen@huawei.com>
> Date: Wed, 12 Dec 2018 17:28:16 +0800
> 
>> +static int fill_mergeable_rx_buff(struct virtio_vsock *vsock,
>> +		struct virtqueue *vq)
>> +{
>> +	struct page_frag *alloc_frag = &vsock->alloc_frag;
>> +	struct scatterlist sg;
>> +	/* Currently we don't use ewma len, use PAGE_SIZE instead, because too
>> +	 * small size can't fill one full packet, sadly we only 128 vq num now.
>> +	 */
>> +	unsigned int len = PAGE_SIZE, hole;
>> +	void *buf;
>> +	int err;
> 
> Please don't break up a set of local variable declarations with a
> comment like this.  The comment seems to be about the initialization
> of 'len', so move that initialization into the code below the variable
> declarations and bring the comment along for the ride as well.
> 
> .
> 

Hi David,

Thanks your suggestions, if maintainers approve use this series of
patches other than "vsock over virtio-net" idea, I will send to next
version and fix it. Otherwise, I hope it can give maintainers the
motivation that aggregate the vsock(virtio_transport related) and
virtio-net.

Thanks,
Yiwen.

^ permalink raw reply

* Re: [PATCH net V2 0/4] Fix various issue of vhost
From: Jason Wang @ 2018-12-13  2:42 UTC (permalink / raw)
  To: David Miller; +Cc: netdev, virtualization, linux-kernel, kvm, mst
In-Reply-To: <20181212.153134.1249380470735153418.davem@davemloft.net>


On 2018/12/13 上午7:31, David Miller wrote:
> From: Jason Wang <jasowang@redhat.com>
> Date: Wed, 12 Dec 2018 18:08:15 +0800
>
>> This series tries to fix various issues of vhost:
>>
>> - Patch 1 adds a missing write barrier between used idx updating and
>>    logging.
>> - Patch 2-3 brings back the protection of device IOTLB through vq
>>    mutex, this fixes possible use after free in device IOTLB entries.
>> - Patch 4-7 fixes the diry page logging when device IOTLB is
>>    enabled. We should done through GPA instead of GIOVA, this was done
>>    through intorudce HVA->GPA reverse mapping and convert HVA to GPA
>>    during logging dirty pages.
>>
>> Please consider them for -stable.
>>
>> Thanks
>>
>> Changes from V1:
>> - silent compiler warning for 32bit.
>> - use mutex_trylock() on slowpath instead of mutex_lock() even on fast
>>    path.
> Hello Jason.
>
> Look like Michael wants you to split out patch #4 and target
> net-next with it.
>
> So please do that and respin the first 3 patches here with Michael's
> ACKs.
>
> Thanks.


Yes, will send V3.

Thanks

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: [PATCH net V2 4/4] vhost: log dirty page correctly
From: Jason Wang @ 2018-12-13  2:39 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: Jintack Lim, netdev, linux-kernel, kvm, virtualization
In-Reply-To: <20181212092435-mutt-send-email-mst@kernel.org>


On 2018/12/12 下午10:32, Michael S. Tsirkin wrote:
> On Wed, Dec 12, 2018 at 06:08:19PM +0800, Jason Wang wrote:
>> Vhost dirty page logging API is designed to sync through GPA. But we
>> try to log GIOVA when device IOTLB is enabled. This is wrong and may
>> lead to missing data after migration.
>>
>> To solve this issue, when logging with device IOTLB enabled, we will:
>>
>> 1) reuse the device IOTLB translation result of GIOVA->HVA mapping to
>>     get HVA, for writable descriptor, get HVA through iovec. For used
>>     ring update, translate its GIOVA to HVA
>> 2) traverse the GPA->HVA mapping to get the possible GPA and log
>>     through GPA. Pay attention this reverse mapping is not guaranteed
>>     to be unique, so we should log each possible GPA in this case.
>>
>> This fix the failure of scp to guest during migration. In -next, we
>> will probably support passing GIOVA->GPA instead of GIOVA->HVA.
>>
>> Fixes: 6b1e6cc7855b ("vhost: new device IOTLB API")
>> Reported-by: Jintack Lim <jintack@cs.columbia.edu>
>> Cc: Jintack Lim <jintack@cs.columbia.edu>
>> Signed-off-by: Jason Wang <jasowang@redhat.com>
> It's a nasty bug for sure but it's been like this for a long
> time so I'm inclined to say let's put it in 4.21,
> and queue for stable.
>
> So please split this out from this series.


Ok.


>
> Also, I'd like to see a feature bit that allows GPA in IOTLBs.


Just to make sure I understand this. It looks to me we should:

- allow passing GIOVA->GPA through UAPI

- cache GIOVA->GPA somewhere but still use GIOVA->HVA in device IOTLB 
for performance

Is this what you suggest?

Thanks


>
>> ---
>>   drivers/vhost/net.c   |  3 +-
>>   drivers/vhost/vhost.c | 79 +++++++++++++++++++++++++++++++++++--------
>>   drivers/vhost/vhost.h |  3 +-
>>   3 files changed, 69 insertions(+), 16 deletions(-)
>>
>> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
>> index ad7a6f475a44..784df2b49628 100644
>> --- a/drivers/vhost/net.c
>> +++ b/drivers/vhost/net.c
>> @@ -1192,7 +1192,8 @@ static void handle_rx(struct vhost_net *net)
>>   		if (nvq->done_idx > VHOST_NET_BATCH)
>>   			vhost_net_signal_used(nvq);
>>   		if (unlikely(vq_log))
>> -			vhost_log_write(vq, vq_log, log, vhost_len);
>> +			vhost_log_write(vq, vq_log, log, vhost_len,
>> +					vq->iov, in);
>>   		total_len += vhost_len;
>>   		if (unlikely(vhost_exceeds_weight(++recv_pkts, total_len))) {
>>   			vhost_poll_queue(&vq->poll);
>> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
>> index 55e5aa662ad5..3660310604fd 100644
>> --- a/drivers/vhost/vhost.c
>> +++ b/drivers/vhost/vhost.c
>> @@ -1733,11 +1733,67 @@ static int log_write(void __user *log_base,
>>   	return r;
>>   }
>>   
>> +static int log_write_hva(struct vhost_virtqueue *vq, u64 hva, u64 len)
>> +{
>> +	struct vhost_umem *umem = vq->umem;
>> +	struct vhost_umem_node *u;
>> +	u64 gpa;
>> +	int r;
>> +	bool hit = false;
>> +
>> +	list_for_each_entry(u, &umem->umem_list, link) {
>> +		if (u->userspace_addr < hva &&
>> +		    u->userspace_addr + u->size >=
>> +		    hva + len) {
>> +			gpa = u->start + hva - u->userspace_addr;
>> +			r = log_write(vq->log_base, gpa, len);
>> +			if (r < 0)
>> +				return r;
>> +			hit = true;
>> +		}
>> +	}
>> +
>> +	/* No reverse mapping, should be a bug */
>> +	WARN_ON(!hit);
> Maybe it should but userspace can trigger this easily I think.
> We need to stop the device not warn in kernel log.
>
> Also there's an error fd: VHOST_SET_VRING_ERR, need to wake it up.
>

Ok.


>> +	return 0;
>> +}
>> +
>> +static void log_used(struct vhost_virtqueue *vq, u64 used_offset, u64 len)
>> +{
>> +	struct iovec iov[64];
>> +	int i, ret;
>> +
>> +	if (!vq->iotlb) {
>> +		log_write(vq->log_base, vq->log_addr + used_offset, len);
>> +		return;
>> +	}
> This change seems questionable. used ring writes
> use their own machinery it does not go through iotlb.
> Same should apply to log I think.


The problem is used ring may not be physically contiguous with Device 
IOTLB enabled. So it should go through it.


>
>> +
>> +	ret = translate_desc(vq, (u64)(uintptr_t)vq->used + used_offset,
>> +			     len, iov, 64, VHOST_ACCESS_WO);
>> +	WARN_ON(ret < 0);
>
> Same thing here. translation failures can be triggered from guest.
> warn on is not a good error handling strategy ...


Ok. Let me fix it.


Thanks


>> +
>> +	for (i = 0; i < ret; i++) {
>> +		ret = log_write_hva(vq,	(u64)(uintptr_t)iov[i].iov_base,
>> +				    iov[i].iov_len);
>> +		WARN_ON(ret);
>> +	}
>> +}
>> +
>>   int vhost_log_write(struct vhost_virtqueue *vq, struct vhost_log *log,
>> -		    unsigned int log_num, u64 len)
>> +		    unsigned int log_num, u64 len, struct iovec *iov, int count)
>>   {
>>   	int i, r;
>>   
>> +	if (vq->iotlb) {
>> +		for (i = 0; i < count; i++) {
>> +			r = log_write_hva(vq, (u64)(uintptr_t)iov[i].iov_base,
>> +					  iov[i].iov_len);
>> +			if (r < 0)
>> +				return r;
>> +		}
>> +		return 0;
>> +	}
>> +
>>   	/* Make sure data written is seen before log. */
>>   	smp_wmb();
>>   	for (i = 0; i < log_num; ++i) {
>> @@ -1769,9 +1825,8 @@ static int vhost_update_used_flags(struct vhost_virtqueue *vq)
>>   		smp_wmb();
>>   		/* Log used flag write. */
>>   		used = &vq->used->flags;
>> -		log_write(vq->log_base, vq->log_addr +
>> -			  (used - (void __user *)vq->used),
>> -			  sizeof vq->used->flags);
>> +		log_used(vq, (used - (void __user *)vq->used),
>> +			 sizeof vq->used->flags);
>>   		if (vq->log_ctx)
>>   			eventfd_signal(vq->log_ctx, 1);
>>   	}
>> @@ -1789,9 +1844,8 @@ static int vhost_update_avail_event(struct vhost_virtqueue *vq, u16 avail_event)
>>   		smp_wmb();
>>   		/* Log avail event write */
>>   		used = vhost_avail_event(vq);
>> -		log_write(vq->log_base, vq->log_addr +
>> -			  (used - (void __user *)vq->used),
>> -			  sizeof *vhost_avail_event(vq));
>> +		log_used(vq, (used - (void __user *)vq->used),
>> +			 sizeof *vhost_avail_event(vq));
>>   		if (vq->log_ctx)
>>   			eventfd_signal(vq->log_ctx, 1);
>>   	}
>> @@ -2191,10 +2245,8 @@ static int __vhost_add_used_n(struct vhost_virtqueue *vq,
>>   		/* Make sure data is seen before log. */
>>   		smp_wmb();
>>   		/* Log used ring entry write. */
>> -		log_write(vq->log_base,
>> -			  vq->log_addr +
>> -			   ((void __user *)used - (void __user *)vq->used),
>> -			  count * sizeof *used);
>> +		log_used(vq, ((void __user *)used - (void __user *)vq->used),
>> +			 count * sizeof *used);
>>   	}
>>   	old = vq->last_used_idx;
>>   	new = (vq->last_used_idx += count);
>> @@ -2236,9 +2288,8 @@ int vhost_add_used_n(struct vhost_virtqueue *vq, struct vring_used_elem *heads,
>>   		/* Make sure used idx is seen before log. */
>>   		smp_wmb();
>>   		/* Log used index update. */
>> -		log_write(vq->log_base,
>> -			  vq->log_addr + offsetof(struct vring_used, idx),
>> -			  sizeof vq->used->idx);
>> +		log_used(vq, offsetof(struct vring_used, idx),
>> +			 sizeof vq->used->idx);
>>   		if (vq->log_ctx)
>>   			eventfd_signal(vq->log_ctx, 1);
>>   	}
>> diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
>> index 466ef7542291..1b675dad5e05 100644
>> --- a/drivers/vhost/vhost.h
>> +++ b/drivers/vhost/vhost.h
>> @@ -205,7 +205,8 @@ bool vhost_vq_avail_empty(struct vhost_dev *, struct vhost_virtqueue *);
>>   bool vhost_enable_notify(struct vhost_dev *, struct vhost_virtqueue *);
>>   
>>   int vhost_log_write(struct vhost_virtqueue *vq, struct vhost_log *log,
>> -		    unsigned int log_num, u64 len);
>> +		    unsigned int log_num, u64 len,
>> +		    struct iovec *iov, int count);
>>   int vq_iotlb_prefetch(struct vhost_virtqueue *vq);
>>   
>>   struct vhost_msg_node *vhost_new_msg(struct vhost_virtqueue *vq, int type);
>> -- 
>> 2.17.1
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: [PATCH v2 3/5] VSOCK: support receive mergeable rx buffer in guest
From: jiangyiwen @ 2018-12-13  2:38 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: netdev, kvm, Stefan Hajnoczi, virtualization
In-Reply-To: <20181212101936-mutt-send-email-mst@kernel.org>

Hi Michael,

On 2018/12/12 23:31, Michael S. Tsirkin wrote:
> On Wed, Dec 12, 2018 at 05:31:39PM +0800, jiangyiwen wrote:
>> Guest receive mergeable rx buffer, it can merge
>> scatter rx buffer into a big buffer and then copy
>> to user space.
>>
>> In addition, it also use iovec to replace buf in struct
>> virtio_vsock_pkt, keep tx and rx consistency. The only
>> difference is now tx still uses a segment of continuous
>> physical memory to implement.
>>
>> Signed-off-by: Yiwen Jiang <jiangyiwen@huawei.com>
>> ---
>>  drivers/vhost/vsock.c                   |  31 +++++++---
>>  include/linux/virtio_vsock.h            |   6 +-
>>  net/vmw_vsock/virtio_transport.c        | 105 ++++++++++++++++++++++++++++----
>>  net/vmw_vsock/virtio_transport_common.c |  59 ++++++++++++++----
>>  4 files changed, 166 insertions(+), 35 deletions(-)
> 
> 
> This was supposed to be a guest patch, why is vhost changed here?
> 

In mergeable rx buff cases, it need to scatter big packets into several
buffers, so I add kvec variable in struct virtio_vsock_pkt, at the same
time, in order to keep tx and rx consistency, I use kvec to replace
variable buf, because vhost use the variable pkt->buf, so this patch
caused vhost is changed.

>> diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
>> index dc52b0f..c7ab0dd 100644
>> --- a/drivers/vhost/vsock.c
>> +++ b/drivers/vhost/vsock.c
>> @@ -179,6 +179,8 @@ static int get_rx_bufs(struct vhost_virtqueue *vq,
>>  		size_t nbytes;
>>  		size_t len;
>>  		s16 headcount;
>> +		size_t remain_len;
>> +		int i;
>>
>>  		spin_lock_bh(&vsock->send_pkt_list_lock);
>>  		if (list_empty(&vsock->send_pkt_list)) {
>> @@ -221,11 +223,19 @@ static int get_rx_bufs(struct vhost_virtqueue *vq,
>>  			break;
>>  		}
>>
>> -		nbytes = copy_to_iter(pkt->buf, pkt->len, &iov_iter);
>> -		if (nbytes != pkt->len) {
>> -			virtio_transport_free_pkt(pkt);
>> -			vq_err(vq, "Faulted on copying pkt buf\n");
>> -			break;
>> +		remain_len = pkt->len;
>> +		for (i = 0; i < pkt->nr_vecs; i++) {
>> +			int tmp_len;
>> +
>> +			tmp_len = min(remain_len, pkt->vec[i].iov_len);
>> +			nbytes = copy_to_iter(pkt->vec[i].iov_base, tmp_len, &iov_iter);
>> +			if (nbytes != tmp_len) {
>> +				virtio_transport_free_pkt(pkt);
>> +				vq_err(vq, "Faulted on copying pkt buf\n");
>> +				break;
>> +			}
>> +
>> +			remain_len -= tmp_len;
>>  		}
>>
>>  		vhost_add_used_n(vq, vq->heads, headcount);
>> @@ -341,6 +351,7 @@ static void vhost_transport_send_pkt_work(struct vhost_work *work)
>>  	struct iov_iter iov_iter;
>>  	size_t nbytes;
>>  	size_t len;
>> +	void *buf;
>>
>>  	if (in != 0) {
>>  		vq_err(vq, "Expected 0 input buffers, got %u\n", in);
>> @@ -375,13 +386,17 @@ static void vhost_transport_send_pkt_work(struct vhost_work *work)
>>  		return NULL;
>>  	}
>>
>> -	pkt->buf = kmalloc(pkt->len, GFP_KERNEL);
>> -	if (!pkt->buf) {
>> +	buf = kmalloc(pkt->len, GFP_KERNEL);
>> +	if (!buf) {
>>  		kfree(pkt);
>>  		return NULL;
>>  	}
>>
>> -	nbytes = copy_from_iter(pkt->buf, pkt->len, &iov_iter);
>> +	pkt->vec[0].iov_base = buf;
>> +	pkt->vec[0].iov_len = pkt->len;
>> +	pkt->nr_vecs = 1;
>> +
>> +	nbytes = copy_from_iter(buf, pkt->len, &iov_iter);
>>  	if (nbytes != pkt->len) {
>>  		vq_err(vq, "Expected %u byte payload, got %zu bytes\n",
>>  		       pkt->len, nbytes);
>> diff --git a/include/linux/virtio_vsock.h b/include/linux/virtio_vsock.h
>> index da9e1fe..734eeed 100644
>> --- a/include/linux/virtio_vsock.h
>> +++ b/include/linux/virtio_vsock.h
>> @@ -13,6 +13,8 @@
>>  #define VIRTIO_VSOCK_DEFAULT_RX_BUF_SIZE	(1024 * 4)
>>  #define VIRTIO_VSOCK_MAX_BUF_SIZE		0xFFFFFFFFUL
>>  #define VIRTIO_VSOCK_MAX_PKT_BUF_SIZE		(1024 * 64)
>> +/* virtio_vsock_pkt + max_pkt_len(default MAX_PKT_BUF_SIZE) */
>> +#define VIRTIO_VSOCK_MAX_VEC_NUM ((VIRTIO_VSOCK_MAX_PKT_BUF_SIZE / PAGE_SIZE) + 1)
>>
>>  /* Virtio-vsock feature */
>>  #define VIRTIO_VSOCK_F_MRG_RXBUF 0 /* Host can merge receive buffers. */
>> @@ -55,10 +57,12 @@ struct virtio_vsock_pkt {
>>  	struct list_head list;
>>  	/* socket refcnt not held, only use for cancellation */
>>  	struct vsock_sock *vsk;
>> -	void *buf;
>> +	struct kvec vec[VIRTIO_VSOCK_MAX_VEC_NUM];
>> +	int nr_vecs;
>>  	u32 len;
>>  	u32 off;
>>  	bool reply;
>> +	bool mergeable;
>>  };
>>
>>  struct virtio_vsock_pkt_info {
>> diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
>> index c4a465c..148b58a 100644
>> --- a/net/vmw_vsock/virtio_transport.c
>> +++ b/net/vmw_vsock/virtio_transport.c
>> @@ -155,8 +155,10 @@ static int virtio_transport_send_pkt_loopback(struct virtio_vsock *vsock,
>>
>>  		sg_init_one(&hdr, &pkt->hdr, sizeof(pkt->hdr));
>>  		sgs[out_sg++] = &hdr;
>> -		if (pkt->buf) {
>> -			sg_init_one(&buf, pkt->buf, pkt->len);
>> +		if (pkt->len) {
>> +			/* Currently only support a segment of memory in tx */
>> +			BUG_ON(pkt->vec[0].iov_len != pkt->len);
>> +			sg_init_one(&buf, pkt->vec[0].iov_base, pkt->vec[0].iov_len);
>>  			sgs[out_sg++] = &buf;
>>  		}
>>
>> @@ -304,23 +306,28 @@ static int fill_old_rx_buff(struct virtqueue *vq)
>>  	struct virtio_vsock_pkt *pkt;
>>  	struct scatterlist hdr, buf, *sgs[2];
>>  	int ret;
>> +	void *pkt_buf;
>>
>>  	pkt = kzalloc(sizeof(*pkt), GFP_KERNEL);
>>  	if (!pkt)
>>  		return -ENOMEM;
>>
>> -	pkt->buf = kmalloc(buf_len, GFP_KERNEL);
>> -	if (!pkt->buf) {
>> +	pkt_buf = kmalloc(buf_len, GFP_KERNEL);
>> +	if (!pkt_buf) {
>>  		virtio_transport_free_pkt(pkt);
>>  		return -ENOMEM;
>>  	}
>>
>> +	pkt->vec[0].iov_base = pkt_buf;
>> +	pkt->vec[0].iov_len = buf_len;
>> +	pkt->nr_vecs = 1;
>> +
>>  	pkt->len = buf_len;
>>
>>  	sg_init_one(&hdr, &pkt->hdr, sizeof(pkt->hdr));
>>  	sgs[0] = &hdr;
>>
>> -	sg_init_one(&buf, pkt->buf, buf_len);
>> +	sg_init_one(&buf, pkt->vec[0].iov_base, buf_len);
>>  	sgs[1] = &buf;
>>  	ret = virtqueue_add_sgs(vq, sgs, 0, 2, pkt, GFP_KERNEL);
>>  	if (ret)
>> @@ -388,11 +395,78 @@ static bool virtio_transport_more_replies(struct virtio_vsock *vsock)
>>  	return val < virtqueue_get_vring_size(vq);
>>  }
>>
>> +static struct virtio_vsock_pkt *receive_mergeable(struct virtqueue *vq,
>> +		struct virtio_vsock *vsock, unsigned int *total_len)
>> +{
>> +	struct virtio_vsock_pkt *pkt;
>> +	u16 num_buf;
>> +	void *buf;
>> +	unsigned int len;
>> +	size_t vsock_hlen = sizeof(struct virtio_vsock_pkt);
>> +
>> +	buf = virtqueue_get_buf(vq, &len);
>> +	if (!buf)
>> +		return NULL;
>> +
>> +	*total_len = len;
>> +	vsock->rx_buf_nr--;
>> +
>> +	if (unlikely(len < vsock_hlen)) {
>> +		put_page(virt_to_head_page(buf));
>> +		return NULL;
>> +	}
>> +
>> +	pkt = buf;
>> +	num_buf = le16_to_cpu(pkt->mrg_rxbuf_hdr.num_buffers);
>> +	if (!num_buf || num_buf > VIRTIO_VSOCK_MAX_VEC_NUM) {
>> +		put_page(virt_to_head_page(buf));
>> +		return NULL;
>> +	}
> 
> So everything just stops going, and host and user don't even
> know what the reason is. And not only that - the next
> packet will be corrupted because we skipped the first one.
> 
> 

I understand this case will not encountered unless the code has
*BUG*, like Host send some problematic packages (shorten/longer than
expected). In this case, I think we should ignore/drop these packets.

> 
>> +
>> +	/* Initialize pkt residual structure */
>> +	memset(&pkt->work, 0, vsock_hlen - sizeof(struct virtio_vsock_hdr) -
>> +			sizeof(struct virtio_vsock_mrg_rxbuf_hdr));
>> +
>> +	pkt->mergeable = true;
>> +	pkt->len = le32_to_cpu(pkt->hdr.len);
>> +	if (!pkt->len)
>> +		return pkt;
>> +
>> +	len -= vsock_hlen;
>> +	if (len) {
>> +		pkt->vec[pkt->nr_vecs].iov_base = buf + vsock_hlen;
>> +		pkt->vec[pkt->nr_vecs].iov_len = len;
>> +		/* Shared page with pkt, so get page in advance */
>> +		get_page(virt_to_head_page(buf));
>> +		pkt->nr_vecs++;
>> +	}
>> +
>> +	while (--num_buf) {
>> +		buf = virtqueue_get_buf(vq, &len);
>> +		if (!buf)
>> +			goto err;
>> +
>> +		*total_len += len;
>> +		vsock->rx_buf_nr--;
>> +
>> +		pkt->vec[pkt->nr_vecs].iov_base = buf;
>> +		pkt->vec[pkt->nr_vecs].iov_len = len;
>> +		pkt->nr_vecs++;
>> +	}
>> +
>> +	return pkt;
>> +err:
>> +	virtio_transport_free_pkt(pkt);
>> +	return NULL;
>> +}
>> +
>>  static void virtio_transport_rx_work(struct work_struct *work)
>>  {
>>  	struct virtio_vsock *vsock =
>>  		container_of(work, struct virtio_vsock, rx_work);
>>  	struct virtqueue *vq;
>> +	size_t vsock_hlen = vsock->mergeable ? sizeof(struct virtio_vsock_pkt) :
>> +			sizeof(struct virtio_vsock_hdr);
>>
>>  	vq = vsock->vqs[VSOCK_VQ_RX];
>>
>> @@ -412,21 +486,26 @@ static void virtio_transport_rx_work(struct work_struct *work)
>>  				goto out;
>>  			}
>>
>> -			pkt = virtqueue_get_buf(vq, &len);
>> -			if (!pkt) {
>> -				break;
>> -			}
>> +			if (likely(vsock->mergeable)) {
>> +				pkt = receive_mergeable(vq, vsock, &len);
>> +				if (!pkt)
>> +					break;
>> +			} else {
>> +				pkt = virtqueue_get_buf(vq, &len);
>> +				if (!pkt)
>> +					break;
>>
> 
> So looking at it, this seems to be the main source of the gain.
> But why does this require host/guest changes?
> 
> 
> The way I see it:
> 	- get a buffer and create an skb
> 	- get the next one, check header matches, if yes
> 	  tack it on the skb as a fragment. If not then
> 	  don't, deliver previous one and queue the new one.
> 
> 

Vhost change reason I explain as above, and I hope use kvec
to instead buf, after all buf only can express a contiguous
physical memory.

Thanks,
Yiwen.

> 
>> -			vsock->rx_buf_nr--;
>> +				vsock->rx_buf_nr--;
>> +			}
>>
>>  			/* Drop short/long packets */
>> -			if (unlikely(len < sizeof(pkt->hdr) ||
>> -				     len > sizeof(pkt->hdr) + pkt->len)) {
>> +			if (unlikely(len < vsock_hlen ||
>> +				     len > vsock_hlen + pkt->len)) {
>>  				virtio_transport_free_pkt(pkt);
>>  				continue;
>>  			}
>>
>> -			pkt->len = len - sizeof(pkt->hdr);
>> +			pkt->len = len - vsock_hlen;
>>  			virtio_transport_deliver_tap_pkt(pkt);
>>  			virtio_transport_recv_pkt(pkt);
>>  		}
>> diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
>> index 3ae3a33..123a8b6 100644
>> --- a/net/vmw_vsock/virtio_transport_common.c
>> +++ b/net/vmw_vsock/virtio_transport_common.c
>> @@ -44,6 +44,7 @@ static const struct virtio_transport *virtio_transport_get_ops(void)
>>  {
>>  	struct virtio_vsock_pkt *pkt;
>>  	int err;
>> +	void *buf = NULL;
>>
>>  	pkt = kzalloc(sizeof(*pkt), GFP_KERNEL);
>>  	if (!pkt)
>> @@ -62,12 +63,16 @@ static const struct virtio_transport *virtio_transport_get_ops(void)
>>  	pkt->vsk		= info->vsk;
>>
>>  	if (info->msg && len > 0) {
>> -		pkt->buf = kmalloc(len, GFP_KERNEL);
>> -		if (!pkt->buf)
>> +		buf = kmalloc(len, GFP_KERNEL);
>> +		if (!buf)
>>  			goto out_pkt;
>> -		err = memcpy_from_msg(pkt->buf, info->msg, len);
>> +		err = memcpy_from_msg(buf, info->msg, len);
>>  		if (err)
>>  			goto out;
>> +
>> +		pkt->vec[0].iov_base = buf;
>> +		pkt->vec[0].iov_len = len;
>> +		pkt->nr_vecs = 1;
>>  	}
>>
>>  	trace_virtio_transport_alloc_pkt(src_cid, src_port,
>> @@ -80,7 +85,7 @@ static const struct virtio_transport *virtio_transport_get_ops(void)
>>  	return pkt;
>>
>>  out:
>> -	kfree(pkt->buf);
>> +	kfree(buf);
>>  out_pkt:
>>  	kfree(pkt);
>>  	return NULL;
>> @@ -92,6 +97,7 @@ static struct sk_buff *virtio_transport_build_skb(void *opaque)
>>  	struct virtio_vsock_pkt *pkt = opaque;
>>  	struct af_vsockmon_hdr *hdr;
>>  	struct sk_buff *skb;
>> +	int i;
>>
>>  	skb = alloc_skb(sizeof(*hdr) + sizeof(pkt->hdr) + pkt->len,
>>  			GFP_ATOMIC);
>> @@ -134,7 +140,8 @@ static struct sk_buff *virtio_transport_build_skb(void *opaque)
>>  	skb_put_data(skb, &pkt->hdr, sizeof(pkt->hdr));
>>
>>  	if (pkt->len) {
>> -		skb_put_data(skb, pkt->buf, pkt->len);
>> +		for (i = 0; i < pkt->nr_vecs; i++)
>> +			skb_put_data(skb, pkt->vec[i].iov_base, pkt->vec[i].iov_len);
>>  	}
>>
>>  	return skb;
>> @@ -260,6 +267,9 @@ static int virtio_transport_send_credit_update(struct vsock_sock *vsk,
>>
>>  	spin_lock_bh(&vvs->rx_lock);
>>  	while (total < len && !list_empty(&vvs->rx_queue)) {
>> +		size_t copy_bytes, last_vec_total = 0, vec_off;
>> +		int i;
>> +
>>  		pkt = list_first_entry(&vvs->rx_queue,
>>  				       struct virtio_vsock_pkt, list);
>>
>> @@ -272,14 +282,28 @@ static int virtio_transport_send_credit_update(struct vsock_sock *vsk,
>>  		 */
>>  		spin_unlock_bh(&vvs->rx_lock);
>>
>> -		err = memcpy_to_msg(msg, pkt->buf + pkt->off, bytes);
>> -		if (err)
>> -			goto out;
>> +		for (i = 0; i < pkt->nr_vecs; i++) {
>> +			if (pkt->off > last_vec_total + pkt->vec[i].iov_len) {
>> +				last_vec_total += pkt->vec[i].iov_len;
>> +				continue;
>> +			}
>> +
>> +			vec_off = pkt->off - last_vec_total;
>> +			copy_bytes = min(pkt->vec[i].iov_len - vec_off, bytes);
>> +			err = memcpy_to_msg(msg, pkt->vec[i].iov_base + vec_off,
>> +					copy_bytes);
>> +			if (err)
>> +				goto out;
>> +
>> +			bytes -= copy_bytes;
>> +			pkt->off += copy_bytes;
>> +			total += copy_bytes;
>> +			last_vec_total += pkt->vec[i].iov_len;
>> +			if (!bytes)
>> +				break;
>> +		}
>>
>>  		spin_lock_bh(&vvs->rx_lock);
>> -
>> -		total += bytes;
>> -		pkt->off += bytes;
>>  		if (pkt->off == pkt->len) {
>>  			virtio_transport_dec_rx_pkt(vvs, pkt);
>>  			list_del(&pkt->list);
>> @@ -1050,8 +1074,17 @@ void virtio_transport_recv_pkt(struct virtio_vsock_pkt *pkt)
>>
>>  void virtio_transport_free_pkt(struct virtio_vsock_pkt *pkt)
>>  {
>> -	kfree(pkt->buf);
>> -	kfree(pkt);
>> +	int i;
>> +
>> +	if (pkt->mergeable) {
>> +		for (i = 0; i < pkt->nr_vecs; i++)
>> +			put_page(virt_to_head_page(pkt->vec[i].iov_base));
>> +		put_page(virt_to_head_page((void *)pkt));
>> +	} else {
>> +		for (i = 0; i < pkt->nr_vecs; i++)
>> +			kfree(pkt->vec[i].iov_base);
>> +		kfree(pkt);
>> +	}
>>  }
>>  EXPORT_SYMBOL_GPL(virtio_transport_free_pkt);
>>
>> -- 
>> 1.8.3.1
>>
> 
> .
> 

^ permalink raw reply

* Re: [PATCH net V2 3/4] Revert "net: vhost: lock the vqs one by one"
From: Jason Wang @ 2018-12-13  2:27 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: netdev, linux-kernel, kvm, virtualization
In-Reply-To: <20181212092102-mutt-send-email-mst@kernel.org>


On 2018/12/12 下午10:24, Michael S. Tsirkin wrote:
> On Wed, Dec 12, 2018 at 06:08:18PM +0800, Jason Wang wrote:
>> This reverts commit 78139c94dc8c96a478e67dab3bee84dc6eccb5fd. We don't
>> protect device IOTLB with vq mutex, which will lead e.g use after free
>> for device IOTLB entries. And since we've switched to use
>> mutex_trylock() in previous patch, it's safe to revert it without
>> having deadlock.
>>
>> Fixes: commit 78139c94dc8c ("net: vhost: lock the vqs one by one")
>> Cc: Tonghao Zhang<xiangxia.m.yue@gmail.com>
>> Signed-off-by: Jason Wang<jasowang@redhat.com>
> Acked-by: Michael S. Tsirkin<mst@redhat.com>
>
> I'd try to put this in 4.20 if we can
> and it's needed for -stable I think.
>
> Also looks like we should allow iotlb entries per vq
> to improve locking. What do you think?
>

Yes, we can do it for -next.

Thanks

_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: [PATCH v2 0/5] VSOCK: support mergeable rx buffer in vhost-vsock
From: jiangyiwen @ 2018-12-13  2:14 UTC (permalink / raw)
  To: Michael S. Tsirkin; +Cc: netdev, kvm, Stefan Hajnoczi, virtualization
In-Reply-To: <20181212100835-mutt-send-email-mst@kernel.org>

On 2018/12/12 23:09, Michael S. Tsirkin wrote:
> On Wed, Dec 12, 2018 at 05:25:50PM +0800, jiangyiwen wrote:
>> Now vsock only support send/receive small packet, it can't achieve
>> high performance. As previous discussed with Jason Wang, I revisit the
>> idea of vhost-net about mergeable rx buffer and implement the mergeable
>> rx buffer in vhost-vsock, it can allow big packet to be scattered in
>> into different buffers and improve performance obviously.
>>
>> This series of patches mainly did three things:
>> - mergeable buffer implementation
>> - increase the max send pkt size
>> - add used and signal guest in a batch
>>
>> And I write a tool to test the vhost-vsock performance, mainly send big
>> packet(64K) included guest->Host and Host->Guest. I test performance
>> independently and the result as follows:
>>
>> Before performance:
>>               Single socket            Multiple sockets(Max Bandwidth)
>> Guest->Host   ~400MB/s                 ~480MB/s
>> Host->Guest   ~1450MB/s                ~1600MB/s
>>
>> After performance only use implement mergeable rx buffer:
>>               Single socket            Multiple sockets(Max Bandwidth)
>> Guest->Host   ~400MB/s                 ~480MB/s
>> Host->Guest   ~1280MB/s                ~1350MB/s
>>
>> In this case, max send pkt size is still limited to 4K, so Host->Guest
>> performance will worse than before.
> 
> It's concerning though, what if application sends small packets?
> What is the source of the slowdown? Do you know?
> 

Hi Michael,

Before performance is tested by me one month ago, I don't retest this time,
this result can have some fluctuations, today I will retest all of cases
included small and big packets, and try to find out the slowdown reason.

Thanks,
Yiwen.

>> After performance increase the max send pkt size to 64K:
>>               Single socket            Multiple sockets(Max Bandwidth)
>> Guest->Host   ~1700MB/s                ~2900MB/s
>> Host->Guest   ~1500MB/s                ~2440MB/s
>>
>> After performance all patches are used:
>>               Single socket            Multiple sockets(Max Bandwidth)
>> Guest->Host   ~1700MB/s                ~2900MB/s
>> Host->Guest   ~1700MB/s                ~2900MB/s
>>
>> >From the test results, the performance is improved obviously, and guest
>> memory will not be wasted.
>>
>> In addition, in order to support mergeable rx buffer in virtio-vsock,
>> we need to add a qemu patch to support parse feature.
>>
>> ---
>> v1 -> v2:
>>  * Addressed comments from Jason Wang.
>>  * Add performance test result independently.
>>  * Use Skb_page_frag_refill() which can use high order page and reduce
>>    the stress of page allocator.
>>  * Still use fixed size(PAGE_SIZE) to fill rx buffer, because too small
>>    size can't fill one full packet, we only 128 vq num now.
>>  * Use iovec to replace buf in struct virtio_vsock_pkt, keep tx and rx
>>    consistency.
>>  * Add virtio_transport ops to get max pkt len, in order to be compatible
>>    with old version.
>> ---
>>
>> Yiwen Jiang (5):
>>   VSOCK: support fill mergeable rx buffer in guest
>>   VSOCK: support fill data to mergeable rx buffer in host
>>   VSOCK: support receive mergeable rx buffer in guest
>>   VSOCK: increase send pkt len in mergeable mode to improve performance
>>   VSOCK: batch sending rx buffer to increase bandwidth
>>
>>  drivers/vhost/vsock.c                   | 183 ++++++++++++++++++++-----
>>  include/linux/virtio_vsock.h            |  13 +-
>>  include/uapi/linux/virtio_vsock.h       |   5 +
>>  net/vmw_vsock/virtio_transport.c        | 229 +++++++++++++++++++++++++++-----
>>  net/vmw_vsock/virtio_transport_common.c |  66 ++++++---
>>  5 files changed, 411 insertions(+), 85 deletions(-)
>>
>> -- 
>> 1.8.3.1
> 
> .
> 


_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: [PATCH] vhost: correct the related warning message
From: Michael S. Tsirkin @ 2018-12-13  2:03 UTC (permalink / raw)
  To: wangyan; +Cc: kvm, netdev, virtualization, piaojun, viro, hch
In-Reply-To: <5C11B176.4040704@huawei.com>

On Thu, Dec 13, 2018 at 09:10:14AM +0800, wangyan wrote:
> Fixes: 'commit d588cf8f618d ("target: Fix se_tpg_tfo->tf_subsys regression + remove tf_subsystem")'
>        'commit cbbd26b8b1a6 ("[iov_iter] new primitives - copy_from_iter_full() and friends")'
> 
> Signed-off-by: Yan Wang <wangyan122@huawei.com>
> Reviewed-by: Jun Piao <piaojun@huawei.com>

Applied, thanks!

> ---
>  drivers/vhost/scsi.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/vhost/scsi.c b/drivers/vhost/scsi.c
> index 50dffe8..b459b69 100644
> --- a/drivers/vhost/scsi.c
> +++ b/drivers/vhost/scsi.c
> @@ -889,7 +889,7 @@ static void vhost_scsi_submission_work(struct work_struct *work)
>  
>  	if (unlikely(!copy_from_iter_full(vc->req, vc->req_size,
>  					  &vc->out_iter))) {
> -		vq_err(vq, "Faulted on copy_from_iter\n");
> +		vq_err(vq, "Faulted on copy_from_iter_full\n");
>  	} else if (unlikely(*vc->lunp != 1)) {
>  		/* virtio-scsi spec requires byte 0 of the lun to be 1 */
>  		vq_err(vq, "Illegal virtio-scsi lun: %u\n", *vc->lunp);
> @@ -1441,7 +1441,7 @@ static void vhost_scsi_flush(struct vhost_scsi *vs)
>  			se_tpg = &tpg->se_tpg;
>  			ret = target_depend_item(&se_tpg->tpg_group.cg_item);
>  			if (ret) {
> -				pr_warn("configfs_depend_item() failed: %d\n", ret);
> +				pr_warn("target_depend_item() failed: %d\n", ret);
>  				kfree(vs_tpg);
>  				mutex_unlock(&tpg->tv_tpg_mutex);
>  				goto out;
> -- 
> 1.8.3.1
> 

^ permalink raw reply

* Re: [PATCH net V2 0/4] Fix various issue of vhost
From: David Miller @ 2018-12-12 23:31 UTC (permalink / raw)
  To: jasowang; +Cc: netdev, virtualization, linux-kernel, kvm, mst
In-Reply-To: <20181212100819.21295-1-jasowang@redhat.com>

From: Jason Wang <jasowang@redhat.com>
Date: Wed, 12 Dec 2018 18:08:15 +0800

> This series tries to fix various issues of vhost:
> 
> - Patch 1 adds a missing write barrier between used idx updating and
>   logging.
> - Patch 2-3 brings back the protection of device IOTLB through vq
>   mutex, this fixes possible use after free in device IOTLB entries.
> - Patch 4-7 fixes the diry page logging when device IOTLB is
>   enabled. We should done through GPA instead of GIOVA, this was done
>   through intorudce HVA->GPA reverse mapping and convert HVA to GPA
>   during logging dirty pages.
> 
> Please consider them for -stable.
> 
> Thanks
> 
> Changes from V1:
> - silent compiler warning for 32bit.
> - use mutex_trylock() on slowpath instead of mutex_lock() even on fast
>   path.

Hello Jason.

Look like Michael wants you to split out patch #4 and target
net-next with it.

So please do that and respin the first 3 patches here with Michael's
ACKs.

Thanks.

^ permalink raw reply

* Re: [PATCH v2 2/5] VSOCK: support fill data to mergeable rx buffer in host
From: David Miller @ 2018-12-12 19:09 UTC (permalink / raw)
  To: jiangyiwen; +Cc: kvm, mst, netdev, virtualization, stefanha
In-Reply-To: <5C10D4FB.6070009@huawei.com>

From: jiangyiwen <jiangyiwen@huawei.com>
Date: Wed, 12 Dec 2018 17:29:31 +0800

> diff --git a/include/uapi/linux/virtio_vsock.h b/include/uapi/linux/virtio_vsock.h
> index 1d57ed3..2292f30 100644
> --- a/include/uapi/linux/virtio_vsock.h
> +++ b/include/uapi/linux/virtio_vsock.h
> @@ -63,6 +63,11 @@ struct virtio_vsock_hdr {
>  	__le32	fwd_cnt;
>  } __attribute__((packed));
> 
> +/* It add mergeable rx buffers feature */
> +struct virtio_vsock_mrg_rxbuf_hdr {
> +	__le16  num_buffers;    /* number of mergeable rx buffers */
> +} __attribute__((packed));
> +

I know the rest of this file uses 'packed' but this attribute should
only be used if absolutely necessary as it incurs a
non-trivial performance penalty for some architectures.

^ permalink raw reply

* Re: [PATCH v2 1/5] VSOCK: support fill mergeable rx buffer in guest
From: David Miller @ 2018-12-12 19:08 UTC (permalink / raw)
  To: jiangyiwen; +Cc: kvm, mst, netdev, virtualization, stefanha
In-Reply-To: <5C10D4B0.8080504@huawei.com>

From: jiangyiwen <jiangyiwen@huawei.com>
Date: Wed, 12 Dec 2018 17:28:16 +0800

> +static int fill_mergeable_rx_buff(struct virtio_vsock *vsock,
> +		struct virtqueue *vq)
> +{
> +	struct page_frag *alloc_frag = &vsock->alloc_frag;
> +	struct scatterlist sg;
> +	/* Currently we don't use ewma len, use PAGE_SIZE instead, because too
> +	 * small size can't fill one full packet, sadly we only 128 vq num now.
> +	 */
> +	unsigned int len = PAGE_SIZE, hole;
> +	void *buf;
> +	int err;

Please don't break up a set of local variable declarations with a
comment like this.  The comment seems to be about the initialization
of 'len', so move that initialization into the code below the variable
declarations and bring the comment along for the ride as well.

^ permalink raw reply

* Re: [PATCH v2 2/5] VSOCK: support fill data to mergeable rx buffer in host
From: Michael S. Tsirkin @ 2018-12-12 15:37 UTC (permalink / raw)
  To: jiangyiwen; +Cc: netdev, kvm, Stefan Hajnoczi, virtualization
In-Reply-To: <5C10D4FB.6070009@huawei.com>

On Wed, Dec 12, 2018 at 05:29:31PM +0800, jiangyiwen wrote:
> When vhost support VIRTIO_VSOCK_F_MRG_RXBUF feature,
> it will merge big packet into rx vq.
> 
> Signed-off-by: Yiwen Jiang <jiangyiwen@huawei.com>

I feel this approach jumps into making interface changes for
optimizations too quickly. For example, what prevents us
from taking a big buffer, prepending each chunk
with the header and writing it out without
host/guest interface changes?

This should allow optimizations such as vhost_add_used_n
batching.

I realize a header in each packet does have a cost,
but it also has advantages such as improved robustness,
I'd like to see more of an apples to apples comparison
of the performance gain from skipping them.


> ---
>  drivers/vhost/vsock.c             | 111 ++++++++++++++++++++++++++++++--------
>  include/linux/virtio_vsock.h      |   1 +
>  include/uapi/linux/virtio_vsock.h |   5 ++
>  3 files changed, 94 insertions(+), 23 deletions(-)
> 
> diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
> index 34bc3ab..dc52b0f 100644
> --- a/drivers/vhost/vsock.c
> +++ b/drivers/vhost/vsock.c
> @@ -22,7 +22,8 @@
>  #define VHOST_VSOCK_DEFAULT_HOST_CID	2
> 
>  enum {
> -	VHOST_VSOCK_FEATURES = VHOST_FEATURES,
> +	VHOST_VSOCK_FEATURES = VHOST_FEATURES |
> +			(1ULL << VIRTIO_VSOCK_F_MRG_RXBUF),
>  };
> 
>  /* Used to track all the vhost_vsock instances on the system. */
> @@ -80,6 +81,69 @@ static struct vhost_vsock *vhost_vsock_get(u32 guest_cid)
>  	return vsock;
>  }
> 
> +/* This segment of codes are copied from drivers/vhost/net.c */
> +static int get_rx_bufs(struct vhost_virtqueue *vq,
> +		struct vring_used_elem *heads, int datalen,
> +		unsigned *iovcount, unsigned int quota)
> +{
> +	unsigned int out, in;
> +	int seg = 0;
> +	int headcount = 0;
> +	unsigned d;
> +	int ret;
> +	/*
> +	 * len is always initialized before use since we are always called with
> +	 * datalen > 0.
> +	 */
> +	u32 uninitialized_var(len);
> +
> +	while (datalen > 0 && headcount < quota) {
> +		if (unlikely(seg >= UIO_MAXIOV)) {
> +			ret = -ENOBUFS;
> +			goto err;
> +		}
> +
> +		ret = vhost_get_vq_desc(vq, vq->iov + seg,
> +				ARRAY_SIZE(vq->iov) - seg, &out,
> +				&in, NULL, NULL);
> +		if (unlikely(ret < 0))
> +			goto err;
> +
> +		d = ret;
> +		if (d == vq->num) {
> +			ret = 0;
> +			goto err;
> +		}
> +
> +		if (unlikely(out || in <= 0)) {
> +			vq_err(vq, "unexpected descriptor format for RX: "
> +					"out %d, in %d\n", out, in);
> +			ret = -EINVAL;
> +			goto err;
> +		}
> +
> +		heads[headcount].id = cpu_to_vhost32(vq, d);
> +		len = iov_length(vq->iov + seg, in);
> +		heads[headcount].len = cpu_to_vhost32(vq, len);
> +		datalen -= len;
> +		++headcount;
> +		seg += in;
> +	}
> +
> +	heads[headcount - 1].len = cpu_to_vhost32(vq, len + datalen);
> +	*iovcount = seg;
> +
> +	/* Detect overrun */
> +	if (unlikely(datalen > 0)) {
> +		ret = UIO_MAXIOV + 1;
> +		goto err;
> +	}
> +	return headcount;
> +err:
> +	vhost_discard_vq_desc(vq, headcount);
> +	return ret;
> +}
> +
>  static void
>  vhost_transport_do_send_pkt(struct vhost_vsock *vsock,
>  			    struct vhost_virtqueue *vq)
> @@ -87,22 +151,34 @@ static struct vhost_vsock *vhost_vsock_get(u32 guest_cid)
>  	struct vhost_virtqueue *tx_vq = &vsock->vqs[VSOCK_VQ_TX];
>  	bool added = false;
>  	bool restart_tx = false;
> +	int mergeable;
> +	size_t vsock_hlen;
> 
>  	mutex_lock(&vq->mutex);
> 
>  	if (!vq->private_data)
>  		goto out;
> 
> +	mergeable = vhost_has_feature(vq, VIRTIO_VSOCK_F_MRG_RXBUF);
> +	/*
> +	 * Guest fill page for rx vq in mergeable case, so it will not
> +	 * allocate pkt structure, we should reserve size of pkt in advance.
> +	 */
> +	if (likely(mergeable))
> +		vsock_hlen = sizeof(struct virtio_vsock_pkt);
> +	else
> +		vsock_hlen = sizeof(struct virtio_vsock_hdr);
> +
>  	/* Avoid further vmexits, we're already processing the virtqueue */
>  	vhost_disable_notify(&vsock->dev, vq);
> 
>  	for (;;) {
>  		struct virtio_vsock_pkt *pkt;
>  		struct iov_iter iov_iter;
> -		unsigned out, in;
> +		unsigned out = 0, in = 0;
>  		size_t nbytes;
>  		size_t len;
> -		int head;
> +		s16 headcount;
> 
>  		spin_lock_bh(&vsock->send_pkt_list_lock);
>  		if (list_empty(&vsock->send_pkt_list)) {
> @@ -116,16 +192,9 @@ static struct vhost_vsock *vhost_vsock_get(u32 guest_cid)
>  		list_del_init(&pkt->list);
>  		spin_unlock_bh(&vsock->send_pkt_list_lock);
> 
> -		head = vhost_get_vq_desc(vq, vq->iov, ARRAY_SIZE(vq->iov),
> -					 &out, &in, NULL, NULL);
> -		if (head < 0) {
> -			spin_lock_bh(&vsock->send_pkt_list_lock);
> -			list_add(&pkt->list, &vsock->send_pkt_list);
> -			spin_unlock_bh(&vsock->send_pkt_list_lock);
> -			break;
> -		}
> -
> -		if (head == vq->num) {
> +		headcount = get_rx_bufs(vq, vq->heads, vsock_hlen + pkt->len,
> +				&in, likely(mergeable) ? UIO_MAXIOV : 1);
> +		if (headcount <= 0) {
>  			spin_lock_bh(&vsock->send_pkt_list_lock);
>  			list_add(&pkt->list, &vsock->send_pkt_list);
>  			spin_unlock_bh(&vsock->send_pkt_list_lock);
> @@ -133,24 +202,20 @@ static struct vhost_vsock *vhost_vsock_get(u32 guest_cid)
>  			/* We cannot finish yet if more buffers snuck in while
>  			 * re-enabling notify.
>  			 */
> -			if (unlikely(vhost_enable_notify(&vsock->dev, vq))) {
> +			if (!headcount && unlikely(vhost_enable_notify(&vsock->dev, vq))) {
>  				vhost_disable_notify(&vsock->dev, vq);
>  				continue;
>  			}
>  			break;
>  		}
> 
> -		if (out) {
> -			virtio_transport_free_pkt(pkt);
> -			vq_err(vq, "Expected 0 output buffers, got %u\n", out);
> -			break;
> -		}
> -
>  		len = iov_length(&vq->iov[out], in);
>  		iov_iter_init(&iov_iter, READ, &vq->iov[out], in, len);
> 
> -		nbytes = copy_to_iter(&pkt->hdr, sizeof(pkt->hdr), &iov_iter);
> -		if (nbytes != sizeof(pkt->hdr)) {
> +		if (likely(mergeable))
> +			pkt->mrg_rxbuf_hdr.num_buffers = cpu_to_le16(headcount);
> +		nbytes = copy_to_iter(&pkt->hdr, vsock_hlen, &iov_iter);
> +		if (nbytes != vsock_hlen) {
>  			virtio_transport_free_pkt(pkt);
>  			vq_err(vq, "Faulted on copying pkt hdr\n");
>  			break;
> @@ -163,7 +228,7 @@ static struct vhost_vsock *vhost_vsock_get(u32 guest_cid)
>  			break;
>  		}
> 
> -		vhost_add_used(vq, head, sizeof(pkt->hdr) + pkt->len);
> +		vhost_add_used_n(vq, vq->heads, headcount);
>  		added = true;
> 
>  		if (pkt->reply) {
> diff --git a/include/linux/virtio_vsock.h b/include/linux/virtio_vsock.h
> index bf84418..da9e1fe 100644
> --- a/include/linux/virtio_vsock.h
> +++ b/include/linux/virtio_vsock.h
> @@ -50,6 +50,7 @@ struct virtio_vsock_sock {
> 
>  struct virtio_vsock_pkt {
>  	struct virtio_vsock_hdr	hdr;
> +	struct virtio_vsock_mrg_rxbuf_hdr mrg_rxbuf_hdr;
>  	struct work_struct work;
>  	struct list_head list;
>  	/* socket refcnt not held, only use for cancellation */
> diff --git a/include/uapi/linux/virtio_vsock.h b/include/uapi/linux/virtio_vsock.h
> index 1d57ed3..2292f30 100644
> --- a/include/uapi/linux/virtio_vsock.h
> +++ b/include/uapi/linux/virtio_vsock.h
> @@ -63,6 +63,11 @@ struct virtio_vsock_hdr {
>  	__le32	fwd_cnt;
>  } __attribute__((packed));
> 
> +/* It add mergeable rx buffers feature */
> +struct virtio_vsock_mrg_rxbuf_hdr {
> +	__le16  num_buffers;    /* number of mergeable rx buffers */
> +} __attribute__((packed));
> +
>  enum virtio_vsock_type {
>  	VIRTIO_VSOCK_TYPE_STREAM = 1,
>  };
> -- 
> 1.8.3.1
> 

^ permalink raw reply

* Re: [PATCH v2 3/5] VSOCK: support receive mergeable rx buffer in guest
From: Michael S. Tsirkin @ 2018-12-12 15:31 UTC (permalink / raw)
  To: jiangyiwen; +Cc: netdev, kvm, Stefan Hajnoczi, virtualization
In-Reply-To: <5C10D57B.3070701@huawei.com>

On Wed, Dec 12, 2018 at 05:31:39PM +0800, jiangyiwen wrote:
> Guest receive mergeable rx buffer, it can merge
> scatter rx buffer into a big buffer and then copy
> to user space.
> 
> In addition, it also use iovec to replace buf in struct
> virtio_vsock_pkt, keep tx and rx consistency. The only
> difference is now tx still uses a segment of continuous
> physical memory to implement.
> 
> Signed-off-by: Yiwen Jiang <jiangyiwen@huawei.com>
> ---
>  drivers/vhost/vsock.c                   |  31 +++++++---
>  include/linux/virtio_vsock.h            |   6 +-
>  net/vmw_vsock/virtio_transport.c        | 105 ++++++++++++++++++++++++++++----
>  net/vmw_vsock/virtio_transport_common.c |  59 ++++++++++++++----
>  4 files changed, 166 insertions(+), 35 deletions(-)


This was supposed to be a guest patch, why is vhost changed here?

> diff --git a/drivers/vhost/vsock.c b/drivers/vhost/vsock.c
> index dc52b0f..c7ab0dd 100644
> --- a/drivers/vhost/vsock.c
> +++ b/drivers/vhost/vsock.c
> @@ -179,6 +179,8 @@ static int get_rx_bufs(struct vhost_virtqueue *vq,
>  		size_t nbytes;
>  		size_t len;
>  		s16 headcount;
> +		size_t remain_len;
> +		int i;
> 
>  		spin_lock_bh(&vsock->send_pkt_list_lock);
>  		if (list_empty(&vsock->send_pkt_list)) {
> @@ -221,11 +223,19 @@ static int get_rx_bufs(struct vhost_virtqueue *vq,
>  			break;
>  		}
> 
> -		nbytes = copy_to_iter(pkt->buf, pkt->len, &iov_iter);
> -		if (nbytes != pkt->len) {
> -			virtio_transport_free_pkt(pkt);
> -			vq_err(vq, "Faulted on copying pkt buf\n");
> -			break;
> +		remain_len = pkt->len;
> +		for (i = 0; i < pkt->nr_vecs; i++) {
> +			int tmp_len;
> +
> +			tmp_len = min(remain_len, pkt->vec[i].iov_len);
> +			nbytes = copy_to_iter(pkt->vec[i].iov_base, tmp_len, &iov_iter);
> +			if (nbytes != tmp_len) {
> +				virtio_transport_free_pkt(pkt);
> +				vq_err(vq, "Faulted on copying pkt buf\n");
> +				break;
> +			}
> +
> +			remain_len -= tmp_len;
>  		}
> 
>  		vhost_add_used_n(vq, vq->heads, headcount);
> @@ -341,6 +351,7 @@ static void vhost_transport_send_pkt_work(struct vhost_work *work)
>  	struct iov_iter iov_iter;
>  	size_t nbytes;
>  	size_t len;
> +	void *buf;
> 
>  	if (in != 0) {
>  		vq_err(vq, "Expected 0 input buffers, got %u\n", in);
> @@ -375,13 +386,17 @@ static void vhost_transport_send_pkt_work(struct vhost_work *work)
>  		return NULL;
>  	}
> 
> -	pkt->buf = kmalloc(pkt->len, GFP_KERNEL);
> -	if (!pkt->buf) {
> +	buf = kmalloc(pkt->len, GFP_KERNEL);
> +	if (!buf) {
>  		kfree(pkt);
>  		return NULL;
>  	}
> 
> -	nbytes = copy_from_iter(pkt->buf, pkt->len, &iov_iter);
> +	pkt->vec[0].iov_base = buf;
> +	pkt->vec[0].iov_len = pkt->len;
> +	pkt->nr_vecs = 1;
> +
> +	nbytes = copy_from_iter(buf, pkt->len, &iov_iter);
>  	if (nbytes != pkt->len) {
>  		vq_err(vq, "Expected %u byte payload, got %zu bytes\n",
>  		       pkt->len, nbytes);
> diff --git a/include/linux/virtio_vsock.h b/include/linux/virtio_vsock.h
> index da9e1fe..734eeed 100644
> --- a/include/linux/virtio_vsock.h
> +++ b/include/linux/virtio_vsock.h
> @@ -13,6 +13,8 @@
>  #define VIRTIO_VSOCK_DEFAULT_RX_BUF_SIZE	(1024 * 4)
>  #define VIRTIO_VSOCK_MAX_BUF_SIZE		0xFFFFFFFFUL
>  #define VIRTIO_VSOCK_MAX_PKT_BUF_SIZE		(1024 * 64)
> +/* virtio_vsock_pkt + max_pkt_len(default MAX_PKT_BUF_SIZE) */
> +#define VIRTIO_VSOCK_MAX_VEC_NUM ((VIRTIO_VSOCK_MAX_PKT_BUF_SIZE / PAGE_SIZE) + 1)
> 
>  /* Virtio-vsock feature */
>  #define VIRTIO_VSOCK_F_MRG_RXBUF 0 /* Host can merge receive buffers. */
> @@ -55,10 +57,12 @@ struct virtio_vsock_pkt {
>  	struct list_head list;
>  	/* socket refcnt not held, only use for cancellation */
>  	struct vsock_sock *vsk;
> -	void *buf;
> +	struct kvec vec[VIRTIO_VSOCK_MAX_VEC_NUM];
> +	int nr_vecs;
>  	u32 len;
>  	u32 off;
>  	bool reply;
> +	bool mergeable;
>  };
> 
>  struct virtio_vsock_pkt_info {
> diff --git a/net/vmw_vsock/virtio_transport.c b/net/vmw_vsock/virtio_transport.c
> index c4a465c..148b58a 100644
> --- a/net/vmw_vsock/virtio_transport.c
> +++ b/net/vmw_vsock/virtio_transport.c
> @@ -155,8 +155,10 @@ static int virtio_transport_send_pkt_loopback(struct virtio_vsock *vsock,
> 
>  		sg_init_one(&hdr, &pkt->hdr, sizeof(pkt->hdr));
>  		sgs[out_sg++] = &hdr;
> -		if (pkt->buf) {
> -			sg_init_one(&buf, pkt->buf, pkt->len);
> +		if (pkt->len) {
> +			/* Currently only support a segment of memory in tx */
> +			BUG_ON(pkt->vec[0].iov_len != pkt->len);
> +			sg_init_one(&buf, pkt->vec[0].iov_base, pkt->vec[0].iov_len);
>  			sgs[out_sg++] = &buf;
>  		}
> 
> @@ -304,23 +306,28 @@ static int fill_old_rx_buff(struct virtqueue *vq)
>  	struct virtio_vsock_pkt *pkt;
>  	struct scatterlist hdr, buf, *sgs[2];
>  	int ret;
> +	void *pkt_buf;
> 
>  	pkt = kzalloc(sizeof(*pkt), GFP_KERNEL);
>  	if (!pkt)
>  		return -ENOMEM;
> 
> -	pkt->buf = kmalloc(buf_len, GFP_KERNEL);
> -	if (!pkt->buf) {
> +	pkt_buf = kmalloc(buf_len, GFP_KERNEL);
> +	if (!pkt_buf) {
>  		virtio_transport_free_pkt(pkt);
>  		return -ENOMEM;
>  	}
> 
> +	pkt->vec[0].iov_base = pkt_buf;
> +	pkt->vec[0].iov_len = buf_len;
> +	pkt->nr_vecs = 1;
> +
>  	pkt->len = buf_len;
> 
>  	sg_init_one(&hdr, &pkt->hdr, sizeof(pkt->hdr));
>  	sgs[0] = &hdr;
> 
> -	sg_init_one(&buf, pkt->buf, buf_len);
> +	sg_init_one(&buf, pkt->vec[0].iov_base, buf_len);
>  	sgs[1] = &buf;
>  	ret = virtqueue_add_sgs(vq, sgs, 0, 2, pkt, GFP_KERNEL);
>  	if (ret)
> @@ -388,11 +395,78 @@ static bool virtio_transport_more_replies(struct virtio_vsock *vsock)
>  	return val < virtqueue_get_vring_size(vq);
>  }
> 
> +static struct virtio_vsock_pkt *receive_mergeable(struct virtqueue *vq,
> +		struct virtio_vsock *vsock, unsigned int *total_len)
> +{
> +	struct virtio_vsock_pkt *pkt;
> +	u16 num_buf;
> +	void *buf;
> +	unsigned int len;
> +	size_t vsock_hlen = sizeof(struct virtio_vsock_pkt);
> +
> +	buf = virtqueue_get_buf(vq, &len);
> +	if (!buf)
> +		return NULL;
> +
> +	*total_len = len;
> +	vsock->rx_buf_nr--;
> +
> +	if (unlikely(len < vsock_hlen)) {
> +		put_page(virt_to_head_page(buf));
> +		return NULL;
> +	}
> +
> +	pkt = buf;
> +	num_buf = le16_to_cpu(pkt->mrg_rxbuf_hdr.num_buffers);
> +	if (!num_buf || num_buf > VIRTIO_VSOCK_MAX_VEC_NUM) {
> +		put_page(virt_to_head_page(buf));
> +		return NULL;
> +	}

So everything just stops going, and host and user don't even
know what the reason is. And not only that - the next
packet will be corrupted because we skipped the first one.



> +
> +	/* Initialize pkt residual structure */
> +	memset(&pkt->work, 0, vsock_hlen - sizeof(struct virtio_vsock_hdr) -
> +			sizeof(struct virtio_vsock_mrg_rxbuf_hdr));
> +
> +	pkt->mergeable = true;
> +	pkt->len = le32_to_cpu(pkt->hdr.len);
> +	if (!pkt->len)
> +		return pkt;
> +
> +	len -= vsock_hlen;
> +	if (len) {
> +		pkt->vec[pkt->nr_vecs].iov_base = buf + vsock_hlen;
> +		pkt->vec[pkt->nr_vecs].iov_len = len;
> +		/* Shared page with pkt, so get page in advance */
> +		get_page(virt_to_head_page(buf));
> +		pkt->nr_vecs++;
> +	}
> +
> +	while (--num_buf) {
> +		buf = virtqueue_get_buf(vq, &len);
> +		if (!buf)
> +			goto err;
> +
> +		*total_len += len;
> +		vsock->rx_buf_nr--;
> +
> +		pkt->vec[pkt->nr_vecs].iov_base = buf;
> +		pkt->vec[pkt->nr_vecs].iov_len = len;
> +		pkt->nr_vecs++;
> +	}
> +
> +	return pkt;
> +err:
> +	virtio_transport_free_pkt(pkt);
> +	return NULL;
> +}
> +
>  static void virtio_transport_rx_work(struct work_struct *work)
>  {
>  	struct virtio_vsock *vsock =
>  		container_of(work, struct virtio_vsock, rx_work);
>  	struct virtqueue *vq;
> +	size_t vsock_hlen = vsock->mergeable ? sizeof(struct virtio_vsock_pkt) :
> +			sizeof(struct virtio_vsock_hdr);
> 
>  	vq = vsock->vqs[VSOCK_VQ_RX];
> 
> @@ -412,21 +486,26 @@ static void virtio_transport_rx_work(struct work_struct *work)
>  				goto out;
>  			}
> 
> -			pkt = virtqueue_get_buf(vq, &len);
> -			if (!pkt) {
> -				break;
> -			}
> +			if (likely(vsock->mergeable)) {
> +				pkt = receive_mergeable(vq, vsock, &len);
> +				if (!pkt)
> +					break;
> +			} else {
> +				pkt = virtqueue_get_buf(vq, &len);
> +				if (!pkt)
> +					break;
> 

So looking at it, this seems to be the main source of the gain.
But why does this require host/guest changes?


The way I see it:
	- get a buffer and create an skb
	- get the next one, check header matches, if yes
	  tack it on the skb as a fragment. If not then
	  don't, deliver previous one and queue the new one.



> -			vsock->rx_buf_nr--;
> +				vsock->rx_buf_nr--;
> +			}
> 
>  			/* Drop short/long packets */
> -			if (unlikely(len < sizeof(pkt->hdr) ||
> -				     len > sizeof(pkt->hdr) + pkt->len)) {
> +			if (unlikely(len < vsock_hlen ||
> +				     len > vsock_hlen + pkt->len)) {
>  				virtio_transport_free_pkt(pkt);
>  				continue;
>  			}
> 
> -			pkt->len = len - sizeof(pkt->hdr);
> +			pkt->len = len - vsock_hlen;
>  			virtio_transport_deliver_tap_pkt(pkt);
>  			virtio_transport_recv_pkt(pkt);
>  		}
> diff --git a/net/vmw_vsock/virtio_transport_common.c b/net/vmw_vsock/virtio_transport_common.c
> index 3ae3a33..123a8b6 100644
> --- a/net/vmw_vsock/virtio_transport_common.c
> +++ b/net/vmw_vsock/virtio_transport_common.c
> @@ -44,6 +44,7 @@ static const struct virtio_transport *virtio_transport_get_ops(void)
>  {
>  	struct virtio_vsock_pkt *pkt;
>  	int err;
> +	void *buf = NULL;
> 
>  	pkt = kzalloc(sizeof(*pkt), GFP_KERNEL);
>  	if (!pkt)
> @@ -62,12 +63,16 @@ static const struct virtio_transport *virtio_transport_get_ops(void)
>  	pkt->vsk		= info->vsk;
> 
>  	if (info->msg && len > 0) {
> -		pkt->buf = kmalloc(len, GFP_KERNEL);
> -		if (!pkt->buf)
> +		buf = kmalloc(len, GFP_KERNEL);
> +		if (!buf)
>  			goto out_pkt;
> -		err = memcpy_from_msg(pkt->buf, info->msg, len);
> +		err = memcpy_from_msg(buf, info->msg, len);
>  		if (err)
>  			goto out;
> +
> +		pkt->vec[0].iov_base = buf;
> +		pkt->vec[0].iov_len = len;
> +		pkt->nr_vecs = 1;
>  	}
> 
>  	trace_virtio_transport_alloc_pkt(src_cid, src_port,
> @@ -80,7 +85,7 @@ static const struct virtio_transport *virtio_transport_get_ops(void)
>  	return pkt;
> 
>  out:
> -	kfree(pkt->buf);
> +	kfree(buf);
>  out_pkt:
>  	kfree(pkt);
>  	return NULL;
> @@ -92,6 +97,7 @@ static struct sk_buff *virtio_transport_build_skb(void *opaque)
>  	struct virtio_vsock_pkt *pkt = opaque;
>  	struct af_vsockmon_hdr *hdr;
>  	struct sk_buff *skb;
> +	int i;
> 
>  	skb = alloc_skb(sizeof(*hdr) + sizeof(pkt->hdr) + pkt->len,
>  			GFP_ATOMIC);
> @@ -134,7 +140,8 @@ static struct sk_buff *virtio_transport_build_skb(void *opaque)
>  	skb_put_data(skb, &pkt->hdr, sizeof(pkt->hdr));
> 
>  	if (pkt->len) {
> -		skb_put_data(skb, pkt->buf, pkt->len);
> +		for (i = 0; i < pkt->nr_vecs; i++)
> +			skb_put_data(skb, pkt->vec[i].iov_base, pkt->vec[i].iov_len);
>  	}
> 
>  	return skb;
> @@ -260,6 +267,9 @@ static int virtio_transport_send_credit_update(struct vsock_sock *vsk,
> 
>  	spin_lock_bh(&vvs->rx_lock);
>  	while (total < len && !list_empty(&vvs->rx_queue)) {
> +		size_t copy_bytes, last_vec_total = 0, vec_off;
> +		int i;
> +
>  		pkt = list_first_entry(&vvs->rx_queue,
>  				       struct virtio_vsock_pkt, list);
> 
> @@ -272,14 +282,28 @@ static int virtio_transport_send_credit_update(struct vsock_sock *vsk,
>  		 */
>  		spin_unlock_bh(&vvs->rx_lock);
> 
> -		err = memcpy_to_msg(msg, pkt->buf + pkt->off, bytes);
> -		if (err)
> -			goto out;
> +		for (i = 0; i < pkt->nr_vecs; i++) {
> +			if (pkt->off > last_vec_total + pkt->vec[i].iov_len) {
> +				last_vec_total += pkt->vec[i].iov_len;
> +				continue;
> +			}
> +
> +			vec_off = pkt->off - last_vec_total;
> +			copy_bytes = min(pkt->vec[i].iov_len - vec_off, bytes);
> +			err = memcpy_to_msg(msg, pkt->vec[i].iov_base + vec_off,
> +					copy_bytes);
> +			if (err)
> +				goto out;
> +
> +			bytes -= copy_bytes;
> +			pkt->off += copy_bytes;
> +			total += copy_bytes;
> +			last_vec_total += pkt->vec[i].iov_len;
> +			if (!bytes)
> +				break;
> +		}
> 
>  		spin_lock_bh(&vvs->rx_lock);
> -
> -		total += bytes;
> -		pkt->off += bytes;
>  		if (pkt->off == pkt->len) {
>  			virtio_transport_dec_rx_pkt(vvs, pkt);
>  			list_del(&pkt->list);
> @@ -1050,8 +1074,17 @@ void virtio_transport_recv_pkt(struct virtio_vsock_pkt *pkt)
> 
>  void virtio_transport_free_pkt(struct virtio_vsock_pkt *pkt)
>  {
> -	kfree(pkt->buf);
> -	kfree(pkt);
> +	int i;
> +
> +	if (pkt->mergeable) {
> +		for (i = 0; i < pkt->nr_vecs; i++)
> +			put_page(virt_to_head_page(pkt->vec[i].iov_base));
> +		put_page(virt_to_head_page((void *)pkt));
> +	} else {
> +		for (i = 0; i < pkt->nr_vecs; i++)
> +			kfree(pkt->vec[i].iov_base);
> +		kfree(pkt);
> +	}
>  }
>  EXPORT_SYMBOL_GPL(virtio_transport_free_pkt);
> 
> -- 
> 1.8.3.1
> 

^ permalink raw reply

* Re: [virtio-dev] Re: [PATCH v5 5/7] iommu: Add virtio-iommu driver
From: Auger Eric @ 2018-12-12 15:27 UTC (permalink / raw)
  To: Michael S. Tsirkin, Jean-Philippe Brucker
  Cc: mark.rutland, virtio-dev, lorenzo.pieralisi, tnowicki, devicetree,
	marc.zyngier, linux-pci, joro, will.deacon, virtualization, iommu,
	robh+dt, bhelgaas, robin.murphy, kvmarm
In-Reply-To: <20181212093709-mutt-send-email-mst@kernel.org>

Hi,

On 12/12/18 3:56 PM, Michael S. Tsirkin wrote:
> On Fri, Dec 07, 2018 at 06:52:31PM +0000, Jean-Philippe Brucker wrote:
>> Sorry for the delay, I wanted to do a little more performance analysis
>> before continuing.
>>
>> On 27/11/2018 18:10, Michael S. Tsirkin wrote:
>>> On Tue, Nov 27, 2018 at 05:55:20PM +0000, Jean-Philippe Brucker wrote:
>>>>>> +	if (!virtio_has_feature(vdev, VIRTIO_F_VERSION_1) ||
>>>>>> +	    !virtio_has_feature(vdev, VIRTIO_IOMMU_F_MAP_UNMAP))
>>>>>
>>>>> Why bother with a feature bit for this then btw?
>>>>
>>>> We'll need a new feature bit for sharing page tables with the hardware,
>>>> because they require different requests (attach_table/invalidate instead
>>>> of map/unmap.) A future device supporting page table sharing won't
>>>> necessarily need to support map/unmap.
>>>>
>>> I don't see virtio iommu being extended to support ARM specific
>>> requests. This just won't scale, too many different
>>> descriptor formats out there.
>>
>> They aren't really ARM specific requests. The two new requests are
>> ATTACH_TABLE and INVALIDATE, which would be used by x86 IOMMUs as well.
>>
>> Sharing CPU address space with the HW IOMMU (SVM) has been in the scope
>> of virtio-iommu since the first RFC, and I've been working with that
>> extension in mind since the beginning. As an example you can have a look
>> at my current draft for this [1], which is inspired from the VFIO work
>> we've been doing with Intel.
>>
>> The negotiation phase inevitably requires vendor-specific fields in the
>> descriptors - host tells which formats are supported, guest chooses a
>> format and attaches page tables. But invalidation and fault reporting
>> descriptors are fairly generic.
> 
> We need to tread carefully here.  People expect it that if user does
> lspci and sees a virtio device then it's reasonably portable.
> 
>>> If you want to go that way down the road, you should avoid
>>> virtio iommu, instead emulate and share code with the ARM SMMU (probably
>>> with a different vendor id so you can implement the
>>> report on map for devices without PRI).
>>
>> vSMMU has to stay in userspace though. The main reason we're proposing
>> virtio-iommu is that emulating every possible vIOMMU model in the kernel
>> would be unmaintainable. With virtio-iommu we can process the fast path
>> in the host kernel, through vhost-iommu, and do the heavy lifting in
>> userspace.
> 
> Interesting.
> 
>> As said above, I'm trying to keep the fast path for
>> virtio-iommu generic.
>>
>> More notes on what I consider to be the fast path, and comparison with
>> vSMMU:
>>
>> (1) The primary use-case we have in mind for vIOMMU is something like
>> DPDK in the guest, assigning a hardware device to guest userspace. DPDK
>> maps a large amount of memory statically, to be used by a pass-through
>> device. For this case I don't think we care about vIOMMU performance.
>> Setup and teardown need to be reasonably fast, sure, but the MAP/UNMAP
>> requests don't have to be optimal.
>>
>>
>> (2) If the assigned device is owned by the guest kernel, then mappings
>> are dynamic and require dma_map/unmap() to be fast, but there generally
>> is no need for a vIOMMU, since device and drivers are trusted by the
>> guest kernel. Even when the user does enable a vIOMMU for this case
>> (allowing to over-commit guest memory, which needs to be pinned
>> otherwise),
> 
> BTW that's in theory in practice it doesn't really work.
> 
>> we generally play tricks like lazy TLBI (non-strict mode) to
>> make it faster.
> 
> Simple lazy TLB for guest/userspace drivers would be a big no no.
> You need something smarter.
> 
>> Here device and drivers are trusted, therefore the
>> vulnerability window of lazy mode isn't a concern.
>>
>> If the reason to enable the vIOMMU is over-comitting guest memory
>> however, you can't use nested translation because it requires pinning
>> the second-level tables. For this case performance matters a bit,
>> because your invalidate-on-map needs to be fast, even if you enable lazy
>> mode and only receive inval-on-unmap every 10ms. It won't ever be as
>> fast as nested translation, though. For this case I think vSMMU+Caching
>> Mode and userspace virtio-iommu with MAP/UNMAP would perform similarly
>> (given page-sized payloads), because the pagetable walk doesn't add a
>> lot of overhead compared to the context switch. But given the results
>> below, vhost-iommu would be faster than vSMMU+CM.
>>
>>
>> (3) Then there is SVM. For SVM, any destructive change to the process
>> address space requires a synchronous invalidation command to the
>> hardware (at least when using PCI ATS). Given that SVM is based on page
>> faults, fault reporting from host to guest also needs to be fast, as
>> well as fault response from guest to host.
>>
>> I think this is where performance matters the most. To get a feel of the
>> advantage we get with virtio-iommu, I compared the vSMMU page-table
>> sharing implementation [2] and vhost-iommu + VFIO with page table
>> sharing (based on Tomasz Nowicki's vhost-iommu prototype). That's on a
>> ThunderX2 with a 10Gb NIC assigned to the guest kernel, which
>> corresponds to case (2) above, with nesting page tables and without the
>> lazy mode. The host's only job is forwarding invalidation to the HW SMMU.
>>
>> vhost-iommu performed on average 1.8x and 5.5x better than vSMMU on
>> netperf TCP_STREAM and TCP_MAERTS respectively (~200 samples). I think
>> this can be further optimized (that was still polling under the vq
>> lock), and unlike vSMMU, virtio-iommu offers the possibility of
>> multi-queue for improved scalability. In addition, the guest will need
>> to send both TLB and ATC invalidations with vSMMU, but virtio-iommu
>> allows to multiplex those, and to invalidate ranges. Similarly for fault
>> injection, having the ability to report page faults to the guest from
>> the host kernel should be significantly faster than having to go to
>> userspace and back to the kernel.
> 
> Fascinating. Any data about host CPU utilization?
> 
> Eric what do you think?
> 
> Is it true that SMMUv3 is fundmentally slow at the architecture level
> and so a PV interface will always scale better until
> a new hardware interface is designed?

As far as I understand the figures above correspond to vhost-iommu
against vsmmuv3. In the 2 cases the guest owns stage1 tables so the
difference comes from the IOTLB invalidation handling. With vhost we
avoid a kernel <-> userspace round trip which may mostly explain the
difference.

About SMMUv3 issues I already reported one big limitation with respect
to hugepage invalidation. See [RFC v2 4/4] iommu/arm-smmu-v3: add
CMD_TLBI_NH_VA_AM command for iova range invalidation
(https://lkml.org/lkml/2017/8/11/428).

At smmuv3 guest driver level, arm_smmu_tlb_inv_range_nosync(), when
called with a hugepage size, invalidates each 4K/64K page of the region
and not the whole region at once. Each of them are trapped by the SMMUv3
device which forwards them to the host. This stalls the guest. This
issue can be observed in DPDK case - not the use case benchmarked above - .

I raised this point again in recent discussions and it is unclear
whether this is an SMMUv3 driver limitation or an architecture
limitation. Seems a single invalidation within the block mapping should
invalidate the whole mapping at HW level. In the past I hacked a
workaround by defining an implementation defined invalidation command.

Robin/Will, could you please explain the rationale behind the
arm_smmu_tlb_inv_range_nosync() implementation.

Thanks

Eric



> 
> 
>>
>> (4) Virtio and vhost endpoints weren't really a priority for the base
>> virtio-iommu device, we were looking mainly at device pass-through. I
>> have optimizations in mind for this, although a lot of them are based on
>> page tables, not MAP/UNMAP requests. But just getting the vIOMMU closer
>> to vhost devices, avoiding the trip to userspace through vhost-tlb,
>> should already improve things.
>>
>> The important difference when DMA is done by software is that you don't
>> need to mirror all mappings into the HW IOMMU - you don't need
>> inval-on-map. The endpoint can ask the vIOMMU for mappings when it needs
>> them, like vhost-iotlb does for example. So the MAP/UNMAP interface of
>> virtio-iommu performs poorly for emulated/PV endpoints compared to an
>> emulated IOMMU, since it requires three context switches for DMA
>> (MAP/DMA/UNMAP) between host and guest, rather than two (DMA/INVAL).
>> There is a feature I call "posted MAP", that avoids the kick on MAP and
>> instead lets the device fetch the MAP request on TLB miss, but I haven't
>> spent enough time experimenting with this.
>>
>>> Others on the TC might feel differently.
>>>
>>> If someone's looking into adding virtio iommu support in hardware,
>>> that's a different matter. Which is it?
>>
>> I'm not aware of anything like that, and suspect that no one would
>> consider it until virtio-iommu is more widely adopted.
>>
>> Thanks,
>> Jean
>>
>>
>> [1] Diff between current spec and page table sharing draft
>>     (Very rough, missing page fault support and I'd like to rework the
>>      PASID model a bit, but table descriptors p.24-26 for both Arm
>>      SMMUv2 and SMMUv3.)
>>
>> http://jpbrucker.net/virtio-iommu/spec-table/diffs/virtio-iommu-pdf-diff-v0.9-v0.10.dev03.pdf
>>
>> [2] [RFC v2 00/28] vSMMUv3/pSMMUv3 2 stage VFIO integration
>>     https://www.mail-archive.com/qemu-devel@nongnu.org/msg562369.html

^ permalink raw reply

* Re: [PATCH v2 0/5] VSOCK: support mergeable rx buffer in vhost-vsock
From: Michael S. Tsirkin @ 2018-12-12 15:09 UTC (permalink / raw)
  To: jiangyiwen; +Cc: netdev, kvm, Stefan Hajnoczi, virtualization
In-Reply-To: <5C10D41E.9050002@huawei.com>

On Wed, Dec 12, 2018 at 05:25:50PM +0800, jiangyiwen wrote:
> Now vsock only support send/receive small packet, it can't achieve
> high performance. As previous discussed with Jason Wang, I revisit the
> idea of vhost-net about mergeable rx buffer and implement the mergeable
> rx buffer in vhost-vsock, it can allow big packet to be scattered in
> into different buffers and improve performance obviously.
> 
> This series of patches mainly did three things:
> - mergeable buffer implementation
> - increase the max send pkt size
> - add used and signal guest in a batch
> 
> And I write a tool to test the vhost-vsock performance, mainly send big
> packet(64K) included guest->Host and Host->Guest. I test performance
> independently and the result as follows:
> 
> Before performance:
>               Single socket            Multiple sockets(Max Bandwidth)
> Guest->Host   ~400MB/s                 ~480MB/s
> Host->Guest   ~1450MB/s                ~1600MB/s
> 
> After performance only use implement mergeable rx buffer:
>               Single socket            Multiple sockets(Max Bandwidth)
> Guest->Host   ~400MB/s                 ~480MB/s
> Host->Guest   ~1280MB/s                ~1350MB/s
> 
> In this case, max send pkt size is still limited to 4K, so Host->Guest
> performance will worse than before.

It's concerning though, what if application sends small packets?
What is the source of the slowdown? Do you know?

> After performance increase the max send pkt size to 64K:
>               Single socket            Multiple sockets(Max Bandwidth)
> Guest->Host   ~1700MB/s                ~2900MB/s
> Host->Guest   ~1500MB/s                ~2440MB/s
> 
> After performance all patches are used:
>               Single socket            Multiple sockets(Max Bandwidth)
> Guest->Host   ~1700MB/s                ~2900MB/s
> Host->Guest   ~1700MB/s                ~2900MB/s
> 
> >From the test results, the performance is improved obviously, and guest
> memory will not be wasted.
> 
> In addition, in order to support mergeable rx buffer in virtio-vsock,
> we need to add a qemu patch to support parse feature.
> 
> ---
> v1 -> v2:
>  * Addressed comments from Jason Wang.
>  * Add performance test result independently.
>  * Use Skb_page_frag_refill() which can use high order page and reduce
>    the stress of page allocator.
>  * Still use fixed size(PAGE_SIZE) to fill rx buffer, because too small
>    size can't fill one full packet, we only 128 vq num now.
>  * Use iovec to replace buf in struct virtio_vsock_pkt, keep tx and rx
>    consistency.
>  * Add virtio_transport ops to get max pkt len, in order to be compatible
>    with old version.
> ---
> 
> Yiwen Jiang (5):
>   VSOCK: support fill mergeable rx buffer in guest
>   VSOCK: support fill data to mergeable rx buffer in host
>   VSOCK: support receive mergeable rx buffer in guest
>   VSOCK: increase send pkt len in mergeable mode to improve performance
>   VSOCK: batch sending rx buffer to increase bandwidth
> 
>  drivers/vhost/vsock.c                   | 183 ++++++++++++++++++++-----
>  include/linux/virtio_vsock.h            |  13 +-
>  include/uapi/linux/virtio_vsock.h       |   5 +
>  net/vmw_vsock/virtio_transport.c        | 229 +++++++++++++++++++++++++++-----
>  net/vmw_vsock/virtio_transport_common.c |  66 ++++++---
>  5 files changed, 411 insertions(+), 85 deletions(-)
> 
> -- 
> 1.8.3.1
_______________________________________________
Virtualization mailing list
Virtualization@lists.linux-foundation.org
https://lists.linuxfoundation.org/mailman/listinfo/virtualization

^ permalink raw reply

* Re: [virtio-dev] Re: [PATCH v5 5/7] iommu: Add virtio-iommu driver
From: Michael S. Tsirkin @ 2018-12-12 14:56 UTC (permalink / raw)
  To: Jean-Philippe Brucker
  Cc: mark.rutland, virtio-dev, lorenzo.pieralisi, tnowicki, devicetree,
	marc.zyngier, linux-pci, joro, will.deacon, virtualization,
	eric.auger, iommu, robh+dt, bhelgaas, robin.murphy, kvmarm
In-Reply-To: <e1dde79c-0bc4-ea99-1bb0-9e70b56955fb@arm.com>

On Fri, Dec 07, 2018 at 06:52:31PM +0000, Jean-Philippe Brucker wrote:
> Sorry for the delay, I wanted to do a little more performance analysis
> before continuing.
> 
> On 27/11/2018 18:10, Michael S. Tsirkin wrote:
> > On Tue, Nov 27, 2018 at 05:55:20PM +0000, Jean-Philippe Brucker wrote:
> >>>> +	if (!virtio_has_feature(vdev, VIRTIO_F_VERSION_1) ||
> >>>> +	    !virtio_has_feature(vdev, VIRTIO_IOMMU_F_MAP_UNMAP))
> >>>
> >>> Why bother with a feature bit for this then btw?
> >>
> >> We'll need a new feature bit for sharing page tables with the hardware,
> >> because they require different requests (attach_table/invalidate instead
> >> of map/unmap.) A future device supporting page table sharing won't
> >> necessarily need to support map/unmap.
> >>
> > I don't see virtio iommu being extended to support ARM specific
> > requests. This just won't scale, too many different
> > descriptor formats out there.
> 
> They aren't really ARM specific requests. The two new requests are
> ATTACH_TABLE and INVALIDATE, which would be used by x86 IOMMUs as well.
> 
> Sharing CPU address space with the HW IOMMU (SVM) has been in the scope
> of virtio-iommu since the first RFC, and I've been working with that
> extension in mind since the beginning. As an example you can have a look
> at my current draft for this [1], which is inspired from the VFIO work
> we've been doing with Intel.
> 
> The negotiation phase inevitably requires vendor-specific fields in the
> descriptors - host tells which formats are supported, guest chooses a
> format and attaches page tables. But invalidation and fault reporting
> descriptors are fairly generic.

We need to tread carefully here.  People expect it that if user does
lspci and sees a virtio device then it's reasonably portable.

> > If you want to go that way down the road, you should avoid
> > virtio iommu, instead emulate and share code with the ARM SMMU (probably
> > with a different vendor id so you can implement the
> > report on map for devices without PRI).
> 
> vSMMU has to stay in userspace though. The main reason we're proposing
> virtio-iommu is that emulating every possible vIOMMU model in the kernel
> would be unmaintainable. With virtio-iommu we can process the fast path
> in the host kernel, through vhost-iommu, and do the heavy lifting in
> userspace.

Interesting.

> As said above, I'm trying to keep the fast path for
> virtio-iommu generic.
> 
> More notes on what I consider to be the fast path, and comparison with
> vSMMU:
> 
> (1) The primary use-case we have in mind for vIOMMU is something like
> DPDK in the guest, assigning a hardware device to guest userspace. DPDK
> maps a large amount of memory statically, to be used by a pass-through
> device. For this case I don't think we care about vIOMMU performance.
> Setup and teardown need to be reasonably fast, sure, but the MAP/UNMAP
> requests don't have to be optimal.
> 
> 
> (2) If the assigned device is owned by the guest kernel, then mappings
> are dynamic and require dma_map/unmap() to be fast, but there generally
> is no need for a vIOMMU, since device and drivers are trusted by the
> guest kernel. Even when the user does enable a vIOMMU for this case
> (allowing to over-commit guest memory, which needs to be pinned
> otherwise),

BTW that's in theory in practice it doesn't really work.

> we generally play tricks like lazy TLBI (non-strict mode) to
> make it faster.

Simple lazy TLB for guest/userspace drivers would be a big no no.
You need something smarter.

> Here device and drivers are trusted, therefore the
> vulnerability window of lazy mode isn't a concern.
> 
> If the reason to enable the vIOMMU is over-comitting guest memory
> however, you can't use nested translation because it requires pinning
> the second-level tables. For this case performance matters a bit,
> because your invalidate-on-map needs to be fast, even if you enable lazy
> mode and only receive inval-on-unmap every 10ms. It won't ever be as
> fast as nested translation, though. For this case I think vSMMU+Caching
> Mode and userspace virtio-iommu with MAP/UNMAP would perform similarly
> (given page-sized payloads), because the pagetable walk doesn't add a
> lot of overhead compared to the context switch. But given the results
> below, vhost-iommu would be faster than vSMMU+CM.
> 
> 
> (3) Then there is SVM. For SVM, any destructive change to the process
> address space requires a synchronous invalidation command to the
> hardware (at least when using PCI ATS). Given that SVM is based on page
> faults, fault reporting from host to guest also needs to be fast, as
> well as fault response from guest to host.
> 
> I think this is where performance matters the most. To get a feel of the
> advantage we get with virtio-iommu, I compared the vSMMU page-table
> sharing implementation [2] and vhost-iommu + VFIO with page table
> sharing (based on Tomasz Nowicki's vhost-iommu prototype). That's on a
> ThunderX2 with a 10Gb NIC assigned to the guest kernel, which
> corresponds to case (2) above, with nesting page tables and without the
> lazy mode. The host's only job is forwarding invalidation to the HW SMMU.
> 
> vhost-iommu performed on average 1.8x and 5.5x better than vSMMU on
> netperf TCP_STREAM and TCP_MAERTS respectively (~200 samples). I think
> this can be further optimized (that was still polling under the vq
> lock), and unlike vSMMU, virtio-iommu offers the possibility of
> multi-queue for improved scalability. In addition, the guest will need
> to send both TLB and ATC invalidations with vSMMU, but virtio-iommu
> allows to multiplex those, and to invalidate ranges. Similarly for fault
> injection, having the ability to report page faults to the guest from
> the host kernel should be significantly faster than having to go to
> userspace and back to the kernel.

Fascinating. Any data about host CPU utilization?

Eric what do you think?

Is it true that SMMUv3 is fundmentally slow at the architecture level
and so a PV interface will always scale better until
a new hardware interface is designed?


> 
> (4) Virtio and vhost endpoints weren't really a priority for the base
> virtio-iommu device, we were looking mainly at device pass-through. I
> have optimizations in mind for this, although a lot of them are based on
> page tables, not MAP/UNMAP requests. But just getting the vIOMMU closer
> to vhost devices, avoiding the trip to userspace through vhost-tlb,
> should already improve things.
> 
> The important difference when DMA is done by software is that you don't
> need to mirror all mappings into the HW IOMMU - you don't need
> inval-on-map. The endpoint can ask the vIOMMU for mappings when it needs
> them, like vhost-iotlb does for example. So the MAP/UNMAP interface of
> virtio-iommu performs poorly for emulated/PV endpoints compared to an
> emulated IOMMU, since it requires three context switches for DMA
> (MAP/DMA/UNMAP) between host and guest, rather than two (DMA/INVAL).
> There is a feature I call "posted MAP", that avoids the kick on MAP and
> instead lets the device fetch the MAP request on TLB miss, but I haven't
> spent enough time experimenting with this.
> 
> > Others on the TC might feel differently.
> > 
> > If someone's looking into adding virtio iommu support in hardware,
> > that's a different matter. Which is it?
> 
> I'm not aware of anything like that, and suspect that no one would
> consider it until virtio-iommu is more widely adopted.
> 
> Thanks,
> Jean
> 
> 
> [1] Diff between current spec and page table sharing draft
>     (Very rough, missing page fault support and I'd like to rework the
>      PASID model a bit, but table descriptors p.24-26 for both Arm
>      SMMUv2 and SMMUv3.)
> 
> http://jpbrucker.net/virtio-iommu/spec-table/diffs/virtio-iommu-pdf-diff-v0.9-v0.10.dev03.pdf
> 
> [2] [RFC v2 00/28] vSMMUv3/pSMMUv3 2 stage VFIO integration
>     https://www.mail-archive.com/qemu-devel@nongnu.org/msg562369.html

^ permalink raw reply

* Re: [PATCH v6 0/7] Add virtio-iommu driver
From: Michael S. Tsirkin @ 2018-12-12 14:35 UTC (permalink / raw)
  To: Joerg Roedel
  Cc: mark.rutland, virtio-dev, lorenzo.pieralisi, tnowicki, devicetree,
	Jean-Philippe Brucker, linux-pci, will.deacon, robin.murphy,
	virtualization, eric.auger, iommu, robh+dt, marc.zyngier,
	bhelgaas, kvmarm
In-Reply-To: <20181212103545.GV16835@8bytes.org>

On Wed, Dec 12, 2018 at 11:35:45AM +0100, Joerg Roedel wrote:
> Hi,
> 
> to make progress on this, we should first agree on the protocol used
> between guest and host. I have a few points to discuss on the protocol
> first.
> 
> On Tue, Dec 11, 2018 at 06:20:57PM +0000, Jean-Philippe Brucker wrote:
> > [1] Virtio-iommu specification v0.9, sources and pdf
> >     git://linux-arm.org/virtio-iommu.git virtio-iommu/v0.9
> >     http://jpbrucker.net/virtio-iommu/spec/v0.9/virtio-iommu-v0.9.pdf
> 
> Looking at this I wonder why it doesn't make the IOTLB visible to the
> guest. the UNMAP requests seem to require that the TLB is already
> flushed to make the unmap visible.
> 
> I think that will cost significant performance for both, vfio and
> dma-iommu use-cases which both do (vfio at least to some degree),
> deferred flushing.
> 
> I also wonder whether the protocol should implement a
> protocol version handshake

virtio has a builtin version handshake so devices don't need to.

> and iommu-feature set queries.
> 
> > [3] git://linux-arm.org/linux-jpb.git virtio-iommu/v0.9.1
> >     git://linux-arm.org/kvmtool-jpb.git virtio-iommu/v0.9
> 
> Unfortunatly gitweb seems to be broken on linux-arm.org. What is missing
> in this patch-set to make this work on x86?

And I wonder about pcc too.

> Regards,
> 
> 	Joerg

^ permalink raw reply

* Re: [PATCH net V2 1/4] vhost: make sure used idx is seen before log in vhost_add_used_n()
From: Michael S. Tsirkin @ 2018-12-12 14:33 UTC (permalink / raw)
  To: Jason Wang; +Cc: netdev, linux-kernel, kvm, virtualization
In-Reply-To: <20181212100819.21295-2-jasowang@redhat.com>

On Wed, Dec 12, 2018 at 06:08:16PM +0800, Jason Wang wrote:
> We miss a write barrier that guarantees used idx is updated and seen
> before log. This will let userspace sync and copy used ring before
> used idx is update. Fix this by adding a barrier before log_write().
> 
> Fixes: 8dd014adfea6f ("vhost-net: mergeable buffers support")
> Signed-off-by: Jason Wang <jasowang@redhat.com>


Acked-by: Michael S. Tsirkin <mst@redhat.com>

also for 4.20 and seems like a stable candidate.

> ---
>  drivers/vhost/vhost.c | 2 ++
>  1 file changed, 2 insertions(+)
> 
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index 6b98d8e3a5bf..5915f240275a 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -2220,6 +2220,8 @@ int vhost_add_used_n(struct vhost_virtqueue *vq, struct vring_used_elem *heads,
>  		return -EFAULT;
>  	}
>  	if (unlikely(vq->log_used)) {
> +		/* Make sure used idx is seen before log. */
> +		smp_wmb();
>  		/* Log used index update. */
>  		log_write(vq->log_base,
>  			  vq->log_addr + offsetof(struct vring_used, idx),
> -- 
> 2.17.1

^ permalink raw reply

* Re: [PATCH net V2 4/4] vhost: log dirty page correctly
From: Michael S. Tsirkin @ 2018-12-12 14:32 UTC (permalink / raw)
  To: Jason Wang; +Cc: Jintack Lim, netdev, linux-kernel, kvm, virtualization
In-Reply-To: <20181212100819.21295-5-jasowang@redhat.com>

On Wed, Dec 12, 2018 at 06:08:19PM +0800, Jason Wang wrote:
> Vhost dirty page logging API is designed to sync through GPA. But we
> try to log GIOVA when device IOTLB is enabled. This is wrong and may
> lead to missing data after migration.
> 
> To solve this issue, when logging with device IOTLB enabled, we will:
> 
> 1) reuse the device IOTLB translation result of GIOVA->HVA mapping to
>    get HVA, for writable descriptor, get HVA through iovec. For used
>    ring update, translate its GIOVA to HVA
> 2) traverse the GPA->HVA mapping to get the possible GPA and log
>    through GPA. Pay attention this reverse mapping is not guaranteed
>    to be unique, so we should log each possible GPA in this case.
> 
> This fix the failure of scp to guest during migration. In -next, we
> will probably support passing GIOVA->GPA instead of GIOVA->HVA.
> 
> Fixes: 6b1e6cc7855b ("vhost: new device IOTLB API")
> Reported-by: Jintack Lim <jintack@cs.columbia.edu>
> Cc: Jintack Lim <jintack@cs.columbia.edu>
> Signed-off-by: Jason Wang <jasowang@redhat.com>

It's a nasty bug for sure but it's been like this for a long
time so I'm inclined to say let's put it in 4.21,
and queue for stable.

So please split this out from this series.

Also, I'd like to see a feature bit that allows GPA in IOTLBs.

> ---
>  drivers/vhost/net.c   |  3 +-
>  drivers/vhost/vhost.c | 79 +++++++++++++++++++++++++++++++++++--------
>  drivers/vhost/vhost.h |  3 +-
>  3 files changed, 69 insertions(+), 16 deletions(-)
> 
> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> index ad7a6f475a44..784df2b49628 100644
> --- a/drivers/vhost/net.c
> +++ b/drivers/vhost/net.c
> @@ -1192,7 +1192,8 @@ static void handle_rx(struct vhost_net *net)
>  		if (nvq->done_idx > VHOST_NET_BATCH)
>  			vhost_net_signal_used(nvq);
>  		if (unlikely(vq_log))
> -			vhost_log_write(vq, vq_log, log, vhost_len);
> +			vhost_log_write(vq, vq_log, log, vhost_len,
> +					vq->iov, in);
>  		total_len += vhost_len;
>  		if (unlikely(vhost_exceeds_weight(++recv_pkts, total_len))) {
>  			vhost_poll_queue(&vq->poll);
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index 55e5aa662ad5..3660310604fd 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -1733,11 +1733,67 @@ static int log_write(void __user *log_base,
>  	return r;
>  }
>  
> +static int log_write_hva(struct vhost_virtqueue *vq, u64 hva, u64 len)
> +{
> +	struct vhost_umem *umem = vq->umem;
> +	struct vhost_umem_node *u;
> +	u64 gpa;
> +	int r;
> +	bool hit = false;
> +
> +	list_for_each_entry(u, &umem->umem_list, link) {
> +		if (u->userspace_addr < hva &&
> +		    u->userspace_addr + u->size >=
> +		    hva + len) {
> +			gpa = u->start + hva - u->userspace_addr;
> +			r = log_write(vq->log_base, gpa, len);
> +			if (r < 0)
> +				return r;
> +			hit = true;
> +		}
> +	}
> +
> +	/* No reverse mapping, should be a bug */
> +	WARN_ON(!hit);

Maybe it should but userspace can trigger this easily I think.
We need to stop the device not warn in kernel log.

Also there's an error fd: VHOST_SET_VRING_ERR, need to wake it up.


> +	return 0;
> +}
> +
> +static void log_used(struct vhost_virtqueue *vq, u64 used_offset, u64 len)
> +{
> +	struct iovec iov[64];
> +	int i, ret;
> +
> +	if (!vq->iotlb) {
> +		log_write(vq->log_base, vq->log_addr + used_offset, len);
> +		return;
> +	}

This change seems questionable. used ring writes 
use their own machinery it does not go through iotlb.
Same should apply to log I think.

> +
> +	ret = translate_desc(vq, (u64)(uintptr_t)vq->used + used_offset,
> +			     len, iov, 64, VHOST_ACCESS_WO);
> +	WARN_ON(ret < 0);


Same thing here. translation failures can be triggered from guest.
warn on is not a good error handling strategy ...

> +
> +	for (i = 0; i < ret; i++) {
> +		ret = log_write_hva(vq,	(u64)(uintptr_t)iov[i].iov_base,
> +				    iov[i].iov_len);
> +		WARN_ON(ret);
> +	}
> +}
> +
>  int vhost_log_write(struct vhost_virtqueue *vq, struct vhost_log *log,
> -		    unsigned int log_num, u64 len)
> +		    unsigned int log_num, u64 len, struct iovec *iov, int count)
>  {
>  	int i, r;
>  
> +	if (vq->iotlb) {
> +		for (i = 0; i < count; i++) {
> +			r = log_write_hva(vq, (u64)(uintptr_t)iov[i].iov_base,
> +					  iov[i].iov_len);
> +			if (r < 0)
> +				return r;
> +		}
> +		return 0;
> +	}
> +
>  	/* Make sure data written is seen before log. */
>  	smp_wmb();
>  	for (i = 0; i < log_num; ++i) {
> @@ -1769,9 +1825,8 @@ static int vhost_update_used_flags(struct vhost_virtqueue *vq)
>  		smp_wmb();
>  		/* Log used flag write. */
>  		used = &vq->used->flags;
> -		log_write(vq->log_base, vq->log_addr +
> -			  (used - (void __user *)vq->used),
> -			  sizeof vq->used->flags);
> +		log_used(vq, (used - (void __user *)vq->used),
> +			 sizeof vq->used->flags);
>  		if (vq->log_ctx)
>  			eventfd_signal(vq->log_ctx, 1);
>  	}
> @@ -1789,9 +1844,8 @@ static int vhost_update_avail_event(struct vhost_virtqueue *vq, u16 avail_event)
>  		smp_wmb();
>  		/* Log avail event write */
>  		used = vhost_avail_event(vq);
> -		log_write(vq->log_base, vq->log_addr +
> -			  (used - (void __user *)vq->used),
> -			  sizeof *vhost_avail_event(vq));
> +		log_used(vq, (used - (void __user *)vq->used),
> +			 sizeof *vhost_avail_event(vq));
>  		if (vq->log_ctx)
>  			eventfd_signal(vq->log_ctx, 1);
>  	}
> @@ -2191,10 +2245,8 @@ static int __vhost_add_used_n(struct vhost_virtqueue *vq,
>  		/* Make sure data is seen before log. */
>  		smp_wmb();
>  		/* Log used ring entry write. */
> -		log_write(vq->log_base,
> -			  vq->log_addr +
> -			   ((void __user *)used - (void __user *)vq->used),
> -			  count * sizeof *used);
> +		log_used(vq, ((void __user *)used - (void __user *)vq->used),
> +			 count * sizeof *used);
>  	}
>  	old = vq->last_used_idx;
>  	new = (vq->last_used_idx += count);
> @@ -2236,9 +2288,8 @@ int vhost_add_used_n(struct vhost_virtqueue *vq, struct vring_used_elem *heads,
>  		/* Make sure used idx is seen before log. */
>  		smp_wmb();
>  		/* Log used index update. */
> -		log_write(vq->log_base,
> -			  vq->log_addr + offsetof(struct vring_used, idx),
> -			  sizeof vq->used->idx);
> +		log_used(vq, offsetof(struct vring_used, idx),
> +			 sizeof vq->used->idx);
>  		if (vq->log_ctx)
>  			eventfd_signal(vq->log_ctx, 1);
>  	}
> diff --git a/drivers/vhost/vhost.h b/drivers/vhost/vhost.h
> index 466ef7542291..1b675dad5e05 100644
> --- a/drivers/vhost/vhost.h
> +++ b/drivers/vhost/vhost.h
> @@ -205,7 +205,8 @@ bool vhost_vq_avail_empty(struct vhost_dev *, struct vhost_virtqueue *);
>  bool vhost_enable_notify(struct vhost_dev *, struct vhost_virtqueue *);
>  
>  int vhost_log_write(struct vhost_virtqueue *vq, struct vhost_log *log,
> -		    unsigned int log_num, u64 len);
> +		    unsigned int log_num, u64 len,
> +		    struct iovec *iov, int count);
>  int vq_iotlb_prefetch(struct vhost_virtqueue *vq);
>  
>  struct vhost_msg_node *vhost_new_msg(struct vhost_virtqueue *vq, int type);
> -- 
> 2.17.1

^ permalink raw reply

* Re: [PATCH net V2 3/4] Revert "net: vhost: lock the vqs one by one"
From: Michael S. Tsirkin @ 2018-12-12 14:24 UTC (permalink / raw)
  To: Jason Wang; +Cc: netdev, linux-kernel, kvm, virtualization
In-Reply-To: <20181212100819.21295-4-jasowang@redhat.com>

On Wed, Dec 12, 2018 at 06:08:18PM +0800, Jason Wang wrote:
> This reverts commit 78139c94dc8c96a478e67dab3bee84dc6eccb5fd. We don't
> protect device IOTLB with vq mutex, which will lead e.g use after free
> for device IOTLB entries. And since we've switched to use
> mutex_trylock() in previous patch, it's safe to revert it without
> having deadlock.
> 
> Fixes: commit 78139c94dc8c ("net: vhost: lock the vqs one by one")
> Cc: Tonghao Zhang <xiangxia.m.yue@gmail.com>
> Signed-off-by: Jason Wang <jasowang@redhat.com>


Acked-by: Michael S. Tsirkin <mst@redhat.com>

I'd try to put this in 4.20 if we can
and it's needed for -stable I think.

Also looks like we should allow iotlb entries per vq
to improve locking. What do you think?

> ---
>  drivers/vhost/vhost.c | 21 +++++++++++++++++----
>  1 file changed, 17 insertions(+), 4 deletions(-)
> 
> diff --git a/drivers/vhost/vhost.c b/drivers/vhost/vhost.c
> index 5915f240275a..55e5aa662ad5 100644
> --- a/drivers/vhost/vhost.c
> +++ b/drivers/vhost/vhost.c
> @@ -295,11 +295,8 @@ static void vhost_vq_meta_reset(struct vhost_dev *d)
>  {
>  	int i;
>  
> -	for (i = 0; i < d->nvqs; ++i) {
> -		mutex_lock(&d->vqs[i]->mutex);
> +	for (i = 0; i < d->nvqs; ++i)
>  		__vhost_vq_meta_reset(d->vqs[i]);
> -		mutex_unlock(&d->vqs[i]->mutex);
> -	}
>  }
>  
>  static void vhost_vq_reset(struct vhost_dev *dev,
> @@ -895,6 +892,20 @@ static inline void __user *__vhost_get_user(struct vhost_virtqueue *vq,
>  #define vhost_get_used(vq, x, ptr) \
>  	vhost_get_user(vq, x, ptr, VHOST_ADDR_USED)
>  
> +static void vhost_dev_lock_vqs(struct vhost_dev *d)
> +{
> +	int i = 0;
> +	for (i = 0; i < d->nvqs; ++i)
> +		mutex_lock_nested(&d->vqs[i]->mutex, i);
> +}
> +
> +static void vhost_dev_unlock_vqs(struct vhost_dev *d)
> +{
> +	int i = 0;
> +	for (i = 0; i < d->nvqs; ++i)
> +		mutex_unlock(&d->vqs[i]->mutex);
> +}
> +
>  static int vhost_new_umem_range(struct vhost_umem *umem,
>  				u64 start, u64 size, u64 end,
>  				u64 userspace_addr, int perm)
> @@ -976,6 +987,7 @@ static int vhost_process_iotlb_msg(struct vhost_dev *dev,
>  	int ret = 0;
>  
>  	mutex_lock(&dev->mutex);
> +	vhost_dev_lock_vqs(dev);
>  	switch (msg->type) {
>  	case VHOST_IOTLB_UPDATE:
>  		if (!dev->iotlb) {
> @@ -1009,6 +1021,7 @@ static int vhost_process_iotlb_msg(struct vhost_dev *dev,
>  		break;
>  	}
>  
> +	vhost_dev_unlock_vqs(dev);
>  	mutex_unlock(&dev->mutex);
>  
>  	return ret;
> -- 
> 2.17.1

^ permalink raw reply

* Re: [PATCH net V2 2/4] vhost_net: switch to use mutex_trylock() in vhost_net_busy_poll()
From: Michael S. Tsirkin @ 2018-12-12 14:20 UTC (permalink / raw)
  To: Jason Wang; +Cc: kvm, netdev, linux-kernel, virtualization, David Miller
In-Reply-To: <20181212100819.21295-3-jasowang@redhat.com>

On Wed, Dec 12, 2018 at 06:08:17PM +0800, Jason Wang wrote:
> We used to hold the mutex of paired virtqueue in
> vhost_net_busy_poll(). But this will results an inconsistent lock
> order which may cause deadlock if we try to bring back the protection
> of device IOTLB with vq mutex that requires to hold mutex of all
> virtqueues at the same time.
> 
> Fix this simply by switching to use mutex_trylock(), when fail just
> skip the busy polling. This can happen when device IOTLB is under
> updating which should be rare.
> 
> Fixes: commit 78139c94dc8c ("net: vhost: lock the vqs one by one")
> Cc: Tonghao Zhang <xiangxia.m.yue@gmail.com>
> Signed-off-by: Jason Wang <jasowang@redhat.com>

Acked-by: Michael S. Tsirkin <mst@redhat.com>

and I think we should try to put this fix in 4.20 too.


> ---
>  drivers/vhost/net.c | 8 +++++++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/vhost/net.c b/drivers/vhost/net.c
> index ab11b2bee273..ad7a6f475a44 100644
> --- a/drivers/vhost/net.c
> +++ b/drivers/vhost/net.c
> @@ -513,7 +513,13 @@ static void vhost_net_busy_poll(struct vhost_net *net,
>  	struct socket *sock;
>  	struct vhost_virtqueue *vq = poll_rx ? tvq : rvq;
>  
> -	mutex_lock_nested(&vq->mutex, poll_rx ? VHOST_NET_VQ_TX: VHOST_NET_VQ_RX);
> +	/* Try to hold the vq mutex of the paired virtqueue. We can't
> +	 * use mutex_lock() here since we could not guarantee a
> +	 * consistenet lock ordering.
> +	 */
> +	if (!mutex_trylock(&vq->mutex))
> +		return;
> +
>  	vhost_disable_notify(&net->dev, vq);
>  	sock = rvq->private_data;
>  
> -- 
> 2.17.1

^ permalink raw reply


This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox